: Complex Disallow pattern in robots.txt I have a URL like this: www.example.com/freelance-jobs-new-york I had a problem and many duplicated pages have been created like this: www.example.com/freelance-jobs-new-york-php-php
I have a URL like this:
example.com/freelance-jobs-new-york
I had a problem and many duplicated pages have been created like this:
example.com/freelance-jobs-new-york-php-php www.example.com/freelance-jobs-new-york-php-php-php example.com/freelance-jobs-new-york-php-php-php-php
And so on, those pages have the same content as the main one, so what I did to fix it was redirecting all the pages with more than two times php keyword in the URL to the main URL.
But I have did it late, so Google has to redirect maybe more than 20.000 pages that have been already crawled.
So I want to setup a Disallow in robots.txt to block it for spending resources on those urls.
So my question is, what pattern should I use to disallow pages with more than two times the keyword php in the URL?
Will, Disallow: /*php*php* work as expected? I am asking this because I don't want to accidentally block good URLs.
More posts by @YK1175434
2 Comments
Sorted by latest first Latest Oldest Best
Simply you can use:
Disallow: /freelance-jobs-new-york-php-php*/
see this google page
support.google.com/webmasters/answer/6062596?hl=en&ref_topic=6061961
Googlebot does support wildcards in robots.txt. They announced this in their blog. googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html
Other browsers don't actually support wildcards, so that syntax is not universal.
However, putting urls into robots.txt does not prevent googlebot from indexing them. Your solution of the canonical tag sounds like a much better idea to get them out of the index. 301 redirects would also work.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.