: Am I covering the major search engine spiders in my anti-crawl protection white list? I've a system which blacklists users who request too many pages too fast if they are not in my white list.
I've a system which blacklists users who request too many pages too fast if they are not in my white list. We are just worried about the main search engines and to be honest Google is the only one my bosses are worried about.
The White List:
crawler_name - crawler_host
Googlebot - .googlebot.com
Yahoo! Slurp - crawl.yahoo.net
MSNBot - search.msn.com
If the HTTP_REQUEST_HEADER contains the crawler_name and the hostname (reverse DNS lookup of the IP) contains the crawler_host of any of the above, then we leave them request as many pages as they want.
Is this list good enough? Will this cover the main search engine spiders? Or might we accidentally block one?
Edit:
I've tested it using the "Fetch as GoogleBot" feature in Google Webmaster Tools and it's working as expected.
According to Microsoft "Bing operates three crawlers today:
bingbot, adidxbot, msnbot". That's fine, I can add in bingbot and adidxbot but will their resolved hostnames still contain "search.msn.com"?
More posts by @Reiling115
1 Comments
Sorted by latest first Latest Oldest Best
This is the wrong strategy. Also, headers are trivial to spoof.
Honestly, anti-crawler protections are very fragile and generally unwise. You may end up blocking legitimate users (who'll be annoyed), or your code might end up being forgotten about, become stale, and block crawlers you want to allow.
You can verify whether a bot belongs to Google, however - advice here: www.google.com/support/webmasters/bin/answer.py?answer=80553
I'm not sure if the same is possible with other crawlers. Frankly, this isn't a strategy I'd employ.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.