Mobile app version of vmapp.org
Login or Join
Reiling115

: Am I covering the major search engine spiders in my anti-crawl protection white list? I've a system which blacklists users who request too many pages too fast if they are not in my white list.

@Reiling115

Posted in: #Bing #Google #Googlebot #WebCrawlers

I've a system which blacklists users who request too many pages too fast if they are not in my white list. We are just worried about the main search engines and to be honest Google is the only one my bosses are worried about.

The White List:

crawler_name - crawler_host
Googlebot - .googlebot.com
Yahoo! Slurp - crawl.yahoo.net
MSNBot - search.msn.com

If the HTTP_REQUEST_HEADER contains the crawler_name and the hostname (reverse DNS lookup of the IP) contains the crawler_host of any of the above, then we leave them request as many pages as they want.

Is this list good enough? Will this cover the main search engine spiders? Or might we accidentally block one?

Edit:
I've tested it using the "Fetch as GoogleBot" feature in Google Webmaster Tools and it's working as expected.
According to Microsoft "Bing operates three crawlers today:
bingbot, adidxbot, msnbot". That's fine, I can add in bingbot and adidxbot but will their resolved hostnames still contain "search.msn.com"?

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Reiling115

1 Comments

Sorted by latest first Latest Oldest Best

 

@Harper822

This is the wrong strategy. Also, headers are trivial to spoof.

Honestly, anti-crawler protections are very fragile and generally unwise. You may end up blocking legitimate users (who'll be annoyed), or your code might end up being forgotten about, become stale, and block crawlers you want to allow.

You can verify whether a bot belongs to Google, however - advice here: www.google.com/support/webmasters/bin/answer.py?answer=80553
I'm not sure if the same is possible with other crawlers. Frankly, this isn't a strategy I'd employ.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme