: Is there an index of the IP addresses used by indexing bots? I have a page that gets minimal traffic, but I set up static notifications when it gets hit. Now, I want bots to get ignored,
I have a page that gets minimal traffic, but I set up static notifications when it gets hit. Now, I want bots to get ignored, so what I'm doing now is adding bots I see to a "no notify" list.
Is there a reference listing of the IP addresses used by indexing robots?
e.g, a list like:
$no_mail = array(
'67.195.115.105', // yahoo bot
'207.46.199.50', // msn bot
'61.135.249.246', //youdao bot
'207.46.199.32', // msn bot
);
More posts by @Dunderdale272
6 Comments
Sorted by latest first Latest Oldest Best
One way or the other if you are serious about filtering out bots you will need to implement some local list as well. Sometimes random seeming IP's get obsessed with a website I am administering. University projects, poorly implemented bots that seem experimental but are not generally recognized, those sorts of things.
Also: the Cuil bot (Twiceler) is the devil.
All of the search engines use a huge number of IP addresses. You'll want to look at the user agent string instead. Check this page for a good list of all crawlers.
In PHP, something like this would work:
$bots = array( 'googlebot', 'msnbot', 'slurp', 'mediapartners-google' );
$isRobot = false;
$ua = strtolower( $_SERVER['HTTP_USER_AGENT'] );
foreach ( $bots as $bot ) {
if ( strpos( $ua, $bot ) !== false )
$isRobot = true;
}
if ( !$isRobot ) {
// do your thing
}
There's some code to recognize bots at ekstreme.com/phplabs/search-engine-authentication (as well as the Google Help Center article at www.google.com/support/webmasters/bin/answer.py?answer=80553 on verifying Googlebot). There's also some code at ekstreme.com/phplabs/crawlercontroller.php that can be used to recognize crawlers, which you could easily extend to recognize "good" crawlers as well as the spammy ones it recognizes now.
In general, it's important not to rely on either user-agent name or IP address alone, since some user-agents may be used by normal users and some IP addresses may be shared.
That said, if you're only using this for email notifications, I'd probably just ignore simple known patterns in the user-agent and live with the false positives & false negatives. Check your log files for the most common crawlers that are active on your site and just check for a unique part of the user-agent name (it might be enough to just use "googlebot|slurp|msnbot|bingbot").
Can you access the useragent? That seems to me a better way of working out who is a real user, and what is a bot - it's more resilient to legitimate crawlers changing addresses, and if anything is masquerading as a bot, you probably don't want to get the email anyway.
Why don't you just put this in your robots.txt file?
User-agent: *
Disallow: /path/page-you-dont-want-crawled.html
That way you won't need to keep hunting for bots. I would bet anything that Google, Yahoo, and MSN have hundreds of bots and they probably have different IP addresses and new ones being created all the time. Adding the above should do the same for your file page without all of the hassle.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.