Mobile app version of vmapp.org
Login or Join
Dunderdale272

: Is there an index of the IP addresses used by indexing bots? I have a page that gets minimal traffic, but I set up static notifications when it gets hit. Now, I want bots to get ignored,

@Dunderdale272

Posted in: #SearchEngines #WebCrawlers

I have a page that gets minimal traffic, but I set up static notifications when it gets hit. Now, I want bots to get ignored, so what I'm doing now is adding bots I see to a "no notify" list.

Is there a reference listing of the IP addresses used by indexing robots?

e.g, a list like:

$no_mail = array(
'67.195.115.105', // yahoo bot
'207.46.199.50', // msn bot
'61.135.249.246', //youdao bot
'207.46.199.32', // msn bot
);

10.07% popularity Vote Up Vote Down


Login to follow query

More posts by @Dunderdale272

6 Comments

Sorted by latest first Latest Oldest Best

 

@Dunderdale272

One way or the other if you are serious about filtering out bots you will need to implement some local list as well. Sometimes random seeming IP's get obsessed with a website I am administering. University projects, poorly implemented bots that seem experimental but are not generally recognized, those sorts of things.

Also: the Cuil bot (Twiceler) is the devil.

10% popularity Vote Up Vote Down


 

@Cofer257

All of the search engines use a huge number of IP addresses. You'll want to look at the user agent string instead. Check this page for a good list of all crawlers.

In PHP, something like this would work:

$bots = array( 'googlebot', 'msnbot', 'slurp', 'mediapartners-google' );
$isRobot = false;
$ua = strtolower( $_SERVER['HTTP_USER_AGENT'] );

foreach ( $bots as $bot ) {
if ( strpos( $ua, $bot ) !== false )
$isRobot = true;
}

if ( !$isRobot ) {
// do your thing
}

10% popularity Vote Up Vote Down


 

@Welton855

There's some code to recognize bots at ekstreme.com/phplabs/search-engine-authentication (as well as the Google Help Center article at www.google.com/support/webmasters/bin/answer.py?answer=80553 on verifying Googlebot). There's also some code at ekstreme.com/phplabs/crawlercontroller.php that can be used to recognize crawlers, which you could easily extend to recognize "good" crawlers as well as the spammy ones it recognizes now.

In general, it's important not to rely on either user-agent name or IP address alone, since some user-agents may be used by normal users and some IP addresses may be shared.

That said, if you're only using this for email notifications, I'd probably just ignore simple known patterns in the user-agent and live with the false positives & false negatives. Check your log files for the most common crawlers that are active on your site and just check for a unique part of the user-agent name (it might be enough to just use "googlebot|slurp|msnbot|bingbot").

10% popularity Vote Up Vote Down


 

@Gail5422790

www.user-agents.org/ might be what you are looking for.

10% popularity Vote Up Vote Down


 

@Berryessa370

Can you access the useragent? That seems to me a better way of working out who is a real user, and what is a bot - it's more resilient to legitimate crawlers changing addresses, and if anything is masquerading as a bot, you probably don't want to get the email anyway.

10% popularity Vote Up Vote Down


 

@RJPawlick198

Why don't you just put this in your robots.txt file?

User-agent: *
Disallow: /path/page-you-dont-want-crawled.html


That way you won't need to keep hunting for bots. I would bet anything that Google, Yahoo, and MSN have hundreds of bots and they probably have different IP addresses and new ones being created all the time. Adding the above should do the same for your file page without all of the hassle.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme