: Is there an index of the IP addresses used by indexing bots? I have a page that gets minimal traffic, but I set up static notifications when it gets hit. Now, I want bots to get ignored,

I have a page that gets minimal traffic, but I set up static notifications when it gets hit. Now, I want bots to get ignored, so what I'm doing now is adding bots I see to a "no notify" list.

Is there a reference listing of the IP addresses used by indexing robots?

e.g, a list like:

$no_mail = array(
'67.195.115.105', // yahoo bot
'207.46.199.50', // msn bot
'61.135.249.246', //youdao bot
'207.46.199.32', // msn bot
);

10.07% popularity Vote Up Vote Down

: Adding comments to static pages? I've used disqus on a few sites successfully. Are there any other solid solutions for comments for static websites? (For new sites, I tend to use a CMS that

@Dunderdale272

Posted in: #Comments

4 Comments

: Wordpress - Automatic email to admin on password change Noticed something interesting in WordPress...hopefully its just me not seeing the right option. Whenever a user requests a new password,

@Dunderdale272

Posted in: #Wordpress

6 Comments

: Allowing customers to edit static website via WYSIWYG I've set a few clients up with Adobe (formerly Macromedia) Contribute to edit static html content. Add the comments to prevent them from

@Dunderdale272

Posted in: #Content #Contribute #StaticContent #Wysiwyg

12 Comments

: I'm not sure "largeness" matters. Django is a well-respected and impressive framework which has managed some great sites, particularly lawrence.com. You can look at djangosites.com for other sites

@Dunderdale272

0 Comments

Login to post a comment!

6 Comments

Sorted by latest first Latest Oldest Best

@Dunderdale272

One way or the other if you are serious about filtering out bots you will need to implement some local list as well. Sometimes random seeming IP's get obsessed with a website I am administering. University projects, poorly implemented bots that seem experimental but are not generally recognized, those sorts of things.

Also: the Cuil bot (Twiceler) is the devil.

10% popularity Vote Up Vote Down

@Cofer257

All of the search engines use a huge number of IP addresses. You'll want to look at the user agent string instead. Check this page for a good list of all crawlers.

In PHP, something like this would work:

$bots = array( 'googlebot', 'msnbot', 'slurp', 'mediapartners-google' );
$isRobot = false;
$ua = strtolower( $_SERVER['HTTP_USER_AGENT'] );

foreach ( $bots as $bot ) {
if ( strpos( $ua, $bot ) !== false )
$isRobot = true;
}

if ( !$isRobot ) {
// do your thing
}

10% popularity Vote Up Vote Down

@Welton855

There's some code to recognize bots at ekstreme.com/phplabs/search-engine-authentication (as well as the Google Help Center article at www.google.com/support/webmasters/bin/answer.py?answer=80553 on verifying Googlebot). There's also some code at ekstreme.com/phplabs/crawlercontroller.php that can be used to recognize crawlers, which you could easily extend to recognize "good" crawlers as well as the spammy ones it recognizes now.

In general, it's important not to rely on either user-agent name or IP address alone, since some user-agents may be used by normal users and some IP addresses may be shared.

That said, if you're only using this for email notifications, I'd probably just ignore simple known patterns in the user-agent and live with the false positives & false negatives. Check your log files for the most common crawlers that are active on your site and just check for a unique part of the user-agent name (it might be enough to just use "googlebot|slurp|msnbot|bingbot").

10% popularity Vote Up Vote Down

@Gail5422790

www.user-agents.org/ might be what you are looking for.

10% popularity Vote Up Vote Down

@Berryessa370

Can you access the useragent? That seems to me a better way of working out who is a real user, and what is a bot - it's more resilient to legitimate crawlers changing addresses, and if anything is masquerading as a bot, you probably don't want to get the email anyway.

10% popularity Vote Up Vote Down

@RJPawlick198

Why don't you just put this in your robots.txt file?

User-agent: *
Disallow: /path/page-you-dont-want-crawled.html

That way you won't need to keep hunting for bots. I would bet anything that Google, Yahoo, and MSN have hundreds of bots and they probably have different IP addresses and new ones being created all the time. Adding the above should do the same for your file page without all of the hassle.

10% popularity Vote Up Vote Down

Feed

: Is there an index of the IP addresses used by indexing bots? I have a page that gets minimal traffic, but I set up static notifications when it gets hit. Now, I want bots to get ignored,

More posts by @Dunderdale272

: Adding comments to static pages? I've used disqus on a few sites successfully. Are there any other solid solutions for comments for static websites? (For new sites, I tend to use a CMS that

: Wordpress - Automatic email to admin on password change Noticed something interesting in WordPress...hopefully its just me not seeing the right option. Whenever a user requests a new password,

: Allowing customers to edit static website via WYSIWYG I've set a few clients up with Adobe (formerly Macromedia) Contribute to edit static html content. Add the comments to prevent them from

: I'm not sure "largeness" matters. Django is a well-respected and impressive framework which has managed some great sites, particularly lawrence.com. You can look at djangosites.com for other sites

Login to post a comment!

6 Comments

Back to top | Use Dark Theme