Mobile app version of vmapp.org
Login or Join
Welton855

: How To Track Down and Stop Rogue Bots? Most of the bandwidth of one site is being consumed by an unidentified bot. According to AWSTATS it says: Unknown robot (identified by 'bot*') consumed

@Welton855

Posted in: #Bandwidth #WebCrawlers

Most of the bandwidth of one site is being consumed by an unidentified bot. According to AWSTATS it says:
Unknown robot (identified by 'bot*') consumed 164 GB this month.

By comparison, Googlebot consumed 10 GB and visitors (viewed traffic) consumed 25 GB.
This means rogue bots are consumming over 6X the bandwidth of visitors. For other sites
which I run (about a dozen) the normal ratio is 25%, so for 25GB of viewed traffic, bots take about 6GB in TOTAL.

The question therefore is: How to identify which bot(s) are causing this huge amount of request and how to stop them or slow them down if they are useful?

Obviously, most bots that visit the site are important including the Googlebot, Yahoo Slurp, MSNBot, etc, including the AdSense/DoubleClick bots, so I cannot simply block all bots.

The reason I am investigating this is that I am reaching the limit of bandwidth and exceeded CPU usage for my host, so I was sent a notice.

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Welton855

2 Comments

Sorted by latest first Latest Oldest Best

 

@Shakeerah822

Block Unwanted Robots/Spiders visitors via PHP

Instructions:

Place the following PHP Code in the beginning of your index.php file.

The idea here is to place the code in the main site's PHP home page, the main entry point of the site.

If you have other PHP files that are accessed directly via an URL (not including PHP include or require support type files), then place the code in the beginning of those files.
For most PHP sites and PHP CMS sites, the root's index.php file is the file that is the main entry point of the site.

Keep in mind that your site statistics, i.e. AWStats, will still log the hits under Unknown robot (identified by 'bot' followed by a space or one of the following characters _+:,.;/-), but these bots will be blocked from accessing your site's content.

<?php
// ---------------------------------------------------------------------------------------------------------------

// Banned IP Addresses and Bots - Redirects banned visitors who make it past the .htaccess and or robots.txt files to an URL.
// The $banned_ip_addresses array can contain both full and partial IP addresses, i.e. Full = 123.456.789.101, Partial = 123.456.789. or 123.456. or 123.
// Use partial IP addresses to include all IP addresses that begin with a partial IP addresses. The partial IP addresses must end with a period.
// The $banned_bots, $banned_unknown_bots, and $good_bots arrays should contain keyword strings found within the User Agent string.
// The $banned_unknown_bots array is used to identify unknown robots (identified by 'bot' followed by a space or one of the following characters _+:,.;/-).
// The $good_bots array contains keyword strings used as exemptions when checking for $banned_unknown_bots. If you do not want to utilize the $good_bots array such as
// $good_bots = array(), then you must remove the the keywords strings 'bot.','bot/','bot-' from the $banned_unknown_bots array or else the good bots will also be banned.
$banned_ip_addresses = array('41.','64.79.100.23','5.254.97.75','148.251.236.167','88.180.102.124','62.210.172.77','45.','195.206.253.146');
$banned_bots = array('.ru','AhrefsBot','crawl','crawler','DotBot','linkdex','majestic','meanpath','PageAnalyzer','robot','rogerbot','semalt','SeznamBot','spider');
$banned_unknown_bots = array('bot ','bot_','bot+','bot:','bot,','bot;','bot','bot.','bot/','bot-');
$good_bots = array('Google','MSN','bing','Slurp','Yahoo','DuckDuck');
$banned_redirect_url = 'http://english-1329329990.spampoison.com';

// Visitor's IP address and Browser (User Agent)
$ip_address = $_SERVER['REMOTE_ADDR'];
$browser = $_SERVER['HTTP_USER_AGENT'];

// Declared Temporary Variables
$ipfound = $piece = $botfound = $gbotfound = $ubotfound = '';

// Checks for Banned IP Addresses and Bots
if($banned_redirect_url != ''){
// Checks for Banned IP Address
if(!empty($banned_ip_addresses)){
if(in_array($ip_address, $banned_ip_addresses)){$ipfound = 'found';}
if($ipfound != 'found'){
$ip_pieces = explode('.', $ip_address);
foreach ($ip_pieces as $value){
$piece = $piece.$value.'.';
if(in_array($piece, $banned_ip_addresses)){$ipfound = 'found'; break;}
}
}
if($ipfound == 'found'){header("location: $banned_redirect_url"); exit();}
}

// Checks for Banned Bots
if(!empty($banned_bots)){
foreach ($banned_bots as $bbvalue){
$pos1 = stripos($browser, $bbvalue);
if($pos1 !== false){$botfound = 'found'; break;}
}
if($botfound == 'found'){header("location: $banned_redirect_url"); exit();}
}

// Checks for Banned Unknown Bots
if(!empty($good_bots)){
foreach ($good_bots as $gbvalue){
$pos2 = stripos($browser, $gbvalue);
if($pos2 !== false){$gbotfound = 'found'; break;}
}
}
if($gbotfound != 'found'){
if(!empty($banned_unknown_bots)){
foreach ($banned_unknown_bots as $bubvalue){
$pos3 = stripos($browser, $bubvalue);
if($pos3 !== false){$ubotfound = 'found'; break;}
}
if($ubotfound == 'found'){header("location: $banned_redirect_url"); exit();}
}
}
}

// ---------------------------------------------------------------------------------------------------------------
?>

10% popularity Vote Up Vote Down


 

@Kristi941

Create a page that captures IP addresses of anyone who visits it. Add those IPs to an htaccess file that blocks it. (see example here)
Link to that page in the footer of your website using 1px transparent image
Block that page in robots.txt so good robots won't find it


Note: whitelisting good IPs and/or useragents is also a good idea (IP Addresses of Search Engine Spiders)

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme