Mobile app version of vmapp.org
Login or Join
Samaraweera270

: Rules in .htaccess to block spiders don't appear to be effective, I still see the crawlers in Awstats I have Put this code in htaccess to prevent search engines accessing my site However, I

@Samaraweera270

Posted in: #Htaccess

I have Put this code in htaccess to prevent search engines accessing my site
However, I still see them listed daily in the AWSTATS file on my sever.

Does this mean they are still searching my site? They haven't infiltrated the site but just logged as an attempt?

# Stop the Nasties!!
RewriteEngine on

RewriteCond %{HTTP_USER_AGENT} ^autoemailspider [OR]
RewriteCond %{HTTP_USER_AGENT} ^baidu [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider* [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bingbot[OR]
RewriteCond %{HTTP_USER_AGENT} ^Yandex [OR]
RewriteCond %{HTTP_USER_AGENT} ^Sosospider [OR]
RewriteCond %{HTTP_USER_AGENT} ^AhrefsBot[OR]
RewriteCond %{HTTP_USER_AGENT} ^AITCSRobot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Arachnophilia [OR]
RewriteCond %{HTTP_USER_AGENT} ^archive.org_bot [OR]
RewriteCond %{HTTP_USER_AGENT} ^BackDoorBot[OR]
RewriteCond %{HTTP_USER_AGENT} ^BSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^CFNetwork[OR]
RewriteCond %{HTTP_USER_AGENT} ^CyberPatrol [OR]
RewriteCond %{HTTP_USER_AGENT} ^DeuSu[OR]
RewriteCond %{HTTP_USER_AGENT} ^DotBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [OR]
RewriteCond %{HTTP_USER_AGENT} ^FeedlyBot[OR]
RewriteCond %{HTTP_USER_AGENT} ^Genieo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Gluten Free Crawler [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrapeshotCrawler [OR]
RewriteCond %{HTTP_USER_AGENT} ^MaxPointCrawler [OR]
RewriteCond %{HTTP_USER_AGENT} ^meanpathbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^MJ12bot [OR]
RewriteCond %{HTTP_USER_AGENT} ^PagesInventory [OR]
RewriteCond %{HTTP_USER_AGENT} ^PHP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Plukkie [OR]
RewriteCond %{HTTP_USER_AGENT} ^Qwantify [OR]
RewriteCond %{HTTP_USER_AGENT} ^SemrushBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SentiBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SEOkicks-Robot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SeznamBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^WeSEE_Bot [OR]
RewriteCond %{HTTP_USER_AGENT} ^worldwebheritage.org [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xenu Link Sleuth [OR]
RewriteCond %{HTTP_USER_AGENT} ^Yahoo! Slurp[OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus [OR]
RewriteCond %{HTTP_USER_AGENT} ^SogouwebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^360Spider [OR]
RewriteRule ^.* - [F,L]


Yes, this is the code I am using, copied from here. Looking at my RAW log I don't see any forbidden entries Here are 2 examples of today entry Should I delete the OR entries and what do they mean?

207.46.13.186 - - [30/Nov/2016:12:05:19 +0000] "GET /comrades/comrades%20football%20team2.jpg HTTP/1.1" 200 47649 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

52.213.197.166 - - [03/Dec/2016:14:54:02 +0000] "GET /robots.txt HTTP/1.0" 200 1473 "-" "IDG/UK (http://spaziodati.eu/)"

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Samaraweera270

1 Comments

Sorted by latest first Latest Oldest Best

 

@Ogunnowo487

:
RewriteCond %{HTTP_USER_AGENT} ^Yahoo! Slurp[OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus [OR]
RewriteCond %{HTTP_USER_AGENT} ^SogouwebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^360Spider [OR]
RewriteRule ^.* - [F,L]



This code is actually "broken" in several places and will never work as intended. In fact, it won't block anything in its current state, which explains your access log.


You need to remove the OR flag on the last RewriteCond directive. This additional OR flag would ordinarily cause all traffic to be blocked!
(But since you have further errors - see #2 - this does not happen!)
RewriteCond %{HTTP_USER_AGENT} ^Bingbot[OR]
You are missing a space between the CondPattern (^Bingbot) and the flags argument ([OR]). (It should be ^Bingbot [OR].) This won't match "Bingbot". But, crucially, the condition is now an implicit AND - so your rule block will never succeed and no bot will be blocked! I count at least 7 directives in your code above where the space is missing!
As Stephen has already pointed out in comments, the regex used to match these bots do not necessarily seem to be correct. For example, a pattern such as ^Bingbot matches the exact string "Bingbot" (capital "B") at the start of the user-agent (^ being a start-of-string anchor). But the log entry you've shown contains "bingbot" (all lowercase) in the middle of the user-agent string. This will not match. You probably need a condition like the following, without a ^ prefix and with the NC flag for a case-insensitive match:

RewriteCond %{HTTP_USER_AGENT} bingbot [NC,OR]


You'll need to check the other regex, whether they match the User-Agent you are trying to target. Are you matching at the start of the UA (^)? Should it be case-insensitive (NC)?
Minor point... Given the following two directives, the first one is superfluous. However, the second one looks like an error.

RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider* [OR]




However, I still see them listed daily in the AWSTATS file on my sever.


Yes, even if you block the bots (once your code is working), they will still "hit" your server and be logged in the server's access log from which AWStats builds its reports.

However, check your raw access log and you should see a 403 (Forbidden) in the response status for these requests (this is probably reported in AWStats as well). If not, then something is wrong.

The RewriteRule can also be simplified:

RewriteRule ^ - [F]


The L flag is implied when you use the F flag.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme