Mobile app version of vmapp.org
Login or Join
YK1175434

: Prevent crawler that doesn't honoring robots.txt I have some problem, when I try to write robots.txt for my site ... I find some issues by search on Google, and tell me about honor and not

@YK1175434

Posted in: #Indexing #RobotsTxt #WebCrawlers

I have some problem, when I try to write robots.txt for my site ...

I find some issues by search on Google, and tell me about honor and not honoring robots.txt, how I can prevent it, can I perform it with .htaccess or other way ?

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @YK1175434

2 Comments

Sorted by latest first Latest Oldest Best

 

@Jessie594

Simple: Ban them all! With PHP and Regex. For example:

if (preg_match('/(?i)badbot1|badbot2|badbot3/',$_SERVER['HTTP_USER_AGENT'])){

header ('HTTP/1.1 403 Forbidden');
exit();
}


The header statement is optional

Be careful, never close the last "badbot" with a pipe "|". If you do, you ban all your traffic!
So, use "badbot1|badbot2|badbot3".

Never "|badbot1|badbot2|badbot3" and
Never "badbot1|badbot2|badbot3|"

Good luck

10% popularity Vote Up Vote Down


 

@Jessie594

If there are crawlers not following your robots.txt rules you will need to ban them by IP. Placing their user agent's into your robots.txt to ban does nothing if they aren't following it's rules.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme