Mobile app version of vmapp.org
Login or Join
Yeniel560

: Why is Yahoo bot hitting a page when my robots.txt file is configured to disallow all bots? My robots.txt: User-agent: * Disallow: / A page two directories below the root is being hit by a

@Yeniel560

Posted in: #RobotsTxt #WebCrawlers #Yahoo

My robots.txt:

User-agent: *
Disallow: /


A page two directories below the root is being hit by a Yahoo bot and getting a 404:

HTTP_REFERER: [empty string]
HTTP_USER_AGENT: Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; help.yahoo.com/help/us/ysearch/slurp) NOT Firefox/3.5
QUERY_STRING: [empty string]
REMOTE_ADDR: 98.137.206.112
REMOTE_HOST: 98.137.206.112
REMOTE_USER: [empty string]
REQUEST_METHOD: GET


How is this possible and how can I prevent this?

WHOIS for 98.137.206.112

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Yeniel560

2 Comments

Sorted by latest first Latest Oldest Best

 

@Eichhorn148

In addition to looking up the whois for the ip address, Yahoo has a proceedure for verifying Slurp:



For each page view request, check the user-agent and IP address. All requests from Yahoo! Search utilize a user-agent starting with ‘Yahoo! Slurp.’
For each request from ‘Yahoo! Slurp’ user-agent, you can start with the IP address (i.e. 74.6.67.218) and use reverse DNS lookup to find out the registered name of the machine.
Once you have the host name (in this case, lj612134.crawl.yahoo.net), you can then check if it really is coming from Yahoo! Search. The name of all Yahoo! Search crawlers will end with ‘crawl.yahoo.net,’ so if the name doesn’t end with this, you know it’s not really our crawler.
Finally, you need to verify the name is accurate. In order to do this, you can use Forward DNS to see the IP address associated with the host name. This should match the IP address you used in Step 2. If it doesn’t, it means the name was fake.



Following that proceedure:

$ host 98.137.206.112
112.206.137.98.in-addr.arpa domain name pointer h174.hlfs.bf1.yahoo.com.
$ ping h174.hlfs.bf1.yahoo.com
PING h174.hlfs.bf1.yahoo.com (98.137.206.112) 56(84) bytes of data.


Which verifies that Yahoo is indeed in control of that IP address and that the request to your website is a valid request from Yahoo Slurp.

Yahoo is generally a very well behaved bot that follows robots.txt.

10% popularity Vote Up Vote Down


 

@Welton855

A robots.txt file offers instructions to crawlers about how you would like them to behave, and most reputable crawlers try to follow them, but it has no effect on your server to actually force crawlers to follow them.

Typically if a crawler is not following your robots file, it either indicates that it is a rude crawler, perhaps even sending a user agent that masquerades as someone else, or it is a legitimate reputable crawler that has not seen a recently updated robots file. In this case, it appears that the source IP really belongs to the agent indicated, and I would generally expect Yahoo! to follow robots directives.

So without further information I would guess that you recently updated robots.txt to block all agents and Yahoo! has not crawled your robots.txt file since that update, but I would expect it do so within a few hours or days and begin to follow the instructions accordingly.

However, whether my guess is correct or not, if you want to force the blocking of crawlers, regardless of how kind they are, you should look into other methods such as htaccess.

Also note that unless you have a specific reason it is generally not recommended to indiscriminately block crawlers from a public facing website, as bots like Google or Bing or Yahoo may index your site and potentially send you lots of traffic.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme