Mobile app version of vmapp.org
Login or Join
Barnes591

: Robots.txt with only Disallow and Allow directives is not preventing crawling of disallowed resources I have a robots.txt file: User-agent:* Disallow:/path/page Disallow:/path/ Allow:/ The disallowed

@Barnes591

Posted in: #RobotsTxt #WebCrawlers

I have a robots.txt file:

User-agent:*
Disallow:/path/page
Disallow:/path/
Allow:/


The disallowed path is still getting crawled.

I have searched this problem and what they said, order of precedence in google does not matter. So technically, the disallowed should work but now I'm wondering if it is because Allow:/ is overriding it?

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Barnes591

2 Comments

Sorted by latest first Latest Oldest Best

 

@Jamie184

As already mentioned, the Allow: directive in this instance is superfluous. The default action is to allow all crawling, so explicitly stating Allow: / (ie. allow all) is entirely redundant.

However, contrary to what has been suggested, neither would the Allow: / directive cause you any problems. The Allow: / directive will not "override" other Disallow: directives, because it is the least specific, regardless of the apparent order.


order of precedence in google does not matter.


Yes, sort of. You mean the "order of the directives does not matter". There is always an order of precedence (unless you are using "wildcards", in which case it is officially "undefined"). This is why the Allow: / directive does not override the more specific Disallow: directives above it. Google defines the order of precedence:


for allow and disallow directives, the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule.


And this is confirmed by using Google's robots.txt Tester, when testing a disallowed path eg. /path/page:



This is at least how Googlebot and Bingbot work (the most specific path wins). However, some (old) bots reportedly use a "first match" rule. So, for greatest compatibility it is recommended to include any Allow: directives first. Reference: What's the proper way to handle Allow and Disallow in robots.txt?

Also, since robots.txt is prefix matching, the Disallow: /path/page directive is also superfluous, since Disallow: /path/ will block /path/page as well. So, in summary, your robots.txt file only needs the one Disallow directive, the others are simply superfluous but will not actually cause any harm:

User-agent: *
Disallow: /path/


White space before the path is entirely optional, although as noted in Stephen's answer, it is much more common to see it and it arguably makes it more readable.

The only time you would need an Allow: directive is if you need to make an exception and allow a URL that would otherwise be blocked by a Disallow: directive. eg. If you wanted to allow /path/foo in the above robots.txt file then you would need to explicitly include the Allow: /path/foo directive somewhere in the group.


The disallowed path is still getting crawled.


If this is still the case, then something else is going on...


Do you have any other directives in your robots.txt file. Test the URL in Google's robot.txt Tester.
When was the current robots.txt file implemented? Google only picks up changes to the robots.txt file every day or so. In GSC you can identify which version Googlebot is currently using.
As Stephen has already pointed out, robots.txt is only honoured by the "good bots". Many (bad) bots will simply ignore it and crawl your URLs regardless. You can check in your access logs as to whether the "good" bots are still crawling these disallowed URLs.

10% popularity Vote Up Vote Down


 

@BetL925

Putting your page through Google's robots.txt tester reveals two problems:


You have no User-agent line so your rules don't apply to any crawler:
Once you put in the User-agent line, the Allow line overrides the disallow:


The correct robots.txt file would be:

User-agent: *
Disallow: /path/page
Disallow: /path/


DO NOT use Allow:. Allowing crawling is the default. You only need to include the items you don't want crawled.

Include the User-Agent: line to specify that it applies to all crawlers. Otherwise, it will apply to none.

I don't think that having a space after the colon actually matters, but all the examples I see have it.

I should also add that robots.txt is only for crawlers that choose to obey it. Search engine spiders like Googlebot should obey robots.txt, however not all other spiders will.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme