: What's the proper way to handle Allow and Disallow in robots.txt? I run a fairly large-scale Web crawler. We try very hard to operate the crawler within accepted community standards, and that

I run a fairly large-scale Web crawler. We try very hard to operate the crawler within accepted community standards, and that includes respecting robots.txt. We get very few complaints about the crawler, but when we do the majority are about our handling of robots.txt. Most often the Webmaster made a mistake in his robots.txt and we kindly point out the error. But periodically we run into grey areas that involve the handling of Allow and Disallow.

The robots.txt page doesn't cover Allow. I've seen other pages, some of which say that crawlers use a "first matching" rule, and others that don't specify. That leads to some confusion. For example, Google's page about robots.txt used to have this example:

User-agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html

Obviously, a "first matching" rule here wouldn't work because the crawler would see the Disallow and go away, never crawling the file that was specifically allowed.

We're in the clear if we ignore all Allow lines, but then we might not crawl something that we're allowed to crawl. We'll miss things.

We've had great success by checking Allow first, and then checking Disallow, the idea being that Allow was intended to be more specific than Disallow. That's because, by default (i.e. in the absence of instructions to the contrary), all access is allowed. But then we run across something like this:

User-agent: *
Disallow: /norobots/
Allow: /

The intent here is obvious, but that Allow: / will cause a bot that checks Allow first to think it can crawl anything on the site.

Even that can be worked around in this case. We can compare the matching Allow with the matching Disallow and determine that we're not allowed to crawl anything in /norobots/. But that breaks down in the face of wildcards:

User-agent: *
Disallow: /norobots/
Allow: /*.html$

The question, then, is the bot allowed to crawl /norobots/index.html?

The "first matching" rule eliminates all ambiguity, but I often see sites that show something like the old Google example, putting the more specific Allow after the Disallow. That syntax requires more processing by the bot and leads to ambiguities that can't be resolved.

My question, then, is what's the right way to do things? What do Webmasters expect from a well-behaved bot when it comes to robots.txt handling?

10.02% popularity Vote Up Vote Down

: An easy way to detect bogus email addresses is to send an email to the address with a confirmation link inside. If the user doesn't click on the confirmation link their account is not activated

@Kristi941

0 Comments

: There actually is very little to SEO contrary to what many will have you believe. SEO is mostly: quality content usability accessibility semantic markup If your site is built with the four

@Kristi941

0 Comments

: 1) Make sure you use a 301 redirect when redirecting from default.php to JobBoard.php. This will tell the search engines your page has moved and to de-index the old page, associate it with

@Kristi941

0 Comments

: .com domain transfer failing I'm trying to transfer one of my .com addresses between registrars. I'm down as the owner contact (confirmed working) and the losing registrar is down as the tech

@Kristi941

Posted in: #DomainRegistrar #Domains

5 Comments

Login to post a comment!

2 Comments

Sorted by latest first Latest Oldest Best

@Kaufman445

One very important note: the Allow statement should come before the Disallow statement, no matter how specific your statements are.
So in your third example - no, the bots won't crawl /norobots/index.html.

Generally, as a personal rule, I put allow statements first and then I list the disallowed pages and folders.

10% popularity Vote Up Vote Down

@Pope3001725

Here's my take on what I see in those three examples.

Example 1
I would ignore the entire /folder1/ directory except the myfile.html file. Since they explicitly allow it I would assume it was simply easier to block the entire directory and explicitly allow that one file as opposed to listing every file they wanted to have blocked. If that directory contained a lot of files and subdirectories that robots.txt file could get unwieldy fast.

Example 2
I would assume the /norobots/ directory is off limits and everything else is available to be crawled. I read this as "crawl everything except the /norobots/ directory".

Example 3
Similar to example 2, I would assume the /norobots/ directory is off limits and all .html files not in that directory is available to be crawled. I read this as "crawl all .html files but do not crawl any content in the the /norobots/ directory".

Hopefully your bot's user-agent contains a URL where they can find out more information about your crawling habits and make removal requests or give you feedback about how they want their robots.txt interpreted.

10% popularity Vote Up Vote Down

Feed

: What's the proper way to handle Allow and Disallow in robots.txt? I run a fairly large-scale Web crawler. We try very hard to operate the crawler within accepted community standards, and that

More posts by @Kristi941

: An easy way to detect bogus email addresses is to send an email to the address with a confirmation link inside. If the user doesn't click on the confirmation link their account is not activated

: There actually is very little to SEO contrary to what many will have you believe. SEO is mostly: quality content usability accessibility semantic markup If your site is built with the four

: 1) Make sure you use a 301 redirect when redirecting from default.php to JobBoard.php. This will tell the search engines your page has moved and to de-index the old page, associate it with

: .com domain transfer failing I'm trying to transfer one of my .com addresses between registrars. I'm down as the owner contact (confirmed working) and the losing registrar is down as the tech

Login to post a comment!

2 Comments

Back to top | Use Dark Theme