Mobile app version of vmapp.org
Login or Join
Michele947

: Why do Google search results include pages disallowed in robots.txt? I have some pages on my site that I want to keep search engines away from, so I disallowed them in my robots.txt file like

@Michele947

Posted in: #GoogleSearch #RobotsTxt

I have some pages on my site that I want to keep search engines away from, so I disallowed them in my robots.txt file like this:

User-Agent: *
Disallow: /email


Yet I recently noticed that Google still sometimes returns links to those pages in their search results. Why does this happen, and how can I stop it?

Background:

Several years ago, I made a simple web site for a club a relative of mine was involved in. They wanted to have e-mail links on their pages, so, to try and keep those e-mail addresses from ending up on too many spam lists, instead of using direct mailto: links I made those links point to a simple redirector / address harvester trap script running on my own site. This script would return either a 301 redirect to the actual mailto: URL, or, if it detected a suspicious access pattern, a page containing lots of random fake e-mail addresses and links to more such pages. To keep legitimate search bots away from the trap, I set up the robots.txt rule shown above, disallowing the entire space of both legit redirector links and trap pages.

Just recently, however, one of the people in the club searched Google for their own name and was quite surprised when one of the results on the first page was a link to the redirector script, with a title consisting of their e-mail address followed by my name. Of course, they immediately e-mailed me and wanted to know how to get their address out of Google's index. I was quite surprised too, since I had no idea that Google would index such URLs at all, seemingly in violation of my robots.txt rule.

I did manage to submit a removal request to Google, and it seems to have worked, but I'd like to know why and how Google is circumventing my robots.txt like that and how to make sure that none of the disallowed pages will show up in their search results.

Ps. I actually found out a possible explanation and solution, which I'll post below, while preparing this question, but I thought I'd ask it anyway in case someone else might have the same problem. Please do feel free to post your own answers. I'd also be interested in knowing if other search engines do this too, and whether the same solutions work for them also.

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Michele947

2 Comments

Sorted by latest first Latest Oldest Best

 

@Alves908

It is true that whilst this should prevent Google (and the good bots) from crawling these pages and reading their content, they can still show a URL-only link in the SERPs if they are linked to, of the form:



As you can see, there is no title or description, it is literally just the URL. Naturally these type of results are usually omitted from the SERPs, unless you explicitly search for them.

And as you mention in your answer, if you don't want the URL to appear at all in the SERPs, then you need to allow robots, but include a noindex meta tag.

10% popularity Vote Up Vote Down


 

@Michele947

It seems that Google deliberately includes URLs disallowed in robots.txt in their index if there are links to those URLs from other pages they've crawled. To quote their Webmaster Tools help pages:


"While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results."


Apparently, Google interprets a Disallow directive in robots.txt as a prohibition against crawling the page, not against indexing it. I suppose that's technically a valid interpretation, even if it does smack of rules lawyering to me.

In this interview article, Matt Cutts from Google gives a bit more background and does provide a reasonable-sounding explanation for why they do this:


"In the early days, lots of very popular websites didn't want to be crawled at all. For example, eBay and the New York Times did not allow any search engine, or at least not Google to crawl any pages from it. The Library of Congress had various sections that said you are not allowed to crawl with a search engine. And so, when someone came to Google and they typed in eBay, and we haven't crawled eBay, and we couldn't return eBay, we looked kind of suboptimal. So, the compromise that we decided to come up with was, we wouldn't crawl you from robots.txt, but we could return that URL reference that we saw."


The solution recommended on both of those pages is to add a noindex meta tag to the pages you don't want indexed. (The X-Robots-Tag HTTP header should also work for non-HTML pages. I'm not sure if it works on redirects, though.) Paradoxically, this means that you have to allow Googlebot to crawl those pages (either by removing them from robots.txt entirely, or by adding a separate, more permissive set of rules for Googlebot), since otherwise it can't see the meta tag in the first place.

I've edited my redirect / spider trap script to send both the meta tag and the X-Robots-Tag header with the value noindex,nofollow and allowed Googlebot to crawl the script's URL in my robots.txt. We'll see if it works once Google re-indexes my site.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme