Mobile app version of vmapp.org
Login or Join
Speyer207

: URLs with 'NoIndex` in robots.txt are being indexed by Google In my robots.txt file (http://www.tutorvista.com/robots.txt), I'm using Noindex: /content/... to disallow indexing: This should mean

@Speyer207

Posted in: #GoogleSearch #Noindex #RobotsTxt #Serps

In my robots.txt file (http://www.tutorvista.com/robots.txt), I'm using Noindex: /content/... to disallow indexing:



This should mean that www.tutorvista.com/content/ and anything below this URL shouldn't be indexed. But in the image of my search results below, you can see that pages under this URL are being indexed:



Additionally, I'm using Disallow: /biology/ which means that www.tutorvista.com/biology/ and anything below this shouldn't be crawled. But in the image of my search results, you can see that pages under this URL are being crawled and indexed.



So can anyone tell me what's wrong with my robots.txt directives?

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Speyer207

2 Comments

Sorted by latest first Latest Oldest Best

 

@Ann8826881

Note that Noindex is not part of the original robots.txt specification. Google supported it as experimental feature (see: How does “Noindex:” in robots.txt work?), but it’s not clear if that is still the case (as they didn’t document it to begin with). But let’s assume it is.

Your robots.txt has two problems.

Empty lines

A record must not contain empty lines. Empty lines are used to separate records.

A conforming bot (which doesn’t identify as Googlebot-Image/Adsbot-Google/Mediapartners-Google) uses this record:

User-agent: *
Allow: /


So none of the following Disallow/Allow/Noindex lines apply.

Of course a bot may try to "fix" this and interpret the following lines to be part of this record (i.e., ignoring the blank lines), but the robots.txt spec doesn’t define this, so I wouldn’t count on it.

... in Noindex values

If Noindex works like Disallow (which we don’t know for sure, as Noindex is not specified/documented, but I guess it wouldn’t make sense to specify it differently), the ... you appended to the values mean that ... must appear in the URLs you want to noindex.

The line

Noindex: /content/biology/...


would apply to a URL like /content/biology/.../foobar, but not to a URL like /content/biology/foobar nor /content/biology/.

So if you want every URL whose paths starts with /content/biology/ to be noindexed, you would have to specify:

Noindex: /content/biology/

10% popularity Vote Up Vote Down


 

@Yeniel560

"noindex" directives should not be used in your robots.txt file, instead a noindex meta tag should be added to any pages that you don't want indexed in Google.

A NOINDEX tag looks like the below and it should be placed in the section of any page you do not want indexed:

<meta name="robots" content="noindex">


More information can be found here.

In the second example while you do have "Disallow: /biology/" in your robots.txt file, a few lines above this you also have "Allow: /biology/animations/" hence why this page in indexed in your example.

Hope this helps!

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme