Mobile app version of vmapp.org
Login or Join
Caterina187

: Does Google ignore robots.txt I know that here www.w3.org/TR/html4/appendix/notes.html#h-B.4.1.1 it says spiders always check the robots.txt before going to page. However I have recently been told

@Caterina187

Posted in: #Google #Googlebot #RobotsTxt #WebCrawlers

I know that here w3.org/TR/html4/appendix/notes.html#h-B.4.1.1 it says spiders always check the robots.txt before going to page. However I have recently been told that Google crawls every single URL that it can find on a site and then looks at the robots.txt file and filters out what is disallowed. Is this true?

10.04% popularity Vote Up Vote Down


Login to follow query

More posts by @Caterina187

4 Comments

Sorted by latest first Latest Oldest Best

 

@Bryan171

robots.txt are the instruction not the compulsion. Google normally index the page that you have blocked in robots.txt specially if you have links pointed to blocked page. Even if that page has noindex tag and links have nofollow tags.

MattCutt have told this in his official video and he gave the example of Ebay and white house gov websites. Few years back they had blocked the search engines but due to large amount of requests Google have to crawl and index the websites. now it is a normal practise by google.
I think below is the video i am talking about. www.mattcutts.com/blog/robots-txt-remove-url/
If you want to block Google then try .htaccess or password etc.

10% popularity Vote Up Vote Down


 

@Karen161

Google will still see sites blocked by robots.txt, and may even list them in search results.

This is especially the case when entire domains/subdomains are blocked. Google will list links to these along with the text A description for this result is not available because of this site's robots.txt – learn more with a link to support.google.com/webmasters/answer/156449 .



They tell us that while won't crawl or index the content of pages blocked by robots.txt, they may still index the URLs if we find links to them elsewhere. They also give this helpful advice:


To entirely prevent a page's contents from being listed in the Google web index even if other sites link to it, use a noindex meta tag or x-robots-tag. As long as Googlebot fetches the page, it will see the noindex meta tag and prevent that page from showing up in the web index. The x-robots-tag HTTP header is particularly useful if you wish to limit indexing of non-HTML files like graphics or other kinds of documents.


So if you really don't want your pages indexed then make sure to use a META tag or HTTP header. I've found <meta name="robots" content="noindex, nofollow"> particularly helpful for back-end admin areas and control panels when I don't trust Disallow: /admin to be good enough.

10% popularity Vote Up Vote Down


 

@Eichhorn148

Google does not ignore robots.txt. If you were to find Googlebot crawling a page blocked by robots.txt you should report it to Google in their "crawling, indexing, and ranking" product forum.

There are some cases in which it may look like Googlebot disobeys robots.txt:


The robots.txt file is recently updated -- Googlebot may only fetch it once a day.
A robot claims to be Googlebot but is not actually run by Google -- How to verify Googlebot
There is an error in your robots.txt file. -- Test it in Google Webmaster Tools
A page is listed in search results even when blocked -- Google may list pages that are in robots.txt when there are several external links to them. When this happens, Googlebot does not crawl the page, but rather uses third party information (such as link anchor text) to determine what the page is about.


While Google is good at following robots.txt, not all web crawlers are as friendly. It is not uncommon to see other, less well mannered, robots crawling blocked pages.

10% popularity Vote Up Vote Down


 

@Angela700

Google may index the URL but not the contents of a page if it is restricted by robots.txt or a robots meta directive. This is, providing that nowhere else on the web links to the same destination without a nofollow link relationship.

You can read more on how Google listens to robots here.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme