Mobile app version of vmapp.org
Login or Join
Holmes151

: Robots.txt vs Sitemap -- Who wins in a Conflict If I block off the directory /foo in robots.txt, but my xml sitemap contains URLs with /foo, will the URLs in the sitemap get picked up by

@Holmes151

Posted in: #RobotsTxt #XmlSitemap

If I block off the directory /foo in robots.txt, but my xml sitemap contains URLs with /foo, will the URLs in the sitemap get picked up by Google and other search engines? In other words, does the sitemap trump robots.txt? I think so, but am not sure.

10.04% popularity Vote Up Vote Down


Login to follow query

More posts by @Holmes151

4 Comments

Sorted by latest first Latest Oldest Best

 

@Carla537

In Google webmaster: It shows an error in your XML sitemap that "You have put a link which is prevented to Crawl in your robots.txt file.
Google prefers robots.txt file rather than Sitemap.

10% popularity Vote Up Vote Down


 

@Michele947

No Robots Exclusion Protocol compliant search engine may crawl any URL disallowed in robots.txt, no matter where else it might be listed.

However, Google doesn't necessarily have to crawl your URLs in order to index them. If they believe they have sufficient evidence that there actually is a page at that URL (and a sitemap listing very likely counts as such evidence) then they may simply decide to add the URL to their index without any content. To quote Google's Webmaster Tools help pages:


"While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results."


Such pages can turn up as search results e.g. for words included in the URL itself, or for words used in links pointing to the page.

Thus, if you both list a page in a sitemap and disallow it in robots.txt, it's likely that Google will index the URL of that page — but not its content.

10% popularity Vote Up Vote Down


 

@Angela700

Itai's answer is correct so nothing much major to add to that but in reply to your specific question...

A sitemap cannot trump a robots.txt, a sitemap provides no instructions / directives for crawlers on a website. They aren't even comparable. If you have instructed robots not to visit/follow /foo then any bots that are obeying your robots directives will simply not visit that directory regardless of what path they took to get there (sitemap or otherwise).

10% popularity Vote Up Vote Down


 

@Frith620

Robots.txt defines what conforming bots are allowed or not to request. Even if a particular link is present in a sitemap, a bot is not be allowed to request it if the robots.txt disallows it.

Remember that sitemaps are not necessary and even if one is provided, crawlers may ignore URLs and crawl ones which are not there. If can see this in the Google Webmaster Tools which shows that not all the URLs in a sitemap get crawled and if some URLs are roboted.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme