: The Sitemap Paradox We use a sitemap on Stack Overflow, but I have mixed feelings about it. Web crawlers usually discover pages from links within the site and from other sites. Sitemaps
We use a sitemap on Stack Overflow, but I have mixed feelings about it.
Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata. Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site.
Based on our two years' experience with sitemaps, there's something fundamentally paradoxical about the sitemap:
Sitemaps are intended for sites that are hard to crawl properly.
If Google can't successfully crawl your site to find a link, but is able to find it in the sitemap it gives the sitemap link no weight and will not index it!
That's the sitemap paradox -- if your site isn't being properly crawled (for whatever reason), using a sitemap will not help you!
Google goes out of their way to make no sitemap guarantees:
"We cannot make any predictions or guarantees about when or if your URLs will be crawled or added to our index" citation
"We don't guarantee that we'll crawl or index all of your URLs. For example, we won't crawl or index image URLs contained in your Sitemap." citation
"submitting a Sitemap doesn't guarantee that all pages of your site will be crawled or included in our search results" citation
Given that links found in sitemaps are merely recommendations, whereas links found on your own website proper are considered canonical ... it seems the only logical thing to do is avoid having a sitemap and make damn sure that Google and any other search engine can properly spider your site using the plain old standard web pages everyone else sees.
By the time you have done that, and are getting spidered nice and thoroughly so Google can see that your own site links to these pages, and would be willing to crawl the links -- uh, why do we need a sitemap, again? The sitemap can be actively harmful, because it distracts you from ensuring that search engine spiders are able to successfully crawl your whole site. "Oh, it doesn't matter if the crawler can see it, we'll just slap those links in the sitemap!" Reality is quite the opposite in our experience.
That seems more than a little ironic considering sitemaps were intended for sites that have a very deep collection of links or complex UI that may be hard to spider. In our experience, the sitemap does not help, because if Google can't find the link on your site proper, it won't index it from the sitemap anyway. We've seen this proven time and time again with Stack Overflow questions.
Am I wrong? Do sitemaps make sense, and we're somehow just using them incorrectly?
I recently restructured a site that I am still working on. Because there was no good way I could see to link 500,000 pages to help users, I decided to use an XML sitemap and submit it to Google and use site search instead. Google had no problem indexing my site earlier, however, since adding the sitemap, Google is very aggressive in spidering my site and indexing the pages extremely fast. Google has used the sitemap to find new pages (about 3300 per week) and revisit updated pages. It has been a real win in my book. I still want to figure out a new way to link my pages and use AJAX for look-up, but that is a project for another day. So far, so good! It has been a good solution for me. All and all, I have gained and not lost. Which is interesting since I have always felt that sitemaps could actually be more useful but limited by its design.
Disclaimer: I work together with the Sitemaps team at Google, so I'm somewhat biased :-).
In addition to using Sitemaps extensively for "non-web-index" content (images, videos, News, etc.) we use information from URLs included in Sitemaps files for these main purposes:
Discovering new and updated content (I guess this is the obvious one, and yes, we do pick up and index otherwise unlinked URLs from there too)
Recognizing preferred URLs for canonicalization (there are other ways to handle canonicalization too)
Providing a useful indexed URL count in Google Webmaster Tools (approximations from site:-queries are not usable as a metric)
Providing a basis for useful crawl errors (if a URL included in a Sitemap file has a crawl error, that's usually a bigger issue & shown separately in Webmaster Tools)
On the webmaster-side, I've also found Sitemaps files extremely useful:
If you use a crawler to create the Sitemaps file, then you can easily check that your site is crawlable and see first-hand what kind of URLs are found. Is the crawler finding your preferred URLs, or is something incorrectly configured? Is the crawler getting stuck in infinite spaces (eg endless calendar scripts) somewhere? Is your server able to handle the load?
How many pages does your site really have? If your Sitemap file is "clean" (no duplicates, etc), then that's easy to check.
Is your site really cleanly crawlable without running into duplicate content? Compare the server logs left behind by Googlebot with your Sitemaps file -- if Googlebot is crawling URLs that aren't in your Sitemap file, you might want to double-check your internal linking.
Is your server running into problems with your preferred URLs? Cross-checking your server error log with the Sitemaps URLs can be quite useful.
How many of your pages are really indexed? As mentioned above, this count is visible in Webmaster Tools.
Granted, for really small, static, easily crawlable sites, using Sitemaps may be unnecessary from Google's point of view once the site has been crawled and indexed. For anything else, I'd really recommend using them.
FWIW There are some misconceptions that I'd like to cover as well:
The Sitemap file isn't meant to "fix" crawlability issues. If your site can't be crawled, fix that first.
We don't use Sitemap files for ranking.
Using a Sitemap file won't reduce our normal crawling of your site. It's additional information, not a replacement for crawling. Similarly, not having a URL in a Sitemap file doesn't mean that it won't be indexed.
Don't fuss over the meta-data. If you can't provide useful values (eg for priority), leave them out & don't worry about that.
This was (first?) written about by Randfish over at SEOmoz back in the good old year of 2007. The first time around he came to the same types of conclusions, but then time did it's thing... and passed.
He has since (Jan 2009) added a postscript to the article stating that any possible downsides are simply outweighed by the overall positive results of generating, verifying, and submitting sitemaps.
Update Jan. 5, 2009 - I've actually significantly changed my mind
about this advice. Yes, sitemaps can still obsfucate architectural
issues, but given the experience I've had over the last 1.5 years, I
now recommend to all of our clients (and nearly everyone else who
asks) that sitemaps be submitted. The positives in terms of crawling,
indexation and traffic simply outweigh the downsides.
Sitemaps can save your ass.
On one of my sites, I have a large number of links that I prevent search engines from spidering. Long story short, Google was mis-interpreting JS in my forum and triggering lots of 500 and 403 response codes, which I believed were affecting the site's position. I worked around this by excluding the problematic URLs via robots.txt.
One day, I messed up and did something that prevented Google from crawling some pages on that site I really wanted indexed. Because of the forum exclusions, the Webmaster Tools error section for "Restricted by robots.txt" had over 4000 pages in it, so I would not have picked this error up until it was way too late.
Fortunately, because all of the "important" pages on my site are in sitemaps, I was able to quickly detect this problem in the special error category that Webmaster Tools has for problems with pages in sitemaps.
As an aside, I also get a lot of benefit from using a Sitemap Index to determine indexing quality of various sections of my sites, as mentioned by @AJ Kohn.
We use sitemaps (not submitted to search engines, but linked in robots.txt) mainly for making sure the homepage has the highest <priority>. I'm not sure whether they have much other use.
Jeff, I have no idea about Stackoverflow because I have never had the opportunity in my life to be a webmaster of such a huge and so frequently updated website.
For small websites that do not frequently change I think sitemap are quite useful (not saying that sitemap is the most important thing, but quite useful yes) for two reasons:
The site is crawled quickly (same reason explained by Joshak answer above) and in my small experience I noticed this many times with small sites (up to 30/50 pages)
After few weeks I submitted a sitemap, I look in "Google Webmaster Tools - Sitemaps" and I can see the number of URLs submitted in sitemap VS the number of URLs in web index. If I see that they are the same, then good. Otherwise I can check up immediately in my websites what pages are not getting indexed and why.
I heard that sitemaps put your pages into the supplemental index faster. But I haven't even heard the supplemental index mentioned in ages, so they may not be using it anymore.
P.S. in case my statement isn't clear enough, being in the supplemental index is (or was) a BAD thing...therefore a sitemap is (or was) BAD.
if you care about this topic, please read this great google paper googlewebmastercentral.blogspot.com/2009/04/research-study-of-sitemaps.html ( april 2009 ) - read the complete paper, not only the blogpost.
from the paper
ok, basically google struggled with the same question.
they do not disclose how they determine value within the sitemap, but they mention the concept of a virtual link from the start page to the sitemap.
lots of other interesting stuff
but yeah, the sitemap is mostly used for discovery (the process of google discovering your stuff), not for value determination. if you struggle with discovery, use a sitemap. discovery is a precondition to crawling, but does not touch value determination.
from my experience
there are a sh*tload of sites that just use HTML and XML sitemaps for interlinking of their pages
and of these, the XML sitemap is much much much better crawled then HTML sitemap. (i took a really good look at some really big ones)
there are even very successful sites that just use XML sitemap.
when i implement a SEO strategy for a site with more than half a million pages i go for
everything else is just "balast" - yeah, other stuff might have positive SEO value, but definitely has a negative value: it makes the site harder to manage. (p.s.: for value determination i interlink the landingpages in a sensemaking way (big impact), but thats already the second step).
about your question: please do not confuse discovery, crawling, indexing and ranking. you can track all of them separately, and you can optimize all of them separately. and you can enhance enhance discovery and crawling in a major way with a great (i.e.: real time) sitemap.
Sitemaps are incredibly valuable if you use them correctly.
First off, the fact that Google says they are hints is only there to a) ensure that webmasters aren't under the false impression that sitemap = indexation and b) give Google the ability to ignore certain sitemaps if they deem them to be unreliable (aka lastmod is the current date for all URLs each day they're accessed.)
However, Google generally likes and consumes sitemaps (in fact they'll sometimes find their own and add them to Google Webmaster Tools). Why? It increases the efficiency with which they can crawl.
Instead of starting at a seed site and crawling the web, they can allocate an appropriate amount of their crawl budget to a site based on the submitted sitemaps. They can also build up a large history of your site with associated error data (500, 404 etc.)
"Googlebot crawls the web by following links from one page to another, so if your site isn't well linked, it may be hard for us to discover it."
What they don't say is that crawling the web is time consuming and they're prefer to have a cheat sheet (aka sitemap).
Sure, your site might be just fine from an crawl perspective, but if you want to introduce new content, dropping that content into a sitemap with a high priority is a quicker way to get crawled and indexed.
And this works for Google too, since they want to find, crawl and index new content - fast. Now, even if you don't think Google prefers the beaten path versus the machete on the jungle approach, there's another reason why sitemaps are valuable - tracking.
In particular, using a sitemap index (http://sitemaps.org/protocol.php#index) you can break your site down into sections - sitemap by sitemap. By doing so you can then look at the indexation rate of your site section by section.
One section or content type might have an 87% indexation rate while another could have a 46% indexation rate. It's then your job to figure out why.
To get full use out of sitemaps you'll want to track Googlebot (and Bingbot) crawl on your site (via weblogs), match those to your sitemaps and then follow it all through to traffic.
Don't go to sleep on sitemaps - invest in them.
Let it crawl.
I do the following:
make the site crawlable in the old way.
make sure I do have a robots.txt with a sitemap indication on it.
make a XML sitemap, but do not submit. Let crawler discover and use it as needed, as part of its discovering and indexing process.
I generate an extended XML file, which serve as base for many things:
Generating the HTML sitemap
Help the 404 (not found) page
Help with other tiny tasks, like making breadcrumbs, or getting some metadata on my Facade pattern for a page.
Hence I do have all this, why not serve also a xml sitemap and let the crawler do what it would like to do, if it would like to do it?
DO NOT USE SITEMAPS
Sitemaps are mainly for sites that do not timestamp indexes and nodes.... SE does both for it's core content, so having a sitemap will slow a crawler down... Yes, that's right, it will slow it down, because the sitemap lacks the metadata that the core indexes have. On the flipside, I have no real idea how google builds it's bots, just know if I was going to bot SE, I would NOT use the sitemap. Plus, some site's don't even notice that their sitemaps are all %!@$ -- and if you've built a profile on a sitemap that's all the sudden not working, and you've got to create a new profile off the real site.
So, you're right -- DO NOT USE SITEMAPS!
TIP: One thing you should do though is keep the semantics of the tags the same over time as much as possible, meaning if "Asked One Hour Ago" has a metadata embed in it like:
title="2010-11-02 00:07:15Z" class="relativetime"
never change the string name relativetime, unless the meaning of the data in title has changed. NEVER... :-)
If you know you have good site architecture and the Google would find your pages naturally the only benefit I'm aware of is faster indexing, if your site is getting indexed fast enough for you then no need.
Here's article from 2009 where a gentlemen tested how fast Google crawled his site with a sitemap and without. www.seomoz.org/blog/do-sitemaps-effect-crawlers
My rule of thumb is if you're launching something new and untested you want to see how Google crawls your site to make sure there is nothing that needs to be fixed so don't submit, however, if you're making changes and want Google to see them faster then do submit or if you have other time sensitive information such as breaking news then submit because you want to do whatever you can to make sure you're the first Google sees, otherwise it's a matter of preference.
I believe that search engines use the sitemap not so much to find pages, but to optimize how they often they check them for updates. They look at <changefreq> and <lastmod>. Google probably spiders the entire website very often (check your logs!), but not all search engines have the resources to do that (Has anyone tried Blekko?). In any case since there is no penalty for using them and they can be created automatically and easily I'd keep doing it.
In Google's words: "In most cases, webmasters will benefit from Sitemap submission, and in no case will you be penalized for it."
But I agree that the best thing you can do if you want your website pages to appear in search engines is to make sure they are crawlable from the site proper.
I've not run into this myself, but the majority of my projects are applications or sites that otherwise require user accounts so indexing by search engines isn't a focus.
That said, I've heard before that SEO has basically rendered sitemaps useless. If you look at the protocol, it's sort of an "honor system" to tell how often a page changes and what the relative priority of each page is. It stands to reason that dime-a-dozen SEO firms misuse the fields - every page is top priority! every page changes hourly! - and rendered sitemaps effectively useless.
This article from 2008 says basically that and seems to come to the same conclusion that you do: the sitemap is pretty well useless and you would be better off optimizing the content to be indexed and ditching the sitemap.
I suspect: for Google, sitemaps are necessary to keep track of updates in the fastest way possible. E.g., let's say you have added a new content to some deep location of your web site, which takes more than 10-20 clicks to reach from your home page. For Google to reach this new page would be less likely in a short time - so instead, until a path to this page is utterly determined, the existence of it is announced. After all, PageRank is not calculated immediately, it requires time to evaluate user behavior and such - so, until then, why shouldn't the engine crawl and index a page with fresh content?