Mobile app version of vmapp.org
Login or Join
Pierce454

: Why did Google stop indexing pages from our sitemap.xml? We're seeing some pages that exist in our sitemap.xml but are inexplicably missing from Google's public search index. You can't download

@Pierce454

Posted in: #Google #Sitemap

We're seeing some pages that exist in our sitemap.xml but are inexplicably missing from Google's public search index.

You can't download superuser.com/sitemap.xml -- we protect this file because there have been issues with it in the past -- but googlebot can. We have verified via Google Webmaster Tools that the sitemap.xml file was pulled down today and is rated OK with no errors (green checkmark).



The sitemap.xml contains a list of the last 50,000 questions on our site that were asked. For example, this question ...
superuser.com/questions/201610/how-to-see-the-end-of-a-long-chain-of-symbolic-links
... exists in the sitemap.xml as ...

<url>
<loc>https://superuser.com/questions/201610/how-to-see-the-end-of-a-long-chain-of-symbolic-links</loc>
<lastmod>2010-10-20</lastmod>
<changefreq>daily</changefreq>
<priority>0.2</priority>
</url>


Searching for "How to see the end of a long chain of symbolic links" gives only one result to questionhub.com which is scraping our data (a whole different problem).

You can increment the question count number and do an exact search for the question title and you will see this pattern persist.

These urls are in sitemap.xml but they are not showing up in Google's index -- and yet they show up on sites that scrape our creative commons data. Why would that be?

10.06% popularity Vote Up Vote Down


Login to follow query

More posts by @Pierce454

6 Comments

Sorted by latest first Latest Oldest Best

 

@Voss4911412

It looks like Google was having some technical crawl problems this week, that sound remarkably like what we were experiencing:
searchengineland.com/is-google-broken-sites-big-small-seeing-indexing-problems-53701

No one seems to be immune from a Google indexing problem that has many site owners baffled. Blogs and websites, big and small, aren’t being indexed as quickly as they normally are — if they’re being indexed at all.

...

John from Google replied to the thread in the Webmaster forums saying:


Just to be clear, the issues from this thread, which I have reviewed in detail, are not due to changes in our policies or changes in our algorithms; they are due to a technical issue on our side that will be visibly resolved as soon as possible (it may take up to a few days to be visible for all sites though)

10% popularity Vote Up Vote Down


 

@Lengel546

With this type of thing there are a lot of potential answers.

I'd start by asking how many pages you actually have. (you submitted 50,000 URLs a quick site:superuser.com show 125,000 indexed do you think you only have 50K URLs and are submitting all of them yet Google is finding 2-3 copies of each page? or maybe you have 1Mil URLs and only 12.5% are getting indexed) getting the big picture helps to direct where to look for issues.

If nothing seems wrong with step one, I'd move onto content, it looks like QH has a whole lot more content on their page and link out many other "resources" despite the fact that all their content is scraped it's possible Google considers their page more useful since they provide more resources/information to the user. If they are considered the authority and all your content is the same as theirs it's possible Google won't index yours even though you are the original.

If you're convinced that is not the issue build some high quality links to it, blog this question on some popular employee blogs or ask some friends to blog about it, perhaps if you have SEO friends that run popular blogs they'd write a case study about it etc.

If you get a lot of strong links and it's still not getting indexed look for reasons it might be penalized (in most cases this won't be the issue but it never hurts to check).

If none of this works then 9 times out of 10 it's a simple technical issue that's been overlooked (robots exclusion or something similar).

If you're still have no answer after going through this ask Google and hope they get you an answer.

10% popularity Vote Up Vote Down


 

@Jamie184

The question was just asked yesterday - give googlebot a chance, you aren't the only site on the Internet that he has to crawl ya know :)

If questions are normally indexed within a day or so, and a week goes by and that one still isn't indexed, then I might be concerned. But certainly not after 1 day.

10% popularity Vote Up Vote Down


 

@XinRu657

I think google might be having a hard time indexing your web pages, 50.000 is alot. So my suggestion would be breakdown your sitemap into pieces like so

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>http://www.example.com/sitemap1.xml.gz</loc>
<lastmod>2004-10-01T18:23:17+00:00</lastmod>
</sitemap>
<sitemap>
<loc>http://www.example.com/sitemap2.xml.gz</loc>
<lastmod>2005-01-01</lastmod>
</sitemap>
</sitemapindex>


If you breakdown you will have a better luck of having those 50.000 urls indexed.

Sitemaps.org explanation of the issue


You can provide multiple Sitemap files, but each Sitemap file that you provide must have no more than 50,000 URLs and must be no larger than 10MB (10,485,760 bytes). If you would like, you may compress your Sitemap files using gzip to reduce your bandwidth requirement; however the sitemap file once uncompressed must be no larger than 10MB. If you want to list more than 50,000 URLs, you must create multiple Sitemap files.

If you do provide multiple Sitemaps, you should then list each Sitemap file in a Sitemap index file. Sitemap index files may not list more than 50,000 Sitemaps and must be no larger than 10MB (10,485,760 bytes) and can be compressed. You can have more than one Sitemap index file. The XML format of a Sitemap index file is very similar to the XML format of a Sitemap file.

sitemaps.org/protocol.php

10% popularity Vote Up Vote Down


 

@Phylliss660

It appears that Google is stating that 46,514 submitted links are in the index. Could it be an issue with (I hate to say it) but page ranking? The scraping sites may be doing a better job cross-linking etc and being ranked higher. Just a thought.

This search site:superuser.com How to see the end of a long chain of symbolic links also appears to be fetching your sitemap.xml correctly, albeit not returning the expected results.

10% popularity Vote Up Vote Down


 

@Chiappetta492

Google doesn't make any offer or guarantee that pages in a sitemap will be indexed.

My experience has been that a page has to be linked-to (from a page of some authority) to show up. Is that page/question linked to directly/indirectly from a page with some authority?

E.g. if the superuser.com homepage (which presumably has many inlinks) linked directly to this question, or linked to it indirectly through a number of other pages, then you could expect it to be indexed.

From google:


Google doesn't guarantee that we'll
crawl or index all of your URLs.
However, we use the data in your
Sitemap to learn about your site's
structure, which will allow us to
improve our crawler schedule and do a
better job crawling your site in the
future. In most cases, webmasters will
benefit from Sitemap submission, and
in no case will you be penalized for
it.

www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156184

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme