Mobile app version of vmapp.org
Login or Join
Sue5673885

: Google Search Console reporting warnings for URLs in the sitemap that have been removed from it for two weeks We have a sitemap that is regenerated daily based on the records in our database.

@Sue5673885

Posted in: #Sitemap #XmlSitemap

We have a sitemap that is regenerated daily based on the records in our database. There are about 55 million records and each record is accessible as a separate page. However sometimes records are deleted and after 1 to 2 weeks the Google Search Console complains for a couple (but not all) of the deleted items that their URL returns a 404. This is shown as a warning under the console's Sitemap errors section with a link to the sitemap where this URL used to be, but is now removed.

I suspect the cause of this is that Google doesn't fetch the sitemap every day. It seems to cache it for a couple weeks, despite our http-response headers:

Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Expires: 0


Google then complains when it checks cached URLs that have been removed both from the site and the latest version of the sitemap. Can someone confirm that Google always caches sitemaps?

Looking at the dates that the pieces of our sitemap are processed, it seems our entire sitemap also takes about 2 weeks to process. Is it possible to tell Google to fetch and use the latest version of a sitemap page daily?

I've read this similar question, but that question is asking where Google gets its old urls from. I know exactly where it comes from (Google told me so). I understand the 404s are probably not a big deal, but if possible I'd like to prevent them.

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Sue5673885

1 Comments

Sorted by latest first Latest Oldest Best

 

@Megan663

It isn't that Google is caching the sitemap file itself. When Google downloads the sitemap, it parses it and adds the URLs to a database. It then decides whether it needs to crawl those URLs soon. Google also queries its database to show you the information in search console. It is Google's database that has this out of date information.

There is no way to force Google to re-fetch sitemaps more often than it already does. When you have 1000+ sitemap files, there is no way that Google is going to fetch them all every day. Like most files on your website, Google is going to fetch them every couple weeks. Only very high Pagerank pages (such as your home page) tend to get fetched by Google more often. If you have a high Pagerank site you have more important pages to link to other than sitemap files in the hopes of getting them downloaded more frequently.

My suggestion would be to return "410 Gone" status for the removed pages rather than "404 Not Found". You may still see warnings about it, but you should then be able to differentiate the intentional removal from an unintentional problem.

Google also treats 410 status differently. It removes the page itself from the index immediately upon crawling as opposed to 404 where it gives it a 24 hour grace period. Google comes back to re-try crawling 410 URLs much less frequently than URLs with 404 status.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme