Mobile app version of vmapp.org
Login or Join
Murray155

: Accidentally submitted broken URLs in a sitemap. How can I stop Google crawling them? We recently added a sitemap to our site but didn't exclude unpublished items (which, unless you're logged

@Murray155

Posted in: #GoogleSearchConsole #Seo

We recently added a sitemap to our site but didn't exclude unpublished items (which, unless you're logged in as an admin, will 404). Google started trying to crawl them and as a result, we saw a huge increase in the number of 404s in the Google Webmaster tools.



We've since corrected the sitemap and now those items are not included, however the graph still indicates there are a high number of 404s. We can mark these URLs as "fixed" in the webmaster tools, but they aren't fixed - they're still 404ing, they just shouldn't be crawled and indexed.

Is it possible to tell Google not to crawl these? Will I need to add these 40,000 links to our robots file? And will these 404s be causing an issue for SEO?

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Murray155

2 Comments

Sorted by latest first Latest Oldest Best

 

@Caterina187

Block the links using robots.txt.

This will work for a large set of URLs which have some kind of common origin URL.


A directory and its contents by following the directory name with a
forward slash:

Disallow: /sample-directory/


From Google Search Console Help.

10% popularity Vote Up Vote Down


 

@Heady270

Googlebot generally never stops crawling pages once it starts. It is possible that Googlebot will return to check on those URLs every once in a while indefinitely. For reference here is an article about Google's big memory for 404 pages.

You say that you already removed the URLs from the sitemap. That is good. If you hadn't already done so, that would be the first step.

You don't want to add 40,000 individual URLs to your robots.txt file. That could produce a file that is too big. The max robots.txt size is 500KB for Google. Other crawlers may not process even that much.

Having "404 Not Found" error pages on your site is not a problem. Google's John Mueller says:


404 errors on invalid URLs do not harm your site’s indexing or ranking in any way. It doesn’t matter if there are 100 or 10 million, they won’t harm your site’s ranking. googlewebmastercentral.blogspot.ch/2011/05/do-404s-hurt-my-site.html


Having that many 404 errors will make that report in Google Search Console much less usable. One way to remove them from it would be to return a more appropriate status. I would suggest: "401 Unauthorized". That indicates that there is content there, but that the user would be required to login to see it. If the user is logged in, but not an admin, then a "403 Forbidden" status would be appropriate.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme