: Sitemap and crawler I am still newbie in the entire world of SEO. Let's take a look at my example. I have a website very similar to blog style site. There few patterns for urls, all of

Posted in: #Googlebot #GoogleSearchConsole #Sitemap

I am still newbie in the entire world of SEO. Let's take a look at my example. I have a website very similar to blog style site. There few patterns for urls, all of the are very simple and allowed to be crawled.

Daily I submit new sitemap with accessible content with frequency=daily. The things on the site change very frequently. In addition I have many catalogs of blogs with pagination, one blog can be in few catalogs. In sitemap I submit only the main pages if the catalogs. Catalogs are allowed to be crawled.

I have a feeling that the crawler does very hard work, because it needs to scan every catalog with actually already indexed content because every blog could be crawled from the previous catalog or from the sitemap. Because of this, it looks like crawler will never finish his work, in webmaster tools I see a lot pages that were submitted months ago, currently deleted, and still in index even thought I retrieve 404 with noindex option. The nonsense is I can find old sitemap in the index, but how come, I thought at least files of sitemaps should be indexed daily.

What the preferred strategy in my case. Should I just mark the pagination in catalogs as "nofollow" content and leave only main page for every catalog. Actually I need only content in sitemap to be indexed. Because everything except the blog is just thousands of pages of pagination to the same blogs. Previously I heart very interesting idea about instead of tons of catalogs using archive with date navigation, and allow only it for crawler.

What's your opinion?
Thanks!

10.02% popularity Vote Up Vote Down

: Should 304 Not Modified responses include the "Last Modified" header? I've been using the most excellent http://redbot.org tool for testing have HTTP headers on my site correct (its custom code

@Kevin317

Posted in: #Apache2 #Http #HttpHeaders #Proxy

2 Comments

: IP Banned from my own Website So recently, I added a php script that had to refresh the page a lot. This apparently caused my website to think that I was DDoSing it, and it automatically

@Kevin317

Posted in: #IpAddress #WebHosting #Webserver

2 Comments

: They are called sitelinks. Google will decide if you have any. Your only direct control is that you can remove them via Google Webmaster Tools. How exactly Google decides what's worthy is

@Kevin317

0 Comments

: Prevent crawling of mobile site I have an issue with bots crawling the mobile site and returning hits on the mobile site for desktop users. The mobile content is in /mobile/ The regular site

@Kevin317

Posted in: #Bing #Googlebot #RobotsTxt

2 Comments

Login to post a comment!

2 Comments

Sorted by latest first Latest Oldest Best

@Hamaas447

Why do your URLs churn so much? You write:

"[I]n webmaster tools I see a lot pages that were submitted months ago, currently deleted, and still in index even thought I retrieve 404 with noindex option."

If you keep your pages up only for a month or two before deleting them, it's no wonder that Google can't keep up.

In particular, keep in mind that Google will not immediately drop pages from their index when they get a 404 response — they'll wait a while, in case the error was temporary and the content comes back later.

Besides updating your sitemap regularly to reflect new content, there are a few other things you could do that might help Google (and other search engines) keep up with a frequently changing site structure:

Have your removed pages serve 410 Gone responses instead of 404 Not Found. Since a few years ago, Google has treated such responses as "a bit more permanent" than 404s, and may remove such pages from their index faster. (Alternatively, 301 permanent redirects to some stable target page may also work.)
If you know in advance when a page is likely to be removed or changed, send an appropriate HTTP Expires header. This is primarily meant for browsers and proxies, but search engines may pay attention to it too.

You may indeed also want to consider tagging short-lived or frequently changing pages with a noindex meta tag, especially if the content on those pages is also available at more stable URLs elsewhere.

Using a robots.txt file to keep bots away from frequently changing parts of your site might also help the bots focus on those parts of your site that you want them to index — but keep in mind that this also prevents the disallowed pages from passing on PageRank. You could also try using the <priority> tags in your sitemaps to guide bots to the pages you want indexed most.

However, I think the real issue is simply that Google works best with URLs that don't have the lifespan of a mayfly. On a typical blog, which you say your site resembles, once a post goes up, it stays up, and each post will generally have a stable URL pointing to it. Without knowing more about what your site actually is, it's hard to say how practical or not that would be for you, but in general, if the content doesn't disappear from your site entirely, you should try to design your URL structure so that links that used to work in the past will keep leading to the same content whenever possible.

Edit: One other thing you might try for temporarily removed pages would be to return a 200 OK response code with a noindex meta tag (and a short explanation for users, of course). Google normally discourages such "soft 404" pages, but if you expect the content to come back shortly, they might be appropriate. In particular, this page seems to imply that Google drops noindex-ed pages from their results immediately after seeing the tag, while the comments I linked to above suggest that they may also re-crawl such pages more frequently than ones that have been removed as 404s.

10% popularity Vote Up Vote Down

@Cofer257

First, stop submitting your sitemap every day. You can keep updating it of course, and Google will periodically check it for updates. But as long as your site is crawlable, that's more important. You can also submit your RSS feed which will help Google find your newer content.

Regarding the "catalog", I think how you are currently doing it is fine, as long as there is a finite number of catalog pages, and they are not duplicated.

10% popularity Vote Up Vote Down

Feed

: Sitemap and crawler I am still newbie in the entire world of SEO. Let's take a look at my example. I have a website very similar to blog style site. There few patterns for urls, all of

More posts by @Kevin317

: Should 304 Not Modified responses include the "Last Modified" header? I've been using the most excellent http://redbot.org tool for testing have HTTP headers on my site correct (its custom code

: IP Banned from my own Website So recently, I added a php script that had to refresh the page a lot. This apparently caused my website to think that I was DDoSing it, and it automatically

: They are called sitelinks. Google will decide if you have any. Your only direct control is that you can remove them via Google Webmaster Tools. How exactly Google decides what's worthy is

: Prevent crawling of mobile site I have an issue with bots crawling the mobile site and returning hits on the mobile site for desktop users. The mobile content is in /mobile/ The regular site

Login to post a comment!

2 Comments

Back to top | Use Dark Theme