: Blocking vs noindex to reduce crawl requests I observed that GoogleBot is making a lot of duplicate requests for the same URLs from my website within a week. Amongst these requests a majority
I observed that GoogleBot is making a lot of duplicate requests for the same URLs from my website within a week. Amongst these requests a majority were for low/thin value pages(no or very low SERP,not much of content).
Therefore, I want to optimize the way in which google uses its bandwidth for my website. Apart from few unnecessary resources that I can block, I want to limit the bots focus to crawling/recrawling high-value pages only.
After discussing a lot I have 3 options
404 the low value pages. Not an option for me.
Add no-index to the low-value pages. This should(although not confirmed) reduce the frequency with which those pages are requested for while crawling.
Block the URLs via robots.txt. There is (no particular pattern + I have to block 150000+ URLs) to the low-value pages because of which I cannot use wildcards in the robots.txt. So, robots.txt is almost out of the picture.
Looking at these options 2nd one is the one most feasible. But my concern is that as per Google documentation crawling and indexing are independent.
Robots.txt should be used to limit crawling.
no-index should be used to prevent indexing.
So, perhaps adding no-index would not help my case. Any suggestions or alternatives?
More posts by @Pope3001725
1 Comments
Sorted by latest first Latest Oldest Best
Google have too many cralwers based on backlinks, re-crawl same URL after few weeks/months, pagerank, sitemap, Google Webmaster Request etc.
By using noindex Google may crawl that URL less frequently, but it will not going to block it permanently, because noindex pages are crawlable and pass PageRank when it is linked from somewhere, so as per backlinks cralwer and pagerank crawler those pages will going to crawl.
So my first advice is try to links those pages rarely.
Second is remove those pages from sitemap or feed URL of your website.
Third is use Last Modified HTTP header, because when Google crawl some pages, then they will going to recrawl same URL after some time(May be after few weeks to check any changes).
I don't see any other solution for you. If it is possible then move your thin content to subdiretory and block that specific directoy in robots.txt.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.