Mobile app version of vmapp.org
Login or Join
Courtney195

: Stop Google crawling links completely I would like to stop Google crawling certain links on a site. When I say this, I mean completely stop the crawling. I'm aware that I use noindex/nofollow

@Courtney195

Posted in: #Googlebot #Links #WebCrawlers

I would like to stop Google crawling certain links on a site. When I say this, I mean completely stop the crawling. I'm aware that I use noindex/nofollow and other methods in robots.txt, but, I'm told that whilst these will be respected in terms of the content of the page, the URL itself is still crawled.

The reason I'm so bothered about is due to the crawl budget that is allocated to the site. The site uses layered/faceted navigation, and the combination of filters creates hundreds of thousands of unique links.

I want to make sure Google ignores these links completely, and concentrates on the important pages that I want to be crawled and indexed.

I've been made aware of a couple of options -


Ajax Checkboxes - replace the HTML links with ajax checkboxes. The problem here is that these aren't any good for accessibility and browsers with JavaScript disabled. I'm also aware, that Google can see this as cloaking, and can penalise appropriately.
Hashbang #! Urls - From what I can find, if you add a hash bang to a URL, Google will not crawl it. For example myshop.com/shoes#!colour-red. I can't find any sites using it in this form though, so I'm not convinced.

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Courtney195

1 Comments

Sorted by latest first Latest Oldest Best

 

@Heady270

Using Disallow: in robots.txt will prevent Googlebot from crawling the URLs at all. In fact, robots.txt is more of a tool for controlling your crawl budget than for controlling which pages are indexed. In some (albeit fairly rare) cases, Google chooses to included pages in the index even though they are in robots.txt based on the number of inbound links to those pages and the anchor text of those links.

So if you have a section of the site that you don't want to have Google crawling and using up your server resources create a robots.txt file like this:

User-agent: *
Disallow: /folder1/


Here is Google's full documention for robots.txt. They even support wildcards.

AJAX Checkboxes

I would not rely on javascript to prevent Googlebot from crawling anything. Googlebot now executes some javascript and its ability to do so is likely to improve in the future.

Hashbang #! URLs

Hash bang urls are specifically designed to be crawlable. When Googlebot encounters #! in a URL it makes a request to your site. Using a #! is also going to use up your crawl budget.

Using just a hash without the bang (#) in your URLs might work for you. Googlebot doesn't make requests to your server without the !.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme