: Stop Google crawling links completely I would like to stop Google crawling certain links on a site. When I say this, I mean completely stop the crawling. I'm aware that I use noindex/nofollow

Posted in: #Googlebot #Links #WebCrawlers

I would like to stop Google crawling certain links on a site. When I say this, I mean completely stop the crawling. I'm aware that I use noindex/nofollow and other methods in robots.txt, but, I'm told that whilst these will be respected in terms of the content of the page, the URL itself is still crawled.

The reason I'm so bothered about is due to the crawl budget that is allocated to the site. The site uses layered/faceted navigation, and the combination of filters creates hundreds of thousands of unique links.

I want to make sure Google ignores these links completely, and concentrates on the important pages that I want to be crawled and indexed.

I've been made aware of a couple of options -

Ajax Checkboxes - replace the HTML links with ajax checkboxes. The problem here is that these aren't any good for accessibility and browsers with JavaScript disabled. I'm also aware, that Google can see this as cloaking, and can penalise appropriately.
Hashbang #! Urls - From what I can find, if you add a hash bang to a URL, Google will not crawl it. For example myshop.com/shoes#!colour-red. I can't find any sites using it in this form though, so I'm not convinced.

10.01% popularity Vote Up Vote Down

: What time does Googlebot crawl sites Daily evening I sync the localhost copy of website to the production one. Yesterday I could not do it and I did so today at 10:00 am. Common sense says

@Courtney195

Posted in: #Googlebot

1 Comments

: Avoiding Google Sandbox effect by redirecting an old domain to a new one I believe in the Google Sandbox effect. I've seen it with a site of mine. That's why I'm now booking many domains

@Courtney195

Posted in: #Domains #GoogleSandbox #Redirects

1 Comments

: Random behavior sending mails to Gmail I am sick to warn website users (new and existing) who have Gmail accounts "Check your spam folder". I suffer this as a webmaster and as a user. For

@Courtney195

Posted in: #Gmail #Spam

0 Comments

: Blocking specific ads in the Google Custom Search I don't seem to find way to block competitor sites in the Google Custom Search product though I'm able to in normal AdSense ad units. I've

@Courtney195

Posted in: #GoogleAdsense #GoogleCustomSearch

1 Comments

Login to post a comment!

1 Comments

Sorted by latest first Latest Oldest Best

@Heady270

Using Disallow: in robots.txt will prevent Googlebot from crawling the URLs at all. In fact, robots.txt is more of a tool for controlling your crawl budget than for controlling which pages are indexed. In some (albeit fairly rare) cases, Google chooses to included pages in the index even though they are in robots.txt based on the number of inbound links to those pages and the anchor text of those links.

So if you have a section of the site that you don't want to have Google crawling and using up your server resources create a robots.txt file like this:

User-agent: *
Disallow: /folder1/

Here is Google's full documention for robots.txt. They even support wildcards.

AJAX Checkboxes

I would not rely on javascript to prevent Googlebot from crawling anything. Googlebot now executes some javascript and its ability to do so is likely to improve in the future.

Hashbang #! URLs

Hash bang urls are specifically designed to be crawlable. When Googlebot encounters #! in a URL it makes a request to your site. Using a #! is also going to use up your crawl budget.

Using just a hash without the bang (#) in your URLs might work for you. Googlebot doesn't make requests to your server without the !.

10% popularity Vote Up Vote Down

Feed

: Stop Google crawling links completely I would like to stop Google crawling certain links on a site. When I say this, I mean completely stop the crawling. I'm aware that I use noindex/nofollow

More posts by @Courtney195

: What time does Googlebot crawl sites Daily evening I sync the localhost copy of website to the production one. Yesterday I could not do it and I did so today at 10:00 am. Common sense says

: Avoiding Google Sandbox effect by redirecting an old domain to a new one I believe in the Google Sandbox effect. I've seen it with a site of mine. That's why I'm now booking many domains

: Random behavior sending mails to Gmail I am sick to warn website users (new and existing) who have Gmail accounts "Check your spam folder". I suffer this as a webmaster and as a user. For

: Blocking specific ads in the Google Custom Search I don't seem to find way to block competitor sites in the Google Custom Search product though I'm able to in normal AdSense ad units. I've

Login to post a comment!

1 Comments

Back to top | Use Dark Theme