: Is there a way to disallow crawling of only HTTPS in robots.txt? I just realized that Bingbot is crawling my company's website's pages over https. Bing already crawls the site over http, so

I just realized that Bingbot is crawling my company's website's pages over https. Bing already crawls the site over http, so this seems frivolous. Is there a way to specify Disallow: / for https only?

According to Wikipedia, each protocol has its own robots.txt

And according to Google's Robots.txt Specification, the robots.txt applies to http AND https

I don't want to Disallow: / for Bing totally, just over https.

10.04% popularity Vote Up Vote Down

: Another way is: Go to Special:SpecialPages of a given wiki, and check if it has Special:PopularPages; example: http://en.wikipedia.org/wiki/Special:SpecialPages - doesn't have it http://www.alsa-project.org/main/index.php/Special:Spe

@Berryessa370

0 Comments

: How to handle URL encoded parameter separators when responding to AJAX crawled _escaped_fragment? I've been reading about how to handle Google requested URL here There is an example from Google

@Berryessa370

Posted in: #Ajax #Googlebot #Seo

1 Comments

: Laravel Detailed Error Page Lost I want to see all problems when a file can't loaded e.g. I used to do it but now I can't. The only thing I get when a problem occurs is Whoops, looks

@Berryessa370

Posted in: #Configuration #Debug #Laravel #Php

2 Comments

: Author bio on all pages - is it duplicate content? In a website with user generated content, I provide an author bio under every article on the site. The author bio will be the same under

@Berryessa370

Posted in: #DuplicateContent #Seo

2 Comments

Login to post a comment!

4 Comments

Sorted by latest first Latest Oldest Best

@Heady270

Add a .htaccess file to redirect HTTPS to HTTP, and to redirect requests for the robots.txt file to one that dissalows HTTPS crawling:

# Redirect HTTPS to HTTP
RewriteCond %{HTTP:X-Forwarded-Proto} =https
RewriteRule ^(.*)$ %{HTTP_HOST}%{REQUEST_URI} [L,R=301]

RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots.txt$ robots_https.txt

Then add a robots_https.txt with this in it:

User-agent: *
Disallow: /

10% popularity Vote Up Vote Down

@Jessie594

Create a separate robots.txt for HTTPS requests, for example: robots_https.txt and place this in the root of your website.

Then add the following lines to your root .htaccess file to redirect all bot requests over HTTPS to use robots_https.txt instead.

RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots.txt$ robots_https.txt

10% popularity Vote Up Vote Down

@Moriarity557

Before you try to manipulate robots.txt, ensure that you have defined canonical link elements on your pages.

Web crawlers should treat:

<link rel="canonical" href="…" />

as a very strong hint that two pages should be considered to have the same content, and that one of the URLs is the preferred address for the content.

As stated in RFC 6596 Section 3:

The target (canonical) IRI MAY:

…

Have different scheme names, such as "http" to "https"…

With the canonical link hints, a reasonably intelligent web crawler should be able to avoid crawling the site a second time over HTTPS.

10% popularity Vote Up Vote Down

@Heady270

There is no way to do it in robots.txt itself as served over HTTP.

You could serve a different robots file entirely for secure HTTPS connections. Here is one of doing so using rewrite rules in your .htaccess file:

RewriteEngine On
RewriteCond %{HTTPS} =on
RewriteRule ^robots.txt$ robots-deny-all.txt [L]

Where robots-deny-all.txt has the contents:

User-agent: *
Disallow: /

10% popularity Vote Up Vote Down

Feed

: Is there a way to disallow crawling of only HTTPS in robots.txt? I just realized that Bingbot is crawling my company's website's pages over https. Bing already crawls the site over http, so

More posts by @Berryessa370

: Another way is: Go to Special:SpecialPages of a given wiki, and check if it has Special:PopularPages; example: http://en.wikipedia.org/wiki/Special:SpecialPages - doesn't have it http://www.alsa-project.org/main/index.php/Special:Spe

: How to handle URL encoded parameter separators when responding to AJAX crawled _escaped_fragment? I've been reading about how to handle Google requested URL here There is an example from Google

: Laravel Detailed Error Page Lost I want to see all problems when a file can't loaded e.g. I used to do it but now I can't. The only thing I get when a problem occurs is Whoops, looks

: Author bio on all pages - is it duplicate content? In a website with user generated content, I provide an author bio under every article on the site. The author bio will be the same under

Login to post a comment!

4 Comments

Back to top | Use Dark Theme