Mobile app version of vmapp.org
Login or Join
Berryessa370

: Is there a way to disallow crawling of only HTTPS in robots.txt? I just realized that Bingbot is crawling my company's website's pages over https. Bing already crawls the site over http, so

@Berryessa370

Posted in: #Https #RobotsTxt

I just realized that Bingbot is crawling my company's website's pages over https. Bing already crawls the site over http, so this seems frivolous. Is there a way to specify Disallow: / for https only?

According to Wikipedia, each protocol has its own robots.txt

And according to Google's Robots.txt Specification, the robots.txt applies to http AND https

I don't want to Disallow: / for Bing totally, just over https.

10.04% popularity Vote Up Vote Down


Login to follow query

More posts by @Berryessa370

4 Comments

Sorted by latest first Latest Oldest Best

 

@Heady270

Add a .htaccess file to redirect HTTPS to HTTP, and to redirect requests for the robots.txt file to one that dissalows HTTPS crawling:

# Redirect HTTPS to HTTP
RewriteCond %{HTTP:X-Forwarded-Proto} =https
RewriteRule ^(.*)$ %{HTTP_HOST}%{REQUEST_URI} [L,R=301]

RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots.txt$ robots_https.txt


Then add a robots_https.txt with this in it:

User-agent: *
Disallow: /

10% popularity Vote Up Vote Down


 

@Jessie594

Create a separate robots.txt for HTTPS requests, for example: robots_https.txt and place this in the root of your website.

Then add the following lines to your root .htaccess file to redirect all bot requests over HTTPS to use robots_https.txt instead.

RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots.txt$ robots_https.txt

10% popularity Vote Up Vote Down


 

@Moriarity557

Before you try to manipulate robots.txt, ensure that you have defined canonical link elements on your pages.

Web crawlers should treat:

<link rel="canonical" href="…" />


as a very strong hint that two pages should be considered to have the same content, and that one of the URLs is the preferred address for the content.

As stated in RFC 6596 Section 3:


The target (canonical) IRI MAY:




Have different scheme names, such as "http" to "https"…



With the canonical link hints, a reasonably intelligent web crawler should be able to avoid crawling the site a second time over HTTPS.

10% popularity Vote Up Vote Down


 

@Heady270

There is no way to do it in robots.txt itself as served over HTTP.

You could serve a different robots file entirely for secure HTTPS connections. Here is one of doing so using rewrite rules in your .htaccess file:

RewriteEngine On
RewriteCond %{HTTPS} =on
RewriteRule ^robots.txt$ robots-deny-all.txt [L]


Where robots-deny-all.txt has the contents:

User-agent: *
Disallow: /

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme