: Is there a way to disallow crawling of only HTTPS in robots.txt? I just realized that Bingbot is crawling my company's website's pages over https. Bing already crawls the site over http, so
I just realized that Bingbot is crawling my company's website's pages over https. Bing already crawls the site over http, so this seems frivolous. Is there a way to specify Disallow: / for https only?
According to Wikipedia, each protocol has its own robots.txt
And according to Google's Robots.txt Specification, the robots.txt applies to http AND https
I don't want to Disallow: / for Bing totally, just over https.
More posts by @Berryessa370
4 Comments
Sorted by latest first Latest Oldest Best
Add a .htaccess file to redirect HTTPS to HTTP, and to redirect requests for the robots.txt file to one that dissalows HTTPS crawling:
# Redirect HTTPS to HTTP
RewriteCond %{HTTP:X-Forwarded-Proto} =https
RewriteRule ^(.*)$ %{HTTP_HOST}%{REQUEST_URI} [L,R=301]
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots.txt$ robots_https.txt
Then add a robots_https.txt with this in it:
User-agent: *
Disallow: /
Create a separate robots.txt for HTTPS requests, for example: robots_https.txt and place this in the root of your website.
Then add the following lines to your root .htaccess file to redirect all bot requests over HTTPS to use robots_https.txt instead.
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots.txt$ robots_https.txt
Before you try to manipulate robots.txt, ensure that you have defined canonical link elements on your pages.
Web crawlers should treat:
<link rel="canonical" href="…" />
as a very strong hint that two pages should be considered to have the same content, and that one of the URLs is the preferred address for the content.
As stated in RFC 6596 Section 3:
The target (canonical) IRI MAY:
…
Have different scheme names, such as "http" to "https"…
With the canonical link hints, a reasonably intelligent web crawler should be able to avoid crawling the site a second time over HTTPS.
There is no way to do it in robots.txt itself as served over HTTP.
You could serve a different robots file entirely for secure HTTPS connections. Here is one of doing so using rewrite rules in your .htaccess file:
RewriteEngine On
RewriteCond %{HTTPS} =on
RewriteRule ^robots.txt$ robots-deny-all.txt [L]
Where robots-deny-all.txt has the contents:
User-agent: *
Disallow: /
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.