: Protocol Agnostic Robots Sitemap Recently, I have enabled all my servers to serve everything over HTTP and HTTPS. Users can access any site via http://www.example.com or https://www.example.com.
Recently, I have enabled all my servers to serve everything over HTTP and HTTPS. Users can access any site via www.example.com or www.example.com. All pages are identical between the versions, so www.example.com/about.php is the same as www.example.com/about.php and so on.
URLs are relative, so they do not mention the protocol with one exception. In other words, if the page is loaded with HTTP, it will link to other pages, images, CSS, Javascript over HTTP and the same with HTTPS, as to avoid mixed content warnings.
Now about that exception. It is in robots.txt:
Sitemap: www.example.com/sitemap.php
Apparently this URL must be absolute.
Now the problem I see if that when Google reads www.example.com/robots.txt it gets an HTTP sitemap! The documentation on robots.org says that one can specify multiple sitemaps but if I am not sure that putting both the HTTP and HTTPS sitemap is a good idea since they will contain each a list of identical pages (one with HTTP and one with HTTPS).
How should Sitemap in robots.txt be handled for websites that accept HTTP and HTTPS?
Some ideas that came to mind:
Specify both sitemaps (as mentioned above). Afraid this would cause duplicate content issues.
Only specify the HTTPS Sitemap. That gives access to all unique pages anyway.
Find a magical (Apache) way to sent a different robots.txt via HTTP and HTTPS. Is that even possible? Could it cause issues?
More posts by @Frith620
2 Comments
Sorted by latest first Latest Oldest Best
A sitemap at www.example.com/sitemap.php can only contain URLs from www.example.com/.¹ The scheme and the host must be the same.
So if you 1) want to provide sitemaps for both protocols, and 2) link both sitemaps via the Sitemap field in the robots.txt, you have to provide separate robots.txt files for HTTP and HTTPS:
# www.example.com/robots.txt
Sitemap: www.example.com/sitemap.php
# www.example.com/robots.txt
Sitemap: www.example.com/sitemap.php
(It should be easy to achieve this with Apache, see for example the answers to Is there a way to disallow crawling of only HTTPS in robots.txt?)
But you might want to provide a sitemap only for the canonical variant (e.g., only for HTTPS), because there is not much point in letting search engines parse the sitemap for the non-canonical variant, as they typically wouldn’t want to index any of its URLs. So if HTTPS should be canonical:
On each HTTP page, link to its HTTPS version with the canonical link type.
Provide a sitemap only on HTTPS, listing only the HTTPS URLs.
Link the sitemap (ideally only) from the HTTPS robots.txt.
¹ Except if cross submits are used.
www.example.com/about/ http://www.example.com/about example.com/about/ http://example.com/about www.example.com/about/ https://www.example.com/about
These kind of duplicate content Google already handling from many years ago. So first of don't worry about duplicate content issue.
It is totally fine to serve HTTP and HTTPS version of site on same time, specially when you're migrating your site from HTTP to HTTPS, Stackoverflow also done that in past.
Here Google will index only one version of your webpage, it means they will not going to index both version www.example.com/about.php and www.example.com/about.php. In most of time, by default they will choose HTTPS
And again there is no need to add your sitemap file into robots.txt. Specially when you think about Google(It is not ask.com), because they gives us option to submit our sitemap into webmaster tool. So create two properties into search console like www.example.com and www.example.com and submit individual sitemap there.
I don't know why you're so serious about sitemap, robots.txt and all thing. Google can crawl and index any website without sitemap, for example wikipedia does not have any sitemap, but it is crawl often, because they have good internal link structure.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.