: Googlebot robots.txt access error on HTTPS when redirected from HTTP This might be a stupid question, but I never came across this issue before and were not able to find any definite answers
This might be a stupid question, but I never came across this issue before and were not able to find any definite answers to this on the web:
Our client migrated to HTTPS a few months ago, using their HTTP sites along their HTTPS sites. We told them to 301 redirect their HTTP sites to their corresponding HTTPS sites. So far everything was fine...
UNTIL we got an error essage in Google Webmaster Tools for www.example.com/robots.txt:
Googlebot encountered 5429 errors while attempting to access your robots.txt. To ensure that we didn't crawl any pages listed in that file, we postponed our crawl. Your site's overall robots.txt error rate is 12.9%.
We asked their IT service provider to look into the issue, assuming that they made some mistake by setting the 301 redirect in the .htaccess file. However, they referred back to us stating that redirection for robots.txt might be discouraged by Google generally, see here, and that this might well be the issue. They recommend keeping the HTTP robots.txt with a 200.
I actually have never come across this issue so far.
Have you any idea what might be causing the issue?
I figure if we do no longer redirect the HTTP robots.txt file, Googlebot might try to crawl the HTTP versions of the website. Which shouldn't be an issue really if all HTTP versions are 301 redirected to the HTTPS versions properly. It just doesn't feel right ;) I'm interested in fixing the issue (by finding out the cause) more than finding a quick fix.
More posts by @Michele947
3 Comments
Sorted by latest first Latest Oldest Best
My hypothesis on the 5429 errors is that Google is trying to parse an HTML doc, (i.e., the redirect is working, but ending up in the wrong place). Could be a 404 page, an error page or even the home page.
I had this very problem yesterday where the example.com/robots.txt was redirecting to /index.php and then again to my home page due to a dodgy .htaccess.
If that's the case, that means Google likely does follow redirects on the robots.txt
First, make sure you set your preferred site(s) to HTTPS mode in GWT. This may require you to make a new property and re-verify it.
Now once it's looking for SSL mode, hit the sidebar and nav to "Crawl > robots.txt Tester". You should see a field at the bottom that starts with yoursite.com followed by a text box and red "TEST" button.
You should see your robots directives loaded up. Run the test on both an allowed and disallowed page and see what it says. If it still can't access it, yet you can, then you should header test it. Open Chrome inspector then the networking tab. Refresh the page then click the first or second entries to expose headers. You are looking for any fishy looking responses or non-200 codes...there may be a hint there as to why G is not able to enter.
If you find a redirect issue, its all on the shoulders of the "IT service provider". They should be able to correctly route to HTTPS mode....if they can't, I would suggest to the client that they find a new "IT service provider" who is capable of understanding how things work in regards to forwarding.
As a bonus, they should be setting an HSTS header as well. HSTS uses a client side 307 redirect and is more strict/stateful than 301 style redirects. It also has a better ability to mitigate [block] insecure elements.
The best way to determine why Google can't access a page (including robots.txt) is to use the fetch as Google feature in Google Webmaster Tools.
Log into Google Webmaster Tools
Select your site (Make sure you have it registered with the ) Navigate to "Crawl" -> "Fetch as Google"
Enter /robots.txt in the text box
Click the "Fetch" button
Google will then tell more detailed information about why it is not able to get your robots.txt file.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.