: Googlebot robots.txt access error on HTTPS when redirected from HTTP This might be a stupid question, but I never came across this issue before and were not able to find any definite answers

Posted in: #301Redirect #Htaccess #Https #Redirects #RobotsTxt

This might be a stupid question, but I never came across this issue before and were not able to find any definite answers to this on the web:

Our client migrated to HTTPS a few months ago, using their HTTP sites along their HTTPS sites. We told them to 301 redirect their HTTP sites to their corresponding HTTPS sites. So far everything was fine...

UNTIL we got an error essage in Google Webmaster Tools for www.example.com/robots.txt:

Googlebot encountered 5429 errors while attempting to access your robots.txt. To ensure that we didn't crawl any pages listed in that file, we postponed our crawl. Your site's overall robots.txt error rate is 12.9%.

We asked their IT service provider to look into the issue, assuming that they made some mistake by setting the 301 redirect in the .htaccess file. However, they referred back to us stating that redirection for robots.txt might be discouraged by Google generally, see here, and that this might well be the issue. They recommend keeping the HTTP robots.txt with a 200.

I actually have never come across this issue so far.
Have you any idea what might be causing the issue?

I figure if we do no longer redirect the HTTP robots.txt file, Googlebot might try to crawl the HTTP versions of the website. Which shouldn't be an issue really if all HTTP versions are 301 redirected to the HTTPS versions properly. It just doesn't feel right ;) I'm interested in fixing the issue (by finding out the cause) more than finding a quick fix.

10.03% popularity Vote Up Vote Down

: In what location does Tiki store articles: in the file system or in the MySQL database? I have a dump of the directory structure from an old Tiki website from around 2010 that is no longer

@Michele947

Posted in: #Filenames #Mysql #SiteStructure #Wiki

1 Comments

: How to require approval to signup new Bluespice MediaWiki accounts? On a MediaWiki site running BlueSpice, how can I require that all new account sign ups be approved before the account is created?

@Michele947

Posted in: #Authentication #Mediawiki

1 Comments

: Google: Fix Mobile Usability Issues warning - Is Redirect Sufficient? So recently Google sent out a mass warning to many of us to either fix Mobile Usability Issues on our website or risk having

@Michele947

Posted in: #GoogleSearch #Htaccess #Mobile #Seo

1 Comments

: What is current standard practise for redirects of http sub/domains If I have widget.net domain, I assume users are going to type in either www.widget.net or widget.net and expect to get to

@Michele947

Posted in: #Dns

1 Comments

Login to post a comment!

3 Comments

Sorted by latest first Latest Oldest Best

@Candy875

My hypothesis on the 5429 errors is that Google is trying to parse an HTML doc, (i.e., the redirect is working, but ending up in the wrong place). Could be a 404 page, an error page or even the home page.

I had this very problem yesterday where the example.com/robots.txt was redirecting to /index.php and then again to my home page due to a dodgy .htaccess.

If that's the case, that means Google likely does follow redirects on the robots.txt

10% popularity Vote Up Vote Down

@Si4351233

First, make sure you set your preferred site(s) to HTTPS mode in GWT. This may require you to make a new property and re-verify it.

Now once it's looking for SSL mode, hit the sidebar and nav to "Crawl > robots.txt Tester". You should see a field at the bottom that starts with yoursite.com followed by a text box and red "TEST" button.

You should see your robots directives loaded up. Run the test on both an allowed and disallowed page and see what it says. If it still can't access it, yet you can, then you should header test it. Open Chrome inspector then the networking tab. Refresh the page then click the first or second entries to expose headers. You are looking for any fishy looking responses or non-200 codes...there may be a hint there as to why G is not able to enter.

If you find a redirect issue, its all on the shoulders of the "IT service provider". They should be able to correctly route to HTTPS mode....if they can't, I would suggest to the client that they find a new "IT service provider" who is capable of understanding how things work in regards to forwarding.

As a bonus, they should be setting an HSTS header as well. HSTS uses a client side 307 redirect and is more strict/stateful than 301 style redirects. It also has a better ability to mitigate [block] insecure elements.

10% popularity Vote Up Vote Down

@Megan663

The best way to determine why Google can't access a page (including robots.txt) is to use the fetch as Google feature in Google Webmaster Tools.

Log into Google Webmaster Tools
Select your site (Make sure you have it registered with the ) Navigate to "Crawl" -> "Fetch as Google"
Enter /robots.txt in the text box
Click the "Fetch" button

Google will then tell more detailed information about why it is not able to get your robots.txt file.

10% popularity Vote Up Vote Down

Feed

: Googlebot robots.txt access error on HTTPS when redirected from HTTP This might be a stupid question, but I never came across this issue before and were not able to find any definite answers

More posts by @Michele947

: In what location does Tiki store articles: in the file system or in the MySQL database? I have a dump of the directory structure from an old Tiki website from around 2010 that is no longer

: How to require approval to signup new Bluespice MediaWiki accounts? On a MediaWiki site running BlueSpice, how can I require that all new account sign ups be approved before the account is created?

: Google: Fix Mobile Usability Issues warning - Is Redirect Sufficient? So recently Google sent out a mass warning to many of us to either fix Mobile Usability Issues on our website or risk having

: What is current standard practise for redirects of http sub/domains If I have widget.net domain, I assume users are going to type in either www.widget.net or widget.net and expect to get to

Login to post a comment!

3 Comments

Back to top | Use Dark Theme