: Google doesn't crawl CDN files I've noticed that Google Webmaster Tools is reporting a lot of blocked resources in my website. Right now all the "blocked resources" are .css, .js and images (.jpg,
I've noticed that Google Webmaster Tools is reporting a lot of blocked resources in my website.
Right now all the "blocked resources" are .css, .js and images (.jpg, .png) that I serve from Cloudfront CDN.
I've spent a lot of time testing and trying to figure out why google doesn't crawl these files and reports a "resource block" status.
Currently I serve these files from several hostnames like: cdn1.example.com, cdn2.example.com, …
cdn1, cdn2 and the others are CNAME's to the cloudfront distribution name.
Test: I've tried to use directly the cloudfront distribution (no CNAME) but the problem persists.
Currently my robots.txt looks like this:
# Google AdSense
User-agent: Mediapartners-Google
Disallow:
#Google images
User-agent: Googlebot-Image
Disallow: /
User-agent: *
Disallow: /homepage
Disallow: /index.php*
Disallow: /uncategorized*
Disallow: /tag/*
Disallow: *feed
Disallow: */page/*
Disallow: *author*
Disallow: *archive*
Disallow: */category*
Disallow: *tag=*
Disallow: /test*
Allow: /
And examples of files blocked in one example page:
cdn1.example.com/wp-content/plugins/wp-forecast/wp-forecast-default.css
cdn9.example.com/wp-content/plugins/bwp-minify/min/?f=wp-content/themes/magazine/css/font-awesome.min.css,wp-content/themes/magazine/css/responsive.css
cdn5.example.com/wp-content/themes/magazine/images/nobg.png
cdn6.example.com/wp-content/plugins/floating-social-bar/images/fsb-sprite.png
cdn5.example.com/wp-content/uploads/2013/11/Design-Hotel-3-80x80.jpg
cdn5.example.com/wp-content/uploads/2013/11/Marta-Hotel-7-270x225.jpg
I've even tried to allow everything in robots.txt but I always have the same result.
I've also been looking carefully at CloudFront settings in Amazon and see nothing that could be related (I don't use and never used the option :"Restrict Viewer Access (Use Signed URLs or Signed Cookies)".
Right now I've spent a lot of time looking into this and have no more ideas.
Can someone can think of a reason why Googlebot would be blocked from crawling files hosted in Amazon CloudFront?
More posts by @Debbie626
4 Comments
Sorted by latest first Latest Oldest Best
Create a robots.txt in a bucket.
Create another origin for your cloudfront distribution.
Set your bucket's priority higher then your website.
Invalidate your site's robots.txt on Cloudfront.
After doing the above, Google will read the sites robots.txt when crawling your site and will get to see the different robots.txt when following links from your cdn.
So, the solution seems to be that Amazon cloudfront also evaluates my robots.txt and somehow uses different syntax rules from google.
The working version of my robots.txt is the following:
User-agent: Googlebot-Image
Disallow: /
User-agent: *
Disallow: /homepage
Disallow: /uncategorized
Disallow: /page
Disallow: /category
Disallow: /author
Disallow: /feed
Disallow: /tags
Disallow: /test
A very important note to say that this isn't performing the exact same functions as before. In fact, I took out all blank lines, wildcards and "allow" directives. Meaning that the end result is not the same... but I think is close enough for me.
For example it doesn't exclude tag pages when passed in query string...
Three important notes:
If you're testing with this don't forget to invalidate robots.txt in cloudfront distribution for each iteration. Just checking you're being served the last version is not enough.
I couldn't find anywhere a definition of the robot.txt syntax understood by amazon cloudfront. So, it was trial and error.
To test results use the "fetch and render" tool of google webmaster and their mobile friendly tester (https://www.google.com/webmasters/tools/mobile-friendly/)
I don't understand why cloudfront is validating and evaluating my robots.txt. This file is a "deal" with me and the crawlers that come to my site. Amazon has no business in the middle. Messing with my robots.txt is just plain stupid.
It never came across my mind that cloudfront could be second guessing my robots.txt syntax.
Found out the problem:
The CloudFront reads the robots.txt and prevents serving the content, but it parses some how different from what robots should, I guess.
For instance, the following content on robots.txt:
Disallow: */wp-contents/
Allow: */wp-contents/themes/
When Googlebot gets it itself, it indexes it;
When CloudFront reads it, it doesn't consider the 'Allow' directive, and forbids to serve anything inside */wp-contents/themes/.
Short answer: check the robots.txt on your CloudFront distribution, it might be the problem. Invalidate and update it with a corrected version and it should work!
Google does not block external resources from being indexed via using a robots.txt in the root of the main site. Using a sub domain, a cdn or other is classed as an external domain therefor the only way to block the content is using a header response on the file served by the CDN itself, or by using a robots.txt on the cdn or sub domain.
Using:
#Google images
User-agent: Googlebot-Image
Disallow: /
Should only block images that are local, you will need to do the same on the CDN.
The chances are its a header response problem and you should do a 'CURL' on one of the files on the CDN. It should look something like:
HTTP/1.0 200 OK
Cache-Control: max-age=86400, public
Date: Thu, 10 May 2012 07:43:51 GMT
ETag: b784a8d162cd0b45fcb6d8933e8640b457392b46
Last-Modified: Tue, 08 May 2012 16:46:33 GMT
X-Powered-By: Express
Age: 7
Content-Length: 0
X-Cache: Hit from cloudfront
X-Amz-Cf-Id: V_da8LHRj269JyqkEO143FLpm8kS7xRh4Wa5acB6xa0Qz3rW3P7-Uw==,iFg6qa2KnhUTQ_xRjuhgUIhj8ubAiBrCs6TXJ_L66YJR583xXWAy-Q==
Via: 1.0 d2625240b33e8b85b3cbea9bb40abb10.cloudfront.net (CloudFront)
Connection: close
Things to look out for are:
HTTP/1.1 200 OK
Date: Tue, 25 May 2010 21:42:43 GMT
X-Robots-Tag: googlebot: noindex
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.