: Google doesn't crawl CDN files I've noticed that Google Webmaster Tools is reporting a lot of blocked resources in my website. Right now all the "blocked resources" are .css, .js and images (.jpg,

Posted in: #AmazonCloudfront #Cdn #Googlebot #GoogleSearchConsole #WebCrawlers

I've noticed that Google Webmaster Tools is reporting a lot of blocked resources in my website.
Right now all the "blocked resources" are .css, .js and images (.jpg, .png) that I serve from Cloudfront CDN.

I've spent a lot of time testing and trying to figure out why google doesn't crawl these files and reports a "resource block" status.

Currently I serve these files from several hostnames like: cdn1.example.com, cdn2.example.com, …

cdn1, cdn2 and the others are CNAME's to the cloudfront distribution name.

Test: I've tried to use directly the cloudfront distribution (no CNAME) but the problem persists.

Currently my robots.txt looks like this:

# Google AdSense
User-agent: Mediapartners-Google
Disallow:
#Google images
User-agent: Googlebot-Image
Disallow: /

User-agent: *
Disallow: /homepage
Disallow: /index.php*
Disallow: /uncategorized*
Disallow: /tag/*
Disallow: *feed
Disallow: */page/*
Disallow: *author*
Disallow: *archive*
Disallow: */category*
Disallow: *tag=*
Disallow: /test*
Allow: /

And examples of files blocked in one example page:

cdn1.example.com/wp-content/plugins/wp-forecast/wp-forecast-default.css
cdn9.example.com/wp-content/plugins/bwp-minify/min/?f=wp-content/themes/magazine/css/font-awesome.min.css,wp-content/themes/magazine/css/responsive.css
cdn5.example.com/wp-content/themes/magazine/images/nobg.png
cdn6.example.com/wp-content/plugins/floating-social-bar/images/fsb-sprite.png
cdn5.example.com/wp-content/uploads/2013/11/Design-Hotel-3-80x80.jpg
cdn5.example.com/wp-content/uploads/2013/11/Marta-Hotel-7-270x225.jpg

I've even tried to allow everything in robots.txt but I always have the same result.

I've also been looking carefully at CloudFront settings in Amazon and see nothing that could be related (I don't use and never used the option :"Restrict Viewer Access (Use Signed URLs or Signed Cookies)".

Right now I've spent a lot of time looking into this and have no more ideas.

Can someone can think of a reason why Googlebot would be blocked from crawling files hosted in Amazon CloudFront?

10.04% popularity Vote Up Vote Down

: How to add near-synonymous keywords to a web page? Say I'm selling software to calculate prime numbers. I'd like to have many similar keywords like "calculate" (prime numbers), "find", "discover",

@Debbie626

Posted in: #Keywords #Seo

2 Comments

: Bdo, bdi, or span/p/div with dir attribute? I don't get the difference between these options. What is the difference between them & which one is better for SEO or rendering?

@Debbie626

Posted in: #Html #Language

1 Comments

: Remove dynamically generated sitemap from google search result I use Yii PHP framework to generate sitemaps of user profiles. The sitemap url is something like this http://mywebsite.com/site/sitemap33449

@Debbie626

Posted in: #GoogleIndex #XmlSitemap

1 Comments

: Full Domain Name Or Acronym? I read on MOZ that a domain name should ideally be less than 15 characters. My current domain name for a WIP website is ANightInBurlington.com, which is 18 before

@Debbie626

Posted in: #Domains #Name #Url

0 Comments

Login to post a comment!

4 Comments

Sorted by latest first Latest Oldest Best

@Ogunnowo487

Create a robots.txt in a bucket.

Create another origin for your cloudfront distribution.

Set your bucket's priority higher then your website.

Invalidate your site's robots.txt on Cloudfront.

After doing the above, Google will read the sites robots.txt when crawling your site and will get to see the different robots.txt when following links from your cdn.

10% popularity Vote Up Vote Down

@Jessie594

So, the solution seems to be that Amazon cloudfront also evaluates my robots.txt and somehow uses different syntax rules from google.

The working version of my robots.txt is the following:

User-agent: Googlebot-Image
Disallow: /
User-agent: *
Disallow: /homepage
Disallow: /uncategorized
Disallow: /page
Disallow: /category
Disallow: /author
Disallow: /feed
Disallow: /tags
Disallow: /test

A very important note to say that this isn't performing the exact same functions as before. In fact, I took out all blank lines, wildcards and "allow" directives. Meaning that the end result is not the same... but I think is close enough for me.
For example it doesn't exclude tag pages when passed in query string...

Three important notes:

If you're testing with this don't forget to invalidate robots.txt in cloudfront distribution for each iteration. Just checking you're being served the last version is not enough.
I couldn't find anywhere a definition of the robot.txt syntax understood by amazon cloudfront. So, it was trial and error.
To test results use the "fetch and render" tool of google webmaster and their mobile friendly tester (https://www.google.com/webmasters/tools/mobile-friendly/)

I don't understand why cloudfront is validating and evaluating my robots.txt. This file is a "deal" with me and the crawlers that come to my site. Amazon has no business in the middle. Messing with my robots.txt is just plain stupid.

It never came across my mind that cloudfront could be second guessing my robots.txt syntax.

10% popularity Vote Up Vote Down

@Jennifer507

Found out the problem:
The CloudFront reads the robots.txt and prevents serving the content, but it parses some how different from what robots should, I guess.

For instance, the following content on robots.txt:

Disallow: */wp-contents/
Allow: */wp-contents/themes/

When Googlebot gets it itself, it indexes it;
When CloudFront reads it, it doesn't consider the 'Allow' directive, and forbids to serve anything inside */wp-contents/themes/.

Short answer: check the robots.txt on your CloudFront distribution, it might be the problem. Invalidate and update it with a corrected version and it should work!

10% popularity Vote Up Vote Down

@Nimeshi995

Google does not block external resources from being indexed via using a robots.txt in the root of the main site. Using a sub domain, a cdn or other is classed as an external domain therefor the only way to block the content is using a header response on the file served by the CDN itself, or by using a robots.txt on the cdn or sub domain.

Using:
#Google images
User-agent: Googlebot-Image
Disallow: /

Should only block images that are local, you will need to do the same on the CDN.

The chances are its a header response problem and you should do a 'CURL' on one of the files on the CDN. It should look something like:

HTTP/1.0 200 OK
Cache-Control: max-age=86400, public
Date: Thu, 10 May 2012 07:43:51 GMT
ETag: b784a8d162cd0b45fcb6d8933e8640b457392b46
Last-Modified: Tue, 08 May 2012 16:46:33 GMT
X-Powered-By: Express
Age: 7
Content-Length: 0
X-Cache: Hit from cloudfront
X-Amz-Cf-Id: V_da8LHRj269JyqkEO143FLpm8kS7xRh4Wa5acB6xa0Qz3rW3P7-Uw==,iFg6qa2KnhUTQ_xRjuhgUIhj8ubAiBrCs6TXJ_L66YJR583xXWAy-Q==
Via: 1.0 d2625240b33e8b85b3cbea9bb40abb10.cloudfront.net (CloudFront)
Connection: close

Things to look out for are:

HTTP/1.1 200 OK
Date: Tue, 25 May 2010 21:42:43 GMT
X-Robots-Tag: googlebot: noindex

10% popularity Vote Up Vote Down

Feed

: Google doesn't crawl CDN files I've noticed that Google Webmaster Tools is reporting a lot of blocked resources in my website. Right now all the "blocked resources" are .css, .js and images (.jpg,

More posts by @Debbie626

: How to add near-synonymous keywords to a web page? Say I'm selling software to calculate prime numbers. I'd like to have many similar keywords like "calculate" (prime numbers), "find", "discover",

: Bdo, bdi, or span/p/div with dir attribute? I don't get the difference between these options. What is the difference between them & which one is better for SEO or rendering?

: Remove dynamically generated sitemap from google search result I use Yii PHP framework to generate sitemaps of user profiles. The sitemap url is something like this http://mywebsite.com/site/sitemap33449

: Full Domain Name Or Acronym? I read on MOZ that a domain name should ideally be less than 15 characters. My current domain name for a WIP website is ANightInBurlington.com, which is 18 before

Login to post a comment!

4 Comments

Back to top | Use Dark Theme