: What are the htaccess mechanics of preventing search engines from indexing PDF files? There are already a variety of posts on how to block certain files (in my case, PDFs) from a search engine

There are already a variety of posts on how to block certain files (in my case, PDFs) from a search engine like Google. The most relevant for this post was here: How to protect PDF file from indexing. However, in that post, the final answer was never quite clear. Based on these three sites:

Playing with the X-Robots-Tag
Preventing your site from being indexed, the right way
Google Developers Robots Meta Tag

I think I understand the recommendation. Essentially, we should not use robots.txt to disallow crawling/indexing of files. We should instead use X-Robots-Tag.

This brings me to three questions, which is really so I can be absolutely sure that what follows would work.

Question 1: Suppose I want to disallow search engine indexing to any files within a subfolder of my site, mysite.com/secret
I would create a .htaccess file in the subfolder with the following:

Header set X-Robots-Tag "noindex, nofollow"

Alternatively, if I wish to disallow access in the secret subfolder only to PDFs, I would use (again within a separate .htaccess in the subfolder):

<FilesMatch ".doc$">
Header set X-Robots-Tag "index, noarchive, nosnippet"
</FilesMatch>

Question 2: Is there any advantage to doing the same for the main .htaccess file in the website root directory? If so, how do you alter the above two statements for subdirectories? On Google's site they suggest:

<Files ~ ".pdf$">
Header set X-Robots-Tag "noindex, nofollow"
</Files>

Do I change it to "secret/.pdf$"instead? I am unsure of forward vs. backward slashes.

Question 3: Suppose I have a separate PDF document on a separate page that links the PDF in the secret folder. Even with the .htaccess x-robots tag block in place, does the third party linking break the non-indexing command?

10.01% popularity Vote Up Vote Down

: Google Rank and Domain redirection I have a blog on Blogspot (owned by Google). I would like to host it under my own domain. So I would consider 3 options: Create a Domain redirect to Blogspot

@Lengel546

Posted in: #301Redirect #Blogspot #Pagerank #Redirects #Seo

1 Comments

: Using productontology in Schema.org for another language? I have a site that is in two languages, German and English. For each version of the site I am defining different schema.org markups.

@Lengel546

Posted in: #Language #ProductontologyOrg #SchemaOrg

1 Comments

: How can we deal with thin content pages with fluctuating content? We have identified a large number of pages on our e-commerce site which we are sure will be considered thin content. These

@Lengel546

Posted in: #Noindex #ThinContent

2 Comments

: Prevent frame view with GoDaddy forward (301 with mask) I have a domain name at GoDaddy and have forwarded it to an external domain, using the settings 301 with mask. However, when I go to

@Lengel546

Posted in: #Dns #DomainForwarding #Domains #Godaddy

1 Comments

Login to post a comment!

1 Comments

Sorted by latest first Latest Oldest Best

@Heady270

You have done your research and seem to have a good handle on the situation. To sum up:

Using robots.txt would prevent search engines from crawling the PDF files. If third party sites linked directly to the PDF files, then search engines might include the URLs in the search index (but would still not be able to index their contents.)

Using X-Robots-Tag "noindex, nofollow" will prevent search engines from indexing the PDF files even though they may crawl them. Third party sites directly linking to the files will still not cause the PDF files to get indexed.

You cannot use both methods. If you block the PDF files with robots.txt search engines will never see the header and may still index the URLs.

Your first FilesMatch matching looks correct if you substitute pdf for doc. The rule inside it looks like it would allow indexing, so you may have pasted in the wrong thing.

If you wanted to put it in the root directory you would need to use secret/.*.pdf$ instead. The only advantage to doing so might be to centralize all your rules in one place.

10% popularity Vote Up Vote Down

Feed

: What are the htaccess mechanics of preventing search engines from indexing PDF files? There are already a variety of posts on how to block certain files (in my case, PDFs) from a search engine

More posts by @Lengel546

: Google Rank and Domain redirection I have a blog on Blogspot (owned by Google). I would like to host it under my own domain. So I would consider 3 options: Create a Domain redirect to Blogspot

: Using productontology in Schema.org for another language? I have a site that is in two languages, German and English. For each version of the site I am defining different schema.org markups.

: How can we deal with thin content pages with fluctuating content? We have identified a large number of pages on our e-commerce site which we are sure will be considered thin content. These

: Prevent frame view with GoDaddy forward (301 with mask) I have a domain name at GoDaddy and have forwarded it to an external domain, using the settings 301 with mask. However, when I go to

Login to post a comment!

1 Comments

Back to top | Use Dark Theme