Mobile app version of vmapp.org
Login or Join
Lengel546

: What are the htaccess mechanics of preventing search engines from indexing PDF files? There are already a variety of posts on how to block certain files (in my case, PDFs) from a search engine

@Lengel546

Posted in: #Indexing #Pdf #Seo

There are already a variety of posts on how to block certain files (in my case, PDFs) from a search engine like Google. The most relevant for this post was here: How to protect PDF file from indexing. However, in that post, the final answer was never quite clear. Based on these three sites:


Playing with the X-Robots-Tag
Preventing your site from being indexed, the right way
Google Developers Robots Meta Tag


I think I understand the recommendation. Essentially, we should not use robots.txt to disallow crawling/indexing of files. We should instead use X-Robots-Tag.

This brings me to three questions, which is really so I can be absolutely sure that what follows would work.

Question 1: Suppose I want to disallow search engine indexing to any files within a subfolder of my site, mysite.com/secret
I would create a .htaccess file in the subfolder with the following:

Header set X-Robots-Tag "noindex, nofollow"


Alternatively, if I wish to disallow access in the secret subfolder only to PDFs, I would use (again within a separate .htaccess in the subfolder):

<FilesMatch ".doc$">
Header set X-Robots-Tag "index, noarchive, nosnippet"
</FilesMatch>


Question 2: Is there any advantage to doing the same for the main .htaccess file in the website root directory? If so, how do you alter the above two statements for subdirectories? On Google's site they suggest:

<Files ~ ".pdf$">
Header set X-Robots-Tag "noindex, nofollow"
</Files>


Do I change it to "secret/.pdf$"instead? I am unsure of forward vs. backward slashes.

Question 3: Suppose I have a separate PDF document on a separate page that links the PDF in the secret folder. Even with the .htaccess x-robots tag block in place, does the third party linking break the non-indexing command?

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Lengel546

1 Comments

Sorted by latest first Latest Oldest Best

 

@Heady270

You have done your research and seem to have a good handle on the situation. To sum up:

Using robots.txt would prevent search engines from crawling the PDF files. If third party sites linked directly to the PDF files, then search engines might include the URLs in the search index (but would still not be able to index their contents.)

Using X-Robots-Tag "noindex, nofollow" will prevent search engines from indexing the PDF files even though they may crawl them. Third party sites directly linking to the files will still not cause the PDF files to get indexed.

You cannot use both methods. If you block the PDF files with robots.txt search engines will never see the header and may still index the URLs.



Your first FilesMatch matching looks correct if you substitute pdf for doc. The rule inside it looks like it would allow indexing, so you may have pasted in the wrong thing.

If you wanted to put it in the root directory you would need to use secret/.*.pdf$ instead. The only advantage to doing so might be to centralize all your rules in one place.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme