Mobile app version of vmapp.org
Login or Join
Jennifer507

: Robots.txt should be in the root-directory or can be in sub-directory? I have a sub-directory that I would like not to be visible to the search engine Web crawlers. One way to do that is

@Jennifer507

Posted in: #RobotsTxt #WebCrawlers

I have a sub-directory that I would like not to be visible to the search engine Web crawlers.

One way to do that is to use a robots.txt in the root directory of the server but is something that I want to avoid. The reason is that anyone knowing the website URL, could access the robots.txt contents and can explore the disallowed directories, which is something that I want to avoid.

I though a way to avoid this.
Let X be the name of the sub-directory that I want not to be indexed. One way to stop Web Crawlers indexing the X directory and at the same time to make harder for someone to identify X directory from root's robots.txt, is to add the robots.txt in the X directory instead of the root directory.

If I follow this solution I have the following questions:


Will the Web Crawlers "read" the robots.txt if is in a sub-directory? (given that, a robots.txt already exist and in the root directory)
If robots.txt is in the X sub-directory, then what shall I use:

User-agent: *
Disallow: /X/


or this

User-agent: *
Disallow: /

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Jennifer507

2 Comments

Sorted by latest first Latest Oldest Best

 

@Carla537

No, web crawlers will not read or obey a robots.txt file in a subdirectory. As described on the quasi-official robotstxt.org site:


Where to put it

The short answer: in the top-level directory of your web server.


or on Google's help pages (emphasis mine):


A robots.txt file is a file at the root of your site that indicates those parts of your site you don’t want accessed by search engine crawlers.


In any case, using robots.txt to hide sensitive pages from search results is a bad idea anyway, since search engines can index pages disallowed in robots.txt if other pages link to them. Or, as described on the Google help page linked above:


You should not use robots.txt as a means to hide your web pages from Google Search results. This is because other pages might point to your page, and your page could get indexed that way, avoiding the robots.txt file.


So what should you do instead?


You can let search engines crawl the pages (if they find them), but include a robots meta tag with the content noindex,nofollow. This will tell search engines not to index those pages even if they do find links to them, and not to follow any further links from those pages. (Of course, this will only work for HTML web pages.)
For non-HTML resources, you can configure your web server (e.g. using an .htaccess file) to send the X-Robots-Tag HTTP header with the same content.
You can set up password authentication to protect the sensitive pages. Besides protecting the pages from unauthorized human visitors, it will also effectively keep web crawlers away.

10% popularity Vote Up Vote Down


 

@Lee4591628

Your robots.txt should be in the root directory and should not have any other name. According to the standard specification:


This file must be accessible via HTTP on the local URL "/robots.txt".

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme