Mobile app version of vmapp.org
Login or Join
Rivera981

: Google Webmaster Tools: Robots disallow does not seem to be working for staging site I've added robots.txt on my staging site at staging.mydomain.com as: User-agent: * Disallow: / I then added

@Rivera981

Posted in: #GoogleSearchConsole #RobotsTxt

I've added robots.txt on my staging site at staging.mydomain.com as:

User-agent: *
Disallow: /


I then added and verified the staging site in Google Webmasters Tools.
In Crawl > Blocked URLs, I can see robots.txt listed with the status as 200(Success).
Further down that page, when I clicked on Test button to test staging.mydomain.com/, it gives me the result as:


Allowed
Detected as a directory; specific files may have different restrictions


Looks like this is the wrong result. What have I done wrong? Do I have to wait for some time to have google read robots.txt?

Within the staging site, I have other folders such as:
staging.mydomain.com/test1/ http://staging.mydomain.com/test2/


Obviously I want to disallow indexing all of these. When I do a test for these folders, the result shows up as Allowed. Do I need to add robots.txt within each of the sub-directories?

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Rivera981

1 Comments

Sorted by latest first Latest Oldest Best

 

@Fox8124981

Looks like what done is perfectively fine.

A typical robots.txt for a production site might be as simple as:

User-agent: *
Disallow:


This is the least restrictive. It says that all crawlers are allowed to crawl the entire site.

For our dev or staging site, we want to use the following:

User-agent: *
Disallow: /


This requests that the entire site not be crawled.

But some time if you didn't take proper precautions prior to creating your dev or staging sites, there's a good chance that the search engines found your work-in-progress.

What now? Well, let's be careful here.

1.First, understand that search engines will cache your site for a certain length of time.

2.Second, you'll need to keep in mind that restricting crawling of your site does not mean that existing indexed pages will disappear from search engine results.

If you find your staging site pages in search results, it's a good idea to go ahead and tell search engines not to index each page. The best way to is to add a "noindex" meta tag to all your pages. The noindex tag looks like this:

<meta name="robots" content="noindex" />


OR :

Advised Approach:

1.Add Authentication (HTTP or otherwise) infront of requests.

2.Respond with appropriate response code if not permitted (e.g. 401 Unauthorized).

3.Everything else in the Basic Approach above.

By adding a robots.txt it prevents search engines from accessing and indexing the content. However, that doesn't mean they won't index the URL. If a search engine knows about a given URL, it may add it to the search result index. You'll sometimes see these in the search results. The title tends to be the URL with not description. To prevent this from happening, the search engines need to be told not to show the content or URLs. By adding Authentication infront and not responding with a 200 OK status code it is a strong signal to the engines not to add these URLs to their index. From my experience I haven't ever seen a 401 response code page listed in a search engine index.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme