Mobile app version of vmapp.org
Login or Join
Sue5673885

: Block third domain from being indexed We have a web application that handles many websites that have test and production environment. This application has only one folder and it's hosted with

@Sue5673885

Posted in: #Iis #Index #RobotsTxt #Seo

We have a web application that handles many websites that have test and production environment.

This application has only one folder and it's hosted with IIS.
We already have noindex nofollow, index follow on respective domains but it's not working for static files (images, pdf documents etc.)

Is there a way to setup the robots.txt disallowing every single test domain?

Example

Disallow: test.domain1.com
Disallow: test.domain2.com


etc.

Take note that the domains are 200+ and they are being added/removed very fastly so robots.txt would be a good solution for us. We do not have access to the google search engine of each domain to request a removal from google index. Also the solution should work for bing and other search engines.

Can you see any way to achieve this?

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Sue5673885

2 Comments

Sorted by latest first Latest Oldest Best

 

@Jamie184

We already have noindex nofollow, index follow on respective domains


Presumably this is implemented using a <meta name="robots"... element in the HEAD section of the HTML? (Strictly speaking these values should be comma separated, ie. "noindex, follow".)


but it's not working for static files


For static files you would need to use the corresponding X-Robots-Tag HTTP response header. For example:

X-Robots-Tag: noindex, nofollow


If the robots meta tag is currently working OK for you then you could instead just use the X-Robots-Tag header for everything on that domain if you wanted to.

Reference: developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag#using-the-x-robots-tag-http-header


Disallow: test.domain1.com
Disallow: test.domain2.com


Aside: robots.txt doesn't work like this. It works on URLs, not hosts/domains. To disallow all crawling then you would have the same robots.txt in the root of each domain. For example:

User-agent: *
Disallow: /


Note, however, that robots.txt blocks crawling. It doesn't necessarily prevent URLs from being indexed if the URLs get linked to. To specifically prevent indexing you use the robots meta tag (as you are already doing) and/or X-Robots-Tag header. Don't use both, since that will block the crawler from seeing the robots meta tag.

10% popularity Vote Up Vote Down


 

@Pierce454

I would not recommend relying on the robots.txt file. Search engines may respect it – or may not. The "Disallow" is rather a recommendation than a rule.

If you want to be sure your development environments are hidden from the public, in my opinion the only reliable way to block robots / search engines from indexing pages is to use a password protection via a .htaccess file instead. AfaIk .htacces files and pw rules can also be generated on the fly.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme