: Block third domain from being indexed We have a web application that handles many websites that have test and production environment. This application has only one folder and it's hosted with
We have a web application that handles many websites that have test and production environment.
This application has only one folder and it's hosted with IIS.
We already have noindex nofollow, index follow on respective domains but it's not working for static files (images, pdf documents etc.)
Is there a way to setup the robots.txt disallowing every single test domain?
Example
Disallow: test.domain1.com
Disallow: test.domain2.com
etc.
Take note that the domains are 200+ and they are being added/removed very fastly so robots.txt would be a good solution for us. We do not have access to the google search engine of each domain to request a removal from google index. Also the solution should work for bing and other search engines.
Can you see any way to achieve this?
More posts by @Sue5673885
2 Comments
Sorted by latest first Latest Oldest Best
We already have noindex nofollow, index follow on respective domains
Presumably this is implemented using a <meta name="robots"... element in the HEAD section of the HTML? (Strictly speaking these values should be comma separated, ie. "noindex, follow".)
but it's not working for static files
For static files you would need to use the corresponding X-Robots-Tag HTTP response header. For example:
X-Robots-Tag: noindex, nofollow
If the robots meta tag is currently working OK for you then you could instead just use the X-Robots-Tag header for everything on that domain if you wanted to.
Reference: developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag#using-the-x-robots-tag-http-header
Disallow: test.domain1.com
Disallow: test.domain2.com
Aside: robots.txt doesn't work like this. It works on URLs, not hosts/domains. To disallow all crawling then you would have the same robots.txt in the root of each domain. For example:
User-agent: *
Disallow: /
Note, however, that robots.txt blocks crawling. It doesn't necessarily prevent URLs from being indexed if the URLs get linked to. To specifically prevent indexing you use the robots meta tag (as you are already doing) and/or X-Robots-Tag header. Don't use both, since that will block the crawler from seeing the robots meta tag.
I would not recommend relying on the robots.txt file. Search engines may respect it – or may not. The "Disallow" is rather a recommendation than a rule.
If you want to be sure your development environments are hidden from the public, in my opinion the only reliable way to block robots / search engines from indexing pages is to use a password protection via a .htaccess file instead. AfaIk .htacces files and pw rules can also be generated on the fly.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.