: Using canonical links to keep a site out of Google search results I have two copies of a website, a live site at www.example.com and a test site at dev.example.net. (Note that this is a subdomain
I have two copies of a website, a live site at example.com and a test site at dev.example.net. (Note that this is a subdomain of a different parent domain.) Both sites have exactly the same URL structure, and the same page content, but different actual HTML. I do not want dev.example.net to appear in search results.
In the <head> of every single page on dev.example.net there is a <link rel="canonical" href="https://www.example.com/PATH"> (i.e., a link to the equivalent page on the site we want people to see). By my reckoning, this should keep dev.example.net from appearing in search results at all. And yet it persistently shows up. (A search for our company name shows example.com as the first result, and dev.example.net as the second.)
Am I misunderstanding what I'm doing here? Should I add noindex tags to the pages on dev.example.net?
More posts by @Reiling115
2 Comments
Sorted by latest first Latest Oldest Best
Use noindex to keep pages out of Google’s index
The only correct way to keep results out of Google’s index is to use noindex.
At the risk of being pendantic, Google’s (or any search engine’s) search results are composed of items that have been indexed. Google honors a couple of ways to inform them to omit a page from its index. If you don’t use these methods, don’t be surpised if your page ends up in the search results.
So the short answer is yes, use noindex to keep things out of the index. Or better yet, use the X-Robots-Tag HTTP header (see below).
Don’t use robots.txt for this
robots.txt prevents pages from being spidered which is a related, but distinct, concept to indexing. Many non-spidered pages that have strong backlinks can and do rank well in the Google search results.
You may have seen some, they look like the example at the bottom of this Moz.com article.
Google explains:
robots.txt Disallow does not guarantee that a page will not appear in results: Google may still decide, based on external information such as incoming links, that it is relevant. If you wish to explicitly block a page from being indexed, you should instead use the noindex robots meta tag or X-Robots-Tag HTTP header. In this case, you should not disallow the page in robots.txt, because the page must be crawled in order for the tag to be seen and obeyed.
Canonical URL’s don’t exclude anything from Google’s index
Canonical URL’s tell Google that the referring and referred pages represent the same content, for “consolidating link signals for the duplicate or similar content” — that is, they help with SEO.
But to really drive traffic from one particular page, Google suggests:
It's a good idea to pick one of those URLs as your preferred (canonical) destination, and use 301 redirects to send traffic from the other URLs to your preferred URL. A server-side 301 redirect is the best way to ensure that users and search engines are directed to the correct page. The 301 status code means that a page has permanently moved to a new location.
But this 301 solution won’t help you, because you need users to be able to see the dev. site.
A note on canonical and alternative URLs
Note, it is perfectly reasonable for Google to send traffic to non-canonical URLs — different presentations of the same content can be appropriate in different contexts. Consider content you share at both at your regular “www.” site and a mobile “m.” site that is highly optimized for phones. Google might present a non-canonical PDF version if the user included “PDF” in their search phrase.
But why does Google like your “dev.” site anyway?
Google’s algorythm doesn’t care that your dev site might have unapproved content, and your users probably don’t either. (It also doesn’t much care how you or your bosses feel about this.)
Here are a few things Google does care about:
Google rewards freshness of content. If you dev site changes
much more often (it does, doesn’t it?) that may be a positive SEO
signal.
People on the web might have discovered your dev site and be linking
to it for one reason or another.
If your dev site has significant technical upgrades, or gets less
traffic than your production site, it might be faster — and Google rewards speed.
Why an HTTP header solution would be better for you than a meta tag
If you use the X-Robots HTTP tag to return the noindex instruction, that can be configured on the web server, not on your HTTP files or other artifacts. So you won’t need to change anything when you promote the files to your production site.
A sub-domain is a separate site and can be treated as a site.
There are two things you can do.
1] Create a robots.txt file in the root of the sub-domain with:
User-agent: *
Disallow: /
This code will disallow access to the entire site.
Here is a link that should be helpful to understand robots.txt files:
www.robotstxt.org/robotstxt.html
2] If you are able to, it would be wise to add a NoIndex meta-tag with:
<meta name="robots" content="noindex">
This code will prevent the page from being indexed.
Here is a link that should be helpful to understand the NoIndex meta-tag:
en.wikipedia.org/wiki/Noindex
Either one should work, however, if you can do both without much effort, that may help. Option 1 is the easiest to implement.
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.