: De-index URL parameters by value Upon reading over this question is lengthy so allow me to provide a one sentence summary: I need to get Google to de-index URLs that have parameters with certain
Upon reading over this question is lengthy so allow me to provide a one sentence summary: I need to get Google to de-index URLs that have parameters with certain values appended
I have a website example.com with language translations.
There used to be many translations but I deleted them all so that only English (Default) and French options remain.
When one selects a language option a parameter is aded to the URL. For example, the home page:
example.com (default) example.com/main?l=fr_FR (French)
I added a robots.txt to stop Google from crawling any of the language translations:
# robots.txt generated at www.mcanerin.com User-agent: *
Disallow:
Disallow: /cgi-bin/
Disallow: /*?l=
So any pages containing "?l=" should not be crawled. I checked in GWT using the robots testing tool. It works.
But under html improvements the previously crawled language translation URLs remain indexed. The internet says to add a 404 to the header of the removed URLs so the Googles knows to de-index it.
I checked to see what my CMS would throw up if I visited one of the URLs that should no longer exist.
This URL was listed in GWT under duplicate title tags (One of the reasons I want to scrub up my URLS)
example.com/reports/view/884?l=vi_VN&l=hy_AM
This URL should not exist - I removed the language translations. The page loads when it should not! I played around. I typed example.com?whatever123
It seems that parameters always load as long as everything before the question mark is a real URL.
So if Google has indexed all these URLS with parameters how do I remove them? I cannot check if a 404 is being generated because the page always loads because it's a parameter that needs to be de-indexed.
More posts by @Jessie594
3 Comments
Sorted by latest first Latest Oldest Best
I would rewrite your URLs so that the language is a directory:
/main
/fr_FR/main
That has two advantages:
Robots.txt can be used to block certain languages but not other (without resorting to wildcards)
You can add directories to Google webmaster tools and change the geographic targeting of the directories (in that case it should be geo targeted to france)
See: How should I structure my URLs for both SEO and localization?
If you can't do that, you could noindex content based on parameter values. The easiest option is to use a meta robots noindex tag. You could put a tag in only when the parameter has a value that you don't want to have Google include in the index.
You can do several things:
is robots.txt, although take good care what you add in there as if can ALSO deindex some correct URLs. (i.e. if google did index
hxxps://example.com/reports/view/884?l=fr_FR, you don't just want to lose it right?
Which brings me to the second:
Since you want to "deindex" old URLs, why not 301 them to "correct" ones or return a hard 404 or 410 on the real incorrect ones.
use rel="canonical" so you tell google what the correct page is (although it is "other" content)
use google webmastertools and specify what a parameter is used for. For the l= parameter you could specify "translates"
All permalink will be index by SE, exclude permalink have a symbol ?, =, and &.
User-agent: *
Disallow:
Disallow: /cgi-bin/
Disallow: /*?*
Disallow: /*?
Disallow: /*=*
Disallow: /*=
Disallow: /*&*
Disallow: /*&
Allow: /
Will blocking all permalink format like /main?l=de_DE or /main?l=Any_Value_Here, excluding /main?l=fr_FR form SE.
User-agent: *
Disallow:
Disallow: /cgi-bin/
Disallow: /main?l=*
Allow: /
Allow: /main?l=fr_FR
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.