Mobile app version of vmapp.org
Login or Join
Murray155

: How to correctly remove MULTIPLE parameters from the Google index using htaccess and canonical / noindex? I found a similar question with a great answer for removing a single URL parameter. However,

@Murray155

Posted in: #CanonicalUrl #Htaccess #Noindex #Parameters #Url

I found a similar question with a great answer for removing a single URL parameter. However, what if I have 20+ URL parameters that I don't want indexed?

Also, in the example solution below, it is assuming that you want to specify a parameter range (ex: ?id=0 to ?id=9)... In my situation I would simply want to not index anything with the ?id parameter at all, regardless of what follows in the url string. Let's also say that I would not want to index the ?start and ?Page parameters either... Can someone help me out with a revised version of the following code?

For NOINDEX:

<IfModule mod_rewrite.c>
RewriteCond %{QUERY_STRING} ^id=([0-9]*)$
RewriteRule .* - [E=NOINDEX_HEADER:1]
</IfModule>

<IfModule mod_headers.c>
Header set X-Robots-Tag "noindex, follow" env=NOINDEX_HEADER
</IfModule>


For CANONICAL:

<IfModule mod_rewrite.c>
RewriteCond %{QUERY_STRING} ^id=([0-9]*)$
RewriteRule .* - [E=CANONICAL_HEADER:1]
</IfModule>

<IfModule mod_headers.c>
Header set Link '%{HTTP_HOST}%{REQUEST_URI}e; rel="canonical"' env=CANONICAL_HEADER
</IfModule>


Thank you @Evgeniy and @JohnMueller for the above code.
Reference: How to correctly remove parameters from the Google index?

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Murray155

2 Comments

Sorted by latest first Latest Oldest Best

 

@BetL925

In addition to redirects and meta tags, it is possible to prevent Google from indexing specific URL parameters by configuring them in Google Search Console: support.google.com/webmasters/answer/6080550?hl=en
You can set parameters to "Doesn't effect page content." This is for tracking parameters that don't actually change the page. This setting causes allows Googlebot to crawl them but then index the version without them.

You can also set parameters to "Changes page content" and then "Crawl no URLs". This causes Googlebot to not crawl those URLs with the parameters on them at all. They would then mostly fall out of the index. You could use this for pagination and sorting parameters that cause duplicate content compared to the page without any parameters.

10% popularity Vote Up Vote Down


 

@Ann8826881

You only really need to set the rel="canonical" header. This should be sufficient in ensuring only the canonical URL (ie. the one with no URL params) appears in the SERPs. Setting a noindex robots meta tag for such URLs would seem to be overkill (and a tad risky) IMO.

Presumably you are unable to set a rel="canonical" meta tag in the HTML itself?


...what if I have 20+ URL parameters that I don't want indexed?


Is it safe to say that you don't want any URL with any URL params (ie. any query string) indexed? In which case you can simply change your RewriteCond directive to read:

RewriteCond %{QUERY_STRING} .


That is, there is a query string of any length.

If, however, you want to exclude 20 specific URL params then you are going to have to name every one of them. For example:

RewriteCond %{QUERY_STRING} (?:^|&)(id|start|page|another)=


The (?:^|&) is a non-capturing group to ensure we only match these specific param names and not something like sid or lastpage, etc. (if they could possibly be URL param names).


Header set Link '%{HTTP_HOST}%{REQUEST_URI}e; rel="canonical"' env=CANONICAL_HEADER



This is invalid. You are missing the scheme (eg. http), e symbol after the %{HTTP_HOST} variable (this would result in a 500 error) and angled brackets (<..>) around the URL. This should be of the form:

Header set Link '<http://%{HTTP_HOST}e%{REQUEST_URI}e>; rel="canonical"'


Reference: (Google's support doc regarding canonical URLs) support.google.com/webmasters/answer/139066?hl=en
UPDATE: However, the %{REQUEST_URI}e environment variable, when used in this context, includes the query string - which really defeats the object of this excercise. This whole block should be rewritten as:

RewriteCond %{QUERY_STRING} .
RewriteRule (.*) - [E=CANONICAL_URI:]
Header set Link '<http://%{HTTP_HOST}e/%{CANONICAL_URI}e>; rel="canonical"' env=CANONICAL_URI


Instead of using the REQUEST_URI variable, we capture the URL-path only (which excludes the query string) using the RewriteRule directive and store this in the CANONICAL_URI variable. This is then used in the Header directive instead.

There is also no need for the <IfModule> containers here. It either works or it breaks, these directives are not intended to be optional (are they?).

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme