Mobile app version of vmapp.org
Login or Join
Si4351233

: Does URL encoding create duplicate content? An SEO expert was testing my site, and noticed that my URLs contained the special character :. He said that would create duplicate content, because

@Si4351233

Posted in: #CanonicalUrl #Seo #Url #UrlEncoding

An SEO expert was testing my site, and noticed that my URLs contained the special character :. He said that would create duplicate content, because google would interpret any url containing : as two separate URLs: one with : and one with %3A. Is he right?

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Si4351233

2 Comments

Sorted by latest first Latest Oldest Best

 

@Si4351233

Your "SEO expert" might be a lying bastard, but this probably isn't the reason. He's absolutely right about this. This is a little known edge case in URL construction.

RFC 3986 is the official definition of the URL format and rules on how to encode and decode URL. Any URL parser should be following this as closely as possible to avoid errors and be interoperable with the rest of the Internet. That includes search engines which, if they apply the rules incorrectly, won't actually be able to crawl or index certain resources at all (e.g. they will get a 404 because they screwed up your URL, or your application will misinterpret the URL or query string).

The RFC gives rules on how to do the percent encoding and decoding, which we're all familiar with, but it also explains when to do the encoding and decoding, and to which characters.

Note that search engines normalize URLs (so that they can be compared) but they do not dereference them. Web servers dereference URLs to locate documents and to pass decoded data to your web application. When a URL is normalized, only a subset of percent encoded characters are decoded; when it is dereferenced, all of them are decoded.

In particular it specifies how to compare two URLs for equivalence (all of section 6) and which characters must be percent decoded before doing so (section 6.2.1 and 6.2.2). Here we find out that the only characters to be decoded before comparing URLs for equivalence are the so-called unreserved characters. These are defined (in section 2.3) as "uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde." Percent encoding is meant to prevent browsers, search engines, and the like from misinterpreting special characters in URLs, but since none of the unreserved characters have special meanings in a URL, these can be decoded by anyone at any time.

So, the %3A is not decoded to a colon : before two URLs are compared for equivalence. The colon actually has some unusual rules that apply to its use in the path component of a URL (explained in section 4.2); it cannot appear in the first path component of a relative URL (but is allowed in subsequent components) because it could be confused for a URL scheme.

To construct valid relative URLs with a colon as the first path component, we would either have to encode it some of the time and not others, prefix all such relative URLs with ./, or forgo relative URLs entirely (which is usually what happens, but relative URLs are much more common than you might think).

This part of the URL spec could use some clarification, but given the circumstances I would strongly recommend that if you are going to use colons in your URLs that you always encode them. This removes any possible ambiguity with respect to URL equivalence and ensures that you won't hit this edge case even if you use relative URLs.

10% popularity Vote Up Vote Down


 

@Jessie594

I suspect this is lies/misinterpretation from your "SEO expert" (such roles do not exist IMO). Essentially %3A and : are exactly the same thing, one is just encoded and means exactly the same thing, anything that reads an URL will know that.

Otherwise you could argue that any non-alphanumeric character could cause duplicate content as they all have an URL encoded entity for someing (eg %2d is -).

IE:
webmasters.stackexchange.com/questions/31499/seo-whould-i-use-in-url

and
webmasters.stackexchange.com/questions/31499/seo%2dwhould%2di%2duse%2din%2durl

Both resolve to the same place, except that - is url encoded in the latter and is what will be honoured by browsers/search engines.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme