Mobile app version of vmapp.org
Login or Join
Correia994

: Sitemap.xml generates 404's for URLs with single quotes and commas I'm going to try and keep this as terse as possible: when it comes to single quotes and commas in the URL, I'm damned if

@Correia994

Posted in: #GoogleSearchConsole #Sitemap #UrlEncoding #Xml #XmlSitemap

I'm going to try and keep this as terse as possible: when it comes to single quotes and commas in the URL, I'm damned if I encode and damned if I don't.

If I leave the single quote unencoded in the sitemap.xml loc entry, some crawlers (most notably, Bing) truncate the URL up to the point just before the single quote.

If I encode the single quote as ' according to this guide, some crawlers truncate the URL up to and including the ampersand. Bing used to do this until I contacted their tech support.

However now that my sitemap.xml is "proper" according to the guide, Google Webmaster shows a crap-tonne of 404's - most of which show that the Google crawler is using the XML encoded form of the URL (ex., example.com/someone'-lucky-day) instead of the decoded form (http://example.com/someone's-lucky-day). The other 404'd URLs contain commas (ex., example.com/someone,-really-hates-me becomes example.com/someone).
One thing to note: Whenever my web app raises a 500 server error, I get emailed a copy of the error. The email includes the URL attempted by the visitor (or crawler in this case). After switching my sitemap.xml to encode the single quotes, I haven't received any more of these error reports; for now, it's just Google Webmaster complaining.

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Correia994

2 Comments

Sorted by latest first Latest Oldest Best

 

@Ogunnowo487

FWIW... on the face of it Google would seem to be incorrect in my opinion. Or rather, it's implementation of the standard (RFC 3986) is too strict. (Although systems do vary in this respect.)

URLs always need to be suitably URL encoded / percent-encoded (as @mike states) by encoding characters that have special meaning, and then XML entity encoded when used in an XML document (or HTML entity encoded if used in an HTML document).

Whilst single quotes and commas are considered "reserved characters" in a URL, they have no special meaning in the path part of the URL and can be used as-is, without being percent-encoded. So, a URL such as example.com/someone's-lucky-day is perfectly valid as it is - the ' does not need to be encoded here (it would still need to be XML encoded in an XML sitemp). Just to clarify, there is no harm in percent-encoding these characters, in fact you can percent-encode everything if you wanted to!

Reference: StackOverflow quesion - Valid Characters for Directory part of a URL.

Also conflicting with Google's implementation is that the JavaScript method encodeURI() (for encoding the path parts of a URL) does not percent-encode the single quote and comma characters. However, the corresponding PHP function rawurlencode() does. On examining the output of these functions, it seems that JavaScript closely follows the standard; PHP does not.

However, another thought... Is there an encoding issue? Is the XML document UTF-8 encoded and are these characters really apostrophes and commas and not curly quotes or something that "looks" similar?!

10% popularity Vote Up Vote Down


 

@LarsenBagley505

HTML coding in pages can't be used as part of a URL.

You have to use special character coding for symbols that could possibly wreck the URL.

For the encoding, you start the character with a percent sign then a hexadecimal code which I think means the ascii code of the actual character you're trying to use.

Go to this page and use the forms and fill in your URL's in question to see how they should be encoded.
www.w3schools.com/tags/ref_urlencode.asp

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme