Mobile app version of vmapp.org
Login or Join
Jessie594

: CMS text encoding and common browser decoding; recent change? I'm managing a CMS-based website. The primary language of the CMS is English. Our content is mostly in English, but user postings

@Jessie594

Posted in: #ContentEncoding

I'm managing a CMS-based website. The primary language of the CMS is English. Our content is mostly in English, but user postings are sometimes in other languages. The supporting mySQL database of my site is set to UTF-8 encoding.

Until recently, "other language" (e.g. Russian) postings displayed correctly. This in spite of the fact that the CMS declares in the header of each generated page:

<meta http-equiv='Content-Type' content='text/html; charset=iso-8859-1' />


My best guess is that in the past, browsers commonly took this header declaration as advice, and also looked at content to determine the actual encoding. If, for example, UTF-8 encoding was detected, the heading declaration was overridden. That's how "other language" text was correctly displayed. It helps that iso-8859-encoded text will be correctly interpreted by a UTF-8 decoder, that is, iso-8859-1 is effectively a subset of UTF-8.

Question 1: iso-8859-1 is effectively a subset of UTF-8, right?

As far as I know, nothing about the site, the CMS installation, or the server has changed recently. But, of course, browser tech is constantly evolving and new browser releases arrive all the time.

In the past few weeks, "other language" content is showing up as garbage (mojibake) in the latest versions of Safari (5.1) Chrome (15.0.874.81), and Firefox (7.0.1), at least. All indications are that the "auto" text encoding mode of each is now determining the site pages are iso-8859-1 encoded, end-of-story. When I manually set the browser to use UTF-8, the garbage disappears, and all text is correctly rendered, as before.

Question 2: Has the standard/practice changed, so that recent browser releases obey the header declaration of text encoding and won't override it no matter what encoding is actually present?

Question 3: If not, what other factors could account for the sudden appearance of garbage where there was none before? (For example, could a subtle change to Apache made by the hosting provider have this effect?)

I have no idea why the CMS primary distribution is iso-8859-1 encoded. I've looked at the CMS docs and submitted several queries to support channels, but … no answer. The CMS is definitely capable of supporting other languages; many "official" alternates are available, and so far as I've checked, they all declare "utf-8" in the headers of generated pages

There may be some 8859-1 encoding-dependent code deep in the CMS, I suppose. (No, I'm not going to try to find it!) But the existence of a large number of alternative language packs would seem to argue against that.

Question 4: (bonus!) If there is no encoding-dependent code in the CMS, what kind of technical reasons might make the CMS developers reluctant to move their primary distribution to UTF-8?

Question 5: Am I totally missing some point? Am I totally or partially confused about how text encoding works?

10.03% popularity Vote Up Vote Down


Login to follow query

More posts by @Jessie594

3 Comments

Sorted by latest first Latest Oldest Best

 

@Murphy175

Resume

Your server had and haven't now AddDefaultCharset direcive (namely - AddDefaultCharset utf-8).

This is single reason for described past and current view.

Return (try) it back over htaccess

Bonus answer on q.4


charset-dependent code changes to text-manipulation exist always - most string functions have pairs as mb*()
when you start to think about 8bit-texts in (My)SQL, you have to think about DB-charset|client charset|connection charset|natural sorting|string size. Closed eyes and bad memory seems better choice ("English rule the world" mantra)


Bonus answer on q.5

No, you miss nothing. Old good results was just spoofed, you must got headache with 8bit from first strings on site

10% popularity Vote Up Vote Down


 

@Jamie184

As I understand it, the character encoding used by the browser is decided in the following order:


The Content-Type response header as sent from the server.
If not #1 then the Content-Type META tag.
If neither of the above then the browser default, which I assume is based initially on the default language on the system.


AFAIK the default encoding in the browser has not changed. And I very much doubt that simply upgrading to the latest version of a browser would touch the default that was already set.

Also, I don't think it is even possible to accurately deduce the correct character encoding by analysing the page content. It needs to be told, or it falls back to the browser default. Incidentally, on the machine I'm sat at, both Firefox 3.6 and Chrome 14 both default to ISO-8859-1.


Question 1: iso-8859-1 is effectively a subset of UTF-8, right?


Unfortunately, that's not the case - US-ASCII is a subset of utf-8, but non-ASCII characters in iso-8859-1 are encoded differently than in utf-8.


Question 2: Has the standard/practice changed, so that recent browser
releases obey the header declaration of text encoding and won't
override it no matter what encoding is actually present?


I'm unaware of any change. This would surely break a lot of sites? A browser cannot override the character encoding, as specified by the site, by what it thinks should be the character encoding! Can it?!


Question 3: If not, what other factors could account for the sudden
appearance of garbage where there was none before? (For example, could
a subtle change to Apache made by the hosting provider have this
effect?)


Yes, I would guess a change in server config could be causing this. The Content-Type response header (as mentioned above) might have changed.


Question 4: (bonus!) If there is no encoding-dependent code in the
CMS, what kind of technical reasons might make the CMS developers
reluctant to move their primary distribution to UTF-8?


How old is the CMS? What is it written in? Historically PHP does not handle multi-byte strings too well. There are many functions in PHP that only work on single byte strings. Some of the multi-byte functions are only available with PHP 5.

10% popularity Vote Up Vote Down


 

@Shelley277

Utf-8 and non English chars do not display well in iso-8859-1 and iso-8859-1 chars do not display well in utf-8.

If you want to know the encoding your browser has detected and is using on a page:

In Firefox just go to the menu "View" > "Character encoding".
support.mozilla.com/en-US/kb/Menu%20Reference

Make sure UF-8 is automatically selected when you read a page and when you enter text in the CMS to be posted.
If possible configure the cms to use html character entities.
Try replacing the meta tag to utf-8, or to the char encoding that displays well in your site.
Please note that invisible chars may appear if you are using the wrong encoding

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme