Mobile app version of vmapp.org
Login or Join
Cooney921

: Are there downsides to serving UTF-8 with a BOM? Working with a client, I've just noticed that all of their files are being saved as Windows-1252, but they're serving them with charset=utf-8

@Cooney921

Posted in: #Unicode

Working with a client, I've just noticed that all of their files are being saved as Windows-1252, but they're serving them with charset=utf-8 on the Content-Type header (e.g., Content-Type: text/html; charset=utf-8 and similar for their JS and CSS).

I've recommended to them that they actually use UTF-8, which they're happy to do. But their primary authoring tool is VS.Net 2012, which defaults to Windows-1252 (in English locale Windows installs) unless the file has a signature telling it otherwise. (I was very surprised not to find a setting for this, but I've found multiple answers on Stack Overflow that seem to confirm it doesn't: 1, 2, 3.)

So we can fix this by saving their files as UTF-8 with a BOM (and possibly updating the templates similarly so new files get created that way), because if VS.Net sees the BOM it remembers to save them that way later. The Unicode standard (PDF) says that using a BOM with UTF-8 is allowed but (oddly, to my mind) "not recommended":


Use of a BOM is neither required nor recommended for UTF-8, but may
be encountered...where the BOM is used as a UTF-8 signature.


Are there any significant downsides to serving UTF-8 with a BOM to general web users? Issues with user agents somehow getting it wrong, or...? I mean, anything that understands Unicode is required to understand the BOM, so it should be okay, but we all know that reality sometimes diverges from theory...

10.02% popularity Vote Up Vote Down


Login to follow query

More posts by @Cooney921

2 Comments

Sorted by latest first Latest Oldest Best

 

@Bethany197

No, there are no significant downsides to serving HTML documents as UTF-8 with BOM. Statements to the contrary are still common, but they are based on misunderstanding. Some very early browsers, which you now might find in a museum if you are very lucky, rendered a BOM literally in some encoding. Even in our times, PHP software still cannot handle BOM properly, so you should not use BOM at the start of a PHP file, as it may cause trouble when such a file is concatenated or inserted by PHP. But this is a problem intrinsic to PHP.

Software used to operate on HTML documents need to handle BOM. It’s a rather basic requirement, and UTF-8 with BOM is so common that such software should be avoided. Inconveniening people who still use such programs should not be counted as significant downside.

The W3C page The byte-order mark (BOM) in HTML no longer mentions any browser problems. It mentions issues in processing HTML documents with program code, but this just means that when you write code to process UTF-8 encoded HTML pages, or anything UTF-8 encoded, you need to be prepared to the BOM.

10% popularity Vote Up Vote Down


 

@Eichhorn148

Part of the advantage of UTF-8 is that software that only knows about ASCII can still read the files. When a byte order mark is present in the file, some of software that expects ASCII text may complain that the file is "binary".

Modern web browsers are all capable of consuming UTF-8 with a BOM. I would still recommend omitting the BOM because it makes compatibility with Unix tools such as grep less straightforward.

Also, I am not aware of any advantages of including a BOM for UTF-8, so it seems like a no-brainer to omit it. (This is different than UTF-16 which has big endian and little endian variants that need to be distinguished with a BOM).

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme