Mobile app version of vmapp.org
Login or Join
Moriarity557

: How UTF-8 represents more than 256 characters and when to use UTF-16? We all know what character encoding is. Like the simple ASCII 7 bit used for normal 128 characters representing and UTF-8

@Moriarity557

Posted in: #ContentEncoding #Html

We all know what character encoding is. Like the simple ASCII 7 bit used for normal 128 characters representing and UTF-8 for representing 256 characters.

I have 2 questions:


Some people saying UTF-8 can represent more than 256 characters. How this is possible?
When to use UTF-16? Like which condition e.g. if we have to use japanese or some other language?

10.03% popularity Vote Up Vote Down


Login to follow query

More posts by @Moriarity557

3 Comments

Sorted by latest first Latest Oldest Best

 

@Deb1703797

Note that US ASCII is 7 bits for 0-127 code points.

Extended ASCII is 8 bits (1 byte) for 0-255 code points.

The extended area is interpreted by loading a different code page depending on the language so that other characters might be displayed. ISO/IEC 8859-1 is the 8 bit codepage for Latin-1, ISO/IEC 8859-5 is the 8 bit codepage for Latin/Cyrillic, ISO/IEC 8859-9 is the 8 bit codepage for turkish.

Then we get into displaying japanese, korean languages where double-byte encoding is used or Big-5 for chinese.

UTF-8 is variable width, starting with 7 bit US ASCII and has one or more bytes per code point.

UTF-8 eliminates needing to load different codepages for display of character sets between different languages. It reassigns several code points in the upper 128 characters in the first byte to allow for variable-width encoding, using 1 or more bytes for displaying a vastly larger number of characters. Due to it preserving the first seven bits, it is backwards compatible with US ASCII and extensible to cover all language character sets, symbols, punctuation, etc.

10% popularity Vote Up Vote Down


 

@Heady270

UTF-8 can represent all languages supported by Unicode, all million+ characters. It uses one byte for ASCII characters (0-127), but up to 4 bytes for some international characters.

UTF-16 also can also represent all Unicode characters. It uses exactly 2 bytes per character.

I would recommend using UTF-8 exclusively. It has several advantages over UTF-16:


It is ASCII compatible -- even programs which are not Unicode aware can usually read the files (even if they don't render the international characters properly).
It produces smaller file files that are in English, or have HTML markup. Only files with lots of international characters are bigger. In an HTML file, even for other languages, the amount of markup usually is greater than the number of international characters.
UTF-16 comes in big-endian and small-endian variants which reduces compatibility even further. UTF-8 has one specification.
Operating systems that support Unicode generally choose UTF-8 over UTF-16 as the system default character encoding.

10% popularity Vote Up Vote Down


 

@Margaret670

UTF-8 is a transfer encoding that can represent all the 1,114,112 code points in Unicode (that is, all Unicode characters and also code points not assigned to characters).

You may have been misled by the information that in UTF-8, a single code unit is 8 bits and has thus 256 possible values. But the representation of a character uses a variable number (one to four) of code units.

UTF-16 can represent exactly the same character repertoire as UTF-8. The choice between UTF-8 and UTF-16 depends on technologies rather than languages. For example, on the Internet, UTF-8 is the dominant encoding, whereas internally e.g. in Windows and in many programming languages, UTF-16 is used.

For some languages, UTF-16 can be more efficient than UTF-8. But this is usually not relevant, especially for web pages. All web browsers and search engines support UTF-8, whereas support to UTF-16 varies.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme