UTF-16

From MobileRead

Jump to: navigation, search

The default encoding form of the Unicode Standard uses 16-bit code units. Code point values for the most common characters are in the range of 0 to FFFF16 and are encoded with just one 16-bit unit of the same value. Code points from 1000016 to 10FFFF16 are encoded with two code units that are often called "surrogates", and they are called a "surrogate pair" when, together, they correctly encode one Unicode character. The first surrogate in a pair must be in the range D80016 to DBFF16, and the second one must be in the range DC0016 to DFFF16. Every Unicode code point has only one possible UTF-16 encoding with either one code unit that is not a surrogate or with a correct pair of surrogates. The code point values D80016 to DFFF16 are set aside just for this mechanism and will never, by themselves, be assigned any characters.

Most commonly used characters have code points below FFFF16, but Unicode 3.1 assigns more than 40,000 supplementary characters that make use of surrogate pairs in UTF-16.

Note that comparing UTF-16 strings lexically based on their 16-bit code units does not result in the same order as comparing the code points. This is not usually an issue since only rarely-used characters are affected. Most processes do not rely on the same results in such comparisons. Where necessary, a simple modification to a string comparison can be performed that still allows efficient code unit-based comparisons and makes them compatible with code point comparisons. ICU has C and C++ API functions for this.

While UTF-8 is relatively compact and resource conservative in its use of the bytes required for encoding text in European scripts, it uses 50% more space than UTF-16 for East Asian text. Code points in UTF-8 up to 7FF16 take up two bytes, code points up to FFFF16 take up three (50% more memory than UTF-16), and all others four.

Personal tools
MobileRead Networks