From MobileRead
Jump to: navigation, search

The default encoding form of the Unicode Standard uses 16-bit code units and are generally referenced using hexadecimal notation.

[edit] Overview

Code point values for the most common characters are in the range of 0 to FFFF and are encoded with just one 16-bit unit of the same value. Unicode 3.1 assigns more than 40,000 supplementary characters that make use of surrogate pairs in UTF-16.

Code points from 10000 to 10FFFF are encoded with two code units that are often called "surrogates", and they are called a "surrogate pair" when, together, they correctly encode one Unicode character. The first surrogate in a pair must be in the range D800 to DBFF, and the second one must be in the range DC00 to DFFF. Every Unicode code point has only one possible UTF-16 encoding with either one code unit that is not a surrogate or with a correct pair of surrogates. The code point values D800 to DFFF are set aside just for this mechanism and will never, by themselves, be assigned any characters.

Note that comparing or sorting UTF-16 strings lexically based on their 16-bit code units does not result in the same order as comparing the code points thus sorting will not come out the same. This is not usually an issue since only rarely-used characters are affected. Most processes do not rely on the same results in such comparisons. Where necessary, a simple modification to a string comparison can be performed that still allows efficient code unit-based comparisons and makes them compatible with code point comparisons. ICU has C and C++ API functions for this.

[edit] compared to UTF-8

While UTF-8 is relatively compact and resource conservative in its use of the bytes required for encoding text in European scripts, it uses 50% more space than UTF-16 for East Asian text. Code points in UTF-8 up to 7FF take up two bytes, code points up to FFFF take up three (50% more memory than UTF-16), and all others four.

[edit] Byte order marks

A 2 byte sequence on a computer can be stored in two different ways. These are called Big-Endian and Little-Endian depending on whether the most significant digit comes first in the byte stream or last. This is very important for UTF-16 implementation to correct identify the characters. For this reason the first 2 characters of the file should contain a byte order mark that can be interpreted to show the which kind of file it is. The characters used are FEFF and FFFE (hexadecimal} which can easily be interpreted. If the editor used to view or edit the file cannot recognize UTF-16 then it will display a couple of funny characters at the beginning of the file.

encoding hexadecimal decimal ISO-8859-1
UTF-16 (BE) FE FF 254 255 þÿ
UTF-16 (LE) FF FE 255 254 ÿþ

UTF-8 can also have a Byte order mark. It is mainly used to by editing and viewing programs to show that the file contains multibyte characters. The characters EF BB BF are used. These would show up as  in ISO-8859-1 characters if the editor does not support Unicode.

Personal tools

MobileRead Networks