UTF-8
UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages, and other places where characters are stored or streamed.
[edit] Overview
UTF-8 encodes each character in one to four octets (8-bit bytes) shown as 2 hexidecimal digits in the discussion: (Note that the decimal equivalents are often accepted as well.)
- One byte is needed to encode the 128 US-ASCII characters (Unicode range U+0000 to U+007F).
- Two bytes are needed for Latin letters with diacritics and for characters from Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and Thaana alphabets (Unicode range U+0080 to U+07FF). U+0080 to U+00FF are the same characters as ISO-8859-1. Some examples of the rest are shown in the article on special characters as well as some three bytes entries.
- Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use).
- Four bytes are needed for characters in the other planes of Unicode, which are rarely used in practice.
The most significant bit of these multibyte codes is 1, to prevent confusion with 7-bit ASCII characters and therefore keep standard byte-oriented string processing safe.
For the two byte codes the first byte always begins with 110XXXXX, the following byte begins with 10XXXXXX. There are 1920 codes possible. Thus the first byte will always begin with hex C or D. (Unicode values are always shown as 4 or more hexadecimal digits)
For the three byte codes the first byte begins with 1110, the following 2 bytes begin with 10. There are 61,440 codes used in this set. Thus the first byte will always begin with hex E.
For the four byte codes the first byte begins with 11110, the following 3 bytes begin with 10. There are 1,048,576 possible codes but these are rarely used codes. The first byte will always be in the range of hex F0 to F7.
Although not part of the standard, many Windows programs (including Windows Notepad) use the byte sequence EF BB BF at the beginning of a file to indicate that the file is encoded using UTF-8. This is the Byte Order Mark U+FEFF encoded in UTF-8, which appears as the ISO-8859-1 characters "" in most text editors and Web Browsers not prepared to handle UTF-8. Mobipocket also uses this convention in its OPF file.
[edit] Translating special characters
The special characters used to code characters in HTML files use the Unicode numbers except that they are translated to decimal. For example the unicode U+03B1 character translates to #945 which is the lowercase letter alpha (α)
The Cyrillic alphabet translations to UTF-8 are shown in the article on Windows-1251.