UTF-8

From MobileRead
Jump to: navigation, search

UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for email, web pages, and other places where characters are stored or streamed. It is the default for ePub files.

[edit] Overview

UTF-8 encodes each character in one to four octets (8-bit bytes) shown as 2 hexadecimal digits in the discussion: (Note that the decimal equivalents are often accepted as well.)

  1. One byte is needed to encode the 128 US-ASCII characters (Unicode range U+00000 to U+0007F).
  2. Two bytes are needed for Latin letters with diacritics and for characters from Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and Thaana alphabets (Unicode range U+00080 to U+007FF). U+000A0 to U+000FF are the same characters as ISO-8859-1. Some examples of the rest are shown in the article on special characters as well as some three bytes entries.
  3. Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use).
  4. Four bytes are needed for characters in the other planes of Unicode, which are rarely used in practice.

The most significant bit of these multibyte codes is 1, to prevent confusion with 7-bit ASCII characters and therefore keep standard byte-oriented string processing safe.

For the two byte codes the first byte always begins with 110XXXXX, the following byte begins with 10XXXXXX. There are 1920 codes possible. Thus the first byte will always begin with hex C or D. (Unicode values are always shown as 4 or more hexadecimal digits)

For the three byte codes the first byte begins with 1110, the following 2 bytes begin with 10. There are 61,440 codes used in this set. Thus the first byte will always begin with hex E.

For the four byte codes the first byte begins with 11110, the following 3 bytes begin with 10. There are 1,048,576 possible codes but these are rarely used codes. The first byte will always be in the range of hex F0 to F7.

Although not part of the standard, many Windows programs (including Windows Notepad) use the byte sequence EF BB BF at the beginning of a file to indicate that the file is encoded using UTF-8. This is the Byte Order Mark U+FEFF encoded in UTF-8, which appears as the ISO-8859-1 characters "" in most text editors and Web Browsers not prepared to handle UTF-8. Mobipocket also uses this convention in its OPF file. As HTML5 is becoming more prevalent on the Internet there is now fairly good support for UTF-8. Many of the characters have been named to aid in creating them see Named character references for a list. This supersedes the older Entity reference used in earlier versions of HTML.

[edit] Translating special characters

The special characters used to code characters in HTML files use the Unicode numbers except that they are translated to decimal. For example the unicode U+03B1 character translates to #945 which is the lowercase letter alpha (α)

The Cyrillic alphabet translations to UTF-8 are shown in the article on Windows-1251.

Personal tools
Namespaces

Variants
Actions
Navigation
MobileRead Networks
Toolbox