UTF-32

From MobileRead
Jump to: navigation, search

UTF-32 is straight forward way to code every character in Unicode in one 32 bit integer but wasteful of storage space, since all of the Unicode characters can be coded in 21 bits.

[edit] Overview

Since all of the Unicode characters take exactly the same amount of space this code scheme makes it easy to directly access any single character, unlike the competing systems UTF-8 and UTF-16. (In practice it is not usual to need direct access since text is normally sequential.) Its disadvantage is that it wastes memory since all characters take up a full 32-bit word space. It is not compatible with either of the other systems but conversion is possible. UTF-32 like UTF-16 has to know the processor coding byte order since there are both big endian (BE) and little endian (LE) processors. This is handled with a Byte order mark at the beginning of the sequence. There are compression schemes that can be used to compress the size of the words by about 25% for transmission.

[edit] Encoding

While every character takes 4 bytes the encoding otherwise for the ASCII characters is exactly the same order with control characters in the first 32 entries (0-31) and the regular ASCII values following. After the 128 characters there is standardization to conform to the U-xxxxxxxx hexadecimal scheme.

See UTF-32 file list for a full list of the coding.

Personal tools
Namespaces

Variants
Actions
Navigation
MobileRead Networks
Toolbox