From MobileRead
Jump to: navigation, search

Unicode (and the parallel ISO 10646 standard) defines the character set necessary for efficiently processing text in any language and for maintaining text data integrity.


[edit] Overview

Unicode maps all the worlds language letters and symbols into a single character set where each symbol has a unique binary value. In addition to global character coverage, the Unicode standard is unique among character set standards because it also defines data and algorithms for efficient and consistent text processing. It also attempts to standardize symbols that are in use worldwide.

To standardize the implementation there is a defined ICU, International Components for Unicode. The ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.

[edit] History

Unicode is a shortened name for Unification Code, the first attempt to Unify all the characters in languages throughout the world into one system. This effort was begun in the early 1980s. It was designed to be simple for computers to process and to avoid escape sequences. An escape sequence is where the character set was redefined using the same binary code, depending on some escape character used to switch the interpretation.

Unicode was developed as a single-coded character set that contains support for all languages in the world. The first version of Unicode used 16-bit numbers, which allowed for encoding 65,536 characters without complicated multibyte schemes. With the inclusion of more characters, and following implementation needs of many different platforms, Unicode was extended to allow more than one million characters. Several other encoding schemes were added. This introduced more complexity into the Unicode standard, but far less than managing a large number of different encodings.

Starting with Unicode 2.0 (published in 1996), the Unicode standard began assigning numbers from 0 to 10FFFF16, which requires 21 bits but does not use them completely. This gives more than enough room for all written languages in the world. The original repertoire covered all major languages commonly used in computing. Unicode continues to grow, and it includes more scripts. Unicode is now up to version 9.

[edit] Design of Unicode

The design of Unicode differs in several ways from traditional character sets and encoding schemes:

  1. Its repertoire enables users to include text efficiently in almost all languages within a single document.
  2. It can be encoded in a byte-based way with one or more bytes per character, but the default encoding scheme uses 16-bit units that allow much simpler processing for all common characters.
  3. Many characters, such as letters with accents and umlauts, can be combined from the base character and accent or umlaut modifiers. This combining reduces the number of different characters that need to be encoded separately. "Precomposed" variants for characters that existed in common character sets at the time were included for compatibility.
  4. Characters and their usage are well-defined and described. While traditional character sets typically only provide the name or a picture of a character and its number and byte encoding, Unicode has a comprehensive database of properties available for download. It also defines a number of processes and algorithms for dealing with many aspects of text processing to make it more interoperable.

The early inclusion of all characters of commonly used character sets makes Unicode a useful "pivot" point for converting between traditional character sets, and makes it feasible to process non-Unicode text by first converting into Unicode, process the text, and convert it back to the original encoding without loss of data.

[edit] Implementation

One implementation of Unicode characters, favored in western countries, is UTF-8 due to its transparency with ASCII. In this format a character is one or more bytes long. The default format is UTF-16 which is favored in East Asian countries due to its smaller size. The minimal size of a character is 2 bytes and some are multiples of 2 bytes.

A UTF-32 also exists which is straight forward to code since every character is available in one integer but wasteful of storage space. (All of the Unicode characters can be coded in 21 bits.)

Prior to the invention of Unicode, multiple character sets were used to handle glyphs in multiple languages. Even today some viewers and editors can get confused by files coded in Unicode. Implementations should identify how the character sets are defined. The way this is done varies depending on the file format. With the release of HTML5 there is support for both Unicode characters themselves and Named character references which can be used if the editor does have the character support. See also alphabet for various Unicode alphabets available.

The area from E000 to F8FF is a private use area supporting 6400 fonts. Anyone can define Unicode characters in this range for any purpose. Of course a font set is needed to support these definition.

[edit] Finding the font

Sometimes it can be difficult to find the font you want with all of the possible characters out there in Unicode. There is a free program call BabelMap that can be of help. It can be found at: http://www.babelstone.co.uk/Software/BabelMap.html. The results are bitmapped fonts. You can also use an online resource http://unicode-table.com/en/ to tap on a character to place it in your clipboard and then you can paste in in any document that supports Unicode. The text editor BabelPad is a Unicode editor that contains BabelMap inside with the addition of text editing and font conversion tools.

It can also be difficult to find a font set that contains the glyphs you need. Many font sets now include reasonable subsets but none contain all. Particular good font sets in this regard are available at: Unicode Fonts for Ancient Scripts. The Symbola set has a good selection of symbols.

[edit] For more information

Personal tools

MobileRead Networks