Unicode
Unicode (and the parallel ISO 10646 standard) defines the character set necessary for efficiently processing text in any language and for maintaining text data integrity.
Contents |
[edit] Overview
Unicode maps all the worlds language letters and symbols into a single character set where each symbol has a unique binary value. In addition to global character coverage, the Unicode standard is unique among character set standards because it also defines data and algorithms for efficient and consistent text processing. It also attempts to standardize symbols that are in use worldwide.
To standardize the implementation there is a defined ICU, International Components for Unicode. The ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.
[edit] History
Unicode is a shortened name for Unification Code, the first attempt to Unify all the characters in languages throughout the world into one system. This effort was begun in the early 1980s. It was designed to be simple for computers to process and to avoid escape sequences. An escape sequence is where the character set was redefined using the same binary code, depending on some escape character used to switch the interpretation.
Unicode was developed as a single-coded character set that contains support for all languages in the world. The first version of Unicode used 16-bit numbers, which allowed for encoding 65,536 characters without complicated multibyte schemes. With the inclusion of more characters, and following implementation needs of many different platforms, Unicode was extended to allow more than one million characters. Several other encoding schemes were added. This introduced more complexity into the Unicode standard, but far less than managing a large number of different encodings.
Starting with Unicode 2.0 (published in 1996), the Unicode standard began assigning numbers from 0 to 10FFFF16, which requires 21 bits but does not use them completely. This gives more than enough room for all written languages in the world. The original repertoire covered all major languages commonly used in computing. Unicode continues to grow, and it includes more scripts. Unicode is now up to version 9.
[edit] Design of Unicode
The design of Unicode differs in several ways from traditional character sets and encoding schemes:
- Its repertoire enables users to include text efficiently in almost all languages within a single document.
- It can be encoded in a byte-based way with one or more bytes per character, but the default encoding scheme uses 16-bit units that allow much simpler processing for all common characters.
- Many characters, such as letters with accents and umlauts, can be combined from the base character and accent or umlaut modifiers. This combining reduces the number of different characters that need to be encoded separately. "Precomposed" variants for characters that existed in common character sets at the time were included for compatibility.
- Characters and their usage are well-defined and described. While traditional character sets typically only provide the name or a picture of a character and its number and byte encoding, Unicode has a comprehensive database of properties available for download. It also defines a number of processes and algorithms for dealing with many aspects of text processing to make it more interoperable.
The early inclusion of all characters of commonly used character sets makes Unicode a useful "pivot" point for converting between traditional character sets, and makes it feasible to process non-Unicode text by first converting into Unicode, process the text, and convert it back to the original encoding without loss of data.
[edit] Implementation
One implementation of Unicode characters, favored in western countries, is UTF-8 due to its transparency with ASCII. In this format a character is one or more bytes long. The default format is UTF-16 which is favored in East Asian countries due to its smaller size. The minimal size of a character is 2 bytes and some are multiples of 2 bytes.
A UTF-32 also exists which is straight forward to code since every character is available in one integer but wasteful of storage space. (All of the Unicode characters can be coded in 21 bits.)
Prior to the invention of Unicode, multiple character sets were used to handle glyphs in multiple languages. Even today some viewers and editors can get confused by files coded in Unicode. Implementations should identify how the character sets are defined. The way this is done varies depending on the file format. With the release of HTML5 there is support for both Unicode characters themselves and Named character references which can be used if the editor does have the character support. See also alphabet for various Unicode alphabets available.
The area from E000 to F8FF is a private use area supporting 6400 fonts. Anyone can define Unicode characters in this range for any purpose. Of course a font set is needed to support these definition.
[edit] Finding the font
Sometimes it can be difficult to find the font you want with all of the possible characters out there in Unicode. There is a free program call BabelMap that can be of help. It can be found at: http://www.babelstone.co.uk/Software/BabelMap.html. The results are bitmapped fonts. You can also use an online resource http://unicode-table.com/en/ to tap on a character to place it in your clipboard and then you can paste in in any document that supports Unicode. The text editor BabelPad is a Unicode editor that contains BabelMap inside with the addition of text editing and font conversion tools.
It can also be difficult to find a font set that contains the glyphs you need. Many font sets now include reasonable subsets but none contain all. Particular good font sets in this regard are available at: Unicode Fonts for Ancient Scripts. The Symbola set has a good selection of symbols.
[edit] For more information
- http://source.icu-project.org/
- http://www.unicode.org/charts/ charts of the all the symbols - kept up to date with new releases, organized into Script sets and separate groups of Symbols and Punctuation
- http://www.unicode.org/versions/enumeratedversions.html - Lists all the versions. A few samples:
- http://www.unicode.org/versions/Unicode5.1.0/ - standard released April 4, 2008
- http://www.unicode.org/versions/Unicode6.0.0/ - standard released Oct 11, 2010
- http://www.unicode.org/versions/Unicode7.0.0/ - standard released June 16, 2014 with a total of 112,956 characters.
- http://www.unicode.org/versions/Unicode8.0.0/ - standard released June 15, 2015 Unicode 8.0 adds 7,716 characters for a total of 120,672 characters. These additions include six new scripts and many new symbols, as well as character additions to several existing scripts.
- http://www.unicode.org/versions/Unicode9.0.0/ - standard released June 21, 2016 Version 9.0 adds exactly 7,500 characters, for a total of 128,172 characters. These additions include six new scripts and 72 new emoji characters.
- http://www.unicode.org/versions/Unicode10.0.0/ - Version 10 released June 20, 2017 Unicode 10.0 adds 8,518 characters, for a total of 136,690 characters. These additions include 4 new scripts, for a total of 139 scripts, as well as 56 new emoji characters.
- http://www.unicode.org/versions/Unicode11.0.0/ - Version 11 released June 10, 2018. Version 11.0 adds 684 characters, for a total of 137,374 characters. These additions include 7 new scripts, for a total of 146 scripts, as well as 66 new emoji characters.
- http://www.unicode.org/versions/latest/ Always refers to the latest version.
- For more detail about the mappings between Unicode and other formats, you can view:
- Unicode <--> ISO-8859 mappings at ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/
- Unicode <--> Windows mappings at ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/
- Unicode <--> Apple mappings at ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/
- http://www.fileformat.info/info/unicode/index.htm provides samples that may be copied and more information on specific codes. Use search to view the symbol for a particular Unicode value U+xxxx
- http://unicode-table.com/en/ A list of codings and their symbols.
- http://www.unicode.org/charts/ The latest charts for Unicode. It includes Scripts, Symbols, and Notes
- https://en.wikipedia.org/wiki/Unicode lots of pages link off this one.