From MobileRead
Jump to: navigation, search

Characters are the basic unit of written communication. Technically these are called graphemes. In computer terminology characters include letters (alphabet), numbers, punctuation, and symbols. In some languages Diacritic marks (accents) are also part of some letters.


[edit] Character sets

A character set is a collection of characters. For example ASCII is a well known collection used in computer systems. It contains the basic Latin (also known as western) alphabet. The basic storage unit in modern computers is the Byte and it can store 256 values. When a Byte is used to store characters the ASCII characters are encoded into the first 128 slots while the additional 128 characters may be mapped to special symbols or additional characters. Most encoding schemes are designed to leave the first 128 characters in the ASCII representation to provide compatibility. Sometimes the term ANSI is used to define an 8-bit character code to contrast it from Unicode.

Some encoding schemes, most notably Unicode, are extended beyond 256 locations by using a keycode within to first 256 code to identify that the code set is extended beyond one byte. This technique permits variable byte lengths to represent characters. The computer or electronic device displaying the characters will translate the binary value into graphic representation (glyph) using fonts. Note that a glyph can include more than one character and may change depending on adjacent characters.

Web Browsers and therefore HTML documents have a defined collection of character sets that are supported. Some of the more popular character sets (also called character encoding) include:

To let the browser know which character set is being used a special entry is placed in the head element. For example:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

would indicate that the character set for this page is UTF-8 which would likely support many special characters as well.

An XML file may also have this coded as the first line. For example:

<?xml version="1.0" encoding="ISO-8859-1"?>

Having this on the first line helps avoiding the problem of already needing to parse a file in some character set just to find out what the character set should be. Meta sections can be many lines from the beginning.

An XHTML document as specified in for ePUB should use the XML syntax.

[edit] character entry

The most common way to enter characters is the keyboard. This is generally adequate for the ASCII character set but will often be missing the extra characters that have been defined in the character encoding. To generate these characters the user will often resort to entering the binary values themselves or a menu may be provided. In some cases the user can copy and paste a display of the desired character. For HTML there have been specially design word codes defined by Entity reference that can be used to generate some characters. These word codes are shown for various character sets in the links shown above. Some additional special characters are shown in the link.

In the English version of Windows, the characters from Windows-1252 and ISO-8859-1 can be inserted by holding down the Alt key and entering a zero followed by the character's three-digit decimal code on the numpad. You can also download a table that contains many character maps from

[edit] Displaying Characters

Whether a particular character encoding will be displayed on the screen depends on the fonts that have been used for the display. Not every font set will have all of the characters that might have been defined. In addition a particular font set might have the character but it is mapped to a different character code. For this reason the characters you see might look very different from the characters that you were expecting. Knowing and using the correct fonts intended by the originator will usually resolve this problem.

If a particular font does not have glyphs for a particular character encoding they will often print a ? where the character belongs. In some cases a box or other graphic may be used or the character may be ignored.

[edit] Conversion

It is possible to convert one character set to another using the Unix/GNU command iconv available in GNU installations or Gygwin for Windows. This is a CLI utility.

[edit] For more information

Personal tools

MobileRead Networks