From MobileRead
Jump to: navigation, search

[edit] Character sets and character encodings

These two concepts get rather confused. UTF-8 and UTF-16, for instance, are encodings of the Unicode character set. As long as there never were more than 256 characters in a character set, the difference between the two was mainly academical, but with wide character sets it gets more and more important to make a distinction.

A character set is, I believe, best described as an abstract collection of character points -- a code point is associated with a character (for instance in ISO 8859-1, the character 'A' is associated with the code point 0x41 or 65).

A character encoding defines how the code point is represented in some particular type of data storage. With ISO 8859-1, it is simple: it's a one-to-one mapping from a code point to a byte. With Unicode (where the code point for 'A' is U+0041, and thus doesn't fit one single byte), it gets important to know if the code point is stored as a 16-byte entity, or as two 8-byte (so will 00 be stored in the first or in the last byte?), or even with some other encoding method (such as UTF-8 or even UTF-7).Athulin 14:43, 23 September 2008 (EDT)

[edit] defining new terms

I was hoping to avoid define a bunch of new terms like code point and code space but perhaps it is necessary. Let me think about this and see if I can still make it as simple as possible and use the more precise terms. In addition are getting into big endian and little endian concepts which is beyond the scope of an introductory text--DaleDe 18:03, 23 September 2008 (EDT)

Personal tools

MobileRead Networks