ASCII

From MobileRead
Jump to: navigation, search

ASCII (American Society for Computing Information Interchange) is a standard that defines the basic digital character interpretation for TXT. Collectively it is known as the western or Latin alphabet.

Contents

[edit] Overview

On current computer systems, the basic unit of information is a byte, which is made up of eight bits. A single bit has two states, it may be on or off. There are 256 possible combinations of on and off in a string of eight bits, so there are 256 possible representations in a byte.

The ASCII (American Society for Computing Information Interchange) standard defines what is represented by the first 128 values of a byte (7 bits). ASCII 0 through 31 are reserved for "control" characters, like tabs, line feeds, and carriage returns. The value 127 (all ones) is a correction character used in some systems to "rubout" an existing character. The other values represent characters, that is, upper and lower case letters (the alphabet), numbers, punctuation, a space, basic mathematical and a few other symbols. Values between ASCII 128 and 255 are not defined by the ASCII standard, and may vary depending upon the system used. On PCs, they are used for mathematical symbols, Greek letters, accented "national" characters (extended) and the like.

Plain ASCII text files contain only ASCII 0 - 127. It is a subset of ISO-8859-1 and UTF-8 standard character sets as well as Windows-1252 and many others. The UTF-16 and UTF-32 standards reserve the exact same locations for ASCII characters even though the width of each character is more than 8-bits. The leading bits are all zeros.

[edit] History

The history of ASCII is tied to the computer. Initially the idea of electronic text was not in the minds of early computer designers of the 1940s. Early computers were concerned with number crunching. A 4 bit code was developed called BCD (binary coded decimal) to represent the numbers. By the 50s the idea of letters appeared but they only considered text useful for messages. Adding letters increased the BCD coding scheme to 6 bits (64 combinations). Uppercase letters and some symbols were all that were supported.

Meanwhile companies working on the electric communication of messages (Telex, TWX) in the 50s were also struggling to solve some of their own coding problems. They were using a 5 bit code derived from the Baudot code with only 32 combinations. They used two of those codes to switch between numbers/symbols and letters. As communication hardware and computer hardware began to interact a better system for coding was needed.

In the 60s a committee was assigned to solve the problem and the result was the seven bit ASCII code that we still use today. An initial design was implemented in 1963 but without lower case letters. In 1967 this was corrected. Part of the design was to select a letter order for the coding that permitted easy computer sorting alphabetically. IBM knew of this standard but resisted adopting it on their mainframe computers preferring a scheme based on an extended version of BCD called EBCDIC.

By the late 60s the guys at Bell Labs were designing a new OS justified, in part, by the need for a decent text editing capability to create electronic documents to be printed out. The result was Unix, which used the ASCII coding scheme.

Today ASCII is used on computers and is also defined for communication between devices and even long distance communication over wire or wireless. It is the standard for email. It can also be used as a way to transmit binary data using Base64 encoding.

[edit] ASCII Chart

In the chart below the decimal value, the hexadecimal value, and the visible character (or standard keyboard character) for each ASCII character are shown. The control codes can be generated by holding down the Ctrl (control) key (shown as ^) while striking the key shown. The ASCII codes are:

000  0000  ^@   032  0x20       064  0x40  @    096  0x60  `
001  0x01  ^A   033  0x21  !    065  0x41  A    097  0x61  a
002  0x02  ^B   034  0x22  "    066  0x42  B    098  0x62  b
003  0x03  ^C   035  0x23  #    067  0x43  C    099  0x63  c
004  0x04  ^D   036  0x24  $    068  0x44  D    100  0x64  d
005  0x05  ^E   037  0x25  %    069  0x45  E    101  0x65  e
006  0x06  ^F   038  0x26  &    070  0x46  F    102  0x66  f
007  0x07  ^G   039  0x27  '    071  0x47  G    103  0x67  g
008  0x08  ^H   040  0x28  (    072  0x48  H    104  0x68  h
009  0x09  ^I   041  0x29  )    073  0x49  I    105  0x69  i
010  0x0a  ^J   042  0x2a  *    074  0x4a  J    106  0x6a  j
011  0x0b  ^K   043  0x2b  +    075  0x4b  K    107  0x6b  k
012  0x0c  ^L   044  0x2c  ,    076  0x4c  L    108  0x6c  l
013  0x0d  ^M   045  0x2d  -    077  0x4d  M    109  0x6d  m
014  0x0e  ^N   046  0x2e  .    078  0x4e  N    110  0x6e  n
015  0x0f  ^O   047  0x2f  /    079  0x4f  O    111  0x6f  o
016  0x10  ^P   048  0x30  0    080  0x50  P    112  0x70  p
017  0x11  ^Q   049  0x31  1    081  0x51  Q    113  0x71  q
018  0x12  ^R   050  0x32  2    082  0x52  R    114  0x72  r
019  0x13  ^S   051  0x33  3    083  0x53  S    115  0x73  s
020  0x14  ^T   052  0x34  4    084  0x54  T    116  0x74  t
021  0x15  ^U   053  0x35  5    085  0x55  U    117  0x75  u
022  0x16  ^V   054  0x36  6    086  0x56  V    118  0x76  v
023  0x17  ^W   055  0x37  7    087  0x57  W    119  0x77  w
024  0x18  ^X   056  0x38  8    088  0x58  X    120  0x78  x
025  0x19  ^Y   057  0x39  9    089  0x59  Y    121  0x79  y
026  0x1a  ^Z   058  0x3a  :    090  0x5a  Z    122  0x7a  z
027  0x1b  ^[   059  0x3b  ;    091  0x5b  [    123  0x7b  {
028  0x1c  ^\   060  0x3c  <    092  0x5c  \    124  0x7c  |
029  0x1d  ^]   061  0x3d  =    093  0x5d  ]    125  0x7d  }
030  0x1e  ^^   062  0x3e  >    094  0x5e  ^    126  0x7e  ~
031  0x1f  ^_   063  0x3f  ?    095  0x5f  _    127  0x7f  ⌂

[edit] Control Codes

The first 31 ASCII characters are called control codes as they cause an action to happen rather than just printing a character. Most of the control codes are intended for communication protocol use but some are used in computers as well. Those on a standard keyboard are indentified as key.

Some of the more important control codes are:

007 is BEL (ring a bell) to get the attention of the operator.
008 is generally backspace key
009 is tab key
010 is line feed (new line)
011 is vertical tab (not often used)
012 is form feed (new page)
013 is carriage return (often called the enter key)
027 is the Esc key
127 is typically the delete key (rubout)

The enter key usually enters the carriage return and line feed sequence. The carriage return by itself can be entered with a shift key/enter key combination.

There is often a need to enter control codes (control key) in the text for later execution. The most used method is to use an escape sequence.

\0 null
\n new line
\r carriage return
\t tab
\b backspace
\f form feed
\v vertical tab

A more general approach is to use \x## where ## is a hexadecimal number value for the control code. For example JSON does not support \v so you must use \x0B. The sequence \0 in not in JSON and may not be supported in other systems as well.

[edit] ASCII names

Web Browsers and other software using HTML has defined HTML entities‎‎ names for some of the ASCII characters to use when the HTML decode would interfere with the proper display of the actual character. Names are available for 4 characters (quot, amp, lt, and gt in lower or upper case) and in XHTML a fifth was added (apos). In HTML5 nearly all of the ASCII symbol and punctuation characters have names. These named character references along with their Unicode designation are listed below. To use the names you must precede them with &.

Name Character(s) Glyph Notes
Tab; U+00009
NewLine; U+0000A
excl; U+00021 !
quot; U+00022 " entity reference also QUOT QUOT; quot
num; U+00023 #
dollar; U+00024 $
percnt; U+00025 %
amp; U+00026 & entity reference also AMP AMP; amp
apos; U+00027 ' entity reference
lpar; U+00028 (
rpar; U+00029 )
ast; U+0002A * also midast;
plus; U+0002B +
comma; U+0002C ,
U+0002D -
period; U+0002E .
sol; U+0002F /
colon; U+0003A :
semi; U+0003B ;
lt; U+0003C < entity reference also LT; LT lt
equals; U+0003D =
gt; U+0003E > entity reference also GT; GT gt
quest; U+0003F ?
commat; U+00040 @
lbrack; U+0005B [ also lsqb;
bsol; U+0005C \
rbrack; U+0005D ] also rsqb;
Hat; U+0005E ^ caret
lowbar; U+0005F _ also UnderBar;
grave; U+00060 ` also DiacriticalGrave;
lbrace; U+0007B { also lcub;
vert; U+0007C | also VerticalLine; verbar;
rbrace; U+0007D } also rcub;
U+0007E ~
U+0007F

Two of the control codes are mentioned and all of the symbol characters are defined except the space, the minus (dash, hyphen) character, and the tilde. The minus and tilde are defined as being used by other Unicode characters instead of the ASCII ones. (The minus is U+02212 and the tilde is U+0223C.) The alphabet and numbers are represented by themselves. Note that the alphabet is designed so that the uppercase character and the lowercase character differ by a single bit in the byte. For example 0x41 (0100 0001) for A and 0x61 (0110 0001) for a.

[edit] For more information

Personal tools
Namespaces

Variants
Actions
Navigation
MobileRead Networks
Toolbox