ASCII
ASCII (American Society for Computing Information Interchange) is a standard that defines the basic digital character interpretation for TXT. Collectively it is known as the western or Latin alphabet.
Contents |
[edit] Overview
On current computer systems, the basic unit of information is a byte, which is made up of eight bits. A single bit has two states, it may be on or off. There are 256 possible combinations of on and off in a string of eight bits, so there are 256 possible representations in a byte.
The ASCII (American Society for Computing Information Interchange) standard defines what is represented by the first 128 values of a byte (7 bits). ASCII 0 through 31 are reserved for "control" characters, like tabs, line feeds, and carriage returns. The value 127 (all ones) is a correction character used in some systems to "rubout" an existing character. The other values represent characters, that is, upper and lower case letters (the alphabet), numbers, punctuation, a space, basic mathematical and a few other symbols. Values between ASCII 128 and 255 are not defined by the ASCII standard, and may vary depending upon the system used. On PCs, they are used for mathematical symbols, Greek letters, accented "national" characters (extended) and the like.
Plain ASCII text files contain only ASCII 0 - 127. It is a subset of ISO-8859-1 and UTF-8 standard character sets as well as Windows-1252 and many others. The UTF-16 and UTF-32 standards reserve the exact same locations for ASCII characters even though the width of each character is more than 8-bits. The leading bits are all zeros.
[edit] History
The history of ASCII is tied to the computer. Initially the idea of electronic text was not in the minds of early computer designers of the 1940s. Early computers were concerned with number crunching. A 4 bit code was developed called BCD (binary coded decimal) to represent the numbers. By the 50s the idea of letters appeared but they only considered text useful for messages. Adding letters increased the BCD coding scheme to 6 bits (64 combinations). Uppercase letters and some symbols were all that were supported.
Meanwhile companies working on the electric communication of messages (Telex, TWX) in the 50s were also struggling to solve some of their own coding problems. They were using a 5 bit code derived from the Baudot code with only 32 combinations. They used two of those codes to switch between numbers/symbols and letters. As communication hardware and computer hardware began to interact a better system for coding was needed.
In the 60s a committee was assigned to solve the problem and the result was the seven bit ASCII code that we still use today. An initial design was implemented in 1963 but without lower case letters. In 1967 this was corrected. Part of the design was to select a letter order for the coding that permitted easy computer sorting alphabetically. IBM knew of this standard but resisted adopting it on their mainframe computers preferring a scheme based on an extended version of BCD called EBCDIC.
By the late 60s the guys at Bell Labs were designing a new OS justified, in part, by the need for a decent text editing capability to create electronic documents to be printed out. The result was Unix, which used the ASCII coding scheme.
Today ASCII is used on computers and is also defined for communication between devices and even long distance communication over wire or wireless. It is the standard for email. It can also be used as a way to transmit binary data using Base64 encoding.
[edit] ASCII Chart
In the chart below the decimal value, the hexadecimal value, and the visible character (or standard keyboard character) for each ASCII character are shown. The control codes can be generated by holding down the Ctrl (control) key (shown as ^) while striking the key shown. The ASCII codes are:
000 0000 ^@ 032 0x20 064 0x40 @ 096 0x60 ` 001 0x01 ^A 033 0x21 ! 065 0x41 A 097 0x61 a 002 0x02 ^B 034 0x22 " 066 0x42 B 098 0x62 b 003 0x03 ^C 035 0x23 # 067 0x43 C 099 0x63 c 004 0x04 ^D 036 0x24 $ 068 0x44 D 100 0x64 d 005 0x05 ^E 037 0x25 % 069 0x45 E 101 0x65 e 006 0x06 ^F 038 0x26 & 070 0x46 F 102 0x66 f 007 0x07 ^G 039 0x27 ' 071 0x47 G 103 0x67 g 008 0x08 ^H 040 0x28 ( 072 0x48 H 104 0x68 h 009 0x09 ^I 041 0x29 ) 073 0x49 I 105 0x69 i 010 0x0a ^J 042 0x2a * 074 0x4a J 106 0x6a j 011 0x0b ^K 043 0x2b + 075 0x4b K 107 0x6b k 012 0x0c ^L 044 0x2c , 076 0x4c L 108 0x6c l 013 0x0d ^M 045 0x2d - 077 0x4d M 109 0x6d m 014 0x0e ^N 046 0x2e . 078 0x4e N 110 0x6e n 015 0x0f ^O 047 0x2f / 079 0x4f O 111 0x6f o 016 0x10 ^P 048 0x30 0 080 0x50 P 112 0x70 p 017 0x11 ^Q 049 0x31 1 081 0x51 Q 113 0x71 q 018 0x12 ^R 050 0x32 2 082 0x52 R 114 0x72 r 019 0x13 ^S 051 0x33 3 083 0x53 S 115 0x73 s 020 0x14 ^T 052 0x34 4 084 0x54 T 116 0x74 t 021 0x15 ^U 053 0x35 5 085 0x55 U 117 0x75 u 022 0x16 ^V 054 0x36 6 086 0x56 V 118 0x76 v 023 0x17 ^W 055 0x37 7 087 0x57 W 119 0x77 w 024 0x18 ^X 056 0x38 8 088 0x58 X 120 0x78 x 025 0x19 ^Y 057 0x39 9 089 0x59 Y 121 0x79 y 026 0x1a ^Z 058 0x3a : 090 0x5a Z 122 0x7a z 027 0x1b ^[ 059 0x3b ; 091 0x5b [ 123 0x7b { 028 0x1c ^\ 060 0x3c < 092 0x5c \ 124 0x7c | 029 0x1d ^] 061 0x3d = 093 0x5d ] 125 0x7d } 030 0x1e ^^ 062 0x3e > 094 0x5e ^ 126 0x7e ~ 031 0x1f ^_ 063 0x3f ? 095 0x5f _ 127 0x7f ⌂
[edit] Control Codes
The first 31 ASCII characters are called control codes as they cause an action to happen rather than just printing a character. Most of the control codes are intended for communication protocol use but some are used in computers as well. Those on a standard keyboard are indentified as key.
Some of the more important control codes are:
007 is BEL (ring a bell) to get the attention of the operator. 008 is generally backspace key 009 is tab key 010 is line feed (new line) 011 is vertical tab (not often used) 012 is form feed (new page) 013 is carriage return (often called the enter key) 027 is the Esc key 127 is typically the delete key (rubout)
The enter key usually enters the carriage return and line feed sequence. The carriage return by itself can be entered with a shift key/enter key combination.
There is often a need to enter control codes (control key) in the text for later execution. The most used method is to use an escape sequence.
\0 null \n new line \r carriage return \t tab \b backspace \f form feed \v vertical tab
A more general approach is to use \x## where ## is a hexadecimal number value for the control code. For example JSON does not support \v so you must use \x0B. The sequence \0 in not in JSON and may not be supported in other systems as well.
[edit] ASCII names
Web Browsers and other software using HTML has defined HTML entities names for some of the ASCII characters to use when the HTML decode would interfere with the proper display of the actual character. Names are available for 4 characters (quot, amp, lt, and gt in lower or upper case) and in XHTML a fifth was added (apos). In HTML5 nearly all of the ASCII symbol and punctuation characters have names. These named character references along with their Unicode designation are listed below. To use the names you must precede them with &.
Name | Character(s) | Glyph | Notes |
---|---|---|---|
Tab; | U+00009 | ␉ | |
NewLine; | U+0000A | ␊ | |
excl; | U+00021 | ! | |
quot; | U+00022 | " | entity reference also QUOT QUOT; quot |
num; | U+00023 | # | |
dollar; | U+00024 | $ | |
percnt; | U+00025 | % | |
amp; | U+00026 | & | entity reference also AMP AMP; amp |
apos; | U+00027 | ' | entity reference |
lpar; | U+00028 | ( | |
rpar; | U+00029 | ) | |
ast; | U+0002A | * | also midast; |
plus; | U+0002B | + | |
comma; | U+0002C | , | |
U+0002D | - | ||
period; | U+0002E | . | |
sol; | U+0002F | / | |
colon; | U+0003A | : | |
semi; | U+0003B | ; | |
lt; | U+0003C | < | entity reference also LT; LT lt |
equals; | U+0003D | = | |
gt; | U+0003E | > | entity reference also GT; GT gt |
quest; | U+0003F | ? | |
commat; | U+00040 | @ | |
lbrack; | U+0005B | [ | also lsqb; |
bsol; | U+0005C | \ | |
rbrack; | U+0005D | ] | also rsqb; |
Hat; | U+0005E | ^ | caret |
lowbar; | U+0005F | _ | also UnderBar; |
grave; | U+00060 | ` | also DiacriticalGrave; |
lbrace; | U+0007B | { | also lcub; |
vert; | U+0007C | | | also VerticalLine; verbar; |
rbrace; | U+0007D | } | also rcub; |
U+0007E | ~ | ||
U+0007F | ⌂ |
Two of the control codes are mentioned and all of the symbol characters are defined except the space, the minus (dash, hyphen) character, and the tilde. The minus and tilde are defined as being used by other Unicode characters instead of the ASCII ones. (The minus is U+02212 and the tilde is U+0223C.) The alphabet and numbers are represented by themselves. Note that the alphabet is designed so that the uppercase character and the lowercase character differ by a single bit in the byte. For example 0x41 (0100 0001) for A and 0x61 (0110 0001) for a.
[edit] For more information
- ASCII chart - includes minimal control code values.
- http://en.wikipedia.org/wiki/Control_character full explanation of control codes
- http://en.wikipedia.org/wiki/ASCII for history and other details.