On current computer systems, the basic unit of information is a byte, which is made up of eight bits. A bit may be on or off. There are 256 possible combinations of on and off in a string of eight bits, so there are 256 possible representations in a byte. TXT files are made up of these basic symbols, called characters; primarily the ones defined in the ASCII character set.
Text files will often use a .txt extension, particularly if they contain a fairly long document. Other times a more semantic extension is used such a .log for a computer generated log file, or .eml for an email. Text files can also be referred to as etext.
Here are some issues when trying to read TXT.
Line end convention
One issue with ASCII files is the line ending convention. The system using the file needs to know what terminates any particular line of text. Unfortunately, this can vary. Systems running a flavor of the Unix operating system expect lines to be terminated by a Line Feed character, ASCII 10 (0x0a). Apple Macintosh machines use the Carriage Return character, ASCII 13. (0x0d) PCs running DOS or Windows use both, with a "CRLF" combination as the line terminator.
Depending on the origin of the ASCII file, it may be necessary to adjust the line endings for the system you use. On Windows, for example, the default plain text file editor is Notepad. Notepad does not understand text files using only LF as the line ending, and will not display them properly, so text files brought over to a PC from a Unix system need to be massaged to have CRs added to all line ends.
Another issue when trying to read TXT files on an eBook reader is the presumed length of a line. Most Hardware Readers do not have pages wide enough to read a full letter size or A4 page without wrapping the text. If the source file text assumes a particular line width it is likely to have line ending characters at the end of each line. This, combined with the wrapping, causes extra lines and often these extra lines are very short lines making the reading experience less pleasant. Text files generated from programs supporting variable width characters are prone to generate these kinds of problems.
Typically a Hardware Reader device wants line ends only at the ends of paragraphs so that it can wrap the text to the edges of the device. Some devices may attempt to achieve this by ignoring line end characters unless it sees two sets in a row (an empty line) to mark the end of the paragraph. An additional check would be to mark the end of a paragraph if the next line begins with a TAB (ASCII 9) character or even multiple spaces. (This last check could be confused if multiple lines are indented the same amount.) These techniques are also the methods used by some conversion programs to determine paragraphs. When text is rearranged in this fashion it is called reflowing the text. One way to quickly perform this conversion using Perl is:
perl -lp00e's/\n/ /g' input.txt > output.txt
Cut and paste will often allow you to remove CR/LF combination by selecting one paragraph at a time. Word will allow you to convert any txt file you import.
Paragraphs in most books are indented. A file may try to show this indenting by placing one or more spaces at the beginning of each paragraph. Unfortunately some conversion tools eliminate extra spaces at the beginning of a line due to the fact that they are often used to provide a left margin in the text as well. The solution is to indicate indention using a tab key value (0x09) instead of blanks. If the conversion program eliminates all so called 'white space' it will also remove the tabs and this technique will fail.
Chapters are another area where there may be problems in handling text. Gutenberg uses a convention of 4 blank lines in a row to identify new chapters or breaks in the flow of text. Another way that is often better is to use the form feed character (0x0c) to cause a new page.
ASCII TXT files can try and simulate a fixed relationship between characters on a page. Often the txt is assumed to use fixed-width fonts like Courier and will simulate a table by arranging the characters on a page using multiple spaces to keep the text lined up with the previous line. TXT does not support adjustable tabs so this is the only way make the appearance of a table. Other times people might even use ASCII Art to simulate a picture. This kinds of usage will not transfer well in a reflowable document and even changing the font size can cause problems due to fixed length lines. Reflowable formats sometimes make an allowance for this use by providing a way to mark the text as using fixed-width fonts and fixed lines. The user will have to recognize this and manually fix the document to match the original spacing.
Plain ASCII text is really only suitable for English and a few other western languages. Most Latin derived languages need accented characters to properly support the words in their alphabet. This generally requires 8 bit encoding using the ISO-8859-1 standard alphabet. Windows-1252 is another 8 bit encoding scheme. It is not a ISO standard but does permit more Latin derived languages to be supported while still restricting characters to 8 total bits. (8 bits is a standard computer character size called a Byte.) See also ANSI.
The newest character set is Unicode (encoded as UTF-8 or UTF-16). It is universal for all languages. There are many more than 255 different characters, so most characters must be represented using more than one byte. That said, UTF-8 is a variable length character code that maintains compatibility with ASCII in that the ASCII character set is included in the code as one byte characters. The 8th bit is used to indicate a multi-byte character. Although not part of the original standard, many Windows programs (including Windows Notepad) use the byte sequence EF BB BF at the beginning of a file to indicate that the file is encoded using UTF-8. This is the Byte Order Mark U+FEFF encoded in UTF-8. In text editors and Web Browsers which do not support UTF-8, the Mark will often appear as the ISO-8859-1 characters "ï»¿".
The use of extended characters can cause problems in text files in that you need to know what character set was used to define the symbols and you may need to enter characters that are not even on your keyboard. The answer is to tag the characters with an identifying name for each character. This issue is covered on the Special characters page and the ISO-8859-1 page. Most eBook Readers understand these tags and will display the characters correctly. However, viewing these in a simple text editor can be problematic. If you use an editor that can understand these characters you can get around the keyboard problem by display the alphabet on the screen and then copy and pasting the characters into your editor. Most computers provide a method of displaying the alphabet in a particular font set. On Windows you can open any kind of command line or search and type charmap.
Text format does not provide a way to specify anything other than content. There are some conventions that can be used to provide paragraphs but basically anything else requires some markup conventions. There are several different specialized markup languages that can be used to provide markup data within a text file.
- xTxT is simply a syntax for marking up plain-text documents. Using a minimal amount of markup, you can style and structure your text documents for display in a variety of output formats -- most commonly HTML and XHTML.
- AsciiDoc is a text based language that provides simple markup conventions for ASCII text that can be used to make a fully formatted book. Its advantage is the text itself is easily read in the original text.
- Markdown is a plain text format and a Perl script that creates HTML.
- Textile another plain text markup language.
- ReStructuredText another plain text markup often used for Python docs.
- Pandoc can create an eBook from AsciiDoc, Textile, Markdown or other pure TXT input.
- Gutenberg has very simple text markup conventions. It is expected to be read in its text format.
- Haddock is a plain text format similar in some ways to the Wiki format.
There are more complicated markups available as well. The generic standard for markups is SGML which defines the rules for more complex forms. Technically HTML and XML are markup languages. Other markup languages include TeX and its cousin LaTeX. Even this Wiki is built using a markup language. This form of markup can describe a data text format for even advanced word processor databases such as Framemaker. One reason to use text markup on a complex file format is to provide files that can be managed with a Version control system treating the source markup files as simple text.
Tools for TXT files
- TXTcollector is a tool to combine text files into one file.
- Text editors.org - contains information on all text editors you might want to use for editing TXT files.
- TXTz is zipped TXT files, sometimes in Textile format, are plain ASCII text documents that are compressed to save space. Very few programs can read these directly.
- TextWrangler is the little brother to BBEdit and is a free application for the Macintosh. It is a Text editor that has HTML assistance.
- Notepad++ is a free windows text editor that has an support for other formats as well.
- BabelPad is a free Windows text editor that supports full Unicode characters.