From MobileRead
Jump to: navigation, search

Project Gutenberg was the first producer of free electronic books (eBooks). They have a goal of releasing every public domain book in existence. They release all of their eBooks in .txt format for simple reading using a text editor. (They also release some books in other formats, namely ePub and Kindle.)


[edit] Overview

Project Gutenberg began in 1971 when Michael Hart was given an operator's account with $100,000,000 of computer time in it by the operators of the Xerox Sigma V mainframe at the Materials Research Lab at the University of Illinois. (Actually there was a surplus of time available and Michael knew several of the operators.) An hour and 47 minutes later, he announced that the greatest value created by computers would not be computing, but would be the storage, retrieval, and searching of what was stored in our libraries.

He then proceeded to type in the "Declaration of Independence" and tried to send it to everyone on the network. Initially all of the eBooks in Project Gutenberg were typed in but today the are generally entered via OCR and then corrected. This has lead to much fewer errors. All of the books except very specialized ones are available in TXT format. Many are also available in other more specific eBook formats.

[edit] Problems and solutions

The problem with .txt file eBooks is that they do not lend themselves to elaborate or easy to read formatting options. They often have fix length lines of data that do not wrap well on the PPC or other small screens. If they don't wrap the line then they require scrolling sideways. In addition there is no graphics support, font size control, or character set choices. It is for these reasons that Project Gutenberg has released many eBooks in more advanced formats.

For Pocket PC .txt files mean that these files can be easily read by Pocket Word. However, an editor is likely not the best tool to read books with. It is typically not oriented toward just reading a page at a time and does not support such features as bookmarking your progress. It is also easy to accidentally modify a book you are trying to read with an editor. You can set the file to read only to prevent saving accidental modification. Very few, if any, serious eBook readers consider .txt files to be their preferred reading format.

[edit] Programs to help convert etext files

E-Book Tidy is a useful conversion program to aid in translating to and from Palm docs and to other formats. It is particularly useful in convert Gutenberg text files. It can be used as PC reader for Palm Docs. It will also convert word and RTF files but only fairly simple ones. It can be used as part of an HTML conversion.

Book Designer can be used to convert Gutenberg .txt files and many other formats.

GuteBook is a script that can automate fixing some Gutenberg problems and create nicely formatted eBooks.

Here is a perl script to fix broken lines in a paragraph

#!/usr/bin/perl -w
die "USAGE\n$0 filein fileout\n\n" if $#ARGV!=1;
open(A,"<$ARGV[0]");my @a=<A>; close(A);
foreach $l(@a)
       if (not defined $1) {print"problems at line -$l-\n"}
               $l=~s/\r//g;  # if the file was in DOS mode
               if ($l!~/[\.:,;\"!\?\'\)-]$/)
               { print(B "$l ") }
               else{print(B "$l\n")}

Here is a Python script to fix broken lines in a paragraph, similar to the Perl script above

from Tkinter import Tk
from tkCommonDialog import Dialog

class OpenFile(Dialog):
    command = "tk_getOpenFile"

rootwin = Tk(); rootwin.withdraw()
fname = OpenFile().show().split('/')[-1]
if fname != "":
    print >>open('out_'+fname, 'w'), open(fname).read().replace('\n\n','#uNiQuE#').replace('\n',' ').replace('#uNiQuE#','\n')

[edit] Conventions for TXT

Conventions for TXT used in Project Gutenberg have changed over the years. The most current version was formulated in 2004. It dictates 7 bit ASCII text be supported unless it is impossible to do so. An 8 bit TXT (extended text) can be also submitted using ISO-8859-1 standards when accented characters are needed. Minimal Markup is supported. The standard calls for:

  • Line length of 60-70 characters (no less that 55 nor more than 80)
  • each line ends with a CR/LF sequence
  • paragraphs are indicated by a blank line
  • for minor use of Greek words transliterate to ASCII.
  • for poetry make it look like as much like the original as possible using spaces
  • indent for quotations
  • one space or two at the end of a sentence (user choice)
  • Do not indent for paragraph start
  • Use -- without spaces to indicate an emdash (—)
  • Use ------ with spaces before and after to show missing words
  • Remove - at the end of line used to split a word. Move the full word
  • In the past italics were shown as ITALICS, _italics_, and /italics/. The current preferred method is _italics_ but the others are accepted. Be consistent.
  • Do not capitalize the first word in each chapter
  • If transcribers text is needed use [ and ] around the text
  • Use transcribers notes for important images and captions
  • Do not include page numbers unless page number references are used in the text. If you use them put them inside brackets
  • Keep TOC but remove page numbers
  • Keep Index and glossary only if in original (do not do it for republished works)
  • Handle breaks in the text with spaced * such as * * * * *
  • Do not use tabs
  • Handle tables by evenly spacing the mono-spaced font.
  • for footnotes:
    • really short ones can be [inline]
    • longer ones need their own bracketed paragraph after the reference with a [1] style reference
    • Lots of footnotes can be collected at the end of a chapter but this may require renumbering the footnotes themselves.
  • for symbols use the word or abbreviation - for example use deg. or degrees
  • for British pound use the word or a leading L
  • Handle . . . ellipses with spaces and periods
  • Use 4 empty lines to separate chapters

[edit] For more information

[edit] Gutenberg sites

Copyright laws change from country to country. Be sure the download is legal in your country.

Personal tools

MobileRead Networks