PDF

From MobileRead
Jump to: navigation, search

PDF stands for Portable Document Format and was created in 1993 by Adobe Systems for the interchange of documents. This article will focus on the use of PDF files for Mobile eBook reading.

Contents

[edit] Overview

Initially PDF was designed as a print format similar to PostScript and even today it is often used to exchange data that will be printed. Since it was designed as a print format it specifies the size of the paper that is needed to reproduce (render) the original. PDF is now an open standard ISO 32000-1. The standard does not encompass all versions or all capabilities.

Since PDF is designed for printing it is generally defaulted to letter size for US documents and A4 for Europe. To display such a page on a computer or eBook screen at full size would require a 14" or larger screen display. Most portable devices do not have screens nearly this big so they will often shrink the page to the point of not being readable or zoom in to just part of the page causing the reader to pan the page around to try and read it.

[edit] Content

A PDF file may contain several types of data. These include Text, Raster Graphics, Vector Graphics (SVG), Fonts (Glyphs), and meta data. Not all PDF files contain all of the types of data and some PDF files are different from what they seem. For example a PDF file might look like it has text in it while in fact the PDF is displaying an image (picture) that contains text. Text, displayed in this fashion, behaves differently from regular text when the file is manipulated by the display software.

A PDF file may contain tables but there is usually no intelligence to a table's construction. It consists of text placed at specific places on the page together with some line graphics. This looks like a table to the reader but it is not possible to reverse the process and extract a table from the data base. A table may also exist as an image in the file.

A PDF file, by its very nature, contains pages of information that are to be displayed or printed. The size and rotation of these pages is determined when the file is built, although it can change from page to page. It is possible to zoom into or out of a page of data to make it seem larger or smaller using the capabilities of some viewing software. This sets PDF files apart from many other eBook formats because most of the latter provide data that is not constrained to a page boundary or required to conform to placement on a fixed size page.

[edit] Fonts and Text

Text in PDF is referenced to a particular font and font size. This reference may to be to fonts that are enclosed in the file itself or to external fonts that are expected to be available to the rendering software. If the fonts are not available the output may not render properly. Having the font internal does increase the size of the file but means that the characters will be available. This can be especially important for unusual character sets so sometimes only a subset of fonts are included in the document. Both Adobe Type 1 (TTF) and Type 3 (bitmapped fonts) can be embedded.

[edit] Images

Images (Graphics) may be either raster or vector based. A raster image (also known as bitmapped) is often created by scanning a picture. Even a digital camera scans the image to create the file containing the picture. The resolution of the image (pixel width and height) is determined when the file is created but may be larger than is needed to permit zooming in with high fidelity. The rendering software scales the picture based on information in the file. Normally zooming in beyond 100% is done by replicating pixels and results in a blocky appearance to the image. Zooming out can sometimes leave out narrow lines in the image.

Vector based images are built with lines and mathematical curves. These types of images can be zoomed without losing the quality of the image, but are suitable mostly for line drawings. The boogie board sync uses this format for saved screens.

As already mentioned some documents are built totally from scanned images using the PDF format as a container for the images. There may not be any content text inside the file.

[edit] Tags

Text may or may not be tagged in a PDF file. Tags are meta data that provide information about the text itself (semantics). They basically allow the rendering software to be able to move or resize the data in an intelligent way without losing the content. Tags are normally placed in the database when it is converted to PDF but there are also ways to add tags after the fact. Tagging after the fact may not be successful in adding appropriate semantics.

[edit] Reflow

Being able to rearrange the text is called reflowing the document and permits a PDF designed for a full sized piece of paper to be easily read on a small devices such as a PDA or eBook Reader. Tags are used to facilitate reflow. Some editing tools that create PDF files only when saving or printing the document cannot create tag data and these files will not support reflow on some programs. PDF files with text in image format cannot be reflowed directly.

Just because the tags are present does not mean that the rendering program (reader) is able to use them. In fact most non-Adobe readers cannot reflow documents. However there are some programs that support reflow even in documents that are not tagged. (Technically these programs are generating temporary tags to facilitate the reflow.) Adobe has released a portable version of Adobe Digital Editions (ADE) that now supports reflow on PDF files. Many of the latest dedicated eBook Reader now support ADE. Most eBook reading programs depend, instead, on zooming and panning the document. Zooming out a document to make the page fit on a page is likely to make the text too small to read easily.

Reflow is related to zooming in the sense that the text is resized to the original requested text size in the document even when the page is smaller. In fact the reflow option on the PC version of Adobe Reader is shown on zoom menu. Even if a document is reflowed the zoom feature can still be used to vary the text size.

One good use of reflow is to display a multicolumn document in one column which can make it much easier to read on electronic devices where the full page cannot be seen at once.

With the release of ADE, with support for non-tagged documents, there are three levels of reflow available. Which one you have is device and implementation dependent.

  • line reflow: This minimal level can wrap lines to a smaller screen however each line is independent and behaves like a separate paragraph. It allows reading but does not present a well balanced page since it results in many short lines and arbitrary unintelligent line breaks.
  • paragraph reflow: This level detects the paragraph boundaries within a document and wraps the full paragraph so long as it is on the same physical page in the original document. Detection without tags can cause mistakes and some paragraphs may have breaks in them. In addition this level does not wrap a paragraph that crosses a page boundary in the original document. The notion of original page boundaries causes a break in the virtual page in the reader.
  • full reflow: This level is typically not supported in ADE. It ignores the original page boundaries and wraps based on paragraphs. It also includes intelligent centering and other format features. It offers the best look a feel for the document with improved reading enjoyment where paragraphs are not split. This really needs tags to be successful. An untagged document will exhibit problems in doing the reflow from time to time, particularly where headings and images are included.

Of course none of the reflow techniques can match the original typography of the document and they will generally have one zoom mode that shows the look of the original page or the ability to turn off reflow entirely when needed.

[edit] Other features

A PDF file can contain hot links to other places in the file or links to objects outside the file. In addition a PDF can contain a TOC or index (called bookmarks) with links to places or data in the file. Not all readers can support these and not all editors are capable of adding this data to the file in the first place. Some documents support comments that can be added by reviewers.

A PDF can also be an archive document for preserving the content. These kinds of documents always have embedded fonts and are not compressed.

[edit] PDF/A

PDF/A is a subset of the full PDF capability. It was originally developed for Archiving where it features the guaranteed ability to be read with future versions of PDF tools. However, the nature of the format makes it ideal for an exchange format and has been adopted by Adobe Digital Editions (ADE) for portable devices. ADE makes one main addition and that is the inclusion of DRM support which is not in the standard for PDF/A.

The nature of PDF/A makes it the ideal form for PDF eBooks.

[edit] PDF Creation

PDF is a ISO standard and the makeup is documented so that there are many tools that can create PDF files.

  • Adobe Acrobat, Adobe InDesign, Adobe Pagemaker, and Adobe Framemaker can make intelligent PDF files with TOCs and Tagging.
  • PDFCreator is a software PDF Printer and is Open Source.
  • pdfTeX & pdfLaTeX produce virtually all PDF files created or read by physicists & mathematicians, and many in other fields involving considerable mathematics, such as computer science and quantitative biology or finance.
  • Open Office 2.4.1 can create PDF files and does support adding tags to the files.
  • Software PDF printers intercept the print command and creating a PDF image instead of actual Printer output. These PDF files cannot have much inherent intelligence about the content. One such tool is Cute PDF writer. They have both a free version and a professional one.
  • Neevia makes a product called docuprinter LT that can use Macros in Word and other Microsoft Office files to generate a PDF with TOC and links. It also has variable compression ability.
  • Another popular tool for PDF is made by Foxit. They have readers and PDF editor programs.
  • Tomahawk PDF is an editor designed to generate a PDF.
  • Many commercial and business copiers can now output PDF files directly as part of the scanning process.

[edit] Manipulate PDF

  • k2pdfopt optimize a PDF for a smaller screen display.
  • Ghostscript offers various tools for manipulation of PDF files.
  • BRISS is a PDF cropping program to remove extra margins.
  • PDFtk is a PDF toolkit. The free version can be used for quickly merging and splitting PDF documents and pages. It also includes a command line version. A pro version is also available.

[edit] Online PDF converters

[edit] PDF Viewers

There are many PDF viewers available. Adobe makes free viewers called Adobe Reader for many platforms and many 3rd party programs exist. All of the versions from Adobe, other than Palm, can read PDF files directly without requiring conversion. A new third party application called PalmPDF is available to read PDF files directly on Palm OS 5 units. It even supports reflow.

Adobe Digital Editions is specifically targeted for eBook Reading. It can read ePUB and PDF documents.

Foxit reader is also very popular. Available for Windows, Windows Mobile and Linux. The company offers other PDF tools, too.

Ghostscript can also read PDF documents. It loads quickly and is available on a wide variety of platforms.

MuPDF: is a lightweight, open source PDF viewer and parser/rendering library. Available for Linux and Windows.

Nitro PDF reader: Freeware PDF viewer for Windows.

PDF-XChange Viewer is a freeware PDF viewer for Windows. The company offers more complex PDF tools, too.

STDU Viewer: Free PDF reader for Windows. It also supports DjVu, Comic Book Archive (CBR or CBZ), XPS, FB2, TIFF, TXT and image file formats.

Sumatra PDF viewer: slim, free, open-source PDF reader for Windows.

Xpdf: a very popular open source PDF viewer for Linux, but also ported to a number of other platforms.

Evince can also read PDF files. Open source, available for Linux and Windows.

EPDFView: a lightweight open source PDF viewer for Linux.

Most 3rd party rendering programs do not support reflow and may not even have TOC capabilities. Some also have very limited zoom and do not support panning.

[edit] eBook Readers

Most dedicated eBook Readers claim support for PDF but most do a poor job of delivering on the promise due to the basic problem of attempting to read a 14" diagonal page on a 5" or 6" diagonal device. Of course the larger devices (8" and above) have an easier time of it but there are some notable attempts on smaller devices.

The main approach is to display the whole page on the screen which is fine to get an idea about what the page looks like but impossible to read. Even removing the margins and just showing the text does little to improve the situation. Switching to landscape mode and splitting the page in two helps but is not enough to make a readable document on a device of less that 8".

The Sony PRS505 was the first device to implement mobile Adobe Digital Editions which can use page by page reflow to solve the problem for documents that lend themselves to this solution (dominantly text documents). Many EInk devices now implement mobile ADE, and all but a few of these support reflow. The Bookeen Cybook Gen3 attacks the problem by permitting the user to zoom in and then pan around on the page in fixed increments. This permits viewing a large document page but becomes too cumbersome if you were to try and read a full book that way.

Note that these problems go away if you are able to reformat the entire document to a size that fits the screen.

[edit] eBook Reader PDF Capabilities

The E-book Reader Matrix and the newer eBook Reader Matrix includes a row for PDF reading capabilities, which can include the following.

Capability Description
Full Page View Each page is shown on the screen full size.
Landscape View Portrait pages are shown as several landscape screens. Left and right margins are often cropped.
Continuous View There are no gaps between pages, so parts of two pages can be on the screen.
Two Column View One column fills the screen, at the bottom of the page the 2nd column is started.
Manual Crop Margins Zoom (magnify) in small increments to crop margins, i.e. reader customizable cropping.
Autocrop Margins Zoom to remove whitespace in margins. Can be defeated by headers and footers and by scan artifacts.
Fixed Zoom Zooming to fixed parts of the page via a menu.
Arbitrary Zoom Zoom to any reader-specified part of the page.
Single Page Reflow Reflow the text on a page. Number of font sizes allowed is typically 3 to 8.
Entire Document Reflow Reflow the text across page boundaries. Number of font sizes allowed is typically 3 to 8.
Zoom Images in Reflow Reflow "magnifies" text, so also magnify the images.
Passwords PDFs with passwords can be opened.
Table of Contents The table of contents, if any, is available for navigation.
Hyperlinks Hyperlinks within the document can be followed and there is a "go back" option to unwind links.
Text to Speech The text in the PDF, if any, can be read aloud.
Text Search The text in the PDF, if any, can be searched.
Dictionary Lookup Words in the PDF, if any, can be looked-up in a dictionary.

[edit] Metadata

Metadata is stored in the file by the creation program. It can be viewed from the File>Properties tag in most viewing programs. For a PDF file the data includes:

  • File name
  • Title
  • Author
  • Subject
  • Keywords
  • Application used to create the file
  • PDF producer program
  • PDF version
  • File Size
  • Page Size
  • Tagged PDF status (Yes, No)
  • Document Security (what you can do with the document)
  • Fonts
  • Misc other data

Not all of this data can be viewed in all Readers. The application BeCyPDFmetaedit can be used to modify or create this data in an existing PDF file.

[edit] Tips

The most important tip for eBook users is that you can create custom sized pages for PDF use when you build your own PDF files. Generally this custom sized paper should be 5.24" x 6.63" (6.69?) (or for true sizes 9cm × 12cm) for good results on a 6" reader (800x600 pixel). You can build this once as a template for your printer and then reference it when needed. Margins can be set to whatever you like. A Reader Guide to creating PDF files is available from Sony.

Always tag the document if you can. Microsoft ActiveSync will add tags needed for reflow to a PDF file that does not have them while transferring it to your PDA. Note that this requires it installation of Adobe Reader for Pocket PC devices. Once added these tags are permanent. Adobe Acrobat Professional can also add tagging to a PDF file after the fact.

While it is possible to edit a PDF file using some editors, it is much better to edit a source file and regenerate the PDF. Direct PDF editing can result in a convoluted arrangement of text requiring a larger file size and hence may confuse some display tools.

As of late 2009, most PDF support on eBook readers are either ADE based or Xpdf based.

One way to attempt to read PDF pages on small devices is to crop the margins off the edge of the pages. This technique is used by some eBook Readers but there are tools that can help you to do this yourself. More advanced is the ability to separate 2 columns of data and make a singe narrower data stream. Tools and tips to do these kinds of things include:

  • BRISS - a margin cropper and columns separating tool. It is free. (Briss doesn't rasterize the PDF)
  • PaperCrop - creates PDF images cropped. - It is free (Papercrop rasterizes the PDF)
  • PDFCropper - discussed on MobileRead forum.
  • Use k2pdfopt to optimize a PDF for a smaller screen display or even to minimize the number of pages.

[edit] Limitations

While PDF is a very popular format for sharing files on computer systems it does have limitations when used with a portable eBook Reader. Some of these are inherent in the format and some are because the rendering software does not support all of the features available in PDF.

  • A PDF file will generally be much larger than a file in many other eBook formats. This can cause problems in how many eBooks you can have on your device, and it can cause the rendering software to behave sluggishly or even not work on some files. A document in PDF can be significantly different in size depending on how it was created. Some of the issues are:
    • Graphic images mimicking text are much larger.
    • Graphics can be much larger (higher resolution) than is needed.
    • Compression can be adjusted when the file is built to provide for typical viewing or professional typesetting.
    • Editing of the PDF can leave artifacts in the file.
    • Embedding fonts will increase the file size.
  • The complexity of the supported capabilities can be beyond the ability of some rendering programs. This can be very confusing to the user since some PDF files work fine and some will not display properly or even may not even load. Some of these reasons include:
    • PDF supports a wide variety of graphics formats and some rendering software may not support all of them. SVG and JPEG2000 are two suspects.
    • A PDF file can be built from multiple PDF files. This ability to append and insert files can cause data to not be linear in a file and metadata to be scattered all over the file. This can require loading an entire document into memory to find and display the information properly.
    • The fonts used by the rendering program may not match the source file.
    • PDF file formats have gone through many revisions and some readers may not handle all versions.
  • Page numbers referenced by the program may not match the printed document. This is usually a result of front matter in the document being identified with roman numeral numbered pages while the main document starts the numbering again or uses a chapter based numbering system.
  • PDF files are sometimes read with a different program on the reader causing the user interface to be different and have different features. This can make it inconsistent with user expectations.
  • Readers will typically squeeze the PDF page to fit the reader screen size. This can make many documents impossible to read due to extremely small print.
  • Multicolumn PDF files can be difficult to read due to the need to backup to get back to the top of the page when the reader splits the page.
  • Images may be distorted or unreadable due to resizing in the rendering software. Good software should provide the ability to separately zoom a particular image.
  • When designing an ebook of poetry attention should be given to automatically added reflowing tags by various software. Poetry and reflowing text is not always a good combination, as many poets work with text more in an image fashion than as simply words following words.

Some of these limitations can be overcome when the PDF is specifically designed with eBook reading in mind. See tips.

[edit] Converting from PDF files

PDF is a terminal format. That is to say it is not a very good source format. It is intended to be the final output. Having said that there are many programs that can manipulate PDF files or convert them to another format, with varying degrees of success. See E-book conversion for many such programs. A few are listed below:

  • RubyPDF Technologies has several software tools to manipulate or convert PDF files to another format.
  • PDF Converter with OCR an efficient tool that works for your convenience and is wholly suited to get the contents from the scanned PDF or image as well as normal PDF files with OCR technology built-in.
Personal tools
Namespaces

Variants
Actions
Navigation
MobileRead Networks
Toolbox