MOBI

From MobileRead

Jump to: navigation, search

MOBI is the format used by the the MobiPocket Reader. It may have a .mobi extension or it may have a .prc extension. The extension can be changed by the user to either of the accepted forms. In either case it may be DRM protected or non-DRM. The .prc extension is used because the PalmOS doesn't support any file extensions except .prc and .pdb. Note that Mobipocket prohibits their DRM format to be used on dedicated eBook readers that support other DRM formats.

Contents

[edit] Description

MOBI format was originally an extension of the PalmDOC format by adding certain HTML like tags to the data. Many MOBI formatted documents still use this form. However there is also a high compression version of this file format that compresses data to a larger degree in a proprietary manner. There are some third party programs that can read the eBooks in the original MOBI format but there are only a few third party program that can read the eBooks in the new compressed form. The higher compression mode is using a huffman coding scheme that has been called the Huff/cdic algorithm. For a description in python check huffdic.py available as part of the Calibre project.

From time to time features have been added to the format so new files may have problems if you try and read them with a down level reader. Currently the source files follow the guidelines in the Open eBook format.

[edit] Format

Like PalmDOC, the Mobipocket file format is that of a standard Palm Database Format file. The header of that format includes the name of the database (usually the book title and sometimes a portion of the authors name) which is up to 31 bytes of data. The files are identified as Creator ID of MOBI and a Type of BOOK.

Mobipocket have some minimal file format info, mainly about the html encoding they use in the text of the book, at http://www.mobipocket.com/dev/article.asp?BaseFolder=prcgen

[edit] PalmDOC Header

The first record in the Palm Database Format gives more information about the Mobipocket file. The first 16 bytes are almost identical to the first sixteen bytes of a PalmDOC format file.

bytescontentcomments
2Compression 1 == no compression, 2 = PalmDOC compression, 17480 = DRMed
2UnusedAlways zero
4text lengthUncompressed length of the entire text of the book
2record countNumber of PDB records used for the text of the book.
2record sizeMaximum size of each record containing text, always 4096
4Current PositionCurrent reading position, as an offset into the uncompressed text

There are two differences from a Palm DOC file. There's an additional compression type (17480), and the Current Position bytes are used for a different purpose:

bytescontentcomments
2Encryption Type 0 == no encryption, 1 = Old Mobipocket Encryption, 2 = Mobipocket Encryption
2UnusedAlways zero

The old Mobipocket Encryption scheme only allows the file to be registered with one PID, unlike the current encryption scheme that allows multiple PIDs to be used in a single file. Unless specifically mentioned, all the encryption information on this page refers to the current scheme.

[edit] MOBI Header

Most Mobipocket file also have a MOBI header in record 0 that follows these 16 bytes, and newer formats also have an EXTH header following the MOBI header, again all in record 0 of the PDB file format.

The MOBI header is of variable length and is not documented. Some fields have been tentatively identified as follows:

offsetbytescontentcomments
164identifierthe characters M O B I
204header length the length of the MOBI header, including the previous 4 bytes
244Mobi typeThe kind of Mobipocket file this is

2 Mobipocket Book

3 PalmDoc Book

4 Audio

257 News

258 News_Feed

259 News_Magazine

513 PICS

514 WORD

515 XLS

516 PPT

517 TEXT

518 HTML

284text Encoding1252 = CP1252 (WinLatin1); 65001 = UTF-8
324Unique-IDSome kind of unique ID number (random?)
364Generator versionPotentially the version of the Mobipocket-generation tool. Always >= the value of the "format version" field and <= the version of mobigen used to produce the file.
4040Reservedall 0xFF. In case of a dictionary, or some newer file formats, a few bytes are used from this range of 40 0xFFs
804First Non-book index?First record number (starting with 0) that's not the book's text
844Full Name OffsetOffset in record 0 (not from start of file) of the full name of the book
884Full Name LengthLength in bytes of the full name of the book
924LanguageBook language code. Low byte is main language 09= English, next byte is dialect, 08 = British, 04 = US
964Input LanguageInput language for a dictionary
1004Output LanguageOutput language for a dictionary
1044Format versionPotentially the version of the Mobipocket format used in this file. Always >= 1 and <= the value of the "generator version" field.
1084First Image index?First record number (starting with 0) that contains an image. Image records must be sequential.
11216?sizteen bytes, often zeros
1284EXTH flagsbitfield. if bit 6, 0x40 is set, then there's an EXTH record
13236?32 unknown bytes, if MOBI is long enough
1684DRM OffsetOffset to DRM key info in DRMed files. 0xFFFFFFFF if no DRM
1724DRM CountNumber of entries in DRM info.
1744DRM SizeNumber of bytes in DRM info.
1764DRM FlagsSome flags concerning the DRM info.
180??Bytes to the end of the MOBI header, including the following if the header if long enough.
2422Extra Data Flags A set of binary flags, some of which indicate extra data at the end of each text block.

[edit] EXTH Header

If the MOBI header indicates that there's an EXTH header, it follows immediately after the MOBI header. since the MOBI header is of variable length, this isn't at any fixed offset in record 0. Note that some readers will ignore any EXTH header info if the mobipocket version number specified in the MOBI header is 2 or less (perhaps 3 or less).

The EXTH header is also undocumented, so some of this is guesswork.

bytescontentcomments
4identifierthe characters E X T H
4header length the length of the EXTH header, including the previous 4 bytes
4record CountThe number of records in the EXTH header. the rest of the EXTH header consists of repeated EXTH records to the end of the EXTH length.
EXTH record startRepeat until done.
4record typeExth Record type. One of the following:

1 drm_server_id

2 drm_commerce_id

3 drm_ebookbase_book_id

100 author

101 publisher

102 imprint

103 description

104 isbn

105 subject

106 publishingdate

107 review

108 contributor

109 rights

110 subjectcode

111 type

112 source

113 asin

114 versionnumber

115 sample

116 startreading

201 coveroffset

202 thumboffset

203 hasfakecover

204 204 Unknown

205 205 Unknown

206 206 Unknown

207 207 Unknown

300 300 Unknown

401 clippinglimit

402 publisherlimit

403 403 Unknown

404 404 ttsflag

501 cdetype

502 lastupdatetime

503 updatedtitle

4record lengthlength of EXTH record = L , including the 8 bytes in the type and length fields
L-8record dataData.
EXTH record endRepeat until done.

And now, at the end of Record 0 of the PDB file format, we usually get the full file name, the offset of which is given in the MOBI header.

[edit] Variable-width integers

Some parts of the Mobipocket format encode data as variable-width integers. These integers are represented big-endian with 7 bits per byte in bits 1-7. They may be either forward-encoded, in which case only the LSB has bit 8 set, or backward-encoded, in which case only the MSB has bit 8 set. For example, the number 0x11111 would be represented forward-encoded as:

   0x04 0x22 0x91

And backward-encoded as:

   0x84 0x22 0x11

[edit] Trailing entries

The Extra Data Flags field of the MOBI header indicates which, if any, trailing entries are appended to the end of each text record. Each set bit in the field indicates a trailing entry. The entries appear to occur in bit-order; e.g., trailing entry 1 immediately follows the text content and entry 16 occurs at the very end of the record. The effect and exact details of most of these entries is unknown. The trailing entries indicated by bits 2-16 appear to follow a common format. That format is:

   <data><size>

Where <size> is the size of the entire trailing entry (including the size of <size>) as a backward-encoded Mobipocket variable-width integer.

[edit] Multibyte character overlap

When bit 1 of the Extra Data Flags field is set, each record is followed by a trailing entry containing any extra bytes necessary to complete a multibyte character which crosses the record boundary. The bytes do not participate in compression regardless which compression scheme is used for the file. The overlapping bytes then re-appear as normal content at the beginning of the following record. The trailing entry ends with a byte containing a count of the overlapping bytes plus additional flags.

offsetbytescontentcomments
00-3N terminal bytes of a multibyte character
N1Size & flagsbits 1-2 encode N, use of bits 3-8 is unknown

[edit] PalmDOC Compression

PalmDOC uses LZ77 compression techniques. DOC files can contain only compressed text. The format does not allow for any text formatting. This keeps files small, in keeping with the Palm philosophy. However, extensions to the format can use tags, such as HTML or PML, to include formatting within text. These extensions to PalmDoc are not interchangeable and are the basis for most eBook Reader formats on Palm devices.

LZ77 algorithms achieve compression by replacing portions of the data with references to matching data that has already passed through both encoder and decoder. A match is encoded by a pair of numbers called a length-distance pair, which is equivalent to the statement "each of the next length characters is equal to the character exactly distance characters behind it in the uncompressed stream." (The "distance" is sometimes called the "offset" instead.)

In the PalmDoc format, a length-distance pair is always encoded by a two-byte sequence. Of the 16 bits that make up these two bytes, 11 bits go to encoding the distance, 3 go to encoding the length, and the remaining two are used to make sure the decoder can identify the first byte as the beginning of such a two-byte sequence.

PalmDOC data is always divided into 4096 byte blocks and the blocks are acted upon independently.

PalmDOC does have support for bookmarks. These pointers are named and refer to an offset location in a file. If the file is edited these locations may no longer refer to the correct locations. Some reading programs allow the user to enter or edit these bookmarks while others treat them as a TOC. Some reading programs may ignore them entirely. They are stored at the end of the file itself so the full file needs to be scanned when loaded to find them.

[edit] MBP

This is the extension used on a side file (auxiliary) for MOBI formatted eBooks. It is used to store metadata used by the library software and also to store user entered data like bookmarks, annotations, last read position. This file is created automatically by the reader program when the eBook is first opened and has a .mbp extension. The Library management software in MobiPocket uses this file to get information displayed in the library window such as title and author so that it won't have to open the larger eBook file.

There is an ongoing effort to describe the binary MBP file format (see this site).

[edit] eBook Creation

There are several ways to create eBooks in the MOBI format. The rules for the format of the source files need to create eBooks in MOBI are spelled out in documents on the MobiPocket web site. The recommended tool called MobiPocket Creator is available as a download from the web site.

EBooks can also be converted from other forms using the Windows version of the MobiPocket Reader. Once converted the file can be used on any device supported by MobiPocket Reader.

[edit] Guidelines

In order to better support the features of the MobiPocket Reader there are some guidelines that need to be followed when creating a book in this format.

  • Do not specify a default font family, font size or other font attributes such as weight or color. This is a choice the person reading the eBook should be able to make. Fonts Sizes and Attributes can be specified for special headings and other specific items. Use only generic font families.
  • Do not impose justification for standard text. It may be needed for captions and other special text.
  • Do not use tables for anything except table data. Nested tables are not supported.
  • Do not use blank lines to try and force page changes. Use the <mbp:pagebreak/> tag.
  • Do not use multiple books for different devices. Instead use advanced features such as multi resolution images and platform specific frames.

[edit] Adapting images to various PDA screen resolutions

The IMG tag in Mobipocket publications supports up to three source attributes for various resolutions: src, losrc and hisrc. This makes it possible to optimize the same ebook for various devices. The image to be displayed is dynamically selected by the Reader according to the resolution of the screen on the actual device:

Attribute
screen smallest size
example devices
losrc<= 239 pixelsLow rez 160x160 Palm devices (PalmVx, Treo 600, Zire) Smartphones (Nokia 3650, Sony Ericson P800/900, Microsoft smartphones)
src>= 240 pixels (handhelds)Pocket PC, Hi rez Palm devices (Sony Clie, Tungsten, Zire 71)
hirsc>= 480 pixelsany desktop or tablet PC

Example:

<img hisrc="cover480x640.gif" src="cover220x300.gif" losrc="cover140x140.gif"/>

Please also notice that there is a 63KB internal limitation for images (this is a restriction of the Mobipocket .PRC format). GIFs have to be smaller than 63KB. You can use GIF optimization programs such as Ulead Smart Saver to get GIFs smaller than 63KB. (If images are bigger than 63KB, they are automatically resized to fit in the limit by MobiGEN but you might not like the result). Jpeg images will use a lower Quality setting to get the image size down without reducing the pixel size.

[edit] Format limitations

There are many limitations in the MOBI format. A few are listed here.

  • Blocks of text can never have a greater than normal margin on their right side.
  • Left margins can only be specified in 1em increments. Text can only have a hanging indent if it has no left margin.
  • Text cannot flow around images taller than one line of text.
  • Image sizes cannot be scaled with font size.
  • In some -- but not all -- Mobipocket renderers, text with a left margin changes that margin value per line based upon the font-size at which point the preceding line-break occurred.
  • Many measures, such as the indent of a hanging indent, cannot be specified in ems.
  • Individual items of text cannot be displayed in a monospace font.
  • Tables display wildly differently on different Mobipocket renderers, especially tables which cross more than one screen.
  • Nested tables are not supported at all.
  • In addition you only get the full range of Mobipocket's formatting capabilities if you have markup written to use Mobipocket's non-standard, extended, and under-documented implementation of HTML 3.2. See: File tag reference on the mobipocket web site.

[edit] MOBI DRM

Mobi DRM can optionally be applied to this file format. There is the standard scheme supported by Mobipocket and Overdrive servers. This is based on an ID derived from the reading device or program. This PID must be known to the server when an eBook is purchased and will be embedded in the file and locked to the device. The licensing scheme does permit multiple devices (usually up to 4) to be supported. In this case the server needs to know device id of all the devices. If you add a device you must tell the server and redownload the eBook to be able to read it on the new device. Normally there is no charge to add a device or for redownloading the eBook. If the dealer goes out of business you may not be able to add a device since there would be no way to redownload the file.

A second, simpler scheme, only requires knowledge of the account login name and password used to purchase the eBook. Once this data is entered the eBook can be read. Entering this data is only required once per device. This is a new scheme and some readers may not have support for this method.

A third method used on some ebooks is to use a generic MOBI key. It has encryption but only using the generic MOBI key (not a PID-specific key). This means that can be read by any MobiPocket Reader software, on any device, but not by any non-MobiPocket software.

The DRM applies only to the eBook itself and not to the metadata. A library routine can read the metadata without having to unlock the eBook. Some programs have been devised to even be able to change this information without touching the DRM portion of the file.

[edit] MOBI eBook Readers and converters

In addition to the MobiPocket supplied Readers there are also 3rd party readers and converters. This include:

[edit] MOBI eBook Hardware Readers

Not all eBook readers that support Mobi format have the same features. Check Mobi Comparison for details on what is actually supported.

Personal tools
MobileRead Networks