Metadata
Metadata is data about data. That is: if an eBook is considered to be data then the information describing the book would be metadata. It is very important that this data be identified separately with keywords so that it can be used by library software to describe the eBook and by retailers to identify the contents.
Contents |
[edit] Overview
Most library management programs extract this data from the original eBook file or container so that it can be referenced separately and in some cases they may get data from other sources instead of the file itself.
This is the equivalent of a library card from the library catalog in a public library.
Metadata should include at least:
- TITLE for the title of the book.
- AUTHOR for the authors name
- PUBLISHER for the publisher name
- COPYRIGHT for copyright information or published date
- EISBN or ISBN for the book identifier
It may also include many other things such as a cover image, a description (subject) of the book, the genre and the language of the book. See Open eBook for an example of a package file containing metadata.
[edit] HTML metadata
The <head> section of an HTML document can be the source of metadata for some applications. Here is a sample of some of the data that may appear.
- <meta name="gemstar-legacy" content="publisher 2.0" />
- <title>The Title</title>
- <meta name="Author" content="Authors Name" />
- <meta name="Description" content="Mystery, Suspense, History, Gothic, Literature, Books, Arts" />
- <meta http-equiv="Content-Type" content="text/html; charset=windows-1252" />
- <meta name="GENERATOR" content="rbmake v0.99d using the rbmake library v0.99d" />
Here is a more exotic form for HTML that intends to get converted to MOBI
<meta http-equiv="Content-Type" content="text/html;" /> <title>MyBookTitle</title> <meta name="DCTERMS.title" content="MyBookTitle" /> <meta name="DCTERMS.language" content="en-US" scheme="DCTERMS.RFC4646" /> <meta name="DCTERMS.source" content="http://xml.openoffice.org/odf2xhtml" /> <meta name="DCTERMS.issued" content="TimeOf Creation eg:2006-05-25T16:53:56" scheme="DCTERMS.W3CDTF"/> <meta name="DCTERMS.creator" content="Author's Name"/> <meta name="DCTERMS.contributor" content="Other Contributor's Name" /> <meta name="DCTERMS.modified" content="Last Modified eg:2010-02-06T13:28:08.71" scheme="DCTERMS.W3CDTF"/> <meta name="DCTERMS.provenance" content="Your ISBN Number" /> <meta name="DCTERMS.subject" content="What book is about" /> <link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" hreflang="en" /> <link rel="schema.DCTERMS" href="http://purl.org/dc/terms/" hreflang="en" /> <link rel="schema.DCTYPE" href="http://purl.org/dc/dcmitype/" hreflang="en" /> <link rel="schema.DCAM" href="http://purl.org/dc/dcam/" hreflang="en" /> <link rel="stylesheet" type="text/css" href="ebook.css" /> <base href="." />
[edit] Metadata on files
eBook files that are recognized by the appropriate operating system include a feature that can be used to add metadata to the file. For example on Windows you can right click on the file in a browser window and select properties. A general tab and a Summary tab will appear. Select the Summary tab to fill in the data.
- Title
- Subject
- Author
- Category (Genre)
- Keywords (used for searches)
- Comments
- source (only in advanced view)
- Revision Number (only in advanced view)
In some OS's (MacOS X, Linux) this information can be used to correct a title or author's name in the metadata contained in the file itself but Windows does not store it in the file. This data will show up if the mouse cursor is placed over the icon on systems that support the hover feature.
PDF files will also show their metadata in a tab on the properties selection but you may need to open the file itself to change the data. This data is not sync'd to the Summary tab on Windows systems so it can be different.
[edit] Culture data
Culture data is an important component of metadata about an eBook. The most obvious information is the language in which the eBook is written which implies the character set. Other culture items are usually subdivided from the language. Culture data includes but is not limited to, sorting order of the alphabet, conventions used in writing dates, and formatting numbers.
The culture names follow the RFC 1766 standard in the format "<languagecode2>-<country/regioncode2>", where <languagecode2> is a lowercase two-letter code derived from ISO 639-1 and <country/regioncode2> is an uppercase two-letter code derived from ISO 3166. For example, U.S. English is "en-US". In cases where a two-letter language code is not available, the three-letter code derived from ISO 639-2 is used.
[edit] ePUB metadata
The latest and most complete eBook publishing standard is called ePUB. ePUB rules allow users to add publication information according to DCMI terms. The metadata is placed in an OPF file inside the ePub. It follows the rules of Dublin Core (DC). Order is not significant, and duplicate tags are allowed. A language attribute can be added where needed. The standard defines an assortment of meta data including the following:
[edit] Required terms:
- title
- language — use a RFC3066 language code
- identifier — use a probably unique string: URI or ISBN would be good choices. An attribute can be used to specify the type of identifier such as: "ISBN" or "DOI." Multiple identifiers are allowed. A list of choices might be:
- DEWEY Dewey Decimal System
- DOI Digital Object Identifier (From Wikipedia)
- ISBN International Standard Book Number (ISBN.org)
- ISSN International Standard Serial Number
- LCC Library of Congress Classification
- LCCN Library of Congress Control Number (also known as "Library of Congress Card Number")
- OSIS Open Scriptural Information Standard
- SICI Serial Item and Contribution Identifier
- URI Uniform Resource Identifier
- URL Uniform Resource Locator
- URN Uniform Resource Name
[edit] Optional terms:
While optional these terms should be present if known. The standard permits reusing a term. For example: if there are two authors then there would be separate creator entries for each.
- creator - This is the principal author of the work. One author per entry.
- creator and contributor can have an attribute that defines the role (opf:role) — see http://www.loc.gov/marc/relators/relacode.html for values. These include Author (aut), Illustrator (ill), editor (edt, for compilations), etc.
- contributor - This identifies other persons who had a less important role, such as writing the prelude or illustrating where images play a minor role in the publication. This entry can also include an attribute that defines the role. Note that Book producer (bkp) would be good role for the person who created the ePub from another source.
- publisher - Examples of Publisher include a person, an organization, or a service.
- subject - no standard form but it could use the Library of Congress Subject Heading System for example.
- description - This field normally contains the description that is used to describe the book contents on a retail site or press release.
- date - The format is defined as ISO 8601 on which it is based. In particular, dates without times are represented in the form YYYY[-MM[-DD]]: a required 4-digit year, an optional 2-digit month, and if the month is given, an optional 2-digit day of month.
- The date can have an attribute with undefined contents. opf:event is used. Typical values might include: creation, publication, and modification
- type - general categories, functions, genres
- format - mimetype
- source - Information regarding a prior resource from which the publication was derived.
- relation - auxiliary resource and its relationship to the publication.
- coverage - The extent or scope of the publication’s content.
- rights (copyright, CC creative commons, public domain, etc.
See The ePUB specification 2.0.1 for more information on this topic.
[edit] Example
An example would be:
<package version="2.0" xmlns="http://www.idpf.org/2007/opf" unique-identifier="BookId"> <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf"> <dc:title>Alice in Wonderland</dc:title> <dc:language>en</dc:language> <dc:identifier id="BookId" opf:scheme="ISBN"> 123456789X </dc:identifier> <dc:creator opf:role="aut">Lewis Carroll</dc:creator> </metadata> ... </package>
[edit] Viewer/Editor
[edit] EXIF metadata
See Exif#Metadata for JPG metadata information.
[edit] Tools
Exiftool is a free multipurpose tool to manipulate metadata. ExifTool is a platform-independent Perl library plus a command-line application for reading, writing and editing meta information in a wide variety of files. ExifTool supports many different metadata formats including EXIF, GPS, IPTC, XMP, JFIF, GeoTIFF, ICC Profile, Photoshop IRB, FlashPix, AFCP and ID3, as well as the maker notes of many digital cameras by Canon, Casio, DJI, FLIR, FujiFilm, GE, HP, JVC/Victor, Kodak, Leaf, Minolta/Konica-Minolta, Motorola, Nikon, Nintendo, Olympus/Epson, Panasonic/Leica, Pentax/Asahi, Phase One, Reconyx, Ricoh, Samsung, Sanyo, Sigma/Foveon and Sony.
[edit] MP3 metadata
The ID3v1.1 metadata for an MP3 file is 128 bytes of data that is appended to the end of the file. It consists of:
- The letters "TAG" - 3bytes
- Song Title - 30 characters (note that the title is used for sorting on some devices)
- Artist - 30 characters
- Album - 30 characters
- Year - 4 characters
- Comment - 28 characters (a comment of 29-30 character overwrites track.)
- binary 0 - 1 byte
- track - 1 byte
- Genre - 1 byte (a defined look up value of 80 entries)
If the full number of characters is not needed the field will be filled with binary 0s.
Other embedded data includes:
- bit rate in Kbps (unusual number indicates average of variable bit rate)
- length: hours:minutes:seconds
- channels: mono, stereo
- sample rate: KHz
As of ID3 version 2 there is data at the front of the file. The file begins with ID3 and is followed by the metadata as described on the Id3 web site. This is a complex specification that can embed lots of information in the file including the cover image. Here is a list of features:
- The ID3v2 tag is a container format, just like IFF or PNG files, allowing new frames (chunks) as evolution proceeds.
- Residing in the beginning of the audio file makes it suitable for streaming.
- Has an 'unsynchronization scheme' to prevent ID3v2-incompatible players to attempt to play the tag.
- Maximum tag size is 256 megabytes and maximum frame size is 16 megabytes.
- Byte conservative and with the capability to compress data it keeps the files small.
- The tag supports Unicode but is ISO-8859-1 by default
- Isn't entirely focused on musical audio, but also other types of audio such as audio books.
- Has several new text fields such as composer, conductor, media type, BPM, copyright message, etc. and the possibility to design your own as you see fit.
- Can contain lyrics as well as music-synced lyrics (karaoke) in almost any language.
- Is able to contain volume, balance, equalizer and reverb settings.
- Could be linked to CD-databases such as CDDB and FreeDB.
- Is able to contain images and just about any file you want to include.
- Supports enciphered information, linked information and weblinks.
- While intended for MP3 it can be used for other sound formats.
[edit] Tools
- MP3 Tag Editor - supports lots of formats
[edit] Further References
- CBR and CBZ#Metadata shows the available metadata for these formats. There is no standard.
- BeCyPDFmetaedit can be used to view and edit metadata in a PDF file.
- Dublin Core contains the standards adopted by many publishers for metadata for eBooks. Feedbooks also uses this standard.
- Microsoft Culture Data reference
- XMP - An Adobe tool to add metadata. This data is in an open format.
- ID3.org - MP3 standard
- JPG metadata - tools and specifications - JPG#Metadata shows the metadata in JPG file.
- Library of Congress Standards - US government reference.
- IPTC - Photo Metadata, core data is exchangeable with Adobe XMP.
- Document Metadata Extraction - Tools to see metadata in various formats.
- ONIX a publisher standard for the interchange of metadata for books.
- RIFF Resource Interchange File Format is for video and audio.