XML
XML is a generalized Markup up Language for the exchange of information. It is generalized in that it allows users to define their own tags and thus there is a data definition table required to decode the data. This DTD defined using a pointer near the top of the file. XML is also used in private applications where the DTD is known internally and other typically required tags may be missing.
Contents |
[edit] Overview
Historically XML was intended as a simplification of SGML (Standard Generalized Markup Language) and was based on earlier work on HTML. As compared to HTML it differs in that it can have custom definitions of tags and the tag structure has to be 'well formed' meaning that there must be a close tag for every open tag unless the tag is self closing. XML is also case sensitive. An XML file may be identified with a first line <?xml statement like:
<?xml version="1.0" ?>
XML is the language of choice for defining metadata. The main metadata use of this is in OPF (Open eBook Package Files).
XHTML is a special form of XML where the DTD is already known. It is a predefined XML based on HTML and obeys all XML rules. The tags are all lower case.
Tags starting with <? are special tags. They are generally used in a tagged document that may have dual purposes being read by different parsers. If a parser sees an <? with letters that they don't understand then they are to ignore the tag. However, these tags will pass standard check routines such as epubcheck. A <!-- is a special tag defining a comment. The rest of the line until a matching -- is ignored. The tag is then closed with a >. Note that these tags do not require a separate closing tag.
All tags must begin with < and end with >. The tag name is inside. There can be optional attributes of the form of name= with the value contained in quotes. There must be a closing tag with the same name at the opening tag preceded with a /. If the tag has no data then it can be self closed by ending the tag name or last attribute with a space and />. All line endings (cr, lf) are ignored but may be present for human reading of the file.
[edit] character encoding
All XML coding requires a that the character coding be defined. Be default all XML documents use UTF-8 encoding. Otherwise this must be specified at the beginning of the file as one of the examples below: (of course it is also legal to express UTF-8.)
<?xml encoding='UTF-8'?> <?xml encoding='EUC-JP'?>
All XML processors MUST be able to read entities in both the UTF-8 and UTF-16 encodings. Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors MUST be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents. If the replacement text of an external entity is to begin with the character U+FEFF, and no text declaration is present, then a Byte Order Mark MUST be present, whether the entity is encoded in UTF-8 or UTF-16.
[edit] XML eBook Formats
In practice, for eBook devices XML has taken the lead in defining the structure of the source document for most modern books.
- The standardization effort in the International Community is contained in the XHTML 1.1 specification maintained by the International Digital Publishing Forum (<idpf>) See http://www.idpf.org/specs.htm. This standard defines the book data and also a container mechanism to hold all of the various pieces of a book called ePUB.
- A second standard is RSS which is used as a distribution standard for many news releases and blogs on the Internet. As eBooks attempt to move into daily news reading the RSS format will become very important. It is also based on XML.
- A third XML standard is being used by the Russian community to define the Fiction Book standard. For more information see http://www.fictionbook.org/index.php/Eng:FictionBook. The format is called FB2.
- A fourth XML standard is used in the publication of the Sony BBeB format for eBooks. This LRS format is compiled into LRF files or, if protected with DRM, LRX files. This format is also known as the Xylog XML format.
[edit] Other XML formats for documents
Besides the formats being proposed and implemented in the eBook community there is an ongoing debate on XML based formats for Document exchange. These are similar to the eBook formats so they are listed here for items to be aware of.
- ODF - The Oasis Open Document Format is an xml based format being proposed by several companies. The parent ODF committee has recently jumped ship in favor of the CDF format proposed by W3C. This format is backed by Sun, IBM and others. It encapsulates xml in a zip file to avoid large file sizes. This format uses .ODT as the file name extension. Open Office uses this format.
- CDF - The Compound Data Format is proposed as an xml format by the W3C committee that controls such important standards as html and xhtml. See http://www.w3.org/2004/CDF/
- CDFML - The Common Data Format XML exchange format is proposed by NASA http://cdf.gsfc.nasa.gov/ for the open exchange of documents.
- Microsoft Office Open XML - The exchange format being promoted for Document exchange. It is being used as a save format in Word 2007. The file is compressed by zip and used in its compressed form to save space. This file format uses a .DOCX extension for the file name.
- OSIS is an XML Schema definition for Bibles and other Biblical research texts. It finds its way into several Bible study tools.
- ABW is an XML Schema for AbiWord which is a freely available word processor program released under the Gnu license.
- XPS and XML Paper specification used in Vista as the printer spool format.
[edit] XML character entity references
Unlike traditional HTML with its large range of character entity references, in XML there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:
- & → & (ampersand, U+0026)
- < → < (less-than sign, U+003C)
- > → > (greater-than sign, U+003E)
- " → " (quotation mark, U+0022)
- ' → ' (apostrophe, U+0027)
All other character entity references have to be defined before they can be used. However, use of ' in XHTML should generally be avoided for compatibility reasons as it was not defined for HTML. ' or ' may be used instead.
Here is the syntax for creating an ENTITY:
<!ENTITY greeting1 "Hello world"> <!ENTITY nbsp " ">
[edit] Tools
- XML copy editor - Useful on any format derived from XML to check that it is well formed. It is also an editor and can provide pretty printing and other features.
[edit] Related Information
While not specific to the XML format as used in eBooks the following articles are related.
- Metadata is used to describe eBooks and is generally in XML format even if the eBook isn't.
- MathML is an XML format specifically designed to add mathematic equations for use in eBooks and Web Browsers. However, it is not required for ePub 2 but is required for ePub 3.
- ePub is the current focus of a eBook format that embodies and embraces the XML
capabilities.
- DTBook is a standard using XML to support Digital Talking Books.
- SVG is a specialized language for defining vector graphics.