HTML

From MobileRead
Jump to: navigation, search

HTML is the generic name for several web based formats. HTML as a specific format is no longer used anywhere. It has been replaced by XHTML or HTML5. The extensions .htm and .html may still be used to refer to newer formats but .xhtml is preferred. This page is now archived.

Contents

[edit] Introduction

HTML stands for HyperText Markup Language and is the primary language used by all Internet web sites. It became the favored format of text in 1994 due in part to its flexibility of formatting. There have been several versions of HTML used in eBooks, all the way from version 3.2 to HTML5. HTML is also used generically to refer to any version including XHTML.

There is plenty of other places to find out how HTML works in a web browser but this page will focus on its use in an eBook Reader or as an eBook source files. HTML files usually have an HTM or HTML extension. The page covers HTML from 3.2 through version 4. The latest version is called HTML5. See also XHTML. If you are not familiar with HTML try HTML Tutorial at W3schools.

[edit] Ignoring Tags

A web browser is normally designed to ignore tags that is does not understand. However it expects a file to contain two sections, <Head> and <Body>. These are contained within a overall section called <HTML>. Ebook Readers will behave similarly but when a file is used as a source file it will often complain about entries it does not understand. More and more HTML is expected to conform to the idea of XHTML where the rules are more stringent.

Most eBook generation programs do not require a head section and may not even need the body tag. Tags that are recognized but not needed will simply be ignored.

For example in a Tome Raider source file anything within following tag pairs will be skipped along with the tags themselves

  • <HEAD> ... </ HEAD >
  • <SCRIPT> ... </ SCRIPT >
  • <TITLE> ... </TITLE>
  • <STYLE> ... </STYLE>

If a <new> tag is found within the scope of any of above tags, a new subject will be started. <BODY>, <HTML> and <META> tags will be skipped. Note tags are often shown in uppercase but the format is not case sensitive in HTML although later standards want lowercase.

[edit] Fonts and Characters

Data in HTML files will be displayed using a default set of fonts and a default set of characters unless specified in the file or CSS file. The W3C standard requires that a character set be specified. For example:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

will be specified as a line in the <head> section of the file. It can also be at the beginning in the style of XML. However many eBook readers are not able to decode meta statements. Certain special characters can be shown using the Entity reference capability.

[edit] Supported tags

Almost no eBook reader fully supports all of the HTML tags. See ePub tags for a list of supported tags.

For MOBI and OEB format see: EBook HTML‎ for HTML tags that it supports. This is a good example of HTML 3.2 and 4.0 support prior to the use of CSS.

A table of HTML features is listed below to highlight the capabilities of various eBook Reader to render raw HTML code. Note that generally eBook readers expect one file to contain the full data while HTML uses many external files such as CSS file, Image files, other HTML pages, etc.

The devices should be expanded. For now the Bookeen Cybook Gen3 (CY), FBReader (FB), Hanlin V3 (H3) and Cool Reader (CR) are listed.

A blank in the entry mean that item has not be filled in yet. Some items are marked with a P to mean partial support.

Database Features
Devices CY FB H3 CR
Images N
Metadata
CSS P N
Internal Links N
External Links N
Internet Links N
Tables support
UTF-8 encoding
ZIP encapsulated Y Y Y
Font bold/italics Y N

[edit] Special Tags

EBook generation programs will almost always have tags that are not standard HTML tags. This is because the requires for a page based reader is very different from those of a browser.

For example the eBook Publisher tool used to generate IMP files for the eBookwise-1150 can use a tag <PB> to force a page break (new page) and uses <a name="toc"></a> specially to find the start of the TOC data.

To force a page break use: <p style="page-break-after: always"> (Preferred) or <p style="page-break-before: always"> which is recognized by some eReader software, notably eBook Publisher.

MobiPocket files have many special customized tags.

[edit] Style Sheets

eBook standards encourage the use of Style Sheets and CSS (Cascading Style Sheets) are the preferred form. However the use of these is optional and not even supported in all eBook formats. CSS style sheets can be in one or more separate files and can also be listed in between <STYLE> and </STYLE> keywords in the <HEAD> section of the document. If they are in a separate document then a pointer is needed in the <HEAD> section. For example:

<link rel="stylesheet" href="style.css" type="text/css" />

The Style attribute can also be used in other tags within the <BODY> of the document. When there is more that one style that could be used most specific reference will override the same entry in other places.

Style sheets do not have to be complicated. A simple style sheet might look like:

body {margin-left:2; x-sbp-widow-push:2; x-sbp-orphan-pull:1; 
      margin-right:1
     }
h2, h1 {text-align:"center"}

[edit] Attributes

A tag can contain attributes within the opening tag. These are property=value pairs. Some attributes are unique to particular tags but some can be used in any tag although they may not serve any useful purpose. Among the universal tags there are:

  • lang - will specify the target language for the tag and will override any less specific designation. It is a good idea to specify the language of the data at the beginning of the file. For example <HTML lang=en> could be used.
  • style - As mention style sheets are the preferred method of specifying style but a style attribute can be used anywhere to override more general settings.
  • id - Any section can have an id tag to provide a unique name to the section.
  • title - A title can provide text to use for display if you hover over a tag. It can also be used to provide the equivalent of a footnote for the data within a particular tag.

[edit] Coding HTML

It is possible to hand code HTML but often it is generated as a translation from some other format. For example Word DOC files can be converted to HTML by Word itself. Often conversions are ugly with all kinds of extra coding and sometimes do not conform to the actual requires for HTML standards. One way to fix this code is a program called Htmltidy.

There are many specific tags for sections of a document such as h1, h2, h3, etc. for headers and p for paragraphs. A universal div tag can be used anytime to divide the section for any purpose. Many other tags are defined in the standard.

[edit] HTML Tools

  • v HTML Merger - A free program to merge multiple HTML files into one. Also known as SoftSnow Merger
  • HTTrack can download a full web site to your local computer.
  • HTML Merge - A GUI program that merges HTML files in a folder or linked from an index file. Called html-merge at http://htmlmg.sourceforge.net/
  • BBEdit BBEdit is the leading professional HTML and text editor for the Macintosh.
  • TextWrangler is the little brother to BBEdit and is a free application for the Macintosh. It is a Text editor that has HTML assistance.
  • Notepad++ is a free windows text editor that has an HTML mode.

[edit] For more information

Personal tools
Namespaces

Variants
Actions
Navigation
MobileRead Networks
Toolbox