EBML is short for Extensible Binary Meta Language. EBML specifies a binary and Byte (octet) aligned format inspired by the principle of XML. EBML itself is a generalized description of the technique of binary markup. Like XML, it is completely agnostic to any data that it might contain.
The Matroska project is a specific implementation using the rules of EBML: It seeks to define a subset of the EBML language in the context of audio and video data (though it obviously isn't limited to this purpose). The format is made of 2 parts: the semantic and the syntax. The semantic specifies a number of IDs and their basic type and is not included in the data file/stream. There is a specific project dealing with EBML in more details and more recent updates.
Just like XML, the specific "tags" (IDs in EBML parlance) used in an EBML implementation are arbitrary. However, the semantic of EBML outlines general data types and ID's.
The known basic types are:
- Signed Integer - Big-endian, any size from 1 to 8 Bytes
- Unsigned Integer - Big-endian, any size from 1 to 8 Bytes
- Float - Big-endian, defined for 4 and 8 Bytes (32, 64 bits)
- String - Printable ASCII (0x20 to 0x7E), zero-padded when needed
- UTF-8 - Unicode string, zero padded when needed (RFC 2279)
- Date - signed 8 octets integer in nanoseconds with 0 indicating the precise beginning of the millennium (at 2001-01-01T00:00:00,000000000 UTC)
- Master-Element - contains other EBML sub-elements of the next lower level
- Binary - not interpreted by the parser
As well as defining standard data types, EBML uses a system of Elements to make up an EBML "document." Elements incorporate an Element ID, a descriptor for the size of the element, and the binary data itself. Further, Elements can be nested, or contain, Elements of a lower "level."
Element IDs (also called EBML IDs) are outlined as follows, beginning with the ID itself, followed by the Data Size, and then the non-interpreted Binary itself:
Element ID coded with an UTF-8 like system:
bits, big-endian 1xxx xxxx - Class A IDs (2^7 -1 possible values) (base 0x8X) 01xx xxxx xxxx xxxx - Class B IDs (2^14-1 possible values) (base 0x4X 0xXX) 001x xxxx xxxx xxxx xxxx xxxx - Class C IDs (2^21-1 possible values) (base 0x2X 0xXX 0xXX) 0001 xxxx xxxx xxxx xxxx xxxx xxxx xxxx - Class D IDs (2^28-1 possible values) (base 0x1X 0xXX 0xXX 0xXX)
- The leading bits of the EBML IDs are used to identify the length of the ID. The number of leading 0's + 1 is the length of the ID in octets. We will refer to the leading bits as the Length Descriptor.
- Any ID where all x's are composed entirely of 1's is a Reserved ID, thus the -1 in the definitions above.
- The Reserved IDs (all x set to 1) are the only IDs that may change the Length Descriptor.