2011-03-26 22:01:25 -04:00

440 lines
21 KiB
Plaintext

from (http://wiki.mobileread.com/wiki/MOBI)
About
-----
MOBI is the format used by the the MobiPocket Reader. It may have a .mobi
extension or it may have a .prc extension. The extension can be changed by the
user to either of the accepted forms. In either case it may be DRM protected or
non-DRM. The .prc extension is used because the PalmOS doesn't support any file
extensions except .prc or .pdb. Note that Mobipocket prohibits their DRM format
to be used on dedicated eBook readers that support other DRM formats.
Description
-----------
MOBI format was originally an extension of the PalmDOC format by adding
certain HTML like tags to the data. Many MOBI formatted documents still use
this form. However there is also a high compression version of this file format
that compresses data to a larger degree in a proprietary manner. There are some
third party programs that can read the eBooks in the original MOBI format but
there are only a few third party program that can read the eBooks in the new
compressed form. The higher compression mode is using a huffman coding scheme
that has been called the Huff/cdic algorithm.
From time to time features have been added to the format so new files may have
problems if you try and read them with a down level reader. Currently the
source files follow the guidelines in the Open eBook format.
Note that AZW for the Amazon Kindle is the same format as MOBI except that it
uses a slightly different DRM scheme.
Format
------
Like PalmDOC, the Mobipocket file format is that of a standard Palm Database
Format file. The header of that format includes the name of the database
(usually the book title and sometimes a portion of the authors name) which is
up to 31 bytes of data. The files are identified as Creator ID of MOBI and a
Type of BOOK.
PalmDOC Header
--------------
The first record in the Palm Database Format gives more information about the
Mobipocket file. The first 16 bytes are almost identical to the first sixteen
bytes of a PalmDOC format file.
bytes content comments
2 Compression 1 == no compression, 2 = PalmDOC compression,
17480 = HUFF/CDIC compression.
2 Unused Always zero
4 text length Uncompressed length of the entire text of the book
2 record count Number of PDB records used for the text of the book.
2 record size Maximum size of each record containing text, always
4096.
4 Current Position Current reading position, as an offset into the
uncompressed text
There are two differences from a Palm DOC file. There's an additional
compression type (17480), and the Current Position bytes are used for a
different purpose:
bytes content comments
2 Encryption Type 0 == no encryption, 1 = Old Mobipocket Encryption,
2 = Mobipocket Encryption.
2 Unknown Usually zero
The old Mobipocket Encryption scheme only allows the file to be registered
with one PID, unlike the current encryption scheme that allows multiple PIDs to
be used in a single file. Unless specifically mentioned, all the encryption
information on this page refers to the current scheme.
MOBI Header
-----------
Most Mobipocket file also have a MOBI header in record 0 that follows these
16 bytes, and newer formats also have an EXTH header following the MOBI header,
again all in record 0 of the PDB file format.
The MOBI header is of variable length and is not documented. Some fields have
been tentatively identified as follows:
offset bytes content comments
16 4 identifier The characters M O B I
20 4 header length The length of the MOBI header, including
the previous 4 bytes
24 4 Mobi type The kind of Mobipocket file this is
2 Mobipocket Book
3 PalmDoc Book
4 Audio
257 News
258 News_Feed
259 News_Magazine
513 PICS
514 WORD
515 XLS
516 PPT
517 TEXT
518 HTML
28 4 text Encoding 1252 = CP1252 (WinLatin1); 65001 = UTF-8
32 4 Unique-ID Some kind of unique ID number (random?)
36 4 Generator version Potentially the version of the
Mobipocket-generation tool. Always >=
the value of the "format version" field
and <= the version of mobigen used to
produce the file.
40 40 Reserved All 0xFF. In case of a dictionary, or
some newer file formats, a few bytes are
used from this range of 40 0xFFs
80 4 First Non-book index? First record number (starting with 0)
that's not the book's text
84 4 Full Name Offset Offset in record 0 (not from start of
file) of the full name of the book
88 4 Full Name Length Length in bytes of the full name of the
book
92 4 Language Book language code. Low byte is main
language 09= English, next byte is
dialect, 08 = British, 04 = US
96 4 Input Language Input language for a dictionary
100 4 Output Language Output language for a dictionary
104 4 Format version Potentially the version of the
Mobipocket format used in this file.
Always >= 1 and <= the value of the
"generator version" field.
108 4 First Image record First record number (starting with 0)
that contains an image. Image records
should be sequential. If there are
no images this will be 0xffffffff.
112 4 HUFF record Record containing Huff information
used in HUFF/CDIC decompression.
116 4 HUFF count Number of Huff records.
122 4 DATP record Unknown: Records starts with DATP.
124 4 DATP count Number of DATP records.
128 4 EXTH flags Bitfield. if bit 6, 0x40 is set, then
there's an EXTH record
The following records are only present if the mobi header is long enough.
132 36 ? 32 unknown bytes, if MOBI is long enough
168 4 DRM Offset Offset to DRM key info in DRMed files.
0xFFFFFFFF if no DRM
172 4 DRM Count Number of entries in DRM info.
174 4 DRM Size Number of bytes in DRM info.
176 4 DRM Flags Some flags concerning the DRM info.
180 6 ?
186 2 Last Image record Possible vaule with the last image
record. If there are no images in the
book this will be 0xffff.
188 4 ?
192 4 FCIS record Unknown. Record starts with FCIS.
196 4 ?
200 4 FLIS record Unknown. Records starts with FLIS.
204 ? ? Bytes to the end of the MOBI header,
including the following if the header
length >= 228. ( 244 from start of
record)
242 2 Extra Data Flags A set of binary flags, some of which
indicate extra data at the end of each
text block. This only seems to be valid
for Mobipocket format version 5 and 6
(and higher?), when the header length
is 228 (0xE4) or 232 (0xE8).
EXTH Header
-----------
If the MOBI header indicates that there's an EXTH header, it follows immediately
after the MOBI header. since the MOBI header is of variable length, this isn't
at any fixed offset in record 0. Note that some readers will ignore any EXTH
header info if the mobipocket version number specified in the MOBI header is 2
or less (perhaps 3 or less).
The EXTH header is also undocumented, so some of this is guesswork.
bytes content comments
4 identifier the characters E X T H
4 header length the length of the EXTH header, including the previous 4 bytes
4 record Count The number of records in the EXTH header. the rest of the EXTH header consists of repeated EXTH records to the end of the EXTH length.
EXTH record start Repeat until done.
4 record type Exth Record type. Just a number identifying what's stored in the record
4 record length length of EXTH record = L , including the 8 bytes in the type and length fields
L-8 record data Data.
EXTH record end Repeat until done.
There are lots of different EXTH Records types. Ones found so far in Mobipocket
files are listed here, with possible meanings. Hopefully the table will be
filled in as more information comes to light.
record type usual length name comments
1 drm_server_id
2 drm_commerce_id
3 drm_ebookbase_book_id
100 author
101 publisher
102 imprint
103 description
104 isbn
105 subject
106 publishingdate
107 review
108 contributor
109 rights
110 subjectcode
111 type
112 source
113 asin
114 versionnumber
115 sample
116 startreading
117 3 adult Mobipocket Creator adds this if Adult only is checked; contents: "yes"
118 retail price As text, e.g. "4.99"
119 retail price currency As text, e.g. "USD"
201 4 coveroffset Add to first image field in Mobi Header to find PDB record containing the cover image
202 4 thumboffset Add to first image field in Mobi Header to find PDB record containing the thumbnail cover image
203 hasfakecover
204 4 Creator Software Records 204-207 are usually the same for all books from a certain source, e.g. 1-6-2-41 for Baen and 201-1-0-85 for project gutenberg, 200-1-0-85 for amazon when converted to a 32 bit integer.
205 4 Creator Major Version
206 4 Creator Minor Version
207 4 Creator Build Number
208 watermark
209 tamper proof keys Used by the Kindle (and Android app) for generating book-specific PIDs.
300 fontsignature
401 1 clippinglimit
402 publisherlimit
403 403 Unknown 1 - Text to Speech disabled; 0 - Text to Speech enabled
404 1 404 ttsflag
501 4 cdetype PDOC - Personal Doc;
EBOK - ebook;
502 lastupdatetime
503 updatedtitle
And now, at the end of Record 0 of the PDB file format, we usually get the full
file name, the offset of which is given in the MOBI header.
Variable-width integers
-----------------------
Some parts of the Mobipocket format encode data as variable-width integers.
These integers are represented big-endian with 7 bits per byte in bits 1-7. They
may be either forward-encoded, in which case only the LSB has bit 8 set, or
backward-encoded, in which case only the MSB has bit 8 set. For example, the
number 0x11111 would be represented forward-encoded as:
0x04 0x22 0x91
And backward-encoded as:
0x84 0x22 0x11
Trailing entries
----------------
The Extra Data Flags field of the MOBI header indicates which, if any, trailing
entries are appended to the end of each text record. Each set bit in the field
indicates a trailing entry. The entries appear to occur in bit-order; e.g.,
trailing entry 1 immediately follows the text content and entry 16 occurs at
the very end of the record. The effect and exact details of most of these
entries is unknown. The trailing entries indicated by bits 2-16 appear to
follow a common format. That format is:
<data><size>
Where <size> is the size of the entire trailing entry (including the size of
<size>) as a backward-encoded Mobipocket variable-width integer.
Only a few bits have been identified
bit Data at end of records
0x0001 Multi-byte character overlaps
0x0002 Some data to help with indexing
0x0004 Some data about uncrossable breaks
Multibyte character overlap
---------------------------
When bit 1 of the Extra Data Flags field is set, each record is followed by a
trailing entry containing any extra bytes necessary to complete a multibyte
character which crosses the record boundary. The bytes do not participate in
compression regardless which compression scheme is used for the file. However,
unlike the trailing data bytes, the multibytes (including the count byte) do
get included in any encryption. The overlapping bytes then re-appear as normal
content at the beginning of the following record. The trailing entry ends with
a byte containing a count of the overlapping bytes plus additional flags.
offset bytes content comments
0 0-3 N terminal bytes
of a multibyte
character
N 1 Size & flags bits 1-2 encode N, use of bits 3-8 is unknown
PalmDOC Compression
-------------------
PalmDOC uses LZ77 compression techniques. DOC files can contain only compressed
text. The format does not allow for any text formatting. This keeps files small,
in keeping with the Palm philosophy. However, extensions to the format can use
tags, such as HTML or PML, to include formatting within text. These extensions
to PalmDoc are not interchangeable and are the basis for most eBook Reader
formats on Palm devices.
LZ77 algorithms achieve compression by replacing portions of the data with
references to matching data that has already passed through both encoder and
decoder. A match is encoded by a pair of numbers called a length-distance pair,
which is equivalent to the statement "each of the next length characters is
equal to the character exactly distance characters behind it in the uncompressed
stream." (The "distance" is sometimes called the "offset" instead.)
In the PalmDoc format, a length-distance pair is always encoded by a two-byte
sequence. Of the 16 bits that make up these two bytes, 11 bits go to encoding
the distance, 3 go to encoding the length, and the remaining two are used to
make sure the decoder can identify the first byte as the beginning of such a
two-byte sequence. The exact alforithm needed to decode the compressed text can
be found on the PalmDOC page.
PalmDOC data is always divided into 4096 byte blocks and the blocks are acted
upon independently.
PalmDOC does have support for bookmarks. These pointers are named and refer to
an offset location in a file. If the file is edited these locations may no
longer refer to the correct locations. Some reading programs allow the user to
enter or edit these bookmarks while others treat them as a TOC. Some reading
programs may ignore them entirely. They are stored at the end of the file itself
so the full file needs to be scanned when loaded to find them.
Image Records
-------------
If the file contains images, they follow the text blocks, with each image using a
single block. The 4096-byte record size in the PalmDoc header applies only to
text records; image records may be larger.
Magic Records
-------------
In some cases, MobiPocket Creator adds a 2-zero-byte record after the text
records in a file. This record is not included in the "record count" of text
records in the PalmDoc header, and is also not used as the "first non-book
index" in the MOBI header. (If the 2-zero-byte record is present, the index of
the following block is used as the "first non-book index".)
MobiPocket Creator also ends files with three records: 'FLIS', 'FCIS', and
'end-of-file', in that order. The 'FLIS' and 'FCIS' records do not seem to be
necessary for MobiPocket Reader or the Amazon Kindle 2 to read the file. The
'end-of-file' record might be necessary.
FLIS Record
-----------
The FLIS record appears to have a fixed value. The meaning of the values is not known.
offset bytes content comments
0 4 identifier the characters F L I S (0x46 0x4c 0x49 0x53)
4 4 ? fixed value: 8
8 2 ? fixed value: 65
10 2 ? fixed value: 0
12 4 ? fixed value: 0
16 4 ? fixed value: -1
20 2 ? fixed value: 1
22 2 ? fixed value: 3
24 4 ? fixed value: 3
28 4 ? fixed value: 1
32 4 ? fixed value: -1
FCIS Record
-----------
The FCIS record appears to have mostly fixed values.
offset bytes content comments
0 4 identifier the characters F C I S (0x46 0x43 0x49 0x53)
4 4 ? fixed value: 20
8 4 ? fixed value: 16
12 4 ? fixed value: 1
16 4 ? fixed value: 0
20 4 ? text length (the same value as "text length" in the PalmDoc header)
24 4 ? fixed value: 0
28 4 ? fixed value: 32
32 4 ? fixed value: 8
36 2 ? fixed value: 1
38 2 ? fixed value: 1
40 4 ? fixed value: 0
End-of-file Record
------------------
The end-of-file record is a fixed 4-byte record. While the last two bytes
appear to be a CRLF marker, the meaning of the first two bytes is unknown.
offset bytes content comments
0 1 ? fixed value: 233 (0xe9)
1 1 ? fixed value: 142 (0x8e)
2 1 ? fixed value: 13 (0x0d)
3 1 ? fixed value: 10 (0x0a)
SRCS Record
-----------
kindlegen creates a record whose content is a zip archive of all source files
(i.e., .opf, .ncx, .htm, .jpg, ...) given to the command and puts it in the
generated MOBI file. The record begins with the "SRCS" signature and is
located just before the #End-of-file Record.
MOBI files created with Mobipocket creator, Amazon's Personal Document Service,
or Kindle Direct Publishing (former Amazon DTP) don't include SRCS record.
In a past, kindlegen had an undocumented option to suppress this record, but
the option was removed in 2010.
offset bytes content comments
0 4 identifier "SRCS" (0x53 0x52 0x43 0x53)
4 4 ? fixed value(?): 0x00000010
8 4 ? fixed value(?): 0x0000002f
12 4 ? fixed value(?): 0x00000001
16 zip The zip archive continues to the end of this record
MBP
---
This is the extension used on a side file (auxiliary) for MOBI formatted eBooks.
It is used to store metadata used by the library software and also to store
user entered data like bookmarks, annotations, last read position. This file is
created automatically by the reader program when the eBook is first opened and
has a .mbp extension. The Library management software in MobiPocket uses this
file to get information displayed in the library window such as title and author
so that it won't have to open the larger eBook file.