mirror of
https://github.com/kovidgoyal/calibre.git
synced 2025-06-23 15:30:45 -04:00
440 lines
21 KiB
Plaintext
440 lines
21 KiB
Plaintext
from (http://wiki.mobileread.com/wiki/MOBI)
|
|
|
|
About
|
|
-----
|
|
|
|
MOBI is the format used by the the MobiPocket Reader. It may have a .mobi
|
|
extension or it may have a .prc extension. The extension can be changed by the
|
|
user to either of the accepted forms. In either case it may be DRM protected or
|
|
non-DRM. The .prc extension is used because the PalmOS doesn't support any file
|
|
extensions except .prc or .pdb. Note that Mobipocket prohibits their DRM format
|
|
to be used on dedicated eBook readers that support other DRM formats.
|
|
|
|
|
|
Description
|
|
-----------
|
|
|
|
MOBI format was originally an extension of the PalmDOC format by adding
|
|
certain HTML like tags to the data. Many MOBI formatted documents still use
|
|
this form. However there is also a high compression version of this file format
|
|
that compresses data to a larger degree in a proprietary manner. There are some
|
|
third party programs that can read the eBooks in the original MOBI format but
|
|
there are only a few third party program that can read the eBooks in the new
|
|
compressed form. The higher compression mode is using a huffman coding scheme
|
|
that has been called the Huff/cdic algorithm.
|
|
|
|
From time to time features have been added to the format so new files may have
|
|
problems if you try and read them with a down level reader. Currently the
|
|
source files follow the guidelines in the Open eBook format.
|
|
|
|
Note that AZW for the Amazon Kindle is the same format as MOBI except that it
|
|
uses a slightly different DRM scheme.
|
|
|
|
|
|
Format
|
|
------
|
|
|
|
Like PalmDOC, the Mobipocket file format is that of a standard Palm Database
|
|
Format file. The header of that format includes the name of the database
|
|
(usually the book title and sometimes a portion of the authors name) which is
|
|
up to 31 bytes of data. The files are identified as Creator ID of MOBI and a
|
|
Type of BOOK.
|
|
|
|
|
|
PalmDOC Header
|
|
--------------
|
|
|
|
The first record in the Palm Database Format gives more information about the
|
|
Mobipocket file. The first 16 bytes are almost identical to the first sixteen
|
|
bytes of a PalmDOC format file.
|
|
|
|
bytes content comments
|
|
2 Compression 1 == no compression, 2 = PalmDOC compression,
|
|
17480 = HUFF/CDIC compression.
|
|
2 Unused Always zero
|
|
4 text length Uncompressed length of the entire text of the book
|
|
2 record count Number of PDB records used for the text of the book.
|
|
2 record size Maximum size of each record containing text, always
|
|
4096.
|
|
4 Current Position Current reading position, as an offset into the
|
|
uncompressed text
|
|
|
|
There are two differences from a Palm DOC file. There's an additional
|
|
compression type (17480), and the Current Position bytes are used for a
|
|
different purpose:
|
|
|
|
bytes content comments
|
|
2 Encryption Type 0 == no encryption, 1 = Old Mobipocket Encryption,
|
|
2 = Mobipocket Encryption.
|
|
2 Unknown Usually zero
|
|
|
|
The old Mobipocket Encryption scheme only allows the file to be registered
|
|
with one PID, unlike the current encryption scheme that allows multiple PIDs to
|
|
be used in a single file. Unless specifically mentioned, all the encryption
|
|
information on this page refers to the current scheme.
|
|
|
|
|
|
MOBI Header
|
|
-----------
|
|
|
|
Most Mobipocket file also have a MOBI header in record 0 that follows these
|
|
16 bytes, and newer formats also have an EXTH header following the MOBI header,
|
|
again all in record 0 of the PDB file format.
|
|
|
|
The MOBI header is of variable length and is not documented. Some fields have
|
|
been tentatively identified as follows:
|
|
|
|
offset bytes content comments
|
|
16 4 identifier The characters M O B I
|
|
20 4 header length The length of the MOBI header, including
|
|
the previous 4 bytes
|
|
24 4 Mobi type The kind of Mobipocket file this is
|
|
2 Mobipocket Book
|
|
3 PalmDoc Book
|
|
4 Audio
|
|
257 News
|
|
258 News_Feed
|
|
259 News_Magazine
|
|
513 PICS
|
|
514 WORD
|
|
515 XLS
|
|
516 PPT
|
|
517 TEXT
|
|
518 HTML
|
|
28 4 text Encoding 1252 = CP1252 (WinLatin1); 65001 = UTF-8
|
|
32 4 Unique-ID Some kind of unique ID number (random?)
|
|
36 4 Generator version Potentially the version of the
|
|
Mobipocket-generation tool. Always >=
|
|
the value of the "format version" field
|
|
and <= the version of mobigen used to
|
|
produce the file.
|
|
40 40 Reserved All 0xFF. In case of a dictionary, or
|
|
some newer file formats, a few bytes are
|
|
used from this range of 40 0xFFs
|
|
80 4 First Non-book index? First record number (starting with 0)
|
|
that's not the book's text
|
|
84 4 Full Name Offset Offset in record 0 (not from start of
|
|
file) of the full name of the book
|
|
88 4 Full Name Length Length in bytes of the full name of the
|
|
book
|
|
92 4 Language Book language code. Low byte is main
|
|
language 09= English, next byte is
|
|
dialect, 08 = British, 04 = US
|
|
96 4 Input Language Input language for a dictionary
|
|
100 4 Output Language Output language for a dictionary
|
|
104 4 Format version Potentially the version of the
|
|
Mobipocket format used in this file.
|
|
Always >= 1 and <= the value of the
|
|
"generator version" field.
|
|
108 4 First Image record First record number (starting with 0)
|
|
that contains an image. Image records
|
|
should be sequential. If there are
|
|
no images this will be 0xffffffff.
|
|
112 4 HUFF record Record containing Huff information
|
|
used in HUFF/CDIC decompression.
|
|
116 4 HUFF count Number of Huff records.
|
|
122 4 DATP record Unknown: Records starts with DATP.
|
|
124 4 DATP count Number of DATP records.
|
|
128 4 EXTH flags Bitfield. if bit 6, 0x40 is set, then
|
|
there's an EXTH record
|
|
The following records are only present if the mobi header is long enough.
|
|
132 36 ? 32 unknown bytes, if MOBI is long enough
|
|
168 4 DRM Offset Offset to DRM key info in DRMed files.
|
|
0xFFFFFFFF if no DRM
|
|
172 4 DRM Count Number of entries in DRM info.
|
|
174 4 DRM Size Number of bytes in DRM info.
|
|
176 4 DRM Flags Some flags concerning the DRM info.
|
|
180 6 ?
|
|
186 2 Last Image record Possible vaule with the last image
|
|
record. If there are no images in the
|
|
book this will be 0xffff.
|
|
188 4 ?
|
|
192 4 FCIS record Unknown. Record starts with FCIS.
|
|
196 4 ?
|
|
200 4 FLIS record Unknown. Records starts with FLIS.
|
|
204 ? ? Bytes to the end of the MOBI header,
|
|
including the following if the header
|
|
length >= 228. ( 244 from start of
|
|
record)
|
|
242 2 Extra Data Flags A set of binary flags, some of which
|
|
indicate extra data at the end of each
|
|
text block. This only seems to be valid
|
|
for Mobipocket format version 5 and 6
|
|
(and higher?), when the header length
|
|
is 228 (0xE4) or 232 (0xE8).
|
|
|
|
|
|
EXTH Header
|
|
-----------
|
|
|
|
If the MOBI header indicates that there's an EXTH header, it follows immediately
|
|
after the MOBI header. since the MOBI header is of variable length, this isn't
|
|
at any fixed offset in record 0. Note that some readers will ignore any EXTH
|
|
header info if the mobipocket version number specified in the MOBI header is 2
|
|
or less (perhaps 3 or less).
|
|
|
|
The EXTH header is also undocumented, so some of this is guesswork.
|
|
|
|
bytes content comments
|
|
4 identifier the characters E X T H
|
|
4 header length the length of the EXTH header, including the previous 4 bytes
|
|
4 record Count The number of records in the EXTH header. the rest of the EXTH header consists of repeated EXTH records to the end of the EXTH length.
|
|
EXTH record start Repeat until done.
|
|
4 record type Exth Record type. Just a number identifying what's stored in the record
|
|
4 record length length of EXTH record = L , including the 8 bytes in the type and length fields
|
|
L-8 record data Data.
|
|
EXTH record end Repeat until done.
|
|
|
|
There are lots of different EXTH Records types. Ones found so far in Mobipocket
|
|
files are listed here, with possible meanings. Hopefully the table will be
|
|
filled in as more information comes to light.
|
|
|
|
record type usual length name comments
|
|
1 drm_server_id
|
|
2 drm_commerce_id
|
|
3 drm_ebookbase_book_id
|
|
100 author
|
|
101 publisher
|
|
102 imprint
|
|
103 description
|
|
104 isbn
|
|
105 subject
|
|
106 publishingdate
|
|
107 review
|
|
108 contributor
|
|
109 rights
|
|
110 subjectcode
|
|
111 type
|
|
112 source
|
|
113 asin
|
|
114 versionnumber
|
|
115 sample
|
|
116 startreading
|
|
117 3 adult Mobipocket Creator adds this if Adult only is checked; contents: "yes"
|
|
118 retail price As text, e.g. "4.99"
|
|
119 retail price currency As text, e.g. "USD"
|
|
201 4 coveroffset Add to first image field in Mobi Header to find PDB record containing the cover image
|
|
202 4 thumboffset Add to first image field in Mobi Header to find PDB record containing the thumbnail cover image
|
|
203 hasfakecover
|
|
204 4 Creator Software Records 204-207 are usually the same for all books from a certain source, e.g. 1-6-2-41 for Baen and 201-1-0-85 for project gutenberg, 200-1-0-85 for amazon when converted to a 32 bit integer.
|
|
205 4 Creator Major Version
|
|
206 4 Creator Minor Version
|
|
207 4 Creator Build Number
|
|
208 watermark
|
|
209 tamper proof keys Used by the Kindle (and Android app) for generating book-specific PIDs.
|
|
300 fontsignature
|
|
401 1 clippinglimit
|
|
402 publisherlimit
|
|
403 403 Unknown 1 - Text to Speech disabled; 0 - Text to Speech enabled
|
|
404 1 404 ttsflag
|
|
501 4 cdetype PDOC - Personal Doc;
|
|
EBOK - ebook;
|
|
502 lastupdatetime
|
|
503 updatedtitle
|
|
|
|
And now, at the end of Record 0 of the PDB file format, we usually get the full
|
|
file name, the offset of which is given in the MOBI header.
|
|
|
|
|
|
Variable-width integers
|
|
-----------------------
|
|
|
|
Some parts of the Mobipocket format encode data as variable-width integers.
|
|
These integers are represented big-endian with 7 bits per byte in bits 1-7. They
|
|
may be either forward-encoded, in which case only the LSB has bit 8 set, or
|
|
backward-encoded, in which case only the MSB has bit 8 set. For example, the
|
|
number 0x11111 would be represented forward-encoded as:
|
|
|
|
0x04 0x22 0x91
|
|
|
|
And backward-encoded as:
|
|
|
|
0x84 0x22 0x11
|
|
|
|
|
|
Trailing entries
|
|
----------------
|
|
|
|
The Extra Data Flags field of the MOBI header indicates which, if any, trailing
|
|
entries are appended to the end of each text record. Each set bit in the field
|
|
indicates a trailing entry. The entries appear to occur in bit-order; e.g.,
|
|
trailing entry 1 immediately follows the text content and entry 16 occurs at
|
|
the very end of the record. The effect and exact details of most of these
|
|
entries is unknown. The trailing entries indicated by bits 2-16 appear to
|
|
follow a common format. That format is:
|
|
|
|
<data><size>
|
|
|
|
Where <size> is the size of the entire trailing entry (including the size of
|
|
<size>) as a backward-encoded Mobipocket variable-width integer.
|
|
|
|
Only a few bits have been identified
|
|
|
|
bit Data at end of records
|
|
0x0001 Multi-byte character overlaps
|
|
0x0002 Some data to help with indexing
|
|
0x0004 Some data about uncrossable breaks
|
|
|
|
|
|
Multibyte character overlap
|
|
---------------------------
|
|
|
|
When bit 1 of the Extra Data Flags field is set, each record is followed by a
|
|
trailing entry containing any extra bytes necessary to complete a multibyte
|
|
character which crosses the record boundary. The bytes do not participate in
|
|
compression regardless which compression scheme is used for the file. However,
|
|
unlike the trailing data bytes, the multibytes (including the count byte) do
|
|
get included in any encryption. The overlapping bytes then re-appear as normal
|
|
content at the beginning of the following record. The trailing entry ends with
|
|
a byte containing a count of the overlapping bytes plus additional flags.
|
|
|
|
offset bytes content comments
|
|
0 0-3 N terminal bytes
|
|
of a multibyte
|
|
character
|
|
N 1 Size & flags bits 1-2 encode N, use of bits 3-8 is unknown
|
|
|
|
|
|
PalmDOC Compression
|
|
-------------------
|
|
|
|
PalmDOC uses LZ77 compression techniques. DOC files can contain only compressed
|
|
text. The format does not allow for any text formatting. This keeps files small,
|
|
in keeping with the Palm philosophy. However, extensions to the format can use
|
|
tags, such as HTML or PML, to include formatting within text. These extensions
|
|
to PalmDoc are not interchangeable and are the basis for most eBook Reader
|
|
formats on Palm devices.
|
|
|
|
LZ77 algorithms achieve compression by replacing portions of the data with
|
|
references to matching data that has already passed through both encoder and
|
|
decoder. A match is encoded by a pair of numbers called a length-distance pair,
|
|
which is equivalent to the statement "each of the next length characters is
|
|
equal to the character exactly distance characters behind it in the uncompressed
|
|
stream." (The "distance" is sometimes called the "offset" instead.)
|
|
|
|
In the PalmDoc format, a length-distance pair is always encoded by a two-byte
|
|
sequence. Of the 16 bits that make up these two bytes, 11 bits go to encoding
|
|
the distance, 3 go to encoding the length, and the remaining two are used to
|
|
make sure the decoder can identify the first byte as the beginning of such a
|
|
two-byte sequence. The exact alforithm needed to decode the compressed text can
|
|
be found on the PalmDOC page.
|
|
|
|
PalmDOC data is always divided into 4096 byte blocks and the blocks are acted
|
|
upon independently.
|
|
|
|
PalmDOC does have support for bookmarks. These pointers are named and refer to
|
|
an offset location in a file. If the file is edited these locations may no
|
|
longer refer to the correct locations. Some reading programs allow the user to
|
|
enter or edit these bookmarks while others treat them as a TOC. Some reading
|
|
programs may ignore them entirely. They are stored at the end of the file itself
|
|
so the full file needs to be scanned when loaded to find them.
|
|
|
|
|
|
Image Records
|
|
-------------
|
|
|
|
If the file contains images, they follow the text blocks, with each image using a
|
|
single block. The 4096-byte record size in the PalmDoc header applies only to
|
|
text records; image records may be larger.
|
|
|
|
|
|
Magic Records
|
|
-------------
|
|
|
|
In some cases, MobiPocket Creator adds a 2-zero-byte record after the text
|
|
records in a file. This record is not included in the "record count" of text
|
|
records in the PalmDoc header, and is also not used as the "first non-book
|
|
index" in the MOBI header. (If the 2-zero-byte record is present, the index of
|
|
the following block is used as the "first non-book index".)
|
|
|
|
MobiPocket Creator also ends files with three records: 'FLIS', 'FCIS', and
|
|
'end-of-file', in that order. The 'FLIS' and 'FCIS' records do not seem to be
|
|
necessary for MobiPocket Reader or the Amazon Kindle 2 to read the file. The
|
|
'end-of-file' record might be necessary.
|
|
|
|
|
|
FLIS Record
|
|
-----------
|
|
|
|
The FLIS record appears to have a fixed value. The meaning of the values is not known.
|
|
|
|
offset bytes content comments
|
|
0 4 identifier the characters F L I S (0x46 0x4c 0x49 0x53)
|
|
4 4 ? fixed value: 8
|
|
8 2 ? fixed value: 65
|
|
10 2 ? fixed value: 0
|
|
12 4 ? fixed value: 0
|
|
16 4 ? fixed value: -1
|
|
20 2 ? fixed value: 1
|
|
22 2 ? fixed value: 3
|
|
24 4 ? fixed value: 3
|
|
28 4 ? fixed value: 1
|
|
32 4 ? fixed value: -1
|
|
|
|
|
|
FCIS Record
|
|
-----------
|
|
|
|
The FCIS record appears to have mostly fixed values.
|
|
|
|
offset bytes content comments
|
|
0 4 identifier the characters F C I S (0x46 0x43 0x49 0x53)
|
|
4 4 ? fixed value: 20
|
|
8 4 ? fixed value: 16
|
|
12 4 ? fixed value: 1
|
|
16 4 ? fixed value: 0
|
|
20 4 ? text length (the same value as "text length" in the PalmDoc header)
|
|
24 4 ? fixed value: 0
|
|
28 4 ? fixed value: 32
|
|
32 4 ? fixed value: 8
|
|
36 2 ? fixed value: 1
|
|
38 2 ? fixed value: 1
|
|
40 4 ? fixed value: 0
|
|
|
|
|
|
End-of-file Record
|
|
------------------
|
|
|
|
The end-of-file record is a fixed 4-byte record. While the last two bytes
|
|
appear to be a CRLF marker, the meaning of the first two bytes is unknown.
|
|
|
|
offset bytes content comments
|
|
0 1 ? fixed value: 233 (0xe9)
|
|
1 1 ? fixed value: 142 (0x8e)
|
|
2 1 ? fixed value: 13 (0x0d)
|
|
3 1 ? fixed value: 10 (0x0a)
|
|
|
|
|
|
SRCS Record
|
|
-----------
|
|
|
|
kindlegen creates a record whose content is a zip archive of all source files
|
|
(i.e., .opf, .ncx, .htm, .jpg, ...) given to the command and puts it in the
|
|
generated MOBI file. The record begins with the "SRCS" signature and is
|
|
located just before the #End-of-file Record.
|
|
|
|
MOBI files created with Mobipocket creator, Amazon's Personal Document Service,
|
|
or Kindle Direct Publishing (former Amazon DTP) don't include SRCS record.
|
|
In a past, kindlegen had an undocumented option to suppress this record, but
|
|
the option was removed in 2010.
|
|
|
|
offset bytes content comments
|
|
0 4 identifier "SRCS" (0x53 0x52 0x43 0x53)
|
|
4 4 ? fixed value(?): 0x00000010
|
|
8 4 ? fixed value(?): 0x0000002f
|
|
12 4 ? fixed value(?): 0x00000001
|
|
16 zip The zip archive continues to the end of this record
|
|
|
|
|
|
MBP
|
|
---
|
|
|
|
This is the extension used on a side file (auxiliary) for MOBI formatted eBooks.
|
|
It is used to store metadata used by the library software and also to store
|
|
user entered data like bookmarks, annotations, last read position. This file is
|
|
created automatically by the reader program when the eBook is first opened and
|
|
has a .mbp extension. The Library management software in MobiPocket uses this
|
|
file to get information displayed in the library window such as title and author
|
|
so that it won't have to open the larger eBook file.
|
|
|