mirror of
https://github.com/kovidgoyal/calibre.git
synced 2025-06-21 14:30:57 -04:00
55 lines
2.4 KiB
Plaintext
55 lines
2.4 KiB
Plaintext
About
|
|
-----
|
|
|
|
PalmDOC uses LZ77 compression techniques. DOC files can contain only compressed
|
|
text. The format does not allow for any text formatting. This keeps files
|
|
small, in keeping with the Palm philosophy. However, extensions to the format
|
|
can use tags, such as HTML or PML, to include formatting within text. These
|
|
extensions to PalmDoc are not interchangeable and are the basis for most eBook
|
|
Reader formats on Palm devices.
|
|
|
|
LZ77 algorithms achieve compression by replacing portions of the data with
|
|
references to matching data that has already passed through both encoder and
|
|
decoder. A match is encoded by a pair of numbers called a length-distance pair,
|
|
which is equivalent to the statement "each of the next length characters is
|
|
equal to the character exactly distance characters behind it in the
|
|
uncompressed stream." (The "distance" is sometimes called the "offset" instead.)
|
|
|
|
In the PalmDoc format, a length-distance pair is always encoded by a two-byte
|
|
sequence. Of the 16 bits that make up these two bytes, 11 bits go to encoding
|
|
the distance, 3 go to encoding the length, and the remaining two are used to
|
|
make sure the decoder can identify the first byte as the beginning of such a
|
|
two-byte sequence.
|
|
|
|
PalmDoc combines LZ77 with a simple kind of byte pair compression.
|
|
|
|
|
|
PalmDoc files are decoded as follows:
|
|
-------------------------------------
|
|
|
|
Read a byte from the compressed stream. If the byte is
|
|
|
|
0x00: "1 literal" copy that byte unmodified to the decompressed stream.
|
|
|
|
0x09 to 0x7f: "1 literal" copy that byte unmodified to the decompressed stream.
|
|
|
|
0x01 to 0x08: "literals": the byte is interpreted as a count from 1 to 8, and
|
|
that many literals are copied unmodified from the compressed stream to the
|
|
decompressed stream.
|
|
|
|
0x80 to 0xbf: "length, distance" pair: the 2 leftmost bits of this byte ('10')
|
|
are discarded, and the following 6 bits are combined with the 8 bits of the
|
|
next byte to make a 14 bit "distance, length" item. Those 14 bits are broken
|
|
into 11 bits of distance backwards from the current location in the
|
|
uncompressed text, and 3 bits of length to copy from that point
|
|
(copying n+3 bytes, 3 to 10 bytes).
|
|
|
|
0xc0 to 0xff: "byte pair": this byte is decoded into 2 characters: a space
|
|
character, and a letter formed from this byte XORed with 0x80.
|
|
|
|
Repeat from the beginning until there is no more bytes in the compressed file.
|
|
|
|
PalmDOC data is always divided into 4096 byte blocks and the blocks are acted
|
|
upon independently.
|
|
|