mirror of
				https://github.com/kovidgoyal/calibre.git
				synced 2025-10-31 10:37:00 -04:00 
			
		
		
		
	
		
			
				
	
	
		
			304 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			304 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| Rocket eBook File Format
 | |
| ------------------------
 | |
| 
 | |
| from http://rbmake.sourceforge.net/rb_format.html
 | |
| 
 | |
| 
 | |
| Overview
 | |
| --------
 | |
| 
 | |
| This document attempts to describe the format of a .rb file -- the book
 | |
| format that is downloaded into NuvoMedia's <http://www.nuvomedia.com>
 | |
| hand-held wonder, the Rocket eBook
 | |
| <http://www.rocket-ebook.com/enter.html>.
 | |
| 
 | |
| *Note:* All multi-byte integers are stored in Vax/Intel order (the
 | |
| opposite of network byte order). Most integers are 4 bytes (an int32),
 | |
| but there are some minor exceptions (as detailed below).
 | |
| 
 | |
| Also, the following document refers to the .rb file sections as "pages".
 | |
| 
 | |
| 
 | |
| Details
 | |
| -------
 | |
| 
 | |
| The first 4 bytes of the file seem to be a magic number (in hex): B0 0C
 | |
| B0 0C. I like to think of this as a hexidecimal pun on the word "book"
 | |
| (repeated). [Matt Greenwood has reported seeing a magic number of "B0 0C
 | |
| F0 0D" in another type of ReB-related file -- i.e. "book food".]
 | |
| 
 | |
| The next two bytes appear to be a version number, currently "02 00". I
 | |
| assume this means major version 2, minor version 0.
 | |
| 
 | |
| The next 4 bytes are the string "NUVO", followed by 4 bytes of 00h. (I
 | |
| have also seen an old title that had 0s in place of the "NUVO".)
 | |
| 
 | |
| This brings us up to offset 0Eh, at which point we have a 4-byte
 | |
| representation of the date the book was created (Matt Greenwood pointed
 | |
| this out to me -- thanks!). The year is encoded as an int16. On older
 | |
| version of the RocketLibrary was encoding the year's full value (e.g.
 | |
| 1999 was "CF 07" and 2000 was "D0 07"), but a more recent version is now
 | |
| using the tm_year value verbatim -- i.e. it's storing 100 for the year
 | |
| 2000 ("64 00"). The year is followed by an int8 for the 1-relative month
 | |
| number, and an int8 for the day of the month.
 | |
| 
 | |
| After that is 6 bytes of 00h. These may be reserved for setting the time
 | |
| of creation (at a guess).
 | |
| 
 | |
| Then, at offset 18h, we have an int32 that contains the absolute offset
 | |
| of the "Table of Contents" (the directory of the pages contained within
 | |
| this .rb file). In all of the .rb file's I've seen, this remains
 | |
| constant with a value of 128h. However, I have tested an atypical .rb
 | |
| file where I placed the ToC at the end of the file (after all the file
 | |
| contents), and it worked fine. (I've chosen not to build any books in
 | |
| such a non-standard format, however.)
 | |
| 
 | |
| Immediately following this is an int32 with the length of the .rb file
 | |
| (so we can check if the file is complete or not).
 | |
| 
 | |
| All the bytes from here (offset 20h) up to offset 128h appear to only be
 | |
| used by an encrypted title. In a non-encrypted title, they are always 0.
 | |
| 
 | |
| The table of contents typically comes next (at offset 128h). It starts
 | |
| with an int32 count of the number of "page" entries (.rb-file sections)
 | |
| in the ToC. Each entry consists of a name (zero-padded to 32 bytes),
 | |
| followed by 3 int32s: the length of this entry's data segment, the
 | |
| absolute offset of the data in the .rb file, and a flag. The known flag
 | |
| values are: 1 (encrypted), 2 (info page), and 8 (deflated). The names
 | |
| are tweaked as needed to ensure that they are all unique. The current
 | |
| RocketWriter software uses a unique 6-digit number, a dash, up to 8
 | |
| characters from the filename, and then the re-mapped suffix for the data
 | |
| (.html, .hidx, .png, .info, etc.). My rbmake library simply ensures that
 | |
| the names are no longer than 15 characters (not counting the suffix) and
 | |
| are all unique.
 | |
| 
 | |
| Often the first item in the ToC is the info page, but it doesn't have to
 | |
| be. This page of information contains NAME=VALUE pairs that note the
 | |
| author, title, what the root-page's name is, etc. (See appendix A). This
 | |
| data is never encrypted nor compressed, so this entry's flag value is
 | |
| always "2".
 | |
| 
 | |
| An image page is always stored as a B&W image in PNG format. Since it
 | |
| has its own compression, it is stored without any additional attempt at
 | |
| deflation. I have also never seen an encrypted image, so its flag value
 | |
| is always 0.
 | |
| 
 | |
| An HTML page contains the tags and text that were re-written into a
 | |
| consistent syntax (this presumably makes the HTML renderer in the ReB
 | |
| itself simpler). HTML pages are typically compressed (See appendix B).
 | |
| Every HTML page appears to use the suffix .html no matter what the file
 | |
| name was on import (but I have seen older files with .htm used as the
 | |
| suffix, so the rocket appears to support both).
 | |
| 
 | |
| For every HTML page there is a corresponding .hidx page that contains a
 | |
| summary of the paragraph formatting and the position of the anchor names
 | |
| in the associated .html page (See appendix C). This page is sometimes
 | |
| compressed, depending on length (See appendix B).
 | |
| 
 | |
| There are also reference titles that have a .hkey page that contains a
 | |
| list of words that can be looked up in the associated .html page (See
 | |
| appendix D).
 | |
| 
 | |
| Immediately following the ToC is the data for each piece mentioned in
 | |
| the ToC, in the same order as it appeared in the ToC.
 | |
| 
 | |
| Finally, the end of the file appears to be padded with 20 bytes of 01h.
 | |
| 
 | |
| 
 | |
| Appendix A: Info Page Format
 | |
| ----------------------------
 | |
| 
 | |
| The info page consists of a series of lines that contain "NAME=VALUE"
 | |
| strings. Each line is terminated by a single newline. Here are the
 | |
| values that the RocketWriter generates:
 | |
| 
 | |
|     COMMENT=Info file for <title>
 | |
|     TYPE=2
 | |
|     TITLE=<title>
 | |
|     AUTHOR=<author>
 | |
|     URL=ebook:<long, unique string used for the file's name by the librarian>
 | |
|     GENERATOR=<e.g. RocketLibrarian 1.3.216>
 | |
|     PARSE=1
 | |
|     OUTPUT=1
 | |
|     BODY=<name of root HTML page (as it appears in the ToC)>
 | |
|     MENUMARK=menumark.html
 | |
|     SuggestedRetailPrice=<usually empty>
 | |
| 
 | |
| Encrypted titles have a few more entries (including those listed above):
 | |
| 
 | |
|     ISBN=<ISBN number, including dashes>
 | |
|     REVISION=<digits>
 | |
|     TITLE_LANGUAGE=<en-us>
 | |
|     PUB_NAME=<Publisher's name>
 | |
|     PUBSERVER_ID=<digits>
 | |
|     GENERATOR=<e.g. RocketPress 1.3.121>
 | |
|     VERSION=<digits>
 | |
|     USERNAME=<rocket-ID>
 | |
|     COPY_ID=<digits>
 | |
|     COPYRIGHT=<copyright>
 | |
|     COPYTITLE=<another copyright?>
 | |
| 
 | |
| A reference title also has an indication that there is a .hkey page
 | |
| present, and may also have a GENRE of "Reference":
 | |
| 
 | |
|     HKEY=1
 | |
|     GENRE=Reference
 | |
| 
 | |
| 
 | |
| Appendix B: The format of compressed data
 | |
| -----------------------------------------
 | |
| 
 | |
| Compressed pages have a data section in the .rb file with the following
 | |
| format:
 | |
| 
 | |
| The first int32 is a count of the number of 4096-byte chunks of data we
 | |
| broke the uncompressed page into (the last chunk can be shorter than
 | |
| 4096 bytes, of course).
 | |
| 
 | |
| This is immediately followed by an int32 with the length of the entire
 | |
| uncompressed data.
 | |
| 
 | |
| After this there are <count> int32s that indicate the size of each
 | |
| chunk's compressed data.
 | |
| 
 | |
| Following these length int32s is the output from a deflation (the
 | |
| algorithm used in gzip) for each 4096-byte chunk of the original data.
 | |
| It appears that you must use a window-bit size of 13 and a compression
 | |
| level of "best" to be compatible with the Rocket eBook's system software.
 | |
| 
 | |
| 
 | |
| Appendix C: HTML-index Page Format
 | |
| ----------------------------------
 | |
| 
 | |
| The .hidx page's purpose is to allow the renderer to quickly look up the
 | |
| format of each paragraph (useful for random access to the data), and the
 | |
| position of the anchor names.
 | |
| 
 | |
| The first section lists the various paragraph-producing tags. It is
 | |
| headed by a line of "[tags <count>]", where <count> is the number of
 | |
| tags that follow this header. The tags are listed one per line, and have
 | |
| an implied enumeration from 0 to N-1 (which the other tags and the
 | |
| upcoming paragraph sections reference).
 | |
| 
 | |
| The first tag is typically (always?) "<HTML> -1". The number trailing
 | |
| the tag indicates what other tag (or sequence of tags, one per line) in
 | |
| which we are nested. So, if we have a <BR> nested inside a <P
 | |
| ALIGN="center">, it would be listed separately from a <BR> that was
 | |
| nested inside a normal paragraph, and each one would have a different
 | |
| trailing index number.
 | |
| 
 | |
| Following the tag section is the paragraph section. The heading is
 | |
| "[paragraphs <count>]", and is followed by a line for each paragraph.
 | |
| These lines consist of a character offset into the .html page for the
 | |
| start of the paragraph followed by a 0-relative offset into the tag
 | |
| section (indicating what kind of formatting to use for the indicated
 | |
| paragraph).
 | |
| 
 | |
| The paragraph-section character offsets point to the first bit of text
 | |
| after the associated tag.
 | |
| 
 | |
| The last section details the anchor names. The heading is
 | |
| "[names <count>]", and each item that follows is a quoted string of the
 | |
| anchor name, followed by a character offset into the .html page where
 | |
| we'll find that name. If there are no names in the associated HTML
 | |
| section, the heading is included with a 0 count (i.e. "[names 0]").
 | |
| 
 | |
| The name-section character offsets point to the start of the anchor tag
 | |
| (not after the tag, like the offsets in the "paragraphs" section).
 | |
| 
 | |
| The lines are terminated by newlines (in standard unix fashion).
 | |
| 
 | |
| For example:
 | |
| 
 | |
|     [tags 10]
 | |
|     <HTML> -1
 | |
|     <BODY> 0
 | |
|     <P ALIGN="right"> 1
 | |
|     <P ALIGN="left"> 1
 | |
|     <P> 1
 | |
|     <H3 ALIGN="center"> 1
 | |
|     <P ALIGN="center"> 1
 | |
|     <BR> 6
 | |
|     <H2 ALIGN="center"> 1
 | |
|     <BR> 1
 | |
| 
 | |
|     [paragraphs 42]
 | |
|     160 9
 | |
|     164 9
 | |
|     184 8
 | |
|     220 8
 | |
|     261 6
 | |
|     316 5
 | |
|     359 1
 | |
|     379 6
 | |
|     410 6
 | |
|     460 7
 | |
|     511 7
 | |
|     564 7
 | |
|     616 7
 | |
|     668 7
 | |
|     720 7
 | |
|     773 7
 | |
|     827 7
 | |
|     880 7
 | |
|     933 7
 | |
|     988 7
 | |
|     1043 7
 | |
|     1100 7
 | |
|     1157 7
 | |
|     1214 7
 | |
|     1270 7
 | |
|     1328 7
 | |
|     1385 7
 | |
|     1442 7
 | |
|     1497 7
 | |
|     1556 7
 | |
|     1561 7
 | |
|     1635 1
 | |
|     1656 5
 | |
|     1690 6
 | |
|     1737 7
 | |
|     1773 5
 | |
|     1798 4
 | |
|     1826 3
 | |
|     2663 1
 | |
|     2668 4
 | |
|     2689 2
 | |
|     2730 8
 | |
| 
 | |
|     [names 1]
 | |
|     "ch1" 2689
 | |
| 
 | |
| 
 | |
| Appendix D: HTML-key Page Format
 | |
| --------------------------------
 | |
| 
 | |
| The .hkey page contains a list of words, one per line, sorted in a
 | |
| strict ASCII sequence, each one followed by a tab and the offset in the
 | |
| .html page of the word's data. I presume that the .hkey page must share
 | |
| the same name prefix as its related .html page.
 | |
| 
 | |
| If the names contain high-bit characters, they are translated into
 | |
| regular ASCII in the .hkey file, since this allows the user to search
 | |
| for the words using unaccented characters.
 | |
| 
 | |
| The lines are terminated with a newline (in standard unix fashion).
 | |
| 
 | |
| An example:
 | |
| 
 | |
|     a	5
 | |
|     apple	38
 | |
|     b	84
 | |
|     book	104
 | |
| 
 | |
| Each of these offsets points to a paragraph tag in the associated .html
 | |
| page. I have only seen this sequence of tags used so far:
 | |
| 
 | |
|     <P><BIG><B>word</B></BIG> other stuff</P>
 | |
| 
 | |
| I have seen multiple <B>...</B> tags in the middle of the single set of
 | |
| <BIG>...</BIG> tags, but this is the basic tag format.
 | |
| 
 | |
| The offset in the .hkey page points to the start of the <P> tag.
 | |
| 
 |