mirror of
https://github.com/kovidgoyal/calibre.git
synced 2025-07-09 03:04:10 -04:00
IGN:More conversion documentation
This commit is contained in:
parent
10c9f4032a
commit
53f7cdec0b
@ -301,11 +301,17 @@ Removing headers and footers
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
These options are useful primarily for conversion of PDF documents. Often, the conversion leaves
|
||||
behing page headers and footers in the text. These options use regular expressions to try and detect
|
||||
behind page headers and footers in the text. These options use regular expressions to try and detect
|
||||
the headers and footers and remove them. Remember that they operate on the intermediate XHTML produced
|
||||
by the conversion pipeline. There is also a wizard to help you customize the regular expressions for
|
||||
your document.
|
||||
|
||||
The header and footer regular expressions are used in conjunction with the remove header and footer options.
|
||||
If the remove option is not enabled the regular expression will not be applied to remove the matched text.
|
||||
The removal works by using a python regular expression. All matched text is simply removed from
|
||||
the document. You can learn more about regular expressions and their syntax at
|
||||
http://docs.python.org/library/re.html.
|
||||
|
||||
Miscellaneous
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
@ -403,7 +409,9 @@ This will result in an automatically generated two level Table of Contents that
|
||||
Format specific tips
|
||||
----------------------
|
||||
|
||||
Here you will find tips specific to the conversion of particular formats.
|
||||
Here you will find tips specific to the conversion of particular formats. Options specific to particular
|
||||
format, whether input or output are available in the conversion dialog under their own section, for example
|
||||
`TXT Input` or `EPUB Output`.
|
||||
|
||||
Convert Microsoft Word documents
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
@ -419,7 +427,57 @@ generating the Table of Contents much simpler. It is called BookCreator and is a
|
||||
Convert TXT documents
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
TXT documents have no well defined way to specify formatting like bold, italics, etc, or document structure like paragraphs, headings, sections and so on.
|
||||
Since TXT documents provide no way to explicitly mark parts of
|
||||
the text, by default |app| only groups lines in the input document into paragraphs. The default is to assume one or
|
||||
more blank lines are a paragraph boundary::
|
||||
|
||||
This is the first.
|
||||
|
||||
This is the
|
||||
second paragraph.
|
||||
|
||||
TXT input supports a number of options to differentiate how paragraphs are detected.
|
||||
|
||||
:guilabel:`Treat each line as a paragraph`
|
||||
Assumes that every line is a paragraph::
|
||||
|
||||
This is the first.
|
||||
This is the second.
|
||||
This is the third.
|
||||
|
||||
:guilabel:`Assume print formatting`
|
||||
Assumes that every paragraph starts with an indent (either a tab or 2+ spaces). Paragraphs end when
|
||||
the next line that starts with an indent is reached::
|
||||
|
||||
This is the
|
||||
first.
|
||||
This is the second.
|
||||
|
||||
This is the
|
||||
third.
|
||||
|
||||
:guilabel:`Process using markdown`
|
||||
|app| also supports running TXT input though a transformation preprocessor known as markdown. Markdown
|
||||
allows for basic formatting to be added to TXT documents, such as bold, italics, section headings, tables,
|
||||
loists, a Table of Contents, etc. Marking chapter headings with a leading # and setting the chapter XPath detection
|
||||
expression to "//h:h1" is the easiest way to have a proper table of contents generated from a TXT document.
|
||||
You can learn more about the markdown syntax at http://daringfireball.net/projects/markdown/syntax.
|
||||
|
||||
|
||||
Convert PDF documents
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
PDF documents are one of the worst formats to convert from. They are a fixed page size and text placement format.
|
||||
Meaning, it is very difficult to determine where one paragraph ends and another begins. |app| will try to unwrap
|
||||
paragraphs using a configurable, :guilabel:`Line Un-Wrapping Factor`. This is a scale used to determine the length at which a line should be unwrapped. Valid values are a decimal
|
||||
between 0 and 1. The default is 0.5, this is the median line length. Lower this value to include more
|
||||
text in the unwrapping. Increase to include less.
|
||||
|
||||
Also, they often have headers and footers as part of the document that will become included with the text.
|
||||
Use the options to remove headers and footers to mitigate this issue. If the headers and footers are not
|
||||
removed from the text it can throw off the paragraph unwrapping.
|
||||
|
||||
Some limitations of PDF input is complex, multi-column, and image based documents are not supported.
|
||||
Extraction of vector images and tables from within the document is also not supported.
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user