mirror of
https://github.com/kovidgoyal/calibre.git
synced 2025-07-09 03:04:10 -04:00
IGN:More conversion documentation
This commit is contained in:
parent
10c9f4032a
commit
53f7cdec0b
@ -301,11 +301,17 @@ Removing headers and footers
|
|||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
These options are useful primarily for conversion of PDF documents. Often, the conversion leaves
|
These options are useful primarily for conversion of PDF documents. Often, the conversion leaves
|
||||||
behing page headers and footers in the text. These options use regular expressions to try and detect
|
behind page headers and footers in the text. These options use regular expressions to try and detect
|
||||||
the headers and footers and remove them. Remember that they operate on the intermediate XHTML produced
|
the headers and footers and remove them. Remember that they operate on the intermediate XHTML produced
|
||||||
by the conversion pipeline. There is also a wizard to help you customize the regular expressions for
|
by the conversion pipeline. There is also a wizard to help you customize the regular expressions for
|
||||||
your document.
|
your document.
|
||||||
|
|
||||||
|
The header and footer regular expressions are used in conjunction with the remove header and footer options.
|
||||||
|
If the remove option is not enabled the regular expression will not be applied to remove the matched text.
|
||||||
|
The removal works by using a python regular expression. All matched text is simply removed from
|
||||||
|
the document. You can learn more about regular expressions and their syntax at
|
||||||
|
http://docs.python.org/library/re.html.
|
||||||
|
|
||||||
Miscellaneous
|
Miscellaneous
|
||||||
~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~
|
||||||
|
|
||||||
@ -403,7 +409,9 @@ This will result in an automatically generated two level Table of Contents that
|
|||||||
Format specific tips
|
Format specific tips
|
||||||
----------------------
|
----------------------
|
||||||
|
|
||||||
Here you will find tips specific to the conversion of particular formats.
|
Here you will find tips specific to the conversion of particular formats. Options specific to particular
|
||||||
|
format, whether input or output are available in the conversion dialog under their own section, for example
|
||||||
|
`TXT Input` or `EPUB Output`.
|
||||||
|
|
||||||
Convert Microsoft Word documents
|
Convert Microsoft Word documents
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
@ -419,7 +427,57 @@ generating the Table of Contents much simpler. It is called BookCreator and is a
|
|||||||
Convert TXT documents
|
Convert TXT documents
|
||||||
~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
TXT documents have no well defined way to specify formatting like bold, italics, etc, or document structure like paragraphs, headings, sections and so on.
|
||||||
|
Since TXT documents provide no way to explicitly mark parts of
|
||||||
|
the text, by default |app| only groups lines in the input document into paragraphs. The default is to assume one or
|
||||||
|
more blank lines are a paragraph boundary::
|
||||||
|
|
||||||
|
This is the first.
|
||||||
|
|
||||||
|
This is the
|
||||||
|
second paragraph.
|
||||||
|
|
||||||
|
TXT input supports a number of options to differentiate how paragraphs are detected.
|
||||||
|
|
||||||
|
:guilabel:`Treat each line as a paragraph`
|
||||||
|
Assumes that every line is a paragraph::
|
||||||
|
|
||||||
|
This is the first.
|
||||||
|
This is the second.
|
||||||
|
This is the third.
|
||||||
|
|
||||||
|
:guilabel:`Assume print formatting`
|
||||||
|
Assumes that every paragraph starts with an indent (either a tab or 2+ spaces). Paragraphs end when
|
||||||
|
the next line that starts with an indent is reached::
|
||||||
|
|
||||||
|
This is the
|
||||||
|
first.
|
||||||
|
This is the second.
|
||||||
|
|
||||||
|
This is the
|
||||||
|
third.
|
||||||
|
|
||||||
|
:guilabel:`Process using markdown`
|
||||||
|
|app| also supports running TXT input though a transformation preprocessor known as markdown. Markdown
|
||||||
|
allows for basic formatting to be added to TXT documents, such as bold, italics, section headings, tables,
|
||||||
|
loists, a Table of Contents, etc. Marking chapter headings with a leading # and setting the chapter XPath detection
|
||||||
|
expression to "//h:h1" is the easiest way to have a proper table of contents generated from a TXT document.
|
||||||
|
You can learn more about the markdown syntax at http://daringfireball.net/projects/markdown/syntax.
|
||||||
|
|
||||||
|
|
||||||
Convert PDF documents
|
Convert PDF documents
|
||||||
~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
PDF documents are one of the worst formats to convert from. They are a fixed page size and text placement format.
|
||||||
|
Meaning, it is very difficult to determine where one paragraph ends and another begins. |app| will try to unwrap
|
||||||
|
paragraphs using a configurable, :guilabel:`Line Un-Wrapping Factor`. This is a scale used to determine the length at which a line should be unwrapped. Valid values are a decimal
|
||||||
|
between 0 and 1. The default is 0.5, this is the median line length. Lower this value to include more
|
||||||
|
text in the unwrapping. Increase to include less.
|
||||||
|
|
||||||
|
Also, they often have headers and footers as part of the document that will become included with the text.
|
||||||
|
Use the options to remove headers and footers to mitigate this issue. If the headers and footers are not
|
||||||
|
removed from the text it can throw off the paragraph unwrapping.
|
||||||
|
|
||||||
|
Some limitations of PDF input is complex, multi-column, and image based documents are not supported.
|
||||||
|
Extraction of vector images and tables from within the document is also not supported.
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user