IGN:More conversion documentation

This commit is contained in:
Kovid Goyal 2009-10-27 13:06:24 -06:00
parent 10c9f4032a
commit 53f7cdec0b

View File

@ -301,11 +301,17 @@ Removing headers and footers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
These options are useful primarily for conversion of PDF documents. Often, the conversion leaves
behing page headers and footers in the text. These options use regular expressions to try and detect
behind page headers and footers in the text. These options use regular expressions to try and detect
the headers and footers and remove them. Remember that they operate on the intermediate XHTML produced
by the conversion pipeline. There is also a wizard to help you customize the regular expressions for
your document.
The header and footer regular expressions are used in conjunction with the remove header and footer options.
If the remove option is not enabled the regular expression will not be applied to remove the matched text.
The removal works by using a python regular expression. All matched text is simply removed from
the document. You can learn more about regular expressions and their syntax at
http://docs.python.org/library/re.html.
Miscellaneous
~~~~~~~~~~~~~~
@ -403,7 +409,9 @@ This will result in an automatically generated two level Table of Contents that
Format specific tips
----------------------
Here you will find tips specific to the conversion of particular formats.
Here you will find tips specific to the conversion of particular formats. Options specific to particular
format, whether input or output are available in the conversion dialog under their own section, for example
`TXT Input` or `EPUB Output`.
Convert Microsoft Word documents
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -419,7 +427,57 @@ generating the Table of Contents much simpler. It is called BookCreator and is a
Convert TXT documents
~~~~~~~~~~~~~~~~~~~~~~
TXT documents have no well defined way to specify formatting like bold, italics, etc, or document structure like paragraphs, headings, sections and so on.
Since TXT documents provide no way to explicitly mark parts of
the text, by default |app| only groups lines in the input document into paragraphs. The default is to assume one or
more blank lines are a paragraph boundary::
This is the first.
This is the
second paragraph.
TXT input supports a number of options to differentiate how paragraphs are detected.
:guilabel:`Treat each line as a paragraph`
Assumes that every line is a paragraph::
This is the first.
This is the second.
This is the third.
:guilabel:`Assume print formatting`
Assumes that every paragraph starts with an indent (either a tab or 2+ spaces). Paragraphs end when
the next line that starts with an indent is reached::
This is the
first.
This is the second.
This is the
third.
:guilabel:`Process using markdown`
|app| also supports running TXT input though a transformation preprocessor known as markdown. Markdown
allows for basic formatting to be added to TXT documents, such as bold, italics, section headings, tables,
loists, a Table of Contents, etc. Marking chapter headings with a leading # and setting the chapter XPath detection
expression to "//h:h1" is the easiest way to have a proper table of contents generated from a TXT document.
You can learn more about the markdown syntax at http://daringfireball.net/projects/markdown/syntax.
Convert PDF documents
~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~
PDF documents are one of the worst formats to convert from. They are a fixed page size and text placement format.
Meaning, it is very difficult to determine where one paragraph ends and another begins. |app| will try to unwrap
paragraphs using a configurable, :guilabel:`Line Un-Wrapping Factor`. This is a scale used to determine the length at which a line should be unwrapped. Valid values are a decimal
between 0 and 1. The default is 0.5, this is the median line length. Lower this value to include more
text in the unwrapping. Increase to include less.
Also, they often have headers and footers as part of the document that will become included with the text.
Use the options to remove headers and footers to mitigate this issue. If the headers and footers are not
removed from the text it can throw off the paragraph unwrapping.
Some limitations of PDF input is complex, multi-column, and image based documents are not supported.
Extraction of vector images and tables from within the document is also not supported.