diff --git a/src/calibre/manual/conversion.rst b/src/calibre/manual/conversion.rst index 1f23d43419..6124d24aa2 100644 --- a/src/calibre/manual/conversion.rst +++ b/src/calibre/manual/conversion.rst @@ -301,11 +301,17 @@ Removing headers and footers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These options are useful primarily for conversion of PDF documents. Often, the conversion leaves -behing page headers and footers in the text. These options use regular expressions to try and detect +behind page headers and footers in the text. These options use regular expressions to try and detect the headers and footers and remove them. Remember that they operate on the intermediate XHTML produced by the conversion pipeline. There is also a wizard to help you customize the regular expressions for your document. +The header and footer regular expressions are used in conjunction with the remove header and footer options. +If the remove option is not enabled the regular expression will not be applied to remove the matched text. +The removal works by using a python regular expression. All matched text is simply removed from +the document. You can learn more about regular expressions and their syntax at +http://docs.python.org/library/re.html. + Miscellaneous ~~~~~~~~~~~~~~ @@ -403,7 +409,9 @@ This will result in an automatically generated two level Table of Contents that Format specific tips ---------------------- -Here you will find tips specific to the conversion of particular formats. +Here you will find tips specific to the conversion of particular formats. Options specific to particular +format, whether input or output are available in the conversion dialog under their own section, for example +`TXT Input` or `EPUB Output`. Convert Microsoft Word documents ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -419,7 +427,57 @@ generating the Table of Contents much simpler. It is called BookCreator and is a Convert TXT documents ~~~~~~~~~~~~~~~~~~~~~~ +TXT documents have no well defined way to specify formatting like bold, italics, etc, or document structure like paragraphs, headings, sections and so on. +Since TXT documents provide no way to explicitly mark parts of +the text, by default |app| only groups lines in the input document into paragraphs. The default is to assume one or +more blank lines are a paragraph boundary:: + + This is the first. + + This is the + second paragraph. + +TXT input supports a number of options to differentiate how paragraphs are detected. + + :guilabel:`Treat each line as a paragraph` + Assumes that every line is a paragraph:: + + This is the first. + This is the second. + This is the third. + + :guilabel:`Assume print formatting` + Assumes that every paragraph starts with an indent (either a tab or 2+ spaces). Paragraphs end when + the next line that starts with an indent is reached:: + + This is the + first. + This is the second. + + This is the + third. + + :guilabel:`Process using markdown` + |app| also supports running TXT input though a transformation preprocessor known as markdown. Markdown + allows for basic formatting to be added to TXT documents, such as bold, italics, section headings, tables, + loists, a Table of Contents, etc. Marking chapter headings with a leading # and setting the chapter XPath detection + expression to "//h:h1" is the easiest way to have a proper table of contents generated from a TXT document. + You can learn more about the markdown syntax at http://daringfireball.net/projects/markdown/syntax. + + Convert PDF documents -~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +PDF documents are one of the worst formats to convert from. They are a fixed page size and text placement format. +Meaning, it is very difficult to determine where one paragraph ends and another begins. |app| will try to unwrap +paragraphs using a configurable, :guilabel:`Line Un-Wrapping Factor`. This is a scale used to determine the length at which a line should be unwrapped. Valid values are a decimal +between 0 and 1. The default is 0.5, this is the median line length. Lower this value to include more +text in the unwrapping. Increase to include less. + +Also, they often have headers and footers as part of the document that will become included with the text. +Use the options to remove headers and footers to mitigate this issue. If the headers and footers are not +removed from the text it can throw off the paragraph unwrapping. + +Some limitations of PDF input is complex, multi-column, and image based documents are not supported. +Extraction of vector images and tables from within the document is also not supported.