doc tweaks, delete empty paragraphs during Heuristics

This commit is contained in:
ldolse 2011-01-24 12:20:50 +08:00
parent a53f1148c2
commit 6e64f5ec4e
2 changed files with 6 additions and 4 deletions

View File

@ -367,6 +367,8 @@ class HeuristicProcessor(object):
html = re.sub(ur'\s*<o:p>\s*</o:p>', ' ', html)
# Delete microsoft 'smart' tags
html = re.sub('(?i)</?st1:\w+>', '', html)
# Delete self closing paragraph tags
html = re.sub('<p\s?/>', '', html)
# Get rid of empty span, bold, font, em, & italics tags
html = re.sub(r"\s*<span[^>]*>\s*(<span[^>]*>\s*</span>){0,2}\s*</span>\s*", " ", html)
html = re.sub(r"\s*<(font|[ibu]|em)[^>]*>\s*(<(font|[ibu]|em)[^>]*>\s*</(font|[ibu]|em)>\s*){0,2}\s*</(font|[ibu]|em)>", " ", html)

View File

@ -603,7 +603,7 @@ TXT input supports a number of options to differentiate how paragraphs are detec
formatting will be applied.
:guilabel:`Formatting Style: Heuristic`
Analyses the document for common chapter headings, scene breaks, and italicized words and applies the
Analyzes the document for common chapter headings, scene breaks, and italicized words and applies the
appropriate html markup during conversion.
:guilabel:`Formatting Style: Markdown`