doc tweaks, delete empty paragraphs during Heuristics

This commit is contained in:
ldolse 2011-01-24 12:20:50 +08:00
parent a53f1148c2
commit 6e64f5ec4e
2 changed files with 6 additions and 4 deletions

View File

@ -367,6 +367,8 @@ class HeuristicProcessor(object):
html = re.sub(ur'\s*<o:p>\s*</o:p>', ' ', html) html = re.sub(ur'\s*<o:p>\s*</o:p>', ' ', html)
# Delete microsoft 'smart' tags # Delete microsoft 'smart' tags
html = re.sub('(?i)</?st1:\w+>', '', html) html = re.sub('(?i)</?st1:\w+>', '', html)
# Delete self closing paragraph tags
html = re.sub('<p\s?/>', '', html)
# Get rid of empty span, bold, font, em, & italics tags # Get rid of empty span, bold, font, em, & italics tags
html = re.sub(r"\s*<span[^>]*>\s*(<span[^>]*>\s*</span>){0,2}\s*</span>\s*", " ", html) html = re.sub(r"\s*<span[^>]*>\s*(<span[^>]*>\s*</span>){0,2}\s*</span>\s*", " ", html)
html = re.sub(r"\s*<(font|[ibu]|em)[^>]*>\s*(<(font|[ibu]|em)[^>]*>\s*</(font|[ibu]|em)>\s*){0,2}\s*</(font|[ibu]|em)>", " ", html) html = re.sub(r"\s*<(font|[ibu]|em)[^>]*>\s*(<(font|[ibu]|em)[^>]*>\s*</(font|[ibu]|em)>\s*){0,2}\s*</(font|[ibu]|em)>", " ", html)

View File

@ -603,7 +603,7 @@ TXT input supports a number of options to differentiate how paragraphs are detec
formatting will be applied. formatting will be applied.
:guilabel:`Formatting Style: Heuristic` :guilabel:`Formatting Style: Heuristic`
Analyses the document for common chapter headings, scene breaks, and italicized words and applies the Analyzes the document for common chapter headings, scene breaks, and italicized words and applies the
appropriate html markup during conversion. appropriate html markup during conversion.
:guilabel:`Formatting Style: Markdown` :guilabel:`Formatting Style: Markdown`