Conversion pipeline: Strip out large blocks of contiguous space (more than 10000 contiguous blanks) as these slow down the conversion process and are almost always indicative of an error in the input document.

This commit is contained in:
Kovid Goyal 2011-07-18 10:53:31 -06:00
parent 08f5775f65
commit 59d9e15580

View File

@ -303,6 +303,9 @@ class CSSPreProcessor(object):
class HTMLPreProcessor(object): class HTMLPreProcessor(object):
PREPROCESS = [ PREPROCESS = [
# Remove huge block of contiguous spaces as they slow down
# the following regexes pretty badly
(re.compile(r'\s{10000,}'), lambda m: ''),
# Some idiotic HTML generators (Frontpage I'm looking at you) # Some idiotic HTML generators (Frontpage I'm looking at you)
# Put all sorts of crap into <head>. This messes up lxml # Put all sorts of crap into <head>. This messes up lxml
(re.compile(r'<head[^>]*>\n*(.*?)\n*</head>', re.IGNORECASE|re.DOTALL), (re.compile(r'<head[^>]*>\n*(.*?)\n*</head>', re.IGNORECASE|re.DOTALL),