recipes: lwn_weekly: improve table handling

Site uses table layout a lot, both for page formatting
and within article's text, yet we clean up all tags
before & after article text, and remove what's left
from tables in-between, also removing useful tables
often embedded within articles. The better way seems
to keep only parts we actually interested about:
PageHeadline (article's title) and ArticleText and
not linearize table within ArticleText tag, thus
preserving useful tables.

Signed-off-by: Sergiy Kibrik <sakib@meta.ua>
This commit is contained in:
Sergiy Kibrik 2014-11-11 10:10:24 +02:00 committed by Kovid Goyal
parent 29fd4d5b2e
commit c98eb806f5

View File

@ -30,8 +30,7 @@ class WeeklyLWN(BasicNewsRecipe):
# masthead_url = 'http://lwn.net/images/lcorner.png' # masthead_url = 'http://lwn.net/images/lcorner.png'
publication_type = 'magazine' publication_type = 'magazine'
remove_tags_before = dict(attrs={'class':'PageHeadline'}) keep_only_tags = [dict(attrs={'class':['PageHeadline','ArticleText']})]
remove_tags_after = dict(attrs={'class':'ArticleText'})
remove_tags = [dict(name=['h2', 'form'])] remove_tags = [dict(name=['h2', 'form'])]
preprocess_regexps = [ preprocess_regexps = [
@ -40,7 +39,6 @@ class WeeklyLWN(BasicNewsRecipe):
] ]
conversion_options = { conversion_options = {
'linearize_tables' : True,
'no_inline_navbars': True, 'no_inline_navbars': True,
} }