recipes: lwn_weekly: improve table handling

Site uses table layout a lot, both for page formatting
and within article's text, yet we clean up all tags
before & after article text, and remove what's left
from tables in-between, also removing useful tables
often embedded within articles. The better way seems
to keep only parts we actually interested about:
PageHeadline (article's title) and ArticleText and
not linearize table within ArticleText tag, thus
preserving useful tables.

Signed-off-by: Sergiy Kibrik <sakib@meta.ua>
This commit is contained in:
Sergiy Kibrik 2014-11-11 10:10:24 +02:00 committed by Kovid Goyal
parent 29fd4d5b2e
commit c98eb806f5

View File

@ -30,8 +30,7 @@ class WeeklyLWN(BasicNewsRecipe):
# masthead_url = 'http://lwn.net/images/lcorner.png'
publication_type = 'magazine'
remove_tags_before = dict(attrs={'class':'PageHeadline'})
remove_tags_after = dict(attrs={'class':'ArticleText'})
keep_only_tags = [dict(attrs={'class':['PageHeadline','ArticleText']})]
remove_tags = [dict(name=['h2', 'form'])]
preprocess_regexps = [
@ -40,7 +39,6 @@ class WeeklyLWN(BasicNewsRecipe):
]
conversion_options = {
'linearize_tables' : True,
'no_inline_navbars': True,
}