as ADE cannot handle them."
+ tickets: [794427]
+
+ improved recipes:
+ - Le Temps
+ - Perfil
+ - Financial Times UK
+
+ new recipes:
+ - title: "Daytona Beach Journal"
+ author: BRGriff
+
+ - title: "El club del ebook and Frontline"
+ author: Darko Miletic
+
+
+- version: 0.8.6
+ date: 2011-06-17
+
+ new features:
+ - title: "Builtin support for downloading and installing/updating calibre plugins. Go to Preferences->Plugins and click 'Get new plugins'"
+ description: "When updates for installed plugins are available, calibre will automatically (unobtrusively) notify you"
+ type: major
+
+ - title: "Metadata download configuration: Allow defining a set of 'default' fields for metadata download and quichly switching to/from them"
+
+ - title: "Allow clicking on the news category in the Tag Browser to display all downloaded periodicals"
+
+ - title: "Driver for the Libre Air"
+
+ - title: "Email sending: Allow user to stop email jobs (note that stopping may not actually prevent the email from being sent, depending on when the stop happens). Also automatically abort email sending if it takes longer than 15mins."
+ tickets: [795960]
+
+ bug fixes:
+ - title: "MOBI Output: Allow setting of background color on tables also set the border attribute on the table if the table has any border related css defined."
+ tickets: [797580]
+
+ - title: "Nook TSR: Put news sent to the device in My Files/Newspapers instaed of My Files/Books."
+ tickets: [796674]
+
+ - title: "MOBI Output: Fix a bug where linking to the very first element in an HTML file could sometimes result in the link pointing to the last element in the previous file."
+ tickets: [797214]
+
+ - title: "CSV catalog: Convert HTML comments to plain text"
+
+ - title: "HTML Input: Ignore links to text files."
+ tickets: [791568]
+
+ - title: "EPUB Output: Change orphaned tags to as they cause ADE to crash."
+
+ - title: "Fix 'Stop selected jobs' button trying to stop the same job multiple times"
+
+ - title: "Database: Explicitly test for case sensitivity on OS X instead of assuming a case insensitive filesystem."
+ tickets: [796258]
+
+ - title: "Get Books: More fixes to the Amazon store plugin"
+
+ - title: "FB2 Input: Do not specify font families/background colors"
+
+
+ improved recipes:
+ - Philadelphia Inquirer
+ - Macleans Magazone
+ - Metro UK
+
+ new recipes:
+ - title: "Christian Post, Down To Earth and Words Without Borders"
+ author: sexymax15
+
+ - title: "Noticias R7"
+ author: Diniz Bortolotto
+
+ - title: "UK Daily Mirror"
+ author: Dave Asbury
+
+ - title: "New Musical Express Magazine"
+ author: scissors
+
+
+- version: 0.8.5
+ date: 2011-06-10
+
+ new features:
+ - title: "A new 'portable' calibre build, useful if you like to carry around calibre and its library on a USB key"
+ type: major
+ description: "For details, see: http://calibre-ebook.com/download_portable"
+
+ - title: "E-book viewer: Remember the last used font size multiplier."
+ tickets: [774343]
+
+ - title: "Preliminary support for the Kobo Touch. Drivers for the ZTE v9 tablet, Samsung S2, Notion Ink Adam and PocketBook 360+"
+
+ - title: "When downloading metadata merge rather than replace tags"
+
+ - title: "Edit metadata dialog: When pasting in an ISBN, if not valid ISBN if present on the clipboard popup a box for the user to enter the ISBN"
+
+ - title: "Windows build: Add code to load .pyd python extensions from a zip file. This allows many more files in the calibre installation to be zipped up, speeding up the installer."
+ - title: "Add an action to remove all formats from the selected books to the remove books button"
+
+
+ bug fixes:
+ - title: "Various minor bug fixes to the column coloring code"
+
+ - title: "Fix the not() template function"
+
+ - title: "Nook Color/TSR: When sending books to the storage card place them in the My Files/Books subdirectory. Also do not upload cover thumbnails as users report that the NC/TSR don't use them."
+ tickets: [792842]
+
+ - title: "Get Books: Update plugins for Amazon and B&N stores to handle website changes. Enable some stores by default on first run. Add Zixo store"
+ tickets: [792762]
+
+ - title: "Comic Input: Replace the # character in filenames as it can cause problem with conversion/vieweing."
+ tickets: [792723]
+
+ - title: "When writing files to zipfile, reset timestamp if it doesn't fit in 1980's vintage storage structures"
+
+ - title: "Amazon metadata plugin: Fix parsing of published date from amazon.de when it has februar in it"
+
+ improved recipes:
+ - Ambito
+ - GoComics
+ - Le Monde Diplomatique
+ - Max Planck
+ - express.de
+
+ new recipes:
+ - title: Ambito Financiero
+ author: Darko Miletic
+
+ - title: Stiin Tas Technica
+ author: Silviu Cotoara
+
+ - title: "Metro News NL"
+ author: DrMerry
+
+ - title: "Brigitte.de, Polizeipresse DE and Heise Online"
+ author: schuster
+
+
+
- version: 0.8.4
date: 2011-06-03
diff --git a/recipes/ambito.recipe b/recipes/ambito.recipe
index dd92ee19b3..55a532bb9e 100644
--- a/recipes/ambito.recipe
+++ b/recipes/ambito.recipe
@@ -1,7 +1,5 @@
-#!/usr/bin/env python
-
__license__ = 'GPL v3'
-__copyright__ = '2008-2009, Darko Miletic '
+__copyright__ = '2008-2011, Darko Miletic '
'''
ambito.com
'''
@@ -11,51 +9,56 @@ from calibre.web.feeds.news import BasicNewsRecipe
class Ambito(BasicNewsRecipe):
title = 'Ambito.com'
__author__ = 'Darko Miletic'
- description = 'Informacion Libre las 24 horas'
- publisher = 'Ambito.com'
- category = 'news, politics, Argentina'
+ description = 'Ambito.com con noticias del Diario Ambito Financiero de Buenos Aires'
+ publisher = 'Editorial Nefir S.A.'
+ category = 'news, politics, economy, finances, Argentina'
oldest_article = 2
- max_articles_per_feed = 100
no_stylesheets = True
- encoding = 'iso-8859-1'
- cover_url = 'http://www.ambito.com/img/logo_.jpg'
- remove_javascript = True
+ encoding = 'cp1252'
+ masthead_url = 'http://www.ambito.com/img/logo_.jpg'
use_embedded_content = False
+ language = 'es_AR'
+ publication_type = 'newsportal'
+ extra_css = """
+ body{font-family: "Trebuchet MS",Verdana,sans-serif}
+ .volanta{font-size: small}
+ .t2_portada{font-size: xx-large; font-family: Georgia,serif; color: #026698}
+ """
- html2lrf_options = [
- '--comment', description
- , '--category', category
- , '--publisher', publisher
- ]
- html2epub_options = 'publisher="' + publisher + '"\ncomments="' + description + '"\ntags="' + category + '"'
+ conversion_options = {
+ 'comment' : description
+ , 'tags' : category
+ , 'publisher' : publisher
+ , 'language' : language
+ }
keep_only_tags = [dict(name='div', attrs={'align':'justify'})]
-
- remove_tags = [dict(name=['object','link'])]
+ remove_tags = [dict(name=['object','link','embed','iframe','meta','link','table','img'])]
+ remove_attributes = ['align']
feeds = [
(u'Principales Noticias', u'http://www.ambito.com/rss/noticiasp.asp' )
,(u'Economia' , u'http://www.ambito.com/rss/noticias.asp?S=Econom%EDa' )
,(u'Politica' , u'http://www.ambito.com/rss/noticias.asp?S=Pol%EDtica' )
,(u'Informacion General' , u'http://www.ambito.com/rss/noticias.asp?S=Informaci%F3n%20General')
- ,(u'Agro' , u'http://www.ambito.com/rss/noticias.asp?S=Agro' )
+ ,(u'Campo' , u'http://www.ambito.com/rss/noticias.asp?S=Agro' )
,(u'Internacionales' , u'http://www.ambito.com/rss/noticias.asp?S=Internacionales' )
,(u'Deportes' , u'http://www.ambito.com/rss/noticias.asp?S=Deportes' )
,(u'Espectaculos' , u'http://www.ambito.com/rss/noticias.asp?S=Espect%E1culos' )
- ,(u'Tecnologia' , u'http://www.ambito.com/rss/noticias.asp?S=Tecnologia' )
- ,(u'Salud' , u'http://www.ambito.com/rss/noticias.asp?S=Salud' )
+ ,(u'Tecnologia' , u'http://www.ambito.com/rss/noticias.asp?S=Tecnolog%EDa' )
,(u'Ambito Nacional' , u'http://www.ambito.com/rss/noticias.asp?S=Ambito%20Nacional' )
]
def print_version(self, url):
- return url.replace('http://www.ambito.com/noticia.asp?','http://www.ambito.com/noticias/imprimir.asp?')
+ return url.replace('/noticia.asp?','/noticias/imprimir.asp?')
def preprocess_html(self, soup):
- mtag = ''
- soup.head.insert(0,mtag)
for item in soup.findAll(style=True):
del item['style']
+ for item in soup.findAll('a'):
+ str = item.string
+ if str is None:
+ str = self.tag_to_string(item)
+ item.replaceWith(str)
return soup
-
- language = 'es_AR'
diff --git a/recipes/ambito_financiero.recipe b/recipes/ambito_financiero.recipe
new file mode 100644
index 0000000000..08c056e8ee
--- /dev/null
+++ b/recipes/ambito_financiero.recipe
@@ -0,0 +1,87 @@
+__license__ = 'GPL v3'
+__copyright__ = '2011, Darko Miletic '
+'''
+ambito.com/diario
+'''
+
+import time
+from calibre import strftime
+from calibre.web.feeds.news import BasicNewsRecipe
+
+class Ambito_Financiero(BasicNewsRecipe):
+ title = 'Ambito Financiero'
+ __author__ = 'Darko Miletic'
+ description = 'Informacion Libre las 24 horas'
+ publisher = 'Editorial Nefir S.A.'
+ category = 'news, politics, economy, Argentina'
+ no_stylesheets = True
+ encoding = 'cp1252'
+ masthead_url = 'http://www.ambito.com/diario/img/logo_af.gif'
+ publication_type = 'newspaper'
+ needs_subscription = 'optional'
+ use_embedded_content = False
+ language = 'es_AR'
+ PREFIX = 'http://www.ambito.com'
+ INDEX = PREFIX + '/diario/index.asp'
+ LOGIN = PREFIX + '/diario/login/entrada.asp'
+ extra_css = """
+ body{font-family: "Trebuchet MS",Verdana,sans-serif}
+ .volanta{font-size: small}
+ .t2_portada{font-size: xx-large; font-family: Georgia,serif; color: #026698}
+ """
+
+ conversion_options = {
+ 'comment' : description
+ , 'tags' : category
+ , 'publisher' : publisher
+ , 'language' : language
+ }
+
+ keep_only_tags = [dict(name='div', attrs={'align':'justify'})]
+ remove_tags = [dict(name=['object','link','embed','iframe','meta','link','table','img'])]
+ remove_attributes = ['align']
+
+ def get_browser(self):
+ br = BasicNewsRecipe.get_browser()
+ br.open(self.INDEX)
+ if self.username is not None and self.password is not None:
+ br.open(self.LOGIN)
+ br.select_form(name='frmlogin')
+ br['USER_NAME'] = self.username
+ br['USER_PASS'] = self.password
+ br.submit()
+ return br
+
+ def print_version(self, url):
+ return url.replace('/diario/noticia.asp?','/noticias/imprimir.asp?')
+
+ def preprocess_html(self, soup):
+ for item in soup.findAll(style=True):
+ del item['style']
+ for item in soup.findAll('a'):
+ str = item.string
+ if str is None:
+ str = self.tag_to_string(item)
+ item.replaceWith(str)
+ return soup
+
+ def parse_index(self):
+ soup = self.index_to_soup(self.INDEX)
+ cover_item = soup.find('img',attrs={'class':'fotodespliegue'})
+ if cover_item:
+ self.cover_url = self.PREFIX + cover_item['src']
+ articles = []
+ checker = []
+ for feed_link in soup.findAll('a', attrs={'class':['t0_portada','t2_portada','bajada']}):
+ url = self.PREFIX + feed_link['href']
+ title = self.tag_to_string(feed_link)
+ date = strftime("%a, %d %b %Y %H:%M:%S +0000",time.gmtime())
+ if url not in checker:
+ checker.append(url)
+ articles.append({
+ 'title' :title
+ ,'date' :date
+ ,'url' :url
+ ,'description':u''
+ })
+ return [(self.title, articles)]
diff --git a/recipes/arizona_republic.recipe b/recipes/arizona_republic.recipe
new file mode 100644
index 0000000000..5bc2140946
--- /dev/null
+++ b/recipes/arizona_republic.recipe
@@ -0,0 +1,68 @@
+__license__ = 'GPL v3'
+__copyright__ = '2010, jolo'
+'''
+azrepublic.com
+'''
+from calibre.web.feeds.recipes import BasicNewsRecipe
+
+class AdvancedUserRecipe1307301031(BasicNewsRecipe):
+ title = u'AZRepublic'
+ __author__ = 'Jim Olo'
+ language = 'en'
+ description = "The Arizona Republic is Arizona's leading provider of news and information, and has published a daily newspaper in Phoenix for more than 110 years"
+ publisher = 'AZRepublic/AZCentral'
+ masthead_url = 'http://freedom2t.com/wp-content/uploads/press_az_republic_v2.gif'
+ cover_url = 'http://www.valleyleadership.org/Common/Img/2line4c_AZRepublic%20with%20azcentral%20logo.jpg'
+ category = 'news, politics, USA, AZ, Arizona'
+
+ oldest_article = 7
+ max_articles_per_feed = 100
+ remove_empty_feeds = True
+ no_stylesheets = True
+ remove_javascript = True
+# extra_css = '.headline {font-size: medium;} \n .fact { padding-top: 10pt }'
+ extra_css = ' body{ font-family: Verdana,Helvetica,Arial,sans-serif } .headline {font-size: medium} .introduction{font-weight: bold} .story-feature{display: block; padding: 0; border: 1px solid; width: 40%; font-size: small} .story-feature h2{text-align: center; text-transform: uppercase} '
+
+ remove_attributes = ['width','height','h2','subHeadline','style']
+ remove_tags = [
+ dict(name='div', attrs={'id':['slidingBillboard', 'top728x90', 'subindex-header', 'topSearch']}),
+ dict(name='div', attrs={'id':['simplesearch', 'azcLoginBox', 'azcLoginBoxInner', 'topNav']}),
+ dict(name='div', attrs={'id':['carsDrop', 'homesDrop', 'rentalsDrop', 'classifiedDrop']}),
+ dict(name='div', attrs={'id':['nav', 'mp', 'subnav', 'jobsDrop']}),
+ dict(name='h6', attrs={'class':['section-header']}),
+ dict(name='a', attrs={'href':['#comments']}),
+ dict(name='div', attrs={'class':['articletools clearfix', 'floatRight']}),
+ dict(name='div', attrs={'id':['fbFrame', 'ob', 'storyComments', 'storyGoogleAdBox']}),
+ dict(name='div', attrs={'id':['storyTopHomes', 'openRight', 'footerwrap', 'copyright']}),
+ dict(name='div', attrs={'id':['blogsHed', 'blog_comments', 'blogByline','blogTopics']}),
+ dict(name='div', attrs={'id':['membersRightMain', 'dealsfooter', 'azrTopHed', 'azrRightCol']}),
+ dict(name='div', attrs={'id':['ttdHeader', 'ttdTimeWeather']}),
+ dict(name='div', attrs={'id':['membersRightMain', 'deals-header-wrap']}),
+ dict(name='div', attrs={'id':['todoTopSearchBar', 'byline clearfix', 'subdex-topnav']}),
+ dict(name='h1', attrs={'id':['SEOtext']}),
+ dict(name='table', attrs={'class':['ap-mediabox-table']}),
+ dict(name='p', attrs={'class':['ap_para']}),
+ dict(name='span', attrs={'class':['source-org vcard', 'org fn']}),
+ dict(name='a', attrs={'href':['http://hosted2.ap.org/APDEFAULT/privacy']}),
+ dict(name='a', attrs={'href':['http://hosted2.ap.org/APDEFAULT/terms']}),
+ dict(name='div', attrs={'id':['onespot_nextclick']}),
+ ]
+
+ feeds = [
+ (u'FrontPage', u'http://www.azcentral.com/rss/feeds/republicfront.xml'),
+ (u'TopUS-News', u'http://hosted.ap.org/lineups/USHEADS.rss?SITE=AZPHG&SECTION=HOME'),
+ (u'WorldNews', u'http://hosted.ap.org/lineups/WORLDHEADS.rss?SITE=AZPHG&SECTION=HOME'),
+ (u'TopBusiness', u'http://hosted.ap.org/lineups/BUSINESSHEADS.rss?SITE=AZPHG&SECTION=HOME'),
+ (u'Entertainment', u'http://hosted.ap.org/lineups/ENTERTAINMENT.rss?SITE=AZPHG&SECTION=HOME'),
+ (u'ArizonaNews', u'http://www.azcentral.com/rss/feeds/news.xml'),
+ (u'Gilbert', u'http://www.azcentral.com/rss/feeds/gilbert.xml'),
+ (u'Chandler', u'http://www.azcentral.com/rss/feeds/chandler.xml'),
+ (u'DiningReviews', u'http://www.azcentral.com/rss/feeds/diningreviews.xml'),
+ (u'AZBusiness', u'http://www.azcentral.com/rss/feeds/business.xml'),
+ (u'ArizonaDeals', u'http://www.azcentral.com/members/Blog%7E/RealDealsblog'),
+ (u'GroceryDeals', u'http://www.azcentral.com/members/Blog%7E/RealDealsblog/tag/2646')
+ ]
+
+
+
+
diff --git a/recipes/athens_news.recipe b/recipes/athens_news.recipe
new file mode 100644
index 0000000000..6667faaf0c
--- /dev/null
+++ b/recipes/athens_news.recipe
@@ -0,0 +1,70 @@
+__license__ = 'GPL v3'
+__copyright__ = '2011, Darko Miletic '
+'''
+www.athensnews.gr
+'''
+
+from calibre.web.feeds.news import BasicNewsRecipe
+
+class AthensNews(BasicNewsRecipe):
+ title = 'Athens News'
+ __author__ = 'Darko Miletic'
+ description = 'Greece in English since 1952'
+ publisher = 'NEP Publishing Company SA'
+ category = 'news, politics, Greece, Athens'
+ oldest_article = 1
+ max_articles_per_feed = 200
+ no_stylesheets = True
+ encoding = 'utf8'
+ use_embedded_content = False
+ language = 'en_GR'
+ remove_empty_feeds = True
+ publication_type = 'newspaper'
+ masthead_url = 'http://www.athensnews.gr/sites/athensnews/themes/athensnewsv3/images/logo.jpg'
+ extra_css = """
+ body{font-family: Arial,Helvetica,sans-serif }
+ img{margin-bottom: 0.4em; display:block}
+ .big{font-size: xx-large; font-family: Georgia,serif}
+ .articlepubdate{font-size: small; color: gray; font-family: Georgia,serif}
+ .lezanta{font-size: x-small; font-weight: bold; text-align: left; margin-bottom: 1em; display: block}
+ """
+
+ conversion_options = {
+ 'comment' : description
+ , 'tags' : category
+ , 'publisher' : publisher
+ , 'language' : language
+ , 'linearize_tables' : True
+ }
+
+ remove_tags = [
+ dict(name=['meta','link'])
+ ]
+ keep_only_tags=[
+ dict(name='span',attrs={'class':'big'})
+ ,dict(name='td', attrs={'class':['articlepubdate','text']})
+ ]
+ remove_attributes=['lang']
+
+
+ feeds = [
+ (u'News' , u'http://www.athensnews.gr/category/1/feed' )
+ ,(u'Politics' , u'http://www.athensnews.gr/category/8/feed' )
+ ,(u'Business' , u'http://www.athensnews.gr/category/2/feed' )
+ ,(u'Economy' , u'http://www.athensnews.gr/category/11/feed')
+ ,(u'Community' , u'http://www.athensnews.gr/category/5/feed' )
+ ,(u'Arts' , u'http://www.athensnews.gr/category/3/feed' )
+ ,(u'Living in Athens', u'http://www.athensnews.gr/category/7/feed' )
+ ,(u'Sports' , u'http://www.athensnews.gr/category/4/feed' )
+ ,(u'Travel' , u'http://www.athensnews.gr/category/6/feed' )
+ ,(u'Letters' , u'http://www.athensnews.gr/category/44/feed')
+ ,(u'Media' , u'http://www.athensnews.gr/multimedia/feed' )
+ ]
+
+ def print_version(self, url):
+ return url + '?action=print'
+
+ def preprocess_html(self, soup):
+ for item in soup.findAll(style=True):
+ del item['style']
+ return soup
diff --git a/recipes/brigitte_de.recipe b/recipes/brigitte_de.recipe
new file mode 100644
index 0000000000..860d5176ac
--- /dev/null
+++ b/recipes/brigitte_de.recipe
@@ -0,0 +1,36 @@
+from calibre.web.feeds.news import BasicNewsRecipe
+
+class AdvancedUserRecipe(BasicNewsRecipe):
+
+ title = u'Brigitte.de'
+ __author__ = 'schuster'
+ oldest_article = 14
+ max_articles_per_feed = 100
+ no_stylesheets = True
+ use_embedded_content = False
+ language = 'de'
+ remove_javascript = True
+ remove_empty_feeds = True
+ timeout = 10
+ cover_url = 'http://www.medienmilch.de/typo3temp/pics/Brigitte-Logo_d5feb4a6e4.jpg'
+ masthead_url = 'http://www.medienmilch.de/typo3temp/pics/Brigitte-Logo_d5feb4a6e4.jpg'
+
+
+ remove_tags = [dict(attrs={'class':['linklist', 'head', 'indent right relatedContent', 'artikel-meta segment', 'segment', 'comment commentFormWrapper segment borderBG', 'segment borderBG comments', 'segment borderBG box', 'center', 'segment nextPageLink', 'inCar']}),
+ dict(id=['header', 'artTools', 'context', 'interact', 'footer-navigation', 'bwNet', 'copy', 'keyboardNavigationHint']),
+ dict(name=['hjtrs', 'kud'])]
+
+ feeds = [(u'Mode', u'http://www.brigitte.de/mode/feed.rss'),
+ (u'Beauty', u'http://www.brigitte.de/beauty/feed.rss'),
+ (u'Luxus', u'http://www.brigitte.de/luxus/feed.rss'),
+ (u'Figur', u'http://www.brigitte.de/figur/feed.rss'),
+ (u'Gesundheit', u'http://www.brigitte.de/gesundheit/feed.rss'),
+ (u'Liebe&Sex', u'http://www.brigitte.de/liebe-sex/feed.rss'),
+ (u'Gesellschaft', u'http://www.brigitte.de/gesellschaft/feed.rss'),
+ (u'Kultur', u'http://www.brigitte.de/kultur/feed.rss'),
+ (u'Reise', u'http://www.brigitte.de/reise/feed.rss'),
+ (u'Kochen', u'http://www.brigitte.de/kochen/feed.rss'),
+ (u'Wohnen', u'http://www.brigitte.de/wohnen/feed.rss'),
+ (u'Job', u'http://www.brigitte.de/job/feed.rss'),
+ (u'Erfahrungen', u'http://www.brigitte.de/erfahrungen/feed.rss'),
+]
diff --git a/recipes/buenosaireseconomico.recipe b/recipes/buenosaireseconomico.recipe
index 782358e6d3..ccfdd5aca0 100644
--- a/recipes/buenosaireseconomico.recipe
+++ b/recipes/buenosaireseconomico.recipe
@@ -1,72 +1,59 @@
-#!/usr/bin/env python
-
__license__ = 'GPL v3'
-__copyright__ = '2009, Darko Miletic '
+__copyright__ = '2009-2011, Darko Miletic '
'''
-elargentino.com
+www.diariobae.com
'''
-
+from calibre import strftime
from calibre.web.feeds.news import BasicNewsRecipe
-from calibre.ebooks.BeautifulSoup import Tag
class BsAsEconomico(BasicNewsRecipe):
title = 'Buenos Aires Economico'
__author__ = 'Darko Miletic'
- description = 'Revista Argentina'
- publisher = 'ElArgentino.com'
+ description = 'Diario BAE es el diario economico-politico con mas influencia en la Argentina. Fuente de empresarios y politicos del pais y el exterior. El pozo estaria aportando en periodos breves un volumen equivalente a 800m3 diarios. Pero todavia deben efectuarse otras perforaciones adicionales.'
+ publisher = 'Diario BAE'
category = 'news, politics, economy, Argentina'
oldest_article = 2
max_articles_per_feed = 100
no_stylesheets = True
use_embedded_content = False
encoding = 'utf-8'
- language = 'es_AR'
+ language = 'es_AR'
+ cover_url = strftime('http://www.diariobae.com/imgs_portadas/%Y%m%d_portadasBAE.jpg')
+ masthead_url = 'http://www.diariobae.com/img/logo_bae.png'
+ remove_empty_feeds = True
+ publication_type = 'newspaper'
+ extra_css = """
+ body{font-family: Georgia,"Times New Roman",Times,serif}
+ #titulo{font-size: x-large}
+ #epi{font-size: small; font-style: italic; font-weight: bold}
+ img{display: block; margin-top: 1em}
+ """
+ conversion_options = {
+ 'comment' : description
+ , 'tags' : category
+ , 'publisher' : publisher
+ , 'language' : language
+ }
- lang = 'es-AR'
- direction = 'ltr'
- INDEX = 'http://www.elargentino.com/medios/121/Buenos-Aires-Economico.html'
- extra_css = ' .titulo{font-size: x-large; font-weight: bold} .volantaImp{font-size: small; font-weight: bold} '
-
- html2lrf_options = [
- '--comment' , description
- , '--category' , category
- , '--publisher', publisher
+ remove_tags_before= dict(attrs={'id':'titulo'})
+ remove_tags_after = dict(attrs={'id':'autor' })
+ remove_tags = [
+ dict(name=['meta','base','iframe','link','lang'])
+ ,dict(attrs={'id':'barra_tw'})
]
+ remove_attributes = ['data-count','data-via']
- html2epub_options = 'publisher="' + publisher + '"\ncomments="' + description + '"\ntags="' + category + '"\noverride_css=" p {text-indent: 0cm; margin-top: 0em; margin-bottom: 0.5em} "'
-
- keep_only_tags = [dict(name='div', attrs={'class':'ContainerPop'})]
-
- remove_tags = [dict(name='link')]
-
- feeds = [(u'Articulos', u'http://www.elargentino.com/Highlights.aspx?ParentType=Section&ParentId=121&Content-Type=text/xml&ChannelDesc=Buenos%20Aires%20Econ%C3%B3mico')]
-
- def print_version(self, url):
- main, sep, article_part = url.partition('/nota-')
- article_id, rsep, rrest = article_part.partition('-')
- return u'http://www.elargentino.com/Impresion.aspx?Id=' + article_id
+ feeds = [
+ (u'Argentina' , u'http://www.diariobae.com/rss/argentina.xml' )
+ ,(u'Valores' , u'http://www.diariobae.com/rss/valores.xml' )
+ ,(u'Finanzas' , u'http://www.diariobae.com/rss/finanzas.xml' )
+ ,(u'Negocios' , u'http://www.diariobae.com/rss/negocios.xml' )
+ ,(u'Mundo' , u'http://www.diariobae.com/rss/mundo.xml' )
+ ,(u'5 dias' , u'http://www.diariobae.com/rss/5dias.xml' )
+ ,(u'Espectaculos', u'http://www.diariobae.com/rss/espectaculos.xml')
+ ]
def preprocess_html(self, soup):
for item in soup.findAll(style=True):
del item['style']
- soup.html['lang'] = self.lang
- soup.html['dir' ] = self.direction
- mlang = Tag(soup,'meta',[("http-equiv","Content-Language"),("content",self.lang)])
- mcharset = Tag(soup,'meta',[("http-equiv","Content-Type"),("content","text/html; charset=utf-8")])
- soup.head.insert(0,mlang)
- soup.head.insert(1,mcharset)
return soup
-
- def get_cover_url(self):
- cover_url = None
- soup = self.index_to_soup(self.INDEX)
- cover_item = soup.find('div',attrs={'class':'colder'})
- if cover_item:
- clean_url = self.image_url_processor(None,cover_item.div.img['src'])
- cover_url = 'http://www.elargentino.com' + clean_url + '&height=600'
- return cover_url
-
- def image_url_processor(self, baseurl, url):
- base, sep, rest = url.rpartition('?Id=')
- img, sep2, rrest = rest.partition('&')
- return base + sep + img
diff --git a/recipes/catholic_news_agency.recipe b/recipes/catholic_news_agency.recipe
new file mode 100644
index 0000000000..43b7755f07
--- /dev/null
+++ b/recipes/catholic_news_agency.recipe
@@ -0,0 +1,13 @@
+from calibre.web.feeds.news import BasicNewsRecipe
+
+class AdvancedUserRecipe1301972345(BasicNewsRecipe):
+ title = u'Catholic News Agency'
+ language = 'en'
+ __author__ = 'Jetkey'
+ oldest_article = 5
+ max_articles_per_feed = 20
+
+ feeds = [(u'U.S. News', u'http://feeds.feedburner.com/catholicnewsagency/dailynews-us'),
+ (u'Vatican', u'http://feeds.feedburner.com/catholicnewsagency/dailynews-vatican'),
+ (u'Bishops Corner', u'http://feeds.feedburner.com/catholicnewsagency/columns/bishopscorner'),
+ (u'Saint of the Day', u'http://feeds.feedburner.com/catholicnewsagency/saintoftheday')]
diff --git a/recipes/christian_post.recipe b/recipes/christian_post.recipe
new file mode 100644
index 0000000000..7bb08e88dc
--- /dev/null
+++ b/recipes/christian_post.recipe
@@ -0,0 +1,37 @@
+#created by sexymax15 ....sexymax15@gmail.com
+#christian post recipe
+from calibre.web.feeds.news import BasicNewsRecipe
+
+class ChristianPost(BasicNewsRecipe):
+
+ title = 'The Christian Post'
+ __author__ = 'sexymax15'
+ description = 'Homepage'
+ language = 'en'
+ no_stylesheets = True
+ use_embedded_content = False
+ oldest_article = 30
+ max_articles_per_feed = 15
+
+ remove_empty_feeds = True
+ no_stylesheets = True
+ remove_javascript = True
+
+ extra_css = '''
+ h1 {color:#008852;font-family:Arial,Helvetica,sans-serif; font-size:20px; font-size-adjust:none; font-stretch:normal; font-style:normal; font-variant:normal; font-weight:bold; line-height:18px;}
+ h2 {color:#4D4D4D;font-family:Arial,Helvetica,sans-serif; font-size:16px; font-size-adjust:none; font-stretch:normal; font-style:normal; font-variant:normal; font-weight:bold; line-height:16px; } '''
+
+
+ feeds = [
+ ('Homepage', 'http://www.christianpost.com/services/rss/feed/'),
+ ('Most Popular', 'http://www.christianpost.com/services/rss/feed/most-popular'),
+ ('Entertainment', 'http://www.christianpost.com/services/rss/feed/entertainment/'),
+ ('Politics', 'http://www.christianpost.com/services/rss/feed/politics/'),
+ ('Living', 'http://www.christianpost.com/services/rss/feed/living/'),
+ ('Business', 'http://www.christianpost.com/services/rss/feed/business/'),
+ ('Opinion', 'http://www.christianpost.com/services/rss/feed/opinion/')
+ ]
+
+ def print_version(self, url):
+ return url +'print.html'
+
diff --git a/recipes/criticadigital.recipe b/recipes/criticadigital.recipe
deleted file mode 100644
index 3cb72e6be4..0000000000
--- a/recipes/criticadigital.recipe
+++ /dev/null
@@ -1,69 +0,0 @@
-#!/usr/bin/env python
-
-__license__ = 'GPL v3'
-__copyright__ = '2008, Darko Miletic '
-'''
-criticadigital.com
-'''
-
-from calibre.web.feeds.news import BasicNewsRecipe
-
-class CriticaDigital(BasicNewsRecipe):
- title = 'Critica de la Argentina'
- __author__ = 'Darko Miletic and Sujata Raman'
- description = 'Noticias de Argentina'
- oldest_article = 2
- max_articles_per_feed = 100
- language = 'es_AR'
-
- no_stylesheets = True
- use_embedded_content = False
- encoding = 'cp1252'
-
- extra_css = '''
- h1{font-family:"Trebuchet MS";}
- h3{color:#9A0000; font-family:Tahoma; font-size:x-small;}
- h2{color:#504E53; font-family:Arial,Helvetica,sans-serif ;font-size:small;}
- #epigrafe{font-family:Arial,Helvetica,sans-serif ;color:#666666 ; font-size:x-small;}
- p {font-family:Arial,Helvetica,sans-serif;}
- #fecha{color:#858585; font-family:Tahoma; font-size:x-small;}
- #autor{color:#858585; font-family:Tahoma; font-size:x-small;}
- #hora{color:#F00000;font-family:Tahoma; font-size:x-small;}
- '''
- keep_only_tags = [
- dict(name='div', attrs={'class':['bloqueTitulosNoticia','cfotonota']})
- ,dict(name='div', attrs={'id':'boxautor'})
- ,dict(name='p', attrs={'id':'textoNota'})
- ]
-
- remove_tags = [
- dict(name='div', attrs={'class':'box300' })
- ,dict(name='div', style=True )
- ,dict(name='div', attrs={'class':'titcomentario'})
- ,dict(name='div', attrs={'class':'comentario' })
- ,dict(name='div', attrs={'class':'paginador' })
- ]
-
- feeds = [
- (u'Politica', u'http://www.criticadigital.com/herramientas/rss.php?ch=politica' )
- ,(u'Economia', u'http://www.criticadigital.com/herramientas/rss.php?ch=economia' )
- ,(u'Deportes', u'http://www.criticadigital.com/herramientas/rss.php?ch=deportes' )
- ,(u'Espectaculos', u'http://www.criticadigital.com/herramientas/rss.php?ch=espectaculos')
- ,(u'Mundo', u'http://www.criticadigital.com/herramientas/rss.php?ch=mundo' )
- ,(u'Policiales', u'http://www.criticadigital.com/herramientas/rss.php?ch=policiales' )
- ,(u'Sociedad', u'http://www.criticadigital.com/herramientas/rss.php?ch=sociedad' )
- ,(u'Salud', u'http://www.criticadigital.com/herramientas/rss.php?ch=salud' )
- ,(u'Tecnologia', u'http://www.criticadigital.com/herramientas/rss.php?ch=tecnologia' )
- ,(u'Santa Fe', u'http://www.criticadigital.com/herramientas/rss.php?ch=santa_fe' )
- ]
-
- def get_cover_url(self):
- cover_url = None
- index = 'http://www.criticadigital.com/impresa/'
- soup = self.index_to_soup(index)
- link_item = soup.find('div',attrs={'class':'tapa'})
- if link_item:
- cover_url = index + link_item.img['src']
- return cover_url
-
-
diff --git a/recipes/daily_mirror.recipe b/recipes/daily_mirror.recipe
new file mode 100644
index 0000000000..5d4dbe3f4b
--- /dev/null
+++ b/recipes/daily_mirror.recipe
@@ -0,0 +1,52 @@
+from calibre.web.feeds.news import BasicNewsRecipe
+
+class AdvancedUserRecipe1306061239(BasicNewsRecipe):
+ title = u'The Daily Mirror'
+ description = 'News as provide by The Daily Mirror -UK'
+
+ __author__ = 'Dave Asbury'
+ language = 'en_GB'
+
+ cover_url = 'http://yookeo.com/screens/m/i/mirror.co.uk.jpg'
+
+ masthead_url = 'http://www.nmauk.co.uk/nma/images/daily_mirror.gif'
+
+
+ oldest_article = 1
+ max_articles_per_feed = 100
+ remove_empty_feeds = True
+ remove_javascript = True
+ no_stylesheets = True
+
+ keep_only_tags = [
+ dict(name='h1'),
+ dict(attrs={'class':['article-attr']}),
+ dict(name='div', attrs={'class' : [ 'article-body', 'crosshead']})
+
+
+ ]
+
+ remove_tags = [
+ dict(name='div', attrs={'class' : ['caption', 'article-resize']}),
+ dict( attrs={'class':'append-html'})
+ ]
+
+
+
+
+ feeds = [
+
+ (u'News', u'http://www.mirror.co.uk/news/rss.xml')
+ ,(u'Tech News', u'http://www.mirror.co.uk/news/technology/rss.xml')
+ ,(u'Weird World','http://www.mirror.co.uk/news/weird-world/rss.xml')
+ ,(u'Film Gossip','http://www.mirror.co.uk/celebs/film/rss.xml')
+ ,(u'Music News','http://www.mirror.co.uk/celebs/music/rss.xml')
+ ,(u'Celebs and Tv Gossip','http://www.mirror.co.uk/celebs/tv/rss.xml')
+ ,(u'Sport','http://www.mirror.co.uk/sport/rss.xml')
+ ,(u'Life Style','http://www.mirror.co.uk/life-style/rss.xml')
+ ,(u'Advice','http://www.mirror.co.uk/advice/rss.xml')
+ ,(u'Travel','http://www.mirror.co.uk/advice/travel/rss.xml')
+
+ # example of commented out feed not needed ,(u'Travel','http://www.mirror.co.uk/advice/travel/rss.xml')
+ ]
+
diff --git a/recipes/daytona_beach.recipe b/recipes/daytona_beach.recipe
new file mode 100644
index 0000000000..1230c1d8ed
--- /dev/null
+++ b/recipes/daytona_beach.recipe
@@ -0,0 +1,78 @@
+from calibre.web.feeds.news import BasicNewsRecipe
+
+class DaytonBeachNewsJournal(BasicNewsRecipe):
+ title ='Daytona Beach News Journal'
+ __author__ = 'BRGriff'
+ pubisher = 'News-JournalOnline.com'
+ description = 'Daytona Beach, Florida, Newspaper'
+ category = 'News, Daytona Beach, Florida'
+ oldest_article = 1
+ max_articles_per_feed = 100
+ remove_javascript = True
+ use_embedded_content = False
+ no_stylesheets = True
+ language = 'en'
+ filterDuplicates = True
+ remove_attributes = ['style']
+
+ keep_only_tags = [dict(name='div', attrs={'class':'page-header'}),
+ dict(name='div', attrs={'class':'asset-body'})
+ ]
+ remove_tags = [dict(name='div', attrs={'class':['byline-section', 'asset-meta']})
+ ]
+
+ feeds = [
+ #####NEWS#####
+ (u"News", u"http://www.news-journalonline.com/rss.xml"),
+ (u"Breaking News", u"http://www.news-journalonline.com/breakingnews/rss.xml"),
+ (u"Local - East Volusia", u"http://www.news-journalonline.com/news/local/east-volusia/rss.xml"),
+ (u"Local - West Volusia", u"http://www.news-journalonline.com/news/local/west-volusia/rss.xml"),
+ (u"Local - Southeast", u"http://www.news-journalonline.com/news/local/southeast-volusia/rss.xml"),
+ (u"Local - Flagler", u"http://www.news-journalonline.com/news/local/flagler/rss.xml"),
+ (u"Florida", u"http://www.news-journalonline.com/news/florida/rss.xml"),
+ (u"National/World", u"http://www.news-journalonline.com/news/nationworld/rss.xml"),
+ (u"Politics", u"http://www.news-journalonline.com/news/politics/rss.xml"),
+ (u"News of Record", u"http://www.news-journalonline.com/news/news-of-record/rss.xml"),
+ ####BUSINESS####
+ (u"Business", u"http://www.news-journalonline.com/business/rss.xml"),
+ #(u"Jobs", u"http://www.news-journalonline.com/business/jobs/rss.xml"),
+ #(u"Markets", u"http://www.news-journalonline.com/business/markets/rss.xml"),
+ #(u"Real Estate", u"http://www.news-journalonline.com/business/real-estate/rss.xml"),
+ #(u"Technology", u"http://www.news-journalonline.com/business/technology/rss.xml"),
+ ####SPORTS####
+ (u"Sports", u"http://www.news-journalonline.com/sports/rss.xml"),
+ (u"Racing", u"http://www.news-journalonline.com/racing/rss.xml"),
+ (u"Highschool", u"http://www.news-journalonline.com/sports/highschool/rss.xml"),
+ (u"College", u"http://www.news-journalonline.com/sports/college/rss.xml"),
+ (u"Basketball", u"http://www.news-journalonline.com/sports/basketball/rss.xml"),
+ (u"Football", u"http://www.news-journalonline.com/sports/football/rss.xml"),
+ (u"Golf", u"http://www.news-journalonline.com/sports/golf/rss.xml"),
+ (u"Other Sports", u"http://www.news-journalonline.com/sports/other/rss.xml"),
+ ####LIFESTYLE####
+ (u"Lifestyle", u"http://www.news-journalonline.com/lifestyle/rss.xml"),
+ #(u"Fashion", u"http://www.news-journalonline.com/lifestyle/fashion/rss.xml"),
+ (u"Food", u"http://www.news-journalonline.com/lifestyle/food/rss.xml"),
+ #(u"Health", u"http://www.news-journalonline.com/lifestyle/health/rss.xml"),
+ (u"Home and Garden", u"http://www.news-journalonline.com/lifestyle/home-and-garden/rss.xml"),
+ (u"Living", u"http://www.news-journalonline.com/lifestyle/living/rss.xml"),
+ (u"Religion", u"http://www.news-journalonline.com/lifestyle/religion/rss.xml"),
+ #(u"Travel", u"http://www.news-journalonline.com/lifestyle/travel/rss.xml"),
+ ####OPINION####
+ #(u"Opinion", u"http://www.news-journalonline.com/opinion/rss.xml"),
+ #(u"Letters to Editor", u"http://www.news-journalonline.com/opinion/letters-to-the-editor/rss.xml"),
+ #(u"Columns", u"http://www.news-journalonline.com/columns/rss.xml"),
+ #(u"Podcasts", u"http://www.news-journalonline.com/podcasts/rss.xml"),
+ ####ENTERTAINMENT#### ##Weekly Feature##
+ (u"Entertainment", u"http://www.go386.com/rss.xml"),
+ (u"Go Out", u"http://www.go386.com/go/rss.xml"),
+ (u"Music", u"http://www.go386.com/music/rss.xml"),
+ (u"Movies", u"http://www.go386.com/movies/rss.xml"),
+ #(u"Culture", u"http://www.go386.com/culture/rss.xml"),
+
+ ]
+
+ extra_css = '''
+ .page-header{font-family:Arial,Helvetica,sans-serif; font-style:bold;font-size:22pt;}
+ .asset-body{font-family:Helvetica,Arial,sans-serif; font-size:16pt;}
+
+ '''
diff --git a/recipes/down_to_earth.recipe b/recipes/down_to_earth.recipe
new file mode 100644
index 0000000000..bc37514d3e
--- /dev/null
+++ b/recipes/down_to_earth.recipe
@@ -0,0 +1,18 @@
+from calibre.web.feeds.recipes import BasicNewsRecipe
+
+class AdvancedUserRecipe1307834113(BasicNewsRecipe):
+
+ title = u'Down To Earth'
+ oldest_article = 300
+ __author__ = 'sexymax15'
+ max_articles_per_feed = 30
+ no_stylesheets = True
+ remove_javascript = True
+ remove_attributes = ['width','height']
+ use_embedded_content = False
+ language = 'en_IN'
+ remove_empty_feeds = True
+ remove_tags_before = dict(name='div', id='PageContent')
+ remove_tags_after = [dict(name='div'),{'class':'box'}]
+ remove_tags =[{'class':'box'}]
+ feeds = [(u'editor', u'http://www.downtoearth.org.in/taxonomy/term/20348/0/feed'), (u'cover story', u'http://www.downtoearth.org.in/taxonomy/term/20345/0/feed'), (u'special report', u'http://www.downtoearth.org.in/taxonomy/term/20384/0/feed'), (u'features', u'http://www.downtoearth.org.in/taxonomy/term/20350/0/feed'), (u'news', u'http://www.downtoearth.org.in/taxonomy/term/20366/0/feed'), (u'debate', u'http://www.downtoearth.org.in/taxonomy/term/20347/0/feed'), (u'natural disasters', u'http://www.downtoearth.org.in/taxonomy/term/20822/0/feed')]
diff --git a/recipes/elclubdelebook.recipe b/recipes/elclubdelebook.recipe
new file mode 100644
index 0000000000..e05b176cc5
--- /dev/null
+++ b/recipes/elclubdelebook.recipe
@@ -0,0 +1,61 @@
+
+__license__ = 'GPL v3'
+__copyright__ = '2011, Darko Miletic '
+'''
+www.clubdelebook.com
+'''
+
+from calibre.web.feeds.news import BasicNewsRecipe
+
+class ElClubDelEbook(BasicNewsRecipe):
+ title = 'El club del ebook'
+ __author__ = 'Darko Miletic'
+ description = 'El Club del eBook, es la primera fuente de informacion sobre ebooks de Argentina. Aca vas a encontrar noticias, tips, tutoriales, recursos y opiniones sobre el mundo de los libros electronicos.'
+ tags = 'ebook, libro electronico, e-book, ebooks, libros electronicos, e-books'
+ oldest_article = 7
+ max_articles_per_feed = 100
+ language = 'es_AR'
+ encoding = 'utf-8'
+ no_stylesheets = True
+ use_embedded_content = True
+ publication_type = 'blog'
+ masthead_url = 'http://dl.dropbox.com/u/2845131/elclubdelebook.png'
+ extra_css = """
+ body{font-family: Arial,Helvetica,sans-serif}
+ img{ margin-bottom: 0.8em;
+ border: 1px solid #333333;
+ padding: 4px; display: block
+ }
+ """
+
+ conversion_options = {
+ 'comment' : description
+ , 'tags' : tags
+ , 'publisher': title
+ , 'language' : language
+ }
+
+ remove_tags = [dict(attrs={'id':'crp_related'})]
+ remove_tags_after = dict(attrs={'id':'crp_related'})
+
+ feeds = [(u'Articulos', u'http://feeds.feedburner.com/ElClubDelEbook')]
+
+ def preprocess_html(self, soup):
+ for item in soup.findAll(style=True):
+ del item['style']
+ for item in soup.findAll('a'):
+ limg = item.find('img')
+ if item.string is not None:
+ str = item.string
+ item.replaceWith(str)
+ else:
+ if limg:
+ item.name = 'div'
+ item.attrs = []
+ else:
+ str = self.tag_to_string(item)
+ item.replaceWith(str)
+ for item in soup.findAll('img'):
+ if not item.has_key('alt'):
+ item['alt'] = 'image'
+ return soup
diff --git a/recipes/elcronista.recipe b/recipes/elcronista.recipe
index 93615f8f42..f8da81c4bb 100644
--- a/recipes/elcronista.recipe
+++ b/recipes/elcronista.recipe
@@ -1,72 +1,59 @@
-#!/usr/bin/env python
-
__license__ = 'GPL v3'
-__copyright__ = '2008, Darko Miletic '
+__copyright__ = '2008-2011, Darko Miletic '
'''
-cronista.com
+www.cronista.com
'''
from calibre.web.feeds.news import BasicNewsRecipe
-class ElCronista(BasicNewsRecipe):
- title = 'El Cronista'
+class Pagina12(BasicNewsRecipe):
+ title = 'El Cronista Comercial'
__author__ = 'Darko Miletic'
- description = 'Noticias de Argentina'
+ description = 'El Cronista Comercial es el Diario economico-politico mas valorado. Es la fuente mas confiable de informacion en temas de economia, finanzas y negocios enmarcados politicamente.'
+ publisher = 'Cronista.com'
+ category = 'news, politics, economy, finances, Argentina'
oldest_article = 2
- language = 'es_AR'
-
- max_articles_per_feed = 100
+ max_articles_per_feed = 200
no_stylesheets = True
+ encoding = 'utf8'
use_embedded_content = False
- encoding = 'cp1252'
+ language = 'es_AR'
+ remove_empty_feeds = True
+ publication_type = 'newspaper'
+ masthead_url = 'http://www.cronista.com/export/sites/diarioelcronista/arte/header-logo.gif'
+ extra_css = """
+ body{font-family: Arial,Helvetica,sans-serif }
+ h2{font-family: Georgia,"Times New Roman",Times,serif }
+ img{margin-bottom: 0.4em; display:block}
+ .nom{font-weight: bold; vertical-align: baseline}
+ .autor-cfoto{border-bottom: 1px solid #D2D2D2;
+ border-top: 1px solid #D2D2D2;
+ display: inline-block;
+ margin: 0 10px 10px 0;
+ padding: 10px;
+ width: 210px}
+ .under{font-weight: bold}
+ .time{font-size: small}
+ """
- html2lrf_options = [
- '--comment' , description
- , '--category' , 'news, Argentina'
- , '--publisher' , title
- ]
+ conversion_options = {
+ 'comment' : description
+ , 'tags' : category
+ , 'publisher' : publisher
+ , 'language' : language
+ }
- keep_only_tags = [
- dict(name='table', attrs={'width':'100%' })
- ,dict(name='h1' , attrs={'class':'Arialgris16normal'})
- ]
+ remove_tags = [
+ dict(name=['meta','link','base','iframe','object','embed'])
+ ,dict(attrs={'class':['user-tools','tabsmedia']})
+ ]
+ remove_attributes = ['lang']
+ remove_tags_before = dict(attrs={'class':'top'})
+ remove_tags_after = dict(attrs={'class':'content-nota'})
+ feeds = [(u'Ultimas noticias', u'http://www.cronista.com/rss.html')]
- remove_tags = [dict(name='a', attrs={'class':'Arialazul12'})]
-
- feeds = [
- (u'Economia' , u'http://www.cronista.com/adjuntos/8/rss/Economia_EI.xml' )
- ,(u'Negocios' , u'http://www.cronista.com/adjuntos/8/rss/negocios_EI.xml' )
- ,(u'Ultimo momento' , u'http://www.cronista.com/adjuntos/8/rss/ultimo_momento.xml' )
- ,(u'Finanzas y Mercados' , u'http://www.cronista.com/adjuntos/8/rss/Finanzas_Mercados_EI.xml' )
- ,(u'Financial Times' , u'http://www.cronista.com/adjuntos/8/rss/FT_EI.xml' )
- ,(u'Opinion edicion impresa' , u'http://www.cronista.com/adjuntos/8/rss/opinion_edicion_impresa.xml' )
- ,(u'Socialmente Responsables', u'http://www.cronista.com/adjuntos/8/rss/Socialmente_Responsables.xml')
- ,(u'Asuntos Legales' , u'http://www.cronista.com/adjuntos/8/rss/asuntoslegales.xml' )
- ,(u'IT Business' , u'http://www.cronista.com/adjuntos/8/rss/itbusiness.xml' )
- ,(u'Management y RR.HH.' , u'http://www.cronista.com/adjuntos/8/rss/management.xml' )
- ,(u'Inversiones Personales' , u'http://www.cronista.com/adjuntos/8/rss/inversionespersonales.xml' )
- ]
-
- def print_version(self, url):
- main, sep, rest = url.partition('.com/notas/')
- article_id, lsep, rrest = rest.partition('-')
- return 'http://www.cronista.com/interior/index.php?p=imprimir_nota&idNota=' + article_id
def preprocess_html(self, soup):
- mtag = ''
- soup.head.insert(0,mtag)
- soup.head.base.extract()
- htext = soup.find('h1',attrs={'class':'Arialgris16normal'})
- htext.name = 'p'
- soup.prettify()
+ for item in soup.findAll(style=True):
+ del item['style']
return soup
-
- def get_cover_url(self):
- cover_url = None
- index = 'http://www.cronista.com/contenidos/'
- soup = self.index_to_soup(index + 'ee.html')
- link_item = soup.find('a',attrs={'href':"javascript:Close()"})
- if link_item:
- cover_url = index + link_item.img['src']
- return cover_url
-
diff --git a/recipes/eluniversal_ve.recipe b/recipes/eluniversal_ve.recipe
index 28667cd39b..d7c2c4710b 100644
--- a/recipes/eluniversal_ve.recipe
+++ b/recipes/eluniversal_ve.recipe
@@ -1,5 +1,5 @@
__license__ = 'GPL v3'
-__copyright__ = '2010, Darko Miletic '
+__copyright__ = '2010-2011, Darko Miletic '
'''
www.eluniversal.com
'''
@@ -15,12 +15,20 @@ class ElUniversal(BasicNewsRecipe):
max_articles_per_feed = 100
no_stylesheets = True
use_embedded_content = False
+ remove_empty_feeds = True
encoding = 'cp1252'
publisher = 'El Universal'
category = 'news, Caracas, Venezuela, world'
language = 'es_VE'
+ publication_type = 'newspaper'
cover_url = strftime('http://static.eluniversal.com/%Y/%m/%d/portada.jpg')
-
+ extra_css = """
+ .txt60{font-family: Tahoma,Geneva,sans-serif; font-size: small}
+ .txt29{font-family: Tahoma,Geneva,sans-serif; font-size: small; color: gray}
+ .txt38{font-family: Georgia,"Times New Roman",Times,serif; font-size: xx-large}
+ .txt35{font-family: Georgia,"Times New Roman",Times,serif; font-size: large}
+ body{font-family: Verdana,Arial,Helvetica,sans-serif}
+ """
conversion_options = {
'comments' : description
,'tags' : category
@@ -28,10 +36,11 @@ class ElUniversal(BasicNewsRecipe):
,'publisher' : publisher
}
- keep_only_tags = [dict(name='div', attrs={'class':'Nota'})]
+ remove_tags_before=dict(attrs={'class':'header-print MB10'})
+ remove_tags_after= dict(attrs={'id':'SizeText'})
remove_tags = [
- dict(name=['object','link','script','iframe'])
- ,dict(name='div',attrs={'class':'Herramientas'})
+ dict(name=['object','link','script','iframe','meta'])
+ ,dict(attrs={'class':'header-print MB10'})
]
feeds = [
diff --git a/recipes/endgadget.recipe b/recipes/endgadget.recipe
index 8a2181fdc3..83d994a6da 100644
--- a/recipes/endgadget.recipe
+++ b/recipes/endgadget.recipe
@@ -1,7 +1,7 @@
#!/usr/bin/env python
__license__ = 'GPL v3'
-__copyright__ = '2008 - 2009, Darko Miletic '
+__copyright__ = 'Copyright 2011 Starson17'
'''
engadget.com
'''
@@ -9,14 +9,29 @@ engadget.com
from calibre.web.feeds.news import BasicNewsRecipe
class Engadget(BasicNewsRecipe):
- title = u'Engadget'
- __author__ = 'Darko Miletic'
+ title = u'Engadget_Full'
+ __author__ = 'Starson17'
+ __version__ = 'v1.00'
+ __date__ = '02, July 2011'
description = 'Tech news'
language = 'en'
oldest_article = 7
max_articles_per_feed = 100
no_stylesheets = True
- use_embedded_content = True
+ use_embedded_content = False
+ remove_javascript = True
+ remove_empty_feeds = True
- feeds = [ (u'Posts', u'http://www.engadget.com/rss.xml')]
+ keep_only_tags = [dict(name='div', attrs={'class':['post_content permalink ','post_content permalink alt-post-full']})]
+ remove_tags = [dict(name='div', attrs={'class':['filed_under','post_footer']})]
+ remove_tags_after = [dict(name='div', attrs={'class':['post_footer']})]
+
+ feeds = [(u'Posts', u'http://www.engadget.com/rss.xml')]
+
+ extra_css = '''
+ h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
+ h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
+ p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
+ body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
+ '''
diff --git a/recipes/financial_times.recipe b/recipes/financial_times.recipe
index e750b6f113..0079b2be3a 100644
--- a/recipes/financial_times.recipe
+++ b/recipes/financial_times.recipe
@@ -1,32 +1,41 @@
-#!/usr/bin/env python
-
__license__ = 'GPL v3'
-__copyright__ = '2008, Darko Miletic '
+__copyright__ = '2010-2011, Darko Miletic '
'''
-ft.com
+www.ft.com
'''
+import datetime
from calibre.web.feeds.news import BasicNewsRecipe
-class FinancialTimes(BasicNewsRecipe):
- title = u'Financial Times'
- __author__ = 'Darko Miletic and Sujata Raman'
- description = ('Financial world news. Available after 5AM '
- 'GMT, daily.')
+class FinancialTimes_rss(BasicNewsRecipe):
+ title = 'Financial Times'
+ __author__ = 'Darko Miletic'
+ description = "The Financial Times (FT) is one of the world's leading business news and information organisations, recognised internationally for its authority, integrity and accuracy."
+ publisher = 'The Financial Times Ltd.'
+ category = 'news, finances, politics, World'
oldest_article = 2
- language = 'en'
-
- max_articles_per_feed = 100
+ language = 'en'
+ max_articles_per_feed = 250
no_stylesheets = True
use_embedded_content = False
needs_subscription = True
- simultaneous_downloads= 1
- delay = 1
+ encoding = 'utf8'
+ publication_type = 'newspaper'
+ masthead_url = 'http://im.media.ft.com/m/img/masthead_main.jpg'
+ LOGIN = 'https://registration.ft.com/registration/barrier/login'
+ INDEX = 'http://www.ft.com'
- LOGIN = 'https://registration.ft.com/registration/barrier/login'
+ conversion_options = {
+ 'comment' : description
+ , 'tags' : category
+ , 'publisher' : publisher
+ , 'language' : language
+ , 'linearize_tables' : True
+ }
def get_browser(self):
br = BasicNewsRecipe.get_browser()
+ br.open(self.INDEX)
if self.username is not None and self.password is not None:
br.open(self.LOGIN)
br.select_form(name='loginForm')
@@ -35,31 +44,63 @@ class FinancialTimes(BasicNewsRecipe):
br.submit()
return br
- keep_only_tags = [ dict(name='div', attrs={'id':'cont'}) ]
- remove_tags_after = dict(name='p', attrs={'class':'copyright'})
+ keep_only_tags = [dict(name='div', attrs={'class':['fullstory fullstoryHeader','fullstory fullstoryBody','ft-story-header','ft-story-body','index-detail']})]
remove_tags = [
- dict(name='div', attrs={'id':'floating-con'})
+ dict(name='div', attrs={'id':'floating-con'})
+ ,dict(name=['meta','iframe','base','object','embed','link'])
+ ,dict(attrs={'class':['storyTools','story-package','screen-copy','story-package separator','expandable-image']})
]
+ remove_attributes = ['width','height','lang']
- extra_css = '''
- body{font-family:Arial,Helvetica,sans-serif;}
- h2(font-size:large;}
- .ft-story-header(font-size:xx-small;}
- .ft-story-body(font-size:small;}
- a{color:#003399;}
+ extra_css = """
+ body{font-family: Georgia,Times,"Times New Roman",serif}
+ h2{font-size:large}
+ .ft-story-header{font-size: x-small}
.container{font-size:x-small;}
h3{font-size:x-small;color:#003399;}
- '''
+ .copyright{font-size: x-small}
+ img{margin-top: 0.8em; display: block}
+ .lastUpdated{font-family: Arial,Helvetica,sans-serif; font-size: x-small}
+ .byline,.ft-story-body,.ft-story-header{font-family: Arial,Helvetica,sans-serif}
+ """
+
feeds = [
(u'UK' , u'http://www.ft.com/rss/home/uk' )
,(u'US' , u'http://www.ft.com/rss/home/us' )
- ,(u'Europe' , u'http://www.ft.com/rss/home/europe' )
,(u'Asia' , u'http://www.ft.com/rss/home/asia' )
,(u'Middle East', u'http://www.ft.com/rss/home/middleeast')
]
def preprocess_html(self, soup):
- content_type = soup.find('meta', {'http-equiv':'Content-Type'})
- if content_type:
- content_type['content'] = 'text/html; charset=utf-8'
+ items = ['promo-box','promo-title',
+ 'promo-headline','promo-image',
+ 'promo-intro','promo-link','subhead']
+ for item in items:
+ for it in soup.findAll(item):
+ it.name = 'div'
+ it.attrs = []
+ for item in soup.findAll(style=True):
+ del item['style']
+ for item in soup.findAll('a'):
+ limg = item.find('img')
+ if item.string is not None:
+ str = item.string
+ item.replaceWith(str)
+ else:
+ if limg:
+ item.name = 'div'
+ item.attrs = []
+ else:
+ str = self.tag_to_string(item)
+ item.replaceWith(str)
+ for item in soup.findAll('img'):
+ if not item.has_key('alt'):
+ item['alt'] = 'image'
return soup
+
+ def get_cover_url(self):
+ cdate = datetime.date.today()
+ if cdate.isoweekday() == 7:
+ cdate -= datetime.timedelta(days=1)
+ return cdate.strftime('http://specials.ft.com/vtf_pdf/%d%m%y_FRONT1_USA.pdf')
+
diff --git a/recipes/financial_times_uk.recipe b/recipes/financial_times_uk.recipe
index cf219cfda1..e06eb0dc77 100644
--- a/recipes/financial_times_uk.recipe
+++ b/recipes/financial_times_uk.recipe
@@ -1,15 +1,19 @@
__license__ = 'GPL v3'
__copyright__ = '2010-2011, Darko Miletic '
'''
-ft.com
+www.ft.com/uk-edition
'''
+
+import datetime
from calibre import strftime
from calibre.web.feeds.news import BasicNewsRecipe
class FinancialTimes(BasicNewsRecipe):
- title = u'Financial Times - UK printed edition'
+ title = 'Financial Times - UK printed edition'
__author__ = 'Darko Miletic'
- description = 'Financial world news'
+ description = "The Financial Times (FT) is one of the world's leading business news and information organisations, recognised internationally for its authority, integrity and accuracy."
+ publisher = 'The Financial Times Ltd.'
+ category = 'news, finances, politics, UK, World'
oldest_article = 2
language = 'en_GB'
max_articles_per_feed = 250
@@ -17,14 +21,23 @@ class FinancialTimes(BasicNewsRecipe):
use_embedded_content = False
needs_subscription = True
encoding = 'utf8'
- simultaneous_downloads= 1
- delay = 1
+ publication_type = 'newspaper'
+ masthead_url = 'http://im.media.ft.com/m/img/masthead_main.jpg'
LOGIN = 'https://registration.ft.com/registration/barrier/login'
INDEX = 'http://www.ft.com/uk-edition'
PREFIX = 'http://www.ft.com'
+ conversion_options = {
+ 'comment' : description
+ , 'tags' : category
+ , 'publisher' : publisher
+ , 'language' : language
+ , 'linearize_tables' : True
+ }
+
def get_browser(self):
br = BasicNewsRecipe.get_browser()
+ br.open(self.INDEX)
if self.username is not None and self.password is not None:
br.open(self.LOGIN)
br.select_form(name='loginForm')
@@ -33,29 +46,34 @@ class FinancialTimes(BasicNewsRecipe):
br.submit()
return br
- keep_only_tags = [ dict(name='div', attrs={'id':'cont'}) ]
- remove_tags_after = dict(name='p', attrs={'class':'copyright'})
+ keep_only_tags = [dict(name='div', attrs={'class':['fullstory fullstoryHeader','fullstory fullstoryBody','ft-story-header','ft-story-body','index-detail']})]
remove_tags = [
dict(name='div', attrs={'id':'floating-con'})
,dict(name=['meta','iframe','base','object','embed','link'])
+ ,dict(attrs={'class':['storyTools','story-package','screen-copy','story-package separator','expandable-image']})
]
remove_attributes = ['width','height','lang']
extra_css = """
- body{font-family:Arial,Helvetica,sans-serif;}
- h2{font-size:large;}
- .ft-story-header{font-size:xx-small;}
- .ft-story-body{font-size:small;}
- a{color:#003399;}
+ body{font-family: Georgia,Times,"Times New Roman",serif}
+ h2{font-size:large}
+ .ft-story-header{font-size: x-small}
.container{font-size:x-small;}
h3{font-size:x-small;color:#003399;}
.copyright{font-size: x-small}
+ img{margin-top: 0.8em; display: block}
+ .lastUpdated{font-family: Arial,Helvetica,sans-serif; font-size: x-small}
+ .byline,.ft-story-body,.ft-story-header{font-family: Arial,Helvetica,sans-serif}
"""
def get_artlinks(self, elem):
articles = []
for item in elem.findAll('a',href=True):
- url = self.PREFIX + item['href']
+ rawlink = item['href']
+ if rawlink.startswith('http://'):
+ url = rawlink
+ else:
+ url = self.PREFIX + rawlink
title = self.tag_to_string(item)
date = strftime(self.timefmt)
articles.append({
@@ -65,7 +83,7 @@ class FinancialTimes(BasicNewsRecipe):
,'description':''
})
return articles
-
+
def parse_index(self):
feeds = []
soup = self.index_to_soup(self.INDEX)
@@ -80,11 +98,41 @@ class FinancialTimes(BasicNewsRecipe):
strest.insert(0,st)
for item in strest:
ftitle = self.tag_to_string(item)
- self.report_progress(0, _('Fetching feed')+' %s...'%(ftitle))
+ self.report_progress(0, _('Fetching feed')+' %s...'%(ftitle))
feedarts = self.get_artlinks(item.parent.ul)
feeds.append((ftitle,feedarts))
return feeds
def preprocess_html(self, soup):
- return self.adeify_images(soup)
+ items = ['promo-box','promo-title',
+ 'promo-headline','promo-image',
+ 'promo-intro','promo-link','subhead']
+ for item in items:
+ for it in soup.findAll(item):
+ it.name = 'div'
+ it.attrs = []
+ for item in soup.findAll(style=True):
+ del item['style']
+ for item in soup.findAll('a'):
+ limg = item.find('img')
+ if item.string is not None:
+ str = item.string
+ item.replaceWith(str)
+ else:
+ if limg:
+ item.name = 'div'
+ item.attrs = []
+ else:
+ str = self.tag_to_string(item)
+ item.replaceWith(str)
+ for item in soup.findAll('img'):
+ if not item.has_key('alt'):
+ item['alt'] = 'image'
+ return soup
+ def get_cover_url(self):
+ cdate = datetime.date.today()
+ if cdate.isoweekday() == 7:
+ cdate -= datetime.timedelta(days=1)
+ return cdate.strftime('http://specials.ft.com/vtf_pdf/%d%m%y_FRONT1_LON.pdf')
+
\ No newline at end of file
diff --git a/recipes/frontlineonnet.recipe b/recipes/frontlineonnet.recipe
new file mode 100644
index 0000000000..3b65e4bb18
--- /dev/null
+++ b/recipes/frontlineonnet.recipe
@@ -0,0 +1,81 @@
+__license__ = 'GPL v3'
+__copyright__ = '2011, Darko Miletic '
+'''
+frontlineonnet.com
+'''
+
+import re
+from calibre import strftime
+from calibre.web.feeds.news import BasicNewsRecipe
+
+class Frontlineonnet(BasicNewsRecipe):
+ title = 'Frontline'
+ __author__ = 'Darko Miletic'
+ description = "India's national magazine"
+ publisher = 'Frontline'
+ category = 'news, politics, India'
+ no_stylesheets = True
+ delay = 1
+ INDEX = 'http://frontlineonnet.com/'
+ use_embedded_content = False
+ encoding = 'cp1252'
+ language = 'en_IN'
+ publication_type = 'magazine'
+ masthead_url = 'http://frontlineonnet.com/images/newfline.jpg'
+ extra_css = """
+ body{font-family: Verdana,Arial,Helvetica,sans-serif}
+ img{margin-top:0.5em; margin-bottom: 0.7em; display: block}
+ """
+
+ conversion_options = {
+ 'comment' : description
+ , 'tags' : category
+ , 'publisher' : publisher
+ , 'language' : language
+ , 'linearize_tables' : True
+ }
+
+ preprocess_regexps = [
+ (re.compile(r'.*?title', re.DOTALL|re.IGNORECASE),lambda match: '')
+ ,(re.compile(r'', re.DOTALL|re.IGNORECASE),lambda match: '')
+ ,(re.compile(r'', re.DOTALL|re.IGNORECASE),lambda match: ' ')
+ ,(re.compile(r'', re.DOTALL|re.IGNORECASE),lambda match: '')
+ ,(re.compile(r'', re.DOTALL|re.IGNORECASE),lambda match: ' ')
+ ]
+
+ keep_only_tags= [
+ dict(name='font', attrs={'class':'storyhead'})
+ ,dict(attrs={'class':'byline'})
+ ]
+ remove_attributes=['size','noshade','border']
+
+ def preprocess_html(self, soup):
+ for item in soup.findAll(style=True):
+ del item['style']
+ for item in soup.findAll('img'):
+ if not item.has_key('alt'):
+ item['alt'] = 'image'
+ return soup
+
+ def parse_index(self):
+ articles = []
+ soup = self.index_to_soup(self.INDEX)
+ for feed_link in soup.findAll('a',href=True):
+ if feed_link['href'].startswith('stories/'):
+ url = self.INDEX + feed_link['href']
+ title = self.tag_to_string(feed_link)
+ date = strftime(self.timefmt)
+ articles.append({
+ 'title' :title
+ ,'date' :date
+ ,'url' :url
+ ,'description':''
+ })
+ return [('Frontline', articles)]
+
+ def print_version(self, url):
+ return "http://www.hinduonnet.com/thehindu/thscrip/print.pl?prd=fline&file=" + url.rpartition('/')[2]
+
+ def image_url_processor(self, baseurl, url):
+ return url.replace('../images/', self.INDEX + 'images/').strip()
diff --git a/recipes/go_comics.recipe b/recipes/go_comics.recipe
index a30ae1e94d..7062c0913d 100644
--- a/recipes/go_comics.recipe
+++ b/recipes/go_comics.recipe
@@ -11,8 +11,8 @@ import mechanize, re
class GoComics(BasicNewsRecipe):
title = 'GoComics'
__author__ = 'Starson17'
- __version__ = '1.05'
- __date__ = '19 may 2011'
+ __version__ = '1.06'
+ __date__ = '07 June 2011'
description = u'200+ Comics - Customize for more days/comics: Defaults to 7 days, 25 comics - 20 general, 5 editorial.'
category = 'news, comics'
language = 'en'
@@ -56,225 +56,318 @@ class GoComics(BasicNewsRecipe):
def parse_index(self):
feeds = []
for title, url in [
- ######## COMICS - GENERAL ########
- (u"2 Cows and a Chicken", u"http://www.gocomics.com/2cowsandachicken"),
- # (u"9 to 5", u"http://www.gocomics.com/9to5"),
- # (u"The Academia Waltz", u"http://www.gocomics.com/academiawaltz"),
- # (u"Adam@Home", u"http://www.gocomics.com/adamathome"),
- # (u"Agnes", u"http://www.gocomics.com/agnes"),
- # (u"Andy Capp", u"http://www.gocomics.com/andycapp"),
- # (u"Animal Crackers", u"http://www.gocomics.com/animalcrackers"),
- # (u"Annie", u"http://www.gocomics.com/annie"),
- (u"The Argyle Sweater", u"http://www.gocomics.com/theargylesweater"),
- # (u"Ask Shagg", u"http://www.gocomics.com/askshagg"),
- (u"B.C.", u"http://www.gocomics.com/bc"),
- # (u"Back in the Day", u"http://www.gocomics.com/backintheday"),
- # (u"Bad Reporter", u"http://www.gocomics.com/badreporter"),
- # (u"Baldo", u"http://www.gocomics.com/baldo"),
- # (u"Ballard Street", u"http://www.gocomics.com/ballardstreet"),
- # (u"Barkeater Lake", u"http://www.gocomics.com/barkeaterlake"),
- # (u"The Barn", u"http://www.gocomics.com/thebarn"),
- # (u"Basic Instructions", u"http://www.gocomics.com/basicinstructions"),
- # (u"Bewley", u"http://www.gocomics.com/bewley"),
- # (u"Big Top", u"http://www.gocomics.com/bigtop"),
- # (u"Biographic", u"http://www.gocomics.com/biographic"),
- (u"Birdbrains", u"http://www.gocomics.com/birdbrains"),
- # (u"Bleeker: The Rechargeable Dog", u"http://www.gocomics.com/bleeker"),
- # (u"Bliss", u"http://www.gocomics.com/bliss"),
- (u"Bloom County", u"http://www.gocomics.com/bloomcounty"),
- # (u"Bo Nanas", u"http://www.gocomics.com/bonanas"),
- # (u"Bob the Squirrel", u"http://www.gocomics.com/bobthesquirrel"),
- # (u"The Boiling Point", u"http://www.gocomics.com/theboilingpoint"),
- # (u"Boomerangs", u"http://www.gocomics.com/boomerangs"),
- # (u"The Boondocks", u"http://www.gocomics.com/boondocks"),
- # (u"Bottomliners", u"http://www.gocomics.com/bottomliners"),
- # (u"Bound and Gagged", u"http://www.gocomics.com/boundandgagged"),
- # (u"Brainwaves", u"http://www.gocomics.com/brainwaves"),
- # (u"Brenda Starr", u"http://www.gocomics.com/brendastarr"),
- # (u"Brewster Rockit", u"http://www.gocomics.com/brewsterrockit"),
- # (u"Broom Hilda", u"http://www.gocomics.com/broomhilda"),
- (u"Calvin and Hobbes", u"http://www.gocomics.com/calvinandhobbes"),
- # (u"Candorville", u"http://www.gocomics.com/candorville"),
- # (u"Cathy", u"http://www.gocomics.com/cathy"),
- # (u"C'est la Vie", u"http://www.gocomics.com/cestlavie"),
- # (u"Chuckle Bros", u"http://www.gocomics.com/chucklebros"),
- # (u"Citizen Dog", u"http://www.gocomics.com/citizendog"),
- # (u"The City", u"http://www.gocomics.com/thecity"),
- # (u"Cleats", u"http://www.gocomics.com/cleats"),
- # (u"Close to Home", u"http://www.gocomics.com/closetohome"),
- # (u"Compu-toon", u"http://www.gocomics.com/compu-toon"),
- # (u"Cornered", u"http://www.gocomics.com/cornered"),
- (u"Cul de Sac", u"http://www.gocomics.com/culdesac"),
- # (u"Daddy's Home", u"http://www.gocomics.com/daddyshome"),
- # (u"Deep Cover", u"http://www.gocomics.com/deepcover"),
- # (u"Dick Tracy", u"http://www.gocomics.com/dicktracy"),
- # (u"The Dinette Set", u"http://www.gocomics.com/dinetteset"),
- # (u"Dog Eat Doug", u"http://www.gocomics.com/dogeatdoug"),
- # (u"Domestic Abuse", u"http://www.gocomics.com/domesticabuse"),
- # (u"Doodles", u"http://www.gocomics.com/doodles"),
- (u"Doonesbury", u"http://www.gocomics.com/doonesbury"),
- # (u"The Doozies", u"http://www.gocomics.com/thedoozies"),
- # (u"The Duplex", u"http://www.gocomics.com/duplex"),
- # (u"Eek!", u"http://www.gocomics.com/eek"),
- # (u"The Elderberries", u"http://www.gocomics.com/theelderberries"),
- # (u"Flight Deck", u"http://www.gocomics.com/flightdeck"),
- # (u"Flo and Friends", u"http://www.gocomics.com/floandfriends"),
- # (u"The Flying McCoys", u"http://www.gocomics.com/theflyingmccoys"),
- (u"For Better or For Worse", u"http://www.gocomics.com/forbetterorforworse"),
- # (u"For Heaven's Sake", u"http://www.gocomics.com/forheavenssake"),
- # (u"Fort Knox", u"http://www.gocomics.com/fortknox"),
- # (u"FoxTrot", u"http://www.gocomics.com/foxtrot"),
- (u"FoxTrot Classics", u"http://www.gocomics.com/foxtrotclassics"),
- # (u"Frank & Ernest", u"http://www.gocomics.com/frankandernest"),
- # (u"Fred Basset", u"http://www.gocomics.com/fredbasset"),
- # (u"Free Range", u"http://www.gocomics.com/freerange"),
- # (u"Frog Applause", u"http://www.gocomics.com/frogapplause"),
- # (u"The Fusco Brothers", u"http://www.gocomics.com/thefuscobrothers"),
- (u"Garfield", u"http://www.gocomics.com/garfield"),
- # (u"Garfield Minus Garfield", u"http://www.gocomics.com/garfieldminusgarfield"),
- # (u"Gasoline Alley", u"http://www.gocomics.com/gasolinealley"),
- # (u"Gil Thorp", u"http://www.gocomics.com/gilthorp"),
- # (u"Ginger Meggs", u"http://www.gocomics.com/gingermeggs"),
- # (u"Girls & Sports", u"http://www.gocomics.com/girlsandsports"),
- # (u"Haiku Ewe", u"http://www.gocomics.com/haikuewe"),
- # (u"Heart of the City", u"http://www.gocomics.com/heartofthecity"),
- # (u"Heathcliff", u"http://www.gocomics.com/heathcliff"),
- # (u"Herb and Jamaal", u"http://www.gocomics.com/herbandjamaal"),
- # (u"Home and Away", u"http://www.gocomics.com/homeandaway"),
- # (u"Housebroken", u"http://www.gocomics.com/housebroken"),
- # (u"Hubert and Abby", u"http://www.gocomics.com/hubertandabby"),
- # (u"Imagine This", u"http://www.gocomics.com/imaginethis"),
- # (u"In the Bleachers", u"http://www.gocomics.com/inthebleachers"),
- # (u"In the Sticks", u"http://www.gocomics.com/inthesticks"),
- # (u"Ink Pen", u"http://www.gocomics.com/inkpen"),
- # (u"It's All About You", u"http://www.gocomics.com/itsallaboutyou"),
- # (u"Joe Vanilla", u"http://www.gocomics.com/joevanilla"),
- # (u"La Cucaracha", u"http://www.gocomics.com/lacucaracha"),
- # (u"Last Kiss", u"http://www.gocomics.com/lastkiss"),
- # (u"Legend of Bill", u"http://www.gocomics.com/legendofbill"),
- # (u"Liberty Meadows", u"http://www.gocomics.com/libertymeadows"),
- (u"Lio", u"http://www.gocomics.com/lio"),
- # (u"Little Dog Lost", u"http://www.gocomics.com/littledoglost"),
- # (u"Little Otto", u"http://www.gocomics.com/littleotto"),
- # (u"Loose Parts", u"http://www.gocomics.com/looseparts"),
- # (u"Love Is...", u"http://www.gocomics.com/loveis"),
- # (u"Maintaining", u"http://www.gocomics.com/maintaining"),
- # (u"The Meaning of Lila", u"http://www.gocomics.com/meaningoflila"),
- # (u"Middle-Aged White Guy", u"http://www.gocomics.com/middleagedwhiteguy"),
- # (u"The Middletons", u"http://www.gocomics.com/themiddletons"),
- # (u"Momma", u"http://www.gocomics.com/momma"),
- # (u"Mutt & Jeff", u"http://www.gocomics.com/muttandjeff"),
- # (u"Mythtickle", u"http://www.gocomics.com/mythtickle"),
- # (u"Nest Heads", u"http://www.gocomics.com/nestheads"),
- # (u"NEUROTICA", u"http://www.gocomics.com/neurotica"),
- (u"New Adventures of Queen Victoria", u"http://www.gocomics.com/thenewadventuresofqueenvictoria"),
- (u"Non Sequitur", u"http://www.gocomics.com/nonsequitur"),
- # (u"The Norm", u"http://www.gocomics.com/thenorm"),
- # (u"On A Claire Day", u"http://www.gocomics.com/onaclaireday"),
- # (u"One Big Happy", u"http://www.gocomics.com/onebighappy"),
- # (u"The Other Coast", u"http://www.gocomics.com/theothercoast"),
- # (u"Out of the Gene Pool Re-Runs", u"http://www.gocomics.com/outofthegenepool"),
- # (u"Overboard", u"http://www.gocomics.com/overboard"),
- # (u"Pibgorn", u"http://www.gocomics.com/pibgorn"),
- # (u"Pibgorn Sketches", u"http://www.gocomics.com/pibgornsketches"),
- (u"Pickles", u"http://www.gocomics.com/pickles"),
- # (u"Pinkerton", u"http://www.gocomics.com/pinkerton"),
- # (u"Pluggers", u"http://www.gocomics.com/pluggers"),
- (u"Pooch Cafe", u"http://www.gocomics.com/poochcafe"),
- # (u"PreTeena", u"http://www.gocomics.com/preteena"),
- # (u"The Quigmans", u"http://www.gocomics.com/thequigmans"),
- # (u"Rabbits Against Magic", u"http://www.gocomics.com/rabbitsagainstmagic"),
- (u"Real Life Adventures", u"http://www.gocomics.com/reallifeadventures"),
- # (u"Red and Rover", u"http://www.gocomics.com/redandrover"),
- # (u"Red Meat", u"http://www.gocomics.com/redmeat"),
- # (u"Reynolds Unwrapped", u"http://www.gocomics.com/reynoldsunwrapped"),
- # (u"Ronaldinho Gaucho", u"http://www.gocomics.com/ronaldinhogaucho"),
- # (u"Rubes", u"http://www.gocomics.com/rubes"),
- # (u"Scary Gary", u"http://www.gocomics.com/scarygary"),
- (u"Shoe", u"http://www.gocomics.com/shoe"),
- # (u"Shoecabbage", u"http://www.gocomics.com/shoecabbage"),
- # (u"Skin Horse", u"http://www.gocomics.com/skinhorse"),
- # (u"Slowpoke", u"http://www.gocomics.com/slowpoke"),
- # (u"Speed Bump", u"http://www.gocomics.com/speedbump"),
- # (u"State of the Union", u"http://www.gocomics.com/stateoftheunion"),
- (u"Stone Soup", u"http://www.gocomics.com/stonesoup"),
- # (u"Strange Brew", u"http://www.gocomics.com/strangebrew"),
- # (u"Sylvia", u"http://www.gocomics.com/sylvia"),
- # (u"Tank McNamara", u"http://www.gocomics.com/tankmcnamara"),
- # (u"Tiny Sepuku", u"http://www.gocomics.com/tinysepuku"),
- # (u"TOBY", u"http://www.gocomics.com/toby"),
- # (u"Tom the Dancing Bug", u"http://www.gocomics.com/tomthedancingbug"),
- # (u"Too Much Coffee Man", u"http://www.gocomics.com/toomuchcoffeeman"),
- # (u"W.T. Duck", u"http://www.gocomics.com/wtduck"),
- # (u"Watch Your Head", u"http://www.gocomics.com/watchyourhead"),
- # (u"Wee Pals", u"http://www.gocomics.com/weepals"),
- # (u"Winnie the Pooh", u"http://www.gocomics.com/winniethepooh"),
- (u"Wizard of Id", u"http://www.gocomics.com/wizardofid"),
- # (u"Working It Out", u"http://www.gocomics.com/workingitout"),
- # (u"Yenny", u"http://www.gocomics.com/yenny"),
- # (u"Zack Hill", u"http://www.gocomics.com/zackhill"),
- (u"Ziggy", u"http://www.gocomics.com/ziggy"),
- ######## COMICS - EDITORIAL ########
- ("Lalo Alcaraz","http://www.gocomics.com/laloalcaraz"),
- ("Nick Anderson","http://www.gocomics.com/nickanderson"),
- ("Chuck Asay","http://www.gocomics.com/chuckasay"),
- ("Tony Auth","http://www.gocomics.com/tonyauth"),
- ("Donna Barstow","http://www.gocomics.com/donnabarstow"),
- # ("Bruce Beattie","http://www.gocomics.com/brucebeattie"),
- # ("Clay Bennett","http://www.gocomics.com/claybennett"),
- # ("Lisa Benson","http://www.gocomics.com/lisabenson"),
- # ("Steve Benson","http://www.gocomics.com/stevebenson"),
- # ("Chip Bok","http://www.gocomics.com/chipbok"),
- # ("Steve Breen","http://www.gocomics.com/stevebreen"),
- # ("Chris Britt","http://www.gocomics.com/chrisbritt"),
- # ("Stuart Carlson","http://www.gocomics.com/stuartcarlson"),
- # ("Ken Catalino","http://www.gocomics.com/kencatalino"),
- # ("Paul Conrad","http://www.gocomics.com/paulconrad"),
- # ("Jeff Danziger","http://www.gocomics.com/jeffdanziger"),
- # ("Matt Davies","http://www.gocomics.com/mattdavies"),
- # ("John Deering","http://www.gocomics.com/johndeering"),
- # ("Bob Gorrell","http://www.gocomics.com/bobgorrell"),
- # ("Walt Handelsman","http://www.gocomics.com/walthandelsman"),
- # ("Clay Jones","http://www.gocomics.com/clayjones"),
- # ("Kevin Kallaugher","http://www.gocomics.com/kevinkallaugher"),
- # ("Steve Kelley","http://www.gocomics.com/stevekelley"),
- # ("Dick Locher","http://www.gocomics.com/dicklocher"),
- # ("Chan Lowe","http://www.gocomics.com/chanlowe"),
- # ("Mike Luckovich","http://www.gocomics.com/mikeluckovich"),
- # ("Gary Markstein","http://www.gocomics.com/garymarkstein"),
- # ("Glenn McCoy","http://www.gocomics.com/glennmccoy"),
- # ("Jim Morin","http://www.gocomics.com/jimmorin"),
- # ("Jack Ohman","http://www.gocomics.com/jackohman"),
- # ("Pat Oliphant","http://www.gocomics.com/patoliphant"),
- # ("Joel Pett","http://www.gocomics.com/joelpett"),
- # ("Ted Rall","http://www.gocomics.com/tedrall"),
- # ("Michael Ramirez","http://www.gocomics.com/michaelramirez"),
- # ("Marshall Ramsey","http://www.gocomics.com/marshallramsey"),
- # ("Steve Sack","http://www.gocomics.com/stevesack"),
- # ("Ben Sargent","http://www.gocomics.com/bensargent"),
- # ("Drew Sheneman","http://www.gocomics.com/drewsheneman"),
- # ("John Sherffius","http://www.gocomics.com/johnsherffius"),
- # ("Small World","http://www.gocomics.com/smallworld"),
- # ("Scott Stantis","http://www.gocomics.com/scottstantis"),
- # ("Wayne Stayskal","http://www.gocomics.com/waynestayskal"),
- # ("Dana Summers","http://www.gocomics.com/danasummers"),
- # ("Paul Szep","http://www.gocomics.com/paulszep"),
- # ("Mike Thompson","http://www.gocomics.com/mikethompson"),
- # ("Tom Toles","http://www.gocomics.com/tomtoles"),
- # ("Gary Varvel","http://www.gocomics.com/garyvarvel"),
- # ("ViewsAfrica","http://www.gocomics.com/viewsafrica"),
- # ("ViewsAmerica","http://www.gocomics.com/viewsamerica"),
- # ("ViewsAsia","http://www.gocomics.com/viewsasia"),
- # ("ViewsBusiness","http://www.gocomics.com/viewsbusiness"),
- # ("ViewsEurope","http://www.gocomics.com/viewseurope"),
- # ("ViewsLatinAmerica","http://www.gocomics.com/viewslatinamerica"),
- # ("ViewsMidEast","http://www.gocomics.com/viewsmideast"),
- # ("Views of the World","http://www.gocomics.com/viewsoftheworld"),
- # ("Kerry Waghorn","http://www.gocomics.com/facesinthenews"),
- # ("Dan Wasserman","http://www.gocomics.com/danwasserman"),
- # ("Signe Wilkinson","http://www.gocomics.com/signewilkinson"),
- # ("Wit of the World","http://www.gocomics.com/witoftheworld"),
- # ("Don Wright","http://www.gocomics.com/donwright"),
+ (u"2 Cows and a Chicken", u"http://www.gocomics.com/2cowsandachicken"),
+ #(u"9 Chickweed Lane", u"http://www.gocomics.com/9chickweedlane"),
+ (u"9 to 5", u"http://www.gocomics.com/9to5"),
+ #(u"Adam At Home", u"http://www.gocomics.com/adamathome"),
+ (u"Agnes", u"http://www.gocomics.com/agnes"),
+ #(u"Alley Oop", u"http://www.gocomics.com/alleyoop"),
+ #(u"Andy Capp", u"http://www.gocomics.com/andycapp"),
+ #(u"Animal Crackers", u"http://www.gocomics.com/animalcrackers"),
+ #(u"Annie", u"http://www.gocomics.com/annie"),
+ #(u"Arlo & Janis", u"http://www.gocomics.com/arloandjanis"),
+ #(u"Ask Shagg", u"http://www.gocomics.com/askshagg"),
+ (u"B.C.", u"http://www.gocomics.com/bc"),
+ #(u"Back in the Day", u"http://www.gocomics.com/backintheday"),
+ #(u"Bad Reporter", u"http://www.gocomics.com/badreporter"),
+ #(u"Baldo", u"http://www.gocomics.com/baldo"),
+ #(u"Ballard Street", u"http://www.gocomics.com/ballardstreet"),
+ #(u"Barkeater Lake", u"http://www.gocomics.com/barkeaterlake"),
+ #(u"Basic Instructions", u"http://www.gocomics.com/basicinstructions"),
+ #(u"Ben", u"http://www.gocomics.com/ben"),
+ #(u"Betty", u"http://www.gocomics.com/betty"),
+ #(u"Bewley", u"http://www.gocomics.com/bewley"),
+ #(u"Big Nate", u"http://www.gocomics.com/bignate"),
+ #(u"Big Top", u"http://www.gocomics.com/bigtop"),
+ #(u"Biographic", u"http://www.gocomics.com/biographic"),
+ #(u"Birdbrains", u"http://www.gocomics.com/birdbrains"),
+ #(u"Bleeker: The Rechargeable Dog", u"http://www.gocomics.com/bleeker"),
+ #(u"Bliss", u"http://www.gocomics.com/bliss"),
+ (u"Bloom County", u"http://www.gocomics.com/bloomcounty"),
+ #(u"Bo Nanas", u"http://www.gocomics.com/bonanas"),
+ #(u"Bob the Squirrel", u"http://www.gocomics.com/bobthesquirrel"),
+ #(u"Boomerangs", u"http://www.gocomics.com/boomerangs"),
+ #(u"Bottomliners", u"http://www.gocomics.com/bottomliners"),
+ #(u"Bound and Gagged", u"http://www.gocomics.com/boundandgagged"),
+ #(u"Brainwaves", u"http://www.gocomics.com/brainwaves"),
+ #(u"Brenda Starr", u"http://www.gocomics.com/brendastarr"),
+ #(u"Brevity", u"http://www.gocomics.com/brevity"),
+ #(u"Brewster Rockit", u"http://www.gocomics.com/brewsterrockit"),
+ #(u"Broom Hilda", u"http://www.gocomics.com/broomhilda"),
+ (u"Calvin and Hobbes", u"http://www.gocomics.com/calvinandhobbes"),
+ #(u"Candorville", u"http://www.gocomics.com/candorville"),
+ #(u"Cathy", u"http://www.gocomics.com/cathy"),
+ #(u"C'est la Vie", u"http://www.gocomics.com/cestlavie"),
+ #(u"Cheap Thrills", u"http://www.gocomics.com/cheapthrills"),
+ #(u"Chuckle Bros", u"http://www.gocomics.com/chucklebros"),
+ #(u"Citizen Dog", u"http://www.gocomics.com/citizendog"),
+ #(u"Cleats", u"http://www.gocomics.com/cleats"),
+ #(u"Close to Home", u"http://www.gocomics.com/closetohome"),
+ #(u"Committed", u"http://www.gocomics.com/committed"),
+ #(u"Compu-toon", u"http://www.gocomics.com/compu-toon"),
+ #(u"Cornered", u"http://www.gocomics.com/cornered"),
+ #(u"Cow & Boy", u"http://www.gocomics.com/cow&boy"),
+ #(u"Cul de Sac", u"http://www.gocomics.com/culdesac"),
+ #(u"Daddy's Home", u"http://www.gocomics.com/daddyshome"),
+ #(u"Deep Cover", u"http://www.gocomics.com/deepcover"),
+ #(u"Dick Tracy", u"http://www.gocomics.com/dicktracy"),
+ (u"Dog Eat Doug", u"http://www.gocomics.com/dogeatdoug"),
+ #(u"Domestic Abuse", u"http://www.gocomics.com/domesticabuse"),
+ (u"Doodles", u"http://www.gocomics.com/doodles"),
+ (u"Doonesbury", u"http://www.gocomics.com/doonesbury"),
+ #(u"Drabble", u"http://www.gocomics.com/drabble"),
+ #(u"Eek!", u"http://www.gocomics.com/eek"),
+ #(u"F Minus", u"http://www.gocomics.com/fminus"),
+ #(u"Family Tree", u"http://www.gocomics.com/familytree"),
+ #(u"Farcus", u"http://www.gocomics.com/farcus"),
+ (u"Fat Cats Classics", u"http://www.gocomics.com/fatcatsclassics"),
+ #(u"Ferd'nand", u"http://www.gocomics.com/ferdnand"),
+ #(u"Flight Deck", u"http://www.gocomics.com/flightdeck"),
+ (u"Flo and Friends", u"http://www.gocomics.com/floandfriends"),
+ #(u"For Better or For Worse", u"http://www.gocomics.com/forbetterorforworse"),
+ #(u"For Heaven's Sake", u"http://www.gocomics.com/forheavenssake"),
+ #(u"Fort Knox", u"http://www.gocomics.com/fortknox"),
+ #(u"FoxTrot Classics", u"http://www.gocomics.com/foxtrotclassics"),
+ (u"FoxTrot", u"http://www.gocomics.com/foxtrot"),
+ #(u"Frank & Ernest", u"http://www.gocomics.com/frankandernest"),
+ #(u"Frazz", u"http://www.gocomics.com/frazz"),
+ #(u"Fred Basset", u"http://www.gocomics.com/fredbasset"),
+ #(u"Free Range", u"http://www.gocomics.com/freerange"),
+ #(u"Frog Applause", u"http://www.gocomics.com/frogapplause"),
+ #(u"Garfield Minus Garfield", u"http://www.gocomics.com/garfieldminusgarfield"),
+ (u"Garfield", u"http://www.gocomics.com/garfield"),
+ #(u"Gasoline Alley", u"http://www.gocomics.com/gasolinealley"),
+ #(u"Geech Classics", u"http://www.gocomics.com/geechclassics"),
+ #(u"Get Fuzzy", u"http://www.gocomics.com/getfuzzy"),
+ #(u"Gil Thorp", u"http://www.gocomics.com/gilthorp"),
+ #(u"Ginger Meggs", u"http://www.gocomics.com/gingermeggs"),
+ #(u"Girls & Sports", u"http://www.gocomics.com/girlsandsports"),
+ #(u"Graffiti", u"http://www.gocomics.com/graffiti"),
+ #(u"Grand Avenue", u"http://www.gocomics.com/grandavenue"),
+ #(u"Haiku Ewe", u"http://www.gocomics.com/haikuewe"),
+ #(u"Heart of the City", u"http://www.gocomics.com/heartofthecity"),
+ (u"Heathcliff", u"http://www.gocomics.com/heathcliff"),
+ #(u"Herb and Jamaal", u"http://www.gocomics.com/herbandjamaal"),
+ #(u"Herman", u"http://www.gocomics.com/herman"),
+ #(u"Home and Away", u"http://www.gocomics.com/homeandaway"),
+ #(u"Housebroken", u"http://www.gocomics.com/housebroken"),
+ #(u"Hubert and Abby", u"http://www.gocomics.com/hubertandabby"),
+ #(u"Imagine This", u"http://www.gocomics.com/imaginethis"),
+ #(u"In the Bleachers", u"http://www.gocomics.com/inthebleachers"),
+ #(u"In the Sticks", u"http://www.gocomics.com/inthesticks"),
+ #(u"Ink Pen", u"http://www.gocomics.com/inkpen"),
+ #(u"It's All About You", u"http://www.gocomics.com/itsallaboutyou"),
+ #(u"Jane's World", u"http://www.gocomics.com/janesworld"),
+ #(u"Joe Vanilla", u"http://www.gocomics.com/joevanilla"),
+ #(u"Jump Start", u"http://www.gocomics.com/jumpstart"),
+ #(u"Kit 'N' Carlyle", u"http://www.gocomics.com/kitandcarlyle"),
+ #(u"La Cucaracha", u"http://www.gocomics.com/lacucaracha"),
+ #(u"Last Kiss", u"http://www.gocomics.com/lastkiss"),
+ #(u"Legend of Bill", u"http://www.gocomics.com/legendofbill"),
+ #(u"Liberty Meadows", u"http://www.gocomics.com/libertymeadows"),
+ #(u"Li'l Abner Classics", u"http://www.gocomics.com/lilabnerclassics"),
+ #(u"Lio", u"http://www.gocomics.com/lio"),
+ #(u"Little Dog Lost", u"http://www.gocomics.com/littledoglost"),
+ #(u"Little Otto", u"http://www.gocomics.com/littleotto"),
+ #(u"Lola", u"http://www.gocomics.com/lola"),
+ #(u"Loose Parts", u"http://www.gocomics.com/looseparts"),
+ #(u"Love Is...", u"http://www.gocomics.com/loveis"),
+ #(u"Luann", u"http://www.gocomics.com/luann"),
+ #(u"Maintaining", u"http://www.gocomics.com/maintaining"),
+ (u"Marmaduke", u"http://www.gocomics.com/marmaduke"),
+ #(u"Meg! Classics", u"http://www.gocomics.com/megclassics"),
+ #(u"Middle-Aged White Guy", u"http://www.gocomics.com/middleagedwhiteguy"),
+ #(u"Minimum Security", u"http://www.gocomics.com/minimumsecurity"),
+ #(u"Moderately Confused", u"http://www.gocomics.com/moderatelyconfused"),
+ (u"Momma", u"http://www.gocomics.com/momma"),
+ #(u"Monty", u"http://www.gocomics.com/monty"),
+ #(u"Motley Classics", u"http://www.gocomics.com/motleyclassics"),
+ (u"Mutt & Jeff", u"http://www.gocomics.com/muttandjeff"),
+ #(u"Mythtickle", u"http://www.gocomics.com/mythtickle"),
+ #(u"Nancy", u"http://www.gocomics.com/nancy"),
+ #(u"Natural Selection", u"http://www.gocomics.com/naturalselection"),
+ #(u"Nest Heads", u"http://www.gocomics.com/nestheads"),
+ #(u"NEUROTICA", u"http://www.gocomics.com/neurotica"),
+ #(u"New Adventures of Queen Victoria", u"http://www.gocomics.com/thenewadventuresofqueenvictoria"),
+ #(u"Non Sequitur", u"http://www.gocomics.com/nonsequitur"),
+ #(u"Off The Mark", u"http://www.gocomics.com/offthemark"),
+ #(u"On A Claire Day", u"http://www.gocomics.com/onaclaireday"),
+ #(u"One Big Happy Classics", u"http://www.gocomics.com/onebighappyclassics"),
+ #(u"One Big Happy", u"http://www.gocomics.com/onebighappy"),
+ #(u"Out of the Gene Pool Re-Runs", u"http://www.gocomics.com/outofthegenepool"),
+ #(u"Over the Hedge", u"http://www.gocomics.com/overthehedge"),
+ #(u"Overboard", u"http://www.gocomics.com/overboard"),
+ #(u"PC and Pixel", u"http://www.gocomics.com/pcandpixel"),
+ (u"Peanuts", u"http://www.gocomics.com/peanuts"),
+ #(u"Pearls Before Swine", u"http://www.gocomics.com/pearlsbeforeswine"),
+ #(u"Pibgorn Sketches", u"http://www.gocomics.com/pibgornsketches"),
+ #(u"Pibgorn", u"http://www.gocomics.com/pibgorn"),
+ (u"Pickles", u"http://www.gocomics.com/pickles"),
+ #(u"Pinkerton", u"http://www.gocomics.com/pinkerton"),
+ #(u"Pluggers", u"http://www.gocomics.com/pluggers"),
+ #(u"Pooch Cafe", u"http://www.gocomics.com/poochcafe"),
+ #(u"PreTeena", u"http://www.gocomics.com/preteena"),
+ #(u"Prickly City", u"http://www.gocomics.com/pricklycity"),
+ #(u"Rabbits Against Magic", u"http://www.gocomics.com/rabbitsagainstmagic"),
+ #(u"Raising Duncan Classics", u"http://www.gocomics.com/raisingduncanclassics"),
+ #(u"Real Life Adventures", u"http://www.gocomics.com/reallifeadventures"),
+ #(u"Reality Check", u"http://www.gocomics.com/realitycheck"),
+ #(u"Red and Rover", u"http://www.gocomics.com/redandrover"),
+ #(u"Red Meat", u"http://www.gocomics.com/redmeat"),
+ #(u"Reynolds Unwrapped", u"http://www.gocomics.com/reynoldsunwrapped"),
+ #(u"Rip Haywire", u"http://www.gocomics.com/riphaywire"),
+ #(u"Ripley's Believe It or Not!", u"http://www.gocomics.com/ripleysbelieveitornot"),
+ #(u"Ronaldinho Gaucho", u"http://www.gocomics.com/ronaldinhogaucho"),
+ #(u"Rose Is Rose", u"http://www.gocomics.com/roseisrose"),
+ #(u"Rubes", u"http://www.gocomics.com/rubes"),
+ #(u"Rudy Park", u"http://www.gocomics.com/rudypark"),
+ #(u"Scary Gary", u"http://www.gocomics.com/scarygary"),
+ #(u"Shirley and Son Classics", u"http://www.gocomics.com/shirleyandsonclassics"),
+ #(u"Shoe", u"http://www.gocomics.com/shoe"),
+ #(u"Shoecabbage", u"http://www.gocomics.com/shoecabbage"),
+ #(u"Skin Horse", u"http://www.gocomics.com/skinhorse"),
+ #(u"Slowpoke", u"http://www.gocomics.com/slowpoke"),
+ #(u"Soup To Nutz", u"http://www.gocomics.com/souptonutz"),
+ #(u"Speed Bump", u"http://www.gocomics.com/speedbump"),
+ #(u"Spot The Frog", u"http://www.gocomics.com/spotthefrog"),
+ #(u"State of the Union", u"http://www.gocomics.com/stateoftheunion"),
+ #(u"Stone Soup", u"http://www.gocomics.com/stonesoup"),
+ #(u"Strange Brew", u"http://www.gocomics.com/strangebrew"),
+ #(u"Sylvia", u"http://www.gocomics.com/sylvia"),
+ #(u"Tank McNamara", u"http://www.gocomics.com/tankmcnamara"),
+ #(u"Tarzan Classics", u"http://www.gocomics.com/tarzanclassics"),
+ #(u"That's Life", u"http://www.gocomics.com/thatslife"),
+ #(u"The Academia Waltz", u"http://www.gocomics.com/academiawaltz"),
+ #(u"The Argyle Sweater", u"http://www.gocomics.com/theargylesweater"),
+ #(u"The Barn", u"http://www.gocomics.com/thebarn"),
+ #(u"The Boiling Point", u"http://www.gocomics.com/theboilingpoint"),
+ #(u"The Boondocks", u"http://www.gocomics.com/boondocks"),
+ #(u"The Born Loser", u"http://www.gocomics.com/thebornloser"),
+ #(u"The Buckets", u"http://www.gocomics.com/thebuckets"),
+ #(u"The City", u"http://www.gocomics.com/thecity"),
+ #(u"The Dinette Set", u"http://www.gocomics.com/dinetteset"),
+ #(u"The Doozies", u"http://www.gocomics.com/thedoozies"),
+ #(u"The Duplex", u"http://www.gocomics.com/duplex"),
+ #(u"The Elderberries", u"http://www.gocomics.com/theelderberries"),
+ #(u"The Flying McCoys", u"http://www.gocomics.com/theflyingmccoys"),
+ #(u"The Fusco Brothers", u"http://www.gocomics.com/thefuscobrothers"),
+ #(u"The Grizzwells", u"http://www.gocomics.com/thegrizzwells"),
+ #(u"The Humble Stumble", u"http://www.gocomics.com/thehumblestumble"),
+ #(u"The Knight Life", u"http://www.gocomics.com/theknightlife"),
+ #(u"The Meaning of Lila", u"http://www.gocomics.com/meaningoflila"),
+ #(u"The Middletons", u"http://www.gocomics.com/themiddletons"),
+ #(u"The Norm", u"http://www.gocomics.com/thenorm"),
+ #(u"The Other Coast", u"http://www.gocomics.com/theothercoast"),
+ #(u"The Quigmans", u"http://www.gocomics.com/thequigmans"),
+ #(u"The Sunshine Club", u"http://www.gocomics.com/thesunshineclub"),
+ #(u"Tiny Sepuk", u"http://www.gocomics.com/tinysepuk"),
+ #(u"TOBY", u"http://www.gocomics.com/toby"),
+ #(u"Tom the Dancing Bug", u"http://www.gocomics.com/tomthedancingbug"),
+ #(u"Too Much Coffee Man", u"http://www.gocomics.com/toomuchcoffeeman"),
+ #(u"Unstrange Phenomena", u"http://www.gocomics.com/unstrangephenomena"),
+ #(u"W.T. Duck", u"http://www.gocomics.com/wtduck"),
+ #(u"Watch Your Head", u"http://www.gocomics.com/watchyourhead"),
+ #(u"Wee Pals", u"http://www.gocomics.com/weepals"),
+ #(u"Winnie the Pooh", u"http://www.gocomics.com/winniethepooh"),
+ #(u"Wizard of Id", u"http://www.gocomics.com/wizardofid"),
+ #(u"Working Daze", u"http://www.gocomics.com/workingdaze"),
+ #(u"Working It Out", u"http://www.gocomics.com/workingitout"),
+ #(u"Yenny", u"http://www.gocomics.com/yenny"),
+ #(u"Zack Hill", u"http://www.gocomics.com/zackhill"),
+ (u"Ziggy", u"http://www.gocomics.com/ziggy"),
+ #
+ ######## EDITORIAL CARTOONS #####################
+ (u"Adam Zyglis", u"http://www.gocomics.com/adamzyglis"),
+ #(u"Andy Singer", u"http://www.gocomics.com/andysinger"),
+ #(u"Ben Sargent",u"http://www.gocomics.com/bensargent"),
+ #(u"Bill Day", u"http://www.gocomics.com/billday"),
+ #(u"Bill Schorr", u"http://www.gocomics.com/billschorr"),
+ #(u"Bob Englehart", u"http://www.gocomics.com/bobenglehart"),
+ (u"Bob Gorrell",u"http://www.gocomics.com/bobgorrell"),
+ #(u"Brian Fairrington", u"http://www.gocomics.com/brianfairrington"),
+ #(u"Bruce Beattie", u"http://www.gocomics.com/brucebeattie"),
+ #(u"Cam Cardow", u"http://www.gocomics.com/camcardow"),
+ #(u"Chan Lowe",u"http://www.gocomics.com/chanlowe"),
+ #(u"Chip Bok",u"http://www.gocomics.com/chipbok"),
+ #(u"Chris Britt",u"http://www.gocomics.com/chrisbritt"),
+ #(u"Chuck Asay",u"http://www.gocomics.com/chuckasay"),
+ #(u"Clay Bennett",u"http://www.gocomics.com/claybennett"),
+ #(u"Clay Jones",u"http://www.gocomics.com/clayjones"),
+ #(u"Dan Wasserman",u"http://www.gocomics.com/danwasserman"),
+ #(u"Dana Summers",u"http://www.gocomics.com/danasummers"),
+ #(u"Daryl Cagle", u"http://www.gocomics.com/darylcagle"),
+ #(u"David Fitzsimmons", u"http://www.gocomics.com/davidfitzsimmons"),
+ (u"Dick Locher",u"http://www.gocomics.com/dicklocher"),
+ #(u"Don Wright",u"http://www.gocomics.com/donwright"),
+ #(u"Donna Barstow",u"http://www.gocomics.com/donnabarstow"),
+ #(u"Drew Litton", u"http://www.gocomics.com/drewlitton"),
+ #(u"Drew Sheneman",u"http://www.gocomics.com/drewsheneman"),
+ #(u"Ed Stein", u"http://www.gocomics.com/edstein"),
+ #(u"Eric Allie", u"http://www.gocomics.com/ericallie"),
+ #(u"Gary Markstein", u"http://www.gocomics.com/garymarkstein"),
+ #(u"Gary McCoy", u"http://www.gocomics.com/garymccoy"),
+ #(u"Gary Varvel", u"http://www.gocomics.com/garyvarvel"),
+ #(u"Glenn McCoy",u"http://www.gocomics.com/glennmccoy"),
+ #(u"Henry Payne", u"http://www.gocomics.com/henrypayne"),
+ #(u"Jack Ohman",u"http://www.gocomics.com/jackohman"),
+ #(u"JD Crowe", u"http://www.gocomics.com/jdcrowe"),
+ #(u"Jeff Danziger",u"http://www.gocomics.com/jeffdanziger"),
+ #(u"Jeff Parker", u"http://www.gocomics.com/jeffparker"),
+ #(u"Jeff Stahler", u"http://www.gocomics.com/jeffstahler"),
+ #(u"Jerry Holbert", u"http://www.gocomics.com/jerryholbert"),
+ #(u"Jim Morin",u"http://www.gocomics.com/jimmorin"),
+ #(u"Joel Pett",u"http://www.gocomics.com/joelpett"),
+ #(u"John Cole", u"http://www.gocomics.com/johncole"),
+ #(u"John Darkow", u"http://www.gocomics.com/johndarkow"),
+ #(u"John Deering",u"http://www.gocomics.com/johndeering"),
+ #(u"John Sherffius", u"http://www.gocomics.com/johnsherffius"),
+ #(u"Ken Catalino",u"http://www.gocomics.com/kencatalino"),
+ #(u"Kerry Waghorn",u"http://www.gocomics.com/facesinthenews"),
+ #(u"Kevin Kallaugher",u"http://www.gocomics.com/kevinkallaugher"),
+ #(u"Lalo Alcaraz",u"http://www.gocomics.com/laloalcaraz"),
+ #(u"Larry Wright", u"http://www.gocomics.com/larrywright"),
+ #(u"Lisa Benson", u"http://www.gocomics.com/lisabenson"),
+ #(u"Marshall Ramsey", u"http://www.gocomics.com/marshallramsey"),
+ #(u"Matt Bors", u"http://www.gocomics.com/mattbors"),
+ #(u"Matt Davies",u"http://www.gocomics.com/mattdavies"),
+ #(u"Michael Ramirez", u"http://www.gocomics.com/michaelramirez"),
+ #(u"Mike Keefe", u"http://www.gocomics.com/mikekeefe"),
+ #(u"Mike Luckovich", u"http://www.gocomics.com/mikeluckovich"),
+ #(u"MIke Thompson", u"http://www.gocomics.com/mikethompson"),
+ #(u"Monte Wolverton", u"http://www.gocomics.com/montewolverton"),
+ #(u"Mr. Fish", u"http://www.gocomics.com/mrfish"),
+ #(u"Nate Beeler", u"http://www.gocomics.com/natebeeler"),
+ #(u"Nick Anderson", u"http://www.gocomics.com/nickanderson"),
+ #(u"Pat Bagley", u"http://www.gocomics.com/patbagley"),
+ #(u"Pat Oliphant",u"http://www.gocomics.com/patoliphant"),
+ #(u"Paul Conrad",u"http://www.gocomics.com/paulconrad"),
+ #(u"Paul Szep", u"http://www.gocomics.com/paulszep"),
+ #(u"RJ Matson", u"http://www.gocomics.com/rjmatson"),
+ #(u"Rob Rogers", u"http://www.gocomics.com/robrogers"),
+ #(u"Robert Ariail", u"http://www.gocomics.com/robertariail"),
+ #(u"Scott Stantis", u"http://www.gocomics.com/scottstantis"),
+ #(u"Signe Wilkinson", u"http://www.gocomics.com/signewilkinson"),
+ #(u"Small World",u"http://www.gocomics.com/smallworld"),
+ #(u"Steve Benson", u"http://www.gocomics.com/stevebenson"),
+ #(u"Steve Breen", u"http://www.gocomics.com/stevebreen"),
+ #(u"Steve Kelley", u"http://www.gocomics.com/stevekelley"),
+ #(u"Steve Sack", u"http://www.gocomics.com/stevesack"),
+ #(u"Stuart Carlson",u"http://www.gocomics.com/stuartcarlson"),
+ #(u"Ted Rall",u"http://www.gocomics.com/tedrall"),
+ #(u"(Th)ink", u"http://www.gocomics.com/think"),
+ #(u"Tom Toles",u"http://www.gocomics.com/tomtoles"),
+ (u"Tony Auth",u"http://www.gocomics.com/tonyauth"),
+ #(u"Views of the World",u"http://www.gocomics.com/viewsoftheworld"),
+ #(u"ViewsAfrica",u"http://www.gocomics.com/viewsafrica"),
+ #(u"ViewsAmerica",u"http://www.gocomics.com/viewsamerica"),
+ #(u"ViewsAsia",u"http://www.gocomics.com/viewsasia"),
+ #(u"ViewsBusiness",u"http://www.gocomics.com/viewsbusiness"),
+ #(u"ViewsEurope",u"http://www.gocomics.com/viewseurope"),
+ #(u"ViewsLatinAmerica",u"http://www.gocomics.com/viewslatinamerica"),
+ #(u"ViewsMidEast",u"http://www.gocomics.com/viewsmideast"),
+ (u"Walt Handelsman",u"http://www.gocomics.com/walthandelsman"),
+ #(u"Wayne Stayskal",u"http://www.gocomics.com/waynestayskal"),
+ #(u"Wit of the World",u"http://www.gocomics.com/witoftheworld"),
]:
print 'Working on: ', title
articles = self.make_links(url)
@@ -352,3 +445,4 @@ class GoComics(BasicNewsRecipe):
p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
'''
+
diff --git a/recipes/hbr.recipe b/recipes/hbr.recipe
index cd7dcd2061..1152a48784 100644
--- a/recipes/hbr.recipe
+++ b/recipes/hbr.recipe
@@ -1,5 +1,6 @@
from calibre.web.feeds.news import BasicNewsRecipe
import re
+from datetime import date, timedelta
class HBR(BasicNewsRecipe):
@@ -12,13 +13,14 @@ class HBR(BasicNewsRecipe):
no_stylesheets = True
LOGIN_URL = 'http://hbr.org/login?request_url=/'
- INDEX = 'http://hbr.org/current'
+ INDEX = 'http://hbr.org/archive-toc/BR'
keep_only_tags = [dict(name='div', id='pageContainer')]
remove_tags = [dict(id=['mastheadContainer', 'magazineHeadline',
'articleToolbarTopRD', 'pageRightSubColumn', 'pageRightColumn',
'todayOnHBRListWidget', 'mostWidget', 'keepUpWithHBR',
'mailingListTout', 'partnerCenter', 'pageFooter',
+ 'superNavHeadContainer', 'hbrDisqus',
'articleToolbarTop', 'articleToolbarBottom', 'articleToolbarRD']),
dict(name='iframe')]
extra_css = '''
@@ -55,9 +57,14 @@ class HBR(BasicNewsRecipe):
def hbr_get_toc(self):
- soup = self.index_to_soup(self.INDEX)
- url = soup.find('a', text=lambda t:'Full Table of Contents' in t).parent.get('href')
- return self.index_to_soup('http://hbr.org'+url)
+ today = date.today()
+ future = today + timedelta(days=30)
+ for x in [x.strftime('%y%m') for x in (future, today)]:
+ url = self.INDEX + x
+ soup = self.index_to_soup(url)
+ if not soup.find(text='Issue Not Found'):
+ return soup
+ raise Exception('Could not find current issue')
def hbr_parse_section(self, container, feeds):
current_section = None
diff --git a/recipes/hbr_blogs.recipe b/recipes/hbr_blogs.recipe
index bd72a95ebf..acee567d8d 100644
--- a/recipes/hbr_blogs.recipe
+++ b/recipes/hbr_blogs.recipe
@@ -6,7 +6,7 @@ class HBR(BasicNewsRecipe):
title = 'Harvard Business Review Blogs'
description = 'To subscribe go to http://hbr.harvardbusiness.org'
needs_subscription = True
- __author__ = 'Kovid Goyal and Sujata Raman, enhanced by BrianG'
+ __author__ = 'Kovid Goyal, enhanced by BrianG'
language = 'en'
no_stylesheets = True
diff --git a/recipes/icons/ambito_financiero.png b/recipes/icons/ambito_financiero.png
new file mode 100644
index 0000000000..e0a6f409cf
Binary files /dev/null and b/recipes/icons/ambito_financiero.png differ
diff --git a/recipes/icons/athens_news.png b/recipes/icons/athens_news.png
new file mode 100644
index 0000000000..499a11dbe2
Binary files /dev/null and b/recipes/icons/athens_news.png differ
diff --git a/recipes/icons/buenosaireseconomico.png b/recipes/icons/buenosaireseconomico.png
new file mode 100644
index 0000000000..d84f7483ae
Binary files /dev/null and b/recipes/icons/buenosaireseconomico.png differ
diff --git a/recipes/icons/elclubdelebook.png b/recipes/icons/elclubdelebook.png
new file mode 100644
index 0000000000..c43f045484
Binary files /dev/null and b/recipes/icons/elclubdelebook.png differ
diff --git a/recipes/icons/elcronista.png b/recipes/icons/elcronista.png
index 0be856345e..ca64756de1 100644
Binary files a/recipes/icons/elcronista.png and b/recipes/icons/elcronista.png differ
diff --git a/recipes/icons/financial_times.png b/recipes/icons/financial_times.png
new file mode 100644
index 0000000000..2a769d9dbb
Binary files /dev/null and b/recipes/icons/financial_times.png differ
diff --git a/recipes/icons/financial_times_uk.png b/recipes/icons/financial_times_uk.png
new file mode 100644
index 0000000000..2a769d9dbb
Binary files /dev/null and b/recipes/icons/financial_times_uk.png differ
diff --git a/recipes/icons/observatorul_cultural.png b/recipes/icons/observatorul_cultural.png
new file mode 100644
index 0000000000..f322bd01dc
Binary files /dev/null and b/recipes/icons/observatorul_cultural.png differ
diff --git a/recipes/icons/stiintasitehnica.png b/recipes/icons/stiintasitehnica.png
new file mode 100644
index 0000000000..eb16ec3a0e
Binary files /dev/null and b/recipes/icons/stiintasitehnica.png differ
diff --git a/recipes/independent.recipe b/recipes/independent.recipe
index 2ce6b24c4f..0a94384b37 100644
--- a/recipes/independent.recipe
+++ b/recipes/independent.recipe
@@ -6,7 +6,7 @@ class TheIndependent(BasicNewsRecipe):
language = 'en_GB'
__author__ = 'Krittika Goyal'
oldest_article = 1 #days
- max_articles_per_feed = 25
+ max_articles_per_feed = 30
encoding = 'latin1'
no_stylesheets = True
@@ -25,24 +25,39 @@ class TheIndependent(BasicNewsRecipe):
'http://www.independent.co.uk/news/uk/rss'),
('World',
'http://www.independent.co.uk/news/world/rss'),
- ('Sport',
- 'http://www.independent.co.uk/sport/rss'),
- ('Arts and Entertainment',
- 'http://www.independent.co.uk/arts-entertainment/rss'),
('Business',
'http://www.independent.co.uk/news/business/rss'),
- ('Life and Style',
- 'http://www.independent.co.uk/life-style/gadgets-and-tech/news/rss'),
- ('Science',
- 'http://www.independent.co.uk/news/science/rss'),
('People',
'http://www.independent.co.uk/news/people/rss'),
+ ('Science',
+ 'http://www.independent.co.uk/news/science/rss'),
('Media',
'http://www.independent.co.uk/news/media/rss'),
- ('Health and Families',
- 'http://www.independent.co.uk/life-style/health-and-families/rss'),
+ ('Education',
+ 'http://www.independent.co.uk/news/education/rss'),
('Obituaries',
'http://www.independent.co.uk/news/obituaries/rss'),
+
+ ('Opinion',
+ 'http://www.independent.co.uk/opinion/rss'),
+
+ ('Environment',
+ 'http://www.independent.co.uk/environment/rss'),
+
+ ('Sport',
+ 'http://www.independent.co.uk/sport/rss'),
+
+ ('Life and Style',
+ 'http://www.independent.co.uk/life-style/rss'),
+
+ ('Arts and Entertainment',
+ 'http://www.independent.co.uk/arts-entertainment/rss'),
+
+ ('Travel',
+ 'http://www.independent.co.uk/travel/rss'),
+
+ ('Money',
+ 'http://www.independent.co.uk/money/rss'),
]
def preprocess_html(self, soup):
diff --git a/recipes/infobae.recipe b/recipes/infobae.recipe
index 9553746449..b577988347 100644
--- a/recipes/infobae.recipe
+++ b/recipes/infobae.recipe
@@ -1,5 +1,5 @@
__license__ = 'GPL v3'
-__copyright__ = '2008-2010, Darko Miletic '
+__copyright__ = '2008-2011, Darko Miletic '
'''
infobae.com
'''
@@ -9,7 +9,7 @@ from calibre.web.feeds.news import BasicNewsRecipe
class Infobae(BasicNewsRecipe):
title = 'Infobae.com'
__author__ = 'Darko Miletic and Sujata Raman'
- description = 'Informacion Libre las 24 horas'
+ description = 'Infobae.com es el sitio de noticias con mayor actualizacion de Latinoamérica. Noticias actualizadas las 24 horas, los 365 días del año.'
publisher = 'Infobae.com'
category = 'news, politics, Argentina'
oldest_article = 1
@@ -17,13 +17,13 @@ class Infobae(BasicNewsRecipe):
no_stylesheets = True
use_embedded_content = False
language = 'es_AR'
- encoding = 'cp1252'
- masthead_url = 'http://www.infobae.com/imgs/header/header.gif'
- remove_javascript = True
+ encoding = 'utf8'
+ masthead_url = 'http://www.infobae.com/media/img/static/logo-infobae.gif'
remove_empty_feeds = True
extra_css = '''
- body{font-family:Arial,Helvetica,sans-serif;}
- .popUpTitulo{color:#0D4261; font-size: xx-large}
+ body{font-family: Arial,Helvetica,sans-serif}
+ img{display: block}
+ .categoria{font-size: small; text-transform: uppercase}
'''
conversion_options = {
@@ -31,26 +31,44 @@ class Infobae(BasicNewsRecipe):
, 'tags' : category
, 'publisher' : publisher
, 'language' : language
- , 'linearize_tables' : True
}
-
-
+
+ keep_only_tags = [dict(attrs={'class':['titularnota','nota','post-title','post-entry','entry-title','entry-info','entry-content']})]
+ remove_tags_after = dict(attrs={'class':['interior-noticia','nota-desc','tags']})
+ remove_tags = [
+ dict(name=['base','meta','link','iframe','object','embed','ins'])
+ ,dict(attrs={'class':['barranota','tags']})
+ ]
+
feeds = [
- (u'Noticias' , u'http://www.infobae.com/adjuntos/html/RSS/hoy.xml' )
- ,(u'Salud' , u'http://www.infobae.com/adjuntos/html/RSS/salud.xml' )
- ,(u'Tecnologia', u'http://www.infobae.com/adjuntos/html/RSS/tecnologia.xml')
- ,(u'Deportes' , u'http://www.infobae.com/adjuntos/html/RSS/deportes.xml' )
+ (u'Saludable' , u'http://www.infobae.com/rss/saludable.xml')
+ ,(u'Economia' , u'http://www.infobae.com/rss/economia.xml' )
+ ,(u'En Numeros', u'http://www.infobae.com/rss/rating.xml' )
+ ,(u'Finanzas' , u'http://www.infobae.com/rss/finanzas.xml' )
+ ,(u'Mundo' , u'http://www.infobae.com/rss/mundo.xml' )
+ ,(u'Sociedad' , u'http://www.infobae.com/rss/sociedad.xml' )
+ ,(u'Politica' , u'http://www.infobae.com/rss/politica.xml' )
+ ,(u'Deportes' , u'http://www.infobae.com/rss/deportes.xml' )
]
- def print_version(self, url):
- article_part = url.rpartition('/')[2]
- article_id= article_part.partition('-')[0]
- return 'http://www.infobae.com/notas/nota_imprimir.php?Idx=' + article_id
-
- def postprocess_html(self, soup, first):
- for tag in soup.findAll(name='strong'):
- tag.name = 'b'
+ def preprocess_html(self, soup):
+ for item in soup.findAll(style=True):
+ del item['style']
+ for item in soup.findAll('a'):
+ limg = item.find('img')
+ if item.string is not None:
+ str = item.string
+ item.replaceWith(str)
+ else:
+ if limg:
+ item.name = 'div'
+ item.attrs = []
+ else:
+ str = self.tag_to_string(item)
+ item.replaceWith(str)
+ for item in soup.findAll('img'):
+ if not item.has_key('alt'):
+ item['alt'] = 'image'
return soup
-
diff --git a/recipes/le_monde.recipe b/recipes/le_monde.recipe
index cf1f858dfe..8fcdf9c870 100644
--- a/recipes/le_monde.recipe
+++ b/recipes/le_monde.recipe
@@ -99,7 +99,7 @@ class LeMonde(BasicNewsRecipe):
keep_only_tags = [
dict(name='div', attrs={'class':['contenu']})
]
-
+ remove_tags = [dict(name='div', attrs={'class':['LM_atome']})]
remove_tags_after = [dict(id='appel_temoignage')]
def get_article_url(self, article):
diff --git a/recipes/le_temps.recipe b/recipes/le_temps.recipe
index 7e320fe710..367dd4fc50 100644
--- a/recipes/le_temps.recipe
+++ b/recipes/le_temps.recipe
@@ -14,7 +14,7 @@ class LeTemps(BasicNewsRecipe):
title = u'Le Temps'
oldest_article = 7
max_articles_per_feed = 100
- __author__ = 'Sujata Raman'
+ __author__ = 'Kovid Goyal'
description = 'French news. Needs a subscription from http://www.letemps.ch'
no_stylesheets = True
remove_javascript = True
@@ -27,6 +27,7 @@ class LeTemps(BasicNewsRecipe):
def get_browser(self):
br = BasicNewsRecipe.get_browser(self)
br.open('http://www.letemps.ch/login')
+ br.select_form(nr=1)
br['username'] = self.username
br['password'] = self.password
raw = br.submit().read()
diff --git a/recipes/lemonde_dip.recipe b/recipes/lemonde_dip.recipe
index 9845c207fc..8e61e24cdc 100644
--- a/recipes/lemonde_dip.recipe
+++ b/recipes/lemonde_dip.recipe
@@ -1,5 +1,5 @@
__license__ = 'GPL v3'
-__copyright__ = '2008-2010, Darko Miletic '
+__copyright__ = '2008-2011, Darko Miletic '
'''
mondediplo.com
'''
@@ -11,7 +11,7 @@ from calibre.web.feeds.news import BasicNewsRecipe
class LeMondeDiplomatiqueEn(BasicNewsRecipe):
title = 'Le Monde diplomatique - English edition'
__author__ = 'Darko Miletic'
- description = 'Real journalism making sense of the world around us'
+ description = "Le Monde diplomatique is the place you go when you want to know what's really happening. This is a major international paper that is truly independent, that sees the world in fresh ways, that focuses on places no other publications reach. We offer a clear, considered view of the conflicting interests and complexities of a modern global world. LMD in English is a concise version of the Paris-based parent edition, publishing all the major stories each month, expertly translated, and with some London-based commissions too. We offer a taster of LMD quality on our website where a selection of articles are available each month."
publisher = 'Le Monde diplomatique'
category = 'news, politics, world'
no_stylesheets = True
@@ -26,13 +26,19 @@ class LeMondeDiplomatiqueEn(BasicNewsRecipe):
INDEX = PREFIX + strftime('%Y/%m/')
use_embedded_content = False
language = 'en'
- extra_css = ' body{font-family: "Luxi sans","Lucida sans","Lucida Grande",Lucida,"Lucida Sans Unicode",sans-serif} .surtitre{font-size: 1.2em; font-variant: small-caps; margin-bottom: 0.5em} .chapo{font-size: 1.2em; font-weight: bold; margin: 1em 0 0.5em} .texte{font-family: Georgia,"Times New Roman",serif} h1{color: #990000} .notes{border-top: 1px solid #CCCCCC; font-size: 0.9em; line-height: 1.4em} '
+ extra_css = """
+ body{font-family: "Luxi sans","Lucida sans","Lucida Grande",Lucida,"Lucida Sans Unicode",sans-serif}
+ .surtitre{font-size: 1.2em; font-variant: small-caps; margin-bottom: 0.5em}
+ .chapo{font-size: 1.2em; font-weight: bold; margin: 1em 0 0.5em}
+ .texte{font-family: Georgia,"Times New Roman",serif} h1{color: #990000}
+ .notes{border-top: 1px solid #CCCCCC; font-size: 0.9em; line-height: 1.4em}
+ """
conversion_options = {
- 'comment' : description
- , 'tags' : category
- , 'publisher' : publisher
- , 'language' : language
+ 'comment' : description
+ , 'tags' : category
+ , 'publisher' : publisher
+ , 'language' : language
}
def get_browser(self):
@@ -46,12 +52,12 @@ class LeMondeDiplomatiqueEn(BasicNewsRecipe):
br.open(self.LOGIN,data)
return br
- keep_only_tags =[
+ keep_only_tags =[
dict(name='div', attrs={'id':'contenu'})
, dict(name='div',attrs={'class':'notes surlignable'})
]
remove_tags = [dict(name=['object','link','script','iframe','base'])]
- remove_attributes = ['height','width']
+ remove_attributes = ['height','width','name','lang']
def parse_index(self):
articles = []
@@ -75,3 +81,24 @@ class LeMondeDiplomatiqueEn(BasicNewsRecipe):
})
return [(self.title, articles)]
+ def get_cover_url(self):
+ cover_url = None
+ soup = self.index_to_soup(self.INDEX)
+ cover_item = soup.find('div',attrs={'class':'current'})
+ if cover_item:
+ ap = cover_item.find('img',attrs={'class':'spip_logos'})
+ if ap:
+ cover_url = self.INDEX + ap['src']
+ return cover_url
+
+ def preprocess_html(self, soup):
+ for item in soup.findAll(style=True):
+ del item['style']
+ for item in soup.findAll('a'):
+ if item.string is not None:
+ str = item.string
+ item.replaceWith(str)
+ else:
+ str = self.tag_to_string(item)
+ item.replaceWith(str)
+ return soup
diff --git a/recipes/macleans.recipe b/recipes/macleans.recipe
index 296a56f5f3..22f94638d9 100644
--- a/recipes/macleans.recipe
+++ b/recipes/macleans.recipe
@@ -1,239 +1,28 @@
#!/usr/bin/env python
+from calibre.web.feeds.news import BasicNewsRecipe
-__license__ = 'GPL v3'
-
-'''
-macleans.ca
-'''
-from calibre.web.feeds.recipes import BasicNewsRecipe
-from calibre.ebooks.BeautifulSoup import Tag
-from datetime import timedelta, date
-
-class Macleans(BasicNewsRecipe):
+class AdvancedUserRecipe1308306308(BasicNewsRecipe):
title = u'Macleans Magazine'
- __author__ = 'Nick Redding'
language = 'en_CA'
- description = ('Macleans Magazine')
+ __author__ = 'sexymax15'
+ oldest_article = 30
+ max_articles_per_feed = 12
+ use_embedded_content = False
+
+ remove_empty_feeds = True
no_stylesheets = True
- timefmt = ' [%b %d]'
+ remove_javascript = True
+ remove_tags = [dict(name ='img'),dict (id='header'),{'class':'postmetadata'}]
+ remove_tags_after = {'class':'postmetadata'}
- # customization notes: delete sections you are not interested in
- # set oldest_article to the maximum number of days back from today to include articles
- sectionlist = [
- ['http://www2.macleans.ca/','Front Page'],
- ['http://www2.macleans.ca/category/canada/','Canada'],
- ['http://www2.macleans.ca/category/world-from-the-magazine/','World'],
- ['http://www2.macleans.ca/category/business','Business'],
- ['http://www2.macleans.ca/category/arts-culture/','Culture'],
- ['http://www2.macleans.ca/category/opinion','Opinion'],
- ['http://www2.macleans.ca/category/health-from-the-magazine/','Health'],
- ['http://www2.macleans.ca/category/environment-from-the-magazine/','Environment'],
- ['http://www2.macleans.ca/category/education/','On Campus'],
- ['http://www2.macleans.ca/category/travel-from-the-magazine/','Travel']
- ]
- oldest_article = 7
-
- # formatting for print version of articles
- extra_css = '''h2{font-family:Times,serif; font-size:large;}
- small {font-family:Times,serif; font-size:xx-small; list-style-type: none;}
- '''
-
- # tag handling for print version of articles
- keep_only_tags = [dict(id='tw-print')]
- remove_tags = [dict({'class':'postmetadata'})]
-
-
- def preprocess_html(self,soup):
- for img_tag in soup.findAll('img'):
- parent_tag = img_tag.parent
- if parent_tag.name == 'a':
- new_tag = Tag(soup,'p')
- new_tag.insert(0,img_tag)
- parent_tag.replaceWith(new_tag)
- elif parent_tag.name == 'p':
- if not self.tag_to_string(parent_tag) == '':
- new_div = Tag(soup,'div')
- new_tag = Tag(soup,'p')
- new_tag.insert(0,img_tag)
- parent_tag.replaceWith(new_div)
- new_div.insert(0,new_tag)
- new_div.insert(1,parent_tag)
- return soup
-
- def parse_index(self):
-
-
-
- articles = {}
- key = None
- ans = []
-
- def parse_index_page(page_url,page_title):
-
- def decode_date(datestr):
- dmysplit = datestr.strip().lower().split(',')
- mdsplit = dmysplit[1].split()
- m = ['january','february','march','april','may','june','july','august','september','october','november','december'].index(mdsplit[0])+1
- d = int(mdsplit[1])
- y = int(dmysplit[2].split()[0])
- return date(y,m,d)
-
- def article_title(tag):
- atag = tag.find('a',href=True)
- if not atag:
- return ''
- return self.tag_to_string(atag)
-
- def article_url(tag):
- atag = tag.find('a',href=True)
- if not atag:
- return ''
- return atag['href']+'print/'
-
- def article_description(tag):
- for p_tag in tag.findAll('p'):
- d = self.tag_to_string(p_tag,False)
- if not d == '':
- return d
- return ''
-
- def compound_h4_h3_title(tag):
- if tag.h4:
- if tag.h3:
- return self.tag_to_string(tag.h4,False)+u'\u2014'+self.tag_to_string(tag.h3,False)
- else:
- return self.tag_to_string(tag.h4,False)
- elif tag.h3:
- return self.tag_to_string(tag.h3,False)
- else:
- return ''
-
- def compound_h2_h4_title(tag):
- if tag.h2:
- if tag.h4:
- return self.tag_to_string(tag.h2,False)+u'\u2014'+self.tag_to_string(tag.h4,False)
- else:
- return self.tag_to_string(tag.h2,False)
- elif tag.h4:
- return self.tag_to_string(tag.h4,False)
- else:
- return ''
-
-
- def handle_article(header_tag, outer_tag):
- if header_tag:
- url = article_url(header_tag)
- title = article_title(header_tag)
- author_date_tag = outer_tag.h4
- if author_date_tag:
- author_date = self.tag_to_string(author_date_tag,False).split(' - ')
- author = author_date[0].strip()
- article_date = decode_date(author_date[1])
- earliest_date = date.today() - timedelta(days=self.oldest_article)
- if article_date < earliest_date:
- self.log("Skipping article dated %s" % author_date[1])
- else:
- excerpt_div = outer_tag.find('div','excerpt')
- if excerpt_div:
- description = article_description(excerpt_div)
- else:
- description = ''
- if not articles.has_key(page_title):
- articles[page_title] = []
- articles[page_title].append(dict(title=title,url=url,date=author_date[1],description=description,author=author,content=''))
-
- def handle_category_article(cat, header_tag, outer_tag):
- url = article_url(header_tag)
- title = article_title(header_tag)
- if not title == '':
- title = cat+u'\u2014'+title
- a_tag = outer_tag.find('span','authorLink')
- if a_tag:
- author = self.tag_to_string(a_tag,False)
- a_tag.parent.extract()
- else:
- author = ''
- description = article_description(outer_tag)
- if not articles.has_key(page_title):
- articles[page_title] = []
- articles[page_title].append(dict(title=title,url=url,date='',description=description,author=author,content=''))
-
-
- soup = self.index_to_soup(page_url)
-
- if page_title == 'Front Page':
- # special processing for the front page
- top_stories = soup.find('div',{ "id" : "macleansFeatured" })
- if top_stories:
- for div_slide in top_stories.findAll('div','slide'):
- url = article_url(div_slide)
- div_title = div_slide.find('div','header')
- if div_title:
- title = self.tag_to_string(div_title,False)
- else:
- title = ''
- description = article_description(div_slide)
- if not articles.has_key(page_title):
- articles[page_title] = []
- articles[page_title].append(dict(title=title,url=url,date='',description=description,author='',content=''))
-
- from_macleans = soup.find('div',{ "id" : "fromMacleans" })
- if from_macleans:
- for li_tag in from_macleans.findAll('li','fromMacleansArticle'):
- title = compound_h4_h3_title(li_tag)
- url = article_url(li_tag)
- description = article_description(li_tag)
- if not articles.has_key(page_title):
- articles[page_title] = []
- articles[page_title].append(dict(title=title,url=url,date='',description=description,author='',content=''))
-
- blog_central = soup.find('div',{ "id" : "bloglist" })
- if blog_central:
- for li_tag in blog_central.findAll('li'):
- title = compound_h2_h4_title(li_tag)
- if li_tag.h4:
- url = article_url(li_tag.h4)
- if not articles.has_key(page_title):
- articles[page_title] = []
- articles[page_title].append(dict(title=title,url=url,date='',description='',author='',content=''))
-
-# need_to_know = soup.find('div',{ "id" : "needToKnow" })
-# if need_to_know:
-# for div_tag in need_to_know('div',attrs={'class' : re.compile("^needToKnowArticle")}):
-# title = compound_h4_h3_title(div_tag)
-# url = article_url(div_tag)
-# description = article_description(div_tag)
-# if not articles.has_key(page_title):
-# articles[page_title] = []
-# articles[page_title].append(dict(title=title,url=url,date='',description=description,author='',content=''))
-
- for news_category in soup.findAll('div','newsCategory'):
- news_cat = self.tag_to_string(news_category.h4,False)
- handle_category_article(news_cat, news_category.find('h2'), news_category.find('div'))
- for news_item in news_category.findAll('li'):
- handle_category_article(news_cat,news_item.h3,news_item)
-
- return
-
- # find the div containing the highlight article
- div_post = soup.find('div','post')
- if div_post:
- h1_tag = div_post.h1
- handle_article(h1_tag,div_post)
-
- # find the divs containing the rest of the articles
- div_other = div_post.find('div', { "id" : "categoryOtherPosts" })
- if div_other:
- for div_entry in div_other.findAll('div','entry'):
- h2_tag = div_entry.h2
- handle_article(h2_tag,div_entry)
-
-
-
- for page_name,page_title in self.sectionlist:
- parse_index_page(page_name,page_title)
- ans.append(page_title)
-
- ans = [(key, articles[key]) for key in ans if articles.has_key(key)]
- return ans
+ feeds = [(u'Blog Central', u'http://www2.macleans.ca/category/blog-central/feed/'),
+ (u'Canada', u'http://www2.macleans.ca/category/canada/feed/'),
+(u'World', u'http://www2.macleans.ca/category/world-from-the-magazine/feed/'),
+(u'Business', u'http://www2.macleans.ca/category/business/feed/'),
+(u'Arts & Culture', u'http://www2.macleans.ca/category/arts-culture/feed/'),
+(u'Opinion', u'http://www2.macleans.ca/category/opinion/feed/'),
+(u'Health', u'http://www2.macleans.ca/category/health-from-the-magazine/feed/'),
+ (u'Environment', u'http://www2.macleans.ca/category/environment-from-the-magazine/feed/')]
+ def print_version(self, url):
+ return url + 'print/'
diff --git a/recipes/mail_and_guardian.recipe b/recipes/mail_and_guardian.recipe
index 5b58f3f938..ddfce97765 100644
--- a/recipes/mail_and_guardian.recipe
+++ b/recipes/mail_and_guardian.recipe
@@ -3,7 +3,7 @@ from calibre.web.feeds.news import BasicNewsRecipe
class AdvancedUserRecipe1295081935(BasicNewsRecipe):
title = u'Mail & Guardian ZA News'
__author__ = '77ja65'
- language = 'en'
+ language = 'en_ZA'
oldest_article = 7
max_articles_per_feed = 30
no_stylesheets = True
diff --git a/recipes/metro_news_nl.recipe b/recipes/metro_news_nl.recipe
new file mode 100644
index 0000000000..cfdd9e5441
--- /dev/null
+++ b/recipes/metro_news_nl.recipe
@@ -0,0 +1,45 @@
+from calibre.web.feeds.news import BasicNewsRecipe
+
+class AdvancedUserRecipe1306097511(BasicNewsRecipe):
+ title = u'Metro Nieuws NL'
+ oldest_article = 2
+ max_articles_per_feed = 100
+ __author__ = u'DrMerry'
+ description = u'Metro Nederland'
+ language = u'nl'
+ simultaneous_downloads = 5
+ delay = 1
+# timefmt = ' [%A, %d %B, %Y]'
+ timefmt = ''
+ no_stylesheets = True
+ remove_javascript = True
+ remove_empty_feeds = True
+ cover_url = 'http://www.readmetro.com/img/en/metroholland/last/1/small.jpg'
+ remove_empty_feeds = True
+ publication_type = 'newspaper'
+ remove_tags_before = dict(name='div', attrs={'id':'date'})
+ remove_tags_after = dict(name='div', attrs={'id':'column-1-3'})
+ encoding = 'utf-8'
+ extra_css = '#date {font-size: 10px} .article-image-caption {font-size: 8px}'
+
+ remove_tags = [dict(name='div', attrs={'class':[ 'metroCommentFormWrap',
+ 'commentForm', 'metroCommentInnerWrap', 'article-slideshow-counter-container', 'article-slideshow-control', 'ad', 'header-links',
+ 'art-rgt','pluck-app pluck-comm', 'share-and-byline', 'article-tools-below-title', 'col-179 ', 'related-links', 'clear padding-top-15', 'share-tools', 'article-page-auto-pushes', 'footer-edit']}),
+ dict(name='div', attrs={'id':['article-2', 'article-4', 'article-1', 'navigation', 'footer', 'header', 'comments', 'sidebar']}),
+ dict(name='iframe')]
+
+ feeds = [
+ (u'Binnenland', u'http://www.metronieuws.nl/rss.xml?c=1277377288-3'),
+ (u'Economie', u'http://www.metronieuws.nl/rss.xml?c=1278070988-0'),
+ (u'Den Haag', u'http://www.metronieuws.nl/rss.xml?c=1289013337-3'),
+ (u'Rotterdam', u'http://www.metronieuws.nl/rss.xml?c=1289013337-2'),
+ (u'Amsterdam', u'http://www.metronieuws.nl/rss.xml?c=1289013337-1'),
+ (u'Columns', u'http://www.metronieuws.nl/rss.xml?c=1277377288-17'),
+ (u'Entertainment', u'http://www.metronieuws.nl/rss.xml?c=1277377288-2'),
+ (u'Dot', u'http://www.metronieuws.nl/rss.xml?c=1283166782-12'),
+ (u'Familie', u'http://www.metronieuws.nl/rss.xml?c=1283166782-9'),
+ (u'Blogs', u'http://www.metronieuws.nl/rss.xml?c=1295586825-6'),
+ (u'Reizen', u'http://www.metronieuws.nl/rss.xml?c=1277377288-13'),
+ (u'Carrière', u'http://www.metronieuws.nl/rss.xml?c=1278070988-1'),
+ (u'Sport', u'http://www.metronieuws.nl/rss.xml?c=1277377288-12')
+ ]
diff --git a/recipes/metro_uk.recipe b/recipes/metro_uk.recipe
index deced5976b..287af47f5c 100644
--- a/recipes/metro_uk.recipe
+++ b/recipes/metro_uk.recipe
@@ -1,29 +1,34 @@
+import re
from calibre.web.feeds.news import BasicNewsRecipe
class AdvancedUserRecipe1306097511(BasicNewsRecipe):
title = u'Metro UK'
-
- no_stylesheets = True
- oldest_article = 1
- max_articles_per_feed = 200
+ description = 'News as provide by The Metro -UK'
__author__ = 'Dave Asbury'
+ no_stylesheets = True
+ oldest_article = 1
+ max_articles_per_feed = 25
+ remove_empty_feeds = True
+ remove_javascript = True
+
+ preprocess_regexps = [(re.compile(r'Tweet'), lambda a : '')]
+
language = 'en_GB'
- simultaneous_downloads= 3
+
masthead_url = 'http://e-edition.metro.co.uk/images/metro_logo.gif'
+ extra_css = 'h2 {font: sans-serif medium;}'
keep_only_tags = [
+ dict(name='h1'),dict(name='h2', attrs={'class':'h2'}),
dict(attrs={'class':['img-cnt figure']}),
- dict(attrs={'class':['art-img']}),
- dict(name='h1'),
- dict(name='h2', attrs={'class':'h2'}),
+ dict(attrs={'class':['art-img']}),
+
dict(name='div', attrs={'class':'art-lft'})
]
- remove_tags = [dict(name='div', attrs={'class':[ 'metroCommentFormWrap',
- 'commentForm', 'metroCommentInnerWrap',
- 'art-rgt','pluck-app pluck-comm','news m12 clrd clr-l p5t', 'flt-r' ]})]
-
+ remove_tags = [dict(name='div', attrs={'class':[ 'news m12 clrd clr-b p5t shareBtm', 'commentForm', 'metroCommentInnerWrap',
+ 'art-rgt','pluck-app pluck-comm','news m12 clrd clr-l p5t', 'flt-r' ]}),
+ dict(attrs={'class':[ 'metroCommentFormWrap','commentText','commentsNav','avatar','submDateAndTime']})
+ ]
feeds = [
(u'News', u'http://www.metro.co.uk/rss/news/'), (u'Money', u'http://www.metro.co.uk/rss/money/'), (u'Sport', u'http://www.metro.co.uk/rss/sport/'), (u'Film', u'http://www.metro.co.uk/rss/metrolife/film/'), (u'Music', u'http://www.metro.co.uk/rss/metrolife/music/'), (u'TV', u'http://www.metro.co.uk/rss/tv/'), (u'Showbiz', u'http://www.metro.co.uk/rss/showbiz/'), (u'Weird News', u'http://www.metro.co.uk/rss/weird/'), (u'Travel', u'http://www.metro.co.uk/rss/travel/'), (u'Lifestyle', u'http://www.metro.co.uk/rss/lifestyle/'), (u'Books', u'http://www.metro.co.uk/rss/lifestyle/books/'), (u'Food', u'http://www.metro.co.uk/rss/lifestyle/restaurants/')]
-
-
diff --git a/recipes/ming_pao.recipe b/recipes/ming_pao.recipe
index 08ee20cb15..947d85692f 100644
--- a/recipes/ming_pao.recipe
+++ b/recipes/ming_pao.recipe
@@ -1,17 +1,23 @@
-# -*- coding: utf-8 -*-
__license__ = 'GPL v3'
__copyright__ = '2010-2011, Eddie Lau'
+# Region - Hong Kong, Vancouver, Toronto
+__Region__ = 'Hong Kong'
# Users of Kindle 3 with limited system-level CJK support
# please replace the following "True" with "False".
__MakePeriodical__ = True
# Turn below to true if your device supports display of CJK titles
__UseChineseTitle__ = False
-# Trun below to true if you wish to use life.mingpao.com as the main article source
+# Set it to False if you want to skip images
+__KeepImages__ = True
+# (HK only) Turn below to true if you wish to use life.mingpao.com as the main article source
__UseLife__ = True
+
'''
Change Log:
+2011/06/26: add fetching Vancouver and Toronto versions of the paper, also provide captions for images using life.mingpao fetch source
+ provide options to remove all images in the file
2011/05/12: switch the main parse source to life.mingpao.com, which has more photos on the article pages
2011/03/06: add new articles for finance section, also a new section "Columns"
2011/02/28: rearrange the sections
@@ -34,21 +40,96 @@ Change Log:
import os, datetime, re
from calibre.web.feeds.recipes import BasicNewsRecipe
from contextlib import nested
-
-
from calibre.ebooks.BeautifulSoup import BeautifulSoup
from calibre.ebooks.metadata.opf2 import OPFCreator
from calibre.ebooks.metadata.toc import TOC
from calibre.ebooks.metadata import MetaInformation
-class MPHKRecipe(BasicNewsRecipe):
- title = 'Ming Pao - Hong Kong'
+# MAIN CLASS
+class MPRecipe(BasicNewsRecipe):
+ if __Region__ == 'Hong Kong':
+ title = 'Ming Pao - Hong Kong'
+ description = 'Hong Kong Chinese Newspaper (http://news.mingpao.com)'
+ category = 'Chinese, News, Hong Kong'
+ extra_css = 'img {display: block; margin-left: auto; margin-right: auto; margin-top: 10px; margin-bottom: 10px;} font>b {font-size:200%; font-weight:bold;}'
+ masthead_url = 'http://news.mingpao.com/image/portals_top_logo_news.gif'
+ keep_only_tags = [dict(name='h1'),
+ dict(name='font', attrs={'style':['font-size:14pt; line-height:160%;']}), # for entertainment page title
+ dict(name='font', attrs={'color':['AA0000']}), # for column articles title
+ dict(attrs={'id':['newscontent']}), # entertainment and column page content
+ dict(attrs={'id':['newscontent01','newscontent02']}),
+ dict(attrs={'class':['photo']}),
+ dict(name='table', attrs={'width':['100%'], 'border':['0'], 'cellspacing':['5'], 'cellpadding':['0']}), # content in printed version of life.mingpao.com
+ dict(name='img', attrs={'width':['180'], 'alt':['按圖放大']}) # images for source from life.mingpao.com
+ ]
+ if __KeepImages__:
+ remove_tags = [dict(name='style'),
+ dict(attrs={'id':['newscontent135']}), # for the finance page from mpfinance.com
+ dict(name='font', attrs={'size':['2'], 'color':['666666']}), # article date in life.mingpao.com article
+ #dict(name='table') # for content fetched from life.mingpao.com
+ ]
+ else:
+ remove_tags = [dict(name='style'),
+ dict(attrs={'id':['newscontent135']}), # for the finance page from mpfinance.com
+ dict(name='font', attrs={'size':['2'], 'color':['666666']}), # article date in life.mingpao.com article
+ dict(name='img'),
+ #dict(name='table') # for content fetched from life.mingpao.com
+ ]
+ remove_attributes = ['width']
+ preprocess_regexps = [
+ (re.compile(r'', re.DOTALL|re.IGNORECASE),
+ lambda match: ''),
+ (re.compile(r'', re.DOTALL|re.IGNORECASE),
+ lambda match: ''),
+ (re.compile(r' ', re.DOTALL|re.IGNORECASE), # for entertainment page
+ lambda match: ''),
+ # skip after title in life.mingpao.com fetched article
+ (re.compile(r"", re.DOTALL|re.IGNORECASE),
+ lambda match: " "),
+ (re.compile(r" ", re.DOTALL|re.IGNORECASE),
+ lambda match: "")
+ ]
+ elif __Region__ == 'Vancouver':
+ title = 'Ming Pao - Vancouver'
+ description = 'Vancouver Chinese Newspaper (http://www.mingpaovan.com)'
+ category = 'Chinese, News, Vancouver'
+ extra_css = 'img {display: block; margin-left: auto; margin-right: auto; margin-top: 10px; margin-bottom: 10px;} b>font {font-size:200%; font-weight:bold;}'
+ masthead_url = 'http://www.mingpaovan.com/image/mainlogo2_VAN2.gif'
+ keep_only_tags = [dict(name='table', attrs={'width':['450'], 'border':['0'], 'cellspacing':['0'], 'cellpadding':['1']}),
+ dict(name='table', attrs={'width':['450'], 'border':['0'], 'cellspacing':['3'], 'cellpadding':['3'], 'id':['tblContent3']}),
+ dict(name='table', attrs={'width':['180'], 'border':['0'], 'cellspacing':['0'], 'cellpadding':['0'], 'bgcolor':['F0F0F0']}),
+ ]
+ if __KeepImages__:
+ remove_tags = [dict(name='img', attrs={'src':['../../../image/magnifier.gif']})] # the magnifier icon
+ else:
+ remove_tags = [dict(name='img')]
+ remove_attributes = ['width']
+ preprocess_regexps = [(re.compile(r' ', re.DOTALL|re.IGNORECASE),
+ lambda match: ''),
+ ]
+ elif __Region__ == 'Toronto':
+ title = 'Ming Pao - Toronto'
+ description = 'Toronto Chinese Newspaper (http://www.mingpaotor.com)'
+ category = 'Chinese, News, Toronto'
+ extra_css = 'img {display: block; margin-left: auto; margin-right: auto; margin-top: 10px; margin-bottom: 10px;} b>font {font-size:200%; font-weight:bold;}'
+ masthead_url = 'http://www.mingpaotor.com/image/mainlogo2_TOR2.gif'
+ keep_only_tags = [dict(name='table', attrs={'width':['450'], 'border':['0'], 'cellspacing':['0'], 'cellpadding':['1']}),
+ dict(name='table', attrs={'width':['450'], 'border':['0'], 'cellspacing':['3'], 'cellpadding':['3'], 'id':['tblContent3']}),
+ dict(name='table', attrs={'width':['180'], 'border':['0'], 'cellspacing':['0'], 'cellpadding':['0'], 'bgcolor':['F0F0F0']}),
+ ]
+ if __KeepImages__:
+ remove_tags = [dict(name='img', attrs={'src':['../../../image/magnifier.gif']})] # the magnifier icon
+ else:
+ remove_tags = [dict(name='img')]
+ remove_attributes = ['width']
+ preprocess_regexps = [(re.compile(r' ', re.DOTALL|re.IGNORECASE),
+ lambda match: ''),
+ ]
+
oldest_article = 1
max_articles_per_feed = 100
__author__ = 'Eddie Lau'
- description = 'Hong Kong Chinese Newspaper (http://news.mingpao.com)'
publisher = 'MingPao'
- category = 'Chinese, News, Hong Kong'
remove_javascript = True
use_embedded_content = False
no_stylesheets = True
@@ -57,33 +138,6 @@ class MPHKRecipe(BasicNewsRecipe):
recursions = 0
conversion_options = {'linearize_tables':True}
timefmt = ''
- extra_css = 'img {display: block; margin-left: auto; margin-right: auto; margin-top: 10px; margin-bottom: 10px;} font>b {font-size:200%; font-weight:bold;}'
- masthead_url = 'http://news.mingpao.com/image/portals_top_logo_news.gif'
- keep_only_tags = [dict(name='h1'),
- dict(name='font', attrs={'style':['font-size:14pt; line-height:160%;']}), # for entertainment page title
- dict(name='font', attrs={'color':['AA0000']}), # for column articles title
- dict(attrs={'id':['newscontent']}), # entertainment and column page content
- dict(attrs={'id':['newscontent01','newscontent02']}),
- dict(attrs={'class':['photo']}),
- dict(name='img', attrs={'width':['180'], 'alt':['按圖放大']}) # images for source from life.mingpao.com
- ]
- remove_tags = [dict(name='style'),
- dict(attrs={'id':['newscontent135']}), # for the finance page from mpfinance.com
- dict(name='table')] # for content fetched from life.mingpao.com
- remove_attributes = ['width']
- preprocess_regexps = [
- (re.compile(r' ', re.DOTALL|re.IGNORECASE),
- lambda match: ''),
- (re.compile(r'', re.DOTALL|re.IGNORECASE),
- lambda match: ''),
- (re.compile(r'
', re.DOTALL|re.IGNORECASE), # for entertainment page
- lambda match: ''),
- # skip after title in life.mingpao.com fetched article
- (re.compile(r" ", re.DOTALL|re.IGNORECASE),
- lambda match: " "),
- (re.compile(r" ", re.DOTALL|re.IGNORECASE),
- lambda match: "")
- ]
def image_url_processor(cls, baseurl, url):
# trick: break the url at the first occurance of digit, add an additional
@@ -124,8 +178,18 @@ class MPHKRecipe(BasicNewsRecipe):
def get_dtlocal(self):
dt_utc = datetime.datetime.utcnow()
- # convert UTC to local hk time - at around HKT 6.00am, all news are available
- dt_local = dt_utc - datetime.timedelta(-2.0/24)
+ if __Region__ == 'Hong Kong':
+ # convert UTC to local hk time - at HKT 5.30am, all news are available
+ dt_local = dt_utc + datetime.timedelta(8.0/24) - datetime.timedelta(5.5/24)
+ # dt_local = dt_utc.astimezone(pytz.timezone('Asia/Hong_Kong')) - datetime.timedelta(5.5/24)
+ elif __Region__ == 'Vancouver':
+ # convert UTC to local Vancouver time - at PST time 5.30am, all news are available
+ dt_local = dt_utc + datetime.timedelta(-8.0/24) - datetime.timedelta(5.5/24)
+ #dt_local = dt_utc.astimezone(pytz.timezone('America/Vancouver')) - datetime.timedelta(5.5/24)
+ elif __Region__ == 'Toronto':
+ # convert UTC to local Toronto time - at EST time 8.30am, all news are available
+ dt_local = dt_utc + datetime.timedelta(-5.0/24) - datetime.timedelta(8.5/24)
+ #dt_local = dt_utc.astimezone(pytz.timezone('America/Toronto')) - datetime.timedelta(8.5/24)
return dt_local
def get_fetchdate(self):
@@ -135,13 +199,15 @@ class MPHKRecipe(BasicNewsRecipe):
return self.get_dtlocal().strftime("%Y-%m-%d")
def get_fetchday(self):
- # dt_utc = datetime.datetime.utcnow()
- # convert UTC to local hk time - at around HKT 6.00am, all news are available
- # dt_local = dt_utc - datetime.timedelta(-2.0/24)
return self.get_dtlocal().strftime("%d")
def get_cover_url(self):
- cover = 'http://news.mingpao.com/' + self.get_fetchdate() + '/' + self.get_fetchdate() + '_' + self.get_fetchday() + 'gacov.jpg'
+ if __Region__ == 'Hong Kong':
+ cover = 'http://news.mingpao.com/' + self.get_fetchdate() + '/' + self.get_fetchdate() + '_' + self.get_fetchday() + 'gacov.jpg'
+ elif __Region__ == 'Vancouver':
+ cover = 'http://www.mingpaovan.com/ftp/News/' + self.get_fetchdate() + '/' + self.get_fetchday() + 'pgva1s.jpg'
+ elif __Region__ == 'Toronto':
+ cover = 'http://www.mingpaotor.com/ftp/News/' + self.get_fetchdate() + '/' + self.get_fetchday() + 'pgtas.jpg'
br = BasicNewsRecipe.get_browser()
try:
br.open(cover)
@@ -153,76 +219,104 @@ class MPHKRecipe(BasicNewsRecipe):
feeds = []
dateStr = self.get_fetchdate()
- if __UseLife__:
- for title, url, keystr in [(u'\u8981\u805e Headline', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalga', 'nal'),
- (u'\u6e2f\u805e Local', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalgb', 'nal'),
- (u'\u6559\u80b2 Education', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalgf', 'nal'),
- (u'\u793e\u8a55/\u7b46\u9663 Editorial', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=nalmr', 'nal'),
- (u'\u8ad6\u58c7 Forum', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=nalfa', 'nal'),
- (u'\u4e2d\u570b China', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=nalca', 'nal'),
- (u'\u570b\u969b World', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=nalta', 'nal'),
- (u'\u7d93\u6fdf Finance', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalea', 'nal'),
- (u'\u9ad4\u80b2 Sport', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalsp', 'nal'),
- (u'\u5f71\u8996 Film/TV', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalma', 'nal'),
- (u'\u5c08\u6b04 Columns', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=ncolumn', 'ncl')]:
- articles = self.parse_section2(url, keystr)
+ if __Region__ == 'Hong Kong':
+ if __UseLife__:
+ for title, url, keystr in [(u'\u8981\u805e Headline', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalga', 'nal'),
+ (u'\u6e2f\u805e Local', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalgb', 'nal'),
+ (u'\u6559\u80b2 Education', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalgf', 'nal'),
+ (u'\u793e\u8a55/\u7b46\u9663 Editorial', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=nalmr', 'nal'),
+ (u'\u8ad6\u58c7 Forum', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=nalfa', 'nal'),
+ (u'\u4e2d\u570b China', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=nalca', 'nal'),
+ (u'\u570b\u969b World', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=nalta', 'nal'),
+ (u'\u7d93\u6fdf Finance', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalea', 'nal'),
+ (u'\u9ad4\u80b2 Sport', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalsp', 'nal'),
+ (u'\u5f71\u8996 Film/TV', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalma', 'nal'),
+ (u'\u5c08\u6b04 Columns', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=ncolumn', 'ncl')]:
+ articles = self.parse_section2(url, keystr)
+ if articles:
+ feeds.append((title, articles))
+
+ for title, url in [(u'\u526f\u520a Supplement', 'http://news.mingpao.com/' + dateStr + '/jaindex.htm'),
+ (u'\u82f1\u6587 English', 'http://news.mingpao.com/' + dateStr + '/emindex.htm')]:
+ articles = self.parse_section(url)
+ if articles:
+ feeds.append((title, articles))
+ else:
+ for title, url in [(u'\u8981\u805e Headline', 'http://news.mingpao.com/' + dateStr + '/gaindex.htm'),
+ (u'\u6e2f\u805e Local', 'http://news.mingpao.com/' + dateStr + '/gbindex.htm'),
+ (u'\u6559\u80b2 Education', 'http://news.mingpao.com/' + dateStr + '/gfindex.htm')]:
+ articles = self.parse_section(url)
+ if articles:
+ feeds.append((title, articles))
+
+ # special- editorial
+ ed_articles = self.parse_ed_section('http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=nalmr')
+ if ed_articles:
+ feeds.append((u'\u793e\u8a55/\u7b46\u9663 Editorial', ed_articles))
+
+ for title, url in [(u'\u8ad6\u58c7 Forum', 'http://news.mingpao.com/' + dateStr + '/faindex.htm'),
+ (u'\u4e2d\u570b China', 'http://news.mingpao.com/' + dateStr + '/caindex.htm'),
+ (u'\u570b\u969b World', 'http://news.mingpao.com/' + dateStr + '/taindex.htm')]:
+ articles = self.parse_section(url)
+ if articles:
+ feeds.append((title, articles))
+
+ # special - finance
+ #fin_articles = self.parse_fin_section('http://www.mpfinance.com/htm/Finance/' + dateStr + '/News/ea,eb,ecindex.htm')
+ fin_articles = self.parse_fin_section('http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalea')
+ if fin_articles:
+ feeds.append((u'\u7d93\u6fdf Finance', fin_articles))
+
+ for title, url in [('Tech News', 'http://news.mingpao.com/' + dateStr + '/naindex.htm'),
+ (u'\u9ad4\u80b2 Sport', 'http://news.mingpao.com/' + dateStr + '/spindex.htm')]:
+ articles = self.parse_section(url)
+ if articles:
+ feeds.append((title, articles))
+
+ # special - entertainment
+ ent_articles = self.parse_ent_section('http://ol.mingpao.com/cfm/star1.cfm')
+ if ent_articles:
+ feeds.append((u'\u5f71\u8996 Film/TV', ent_articles))
+
+ for title, url in [(u'\u526f\u520a Supplement', 'http://news.mingpao.com/' + dateStr + '/jaindex.htm'),
+ (u'\u82f1\u6587 English', 'http://news.mingpao.com/' + dateStr + '/emindex.htm')]:
+ articles = self.parse_section(url)
+ if articles:
+ feeds.append((title, articles))
+
+
+ # special- columns
+ col_articles = self.parse_col_section('http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=ncolumn')
+ if col_articles:
+ feeds.append((u'\u5c08\u6b04 Columns', col_articles))
+ elif __Region__ == 'Vancouver':
+ for title, url in [(u'\u8981\u805e Headline', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VAindex.htm'),
+ (u'\u52a0\u570b Canada', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VBindex.htm'),
+ (u'\u793e\u5340 Local', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VDindex.htm'),
+ (u'\u6e2f\u805e Hong Kong', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/HK-VGindex.htm'),
+ (u'\u570b\u969b World', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VTindex.htm'),
+ (u'\u4e2d\u570b China', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VCindex.htm'),
+ (u'\u7d93\u6fdf Economics', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VEindex.htm'),
+ (u'\u9ad4\u80b2 Sports', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VSindex.htm'),
+ (u'\u5f71\u8996 Film/TV', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/HK-MAindex.htm'),
+ (u'\u526f\u520a Supplements', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/WWindex.htm'),]:
+ articles = self.parse_section3(url, 'http://www.mingpaovan.com/')
if articles:
feeds.append((title, articles))
-
- for title, url in [(u'\u526f\u520a Supplement', 'http://news.mingpao.com/' + dateStr + '/jaindex.htm'),
- (u'\u82f1\u6587 English', 'http://news.mingpao.com/' + dateStr + '/emindex.htm')]:
- articles = self.parse_section(url)
+ elif __Region__ == 'Toronto':
+ for title, url in [(u'\u8981\u805e Headline', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/TAindex.htm'),
+ (u'\u52a0\u570b Canada', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/TDindex.htm'),
+ (u'\u793e\u5340 Local', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/TFindex.htm'),
+ (u'\u4e2d\u570b China', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/TCAindex.htm'),
+ (u'\u570b\u969b World', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/TTAindex.htm'),
+ (u'\u6e2f\u805e Hong Kong', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/HK-GAindex.htm'),
+ (u'\u7d93\u6fdf Economics', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/THindex.htm'),
+ (u'\u9ad4\u80b2 Sports', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/TSindex.htm'),
+ (u'\u5f71\u8996 Film/TV', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/HK-MAindex.htm'),
+ (u'\u526f\u520a Supplements', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/WWindex.htm'),]:
+ articles = self.parse_section3(url, 'http://www.mingpaotor.com/')
if articles:
feeds.append((title, articles))
- else:
- for title, url in [(u'\u8981\u805e Headline', 'http://news.mingpao.com/' + dateStr + '/gaindex.htm'),
- (u'\u6e2f\u805e Local', 'http://news.mingpao.com/' + dateStr + '/gbindex.htm'),
- (u'\u6559\u80b2 Education', 'http://news.mingpao.com/' + dateStr + '/gfindex.htm')]:
- articles = self.parse_section(url)
- if articles:
- feeds.append((title, articles))
-
- # special- editorial
- ed_articles = self.parse_ed_section('http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=nalmr')
- if ed_articles:
- feeds.append((u'\u793e\u8a55/\u7b46\u9663 Editorial', ed_articles))
-
- for title, url in [(u'\u8ad6\u58c7 Forum', 'http://news.mingpao.com/' + dateStr + '/faindex.htm'),
- (u'\u4e2d\u570b China', 'http://news.mingpao.com/' + dateStr + '/caindex.htm'),
- (u'\u570b\u969b World', 'http://news.mingpao.com/' + dateStr + '/taindex.htm')]:
- articles = self.parse_section(url)
- if articles:
- feeds.append((title, articles))
-
- # special - finance
- #fin_articles = self.parse_fin_section('http://www.mpfinance.com/htm/Finance/' + dateStr + '/News/ea,eb,ecindex.htm')
- fin_articles = self.parse_fin_section('http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalea')
- if fin_articles:
- feeds.append((u'\u7d93\u6fdf Finance', fin_articles))
-
- for title, url in [('Tech News', 'http://news.mingpao.com/' + dateStr + '/naindex.htm'),
- (u'\u9ad4\u80b2 Sport', 'http://news.mingpao.com/' + dateStr + '/spindex.htm')]:
- articles = self.parse_section(url)
- if articles:
- feeds.append((title, articles))
-
- # special - entertainment
- ent_articles = self.parse_ent_section('http://ol.mingpao.com/cfm/star1.cfm')
- if ent_articles:
- feeds.append((u'\u5f71\u8996 Film/TV', ent_articles))
-
- for title, url in [(u'\u526f\u520a Supplement', 'http://news.mingpao.com/' + dateStr + '/jaindex.htm'),
- (u'\u82f1\u6587 English', 'http://news.mingpao.com/' + dateStr + '/emindex.htm')]:
- articles = self.parse_section(url)
- if articles:
- feeds.append((title, articles))
-
-
- # special- columns
- col_articles = self.parse_col_section('http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=ncolumn')
- if col_articles:
- feeds.append((u'\u5c08\u6b04 Columns', col_articles))
-
return feeds
# parse from news.mingpao.com
@@ -256,11 +350,30 @@ class MPHKRecipe(BasicNewsRecipe):
title = self.tag_to_string(i)
url = 'http://life.mingpao.com/cfm/' + i.get('href', False)
if (url not in included_urls) and (not url.rfind('.txt') == -1) and (not url.rfind(keystr) == -1):
+ url = url.replace('dailynews3.cfm', 'dailynews3a.cfm') # use printed version of the article
current_articles.append({'title': title, 'url': url, 'description': ''})
included_urls.append(url)
current_articles.reverse()
return current_articles
+ # parse from www.mingpaovan.com
+ def parse_section3(self, url, baseUrl):
+ self.get_fetchdate()
+ soup = self.index_to_soup(url)
+ divs = soup.findAll(attrs={'class': ['ListContentLargeLink']})
+ current_articles = []
+ included_urls = []
+ divs.reverse()
+ for i in divs:
+ title = self.tag_to_string(i)
+ urlstr = i.get('href', False)
+ urlstr = baseUrl + '/' + urlstr.replace('../../../', '')
+ if urlstr not in included_urls:
+ current_articles.append({'title': title, 'url': urlstr, 'description': '', 'date': ''})
+ included_urls.append(urlstr)
+ current_articles.reverse()
+ return current_articles
+
def parse_ed_section(self, url):
self.get_fetchdate()
soup = self.index_to_soup(url)
@@ -338,7 +451,12 @@ class MPHKRecipe(BasicNewsRecipe):
if dir is None:
dir = self.output_dir
if __UseChineseTitle__ == True:
- title = u'\u660e\u5831 (\u9999\u6e2f)'
+ if __Region__ == 'Hong Kong':
+ title = u'\u660e\u5831 (\u9999\u6e2f)'
+ elif __Region__ == 'Vancouver':
+ title = u'\u660e\u5831 (\u6eab\u54e5\u83ef)'
+ elif __Region__ == 'Toronto':
+ title = u'\u660e\u5831 (\u591a\u502b\u591a)'
else:
title = self.short_title()
# if not generating a periodical, force date to apply in title
diff --git a/recipes/ming_pao_toronto.recipe b/recipes/ming_pao_toronto.recipe
new file mode 100644
index 0000000000..9f3d7f510c
--- /dev/null
+++ b/recipes/ming_pao_toronto.recipe
@@ -0,0 +1,594 @@
+__license__ = 'GPL v3'
+__copyright__ = '2010-2011, Eddie Lau'
+
+# Region - Hong Kong, Vancouver, Toronto
+__Region__ = 'Toronto'
+# Users of Kindle 3 with limited system-level CJK support
+# please replace the following "True" with "False".
+__MakePeriodical__ = True
+# Turn below to true if your device supports display of CJK titles
+__UseChineseTitle__ = False
+# Set it to False if you want to skip images
+__KeepImages__ = True
+# (HK only) Turn below to true if you wish to use life.mingpao.com as the main article source
+__UseLife__ = True
+
+
+'''
+Change Log:
+2011/06/26: add fetching Vancouver and Toronto versions of the paper, also provide captions for images using life.mingpao fetch source
+ provide options to remove all images in the file
+2011/05/12: switch the main parse source to life.mingpao.com, which has more photos on the article pages
+2011/03/06: add new articles for finance section, also a new section "Columns"
+2011/02/28: rearrange the sections
+ [Disabled until Kindle has better CJK support and can remember last (section,article) read in Sections & Articles
+ View] make it the same title if generating a periodical, so past issue will be automatically put into "Past Issues"
+ folder in Kindle 3
+2011/02/20: skip duplicated links in finance section, put photos which may extend a whole page to the back of the articles
+ clean up the indentation
+2010/12/07: add entertainment section, use newspaper front page as ebook cover, suppress date display in section list
+ (to avoid wrong date display in case the user generates the ebook in a time zone different from HKT)
+2010/11/22: add English section, remove eco-news section which is not updated daily, correct
+ ordering of articles
+2010/11/12: add news image and eco-news section
+2010/11/08: add parsing of finance section
+2010/11/06: temporary work-around for Kindle device having no capability to display unicode
+ in section/article list.
+2010/10/31: skip repeated articles in section pages
+'''
+
+import os, datetime, re
+from calibre.web.feeds.recipes import BasicNewsRecipe
+from contextlib import nested
+from calibre.ebooks.BeautifulSoup import BeautifulSoup
+from calibre.ebooks.metadata.opf2 import OPFCreator
+from calibre.ebooks.metadata.toc import TOC
+from calibre.ebooks.metadata import MetaInformation
+
+# MAIN CLASS
+class MPRecipe(BasicNewsRecipe):
+ if __Region__ == 'Hong Kong':
+ title = 'Ming Pao - Hong Kong'
+ description = 'Hong Kong Chinese Newspaper (http://news.mingpao.com)'
+ category = 'Chinese, News, Hong Kong'
+ extra_css = 'img {display: block; margin-left: auto; margin-right: auto; margin-top: 10px; margin-bottom: 10px;} font>b {font-size:200%; font-weight:bold;}'
+ masthead_url = 'http://news.mingpao.com/image/portals_top_logo_news.gif'
+ keep_only_tags = [dict(name='h1'),
+ dict(name='font', attrs={'style':['font-size:14pt; line-height:160%;']}), # for entertainment page title
+ dict(name='font', attrs={'color':['AA0000']}), # for column articles title
+ dict(attrs={'id':['newscontent']}), # entertainment and column page content
+ dict(attrs={'id':['newscontent01','newscontent02']}),
+ dict(attrs={'class':['photo']}),
+ dict(name='table', attrs={'width':['100%'], 'border':['0'], 'cellspacing':['5'], 'cellpadding':['0']}), # content in printed version of life.mingpao.com
+ dict(name='img', attrs={'width':['180'], 'alt':['按圖放大']}) # images for source from life.mingpao.com
+ ]
+ if __KeepImages__:
+ remove_tags = [dict(name='style'),
+ dict(attrs={'id':['newscontent135']}), # for the finance page from mpfinance.com
+ dict(name='font', attrs={'size':['2'], 'color':['666666']}), # article date in life.mingpao.com article
+ #dict(name='table') # for content fetched from life.mingpao.com
+ ]
+ else:
+ remove_tags = [dict(name='style'),
+ dict(attrs={'id':['newscontent135']}), # for the finance page from mpfinance.com
+ dict(name='font', attrs={'size':['2'], 'color':['666666']}), # article date in life.mingpao.com article
+ dict(name='img'),
+ #dict(name='table') # for content fetched from life.mingpao.com
+ ]
+ remove_attributes = ['width']
+ preprocess_regexps = [
+ (re.compile(r' ', re.DOTALL|re.IGNORECASE),
+ lambda match: ''),
+ (re.compile(r'', re.DOTALL|re.IGNORECASE),
+ lambda match: ''),
+ (re.compile(r'
', re.DOTALL|re.IGNORECASE), # for entertainment page
+ lambda match: ''),
+ # skip after title in life.mingpao.com fetched article
+ (re.compile(r" ", re.DOTALL|re.IGNORECASE),
+ lambda match: " "),
+ (re.compile(r" ", re.DOTALL|re.IGNORECASE),
+ lambda match: "")
+ ]
+ elif __Region__ == 'Vancouver':
+ title = 'Ming Pao - Vancouver'
+ description = 'Vancouver Chinese Newspaper (http://www.mingpaovan.com)'
+ category = 'Chinese, News, Vancouver'
+ extra_css = 'img {display: block; margin-left: auto; margin-right: auto; margin-top: 10px; margin-bottom: 10px;} b>font {font-size:200%; font-weight:bold;}'
+ masthead_url = 'http://www.mingpaovan.com/image/mainlogo2_VAN2.gif'
+ keep_only_tags = [dict(name='table', attrs={'width':['450'], 'border':['0'], 'cellspacing':['0'], 'cellpadding':['1']}),
+ dict(name='table', attrs={'width':['450'], 'border':['0'], 'cellspacing':['3'], 'cellpadding':['3'], 'id':['tblContent3']}),
+ dict(name='table', attrs={'width':['180'], 'border':['0'], 'cellspacing':['0'], 'cellpadding':['0'], 'bgcolor':['F0F0F0']}),
+ ]
+ if __KeepImages__:
+ remove_tags = [dict(name='img', attrs={'src':['../../../image/magnifier.gif']})] # the magnifier icon
+ else:
+ remove_tags = [dict(name='img')]
+ remove_attributes = ['width']
+ preprocess_regexps = [(re.compile(r' ', re.DOTALL|re.IGNORECASE),
+ lambda match: ''),
+ ]
+ elif __Region__ == 'Toronto':
+ title = 'Ming Pao - Toronto'
+ description = 'Toronto Chinese Newspaper (http://www.mingpaotor.com)'
+ category = 'Chinese, News, Toronto'
+ extra_css = 'img {display: block; margin-left: auto; margin-right: auto; margin-top: 10px; margin-bottom: 10px;} b>font {font-size:200%; font-weight:bold;}'
+ masthead_url = 'http://www.mingpaotor.com/image/mainlogo2_TOR2.gif'
+ keep_only_tags = [dict(name='table', attrs={'width':['450'], 'border':['0'], 'cellspacing':['0'], 'cellpadding':['1']}),
+ dict(name='table', attrs={'width':['450'], 'border':['0'], 'cellspacing':['3'], 'cellpadding':['3'], 'id':['tblContent3']}),
+ dict(name='table', attrs={'width':['180'], 'border':['0'], 'cellspacing':['0'], 'cellpadding':['0'], 'bgcolor':['F0F0F0']}),
+ ]
+ if __KeepImages__:
+ remove_tags = [dict(name='img', attrs={'src':['../../../image/magnifier.gif']})] # the magnifier icon
+ else:
+ remove_tags = [dict(name='img')]
+ remove_attributes = ['width']
+ preprocess_regexps = [(re.compile(r' ', re.DOTALL|re.IGNORECASE),
+ lambda match: ''),
+ ]
+
+ oldest_article = 1
+ max_articles_per_feed = 100
+ __author__ = 'Eddie Lau'
+ publisher = 'MingPao'
+ remove_javascript = True
+ use_embedded_content = False
+ no_stylesheets = True
+ language = 'zh'
+ encoding = 'Big5-HKSCS'
+ recursions = 0
+ conversion_options = {'linearize_tables':True}
+ timefmt = ''
+
+ def image_url_processor(cls, baseurl, url):
+ # trick: break the url at the first occurance of digit, add an additional
+ # '_' at the front
+ # not working, may need to move this to preprocess_html() method
+# minIdx = 10000
+# i0 = url.find('0')
+# if i0 >= 0 and i0 < minIdx:
+# minIdx = i0
+# i1 = url.find('1')
+# if i1 >= 0 and i1 < minIdx:
+# minIdx = i1
+# i2 = url.find('2')
+# if i2 >= 0 and i2 < minIdx:
+# minIdx = i2
+# i3 = url.find('3')
+# if i3 >= 0 and i0 < minIdx:
+# minIdx = i3
+# i4 = url.find('4')
+# if i4 >= 0 and i4 < minIdx:
+# minIdx = i4
+# i5 = url.find('5')
+# if i5 >= 0 and i5 < minIdx:
+# minIdx = i5
+# i6 = url.find('6')
+# if i6 >= 0 and i6 < minIdx:
+# minIdx = i6
+# i7 = url.find('7')
+# if i7 >= 0 and i7 < minIdx:
+# minIdx = i7
+# i8 = url.find('8')
+# if i8 >= 0 and i8 < minIdx:
+# minIdx = i8
+# i9 = url.find('9')
+# if i9 >= 0 and i9 < minIdx:
+# minIdx = i9
+ return url
+
+ def get_dtlocal(self):
+ dt_utc = datetime.datetime.utcnow()
+ if __Region__ == 'Hong Kong':
+ # convert UTC to local hk time - at HKT 5.30am, all news are available
+ dt_local = dt_utc + datetime.timedelta(8.0/24) - datetime.timedelta(5.5/24)
+ # dt_local = dt_utc.astimezone(pytz.timezone('Asia/Hong_Kong')) - datetime.timedelta(5.5/24)
+ elif __Region__ == 'Vancouver':
+ # convert UTC to local Vancouver time - at PST time 5.30am, all news are available
+ dt_local = dt_utc + datetime.timedelta(-8.0/24) - datetime.timedelta(5.5/24)
+ #dt_local = dt_utc.astimezone(pytz.timezone('America/Vancouver')) - datetime.timedelta(5.5/24)
+ elif __Region__ == 'Toronto':
+ # convert UTC to local Toronto time - at EST time 8.30am, all news are available
+ dt_local = dt_utc + datetime.timedelta(-5.0/24) - datetime.timedelta(8.5/24)
+ #dt_local = dt_utc.astimezone(pytz.timezone('America/Toronto')) - datetime.timedelta(8.5/24)
+ return dt_local
+
+ def get_fetchdate(self):
+ return self.get_dtlocal().strftime("%Y%m%d")
+
+ def get_fetchformatteddate(self):
+ return self.get_dtlocal().strftime("%Y-%m-%d")
+
+ def get_fetchday(self):
+ return self.get_dtlocal().strftime("%d")
+
+ def get_cover_url(self):
+ if __Region__ == 'Hong Kong':
+ cover = 'http://news.mingpao.com/' + self.get_fetchdate() + '/' + self.get_fetchdate() + '_' + self.get_fetchday() + 'gacov.jpg'
+ elif __Region__ == 'Vancouver':
+ cover = 'http://www.mingpaovan.com/ftp/News/' + self.get_fetchdate() + '/' + self.get_fetchday() + 'pgva1s.jpg'
+ elif __Region__ == 'Toronto':
+ cover = 'http://www.mingpaotor.com/ftp/News/' + self.get_fetchdate() + '/' + self.get_fetchday() + 'pgtas.jpg'
+ br = BasicNewsRecipe.get_browser()
+ try:
+ br.open(cover)
+ except:
+ cover = None
+ return cover
+
+ def parse_index(self):
+ feeds = []
+ dateStr = self.get_fetchdate()
+
+ if __Region__ == 'Hong Kong':
+ if __UseLife__:
+ for title, url, keystr in [(u'\u8981\u805e Headline', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalga', 'nal'),
+ (u'\u6e2f\u805e Local', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalgb', 'nal'),
+ (u'\u6559\u80b2 Education', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalgf', 'nal'),
+ (u'\u793e\u8a55/\u7b46\u9663 Editorial', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=nalmr', 'nal'),
+ (u'\u8ad6\u58c7 Forum', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=nalfa', 'nal'),
+ (u'\u4e2d\u570b China', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=nalca', 'nal'),
+ (u'\u570b\u969b World', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=nalta', 'nal'),
+ (u'\u7d93\u6fdf Finance', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalea', 'nal'),
+ (u'\u9ad4\u80b2 Sport', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalsp', 'nal'),
+ (u'\u5f71\u8996 Film/TV', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalma', 'nal'),
+ (u'\u5c08\u6b04 Columns', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=ncolumn', 'ncl')]:
+ articles = self.parse_section2(url, keystr)
+ if articles:
+ feeds.append((title, articles))
+
+ for title, url in [(u'\u526f\u520a Supplement', 'http://news.mingpao.com/' + dateStr + '/jaindex.htm'),
+ (u'\u82f1\u6587 English', 'http://news.mingpao.com/' + dateStr + '/emindex.htm')]:
+ articles = self.parse_section(url)
+ if articles:
+ feeds.append((title, articles))
+ else:
+ for title, url in [(u'\u8981\u805e Headline', 'http://news.mingpao.com/' + dateStr + '/gaindex.htm'),
+ (u'\u6e2f\u805e Local', 'http://news.mingpao.com/' + dateStr + '/gbindex.htm'),
+ (u'\u6559\u80b2 Education', 'http://news.mingpao.com/' + dateStr + '/gfindex.htm')]:
+ articles = self.parse_section(url)
+ if articles:
+ feeds.append((title, articles))
+
+ # special- editorial
+ ed_articles = self.parse_ed_section('http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=nalmr')
+ if ed_articles:
+ feeds.append((u'\u793e\u8a55/\u7b46\u9663 Editorial', ed_articles))
+
+ for title, url in [(u'\u8ad6\u58c7 Forum', 'http://news.mingpao.com/' + dateStr + '/faindex.htm'),
+ (u'\u4e2d\u570b China', 'http://news.mingpao.com/' + dateStr + '/caindex.htm'),
+ (u'\u570b\u969b World', 'http://news.mingpao.com/' + dateStr + '/taindex.htm')]:
+ articles = self.parse_section(url)
+ if articles:
+ feeds.append((title, articles))
+
+ # special - finance
+ #fin_articles = self.parse_fin_section('http://www.mpfinance.com/htm/Finance/' + dateStr + '/News/ea,eb,ecindex.htm')
+ fin_articles = self.parse_fin_section('http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalea')
+ if fin_articles:
+ feeds.append((u'\u7d93\u6fdf Finance', fin_articles))
+
+ for title, url in [('Tech News', 'http://news.mingpao.com/' + dateStr + '/naindex.htm'),
+ (u'\u9ad4\u80b2 Sport', 'http://news.mingpao.com/' + dateStr + '/spindex.htm')]:
+ articles = self.parse_section(url)
+ if articles:
+ feeds.append((title, articles))
+
+ # special - entertainment
+ ent_articles = self.parse_ent_section('http://ol.mingpao.com/cfm/star1.cfm')
+ if ent_articles:
+ feeds.append((u'\u5f71\u8996 Film/TV', ent_articles))
+
+ for title, url in [(u'\u526f\u520a Supplement', 'http://news.mingpao.com/' + dateStr + '/jaindex.htm'),
+ (u'\u82f1\u6587 English', 'http://news.mingpao.com/' + dateStr + '/emindex.htm')]:
+ articles = self.parse_section(url)
+ if articles:
+ feeds.append((title, articles))
+
+
+ # special- columns
+ col_articles = self.parse_col_section('http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=ncolumn')
+ if col_articles:
+ feeds.append((u'\u5c08\u6b04 Columns', col_articles))
+ elif __Region__ == 'Vancouver':
+ for title, url in [(u'\u8981\u805e Headline', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VAindex.htm'),
+ (u'\u52a0\u570b Canada', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VBindex.htm'),
+ (u'\u793e\u5340 Local', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VDindex.htm'),
+ (u'\u6e2f\u805e Hong Kong', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/HK-VGindex.htm'),
+ (u'\u570b\u969b World', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VTindex.htm'),
+ (u'\u4e2d\u570b China', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VCindex.htm'),
+ (u'\u7d93\u6fdf Economics', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VEindex.htm'),
+ (u'\u9ad4\u80b2 Sports', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VSindex.htm'),
+ (u'\u5f71\u8996 Film/TV', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/HK-MAindex.htm'),
+ (u'\u526f\u520a Supplements', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/WWindex.htm'),]:
+ articles = self.parse_section3(url, 'http://www.mingpaovan.com/')
+ if articles:
+ feeds.append((title, articles))
+ elif __Region__ == 'Toronto':
+ for title, url in [(u'\u8981\u805e Headline', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/TAindex.htm'),
+ (u'\u52a0\u570b Canada', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/TDindex.htm'),
+ (u'\u793e\u5340 Local', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/TFindex.htm'),
+ (u'\u4e2d\u570b China', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/TCAindex.htm'),
+ (u'\u570b\u969b World', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/TTAindex.htm'),
+ (u'\u6e2f\u805e Hong Kong', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/HK-GAindex.htm'),
+ (u'\u7d93\u6fdf Economics', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/THindex.htm'),
+ (u'\u9ad4\u80b2 Sports', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/TSindex.htm'),
+ (u'\u5f71\u8996 Film/TV', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/HK-MAindex.htm'),
+ (u'\u526f\u520a Supplements', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/WWindex.htm'),]:
+ articles = self.parse_section3(url, 'http://www.mingpaotor.com/')
+ if articles:
+ feeds.append((title, articles))
+ return feeds
+
+ # parse from news.mingpao.com
+ def parse_section(self, url):
+ dateStr = self.get_fetchdate()
+ soup = self.index_to_soup(url)
+ divs = soup.findAll(attrs={'class': ['bullet','bullet_grey']})
+ current_articles = []
+ included_urls = []
+ divs.reverse()
+ for i in divs:
+ a = i.find('a', href = True)
+ title = self.tag_to_string(a)
+ url = a.get('href', False)
+ url = 'http://news.mingpao.com/' + dateStr + '/' +url
+ if url not in included_urls and url.rfind('Redirect') == -1:
+ current_articles.append({'title': title, 'url': url, 'description':'', 'date':''})
+ included_urls.append(url)
+ current_articles.reverse()
+ return current_articles
+
+ # parse from life.mingpao.com
+ def parse_section2(self, url, keystr):
+ self.get_fetchdate()
+ soup = self.index_to_soup(url)
+ a = soup.findAll('a', href=True)
+ a.reverse()
+ current_articles = []
+ included_urls = []
+ for i in a:
+ title = self.tag_to_string(i)
+ url = 'http://life.mingpao.com/cfm/' + i.get('href', False)
+ if (url not in included_urls) and (not url.rfind('.txt') == -1) and (not url.rfind(keystr) == -1):
+ url = url.replace('dailynews3.cfm', 'dailynews3a.cfm') # use printed version of the article
+ current_articles.append({'title': title, 'url': url, 'description': ''})
+ included_urls.append(url)
+ current_articles.reverse()
+ return current_articles
+
+ # parse from www.mingpaovan.com
+ def parse_section3(self, url, baseUrl):
+ self.get_fetchdate()
+ soup = self.index_to_soup(url)
+ divs = soup.findAll(attrs={'class': ['ListContentLargeLink']})
+ current_articles = []
+ included_urls = []
+ divs.reverse()
+ for i in divs:
+ title = self.tag_to_string(i)
+ urlstr = i.get('href', False)
+ urlstr = baseUrl + '/' + urlstr.replace('../../../', '')
+ if urlstr not in included_urls:
+ current_articles.append({'title': title, 'url': urlstr, 'description': '', 'date': ''})
+ included_urls.append(urlstr)
+ current_articles.reverse()
+ return current_articles
+
+ def parse_ed_section(self, url):
+ self.get_fetchdate()
+ soup = self.index_to_soup(url)
+ a = soup.findAll('a', href=True)
+ a.reverse()
+ current_articles = []
+ included_urls = []
+ for i in a:
+ title = self.tag_to_string(i)
+ url = 'http://life.mingpao.com/cfm/' + i.get('href', False)
+ if (url not in included_urls) and (not url.rfind('.txt') == -1) and (not url.rfind('nal') == -1):
+ current_articles.append({'title': title, 'url': url, 'description': ''})
+ included_urls.append(url)
+ current_articles.reverse()
+ return current_articles
+
+ def parse_fin_section(self, url):
+ self.get_fetchdate()
+ soup = self.index_to_soup(url)
+ a = soup.findAll('a', href= True)
+ current_articles = []
+ included_urls = []
+ for i in a:
+ #url = 'http://www.mpfinance.com/cfm/' + i.get('href', False)
+ url = 'http://life.mingpao.com/cfm/' + i.get('href', False)
+ #if url not in included_urls and not url.rfind(dateStr) == -1 and url.rfind('index') == -1:
+ if url not in included_urls and (not url.rfind('txt') == -1) and (not url.rfind('nal') == -1):
+ title = self.tag_to_string(i)
+ current_articles.append({'title': title, 'url': url, 'description':''})
+ included_urls.append(url)
+ return current_articles
+
+ def parse_ent_section(self, url):
+ self.get_fetchdate()
+ soup = self.index_to_soup(url)
+ a = soup.findAll('a', href=True)
+ a.reverse()
+ current_articles = []
+ included_urls = []
+ for i in a:
+ title = self.tag_to_string(i)
+ url = 'http://ol.mingpao.com/cfm/' + i.get('href', False)
+ if (url not in included_urls) and (not url.rfind('.txt') == -1) and (not url.rfind('star') == -1):
+ current_articles.append({'title': title, 'url': url, 'description': ''})
+ included_urls.append(url)
+ current_articles.reverse()
+ return current_articles
+
+ def parse_col_section(self, url):
+ self.get_fetchdate()
+ soup = self.index_to_soup(url)
+ a = soup.findAll('a', href=True)
+ a.reverse()
+ current_articles = []
+ included_urls = []
+ for i in a:
+ title = self.tag_to_string(i)
+ url = 'http://life.mingpao.com/cfm/' + i.get('href', False)
+ if (url not in included_urls) and (not url.rfind('.txt') == -1) and (not url.rfind('ncl') == -1):
+ current_articles.append({'title': title, 'url': url, 'description': ''})
+ included_urls.append(url)
+ current_articles.reverse()
+ return current_articles
+
+ def preprocess_html(self, soup):
+ for item in soup.findAll(style=True):
+ del item['style']
+ for item in soup.findAll(style=True):
+ del item['width']
+ for item in soup.findAll(stype=True):
+ del item['absmiddle']
+ return soup
+
+ def create_opf(self, feeds, dir=None):
+ if dir is None:
+ dir = self.output_dir
+ if __UseChineseTitle__ == True:
+ if __Region__ == 'Hong Kong':
+ title = u'\u660e\u5831 (\u9999\u6e2f)'
+ elif __Region__ == 'Vancouver':
+ title = u'\u660e\u5831 (\u6eab\u54e5\u83ef)'
+ elif __Region__ == 'Toronto':
+ title = u'\u660e\u5831 (\u591a\u502b\u591a)'
+ else:
+ title = self.short_title()
+ # if not generating a periodical, force date to apply in title
+ if __MakePeriodical__ == False:
+ title = title + ' ' + self.get_fetchformatteddate()
+ if True:
+ mi = MetaInformation(title, [self.publisher])
+ mi.publisher = self.publisher
+ mi.author_sort = self.publisher
+ if __MakePeriodical__ == True:
+ mi.publication_type = 'periodical:'+self.publication_type+':'+self.short_title()
+ else:
+ mi.publication_type = self.publication_type+':'+self.short_title()
+ #mi.timestamp = nowf()
+ mi.timestamp = self.get_dtlocal()
+ mi.comments = self.description
+ if not isinstance(mi.comments, unicode):
+ mi.comments = mi.comments.decode('utf-8', 'replace')
+ #mi.pubdate = nowf()
+ mi.pubdate = self.get_dtlocal()
+ opf_path = os.path.join(dir, 'index.opf')
+ ncx_path = os.path.join(dir, 'index.ncx')
+ opf = OPFCreator(dir, mi)
+ # Add mastheadImage entry to section
+ mp = getattr(self, 'masthead_path', None)
+ if mp is not None and os.access(mp, os.R_OK):
+ from calibre.ebooks.metadata.opf2 import Guide
+ ref = Guide.Reference(os.path.basename(self.masthead_path), os.getcwdu())
+ ref.type = 'masthead'
+ ref.title = 'Masthead Image'
+ opf.guide.append(ref)
+
+ manifest = [os.path.join(dir, 'feed_%d'%i) for i in range(len(feeds))]
+ manifest.append(os.path.join(dir, 'index.html'))
+ manifest.append(os.path.join(dir, 'index.ncx'))
+
+ # Get cover
+ cpath = getattr(self, 'cover_path', None)
+ if cpath is None:
+ pf = open(os.path.join(dir, 'cover.jpg'), 'wb')
+ if self.default_cover(pf):
+ cpath = pf.name
+ if cpath is not None and os.access(cpath, os.R_OK):
+ opf.cover = cpath
+ manifest.append(cpath)
+
+ # Get masthead
+ mpath = getattr(self, 'masthead_path', None)
+ if mpath is not None and os.access(mpath, os.R_OK):
+ manifest.append(mpath)
+
+ opf.create_manifest_from_files_in(manifest)
+ for mani in opf.manifest:
+ if mani.path.endswith('.ncx'):
+ mani.id = 'ncx'
+ if mani.path.endswith('mastheadImage.jpg'):
+ mani.id = 'masthead-image'
+ entries = ['index.html']
+ toc = TOC(base_path=dir)
+ self.play_order_counter = 0
+ self.play_order_map = {}
+
+ def feed_index(num, parent):
+ f = feeds[num]
+ for j, a in enumerate(f):
+ if getattr(a, 'downloaded', False):
+ adir = 'feed_%d/article_%d/'%(num, j)
+ auth = a.author
+ if not auth:
+ auth = None
+ desc = a.text_summary
+ if not desc:
+ desc = None
+ else:
+ desc = self.description_limiter(desc)
+ entries.append('%sindex.html'%adir)
+ po = self.play_order_map.get(entries[-1], None)
+ if po is None:
+ self.play_order_counter += 1
+ po = self.play_order_counter
+ parent.add_item('%sindex.html'%adir, None, a.title if a.title else _('Untitled Article'),
+ play_order=po, author=auth, description=desc)
+ last = os.path.join(self.output_dir, ('%sindex.html'%adir).replace('/', os.sep))
+ for sp in a.sub_pages:
+ prefix = os.path.commonprefix([opf_path, sp])
+ relp = sp[len(prefix):]
+ entries.append(relp.replace(os.sep, '/'))
+ last = sp
+
+ if os.path.exists(last):
+ with open(last, 'rb') as fi:
+ src = fi.read().decode('utf-8')
+ soup = BeautifulSoup(src)
+ body = soup.find('body')
+ if body is not None:
+ prefix = '/'.join('..'for i in range(2*len(re.findall(r'link\d+', last))))
+ templ = self.navbar.generate(True, num, j, len(f),
+ not self.has_single_feed,
+ a.orig_url, self.publisher, prefix=prefix,
+ center=self.center_navbar)
+ elem = BeautifulSoup(templ.render(doctype='xhtml').decode('utf-8')).find('div')
+ body.insert(len(body.contents), elem)
+ with open(last, 'wb') as fi:
+ fi.write(unicode(soup).encode('utf-8'))
+ if len(feeds) == 0:
+ raise Exception('All feeds are empty, aborting.')
+
+ if len(feeds) > 1:
+ for i, f in enumerate(feeds):
+ entries.append('feed_%d/index.html'%i)
+ po = self.play_order_map.get(entries[-1], None)
+ if po is None:
+ self.play_order_counter += 1
+ po = self.play_order_counter
+ auth = getattr(f, 'author', None)
+ if not auth:
+ auth = None
+ desc = getattr(f, 'description', None)
+ if not desc:
+ desc = None
+ feed_index(i, toc.add_item('feed_%d/index.html'%i, None,
+ f.title, play_order=po, description=desc, author=auth))
+
+ else:
+ entries.append('feed_%d/index.html'%0)
+ feed_index(0, toc)
+
+ for i, p in enumerate(entries):
+ entries[i] = os.path.join(dir, p.replace('/', os.sep))
+ opf.create_spine(entries)
+ opf.set_toc(toc)
+
+ with nested(open(opf_path, 'wb'), open(ncx_path, 'wb')) as (opf_file, ncx_file):
+ opf.render(opf_file, ncx_file)
+
diff --git a/recipes/ming_pao_vancouver.recipe b/recipes/ming_pao_vancouver.recipe
new file mode 100644
index 0000000000..3b13211d01
--- /dev/null
+++ b/recipes/ming_pao_vancouver.recipe
@@ -0,0 +1,594 @@
+__license__ = 'GPL v3'
+__copyright__ = '2010-2011, Eddie Lau'
+
+# Region - Hong Kong, Vancouver, Toronto
+__Region__ = 'Vancouver'
+# Users of Kindle 3 with limited system-level CJK support
+# please replace the following "True" with "False".
+__MakePeriodical__ = True
+# Turn below to true if your device supports display of CJK titles
+__UseChineseTitle__ = False
+# Set it to False if you want to skip images
+__KeepImages__ = True
+# (HK only) Turn below to true if you wish to use life.mingpao.com as the main article source
+__UseLife__ = True
+
+
+'''
+Change Log:
+2011/06/26: add fetching Vancouver and Toronto versions of the paper, also provide captions for images using life.mingpao fetch source
+ provide options to remove all images in the file
+2011/05/12: switch the main parse source to life.mingpao.com, which has more photos on the article pages
+2011/03/06: add new articles for finance section, also a new section "Columns"
+2011/02/28: rearrange the sections
+ [Disabled until Kindle has better CJK support and can remember last (section,article) read in Sections & Articles
+ View] make it the same title if generating a periodical, so past issue will be automatically put into "Past Issues"
+ folder in Kindle 3
+2011/02/20: skip duplicated links in finance section, put photos which may extend a whole page to the back of the articles
+ clean up the indentation
+2010/12/07: add entertainment section, use newspaper front page as ebook cover, suppress date display in section list
+ (to avoid wrong date display in case the user generates the ebook in a time zone different from HKT)
+2010/11/22: add English section, remove eco-news section which is not updated daily, correct
+ ordering of articles
+2010/11/12: add news image and eco-news section
+2010/11/08: add parsing of finance section
+2010/11/06: temporary work-around for Kindle device having no capability to display unicode
+ in section/article list.
+2010/10/31: skip repeated articles in section pages
+'''
+
+import os, datetime, re
+from calibre.web.feeds.recipes import BasicNewsRecipe
+from contextlib import nested
+from calibre.ebooks.BeautifulSoup import BeautifulSoup
+from calibre.ebooks.metadata.opf2 import OPFCreator
+from calibre.ebooks.metadata.toc import TOC
+from calibre.ebooks.metadata import MetaInformation
+
+# MAIN CLASS
+class MPRecipe(BasicNewsRecipe):
+ if __Region__ == 'Hong Kong':
+ title = 'Ming Pao - Hong Kong'
+ description = 'Hong Kong Chinese Newspaper (http://news.mingpao.com)'
+ category = 'Chinese, News, Hong Kong'
+ extra_css = 'img {display: block; margin-left: auto; margin-right: auto; margin-top: 10px; margin-bottom: 10px;} font>b {font-size:200%; font-weight:bold;}'
+ masthead_url = 'http://news.mingpao.com/image/portals_top_logo_news.gif'
+ keep_only_tags = [dict(name='h1'),
+ dict(name='font', attrs={'style':['font-size:14pt; line-height:160%;']}), # for entertainment page title
+ dict(name='font', attrs={'color':['AA0000']}), # for column articles title
+ dict(attrs={'id':['newscontent']}), # entertainment and column page content
+ dict(attrs={'id':['newscontent01','newscontent02']}),
+ dict(attrs={'class':['photo']}),
+ dict(name='table', attrs={'width':['100%'], 'border':['0'], 'cellspacing':['5'], 'cellpadding':['0']}), # content in printed version of life.mingpao.com
+ dict(name='img', attrs={'width':['180'], 'alt':['按圖放大']}) # images for source from life.mingpao.com
+ ]
+ if __KeepImages__:
+ remove_tags = [dict(name='style'),
+ dict(attrs={'id':['newscontent135']}), # for the finance page from mpfinance.com
+ dict(name='font', attrs={'size':['2'], 'color':['666666']}), # article date in life.mingpao.com article
+ #dict(name='table') # for content fetched from life.mingpao.com
+ ]
+ else:
+ remove_tags = [dict(name='style'),
+ dict(attrs={'id':['newscontent135']}), # for the finance page from mpfinance.com
+ dict(name='font', attrs={'size':['2'], 'color':['666666']}), # article date in life.mingpao.com article
+ dict(name='img'),
+ #dict(name='table') # for content fetched from life.mingpao.com
+ ]
+ remove_attributes = ['width']
+ preprocess_regexps = [
+ (re.compile(r'', re.DOTALL|re.IGNORECASE),
+ lambda match: ''),
+ (re.compile(r'', re.DOTALL|re.IGNORECASE),
+ lambda match: ''),
+ (re.compile(r' ', re.DOTALL|re.IGNORECASE), # for entertainment page
+ lambda match: ''),
+ # skip after title in life.mingpao.com fetched article
+ (re.compile(r"", re.DOTALL|re.IGNORECASE),
+ lambda match: " "),
+ (re.compile(r" ", re.DOTALL|re.IGNORECASE),
+ lambda match: "")
+ ]
+ elif __Region__ == 'Vancouver':
+ title = 'Ming Pao - Vancouver'
+ description = 'Vancouver Chinese Newspaper (http://www.mingpaovan.com)'
+ category = 'Chinese, News, Vancouver'
+ extra_css = 'img {display: block; margin-left: auto; margin-right: auto; margin-top: 10px; margin-bottom: 10px;} b>font {font-size:200%; font-weight:bold;}'
+ masthead_url = 'http://www.mingpaovan.com/image/mainlogo2_VAN2.gif'
+ keep_only_tags = [dict(name='table', attrs={'width':['450'], 'border':['0'], 'cellspacing':['0'], 'cellpadding':['1']}),
+ dict(name='table', attrs={'width':['450'], 'border':['0'], 'cellspacing':['3'], 'cellpadding':['3'], 'id':['tblContent3']}),
+ dict(name='table', attrs={'width':['180'], 'border':['0'], 'cellspacing':['0'], 'cellpadding':['0'], 'bgcolor':['F0F0F0']}),
+ ]
+ if __KeepImages__:
+ remove_tags = [dict(name='img', attrs={'src':['../../../image/magnifier.gif']})] # the magnifier icon
+ else:
+ remove_tags = [dict(name='img')]
+ remove_attributes = ['width']
+ preprocess_regexps = [(re.compile(r' ', re.DOTALL|re.IGNORECASE),
+ lambda match: ''),
+ ]
+ elif __Region__ == 'Toronto':
+ title = 'Ming Pao - Toronto'
+ description = 'Toronto Chinese Newspaper (http://www.mingpaotor.com)'
+ category = 'Chinese, News, Toronto'
+ extra_css = 'img {display: block; margin-left: auto; margin-right: auto; margin-top: 10px; margin-bottom: 10px;} b>font {font-size:200%; font-weight:bold;}'
+ masthead_url = 'http://www.mingpaotor.com/image/mainlogo2_TOR2.gif'
+ keep_only_tags = [dict(name='table', attrs={'width':['450'], 'border':['0'], 'cellspacing':['0'], 'cellpadding':['1']}),
+ dict(name='table', attrs={'width':['450'], 'border':['0'], 'cellspacing':['3'], 'cellpadding':['3'], 'id':['tblContent3']}),
+ dict(name='table', attrs={'width':['180'], 'border':['0'], 'cellspacing':['0'], 'cellpadding':['0'], 'bgcolor':['F0F0F0']}),
+ ]
+ if __KeepImages__:
+ remove_tags = [dict(name='img', attrs={'src':['../../../image/magnifier.gif']})] # the magnifier icon
+ else:
+ remove_tags = [dict(name='img')]
+ remove_attributes = ['width']
+ preprocess_regexps = [(re.compile(r' ', re.DOTALL|re.IGNORECASE),
+ lambda match: ''),
+ ]
+
+ oldest_article = 1
+ max_articles_per_feed = 100
+ __author__ = 'Eddie Lau'
+ publisher = 'MingPao'
+ remove_javascript = True
+ use_embedded_content = False
+ no_stylesheets = True
+ language = 'zh'
+ encoding = 'Big5-HKSCS'
+ recursions = 0
+ conversion_options = {'linearize_tables':True}
+ timefmt = ''
+
+ def image_url_processor(cls, baseurl, url):
+ # trick: break the url at the first occurance of digit, add an additional
+ # '_' at the front
+ # not working, may need to move this to preprocess_html() method
+# minIdx = 10000
+# i0 = url.find('0')
+# if i0 >= 0 and i0 < minIdx:
+# minIdx = i0
+# i1 = url.find('1')
+# if i1 >= 0 and i1 < minIdx:
+# minIdx = i1
+# i2 = url.find('2')
+# if i2 >= 0 and i2 < minIdx:
+# minIdx = i2
+# i3 = url.find('3')
+# if i3 >= 0 and i0 < minIdx:
+# minIdx = i3
+# i4 = url.find('4')
+# if i4 >= 0 and i4 < minIdx:
+# minIdx = i4
+# i5 = url.find('5')
+# if i5 >= 0 and i5 < minIdx:
+# minIdx = i5
+# i6 = url.find('6')
+# if i6 >= 0 and i6 < minIdx:
+# minIdx = i6
+# i7 = url.find('7')
+# if i7 >= 0 and i7 < minIdx:
+# minIdx = i7
+# i8 = url.find('8')
+# if i8 >= 0 and i8 < minIdx:
+# minIdx = i8
+# i9 = url.find('9')
+# if i9 >= 0 and i9 < minIdx:
+# minIdx = i9
+ return url
+
+ def get_dtlocal(self):
+ dt_utc = datetime.datetime.utcnow()
+ if __Region__ == 'Hong Kong':
+ # convert UTC to local hk time - at HKT 5.30am, all news are available
+ dt_local = dt_utc + datetime.timedelta(8.0/24) - datetime.timedelta(5.5/24)
+ # dt_local = dt_utc.astimezone(pytz.timezone('Asia/Hong_Kong')) - datetime.timedelta(5.5/24)
+ elif __Region__ == 'Vancouver':
+ # convert UTC to local Vancouver time - at PST time 5.30am, all news are available
+ dt_local = dt_utc + datetime.timedelta(-8.0/24) - datetime.timedelta(5.5/24)
+ #dt_local = dt_utc.astimezone(pytz.timezone('America/Vancouver')) - datetime.timedelta(5.5/24)
+ elif __Region__ == 'Toronto':
+ # convert UTC to local Toronto time - at EST time 8.30am, all news are available
+ dt_local = dt_utc + datetime.timedelta(-5.0/24) - datetime.timedelta(8.5/24)
+ #dt_local = dt_utc.astimezone(pytz.timezone('America/Toronto')) - datetime.timedelta(8.5/24)
+ return dt_local
+
+ def get_fetchdate(self):
+ return self.get_dtlocal().strftime("%Y%m%d")
+
+ def get_fetchformatteddate(self):
+ return self.get_dtlocal().strftime("%Y-%m-%d")
+
+ def get_fetchday(self):
+ return self.get_dtlocal().strftime("%d")
+
+ def get_cover_url(self):
+ if __Region__ == 'Hong Kong':
+ cover = 'http://news.mingpao.com/' + self.get_fetchdate() + '/' + self.get_fetchdate() + '_' + self.get_fetchday() + 'gacov.jpg'
+ elif __Region__ == 'Vancouver':
+ cover = 'http://www.mingpaovan.com/ftp/News/' + self.get_fetchdate() + '/' + self.get_fetchday() + 'pgva1s.jpg'
+ elif __Region__ == 'Toronto':
+ cover = 'http://www.mingpaotor.com/ftp/News/' + self.get_fetchdate() + '/' + self.get_fetchday() + 'pgtas.jpg'
+ br = BasicNewsRecipe.get_browser()
+ try:
+ br.open(cover)
+ except:
+ cover = None
+ return cover
+
+ def parse_index(self):
+ feeds = []
+ dateStr = self.get_fetchdate()
+
+ if __Region__ == 'Hong Kong':
+ if __UseLife__:
+ for title, url, keystr in [(u'\u8981\u805e Headline', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalga', 'nal'),
+ (u'\u6e2f\u805e Local', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalgb', 'nal'),
+ (u'\u6559\u80b2 Education', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalgf', 'nal'),
+ (u'\u793e\u8a55/\u7b46\u9663 Editorial', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=nalmr', 'nal'),
+ (u'\u8ad6\u58c7 Forum', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=nalfa', 'nal'),
+ (u'\u4e2d\u570b China', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=nalca', 'nal'),
+ (u'\u570b\u969b World', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=nalta', 'nal'),
+ (u'\u7d93\u6fdf Finance', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalea', 'nal'),
+ (u'\u9ad4\u80b2 Sport', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalsp', 'nal'),
+ (u'\u5f71\u8996 Film/TV', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalma', 'nal'),
+ (u'\u5c08\u6b04 Columns', 'http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=ncolumn', 'ncl')]:
+ articles = self.parse_section2(url, keystr)
+ if articles:
+ feeds.append((title, articles))
+
+ for title, url in [(u'\u526f\u520a Supplement', 'http://news.mingpao.com/' + dateStr + '/jaindex.htm'),
+ (u'\u82f1\u6587 English', 'http://news.mingpao.com/' + dateStr + '/emindex.htm')]:
+ articles = self.parse_section(url)
+ if articles:
+ feeds.append((title, articles))
+ else:
+ for title, url in [(u'\u8981\u805e Headline', 'http://news.mingpao.com/' + dateStr + '/gaindex.htm'),
+ (u'\u6e2f\u805e Local', 'http://news.mingpao.com/' + dateStr + '/gbindex.htm'),
+ (u'\u6559\u80b2 Education', 'http://news.mingpao.com/' + dateStr + '/gfindex.htm')]:
+ articles = self.parse_section(url)
+ if articles:
+ feeds.append((title, articles))
+
+ # special- editorial
+ ed_articles = self.parse_ed_section('http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=nalmr')
+ if ed_articles:
+ feeds.append((u'\u793e\u8a55/\u7b46\u9663 Editorial', ed_articles))
+
+ for title, url in [(u'\u8ad6\u58c7 Forum', 'http://news.mingpao.com/' + dateStr + '/faindex.htm'),
+ (u'\u4e2d\u570b China', 'http://news.mingpao.com/' + dateStr + '/caindex.htm'),
+ (u'\u570b\u969b World', 'http://news.mingpao.com/' + dateStr + '/taindex.htm')]:
+ articles = self.parse_section(url)
+ if articles:
+ feeds.append((title, articles))
+
+ # special - finance
+ #fin_articles = self.parse_fin_section('http://www.mpfinance.com/htm/Finance/' + dateStr + '/News/ea,eb,ecindex.htm')
+ fin_articles = self.parse_fin_section('http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr + '&Category=nalea')
+ if fin_articles:
+ feeds.append((u'\u7d93\u6fdf Finance', fin_articles))
+
+ for title, url in [('Tech News', 'http://news.mingpao.com/' + dateStr + '/naindex.htm'),
+ (u'\u9ad4\u80b2 Sport', 'http://news.mingpao.com/' + dateStr + '/spindex.htm')]:
+ articles = self.parse_section(url)
+ if articles:
+ feeds.append((title, articles))
+
+ # special - entertainment
+ ent_articles = self.parse_ent_section('http://ol.mingpao.com/cfm/star1.cfm')
+ if ent_articles:
+ feeds.append((u'\u5f71\u8996 Film/TV', ent_articles))
+
+ for title, url in [(u'\u526f\u520a Supplement', 'http://news.mingpao.com/' + dateStr + '/jaindex.htm'),
+ (u'\u82f1\u6587 English', 'http://news.mingpao.com/' + dateStr + '/emindex.htm')]:
+ articles = self.parse_section(url)
+ if articles:
+ feeds.append((title, articles))
+
+
+ # special- columns
+ col_articles = self.parse_col_section('http://life.mingpao.com/cfm/dailynews2.cfm?Issue=' + dateStr +'&Category=ncolumn')
+ if col_articles:
+ feeds.append((u'\u5c08\u6b04 Columns', col_articles))
+ elif __Region__ == 'Vancouver':
+ for title, url in [(u'\u8981\u805e Headline', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VAindex.htm'),
+ (u'\u52a0\u570b Canada', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VBindex.htm'),
+ (u'\u793e\u5340 Local', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VDindex.htm'),
+ (u'\u6e2f\u805e Hong Kong', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/HK-VGindex.htm'),
+ (u'\u570b\u969b World', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VTindex.htm'),
+ (u'\u4e2d\u570b China', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VCindex.htm'),
+ (u'\u7d93\u6fdf Economics', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VEindex.htm'),
+ (u'\u9ad4\u80b2 Sports', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/VSindex.htm'),
+ (u'\u5f71\u8996 Film/TV', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/HK-MAindex.htm'),
+ (u'\u526f\u520a Supplements', 'http://www.mingpaovan.com/htm/News/' + dateStr + '/WWindex.htm'),]:
+ articles = self.parse_section3(url, 'http://www.mingpaovan.com/')
+ if articles:
+ feeds.append((title, articles))
+ elif __Region__ == 'Toronto':
+ for title, url in [(u'\u8981\u805e Headline', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/TAindex.htm'),
+ (u'\u52a0\u570b Canada', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/TDindex.htm'),
+ (u'\u793e\u5340 Local', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/TFindex.htm'),
+ (u'\u4e2d\u570b China', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/TCAindex.htm'),
+ (u'\u570b\u969b World', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/TTAindex.htm'),
+ (u'\u6e2f\u805e Hong Kong', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/HK-GAindex.htm'),
+ (u'\u7d93\u6fdf Economics', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/THindex.htm'),
+ (u'\u9ad4\u80b2 Sports', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/TSindex.htm'),
+ (u'\u5f71\u8996 Film/TV', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/HK-MAindex.htm'),
+ (u'\u526f\u520a Supplements', 'http://www.mingpaotor.com/htm/News/' + dateStr + '/WWindex.htm'),]:
+ articles = self.parse_section3(url, 'http://www.mingpaotor.com/')
+ if articles:
+ feeds.append((title, articles))
+ return feeds
+
+ # parse from news.mingpao.com
+ def parse_section(self, url):
+ dateStr = self.get_fetchdate()
+ soup = self.index_to_soup(url)
+ divs = soup.findAll(attrs={'class': ['bullet','bullet_grey']})
+ current_articles = []
+ included_urls = []
+ divs.reverse()
+ for i in divs:
+ a = i.find('a', href = True)
+ title = self.tag_to_string(a)
+ url = a.get('href', False)
+ url = 'http://news.mingpao.com/' + dateStr + '/' +url
+ if url not in included_urls and url.rfind('Redirect') == -1:
+ current_articles.append({'title': title, 'url': url, 'description':'', 'date':''})
+ included_urls.append(url)
+ current_articles.reverse()
+ return current_articles
+
+ # parse from life.mingpao.com
+ def parse_section2(self, url, keystr):
+ self.get_fetchdate()
+ soup = self.index_to_soup(url)
+ a = soup.findAll('a', href=True)
+ a.reverse()
+ current_articles = []
+ included_urls = []
+ for i in a:
+ title = self.tag_to_string(i)
+ url = 'http://life.mingpao.com/cfm/' + i.get('href', False)
+ if (url not in included_urls) and (not url.rfind('.txt') == -1) and (not url.rfind(keystr) == -1):
+ url = url.replace('dailynews3.cfm', 'dailynews3a.cfm') # use printed version of the article
+ current_articles.append({'title': title, 'url': url, 'description': ''})
+ included_urls.append(url)
+ current_articles.reverse()
+ return current_articles
+
+ # parse from www.mingpaovan.com
+ def parse_section3(self, url, baseUrl):
+ self.get_fetchdate()
+ soup = self.index_to_soup(url)
+ divs = soup.findAll(attrs={'class': ['ListContentLargeLink']})
+ current_articles = []
+ included_urls = []
+ divs.reverse()
+ for i in divs:
+ title = self.tag_to_string(i)
+ urlstr = i.get('href', False)
+ urlstr = baseUrl + '/' + urlstr.replace('../../../', '')
+ if urlstr not in included_urls:
+ current_articles.append({'title': title, 'url': urlstr, 'description': '', 'date': ''})
+ included_urls.append(urlstr)
+ current_articles.reverse()
+ return current_articles
+
+ def parse_ed_section(self, url):
+ self.get_fetchdate()
+ soup = self.index_to_soup(url)
+ a = soup.findAll('a', href=True)
+ a.reverse()
+ current_articles = []
+ included_urls = []
+ for i in a:
+ title = self.tag_to_string(i)
+ url = 'http://life.mingpao.com/cfm/' + i.get('href', False)
+ if (url not in included_urls) and (not url.rfind('.txt') == -1) and (not url.rfind('nal') == -1):
+ current_articles.append({'title': title, 'url': url, 'description': ''})
+ included_urls.append(url)
+ current_articles.reverse()
+ return current_articles
+
+ def parse_fin_section(self, url):
+ self.get_fetchdate()
+ soup = self.index_to_soup(url)
+ a = soup.findAll('a', href= True)
+ current_articles = []
+ included_urls = []
+ for i in a:
+ #url = 'http://www.mpfinance.com/cfm/' + i.get('href', False)
+ url = 'http://life.mingpao.com/cfm/' + i.get('href', False)
+ #if url not in included_urls and not url.rfind(dateStr) == -1 and url.rfind('index') == -1:
+ if url not in included_urls and (not url.rfind('txt') == -1) and (not url.rfind('nal') == -1):
+ title = self.tag_to_string(i)
+ current_articles.append({'title': title, 'url': url, 'description':''})
+ included_urls.append(url)
+ return current_articles
+
+ def parse_ent_section(self, url):
+ self.get_fetchdate()
+ soup = self.index_to_soup(url)
+ a = soup.findAll('a', href=True)
+ a.reverse()
+ current_articles = []
+ included_urls = []
+ for i in a:
+ title = self.tag_to_string(i)
+ url = 'http://ol.mingpao.com/cfm/' + i.get('href', False)
+ if (url not in included_urls) and (not url.rfind('.txt') == -1) and (not url.rfind('star') == -1):
+ current_articles.append({'title': title, 'url': url, 'description': ''})
+ included_urls.append(url)
+ current_articles.reverse()
+ return current_articles
+
+ def parse_col_section(self, url):
+ self.get_fetchdate()
+ soup = self.index_to_soup(url)
+ a = soup.findAll('a', href=True)
+ a.reverse()
+ current_articles = []
+ included_urls = []
+ for i in a:
+ title = self.tag_to_string(i)
+ url = 'http://life.mingpao.com/cfm/' + i.get('href', False)
+ if (url not in included_urls) and (not url.rfind('.txt') == -1) and (not url.rfind('ncl') == -1):
+ current_articles.append({'title': title, 'url': url, 'description': ''})
+ included_urls.append(url)
+ current_articles.reverse()
+ return current_articles
+
+ def preprocess_html(self, soup):
+ for item in soup.findAll(style=True):
+ del item['style']
+ for item in soup.findAll(style=True):
+ del item['width']
+ for item in soup.findAll(stype=True):
+ del item['absmiddle']
+ return soup
+
+ def create_opf(self, feeds, dir=None):
+ if dir is None:
+ dir = self.output_dir
+ if __UseChineseTitle__ == True:
+ if __Region__ == 'Hong Kong':
+ title = u'\u660e\u5831 (\u9999\u6e2f)'
+ elif __Region__ == 'Vancouver':
+ title = u'\u660e\u5831 (\u6eab\u54e5\u83ef)'
+ elif __Region__ == 'Toronto':
+ title = u'\u660e\u5831 (\u591a\u502b\u591a)'
+ else:
+ title = self.short_title()
+ # if not generating a periodical, force date to apply in title
+ if __MakePeriodical__ == False:
+ title = title + ' ' + self.get_fetchformatteddate()
+ if True:
+ mi = MetaInformation(title, [self.publisher])
+ mi.publisher = self.publisher
+ mi.author_sort = self.publisher
+ if __MakePeriodical__ == True:
+ mi.publication_type = 'periodical:'+self.publication_type+':'+self.short_title()
+ else:
+ mi.publication_type = self.publication_type+':'+self.short_title()
+ #mi.timestamp = nowf()
+ mi.timestamp = self.get_dtlocal()
+ mi.comments = self.description
+ if not isinstance(mi.comments, unicode):
+ mi.comments = mi.comments.decode('utf-8', 'replace')
+ #mi.pubdate = nowf()
+ mi.pubdate = self.get_dtlocal()
+ opf_path = os.path.join(dir, 'index.opf')
+ ncx_path = os.path.join(dir, 'index.ncx')
+ opf = OPFCreator(dir, mi)
+ # Add mastheadImage entry to section
+ mp = getattr(self, 'masthead_path', None)
+ if mp is not None and os.access(mp, os.R_OK):
+ from calibre.ebooks.metadata.opf2 import Guide
+ ref = Guide.Reference(os.path.basename(self.masthead_path), os.getcwdu())
+ ref.type = 'masthead'
+ ref.title = 'Masthead Image'
+ opf.guide.append(ref)
+
+ manifest = [os.path.join(dir, 'feed_%d'%i) for i in range(len(feeds))]
+ manifest.append(os.path.join(dir, 'index.html'))
+ manifest.append(os.path.join(dir, 'index.ncx'))
+
+ # Get cover
+ cpath = getattr(self, 'cover_path', None)
+ if cpath is None:
+ pf = open(os.path.join(dir, 'cover.jpg'), 'wb')
+ if self.default_cover(pf):
+ cpath = pf.name
+ if cpath is not None and os.access(cpath, os.R_OK):
+ opf.cover = cpath
+ manifest.append(cpath)
+
+ # Get masthead
+ mpath = getattr(self, 'masthead_path', None)
+ if mpath is not None and os.access(mpath, os.R_OK):
+ manifest.append(mpath)
+
+ opf.create_manifest_from_files_in(manifest)
+ for mani in opf.manifest:
+ if mani.path.endswith('.ncx'):
+ mani.id = 'ncx'
+ if mani.path.endswith('mastheadImage.jpg'):
+ mani.id = 'masthead-image'
+ entries = ['index.html']
+ toc = TOC(base_path=dir)
+ self.play_order_counter = 0
+ self.play_order_map = {}
+
+ def feed_index(num, parent):
+ f = feeds[num]
+ for j, a in enumerate(f):
+ if getattr(a, 'downloaded', False):
+ adir = 'feed_%d/article_%d/'%(num, j)
+ auth = a.author
+ if not auth:
+ auth = None
+ desc = a.text_summary
+ if not desc:
+ desc = None
+ else:
+ desc = self.description_limiter(desc)
+ entries.append('%sindex.html'%adir)
+ po = self.play_order_map.get(entries[-1], None)
+ if po is None:
+ self.play_order_counter += 1
+ po = self.play_order_counter
+ parent.add_item('%sindex.html'%adir, None, a.title if a.title else _('Untitled Article'),
+ play_order=po, author=auth, description=desc)
+ last = os.path.join(self.output_dir, ('%sindex.html'%adir).replace('/', os.sep))
+ for sp in a.sub_pages:
+ prefix = os.path.commonprefix([opf_path, sp])
+ relp = sp[len(prefix):]
+ entries.append(relp.replace(os.sep, '/'))
+ last = sp
+
+ if os.path.exists(last):
+ with open(last, 'rb') as fi:
+ src = fi.read().decode('utf-8')
+ soup = BeautifulSoup(src)
+ body = soup.find('body')
+ if body is not None:
+ prefix = '/'.join('..'for i in range(2*len(re.findall(r'link\d+', last))))
+ templ = self.navbar.generate(True, num, j, len(f),
+ not self.has_single_feed,
+ a.orig_url, self.publisher, prefix=prefix,
+ center=self.center_navbar)
+ elem = BeautifulSoup(templ.render(doctype='xhtml').decode('utf-8')).find('div')
+ body.insert(len(body.contents), elem)
+ with open(last, 'wb') as fi:
+ fi.write(unicode(soup).encode('utf-8'))
+ if len(feeds) == 0:
+ raise Exception('All feeds are empty, aborting.')
+
+ if len(feeds) > 1:
+ for i, f in enumerate(feeds):
+ entries.append('feed_%d/index.html'%i)
+ po = self.play_order_map.get(entries[-1], None)
+ if po is None:
+ self.play_order_counter += 1
+ po = self.play_order_counter
+ auth = getattr(f, 'author', None)
+ if not auth:
+ auth = None
+ desc = getattr(f, 'description', None)
+ if not desc:
+ desc = None
+ feed_index(i, toc.add_item('feed_%d/index.html'%i, None,
+ f.title, play_order=po, description=desc, author=auth))
+
+ else:
+ entries.append('feed_%d/index.html'%0)
+ feed_index(0, toc)
+
+ for i, p in enumerate(entries):
+ entries[i] = os.path.join(dir, p.replace('/', os.sep))
+ opf.create_spine(entries)
+ opf.set_toc(toc)
+
+ with nested(open(opf_path, 'wb'), open(ncx_path, 'wb')) as (opf_file, ncx_file):
+ opf.render(opf_file, ncx_file)
+
diff --git a/recipes/nme.recipe b/recipes/nme.recipe
new file mode 100644
index 0000000000..70e8e24fde
--- /dev/null
+++ b/recipes/nme.recipe
@@ -0,0 +1,42 @@
+from calibre.web.feeds.news import BasicNewsRecipe
+
+class AdvancedUserRecipe1306061239(BasicNewsRecipe):
+ title = u'New Musical Express Magazine'
+ __author__ = "scissors"
+ language = 'en'
+ remove_empty_feeds = True
+ remove_javascript = True
+ no_stylesheets = True
+ oldest_article = 7
+ max_articles_per_feed = 100
+ cover_url = 'http://tawanda3000.files.wordpress.com/2011/02/nme-logo.jpg'
+
+ remove_tags = [
+ dict( attrs={'class':'clear_icons'}),
+ dict( attrs={'class':'share_links'}),
+ dict( attrs={'id':'right_panel'}),
+ dict( attrs={'class':'today box'})
+
+]
+
+ keep_only_tags = [
+
+ dict(name='h1'),
+ #dict(name='h3'),
+ dict(attrs={'class' : 'BText'}),
+ dict(attrs={'class' : 'Bmore'}),
+ dict(attrs={'class' : 'bPosts'}),
+ dict(attrs={'class' : 'text'}),
+ dict(attrs={'id' : 'article_gallery'}),
+ dict(attrs={'class' : 'article_text'})
+]
+
+
+
+
+ feeds = [
+ (u'NME News', u'http://feeds2.feedburner.com/nmecom/rss/newsxml'),
+ (u'Reviews', u'http://feeds2.feedburner.com/nme/SdML'),
+ (u'Blogs', u'http://www.nme.com/blog/index.php?blog=140&tempskin=_rss2'),
+
+ ]
diff --git a/recipes/noticias_r7.recipe b/recipes/noticias_r7.recipe
new file mode 100644
index 0000000000..b7495bb77e
--- /dev/null
+++ b/recipes/noticias_r7.recipe
@@ -0,0 +1,40 @@
+import re
+from calibre.web.feeds.news import BasicNewsRecipe
+
+class PortalR7(BasicNewsRecipe):
+ title = 'Noticias R7'
+ __author__ = 'Diniz Bortolotto'
+ description = 'Noticias Portal R7'
+ oldest_article = 2
+ max_articles_per_feed = 20
+ encoding = 'utf8'
+ publisher = 'Rede Record'
+ category = 'news, Brazil'
+ language = 'pt_BR'
+ publication_type = 'newsportal'
+ use_embedded_content = False
+ no_stylesheets = True
+ remove_javascript = True
+ remove_attributes = ['style']
+
+ feeds = [
+ (u'Brasil', u'http://www.r7.com/data/rss/brasil.xml'),
+ (u'Economia', u'http://www.r7.com/data/rss/economia.xml'),
+ (u'Internacional', u'http://www.r7.com/data/rss/internacional.xml'),
+ (u'Tecnologia e Ci\xeancia', u'http://www.r7.com/data/rss/tecnologiaCiencia.xml')
+ ]
+ reverse_article_order = True
+
+ keep_only_tags = [dict(name='div', attrs={'class':'materia'})]
+ remove_tags = [
+ dict(id=['espalhe', 'report-erro']),
+ dict(name='ul', attrs={'class':'controles'}),
+ dict(name='ul', attrs={'class':'relacionados'}),
+ dict(name='div', attrs={'class':'materia_banner'}),
+ dict(name='div', attrs={'class':'materia_controles'})
+ ]
+
+ preprocess_regexps = [
+ (re.compile(r'.* ',re.DOTALL|re.IGNORECASE),
+ lambda match: ' ')
+ ]
diff --git a/recipes/perfil.recipe b/recipes/perfil.recipe
index 1104202318..af7072c6f6 100644
--- a/recipes/perfil.recipe
+++ b/recipes/perfil.recipe
@@ -26,6 +26,7 @@ class Perfil(BasicNewsRecipe):
.foto1 h1{font-size: x-small}
h1{font-family: Georgia,"Times New Roman",serif}
img{margin-bottom: 0.4em}
+ .hora{font-size: x-small; color: red}
"""
conversion_options = {
@@ -60,7 +61,26 @@ class Perfil(BasicNewsRecipe):
,(u'Tecnologia' , u'http://www.perfil.com/rss/tecnologia.xml' )
]
+ def get_article_url(self, article):
+ return article.get('guid', None)
+
def preprocess_html(self, soup):
for item in soup.findAll(style=True):
del item['style']
+ for item in soup.findAll('a'):
+ limg = item.find('img')
+ if item.string is not None:
+ str = item.string
+ item.replaceWith(str)
+ else:
+ if limg:
+ item.name = 'div'
+ item.attrs = []
+ else:
+ str = self.tag_to_string(item)
+ item.replaceWith(str)
+ for item in soup.findAll('img'):
+ if not item.has_key('alt'):
+ item['alt'] = 'image'
return soup
+
\ No newline at end of file
diff --git a/recipes/philly.recipe b/recipes/philly.recipe
index 80de2f3277..c6cad5d174 100644
--- a/recipes/philly.recipe
+++ b/recipes/philly.recipe
@@ -1,85 +1,45 @@
#!/usr/bin/env python
-__license__ = 'GPL v3'
-'''
-philly.com/inquirer/
-'''
-from calibre.web.feeds.recipes import BasicNewsRecipe
+from calibre.web.feeds.news import BasicNewsRecipe
-class Philly(BasicNewsRecipe):
-
- title = 'Philadelphia Inquirer'
- __author__ = 'RadikalDissent and Sujata Raman'
+class AdvancedUserRecipe1308312288(BasicNewsRecipe):
+ title = u'Philadelphia Inquirer'
+ __author__ = 'sexymax15'
language = 'en'
description = 'Daily news from the Philadelphia Inquirer'
- no_stylesheets = True
- use_embedded_content = False
- oldest_article = 1
- max_articles_per_feed = 25
+ oldest_article = 15
+ max_articles_per_feed = 20
+ use_embedded_content = False
+ remove_empty_feeds = True
+ no_stylesheets = True
+ remove_javascript = True
- extra_css = '''
- h1{font-family:verdana,arial,helvetica,sans-serif; font-size: large;}
- h2{font-family:verdana,arial,helvetica,sans-serif; font-size: small;}
- .body-content{font-family:verdana,arial,helvetica,sans-serif; font-size: small;}
- .byline {font-size: small; color: #666666; font-style:italic; }
- .lastline {font-size: small; color: #666666; font-style:italic;}
- .contact {font-size: small; color: #666666;}
- .contact p {font-size: small; color: #666666;}
- #photoCaption { font-family:verdana,arial,helvetica,sans-serif; font-size:x-small;}
- .photoCaption { font-family:verdana,arial,helvetica,sans-serif; font-size:x-small;}
- #photoCredit{ font-family:verdana,arial,helvetica,sans-serif; font-size:x-small; color:#666666;}
- .photoCredit{ font-family:verdana,arial,helvetica,sans-serif; font-size:x-small; color:#666666;}
- .article_timestamp{font-size:x-small; color:#666666;}
- a {font-family:verdana,arial,helvetica,sans-serif; font-size: x-small;}
- '''
+ # remove_tags_before = {'class':'article_timestamp'}
+ #remove_tags_after = {'class':'graylabel'}
+ keep_only_tags= [dict(name=['h1','p'])]
+ remove_tags = [dict(name=['hr','dl','dt','img','meta','iframe','link','script','form','input','label']),
+dict(id=['toggleConfirmEmailDiv','toggleTOS','toggleUsernameMsgDiv','toggleConfirmYear','navT1_philly','secondaryNav','navPlacement','globalPrimaryNav'
+,'ugc-footer-philly','bv_footer_include','footer','header',
+'container_rag_bottom','section_rectangle','contentrightside'])
+,{'class':['megamenu3 megamenu','container misc','container_inner misc_inner'
+,'misccontainer_left_32','headlineonly','misccontainer_middle_32'
+,'misccontainer_right_32','headline formBegin',
+'post_balloon','relatedlist','linkssubhead','b_sq','dotted-rule-above'
+,'container','headlines-digest','graylabel','container_inner'
+,'rlinks_colorbar1','rlinks_colorbar2','supercontainer','container_5col_left','container_image_left',
+'digest-headline2','digest-lead','container_5col_leftmiddle',
+'container_5col_middlemiddle','container_5col_rightmiddle'
+,'container_5col_right','divclear','supercontainer_outer force-width',
+'supercontainer','containertitle kicker-title',
+'pollquestion','pollchoice','photomore','pollbutton','container rssbox','containertitle video ',
+'containertitle_image ','container_tabtwo','selected'
+,'shadetabs','selected','tabcontentstyle','tabcontent','inner_container'
+,'arrow','container_ad','containertitlespacer','adUnit','tracking','sitemsg_911 clearfix']}]
- keep_only_tags = [
- dict(name='div', attrs={'class':'story-content'}),
- dict(name='div', attrs={'id': 'contentinside'})
- ]
+ extra_css = """
+ h1{font-family: Georgia,serif; font-size: xx-large}
- remove_tags = [
- dict(name='div', attrs={'class':['linkssubhead','post_balloon','relatedlist','pollquestion','b_sq']}),
- dict(name='dl', attrs={'class':'relatedlist'}),
- dict(name='div', attrs={'id':['photoNav','sidebar_adholder']}),
- dict(name='a', attrs={'class': ['headlineonly','bl']}),
- dict(name='img', attrs={'class':'img_noborder'})
- ]
- # def print_version(self, url):
- # return url + '?viewAll=y'
+ """
- feeds = [
- ('Front Page', 'http://www.philly.com/inquirer_front_page.rss'),
- ('Business', 'http://www.philly.com/inq_business.rss'),
- #('News', 'http://www.philly.com/inquirer/news/index.rss'),
- ('Nation', 'http://www.philly.com/inq_news_world_us.rss'),
- ('Local', 'http://www.philly.com/inquirer_local.rss'),
- ('Health', 'http://www.philly.com/inquirer_health_science.rss'),
- ('Education', 'http://www.philly.com/inquirer_education.rss'),
- ('Editorial and opinion', 'http://www.philly.com/inq_news_editorial.rss'),
- ('Sports', 'http://www.philly.com/inquirer_sports.rss')
- ]
+ feeds = [(u'News', u'http://www.philly.com/philly_news.rss')]
- def get_article_url(self, article):
- ans = article.link
-
- try:
- self.log('Looking for full story link in', ans)
- soup = self.index_to_soup(ans)
- x = soup.find(text="View All")
-
- if x is not None:
- ans = ans + '?viewAll=y'
- self.log('Found full story link', ans)
- except:
- pass
- return ans
-
- def postprocess_html(self, soup,first):
-
- for tag in soup.findAll(name='div',attrs={'class':"container_ate_qandatitle"}):
- tag.extract()
- for tag in soup.findAll(name='br'):
- tag.extract()
-
- return soup
diff --git a/recipes/scmp.recipe b/recipes/scmp.recipe
new file mode 100644
index 0000000000..1da7b9e1bc
--- /dev/null
+++ b/recipes/scmp.recipe
@@ -0,0 +1,80 @@
+__license__ = 'GPL v3'
+__copyright__ = '2010, Darko Miletic '
+'''
+scmp.com
+'''
+
+import re
+from calibre.web.feeds.news import BasicNewsRecipe
+
+class SCMP(BasicNewsRecipe):
+ title = 'South China Morning Post'
+ __author__ = 'llam'
+ description = "SCMP.com, Hong Kong's premier online English daily provides exclusive up-to-date news, audio video news, podcasts, RSS Feeds, Blogs, breaking news, top stories, award winning news and analysis on Hong Kong and China."
+ publisher = 'South China Morning Post Publishers Ltd.'
+ category = 'SCMP, Online news, Hong Kong News, China news, Business news, English newspaper, daily newspaper, Lifestyle news, Sport news, Audio Video news, Asia news, World news, economy news, investor relations news, RSS Feeds'
+ oldest_article = 2
+ delay = 1
+ max_articles_per_feed = 200
+ no_stylesheets = True
+ encoding = 'utf-8'
+ use_embedded_content = False
+ language = 'en_CN'
+ remove_empty_feeds = True
+ needs_subscription = True
+ publication_type = 'newspaper'
+ masthead_url = 'http://www.scmp.com/images/logo_scmp_home.gif'
+ extra_css = ' body{font-family: Arial,Helvetica,sans-serif } '
+
+ conversion_options = {
+ 'comment' : description
+ , 'tags' : category
+ , 'publisher' : publisher
+ , 'language' : language
+ }
+
+ def get_browser(self):
+ br = BasicNewsRecipe.get_browser()
+ #br.set_debug_http(True)
+ #br.set_debug_responses(True)
+ #br.set_debug_redirects(True)
+ if self.username is not None and self.password is not None:
+ br.open('http://www.scmp.com/portal/site/SCMP/')
+ br.select_form(name='loginForm')
+ br['Login' ] = self.username
+ br['Password'] = self.password
+ br.submit()
+ return br
+
+ remove_attributes=['width','height','border']
+
+ keep_only_tags = [
+ dict(attrs={'id':['ART','photoBox']})
+ ,dict(attrs={'class':['article_label','article_byline','article_body']})
+ ]
+
+ preprocess_regexps = [
+ (re.compile(r'', re.DOTALL|re.IGNORECASE),
+ lambda match: ''),
+ ]
+
+ feeds = [
+ (u'Business' , u'http://www.scmp.com/rss/business.xml' )
+ ,(u'Hong Kong' , u'http://www.scmp.com/rss/hong_kong.xml' )
+ ,(u'China' , u'http://www.scmp.com/rss/china.xml' )
+ ,(u'Asia & World' , u'http://www.scmp.com/rss/news_asia_world.xml')
+ ,(u'Opinion' , u'http://www.scmp.com/rss/opinion.xml' )
+ ,(u'LifeSTYLE' , u'http://www.scmp.com/rss/lifestyle.xml' )
+ ,(u'Sport' , u'http://www.scmp.com/rss/sport.xml' )
+ ]
+
+ def print_version(self, url):
+ rpart, sep, rest = url.rpartition('&')
+ return rpart #+ sep + urllib.quote_plus(rest)
+
+ def preprocess_html(self, soup):
+ for item in soup.findAll(style=True):
+ del item['style']
+ items = soup.findAll(src="/images/label_icon.gif")
+ [item.extract() for item in items]
+ return self.adeify_images(soup)
diff --git a/recipes/sizinti_derigisi.recipe b/recipes/sizinti_derigisi.recipe
new file mode 100644
index 0000000000..d05648170e
--- /dev/null
+++ b/recipes/sizinti_derigisi.recipe
@@ -0,0 +1,40 @@
+# -*- coding: utf-8 -*-
+
+from calibre.web.feeds.news import BasicNewsRecipe
+
+class TodaysZaman_en(BasicNewsRecipe):
+ title = u'Sızıntı Dergisi'
+ __author__ = u'thomass'
+ description = 'a Turkey based daily for national and international news in the fields of business, diplomacy, politics, culture, arts, sports and economics, in addition to commentaries, specials and features'
+ oldest_article = 30
+ max_articles_per_feed =80
+ no_stylesheets = True
+ #delay = 1
+ #use_embedded_content = False
+ encoding = 'utf-8'
+ #publisher = ' '
+ category = 'dergi, ilim, kültür, bilim,Türkçe'
+ language = 'tr'
+ publication_type = 'magazine'
+ #extra_css = ' body{ font-family: Verdana,Helvetica,Arial,sans-serif } .introduction{font-weight: bold} .story-feature{display: block; padding: 0; border: 1px solid; width: 40%; font-size: small} .story-feature h2{text-align: center; text-transform: uppercase} '
+ #keep_only_tags = [dict(name='h1', attrs={'class':['georgia_30']})]
+
+ #remove_attributes = ['aria-describedby']
+ #remove_tags = [dict(name='div', attrs={'id':['renk10']}) ]
+ cover_img_url = 'http://www.sizinti.com.tr/images/sizintiprint.jpg'
+ masthead_url = 'http://www.sizinti.com.tr/images/sizintiprint.jpg'
+ remove_tags_before = dict(id='content-right')
+
+
+ #remove_empty_feeds= True
+ #remove_attributes = ['width','height']
+
+ feeds = [
+ ( u'Sızıntı', u'http://www.sizinti.com.tr/rss'),
+ ]
+
+ #def preprocess_html(self, soup):
+ # return self.adeify_images(soup)
+ #def print_version(self, url): #there is a probem caused by table format
+ #return url.replace('http://www.todayszaman.com/newsDetail_getNewsById.action?load=detay&', 'http://www.todayszaman.com/newsDetail_openPrintPage.action?')
+
diff --git a/recipes/stiintasitehnica.recipe b/recipes/stiintasitehnica.recipe
new file mode 100644
index 0000000000..c58a115b56
--- /dev/null
+++ b/recipes/stiintasitehnica.recipe
@@ -0,0 +1,56 @@
+# -*- coding: utf-8 -*-
+#!/usr/bin/env python
+
+__license__ = 'GPL v3'
+__copyright__ = u'2011, Silviu Cotoar\u0103'
+'''
+stiintasitehnica.com
+'''
+
+from calibre.web.feeds.news import BasicNewsRecipe
+
+class Stiintasitehnica(BasicNewsRecipe):
+ title = u'\u0218tiin\u021b\u0103 \u015fi Tehnic\u0103'
+ __author__ = u'Silviu Cotoar\u0103'
+ description = u'\u0218tiin\u021b\u0103 \u015fi Tehnic\u0103'
+ publisher = u'\u0218tiin\u021b\u0103 \u015fi Tehnic\u0103'
+ oldest_article = 50
+ language = 'ro'
+ max_articles_per_feed = 100
+ no_stylesheets = True
+ use_embedded_content = False
+ category = u'Ziare,Reviste,Stiinta,Tehnica'
+ encoding = 'utf-8'
+ cover_url = 'http://www.stiintasitehnica.com/images/logo.jpg'
+
+ conversion_options = {
+ 'comments' : description
+ ,'tags' : category
+ ,'language' : language
+ ,'publisher' : publisher
+ }
+
+ keep_only_tags = [
+ dict(name='div', attrs={'id':'mainColumn2'})
+ ]
+
+ remove_tags = [
+ dict(name='span', attrs={'class':['redEar']})
+ , dict(name='table', attrs={'class':['connect_widget_interactive_area']})
+ , dict(name='div', attrs={'class':['panel-overlay']})
+ , dict(name='div', attrs={'id':['pointer']})
+ , dict(name='img', attrs={'class':['nav-next', 'nav-prev']})
+ , dict(name='table', attrs={'class':['connect_widget_interactive_area']})
+ , dict(name='hr', attrs={'class':['dotted']})
+ ]
+
+ remove_tags_after = [
+ dict(name='hr', attrs={'class':['dotted']})
+ ]
+
+ feeds = [
+ (u'Feeds', u'http://www.stiintasitehnica.com/rss/stiri.xml')
+ ]
+
+ def preprocess_html(self, soup):
+ return self.adeify_images(soup)
diff --git a/recipes/telegraph_uk.recipe b/recipes/telegraph_uk.recipe
index 5fe5b168b8..157cfa99e9 100644
--- a/recipes/telegraph_uk.recipe
+++ b/recipes/telegraph_uk.recipe
@@ -56,6 +56,7 @@ class TelegraphUK(BasicNewsRecipe):
,(u'Sport' , u'http://www.telegraph.co.uk/sport/rss' )
,(u'Earth News' , u'http://www.telegraph.co.uk/earth/earthnews/rss' )
,(u'Comment' , u'http://www.telegraph.co.uk/comment/rss' )
+ ,(u'Travel' , u'http://www.telegraph.co.uk/travel/rss' )
,(u'How about that?', u'http://www.telegraph.co.uk/news/newstopics/howaboutthat/rss' )
]
diff --git a/recipes/todays_zaman.recipe b/recipes/todays_zaman.recipe
new file mode 100644
index 0000000000..5f3b85131a
--- /dev/null
+++ b/recipes/todays_zaman.recipe
@@ -0,0 +1,53 @@
+from calibre.web.feeds.news import BasicNewsRecipe
+
+class TodaysZaman_en(BasicNewsRecipe):
+ title = u'Todays Zaman'
+ __author__ = u'thomass'
+ description = 'a Turkey based daily for national and international news in the fields of business, diplomacy, politics, culture, arts, sports and economics, in addition to commentaries, specials and features'
+ oldest_article = 2
+ max_articles_per_feed =100
+ no_stylesheets = True
+ #delay = 1
+ #use_embedded_content = False
+ encoding = 'utf-8'
+ #publisher = ' '
+ category = 'news, haberler,TR,gazete'
+ language = 'en_TR'
+ publication_type = 'newspaper'
+ #extra_css = ' body{ font-family: Verdana,Helvetica,Arial,sans-serif } .introduction{font-weight: bold} .story-feature{display: block; padding: 0; border: 1px solid; width: 40%; font-size: small} .story-feature h2{text-align: center; text-transform: uppercase} '
+ #keep_only_tags = [dict(name='font', attrs={'class':['newsDetail','agenda2NewsSpot']}),dict(name='span', attrs={'class':['agenda2Title']}),dict(name='div', attrs={'id':['gallery']})]
+ keep_only_tags = [dict(name='h1', attrs={'class':['georgia_30']}),dict(name='span', attrs={'class':['left-date','detailDate','detailCName']}),dict(name='td', attrs={'id':['newsSpot','newsText']})] #resim ekleme: ,dict(name='div', attrs={'id':['gallery','detailDate',]})
+
+ remove_attributes = ['aria-describedby']
+ remove_tags = [dict(name='img', attrs={'src':['/images/icon_print.gif','http://gmodules.com/ig/images/plus_google.gif','/images/template/jazz/agenda/i1.jpg', 'http://medya.todayszaman.com/todayszaman/images/logo/logo.bmp']}),dict(name='hr', attrs={'class':[ 'interactive-hr']}),dict(name='div', attrs={'class':[ 'empty_height_18','empty_height_9']}) ,dict(name='td', attrs={'id':[ 'superTitle']}),dict(name='span', attrs={'class':[ 't-count enabled t-count-focus']}),dict(name='a', attrs={'id':[ 'count']}),dict(name='td', attrs={'class':[ 'left-date']}) ]
+ cover_img_url = 'http://medya.todayszaman.com/todayszaman/images/logo/logo.bmp'
+ masthead_url = 'http://medya.todayszaman.com/todayszaman/images/logo/logo.bmp'
+ remove_empty_feeds= True
+ # remove_attributes = ['width','height']
+
+ feeds = [
+ ( u'Home', u'http://www.todayszaman.com/rss?sectionId=0'),
+ ( u'News', u'http://www.todayszaman.com/rss?sectionId=100'),
+ ( u'Business', u'http://www.todayszaman.com/rss?sectionId=105'),
+ ( u'Interviews', u'http://www.todayszaman.com/rss?sectionId=8'),
+ ( u'Columnists', u'http://www.todayszaman.com/rss?sectionId=6'),
+ ( u'Op-Ed', u'http://www.todayszaman.com/rss?sectionId=109'),
+ ( u'Arts & Culture', u'http://www.todayszaman.com/rss?sectionId=110'),
+ ( u'Expat Zone', u'http://www.todayszaman.com/rss?sectionId=132'),
+ ( u'Sports', u'http://www.todayszaman.com/rss?sectionId=5'),
+ ( u'Features', u'http://www.todayszaman.com/rss?sectionId=116'),
+ ( u'Travel', u'http://www.todayszaman.com/rss?sectionId=117'),
+ ( u'Leisure', u'http://www.todayszaman.com/rss?sectionId=118'),
+ ( u'Weird But True', u'http://www.todayszaman.com/rss?sectionId=134'),
+ ( u'Life', u'http://www.todayszaman.com/rss?sectionId=133'),
+ ( u'Health', u'http://www.todayszaman.com/rss?sectionId=126'),
+ ( u'Press Review', u'http://www.todayszaman.com/rss?sectionId=130'),
+ ( u'Todays think tanks', u'http://www.todayszaman.com/rss?sectionId=159'),
+
+ ]
+
+ #def preprocess_html(self, soup):
+ # return self.adeify_images(soup)
+ #def print_version(self, url): #there is a probem caused by table format
+ #return url.replace('http://www.todayszaman.com/newsDetail_getNewsById.action?load=detay&', 'http://www.todayszaman.com/newsDetail_openPrintPage.action?')
+
diff --git a/recipes/words_without_borders.recipe b/recipes/words_without_borders.recipe
new file mode 100644
index 0000000000..7f51f9cb6b
--- /dev/null
+++ b/recipes/words_without_borders.recipe
@@ -0,0 +1,25 @@
+#recipe created by sexymax15.....sexymax15@gmail.com
+#Words without Borders recipe
+
+from calibre.web.feeds.news import BasicNewsRecipe
+
+class AdvancedUserRecipe1308302002(BasicNewsRecipe):
+ title = u'Words Without Borders'
+ language = 'en'
+ __author__ = 'sexymax15'
+ oldest_article = 90
+ max_articles_per_feed = 30
+ use_embedded_content = False
+
+ remove_empty_feeds = True
+ no_stylesheets = True
+ remove_javascript = True
+ keep_only_tags = {'class':'span-14 article'}
+ remove_tags_after = [{'class':'addthis_toolbox addthis_default_style no_print'}]
+ remove_tags = [{'class':['posterous_quote_citation','button']}]
+ extra_css = """
+ h1{font-family: Georgia,serif; font-size: large}h2{font-family: Georgia,serif; font-size: large} """
+
+
+
+ feeds = [(u'wwb', u'http://feeds.feedburner.com/wwborders?format=xml')]
diff --git a/recipes/wprost.recipe b/recipes/wprost.recipe
index b317571981..b271665125 100644
--- a/recipes/wprost.recipe
+++ b/recipes/wprost.recipe
@@ -2,90 +2,92 @@
__license__ = 'GPL v3'
__copyright__ = '2010, matek09, matek09@gmail.com'
+__copyright__ = 'Modified 2011, Mariusz Wolek '
from calibre.web.feeds.news import BasicNewsRecipe
import re
class Wprost(BasicNewsRecipe):
- EDITION = 0
- FIND_LAST_FULL_ISSUE = True
- EXCLUDE_LOCKED = True
- ICO_BLOCKED = 'http://www.wprost.pl/G/icons/ico_blocked.gif'
+ EDITION = 0
+ FIND_LAST_FULL_ISSUE = True
+ EXCLUDE_LOCKED = True
+ ICO_BLOCKED = 'http://www.wprost.pl/G/icons/ico_blocked.gif'
- title = u'Wprost'
- __author__ = 'matek09'
- description = 'Weekly magazine'
- encoding = 'ISO-8859-2'
- no_stylesheets = True
- language = 'pl'
- remove_javascript = True
+ title = u'Wprost'
+ __author__ = 'matek09'
+ description = 'Weekly magazine'
+ encoding = 'ISO-8859-2'
+ no_stylesheets = True
+ language = 'pl'
+ remove_javascript = True
- remove_tags_before = dict(dict(name = 'div', attrs = {'id' : 'print-layer'}))
- remove_tags_after = dict(dict(name = 'div', attrs = {'id' : 'print-layer'}))
+ remove_tags_before = dict(dict(name = 'div', attrs = {'id' : 'print-layer'}))
+ remove_tags_after = dict(dict(name = 'div', attrs = {'id' : 'print-layer'}))
- '''keep_only_tags =[]
- keep_only_tags.append(dict(name = 'table', attrs = {'id' : 'title-table'}))
- keep_only_tags.append(dict(name = 'div', attrs = {'class' : 'div-header'}))
- keep_only_tags.append(dict(name = 'div', attrs = {'class' : 'div-content'}))
- keep_only_tags.append(dict(name = 'div', attrs = {'class' : 'def element-autor'}))'''
+ '''keep_only_tags =[]
+ keep_only_tags.append(dict(name = 'table', attrs = {'id' : 'title-table'}))
+ keep_only_tags.append(dict(name = 'div', attrs = {'class' : 'div-header'}))
+ keep_only_tags.append(dict(name = 'div', attrs = {'class' : 'div-content'}))
+ keep_only_tags.append(dict(name = 'div', attrs = {'class' : 'def element-autor'}))'''
- preprocess_regexps = [(re.compile(r'style="display: none;"'), lambda match: ''),
- (re.compile(r'display: block;'), lambda match: '')]
+ preprocess_regexps = [(re.compile(r'style="display: none;"'), lambda match: ''),
+ (re.compile(r'display: block;'), lambda match: ''),
+ (re.compile(r'\\ | \<\/table\>'), lambda match: ''),
+ (re.compile(r'\'), lambda match: ''),
+ (re.compile(r'\'), lambda match: ''),
+ (re.compile(r'\'), lambda match: '')]
+ remove_tags =[]
+ remove_tags.append(dict(name = 'div', attrs = {'class' : 'def element-date'}))
+ remove_tags.append(dict(name = 'div', attrs = {'class' : 'def silver'}))
+ remove_tags.append(dict(name = 'div', attrs = {'id' : 'content-main-column-right'}))
- remove_tags =[]
- remove_tags.append(dict(name = 'div', attrs = {'class' : 'def element-date'}))
- remove_tags.append(dict(name = 'div', attrs = {'class' : 'def silver'}))
- remove_tags.append(dict(name = 'div', attrs = {'id' : 'content-main-column-right'}))
-
-
- extra_css = '''
- .div-header {font-size: x-small; font-weight: bold}
- '''
+ extra_css = '''
+ .div-header {font-size: x-small; font-weight: bold}
+ '''
#h2 {font-size: x-large; font-weight: bold}
- def is_blocked(self, a):
- if a.findNextSibling('img') is None:
- return False
- else:
- return True
+ def is_blocked(self, a):
+ if a.findNextSibling('img') is None:
+ return False
+ else:
+ return True
- def find_last_issue(self):
- soup = self.index_to_soup('http://www.wprost.pl/archiwum/')
- a = 0
- if self.FIND_LAST_FULL_ISSUE:
- ico_blocked = soup.findAll('img', attrs={'src' : self.ICO_BLOCKED})
- a = ico_blocked[-1].findNext('a', attrs={'title' : re.compile('Zobacz spis tre.ci')})
- else:
- a = soup.find('a', attrs={'title' : re.compile('Zobacz spis tre.ci')})
- self.EDITION = a['href'].replace('/tygodnik/?I=', '')
- self.cover_url = a.img['src']
+ def find_last_issue(self):
+ soup = self.index_to_soup('http://www.wprost.pl/archiwum/')
+ a = 0
+ if self.FIND_LAST_FULL_ISSUE:
+ ico_blocked = soup.findAll('img', attrs={'src' : self.ICO_BLOCKED})
+ a = ico_blocked[-1].findNext('a', attrs={'title' : re.compile('Zobacz spis tre.ci')})
+ else:
+ a = soup.find('a', attrs={'title' : re.compile('Zobacz spis tre.ci')})
+ self.EDITION = a['href'].replace('/tygodnik/?I=', '')
+ self.cover_url = a.img['src']
- def parse_index(self):
- self.find_last_issue()
- soup = self.index_to_soup('http://www.wprost.pl/tygodnik/?I=' + self.EDITION)
- feeds = []
- for main_block in soup.findAll(attrs={'class':'main-block-s3 s3-head head-red3'}):
- articles = list(self.find_articles(main_block))
- if len(articles) > 0:
- section = self.tag_to_string(main_block)
- feeds.append((section, articles))
- return feeds
-
- def find_articles(self, main_block):
- for a in main_block.findAllNext( attrs={'style':['','padding-top: 15px;']}):
- if a.name in "td":
- break
- if self.EXCLUDE_LOCKED & self.is_blocked(a):
- continue
- yield {
- 'title' : self.tag_to_string(a),
- 'url' : 'http://www.wprost.pl' + a['href'],
- 'date' : '',
- 'description' : ''
- }
+ def parse_index(self):
+ self.find_last_issue()
+ soup = self.index_to_soup('http://www.wprost.pl/tygodnik/?I=' + self.EDITION)
+ feeds = []
+ for main_block in soup.findAll(attrs={'class':'main-block-s3 s3-head head-red3'}):
+ articles = list(self.find_articles(main_block))
+ if len(articles) > 0:
+ section = self.tag_to_string(main_block)
+ feeds.append((section, articles))
+ return feeds
+ def find_articles(self, main_block):
+ for a in main_block.findAllNext( attrs={'style':['','padding-top: 15px;']}):
+ if a.name in "td":
+ break
+ if self.EXCLUDE_LOCKED & self.is_blocked(a):
+ continue
+ yield {
+ 'title' : self.tag_to_string(a),
+ 'url' : 'http://www.wprost.pl' + a['href'],
+ 'date' : '',
+ 'description' : ''
+ }
diff --git a/recipes/wsj.recipe b/recipes/wsj.recipe
index cf84722bac..a3bc041d25 100644
--- a/recipes/wsj.recipe
+++ b/recipes/wsj.recipe
@@ -51,7 +51,7 @@ class WallStreetJournal(BasicNewsRecipe):
br['password'] = self.password
res = br.submit()
raw = res.read()
- if 'Welcome,' not in raw:
+ if 'Welcome,' not in raw and '>Logout<' not in raw:
raise ValueError('Failed to log in to wsj.com, check your '
'username and password')
return br
diff --git a/recipes/zaman.recipe b/recipes/zaman.recipe
index 064b2c265a..4a2e9e8069 100644
--- a/recipes/zaman.recipe
+++ b/recipes/zaman.recipe
@@ -1,20 +1,55 @@
+# -*- coding: utf-8 -*-
+
from calibre.web.feeds.news import BasicNewsRecipe
-class ZamanRecipe(BasicNewsRecipe):
- title = u'Zaman'
- __author__ = u'Deniz Og\xfcz'
- language = 'tr'
- oldest_article = 1
- max_articles_per_feed = 10
+class Zaman (BasicNewsRecipe):
- cover_url = 'http://medya.zaman.com.tr/zamantryeni/pics/zamanonline.gif'
- feeds = [(u'Gundem', u'http://www.zaman.com.tr/gundem.rss'),
- (u'Son Dakika', u'http://www.zaman.com.tr/sondakika.rss'),
- (u'Spor', u'http://www.zaman.com.tr/spor.rss'),
- (u'Ekonomi', u'http://www.zaman.com.tr/ekonomi.rss'),
- (u'Politika', u'http://www.zaman.com.tr/politika.rss'),
- (u'D\u0131\u015f Haberler', u'http://www.zaman.com.tr/dishaberler.rss'),
- (u'Yazarlar', u'http://www.zaman.com.tr/yazarlar.rss'),]
+ title = u'ZAMAN Gazetesi'
+ __author__ = u'thomass'
+ oldest_article = 2
+ max_articles_per_feed =100
+ # no_stylesheets = True
+ #delay = 1
+ #use_embedded_content = False
+ encoding = 'ISO 8859-9'
+ publisher = 'Zaman'
+ category = 'news, haberler,TR,gazete'
+ language = 'tr'
+ publication_type = 'newspaper '
+ extra_css = ' body{ font-family: Verdana,Helvetica,Arial,sans-serif } .introduction{font-weight: bold} .story-feature{display: block; padding: 0; border: 1px solid; width: 40%; font-size: small} .story-feature h2{text-align: center; text-transform: uppercase} '
+ conversion_options = {
+ 'tags' : category
+ ,'language' : language
+ ,'publisher' : publisher
+ ,'linearize_tables': False
+ }
+ cover_img_url = 'https://fbcdn-profile-a.akamaihd.net/hprofile-ak-snc4/188140_81722291869_2111820_n.jpg'
+ masthead_url = 'http://medya.zaman.com.tr/extentions/zaman.com.tr/img/section/logo-section.png'
- def print_version(self, url):
- return url.replace('www.zaman.com.tr/haber.do?', 'www.zaman.com.tr/yazdir.do?')
+
+ keep_only_tags = [dict(name='div', attrs={'id':[ 'news-detail-content']}), dict(name='td', attrs={'class':['columnist-detail','columnist_head']}) ]
+ remove_tags = [ dict(name='div', attrs={'id':['news-detail-news-text-font-size','news-detail-gallery','news-detail-news-bottom-social']}),dict(name='div', attrs={'class':['radioEmbedBg','radyoProgramAdi']}),dict(name='a', attrs={'class':['webkit-html-attribute-value webkit-html-external-link']}),dict(name='table', attrs={'id':['yaziYorumTablosu']}),dict(name='img', attrs={'src':['http://medya.zaman.com.tr/pics/paylas.gif','http://medya.zaman.com.tr/extentions/zaman.com.tr/img/columnist/ma-16.png']})]
+
+
+ #remove_attributes = ['width','height']
+ remove_empty_feeds= True
+
+ feeds = [
+ ( u'Anasayfa', u'http://www.zaman.com.tr/anasayfa.rss'),
+ ( u'Son Dakika', u'http://www.zaman.com.tr/sondakika.rss'),
+ ( u'En çok Okunanlar', u'http://www.zaman.com.tr/max_all.rss'),
+ ( u'Gündem', u'http://www.zaman.com.tr/gundem.rss'),
+ ( u'Yazarlar', u'http://www.zaman.com.tr/yazarlar.rss'),
+ ( u'Politika', u'http://www.zaman.com.tr/politika.rss'),
+ ( u'Ekonomi', u'http://www.zaman.com.tr/ekonomi.rss'),
+ ( u'Dış Haberler', u'http://www.zaman.com.tr/dishaberler.rss'),
+ ( u'Yorumlar', u'http://www.zaman.com.tr/yorumlar.rss'),
+ ( u'Röportaj', u'http://www.zaman.com.tr/roportaj.rss'),
+ ( u'Spor', u'http://www.zaman.com.tr/spor.rss'),
+ ( u'Kürsü', u'http://www.zaman.com.tr/kursu.rss'),
+ ( u'Kültür Sanat', u'http://www.zaman.com.tr/kultursanat.rss'),
+ ( u'Televizyon', u'http://www.zaman.com.tr/televizyon.rss'),
+ ( u'Manşet', u'http://www.zaman.com.tr/manset.rss'),
+
+
+ ]
diff --git a/resources/content_server/browse/browse.html b/resources/content_server/browse/browse.html
index de78e432d7..6a9697dc06 100644
--- a/resources/content_server/browse/browse.html
+++ b/resources/content_server/browse/browse.html
@@ -20,8 +20,8 @@
-
-
+
+
| |