Sync to trunk.
156
Changelog.yaml
@ -19,6 +19,162 @@
|
||||
# new recipes:
|
||||
# - title:
|
||||
|
||||
- version: 0.9.12
|
||||
date: 2012-12-28
|
||||
|
||||
new features:
|
||||
- title: "Drivers for Kibano e-reader and Slick ER-700-2"
|
||||
tickets: [1093570, 1093732]
|
||||
|
||||
- title: "Add support for downloading metadata from Amazon Brazil."
|
||||
tickets: [1092594]
|
||||
|
||||
- title: "Copy to library: Allow specifying the destination library by path."
|
||||
tickets: [1093231]
|
||||
|
||||
- title: "When adding empty books, allow setting of the series for the new books. Also select the newly added book records after adding."
|
||||
|
||||
- title: "PDF Output: Add a checkbox to override the page size defined by the output profile. This allows you to specify a custom page size even if the output profile is not set to default."
|
||||
|
||||
- title: "Add usb ids for newer kindle fire to the linux mtp driver"
|
||||
|
||||
bug fixes:
|
||||
- title: "Linux: Temporarily redirect stdout to get rid of the annoying and pointless message about mtpz during libmtp initialization"
|
||||
|
||||
- title: "Fix multiple 'All column' coloring rules not being applied"
|
||||
tickets: [1093574]
|
||||
|
||||
- title: "Use custom icons in the content server as well."
|
||||
tickets: [1092098]
|
||||
|
||||
improved recipes:
|
||||
- La Voce
|
||||
- Harpers Magazine (printed edition)
|
||||
- Pajamas Media
|
||||
- NSFW corp
|
||||
- The Hindu
|
||||
- Nikkei News
|
||||
|
||||
new recipes:
|
||||
- title: Various Ukranian news sources
|
||||
author: rpalyvoda
|
||||
|
||||
- version: 0.9.11
|
||||
date: 2012-12-21
|
||||
|
||||
new features:
|
||||
- title: "Merry Christmas and Happy Holidays to all ☺"
|
||||
|
||||
- title: "When connecting to MTP devices such as the Kindle Fire HD or the Nook HD, speed up the process by ignoring some folders."
|
||||
description: "calibre will now ignore folders for music, video, pictures, etc. when scanning the device. This can substantially speed up the connection process if you have thousands of non-ebook files on the device. The list of folders to be ignored can be customized by right clicking on the device icon in calibre and selecting 'Configure this device'."
|
||||
|
||||
- title: "Allow changing the icons for categories in the Tag Browser. Right click on a category and choose 'Change category icon'."
|
||||
tickets: [1092098]
|
||||
|
||||
- title: "Allow setting the color of all columns with a single rule in Preferences->Look & Feel->Column Coloring"
|
||||
|
||||
- title: "MOBI: When reading metadata from mobi files, put the contents of the ASIN field into an identifier named mobi-asin. Note that this value is not used when downloading metadata as it is not possible to know which (country specific) amazon website the ASIN comes from."
|
||||
tickets: [1090394]
|
||||
|
||||
bug fixes:
|
||||
- title: "Windows build: Fix a regression in 0.9.9 that caused calibre to not start on some windows system that were missing the VC.90 dlls (some older XP systems)"
|
||||
|
||||
- title: "Kobo driver: Workaround for invalid shelves created by bugs in the Kobo server"
|
||||
tickets: [1091932]
|
||||
|
||||
- title: "Metadata download: Fix cover downloading from non-US amazon sites broken by a website change."
|
||||
tickets: [1090765]
|
||||
|
||||
improved recipes:
|
||||
- Le Devoir
|
||||
- Nin online
|
||||
- countryfile
|
||||
- Birmingham Post
|
||||
- The Independent
|
||||
- Various Polish news sources
|
||||
|
||||
new recipes:
|
||||
- title: MobileBulgaria
|
||||
author: Martin Tsanchev
|
||||
|
||||
- title: Various Polish news sources
|
||||
author: fenuks
|
||||
|
||||
- version: 0.9.10
|
||||
date: 2012-12-14
|
||||
|
||||
new features:
|
||||
- title: "Drivers for Nextbook Premium 8 se, HTC Desire X and Emerson EM 543"
|
||||
tickets: [1088149, 1088112, 1087978]
|
||||
|
||||
bug fixes:
|
||||
- title: "Fix rich text delegate not working with Qt compiled in debug mode."
|
||||
tickets: [1089011]
|
||||
|
||||
- title: "When deleting all books in the library, blank the book details panel"
|
||||
|
||||
- title: "Conversion: Fix malformed values in the bgcolor attribute causing conversion to abort"
|
||||
|
||||
- title: "Conversion: Fix heuristics applying incorrect style in some circumstances"
|
||||
tickets: [1066507]
|
||||
|
||||
- title: "Possible fix for 64bit calibre not starting up on some Windows systems"
|
||||
tickets: [1087816]
|
||||
|
||||
improved recipes:
|
||||
- Sivil Dusunce
|
||||
- Anchorage Daily News
|
||||
- Le Monde
|
||||
- Harpers
|
||||
|
||||
new recipes:
|
||||
- title: Titanic
|
||||
author: Krittika Goyal
|
||||
|
||||
- version: 0.9.9
|
||||
date: 2012-12-07
|
||||
|
||||
new features:
|
||||
- title: "64 bit build for windows"
|
||||
type: major
|
||||
description: "calibre now has a 64 bit version for windows, available at: http://calibre-ebook.com/download_windows64 The 64bit build is not limited to using only 3GB of RAM when converting large/complex documents. It may also be slightly faster for some tasks. You can have both the 32 bit and the 64 bit build installed at the same time, they will use the same libraries, plugins and settings."
|
||||
|
||||
- title: "Content server: Make the identifiers in each books metadata clickable."
|
||||
tickets: [1085726]
|
||||
|
||||
bug fixes:
|
||||
- title: "EPUB Input: Fix an infinite loop while trying to recover a damaged EPUB file."
|
||||
tickets: [1086917]
|
||||
|
||||
- title: "KF8 Input: Fix handling of links in files that link to the obsolete <a name> tags instead of tags with an id attribute."
|
||||
tickets: [1086705]
|
||||
|
||||
- title: "Conversion: Fix a bug in removal of invalid entries from the spine, where not all invalid entries were removed, causing conversion to fail."
|
||||
tickets: [1086054]
|
||||
|
||||
- title: "KF8 Input: Ignore invalid flow references in the KF8 document instead of erroring out on them."
|
||||
tickets: [1085306]
|
||||
|
||||
- title: "Fix command line output on linux systems with incorrect LANG/LC_TYPE env vars."
|
||||
tickets: [1085103]
|
||||
|
||||
- title: "KF8 Input: Fix page breaks specified using the data-AmznPageBreak attribute being ignored by calibre."
|
||||
|
||||
- title: "PDF Output: Fix custom size field not accepting fractional numbers as sizes"
|
||||
|
||||
- title: "Get Books: Update libre.de and publio for website changes"
|
||||
|
||||
- title: "Wireless driver: Increase timeout interval, and when allocating a random port try 9090 first"
|
||||
|
||||
improved recipes:
|
||||
- New York Times
|
||||
- Weblogs SL
|
||||
- Zaman Gazetesi
|
||||
- Aksiyon Dergisi
|
||||
- Endgadget
|
||||
- Metro UK
|
||||
- Heise Online
|
||||
|
||||
- version: 0.9.8
|
||||
date: 2012-11-30
|
||||
|
||||
|
@ -49,7 +49,7 @@ All the |app| python code is in the ``calibre`` package. This package contains t
|
||||
* Metadata reading, writing, and downloading is all in ebooks.metadata
|
||||
* Conversion happens in a pipeline, for the structure of the pipeline,
|
||||
see :ref:`conversion-introduction`. The pipeline consists of an input
|
||||
plugin, various transforms and an output plugin. The code constructs
|
||||
plugin, various transforms and an output plugin. The that code constructs
|
||||
and drives the pipeline is in plumber.py. The pipeline works on a
|
||||
representation of an ebook that is like an unzipped epub, with
|
||||
manifest, spine, toc, guide, html content, etc. The
|
||||
@ -74,10 +74,6 @@ After installing Bazaar, you can get the |app| source code with the command::
|
||||
|
||||
On Windows you will need the complete path name, that will be something like :file:`C:\\Program Files\\Bazaar\\bzr.exe`.
|
||||
|
||||
To update a branch to the latest code, use the command::
|
||||
|
||||
bzr merge
|
||||
|
||||
|app| is a very large project with a very long source control history, so the
|
||||
above can take a while (10mins to an hour depending on your internet speed).
|
||||
|
||||
@ -88,6 +84,11 @@ using::
|
||||
|
||||
bzr branch --stacked lp:calibre
|
||||
|
||||
|
||||
To update a branch to the latest code, use the command::
|
||||
|
||||
bzr merge
|
||||
|
||||
Submitting your changes to be included
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
|
@ -162,6 +162,8 @@ Follow these steps to find the problem:
|
||||
* If you are connecting an Apple iDevice (iPad, iPod Touch, iPhone), use the 'Connect to iTunes' method in the 'Getting started' instructions in `Calibre + Apple iDevices: Start here <http://www.mobileread.com/forums/showthread.php?t=118559>`_.
|
||||
* Make sure you are running the latest version of |app|. The latest version can always be downloaded from `the calibre website <http://calibre-ebook.com/download>`_.
|
||||
* Ensure your operating system is seeing the device. That is, the device should show up in Windows Explorer (in Windows) or Finder (in OS X).
|
||||
* In |app|, go to Preferences->Ignored Devices and check that your device
|
||||
is not being ignored
|
||||
* In |app|, go to Preferences->Plugins->Device Interface plugin and make sure the plugin for your device is enabled, the plugin icon next to it should be green when it is enabled.
|
||||
* If all the above steps fail, go to Preferences->Miscellaneous and click debug device detection with your device attached and post the output as a ticket on `the calibre bug tracker <http://bugs.calibre-ebook.com>`_.
|
||||
|
||||
@ -668,6 +670,9 @@ There are three possible things I know of, that can cause this:
|
||||
the blacklist of programs inside RoboForm to fix this. Or uninstall
|
||||
RoboForm.
|
||||
|
||||
* The Logitech SetPoint Settings application causes random crashes in
|
||||
|app| when it is open. Close it before starting |app|.
|
||||
|
||||
|app| is not starting on OS X?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
|
@ -9,11 +9,12 @@ class Adventure_zone(BasicNewsRecipe):
|
||||
no_stylesheets = True
|
||||
oldest_article = 20
|
||||
max_articles_per_feed = 100
|
||||
cover_url = 'http://www.adventure-zone.info/inne/logoaz_2012.png'
|
||||
index='http://www.adventure-zone.info/fusion/'
|
||||
use_embedded_content=False
|
||||
preprocess_regexps = [(re.compile(r"<td class='capmain'>Komentarze</td>", re.IGNORECASE), lambda m: ''),
|
||||
(re.compile(r'\<table .*?\>'), lambda match: ''),
|
||||
(re.compile(r'\<tbody\>'), lambda match: '')]
|
||||
(re.compile(r'</?table.*?>'), lambda match: ''),
|
||||
(re.compile(r'</?tbody.*?>'), lambda match: '')]
|
||||
remove_tags_before= dict(name='td', attrs={'class':'main-bg'})
|
||||
remove_tags= [dict(name='img', attrs={'alt':'Drukuj'})]
|
||||
remove_tags_after= dict(id='comments')
|
||||
@ -36,11 +37,11 @@ class Adventure_zone(BasicNewsRecipe):
|
||||
return feeds
|
||||
|
||||
|
||||
def get_cover_url(self):
|
||||
'''def get_cover_url(self):
|
||||
soup = self.index_to_soup('http://www.adventure-zone.info/fusion/news.php')
|
||||
cover=soup.find(id='box_OstatninumerAZ')
|
||||
self.cover_url='http://www.adventure-zone.info/fusion/'+ cover.center.a.img['src']
|
||||
return getattr(self, 'cover_url', self.cover_url)
|
||||
return getattr(self, 'cover_url', self.cover_url)'''
|
||||
|
||||
|
||||
def skip_ad_pages(self, soup):
|
||||
|
@ -5,6 +5,8 @@ class AdvancedUserRecipe1278347258(BasicNewsRecipe):
|
||||
__author__ = 'rty'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
auto_cleanup = True
|
||||
|
||||
|
||||
feeds = [(u'Alaska News', u'http://www.adn.com/news/alaska/index.xml'),
|
||||
(u'Business', u'http://www.adn.com/money/index.xml'),
|
||||
@ -28,13 +30,13 @@ class AdvancedUserRecipe1278347258(BasicNewsRecipe):
|
||||
conversion_options = {'linearize_tables':True}
|
||||
masthead_url = 'http://media.adn.com/includes/assets/images/adn_logo.2.gif'
|
||||
|
||||
keep_only_tags = [
|
||||
dict(name='div', attrs={'class':'left_col story_mainbar'}),
|
||||
]
|
||||
remove_tags = [
|
||||
dict(name='div', attrs={'class':'story_tools'}),
|
||||
dict(name='p', attrs={'class':'ad_label'}),
|
||||
]
|
||||
remove_tags_after = [
|
||||
dict(name='div', attrs={'class':'advertisement'}),
|
||||
]
|
||||
#keep_only_tags = [
|
||||
#dict(name='div', attrs={'class':'left_col story_mainbar'}),
|
||||
#]
|
||||
#remove_tags = [
|
||||
#dict(name='div', attrs={'class':'story_tools'}),
|
||||
#dict(name='p', attrs={'class':'ad_label'}),
|
||||
#]
|
||||
#remove_tags_after = [
|
||||
#dict(name='div', attrs={'class':'advertisement'}),
|
||||
#]
|
||||
|
@ -3,11 +3,11 @@ from calibre.web.feeds.news import BasicNewsRecipe
|
||||
class Android_com_pl(BasicNewsRecipe):
|
||||
title = u'Android.com.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = 'Android.com.pl - biggest polish Android site'
|
||||
description = u'Android.com.pl - to największe w Polsce centrum Android OS. Znajdziesz tu: nowości, forum, pomoc, recenzje, gry, aplikacje.'
|
||||
category = 'Android, mobile'
|
||||
language = 'pl'
|
||||
use_embedded_content=True
|
||||
cover_url =u'http://upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Android_robot.svg/220px-Android_robot.svg.png'
|
||||
cover_url =u'http://android.com.pl/wp-content/themes/android/images/logo.png'
|
||||
oldest_article = 8
|
||||
max_articles_per_feed = 100
|
||||
feeds = [(u'Android', u'http://android.com.pl/component/content/frontpage/frontpage.feed?type=rss')]
|
||||
feeds = [(u'Android', u'http://android.com.pl/feed/')]
|
||||
|
19
recipes/astroflesz.recipe
Normal file
@ -0,0 +1,19 @@
|
||||
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:fdm=marker:ai
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Astroflesz(BasicNewsRecipe):
|
||||
title = u'Astroflesz'
|
||||
oldest_article = 7
|
||||
__author__ = 'fenuks'
|
||||
description = u'astroflesz.pl - to portal poświęcony astronomii. Informuje zarówno o aktualnych wydarzeniach i odkryciach naukowych, jak również zapowiada ciekawe zjawiska astronomiczne'
|
||||
category = 'astronomy'
|
||||
language = 'pl'
|
||||
cover_url = 'http://www.astroflesz.pl/templates/astroflesz/images/logo/logo.png'
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets = True
|
||||
use_embedded_content = False
|
||||
keep_only_tags = [dict(id="k2Container")]
|
||||
remove_tags_after = dict(name='div', attrs={'class':'itemLinks'})
|
||||
remove_tags = [dict(name='div', attrs={'class':['itemLinks', 'itemToolbar', 'itemRatingBlock']})]
|
||||
feeds = [(u'Wszystkie', u'http://astroflesz.pl/?format=feed')]
|
@ -1,9 +1,11 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
import re
|
||||
import mechanize
|
||||
|
||||
class AdvancedUserRecipe1306097511(BasicNewsRecipe):
|
||||
title = u'Birmingham post'
|
||||
description = 'Author D.Asbury. News for Birmingham UK'
|
||||
#timefmt = ''
|
||||
# last update 8/9/12
|
||||
__author__ = 'Dave Asbury'
|
||||
cover_url = 'http://profile.ak.fbcdn.net/hprofile-ak-snc4/161987_9010212100_2035706408_n.jpg'
|
||||
oldest_article = 2
|
||||
@ -15,8 +17,30 @@ class AdvancedUserRecipe1306097511(BasicNewsRecipe):
|
||||
#auto_cleanup = True
|
||||
language = 'en_GB'
|
||||
|
||||
cover_url = 'http://profile.ak.fbcdn.net/hprofile-ak-snc4/161987_9010212100_2035706408_n.jpg'
|
||||
|
||||
masthead_url = 'http://www.pressgazette.co.uk/Pictures/web/t/c/g/birmingham_post.jpg'
|
||||
masthead_url = 'http://www.trinitymirror.com/images/birminghampost-logo.gif'
|
||||
def get_cover_url(self):
|
||||
soup = self.index_to_soup('http://www.birminghampost.net')
|
||||
# look for the block containing the sun button and url
|
||||
cov = soup.find(attrs={'height' : re.compile('3'), 'alt' : re.compile('Birmingham Post')})
|
||||
print
|
||||
print '%%%%%%%%%%%%%%%',cov
|
||||
print
|
||||
cov2 = str(cov['src'])
|
||||
# cov2=cov2[7:]
|
||||
print '88888888 ',cov2,' 888888888888'
|
||||
|
||||
#cover_url=cov2
|
||||
#return cover_url
|
||||
br = mechanize.Browser()
|
||||
br.set_handle_redirect(False)
|
||||
try:
|
||||
br.open_novisit(cov2)
|
||||
cover_url = cov2
|
||||
except:
|
||||
cover_url = 'http://profile.ak.fbcdn.net/hprofile-ak-snc4/161987_9010212100_2035706408_n.jpg'
|
||||
return cover_url
|
||||
|
||||
|
||||
keep_only_tags = [
|
||||
|
@ -7,24 +7,29 @@ class AdvancedUserRecipe1325006965(BasicNewsRecipe):
|
||||
#cover_url = 'http://www.countryfile.com/sites/default/files/imagecache/160px_wide/cover/2_1.jpg'
|
||||
__author__ = 'Dave Asbury'
|
||||
description = 'The official website of Countryfile Magazine'
|
||||
# last updated 7/10/12
|
||||
# last updated 8/12/12
|
||||
language = 'en_GB'
|
||||
oldest_article = 30
|
||||
max_articles_per_feed = 25
|
||||
remove_empty_feeds = True
|
||||
no_stylesheets = True
|
||||
auto_cleanup = True
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
#articles_are_obfuscated = True
|
||||
ignore_duplicate_articles = {'title'}
|
||||
#article_already_exists = False
|
||||
#feed_hash = ''
|
||||
def get_cover_url(self):
|
||||
soup = self.index_to_soup('http://www.countryfile.com/')
|
||||
soup = self.index_to_soup('http://www.countryfile.com/magazine')
|
||||
cov = soup.find(attrs={'class' : re.compile('imagecache imagecache-250px_wide')})#'width' : '160',
|
||||
print '&&&&&&&& ',cov,' ***'
|
||||
cov=str(cov)
|
||||
#cov2 = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', cov)
|
||||
cov2 = re.findall('/(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', cov)
|
||||
|
||||
cov2 = str(cov2)
|
||||
cov2= "http://www.countryfile.com"+cov2[2:len(cov2)-8]
|
||||
|
||||
cov = soup.find(attrs={'width' : '160', 'class' : re.compile('imagecache imagecache-160px_wide')})
|
||||
print '******** ',cov,' ***'
|
||||
cov2 = str(cov)
|
||||
cov2=cov2[10:101]
|
||||
print '******** ',cov2,' ***'
|
||||
#cov2='http://www.countryfile.com/sites/default/files/imagecache/160px_wide/cover/1b_0.jpg'
|
||||
# try to get cover - if can't get known cover
|
||||
br = browser()
|
||||
|
||||
@ -45,5 +50,3 @@ class AdvancedUserRecipe1325006965(BasicNewsRecipe):
|
||||
(u'Countryside', u'http://www.countryfile.com/rss/countryside'),
|
||||
]
|
||||
|
||||
|
||||
|
||||
|
20
recipes/czas_gentlemanow.recipe
Normal file
@ -0,0 +1,20 @@
|
||||
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:fdm=marker:ai
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class CzasGentlemanow(BasicNewsRecipe):
|
||||
title = u'Czas Gentlemanów'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Historia mężczyzn z dala od wielkiej polityki'
|
||||
category = 'blog'
|
||||
language = 'pl'
|
||||
cover_url = 'http://czasgentlemanow.pl/wp-content/uploads/2012/10/logo-Czas-Gentlemanow1.jpg'
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets = True
|
||||
remove_empty_feeds = True
|
||||
use_embedded_content = False
|
||||
keep_only_tags = [dict(name='div', attrs={'class':'content'})]
|
||||
remove_tags = [dict(attrs={'class':'meta_comments'})]
|
||||
remove_tags_after = dict(name='div', attrs={'class':'fblikebutton_button'})
|
||||
feeds = [(u'M\u0119ski \u015awiat', u'http://czasgentlemanow.pl/category/meski-swiat/feed/'), (u'Styl', u'http://czasgentlemanow.pl/category/styl/feed/'), (u'Vademecum Gentlemana', u'http://czasgentlemanow.pl/category/vademecum/feed/'), (u'Dom i rodzina', u'http://czasgentlemanow.pl/category/dom-i-rodzina/feed/'), (u'Honor', u'http://czasgentlemanow.pl/category/honor/feed/'), (u'Gad\u017cety Gentlemana', u'http://czasgentlemanow.pl/category/gadzety-gentlemana/feed/')]
|
@ -7,6 +7,7 @@ class Dzieje(BasicNewsRecipe):
|
||||
cover_url = 'http://www.dzieje.pl/sites/default/files/dzieje_logo.png'
|
||||
category = 'history'
|
||||
language = 'pl'
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
index = 'http://dzieje.pl'
|
||||
oldest_article = 8
|
||||
max_articles_per_feed = 100
|
||||
@ -14,11 +15,56 @@ class Dzieje(BasicNewsRecipe):
|
||||
no_stylesheets= True
|
||||
keep_only_tags = [dict(name='h1', attrs={'class':'title'}), dict(id='content-area')]
|
||||
remove_tags = [dict(attrs={'class':'field field-type-computed field-field-tagi'}), dict(id='dogory')]
|
||||
feeds = [(u'Dzieje', u'http://dzieje.pl/rss.xml')]
|
||||
#feeds = [(u'Dzieje', u'http://dzieje.pl/rss.xml')]
|
||||
|
||||
def append_page(self, soup, appendtag):
|
||||
tag = appendtag.find('li', attrs={'class':'pager-next'})
|
||||
if tag:
|
||||
while tag:
|
||||
url = tag.a['href']
|
||||
if not url.startswith('http'):
|
||||
url = 'http://dzieje.pl'+tag.a['href']
|
||||
soup2 = self.index_to_soup(url)
|
||||
pagetext = soup2.find(id='content-area').find(attrs={'class':'content'})
|
||||
for r in pagetext.findAll(attrs={'class':['fieldgroup group-groupkul', 'fieldgroup group-zdjeciekult', 'fieldgroup group-zdjecieciekaw', 'fieldgroup group-zdjecieksiazka', 'fieldgroup group-zdjeciedu', 'field field-type-filefield field-field-zdjecieglownawyd']}):
|
||||
r.extract()
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
tag = soup2.find('li', attrs={'class':'pager-next'})
|
||||
for r in appendtag.findAll(attrs={'class':['item-list', 'field field-type-computed field-field-tagi', ]}):
|
||||
r.extract()
|
||||
|
||||
def find_articles(self, url):
|
||||
articles = []
|
||||
soup=self.index_to_soup(url)
|
||||
tag=soup.find(id='content-area').div.div
|
||||
for i in tag.findAll('div', recursive=False):
|
||||
temp = i.find(attrs={'class':'views-field-title'}).span.a
|
||||
title = temp.string
|
||||
url = self.index + temp['href']
|
||||
date = '' #i.find(attrs={'class':'views-field-created'}).span.string
|
||||
articles.append({'title' : title,
|
||||
'url' : url,
|
||||
'date' : date,
|
||||
'description' : ''
|
||||
})
|
||||
return articles
|
||||
|
||||
def parse_index(self):
|
||||
feeds = []
|
||||
feeds.append((u"Wiadomości", self.find_articles('http://dzieje.pl/wiadomosci')))
|
||||
feeds.append((u"Kultura i sztuka", self.find_articles('http://dzieje.pl/kulturaisztuka')))
|
||||
feeds.append((u"Film", self.find_articles('http://dzieje.pl/kino')))
|
||||
feeds.append((u"Rozmaitości historyczne", self.find_articles('http://dzieje.pl/rozmaitości')))
|
||||
feeds.append((u"Książka", self.find_articles('http://dzieje.pl/ksiazka')))
|
||||
feeds.append((u"Wystawa", self.find_articles('http://dzieje.pl/wystawa')))
|
||||
feeds.append((u"Edukacja", self.find_articles('http://dzieje.pl/edukacja')))
|
||||
feeds.append((u"Dzieje się", self.find_articles('http://dzieje.pl/wydarzenia')))
|
||||
return feeds
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
for a in soup('a'):
|
||||
if a.has_key('href') and 'http://' not in a['href'] and 'https://' not in a['href']:
|
||||
a['href']=self.index + a['href']
|
||||
self.append_page(soup, soup.body)
|
||||
return soup
|
24
recipes/ekologia_pl.recipe
Normal file
@ -0,0 +1,24 @@
|
||||
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:fdm=marker:ai
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
import re
|
||||
class EkologiaPl(BasicNewsRecipe):
|
||||
title = u'Ekologia.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Portal ekologiczny - eko, ekologia, ochrona przyrody, ochrona środowiska, przyroda, środowisko online. Ekologia i ochrona środowiska. Ekologia dla dzieci.'
|
||||
category = 'ecology'
|
||||
language = 'pl'
|
||||
cover_url = 'http://www.ekologia.pl/assets/images/logo/ekologia_pl_223x69.png'
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
extra_css = '.title {font-size: 200%;}'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets = True
|
||||
remove_empty_feeds = True
|
||||
use_embedded_content = False
|
||||
remove_tags = [dict(attrs={'class':['ekoLogo', 'powrocArt', 'butonDrukuj']})]
|
||||
|
||||
feeds = [(u'Wiadomo\u015bci', u'http://www.ekologia.pl/rss/20,53,0'), (u'\u015arodowisko', u'http://www.ekologia.pl/rss/20,56,0'), (u'Styl \u017cycia', u'http://www.ekologia.pl/rss/20,55,0')]
|
||||
|
||||
def print_version(self, url):
|
||||
id = re.search(r',(?P<id>\d+)\.html', url).group('id')
|
||||
return 'http://drukuj.ekologia.pl/artykul/' + id
|
@ -5,6 +5,7 @@ class AdvancedUserRecipe1341650280(BasicNewsRecipe):
|
||||
|
||||
title = u'Empire Magazine'
|
||||
description = 'Author D.Asbury. Film articles from Empire Mag. '
|
||||
language = 'en'
|
||||
__author__ = 'Dave Asbury'
|
||||
# last updated 7/7/12
|
||||
remove_empty_feeds = True
|
||||
|
19
recipes/film_org_pl.recipe
Normal file
@ -0,0 +1,19 @@
|
||||
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:fdm=marker:ai
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
import re
|
||||
class FilmOrgPl(BasicNewsRecipe):
|
||||
title = u'Film.org.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = u"Recenzje, analizy, artykuły, rankingi - wszystko o filmie dla miłośników kina. Opisy efektów specjalnych, wersji reżyserskich, remake'ów, sequeli. No i forum filmowe. Jedne z największych w Polsce."
|
||||
category = 'film'
|
||||
language = 'pl'
|
||||
cover_url = 'http://film.org.pl/wp-content/themes/KMF/images/logo_kmf10.png'
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets = True
|
||||
remove_empty_feeds = True
|
||||
use_embedded_content = True
|
||||
preprocess_regexps = [(re.compile(ur'<h3>Przeczytaj także:</h3>.*', re.IGNORECASE|re.DOTALL), lambda m: '</body>'), (re.compile(ur'<div>Artykuł</div>', re.IGNORECASE), lambda m: ''), (re.compile(ur'<div>Ludzie filmu</div>', re.IGNORECASE), lambda m: '')]
|
||||
remove_tags = [dict(name='img', attrs={'alt':['Ludzie filmu', u'Artykuł']})]
|
||||
feeds = [(u'Recenzje', u'http://film.org.pl/r/recenzje/feed/'), (u'Artyku\u0142', u'http://film.org.pl/a/artykul/feed/'), (u'Analiza', u'http://film.org.pl/a/analiza/feed/'), (u'Ranking', u'http://film.org.pl/a/ranking/feed/'), (u'Blog', u'http://film.org.pl/kmf/blog/feed/'), (u'Ludzie', u'http://film.org.pl/a/ludzie/feed/'), (u'Seriale', u'http://film.org.pl/a/seriale/feed/'), (u'Oceanarium', u'http://film.org.pl/a/ocenarium/feed/'), (u'VHS', u'http://film.org.pl/a/vhs-a/feed/')]
|
@ -17,6 +17,7 @@ class FilmWebPl(BasicNewsRecipe):
|
||||
preprocess_regexps = [(re.compile(u'\(kliknij\,\ aby powiększyć\)', re.IGNORECASE), lambda m: ''), ]#(re.compile(ur' | ', re.IGNORECASE), lambda m: '')]
|
||||
extra_css = '.hdrBig {font-size:22px;} ul {list-style-type:none; padding: 0; margin: 0;}'
|
||||
remove_tags= [dict(name='div', attrs={'class':['recommendOthers']}), dict(name='ul', attrs={'class':'fontSizeSet'}), dict(attrs={'class':'userSurname anno'})]
|
||||
remove_attributes = ['style',]
|
||||
keep_only_tags= [dict(name='h1', attrs={'class':['hdrBig', 'hdrEntity']}), dict(name='div', attrs={'class':['newsInfo', 'newsInfoSmall', 'reviewContent description']})]
|
||||
feeds = [(u'News / Filmy w produkcji', 'http://www.filmweb.pl/feed/news/category/filminproduction'),
|
||||
(u'News / Festiwale, nagrody i przeglądy', u'http://www.filmweb.pl/feed/news/category/festival'),
|
||||
@ -50,4 +51,9 @@ class FilmWebPl(BasicNewsRecipe):
|
||||
for i in soup.findAll('sup'):
|
||||
if not i.string or i.string.startswith('(kliknij'):
|
||||
i.extract()
|
||||
tag = soup.find(name='ul', attrs={'class':'inline sep-line'})
|
||||
if tag:
|
||||
tag.name = 'div'
|
||||
for t in tag.findAll('li'):
|
||||
t.name = 'div'
|
||||
return soup
|
||||
|
@ -4,9 +4,10 @@ import re
|
||||
class Gildia(BasicNewsRecipe):
|
||||
title = u'Gildia.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = 'Gildia - cultural site'
|
||||
description = u'Fantastyczny Portal Kulturalny - newsy, recenzje, galerie, wywiady. Literatura, film, gry komputerowe i planszowe, komiks, RPG, sklep. Nie lekceważ potęgi wyobraźni!'
|
||||
cover_url = 'http://www.film.gildia.pl/_n_/portal/redakcja/logo/logo-gildia.pl-500.jpg'
|
||||
category = 'culture'
|
||||
cover_url = 'http://gildia.pl/images/logo-main.png'
|
||||
language = 'pl'
|
||||
oldest_article = 8
|
||||
max_articles_per_feed = 100
|
||||
@ -23,10 +24,13 @@ class Gildia(BasicNewsRecipe):
|
||||
content = soup.find('div', attrs={'class':'news'})
|
||||
if 'recenzj' in soup.title.string.lower():
|
||||
for link in content.findAll(name='a'):
|
||||
if 'recenzj' in link['href']:
|
||||
self.log.warn('odnosnik')
|
||||
self.log.warn(link['href'])
|
||||
if 'recenzj' in link['href'] or 'muzyka/plyty' in link['href']:
|
||||
return self.index_to_soup(link['href'], raw=True)
|
||||
if 'fragmen' in soup.title.string.lower():
|
||||
for link in content.findAll(name='a'):
|
||||
if 'fragment' in link['href']:
|
||||
return self.index_to_soup(link['href'], raw=True)
|
||||
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
for a in soup('a'):
|
||||
|
@ -1,19 +1,20 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
from calibre.ebooks.BeautifulSoup import BeautifulSoup
|
||||
class Gram_pl(BasicNewsRecipe):
|
||||
title = u'Gram.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = 'Gram.pl - site about computer games'
|
||||
description = u'Serwis społecznościowy o grach: recenzje, newsy, zapowiedzi, encyklopedia gier, forum. Gry PC, PS3, X360, PS Vita, sprzęt dla graczy.'
|
||||
category = 'games'
|
||||
language = 'pl'
|
||||
oldest_article = 8
|
||||
index='http://www.gram.pl'
|
||||
max_articles_per_feed = 100
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
no_stylesheets= True
|
||||
extra_css = 'h2 {font-style: italic; font-size:20px;} .picbox div {float: left;}'
|
||||
#extra_css = 'h2 {font-style: italic; font-size:20px;} .picbox div {float: left;}'
|
||||
cover_url=u'http://www.gram.pl/www/01/img/grampl_zima.png'
|
||||
remove_tags= [dict(name='p', attrs={'class':['extraText', 'must-log-in']}), dict(attrs={'class':['el', 'headline', 'post-info', 'entry-footer clearfix']}), dict(name='div', attrs={'class':['twojaOcena', 'comment-body', 'comment-author vcard', 'comment-meta commentmetadata', 'tw_button', 'entry-comment-counter', 'snap_nopreview sharing robots-nocontent', 'sharedaddy sd-sharing-enabled']}), dict(id=['igit_rpwt_css', 'comments', 'reply-title', 'igit_title'])]
|
||||
keep_only_tags= [dict(name='div', attrs={'class':['main', 'arkh-postmetadataheader', 'arkh-postcontent', 'post', 'content', 'news_header', 'news_subheader', 'news_text']}), dict(attrs={'class':['contentheading', 'contentpaneopen']}), dict(name='article')]
|
||||
keep_only_tags= [dict(id='articleModule')]
|
||||
remove_tags = [dict(attrs={'class':['breadCrump', 'dymek', 'articleFooter']})]
|
||||
feeds = [(u'Informacje', u'http://www.gram.pl/feed_news.asp'),
|
||||
(u'Publikacje', u'http://www.gram.pl/feed_news.asp?type=articles'),
|
||||
(u'Kolektyw- Indie Games', u'http://indie.gram.pl/feed/'),
|
||||
@ -28,35 +29,21 @@ class Gram_pl(BasicNewsRecipe):
|
||||
feed.articles.remove(article)
|
||||
return feeds
|
||||
|
||||
def append_page(self, soup, appendtag):
|
||||
nexturl = appendtag.find('a', attrs={'class':'cpn'})
|
||||
while nexturl:
|
||||
soup2 = self.index_to_soup('http://www.gram.pl'+ nexturl['href'])
|
||||
r=appendtag.find(id='pgbox')
|
||||
if r:
|
||||
r.extract()
|
||||
pagetext = soup2.find(attrs={'class':'main'})
|
||||
r=pagetext.find('h1')
|
||||
if r:
|
||||
r.extract()
|
||||
r=pagetext.find('h2')
|
||||
if r:
|
||||
r.extract()
|
||||
for r in pagetext.findAll('script'):
|
||||
r.extract()
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
nexturl = appendtag.find('a', attrs={'class':'cpn'})
|
||||
r=appendtag.find(id='pgbox')
|
||||
if r:
|
||||
r.extract()
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
self.append_page(soup, soup.body)
|
||||
tag=soup.findAll(name='div', attrs={'class':'picbox'})
|
||||
for t in tag:
|
||||
t['style']='float: left;'
|
||||
tag=soup.find(name='div', attrs={'class':'summary'})
|
||||
if tag:
|
||||
tag.find(attrs={'class':'pros'}).insert(0, BeautifulSoup('<h2>Plusy:</h2>').h2)
|
||||
tag.find(attrs={'class':'cons'}).insert(0, BeautifulSoup('<h2>Minusy:</h2>').h2)
|
||||
tag = soup.find(name='section', attrs={'class':'cenzurka'})
|
||||
if tag:
|
||||
rate = tag.p.img['data-ocena']
|
||||
tag.p.img.extract()
|
||||
tag.p.insert(len(tag.p.contents)-2, BeautifulSoup('<h2>Ocena: {0}</h2>'.format(rate)).h2)
|
||||
for a in soup('a'):
|
||||
if a.has_key('href') and 'http://' not in a['href'] and 'https://' not in a['href']:
|
||||
a['href']=self.index + a['href']
|
||||
tag=soup.find(name='span', attrs={'class':'platforma'})
|
||||
if tag:
|
||||
tag.name = 'p'
|
||||
return soup
|
||||
|
@ -1,5 +1,5 @@
|
||||
__license__ = 'GPL v3'
|
||||
__copyright__ = '2008-2010, Darko Miletic <darko.miletic at gmail.com>'
|
||||
__copyright__ = '2008-2012, Darko Miletic <darko.miletic at gmail.com>'
|
||||
'''
|
||||
harpers.org
|
||||
'''
|
||||
@ -16,6 +16,7 @@ class Harpers(BasicNewsRecipe):
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets = True
|
||||
use_embedded_content = False
|
||||
masthead_url = 'http://harpers.org/wp-content/themes/harpers/images/pheader.gif'
|
||||
|
||||
conversion_options = {
|
||||
'comment' : description
|
||||
@ -31,27 +32,9 @@ class Harpers(BasicNewsRecipe):
|
||||
.caption{font-family:Verdana,sans-serif;font-size:x-small;color:#666666;}
|
||||
'''
|
||||
|
||||
keep_only_tags = [ dict(name='div', attrs={'id':'cached'}) ]
|
||||
remove_tags = [
|
||||
dict(name='table', attrs={'class':['rcnt','rcnt topline']})
|
||||
,dict(name=['link','object','embed','meta','base'])
|
||||
]
|
||||
keep_only_tags = [ dict(name='div', attrs={'class':['postdetailFull', 'articlePost']}) ]
|
||||
remove_tags = [dict(name=['link','object','embed','meta','base'])]
|
||||
remove_attributes = ['width','height']
|
||||
|
||||
feeds = [(u"Harper's Magazine", u'http://www.harpers.org/rss/frontpage-rss20.xml')]
|
||||
feeds = [(u"Harper's Magazine", u'http://harpers.org/feed/')]
|
||||
|
||||
def get_cover_url(self):
|
||||
cover_url = None
|
||||
index = 'http://harpers.org/'
|
||||
soup = self.index_to_soup(index)
|
||||
link_item = soup.find(name = 'img',attrs= {'class':"cover"})
|
||||
if link_item:
|
||||
cover_url = 'http://harpers.org' + link_item['src']
|
||||
return cover_url
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
for item in soup.findAll(style=True):
|
||||
del item['style']
|
||||
for item in soup.findAll(xmlns=True):
|
||||
del item['xmlns']
|
||||
return soup
|
||||
|
@ -1,18 +1,22 @@
|
||||
__license__ = 'GPL v3'
|
||||
__copyright__ = '2008-2010, Darko Miletic <darko.miletic at gmail.com>'
|
||||
__copyright__ = '2008-2012, Darko Miletic <darko.miletic at gmail.com>'
|
||||
'''
|
||||
harpers.org - paid subscription/ printed issue articles
|
||||
This recipe only get's article's published in text format
|
||||
images and pdf's are ignored
|
||||
If you have institutional subscription based on access IP you do not need to enter
|
||||
anything in username/password fields
|
||||
'''
|
||||
|
||||
import time, re
|
||||
import urllib
|
||||
from calibre import strftime
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Harpers_full(BasicNewsRecipe):
|
||||
title = "Harper's Magazine - articles from printed edition"
|
||||
__author__ = 'Darko Miletic'
|
||||
description = "Harper's Magazine: Founded June 1850."
|
||||
description = "Harper's Magazine, the oldest general-interest monthly in America, explores the issues that drive our national conversation, through long-form narrative journalism and essays, and such celebrated features as the iconic Harper's Index."
|
||||
publisher = "Harpers's"
|
||||
category = 'news, politics, USA'
|
||||
oldest_article = 30
|
||||
@ -21,13 +25,16 @@ class Harpers_full(BasicNewsRecipe):
|
||||
use_embedded_content = False
|
||||
delay = 1
|
||||
language = 'en'
|
||||
needs_subscription = True
|
||||
masthead_url = 'http://www.harpers.org/media/image/Harpers_305x100.gif'
|
||||
encoding = 'utf8'
|
||||
needs_subscription = 'optional'
|
||||
masthead_url = 'http://harpers.org/wp-content/themes/harpers/images/pheader.gif'
|
||||
publication_type = 'magazine'
|
||||
INDEX = strftime('http://www.harpers.org/archive/%Y/%m')
|
||||
LOGIN = 'http://www.harpers.org'
|
||||
cover_url = strftime('http://www.harpers.org/media/pages/%Y/%m/gif/0001.gif')
|
||||
extra_css = ' body{font-family: "Georgia",serif} '
|
||||
LOGIN = 'http://harpers.org/wp-content/themes/harpers/ajax_login.php'
|
||||
extra_css = """
|
||||
body{font-family: adobe-caslon-pro,serif}
|
||||
.category{font-size: small}
|
||||
.articlePost p:first-letter{display: inline; font-size: xx-large; font-weight: bold}
|
||||
"""
|
||||
|
||||
conversion_options = {
|
||||
'comment' : description
|
||||
@ -36,32 +43,53 @@ class Harpers_full(BasicNewsRecipe):
|
||||
, 'language' : language
|
||||
}
|
||||
|
||||
keep_only_tags = [ dict(name='div', attrs={'id':'cached'}) ]
|
||||
keep_only_tags = [ dict(name='div', attrs={'class':['postdetailFull','articlePost']}) ]
|
||||
remove_tags = [
|
||||
dict(name='table', attrs={'class':['rcnt','rcnt topline']})
|
||||
,dict(name='link')
|
||||
dict(name='div', attrs={'class':'fRight rightDivPad'})
|
||||
,dict(name=['link','meta','object','embed','iframe'])
|
||||
]
|
||||
remove_attributes=['xmlns']
|
||||
|
||||
def get_browser(self):
|
||||
br = BasicNewsRecipe.get_browser()
|
||||
br.open('http://harpers.org/')
|
||||
if self.username is not None and self.password is not None:
|
||||
br.open(self.LOGIN)
|
||||
br.select_form(nr=1)
|
||||
br['handle' ] = self.username
|
||||
br['password'] = self.password
|
||||
br.submit()
|
||||
tt = time.localtime()*1000
|
||||
data = urllib.urlencode({ 'm':self.username
|
||||
,'p':self.password
|
||||
,'rt':'http://harpers.org/'
|
||||
,'tt':tt
|
||||
})
|
||||
br.open(self.LOGIN, data)
|
||||
return br
|
||||
|
||||
def parse_index(self):
|
||||
#find current issue
|
||||
soup = self.index_to_soup('http://harpers.org/')
|
||||
currentIssue=soup.find('div',attrs={'class':'mainNavi'}).find('li',attrs={'class':'curentIssue'})
|
||||
currentIssue_url=self.tag_to_string(currentIssue.a['href'])
|
||||
self.log(currentIssue_url)
|
||||
|
||||
#go to the current issue
|
||||
soup1 = self.index_to_soup(currentIssue_url)
|
||||
date = re.split('\s\|\s',self.tag_to_string(soup1.head.title.string))[0]
|
||||
self.timefmt = u' [%s]'%date
|
||||
|
||||
#get cover
|
||||
coverurl='http://harpers.org/wp-content/themes/harpers/ajax_microfiche.php?img=harpers-'+re.split('harpers.org/',currentIssue_url)[1]+'gif/0001.gif'
|
||||
soup2 = self.index_to_soup(coverurl)
|
||||
self.cover_url = self.tag_to_string(soup2.find('img')['src'])
|
||||
self.log(self.cover_url)
|
||||
articles = []
|
||||
print 'Processing ' + self.INDEX
|
||||
soup = self.index_to_soup(self.INDEX)
|
||||
for item in soup.findAll('div', attrs={'class':'title'}):
|
||||
text_link = item.parent.find('img',attrs={'alt':'Text'})
|
||||
if text_link:
|
||||
url = self.LOGIN + item.a['href']
|
||||
title = item.a.contents[0]
|
||||
count = 0
|
||||
for item in soup1.findAll('div', attrs={'class':'articleData'}):
|
||||
text_links = item.findAll('h2')
|
||||
for text_link in text_links:
|
||||
if count == 0:
|
||||
count = 1
|
||||
else:
|
||||
url = text_link.a['href']
|
||||
title = text_link.a.contents[0]
|
||||
date = strftime(' %B %Y')
|
||||
articles.append({
|
||||
'title' :title
|
||||
@ -69,4 +97,14 @@ class Harpers_full(BasicNewsRecipe):
|
||||
,'url' :url
|
||||
,'description':''
|
||||
})
|
||||
return [(soup.head.title.string, articles)]
|
||||
return [(soup1.head.title.string, articles)]
|
||||
|
||||
def print_version(self, url):
|
||||
return url + '?single=1'
|
||||
|
||||
def cleanup(self):
|
||||
soup = self.index_to_soup('http://harpers.org/')
|
||||
signouturl=self.tag_to_string(soup.find('li', attrs={'class':'subLogOut'}).findNext('li').a['href'])
|
||||
self.log(signouturl)
|
||||
self.browser.open(signouturl)
|
||||
|
||||
|
@ -15,23 +15,12 @@ class AdvancedUserRecipe(BasicNewsRecipe):
|
||||
timeout = 5
|
||||
no_stylesheets = True
|
||||
|
||||
keep_only_tags = [dict(name='div', attrs={'id':'mitte_news'}),
|
||||
dict(name='h1', attrs={'class':'clear'}),
|
||||
dict(name='div', attrs={'class':'meldung_wrapper'})]
|
||||
|
||||
remove_tags_after = dict(name ='p', attrs={'class':'editor'})
|
||||
remove_tags = [dict(id='navi_top_container'),
|
||||
dict(id='navi_bottom'),
|
||||
dict(id='mitte_rechts'),
|
||||
dict(id='navigation'),
|
||||
dict(id='subnavi'),
|
||||
dict(id='social_bookmarks'),
|
||||
dict(id='permalink'),
|
||||
dict(id='content_foren'),
|
||||
dict(id='seiten_navi'),
|
||||
dict(id='adbottom'),
|
||||
dict(id='sitemap'),
|
||||
dict(name='div', attrs={'id':'sitemap'}),
|
||||
dict(name='ul', attrs={'class':'erste_zeile'}),
|
||||
dict(name='ul', attrs={'class':'zweite_zeile'}),
|
||||
dict(name='div', attrs={'class':'navi_top_container'})]
|
||||
dict(name='p', attrs={'class':'size80'})]
|
||||
|
||||
feeds = [
|
||||
('Newsticker', 'http://www.heise.de/newsticker/heise.rdf'),
|
||||
@ -54,5 +43,3 @@ class AdvancedUserRecipe(BasicNewsRecipe):
|
||||
|
||||
def print_version(self, url):
|
||||
return url + '?view=print'
|
||||
|
||||
|
||||
|
@ -16,10 +16,14 @@ class TheHindu(BasicNewsRecipe):
|
||||
|
||||
keep_only_tags = [dict(id='content')]
|
||||
remove_tags = [dict(attrs={'class':['article-links', 'breadcr']}),
|
||||
dict(id=['email-section', 'right-column', 'printfooter'])]
|
||||
dict(id=['email-section', 'right-column', 'printfooter', 'topover',
|
||||
'slidebox', 'th_footer'])]
|
||||
|
||||
extra_css = '.photo-caption { font-size: smaller }'
|
||||
|
||||
def preprocess_raw_html(self, raw, url):
|
||||
return raw.replace('<body><p>', '<p>').replace('</p></body>', '</p>')
|
||||
|
||||
def postprocess_html(self, soup, first_fetch):
|
||||
for t in soup.findAll(['table', 'tr', 'td','center']):
|
||||
t.name = 'div'
|
||||
|
@ -3,7 +3,7 @@ from calibre.web.feeds.news import BasicNewsRecipe
|
||||
class Historia_org_pl(BasicNewsRecipe):
|
||||
title = u'Historia.org.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = u'history site'
|
||||
description = u'Artykuły dotyczące historii w układzie epok i tematów, forum. Najlepsza strona historii. Matura z historii i egzamin gimnazjalny z historii.'
|
||||
cover_url = 'http://lh3.googleusercontent.com/_QeRQus12wGg/TOvHsZ2GN7I/AAAAAAAAD_o/LY1JZDnq7ro/logo5.jpg'
|
||||
category = 'history'
|
||||
language = 'pl'
|
||||
@ -12,16 +12,15 @@ class Historia_org_pl(BasicNewsRecipe):
|
||||
no_stylesheets = True
|
||||
use_embedded_content = True
|
||||
max_articles_per_feed = 100
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
|
||||
feeds = [(u'Wszystkie', u'http://www.historia.org.pl/index.php?format=feed&type=atom'),
|
||||
(u'Wiadomości', u'http://www.historia.org.pl/index.php/wiadomosci.feed?type=atom'),
|
||||
(u'Publikacje', u'http://www.historia.org.pl/index.php/publikacje.feed?type=atom'),
|
||||
(u'Publicystyka', u'http://www.historia.org.pl/index.php/publicystyka.feed?type=atom'),
|
||||
(u'Recenzje', u'http://historia.org.pl/index.php/recenzje.feed?type=atom'),
|
||||
(u'Kultura i sztuka', u'http://www.historia.org.pl/index.php/kultura-i-sztuka.feed?type=atom'),
|
||||
(u'Rekonstykcje', u'http://www.historia.org.pl/index.php/rekonstrukcje.feed?type=atom'),
|
||||
(u'Projekty', u'http://www.historia.org.pl/index.php/projekty.feed?type=atom'),
|
||||
(u'Konkursy'), (u'http://www.historia.org.pl/index.php/konkursy.feed?type=atom')]
|
||||
|
||||
feeds = [(u'Wszystkie', u'http://historia.org.pl/feed/'),
|
||||
(u'Wiadomości', u'http://historia.org.pl/Kategoria/wiadomosci/feed/'),
|
||||
(u'Publikacje', u'http://historia.org.pl/Kategoria/artykuly/feed/'),
|
||||
(u'Publicystyka', u'http://historia.org.pl/Kategoria/publicystyka/feed/'),
|
||||
(u'Recenzje', u'http://historia.org.pl/Kategoria/recenzje/feed/'),
|
||||
(u'Projekty', u'http://historia.org.pl/Kategoria/projekty/feed/'),]
|
||||
|
||||
|
||||
def print_version(self, url):
|
||||
|
BIN
recipes/icons/astroflesz.png
Normal file
After Width: | Height: | Size: 1.1 KiB |
BIN
recipes/icons/czas_gentlemanow.png
Normal file
After Width: | Height: | Size: 24 KiB |
BIN
recipes/icons/ekologia_pl.png
Normal file
After Width: | Height: | Size: 702 B |
BIN
recipes/icons/poradnia_pwn.png
Normal file
After Width: | Height: | Size: 350 B |
BIN
recipes/icons/tvp_info.png
Normal file
After Width: | Height: | Size: 329 B |
BIN
recipes/icons/zaufana_trzecia_strona.png
Normal file
After Width: | Height: | Size: 412 B |
@ -47,9 +47,10 @@ class TheIndependentNew(BasicNewsRecipe):
|
||||
dict(name='img',attrs={'alt' : ['Get Adobe Flash player']}),
|
||||
dict(name='img',attrs={'alt' : ['view gallery']}),
|
||||
dict(attrs={'style' : re.compile('.*')}),
|
||||
dict(attrs={'class':lambda x: x and 'voicesRelatedTopics' in x.split()}),
|
||||
]
|
||||
|
||||
keep_only_tags =[dict(attrs={'id':'main'})]
|
||||
keep_only_tags =[dict(attrs={'id':['main','top']})]
|
||||
recursions = 0
|
||||
|
||||
# fixes non compliant html nesting and 'marks' article graphics links
|
||||
@ -69,7 +70,7 @@ class TheIndependentNew(BasicNewsRecipe):
|
||||
}
|
||||
|
||||
extra_css = """
|
||||
h1{font-family: Georgia,serif }
|
||||
h1{font-family: Georgia,serif ; font-size: x-large; }
|
||||
body{font-family: Verdana,Arial,Helvetica,sans-serif}
|
||||
img{margin-bottom: 0.4em; display:block}
|
||||
.starRating img {float: left}
|
||||
@ -77,16 +78,21 @@ class TheIndependentNew(BasicNewsRecipe):
|
||||
.image {clear:left; font-size: x-small; color:#888888;}
|
||||
.articleByTimeLocation {font-size: x-small; color:#888888;
|
||||
margin-bottom:0.2em ; margin-top:0.2em ; display:block}
|
||||
.subtitle {clear:left}
|
||||
.subtitle {clear:left ;}
|
||||
.column-1 h1 { color: #191919}
|
||||
.column-1 h2 { color: #333333}
|
||||
.column-1 h3 { color: #444444}
|
||||
.column-1 p { color: #777777}
|
||||
.column-1 p,a,h1,h2,h3 { margin: 0; }
|
||||
.column-1 div{color:#888888; margin: 0;}
|
||||
.subtitle { color: #777777; font-size: medium;}
|
||||
.column-1 a,h1,h2,h3 { margin: 0; }
|
||||
.column-1 div{margin: 0;}
|
||||
.articleContent {display: block; clear:left;}
|
||||
.articleContent {color: #000000; font-size: medium;}
|
||||
.ivDrip-section {color: #000000; font-size: medium;}
|
||||
.datetime {color: #888888}
|
||||
.title {font-weight:bold;}
|
||||
.storyTop{}
|
||||
.pictureContainer img { max-width: 400px; max-height: 400px;}
|
||||
.image img { max-width: 400px; max-height: 400px;}
|
||||
"""
|
||||
|
||||
oldest_article = 1
|
||||
@ -325,6 +331,20 @@ class TheIndependentNew(BasicNewsRecipe):
|
||||
item.contents[0] = ''
|
||||
|
||||
def postprocess_html(self,soup, first_fetch):
|
||||
|
||||
#mark subtitle parent as non-compliant nesting causes
|
||||
# p's to be 'popped out' of the h3 tag they are nested in.
|
||||
subtitle = soup.find('h3', attrs={'class' : 'subtitle'})
|
||||
subtitle_div = None
|
||||
if subtitle:
|
||||
subtitle_div = subtitle.parent
|
||||
if subtitle_div:
|
||||
clazz = ''
|
||||
if 'class' in subtitle_div:
|
||||
clazz = subtitle_div['class'] + ' '
|
||||
clazz = clazz + 'subtitle'
|
||||
subtitle_div['class'] = clazz
|
||||
|
||||
#find broken images and remove captions
|
||||
items_to_extract = []
|
||||
for item in soup.findAll('div', attrs={'class' : 'image'}):
|
||||
@ -501,6 +521,9 @@ class TheIndependentNew(BasicNewsRecipe):
|
||||
),
|
||||
(u'Opinion',
|
||||
u'http://www.independent.co.uk/opinion/?service=rss'),
|
||||
(u'Voices',
|
||||
u'http://www.independent.co.uk/voices/?service=rss'
|
||||
),
|
||||
(u'Environment',
|
||||
u'http://www.independent.co.uk/environment/?service=rss'),
|
||||
(u'Sport - Athletics',
|
||||
|
@ -9,6 +9,21 @@ class Kosmonauta(BasicNewsRecipe):
|
||||
language = 'pl'
|
||||
cover_url='http://bi.gazeta.pl/im/4/10393/z10393414X,Kosmonauta-net.jpg'
|
||||
no_stylesheets = True
|
||||
INDEX = 'http://www.kosmonauta.net'
|
||||
oldest_article = 7
|
||||
no_stylesheets = True
|
||||
max_articles_per_feed = 100
|
||||
feeds = [(u'Kosmonauta.net', u'http://www.kosmonauta.net/index.php/feed/rss.html')]
|
||||
keep_only_tags = [dict(name='div', attrs={'class':'item-page'})]
|
||||
remove_tags = [dict(attrs={'class':['article-tools clearfix', 'cedtag', 'nav clearfix', 'jwDisqusForm']})]
|
||||
remove_tags_after = dict(name='div', attrs={'class':'cedtag'})
|
||||
feeds = [(u'Kosmonauta.net', u'http://www.kosmonauta.net/?format=feed&type=atom')]
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
for a in soup.findAll(name='a'):
|
||||
if a.has_key('href'):
|
||||
href = a['href']
|
||||
if not href.startswith('http'):
|
||||
a['href'] = self.INDEX + href
|
||||
print '%%%%%%%%%%%%%%%%%%%%%%%%%', a['href']
|
||||
return soup
|
||||
|
@ -1,15 +1,16 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
import re
|
||||
class Ksiazka_net_pl(BasicNewsRecipe):
|
||||
title = u'ksiazka.net.pl'
|
||||
title = u'książka.net.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Ksiazka.net.pl - book vortal'
|
||||
description = u'Portal Księgarski - tematyczny serwis o książkach. Wydarzenia z rynku księgarsko-wydawniczego, nowości, zapowiedzi, bestsellery, setki recenzji. Niezbędne informacje dla każdego miłośnika książek, księgarza, bibliotekarza i wydawcy.'
|
||||
cover_url = 'http://www.ksiazka.net.pl/fileadmin/templates/ksiazka.net.pl/images/1PortalKsiegarski-logo.jpg'
|
||||
category = 'books'
|
||||
language = 'pl'
|
||||
oldest_article = 8
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets= True
|
||||
remove_empty_feeds = True
|
||||
#extra_css = 'img {float: right;}'
|
||||
preprocess_regexps = [(re.compile(ur'Podoba mi się, kupuję:'), lambda match: '<br />')]
|
||||
remove_tags_before= dict(name='div', attrs={'class':'m-body'})
|
||||
|
@ -2,7 +2,7 @@
|
||||
__license__ = 'GPL v3'
|
||||
__author__ = 'Gabriele Marini, based on Darko Miletic'
|
||||
__copyright__ = '2009, Darko Miletic <darko.miletic at gmail.com>'
|
||||
__description__ = 'La Stampa 05/05/2010'
|
||||
__description__ = 'La Stampa 28/12/2012'
|
||||
|
||||
'''
|
||||
http://www.lastampa.it/
|
||||
@ -14,10 +14,11 @@ class LaStampa(BasicNewsRecipe):
|
||||
title = u'La Stampa'
|
||||
language = 'it'
|
||||
__author__ = 'Gabriele Marini'
|
||||
oldest_article = 15
|
||||
#oldest_article = 15
|
||||
oldest_articlce = 7 #for daily schedule
|
||||
max_articles_per_feed = 50
|
||||
recursion = 100
|
||||
cover_url = 'http://www.lastampa.it/edicola/PDF/1.pdf'
|
||||
cover_url = 'http://www1.lastampa.it/edicola/PDF/1.pdf'
|
||||
use_embedded_content = False
|
||||
remove_javascript = True
|
||||
no_stylesheets = True
|
||||
@ -33,35 +34,41 @@ class LaStampa(BasicNewsRecipe):
|
||||
if link:
|
||||
return link[0]['href']
|
||||
|
||||
keep_only_tags = [dict(attrs={'class':['boxocchiello2','titoloRub','titologir','catenaccio','sezione','articologirata']}),
|
||||
keep_only_tags = [dict(attrs={'class':['boxocchiello2','titoloRub','titologir','autore-girata','luogo-girata','catenaccio','sezione','articologirata','bodytext','news-single-img','ls-articoloCorpo','ls-blog-list-1col']}),
|
||||
dict(name='div', attrs={'id':'corpoarticolo'})
|
||||
]
|
||||
|
||||
remove_tags = [dict(name='div', attrs={'id':'menutop'}),
|
||||
dict(name='div', attrs={'id':'fwnetblocco'}),
|
||||
dict(name='table', attrs={'id':'strumenti'}),
|
||||
dict(name='table', attrs={'id':'imgesterna'}),
|
||||
dict(name='a', attrs={'class':'linkblu'}),
|
||||
dict(name='a', attrs={'class':'link'}),
|
||||
|
||||
remove_tags = [dict(name='div', attrs={'id':['menutop','fwnetblocco']}),
|
||||
dict(attrs={'class':['ls-toolbarCommenti','ls-boxCommentsBlog']}),
|
||||
dict(name='table', attrs={'id':['strumenti','imgesterna']}),
|
||||
dict(name='a', attrs={'class':['linkblu','link']}),
|
||||
dict(name='span', attrs={'class':['boxocchiello','boxocchiello2','sezione']})
|
||||
]
|
||||
|
||||
feeds = [
|
||||
(u'Home', u'http://www.lastampa.it/redazione/rss_home.xml'),
|
||||
(u'Editoriali', u'http://www.lastampa.it/cmstp/rubriche/oggetti/rss.asp?ID_blog=25'),
|
||||
(u'Politica', u'http://www.lastampa.it/redazione/cmssezioni/politica/rss_politica.xml'),
|
||||
(u'ArciItaliana', u'http://www.lastampa.it/cmstp/rubriche/oggetti/rss.asp?ID_blog=14'),
|
||||
(u'Cronache', u'http://www.lastampa.it/redazione/cmssezioni/cronache/rss_cronache.xml'),
|
||||
(u'Esteri', u'http://www.lastampa.it/redazione/cmssezioni/esteri/rss_esteri.xml'),
|
||||
(u'Danni Collaterali', u'http://www.lastampa.it/cmstp/rubriche/oggetti/rss.asp?ID_blog=90'),
|
||||
(u'Economia', u'http://www.lastampa.it/redazione/cmssezioni/economia/rss_economia.xml'),
|
||||
(u'Tecnologia ', u'http://www.lastampa.it/cmstp/rubriche/oggetti/rss.asp?ID_blog=30'),
|
||||
(u'Spettacoli', u'http://www.lastampa.it/redazione/cmssezioni/spettacoli/rss_spettacoli.xml'),
|
||||
(u'Sport', u'http://www.lastampa.it/sport/rss_home.xml'),
|
||||
(u'Torino', u'http://rss.feedsportal.com/c/32418/f/466938/index.rss'),
|
||||
(u'Motori', u'http://www.lastampa.it/cmstp/rubriche/oggetti/rss.asp?ID_blog=57'),
|
||||
(u'Scienza', u'http://www.lastampa.it/cmstp/rubriche/oggetti/rss.asp?ID_blog=38'),
|
||||
(u'Fotografia', u'http://rss.feedsportal.com/c/32418/f/478449/index.rss'),
|
||||
(u'Scuola', u'http://www.lastampa.it/cmstp/rubriche/oggetti/rss.asp?ID_blog=60'),
|
||||
(u'Tempo Libero', u'http://www.lastampa.it/tempolibero/rss_home.xml')
|
||||
feeds = [(u'BuonGiorno',u'http://www.lastampa.it/cultura/opinioni/buongiorno/rss.xml'),
|
||||
(u'Jena', u'http://www.lastampa.it/cultura/opinioni/jena/rss.xml'),
|
||||
(u'Editoriali', u'http://www.lastampa.it/cultura/opinioni/editoriali'),
|
||||
(u'Finestra sull America', u'http://lastampa.feedsportal.com/c/32418/f/625713/index.rss'),
|
||||
(u'HomePage', u'http://www.lastampa.it/rss.xml'),
|
||||
(u'Politica Italia', u'http://www.lastampa.it/italia/politica/rss.xml'),
|
||||
(u'ArciItaliana', u'http://www.lastampa.it/rss/blog/arcitaliana'),
|
||||
(u'Cronache', u'http://www.lastampa.it/italia/cronache/rss.xml'),
|
||||
(u'Esteri', u'http://www.lastampa.it/esteri/rss.xml'),
|
||||
(u'Danni Collaterali', u'http://www.lastampa.it/rss/blog/danni-collaterali'),
|
||||
(u'Economia', u'http://www.lastampa.it/economia/rss.xml'),
|
||||
(u'Tecnologia ', u'http://www.lastampa.it/tecnologia/rss.xml'),
|
||||
(u'Spettacoli', u'http://www.lastampa.it/spettacoli/rss.xml'),
|
||||
(u'Sport', u'http://www.lastampa.it/sport/rss.xml'),
|
||||
(u'Torino', u'http://www.lastampa.it/cronaca/rss.xml'),
|
||||
(u'Motori', u'http://www.lastampa.it/motori/rss.xml'),
|
||||
(u'Scienza', u'http://www.lastampa.it/scienza/rss.xml'),
|
||||
(u'Cultura', u'http://www.lastampa.it/cultura/rss.xml'),
|
||||
(u'Scuola', u'http://www.lastampa.it/cultura/scuola/rss.xml'),
|
||||
(u'Benessere', u'http://www.lastampa.it/scienza/benessere/rss.xml'),
|
||||
(u'Cucina', u'http://www.lastampa.it/societa/cucina/rss.xml'),
|
||||
(u'Casa', u'http://www.lastampa.it/societa/casa/rss.xml'),
|
||||
(u'Moda',u'http://www.lastampa.it/societa/moda/rss.xml'),
|
||||
(u'Giochi',u'http://www.lastampa.it/tecnologia/giochi/rss.xml'),
|
||||
(u'Viaggi',u'http://www.lastampa.it/societa/viaggi/rss.xml'),
|
||||
(u'Ambiente', u'http://www.lastampa.it/scienza/ambiente/rss.xml')
|
||||
]
|
||||
|
@ -7,9 +7,9 @@ class AdvancedUserRecipe1324114228(BasicNewsRecipe):
|
||||
max_articles_per_feed = 100
|
||||
auto_cleanup = True
|
||||
masthead_url = 'http://www.lavoce.info/binary/la_voce/testata/lavoce.1184661635.gif'
|
||||
feeds = [(u'La Voce', u'http://www.lavoce.info/feed_rss.php?id_feed=1')]
|
||||
feeds = [(u'La Voce', u'http://www.lavoce.info/feed/')]
|
||||
__author__ = 'faber1971'
|
||||
description = 'Italian website on Economy - v1.01 (17, December 2011)'
|
||||
description = 'Italian website on Economy - v1.02 (27, December 2012)'
|
||||
language = 'it'
|
||||
|
||||
|
||||
|
@ -22,13 +22,15 @@ class LeMonde(BasicNewsRecipe):
|
||||
#publication_type = 'newsportal'
|
||||
extra_css = '''
|
||||
h1{font-size:130%;}
|
||||
h2{font-size:100%;}
|
||||
blockquote.aside {background-color: #DDD; padding: 0.5em;}
|
||||
.ariane{font-size:xx-small;}
|
||||
.source{font-size:xx-small;}
|
||||
#.href{font-size:xx-small;}
|
||||
#.figcaption style{color:#666666; font-size:x-small;}
|
||||
#.main-article-info{font-family:Arial,Helvetica,sans-serif;}
|
||||
#full-contents{font-size:small; font-family:Arial,Helvetica,sans-serif;font-weight:normal;}
|
||||
#match-stats-summary{font-size:small; font-family:Arial,Helvetica,sans-serif;font-weight:normal;}
|
||||
/*.href{font-size:xx-small;}*/
|
||||
/*.figcaption style{color:#666666; font-size:x-small;}*/
|
||||
/*.main-article-info{font-family:Arial,Helvetica,sans-serif;}*/
|
||||
/*full-contents{font-size:small; font-family:Arial,Helvetica,sans-serif;font-weight:normal;}*/
|
||||
/*match-stats-summary{font-size:small; font-family:Arial,Helvetica,sans-serif;font-weight:normal;}*/
|
||||
'''
|
||||
#preprocess_regexps = [(re.compile(r'<!--.*?-->', re.DOTALL), lambda m: '')]
|
||||
conversion_options = {
|
||||
@ -44,6 +46,9 @@ class LeMonde(BasicNewsRecipe):
|
||||
filterDuplicates = True
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
for aside in soup.findAll('aside'):
|
||||
aside.name='blockquote'
|
||||
aside['class'] = "aside"
|
||||
for alink in soup.findAll('a'):
|
||||
if alink.string is not None:
|
||||
tstr = alink.string
|
||||
@ -107,7 +112,9 @@ class LeMonde(BasicNewsRecipe):
|
||||
]
|
||||
|
||||
remove_tags = [
|
||||
dict(name='div', attrs={'class':['bloc_base meme_sujet']}),
|
||||
dict(attrs={'class':['rubriques_liees']}),
|
||||
dict(attrs={'class':['sociaux']}),
|
||||
dict(attrs={'class':['bloc_base meme_sujet']}),
|
||||
dict(name='p', attrs={'class':['lire']})
|
||||
]
|
||||
|
||||
|
@ -32,26 +32,28 @@ class ledevoir(BasicNewsRecipe):
|
||||
recursion = 10
|
||||
needs_subscription = 'optional'
|
||||
|
||||
filterDuplicates = False
|
||||
url_list = []
|
||||
|
||||
remove_javascript = True
|
||||
no_stylesheets = True
|
||||
auto_cleanup = True
|
||||
|
||||
preprocess_regexps = [(re.compile(r'(title|alt)=".*?>.*?"', re.DOTALL), lambda m: '')]
|
||||
|
||||
keep_only_tags = [
|
||||
dict(name='div', attrs={'id':'article'}),
|
||||
dict(name='div', attrs={'id':'colonne_principale'})
|
||||
]
|
||||
#keep_only_tags = [
|
||||
#dict(name='div', attrs={'id':'article_detail'}),
|
||||
#dict(name='div', attrs={'id':'colonne_principale'})
|
||||
#]
|
||||
|
||||
remove_tags = [
|
||||
dict(name='div', attrs={'id':'dialog'}),
|
||||
dict(name='div', attrs={'class':['interesse_actions','reactions']}),
|
||||
dict(name='ul', attrs={'class':'mots_cles'}),
|
||||
dict(name='a', attrs={'class':'haut'}),
|
||||
dict(name='h5', attrs={'class':'interesse_actions'})
|
||||
]
|
||||
#remove_tags = [
|
||||
#dict(name='div', attrs={'id':'dialog'}),
|
||||
#dict(name='div', attrs={'class':['interesse_actions','reactions','taille_du_texte right clearfix','partage_sociaux clearfix']}),
|
||||
#dict(name='aside', attrs={'class':['article_actions clearfix','reactions','partage_sociaux_wrapper']}),
|
||||
#dict(name='ul', attrs={'class':'mots_cles'}),
|
||||
#dict(name='ul', attrs={'id':'commentaires'}),
|
||||
#dict(name='a', attrs={'class':'haut'}),
|
||||
#dict(name='h5', attrs={'class':'interesse_actions'})
|
||||
#]
|
||||
|
||||
feeds = [
|
||||
(u'A la une', 'http://www.ledevoir.com/rss/manchettes.xml'),
|
||||
@ -95,10 +97,4 @@ class ledevoir(BasicNewsRecipe):
|
||||
br.submit()
|
||||
return br
|
||||
|
||||
def print_version(self, url):
|
||||
if self.filterDuplicates:
|
||||
if url in self.url_list:
|
||||
return
|
||||
self.url_list.append(url)
|
||||
return url
|
||||
|
||||
|
12
recipes/lvivs_ks_ghazieta.recipe
Normal file
@ -0,0 +1,12 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class AdvancedUserRecipe1356270446(BasicNewsRecipe):
|
||||
title = u'\u041b\u044c\u0432\u0456\u0432\u0441\u044c\u043a\u0430 \u0433\u0430\u0437\u0435\u0442\u0430'
|
||||
__author__ = 'rpalyvoda'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
language = 'uk'
|
||||
cover_url = 'http://lvivska.com/sites/all/themes/biblos/images/logo.png'
|
||||
masthead_url = 'http://lvivska.com/sites/all/themes/biblos/images/logo.png'
|
||||
auto_cleanup = True
|
||||
feeds = [(u'\u041d\u043e\u0432\u0438\u043d\u0438', u'http://lvivska.com/rss/news.xml'), (u'\u041f\u043e\u043b\u0456\u0442\u0438\u043a\u0430', u'http://lvivska.com/rss/politic.xml'), (u'\u0415\u043a\u043e\u043d\u043e\u043c\u0456\u043a\u0430', u'http://lvivska.com/rss/economic.xml'), (u'\u041f\u0440\u0430\u0432\u043e', u'http://lvivska.com/rss/law.xml'), (u'\u0421\u0432\u0456\u0442', u'http://lvivska.com/rss/world.xml'), (u'\u0416\u0438\u0442\u0442\u044f', u'http://lvivska.com/rss/life.xml'), (u'\u041a\u0443\u043b\u044c\u0442\u0443\u0440\u0430', u'http://lvivska.com/rss/culture.xml'), (u'\u041b\u0430\u0441\u0443\u043d', u'http://lvivska.com/rss/cooking.xml'), (u'\u0421\u0442\u0438\u043b\u044c', u'http://lvivska.com/rss/style.xml'), (u'Galicia Incognita', u'http://lvivska.com/rss/galiciaincognita.xml'), (u'\u0421\u043f\u043e\u0440\u0442', u'http://lvivska.com/rss/sport.xml'), (u'\u0415\u043a\u043e\u043b\u043e\u0433\u0456\u044f', u'http://lvivska.com/rss/ecology.xml'), (u"\u0417\u0434\u043e\u0440\u043e\u0432'\u044f", u'http://lvivska.com/rss/health.xml'), (u'\u0410\u0432\u0442\u043e', u'http://lvivska.com/rss/auto.xml'), (u'\u0411\u043b\u043e\u0433\u0438', u'http://lvivska.com/rss/blog.xml')]
|
@ -1,43 +1,74 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
from calibre import strftime
|
||||
import re
|
||||
import datetime
|
||||
import time
|
||||
|
||||
class AdvancedUserRecipe1306097511(BasicNewsRecipe):
|
||||
title = u'Metro UK'
|
||||
description = 'Author Dave Asbury : News from The Metro - UK'
|
||||
description = 'News as provided by The Metro -UK'
|
||||
#timefmt = ''
|
||||
__author__ = 'Dave Asbury'
|
||||
#last update 9/9/12
|
||||
#last update 9/6/12
|
||||
cover_url = 'http://profile.ak.fbcdn.net/hprofile-ak-snc4/276636_117118184990145_2132092232_n.jpg'
|
||||
no_stylesheets = True
|
||||
oldest_article = 1
|
||||
max_articles_per_feed = 12
|
||||
remove_empty_feeds = True
|
||||
remove_javascript = True
|
||||
#auto_cleanup = True
|
||||
auto_cleanup = True
|
||||
encoding = 'UTF-8'
|
||||
cover_url ='http://profile.ak.fbcdn.net/hprofile-ak-snc4/157897_117118184990145_840702264_n.jpg'
|
||||
|
||||
language = 'en_GB'
|
||||
masthead_url = 'http://e-edition.metro.co.uk/images/metro_logo.gif'
|
||||
extra_css = '''
|
||||
h1{font-family:Arial,Helvetica,sans-serif; font-weight:900;font-size:1.6em;}
|
||||
h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:1.2em;}
|
||||
p{font-family:Arial,Helvetica,sans-serif;font-size:1.0em;}
|
||||
body{font-family:Helvetica,Arial,sans-serif;font-size:1.0em;}
|
||||
'''
|
||||
keep_only_tags = [
|
||||
#dict(name='h1'),
|
||||
#dict(name='h2'),
|
||||
#dict(name='div', attrs={'class' : ['row','article','img-cnt figure','clrd']})
|
||||
#dict(name='h3'),
|
||||
#dict(attrs={'class' : 'BText'}),
|
||||
]
|
||||
remove_tags = [
|
||||
dict(name='div',attrs={'class' : 'art-fd fd-gr1-b clrd'}),
|
||||
dict(name='span',attrs={'class' : 'share'}),
|
||||
dict(name='li'),
|
||||
dict(attrs={'class' : ['twitter-share-button','header-forms','hdr-lnks','close','art-rgt','fd-gr1-b clrd google-article','news m12 clrd clr-b p5t shareBtm','item-ds csl-3-img news','c-1of3 c-last','c-1of1','pd','item-ds csl-3-img sport']}),
|
||||
dict(attrs={'id' : ['','sky-left','sky-right','ftr-nav','and-ftr','notificationList','logo','miniLogo','comments-news','metro_extras']})
|
||||
]
|
||||
remove_tags_before = dict(name='h1')
|
||||
#remove_tags_after = dict(attrs={'id':['topic-buttons']})
|
||||
|
||||
feeds = [
|
||||
(u'News', u'http://www.metro.co.uk/rss/news/'), (u'Money', u'http://www.metro.co.uk/rss/money/'), (u'Sport', u'http://www.metro.co.uk/rss/sport/'), (u'Film', u'http://www.metro.co.uk/rss/metrolife/film/'), (u'Music', u'http://www.metro.co.uk/rss/metrolife/music/'), (u'TV', u'http://www.metro.co.uk/rss/tv/'), (u'Showbiz', u'http://www.metro.co.uk/rss/showbiz/'), (u'Weird News', u'http://www.metro.co.uk/rss/weird/'), (u'Travel', u'http://www.metro.co.uk/rss/travel/'), (u'Lifestyle', u'http://www.metro.co.uk/rss/lifestyle/'), (u'Books', u'http://www.metro.co.uk/rss/lifestyle/books/'), (u'Food', u'http://www.metro.co.uk/rss/lifestyle/restaurants/')]
|
||||
def parse_index(self):
|
||||
articles = {}
|
||||
key = None
|
||||
ans = []
|
||||
feeds = [ ('UK', 'http://metro.co.uk/news/uk/'),
|
||||
('World', 'http://metro.co.uk/news/world/'),
|
||||
('Weird', 'http://metro.co.uk/news/weird/'),
|
||||
('Money', 'http://metro.co.uk/news/money/'),
|
||||
('Sport', 'http://metro.co.uk/sport/'),
|
||||
('Guilty Pleasures', 'http://metro.co.uk/guilty-pleasures/')
|
||||
]
|
||||
for key, feed in feeds:
|
||||
soup = self.index_to_soup(feed)
|
||||
articles[key] = []
|
||||
ans.append(key)
|
||||
|
||||
today = datetime.date.today()
|
||||
today = time.mktime(today.timetuple())-60*60*24
|
||||
|
||||
for a in soup.findAll('a'):
|
||||
for name, value in a.attrs:
|
||||
if name == "class" and value=="post":
|
||||
url = a['href']
|
||||
title = a['title']
|
||||
print title
|
||||
description = ''
|
||||
m = re.search('^.*uk/([^/]*)/([^/]*)/([^/]*)/', url)
|
||||
skip = 1
|
||||
if len(m.groups()) == 3:
|
||||
g = m.groups()
|
||||
dt = datetime.datetime.strptime(''+g[0]+'-'+g[1]+'-'+g[2], '%Y-%m-%d')
|
||||
pubdate = time.strftime('%a, %d %b', dt.timetuple())
|
||||
|
||||
dt = time.mktime(dt.timetuple())
|
||||
if dt >= today:
|
||||
print pubdate
|
||||
skip = 0
|
||||
else:
|
||||
pubdate = strftime('%a, %d %b')
|
||||
|
||||
summary = a.find(True, attrs={'class':'excerpt'})
|
||||
if summary:
|
||||
description = self.tag_to_string(summary, use_alt=False)
|
||||
|
||||
if skip == 0:
|
||||
articles[key].append(
|
||||
dict(title=title, url=url, date=pubdate,
|
||||
description=description,
|
||||
content=''))
|
||||
#ans = self.sort_index_by(ans, {'The Front Page':-1, 'Dining In, Dining Out':1, 'Obituaries':2})
|
||||
ans = [(key, articles[key]) for key in ans if articles.has_key(key)]
|
||||
return ans
|
||||
|
@ -2,7 +2,7 @@
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
class Mlody_technik(BasicNewsRecipe):
|
||||
title = u'Mlody technik'
|
||||
title = u'Młody technik'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Młody technik'
|
||||
category = 'science'
|
||||
|
27
recipes/mobile_bulgaria.recipe
Normal file
@ -0,0 +1,27 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class AdvancedUserRecipe1329123365(BasicNewsRecipe):
|
||||
title = u'Mobilebulgaria.com'
|
||||
__author__ = 'M3 Web'
|
||||
description = 'The biggest Bulgarian site covering mobile consumer electronics. Offers detailed reviews, popular discussion forum, shop and platform for selling new and second hand phones and gadgets.'
|
||||
category = 'News, Reviews, Offers, Forum'
|
||||
oldest_article = 45
|
||||
max_articles_per_feed = 10
|
||||
language = 'bg'
|
||||
encoding = 'windows-1251'
|
||||
no_stylesheets = False
|
||||
remove_javascript = True
|
||||
keep_only_tags = [dict(name='div', attrs={'class':'bigblock'}),
|
||||
dict(name='div', attrs={'class':'verybigblock'}),
|
||||
dict(name='table', attrs={'class':'obiaviresults'}),
|
||||
dict(name='div', attrs={'class':'forumblock'}),
|
||||
dict(name='div', attrs={'class':'forumblock_b1'}),
|
||||
dict(name='div', attrs={'class':'block2_2colswrap'})]
|
||||
|
||||
feeds = [(u'News', u'http://www.mobilebulgaria.com/rss_full.php'),
|
||||
(u'Reviews', u'http://www.mobilebulgaria.com/rss_reviews.php'),
|
||||
(u'Offers', u'http://www.mobilebulgaria.com/obiavi/rss.php'),
|
||||
(u'Forum', u'http://www.mobilebulgaria.com/rss_forum_last10.php')]
|
||||
|
||||
extra_css = '''
|
||||
#gallery1 div{display: block; float: left; margin: 0 10px 10px 0;} '''
|
@ -13,8 +13,11 @@ class NikkeiNet_paper_subscription(BasicNewsRecipe):
|
||||
max_articles_per_feed = 30
|
||||
language = 'ja'
|
||||
no_stylesheets = True
|
||||
cover_url = 'http://parts.nikkei.com/parts/ds/images/common/logo_r1.svg'
|
||||
masthead_url = 'http://parts.nikkei.com/parts/ds/images/common/logo_r1.svg'
|
||||
#cover_url = 'http://parts.nikkei.com/parts/ds/images/common/logo_r1.svg'
|
||||
cover_url = 'http://cdn.nikkei.co.jp/parts/ds/images/common/st_nikkei_r1_20101003_1.gif'
|
||||
#masthead_url = 'http://parts.nikkei.com/parts/ds/images/common/logo_r1.svg'
|
||||
masthead_url = 'http://cdn.nikkei.co.jp/parts/ds/images/common/st_nikkei_r1_20101003_1.gif'
|
||||
cover_margins = (10, 188, '#ffffff')
|
||||
|
||||
remove_tags_before = {'class':"cmn-indent"}
|
||||
remove_tags = [
|
||||
@ -40,8 +43,11 @@ class NikkeiNet_paper_subscription(BasicNewsRecipe):
|
||||
print "-------------------------open top page-------------------------------------"
|
||||
br.open('http://www.nikkei.com/')
|
||||
print "-------------------------open first login form-----------------------------"
|
||||
link = br.links(url_regex="www.nikkei.com/etc/accounts/login").next()
|
||||
br.follow_link(link)
|
||||
try:
|
||||
url = br.links(url_regex="www.nikkei.com/etc/accounts/login").next().url
|
||||
except StopIteration:
|
||||
url = 'http://www.nikkei.com/etc/accounts/login?dps=3&pageflag=top&url=http%3A%2F%2Fwww.nikkei.com%2F'
|
||||
br.open(url) #br.follow_link(link)
|
||||
#response = br.response()
|
||||
#print response.get_data()
|
||||
print "-------------------------JS redirect(send autoPostForm)--------------------"
|
||||
|
@ -15,7 +15,7 @@ class Nin(BasicNewsRecipe):
|
||||
publisher = 'NIN d.o.o. - Ringier d.o.o.'
|
||||
category = 'news, politics, Serbia'
|
||||
no_stylesheets = True
|
||||
oldest_article = 15
|
||||
oldest_article = 180
|
||||
encoding = 'utf-8'
|
||||
needs_subscription = True
|
||||
remove_empty_feeds = True
|
||||
@ -25,7 +25,7 @@ class Nin(BasicNewsRecipe):
|
||||
use_embedded_content = False
|
||||
language = 'sr'
|
||||
publication_type = 'magazine'
|
||||
masthead_url = 'http://www.nin.co.rs/img/head/logo.jpg'
|
||||
masthead_url = 'http://www.nin.co.rs/img/logo_print.jpg'
|
||||
extra_css = """
|
||||
@font-face {font-family: "sans1";src:url(res:///opt/sony/ebook/FONT/tt0003m_.ttf)}
|
||||
body{font-family: Verdana, Lucida, sans1, sans-serif}
|
||||
@ -42,11 +42,11 @@ class Nin(BasicNewsRecipe):
|
||||
, 'tags' : category
|
||||
, 'publisher' : publisher
|
||||
, 'language' : language
|
||||
, 'linearize_tables': True
|
||||
}
|
||||
|
||||
preprocess_regexps = [
|
||||
(re.compile(r'</body>.*?<html>', re.DOTALL|re.IGNORECASE),lambda match: '</body>')
|
||||
,(re.compile(r'</html>.*?</html>', re.DOTALL|re.IGNORECASE),lambda match: '</html>')
|
||||
(re.compile(r'<div class="standardFont">.*', re.DOTALL|re.IGNORECASE),lambda match: '')
|
||||
,(re.compile(u'\u0110'), lambda match: u'\u00D0')
|
||||
]
|
||||
|
||||
@ -60,42 +60,21 @@ class Nin(BasicNewsRecipe):
|
||||
br.submit()
|
||||
return br
|
||||
|
||||
keep_only_tags =[dict(name='td', attrs={'width':'520'})]
|
||||
remove_tags_before =dict(name='span', attrs={'class':'izjava'})
|
||||
remove_tags_after =dict(name='html')
|
||||
remove_tags = [
|
||||
dict(name=['object','link','iframe','meta','base'])
|
||||
,dict(attrs={'class':['fb-like','twitter-share-button']})
|
||||
,dict(attrs={'rel':'nofollow'})
|
||||
]
|
||||
remove_tags_before = dict(name='div', attrs={'class':'titleFont'})
|
||||
remove_tags_after = dict(name='div', attrs={'class':'standardFont'})
|
||||
remove_tags = [dict(name=['object','link','iframe','meta','base'])]
|
||||
remove_attributes = ['border','background','height','width','align','valign']
|
||||
|
||||
def get_cover_url(self):
|
||||
cover_url = None
|
||||
soup = self.index_to_soup(self.INDEX)
|
||||
for item in soup.findAll('a', href=True):
|
||||
if item['href'].startswith('/pages/issue.php?id='):
|
||||
simg = item.find('img')
|
||||
if simg:
|
||||
return self.PREFIX + item.img['src']
|
||||
cover = soup.find('img', attrs={'class':'issueImg'})
|
||||
if cover:
|
||||
return self.PREFIX + cover['src']
|
||||
return cover_url
|
||||
|
||||
feeds = [(u'NIN Online', u'http://www.nin.co.rs/misc/rss.php?feed=RSS2.0')]
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
for item in soup.findAll(style=True):
|
||||
del item['style']
|
||||
for item in soup.findAll('div'):
|
||||
if len(item.contents) == 0:
|
||||
item.extract()
|
||||
for item in soup.findAll(['td','tr']):
|
||||
item.name='div'
|
||||
for item in soup.findAll('img'):
|
||||
if not item.has_key('alt'):
|
||||
item['alt'] = 'image'
|
||||
for tbl in soup.findAll('table'):
|
||||
img = tbl.find('img')
|
||||
if img:
|
||||
img.extract()
|
||||
tbl.replaceWith(img)
|
||||
return soup
|
||||
def print_version(self, url):
|
||||
return url + '&pf=1'
|
||||
|
||||
|
@ -6,7 +6,6 @@ www.nsfwcorp.com
|
||||
'''
|
||||
|
||||
import urllib
|
||||
from calibre import strftime
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class NotSafeForWork(BasicNewsRecipe):
|
||||
@ -21,8 +20,9 @@ class NotSafeForWork(BasicNewsRecipe):
|
||||
needs_subscription = True
|
||||
auto_cleanup = False
|
||||
INDEX = 'https://www.nsfwcorp.com'
|
||||
LOGIN = INDEX + '/login'
|
||||
use_embedded_content = False
|
||||
LOGIN = INDEX + '/login/target/'
|
||||
SETTINGS = INDEX + '/settings/'
|
||||
use_embedded_content = True
|
||||
language = 'en'
|
||||
publication_type = 'magazine'
|
||||
masthead_url = 'http://assets.nsfwcorp.com/media/headers/nsfw_banner.jpg'
|
||||
@ -46,15 +46,6 @@ class NotSafeForWork(BasicNewsRecipe):
|
||||
, 'language' : language
|
||||
}
|
||||
|
||||
remove_tags_before = dict(attrs={'id':'fromToLine'})
|
||||
remove_tags_after = dict(attrs={'id':'unlockButtonDiv'})
|
||||
remove_tags=[
|
||||
dict(name=['meta', 'link', 'iframe', 'embed', 'object'])
|
||||
,dict(name='a', attrs={'class':'switchToDeskNotes'})
|
||||
,dict(attrs={'id':'unlockButtonDiv'})
|
||||
]
|
||||
remove_attributes = ['lang']
|
||||
|
||||
def get_browser(self):
|
||||
br = BasicNewsRecipe.get_browser()
|
||||
br.open(self.LOGIN)
|
||||
@ -65,30 +56,12 @@ class NotSafeForWork(BasicNewsRecipe):
|
||||
br.open(self.LOGIN, data)
|
||||
return br
|
||||
|
||||
def parse_index(self):
|
||||
articles = []
|
||||
soup = self.index_to_soup(self.INDEX)
|
||||
dispatches = soup.find(attrs={'id':'dispatches'})
|
||||
if dispatches:
|
||||
for item in dispatches.findAll('h3'):
|
||||
description = u''
|
||||
title_link = item.find('span', attrs={'class':'dispatchTitle'})
|
||||
description_link = item.find('span', attrs={'class':'dispatchSubtitle'})
|
||||
feed_link = item.find('a', href=True)
|
||||
if feed_link:
|
||||
url = self.INDEX + feed_link['href']
|
||||
title = self.tag_to_string(title_link)
|
||||
description = self.tag_to_string(description_link)
|
||||
date = strftime(self.timefmt)
|
||||
articles.append({
|
||||
'title' :title
|
||||
,'date' :date
|
||||
,'url' :url
|
||||
,'description':description
|
||||
})
|
||||
return [('Dispatches', articles)]
|
||||
def get_feeds(self):
|
||||
self.feeds = []
|
||||
soup = self.index_to_soup(self.SETTINGS)
|
||||
for item in soup.findAll('input', attrs={'type':'text'}):
|
||||
if item.has_key('value') and item['value'].startswith('http://www.nsfwcorp.com/feed/'):
|
||||
self.feeds.append(item['value'])
|
||||
return self.feeds
|
||||
return self.feeds
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
for item in soup.findAll(style=True):
|
||||
del item['style']
|
||||
return soup
|
||||
|
@ -6,22 +6,41 @@ __copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
|
||||
nytimes.com
|
||||
'''
|
||||
import re, string, time
|
||||
from calibre import entity_to_unicode, strftime
|
||||
from calibre import strftime
|
||||
from datetime import timedelta, date
|
||||
from time import sleep
|
||||
from calibre.web.feeds.recipes import BasicNewsRecipe
|
||||
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup
|
||||
|
||||
|
||||
class NYTimes(BasicNewsRecipe):
|
||||
|
||||
recursions=1 # set this to zero to omit Related articles lists
|
||||
|
||||
# set getTechBlogs to True to include the technology blogs
|
||||
# set tech_oldest_article to control article age
|
||||
# set tech_max_articles_per_feed to control article count
|
||||
getTechBlogs = True
|
||||
remove_empty_feeds = True
|
||||
tech_oldest_article = 14
|
||||
tech_max_articles_per_feed = 25
|
||||
|
||||
|
||||
# set headlinesOnly to True for the headlines-only version. If True, webEdition is ignored.
|
||||
headlinesOnly = True
|
||||
|
||||
# set webEdition to True for the Web edition of the newspaper. Set oldest_article to the
|
||||
# number of days old an article can be for inclusion. If oldest_article = 0 all articles
|
||||
# will be included. Note: oldest_article is ignored if webEdition = False
|
||||
# set webEdition to True for the Web edition of the newspaper. Set oldest_web_article to the
|
||||
# number of days old an article can be for inclusion. If oldest_web_article = None all articles
|
||||
# will be included. Note: oldest_web_article is ignored if webEdition = False
|
||||
webEdition = False
|
||||
oldest_article = 7
|
||||
oldest_web_article = 7
|
||||
|
||||
# download higher resolution images than the small thumbnails typically included in the article
|
||||
# the down side of having large beautiful images is the file size is much larger, on the order of 7MB per paper
|
||||
useHighResImages = True
|
||||
|
||||
# replace paid Kindle Version: the name will be changed to "The New York Times" to cause
|
||||
# previous paid versions of the new york times to best sent to the back issues folder on the kindle
|
||||
replaceKindleVersion = False
|
||||
|
||||
# includeSections: List of sections to include. If empty, all sections found will be included.
|
||||
# Otherwise, only the sections named will be included. For example,
|
||||
@ -82,57 +101,68 @@ class NYTimes(BasicNewsRecipe):
|
||||
('Education',u'education'),
|
||||
('Multimedia',u'multimedia'),
|
||||
(u'Obituaries',u'obituaries'),
|
||||
(u'Sunday Magazine',u'magazine'),
|
||||
(u'Week in Review',u'weekinreview')]
|
||||
(u'Sunday Magazine',u'magazine')
|
||||
]
|
||||
|
||||
tech_feeds = [
|
||||
(u'Tech - Pogues Posts', u'http://pogue.blogs.nytimes.com/feed/'),
|
||||
(u'Tech - Bits', u'http://bits.blogs.nytimes.com/feed/'),
|
||||
(u'Tech - Gadgetwise', u'http://gadgetwise.blogs.nytimes.com/feed/'),
|
||||
(u'Tech - Open', u'http://open.blogs.nytimes.com/feed/')
|
||||
]
|
||||
|
||||
|
||||
if headlinesOnly:
|
||||
title='New York Times Headlines'
|
||||
description = 'Headlines from the New York Times. Needs a subscription from http://www.nytimes.com'
|
||||
needs_subscription = 'optional'
|
||||
description = 'Headlines from the New York Times'
|
||||
needs_subscription = False
|
||||
elif webEdition:
|
||||
title='New York Times (Web)'
|
||||
description = 'New York Times on the Web'
|
||||
needs_subscription = True
|
||||
needs_subscription = False
|
||||
elif replaceKindleVersion:
|
||||
title='The New York Times'
|
||||
description = 'Today\'s New York Times'
|
||||
needs_subscription = False
|
||||
else:
|
||||
title='New York Times'
|
||||
description = 'Today\'s New York Times'
|
||||
needs_subscription = True
|
||||
needs_subscription = False
|
||||
|
||||
|
||||
month_list = ['january','february','march','april','may','june','july','august','september','october','november','december']
|
||||
|
||||
def decode_us_date(self,datestr):
|
||||
udate = datestr.strip().lower().split()
|
||||
def decode_url_date(self,url):
|
||||
urlitems = url.split('/')
|
||||
try:
|
||||
m = self.month_list.index(udate[0])+1
|
||||
d = date(int(urlitems[3]),int(urlitems[4]),int(urlitems[5]))
|
||||
except:
|
||||
return date.today()
|
||||
d = int(udate[1])
|
||||
y = int(udate[2])
|
||||
try:
|
||||
d = date(y,m,d)
|
||||
d = date(int(urlitems[4]),int(urlitems[5]),int(urlitems[6]))
|
||||
except:
|
||||
d = date.today
|
||||
return None
|
||||
return d
|
||||
|
||||
earliest_date = date.today() - timedelta(days=oldest_article)
|
||||
if oldest_web_article is None:
|
||||
earliest_date = date.today()
|
||||
else:
|
||||
earliest_date = date.today() - timedelta(days=oldest_web_article)
|
||||
oldest_article = 365 # by default, a long time ago
|
||||
|
||||
__author__ = 'GRiker/Kovid Goyal/Nick Redding'
|
||||
language = 'en'
|
||||
requires_version = (0, 7, 5)
|
||||
|
||||
encoding = 'utf-8'
|
||||
|
||||
timefmt = ''
|
||||
masthead_url = 'http://graphics8.nytimes.com/images/misc/nytlogo379x64.gif'
|
||||
|
||||
#simultaneous_downloads = 1 # no longer required to deal with ads
|
||||
|
||||
cover_margins = (18,18,'grey99')
|
||||
|
||||
remove_tags_before = dict(id='article')
|
||||
remove_tags_after = dict(id='article')
|
||||
remove_tags = [dict(attrs={'class':[
|
||||
remove_tags = [
|
||||
dict(attrs={'class':[
|
||||
'articleFooter',
|
||||
'articleTools',
|
||||
'columnGroup doubleRule',
|
||||
'columnGroup singleRule',
|
||||
'columnGroup last',
|
||||
'columnGroup last',
|
||||
@ -140,7 +170,6 @@ class NYTimes(BasicNewsRecipe):
|
||||
'dottedLine',
|
||||
'entry-meta',
|
||||
'entry-response module',
|
||||
'icon enlargeThis',
|
||||
'leftNavTabs',
|
||||
'metaFootnote',
|
||||
'module box nav',
|
||||
@ -150,10 +179,44 @@ class NYTimes(BasicNewsRecipe):
|
||||
'relatedSearchesModule',
|
||||
'side_tool',
|
||||
'singleAd',
|
||||
'entry entry-utility', #added for DealBook
|
||||
'entry-tags', #added for DealBook
|
||||
'footer promos clearfix', #added for DealBook
|
||||
'footer links clearfix', #added for DealBook
|
||||
'tabsContainer', #added for other blog downloads
|
||||
'column lastColumn', #added for other blog downloads
|
||||
'pageHeaderWithLabel', #added for other gadgetwise downloads
|
||||
'column two', #added for other blog downloads
|
||||
'column two last', #added for other blog downloads
|
||||
'column three', #added for other blog downloads
|
||||
'column three last', #added for other blog downloads
|
||||
'column four',#added for other blog downloads
|
||||
'column four last',#added for other blog downloads
|
||||
'column last', #added for other blog downloads
|
||||
'entry entry-related',
|
||||
'subNavigation tabContent active', #caucus blog navigation
|
||||
'mediaOverlay slideshow',
|
||||
'wideThumb',
|
||||
'video', #added 02-11-2011
|
||||
'videoHeader',#added 02-11-2011
|
||||
'articleInlineVideoHolder', #added 02-11-2011
|
||||
'assetCompanionAd',
|
||||
re.compile('^subNavigation'),
|
||||
re.compile('^leaderboard'),
|
||||
re.compile('^module'),
|
||||
re.compile('commentCount'),
|
||||
'credit'
|
||||
]}),
|
||||
dict(name='div', attrs={'class':re.compile('toolsList')}), # bits
|
||||
dict(name='div', attrs={'class':re.compile('postNavigation')}), # bits
|
||||
dict(name='div', attrs={'class':'tweet'}),
|
||||
dict(name='span', attrs={'class':'commentCount meta'}),
|
||||
dict(name='div', attrs={'id':'header'}),
|
||||
dict(name='div', attrs={'id':re.compile('commentsContainer')}), # bits, pogue, gadgetwise, open
|
||||
dict(name='ul', attrs={'class':re.compile('entry-tools')}), # pogue, gadgetwise
|
||||
dict(name='div', attrs={'class':re.compile('nocontent')}), # pogue, gadgetwise
|
||||
dict(name='div', attrs={'id':re.compile('respond')}), # open
|
||||
dict(name='div', attrs={'class':re.compile('entry-tags')}), # pogue
|
||||
dict(id=[
|
||||
'adxLeaderboard',
|
||||
'adxSponLink',
|
||||
@ -183,22 +246,29 @@ class NYTimes(BasicNewsRecipe):
|
||||
'side_index',
|
||||
'side_tool',
|
||||
'toolsRight',
|
||||
'skybox', #added for DealBook
|
||||
'TopAd', #added for DealBook
|
||||
'related-content', #added for DealBook
|
||||
]),
|
||||
dict(name=['script', 'noscript', 'style','form','hr'])]
|
||||
no_stylesheets = True
|
||||
extra_css = '''
|
||||
.articleHeadline { text-align: left; margin-top:0.5em; margin-bottom:0.25em; }
|
||||
.credit { text-align: right; font-size: small; line-height:1em; margin-top:5px; margin-left:0; margin-right:0; margin-bottom: 0; }
|
||||
.byline { text-align: left; font-size: small; line-height:1em; margin-top:10px; margin-left:0; margin-right:0; margin-bottom: 0; }
|
||||
.dateline { text-align: left; font-size: small; line-height:1em;margin-top:5px; margin-left:0; margin-right:0; margin-bottom: 0; }
|
||||
.kicker { font-size: small; line-height:1em;margin-top:5px; margin-left:0; margin-right:0; margin-bottom: 0; }
|
||||
.timestamp { text-align: left; font-size: small; }
|
||||
.caption { font-size: small; font-style:italic; line-height:1em; margin-top:5px; margin-left:0; margin-right:0; margin-bottom: 0; }
|
||||
.credit { font-weight: normal; text-align: right; font-size: 50%; line-height:1em; margin-top:5px; margin-left:0; margin-right:0; margin-bottom: 0; }
|
||||
.byline { text-align: left; font-size: 50%; line-height:1em; margin-top:10px; margin-left:0; margin-right:0; margin-bottom: 0; }
|
||||
.dateline { text-align: left; font-size: 50%; line-height:1em;margin-top:5px; margin-left:0; margin-right:0; margin-bottom: 0; }
|
||||
.kicker { font-size: 50%; line-height:1em;margin-top:5px; margin-left:0; margin-right:0; margin-bottom: 0; }
|
||||
.timestamp { font-weight: normal; text-align: left; font-size: 50%; }
|
||||
.caption { font-size: 50%; font-style:italic; line-height:1em; margin-top:5px; margin-left:0; margin-right:0; margin-bottom: 0; }
|
||||
a:link {text-decoration: none; }
|
||||
.date{font-size: 50%; }
|
||||
.update{font-size: 50%; }
|
||||
.articleBody { }
|
||||
.authorId {text-align: left; }
|
||||
.authorId {text-align: left; font-size: 50%; }
|
||||
.image {text-align: center;}
|
||||
.source {text-align: left; }'''
|
||||
.aside {color:blue;margin:0px 0px 0px 0px; padding: 0px 0px 0px 0px; font-size:100%;}
|
||||
.asidenote {color:blue;margin:0px 0px 0px 0px; padding: 0px 0px 0px 0px; font-size:100%;font-weight:bold;}
|
||||
.source {text-align: left; font-size: x-small; }'''
|
||||
|
||||
|
||||
articles = {}
|
||||
@ -222,11 +292,11 @@ class NYTimes(BasicNewsRecipe):
|
||||
del ans[idx]
|
||||
idx_max = idx_max-1
|
||||
continue
|
||||
if self.verbose:
|
||||
if True: #self.verbose
|
||||
self.log("Section %s: %d articles" % (ans[idx][0], len(ans[idx][1])) )
|
||||
for article in ans[idx][1]:
|
||||
total_article_count += 1
|
||||
if self.verbose:
|
||||
if True: #self.verbose
|
||||
self.log("\t%-40.40s... \t%-60.60s..." % (article['title'].encode('cp1252','replace'),
|
||||
article['url'].encode('cp1252','replace')))
|
||||
idx = idx+1
|
||||
@ -237,7 +307,7 @@ class NYTimes(BasicNewsRecipe):
|
||||
def exclude_url(self,url):
|
||||
if not url.startswith("http"):
|
||||
return True
|
||||
if not url.endswith(".html"):
|
||||
if not url.endswith(".html") and 'dealbook.nytimes.com' not in url: #added for DealBook
|
||||
return True
|
||||
if 'nytimes.com' not in url:
|
||||
return True
|
||||
@ -280,88 +350,76 @@ class NYTimes(BasicNewsRecipe):
|
||||
|
||||
def get_browser(self):
|
||||
br = BasicNewsRecipe.get_browser()
|
||||
if self.username is not None and self.password is not None:
|
||||
br.open('http://www.nytimes.com/auth/login')
|
||||
br.form = br.forms().next()
|
||||
br['userid'] = self.username
|
||||
br['password'] = self.password
|
||||
raw = br.submit().read()
|
||||
if 'Please try again' in raw:
|
||||
raise Exception('Your username and password are incorrect')
|
||||
return br
|
||||
|
||||
def skip_ad_pages(self, soup):
|
||||
# Skip ad pages served before actual article
|
||||
skip_tag = soup.find(True, {'name':'skip'})
|
||||
if skip_tag is not None:
|
||||
self.log.warn("Found forwarding link: %s" % skip_tag.parent['href'])
|
||||
url = 'http://www.nytimes.com' + re.sub(r'\?.*', '', skip_tag.parent['href'])
|
||||
url += '?pagewanted=all'
|
||||
self.log.warn("Skipping ad to article at '%s'" % url)
|
||||
return self.index_to_soup(url, raw=True)
|
||||
|
||||
cover_tag = 'NY_NYT'
|
||||
def get_cover_url(self):
|
||||
cover = None
|
||||
st = time.localtime()
|
||||
year = str(st.tm_year)
|
||||
month = "%.2d" % st.tm_mon
|
||||
day = "%.2d" % st.tm_mday
|
||||
cover = 'http://graphics8.nytimes.com/images/' + year + '/' + month +'/' + day +'/nytfrontpage/scan.jpg'
|
||||
cover = 'http://webmedia.newseum.org/newseum-multimedia/dfp/jpg'+str(date.today().day)+'/lg/'+self.cover_tag+'.jpg'
|
||||
br = BasicNewsRecipe.get_browser()
|
||||
daysback=1
|
||||
try:
|
||||
br.open(cover)
|
||||
except:
|
||||
while daysback<7:
|
||||
cover = 'http://webmedia.newseum.org/newseum-multimedia/dfp/jpg'+str((date.today() - timedelta(days=daysback)).day)+'/lg/'+self.cover_tag+'.jpg'
|
||||
br = BasicNewsRecipe.get_browser()
|
||||
try:
|
||||
br.open(cover)
|
||||
except:
|
||||
daysback = daysback+1
|
||||
continue
|
||||
break
|
||||
if daysback==7:
|
||||
self.log("\nCover unavailable")
|
||||
cover = None
|
||||
return cover
|
||||
|
||||
masthead_url = 'http://graphics8.nytimes.com/images/misc/nytlogo379x64.gif'
|
||||
|
||||
def short_title(self):
|
||||
return self.title
|
||||
|
||||
def index_to_soup(self, url_or_raw, raw=False):
|
||||
'''
|
||||
OVERRIDE of class method
|
||||
deals with various page encodings between index and articles
|
||||
'''
|
||||
def get_the_soup(docEncoding, url_or_raw, raw=False) :
|
||||
|
||||
def article_to_soup(self, url_or_raw, raw=False):
|
||||
from contextlib import closing
|
||||
import copy
|
||||
from calibre.ebooks.chardet import xml_to_unicode
|
||||
if re.match(r'\w+://', url_or_raw):
|
||||
br = self.clone_browser(self.browser)
|
||||
f = br.open_novisit(url_or_raw)
|
||||
open_func = getattr(br, 'open_novisit', br.open)
|
||||
with closing(open_func(url_or_raw)) as f:
|
||||
_raw = f.read()
|
||||
f.close()
|
||||
if not _raw:
|
||||
raise RuntimeError('Could not fetch index from %s'%url_or_raw)
|
||||
else:
|
||||
_raw = url_or_raw
|
||||
if raw:
|
||||
return _raw
|
||||
|
||||
if not isinstance(_raw, unicode) and self.encoding:
|
||||
_raw = _raw.decode(docEncoding, 'replace')
|
||||
massage = list(BeautifulSoup.MARKUP_MASSAGE)
|
||||
massage.append((re.compile(r'&(\S+?);'), lambda match: entity_to_unicode(match, encoding=self.encoding)))
|
||||
return BeautifulSoup(_raw, markupMassage=massage)
|
||||
if callable(self.encoding):
|
||||
_raw = self.encoding(_raw)
|
||||
else:
|
||||
_raw = _raw.decode(self.encoding, 'replace')
|
||||
|
||||
# Entry point
|
||||
soup = get_the_soup( self.encoding, url_or_raw )
|
||||
contentType = soup.find(True,attrs={'http-equiv':'Content-Type'})
|
||||
docEncoding = str(contentType)[str(contentType).find('charset=') + len('charset='):str(contentType).rfind('"')]
|
||||
if docEncoding == '' :
|
||||
docEncoding = self.encoding
|
||||
nmassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
|
||||
nmassage.extend(self.preprocess_regexps)
|
||||
nmassage += [(re.compile(r'<!DOCTYPE .+?>', re.DOTALL), lambda m: '')]
|
||||
# Some websites have buggy doctype declarations that mess up beautifulsoup
|
||||
# Remove comments as they can leave detritus when extracting tags leaves
|
||||
# multiple nested comments
|
||||
nmassage.append((re.compile(r'<!--.*?-->', re.DOTALL), lambda m: ''))
|
||||
usrc = xml_to_unicode(_raw, self.verbose, strip_encoding_pats=True)[0]
|
||||
usrc = self.preprocess_raw_html(usrc, url_or_raw)
|
||||
return BeautifulSoup(usrc, markupMassage=nmassage)
|
||||
|
||||
if self.verbose > 2:
|
||||
self.log( " document encoding: '%s'" % docEncoding)
|
||||
if docEncoding != self.encoding :
|
||||
soup = get_the_soup(docEncoding, url_or_raw)
|
||||
|
||||
return soup
|
||||
|
||||
def massageNCXText(self, description):
|
||||
# Kindle TOC descriptions won't render certain characters
|
||||
if description:
|
||||
massaged = unicode(BeautifulStoneSoup(description, convertEntities=BeautifulStoneSoup.HTML_ENTITIES))
|
||||
# Replace '&' with '&'
|
||||
massaged = re.sub("&","&", massaged)
|
||||
massaged = re.sub("&","&", massaged)
|
||||
massaged = re.sub("&","&", massaged)
|
||||
return self.fixChars(massaged)
|
||||
else:
|
||||
return description
|
||||
@ -383,6 +441,16 @@ class NYTimes(BasicNewsRecipe):
|
||||
if self.filterDuplicates:
|
||||
if url in self.url_list:
|
||||
return
|
||||
if self.webEdition:
|
||||
date_tag = self.decode_url_date(url)
|
||||
if date_tag is not None:
|
||||
if self.oldest_web_article is not None:
|
||||
if date_tag < self.earliest_date:
|
||||
self.log("Skipping article %s" % url)
|
||||
return
|
||||
else:
|
||||
self.log("Skipping article %s" % url)
|
||||
return
|
||||
self.url_list.append(url)
|
||||
title = self.tag_to_string(a, use_alt=True).strip()
|
||||
description = ''
|
||||
@ -407,6 +475,31 @@ class NYTimes(BasicNewsRecipe):
|
||||
description=description, author=author,
|
||||
content=''))
|
||||
|
||||
def get_tech_feeds(self,ans):
|
||||
if self.getTechBlogs:
|
||||
tech_articles = {}
|
||||
key_list = []
|
||||
save_oldest_article = self.oldest_article
|
||||
save_max_articles_per_feed = self.max_articles_per_feed
|
||||
self.oldest_article = self.tech_oldest_article
|
||||
self.max_articles_per_feed = self.tech_max_articles_per_feed
|
||||
self.feeds = self.tech_feeds
|
||||
tech = self.parse_feeds()
|
||||
self.oldest_article = save_oldest_article
|
||||
self.max_articles_per_feed = save_max_articles_per_feed
|
||||
self.feeds = None
|
||||
for f in tech:
|
||||
key_list.append(f.title)
|
||||
tech_articles[f.title] = []
|
||||
for a in f.articles:
|
||||
tech_articles[f.title].append(
|
||||
dict(title=a.title, url=a.url, date=a.date,
|
||||
description=a.summary, author=a.author,
|
||||
content=a.content))
|
||||
tech_ans = [(k, tech_articles[k]) for k in key_list if tech_articles.has_key(k)]
|
||||
for x in tech_ans:
|
||||
ans.append(x)
|
||||
return ans
|
||||
|
||||
def parse_web_edition(self):
|
||||
|
||||
@ -418,31 +511,41 @@ class NYTimes(BasicNewsRecipe):
|
||||
if sec_title in self.excludeSections:
|
||||
print "SECTION EXCLUDED: ",sec_title
|
||||
continue
|
||||
print 'Index URL: '+'http://www.nytimes.com/pages/'+index_url+'/index.html'
|
||||
try:
|
||||
soup = self.index_to_soup('http://www.nytimes.com/pages/'+index_url+'/index.html')
|
||||
except:
|
||||
continue
|
||||
print 'Index URL: '+'http://www.nytimes.com/pages/'+index_url+'/index.html'
|
||||
|
||||
self.key = sec_title
|
||||
# Find each article
|
||||
for div in soup.findAll(True,
|
||||
attrs={'class':['section-headline', 'story', 'story headline','sectionHeader','headlinesOnly multiline flush']}):
|
||||
if div['class'] in ['story', 'story headline'] :
|
||||
attrs={'class':['section-headline', 'ledeStory', 'story', 'story headline','sectionHeader','headlinesOnly multiline flush']}):
|
||||
if div['class'] in ['story', 'story headline', 'storyHeader'] :
|
||||
self.handle_article(div)
|
||||
elif div['class'] == 'ledeStory':
|
||||
divsub = div.find('div','storyHeader')
|
||||
if divsub is not None:
|
||||
self.handle_article(divsub)
|
||||
ulrefer = div.find('ul','refer')
|
||||
if ulrefer is not None:
|
||||
for lidiv in ulrefer.findAll('li'):
|
||||
self.handle_article(lidiv)
|
||||
elif div['class'] == 'headlinesOnly multiline flush':
|
||||
for lidiv in div.findAll('li'):
|
||||
self.handle_article(lidiv)
|
||||
|
||||
self.ans = [(k, self.articles[k]) for k in self.ans if self.articles.has_key(k)]
|
||||
return self.filter_ans(self.ans)
|
||||
return self.filter_ans(self.get_tech_feeds(self.ans))
|
||||
|
||||
|
||||
def parse_todays_index(self):
|
||||
|
||||
soup = self.index_to_soup('http://www.nytimes.com/pages/todayspaper/index.html')
|
||||
|
||||
skipping = False
|
||||
# Find each article
|
||||
for div in soup.findAll(True,
|
||||
attrs={'class':['section-headline', 'story', 'story headline','sectionHeader','headlinesOnly multiline flush']}):
|
||||
|
||||
if div['class'] in ['section-headline','sectionHeader']:
|
||||
self.key = string.capwords(self.feed_title(div))
|
||||
self.key = self.key.replace('Op-ed','Op-Ed')
|
||||
@ -466,7 +569,7 @@ class NYTimes(BasicNewsRecipe):
|
||||
self.handle_article(lidiv)
|
||||
|
||||
self.ans = [(k, self.articles[k]) for k in self.ans if self.articles.has_key(k)]
|
||||
return self.filter_ans(self.ans)
|
||||
return self.filter_ans(self.get_tech_feeds(self.ans))
|
||||
|
||||
def parse_headline_index(self):
|
||||
|
||||
@ -514,7 +617,7 @@ class NYTimes(BasicNewsRecipe):
|
||||
for h3_item in search_div.findAll('h3'):
|
||||
byline = h3_item.h6
|
||||
if byline is not None:
|
||||
author = self.tag_to_string(byline,usa_alt=False)
|
||||
author = self.tag_to_string(byline,use_alt=False)
|
||||
else:
|
||||
author = ''
|
||||
a = h3_item.find('a', href=True)
|
||||
@ -540,7 +643,7 @@ class NYTimes(BasicNewsRecipe):
|
||||
self.articles[section_name].append(dict(title=title, url=url, date=pubdate, description=description, author=author, content=''))
|
||||
|
||||
self.ans = [(k, self.articles[k]) for k in self.ans if self.articles.has_key(k)]
|
||||
return self.filter_ans(self.ans)
|
||||
return self.filter_ans(self.get_tech_feeds(self.ans))
|
||||
|
||||
def parse_index(self):
|
||||
if self.headlinesOnly:
|
||||
@ -550,32 +653,191 @@ class NYTimes(BasicNewsRecipe):
|
||||
else:
|
||||
return self.parse_todays_index()
|
||||
|
||||
def strip_anchors(self,soup):
|
||||
def strip_anchors(self,soup,kill_all=False):
|
||||
paras = soup.findAll(True)
|
||||
for para in paras:
|
||||
aTags = para.findAll('a')
|
||||
for a in aTags:
|
||||
if a.img is None:
|
||||
a.replaceWith(a.renderContents().decode('cp1252','replace'))
|
||||
if kill_all or (self.recursions==0):
|
||||
a.replaceWith(self.tag_to_string(a,False))
|
||||
else:
|
||||
if a.has_key('href'):
|
||||
if a['href'].startswith('http://www.nytimes'):
|
||||
if not a['href'].endswith('pagewanted=all'):
|
||||
url = re.sub(r'\?.*', '', a['href'])
|
||||
if self.exclude_url(url):
|
||||
a.replaceWith(self.tag_to_string(a,False))
|
||||
else:
|
||||
a['href'] = url+'?pagewanted=all'
|
||||
elif not (a['href'].startswith('http://pogue') or \
|
||||
a['href'].startswith('http://bits') or \
|
||||
a['href'].startswith('http://travel') or \
|
||||
a['href'].startswith('http://business') or \
|
||||
a['href'].startswith('http://tech') or \
|
||||
a['href'].startswith('http://health') or \
|
||||
a['href'].startswith('http://dealbook') or \
|
||||
a['href'].startswith('http://open')):
|
||||
a.replaceWith(self.tag_to_string(a,False))
|
||||
return soup
|
||||
|
||||
def handle_tags(self,soup):
|
||||
try:
|
||||
print("HANDLE TAGS: TITLE = "+self.tag_to_string(soup.title))
|
||||
except:
|
||||
print("HANDLE TAGS: NO TITLE")
|
||||
if soup is None:
|
||||
print("ERROR: handle_tags received NoneType")
|
||||
return None
|
||||
|
||||
## print("HANDLING AD FORWARD:")
|
||||
## print(soup)
|
||||
if self.keep_only_tags:
|
||||
body = Tag(soup, 'body')
|
||||
try:
|
||||
if isinstance(self.keep_only_tags, dict):
|
||||
self.keep_only_tags = [self.keep_only_tags]
|
||||
for spec in self.keep_only_tags:
|
||||
for tag in soup.find('body').findAll(**spec):
|
||||
body.insert(len(body.contents), tag)
|
||||
soup.find('body').replaceWith(body)
|
||||
except AttributeError: # soup has no body element
|
||||
pass
|
||||
|
||||
def remove_beyond(tag, next):
|
||||
while tag is not None and getattr(tag, 'name', None) != 'body':
|
||||
after = getattr(tag, next)
|
||||
while after is not None:
|
||||
ns = getattr(tag, next)
|
||||
after.extract()
|
||||
after = ns
|
||||
tag = tag.parent
|
||||
|
||||
if self.remove_tags_after is not None:
|
||||
rt = [self.remove_tags_after] if isinstance(self.remove_tags_after, dict) else self.remove_tags_after
|
||||
for spec in rt:
|
||||
tag = soup.find(**spec)
|
||||
remove_beyond(tag, 'nextSibling')
|
||||
|
||||
if self.remove_tags_before is not None:
|
||||
tag = soup.find(**self.remove_tags_before)
|
||||
remove_beyond(tag, 'previousSibling')
|
||||
|
||||
for kwds in self.remove_tags:
|
||||
for tag in soup.findAll(**kwds):
|
||||
tag.extract()
|
||||
|
||||
return soup
|
||||
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
#print("PREPROCESS TITLE="+self.tag_to_string(soup.title))
|
||||
skip_tag = soup.find(True, {'name':'skip'})
|
||||
if skip_tag is not None:
|
||||
#url = 'http://www.nytimes.com' + re.sub(r'\?.*', '', skip_tag.parent['href'])
|
||||
url = 'http://www.nytimes.com' + skip_tag.parent['href']
|
||||
#url += '?pagewanted=all'
|
||||
self.log.warn("Skipping ad to article at '%s'" % url)
|
||||
sleep(5)
|
||||
soup = self.handle_tags(self.article_to_soup(url))
|
||||
|
||||
if self.webEdition & (self.oldest_article>0):
|
||||
date_tag = soup.find(True,attrs={'class': ['dateline','date']})
|
||||
if date_tag:
|
||||
date_str = self.tag_to_string(date_tag,use_alt=False)
|
||||
date_str = date_str.replace('Published:','')
|
||||
date_items = date_str.split(',')
|
||||
try:
|
||||
datestring = date_items[0]+' '+date_items[1]
|
||||
article_date = self.decode_us_date(datestring)
|
||||
except:
|
||||
article_date = date.today()
|
||||
if article_date < self.earliest_date:
|
||||
self.log("Skipping article dated %s" % date_str)
|
||||
return None
|
||||
# check if the article is from one of the tech blogs
|
||||
blog=soup.find('div',attrs={'id':['pogue','bits','gadgetwise','open']})
|
||||
|
||||
if blog is not None:
|
||||
old_body = soup.find('body')
|
||||
new_body=Tag(soup,'body')
|
||||
new_body.append(soup.find('div',attrs={'id':'content'}))
|
||||
new_body.find('div',attrs={'id':'content'})['id']='blogcontent' # identify for postprocess_html
|
||||
old_body.replaceWith(new_body)
|
||||
for divr in soup.findAll('div',attrs={'class':re.compile('w190 right')}):
|
||||
if divr.find(text=re.compile('Sign up')):
|
||||
divr.extract()
|
||||
divr = soup.find('div',attrs={'id':re.compile('related-content')})
|
||||
if divr is not None:
|
||||
# handle related articles
|
||||
rlist = []
|
||||
ul = divr.find('ul')
|
||||
if ul is not None:
|
||||
for li in ul.findAll('li'):
|
||||
atag = li.find('a')
|
||||
if atag is not None:
|
||||
if atag['href'].startswith('http://pogue') or atag['href'].startswith('http://bits') or \
|
||||
atag['href'].startswith('http://open'):
|
||||
atag.find(text=True).replaceWith(self.massageNCXText(self.tag_to_string(atag,False)))
|
||||
rlist.append(atag)
|
||||
divr.extract()
|
||||
if rlist != []:
|
||||
asidediv = Tag(soup,'div',[('class','aside')])
|
||||
if soup.find('hr') is None:
|
||||
asidediv.append(Tag(soup,'hr'))
|
||||
h4 = Tag(soup,'h4',[('class','asidenote')])
|
||||
h4.insert(0,"Related Posts")
|
||||
asidediv.append(h4)
|
||||
ul = Tag(soup,'ul')
|
||||
for r in rlist:
|
||||
li = Tag(soup,'li',[('class','aside')])
|
||||
r['class'] = 'aside'
|
||||
li.append(r)
|
||||
ul.append(li)
|
||||
asidediv.append(ul)
|
||||
asidediv.append(Tag(soup,'hr'))
|
||||
smain = soup.find('body')
|
||||
smain.append(asidediv)
|
||||
for atag in soup.findAll('a'):
|
||||
img = atag.find('img')
|
||||
if img is not None:
|
||||
atag.replaceWith(img)
|
||||
elif not atag.has_key('href'):
|
||||
atag.replaceWith(atag.renderContents().decode('cp1252','replace'))
|
||||
elif not (atag['href'].startswith('http://www.nytimes') or atag['href'].startswith('http://pogue') or \
|
||||
atag['href'].startswith('http://bits') or atag['href'].startswith('http://open')):
|
||||
atag.replaceWith(atag.renderContents().decode('cp1252','replace'))
|
||||
hdr = soup.find('address')
|
||||
if hdr is not None:
|
||||
hdr.name='span'
|
||||
for span_credit in soup.findAll('span','credit'):
|
||||
sp = Tag(soup,'span')
|
||||
span_credit.replaceWith(sp)
|
||||
sp.append(Tag(soup,'br'))
|
||||
sp.append(span_credit)
|
||||
sp.append(Tag(soup,'br'))
|
||||
|
||||
else: # nytimes article
|
||||
|
||||
related = [] # these will be the related articles
|
||||
first_outer = None # first related outer tag
|
||||
first_related = None # first related tag
|
||||
for outerdiv in soup.findAll(attrs={'class': re.compile('articleInline runaroundLeft')}):
|
||||
for rdiv in soup.findAll('div','columnGroup doubleRule'):
|
||||
if rdiv.find('h3') is not None:
|
||||
if self.tag_to_string(rdiv.h3,False).startswith('Related'):
|
||||
rdiv.h3.find(text=True).replaceWith("Related articles")
|
||||
rdiv.h3['class'] = 'asidenote'
|
||||
for litag in rdiv.findAll('li'):
|
||||
if litag.find('a') is not None:
|
||||
if litag.find('a')['href'].startswith('http://www.nytimes.com'):
|
||||
url = re.sub(r'\?.*', '', litag.find('a')['href'])
|
||||
litag.find('a')['href'] = url+'?pagewanted=all'
|
||||
litag.extract()
|
||||
related.append(litag)
|
||||
if first_related is None:
|
||||
first_related = rdiv
|
||||
first_outer = outerdiv
|
||||
else:
|
||||
litag.extract()
|
||||
if related != []:
|
||||
for r in related:
|
||||
if r.h6: # don't want the anchor inside a h6 tag
|
||||
r.h6.replaceWith(r.h6.a)
|
||||
first_related.ul.append(r)
|
||||
first_related.insert(0,Tag(soup,'hr'))
|
||||
first_related.append(Tag(soup,'hr'))
|
||||
first_related['class'] = 'aside'
|
||||
first_outer.replaceWith(first_related) # replace the outer tag with the related tag
|
||||
|
||||
for rdiv in soup.findAll(attrs={'class': re.compile('articleInline runaroundLeft')}):
|
||||
rdiv.extract()
|
||||
|
||||
kicker_tag = soup.find(attrs={'class':'kicker'})
|
||||
if kicker_tag: # remove Op_Ed author head shots
|
||||
@ -584,9 +846,77 @@ class NYTimes(BasicNewsRecipe):
|
||||
img_div = soup.find('div','inlineImage module')
|
||||
if img_div:
|
||||
img_div.extract()
|
||||
return self.strip_anchors(soup)
|
||||
|
||||
def postprocess_html(self,soup, True):
|
||||
if self.useHighResImages:
|
||||
try:
|
||||
#open up all the "Enlarge this Image" pop-ups and download the full resolution jpegs
|
||||
enlargeThisList = soup.findAll('div',{'class':'icon enlargeThis'})
|
||||
if enlargeThisList:
|
||||
for popupref in enlargeThisList:
|
||||
popupreflink = popupref.find('a')
|
||||
if popupreflink:
|
||||
reflinkstring = str(popupreflink['href'])
|
||||
refstart = reflinkstring.find("javascript:pop_me_up2('") + len("javascript:pop_me_up2('")
|
||||
refend = reflinkstring.find(".html", refstart) + len(".html")
|
||||
reflinkstring = reflinkstring[refstart:refend]
|
||||
|
||||
popuppage = self.browser.open(reflinkstring)
|
||||
popuphtml = popuppage.read()
|
||||
popuppage.close()
|
||||
if popuphtml:
|
||||
st = time.localtime()
|
||||
year = str(st.tm_year)
|
||||
month = "%.2d" % st.tm_mon
|
||||
day = "%.2d" % st.tm_mday
|
||||
imgstartpos = popuphtml.find('http://graphics8.nytimes.com/images/' + year + '/' + month +'/' + day +'/') + len('http://graphics8.nytimes.com/images/' + year + '/' + month +'/' + day +'/')
|
||||
highResImageLink = 'http://graphics8.nytimes.com/images/' + year + '/' + month +'/' + day +'/' + popuphtml[imgstartpos:popuphtml.find('.jpg',imgstartpos)+4]
|
||||
popupSoup = BeautifulSoup(popuphtml)
|
||||
highResTag = popupSoup.find('img', {'src':highResImageLink})
|
||||
if highResTag:
|
||||
try:
|
||||
newWidth = highResTag['width']
|
||||
newHeight = highResTag['height']
|
||||
imageTag = popupref.parent.find("img")
|
||||
except:
|
||||
self.log("Error: finding width and height of img")
|
||||
popupref.extract()
|
||||
if imageTag:
|
||||
try:
|
||||
imageTag['src'] = highResImageLink
|
||||
imageTag['width'] = newWidth
|
||||
imageTag['height'] = newHeight
|
||||
except:
|
||||
self.log("Error setting the src width and height parameters")
|
||||
except Exception:
|
||||
self.log("Error pulling high resolution images")
|
||||
|
||||
try:
|
||||
#in case pulling images failed, delete the enlarge this text
|
||||
enlargeThisList = soup.findAll('div',{'class':'icon enlargeThis'})
|
||||
if enlargeThisList:
|
||||
for popupref in enlargeThisList:
|
||||
popupref.extract()
|
||||
except:
|
||||
self.log("Error removing Enlarge this text")
|
||||
|
||||
|
||||
return self.strip_anchors(soup,False)
|
||||
|
||||
def postprocess_html(self,soup,first_fetch):
|
||||
if not first_fetch: # remove Related links
|
||||
for aside in soup.findAll('div','aside'):
|
||||
aside.extract()
|
||||
soup = self.strip_anchors(soup,True)
|
||||
|
||||
if soup.find('div',attrs={'id':'blogcontent'}) is None:
|
||||
if first_fetch:
|
||||
aside = soup.find('div','aside')
|
||||
if aside is not None: # move the related list to the end of the article
|
||||
art = soup.find('div',attrs={'id':'article'})
|
||||
if art is None:
|
||||
art = soup.find('div',attrs={'class':'article'})
|
||||
if art is not None:
|
||||
art.append(aside)
|
||||
try:
|
||||
if self.one_picture_per_article:
|
||||
# Remove all images after first
|
||||
@ -642,6 +972,7 @@ class NYTimes(BasicNewsRecipe):
|
||||
try:
|
||||
# Change <nyt_headline> to <h2>
|
||||
h1 = soup.find('h1')
|
||||
blogheadline = str(h1) #added for dealbook
|
||||
if h1:
|
||||
headline = h1.find("nyt_headline")
|
||||
if headline:
|
||||
@ -649,13 +980,19 @@ class NYTimes(BasicNewsRecipe):
|
||||
tag['class'] = "headline"
|
||||
tag.insert(0, self.fixChars(headline.contents[0]))
|
||||
h1.replaceWith(tag)
|
||||
elif blogheadline.find('entry-title'):#added for dealbook
|
||||
tag = Tag(soup, "h2")#added for dealbook
|
||||
tag['class'] = "headline"#added for dealbook
|
||||
tag.insert(0, self.fixChars(h1.contents[0]))#added for dealbook
|
||||
h1.replaceWith(tag)#added for dealbook
|
||||
|
||||
else:
|
||||
# Blog entry - replace headline, remove <hr> tags
|
||||
# Blog entry - replace headline, remove <hr> tags - BCC I think this is no longer functional 1-18-2011
|
||||
headline = soup.find('title')
|
||||
if headline:
|
||||
tag = Tag(soup, "h2")
|
||||
tag['class'] = "headline"
|
||||
tag.insert(0, self.fixChars(headline.contents[0]))
|
||||
tag.insert(0, self.fixChars(self.tag_to_string(headline,False)))
|
||||
soup.insert(0, tag)
|
||||
hrs = soup.findAll('hr')
|
||||
for hr in hrs:
|
||||
@ -663,6 +1000,29 @@ class NYTimes(BasicNewsRecipe):
|
||||
except:
|
||||
self.log("ERROR: Problem in Change <nyt_headline> to <h2>")
|
||||
|
||||
try:
|
||||
#if this is from a blog (dealbook, fix the byline format
|
||||
bylineauthor = soup.find('address',attrs={'class':'byline author vcard'})
|
||||
if bylineauthor:
|
||||
tag = Tag(soup, "h6")
|
||||
tag['class'] = "byline"
|
||||
tag.insert(0, self.fixChars(self.tag_to_string(bylineauthor,False)))
|
||||
bylineauthor.replaceWith(tag)
|
||||
except:
|
||||
self.log("ERROR: fixing byline author format")
|
||||
|
||||
try:
|
||||
#if this is a blog (dealbook) fix the credit style for the pictures
|
||||
blogcredit = soup.find('div',attrs={'class':'credit'})
|
||||
if blogcredit:
|
||||
tag = Tag(soup, "h6")
|
||||
tag['class'] = "credit"
|
||||
tag.insert(0, self.fixChars(self.tag_to_string(blogcredit,False)))
|
||||
blogcredit.replaceWith(tag)
|
||||
except:
|
||||
self.log("ERROR: fixing credit format")
|
||||
|
||||
|
||||
try:
|
||||
# Change <h1> to <h3> - used in editorial blogs
|
||||
masthead = soup.find("h1")
|
||||
@ -685,6 +1045,13 @@ class NYTimes(BasicNewsRecipe):
|
||||
subhead.replaceWith(bTag)
|
||||
except:
|
||||
self.log("ERROR: Problem in Change <h1> to <h3> - used in editorial blogs")
|
||||
try:
|
||||
#remove the <strong> update tag
|
||||
blogupdated = soup.find('span', {'class':'update'})
|
||||
if blogupdated:
|
||||
blogupdated.replaceWith("")
|
||||
except:
|
||||
self.log("ERROR: Removing strong tag")
|
||||
|
||||
try:
|
||||
divTag = soup.find('div',attrs={'id':'articleBody'})
|
||||
@ -708,16 +1075,16 @@ class NYTimes(BasicNewsRecipe):
|
||||
return soup
|
||||
|
||||
def populate_article_metadata(self, article, soup, first):
|
||||
if first and hasattr(self, 'add_toc_thumbnail'):
|
||||
if not first:
|
||||
return
|
||||
idxdiv = soup.find('div',attrs={'class':'articleSpanImage'})
|
||||
if idxdiv is not None:
|
||||
if idxdiv.img:
|
||||
self.add_toc_thumbnail(article, idxdiv.img['src'])
|
||||
self.add_toc_thumbnail(article, re.sub(r'links\\link\d+\\','',idxdiv.img['src']))
|
||||
else:
|
||||
img = soup.find('img')
|
||||
img = soup.find('body').find('img')
|
||||
if img is not None:
|
||||
self.add_toc_thumbnail(article, img['src'])
|
||||
|
||||
self.add_toc_thumbnail(article, re.sub(r'links\\link\d+\\','',img['src']))
|
||||
shortparagraph = ""
|
||||
try:
|
||||
if len(article.text_summary.strip()) == 0:
|
||||
@ -731,13 +1098,22 @@ class NYTimes(BasicNewsRecipe):
|
||||
#account for blank paragraphs and short paragraphs by appending them to longer ones
|
||||
if len(refparagraph) > 0:
|
||||
if len(refparagraph) > 70: #approximately one line of text
|
||||
article.summary = article.text_summary = shortparagraph + refparagraph
|
||||
newpara = shortparagraph + refparagraph
|
||||
newparaDateline,newparaEm,newparaDesc = newpara.partition('—')
|
||||
if newparaEm == '':
|
||||
newparaDateline,newparaEm,newparaDesc = newpara.partition('—')
|
||||
if newparaEm == '':
|
||||
newparaDesc = newparaDateline
|
||||
article.summary = article.text_summary = newparaDesc.strip()
|
||||
return
|
||||
else:
|
||||
shortparagraph = refparagraph + " "
|
||||
if shortparagraph.strip().find(" ") == -1 and not shortparagraph.strip().endswith(":"):
|
||||
shortparagraph = shortparagraph + "- "
|
||||
else:
|
||||
article.summary = article.text_summary = self.massageNCXText(article.text_summary)
|
||||
except:
|
||||
self.log("Error creating article descriptions")
|
||||
return
|
||||
|
||||
|
||||
|
@ -6,31 +6,42 @@ __copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
|
||||
nytimes.com
|
||||
'''
|
||||
import re, string, time
|
||||
from calibre import entity_to_unicode, strftime
|
||||
from calibre import strftime
|
||||
from datetime import timedelta, date
|
||||
from time import sleep
|
||||
from calibre.web.feeds.recipes import BasicNewsRecipe
|
||||
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup
|
||||
|
||||
|
||||
class NYTimes(BasicNewsRecipe):
|
||||
|
||||
recursions=1 # set this to zero to omit Related articles lists
|
||||
|
||||
# set getTechBlogs to True to include the technology blogs
|
||||
# set tech_oldest_article to control article age
|
||||
# set tech_max_articles_per_feed to control article count
|
||||
getTechBlogs = True
|
||||
remove_empty_feeds = True
|
||||
tech_oldest_article = 14
|
||||
tech_max_articles_per_feed = 25
|
||||
|
||||
|
||||
# set headlinesOnly to True for the headlines-only version. If True, webEdition is ignored.
|
||||
headlinesOnly = False
|
||||
|
||||
# set webEdition to True for the Web edition of the newspaper. Set oldest_article to the
|
||||
# number of days old an article can be for inclusion. If oldest_article = 0 all articles
|
||||
# will be included. Note: oldest_article is ignored if webEdition = False
|
||||
# set webEdition to True for the Web edition of the newspaper. Set oldest_web_article to the
|
||||
# number of days old an article can be for inclusion. If oldest_web_article = None all articles
|
||||
# will be included. Note: oldest_web_article is ignored if webEdition = False
|
||||
webEdition = False
|
||||
oldest_article = 7
|
||||
|
||||
# replace paid Kindle Version: the name will be changed to "The New York Times" to cause
|
||||
# previous paid versions of the new york times to best sent to the back issues folder on the kindle
|
||||
replaceKindleVersion = False
|
||||
oldest_web_article = None
|
||||
|
||||
# download higher resolution images than the small thumbnails typically included in the article
|
||||
# the down side of having large beautiful images is the file size is much larger, on the order of 7MB per paper
|
||||
useHighResImages = True
|
||||
|
||||
# replace paid Kindle Version: the name will be changed to "The New York Times" to cause
|
||||
# previous paid versions of the new york times to best sent to the back issues folder on the kindle
|
||||
replaceKindleVersion = False
|
||||
|
||||
# includeSections: List of sections to include. If empty, all sections found will be included.
|
||||
# Otherwise, only the sections named will be included. For example,
|
||||
#
|
||||
@ -90,60 +101,68 @@ class NYTimes(BasicNewsRecipe):
|
||||
('Education',u'education'),
|
||||
('Multimedia',u'multimedia'),
|
||||
(u'Obituaries',u'obituaries'),
|
||||
(u'Sunday Magazine',u'magazine'),
|
||||
(u'Week in Review',u'weekinreview')]
|
||||
(u'Sunday Magazine',u'magazine')
|
||||
]
|
||||
|
||||
tech_feeds = [
|
||||
(u'Tech - Pogues Posts', u'http://pogue.blogs.nytimes.com/feed/'),
|
||||
(u'Tech - Bits', u'http://bits.blogs.nytimes.com/feed/'),
|
||||
(u'Tech - Gadgetwise', u'http://gadgetwise.blogs.nytimes.com/feed/'),
|
||||
(u'Tech - Open', u'http://open.blogs.nytimes.com/feed/')
|
||||
]
|
||||
|
||||
|
||||
if headlinesOnly:
|
||||
title='New York Times Headlines'
|
||||
description = 'Headlines from the New York Times'
|
||||
needs_subscription = True
|
||||
needs_subscription = False
|
||||
elif webEdition:
|
||||
title='New York Times (Web)'
|
||||
description = 'New York Times on the Web'
|
||||
needs_subscription = True
|
||||
needs_subscription = False
|
||||
elif replaceKindleVersion:
|
||||
title='The New York Times'
|
||||
description = 'Today\'s New York Times'
|
||||
needs_subscription = True
|
||||
needs_subscription = False
|
||||
else:
|
||||
title='New York Times'
|
||||
description = 'Today\'s New York Times. Needs subscription from http://www.nytimes.com'
|
||||
needs_subscription = True
|
||||
description = 'Today\'s New York Times'
|
||||
needs_subscription = False
|
||||
|
||||
|
||||
month_list = ['january','february','march','april','may','june','july','august','september','october','november','december']
|
||||
|
||||
def decode_us_date(self,datestr):
|
||||
udate = datestr.strip().lower().split()
|
||||
def decode_url_date(self,url):
|
||||
urlitems = url.split('/')
|
||||
try:
|
||||
m = self.month_list.index(udate[0])+1
|
||||
d = date(int(urlitems[3]),int(urlitems[4]),int(urlitems[5]))
|
||||
except:
|
||||
return date.today()
|
||||
d = int(udate[1])
|
||||
y = int(udate[2])
|
||||
try:
|
||||
d = date(y,m,d)
|
||||
d = date(int(urlitems[4]),int(urlitems[5]),int(urlitems[6]))
|
||||
except:
|
||||
d = date.today
|
||||
return None
|
||||
return d
|
||||
|
||||
earliest_date = date.today() - timedelta(days=oldest_article)
|
||||
if oldest_web_article is None:
|
||||
earliest_date = date.today()
|
||||
else:
|
||||
earliest_date = date.today() - timedelta(days=oldest_web_article)
|
||||
oldest_article = 365 # by default, a long time ago
|
||||
|
||||
__author__ = 'GRiker/Kovid Goyal/Nick Redding/Ben Collier'
|
||||
__author__ = 'GRiker/Kovid Goyal/Nick Redding'
|
||||
language = 'en'
|
||||
requires_version = (0, 7, 5)
|
||||
|
||||
encoding = 'utf-8'
|
||||
|
||||
timefmt = ''
|
||||
masthead_url = 'http://graphics8.nytimes.com/images/misc/nytlogo379x64.gif'
|
||||
|
||||
#simultaneous_downloads = 1 # no longer required to deal with ads
|
||||
|
||||
cover_margins = (18,18,'grey99')
|
||||
|
||||
remove_tags_before = dict(id='article')
|
||||
remove_tags_after = dict(id='article')
|
||||
remove_tags = [dict(attrs={'class':[
|
||||
remove_tags = [
|
||||
dict(attrs={'class':[
|
||||
'articleFooter',
|
||||
'articleTools',
|
||||
'columnGroup doubleRule',
|
||||
'columnGroup singleRule',
|
||||
'columnGroup last',
|
||||
'columnGroup last',
|
||||
@ -151,7 +170,6 @@ class NYTimes(BasicNewsRecipe):
|
||||
'dottedLine',
|
||||
'entry-meta',
|
||||
'entry-response module',
|
||||
#'icon enlargeThis', #removed to provide option for high res images
|
||||
'leftNavTabs',
|
||||
'metaFootnote',
|
||||
'module box nav',
|
||||
@ -175,12 +193,9 @@ class NYTimes(BasicNewsRecipe):
|
||||
'column four',#added for other blog downloads
|
||||
'column four last',#added for other blog downloads
|
||||
'column last', #added for other blog downloads
|
||||
'timestamp published', #added for other blog downloads
|
||||
'entry entry-related',
|
||||
'subNavigation tabContent active', #caucus blog navigation
|
||||
'columnGroup doubleRule',
|
||||
'mediaOverlay slideshow',
|
||||
'headlinesOnly multiline flush',
|
||||
'wideThumb',
|
||||
'video', #added 02-11-2011
|
||||
'videoHeader',#added 02-11-2011
|
||||
@ -189,7 +204,19 @@ class NYTimes(BasicNewsRecipe):
|
||||
re.compile('^subNavigation'),
|
||||
re.compile('^leaderboard'),
|
||||
re.compile('^module'),
|
||||
re.compile('commentCount'),
|
||||
'credit'
|
||||
]}),
|
||||
dict(name='div', attrs={'class':re.compile('toolsList')}), # bits
|
||||
dict(name='div', attrs={'class':re.compile('postNavigation')}), # bits
|
||||
dict(name='div', attrs={'class':'tweet'}),
|
||||
dict(name='span', attrs={'class':'commentCount meta'}),
|
||||
dict(name='div', attrs={'id':'header'}),
|
||||
dict(name='div', attrs={'id':re.compile('commentsContainer')}), # bits, pogue, gadgetwise, open
|
||||
dict(name='ul', attrs={'class':re.compile('entry-tools')}), # pogue, gadgetwise
|
||||
dict(name='div', attrs={'class':re.compile('nocontent')}), # pogue, gadgetwise
|
||||
dict(name='div', attrs={'id':re.compile('respond')}), # open
|
||||
dict(name='div', attrs={'class':re.compile('entry-tags')}), # pogue
|
||||
dict(id=[
|
||||
'adxLeaderboard',
|
||||
'adxSponLink',
|
||||
@ -227,17 +254,21 @@ class NYTimes(BasicNewsRecipe):
|
||||
no_stylesheets = True
|
||||
extra_css = '''
|
||||
.articleHeadline { text-align: left; margin-top:0.5em; margin-bottom:0.25em; }
|
||||
.credit { text-align: right; font-size: small; line-height:1em; margin-top:5px; margin-left:0; margin-right:0; margin-bottom: 0; }
|
||||
.byline { text-align: left; font-size: small; line-height:1em; margin-top:10px; margin-left:0; margin-right:0; margin-bottom: 0; }
|
||||
.dateline { text-align: left; font-size: small; line-height:1em;margin-top:5px; margin-left:0; margin-right:0; margin-bottom: 0; }
|
||||
.kicker { font-size: small; line-height:1em;margin-top:5px; margin-left:0; margin-right:0; margin-bottom: 0; }
|
||||
.timestamp { text-align: left; font-size: small; }
|
||||
.caption { font-size: small; font-style:italic; line-height:1em; margin-top:5px; margin-left:0; margin-right:0; margin-bottom: 0; }
|
||||
.credit { font-weight: normal; text-align: right; font-size: 50%; line-height:1em; margin-top:5px; margin-left:0; margin-right:0; margin-bottom: 0; }
|
||||
.byline { text-align: left; font-size: 50%; line-height:1em; margin-top:10px; margin-left:0; margin-right:0; margin-bottom: 0; }
|
||||
.dateline { text-align: left; font-size: 50%; line-height:1em;margin-top:5px; margin-left:0; margin-right:0; margin-bottom: 0; }
|
||||
.kicker { font-size: 50%; line-height:1em;margin-top:5px; margin-left:0; margin-right:0; margin-bottom: 0; }
|
||||
.timestamp { font-weight: normal; text-align: left; font-size: 50%; }
|
||||
.caption { font-size: 50%; font-style:italic; line-height:1em; margin-top:5px; margin-left:0; margin-right:0; margin-bottom: 0; }
|
||||
a:link {text-decoration: none; }
|
||||
.date{font-size: 50%; }
|
||||
.update{font-size: 50%; }
|
||||
.articleBody { }
|
||||
.authorId {text-align: left; }
|
||||
.authorId {text-align: left; font-size: 50%; }
|
||||
.image {text-align: center;}
|
||||
.source {text-align: left; }'''
|
||||
.aside {color:blue;margin:0px 0px 0px 0px; padding: 0px 0px 0px 0px; font-size:100%;}
|
||||
.asidenote {color:blue;margin:0px 0px 0px 0px; padding: 0px 0px 0px 0px; font-size:100%;font-weight:bold;}
|
||||
.source {text-align: left; font-size: x-small; }'''
|
||||
|
||||
|
||||
articles = {}
|
||||
@ -261,11 +292,11 @@ class NYTimes(BasicNewsRecipe):
|
||||
del ans[idx]
|
||||
idx_max = idx_max-1
|
||||
continue
|
||||
if self.verbose:
|
||||
if True: #self.verbose
|
||||
self.log("Section %s: %d articles" % (ans[idx][0], len(ans[idx][1])) )
|
||||
for article in ans[idx][1]:
|
||||
total_article_count += 1
|
||||
if self.verbose:
|
||||
if True: #self.verbose
|
||||
self.log("\t%-40.40s... \t%-60.60s..." % (article['title'].encode('cp1252','replace'),
|
||||
article['url'].encode('cp1252','replace')))
|
||||
idx = idx+1
|
||||
@ -276,7 +307,7 @@ class NYTimes(BasicNewsRecipe):
|
||||
def exclude_url(self,url):
|
||||
if not url.startswith("http"):
|
||||
return True
|
||||
if not url.endswith(".html") and 'dealbook.nytimes.com' not in url and 'blogs.nytimes.com' not in url: #added for DealBook
|
||||
if not url.endswith(".html") and 'dealbook.nytimes.com' not in url: #added for DealBook
|
||||
return True
|
||||
if 'nytimes.com' not in url:
|
||||
return True
|
||||
@ -319,88 +350,76 @@ class NYTimes(BasicNewsRecipe):
|
||||
|
||||
def get_browser(self):
|
||||
br = BasicNewsRecipe.get_browser()
|
||||
if self.username is not None and self.password is not None:
|
||||
br.open('http://www.nytimes.com/auth/login')
|
||||
br.form = br.forms().next()
|
||||
br['userid'] = self.username
|
||||
br['password'] = self.password
|
||||
raw = br.submit().read()
|
||||
if 'Please try again' in raw:
|
||||
raise Exception('Your username and password are incorrect')
|
||||
return br
|
||||
|
||||
def skip_ad_pages(self, soup):
|
||||
# Skip ad pages served before actual article
|
||||
skip_tag = soup.find(True, {'name':'skip'})
|
||||
if skip_tag is not None:
|
||||
self.log.warn("Found forwarding link: %s" % skip_tag.parent['href'])
|
||||
url = 'http://www.nytimes.com' + re.sub(r'\?.*', '', skip_tag.parent['href'])
|
||||
url += '?pagewanted=all'
|
||||
self.log.warn("Skipping ad to article at '%s'" % url)
|
||||
return self.index_to_soup(url, raw=True)
|
||||
|
||||
cover_tag = 'NY_NYT'
|
||||
def get_cover_url(self):
|
||||
cover = None
|
||||
st = time.localtime()
|
||||
year = str(st.tm_year)
|
||||
month = "%.2d" % st.tm_mon
|
||||
day = "%.2d" % st.tm_mday
|
||||
cover = 'http://graphics8.nytimes.com/images/' + year + '/' + month +'/' + day +'/nytfrontpage/scan.jpg'
|
||||
cover = 'http://webmedia.newseum.org/newseum-multimedia/dfp/jpg'+str(date.today().day)+'/lg/'+self.cover_tag+'.jpg'
|
||||
br = BasicNewsRecipe.get_browser()
|
||||
daysback=1
|
||||
try:
|
||||
br.open(cover)
|
||||
except:
|
||||
while daysback<7:
|
||||
cover = 'http://webmedia.newseum.org/newseum-multimedia/dfp/jpg'+str((date.today() - timedelta(days=daysback)).day)+'/lg/'+self.cover_tag+'.jpg'
|
||||
br = BasicNewsRecipe.get_browser()
|
||||
try:
|
||||
br.open(cover)
|
||||
except:
|
||||
daysback = daysback+1
|
||||
continue
|
||||
break
|
||||
if daysback==7:
|
||||
self.log("\nCover unavailable")
|
||||
cover = None
|
||||
return cover
|
||||
|
||||
masthead_url = 'http://graphics8.nytimes.com/images/misc/nytlogo379x64.gif'
|
||||
|
||||
def short_title(self):
|
||||
return self.title
|
||||
|
||||
def index_to_soup(self, url_or_raw, raw=False):
|
||||
'''
|
||||
OVERRIDE of class method
|
||||
deals with various page encodings between index and articles
|
||||
'''
|
||||
def get_the_soup(docEncoding, url_or_raw, raw=False) :
|
||||
|
||||
def article_to_soup(self, url_or_raw, raw=False):
|
||||
from contextlib import closing
|
||||
import copy
|
||||
from calibre.ebooks.chardet import xml_to_unicode
|
||||
if re.match(r'\w+://', url_or_raw):
|
||||
br = self.clone_browser(self.browser)
|
||||
f = br.open_novisit(url_or_raw)
|
||||
open_func = getattr(br, 'open_novisit', br.open)
|
||||
with closing(open_func(url_or_raw)) as f:
|
||||
_raw = f.read()
|
||||
f.close()
|
||||
if not _raw:
|
||||
raise RuntimeError('Could not fetch index from %s'%url_or_raw)
|
||||
else:
|
||||
_raw = url_or_raw
|
||||
if raw:
|
||||
return _raw
|
||||
|
||||
if not isinstance(_raw, unicode) and self.encoding:
|
||||
_raw = _raw.decode(docEncoding, 'replace')
|
||||
massage = list(BeautifulSoup.MARKUP_MASSAGE)
|
||||
massage.append((re.compile(r'&(\S+?);'), lambda match: entity_to_unicode(match, encoding=self.encoding)))
|
||||
return BeautifulSoup(_raw, markupMassage=massage)
|
||||
if callable(self.encoding):
|
||||
_raw = self.encoding(_raw)
|
||||
else:
|
||||
_raw = _raw.decode(self.encoding, 'replace')
|
||||
|
||||
# Entry point
|
||||
soup = get_the_soup( self.encoding, url_or_raw )
|
||||
contentType = soup.find(True,attrs={'http-equiv':'Content-Type'})
|
||||
docEncoding = str(contentType)[str(contentType).find('charset=') + len('charset='):str(contentType).rfind('"')]
|
||||
if docEncoding == '' :
|
||||
docEncoding = self.encoding
|
||||
nmassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
|
||||
nmassage.extend(self.preprocess_regexps)
|
||||
nmassage += [(re.compile(r'<!DOCTYPE .+?>', re.DOTALL), lambda m: '')]
|
||||
# Some websites have buggy doctype declarations that mess up beautifulsoup
|
||||
# Remove comments as they can leave detritus when extracting tags leaves
|
||||
# multiple nested comments
|
||||
nmassage.append((re.compile(r'<!--.*?-->', re.DOTALL), lambda m: ''))
|
||||
usrc = xml_to_unicode(_raw, self.verbose, strip_encoding_pats=True)[0]
|
||||
usrc = self.preprocess_raw_html(usrc, url_or_raw)
|
||||
return BeautifulSoup(usrc, markupMassage=nmassage)
|
||||
|
||||
if self.verbose > 2:
|
||||
self.log( " document encoding: '%s'" % docEncoding)
|
||||
if docEncoding != self.encoding :
|
||||
soup = get_the_soup(docEncoding, url_or_raw)
|
||||
|
||||
return soup
|
||||
|
||||
def massageNCXText(self, description):
|
||||
# Kindle TOC descriptions won't render certain characters
|
||||
if description:
|
||||
massaged = unicode(BeautifulStoneSoup(description, convertEntities=BeautifulStoneSoup.HTML_ENTITIES))
|
||||
# Replace '&' with '&'
|
||||
massaged = re.sub("&","&", massaged)
|
||||
massaged = re.sub("&","&", massaged)
|
||||
massaged = re.sub("&","&", massaged)
|
||||
return self.fixChars(massaged)
|
||||
else:
|
||||
return description
|
||||
@ -422,6 +441,16 @@ class NYTimes(BasicNewsRecipe):
|
||||
if self.filterDuplicates:
|
||||
if url in self.url_list:
|
||||
return
|
||||
if self.webEdition:
|
||||
date_tag = self.decode_url_date(url)
|
||||
if date_tag is not None:
|
||||
if self.oldest_web_article is not None:
|
||||
if date_tag < self.earliest_date:
|
||||
self.log("Skipping article %s" % url)
|
||||
return
|
||||
else:
|
||||
self.log("Skipping article %s" % url)
|
||||
return
|
||||
self.url_list.append(url)
|
||||
title = self.tag_to_string(a, use_alt=True).strip()
|
||||
description = ''
|
||||
@ -446,6 +475,31 @@ class NYTimes(BasicNewsRecipe):
|
||||
description=description, author=author,
|
||||
content=''))
|
||||
|
||||
def get_tech_feeds(self,ans):
|
||||
if self.getTechBlogs:
|
||||
tech_articles = {}
|
||||
key_list = []
|
||||
save_oldest_article = self.oldest_article
|
||||
save_max_articles_per_feed = self.max_articles_per_feed
|
||||
self.oldest_article = self.tech_oldest_article
|
||||
self.max_articles_per_feed = self.tech_max_articles_per_feed
|
||||
self.feeds = self.tech_feeds
|
||||
tech = self.parse_feeds()
|
||||
self.oldest_article = save_oldest_article
|
||||
self.max_articles_per_feed = save_max_articles_per_feed
|
||||
self.feeds = None
|
||||
for f in tech:
|
||||
key_list.append(f.title)
|
||||
tech_articles[f.title] = []
|
||||
for a in f.articles:
|
||||
tech_articles[f.title].append(
|
||||
dict(title=a.title, url=a.url, date=a.date,
|
||||
description=a.summary, author=a.author,
|
||||
content=a.content))
|
||||
tech_ans = [(k, tech_articles[k]) for k in key_list if tech_articles.has_key(k)]
|
||||
for x in tech_ans:
|
||||
ans.append(x)
|
||||
return ans
|
||||
|
||||
def parse_web_edition(self):
|
||||
|
||||
@ -457,31 +511,41 @@ class NYTimes(BasicNewsRecipe):
|
||||
if sec_title in self.excludeSections:
|
||||
print "SECTION EXCLUDED: ",sec_title
|
||||
continue
|
||||
print 'Index URL: '+'http://www.nytimes.com/pages/'+index_url+'/index.html'
|
||||
try:
|
||||
soup = self.index_to_soup('http://www.nytimes.com/pages/'+index_url+'/index.html')
|
||||
except:
|
||||
continue
|
||||
print 'Index URL: '+'http://www.nytimes.com/pages/'+index_url+'/index.html'
|
||||
|
||||
self.key = sec_title
|
||||
# Find each article
|
||||
for div in soup.findAll(True,
|
||||
attrs={'class':['section-headline', 'story', 'story headline','sectionHeader','headlinesOnly multiline flush']}):
|
||||
if div['class'] in ['story', 'story headline'] :
|
||||
attrs={'class':['section-headline', 'ledeStory', 'story', 'story headline','sectionHeader','headlinesOnly multiline flush']}):
|
||||
if div['class'] in ['story', 'story headline', 'storyHeader'] :
|
||||
self.handle_article(div)
|
||||
elif div['class'] == 'ledeStory':
|
||||
divsub = div.find('div','storyHeader')
|
||||
if divsub is not None:
|
||||
self.handle_article(divsub)
|
||||
ulrefer = div.find('ul','refer')
|
||||
if ulrefer is not None:
|
||||
for lidiv in ulrefer.findAll('li'):
|
||||
self.handle_article(lidiv)
|
||||
elif div['class'] == 'headlinesOnly multiline flush':
|
||||
for lidiv in div.findAll('li'):
|
||||
self.handle_article(lidiv)
|
||||
|
||||
self.ans = [(k, self.articles[k]) for k in self.ans if self.articles.has_key(k)]
|
||||
return self.filter_ans(self.ans)
|
||||
return self.filter_ans(self.get_tech_feeds(self.ans))
|
||||
|
||||
|
||||
def parse_todays_index(self):
|
||||
|
||||
soup = self.index_to_soup('http://www.nytimes.com/pages/todayspaper/index.html')
|
||||
|
||||
skipping = False
|
||||
# Find each article
|
||||
for div in soup.findAll(True,
|
||||
attrs={'class':['section-headline', 'story', 'story headline','sectionHeader','headlinesOnly multiline flush']}):
|
||||
|
||||
if div['class'] in ['section-headline','sectionHeader']:
|
||||
self.key = string.capwords(self.feed_title(div))
|
||||
self.key = self.key.replace('Op-ed','Op-Ed')
|
||||
@ -505,7 +569,7 @@ class NYTimes(BasicNewsRecipe):
|
||||
self.handle_article(lidiv)
|
||||
|
||||
self.ans = [(k, self.articles[k]) for k in self.ans if self.articles.has_key(k)]
|
||||
return self.filter_ans(self.ans)
|
||||
return self.filter_ans(self.get_tech_feeds(self.ans))
|
||||
|
||||
def parse_headline_index(self):
|
||||
|
||||
@ -553,7 +617,7 @@ class NYTimes(BasicNewsRecipe):
|
||||
for h3_item in search_div.findAll('h3'):
|
||||
byline = h3_item.h6
|
||||
if byline is not None:
|
||||
author = self.tag_to_string(byline,usa_alt=False)
|
||||
author = self.tag_to_string(byline,use_alt=False)
|
||||
else:
|
||||
author = ''
|
||||
a = h3_item.find('a', href=True)
|
||||
@ -579,7 +643,7 @@ class NYTimes(BasicNewsRecipe):
|
||||
self.articles[section_name].append(dict(title=title, url=url, date=pubdate, description=description, author=author, content=''))
|
||||
|
||||
self.ans = [(k, self.articles[k]) for k in self.ans if self.articles.has_key(k)]
|
||||
return self.filter_ans(self.ans)
|
||||
return self.filter_ans(self.get_tech_feeds(self.ans))
|
||||
|
||||
def parse_index(self):
|
||||
if self.headlinesOnly:
|
||||
@ -589,40 +653,199 @@ class NYTimes(BasicNewsRecipe):
|
||||
else:
|
||||
return self.parse_todays_index()
|
||||
|
||||
def strip_anchors(self,soup):
|
||||
def strip_anchors(self,soup,kill_all=False):
|
||||
paras = soup.findAll(True)
|
||||
for para in paras:
|
||||
aTags = para.findAll('a')
|
||||
for a in aTags:
|
||||
if a.img is None:
|
||||
a.replaceWith(a.renderContents().decode('cp1252','replace'))
|
||||
if kill_all or (self.recursions==0):
|
||||
a.replaceWith(self.tag_to_string(a,False))
|
||||
else:
|
||||
if a.has_key('href'):
|
||||
if a['href'].startswith('http://www.nytimes'):
|
||||
if not a['href'].endswith('pagewanted=all'):
|
||||
url = re.sub(r'\?.*', '', a['href'])
|
||||
if self.exclude_url(url):
|
||||
a.replaceWith(self.tag_to_string(a,False))
|
||||
else:
|
||||
a['href'] = url+'?pagewanted=all'
|
||||
elif not (a['href'].startswith('http://pogue') or \
|
||||
a['href'].startswith('http://bits') or \
|
||||
a['href'].startswith('http://travel') or \
|
||||
a['href'].startswith('http://business') or \
|
||||
a['href'].startswith('http://tech') or \
|
||||
a['href'].startswith('http://health') or \
|
||||
a['href'].startswith('http://dealbook') or \
|
||||
a['href'].startswith('http://open')):
|
||||
a.replaceWith(self.tag_to_string(a,False))
|
||||
return soup
|
||||
|
||||
def handle_tags(self,soup):
|
||||
try:
|
||||
print("HANDLE TAGS: TITLE = "+self.tag_to_string(soup.title))
|
||||
except:
|
||||
print("HANDLE TAGS: NO TITLE")
|
||||
if soup is None:
|
||||
print("ERROR: handle_tags received NoneType")
|
||||
return None
|
||||
|
||||
## print("HANDLING AD FORWARD:")
|
||||
## print(soup)
|
||||
if self.keep_only_tags:
|
||||
body = Tag(soup, 'body')
|
||||
try:
|
||||
if isinstance(self.keep_only_tags, dict):
|
||||
self.keep_only_tags = [self.keep_only_tags]
|
||||
for spec in self.keep_only_tags:
|
||||
for tag in soup.find('body').findAll(**spec):
|
||||
body.insert(len(body.contents), tag)
|
||||
soup.find('body').replaceWith(body)
|
||||
except AttributeError: # soup has no body element
|
||||
pass
|
||||
|
||||
def remove_beyond(tag, next):
|
||||
while tag is not None and getattr(tag, 'name', None) != 'body':
|
||||
after = getattr(tag, next)
|
||||
while after is not None:
|
||||
ns = getattr(tag, next)
|
||||
after.extract()
|
||||
after = ns
|
||||
tag = tag.parent
|
||||
|
||||
if self.remove_tags_after is not None:
|
||||
rt = [self.remove_tags_after] if isinstance(self.remove_tags_after, dict) else self.remove_tags_after
|
||||
for spec in rt:
|
||||
tag = soup.find(**spec)
|
||||
remove_beyond(tag, 'nextSibling')
|
||||
|
||||
if self.remove_tags_before is not None:
|
||||
tag = soup.find(**self.remove_tags_before)
|
||||
remove_beyond(tag, 'previousSibling')
|
||||
|
||||
for kwds in self.remove_tags:
|
||||
for tag in soup.findAll(**kwds):
|
||||
tag.extract()
|
||||
|
||||
return soup
|
||||
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
if self.webEdition & (self.oldest_article>0):
|
||||
date_tag = soup.find(True,attrs={'class': ['dateline','date']})
|
||||
if date_tag:
|
||||
date_str = self.tag_to_string(date_tag,use_alt=False)
|
||||
date_str = date_str.replace('Published:','')
|
||||
date_items = date_str.split(',')
|
||||
try:
|
||||
datestring = date_items[0]+' '+date_items[1]
|
||||
article_date = self.decode_us_date(datestring)
|
||||
except:
|
||||
article_date = date.today()
|
||||
if article_date < self.earliest_date:
|
||||
self.log("Skipping article dated %s" % date_str)
|
||||
return None
|
||||
#print("PREPROCESS TITLE="+self.tag_to_string(soup.title))
|
||||
skip_tag = soup.find(True, {'name':'skip'})
|
||||
if skip_tag is not None:
|
||||
#url = 'http://www.nytimes.com' + re.sub(r'\?.*', '', skip_tag.parent['href'])
|
||||
url = 'http://www.nytimes.com' + skip_tag.parent['href']
|
||||
#url += '?pagewanted=all'
|
||||
self.log.warn("Skipping ad to article at '%s'" % url)
|
||||
sleep(5)
|
||||
soup = self.handle_tags(self.article_to_soup(url))
|
||||
|
||||
#all articles are from today, no need to print the date on every page
|
||||
try:
|
||||
if not self.webEdition:
|
||||
date_tag = soup.find(True,attrs={'class': ['dateline','date']})
|
||||
if date_tag:
|
||||
date_tag.extract()
|
||||
except:
|
||||
self.log("Error removing the published date")
|
||||
# check if the article is from one of the tech blogs
|
||||
blog=soup.find('div',attrs={'id':['pogue','bits','gadgetwise','open']})
|
||||
|
||||
if blog is not None:
|
||||
old_body = soup.find('body')
|
||||
new_body=Tag(soup,'body')
|
||||
new_body.append(soup.find('div',attrs={'id':'content'}))
|
||||
new_body.find('div',attrs={'id':'content'})['id']='blogcontent' # identify for postprocess_html
|
||||
old_body.replaceWith(new_body)
|
||||
for divr in soup.findAll('div',attrs={'class':re.compile('w190 right')}):
|
||||
if divr.find(text=re.compile('Sign up')):
|
||||
divr.extract()
|
||||
divr = soup.find('div',attrs={'id':re.compile('related-content')})
|
||||
if divr is not None:
|
||||
# handle related articles
|
||||
rlist = []
|
||||
ul = divr.find('ul')
|
||||
if ul is not None:
|
||||
for li in ul.findAll('li'):
|
||||
atag = li.find('a')
|
||||
if atag is not None:
|
||||
if atag['href'].startswith('http://pogue') or atag['href'].startswith('http://bits') or \
|
||||
atag['href'].startswith('http://open'):
|
||||
atag.find(text=True).replaceWith(self.massageNCXText(self.tag_to_string(atag,False)))
|
||||
rlist.append(atag)
|
||||
divr.extract()
|
||||
if rlist != []:
|
||||
asidediv = Tag(soup,'div',[('class','aside')])
|
||||
if soup.find('hr') is None:
|
||||
asidediv.append(Tag(soup,'hr'))
|
||||
h4 = Tag(soup,'h4',[('class','asidenote')])
|
||||
h4.insert(0,"Related Posts")
|
||||
asidediv.append(h4)
|
||||
ul = Tag(soup,'ul')
|
||||
for r in rlist:
|
||||
li = Tag(soup,'li',[('class','aside')])
|
||||
r['class'] = 'aside'
|
||||
li.append(r)
|
||||
ul.append(li)
|
||||
asidediv.append(ul)
|
||||
asidediv.append(Tag(soup,'hr'))
|
||||
smain = soup.find('body')
|
||||
smain.append(asidediv)
|
||||
for atag in soup.findAll('a'):
|
||||
img = atag.find('img')
|
||||
if img is not None:
|
||||
atag.replaceWith(img)
|
||||
elif not atag.has_key('href'):
|
||||
atag.replaceWith(atag.renderContents().decode('cp1252','replace'))
|
||||
elif not (atag['href'].startswith('http://www.nytimes') or atag['href'].startswith('http://pogue') or \
|
||||
atag['href'].startswith('http://bits') or atag['href'].startswith('http://open')):
|
||||
atag.replaceWith(atag.renderContents().decode('cp1252','replace'))
|
||||
hdr = soup.find('address')
|
||||
if hdr is not None:
|
||||
hdr.name='span'
|
||||
for span_credit in soup.findAll('span','credit'):
|
||||
sp = Tag(soup,'span')
|
||||
span_credit.replaceWith(sp)
|
||||
sp.append(Tag(soup,'br'))
|
||||
sp.append(span_credit)
|
||||
sp.append(Tag(soup,'br'))
|
||||
|
||||
else: # nytimes article
|
||||
|
||||
related = [] # these will be the related articles
|
||||
first_outer = None # first related outer tag
|
||||
first_related = None # first related tag
|
||||
for outerdiv in soup.findAll(attrs={'class': re.compile('articleInline runaroundLeft')}):
|
||||
for rdiv in soup.findAll('div','columnGroup doubleRule'):
|
||||
if rdiv.find('h3') is not None:
|
||||
if self.tag_to_string(rdiv.h3,False).startswith('Related'):
|
||||
rdiv.h3.find(text=True).replaceWith("Related articles")
|
||||
rdiv.h3['class'] = 'asidenote'
|
||||
for litag in rdiv.findAll('li'):
|
||||
if litag.find('a') is not None:
|
||||
if litag.find('a')['href'].startswith('http://www.nytimes.com'):
|
||||
url = re.sub(r'\?.*', '', litag.find('a')['href'])
|
||||
litag.find('a')['href'] = url+'?pagewanted=all'
|
||||
litag.extract()
|
||||
related.append(litag)
|
||||
if first_related is None:
|
||||
first_related = rdiv
|
||||
first_outer = outerdiv
|
||||
else:
|
||||
litag.extract()
|
||||
if related != []:
|
||||
for r in related:
|
||||
if r.h6: # don't want the anchor inside a h6 tag
|
||||
r.h6.replaceWith(r.h6.a)
|
||||
first_related.ul.append(r)
|
||||
first_related.insert(0,Tag(soup,'hr'))
|
||||
first_related.append(Tag(soup,'hr'))
|
||||
first_related['class'] = 'aside'
|
||||
first_outer.replaceWith(first_related) # replace the outer tag with the related tag
|
||||
|
||||
for rdiv in soup.findAll(attrs={'class': re.compile('articleInline runaroundLeft')}):
|
||||
rdiv.extract()
|
||||
|
||||
kicker_tag = soup.find(attrs={'class':'kicker'})
|
||||
if kicker_tag: # remove Op_Ed author head shots
|
||||
tagline = self.tag_to_string(kicker_tag)
|
||||
if tagline=='Op-Ed Columnist':
|
||||
img_div = soup.find('div','inlineImage module')
|
||||
if img_div:
|
||||
img_div.extract()
|
||||
|
||||
if self.useHighResImages:
|
||||
try:
|
||||
@ -667,26 +890,6 @@ class NYTimes(BasicNewsRecipe):
|
||||
except Exception:
|
||||
self.log("Error pulling high resolution images")
|
||||
|
||||
try:
|
||||
#remove "Related content" bar
|
||||
runAroundsFound = soup.findAll('div',{'class':['articleInline runaroundLeft','articleInline doubleRule runaroundLeft','articleInline runaroundLeft firstArticleInline','articleInline runaroundLeft ','articleInline runaroundLeft lastArticleInline']})
|
||||
if runAroundsFound:
|
||||
for runAround in runAroundsFound:
|
||||
#find all section headers
|
||||
hlines = runAround.findAll(True ,{'class':['sectionHeader','sectionHeader flushBottom']})
|
||||
if hlines:
|
||||
for hline in hlines:
|
||||
hline.extract()
|
||||
|
||||
#find all section headers
|
||||
hlines = runAround.findAll('h6')
|
||||
if hlines:
|
||||
for hline in hlines:
|
||||
hline.extract()
|
||||
except:
|
||||
self.log("Error removing related content bar")
|
||||
|
||||
|
||||
try:
|
||||
#in case pulling images failed, delete the enlarge this text
|
||||
enlargeThisList = soup.findAll('div',{'class':'icon enlargeThis'})
|
||||
@ -696,9 +899,24 @@ class NYTimes(BasicNewsRecipe):
|
||||
except:
|
||||
self.log("Error removing Enlarge this text")
|
||||
|
||||
return self.strip_anchors(soup)
|
||||
|
||||
def postprocess_html(self,soup, True):
|
||||
return self.strip_anchors(soup,False)
|
||||
|
||||
def postprocess_html(self,soup,first_fetch):
|
||||
if not first_fetch: # remove Related links
|
||||
for aside in soup.findAll('div','aside'):
|
||||
aside.extract()
|
||||
soup = self.strip_anchors(soup,True)
|
||||
|
||||
if soup.find('div',attrs={'id':'blogcontent'}) is None:
|
||||
if first_fetch:
|
||||
aside = soup.find('div','aside')
|
||||
if aside is not None: # move the related list to the end of the article
|
||||
art = soup.find('div',attrs={'id':'article'})
|
||||
if art is None:
|
||||
art = soup.find('div',attrs={'class':'article'})
|
||||
if art is not None:
|
||||
art.append(aside)
|
||||
try:
|
||||
if self.one_picture_per_article:
|
||||
# Remove all images after first
|
||||
@ -774,7 +992,7 @@ class NYTimes(BasicNewsRecipe):
|
||||
if headline:
|
||||
tag = Tag(soup, "h2")
|
||||
tag['class'] = "headline"
|
||||
tag.insert(0, self.fixChars(headline.renderContents()))
|
||||
tag.insert(0, self.fixChars(self.tag_to_string(headline,False)))
|
||||
soup.insert(0, tag)
|
||||
hrs = soup.findAll('hr')
|
||||
for hr in hrs:
|
||||
@ -788,7 +1006,7 @@ class NYTimes(BasicNewsRecipe):
|
||||
if bylineauthor:
|
||||
tag = Tag(soup, "h6")
|
||||
tag['class'] = "byline"
|
||||
tag.insert(0, self.fixChars(bylineauthor.renderContents()))
|
||||
tag.insert(0, self.fixChars(self.tag_to_string(bylineauthor,False)))
|
||||
bylineauthor.replaceWith(tag)
|
||||
except:
|
||||
self.log("ERROR: fixing byline author format")
|
||||
@ -799,7 +1017,7 @@ class NYTimes(BasicNewsRecipe):
|
||||
if blogcredit:
|
||||
tag = Tag(soup, "h6")
|
||||
tag['class'] = "credit"
|
||||
tag.insert(0, self.fixChars(blogcredit.renderContents()))
|
||||
tag.insert(0, self.fixChars(self.tag_to_string(blogcredit,False)))
|
||||
blogcredit.replaceWith(tag)
|
||||
except:
|
||||
self.log("ERROR: fixing credit format")
|
||||
@ -855,23 +1073,22 @@ class NYTimes(BasicNewsRecipe):
|
||||
self.log("ERROR: Problem in Add class=authorId to <div> so we can format with CSS")
|
||||
|
||||
return soup
|
||||
|
||||
def populate_article_metadata(self, article, soup, first):
|
||||
if first and hasattr(self, 'add_toc_thumbnail'):
|
||||
if not first:
|
||||
return
|
||||
idxdiv = soup.find('div',attrs={'class':'articleSpanImage'})
|
||||
if idxdiv is not None:
|
||||
if idxdiv.img:
|
||||
self.add_toc_thumbnail(article, idxdiv.img['src'])
|
||||
self.add_toc_thumbnail(article, re.sub(r'links\\link\d+\\','',idxdiv.img['src']))
|
||||
else:
|
||||
img = soup.find('img')
|
||||
img = soup.find('body').find('img')
|
||||
if img is not None:
|
||||
self.add_toc_thumbnail(article, img['src'])
|
||||
|
||||
self.add_toc_thumbnail(article, re.sub(r'links\\link\d+\\','',img['src']))
|
||||
shortparagraph = ""
|
||||
try:
|
||||
if len(article.text_summary.strip()) == 0:
|
||||
articlebodies = soup.findAll('div',attrs={'class':'articleBody'})
|
||||
if not articlebodies: #added to account for blog formats
|
||||
articlebodies = soup.findAll('div', attrs={'class':'entry-content'}) #added to account for blog formats
|
||||
if articlebodies:
|
||||
for articlebody in articlebodies:
|
||||
if articlebody:
|
||||
@ -880,15 +1097,23 @@ class NYTimes(BasicNewsRecipe):
|
||||
refparagraph = self.massageNCXText(self.tag_to_string(p,use_alt=False)).strip()
|
||||
#account for blank paragraphs and short paragraphs by appending them to longer ones
|
||||
if len(refparagraph) > 0:
|
||||
if len(refparagraph) > 140: #approximately two lines of text
|
||||
article.summary = article.text_summary = shortparagraph + refparagraph
|
||||
if len(refparagraph) > 70: #approximately one line of text
|
||||
newpara = shortparagraph + refparagraph
|
||||
newparaDateline,newparaEm,newparaDesc = newpara.partition('—')
|
||||
if newparaEm == '':
|
||||
newparaDateline,newparaEm,newparaDesc = newpara.partition('—')
|
||||
if newparaEm == '':
|
||||
newparaDesc = newparaDateline
|
||||
article.summary = article.text_summary = newparaDesc.strip()
|
||||
return
|
||||
else:
|
||||
shortparagraph = refparagraph + " "
|
||||
if shortparagraph.strip().find(" ") == -1 and not shortparagraph.strip().endswith(":"):
|
||||
shortparagraph = shortparagraph + "- "
|
||||
|
||||
else:
|
||||
article.summary = article.text_summary = self.massageNCXText(article.text_summary)
|
||||
except:
|
||||
self.log("Error creating article descriptions")
|
||||
return
|
||||
|
||||
|
||||
|
@ -1,27 +1,27 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
from calibre.ebooks.BeautifulSoup import BeautifulSoup
|
||||
|
||||
class PajamasMedia(BasicNewsRecipe):
|
||||
title = u'Pajamas Media'
|
||||
description = u'Provides exclusive news and opinion for forty countries.'
|
||||
language = 'en'
|
||||
__author__ = 'Krittika Goyal'
|
||||
oldest_article = 1 #days
|
||||
oldest_article = 2 #days
|
||||
max_articles_per_feed = 25
|
||||
recursions = 1
|
||||
match_regexps = [r'http://pajamasmedia.com/blog/.*/2/$']
|
||||
#encoding = 'latin1'
|
||||
|
||||
remove_stylesheets = True
|
||||
#remove_tags_before = dict(name='h1', attrs={'class':'heading'})
|
||||
remove_tags_after = dict(name='div', attrs={'class':'paged-nav'})
|
||||
remove_tags = [
|
||||
dict(name='iframe'),
|
||||
dict(name='div', attrs={'class':['pages']}),
|
||||
#dict(name='div', attrs={'id':['bookmark']}),
|
||||
#dict(name='span', attrs={'class':['related_link', 'slideshowcontrols']}),
|
||||
#dict(name='ul', attrs={'class':'articleTools'}),
|
||||
]
|
||||
auto_cleanup = True
|
||||
##remove_tags_before = dict(name='h1', attrs={'class':'heading'})
|
||||
#remove_tags_after = dict(name='div', attrs={'class':'paged-nav'})
|
||||
#remove_tags = [
|
||||
#dict(name='iframe'),
|
||||
#dict(name='div', attrs={'class':['pages']}),
|
||||
##dict(name='div', attrs={'id':['bookmark']}),
|
||||
##dict(name='span', attrs={'class':['related_link', 'slideshowcontrols']}),
|
||||
##dict(name='ul', attrs={'class':'articleTools'}),
|
||||
#]
|
||||
|
||||
feeds = [
|
||||
('pajamas Media',
|
||||
@ -29,20 +29,20 @@ class PajamasMedia(BasicNewsRecipe):
|
||||
|
||||
]
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
story = soup.find(name='div', attrs={'id':'innerpage-content'})
|
||||
#td = heading.findParent(name='td')
|
||||
#td.extract()
|
||||
#def preprocess_html(self, soup):
|
||||
#story = soup.find(name='div', attrs={'id':'innerpage-content'})
|
||||
##td = heading.findParent(name='td')
|
||||
##td.extract()
|
||||
|
||||
soup = BeautifulSoup('<html><head><title>t</title></head><body></body></html>')
|
||||
body = soup.find(name='body')
|
||||
body.insert(0, story)
|
||||
return soup
|
||||
#soup = BeautifulSoup('<html><head><title>t</title></head><body></body></html>')
|
||||
#body = soup.find(name='body')
|
||||
#body.insert(0, story)
|
||||
#return soup
|
||||
|
||||
def postprocess_html(self, soup, first):
|
||||
if not first:
|
||||
h = soup.find(attrs={'class':'innerpage-header'})
|
||||
if h: h.extract()
|
||||
auth = soup.find(attrs={'class':'author'})
|
||||
if auth: auth.extract()
|
||||
return soup
|
||||
#def postprocess_html(self, soup, first):
|
||||
#if not first:
|
||||
#h = soup.find(attrs={'class':'innerpage-header'})
|
||||
#if h: h.extract()
|
||||
#auth = soup.find(attrs={'class':'author'})
|
||||
#if auth: auth.extract()
|
||||
#return soup
|
||||
|
63
recipes/poradnia_pwn.recipe
Normal file
@ -0,0 +1,63 @@
|
||||
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:fdm=marker:ai
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
class PoradniaPWN(BasicNewsRecipe):
|
||||
title = u'Poradnia Językowa PWN'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Internetowa poradnia językowa Wydawnictwa Naukowego PWN. Poradnię prowadzi Redaktor Naczelny Słowników Języka Polskiego, prof. Mirosław Bańko. Pomagają mu eksperci - znani polscy językoznawcy. Współpracuje z nami m.in. prof. Jerzy Bralczyk oraz dr Jan Grzenia.'
|
||||
category = 'language'
|
||||
language = 'pl'
|
||||
#cover_url = ''
|
||||
oldest_article = 14
|
||||
max_articles_per_feed = 100000
|
||||
INDEX = "http://poradnia.pwn.pl/"
|
||||
no_stylesheets = True
|
||||
remove_attributes = ['style']
|
||||
remove_javascript = True
|
||||
use_embedded_content = False
|
||||
#preprocess_regexps = [(re.compile('<li|ul', re.IGNORECASE), lambda m: '<div'),(re.compile('</li>', re.IGNORECASE), lambda m: '</div>'), (re.compile('</ul>', re.IGNORECASE), lambda m: '</div>')]
|
||||
keep_only_tags = [dict(name="div", attrs={"class":"searchhi"})]
|
||||
feeds = [(u'Poradnia', u'http://rss.pwn.pl/poradnia.rss')]
|
||||
|
||||
'''def find_articles(self, url):
|
||||
articles = []
|
||||
soup=self.index_to_soup(url)
|
||||
counter = int(soup.find(name='p', attrs={'class':'count'}).findAll('b')[-1].string)
|
||||
counter = 500
|
||||
pos = 0
|
||||
next = url
|
||||
while next:
|
||||
soup=self.index_to_soup(next)
|
||||
tag=soup.find(id="listapytan")
|
||||
art=tag.findAll(name='li')
|
||||
for i in art:
|
||||
if i.h4:
|
||||
title=i.h4.a.string
|
||||
url=self.INDEX+i.h4.a['href']
|
||||
#date=soup.find(id='footer').ul.li.string[41:-1]
|
||||
articles.append({'title' : title,
|
||||
'url' : url,
|
||||
'date' : '',
|
||||
'description' : ''
|
||||
})
|
||||
pos += 10
|
||||
if not pos >=counter:
|
||||
next = 'http://poradnia.pwn.pl/lista.php?kat=18&od=' + str(pos)
|
||||
print u'Tworzenie listy artykułów dla', next
|
||||
else:
|
||||
next = None
|
||||
print articles
|
||||
return articles
|
||||
|
||||
def parse_index(self):
|
||||
feeds = []
|
||||
feeds.append((u"Poradnia", self.find_articles('http://poradnia.pwn.pl/lista.php')))
|
||||
|
||||
return feeds'''
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
for i in soup.findAll(name=['ul', 'li']):
|
||||
i.name="div"
|
||||
for z in soup.findAll(name='a'):
|
||||
if not z['href'].startswith('http'):
|
||||
z['href'] = 'http://poradnia.pwn.pl/' + z['href']
|
||||
return soup
|
@ -1,12 +1,13 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
class BasicUserRecipe1324913680(BasicNewsRecipe):
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
class AdvancedUserRecipe1355341662(BasicNewsRecipe):
|
||||
title = u'Sivil Dusunce'
|
||||
language = 'tr'
|
||||
__author__ = 'asalet_r'
|
||||
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 20
|
||||
max_articles_per_feed = 50
|
||||
auto_cleanup = True
|
||||
|
||||
feeds = [(u'Sivil Dusunce', u'http://www.sivildusunce.com/feed/')]
|
||||
feeds = [(u'Sivil Dusunce', u'http://www.sivildusunce.com/?t=rss&xml=1')]
|
||||
|
@ -8,19 +8,19 @@ Fetch sueddeutsche.de
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
class Sueddeutsche(BasicNewsRecipe):
|
||||
|
||||
title = u'Süddeutsche.de' # 2012-01-26 AGe Correct Title
|
||||
description = 'News from Germany, Access to online content' # 2012-01-26 AGe
|
||||
__author__ = 'Oliver Niesner and Armin Geller' #Update AGe 2012-01-26
|
||||
publisher = u'Süddeutsche Zeitung' # 2012-01-26 AGe add
|
||||
category = 'news, politics, Germany' # 2012-01-26 AGe add
|
||||
timefmt = ' [%a, %d %b %Y]' # 2012-01-26 AGe add %a
|
||||
title = u'Süddeutsche.de'
|
||||
description = 'News from Germany, Access to online content'
|
||||
__author__ = 'Oliver Niesner and Armin Geller' #Update AGe 2012-12-05
|
||||
publisher = u'Süddeutsche Zeitung'
|
||||
category = 'news, politics, Germany'
|
||||
timefmt = ' [%a, %d %b %Y]'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
language = 'de'
|
||||
encoding = 'utf-8'
|
||||
publication_type = 'newspaper' # 2012-01-26 add
|
||||
publication_type = 'newspaper'
|
||||
cover_source = 'http://www.sueddeutsche.de/verlag' # 2012-01-26 AGe add from Darko Miletic paid content source
|
||||
masthead_url = 'http://www.sueddeutsche.de/static_assets/build/img/sdesiteheader/logo_homepage.441d531c.png' # 2012-01-26 AGe add
|
||||
masthead_url = 'http://www.sueddeutsche.de/static_assets/img/sdesiteheader/logo_standard.a152b0df.png' # 2012-12-05 AGe add
|
||||
|
||||
use_embedded_content = False
|
||||
no_stylesheets = True
|
||||
@ -40,9 +40,9 @@ class Sueddeutsche(BasicNewsRecipe):
|
||||
(u'Sport', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5ESport%24?output=rss'),
|
||||
(u'Leben', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5ELeben%24?output=rss'),
|
||||
(u'Karriere', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EKarriere%24?output=rss'),
|
||||
(u'Bildung', u'http://rss.sueddeutsche.de/rss/bildung'), #2012-01-26 AGe New
|
||||
(u'Gesundheit', u'http://rss.sueddeutsche.de/rss/gesundheit'), #2012-01-26 AGe New
|
||||
(u'Stil', u'http://rss.sueddeutsche.de/rss/stil'), #2012-01-26 AGe New
|
||||
(u'Bildung', u'http://rss.sueddeutsche.de/rss/bildung'),
|
||||
(u'Gesundheit', u'http://rss.sueddeutsche.de/rss/gesundheit'),
|
||||
(u'Stil', u'http://rss.sueddeutsche.de/rss/stil'),
|
||||
(u'München & Region', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EMünchen&Region%24?output=rss'),
|
||||
(u'Bayern', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EBayern%24?output=rss'),
|
||||
(u'Medien', u'http://suche.sueddeutsche.de/query/%23/sort/-docdatetime/drilldown/%C2%A7ressort%3A%5EMedien%24?output=rss'),
|
||||
|
20
recipes/titanic_de.recipe
Normal file
@ -0,0 +1,20 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Titanic(BasicNewsRecipe):
|
||||
title = u'Titanic'
|
||||
language = 'de'
|
||||
__author__ = 'Krittika Goyal'
|
||||
oldest_article = 14 #days
|
||||
max_articles_per_feed = 25
|
||||
#encoding = 'cp1252'
|
||||
use_embedded_content = False
|
||||
|
||||
no_stylesheets = True
|
||||
auto_cleanup = True
|
||||
|
||||
|
||||
feeds = [
|
||||
('News',
|
||||
'http://www.titanic-magazin.de/ich.war.bei.der.waffen.rss'),
|
||||
]
|
||||
|
20
recipes/tvp_info.recipe
Normal file
@ -0,0 +1,20 @@
|
||||
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:fdm=marker:ai
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
class TVPINFO(BasicNewsRecipe):
|
||||
title = u'TVP.INFO'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Serwis informacyjny TVP.INFO'
|
||||
category = 'news'
|
||||
language = 'pl'
|
||||
cover_url = 'http://s.v3.tvp.pl/files/tvp-info/gfx/logo.png'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets = True
|
||||
remove_empty_feeds = True
|
||||
remove_javascript = True
|
||||
use_embedded_content = False
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
keep_only_tags = [dict(id='contentNews')]
|
||||
remove_tags = [dict(attrs={'class':['toolbox', 'modulBox read', 'modulBox social', 'videoPlayerBox']}), dict(id='belka')]
|
||||
feeds = [(u'Wiadomo\u015bci', u'http://tvp.info/informacje?xslt=tvp-info/news/rss.xslt&src_id=191865'),
|
||||
(u'\u015awiat', u'http://tvp.info/informacje/swiat?xslt=tvp-info/news/rss.xslt&src_id=191867'), (u'Biznes', u'http://tvp.info/informacje/biznes?xslt=tvp-info/news/rss.xslt&src_id=191868'), (u'Nauka', u'http://tvp.info/informacje/nauka?xslt=tvp-info/news/rss.xslt&src_id=191870'), (u'Kultura', u'http://tvp.info/informacje/kultura?xslt=tvp-info/news/rss.xslt&src_id=191869'), (u'Rozmaito\u015bci', u'http://tvp.info/informacje/rozmaitosci?xslt=tvp-info/news/rss.xslt&src_id=191872'), (u'Opinie', u'http://tvp.info/opinie?xslt=tvp-info/news/rss.xslt&src_id=191875'), (u'Komentarze', u'http://tvp.info/opinie/komentarze?xslt=tvp-info/news/rss.xslt&src_id=238200'), (u'Wywiady', u'http://tvp.info/opinie/wywiady?xslt=tvp-info/news/rss.xslt&src_id=236644')]
|
13
recipes/ukraiyns_kii_tizhdien.recipe
Normal file
@ -0,0 +1,13 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class AdvancedUserRecipe1356283265(BasicNewsRecipe):
|
||||
title = u'\u0423\u043a\u0440\u0430\u0457\u043d\u0441\u044c\u043a\u0438\u0439 \u0422\u0438\u0436\u0434\u0435\u043d\u044c'
|
||||
__author__ = 'rpalyvoda'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
language = 'uk'
|
||||
cover_url = 'http://tyzhden.ua/Images/Style1/tyzhden.ua-logo2.gif'
|
||||
masthead_url = 'http://tyzhden.ua/Images/Style1/tyzhden.ua-logo2.gif'
|
||||
auto_cleanup = True
|
||||
|
||||
feeds = [(u'\u041d\u043e\u0432\u0438\u043d\u0438', u'http://tyzhden.ua/RSS/News/'), (u'\u041e\u0440\u0438\u0433\u0456\u043d\u0430\u043b\u044c\u043d\u0456 \u043d\u043e\u0432\u0438\u043d\u0438', u'http://tyzhden.ua/RSS/News.Original/'), (u'\u041f\u0443\u0431\u043b\u0456\u043a\u0430\u0446\u0456\u0457', u'http://tyzhden.ua/RSS/Publications/')]
|
@ -2,8 +2,8 @@
|
||||
__license__ = 'GPL v3'
|
||||
__copyright__ = '4 February 2011, desUBIKado'
|
||||
__author__ = 'desUBIKado'
|
||||
__version__ = 'v0.08'
|
||||
__date__ = '30, June 2012'
|
||||
__version__ = 'v0.09'
|
||||
__date__ = '02, December 2012'
|
||||
'''
|
||||
http://www.weblogssl.com/
|
||||
'''
|
||||
@ -37,6 +37,7 @@ class weblogssl(BasicNewsRecipe):
|
||||
,(u'Xataka Mexico', u'http://feeds.weblogssl.com/xatakamx')
|
||||
,(u'Xataka M\xf3vil', u'http://feeds.weblogssl.com/xatakamovil')
|
||||
,(u'Xataka Android', u'http://feeds.weblogssl.com/xatakandroid')
|
||||
,(u'Xataka Windows', u'http://feeds.weblogssl.com/xatakawindows')
|
||||
,(u'Xataka Foto', u'http://feeds.weblogssl.com/xatakafoto')
|
||||
,(u'Xataka ON', u'http://feeds.weblogssl.com/xatakaon')
|
||||
,(u'Xataka Ciencia', u'http://feeds.weblogssl.com/xatakaciencia')
|
||||
@ -80,19 +81,31 @@ class weblogssl(BasicNewsRecipe):
|
||||
|
||||
keep_only_tags = [dict(name='div', attrs={'id':'infoblock'}),
|
||||
dict(name='div', attrs={'class':'post'}),
|
||||
dict(name='div', attrs={'id':'blog-comments'})
|
||||
dict(name='div', attrs={'id':'blog-comments'}),
|
||||
dict(name='div', attrs={'class':'container'}) #m.xataka.com
|
||||
]
|
||||
|
||||
remove_tags = [dict(name='div', attrs={'id':'comment-nav'})]
|
||||
remove_tags = [dict(name='div', attrs={'id':'comment-nav'}),
|
||||
dict(name='menu', attrs={'class':'social-sharing'}), #m.xataka.com
|
||||
dict(name='section' , attrs={'class':'comments'}), #m.xataka.com
|
||||
dict(name='div' , attrs={'class':'article-comments'}), #m.xataka.com
|
||||
dict(name='nav' , attrs={'class':'article-taxonomy'}) #m.xataka.com
|
||||
]
|
||||
|
||||
remove_tags_after = dict(name='section' , attrs={'class':'comments'})
|
||||
|
||||
def print_version(self, url):
|
||||
return url.replace('http://www.', 'http://m.')
|
||||
|
||||
preprocess_regexps = [
|
||||
# Para poner una linea en blanco entre un comentario y el siguiente
|
||||
(re.compile(r'<li id="c', re.DOTALL|re.IGNORECASE), lambda match: '<br><br><li id="c')
|
||||
(re.compile(r'<li id="c', re.DOTALL|re.IGNORECASE), lambda match: '<br><br><li id="c'),
|
||||
# Para ver las imágenes en las noticias de m.xataka.com
|
||||
(re.compile(r'<noscript>', re.DOTALL|re.IGNORECASE), lambda m: ''),
|
||||
(re.compile(r'</noscript>', re.DOTALL|re.IGNORECASE), lambda m: '')
|
||||
]
|
||||
|
||||
|
||||
# Para sustituir el video incrustado de YouTube por una imagen
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
@ -108,14 +121,16 @@ class weblogssl(BasicNewsRecipe):
|
||||
|
||||
# Para obtener la url original del articulo a partir de la de "feedsportal"
|
||||
# El siguiente código es gracias al usuario "bosplans" de www.mobileread.com
|
||||
# http://www.mobileread.com/forums/sho...d.php?t=130297
|
||||
# http://www.mobileread.com/forums/showthread.php?t=130297
|
||||
|
||||
def get_article_url(self, article):
|
||||
link = article.get('link', None)
|
||||
if link is None:
|
||||
return article
|
||||
# if link.split('/')[-4]=="xataka2":
|
||||
# return article.get('feedburner_origlink', article.get('link', article.get('guid')))
|
||||
if link.split('/')[-4]=="xataka2":
|
||||
return article.get('feedburner_origlink', article.get('link', article.get('guid')))
|
||||
return article.get('guid', None)
|
||||
if link.split('/')[-1]=="story01.htm":
|
||||
link=link.split('/')[-2]
|
||||
a=['0B','0C','0D','0E','0F','0G','0N' ,'0L0S','0A']
|
||||
|
@ -9,15 +9,15 @@ class Zaman (BasicNewsRecipe):
|
||||
__author__ = u'thomass'
|
||||
oldest_article = 2
|
||||
max_articles_per_feed =50
|
||||
# no_stylesheets = True
|
||||
no_stylesheets = True
|
||||
#delay = 1
|
||||
#use_embedded_content = False
|
||||
encoding = 'ISO 8859-9'
|
||||
publisher = 'Zaman'
|
||||
use_embedded_content = False
|
||||
encoding = 'utf-8'
|
||||
publisher = 'Feza Gazetecilik'
|
||||
category = 'news, haberler,TR,gazete'
|
||||
language = 'tr'
|
||||
publication_type = 'newspaper '
|
||||
extra_css = '.buyukbaslik{font-weight: bold; font-size: 18px;color:#0000FF}'#body{ font-family: Verdana,Helvetica,Arial,sans-serif } .introduction{font-weight: bold} .story-feature{display: block; padding: 0; border: 1px solid; width: 40%; font-size: small} .story-feature h2{text-align: center; text-transform: uppercase} '
|
||||
extra_css = 'h1{text-transform: capitalize; font-weight: bold; font-size: 22px;color:#0000FF} p{text-align:justify} ' #.introduction{font-weight: bold} .story-feature{display: block; padding: 0; border: 1px solid; width: 40%; font-size: small} .story-feature h2{text-align: center; text-transform: uppercase} '
|
||||
conversion_options = {
|
||||
'tags' : category
|
||||
,'language' : language
|
||||
@ -26,25 +26,26 @@ class Zaman (BasicNewsRecipe):
|
||||
}
|
||||
cover_img_url = 'https://fbcdn-profile-a.akamaihd.net/hprofile-ak-snc4/188140_81722291869_2111820_n.jpg'
|
||||
masthead_url = 'http://medya.zaman.com.tr/extentions/zaman.com.tr/img/section/logo-section.png'
|
||||
ignore_duplicate_articles = { 'title', 'url' }
|
||||
auto_cleanup = False
|
||||
remove_empty_feeds= True
|
||||
|
||||
|
||||
#keep_only_tags = [dict(name='div', attrs={'id':[ 'news-detail-content']}), dict(name='td', attrs={'class':['columnist-detail','columnist_head']}) ]
|
||||
remove_tags = [ dict(name='img', attrs={'src':['http://medya.zaman.com.tr/zamantryeni/pics/zamanonline.gif']})]#,dict(name='div', attrs={'class':['radioEmbedBg','radyoProgramAdi']}),dict(name='a', attrs={'class':['webkit-html-attribute-value webkit-html-external-link']}),dict(name='table', attrs={'id':['yaziYorumTablosu']}),dict(name='img', attrs={'src':['http://medya.zaman.com.tr/pics/paylas.gif','http://medya.zaman.com.tr/extentions/zaman.com.tr/img/columnist/ma-16.png']})
|
||||
#keep_only_tags = [dict(name='div', attrs={'id':[ 'contentposition19']})]#,dict(name='div', attrs={'id':[ 'xxx']}),dict(name='div', attrs={'id':[ 'xxx']}),dict(name='div', attrs={'id':[ 'xxx']}),dict(name='div', attrs={'id':[ 'xxx']}),dict(name='div', attrs={'id':[ 'xxx']}),dict(name='div', attrs={'id':[ 'xxx']}),dict(name='div', attrs={'id':[ 'news-detail-content']}), dict(name='td', attrs={'class':['columnist-detail','columnist_head']}), ]
|
||||
remove_tags = [ dict(name='img', attrs={'src':['http://cmsmedya.zaman.com.tr/images/logo/logo.bmp']}),dict(name='hr', attrs={'class':['interactive-hr']})]# remove_tags = [ dict(name='div', attrs={'class':[ 'detayUyari']}),dict(name='div', attrs={'class':[ 'detayYorum']}),dict(name='div', attrs={'class':[ 'addthis_toolbox addthis_default_style ']}),dict(name='div', attrs={'id':[ 'tumYazi']})]#,dict(name='div', attrs={'id':[ 'xxx']}),dict(name='div', attrs={'id':[ 'xxx']}),dict(name='div', attrs={'id':[ 'xxx']}),dict(name='div', attrs={'id':[ 'xxx']}),dict(name='div', attrs={'id':[ 'xxx']}),dict(name='div', attrs={'id':[ 'xxx']}),dict(name='img', attrs={'src':['http://medya.zaman.com.tr/zamantryeni/pics/zamanonline.gif']}),dict(name='div', attrs={'class':['radioEmbedBg','radyoProgramAdi']}),dict(name='a', attrs={'class':['webkit-html-attribute-value webkit-html-external-link']}),dict(name='table', attrs={'id':['yaziYorumTablosu']}),dict(name='img', attrs={'src':['http://medya.zaman.com.tr/pics/paylas.gif','http://medya.zaman.com.tr/extentions/zaman.com.tr/img/columnist/ma-16.png']}),dict(name='div', attrs={'id':[ 'news-detail-gallery']}),dict(name='div', attrs={'id':[ 'news-detail-title-bottom-part']}),dict(name='div', attrs={'id':[ 'news-detail-news-paging-main']})]#
|
||||
|
||||
|
||||
#remove_attributes = ['width','height']
|
||||
remove_empty_feeds= True
|
||||
|
||||
feeds = [
|
||||
( u'Anasayfa', u'http://www.zaman.com.tr/anasayfa.rss'),
|
||||
( u'Son Dakika', u'http://www.zaman.com.tr/sondakika.rss'),
|
||||
#( u'En çok Okunanlar', u'http://www.zaman.com.tr/max_all.rss'),
|
||||
#( u'Manşet', u'http://www.zaman.com.tr/manset.rss'),
|
||||
( u'Gündem', u'http://www.zaman.com.tr/gundem.rss'),
|
||||
( u'Manşet', u'http://www.zaman.com.tr/manset.rss'),
|
||||
( u'Yazarlar', u'http://www.zaman.com.tr/yazarlar.rss'),
|
||||
( u'Politika', u'http://www.zaman.com.tr/politika.rss'),
|
||||
( u'Ekonomi', u'http://www.zaman.com.tr/ekonomi.rss'),
|
||||
( u'Dış Haberler', u'http://www.zaman.com.tr/dishaberler.rss'),
|
||||
( u'Son Dakika', u'http://www.zaman.com.tr/sondakika.rss'),
|
||||
( u'Gündem', u'http://www.zaman.com.tr/gundem.rss'),
|
||||
( u'Yorumlar', u'http://www.zaman.com.tr/yorumlar.rss'),
|
||||
( u'Röportaj', u'http://www.zaman.com.tr/roportaj.rss'),
|
||||
( u'Dizi Yazı', u'http://www.zaman.com.tr/dizi.rss'),
|
||||
@ -59,8 +60,9 @@ class Zaman (BasicNewsRecipe):
|
||||
( u'Cuma Eki', u'http://www.zaman.com.tr/cuma.rss'),
|
||||
( u'Cumaertesi Eki', u'http://www.zaman.com.tr/cumaertesi.rss'),
|
||||
( u'Pazar Eki', u'http://www.zaman.com.tr/pazar.rss'),
|
||||
( u'En çok Okunanlar', u'http://www.zaman.com.tr/max_all.rss'),
|
||||
( u'Anasayfa', u'http://www.zaman.com.tr/anasayfa.rss'),
|
||||
|
||||
]
|
||||
def print_version(self, url):
|
||||
return url.replace('http://www.zaman.com.tr/haber.do?haberno=', 'http://www.zaman.com.tr/yazdir.do?haberno=')
|
||||
|
||||
return url.replace('http://www.zaman.com.tr/newsDetail_getNewsById.action?newsId=', 'http://www.zaman.com.tr/newsDetail_openPrintPage.action?newsId=')
|
||||
|
16
recipes/zaufana_trzecia_strona.recipe
Normal file
@ -0,0 +1,16 @@
|
||||
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:fdm=marker:ai
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
class ZTS(BasicNewsRecipe):
|
||||
title = u'Zaufana Trzecia Strona'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Niezależne źródło wiadomości o świecie bezpieczeństwa IT'
|
||||
category = 'IT, security'
|
||||
language = 'pl'
|
||||
cover_url = 'http://www.zaufanatrzeciastrona.pl/wp-content/uploads/2012/08/z3s_h100.png'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets = True
|
||||
remove_empty_feeds = True
|
||||
keep_only_tags = [dict(name='div', attrs={'class':'post postcontent'})]
|
||||
remove_tags = [dict(name='div', attrs={'class':'dolna-ramka'})]
|
||||
feeds = [(u'Strona g\u0142\xf3wna', u'http://feeds.feedburner.com/ZaufanaTrzeciaStronaGlowna'), (u'Drobiazgi', u'http://feeds.feedburner.com/ZaufanaTrzeciaStronaDrobiazgi')]
|
13
recipes/zaxid_net.recipe
Normal file
@ -0,0 +1,13 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class AdvancedUserRecipe1356281741(BasicNewsRecipe):
|
||||
title = u'Zaxid.net'
|
||||
__author__ = 'rpalyvoda'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
language = 'uk'
|
||||
cover_url = 'http://upload.wikimedia.org/wikipedia/uk/b/bc/Zaxid-net.jpg'
|
||||
masthead_url = 'http://upload.wikimedia.org/wikipedia/uk/b/bc/Zaxid-net.jpg'
|
||||
auto_cleanup = True
|
||||
|
||||
feeds = [(u'\u0422\u043e\u043f \u043d\u043e\u0432\u0438\u043d\u0438', u'http://feeds.feedburner.com/zaxid/topNews'), (u'\u0421\u0442\u0440\u0456\u0447\u043a\u0430 \u043d\u043e\u0432\u0438\u043d', u'http://feeds.feedburner.com/zaxid/AllNews'), (u'\u041d\u043e\u0432\u0438\u043d\u0438 \u041b\u044c\u0432\u043e\u0432\u0430', u'http://feeds.feedburner.com/zaxid/Lviv'), (u'\u041d\u043e\u0432\u0438\u043d\u0438 \u0423\u043a\u0440\u0430\u0457\u043d\u0438', u'http://feeds.feedburner.com/zaxid/Ukraine'), (u'\u041d\u043e\u0432\u0438\u043d\u0438 \u0441\u0432\u0456\u0442\u0443', u'http://feeds.feedburner.com/zaxid/World'), (u'\u041d\u043e\u0432\u0438\u043d\u0438 - \u0420\u0430\u0434\u0456\u043e 24', u'\u0420\u0430\u0434\u0456\u043e 24'), (u'\u0411\u043b\u043e\u0433\u0438', u'http://feeds.feedburner.com/zaxid/Blogs'), (u"\u041f\u0443\u0431\u043b\u0456\u043a\u0430\u0446\u0456\u0457 - \u0406\u043d\u0442\u0435\u0440\u0432'\u044e", u'http://feeds.feedburner.com/zaxid/Interview'), (u'\u041f\u0443\u0431\u043b\u0456\u043a\u0430\u0446\u0456\u0457 - \u0421\u0442\u0430\u0442\u0442\u0456', u'http://feeds.feedburner.com/zaxid/Articles'), (u'\u0410\u0444\u0456\u0448\u0430', u'http://zaxid.net/rss/subcategory/140.xml'), (u'\u0413\u0430\u043b\u0438\u0447\u0438\u043d\u0430', u'http://feeds.feedburner.com/zaxid/Galicia'), (u'\u041a\u0443\u043b\u044c\u0442\u0443\u0440\u0430.NET', u'http://feeds.feedburner.com/zaxid/KulturaNET'), (u"\u043d\u0435\u0412\u0456\u0434\u043e\u043c\u0456 \u043b\u044c\u0432\u0456\u0432'\u044f\u043d\u0438", u'http://feeds.feedburner.com/zaxid/UnknownLviv'), (u'\u041b\u0435\u043e\u043f\u043e\u043b\u0456\u0441 MULTIPLEX', u'http://feeds.feedburner.com/zaxid/LeopolisMULTIPLEX'), (u'\u0411\u0438\u0442\u0432\u0430 \u0437\u0430 \u043c\u043e\u0432\u0443', u'http://zaxid.net/rss/subcategory/138.xml'), (u'\u0422\u0440\u0430\u043d\u0441\u043f\u043e\u0440\u0442\u043d\u0430 \u0441\u0445\u0435\u043c\u0430 \u041b\u044c\u0432\u043e\u0432\u0430', u'http://zaxid.net/rss/subcategory/132.xml'), (u'\u0414\u0435\u043c\u0456\u0444\u043e\u043b\u043e\u0433\u0456\u0437\u0430\u0446\u0456\u044f', u'http://zaxid.net/rss/subcategory/130.xml'), (u"\u041c\u0438 \u043f\u0430\u043c'\u044f\u0442\u0430\u0454\u043c\u043e", u'http://feeds.feedburner.com/zaxid/WeRemember'), (u'20 \u0440\u043e\u043a\u0456\u0432 \u041d\u0435\u0437\u0430\u043b\u0435\u0436\u043d\u043e\u0441\u0442\u0456', u'http://zaxid.net/rss/subcategory/129.xml'), (u'\u041f\u0440\u0430\u0432\u043e \u043d\u0430 \u0434\u0438\u0442\u0438\u043d\u0441\u0442\u0432\u043e', u'http://feeds.feedburner.com/zaxid/Childhood'), (u'\u0410\u043d\u043e\u043d\u0441\u0438', u'http://feeds.feedburner.com/zaxid/Announcements')]
|
@ -81,6 +81,7 @@ body {
|
||||
background-color: #39a9cf;
|
||||
-moz-border-radius: 5px;
|
||||
-webkit-border-radius: 5px;
|
||||
border-radius: 5px;
|
||||
text-shadow: #27211b 1px 1px 1px;
|
||||
-moz-box-shadow: 5px 5px 5px #222;
|
||||
-webkit-box-shadow: 5px 5px 5px #222;
|
||||
|
Before Width: | Height: | Size: 17 KiB After Width: | Height: | Size: 62 KiB |
@ -12,6 +12,7 @@ let g:syntastic_cpp_include_dirs = [
|
||||
\'/usr/include/fontconfig',
|
||||
\'src/qtcurve/common', 'src/qtcurve',
|
||||
\'src/unrar',
|
||||
\'src/qt-harfbuzz/src',
|
||||
\'/usr/include/ImageMagick',
|
||||
\]
|
||||
let g:syntastic_c_include_dirs = g:syntastic_cpp_include_dirs
|
||||
|
@ -6,12 +6,13 @@ __license__ = 'GPL v3'
|
||||
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
|
||||
__docformat__ = 'restructuredtext en'
|
||||
|
||||
import os, socket, struct, subprocess, sys, glob
|
||||
import os, socket, struct, subprocess, glob
|
||||
from distutils.spawn import find_executable
|
||||
|
||||
from PyQt4 import pyqtconfig
|
||||
|
||||
from setup import isosx, iswindows, islinux, is64bit
|
||||
is64bit
|
||||
|
||||
OSX_SDK = '/Developer/SDKs/MacOSX10.5.sdk'
|
||||
|
||||
@ -81,6 +82,7 @@ def consolidate(envvar, default):
|
||||
pyqt = pyqtconfig.Configuration()
|
||||
|
||||
qt_inc = pyqt.qt_inc_dir
|
||||
qt_private_inc = []
|
||||
qt_lib = pyqt.qt_lib_dir
|
||||
ft_lib_dirs = []
|
||||
ft_libs = []
|
||||
@ -140,6 +142,8 @@ elif isosx:
|
||||
png_libs = ['png12']
|
||||
ft_libs = ['freetype']
|
||||
ft_inc_dirs = ['/sw/include/freetype2']
|
||||
bq = glob.glob('/sw/build/qt-*/include')[-1]
|
||||
qt_private_inc = ['%s/%s'%(bq, m) for m in ('QtGui', 'QtCore')]
|
||||
else:
|
||||
# Include directories
|
||||
png_inc_dirs = pkgconfig_include_dirs('libpng', 'PNG_INC_DIR',
|
||||
|
@ -102,7 +102,8 @@ class Check(Command):
|
||||
errors = True
|
||||
if errors:
|
||||
cPickle.dump(cache, open(self.CACHE, 'wb'), -1)
|
||||
subprocess.call(['gvim', '-f', f])
|
||||
subprocess.call(['gvim', '-S',
|
||||
self.j(self.SRC, '../session.vim'), '-f', f])
|
||||
raise SystemExit(1)
|
||||
cache[f] = mtime
|
||||
for x in builtins:
|
||||
|
@ -18,7 +18,7 @@ from setup.build_environment import (chmlib_inc_dirs,
|
||||
msvc, MT, win_inc, win_lib, win_ddk, magick_inc_dirs, magick_lib_dirs,
|
||||
magick_libs, chmlib_lib_dirs, sqlite_inc_dirs, icu_inc_dirs,
|
||||
icu_lib_dirs, win_ddk_lib_dirs, ft_libs, ft_lib_dirs, ft_inc_dirs,
|
||||
zlib_libs, zlib_lib_dirs, zlib_inc_dirs, is64bit)
|
||||
zlib_libs, zlib_lib_dirs, zlib_inc_dirs, is64bit, qt_private_inc)
|
||||
MT
|
||||
isunix = islinux or isosx or isbsd
|
||||
|
||||
@ -183,6 +183,13 @@ extensions = [
|
||||
sip_files = ['calibre/gui2/progress_indicator/QProgressIndicator.sip']
|
||||
),
|
||||
|
||||
Extension('qt_hack',
|
||||
['calibre/ebooks/pdf/render/qt_hack.cpp'],
|
||||
inc_dirs = qt_private_inc + ['calibre/ebooks/pdf/render', 'qt-harfbuzz/src'],
|
||||
headers = ['calibre/ebooks/pdf/render/qt_hack.h'],
|
||||
sip_files = ['calibre/ebooks/pdf/render/qt_hack.sip']
|
||||
),
|
||||
|
||||
Extension('unrar',
|
||||
['unrar/%s.cpp'%(x.partition('.')[0]) for x in '''
|
||||
rar.o strlist.o strfn.o pathfn.o savepos.o smallfn.o global.o file.o
|
||||
@ -545,6 +552,9 @@ class Build(Command):
|
||||
VERSION = 1.0.0
|
||||
CONFIG += %s
|
||||
''')%(ext.name, ' '.join(ext.headers), ' '.join(ext.sources), archs)
|
||||
if ext.inc_dirs:
|
||||
idir = ' '.join(ext.inc_dirs)
|
||||
pro += 'INCLUDEPATH = %s\n'%idir
|
||||
pro = pro.replace('\\', '\\\\')
|
||||
open(ext.name+'.pro', 'wb').write(pro)
|
||||
qmc = [QMAKE, '-o', 'Makefile']
|
||||
|
@ -39,18 +39,6 @@ class Win32(WinBase):
|
||||
def msi64(self):
|
||||
return installer_name('msi', is64bit=True)
|
||||
|
||||
def sign_msi(self):
|
||||
import xattr
|
||||
print ('Signing installers ...')
|
||||
sign64 = False
|
||||
msi64 = self.msi64
|
||||
if os.path.exists(msi64) and 'user.signed' not in xattr.list(msi64):
|
||||
subprocess.check_call(['scp', msi64, self.VM_NAME +
|
||||
':build/%s/%s'%(__appname__, msi64)])
|
||||
sign64 = True
|
||||
subprocess.check_call(['ssh', self.VM_NAME, '~/sign.sh'], shell=False)
|
||||
return sign64
|
||||
|
||||
def do_dl(self, installer, errmsg):
|
||||
subprocess.check_call(('scp',
|
||||
'%s:build/%s/%s'%(self.VM_NAME, __appname__, installer), 'dist'))
|
||||
@ -62,14 +50,8 @@ class Win32(WinBase):
|
||||
installer = self.installer()
|
||||
if os.path.exists('build/winfrozen'):
|
||||
shutil.rmtree('build/winfrozen')
|
||||
sign64 = self.sign_msi()
|
||||
if sign64:
|
||||
self.do_dl(self.msi64, 'Failed to d/l signed 64 bit installer')
|
||||
import xattr
|
||||
xattr.set(self.msi64, 'user.signed', 'true')
|
||||
|
||||
self.do_dl(installer, 'Failed to freeze')
|
||||
|
||||
installer = 'dist/%s-portable-installer-%s.exe'%(__appname__, __version__)
|
||||
self.do_dl(installer, 'Failed to get portable installer')
|
||||
|
||||
|
@ -91,6 +91,7 @@ class Win32Freeze(Command, WixMixIn):
|
||||
if not is64bit:
|
||||
self.build_portable()
|
||||
self.build_portable_installer()
|
||||
self.sign_installers()
|
||||
|
||||
def remove_CRT_from_manifests(self):
|
||||
'''
|
||||
@ -101,7 +102,8 @@ class Win32Freeze(Command, WixMixIn):
|
||||
repl_pat = re.compile(
|
||||
r'(?is)<dependency>.*?Microsoft\.VC\d+\.CRT.*?</dependency>')
|
||||
|
||||
for dll in glob.glob(self.j(self.dll_dir, '*.dll')):
|
||||
for dll in (glob.glob(self.j(self.dll_dir, '*.dll')) +
|
||||
glob.glob(self.j(self.plugins_dir, '*.pyd'))):
|
||||
bn = self.b(dll)
|
||||
with open(dll, 'rb') as f:
|
||||
raw = f.read()
|
||||
@ -488,6 +490,17 @@ class Win32Freeze(Command, WixMixIn):
|
||||
|
||||
subprocess.check_call([LZMA + r'\bin\elzma.exe', '-9', '--lzip', name])
|
||||
|
||||
def sign_installers(self):
|
||||
self.info('Signing installers...')
|
||||
files = glob.glob(self.j('dist', '*.msi')) + glob.glob(self.j('dist',
|
||||
'*.exe'))
|
||||
if not files:
|
||||
raise ValueError('No installers found')
|
||||
subprocess.check_call(['signtool.exe', 'sign', '/a', '/d',
|
||||
'calibre - E-book management', '/du',
|
||||
'http://calibre-ebook.com', '/t',
|
||||
'http://timestamp.verisign.com/scripts/timstamp.dll'] + files)
|
||||
|
||||
def add_dir_to_zip(self, zf, path, prefix=''):
|
||||
'''
|
||||
Add a directory recursively to the zip file with an optional prefix.
|
||||
@ -586,6 +599,10 @@ class Win32Freeze(Command, WixMixIn):
|
||||
# from files
|
||||
'unrar.pyd', 'wpd.pyd', 'podofo.pyd',
|
||||
'progress_indicator.pyd',
|
||||
# As per this https://bugs.launchpad.net/bugs/1087816
|
||||
# on some systems magick.pyd fails to load from memory
|
||||
# on 64 bit
|
||||
'magick.pyd',
|
||||
}:
|
||||
self.add_to_zipfile(zf, pyd, x)
|
||||
os.remove(self.j(x, pyd))
|
||||
|
1162
setup/iso_639/ca.po
@ -9,14 +9,14 @@ msgstr ""
|
||||
"Project-Id-Version: calibre\n"
|
||||
"Report-Msgid-Bugs-To: FULL NAME <EMAIL@ADDRESS>\n"
|
||||
"POT-Creation-Date: 2011-11-25 14:01+0000\n"
|
||||
"PO-Revision-Date: 2012-08-15 10:30+0000\n"
|
||||
"Last-Translator: Jellby <Unknown>\n"
|
||||
"PO-Revision-Date: 2012-12-24 08:05+0000\n"
|
||||
"Last-Translator: Adolfo Jayme Barrientos <fitoschido@gmail.com>\n"
|
||||
"Language-Team: Español; Castellano <>\n"
|
||||
"MIME-Version: 1.0\n"
|
||||
"Content-Type: text/plain; charset=UTF-8\n"
|
||||
"Content-Transfer-Encoding: 8bit\n"
|
||||
"X-Launchpad-Export-Date: 2012-08-16 04:40+0000\n"
|
||||
"X-Generator: Launchpad (build 15810)\n"
|
||||
"X-Launchpad-Export-Date: 2012-12-25 04:46+0000\n"
|
||||
"X-Generator: Launchpad (build 16378)\n"
|
||||
|
||||
#. name for aaa
|
||||
msgid "Ghotuo"
|
||||
@ -9584,27 +9584,27 @@ msgstr "Holikachuk"
|
||||
|
||||
#. name for hoj
|
||||
msgid "Hadothi"
|
||||
msgstr ""
|
||||
msgstr "Hadothi"
|
||||
|
||||
#. name for hol
|
||||
msgid "Holu"
|
||||
msgstr ""
|
||||
msgstr "Holu"
|
||||
|
||||
#. name for hom
|
||||
msgid "Homa"
|
||||
msgstr ""
|
||||
msgstr "Homa"
|
||||
|
||||
#. name for hoo
|
||||
msgid "Holoholo"
|
||||
msgstr ""
|
||||
msgstr "Holoholo"
|
||||
|
||||
#. name for hop
|
||||
msgid "Hopi"
|
||||
msgstr ""
|
||||
msgstr "Hopi"
|
||||
|
||||
#. name for hor
|
||||
msgid "Horo"
|
||||
msgstr ""
|
||||
msgstr "Horo"
|
||||
|
||||
#. name for hos
|
||||
msgid "Ho Chi Minh City Sign Language"
|
||||
@ -9612,27 +9612,27 @@ msgstr "Lengua de signos de Ho Chi Minh"
|
||||
|
||||
#. name for hot
|
||||
msgid "Hote"
|
||||
msgstr ""
|
||||
msgstr "Hote"
|
||||
|
||||
#. name for hov
|
||||
msgid "Hovongan"
|
||||
msgstr ""
|
||||
msgstr "Hovongan"
|
||||
|
||||
#. name for how
|
||||
msgid "Honi"
|
||||
msgstr ""
|
||||
msgstr "Honi"
|
||||
|
||||
#. name for hoy
|
||||
msgid "Holiya"
|
||||
msgstr ""
|
||||
msgstr "Holiya"
|
||||
|
||||
#. name for hoz
|
||||
msgid "Hozo"
|
||||
msgstr ""
|
||||
msgstr "Hozo"
|
||||
|
||||
#. name for hpo
|
||||
msgid "Hpon"
|
||||
msgstr ""
|
||||
msgstr "Hpon"
|
||||
|
||||
#. name for hps
|
||||
msgid "Hawai'i Pidgin Sign Language"
|
||||
@ -9640,15 +9640,15 @@ msgstr "Lengua de signos pidyin hawaiana"
|
||||
|
||||
#. name for hra
|
||||
msgid "Hrangkhol"
|
||||
msgstr ""
|
||||
msgstr "Hrangkhol"
|
||||
|
||||
#. name for hre
|
||||
msgid "Hre"
|
||||
msgstr ""
|
||||
msgstr "Hre"
|
||||
|
||||
#. name for hrk
|
||||
msgid "Haruku"
|
||||
msgstr ""
|
||||
msgstr "Haruku"
|
||||
|
||||
#. name for hrm
|
||||
msgid "Miao; Horned"
|
||||
@ -9656,19 +9656,19 @@ msgstr ""
|
||||
|
||||
#. name for hro
|
||||
msgid "Haroi"
|
||||
msgstr ""
|
||||
msgstr "Haroi"
|
||||
|
||||
#. name for hrr
|
||||
msgid "Horuru"
|
||||
msgstr ""
|
||||
msgstr "Horuru"
|
||||
|
||||
#. name for hrt
|
||||
msgid "Hértevin"
|
||||
msgstr ""
|
||||
msgstr "Hértevin"
|
||||
|
||||
#. name for hru
|
||||
msgid "Hruso"
|
||||
msgstr ""
|
||||
msgstr "Hruso"
|
||||
|
||||
#. name for hrv
|
||||
msgid "Croatian"
|
||||
|
@ -12,14 +12,14 @@ msgstr ""
|
||||
"Report-Msgid-Bugs-To: Debian iso-codes team <pkg-isocodes-"
|
||||
"devel@lists.alioth.debian.org>\n"
|
||||
"POT-Creation-Date: 2011-11-25 14:01+0000\n"
|
||||
"PO-Revision-Date: 2011-09-27 15:44+0000\n"
|
||||
"Last-Translator: IIDA Yosiaki <iida@gnu.org>\n"
|
||||
"PO-Revision-Date: 2012-12-13 13:56+0000\n"
|
||||
"Last-Translator: Shushi Kurose <md81bird@hitaki.net>\n"
|
||||
"Language-Team: Japanese <translation-team-ja@lists.sourceforge.net>\n"
|
||||
"MIME-Version: 1.0\n"
|
||||
"Content-Type: text/plain; charset=UTF-8\n"
|
||||
"Content-Transfer-Encoding: 8bit\n"
|
||||
"X-Launchpad-Export-Date: 2011-11-26 05:21+0000\n"
|
||||
"X-Generator: Launchpad (build 14381)\n"
|
||||
"X-Launchpad-Export-Date: 2012-12-14 05:34+0000\n"
|
||||
"X-Generator: Launchpad (build 16369)\n"
|
||||
"Language: ja\n"
|
||||
|
||||
#. name for aaa
|
||||
@ -86,12 +86,9 @@ msgstr ""
|
||||
msgid "Abnaki; Eastern"
|
||||
msgstr ""
|
||||
|
||||
# 以下「国国」は、国立国会図書館のサイト。
|
||||
# ジブチ
|
||||
# マイペディア「ジブチ」の項に「アファル語」
|
||||
#. name for aar
|
||||
msgid "Afar"
|
||||
msgstr "アファール語"
|
||||
msgstr "アファル語"
|
||||
|
||||
#. name for aas
|
||||
msgid "Aasáx"
|
||||
|
14524
setup/iso_639/ms.po
@ -4,7 +4,7 @@ __license__ = 'GPL v3'
|
||||
__copyright__ = '2008, Kovid Goyal kovid@kovidgoyal.net'
|
||||
__docformat__ = 'restructuredtext en'
|
||||
__appname__ = u'calibre'
|
||||
numeric_version = (0, 9, 8)
|
||||
numeric_version = (0, 9, 12)
|
||||
__version__ = u'.'.join(map(unicode, numeric_version))
|
||||
__author__ = u"Kovid Goyal <kovid@kovidgoyal.net>"
|
||||
|
||||
@ -100,6 +100,7 @@ class Plugins(collections.Mapping):
|
||||
'freetype',
|
||||
'woff',
|
||||
'unrar',
|
||||
'qt_hack',
|
||||
]
|
||||
if iswindows:
|
||||
plugins.extend(['winutil', 'wpd', 'winfonts'])
|
||||
|
@ -661,7 +661,7 @@ from calibre.devices.nuut2.driver import NUUT2
|
||||
from calibre.devices.iriver.driver import IRIVER_STORY
|
||||
from calibre.devices.binatone.driver import README
|
||||
from calibre.devices.hanvon.driver import (N516, EB511, ALEX, AZBOOKA, THEBOOK,
|
||||
LIBREAIR, ODYSSEY)
|
||||
LIBREAIR, ODYSSEY, KIBANO)
|
||||
from calibre.devices.edge.driver import EDGE
|
||||
from calibre.devices.teclast.driver import (TECLAST_K3, NEWSMY, IPAPYRUS,
|
||||
SOVOS, PICO, SUNSTECH_EB700, ARCHOS7O, STASH, WEXLER)
|
||||
@ -712,7 +712,7 @@ plugins += [
|
||||
BOOQ,
|
||||
EB600,
|
||||
README,
|
||||
N516,
|
||||
N516, KIBANO,
|
||||
THEBOOK, LIBREAIR,
|
||||
EB511,
|
||||
ELONEX,
|
||||
|
@ -121,6 +121,8 @@ def debug(ioreg_to_tmp=False, buf=None, plugins=None,
|
||||
out('\nDisabled plugins:', textwrap.fill(' '.join([x.__class__.__name__ for x in
|
||||
disabled_plugins])))
|
||||
out(' ')
|
||||
else:
|
||||
out('\nNo disabled plugins')
|
||||
found_dev = False
|
||||
for dev in devplugins:
|
||||
if not dev.MANAGES_DEVICE_PRESENCE: continue
|
||||
|
@ -10,7 +10,7 @@ import cStringIO
|
||||
|
||||
from calibre.devices.usbms.driver import USBMS
|
||||
|
||||
HTC_BCDS = [0x100, 0x0222, 0x0226, 0x227, 0x228, 0x229, 0x9999]
|
||||
HTC_BCDS = [0x100, 0x0222, 0x0226, 0x227, 0x228, 0x229, 0x0231, 0x9999]
|
||||
|
||||
class ANDROID(USBMS):
|
||||
|
||||
@ -48,6 +48,7 @@ class ANDROID(USBMS):
|
||||
0x2910 : HTC_BCDS,
|
||||
0xe77 : HTC_BCDS,
|
||||
0xff9 : HTC_BCDS,
|
||||
0x0001 : [0x255],
|
||||
},
|
||||
|
||||
# Eken
|
||||
@ -92,7 +93,7 @@ class ANDROID(USBMS):
|
||||
# Google
|
||||
0x18d1 : {
|
||||
0x0001 : [0x0223, 0x230, 0x9999],
|
||||
0x0003 : [0x0230],
|
||||
0x0003 : [0x0230, 0x9999],
|
||||
0x4e11 : [0x0100, 0x226, 0x227],
|
||||
0x4e12 : [0x0100, 0x226, 0x227],
|
||||
0x4e21 : [0x0100, 0x226, 0x227, 0x231],
|
||||
@ -212,7 +213,8 @@ class ANDROID(USBMS):
|
||||
'VIZIO', 'GOOGLE', 'FREESCAL', 'KOBO_INC', 'LENOVO', 'ROCKCHIP',
|
||||
'POCKET', 'ONDA_MID', 'ZENITHIN', 'INGENIC', 'PMID701C', 'PD',
|
||||
'PMP5097C', 'MASS', 'NOVO7', 'ZEKI', 'COBY', 'SXZ', 'USB_2.0',
|
||||
'COBY_MID', 'VS', 'AINOL', 'TOPWISE', 'PAD703']
|
||||
'COBY_MID', 'VS', 'AINOL', 'TOPWISE', 'PAD703', 'NEXT8D12',
|
||||
'MEDIATEK']
|
||||
WINDOWS_MAIN_MEM = ['ANDROID_PHONE', 'A855', 'A853', 'INC.NEXUS_ONE',
|
||||
'__UMS_COMPOSITE', '_MB200', 'MASS_STORAGE', '_-_CARD', 'SGH-I897',
|
||||
'GT-I9000', 'FILE-STOR_GADGET', 'SGH-T959_CARD', 'SGH-T959', 'SAMSUNG_ANDROID',
|
||||
@ -232,7 +234,7 @@ class ANDROID(USBMS):
|
||||
'THINKPAD_TABLET', 'SGH-T989', 'YP-G70', 'STORAGE_DEVICE',
|
||||
'ADVANCED', 'SGH-I727', 'USB_FLASH_DRIVER', 'ANDROID',
|
||||
'S5830I_CARD', 'MID7042', 'LINK-CREATE', '7035', 'VIEWPAD_7E',
|
||||
'NOVO7', 'MB526', '_USB#WYK7MSF8KE', 'TABLET_PC']
|
||||
'NOVO7', 'MB526', '_USB#WYK7MSF8KE', 'TABLET_PC', 'F', 'MT65XX_MS']
|
||||
WINDOWS_CARD_A_MEM = ['ANDROID_PHONE', 'GT-I9000_CARD', 'SGH-I897',
|
||||
'FILE-STOR_GADGET', 'SGH-T959_CARD', 'SGH-T959', 'SAMSUNG_ANDROID', 'GT-P1000_CARD',
|
||||
'A70S', 'A101IT', '7', 'INCREDIBLE', 'A7EB', 'SGH-T849_CARD',
|
||||
@ -243,7 +245,7 @@ class ANDROID(USBMS):
|
||||
'FILE-CD_GADGET', 'GT-I9001_CARD', 'USB_2.0', 'XT875',
|
||||
'UMS_COMPOSITE', 'PRO', '.KOBO_VOX', 'SGH-T989_CARD', 'SGH-I727',
|
||||
'USB_FLASH_DRIVER', 'ANDROID', 'MID7042', '7035', 'VIEWPAD_7E',
|
||||
'NOVO7', 'ADVANCED', 'TABLET_PC']
|
||||
'NOVO7', 'ADVANCED', 'TABLET_PC', 'F']
|
||||
|
||||
OSX_MAIN_MEM = 'Android Device Main Memory'
|
||||
|
||||
|
@ -41,6 +41,20 @@ class N516(USBMS):
|
||||
def can_handle(self, device_info, debug=False):
|
||||
return not is_alex(device_info)
|
||||
|
||||
class KIBANO(N516):
|
||||
|
||||
name = 'Kibano driver'
|
||||
gui_name = 'Kibano'
|
||||
description = _('Communicate with the Kibano eBook reader.')
|
||||
FORMATS = ['epub', 'pdf', 'txt']
|
||||
BCD = [0x323]
|
||||
|
||||
VENDOR_NAME = 'EBOOK'
|
||||
# We use EXTERNAL_SD_CARD for main mem as some devices have not working
|
||||
# main memories
|
||||
WINDOWS_MAIN_MEM = WINDOWS_CARD_A_MEM = ['INTERNAL_SD_CARD',
|
||||
'EXTERNAL_SD_CARD']
|
||||
|
||||
class THEBOOK(N516):
|
||||
name = 'The Book driver'
|
||||
gui_name = 'The Book'
|
||||
|
@ -199,6 +199,11 @@ class KTCollectionsBookList(CollectionsBookList):
|
||||
('series' in collection_attributes and
|
||||
book.get('series', None) == category):
|
||||
is_series = True
|
||||
|
||||
# The category should not be None, but, it has happened.
|
||||
if not category:
|
||||
continue
|
||||
|
||||
cat_name = category.strip(' ,')
|
||||
|
||||
if cat_name not in collections:
|
||||
|
@ -1537,7 +1537,11 @@ class KOBOTOUCH(KOBO):
|
||||
return bookshelves
|
||||
|
||||
cursor = connection.cursor()
|
||||
query = "select ShelfName from ShelfContent where ContentId = ? and _IsDeleted = 'false'"
|
||||
query = "select ShelfName " \
|
||||
"from ShelfContent " \
|
||||
"where ContentId = ? " \
|
||||
"and _IsDeleted = 'false' " \
|
||||
"and ShelfName is not null" # This should never be nulll, but it is protection against an error cause by a sync to the Kobo server
|
||||
values = (ContentID, )
|
||||
cursor.execute(query, values)
|
||||
for i, row in enumerate(cursor):
|
||||
@ -2357,6 +2361,8 @@ class KOBOTOUCH(KOBO):
|
||||
update_query = 'UPDATE content SET Series=?, SeriesNumber==? where BookID is Null and ContentID = ?'
|
||||
if book.series is None:
|
||||
update_values = (None, None, book.contentID, )
|
||||
elif book.series_index is None: # This should never happen, but...
|
||||
update_values = (book.series, None, book.contentID, )
|
||||
else:
|
||||
update_values = (book.series, "%g"%book.series_index, book.contentID, )
|
||||
|
||||
|
@ -13,6 +13,7 @@ from itertools import izip
|
||||
|
||||
from calibre import prints
|
||||
from calibre.constants import iswindows, numeric_version
|
||||
from calibre.devices.errors import PathError
|
||||
from calibre.devices.mtp.base import debug
|
||||
from calibre.devices.mtp.defaults import DeviceDefaults
|
||||
from calibre.ptempfile import SpooledTemporaryFile, PersistentTemporaryDirectory
|
||||
@ -23,6 +24,12 @@ from calibre.utils.filenames import shorten_components_to
|
||||
BASE = importlib.import_module('calibre.devices.mtp.%s.driver'%(
|
||||
'windows' if iswindows else 'unix')).MTP_DEVICE
|
||||
|
||||
class MTPInvalidSendPathError(PathError):
|
||||
|
||||
def __init__(self, folder):
|
||||
PathError.__init__(self, 'Trying to send to ignored folder: %s'%folder)
|
||||
self.folder = folder
|
||||
|
||||
class MTP_DEVICE(BASE):
|
||||
|
||||
METADATA_CACHE = 'metadata.calibre'
|
||||
@ -46,6 +53,7 @@ class MTP_DEVICE(BASE):
|
||||
self._prefs = None
|
||||
self.device_defaults = DeviceDefaults()
|
||||
self.current_device_defaults = {}
|
||||
self.highlight_ignored_folders = False
|
||||
|
||||
@property
|
||||
def prefs(self):
|
||||
@ -59,9 +67,25 @@ class MTP_DEVICE(BASE):
|
||||
p.defaults['blacklist'] = []
|
||||
p.defaults['history'] = {}
|
||||
p.defaults['rules'] = []
|
||||
p.defaults['ignored_folders'] = {}
|
||||
|
||||
return self._prefs
|
||||
|
||||
def is_folder_ignored(self, storage_or_storage_id, name,
|
||||
ignored_folders=None):
|
||||
storage_id = unicode(getattr(storage_or_storage_id, 'object_id',
|
||||
storage_or_storage_id))
|
||||
name = icu_lower(name)
|
||||
if ignored_folders is None:
|
||||
ignored_folders = self.get_pref('ignored_folders')
|
||||
if storage_id in ignored_folders:
|
||||
return name in {icu_lower(x) for x in ignored_folders[storage_id]}
|
||||
|
||||
return name in {
|
||||
'alarms', 'android', 'dcim', 'movies', 'music', 'notifications',
|
||||
'pictures', 'ringtones', 'samsung', 'sony', 'htc', 'bluetooth',
|
||||
'games', 'lost.dir', 'video', 'whatsapp', 'image'}
|
||||
|
||||
def configure_for_kindle_app(self):
|
||||
proxy = self.prefs
|
||||
with proxy:
|
||||
@ -371,6 +395,8 @@ class MTP_DEVICE(BASE):
|
||||
|
||||
for infile, fname, mi in izip(files, names, metadata):
|
||||
path = self.create_upload_path(prefix, mi, fname, routing)
|
||||
if path and self.is_folder_ignored(storage, path[0]):
|
||||
raise MTPInvalidSendPathError(path[0])
|
||||
parent = self.ensure_parent(storage, path)
|
||||
if hasattr(infile, 'read'):
|
||||
pos = infile.tell()
|
||||
@ -472,7 +498,7 @@ class MTP_DEVICE(BASE):
|
||||
|
||||
def config_widget(self):
|
||||
from calibre.gui2.device_drivers.mtp_config import MTPConfig
|
||||
return MTPConfig(self)
|
||||
return MTPConfig(self, highlight_ignored_folders=self.highlight_ignored_folders)
|
||||
|
||||
def save_settings(self, cw):
|
||||
cw.commit()
|
||||
|
@ -239,12 +239,12 @@ class TestDeviceInteraction(unittest.TestCase):
|
||||
|
||||
# Test get_filesystem
|
||||
used_by_one = self.measure_memory_usage(1,
|
||||
self.dev.dev.get_filesystem, self.storage.object_id, lambda x:
|
||||
x)
|
||||
self.dev.dev.get_filesystem, self.storage.object_id,
|
||||
lambda x, l:True)
|
||||
|
||||
used_by_many = self.measure_memory_usage(5,
|
||||
self.dev.dev.get_filesystem, self.storage.object_id, lambda x:
|
||||
x)
|
||||
self.dev.dev.get_filesystem, self.storage.object_id,
|
||||
lambda x, l: True)
|
||||
|
||||
self.check_memory(used_by_one, used_by_many,
|
||||
'Memory consumption during get_filesystem')
|
||||
|
@ -13,6 +13,8 @@ const calibre_device_entry_t calibre_mtp_device_table[] = {
|
||||
|
||||
// Amazon Kindle Fire HD
|
||||
, { "Amazon", 0x1949, "Fire HD", 0x0007, DEVICE_FLAGS_ANDROID_BUGS}
|
||||
, { "Amazon", 0x1949, "Fire HD", 0x0008, DEVICE_FLAGS_ANDROID_BUGS}
|
||||
, { "Amazon", 0x1949, "Fire HD", 0x000a, DEVICE_FLAGS_ANDROID_BUGS}
|
||||
|
||||
// Nexus 10
|
||||
, { "Google", 0x18d1, "Nexus 10", 0x4ee2, DEVICE_FLAGS_ANDROID_BUGS}
|
||||
|
@ -212,8 +212,13 @@ class MTP_DEVICE(MTPDeviceBase):
|
||||
ans += pprint.pformat(storage)
|
||||
return ans
|
||||
|
||||
def _filesystem_callback(self, entry):
|
||||
self.filesystem_callback(_('Found object: %s')%entry.get('name', ''))
|
||||
def _filesystem_callback(self, entry, level):
|
||||
name = entry.get('name', '')
|
||||
self.filesystem_callback(_('Found object: %s')%name)
|
||||
if (level == 0 and
|
||||
self.is_folder_ignored(self._currently_getting_sid, name)):
|
||||
return False
|
||||
return True
|
||||
|
||||
@property
|
||||
def filesystem_cache(self):
|
||||
@ -234,6 +239,7 @@ class MTP_DEVICE(MTPDeviceBase):
|
||||
storage.append({'id':sid, 'size':capacity,
|
||||
'is_folder':True, 'name':name, 'can_delete':False,
|
||||
'is_system':True})
|
||||
self._currently_getting_sid = unicode(sid)
|
||||
items, errs = self.dev.get_filesystem(sid,
|
||||
self._filesystem_callback)
|
||||
all_items.extend(items), all_errs.extend(errs)
|
||||
|
@ -8,7 +8,9 @@
|
||||
|
||||
#define UNICODE
|
||||
#include <Python.h>
|
||||
|
||||
#include <sys/types.h>
|
||||
#include <sys/stat.h>
|
||||
#include <fcntl.h>
|
||||
#include <stdlib.h>
|
||||
#include <libmtp.h>
|
||||
|
||||
@ -122,7 +124,7 @@ static PyObject* build_file_metadata(LIBMTP_file_t *nf, uint32_t storage_id) {
|
||||
PyObject *ans = NULL;
|
||||
|
||||
ans = Py_BuildValue("{s:s, s:k, s:k, s:k, s:K, s:L, s:O}",
|
||||
"name", (unsigned long)nf->filename,
|
||||
"name", nf->filename,
|
||||
"id", (unsigned long)nf->item_id,
|
||||
"parent_id", (unsigned long)nf->parent_id,
|
||||
"storage_id", (unsigned long)storage_id,
|
||||
@ -357,10 +359,10 @@ Device_storage_info(Device *self, void *closure) {
|
||||
|
||||
// Device.get_filesystem {{{
|
||||
|
||||
static int recursive_get_files(LIBMTP_mtpdevice_t *dev, uint32_t storage_id, uint32_t parent_id, PyObject *ans, PyObject *errs, PyObject *callback) {
|
||||
static int recursive_get_files(LIBMTP_mtpdevice_t *dev, uint32_t storage_id, uint32_t parent_id, PyObject *ans, PyObject *errs, PyObject *callback, unsigned int level) {
|
||||
LIBMTP_file_t *f, *files;
|
||||
PyObject *entry;
|
||||
int ok = 1;
|
||||
PyObject *entry, *r;
|
||||
int ok = 1, recurse;
|
||||
|
||||
Py_BEGIN_ALLOW_THREADS;
|
||||
files = LIBMTP_Get_Files_And_Folders(dev, storage_id, parent_id);
|
||||
@ -372,13 +374,15 @@ static int recursive_get_files(LIBMTP_mtpdevice_t *dev, uint32_t storage_id, uin
|
||||
entry = build_file_metadata(f, storage_id);
|
||||
if (entry == NULL) { ok = 0; }
|
||||
else {
|
||||
Py_XDECREF(PyObject_CallFunctionObjArgs(callback, entry, NULL));
|
||||
r = PyObject_CallFunction(callback, "OI", entry, level);
|
||||
recurse = (r != NULL && PyObject_IsTrue(r)) ? 1 : 0;
|
||||
Py_XDECREF(r);
|
||||
if (PyList_Append(ans, entry) != 0) { ok = 0; }
|
||||
Py_DECREF(entry);
|
||||
}
|
||||
|
||||
if (ok && f->filetype == LIBMTP_FILETYPE_FOLDER) {
|
||||
if (!recursive_get_files(dev, storage_id, f->item_id, ans, errs, callback)) {
|
||||
if (ok && recurse && f->filetype == LIBMTP_FILETYPE_FOLDER) {
|
||||
if (!recursive_get_files(dev, storage_id, f->item_id, ans, errs, callback, level+1)) {
|
||||
ok = 0;
|
||||
}
|
||||
}
|
||||
@ -408,7 +412,7 @@ Device_get_filesystem(Device *self, PyObject *args) {
|
||||
if (errs == NULL || ans == NULL) { PyErr_NoMemory(); return NULL; }
|
||||
|
||||
LIBMTP_Clear_Errorstack(self->device);
|
||||
ok = recursive_get_files(self->device, (uint32_t)storage_id, 0, ans, errs, callback);
|
||||
ok = recursive_get_files(self->device, (uint32_t)storage_id, 0xFFFFFFFF, ans, errs, callback, 0);
|
||||
dump_errorstack(self->device, errs);
|
||||
if (!ok) {
|
||||
Py_DECREF(ans);
|
||||
@ -537,7 +541,7 @@ static PyMethodDef Device_methods[] = {
|
||||
},
|
||||
|
||||
{"get_filesystem", (PyCFunction)Device_get_filesystem, METH_VARARGS,
|
||||
"get_filesystem(storage_id, callback) -> Get the list of files and folders on the device in storage_id. Returns files, errors. callback must be a callable that accepts a single argument. It is called with every found object."
|
||||
"get_filesystem(storage_id, callback) -> Get the list of files and folders on the device in storage_id. Returns files, errors. callback must be a callable that is called as with (entry, level). It is called with every found object. If callback returns False and the object is a folder, it is not recursed into."
|
||||
},
|
||||
|
||||
{"get_file", (PyCFunction)Device_get_file, METH_VARARGS,
|
||||
@ -726,7 +730,20 @@ initlibmtp(void) {
|
||||
if (MTPError == NULL) return;
|
||||
PyModule_AddObject(m, "MTPError", MTPError);
|
||||
|
||||
// Redirect stdout to get rid of the annoying message about mtpz. Really,
|
||||
// who designs a library without anyway to control/redirect the debugging
|
||||
// output, and hardcoded paths that cannot be changed?
|
||||
int bak, new;
|
||||
fflush(stdout);
|
||||
bak = dup(STDOUT_FILENO);
|
||||
new = open("/dev/null", O_WRONLY);
|
||||
dup2(new, STDOUT_FILENO);
|
||||
close(new);
|
||||
LIBMTP_Init();
|
||||
fflush(stdout);
|
||||
dup2(bak, STDOUT_FILENO);
|
||||
close(bak);
|
||||
|
||||
LIBMTP_Set_Debug(LIBMTP_DEBUG_NONE);
|
||||
|
||||
Py_INCREF(&DeviceType);
|
||||
|
@ -133,12 +133,14 @@ class GetBulkCallback : public IPortableDevicePropertiesBulkCallback {
|
||||
|
||||
public:
|
||||
PyObject *items;
|
||||
PyObject *subfolders;
|
||||
unsigned int level;
|
||||
HANDLE complete;
|
||||
ULONG self_ref;
|
||||
PyThreadState *thread_state;
|
||||
PyObject *callback;
|
||||
|
||||
GetBulkCallback(PyObject *items_dict, HANDLE ev, PyObject* pycallback) : items(items_dict), complete(ev), self_ref(1), thread_state(NULL), callback(pycallback) {}
|
||||
GetBulkCallback(PyObject *items_dict, PyObject *subfolders, unsigned int level, HANDLE ev, PyObject* pycallback) : items(items_dict), subfolders(subfolders), level(level), complete(ev), self_ref(1), thread_state(NULL), callback(pycallback) {}
|
||||
~GetBulkCallback() {}
|
||||
|
||||
HRESULT __stdcall OnStart(REFGUID Context) { return S_OK; }
|
||||
@ -172,7 +174,7 @@ public:
|
||||
DWORD num = 0, i;
|
||||
wchar_t *property = NULL;
|
||||
IPortableDeviceValues *properties = NULL;
|
||||
PyObject *temp, *obj;
|
||||
PyObject *temp, *obj, *r;
|
||||
HRESULT hr;
|
||||
|
||||
if (SUCCEEDED(values->GetCount(&num))) {
|
||||
@ -196,7 +198,11 @@ public:
|
||||
Py_DECREF(temp);
|
||||
|
||||
set_properties(obj, properties);
|
||||
Py_XDECREF(PyObject_CallFunctionObjArgs(callback, obj, NULL));
|
||||
r = PyObject_CallFunction(callback, "OI", obj, this->level);
|
||||
if (r != NULL && PyObject_IsTrue(r)) {
|
||||
PyList_Append(this->subfolders, PyDict_GetItemString(obj, "id"));
|
||||
}
|
||||
Py_XDECREF(r);
|
||||
|
||||
properties->Release(); properties = NULL;
|
||||
}
|
||||
@ -209,8 +215,7 @@ public:
|
||||
|
||||
};
|
||||
|
||||
static PyObject* bulk_get_filesystem(IPortableDevice *device, IPortableDevicePropertiesBulk *bulk_properties, const wchar_t *storage_id, IPortableDevicePropVariantCollection *object_ids, PyObject *pycallback) {
|
||||
PyObject *folders = NULL;
|
||||
static bool bulk_get_filesystem(unsigned int level, IPortableDevice *device, IPortableDevicePropertiesBulk *bulk_properties, IPortableDevicePropVariantCollection *object_ids, PyObject *pycallback, PyObject *ans, PyObject *subfolders) {
|
||||
GUID guid_context = GUID_NULL;
|
||||
HANDLE ev = NULL;
|
||||
IPortableDeviceKeyCollection *properties;
|
||||
@ -218,18 +223,15 @@ static PyObject* bulk_get_filesystem(IPortableDevice *device, IPortableDevicePro
|
||||
HRESULT hr;
|
||||
DWORD wait_result;
|
||||
int pump_result;
|
||||
BOOL ok = TRUE;
|
||||
bool ok = true;
|
||||
|
||||
ev = CreateEvent(NULL, FALSE, FALSE, NULL);
|
||||
if (ev == NULL) return PyErr_NoMemory();
|
||||
|
||||
folders = PyDict_New();
|
||||
if (folders == NULL) {PyErr_NoMemory(); goto end;}
|
||||
if (ev == NULL) {PyErr_NoMemory(); return false; }
|
||||
|
||||
properties = create_filesystem_properties_collection();
|
||||
if (properties == NULL) goto end;
|
||||
|
||||
callback = new (std::nothrow) GetBulkCallback(folders, ev, pycallback);
|
||||
callback = new (std::nothrow) GetBulkCallback(ans, subfolders, level, ev, pycallback);
|
||||
if (callback == NULL) { PyErr_NoMemory(); goto end; }
|
||||
|
||||
hr = bulk_properties->QueueGetValuesByObjectList(object_ids, properties, callback, &guid_context);
|
||||
@ -245,13 +247,13 @@ static PyObject* bulk_get_filesystem(IPortableDevice *device, IPortableDevicePro
|
||||
break; // Event was signalled, bulk operation complete
|
||||
} else if (wait_result == WAIT_OBJECT_0 + 1) { // Messages need to be dispatched
|
||||
pump_result = pump_waiting_messages();
|
||||
if (pump_result == 1) { PyErr_SetString(PyExc_RuntimeError, "Application has been asked to quit."); ok = FALSE; break;}
|
||||
if (pump_result == 1) { PyErr_SetString(PyExc_RuntimeError, "Application has been asked to quit."); ok = false; break;}
|
||||
} else if (wait_result == WAIT_TIMEOUT) {
|
||||
// 60 seconds with no updates, looks bad
|
||||
PyErr_SetString(WPDError, "The device seems to have hung."); ok = FALSE; break;
|
||||
PyErr_SetString(WPDError, "The device seems to have hung."); ok = false; break;
|
||||
} else if (wait_result == WAIT_ABANDONED_0) {
|
||||
// This should never happen
|
||||
PyErr_SetString(WPDError, "An unknown error occurred (mutex abandoned)"); ok = FALSE; break;
|
||||
PyErr_SetString(WPDError, "An unknown error occurred (mutex abandoned)"); ok = false; break;
|
||||
} else {
|
||||
// The wait failed for some reason
|
||||
PyErr_SetFromWindowsErr(0); ok = FALSE; break;
|
||||
@ -261,22 +263,21 @@ static PyObject* bulk_get_filesystem(IPortableDevice *device, IPortableDevicePro
|
||||
if (!ok) {
|
||||
bulk_properties->Cancel(guid_context);
|
||||
pump_waiting_messages();
|
||||
Py_DECREF(folders); folders = NULL;
|
||||
}
|
||||
end:
|
||||
if (ev != NULL) CloseHandle(ev);
|
||||
if (properties != NULL) properties->Release();
|
||||
if (callback != NULL) callback->Release();
|
||||
|
||||
return folders;
|
||||
return ok;
|
||||
}
|
||||
|
||||
// }}}
|
||||
|
||||
// find_all_objects_in() {{{
|
||||
static BOOL find_all_objects_in(IPortableDeviceContent *content, IPortableDevicePropVariantCollection *object_ids, const wchar_t *parent_id, PyObject *callback) {
|
||||
// find_objects_in() {{{
|
||||
static bool find_objects_in(IPortableDeviceContent *content, IPortableDevicePropVariantCollection *object_ids, const wchar_t *parent_id) {
|
||||
/*
|
||||
* Find all children of the object identified by parent_id, recursively.
|
||||
* Find all children of the object identified by parent_id.
|
||||
* The child ids are put into object_ids. Returns False if any errors
|
||||
* occurred (also sets the python exception).
|
||||
*/
|
||||
@ -285,8 +286,7 @@ static BOOL find_all_objects_in(IPortableDeviceContent *content, IPortableDevice
|
||||
PWSTR child_ids[10];
|
||||
DWORD fetched, i;
|
||||
PROPVARIANT pv;
|
||||
BOOL ok = 1;
|
||||
PyObject *id;
|
||||
bool ok = true;
|
||||
|
||||
PropVariantInit(&pv);
|
||||
pv.vt = VT_LPWSTR;
|
||||
@ -295,7 +295,7 @@ static BOOL find_all_objects_in(IPortableDeviceContent *content, IPortableDevice
|
||||
hr = content->EnumObjects(0, parent_id, NULL, &children);
|
||||
Py_END_ALLOW_THREADS;
|
||||
|
||||
if (FAILED(hr)) {hresult_set_exc("Failed to get children from device", hr); ok = 0; goto end;}
|
||||
if (FAILED(hr)) {hresult_set_exc("Failed to get children from device", hr); ok = false; goto end;}
|
||||
|
||||
hr = S_OK;
|
||||
|
||||
@ -306,19 +306,12 @@ static BOOL find_all_objects_in(IPortableDeviceContent *content, IPortableDevice
|
||||
if (SUCCEEDED(hr)) {
|
||||
for(i = 0; i < fetched; i++) {
|
||||
pv.pwszVal = child_ids[i];
|
||||
id = wchar_to_unicode(pv.pwszVal);
|
||||
if (id != NULL) {
|
||||
Py_XDECREF(PyObject_CallFunctionObjArgs(callback, id, NULL));
|
||||
Py_DECREF(id);
|
||||
}
|
||||
hr2 = object_ids->Add(&pv);
|
||||
pv.pwszVal = NULL;
|
||||
if (FAILED(hr2)) { hresult_set_exc("Failed to add child ids to propvariantcollection", hr2); break; }
|
||||
ok = find_all_objects_in(content, object_ids, child_ids[i], callback);
|
||||
if (!ok) break;
|
||||
}
|
||||
for (i = 0; i < fetched; i++) { CoTaskMemFree(child_ids[i]); child_ids[i] = NULL; }
|
||||
if (FAILED(hr2) || !ok) { ok = 0; goto end; }
|
||||
if (FAILED(hr2) || !ok) { ok = false; goto end; }
|
||||
}
|
||||
}
|
||||
|
||||
@ -340,13 +333,8 @@ static PyObject* get_object_properties(IPortableDeviceProperties *devprops, IPor
|
||||
Py_END_ALLOW_THREADS;
|
||||
if (FAILED(hr)) { hresult_set_exc("Failed to get properties for object", hr); goto end; }
|
||||
|
||||
temp = wchar_to_unicode(object_id);
|
||||
if (temp == NULL) goto end;
|
||||
|
||||
ans = PyDict_New();
|
||||
if (ans == NULL) { PyErr_NoMemory(); goto end; }
|
||||
if (PyDict_SetItemString(ans, "id", temp) != 0) { Py_DECREF(ans); ans = NULL; PyErr_NoMemory(); goto end; }
|
||||
|
||||
ans = Py_BuildValue("{s:N}", "id", wchar_to_unicode(object_id));
|
||||
if (ans == NULL) goto end;
|
||||
set_properties(ans, values);
|
||||
|
||||
end:
|
||||
@ -355,12 +343,12 @@ end:
|
||||
return ans;
|
||||
}
|
||||
|
||||
static PyObject* single_get_filesystem(IPortableDeviceContent *content, const wchar_t *storage_id, IPortableDevicePropVariantCollection *object_ids, PyObject *callback) {
|
||||
static bool single_get_filesystem(unsigned int level, IPortableDeviceContent *content, IPortableDevicePropVariantCollection *object_ids, PyObject *callback, PyObject *ans, PyObject *subfolders) {
|
||||
DWORD num, i;
|
||||
PROPVARIANT pv;
|
||||
HRESULT hr;
|
||||
BOOL ok = 1;
|
||||
PyObject *ans = NULL, *item = NULL;
|
||||
bool ok = true;
|
||||
PyObject *item = NULL, *r = NULL, *recurse = NULL;
|
||||
IPortableDeviceProperties *devprops = NULL;
|
||||
IPortableDeviceKeyCollection *properties = NULL;
|
||||
|
||||
@ -373,32 +361,36 @@ static PyObject* single_get_filesystem(IPortableDeviceContent *content, const wc
|
||||
hr = object_ids->GetCount(&num);
|
||||
if (FAILED(hr)) { hresult_set_exc("Failed to get object id count", hr); goto end; }
|
||||
|
||||
ans = PyDict_New();
|
||||
if (ans == NULL) goto end;
|
||||
|
||||
for (i = 0; i < num; i++) {
|
||||
ok = 0;
|
||||
ok = false;
|
||||
recurse = NULL;
|
||||
PropVariantInit(&pv);
|
||||
hr = object_ids->GetAt(i, &pv);
|
||||
if (SUCCEEDED(hr) && pv.pwszVal != NULL) {
|
||||
item = get_object_properties(devprops, properties, pv.pwszVal);
|
||||
if (item != NULL) {
|
||||
Py_XDECREF(PyObject_CallFunctionObjArgs(callback, item, NULL));
|
||||
r = PyObject_CallFunction(callback, "OI", item, level);
|
||||
if (r != NULL && PyObject_IsTrue(r)) recurse = item;
|
||||
Py_XDECREF(r);
|
||||
PyDict_SetItem(ans, PyDict_GetItemString(item, "id"), item);
|
||||
Py_DECREF(item); item = NULL;
|
||||
ok = 1;
|
||||
ok = true;
|
||||
}
|
||||
} else hresult_set_exc("Failed to get item from IPortableDevicePropVariantCollection", hr);
|
||||
|
||||
PropVariantClear(&pv);
|
||||
if (!ok) { Py_DECREF(ans); ans = NULL; break; }
|
||||
if (!ok) break;
|
||||
if (recurse != NULL) {
|
||||
if (PyList_Append(subfolders, PyDict_GetItemString(recurse, "id")) == -1) ok = false;
|
||||
}
|
||||
if (!ok) break;
|
||||
}
|
||||
|
||||
end:
|
||||
if (devprops != NULL) devprops->Release();
|
||||
if (properties != NULL) properties->Release();
|
||||
|
||||
return ans;
|
||||
return ok;
|
||||
}
|
||||
// }}}
|
||||
|
||||
@ -438,35 +430,60 @@ end:
|
||||
return values;
|
||||
} // }}}
|
||||
|
||||
PyObject* wpd::get_filesystem(IPortableDevice *device, const wchar_t *storage_id, IPortableDevicePropertiesBulk *bulk_properties, PyObject *callback) { // {{{
|
||||
PyObject *folders = NULL;
|
||||
static bool get_files_and_folders(unsigned int level, IPortableDevice *device, IPortableDeviceContent *content, IPortableDevicePropertiesBulk *bulk_properties, const wchar_t *parent_id, PyObject *callback, PyObject *ans) { // {{{
|
||||
bool ok = true;
|
||||
IPortableDevicePropVariantCollection *object_ids = NULL;
|
||||
PyObject *subfolders = NULL;
|
||||
HRESULT hr;
|
||||
|
||||
subfolders = PyList_New(0);
|
||||
if (subfolders == NULL) { ok = false; goto end; }
|
||||
|
||||
Py_BEGIN_ALLOW_THREADS;
|
||||
hr = CoCreateInstance(CLSID_PortableDevicePropVariantCollection, NULL,
|
||||
CLSCTX_INPROC_SERVER, IID_PPV_ARGS(&object_ids));
|
||||
Py_END_ALLOW_THREADS;
|
||||
if (FAILED(hr)) { hresult_set_exc("Failed to create propvariantcollection", hr); ok = false; goto end; }
|
||||
|
||||
ok = find_objects_in(content, object_ids, parent_id);
|
||||
if (!ok) goto end;
|
||||
|
||||
if (bulk_properties != NULL) ok = bulk_get_filesystem(level, device, bulk_properties, object_ids, callback, ans, subfolders);
|
||||
else ok = single_get_filesystem(level, content, object_ids, callback, ans, subfolders);
|
||||
if (!ok) goto end;
|
||||
|
||||
for (Py_ssize_t i = 0; i < PyList_GET_SIZE(subfolders); i++) {
|
||||
const wchar_t *child_id = unicode_to_wchar(PyList_GET_ITEM(subfolders, i));
|
||||
if (child_id == NULL) { ok = false; break; }
|
||||
ok = get_files_and_folders(level+1, device, content, bulk_properties, child_id, callback, ans);
|
||||
if (!ok) break;
|
||||
}
|
||||
end:
|
||||
if (object_ids != NULL) object_ids->Release();
|
||||
Py_XDECREF(subfolders);
|
||||
return ok;
|
||||
} // }}}
|
||||
|
||||
PyObject* wpd::get_filesystem(IPortableDevice *device, const wchar_t *storage_id, IPortableDevicePropertiesBulk *bulk_properties, PyObject *callback) { // {{{
|
||||
PyObject *ans = NULL;
|
||||
IPortableDeviceContent *content = NULL;
|
||||
HRESULT hr;
|
||||
BOOL ok;
|
||||
|
||||
ans = PyDict_New();
|
||||
if (ans == NULL) return PyErr_NoMemory();
|
||||
|
||||
Py_BEGIN_ALLOW_THREADS;
|
||||
hr = device->Content(&content);
|
||||
Py_END_ALLOW_THREADS;
|
||||
if (FAILED(hr)) { hresult_set_exc("Failed to create content interface", hr); goto end; }
|
||||
|
||||
Py_BEGIN_ALLOW_THREADS;
|
||||
hr = CoCreateInstance(CLSID_PortableDevicePropVariantCollection, NULL,
|
||||
CLSCTX_INPROC_SERVER, IID_PPV_ARGS(&object_ids));
|
||||
Py_END_ALLOW_THREADS;
|
||||
if (FAILED(hr)) { hresult_set_exc("Failed to create propvariantcollection", hr); goto end; }
|
||||
|
||||
ok = find_all_objects_in(content, object_ids, storage_id, callback);
|
||||
if (!ok) goto end;
|
||||
|
||||
if (bulk_properties != NULL) folders = bulk_get_filesystem(device, bulk_properties, storage_id, object_ids, callback);
|
||||
else folders = single_get_filesystem(content, storage_id, object_ids, callback);
|
||||
if (!get_files_and_folders(0, device, content, bulk_properties, storage_id, callback, ans)) {
|
||||
Py_DECREF(ans); ans = NULL;
|
||||
}
|
||||
|
||||
end:
|
||||
if (content != NULL) content->Release();
|
||||
if (object_ids != NULL) object_ids->Release();
|
||||
|
||||
return folders;
|
||||
return ans;
|
||||
} // }}}
|
||||
|
||||
PyObject* wpd::get_file(IPortableDevice *device, const wchar_t *object_id, PyObject *dest, PyObject *callback) { // {{{
|
||||
|
@ -164,7 +164,7 @@ static PyMethodDef Device_methods[] = {
|
||||
},
|
||||
|
||||
{"get_filesystem", (PyCFunction)py_get_filesystem, METH_VARARGS,
|
||||
"get_filesystem(storage_id, callback) -> Get all files/folders on the storage identified by storage_id. Tries to use bulk operations when possible. callback must be a callable that accepts a single argument. It is called with every found id and then with the metadata for every id."
|
||||
"get_filesystem(storage_id, callback) -> Get all files/folders on the storage identified by storage_id. Tries to use bulk operations when possible. callback must be a callable that is called as (object, level). It is called with every found object. If the callback returns False and the object is a folder, it is not recursed into."
|
||||
},
|
||||
|
||||
{"get_file", (PyCFunction)py_get_file, METH_VARARGS,
|
||||
|
@ -214,13 +214,14 @@ class MTP_DEVICE(MTPDeviceBase):
|
||||
|
||||
return True
|
||||
|
||||
def _filesystem_callback(self, obj):
|
||||
if isinstance(obj, dict):
|
||||
def _filesystem_callback(self, obj, level):
|
||||
n = obj.get('name', '')
|
||||
msg = _('Found object: %s')%n
|
||||
else:
|
||||
msg = _('Found id: %s')%obj
|
||||
if (level == 0 and
|
||||
self.is_folder_ignored(self._currently_getting_sid, n)):
|
||||
return False
|
||||
self.filesystem_callback(msg)
|
||||
return obj.get('is_folder', False)
|
||||
|
||||
@property
|
||||
def filesystem_cache(self):
|
||||
@ -241,6 +242,7 @@ class MTP_DEVICE(MTPDeviceBase):
|
||||
break
|
||||
storage = {'id':storage_id, 'size':capacity, 'name':name,
|
||||
'is_folder':True, 'can_delete':False, 'is_system':True}
|
||||
self._currently_getting_sid = unicode(storage_id)
|
||||
id_map = self.dev.get_filesystem(storage_id,
|
||||
self._filesystem_callback)
|
||||
for x in id_map.itervalues(): x['storage_id'] = storage_id
|
||||
|
@ -12,24 +12,24 @@ pprint, io
|
||||
|
||||
def build(mod='wpd'):
|
||||
master = subprocess.Popen('ssh -MN getafix'.split())
|
||||
master2 = subprocess.Popen('ssh -MN xp_build'.split())
|
||||
master2 = subprocess.Popen('ssh -MN win64'.split())
|
||||
try:
|
||||
while not glob.glob(os.path.expanduser('~/.ssh/*kovid@xp_build*')):
|
||||
while not glob.glob(os.path.expanduser('~/.ssh/*kovid@win64*')):
|
||||
time.sleep(0.05)
|
||||
builder = subprocess.Popen('ssh xp_build ~/build-wpd'.split())
|
||||
builder = subprocess.Popen('ssh win64 ~/build-wpd'.split())
|
||||
if builder.wait() != 0:
|
||||
raise Exception('Failed to build plugin')
|
||||
while not glob.glob(os.path.expanduser('~/.ssh/*kovid@getafix*')):
|
||||
time.sleep(0.05)
|
||||
syncer = subprocess.Popen('ssh getafix ~/test-wpd'.split())
|
||||
syncer = subprocess.Popen('ssh getafix ~/update-calibre'.split())
|
||||
if syncer.wait() != 0:
|
||||
raise Exception('Failed to rsync to getafix')
|
||||
subprocess.check_call(
|
||||
('scp xp_build:build/calibre/src/calibre/plugins/%s.pyd /tmp'%mod).split())
|
||||
('scp win64:build/calibre/src/calibre/plugins/%s.pyd /tmp'%mod).split())
|
||||
subprocess.check_call(
|
||||
('scp /tmp/%s.pyd getafix:calibre/src/calibre/devices/mtp/windows'%mod).split())
|
||||
('scp /tmp/%s.pyd getafix:calibre-src/src/calibre/devices/mtp/windows'%mod).split())
|
||||
p = subprocess.Popen(
|
||||
'ssh getafix calibre-debug -e calibre/src/calibre/devices/mtp/windows/remote.py'.split())
|
||||
'ssh getafix calibre-debug -e calibre-src/src/calibre/devices/mtp/windows/remote.py'.split())
|
||||
p.wait()
|
||||
print()
|
||||
finally:
|
||||
@ -59,7 +59,7 @@ def main():
|
||||
# return
|
||||
|
||||
from calibre.devices.scanner import win_scanner
|
||||
from calibre.devices.mtp.windows.driver import MTP_DEVICE
|
||||
from calibre.devices.mtp.driver import MTP_DEVICE
|
||||
dev = MTP_DEVICE(None)
|
||||
dev.startup()
|
||||
print (dev.wpd, dev.wpd_error)
|
||||
|
@ -54,6 +54,8 @@ def synchronous(tlockname):
|
||||
|
||||
class ConnectionListener (Thread):
|
||||
|
||||
NOT_SERVICED_COUNT = 6
|
||||
|
||||
def __init__(self, driver):
|
||||
Thread.__init__(self)
|
||||
self.daemon = True
|
||||
@ -78,8 +80,8 @@ class ConnectionListener (Thread):
|
||||
|
||||
if not self.driver.connection_queue.empty():
|
||||
queue_not_serviced_count += 1
|
||||
if queue_not_serviced_count >= 3:
|
||||
self.driver._debug('queue not serviced')
|
||||
if queue_not_serviced_count >= self.NOT_SERVICED_COUNT:
|
||||
self.driver._debug('queue not serviced', queue_not_serviced_count)
|
||||
try:
|
||||
sock = self.driver.connection_queue.get_nowait()
|
||||
s = self.driver._json_encode(
|
||||
@ -1281,10 +1283,10 @@ class SMART_DEVICE_APP(DeviceConfig, DevicePlugin):
|
||||
self._close_listen_socket()
|
||||
return message
|
||||
else:
|
||||
while i < 100: # try up to 100 random port numbers
|
||||
while i < 100: # try 9090 then up to 99 random port numbers
|
||||
i += 1
|
||||
port = self._attach_to_port(self.listen_socket,
|
||||
random.randint(8192, 32000))
|
||||
9090 if i == 1 else random.randint(8192, 32000))
|
||||
if port != 0:
|
||||
break
|
||||
if port == 0:
|
||||
|
@ -19,9 +19,10 @@ class TECLAST_K3(USBMS):
|
||||
PRODUCT_ID = [0x3203]
|
||||
BCD = [0x0000, 0x0100]
|
||||
|
||||
VENDOR_NAME = ['TECLAST', 'IMAGIN', 'RK28XX', 'PER3274B', 'BEBOOK']
|
||||
VENDOR_NAME = ['TECLAST', 'IMAGIN', 'RK28XX', 'PER3274B', 'BEBOOK',
|
||||
'RK2728', 'MR700']
|
||||
WINDOWS_MAIN_MEM = WINDOWS_CARD_A_MEM = ['DIGITAL_PLAYER', 'TL-K5',
|
||||
'EREADER', 'USB-MSC', 'PER3274B', 'BEBOOK']
|
||||
'EREADER', 'USB-MSC', 'PER3274B', 'BEBOOK', 'USER']
|
||||
|
||||
MAIN_MEMORY_VOLUME_LABEL = 'K3 Main Memory'
|
||||
STORAGE_CARD_VOLUME_LABEL = 'K3 Storage Card'
|
||||
|
@ -14,50 +14,32 @@ import os
|
||||
from calibre.customize.conversion import OutputFormatPlugin, \
|
||||
OptionRecommendation
|
||||
from calibre.ptempfile import TemporaryDirectory
|
||||
from calibre.constants import iswindows
|
||||
|
||||
UNITS = [
|
||||
'millimeter',
|
||||
'point',
|
||||
'inch' ,
|
||||
'pica' ,
|
||||
'didot',
|
||||
'cicero',
|
||||
'devicepixel',
|
||||
]
|
||||
UNITS = ['millimeter', 'centimeter', 'point', 'inch' , 'pica' , 'didot',
|
||||
'cicero', 'devicepixel']
|
||||
|
||||
PAPER_SIZES = ['b2',
|
||||
'a9',
|
||||
'executive',
|
||||
'tabloid',
|
||||
'b4',
|
||||
'b5',
|
||||
'b6',
|
||||
'b7',
|
||||
'b0',
|
||||
'b1',
|
||||
'letter',
|
||||
'b3',
|
||||
'a7',
|
||||
'a8',
|
||||
'b8',
|
||||
'b9',
|
||||
'a3',
|
||||
'a1',
|
||||
'folio',
|
||||
'c5e',
|
||||
'dle',
|
||||
'a0',
|
||||
'ledger',
|
||||
'legal',
|
||||
'a6',
|
||||
'a2',
|
||||
'b10',
|
||||
'a5',
|
||||
'comm10e',
|
||||
'a4']
|
||||
PAPER_SIZES = [u'a0', u'a1', u'a2', u'a3', u'a4', u'a5', u'a6', u'b0', u'b1',
|
||||
u'b2', u'b3', u'b4', u'b5', u'b6', u'legal', u'letter']
|
||||
|
||||
ORIENTATIONS = ['portrait', 'landscape']
|
||||
class PDFMetadata(object): # {{{
|
||||
def __init__(self, oeb_metadata=None):
|
||||
from calibre import force_unicode
|
||||
from calibre.ebooks.metadata import authors_to_string
|
||||
self.title = _(u'Unknown')
|
||||
self.author = _(u'Unknown')
|
||||
self.tags = u''
|
||||
|
||||
if oeb_metadata != None:
|
||||
if len(oeb_metadata.title) >= 1:
|
||||
self.title = oeb_metadata.title[0].value
|
||||
if len(oeb_metadata.creator) >= 1:
|
||||
self.author = authors_to_string([x.value for x in oeb_metadata.creator])
|
||||
if oeb_metadata.subject:
|
||||
self.tags = u', '.join(map(unicode, oeb_metadata.subject))
|
||||
|
||||
self.title = force_unicode(self.title)
|
||||
self.author = force_unicode(self.author)
|
||||
# }}}
|
||||
|
||||
class PDFOutput(OutputFormatPlugin):
|
||||
|
||||
@ -66,9 +48,14 @@ class PDFOutput(OutputFormatPlugin):
|
||||
file_type = 'pdf'
|
||||
|
||||
options = set([
|
||||
OptionRecommendation(name='override_profile_size', recommended_value=False,
|
||||
help=_('Normally, the PDF page size is set by the output profile'
|
||||
' chosen under page options. This option will cause the '
|
||||
' page size settings under PDF Output to override the '
|
||||
' size specified by the output profile.')),
|
||||
OptionRecommendation(name='unit', recommended_value='inch',
|
||||
level=OptionRecommendation.LOW, short_switch='u', choices=UNITS,
|
||||
help=_('The unit of measure. Default is inch. Choices '
|
||||
help=_('The unit of measure for page sizes. Default is inch. Choices '
|
||||
'are %s '
|
||||
'Note: This does not override the unit for margins!') % UNITS),
|
||||
OptionRecommendation(name='paper_size', recommended_value='letter',
|
||||
@ -80,10 +67,6 @@ class PDFOutput(OutputFormatPlugin):
|
||||
help=_('Custom size of the document. Use the form widthxheight '
|
||||
'EG. `123x321` to specify the width and height. '
|
||||
'This overrides any specified paper-size.')),
|
||||
OptionRecommendation(name='orientation', recommended_value='portrait',
|
||||
level=OptionRecommendation.LOW, choices=ORIENTATIONS,
|
||||
help=_('The orientation of the page. Default is portrait. Choices '
|
||||
'are %s') % ORIENTATIONS),
|
||||
OptionRecommendation(name='preserve_cover_aspect_ratio',
|
||||
recommended_value=False,
|
||||
help=_('Preserve the aspect ratio of the cover, instead'
|
||||
@ -108,6 +91,14 @@ class PDFOutput(OutputFormatPlugin):
|
||||
OptionRecommendation(name='pdf_mono_font_size',
|
||||
recommended_value=16, help=_(
|
||||
'The default font size for monospaced text')),
|
||||
OptionRecommendation(name='pdf_mark_links', recommended_value=False,
|
||||
help=_('Surround all links with a red box, useful for debugging.')),
|
||||
OptionRecommendation(name='old_pdf_engine', recommended_value=False,
|
||||
help=_('Use the old, less capable engine to generate the PDF')),
|
||||
OptionRecommendation(name='uncompressed_pdf',
|
||||
recommended_value=False, help=_(
|
||||
'Generate an uncompressed PDF, useful for debugging, '
|
||||
'only works with the new PDF engine.')),
|
||||
])
|
||||
|
||||
def convert(self, oeb_book, output_path, input_plugin, opts, log):
|
||||
@ -200,32 +191,17 @@ class PDFOutput(OutputFormatPlugin):
|
||||
if k in family_map:
|
||||
val[i].value = family_map[k]
|
||||
|
||||
def remove_font_specification(self):
|
||||
# Qt produces image based pdfs on windows when non-generic fonts are specified
|
||||
# This might change in Qt WebKit 2.3+ you will have to test.
|
||||
for item in self.oeb.manifest:
|
||||
if not hasattr(item.data, 'cssRules'): continue
|
||||
for i, rule in enumerate(item.data.cssRules):
|
||||
if rule.type != rule.STYLE_RULE: continue
|
||||
ff = rule.style.getProperty('font-family')
|
||||
if ff is None: continue
|
||||
val = ff.propertyValue
|
||||
for i in xrange(val.length):
|
||||
k = icu_lower(val[i].value)
|
||||
if k not in {'serif', 'sans', 'sans-serif', 'sansserif',
|
||||
'monospace', 'cursive', 'fantasy'}:
|
||||
val[i].value = ''
|
||||
|
||||
def convert_text(self, oeb_book):
|
||||
from calibre.ebooks.pdf.writer import PDFWriter
|
||||
from calibre.ebooks.metadata.opf2 import OPF
|
||||
if self.opts.old_pdf_engine:
|
||||
from calibre.ebooks.pdf.writer import PDFWriter
|
||||
PDFWriter
|
||||
else:
|
||||
from calibre.ebooks.pdf.render.from_html import PDFWriter
|
||||
|
||||
self.log.debug('Serializing oeb input to disk for processing...')
|
||||
self.get_cover_data()
|
||||
|
||||
if iswindows:
|
||||
self.remove_font_specification()
|
||||
else:
|
||||
self.handle_embedded_fonts()
|
||||
|
||||
with TemporaryDirectory('_pdf_out') as oeb_dir:
|
||||
@ -240,9 +216,9 @@ class PDFOutput(OutputFormatPlugin):
|
||||
'toc', None))
|
||||
|
||||
def write(self, Writer, items, toc):
|
||||
from calibre.ebooks.pdf.writer import PDFMetadata
|
||||
writer = Writer(self.opts, self.log, cover_data=self.cover_data,
|
||||
toc=toc)
|
||||
writer.report_progress = self.report_progress
|
||||
|
||||
close = False
|
||||
if not hasattr(self.output_path, 'write'):
|
||||
|
@ -1125,7 +1125,7 @@ OptionRecommendation(name='search_replace',
|
||||
RemoveFakeMargins()(self.oeb, self.log, self.opts)
|
||||
RemoveAdobeMargins()(self.oeb, self.log, self.opts)
|
||||
|
||||
if self.opts.subset_embedded_fonts:
|
||||
if self.opts.subset_embedded_fonts and self.output_plugin.file_type != 'pdf':
|
||||
from calibre.ebooks.oeb.transforms.subset import SubsetFonts
|
||||
SubsetFonts()(self.oeb, self.log, self.opts)
|
||||
|
||||
|
@ -335,32 +335,50 @@ class HeuristicProcessor(object):
|
||||
This function intentionally leaves hyphenated content alone as that is handled by the
|
||||
dehyphenate routine in a separate step
|
||||
'''
|
||||
def style_unwrap(match):
|
||||
style_close = match.group('style_close')
|
||||
style_open = match.group('style_open')
|
||||
if style_open and style_close:
|
||||
return style_close+' '+style_open
|
||||
elif style_open and not style_close:
|
||||
return ' '+style_open
|
||||
elif not style_open and style_close:
|
||||
return style_close+' '
|
||||
else:
|
||||
return ' '
|
||||
|
||||
|
||||
# define the pieces of the regex
|
||||
lookahead = "(?<=.{"+str(length)+u"}([a-zäëïöüàèìòùáćéíĺóŕńśúýâêîôûçąężıãõñæøþðßěľščťžňďřů,:)\IA\u00DF]|(?<!\&\w{4});))" # (?<!\&\w{4});) is a semicolon not part of an entity
|
||||
em_en_lookahead = "(?<=.{"+str(length)+u"}[\u2013\u2014])"
|
||||
soft_hyphen = u"\xad"
|
||||
line_ending = "\s*</(span|[iubp]|div)>\s*(</(span|[iubp]|div)>)?"
|
||||
line_ending = "\s*(?P<style_close></(span|[iub])>)?\s*(</(p|div)>)?"
|
||||
blanklines = "\s*(?P<up2threeblanks><(p|span|div)[^>]*>\s*(<(p|span|div)[^>]*>\s*</(span|p|div)>\s*)</(span|p|div)>\s*){0,3}\s*"
|
||||
line_opening = "<(span|[iubp]|div)[^>]*>\s*(<(span|[iubp]|div)[^>]*>)?\s*"
|
||||
line_opening = "<(p|div)[^>]*>\s*(?P<style_open><(span|[iub])[^>]*>)?\s*"
|
||||
txt_line_wrap = u"((\u0020|\u0009)*\n){1,4}"
|
||||
|
||||
unwrap_regex = lookahead+line_ending+blanklines+line_opening
|
||||
em_en_unwrap_regex = em_en_lookahead+line_ending+blanklines+line_opening
|
||||
shy_unwrap_regex = soft_hyphen+line_ending+blanklines+line_opening
|
||||
|
||||
if format == 'txt':
|
||||
unwrap_regex = lookahead+txt_line_wrap
|
||||
em_en_unwrap_regex = em_en_lookahead+txt_line_wrap
|
||||
shy_unwrap_regex = soft_hyphen+txt_line_wrap
|
||||
else:
|
||||
unwrap_regex = lookahead+line_ending+blanklines+line_opening
|
||||
em_en_unwrap_regex = em_en_lookahead+line_ending+blanklines+line_opening
|
||||
shy_unwrap_regex = soft_hyphen+line_ending+blanklines+line_opening
|
||||
|
||||
unwrap = re.compile(u"%s" % unwrap_regex, re.UNICODE)
|
||||
em_en_unwrap = re.compile(u"%s" % em_en_unwrap_regex, re.UNICODE)
|
||||
shy_unwrap = re.compile(u"%s" % shy_unwrap_regex, re.UNICODE)
|
||||
|
||||
if format == 'txt':
|
||||
content = unwrap.sub(' ', content)
|
||||
content = em_en_unwrap.sub('', content)
|
||||
content = shy_unwrap.sub('', content)
|
||||
else:
|
||||
content = unwrap.sub(style_unwrap, content)
|
||||
content = em_en_unwrap.sub(style_unwrap, content)
|
||||
content = shy_unwrap.sub(style_unwrap, content)
|
||||
|
||||
return content
|
||||
|
||||
def txt_process(self, match):
|
||||
|
@ -17,7 +17,7 @@ from urllib import unquote
|
||||
|
||||
from calibre.ebooks.chardet import detect_xml_encoding
|
||||
from calibre.constants import iswindows
|
||||
from calibre import unicode_path, as_unicode
|
||||
from calibre import unicode_path, as_unicode, replace_entities
|
||||
|
||||
class Link(object):
|
||||
'''
|
||||
@ -147,6 +147,7 @@ class HTMLFile(object):
|
||||
url = match.group(i)
|
||||
if url:
|
||||
break
|
||||
url = replace_entities(url)
|
||||
try:
|
||||
link = self.resolve(url)
|
||||
except ValueError:
|
||||
|
@ -75,6 +75,20 @@ class Worker(Thread): # Get details {{{
|
||||
9: ['sept'],
|
||||
12: ['déc'],
|
||||
},
|
||||
'br': {
|
||||
1: ['janeiro'],
|
||||
2: ['fevereiro'],
|
||||
3: ['março'],
|
||||
4: ['abril'],
|
||||
5: ['maio'],
|
||||
6: ['junho'],
|
||||
7: ['julho'],
|
||||
8: ['agosto'],
|
||||
9: ['setembro'],
|
||||
10: ['outubro'],
|
||||
11: ['novembro'],
|
||||
12: ['dezembro'],
|
||||
},
|
||||
'es': {
|
||||
1: ['enero'],
|
||||
2: ['febrero'],
|
||||
@ -117,6 +131,7 @@ class Worker(Thread): # Get details {{{
|
||||
text()="Product details" or \
|
||||
text()="Détails sur le produit" or \
|
||||
text()="Detalles del producto" or \
|
||||
text()="Detalhes do produto" or \
|
||||
text()="登録情報"]/../div[@class="content"]
|
||||
'''
|
||||
# Editor: is for Spanish
|
||||
@ -126,6 +141,7 @@ class Worker(Thread): # Get details {{{
|
||||
starts-with(text(), "Editore:") or \
|
||||
starts-with(text(), "Editeur") or \
|
||||
starts-with(text(), "Editor:") or \
|
||||
starts-with(text(), "Editora:") or \
|
||||
starts-with(text(), "出版社:")]
|
||||
'''
|
||||
self.language_xpath = '''
|
||||
@ -141,7 +157,7 @@ class Worker(Thread): # Get details {{{
|
||||
'''
|
||||
|
||||
self.ratings_pat = re.compile(
|
||||
r'([0-9.]+) ?(out of|von|su|étoiles sur|つ星のうち|de un máximo de) ([\d\.]+)( (stars|Sternen|stelle|estrellas)){0,1}')
|
||||
r'([0-9.]+) ?(out of|von|su|étoiles sur|つ星のうち|de un máximo de|de) ([\d\.]+)( (stars|Sternen|stelle|estrellas|estrelas)){0,1}')
|
||||
|
||||
lm = {
|
||||
'eng': ('English', 'Englisch'),
|
||||
@ -150,6 +166,7 @@ class Worker(Thread): # Get details {{{
|
||||
'deu': ('German', 'Deutsch'),
|
||||
'spa': ('Spanish', 'Espa\xf1ol', 'Espaniol'),
|
||||
'jpn': ('Japanese', u'日本語'),
|
||||
'por': ('Portuguese', 'Português'),
|
||||
}
|
||||
self.lang_map = {}
|
||||
for code, names in lm.iteritems():
|
||||
@ -435,7 +452,7 @@ class Worker(Thread): # Get details {{{
|
||||
|
||||
|
||||
def parse_cover(self, root):
|
||||
imgs = root.xpath('//img[@id="prodImage" and @src]')
|
||||
imgs = root.xpath('//img[(@id="prodImage" or @id="original-main-image") and @src]')
|
||||
if imgs:
|
||||
src = imgs[0].get('src')
|
||||
if '/no-image-avail' not in src:
|
||||
@ -505,6 +522,7 @@ class Amazon(Source):
|
||||
'it' : _('Italy'),
|
||||
'jp' : _('Japan'),
|
||||
'es' : _('Spain'),
|
||||
'br' : _('Brazil'),
|
||||
}
|
||||
|
||||
options = (
|
||||
@ -570,6 +588,8 @@ class Amazon(Source):
|
||||
url = 'http://amzn.com/'+asin
|
||||
elif domain == 'uk':
|
||||
url = 'http://www.amazon.co.uk/dp/'+asin
|
||||
elif domain == 'br':
|
||||
url = 'http://www.amazon.com.br/dp/'+asin
|
||||
else:
|
||||
url = 'http://www.amazon.%s/dp/%s'%(domain, asin)
|
||||
if url:
|
||||
@ -629,7 +649,7 @@ class Amazon(Source):
|
||||
q['field-isbn'] = isbn
|
||||
else:
|
||||
# Only return book results
|
||||
q['search-alias'] = 'stripbooks'
|
||||
q['search-alias'] = 'digital-text' if domain == 'br' else 'stripbooks'
|
||||
if title:
|
||||
title_tokens = list(self.get_title_tokens(title))
|
||||
if title_tokens:
|
||||
@ -661,6 +681,8 @@ class Amazon(Source):
|
||||
udomain = 'co.uk'
|
||||
elif domain == 'jp':
|
||||
udomain = 'co.jp'
|
||||
elif domain == 'br':
|
||||
udomain = 'com.br'
|
||||
url = 'http://www.amazon.%s/s/?'%udomain + urlencode(encoded_q)
|
||||
return url, domain
|
||||
|
||||
@ -978,6 +1000,16 @@ if __name__ == '__main__': # tests {{{
|
||||
),
|
||||
] # }}}
|
||||
|
||||
br_tests = [ # {{{
|
||||
(
|
||||
{'title':'Guerra dos Tronos'},
|
||||
[title_test('A Guerra dos Tronos - As Crônicas de Gelo e Fogo',
|
||||
exact=True), authors_test(['George R. R. Martin'])
|
||||
]
|
||||
|
||||
),
|
||||
] # }}}
|
||||
|
||||
def do_test(domain, start=0, stop=None):
|
||||
tests = globals().get(domain+'_tests')
|
||||
if stop is None:
|
||||
|
@ -484,7 +484,7 @@ def identify(log, abort, # {{{
|
||||
log('The identify phase took %.2f seconds'%(time.time() - start_time))
|
||||
log('The longest time (%f) was taken by:'%longest, lp)
|
||||
log('Merging results from different sources and finding earliest ',
|
||||
'publication dates from the xisbn service')
|
||||
'publication dates from the worldcat.org service')
|
||||
start_time = time.time()
|
||||
results = merge_identify_results(results, log)
|
||||
|
||||
|
@ -126,6 +126,7 @@ class EXTHHeader(object): # {{{
|
||||
elif idx == 113: # ASIN or other id
|
||||
try:
|
||||
self.uuid = content.decode('ascii')
|
||||
self.mi.set_identifier('mobi-asin', self.uuid)
|
||||
except:
|
||||
self.uuid = None
|
||||
elif idx == 116:
|
||||
|
@ -74,11 +74,12 @@ def remove_kindlegen_markup(parts):
|
||||
part = "".join(srcpieces)
|
||||
parts[i] = part
|
||||
|
||||
# we can safely remove all of the Kindlegen generated data-AmznPageBreak tags
|
||||
# we can safely remove all of the Kindlegen generated data-AmznPageBreak
|
||||
# attributes
|
||||
find_tag_with_AmznPageBreak_pattern = re.compile(
|
||||
r'''(<[^>]*\sdata-AmznPageBreak=[^>]*>)''', re.IGNORECASE)
|
||||
within_tag_AmznPageBreak_position_pattern = re.compile(
|
||||
r'''\sdata-AmznPageBreak=['"][^'"]*['"]''')
|
||||
r'''\sdata-AmznPageBreak=['"]([^'"]*)['"]''')
|
||||
|
||||
for i in xrange(len(parts)):
|
||||
part = parts[i]
|
||||
@ -86,10 +87,8 @@ def remove_kindlegen_markup(parts):
|
||||
for j in range(len(srcpieces)):
|
||||
tag = srcpieces[j]
|
||||
if tag.startswith('<'):
|
||||
for m in within_tag_AmznPageBreak_position_pattern.finditer(tag):
|
||||
replacement = ''
|
||||
tag = within_tag_AmznPageBreak_position_pattern.sub(replacement, tag, 1)
|
||||
srcpieces[j] = tag
|
||||
srcpieces[j] = within_tag_AmznPageBreak_position_pattern.sub(
|
||||
lambda m:' style="page-break-after:%s"'%m.group(1), tag)
|
||||
part = "".join(srcpieces)
|
||||
parts[i] = part
|
||||
|
||||
|
@ -44,6 +44,18 @@ def locate_beg_end_of_tag(ml, aid):
|
||||
return plt, pgt
|
||||
return 0, 0
|
||||
|
||||
def reverse_tag_iter(block):
|
||||
''' Iterate over all tags in block in reverse order, i.e. last tag
|
||||
to first tag. '''
|
||||
end = len(block)
|
||||
while True:
|
||||
pgt = block.rfind(b'>', 0, end)
|
||||
if pgt == -1: break
|
||||
plt = block.rfind(b'<', 0, pgt)
|
||||
if plt == -1: break
|
||||
yield block[plt:pgt+1]
|
||||
end = plt
|
||||
|
||||
class Mobi8Reader(object):
|
||||
|
||||
def __init__(self, mobi6_reader, log):
|
||||
@ -275,13 +287,12 @@ class Mobi8Reader(object):
|
||||
return '%s/%s'%(fi.type, fi.filename), idtext
|
||||
|
||||
def get_id_tag(self, pos):
|
||||
# find the correct tag by actually searching in the destination
|
||||
# textblock at position
|
||||
# Find the first tag with a named anchor (name or id attribute) before
|
||||
# pos
|
||||
fi = self.get_file_info(pos)
|
||||
if fi.num is None and fi.start is None:
|
||||
raise ValueError('No file contains pos: %d'%pos)
|
||||
textblock = self.parts[fi.num]
|
||||
id_map = []
|
||||
npos = pos - fi.start
|
||||
pgt = textblock.find(b'>', npos)
|
||||
plt = textblock.find(b'<', npos)
|
||||
@ -290,28 +301,15 @@ class Mobi8Reader(object):
|
||||
if plt == npos or pgt < plt:
|
||||
npos = pgt + 1
|
||||
textblock = textblock[0:npos]
|
||||
# find id links only inside of tags
|
||||
# inside any < > pair find all "id=' and return whatever is inside
|
||||
# the quotes
|
||||
id_pattern = re.compile(br'''<[^>]*\sid\s*=\s*['"]([^'"]*)['"][^>]*>''',
|
||||
re.IGNORECASE)
|
||||
for m in re.finditer(id_pattern, textblock):
|
||||
id_map.append((m.start(), m.group(1)))
|
||||
id_re = re.compile(br'''<[^>]+\sid\s*=\s*['"]([^'"]+)['"]''')
|
||||
name_re = re.compile(br'''<\s*a\s*\sname\s*=\s*['"]([^'"]+)['"]''')
|
||||
for tag in reverse_tag_iter(textblock):
|
||||
m = id_re.match(tag) or name_re.match(tag)
|
||||
if m is not None:
|
||||
return m.group(1)
|
||||
|
||||
if not id_map:
|
||||
# Found no id in the textblock, link must be to top of file
|
||||
# No tag found, link to start of file
|
||||
return b''
|
||||
# if npos is before first id= inside a tag, return the first
|
||||
if npos < id_map[0][0]:
|
||||
return id_map[0][1]
|
||||
# if npos is after the last id= inside a tag, return the last
|
||||
if npos > id_map[-1][0]:
|
||||
return id_map[-1][1]
|
||||
# otherwise find last id before npos
|
||||
for i, item in enumerate(id_map):
|
||||
if npos < item[0]:
|
||||
return id_map[i-1][1]
|
||||
return id_map[0][1]
|
||||
|
||||
def create_guide(self):
|
||||
guide = Guide()
|
||||
|
@ -92,6 +92,31 @@ class BookIndexing
|
||||
this.last_check = [body.scrollWidth, body.scrollHeight]
|
||||
return ans
|
||||
|
||||
all_links_and_anchors: () ->
|
||||
body = document.body
|
||||
links = []
|
||||
anchors = {}
|
||||
for a in document.querySelectorAll("body a[href], body [id], body a[name]")
|
||||
if window.paged_display?.in_paged_mode
|
||||
geom = window.paged_display.column_location(a)
|
||||
else
|
||||
br = a.getBoundingClientRect()
|
||||
[left, top] = viewport_to_document(br.left, br.top, a.ownerDocument)
|
||||
geom = {'left':left, 'top':top, 'width':br.right-br.left, 'height':br.bottom-br.top}
|
||||
|
||||
href = a.getAttribute('href')
|
||||
if href
|
||||
links.push([href, geom])
|
||||
id = a.getAttribute("id")
|
||||
if id and id not in anchors
|
||||
anchors[id] = geom
|
||||
if a.tagName in ['A', "a"]
|
||||
name = a.getAttribute("name")
|
||||
if name and name not in anchors
|
||||
anchors[name] = geom
|
||||
|
||||
return {'links':links, 'anchors':anchors}
|
||||
|
||||
if window?
|
||||
window.book_indexing = new BookIndexing()
|
||||
|
||||
|
@ -242,6 +242,18 @@ class PagedDisplay
|
||||
# Return the number of the column that contains xpos
|
||||
return Math.floor(xpos/this.page_width)
|
||||
|
||||
column_location: (elem) ->
|
||||
# Return the location of elem relative to its containing column
|
||||
br = elem.getBoundingClientRect()
|
||||
[left, top] = calibre_utils.viewport_to_document(br.left, br.top, elem.ownerDocument)
|
||||
c = this.column_at(left)
|
||||
width = Math.min(br.right, (c+1)*this.page_width) - br.left
|
||||
if br.bottom < br.top
|
||||
br.bottom = window.innerHeight
|
||||
height = Math.min(br.bottom, window.innerHeight) - br.top
|
||||
left -= c*this.page_width
|
||||
return {'column':c, 'left':left, 'top':top, 'width':width, 'height':height}
|
||||
|
||||
column_boundaries: () ->
|
||||
# Return the column numbers at the left edge and after the right edge
|
||||
# of the viewport
|
||||
|
@ -320,13 +320,11 @@ class OEBReader(object):
|
||||
self.logger.warn(u'Spine item %r not found' % idref)
|
||||
continue
|
||||
item = manifest.ids[idref]
|
||||
if item.media_type.lower() in OEB_DOCS and hasattr(item.data, 'xpath'):
|
||||
spine.add(item, elem.get('linear'))
|
||||
for item in spine:
|
||||
if item.media_type.lower() not in OEB_DOCS:
|
||||
if not hasattr(item.data, 'xpath'):
|
||||
else:
|
||||
self.oeb.log.warn('The item %s is not a XML document.'
|
||||
' Removing it from spine.'%item.href)
|
||||
spine.remove(item)
|
||||
if len(spine) == 0:
|
||||
raise OEBError("Spine is empty")
|
||||
self._spine_add_extra()
|
||||
|