[Sync] Sync with trunk, revision 10607
12
COPYRIGHT
@ -9,6 +9,12 @@ License: GPL-2 or later
|
||||
The full text of the GPL is distributed as in
|
||||
/usr/share/common-licenses/GPL-2 on Debian systems.
|
||||
|
||||
Files: setup/iso_639/*
|
||||
Copyright: Various
|
||||
License: LGPL 2.1
|
||||
The full text of the LGPL is distributed as in
|
||||
/usr/share/common-licenses/LGPL-2.1 on Debian systems.
|
||||
|
||||
Files: src/calibre/ebooks/BeautifulSoup.py
|
||||
Copyright: Copyright (c) 2004-2007, Leonard Richardson
|
||||
License: BSD
|
||||
@ -28,6 +34,12 @@ License: other
|
||||
are permitted in any medium without royalty provided the copyright
|
||||
notice and this notice are preserved.
|
||||
|
||||
Files: src/calibre/ebooks/readability/*
|
||||
Copyright: Unknown
|
||||
License: Apache 2.0
|
||||
The full text of the Apache 2.0 license is available at:
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Files: /src/cherrypy/*
|
||||
Copyright: Copyright (c) 2004-2007, CherryPy Team (team@cherrypy.org)
|
||||
Copyright: Copyright (C) 2005, Tiago Cogumbreiro <cogumbreiro@users.sf.net>
|
||||
|
471
Changelog.yaml
@ -19,6 +19,477 @@
|
||||
# new recipes:
|
||||
# - title:
|
||||
|
||||
- version: 0.8.21
|
||||
date: 2011-09-30
|
||||
|
||||
new features:
|
||||
- title: "A Tips and Tricks blog at http://blog.calibre-ebook.com to introduce less well known calibre features in a simple way"
|
||||
|
||||
- title: "News download: Add list of articles in the downloaded issue to the comments metadata of the generated ebook. Makes it possible to search for a particular article in the calibre library."
|
||||
ticket: [851717]
|
||||
|
||||
- title: "Toolbar buttons: You can now also right click the buttons to bring the popup of extra actions, in addition to clicking the small arrow next to the button."
|
||||
|
||||
- title: "Amazon metadata download plugin: Add option to download metadata from amazon.es"
|
||||
|
||||
- title: Driver for Vizio and iRobot A9 Android tablets
|
||||
tickets: [854408,862175]
|
||||
|
||||
- title: "When switching to/starting with a library with a corrupted database, offer the user the option of rebuilding the database instead of erroring out."
|
||||
|
||||
- title: "Template language: Add list_equals function"
|
||||
|
||||
- title: "Add a special output profile for the PocketBook 900 as it does not resize images correctly by itself"
|
||||
|
||||
bug fixes:
|
||||
- title: "Fix regression that cause PDF Output to generate very large files"
|
||||
|
||||
- title: Fix Title Sort field not being displayed in Book details panel
|
||||
|
||||
- title: Prevent renaming of languages in the Tag browser
|
||||
tickets: [860943]
|
||||
|
||||
- title: "Get books: Fix getting price from Foyles"
|
||||
|
||||
- title: "Content server: When a search matches no queries, do not show an error message"
|
||||
|
||||
- title: "ODT Input: Add workaround for ADE to fix centering of block level images when converting to EPUB"
|
||||
tickets: [859343]
|
||||
|
||||
- title: "Content server: When WSGI embedding fix handling of empty URL"
|
||||
|
||||
- title: "RTF Input: Fix spurious spaces inserted after some unicode characters"
|
||||
tickets: [851215]
|
||||
|
||||
- title: "Fix regression that broke clicking on the first letter of author names in the Tag Browser when grouped"
|
||||
tickets: [860615]
|
||||
|
||||
- title: "Fix reading metadata from filenames when the author regexp does not match anything"
|
||||
|
||||
- title: "Fix incorrect display of the month September in Finnish calibre"
|
||||
tickets: [858737]
|
||||
|
||||
- title: "Do not delete the file when the user tries to add a format to a book from a file already in the books directory"
|
||||
tickets: [856158]
|
||||
|
||||
- title: "Fix regression that broke customization of Kobo device plugin"
|
||||
|
||||
- title: "Allow user defined templates to be used in save to disk"
|
||||
|
||||
improved recipes:
|
||||
- Read It Later
|
||||
- American Spectator
|
||||
- Sydney Morning Herald
|
||||
- Chicago Tribune
|
||||
- American Prospect
|
||||
- DNA India
|
||||
- Times of India
|
||||
- Kurier
|
||||
- xkcd
|
||||
- Cnet
|
||||
|
||||
new recipes:
|
||||
- title: Various Colombian news sources
|
||||
author: BIGO-CAVA
|
||||
|
||||
- title: Gosc Niedzielny
|
||||
author: Piotr Kontek
|
||||
|
||||
- title: Leipzer Volkszeitung
|
||||
author: a.peter
|
||||
|
||||
- title: Folha de Sao Paulo (full edition)
|
||||
author: fluzao
|
||||
|
||||
- title: Den of Geek
|
||||
author: Jaded
|
||||
|
||||
- title: Republica
|
||||
author: Manish Bhattarai
|
||||
|
||||
- title: Sign on San Diego
|
||||
author: Jay Kindle
|
||||
|
||||
- version: 0.8.20
|
||||
date: 2011-09-23
|
||||
|
||||
new features:
|
||||
- title: "MOBI Output: Map a larger set of font names to sans-serif/monospace font in the MOBI file"
|
||||
|
||||
- title: "Get Books: Allow searching on the DRM column in the results."
|
||||
tickets: [852514]
|
||||
|
||||
- title: "Manage tags/series/etc dialog: Add a was column to show the old value when changing values."
|
||||
tickets: [846538]
|
||||
|
||||
- title: "Template language: Add new functions to manipulate language codes"
|
||||
tickets: [832084]
|
||||
|
||||
bug fixes:
|
||||
- title: "MOBI Output: Don't set cdetype when option to enable sharing instead of syncing is specified. This fixes the option."
|
||||
|
||||
- title: "Conversion pipeline: Fix crash caused by empty <style> elements."
|
||||
tickets: [775277]
|
||||
|
||||
- title: "Get Books: Fix Woblink store"
|
||||
|
||||
- title: "MOBI Input: Correctly handle MOBI files that have been passed through a DRM removal tool that leaves the DRM fields in the header."
|
||||
tickets: [855732]
|
||||
|
||||
- title: "Fix typo preventing the updating of metadata in MOBI files serverd by the content server"
|
||||
|
||||
- title: "Get Books: Handle non ASCII filenames for downloaded books"
|
||||
tickets: [855109]
|
||||
|
||||
- title: "When generating the title sort string and stripping a leading article, strip leading punctuation that remains after removing the article"
|
||||
tickets: [855070]
|
||||
|
||||
- title: "Fix downloading metadata in the Edit metadata dialog could result in off by one day published dates, in timezones behind GMT"
|
||||
tickets: [855143]
|
||||
|
||||
- title: "Fix handing of title_sort and custom columns when creating a BiBTeX catalog."
|
||||
tickets: [853249]
|
||||
|
||||
- title: "TXT Markdown Input: Change handling of _ to work mid word."
|
||||
|
||||
- title: "Fix Check library reporting unknown files ad both missing an unknown"
|
||||
tickets: [846926]
|
||||
|
||||
- title: "Search/Replace: Permit .* to match empty tag like columns."
|
||||
tickets: [840517]
|
||||
|
||||
improved recipes:
|
||||
- Cicero (DE)
|
||||
- Taz.de
|
||||
- Ming Pao - HK
|
||||
- Macleans Magazine
|
||||
- IDG.se
|
||||
- PC World (eng)
|
||||
- LA Times
|
||||
|
||||
new recipes:
|
||||
- title: Ekantipur (Nepal)
|
||||
author: fab4.ilam
|
||||
|
||||
- title: Various Polish news sources
|
||||
author: fenuks
|
||||
|
||||
- title: Taipei Times and China Post
|
||||
author: Krittika Goyal
|
||||
|
||||
- title: Berliner Zeitung
|
||||
author: ape
|
||||
|
||||
- version: 0.8.19
|
||||
date: 2011-09-16
|
||||
|
||||
new features:
|
||||
- title: "Driver for Sony Ericsson Xperia Arc"
|
||||
|
||||
- title: "MOBI Output: Add option in Preferences->Output Options->MOBI Output to enable the share via Facebook feature for calibre produced MOBI files. Note that enabling this disables the sync last read position across multiple devices feature. Don't ask me why, ask Amazon."
|
||||
|
||||
- title: "Content server: Update metadata when sending MOBI as well as EPUB files"
|
||||
|
||||
- title: "News download: Add an auto_cleanup_keep variable that allows recipe writers to tell the auto cleanup to never remove a specified element"
|
||||
|
||||
- title: "Conversion: Remove paragraph spacing: If you set the indent size negative, calibre will now leave the indents specified in the input document"
|
||||
|
||||
bug fixes:
|
||||
- title: "Fix regression in 0.8.18 that broke PDF Output"
|
||||
|
||||
- title: "MOBI Output: Revert change in 0.8.18 that marked news downloads with a single section as blogs, as the Kindle does not auto archive them"
|
||||
|
||||
- title: "PDF output on OSX now generates proper non image based documents"
|
||||
|
||||
- title: "RTF Input: Fix handling of internal links and underlined text"
|
||||
tickets: [845328]
|
||||
|
||||
- title: "Fix language sometimes not getting set when downloading metadata in the edit metadata dialog"
|
||||
|
||||
- title: "Fix regression that broke killing of multiple jobs"
|
||||
tickets: [850764]
|
||||
|
||||
- title: "Fix bug processing author names with initials when downloading metadata from ozon.ru."
|
||||
tickets: [845420]
|
||||
|
||||
- title: "Fix a memory leak in the Copy to library operation which also fixes the metadata.db being held open in the destination library"
|
||||
tickets: [849469]
|
||||
|
||||
- title: "Keyboard shortcuts: Allow use of symbol keys like >,*,etc."
|
||||
tickets: [847378]
|
||||
|
||||
- title: "EPUB Output: When splitting be a little cleverer about discarding 'empty' pages"
|
||||
|
||||
|
||||
improved recipes:
|
||||
- Twitch Films
|
||||
- Japan Times
|
||||
- People/US Magazine mashup
|
||||
- Business World India
|
||||
- Inquirer.net
|
||||
- Guardian/Observer
|
||||
|
||||
new recipes:
|
||||
- title: RT
|
||||
author: Darko Miletic
|
||||
|
||||
- title: CIO Magazine
|
||||
author: Julio Map
|
||||
|
||||
- title: India Today and Hindustan Times
|
||||
author: Krittika Goyal
|
||||
|
||||
- title: Pagina 12 Print Edition
|
||||
author: Pablo Marfil
|
||||
|
||||
- version: 0.8.18
|
||||
date: 2011-09-09
|
||||
|
||||
new features:
|
||||
- title: "Kindle news download: On Kindle 3 and newer have the View Articles and Sections menu remember the current article."
|
||||
tickets: [748741]
|
||||
|
||||
- title: "Conversion: Add option to unsmarten puctuation under Look & Feel"
|
||||
|
||||
- title: "Driver of Motorola Ex124G and Pandigital Nova Tablet"
|
||||
|
||||
- title: "Allow downloading metadata from amazon.co.jp. To use it, configure the amazon metadata source to use the Japanese amazon site."
|
||||
tickets: [842447]
|
||||
|
||||
- title: "When automatically generating author sort for author name, ignore common prefixes like Mr. Dr. etc. Controllable via tweak. Also add a tweak to allow control of how a string is split up into multiple authors."
|
||||
tickets: [795984]
|
||||
|
||||
- title: "TXT Output: Preserve as much formatting as possible when generating Markdown output including various CSS styles"
|
||||
|
||||
bug fixes:
|
||||
|
||||
- title: "Fix pubdate incorrect when used in save to disk template in timezones ahead of GMT."
|
||||
tickets: [844445]
|
||||
|
||||
- title: "When attempting to stop multiple device jobs at once, only show a single error message"
|
||||
tickets: [841588]
|
||||
|
||||
- title: "Fix conversion of large EPUB files to PDF erroring out on systems with a limited number of available file handles"
|
||||
tickets: [816616]
|
||||
|
||||
- title: "EPUB catalog generation: Fix some entries going off the left edge of the page for unread/wishlist items"
|
||||
|
||||
- title: "When setting language in an EPUB file always use the 2 letter language code in preference to the three letter code, when possible."
|
||||
tickets: [841201]
|
||||
|
||||
- title: "Content server: Fix --url-prefix not used for links in the book details view."
|
||||
|
||||
- title: "MOBI Input: When links in a MOBI file point to just before block elements, and there is a page break on the block element, the links can end up pointing to the wrong place on conversion. Adjust the location in such cases to point to the block element directly."
|
||||
|
||||
improved recipes:
|
||||
- Kopalnia Wiedzy
|
||||
- FilmWeb.pl
|
||||
- Philadelphia Inquirer
|
||||
- Honolulu Star Advertiser
|
||||
- Counterpunch
|
||||
- Philadelphia Inquirer
|
||||
|
||||
new recipes:
|
||||
- title: Various Polish news sources
|
||||
author: fenuks
|
||||
|
||||
- version: 0.8.17
|
||||
date: 2011-09-02
|
||||
|
||||
new features:
|
||||
- title: "Basic support for Amazon AZW4 format (PDF wrapped inside a MOBI)"
|
||||
|
||||
- title: "When showing the cover browser in a separate window, allow the use of the V, D shortcut keys to view the current book and send it to device respectively."
|
||||
tickets: [836402]
|
||||
|
||||
- title: "Add an option in Preferences->Miscellaneous to abort conversion jobs that take too long."
|
||||
tickets: [835233]
|
||||
|
||||
- title: "Driver for HTC Evo and HP TouchPad (with kindle app)"
|
||||
|
||||
- title: "Preferences->Adding books, detect when the user specifies a test expression with no file extension and popup a warning"
|
||||
|
||||
bug fixes:
|
||||
- title: "E-book viewer: Ensure toolbars are always visible"
|
||||
|
||||
- title: "Content server: Fix grouping of Tags/authors not working for some non english languages with Internet Explorer"
|
||||
tickets: [835238]
|
||||
|
||||
- title: "When downloading metadata from amazon, fix italics inside brackets getting lost."
|
||||
tickets: [836857]
|
||||
|
||||
- title: "Get Books: Add EscapeMagazine.pl and RW2010.pl stores"
|
||||
|
||||
- title: "Conversion pipeline: Fix conversion of cm/mm to pts. Fixes use of cm as a length unit when converting to MOBI."
|
||||
|
||||
- title: "When showing the cover browser in a separate window, focus the cover browser so that keyboard shortcuts work immediately."
|
||||
tickets: [835933]
|
||||
|
||||
- title: "HTMLZ Output: Fix special chars like ampersands, etc. not being converted to entities"
|
||||
|
||||
- title: "Keyboard shortcuts config: Fix clicking done in the shortcut editor with shortcuts set to default caused the displayed shortcut to be always set to None"
|
||||
|
||||
- title: "Fix bottom most entries in keyboard shortcuts not editable"
|
||||
|
||||
improved recipes:
|
||||
- Hacker News
|
||||
- Nikkei News
|
||||
|
||||
new recipes:
|
||||
- title: "Haber 7 and Hira"
|
||||
authors: thomass
|
||||
|
||||
- title: "NTV and NTVSpor by A Erdogan"
|
||||
author: A Erdogan
|
||||
|
||||
|
||||
- version: 0.8.16
|
||||
date: 2011-08-26
|
||||
|
||||
new features:
|
||||
- title: "News download: Add algorithms to automatically clean up downloaded HTML"
|
||||
description: "Use the algorithms from the Readability project to automatically cleanup downloaded HTML. You can turn this on in your own recipes by adding auto_cleanup=True to the recipe. It is turned on by default for basic recipes created via the GUI. This makes it a little easier to develop recipes for beginners."
|
||||
type: major
|
||||
|
||||
- title: "Add an option to Preferences->Look and Feel->Cover Browser to show the cover browser full screen. When showing the cover browser in a separate window, you can make it fullscreen by pressing the F11 key."
|
||||
tickets: [829855 ]
|
||||
|
||||
- title: "Show the languages currently used at the top of the drop down list of languages"
|
||||
|
||||
- title: "When automatically computing author sort from author's name, if the name contains certain words like Inc., Company, Team, etc. use the author name as the sort string directly. The list of such words can be controlled via Preferences->Tweaks."
|
||||
tickets: [797895]
|
||||
|
||||
- title: "Add a search for individual tweaks to Preferences->Tweaks"
|
||||
|
||||
- title: "Drivers for a few new android phones"
|
||||
|
||||
bug fixes:
|
||||
- title: "Fix line unwrapping algorithms to account for some central European characters as well."
|
||||
tickets: [822744]
|
||||
|
||||
- title: "Switch to using more modern language names/translations from the iso-codes package"
|
||||
|
||||
- title: "Allow cases insensitive entering of language names for convenience."
|
||||
tickets: [832761]
|
||||
|
||||
- title: "When adding a text indent to paragraphs as part of the remove spacing between paragraphs transformation, do not add an indent to paragraphs that are directly centered or right aligned."
|
||||
tickets: [830439]
|
||||
|
||||
- title: "Conversion pipeline: More robust handling of case insensitive tag and class css selectors"
|
||||
|
||||
- title: "MOBI Output: Add support for the start attribute on <ol> tags"
|
||||
|
||||
- title: "When adding books that have no language specified, do not automatically set the language to calibre's interface language."
|
||||
tickets: [830092]
|
||||
|
||||
- title: "Fix use of tag browser to search for languages when calibre is translated to a non English language"
|
||||
tickets: [830078]
|
||||
|
||||
- title: "When downloading news, set the language field correctly"
|
||||
|
||||
- title: "Fix languages field in the Edit metadata dialog too wide"
|
||||
tickets: [829912]
|
||||
|
||||
- title: "Fix setting of languages that have commas in their names broken"
|
||||
|
||||
- title: "FB2 Input: When convert FB2 files, read the cover from the FB2 file correctly."
|
||||
tickets: [829240]
|
||||
|
||||
improved recipes:
|
||||
- Politifact
|
||||
- Reuters
|
||||
- Sueddeutsche
|
||||
- CNN
|
||||
- Financial Times UK
|
||||
- MSDN Magazine
|
||||
- Houston Chronicle
|
||||
- Harvard Business Review
|
||||
|
||||
new recipes:
|
||||
- title: CBN News and Fairbanks Daily
|
||||
author: by Roger
|
||||
|
||||
- title: Hacker News
|
||||
author: Tom Scholl
|
||||
|
||||
- title: Various Turkish news sources
|
||||
author: thomass
|
||||
|
||||
- title: Cvece Zla
|
||||
author: Darko Miletic
|
||||
|
||||
- title: Various Polish news sources
|
||||
author: fenuks
|
||||
|
||||
- title: Fluter
|
||||
author: Armin Geller
|
||||
|
||||
- title: Brasil de Fato
|
||||
author: Alex Mitrani
|
||||
|
||||
- version: 0.8.15
|
||||
date: 2011-08-19
|
||||
|
||||
new features:
|
||||
- title: "Add a 'languages' metadata field."
|
||||
type: major
|
||||
description: "This is useful if you have a multi-lingual book collection. You can now set one or more languages per book via the Edit Metadata dialog. If you want the languages
|
||||
column to be visible, then go to Preferences->Add your own columns and unhide the languages columns. You can also bulk set the languages on multiple books via the bulk edit metadata dialog. You can also have the languages show up in the book details panel on the right by going to Preferences->Look and Feel->Book details"
|
||||
|
||||
- title: "Get Books: Add XinXii store."
|
||||
|
||||
- title: "Metadata download plugin for ozon.ru, enabled only when user selects russian as their language in the welcome wizard."
|
||||
|
||||
- title: "Bambook driver: Allow direct transfer of PDF files to Bambook devices"
|
||||
|
||||
- title: "Driver for Coby MID7015A and Asus EEE Note"
|
||||
|
||||
- title: "Edit metadata dialog: The keyboard shortcut Ctrl+D can now be used to trigger a metadata download. Also show the row number of the book being edited in the titlebar"
|
||||
|
||||
- title: "Add an option to not preserve the date when using the 'Copy to Library' function (found in Preferences->Adding books)"
|
||||
|
||||
bug fixes:
|
||||
- title: "Linux binary: Use readlink -f rather than readlink -e in the launcher scripts so that they work with recent releases of busybox"
|
||||
|
||||
- title: "When bulk downloading metadata for more than 100 books at a time, automatically split up the download into batches of 100."
|
||||
tickets: [828373]
|
||||
|
||||
- title: "When deleting books from the Kindle also delete 'sidecar' .apnx and .ph1 files as the kindle does not clean them up automatically"
|
||||
tickets: [827684]
|
||||
|
||||
- title: "Fix a subtle bug in the device drivers that caused calibre to lose track of some books on the device if you used author_sort in the send to device template and your books have author sort values that differ only in case."
|
||||
tickets: [825706]
|
||||
|
||||
- title: "Fix scene break character pattern not saved in conversion preferences"
|
||||
tickets: [826038]
|
||||
|
||||
- title: "Keyboard shortcuts: Fix a bug triggered by some third party plugins that made the keyboard preferences unusable in OS X."
|
||||
tickets: [826325]
|
||||
|
||||
- title: "Search box: Fix completion no longer working after using Tag Browser to do a search. Also ensure that completer popup is always hidden when a search is performed."
|
||||
|
||||
- title: "Fix pressing Enter in the search box causes the same search to be executed twice in the plugins and keyboard shortcuts preferences panels"
|
||||
|
||||
- title: "Catalog generation: Fix error creating epub/mobi catalogs on non UTF-8 windows systems when the metadata contained non ASCII characters"
|
||||
|
||||
improved recipes:
|
||||
- Financial Times UK
|
||||
- La Tercera
|
||||
- Folha de Sao Paolo
|
||||
- Metro niews NL
|
||||
- La Nacion
|
||||
- Juventud Rebelde
|
||||
- Rzeczpospolita Online
|
||||
- Newsweek Polska
|
||||
- CNET news
|
||||
|
||||
new recipes:
|
||||
- title: El Mostrador and The Clinic
|
||||
author: Alex Mitrani
|
||||
|
||||
- title: Patente de Corso
|
||||
author: Oscar Megia Lopez
|
||||
|
||||
- version: 0.8.14
|
||||
date: 2011-08-12
|
||||
|
||||
|
@ -118,7 +118,7 @@ EBVS
|
||||
<0x 00 00 00 00>
|
||||
<0x 00 00 00 10>
|
||||
...(rest of size of DATA block)
|
||||
<0x FD EA = PAD? (ýê)>
|
||||
<0x FD EA = PAD? (ýê)>
|
||||
DATA
|
||||
<0x 4 bytes = size of <marked text (see 3rd note)> >
|
||||
<marked text (see 3rd note)>
|
||||
@ -155,7 +155,7 @@ EBVS
|
||||
<0x 00 00 00 00>
|
||||
<0x 00 00 00 10>
|
||||
...(rest of size of DATA block)
|
||||
<0x FD EA = PAD? (ýê)>
|
||||
<0x FD EA = PAD? (ýê)>
|
||||
[fi MARK || BOOKMARK]
|
||||
//-------------------------------
|
||||
[if CORRECTION]
|
||||
@ -174,7 +174,7 @@ EBVS
|
||||
<0x 00 00 00 00>
|
||||
<0x 00 00 00 10>
|
||||
...(rest of size of DATA block)
|
||||
<0x FD EA = PAD? (ýê)>
|
||||
<0x FD EA = PAD? (ýê)>
|
||||
DATA
|
||||
<0x 4 bytes = size of <marked text (see 3rd note)> >
|
||||
<marked text (see 3rd note)>
|
||||
@ -246,7 +246,7 @@ EBVS
|
||||
<0x 00 00 00 00>
|
||||
<0x 00 00 00 10>
|
||||
...(size of DATA block - 30)
|
||||
<0x FD EA = PAD? (ýê)>
|
||||
<0x FD EA = PAD? (ýê)>
|
||||
[fi DRAWING]
|
||||
//-------------------------------
|
||||
[next {NOTE,MARK,CORRECTION,DRAWING}]
|
||||
@ -308,7 +308,7 @@ EBVS
|
||||
...4
|
||||
...4
|
||||
...4
|
||||
<0x FD EA = PAD? (ýê)>
|
||||
<0x FD EA = PAD? (ýê)>
|
||||
//--------------------------------------------------------------------
|
||||
|
||||
// CATEGORY (if any)
|
||||
|
38
recipes/adventure_zone_pl.recipe
Normal file
@ -0,0 +1,38 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Adventure_zone(BasicNewsRecipe):
|
||||
title = u'Adventure Zone'
|
||||
__author__ = 'fenuks'
|
||||
description = 'Adventure zone - adventure games from A to Z'
|
||||
category = 'games'
|
||||
language = 'pl'
|
||||
oldest_article = 15
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets = True
|
||||
remove_tags_before= dict(name='td', attrs={'class':'main-bg'})
|
||||
remove_tags_after= dict(name='td', attrs={'class':'main-body middle-border'})
|
||||
extra_css = '.main-bg{text-align: left;} td.capmain{ font-size: 22px; }'
|
||||
feeds = [(u'Nowinki', u'http://www.adventure-zone.info/fusion/feeds/news.php')]
|
||||
|
||||
def get_cover_url(self):
|
||||
soup = self.index_to_soup('http://www.adventure-zone.info/fusion/news.php')
|
||||
cover=soup.find(id='box_OstatninumerAZ')
|
||||
self.cover_url='http://www.adventure-zone.info/fusion/'+ cover.center.a.img['src']
|
||||
return getattr(self, 'cover_url', self.cover_url)
|
||||
|
||||
|
||||
def skip_ad_pages(self, soup):
|
||||
skip_tag = soup.body.findAll(name='a')
|
||||
if skip_tag is not None:
|
||||
for r in skip_tag:
|
||||
if 'articles.php?' in r['href']:
|
||||
if r.strong is not None:
|
||||
word=r.strong.string
|
||||
if ('zapowied' or 'recenzj') in word:
|
||||
return self.index_to_soup('http://www.adventure-zone.info/fusion/print.php?type=A&item_id'+r['href'][r['href'].find('_id')+3:], raw=True)
|
||||
else:
|
||||
None
|
||||
|
||||
def print_version(self, url):
|
||||
return url.replace('news.php?readmore', 'print.php?type=N&item_id')
|
||||
|
@ -18,6 +18,8 @@ class TheAmericanSpectator(BasicNewsRecipe):
|
||||
use_embedded_content = False
|
||||
language = 'en'
|
||||
INDEX = 'http://spectator.org'
|
||||
auto_cleanup = True
|
||||
encoding = 'utf-8'
|
||||
|
||||
conversion_options = {
|
||||
'comments' : description
|
||||
@ -26,17 +28,6 @@ class TheAmericanSpectator(BasicNewsRecipe):
|
||||
,'publisher' : publisher
|
||||
}
|
||||
|
||||
keep_only_tags = [
|
||||
dict(name='div', attrs={'class':'post inner'})
|
||||
,dict(name='div', attrs={'class':'author-bio'})
|
||||
]
|
||||
|
||||
remove_tags = [
|
||||
dict(name='object')
|
||||
,dict(name='div', attrs={'class':['col3','post-options','social']})
|
||||
,dict(name='p' , attrs={'class':['letter-editor','meta']})
|
||||
]
|
||||
|
||||
feeds = [ (u'Articles', u'http://feeds.feedburner.com/amspecarticles')]
|
||||
|
||||
def get_cover_url(self):
|
||||
|
12
recipes/android_com_pl.recipe
Normal file
@ -0,0 +1,12 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Android_com_pl(BasicNewsRecipe):
|
||||
title = u'Android.com.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = 'Android.com.pl - biggest polish Android site'
|
||||
category = 'Android, mobile'
|
||||
language = 'pl'
|
||||
cover_url =u'http://upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Android_robot.svg/220px-Android_robot.svg.png'
|
||||
oldest_article = 8
|
||||
max_articles_per_feed = 100
|
||||
feeds = [(u'Android', u'http://android.com.pl/component/content/frontpage/frontpage.feed?type=rss')]
|
@ -1,9 +1,10 @@
|
||||
import re
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class AmericanProspect(BasicNewsRecipe):
|
||||
title = u'American Prospect'
|
||||
__author__ = u'Michael Heinz'
|
||||
__author__ = u'Michael Heinz, a.peter'
|
||||
version = 2
|
||||
|
||||
oldest_article = 30
|
||||
language = 'en'
|
||||
max_articles_per_feed = 100
|
||||
@ -11,16 +12,7 @@ class AmericanProspect(BasicNewsRecipe):
|
||||
no_stylesheets = True
|
||||
remove_javascript = True
|
||||
|
||||
preprocess_regexps = [
|
||||
(re.compile(r'<body.*?<div class="pad_10L10R">', re.DOTALL|re.IGNORECASE), lambda match: '<body><div>'),
|
||||
(re.compile(r'</div>.*</body>', re.DOTALL|re.IGNORECASE), lambda match: '</div></body>'),
|
||||
(re.compile('\r'),lambda match: ''),
|
||||
(re.compile(r'<!-- .+? -->', re.DOTALL|re.IGNORECASE), lambda match: ''),
|
||||
(re.compile(r'<link .+?>', re.DOTALL|re.IGNORECASE), lambda match: ''),
|
||||
(re.compile(r'<script.*?</script>', re.DOTALL|re.IGNORECASE), lambda match: ''),
|
||||
(re.compile(r'<noscript.*?</noscript>', re.DOTALL|re.IGNORECASE), lambda match: ''),
|
||||
(re.compile(r'<meta .*?/>', re.DOTALL|re.IGNORECASE), lambda match: ''),
|
||||
]
|
||||
keep_only_tags = [dict(name='div', attrs={'class':'pad_10L10R'})]
|
||||
remove_tags = [dict(name='form'), dict(name='div', attrs={'class':['bkt_caption','sharebox noprint','badgebox']})]
|
||||
|
||||
feeds = [(u'Articles', u'feed://www.prospect.org/articles_rss.jsp')]
|
||||
|
||||
|
21
recipes/archeowiesci.recipe
Normal file
@ -0,0 +1,21 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Archeowiesci(BasicNewsRecipe):
|
||||
title = u'Archeowiesci'
|
||||
__author__ = 'fenuks'
|
||||
category = 'archeology'
|
||||
language = 'pl'
|
||||
cover_url='http://archeowiesci.pl/wp-content/uploads/2011/05/Archeowiesci2-115x115.jpg'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
auto_cleanup = True
|
||||
remove_tags=[dict(name='span', attrs={'class':['post-ratings', 'post-ratings-loading']})]
|
||||
feeds = [(u'Archeowieści', u'http://archeowiesci.pl/feed/')]
|
||||
|
||||
def parse_feeds (self):
|
||||
feeds = BasicNewsRecipe.parse_feeds(self)
|
||||
for feed in feeds:
|
||||
for article in feed.articles[:]:
|
||||
if 'subskrypcja' in article.title:
|
||||
feed.articles.remove(article)
|
||||
return feeds
|
18
recipes/astro_news_pl.recipe
Normal file
@ -0,0 +1,18 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class AstroNEWS(BasicNewsRecipe):
|
||||
title = u'AstroNEWS'
|
||||
__author__ = 'fenuks'
|
||||
description = 'AstroNEWS- astronomy every day'
|
||||
category = 'astronomy, science'
|
||||
language = 'pl'
|
||||
oldest_article = 8
|
||||
max_articles_per_feed = 100
|
||||
auto_cleanup = True
|
||||
cover_url='http://news.astronet.pl/img/logo_news.jpg'
|
||||
# no_stylesheets= True
|
||||
feeds = [(u'Wiadomości', u'http://news.astronet.pl/rss.cgi')]
|
||||
|
||||
def print_version(self, url):
|
||||
return url.replace('astronet.pl/', 'astronet.pl/print.cgi?')
|
||||
|
15
recipes/astronomia_pl.recipe
Normal file
@ -0,0 +1,15 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Astronomia_pl(BasicNewsRecipe):
|
||||
title = u'Astronomia.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = 'Astronomia - polish astronomy site'
|
||||
cover_url = 'http://www.astronomia.pl/grafika/logo.gif'
|
||||
category = 'astronomy, science'
|
||||
language = 'pl'
|
||||
oldest_article = 8
|
||||
max_articles_per_feed = 100
|
||||
#no_stylesheets=True
|
||||
remove_tags_before=dict(name='div', attrs={'id':'a1'})
|
||||
keep_only_tags=[dict(name='div', attrs={'id':['a1', 'h2']})]
|
||||
feeds = [(u'Wiadomości z astronomii i astronautyki', u'http://www.astronomia.pl/rss/')]
|
52
recipes/bash_org_pl.recipe
Normal file
@ -0,0 +1,52 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Bash_org_pl(BasicNewsRecipe):
|
||||
title = u'Bash.org.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = 'Bash.org.pl - funny quotations from IRC discussions'
|
||||
category = 'funny quotations, humour'
|
||||
language = 'pl'
|
||||
cover_url = u'http://userlogos.org/files/logos/dzikiosiol/none_0.png'
|
||||
max_articles_per_feed = 50
|
||||
no_stylesheets= True
|
||||
keep_only_tags= [dict(name='a', attrs={'class':'qid click'}),
|
||||
dict(name='div', attrs={'class':'quote post-content post-body'})]
|
||||
|
||||
|
||||
def latest_articles(self):
|
||||
articles = []
|
||||
soup=self.index_to_soup(u'http://bash.org.pl/latest/')
|
||||
#date=soup.find('div', attrs={'class':'right'}).string
|
||||
tags=soup.findAll('a', attrs={'class':'qid click'})
|
||||
for a in tags:
|
||||
title=a.string
|
||||
url='http://bash.org.pl' +a['href']
|
||||
articles.append({'title' : title,
|
||||
'url' : url,
|
||||
'date' : '',
|
||||
'description' : ''
|
||||
})
|
||||
return articles
|
||||
|
||||
|
||||
def random_articles(self):
|
||||
articles = []
|
||||
for i in range(self.max_articles_per_feed):
|
||||
soup=self.index_to_soup(u'http://bash.org.pl/random/')
|
||||
#date=soup.find('div', attrs={'class':'right'}).string
|
||||
url=soup.find('a', attrs={'class':'qid click'})
|
||||
title=url.string
|
||||
url='http://bash.org.pl' +url['href']
|
||||
articles.append({'title' : title,
|
||||
'url' : url,
|
||||
'date' : '',
|
||||
'description' : ''
|
||||
})
|
||||
return articles
|
||||
|
||||
|
||||
def parse_index(self):
|
||||
feeds = []
|
||||
feeds.append((u"Najnowsze", self.latest_articles()))
|
||||
feeds.append((u"Losowe", self.random_articles()))
|
||||
return feeds
|
@ -36,8 +36,9 @@ class BBC(BasicNewsRecipe):
|
||||
]
|
||||
|
||||
remove_tags = [
|
||||
dict(name='div', attrs={'class':['story-feature related narrow', 'share-help', 'embedded-hyper', \
|
||||
'story-feature wide ', 'story-feature narrow']})
|
||||
dict(name='div', attrs={'class':['story-feature related narrow', 'share-help', 'embedded-hyper',
|
||||
'story-feature wide ', 'story-feature narrow']}),
|
||||
dict(id=['hypertab', 'comment-form']),
|
||||
]
|
||||
|
||||
remove_attributes = ['width','height']
|
||||
|
70
recipes/benchmark_pl.recipe
Normal file
@ -0,0 +1,70 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
import re
|
||||
class Benchmark_pl(BasicNewsRecipe):
|
||||
title = u'Benchmark.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = u'benchmark.pl -IT site'
|
||||
cover_url = 'http://www.ieaddons.pl/benchmark/logo_benchmark_new.gif'
|
||||
category = 'IT'
|
||||
language = 'pl'
|
||||
oldest_article = 8
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets=True
|
||||
preprocess_regexps = [(re.compile(ur'\bWięcej o .*</body>', re.DOTALL|re.IGNORECASE), lambda match: '</body>')]
|
||||
keep_only_tags=[dict(name='div', attrs={'class':['m_zwykly', 'gallery']})]
|
||||
remove_tags_after=dict(name='div', attrs={'class':'body'})
|
||||
remove_tags=[dict(name='div', attrs={'class':['kategoria', 'socialize', 'thumb', 'panelOcenaObserwowane', 'categoryNextToSocializeGallery']})]
|
||||
INDEX= 'http://www.benchmark.pl'
|
||||
feeds = [(u'Aktualności', u'http://www.benchmark.pl/rss/aktualnosci-pliki.xml'),
|
||||
(u'Testy i recenzje', u'http://www.benchmark.pl/rss/testy-recenzje-minirecenzje.xml')]
|
||||
|
||||
|
||||
def append_page(self, soup, appendtag):
|
||||
nexturl = soup.find('span', attrs={'class':'next'})
|
||||
while nexturl is not None:
|
||||
nexturl= self.INDEX + nexturl.parent['href']
|
||||
soup2 = self.index_to_soup(nexturl)
|
||||
nexturl=soup2.find('span', attrs={'class':'next'})
|
||||
pagetext = soup2.find(name='div', attrs={'class':'body'})
|
||||
appendtag.find('div', attrs={'class':'k_ster'}).extract()
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
if appendtag.find('div', attrs={'class':'k_ster'}) is not None:
|
||||
appendtag.find('div', attrs={'class':'k_ster'}).extract()
|
||||
|
||||
|
||||
def image_article(self, soup, appendtag):
|
||||
nexturl=soup.find('div', attrs={'class':'preview'})
|
||||
if nexturl is not None:
|
||||
nexturl=nexturl.find('a', attrs={'class':'move_next'})
|
||||
image=appendtag.find('div', attrs={'class':'preview'}).div['style'][16:]
|
||||
image=self.INDEX + image[:image.find("')")]
|
||||
appendtag.find(attrs={'class':'preview'}).name='img'
|
||||
appendtag.find(attrs={'class':'preview'})['src']=image
|
||||
appendtag.find('a', attrs={'class':'move_next'}).extract()
|
||||
while nexturl is not None:
|
||||
nexturl= self.INDEX + nexturl['href']
|
||||
soup2 = self.index_to_soup(nexturl)
|
||||
nexturl=soup2.find('a', attrs={'class':'move_next'})
|
||||
image=soup2.find('div', attrs={'class':'preview'}).div['style'][16:]
|
||||
image=self.INDEX + image[:image.find("')")]
|
||||
soup2.find(attrs={'class':'preview'}).name='img'
|
||||
soup2.find(attrs={'class':'preview'})['src']=image
|
||||
pagetext=soup2.find('div', attrs={'class':'gallery'})
|
||||
pagetext.find('div', attrs={'class':'title'}).extract()
|
||||
pagetext.find('div', attrs={'class':'thumb'}).extract()
|
||||
pagetext.find('div', attrs={'class':'panelOcenaObserwowane'}).extract()
|
||||
if nexturl is not None:
|
||||
pagetext.find('a', attrs={'class':'move_next'}).extract()
|
||||
pagetext.find('a', attrs={'class':'move_back'}).extract()
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
|
||||
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
if soup.find('div', attrs={'class':'preview'}) is not None:
|
||||
self.image_article(soup, soup.body)
|
||||
else:
|
||||
self.append_page(soup, soup.body)
|
||||
return soup
|
61
recipes/berliner_zeitung.recipe
Normal file
@ -0,0 +1,61 @@
|
||||
from calibre.web.feeds.recipes import BasicNewsRecipe
|
||||
import re
|
||||
|
||||
class SportsIllustratedRecipe(BasicNewsRecipe) :
|
||||
__author__ = 'ape'
|
||||
__copyright__ = 'ape'
|
||||
__license__ = 'GPL v3'
|
||||
language = 'de'
|
||||
description = 'Berliner Zeitung'
|
||||
version = 2
|
||||
title = u'Berliner Zeitung'
|
||||
timefmt = ' [%d.%m.%Y]'
|
||||
|
||||
no_stylesheets = True
|
||||
remove_javascript = True
|
||||
use_embedded_content = False
|
||||
publication_type = 'newspaper'
|
||||
|
||||
keep_only_tags = [dict(name='div', attrs={'class':'teaser t_split t_artikel'})]
|
||||
|
||||
INDEX = 'http://www.berlinonline.de/berliner-zeitung/'
|
||||
|
||||
def parse_index(self):
|
||||
base = 'http://www.berlinonline.de'
|
||||
answer = []
|
||||
articles = {}
|
||||
more = 1
|
||||
|
||||
soup = self.index_to_soup(self.INDEX)
|
||||
|
||||
# Get list of links to ressorts from index page
|
||||
ressort_list = soup.findAll('ul', attrs={'class': re.compile('ressortlist')})
|
||||
for ressort in ressort_list[0].findAll('a'):
|
||||
feed_title = ressort.string
|
||||
print 'Analyzing', feed_title
|
||||
if not articles.has_key(feed_title):
|
||||
articles[feed_title] = []
|
||||
answer.append(feed_title)
|
||||
# Load ressort page.
|
||||
feed = self.index_to_soup('http://www.berlinonline.de' + ressort['href'])
|
||||
# find mainbar div which contains the list of all articles
|
||||
for article_container in feed.findAll('div', attrs={'class': re.compile('mainbar')}):
|
||||
# iterate over all articles
|
||||
for article_teaser in article_container.findAll('div', attrs={'class': re.compile('teaser')}):
|
||||
# extract title of article
|
||||
if article_teaser.h3 != None:
|
||||
article = {'title' : article_teaser.h3.a.string, 'date' : u'', 'url' : base + article_teaser.h3.a['href'], 'description' : u''}
|
||||
articles[feed_title].append(article)
|
||||
else:
|
||||
# Skip teasers for missing photos
|
||||
if article_teaser.div.p.contents[0].find('Foto:') > -1:
|
||||
continue
|
||||
article = {'title': 'Weitere Artikel ' + str(more), 'date': u'', 'url': base + article_teaser.div.p.a['href'], 'description': u''}
|
||||
articles[feed_title].append(article)
|
||||
more += 1
|
||||
answer = [[key, articles[key]] for key in answer if articles.has_key(key)]
|
||||
return answer
|
||||
|
||||
def get_masthead_url(self):
|
||||
return 'http://www.berlinonline.de/.img/berliner-zeitung/blz_logo.gif'
|
||||
|
31
recipes/brasil_de_fato.recipe
Normal file
@ -0,0 +1,31 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class BrasilDeFato(BasicNewsRecipe):
|
||||
news = True
|
||||
title = u'Brasil de Fato'
|
||||
__author__ = 'Alex Mitrani'
|
||||
description = u'Uma visão popular do Brasil e do mundo.'
|
||||
publisher = u'SOCIEDADE EDITORIAL BRASIL DE FATO'
|
||||
category = 'news, politics, Brazil, rss, Portuguese'
|
||||
oldest_article = 10
|
||||
max_articles_per_feed = 100
|
||||
summary_length = 1000
|
||||
language = 'pt_BR'
|
||||
|
||||
remove_javascript = True
|
||||
no_stylesheets = True
|
||||
use_embedded_content = False
|
||||
remove_empty_feeds = True
|
||||
masthead_url = 'http://www.brasildefato.com.br/sites/default/files/zeropoint_logo.jpg'
|
||||
keep_only_tags = [dict(name='div', attrs={'id':'main'})]
|
||||
remove_tags = [dict(name='div', attrs={'class':'links'})]
|
||||
remove_tags_after = [dict(name='div', attrs={'class':'links'})]
|
||||
|
||||
feeds = [(u'Nacional', u'http://www.brasildefato.com.br/rss_nacional')
|
||||
,(u'Internacional', u'http://www.brasildefato.com.br/rss_internacional')
|
||||
,(u'Entrevista', u'http://www.brasildefato.com.br/rss_entrevista')
|
||||
,(u'Cultura', u'http://www.brasildefato.com.br/rss_cultura')
|
||||
,(u'Análise', u'http://www.brasildefato.com.br/rss_analise')
|
||||
]
|
57
recipes/bugun_gazetesi.recipe
Normal file
@ -0,0 +1,57 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Bugun (BasicNewsRecipe):
|
||||
|
||||
title = u'BUGÜN Gazetesi'
|
||||
__author__ = u'thomass'
|
||||
oldest_article = 2
|
||||
max_articles_per_feed =100
|
||||
#no_stylesheets = True
|
||||
#delay = 1
|
||||
use_embedded_content = False
|
||||
encoding = 'UTF-8'
|
||||
publisher = 'thomass'
|
||||
category = 'news, haberler,TR,gazete'
|
||||
language = 'tr'
|
||||
publication_type = 'newspaper '
|
||||
extra_css = ' div{font-size: small} h2{font-size: small;font-weight: bold} #ctl00_ortayer_haberBaslik{font-size:20px;font-weight: bold} '#h1{ font-size:10%;font-weight: bold} '#ctl00_ortayer_haberBaslik{ 'font-size:10%;font-weight: bold'}
|
||||
#introduction{} .story-feature{display: block; padding: 0; border: 1px solid; width: 40%; font-size: small} .story-feature h2{text-align: center; text-transform: uppercase} '
|
||||
conversion_options = {
|
||||
'tags' : category
|
||||
,'language' : language
|
||||
,'publisher' : publisher
|
||||
,'linearize_tables': True
|
||||
}
|
||||
cover_img_url = 'http://www.bugun.com.tr/images/bugunLogo2011.png'
|
||||
masthead_url = 'http://www.bugun.com.tr/images/bugunLogo2011.png'
|
||||
|
||||
keep_only_tags = [dict(name='h1', attrs={'class':[ 'haberBaslik']}),dict(name='h2', attrs={'class':[ 'haberOzet']}), dict(name='div', attrs={'class':['haberGriDivvvv']}), dict(name='div', attrs={'id':[ 'haberTextDiv']}), ]
|
||||
|
||||
#keep_only_tags = [dict(name='div', attrs={'id':[ 'news-detail-content']}), dict(name='td', attrs={'class':['columnist-detail','columnist_head']}) ]
|
||||
#remove_tags = [ dict(name='div', attrs={'id':['news-detail-news-text-font-size','news-detail-gallery','news-detail-news-bottom-social']}),dict(name='div', attrs={'class':['radioEmbedBg','radyoProgramAdi']}),dict(name='a', attrs={'class':['webkit-html-attribute-value webkit-html-external-link']}),dict(name='table', attrs={'id':['yaziYorumTablosu']}),dict(name='img', attrs={'src':['http://medya.zaman.com.tr/pics/paylas.gif','http://medya.zaman.com.tr/extentions/zaman.com.tr/img/columnist/ma-16.png']})]
|
||||
|
||||
|
||||
#remove_attributes = ['width','height']
|
||||
remove_empty_feeds= True
|
||||
|
||||
feeds = [
|
||||
( u'Son Dakika', u'http://www.bugun.com.tr/haberler.xml'),
|
||||
( u'Yazarlar', u'http://www.bugun.com.tr/rss/yazarlar.xml'),
|
||||
( u'Gündem', u'http://www.bugun.com.tr/rss/gundem.xml'),
|
||||
( u'Ekonomi', u'http://www.bugun.com.tr/rss/ekonomi.xml'),
|
||||
( u'Spor', u'http://www.bugun.com.tr/rss/spor.xml'),
|
||||
( u'Magazin', u'http://www.bugun.com.tr/rss/magazin.xml'),
|
||||
( u'Teknoloji', u'http://www.bugun.com.tr/rss/teknoloji.xml'),
|
||||
( u'Yaşam', u'http://www.bugun.com.tr/rss/yasam.xml'),
|
||||
( u'Medya', u'http://www.bugun.com.tr/rss/medya.xml'),
|
||||
( u'Dünya', u'http://www.bugun.com.tr/rss/dunya.xml'),
|
||||
( u'Politika', u'http://www.bugun.com.tr/rss/politika.xml'),
|
||||
( u'Sağlık', u'http://www.bugun.com.tr/rss/saglik.xml'),
|
||||
( u'Tarifler', u'http://www.bugun.com.tr/rss/yemek-tarifi.xml'),
|
||||
|
||||
|
||||
|
||||
|
||||
]
|
@ -1,59 +1,54 @@
|
||||
#!/usr/bin/env python
|
||||
__license__ = 'GPL v3'
|
||||
__copyright__ = '2008, Kovid Goyal kovid@kovidgoyal.net'
|
||||
__docformat__ = 'restructuredtext en'
|
||||
|
||||
__copyright__ = '2008 Kovid Goyal kovid@kovidgoyal.net, 2010 Darko Miletic <darko.miletic at gmail.com>'
|
||||
'''
|
||||
businessweek.com
|
||||
www.businessweek.com
|
||||
'''
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class BusinessWeek(BasicNewsRecipe):
|
||||
title = 'Business Week'
|
||||
description = 'Business News, Stock Market and Financial Advice'
|
||||
__author__ = 'ChuckEggDotCom and Sujata Raman'
|
||||
language = 'en'
|
||||
|
||||
__author__ = 'Kovid Goyal and Darko Miletic'
|
||||
description = 'Read the latest international business news & stock market news. Get updated company profiles, financial advice, global economy and technology news.'
|
||||
publisher = 'Bloomberg L.P.'
|
||||
category = 'Business, business news, stock market, stock market news, financial advice, company profiles, financial advice, global economy, technology news'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 10
|
||||
max_articles_per_feed = 200
|
||||
no_stylesheets = True
|
||||
encoding = 'utf8'
|
||||
use_embedded_content = False
|
||||
language = 'en'
|
||||
remove_empty_feeds = True
|
||||
publication_type = 'magazine'
|
||||
cover_url = 'http://images.businessweek.com/mz/covers/current_120x160.jpg'
|
||||
masthead_url = 'http://assets.businessweek.com/images/bw-logo.png'
|
||||
extra_css = """
|
||||
body{font-family: Helvetica,Arial,sans-serif }
|
||||
img{margin-bottom: 0.4em; display:block}
|
||||
.tagline{color: gray; font-style: italic}
|
||||
.photoCredit{font-size: small; color: gray}
|
||||
"""
|
||||
|
||||
recursions = 1
|
||||
match_regexps = [r'http://www.businessweek.com/.*_page_[1-9].*']
|
||||
extra_css = '''
|
||||
h1{font-family :Arial,Helvetica,sans-serif; font-size:large;}
|
||||
.news_story_title{font-family :Arial,Helvetica,sans-serif; font-size:large;font-weight:bold;}
|
||||
h2{font-family :Arial,Helvetica,sans-serif; font-size:medium;color:#666666;}
|
||||
h3{text-transform:uppercase;font-family :Arial,Helvetica,sans-serif; font-size:large;font-weight:bold;}
|
||||
h4{font-family :Arial,Helvetica,sans-serif; font-size:small;font-weight:bold;}
|
||||
p{font-family :Arial,Helvetica,sans-serif; }
|
||||
#lede600{font-size:x-small;}
|
||||
#storybody{font-size:x-small;}
|
||||
p{font-family :Arial,Helvetica,sans-serif;}
|
||||
.strap{font-family :Arial,Helvetica,sans-serif; font-size:x-small; color:#064599;}
|
||||
.byline{font-family :Arial,Helvetica,sans-serif; font-size:x-small;}
|
||||
.postedBy{font-family :Arial,Helvetica,sans-serif; font-size:x-small;color:#666666;}
|
||||
.trackback{font-family :Arial,Helvetica,sans-serif; font-size:x-small;color:#666666;}
|
||||
.date{font-family :Arial,Helvetica,sans-serif; font-size:x-small;color:#666666;}
|
||||
.wrapper{font-family :Arial,Helvetica,sans-serif; font-size:x-small;}
|
||||
.photoCredit{font-family :Arial,Helvetica,sans-serif; font-size:x-small;color:#666666;}
|
||||
.tagline{font-family :Arial,Helvetica,sans-serif; font-size:x-small;color:#666666;}
|
||||
.pageCount{color:#666666;font-family :Arial,Helvetica,sans-serif; font-size:x-small;}
|
||||
.note{font-family :Arial,Helvetica,sans-serif; font-size:small;color:#666666;font-style:italic;}
|
||||
.highlight{font-family :Arial,Helvetica,sans-serif; font-size:small;background-color:#FFF200;}
|
||||
.annotation{font-family :Arial,Helvetica,sans-serif; font-size:x-small;color:#666666;}
|
||||
'''
|
||||
conversion_options = {
|
||||
'comment' : description
|
||||
, 'tags' : category
|
||||
, 'publisher' : publisher
|
||||
, 'language' : language
|
||||
}
|
||||
|
||||
remove_tags = [ dict(name='div', attrs={'id':["log","feedback","footer","secondarynav","secondnavbar","header","email","bw2-header","column2","wrapper-bw2-footer","wrapper-mgh-footer","inset","commentForm","commentDisplay","bwExtras","bw2-umbrella","readerComments","leg","rightcol"]}),
|
||||
dict(name='div', attrs={'class':["menu",'sponsorbox smallertext',"TopNavTile","graybottom leaderboard"]}),
|
||||
dict(name='img', alt ="News"),
|
||||
dict(name='td', width ="1"),
|
||||
remove_tags = [
|
||||
dict(attrs={'class':'inStory'})
|
||||
,dict(name=['meta','link','iframe','base','embed','object','table','th','tr','td'])
|
||||
,dict(attrs={'id':['inset','videoDisplay']})
|
||||
]
|
||||
keep_only_tags = [dict(name='div', attrs={'id':['story-body','storyBody','article_body','articleBody']})]
|
||||
remove_attributes = ['lang']
|
||||
match_regexps = [r'http://www.businessweek.com/.*_page_[1-9].*']
|
||||
|
||||
|
||||
feeds = [
|
||||
(u'Top Stories', u'http://www.businessweek.com/topStories/rss/topStories.rss'),
|
||||
(u'Top News', u'http://www.businessweek.com/rss/bwdaily.rss'),
|
||||
(u'Top News' , u'http://www.businessweek.com/rss/bwdaily.rss' ),
|
||||
(u'Asia', u'http://www.businessweek.com/rss/asia.rss'),
|
||||
(u'Autos', u'http://www.businessweek.com/rss/autos/index.rss'),
|
||||
(u'Classic Cars', u'http://rss.businessweek.com/bw_rss/classiccars'),
|
||||
@ -75,19 +70,36 @@ class BusinessWeek(BasicNewsRecipe):
|
||||
]
|
||||
|
||||
def get_article_url(self, article):
|
||||
|
||||
url = article.get('guid', None)
|
||||
if 'podcasts' in url:
|
||||
return None
|
||||
if 'surveys' in url:
|
||||
return None
|
||||
if 'images' in url:
|
||||
return None
|
||||
if 'feedroom' in url:
|
||||
return None
|
||||
if '/magazine/toc/' in url:
|
||||
return None
|
||||
rurl, sep, rest = url.rpartition('?')
|
||||
if rurl:
|
||||
return rurl
|
||||
return rest
|
||||
|
||||
if 'podcasts' in url or 'surveys' in url:
|
||||
url = None
|
||||
|
||||
def print_version(self, url):
|
||||
if '/news/' in url or '/blog/' in url:
|
||||
return url
|
||||
if '/magazine' in url:
|
||||
rurl = url.replace('http://www.businessweek.com/','http://www.businessweek.com/printer/')
|
||||
else:
|
||||
rurl = url.replace('http://www.businessweek.com/','http://www.businessweek.com/print/')
|
||||
return rurl.replace('/investing/','/investor/')
|
||||
|
||||
def postprocess_html(self, soup, first):
|
||||
|
||||
for tag in soup.findAll(name=['ul','li','table','td','tr','span']):
|
||||
tag.name = 'div'
|
||||
for tag in soup.findAll(name= 'div',attrs={ 'id':'pageNav'}):
|
||||
tag.extract()
|
||||
def preprocess_html(self, soup):
|
||||
for item in soup.findAll(style=True):
|
||||
del item['style']
|
||||
for alink in soup.findAll('a'):
|
||||
if alink.string is not None:
|
||||
tstr = alink.string
|
||||
alink.replaceWith(tstr)
|
||||
return soup
|
||||
|
||||
|
@ -4,95 +4,73 @@ __copyright__ = '2009-2010, Darko Miletic <darko.miletic at gmail.com>'
|
||||
www.businessworld.in
|
||||
'''
|
||||
|
||||
from calibre import strftime
|
||||
import re
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class BusinessWorldMagazine(BasicNewsRecipe):
|
||||
title = 'Business World Magazine'
|
||||
__author__ = 'Darko Miletic'
|
||||
__author__ = 'Kovid Goyal'
|
||||
description = 'News from India'
|
||||
publisher = 'ABP Pvt Ltd Publication'
|
||||
category = 'news, politics, finances, India, Asia'
|
||||
delay = 1
|
||||
no_stylesheets = True
|
||||
INDEX = 'http://www.businessworld.in/bw/Magazine_Current_Issue'
|
||||
INDEX = 'http://www.businessworld.in/businessworld/magazine_latest_issue.php'
|
||||
ROOT = 'http://www.businessworld.in'
|
||||
use_embedded_content = False
|
||||
encoding = 'utf-8'
|
||||
language = 'en_IN'
|
||||
extra_css = """
|
||||
img{display: block; margin-bottom: 0.5em}
|
||||
body{font-family: Arial,Helvetica,sans-serif}
|
||||
h2{color: gray; display: block}
|
||||
"""
|
||||
|
||||
conversion_options = {
|
||||
'comment' : description
|
||||
, 'tags' : category
|
||||
, 'publisher' : publisher
|
||||
, 'language' : language
|
||||
}
|
||||
|
||||
def is_in_list(self,linklist,url):
|
||||
for litem in linklist:
|
||||
if litem == url:
|
||||
return True
|
||||
return False
|
||||
|
||||
auto_cleanup = True
|
||||
|
||||
def parse_index(self):
|
||||
br = self.browser
|
||||
br.open(self.ROOT)
|
||||
raw = br.open(br.click_link(text_regex=re.compile('Current.*Issue',
|
||||
re.I))).read()
|
||||
soup = self.index_to_soup(raw)
|
||||
mc = soup.find(attrs={'class':'mag_cover'})
|
||||
if mc is not None:
|
||||
img = mc.find('img', src=True)
|
||||
if img is not None:
|
||||
self.cover_url = img['src']
|
||||
|
||||
feeds = []
|
||||
current_section = None
|
||||
articles = []
|
||||
linklist = []
|
||||
soup = self.index_to_soup(self.INDEX)
|
||||
for tag in soup.findAll(['h3', 'h2']):
|
||||
inner_a = tag.find('a')
|
||||
if tag.name == 'h3' and inner_a is not None:
|
||||
continue
|
||||
if tag.name == 'h2' and (inner_a is None or current_section is
|
||||
None):
|
||||
continue
|
||||
|
||||
if tag.name == 'h3':
|
||||
if current_section is not None and articles:
|
||||
feeds.append((current_section, articles))
|
||||
current_section = self.tag_to_string(tag)
|
||||
self.log('Found section:', current_section)
|
||||
articles = []
|
||||
elif tag.name == 'h2':
|
||||
url = inner_a.get('href', None)
|
||||
if url is None: continue
|
||||
if url.startswith('/'): url = self.ROOT + url
|
||||
title = self.tag_to_string(inner_a)
|
||||
h1 = tag.findPreviousSibling('h1')
|
||||
if h1 is not None:
|
||||
title = self.tag_to_string(h1) + title
|
||||
self.log('\tFound article:', title)
|
||||
articles.append({'title':title, 'url':url, 'date':'',
|
||||
'description':''})
|
||||
|
||||
if current_section and articles:
|
||||
feeds.append((current_section, articles))
|
||||
|
||||
return feeds
|
||||
|
||||
|
||||
|
||||
tough = soup.find('div', attrs={'id':'tough'})
|
||||
if tough:
|
||||
for item in tough.findAll('h1'):
|
||||
description = ''
|
||||
title_prefix = ''
|
||||
feed_link = item.find('a')
|
||||
if feed_link and feed_link.has_key('href'):
|
||||
url = self.ROOT + feed_link['href']
|
||||
if not self.is_in_list(linklist,url):
|
||||
title = title_prefix + self.tag_to_string(feed_link)
|
||||
date = strftime(self.timefmt)
|
||||
articles.append({
|
||||
'title' :title
|
||||
,'date' :date
|
||||
,'url' :url
|
||||
,'description':description
|
||||
})
|
||||
linklist.append(url)
|
||||
|
||||
for item in soup.findAll('div', attrs={'class':'nametitle'}):
|
||||
description = ''
|
||||
title_prefix = ''
|
||||
feed_link = item.find('a')
|
||||
if feed_link and feed_link.has_key('href'):
|
||||
url = self.ROOT + feed_link['href']
|
||||
if not self.is_in_list(linklist,url):
|
||||
title = title_prefix + self.tag_to_string(feed_link)
|
||||
date = strftime(self.timefmt)
|
||||
articles.append({
|
||||
'title' :title
|
||||
,'date' :date
|
||||
,'url' :url
|
||||
,'description':description
|
||||
})
|
||||
linklist.append(url)
|
||||
return [(soup.head.title.string, articles)]
|
||||
|
||||
|
||||
keep_only_tags = [dict(name='div', attrs={'id':'printwrapper'})]
|
||||
remove_tags = [dict(name=['object','link','meta','base','iframe','link','table'])]
|
||||
|
||||
def print_version(self, url):
|
||||
return url.replace('/bw/','/bw/storyContent/')
|
||||
|
||||
def get_cover_url(self):
|
||||
cover_url = None
|
||||
soup = self.index_to_soup(self.INDEX)
|
||||
cover_item = soup.find('img',attrs={'class':'toughbor'})
|
||||
if cover_item:
|
||||
cover_url = self.ROOT + cover_item['src']
|
||||
return cover_url
|
||||
|
73
recipes/cbn.recipe
Normal file
@ -0,0 +1,73 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
|
||||
class CBN(BasicNewsRecipe):
|
||||
title = u'CBN News'
|
||||
__author__ = 'Roger'
|
||||
# TODO: I just noticed this is downloading 25+ articles, while
|
||||
# the online site is only publishing at most 7 articles daily.
|
||||
# So, somehow this needs to be fixed it only downloads max 7 articles
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
|
||||
description = 'The Christian Broadcasting Network'
|
||||
publisher = 'http://www.cbn.com/'
|
||||
category = 'news, religion, spiritual, christian'
|
||||
language = 'en'
|
||||
|
||||
# Make article titles, author and date bold, italic or small font.
|
||||
# TODO: Could use a smaller title text
|
||||
# TODO: Italicize Author and Publisher?
|
||||
#
|
||||
# http://www.cbn.com/App_Themes/Common/base.css,
|
||||
# http://www.cbn.com/App_Themes/CBNNews/article.css",
|
||||
# ... and many more style sheets.
|
||||
#extra_css = '''
|
||||
# .story_item_headline { font-size: medium; font-weight: bold; }
|
||||
# .story_item_author { font-size: small; font-style:italic; }
|
||||
# .signature_line { font-size: small; }
|
||||
# '''
|
||||
|
||||
remove_javascript = True
|
||||
use_embedded_content = False
|
||||
no_stylesheets = True
|
||||
language = 'en'
|
||||
encoding = 'iso-8859-1'
|
||||
conversion_options = {'linearize_tables':True}
|
||||
|
||||
# TODO: No masterhead_url for CBN, using one I grepped from a news article
|
||||
# (There's a better/higher contrast blue on white background image, but
|
||||
# can't get it or it's too big -- embedded into a larger jpeg?)
|
||||
masthead_url = 'http://www.cbn.com/templates/images/cbn_com_logo.jpg'
|
||||
|
||||
keep_only_tags = [
|
||||
dict(name='h1', attrs={'id':'articleTitle'}),
|
||||
dict(name='div', attrs={'class':'articleAuthor'}),
|
||||
dict(name='div', attrs={'class':'articleDate'}),
|
||||
dict(name='div', attrs={'class':'articleText'}),
|
||||
]
|
||||
|
||||
remove_tags = [
|
||||
# The article image is usually Adobe Flash Player Image
|
||||
# The snapshot .jpg image files of the video are found
|
||||
# within a URL folder named "PageFiles_Files"
|
||||
# Filter this for now.
|
||||
# (Majority of images seem to be Adobe Flash.)
|
||||
dict(name='div', attrs={'class':'articleImage'}),
|
||||
]
|
||||
|
||||
|
||||
# Comment-out or uncomment any of the following RSS feeds according to your
|
||||
# liking.
|
||||
# A full list can be found here: http://www.cbn.com/rss.aspx
|
||||
|
||||
feeds = [
|
||||
(u'World', u'http://www.cbn.com/cbnnews/world/feed/'),
|
||||
(u'US', u'http://www.cbn.com/cbnnews/us/feed/'),
|
||||
(u'Inside Israel', u'http://www.cbn.com/cbnnews/insideisrael/feed/'),
|
||||
(u'Politics', u'http://www.cbn.com/cbnnews/politics/feed/'),
|
||||
(u'Christian World News', u'http://www.cbn.com/cbnnews/shows/cwn/feed/'),
|
||||
(u'Health and Science', u'http://www.cbn.com/cbnnews/healthscience/feed/'),
|
||||
(u'Finance', u'http://www.cbn.com/cbnnews/finance/feed/'),
|
||||
]
|
||||
|
16
recipes/cd_action.recipe
Normal file
@ -0,0 +1,16 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
|
||||
class CD_Action(BasicNewsRecipe):
|
||||
title = u'CD-Action'
|
||||
__author__ = 'fenuks'
|
||||
description = 'cdaction.pl - polish magazine about games site'
|
||||
category = 'games'
|
||||
language = 'pl'
|
||||
oldest_article = 8
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets= True
|
||||
cover_url =u'http://s.cdaction.pl/obrazki/logo-CD-Action_172k9.JPG'
|
||||
keep_only_tags= dict(id='news_content')
|
||||
remove_tags_after= dict(name='div', attrs={'class':'tresc'})
|
||||
feeds = [(u'Newsy', u'http://www.cdaction.pl/rss_newsy.xml')]
|
43
recipes/cgm_pl.recipe
Normal file
@ -0,0 +1,43 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class CGM(BasicNewsRecipe):
|
||||
title = u'CGM'
|
||||
oldest_article = 7
|
||||
__author__ = 'fenuks'
|
||||
description = u'Codzienna Gazeta Muzyczna'
|
||||
cover_url = 'http://www.krafcy.com/foto/tinymce/Image/cgm%281%29.jpg'
|
||||
category = 'music'
|
||||
language = 'pl'
|
||||
use_embedded_content = False
|
||||
remove_empty_feeds= True
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheers=True
|
||||
extra_css = 'div {color:black;} strong {color:black;} span {color:black;} p {color:black;} h2 {color:black;}'
|
||||
remove_tags_before=dict(id='mainContent')
|
||||
remove_tags_after=dict(name='div', attrs={'class':'fbContainer'})
|
||||
remove_tags=[dict(name='div', attrs={'class':'fbContainer'}),
|
||||
dict(name='p', attrs={'class':['tagCloud', 'galleryAuthor']}),
|
||||
dict(id=['movieShare', 'container'])]
|
||||
feeds = [(u'Informacje', u'http://www.cgm.pl/rss.xml'), (u'Polecamy', u'http://www.cgm.pl/rss,4,news.xml'),
|
||||
(u'Recenzje', u'http://www.cgm.pl/rss,1,news.xml')]
|
||||
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
for item in soup.findAll(style=True):
|
||||
del item['style']
|
||||
ad=soup.findAll('a')
|
||||
for r in ad:
|
||||
if 'http://www.hustla.pl' in r['href']:
|
||||
r.extract()
|
||||
gallery=soup.find('div', attrs={'class':'galleryFlash'})
|
||||
if gallery:
|
||||
img=gallery.find('embed')
|
||||
if img:
|
||||
img=img['src'][35:]
|
||||
img='http://www.cgm.pl/_vault/_gallery/_photo/'+img
|
||||
param=gallery.findAll(name='param')
|
||||
for i in param:
|
||||
i.extract()
|
||||
gallery.contents[1].name='img'
|
||||
gallery.contents[1]['src']=img
|
||||
return soup
|
@ -8,21 +8,25 @@ from calibre.web.feeds.news import BasicNewsRecipe
|
||||
class ChicagoTribune(BasicNewsRecipe):
|
||||
|
||||
title = 'Chicago Tribune'
|
||||
__author__ = 'Kovid Goyal and Sujata Raman'
|
||||
__author__ = 'Kovid Goyal and Sujata Raman, a.peter'
|
||||
description = 'Politics, local and business news from Chicago'
|
||||
language = 'en'
|
||||
version = 2
|
||||
|
||||
use_embedded_content = False
|
||||
no_stylesheets = True
|
||||
remove_javascript = True
|
||||
recursions = 1
|
||||
|
||||
keep_only_tags = [dict(name='div', attrs={'class':["story","entry-asset asset hentry"]}),
|
||||
dict(name='div', attrs={'id':["pagebody","story","maincontentcontainer"]}),
|
||||
]
|
||||
remove_tags_after = [ {'class':['photo_article',]} ]
|
||||
remove_tags_after = [{'class':['photo_article',]}]
|
||||
|
||||
remove_tags = [{'id':["moduleArticleTools","content-bottom","rail","articleRelates module","toolSet","relatedrailcontent","div-wrapper","beta","atp-comments","footer"]},
|
||||
{'class':["clearfix","relatedTitle","articleRelates module","asset-footer","tools","comments","featurePromo","featurePromo fp-topjobs brownBackground","clearfix fullSpan brownBackground","curvedContent"]},
|
||||
match_regexps = [r'page=[0-9]+']
|
||||
|
||||
remove_tags = [{'id':["moduleArticleTools","content-bottom","rail","articleRelates module","toolSet","relatedrailcontent","div-wrapper","beta","atp-comments","footer",'gallery-subcontent','subFooter']},
|
||||
{'class':["clearfix","relatedTitle","articleRelates module","asset-footer","tools","comments","featurePromo","featurePromo fp-topjobs brownBackground","clearfix fullSpan brownBackground","curvedContent",'nextgen-share-tools','outbrainTools', 'google-ad-story-bottom']},
|
||||
dict(name='font',attrs={'id':["cr-other-headlines"]})]
|
||||
extra_css = '''
|
||||
h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
|
||||
@ -76,8 +80,12 @@ class ChicagoTribune(BasicNewsRecipe):
|
||||
print article.get('feedburner_origlink', article.get('guid', article.get('link')))
|
||||
return article.get('feedburner_origlink', article.get('guid', article.get('link')))
|
||||
|
||||
|
||||
def postprocess_html(self, soup, first_fetch):
|
||||
# Remove the navigation bar. It was kept until now to be able to follow
|
||||
# the links to further pages. But now we don't need them anymore.
|
||||
for nav in soup.findAll(attrs={'class':['toppaginate','article-nav clearfix']}):
|
||||
nav.extract()
|
||||
|
||||
for t in soup.findAll(['table', 'tr', 'td']):
|
||||
t.name = 'div'
|
||||
|
||||
@ -88,4 +96,3 @@ class ChicagoTribune(BasicNewsRecipe):
|
||||
|
||||
return soup
|
||||
|
||||
|
||||
|
29
recipes/china_post.recipe
Normal file
@ -0,0 +1,29 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class CP(BasicNewsRecipe):
|
||||
title = u'China Post'
|
||||
language = 'en_CN'
|
||||
__author__ = 'Krittika Goyal'
|
||||
oldest_article = 1 #days
|
||||
max_articles_per_feed = 25
|
||||
use_embedded_content = False
|
||||
|
||||
no_stylesheets = True
|
||||
auto_cleanup = True
|
||||
|
||||
feeds = [
|
||||
('Top Stories',
|
||||
'http://www.chinapost.com.tw/rss/front.xml'),
|
||||
('Taiwan',
|
||||
'http://www.chinapost.com.tw/rss/taiwan.xml'),
|
||||
('China',
|
||||
'http://www.chinapost.com.tw/rss/china.xml'),
|
||||
('Business',
|
||||
'http://www.chinapost.com.tw/rss/business.xml'),
|
||||
('World',
|
||||
'http://www.chinapost.com.tw/rss/international.xml'),
|
||||
('Sports',
|
||||
'http://www.chinapost.com.tw/rss/sports.xml'),
|
||||
]
|
||||
|
||||
|
@ -1,35 +1,52 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Cicero(BasicNewsRecipe):
|
||||
timefmt = ' [%Y-%m-%d]'
|
||||
title = u'Cicero'
|
||||
__author__ = 'mad@sharktooth.de'
|
||||
description = u'Magazin f\xfcr politische Kultur'
|
||||
oldest_article = 7
|
||||
class BasicUserRecipe1316245412(BasicNewsRecipe):
|
||||
#from calibre.utils.magick import Image, PixelWand
|
||||
title = u'Cicero Online'
|
||||
description = u'Magazin f\xfcr politische Kultur (RSS Version)'
|
||||
publisher = 'Ringier Publishing GmbH'
|
||||
category = 'news, politics, Germany'
|
||||
language = 'de'
|
||||
encoding = 'UTF-8'
|
||||
__author__ = 'Armin Geller' # Upd. 2011-09-23
|
||||
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets = True
|
||||
use_embedded_content = False
|
||||
publisher = 'Ringier Publishing'
|
||||
category = 'news, politics, Germany'
|
||||
encoding = 'iso-8859-1'
|
||||
publication_type = 'magazine'
|
||||
masthead_url = 'http://www.cicero.de/img2/cicero_logo_rss.gif'
|
||||
auto_cleanup = False
|
||||
|
||||
# remove_javascript = True
|
||||
|
||||
remove_tags = [
|
||||
dict(name='div', attrs={'id':["header", "navigation", "skip-link", "header-print", "header-print-url", "meta-toolbar", "footer"]}),
|
||||
dict(name='div', attrs={'class':["region region-sidebar-first column sidebar", "breadcrumb",
|
||||
"breadcrumb-title", "meta", "comment-wrapper",
|
||||
"field field-name-field-show-teaser-right field-type-list-boolean field-label-above",
|
||||
"page-header",
|
||||
"view view-alle-karikaturen view-id-alle_karikaturen view-display-id-default view-dom-id-1",
|
||||
"pagination",
|
||||
"view view-letzte-videos view-id-letzte_videos view-display-id-default view-dom-id-1",
|
||||
"view view-letzte-videos view-id-letzte_videos view-display-id-default view-dom-id-2", # 2011-09-23
|
||||
"view view-alle-karikaturen view-id-alle_karikaturen view-display-id-default view-dom-id-2", # 2011-09-23
|
||||
]}),
|
||||
dict(name='div', attrs={'title':["Dossier Auswahl"]}),
|
||||
dict(name='h2', attrs={'class':["title comment-form"]}),
|
||||
dict(name='form', attrs={'class':["comment-form user-info-from-cookie"]}),
|
||||
dict(name='table', attrs={'class':["mcx-social-horizontal", "page-header"]}),
|
||||
]
|
||||
|
||||
feeds = [
|
||||
(u'Das gesamte Portfolio', u'http://www.cicero.de/rss/rss.php?ress_id='),
|
||||
#(u'Alle Heft-Inhalte', u'http://www.cicero.de/rss/rss.php?ress_id=heft'),
|
||||
#(u'Alle Online-Inhalte', u'http://www.cicero.de/rss/rss.php?ress_id=online'),
|
||||
#(u'Berliner Republik', u'http://www.cicero.de/rss/rss.php?ress_id=4'),
|
||||
#(u'Weltb\xfchne', u'http://www.cicero.de/rss/rss.php?ress_id=1'),
|
||||
#(u'Salon', u'http://www.cicero.de/rss/rss.php?ress_id=7'),
|
||||
#(u'Kapital', u'http://www.cicero.de/rss/rss.php?ress_id=6'),
|
||||
#(u'Netzst\xfccke', u'http://www.cicero.de/rss/rss.php?ress_id=9'),
|
||||
#(u'Leinwand', u'http://www.cicero.de/rss/rss.php?ress_id=12'),
|
||||
#(u'Bibliothek', u'http://www.cicero.de/rss/rss.php?ress_id=15'),
|
||||
(u'Kolumne - Alle Kolulmnen', u'http://www.cicero.de/rss/rss2.php?ress_id='),
|
||||
#(u'Kolumne - Schreiber, Berlin', u'http://www.cicero.de/rss/rss2.php?ress_id=35'),
|
||||
#(u'Kolumne - TV Kritik', u'http://www.cicero.de/rss/rss2.php?ress_id=34')
|
||||
]
|
||||
(u'Das gesamte Portfolio', u'http://www.cicero.de/rss.xml'),
|
||||
(u'Berliner Republik', u'http://www.cicero.de/berliner-republik.xml'),
|
||||
(u'Weltb\xfchne', u'http://www.cicero.de/weltbuehne.xml'),
|
||||
(u'Kapital', u'http://www.cicero.de/kapital.xml'),
|
||||
(u'Salon', u'http://www.cicero.de/salon.xml'),
|
||||
(u'Blogs', u'http://www.cicero.de/blogs.xml'), #seems not to be in use at the moment
|
||||
]
|
||||
|
||||
def print_version(self, url):
|
||||
return 'http://www.cicero.de/page_print.php?' + url.rpartition('?')[2]
|
||||
return url + '?print'
|
||||
|
||||
# def get_cover_url(self):
|
||||
# return 'http://www.cicero.de/sites/all/themes/cicero/logo.png' # need to find a good logo on their home page!
|
||||
|
||||
|
128
recipes/cio_magazine.recipe
Normal file
@ -0,0 +1,128 @@
|
||||
# Los primeros comentarios son las dificultades que he tenido con el Piton
|
||||
# Cuando da error UTF8 revisa los comentarios (acentos). En notepad++ Search, Goto, posicion y lo ves.
|
||||
# Editar con Notepad++ Si pone - donde no debe es que ha indentado mal... Edit - Blank operations - tab to space
|
||||
# He entendido lo que significa el from... son paths dentro de pylib.zip...
|
||||
# Con from importa solo un simbolo...con import,la libreria completa
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
# sys no hace falta... lo intente usar para escribir en stderr
|
||||
from calibre import strftime
|
||||
# Para convertir el tiempo del articulo
|
||||
import string, re
|
||||
# Para usar expresiones regulares
|
||||
# Visto en pylib.zip... la primera letra es mayuscula
|
||||
# Estas dos ultimas han sido un vago intento de establecer una cookie (no usado)
|
||||
|
||||
class CIO_Magazine(BasicNewsRecipe):
|
||||
title = 'CIO Magazine'
|
||||
oldest_article = 14
|
||||
max_articles_per_feed = 100
|
||||
auto_cleanup = True
|
||||
__author__ = 'Julio Map'
|
||||
description = 'CIO is the leading information brand for today-s busy Chief information Officer - CIO Magazine bi-monthly '
|
||||
language = 'en'
|
||||
encoding = 'utf8'
|
||||
cover_url = 'http://www.cio.com/homepage/images/hp-cio-logo-linkedin.png'
|
||||
|
||||
remove_tags_before = dict(name='div', attrs={'id':'container'})
|
||||
# Absolutamente innecesario... al final he visto un print_version (ver mas adelante)
|
||||
|
||||
# Dentro de una revista dada...
|
||||
# issue_details contiene el titulo y las secciones de este ejemplar
|
||||
# DetailModule esta dentro de issue_details contiene las urls y resumenes
|
||||
# Dentro de un articulo dado...
|
||||
# Article-default-body contiene el texto. Pero como digo, he encontrado una print_version
|
||||
|
||||
no_stylesheets = True
|
||||
remove_javascript = True
|
||||
|
||||
def print_version(self,url):
|
||||
# A esta funcion le llama el sistema... no hay que llamarla uno mismo (porque seria llamada dos veces)
|
||||
# Existe una version imprimible de los articulos cambiando
|
||||
# http://www.cio.com/article/<num>/<titulo> por
|
||||
# http://www.cio.com/article/print/<num> que contiene todas las paginas dentro del div id=container
|
||||
if url.startswith('/'):
|
||||
url = 'http://www.cio.com'+url
|
||||
segments = url.split('/')
|
||||
printURL = '/'.join(segments[0:4]) + '/print/' + segments[4] +'#'
|
||||
return printURL
|
||||
|
||||
|
||||
def parse_index(self):
|
||||
###########################################################################
|
||||
# This method should be implemented in recipes that parse a website
|
||||
# instead of feeds to generate a list of articles. Typical uses are for
|
||||
# news sources that have a Print Edition webpage that lists all the
|
||||
# articles in the current print edition. If this function is implemented,
|
||||
# it will be used in preference to BasicNewsRecipe.parse_feeds().
|
||||
#
|
||||
# It must return a list. Each element of the list must be a 2-element
|
||||
# tuple of the form ('feed title', list of articles).
|
||||
#
|
||||
# Each list of articles must contain dictionaries of the form:
|
||||
#
|
||||
# {
|
||||
# 'title' : article title,
|
||||
# 'url' : URL of print version,
|
||||
# 'date' : The publication date of the article as a string,
|
||||
# 'description' : A summary of the article
|
||||
# 'content' : The full article (can be an empty string). This is used by FullContentProfile
|
||||
# }
|
||||
#
|
||||
# For an example, see the recipe for downloading The Atlantic.
|
||||
# In addition, you can add 'author' for the author of the article.
|
||||
###############################################################################
|
||||
|
||||
# Primero buscamos cual es la ultima revista que se ha creado
|
||||
soupinicial = self.index_to_soup('http://www.cio.com/magazine')
|
||||
# Es el primer enlace que hay en el DIV con class content_body
|
||||
a= soupinicial.find(True, attrs={'class':'content_body'}).find('a', href=True)
|
||||
INDEX = re.sub(r'\?.*', '', a['href'])
|
||||
# Como cio.com usa enlaces relativos, le anteponemos el domain name.
|
||||
if INDEX.startswith('/'): # protegiendonos de que dejen de usarlos
|
||||
INDEX = 'http://www.cio.com'+INDEX
|
||||
# Y nos aseguramos en los logs que lo estamos haciendo bien
|
||||
print ("INDEX en parse_index: ", INDEX)
|
||||
|
||||
# Ya sabemos cual es la revista... procesemosla.
|
||||
soup = self.index_to_soup(INDEX)
|
||||
|
||||
articles = {}
|
||||
key = None
|
||||
feeds = []
|
||||
# Para empezar nos quedamos solo con dos DIV, 'heading' y ' issue_item'
|
||||
# Del primero sacamos las categorias (key) y del segundo las urls y resumenes
|
||||
for div in soup.findAll(True,
|
||||
attrs={'class':['heading', 'issue_item']}):
|
||||
|
||||
if div['class'] == 'heading':
|
||||
key = string.capwords(self.tag_to_string(div.span))
|
||||
print ("Key: ",key) # Esto es para depurar
|
||||
articles[key] = []
|
||||
feeds.append(key)
|
||||
|
||||
elif div['class'] == 'issue_item':
|
||||
a = div.find('a', href=True)
|
||||
if not a:
|
||||
continue
|
||||
url = re.sub(r'\?.*', '', a['href'])
|
||||
print("url: ",url) # Esto es para depurar
|
||||
title = self.tag_to_string(a, use_alt=True).strip() # Ya para nota, quitar al final las dos ultimas palabras
|
||||
pubdate = strftime('%a, %d %b') # No es la fecha de publicacion sino la de colecta
|
||||
summary = div.find('p') # Dentro de la div 'issue_item' el unico parrafo que hay es el resumen
|
||||
description = '' # Si hay summary la description sera el summary... si no, la dejamos en blanco
|
||||
|
||||
if summary:
|
||||
description = self.tag_to_string(summary, use_alt=False)
|
||||
print ("Description = ", description)
|
||||
|
||||
|
||||
feed = key if key is not None else 'Uncategorized' # Esto esta copiado del NY times
|
||||
if not articles.has_key(feed):
|
||||
articles[feed] = []
|
||||
if not 'podcasts' in url:
|
||||
articles[feed].append(
|
||||
dict(title=title, url=url, date=pubdate,
|
||||
description=description,
|
||||
content=''))
|
||||
feeds = [(key, articles[key]) for key in feeds if articles.has_key(key)]
|
||||
return feeds
|
@ -2,6 +2,11 @@
|
||||
__license__ = 'GPL v3'
|
||||
__copyright__ = '2009, Darko Miletic <darko.miletic at gmail.com>'
|
||||
'''
|
||||
Changelog:
|
||||
2011-09-24
|
||||
Changed cover (drMerry)
|
||||
'''
|
||||
'''
|
||||
news.cnet.com
|
||||
'''
|
||||
|
||||
@ -9,7 +14,7 @@ from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class CnetNews(BasicNewsRecipe):
|
||||
title = 'CNET News'
|
||||
__author__ = 'Darko Miletic'
|
||||
__author__ = 'Darko Miletic updated by DrMerry.'
|
||||
description = 'Tech news and business reports by CNET News. Focused on information technology, core topics include computers, hardware, software, networking, and Internet media.'
|
||||
publisher = 'CNET'
|
||||
category = 'news, IT, USA'
|
||||
|
@ -28,11 +28,12 @@ class CNN(BasicNewsRecipe):
|
||||
(re.compile(r'<style.*?</style>', re.DOTALL), lambda m: ''),
|
||||
]
|
||||
|
||||
keep_only_tags = [dict(id='cnnContentContainer')]
|
||||
keep_only_tags = [dict(id=['cnnContentContainer', 'storycontent'])]
|
||||
remove_tags = [
|
||||
{'class':['cnn_strybtntools', 'cnn_strylftcntnt',
|
||||
'cnn_strybtntools', 'cnn_strybtntoolsbttm', 'cnn_strybtmcntnt',
|
||||
'cnn_strycntntrgt']},
|
||||
'cnn_strycntntrgt', 'hed_side', 'foot']},
|
||||
dict(id=['ie_column']),
|
||||
]
|
||||
|
||||
|
||||
|
@ -1,40 +1,10 @@
|
||||
import re
|
||||
from lxml.html import parse
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Counterpunch(BasicNewsRecipe):
|
||||
'''
|
||||
Parses counterpunch.com for articles
|
||||
'''
|
||||
title = 'Counterpunch'
|
||||
description = 'Daily political opinion from www.Counterpunch.com'
|
||||
language = 'en'
|
||||
__author__ = 'O. Emmerson'
|
||||
keep_only_tags = [dict(name='td', attrs={'width': '522'})]
|
||||
max_articles_per_feed = 10
|
||||
title = u'Counterpunch'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
auto_cleanup = True
|
||||
|
||||
def parse_index(self):
|
||||
feeds = []
|
||||
title, url = 'Counterpunch', 'http://www.counterpunch.com'
|
||||
articles = self.parse_page(url)
|
||||
if articles:
|
||||
feeds.append((title, articles))
|
||||
return feeds
|
||||
|
||||
def parse_page(self, url):
|
||||
parsed_page = parse(url).getroot()
|
||||
articles = []
|
||||
unwanted_text = re.compile('Website\ of\ the|I\ urge\ you|Subscribe\ now|DONATE|\@asis\.com|donation\ button|click\ over\ to\ our')
|
||||
parsed_articles = [a for a in parsed_page.cssselect("html>body>table tr>td>p[class='style2']") if not unwanted_text.search(a.text_content())]
|
||||
for art in parsed_articles:
|
||||
try:
|
||||
author = art.text
|
||||
title = art.cssselect("a")[0].text + ' by {0}'.format(author)
|
||||
art_url = 'http://www.counterpunch.com/' + art.cssselect("a")[0].attrib['href']
|
||||
articles.append({'title': title, 'url': art_url})
|
||||
except Exception as e:
|
||||
e
|
||||
#print('Handler Error: ', e, 'title :', a.text_content())
|
||||
pass
|
||||
return articles
|
||||
feeds = [(u'Counterpunch', u'http://www.counterpunch.org/category/article/feed/')]
|
||||
|
||||
|
47
recipes/cvecezla.recipe
Normal file
@ -0,0 +1,47 @@
|
||||
|
||||
__license__ = 'GPL v3'
|
||||
__copyright__ = '2011, Darko Miletic <darko.miletic at gmail.com>'
|
||||
'''
|
||||
cvecezla.wordpress.com
|
||||
'''
|
||||
|
||||
import re
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class CveceZla(BasicNewsRecipe):
|
||||
title = 'Cvece zla i naopakog'
|
||||
__author__ = 'Darko Miletic'
|
||||
description = 'Haoticnost razmisljanja poradja haoticnost pisanja. Muzika, stripovi, igre, knjige, generalno glupiranje...'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
language = 'sr'
|
||||
encoding = 'utf-8'
|
||||
no_stylesheets = True
|
||||
use_embedded_content = False
|
||||
publication_type = 'blog'
|
||||
extra_css = ' @font-face {font-family: "serif1";src:url(res:///opt/sony/ebook/FONT/tt0011m_.ttf)} @font-face {font-family: "sans1";src:url(res:///opt/sony/ebook/FONT/tt0003m_.ttf)} body{font-family: "Trebuchet MS",Trebuchet,Verdana,sans1,sans-serif} .article_description{font-family: sans1, sans-serif} img{display: block } '
|
||||
|
||||
conversion_options = {
|
||||
'comment' : description
|
||||
, 'tags' : 'igre, muzika, film, blog, Srbija'
|
||||
, 'publisher': 'Mehmet Krljic'
|
||||
, 'language' : language
|
||||
}
|
||||
|
||||
preprocess_regexps = [(re.compile(u'\u0110'), lambda match: u'\u00D0')]
|
||||
|
||||
remove_tags_before = dict(attrs={'class':'navigation'})
|
||||
remove_tags_after = dict(attrs={'class':'commentlist'})
|
||||
remove_tags = [
|
||||
dict(attrs={'class':['postmetadata alt','sharedaddy sharedaddy-dark sd-like-enabled sd-sharing-enabled','reply','navigation']})
|
||||
,dict(attrs={'id':'respond'})
|
||||
]
|
||||
|
||||
feeds = [(u'Clanci', u'http://cvecezla.wordpress.com/feed/')]
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
for item in soup.findAll(style=True):
|
||||
del item['style']
|
||||
return soup
|
||||
|
||||
|
15
recipes/dark_horizons.recipe
Normal file
@ -0,0 +1,15 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class AdvancedUserRecipe1317580312(BasicNewsRecipe):
|
||||
title = u'Dark Horizons'
|
||||
language = 'en'
|
||||
__author__ = 'Jaded'
|
||||
description ='News, images, video clips and reviews of current and upcoming blockbuster films. '
|
||||
category = 'movies, tv, news'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
cover_url = 'http://a4.sphotos.ak.fbcdn.net/hphotos-ak-ash2/164168_148419801879765_148410081880737_225532_464073_n.jpg'
|
||||
masthead_url = 'http://www.darkhorizons.com/graphics/2/logo_print.png'
|
||||
auto_cleanup = True
|
||||
|
||||
feeds = [(u'News', u'http://www.darkhorizons.com/feeds/news.atom'), (u'Features', u'http://www.darkhorizons.com/feeds/features.atom'), (u'Reviews', u'http://www.darkhorizons.com/feeds/reviews.atom')]
|
21
recipes/den_of_geek.recipe
Normal file
@ -0,0 +1,21 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class AdvancedUserRecipe1316944753(BasicNewsRecipe):
|
||||
title = u'Den of Geek'
|
||||
__author__ = 'Jaded'
|
||||
language = 'en'
|
||||
description = 'From science fiction enthusiasts through to gaming fanatics, Den of Geek has become the one-stop UK website for people genuinely passionate about their entertainment media. Den of Geek covers popular culture but always with an edgy, UK centric slant that sets it apart from the crowd.'
|
||||
category = 'Movies, TV, Games, Comics, Cult, News, Reviews'
|
||||
language = 'en'
|
||||
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
auto_cleanup = True
|
||||
|
||||
no_stylesheets = True
|
||||
use_embedded_content = True
|
||||
publication_type = 'newsportal'
|
||||
masthead_url ='http://www.denofgeek.com/siteimage/scale/0/0/logo.gif'
|
||||
cover_url ='http://a5.sphotos.ak.fbcdn.net/hphotos-ak-snc6/166479_180131695357862_139191826118516_354818_4993703_n.jpg'
|
||||
|
||||
feeds = [(u'Movies', u'http://www.denofgeek.com/movies/rss/'), (u'TV', u'http://www.denofgeek.com/television/rss/'), (u'Comics & Books', u'http://www.denofgeek.com/comics/rss/'), (u'Games', u'http://www.denofgeek.com/games/rss/'), (u'DVD/Blu-ray', u'http://www.denofgeek.com/Reviews/rss/')]
|
@ -22,6 +22,10 @@ class Descopera(BasicNewsRecipe):
|
||||
category = 'Ziare,Reviste,Descopera'
|
||||
encoding = 'utf-8'
|
||||
cover_url = 'http://www.descopera.ro/images/header_images/logo.gif'
|
||||
use_embedded_content = False
|
||||
|
||||
no_stylesheets = True
|
||||
auto_cleanup = True
|
||||
|
||||
conversion_options = {
|
||||
'comments' : description
|
||||
@ -30,28 +34,6 @@ class Descopera(BasicNewsRecipe):
|
||||
,'publisher' : publisher
|
||||
}
|
||||
|
||||
|
||||
keep_only_tags = [
|
||||
dict(name='h1', attrs={'style':'font-family: Arial,Helvetica,sans-serif; font-size: 18px; color: rgb(51, 51, 51); font-weight: bold; margin: 10px 0pt; clear: both; float: left;width: 610px;'})
|
||||
,dict(name='div', attrs={'style':'margin-right: 15px; margin-bottom: 15px; float: left;'})
|
||||
, dict(name='p', attrs={'id':'itemDescription'})
|
||||
,dict(name='div', attrs={'id':'itemBody'})
|
||||
]
|
||||
|
||||
remove_tags = [
|
||||
dict(name='div', attrs={'class':['tools']})
|
||||
, dict(name='div', attrs={'class':['share']})
|
||||
, dict(name='div', attrs={'class':['category']})
|
||||
, dict(name='div', attrs={'id':['comments']})
|
||||
]
|
||||
|
||||
remove_tags_after = [
|
||||
dict(name='div', attrs={'id':'comments'})
|
||||
]
|
||||
|
||||
feeds = [
|
||||
(u'Feeds', u'http://www.descopera.ro/rss')
|
||||
]
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
return self.adeify_images(soup)
|
||||
|
11
recipes/diario_la_republica.recipe
Normal file
@ -0,0 +1,11 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class AdvancedUserRecipe1317341449(BasicNewsRecipe):
|
||||
title = u'Diario La Republica'
|
||||
__author__ = 'CAVALENCIA'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
auto_cleanup = True
|
||||
language = 'es_CO'
|
||||
|
||||
feeds = [(u'Diario La Republica', u'http://www.larepublica.com.co/rss/larepublica.xml')]
|
@ -1,6 +1,3 @@
|
||||
'''
|
||||
dnaindia.com
|
||||
'''
|
||||
import re
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
@ -12,6 +9,10 @@ class DNAIndia(BasicNewsRecipe):
|
||||
language = 'en_IN'
|
||||
|
||||
encoding = 'cp1252'
|
||||
use_embedded_content = False
|
||||
|
||||
no_stylesheets = True
|
||||
auto_cleanup = True
|
||||
|
||||
feeds = [
|
||||
('Top News', 'http://www.dnaindia.com/syndication/rss_topnews.xml'),
|
||||
@ -22,15 +23,10 @@ class DNAIndia(BasicNewsRecipe):
|
||||
('World', 'http://www.dnaindia.com/syndication/rss,catid-9.xml'),
|
||||
('Money', 'http://www.dnaindia.com/syndication/rss,catid-4.xml'),
|
||||
('Sports', 'http://www.dnaindia.com/syndication/rss,catid-6.xml'),
|
||||
('After Hours', 'http://www.dnaindia.com/syndication/rss,catid-7.xml'),
|
||||
('Digital Life', 'http://www.dnaindia.com/syndication/rss,catid-1089741.xml'),
|
||||
('After Hours', 'http://www.dnaindia.com/syndication/rss,catid-7.xml')
|
||||
]
|
||||
remove_tags = [{'id':['footer', 'lhs-col']}, {'class':['bottom', 'categoryHead',
|
||||
'article_tools']}]
|
||||
keep_only_tags = dict(id='middle-col')
|
||||
remove_tags_after=[dict(attrs={'id':'story'})]
|
||||
remove_attributes=['style']
|
||||
no_stylesheets = True
|
||||
|
||||
|
||||
|
||||
def print_version(self, url):
|
||||
match = re.search(r'newsid=(\d+)', url)
|
||||
|
22
recipes/dobreprogamy.recipe
Normal file
@ -0,0 +1,22 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
import re
|
||||
|
||||
class Dobreprogramy_pl(BasicNewsRecipe):
|
||||
title = 'Dobreprogramy.pl'
|
||||
__author__ = 'fenuks'
|
||||
__licence__ ='GPL v3'
|
||||
category = 'IT'
|
||||
language = 'pl'
|
||||
cover_url = 'http://userlogos.org/files/logos/Karmody/dobreprogramy_01.png'
|
||||
description = u'Aktualności i blogi z dobreprogramy.pl'
|
||||
encoding = 'utf-8'
|
||||
no_stylesheets = True
|
||||
language = 'pl'
|
||||
extra_css = '.title {font-size:22px;}'
|
||||
oldest_article = 8
|
||||
max_articles_per_feed = 100
|
||||
preprocess_regexps = [(re.compile(ur'<div id="\S+360pmp4">Twoja przeglądarka nie obsługuje Flasha i HTML5 lub wyłączono obsługę JavaScript...</div>'), lambda match: '') ]
|
||||
remove_tags = [dict(name='div', attrs={'class':['komentarze', 'block', 'portalInfo', 'menuBar', 'topBar']})]
|
||||
keep_only_tags = [dict(name='div', attrs={'class':['mainBar', 'newsContent', 'postTitle title', 'postInfo', 'contentText', 'content']})]
|
||||
feeds = [(u'Aktualności', 'http://feeds.feedburner.com/dobreprogramy/Aktualnosci'),
|
||||
('Blogi', 'http://feeds.feedburner.com/dobreprogramy/BlogCzytelnikow')]
|
17
recipes/dzieje_pl.recipe
Normal file
@ -0,0 +1,17 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Dzieje(BasicNewsRecipe):
|
||||
title = u'dzieje.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = 'Dzieje - history of Poland'
|
||||
cover_url = 'http://www.dzieje.pl/sites/default/files/dzieje_logo.png'
|
||||
category = 'history'
|
||||
language = 'pl'
|
||||
oldest_article = 8
|
||||
max_articles_per_feed = 100
|
||||
remove_javascript=True
|
||||
no_stylesheets= True
|
||||
remove_tags_before= dict(name='h1', attrs={'class':'title'})
|
||||
remove_tags_after= dict(id='dogory')
|
||||
remove_tags=[dict(id='dogory')]
|
||||
feeds = [(u'Dzieje', u'http://dzieje.pl/rss.xml')]
|
@ -77,30 +77,21 @@ class Economist(BasicNewsRecipe):
|
||||
continue
|
||||
self.log('Found section: %s'%section_title)
|
||||
articles = []
|
||||
for h5 in section.findAll('h5'):
|
||||
article_title = self.tag_to_string(h5).strip()
|
||||
if not article_title:
|
||||
continue
|
||||
data = h5.findNextSibling(attrs={'class':'article'})
|
||||
if data is None: continue
|
||||
a = data.find('a', href=True)
|
||||
if a is None: continue
|
||||
url = a['href']
|
||||
if url.startswith('/'): url = 'http://www.economist.com'+url
|
||||
url += '/print'
|
||||
article_title += ': %s'%self.tag_to_string(a).strip()
|
||||
articles.append({'title':article_title, 'url':url,
|
||||
'description':'', 'date':''})
|
||||
if not articles:
|
||||
# We have last or first section
|
||||
for art in section.findAll(attrs={'class':'article'}):
|
||||
a = art.find('a', href=True)
|
||||
subsection = ''
|
||||
for node in section.findAll(attrs={'class':'article'}):
|
||||
subsec = node.findPreviousSibling('h5')
|
||||
if subsec is not None:
|
||||
subsection = self.tag_to_string(subsec)
|
||||
prefix = (subsection+': ') if subsection else ''
|
||||
a = node.find('a', href=True)
|
||||
if a is not None:
|
||||
url = a['href']
|
||||
if url.startswith('/'): url = 'http://www.economist.com'+url
|
||||
url += '/print'
|
||||
title = self.tag_to_string(a)
|
||||
if title:
|
||||
title = prefix + title
|
||||
self.log('\tFound article:', title)
|
||||
articles.append({'title':title, 'url':url,
|
||||
'description':'', 'date':''})
|
||||
|
||||
|
@ -69,30 +69,21 @@ class Economist(BasicNewsRecipe):
|
||||
continue
|
||||
self.log('Found section: %s'%section_title)
|
||||
articles = []
|
||||
for h5 in section.findAll('h5'):
|
||||
article_title = self.tag_to_string(h5).strip()
|
||||
if not article_title:
|
||||
continue
|
||||
data = h5.findNextSibling(attrs={'class':'article'})
|
||||
if data is None: continue
|
||||
a = data.find('a', href=True)
|
||||
if a is None: continue
|
||||
url = a['href']
|
||||
if url.startswith('/'): url = 'http://www.economist.com'+url
|
||||
url += '/print'
|
||||
article_title += ': %s'%self.tag_to_string(a).strip()
|
||||
articles.append({'title':article_title, 'url':url,
|
||||
'description':'', 'date':''})
|
||||
if not articles:
|
||||
# We have last or first section
|
||||
for art in section.findAll(attrs={'class':'article'}):
|
||||
a = art.find('a', href=True)
|
||||
subsection = ''
|
||||
for node in section.findAll(attrs={'class':'article'}):
|
||||
subsec = node.findPreviousSibling('h5')
|
||||
if subsec is not None:
|
||||
subsection = self.tag_to_string(subsec)
|
||||
prefix = (subsection+': ') if subsection else ''
|
||||
a = node.find('a', href=True)
|
||||
if a is not None:
|
||||
url = a['href']
|
||||
if url.startswith('/'): url = 'http://www.economist.com'+url
|
||||
url += '/print'
|
||||
title = self.tag_to_string(a)
|
||||
if title:
|
||||
title = prefix + title
|
||||
self.log('\tFound article:', title)
|
||||
articles.append({'title':title, 'url':url,
|
||||
'description':'', 'date':''})
|
||||
|
||||
|
23
recipes/eioba.recipe
Normal file
@ -0,0 +1,23 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class eioba(BasicNewsRecipe):
|
||||
title = u'eioba'
|
||||
__author__ = 'fenuks'
|
||||
cover_url = 'http://www.eioba.org/lay/logo_pl_v3.png'
|
||||
language = 'pl'
|
||||
oldest_article = 7
|
||||
remove_empty_feeds= True
|
||||
max_articles_per_feed = 100
|
||||
extra_css = '#ctl0_body_Topic {font-weight: bold; font-size:30px;}'
|
||||
keep_only_tags=[dict(id=['ctl0_body_Topic', 'articleContent'])]
|
||||
feeds = [(u'Wszyskie kategorie', u'http://feeds.eioba.pl/eioba-pl-top'),
|
||||
(u'Technologia', u'http://www.eioba.pl/feed/categories/1.xml'),
|
||||
(u'Nauka', u'http://www.eioba.pl/feed/categories/12.xml'),
|
||||
(u'Finanse', u'http://www.eioba.pl/feed/categories/7.xml'),
|
||||
(u'Życie', u'http://www.eioba.pl/feed/categories/5.xml'),
|
||||
(u'Zainteresowania', u'http://www.eioba.pl/feed/categories/420.xml'),
|
||||
(u'Społeczeństwo', u'http://www.eioba.pl/feed/categories/8.xml'),
|
||||
(u'Rozrywka', u'http://www.eioba.pl/feed/categories/10.xml'),
|
||||
(u'Rożne', u'http://www.eioba.pl/feed/categories/9.xml')
|
||||
]
|
21
recipes/ekantipur.recipe
Normal file
@ -0,0 +1,21 @@
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class AdvancedUserRecipe1314326622(BasicNewsRecipe):
|
||||
title = u'Ekantipur'
|
||||
__author__ = 'Manish Bhattarai'
|
||||
description = 'News from the No.1 News Portal In Nepal'
|
||||
language = 'en_NP'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 25
|
||||
masthead_url = 'http://kantipur.com.np/images/ekantipur_01.jpg'
|
||||
remove_empty_feeds = True
|
||||
remove_tags_before = dict(id='main-content')
|
||||
remove_tags_after = dict(id='view-comments')
|
||||
remove_tags = [dict(attrs={'class':['ratings', 'news-tool', 'comment', 'post-ur-comment','asideBox','commentsbox','related-sidebar-row related-news']}),
|
||||
dict(id=['sidebar','news-detail-img', 'footer-wrapper']),
|
||||
dict(name=['script'])]
|
||||
|
||||
feeds = [(u'Top Stories', u'http://www.ekantipur.com/en/rss/top-stories/'), (u'National', u'http://www.ekantipur.com/en/rss/national/1'), (u'Capital', u'http://www.ekantipur.com/en/rss/capital/7'), (u'Business', u'http://www.ekantipur.com/en/rss/business/3'), (u'World', u'http://www.ekantipur.com/en/rss/world/5'), (u'Sports', u'http://www.ekantipur.com/en/rss/sports/4'), (u'Mixed Bag', u'http://www.ekantipur.com/en/rss/mixed-bag/14'), (u'Health & Living', u'http://www.ekantipur.com/en/rss/health-and-living/19'), (u'Entertainment', u'http://www.ekantipur.com/en/rss/entertainment/6')]
|
||||
|
||||
|
@ -2,12 +2,10 @@
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
|
||||
|
||||
class AdvancedUserRecipe1311790237(BasicNewsRecipe):
|
||||
title = u'Periódico El Colombiano'
|
||||
language = 'es_CO'
|
||||
__author__ = 'BIGO-CAVA'
|
||||
language = 'es_CO'
|
||||
cover_url = 'http://www.elcolombiano.com/images/logoElColombiano348x46.gif'
|
||||
remove_tags_before = dict(id='contenidoArt')
|
||||
remove_tags_after = dict(id='enviaTips')
|
||||
|
54
recipes/el_espectador.recipe
Normal file
@ -0,0 +1,54 @@
|
||||
# coding=utf-8
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class ColombiaElEspectador(BasicNewsRecipe):
|
||||
title = u'Periódico el Espectador'
|
||||
__author__ = 'BIGO-CAVA'
|
||||
cover_url = 'http://www.elespectador.com/sites/elespectador.com/themes/elespectador/images/logo.gif'
|
||||
#remove_tags_before = dict(id='fb-root')
|
||||
remove_tags_before = dict(id='content')
|
||||
remove_tags_after = [dict(name='div', attrs={'class':'paginacion'})]
|
||||
language = 'es_CO'
|
||||
#keep_only_tags = [dict(name='div', id='content')]
|
||||
remove_tags = [dict(name='div', attrs={'class':'herramientas_nota'}),
|
||||
dict(name='div', attrs={'class':'relpauta'}),
|
||||
dict(name='div', attrs={'class':'recursosrelacionados'}),
|
||||
dict(name='div', attrs={'class':'nav_negocios'})]
|
||||
# dict(name='div', attrs={'class':'tags_playerrecurso'}),
|
||||
# dict(name='div', attrs={'class':'ico-mail2'}),
|
||||
# dict(name='div', attrs={'id':'caja-instapaper'}),
|
||||
# dict(name='div', attrs={'class':'modulo herramientas'})]
|
||||
oldest_article = 2
|
||||
max_articles_per_feed = 100
|
||||
remove_javascript = True
|
||||
no_stylesheets = True
|
||||
use_embedded_content = False
|
||||
remove_empty_feeds = True
|
||||
masthead_url = 'http://www.elespectador.com/sites/elespectador.com/themes/elespectador/images/logo.gif'
|
||||
publication_type = 'newspaper'
|
||||
|
||||
extra_css = """
|
||||
p{text-align: justify; font-size: 100%}
|
||||
body{ text-align: left; font-size:100% }
|
||||
h1{font-family: sans-serif; font-size:150%; font-weight:bold; text-align: justify; }
|
||||
h3{font-family: sans-serif; font-size:100%; font-style: italic; text-align: justify; }
|
||||
"""
|
||||
|
||||
|
||||
feeds = [(u'Política ', u' http://www.elespectador.com/noticias/politica/feed'),
|
||||
(u'Judicial', u'http://www.elespectador.com/noticias/judicial/feed'),
|
||||
(u'Paz', u'http://www.elespectador.com/noticias/paz/feed'),
|
||||
(u'Economía', u'http://www.elespectador.com/economia/feed'),
|
||||
(u'Soy Periodista', u'http://www.elespectador.com/noticias/soyperiodista/feed'),
|
||||
(u'Investigación', u'http://www.elespectador.com/noticias/investigacion/feed'),
|
||||
(u'Educación', u'http://www.elespectador.com/noticias/educacion/feed'),
|
||||
(u'Salud', u'http://www.elespectador.com/noticias/salud/feed'),
|
||||
(u'El Mundo', u'http://www.elespectador.com/noticias/elmundo/feed'),
|
||||
(u'Nacional', u'http://www.elespectador.com/noticias/nacional/feed'),
|
||||
(u'Bogotá', u'http://www.elespectador.com/noticias/bogota/feed'),
|
||||
(u'Deportes', u'http://www.elespectador.com/deportes/feed'),
|
||||
(u'Tecnología', u'http://www.elespectador.com/tecnologia/feed'),
|
||||
(u'Actualidad', u'http://www.elespectador.com/noticias/actualidad/feed'),
|
||||
(u'Opinión', u'http://www.elespectador.com/opinion/feed'),
|
||||
(u'Editorial', u'http://www.elespectador.com/opinion/editorial/feed')]
|
50
recipes/el_mundo_co.recipe
Normal file
@ -0,0 +1,50 @@
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class ColombiaElMundo02(BasicNewsRecipe):
|
||||
title = u'Periódico El Mundo'
|
||||
__author__ = 'BIGO-CAVA'
|
||||
language = 'es_CO'
|
||||
cover_url = 'http://www.elmundo.com/portal/img/logo_mundo2.png'
|
||||
remove_tags_before = dict(id='miga_pan')
|
||||
#remove_tags_before = [dict(name='div', attrs={'class':'contenido'})]
|
||||
remove_tags_after = [dict(name='div', attrs={'class':'cuadro_opciones_new1'})]
|
||||
#keep_only_tags = [dict(name='div', id='miga_pan')]
|
||||
remove_tags = [dict(name='div', attrs={'class':'ruta'}),
|
||||
dict(name='div', attrs={'class':'buscador'}),
|
||||
dict(name='div', attrs={'class':'iconos'}),
|
||||
dict(name='div', attrs={'class':'otros_iconos'}),
|
||||
dict(name='div', attrs={'class':'cuadro_opciones_new1'}),
|
||||
dict(name='div', attrs={'class':'otras_noticias'}),
|
||||
dict(name='div', attrs={'class':'notas_relacionadas'}),
|
||||
dict(name='div', attrs={'id':'lateral_2'})]
|
||||
oldest_article = 2
|
||||
max_articles_per_feed = 100
|
||||
remove_javascript = True
|
||||
no_stylesheets = True
|
||||
use_embedded_content = False
|
||||
remove_empty_feeds = True
|
||||
masthead_url = 'http://www.elmundo.com/portal/img/logo_mundo2.png'
|
||||
publication_type = 'newspaper'
|
||||
|
||||
extra_css = """
|
||||
p{text-align: justify; font-size: 100%}
|
||||
body{ text-align: left; font-size:100% }
|
||||
h1{font-family: sans-serif; font-size:150%; font-weight:bold; text-align: justify; }
|
||||
h3{font-family: sans-serif; font-size:100%; font-style: italic; text-align: justify; }
|
||||
"""
|
||||
|
||||
|
||||
feeds = [(u'Opinión', u'http://www.elmundo.com/images/rss/opinion.xml'),
|
||||
(u'Economía', u'http://www.elmundo.com/images/rss/noticias_economia.xml'),
|
||||
(u'Deportes', u'http://www.elmundo.com/images/rss/deportes.xml'),
|
||||
(u'Política ', u'http://www.elmundo.com/images/rss/noticias_politica.xml'),
|
||||
(u'Antioquia', u'http://www.elmundo.com/images/rss/noticias_antioquia.xml'),
|
||||
(u'Nacional ', u'http://www.elmundo.com/images/rss/noticias_nacional.xml'),
|
||||
(u'Internacional', u'http://www.elmundo.com/images/rss/noticias_internacional.xml'),
|
||||
(u'Servicios Públicos', u'http://www.elmundo.com/images/rss/noticias_servicios_publicos.xml'),
|
||||
(u'Infraestructura', u'http://www.elmundo.com/images/rss/noticias_infraestructura.xml'),
|
||||
(u'Mobilidad', u'http://www.elmundo.com/images/rss/noticias_movilidad.xml'),
|
||||
(u'Derechos Humanos', u'http://www.elmundo.com/images/rss/noticias_derechos_humanos.xml'),
|
||||
(u'Vida', u'http://www.elmundo.com/images/rss/vida.xml'),
|
||||
(u'Cultura', u'http://www.elmundo.com/images/rss/cultura.xml')]
|
@ -2,18 +2,17 @@
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
|
||||
|
||||
|
||||
class ColombiaElTiempo02(BasicNewsRecipe):
|
||||
title = u'Periódico el Tiempo'
|
||||
language = 'es_CO'
|
||||
__author__ = 'BIGO-CAVA'
|
||||
language = 'es_CO'
|
||||
cover_url = 'http://www.eltiempo.com/media/css/images/logo_footer.png'
|
||||
remove_tags_before = dict(id='fb-root')
|
||||
#remove_tags_before = dict(id='fb-root')
|
||||
remove_tags_before = dict(id='contenidoArt')
|
||||
remove_tags_after = [dict(name='div', attrs={'class':'modulo reporte'})]
|
||||
keep_only_tags = [dict(name='div', id='contenidoArt')]
|
||||
remove_tags = [dict(name='div', attrs={'class':'social-media'}),
|
||||
dict(name='div', attrs={'class':'recomend-art'}),
|
||||
dict(name='div', attrs={'class':'caja-facebook'}),
|
||||
dict(name='div', attrs={'class':'caja-twitter'}),
|
||||
dict(name='div', attrs={'class':'caja-buzz'}),
|
||||
|
15
recipes/elektroda_pl.recipe
Normal file
@ -0,0 +1,15 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Elektroda(BasicNewsRecipe):
|
||||
title = u'Elektroda'
|
||||
oldest_article = 8
|
||||
__author__ = 'fenuks'
|
||||
description = 'Elektroda.pl'
|
||||
cover_url = 'http://demotywatory.elektroda.pl/Thunderpic/logo.gif'
|
||||
category = 'electronics'
|
||||
language = 'pl'
|
||||
max_articles_per_feed = 100
|
||||
remove_tags_before=dict(name='span', attrs={'class':'postbody'})
|
||||
remove_tags_after=dict(name='td', attrs={'class':'spaceRow'})
|
||||
remove_tags=[dict(name='a', attrs={'href':'#top'})]
|
||||
feeds = [(u'Elektroda', u'http://www.elektroda.pl/rtvforum/rss.php')]
|
113
recipes/fairbanks_daily.recipe
Normal file
@ -0,0 +1,113 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class FairbanksDailyNewsminer(BasicNewsRecipe):
|
||||
title = u'Fairbanks Daily News-miner'
|
||||
__author__ = 'Roger'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
|
||||
description = 'The voice of interior Alaska since 1903'
|
||||
publisher = 'http://www.newsminer.com/'
|
||||
category = 'news, Alaska, Fairbanks'
|
||||
language = 'en'
|
||||
|
||||
# Make article titles, author and date bold, italic or small font.
|
||||
# http://assets.matchbin.com/sites/635/stylesheets/newsminer.com.css
|
||||
# (signature_line contains date, views, comments)
|
||||
extra_css = '''
|
||||
.story_item_headline { font-size: medium; font-weight: bold; }
|
||||
.story_item_author { font-size: small; font-style:italic; }
|
||||
.signature_line { font-size: small; }
|
||||
'''
|
||||
|
||||
remove_javascript = True
|
||||
use_embedded_content = False
|
||||
no_stylesheets = True
|
||||
language = 'en'
|
||||
encoding = 'utf8'
|
||||
conversion_options = {'linearize_tables':True}
|
||||
|
||||
# TODO: The News-miner cover image seems a bit small. Can this be enlarged by 10-30%?
|
||||
masthead_url = 'http://d2uh5w9wm14i0w.cloudfront.net/sites/635/assets/top_masthead_-_menu_pic.jpg'
|
||||
|
||||
|
||||
# In order to omit seeing number of views, number of posts and the pipe
|
||||
# symbol for divider after the title and date of the article, a regex or
|
||||
# manual processing is needed to get just the "story_item_date updated"
|
||||
# (which contains the date). Everything else on this line is pretty much not needed.
|
||||
#
|
||||
# Currently, you will see the following:
|
||||
# | Aug 24, 2011 | 654 views | 6 | |
|
||||
# (ie. 6 comments)
|
||||
#
|
||||
# HTML line containing story_item_date:
|
||||
# <div class="signature_line"><span title="2011-08-22T23:37:14Z" class="story_item_date updated">Aug 22, 2011</span> | 2370 views | 52 <a href="/pages/full_story/push?article-Officials+tout+new+South+Cushman+homeless+living+facility%20&id=15183753#comments_15183753"><img alt="52 comments" class="dont_touch_me" src="http://d2uh5w9wm14i0w.cloudfront.net/images/comments-icon.gif" title="52 comments" /></a> | <span id="number_recommendations_15183753" class="number_recommendations">9</span> <a href="#1" id="recommend_link_15183753" onclick="Element.remove('recommend_link_15183753'); new Ajax.Request('/community/content/recommend/15183753', {asynchronous:true, evalScripts:true}); return false;"><img alt="9 recommendations" class="dont_touch_me" src="http://d2uh5w9wm14i0w.cloudfront.net/images/thumbs-up-icon.gif" title="9 recommendations" /></a> | <a href="#1" onclick="$j.facebox({ajax: '/community/content/email_friend_pane/15183753'}); return false;"><span style="position: relative;"><img alt="email to a friend" class="dont_touch_me" src="http://d2uh5w9wm14i0w.cloudfront.net/images/email-this.gif" title="email to a friend" /></span></a> | <span><a href="/printer_friendly/15183753" target="_blank"><img alt="print" class="dont_touch_me" src="http://d2uh5w9wm14i0w.cloudfront.net/images/print_icon.gif" title="print" /></a></span><span id="email_content_message_15183753" class="signature_email_message"></span></div>
|
||||
|
||||
# The following was suggested, but it looks like I also need to define self & soup
|
||||
# (as well as bring in extra soup depends?)
|
||||
#date = self.tag_to_string(soup.find('span', attrs={'class':'story_item_date updated'}))
|
||||
|
||||
#preprocess_regexps = [(re.compile(r'<span[^>]*addthis_separator*>'), lambda match: '') ]
|
||||
#preprocess_regexps = [(re.compile(r'span class="addthis_separator">|</span>'), lambda match: '') ]
|
||||
|
||||
#preprocess_regexps = [
|
||||
# (re.compile(r'<start>.*?<end>', re.IGNORECASE | re.DOTALL), lambda match : ''),
|
||||
# ]
|
||||
|
||||
#def get_browser(self):
|
||||
#def preprocess_html(soup, first_fetch):
|
||||
# date = self.tag_to_string(soup.find('span', attrs={'class':'story_item_date updated'}))
|
||||
# return
|
||||
|
||||
#preprocess_regexps = [(re.compile(r' |.*?', re.DOTALL), lambda m: '')]
|
||||
|
||||
|
||||
keep_only_tags = [
|
||||
#dict(name='div', attrs={'class':'hnews hentry item'}),
|
||||
dict(name='div', attrs={'class':'story_item_headline entry-title'}),
|
||||
#dict(name='div', attrs={'class':'story_item_author'}),
|
||||
#dict(name='span', attrs={'class':'story_item_date updated'}),
|
||||
#dict(name='div', attrs={'class':'story_item_author'}),
|
||||
dict(name='div', attrs={'class':'full_story'})
|
||||
]
|
||||
|
||||
remove_tags = [
|
||||
# Try getting rid of some signature_line (date line) stuff
|
||||
#dict(name='img', attrs={'alt'}),
|
||||
dict(name='img', attrs={'class':'dont_touch_me'}),
|
||||
dict(name='span', attrs={'class':'number_recommendations'}),
|
||||
#dict(name='div', attrs={'class':'signature_line'}),
|
||||
|
||||
# Removes div within <!-- AddThis Button BEGIN --> <!-- AddThis Button END -->
|
||||
dict(name='div', attrs={'class':'addthis_toolbox addthis_default_style'}),
|
||||
|
||||
dict(name='div', attrs={'class':'related_content'}),
|
||||
dict(name='div', attrs={'id':'comments_container'})
|
||||
]
|
||||
|
||||
|
||||
# Comment-out or uncomment any of the following RSS feeds according to your
|
||||
# liking.
|
||||
#
|
||||
# TODO: Some random bits of text might be trailing the last page (or TOC on
|
||||
# MOBI files), these are bits of public posts and comments and need to also
|
||||
# be removed.
|
||||
#
|
||||
feeds = [
|
||||
(u'Alaska News', u'http://newsminer.com/rss/rss_feeds/alaska_news?content_type=article&tags=alaska_news&page_name=rss_feeds&instance=alaska_news'),
|
||||
(u'Local News', u'http://newsminer.com/rss/rss_feeds/local_news?content_type=article&tags=local_news&page_name=rss_feeds&offset=0&instance=local_news'),
|
||||
(u'Business', u'http://newsminer.com/rss/rss_feeds/business_news?content_type=article&tags=business_news&page_name=rss_feeds&instance=business_news'),
|
||||
(u'Politics', u'http://newsminer.com/rss/rss_feeds/politics_news?content_type=article&tags=politics_news&page_name=rss_feeds&instance=politics_news'),
|
||||
(u'Sports', u'http://newsminer.com/rss/rss_feeds/sports_news?content_type=article&tags=sports_news&page_name=rss_feeds&instance=sports_news'),
|
||||
(u'Latitude 65 feed', u'http://newsminer.com/rss/rss_feeds/latitude_65?content_type=article&tags=latitude_65&page_name=rss_feeds&offset=0&instance=latitude_65'),
|
||||
#(u'Sundays', u'http://newsminer.com/rss/rss_feeds/Sundays?content_type=article&tags=alaska_science_forum+scott_mccrea+interior_gardening+in_the_bush+judy_ferguson+book_reviews+theresa_bakker+judith_kleinfeld+interior_scrapbook+nuggets_comics+freeze_frame&page_name=rss_feeds&tag_inclusion=or&instance=Sundays'),
|
||||
(u'Outdoors', u'http://newsminer.com/rss/rss_feeds/Outdoors?content_type=article&tags=outdoors&page_name=rss_feeds&instance=Outdoors'),
|
||||
#(u'Fairbanks Grizzlies', u'http://newsminer.com/rss/rss_feeds/fairbanks_grizzlies?content_type=article&tags=fairbanks_grizzlies&page_name=rss_feeds&instance=fairbanks_grizzlies'),
|
||||
#(u'Newsminer', u'http://newsminer.com/rss/rss_feeds/Newsminer?content_type=article&tags=ted_stevens_bullets+ted_stevens+sports_news+business_news+fairbanks_grizzlies+dermot_cole_column+outdoors+alaska_science_forum+scott_mccrea+interior_gardening+in_the_bush+judy_ferguson+book_reviews+theresa_bakker+judith_kleinfeld+interior_scrapbook+nuggets_comics+freeze_frame&page_name=rss_feeds&tag_inclusion=or&instance=Newsminer'),
|
||||
(u'Opinion', u'http://newsminer.com/rss/rss_feeds/Opinion?content_type=article&tags=editorials&page_name=rss_feeds&instance=Opinion'),
|
||||
(u'Youth', u'http://newsminer.com/rss/rss_feeds/Youth?content_type=article&tags=youth&page_name=rss_feeds&instance=Youth'),
|
||||
#(u'Dermot Cole Blog', u'http://newsminer.com/rss/rss_feeds/dermot_cole_blog+rss?content_type=blog+entry&sort_by=posted_on&user_ids=3015275&page_name=blogs_dermot_cole&limit=10&instance=dermot_cole_blog+rss'),
|
||||
(u'Dermot Cole Column', u'http://newsminer.com/rss/rss_feeds/Dermot_Cole_column?content_type=article&tags=dermot_cole_column&page_name=rss_feeds&instance=Dermot_Cole_column'),
|
||||
#(u'Sarah Palin', u'http://newsminer.com/rss/rss_feeds/sarah_palin?content_type=article&tags=palin_in_the_news+palin_on_the_issues&page_name=rss_feeds&tag_inclusion=or&instance=sarah_palin')
|
||||
]
|
||||
|
40
recipes/film_web.recipe
Normal file
@ -0,0 +1,40 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Filmweb_pl(BasicNewsRecipe):
|
||||
title = u'FilmWeb'
|
||||
__author__ = 'fenuks'
|
||||
description = 'FilmWeb - biggest polish movie site'
|
||||
cover_url = 'http://userlogos.org/files/logos/crudus/filmweb.png'
|
||||
category = 'movies'
|
||||
language = 'pl'
|
||||
oldest_article = 8
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets= True
|
||||
extra_css = '.hdrBig {font-size:22px;}'
|
||||
remove_tags= [dict(name='div', attrs={'class':['recommendOthers']}), dict(name='ul', attrs={'class':'fontSizeSet'})]
|
||||
keep_only_tags= [dict(name='h1', attrs={'class':'hdrBig'}), dict(name='div', attrs={'class':['newsInfo', 'reviewContent fontSizeCont description']})]
|
||||
feeds = [(u'Wszystkie newsy', u'http://www.filmweb.pl/feed/news/latest'),
|
||||
(u'News / Filmy w produkcji', 'http://www.filmweb.pl/feed/news/category/filminproduction'),
|
||||
(u'News / Festiwale, nagrody i przeglądy', u'http://www.filmweb.pl/feed/news/category/festival'),
|
||||
(u'News / Seriale', u'http://www.filmweb.pl/feed/news/category/serials'),
|
||||
(u'News / Box office', u'http://www.filmweb.pl/feed/news/category/boxoffice'),
|
||||
(u'News / Multimedia', u'http://www.filmweb.pl/feed/news/category/multimedia'),
|
||||
(u'News / Dystrybucja dvd / blu-ray', u'http://www.filmweb.pl/feed/news/category/video'),
|
||||
(u'News / Dystrybucja kinowa', u'http://www.filmweb.pl/feed/news/category/cinema'),
|
||||
(u'News / off', u'http://www.filmweb.pl/feed/news/category/off'),
|
||||
(u'News / Gry wideo', u'http://www.filmweb.pl/feed/news/category/game'),
|
||||
(u'News / Organizacje branżowe', u'http://www.filmweb.pl/feed/news/category/organizations'),
|
||||
(u'News / Internet', u'http://www.filmweb.pl/feed/news/category/internet'),
|
||||
(u'News / Różne', u'http://www.filmweb.pl/feed/news/category/other'),
|
||||
(u'News / Kino polskie', u'http://www.filmweb.pl/feed/news/category/polish.cinema'),
|
||||
(u'News / Telewizja', u'http://www.filmweb.pl/feed/news/category/tv'),
|
||||
(u'Recenzje redakcji', u'http://www.filmweb.pl/feed/reviews/latest'),
|
||||
(u'Recenzje użytkowników', u'http://www.filmweb.pl/feed/user-reviews/latest')]
|
||||
|
||||
def skip_ad_pages(self, soup):
|
||||
skip_tag = soup.find('a', attrs={'class':'welcomeScreenButton'})
|
||||
if skip_tag is not None:
|
||||
self.log.warn('skip_tag')
|
||||
self.log.warn(skip_tag)
|
||||
return self.index_to_soup(skip_tag['href'], raw=True)
|
||||
|
@ -5,6 +5,7 @@ www.ft.com/uk-edition
|
||||
'''
|
||||
|
||||
import datetime
|
||||
from calibre.ptempfile import PersistentTemporaryFile
|
||||
from calibre import strftime
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
@ -22,8 +23,11 @@ class FinancialTimes(BasicNewsRecipe):
|
||||
needs_subscription = True
|
||||
encoding = 'utf8'
|
||||
publication_type = 'newspaper'
|
||||
articles_are_obfuscated = True
|
||||
temp_files = []
|
||||
masthead_url = 'http://im.media.ft.com/m/img/masthead_main.jpg'
|
||||
LOGIN = 'https://registration.ft.com/registration/barrier/login'
|
||||
LOGIN2 = 'http://media.ft.com/h/subs3.html'
|
||||
INDEX = 'http://www.ft.com/uk-edition'
|
||||
PREFIX = 'http://www.ft.com'
|
||||
|
||||
@ -39,14 +43,19 @@ class FinancialTimes(BasicNewsRecipe):
|
||||
br = BasicNewsRecipe.get_browser()
|
||||
br.open(self.INDEX)
|
||||
if self.username is not None and self.password is not None:
|
||||
br.open(self.LOGIN)
|
||||
br.open(self.LOGIN2)
|
||||
br.select_form(name='loginForm')
|
||||
br['username'] = self.username
|
||||
br['password'] = self.password
|
||||
br.submit()
|
||||
return br
|
||||
|
||||
keep_only_tags = [dict(name='div', attrs={'class':['fullstory fullstoryHeader','fullstory fullstoryBody','ft-story-header','ft-story-body','index-detail']})]
|
||||
keep_only_tags = [
|
||||
dict(name='div', attrs={'class':['fullstory fullstoryHeader', 'ft-story-header']})
|
||||
,dict(name='div', attrs={'class':'standfirst'})
|
||||
,dict(name='div', attrs={'id' :'storyContent'})
|
||||
,dict(name='div', attrs={'class':['ft-story-body','index-detail']})
|
||||
]
|
||||
remove_tags = [
|
||||
dict(name='div', attrs={'id':'floating-con'})
|
||||
,dict(name=['meta','iframe','base','object','embed','link'])
|
||||
@ -68,18 +77,23 @@ class FinancialTimes(BasicNewsRecipe):
|
||||
|
||||
def get_artlinks(self, elem):
|
||||
articles = []
|
||||
count = 0
|
||||
for item in elem.findAll('a',href=True):
|
||||
count = count + 1
|
||||
if self.test and count > 2:
|
||||
return articles
|
||||
rawlink = item['href']
|
||||
if rawlink.startswith('http://'):
|
||||
url = rawlink
|
||||
else:
|
||||
url = self.PREFIX + rawlink
|
||||
urlverified = self.browser.open_novisit(url).geturl() # resolve redirect.
|
||||
title = self.tag_to_string(item)
|
||||
date = strftime(self.timefmt)
|
||||
articles.append({
|
||||
'title' :title
|
||||
,'date' :date
|
||||
,'url' :url
|
||||
,'url' :urlverified
|
||||
,'description':''
|
||||
})
|
||||
return articles
|
||||
@ -96,7 +110,11 @@ class FinancialTimes(BasicNewsRecipe):
|
||||
st = wide.find('h4',attrs={'class':'section-no-arrow'})
|
||||
if st:
|
||||
strest.insert(0,st)
|
||||
count = 0
|
||||
for item in strest:
|
||||
count = count + 1
|
||||
if self.test and count > 2:
|
||||
return feeds
|
||||
ftitle = self.tag_to_string(item)
|
||||
self.report_progress(0, _('Fetching feed')+' %s...'%(ftitle))
|
||||
feedarts = self.get_artlinks(item.parent.ul)
|
||||
@ -136,3 +154,18 @@ class FinancialTimes(BasicNewsRecipe):
|
||||
cdate -= datetime.timedelta(days=1)
|
||||
return cdate.strftime('http://specials.ft.com/vtf_pdf/%d%m%y_FRONT1_LON.pdf')
|
||||
|
||||
def get_obfuscated_article(self, url):
|
||||
count = 0
|
||||
while (count < 10):
|
||||
try:
|
||||
response = self.browser.open(url)
|
||||
html = response.read()
|
||||
count = 10
|
||||
except:
|
||||
print "Retrying download..."
|
||||
count += 1
|
||||
self.temp_files.append(PersistentTemporaryFile('_fa.html'))
|
||||
self.temp_files[-1].write(html)
|
||||
self.temp_files[-1].close()
|
||||
return self.temp_files[-1].name
|
||||
|
@ -16,13 +16,13 @@ class Fleshbot(BasicNewsRecipe):
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets = True
|
||||
encoding = 'utf-8'
|
||||
use_embedded_content = False
|
||||
use_embedded_content = True
|
||||
language = 'en'
|
||||
masthead_url = 'http://cache.fleshbot.com/assets/base/img/thumbs140x140/fleshbot.com.png'
|
||||
masthead_url = 'http://cache.gawkerassets.com/assets/kotaku.com/img/logo.png'
|
||||
extra_css = '''
|
||||
body{font-family: "Lucida Grande",Helvetica,Arial,sans-serif}
|
||||
img{margin-bottom: 1em}
|
||||
h1{font-family :Arial,Helvetica,sans-serif; font-size:x-large}
|
||||
h1{font-family :Arial,Helvetica,sans-serif; font-size:large}
|
||||
'''
|
||||
conversion_options = {
|
||||
'comment' : description
|
||||
@ -31,13 +31,12 @@ class Fleshbot(BasicNewsRecipe):
|
||||
, 'language' : language
|
||||
}
|
||||
|
||||
remove_attributes = ['width','height']
|
||||
keep_only_tags = [dict(attrs={'class':'content permalink'})]
|
||||
remove_tags_before = dict(name='h1')
|
||||
remove_tags = [dict(attrs={'class':'contactinfo'})]
|
||||
remove_tags_after = dict(attrs={'class':'contactinfo'})
|
||||
feeds = [(u'Articles', u'http://feeds.gawker.com/fleshbot/vip?format=xml')]
|
||||
|
||||
remove_tags = [
|
||||
{'class': 'feedflare'},
|
||||
]
|
||||
|
||||
feeds = [(u'Articles', u'http://feeds.gawker.com/fleshbot/full')]
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
return self.adeify_images(soup)
|
||||
|
39
recipes/fluter_de.recipe
Normal file
@ -0,0 +1,39 @@
|
||||
__license__ = 'GPL v3'
|
||||
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
|
||||
|
||||
'''
|
||||
Fetch fluter.de
|
||||
'''
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class AdvancedUserRecipe1313693926(BasicNewsRecipe):
|
||||
|
||||
title = u'Fluter'
|
||||
description = 'fluter.de Magazin der Bundeszentrale für politische Bildung/bpb'
|
||||
language = 'de'
|
||||
encoding = 'UTF-8'
|
||||
|
||||
__author__ = 'Armin Geller' # 2011-08-19
|
||||
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 50
|
||||
|
||||
|
||||
remove_tags = [
|
||||
dict(name='div', attrs={'id':["comments"]}),
|
||||
dict(attrs={'class':['commentlink']}),
|
||||
]
|
||||
|
||||
|
||||
keep_only_tags = [
|
||||
dict(name='div', attrs={'class':["grid_8 articleText"]}),
|
||||
dict(name='div', attrs={'class':["articleTextInnerText"]}),
|
||||
]
|
||||
|
||||
feeds = [
|
||||
(u'Inhalt:', u'http://www.fluter.de/de/?tpl=907'),
|
||||
]
|
||||
|
||||
extra_css = '.cs_img {margin-right: 10pt;}'
|
||||
|
66
recipes/focus_pl.recipe
Normal file
@ -0,0 +1,66 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Focus_pl(BasicNewsRecipe):
|
||||
title = u'Focus.pl'
|
||||
oldest_article = 15
|
||||
max_articles_per_feed = 100
|
||||
__author__ = 'fenuks'
|
||||
language = 'pl'
|
||||
description ='polish scientific monthly magazine'
|
||||
category='magazine'
|
||||
cover_url=''
|
||||
remove_empty_feeds= True
|
||||
no_stylesheets=True
|
||||
remove_tags_before=dict(name='div', attrs={'class':'h2 h2f'})
|
||||
remove_tags_after=dict(name='div', attrs={'class':'clear'})
|
||||
feeds = [(u'Wszystkie kategorie', u'http://focus.pl.feedsportal.com/c/32992/f/532692/index.rss'),
|
||||
(u'Nauka', u'http://focus.pl.feedsportal.com/c/32992/f/532693/index.rss'),
|
||||
(u'Historia', u'http://focus.pl.feedsportal.com/c/32992/f/532694/index.rss'),
|
||||
(u'Cywilizacja', u'http://focus.pl.feedsportal.com/c/32992/f/532695/index.rss'),
|
||||
(u'Sport', u'http://focus.pl.feedsportal.com/c/32992/f/532696/index.rss'),
|
||||
(u'Technika', u'http://focus.pl.feedsportal.com/c/32992/f/532697/index.rss'),
|
||||
(u'Przyroda', u'http://focus.pl.feedsportal.com/c/32992/f/532698/index.rss'),
|
||||
(u'Technologie', u'http://focus.pl.feedsportal.com/c/32992/f/532699/index.rss'),
|
||||
(u'Warto wiedzieć', u'http://focus.pl.feedsportal.com/c/32992/f/532700/index.rss'),
|
||||
|
||||
|
||||
|
||||
]
|
||||
|
||||
def skip_ad_pages(self, soup):
|
||||
tag=soup.find(name='a')
|
||||
if tag:
|
||||
new_soup=self.index_to_soup(tag['href']+ 'do-druku/1/', raw=True)
|
||||
return new_soup
|
||||
|
||||
def append_page(self, appendtag):
|
||||
tag=appendtag.find(name='div', attrs={'class':'arrows'})
|
||||
if tag:
|
||||
nexturl='http://www.focus.pl/'+tag.a['href']
|
||||
for rem in appendtag.findAll(name='div', attrs={'class':'klik-nav'}):
|
||||
rem.extract()
|
||||
while nexturl:
|
||||
soup2=self.index_to_soup(nexturl)
|
||||
nexturl=None
|
||||
pagetext=soup2.find(name='div', attrs={'class':'txt'})
|
||||
tag=pagetext.find(name='div', attrs={'class':'arrows'})
|
||||
for r in tag.findAll(name='a'):
|
||||
if u'Następne' in r.string:
|
||||
nexturl='http://www.focus.pl/'+r['href']
|
||||
for rem in pagetext.findAll(name='div', attrs={'class':'klik-nav'}):
|
||||
rem.extract()
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
|
||||
def get_cover_url(self):
|
||||
soup=self.index_to_soup('http://www.focus.pl/magazyn/')
|
||||
tag=soup.find(name='div', attrs={'class':'clr fl'})
|
||||
if tag:
|
||||
self.cover_url='http://www.focus.pl/' + tag.a['href']
|
||||
return getattr(self, 'cover_url', self.cover_url)
|
||||
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
self.append_page(soup.body)
|
||||
return soup
|
@ -1,3 +1,4 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
from datetime import datetime, timedelta
|
||||
from calibre.ebooks.BeautifulSoup import Tag,BeautifulSoup
|
||||
@ -16,7 +17,7 @@ class FolhaOnline(BasicNewsRecipe):
|
||||
news = True
|
||||
|
||||
title = u'Folha de S\xE3o Paulo'
|
||||
__author__ = 'Euler Alves'
|
||||
__author__ = 'Euler Alves and Alex Mitrani'
|
||||
description = u'Brazilian news from Folha de S\xE3o Paulo'
|
||||
publisher = u'Folha de S\xE3o Paulo'
|
||||
category = 'news, rss'
|
||||
@ -62,37 +63,50 @@ class FolhaOnline(BasicNewsRecipe):
|
||||
,dict(name='div',
|
||||
attrs={'class':[
|
||||
'openBox adslibraryArticle'
|
||||
,'toolbar'
|
||||
]})
|
||||
|
||||
,dict(name='a')
|
||||
,dict(name='iframe')
|
||||
,dict(name='link')
|
||||
,dict(name='script')
|
||||
,dict(name='li')
|
||||
]
|
||||
remove_tags_after = dict(name='div',attrs={'id':'articleEnd'})
|
||||
|
||||
feeds = [
|
||||
(u'Em cima da hora', u'http://feeds.folha.uol.com.br/emcimadahora/rss091.xml')
|
||||
,(u'Cotidiano', u'http://feeds.folha.uol.com.br/folha/cotidiano/rss091.xml')
|
||||
,(u'Brasil', u'http://feeds.folha.uol.com.br/folha/brasil/rss091.xml')
|
||||
,(u'Mundo', u'http://feeds.folha.uol.com.br/mundo/rss091.xml')
|
||||
,(u'Poder', u'http://feeds.folha.uol.com.br/poder/rss091.xml')
|
||||
,(u'Mercado', u'http://feeds.folha.uol.com.br/folha/dinheiro/rss091.xml')
|
||||
,(u'Saber', u'http://feeds.folha.uol.com.br/folha/educacao/rss091.xml')
|
||||
,(u'Tec', u'http://feeds.folha.uol.com.br/folha/informatica/rss091.xml')
|
||||
,(u'Ilustrada', u'http://feeds.folha.uol.com.br/folha/ilustrada/rss091.xml')
|
||||
,(u'Ambiente', u'http://feeds.folha.uol.com.br/ambiente/rss091.xml')
|
||||
,(u'Bichos', u'http://feeds.folha.uol.com.br/bichos/rss091.xml')
|
||||
,(u'Ci\xEAncia', u'http://feeds.folha.uol.com.br/ciencia/rss091.xml')
|
||||
,(u'Poder', u'http://feeds.folha.uol.com.br/poder/rss091.xml')
|
||||
,(u'Equil\xEDbrio e Sa\xFAde', u'http://feeds.folha.uol.com.br/equilibrioesaude/rss091.xml')
|
||||
,(u'Turismo', u'http://feeds.folha.uol.com.br/folha/turismo/rss091.xml')
|
||||
,(u'Mundo', u'http://feeds.folha.uol.com.br/mundo/rss091.xml')
|
||||
,(u'Pelo Mundo', u'http://feeds.folha.uol.com.br/pelomundo.folha.rssblog.uol.com.br/')
|
||||
,(u'Circuito integrado', u'http://feeds.folha.uol.com.br/circuitointegrado.folha.rssblog.uol.com.br/')
|
||||
,(u'Blog do Fred', u'http://feeds.folha.uol.com.br/blogdofred.folha.rssblog.uol.com.br/')
|
||||
,(u'Maria In\xEAs Dolci', u'http://feeds.folha.uol.com.br/mariainesdolci.folha.blog.uol.com.br/')
|
||||
,(u'Eduardo Ohata', u'http://feeds.folha.uol.com.br/folha/pensata/eduardoohata/rss091.xml')
|
||||
,(u'Kennedy Alencar', u'http://feeds.folha.uol.com.br/folha/pensata/kennedyalencar/rss091.xml')
|
||||
,(u'Eliane Catanh\xEAde', u'http://feeds.folha.uol.com.br/folha/pensata/elianecantanhede/rss091.xml')
|
||||
,(u'Fernado Canzian', u'http://feeds.folha.uol.com.br/folha/pensata/fernandocanzian/rss091.xml')
|
||||
,(u'Gilberto Dimenstein', u'http://feeds.folha.uol.com.br/folha/pensata/gilbertodimenstein/rss091.xml')
|
||||
,(u'H\xE9lio Schwartsman', u'http://feeds.folha.uol.com.br/folha/pensata/helioschwartsman/rss091.xml')
|
||||
,(u'Jo\xE3o Pereira Coutinho', u'http://http://feeds.folha.uol.com.br/folha/pensata/joaopereiracoutinho/rss091.xml')
|
||||
,(u'Luiz Caversan', u'http://http://feeds.folha.uol.com.br/folha/pensata/luizcaversan/rss091.xml')
|
||||
,(u'S\xE9rgio Malbergier', u'http://http://feeds.folha.uol.com.br/folha/pensata/sergiomalbergier/rss091.xml')
|
||||
,(u'Valdo Cruz', u'http://http://feeds.folha.uol.com.br/folha/pensata/valdocruz/rss091.xml')
|
||||
,(u'Esporte', u'http://feeds.folha.uol.com.br/folha/esporte/rss091.xml')
|
||||
,(u'Zapping', u'http://feeds.folha.uol.com.br/colunas/zapping/rss091.xml')
|
||||
,(u'Cida Santos', u'http://feeds.folha.uol.com.br/colunas/cidasantos/rss091.xml')
|
||||
,(u'Clóvis Rossi', u'http://feeds.folha.uol.com.br/colunas/clovisrossi/rss091.xml')
|
||||
,(u'Eliane Cantanhêde', u'http://feeds.folha.uol.com.br/colunas/elianecantanhede/rss091.xml')
|
||||
,(u'Fernando Canzian', u'http://feeds.folha.uol.com.br/colunas/fernandocanzian/rss091.xml')
|
||||
,(u'Gilberto Dimenstein', u'http://feeds.folha.uol.com.br/colunas/gilbertodimenstein/rss091.xml')
|
||||
,(u'Hélio Schwartsman', u'http://feeds.folha.uol.com.br/colunas/helioschwartsman/rss091.xml')
|
||||
,(u'Humberto Luiz Peron', u'http://feeds.folha.uol.com.br/colunas/futebolnarede/rss091.xml')
|
||||
,(u'João Pereira Coutinho', u'http://feeds.folha.uol.com.br/colunas/joaopereiracoutinho/rss091.xml')
|
||||
,(u'José Antonio Ramalho', u'http://feeds.folha.uol.com.br/colunas/canalaberto/rss091.xml')
|
||||
,(u'Kennedy Alencar', u'http://feeds.folha.uol.com.br/colunas/kennedyalencar/rss091.xml')
|
||||
,(u'Luiz Caversan', u'http://feeds.folha.uol.com.br/colunas/luizcaversan/rss091.xml')
|
||||
,(u'Luiz Rivoiro', u'http://feeds.folha.uol.com.br/colunas/paiepai/rss091.xml')
|
||||
,(u'Marcelo Leite', u'http://feeds.folha.uol.com.br/colunas/marceloleite/rss091.xml')
|
||||
,(u'Sérgio Malbergier', u'http://feeds.folha.uol.com.br/colunas/sergiomalbergier/rss091.xml')
|
||||
,(u'Sylvia Colombo', u'http://feeds.folha.uol.com.br/colunas/sylviacolombo/rss091.xml')
|
||||
,(u'Valdo Cruz', u'http://feeds.folha.uol.com.br/colunas/valdocruz/rss091.xml')
|
||||
]
|
||||
|
||||
|
||||
|
96
recipes/folhadesaopaulo_sub.recipe
Normal file
@ -0,0 +1,96 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
import re
|
||||
|
||||
class FSP(BasicNewsRecipe):
|
||||
|
||||
title = u'Folha de S\xE3o Paulo'
|
||||
__author__ = 'fluzao'
|
||||
description = u'Printed edition contents. UOL subscription required (Folha subscription currently not supported).' + \
|
||||
u' [Conte\xfado completo da edi\xe7\xe3o impressa. Somente para assinantes UOL.]'
|
||||
INDEX = 'http://www1.folha.uol.com.br/fsp/indices/'
|
||||
language = 'pt'
|
||||
no_stylesheets = True
|
||||
max_articles_per_feed = 40
|
||||
remove_javascript = True
|
||||
needs_subscription = True
|
||||
remove_tags_before = dict(name='b')
|
||||
remove_tags = [dict(name='td', attrs={'align':'center'})]
|
||||
remove_attributes = ['height','width']
|
||||
masthead_url = 'http://f.i.uol.com.br/fsp/furniture/images/lgo-fsp-430x50-ffffff.gif'
|
||||
|
||||
# fixes the problem with the section names
|
||||
section_dict = {'cotidian' : 'cotidiano', 'ilustrad': 'ilustrada', \
|
||||
'quadrin': 'quadrinhos' , 'opiniao' : u'opini\xE3o', \
|
||||
'ciencia' : u'ci\xeancia' , 'saude' : u'sa\xfade', \
|
||||
'ribeirao' : u'ribeir\xE3o' , 'equilibrio' : u'equil\xedbrio'}
|
||||
|
||||
# this solves the problem with truncated content in Kindle
|
||||
conversion_options = {'linearize_tables' : True}
|
||||
|
||||
# this bit removes the footer where there are links for Proximo Texto, Texto Anterior,
|
||||
# Indice e Comunicar Erros
|
||||
preprocess_regexps = [(re.compile(r'<BR><BR>Texto Anterior:.*<!--/NOTICIA-->',
|
||||
re.DOTALL|re.IGNORECASE), lambda match: r''),
|
||||
(re.compile(r'<BR><BR>Próximo Texto:.*<!--/NOTICIA-->',
|
||||
re.DOTALL|re.IGNORECASE), lambda match: r'')]
|
||||
|
||||
def get_browser(self):
|
||||
br = BasicNewsRecipe.get_browser()
|
||||
if self.username is not None and self.password is not None:
|
||||
br.open('https://acesso.uol.com.br/login.html')
|
||||
br.form = br.forms().next()
|
||||
br['user'] = self.username
|
||||
br['pass'] = self.password
|
||||
br.submit().read()
|
||||
## if 'Please try again' in raw:
|
||||
## raise Exception('Your username and password are incorrect')
|
||||
return br
|
||||
|
||||
|
||||
def parse_index(self):
|
||||
soup = self.index_to_soup(self.INDEX)
|
||||
feeds = []
|
||||
articles = []
|
||||
section_title = "Preambulo"
|
||||
for post in soup.findAll('a'):
|
||||
# if name=True => new section
|
||||
strpost = str(post)
|
||||
if strpost.startswith('<a name'):
|
||||
if articles:
|
||||
feeds.append((section_title, articles))
|
||||
self.log()
|
||||
self.log('--> new section found, creating old section feed: ', section_title)
|
||||
section_title = post['name']
|
||||
if section_title in self.section_dict:
|
||||
section_title = self.section_dict[section_title]
|
||||
articles = []
|
||||
self.log('--> new section title: ', section_title)
|
||||
if strpost.startswith('<a href'):
|
||||
url = post['href']
|
||||
if url.startswith('/fsp'):
|
||||
url = 'http://www1.folha.uol.com.br'+url
|
||||
title = self.tag_to_string(post)
|
||||
self.log()
|
||||
self.log('--> post: ', post)
|
||||
self.log('--> url: ', url)
|
||||
self.log('--> title: ', title)
|
||||
articles.append({'title':title, 'url':url})
|
||||
if articles:
|
||||
feeds.append((section_title, articles))
|
||||
|
||||
# keeping the front page url
|
||||
minha_capa = feeds[0][1][1]['url']
|
||||
|
||||
# removing the 'Preambulo' section
|
||||
del feeds[0]
|
||||
|
||||
# creating the url for the cover image
|
||||
coverurl = feeds[0][1][0]['url']
|
||||
coverurl = coverurl.replace('/opiniao/fz', '/images/cp')
|
||||
coverurl = coverurl.replace('01.htm', '.jpg')
|
||||
self.cover_url = coverurl
|
||||
|
||||
# inserting the cover page as the first article (nicer for kindle users)
|
||||
feeds.insert(0,(u'primeira p\xe1gina', [{'title':u'Primeira p\xe1gina' , 'url':minha_capa}]))
|
||||
return feeds
|
83
recipes/gazeta_wyborcza.recipe
Normal file
@ -0,0 +1,83 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Gazeta_Wyborcza(BasicNewsRecipe):
|
||||
title = u'Gazeta Wyborcza'
|
||||
__author__ = 'fenuks'
|
||||
cover_url = 'http://bi.gazeta.pl/im/5/10285/z10285445AA.jpg'
|
||||
language = 'pl'
|
||||
description ='news from gazeta.pl'
|
||||
category='newspaper'
|
||||
INDEX='http://wyborcza.pl'
|
||||
remove_empty_feeds= True
|
||||
oldest_article = 3
|
||||
max_articles_per_feed = 100
|
||||
remove_javascript=True
|
||||
no_stylesheets=True
|
||||
remove_tags_before=dict(id='k0')
|
||||
remove_tags_after=dict(id='banP4')
|
||||
remove_tags=[dict(name='div', attrs={'class':'rel_box'}), dict(attrs={'class':['date', 'zdjP', 'zdjM', 'pollCont', 'rel_video', 'brand', 'txt_upl']}), dict(name='div', attrs={'id':'footer'})]
|
||||
feeds = [(u'Kraj', u'http://rss.feedsportal.com/c/32739/f/530266/index.rss'), (u'\u015awiat', u'http://rss.feedsportal.com/c/32739/f/530270/index.rss'),
|
||||
(u'Wyborcza.biz', u'http://wyborcza.biz/pub/rss/wyborcza_biz_wiadomosci.htm'),
|
||||
(u'Komentarze', u'http://rss.feedsportal.com/c/32739/f/530312/index.rss'),
|
||||
(u'Kultura', u'http://rss.gazeta.pl/pub/rss/gazetawyborcza_kultura.xml'),
|
||||
(u'Nauka', u'http://rss.feedsportal.com/c/32739/f/530269/index.rss'), (u'Opinie', u'http://rss.gazeta.pl/pub/rss/opinie.xml'), (u'Gazeta \u015awi\u0105teczna', u'http://rss.feedsportal.com/c/32739/f/530431/index.rss'), (u'Du\u017cy Format', u'http://rss.feedsportal.com/c/32739/f/530265/index.rss'), (u'Witamy w Polsce', u'http://rss.feedsportal.com/c/32739/f/530476/index.rss'), (u'M\u0119ska Muzyka', u'http://rss.feedsportal.com/c/32739/f/530337/index.rss'), (u'Lata Lec\u0105', u'http://rss.feedsportal.com/c/32739/f/530326/index.rss'), (u'Solidarni z Tybetem', u'http://rss.feedsportal.com/c/32739/f/530461/index.rss'), (u'W pon. - \u017bakowski', u'http://rss.feedsportal.com/c/32739/f/530491/index.rss'), (u'We wt. - Kolenda-Zalewska', u'http://rss.feedsportal.com/c/32739/f/530310/index.rss'), (u'\u015aroda w \u015brod\u0119', u'http://rss.feedsportal.com/c/32739/f/530428/index.rss'), (u'W pi\u0105tek - Olejnik', u'http://rss.feedsportal.com/c/32739/f/530364/index.rss'), (u'Nekrologi', u'http://rss.feedsportal.com/c/32739/f/530358/index.rss')
|
||||
]
|
||||
|
||||
def skip_ad_pages(self, soup):
|
||||
tag=soup.find(name='a', attrs={'class':'btn'})
|
||||
if tag:
|
||||
new_soup=self.index_to_soup(tag['href'], raw=True)
|
||||
return new_soup
|
||||
|
||||
|
||||
def append_page(self, soup, appendtag):
|
||||
loop=False
|
||||
tag = soup.find('div', attrs={'id':'Str'})
|
||||
if appendtag.find('div', attrs={'id':'Str'}):
|
||||
nexturl=tag.findAll('a')
|
||||
appendtag.find('div', attrs={'id':'Str'}).extract()
|
||||
loop=True
|
||||
if appendtag.find(id='source'):
|
||||
appendtag.find(id='source').extract()
|
||||
while loop:
|
||||
loop=False
|
||||
for link in nexturl:
|
||||
if u'następne' in link.string:
|
||||
url= self.INDEX + link['href']
|
||||
soup2 = self.index_to_soup(url)
|
||||
pagetext = soup2.find(id='artykul')
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
tag = soup2.find('div', attrs={'id':'Str'})
|
||||
nexturl=tag.findAll('a')
|
||||
loop=True
|
||||
|
||||
def gallery_article(self, appendtag):
|
||||
tag=appendtag.find(id='container_gal')
|
||||
if tag:
|
||||
nexturl=appendtag.find(id='gal_btn_next').a['href']
|
||||
appendtag.find(id='gal_navi').extract()
|
||||
while nexturl:
|
||||
soup2=self.index_to_soup(nexturl)
|
||||
pagetext=soup2.find(id='container_gal')
|
||||
nexturl=pagetext.find(id='gal_btn_next')
|
||||
if nexturl:
|
||||
nexturl=nexturl.a['href']
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
rem=appendtag.find(id='gal_navi')
|
||||
if rem:
|
||||
rem.extract()
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
self.append_page(soup, soup.body)
|
||||
if soup.find(id='container_gal'):
|
||||
self.gallery_article(soup.body)
|
||||
return soup
|
||||
|
||||
def print_version(self, url):
|
||||
if 'http://wyborcza.biz/biznes/' not in url:
|
||||
return url
|
||||
else:
|
||||
return url.replace('http://wyborcza.biz/biznes/1', 'http://wyborcza.biz/biznes/2029020')
|
26
recipes/gildia_pl.recipe
Normal file
@ -0,0 +1,26 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Gildia(BasicNewsRecipe):
|
||||
title = u'Gildia.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = 'Gildia - cultural site'
|
||||
cover_url = 'http://www.film.gildia.pl/_n_/portal/redakcja/logo/logo-gildia.pl-500.jpg'
|
||||
category = 'culture'
|
||||
language = 'pl'
|
||||
oldest_article = 8
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets=True
|
||||
remove_tags=[dict(name='div', attrs={'class':'backlink'}), dict(name='div', attrs={'class':'im_img'}), dict(name='div', attrs={'class':'addthis_toolbox addthis_default_style'})]
|
||||
keep_only_tags=dict(name='div', attrs={'class':'widetext'})
|
||||
feeds = [(u'Gry', u'http://www.gry.gildia.pl/rss'), (u'Literatura', u'http://www.literatura.gildia.pl/rss'), (u'Film', u'http://www.film.gildia.pl/rss'), (u'Horror', u'http://www.horror.gildia.pl/rss'), (u'Konwenty', u'http://www.konwenty.gildia.pl/rss'), (u'Plansz\xf3wki', u'http://www.planszowki.gildia.pl/rss'), (u'Manga i anime', u'http://www.manga.gildia.pl/rss'), (u'Star Wars', u'http://www.starwars.gildia.pl/rss'), (u'Techno', u'http://www.techno.gildia.pl/rss'), (u'Historia', u'http://www.historia.gildia.pl/rss'), (u'Magia', u'http://www.magia.gildia.pl/rss'), (u'Bitewniaki', u'http://www.bitewniaki.gildia.pl/rss'), (u'RPG', u'http://www.rpg.gildia.pl/rss'), (u'LARP', u'http://www.larp.gildia.pl/rss'), (u'Muzyka', u'http://www.muzyka.gildia.pl/rss'), (u'Nauka', u'http://www.nauka.gildia.pl/rss')]
|
||||
|
||||
|
||||
def skip_ad_pages(self, soup):
|
||||
content = soup.find('div', attrs={'class':'news'})
|
||||
skip_tag= content.findAll(name='a')
|
||||
if skip_tag is not None:
|
||||
for link in skip_tag:
|
||||
if 'recenzja' in link['href']:
|
||||
self.log.warn('odnosnik')
|
||||
self.log.warn(link['href'])
|
||||
return self.index_to_soup(link['href'], raw=True)
|
112
recipes/gosc_niedzielny.recipe
Normal file
@ -0,0 +1,112 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
#!/usr/bin/env python
|
||||
|
||||
__license__ = 'GPL v3'
|
||||
__copyright__ = '2011, Piotr Kontek, piotr.kontek@gmail.com'
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
from calibre.ptempfile import PersistentTemporaryFile
|
||||
import re
|
||||
|
||||
class GN(BasicNewsRecipe):
|
||||
EDITION = 0
|
||||
|
||||
__author__ = 'Piotr Kontek'
|
||||
title = u'Gość niedzielny'
|
||||
description = 'Weekly magazine'
|
||||
encoding = 'utf-8'
|
||||
no_stylesheets = True
|
||||
language = 'pl'
|
||||
remove_javascript = True
|
||||
temp_files = []
|
||||
|
||||
articles_are_obfuscated = True
|
||||
|
||||
def get_obfuscated_article(self, url):
|
||||
br = self.get_browser()
|
||||
br.open(url)
|
||||
source = br.response().read()
|
||||
page = self.index_to_soup(source)
|
||||
|
||||
main_section = page.find('div',attrs={'class':'txt doc_prnt_prv'})
|
||||
|
||||
title = main_section.find('h2')
|
||||
info = main_section.find('div', attrs={'class' : 'cf doc_info'})
|
||||
authors = info.find(attrs={'class':'l'})
|
||||
article = str(main_section.find('p', attrs={'class' : 'doc_lead'}))
|
||||
first = True
|
||||
for p in main_section.findAll('p', attrs={'class':None}, recursive=False):
|
||||
if first and p.find('img') != None:
|
||||
article = article + '<p>'
|
||||
article = article + str(p.find('img')).replace('src="/files/','src="http://www.gosc.pl/files/')
|
||||
article = article + '<font size="-2">'
|
||||
for s in p.findAll('span'):
|
||||
article = article + self.tag_to_string(s)
|
||||
article = article + '</font></p>'
|
||||
else:
|
||||
article = article + str(p).replace('src="/files/','src="http://www.gosc.pl/files/')
|
||||
first = False
|
||||
|
||||
html = unicode(title) + unicode(authors) + unicode(article)
|
||||
|
||||
self.temp_files.append(PersistentTemporaryFile('_temparse.html'))
|
||||
self.temp_files[-1].write(html)
|
||||
self.temp_files[-1].close()
|
||||
return self.temp_files[-1].name
|
||||
|
||||
def find_last_issue(self):
|
||||
soup = self.index_to_soup('http://gosc.pl/wyszukaj/wydania/3.Gosc-Niedzielny')
|
||||
#szukam zdjęcia i linka do porzedniego pełnego numeru
|
||||
first = True
|
||||
for d in soup.findAll('div', attrs={'class':'l release_preview_l'}):
|
||||
img = d.find('img')
|
||||
if img != None:
|
||||
a = img.parent
|
||||
self.EDITION = a['href']
|
||||
self.title = img['alt']
|
||||
self.cover_url = 'http://www.gosc.pl' + img['src']
|
||||
if not first:
|
||||
break
|
||||
first = False
|
||||
|
||||
def parse_index(self):
|
||||
self.find_last_issue()
|
||||
soup = self.index_to_soup('http://www.gosc.pl' + self.EDITION)
|
||||
feeds = []
|
||||
#wstepniak
|
||||
a = soup.find('div',attrs={'class':'release-wp-b'}).find('a')
|
||||
articles = [
|
||||
{'title' : self.tag_to_string(a),
|
||||
'url' : 'http://www.gosc.pl' + a['href'].replace('/doc/','/doc_pr/'),
|
||||
'date' : '',
|
||||
'description' : ''}
|
||||
]
|
||||
feeds.append((u'Wstępniak',articles))
|
||||
#kategorie
|
||||
for addr in soup.findAll('a',attrs={'href':re.compile('kategoria')}):
|
||||
if addr.string != u'wszystkie artyku\u0142y z tej kategorii \xbb':
|
||||
main_block = self.index_to_soup('http://www.gosc.pl' + addr['href'])
|
||||
articles = list(self.find_articles(main_block))
|
||||
if len(articles) > 0:
|
||||
section = addr.string
|
||||
feeds.append((section, articles))
|
||||
return feeds
|
||||
|
||||
def find_articles(self, main_block):
|
||||
for a in main_block.findAll('div', attrs={'class':'prev_doc2'}):
|
||||
art = a.find('a')
|
||||
yield {
|
||||
'title' : self.tag_to_string(art),
|
||||
'url' : 'http://www.gosc.pl' + art['href'].replace('/doc/','/doc_pr/'),
|
||||
'date' : '',
|
||||
'description' : ''
|
||||
}
|
||||
for a in main_block.findAll('div', attrs={'class':'sr-document'}):
|
||||
art = a.find('a')
|
||||
yield {
|
||||
'title' : self.tag_to_string(art),
|
||||
'url' : 'http://www.gosc.pl' + art['href'].replace('/doc/','/doc_pr/'),
|
||||
'date' : '',
|
||||
'description' : ''
|
||||
}
|
||||
|
25
recipes/gram_pl.recipe
Normal file
@ -0,0 +1,25 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Gram_pl(BasicNewsRecipe):
|
||||
title = u'Gram.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = 'Gram.pl - site about computer games'
|
||||
category = 'games'
|
||||
language = 'pl'
|
||||
oldest_article = 8
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets= True
|
||||
extra_css = 'h2 {font-style: italic; font-size:20px;}'
|
||||
cover_url=u'http://www.gram.pl/www/01/img/grampl_zima.png'
|
||||
remove_tags= [dict(name='p', attrs={'class':['extraText', 'must-log-in']}), dict(attrs={'class':['el', 'headline', 'post-info']}), dict(name='div', attrs={'class':['twojaOcena', 'comment-body', 'comment-author vcard', 'comment-meta commentmetadata', 'tw_button']}), dict(id=['igit_rpwt_css', 'comments', 'reply-title', 'igit_title'])]
|
||||
keep_only_tags= [dict(name='div', attrs={'class':['main', 'arkh-postmetadataheader', 'arkh-postcontent', 'post', 'content', 'news_header', 'news_subheader', 'news_text']}), dict(attrs={'class':['contentheading', 'contentpaneopen']})]
|
||||
feeds = [(u'gram.pl - informacje', u'http://www.gram.pl/feed_news.asp'),
|
||||
(u'gram.pl - publikacje', u'http://www.gram.pl/feed_news.asp?type=articles')]
|
||||
|
||||
def parse_feeds (self):
|
||||
feeds = BasicNewsRecipe.parse_feeds(self)
|
||||
for feed in feeds:
|
||||
for article in feed.articles[:]:
|
||||
if 'REKLAMA SKLEP' in article.title.upper() or u'ARTYKUŁ:' in article.title.upper():
|
||||
feed.articles.remove(article)
|
||||
return feeds
|
13
recipes/greenlinux_pl.recipe
Normal file
@ -0,0 +1,13 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class GreenLinux(BasicNewsRecipe):
|
||||
title = u'GreenLinux.pl'
|
||||
__author__ = 'fenuks'
|
||||
category = 'IT'
|
||||
language = 'pl'
|
||||
cover_url = 'http://lh5.ggpht.com/_xd_6Y9kXhEc/S8tjyqlfhfI/AAAAAAAAAYU/zFNTp07ZQko/top.png'
|
||||
oldest_article = 15
|
||||
max_articles_per_feed = 100
|
||||
auto_cleanup = True
|
||||
|
||||
feeds = [(u'Newsy', u'http://feeds.feedburner.com/greenlinux')]
|
38
recipes/gry_online_pl.recipe
Normal file
@ -0,0 +1,38 @@
|
||||
from calibre.web.feeds.recipes import BasicNewsRecipe
|
||||
|
||||
class Gry_online_pl(BasicNewsRecipe):
|
||||
title = u'Gry-Online.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = 'Gry-Online.pl - computer games'
|
||||
category = 'games'
|
||||
language = 'pl'
|
||||
oldest_article = 13
|
||||
INDEX= 'http://www.gry-online.pl/'
|
||||
cover_url='http://www.gry-online.pl/img/1st_10/1st-gol-logo.png'
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets= True
|
||||
extra_css = 'p.wn1{font-size:22px;}'
|
||||
remove_tags_after= [dict(name='div', attrs={'class':['tresc-newsa']})]
|
||||
keep_only_tags = [dict(name='div', attrs={'class':['txthead']}), dict(name='p', attrs={'class':['wtx1', 'wn1', 'wob']}), dict(name='a', attrs={'class':['num_str_nex']})]
|
||||
#remove_tags= [dict(name='div', attrs={'class':['news_plat']})]
|
||||
feeds = [(u'Newsy', 'http://www.gry-online.pl/rss/news.xml'), ('Teksty', u'http://www.gry-online.pl/rss/teksty.xml')]
|
||||
|
||||
|
||||
def append_page(self, soup, appendtag):
|
||||
nexturl = soup.find('a', attrs={'class':'num_str_nex'})
|
||||
if appendtag.find('a', attrs={'class':'num_str_nex'}) is not None:
|
||||
appendtag.find('a', attrs={'class':'num_str_nex'}).replaceWith('\n')
|
||||
if nexturl is not None:
|
||||
if 'strona' in nexturl.div.string:
|
||||
nexturl= self.INDEX + nexturl['href']
|
||||
soup2 = self.index_to_soup(nexturl)
|
||||
pagetext = soup2.findAll(name='p', attrs={'class':['wtx1', 'wn1', 'wob']})
|
||||
for tag in pagetext:
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, tag)
|
||||
self.append_page(soup2, appendtag)
|
||||
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
self.append_page(soup, soup.body)
|
||||
return soup
|
@ -15,8 +15,10 @@ class Guardian(BasicNewsRecipe):
|
||||
title = u'The Guardian and The Observer'
|
||||
if date.today().weekday() == 6:
|
||||
base_url = "http://www.guardian.co.uk/theobserver"
|
||||
cover_pic = 'Observer digital edition'
|
||||
else:
|
||||
base_url = "http://www.guardian.co.uk/theguardian"
|
||||
cover_pic = 'Guardian digital edition'
|
||||
|
||||
__author__ = 'Seabound and Sujata Raman'
|
||||
language = 'en_GB'
|
||||
@ -79,7 +81,7 @@ class Guardian(BasicNewsRecipe):
|
||||
# soup = self.index_to_soup("http://www.guardian.co.uk/theobserver")
|
||||
soup = self.index_to_soup(self.base_url)
|
||||
# find cover pic
|
||||
img = soup.find( 'img',attrs ={'alt':'Guardian digital edition'})
|
||||
img = soup.find( 'img',attrs ={'alt':self.cover_pic})
|
||||
if img is not None:
|
||||
self.cover_url = img['src']
|
||||
# end find cover pic
|
||||
|
50
recipes/h7_tumspor.recipe
Normal file
@ -0,0 +1,50 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Haber7TS (BasicNewsRecipe):
|
||||
|
||||
title = u'H7 TÜMSPOR'
|
||||
__author__ = u'thomass'
|
||||
description = ' Haber 7 TÜMSPOR sitesinden tüm branşlarda spor haberleri '
|
||||
oldest_article =2
|
||||
max_articles_per_feed =100
|
||||
no_stylesheets = True
|
||||
#delay = 1
|
||||
#use_embedded_content = False
|
||||
encoding = 'ISO 8859-9'
|
||||
publisher = 'thomass'
|
||||
category = 'güncel, haber, türkçe,spor,futbol'
|
||||
language = 'tr'
|
||||
publication_type = 'newspaper'
|
||||
|
||||
conversion_options = {
|
||||
'tags' : category
|
||||
,'language' : language
|
||||
,'publisher' : publisher
|
||||
,'linearize_tables': True
|
||||
}
|
||||
extra_css = ' #newsheadcon h1{font-weight: bold; font-size: 18px;color:#0000FF} '
|
||||
keep_only_tags = [dict(name='div', attrs={'class':['intNews','leftmidmerge']})]
|
||||
remove_tags = [dict(name='div', attrs={'id':['blocktitle','banner46860body']}),dict(name='div', attrs={'class':[ 'Breadcrumb','shr','mobile/home.jpg','etiket','yorumYazNew','shr','y-list','banner','lftBannerShowcase','comments','interNews','lftBanner','midblock','rightblock','comnum','commentcon',]}) ,dict(name='a', attrs={'class':['saveto','sendto','comlink','newsshare',]}),dict(name='iframe', attrs={'name':['frm111','frm107']}) ,dict(name='ul', attrs={'class':['nocPagi','leftmidmerge']})]
|
||||
cover_img_url = 'http://image.tumspor.com/v2/images/tasarim/images/logo.jpg'
|
||||
masthead_url = 'http://image.tumspor.com/v2/images/tasarim/images/logo.jpg'
|
||||
remove_empty_feeds= True
|
||||
|
||||
feeds = [
|
||||
( u'Futbol', u'http://open.dapper.net/services/h7tsfutbol'),
|
||||
( u'Basketbol', u'http://open.dapper.net/services/h7tsbasket'),
|
||||
( u'Tenis', u'http://open.dapper.net/services/h7tstenis'),
|
||||
( u'NBA', u'http://open.dapper.net/services/h7tsnba'),
|
||||
( u'Diğer Sporlar', u'http://open.dapper.net/services/h7tsdiger'),
|
||||
( u'Yazarlar & Magazin', u'http://open.dapper.net/services/h7tsyazarmagazin'),
|
||||
]
|
||||
def preprocess_html(self, soup):
|
||||
for alink in soup.findAll('a'):
|
||||
if alink.string is not None:
|
||||
tstr = alink.string
|
||||
alink.replaceWith(tstr)
|
||||
return soup
|
||||
# def print_version(self, url):
|
||||
# return url.replace('http://www.aksiyon.com.tr/aksiyon/newsDetail_getNewsById.action?load=detay&', 'http://www.aksiyon.com.tr/aksiyon/mobile_detailn.action?')
|
||||
|
60
recipes/haber7.recipe
Normal file
@ -0,0 +1,60 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Haber7 (BasicNewsRecipe):
|
||||
|
||||
title = u'Haber 7'
|
||||
__author__ = u'thomass'
|
||||
description = ' Haber 7 sitesinden haberler '
|
||||
oldest_article =2
|
||||
max_articles_per_feed =100
|
||||
no_stylesheets = True
|
||||
#delay = 1
|
||||
#use_embedded_content = False
|
||||
encoding = 'ISO 8859-9'
|
||||
publisher = 'thomass'
|
||||
category = 'güncel, haber, türkçe'
|
||||
language = 'tr'
|
||||
publication_type = 'newspaper'
|
||||
|
||||
conversion_options = {
|
||||
'tags' : category
|
||||
,'language' : language
|
||||
,'publisher' : publisher
|
||||
,'linearize_tables': True
|
||||
}
|
||||
extra_css = 'body{ font-size: 12px}h2{font-weight: bold; font-size: 18px;color:#0000FF} #newsheadcon h1{font-weight: bold; font-size: 18px;color:#0000FF}'
|
||||
|
||||
keep_only_tags = [dict(name='div', attrs={'class':['intNews','leftmidmerge']})]
|
||||
remove_tags = [dict(name='div', attrs={'id':['blocktitle','banner46860body']}),dict(name='div', attrs={'class':[ 'Breadcrumb','shr','mobile/home.jpg','etiket','yorumYazNew','shr','y-list','banner','lftBannerShowcase','comments','interNews','lftBanner','midblock','rightblock','comnum','commentcon',]}) ,dict(name='a', attrs={'class':['saveto','sendto','comlink','newsshare',]}),dict(name='iframe', attrs={'name':['frm111','frm107']}) ,dict(name='ul', attrs={'class':['nocPagi','leftmidmerge']})]
|
||||
|
||||
cover_img_url = 'http://dl.dropbox.com/u/39726752/haber7.JPG'
|
||||
masthead_url = 'http://dl.dropbox.com/u/39726752/haber7.JPG'
|
||||
remove_empty_feeds= True
|
||||
|
||||
feeds = [
|
||||
( u'Siyaset', u'http://open.dapper.net/services/h7siyaset'),
|
||||
( u'Güncel', u'http://open.dapper.net/services/h7guncel'),
|
||||
( u'Yaşam', u'http://open.dapper.net/services/h7yasam'),
|
||||
( u'Ekonomi', u'http://open.dapper.net/services/h7ekonomi'),
|
||||
( u'3. Sayfa', u'http://open.dapper.net/services/h73sayfa'),
|
||||
( u'Dünya', u'http://open.dapper.net/services/h7dunya'),
|
||||
( u'Medya', u'http://open.dapper.net/services/h7medya'),
|
||||
|
||||
( u'Yazarlar', u'http://open.dapper.net/services/h7yazarlar'),
|
||||
( u'Bilim', u'http://open.dapper.net/services/h7bilim'),
|
||||
( u'Eğitim', u'http://open.dapper.net/services/h7egitim'),
|
||||
( u'Spor', u'http://open.dapper.net/services/h7sporv3'),
|
||||
|
||||
|
||||
]
|
||||
def preprocess_html(self, soup):
|
||||
for alink in soup.findAll('a'):
|
||||
if alink.string is not None:
|
||||
tstr = alink.string
|
||||
alink.replaceWith(tstr)
|
||||
return soup
|
||||
# def print_version(self, url):
|
||||
# return url.replace('http://www.aksiyon.com.tr/aksiyon/newsDetail_getNewsById.action?load=detay&', 'http://www.aksiyon.com.tr/aksiyon/mobile_detailn.action?')
|
||||
|
123
recipes/hackernews.recipe
Normal file
@ -0,0 +1,123 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
__license__ = 'GPL v3'
|
||||
'''
|
||||
Hacker News
|
||||
'''
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
from calibre.ptempfile import PersistentTemporaryFile
|
||||
from urlparse import urlparse
|
||||
import re
|
||||
|
||||
class HackerNews(BasicNewsRecipe):
|
||||
title = 'Hacker News'
|
||||
__author__ = 'Tom Scholl'
|
||||
description = u'Hacker News, run by Y Combinator. Anything that good hackers would find interesting, with a focus on programming and startups.'
|
||||
publisher = 'Y Combinator'
|
||||
category = 'news, programming, it, technology'
|
||||
masthead_url = 'http://img585.imageshack.us/img585/5011/hnle.png'
|
||||
cover_url = 'http://img585.imageshack.us/img585/5011/hnle.png'
|
||||
delay = 1
|
||||
max_articles_per_feed = 30
|
||||
use_embedded_content = False
|
||||
no_stylesheets = True
|
||||
encoding = 'utf-8'
|
||||
language = 'en'
|
||||
requires_version = (0,8,16)
|
||||
|
||||
feeds = [
|
||||
(u'Hacker News', 'http://news.ycombinator.com/rss')
|
||||
]
|
||||
|
||||
temp_files = []
|
||||
articles_are_obfuscated = True
|
||||
|
||||
def get_readable_content(self, url):
|
||||
self.log('get_readable_content(' + url + ')')
|
||||
br = self.get_browser()
|
||||
f = br.open(url)
|
||||
html = f.read()
|
||||
f.close()
|
||||
|
||||
return self.extract_readable_article(html, url)
|
||||
|
||||
def get_hn_content(self, url):
|
||||
self.log('get_hn_content(' + url + ')')
|
||||
soup = self.index_to_soup(url)
|
||||
main = soup.find('tr').findNextSiblings('tr', limit=2)[1].td
|
||||
|
||||
title = self.tag_to_string(main.find('td', 'title'))
|
||||
link = main.find('td', 'title').find('a')['href']
|
||||
if link.startswith('item?'):
|
||||
link = 'http://news.ycombinator.com/' + link
|
||||
readable_link = link.rpartition('http://')[2].rpartition('https://')[2]
|
||||
subtext = self.tag_to_string(main.find('td', 'subtext'))
|
||||
|
||||
title_content_td = main.find('td', 'title').findParent('tr').findNextSiblings('tr', limit=3)[2].findAll('td', limit=2)[1]
|
||||
title_content = u''
|
||||
if not title_content_td.find('form'):
|
||||
title_content_td.name ='div'
|
||||
title_content = title_content_td.prettify()
|
||||
|
||||
comments = u''
|
||||
for td in main.findAll('td', 'default'):
|
||||
comhead = td.find('span', 'comhead')
|
||||
if comhead:
|
||||
com_title = u'<h4>' + self.tag_to_string(comhead).replace(' | link', '') + u'</h4>'
|
||||
comhead.parent.extract()
|
||||
br = td.find('br')
|
||||
if br:
|
||||
br.extract()
|
||||
reply = td.find('a', attrs = {'href' : re.compile('^reply?')})
|
||||
if reply:
|
||||
reply.parent.extract()
|
||||
td.name = 'div'
|
||||
indent_width = (int(td.parent.find('td').img['width']) * 2) / 3
|
||||
td['style'] = 'padding-left: ' + str(indent_width) + 'px'
|
||||
comments = comments + com_title + td.prettify()
|
||||
|
||||
body = u'<h3>' + title + u'</h3><p><a href="' + link + u'">' + readable_link + u'</a><br/><strong>' + subtext + u'</strong></p>' + title_content + u'<br/>'
|
||||
body = body + comments
|
||||
return u'<html><title>' + title + u'</title><body>' + body + '</body></html>'
|
||||
|
||||
def get_obfuscated_article(self, url):
|
||||
if url.startswith('http://news.ycombinator.com'):
|
||||
content = self.get_hn_content(url)
|
||||
else:
|
||||
# TODO: use content-type header instead of url
|
||||
is_image = False
|
||||
for ext in ['.jpg', '.png', '.svg', '.gif', '.jpeg', '.tiff', '.bmp',]:
|
||||
if url.endswith(ext):
|
||||
is_image = True
|
||||
break
|
||||
|
||||
if is_image:
|
||||
self.log('using image_content (' + url + ')')
|
||||
content = u'<html><body><img src="' + url + u'"></body></html>'
|
||||
else:
|
||||
content = self.get_readable_content(url)
|
||||
|
||||
self.temp_files.append(PersistentTemporaryFile('_fa.html'))
|
||||
self.temp_files[-1].write(content)
|
||||
self.temp_files[-1].close()
|
||||
return self.temp_files[-1].name
|
||||
|
||||
def is_link_wanted(self, url, tag):
|
||||
if url.endswith('.pdf'):
|
||||
return False
|
||||
return True
|
||||
|
||||
def prettyify_url(self, url):
|
||||
return urlparse(url).hostname
|
||||
|
||||
def populate_article_metadata(self, article, soup, first):
|
||||
article.text_summary = self.prettyify_url(article.url)
|
||||
article.summary = article.text_summary
|
||||
|
||||
# def parse_index(self):
|
||||
# feeds = []
|
||||
# feeds.append((u'Hacker News',[{'title': 'Testing', 'url': 'http://news.ycombinator.com/item?id=2935944'}]))
|
||||
# return feeds
|
||||
|
||||
|
||||
|
@ -11,8 +11,15 @@ class HBR(BasicNewsRecipe):
|
||||
timefmt = ' [%B %Y]'
|
||||
language = 'en'
|
||||
no_stylesheets = True
|
||||
recipe_disabled = ('hbr.org has started requiring the use of javascript'
|
||||
' to log into their website. This is unsupported in calibre, so'
|
||||
' this recipe has been disabled. If you would like to see '
|
||||
' HBR supported in calibre, contact hbr.org and ask them'
|
||||
' to provide a javascript free login method.')
|
||||
|
||||
LOGIN_URL = 'https://hbr.org/login?request_url=/'
|
||||
LOGOUT_URL = 'https://hbr.org/logout?request_url=/'
|
||||
|
||||
LOGIN_URL = 'http://hbr.org/login?request_url=/'
|
||||
INDEX = 'http://hbr.org/archive-toc/BR'
|
||||
|
||||
keep_only_tags = [dict(name='div', id='pageContainer')]
|
||||
@ -34,17 +41,23 @@ class HBR(BasicNewsRecipe):
|
||||
|
||||
def get_browser(self):
|
||||
br = BasicNewsRecipe.get_browser(self)
|
||||
self.logout_url = None
|
||||
|
||||
#'''
|
||||
br.open(self.LOGIN_URL)
|
||||
br.select_form(name='signin-form')
|
||||
br['signin-form:username'] = self.username
|
||||
br['signin-form:password'] = self.password
|
||||
raw = br.submit().read()
|
||||
if 'My Account' not in raw:
|
||||
if '>Sign out<' not in raw:
|
||||
raise Exception('Failed to login, are you sure your username and password are correct?')
|
||||
self.logout_url = None
|
||||
try:
|
||||
link = br.find_link(text='Sign out')
|
||||
if link:
|
||||
self.logout_url = link.absolute_url
|
||||
except:
|
||||
self.logout_url = self.LOGOUT_URL
|
||||
#'''
|
||||
return br
|
||||
|
||||
def cleanup(self):
|
||||
@ -57,6 +70,8 @@ class HBR(BasicNewsRecipe):
|
||||
|
||||
|
||||
def hbr_get_toc(self):
|
||||
#return self.index_to_soup(open('/t/hbr.html').read())
|
||||
|
||||
today = date.today()
|
||||
future = today + timedelta(days=30)
|
||||
for x in [x.strftime('%y%m') for x in (future, today)]:
|
||||
@ -66,53 +81,43 @@ class HBR(BasicNewsRecipe):
|
||||
return soup
|
||||
raise Exception('Could not find current issue')
|
||||
|
||||
def hbr_parse_section(self, container, feeds):
|
||||
def hbr_parse_toc(self, soup):
|
||||
feeds = []
|
||||
current_section = None
|
||||
current_articles = []
|
||||
for x in container.findAll(name=['li', 'h3', 'h4']):
|
||||
if x.name in ['h3', 'h4'] and not x.findAll(True):
|
||||
if current_section and current_articles:
|
||||
feeds.append((current_section, current_articles))
|
||||
current_section = self.tag_to_string(x)
|
||||
current_articles = []
|
||||
articles = []
|
||||
for x in soup.find(id='archiveToc').findAll(['h3', 'h4']):
|
||||
if x.name == 'h3':
|
||||
if current_section is not None and articles:
|
||||
feeds.append((current_section, articles))
|
||||
current_section = self.tag_to_string(x).capitalize()
|
||||
articles = []
|
||||
self.log('\tFound section:', current_section)
|
||||
if x.name == 'li':
|
||||
else:
|
||||
a = x.find('a', href=True)
|
||||
if a is not None:
|
||||
if a is None: continue
|
||||
title = self.tag_to_string(a)
|
||||
url = a.get('href')
|
||||
url = a['href']
|
||||
if '/ar/' not in url:
|
||||
continue
|
||||
if url.startswith('/'):
|
||||
url = 'http://hbr.org'+url
|
||||
url = 'http://hbr.org' + url
|
||||
url = self.map_url(url)
|
||||
p = x.find('p')
|
||||
p = x.parent.find('p')
|
||||
desc = ''
|
||||
if p is not None:
|
||||
desc = self.tag_to_string(p)
|
||||
if not title or not url:
|
||||
continue
|
||||
self.log('\t\tFound article:', title)
|
||||
self.log('\t\t\t', url)
|
||||
self.log('\t\t\t', desc)
|
||||
current_articles.append({'title':title, 'url':url,
|
||||
'description':desc, 'date':''})
|
||||
if current_section and current_articles:
|
||||
feeds.append((current_section, current_articles))
|
||||
|
||||
|
||||
|
||||
def hbr_parse_toc(self, soup):
|
||||
feeds = []
|
||||
features = soup.find(id='issueFeaturesContent')
|
||||
self.hbr_parse_section(features, feeds)
|
||||
departments = soup.find(id='issueDepartments')
|
||||
self.hbr_parse_section(departments, feeds)
|
||||
articles.append({'title':title, 'url':url, 'description':desc,
|
||||
'date':''})
|
||||
return feeds
|
||||
|
||||
|
||||
def parse_index(self):
|
||||
soup = self.hbr_get_toc()
|
||||
#open('/t/hbr.html', 'wb').write(unicode(soup).encode('utf-8'))
|
||||
feeds = self.hbr_parse_toc(soup)
|
||||
return feeds
|
||||
|
||||
|
@ -5,34 +5,27 @@ class HBR(BasicNewsRecipe):
|
||||
|
||||
title = 'Harvard Business Review Blogs'
|
||||
description = 'To subscribe go to http://hbr.harvardbusiness.org'
|
||||
needs_subscription = True
|
||||
__author__ = 'Kovid Goyal, enhanced by BrianG'
|
||||
__author__ = 'Kovid Goyal'
|
||||
language = 'en'
|
||||
no_stylesheets = True
|
||||
#recipe_disabled = ('hbr.org has started requiring the use of javascript'
|
||||
# ' to log into their website. This is unsupported in calibre, so'
|
||||
# ' this recipe has been disabled. If you would like to see '
|
||||
# ' HBR supported in calibre, contact hbr.org and ask them'
|
||||
# ' to provide a javascript free login method.')
|
||||
needs_subscription = False
|
||||
|
||||
LOGIN_URL = 'http://hbr.org/login?request_url=/'
|
||||
LOGOUT_URL = 'http://hbr.org/logout?request_url=/'
|
||||
|
||||
INDEX = 'http://hbr.org/current'
|
||||
|
||||
#
|
||||
# Blog Stuff
|
||||
#
|
||||
|
||||
|
||||
INCLUDE_BLOGS = True
|
||||
INCLUDE_ARTICLES = False
|
||||
|
||||
# option-specific settings.
|
||||
|
||||
if INCLUDE_BLOGS == True:
|
||||
remove_tags_after = dict(id='articleBody')
|
||||
remove_tags_before = dict(id='pageFeature')
|
||||
feeds = [('Blog','http://feeds.harvardbusiness.org/harvardbusiness')]
|
||||
oldest_article = 30
|
||||
max_articles_per_feed = 100
|
||||
use_embedded_content = False
|
||||
else:
|
||||
timefmt = ' [%B %Y]'
|
||||
|
||||
|
||||
keep_only_tags = [ dict(name='div', id='pageContainer')
|
||||
]
|
||||
@ -41,21 +34,16 @@ class HBR(BasicNewsRecipe):
|
||||
'articleToolbarTopRD', 'pageRightSubColumn', 'pageRightColumn',
|
||||
'todayOnHBRListWidget', 'mostWidget', 'keepUpWithHBR',
|
||||
'articleToolbarTop','articleToolbarBottom', 'articleToolbarRD',
|
||||
'mailingListTout', 'partnerCenter', 'pageFooter']),
|
||||
dict(name='iframe')]
|
||||
'mailingListTout', 'partnerCenter', 'pageFooter', 'shareWidgetTop']),
|
||||
dict(name=['iframe', 'style'])]
|
||||
|
||||
extra_css = '''
|
||||
a {font-family:Georgia,"Times New Roman",Times,serif; font-style:italic; color:#000000; }
|
||||
.article{font-family:Georgia,"Times New Roman",Times,serif; font-size: xx-small;}
|
||||
h2{font-family:Georgia,"Times New Roman",Times,serif; font-weight:bold; font-size:large; }
|
||||
h4{font-family:Georgia,"Times New Roman",Times,serif; font-weight:bold; font-size:small; }
|
||||
#articleBody{font-family:Georgia,"Times New Roman",Times,serif; font-style:italic; color:#000000;font-size:x-small;}
|
||||
#summaryText{font-family:Georgia,"Times New Roman",Times,serif; font-weight:bold; font-size:x-small;}
|
||||
'''
|
||||
#-------------------------------------------------------------------------------------------------
|
||||
|
||||
def get_browser(self):
|
||||
br = BasicNewsRecipe.get_browser(self)
|
||||
self.logout_url = None
|
||||
return br
|
||||
|
||||
#'''
|
||||
br.open(self.LOGIN_URL)
|
||||
br.select_form(name='signin-form')
|
||||
br['signin-form:username'] = self.username
|
||||
@ -63,11 +51,15 @@ class HBR(BasicNewsRecipe):
|
||||
raw = br.submit().read()
|
||||
if 'My Account' not in raw:
|
||||
raise Exception('Failed to login, are you sure your username and password are correct?')
|
||||
self.logout_url = None
|
||||
try:
|
||||
link = br.find_link(text='Sign out')
|
||||
if link:
|
||||
self.logout_url = link.absolute_url
|
||||
except:
|
||||
self.logout_url = self.LOGOUT_URL
|
||||
#'''
|
||||
return br
|
||||
|
||||
#-------------------------------------------------------------------------------------------------
|
||||
def cleanup(self):
|
||||
if self.logout_url is not None:
|
||||
@ -76,99 +68,7 @@ class HBR(BasicNewsRecipe):
|
||||
def map_url(self, url):
|
||||
if url.endswith('/ar/1'):
|
||||
return url[:-1]+'pr'
|
||||
#-------------------------------------------------------------------------------------------------
|
||||
|
||||
def hbr_get_toc(self):
|
||||
soup = self.index_to_soup(self.INDEX)
|
||||
url = soup.find('a', text=lambda t:'Full Table of Contents' in t).parent.get('href')
|
||||
return self.index_to_soup('http://hbr.org'+url)
|
||||
|
||||
#-------------------------------------------------------------------------------------------------
|
||||
|
||||
def hbr_parse_section(self, container, feeds):
|
||||
current_section = None
|
||||
current_articles = []
|
||||
for x in container.findAll(name=['li', 'h3', 'h4']):
|
||||
if x.name in ['h3', 'h4'] and not x.findAll(True):
|
||||
if current_section and current_articles:
|
||||
feeds.append((current_section, current_articles))
|
||||
current_section = self.tag_to_string(x)
|
||||
current_articles = []
|
||||
self.log('\tFound section:', current_section)
|
||||
if x.name == 'li':
|
||||
a = x.find('a', href=True)
|
||||
if a is not None:
|
||||
title = self.tag_to_string(a)
|
||||
url = a.get('href')
|
||||
if '/ar/' not in url:
|
||||
continue
|
||||
if url.startswith('/'):
|
||||
url = 'http://hbr.org'+url
|
||||
url = self.map_url(url)
|
||||
p = x.find('p')
|
||||
desc = ''
|
||||
if p is not None:
|
||||
desc = self.tag_to_string(p)
|
||||
if not title or not url:
|
||||
continue
|
||||
self.log('\t\tFound article:', title)
|
||||
self.log('\t\t\t', url)
|
||||
self.log('\t\t\t', desc)
|
||||
current_articles.append({'title':title, 'url':url,
|
||||
'description':desc, 'date':''})
|
||||
if current_section and current_articles:
|
||||
feeds.append((current_section, current_articles))
|
||||
|
||||
#-------------------------------------------------------------------------------------------------
|
||||
|
||||
def hbr_parse_toc(self, soup):
|
||||
feeds = []
|
||||
features = soup.find(id='issueFeaturesContent')
|
||||
self.hbr_parse_section(features, feeds)
|
||||
departments = soup.find(id='issueDepartments')
|
||||
self.hbr_parse_section(departments, feeds)
|
||||
return feeds
|
||||
#-------------------------------------------------------------------------------------------------
|
||||
def feed_to_index_append(self, feedObject, masterFeed):
|
||||
# Loop thru the feed object and build the correct type of article list
|
||||
for feed in feedObject:
|
||||
# build the correct structure from the feed object
|
||||
newArticles = []
|
||||
for article in feed.articles:
|
||||
newArt = {
|
||||
'title' : article.title,
|
||||
'url' : article.url,
|
||||
'date' : article.date,
|
||||
'description' : article.text_summary
|
||||
}
|
||||
newArticles.append(newArt)
|
||||
|
||||
# Append the earliest/latest dates of the feed to the feed title
|
||||
startDate, endDate = self.get_feed_dates(feed, '%d-%b')
|
||||
newFeedTitle = feed.title + ' (' + startDate + ' thru ' + endDate + ')'
|
||||
|
||||
# append the newly-built list object to the index object passed in
|
||||
# as masterFeed.
|
||||
masterFeed.append( (newFeedTitle,newArticles) )
|
||||
|
||||
#-------------------------------------------------------------------------------------------------
|
||||
def get_feed_dates(self, feedObject, dateMask):
|
||||
startDate = feedObject.articles[len(feedObject.articles)-1].localtime.strftime(dateMask)
|
||||
endDate = feedObject.articles[0].localtime.strftime(dateMask)
|
||||
|
||||
return startDate, endDate
|
||||
|
||||
#-------------------------------------------------------------------------------------------------
|
||||
|
||||
def parse_index(self):
|
||||
if self.INCLUDE_ARTICLES == True:
|
||||
soup = self.hbr_get_toc()
|
||||
feeds = self.hbr_parse_toc(soup)
|
||||
else:
|
||||
return BasicNewsRecipe.parse_index(self)
|
||||
|
||||
return feeds
|
||||
#-------------------------------------------------------------------------------------------------
|
||||
def get_cover_url(self):
|
||||
cover_url = None
|
||||
index = 'http://hbr.org/current'
|
||||
|
29
recipes/hindustan_times.recipe
Normal file
@ -0,0 +1,29 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class HindustanTimes(BasicNewsRecipe):
|
||||
title = u'Hindustan Times'
|
||||
language = 'en_IN'
|
||||
__author__ = 'Krittika Goyal'
|
||||
oldest_article = 1 #days
|
||||
max_articles_per_feed = 25
|
||||
use_embedded_content = False
|
||||
|
||||
no_stylesheets = True
|
||||
auto_cleanup = True
|
||||
|
||||
feeds = [
|
||||
('News',
|
||||
'http://feeds.hindustantimes.com/HT-NewsSectionPage-Topstories'),
|
||||
('Views',
|
||||
'http://feeds.hindustantimes.com/HT-ViewsSectionpage-Topstories'),
|
||||
('Cricket',
|
||||
'http://feeds.hindustantimes.com/HT-Cricket-TopStories'),
|
||||
('Business',
|
||||
'http://feeds.hindustantimes.com/HT-BusinessSectionpage-TopStories'),
|
||||
('Entertainment',
|
||||
'http://feeds.hindustantimes.com/HT-HomePage-Entertainment'),
|
||||
('Lifestyle',
|
||||
'http://feeds.hindustantimes.com/HT-Homepage-LifestyleNews'),
|
||||
]
|
||||
|
||||
|
52
recipes/hira.recipe
Normal file
@ -0,0 +1,52 @@
|
||||
# coding=utf-8
|
||||
|
||||
from calibre.web.feeds.recipes import BasicNewsRecipe
|
||||
|
||||
class Hira(BasicNewsRecipe):
|
||||
title = 'Hira'
|
||||
__author__ = 'thomass'
|
||||
description = 'مجلة حراء مجلة علمية ثقافية فكرية تصدر كل شهرين، تعنى بالعلوم الطبيعية والإنسانية والاجتماعية وتحاور أسرار النفس البشرية وآفاق الكون الشاسعة بالمنظور القرآني الإيماني في تآلف وتناسب بين العلم والإيمان، والعقل والقلب، والفكر والواقع.'
|
||||
oldest_article = 63
|
||||
max_articles_per_feed = 50
|
||||
no_stylesheets = True
|
||||
#delay = 1
|
||||
use_embedded_content = False
|
||||
encoding = 'utf-8'
|
||||
publisher = 'thomass'
|
||||
category = 'News'
|
||||
language = 'ar'
|
||||
publication_type = 'magazine'
|
||||
extra_css = ' .title-detail-wrap{ font-weight: bold ;text-align:right;color:#FF0000;font-size:25px}.title-detail{ font-family:sans-serif;text-align:right;} '
|
||||
|
||||
|
||||
conversion_options = {
|
||||
'tags' : category
|
||||
,'language' : language
|
||||
,'publisher' : publisher
|
||||
,'linearize_tables': True
|
||||
,'base-font-size':'10'
|
||||
}
|
||||
#html2lrf_options = []
|
||||
keep_only_tags = [
|
||||
dict(name='div', attrs={'class':['title-detail']})
|
||||
]
|
||||
|
||||
remove_tags = [
|
||||
dict(name='div', attrs={'class':['clear', 'bbsp']}),
|
||||
]
|
||||
|
||||
remove_attributes = [
|
||||
'width','height'
|
||||
]
|
||||
|
||||
feeds = [
|
||||
(u'حراء', 'http://open.dapper.net/services/hira'),
|
||||
]
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
for alink in soup.findAll('a'):
|
||||
if alink.string is not None:
|
||||
tstr = alink.string
|
||||
alink.replaceWith(tstr)
|
||||
return soup
|
||||
|
13
recipes/historia_pl.recipe
Normal file
@ -0,0 +1,13 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Historia_org_pl(BasicNewsRecipe):
|
||||
title = u'Historia.org.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = u'history site'
|
||||
cover_url = 'http://lh3.googleusercontent.com/_QeRQus12wGg/TOvHsZ2GN7I/AAAAAAAAD_o/LY1JZDnq7ro/logo5.jpg'
|
||||
category = 'history'
|
||||
language = 'pl'
|
||||
oldest_article = 8
|
||||
max_articles_per_feed = 100
|
||||
|
||||
feeds = [(u'Artykuły', u'http://www.historia.org.pl/index.php?format=feed&type=rss')]
|
@ -1,8 +1,6 @@
|
||||
#!/usr/bin/env python
|
||||
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
|
||||
|
||||
import string, pprint
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class HoustonChronicle(BasicNewsRecipe):
|
||||
@ -13,53 +11,28 @@ class HoustonChronicle(BasicNewsRecipe):
|
||||
language = 'en'
|
||||
timefmt = ' [%a, %d %b, %Y]'
|
||||
no_stylesheets = True
|
||||
use_embedded_content = False
|
||||
remove_attributes = ['style']
|
||||
|
||||
keep_only_tags = [
|
||||
dict(id=['story-head', 'story'])
|
||||
oldest_article = 2.0
|
||||
|
||||
keep_only_tags = {'class':lambda x: x and ('hst-articletitle' in x or
|
||||
'hst-articletext' in x or 'hst-galleryitem' in x)}
|
||||
|
||||
feeds = [
|
||||
('News', "http://www.chron.com/rss/feed/News-270.php"),
|
||||
('Sports',
|
||||
'http://www.chron.com/sports/headlines/collectionRss/Sports-Headlines-Staff-Stories-10767.php'),
|
||||
('Neighborhood',
|
||||
'http://www.chron.com/rss/feed/Neighborhood-305.php'),
|
||||
('Business', 'http://www.chron.com/rss/feed/Business-287.php'),
|
||||
('Entertainment',
|
||||
'http://www.chron.com/rss/feed/Entertainment-293.php'),
|
||||
('Editorials',
|
||||
'http://www.chron.com/opinion/editorials/collectionRss/Opinion-Editorials-Headline-List-10567.php'),
|
||||
('Life', 'http://www.chron.com/rss/feed/Life-297.php'),
|
||||
('Science & Tech',
|
||||
'http://www.chron.com/rss/feed/AP-Technology-and-Science-266.php'),
|
||||
]
|
||||
|
||||
remove_tags = [
|
||||
dict(id=['share-module', 'resource-box',
|
||||
'resource-box-header'])
|
||||
]
|
||||
|
||||
extra_css = '''
|
||||
h1{font-family :Arial,Helvetica,sans-serif; font-size:large;}
|
||||
h2{font-family :Arial,Helvetica,sans-serif; font-size:medium; color:#666666;}
|
||||
h3{font-family :Arial,Helvetica,sans-serif; font-size:medium; color:#000000;}
|
||||
h4{font-family :Arial,Helvetica,sans-serif; font-size: x-small;}
|
||||
p{font-family :Arial,Helvetica,sans-serif; font-size:x-small;}
|
||||
#story-head h1{font-family :Arial,Helvetica,sans-serif; font-size: xx-large;}
|
||||
#story-head h2{font-family :Arial,Helvetica,sans-serif; font-size: small; color:#000000;}
|
||||
#story-head h3{font-family :Arial,Helvetica,sans-serif; font-size: xx-small;}
|
||||
#story-head h4{font-family :Arial,Helvetica,sans-serif; font-size: xx-small;}
|
||||
#story{font-family :Arial,Helvetica,sans-serif; font-size:xx-small;}
|
||||
#Text-TextSubhed BoldCond PoynterAgateZero h3{color:#444444;font-family :Arial,Helvetica,sans-serif; font-size:small;}
|
||||
.p260x p{font-family :Arial,Helvetica,serif; font-size:x-small;font-style:italic;}
|
||||
.p260x h6{color:#777777;font-family :Arial,Helvetica,sans-serif; font-size:xx-small;}
|
||||
'''
|
||||
|
||||
|
||||
def parse_index(self):
|
||||
categories = ['news', 'sports', 'business', 'entertainment', 'life',
|
||||
'travel']
|
||||
feeds = []
|
||||
for cat in categories:
|
||||
articles = []
|
||||
soup = self.index_to_soup('http://www.chron.com/%s/'%cat)
|
||||
for elem in soup.findAll(comptype='story', storyid=True):
|
||||
a = elem.find('a', href=True)
|
||||
if a is None: continue
|
||||
url = a['href']
|
||||
if not url.startswith('http://'):
|
||||
url = 'http://www.chron.com'+url
|
||||
articles.append({'title':self.tag_to_string(a), 'url':url,
|
||||
'description':'', 'date':''})
|
||||
pprint.pprint(articles[-1])
|
||||
if articles:
|
||||
feeds.append((string.capwords(cat), articles))
|
||||
return feeds
|
||||
|
||||
|
||||
|
||||
|
||||
|
BIN
recipes/icons/adventure_zone_pl.png
Normal file
After Width: | Height: | Size: 1.6 KiB |
BIN
recipes/icons/android_com_pl.png
Normal file
After Width: | Height: | Size: 1.4 KiB |
BIN
recipes/icons/archeowiesci.png
Normal file
After Width: | Height: | Size: 718 B |
BIN
recipes/icons/astro_news_pl.png
Normal file
After Width: | Height: | Size: 625 B |
BIN
recipes/icons/astronomia_pl.png
Normal file
After Width: | Height: | Size: 389 B |
BIN
recipes/icons/bash_org_pl.png
Normal file
After Width: | Height: | Size: 391 B |
BIN
recipes/icons/benchmark_pl.png
Normal file
After Width: | Height: | Size: 658 B |
BIN
recipes/icons/cd_action.png
Normal file
After Width: | Height: | Size: 972 B |
BIN
recipes/icons/cgm_pl.png
Normal file
After Width: | Height: | Size: 837 B |
BIN
recipes/icons/dark_horizons.png
Normal file
After Width: | Height: | Size: 399 B |
BIN
recipes/icons/den_of_geek.png
Normal file
After Width: | Height: | Size: 1.0 KiB |
BIN
recipes/icons/dobreprogamy.png
Normal file
After Width: | Height: | Size: 1.1 KiB |
BIN
recipes/icons/dzieje_pl.png
Normal file
After Width: | Height: | Size: 642 B |
BIN
recipes/icons/eioba.png
Normal file
After Width: | Height: | Size: 908 B |
BIN
recipes/icons/elektroda_pl.png
Normal file
After Width: | Height: | Size: 1023 B |
BIN
recipes/icons/film_web.png
Normal file
After Width: | Height: | Size: 3.4 KiB |
BIN
recipes/icons/focus_pl.png
Normal file
After Width: | Height: | Size: 695 B |
BIN
recipes/icons/gazeta_wyborcza.png
Normal file
After Width: | Height: | Size: 221 B |
BIN
recipes/icons/gram_pl.png
Normal file
After Width: | Height: | Size: 1.1 KiB |
BIN
recipes/icons/greenlinux_pl.png
Normal file
After Width: | Height: | Size: 648 B |
BIN
recipes/icons/gry_online_pl.png
Normal file
After Width: | Height: | Size: 249 B |
BIN
recipes/icons/historia_pl.png
Normal file
After Width: | Height: | Size: 806 B |
BIN
recipes/icons/japan_times.png
Normal file
After Width: | Height: | Size: 1.2 KiB |
BIN
recipes/icons/konflikty_zbrojne.png
Normal file
After Width: | Height: | Size: 320 B |
BIN
recipes/icons/lomza.png
Normal file
After Width: | Height: | Size: 2.0 KiB |
BIN
recipes/icons/niebezpiecznik.png
Normal file
After Width: | Height: | Size: 795 B |
BIN
recipes/icons/rtnews.png
Normal file
After Width: | Height: | Size: 606 B |
BIN
recipes/icons/twitchfilms.png
Normal file
After Width: | Height: | Size: 200 B |
BIN
recipes/icons/ubuntu_pl.png
Normal file
After Width: | Height: | Size: 508 B |
BIN
recipes/icons/wnp.png
Normal file
After Width: | Height: | Size: 576 B |
@ -4,16 +4,16 @@ from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class IDGse(BasicNewsRecipe):
|
||||
title = 'IDG'
|
||||
description = 'IDG.se'
|
||||
language = 'se'
|
||||
__author__ = 'zapt0'
|
||||
language = 'sv'
|
||||
description = 'IDG.se'
|
||||
oldest_article = 1
|
||||
max_articles_per_feed = 40
|
||||
max_articles_per_feed = 256
|
||||
no_stylesheets = True
|
||||
encoding = 'ISO-8859-1'
|
||||
remove_javascript = True
|
||||
|
||||
feeds = [(u'Senaste nytt',u'http://feeds.idg.se/idg/vzzs')]
|
||||
feeds = [(u'Dagens IDG-nyheter',u'http://feeds.idg.se/idg/ETkj?format=xml')]
|
||||
|
||||
def print_version(self,url):
|
||||
return url + '?articleRenderMode=print&m=print'
|
||||
@ -30,4 +30,3 @@ class IDGse(BasicNewsRecipe):
|
||||
dict(name='div', attrs={'id':['preamble_ad']}),
|
||||
dict(name='ul', attrs={'class':['share']})
|
||||
]
|
||||
|
||||
|