Merge from trunk
@ -37,7 +37,9 @@ nbproject/
|
|||||||
calibre_plugins/
|
calibre_plugins/
|
||||||
recipes/.git
|
recipes/.git
|
||||||
recipes/.gitignore
|
recipes/.gitignore
|
||||||
recipes/README
|
recipes/README.md
|
||||||
|
recipes/icon_checker.py
|
||||||
|
recipes/readme_updater.py
|
||||||
recipes/katalog_egazeciarz.recipe
|
recipes/katalog_egazeciarz.recipe
|
||||||
recipes/tv_axnscifi.recipe
|
recipes/tv_axnscifi.recipe
|
||||||
recipes/tv_comedycentral.recipe
|
recipes/tv_comedycentral.recipe
|
||||||
@ -60,6 +62,7 @@ recipes/tv_tvpkultura.recipe
|
|||||||
recipes/tv_tvppolonia.recipe
|
recipes/tv_tvppolonia.recipe
|
||||||
recipes/tv_tvpuls.recipe
|
recipes/tv_tvpuls.recipe
|
||||||
recipes/tv_viasathistory.recipe
|
recipes/tv_viasathistory.recipe
|
||||||
|
recipes/icons/katalog_egazeciarz.png
|
||||||
recipes/icons/tv_axnscifi.png
|
recipes/icons/tv_axnscifi.png
|
||||||
recipes/icons/tv_comedycentral.png
|
recipes/icons/tv_comedycentral.png
|
||||||
recipes/icons/tv_discoveryscience.png
|
recipes/icons/tv_discoveryscience.png
|
||||||
|
1679
Changelog.old.yaml
1968
Changelog.yaml
@ -434,6 +434,18 @@ a number of older formats either do not support a metadata based Table of Conten
|
|||||||
documents do not have one. In these cases, the options in this section can help you automatically
|
documents do not have one. In these cases, the options in this section can help you automatically
|
||||||
generate a Table of Contents in the converted ebook, based on the actual content in the input document.
|
generate a Table of Contents in the converted ebook, based on the actual content in the input document.
|
||||||
|
|
||||||
|
.. note:: Using these options can be a little challenging to get exactly right.
|
||||||
|
If you prefer creating/editing the Table of Contents by hand, convert to
|
||||||
|
the EPUB or AZW3 formats and select the checkbox at the bottom of the
|
||||||
|
screen that says
|
||||||
|
:guilabel:`Manually fine-tune the Table of Contents after conversion`.
|
||||||
|
This will launch the ToC Editor tool after the conversion. It allows you to
|
||||||
|
create entries in the Table of Contents by simply clicking the place in the
|
||||||
|
book where you want the entry to point. You can also use the ToC Editor by
|
||||||
|
itself, without doing a conversion. Go to :guilabel:`Preferences->Toolbars`
|
||||||
|
and add the ToC Editor to the main toolbar. Then just select the book you
|
||||||
|
want to edit and click the ToC Editor button.
|
||||||
|
|
||||||
The first option is :guilabel:`Force use of auto-generated Table of Contents`. By checking this option
|
The first option is :guilabel:`Force use of auto-generated Table of Contents`. By checking this option
|
||||||
you can have |app| override any Table of Contents found in the metadata of the input document with the
|
you can have |app| override any Table of Contents found in the metadata of the input document with the
|
||||||
auto generated one.
|
auto generated one.
|
||||||
@ -456,7 +468,7 @@ For example, to remove all entries titles "Next" or "Previous" use::
|
|||||||
|
|
||||||
Next|Previous
|
Next|Previous
|
||||||
|
|
||||||
Finally, the :guilabel:`Level 1,2,3 TOC` options allow you to create a sophisticated multi-level Table of Contents.
|
The :guilabel:`Level 1,2,3 TOC` options allow you to create a sophisticated multi-level Table of Contents.
|
||||||
They are XPath expressions that match tags in the intermediate XHTML produced by the conversion pipeline. See the
|
They are XPath expressions that match tags in the intermediate XHTML produced by the conversion pipeline. See the
|
||||||
:ref:`conversion-introduction` for how to get access to this XHTML. Also read the :ref:`xpath-tutorial`, to learn
|
:ref:`conversion-introduction` for how to get access to this XHTML. Also read the :ref:`xpath-tutorial`, to learn
|
||||||
how to construct XPath expressions. Next to each option is a button that launches a wizard to help with the creation
|
how to construct XPath expressions. Next to each option is a button that launches a wizard to help with the creation
|
||||||
@ -672,6 +684,7 @@ Some limitations of PDF input are:
|
|||||||
* Links and Tables of Contents are not supported
|
* Links and Tables of Contents are not supported
|
||||||
* PDFs that use embedded non-unicode fonts to represent non-English characters will result in garbled output for those characters
|
* PDFs that use embedded non-unicode fonts to represent non-English characters will result in garbled output for those characters
|
||||||
* Some PDFs are made up of photographs of the page with OCRed text behind them. In such cases |app| uses the OCRed text, which can be very different from what you see when you view the PDF file
|
* Some PDFs are made up of photographs of the page with OCRed text behind them. In such cases |app| uses the OCRed text, which can be very different from what you see when you view the PDF file
|
||||||
|
* PDFs that are used to display complex text, like right to left languages and math typesetting will not convert correctly
|
||||||
|
|
||||||
To re-iterate **PDF is a really, really bad** format to use as input. If you absolutely must use PDF, then be prepared for an
|
To re-iterate **PDF is a really, really bad** format to use as input. If you absolutely must use PDF, then be prepared for an
|
||||||
output ranging anywhere from decent to unusable, depending on the input PDF.
|
output ranging anywhere from decent to unusable, depending on the input PDF.
|
||||||
|
@ -87,7 +87,9 @@ this bug.
|
|||||||
|
|
||||||
How do I convert a collection of HTML files in a specific order?
|
How do I convert a collection of HTML files in a specific order?
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
In order to convert a collection of HTML files in a specific oder, you have to create a table of contents file. That is, another HTML file that contains links to all the other files in the desired order. Such a file looks like::
|
In order to convert a collection of HTML files in a specific oder, you have to
|
||||||
|
create a table of contents file. That is, another HTML file that contains links
|
||||||
|
to all the other files in the desired order. Such a file looks like::
|
||||||
|
|
||||||
<html>
|
<html>
|
||||||
<body>
|
<body>
|
||||||
@ -102,19 +104,36 @@ In order to convert a collection of HTML files in a specific oder, you have to c
|
|||||||
</body>
|
</body>
|
||||||
</html>
|
</html>
|
||||||
|
|
||||||
Then just add this HTML file to the GUI and use the convert button to create your ebook.
|
Then, just add this HTML file to the GUI and use the convert button to create
|
||||||
|
your ebook. You can use the option in the Table of Contents section in the
|
||||||
|
conversion dialog to control how the Table of Contents is generated.
|
||||||
|
|
||||||
.. note:: By default, when adding HTML files, |app| follows links in the files in *depth first* order. This means that if file A.html links to B.html and C.html and D.html, but B.html also links to D.html, then the files will be in the order A.html, B.html, D.html, C.html. If instead you want the order to be A.html, B.html, C.html, D.html then you must tell |app| to add your files in *breadth first* order. Do this by going to Preferences->Plugins and customizing the HTML to ZIP plugin.
|
.. note:: By default, when adding HTML files, |app| follows links in the files
|
||||||
|
in *depth first* order. This means that if file A.html links to B.html and
|
||||||
|
C.html and D.html, but B.html also links to D.html, then the files will be
|
||||||
|
in the order A.html, B.html, D.html, C.html. If instead you want the order
|
||||||
|
to be A.html, B.html, C.html, D.html then you must tell |app| to add your
|
||||||
|
files in *breadth first* order. Do this by going to Preferences->Plugins
|
||||||
|
and customizing the HTML to ZIP plugin.
|
||||||
|
|
||||||
The EPUB I produced with |app| is not valid?
|
The EPUB I produced with |app| is not valid?
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|app| does not guarantee that an EPUB produced by it is valid. The only guarantee it makes is that if you feed it valid XHTML 1.1 + CSS 2.1 it will output a valid EPUB. |app| is designed for ebook consumers, not producers. It tries hard to ensure that EPUBs it produces actually work as intended on a wide variety of devices, a goal that is incompatible with producing valid EPUBs, and one that is far more important to the vast majority of its users. If you need a tool that always produces valid EPUBs, |app| is not for you.
|
|app| does not guarantee that an EPUB produced by it is valid. The only
|
||||||
|
guarantee it makes is that if you feed it valid XHTML 1.1 + CSS 2.1 it will
|
||||||
|
output a valid EPUB. |app| is designed for ebook consumers, not producers. It
|
||||||
|
tries hard to ensure that EPUBs it produces actually work as intended on a wide
|
||||||
|
variety of devices, a goal that is incompatible with producing valid EPUBs, and
|
||||||
|
one that is far more important to the vast majority of its users. If you need a
|
||||||
|
tool that always produces valid EPUBs, |app| is not for you.
|
||||||
|
|
||||||
How do I use some of the advanced features of the conversion tools?
|
How do I use some of the advanced features of the conversion tools?
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
You can get help on any individual feature of the converters by mousing over it in the GUI or running ``ebook-convert dummy.html .epub -h`` at a terminal. A good place to start is to look at the following demo files that demonstrate some of the advanced features:
|
You can get help on any individual feature of the converters by mousing over
|
||||||
* `html-demo.zip <http://calibre-ebook.com/downloads/html-demo.zip>`_
|
it in the GUI or running ``ebook-convert dummy.html .epub -h`` at a terminal.
|
||||||
|
A good place to start is to look at the following demo file that demonstrates
|
||||||
|
some of the advanced features
|
||||||
|
`html-demo.zip <http://calibre-ebook.com/downloads/html-demo.zip>`_
|
||||||
|
|
||||||
|
|
||||||
Device Integration
|
Device Integration
|
||||||
@ -126,11 +145,11 @@ Device Integration
|
|||||||
|
|
||||||
What devices does |app| support?
|
What devices does |app| support?
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|app| can directly connect to all the major (and most of the minor) ebook reading devices,
|
|app| can directly connect to all the major (and most of the minor) ebook
|
||||||
smarthphones, tablets, etc.
|
reading devices, smarthphones, tablets, etc. In addition, using the
|
||||||
In addition, using the :guilabel:`Connect to folder` function you can use it with any ebook reader that exports itself as a USB disk.
|
:guilabel:`Connect to folder` function you can use it with any ebook reader
|
||||||
You can even connect to Apple devices (via iTunes), using the :guilabel:`Connect to iTunes`
|
that exports itself as a USB disk. You can even connect to Apple devices (via
|
||||||
function.
|
iTunes), using the :guilabel:`Connect to iTunes` function.
|
||||||
|
|
||||||
.. _devsupport:
|
.. _devsupport:
|
||||||
|
|
||||||
@ -579,9 +598,23 @@ Yes, you can. Follow the instructions in the answer above for adding custom colu
|
|||||||
|
|
||||||
How do I move my |app| library from one computer to another?
|
How do I move my |app| library from one computer to another?
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
Simply copy the |app| library folder from the old to the new computer. You can find out what the library folder is by clicking the calibre icon in the toolbar. The very first item is the path to the library folder. Now on the new computer, start |app| for the first time. It will run the Welcome Wizard asking you for the location of the |app| library. Point it to the previously copied folder. If the computer you are transferring to already has a calibre installation, then the Welcome wizard wont run. In that case, right-click the |app| icon in the tooolbar and point it to the newly copied directory. You will now have two calibre libraries on your computer and you can switch between them by clicking the |app| icon on the toolbar. Transferring your library in this manner preserver all your metadata, tags, custom columns, etc.
|
Simply copy the |app| library folder from the old to the new computer. You can
|
||||||
|
find out what the library folder is by clicking the calibre icon in the
|
||||||
|
toolbar. The very first item is the path to the library folder. Now on the new
|
||||||
|
computer, start |app| for the first time. It will run the Welcome Wizard asking
|
||||||
|
you for the location of the |app| library. Point it to the previously copied
|
||||||
|
folder. If the computer you are transferring to already has a calibre
|
||||||
|
installation, then the Welcome wizard wont run. In that case, right-click the
|
||||||
|
|app| icon in the tooolbar and point it to the newly copied directory. You will
|
||||||
|
now have two |app| libraries on your computer and you can switch between them
|
||||||
|
by clicking the |app| icon on the toolbar. Transferring your library in this
|
||||||
|
manner preserver all your metadata, tags, custom columns, etc.
|
||||||
|
|
||||||
Note that if you are transferring between different types of computers (for example Windows to OS X) then after doing the above you should also right-click the |app| icon on the tool bar, select Library Maintenance and run the Check Library action. It will warn you about any problems in your library, which you should fix by hand.
|
Note that if you are transferring between different types of computers (for
|
||||||
|
example Windows to OS X) then after doing the above you should also right-click
|
||||||
|
the |app| icon on the tool bar, select Library Maintenance and run the Check
|
||||||
|
Library action. It will warn you about any problems in your library, which you
|
||||||
|
should fix by hand.
|
||||||
|
|
||||||
.. note:: A |app| library is just a folder which contains all the book files and their metadata. All the metadata is stored in a single file called metadata.db, in the top level folder. If this file gets corrupted, you may see an empty list of books in |app|. In this case you can ask |app| to restore your books by doing a right-click on the |app| icon in the toolbar and selecting Library Maintenance->Restore Library.
|
.. note:: A |app| library is just a folder which contains all the book files and their metadata. All the metadata is stored in a single file called metadata.db, in the top level folder. If this file gets corrupted, you may see an empty list of books in |app|. In this case you can ask |app| to restore your books by doing a right-click on the |app| icon in the toolbar and selecting Library Maintenance->Restore Library.
|
||||||
|
|
||||||
|
@ -531,6 +531,8 @@ Calibre has several keyboard shortcuts to save you time and mouse movement. Thes
|
|||||||
- Get Books
|
- Get Books
|
||||||
* - :kbd:`I`
|
* - :kbd:`I`
|
||||||
- Show book details
|
- Show book details
|
||||||
|
* - :kbd:`K`
|
||||||
|
- Edit Table of Contents
|
||||||
* - :kbd:`M`
|
* - :kbd:`M`
|
||||||
- Merge selected records
|
- Merge selected records
|
||||||
* - :kbd:`Alt+M`
|
* - :kbd:`Alt+M`
|
||||||
|
@ -3,7 +3,7 @@ import re
|
|||||||
class Adventure_zone(BasicNewsRecipe):
|
class Adventure_zone(BasicNewsRecipe):
|
||||||
title = u'Adventure Zone'
|
title = u'Adventure Zone'
|
||||||
__author__ = 'fenuks'
|
__author__ = 'fenuks'
|
||||||
description = u'Adventure zone - adventure games from A to Z'
|
description = u'Czytaj więcej o przygodzie - codzienne nowinki. Szukaj u nas solucji i poradników, czytaj recenzje i zapowiedzi. Także galeria, pliki oraz forum dla wszystkich fanów gier przygodowych.'
|
||||||
category = 'games'
|
category = 'games'
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
@ -18,38 +18,27 @@ class Adventure_zone(BasicNewsRecipe):
|
|||||||
remove_tags_before = dict(name='td', attrs={'class':'main-bg'})
|
remove_tags_before = dict(name='td', attrs={'class':'main-bg'})
|
||||||
remove_tags = [dict(name='img', attrs={'alt':'Drukuj'})]
|
remove_tags = [dict(name='img', attrs={'alt':'Drukuj'})]
|
||||||
remove_tags_after = dict(id='comments')
|
remove_tags_after = dict(id='comments')
|
||||||
extra_css = '.main-bg{text-align: left;} td.capmain{ font-size: 22px; }'
|
extra_css = '.main-bg{text-align: left;} td.capmain{ font-size: 22px; } img.news-category {float: left; margin-right: 5px;}'
|
||||||
feeds = [(u'Nowinki', u'http://www.adventure-zone.info/fusion/feeds/news.php')]
|
feeds = [(u'Nowinki', u'http://www.adventure-zone.info/fusion/feeds/news.php')]
|
||||||
|
|
||||||
'''def parse_feeds (self):
|
|
||||||
feeds = BasicNewsRecipe.parse_feeds(self)
|
|
||||||
soup=self.index_to_soup(u'http://www.adventure-zone.info/fusion/feeds/news.php')
|
|
||||||
tag=soup.find(name='channel')
|
|
||||||
titles=[]
|
|
||||||
for r in tag.findAll(name='image'):
|
|
||||||
r.extract()
|
|
||||||
art=tag.findAll(name='item')
|
|
||||||
for i in art:
|
|
||||||
titles.append(i.title.string)
|
|
||||||
for feed in feeds:
|
|
||||||
for article in feed.articles[:]:
|
|
||||||
article.title=titles[feed.articles.index(article)]
|
|
||||||
return feeds'''
|
|
||||||
|
|
||||||
|
|
||||||
'''def get_cover_url(self):
|
'''def get_cover_url(self):
|
||||||
soup = self.index_to_soup('http://www.adventure-zone.info/fusion/news.php')
|
soup = self.index_to_soup('http://www.adventure-zone.info/fusion/news.php')
|
||||||
cover=soup.find(id='box_OstatninumerAZ')
|
cover=soup.find(id='box_OstatninumerAZ')
|
||||||
self.cover_url='http://www.adventure-zone.info/fusion/'+ cover.center.a.img['src']
|
self.cover_url='http://www.adventure-zone.info/fusion/'+ cover.center.a.img['src']
|
||||||
return getattr(self, 'cover_url', self.cover_url)'''
|
return getattr(self, 'cover_url', self.cover_url)'''
|
||||||
|
|
||||||
def populate_article_metadata(self, article, soup, first):
|
def populate_article_metadata(self, article, soup, first):
|
||||||
result = re.search('(.+) - Adventure Zone', soup.title.string)
|
result = re.search('(.+) - Adventure Zone', soup.title.string)
|
||||||
if result:
|
if result:
|
||||||
article.title = result.group(1)
|
result = result.group(1)
|
||||||
else:
|
else:
|
||||||
result = soup.body.find('strong')
|
result = soup.body.find('strong')
|
||||||
if result:
|
if result:
|
||||||
article.title = result.string
|
result = result.string
|
||||||
|
if result:
|
||||||
|
result = result.replace('&', '&')
|
||||||
|
result = result.replace(''', '’')
|
||||||
|
article.title = result
|
||||||
|
|
||||||
def skip_ad_pages(self, soup):
|
def skip_ad_pages(self, soup):
|
||||||
skip_tag = soup.body.find(name='td', attrs={'class':'main-bg'})
|
skip_tag = soup.body.find(name='td', attrs={'class':'main-bg'})
|
||||||
@ -78,4 +67,3 @@ class Adventure_zone(BasicNewsRecipe):
|
|||||||
a['href']=self.index + a['href']
|
a['href']=self.index + a['href']
|
||||||
return soup
|
return soup
|
||||||
|
|
||||||
|
|
@ -24,4 +24,3 @@ class app_funds(BasicNewsRecipe):
|
|||||||
auto_cleanup = True
|
auto_cleanup = True
|
||||||
|
|
||||||
feeds = [(u'blog', u'http://feeds.feedburner.com/blogspot/etVI')]
|
feeds = [(u'blog', u'http://feeds.feedburner.com/blogspot/etVI')]
|
||||||
|
|
||||||
|
@ -1,10 +1,11 @@
|
|||||||
from calibre.web.feeds.news import BasicNewsRecipe
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
|
||||||
class Archeowiesci(BasicNewsRecipe):
|
class Archeowiesci(BasicNewsRecipe):
|
||||||
title = u'Archeowiesci'
|
title = u'Archeowieści'
|
||||||
__author__ = 'fenuks'
|
__author__ = 'fenuks'
|
||||||
category = 'archeology'
|
category = 'archeology'
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
|
description = u'Z pasją o przeszłości'
|
||||||
cover_url='http://archeowiesci.pl/wp-content/uploads/2011/05/Archeowiesci2-115x115.jpg'
|
cover_url='http://archeowiesci.pl/wp-content/uploads/2011/05/Archeowiesci2-115x115.jpg'
|
||||||
oldest_article = 7
|
oldest_article = 7
|
||||||
needs_subscription='optional'
|
needs_subscription='optional'
|
||||||
|
54
recipes/arret_sur_images.recipe
Normal file
@ -0,0 +1,54 @@
|
|||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
__license__ = 'WTFPL'
|
||||||
|
__author__ = '2013, François D. <franek at chicour.net>'
|
||||||
|
__description__ = 'Get some fresh news from Arrêt sur images'
|
||||||
|
|
||||||
|
|
||||||
|
from calibre.web.feeds.recipes import BasicNewsRecipe
|
||||||
|
|
||||||
|
class Asi(BasicNewsRecipe):
|
||||||
|
|
||||||
|
title = 'Arrêt sur images'
|
||||||
|
__author__ = 'François D. (aka franek)'
|
||||||
|
description = 'Global news in french from news site "Arrêt sur images"'
|
||||||
|
|
||||||
|
oldest_article = 7.0
|
||||||
|
language = 'fr'
|
||||||
|
needs_subscription = True
|
||||||
|
max_articles_per_feed = 100
|
||||||
|
|
||||||
|
simultaneous_downloads = 1
|
||||||
|
timefmt = '[%a, %d %b %Y %I:%M +0200]'
|
||||||
|
cover_url = 'http://www.arretsurimages.net/images/header/menu/menu_1.png'
|
||||||
|
|
||||||
|
use_embedded_content = False
|
||||||
|
no_stylesheets = True
|
||||||
|
remove_javascript = True
|
||||||
|
|
||||||
|
feeds = [
|
||||||
|
('vite dit et gratuit', 'http://www.arretsurimages.net/vite-dit.rss'),
|
||||||
|
('Toutes les chroniques', 'http://www.arretsurimages.net/chroniques.rss'),
|
||||||
|
('Contenus et dossiers', 'http://www.arretsurimages.net/dossiers.rss'),
|
||||||
|
]
|
||||||
|
|
||||||
|
conversion_options = { 'smarten_punctuation' : True }
|
||||||
|
|
||||||
|
remove_tags = [dict(id='vite-titre'), dict(id='header'), dict(id='wrap-connexion'), dict(id='col_right'), dict(name='div', attrs={'class':'bloc-chroniqueur-2'}), dict(id='footercontainer')]
|
||||||
|
|
||||||
|
def print_version(self, url):
|
||||||
|
return url.replace('contenu.php', 'contenu-imprimable.php')
|
||||||
|
|
||||||
|
def get_browser(self):
|
||||||
|
# Need to use robust HTML parser
|
||||||
|
br = BasicNewsRecipe.get_browser(self, use_robust_parser=True)
|
||||||
|
if self.username is not None and self.password is not None:
|
||||||
|
br.open('http://www.arretsurimages.net/index.php')
|
||||||
|
br.select_form(nr=0)
|
||||||
|
br.form.set_all_readonly(False)
|
||||||
|
br['redir'] = 'forum/login.php'
|
||||||
|
br['username'] = self.username
|
||||||
|
br['password'] = self.password
|
||||||
|
br.submit()
|
||||||
|
return br
|
||||||
|
|
@ -2,7 +2,7 @@ from calibre.web.feeds.news import BasicNewsRecipe
|
|||||||
class AstroNEWS(BasicNewsRecipe):
|
class AstroNEWS(BasicNewsRecipe):
|
||||||
title = u'AstroNEWS'
|
title = u'AstroNEWS'
|
||||||
__author__ = 'fenuks'
|
__author__ = 'fenuks'
|
||||||
description = 'AstroNEWS- astronomy every day'
|
description = u'AstroNEWS regularnie dostarcza wiadomości o wydarzeniach związanych z astronomią i astronautyką. Informujemy o aktualnych odkryciach i wydarzeniach naukowych, zapowiadamy ciekawe zjawiska astronomiczne. Serwis jest częścią portalu astronomicznego AstroNET prowadzonego przez miłośników astronomii i zawodowych astronomów.'
|
||||||
category = 'astronomy, science'
|
category = 'astronomy, science'
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
oldest_article = 8
|
oldest_article = 8
|
||||||
|
@ -13,7 +13,15 @@ class Astroflesz(BasicNewsRecipe):
|
|||||||
max_articles_per_feed = 100
|
max_articles_per_feed = 100
|
||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
use_embedded_content = False
|
use_embedded_content = False
|
||||||
|
remove_attributes = ['style']
|
||||||
keep_only_tags = [dict(id="k2Container")]
|
keep_only_tags = [dict(id="k2Container")]
|
||||||
remove_tags_after = dict(name='div', attrs={'class':'itemLinks'})
|
remove_tags_after = dict(name='div', attrs={'class':'itemLinks'})
|
||||||
remove_tags = [dict(name='div', attrs={'class':['itemLinks', 'itemToolbar', 'itemRatingBlock']})]
|
remove_tags = [dict(name='div', attrs={'class':['itemLinks', 'itemToolbar', 'itemRatingBlock']})]
|
||||||
feeds = [(u'Wszystkie', u'http://astroflesz.pl/?format=feed')]
|
feeds = [(u'Wszystkie', u'http://astroflesz.pl/?format=feed')]
|
||||||
|
|
||||||
|
def postprocess_html(self, soup, first_fetch):
|
||||||
|
t = soup.find(attrs={'class':'itemIntroText'})
|
||||||
|
if t:
|
||||||
|
for i in t.findAll('img'):
|
||||||
|
i['style'] = 'float: left; margin-right: 5px;'
|
||||||
|
return soup
|
||||||
|
@ -3,7 +3,7 @@ import re
|
|||||||
class Astronomia_pl(BasicNewsRecipe):
|
class Astronomia_pl(BasicNewsRecipe):
|
||||||
title = u'Astronomia.pl'
|
title = u'Astronomia.pl'
|
||||||
__author__ = 'fenuks'
|
__author__ = 'fenuks'
|
||||||
description = 'Astronomia - polish astronomy site'
|
description = u'Astronomia.pl jest edukacyjnym portalem skierowanym do uczniów, studentów i miłośników astronomii. Przedstawiamy gwiazdy, planety, galaktyki, czarne dziury i wiele innych tajemnic Wszechświata.'
|
||||||
masthead_url = 'http://www.astronomia.pl/grafika/logo.gif'
|
masthead_url = 'http://www.astronomia.pl/grafika/logo.gif'
|
||||||
cover_url = 'http://www.astronomia.pl/grafika/logo.gif'
|
cover_url = 'http://www.astronomia.pl/grafika/logo.gif'
|
||||||
category = 'astronomy, science'
|
category = 'astronomy, science'
|
||||||
|
43
recipes/bachormagazyn.recipe
Normal file
@ -0,0 +1,43 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
|
||||||
|
__license__ = 'GPL v3'
|
||||||
|
__copyright__ = u'Łukasz Grąbczewski 2013'
|
||||||
|
__version__ = '1.0'
|
||||||
|
|
||||||
|
'''
|
||||||
|
bachormagazyn.pl
|
||||||
|
'''
|
||||||
|
|
||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
|
||||||
|
class bachormagazyn(BasicNewsRecipe):
|
||||||
|
__author__ = u'Łukasz Grączewski'
|
||||||
|
title = u'Bachor Magazyn'
|
||||||
|
description = u'Alternatywny magazyn o alternatywach rodzicielstwa'
|
||||||
|
language = 'pl'
|
||||||
|
publisher = 'Bachor Mag.'
|
||||||
|
publication_type = 'magazine'
|
||||||
|
masthead_url = 'http://bachormagazyn.pl/wp-content/uploads/2011/10/bachor_header1.gif'
|
||||||
|
no_stylesheets = True
|
||||||
|
remove_javascript = True
|
||||||
|
use_embedded_content = False
|
||||||
|
remove_empty_feeds = True
|
||||||
|
|
||||||
|
oldest_article = 32 #monthly +1
|
||||||
|
max_articles_per_feed = 100
|
||||||
|
|
||||||
|
feeds = [
|
||||||
|
(u'Bezradnik dla nieudacznych rodziców', u'http://bachormagazyn.pl/feed/')
|
||||||
|
]
|
||||||
|
|
||||||
|
keep_only_tags = []
|
||||||
|
keep_only_tags.append(dict(name = 'div', attrs = {'id' : 'content'}))
|
||||||
|
|
||||||
|
remove_tags = []
|
||||||
|
remove_tags.append(dict(attrs = {'id' : 'nav-above'}))
|
||||||
|
remove_tags.append(dict(attrs = {'id' : 'nav-below'}))
|
||||||
|
remove_tags.append(dict(attrs = {'id' : 'comments'}))
|
||||||
|
remove_tags.append(dict(attrs = {'class' : 'entry-info'}))
|
||||||
|
remove_tags.append(dict(attrs = {'class' : 'comments-link'}))
|
||||||
|
remove_tags.append(dict(attrs = {'class' : 'sharedaddy sd-sharing-enabled'}))
|
@ -1,4 +1,5 @@
|
|||||||
from calibre.web.feeds.news import BasicNewsRecipe
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
import re
|
||||||
class BadaniaNet(BasicNewsRecipe):
|
class BadaniaNet(BasicNewsRecipe):
|
||||||
title = u'badania.net'
|
title = u'badania.net'
|
||||||
__author__ = 'fenuks'
|
__author__ = 'fenuks'
|
||||||
@ -6,9 +7,11 @@ class BadaniaNet(BasicNewsRecipe):
|
|||||||
category = 'science'
|
category = 'science'
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
cover_url = 'http://badania.net/wp-content/badanianet_green_transparent.png'
|
cover_url = 'http://badania.net/wp-content/badanianet_green_transparent.png'
|
||||||
|
extra_css = '.alignleft {float:left; margin-right:5px;} .alignright {float:right; margin-left:5px;}'
|
||||||
oldest_article = 7
|
oldest_article = 7
|
||||||
max_articles_per_feed = 100
|
max_articles_per_feed = 100
|
||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
|
preprocess_regexps = [(re.compile(r"<h4>Tekst sponsoruje</h4>", re.IGNORECASE), lambda m: ''),]
|
||||||
remove_empty_feeds = True
|
remove_empty_feeds = True
|
||||||
use_embedded_content = False
|
use_embedded_content = False
|
||||||
remove_tags = [dict(attrs={'class':['omc-flex-category', 'omc-comment-count', 'omc-single-tags']})]
|
remove_tags = [dict(attrs={'class':['omc-flex-category', 'omc-comment-count', 'omc-single-tags']})]
|
||||||
|
@ -47,4 +47,3 @@ class bankier(BasicNewsRecipe):
|
|||||||
segments = urlPart.split('-')
|
segments = urlPart.split('-')
|
||||||
urlPart2 = segments[-1]
|
urlPart2 = segments[-1]
|
||||||
return 'http://www.bankier.pl/wiadomosci/print.html?article_id=' + urlPart2
|
return 'http://www.bankier.pl/wiadomosci/print.html?article_id=' + urlPart2
|
||||||
|
|
||||||
|
@ -3,7 +3,7 @@ from calibre.web.feeds.news import BasicNewsRecipe
|
|||||||
class Bash_org_pl(BasicNewsRecipe):
|
class Bash_org_pl(BasicNewsRecipe):
|
||||||
title = u'Bash.org.pl'
|
title = u'Bash.org.pl'
|
||||||
__author__ = 'fenuks'
|
__author__ = 'fenuks'
|
||||||
description = 'Bash.org.pl - funny quotations from IRC discussions'
|
description = 'Bash.org.pl - zabawne cytaty z IRC'
|
||||||
category = 'funny quotations, humour'
|
category = 'funny quotations, humour'
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
cover_url = u'http://userlogos.org/files/logos/dzikiosiol/none_0.png'
|
cover_url = u'http://userlogos.org/files/logos/dzikiosiol/none_0.png'
|
||||||
|
@ -1,49 +1,58 @@
|
|||||||
from calibre.web.feeds.news import BasicNewsRecipe
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
import re
|
import re
|
||||||
|
from calibre.ebooks.BeautifulSoup import Comment
|
||||||
|
|
||||||
class BenchmarkPl(BasicNewsRecipe):
|
class BenchmarkPl(BasicNewsRecipe):
|
||||||
title = u'Benchmark.pl'
|
title = u'Benchmark.pl'
|
||||||
__author__ = 'fenuks'
|
__author__ = 'fenuks'
|
||||||
description = u'benchmark.pl -IT site'
|
description = u'benchmark.pl, recenzje i testy sprzętu, aktualności, rankingi, sterowniki, porady, opinie'
|
||||||
masthead_url = 'http://www.benchmark.pl/i/logo-footer.png'
|
masthead_url = 'http://www.benchmark.pl/i/logo-footer.png'
|
||||||
cover_url = 'http://www.ieaddons.pl/benchmark/logo_benchmark_new.gif'
|
cover_url = 'http://www.benchmark.pl/i/logo-dark.png'
|
||||||
category = 'IT'
|
category = 'IT'
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
oldest_article = 8
|
oldest_article = 8
|
||||||
max_articles_per_feed = 100
|
max_articles_per_feed = 100
|
||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
|
remove_attributes = ['style']
|
||||||
preprocess_regexps = [(re.compile(ur'<h3><span style="font-size: small;"> Zobacz poprzednie <a href="http://www.benchmark.pl/news/zestawienie/grupa_id/135">Opinie dnia:</a></span>.*</body>', re.DOTALL|re.IGNORECASE), lambda match: '</body>'), (re.compile(ur'Więcej o .*?</ul>', re.DOTALL|re.IGNORECASE), lambda match: '')]
|
preprocess_regexps = [(re.compile(ur'<h3><span style="font-size: small;"> Zobacz poprzednie <a href="http://www.benchmark.pl/news/zestawienie/grupa_id/135">Opinie dnia:</a></span>.*</body>', re.DOTALL|re.IGNORECASE), lambda match: '</body>'), (re.compile(ur'Więcej o .*?</ul>', re.DOTALL|re.IGNORECASE), lambda match: '')]
|
||||||
keep_only_tags = [dict(name='div', attrs={'class':['m_zwykly', 'gallery']}), dict(id='article')]
|
keep_only_tags = [dict(name='div', attrs={'class':['m_zwykly', 'gallery']}), dict(id='article')]
|
||||||
remove_tags_after=dict(name='div', attrs={'class':'body'})
|
remove_tags_after = dict(id='article')
|
||||||
remove_tags=[dict(name='div', attrs={'class':['kategoria', 'socialize', 'thumb', 'panelOcenaObserwowane', 'categoryNextToSocializeGallery', 'breadcrumb', 'footer', 'moreTopics']}), dict(name='table', attrs={'background':'http://www.benchmark.pl/uploads/backend_img/a/fotki_newsy/opinie_dnia/bg.png'}), dict(name='table', attrs={'width':'210', 'cellspacing':'1', 'cellpadding':'4', 'border':'0', 'align':'right'})]
|
remove_tags = [dict(name='div', attrs={'class':['comments', 'body', 'kategoria', 'socialize', 'thumb', 'panelOcenaObserwowane', 'categoryNextToSocializeGallery', 'breadcrumb', 'footer', 'moreTopics']}), dict(name='table', attrs = {'background':'http://www.benchmark.pl/uploads/backend_img/a/fotki_newsy/opinie_dnia/bg.png'}), dict(name='table', attrs={'width':'210', 'cellspacing':'1', 'cellpadding':'4', 'border':'0', 'align':'right'})]
|
||||||
INDEX = 'http://www.benchmark.pl'
|
INDEX = 'http://www.benchmark.pl'
|
||||||
feeds = [(u'Aktualności', u'http://www.benchmark.pl/rss/aktualnosci-pliki.xml'),
|
feeds = [(u'Aktualności', u'http://www.benchmark.pl/rss/aktualnosci-pliki.xml'),
|
||||||
(u'Testy i recenzje', u'http://www.benchmark.pl/rss/testy-recenzje-minirecenzje.xml')]
|
(u'Testy i recenzje', u'http://www.benchmark.pl/rss/testy-recenzje-minirecenzje.xml')]
|
||||||
|
|
||||||
|
|
||||||
def append_page(self, soup, appendtag):
|
def append_page(self, soup, appendtag):
|
||||||
nexturl = soup.find('span', attrs={'class':'next'})
|
nexturl = soup.find(attrs={'class':'next'})
|
||||||
while nexturl is not None:
|
while nexturl:
|
||||||
nexturl= self.INDEX + nexturl.parent['href']
|
soup2 = self.index_to_soup(nexturl['href'])
|
||||||
soup2 = self.index_to_soup(nexturl)
|
nexturl = soup2.find(attrs={'class':'next'})
|
||||||
nexturl=soup2.find('span', attrs={'class':'next'})
|
|
||||||
pagetext = soup2.find(name='div', attrs={'class':'body'})
|
pagetext = soup2.find(name='div', attrs={'class':'body'})
|
||||||
appendtag.find('div', attrs={'class':'k_ster'}).extract()
|
tag = appendtag.find('div', attrs={'class':'k_ster'})
|
||||||
|
if tag:
|
||||||
|
tag.extract()
|
||||||
|
comments = pagetext.findAll(text=lambda text:isinstance(text, Comment))
|
||||||
|
for comment in comments:
|
||||||
|
comment.extract()
|
||||||
pos = len(appendtag.contents)
|
pos = len(appendtag.contents)
|
||||||
appendtag.insert(pos, pagetext)
|
appendtag.insert(pos, pagetext)
|
||||||
if appendtag.find('div', attrs={'class':'k_ster'}) is not None:
|
if appendtag.find('div', attrs={'class':'k_ster'}):
|
||||||
appendtag.find('div', attrs={'class':'k_ster'}).extract()
|
appendtag.find('div', attrs={'class':'k_ster'}).extract()
|
||||||
|
for r in appendtag.findAll(attrs={'class':'changePage'}):
|
||||||
|
r.extract()
|
||||||
|
|
||||||
|
|
||||||
def image_article(self, soup, appendtag):
|
def image_article(self, soup, appendtag):
|
||||||
nexturl = soup.find('div', attrs={'class':'preview'})
|
nexturl = soup.find('div', attrs={'class':'preview'})
|
||||||
if nexturl is not None:
|
if nexturl:
|
||||||
nexturl = nexturl.find('a', attrs={'class':'move_next'})
|
nexturl = nexturl.find('a', attrs={'class':'move_next'})
|
||||||
image = appendtag.find('div', attrs={'class':'preview'}).div['style'][16:]
|
image = appendtag.find('div', attrs={'class':'preview'}).div['style'][16:]
|
||||||
image = self.INDEX + image[:image.find("')")]
|
image = self.INDEX + image[:image.find("')")]
|
||||||
appendtag.find(attrs={'class':'preview'}).name='img'
|
appendtag.find(attrs={'class':'preview'}).name='img'
|
||||||
appendtag.find(attrs={'class':'preview'})['src']=image
|
appendtag.find(attrs={'class':'preview'})['src']=image
|
||||||
appendtag.find('a', attrs={'class':'move_next'}).extract()
|
appendtag.find('a', attrs={'class':'move_next'}).extract()
|
||||||
while nexturl is not None:
|
while nexturl:
|
||||||
nexturl = self.INDEX + nexturl['href']
|
nexturl = self.INDEX + nexturl['href']
|
||||||
soup2 = self.index_to_soup(nexturl)
|
soup2 = self.index_to_soup(nexturl)
|
||||||
nexturl = soup2.find('a', attrs={'class':'move_next'})
|
nexturl = soup2.find('a', attrs={'class':'move_next'})
|
||||||
@ -55,20 +64,24 @@ class BenchmarkPl(BasicNewsRecipe):
|
|||||||
pagetext.find('div', attrs={'class':'title'}).extract()
|
pagetext.find('div', attrs={'class':'title'}).extract()
|
||||||
pagetext.find('div', attrs={'class':'thumb'}).extract()
|
pagetext.find('div', attrs={'class':'thumb'}).extract()
|
||||||
pagetext.find('div', attrs={'class':'panelOcenaObserwowane'}).extract()
|
pagetext.find('div', attrs={'class':'panelOcenaObserwowane'}).extract()
|
||||||
if nexturl is not None:
|
if nexturl:
|
||||||
pagetext.find('a', attrs={'class':'move_next'}).extract()
|
pagetext.find('a', attrs={'class':'move_next'}).extract()
|
||||||
pagetext.find('a', attrs={'class':'move_back'}).extract()
|
pagetext.find('a', attrs={'class':'move_back'}).extract()
|
||||||
|
comments = pagetext.findAll(text=lambda text:isinstance(text, Comment))
|
||||||
|
for comment in comments:
|
||||||
|
comment.extract()
|
||||||
pos = len(appendtag.contents)
|
pos = len(appendtag.contents)
|
||||||
appendtag.insert(pos, pagetext)
|
appendtag.insert(pos, pagetext)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def preprocess_html(self, soup):
|
def preprocess_html(self, soup):
|
||||||
if soup.find('div', attrs={'class':'preview'}) is not None:
|
if soup.find('div', attrs={'class':'preview'}):
|
||||||
self.image_article(soup, soup.body)
|
self.image_article(soup, soup.body)
|
||||||
else:
|
else:
|
||||||
self.append_page(soup, soup.body)
|
self.append_page(soup, soup.body)
|
||||||
for a in soup('a'):
|
for a in soup('a'):
|
||||||
if a.has_key('href') and 'http://' not in a['href'] and 'https://' not in a['href']:
|
if a.has_key('href') and not a['href'].startswith('http'):
|
||||||
a['href'] = self.INDEX + a['href']
|
a['href'] = self.INDEX + a['href']
|
||||||
|
for r in soup.findAll(attrs={'class':['comments', 'body']}):
|
||||||
|
r.extract()
|
||||||
return soup
|
return soup
|
||||||
|
55
recipes/biweekly.recipe
Normal file
@ -0,0 +1,55 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
|
||||||
|
__license__ = 'GPL v3'
|
||||||
|
__copyright__ = u'Łukasz Grąbczewski 2011'
|
||||||
|
__version__ = '2.0'
|
||||||
|
|
||||||
|
import re, os
|
||||||
|
from calibre import walk
|
||||||
|
from calibre.utils.zipfile import ZipFile
|
||||||
|
from calibre.ptempfile import PersistentTemporaryFile
|
||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
|
||||||
|
class biweekly(BasicNewsRecipe):
|
||||||
|
__author__ = u'Łukasz Grąbczewski'
|
||||||
|
title = 'Biweekly'
|
||||||
|
language = 'en_PL'
|
||||||
|
publisher = 'National Audiovisual Institute'
|
||||||
|
publication_type = 'magazine'
|
||||||
|
description = u'link with culture [English edition of Polish magazine]: literature, theatre, film, art, music, views, talks'
|
||||||
|
|
||||||
|
conversion_options = {
|
||||||
|
'authors' : 'Biweekly.pl'
|
||||||
|
,'publisher' : publisher
|
||||||
|
,'language' : language
|
||||||
|
,'comments' : description
|
||||||
|
,'no_default_epub_cover' : True
|
||||||
|
,'preserve_cover_aspect_ratio': True
|
||||||
|
}
|
||||||
|
|
||||||
|
def build_index(self):
|
||||||
|
browser = self.get_browser()
|
||||||
|
browser.open('http://www.biweekly.pl/')
|
||||||
|
|
||||||
|
# find the link
|
||||||
|
epublink = browser.find_link(text_regex=re.compile('ePUB VERSION'))
|
||||||
|
|
||||||
|
# download ebook
|
||||||
|
self.report_progress(0,_('Downloading ePUB'))
|
||||||
|
response = browser.follow_link(epublink)
|
||||||
|
book_file = PersistentTemporaryFile(suffix='.epub')
|
||||||
|
book_file.write(response.read())
|
||||||
|
book_file.close()
|
||||||
|
|
||||||
|
# convert
|
||||||
|
self.report_progress(0.2,_('Converting to OEB'))
|
||||||
|
oeb = self.output_dir + '/INPUT/'
|
||||||
|
if not os.path.exists(oeb):
|
||||||
|
os.makedirs(oeb)
|
||||||
|
with ZipFile(book_file.name) as f:
|
||||||
|
f.extractall(path=oeb)
|
||||||
|
|
||||||
|
for f in walk(oeb):
|
||||||
|
if f.endswith('.opf'):
|
||||||
|
return f
|
30
recipes/blog_biszopa.recipe
Normal file
@ -0,0 +1,30 @@
|
|||||||
|
__license__ = 'GPL v3'
|
||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
|
||||||
|
class BlogBiszopa(BasicNewsRecipe):
|
||||||
|
title = u'Blog Biszopa'
|
||||||
|
__author__ = 'fenuks'
|
||||||
|
description = u'Zapiski z Granitowego Miasta'
|
||||||
|
category = 'history'
|
||||||
|
#publication_type = ''
|
||||||
|
language = 'pl'
|
||||||
|
#encoding = ''
|
||||||
|
#extra_css = ''
|
||||||
|
cover_url = 'http://blogbiszopa.pl/wp-content/themes/biszop/images/logo.png'
|
||||||
|
masthead_url = ''
|
||||||
|
use_embedded_content = False
|
||||||
|
oldest_article = 7
|
||||||
|
max_articles_per_feed = 100
|
||||||
|
no_stylesheets = True
|
||||||
|
remove_empty_feeds = True
|
||||||
|
remove_javascript = True
|
||||||
|
remove_attributes = ['style', 'font']
|
||||||
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
|
|
||||||
|
keep_only_tags = [dict(id='main-content')]
|
||||||
|
remove_tags = [dict(name='footer')]
|
||||||
|
#remove_tags_after = {}
|
||||||
|
#remove_tags_before = {}
|
||||||
|
|
||||||
|
feeds = [(u'Artyku\u0142y', u'http://blogbiszopa.pl/feed/')]
|
||||||
|
|
@ -25,6 +25,7 @@ class BusinessWeekMagazine(BasicNewsRecipe):
|
|||||||
|
|
||||||
#Find date
|
#Find date
|
||||||
mag=soup.find('h2',text='Magazine')
|
mag=soup.find('h2',text='Magazine')
|
||||||
|
self.log(mag)
|
||||||
dates=self.tag_to_string(mag.findNext('h3'))
|
dates=self.tag_to_string(mag.findNext('h3'))
|
||||||
self.timefmt = u' [%s]'%dates
|
self.timefmt = u' [%s]'%dates
|
||||||
|
|
||||||
@ -32,7 +33,7 @@ class BusinessWeekMagazine(BasicNewsRecipe):
|
|||||||
div0 = soup.find ('div', attrs={'class':'column left'})
|
div0 = soup.find ('div', attrs={'class':'column left'})
|
||||||
section_title = ''
|
section_title = ''
|
||||||
feeds = OrderedDict()
|
feeds = OrderedDict()
|
||||||
for div in div0.findAll('h4'):
|
for div in div0.findAll(['h4','h5']):
|
||||||
articles = []
|
articles = []
|
||||||
section_title = self.tag_to_string(div.findPrevious('h3')).strip()
|
section_title = self.tag_to_string(div.findPrevious('h3')).strip()
|
||||||
title=self.tag_to_string(div.a).strip()
|
title=self.tag_to_string(div.a).strip()
|
||||||
@ -48,7 +49,7 @@ class BusinessWeekMagazine(BasicNewsRecipe):
|
|||||||
feeds[section_title] += articles
|
feeds[section_title] += articles
|
||||||
div1 = soup.find ('div', attrs={'class':'column center'})
|
div1 = soup.find ('div', attrs={'class':'column center'})
|
||||||
section_title = ''
|
section_title = ''
|
||||||
for div in div1.findAll('h5'):
|
for div in div1.findAll(['h4','h5']):
|
||||||
articles = []
|
articles = []
|
||||||
desc=self.tag_to_string(div.findNext('p')).strip()
|
desc=self.tag_to_string(div.findNext('p')).strip()
|
||||||
section_title = self.tag_to_string(div.findPrevious('h3')).strip()
|
section_title = self.tag_to_string(div.findPrevious('h3')).strip()
|
||||||
|
@ -3,7 +3,7 @@ from calibre.web.feeds.news import BasicNewsRecipe
|
|||||||
class CD_Action(BasicNewsRecipe):
|
class CD_Action(BasicNewsRecipe):
|
||||||
title = u'CD-Action'
|
title = u'CD-Action'
|
||||||
__author__ = 'fenuks'
|
__author__ = 'fenuks'
|
||||||
description = 'cdaction.pl - polish games magazine site'
|
description = 'Strona CD-Action (CDA), największego w Polsce pisma dla graczy.Pełne wersje gier, newsy, recenzje, zapowiedzi, konkursy, forum, opinie, galerie screenów,trailery, filmiki, patche, teksty. Gry komputerowe (PC) oraz na konsole (PS3, XBOX 360).'
|
||||||
category = 'games'
|
category = 'games'
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
index='http://www.cdaction.pl'
|
index='http://www.cdaction.pl'
|
||||||
|
@ -1,5 +1,6 @@
|
|||||||
from calibre.web.feeds.news import BasicNewsRecipe
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
import re
|
import re
|
||||||
|
|
||||||
class Ciekawostki_Historyczne(BasicNewsRecipe):
|
class Ciekawostki_Historyczne(BasicNewsRecipe):
|
||||||
title = u'Ciekawostki Historyczne'
|
title = u'Ciekawostki Historyczne'
|
||||||
oldest_article = 7
|
oldest_article = 7
|
||||||
@ -10,39 +11,28 @@ class Ciekawostki_Historyczne(BasicNewsRecipe):
|
|||||||
masthead_url = 'http://ciekawostkihistoryczne.pl/wp-content/themes/Wordpress_Magazine/images/logo-ciekawostki-historyczne-male.jpg'
|
masthead_url = 'http://ciekawostkihistoryczne.pl/wp-content/themes/Wordpress_Magazine/images/logo-ciekawostki-historyczne-male.jpg'
|
||||||
cover_url = 'http://ciekawostkihistoryczne.pl/wp-content/themes/Wordpress_Magazine/images/logo-ciekawostki-historyczne-male.jpg'
|
cover_url = 'http://ciekawostkihistoryczne.pl/wp-content/themes/Wordpress_Magazine/images/logo-ciekawostki-historyczne-male.jpg'
|
||||||
max_articles_per_feed = 100
|
max_articles_per_feed = 100
|
||||||
|
extra_css = 'img.alignleft {float:left; margin-right:5px;} .alignright {float:right; margin-left:5px;}'
|
||||||
|
oldest_article = 12
|
||||||
preprocess_regexps = [(re.compile(ur'Ten artykuł ma kilka stron.*?</fb:like>', re.DOTALL), lambda match: ''), (re.compile(ur'<h2>Zobacz też:</h2>.*?</ol>', re.DOTALL), lambda match: '')]
|
preprocess_regexps = [(re.compile(ur'Ten artykuł ma kilka stron.*?</fb:like>', re.DOTALL), lambda match: ''), (re.compile(ur'<h2>Zobacz też:</h2>.*?</ol>', re.DOTALL), lambda match: '')]
|
||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
remove_empty_feeds = True
|
remove_empty_feeds = True
|
||||||
keep_only_tags = [dict(name='div', attrs={'class':'post'})]
|
keep_only_tags = [dict(name='div', attrs={'class':'post'})]
|
||||||
|
recursions = 5
|
||||||
remove_tags = [dict(id='singlepostinfo')]
|
remove_tags = [dict(id='singlepostinfo')]
|
||||||
|
|
||||||
feeds = [(u'Staro\u017cytno\u015b\u0107', u'http://ciekawostkihistoryczne.pl/tag/starozytnosc/feed/'), (u'\u015aredniowiecze', u'http://ciekawostkihistoryczne.pl/tag/sredniowiecze/feed/'), (u'Nowo\u017cytno\u015b\u0107', u'http://ciekawostkihistoryczne.pl/tag/nowozytnosc/feed/'), (u'XIX wiek', u'http://ciekawostkihistoryczne.pl/tag/xix-wiek/feed/'), (u'1914-1939', u'http://ciekawostkihistoryczne.pl/tag/1914-1939/feed/'), (u'1939-1945', u'http://ciekawostkihistoryczne.pl/tag/1939-1945/feed/'), (u'Powojnie (od 1945)', u'http://ciekawostkihistoryczne.pl/tag/powojnie/feed/'), (u'Recenzje', u'http://ciekawostkihistoryczne.pl/category/recenzje/feed/')]
|
feeds = [(u'Staro\u017cytno\u015b\u0107', u'http://ciekawostkihistoryczne.pl/tag/starozytnosc/feed/'), (u'\u015aredniowiecze', u'http://ciekawostkihistoryczne.pl/tag/sredniowiecze/feed/'), (u'Nowo\u017cytno\u015b\u0107', u'http://ciekawostkihistoryczne.pl/tag/nowozytnosc/feed/'), (u'XIX wiek', u'http://ciekawostkihistoryczne.pl/tag/xix-wiek/feed/'), (u'1914-1939', u'http://ciekawostkihistoryczne.pl/tag/1914-1939/feed/'), (u'1939-1945', u'http://ciekawostkihistoryczne.pl/tag/1939-1945/feed/'), (u'Powojnie (od 1945)', u'http://ciekawostkihistoryczne.pl/tag/powojnie/feed/'), (u'Recenzje', u'http://ciekawostkihistoryczne.pl/category/recenzje/feed/')]
|
||||||
|
|
||||||
def append_page(self, soup, appendtag):
|
def is_link_wanted(self, url, tag):
|
||||||
tag=soup.find(name='h7')
|
return 'ciekawostkihistoryczne' in url and url[-2] in {'2', '3', '4', '5', '6'}
|
||||||
if tag:
|
|
||||||
if tag.br:
|
|
||||||
pass
|
|
||||||
elif tag.nextSibling.name=='p':
|
|
||||||
tag=tag.nextSibling
|
|
||||||
nexturl = tag.findAll('a')
|
|
||||||
for nextpage in nexturl:
|
|
||||||
tag.extract()
|
|
||||||
nextpage= nextpage['href']
|
|
||||||
soup2 = self.index_to_soup(nextpage)
|
|
||||||
pagetext = soup2.find(name='div', attrs={'class':'post'})
|
|
||||||
for r in pagetext.findAll('div', attrs={'id':'singlepostinfo'}):
|
|
||||||
r.extract()
|
|
||||||
for r in pagetext.findAll('div', attrs={'class':'wp-caption alignright'}):
|
|
||||||
r.extract()
|
|
||||||
for r in pagetext.findAll('h1'):
|
|
||||||
r.extract()
|
|
||||||
pagetext.find('h6').nextSibling.extract()
|
|
||||||
pagetext.find('h7').nextSibling.extract()
|
|
||||||
pos = len(appendtag.contents)
|
|
||||||
appendtag.insert(pos, pagetext)
|
|
||||||
|
|
||||||
def preprocess_html(self, soup):
|
def postprocess_html(self, soup, first_fetch):
|
||||||
self.append_page(soup, soup.body)
|
tag = soup.find('h7')
|
||||||
|
if tag:
|
||||||
|
tag.nextSibling.extract()
|
||||||
|
if not first_fetch:
|
||||||
|
for r in soup.findAll(['h1']):
|
||||||
|
r.extract()
|
||||||
|
soup.find('h6').nextSibling.extract()
|
||||||
return soup
|
return soup
|
||||||
|
|
||||||
|
|
66
recipes/computer_woche.recipe
Normal file
@ -0,0 +1,66 @@
|
|||||||
|
__license__ = 'GPL v3'
|
||||||
|
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
|
||||||
|
'''
|
||||||
|
Fetch Computerwoche.
|
||||||
|
'''
|
||||||
|
|
||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
|
||||||
|
|
||||||
|
class Computerwoche(BasicNewsRecipe):
|
||||||
|
|
||||||
|
title = 'Computerwoche'
|
||||||
|
description = 'german computer newspaper'
|
||||||
|
language = 'de'
|
||||||
|
__author__ = 'Maria Seliger'
|
||||||
|
use_embedded_content = False
|
||||||
|
timefmt = ' [%d %b %Y]'
|
||||||
|
max_articles_per_feed = 15
|
||||||
|
linearize_tables = True
|
||||||
|
no_stylesheets = True
|
||||||
|
remove_stylesheets = True
|
||||||
|
remove_javascript = True
|
||||||
|
encoding = 'utf-8'
|
||||||
|
html2epub_options = 'base_font_size=10'
|
||||||
|
summary_length = 100
|
||||||
|
auto_cleanup = True
|
||||||
|
|
||||||
|
|
||||||
|
extra_css = '''
|
||||||
|
h2{font-family:Arial,Helvetica,sans-serif; font-size: x-small; color: #003399;}
|
||||||
|
a{font-family:Arial,Helvetica,sans-serif; font-size: x-small; font-style:italic;}
|
||||||
|
.dachzeile p{font-family:Arial,Helvetica,sans-serif; font-size: x-small; }
|
||||||
|
h1{ font-family:Arial,Helvetica,sans-serif; font-size:x-large; font-weight:bold;}
|
||||||
|
.artikelTeaser{font-family:Arial,Helvetica,sans-serif; font-size: x-small; font-weight:bold; }
|
||||||
|
body{font-family:Arial,Helvetica,sans-serif; }
|
||||||
|
.photo {font-family:Arial,Helvetica,sans-serif; font-size: x-small; color: #666666;} '''
|
||||||
|
|
||||||
|
feeds = [ ('Computerwoche', 'http://rss.feedsportal.com/c/312/f/4414/index.rss'),
|
||||||
|
('IDG Events', 'http://rss.feedsportal.com/c/401/f/7544/index.rss'),
|
||||||
|
('Computerwoche Jobs und Karriere', 'http://rss.feedsportal.com/c/312/f/434082/index.rss'),
|
||||||
|
('Computerwoche BI und ECM', 'http://rss.feedsportal.com/c/312/f/434083/index.rss'),
|
||||||
|
('Computerwoche Cloud Computing', 'http://rss.feedsportal.com/c/312/f/534647/index.rss'),
|
||||||
|
('Computerwoche Compliance und Recht', 'http://rss.feedsportal.com/c/312/f/434084/index.rss'),
|
||||||
|
('Computerwoche CRM', 'http://rss.feedsportal.com/c/312/f/434085/index.rss'),
|
||||||
|
('Computerwoche Data Center und Server', 'http://rss.feedsportal.com/c/312/f/434086/index.rss'),
|
||||||
|
('Computerwoche ERP', 'http://rss.feedsportal.com/c/312/f/434087/index.rss'),
|
||||||
|
('Computerwoche IT Macher', 'http://rss.feedsportal.com/c/312/f/534646/index.rss'),
|
||||||
|
('Computerwoche IT-Services', 'http://rss.feedsportal.com/c/312/f/434089/index.rss'),
|
||||||
|
('Computerwoche IT-Strategie', 'http://rss.feedsportal.com/c/312/f/434090/index.rss'),
|
||||||
|
('Computerwoche Mittelstands-IT', 'http://rss.feedsportal.com/c/312/f/434091/index.rss'),
|
||||||
|
('Computerwoche Mobile und Wireless', 'http://rss.feedsportal.com/c/312/f/434092/index.rss'),
|
||||||
|
('Computerwoche Netzwerk', 'http://rss.feedsportal.com/c/312/f/434093/index.rss'),
|
||||||
|
('Computerwoche Notebook und PC', 'http://rss.feedsportal.com/c/312/f/434094/index.rss'),
|
||||||
|
('Computerwoche Office und Tools', 'http://rss.feedsportal.com/c/312/f/434095/index.rss'),
|
||||||
|
('Computerwoche Security', 'http://rss.feedsportal.com/c/312/f/434098/index.rss'),
|
||||||
|
('Computerwoche SOA und BPM', 'http://rss.feedsportal.com/c/312/f/434099/index.rss'),
|
||||||
|
('Computerwoche Software Infrastruktur', 'http://rss.feedsportal.com/c/312/f/434096/index.rss'),
|
||||||
|
('Computerwoche Storage', 'http://rss.feedsportal.com/c/312/f/534645/index.rss'),
|
||||||
|
('Computerwoche VoIP und TK', 'http://rss.feedsportal.com/c/312/f/434102/index.rss'),
|
||||||
|
('Computerwoche Web', 'http://rss.feedsportal.com/c/312/f/434103/index.rss'),
|
||||||
|
('Computerwoche Home-IT', 'http://rss.feedsportal.com/c/312/f/434104/index.rss')]
|
||||||
|
|
||||||
|
|
||||||
|
def print_version(self, url):
|
||||||
|
return url.replace ('/a/', '/a/print/')
|
||||||
|
|
@ -1,5 +1,5 @@
|
|||||||
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
|
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
|
||||||
|
import re
|
||||||
from calibre.web.feeds.news import BasicNewsRecipe
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
class Computerworld_pl(BasicNewsRecipe):
|
class Computerworld_pl(BasicNewsRecipe):
|
||||||
title = u'Computerworld.pl'
|
title = u'Computerworld.pl'
|
||||||
@ -8,16 +8,20 @@ class Computerworld_pl(BasicNewsRecipe):
|
|||||||
category = 'IT'
|
category = 'IT'
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
masthead_url = 'http://g1.computerworld.pl/cw/beta_gfx/cw2.gif'
|
masthead_url = 'http://g1.computerworld.pl/cw/beta_gfx/cw2.gif'
|
||||||
|
cover_url = 'http://g1.computerworld.pl/cw/beta_gfx/cw2.gif'
|
||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
oldest_article = 7
|
oldest_article = 7
|
||||||
max_articles_per_feed = 100
|
max_articles_per_feed = 100
|
||||||
keep_only_tags=[dict(attrs={'class':['tyt_news', 'prawo', 'autor', 'tresc']})]
|
remove_attributes = ['style',]
|
||||||
remove_tags_after=dict(name='div', attrs={'class':'rMobi'})
|
preprocess_regexps = [(re.compile(u'Zobacz również:', re.IGNORECASE), lambda m: ''), (re.compile(ur'[*]+reklama[*]+', re.IGNORECASE), lambda m: ''),]
|
||||||
remove_tags=[dict(name='div', attrs={'class':['nnav', 'rMobi']}), dict(name='table', attrs={'class':'ramka_slx'})]
|
keep_only_tags = [dict(id=['szpaltaL', 's2011'])]
|
||||||
|
remove_tags_after = dict(name='div', attrs={'class':'tresc'})
|
||||||
|
remove_tags = [dict(attrs={'class':['nnav', 'rMobi', 'tagi', 'rec']}),]
|
||||||
feeds = [(u'Wiadomo\u015bci', u'http://rssout.idg.pl/cw/news_iso.xml')]
|
feeds = [(u'Wiadomo\u015bci', u'http://rssout.idg.pl/cw/news_iso.xml')]
|
||||||
|
|
||||||
def get_cover_url(self):
|
def skip_ad_pages(self, soup):
|
||||||
soup = self.index_to_soup('http://www.computerworld.pl/')
|
if soup.title.string.lower() == 'advertisement':
|
||||||
cover=soup.find(name='img', attrs={'class':'prawo'})
|
tag = soup.find(name='a')
|
||||||
self.cover_url=cover['src']
|
if tag:
|
||||||
return getattr(self, 'cover_url', self.cover_url)
|
new_soup = self.index_to_soup(tag['href'], raw=True)
|
||||||
|
return new_soup
|
@ -1,14 +1,17 @@
|
|||||||
from calibre.web.feeds.news import BasicNewsRecipe
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
from calibre.ebooks.BeautifulSoup import BeautifulSoup
|
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Comment
|
||||||
|
|
||||||
class CoNowegoPl(BasicNewsRecipe):
|
class CoNowegoPl(BasicNewsRecipe):
|
||||||
title = u'conowego.pl'
|
title = u'conowego.pl'
|
||||||
__author__ = 'fenuks'
|
__author__ = 'fenuks'
|
||||||
description = u'Nowy wortal technologiczny oraz gazeta internetowa. Testy najnowszych produktów, fachowe porady i recenzje. U nas znajdziesz wszystko o elektronice użytkowej !'
|
description = u'Nowy wortal technologiczny oraz gazeta internetowa. Testy najnowszych produktów, fachowe porady i recenzje. U nas znajdziesz wszystko o elektronice użytkowej !'
|
||||||
cover_url = 'http://www.conowego.pl/fileadmin/templates/main/images/logo_top.png'
|
#cover_url = 'http://www.conowego.pl/fileadmin/templates/main/images/logo_top.png'
|
||||||
category = 'IT, news'
|
category = 'IT, news'
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
oldest_article = 7
|
oldest_article = 7
|
||||||
max_articles_per_feed = 100
|
max_articles_per_feed = 100
|
||||||
|
INDEX = 'http://www.conowego.pl/'
|
||||||
|
extra_css = '.news-single-img {float:left; margin-right:5px;}'
|
||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
remove_empty_feeds = True
|
remove_empty_feeds = True
|
||||||
use_embedded_content = False
|
use_embedded_content = False
|
||||||
@ -34,5 +37,15 @@ class CoNowegoPl(BasicNewsRecipe):
|
|||||||
pos = len(appendtag.contents)
|
pos = len(appendtag.contents)
|
||||||
appendtag.insert(pos, pagetext)
|
appendtag.insert(pos, pagetext)
|
||||||
|
|
||||||
|
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||||
|
for comment in comments:
|
||||||
|
comment.extract()
|
||||||
for r in appendtag.findAll(attrs={'class':['pages', 'paginationWrap']}):
|
for r in appendtag.findAll(attrs={'class':['pages', 'paginationWrap']}):
|
||||||
r.extract()
|
r.extract()
|
||||||
|
|
||||||
|
def get_cover_url(self):
|
||||||
|
soup = self.index_to_soup('http://www.conowego.pl/magazyn/')
|
||||||
|
tag = soup.find(attrs={'class':'ms_left'})
|
||||||
|
if tag:
|
||||||
|
self.cover_url = self.INDEX + tag.find('img')['src']
|
||||||
|
return getattr(self, 'cover_url', self.cover_url)
|
||||||
|
@ -1,4 +1,5 @@
|
|||||||
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:fdm=marker:ai
|
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:fdm=marker:ai
|
||||||
|
import re
|
||||||
from calibre.web.feeds.news import BasicNewsRecipe
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
|
||||||
class CzasGentlemanow(BasicNewsRecipe):
|
class CzasGentlemanow(BasicNewsRecipe):
|
||||||
@ -11,10 +12,13 @@ class CzasGentlemanow(BasicNewsRecipe):
|
|||||||
ignore_duplicate_articles = {'title', 'url'}
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
oldest_article = 7
|
oldest_article = 7
|
||||||
max_articles_per_feed = 100
|
max_articles_per_feed = 100
|
||||||
|
extra_css = '.gallery-item {float:left; margin-right: 10px; max-width: 20%;} .alignright {text-align: right; float:right; margin-left:5px;}\
|
||||||
|
.wp-caption-text {text-align: left;} img.aligncenter {display: block; margin-left: auto; margin-right: auto;} .alignleft {float: left; margin-right:5px;}'
|
||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
remove_empty_feeds = True
|
remove_empty_feeds = True
|
||||||
|
preprocess_regexps = [(re.compile(u'<h3>Może Cię też zainteresować:</h3>'), lambda m: '')]
|
||||||
use_embedded_content = False
|
use_embedded_content = False
|
||||||
keep_only_tags = [dict(name='div', attrs={'class':'content'})]
|
keep_only_tags = [dict(name='div', attrs={'class':'content'})]
|
||||||
remove_tags = [dict(attrs={'class':'meta_comments'})]
|
remove_tags = [dict(attrs={'class':'meta_comments'}), dict(id=['comments', 'related_posts_thumbnails', 'respond'])]
|
||||||
remove_tags_after = dict(name='div', attrs={'class':'fblikebutton_button'})
|
remove_tags_after = dict(id='comments')
|
||||||
feeds = [(u'M\u0119ski \u015awiat', u'http://czasgentlemanow.pl/category/meski-swiat/feed/'), (u'Styl', u'http://czasgentlemanow.pl/category/styl/feed/'), (u'Vademecum Gentlemana', u'http://czasgentlemanow.pl/category/vademecum/feed/'), (u'Dom i rodzina', u'http://czasgentlemanow.pl/category/dom-i-rodzina/feed/'), (u'Honor', u'http://czasgentlemanow.pl/category/honor/feed/'), (u'Gad\u017cety Gentlemana', u'http://czasgentlemanow.pl/category/gadzety-gentlemana/feed/')]
|
feeds = [(u'M\u0119ski \u015awiat', u'http://czasgentlemanow.pl/category/meski-swiat/feed/'), (u'Styl', u'http://czasgentlemanow.pl/category/styl/feed/'), (u'Vademecum Gentlemana', u'http://czasgentlemanow.pl/category/vademecum/feed/'), (u'Dom i rodzina', u'http://czasgentlemanow.pl/category/dom-i-rodzina/feed/'), (u'Honor', u'http://czasgentlemanow.pl/category/honor/feed/'), (u'Gad\u017cety Gentlemana', u'http://czasgentlemanow.pl/category/gadzety-gentlemana/feed/')]
|
||||||
|
35
recipes/deccan_herald.recipe
Normal file
@ -0,0 +1,35 @@
|
|||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
|
||||||
|
class AdvancedUserRecipe1362501327(BasicNewsRecipe):
|
||||||
|
title = u'Deccan Herald'
|
||||||
|
__author__ = 'Muruli Shamanna'
|
||||||
|
description = 'Daily news from the Deccan Herald'
|
||||||
|
|
||||||
|
oldest_article = 1
|
||||||
|
max_articles_per_feed = 100
|
||||||
|
auto_cleanup = True
|
||||||
|
category = 'News'
|
||||||
|
language = 'en_IN'
|
||||||
|
encoding = 'utf-8'
|
||||||
|
publisher = 'The Printers (Mysore) Private Ltd'
|
||||||
|
##use_embedded_content = True
|
||||||
|
|
||||||
|
cover_url = 'http://www.quizzing.in/wp-content/uploads/2010/07/DH.gif'
|
||||||
|
|
||||||
|
conversion_options = {
|
||||||
|
'comments' : description
|
||||||
|
,'tags' : category
|
||||||
|
,'language' : language
|
||||||
|
,'publisher' : publisher
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
feeds = [(u'News', u'http://www.deccanherald.com/rss/news.rss'), (u'Business', u'http://www.deccanherald.com/rss/business.rss'), (u'Entertainment', u'http://www.deccanherald.com/rss/entertainment.rss'), (u'Sports', u'http://www.deccanherald.com/rss/sports.rss'), (u'Environment', u'http://www.deccanherald.com/rss/environment.rss')]
|
||||||
|
|
||||||
|
extra_css = '''
|
||||||
|
h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:150%;}
|
||||||
|
h2{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:155%;}
|
||||||
|
img {max-width:100%; min-width:100%;}
|
||||||
|
p{font-family:Arial,Helvetica,sans-serif;font-size:large;}
|
||||||
|
body{font-family:Helvetica,Arial,sans-serif;font-size:medium;}
|
||||||
|
'''
|
@ -16,9 +16,10 @@ class Dobreprogramy_pl(BasicNewsRecipe):
|
|||||||
extra_css = '.title {font-size:22px;}'
|
extra_css = '.title {font-size:22px;}'
|
||||||
oldest_article = 8
|
oldest_article = 8
|
||||||
max_articles_per_feed = 100
|
max_articles_per_feed = 100
|
||||||
|
remove_attrs = ['style', 'width', 'height']
|
||||||
preprocess_regexps = [(re.compile(ur'<div id="\S+360pmp4">Twoja przeglądarka nie obsługuje Flasha i HTML5 lub wyłączono obsługę JavaScript...</div>'), lambda match: '') ]
|
preprocess_regexps = [(re.compile(ur'<div id="\S+360pmp4">Twoja przeglądarka nie obsługuje Flasha i HTML5 lub wyłączono obsługę JavaScript...</div>'), lambda match: '') ]
|
||||||
keep_only_tags=[dict(attrs={'class':['news', 'entry single']})]
|
keep_only_tags=[dict(attrs={'class':['news', 'entry single']})]
|
||||||
remove_tags = [dict(attrs={'class':['newsOptions', 'noPrint', 'komentarze', 'tags font-heading-master']}), dict(id='komentarze')]
|
remove_tags = [dict(attrs={'class':['newsOptions', 'noPrint', 'komentarze', 'tags font-heading-master']}), dict(id='komentarze'), dict(name='iframe')]
|
||||||
#remove_tags = [dict(name='div', attrs={'class':['komentarze', 'block', 'portalInfo', 'menuBar', 'topBar']})]
|
#remove_tags = [dict(name='div', attrs={'class':['komentarze', 'block', 'portalInfo', 'menuBar', 'topBar']})]
|
||||||
feeds = [(u'Aktualności', 'http://feeds.feedburner.com/dobreprogramy/Aktualnosci'),
|
feeds = [(u'Aktualności', 'http://feeds.feedburner.com/dobreprogramy/Aktualnosci'),
|
||||||
('Blogi', 'http://feeds.feedburner.com/dobreprogramy/BlogCzytelnikow')]
|
('Blogi', 'http://feeds.feedburner.com/dobreprogramy/BlogCzytelnikow')]
|
||||||
@ -28,4 +29,11 @@ class Dobreprogramy_pl(BasicNewsRecipe):
|
|||||||
for a in soup('a'):
|
for a in soup('a'):
|
||||||
if a.has_key('href') and 'http://' not in a['href'] and 'https://' not in a['href']:
|
if a.has_key('href') and 'http://' not in a['href'] and 'https://' not in a['href']:
|
||||||
a['href']=self.index + a['href']
|
a['href']=self.index + a['href']
|
||||||
|
for r in soup.findAll('iframe'):
|
||||||
|
r.parent.extract()
|
||||||
|
return soup
|
||||||
|
def postprocess_html(self, soup, first_fetch):
|
||||||
|
for r in soup.findAll('span', text=''):
|
||||||
|
if not r.string:
|
||||||
|
r.extract()
|
||||||
return soup
|
return soup
|
@ -8,6 +8,7 @@ class BasicUserRecipe1337668045(BasicNewsRecipe):
|
|||||||
cover_url = 'http://drytooling.com.pl/images/drytooling-kindle.png'
|
cover_url = 'http://drytooling.com.pl/images/drytooling-kindle.png'
|
||||||
description = u'Drytooling.com.pl jest serwisem wspinaczki zimowej, alpinizmu i himalaizmu. Jeśli uwielbiasz zimę, nie możesz doczekać się aż wyciągniesz szpej z szafki i uderzysz w Tatry, Alpy, czy może Himalaje, to znajdziesz tutaj naprawdę dużo interesujących Cię treści! Zapraszamy!'
|
description = u'Drytooling.com.pl jest serwisem wspinaczki zimowej, alpinizmu i himalaizmu. Jeśli uwielbiasz zimę, nie możesz doczekać się aż wyciągniesz szpej z szafki i uderzysz w Tatry, Alpy, czy może Himalaje, to znajdziesz tutaj naprawdę dużo interesujących Cię treści! Zapraszamy!'
|
||||||
__author__ = u'Damian Granowski'
|
__author__ = u'Damian Granowski'
|
||||||
|
language = 'pl'
|
||||||
oldest_article = 100
|
oldest_article = 100
|
||||||
max_articles_per_feed = 20
|
max_articles_per_feed = 20
|
||||||
auto_cleanup = True
|
auto_cleanup = True
|
||||||
|
56
recipes/dwutygodnik.recipe
Normal file
@ -0,0 +1,56 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
|
||||||
|
__license__ = 'GPL v3'
|
||||||
|
__copyright__ = u'Łukasz Grąbczewski 2011'
|
||||||
|
__version__ = '2.0'
|
||||||
|
|
||||||
|
import re, os
|
||||||
|
from calibre import walk
|
||||||
|
from calibre.utils.zipfile import ZipFile
|
||||||
|
from calibre.ptempfile import PersistentTemporaryFile
|
||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
|
||||||
|
class dwutygodnik(BasicNewsRecipe):
|
||||||
|
__author__ = u'Łukasz Grąbczewski'
|
||||||
|
title = 'Dwutygodnik'
|
||||||
|
language = 'pl'
|
||||||
|
publisher = 'Narodowy Instytut Audiowizualny'
|
||||||
|
publication_type = 'magazine'
|
||||||
|
description = u'Strona Kultury: literatura, teatr, film, sztuka, muzyka, felietony, rozmowy'
|
||||||
|
|
||||||
|
conversion_options = {
|
||||||
|
'authors' : 'Dwutygodnik.com'
|
||||||
|
,'publisher' : publisher
|
||||||
|
,'language' : language
|
||||||
|
,'comments' : description
|
||||||
|
,'no_default_epub_cover' : True
|
||||||
|
,'preserve_cover_aspect_ratio': True
|
||||||
|
}
|
||||||
|
|
||||||
|
def build_index(self):
|
||||||
|
browser = self.get_browser()
|
||||||
|
browser.open('http://www.dwutygodnik.com/')
|
||||||
|
|
||||||
|
# find the link
|
||||||
|
epublink = browser.find_link(text_regex=re.compile('Wersja ePub'))
|
||||||
|
|
||||||
|
# download ebook
|
||||||
|
self.report_progress(0,_('Downloading ePUB'))
|
||||||
|
response = browser.follow_link(epublink)
|
||||||
|
book_file = PersistentTemporaryFile(suffix='.epub')
|
||||||
|
book_file.write(response.read())
|
||||||
|
book_file.close()
|
||||||
|
|
||||||
|
# convert
|
||||||
|
self.report_progress(0.2,_('Converting to OEB'))
|
||||||
|
oeb = self.output_dir + '/INPUT/'
|
||||||
|
if not os.path.exists(oeb):
|
||||||
|
os.makedirs(oeb)
|
||||||
|
with ZipFile(book_file.name) as f:
|
||||||
|
f.extractall(path=oeb)
|
||||||
|
|
||||||
|
for f in walk(oeb):
|
||||||
|
if f.endswith('.opf'):
|
||||||
|
return f
|
||||||
|
|
@ -1,13 +1,15 @@
|
|||||||
from calibre.web.feeds.news import BasicNewsRecipe
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
from calibre.ebooks.BeautifulSoup import Comment
|
||||||
|
|
||||||
class Dzieje(BasicNewsRecipe):
|
class Dzieje(BasicNewsRecipe):
|
||||||
title = u'dzieje.pl'
|
title = u'dzieje.pl'
|
||||||
__author__ = 'fenuks'
|
__author__ = 'fenuks'
|
||||||
description = 'Dzieje - history of Poland'
|
description = 'Dzieje.pl - najlepszy portal informacyjno-edukacyjny dotyczący historii Polski XX wieku. Archiwalne fotografie, filmy, katalog postaci, quizy i konkursy.'
|
||||||
cover_url = 'http://www.dzieje.pl/sites/default/files/dzieje_logo.png'
|
cover_url = 'http://www.dzieje.pl/sites/default/files/dzieje_logo.png'
|
||||||
category = 'history'
|
category = 'history'
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
ignore_duplicate_articles = {'title', 'url'}
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
|
extra_css = '.imagecache-default {float:left; margin-right:20px;}'
|
||||||
index = 'http://dzieje.pl'
|
index = 'http://dzieje.pl'
|
||||||
oldest_article = 8
|
oldest_article = 8
|
||||||
max_articles_per_feed = 100
|
max_articles_per_feed = 100
|
||||||
@ -28,11 +30,14 @@ class Dzieje(BasicNewsRecipe):
|
|||||||
pagetext = soup2.find(id='content-area').find(attrs={'class':'content'})
|
pagetext = soup2.find(id='content-area').find(attrs={'class':'content'})
|
||||||
for r in pagetext.findAll(attrs={'class':['fieldgroup group-groupkul', 'fieldgroup group-zdjeciekult', 'fieldgroup group-zdjecieciekaw', 'fieldgroup group-zdjecieksiazka', 'fieldgroup group-zdjeciedu', 'field field-type-filefield field-field-zdjecieglownawyd']}):
|
for r in pagetext.findAll(attrs={'class':['fieldgroup group-groupkul', 'fieldgroup group-zdjeciekult', 'fieldgroup group-zdjecieciekaw', 'fieldgroup group-zdjecieksiazka', 'fieldgroup group-zdjeciedu', 'field field-type-filefield field-field-zdjecieglownawyd']}):
|
||||||
r.extract()
|
r.extract()
|
||||||
pos = len(appendtag.contents)
|
comments = pagetext.findAll(text=lambda text:isinstance(text, Comment))
|
||||||
appendtag.insert(pos, pagetext)
|
# appendtag.insert(pos, pagetext)
|
||||||
tag = soup2.find('li', attrs={'class':'pager-next'})
|
tag = soup2.find('li', attrs={'class':'pager-next'})
|
||||||
for r in appendtag.findAll(attrs={'class':['item-list', 'field field-type-computed field-field-tagi', ]}):
|
for r in appendtag.findAll(attrs={'class':['item-list', 'field field-type-computed field-field-tagi', ]}):
|
||||||
r.extract()
|
r.extract()
|
||||||
|
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||||
|
for comment in comments:
|
||||||
|
comment.extract()
|
||||||
|
|
||||||
def find_articles(self, url):
|
def find_articles(self, url):
|
||||||
articles = []
|
articles = []
|
||||||
@ -64,7 +69,7 @@ class Dzieje(BasicNewsRecipe):
|
|||||||
|
|
||||||
def preprocess_html(self, soup):
|
def preprocess_html(self, soup):
|
||||||
for a in soup('a'):
|
for a in soup('a'):
|
||||||
if a.has_key('href') and 'http://' not in a['href'] and 'https://' not in a['href']:
|
if a.has_key('href') and not a['href'].startswith('http'):
|
||||||
a['href'] = self.index + a['href']
|
a['href'] = self.index + a['href']
|
||||||
self.append_page(soup, soup.body)
|
self.append_page(soup, soup.body)
|
||||||
return soup
|
return soup
|
34
recipes/dziennik_baltycki.recipe
Normal file
@ -0,0 +1,34 @@
|
|||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
|
||||||
|
class DziennikBaltycki(BasicNewsRecipe):
|
||||||
|
title = u'Dziennik Ba\u0142tycki'
|
||||||
|
__author__ = 'fenuks'
|
||||||
|
description = u'Gazeta Regionalna Dziennik Bałtycki. Najnowsze Wiadomości Trójmiasto i Wiadomości Pomorskie. Czytaj!'
|
||||||
|
category = 'newspaper'
|
||||||
|
language = 'pl'
|
||||||
|
encoding = 'iso-8859-2'
|
||||||
|
masthead_url = 'http://s.polskatimes.pl/g/logo_naglowek/dziennikbaltycki.png?24'
|
||||||
|
oldest_article = 7
|
||||||
|
max_articles_per_feed = 100
|
||||||
|
remove_empty_feeds= True
|
||||||
|
no_stylesheets = True
|
||||||
|
use_embedded_content = False
|
||||||
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
|
#preprocess_regexps = [(re.compile(ur'<b>Czytaj także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur',<b>Czytaj też:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>Zobacz także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<center><h4><a.*?</a></h4></center>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TEŻ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ WIĘCEJ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TAKŻE:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>\* CZYTAJ KONIECZNIE:.*', re.DOTALL), lambda match: '</body>'), (re.compile(ur'<b>Nasze serwisy:</b>.*', re.DOTALL), lambda match: '</body>') ]
|
||||||
|
remove_tags_after= dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})
|
||||||
|
remove_tags=[dict(id='mat-podobne'), dict(name='a', attrs={'class':'czytajDalej'}), dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})]
|
||||||
|
|
||||||
|
feeds = [(u'Wiadomo\u015bci', u'http://www.dziennikbaltycki.pl/rss/dziennikbaltycki_wiadomosci.xml?201302'), (u'Sport', u'http://dziennikbaltycki.feedsportal.com/c/32980/f/533756/index.rss?201302'), (u'Rejsy', u'http://www.dziennikbaltycki.pl/rss/dziennikbaltycki_rejsy.xml?201302'), (u'Biznes na Pomorzu', u'http://www.dziennikbaltycki.pl/rss/dziennikbaltycki_biznesnapomorzu.xml?201302'), (u'GOM', u'http://www.dziennikbaltycki.pl/rss/dziennikbaltycki_gom.xml?201302'), (u'Opinie', u'http://www.dziennikbaltycki.pl/rss/dziennikbaltycki_opinie.xml?201302'), (u'Pitawal Pomorski', u'http://www.dziennikbaltycki.pl/rss/dziennikbaltycki_pitawalpomorski.xml?201302')]
|
||||||
|
|
||||||
|
def print_version(self, url):
|
||||||
|
return url.replace('artykul', 'drukuj')
|
||||||
|
|
||||||
|
def skip_ad_pages(self, soup):
|
||||||
|
if 'Advertisement' in soup.title:
|
||||||
|
nexturl=soup.find('a')['href']
|
||||||
|
return self.index_to_soup(nexturl, raw=True)
|
||||||
|
|
||||||
|
def get_cover_url(self):
|
||||||
|
soup = self.index_to_soup('http://www.prasa24.pl/gazeta/dziennik-baltycki/')
|
||||||
|
self.cover_url=soup.find(id='pojemnik').img['src']
|
||||||
|
return getattr(self, 'cover_url', self.cover_url)
|
35
recipes/dziennik_lodzki.recipe
Normal file
@ -0,0 +1,35 @@
|
|||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
|
||||||
|
class DziennikLodzki(BasicNewsRecipe):
|
||||||
|
title = u'Dziennik \u0141\xf3dzki'
|
||||||
|
__author__ = 'fenuks'
|
||||||
|
description = u'Gazeta Regionalna Dziennik Łódzki. Najnowsze Wiadomości Łódź. Czytaj Wiadomości Łódzkie!'
|
||||||
|
category = 'newspaper'
|
||||||
|
language = 'pl'
|
||||||
|
encoding = 'iso-8859-2'
|
||||||
|
masthead_url = 'http://s.polskatimes.pl/g/logo_naglowek/dzienniklodzki.png?24'
|
||||||
|
oldest_article = 7
|
||||||
|
max_articles_per_feed = 100
|
||||||
|
remove_empty_feeds = True
|
||||||
|
no_stylesheets = True
|
||||||
|
use_embedded_content = False
|
||||||
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
|
#preprocess_regexps = [(re.compile(ur'<b>Czytaj także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur',<b>Czytaj też:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>Zobacz także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<center><h4><a.*?</a></h4></center>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TEŻ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ WIĘCEJ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TAKŻE:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>\* CZYTAJ KONIECZNIE:.*', re.DOTALL), lambda match: '</body>'), (re.compile(ur'<b>Nasze serwisy:</b>.*', re.DOTALL), lambda match: '</body>') ]
|
||||||
|
remove_tags_after= dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})
|
||||||
|
remove_tags=[dict(id='mat-podobne'), dict(name='a', attrs={'class':'czytajDalej'}), dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})]
|
||||||
|
|
||||||
|
feeds = [(u'Na sygnale', u'http://www.dzienniklodzki.pl/rss/dzienniklodzki_nasygnale.xml?201302'), (u'\u0141\xf3d\u017a', u'http://www.dzienniklodzki.pl/rss/dzienniklodzki_lodz.xml?201302'), (u'Opinie', u'http://www.dzienniklodzki.pl/rss/dzienniklodzki_opinie.xml?201302'), (u'Pieni\u0105dze', u'http://dzienniklodzki.feedsportal.com/c/32980/f/533763/index.rss?201302'), (u'Kultura', u'http://dzienniklodzki.feedsportal.com/c/32980/f/533762/index.rss?201302'), (u'Sport', u'http://dzienniklodzki.feedsportal.com/c/32980/f/533761/index.rss?201302'), (u'Akcje', u'http://www.dzienniklodzki.pl/rss/dzienniklodzki_akcje.xml?201302'), (u'M\xf3j Reporter', u'http://www.dzienniklodzki.pl/rss/dzienniklodzki_mojreporter.xml?201302'), (u'Studni\xf3wki', u'http://www.dzienniklodzki.pl/rss/dzienniklodzki_studniowki.xml?201302'), (u'Kraj', u'http://www.dzienniklodzki.pl/rss/dzienniklodzki_kraj.xml?201302'), (u'Zdrowie', u'http://www.dzienniklodzki.pl/rss/dzienniklodzki_zdrowie.xml?201302')]
|
||||||
|
|
||||||
|
|
||||||
|
def print_version(self, url):
|
||||||
|
return url.replace('artykul', 'drukuj')
|
||||||
|
|
||||||
|
def skip_ad_pages(self, soup):
|
||||||
|
if 'Advertisement' in soup.title:
|
||||||
|
nexturl=soup.find('a')['href']
|
||||||
|
return self.index_to_soup(nexturl, raw=True)
|
||||||
|
|
||||||
|
def get_cover_url(self):
|
||||||
|
soup = self.index_to_soup('http://www.prasa24.pl/gazeta/dziennik-lodzki/')
|
||||||
|
self.cover_url=soup.find(id='pojemnik').img['src']
|
||||||
|
return getattr(self, 'cover_url', self.cover_url)
|
@ -2,6 +2,8 @@
|
|||||||
|
|
||||||
from calibre.web.feeds.news import BasicNewsRecipe
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
import re
|
import re
|
||||||
|
from calibre.ebooks.BeautifulSoup import Comment
|
||||||
|
|
||||||
class Dziennik_pl(BasicNewsRecipe):
|
class Dziennik_pl(BasicNewsRecipe):
|
||||||
title = u'Dziennik.pl'
|
title = u'Dziennik.pl'
|
||||||
__author__ = 'fenuks'
|
__author__ = 'fenuks'
|
||||||
@ -54,6 +56,9 @@ class Dziennik_pl(BasicNewsRecipe):
|
|||||||
v = pagetext.findAll(name=dictionary['name'], attrs=dictionary['attrs'])
|
v = pagetext.findAll(name=dictionary['name'], attrs=dictionary['attrs'])
|
||||||
for delete in v:
|
for delete in v:
|
||||||
delete.extract()
|
delete.extract()
|
||||||
|
comments = pagetext.findAll(text=lambda text:isinstance(text, Comment))
|
||||||
|
for comment in comments:
|
||||||
|
comment.extract()
|
||||||
pos = len(appendtag.contents)
|
pos = len(appendtag.contents)
|
||||||
appendtag.insert(pos, pagetext)
|
appendtag.insert(pos, pagetext)
|
||||||
if appendtag.find('div', attrs={'class':'article_paginator'}):
|
if appendtag.find('div', attrs={'class':'article_paginator'}):
|
||||||
|
84
recipes/dziennik_wschodni.recipe
Normal file
@ -0,0 +1,84 @@
|
|||||||
|
import re
|
||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
from calibre.ebooks.BeautifulSoup import Comment
|
||||||
|
|
||||||
|
class DziennikWschodni(BasicNewsRecipe):
|
||||||
|
title = u'Dziennik Wschodni'
|
||||||
|
__author__ = 'fenuks'
|
||||||
|
description = u'Dziennik Wschodni - portal regionalny województwa lubelskiego.'
|
||||||
|
category = 'newspaper'
|
||||||
|
language = 'pl'
|
||||||
|
encoding = 'iso-8859-2'
|
||||||
|
extra_css = 'ul {list-style: none; padding:0; margin:0;}'
|
||||||
|
INDEX = 'http://www.dziennikwschodni.pl'
|
||||||
|
masthead_url = INDEX + '/images/top_logo.png'
|
||||||
|
oldest_article = 7
|
||||||
|
max_articles_per_feed = 100
|
||||||
|
remove_empty_feeds = True
|
||||||
|
no_stylesheets = True
|
||||||
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
|
|
||||||
|
preprocess_regexps = [(re.compile(ur'Czytaj:.*?</a>', re.DOTALL), lambda match: ''), (re.compile(ur'Przeczytaj także:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''),
|
||||||
|
(re.compile(ur'Przeczytaj również:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''), (re.compile(ur'Zobacz też:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: '')]
|
||||||
|
|
||||||
|
keep_only_tags = [dict(id=['article', 'cover', 'photostory'])]
|
||||||
|
remove_tags = [dict(id=['articleTags', 'articleMeta', 'boxReadIt', 'articleGalleries', 'articleConnections',
|
||||||
|
'ForumArticleComments', 'articleRecommend', 'jedynkiLinks', 'articleGalleryConnections',
|
||||||
|
'photostoryConnections', 'articleEpaper', 'articlePoll', 'articleAlarm', 'articleByline']),
|
||||||
|
dict(attrs={'class':'articleFunctions'})]
|
||||||
|
|
||||||
|
|
||||||
|
feeds = [(u'Wszystkie', u'http://www.dziennikwschodni.pl/rss.xml'),
|
||||||
|
(u'Lublin', u'http://www.dziennikwschodni.pl/lublin.xml'),
|
||||||
|
(u'Zamość', u'http://www.dziennikwschodni.pl/zamosc.xml'),
|
||||||
|
(u'Biała Podlaska', u'http://www.dziennikwschodni.pl/biala_podlaska.xml'),
|
||||||
|
(u'Chełm', u'http://www.dziennikwschodni.pl/chelm.xml'),
|
||||||
|
(u'Kraśnik', u'http://www.dziennikwschodni.pl/krasnik.xml'),
|
||||||
|
(u'Puławy', u'http://www.dziennikwschodni.pl/pulawy.xml'),
|
||||||
|
(u'Świdnik', u'http://www.dziennikwschodni.pl/swidnik.xml'),
|
||||||
|
(u'Łęczna', u'http://www.dziennikwschodni.pl/leczna.xml'),
|
||||||
|
(u'Lubartów', u'http://www.dziennikwschodni.pl/lubartow.xml'),
|
||||||
|
(u'Sport', u'http://www.dziennikwschodni.pl/sport.xml'),
|
||||||
|
(u'Praca', u'http://www.dziennikwschodni.pl/praca.xml'),
|
||||||
|
(u'Dom', u'http://www.dziennikwschodni.pl/dom.xml'),
|
||||||
|
(u'Moto', u'http://www.dziennikwschodni.pl/moto.xml'),
|
||||||
|
(u'Zdrowie', u'http://www.dziennikwschodni.pl/zdrowie.xml'),
|
||||||
|
]
|
||||||
|
|
||||||
|
def get_cover_url(self):
|
||||||
|
soup = self.index_to_soup(self.INDEX + '/apps/pbcs.dll/section?Category=JEDYNKI')
|
||||||
|
nexturl = self.INDEX + soup.find(id='covers').find('a')['href']
|
||||||
|
soup = self.index_to_soup(nexturl)
|
||||||
|
self.cover_url = self.INDEX + soup.find(id='cover').find(name='img')['src']
|
||||||
|
return getattr(self, 'cover_url', self.cover_url)
|
||||||
|
|
||||||
|
def append_page(self, soup, appendtag):
|
||||||
|
tag = soup.find('span', attrs={'class':'photoNavigationPages'})
|
||||||
|
if tag:
|
||||||
|
number = int(tag.string.rpartition('/')[-1].replace(' ', ''))
|
||||||
|
baseurl = self.INDEX + soup.find(attrs={'class':'photoNavigationNext'})['href'][:-1]
|
||||||
|
|
||||||
|
for r in appendtag.findAll(attrs={'class':'photoNavigation'}):
|
||||||
|
r.extract()
|
||||||
|
for nr in range(2, number+1):
|
||||||
|
soup2 = self.index_to_soup(baseurl + str(nr))
|
||||||
|
pagetext = soup2.find(id='photoContainer')
|
||||||
|
if pagetext:
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
pagetext = soup2.find(attrs={'class':'photoMeta'})
|
||||||
|
if pagetext:
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
pagetext = soup2.find(attrs={'class':'photoStoryText'})
|
||||||
|
if pagetext:
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
|
||||||
|
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||||
|
for comment in comments:
|
||||||
|
comment.extract()
|
||||||
|
|
||||||
|
def preprocess_html(self, soup):
|
||||||
|
self.append_page(soup, soup.body)
|
||||||
|
return soup
|
34
recipes/dziennik_zachodni.recipe
Normal file
@ -0,0 +1,34 @@
|
|||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
|
||||||
|
class DziennikZachodni(BasicNewsRecipe):
|
||||||
|
title = u'Dziennik Zachodni'
|
||||||
|
__author__ = 'fenuks'
|
||||||
|
description = u'Gazeta Regionalna Dziennik Zachodni. Najnowsze Wiadomości Śląskie. Wiadomości Śląsk. Czytaj!'
|
||||||
|
category = 'newspaper'
|
||||||
|
language = 'pl'
|
||||||
|
encoding = 'iso-8859-2'
|
||||||
|
masthead_url = 'http://s.polskatimes.pl/g/logo_naglowek/dziennikzachodni.png?24'
|
||||||
|
oldest_article = 7
|
||||||
|
max_articles_per_feed = 100
|
||||||
|
remove_empty_feeds= True
|
||||||
|
no_stylesheets = True
|
||||||
|
use_embedded_content = False
|
||||||
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
|
#preprocess_regexps = [(re.compile(ur'<b>Czytaj także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur',<b>Czytaj też:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>Zobacz także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<center><h4><a.*?</a></h4></center>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TEŻ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ WIĘCEJ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TAKŻE:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>\* CZYTAJ KONIECZNIE:.*', re.DOTALL), lambda match: '</body>'), (re.compile(ur'<b>Nasze serwisy:</b>.*', re.DOTALL), lambda match: '</body>') ]
|
||||||
|
remove_tags_after= dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})
|
||||||
|
remove_tags=[dict(id='mat-podobne'), dict(name='a', attrs={'class':'czytajDalej'}), dict(attrs={'src':'http://nm.dz.com.pl/dz.png'}), dict(attrs={'href':'http://www.dziennikzachodni.pl/piano'})]
|
||||||
|
|
||||||
|
feeds = [(u'Wszystkie', u'http://dziennikzachodni.feedsportal.com/c/32980/f/533764/index.rss?201302'), (u'Wiadomo\u015bci', u'http://dziennikzachodni.feedsportal.com/c/32980/f/533765/index.rss?201302'), (u'Regiony', u'http://www.dziennikzachodni.pl/rss/dziennikzachodni_regiony.xml?201302'), (u'Opinie', u'http://www.dziennikzachodni.pl/rss/dziennikzachodni_regiony.xml?201302'), (u'Blogi', u'http://www.dziennikzachodni.pl/rss/dziennikzachodni_blogi.xml?201302'), (u'Serwisy', u'http://www.dziennikzachodni.pl/rss/dziennikzachodni_serwisy.xml?201302'), (u'Sport', u'http://dziennikzachodni.feedsportal.com/c/32980/f/533766/index.rss?201302'), (u'M\xf3j Reporter', u'http://www.dziennikzachodni.pl/rss/dziennikzachodni_mojreporter.xml?201302'), (u'Na narty', u'http://www.dziennikzachodni.pl/rss/dziennikzachodni_nanarty.xml?201302'), (u'Drogi', u'http://www.dziennikzachodni.pl/rss/dziennikzachodni_drogi.xml?201302'), (u'Pieni\u0105dze', u'http://dziennikzachodni.feedsportal.com/c/32980/f/533768/index.rss?201302')]
|
||||||
|
|
||||||
|
def print_version(self, url):
|
||||||
|
return url.replace('artykul', 'drukuj')
|
||||||
|
|
||||||
|
def skip_ad_pages(self, soup):
|
||||||
|
if 'Advertisement' in soup.title:
|
||||||
|
nexturl=soup.find('a')['href']
|
||||||
|
return self.index_to_soup(nexturl, raw=True)
|
||||||
|
|
||||||
|
def get_cover_url(self):
|
||||||
|
soup = self.index_to_soup('http://www.prasa24.pl/gazeta/dziennik-zachodni/')
|
||||||
|
self.cover_url=soup.find(id='pojemnik').img['src']
|
||||||
|
return getattr(self, 'cover_url', self.cover_url)
|
79
recipes/echo_dnia.recipe
Normal file
@ -0,0 +1,79 @@
|
|||||||
|
import re
|
||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
from calibre.ebooks.BeautifulSoup import Comment
|
||||||
|
|
||||||
|
class EchoDnia(BasicNewsRecipe):
|
||||||
|
title = u'Echo Dnia'
|
||||||
|
__author__ = 'fenuks'
|
||||||
|
description = u'Echo Dnia - portal regionalny świętokrzyskiego radomskiego i podkarpackiego. Najnowsze wiadomości z Twojego regionu, galerie, video, mp3.'
|
||||||
|
category = 'newspaper'
|
||||||
|
language = 'pl'
|
||||||
|
encoding = 'iso-8859-2'
|
||||||
|
extra_css = 'ul {list-style: none; padding:0; margin:0;}'
|
||||||
|
INDEX = 'http://www.echodnia.eu'
|
||||||
|
masthead_url = INDEX + '/images/top_logo.png'
|
||||||
|
oldest_article = 7
|
||||||
|
max_articles_per_feed = 100
|
||||||
|
remove_empty_feeds = True
|
||||||
|
no_stylesheets = True
|
||||||
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
|
|
||||||
|
preprocess_regexps = [(re.compile(ur'Czytaj:.*?</a>', re.DOTALL), lambda match: ''), (re.compile(ur'Przeczytaj także:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''),
|
||||||
|
(re.compile(ur'Przeczytaj również:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''), (re.compile(ur'Zobacz też:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: '')]
|
||||||
|
|
||||||
|
keep_only_tags = [dict(id=['article', 'cover', 'photostory'])]
|
||||||
|
remove_tags = [dict(id=['articleTags', 'articleMeta', 'boxReadIt', 'articleGalleries', 'articleConnections',
|
||||||
|
'ForumArticleComments', 'articleRecommend', 'jedynkiLinks', 'articleGalleryConnections',
|
||||||
|
'photostoryConnections', 'articleEpaper', 'articlePoll', 'articleAlarm', 'articleByline']),
|
||||||
|
dict(attrs={'class':'articleFunctions'})]
|
||||||
|
|
||||||
|
feeds = [(u'Wszystkie', u'http://www.echodnia.eu/rss.xml'),
|
||||||
|
(u'Świętokrzyskie', u'http://www.echodnia.eu/swietokrzyskie.xml'),
|
||||||
|
(u'Radomskie', u'http://www.echodnia.eu/radomskie.xml'),
|
||||||
|
(u'Podkarpackie', u'http://www.echodnia.eu/podkarpackie.xml'),
|
||||||
|
(u'Sport \u015bwi\u0119tokrzyski', u'http://www.echodnia.eu/sport_swi.xml'),
|
||||||
|
(u'Sport radomski', u'http://www.echodnia.eu/sport_rad.xml'),
|
||||||
|
(u'Sport podkarpacki', u'http://www.echodnia.eu/sport_pod.xml'),
|
||||||
|
(u'Pi\u0142ka no\u017cna', u'http://www.echodnia.eu/pilka.xml'),
|
||||||
|
(u'Praca', u'http://www.echodnia.eu/praca.xml'),
|
||||||
|
(u'Dom', u'http://www.echodnia.eu/dom.xml'),
|
||||||
|
(u'Auto', u'http://www.echodnia.eu/auto.xml'),
|
||||||
|
(u'Zdrowie', u'http://www.echodnia.eu/zdrowie.xml')]
|
||||||
|
|
||||||
|
def get_cover_url(self):
|
||||||
|
soup = self.index_to_soup(self.INDEX + '/apps/pbcs.dll/section?Category=JEDYNKI')
|
||||||
|
nexturl = self.INDEX + soup.find(id='covers').find('a')['href']
|
||||||
|
soup = self.index_to_soup(nexturl)
|
||||||
|
self.cover_url = self.INDEX + soup.find(id='cover').find(name='img')['src']
|
||||||
|
return getattr(self, 'cover_url', self.cover_url)
|
||||||
|
|
||||||
|
def append_page(self, soup, appendtag):
|
||||||
|
tag = soup.find('span', attrs={'class':'photoNavigationPages'})
|
||||||
|
if tag:
|
||||||
|
number = int(tag.string.rpartition('/')[-1].replace(' ', ''))
|
||||||
|
baseurl = self.INDEX + soup.find(attrs={'class':'photoNavigationNext'})['href'][:-1]
|
||||||
|
|
||||||
|
for r in appendtag.findAll(attrs={'class':'photoNavigation'}):
|
||||||
|
r.extract()
|
||||||
|
for nr in range(2, number+1):
|
||||||
|
soup2 = self.index_to_soup(baseurl + str(nr))
|
||||||
|
pagetext = soup2.find(id='photoContainer')
|
||||||
|
if pagetext:
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
pagetext = soup2.find(attrs={'class':'photoMeta'})
|
||||||
|
if pagetext:
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
pagetext = soup2.find(attrs={'class':'photoStoryText'})
|
||||||
|
if pagetext:
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
|
||||||
|
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||||
|
for comment in comments:
|
||||||
|
comment.extract()
|
||||||
|
|
||||||
|
def preprocess_html(self, soup):
|
||||||
|
self.append_page(soup, soup.body)
|
||||||
|
return soup
|
@ -1,8 +1,6 @@
|
|||||||
#!/usr/bin/env python
|
#!/usr/bin/env python
|
||||||
|
|
||||||
__license__ = 'GPL v3'
|
__license__ = 'GPL v3'
|
||||||
__author__ = 'Mori'
|
|
||||||
__version__ = 'v. 0.1'
|
|
||||||
'''
|
'''
|
||||||
blog.eclicto.pl
|
blog.eclicto.pl
|
||||||
'''
|
'''
|
||||||
@ -11,7 +9,7 @@ from calibre.web.feeds.news import BasicNewsRecipe
|
|||||||
import re
|
import re
|
||||||
|
|
||||||
class BlogeClictoRecipe(BasicNewsRecipe):
|
class BlogeClictoRecipe(BasicNewsRecipe):
|
||||||
__author__ = 'Mori'
|
__author__ = 'Mori, Tomasz Długosz'
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
|
|
||||||
title = u'Blog eClicto'
|
title = u'Blog eClicto'
|
||||||
@ -34,10 +32,10 @@ class BlogeClictoRecipe(BasicNewsRecipe):
|
|||||||
]
|
]
|
||||||
|
|
||||||
remove_tags = [
|
remove_tags = [
|
||||||
dict(name = 'span', attrs = {'id' : 'tags'})
|
dict(name = 'div', attrs = {'class' : 'social_bookmark'}),
|
||||||
]
|
]
|
||||||
|
|
||||||
remove_tags_after = [
|
keep_only_tags = [
|
||||||
dict(name = 'div', attrs = {'class' : 'post'})
|
dict(name = 'div', attrs = {'class' : 'post'})
|
||||||
]
|
]
|
||||||
|
|
||||||
|
38
recipes/eclipseonline.recipe
Normal file
@ -0,0 +1,38 @@
|
|||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
class EclipseOnline(BasicNewsRecipe):
|
||||||
|
|
||||||
|
#
|
||||||
|
# oldest_article specifies the maximum age, in days, of posts to retrieve.
|
||||||
|
# The default of 32 is intended to work well with a "days of month = 1"
|
||||||
|
# recipe schedule to download "monthly issues" of Eclipse Online.
|
||||||
|
# Increase this value to include additional posts. However, the RSS feed
|
||||||
|
# currently only includes the 10 most recent posts, so that's the max.
|
||||||
|
#
|
||||||
|
oldest_article = 32
|
||||||
|
|
||||||
|
title = u'Eclipse Online'
|
||||||
|
description = u'"Where strange and wonderful things happen, where reality is eclipsed for a little while with something magical and new." Eclipse Online is edited by Jonathan Strahan and published online by Night Shade Books. http://www.nightshadebooks.com/category/eclipse/'
|
||||||
|
publication_type = 'magazine'
|
||||||
|
language = 'en'
|
||||||
|
|
||||||
|
__author__ = u'Jim DeVona'
|
||||||
|
__version__ = '1.0'
|
||||||
|
|
||||||
|
# For now, use this Eclipse Online logo as the ebook cover image.
|
||||||
|
# (Disable the cover_url line to let Calibre generate a default cover, including date.)
|
||||||
|
cover_url = 'http://www.nightshadebooks.com/wp-content/uploads/2012/10/Eclipse-Logo.jpg'
|
||||||
|
|
||||||
|
# Extract the "post" div containing the story (minus redundant metadata) from each page.
|
||||||
|
keep_only_tags = [dict(name='div', attrs={'class':lambda x: x and 'post' in x})]
|
||||||
|
remove_tags = [dict(name='span', attrs={'class': ['post-author', 'post-category', 'small']})]
|
||||||
|
|
||||||
|
# Nice plain markup (like Eclipse's) works best for most e-readers.
|
||||||
|
# Disregard any special styling rules, but center illustrations.
|
||||||
|
auto_cleanup = False
|
||||||
|
no_stylesheets = True
|
||||||
|
remove_attributes = ['style', 'align']
|
||||||
|
extra_css = '.wp-caption {text-align: center;} .wp-caption-text {font-size: small; font-style: italic;}'
|
||||||
|
|
||||||
|
# Tell Calibre where to look for article links. It will proceed to retrieve
|
||||||
|
# these posts and format them into an ebook according to the above rules.
|
||||||
|
feeds = ['http://www.nightshadebooks.com/category/eclipse/feed/']
|
@ -4,6 +4,7 @@ from calibre.web.feeds.news import BasicNewsRecipe
|
|||||||
class eioba(BasicNewsRecipe):
|
class eioba(BasicNewsRecipe):
|
||||||
title = u'eioba'
|
title = u'eioba'
|
||||||
__author__ = 'fenuks'
|
__author__ = 'fenuks'
|
||||||
|
description = u'eioba.pl - daj się przeczytać!'
|
||||||
cover_url = 'http://www.eioba.org/lay/logo_pl_v3.png'
|
cover_url = 'http://www.eioba.org/lay/logo_pl_v3.png'
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
oldest_article = 7
|
oldest_article = 7
|
||||||
|
@ -9,7 +9,7 @@ class EkologiaPl(BasicNewsRecipe):
|
|||||||
language = 'pl'
|
language = 'pl'
|
||||||
cover_url = 'http://www.ekologia.pl/assets/images/logo/ekologia_pl_223x69.png'
|
cover_url = 'http://www.ekologia.pl/assets/images/logo/ekologia_pl_223x69.png'
|
||||||
ignore_duplicate_articles = {'title', 'url'}
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
extra_css = '.title {font-size: 200%;}'
|
extra_css = '.title {font-size: 200%;} .imagePowiazane, .imgCon {float:left; margin-right:5px;}'
|
||||||
oldest_article = 7
|
oldest_article = 7
|
||||||
max_articles_per_feed = 100
|
max_articles_per_feed = 100
|
||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
|
@ -5,7 +5,7 @@ class Elektroda(BasicNewsRecipe):
|
|||||||
title = u'Elektroda'
|
title = u'Elektroda'
|
||||||
oldest_article = 8
|
oldest_article = 8
|
||||||
__author__ = 'fenuks'
|
__author__ = 'fenuks'
|
||||||
description = 'Elektroda.pl'
|
description = 'Międzynarodowy portal elektroniczny udostępniający bogate zasoby z dziedziny elektroniki oraz forum dyskusyjne.'
|
||||||
cover_url = 'http://demotywatory.elektroda.pl/Thunderpic/logo.gif'
|
cover_url = 'http://demotywatory.elektroda.pl/Thunderpic/logo.gif'
|
||||||
category = 'electronics'
|
category = 'electronics'
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
|
@ -12,6 +12,7 @@ class eMuzyka(BasicNewsRecipe):
|
|||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
oldest_article = 7
|
oldest_article = 7
|
||||||
max_articles_per_feed = 100
|
max_articles_per_feed = 100
|
||||||
|
remove_attributes = ['style']
|
||||||
keep_only_tags=[dict(name='div', attrs={'id':'news_container'}), dict(name='h3'), dict(name='div', attrs={'class':'review_text'})]
|
keep_only_tags=[dict(name='div', attrs={'id':'news_container'}), dict(name='h3'), dict(name='div', attrs={'class':'review_text'})]
|
||||||
remove_tags=[dict(name='span', attrs={'id':'date'})]
|
remove_tags=[dict(name='span', attrs={'id':'date'})]
|
||||||
feeds = [(u'Aktualno\u015bci', u'http://www.emuzyka.pl/rss.php?f=1'), (u'Recenzje', u'http://www.emuzyka.pl/rss.php?f=2')]
|
feeds = [(u'Aktualno\u015bci', u'http://www.emuzyka.pl/rss.php?f=1'), (u'Recenzje', u'http://www.emuzyka.pl/rss.php?f=2')]
|
||||||
|
@ -3,29 +3,37 @@
|
|||||||
__license__ = 'GPL v3'
|
__license__ = 'GPL v3'
|
||||||
__copyright__ = '2010, matek09, matek09@gmail.com'
|
__copyright__ = '2010, matek09, matek09@gmail.com'
|
||||||
|
|
||||||
from calibre.web.feeds.news import BasicNewsRecipe
|
|
||||||
import re
|
import re
|
||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Comment
|
||||||
|
|
||||||
class Esensja(BasicNewsRecipe):
|
class Esensja(BasicNewsRecipe):
|
||||||
|
|
||||||
title = u'Esensja'
|
title = u'Esensja'
|
||||||
__author__ = 'matek09'
|
__author__ = 'matek09 & fenuks'
|
||||||
description = 'Monthly magazine'
|
description = 'Magazyn kultury popularnej'
|
||||||
encoding = 'utf-8'
|
encoding = 'utf-8'
|
||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
remove_javascript = True
|
remove_javascript = True
|
||||||
|
masthead_url = 'http://esensja.pl/img/wrss.gif'
|
||||||
|
oldest_article = 1
|
||||||
|
URL = 'http://esensja.pl'
|
||||||
HREF = '0'
|
HREF = '0'
|
||||||
|
remove_attributes = ['style', 'bgcolor', 'alt', 'color']
|
||||||
#keep_only_tags =[]
|
keep_only_tags = [dict(attrs={'class':'sekcja'}), ]
|
||||||
#keep_only_tags.append(dict(name = 'div', attrs = {'class' : 'article'})
|
#keep_only_tags.append(dict(name = 'div', attrs = {'class' : 'article'})
|
||||||
remove_tags_before = dict(dict(name = 'div', attrs = {'class' : 't-title'}))
|
#remove_tags_before = dict(dict(name = 'div', attrs = {'class' : 't-title'}))
|
||||||
remove_tags_after = dict(dict(name = 'img', attrs = {'src' : '../../../2000/01/img/tab_bot.gif'}))
|
remove_tags_after = dict(id='tekst')
|
||||||
|
|
||||||
remove_tags =[]
|
remove_tags = [dict(name = 'img', attrs = {'src' : ['../../../2000/01/img/tab_top.gif', '../../../2000/01/img/tab_bot.gif']}),
|
||||||
remove_tags.append(dict(name = 'img', attrs = {'src' : '../../../2000/01/img/tab_top.gif'}))
|
dict(name = 'div', attrs = {'class' : 't-title2 nextpage'}),
|
||||||
remove_tags.append(dict(name = 'img', attrs = {'src' : '../../../2000/01/img/tab_bot.gif'}))
|
#dict(attrs={'rel':'lightbox[galeria]'})
|
||||||
remove_tags.append(dict(name = 'div', attrs = {'class' : 't-title2 nextpage'}))
|
dict(attrs={'class':['tekst_koniec', 'ref', 'wykop']}),
|
||||||
|
dict(attrs={'itemprop':['copyrightHolder', 'publisher']}),
|
||||||
|
dict(id='komentarze')
|
||||||
|
|
||||||
|
]
|
||||||
|
|
||||||
extra_css = '''
|
extra_css = '''
|
||||||
.t-title {font-size: x-large; font-weight: bold; text-align: left}
|
.t-title {font-size: x-large; font-weight: bold; text-align: left}
|
||||||
@ -35,8 +43,9 @@ class Esensja(BasicNewsRecipe):
|
|||||||
.annot-ref {font-style: italic; text-align: left}
|
.annot-ref {font-style: italic; text-align: left}
|
||||||
'''
|
'''
|
||||||
|
|
||||||
preprocess_regexps = [(re.compile(r'alt="[^"]*"'),
|
preprocess_regexps = [(re.compile(r'alt="[^"]*"'), lambda match: ''),
|
||||||
lambda match: '')]
|
(re.compile(ur'(title|alt)="[^"]*?"', re.DOTALL), lambda match: ''),
|
||||||
|
]
|
||||||
|
|
||||||
def parse_index(self):
|
def parse_index(self):
|
||||||
soup = self.index_to_soup('http://www.esensja.pl/magazyn/')
|
soup = self.index_to_soup('http://www.esensja.pl/magazyn/')
|
||||||
@ -47,15 +56,19 @@ class Esensja(BasicNewsRecipe):
|
|||||||
soup = self.index_to_soup(self.HREF + '01.html')
|
soup = self.index_to_soup(self.HREF + '01.html')
|
||||||
self.cover_url = 'http://www.esensja.pl/magazyn/' + year + '/' + month + '/img/ilustr/cover_b.jpg'
|
self.cover_url = 'http://www.esensja.pl/magazyn/' + year + '/' + month + '/img/ilustr/cover_b.jpg'
|
||||||
feeds = []
|
feeds = []
|
||||||
|
chapter = ''
|
||||||
|
subchapter = ''
|
||||||
|
articles = []
|
||||||
intro = soup.find('div', attrs={'class' : 'n-title'})
|
intro = soup.find('div', attrs={'class' : 'n-title'})
|
||||||
|
'''
|
||||||
introduction = {'title' : self.tag_to_string(intro.a),
|
introduction = {'title' : self.tag_to_string(intro.a),
|
||||||
'url' : self.HREF + intro.a['href'],
|
'url' : self.HREF + intro.a['href'],
|
||||||
'date' : '',
|
'date' : '',
|
||||||
'description' : ''}
|
'description' : ''}
|
||||||
chapter = 'Wprowadzenie'
|
chapter = 'Wprowadzenie'
|
||||||
subchapter = ''
|
|
||||||
articles = []
|
|
||||||
articles.append(introduction)
|
articles.append(introduction)
|
||||||
|
'''
|
||||||
|
|
||||||
for tag in intro.findAllNext(attrs={'class': ['chapter', 'subchapter', 'n-title']}):
|
for tag in intro.findAllNext(attrs={'class': ['chapter', 'subchapter', 'n-title']}):
|
||||||
if tag.name in 'td':
|
if tag.name in 'td':
|
||||||
if len(articles) > 0:
|
if len(articles) > 0:
|
||||||
@ -71,17 +84,72 @@ class Esensja(BasicNewsRecipe):
|
|||||||
subchapter = self.tag_to_string(tag)
|
subchapter = self.tag_to_string(tag)
|
||||||
subchapter = self.tag_to_string(tag)
|
subchapter = self.tag_to_string(tag)
|
||||||
continue
|
continue
|
||||||
articles.append({'title' : self.tag_to_string(tag.a), 'url' : self.HREF + tag.a['href'], 'date' : '', 'description' : ''})
|
|
||||||
|
|
||||||
a = self.index_to_soup(self.HREF + tag.a['href'])
|
finalurl = tag.a['href']
|
||||||
|
if not finalurl.startswith('http'):
|
||||||
|
finalurl = self.HREF + finalurl
|
||||||
|
articles.append({'title' : self.tag_to_string(tag.a), 'url' : finalurl, 'date' : '', 'description' : ''})
|
||||||
|
|
||||||
|
a = self.index_to_soup(finalurl)
|
||||||
i = 1
|
i = 1
|
||||||
|
|
||||||
while True:
|
while True:
|
||||||
div = a.find('div', attrs={'class' : 't-title2 nextpage'})
|
div = a.find('div', attrs={'class' : 't-title2 nextpage'})
|
||||||
if div is not None:
|
if div is not None:
|
||||||
a = self.index_to_soup(self.HREF + div.a['href'])
|
link = div.a['href']
|
||||||
articles.append({'title' : self.tag_to_string(tag.a) + ' c. d. ' + str(i), 'url' : self.HREF + div.a['href'], 'date' : '', 'description' : ''})
|
if not link.startswith('http'):
|
||||||
|
link = self.HREF + link
|
||||||
|
a = self.index_to_soup(link)
|
||||||
|
articles.append({'title' : self.tag_to_string(tag.a) + ' c. d. ' + str(i), 'url' : link, 'date' : '', 'description' : ''})
|
||||||
i = i + 1
|
i = i + 1
|
||||||
else:
|
else:
|
||||||
break
|
break
|
||||||
|
|
||||||
return feeds
|
return feeds
|
||||||
|
|
||||||
|
def append_page(self, soup, appendtag):
|
||||||
|
r = appendtag.find(attrs={'class':'wiecej_xxx'})
|
||||||
|
if r:
|
||||||
|
nr = r.findAll(attrs={'class':'tn-link'})[-1]
|
||||||
|
try:
|
||||||
|
nr = int(nr.a.string)
|
||||||
|
except:
|
||||||
|
return
|
||||||
|
baseurl = soup.find(attrs={'property':'og:url'})['content'] + '&strona={0}'
|
||||||
|
for number in range(2, nr+1):
|
||||||
|
soup2 = self.index_to_soup(baseurl.format(number))
|
||||||
|
pagetext = soup2.find(attrs={'class':'tresc'})
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
for r in appendtag.findAll(attrs={'class':['wiecej_xxx', 'tekst_koniec']}):
|
||||||
|
r.extract()
|
||||||
|
for r in appendtag.findAll('script'):
|
||||||
|
r.extract()
|
||||||
|
|
||||||
|
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||||
|
for comment in comments:
|
||||||
|
comment.extract()
|
||||||
|
|
||||||
|
def preprocess_html(self, soup):
|
||||||
|
self.append_page(soup, soup.body)
|
||||||
|
for tag in soup.findAll(attrs={'class':'img_box_right'}):
|
||||||
|
temp = tag.find('img')
|
||||||
|
src = ''
|
||||||
|
if temp:
|
||||||
|
src = temp.get('src', '')
|
||||||
|
for r in tag.findAll('a', recursive=False):
|
||||||
|
r.extract()
|
||||||
|
info = tag.find(attrs={'class':'img_info'})
|
||||||
|
text = str(tag)
|
||||||
|
if not src:
|
||||||
|
src = re.search('src="[^"]*?"', text)
|
||||||
|
if src:
|
||||||
|
src = src.group(0)
|
||||||
|
src = src[5:].replace('//', '/')
|
||||||
|
if src:
|
||||||
|
tag.contents = []
|
||||||
|
tag.insert(0, BeautifulSoup('<img src="{0}{1}" />'.format(self.URL, src)))
|
||||||
|
if info:
|
||||||
|
tag.insert(len(tag.contents), info)
|
||||||
|
return soup
|
||||||
|
|
||||||
|
109
recipes/esensja_(rss).recipe
Normal file
@ -0,0 +1,109 @@
|
|||||||
|
__license__ = 'GPL v3'
|
||||||
|
import re
|
||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Comment
|
||||||
|
|
||||||
|
class EsensjaRSS(BasicNewsRecipe):
|
||||||
|
title = u'Esensja (RSS)'
|
||||||
|
__author__ = 'fenuks'
|
||||||
|
description = u'Magazyn kultury popularnej'
|
||||||
|
category = 'reading, fantasy, reviews, boardgames, culture'
|
||||||
|
#publication_type = ''
|
||||||
|
language = 'pl'
|
||||||
|
encoding = 'utf-8'
|
||||||
|
INDEX = 'http://www.esensja.pl'
|
||||||
|
extra_css = '''.t-title {font-size: x-large; font-weight: bold; text-align: left}
|
||||||
|
.t-author {font-size: x-small; text-align: left}
|
||||||
|
.t-title2 {font-size: x-small; font-style: italic; text-align: left}
|
||||||
|
.text {font-size: small; text-align: left}
|
||||||
|
.annot-ref {font-style: italic; text-align: left}
|
||||||
|
'''
|
||||||
|
cover_url = ''
|
||||||
|
masthead_url = 'http://esensja.pl/img/wrss.gif'
|
||||||
|
use_embedded_content = False
|
||||||
|
oldest_article = 7
|
||||||
|
max_articles_per_feed = 100
|
||||||
|
no_stylesheets = True
|
||||||
|
remove_empty_feeds = True
|
||||||
|
remove_javascript = True
|
||||||
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
|
preprocess_regexps = [(re.compile(r'alt="[^"]*"'), lambda match: ''),
|
||||||
|
(re.compile(ur'(title|alt)="[^"]*?"', re.DOTALL), lambda match: ''),
|
||||||
|
]
|
||||||
|
remove_attributes = ['style', 'bgcolor', 'alt', 'color']
|
||||||
|
keep_only_tags = [dict(attrs={'class':'sekcja'}), ]
|
||||||
|
remove_tags_after = dict(id='tekst')
|
||||||
|
|
||||||
|
remove_tags = [dict(name = 'img', attrs = {'src' : ['../../../2000/01/img/tab_top.gif', '../../../2000/01/img/tab_bot.gif']}),
|
||||||
|
dict(name = 'div', attrs = {'class' : 't-title2 nextpage'}),
|
||||||
|
#dict(attrs={'rel':'lightbox[galeria]'})
|
||||||
|
dict(attrs={'class':['tekst_koniec', 'ref', 'wykop']}),
|
||||||
|
dict(attrs={'itemprop':['copyrightHolder', 'publisher']}),
|
||||||
|
dict(id='komentarze')
|
||||||
|
]
|
||||||
|
|
||||||
|
feeds = [(u'Książka', u'http://esensja.pl/rss/ksiazka.rss'),
|
||||||
|
(u'Film', u'http://esensja.pl/rss/film.rss'),
|
||||||
|
(u'Komiks', u'http://esensja.pl/rss/komiks.rss'),
|
||||||
|
(u'Gry', u'http://esensja.pl/rss/gry.rss'),
|
||||||
|
(u'Muzyka', u'http://esensja.pl/rss/muzyka.rss'),
|
||||||
|
(u'Twórczość', u'http://esensja.pl/rss/tworczosc.rss'),
|
||||||
|
(u'Varia', u'http://esensja.pl/rss/varia.rss'),
|
||||||
|
(u'Zgryźliwi Tetrycy', u'http://esensja.pl/rss/tetrycy.rss'),
|
||||||
|
(u'Nowe książki', u'http://esensja.pl/rss/xnowosci.rss'),
|
||||||
|
(u'Ostatnio dodane książki', u'http://esensja.pl/rss/xdodane.rss'),
|
||||||
|
]
|
||||||
|
|
||||||
|
def get_cover_url(self):
|
||||||
|
soup = self.index_to_soup(self.INDEX)
|
||||||
|
cover = soup.find(id='panel_1')
|
||||||
|
self.cover_url = self.INDEX + cover.find('a')['href'].replace('index.html', '') + 'img/ilustr/cover_b.jpg'
|
||||||
|
return getattr(self, 'cover_url', self.cover_url)
|
||||||
|
|
||||||
|
|
||||||
|
def append_page(self, soup, appendtag):
|
||||||
|
r = appendtag.find(attrs={'class':'wiecej_xxx'})
|
||||||
|
if r:
|
||||||
|
nr = r.findAll(attrs={'class':'tn-link'})[-1]
|
||||||
|
try:
|
||||||
|
nr = int(nr.a.string)
|
||||||
|
except:
|
||||||
|
return
|
||||||
|
baseurl = soup.find(attrs={'property':'og:url'})['content'] + '&strona={0}'
|
||||||
|
for number in range(2, nr+1):
|
||||||
|
soup2 = self.index_to_soup(baseurl.format(number))
|
||||||
|
pagetext = soup2.find(attrs={'class':'tresc'})
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
for r in appendtag.findAll(attrs={'class':['wiecej_xxx', 'tekst_koniec']}):
|
||||||
|
r.extract()
|
||||||
|
for r in appendtag.findAll('script'):
|
||||||
|
r.extract()
|
||||||
|
|
||||||
|
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||||
|
for comment in comments:
|
||||||
|
comment.extract()
|
||||||
|
|
||||||
|
|
||||||
|
def preprocess_html(self, soup):
|
||||||
|
self.append_page(soup, soup.body)
|
||||||
|
for tag in soup.findAll(attrs={'class':'img_box_right'}):
|
||||||
|
temp = tag.find('img')
|
||||||
|
src = ''
|
||||||
|
if temp:
|
||||||
|
src = temp.get('src', '')
|
||||||
|
for r in tag.findAll('a', recursive=False):
|
||||||
|
r.extract()
|
||||||
|
info = tag.find(attrs={'class':'img_info'})
|
||||||
|
text = str(tag)
|
||||||
|
if not src:
|
||||||
|
src = re.search('src="[^"]*?"', text)
|
||||||
|
if src:
|
||||||
|
src = src.group(0)
|
||||||
|
src = src[5:].replace('//', '/')
|
||||||
|
if src:
|
||||||
|
tag.contents = []
|
||||||
|
tag.insert(0, BeautifulSoup('<img src="{0}{1}" />'.format(self.INDEX, src)))
|
||||||
|
if info:
|
||||||
|
tag.insert(len(tag.contents), info)
|
||||||
|
return soup
|
@ -1,5 +1,6 @@
|
|||||||
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:fdm=marker:ai
|
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:fdm=marker:ai
|
||||||
from calibre.web.feeds.news import BasicNewsRecipe
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
from calibre.ebooks.BeautifulSoup import Comment
|
||||||
import re
|
import re
|
||||||
class FilmOrgPl(BasicNewsRecipe):
|
class FilmOrgPl(BasicNewsRecipe):
|
||||||
title = u'Film.org.pl'
|
title = u'Film.org.pl'
|
||||||
@ -7,13 +8,47 @@ class FilmOrgPl(BasicNewsRecipe):
|
|||||||
description = u"Recenzje, analizy, artykuły, rankingi - wszystko o filmie dla miłośników kina. Opisy efektów specjalnych, wersji reżyserskich, remake'ów, sequeli. No i forum filmowe. Jedne z największych w Polsce."
|
description = u"Recenzje, analizy, artykuły, rankingi - wszystko o filmie dla miłośników kina. Opisy efektów specjalnych, wersji reżyserskich, remake'ów, sequeli. No i forum filmowe. Jedne z największych w Polsce."
|
||||||
category = 'film'
|
category = 'film'
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
|
extra_css = '.alignright {float:right; margin-left:5px;} .alignleft {float:left; margin-right:5px;} .recenzja-title {font-size: 150%; margin-top: 5px; margin-bottom: 5px;}'
|
||||||
cover_url = 'http://film.org.pl/wp-content/themes/KMF/images/logo_kmf10.png'
|
cover_url = 'http://film.org.pl/wp-content/themes/KMF/images/logo_kmf10.png'
|
||||||
ignore_duplicate_articles = {'title', 'url'}
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
oldest_article = 7
|
oldest_article = 7
|
||||||
max_articles_per_feed = 100
|
max_articles_per_feed = 100
|
||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
|
remove_javascript = True
|
||||||
remove_empty_feeds = True
|
remove_empty_feeds = True
|
||||||
use_embedded_content = True
|
use_embedded_content = False
|
||||||
preprocess_regexps = [(re.compile(ur'<h3>Przeczytaj także:</h3>.*', re.IGNORECASE|re.DOTALL), lambda m: '</body>'), (re.compile(ur'<div>Artykuł</div>', re.IGNORECASE), lambda m: ''), (re.compile(ur'<div>Ludzie filmu</div>', re.IGNORECASE), lambda m: '')]
|
remove_attributes = ['style']
|
||||||
remove_tags = [dict(name='img', attrs={'alt':['Ludzie filmu', u'Artykuł']})]
|
preprocess_regexps = [(re.compile(ur'<h3>Przeczytaj także:</h3>.*', re.IGNORECASE|re.DOTALL), lambda m: '</body>'), (re.compile(ur'</?center>', re.IGNORECASE|re.DOTALL), lambda m: ''), (re.compile(ur'<div>Artykuł</div>', re.IGNORECASE), lambda m: ''), (re.compile(ur'<div>Ludzie filmu</div>', re.IGNORECASE), lambda m: ''), (re.compile(ur'(<br ?/?>\s*?){2,}', re.IGNORECASE|re.DOTALL), lambda m: '')]
|
||||||
|
keep_only_tags = [dict(name=['h11', 'h16', 'h17']), dict(attrs={'class':'editor'})]
|
||||||
|
remove_tags_after = dict(id='comments')
|
||||||
|
remove_tags = [dict(name=['link', 'meta', 'style']), dict(name='img', attrs={'alt':['Ludzie filmu', u'Artykuł']}), dict(id='comments'), dict(attrs={'style':'border: 0pt none ; margin: 0pt; padding: 0pt;'}), dict(name='p', attrs={'class':'rating'}), dict(attrs={'layout':'button_count'})]
|
||||||
feeds = [(u'Recenzje', u'http://film.org.pl/r/recenzje/feed/'), (u'Artyku\u0142', u'http://film.org.pl/a/artykul/feed/'), (u'Analiza', u'http://film.org.pl/a/analiza/feed/'), (u'Ranking', u'http://film.org.pl/a/ranking/feed/'), (u'Blog', u'http://film.org.pl/kmf/blog/feed/'), (u'Ludzie', u'http://film.org.pl/a/ludzie/feed/'), (u'Seriale', u'http://film.org.pl/a/seriale/feed/'), (u'Oceanarium', u'http://film.org.pl/a/ocenarium/feed/'), (u'VHS', u'http://film.org.pl/a/vhs-a/feed/')]
|
feeds = [(u'Recenzje', u'http://film.org.pl/r/recenzje/feed/'), (u'Artyku\u0142', u'http://film.org.pl/a/artykul/feed/'), (u'Analiza', u'http://film.org.pl/a/analiza/feed/'), (u'Ranking', u'http://film.org.pl/a/ranking/feed/'), (u'Blog', u'http://film.org.pl/kmf/blog/feed/'), (u'Ludzie', u'http://film.org.pl/a/ludzie/feed/'), (u'Seriale', u'http://film.org.pl/a/seriale/feed/'), (u'Oceanarium', u'http://film.org.pl/a/ocenarium/feed/'), (u'VHS', u'http://film.org.pl/a/vhs-a/feed/')]
|
||||||
|
|
||||||
|
def append_page(self, soup, appendtag):
|
||||||
|
tag = soup.find('div', attrs={'class': 'pagelink'})
|
||||||
|
if tag:
|
||||||
|
for nexturl in tag.findAll('a'):
|
||||||
|
url = nexturl['href']
|
||||||
|
soup2 = self.index_to_soup(url)
|
||||||
|
pagetext = soup2.find(attrs={'class': 'editor'})
|
||||||
|
comments = pagetext.findAll(text=lambda text:isinstance(text, Comment))
|
||||||
|
for comment in comments:
|
||||||
|
comment.extract()
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
for r in appendtag.findAll(attrs={'class': 'pagelink'}):
|
||||||
|
r.extract()
|
||||||
|
for r in appendtag.findAll(attrs={'id': 'comments'}):
|
||||||
|
r.extract()
|
||||||
|
for r in appendtag.findAll(attrs={'style':'border: 0pt none ; margin: 0pt; padding: 0pt;'}):
|
||||||
|
r.extract()
|
||||||
|
for r in appendtag.findAll(attrs={'layout':'button_count'}):
|
||||||
|
r.extract()
|
||||||
|
|
||||||
|
def preprocess_html(self, soup):
|
||||||
|
for c in soup.findAll('h11'):
|
||||||
|
c.name = 'h1'
|
||||||
|
self.append_page(soup, soup.body)
|
||||||
|
for r in soup.findAll('br'):
|
||||||
|
r.extract()
|
||||||
|
return soup
|
@ -1,11 +1,12 @@
|
|||||||
from calibre.web.feeds.news import BasicNewsRecipe
|
|
||||||
import re
|
import re
|
||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
from calibre.ebooks.BeautifulSoup import BeautifulSoup
|
from calibre.ebooks.BeautifulSoup import BeautifulSoup
|
||||||
|
|
||||||
class FilmWebPl(BasicNewsRecipe):
|
class FilmWebPl(BasicNewsRecipe):
|
||||||
title = u'FilmWeb'
|
title = u'FilmWeb'
|
||||||
__author__ = 'fenuks'
|
__author__ = 'fenuks'
|
||||||
description = 'FilmWeb - biggest polish movie site'
|
description = 'Filmweb.pl - Filmy takie jak Ty Filmweb to największy i najczęściej odwiedzany polski serwis filmowy. Największa baza filmów, seriali i aktorów, repertuar kin i tv, ...'
|
||||||
cover_url = 'http://userlogos.org/files/logos/crudus/filmweb.png'
|
cover_url = 'http://gfx.filmweb.pl/n/logo-filmweb-bevel.jpg'
|
||||||
category = 'movies'
|
category = 'movies'
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
index = 'http://www.filmweb.pl'
|
index = 'http://www.filmweb.pl'
|
||||||
@ -14,11 +15,12 @@ class FilmWebPl(BasicNewsRecipe):
|
|||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
remove_empty_feeds = True
|
remove_empty_feeds = True
|
||||||
ignore_duplicate_articles = {'title', 'url'}
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
preprocess_regexps = [(re.compile(u'\(kliknij\,\ aby powiększyć\)', re.IGNORECASE), lambda m: ''), ]#(re.compile(ur' | ', re.IGNORECASE), lambda m: '')]
|
remove_javascript = True
|
||||||
|
preprocess_regexps = [(re.compile(u'\(kliknij\,\ aby powiększyć\)', re.IGNORECASE), lambda m: ''), (re.compile(ur'(<br ?/?>\s*?<br ?/?>\s*?)+', re.IGNORECASE), lambda m: '<br />')]#(re.compile(ur' | ', re.IGNORECASE), lambda m: '')]
|
||||||
extra_css = '.hdrBig {font-size:22px;} ul {list-style-type:none; padding: 0; margin: 0;}'
|
extra_css = '.hdrBig {font-size:22px;} ul {list-style-type:none; padding: 0; margin: 0;}'
|
||||||
remove_tags= [dict(name='div', attrs={'class':['recommendOthers']}), dict(name='ul', attrs={'class':'fontSizeSet'}), dict(attrs={'class':'userSurname anno'})]
|
#remove_tags = [dict()]
|
||||||
remove_attributes = ['style',]
|
remove_attributes = ['style',]
|
||||||
keep_only_tags= [dict(name='h1', attrs={'class':['hdrBig', 'hdrEntity']}), dict(name='div', attrs={'class':['newsInfo', 'newsInfoSmall', 'reviewContent description']})]
|
keep_only_tags = [dict(attrs={'class':['hdr hdr-super', 'newsContent']})]
|
||||||
feeds = [(u'News / Filmy w produkcji', 'http://www.filmweb.pl/feed/news/category/filminproduction'),
|
feeds = [(u'News / Filmy w produkcji', 'http://www.filmweb.pl/feed/news/category/filminproduction'),
|
||||||
(u'News / Festiwale, nagrody i przeglądy', u'http://www.filmweb.pl/feed/news/category/festival'),
|
(u'News / Festiwale, nagrody i przeglądy', u'http://www.filmweb.pl/feed/news/category/festival'),
|
||||||
(u'News / Seriale', u'http://www.filmweb.pl/feed/news/category/serials'),
|
(u'News / Seriale', u'http://www.filmweb.pl/feed/news/category/serials'),
|
||||||
@ -42,6 +44,11 @@ class FilmWebPl(BasicNewsRecipe):
|
|||||||
if skip_tag is not None:
|
if skip_tag is not None:
|
||||||
return self.index_to_soup(skip_tag['href'], raw=True)
|
return self.index_to_soup(skip_tag['href'], raw=True)
|
||||||
|
|
||||||
|
def postprocess_html(self, soup, first_fetch):
|
||||||
|
for r in soup.findAll(attrs={'class':'singlephoto'}):
|
||||||
|
r['style'] = 'float:left; margin-right: 10px;'
|
||||||
|
return soup
|
||||||
|
|
||||||
def preprocess_html(self, soup):
|
def preprocess_html(self, soup):
|
||||||
for a in soup('a'):
|
for a in soup('a'):
|
||||||
if a.has_key('href') and 'http://' not in a['href'] and 'https://' not in a['href']:
|
if a.has_key('href') and 'http://' not in a['href'] and 'https://' not in a['href']:
|
||||||
@ -51,9 +58,8 @@ class FilmWebPl(BasicNewsRecipe):
|
|||||||
for i in soup.findAll('sup'):
|
for i in soup.findAll('sup'):
|
||||||
if not i.string or i.string.startswith('(kliknij'):
|
if not i.string or i.string.startswith('(kliknij'):
|
||||||
i.extract()
|
i.extract()
|
||||||
tag = soup.find(name='ul', attrs={'class':'inline sep-line'})
|
for r in soup.findAll(id=re.compile('photo-\d+')):
|
||||||
if tag:
|
r.extract()
|
||||||
tag.name = 'div'
|
for r in soup.findAll(style=re.compile('float: ?left')):
|
||||||
for t in tag.findAll('li'):
|
r['class'] = 'singlephoto'
|
||||||
t.name = 'div'
|
|
||||||
return soup
|
return soup
|
||||||
|
@ -13,7 +13,7 @@ class FocusRecipe(BasicNewsRecipe):
|
|||||||
title = u'Focus'
|
title = u'Focus'
|
||||||
publisher = u'Gruner + Jahr Polska'
|
publisher = u'Gruner + Jahr Polska'
|
||||||
category = u'News'
|
category = u'News'
|
||||||
description = u'Newspaper'
|
description = u'Focus.pl - pierwszy w Polsce portal społecznościowy dla miłośników nauki. Tematyka: nauka, historia, cywilizacja, technika, przyroda, sport, gadżety'
|
||||||
category = 'magazine'
|
category = 'magazine'
|
||||||
cover_url = ''
|
cover_url = ''
|
||||||
remove_empty_feeds = True
|
remove_empty_feeds = True
|
||||||
|
@ -1,6 +1,5 @@
|
|||||||
from calibre.web.feeds.news import BasicNewsRecipe
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
import re
|
import re
|
||||||
from calibre.ptempfile import PersistentTemporaryFile
|
|
||||||
|
|
||||||
class ForeignAffairsRecipe(BasicNewsRecipe):
|
class ForeignAffairsRecipe(BasicNewsRecipe):
|
||||||
''' there are three modifications:
|
''' there are three modifications:
|
||||||
@ -45,7 +44,6 @@ class ForeignAffairsRecipe(BasicNewsRecipe):
|
|||||||
'publisher': publisher}
|
'publisher': publisher}
|
||||||
|
|
||||||
temp_files = []
|
temp_files = []
|
||||||
articles_are_obfuscated = True
|
|
||||||
|
|
||||||
def get_cover_url(self):
|
def get_cover_url(self):
|
||||||
soup = self.index_to_soup(self.FRONTPAGE)
|
soup = self.index_to_soup(self.FRONTPAGE)
|
||||||
@ -53,20 +51,6 @@ class ForeignAffairsRecipe(BasicNewsRecipe):
|
|||||||
img_url = div.find('img')['src']
|
img_url = div.find('img')['src']
|
||||||
return self.INDEX + img_url
|
return self.INDEX + img_url
|
||||||
|
|
||||||
def get_obfuscated_article(self, url):
|
|
||||||
br = self.get_browser()
|
|
||||||
br.open(url)
|
|
||||||
|
|
||||||
response = br.follow_link(url_regex = r'/print/[0-9]+', nr = 0)
|
|
||||||
html = response.read()
|
|
||||||
|
|
||||||
self.temp_files.append(PersistentTemporaryFile('_fa.html'))
|
|
||||||
self.temp_files[-1].write(html)
|
|
||||||
self.temp_files[-1].close()
|
|
||||||
|
|
||||||
return self.temp_files[-1].name
|
|
||||||
|
|
||||||
|
|
||||||
def parse_index(self):
|
def parse_index(self):
|
||||||
|
|
||||||
answer = []
|
answer = []
|
||||||
@ -89,10 +73,10 @@ class ForeignAffairsRecipe(BasicNewsRecipe):
|
|||||||
if div.find('a') is not None:
|
if div.find('a') is not None:
|
||||||
originalauthor=self.tag_to_string(div.findNext('div', attrs = {'class':'views-field-field-article-book-nid'}).div.a)
|
originalauthor=self.tag_to_string(div.findNext('div', attrs = {'class':'views-field-field-article-book-nid'}).div.a)
|
||||||
title=subsectiontitle+': '+self.tag_to_string(div.span.a)+' by '+originalauthor
|
title=subsectiontitle+': '+self.tag_to_string(div.span.a)+' by '+originalauthor
|
||||||
url=self.INDEX+div.span.a['href']
|
url=self.INDEX+self.index_to_soup(self.INDEX+div.span.a['href']).find('a', attrs={'class':'fa_addthis_print'})['href']
|
||||||
atr=div.findNext('div', attrs = {'class': 'views-field-field-article-display-authors-value'})
|
atr=div.findNext('div', attrs = {'class': 'views-field-field-article-display-authors-value'})
|
||||||
if atr is not None:
|
if atr is not None:
|
||||||
author=self.tag_to_string(atr.span.a)
|
author=self.tag_to_string(atr.span)
|
||||||
else:
|
else:
|
||||||
author=''
|
author=''
|
||||||
desc=div.findNext('span', attrs = {'class': 'views-field-field-article-summary-value'})
|
desc=div.findNext('span', attrs = {'class': 'views-field-field-article-summary-value'})
|
||||||
@ -106,10 +90,10 @@ class ForeignAffairsRecipe(BasicNewsRecipe):
|
|||||||
for div in sec.findAll('div', attrs = {'class': 'views-field-title'}):
|
for div in sec.findAll('div', attrs = {'class': 'views-field-title'}):
|
||||||
if div.find('a') is not None:
|
if div.find('a') is not None:
|
||||||
title=self.tag_to_string(div.span.a)
|
title=self.tag_to_string(div.span.a)
|
||||||
url=self.INDEX+div.span.a['href']
|
url=self.INDEX+self.index_to_soup(self.INDEX+div.span.a['href']).find('a', attrs={'class':'fa_addthis_print'})['href']
|
||||||
atr=div.findNext('div', attrs = {'class': 'views-field-field-article-display-authors-value'})
|
atr=div.findNext('div', attrs = {'class': 'views-field-field-article-display-authors-value'})
|
||||||
if atr is not None:
|
if atr is not None:
|
||||||
author=self.tag_to_string(atr.span.a)
|
author=self.tag_to_string(atr.span)
|
||||||
else:
|
else:
|
||||||
author=''
|
author=''
|
||||||
desc=div.findNext('span', attrs = {'class': 'views-field-field-article-summary-value'})
|
desc=div.findNext('span', attrs = {'class': 'views-field-field-article-summary-value'})
|
||||||
|
75
recipes/fortune_magazine.recipe
Normal file
@ -0,0 +1,75 @@
|
|||||||
|
from calibre.web.feeds.recipes import BasicNewsRecipe
|
||||||
|
from collections import OrderedDict
|
||||||
|
|
||||||
|
class Fortune(BasicNewsRecipe):
|
||||||
|
|
||||||
|
title = 'Fortune Magazine'
|
||||||
|
__author__ = 'Rick Shang'
|
||||||
|
|
||||||
|
description = 'FORTUNE is a global business magazine that has been revered in its content and credibility since 1930. FORTUNE covers the entire field of business, including specific companies and business trends, prominent business leaders, and new ideas shaping the global marketplace.'
|
||||||
|
language = 'en'
|
||||||
|
category = 'news'
|
||||||
|
encoding = 'UTF-8'
|
||||||
|
keep_only_tags = [dict(attrs={'id':['storycontent']})]
|
||||||
|
remove_tags = [dict(attrs={'class':['hed_side','socialMediaToolbarContainer']})]
|
||||||
|
no_javascript = True
|
||||||
|
no_stylesheets = True
|
||||||
|
needs_subscription = True
|
||||||
|
|
||||||
|
def get_browser(self):
|
||||||
|
br = BasicNewsRecipe.get_browser(self)
|
||||||
|
br.open('http://money.cnn.com/2013/03/21/smallbusiness/legal-marijuana-startups.pr.fortune/index.html')
|
||||||
|
br.select_form(name="paywall-form")
|
||||||
|
br['email'] = self.username
|
||||||
|
br['password'] = self.password
|
||||||
|
br.submit()
|
||||||
|
return br
|
||||||
|
|
||||||
|
def parse_index(self):
|
||||||
|
articles = []
|
||||||
|
soup0 = self.index_to_soup('http://money.cnn.com/magazines/fortune/')
|
||||||
|
|
||||||
|
#Go to the latestissue
|
||||||
|
soup = self.index_to_soup(soup0.find('div',attrs={'class':'latestissue'}).find('a',href=True)['href'])
|
||||||
|
|
||||||
|
#Find cover & date
|
||||||
|
cover_item = soup.find('div', attrs={'id':'cover-story'})
|
||||||
|
cover = cover_item.find('img',src=True)
|
||||||
|
self.cover_url = cover['src']
|
||||||
|
date = self.tag_to_string(cover_item.find('div', attrs={'class':'tocDate'})).strip()
|
||||||
|
self.timefmt = u' [%s]'%date
|
||||||
|
|
||||||
|
|
||||||
|
feeds = OrderedDict()
|
||||||
|
section_title = ''
|
||||||
|
|
||||||
|
#checkout the cover story
|
||||||
|
articles = []
|
||||||
|
coverstory=soup.find('div', attrs={'class':'cnnHeadline'})
|
||||||
|
title=self.tag_to_string(coverstory.a).strip()
|
||||||
|
url=coverstory.a['href']
|
||||||
|
desc=self.tag_to_string(coverstory.findNext('p', attrs={'class':'cnnBlurbTxt'}))
|
||||||
|
articles.append({'title':title, 'url':url, 'description':desc, 'date':''})
|
||||||
|
feeds['Cover Story'] = []
|
||||||
|
feeds['Cover Story'] += articles
|
||||||
|
|
||||||
|
for post in soup.findAll('div', attrs={'class':'cnnheader'}):
|
||||||
|
section_title = self.tag_to_string(post).strip()
|
||||||
|
articles = []
|
||||||
|
|
||||||
|
ul=post.findNext('ul')
|
||||||
|
for link in ul.findAll('li'):
|
||||||
|
links=link.find('h2')
|
||||||
|
title=self.tag_to_string(links.a).strip()
|
||||||
|
url=links.a['href']
|
||||||
|
desc=self.tag_to_string(link.find('p', attrs={'class':'cnnBlurbTxt'}))
|
||||||
|
articles.append({'title':title, 'url':url, 'description':desc, 'date':''})
|
||||||
|
|
||||||
|
if articles:
|
||||||
|
if section_title not in feeds:
|
||||||
|
feeds[section_title] = []
|
||||||
|
feeds[section_title] += articles
|
||||||
|
|
||||||
|
ans = [(key, val) for key, val in feeds.iteritems()]
|
||||||
|
return ans
|
||||||
|
|
@ -3,6 +3,7 @@ from calibre.web.feeds.news import BasicNewsRecipe
|
|||||||
class Fotoblogia_pl(BasicNewsRecipe):
|
class Fotoblogia_pl(BasicNewsRecipe):
|
||||||
title = u'Fotoblogia.pl'
|
title = u'Fotoblogia.pl'
|
||||||
__author__ = 'fenuks'
|
__author__ = 'fenuks'
|
||||||
|
description = u'Jeden z największych polskich blogów o fotografii.'
|
||||||
category = 'photography'
|
category = 'photography'
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
masthead_url = 'http://img.interia.pl/komputery/nimg/u/0/fotoblogia21.jpg'
|
masthead_url = 'http://img.interia.pl/komputery/nimg/u/0/fotoblogia21.jpg'
|
||||||
@ -11,6 +12,6 @@ class Fotoblogia_pl(BasicNewsRecipe):
|
|||||||
max_articles_per_feed = 100
|
max_articles_per_feed = 100
|
||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
use_embedded_content = False
|
use_embedded_content = False
|
||||||
keep_only_tags=[dict(name='div', attrs={'class':'post-view post-standard'})]
|
keep_only_tags=[dict(name='div', attrs={'class':['post-view post-standard', 'photo-container']})]
|
||||||
remove_tags=[dict(attrs={'class':['external fotoblogia', 'categories', 'tags']})]
|
remove_tags=[dict(attrs={'class':['external fotoblogia', 'categories', 'tags']})]
|
||||||
feeds = [(u'Wszystko', u'http://fotoblogia.pl/feed/rss2')]
|
feeds = [(u'Wszystko', u'http://fotoblogia.pl/feed/rss2')]
|
||||||
|
@ -18,6 +18,7 @@ class FrazPC(BasicNewsRecipe):
|
|||||||
max_articles_per_feed = 100
|
max_articles_per_feed = 100
|
||||||
use_embedded_content = False
|
use_embedded_content = False
|
||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
|
remove_empty_feeds = True
|
||||||
cover_url='http://www.frazpc.pl/images/logo.png'
|
cover_url='http://www.frazpc.pl/images/logo.png'
|
||||||
feeds = [
|
feeds = [
|
||||||
(u'Aktualno\u015bci', u'http://www.frazpc.pl/feed/aktualnosci'),
|
(u'Aktualno\u015bci', u'http://www.frazpc.pl/feed/aktualnosci'),
|
||||||
|
@ -1,7 +1,7 @@
|
|||||||
#!/usr/bin/env python
|
#!/usr/bin/env python
|
||||||
|
|
||||||
__license__ = 'GPL v3'
|
__license__ = 'GPL v3'
|
||||||
__copyright__ = u'2010-2012, Tomasz Dlugosz <tomek3d@gmail.com>'
|
__copyright__ = u'2010-2013, Tomasz Dlugosz <tomek3d@gmail.com>'
|
||||||
'''
|
'''
|
||||||
fronda.pl
|
fronda.pl
|
||||||
'''
|
'''
|
||||||
@ -68,6 +68,7 @@ class Fronda(BasicNewsRecipe):
|
|||||||
article_url = 'http://www.fronda.pl' + article_a['href']
|
article_url = 'http://www.fronda.pl' + article_a['href']
|
||||||
article_title = self.tag_to_string(article_a)
|
article_title = self.tag_to_string(article_a)
|
||||||
articles[genName].append( { 'title' : article_title, 'url' : article_url, 'date' : article_date })
|
articles[genName].append( { 'title' : article_title, 'url' : article_url, 'date' : article_date })
|
||||||
|
if articles[genName]:
|
||||||
feeds.append((genName, articles[genName]))
|
feeds.append((genName, articles[genName]))
|
||||||
return feeds
|
return feeds
|
||||||
|
|
||||||
@ -82,8 +83,10 @@ class Fronda(BasicNewsRecipe):
|
|||||||
dict(name='h3', attrs={'class':'block-header article comments'}),
|
dict(name='h3', attrs={'class':'block-header article comments'}),
|
||||||
dict(name='ul', attrs={'class':'comment-list'}),
|
dict(name='ul', attrs={'class':'comment-list'}),
|
||||||
dict(name='ul', attrs={'class':'category'}),
|
dict(name='ul', attrs={'class':'category'}),
|
||||||
|
dict(name='ul', attrs={'class':'tag-list'}),
|
||||||
dict(name='p', attrs={'id':'comments-disclaimer'}),
|
dict(name='p', attrs={'id':'comments-disclaimer'}),
|
||||||
dict(name='div', attrs={'style':'text-align: left; margin-bottom: 15px;'}),
|
dict(name='div', attrs={'style':'text-align: left; margin-bottom: 15px;'}),
|
||||||
dict(name='div', attrs={'style':'text-align: left; margin-top: 15px;'}),
|
dict(name='div', attrs={'style':'text-align: left; margin-top: 15px; margin-bottom: 30px;'}),
|
||||||
|
dict(name='div', attrs={'class':'related-articles content'}),
|
||||||
dict(name='div', attrs={'id':'comment-form'})
|
dict(name='div', attrs={'id':'comment-form'})
|
||||||
]
|
]
|
||||||
|
34
recipes/gazeta_krakowska.recipe
Normal file
@ -0,0 +1,34 @@
|
|||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
|
||||||
|
class GazetaKrakowska(BasicNewsRecipe):
|
||||||
|
title = u'Gazeta Krakowska'
|
||||||
|
__author__ = 'fenuks'
|
||||||
|
description = u'Gazeta Regionalna Gazeta Krakowska. Najnowsze Wiadomości Kraków. Informacje Kraków. Czytaj!'
|
||||||
|
category = 'newspaper'
|
||||||
|
language = 'pl'
|
||||||
|
encoding = 'iso-8859-2'
|
||||||
|
masthead_url = 'http://s.polskatimes.pl/g/logo_naglowek/gazetakrakowska.png?24'
|
||||||
|
oldest_article = 7
|
||||||
|
max_articles_per_feed = 100
|
||||||
|
remove_empty_feeds = True
|
||||||
|
no_stylesheets = True
|
||||||
|
use_embedded_content = False
|
||||||
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
|
#preprocess_regexps = [(re.compile(ur'<b>Czytaj także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur',<b>Czytaj też:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>Zobacz także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<center><h4><a.*?</a></h4></center>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TEŻ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ WIĘCEJ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TAKŻE:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>\* CZYTAJ KONIECZNIE:.*', re.DOTALL), lambda match: '</body>'), (re.compile(ur'<b>Nasze serwisy:</b>.*', re.DOTALL), lambda match: '</body>') ]
|
||||||
|
remove_tags_after= dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})
|
||||||
|
remove_tags=[dict(id='mat-podobne'), dict(name='a', attrs={'class':'czytajDalej'}), dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})]
|
||||||
|
|
||||||
|
feeds = [(u'Fakty24', u'http://gazetakrakowska.feedsportal.com/c/32980/f/533770/index.rss?201302'), (u'Krak\xf3w', u'http://www.gazetakrakowska.pl/rss/gazetakrakowska_krakow.xml?201302'), (u'Tarn\xf3w', u'http://www.gazetakrakowska.pl/rss/gazetakrakowska_tarnow.xml?201302'), (u'Nowy S\u0105cz', u'http://www.gazetakrakowska.pl/rss/gazetakrakowska_nsacz.xml?201302'), (u'Ma\u0142. Zach.', u'http://www.gazetakrakowska.pl/rss/gazetakrakowska_malzach.xml?201302'), (u'Podhale', u'http://www.gazetakrakowska.pl/rss/gazetakrakowska_podhale.xml?201302'), (u'Sport', u'http://gazetakrakowska.feedsportal.com/c/32980/f/533771/index.rss?201302'), (u'Kultura', u'http://gazetakrakowska.feedsportal.com/c/32980/f/533772/index.rss?201302'), (u'Opinie', u'http://www.gazetakrakowska.pl/rss/gazetakrakowska_opinie.xml?201302'), (u'Magazyn', u'http://www.gazetakrakowska.pl/rss/gazetakrakowska_magazyn.xml?201302')]
|
||||||
|
|
||||||
|
def print_version(self, url):
|
||||||
|
return url.replace('artykul', 'drukuj')
|
||||||
|
|
||||||
|
def skip_ad_pages(self, soup):
|
||||||
|
if 'Advertisement' in soup.title:
|
||||||
|
nexturl=soup.find('a')['href']
|
||||||
|
return self.index_to_soup(nexturl, raw=True)
|
||||||
|
|
||||||
|
def get_cover_url(self):
|
||||||
|
soup = self.index_to_soup('http://www.prasa24.pl/gazeta/gazeta-krakowska/')
|
||||||
|
self.cover_url=soup.find(id='pojemnik').img['src']
|
||||||
|
return getattr(self, 'cover_url', self.cover_url)
|
69
recipes/gazeta_lubuska.recipe
Normal file
@ -0,0 +1,69 @@
|
|||||||
|
import re
|
||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
from calibre.ebooks.BeautifulSoup import Comment
|
||||||
|
|
||||||
|
class GazetaLubuska(BasicNewsRecipe):
|
||||||
|
title = u'Gazeta Lubuska'
|
||||||
|
__author__ = 'fenuks'
|
||||||
|
description = u'Gazeta Lubuska - portal regionalny województwa lubuskiego.'
|
||||||
|
category = 'newspaper'
|
||||||
|
language = 'pl'
|
||||||
|
encoding = 'iso-8859-2'
|
||||||
|
extra_css = 'ul {list-style: none; padding:0; margin:0;}'
|
||||||
|
INDEX = 'http://www.gazetalubuska.pl'
|
||||||
|
masthead_url = INDEX + '/images/top_logo.png'
|
||||||
|
oldest_article = 7
|
||||||
|
max_articles_per_feed = 100
|
||||||
|
remove_empty_feeds = True
|
||||||
|
no_stylesheets = True
|
||||||
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
|
|
||||||
|
preprocess_regexps = [(re.compile(ur'Czytaj:.*?</a>', re.DOTALL), lambda match: ''), (re.compile(ur'Przeczytaj także:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''),
|
||||||
|
(re.compile(ur'Przeczytaj również:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''), (re.compile(ur'Zobacz też:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: '')]
|
||||||
|
|
||||||
|
keep_only_tags = [dict(id=['article', 'cover', 'photostory'])]
|
||||||
|
remove_tags = [dict(id=['articleTags', 'articleMeta', 'boxReadIt', 'articleGalleries', 'articleConnections',
|
||||||
|
'ForumArticleComments', 'articleRecommend', 'jedynkiLinks', 'articleGalleryConnections',
|
||||||
|
'photostoryConnections', 'articleEpaper', 'articlePoll', 'articleAlarm', 'articleByline']),
|
||||||
|
dict(attrs={'class':'articleFunctions'})]
|
||||||
|
|
||||||
|
feeds = [(u'Wszystkie', u'http://www.gazetalubuska.pl/rss.xml'), (u'Dreznenko', u'http://www.gazetalubuska.pl/drezdenko.xml'), (u'G\u0142og\xf3w', u'http://www.gazetalubuska.pl/glogow.xml'), (u'Gorz\xf3w Wielkopolski', u'http://www.gazetalubuska.pl/gorzow-wielkopolski.xml'), (u'Gubin', u'http://www.gazetalubuska.pl/gubin.xml'), (u'Kostrzyn', u'http://www.gazetalubuska.pl/kostrzyn.xml'), (u'Krosno Odrza\u0144skie', u'http://www.gazetalubuska.pl/krosno-odrzanskie.xml'), (u'Lubsko', u'http://www.gazetalubuska.pl/lubsko.xml'), (u'Mi\u0119dzych\xf3d', u'http://www.gazetalubuska.pl/miedzychod.xml'), (u'Mi\u0119dzyrzecz', u'http://www.gazetalubuska.pl/miedzyrzecz.xml'), (u'Nowa S\xf3l', u'http://www.gazetalubuska.pl/nowa-sol.xml'), (u'S\u0142ubice', u'http://www.gazetalubuska.pl/slubice.xml'), (u'Strzelce Kraje\u0144skie', u'http://www.gazetalubuska.pl/strzelce-krajenskie.xml'), (u'Sulech\xf3w', u'http://www.gazetalubuska.pl/sulechow.xml'), (u'Sul\u0119cin', u'http://www.gazetalubuska.pl/sulecin.xml'), (u'\u015awi\u0119bodzin', u'http://www.gazetalubuska.pl/swiebodzin.xml'), (u'Wolsztyn', u'http://www.gazetalubuska.pl/wolsztyn.xml'), (u'Wschowa', u'http://www.gazetalubuska.pl/wschowa.xml'), (u'Zielona G\xf3ra', u'http://www.gazetalubuska.pl/zielona-gora.xml'), (u'\u017baga\u0144', u'http://www.gazetalubuska.pl/zagan.xml'), (u'\u017bary', u'http://www.gazetalubuska.pl/zary.xml'), (u'Sport', u'http://www.gazetalubuska.pl/sport.xml'), (u'Auto', u'http://www.gazetalubuska.pl/auto.xml'), (u'Dom', u'http://www.gazetalubuska.pl/dom.xml'), (u'Praca', u'http://www.gazetalubuska.pl/praca.xml'), (u'Zdrowie', u'http://www.gazetalubuska.pl/zdrowie.xml')]
|
||||||
|
|
||||||
|
|
||||||
|
def get_cover_url(self):
|
||||||
|
soup = self.index_to_soup(self.INDEX + '/apps/pbcs.dll/section?Category=JEDYNKI')
|
||||||
|
nexturl = self.INDEX + soup.find(id='covers').find('a')['href']
|
||||||
|
soup = self.index_to_soup(nexturl)
|
||||||
|
self.cover_url = self.INDEX + soup.find(id='cover').find(name='img')['src']
|
||||||
|
return getattr(self, 'cover_url', self.cover_url)
|
||||||
|
|
||||||
|
def append_page(self, soup, appendtag):
|
||||||
|
tag = soup.find('span', attrs={'class':'photoNavigationPages'})
|
||||||
|
if tag:
|
||||||
|
number = int(tag.string.rpartition('/')[-1].replace(' ', ''))
|
||||||
|
baseurl = self.INDEX + soup.find(attrs={'class':'photoNavigationNext'})['href'][:-1]
|
||||||
|
|
||||||
|
for r in appendtag.findAll(attrs={'class':'photoNavigation'}):
|
||||||
|
r.extract()
|
||||||
|
for nr in range(2, number+1):
|
||||||
|
soup2 = self.index_to_soup(baseurl + str(nr))
|
||||||
|
pagetext = soup2.find(id='photoContainer')
|
||||||
|
if pagetext:
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
pagetext = soup2.find(attrs={'class':'photoMeta'})
|
||||||
|
if pagetext:
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
pagetext = soup2.find(attrs={'class':'photoStoryText'})
|
||||||
|
if pagetext:
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
|
||||||
|
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||||
|
for comment in comments:
|
||||||
|
comment.extract()
|
||||||
|
|
||||||
|
def preprocess_html(self, soup):
|
||||||
|
self.append_page(soup, soup.body)
|
||||||
|
return soup
|
@ -99,4 +99,3 @@ class gw_krakow(BasicNewsRecipe):
|
|||||||
if soup.find(id='container_gal'):
|
if soup.find(id='container_gal'):
|
||||||
self.gallery_article(soup.body)
|
self.gallery_article(soup.body)
|
||||||
return soup
|
return soup
|
||||||
|
|
||||||
|
@ -96,4 +96,3 @@ class gw_wawa(BasicNewsRecipe):
|
|||||||
if soup.find(id='container_gal'):
|
if soup.find(id='container_gal'):
|
||||||
self.gallery_article(soup.body)
|
self.gallery_article(soup.body)
|
||||||
return soup
|
return soup
|
||||||
|
|
||||||
|
@ -1,104 +1,96 @@
|
|||||||
#!/usr/bin/env python
|
|
||||||
|
|
||||||
# # Przed uzyciem przeczytaj komentarz w sekcji "feeds"
|
|
||||||
|
|
||||||
__license__ = 'GPL v3'
|
|
||||||
__copyright__ = u'2010, Richard z forum.eksiazki.org'
|
|
||||||
'''pomorska.pl'''
|
|
||||||
|
|
||||||
import re
|
import re
|
||||||
from calibre.web.feeds.news import BasicNewsRecipe
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
from calibre.ebooks.BeautifulSoup import Comment
|
||||||
|
|
||||||
class GazetaPomorska(BasicNewsRecipe):
|
class GazetaPomorska(BasicNewsRecipe):
|
||||||
title = u'Gazeta Pomorska'
|
title = u'Gazeta Pomorska'
|
||||||
publisher = u'Gazeta Pomorska'
|
__author__ = 'Richard z forum.eksiazki.org, fenuks'
|
||||||
description = u'Kujawy i Pomorze - wiadomo\u015bci'
|
description = u'Gazeta Pomorska - portal regionalny'
|
||||||
|
category = 'newspaper'
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
__author__ = u'Richard z forum.eksiazki.org'
|
encoding = 'iso-8859-2'
|
||||||
# # (dziekuje t3d z forum.eksiazki.org za testy)
|
extra_css = 'ul {list-style: none; padding:0; margin:0;}'
|
||||||
oldest_article = 2
|
INDEX = 'http://www.pomorska.pl'
|
||||||
max_articles_per_feed = 20
|
masthead_url = INDEX + '/images/top_logo.png'
|
||||||
|
oldest_article = 7
|
||||||
|
max_articles_per_feed = 100
|
||||||
|
remove_empty_feeds = True
|
||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
remove_javascript = True
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
preprocess_regexps = [
|
|
||||||
(re.compile(r'<a href="http://maps.google[^>]*>[^<]*</a>\.*', re.DOTALL|re.IGNORECASE), lambda m: ''),
|
|
||||||
(re.compile(r'[<Bb >]*Poznaj opinie[^<]*[</Bb >]*[^<]*<a href[^>]*>[^<]*</a>\.*', re.DOTALL|re.IGNORECASE), lambda m: ''),
|
|
||||||
(re.compile(r'[<Bb >]*Przeczytaj[^<]*[</Bb >]*[^<]*<a href[^>]*>[^<]*</a>\.*', re.DOTALL|re.IGNORECASE), lambda m: ''),
|
|
||||||
(re.compile(r'[<Bb >]*Wi.cej informacji[^<]*[</Bb >]*[^<]*<a href[^>]*>[^<]*</a>\.*', re.DOTALL|re.IGNORECASE), lambda m: ''),
|
|
||||||
(re.compile(r'<a href[^>]*>[<Bb >]*Wideo[^<]*[</Bb >]*[^<]*</a>\.*', re.DOTALL|re.IGNORECASE), lambda m: ''),
|
|
||||||
(re.compile(r'<a href[^>]*>[<Bb >]*KLIKNIJ TUTAJ[^<]*[</Bb >]*[^<]*</a>\.*', re.DOTALL|re.IGNORECASE), lambda m: '')
|
|
||||||
]
|
|
||||||
|
|
||||||
feeds = [
|
preprocess_regexps = [(re.compile(ur'Czytaj:.*?</a>', re.DOTALL), lambda match: ''), (re.compile(ur'Przeczytaj także:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''),
|
||||||
# # Tutaj jest wymieniona lista kategorii jakie mozemy otrzymywac z Gazety
|
(re.compile(ur'Przeczytaj również:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''), (re.compile(ur'Zobacz też:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: '')]
|
||||||
# # Pomorskiej, po jednej kategorii w wierszu. Jesli na poczatku danego wiersza
|
|
||||||
# # znajduje sie jeden znak "#", oznacza to ze kategoria jest zakomentowana
|
|
||||||
# # i nie bedziemy jej otrzymywac. Jesli chcemy ja otrzymywac nalezy usunac
|
|
||||||
# # znak # z jej wiersza.
|
|
||||||
# # Jesli subskrybujemy wiecej niz jedna kategorie, na koncu wiersza z kazda
|
|
||||||
# # kategoria musi sie znajdowac niezakomentowany przecinek, z wyjatkiem
|
|
||||||
# # ostatniego wiersza - ma byc bez przecinka na koncu.
|
|
||||||
# # Rekomendowane opcje wyboru kategorii:
|
|
||||||
# # 1. PomorskaRSS - wiadomosci kazdego typu, lub
|
|
||||||
# # 2. Region + wybrane miasta, lub
|
|
||||||
# # 3. Wiadomosci tematyczne.
|
|
||||||
# # Lista kategorii:
|
|
||||||
|
|
||||||
# # PomorskaRSS - wiadomosci kazdego typu, zakomentuj znakiem "#"
|
keep_only_tags = [dict(id=['article', 'cover', 'photostory'])]
|
||||||
# # przed odkomentowaniem wiadomosci wybranego typu:
|
remove_tags = [dict(id=['articleTags', 'articleMeta', 'boxReadIt', 'articleGalleries', 'articleConnections',
|
||||||
(u'PomorskaRSS', u'http://www.pomorska.pl/rss.xml')
|
'ForumArticleComments', 'articleRecommend', 'jedynkiLinks', 'articleGalleryConnections',
|
||||||
|
'photostoryConnections', 'articleEpaper', 'articlePoll', 'articleAlarm', 'articleByline']),
|
||||||
|
dict(attrs={'class':'articleFunctions'})]
|
||||||
|
|
||||||
# # wiadomosci z regionu nie przypisane do okreslonego miasta:
|
feeds = [(u'Wszystkie', u'http://www.pomorska.pl/rss.xml'),
|
||||||
# (u'Region', u'http://www.pomorska.pl/region.xml'),
|
(u'Region', u'http://www.pomorska.pl/region.xml'),
|
||||||
|
(u'Bydgoszcz', u'http://www.pomorska.pl/bydgoszcz.xml'),
|
||||||
# # wiadomosci przypisane do miast:
|
(u'Nakło', u'http://www.pomorska.pl/naklo.xml'),
|
||||||
# (u'Bydgoszcz', u'http://www.pomorska.pl/bydgoszcz.xml'),
|
(u'Koronowo', u'http://www.pomorska.pl/koronowo.xml'),
|
||||||
# (u'Nak\u0142o', u'http://www.pomorska.pl/naklo.xml'),
|
(u'Solec Kujawski', u'http://www.pomorska.pl/soleckujawski.xml'),
|
||||||
# (u'Koronowo', u'http://www.pomorska.pl/koronowo.xml'),
|
(u'Grudziądz', u'http://www.pomorska.pl/grudziadz.xml'),
|
||||||
# (u'Solec Kujawski', u'http://www.pomorska.pl/soleckujawski.xml'),
|
(u'Inowrocław', u'http://www.pomorska.pl/inowroclaw.xml'),
|
||||||
# (u'Grudzi\u0105dz', u'http://www.pomorska.pl/grudziadz.xml'),
|
(u'Toruń', u'http://www.pomorska.pl/torun.xml'),
|
||||||
# (u'Inowroc\u0142aw', u'http://www.pomorska.pl/inowroclaw.xml'),
|
(u'Włocławek', u'http://www.pomorska.pl/wloclawek.xml'),
|
||||||
# (u'Toru\u0144', u'http://www.pomorska.pl/torun.xml'),
|
(u'Aleksandrów Kujawski', u'http://www.pomorska.pl/aleksandrow.xml'),
|
||||||
# (u'W\u0142oc\u0142awek', u'http://www.pomorska.pl/wloclawek.xml'),
|
(u'Brodnica', u'http://www.pomorska.pl/brodnica.xml'),
|
||||||
# (u'Aleksandr\u00f3w Kujawski', u'http://www.pomorska.pl/aleksandrow.xml'),
|
(u'Chełmno', u'http://www.pomorska.pl/chelmno.xml'),
|
||||||
# (u'Brodnica', u'http://www.pomorska.pl/brodnica.xml'),
|
(u'Chojnice', u'http://www.pomorska.pl/chojnice.xml'),
|
||||||
# (u'Che\u0142mno', u'http://www.pomorska.pl/chelmno.xml'),
|
(u'Ciechocinek', u'http://www.pomorska.pl/ciechocinek.xml'),
|
||||||
# (u'Chojnice', u'http://www.pomorska.pl/chojnice.xml'),
|
(u'Golub-Dobrzyń', u'http://www.pomorska.pl/golubdobrzyn.xml'),
|
||||||
# (u'Ciechocinek', u'http://www.pomorska.pl/ciechocinek.xml'),
|
(u'Mogilno', u'http://www.pomorska.pl/mogilno.xml'),
|
||||||
# (u'Golub Dobrzy\u0144', u'http://www.pomorska.pl/golubdobrzyn.xml'),
|
(u'Radziejów', u'http://www.pomorska.pl/radziejow.xml'),
|
||||||
# (u'Mogilno', u'http://www.pomorska.pl/mogilno.xml'),
|
(u'Rypin', u'http://www.pomorska.pl/rypin.xml'),
|
||||||
# (u'Radziej\u00f3w', u'http://www.pomorska.pl/radziejow.xml'),
|
(u'Sępólno', u'http://www.pomorska.pl/sepolno.xml'),
|
||||||
# (u'Rypin', u'http://www.pomorska.pl/rypin.xml'),
|
(u'Świecie', u'http://www.pomorska.pl/swiecie.xml'),
|
||||||
# (u'S\u0119p\u00f3lno', u'http://www.pomorska.pl/sepolno.xml'),
|
(u'Tuchola', u'http://www.pomorska.pl/tuchola.xml'),
|
||||||
# (u'\u015awiecie', u'http://www.pomorska.pl/swiecie.xml'),
|
(u'Żnin', u'http://www.pomorska.pl/znin.xml'),
|
||||||
# (u'Tuchola', u'http://www.pomorska.pl/tuchola.xml'),
|
(u'Sport', u'http://www.pomorska.pl/sport.xml'),
|
||||||
# (u'\u017bnin', u'http://www.pomorska.pl/znin.xml')
|
(u'Zdrowie', u'http://www.pomorska.pl/zdrowie.xml'),
|
||||||
|
(u'Auto', u'http://www.pomorska.pl/moto.xml'),
|
||||||
# # wiadomosci tematyczne (redundancja z region/miasta):
|
(u'Dom', u'http://www.pomorska.pl/dom.xml'),
|
||||||
# (u'Sport', u'http://www.pomorska.pl/sport.xml'),
|
|
||||||
# (u'Zdrowie', u'http://www.pomorska.pl/zdrowie.xml'),
|
|
||||||
# (u'Auto', u'http://www.pomorska.pl/moto.xml'),
|
|
||||||
# (u'Dom', u'http://www.pomorska.pl/dom.xml'),
|
|
||||||
#(u'Reporta\u017c', u'http://www.pomorska.pl/reportaz.xml'),
|
#(u'Reporta\u017c', u'http://www.pomorska.pl/reportaz.xml'),
|
||||||
# (u'Gospodarka', u'http://www.pomorska.pl/gospodarka.xml')
|
(u'Gospodarka', u'http://www.pomorska.pl/gospodarka.xml')]
|
||||||
]
|
|
||||||
|
|
||||||
keep_only_tags = [dict(name='div', attrs={'id':'article'})]
|
def get_cover_url(self):
|
||||||
|
soup = self.index_to_soup(self.INDEX + '/apps/pbcs.dll/section?Category=JEDYNKI')
|
||||||
|
nexturl = self.INDEX + soup.find(id='covers').find('a')['href']
|
||||||
|
soup = self.index_to_soup(nexturl)
|
||||||
|
self.cover_url = self.INDEX + soup.find(id='cover').find(name='img')['src']
|
||||||
|
return getattr(self, 'cover_url', self.cover_url)
|
||||||
|
|
||||||
remove_tags = [
|
def append_page(self, soup, appendtag):
|
||||||
dict(name='p', attrs={'id':'articleTags'}),
|
tag = soup.find('span', attrs={'class':'photoNavigationPages'})
|
||||||
dict(name='div', attrs={'id':'articleEpaper'}),
|
if tag:
|
||||||
dict(name='div', attrs={'id':'articleConnections'}),
|
number = int(tag.string.rpartition('/')[-1].replace(' ', ''))
|
||||||
dict(name='div', attrs={'class':'articleFacts'}),
|
baseurl = self.INDEX + soup.find(attrs={'class':'photoNavigationNext'})['href'][:-1]
|
||||||
dict(name='div', attrs={'id':'articleExternalLink'}),
|
|
||||||
dict(name='div', attrs={'id':'articleMultimedia'}),
|
|
||||||
dict(name='div', attrs={'id':'articleGalleries'}),
|
|
||||||
dict(name='div', attrs={'id':'articleAlarm'}),
|
|
||||||
dict(name='div', attrs={'id':'adholder_srodek1'}),
|
|
||||||
dict(name='div', attrs={'id':'articleVideo'}),
|
|
||||||
dict(name='a', attrs={'name':'fb_share'})]
|
|
||||||
|
|
||||||
extra_css = '''h1 { font-size: 1.4em; }
|
for r in appendtag.findAll(attrs={'class':'photoNavigation'}):
|
||||||
h2 { font-size: 1.0em; }'''
|
r.extract()
|
||||||
|
for nr in range(2, number+1):
|
||||||
|
soup2 = self.index_to_soup(baseurl + str(nr))
|
||||||
|
pagetext = soup2.find(id='photoContainer')
|
||||||
|
if pagetext:
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
pagetext = soup2.find(attrs={'class':'photoMeta'})
|
||||||
|
if pagetext:
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
pagetext = soup2.find(attrs={'class':'photoStoryText'})
|
||||||
|
if pagetext:
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
|
||||||
|
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||||
|
for comment in comments:
|
||||||
|
comment.extract()
|
||||||
|
|
||||||
|
def preprocess_html(self, soup):
|
||||||
|
self.append_page(soup, soup.body)
|
||||||
|
return soup
|
||||||
|
34
recipes/gazeta_wroclawska.recipe
Normal file
@ -0,0 +1,34 @@
|
|||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
|
||||||
|
class GazetaWroclawska(BasicNewsRecipe):
|
||||||
|
title = u'Gazeta Wroc\u0142awska'
|
||||||
|
__author__ = 'fenuks'
|
||||||
|
description = u'Gazeta Regionalna Gazeta Wrocławska. Najnowsze Wiadomości Wrocław, Informacje Wrocław. Czytaj!'
|
||||||
|
category = 'newspaper'
|
||||||
|
language = 'pl'
|
||||||
|
encoding = 'iso-8859-2'
|
||||||
|
masthead_url = 'http://s.polskatimes.pl/g/logo_naglowek/gazetawroclawska.png?24'
|
||||||
|
oldest_article = 7
|
||||||
|
max_articles_per_feed = 100
|
||||||
|
remove_empty_feeds = True
|
||||||
|
no_stylesheets = True
|
||||||
|
use_embedded_content = False
|
||||||
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
|
#preprocess_regexps = [(re.compile(ur'<b>Czytaj także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur',<b>Czytaj też:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>Zobacz także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<center><h4><a.*?</a></h4></center>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TEŻ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ WIĘCEJ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TAKŻE:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>\* CZYTAJ KONIECZNIE:.*', re.DOTALL), lambda match: '</body>'), (re.compile(ur'<b>Nasze serwisy:</b>.*', re.DOTALL), lambda match: '</body>') ]
|
||||||
|
remove_tags_after= dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})
|
||||||
|
remove_tags=[dict(id='mat-podobne'), dict(name='a', attrs={'class':'czytajDalej'}), dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})]
|
||||||
|
|
||||||
|
feeds = [(u'Fakty24', u'http://gazetawroclawska.feedsportal.com/c/32980/f/533775/index.rss?201302'), (u'Region', u'http://www.gazetawroclawska.pl/rss/gazetawroclawska_region.xml?201302'), (u'Kultura', u'http://gazetawroclawska.feedsportal.com/c/32980/f/533777/index.rss?201302'), (u'Sport', u'http://gazetawroclawska.feedsportal.com/c/32980/f/533776/index.rss?201302'), (u'Z archiwum', u'http://www.gazetawroclawska.pl/rss/gazetawroclawska_zarchiwum.xml?201302'), (u'M\xf3j reporter', u'http://www.gazetawroclawska.pl/rss/gazetawroclawska_mojreporter.xml?201302'), (u'Historia', u'http://www.gazetawroclawska.pl/rss/gazetawroclawska_historia.xml?201302'), (u'Listy do redakcji', u'http://www.gazetawroclawska.pl/rss/gazetawroclawska_listydoredakcji.xml?201302'), (u'Na drogach', u'http://www.gazetawroclawska.pl/rss/gazetawroclawska_nadrogach.xml?201302')]
|
||||||
|
|
||||||
|
def print_version(self, url):
|
||||||
|
return url.replace('artykul', 'drukuj')
|
||||||
|
|
||||||
|
def skip_ad_pages(self, soup):
|
||||||
|
if 'Advertisement' in soup.title:
|
||||||
|
nexturl=soup.find('a')['href']
|
||||||
|
return self.index_to_soup(nexturl, raw=True)
|
||||||
|
|
||||||
|
def get_cover_url(self):
|
||||||
|
soup = self.index_to_soup('http://www.prasa24.pl/gazeta/gazeta-wroclawska/')
|
||||||
|
self.cover_url=soup.find(id='pojemnik').img['src']
|
||||||
|
return getattr(self, 'cover_url', self.cover_url)
|
68
recipes/gazeta_wspolczesna.recipe
Normal file
@ -0,0 +1,68 @@
|
|||||||
|
import re
|
||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
from calibre.ebooks.BeautifulSoup import Comment
|
||||||
|
|
||||||
|
class GazetaWspolczesna(BasicNewsRecipe):
|
||||||
|
title = u'Gazeta Wsp\xf3\u0142czesna'
|
||||||
|
__author__ = 'fenuks'
|
||||||
|
description = u'Gazeta Współczesna - portal regionalny.'
|
||||||
|
category = 'newspaper'
|
||||||
|
language = 'pl'
|
||||||
|
encoding = 'iso-8859-2'
|
||||||
|
extra_css = 'ul {list-style: none; padding:0; margin:0;}'
|
||||||
|
INDEX = 'http://www.wspolczesna.pl'
|
||||||
|
masthead_url = INDEX + '/images/top_logo.png'
|
||||||
|
oldest_article = 7
|
||||||
|
max_articles_per_feed = 100
|
||||||
|
remove_empty_feeds = True
|
||||||
|
no_stylesheets = True
|
||||||
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
|
|
||||||
|
preprocess_regexps = [(re.compile(ur'Czytaj:.*?</a>', re.DOTALL), lambda match: ''), (re.compile(ur'Przeczytaj także:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''),
|
||||||
|
(re.compile(ur'Przeczytaj również:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''), (re.compile(ur'Zobacz też:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: '')]
|
||||||
|
|
||||||
|
keep_only_tags = [dict(id=['article', 'cover', 'photostory'])]
|
||||||
|
remove_tags = [dict(id=['articleTags', 'articleMeta', 'boxReadIt', 'articleGalleries', 'articleConnections',
|
||||||
|
'ForumArticleComments', 'articleRecommend', 'jedynkiLinks', 'articleGalleryConnections',
|
||||||
|
'photostoryConnections', 'articleEpaper', 'articlePoll', 'articleAlarm', 'articleByline']),
|
||||||
|
dict(attrs={'class':'articleFunctions'})]
|
||||||
|
|
||||||
|
feeds = [(u'Wszystkie', u'http://www.wspolczesna.pl/rss.xml'), (u'August\xf3w', u'http://www.wspolczesna.pl/augustow.xml'), (u'Bia\u0142ystok', u'http://www.wspolczesna.pl/bialystok.xml'), (u'Bielsk Podlaski', u'http://www.wspolczesna.pl/bielsk.xml'), (u'E\u0142k', u'http://www.wspolczesna.pl/elk.xml'), (u'Grajewo', u'http://www.wspolczesna.pl/grajewo.xml'), (u'Go\u0142dap', u'http://www.wspolczesna.pl/goldap.xml'), (u'Hajn\xf3wka', u'http://www.wspolczesna.pl/hajnowka.xml'), (u'Kolno', u'http://www.wspolczesna.pl/kolno.xml'), (u'\u0141om\u017ca', u'http://www.wspolczesna.pl/lomza.xml'), (u'Mo\u0144ki', u'http://www.wspolczesna.pl/monki.xml'), (u'Olecko', u'http://www.wspolczesna.pl/olecko.xml'), (u'Ostro\u0142\u0119ka', u'http://www.wspolczesna.pl/ostroleka.xml'), (u'Powiat Bia\u0142ostocki', u'http://www.wspolczesna.pl/powiat.xml'), (u'Sejny', u'http://www.wspolczesna.pl/sejny.xml'), (u'Siemiatycze', u'http://www.wspolczesna.pl/siemiatycze.xml'), (u'Sok\xf3\u0142ka', u'http://www.wspolczesna.pl/sokolka.xml'), (u'Suwa\u0142ki', u'http://www.wspolczesna.pl/suwalki.xml'), (u'Wysokie Mazowieckie', u'http://www.wspolczesna.pl/wysokie.xml'), (u'Zambr\xf3w', u'http://www.wspolczesna.pl/zambrow.xml'), (u'Sport', u'http://www.wspolczesna.pl/sport.xml'), (u'Praca', u'http://www.wspolczesna.pl/praca.xml'), (u'Dom', u'http://www.wspolczesna.pl/dom.xml'), (u'Auto', u'http://www.wspolczesna.pl/auto.xml'), (u'Zdrowie', u'http://www.wspolczesna.pl/zdrowie.xml')]
|
||||||
|
|
||||||
|
def get_cover_url(self):
|
||||||
|
soup = self.index_to_soup(self.INDEX + '/apps/pbcs.dll/section?Category=JEDYNKI')
|
||||||
|
nexturl = self.INDEX + soup.find(id='covers').find('a')['href']
|
||||||
|
soup = self.index_to_soup(nexturl)
|
||||||
|
self.cover_url = self.INDEX + soup.find(id='cover').find(name='img')['src']
|
||||||
|
return getattr(self, 'cover_url', self.cover_url)
|
||||||
|
|
||||||
|
def append_page(self, soup, appendtag):
|
||||||
|
tag = soup.find('span', attrs={'class':'photoNavigationPages'})
|
||||||
|
if tag:
|
||||||
|
number = int(tag.string.rpartition('/')[-1].replace(' ', ''))
|
||||||
|
baseurl = self.INDEX + soup.find(attrs={'class':'photoNavigationNext'})['href'][:-1]
|
||||||
|
|
||||||
|
for r in appendtag.findAll(attrs={'class':'photoNavigation'}):
|
||||||
|
r.extract()
|
||||||
|
for nr in range(2, number+1):
|
||||||
|
soup2 = self.index_to_soup(baseurl + str(nr))
|
||||||
|
pagetext = soup2.find(id='photoContainer')
|
||||||
|
if pagetext:
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
pagetext = soup2.find(attrs={'class':'photoMeta'})
|
||||||
|
if pagetext:
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
pagetext = soup2.find(attrs={'class':'photoStoryText'})
|
||||||
|
if pagetext:
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
|
||||||
|
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||||
|
for comment in comments:
|
||||||
|
comment.extract()
|
||||||
|
|
||||||
|
def preprocess_html(self, soup):
|
||||||
|
self.append_page(soup, soup.body)
|
||||||
|
return soup
|
@ -1,12 +1,12 @@
|
|||||||
# -*- coding: utf-8 -*-
|
# -*- coding: utf-8 -*-
|
||||||
from calibre.web.feeds.news import BasicNewsRecipe
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
from calibre.ebooks.BeautifulSoup import Comment
|
||||||
|
|
||||||
class Gazeta_Wyborcza(BasicNewsRecipe):
|
class Gazeta_Wyborcza(BasicNewsRecipe):
|
||||||
title = u'Gazeta.pl'
|
title = u'Gazeta.pl'
|
||||||
__author__ = 'fenuks, Artur Stachecki'
|
__author__ = 'fenuks, Artur Stachecki'
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
description = 'news from gazeta.pl'
|
description = 'Wiadomości z Polski i ze świata. Serwisy tematyczne i lokalne w 20 miastach.'
|
||||||
category = 'newspaper'
|
category = 'newspaper'
|
||||||
publication_type = 'newspaper'
|
publication_type = 'newspaper'
|
||||||
masthead_url = 'http://bi.gazeta.pl/im/5/10285/z10285445AA.jpg'
|
masthead_url = 'http://bi.gazeta.pl/im/5/10285/z10285445AA.jpg'
|
||||||
@ -16,6 +16,7 @@ class Gazeta_Wyborcza(BasicNewsRecipe):
|
|||||||
max_articles_per_feed = 100
|
max_articles_per_feed = 100
|
||||||
remove_javascript = True
|
remove_javascript = True
|
||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
remove_tags_before = dict(id='k0')
|
remove_tags_before = dict(id='k0')
|
||||||
remove_tags_after = dict(id='banP4')
|
remove_tags_after = dict(id='banP4')
|
||||||
remove_tags = [dict(name='div', attrs={'class':'rel_box'}), dict(attrs={'class':['date', 'zdjP', 'zdjM', 'pollCont', 'rel_video', 'brand', 'txt_upl']}), dict(name='div', attrs={'id':'footer'})]
|
remove_tags = [dict(name='div', attrs={'class':'rel_box'}), dict(attrs={'class':['date', 'zdjP', 'zdjM', 'pollCont', 'rel_video', 'brand', 'txt_upl']}), dict(name='div', attrs={'id':'footer'})]
|
||||||
@ -48,6 +49,9 @@ class Gazeta_Wyborcza(BasicNewsRecipe):
|
|||||||
url = self.INDEX + link['href']
|
url = self.INDEX + link['href']
|
||||||
soup2 = self.index_to_soup(url)
|
soup2 = self.index_to_soup(url)
|
||||||
pagetext = soup2.find(id='artykul')
|
pagetext = soup2.find(id='artykul')
|
||||||
|
comments = pagetext.findAll(text=lambda text:isinstance(text, Comment))
|
||||||
|
for comment in comments:
|
||||||
|
comment.extract()
|
||||||
pos = len(appendtag.contents)
|
pos = len(appendtag.contents)
|
||||||
appendtag.insert(pos, pagetext)
|
appendtag.insert(pos, pagetext)
|
||||||
tag = soup2.find('div', attrs={'id': 'Str'})
|
tag = soup2.find('div', attrs={'id': 'Str'})
|
||||||
@ -65,6 +69,9 @@ class Gazeta_Wyborcza(BasicNewsRecipe):
|
|||||||
nexturl = pagetext.find(id='gal_btn_next')
|
nexturl = pagetext.find(id='gal_btn_next')
|
||||||
if nexturl:
|
if nexturl:
|
||||||
nexturl = nexturl.a['href']
|
nexturl = nexturl.a['href']
|
||||||
|
comments = pagetext.findAll(text=lambda text:isinstance(text, Comment))
|
||||||
|
for comment in comments:
|
||||||
|
comment.extract()
|
||||||
pos = len(appendtag.contents)
|
pos = len(appendtag.contents)
|
||||||
appendtag.insert(pos, pagetext)
|
appendtag.insert(pos, pagetext)
|
||||||
rem = appendtag.find(id='gal_navi')
|
rem = appendtag.find(id='gal_navi')
|
||||||
@ -105,3 +112,7 @@ class Gazeta_Wyborcza(BasicNewsRecipe):
|
|||||||
soup = self.index_to_soup('http://wyborcza.pl/' + cover.contents[3].a['href'])
|
soup = self.index_to_soup('http://wyborcza.pl/' + cover.contents[3].a['href'])
|
||||||
self.cover_url = 'http://wyborcza.pl' + soup.img['src']
|
self.cover_url = 'http://wyborcza.pl' + soup.img['src']
|
||||||
return getattr(self, 'cover_url', self.cover_url)
|
return getattr(self, 'cover_url', self.cover_url)
|
||||||
|
|
||||||
|
'''def image_url_processor(self, baseurl, url):
|
||||||
|
print "@@@@@@@@", url
|
||||||
|
return url.replace('http://wyborcza.pl/ ', '')'''
|
||||||
|
88
recipes/gcn.recipe
Normal file
@ -0,0 +1,88 @@
|
|||||||
|
import re
|
||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
from calibre.ebooks.BeautifulSoup import Comment
|
||||||
|
|
||||||
|
class GCN(BasicNewsRecipe):
|
||||||
|
title = u'Gazeta Codziennej Nowiny'
|
||||||
|
__author__ = 'fenuks'
|
||||||
|
description = u'nowiny24.pl - portal regionalny województwa podkarpackiego.'
|
||||||
|
category = 'newspaper'
|
||||||
|
language = 'pl'
|
||||||
|
encoding = 'iso-8859-2'
|
||||||
|
extra_css = 'ul {list-style: none; padding:0; margin:0;}'
|
||||||
|
INDEX = 'http://www.nowiny24.pl'
|
||||||
|
masthead_url = INDEX + '/images/top_logo.png'
|
||||||
|
oldest_article = 7
|
||||||
|
max_articles_per_feed = 100
|
||||||
|
remove_empty_feeds = True
|
||||||
|
no_stylesheets = True
|
||||||
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
|
remove_attributes = ['style']
|
||||||
|
preprocess_regexps = [(re.compile(ur'Czytaj:.*?</a>', re.DOTALL), lambda match: ''), (re.compile(ur'Przeczytaj także:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''),
|
||||||
|
(re.compile(ur'Przeczytaj również:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''), (re.compile(ur'Zobacz też:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: '')]
|
||||||
|
|
||||||
|
keep_only_tags = [dict(id=['article', 'cover', 'photostory'])]
|
||||||
|
remove_tags = [dict(id=['articleTags', 'articleMeta', 'boxReadIt', 'articleGalleries', 'articleConnections',
|
||||||
|
'ForumArticleComments', 'articleRecommend', 'jedynkiLinks', 'articleGalleryConnections',
|
||||||
|
'photostoryConnections', 'articleEpaper', 'articlePoll', 'articleAlarm', 'articleByline']),
|
||||||
|
dict(attrs={'class':'articleFunctions'})]
|
||||||
|
|
||||||
|
feeds = [(u'Wszystkie', u'http://www.nowiny24.pl/rss.xml'),
|
||||||
|
(u'Podkarpacie', u'http://www.nowiny24.pl/podkarpacie.xml'),
|
||||||
|
(u'Bieszczady', u'http://www.nowiny24.pl/bieszczady.xml'),
|
||||||
|
(u'Rzeszów', u'http://www.nowiny24.pl/rzeszow.xml'),
|
||||||
|
(u'Przemyśl', u'http://www.nowiny24.pl/przemysl.xml'),
|
||||||
|
(u'Leżajsk', u'http://www.nowiny24.pl/lezajsk.xml'),
|
||||||
|
(u'Łańcut', u'http://www.nowiny24.pl/lancut.xml'),
|
||||||
|
(u'Dębica', u'http://www.nowiny24.pl/debica.xml'),
|
||||||
|
(u'Jarosław', u'http://www.nowiny24.pl/jaroslaw.xml'),
|
||||||
|
(u'Krosno', u'http://www.nowiny24.pl/krosno.xml'),
|
||||||
|
(u'Mielec', u'http://www.nowiny24.pl/mielec.xml'),
|
||||||
|
(u'Nisko', u'http://www.nowiny24.pl/nisko.xml'),
|
||||||
|
(u'Sanok', u'http://www.nowiny24.pl/sanok.xml'),
|
||||||
|
(u'Stalowa Wola', u'http://www.nowiny24.pl/stalowawola.xml'),
|
||||||
|
(u'Tarnobrzeg', u'http://www.nowiny24.pl/tarnobrzeg.xml'),
|
||||||
|
(u'Sport', u'http://www.nowiny24.pl/sport.xml'),
|
||||||
|
(u'Dom', u'http://www.nowiny24.pl/dom.xml'),
|
||||||
|
(u'Auto', u'http://www.nowiny24.pl/auto.xml'),
|
||||||
|
(u'Praca', u'http://www.nowiny24.pl/praca.xml'),
|
||||||
|
(u'Zdrowie', u'http://www.nowiny24.pl/zdrowie.xml'),
|
||||||
|
(u'Wywiady', u'http://www.nowiny24.pl/wywiady.xml')]
|
||||||
|
|
||||||
|
def get_cover_url(self):
|
||||||
|
soup = self.index_to_soup(self.INDEX + '/apps/pbcs.dll/section?Category=JEDYNKI')
|
||||||
|
nexturl = self.INDEX + soup.find(id='covers').find('a')['href']
|
||||||
|
soup = self.index_to_soup(nexturl)
|
||||||
|
self.cover_url = self.INDEX + soup.find(id='cover').find(name='img')['src']
|
||||||
|
return getattr(self, 'cover_url', self.cover_url)
|
||||||
|
|
||||||
|
def append_page(self, soup, appendtag):
|
||||||
|
tag = soup.find('span', attrs={'class':'photoNavigationPages'})
|
||||||
|
if tag:
|
||||||
|
number = int(tag.string.rpartition('/')[-1].replace(' ', ''))
|
||||||
|
baseurl = self.INDEX + soup.find(attrs={'class':'photoNavigationNext'})['href'][:-1]
|
||||||
|
|
||||||
|
for r in appendtag.findAll(attrs={'class':'photoNavigation'}):
|
||||||
|
r.extract()
|
||||||
|
for nr in range(2, number+1):
|
||||||
|
soup2 = self.index_to_soup(baseurl + str(nr))
|
||||||
|
pagetext = soup2.find(id='photoContainer')
|
||||||
|
if pagetext:
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
pagetext = soup2.find(attrs={'class':'photoMeta'})
|
||||||
|
if pagetext:
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
pagetext = soup2.find(attrs={'class':'photoStoryText'})
|
||||||
|
if pagetext:
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
|
||||||
|
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||||
|
for comment in comments:
|
||||||
|
comment.extract()
|
||||||
|
|
||||||
|
def preprocess_html(self, soup):
|
||||||
|
self.append_page(soup, soup.body)
|
||||||
|
return soup
|
@ -15,6 +15,7 @@ class Gildia(BasicNewsRecipe):
|
|||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
ignore_duplicate_articles = {'title', 'url'}
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
preprocess_regexps = [(re.compile(ur'</?sup>'), lambda match: '') ]
|
preprocess_regexps = [(re.compile(ur'</?sup>'), lambda match: '') ]
|
||||||
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
remove_tags = [dict(name='div', attrs={'class':'backlink'}), dict(name='div', attrs={'class':'im_img'}), dict(name='div', attrs={'class':'addthis_toolbox addthis_default_style'})]
|
remove_tags = [dict(name='div', attrs={'class':'backlink'}), dict(name='div', attrs={'class':'im_img'}), dict(name='div', attrs={'class':'addthis_toolbox addthis_default_style'})]
|
||||||
keep_only_tags = dict(name='div', attrs={'class':'widetext'})
|
keep_only_tags = dict(name='div', attrs={'class':'widetext'})
|
||||||
feeds = [(u'Gry', u'http://www.gry.gildia.pl/rss'), (u'Literatura', u'http://www.literatura.gildia.pl/rss'), (u'Film', u'http://www.film.gildia.pl/rss'), (u'Horror', u'http://www.horror.gildia.pl/rss'), (u'Konwenty', u'http://www.konwenty.gildia.pl/rss'), (u'Plansz\xf3wki', u'http://www.planszowki.gildia.pl/rss'), (u'Manga i anime', u'http://www.manga.gildia.pl/rss'), (u'Star Wars', u'http://www.starwars.gildia.pl/rss'), (u'Techno', u'http://www.techno.gildia.pl/rss'), (u'Historia', u'http://www.historia.gildia.pl/rss'), (u'Magia', u'http://www.magia.gildia.pl/rss'), (u'Bitewniaki', u'http://www.bitewniaki.gildia.pl/rss'), (u'RPG', u'http://www.rpg.gildia.pl/rss'), (u'LARP', u'http://www.larp.gildia.pl/rss'), (u'Muzyka', u'http://www.muzyka.gildia.pl/rss'), (u'Nauka', u'http://www.nauka.gildia.pl/rss')]
|
feeds = [(u'Gry', u'http://www.gry.gildia.pl/rss'), (u'Literatura', u'http://www.literatura.gildia.pl/rss'), (u'Film', u'http://www.film.gildia.pl/rss'), (u'Horror', u'http://www.horror.gildia.pl/rss'), (u'Konwenty', u'http://www.konwenty.gildia.pl/rss'), (u'Plansz\xf3wki', u'http://www.planszowki.gildia.pl/rss'), (u'Manga i anime', u'http://www.manga.gildia.pl/rss'), (u'Star Wars', u'http://www.starwars.gildia.pl/rss'), (u'Techno', u'http://www.techno.gildia.pl/rss'), (u'Historia', u'http://www.historia.gildia.pl/rss'), (u'Magia', u'http://www.magia.gildia.pl/rss'), (u'Bitewniaki', u'http://www.bitewniaki.gildia.pl/rss'), (u'RPG', u'http://www.rpg.gildia.pl/rss'), (u'LARP', u'http://www.larp.gildia.pl/rss'), (u'Muzyka', u'http://www.muzyka.gildia.pl/rss'), (u'Nauka', u'http://www.nauka.gildia.pl/rss')]
|
||||||
@ -34,7 +35,7 @@ class Gildia(BasicNewsRecipe):
|
|||||||
|
|
||||||
def preprocess_html(self, soup):
|
def preprocess_html(self, soup):
|
||||||
for a in soup('a'):
|
for a in soup('a'):
|
||||||
if a.has_key('href') and 'http://' not in a['href'] and 'https://' not in a['href']:
|
if a.has_key('href') and not a['href'].startswith('http'):
|
||||||
if '/gry/' in a['href']:
|
if '/gry/' in a['href']:
|
||||||
a['href']='http://www.gry.gildia.pl' + a['href']
|
a['href']='http://www.gry.gildia.pl' + a['href']
|
||||||
elif u'książk' in soup.title.string.lower() or u'komiks' in soup.title.string.lower():
|
elif u'książk' in soup.title.string.lower() or u'komiks' in soup.title.string.lower():
|
||||||
|
34
recipes/glos_wielkopolski.recipe
Normal file
@ -0,0 +1,34 @@
|
|||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
|
||||||
|
class GlosWielkopolski(BasicNewsRecipe):
|
||||||
|
title = u'G\u0142os Wielkopolski'
|
||||||
|
__author__ = 'fenuks'
|
||||||
|
description = u'Gazeta Regionalna Głos Wielkopolski. Najnowsze Wiadomości Poznań. Czytaj Informacje Poznań!'
|
||||||
|
category = 'newspaper'
|
||||||
|
language = 'pl'
|
||||||
|
encoding = 'iso-8859-2'
|
||||||
|
masthead_url = 'http://s.polskatimes.pl/g/logo_naglowek/gloswielkopolski.png?24'
|
||||||
|
oldest_article = 7
|
||||||
|
max_articles_per_feed = 100
|
||||||
|
remove_empty_feeds= True
|
||||||
|
no_stylesheets = True
|
||||||
|
use_embedded_content = False
|
||||||
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
|
#preprocess_regexps = [(re.compile(ur'<b>Czytaj także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur',<b>Czytaj też:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>Zobacz także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<center><h4><a.*?</a></h4></center>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TEŻ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ WIĘCEJ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TAKŻE:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>\* CZYTAJ KONIECZNIE:.*', re.DOTALL), lambda match: '</body>'), (re.compile(ur'<b>Nasze serwisy:</b>.*', re.DOTALL), lambda match: '</body>') ]
|
||||||
|
remove_tags_after= dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})
|
||||||
|
remove_tags=[dict(id='mat-podobne'), dict(name='a', attrs={'class':'czytajDalej'}), dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})]
|
||||||
|
|
||||||
|
feeds = [(u'Wszystkie', u'http://gloswielkopolski.feedsportal.com/c/32980/f/533779/index.rss?201302'), (u'Wiadomo\u015bci', u'http://gloswielkopolski.feedsportal.com/c/32980/f/533780/index.rss?201302'), (u'Sport', u'http://gloswielkopolski.feedsportal.com/c/32980/f/533781/index.rss?201302'), (u'Kultura', u'http://gloswielkopolski.feedsportal.com/c/32980/f/533782/index.rss?201302'), (u'Porady', u'http://www.gloswielkopolski.pl/rss/gloswielkopolski_porady.xml?201302'), (u'Blogi', u'http://www.gloswielkopolski.pl/rss/gloswielkopolski_blogi.xml?201302'), (u'Nasze akcje', u'http://www.gloswielkopolski.pl/rss/gloswielkopolski_naszeakcje.xml?201302'), (u'Opinie', u'http://www.gloswielkopolski.pl/rss/gloswielkopolski_opinie.xml?201302'), (u'Magazyn', u'http://www.gloswielkopolski.pl/rss/gloswielkopolski_magazyn.xml?201302')]
|
||||||
|
|
||||||
|
def print_version(self, url):
|
||||||
|
return url.replace('artykul', 'drukuj')
|
||||||
|
|
||||||
|
def skip_ad_pages(self, soup):
|
||||||
|
if 'Advertisement' in soup.title:
|
||||||
|
nexturl=soup.find('a')['href']
|
||||||
|
return self.index_to_soup(nexturl, raw=True)
|
||||||
|
|
||||||
|
def get_cover_url(self):
|
||||||
|
soup = self.index_to_soup('http://www.prasa24.pl/gazeta/glos-wielkopolski/')
|
||||||
|
self.cover_url=soup.find(id='pojemnik').img['src']
|
||||||
|
return getattr(self, 'cover_url', self.cover_url)
|
26
recipes/gofin_pl.recipe
Normal file
@ -0,0 +1,26 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
|
||||||
|
__license__ = 'GPL v3'
|
||||||
|
__author__ = 'teepel <teepel44@gmail.com>'
|
||||||
|
|
||||||
|
'''
|
||||||
|
gofin.pl
|
||||||
|
'''
|
||||||
|
|
||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
|
||||||
|
class gofin(BasicNewsRecipe):
|
||||||
|
title = u'Gofin'
|
||||||
|
__author__ = 'teepel <teepel44@gmail.com>'
|
||||||
|
language = 'pl'
|
||||||
|
description =u'Portal Podatkowo-Księgowy'
|
||||||
|
INDEX='http://gofin.pl'
|
||||||
|
oldest_article = 7
|
||||||
|
max_articles_per_feed = 100
|
||||||
|
remove_empty_feeds= True
|
||||||
|
simultaneous_downloads = 5
|
||||||
|
remove_javascript=True
|
||||||
|
no_stylesheets=True
|
||||||
|
auto_cleanup = True
|
||||||
|
|
||||||
|
feeds = [(u'Podatki', u'http://www.rss.gofin.pl/podatki.xml'), (u'Prawo Pracy', u'http://www.rss.gofin.pl/prawopracy.xml'), (u'Rachunkowo\u015b\u0107', u'http://www.rss.gofin.pl/rachunkowosc.xml'), (u'Sk\u0142adki, zasi\u0142ki, emerytury', u'http://www.rss.gofin.pl/zasilki.xml'),(u'Firma', u'http://www.rss.gofin.pl/firma.xml'), (u'Prawnik radzi', u'http://www.rss.gofin.pl/prawnikradzi.xml')]
|
@ -2,7 +2,8 @@
|
|||||||
#!/usr/bin/env python
|
#!/usr/bin/env python
|
||||||
|
|
||||||
__license__ = 'GPL v3'
|
__license__ = 'GPL v3'
|
||||||
__copyright__ = '2011, Piotr Kontek, piotr.kontek@gmail.com'
|
__copyright__ = '2011, Piotr Kontek, piotr.kontek@gmail.com \
|
||||||
|
2013, Tomasz Długosz, tomek3d@gmail.com'
|
||||||
|
|
||||||
from calibre.web.feeds.news import BasicNewsRecipe
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
from calibre.ptempfile import PersistentTemporaryFile
|
from calibre.ptempfile import PersistentTemporaryFile
|
||||||
@ -12,9 +13,9 @@ import re
|
|||||||
class GN(BasicNewsRecipe):
|
class GN(BasicNewsRecipe):
|
||||||
EDITION = 0
|
EDITION = 0
|
||||||
|
|
||||||
__author__ = 'Piotr Kontek'
|
__author__ = 'Piotr Kontek, Tomasz Długosz'
|
||||||
title = u'Gość niedzielny'
|
title = u'Gość Niedzielny'
|
||||||
description = 'Weekly magazine'
|
description = 'Ogólnopolski tygodnik katolicki'
|
||||||
encoding = 'utf-8'
|
encoding = 'utf-8'
|
||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
@ -38,17 +39,25 @@ class GN(BasicNewsRecipe):
|
|||||||
first = True
|
first = True
|
||||||
for p in main_section.findAll('p', attrs={'class':None}, recursive=False):
|
for p in main_section.findAll('p', attrs={'class':None}, recursive=False):
|
||||||
if first and p.find('img') != None:
|
if first and p.find('img') != None:
|
||||||
article = article + '<p>'
|
article += '<p>'
|
||||||
article = article + str(p.find('img')).replace('src="/files/','src="http://www.gosc.pl/files/')
|
article += str(p.find('img')).replace('src="/files/','src="http://www.gosc.pl/files/')
|
||||||
article = article + '<font size="-2">'
|
article += '<font size="-2">'
|
||||||
for s in p.findAll('span'):
|
for s in p.findAll('span'):
|
||||||
article = article + self.tag_to_string(s)
|
article += self.tag_to_string(s)
|
||||||
article = article + '</font></p>'
|
article += '</font></p>'
|
||||||
else:
|
else:
|
||||||
article = article + str(p).replace('src="/files/','src="http://www.gosc.pl/files/')
|
article += str(p).replace('src="/files/','src="http://www.gosc.pl/files/')
|
||||||
first = False
|
first = False
|
||||||
|
limiter = main_section.find('p', attrs={'class' : 'limiter'})
|
||||||
|
if limiter:
|
||||||
|
article += str(limiter)
|
||||||
|
|
||||||
html = unicode(title) + unicode(authors) + unicode(article)
|
html = unicode(title)
|
||||||
|
#sometimes authors are not filled in:
|
||||||
|
if authors:
|
||||||
|
html += unicode(authors) + unicode(article)
|
||||||
|
else:
|
||||||
|
html += unicode(article)
|
||||||
|
|
||||||
self.temp_files.append(PersistentTemporaryFile('_temparse.html'))
|
self.temp_files.append(PersistentTemporaryFile('_temparse.html'))
|
||||||
self.temp_files[-1].write(html)
|
self.temp_files[-1].write(html)
|
||||||
@ -65,7 +74,8 @@ class GN(BasicNewsRecipe):
|
|||||||
if img != None:
|
if img != None:
|
||||||
a = img.parent
|
a = img.parent
|
||||||
self.EDITION = a['href']
|
self.EDITION = a['href']
|
||||||
self.title = img['alt']
|
#this was preventing kindles from moving old issues to 'Back Issues' category:
|
||||||
|
#self.title = img['alt']
|
||||||
self.cover_url = 'http://www.gosc.pl' + img['src']
|
self.cover_url = 'http://www.gosc.pl' + img['src']
|
||||||
if year != date.today().year or not first:
|
if year != date.today().year or not first:
|
||||||
break
|
break
|
||||||
|
@ -1,5 +1,6 @@
|
|||||||
from calibre.web.feeds.news import BasicNewsRecipe
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
from calibre.ebooks.BeautifulSoup import BeautifulSoup
|
from calibre.ebooks.BeautifulSoup import BeautifulSoup
|
||||||
|
|
||||||
class Gram_pl(BasicNewsRecipe):
|
class Gram_pl(BasicNewsRecipe):
|
||||||
title = u'Gram.pl'
|
title = u'Gram.pl'
|
||||||
__author__ = 'fenuks'
|
__author__ = 'fenuks'
|
||||||
@ -11,14 +12,13 @@ class Gram_pl(BasicNewsRecipe):
|
|||||||
max_articles_per_feed = 100
|
max_articles_per_feed = 100
|
||||||
ignore_duplicate_articles = {'title', 'url'}
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
no_stylesheets= True
|
no_stylesheets= True
|
||||||
|
remove_empty_feeds = True
|
||||||
#extra_css = 'h2 {font-style: italic; font-size:20px;} .picbox div {float: left;}'
|
#extra_css = 'h2 {font-style: italic; font-size:20px;} .picbox div {float: left;}'
|
||||||
cover_url=u'http://www.gram.pl/www/01/img/grampl_zima.png'
|
cover_url=u'http://www.gram.pl/www/01/img/grampl_zima.png'
|
||||||
keep_only_tags= [dict(id='articleModule')]
|
keep_only_tags= [dict(id='articleModule')]
|
||||||
remove_tags = [dict(attrs={'class':['breadCrump', 'dymek', 'articleFooter']})]
|
remove_tags = [dict(attrs={'class':['breadCrump', 'dymek', 'articleFooter', 'twitter-share-button']}), dict(name='aside')]
|
||||||
feeds = [(u'Informacje', u'http://www.gram.pl/feed_news.asp'),
|
feeds = [(u'Informacje', u'http://www.gram.pl/feed_news.asp'),
|
||||||
(u'Publikacje', u'http://www.gram.pl/feed_news.asp?type=articles'),
|
(u'Publikacje', u'http://www.gram.pl/feed_news.asp?type=articles')
|
||||||
(u'Kolektyw- Indie Games', u'http://indie.gram.pl/feed/'),
|
|
||||||
#(u'Kolektyw- Moto Games', u'http://www.motogames.gram.pl/news.rss')
|
|
||||||
]
|
]
|
||||||
|
|
||||||
def parse_feeds (self):
|
def parse_feeds (self):
|
||||||
|
@ -1,9 +1,11 @@
|
|||||||
|
import time
|
||||||
from calibre.web.feeds.recipes import BasicNewsRecipe
|
from calibre.web.feeds.recipes import BasicNewsRecipe
|
||||||
|
from calibre.ebooks.BeautifulSoup import Comment
|
||||||
|
|
||||||
class GryOnlinePl(BasicNewsRecipe):
|
class GryOnlinePl(BasicNewsRecipe):
|
||||||
title = u'Gry-Online.pl'
|
title = u'Gry-Online.pl'
|
||||||
__author__ = 'fenuks'
|
__author__ = 'fenuks'
|
||||||
description = 'Gry-Online.pl - computer games'
|
description = u'Wiadomości o grach, recenzje, zapowiedzi. Encyklopedia Gier zawiera opisy gier na PC, konsole Xbox360, PS3 i inne platformy.'
|
||||||
category = 'games'
|
category = 'games'
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
oldest_article = 13
|
oldest_article = 13
|
||||||
@ -12,9 +14,11 @@ class GryOnlinePl(BasicNewsRecipe):
|
|||||||
cover_url = 'http://www.gry-online.pl/im/gry-online-logo.png'
|
cover_url = 'http://www.gry-online.pl/im/gry-online-logo.png'
|
||||||
max_articles_per_feed = 100
|
max_articles_per_feed = 100
|
||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
keep_only_tags=[dict(name='div', attrs={'class':['gc660', 'gc660 S013']})]
|
keep_only_tags = [dict(name='div', attrs={'class':['gc660', 'gc660 S013', 'news_endpage_tit', 'news_container', 'news']})]
|
||||||
remove_tags = [dict({'class':['nav-social', 'add-info', 'smlb', 'lista lista3 lista-gry', 'S013po', 'S013-npb', 'zm_gfx_cnt_bottom', 'ocen-txt', 'wiecej-txt', 'wiecej-txt2']})]
|
remove_tags = [dict({'class':['nav-social', 'add-info', 'smlb', 'lista lista3 lista-gry', 'S013po', 'S013-npb', 'zm_gfx_cnt_bottom', 'ocen-txt', 'wiecej-txt', 'wiecej-txt2']})]
|
||||||
feeds = [(u'Newsy', 'http://www.gry-online.pl/rss/news.xml'), ('Teksty', u'http://www.gry-online.pl/rss/teksty.xml')]
|
feeds = [
|
||||||
|
(u'Newsy', 'http://www.gry-online.pl/rss/news.xml'),
|
||||||
|
('Teksty', u'http://www.gry-online.pl/rss/teksty.xml')]
|
||||||
|
|
||||||
|
|
||||||
def append_page(self, soup, appendtag):
|
def append_page(self, soup, appendtag):
|
||||||
@ -24,17 +28,69 @@ class GryOnlinePl(BasicNewsRecipe):
|
|||||||
url_part = soup.find('link', attrs={'rel':'canonical'})['href']
|
url_part = soup.find('link', attrs={'rel':'canonical'})['href']
|
||||||
url_part = url_part[25:].rpartition('?')[0]
|
url_part = url_part[25:].rpartition('?')[0]
|
||||||
for nexturl in nexturls[1:-1]:
|
for nexturl in nexturls[1:-1]:
|
||||||
soup2 = self.index_to_soup('http://www.gry-online.pl/' + url_part + nexturl['href'])
|
finalurl = 'http://www.gry-online.pl/' + url_part + nexturl['href']
|
||||||
|
for i in range(10):
|
||||||
|
try:
|
||||||
|
soup2 = self.index_to_soup(finalurl)
|
||||||
|
break
|
||||||
|
except:
|
||||||
|
print 'retrying in 0.5s'
|
||||||
|
time.sleep(0.5)
|
||||||
pagetext = soup2.find(attrs={'class':'gc660'})
|
pagetext = soup2.find(attrs={'class':'gc660'})
|
||||||
for r in pagetext.findAll(name='header'):
|
for r in pagetext.findAll(name='header'):
|
||||||
r.extract()
|
r.extract()
|
||||||
for r in pagetext.findAll(attrs={'itemprop':'description'}):
|
for r in pagetext.findAll(attrs={'itemprop':'description'}):
|
||||||
r.extract()
|
r.extract()
|
||||||
|
|
||||||
pos = len(appendtag.contents)
|
pos = len(appendtag.contents)
|
||||||
appendtag.insert(pos, pagetext)
|
appendtag.insert(pos, pagetext)
|
||||||
for r in appendtag.findAll(attrs={'class':['n5p', 'add-info', 'twitter-share-button', 'lista lista3 lista-gry']}):
|
for r in appendtag.findAll(attrs={'class':['n5p', 'add-info', 'twitter-share-button', 'lista lista3 lista-gry']}):
|
||||||
r.extract()
|
r.extract()
|
||||||
|
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||||
|
for comment in comments:
|
||||||
|
comment.extract()
|
||||||
|
else:
|
||||||
|
tag = appendtag.find('div', attrs={'class':'S018stronyr'})
|
||||||
|
if tag:
|
||||||
|
nexturl = tag.a
|
||||||
|
url_part = soup.find('link', attrs={'rel':'canonical'})['href']
|
||||||
|
url_part = url_part[25:].rpartition('?')[0]
|
||||||
|
while tag:
|
||||||
|
end = tag.find(attrs={'class':'right left-dead'})
|
||||||
|
if end:
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
nexturl = tag.a
|
||||||
|
finalurl = 'http://www.gry-online.pl/' + url_part + nexturl['href']
|
||||||
|
for i in range(10):
|
||||||
|
try:
|
||||||
|
soup2 = self.index_to_soup(finalurl)
|
||||||
|
break
|
||||||
|
except:
|
||||||
|
print 'retrying in 0.5s'
|
||||||
|
time.sleep(0.5)
|
||||||
|
tag = soup2.find('div', attrs={'class':'S018stronyr'})
|
||||||
|
pagetext = soup2.find(attrs={'class':'gc660'})
|
||||||
|
for r in pagetext.findAll(name='header'):
|
||||||
|
r.extract()
|
||||||
|
for r in pagetext.findAll(attrs={'itemprop':'description'}):
|
||||||
|
r.extract()
|
||||||
|
|
||||||
|
comments = pagetext.findAll(text=lambda text:isinstance(text, Comment))
|
||||||
|
[comment.extract() for comment in comments]
|
||||||
|
pos = len(appendtag.contents)
|
||||||
|
appendtag.insert(pos, pagetext)
|
||||||
|
for r in appendtag.findAll(attrs={'class':['n5p', 'add-info', 'twitter-share-button', 'lista lista3 lista-gry', 'S018strony']}):
|
||||||
|
r.extract()
|
||||||
|
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||||
|
for comment in comments:
|
||||||
|
comment.extract()
|
||||||
|
|
||||||
|
def image_url_processor(self, baseurl, url):
|
||||||
|
if url.startswith('..'):
|
||||||
|
return url[2:]
|
||||||
|
else:
|
||||||
|
return url
|
||||||
|
|
||||||
def preprocess_html(self, soup):
|
def preprocess_html(self, soup):
|
||||||
self.append_page(soup, soup.body)
|
self.append_page(soup, soup.body)
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
__license__ = 'GPL v3'
|
__license__ = 'GPL v3'
|
||||||
__copyright__ = '2008-2012, Darko Miletic <darko.miletic at gmail.com>'
|
__copyright__ = '2008-2013, Darko Miletic <darko.miletic at gmail.com>'
|
||||||
'''
|
'''
|
||||||
harpers.org - paid subscription/ printed issue articles
|
harpers.org - paid subscription/ printed issue articles
|
||||||
This recipe only get's article's published in text format
|
This recipe only get's article's published in text format
|
||||||
@ -72,24 +72,25 @@ class Harpers_full(BasicNewsRecipe):
|
|||||||
|
|
||||||
#go to the current issue
|
#go to the current issue
|
||||||
soup1 = self.index_to_soup(currentIssue_url)
|
soup1 = self.index_to_soup(currentIssue_url)
|
||||||
date = re.split('\s\|\s',self.tag_to_string(soup1.head.title.string))[0]
|
currentIssue_title = self.tag_to_string(soup1.head.title.string)
|
||||||
|
date = re.split('\s\|\s',currentIssue_title)[0]
|
||||||
self.timefmt = u' [%s]'%date
|
self.timefmt = u' [%s]'%date
|
||||||
|
|
||||||
#get cover
|
#get cover
|
||||||
coverurl='http://harpers.org/wp-content/themes/harpers/ajax_microfiche.php?img=harpers-'+re.split('harpers.org/',currentIssue_url)[1]+'gif/0001.gif'
|
self.cover_url = soup1.find('div', attrs = {'class':'picture_hp'}).find('img', src=True)['src']
|
||||||
soup2 = self.index_to_soup(coverurl)
|
|
||||||
self.cover_url = self.tag_to_string(soup2.find('img')['src'])
|
|
||||||
self.log(self.cover_url)
|
self.log(self.cover_url)
|
||||||
|
|
||||||
articles = []
|
articles = []
|
||||||
count = 0
|
count = 0
|
||||||
for item in soup1.findAll('div', attrs={'class':'articleData'}):
|
for item in soup1.findAll('div', attrs={'class':'articleData'}):
|
||||||
text_links = item.findAll('h2')
|
text_links = item.findAll('h2')
|
||||||
|
if text_links:
|
||||||
for text_link in text_links:
|
for text_link in text_links:
|
||||||
if count == 0:
|
if count == 0:
|
||||||
count = 1
|
count = 1
|
||||||
else:
|
else:
|
||||||
url = text_link.a['href']
|
url = text_link.a['href']
|
||||||
title = text_link.a.contents[0]
|
title = self.tag_to_string(text_link.a)
|
||||||
date = strftime(' %B %Y')
|
date = strftime(' %B %Y')
|
||||||
articles.append({
|
articles.append({
|
||||||
'title' :title
|
'title' :title
|
||||||
@ -97,14 +98,9 @@ class Harpers_full(BasicNewsRecipe):
|
|||||||
,'url' :url
|
,'url' :url
|
||||||
,'description':''
|
,'description':''
|
||||||
})
|
})
|
||||||
return [(soup1.head.title.string, articles)]
|
return [(currentIssue_title, articles)]
|
||||||
|
|
||||||
def print_version(self, url):
|
def print_version(self, url):
|
||||||
return url + '?single=1'
|
return url + '?single=1'
|
||||||
|
|
||||||
def cleanup(self):
|
|
||||||
soup = self.index_to_soup('http://harpers.org/')
|
|
||||||
signouturl=self.tag_to_string(soup.find('li', attrs={'class':'subLogOut'}).findNext('li').a['href'])
|
|
||||||
self.log(signouturl)
|
|
||||||
self.browser.open(signouturl)
|
|
||||||
|
|
||||||
|
27
recipes/hatalska.recipe
Normal file
@ -0,0 +1,27 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
|
||||||
|
__license__ = 'GPL v3'
|
||||||
|
__copyright__ = 'teepel 2012'
|
||||||
|
|
||||||
|
'''
|
||||||
|
hatalska.com
|
||||||
|
'''
|
||||||
|
|
||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
|
||||||
|
class hatalska(BasicNewsRecipe):
|
||||||
|
title = u'Hatalska'
|
||||||
|
__author__ = 'teepel <teepel44@gmail.com>'
|
||||||
|
language = 'pl'
|
||||||
|
description = u'Blog specjalistki z branży mediowo-reklamowej - Natalii Hatalskiej'
|
||||||
|
oldest_article = 7
|
||||||
|
masthead_url='http://hatalska.com/wp-content/themes/jamel/images/logo.png'
|
||||||
|
max_articles_per_feed = 100
|
||||||
|
simultaneous_downloads = 5
|
||||||
|
remove_javascript=True
|
||||||
|
no_stylesheets=True
|
||||||
|
|
||||||
|
remove_tags =[]
|
||||||
|
remove_tags.append(dict(name = 'div', attrs = {'class' : 'feedflare'}))
|
||||||
|
|
||||||
|
feeds = [(u'Blog', u'http://feeds.feedburner.com/hatalskacom')]
|
@ -2,7 +2,6 @@ from __future__ import with_statement
|
|||||||
__license__ = 'GPL 3'
|
__license__ = 'GPL 3'
|
||||||
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
|
__copyright__ = '2009, Kovid Goyal <kovid@kovidgoyal.net>'
|
||||||
|
|
||||||
import time
|
|
||||||
from calibre.web.feeds.news import BasicNewsRecipe
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
|
||||||
class TheHindu(BasicNewsRecipe):
|
class TheHindu(BasicNewsRecipe):
|
||||||
@ -14,41 +13,42 @@ class TheHindu(BasicNewsRecipe):
|
|||||||
max_articles_per_feed = 100
|
max_articles_per_feed = 100
|
||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
|
|
||||||
keep_only_tags = [dict(id='content')]
|
auto_cleanup = True
|
||||||
remove_tags = [dict(attrs={'class':['article-links', 'breadcr']}),
|
|
||||||
dict(id=['email-section', 'right-column', 'printfooter', 'topover',
|
|
||||||
'slidebox', 'th_footer'])]
|
|
||||||
|
|
||||||
extra_css = '.photo-caption { font-size: smaller }'
|
extra_css = '.photo-caption { font-size: smaller }'
|
||||||
|
|
||||||
def preprocess_raw_html(self, raw, url):
|
|
||||||
return raw.replace('<body><p>', '<p>').replace('</p></body>', '</p>')
|
|
||||||
|
|
||||||
def postprocess_html(self, soup, first_fetch):
|
|
||||||
for t in soup.findAll(['table', 'tr', 'td','center']):
|
|
||||||
t.name = 'div'
|
|
||||||
return soup
|
|
||||||
|
|
||||||
def parse_index(self):
|
def parse_index(self):
|
||||||
today = time.strftime('%Y-%m-%d')
|
soup = self.index_to_soup('http://www.thehindu.com/todays-paper/')
|
||||||
soup = self.index_to_soup(
|
div = soup.find('div', attrs={'id':'left-column'})
|
||||||
'http://www.thehindu.com/todays-paper/tp-index/?date=' + today)
|
soup.find(id='subnav-tpbar').extract()
|
||||||
div = soup.find(id='left-column')
|
|
||||||
feeds = []
|
|
||||||
|
|
||||||
current_section = None
|
current_section = None
|
||||||
current_articles = []
|
current_articles = []
|
||||||
for x in div.findAll(['h3', 'div']):
|
feeds = []
|
||||||
if current_section and x.get('class', '') == 'tpaper':
|
for x in div.findAll(['a', 'span']):
|
||||||
a = x.find('a', href=True)
|
if x.name == 'span' and x['class'] == 's-link':
|
||||||
if a is not None:
|
# Section heading found
|
||||||
current_articles.append({'url':a['href']+'?css=print',
|
if current_articles and current_section:
|
||||||
'title':self.tag_to_string(a), 'date': '',
|
|
||||||
'description':''})
|
|
||||||
if x.name == 'h3':
|
|
||||||
if current_section and current_articles:
|
|
||||||
feeds.append((current_section, current_articles))
|
feeds.append((current_section, current_articles))
|
||||||
current_section = self.tag_to_string(x)
|
current_section = self.tag_to_string(x)
|
||||||
current_articles = []
|
current_articles = []
|
||||||
|
self.log('\tFound section:', current_section)
|
||||||
|
elif x.name == 'a':
|
||||||
|
|
||||||
|
title = self.tag_to_string(x)
|
||||||
|
url = x.get('href', False)
|
||||||
|
if not url or not title:
|
||||||
|
continue
|
||||||
|
self.log('\t\tFound article:', title)
|
||||||
|
self.log('\t\t\t', url)
|
||||||
|
current_articles.append({'title': title, 'url':url,
|
||||||
|
'description':'', 'date':''})
|
||||||
|
|
||||||
|
if current_articles and current_section:
|
||||||
|
feeds.append((current_section, current_articles))
|
||||||
|
|
||||||
return feeds
|
return feeds
|
||||||
|
|
||||||
|
|
||||||
|
@ -8,20 +8,15 @@ class Historia_org_pl(BasicNewsRecipe):
|
|||||||
category = 'history'
|
category = 'history'
|
||||||
language = 'pl'
|
language = 'pl'
|
||||||
oldest_article = 8
|
oldest_article = 8
|
||||||
|
extra_css = 'img {float: left; margin-right: 10px;} .alignleft {float: left; margin-right: 10px;}'
|
||||||
remove_empty_feeds= True
|
remove_empty_feeds= True
|
||||||
no_stylesheets = True
|
no_stylesheets = True
|
||||||
use_embedded_content = True
|
use_embedded_content = True
|
||||||
max_articles_per_feed = 100
|
max_articles_per_feed = 100
|
||||||
ignore_duplicate_articles = {'title', 'url'}
|
ignore_duplicate_articles = {'title', 'url'}
|
||||||
|
|
||||||
|
|
||||||
feeds = [(u'Wszystkie', u'http://historia.org.pl/feed/'),
|
feeds = [(u'Wszystkie', u'http://historia.org.pl/feed/'),
|
||||||
(u'Wiadomości', u'http://historia.org.pl/Kategoria/wiadomosci/feed/'),
|
(u'Wiadomości', u'http://historia.org.pl/Kategoria/wiadomosci/feed/'),
|
||||||
(u'Publikacje', u'http://historia.org.pl/Kategoria/artykuly/feed/'),
|
(u'Publikacje', u'http://historia.org.pl/Kategoria/artykuly/feed/'),
|
||||||
(u'Publicystyka', u'http://historia.org.pl/Kategoria/publicystyka/feed/'),
|
(u'Publicystyka', u'http://historia.org.pl/Kategoria/publicystyka/feed/'),
|
||||||
(u'Recenzje', u'http://historia.org.pl/Kategoria/recenzje/feed/'),
|
(u'Recenzje', u'http://historia.org.pl/Kategoria/recenzje/feed/'),
|
||||||
(u'Projekty', u'http://historia.org.pl/Kategoria/projekty/feed/'),]
|
(u'Projekty', u'http://historia.org.pl/Kategoria/projekty/feed/'),]
|
||||||
|
|
||||||
|
|
||||||
def print_version(self, url):
|
|
||||||
return url + '?tmpl=component&print=1&layout=default&page='
|
|
@ -1,6 +1,6 @@
|
|||||||
import re
|
|
||||||
from calibre.web.feeds.recipes import BasicNewsRecipe
|
|
||||||
from collections import OrderedDict
|
from collections import OrderedDict
|
||||||
|
import re
|
||||||
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
|
|
||||||
class HistoryToday(BasicNewsRecipe):
|
class HistoryToday(BasicNewsRecipe):
|
||||||
|
|
||||||
@ -19,7 +19,6 @@ class HistoryToday(BasicNewsRecipe):
|
|||||||
|
|
||||||
|
|
||||||
needs_subscription = True
|
needs_subscription = True
|
||||||
|
|
||||||
def get_browser(self):
|
def get_browser(self):
|
||||||
br = BasicNewsRecipe.get_browser(self)
|
br = BasicNewsRecipe.get_browser(self)
|
||||||
if self.username is not None and self.password is not None:
|
if self.username is not None and self.password is not None:
|
||||||
@ -46,8 +45,9 @@ class HistoryToday(BasicNewsRecipe):
|
|||||||
|
|
||||||
#Go to issue
|
#Go to issue
|
||||||
soup = self.index_to_soup('http://www.historytoday.com/contents')
|
soup = self.index_to_soup('http://www.historytoday.com/contents')
|
||||||
cover = soup.find('div',attrs={'id':'content-area'}).find('img')['src']
|
cover = soup.find('div',attrs={'id':'content-area'}).find('img', attrs={'src':re.compile('.*cover.*')})['src']
|
||||||
self.cover_url=cover
|
self.cover_url=cover
|
||||||
|
self.log(self.cover_url)
|
||||||
|
|
||||||
#Go to the main body
|
#Go to the main body
|
||||||
|
|
||||||
@ -84,4 +84,3 @@ class HistoryToday(BasicNewsRecipe):
|
|||||||
|
|
||||||
def cleanup(self):
|
def cleanup(self):
|
||||||
self.browser.open('http://www.historytoday.com/logout')
|
self.browser.open('http://www.historytoday.com/logout')
|
||||||
|
|
||||||
|
@ -1,5 +1,4 @@
|
|||||||
from calibre.web.feeds.news import BasicNewsRecipe
|
from calibre.web.feeds.news import BasicNewsRecipe
|
||||||
import re
|
|
||||||
|
|
||||||
class HNonlineRecipe(BasicNewsRecipe):
|
class HNonlineRecipe(BasicNewsRecipe):
|
||||||
__license__ = 'GPL v3'
|
__license__ = 'GPL v3'
|
||||||
|
Before Width: | Height: | Size: 389 B After Width: | Height: | Size: 887 B |
BIN
recipes/icons/bachormagazyn.png
Normal file
After Width: | Height: | Size: 898 B |
Before Width: | Height: | Size: 391 B After Width: | Height: | Size: 772 B |
BIN
recipes/icons/biweekly.png
Normal file
After Width: | Height: | Size: 603 B |
BIN
recipes/icons/blog_biszopa.png
Normal file
After Width: | Height: | Size: 755 B |
Before Width: | Height: | Size: 837 B After Width: | Height: | Size: 364 B |
Before Width: | Height: | Size: 24 KiB After Width: | Height: | Size: 1.3 KiB |
BIN
recipes/icons/dwutygodnik.png
Normal file
After Width: | Height: | Size: 603 B |
BIN
recipes/icons/dziennik_baltycki.png
Normal file
After Width: | Height: | Size: 865 B |
BIN
recipes/icons/dziennik_lodzki.png
Normal file
After Width: | Height: | Size: 461 B |
Before Width: | Height: | Size: 481 B After Width: | Height: | Size: 1.1 KiB |
BIN
recipes/icons/dziennik_wschodni.png
Normal file
After Width: | Height: | Size: 414 B |
BIN
recipes/icons/dziennik_zachodni.png
Normal file
After Width: | Height: | Size: 431 B |
BIN
recipes/icons/echo_dnia.png
Normal file
After Width: | Height: | Size: 1.1 KiB |
Before Width: | Height: | Size: 475 B After Width: | Height: | Size: 1.1 KiB |
BIN
recipes/icons/emuzica_pl.png
Normal file
After Width: | Height: | Size: 760 B |
BIN
recipes/icons/esenja.png
Normal file
After Width: | Height: | Size: 329 B |
BIN
recipes/icons/esensja_(rss).png
Normal file
After Width: | Height: | Size: 329 B |
Before Width: | Height: | Size: 3.6 KiB After Width: | Height: | Size: 946 B |
BIN
recipes/icons/film_org_pl.png
Normal file
After Width: | Height: | Size: 762 B |
Before Width: | Height: | Size: 3.4 KiB After Width: | Height: | Size: 2.2 KiB |
Before Width: | Height: | Size: 991 B After Width: | Height: | Size: 737 B |
BIN
recipes/icons/gazeta_krakowska.png
Normal file
After Width: | Height: | Size: 398 B |
BIN
recipes/icons/gazeta_lubuska.png
Normal file
After Width: | Height: | Size: 1.1 KiB |
BIN
recipes/icons/gazeta_wroclawska.png
Normal file
After Width: | Height: | Size: 470 B |