sync with Kovid's branch
@ -37,7 +37,9 @@ nbproject/
|
||||
calibre_plugins/
|
||||
recipes/.git
|
||||
recipes/.gitignore
|
||||
recipes/README
|
||||
recipes/README.md
|
||||
recipes/icon_checker.py
|
||||
recipes/readme_updater.py
|
||||
recipes/katalog_egazeciarz.recipe
|
||||
recipes/tv_axnscifi.recipe
|
||||
recipes/tv_comedycentral.recipe
|
||||
@ -60,6 +62,7 @@ recipes/tv_tvpkultura.recipe
|
||||
recipes/tv_tvppolonia.recipe
|
||||
recipes/tv_tvpuls.recipe
|
||||
recipes/tv_viasathistory.recipe
|
||||
recipes/icons/katalog_egazeciarz.png
|
||||
recipes/icons/tv_axnscifi.png
|
||||
recipes/icons/tv_comedycentral.png
|
||||
recipes/icons/tv_discoveryscience.png
|
||||
|
1679
Changelog.old.yaml
1979
Changelog.yaml
@ -672,6 +672,7 @@ Some limitations of PDF input are:
|
||||
* Links and Tables of Contents are not supported
|
||||
* PDFs that use embedded non-unicode fonts to represent non-English characters will result in garbled output for those characters
|
||||
* Some PDFs are made up of photographs of the page with OCRed text behind them. In such cases |app| uses the OCRed text, which can be very different from what you see when you view the PDF file
|
||||
* PDFs that are used to display complex text, like right to left languages and math typesetting will not convert correctly
|
||||
|
||||
To re-iterate **PDF is a really, really bad** format to use as input. If you absolutely must use PDF, then be prepared for an
|
||||
output ranging anywhere from decent to unusable, depending on the input PDF.
|
||||
|
@ -39,27 +39,27 @@ All the |app| python code is in the ``calibre`` package. This package contains t
|
||||
|
||||
* devices - All the device drivers. Just look through some of the built-in drivers to get an idea for how they work.
|
||||
|
||||
* For details, see: devices.interface which defines the interface supported by device drivers and devices.usbms which
|
||||
* For details, see: devices.interface which defines the interface supported by device drivers and ``devices.usbms`` which
|
||||
defines a generic driver that connects to a USBMS device. All USBMS based drivers in |app| inherit from it.
|
||||
|
||||
* ebooks - All the ebook conversion/metadata code. A good starting point is ``calibre.ebooks.conversion.cli`` which is the
|
||||
module powering the :command:`ebook-convert` command. The conversion process is controlled via conversion.plumber.
|
||||
The format independent code is all in ebooks.oeb and the format dependent code is in ebooks.format_name.
|
||||
module powering the :command:`ebook-convert` command. The conversion process is controlled via ``conversion.plumber``.
|
||||
The format independent code is all in ``ebooks.oeb`` and the format dependent code is in ``ebooks.format_name``.
|
||||
|
||||
* Metadata reading, writing, and downloading is all in ebooks.metadata
|
||||
* Metadata reading, writing, and downloading is all in ``ebooks.metadata``
|
||||
* Conversion happens in a pipeline, for the structure of the pipeline,
|
||||
see :ref:`conversion-introduction`. The pipeline consists of an input
|
||||
plugin, various transforms and an output plugin. The that code constructs
|
||||
and drives the pipeline is in plumber.py. The pipeline works on a
|
||||
and drives the pipeline is in :file:`plumber.py`. The pipeline works on a
|
||||
representation of an ebook that is like an unzipped epub, with
|
||||
manifest, spine, toc, guide, html content, etc. The
|
||||
class that manages this representation is OEBBook in oeb/base.py. The
|
||||
class that manages this representation is OEBBook in ``ebooks.oeb.base``. The
|
||||
various transformations that are applied to the book during
|
||||
conversions live in `oeb/transforms/*.py`. And the input and output
|
||||
plugins live in `conversion/plugins/*.py`.
|
||||
conversions live in :file:`oeb/transforms/*.py`. And the input and output
|
||||
plugins live in :file:`conversion/plugins/*.py`.
|
||||
|
||||
* library - The database back-end and the content server. See library.database2 for the interface to the |app| library. library.server is the |app| Content Server.
|
||||
* gui2 - The Graphical User Interface. GUI initialization happens in gui2.main and gui2.ui. The ebook-viewer is in gui2.viewer.
|
||||
* library - The database back-end and the content server. See ``library.database2`` for the interface to the |app| library. ``library.server`` is the |app| Content Server.
|
||||
* gui2 - The Graphical User Interface. GUI initialization happens in ``gui2.main`` and ``gui2.ui``. The ebook-viewer is in ``gui2.viewer``.
|
||||
|
||||
If you need help understanding the code, post in the `development forum <http://www.mobileread.com/forums/forumdisplay.php?f=240>`_
|
||||
and you will most likely get help from one of |app|'s many developers.
|
||||
|
@ -250,42 +250,71 @@ If you don't want to uninstall it altogether, there are a couple of tricks you c
|
||||
simplest is to simply re-name the executable file that launches the library program. More detail
|
||||
`in the forums <http://www.mobileread.com/forums/showthread.php?t=65809>`_.
|
||||
|
||||
How do I use |app| with my iPad/iPhone/iTouch?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
How do I use |app| with my iPad/iPhone/iPod touch?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Over the air
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
The easiest way to browse your |app| collection on your Apple device (iPad/iPhone/iPod) is by using the calibre content server, which makes your collection available over the net. First perform the following steps in |app|
|
||||
The easiest way to browse your |app| collection on your Apple device
|
||||
(iPad/iPhone/iPod) is by using the |app| content server, which makes your
|
||||
collection available over the net. First perform the following steps in |app|
|
||||
|
||||
* Set the Preferred Output Format in |app| to EPUB (The output format can be set under :guilabel:`Preferences->Interface->Behavior`)
|
||||
* Set the output profile to iPad (this will work for iPhone/iPods as well), under :guilabel:`Preferences->Conversion->Common Options->Page Setup`
|
||||
* Convert the books you want to read on your iPhone to EPUB format by selecting them and clicking the Convert button.
|
||||
* Turn on the Content Server in |app|'s preferences and leave |app| running.
|
||||
* Set the Preferred Output Format in |app| to EPUB (The output format can be
|
||||
set under :guilabel:`Preferences->Interface->Behavior`)
|
||||
* Set the output profile to iPad (this will work for iPhone/iPods as well),
|
||||
under :guilabel:`Preferences->Conversion->Common Options->Page Setup`
|
||||
* Convert the books you want to read on your iDevice to EPUB format by
|
||||
selecting them and clicking the Convert button.
|
||||
* Turn on the Content Server by clicking the :guilabel:`Connect/Share` button
|
||||
and leave |app| running. You can also tell |app| to automatically start the
|
||||
content server via :guilabel:`Preferences->Sharing over the net`.
|
||||
|
||||
Now on your iPad/iPhone you have two choices, use either iBooks (version 1.2 and later) or Stanza (version 3.0 and later). Both are available free from the app store.
|
||||
There are many apps for your iDevice that can connect to |app|. Here we
|
||||
describe using two of them, iBooks and Stanza.
|
||||
|
||||
Using Stanza
|
||||
***************
|
||||
|
||||
Now you should be able to access your books on your iPhone by opening Stanza. Go to "Get Books" and then click the "Shared" tab. Under Shared you will see an entry "Books in calibre". If you don't, make sure your iPad/iPhone is connected using the WiFi network in your house, not 3G. If the |app| catalog is still not detected in Stanza, you can add it manually in Stanza. To do this, click the "Shared" tab, then click the "Edit" button and then click "Add book source" to add a new book source. In the Add Book Source screen enter whatever name you like and in the URL field, enter the following::
|
||||
You should be able to access your books on your iPhone by opening Stanza. Go to
|
||||
"Get Books" and then click the "Shared" tab. Under Shared you will see an entry
|
||||
"Books in calibre". If you don't, make sure your iPad/iPhone is connected using
|
||||
the WiFi network in your house, not 3G. If the |app| catalog is still not
|
||||
detected in Stanza, you can add it manually in Stanza. To do this, click the
|
||||
"Shared" tab, then click the "Edit" button and then click "Add book source" to
|
||||
add a new book source. In the Add Book Source screen enter whatever name you
|
||||
like and in the URL field, enter the following::
|
||||
|
||||
http://192.168.1.2:8080/
|
||||
|
||||
Replace ``192.168.1.2`` with the local IP address of the computer running |app|. If you have changed the port the |app| content server is running on, you will have to change ``8080`` as well to the new port. The local IP address is the IP address you computer is assigned on your home network. A quick Google search will tell you how to find out your local IP address. Now click "Save" and you are done.
|
||||
Replace ``192.168.1.2`` with the local IP address of the computer running
|
||||
|app|. If you have changed the port the |app| content server is running on, you
|
||||
will have to change ``8080`` as well to the new port. The local IP address is
|
||||
the IP address you computer is assigned on your home network. A quick Google
|
||||
search will tell you how to find out your local IP address. Now click "Save"
|
||||
and you are done.
|
||||
|
||||
If you get timeout errors while browsing the calibre catalog in Stanza, try increasing the connection timeout value in the stanza settings. Go to Info->Settings and increase the value of Download Timeout.
|
||||
If you get timeout errors while browsing the calibre catalog in Stanza, try
|
||||
increasing the connection timeout value in the stanza settings. Go to
|
||||
Info->Settings and increase the value of Download Timeout.
|
||||
|
||||
Using iBooks
|
||||
**************
|
||||
|
||||
Start the Safari browser and type in the IP address and port of the computer running the calibre server, like this::
|
||||
Start the Safari browser and type in the IP address and port of the computer
|
||||
running the calibre server, like this::
|
||||
|
||||
http://192.168.1.2:8080/
|
||||
|
||||
Replace ``192.168.1.2`` with the local IP address of the computer running |app|. If you have changed the port the |app| content server is running on, you will have to change ``8080`` as well to the new port. The local IP address is the IP address you computer is assigned on your home network. A quick Google search will tell you how to find out your local IP address.
|
||||
Replace ``192.168.1.2`` with the local IP address of the computer running
|
||||
|app|. If you have changed the port the |app| content server is running on, you
|
||||
will have to change ``8080`` as well to the new port. The local IP address is
|
||||
the IP address you computer is assigned on your home network. A quick Google
|
||||
search will tell you how to find out your local IP address.
|
||||
|
||||
You will see a list of books in Safari, just click on the epub link for whichever book you want to read, Safari will then prompt you to open it with iBooks.
|
||||
You will see a list of books in Safari, just click on the epub link for
|
||||
whichever book you want to read, Safari will then prompt you to open it with
|
||||
iBooks.
|
||||
|
||||
|
||||
With the USB cable + iTunes
|
||||
@ -550,9 +579,23 @@ Yes, you can. Follow the instructions in the answer above for adding custom colu
|
||||
|
||||
How do I move my |app| library from one computer to another?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Simply copy the |app| library folder from the old to the new computer. You can find out what the library folder is by clicking the calibre icon in the toolbar. The very first item is the path to the library folder. Now on the new computer, start |app| for the first time. It will run the Welcome Wizard asking you for the location of the |app| library. Point it to the previously copied folder. If the computer you are transferring to already has a calibre installation, then the Welcome wizard wont run. In that case, right-click the |app| icon in the tooolbar and point it to the newly copied directory. You will now have two calibre libraries on your computer and you can switch between them by clicking the |app| icon on the toolbar. Transferring your library in this manner preserver all your metadata, tags, custom columns, etc.
|
||||
Simply copy the |app| library folder from the old to the new computer. You can
|
||||
find out what the library folder is by clicking the calibre icon in the
|
||||
toolbar. The very first item is the path to the library folder. Now on the new
|
||||
computer, start |app| for the first time. It will run the Welcome Wizard asking
|
||||
you for the location of the |app| library. Point it to the previously copied
|
||||
folder. If the computer you are transferring to already has a calibre
|
||||
installation, then the Welcome wizard wont run. In that case, right-click the
|
||||
|app| icon in the tooolbar and point it to the newly copied directory. You will
|
||||
now have two |app| libraries on your computer and you can switch between them
|
||||
by clicking the |app| icon on the toolbar. Transferring your library in this
|
||||
manner preserver all your metadata, tags, custom columns, etc.
|
||||
|
||||
Note that if you are transferring between different types of computers (for example Windows to OS X) then after doing the above you should also right-click the |app| icon on the tool bar, select Library Maintenance and run the Check Library action. It will warn you about any problems in your library, which you should fix by hand.
|
||||
Note that if you are transferring between different types of computers (for
|
||||
example Windows to OS X) then after doing the above you should also right-click
|
||||
the |app| icon on the tool bar, select Library Maintenance and run the Check
|
||||
Library action. It will warn you about any problems in your library, which you
|
||||
should fix by hand.
|
||||
|
||||
.. note:: A |app| library is just a folder which contains all the book files and their metadata. All the metadata is stored in a single file called metadata.db, in the top level folder. If this file gets corrupted, you may see an empty list of books in |app|. In this case you can ask |app| to restore your books by doing a right-click on the |app| icon in the toolbar and selecting Library Maintenance->Restore Library.
|
||||
|
||||
@ -587,7 +630,10 @@ or a Remote Desktop solution.
|
||||
If you must share the actual library, use a file syncing tool like
|
||||
DropBox or rsync or Microsoft SkyDrive instead of a networked drive. Even with
|
||||
these tools there is danger of data corruption/loss, so only do this if you are
|
||||
willing to live with that risk.
|
||||
willing to live with that risk. In particular, be aware that **Google Drive**
|
||||
is incompatible with |app|, if you put your |app| library in Google Drive, you
|
||||
*will* suffer data loss. See
|
||||
`this thread <http://www.mobileread.com/forums/showthread.php?t=205581>`_ for details.
|
||||
|
||||
Content From The Web
|
||||
---------------------
|
||||
@ -663,7 +709,7 @@ Post any output you see in a help message on the `Forum <http://www.mobileread.c
|
||||
|app| freezes/crashes occasionally?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
There are five possible things I know of, that can cause this:
|
||||
There are several possible things I know of, that can cause this:
|
||||
|
||||
* You recently connected an external monitor or TV to your computer. In
|
||||
this case, whenever |app| opens a new window like the edit metadata
|
||||
@ -671,10 +717,6 @@ There are five possible things I know of, that can cause this:
|
||||
you dont notice it and so you think |app| has frozen. Disconnect your
|
||||
second monitor and restart calibre.
|
||||
|
||||
* You are using a Wacom branded USB mouse. There is an incompatibility between
|
||||
Wacom mice and the graphics toolkit |app| uses. Try using a non-Wacom
|
||||
mouse.
|
||||
|
||||
* If you use RoboForm, it is known to cause |app| to crash. Add |app| to
|
||||
the blacklist of programs inside RoboForm to fix this. Or uninstall
|
||||
RoboForm.
|
||||
@ -685,6 +727,17 @@ There are five possible things I know of, that can cause this:
|
||||
* Constant Guard Protection by Xfinity causes crashes in |app|. You have to
|
||||
manually allow |app| in it or uninstall Constant Guard Protection.
|
||||
|
||||
* Spybot - Search & Destroy blocks |app| from accessing its temporary files
|
||||
breaking viewing and converting of books.
|
||||
|
||||
* You are using a Wacom branded USB mouse. There is an incompatibility between
|
||||
Wacom mice and the graphics toolkit |app| uses. Try using a non-Wacom
|
||||
mouse.
|
||||
|
||||
* On some 64 bit versions of Windows there are security software/settings
|
||||
that prevent 64-bit |app| from working properly. If you are using the 64-bit
|
||||
version of |app| try switching to the 32-bit version.
|
||||
|
||||
If none of the above apply to you, then there is some other program on your
|
||||
computer that is interfering with |app|. First reboot your computer in safe
|
||||
mode, to have as few running programs as possible, and see if the crashes still
|
||||
|
@ -531,12 +531,16 @@ Calibre has several keyboard shortcuts to save you time and mouse movement. Thes
|
||||
- Get Books
|
||||
* - :kbd:`I`
|
||||
- Show book details
|
||||
* - :kbd:`K`
|
||||
- Edit Table of Contents
|
||||
* - :kbd:`M`
|
||||
- Merge selected records
|
||||
* - :kbd:`Alt+M`
|
||||
- Merge selected records, keeping originals
|
||||
* - :kbd:`O`
|
||||
- Open containing folder
|
||||
* - :kbd:`P`
|
||||
- Polish books
|
||||
* - :kbd:`S`
|
||||
- Save to Disk
|
||||
* - :kbd:`V`
|
||||
|
@ -3,7 +3,7 @@ import re
|
||||
class Adventure_zone(BasicNewsRecipe):
|
||||
title = u'Adventure Zone'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Adventure zone - adventure games from A to Z'
|
||||
description = u'Czytaj więcej o przygodzie - codzienne nowinki. Szukaj u nas solucji i poradników, czytaj recenzje i zapowiedzi. Także galeria, pliki oraz forum dla wszystkich fanów gier przygodowych.'
|
||||
category = 'games'
|
||||
language = 'pl'
|
||||
no_stylesheets = True
|
||||
@ -11,7 +11,7 @@ class Adventure_zone(BasicNewsRecipe):
|
||||
max_articles_per_feed = 100
|
||||
cover_url = 'http://www.adventure-zone.info/inne/logoaz_2012.png'
|
||||
index='http://www.adventure-zone.info/fusion/'
|
||||
use_embedded_content=False
|
||||
use_embedded_content = False
|
||||
preprocess_regexps = [(re.compile(r"<td class='capmain'>Komentarze</td>", re.IGNORECASE), lambda m: ''),
|
||||
(re.compile(r'</?table.*?>'), lambda match: ''),
|
||||
(re.compile(r'</?tbody.*?>'), lambda match: '')]
|
||||
@ -21,37 +21,35 @@ class Adventure_zone(BasicNewsRecipe):
|
||||
extra_css = '.main-bg{text-align: left;} td.capmain{ font-size: 22px; }'
|
||||
feeds = [(u'Nowinki', u'http://www.adventure-zone.info/fusion/feeds/news.php')]
|
||||
|
||||
def parse_feeds (self):
|
||||
feeds = BasicNewsRecipe.parse_feeds(self)
|
||||
soup=self.index_to_soup(u'http://www.adventure-zone.info/fusion/feeds/news.php')
|
||||
tag=soup.find(name='channel')
|
||||
titles=[]
|
||||
for r in tag.findAll(name='image'):
|
||||
r.extract()
|
||||
art=tag.findAll(name='item')
|
||||
for i in art:
|
||||
titles.append(i.title.string)
|
||||
for feed in feeds:
|
||||
for article in feed.articles[:]:
|
||||
article.title=titles[feed.articles.index(article)]
|
||||
return feeds
|
||||
|
||||
|
||||
'''def get_cover_url(self):
|
||||
soup = self.index_to_soup('http://www.adventure-zone.info/fusion/news.php')
|
||||
cover=soup.find(id='box_OstatninumerAZ')
|
||||
self.cover_url='http://www.adventure-zone.info/fusion/'+ cover.center.a.img['src']
|
||||
return getattr(self, 'cover_url', self.cover_url)'''
|
||||
|
||||
def populate_article_metadata(self, article, soup, first):
|
||||
result = re.search('(.+) - Adventure Zone', soup.title.string)
|
||||
if result:
|
||||
result = result.group(1)
|
||||
else:
|
||||
result = soup.body.find('strong')
|
||||
if result:
|
||||
result = result.string
|
||||
if result:
|
||||
result = result.replace('&', '&')
|
||||
result = result.replace(''', '’')
|
||||
article.title = result
|
||||
|
||||
def skip_ad_pages(self, soup):
|
||||
skip_tag = soup.body.find(name='td', attrs={'class':'main-bg'})
|
||||
skip_tag = skip_tag.findAll(name='a')
|
||||
for r in skip_tag:
|
||||
if r.strong:
|
||||
word=r.strong.string.lower()
|
||||
if word and (('zapowied' in word) or ('recenzj' in word) or ('solucj' in word) or ('poradnik' in word)):
|
||||
return self.index_to_soup('http://www.adventure-zone.info/fusion/print.php?type=A&item'+r['href'][r['href'].find('article_id')+7:], raw=True)
|
||||
title = soup.title.string.lower()
|
||||
if (('zapowied' in title) or ('recenzj' in title) or ('solucj' in title) or ('poradnik' in title)):
|
||||
for r in skip_tag:
|
||||
if r.strong and r.strong.string:
|
||||
word=r.strong.string.lower()
|
||||
if (('zapowied' in word) or ('recenzj' in word) or ('solucj' in word) or ('poradnik' in word)):
|
||||
return self.index_to_soup('http://www.adventure-zone.info/fusion/print.php?type=A&item'+r['href'][r['href'].find('article_id')+7:], raw=True)
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
footer=soup.find(attrs={'class':'news-footer middle-border'})
|
||||
|
@ -43,6 +43,6 @@ class AntywebRecipe(BasicNewsRecipe):
|
||||
def preprocess_html(self, soup):
|
||||
for alink in soup.findAll('a'):
|
||||
if alink.string is not None:
|
||||
tstr = alink.string
|
||||
alink.replaceWith(tstr)
|
||||
return soup
|
||||
tstr = alink.string
|
||||
alink.replaceWith(tstr)
|
||||
return soup
|
||||
|
@ -24,4 +24,3 @@ class app_funds(BasicNewsRecipe):
|
||||
auto_cleanup = True
|
||||
|
||||
feeds = [(u'blog', u'http://feeds.feedburner.com/blogspot/etVI')]
|
||||
|
||||
|
@ -1,10 +1,11 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class Archeowiesci(BasicNewsRecipe):
|
||||
title = u'Archeowiesci'
|
||||
title = u'Archeowieści'
|
||||
__author__ = 'fenuks'
|
||||
category = 'archeology'
|
||||
language = 'pl'
|
||||
description = u'Z pasją o przeszłości'
|
||||
cover_url='http://archeowiesci.pl/wp-content/uploads/2011/05/Archeowiesci2-115x115.jpg'
|
||||
oldest_article = 7
|
||||
needs_subscription='optional'
|
||||
|
@ -2,7 +2,7 @@ from calibre.web.feeds.news import BasicNewsRecipe
|
||||
class AstroNEWS(BasicNewsRecipe):
|
||||
title = u'AstroNEWS'
|
||||
__author__ = 'fenuks'
|
||||
description = 'AstroNEWS- astronomy every day'
|
||||
description = u'AstroNEWS regularnie dostarcza wiadomości o wydarzeniach związanych z astronomią i astronautyką. Informujemy o aktualnych odkryciach i wydarzeniach naukowych, zapowiadamy ciekawe zjawiska astronomiczne. Serwis jest częścią portalu astronomicznego AstroNET prowadzonego przez miłośników astronomii i zawodowych astronomów.'
|
||||
category = 'astronomy, science'
|
||||
language = 'pl'
|
||||
oldest_article = 8
|
||||
|
@ -13,6 +13,7 @@ class Astroflesz(BasicNewsRecipe):
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets = True
|
||||
use_embedded_content = False
|
||||
remove_attributes = ['style']
|
||||
keep_only_tags = [dict(id="k2Container")]
|
||||
remove_tags_after = dict(name='div', attrs={'class':'itemLinks'})
|
||||
remove_tags = [dict(name='div', attrs={'class':['itemLinks', 'itemToolbar', 'itemRatingBlock']})]
|
||||
|
@ -3,7 +3,7 @@ import re
|
||||
class Astronomia_pl(BasicNewsRecipe):
|
||||
title = u'Astronomia.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = 'Astronomia - polish astronomy site'
|
||||
description = u'Astronomia.pl jest edukacyjnym portalem skierowanym do uczniów, studentów i miłośników astronomii. Przedstawiamy gwiazdy, planety, galaktyki, czarne dziury i wiele innych tajemnic Wszechświata.'
|
||||
masthead_url = 'http://www.astronomia.pl/grafika/logo.gif'
|
||||
cover_url = 'http://www.astronomia.pl/grafika/logo.gif'
|
||||
category = 'astronomy, science'
|
||||
|
43
recipes/bachormagazyn.recipe
Normal file
@ -0,0 +1,43 @@
|
||||
#!/usr/bin/env python
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
__license__ = 'GPL v3'
|
||||
__copyright__ = u'Łukasz Grąbczewski 2013'
|
||||
__version__ = '1.0'
|
||||
|
||||
'''
|
||||
bachormagazyn.pl
|
||||
'''
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class bachormagazyn(BasicNewsRecipe):
|
||||
__author__ = u'Łukasz Grączewski'
|
||||
title = u'Bachor Magazyn'
|
||||
description = u'Alternatywny magazyn o alternatywach rodzicielstwa'
|
||||
language = 'pl'
|
||||
publisher = 'Bachor Mag.'
|
||||
publication_type = 'magazine'
|
||||
masthead_url = 'http://bachormagazyn.pl/wp-content/uploads/2011/10/bachor_header1.gif'
|
||||
no_stylesheets = True
|
||||
remove_javascript = True
|
||||
use_embedded_content = False
|
||||
remove_empty_feeds = True
|
||||
|
||||
oldest_article = 32 #monthly +1
|
||||
max_articles_per_feed = 100
|
||||
|
||||
feeds = [
|
||||
(u'Bezradnik dla nieudacznych rodziców', u'http://bachormagazyn.pl/feed/')
|
||||
]
|
||||
|
||||
keep_only_tags = []
|
||||
keep_only_tags.append(dict(name = 'div', attrs = {'id' : 'content'}))
|
||||
|
||||
remove_tags = []
|
||||
remove_tags.append(dict(attrs = {'id' : 'nav-above'}))
|
||||
remove_tags.append(dict(attrs = {'id' : 'nav-below'}))
|
||||
remove_tags.append(dict(attrs = {'id' : 'comments'}))
|
||||
remove_tags.append(dict(attrs = {'class' : 'entry-info'}))
|
||||
remove_tags.append(dict(attrs = {'class' : 'comments-link'}))
|
||||
remove_tags.append(dict(attrs = {'class' : 'sharedaddy sd-sharing-enabled'}))
|
17
recipes/badania_net.recipe
Normal file
@ -0,0 +1,17 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
class BadaniaNet(BasicNewsRecipe):
|
||||
title = u'badania.net'
|
||||
__author__ = 'fenuks'
|
||||
description = u'chcesz wiedzieć więcej?'
|
||||
category = 'science'
|
||||
language = 'pl'
|
||||
cover_url = 'http://badania.net/wp-content/badanianet_green_transparent.png'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets = True
|
||||
remove_empty_feeds = True
|
||||
use_embedded_content = False
|
||||
remove_tags = [dict(attrs={'class':['omc-flex-category', 'omc-comment-count', 'omc-single-tags']})]
|
||||
remove_tags_after = dict(attrs={'class':'omc-single-tags'})
|
||||
keep_only_tags = [dict(id='omc-full-article')]
|
||||
feeds = [(u'Psychologia', u'http://badania.net/category/psychologia/feed/'), (u'Technologie', u'http://badania.net/category/technologie/feed/'), (u'Biologia', u'http://badania.net/category/biologia/feed/'), (u'Chemia', u'http://badania.net/category/chemia/feed/'), (u'Zdrowie', u'http://badania.net/category/zdrowie/'), (u'Seks', u'http://badania.net/category/psychologia-ewolucyjna-tematyka-seks/feed/')]
|
@ -47,4 +47,3 @@ class bankier(BasicNewsRecipe):
|
||||
segments = urlPart.split('-')
|
||||
urlPart2 = segments[-1]
|
||||
return 'http://www.bankier.pl/wiadomosci/print.html?article_id=' + urlPart2
|
||||
|
||||
|
@ -3,7 +3,7 @@ from calibre.web.feeds.news import BasicNewsRecipe
|
||||
class Bash_org_pl(BasicNewsRecipe):
|
||||
title = u'Bash.org.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = 'Bash.org.pl - funny quotations from IRC discussions'
|
||||
description = 'Bash.org.pl - zabawne cytaty z IRC'
|
||||
category = 'funny quotations, humour'
|
||||
language = 'pl'
|
||||
cover_url = u'http://userlogos.org/files/logos/dzikiosiol/none_0.png'
|
||||
@ -35,8 +35,8 @@ class Bash_org_pl(BasicNewsRecipe):
|
||||
soup=self.index_to_soup(u'http://bash.org.pl/random/')
|
||||
#date=soup.find('div', attrs={'class':'right'}).string
|
||||
url=soup.find('a', attrs={'class':'qid click'})
|
||||
title=url.string
|
||||
url='http://bash.org.pl' +url['href']
|
||||
title=''
|
||||
url='http://bash.org.pl/random/'
|
||||
articles.append({'title' : title,
|
||||
'url' : url,
|
||||
'date' : '',
|
||||
@ -44,6 +44,8 @@ class Bash_org_pl(BasicNewsRecipe):
|
||||
})
|
||||
return articles
|
||||
|
||||
def populate_article_metadata(self, article, soup, first):
|
||||
article.title = soup.find(attrs={'class':'qid click'}).string
|
||||
|
||||
def parse_index(self):
|
||||
feeds = []
|
||||
|
@ -1,74 +1,87 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
import re
|
||||
from calibre.ebooks.BeautifulSoup import Comment
|
||||
|
||||
class BenchmarkPl(BasicNewsRecipe):
|
||||
title = u'Benchmark.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = u'benchmark.pl -IT site'
|
||||
description = u'benchmark.pl, recenzje i testy sprzętu, aktualności, rankingi, sterowniki, porady, opinie'
|
||||
masthead_url = 'http://www.benchmark.pl/i/logo-footer.png'
|
||||
cover_url = 'http://www.ieaddons.pl/benchmark/logo_benchmark_new.gif'
|
||||
cover_url = 'http://www.benchmark.pl/i/logo-dark.png'
|
||||
category = 'IT'
|
||||
language = 'pl'
|
||||
oldest_article = 8
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets=True
|
||||
no_stylesheets = True
|
||||
remove_attributes = ['style']
|
||||
preprocess_regexps = [(re.compile(ur'<h3><span style="font-size: small;"> Zobacz poprzednie <a href="http://www.benchmark.pl/news/zestawienie/grupa_id/135">Opinie dnia:</a></span>.*</body>', re.DOTALL|re.IGNORECASE), lambda match: '</body>'), (re.compile(ur'Więcej o .*?</ul>', re.DOTALL|re.IGNORECASE), lambda match: '')]
|
||||
keep_only_tags=[dict(name='div', attrs={'class':['m_zwykly', 'gallery']}), dict(id='article')]
|
||||
remove_tags_after=dict(name='div', attrs={'class':'body'})
|
||||
remove_tags=[dict(name='div', attrs={'class':['kategoria', 'socialize', 'thumb', 'panelOcenaObserwowane', 'categoryNextToSocializeGallery', 'breadcrumb', 'footer', 'moreTopics']}), dict(name='table', attrs={'background':'http://www.benchmark.pl/uploads/backend_img/a/fotki_newsy/opinie_dnia/bg.png'}), dict(name='table', attrs={'width':'210', 'cellspacing':'1', 'cellpadding':'4', 'border':'0', 'align':'right'})]
|
||||
INDEX= 'http://www.benchmark.pl'
|
||||
keep_only_tags = [dict(name='div', attrs={'class':['m_zwykly', 'gallery']}), dict(id='article')]
|
||||
remove_tags_after = dict(id='article')
|
||||
remove_tags = [dict(name='div', attrs={'class':['comments', 'body', 'kategoria', 'socialize', 'thumb', 'panelOcenaObserwowane', 'categoryNextToSocializeGallery', 'breadcrumb', 'footer', 'moreTopics']}), dict(name='table', attrs = {'background':'http://www.benchmark.pl/uploads/backend_img/a/fotki_newsy/opinie_dnia/bg.png'}), dict(name='table', attrs={'width':'210', 'cellspacing':'1', 'cellpadding':'4', 'border':'0', 'align':'right'})]
|
||||
INDEX = 'http://www.benchmark.pl'
|
||||
feeds = [(u'Aktualności', u'http://www.benchmark.pl/rss/aktualnosci-pliki.xml'),
|
||||
(u'Testy i recenzje', u'http://www.benchmark.pl/rss/testy-recenzje-minirecenzje.xml')]
|
||||
|
||||
|
||||
def append_page(self, soup, appendtag):
|
||||
nexturl = soup.find('span', attrs={'class':'next'})
|
||||
while nexturl is not None:
|
||||
nexturl= self.INDEX + nexturl.parent['href']
|
||||
soup2 = self.index_to_soup(nexturl)
|
||||
nexturl=soup2.find('span', attrs={'class':'next'})
|
||||
nexturl = soup.find(attrs={'class':'next'})
|
||||
while nexturl:
|
||||
soup2 = self.index_to_soup(nexturl['href'])
|
||||
nexturl = soup2.find(attrs={'class':'next'})
|
||||
pagetext = soup2.find(name='div', attrs={'class':'body'})
|
||||
appendtag.find('div', attrs={'class':'k_ster'}).extract()
|
||||
tag = appendtag.find('div', attrs={'class':'k_ster'})
|
||||
if tag:
|
||||
tag.extract()
|
||||
comments = pagetext.findAll(text=lambda text:isinstance(text, Comment))
|
||||
for comment in comments:
|
||||
comment.extract()
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
if appendtag.find('div', attrs={'class':'k_ster'}) is not None:
|
||||
if appendtag.find('div', attrs={'class':'k_ster'}):
|
||||
appendtag.find('div', attrs={'class':'k_ster'}).extract()
|
||||
for r in appendtag.findAll(attrs={'class':'changePage'}):
|
||||
r.extract()
|
||||
|
||||
|
||||
def image_article(self, soup, appendtag):
|
||||
nexturl=soup.find('div', attrs={'class':'preview'})
|
||||
if nexturl is not None:
|
||||
nexturl=nexturl.find('a', attrs={'class':'move_next'})
|
||||
image=appendtag.find('div', attrs={'class':'preview'}).div['style'][16:]
|
||||
image=self.INDEX + image[:image.find("')")]
|
||||
nexturl = soup.find('div', attrs={'class':'preview'})
|
||||
if nexturl:
|
||||
nexturl = nexturl.find('a', attrs={'class':'move_next'})
|
||||
image = appendtag.find('div', attrs={'class':'preview'}).div['style'][16:]
|
||||
image = self.INDEX + image[:image.find("')")]
|
||||
appendtag.find(attrs={'class':'preview'}).name='img'
|
||||
appendtag.find(attrs={'class':'preview'})['src']=image
|
||||
appendtag.find('a', attrs={'class':'move_next'}).extract()
|
||||
while nexturl is not None:
|
||||
nexturl= self.INDEX + nexturl['href']
|
||||
while nexturl:
|
||||
nexturl = self.INDEX + nexturl['href']
|
||||
soup2 = self.index_to_soup(nexturl)
|
||||
nexturl=soup2.find('a', attrs={'class':'move_next'})
|
||||
image=soup2.find('div', attrs={'class':'preview'}).div['style'][16:]
|
||||
image=self.INDEX + image[:image.find("')")]
|
||||
nexturl = soup2.find('a', attrs={'class':'move_next'})
|
||||
image = soup2.find('div', attrs={'class':'preview'}).div['style'][16:]
|
||||
image = self.INDEX + image[:image.find("')")]
|
||||
soup2.find(attrs={'class':'preview'}).name='img'
|
||||
soup2.find(attrs={'class':'preview'})['src']=image
|
||||
pagetext=soup2.find('div', attrs={'class':'gallery'})
|
||||
pagetext = soup2.find('div', attrs={'class':'gallery'})
|
||||
pagetext.find('div', attrs={'class':'title'}).extract()
|
||||
pagetext.find('div', attrs={'class':'thumb'}).extract()
|
||||
pagetext.find('div', attrs={'class':'panelOcenaObserwowane'}).extract()
|
||||
if nexturl is not None:
|
||||
if nexturl:
|
||||
pagetext.find('a', attrs={'class':'move_next'}).extract()
|
||||
pagetext.find('a', attrs={'class':'move_back'}).extract()
|
||||
comments = pagetext.findAll(text=lambda text:isinstance(text, Comment))
|
||||
for comment in comments:
|
||||
comment.extract()
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
|
||||
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
if soup.find('div', attrs={'class':'preview'}) is not None:
|
||||
if soup.find('div', attrs={'class':'preview'}):
|
||||
self.image_article(soup, soup.body)
|
||||
else:
|
||||
self.append_page(soup, soup.body)
|
||||
for a in soup('a'):
|
||||
if a.has_key('href') and 'http://' not in a['href'] and 'https://' not in a['href']:
|
||||
a['href']=self.INDEX + a['href']
|
||||
if a.has_key('href') and not a['href'].startswith('http'):
|
||||
a['href'] = self.INDEX + a['href']
|
||||
for r in soup.findAll(attrs={'class':['comments', 'body']}):
|
||||
r.extract()
|
||||
return soup
|
||||
|
55
recipes/biweekly.recipe
Normal file
@ -0,0 +1,55 @@
|
||||
#!/usr/bin/env python
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
__license__ = 'GPL v3'
|
||||
__copyright__ = u'Łukasz Grąbczewski 2011'
|
||||
__version__ = '2.0'
|
||||
|
||||
import re, os
|
||||
from calibre import walk
|
||||
from calibre.utils.zipfile import ZipFile
|
||||
from calibre.ptempfile import PersistentTemporaryFile
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class biweekly(BasicNewsRecipe):
|
||||
__author__ = u'Łukasz Grąbczewski'
|
||||
title = 'Biweekly'
|
||||
language = 'en_PL'
|
||||
publisher = 'National Audiovisual Institute'
|
||||
publication_type = 'magazine'
|
||||
description = u'link with culture [English edition of Polish magazine]: literature, theatre, film, art, music, views, talks'
|
||||
|
||||
conversion_options = {
|
||||
'authors' : 'Biweekly.pl'
|
||||
,'publisher' : publisher
|
||||
,'language' : language
|
||||
,'comments' : description
|
||||
,'no_default_epub_cover' : True
|
||||
,'preserve_cover_aspect_ratio': True
|
||||
}
|
||||
|
||||
def build_index(self):
|
||||
browser = self.get_browser()
|
||||
browser.open('http://www.biweekly.pl/')
|
||||
|
||||
# find the link
|
||||
epublink = browser.find_link(text_regex=re.compile('ePUB VERSION'))
|
||||
|
||||
# download ebook
|
||||
self.report_progress(0,_('Downloading ePUB'))
|
||||
response = browser.follow_link(epublink)
|
||||
book_file = PersistentTemporaryFile(suffix='.epub')
|
||||
book_file.write(response.read())
|
||||
book_file.close()
|
||||
|
||||
# convert
|
||||
self.report_progress(0.2,_('Converting to OEB'))
|
||||
oeb = self.output_dir + '/INPUT/'
|
||||
if not os.path.exists(oeb):
|
||||
os.makedirs(oeb)
|
||||
with ZipFile(book_file.name) as f:
|
||||
f.extractall(path=oeb)
|
||||
|
||||
for f in walk(oeb):
|
||||
if f.endswith('.opf'):
|
||||
return f
|
30
recipes/blog_biszopa.recipe
Normal file
@ -0,0 +1,30 @@
|
||||
__license__ = 'GPL v3'
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class BlogBiszopa(BasicNewsRecipe):
|
||||
title = u'Blog Biszopa'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Zapiski z Granitowego Miasta'
|
||||
category = 'history'
|
||||
#publication_type = ''
|
||||
language = 'pl'
|
||||
#encoding = ''
|
||||
#extra_css = ''
|
||||
cover_url = 'http://blogbiszopa.pl/wp-content/themes/biszop/images/logo.png'
|
||||
masthead_url = ''
|
||||
use_embedded_content = False
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets = True
|
||||
remove_empty_feeds = True
|
||||
remove_javascript = True
|
||||
remove_attributes = ['style', 'font']
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
|
||||
keep_only_tags = [dict(id='main-content')]
|
||||
remove_tags = [dict(name='footer')]
|
||||
#remove_tags_after = {}
|
||||
#remove_tags_before = {}
|
||||
|
||||
feeds = [(u'Artyku\u0142y', u'http://blogbiszopa.pl/feed/')]
|
||||
|
@ -3,7 +3,7 @@ from calibre.web.feeds.news import BasicNewsRecipe
|
||||
class CD_Action(BasicNewsRecipe):
|
||||
title = u'CD-Action'
|
||||
__author__ = 'fenuks'
|
||||
description = 'cdaction.pl - polish games magazine site'
|
||||
description = 'Strona CD-Action (CDA), największego w Polsce pisma dla graczy.Pełne wersje gier, newsy, recenzje, zapowiedzi, konkursy, forum, opinie, galerie screenów,trailery, filmiki, patche, teksty. Gry komputerowe (PC) oraz na konsole (PS3, XBOX 360).'
|
||||
category = 'games'
|
||||
language = 'pl'
|
||||
index='http://www.cdaction.pl'
|
||||
|
@ -1,5 +1,6 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
import re
|
||||
|
||||
class Ciekawostki_Historyczne(BasicNewsRecipe):
|
||||
title = u'Ciekawostki Historyczne'
|
||||
oldest_article = 7
|
||||
@ -7,42 +8,30 @@ class Ciekawostki_Historyczne(BasicNewsRecipe):
|
||||
description = u'Serwis popularnonaukowy - odkrycia, kontrowersje, historia, ciekawostki, badania, ciekawostki z przeszłości.'
|
||||
category = 'history'
|
||||
language = 'pl'
|
||||
masthead_url= 'http://ciekawostkihistoryczne.pl/wp-content/themes/Wordpress_Magazine/images/logo-ciekawostki-historyczne-male.jpg'
|
||||
cover_url='http://ciekawostkihistoryczne.pl/wp-content/themes/Wordpress_Magazine/images/logo-ciekawostki-historyczne-male.jpg'
|
||||
masthead_url = 'http://ciekawostkihistoryczne.pl/wp-content/themes/Wordpress_Magazine/images/logo-ciekawostki-historyczne-male.jpg'
|
||||
cover_url = 'http://ciekawostkihistoryczne.pl/wp-content/themes/Wordpress_Magazine/images/logo-ciekawostki-historyczne-male.jpg'
|
||||
max_articles_per_feed = 100
|
||||
oldest_article = 140000
|
||||
preprocess_regexps = [(re.compile(ur'Ten artykuł ma kilka stron.*?</fb:like>', re.DOTALL), lambda match: ''), (re.compile(ur'<h2>Zobacz też:</h2>.*?</ol>', re.DOTALL), lambda match: '')]
|
||||
no_stylesheets=True
|
||||
remove_empty_feeds=True
|
||||
keep_only_tags=[dict(name='div', attrs={'class':'post'})]
|
||||
remove_tags=[dict(id='singlepostinfo')]
|
||||
no_stylesheets = True
|
||||
remove_empty_feeds = True
|
||||
keep_only_tags = [dict(name='div', attrs={'class':'post'})]
|
||||
recursions = 5
|
||||
remove_tags = [dict(id='singlepostinfo')]
|
||||
|
||||
feeds = [(u'Staro\u017cytno\u015b\u0107', u'http://ciekawostkihistoryczne.pl/tag/starozytnosc/feed/'), (u'\u015aredniowiecze', u'http://ciekawostkihistoryczne.pl/tag/sredniowiecze/feed/'), (u'Nowo\u017cytno\u015b\u0107', u'http://ciekawostkihistoryczne.pl/tag/nowozytnosc/feed/'), (u'XIX wiek', u'http://ciekawostkihistoryczne.pl/tag/xix-wiek/feed/'), (u'1914-1939', u'http://ciekawostkihistoryczne.pl/tag/1914-1939/feed/'), (u'1939-1945', u'http://ciekawostkihistoryczne.pl/tag/1939-1945/feed/'), (u'Powojnie (od 1945)', u'http://ciekawostkihistoryczne.pl/tag/powojnie/feed/'), (u'Recenzje', u'http://ciekawostkihistoryczne.pl/category/recenzje/feed/')]
|
||||
|
||||
def append_page(self, soup, appendtag):
|
||||
tag=soup.find(name='h7')
|
||||
if tag:
|
||||
if tag.br:
|
||||
pass
|
||||
elif tag.nextSibling.name=='p':
|
||||
tag=tag.nextSibling
|
||||
nexturl = tag.findAll('a')
|
||||
for nextpage in nexturl:
|
||||
tag.extract()
|
||||
nextpage= nextpage['href']
|
||||
soup2 = self.index_to_soup(nextpage)
|
||||
pagetext = soup2.find(name='div', attrs={'class':'post'})
|
||||
for r in pagetext.findAll('div', attrs={'id':'singlepostinfo'}):
|
||||
r.extract()
|
||||
for r in pagetext.findAll('div', attrs={'class':'wp-caption alignright'}):
|
||||
r.extract()
|
||||
for r in pagetext.findAll('h1'):
|
||||
r.extract()
|
||||
pagetext.find('h6').nextSibling.extract()
|
||||
pagetext.find('h7').nextSibling.extract()
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
def is_link_wanted(self, url, tag):
|
||||
return 'ciekawostkihistoryczne' in url and url[-2] in {'2', '3', '4', '5', '6'}
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
self.append_page(soup, soup.body)
|
||||
def postprocess_html(self, soup, first_fetch):
|
||||
tag = soup.find('h7')
|
||||
if tag:
|
||||
tag.nextSibling.extract()
|
||||
if not first_fetch:
|
||||
for r in soup.findAll(['h1']):
|
||||
r.extract()
|
||||
soup.find('h6').nextSibling.extract()
|
||||
return soup
|
||||
|
||||
|
66
recipes/computer_woche.recipe
Normal file
@ -0,0 +1,66 @@
|
||||
__license__ = 'GPL v3'
|
||||
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
|
||||
'''
|
||||
Fetch Computerwoche.
|
||||
'''
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
|
||||
class Computerwoche(BasicNewsRecipe):
|
||||
|
||||
title = 'Computerwoche'
|
||||
description = 'german computer newspaper'
|
||||
language = 'de'
|
||||
__author__ = 'Maria Seliger'
|
||||
use_embedded_content = False
|
||||
timefmt = ' [%d %b %Y]'
|
||||
max_articles_per_feed = 15
|
||||
linearize_tables = True
|
||||
no_stylesheets = True
|
||||
remove_stylesheets = True
|
||||
remove_javascript = True
|
||||
encoding = 'utf-8'
|
||||
html2epub_options = 'base_font_size=10'
|
||||
summary_length = 100
|
||||
auto_cleanup = True
|
||||
|
||||
|
||||
extra_css = '''
|
||||
h2{font-family:Arial,Helvetica,sans-serif; font-size: x-small; color: #003399;}
|
||||
a{font-family:Arial,Helvetica,sans-serif; font-size: x-small; font-style:italic;}
|
||||
.dachzeile p{font-family:Arial,Helvetica,sans-serif; font-size: x-small; }
|
||||
h1{ font-family:Arial,Helvetica,sans-serif; font-size:x-large; font-weight:bold;}
|
||||
.artikelTeaser{font-family:Arial,Helvetica,sans-serif; font-size: x-small; font-weight:bold; }
|
||||
body{font-family:Arial,Helvetica,sans-serif; }
|
||||
.photo {font-family:Arial,Helvetica,sans-serif; font-size: x-small; color: #666666;} '''
|
||||
|
||||
feeds = [ ('Computerwoche', 'http://rss.feedsportal.com/c/312/f/4414/index.rss'),
|
||||
('IDG Events', 'http://rss.feedsportal.com/c/401/f/7544/index.rss'),
|
||||
('Computerwoche Jobs und Karriere', 'http://rss.feedsportal.com/c/312/f/434082/index.rss'),
|
||||
('Computerwoche BI und ECM', 'http://rss.feedsportal.com/c/312/f/434083/index.rss'),
|
||||
('Computerwoche Cloud Computing', 'http://rss.feedsportal.com/c/312/f/534647/index.rss'),
|
||||
('Computerwoche Compliance und Recht', 'http://rss.feedsportal.com/c/312/f/434084/index.rss'),
|
||||
('Computerwoche CRM', 'http://rss.feedsportal.com/c/312/f/434085/index.rss'),
|
||||
('Computerwoche Data Center und Server', 'http://rss.feedsportal.com/c/312/f/434086/index.rss'),
|
||||
('Computerwoche ERP', 'http://rss.feedsportal.com/c/312/f/434087/index.rss'),
|
||||
('Computerwoche IT Macher', 'http://rss.feedsportal.com/c/312/f/534646/index.rss'),
|
||||
('Computerwoche IT-Services', 'http://rss.feedsportal.com/c/312/f/434089/index.rss'),
|
||||
('Computerwoche IT-Strategie', 'http://rss.feedsportal.com/c/312/f/434090/index.rss'),
|
||||
('Computerwoche Mittelstands-IT', 'http://rss.feedsportal.com/c/312/f/434091/index.rss'),
|
||||
('Computerwoche Mobile und Wireless', 'http://rss.feedsportal.com/c/312/f/434092/index.rss'),
|
||||
('Computerwoche Netzwerk', 'http://rss.feedsportal.com/c/312/f/434093/index.rss'),
|
||||
('Computerwoche Notebook und PC', 'http://rss.feedsportal.com/c/312/f/434094/index.rss'),
|
||||
('Computerwoche Office und Tools', 'http://rss.feedsportal.com/c/312/f/434095/index.rss'),
|
||||
('Computerwoche Security', 'http://rss.feedsportal.com/c/312/f/434098/index.rss'),
|
||||
('Computerwoche SOA und BPM', 'http://rss.feedsportal.com/c/312/f/434099/index.rss'),
|
||||
('Computerwoche Software Infrastruktur', 'http://rss.feedsportal.com/c/312/f/434096/index.rss'),
|
||||
('Computerwoche Storage', 'http://rss.feedsportal.com/c/312/f/534645/index.rss'),
|
||||
('Computerwoche VoIP und TK', 'http://rss.feedsportal.com/c/312/f/434102/index.rss'),
|
||||
('Computerwoche Web', 'http://rss.feedsportal.com/c/312/f/434103/index.rss'),
|
||||
('Computerwoche Home-IT', 'http://rss.feedsportal.com/c/312/f/434104/index.rss')]
|
||||
|
||||
|
||||
def print_version(self, url):
|
||||
return url.replace ('/a/', '/a/print/')
|
||||
|
@ -1,5 +1,5 @@
|
||||
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
|
||||
|
||||
import re
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
class Computerworld_pl(BasicNewsRecipe):
|
||||
title = u'Computerworld.pl'
|
||||
@ -7,17 +7,21 @@ class Computerworld_pl(BasicNewsRecipe):
|
||||
description = u'Serwis o IT w przemyśle, finansach, handlu, administracji oraz rynku IT i telekomunikacyjnym - wiadomości, opinie, analizy, porady prawne'
|
||||
category = 'IT'
|
||||
language = 'pl'
|
||||
masthead_url= 'http://g1.computerworld.pl/cw/beta_gfx/cw2.gif'
|
||||
no_stylesheets=True
|
||||
masthead_url = 'http://g1.computerworld.pl/cw/beta_gfx/cw2.gif'
|
||||
cover_url = 'http://g1.computerworld.pl/cw/beta_gfx/cw2.gif'
|
||||
no_stylesheets = True
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
keep_only_tags=[dict(attrs={'class':['tyt_news', 'prawo', 'autor', 'tresc']})]
|
||||
remove_tags_after=dict(name='div', attrs={'class':'rMobi'})
|
||||
remove_tags=[dict(name='div', attrs={'class':['nnav', 'rMobi']}), dict(name='table', attrs={'class':'ramka_slx'})]
|
||||
remove_attributes = ['style',]
|
||||
preprocess_regexps = [(re.compile(u'Zobacz również:', re.IGNORECASE), lambda m: ''), (re.compile(ur'[*]+reklama[*]+', re.IGNORECASE), lambda m: ''),]
|
||||
keep_only_tags = [dict(id=['szpaltaL', 's2011'])]
|
||||
remove_tags_after = dict(name='div', attrs={'class':'tresc'})
|
||||
remove_tags = [dict(attrs={'class':['nnav', 'rMobi', 'tagi', 'rec']}),]
|
||||
feeds = [(u'Wiadomo\u015bci', u'http://rssout.idg.pl/cw/news_iso.xml')]
|
||||
|
||||
def get_cover_url(self):
|
||||
soup = self.index_to_soup('http://www.computerworld.pl/')
|
||||
cover=soup.find(name='img', attrs={'class':'prawo'})
|
||||
self.cover_url=cover['src']
|
||||
return getattr(self, 'cover_url', self.cover_url)
|
||||
def skip_ad_pages(self, soup):
|
||||
if soup.title.string.lower() == 'advertisement':
|
||||
tag = soup.find(name='a')
|
||||
if tag:
|
||||
new_soup = self.index_to_soup(tag['href'], raw=True)
|
||||
return new_soup
|
@ -1,14 +1,16 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
from calibre.ebooks.BeautifulSoup import BeautifulSoup
|
||||
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Comment
|
||||
|
||||
class CoNowegoPl(BasicNewsRecipe):
|
||||
title = u'conowego.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Nowy wortal technologiczny oraz gazeta internetowa. Testy najnowszych produktów, fachowe porady i recenzje. U nas znajdziesz wszystko o elektronice użytkowej !'
|
||||
cover_url = 'http://www.conowego.pl/fileadmin/templates/main/images/logo_top.png'
|
||||
#cover_url = 'http://www.conowego.pl/fileadmin/templates/main/images/logo_top.png'
|
||||
category = 'IT, news'
|
||||
language = 'pl'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
INDEX = 'http://www.conowego.pl/'
|
||||
no_stylesheets = True
|
||||
remove_empty_feeds = True
|
||||
use_embedded_content = False
|
||||
@ -34,5 +36,15 @@ class CoNowegoPl(BasicNewsRecipe):
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
|
||||
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||
for comment in comments:
|
||||
comment.extract()
|
||||
for r in appendtag.findAll(attrs={'class':['pages', 'paginationWrap']}):
|
||||
r.extract()
|
||||
|
||||
def get_cover_url(self):
|
||||
soup = self.index_to_soup('http://www.conowego.pl/magazyn/')
|
||||
tag = soup.find(attrs={'class':'ms_left'})
|
||||
if tag:
|
||||
self.cover_url = self.INDEX + tag.find('img')['src']
|
||||
return getattr(self, 'cover_url', self.cover_url)
|
||||
|
@ -1,4 +1,5 @@
|
||||
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:fdm=marker:ai
|
||||
import re
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class CzasGentlemanow(BasicNewsRecipe):
|
||||
@ -13,8 +14,9 @@ class CzasGentlemanow(BasicNewsRecipe):
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets = True
|
||||
remove_empty_feeds = True
|
||||
preprocess_regexps = [(re.compile(u'<h3>Może Cię też zainteresować:</h3>'), lambda m: '')]
|
||||
use_embedded_content = False
|
||||
keep_only_tags = [dict(name='div', attrs={'class':'content'})]
|
||||
remove_tags = [dict(attrs={'class':'meta_comments'})]
|
||||
remove_tags_after = dict(name='div', attrs={'class':'fblikebutton_button'})
|
||||
remove_tags = [dict(attrs={'class':'meta_comments'}), dict(id=['comments', 'related_posts_thumbnails'])]
|
||||
remove_tags_after = dict(id='comments')
|
||||
feeds = [(u'M\u0119ski \u015awiat', u'http://czasgentlemanow.pl/category/meski-swiat/feed/'), (u'Styl', u'http://czasgentlemanow.pl/category/styl/feed/'), (u'Vademecum Gentlemana', u'http://czasgentlemanow.pl/category/vademecum/feed/'), (u'Dom i rodzina', u'http://czasgentlemanow.pl/category/dom-i-rodzina/feed/'), (u'Honor', u'http://czasgentlemanow.pl/category/honor/feed/'), (u'Gad\u017cety Gentlemana', u'http://czasgentlemanow.pl/category/gadzety-gentlemana/feed/')]
|
||||
|
35
recipes/deccan_herald.recipe
Normal file
@ -0,0 +1,35 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class AdvancedUserRecipe1362501327(BasicNewsRecipe):
|
||||
title = u'Deccan Herald'
|
||||
__author__ = 'Muruli Shamanna'
|
||||
description = 'Daily news from the Deccan Herald'
|
||||
|
||||
oldest_article = 1
|
||||
max_articles_per_feed = 100
|
||||
auto_cleanup = True
|
||||
category = 'News'
|
||||
language = 'en_IN'
|
||||
encoding = 'utf-8'
|
||||
publisher = 'The Printers (Mysore) Private Ltd'
|
||||
##use_embedded_content = True
|
||||
|
||||
cover_url = 'http://www.quizzing.in/wp-content/uploads/2010/07/DH.gif'
|
||||
|
||||
conversion_options = {
|
||||
'comments' : description
|
||||
,'tags' : category
|
||||
,'language' : language
|
||||
,'publisher' : publisher
|
||||
}
|
||||
|
||||
|
||||
feeds = [(u'News', u'http://www.deccanherald.com/rss/news.rss'), (u'Business', u'http://www.deccanherald.com/rss/business.rss'), (u'Entertainment', u'http://www.deccanherald.com/rss/entertainment.rss'), (u'Sports', u'http://www.deccanherald.com/rss/sports.rss'), (u'Environment', u'http://www.deccanherald.com/rss/environment.rss')]
|
||||
|
||||
extra_css = '''
|
||||
h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:150%;}
|
||||
h2{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:155%;}
|
||||
img {max-width:100%; min-width:100%;}
|
||||
p{font-family:Arial,Helvetica,sans-serif;font-size:large;}
|
||||
body{font-family:Helvetica,Arial,sans-serif;font-size:medium;}
|
||||
'''
|
27
recipes/democracy_journal.recipe
Normal file
@ -0,0 +1,27 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
import re
|
||||
|
||||
class AdvancedUserRecipe1361743898(BasicNewsRecipe):
|
||||
title = u'Democracy Journal'
|
||||
description = '''A journal of ideas. Published quarterly.'''
|
||||
__author__ = u'David Nye'
|
||||
language = 'en'
|
||||
oldest_article = 90
|
||||
max_articles_per_feed = 30
|
||||
no_stylesheets = True
|
||||
auto_cleanup = True
|
||||
|
||||
def parse_index(self):
|
||||
articles = []
|
||||
feeds = []
|
||||
soup = self.index_to_soup("http://www.democracyjournal.org")
|
||||
for x in soup.findAll(href=re.compile("http://www\.democracyjournal\.org/\d*/.*php$")):
|
||||
url = x.get('href')
|
||||
title = self.tag_to_string(x)
|
||||
articles.append({'title':title, 'url':url, 'description':'', 'date':''})
|
||||
feeds.append(('Articles', articles))
|
||||
return feeds
|
||||
|
||||
def print_version(self, url):
|
||||
return url + '?page=all'
|
||||
|
@ -1,6 +1,6 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
__license__ = 'GPL v3'
|
||||
__license__ = 'GPL v3'
|
||||
__author__ = 'Mori'
|
||||
__version__ = 'v. 0.5'
|
||||
'''
|
||||
@ -11,56 +11,56 @@ from calibre.web.feeds.news import BasicNewsRecipe
|
||||
import re
|
||||
|
||||
class DziennikInternautowRecipe(BasicNewsRecipe):
|
||||
__author__ = 'Mori'
|
||||
language = 'pl'
|
||||
__author__ = 'Mori'
|
||||
language = 'pl'
|
||||
|
||||
title = u'Dziennik Internautow'
|
||||
publisher = u'Dziennik Internaut\u00f3w Sp. z o.o.'
|
||||
description = u'Internet w \u017cyciu i biznesie. Porady, wywiady, interwencje, bezpiecze\u0144stwo w Sieci, technologia.'
|
||||
title = u'Dziennik Internautow'
|
||||
publisher = u'Dziennik Internaut\u00f3w Sp. z o.o.'
|
||||
description = u'Internet w \u017cyciu i biznesie. Porady, wywiady, interwencje, bezpiecze\u0144stwo w Sieci, technologia.'
|
||||
|
||||
max_articles_per_feed = 100
|
||||
oldest_article = 7
|
||||
cover_url = 'http://di.com.pl/pic/logo_di_norm.gif'
|
||||
max_articles_per_feed = 100
|
||||
oldest_article = 7
|
||||
cover_url = 'http://di.com.pl/pic/logo_di_norm.gif'
|
||||
|
||||
no_stylesheets = True
|
||||
remove_javascript = True
|
||||
encoding = 'utf-8'
|
||||
no_stylesheets = True
|
||||
remove_javascript = True
|
||||
encoding = 'utf-8'
|
||||
|
||||
extra_css = '''
|
||||
.fotodesc{font-size: 75%;}
|
||||
.pub_data{font-size: 75%;}
|
||||
.fotonews{clear: both; padding-top: 10px; padding-bottom: 10px;}
|
||||
#pub_foto{font-size: 75%; float: left; padding-right: 10px;}
|
||||
'''
|
||||
extra_css = '''
|
||||
.fotodesc{font-size: 75%;}
|
||||
.pub_data{font-size: 75%;}
|
||||
.fotonews{clear: both; padding-top: 10px; padding-bottom: 10px;}
|
||||
#pub_foto{font-size: 75%; float: left; padding-right: 10px;}
|
||||
'''
|
||||
|
||||
feeds = [
|
||||
(u'Dziennik Internaut\u00f3w', u'http://feeds.feedburner.com/glowny-di')
|
||||
]
|
||||
feeds = [
|
||||
(u'Dziennik Internaut\u00f3w', u'http://feeds.feedburner.com/glowny-di')
|
||||
]
|
||||
|
||||
keep_only_tags = [
|
||||
dict(name = 'div', attrs = {'id' : 'pub_head'}),
|
||||
dict(name = 'div', attrs = {'id' : 'pub_content'})
|
||||
]
|
||||
keep_only_tags = [
|
||||
dict(name = 'div', attrs = {'id' : 'pub_head'}),
|
||||
dict(name = 'div', attrs = {'id' : 'pub_content'})
|
||||
]
|
||||
|
||||
remove_tags = [
|
||||
dict(name = 'div', attrs = {'class' : 'poradniki_context'}),
|
||||
dict(name = 'div', attrs = {'class' : 'uniBox'}),
|
||||
dict(name = 'object', attrs = {}),
|
||||
dict(name = 'h3', attrs = {}),
|
||||
dict(attrs={'class':'twitter-share-button'})
|
||||
]
|
||||
remove_tags = [
|
||||
dict(name = 'div', attrs = {'class' : 'poradniki_context'}),
|
||||
dict(name = 'div', attrs = {'class' : 'uniBox'}),
|
||||
dict(name = 'object', attrs = {}),
|
||||
dict(name = 'h3', attrs = {}),
|
||||
dict(attrs={'class':'twitter-share-button'})
|
||||
]
|
||||
|
||||
preprocess_regexps = [
|
||||
(re.compile(i[0], re.IGNORECASE | re.DOTALL), i[1]) for i in
|
||||
[
|
||||
(r', <a href="http://di.com.pl/komentarze,.*?</div>', lambda match: '</div>'),
|
||||
(r'<div class="fotonews".*?">', lambda match: '<div class="fotonews">'),
|
||||
(r'http://di.com.pl/pic/photo/mini/', lambda match: 'http://di.com.pl/pic/photo/oryginal/'),
|
||||
(r'\s*</', lambda match: '</'),
|
||||
]
|
||||
]
|
||||
preprocess_regexps = [
|
||||
(re.compile(i[0], re.IGNORECASE | re.DOTALL), i[1]) for i in
|
||||
[
|
||||
(r', <a href="http://di.com.pl/komentarze,.*?</div>', lambda match: '</div>'),
|
||||
(r'<div class="fotonews".*?">', lambda match: '<div class="fotonews">'),
|
||||
(r'http://di.com.pl/pic/photo/mini/', lambda match: 'http://di.com.pl/pic/photo/oryginal/'),
|
||||
(r'\s*</', lambda match: '</'),
|
||||
]
|
||||
]
|
||||
|
||||
def skip_ad_pages(self, soup):
|
||||
if 'Advertisement' in soup.title:
|
||||
nexturl=soup.find('a')['href']
|
||||
return self.index_to_soup(nexturl, raw=True)
|
||||
def skip_ad_pages(self, soup):
|
||||
if 'Advertisement' in soup.title:
|
||||
nexturl=soup.find('a')['href']
|
||||
return self.index_to_soup(nexturl, raw=True)
|
||||
|
@ -33,6 +33,21 @@ class DiscoverMagazine(BasicNewsRecipe):
|
||||
|
||||
remove_tags_after = [dict(name='div', attrs={'class':'listingBar'})]
|
||||
|
||||
# Login stuff
|
||||
needs_subscription = True
|
||||
use_javascript_to_login = True
|
||||
requires_version = (0, 9, 20)
|
||||
|
||||
def javascript_login(self, br, username, password):
|
||||
br.visit('http://discovermagazine.com', timeout=120)
|
||||
f = br.select_form('div.login.section div.form')
|
||||
f['username'] = username
|
||||
f['password'] = password
|
||||
br.submit('input[id="signInButton"]', timeout=120)
|
||||
br.run_for_a_time(20)
|
||||
# End login stuff
|
||||
|
||||
|
||||
def append_page(self, soup, appendtag, position):
|
||||
pager = soup.find('span',attrs={'class':'next'})
|
||||
if pager:
|
||||
|
@ -18,7 +18,7 @@ class Dobreprogramy_pl(BasicNewsRecipe):
|
||||
max_articles_per_feed = 100
|
||||
preprocess_regexps = [(re.compile(ur'<div id="\S+360pmp4">Twoja przeglądarka nie obsługuje Flasha i HTML5 lub wyłączono obsługę JavaScript...</div>'), lambda match: '') ]
|
||||
keep_only_tags=[dict(attrs={'class':['news', 'entry single']})]
|
||||
remove_tags = [dict(attrs={'class':['newsOptions', 'noPrint', 'komentarze', 'tags font-heading-master']}), dict(id='komentarze')]
|
||||
remove_tags = [dict(attrs={'class':['newsOptions', 'noPrint', 'komentarze', 'tags font-heading-master']}), dict(id='komentarze'), dict(name='iframe')]
|
||||
#remove_tags = [dict(name='div', attrs={'class':['komentarze', 'block', 'portalInfo', 'menuBar', 'topBar']})]
|
||||
feeds = [(u'Aktualności', 'http://feeds.feedburner.com/dobreprogramy/Aktualnosci'),
|
||||
('Blogi', 'http://feeds.feedburner.com/dobreprogramy/BlogCzytelnikow')]
|
||||
|
@ -8,6 +8,7 @@ class BasicUserRecipe1337668045(BasicNewsRecipe):
|
||||
cover_url = 'http://drytooling.com.pl/images/drytooling-kindle.png'
|
||||
description = u'Drytooling.com.pl jest serwisem wspinaczki zimowej, alpinizmu i himalaizmu. Jeśli uwielbiasz zimę, nie możesz doczekać się aż wyciągniesz szpej z szafki i uderzysz w Tatry, Alpy, czy może Himalaje, to znajdziesz tutaj naprawdę dużo interesujących Cię treści! Zapraszamy!'
|
||||
__author__ = u'Damian Granowski'
|
||||
language = 'pl'
|
||||
oldest_article = 100
|
||||
max_articles_per_feed = 20
|
||||
auto_cleanup = True
|
||||
|
56
recipes/dwutygodnik.recipe
Normal file
@ -0,0 +1,56 @@
|
||||
#!/usr/bin/env python
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
__license__ = 'GPL v3'
|
||||
__copyright__ = u'Łukasz Grąbczewski 2011'
|
||||
__version__ = '2.0'
|
||||
|
||||
import re, os
|
||||
from calibre import walk
|
||||
from calibre.utils.zipfile import ZipFile
|
||||
from calibre.ptempfile import PersistentTemporaryFile
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class dwutygodnik(BasicNewsRecipe):
|
||||
__author__ = u'Łukasz Grąbczewski'
|
||||
title = 'Dwutygodnik'
|
||||
language = 'pl'
|
||||
publisher = 'Narodowy Instytut Audiowizualny'
|
||||
publication_type = 'magazine'
|
||||
description = u'Strona Kultury: literatura, teatr, film, sztuka, muzyka, felietony, rozmowy'
|
||||
|
||||
conversion_options = {
|
||||
'authors' : 'Dwutygodnik.com'
|
||||
,'publisher' : publisher
|
||||
,'language' : language
|
||||
,'comments' : description
|
||||
,'no_default_epub_cover' : True
|
||||
,'preserve_cover_aspect_ratio': True
|
||||
}
|
||||
|
||||
def build_index(self):
|
||||
browser = self.get_browser()
|
||||
browser.open('http://www.dwutygodnik.com/')
|
||||
|
||||
# find the link
|
||||
epublink = browser.find_link(text_regex=re.compile('Wersja ePub'))
|
||||
|
||||
# download ebook
|
||||
self.report_progress(0,_('Downloading ePUB'))
|
||||
response = browser.follow_link(epublink)
|
||||
book_file = PersistentTemporaryFile(suffix='.epub')
|
||||
book_file.write(response.read())
|
||||
book_file.close()
|
||||
|
||||
# convert
|
||||
self.report_progress(0.2,_('Converting to OEB'))
|
||||
oeb = self.output_dir + '/INPUT/'
|
||||
if not os.path.exists(oeb):
|
||||
os.makedirs(oeb)
|
||||
with ZipFile(book_file.name) as f:
|
||||
f.extractall(path=oeb)
|
||||
|
||||
for f in walk(oeb):
|
||||
if f.endswith('.opf'):
|
||||
return f
|
||||
|
@ -1,9 +1,10 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
from calibre.ebooks.BeautifulSoup import Comment
|
||||
|
||||
class Dzieje(BasicNewsRecipe):
|
||||
title = u'dzieje.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = 'Dzieje - history of Poland'
|
||||
description = 'Dzieje.pl - najlepszy portal informacyjno-edukacyjny dotyczący historii Polski XX wieku. Archiwalne fotografie, filmy, katalog postaci, quizy i konkursy.'
|
||||
cover_url = 'http://www.dzieje.pl/sites/default/files/dzieje_logo.png'
|
||||
category = 'history'
|
||||
language = 'pl'
|
||||
@ -11,8 +12,8 @@ class Dzieje(BasicNewsRecipe):
|
||||
index = 'http://dzieje.pl'
|
||||
oldest_article = 8
|
||||
max_articles_per_feed = 100
|
||||
remove_javascript=True
|
||||
no_stylesheets= True
|
||||
remove_javascript = True
|
||||
no_stylesheets = True
|
||||
keep_only_tags = [dict(name='h1', attrs={'class':'title'}), dict(id='content-area')]
|
||||
remove_tags = [dict(attrs={'class':'field field-type-computed field-field-tagi'}), dict(id='dogory')]
|
||||
#feeds = [(u'Dzieje', u'http://dzieje.pl/rss.xml')]
|
||||
@ -28,16 +29,19 @@ class Dzieje(BasicNewsRecipe):
|
||||
pagetext = soup2.find(id='content-area').find(attrs={'class':'content'})
|
||||
for r in pagetext.findAll(attrs={'class':['fieldgroup group-groupkul', 'fieldgroup group-zdjeciekult', 'fieldgroup group-zdjecieciekaw', 'fieldgroup group-zdjecieksiazka', 'fieldgroup group-zdjeciedu', 'field field-type-filefield field-field-zdjecieglownawyd']}):
|
||||
r.extract()
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
comments = pagetext.findAll(text=lambda text:isinstance(text, Comment))
|
||||
# appendtag.insert(pos, pagetext)
|
||||
tag = soup2.find('li', attrs={'class':'pager-next'})
|
||||
for r in appendtag.findAll(attrs={'class':['item-list', 'field field-type-computed field-field-tagi', ]}):
|
||||
r.extract()
|
||||
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||
for comment in comments:
|
||||
comment.extract()
|
||||
|
||||
def find_articles(self, url):
|
||||
articles = []
|
||||
soup=self.index_to_soup(url)
|
||||
tag=soup.find(id='content-area').div.div
|
||||
soup = self.index_to_soup(url)
|
||||
tag = soup.find(id='content-area').div.div
|
||||
for i in tag.findAll('div', recursive=False):
|
||||
temp = i.find(attrs={'class':'views-field-title'}).span.a
|
||||
title = temp.string
|
||||
@ -64,7 +68,7 @@ class Dzieje(BasicNewsRecipe):
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
for a in soup('a'):
|
||||
if a.has_key('href') and 'http://' not in a['href'] and 'https://' not in a['href']:
|
||||
a['href']=self.index + a['href']
|
||||
if a.has_key('href') and not a['href'].startswith('http'):
|
||||
a['href'] = self.index + a['href']
|
||||
self.append_page(soup, soup.body)
|
||||
return soup
|
34
recipes/dziennik_baltycki.recipe
Normal file
@ -0,0 +1,34 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class DziennikBaltycki(BasicNewsRecipe):
|
||||
title = u'Dziennik Ba\u0142tycki'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Gazeta Regionalna Dziennik Bałtycki. Najnowsze Wiadomości Trójmiasto i Wiadomości Pomorskie. Czytaj!'
|
||||
category = 'newspaper'
|
||||
language = 'pl'
|
||||
encoding = 'iso-8859-2'
|
||||
masthead_url = 'http://s.polskatimes.pl/g/logo_naglowek/dziennikbaltycki.png?24'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
remove_empty_feeds= True
|
||||
no_stylesheets = True
|
||||
use_embedded_content = False
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
#preprocess_regexps = [(re.compile(ur'<b>Czytaj także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur',<b>Czytaj też:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>Zobacz także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<center><h4><a.*?</a></h4></center>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TEŻ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ WIĘCEJ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TAKŻE:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>\* CZYTAJ KONIECZNIE:.*', re.DOTALL), lambda match: '</body>'), (re.compile(ur'<b>Nasze serwisy:</b>.*', re.DOTALL), lambda match: '</body>') ]
|
||||
remove_tags_after= dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})
|
||||
remove_tags=[dict(id='mat-podobne'), dict(name='a', attrs={'class':'czytajDalej'}), dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})]
|
||||
|
||||
feeds = [(u'Wiadomo\u015bci', u'http://www.dziennikbaltycki.pl/rss/dziennikbaltycki_wiadomosci.xml?201302'), (u'Sport', u'http://dziennikbaltycki.feedsportal.com/c/32980/f/533756/index.rss?201302'), (u'Rejsy', u'http://www.dziennikbaltycki.pl/rss/dziennikbaltycki_rejsy.xml?201302'), (u'Biznes na Pomorzu', u'http://www.dziennikbaltycki.pl/rss/dziennikbaltycki_biznesnapomorzu.xml?201302'), (u'GOM', u'http://www.dziennikbaltycki.pl/rss/dziennikbaltycki_gom.xml?201302'), (u'Opinie', u'http://www.dziennikbaltycki.pl/rss/dziennikbaltycki_opinie.xml?201302'), (u'Pitawal Pomorski', u'http://www.dziennikbaltycki.pl/rss/dziennikbaltycki_pitawalpomorski.xml?201302')]
|
||||
|
||||
def print_version(self, url):
|
||||
return url.replace('artykul', 'drukuj')
|
||||
|
||||
def skip_ad_pages(self, soup):
|
||||
if 'Advertisement' in soup.title:
|
||||
nexturl=soup.find('a')['href']
|
||||
return self.index_to_soup(nexturl, raw=True)
|
||||
|
||||
def get_cover_url(self):
|
||||
soup = self.index_to_soup('http://www.prasa24.pl/gazeta/dziennik-baltycki/')
|
||||
self.cover_url=soup.find(id='pojemnik').img['src']
|
||||
return getattr(self, 'cover_url', self.cover_url)
|
35
recipes/dziennik_lodzki.recipe
Normal file
@ -0,0 +1,35 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class DziennikLodzki(BasicNewsRecipe):
|
||||
title = u'Dziennik \u0141\xf3dzki'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Gazeta Regionalna Dziennik Łódzki. Najnowsze Wiadomości Łódź. Czytaj Wiadomości Łódzkie!'
|
||||
category = 'newspaper'
|
||||
language = 'pl'
|
||||
encoding = 'iso-8859-2'
|
||||
masthead_url = 'http://s.polskatimes.pl/g/logo_naglowek/dzienniklodzki.png?24'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
remove_empty_feeds = True
|
||||
no_stylesheets = True
|
||||
use_embedded_content = False
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
#preprocess_regexps = [(re.compile(ur'<b>Czytaj także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur',<b>Czytaj też:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>Zobacz także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<center><h4><a.*?</a></h4></center>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TEŻ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ WIĘCEJ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TAKŻE:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>\* CZYTAJ KONIECZNIE:.*', re.DOTALL), lambda match: '</body>'), (re.compile(ur'<b>Nasze serwisy:</b>.*', re.DOTALL), lambda match: '</body>') ]
|
||||
remove_tags_after= dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})
|
||||
remove_tags=[dict(id='mat-podobne'), dict(name='a', attrs={'class':'czytajDalej'}), dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})]
|
||||
|
||||
feeds = [(u'Na sygnale', u'http://www.dzienniklodzki.pl/rss/dzienniklodzki_nasygnale.xml?201302'), (u'\u0141\xf3d\u017a', u'http://www.dzienniklodzki.pl/rss/dzienniklodzki_lodz.xml?201302'), (u'Opinie', u'http://www.dzienniklodzki.pl/rss/dzienniklodzki_opinie.xml?201302'), (u'Pieni\u0105dze', u'http://dzienniklodzki.feedsportal.com/c/32980/f/533763/index.rss?201302'), (u'Kultura', u'http://dzienniklodzki.feedsportal.com/c/32980/f/533762/index.rss?201302'), (u'Sport', u'http://dzienniklodzki.feedsportal.com/c/32980/f/533761/index.rss?201302'), (u'Akcje', u'http://www.dzienniklodzki.pl/rss/dzienniklodzki_akcje.xml?201302'), (u'M\xf3j Reporter', u'http://www.dzienniklodzki.pl/rss/dzienniklodzki_mojreporter.xml?201302'), (u'Studni\xf3wki', u'http://www.dzienniklodzki.pl/rss/dzienniklodzki_studniowki.xml?201302'), (u'Kraj', u'http://www.dzienniklodzki.pl/rss/dzienniklodzki_kraj.xml?201302'), (u'Zdrowie', u'http://www.dzienniklodzki.pl/rss/dzienniklodzki_zdrowie.xml?201302')]
|
||||
|
||||
|
||||
def print_version(self, url):
|
||||
return url.replace('artykul', 'drukuj')
|
||||
|
||||
def skip_ad_pages(self, soup):
|
||||
if 'Advertisement' in soup.title:
|
||||
nexturl=soup.find('a')['href']
|
||||
return self.index_to_soup(nexturl, raw=True)
|
||||
|
||||
def get_cover_url(self):
|
||||
soup = self.index_to_soup('http://www.prasa24.pl/gazeta/dziennik-lodzki/')
|
||||
self.cover_url=soup.find(id='pojemnik').img['src']
|
||||
return getattr(self, 'cover_url', self.cover_url)
|
@ -2,6 +2,8 @@
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
import re
|
||||
from calibre.ebooks.BeautifulSoup import Comment
|
||||
|
||||
class Dziennik_pl(BasicNewsRecipe):
|
||||
title = u'Dziennik.pl'
|
||||
__author__ = 'fenuks'
|
||||
@ -9,17 +11,17 @@ class Dziennik_pl(BasicNewsRecipe):
|
||||
category = 'newspaper'
|
||||
language = 'pl'
|
||||
masthead_url= 'http://5.s.dziennik.pl/images/logos.png'
|
||||
cover_url= 'http://5.s.dziennik.pl/images/logos.png'
|
||||
cover_url = 'http://5.s.dziennik.pl/images/logos.png'
|
||||
no_stylesheets = True
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
remove_javascript=True
|
||||
remove_empty_feeds=True
|
||||
remove_javascript = True
|
||||
remove_empty_feeds = True
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
extra_css= 'ul {list-style: none; padding: 0; margin: 0;} li {float: left;margin: 0 0.15em;}'
|
||||
extra_css = 'ul {list-style: none; padding: 0; margin: 0;} li {float: left;margin: 0 0.15em;}'
|
||||
preprocess_regexps = [(re.compile("Komentarze:"), lambda m: ''), (re.compile('<p><strong><a href=".*?">>>> CZYTAJ TAKŻE: ".*?"</a></strong></p>'), lambda m: '')]
|
||||
keep_only_tags=[dict(id='article')]
|
||||
remove_tags=[dict(name='div', attrs={'class':['art_box_dodatki', 'new_facebook_icons2', 'leftArt', 'article_print', 'quiz-widget', 'belka-spol', 'belka-spol belka-spol-bottom', 'art_data_tags', 'cl_right', 'boxRounded gal_inside']}), dict(name='a', attrs={'class':['komentarz', 'article_icon_addcommnent']})]
|
||||
keep_only_tags = [dict(id='article')]
|
||||
remove_tags = [dict(name='div', attrs={'class':['art_box_dodatki', 'new_facebook_icons2', 'leftArt', 'article_print', 'quiz-widget', 'belka-spol', 'belka-spol belka-spol-bottom', 'art_data_tags', 'cl_right', 'boxRounded gal_inside']}), dict(name='a', attrs={'class':['komentarz', 'article_icon_addcommnent']})]
|
||||
feeds = [(u'Wszystko', u'http://rss.dziennik.pl/Dziennik-PL/'),
|
||||
(u'Wiadomości', u'http://rss.dziennik.pl/Dziennik-Wiadomosci'),
|
||||
(u'Gospodarka', u'http://rss.dziennik.pl/Dziennik-Gospodarka'),
|
||||
@ -34,26 +36,29 @@ class Dziennik_pl(BasicNewsRecipe):
|
||||
(u'Nieruchomości', u'http://rss.dziennik.pl/Dziennik-Nieruchomosci')]
|
||||
|
||||
def skip_ad_pages(self, soup):
|
||||
tag=soup.find(name='a', attrs={'title':'CZYTAJ DALEJ'})
|
||||
tag = soup.find(name='a', attrs={'title':'CZYTAJ DALEJ'})
|
||||
if tag:
|
||||
new_soup=self.index_to_soup(tag['href'], raw=True)
|
||||
new_soup = self.index_to_soup(tag['href'], raw=True)
|
||||
return new_soup
|
||||
|
||||
def append_page(self, soup, appendtag):
|
||||
tag=soup.find('a', attrs={'class':'page_next'})
|
||||
tag = soup.find('a', attrs={'class':'page_next'})
|
||||
if tag:
|
||||
appendtag.find('div', attrs={'class':'article_paginator'}).extract()
|
||||
while tag:
|
||||
soup2= self.index_to_soup(tag['href'])
|
||||
tag=soup2.find('a', attrs={'class':'page_next'})
|
||||
soup2 = self.index_to_soup(tag['href'])
|
||||
tag = soup2.find('a', attrs={'class':'page_next'})
|
||||
if not tag:
|
||||
for r in appendtag.findAll('div', attrs={'class':'art_src'}):
|
||||
r.extract()
|
||||
pagetext = soup2.find(name='div', attrs={'class':'article_body'})
|
||||
for dictionary in self.remove_tags:
|
||||
v=pagetext.findAll(name=dictionary['name'], attrs=dictionary['attrs'])
|
||||
v = pagetext.findAll(name=dictionary['name'], attrs=dictionary['attrs'])
|
||||
for delete in v:
|
||||
delete.extract()
|
||||
comments = pagetext.findAll(text=lambda text:isinstance(text, Comment))
|
||||
for comment in comments:
|
||||
comment.extract()
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
if appendtag.find('div', attrs={'class':'article_paginator'}):
|
||||
|
84
recipes/dziennik_wschodni.recipe
Normal file
@ -0,0 +1,84 @@
|
||||
import re
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
from calibre.ebooks.BeautifulSoup import Comment
|
||||
|
||||
class DziennikWschodni(BasicNewsRecipe):
|
||||
title = u'Dziennik Wschodni'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Dziennik Wschodni - portal regionalny województwa lubelskiego.'
|
||||
category = 'newspaper'
|
||||
language = 'pl'
|
||||
encoding = 'iso-8859-2'
|
||||
extra_css = 'ul {list-style: none; padding:0; margin:0;}'
|
||||
INDEX = 'http://www.dziennikwschodni.pl'
|
||||
masthead_url = INDEX + '/images/top_logo.png'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
remove_empty_feeds = True
|
||||
no_stylesheets = True
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
|
||||
preprocess_regexps = [(re.compile(ur'Czytaj:.*?</a>', re.DOTALL), lambda match: ''), (re.compile(ur'Przeczytaj także:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''),
|
||||
(re.compile(ur'Przeczytaj również:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''), (re.compile(ur'Zobacz też:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: '')]
|
||||
|
||||
keep_only_tags = [dict(id=['article', 'cover', 'photostory'])]
|
||||
remove_tags = [dict(id=['articleTags', 'articleMeta', 'boxReadIt', 'articleGalleries', 'articleConnections',
|
||||
'ForumArticleComments', 'articleRecommend', 'jedynkiLinks', 'articleGalleryConnections',
|
||||
'photostoryConnections', 'articleEpaper', 'articlePoll', 'articleAlarm', 'articleByline']),
|
||||
dict(attrs={'class':'articleFunctions'})]
|
||||
|
||||
|
||||
feeds = [(u'Wszystkie', u'http://www.dziennikwschodni.pl/rss.xml'),
|
||||
(u'Lublin', u'http://www.dziennikwschodni.pl/lublin.xml'),
|
||||
(u'Zamość', u'http://www.dziennikwschodni.pl/zamosc.xml'),
|
||||
(u'Biała Podlaska', u'http://www.dziennikwschodni.pl/biala_podlaska.xml'),
|
||||
(u'Chełm', u'http://www.dziennikwschodni.pl/chelm.xml'),
|
||||
(u'Kraśnik', u'http://www.dziennikwschodni.pl/krasnik.xml'),
|
||||
(u'Puławy', u'http://www.dziennikwschodni.pl/pulawy.xml'),
|
||||
(u'Świdnik', u'http://www.dziennikwschodni.pl/swidnik.xml'),
|
||||
(u'Łęczna', u'http://www.dziennikwschodni.pl/leczna.xml'),
|
||||
(u'Lubartów', u'http://www.dziennikwschodni.pl/lubartow.xml'),
|
||||
(u'Sport', u'http://www.dziennikwschodni.pl/sport.xml'),
|
||||
(u'Praca', u'http://www.dziennikwschodni.pl/praca.xml'),
|
||||
(u'Dom', u'http://www.dziennikwschodni.pl/dom.xml'),
|
||||
(u'Moto', u'http://www.dziennikwschodni.pl/moto.xml'),
|
||||
(u'Zdrowie', u'http://www.dziennikwschodni.pl/zdrowie.xml'),
|
||||
]
|
||||
|
||||
def get_cover_url(self):
|
||||
soup = self.index_to_soup(self.INDEX + '/apps/pbcs.dll/section?Category=JEDYNKI')
|
||||
nexturl = self.INDEX + soup.find(id='covers').find('a')['href']
|
||||
soup = self.index_to_soup(nexturl)
|
||||
self.cover_url = self.INDEX + soup.find(id='cover').find(name='img')['src']
|
||||
return getattr(self, 'cover_url', self.cover_url)
|
||||
|
||||
def append_page(self, soup, appendtag):
|
||||
tag = soup.find('span', attrs={'class':'photoNavigationPages'})
|
||||
if tag:
|
||||
number = int(tag.string.rpartition('/')[-1].replace(' ', ''))
|
||||
baseurl = self.INDEX + soup.find(attrs={'class':'photoNavigationNext'})['href'][:-1]
|
||||
|
||||
for r in appendtag.findAll(attrs={'class':'photoNavigation'}):
|
||||
r.extract()
|
||||
for nr in range(2, number+1):
|
||||
soup2 = self.index_to_soup(baseurl + str(nr))
|
||||
pagetext = soup2.find(id='photoContainer')
|
||||
if pagetext:
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
pagetext = soup2.find(attrs={'class':'photoMeta'})
|
||||
if pagetext:
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
pagetext = soup2.find(attrs={'class':'photoStoryText'})
|
||||
if pagetext:
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
|
||||
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||
for comment in comments:
|
||||
comment.extract()
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
self.append_page(soup, soup.body)
|
||||
return soup
|
34
recipes/dziennik_zachodni.recipe
Normal file
@ -0,0 +1,34 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class DziennikZachodni(BasicNewsRecipe):
|
||||
title = u'Dziennik Zachodni'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Gazeta Regionalna Dziennik Zachodni. Najnowsze Wiadomości Śląskie. Wiadomości Śląsk. Czytaj!'
|
||||
category = 'newspaper'
|
||||
language = 'pl'
|
||||
encoding = 'iso-8859-2'
|
||||
masthead_url = 'http://s.polskatimes.pl/g/logo_naglowek/dziennikzachodni.png?24'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
remove_empty_feeds= True
|
||||
no_stylesheets = True
|
||||
use_embedded_content = False
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
#preprocess_regexps = [(re.compile(ur'<b>Czytaj także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur',<b>Czytaj też:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>Zobacz także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<center><h4><a.*?</a></h4></center>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TEŻ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ WIĘCEJ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TAKŻE:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>\* CZYTAJ KONIECZNIE:.*', re.DOTALL), lambda match: '</body>'), (re.compile(ur'<b>Nasze serwisy:</b>.*', re.DOTALL), lambda match: '</body>') ]
|
||||
remove_tags_after= dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})
|
||||
remove_tags=[dict(id='mat-podobne'), dict(name='a', attrs={'class':'czytajDalej'}), dict(attrs={'src':'http://nm.dz.com.pl/dz.png'}), dict(attrs={'href':'http://www.dziennikzachodni.pl/piano'})]
|
||||
|
||||
feeds = [(u'Wszystkie', u'http://dziennikzachodni.feedsportal.com/c/32980/f/533764/index.rss?201302'), (u'Wiadomo\u015bci', u'http://dziennikzachodni.feedsportal.com/c/32980/f/533765/index.rss?201302'), (u'Regiony', u'http://www.dziennikzachodni.pl/rss/dziennikzachodni_regiony.xml?201302'), (u'Opinie', u'http://www.dziennikzachodni.pl/rss/dziennikzachodni_regiony.xml?201302'), (u'Blogi', u'http://www.dziennikzachodni.pl/rss/dziennikzachodni_blogi.xml?201302'), (u'Serwisy', u'http://www.dziennikzachodni.pl/rss/dziennikzachodni_serwisy.xml?201302'), (u'Sport', u'http://dziennikzachodni.feedsportal.com/c/32980/f/533766/index.rss?201302'), (u'M\xf3j Reporter', u'http://www.dziennikzachodni.pl/rss/dziennikzachodni_mojreporter.xml?201302'), (u'Na narty', u'http://www.dziennikzachodni.pl/rss/dziennikzachodni_nanarty.xml?201302'), (u'Drogi', u'http://www.dziennikzachodni.pl/rss/dziennikzachodni_drogi.xml?201302'), (u'Pieni\u0105dze', u'http://dziennikzachodni.feedsportal.com/c/32980/f/533768/index.rss?201302')]
|
||||
|
||||
def print_version(self, url):
|
||||
return url.replace('artykul', 'drukuj')
|
||||
|
||||
def skip_ad_pages(self, soup):
|
||||
if 'Advertisement' in soup.title:
|
||||
nexturl=soup.find('a')['href']
|
||||
return self.index_to_soup(nexturl, raw=True)
|
||||
|
||||
def get_cover_url(self):
|
||||
soup = self.index_to_soup('http://www.prasa24.pl/gazeta/dziennik-zachodni/')
|
||||
self.cover_url=soup.find(id='pojemnik').img['src']
|
||||
return getattr(self, 'cover_url', self.cover_url)
|
79
recipes/echo_dnia.recipe
Normal file
@ -0,0 +1,79 @@
|
||||
import re
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
from calibre.ebooks.BeautifulSoup import Comment
|
||||
|
||||
class EchoDnia(BasicNewsRecipe):
|
||||
title = u'Echo Dnia'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Echo Dnia - portal regionalny świętokrzyskiego radomskiego i podkarpackiego. Najnowsze wiadomości z Twojego regionu, galerie, video, mp3.'
|
||||
category = 'newspaper'
|
||||
language = 'pl'
|
||||
encoding = 'iso-8859-2'
|
||||
extra_css = 'ul {list-style: none; padding:0; margin:0;}'
|
||||
INDEX = 'http://www.echodnia.eu'
|
||||
masthead_url = INDEX + '/images/top_logo.png'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
remove_empty_feeds = True
|
||||
no_stylesheets = True
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
|
||||
preprocess_regexps = [(re.compile(ur'Czytaj:.*?</a>', re.DOTALL), lambda match: ''), (re.compile(ur'Przeczytaj także:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''),
|
||||
(re.compile(ur'Przeczytaj również:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''), (re.compile(ur'Zobacz też:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: '')]
|
||||
|
||||
keep_only_tags = [dict(id=['article', 'cover', 'photostory'])]
|
||||
remove_tags = [dict(id=['articleTags', 'articleMeta', 'boxReadIt', 'articleGalleries', 'articleConnections',
|
||||
'ForumArticleComments', 'articleRecommend', 'jedynkiLinks', 'articleGalleryConnections',
|
||||
'photostoryConnections', 'articleEpaper', 'articlePoll', 'articleAlarm', 'articleByline']),
|
||||
dict(attrs={'class':'articleFunctions'})]
|
||||
|
||||
feeds = [(u'Wszystkie', u'http://www.echodnia.eu/rss.xml'),
|
||||
(u'Świętokrzyskie', u'http://www.echodnia.eu/swietokrzyskie.xml'),
|
||||
(u'Radomskie', u'http://www.echodnia.eu/radomskie.xml'),
|
||||
(u'Podkarpackie', u'http://www.echodnia.eu/podkarpackie.xml'),
|
||||
(u'Sport \u015bwi\u0119tokrzyski', u'http://www.echodnia.eu/sport_swi.xml'),
|
||||
(u'Sport radomski', u'http://www.echodnia.eu/sport_rad.xml'),
|
||||
(u'Sport podkarpacki', u'http://www.echodnia.eu/sport_pod.xml'),
|
||||
(u'Pi\u0142ka no\u017cna', u'http://www.echodnia.eu/pilka.xml'),
|
||||
(u'Praca', u'http://www.echodnia.eu/praca.xml'),
|
||||
(u'Dom', u'http://www.echodnia.eu/dom.xml'),
|
||||
(u'Auto', u'http://www.echodnia.eu/auto.xml'),
|
||||
(u'Zdrowie', u'http://www.echodnia.eu/zdrowie.xml')]
|
||||
|
||||
def get_cover_url(self):
|
||||
soup = self.index_to_soup(self.INDEX + '/apps/pbcs.dll/section?Category=JEDYNKI')
|
||||
nexturl = self.INDEX + soup.find(id='covers').find('a')['href']
|
||||
soup = self.index_to_soup(nexturl)
|
||||
self.cover_url = self.INDEX + soup.find(id='cover').find(name='img')['src']
|
||||
return getattr(self, 'cover_url', self.cover_url)
|
||||
|
||||
def append_page(self, soup, appendtag):
|
||||
tag = soup.find('span', attrs={'class':'photoNavigationPages'})
|
||||
if tag:
|
||||
number = int(tag.string.rpartition('/')[-1].replace(' ', ''))
|
||||
baseurl = self.INDEX + soup.find(attrs={'class':'photoNavigationNext'})['href'][:-1]
|
||||
|
||||
for r in appendtag.findAll(attrs={'class':'photoNavigation'}):
|
||||
r.extract()
|
||||
for nr in range(2, number+1):
|
||||
soup2 = self.index_to_soup(baseurl + str(nr))
|
||||
pagetext = soup2.find(id='photoContainer')
|
||||
if pagetext:
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
pagetext = soup2.find(attrs={'class':'photoMeta'})
|
||||
if pagetext:
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
pagetext = soup2.find(attrs={'class':'photoStoryText'})
|
||||
if pagetext:
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
|
||||
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||
for comment in comments:
|
||||
comment.extract()
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
self.append_page(soup, soup.body)
|
||||
return soup
|
@ -1,8 +1,6 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
__license__ = 'GPL v3'
|
||||
__author__ = 'Mori'
|
||||
__version__ = 'v. 0.1'
|
||||
__license__ = 'GPL v3'
|
||||
'''
|
||||
blog.eclicto.pl
|
||||
'''
|
||||
@ -11,39 +9,39 @@ from calibre.web.feeds.news import BasicNewsRecipe
|
||||
import re
|
||||
|
||||
class BlogeClictoRecipe(BasicNewsRecipe):
|
||||
__author__ = 'Mori'
|
||||
language = 'pl'
|
||||
__author__ = 'Mori, Tomasz Długosz'
|
||||
language = 'pl'
|
||||
|
||||
title = u'Blog eClicto'
|
||||
publisher = u'Blog eClicto'
|
||||
description = u'Blog o e-papierze i e-bookach'
|
||||
title = u'Blog eClicto'
|
||||
publisher = u'Blog eClicto'
|
||||
description = u'Blog o e-papierze i e-bookach'
|
||||
|
||||
max_articles_per_feed = 100
|
||||
cover_url = 'http://blog.eclicto.pl/wordpress/wp-content/themes/blog_eclicto/g/logo.gif'
|
||||
max_articles_per_feed = 100
|
||||
cover_url = 'http://blog.eclicto.pl/wordpress/wp-content/themes/blog_eclicto/g/logo.gif'
|
||||
|
||||
no_stylesheets = True
|
||||
remove_javascript = True
|
||||
encoding = 'utf-8'
|
||||
no_stylesheets = True
|
||||
remove_javascript = True
|
||||
encoding = 'utf-8'
|
||||
|
||||
extra_css = '''
|
||||
img{float: left; padding-right: 10px; padding-bottom: 5px;}
|
||||
'''
|
||||
extra_css = '''
|
||||
img{float: left; padding-right: 10px; padding-bottom: 5px;}
|
||||
'''
|
||||
|
||||
feeds = [
|
||||
(u'Blog eClicto', u'http://blog.eclicto.pl/feed/')
|
||||
]
|
||||
feeds = [
|
||||
(u'Blog eClicto', u'http://blog.eclicto.pl/feed/')
|
||||
]
|
||||
|
||||
remove_tags = [
|
||||
dict(name = 'span', attrs = {'id' : 'tags'})
|
||||
]
|
||||
remove_tags = [
|
||||
dict(name = 'div', attrs = {'class' : 'social_bookmark'}),
|
||||
]
|
||||
|
||||
remove_tags_after = [
|
||||
dict(name = 'div', attrs = {'class' : 'post'})
|
||||
]
|
||||
keep_only_tags = [
|
||||
dict(name = 'div', attrs = {'class' : 'post'})
|
||||
]
|
||||
|
||||
preprocess_regexps = [
|
||||
(re.compile(i[0], re.IGNORECASE | re.DOTALL), i[1]) for i in
|
||||
[
|
||||
(r'\s*</', lambda match: '</'),
|
||||
]
|
||||
]
|
||||
preprocess_regexps = [
|
||||
(re.compile(i[0], re.IGNORECASE | re.DOTALL), i[1]) for i in
|
||||
[
|
||||
(r'\s*</', lambda match: '</'),
|
||||
]
|
||||
]
|
||||
|
@ -4,6 +4,7 @@ from calibre.web.feeds.news import BasicNewsRecipe
|
||||
class eioba(BasicNewsRecipe):
|
||||
title = u'eioba'
|
||||
__author__ = 'fenuks'
|
||||
description = u'eioba.pl - daj się przeczytać!'
|
||||
cover_url = 'http://www.eioba.org/lay/logo_pl_v3.png'
|
||||
language = 'pl'
|
||||
oldest_article = 7
|
||||
|
@ -15,7 +15,8 @@ class EkologiaPl(BasicNewsRecipe):
|
||||
no_stylesheets = True
|
||||
remove_empty_feeds = True
|
||||
use_embedded_content = False
|
||||
remove_tags = [dict(attrs={'class':['ekoLogo', 'powrocArt', 'butonDrukuj']})]
|
||||
remove_attrs = ['style']
|
||||
remove_tags = [dict(attrs={'class':['ekoLogo', 'powrocArt', 'butonDrukuj', 'widget-social-buttons']})]
|
||||
|
||||
feeds = [(u'Wiadomo\u015bci', u'http://www.ekologia.pl/rss/20,53,0'), (u'\u015arodowisko', u'http://www.ekologia.pl/rss/20,56,0'), (u'Styl \u017cycia', u'http://www.ekologia.pl/rss/20,55,0')]
|
||||
|
||||
|
27
recipes/el_malpensante.recipe
Normal file
@ -0,0 +1,27 @@
|
||||
# coding=utf-8
|
||||
# https://github.com/iemejia/calibrecolombia
|
||||
|
||||
'''
|
||||
http://www.elmalpensante.com/
|
||||
'''
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class ElMalpensante(BasicNewsRecipe):
|
||||
title = u'El Malpensante'
|
||||
language = 'es_CO'
|
||||
__author__ = 'Ismael Mejia <iemejia@gmail.com>'
|
||||
cover_url = 'http://elmalpensante.com/img/layout/logo.gif'
|
||||
description = 'El Malpensante'
|
||||
oldest_article = 30
|
||||
simultaneous_downloads = 20
|
||||
#tags = 'news, sport, blog'
|
||||
use_embedded_content = True
|
||||
remove_empty_feeds = True
|
||||
max_articles_per_feed = 100
|
||||
feeds = [(u'Artículos', u'http://www.elmalpensante.com/articulosRSS.php'),
|
||||
(u'Malpensantías', u'http://www.elmalpensante.com/malpensantiasRSS.php'),
|
||||
(u'Margaritas', u'http://www.elmalpensante.com/margaritasRSS.php'),
|
||||
# This one is almost the same as articulos so we leave articles
|
||||
# (u'Noticias', u'http://www.elmalpensante.com/noticiasRSS.php'),
|
||||
]
|
@ -5,7 +5,7 @@ class Elektroda(BasicNewsRecipe):
|
||||
title = u'Elektroda'
|
||||
oldest_article = 8
|
||||
__author__ = 'fenuks'
|
||||
description = 'Elektroda.pl'
|
||||
description = 'Międzynarodowy portal elektroniczny udostępniający bogate zasoby z dziedziny elektroniki oraz forum dyskusyjne.'
|
||||
cover_url = 'http://demotywatory.elektroda.pl/Thunderpic/logo.gif'
|
||||
category = 'electronics'
|
||||
language = 'pl'
|
||||
|
93
recipes/elguardian.recipe
Normal file
@ -0,0 +1,93 @@
|
||||
__license__ = 'GPL v3'
|
||||
__copyright__ = '2013, Darko Miletic <darko.miletic at gmail.com>'
|
||||
'''
|
||||
elguardian.com.ar
|
||||
'''
|
||||
|
||||
import re
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class ElGuardian(BasicNewsRecipe):
|
||||
title = 'El Guardian'
|
||||
__author__ = 'Darko Miletic'
|
||||
description = "Semanario con todas las tendencias de un pais"
|
||||
publisher = 'Editorial Apache SA'
|
||||
category = 'news,politics,Argentina'
|
||||
oldest_article = 8
|
||||
max_articles_per_feed = 200
|
||||
no_stylesheets = True
|
||||
encoding = 'utf8'
|
||||
use_embedded_content = False
|
||||
language = 'es_AR'
|
||||
remove_empty_feeds = True
|
||||
publication_type = 'magazine'
|
||||
issn = '1666-7476'
|
||||
masthead_url = 'http://elguardian.com.ar/application/templates/frontend/images/home/logo.png'
|
||||
extra_css = """
|
||||
body{font-family: Arial,sans-serif}
|
||||
img{margin-bottom: 0.4em; display:block}
|
||||
"""
|
||||
|
||||
conversion_options = {
|
||||
'comment' : description
|
||||
, 'tags' : category
|
||||
, 'publisher' : publisher
|
||||
, 'language' : language
|
||||
, 'series' : title
|
||||
, 'isbn' : issn
|
||||
}
|
||||
|
||||
keep_only_tags = [dict(attrs={'class':['fotos', 'header_nota', 'nota']})]
|
||||
remove_tags = [dict(name=['meta','link','iframe','embed','object'])]
|
||||
remove_attributes = ['lang']
|
||||
|
||||
feeds = [
|
||||
(u'El Pais' , u'http://elguardian.com.ar/RSS/el-pais.xml' )
|
||||
,(u'Columnistas' , u'http://elguardian.com.ar/RSS/columnistas.xml' )
|
||||
,(u'Personajes' , u'http://elguardian.com.ar/RSS/personajes.xml' )
|
||||
,(u'Tinta roja' , u'http://elguardian.com.ar/RSS/tinta-roja.xml' )
|
||||
,(u'Yo fui' , u'http://elguardian.com.ar/RSS/yo-fui.xml' )
|
||||
,(u'Ciencia' , u'http://elguardian.com.ar/RSS/ciencia.xml' )
|
||||
,(u'Cronicas' , u'http://elguardian.com.ar/RSS/cronicas.xml' )
|
||||
,(u'Culturas' , u'http://elguardian.com.ar/RSS/culturas.xml' )
|
||||
,(u'DxT' , u'http://elguardian.com.ar/RSS/dxt.xml' )
|
||||
,(u'Fierros' , u'http://elguardian.com.ar/RSS/fierros.xml' )
|
||||
,(u'Frente fashion', u'http://elguardian.com.ar/RSS/frente-fashion.xml')
|
||||
,(u'Pan y vino' , u'http://elguardian.com.ar/RSS/pan-y-vino.xml' )
|
||||
,(u'Turismo' , u'http://elguardian.com.ar/RSS/turismo.xml' )
|
||||
]
|
||||
|
||||
def get_cover_url(self):
|
||||
soup = self.index_to_soup('http://elguardian.com.ar/')
|
||||
udata = soup.find('div', attrs={'class':'datosNumero'})
|
||||
if udata:
|
||||
sdata = udata.find('div')
|
||||
if sdata:
|
||||
stra = re.findall(r'\d+', self.tag_to_string(sdata))
|
||||
self.conversion_options.update({'series_index':int(stra[1])})
|
||||
unumero = soup.find('div', attrs={'class':'ultimoNumero'})
|
||||
if unumero:
|
||||
img = unumero.find('img', src=True)
|
||||
if img:
|
||||
return img['src']
|
||||
return None
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
for item in soup.findAll(style=True):
|
||||
del item['style']
|
||||
for item in soup.findAll('a'):
|
||||
limg = item.find('img')
|
||||
if item.string is not None:
|
||||
str = item.string
|
||||
item.replaceWith(str)
|
||||
else:
|
||||
if limg:
|
||||
item.name = 'div'
|
||||
item.attrs = []
|
||||
else:
|
||||
str = self.tag_to_string(item)
|
||||
item.replaceWith(str)
|
||||
for item in soup.findAll('img'):
|
||||
if not item.has_key('alt'):
|
||||
item['alt'] = 'image'
|
||||
return soup
|
@ -12,6 +12,7 @@ class eMuzyka(BasicNewsRecipe):
|
||||
no_stylesheets = True
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
remove_attributes = ['style']
|
||||
keep_only_tags=[dict(name='div', attrs={'id':'news_container'}), dict(name='h3'), dict(name='div', attrs={'class':'review_text'})]
|
||||
remove_tags=[dict(name='span', attrs={'id':'date'})]
|
||||
feeds = [(u'Aktualno\u015bci', u'http://www.emuzyka.pl/rss.php?f=1'), (u'Recenzje', u'http://www.emuzyka.pl/rss.php?f=2')]
|
||||
|
@ -3,85 +3,153 @@
|
||||
__license__ = 'GPL v3'
|
||||
__copyright__ = '2010, matek09, matek09@gmail.com'
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
import re
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Comment
|
||||
|
||||
class Esensja(BasicNewsRecipe):
|
||||
|
||||
title = u'Esensja'
|
||||
__author__ = 'matek09'
|
||||
description = 'Monthly magazine'
|
||||
encoding = 'utf-8'
|
||||
no_stylesheets = True
|
||||
language = 'pl'
|
||||
remove_javascript = True
|
||||
HREF = '0'
|
||||
title = u'Esensja'
|
||||
__author__ = 'matek09 & fenuks'
|
||||
description = 'Magazyn kultury popularnej'
|
||||
encoding = 'utf-8'
|
||||
no_stylesheets = True
|
||||
language = 'pl'
|
||||
remove_javascript = True
|
||||
masthead_url = 'http://esensja.pl/img/wrss.gif'
|
||||
oldest_article = 1
|
||||
URL = 'http://esensja.pl'
|
||||
HREF = '0'
|
||||
remove_attributes = ['style', 'bgcolor', 'alt', 'color']
|
||||
keep_only_tags = [dict(attrs={'class':'sekcja'}), ]
|
||||
#keep_only_tags.append(dict(name = 'div', attrs = {'class' : 'article'})
|
||||
#remove_tags_before = dict(dict(name = 'div', attrs = {'class' : 't-title'}))
|
||||
remove_tags_after = dict(id='tekst')
|
||||
|
||||
#keep_only_tags =[]
|
||||
#keep_only_tags.append(dict(name = 'div', attrs = {'class' : 'article'})
|
||||
remove_tags_before = dict(dict(name = 'div', attrs = {'class' : 't-title'}))
|
||||
remove_tags_after = dict(dict(name = 'img', attrs = {'src' : '../../../2000/01/img/tab_bot.gif'}))
|
||||
remove_tags = [dict(name = 'img', attrs = {'src' : ['../../../2000/01/img/tab_top.gif', '../../../2000/01/img/tab_bot.gif']}),
|
||||
dict(name = 'div', attrs = {'class' : 't-title2 nextpage'}),
|
||||
#dict(attrs={'rel':'lightbox[galeria]'})
|
||||
dict(attrs={'class':['tekst_koniec', 'ref', 'wykop']}),
|
||||
dict(attrs={'itemprop':['copyrightHolder', 'publisher']}),
|
||||
dict(id='komentarze')
|
||||
|
||||
remove_tags =[]
|
||||
remove_tags.append(dict(name = 'img', attrs = {'src' : '../../../2000/01/img/tab_top.gif'}))
|
||||
remove_tags.append(dict(name = 'img', attrs = {'src' : '../../../2000/01/img/tab_bot.gif'}))
|
||||
remove_tags.append(dict(name = 'div', attrs = {'class' : 't-title2 nextpage'}))
|
||||
]
|
||||
|
||||
extra_css = '''
|
||||
.t-title {font-size: x-large; font-weight: bold; text-align: left}
|
||||
.t-author {font-size: x-small; text-align: left}
|
||||
.t-title2 {font-size: x-small; font-style: italic; text-align: left}
|
||||
.text {font-size: small; text-align: left}
|
||||
.annot-ref {font-style: italic; text-align: left}
|
||||
'''
|
||||
extra_css = '''
|
||||
.t-title {font-size: x-large; font-weight: bold; text-align: left}
|
||||
.t-author {font-size: x-small; text-align: left}
|
||||
.t-title2 {font-size: x-small; font-style: italic; text-align: left}
|
||||
.text {font-size: small; text-align: left}
|
||||
.annot-ref {font-style: italic; text-align: left}
|
||||
'''
|
||||
|
||||
preprocess_regexps = [(re.compile(r'alt="[^"]*"'),
|
||||
lambda match: '')]
|
||||
preprocess_regexps = [(re.compile(r'alt="[^"]*"'), lambda match: ''),
|
||||
(re.compile(ur'(title|alt)="[^"]*?"', re.DOTALL), lambda match: ''),
|
||||
]
|
||||
|
||||
def parse_index(self):
|
||||
soup = self.index_to_soup('http://www.esensja.pl/magazyn/')
|
||||
a = soup.find('a', attrs={'href' : re.compile('.*/index.html')})
|
||||
year = a['href'].split('/')[0]
|
||||
month = a['href'].split('/')[1]
|
||||
self.HREF = 'http://www.esensja.pl/magazyn/' + year + '/' + month + '/iso/'
|
||||
soup = self.index_to_soup(self.HREF + '01.html')
|
||||
self.cover_url = 'http://www.esensja.pl/magazyn/' + year + '/' + month + '/img/ilustr/cover_b.jpg'
|
||||
feeds = []
|
||||
intro = soup.find('div', attrs={'class' : 'n-title'})
|
||||
introduction = {'title' : self.tag_to_string(intro.a),
|
||||
'url' : self.HREF + intro.a['href'],
|
||||
'date' : '',
|
||||
'description' : ''}
|
||||
chapter = 'Wprowadzenie'
|
||||
subchapter = ''
|
||||
articles = []
|
||||
articles.append(introduction)
|
||||
for tag in intro.findAllNext(attrs={'class': ['chapter', 'subchapter', 'n-title']}):
|
||||
if tag.name in 'td':
|
||||
if len(articles) > 0:
|
||||
section = chapter
|
||||
if len(subchapter) > 0:
|
||||
section += ' - ' + subchapter
|
||||
feeds.append((section, articles))
|
||||
articles = []
|
||||
if tag['class'] == 'chapter':
|
||||
chapter = self.tag_to_string(tag).capitalize()
|
||||
subchapter = ''
|
||||
else:
|
||||
subchapter = self.tag_to_string(tag)
|
||||
subchapter = self.tag_to_string(tag)
|
||||
continue
|
||||
articles.append({'title' : self.tag_to_string(tag.a), 'url' : self.HREF + tag.a['href'], 'date' : '', 'description' : ''})
|
||||
def parse_index(self):
|
||||
soup = self.index_to_soup('http://www.esensja.pl/magazyn/')
|
||||
a = soup.find('a', attrs={'href' : re.compile('.*/index.html')})
|
||||
year = a['href'].split('/')[0]
|
||||
month = a['href'].split('/')[1]
|
||||
self.HREF = 'http://www.esensja.pl/magazyn/' + year + '/' + month + '/iso/'
|
||||
soup = self.index_to_soup(self.HREF + '01.html')
|
||||
self.cover_url = 'http://www.esensja.pl/magazyn/' + year + '/' + month + '/img/ilustr/cover_b.jpg'
|
||||
feeds = []
|
||||
chapter = ''
|
||||
subchapter = ''
|
||||
articles = []
|
||||
intro = soup.find('div', attrs={'class' : 'n-title'})
|
||||
'''
|
||||
introduction = {'title' : self.tag_to_string(intro.a),
|
||||
'url' : self.HREF + intro.a['href'],
|
||||
'date' : '',
|
||||
'description' : ''}
|
||||
chapter = 'Wprowadzenie'
|
||||
articles.append(introduction)
|
||||
'''
|
||||
|
||||
a = self.index_to_soup(self.HREF + tag.a['href'])
|
||||
i = 1
|
||||
while True:
|
||||
div = a.find('div', attrs={'class' : 't-title2 nextpage'})
|
||||
if div is not None:
|
||||
a = self.index_to_soup(self.HREF + div.a['href'])
|
||||
articles.append({'title' : self.tag_to_string(tag.a) + ' c. d. ' + str(i), 'url' : self.HREF + div.a['href'], 'date' : '', 'description' : ''})
|
||||
i = i + 1
|
||||
else:
|
||||
break
|
||||
for tag in intro.findAllNext(attrs={'class': ['chapter', 'subchapter', 'n-title']}):
|
||||
if tag.name in 'td':
|
||||
if len(articles) > 0:
|
||||
section = chapter
|
||||
if len(subchapter) > 0:
|
||||
section += ' - ' + subchapter
|
||||
feeds.append((section, articles))
|
||||
articles = []
|
||||
if tag['class'] == 'chapter':
|
||||
chapter = self.tag_to_string(tag).capitalize()
|
||||
subchapter = ''
|
||||
else:
|
||||
subchapter = self.tag_to_string(tag)
|
||||
subchapter = self.tag_to_string(tag)
|
||||
continue
|
||||
|
||||
finalurl = tag.a['href']
|
||||
if not finalurl.startswith('http'):
|
||||
finalurl = self.HREF + finalurl
|
||||
articles.append({'title' : self.tag_to_string(tag.a), 'url' : finalurl, 'date' : '', 'description' : ''})
|
||||
|
||||
a = self.index_to_soup(finalurl)
|
||||
i = 1
|
||||
|
||||
while True:
|
||||
div = a.find('div', attrs={'class' : 't-title2 nextpage'})
|
||||
if div is not None:
|
||||
link = div.a['href']
|
||||
if not link.startswith('http'):
|
||||
link = self.HREF + link
|
||||
a = self.index_to_soup(link)
|
||||
articles.append({'title' : self.tag_to_string(tag.a) + ' c. d. ' + str(i), 'url' : link, 'date' : '', 'description' : ''})
|
||||
i = i + 1
|
||||
else:
|
||||
break
|
||||
|
||||
return feeds
|
||||
|
||||
def append_page(self, soup, appendtag):
|
||||
r = appendtag.find(attrs={'class':'wiecej_xxx'})
|
||||
if r:
|
||||
nr = r.findAll(attrs={'class':'tn-link'})[-1]
|
||||
try:
|
||||
nr = int(nr.a.string)
|
||||
except:
|
||||
return
|
||||
baseurl = soup.find(attrs={'property':'og:url'})['content'] + '&strona={0}'
|
||||
for number in range(2, nr+1):
|
||||
soup2 = self.index_to_soup(baseurl.format(number))
|
||||
pagetext = soup2.find(attrs={'class':'tresc'})
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
for r in appendtag.findAll(attrs={'class':['wiecej_xxx', 'tekst_koniec']}):
|
||||
r.extract()
|
||||
for r in appendtag.findAll('script'):
|
||||
r.extract()
|
||||
|
||||
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||
for comment in comments:
|
||||
comment.extract()
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
self.append_page(soup, soup.body)
|
||||
for tag in soup.findAll(attrs={'class':'img_box_right'}):
|
||||
temp = tag.find('img')
|
||||
src = ''
|
||||
if temp:
|
||||
src = temp.get('src', '')
|
||||
for r in tag.findAll('a', recursive=False):
|
||||
r.extract()
|
||||
info = tag.find(attrs={'class':'img_info'})
|
||||
text = str(tag)
|
||||
if not src:
|
||||
src = re.search('src="[^"]*?"', text)
|
||||
if src:
|
||||
src = src.group(0)
|
||||
src = src[5:].replace('//', '/')
|
||||
if src:
|
||||
tag.contents = []
|
||||
tag.insert(0, BeautifulSoup('<img src="{0}{1}" />'.format(self.URL, src)))
|
||||
if info:
|
||||
tag.insert(len(tag.contents), info)
|
||||
return soup
|
||||
|
||||
return feeds
|
||||
|
109
recipes/esensja_(rss).recipe
Normal file
@ -0,0 +1,109 @@
|
||||
__license__ = 'GPL v3'
|
||||
import re
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Comment
|
||||
|
||||
class EsensjaRSS(BasicNewsRecipe):
|
||||
title = u'Esensja (RSS)'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Magazyn kultury popularnej'
|
||||
category = 'reading, fantasy, reviews, boardgames, culture'
|
||||
#publication_type = ''
|
||||
language = 'pl'
|
||||
encoding = 'utf-8'
|
||||
INDEX = 'http://www.esensja.pl'
|
||||
extra_css = '''.t-title {font-size: x-large; font-weight: bold; text-align: left}
|
||||
.t-author {font-size: x-small; text-align: left}
|
||||
.t-title2 {font-size: x-small; font-style: italic; text-align: left}
|
||||
.text {font-size: small; text-align: left}
|
||||
.annot-ref {font-style: italic; text-align: left}
|
||||
'''
|
||||
cover_url = ''
|
||||
masthead_url = 'http://esensja.pl/img/wrss.gif'
|
||||
use_embedded_content = False
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets = True
|
||||
remove_empty_feeds = True
|
||||
remove_javascript = True
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
preprocess_regexps = [(re.compile(r'alt="[^"]*"'), lambda match: ''),
|
||||
(re.compile(ur'(title|alt)="[^"]*?"', re.DOTALL), lambda match: ''),
|
||||
]
|
||||
remove_attributes = ['style', 'bgcolor', 'alt', 'color']
|
||||
keep_only_tags = [dict(attrs={'class':'sekcja'}), ]
|
||||
remove_tags_after = dict(id='tekst')
|
||||
|
||||
remove_tags = [dict(name = 'img', attrs = {'src' : ['../../../2000/01/img/tab_top.gif', '../../../2000/01/img/tab_bot.gif']}),
|
||||
dict(name = 'div', attrs = {'class' : 't-title2 nextpage'}),
|
||||
#dict(attrs={'rel':'lightbox[galeria]'})
|
||||
dict(attrs={'class':['tekst_koniec', 'ref', 'wykop']}),
|
||||
dict(attrs={'itemprop':['copyrightHolder', 'publisher']}),
|
||||
dict(id='komentarze')
|
||||
]
|
||||
|
||||
feeds = [(u'Książka', u'http://esensja.pl/rss/ksiazka.rss'),
|
||||
(u'Film', u'http://esensja.pl/rss/film.rss'),
|
||||
(u'Komiks', u'http://esensja.pl/rss/komiks.rss'),
|
||||
(u'Gry', u'http://esensja.pl/rss/gry.rss'),
|
||||
(u'Muzyka', u'http://esensja.pl/rss/muzyka.rss'),
|
||||
(u'Twórczość', u'http://esensja.pl/rss/tworczosc.rss'),
|
||||
(u'Varia', u'http://esensja.pl/rss/varia.rss'),
|
||||
(u'Zgryźliwi Tetrycy', u'http://esensja.pl/rss/tetrycy.rss'),
|
||||
(u'Nowe książki', u'http://esensja.pl/rss/xnowosci.rss'),
|
||||
(u'Ostatnio dodane książki', u'http://esensja.pl/rss/xdodane.rss'),
|
||||
]
|
||||
|
||||
def get_cover_url(self):
|
||||
soup = self.index_to_soup(self.INDEX)
|
||||
cover = soup.find(id='panel_1')
|
||||
self.cover_url = self.INDEX + cover.find('a')['href'].replace('index.html', '') + 'img/ilustr/cover_b.jpg'
|
||||
return getattr(self, 'cover_url', self.cover_url)
|
||||
|
||||
|
||||
def append_page(self, soup, appendtag):
|
||||
r = appendtag.find(attrs={'class':'wiecej_xxx'})
|
||||
if r:
|
||||
nr = r.findAll(attrs={'class':'tn-link'})[-1]
|
||||
try:
|
||||
nr = int(nr.a.string)
|
||||
except:
|
||||
return
|
||||
baseurl = soup.find(attrs={'property':'og:url'})['content'] + '&strona={0}'
|
||||
for number in range(2, nr+1):
|
||||
soup2 = self.index_to_soup(baseurl.format(number))
|
||||
pagetext = soup2.find(attrs={'class':'tresc'})
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
for r in appendtag.findAll(attrs={'class':['wiecej_xxx', 'tekst_koniec']}):
|
||||
r.extract()
|
||||
for r in appendtag.findAll('script'):
|
||||
r.extract()
|
||||
|
||||
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||
for comment in comments:
|
||||
comment.extract()
|
||||
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
self.append_page(soup, soup.body)
|
||||
for tag in soup.findAll(attrs={'class':'img_box_right'}):
|
||||
temp = tag.find('img')
|
||||
src = ''
|
||||
if temp:
|
||||
src = temp.get('src', '')
|
||||
for r in tag.findAll('a', recursive=False):
|
||||
r.extract()
|
||||
info = tag.find(attrs={'class':'img_info'})
|
||||
text = str(tag)
|
||||
if not src:
|
||||
src = re.search('src="[^"]*?"', text)
|
||||
if src:
|
||||
src = src.group(0)
|
||||
src = src[5:].replace('//', '/')
|
||||
if src:
|
||||
tag.contents = []
|
||||
tag.insert(0, BeautifulSoup('<img src="{0}{1}" />'.format(self.INDEX, src)))
|
||||
if info:
|
||||
tag.insert(len(tag.contents), info)
|
||||
return soup
|
23
recipes/eso_pl.recipe
Normal file
@ -0,0 +1,23 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class ESO(BasicNewsRecipe):
|
||||
title = u'ESO PL'
|
||||
__author__ = 'fenuks'
|
||||
description = u'ESO, Europejskie Obserwatorium Południowe, buduje i obsługuje najbardziej zaawansowane naziemne teleskopy astronomiczne na świecie'
|
||||
category = 'astronomy'
|
||||
language = 'pl'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets = True
|
||||
remove_empty_feeds = True
|
||||
use_embedded_content = False
|
||||
cover_url = 'https://twimg0-a.akamaihd.net/profile_images/1922519424/eso-twitter-logo.png'
|
||||
keep_only_tags = [dict(attrs={'class':'subcl'})]
|
||||
remove_tags = [dict(id='lang_row'), dict(attrs={'class':['pr_typeid', 'pr_news_feature_link', 'outreach_usage', 'hidden']})]
|
||||
feeds = [(u'Wiadomo\u015bci', u'http://www.eso.org/public/poland/news/feed/'), (u'Og\u0142oszenia', u'http://www.eso.org/public/poland/announcements/feed/'), (u'Zdj\u0119cie tygodnia', u'http://www.eso.org/public/poland/images/potw/feed/')]
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
for a in soup.findAll('a', href=True):
|
||||
if a['href'].startswith('/'):
|
||||
a['href'] = 'http://www.eso.org' + a['href']
|
||||
return soup
|
@ -22,14 +22,14 @@ class f1ultra(BasicNewsRecipe):
|
||||
remove_tags.append(dict(name = 'hr', attrs = {'size' : '2'}))
|
||||
|
||||
preprocess_regexps = [(re.compile(r'align="left"'), lambda match: ''),
|
||||
(re.compile(r'align="right"'), lambda match: ''),
|
||||
(re.compile(r'width=\"*\"'), lambda match: ''),
|
||||
(re.compile(r'\<table .*?\>'), lambda match: '')]
|
||||
(re.compile(r'align="right"'), lambda match: ''),
|
||||
(re.compile(r'width=\"*\"'), lambda match: ''),
|
||||
(re.compile(r'\<table .*?\>'), lambda match: '')]
|
||||
|
||||
|
||||
extra_css = '''.contentheading { font-size: 1.4em; font-weight: bold; }
|
||||
img { display: block; clear: both;}
|
||||
'''
|
||||
img { display: block; clear: both;}
|
||||
'''
|
||||
remove_attributes = ['width','height','position','float','padding-left','padding-right','padding','text-align']
|
||||
|
||||
feeds = [(u'F1 Ultra', u'http://www.f1ultra.pl/index.php?option=com_rd_rss&id=1&Itemid=245')]
|
||||
|
@ -1,24 +1,27 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
import re
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
from calibre.ebooks.BeautifulSoup import BeautifulSoup
|
||||
|
||||
class FilmWebPl(BasicNewsRecipe):
|
||||
title = u'FilmWeb'
|
||||
__author__ = 'fenuks'
|
||||
description = 'FilmWeb - biggest polish movie site'
|
||||
cover_url = 'http://userlogos.org/files/logos/crudus/filmweb.png'
|
||||
description = 'Filmweb.pl - Filmy takie jak Ty Filmweb to największy i najczęściej odwiedzany polski serwis filmowy. Największa baza filmów, seriali i aktorów, repertuar kin i tv, ...'
|
||||
cover_url = 'http://gfx.filmweb.pl/n/logo-filmweb-bevel.jpg'
|
||||
category = 'movies'
|
||||
language = 'pl'
|
||||
index='http://www.filmweb.pl'
|
||||
index = 'http://www.filmweb.pl'
|
||||
#extra_css = '.MarkupPhotoHTML-7 {float:left; margin-right: 10px;}'
|
||||
oldest_article = 8
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets= True
|
||||
remove_empty_feeds=True
|
||||
no_stylesheets = True
|
||||
remove_empty_feeds = True
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
preprocess_regexps = [(re.compile(u'\(kliknij\,\ aby powiększyć\)', re.IGNORECASE), lambda m: ''), ]#(re.compile(ur' | ', re.IGNORECASE), lambda m: '')]
|
||||
remove_javascript = True
|
||||
preprocess_regexps = [(re.compile(u'\(kliknij\,\ aby powiększyć\)', re.IGNORECASE), lambda m: ''), (re.compile(ur'(<br ?/?>\s*?<br ?/?>\s*?)+', re.IGNORECASE), lambda m: '<br />')]#(re.compile(ur' | ', re.IGNORECASE), lambda m: '')]
|
||||
extra_css = '.hdrBig {font-size:22px;} ul {list-style-type:none; padding: 0; margin: 0;}'
|
||||
remove_tags= [dict(name='div', attrs={'class':['recommendOthers']}), dict(name='ul', attrs={'class':'fontSizeSet'}), dict(attrs={'class':'userSurname anno'})]
|
||||
remove_tags = [dict(name='div', attrs={'class':['recommendOthers']}), dict(name='ul', attrs={'class':'fontSizeSet'}), dict(attrs={'class':'userSurname anno'})]
|
||||
remove_attributes = ['style',]
|
||||
keep_only_tags= [dict(name='h1', attrs={'class':['hdrBig', 'hdrEntity']}), dict(name='div', attrs={'class':['newsInfo', 'newsInfoSmall', 'reviewContent description']})]
|
||||
keep_only_tags = [dict(name='h1', attrs={'class':['hdrBig', 'hdrEntity']}), dict(name='div', attrs={'class':['newsInfo', 'newsInfoSmall', 'reviewContent description']})]
|
||||
feeds = [(u'News / Filmy w produkcji', 'http://www.filmweb.pl/feed/news/category/filminproduction'),
|
||||
(u'News / Festiwale, nagrody i przeglądy', u'http://www.filmweb.pl/feed/news/category/festival'),
|
||||
(u'News / Seriale', u'http://www.filmweb.pl/feed/news/category/serials'),
|
||||
@ -42,6 +45,11 @@ class FilmWebPl(BasicNewsRecipe):
|
||||
if skip_tag is not None:
|
||||
return self.index_to_soup(skip_tag['href'], raw=True)
|
||||
|
||||
def postprocess_html(self, soup, first_fetch):
|
||||
for r in soup.findAll(attrs={'class':'singlephoto'}):
|
||||
r['style'] = 'float:left; margin-right: 10px;'
|
||||
return soup
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
for a in soup('a'):
|
||||
if a.has_key('href') and 'http://' not in a['href'] and 'https://' not in a['href']:
|
||||
@ -56,4 +64,8 @@ class FilmWebPl(BasicNewsRecipe):
|
||||
tag.name = 'div'
|
||||
for t in tag.findAll('li'):
|
||||
t.name = 'div'
|
||||
for r in soup.findAll(id=re.compile('photo-\d+')):
|
||||
r.extract()
|
||||
for r in soup.findAll(style=re.compile('float: ?left')):
|
||||
r['class'] = 'singlephoto'
|
||||
return soup
|
||||
|
182
recipes/financial_times_us.recipe
Normal file
@ -0,0 +1,182 @@
|
||||
__license__ = 'GPL v3'
|
||||
__copyright__ = '2013, Darko Miletic <darko.miletic at gmail.com>'
|
||||
'''
|
||||
http://www.ft.com/intl/us-edition
|
||||
'''
|
||||
|
||||
import datetime
|
||||
from calibre.ptempfile import PersistentTemporaryFile
|
||||
from calibre import strftime
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class FinancialTimes(BasicNewsRecipe):
|
||||
title = 'Financial Times (US) printed edition'
|
||||
__author__ = 'Darko Miletic'
|
||||
description = "The Financial Times (FT) is one of the world's leading business news and information organisations, recognised internationally for its authority, integrity and accuracy."
|
||||
publisher = 'The Financial Times Ltd.'
|
||||
category = 'news, finances, politics, UK, World'
|
||||
oldest_article = 2
|
||||
language = 'en'
|
||||
max_articles_per_feed = 250
|
||||
no_stylesheets = True
|
||||
use_embedded_content = False
|
||||
needs_subscription = True
|
||||
encoding = 'utf8'
|
||||
publication_type = 'newspaper'
|
||||
articles_are_obfuscated = True
|
||||
temp_files = []
|
||||
masthead_url = 'http://im.media.ft.com/m/img/masthead_main.jpg'
|
||||
LOGIN = 'https://registration.ft.com/registration/barrier/login'
|
||||
LOGIN2 = 'http://media.ft.com/h/subs3.html'
|
||||
INDEX = 'http://www.ft.com/intl/us-edition'
|
||||
PREFIX = 'http://www.ft.com'
|
||||
|
||||
conversion_options = {
|
||||
'comment' : description
|
||||
, 'tags' : category
|
||||
, 'publisher' : publisher
|
||||
, 'language' : language
|
||||
, 'linearize_tables' : True
|
||||
}
|
||||
|
||||
def get_browser(self):
|
||||
br = BasicNewsRecipe.get_browser(self)
|
||||
br.open(self.INDEX)
|
||||
if self.username is not None and self.password is not None:
|
||||
br.open(self.LOGIN2)
|
||||
br.select_form(name='loginForm')
|
||||
br['username'] = self.username
|
||||
br['password'] = self.password
|
||||
br.submit()
|
||||
return br
|
||||
|
||||
keep_only_tags = [
|
||||
dict(name='div' , attrs={'class':['fullstory fullstoryHeader', 'ft-story-header']})
|
||||
,dict(name='div' , attrs={'class':'standfirst'})
|
||||
,dict(name='div' , attrs={'id' :'storyContent'})
|
||||
,dict(name='div' , attrs={'class':['ft-story-body','index-detail']})
|
||||
,dict(name='h2' , attrs={'class':'entry-title'} )
|
||||
,dict(name='span', attrs={'class':lambda x: x and 'posted-on' in x.split()} )
|
||||
,dict(name='span', attrs={'class':'author_byline'} )
|
||||
,dict(name='div' , attrs={'class':'entry-content'} )
|
||||
]
|
||||
remove_tags = [
|
||||
dict(name='div', attrs={'id':'floating-con'})
|
||||
,dict(name=['meta','iframe','base','object','embed','link'])
|
||||
,dict(attrs={'class':['storyTools','story-package','screen-copy','story-package separator','expandable-image']})
|
||||
]
|
||||
remove_attributes = ['width','height','lang']
|
||||
|
||||
extra_css = """
|
||||
body{font-family: Georgia,Times,"Times New Roman",serif}
|
||||
h2{font-size:large}
|
||||
.ft-story-header{font-size: x-small}
|
||||
.container{font-size:x-small;}
|
||||
h3{font-size:x-small;color:#003399;}
|
||||
.copyright{font-size: x-small}
|
||||
img{margin-top: 0.8em; display: block}
|
||||
.lastUpdated{font-family: Arial,Helvetica,sans-serif; font-size: x-small}
|
||||
.byline,.ft-story-body,.ft-story-header{font-family: Arial,Helvetica,sans-serif}
|
||||
"""
|
||||
|
||||
def get_artlinks(self, elem):
|
||||
articles = []
|
||||
count = 0
|
||||
for item in elem.findAll('a',href=True):
|
||||
count = count + 1
|
||||
if self.test and count > 2:
|
||||
return articles
|
||||
rawlink = item['href']
|
||||
url = rawlink
|
||||
if not rawlink.startswith('http://'):
|
||||
url = self.PREFIX + rawlink
|
||||
try:
|
||||
urlverified = self.browser.open_novisit(url).geturl() # resolve redirect.
|
||||
except:
|
||||
continue
|
||||
title = self.tag_to_string(item)
|
||||
date = strftime(self.timefmt)
|
||||
articles.append({
|
||||
'title' :title
|
||||
,'date' :date
|
||||
,'url' :urlverified
|
||||
,'description':''
|
||||
})
|
||||
return articles
|
||||
|
||||
def parse_index(self):
|
||||
feeds = []
|
||||
soup = self.index_to_soup(self.INDEX)
|
||||
dates= self.tag_to_string(soup.find('div', attrs={'class':'btm-links'}).find('div'))
|
||||
self.timefmt = ' [%s]'%dates
|
||||
wide = soup.find('div',attrs={'class':'wide'})
|
||||
if not wide:
|
||||
return feeds
|
||||
allsections = wide.findAll(attrs={'class':lambda x: x and 'footwell' in x.split()})
|
||||
if not allsections:
|
||||
return feeds
|
||||
count = 0
|
||||
for item in allsections:
|
||||
count = count + 1
|
||||
if self.test and count > 2:
|
||||
return feeds
|
||||
fitem = item.h3
|
||||
if not fitem:
|
||||
fitem = item.h4
|
||||
ftitle = self.tag_to_string(fitem)
|
||||
self.report_progress(0, _('Fetching feed')+' %s...'%(ftitle))
|
||||
feedarts = self.get_artlinks(item.ul)
|
||||
feeds.append((ftitle,feedarts))
|
||||
return feeds
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
items = ['promo-box','promo-title',
|
||||
'promo-headline','promo-image',
|
||||
'promo-intro','promo-link','subhead']
|
||||
for item in items:
|
||||
for it in soup.findAll(item):
|
||||
it.name = 'div'
|
||||
it.attrs = []
|
||||
for item in soup.findAll(style=True):
|
||||
del item['style']
|
||||
for item in soup.findAll('a'):
|
||||
limg = item.find('img')
|
||||
if item.string is not None:
|
||||
str = item.string
|
||||
item.replaceWith(str)
|
||||
else:
|
||||
if limg:
|
||||
item.name = 'div'
|
||||
item.attrs = []
|
||||
else:
|
||||
str = self.tag_to_string(item)
|
||||
item.replaceWith(str)
|
||||
for item in soup.findAll('img'):
|
||||
if not item.has_key('alt'):
|
||||
item['alt'] = 'image'
|
||||
return soup
|
||||
|
||||
def get_cover_url(self):
|
||||
cdate = datetime.date.today()
|
||||
if cdate.isoweekday() == 7:
|
||||
cdate -= datetime.timedelta(days=1)
|
||||
return cdate.strftime('http://specials.ft.com/vtf_pdf/%d%m%y_FRONT1_USA.pdf')
|
||||
|
||||
def get_obfuscated_article(self, url):
|
||||
count = 0
|
||||
while (count < 10):
|
||||
try:
|
||||
response = self.browser.open(url)
|
||||
html = response.read()
|
||||
count = 10
|
||||
except:
|
||||
print "Retrying download..."
|
||||
count += 1
|
||||
tfile = PersistentTemporaryFile('_fa.html')
|
||||
tfile.write(html)
|
||||
tfile.close()
|
||||
self.temp_files.append(tfile)
|
||||
return tfile.name
|
||||
|
||||
def cleanup(self):
|
||||
self.browser.open('https://registration.ft.com/registration/login/logout?location=')
|
@ -13,7 +13,7 @@ class FocusRecipe(BasicNewsRecipe):
|
||||
title = u'Focus'
|
||||
publisher = u'Gruner + Jahr Polska'
|
||||
category = u'News'
|
||||
description = u'Newspaper'
|
||||
description = u'Focus.pl - pierwszy w Polsce portal społecznościowy dla miłośników nauki. Tematyka: nauka, historia, cywilizacja, technika, przyroda, sport, gadżety'
|
||||
category = 'magazine'
|
||||
cover_url = ''
|
||||
remove_empty_feeds = True
|
||||
|
@ -3,6 +3,7 @@ from calibre.web.feeds.news import BasicNewsRecipe
|
||||
class Fotoblogia_pl(BasicNewsRecipe):
|
||||
title = u'Fotoblogia.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Jeden z największych polskich blogów o fotografii.'
|
||||
category = 'photography'
|
||||
language = 'pl'
|
||||
masthead_url = 'http://img.interia.pl/komputery/nimg/u/0/fotoblogia21.jpg'
|
||||
@ -11,6 +12,6 @@ class Fotoblogia_pl(BasicNewsRecipe):
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets = True
|
||||
use_embedded_content = False
|
||||
keep_only_tags=[dict(name='div', attrs={'class':'post-view post-standard'})]
|
||||
keep_only_tags=[dict(name='div', attrs={'class':['post-view post-standard', 'photo-container']})]
|
||||
remove_tags=[dict(attrs={'class':['external fotoblogia', 'categories', 'tags']})]
|
||||
feeds = [(u'Wszystko', u'http://fotoblogia.pl/feed/rss2')]
|
||||
|
@ -18,6 +18,7 @@ class FrazPC(BasicNewsRecipe):
|
||||
max_articles_per_feed = 100
|
||||
use_embedded_content = False
|
||||
no_stylesheets = True
|
||||
remove_empty_feeds = True
|
||||
cover_url='http://www.frazpc.pl/images/logo.png'
|
||||
feeds = [
|
||||
(u'Aktualno\u015bci', u'http://www.frazpc.pl/feed/aktualnosci'),
|
||||
|
@ -1,7 +1,7 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
__license__ = 'GPL v3'
|
||||
__copyright__ = u'2010-2012, Tomasz Dlugosz <tomek3d@gmail.com>'
|
||||
__copyright__ = u'2010-2013, Tomasz Dlugosz <tomek3d@gmail.com>'
|
||||
'''
|
||||
fronda.pl
|
||||
'''
|
||||
@ -23,7 +23,6 @@ class Fronda(BasicNewsRecipe):
|
||||
extra_css = '''
|
||||
h1 {font-size:150%}
|
||||
.body {text-align:left;}
|
||||
div.headline {font-weight:bold}
|
||||
'''
|
||||
|
||||
earliest_date = date.today() - timedelta(days=oldest_article)
|
||||
@ -69,7 +68,8 @@ class Fronda(BasicNewsRecipe):
|
||||
article_url = 'http://www.fronda.pl' + article_a['href']
|
||||
article_title = self.tag_to_string(article_a)
|
||||
articles[genName].append( { 'title' : article_title, 'url' : article_url, 'date' : article_date })
|
||||
feeds.append((genName, articles[genName]))
|
||||
if articles[genName]:
|
||||
feeds.append((genName, articles[genName]))
|
||||
return feeds
|
||||
|
||||
keep_only_tags = [
|
||||
@ -83,6 +83,10 @@ class Fronda(BasicNewsRecipe):
|
||||
dict(name='h3', attrs={'class':'block-header article comments'}),
|
||||
dict(name='ul', attrs={'class':'comment-list'}),
|
||||
dict(name='ul', attrs={'class':'category'}),
|
||||
dict(name='ul', attrs={'class':'tag-list'}),
|
||||
dict(name='p', attrs={'id':'comments-disclaimer'}),
|
||||
dict(name='div', attrs={'style':'text-align: left; margin-bottom: 15px;'}),
|
||||
dict(name='div', attrs={'style':'text-align: left; margin-top: 15px; margin-bottom: 30px;'}),
|
||||
dict(name='div', attrs={'class':'related-articles content'}),
|
||||
dict(name='div', attrs={'id':'comment-form'})
|
||||
]
|
||||
|
34
recipes/gazeta_krakowska.recipe
Normal file
@ -0,0 +1,34 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class GazetaKrakowska(BasicNewsRecipe):
|
||||
title = u'Gazeta Krakowska'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Gazeta Regionalna Gazeta Krakowska. Najnowsze Wiadomości Kraków. Informacje Kraków. Czytaj!'
|
||||
category = 'newspaper'
|
||||
language = 'pl'
|
||||
encoding = 'iso-8859-2'
|
||||
masthead_url = 'http://s.polskatimes.pl/g/logo_naglowek/gazetakrakowska.png?24'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
remove_empty_feeds = True
|
||||
no_stylesheets = True
|
||||
use_embedded_content = False
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
#preprocess_regexps = [(re.compile(ur'<b>Czytaj także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur',<b>Czytaj też:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>Zobacz także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<center><h4><a.*?</a></h4></center>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TEŻ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ WIĘCEJ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TAKŻE:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>\* CZYTAJ KONIECZNIE:.*', re.DOTALL), lambda match: '</body>'), (re.compile(ur'<b>Nasze serwisy:</b>.*', re.DOTALL), lambda match: '</body>') ]
|
||||
remove_tags_after= dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})
|
||||
remove_tags=[dict(id='mat-podobne'), dict(name='a', attrs={'class':'czytajDalej'}), dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})]
|
||||
|
||||
feeds = [(u'Fakty24', u'http://gazetakrakowska.feedsportal.com/c/32980/f/533770/index.rss?201302'), (u'Krak\xf3w', u'http://www.gazetakrakowska.pl/rss/gazetakrakowska_krakow.xml?201302'), (u'Tarn\xf3w', u'http://www.gazetakrakowska.pl/rss/gazetakrakowska_tarnow.xml?201302'), (u'Nowy S\u0105cz', u'http://www.gazetakrakowska.pl/rss/gazetakrakowska_nsacz.xml?201302'), (u'Ma\u0142. Zach.', u'http://www.gazetakrakowska.pl/rss/gazetakrakowska_malzach.xml?201302'), (u'Podhale', u'http://www.gazetakrakowska.pl/rss/gazetakrakowska_podhale.xml?201302'), (u'Sport', u'http://gazetakrakowska.feedsportal.com/c/32980/f/533771/index.rss?201302'), (u'Kultura', u'http://gazetakrakowska.feedsportal.com/c/32980/f/533772/index.rss?201302'), (u'Opinie', u'http://www.gazetakrakowska.pl/rss/gazetakrakowska_opinie.xml?201302'), (u'Magazyn', u'http://www.gazetakrakowska.pl/rss/gazetakrakowska_magazyn.xml?201302')]
|
||||
|
||||
def print_version(self, url):
|
||||
return url.replace('artykul', 'drukuj')
|
||||
|
||||
def skip_ad_pages(self, soup):
|
||||
if 'Advertisement' in soup.title:
|
||||
nexturl=soup.find('a')['href']
|
||||
return self.index_to_soup(nexturl, raw=True)
|
||||
|
||||
def get_cover_url(self):
|
||||
soup = self.index_to_soup('http://www.prasa24.pl/gazeta/gazeta-krakowska/')
|
||||
self.cover_url=soup.find(id='pojemnik').img['src']
|
||||
return getattr(self, 'cover_url', self.cover_url)
|
69
recipes/gazeta_lubuska.recipe
Normal file
@ -0,0 +1,69 @@
|
||||
import re
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
from calibre.ebooks.BeautifulSoup import Comment
|
||||
|
||||
class GazetaLubuska(BasicNewsRecipe):
|
||||
title = u'Gazeta Lubuska'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Gazeta Lubuska - portal regionalny województwa lubuskiego.'
|
||||
category = 'newspaper'
|
||||
language = 'pl'
|
||||
encoding = 'iso-8859-2'
|
||||
extra_css = 'ul {list-style: none; padding:0; margin:0;}'
|
||||
INDEX = 'http://www.gazetalubuska.pl'
|
||||
masthead_url = INDEX + '/images/top_logo.png'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
remove_empty_feeds = True
|
||||
no_stylesheets = True
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
|
||||
preprocess_regexps = [(re.compile(ur'Czytaj:.*?</a>', re.DOTALL), lambda match: ''), (re.compile(ur'Przeczytaj także:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''),
|
||||
(re.compile(ur'Przeczytaj również:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''), (re.compile(ur'Zobacz też:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: '')]
|
||||
|
||||
keep_only_tags = [dict(id=['article', 'cover', 'photostory'])]
|
||||
remove_tags = [dict(id=['articleTags', 'articleMeta', 'boxReadIt', 'articleGalleries', 'articleConnections',
|
||||
'ForumArticleComments', 'articleRecommend', 'jedynkiLinks', 'articleGalleryConnections',
|
||||
'photostoryConnections', 'articleEpaper', 'articlePoll', 'articleAlarm', 'articleByline']),
|
||||
dict(attrs={'class':'articleFunctions'})]
|
||||
|
||||
feeds = [(u'Wszystkie', u'http://www.gazetalubuska.pl/rss.xml'), (u'Dreznenko', u'http://www.gazetalubuska.pl/drezdenko.xml'), (u'G\u0142og\xf3w', u'http://www.gazetalubuska.pl/glogow.xml'), (u'Gorz\xf3w Wielkopolski', u'http://www.gazetalubuska.pl/gorzow-wielkopolski.xml'), (u'Gubin', u'http://www.gazetalubuska.pl/gubin.xml'), (u'Kostrzyn', u'http://www.gazetalubuska.pl/kostrzyn.xml'), (u'Krosno Odrza\u0144skie', u'http://www.gazetalubuska.pl/krosno-odrzanskie.xml'), (u'Lubsko', u'http://www.gazetalubuska.pl/lubsko.xml'), (u'Mi\u0119dzych\xf3d', u'http://www.gazetalubuska.pl/miedzychod.xml'), (u'Mi\u0119dzyrzecz', u'http://www.gazetalubuska.pl/miedzyrzecz.xml'), (u'Nowa S\xf3l', u'http://www.gazetalubuska.pl/nowa-sol.xml'), (u'S\u0142ubice', u'http://www.gazetalubuska.pl/slubice.xml'), (u'Strzelce Kraje\u0144skie', u'http://www.gazetalubuska.pl/strzelce-krajenskie.xml'), (u'Sulech\xf3w', u'http://www.gazetalubuska.pl/sulechow.xml'), (u'Sul\u0119cin', u'http://www.gazetalubuska.pl/sulecin.xml'), (u'\u015awi\u0119bodzin', u'http://www.gazetalubuska.pl/swiebodzin.xml'), (u'Wolsztyn', u'http://www.gazetalubuska.pl/wolsztyn.xml'), (u'Wschowa', u'http://www.gazetalubuska.pl/wschowa.xml'), (u'Zielona G\xf3ra', u'http://www.gazetalubuska.pl/zielona-gora.xml'), (u'\u017baga\u0144', u'http://www.gazetalubuska.pl/zagan.xml'), (u'\u017bary', u'http://www.gazetalubuska.pl/zary.xml'), (u'Sport', u'http://www.gazetalubuska.pl/sport.xml'), (u'Auto', u'http://www.gazetalubuska.pl/auto.xml'), (u'Dom', u'http://www.gazetalubuska.pl/dom.xml'), (u'Praca', u'http://www.gazetalubuska.pl/praca.xml'), (u'Zdrowie', u'http://www.gazetalubuska.pl/zdrowie.xml')]
|
||||
|
||||
|
||||
def get_cover_url(self):
|
||||
soup = self.index_to_soup(self.INDEX + '/apps/pbcs.dll/section?Category=JEDYNKI')
|
||||
nexturl = self.INDEX + soup.find(id='covers').find('a')['href']
|
||||
soup = self.index_to_soup(nexturl)
|
||||
self.cover_url = self.INDEX + soup.find(id='cover').find(name='img')['src']
|
||||
return getattr(self, 'cover_url', self.cover_url)
|
||||
|
||||
def append_page(self, soup, appendtag):
|
||||
tag = soup.find('span', attrs={'class':'photoNavigationPages'})
|
||||
if tag:
|
||||
number = int(tag.string.rpartition('/')[-1].replace(' ', ''))
|
||||
baseurl = self.INDEX + soup.find(attrs={'class':'photoNavigationNext'})['href'][:-1]
|
||||
|
||||
for r in appendtag.findAll(attrs={'class':'photoNavigation'}):
|
||||
r.extract()
|
||||
for nr in range(2, number+1):
|
||||
soup2 = self.index_to_soup(baseurl + str(nr))
|
||||
pagetext = soup2.find(id='photoContainer')
|
||||
if pagetext:
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
pagetext = soup2.find(attrs={'class':'photoMeta'})
|
||||
if pagetext:
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
pagetext = soup2.find(attrs={'class':'photoStoryText'})
|
||||
if pagetext:
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
|
||||
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||
for comment in comments:
|
||||
comment.extract()
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
self.append_page(soup, soup.body)
|
||||
return soup
|
@ -49,8 +49,8 @@ class gw_krakow(BasicNewsRecipe):
|
||||
feeds = [(u'Wiadomości', u'http://rss.gazeta.pl/pub/rss/krakow.xml')]
|
||||
|
||||
def skip_ad_pages(self, soup):
|
||||
tag=soup.find(name='a', attrs={'class':'btn'})
|
||||
if tag:
|
||||
tag=soup.find(name='a', attrs={'class':'btn'})
|
||||
if tag:
|
||||
new_soup=self.index_to_soup(tag['href'], raw=True)
|
||||
return new_soup
|
||||
|
||||
@ -95,8 +95,7 @@ class gw_krakow(BasicNewsRecipe):
|
||||
rem.extract()
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
self.append_page(soup, soup.body)
|
||||
if soup.find(id='container_gal'):
|
||||
self.gallery_article(soup.body)
|
||||
return soup
|
||||
|
||||
self.append_page(soup, soup.body)
|
||||
if soup.find(id='container_gal'):
|
||||
self.gallery_article(soup.body)
|
||||
return soup
|
||||
|
@ -46,8 +46,8 @@ class gw_wawa(BasicNewsRecipe):
|
||||
feeds = [(u'Wiadomości', u'http://rss.gazeta.pl/pub/rss/warszawa.xml')]
|
||||
|
||||
def skip_ad_pages(self, soup):
|
||||
tag=soup.find(name='a', attrs={'class':'btn'})
|
||||
if tag:
|
||||
tag=soup.find(name='a', attrs={'class':'btn'})
|
||||
if tag:
|
||||
new_soup=self.index_to_soup(tag['href'], raw=True)
|
||||
return new_soup
|
||||
|
||||
@ -92,8 +92,7 @@ class gw_wawa(BasicNewsRecipe):
|
||||
rem.extract()
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
self.append_page(soup, soup.body)
|
||||
if soup.find(id='container_gal'):
|
||||
self.gallery_article(soup.body)
|
||||
return soup
|
||||
|
||||
self.append_page(soup, soup.body)
|
||||
if soup.find(id='container_gal'):
|
||||
self.gallery_article(soup.body)
|
||||
return soup
|
||||
|
@ -1,104 +1,96 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
# # Przed uzyciem przeczytaj komentarz w sekcji "feeds"
|
||||
|
||||
__license__ = 'GPL v3'
|
||||
__copyright__ = u'2010, Richard z forum.eksiazki.org'
|
||||
'''pomorska.pl'''
|
||||
|
||||
import re
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
from calibre.ebooks.BeautifulSoup import Comment
|
||||
|
||||
class GazetaPomorska(BasicNewsRecipe):
|
||||
title = u'Gazeta Pomorska'
|
||||
publisher = u'Gazeta Pomorska'
|
||||
description = u'Kujawy i Pomorze - wiadomo\u015bci'
|
||||
__author__ = 'Richard z forum.eksiazki.org, fenuks'
|
||||
description = u'Gazeta Pomorska - portal regionalny'
|
||||
category = 'newspaper'
|
||||
language = 'pl'
|
||||
__author__ = u'Richard z forum.eksiazki.org'
|
||||
# # (dziekuje t3d z forum.eksiazki.org za testy)
|
||||
oldest_article = 2
|
||||
max_articles_per_feed = 20
|
||||
encoding = 'iso-8859-2'
|
||||
extra_css = 'ul {list-style: none; padding:0; margin:0;}'
|
||||
INDEX = 'http://www.pomorska.pl'
|
||||
masthead_url = INDEX + '/images/top_logo.png'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
remove_empty_feeds = True
|
||||
no_stylesheets = True
|
||||
remove_javascript = True
|
||||
preprocess_regexps = [
|
||||
(re.compile(r'<a href="http://maps.google[^>]*>[^<]*</a>\.*', re.DOTALL|re.IGNORECASE), lambda m: ''),
|
||||
(re.compile(r'[<Bb >]*Poznaj opinie[^<]*[</Bb >]*[^<]*<a href[^>]*>[^<]*</a>\.*', re.DOTALL|re.IGNORECASE), lambda m: ''),
|
||||
(re.compile(r'[<Bb >]*Przeczytaj[^<]*[</Bb >]*[^<]*<a href[^>]*>[^<]*</a>\.*', re.DOTALL|re.IGNORECASE), lambda m: ''),
|
||||
(re.compile(r'[<Bb >]*Wi.cej informacji[^<]*[</Bb >]*[^<]*<a href[^>]*>[^<]*</a>\.*', re.DOTALL|re.IGNORECASE), lambda m: ''),
|
||||
(re.compile(r'<a href[^>]*>[<Bb >]*Wideo[^<]*[</Bb >]*[^<]*</a>\.*', re.DOTALL|re.IGNORECASE), lambda m: ''),
|
||||
(re.compile(r'<a href[^>]*>[<Bb >]*KLIKNIJ TUTAJ[^<]*[</Bb >]*[^<]*</a>\.*', re.DOTALL|re.IGNORECASE), lambda m: '')
|
||||
]
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
|
||||
feeds = [
|
||||
# # Tutaj jest wymieniona lista kategorii jakie mozemy otrzymywac z Gazety
|
||||
# # Pomorskiej, po jednej kategorii w wierszu. Jesli na poczatku danego wiersza
|
||||
# # znajduje sie jeden znak "#", oznacza to ze kategoria jest zakomentowana
|
||||
# # i nie bedziemy jej otrzymywac. Jesli chcemy ja otrzymywac nalezy usunac
|
||||
# # znak # z jej wiersza.
|
||||
# # Jesli subskrybujemy wiecej niz jedna kategorie, na koncu wiersza z kazda
|
||||
# # kategoria musi sie znajdowac niezakomentowany przecinek, z wyjatkiem
|
||||
# # ostatniego wiersza - ma byc bez przecinka na koncu.
|
||||
# # Rekomendowane opcje wyboru kategorii:
|
||||
# # 1. PomorskaRSS - wiadomosci kazdego typu, lub
|
||||
# # 2. Region + wybrane miasta, lub
|
||||
# # 3. Wiadomosci tematyczne.
|
||||
# # Lista kategorii:
|
||||
preprocess_regexps = [(re.compile(ur'Czytaj:.*?</a>', re.DOTALL), lambda match: ''), (re.compile(ur'Przeczytaj także:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''),
|
||||
(re.compile(ur'Przeczytaj również:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''), (re.compile(ur'Zobacz też:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: '')]
|
||||
|
||||
# # PomorskaRSS - wiadomosci kazdego typu, zakomentuj znakiem "#"
|
||||
# # przed odkomentowaniem wiadomosci wybranego typu:
|
||||
(u'PomorskaRSS', u'http://www.pomorska.pl/rss.xml')
|
||||
keep_only_tags = [dict(id=['article', 'cover', 'photostory'])]
|
||||
remove_tags = [dict(id=['articleTags', 'articleMeta', 'boxReadIt', 'articleGalleries', 'articleConnections',
|
||||
'ForumArticleComments', 'articleRecommend', 'jedynkiLinks', 'articleGalleryConnections',
|
||||
'photostoryConnections', 'articleEpaper', 'articlePoll', 'articleAlarm', 'articleByline']),
|
||||
dict(attrs={'class':'articleFunctions'})]
|
||||
|
||||
# # wiadomosci z regionu nie przypisane do okreslonego miasta:
|
||||
# (u'Region', u'http://www.pomorska.pl/region.xml'),
|
||||
feeds = [(u'Wszystkie', u'http://www.pomorska.pl/rss.xml'),
|
||||
(u'Region', u'http://www.pomorska.pl/region.xml'),
|
||||
(u'Bydgoszcz', u'http://www.pomorska.pl/bydgoszcz.xml'),
|
||||
(u'Nakło', u'http://www.pomorska.pl/naklo.xml'),
|
||||
(u'Koronowo', u'http://www.pomorska.pl/koronowo.xml'),
|
||||
(u'Solec Kujawski', u'http://www.pomorska.pl/soleckujawski.xml'),
|
||||
(u'Grudziądz', u'http://www.pomorska.pl/grudziadz.xml'),
|
||||
(u'Inowrocław', u'http://www.pomorska.pl/inowroclaw.xml'),
|
||||
(u'Toruń', u'http://www.pomorska.pl/torun.xml'),
|
||||
(u'Włocławek', u'http://www.pomorska.pl/wloclawek.xml'),
|
||||
(u'Aleksandrów Kujawski', u'http://www.pomorska.pl/aleksandrow.xml'),
|
||||
(u'Brodnica', u'http://www.pomorska.pl/brodnica.xml'),
|
||||
(u'Chełmno', u'http://www.pomorska.pl/chelmno.xml'),
|
||||
(u'Chojnice', u'http://www.pomorska.pl/chojnice.xml'),
|
||||
(u'Ciechocinek', u'http://www.pomorska.pl/ciechocinek.xml'),
|
||||
(u'Golub-Dobrzyń', u'http://www.pomorska.pl/golubdobrzyn.xml'),
|
||||
(u'Mogilno', u'http://www.pomorska.pl/mogilno.xml'),
|
||||
(u'Radziejów', u'http://www.pomorska.pl/radziejow.xml'),
|
||||
(u'Rypin', u'http://www.pomorska.pl/rypin.xml'),
|
||||
(u'Sępólno', u'http://www.pomorska.pl/sepolno.xml'),
|
||||
(u'Świecie', u'http://www.pomorska.pl/swiecie.xml'),
|
||||
(u'Tuchola', u'http://www.pomorska.pl/tuchola.xml'),
|
||||
(u'Żnin', u'http://www.pomorska.pl/znin.xml'),
|
||||
(u'Sport', u'http://www.pomorska.pl/sport.xml'),
|
||||
(u'Zdrowie', u'http://www.pomorska.pl/zdrowie.xml'),
|
||||
(u'Auto', u'http://www.pomorska.pl/moto.xml'),
|
||||
(u'Dom', u'http://www.pomorska.pl/dom.xml'),
|
||||
#(u'Reporta\u017c', u'http://www.pomorska.pl/reportaz.xml'),
|
||||
(u'Gospodarka', u'http://www.pomorska.pl/gospodarka.xml')]
|
||||
|
||||
# # wiadomosci przypisane do miast:
|
||||
# (u'Bydgoszcz', u'http://www.pomorska.pl/bydgoszcz.xml'),
|
||||
# (u'Nak\u0142o', u'http://www.pomorska.pl/naklo.xml'),
|
||||
# (u'Koronowo', u'http://www.pomorska.pl/koronowo.xml'),
|
||||
# (u'Solec Kujawski', u'http://www.pomorska.pl/soleckujawski.xml'),
|
||||
# (u'Grudzi\u0105dz', u'http://www.pomorska.pl/grudziadz.xml'),
|
||||
# (u'Inowroc\u0142aw', u'http://www.pomorska.pl/inowroclaw.xml'),
|
||||
# (u'Toru\u0144', u'http://www.pomorska.pl/torun.xml'),
|
||||
# (u'W\u0142oc\u0142awek', u'http://www.pomorska.pl/wloclawek.xml'),
|
||||
# (u'Aleksandr\u00f3w Kujawski', u'http://www.pomorska.pl/aleksandrow.xml'),
|
||||
# (u'Brodnica', u'http://www.pomorska.pl/brodnica.xml'),
|
||||
# (u'Che\u0142mno', u'http://www.pomorska.pl/chelmno.xml'),
|
||||
# (u'Chojnice', u'http://www.pomorska.pl/chojnice.xml'),
|
||||
# (u'Ciechocinek', u'http://www.pomorska.pl/ciechocinek.xml'),
|
||||
# (u'Golub Dobrzy\u0144', u'http://www.pomorska.pl/golubdobrzyn.xml'),
|
||||
# (u'Mogilno', u'http://www.pomorska.pl/mogilno.xml'),
|
||||
# (u'Radziej\u00f3w', u'http://www.pomorska.pl/radziejow.xml'),
|
||||
# (u'Rypin', u'http://www.pomorska.pl/rypin.xml'),
|
||||
# (u'S\u0119p\u00f3lno', u'http://www.pomorska.pl/sepolno.xml'),
|
||||
# (u'\u015awiecie', u'http://www.pomorska.pl/swiecie.xml'),
|
||||
# (u'Tuchola', u'http://www.pomorska.pl/tuchola.xml'),
|
||||
# (u'\u017bnin', u'http://www.pomorska.pl/znin.xml')
|
||||
def get_cover_url(self):
|
||||
soup = self.index_to_soup(self.INDEX + '/apps/pbcs.dll/section?Category=JEDYNKI')
|
||||
nexturl = self.INDEX + soup.find(id='covers').find('a')['href']
|
||||
soup = self.index_to_soup(nexturl)
|
||||
self.cover_url = self.INDEX + soup.find(id='cover').find(name='img')['src']
|
||||
return getattr(self, 'cover_url', self.cover_url)
|
||||
|
||||
# # wiadomosci tematyczne (redundancja z region/miasta):
|
||||
# (u'Sport', u'http://www.pomorska.pl/sport.xml'),
|
||||
# (u'Zdrowie', u'http://www.pomorska.pl/zdrowie.xml'),
|
||||
# (u'Auto', u'http://www.pomorska.pl/moto.xml'),
|
||||
# (u'Dom', u'http://www.pomorska.pl/dom.xml'),
|
||||
# (u'Reporta\u017c', u'http://www.pomorska.pl/reportaz.xml'),
|
||||
# (u'Gospodarka', u'http://www.pomorska.pl/gospodarka.xml')
|
||||
]
|
||||
def append_page(self, soup, appendtag):
|
||||
tag = soup.find('span', attrs={'class':'photoNavigationPages'})
|
||||
if tag:
|
||||
number = int(tag.string.rpartition('/')[-1].replace(' ', ''))
|
||||
baseurl = self.INDEX + soup.find(attrs={'class':'photoNavigationNext'})['href'][:-1]
|
||||
|
||||
keep_only_tags = [dict(name='div', attrs={'id':'article'})]
|
||||
|
||||
remove_tags = [
|
||||
dict(name='p', attrs={'id':'articleTags'}),
|
||||
dict(name='div', attrs={'id':'articleEpaper'}),
|
||||
dict(name='div', attrs={'id':'articleConnections'}),
|
||||
dict(name='div', attrs={'class':'articleFacts'}),
|
||||
dict(name='div', attrs={'id':'articleExternalLink'}),
|
||||
dict(name='div', attrs={'id':'articleMultimedia'}),
|
||||
dict(name='div', attrs={'id':'articleGalleries'}),
|
||||
dict(name='div', attrs={'id':'articleAlarm'}),
|
||||
dict(name='div', attrs={'id':'adholder_srodek1'}),
|
||||
dict(name='div', attrs={'id':'articleVideo'}),
|
||||
dict(name='a', attrs={'name':'fb_share'})]
|
||||
|
||||
extra_css = '''h1 { font-size: 1.4em; }
|
||||
h2 { font-size: 1.0em; }'''
|
||||
for r in appendtag.findAll(attrs={'class':'photoNavigation'}):
|
||||
r.extract()
|
||||
for nr in range(2, number+1):
|
||||
soup2 = self.index_to_soup(baseurl + str(nr))
|
||||
pagetext = soup2.find(id='photoContainer')
|
||||
if pagetext:
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
pagetext = soup2.find(attrs={'class':'photoMeta'})
|
||||
if pagetext:
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
pagetext = soup2.find(attrs={'class':'photoStoryText'})
|
||||
if pagetext:
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
|
||||
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||
for comment in comments:
|
||||
comment.extract()
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
self.append_page(soup, soup.body)
|
||||
return soup
|
||||
|
34
recipes/gazeta_wroclawska.recipe
Normal file
@ -0,0 +1,34 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class GazetaWroclawska(BasicNewsRecipe):
|
||||
title = u'Gazeta Wroc\u0142awska'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Gazeta Regionalna Gazeta Wrocławska. Najnowsze Wiadomości Wrocław, Informacje Wrocław. Czytaj!'
|
||||
category = 'newspaper'
|
||||
language = 'pl'
|
||||
encoding = 'iso-8859-2'
|
||||
masthead_url = 'http://s.polskatimes.pl/g/logo_naglowek/gazetawroclawska.png?24'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
remove_empty_feeds = True
|
||||
no_stylesheets = True
|
||||
use_embedded_content = False
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
#preprocess_regexps = [(re.compile(ur'<b>Czytaj także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur',<b>Czytaj też:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>Zobacz także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<center><h4><a.*?</a></h4></center>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TEŻ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ WIĘCEJ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TAKŻE:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>\* CZYTAJ KONIECZNIE:.*', re.DOTALL), lambda match: '</body>'), (re.compile(ur'<b>Nasze serwisy:</b>.*', re.DOTALL), lambda match: '</body>') ]
|
||||
remove_tags_after= dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})
|
||||
remove_tags=[dict(id='mat-podobne'), dict(name='a', attrs={'class':'czytajDalej'}), dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})]
|
||||
|
||||
feeds = [(u'Fakty24', u'http://gazetawroclawska.feedsportal.com/c/32980/f/533775/index.rss?201302'), (u'Region', u'http://www.gazetawroclawska.pl/rss/gazetawroclawska_region.xml?201302'), (u'Kultura', u'http://gazetawroclawska.feedsportal.com/c/32980/f/533777/index.rss?201302'), (u'Sport', u'http://gazetawroclawska.feedsportal.com/c/32980/f/533776/index.rss?201302'), (u'Z archiwum', u'http://www.gazetawroclawska.pl/rss/gazetawroclawska_zarchiwum.xml?201302'), (u'M\xf3j reporter', u'http://www.gazetawroclawska.pl/rss/gazetawroclawska_mojreporter.xml?201302'), (u'Historia', u'http://www.gazetawroclawska.pl/rss/gazetawroclawska_historia.xml?201302'), (u'Listy do redakcji', u'http://www.gazetawroclawska.pl/rss/gazetawroclawska_listydoredakcji.xml?201302'), (u'Na drogach', u'http://www.gazetawroclawska.pl/rss/gazetawroclawska_nadrogach.xml?201302')]
|
||||
|
||||
def print_version(self, url):
|
||||
return url.replace('artykul', 'drukuj')
|
||||
|
||||
def skip_ad_pages(self, soup):
|
||||
if 'Advertisement' in soup.title:
|
||||
nexturl=soup.find('a')['href']
|
||||
return self.index_to_soup(nexturl, raw=True)
|
||||
|
||||
def get_cover_url(self):
|
||||
soup = self.index_to_soup('http://www.prasa24.pl/gazeta/gazeta-wroclawska/')
|
||||
self.cover_url=soup.find(id='pojemnik').img['src']
|
||||
return getattr(self, 'cover_url', self.cover_url)
|
68
recipes/gazeta_wspolczesna.recipe
Normal file
@ -0,0 +1,68 @@
|
||||
import re
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
from calibre.ebooks.BeautifulSoup import Comment
|
||||
|
||||
class GazetaWspolczesna(BasicNewsRecipe):
|
||||
title = u'Gazeta Wsp\xf3\u0142czesna'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Gazeta Współczesna - portal regionalny.'
|
||||
category = 'newspaper'
|
||||
language = 'pl'
|
||||
encoding = 'iso-8859-2'
|
||||
extra_css = 'ul {list-style: none; padding:0; margin:0;}'
|
||||
INDEX = 'http://www.wspolczesna.pl'
|
||||
masthead_url = INDEX + '/images/top_logo.png'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
remove_empty_feeds = True
|
||||
no_stylesheets = True
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
|
||||
preprocess_regexps = [(re.compile(ur'Czytaj:.*?</a>', re.DOTALL), lambda match: ''), (re.compile(ur'Przeczytaj także:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''),
|
||||
(re.compile(ur'Przeczytaj również:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''), (re.compile(ur'Zobacz też:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: '')]
|
||||
|
||||
keep_only_tags = [dict(id=['article', 'cover', 'photostory'])]
|
||||
remove_tags = [dict(id=['articleTags', 'articleMeta', 'boxReadIt', 'articleGalleries', 'articleConnections',
|
||||
'ForumArticleComments', 'articleRecommend', 'jedynkiLinks', 'articleGalleryConnections',
|
||||
'photostoryConnections', 'articleEpaper', 'articlePoll', 'articleAlarm', 'articleByline']),
|
||||
dict(attrs={'class':'articleFunctions'})]
|
||||
|
||||
feeds = [(u'Wszystkie', u'http://www.wspolczesna.pl/rss.xml'), (u'August\xf3w', u'http://www.wspolczesna.pl/augustow.xml'), (u'Bia\u0142ystok', u'http://www.wspolczesna.pl/bialystok.xml'), (u'Bielsk Podlaski', u'http://www.wspolczesna.pl/bielsk.xml'), (u'E\u0142k', u'http://www.wspolczesna.pl/elk.xml'), (u'Grajewo', u'http://www.wspolczesna.pl/grajewo.xml'), (u'Go\u0142dap', u'http://www.wspolczesna.pl/goldap.xml'), (u'Hajn\xf3wka', u'http://www.wspolczesna.pl/hajnowka.xml'), (u'Kolno', u'http://www.wspolczesna.pl/kolno.xml'), (u'\u0141om\u017ca', u'http://www.wspolczesna.pl/lomza.xml'), (u'Mo\u0144ki', u'http://www.wspolczesna.pl/monki.xml'), (u'Olecko', u'http://www.wspolczesna.pl/olecko.xml'), (u'Ostro\u0142\u0119ka', u'http://www.wspolczesna.pl/ostroleka.xml'), (u'Powiat Bia\u0142ostocki', u'http://www.wspolczesna.pl/powiat.xml'), (u'Sejny', u'http://www.wspolczesna.pl/sejny.xml'), (u'Siemiatycze', u'http://www.wspolczesna.pl/siemiatycze.xml'), (u'Sok\xf3\u0142ka', u'http://www.wspolczesna.pl/sokolka.xml'), (u'Suwa\u0142ki', u'http://www.wspolczesna.pl/suwalki.xml'), (u'Wysokie Mazowieckie', u'http://www.wspolczesna.pl/wysokie.xml'), (u'Zambr\xf3w', u'http://www.wspolczesna.pl/zambrow.xml'), (u'Sport', u'http://www.wspolczesna.pl/sport.xml'), (u'Praca', u'http://www.wspolczesna.pl/praca.xml'), (u'Dom', u'http://www.wspolczesna.pl/dom.xml'), (u'Auto', u'http://www.wspolczesna.pl/auto.xml'), (u'Zdrowie', u'http://www.wspolczesna.pl/zdrowie.xml')]
|
||||
|
||||
def get_cover_url(self):
|
||||
soup = self.index_to_soup(self.INDEX + '/apps/pbcs.dll/section?Category=JEDYNKI')
|
||||
nexturl = self.INDEX + soup.find(id='covers').find('a')['href']
|
||||
soup = self.index_to_soup(nexturl)
|
||||
self.cover_url = self.INDEX + soup.find(id='cover').find(name='img')['src']
|
||||
return getattr(self, 'cover_url', self.cover_url)
|
||||
|
||||
def append_page(self, soup, appendtag):
|
||||
tag = soup.find('span', attrs={'class':'photoNavigationPages'})
|
||||
if tag:
|
||||
number = int(tag.string.rpartition('/')[-1].replace(' ', ''))
|
||||
baseurl = self.INDEX + soup.find(attrs={'class':'photoNavigationNext'})['href'][:-1]
|
||||
|
||||
for r in appendtag.findAll(attrs={'class':'photoNavigation'}):
|
||||
r.extract()
|
||||
for nr in range(2, number+1):
|
||||
soup2 = self.index_to_soup(baseurl + str(nr))
|
||||
pagetext = soup2.find(id='photoContainer')
|
||||
if pagetext:
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
pagetext = soup2.find(attrs={'class':'photoMeta'})
|
||||
if pagetext:
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
pagetext = soup2.find(attrs={'class':'photoStoryText'})
|
||||
if pagetext:
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
|
||||
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||
for comment in comments:
|
||||
comment.extract()
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
self.append_page(soup, soup.body)
|
||||
return soup
|
@ -1,12 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
from calibre.ebooks.BeautifulSoup import Comment
|
||||
|
||||
class Gazeta_Wyborcza(BasicNewsRecipe):
|
||||
title = u'Gazeta.pl'
|
||||
__author__ = 'fenuks, Artur Stachecki'
|
||||
language = 'pl'
|
||||
description = 'news from gazeta.pl'
|
||||
description = 'Wiadomości z Polski i ze świata. Serwisy tematyczne i lokalne w 20 miastach.'
|
||||
category = 'newspaper'
|
||||
publication_type = 'newspaper'
|
||||
masthead_url = 'http://bi.gazeta.pl/im/5/10285/z10285445AA.jpg'
|
||||
@ -16,6 +16,7 @@ class Gazeta_Wyborcza(BasicNewsRecipe):
|
||||
max_articles_per_feed = 100
|
||||
remove_javascript = True
|
||||
no_stylesheets = True
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
remove_tags_before = dict(id='k0')
|
||||
remove_tags_after = dict(id='banP4')
|
||||
remove_tags = [dict(name='div', attrs={'class':'rel_box'}), dict(attrs={'class':['date', 'zdjP', 'zdjM', 'pollCont', 'rel_video', 'brand', 'txt_upl']}), dict(name='div', attrs={'id':'footer'})]
|
||||
@ -48,6 +49,9 @@ class Gazeta_Wyborcza(BasicNewsRecipe):
|
||||
url = self.INDEX + link['href']
|
||||
soup2 = self.index_to_soup(url)
|
||||
pagetext = soup2.find(id='artykul')
|
||||
comments = pagetext.findAll(text=lambda text:isinstance(text, Comment))
|
||||
for comment in comments:
|
||||
comment.extract()
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
tag = soup2.find('div', attrs={'id': 'Str'})
|
||||
@ -65,6 +69,9 @@ class Gazeta_Wyborcza(BasicNewsRecipe):
|
||||
nexturl = pagetext.find(id='gal_btn_next')
|
||||
if nexturl:
|
||||
nexturl = nexturl.a['href']
|
||||
comments = pagetext.findAll(text=lambda text:isinstance(text, Comment))
|
||||
for comment in comments:
|
||||
comment.extract()
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
rem = appendtag.find(id='gal_navi')
|
||||
@ -105,3 +112,7 @@ class Gazeta_Wyborcza(BasicNewsRecipe):
|
||||
soup = self.index_to_soup('http://wyborcza.pl/' + cover.contents[3].a['href'])
|
||||
self.cover_url = 'http://wyborcza.pl' + soup.img['src']
|
||||
return getattr(self, 'cover_url', self.cover_url)
|
||||
|
||||
'''def image_url_processor(self, baseurl, url):
|
||||
print "@@@@@@@@", url
|
||||
return url.replace('http://wyborcza.pl/ ', '')'''
|
||||
|
88
recipes/gcn.recipe
Normal file
@ -0,0 +1,88 @@
|
||||
import re
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
from calibre.ebooks.BeautifulSoup import Comment
|
||||
|
||||
class GCN(BasicNewsRecipe):
|
||||
title = u'Gazeta Codziennej Nowiny'
|
||||
__author__ = 'fenuks'
|
||||
description = u'nowiny24.pl - portal regionalny województwa podkarpackiego.'
|
||||
category = 'newspaper'
|
||||
language = 'pl'
|
||||
encoding = 'iso-8859-2'
|
||||
extra_css = 'ul {list-style: none; padding:0; margin:0;}'
|
||||
INDEX = 'http://www.nowiny24.pl'
|
||||
masthead_url = INDEX + '/images/top_logo.png'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
remove_empty_feeds = True
|
||||
no_stylesheets = True
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
remove_attributes = ['style']
|
||||
preprocess_regexps = [(re.compile(ur'Czytaj:.*?</a>', re.DOTALL), lambda match: ''), (re.compile(ur'Przeczytaj także:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''),
|
||||
(re.compile(ur'Przeczytaj również:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: ''), (re.compile(ur'Zobacz też:.*?</a>', re.DOTALL|re.IGNORECASE), lambda match: '')]
|
||||
|
||||
keep_only_tags = [dict(id=['article', 'cover', 'photostory'])]
|
||||
remove_tags = [dict(id=['articleTags', 'articleMeta', 'boxReadIt', 'articleGalleries', 'articleConnections',
|
||||
'ForumArticleComments', 'articleRecommend', 'jedynkiLinks', 'articleGalleryConnections',
|
||||
'photostoryConnections', 'articleEpaper', 'articlePoll', 'articleAlarm', 'articleByline']),
|
||||
dict(attrs={'class':'articleFunctions'})]
|
||||
|
||||
feeds = [(u'Wszystkie', u'http://www.nowiny24.pl/rss.xml'),
|
||||
(u'Podkarpacie', u'http://www.nowiny24.pl/podkarpacie.xml'),
|
||||
(u'Bieszczady', u'http://www.nowiny24.pl/bieszczady.xml'),
|
||||
(u'Rzeszów', u'http://www.nowiny24.pl/rzeszow.xml'),
|
||||
(u'Przemyśl', u'http://www.nowiny24.pl/przemysl.xml'),
|
||||
(u'Leżajsk', u'http://www.nowiny24.pl/lezajsk.xml'),
|
||||
(u'Łańcut', u'http://www.nowiny24.pl/lancut.xml'),
|
||||
(u'Dębica', u'http://www.nowiny24.pl/debica.xml'),
|
||||
(u'Jarosław', u'http://www.nowiny24.pl/jaroslaw.xml'),
|
||||
(u'Krosno', u'http://www.nowiny24.pl/krosno.xml'),
|
||||
(u'Mielec', u'http://www.nowiny24.pl/mielec.xml'),
|
||||
(u'Nisko', u'http://www.nowiny24.pl/nisko.xml'),
|
||||
(u'Sanok', u'http://www.nowiny24.pl/sanok.xml'),
|
||||
(u'Stalowa Wola', u'http://www.nowiny24.pl/stalowawola.xml'),
|
||||
(u'Tarnobrzeg', u'http://www.nowiny24.pl/tarnobrzeg.xml'),
|
||||
(u'Sport', u'http://www.nowiny24.pl/sport.xml'),
|
||||
(u'Dom', u'http://www.nowiny24.pl/dom.xml'),
|
||||
(u'Auto', u'http://www.nowiny24.pl/auto.xml'),
|
||||
(u'Praca', u'http://www.nowiny24.pl/praca.xml'),
|
||||
(u'Zdrowie', u'http://www.nowiny24.pl/zdrowie.xml'),
|
||||
(u'Wywiady', u'http://www.nowiny24.pl/wywiady.xml')]
|
||||
|
||||
def get_cover_url(self):
|
||||
soup = self.index_to_soup(self.INDEX + '/apps/pbcs.dll/section?Category=JEDYNKI')
|
||||
nexturl = self.INDEX + soup.find(id='covers').find('a')['href']
|
||||
soup = self.index_to_soup(nexturl)
|
||||
self.cover_url = self.INDEX + soup.find(id='cover').find(name='img')['src']
|
||||
return getattr(self, 'cover_url', self.cover_url)
|
||||
|
||||
def append_page(self, soup, appendtag):
|
||||
tag = soup.find('span', attrs={'class':'photoNavigationPages'})
|
||||
if tag:
|
||||
number = int(tag.string.rpartition('/')[-1].replace(' ', ''))
|
||||
baseurl = self.INDEX + soup.find(attrs={'class':'photoNavigationNext'})['href'][:-1]
|
||||
|
||||
for r in appendtag.findAll(attrs={'class':'photoNavigation'}):
|
||||
r.extract()
|
||||
for nr in range(2, number+1):
|
||||
soup2 = self.index_to_soup(baseurl + str(nr))
|
||||
pagetext = soup2.find(id='photoContainer')
|
||||
if pagetext:
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
pagetext = soup2.find(attrs={'class':'photoMeta'})
|
||||
if pagetext:
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
pagetext = soup2.find(attrs={'class':'photoStoryText'})
|
||||
if pagetext:
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
|
||||
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||
for comment in comments:
|
||||
comment.extract()
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
self.append_page(soup, soup.body)
|
||||
return soup
|
12
recipes/geopolityka.recipe
Normal file
@ -0,0 +1,12 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class BasicUserRecipe1361379046(BasicNewsRecipe):
|
||||
title = u'Geopolityka.org'
|
||||
language = 'pl'
|
||||
__author__ = 'chemik111'
|
||||
oldest_article = 15
|
||||
max_articles_per_feed = 100
|
||||
auto_cleanup = True
|
||||
|
||||
feeds = [(u'Rss', u'http://geopolityka.org/index.php?format=feed&type=rss')]
|
||||
|
@ -11,12 +11,13 @@ class Gildia(BasicNewsRecipe):
|
||||
language = 'pl'
|
||||
oldest_article = 8
|
||||
max_articles_per_feed = 100
|
||||
remove_empty_feeds=True
|
||||
no_stylesheets=True
|
||||
remove_empty_feeds = True
|
||||
no_stylesheets = True
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
preprocess_regexps = [(re.compile(ur'</?sup>'), lambda match: '') ]
|
||||
remove_tags=[dict(name='div', attrs={'class':'backlink'}), dict(name='div', attrs={'class':'im_img'}), dict(name='div', attrs={'class':'addthis_toolbox addthis_default_style'})]
|
||||
keep_only_tags=dict(name='div', attrs={'class':'widetext'})
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
remove_tags = [dict(name='div', attrs={'class':'backlink'}), dict(name='div', attrs={'class':'im_img'}), dict(name='div', attrs={'class':'addthis_toolbox addthis_default_style'})]
|
||||
keep_only_tags = dict(name='div', attrs={'class':'widetext'})
|
||||
feeds = [(u'Gry', u'http://www.gry.gildia.pl/rss'), (u'Literatura', u'http://www.literatura.gildia.pl/rss'), (u'Film', u'http://www.film.gildia.pl/rss'), (u'Horror', u'http://www.horror.gildia.pl/rss'), (u'Konwenty', u'http://www.konwenty.gildia.pl/rss'), (u'Plansz\xf3wki', u'http://www.planszowki.gildia.pl/rss'), (u'Manga i anime', u'http://www.manga.gildia.pl/rss'), (u'Star Wars', u'http://www.starwars.gildia.pl/rss'), (u'Techno', u'http://www.techno.gildia.pl/rss'), (u'Historia', u'http://www.historia.gildia.pl/rss'), (u'Magia', u'http://www.magia.gildia.pl/rss'), (u'Bitewniaki', u'http://www.bitewniaki.gildia.pl/rss'), (u'RPG', u'http://www.rpg.gildia.pl/rss'), (u'LARP', u'http://www.larp.gildia.pl/rss'), (u'Muzyka', u'http://www.muzyka.gildia.pl/rss'), (u'Nauka', u'http://www.nauka.gildia.pl/rss')]
|
||||
|
||||
|
||||
@ -34,7 +35,7 @@ class Gildia(BasicNewsRecipe):
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
for a in soup('a'):
|
||||
if a.has_key('href') and 'http://' not in a['href'] and 'https://' not in a['href']:
|
||||
if a.has_key('href') and not a['href'].startswith('http'):
|
||||
if '/gry/' in a['href']:
|
||||
a['href']='http://www.gry.gildia.pl' + a['href']
|
||||
elif u'książk' in soup.title.string.lower() or u'komiks' in soup.title.string.lower():
|
||||
|
34
recipes/glos_wielkopolski.recipe
Normal file
@ -0,0 +1,34 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class GlosWielkopolski(BasicNewsRecipe):
|
||||
title = u'G\u0142os Wielkopolski'
|
||||
__author__ = 'fenuks'
|
||||
description = u'Gazeta Regionalna Głos Wielkopolski. Najnowsze Wiadomości Poznań. Czytaj Informacje Poznań!'
|
||||
category = 'newspaper'
|
||||
language = 'pl'
|
||||
encoding = 'iso-8859-2'
|
||||
masthead_url = 'http://s.polskatimes.pl/g/logo_naglowek/gloswielkopolski.png?24'
|
||||
oldest_article = 7
|
||||
max_articles_per_feed = 100
|
||||
remove_empty_feeds= True
|
||||
no_stylesheets = True
|
||||
use_embedded_content = False
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
#preprocess_regexps = [(re.compile(ur'<b>Czytaj także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur',<b>Czytaj też:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>Zobacz także:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<center><h4><a.*?</a></h4></center>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TEŻ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ WIĘCEJ:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>CZYTAJ TAKŻE:.*?</b>', re.DOTALL), lambda match: ''), (re.compile(ur'<b>\* CZYTAJ KONIECZNIE:.*', re.DOTALL), lambda match: '</body>'), (re.compile(ur'<b>Nasze serwisy:</b>.*', re.DOTALL), lambda match: '</body>') ]
|
||||
remove_tags_after= dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})
|
||||
remove_tags=[dict(id='mat-podobne'), dict(name='a', attrs={'class':'czytajDalej'}), dict(attrs={'src':'http://nm.dz.com.pl/dz.png'})]
|
||||
|
||||
feeds = [(u'Wszystkie', u'http://gloswielkopolski.feedsportal.com/c/32980/f/533779/index.rss?201302'), (u'Wiadomo\u015bci', u'http://gloswielkopolski.feedsportal.com/c/32980/f/533780/index.rss?201302'), (u'Sport', u'http://gloswielkopolski.feedsportal.com/c/32980/f/533781/index.rss?201302'), (u'Kultura', u'http://gloswielkopolski.feedsportal.com/c/32980/f/533782/index.rss?201302'), (u'Porady', u'http://www.gloswielkopolski.pl/rss/gloswielkopolski_porady.xml?201302'), (u'Blogi', u'http://www.gloswielkopolski.pl/rss/gloswielkopolski_blogi.xml?201302'), (u'Nasze akcje', u'http://www.gloswielkopolski.pl/rss/gloswielkopolski_naszeakcje.xml?201302'), (u'Opinie', u'http://www.gloswielkopolski.pl/rss/gloswielkopolski_opinie.xml?201302'), (u'Magazyn', u'http://www.gloswielkopolski.pl/rss/gloswielkopolski_magazyn.xml?201302')]
|
||||
|
||||
def print_version(self, url):
|
||||
return url.replace('artykul', 'drukuj')
|
||||
|
||||
def skip_ad_pages(self, soup):
|
||||
if 'Advertisement' in soup.title:
|
||||
nexturl=soup.find('a')['href']
|
||||
return self.index_to_soup(nexturl, raw=True)
|
||||
|
||||
def get_cover_url(self):
|
||||
soup = self.index_to_soup('http://www.prasa24.pl/gazeta/glos-wielkopolski/')
|
||||
self.cover_url=soup.find(id='pojemnik').img['src']
|
||||
return getattr(self, 'cover_url', self.cover_url)
|
@ -2,7 +2,8 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
__license__ = 'GPL v3'
|
||||
__copyright__ = '2011, Piotr Kontek, piotr.kontek@gmail.com'
|
||||
__copyright__ = '2011, Piotr Kontek, piotr.kontek@gmail.com \
|
||||
2013, Tomasz Długosz, tomek3d@gmail.com'
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
from calibre.ptempfile import PersistentTemporaryFile
|
||||
@ -12,9 +13,9 @@ import re
|
||||
class GN(BasicNewsRecipe):
|
||||
EDITION = 0
|
||||
|
||||
__author__ = 'Piotr Kontek'
|
||||
title = u'Gość niedzielny'
|
||||
description = 'Weekly magazine'
|
||||
__author__ = 'Piotr Kontek, Tomasz Długosz'
|
||||
title = u'Gość Niedzielny'
|
||||
description = 'Ogólnopolski tygodnik katolicki'
|
||||
encoding = 'utf-8'
|
||||
no_stylesheets = True
|
||||
language = 'pl'
|
||||
@ -38,17 +39,25 @@ class GN(BasicNewsRecipe):
|
||||
first = True
|
||||
for p in main_section.findAll('p', attrs={'class':None}, recursive=False):
|
||||
if first and p.find('img') != None:
|
||||
article = article + '<p>'
|
||||
article = article + str(p.find('img')).replace('src="/files/','src="http://www.gosc.pl/files/')
|
||||
article = article + '<font size="-2">'
|
||||
article += '<p>'
|
||||
article += str(p.find('img')).replace('src="/files/','src="http://www.gosc.pl/files/')
|
||||
article += '<font size="-2">'
|
||||
for s in p.findAll('span'):
|
||||
article = article + self.tag_to_string(s)
|
||||
article = article + '</font></p>'
|
||||
article += self.tag_to_string(s)
|
||||
article += '</font></p>'
|
||||
else:
|
||||
article = article + str(p).replace('src="/files/','src="http://www.gosc.pl/files/')
|
||||
article += str(p).replace('src="/files/','src="http://www.gosc.pl/files/')
|
||||
first = False
|
||||
limiter = main_section.find('p', attrs={'class' : 'limiter'})
|
||||
if limiter:
|
||||
article += str(limiter)
|
||||
|
||||
html = unicode(title) + unicode(authors) + unicode(article)
|
||||
html = unicode(title)
|
||||
#sometimes authors are not filled in:
|
||||
if authors:
|
||||
html += unicode(authors) + unicode(article)
|
||||
else:
|
||||
html += unicode(article)
|
||||
|
||||
self.temp_files.append(PersistentTemporaryFile('_temparse.html'))
|
||||
self.temp_files[-1].write(html)
|
||||
@ -65,7 +74,8 @@ class GN(BasicNewsRecipe):
|
||||
if img != None:
|
||||
a = img.parent
|
||||
self.EDITION = a['href']
|
||||
self.title = img['alt']
|
||||
#this was preventing kindles from moving old issues to 'Back Issues' category:
|
||||
#self.title = img['alt']
|
||||
self.cover_url = 'http://www.gosc.pl' + img['src']
|
||||
if year != date.today().year or not first:
|
||||
break
|
||||
|
@ -1,5 +1,6 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
from calibre.ebooks.BeautifulSoup import BeautifulSoup
|
||||
|
||||
class Gram_pl(BasicNewsRecipe):
|
||||
title = u'Gram.pl'
|
||||
__author__ = 'fenuks'
|
||||
@ -11,15 +12,14 @@ class Gram_pl(BasicNewsRecipe):
|
||||
max_articles_per_feed = 100
|
||||
ignore_duplicate_articles = {'title', 'url'}
|
||||
no_stylesheets= True
|
||||
remove_empty_feeds = True
|
||||
#extra_css = 'h2 {font-style: italic; font-size:20px;} .picbox div {float: left;}'
|
||||
cover_url=u'http://www.gram.pl/www/01/img/grampl_zima.png'
|
||||
keep_only_tags= [dict(id='articleModule')]
|
||||
remove_tags = [dict(attrs={'class':['breadCrump', 'dymek', 'articleFooter']})]
|
||||
remove_tags = [dict(attrs={'class':['breadCrump', 'dymek', 'articleFooter', 'twitter-share-button']})]
|
||||
feeds = [(u'Informacje', u'http://www.gram.pl/feed_news.asp'),
|
||||
(u'Publikacje', u'http://www.gram.pl/feed_news.asp?type=articles'),
|
||||
(u'Kolektyw- Indie Games', u'http://indie.gram.pl/feed/'),
|
||||
#(u'Kolektyw- Moto Games', u'http://www.motogames.gram.pl/news.rss')
|
||||
]
|
||||
(u'Publikacje', u'http://www.gram.pl/feed_news.asp?type=articles')
|
||||
]
|
||||
|
||||
def parse_feeds (self):
|
||||
feeds = BasicNewsRecipe.parse_feeds(self)
|
||||
|
@ -1,20 +1,24 @@
|
||||
import time
|
||||
from calibre.web.feeds.recipes import BasicNewsRecipe
|
||||
from calibre.ebooks.BeautifulSoup import Comment
|
||||
|
||||
class GryOnlinePl(BasicNewsRecipe):
|
||||
title = u'Gry-Online.pl'
|
||||
__author__ = 'fenuks'
|
||||
description = 'Gry-Online.pl - computer games'
|
||||
description = u'Wiadomości o grach, recenzje, zapowiedzi. Encyklopedia Gier zawiera opisy gier na PC, konsole Xbox360, PS3 i inne platformy.'
|
||||
category = 'games'
|
||||
language = 'pl'
|
||||
oldest_article = 13
|
||||
INDEX= 'http://www.gry-online.pl/'
|
||||
masthead_url='http://www.gry-online.pl/im/gry-online-logo.png'
|
||||
cover_url='http://www.gry-online.pl/im/gry-online-logo.png'
|
||||
INDEX = 'http://www.gry-online.pl/'
|
||||
masthead_url = 'http://www.gry-online.pl/im/gry-online-logo.png'
|
||||
cover_url = 'http://www.gry-online.pl/im/gry-online-logo.png'
|
||||
max_articles_per_feed = 100
|
||||
no_stylesheets= True
|
||||
keep_only_tags=[dict(name='div', attrs={'class':['gc660', 'gc660 S013']})]
|
||||
remove_tags=[dict({'class':['nav-social', 'add-info', 'smlb', 'lista lista3 lista-gry', 'S013po', 'S013-npb', 'zm_gfx_cnt_bottom', 'ocen-txt', 'wiecej-txt', 'wiecej-txt2']})]
|
||||
feeds = [(u'Newsy', 'http://www.gry-online.pl/rss/news.xml'), ('Teksty', u'http://www.gry-online.pl/rss/teksty.xml')]
|
||||
no_stylesheets = True
|
||||
keep_only_tags = [dict(name='div', attrs={'class':['gc660', 'gc660 S013', 'news_endpage_tit', 'news_container', 'news']})]
|
||||
remove_tags = [dict({'class':['nav-social', 'add-info', 'smlb', 'lista lista3 lista-gry', 'S013po', 'S013-npb', 'zm_gfx_cnt_bottom', 'ocen-txt', 'wiecej-txt', 'wiecej-txt2']})]
|
||||
feeds = [
|
||||
(u'Newsy', 'http://www.gry-online.pl/rss/news.xml'),
|
||||
('Teksty', u'http://www.gry-online.pl/rss/teksty.xml')]
|
||||
|
||||
|
||||
def append_page(self, soup, appendtag):
|
||||
@ -24,17 +28,69 @@ class GryOnlinePl(BasicNewsRecipe):
|
||||
url_part = soup.find('link', attrs={'rel':'canonical'})['href']
|
||||
url_part = url_part[25:].rpartition('?')[0]
|
||||
for nexturl in nexturls[1:-1]:
|
||||
soup2 = self.index_to_soup('http://www.gry-online.pl/' + url_part + nexturl['href'])
|
||||
finalurl = 'http://www.gry-online.pl/' + url_part + nexturl['href']
|
||||
for i in range(10):
|
||||
try:
|
||||
soup2 = self.index_to_soup(finalurl)
|
||||
break
|
||||
except:
|
||||
print 'retrying in 0.5s'
|
||||
time.sleep(0.5)
|
||||
pagetext = soup2.find(attrs={'class':'gc660'})
|
||||
for r in pagetext.findAll(name='header'):
|
||||
r.extract()
|
||||
for r in pagetext.findAll(attrs={'itemprop':'description'}):
|
||||
r.extract()
|
||||
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
for r in appendtag.findAll(attrs={'class':['n5p', 'add-info', 'twitter-share-button', 'lista lista3 lista-gry']}):
|
||||
r.extract()
|
||||
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||
for comment in comments:
|
||||
comment.extract()
|
||||
else:
|
||||
tag = appendtag.find('div', attrs={'class':'S018stronyr'})
|
||||
if tag:
|
||||
nexturl = tag.a
|
||||
url_part = soup.find('link', attrs={'rel':'canonical'})['href']
|
||||
url_part = url_part[25:].rpartition('?')[0]
|
||||
while tag:
|
||||
end = tag.find(attrs={'class':'right left-dead'})
|
||||
if end:
|
||||
break
|
||||
else:
|
||||
nexturl = tag.a
|
||||
finalurl = 'http://www.gry-online.pl/' + url_part + nexturl['href']
|
||||
for i in range(10):
|
||||
try:
|
||||
soup2 = self.index_to_soup(finalurl)
|
||||
break
|
||||
except:
|
||||
print 'retrying in 0.5s'
|
||||
time.sleep(0.5)
|
||||
tag = soup2.find('div', attrs={'class':'S018stronyr'})
|
||||
pagetext = soup2.find(attrs={'class':'gc660'})
|
||||
for r in pagetext.findAll(name='header'):
|
||||
r.extract()
|
||||
for r in pagetext.findAll(attrs={'itemprop':'description'}):
|
||||
r.extract()
|
||||
|
||||
comments = pagetext.findAll(text=lambda text:isinstance(text, Comment))
|
||||
[comment.extract() for comment in comments]
|
||||
pos = len(appendtag.contents)
|
||||
appendtag.insert(pos, pagetext)
|
||||
for r in appendtag.findAll(attrs={'class':['n5p', 'add-info', 'twitter-share-button', 'lista lista3 lista-gry', 'S018strony']}):
|
||||
r.extract()
|
||||
comments = appendtag.findAll(text=lambda text:isinstance(text, Comment))
|
||||
for comment in comments:
|
||||
comment.extract()
|
||||
|
||||
def image_url_processor(self, baseurl, url):
|
||||
if url.startswith('..'):
|
||||
return url[2:]
|
||||
else:
|
||||
return url
|
||||
|
||||
def preprocess_html(self, soup):
|
||||
self.append_page(soup, soup.body)
|
||||
|
@ -1,5 +1,5 @@
|
||||
__license__ = 'GPL v3'
|
||||
__copyright__ = '2008-2012, Darko Miletic <darko.miletic at gmail.com>'
|
||||
__copyright__ = '2008-2013, Darko Miletic <darko.miletic at gmail.com>'
|
||||
'''
|
||||
harpers.org - paid subscription/ printed issue articles
|
||||
This recipe only get's article's published in text format
|
||||
@ -72,7 +72,8 @@ class Harpers_full(BasicNewsRecipe):
|
||||
|
||||
#go to the current issue
|
||||
soup1 = self.index_to_soup(currentIssue_url)
|
||||
date = re.split('\s\|\s',self.tag_to_string(soup1.head.title.string))[0]
|
||||
currentIssue_title = self.tag_to_string(soup1.head.title.string)
|
||||
date = re.split('\s\|\s',currentIssue_title)[0]
|
||||
self.timefmt = u' [%s]'%date
|
||||
|
||||
#get cover
|
||||
@ -84,27 +85,23 @@ class Harpers_full(BasicNewsRecipe):
|
||||
count = 0
|
||||
for item in soup1.findAll('div', attrs={'class':'articleData'}):
|
||||
text_links = item.findAll('h2')
|
||||
for text_link in text_links:
|
||||
if count == 0:
|
||||
count = 1
|
||||
else:
|
||||
url = text_link.a['href']
|
||||
title = text_link.a.contents[0]
|
||||
date = strftime(' %B %Y')
|
||||
articles.append({
|
||||
'title' :title
|
||||
,'date' :date
|
||||
,'url' :url
|
||||
,'description':''
|
||||
})
|
||||
return [(soup1.head.title.string, articles)]
|
||||
if text_links:
|
||||
for text_link in text_links:
|
||||
if count == 0:
|
||||
count = 1
|
||||
else:
|
||||
url = text_link.a['href']
|
||||
title = self.tag_to_string(text_link.a)
|
||||
date = strftime(' %B %Y')
|
||||
articles.append({
|
||||
'title' :title
|
||||
,'date' :date
|
||||
,'url' :url
|
||||
,'description':''
|
||||
})
|
||||
return [(currentIssue_title, articles)]
|
||||
|
||||
def print_version(self, url):
|
||||
return url + '?single=1'
|
||||
|
||||
def cleanup(self):
|
||||
soup = self.index_to_soup('http://harpers.org/')
|
||||
signouturl=self.tag_to_string(soup.find('li', attrs={'class':'subLogOut'}).findNext('li').a['href'])
|
||||
self.log(signouturl)
|
||||
self.browser.open(signouturl)
|
||||
|
||||
|
27
recipes/hatalska.recipe
Normal file
@ -0,0 +1,27 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
__license__ = 'GPL v3'
|
||||
__copyright__ = 'teepel 2012'
|
||||
|
||||
'''
|
||||
hatalska.com
|
||||
'''
|
||||
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class hatalska(BasicNewsRecipe):
|
||||
title = u'Hatalska'
|
||||
__author__ = 'teepel <teepel44@gmail.com>'
|
||||
language = 'pl'
|
||||
description = u'Blog specjalistki z branży mediowo-reklamowej - Natalii Hatalskiej'
|
||||
oldest_article = 7
|
||||
masthead_url='http://hatalska.com/wp-content/themes/jamel/images/logo.png'
|
||||
max_articles_per_feed = 100
|
||||
simultaneous_downloads = 5
|
||||
remove_javascript=True
|
||||
no_stylesheets=True
|
||||
|
||||
remove_tags =[]
|
||||
remove_tags.append(dict(name = 'div', attrs = {'class' : 'feedflare'}))
|
||||
|
||||
feeds = [(u'Blog', u'http://feeds.feedburner.com/hatalskacom')]
|
@ -41,13 +41,16 @@ class TheHindu(BasicNewsRecipe):
|
||||
if current_section and x.get('class', '') == 'tpaper':
|
||||
a = x.find('a', href=True)
|
||||
if a is not None:
|
||||
title = self.tag_to_string(a)
|
||||
self.log('\tFound article:', title)
|
||||
current_articles.append({'url':a['href']+'?css=print',
|
||||
'title':self.tag_to_string(a), 'date': '',
|
||||
'title':title, 'date': '',
|
||||
'description':''})
|
||||
if x.name == 'h3':
|
||||
if current_section and current_articles:
|
||||
feeds.append((current_section, current_articles))
|
||||
current_section = self.tag_to_string(x)
|
||||
self.log('Found section:', current_section)
|
||||
current_articles = []
|
||||
return feeds
|
||||
|
||||
|
67
recipes/hnonline.recipe
Normal file
@ -0,0 +1,67 @@
|
||||
from calibre.web.feeds.news import BasicNewsRecipe
|
||||
|
||||
class HNonlineRecipe(BasicNewsRecipe):
|
||||
__license__ = 'GPL v3'
|
||||
__author__ = 'lacike'
|
||||
language = 'sk'
|
||||
version = 1
|
||||
|
||||
title = u'HNonline'
|
||||
publisher = u'HNonline'
|
||||
category = u'News, Newspaper'
|
||||
description = u'News from Slovakia'
|
||||
cover_url = u'http://hnonline.sk/img/sk/_relaunch/logo2.png'
|
||||
|
||||
oldest_article = 1
|
||||
max_articles_per_feed = 100
|
||||
use_embedded_content = False
|
||||
remove_empty_feeds = True
|
||||
|
||||
no_stylesheets = True
|
||||
remove_javascript = True
|
||||
|
||||
# Feeds from: http://rss.hnonline.sk, for listing see http://rss.hnonline.sk/prehlad
|
||||
feeds = []
|
||||
feeds.append((u'HNonline|Ekonomika a firmy', u'http://rss.hnonline.sk/?p=kC1000'))
|
||||
feeds.append((u'HNonline|Slovensko', u'http://rss.hnonline.sk/?p=kC2000'))
|
||||
feeds.append((u'HNonline|Svet', u'http://rss.hnonline.sk/?p=kC3000'))
|
||||
feeds.append((u'HNonline|\u0160port', u'http://rss.hnonline.sk/?p=kC4000'))
|
||||
feeds.append((u'HNonline|Online rozhovor', u'http://rss.hnonline.sk/?p=kCR000'))
|
||||
|
||||
feeds.append((u'FinWeb|Spr\u00E1vy zo sveta financi\u00ED', u'http://rss.finweb.hnonline.sk/spravodajstvo'))
|
||||
feeds.append((u'FinWeb|Koment\u00E1re a anal\u00FDzy', u'http://rss.finweb.hnonline.sk/?p=kPC200'))
|
||||
feeds.append((u'FinWeb|Invest\u00EDcie', u'http://rss.finweb.hnonline.sk/?p=kPC300'))
|
||||
feeds.append((u'FinWeb|Svet akci\u00ED', u'http://rss.finweb.hnonline.sk/?p=kPC400'))
|
||||
feeds.append((u'FinWeb|Rozhovory', u'http://rss.finweb.hnonline.sk/?p=kPC500'))
|
||||
feeds.append((u'FinWeb|T\u00E9ma t\u00FD\u017Ed\u0148a', u'http://rss.finweb.hnonline.sk/?p=kPC600'))
|
||||
feeds.append((u'FinWeb|Rebr\u00ED\u010Dky', u'http://rss.finweb.hnonline.sk/?p=kPC700'))
|
||||
|
||||
feeds.append((u'HNstyle|Kult\u00FAra', u'http://style.hnonline.sk/?p=kTC100'))
|
||||
feeds.append((u'HNstyle|Auto-moto', u'http://style.hnonline.sk/?p=kTC200'))
|
||||
feeds.append((u'HNstyle|Digit\u00E1l', u'http://style.hnonline.sk/?p=kTC300'))
|
||||
feeds.append((u'HNstyle|Veda', u'http://style.hnonline.sk/?p=kTCV00'))
|
||||
feeds.append((u'HNstyle|Dizajn', u'http://style.hnonline.sk/?p=kTC400'))
|
||||
feeds.append((u'HNstyle|Cestovanie', u'http://style.hnonline.sk/?p=kTCc00'))
|
||||
feeds.append((u'HNstyle|V\u00EDkend', u'http://style.hnonline.sk/?p=kTC800'))
|
||||
feeds.append((u'HNstyle|Gastro', u'http://style.hnonline.sk/?p=kTC600'))
|
||||
feeds.append((u'HNstyle|M\u00F3da', u'http://style.hnonline.sk/?p=kTC700'))
|
||||
feeds.append((u'HNstyle|Modern\u00E1 \u017Eena', u'http://style.hnonline.sk/?p=kTCA00'))
|
||||
feeds.append((u'HNstyle|Pre\u010Do nie?!', u'http://style.hnonline.sk/?p=k7C000'))
|
||||
|
||||
keep_only_tags = []
|
||||
keep_only_tags.append(dict(name = 'h1', attrs = {'class': 'detail-titulek'}))
|
||||
keep_only_tags.append(dict(name = 'div', attrs = {'class': 'detail-podtitulek'}))
|
||||
keep_only_tags.append(dict(name = 'div', attrs = {'class': 'detail-perex'}))
|
||||
keep_only_tags.append(dict(name = 'div', attrs = {'class': 'detail-text'}))
|
||||
|
||||
remove_tags = []
|
||||
#remove_tags.append(dict(name = 'div', attrs = {'id': re.compile('smeplayer.*')}))
|
||||
|
||||
remove_tags_after = []
|
||||
#remove_tags_after = [dict(name = 'p', attrs = {'class': 'autor_line'})]
|
||||
|
||||
extra_css = '''
|
||||
@font-face {font-family: "serif1";src:url(res:///opt/sony/ebook/FONT/tt0011m_.ttf)}
|
||||
@font-face {font-family: "sans1";src:url(res:///opt/sony/ebook/FONT/LiberationSans.ttf)}
|
||||
body {font-family: sans1, serif1;}
|
||||
'''
|
Before Width: | Height: | Size: 389 B After Width: | Height: | Size: 887 B |
BIN
recipes/icons/bachormagazyn.png
Normal file
After Width: | Height: | Size: 898 B |
BIN
recipes/icons/badania_net.png
Normal file
After Width: | Height: | Size: 968 B |
Before Width: | Height: | Size: 391 B After Width: | Height: | Size: 772 B |
BIN
recipes/icons/biweekly.png
Normal file
After Width: | Height: | Size: 603 B |
BIN
recipes/icons/blog_biszopa.png
Normal file
After Width: | Height: | Size: 755 B |
Before Width: | Height: | Size: 837 B After Width: | Height: | Size: 364 B |
Before Width: | Height: | Size: 24 KiB After Width: | Height: | Size: 1.3 KiB |
BIN
recipes/icons/dwutygodnik.png
Normal file
After Width: | Height: | Size: 603 B |
BIN
recipes/icons/dziennik_baltycki.png
Normal file
After Width: | Height: | Size: 865 B |
BIN
recipes/icons/dziennik_lodzki.png
Normal file
After Width: | Height: | Size: 461 B |
Before Width: | Height: | Size: 481 B After Width: | Height: | Size: 1.1 KiB |
BIN
recipes/icons/dziennik_wschodni.png
Normal file
After Width: | Height: | Size: 414 B |
BIN
recipes/icons/dziennik_zachodni.png
Normal file
After Width: | Height: | Size: 431 B |
BIN
recipes/icons/echo_dnia.png
Normal file
After Width: | Height: | Size: 1.1 KiB |
Before Width: | Height: | Size: 475 B After Width: | Height: | Size: 1.1 KiB |
BIN
recipes/icons/elguardian.png
Normal file
After Width: | Height: | Size: 305 B |
BIN
recipes/icons/emuzica_pl.png
Normal file
After Width: | Height: | Size: 760 B |
BIN
recipes/icons/esenja.png
Normal file
After Width: | Height: | Size: 329 B |
BIN
recipes/icons/esensja_(rss).png
Normal file
After Width: | Height: | Size: 329 B |
BIN
recipes/icons/eso_pl.png
Normal file
After Width: | Height: | Size: 946 B |
BIN
recipes/icons/film_org_pl.png
Normal file
After Width: | Height: | Size: 762 B |
Before Width: | Height: | Size: 3.4 KiB After Width: | Height: | Size: 2.2 KiB |