Add note about modern html parsers

2025-07-09 03:04:10 -04:00 · 2013-11-03 09:48:18 +05:30 · 2013-11-03 09:48:18 +05:30 · 6fa4022c78
commit 6fa4022c78
parent 6b8ca10f6a
1 changed files with 20 additions and 1 deletions
--- a/manual/news.rst
+++ b/manual/news.rst
@ -254,7 +254,26 @@ The next interesting feature is::
 ``needs_subscription = True`` tells |app| that this recipe needs a username and password in order to access the content. This causes, |app| to ask for a username and password whenever you try to use this recipe. The code in :meth:`calibre.web.feeds.news.BasicNewsRecipe.get_browser` actually does the login into the NYT website. Once logged in, |app| will use the same, logged in, browser instance to fetch all content. See `mechanize <http://wwwsearch.sourceforge.net/mechanize/>`_ to understand the code in ``get_browser``.
-The next new feature is the :meth:`calibre.web.feeds.news.BasicNewsRecipe.parse_index` method. Its job is to go to http://www.nytimes.com/pages/todayspaper/index.html and fetch the list of articles that appear in *todays* paper. While more complex than simply using :term:`RSS`, the recipe creates an ebook that corresponds very closely to the days paper. ``parse_index`` makes heavy use of `BeautifulSoup <http://www.crummy.com/software/BeautifulSoup/documentation.html>`_ to parse the daily paper webpage.
+The next new feature is the
 :meth:`calibre.web.feeds.news.BasicNewsRecipe.parse_index` method. Its job is
 to go to http://www.nytimes.com/pages/todayspaper/index.html and fetch the list
 of articles that appear in *todays* paper. While more complex than simply using
 :term:`RSS`, the recipe creates an ebook that corresponds very closely to the
 days paper. ``parse_index`` makes heavy use of `BeautifulSoup
 <http://www.crummy.com/software/BeautifulSoup/documentation.html>`_ to parse
 the daily paper webpage. You can also use other, more modern parsers if you
 dislike BeatifulSoup. calibre comes with `lxml <http://lxml.de/>`_ and
 `html5lib <https://github.com/html5lib/html5lib-python>`_, which are the
 recommended parsers. To use them, replace the call to ``index_to_soup()`` with
 the following::
    raw = self.index_to_soup(url, raw=True)
    # For html5lib
    import html5lib
    root = html5lib.parse(raw, namespaceHTMLElements=False, treebuilder='lxml')
    # For the lxml html 4 parser
    from lxml import html
    root = html.fromstring(raw)
 The final new feature is the :meth:`calibre.web.feeds.news.BasicNewsRecipe.preprocess_html` method. It can be used to perform arbitrary transformations on every downloaded HTML page. Here it is used to bypass the ads that the nytimes shows you before each article.