mirror of
https://github.com/kovidgoyal/calibre.git
synced 2025-06-21 06:20:44 -04:00
112 lines
4.2 KiB
ReStructuredText
112 lines
4.2 KiB
ReStructuredText
.. _xpath-tutorial:
|
|
|
|
XPath tutorial
|
|
==============
|
|
|
|
In this tutorial, you will be given a gentle introduction to
|
|
`XPath <https://en.wikipedia.org/wiki/XPath>`_, a query language that can be
|
|
used to select arbitrary parts of `HTML <https://en.wikipedia.org/wiki/HTML>`_
|
|
documents in calibre. XPath is a widely
|
|
used standard, and googling it will yield a ton of information. This tutorial,
|
|
however, focuses on using XPath for e-book related tasks like finding chapter
|
|
headings in an unstructured HTML document.
|
|
|
|
.. contents:: Contents
|
|
:depth: 1
|
|
:local:
|
|
|
|
Selecting by tag name
|
|
----------------------------------------
|
|
|
|
The simplest form of selection is to select tags by name. For example,
|
|
suppose you want to select all the ``<h2>`` tags in a document. The XPath
|
|
query for this is simply::
|
|
|
|
//h:h2 (Selects all <h2> tags)
|
|
|
|
The prefix `//` means *search at any level of the document*. Now suppose you
|
|
want to search for ``<span>`` tags that are inside ``<a>`` tags. That can be
|
|
achieved with::
|
|
|
|
//h:a/h:span (Selects <span> tags inside <a> tags)
|
|
|
|
If you want to search for tags at a particular level in the document, change
|
|
the prefix::
|
|
|
|
/h:body/h:div/h:p (Selects <p> tags that are children of <div> tags that are
|
|
children of the <body> tag)
|
|
|
|
This will match only ``<p>A very short e-book to demonstrate the use of XPath.</p>``
|
|
in the :ref:`sample_ebook` but not any of the other ``<p>`` tags. The ``h:`` prefix
|
|
in the above examples is needed to match XHTML tags. This is because internally,
|
|
calibre represents all content as XHTML. In XHTML tags have a *namespace*, and
|
|
``h:`` is the namespace prefix for HTML tags.
|
|
|
|
Now suppose you want to select both ``<h1>`` and ``<h2>`` tags. To do that,
|
|
we need a XPath construct called *predicate*. A :dfn:`predicate` is simply
|
|
a test that is used to select tags. Tests can be arbitrarily powerful and as
|
|
this tutorial progresses, you will see more powerful examples. A predicate
|
|
is created by enclosing the test expression in square brackets::
|
|
|
|
//*[name()='h1' or name()='h2']
|
|
|
|
There are several new features in this XPath expression. The first is the use
|
|
of the wildcard ``*``. It means *match any tag*. Now look at the test expression
|
|
``name()='h1' or name()='h2'``. :term:`name()` is an example of a *built-in function*.
|
|
It simply evaluates to the name of the tag. So by using it, we can select tags
|
|
whose names are either `h1` or `h2`. Note that the :term:`name()` function
|
|
ignores namespaces so that there is no need for the ``h:`` prefix.
|
|
XPath has several useful built-in functions. A few more will be introduced in this tutorial.
|
|
|
|
Selecting by attributes
|
|
-----------------------
|
|
|
|
To select tags based on their attributes, the use of predicates is required::
|
|
|
|
//*[@style] (Select all tags that have a style attribute)
|
|
//*[@class="chapter"] (Select all tags that have class="chapter")
|
|
//h:h1[@class="bookTitle"] (Select all h1 tags that have class="bookTitle")
|
|
|
|
Here, the ``@`` operator refers to the attributes of the tag. You can use some
|
|
of the `XPath built-in functions`_ to perform more sophisticated
|
|
matching on attribute values.
|
|
|
|
|
|
Selecting by tag content
|
|
------------------------
|
|
|
|
Using XPath, you can even select tags based on the text they contain. The best way to do this is
|
|
to use the power of *regular expressions* via the built-in function :term:`re:test()`::
|
|
|
|
//h:h2[re:test(., 'chapter|section', 'i')] (Selects <h2> tags that contain the words chapter or
|
|
section)
|
|
|
|
Here the ``.`` operator refers to the contents of the tag, just as the ``@`` operator referred
|
|
to its attributes.
|
|
|
|
.. _sample_ebook :
|
|
|
|
Sample e-book
|
|
-------------
|
|
|
|
.. literalinclude:: xpath.xhtml
|
|
:language: html
|
|
|
|
XPath built-in functions
|
|
------------------------
|
|
|
|
.. glossary::
|
|
|
|
name()
|
|
The name of the current tag.
|
|
|
|
contains()
|
|
``contains(s1, s2)`` returns `true` if s1 contains s2.
|
|
|
|
re:test()
|
|
``re:test(src, pattern, flags)`` returns `true` if the string `src` matches the
|
|
regular expression `pattern`. A particularly useful flag is ``i``, it makes matching
|
|
case insensitive. A good primer on the syntax for regular expressions can be found
|
|
at `regexp syntax <https://docs.python.org/2.7/library/re.html>`_
|
|
|