mirror of
				https://github.com/kovidgoyal/calibre.git
				synced 2025-11-03 19:17:02 -05:00 
			
		
		
		
	
		
			
				
	
	
		
			412 lines
		
	
	
		
			17 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			412 lines
		
	
	
		
			17 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
Function mode for Search & replace in the Editor
 | 
						|
================================================
 | 
						|
 | 
						|
The :guilabel:`Search & replace` tool in the editor support a *function mode*.
 | 
						|
In this mode, you can combine regular expressions (see :doc:`regexp`) with
 | 
						|
arbitrarily powerful Python functions to do all sorts of advanced text
 | 
						|
processing.
 | 
						|
 | 
						|
In the standard *regexp* mode for search and replace, you specify both a
 | 
						|
regular expression to search for as well as a template that is used to replace
 | 
						|
all found matches. In function mode, instead of using a fixed template, you
 | 
						|
specify an arbitrary function, in the
 | 
						|
`Python programming language <https://docs.python.org>`_. This allows
 | 
						|
you to do lots of things that are not possible with simple templates.
 | 
						|
 | 
						|
Techniques for using function mode and the syntax will be described by means of
 | 
						|
examples, showing you how to create functions to perform progressively more
 | 
						|
complex tasks.
 | 
						|
 | 
						|
 | 
						|
.. image:: images/function_replace.png
 | 
						|
    :alt: The Function mode
 | 
						|
    :align: center
 | 
						|
 | 
						|
Automatically fixing the case of headings in the document
 | 
						|
---------------------------------------------------------
 | 
						|
 | 
						|
Here, we will leverage one of the builtin functions in the editor to
 | 
						|
automatically change the case of all text inside heading tags to title case::
 | 
						|
 | 
						|
    Find expression: <([Hh][1-6])[^>]*>.+?</\1>
 | 
						|
 | 
						|
For the function, simply choose the :guilabel:`Title-case text (ignore tags)` builtin
 | 
						|
function. The will change titles that look like: ``<h1>some TITLE</h1>`` to
 | 
						|
``<h1>Some Title</h1>``. It will work even if there are other HTML tags inside
 | 
						|
the heading tags.
 | 
						|
 | 
						|
 | 
						|
Your first custom function - smartening hyphens
 | 
						|
-----------------------------------------------
 | 
						|
 | 
						|
The real power of function mode comes from being able to create your own
 | 
						|
functions to process text in arbitrary ways. The Smarten Punctuation tool in
 | 
						|
the editor leaves individual hyphens alone, so you can use the this function to
 | 
						|
replace them with em-dashes.
 | 
						|
 | 
						|
To create a new function, simply click the :guilabel:`Create/edit` button to create a new
 | 
						|
function and copy the Python code from below.
 | 
						|
 | 
						|
.. code-block:: python
 | 
						|
 | 
						|
    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | 
						|
        return match.group().replace('--', '—').replace('-', '—')
 | 
						|
 | 
						|
Every :guilabel:`Search & replace` custom function must have a unique name and consist of a
 | 
						|
Python function named replace, that accepts all the arguments shown above.
 | 
						|
For the moment, we wont worry about all the different arguments to
 | 
						|
``replace()`` function. Just focus on the ``match`` argument. It represents a
 | 
						|
match when running a search and replace. Its full documentation in available
 | 
						|
`here <https://docs.python.org/library/re.html#match-objects>`_.
 | 
						|
``match.group()`` simply returns all the matched text and all we do is replace
 | 
						|
hyphens in that text with em-dashes, first replacing double hyphens and
 | 
						|
then single hyphens.
 | 
						|
 | 
						|
Use this function with the find regular expression::
 | 
						|
 | 
						|
    >[^<>]+<
 | 
						|
 | 
						|
And it will replace all hyphens with em-dashes, but only in actual text and not
 | 
						|
inside HTML tag definitions.
 | 
						|
 | 
						|
 | 
						|
The power of function mode - using a spelling dictionary to fix mis-hyphenated words
 | 
						|
------------------------------------------------------------------------------------
 | 
						|
 | 
						|
Often, e-books created from scans of printed books contain mis-hyphenated words
 | 
						|
-- words that were split at the end of the line on the printed page. We will
 | 
						|
write a simple function to automatically find and fix such words.
 | 
						|
 | 
						|
.. code-block:: python
 | 
						|
 | 
						|
    import regex
 | 
						|
    from calibre import replace_entities
 | 
						|
    from calibre import prepare_string_for_xml
 | 
						|
 | 
						|
    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | 
						|
 | 
						|
        def replace_word(wmatch):
 | 
						|
            # Try to remove the hyphen and replace the words if the resulting
 | 
						|
            # hyphen free word is recognized by the dictionary
 | 
						|
            without_hyphen = wmatch.group(1) + wmatch.group(2)
 | 
						|
            if dictionaries.recognized(without_hyphen):
 | 
						|
                return without_hyphen
 | 
						|
            return wmatch.group()
 | 
						|
 | 
						|
        # Search for words split by a hyphen
 | 
						|
        text = replace_entities(match.group()[1:-1])  # Handle HTML entities like &
 | 
						|
        corrected = regex.sub(r'(\w+)\s*-\s*(\w+)', replace_word, text, flags=regex.VERSION1 | regex.UNICODE)
 | 
						|
        return '>%s<' % prepare_string_for_xml(corrected)  # Put back required entities
 | 
						|
 | 
						|
Use this function with the same find expression as before, namely::
 | 
						|
 | 
						|
    >[^<>]+<
 | 
						|
 | 
						|
And it will magically fix all mis-hyphenated words in the text of the book. The
 | 
						|
main trick is to use one of the useful extra arguments to the replace function,
 | 
						|
``dictionaries``.  This refers to the dictionaries the editor itself uses to
 | 
						|
spell check text in the book. What this function does is look for words
 | 
						|
separated by a hyphen, remove the hyphen and check if the dictionary recognizes
 | 
						|
the composite word, if it does, the original words are replaced by the hyphen
 | 
						|
free composite word.
 | 
						|
 | 
						|
Note that one limitation of this technique is it will only work for
 | 
						|
mono-lingual books, because, by default, ``dictionaries.recognized()`` uses the
 | 
						|
main language of the book.
 | 
						|
 | 
						|
 | 
						|
Auto numbering sections
 | 
						|
-----------------------
 | 
						|
 | 
						|
Now we will see something a little different. Suppose your HTML file has many
 | 
						|
sections, each with a heading in an :code:`<h2>` tag that looks like
 | 
						|
:code:`<h2>Some text</h2>`. You can create a custom function that will
 | 
						|
automatically number these headings with consecutive section numbers, so that
 | 
						|
they look like :code:`<h2>1. Some text</h2>`.
 | 
						|
 | 
						|
.. code-block:: python
 | 
						|
 | 
						|
    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | 
						|
        section_number = '%d. ' % number
 | 
						|
        return match.group(1) + section_number + match.group(2)
 | 
						|
 | 
						|
    # Ensure that when running over multiple files, the files are processed
 | 
						|
    # in the order in which they appear in the book
 | 
						|
    replace.file_order = 'spine'
 | 
						|
 | 
						|
Use it with the find expression::
 | 
						|
 | 
						|
    (?s)(<h2[^<>]*>)(.+?</h2>)
 | 
						|
 | 
						|
Place the cursor at the top of the file and click :guilabel:`Replace all`.
 | 
						|
 | 
						|
This function uses another of the useful extra arguments to ``replace()``: the
 | 
						|
``number`` argument. When doing a :guilabel:`Replace All` number is
 | 
						|
automatically incremented for every successive match.
 | 
						|
 | 
						|
Another new feature is the use of ``replace.file_order`` -- setting that to
 | 
						|
``'spine'`` means that if this search is run on multiple HTML files, the files
 | 
						|
are processed in the order in which they appear in the book. See
 | 
						|
:ref:`file_order_replace_all` for details.
 | 
						|
 | 
						|
 | 
						|
Auto create a Table of Contents
 | 
						|
-------------------------------
 | 
						|
 | 
						|
Finally, lets try something a little more ambitious. Suppose your book has
 | 
						|
headings in ``h1`` and ``h2`` tags that look like
 | 
						|
``<h1 id="someid">Some Text</h1>``. We will auto-generate an HTML Table of
 | 
						|
Contents based on these headings. Create the custom function below:
 | 
						|
 | 
						|
.. code-block:: python
 | 
						|
 | 
						|
    from calibre import replace_entities
 | 
						|
    from calibre.ebooks.oeb.polish.toc import TOC, toc_to_html
 | 
						|
    from calibre.gui2.tweak_book import current_container
 | 
						|
    from calibre.ebooks.oeb.base import xml2str
 | 
						|
 | 
						|
    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | 
						|
        if match is None:
 | 
						|
            # All matches found, output the resulting Table of Contents.
 | 
						|
            # The argument metadata is the metadata of the book being edited
 | 
						|
            if 'toc' in data:
 | 
						|
                toc = data['toc']
 | 
						|
                root = TOC()
 | 
						|
                for (file_name, tag_name, anchor, text) in toc:
 | 
						|
                    parent = root.children[-1] if tag_name == 'h2' and root.children else root
 | 
						|
                    parent.add(text, file_name, anchor)
 | 
						|
                toc = toc_to_html(root, current_container(), 'toc.html', 'Table of Contents for ' + metadata.title, metadata.language)
 | 
						|
                print (xml2str(toc))
 | 
						|
            else:
 | 
						|
                print ('No headings to build ToC from found')
 | 
						|
        else:
 | 
						|
            # Add an entry corresponding to this match to the Table of Contents
 | 
						|
            if 'toc' not in data:
 | 
						|
                # The entries are stored in the data object, which will persist
 | 
						|
                # for all invocations of this function during a 'Replace All' operation
 | 
						|
                data['toc'] = []
 | 
						|
            tag_name, anchor, text = match.group(1), replace_entities(match.group(2)), replace_entities(match.group(3))
 | 
						|
            data['toc'].append((file_name, tag_name, anchor, text))
 | 
						|
            return match.group()  # We don't want to make any actual changes, so return the original matched text
 | 
						|
 | 
						|
    # Ensure that we are called once after the last match is found so we can
 | 
						|
    # output the ToC
 | 
						|
    replace.call_after_last_match = True
 | 
						|
    # Ensure that when running over multiple files, this function is called,
 | 
						|
    # the files are processed in the order in which they appear in the book
 | 
						|
    replace.file_order = 'spine'
 | 
						|
 | 
						|
And use it with the find expression::
 | 
						|
 | 
						|
    <(h[12]) [^<>]* id=['"]([^'"]+)['"][^<>]*>([^<>]+)
 | 
						|
 | 
						|
Run the search on :guilabel:`All text files` and at the end of the search, a
 | 
						|
window will popup with "Debug output from your function" which will have the
 | 
						|
HTML Table of Contents, ready to be pasted into :file:`toc.html`.
 | 
						|
 | 
						|
The function above is heavily commented, so it should be easy to follow. The
 | 
						|
key new feature is the use of another useful extra argument to the
 | 
						|
``replace()`` function, the ``data`` object. The ``data`` object is a Python
 | 
						|
*dict* that persists between all successive invocations of ``replace()`` during
 | 
						|
a single :guilabel:`Replace All` operation.
 | 
						|
 | 
						|
Another new feature is the use of ``call_after_last_match`` -- setting that to
 | 
						|
``True`` on the ``replace()`` function means that the editor will call
 | 
						|
``replace()`` one extra time after all matches have been found. For this extra
 | 
						|
call, the match object will be ``None``.
 | 
						|
 | 
						|
This was just a demonstration to show you the power of function mode,
 | 
						|
if you really needed to generate a Table of Contents from headings in your book,
 | 
						|
you would be better off using the dedicated Table of Contents tool in
 | 
						|
:guilabel:`Tools->Table of Contents`.
 | 
						|
 | 
						|
The API for the function mode
 | 
						|
-----------------------------
 | 
						|
 | 
						|
All function mode functions must be Python functions named replace, with the
 | 
						|
following signature::
 | 
						|
 | 
						|
    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | 
						|
        return a_string
 | 
						|
 | 
						|
When a find/replace is run, for every match that is found, the ``replace()``
 | 
						|
function will be called, it must return the replacement string for that match.
 | 
						|
If no replacements are to be done, it should return ``match.group()`` which is
 | 
						|
the original string. The various arguments to the ``replace()`` function are
 | 
						|
documented below.
 | 
						|
 | 
						|
The ``match`` argument
 | 
						|
^^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
The ``match`` argument represents the currently found match. It is a
 | 
						|
`Python Match object <https://docs.python.org/library/re.html#match-objects>`_.
 | 
						|
Its most useful method is ``group()`` which can be used to get the matched
 | 
						|
text corresponding to individual capture groups in the search regular
 | 
						|
expression.
 | 
						|
 | 
						|
The ``number`` argument
 | 
						|
^^^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
The ``number`` argument is the number of the current match. When you run
 | 
						|
:guilabel:`Replace All`, every successive match will cause ``replace()`` to be
 | 
						|
called with an increasing number. The first match has number 1.
 | 
						|
 | 
						|
The ``file_name`` argument
 | 
						|
^^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
This is the filename of the file in which the current match was found. When
 | 
						|
searching inside marked text, the ``file_name`` is empty. The ``file_name`` is
 | 
						|
in canonical form, a path relative to the root of the book, using ``/`` as the
 | 
						|
path separator.
 | 
						|
 | 
						|
The ``metadata`` argument
 | 
						|
^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
This represents the metadata of the current book, such as title, authors,
 | 
						|
language, etc. It is an object of class :class:`calibre.ebooks.metadata.book.base.Metadata`.
 | 
						|
Useful attributes include, ``title``, ``authors`` (a list of authors) and
 | 
						|
``language`` (the language code).
 | 
						|
 | 
						|
The ``dictionaries`` argument
 | 
						|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
This represents the collection of dictionaries used for spell checking the
 | 
						|
current book. Its most useful method is ``dictionaries.recognized(word)``
 | 
						|
which will return ``True`` if the passed in word is recognized by the dictionary
 | 
						|
for the current book's language.
 | 
						|
 | 
						|
The ``data`` argument
 | 
						|
^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
This a simple Python ``dict``. When you run
 | 
						|
:guilabel:`Replace all`, every successive match will cause ``replace()`` to be
 | 
						|
called with the same ``dict`` as data. You can thus use it to store arbitrary
 | 
						|
data between invocations of ``replace()`` during a :guilabel:`Replace all`
 | 
						|
operation.
 | 
						|
 | 
						|
The ``functions`` argument
 | 
						|
^^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
The ``functions`` argument gives you access to all other user defined
 | 
						|
functions. This is useful for code re-use. You can define utility functions in
 | 
						|
one place and re-use them in all your other functions. For example, suppose you
 | 
						|
create a function name ``My Function`` like this:
 | 
						|
 | 
						|
.. code-block:: python
 | 
						|
 | 
						|
    def utility():
 | 
						|
       # do something
 | 
						|
 | 
						|
    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | 
						|
        ...
 | 
						|
 | 
						|
Then, in another function, you can access the ``utility()`` function like this:
 | 
						|
 | 
						|
.. code-block:: python
 | 
						|
 | 
						|
    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | 
						|
        utility = functions['My Function']['utility']
 | 
						|
        ...
 | 
						|
 | 
						|
You can also use the functions object to store persistent data, that can be
 | 
						|
re-used by other functions. For example, you could have one function that when
 | 
						|
run with :guilabel:`Replace All` collects some data and another function that
 | 
						|
uses it when it is run afterwards. Consider the following two functions:
 | 
						|
 | 
						|
.. code-block:: python
 | 
						|
 | 
						|
    # Function One
 | 
						|
    persistent_data = {}
 | 
						|
 | 
						|
    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | 
						|
        ...
 | 
						|
        persistent_data['something'] = 'some data'
 | 
						|
 | 
						|
    # Function Two
 | 
						|
    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | 
						|
        persistent_data = functions['Function One']['persistent_data']
 | 
						|
        ...
 | 
						|
 | 
						|
Debugging your functions
 | 
						|
^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
You can debug the functions you create by using the standard ``print()``
 | 
						|
function from Python. The output of print will be displayed in a popup window
 | 
						|
after the Find/replace has completed. You saw an example of using ``print()``
 | 
						|
to output an entire table of contents above.
 | 
						|
 | 
						|
.. _file_order_replace_all:
 | 
						|
 | 
						|
Choose file order when running on multiple HTML files
 | 
						|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
When you run a :guilabel:`Replace all` on multiple HTML files, the order in
 | 
						|
which the files are processes depends on what files you have open for editing.
 | 
						|
You can force the search to process files in the order in which the appear by
 | 
						|
setting the ``file_order`` attribute on your function, like this:
 | 
						|
 | 
						|
.. code-block:: python
 | 
						|
 | 
						|
    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | 
						|
        ...
 | 
						|
 | 
						|
    replace.file_order = 'spine'
 | 
						|
 | 
						|
``file_order`` accepts two values, ``spine`` and ``spine-reverse`` which cause
 | 
						|
the search to process multiple files in the order they appear in the book,
 | 
						|
either forwards or backwards, respectively.
 | 
						|
 | 
						|
Having your function called an extra time after the last match is found
 | 
						|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
Sometimes, as in the auto generate table of contents example above, it is
 | 
						|
useful to have your function called an extra time after the last match is
 | 
						|
found. You can do this by setting the ``call_after_last_match`` attribute on your
 | 
						|
function, like this:
 | 
						|
 | 
						|
.. code-block:: python
 | 
						|
 | 
						|
    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | 
						|
        ...
 | 
						|
 | 
						|
    replace.call_after_last_match = True
 | 
						|
 | 
						|
 | 
						|
Appending the output from the function to marked text
 | 
						|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
When running search and replace on marked text, it is sometimes useful to
 | 
						|
append so text to the end of the marked text. You can do that by setting
 | 
						|
the ``append_final_output_to_marked`` attribute on your function (note that you
 | 
						|
also need to set ``call_after_last_match``), like this:
 | 
						|
 | 
						|
.. code-block:: python
 | 
						|
 | 
						|
    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | 
						|
        ...
 | 
						|
        return 'some text to append'
 | 
						|
 | 
						|
    replace.call_after_last_match = True
 | 
						|
    replace.append_final_output_to_marked = True
 | 
						|
 | 
						|
Suppressing the result dialog when performing searches on marked text
 | 
						|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
You can also suppress the result dialog (which can slow down the repeated
 | 
						|
application of a search/replace on many blocks of text) by setting
 | 
						|
the ``suppress_result_dialog`` attribute on your function, like this:
 | 
						|
 | 
						|
.. code-block:: python
 | 
						|
 | 
						|
    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | 
						|
        ...
 | 
						|
 | 
						|
    replace.suppress_result_dialog = True
 | 
						|
 | 
						|
 | 
						|
More examples
 | 
						|
----------------
 | 
						|
 | 
						|
More useful examples, contributed by calibre users, can be found in the
 | 
						|
`calibre E-book editor forum <https://www.mobileread.com/forums/showthread.php?t=237181>`_.
 |