mirror of
				https://github.com/kovidgoyal/calibre.git
				synced 2025-10-26 00:02:25 -04:00 
			
		
		
		
	
		
			
				
	
	
		
			412 lines
		
	
	
		
			17 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			412 lines
		
	
	
		
			17 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| Function mode for Search & replace in the Editor
 | |
| ================================================
 | |
| 
 | |
| The :guilabel:`Search & replace` tool in the editor support a *function mode*.
 | |
| In this mode, you can combine regular expressions (see :doc:`regexp`) with
 | |
| arbitrarily powerful Python functions to do all sorts of advanced text
 | |
| processing.
 | |
| 
 | |
| In the standard *regexp* mode for search and replace, you specify both a
 | |
| regular expression to search for as well as a template that is used to replace
 | |
| all found matches. In function mode, instead of using a fixed template, you
 | |
| specify an arbitrary function, in the
 | |
| `Python programming language <https://docs.python.org>`_. This allows
 | |
| you to do lots of things that are not possible with simple templates.
 | |
| 
 | |
| Techniques for using function mode and the syntax will be described by means of
 | |
| examples, showing you how to create functions to perform progressively more
 | |
| complex tasks.
 | |
| 
 | |
| 
 | |
| .. image:: images/function_replace.png
 | |
|     :alt: The Function mode
 | |
|     :align: center
 | |
| 
 | |
| Automatically fixing the case of headings in the document
 | |
| ---------------------------------------------------------
 | |
| 
 | |
| Here, we will leverage one of the builtin functions in the editor to
 | |
| automatically change the case of all text inside heading tags to title case::
 | |
| 
 | |
|     Find expression: <([Hh][1-6])[^>]*>.+?</\1>
 | |
| 
 | |
| For the function, simply choose the :guilabel:`Title-case text (ignore tags)` builtin
 | |
| function. The will change titles that look like: ``<h1>some titLE</h1>`` to
 | |
| ``<h1>Some Title</h1>``. It will work even if there are other HTML tags inside
 | |
| the heading tags.
 | |
| 
 | |
| 
 | |
| Your first custom function - smartening hyphens
 | |
| -----------------------------------------------
 | |
| 
 | |
| The real power of function mode comes from being able to create your own
 | |
| functions to process text in arbitrary ways. The Smarten Punctuation tool in
 | |
| the editor leaves individual hyphens alone, so you can use the this function to
 | |
| replace them with em-dashes.
 | |
| 
 | |
| To create a new function, simply click the :guilabel:`Create/edit` button to create a new
 | |
| function and copy the Python code from below.
 | |
| 
 | |
| .. code-block:: python
 | |
| 
 | |
|     def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | |
|         return match.group().replace('--', '—').replace('-', '—')
 | |
| 
 | |
| Every :guilabel:`Search & replace` custom function must have a unique name and consist of a
 | |
| Python function named replace, that accepts all the arguments shown above.
 | |
| For the moment, we won't worry about all the different arguments to
 | |
| ``replace()`` function. Just focus on the ``match`` argument. It represents a
 | |
| match when running a search and replace. Its full documentation in available
 | |
| `here <https://docs.python.org/library/re.html#match-objects>`_.
 | |
| ``match.group()`` simply returns all the matched text and all we do is replace
 | |
| hyphens in that text with em-dashes, first replacing double hyphens and
 | |
| then single hyphens.
 | |
| 
 | |
| Use this function with the find regular expression::
 | |
| 
 | |
|     >[^<>]+<
 | |
| 
 | |
| And it will replace all hyphens with em-dashes, but only in actual text and not
 | |
| inside HTML tag definitions.
 | |
| 
 | |
| 
 | |
| The power of function mode - using a spelling dictionary to fix mis-hyphenated words
 | |
| ------------------------------------------------------------------------------------
 | |
| 
 | |
| Often, e-books created from scans of printed books contain mis-hyphenated words
 | |
| -- words that were split at the end of the line on the printed page. We will
 | |
| write a simple function to automatically find and fix such words.
 | |
| 
 | |
| .. code-block:: python
 | |
| 
 | |
|     import regex
 | |
|     from calibre import replace_entities
 | |
|     from calibre import prepare_string_for_xml
 | |
| 
 | |
|     def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | |
| 
 | |
|         def replace_word(wmatch):
 | |
|             # Try to remove the hyphen and replace the words if the resulting
 | |
|             # hyphen free word is recognized by the dictionary
 | |
|             without_hyphen = wmatch.group(1) + wmatch.group(2)
 | |
|             if dictionaries.recognized(without_hyphen):
 | |
|                 return without_hyphen
 | |
|             return wmatch.group()
 | |
| 
 | |
|         # Search for words split by a hyphen
 | |
|         text = replace_entities(match.group()[1:-1])  # Handle HTML entities like &
 | |
|         corrected = regex.sub(r'(\w+)\s*-\s*(\w+)', replace_word, text, flags=regex.VERSION1 | regex.UNICODE)
 | |
|         return '>%s<' % prepare_string_for_xml(corrected)  # Put back required entities
 | |
| 
 | |
| Use this function with the same find expression as before, namely::
 | |
| 
 | |
|     >[^<>]+<
 | |
| 
 | |
| And it will magically fix all mis-hyphenated words in the text of the book. The
 | |
| main trick is to use one of the useful extra arguments to the replace function,
 | |
| ``dictionaries``.  This refers to the dictionaries the editor itself uses to
 | |
| spell check text in the book. What this function does is look for words
 | |
| separated by a hyphen, remove the hyphen and check if the dictionary recognizes
 | |
| the composite word, if it does, the original words are replaced by the hyphen
 | |
| free composite word.
 | |
| 
 | |
| Note that one limitation of this technique is it will only work for
 | |
| mono-lingual books, because, by default, ``dictionaries.recognized()`` uses the
 | |
| main language of the book.
 | |
| 
 | |
| 
 | |
| Auto numbering sections
 | |
| -----------------------
 | |
| 
 | |
| Now we will see something a little different. Suppose your HTML file has many
 | |
| sections, each with a heading in an :code:`<h2>` tag that looks like
 | |
| :code:`<h2>Some text</h2>`. You can create a custom function that will
 | |
| automatically number these headings with consecutive section numbers, so that
 | |
| they look like :code:`<h2>1. Some text</h2>`.
 | |
| 
 | |
| .. code-block:: python
 | |
| 
 | |
|     def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | |
|         section_number = '%d. ' % number
 | |
|         return match.group(1) + section_number + match.group(2)
 | |
| 
 | |
|     # Ensure that when running over multiple files, the files are processed
 | |
|     # in the order in which they appear in the book
 | |
|     replace.file_order = 'spine'
 | |
| 
 | |
| Use it with the find expression::
 | |
| 
 | |
|     (?s)(<h2[^<>]*>)(.+?</h2>)
 | |
| 
 | |
| Place the cursor at the top of the file and click :guilabel:`Replace all`.
 | |
| 
 | |
| This function uses another of the useful extra arguments to ``replace()``: the
 | |
| ``number`` argument. When doing a :guilabel:`Replace All` number is
 | |
| automatically incremented for every successive match.
 | |
| 
 | |
| Another new feature is the use of ``replace.file_order`` -- setting that to
 | |
| ``'spine'`` means that if this search is run on multiple HTML files, the files
 | |
| are processed in the order in which they appear in the book. See
 | |
| :ref:`file_order_replace_all` for details.
 | |
| 
 | |
| 
 | |
| Auto create a Table of Contents
 | |
| -------------------------------
 | |
| 
 | |
| Finally, lets try something a little more ambitious. Suppose your book has
 | |
| headings in ``h1`` and ``h2`` tags that look like
 | |
| ``<h1 id="someid">Some Text</h1>``. We will auto-generate an HTML Table of
 | |
| Contents based on these headings. Create the custom function below:
 | |
| 
 | |
| .. code-block:: python
 | |
| 
 | |
|     from calibre import replace_entities
 | |
|     from calibre.ebooks.oeb.polish.toc import TOC, toc_to_html
 | |
|     from calibre.gui2.tweak_book import current_container
 | |
|     from calibre.ebooks.oeb.base import xml2str
 | |
| 
 | |
|     def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | |
|         if match is None:
 | |
|             # All matches found, output the resulting Table of Contents.
 | |
|             # The argument metadata is the metadata of the book being edited
 | |
|             if 'toc' in data:
 | |
|                 toc = data['toc']
 | |
|                 root = TOC()
 | |
|                 for (file_name, tag_name, anchor, text) in toc:
 | |
|                     parent = root.children[-1] if tag_name == 'h2' and root.children else root
 | |
|                     parent.add(text, file_name, anchor)
 | |
|                 toc = toc_to_html(root, current_container(), 'toc.html', 'Table of Contents for ' + metadata.title, metadata.language)
 | |
|                 print(xml2str(toc))
 | |
|             else:
 | |
|                 print('No headings to build ToC from found')
 | |
|         else:
 | |
|             # Add an entry corresponding to this match to the Table of Contents
 | |
|             if 'toc' not in data:
 | |
|                 # The entries are stored in the data object, which will persist
 | |
|                 # for all invocations of this function during a 'Replace All' operation
 | |
|                 data['toc'] = []
 | |
|             tag_name, anchor, text = match.group(1), replace_entities(match.group(2)), replace_entities(match.group(3))
 | |
|             data['toc'].append((file_name, tag_name, anchor, text))
 | |
|             return match.group()  # We don't want to make any actual changes, so return the original matched text
 | |
| 
 | |
|     # Ensure that we are called once after the last match is found so we can
 | |
|     # output the ToC
 | |
|     replace.call_after_last_match = True
 | |
|     # Ensure that when running over multiple files, this function is called,
 | |
|     # the files are processed in the order in which they appear in the book
 | |
|     replace.file_order = 'spine'
 | |
| 
 | |
| And use it with the find expression::
 | |
| 
 | |
|     <(h[12]) [^<>]* id=['"]([^'"]+)['"][^<>]*>([^<>]+)
 | |
| 
 | |
| Run the search on :guilabel:`All text files` and at the end of the search, a
 | |
| window will popup with "Debug output from your function" which will have the
 | |
| HTML Table of Contents, ready to be pasted into :file:`toc.html`.
 | |
| 
 | |
| The function above is heavily commented, so it should be easy to follow. The
 | |
| key new feature is the use of another useful extra argument to the
 | |
| ``replace()`` function, the ``data`` object. The ``data`` object is a Python
 | |
| *dictionary* that persists between all successive invocations of ``replace()`` during
 | |
| a single :guilabel:`Replace All` operation.
 | |
| 
 | |
| Another new feature is the use of ``call_after_last_match`` -- setting that to
 | |
| ``True`` on the ``replace()`` function means that the editor will call
 | |
| ``replace()`` one extra time after all matches have been found. For this extra
 | |
| call, the match object will be ``None``.
 | |
| 
 | |
| This was just a demonstration to show you the power of function mode,
 | |
| if you really needed to generate a Table of Contents from headings in your book,
 | |
| you would be better off using the dedicated Table of Contents tool in
 | |
| :guilabel:`Tools->Table of Contents`.
 | |
| 
 | |
| The API for the function mode
 | |
| -----------------------------
 | |
| 
 | |
| All function mode functions must be Python functions named replace, with the
 | |
| following signature::
 | |
| 
 | |
|     def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | |
|         return a_string
 | |
| 
 | |
| When a find/replace is run, for every match that is found, the ``replace()``
 | |
| function will be called, it must return the replacement string for that match.
 | |
| If no replacements are to be done, it should return ``match.group()`` which is
 | |
| the original string. The various arguments to the ``replace()`` function are
 | |
| documented below.
 | |
| 
 | |
| The ``match`` argument
 | |
| ^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| The ``match`` argument represents the currently found match. It is a
 | |
| `Python Match object <https://docs.python.org/library/re.html#match-objects>`_.
 | |
| Its most useful method is ``group()`` which can be used to get the matched
 | |
| text corresponding to individual capture groups in the search regular
 | |
| expression.
 | |
| 
 | |
| The ``number`` argument
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| The ``number`` argument is the number of the current match. When you run
 | |
| :guilabel:`Replace All`, every successive match will cause ``replace()`` to be
 | |
| called with an increasing number. The first match has number 1.
 | |
| 
 | |
| The ``file_name`` argument
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| This is the filename of the file in which the current match was found. When
 | |
| searching inside marked text, the ``file_name`` is empty. The ``file_name`` is
 | |
| in canonical form, a path relative to the root of the book, using ``/`` as the
 | |
| path separator.
 | |
| 
 | |
| The ``metadata`` argument
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| This represents the metadata of the current book, such as title, authors,
 | |
| language, etc. It is an object of class :class:`calibre.ebooks.metadata.book.base.Metadata`.
 | |
| Useful attributes include, ``title``, ``authors`` (a list of authors) and
 | |
| ``language`` (the language code).
 | |
| 
 | |
| The ``dictionaries`` argument
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| This represents the collection of dictionaries used for spell checking the
 | |
| current book. Its most useful method is ``dictionaries.recognized(word)``
 | |
| which will return ``True`` if the passed in word is recognized by the dictionary
 | |
| for the current book's language.
 | |
| 
 | |
| The ``data`` argument
 | |
| ^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| This a simple Python ``dictionary``. When you run
 | |
| :guilabel:`Replace all`, every successive match will cause ``replace()`` to be
 | |
| called with the same ``dictionary`` as data. You can thus use it to store arbitrary
 | |
| data between invocations of ``replace()`` during a :guilabel:`Replace all`
 | |
| operation.
 | |
| 
 | |
| The ``functions`` argument
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| The ``functions`` argument gives you access to all other user defined
 | |
| functions. This is useful for code re-use. You can define utility functions in
 | |
| one place and re-use them in all your other functions. For example, suppose you
 | |
| create a function name ``My Function`` like this:
 | |
| 
 | |
| .. code-block:: python
 | |
| 
 | |
|     def utility():
 | |
|        # do something
 | |
| 
 | |
|     def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | |
|         ...
 | |
| 
 | |
| Then, in another function, you can access the ``utility()`` function like this:
 | |
| 
 | |
| .. code-block:: python
 | |
| 
 | |
|     def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | |
|         utility = functions['My Function']['utility']
 | |
|         ...
 | |
| 
 | |
| You can also use the functions object to store persistent data, that can be
 | |
| re-used by other functions. For example, you could have one function that when
 | |
| run with :guilabel:`Replace All` collects some data and another function that
 | |
| uses it when it is run afterwards. Consider the following two functions:
 | |
| 
 | |
| .. code-block:: python
 | |
| 
 | |
|     # Function One
 | |
|     persistent_data = {}
 | |
| 
 | |
|     def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | |
|         ...
 | |
|         persistent_data['something'] = 'some data'
 | |
| 
 | |
|     # Function Two
 | |
|     def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | |
|         persistent_data = functions['Function One']['persistent_data']
 | |
|         ...
 | |
| 
 | |
| Debugging your functions
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| You can debug the functions you create by using the standard ``print()``
 | |
| function from Python. The output of print will be displayed in a popup window
 | |
| after the Find/replace has completed. You saw an example of using ``print()``
 | |
| to output an entire table of contents above.
 | |
| 
 | |
| .. _file_order_replace_all:
 | |
| 
 | |
| Choose file order when running on multiple HTML files
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| When you run a :guilabel:`Replace all` on multiple HTML files, the order in
 | |
| which the files are processes depends on what files you have open for editing.
 | |
| You can force the search to process files in the order in which the appear by
 | |
| setting the ``file_order`` attribute on your function, like this:
 | |
| 
 | |
| .. code-block:: python
 | |
| 
 | |
|     def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | |
|         ...
 | |
| 
 | |
|     replace.file_order = 'spine'
 | |
| 
 | |
| ``file_order`` accepts two values, ``spine`` and ``spine-reverse`` which cause
 | |
| the search to process multiple files in the order they appear in the book,
 | |
| either forwards or backwards, respectively.
 | |
| 
 | |
| Having your function called an extra time after the last match is found
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| Sometimes, as in the auto generate table of contents example above, it is
 | |
| useful to have your function called an extra time after the last match is
 | |
| found. You can do this by setting the ``call_after_last_match`` attribute on your
 | |
| function, like this:
 | |
| 
 | |
| .. code-block:: python
 | |
| 
 | |
|     def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | |
|         ...
 | |
| 
 | |
|     replace.call_after_last_match = True
 | |
| 
 | |
| 
 | |
| Appending the output from the function to marked text
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| When running search and replace on marked text, it is sometimes useful to
 | |
| append so text to the end of the marked text. You can do that by setting
 | |
| the ``append_final_output_to_marked`` attribute on your function (note that you
 | |
| also need to set ``call_after_last_match``), like this:
 | |
| 
 | |
| .. code-block:: python
 | |
| 
 | |
|     def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | |
|         ...
 | |
|         return 'some text to append'
 | |
| 
 | |
|     replace.call_after_last_match = True
 | |
|     replace.append_final_output_to_marked = True
 | |
| 
 | |
| Suppressing the result dialog when performing searches on marked text
 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| You can also suppress the result dialog (which can slow down the repeated
 | |
| application of a search/replace on many blocks of text) by setting
 | |
| the ``suppress_result_dialog`` attribute on your function, like this:
 | |
| 
 | |
| .. code-block:: python
 | |
| 
 | |
|     def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
 | |
|         ...
 | |
| 
 | |
|     replace.suppress_result_dialog = True
 | |
| 
 | |
| 
 | |
| More examples
 | |
| ----------------
 | |
| 
 | |
| More useful examples, contributed by calibre users, can be found in the
 | |
| `calibre E-book editor forum <https://www.mobileread.com/forums/showthread.php?t=237181>`_.
 |