|
|
|
@@ -46,10 +46,10 @@ Next is the beginning of the really good stuff. Remember where I said that regul
|
|
|
|
|
Hey, neat! This is starting to make sense!
|
|
|
|
|
---------------------------------------------
|
|
|
|
|
|
|
|
|
|
I was hoping you'd say that. But brace yourself, now it gets even better! We just saw that using sets, we could match one of several characters at once. But you can even repeat a character or set, reducing the number of expressions needed to handle the above page number example to one. Yes, ONE! Excited? You should be! It works like this: Some so-called special characters, "+", "?" and "*", *repeat the single element preceding them*. (Element means either a single character, a character set, an escape sequence or a group (we'll learn about those last two later)- in short, any single entity in a regular expression.) These characters are called wildcards or quantifiers. To be more precise, "?" matches *0 or 1* of the preceding element, "*" matches *0 or more* of the preceding element and "+" matches *1 or more* of the preceding element. A few examples: The expression ``a?`` would match either "" (which is the empty string, not strictly useful in this case) or "a", the expression ``a*`` would match "", "a", "aa" or any number of a's in a row, and, finally, the expression ``a+`` would match "a", "aa" or any number of a's in a row (Note: it wouldn't match the empty string!). Same deal for sets: The expression ``[0-9]+`` would match *every integer number there is*! I know what you're thinking, and you're right: If you use that in the above case of matching page numbers, wouldn't that be the single one expression to match all the page numbers? Yes, the expression ``Page [0-9]+ of 423`` would match every page number in that book!
|
|
|
|
|
I was hoping you'd say that. But brace yourself, now it gets even better! We just saw that using sets, we could match one of several characters at once. But you can even repeat a character or set, reducing the number of expressions needed to handle the above page number example to one. Yes, ONE! Excited? You should be! It works like this: Some so-called special characters, "+", "?" and "*", *repeat the single element preceding them*. (Element means either a single character, a character set, an escape sequence or a group (we'll learn about those last two later)- in short, any single entity in a regular expression). These characters are called wildcards or quantifiers. To be more precise, "?" matches *0 or 1* of the preceding element, "*" matches *0 or more* of the preceding element and "+" matches *1 or more* of the preceding element. A few examples: The expression ``a?`` would match either "" (which is the empty string, not strictly useful in this case) or "a", the expression ``a*`` would match "", "a", "aa" or any number of a's in a row, and, finally, the expression ``a+`` would match "a", "aa" or any number of a's in a row (Note: it wouldn't match the empty string!). Same deal for sets: The expression ``[0-9]+`` would match *every integer number there is*! I know what you're thinking, and you're right: If you use that in the above case of matching page numbers, wouldn't that be the single one expression to match all the page numbers? Yes, the expression ``Page [0-9]+ of 423`` would match every page number in that book!
|
|
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
|
A note on these quantifiers: They generally try to match as much text as possible, so be careful when using them. This is called "greedy behaviour"- I'm sure you get why. It gets problematic when you, say, try to match a tag. Consider, for example, the string ``"<p class="calibre2">Title here</p>"`` and let's say you'd want to match the opening tag (the part between the first pair of angle brackets, a little more on tags later). You'd think that the expression ``<p.*>`` would match that tag, but actually, it matches the whole string! (The character "." is another special character. It matches anything *except* linebreaks, so, basically, the expression ``.*`` would match any single line you can think of.) Instead, try using ``<p.*?>`` which makes the quantifier ``"*"`` non-greedy. That expression would only match the first opening tag, as intended.
|
|
|
|
|
A note on these quantifiers: They generally try to match as much text as possible, so be careful when using them. This is called "greedy behaviour"- I'm sure you get why. It gets problematic when you, say, try to match a tag. Consider, for example, the string ``"<p class="calibre2">Title here</p>"`` and let's say you'd want to match the opening tag (the part between the first pair of angle brackets, a little more on tags later). You'd think that the expression ``<p.*>`` would match that tag, but actually, it matches the whole string! (The character "." is another special character. It matches anything *except* linebreaks, so, basically, the expression ``.*`` would match any single line you can think of). Instead, try using ``<p.*?>`` which makes the quantifier ``"*"`` non-greedy. That expression would only match the first opening tag, as intended.
|
|
|
|
|
There's actually another way to accomplish this: The expression ``<p[^>]*>`` will match that same opening tag- you'll see why after the next section. Just note that there quite frequently is more than one way to write a regular expression.
|
|
|
|
|
|
|
|
|
|
Well, these special characters are very neat and all, but what if I wanted to match a dot or a question mark?
|
|
|
|
@@ -113,7 +113,7 @@ Let's begin with the conversion settings, which is really neat. In the :guilabel
|
|
|
|
|
<p class="calibre4"> It had only been two years since Addison v. Clark.
|
|
|
|
|
The court case gave us a revised version of what life was
|
|
|
|
|
|
|
|
|
|
(shamelessly ripped out of `this thread <https://www.mobileread.com/forums/showthread.php?t=75594">`_). You'd have to remove some of the tags as well. In this example, I'd recommend beginning with the tag ``<b class="calibre2">``, now you have to end with the corresponding closing tag (opening tags are ``<tag>``, closing tags are ``</tag>``), which is simply the next ``</b>`` in this case. (Refer to a good HTML manual or ask in the forum if you are unclear on this point.) The opening tag can be described using ``<b.*?>``, the closing tag using ``</b>``, thus we could remove everything between those tags using ``<b.*?>.*?</b>``. But using this expression would be a bad idea, because it removes everything enclosed by <b>- tags (which, by the way, render the enclosed text in bold print), and it's a fair bet that we'll remove portions of the book in this way. Instead, include the beginning of the enclosed string as well, making the regular expression ``<b.*?>\s*Generated\s+by\s+ABC\s+Amber\s+LIT.*?</b>`` The ``\s`` with quantifiers are included here instead of explicitly using the spaces as seen in the string to catch any variations of the string that might occur. Remember to check what calibre will remove to make sure you don't remove any portions you want to keep if you test a new expression. If you only check one occurrence, you might miss a mismatch somewhere else in the text. Also note that should you accidentally remove more or fewer tags than you actually wanted to, calibre tries to repair the damaged code after doing the removal.
|
|
|
|
|
(shamelessly ripped out of `this thread <https://www.mobileread.com/forums/showthread.php?t=75594">`_). You'd have to remove some of the tags as well. In this example, I'd recommend beginning with the tag ``<b class="calibre2">``, now you have to end with the corresponding closing tag (opening tags are ``<tag>``, closing tags are ``</tag>``), which is simply the next ``</b>`` in this case. (Refer to a good HTML manual or ask in the forum if you are unclear on this point). The opening tag can be described using ``<b.*?>``, the closing tag using ``</b>``, thus we could remove everything between those tags using ``<b.*?>.*?</b>``. But using this expression would be a bad idea, because it removes everything enclosed by <b>- tags (which, by the way, render the enclosed text in bold print), and it's a fair bet that we'll remove portions of the book in this way. Instead, include the beginning of the enclosed string as well, making the regular expression ``<b.*?>\s*Generated\s+by\s+ABC\s+Amber\s+LIT.*?</b>`` The ``\s`` with quantifiers are included here instead of explicitly using the spaces as seen in the string to catch any variations of the string that might occur. Remember to check what calibre will remove to make sure you don't remove any portions you want to keep if you test a new expression. If you only check one occurrence, you might miss a mismatch somewhere else in the text. Also note that should you accidentally remove more or fewer tags than you actually wanted to, calibre tries to repair the damaged code after doing the removal.
|
|
|
|
|
|
|
|
|
|
Adding books
|
|
|
|
|
^^^^^^^^^^^^^^^^
|
|
|
|
@@ -128,7 +128,7 @@ The last part is regular expression :guilabel:`Search and replace` in metadata f
|
|
|
|
|
|
|
|
|
|
Well, that just about concludes the very short introduction to regular expressions. Hopefully I'll have shown you enough to at least get you started and to enable you to continue learning by yourself- a good starting point would be the `Python documentation for regexps <https://docs.python.org/library/re.html>`_.
|
|
|
|
|
|
|
|
|
|
One last word of warning, though: Regexps are powerful, but also really easy to get wrong. calibre provides really great testing possibilities to see if your expressions behave as you expect them to. Use them. Try not to shoot yourself in the foot. (God, I love that expression...) But should you, despite the warning, injure your foot (or any other body parts), try to learn from it.
|
|
|
|
|
One last word of warning, though: Regexps are powerful, but also really easy to get wrong. calibre provides really great testing possibilities to see if your expressions behave as you expect them to. Use them. Try not to shoot yourself in the foot. (God, I love that expression...). But should you, despite the warning, injure your foot (or any other body parts), try to learn from it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Quick reference
|
|
|
|
|