From c5003ede6c7d59a8d161c5c4b2bca4e3e37471bc Mon Sep 17 00:00:00 2001 From: Kovid Goyal Date: Sun, 21 Jun 2020 14:00:41 +0530 Subject: [PATCH] Add a note about the regex library calibre uses --- manual/regexp.rst | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/manual/regexp.rst b/manual/regexp.rst index c7046f92b3..011a8d255c 100644 --- a/manual/regexp.rst +++ b/manual/regexp.rst @@ -34,7 +34,7 @@ A regular expression is a way to describe sets of strings. A single regular expr Care to explain? -------------------- -Well, that's why we're here. First, this is the most important concept in regular expressions: *A string by itself is a regular expression that matches itself*. That is to say, if I wanted to match the string ``"Hello, World!"`` using a regular expression, the regular expression to use would be ``Hello, World!``. And yes, it really is that simple. You'll notice, though, that this *only* matches the exact string ``"Hello, World!"``, not e.g. ``"Hello, wOrld!"`` or ``"hello, world!"`` or any other such variation. +Well, that's why we're here. First, this is the most important concept in regular expressions: *A string by itself is a regular expression that matches itself*. That is to say, if I wanted to match the string ``"Hello, World!"`` using a regular expression, the regular expression to use would be ``Hello, World!``. And yes, it really is that simple. You'll notice, though, that this *only* matches the exact string ``"Hello, World!"``, not e.g. ``"Hello, wOrld!"`` or ``"hello, world!"`` or any other such variation. That doesn't sound too bad. What's next? ------------------------------------------ @@ -58,15 +58,15 @@ You can of course do that: Just put a backslash in front of any special characte So, what are the most useful sets? ------------------------------------ -Knew you'd ask. Some useful sets are ``[0-9]`` matching a single number, ``[a-z]`` matching a single lowercase letter, ``[A-Z]`` matching a single uppercase letter, ``[a-zA-Z]`` matching a single letter and ``[a-zA-Z0-9]`` matching a single letter or number. You can also use an escape sequence as shorthand:: +Knew you'd ask. Some useful sets are ``[0-9]`` matching a single number, ``[a-z]`` matching a single lowercase letter, ``[A-Z]`` matching a single uppercase letter, ``[a-zA-Z]`` matching a single letter and ``[a-zA-Z0-9]`` matching a single letter or number. You can also use an escape sequence as shorthand:: \d is equivalent to [0-9] \w is equivalent to [a-zA-Z0-9_] \s is equivalent to any whitespace - + .. note:: - "Whitespace" is a term for anything that won't be printed. These characters include space, tabulator, line feed, form feed and carriage return. - + "Whitespace" is a term for anything that won't be printed. These characters include space, tabulator, line feed, form feed and carriage return. + As a last note on sets, you can also define a set as any character *but* those in the set. You do that by including the character ``"^"`` as the *very first character in the set*. Thus, ``[^a]`` would match any character excluding "a". That's called complementing the set. Those escape sequence shorthands we saw earlier can also be complemented: ``"\D"`` means any non-number character, thus being equivalent to ``[^0-9]``. The other shorthands can be complemented by, you guessed it, using the respective uppercase letter instead of the lowercase one. So, going back to the example ``]*>`` from the previous section, now you can see that the character set it's using tries to match any character except for a closing angle bracket. But if I had a few varying strings I wanted to match, things get complicated? @@ -87,7 +87,7 @@ In the beginning, you said there was a way to make a regular expression case ins Yes, I did, thanks for paying attention and reminding me. You can tell calibre how you want certain things handled by using something called flags. You include flags in your expression by using the special construct ``(?flags go here)`` where, obviously, you'd replace "flags go here" with the specific flags you want. For ignoring case, the flag is ``i``, thus you include ``(?i)`` in your expression. Thus, ``(?i)test`` would match "Test", "tEst", "TEst" and any case variation you could think of. -Another useful flag lets the dot match any character at all, *including* the newline, the flag ``s``. If you want to use multiple flags in an expression, just put them in the same statement: ``(?is)`` would ignore case and make the dot match all. It doesn't matter which flag you state first, ``(?si)`` would be equivalent to the above. +Another useful flag lets the dot match any character at all, *including* the newline, the flag ``s``. If you want to use multiple flags in an expression, just put them in the same statement: ``(?is)`` would ignore case and make the dot match all. It doesn't matter which flag you state first, ``(?si)`` would be equivalent to the above. I think I'm beginning to understand these regular expressions now... how do I use them in calibre? ----------------------------------------------------------------------------------------------------- @@ -104,7 +104,7 @@ Let's begin with the conversion settings, which is really neat. In the :guilabel http://www.processtext.com/abclit.html

It had only been two years since Addison v. Clark. The court case gave us a revised version of what life was - + (shamelessly ripped out of `this thread `_). You'd have to remove some of the tags as well. In this example, I'd recommend beginning with the tag ````, now you have to end with the corresponding closing tag (opening tags are ````, closing tags are ````), which is simply the next ```` in this case. (Refer to a good HTML manual or ask in the forum if you are unclear on this point.) The opening tag can be described using ````, the closing tag using ````, thus we could remove everything between those tags using ``.*?``. But using this expression would be a bad idea, because it removes everything enclosed by - tags (which, by the way, render the enclosed text in bold print), and it's a fair bet that we'll remove portions of the book in this way. Instead, include the beginning of the enclosed string as well, making the regular expression ``\s*Generated\s+by\s+ABC\s+Amber\s+LIT.*?`` The ``\s`` with quantifiers are included here instead of explicitly using the spaces as seen in the string to catch any variations of the string that might occur. Remember to check what calibre will remove to make sure you don't remove any portions you want to keep if you test a new expression. If you only check one occurrence, you might miss a mismatch somewhere else in the text. Also note that should you accidentally remove more or fewer tags than you actually wanted to, calibre tries to repair the damaged code after doing the removal. Adding books @@ -144,5 +144,7 @@ Thanks for helping with tips, corrections and such: * Starson17 * Orpheu -For more about regexps see `The Python User Manual `_. - +For more about regexps see `The Python User Manual `_. +The actual regular expression library used by calibre is: +`regex `_ which supports +several useful enhancements over the python standard library one.