Use the more powerful regex engine for book list and metadata from filename

Fixes #1893371 [regex [[\w--[A-Z]] does not work](https://bugs.launchpad.net/calibre/+bug/1893371)
This commit is contained in:
Kovid Goyal 2020-08-29 10:06:06 +05:30
parent 1b5bfd9078
commit 9f1e1e5a18
No known key found for this signature in database
GPG Key ID: 06BC317B515ACE7C
5 changed files with 32 additions and 24 deletions

View File

@ -5,6 +5,14 @@
# for important features/bug fixes. # for important features/bug fixes.
# Also, each release can have new and improved recipes. # Also, each release can have new and improved recipes.
# changes for 5
# viewer supports annotations
# viewer works with RTL and vertical text
# python upgraded to python 3 link to list of not ported plugins
# regex engine used for searching book list and metadata from file names made more powerful
# dark mode support in the content server and viewer UIs
# content server viewer can now browse and create bookmarks
- version: 4.23.0 - version: 4.23.0
date: 2020-08-21 date: 2020-08-21

View File

@ -22,7 +22,9 @@ There are a few places calibre uses regular expressions. There's the
:guilabel:`Search & replace` in conversion options, metadata detection from filenames in the import :guilabel:`Search & replace` in conversion options, metadata detection from filenames in the import
settings and Search & replace when editing the metadata of books in bulk. The settings and Search & replace when editing the metadata of books in bulk. The
calibre book editor can also use regular expressions in its search and replace calibre book editor can also use regular expressions in its search and replace
feature. feature. Finally, you can use regular expressions when searching the calibre
book list and when searching inside the calibre viewer.
What on earth *is* a regular expression? What on earth *is* a regular expression?
------------------------------------------------ ------------------------------------------------

View File

@ -2,9 +2,7 @@ Quick reference for regexp syntax
================================================= =================================================
This checklist summarizes the most commonly used/hard to remember parts of the This checklist summarizes the most commonly used/hard to remember parts of the
regexp engine available in the calibre edit and conversion search/replace regexp engine available in most parts of calibre.
features. Note that this engine is more powerful than the basic regexp engine
used throughout the rest of calibre.
.. contents:: Contents .. contents:: Contents
:depth: 2 :depth: 2
@ -173,25 +171,25 @@ character. The most useful anchors for text processing are:
Groups Groups
------ ------
``(expression)`` ``(expression)``
Capturing group, which stores the selection and can be recalled later Capturing group, which stores the selection and can be recalled later
in the *search* or *replace* patterns with ``\n``, where ``n`` is the in the *search* or *replace* patterns with ``\n``, where ``n`` is the
sequence number of the capturing group (starting at 1 in reading order) sequence number of the capturing group (starting at 1 in reading order)
``(?:expression)`` ``(?:expression)``
Group that does not capture the selection Group that does not capture the selection
``(?>expression)`` ``(?>expression)``
Atomic Group: As soon as the expression is satisfied, the regexp engine Atomic Group: As soon as the expression is satisfied, the regexp engine
passes, and if the rest of the pattern fails, it will not backtrack to passes, and if the rest of the pattern fails, it will not backtrack to
try other combinations with the expression. Atomic groups do not try other combinations with the expression. Atomic groups do not
capture. capture.
``(?|expression)`` ``(?|expression)``
Branch reset group: the branches of the alternations included in the Branch reset group: the branches of the alternations included in the
expression share the same group numbers expression share the same group numbers
``(?<name>expression)`` ``(?<name>expression)``
Group named “name”. The selection can be recalled later in the *search* Group named “name”. The selection can be recalled later in the *search*
pattern by ``(?P=name)`` and in the *replace* by ``\g<name>``. Two pattern by ``(?P=name)`` and in the *replace* by ``\g<name>``. Two
different groups can use the same name. different groups can use the same name.
@ -220,7 +218,7 @@ Lookarounds
Lookaheads and lookbehinds do not consume characters, they are zero length and Lookaheads and lookbehinds do not consume characters, they are zero length and
do not capture. They are atomic groups: as soon as the assertion is satisfied, do not capture. They are atomic groups: as soon as the assertion is satisfied,
the regexp engine passes, and if the rest of the pattern fails, it will not the regexp engine passes, and if the rest of the pattern fails, it will not
backtrack inside the lookaround to try other combinations. backtrack inside the lookaround to try other combinations.
When looking for multiple matches in a string, at the starting position of each When looking for multiple matches in a string, at the starting position of each
match attempt, a lookbehind can inspect the characters before the current match attempt, a lookbehind can inspect the characters before the current
@ -230,7 +228,7 @@ only select 2, because the starting position after the first selection is
immediately before 3, and there are not enough digits for a second match. immediately before 3, and there are not enough digits for a second match.
Similarly, ``\d(\d)`` only captures 2. In calibre's regexp engine practice, the Similarly, ``\d(\d)`` only captures 2. In calibre's regexp engine practice, the
positive lookbehind behaves in the same way, and selects only 2, contrary to positive lookbehind behaves in the same way, and selects only 2, contrary to
theory. theory.
Groups can be placed inside lookarounds, but capture is rarely useful. Groups can be placed inside lookarounds, but capture is rarely useful.
Nevertheless, if it is useful, it will be necessary to be very careful in the Nevertheless, if it is useful, it will be necessary to be very careful in the
@ -275,7 +273,7 @@ To select a string between double quotation marks without stopping on an embedde
“((?>[^“”]+|(?R))*[^“”]+)” “((?>[^“”]+|(?R))*[^“”]+)”
This template can also be used to modify pairs of tags that can be This template can also be used to modify pairs of tags that can be
embedded, such as ``<div>`` tags. embedded, such as ``<div>`` tags.
Special characters Special characters
@ -334,4 +332,3 @@ Modes
``(?m)`` ``(?m)``
Makes the ``^`` and ``$`` anchors match the start and end of lines Makes the ``^`` and ``$`` anchors match the start and end of lines
instead of the start and end of the entire string. instead of the start and end of the entire string.

View File

@ -6,7 +6,7 @@ __license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>' __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
__docformat__ = 'restructuredtext en' __docformat__ = 'restructuredtext en'
import re, weakref, operator import regex, weakref, operator
from functools import partial from functools import partial
from datetime import timedelta from datetime import timedelta
from collections import deque, OrderedDict from collections import deque, OrderedDict
@ -72,7 +72,8 @@ def _match(query, value, matchkind, use_primary_find_in_search=True, case_sensit
elif query == t: elif query == t:
return True return True
elif matchkind == REGEXP_MATCH: elif matchkind == REGEXP_MATCH:
if re.search(query, t, re.UNICODE if case_sensitive else re.I|re.UNICODE): flags = regex.UNICODE | regex.VERSION1 | (0 if case_sensitive else regex.IGNORECASE)
if regex.search(query, t, flags) is not None:
return True return True
elif matchkind == CONTAINS_MATCH: elif matchkind == CONTAINS_MATCH:
if not case_sensitive and use_primary_find_in_search: if not case_sensitive and use_primary_find_in_search:
@ -80,7 +81,7 @@ def _match(query, value, matchkind, use_primary_find_in_search=True, case_sensit
return True return True
elif query in t: elif query in t:
return True return True
except re.error: except regex.error:
pass pass
return False return False
# }}} # }}}
@ -100,7 +101,7 @@ class DateSearch(object): # {{{
self.local_today = {'_today', 'today', icu_lower(_('today'))} self.local_today = {'_today', 'today', icu_lower(_('today'))}
self.local_yesterday = {'_yesterday', 'yesterday', icu_lower(_('yesterday'))} self.local_yesterday = {'_yesterday', 'yesterday', icu_lower(_('yesterday'))}
self.local_thismonth = {'_thismonth', 'thismonth', icu_lower(_('thismonth'))} self.local_thismonth = {'_thismonth', 'thismonth', icu_lower(_('thismonth'))}
self.daysago_pat = re.compile(r'(%s|daysago|_daysago)$'%_('daysago')) self.daysago_pat = regex.compile(r'(%s|daysago|_daysago)$'%_('daysago'), flags=regex.UNICODE | regex.VERSION1)
def eq(self, dbdate, query, field_count): def eq(self, dbdate, query, field_count):
if dbdate.year == query.year: if dbdate.year == query.year:

View File

@ -3,7 +3,7 @@
__license__ = 'GPL v3' __license__ = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>' __copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
import os, re, collections import os, regex, collections
from calibre.utils.config import prefs from calibre.utils.config import prefs
from calibre.constants import filesystem_encoding from calibre.constants import filesystem_encoding
@ -105,8 +105,8 @@ def _get_metadata(stream, stream_type, use_libprs_metadata,
name = os.path.basename(getattr(stream, 'name', '')) name = os.path.basename(getattr(stream, 'name', ''))
# The fallback pattern matches the default filename format produced by calibre # The fallback pattern matches the default filename format produced by calibre
base = metadata_from_filename(name, pat=pattern, fallback_pat=re.compile( base = metadata_from_filename(name, pat=pattern, fallback_pat=regex.compile(
r'^(?P<title>.+) - (?P<author>[^-]+)$')) r'^(?P<title>.+) - (?P<author>[^-]+)$', flags=regex.UNICODE | regex.VERSION1))
if not base.authors: if not base.authors:
base.authors = [_('Unknown')] base.authors = [_('Unknown')]
if not base.title: if not base.title:
@ -133,7 +133,7 @@ def metadata_from_filename(name, pat=None, fallback_pat=None):
name = name.rpartition('.')[0] name = name.rpartition('.')[0]
mi = MetaInformation(None, None) mi = MetaInformation(None, None)
if pat is None: if pat is None:
pat = re.compile(prefs.get('filename_pattern')) pat = regex.compile(prefs.get('filename_pattern'), flags=regex.UNICODE | regex.VERSION1)
name = name.replace('_', ' ') name = name.replace('_', ' ')
match = pat.search(name) match = pat.search(name)
if match is None and fallback_pat is not None: if match is None and fallback_pat is not None: