This commit is contained in:
Kovid Goyal 2011-01-18 17:34:45 -07:00
parent 4ae546bb60
commit 2c8218bc36
4 changed files with 9 additions and 8 deletions

View File

@ -7,7 +7,7 @@ __docformat__ = 'restructuredtext en'
import functools, re
from calibre import entity_to_unicode
from calibre import entity_to_unicode, as_unicode
XMLDECL_RE = re.compile(r'^\s*<[?]xml.*?[?]>')
SVG_NS = 'http://www.w3.org/2000/svg'
@ -463,7 +463,8 @@ class HTMLPreProcessor(object):
replace_txt = ''
rules.insert(0, (search_re, replace_txt))
except Exception as e:
self.log.error('Failed to parse %s regexp because %s' % (search, e))
self.log.error('Failed to parse %r regexp because %s' %
(search, as_unicode(e)))
end_rules = []
# delete soft hyphens - moved here so it's executed after header/footer removal

View File

@ -12,7 +12,7 @@ from calibre.gui2 import error_dialog
class SearchAndReplaceWidget(Widget, Ui_Form):
TITLE = _('Search &\nReplace')
TITLE = _(u'Search\u00a0&\nReplace')
HELP = _('Modify the document text and structure using user defined patterns.')
COMMIT_NAME = 'search_and_replace'
ICON = I('search.png')

View File

@ -100,7 +100,7 @@
</size>
</property>
<property name="spacing">
<number>20</number>
<number>10</number>
</property>
<property name="wordWrap">
<bool>true</bool>
@ -129,8 +129,8 @@
<rect>
<x>0</x>
<y>0</y>
<width>805</width>
<height>484</height>
<width>810</width>
<height>494</height>
</rect>
</property>
<layout class="QVBoxLayout" name="verticalLayout_3">

View File

@ -94,7 +94,7 @@ I think I'm beginning to understand these regular expressions now... how do I us
Conversions
^^^^^^^^^^^^^^
Let's begin with the conversion settings, which is really neat. In the structure detection part, you can input a regexp (short for regular expression) that describes the header or footer string that will be removed during the conversion. The neat part is the wizard. Click on the wizard staff and you get a preview of what |app| "sees" during the conversion process. Scroll down to the header or footer you want to remove, select and copy it, paste it into the regexp field on top of the window. If there are variable parts, like page numbers or so, use sets and quantifiers to cover those, and while you're at it, remember to escape special characters, if there are some. Hit the button labeled :guilabel:`Test` and |app| highlights the parts it would remove were you to use the regexp. Once you're satisfied, hit OK and convert. Be careful if your conversion source has tags like this example::
Let's begin with the conversion settings, which is really neat. In the Search and Replace part, you can input a regexp (short for regular expression) that describes the string that will be replaced during the conversion. The neat part is the wizard. Click on the wizard staff and you get a preview of what |app| "sees" during the conversion process. Scroll down to the string you want to remove, select and copy it, paste it into the regexp field on top of the window. If there are variable parts, like page numbers or so, use sets and quantifiers to cover those, and while you're at it, remember to escape special characters, if there are some. Hit the button labeled :guilabel:`Test` and |app| highlights the parts it would replace were you to use the regexp. Once you're satisfied, hit OK and convert. Be careful if your conversion source has tags like this example::
Maybe, but the cops feel like you do, Anita. What's one more dead vampire?
New laws don't change that. </p>
@ -104,7 +104,7 @@ Let's begin with the conversion settings, which is really neat. In the structure
<p class="calibre4"> It had only been two years since Addison v. Clark.
The court case gave us a revised version of what life was
(shamelessly ripped out of `this thread <http://www.mobileread.com/forums/showthread.php?t=75594">`_). You'd have to remove some of the tags as well. In this example, I'd recommend beginning with the tag ``<b class="calibre2">``, now you have to end with the corresponding closing tag (opening tags are ``<tag>``, closing tags are ``</tag>``), which is simply the next ``</b>`` in this case. (Refer to a good HTML manual or ask in the forum if you are unclear on this point.) The opening tag can be described using ``<b.*?>``, the closing tag using ``</b>``, thus we could remove everything between those tags using ``<b.*?>.*?</b>``. But using this expression would be a bad idea, because it removes everything enclosed by <b>- tags (which, by the way, render the enclosed text in bold print), and it's a fair bet that we'll remove portions of the book in this way. Instead, include the beginning of the enclosed string as well, making the regular expression ``<b.*?>\s*Generated\s+by\s+ABC\s+Amber\s+LIT.*?</b>`` The ``\s`` with quantifiers are included here instead of explicitly using the spaces as seen in the string to catch any variations of the string that might occur. Remember to check what |app| will remove to make sure you don't remove any portions you want to keep if you test a new expression. If you only check one occurrence, you might miss a mismatch somewhere else in the text. Also note that should you accidentally remove more or fewer tags than you actually wanted to, |app| tries to repair the damaged code after doing the header/footer removal.
(shamelessly ripped out of `this thread <http://www.mobileread.com/forums/showthread.php?t=75594">`_). You'd have to remove some of the tags as well. In this example, I'd recommend beginning with the tag ``<b class="calibre2">``, now you have to end with the corresponding closing tag (opening tags are ``<tag>``, closing tags are ``</tag>``), which is simply the next ``</b>`` in this case. (Refer to a good HTML manual or ask in the forum if you are unclear on this point.) The opening tag can be described using ``<b.*?>``, the closing tag using ``</b>``, thus we could remove everything between those tags using ``<b.*?>.*?</b>``. But using this expression would be a bad idea, because it removes everything enclosed by <b>- tags (which, by the way, render the enclosed text in bold print), and it's a fair bet that we'll remove portions of the book in this way. Instead, include the beginning of the enclosed string as well, making the regular expression ``<b.*?>\s*Generated\s+by\s+ABC\s+Amber\s+LIT.*?</b>`` The ``\s`` with quantifiers are included here instead of explicitly using the spaces as seen in the string to catch any variations of the string that might occur. Remember to check what |app| will remove to make sure you don't remove any portions you want to keep if you test a new expression. If you only check one occurrence, you might miss a mismatch somewhere else in the text. Also note that should you accidentally remove more or fewer tags than you actually wanted to, |app| tries to repair the damaged code after doing the removal.
Adding books
^^^^^^^^^^^^^^^^