Kovid Goyal
f5d56958b8
Start work on stemming for the ICU tokenizer
2021-06-20 14:43:24 +05:30
Kovid Goyal
5565c3395e
...
2021-06-20 12:39:12 +05:30
Kovid Goyal
8457379487
Add libstemmer as a dependency
...
Will be used for tokenizing in the new ICU based FTS tokenizer
2021-06-20 12:38:04 +05:30
Kovid Goyal
6c3f1ebb5f
Fix #1933004 [TheAtlantic.com recipe not downloading article text]( https://bugs.launchpad.net/calibre/+bug/1933004 )
2021-06-20 09:09:15 +05:30
Kovid Goyal
755b58d1f5
Test tokenization with different UI langauges
2021-06-19 15:16:48 +05:30
Kovid Goyal
6f7766fbf4
Another script block tokenizer test
2021-06-19 15:00:07 +05:30
Kovid Goyal
53168e075e
Alias for test name option
2021-06-19 14:59:48 +05:30
Kovid Goyal
c75f20a875
Dont repeatedly lookup the word iterator
2021-06-19 14:47:02 +05:30
Kovid Goyal
a547ffd26e
Fix script block loop
...
use the correct language based iterator and also update the start of the
block correctly
2021-06-19 14:27:54 +05:30
Kovid Goyal
fafacae005
Merge branch 'master' of https://github.com/cbhaley/calibre
...
Fixes #1932984 [AttributeError on 'Add-subcategory to <main-category>'](https://bugs.launchpad.net/calibre/+bug/1932984 )
2021-06-19 13:59:30 +05:30
Kovid Goyal
6f7454f1ad
Ensure text fed to the FTS engine is in NFKC form
2021-06-19 13:58:28 +05:30
Charles Haley
7e49b481e9
Bug 1932984: AttributeError on 'Add-subcategory to <main-category>'
2021-06-19 09:17:09 +01:00
Kovid Goyal
52a87af143
Bounds check access to byte_offsets
2021-06-19 13:34:29 +05:30
Kovid Goyal
d9c0da9ec3
...
2021-06-19 13:13:03 +05:30
Kovid Goyal
6e62ccab38
Forgot to test boolean operators in queries
2021-06-19 11:50:46 +05:30
Kovid Goyal
e0dad27caa
tests for fts query syntax
2021-06-19 11:47:52 +05:30
Kovid Goyal
310a1a7d2e
Add FTS tokenizer tests with Chinese
2021-06-19 10:54:34 +05:30
Kovid Goyal
ef78b19912
Also hold global lock when constructing a tokenizer and setting its current_ui_language
2021-06-18 21:40:14 +05:30
Kovid Goyal
d9b773bd19
Ensure tokenizer tests are run with a fixed UI language
2021-06-18 21:38:15 +05:30
Kovid Goyal
c86f439e64
...
2021-06-18 21:16:59 +05:30
Kovid Goyal
6ef1ec1656
Add currency and other symbols to allowed token characters
2021-06-18 21:04:31 +05:30
Kovid Goyal
2cf31be2ba
Use ICU Word BreakIterator for tokenization
2021-06-18 18:06:15 +05:30
Kovid Goyal
879262929e
Merge branch 'master' of https://github.com/MorganSeltzer000/calibre
...
E-book viewer: Fix scrolling backwards by screen-fulls not working
with very large page margins.
2021-06-18 07:50:31 +05:30
Kovid Goyal
febc066142
A function to ensure lang specific iterators
2021-06-18 07:43:10 +05:30
Morgan Seltzer
501d6d0cf2
Fixed Pageup Occasionally Failing
...
Before, pageup failed when the page margins were greater than half the
screen width, because previous_screen_location() went backward by
screen_inline, which did not account for the margins but worked most of
the time due to later rounding. Now this has been fixed.
Signed-off-by: Morgan Seltzer <MorganSeltzer000@gmail.com>
2021-06-17 12:42:18 -05:00
Kovid Goyal
87b85cac39
Start work on ICU word break iterator based tokenization
2021-06-17 15:56:12 +05:30
Kovid Goyal
0cb9637e8c
...
2021-06-17 14:38:00 +05:30
Kovid Goyal
d818bc17b8
...
2021-06-17 12:12:59 +05:30
Kovid Goyal
6302937c4f
Allow directly testing the tokenizer
2021-06-17 12:10:24 +05:30
Kovid Goyal
4127117e8a
Add a UI language based iterator
2021-06-17 09:53:02 +05:30
Kovid Goyal
06d34a2df9
Add a test for snippets
2021-06-17 08:31:16 +05:30
Kovid Goyal
53b8bed17a
Function to get available locales for break iteration
2021-06-17 07:25:15 +05:30
Kovid Goyal
f138d716a5
Merge branch 'python3.10' of https://github.com/swt2c/calibre
2021-06-17 06:16:25 +05:30
Scott Talbert
2e272a39d0
Fix building with Python 3.10
2021-06-16 14:19:40 -04:00
Kovid Goyal
6773b36a42
Forgot to add header to extension definition
2021-06-16 21:57:44 +05:30
Kovid Goyal
584eacdee4
E-book viewer: Fix font sizes specified in absolute units not being honored in locales where the decimal separator is not the period. Fixes #1932152 [The e-book viewer ignores font-size property when using some absolute lenght units]( https://bugs.launchpad.net/calibre/+bug/1932152 )
2021-06-16 21:55:51 +05:30
Kovid Goyal
12e9769b4b
Dont resize scratch unneccessarily
2021-06-16 21:40:17 +05:30
Kovid Goyal
22af8ab304
silence compiler warning
2021-06-16 21:38:32 +05:30
Kovid Goyal
9e77e2848e
...
2021-06-16 20:39:45 +05:30
Kovid Goyal
03b7feb507
Avoid ipython repeated exception when not available
2021-06-16 19:47:54 +05:30
Kovid Goyal
a37c14499c
Fix building of sqlite_extension on ancient Linux
2021-06-16 17:14:31 +05:30
Kovid Goyal
d8595e5bf5
Fix ICU build on Windows
2021-06-16 17:02:07 +05:30
Kovid Goyal
ae25a1f425
Also add test without diacritics removal
2021-06-16 16:16:03 +05:30
Kovid Goyal
bbee5b0acb
Implement diacritics removal in the new tokenizer
2021-06-16 14:54:15 +05:30
Kovid Goyal
ab313c836f
Implement the unicode61 tokenizer with ICU
...
Still have to implement removal of diacritics
2021-06-16 12:51:43 +05:30
Kovid Goyal
c9c1029d02
Merge branch 'hindu-patch' of https://github.com/shivaprsd/calibre
2021-06-15 17:45:04 +05:30
Shiva Prasad
b0d4c388d6
Recipe: make Hindu better resemble print edition
...
* Remove redundant date and timestamps cluttering every article
* Place intro immediately beneath the heading, as in print edition
|- duplicate intro is now removed using CSS
* Visual styling
2021-06-15 17:24:59 +05:30
Kovid Goyal
adf810cae6
Parse tokenizer options
2021-06-15 13:12:24 +05:30
Kovid Goyal
79ea88ddb8
Basic test to tokenizer
2021-06-15 13:07:18 +05:30
Kovid Goyal
c819fcb870
A simple ASCII tokenizer to start with
2021-06-15 11:44:31 +05:30