Adds a new recipe for Cenital, an Argentine news outlet publishing 8 weekly
newsletters covering politics, economics, international affairs, and culture.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This is invalid, but there apparently exist some books in the wild that
use it. Sigh. See #2146609 (*LOTS* of undesired splits on EPUB to AZW3 conversion)
- Add hyphen_is_extra_break flag to icu_BreakIterator struct
- Set flag at creation time by checking if any extra break char is
a hyphen (0x2d or 0x2010) via IS_HYPHEN_CHAR
- Move IS_HYPHEN_CHAR macro before struct definition so it's usable
in the constructor
- Guard all hyphen-joining logic (leading_hyphen, trailing_hyphen,
is_hyphen_sep) and sub-segment trailing-hyphen detection behind
!bi->hyphen_is_extra_break check
- Add test: BreakIterator with '-' extra break splits 'out-of-the-box'
into ['out', 'of', 'the', 'box']
Co-authored-by: kovidgoyal <1308621+kovidgoyal@users.noreply.github.com>
Agent-Logs-Url: https://github.com/kovidgoyal/calibre/sessions/b439270b-8a40-4b51-96f2-8f869de7983d
- Add optional extra_word_break_chars field (sorted UChar32[]) to
icu_BreakIterator struct, stored as a sorted array for efficient lookup
- icu_BreakIterator_new accepts optional 3rd argument (Python str) that is
parsed into a sorted UChar32[] via insertion sort; only applies to UBRK_WORD
- icu_BreakIterator_dealloc frees the extra chars array
- New find_extra_word_break() inline helper scans a UTF-16 segment for the
first matching extra-break codepoint using U16_NEXT + linear search
- BreakIterState gains extra_break_active/seg_start/seg_end sub-segmentation
state fields (zero-initialized by memset in break_iter_state_init)
- break_iter_state_next refactored from while loop to for(;;) to drain
sub-segments before fetching more ICU data; extra break within an ICU word
segment causes the piece before it to flow through normal hyphen-joining
logic while the tail is deferred; trailing-hyphen detection on sub-segments
enables hyphen-joining with subsequent ICU segments
- Fast path: num_extra_word_break_chars == 0 → single comparison, zero overhead
- Tests added covering: letter extra break char, count_words/split2, adjacent
breaks, multiple chars, None arg, surrogate-pair extra break char
Co-authored-by: kovidgoyal <1308621+kovidgoyal@users.noreply.github.com>
Agent-Logs-Url: https://github.com/kovidgoyal/calibre/sessions/c003ae42-1e56-4dbb-9ef2-9f1645b76c70