Commit Graph

53513 Commits

Author SHA1 Message Date
Hassan Raza e1cdb70dc2 Add rooted path containment helper 2026-05-23 20:02:00 +05:00
Kovid Goyal 387f1d05fa Merge branch 'count-pages-fixed-layout' of https://github.com/un-pogaz/calibre 2026-05-23 07:02:39 +05:30
Kovid Goyal 7d2f1597ea Merge branch 'fix/http-connection-header-tokens' of https://github.com/M-Hassan-Raza/calibre 2026-05-23 06:52:37 +05:30
un-pogaz 332ccea5c8 pages count: support fixed-layout 2026-05-22 20:48:01 +02:00
Hassan Raza 289b77463a Parse Connection header tokens 2026-05-22 22:58:10 +05:00
Kovid Goyal cb2b1d195f Merge branch 'fix/http-content-length-framing' of https://github.com/M-Hassan-Raza/calibre 2026-05-22 22:58:51 +05:30
Hassan Raza 74d8ab0c1b Reject invalid HTTP Content-Length framing 2026-05-22 22:21:25 +05:00
Kovid Goyal f31ee236ce Merge branch 'fix/copy-to-library-move-duplicate' of https://github.com/M-Hassan-Raza/calibre 2026-05-22 22:33:02 +05:30
Hassan Raza 648343f888 Fix ignored duplicate moves in content server 2026-05-22 12:59:03 +05:00
Kovid Goyal a5dd4a47cd Merge branch 'fix-content-server-restrictions' of https://github.com/M-Hassan-Raza/calibre 2026-05-22 07:40:28 +05:30
Hassan Raza 5a15ed3d5a Respect content server book restrictions 2026-05-21 22:21:33 +05:00
Kovid Goyal 66501a6ae7 Merge branch 'master' of https://github.com/unkn0w7n/calibre 2026-05-21 14:51:46 +05:30
unkn0w7n b8b56c3607 Update indian_express.recipe 2026-05-21 14:47:34 +05:30
unkn_wn b7b876196e Update business_standard_print.recipe 2026-05-21 14:46:58 +05:30
Kovid Goyal 92e0132d62 Bump dependency for CVE 2026-05-20 20:38:48 +05:30
Kovid Goyal 9cd210dad0 pep8 2026-05-19 07:29:13 +05:30
Kovid Goyal 6dbc00d054 Merge branch 'ap-filter-by-publish-date' of https://github.com/claybdavis/calibre 2026-05-19 07:28:41 +05:30
Kovid Goyal 80f379147d Merge branch 'newcriterion-wp-migration' of https://github.com/claybdavis/calibre 2026-05-19 07:27:51 +05:30
Kovid Goyal d8490c2208 Merge branch 'bbc-sport-headline-block-fix' of https://github.com/claybdavis/calibre 2026-05-19 07:26:53 +05:30
Kovid Goyal 19d2488ca5 Merge branch 'bbc-drop-dead-feeds' of https://github.com/claybdavis/calibre 2026-05-19 07:25:58 +05:30
Kovid Goyal 806a3a5bfc Merge branch 'dependabot/github_actions/actions-8abaa2cbc6' of https://github.com/kovidgoyal/calibre 2026-05-19 07:24:30 +05:30
dependabot[bot] b450e0e2ca Bump github/codeql-action from 4.35.3 to 4.35.4 in the actions group
Bumps the actions group with 1 update: [github/codeql-action](https://github.com/github/codeql-action).


Updates `github/codeql-action` from 4.35.3 to 4.35.4
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](https://github.com/github/codeql-action/compare/v4.35.3...v4.35.4)

---
updated-dependencies:
- dependency-name: github/codeql-action
  dependency-version: 4.35.4
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: actions
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-05-19 01:37:27 +00:00
claybdavis 821f82b730 bbc: drop three dead/stale feed entries
Audit (2026-05-18) of the BBC News feed list found three entries that
no longer produce content:

* Special Reports (https://feeds.bbci.co.uk/news/special_reports/rss.xml)
  HTTP/2 404. Wayback's last successful capture is 2024-07-23, so the
  URL has been dead for roughly two years.
* Also in the News (https://feeds.bbci.co.uk/news/also_in_the_news/rss.xml)
  HTTP/2 404. Wayback has no successful captures of this URL at all.
* Magazine (https://feeds.bbci.co.uk/news/magazine/rss.xml)
  301-redirects to /news/stories/rss.xml which still returns 200 OK,
  but the content there has been stale since December 2022. The
  endpoint is alive; the section is abandoned.

Because the recipes set remove_empty_feeds=True these three have been
silently swallowed on every fetch, costing four wasted HTTP calls per
run (the Magazine redirect doubles up). Dropping them cleans the
active feed list without changing what readers actually receive.

bbc.recipe had all three active entries; bbc_fast.recipe only carried
Magazine. Both files patched accordingly.

The ~50 commented-out legacy feed URLs in the same block are NOT
touched here -- that is a separate cleanup.
2026-05-18 16:10:22 -05:00
claybdavis 507feb15fa bbc: emit <h1> for Sport articles (handle new headline block-type)
Sport articles fetched via recipes/bbc.recipe and recipes/bbc_fast.recipe
were shipping with no <h1>, because BBC restructured the
window.__INITIAL_DATA__ JSON. article['headline'] is now None for Sport,
and the headline lives either in a new 'headline' block-type or — for
the 'high-impact' layout — nested under a 'topper' block's
model.heading.blocks list. The previous parse_article_json loop only
branched on 'image' and 'text', so neither variant produced anything.

Fix: prefer the plain-text article['metadata']['seoHeadline'] when the
legacy article['headline'] field is empty, and as a defensive fallback
extract the headline from a 'headline' or 'topper' block via a small
extract_text_block_plaintext helper. Verified against live Sport URLs
covering both block-type variants; legacy News articles that still
populate article['headline'] are unaffected.

bbc_fast.recipe carries an identical copy of parse_article_json, so the
same patch is applied to both files.
2026-05-18 16:00:45 -05:00
claybdavis a8797f05f1 newcriterion: update parse_index and login for WordPress migration
newcriterion.com moved from October CMS to WordPress. The old recipe
looked for <div id="main"> and an /issues/YYYY/M/ URL pattern, so
parse_index crashed with AttributeError: 'NoneType' object has no
attribute 'findAll' against the new layout.

Rewrites parse_index for the new markup: issue URLs of the form
/issues/<month>-<year>/, a <div class="issue-layout"> container, and
<article class="article-display"> blocks with <h2><a> for title+URL
and <p class="post-excerpt"> for the dek.

Also ports get_browser from the old October-CMS XHR signin endpoint to
standard wp-login.php form submission, and drops the now-unused
urlencode, mechanize.Request, and re imports.
2026-05-18 14:50:10 -05:00
claybdavis bdf0679ecf ap: filter articles by article:published_time meta tag
AP has no RSS feeds and parse_index had no date logic, so the framework's
oldest_article knob was a no-op and cross-day duplicates accumulated
indefinitely.

This fix fetches each candidate article during indexing and reads
<meta property="article:published_time"> to populate both the article's
timestamp (which oldest_article actually filters on) and a formatted date
string for the TOC. Cached per-URL across the front-page walk so duplicate
links are fetched once. Articles whose published_time can't be read are
skipped with a warning rather than kept dateless.

Sets oldest_article = 1 (AP publishes constantly). The trade-off is roughly
30-60s extra wall time and a doubling of HTTP volume per run, paid for by
dropping the ~24 stale articles per consecutive-day fetch.

Same per-URL-fetch idiom as #3132 (latimes og:description).
2026-05-18 13:27:20 -05:00
Kovid Goyal 8544091c82 Content server: Apply null metadata when serving book files. Matches behavior of save to disk. Fixes #2152879 [tags not deleted in ePub by content server](https://bugs.launchpad.net/calibre/+bug/2152879) 2026-05-18 14:27:32 +05:30
Kovid Goyal 68c567b372 Bump dependency for CVE 2026-05-16 13:32:20 +05:30
Kovid Goyal 111abb9a43 Merge branch 'propublica-drop-newsroom-blurb' of https://github.com/claybdavis/calibre 2026-05-14 11:16:44 +05:30
Kovid Goyal 23b27a71be Merge branch 'latimes-fetch-og-description' of https://github.com/claybdavis/calibre 2026-05-14 11:15:54 +05:30
Kovid Goyal af5f132bdf Merge branch 'latimes-drop-follow-link' of https://github.com/claybdavis/calibre 2026-05-14 11:15:35 +05:30
Kovid Goyal 3c800802d5 Merge branch 'latimes-narrow-date-and-fix-images' of https://github.com/claybdavis/calibre 2026-05-14 11:14:17 +05:30
Kovid Goyal b07017934c Merge branch 'wapo-print-tag-subhead-and-caption' of https://github.com/claybdavis/calibre 2026-05-14 11:13:20 +05:30
Kovid Goyal 1fbb66b58f Merge branch 'newyorker-drop-cartoon-stubs' of https://github.com/claybdavis/calibre 2026-05-14 11:10:39 +05:30
Kovid Goyal 125cdbf7e4 Merge branch 'tls-safari-ua' of https://github.com/claybdavis/calibre 2026-05-14 11:10:03 +05:30
claybdavis 8da3da0b66 propublica: drop the "ProPublica is a nonprofit newsroom" blurb
The ProPublica WordPress theme prepends a standing "ProPublica is a nonprofit newsroom that investigates abuses of power..." block to every article body, wrapped as `<div class="wp-block-propublica-notes--top
wp-block-propublica-note">`.

This fix extends remove_tags to strip it via the --top BEM modifier. (The bare wp-block-propublica-note class is reused for in-body editor's-note boxes that should pass through, so the modifier scopes the strip to the top-of-article boilerplate only.)
2026-05-14 00:25:34 -05:00
claybdavis d5c66c2653 latimes: populate per-article description from og:description
LAT's section index pages only attach a teaser to <10% of article tiles, so most TOC entries in the resulting EPUB show no description under the headline.

This fix fetches each article page once during indexing and reads <meta property="og:description"> to populate art['description'].
Cached per-URL across the section walks so duplicates are fetched once. Descriptions shorter than 20 chars are dropped (LAT occasionally publishes placeholder text).
2026-05-14 00:19:21 -05:00
claybdavis 19692fbaeb latimes: drop the Twitter Follow link from the byline
LATimes bylines contain an <a data-social-trigger="enhancedByline"> wrapping an SVG <use> sprite reference and the literal text "Follow". The sprite targets a <shape> ID in a sibling <svg> elsewhere in the source document — that cross-document reference can't resolve in the converted EPUB, so the anchor renders as an empty icon slot followed a badly placed "Follow."

This fix decomposes the anchor in preprocess_html. The "Staff Writer" <span> sibling is not a descendant, so the rubric stays in place.
2026-05-14 00:13:57 -05:00
claybdavis 58dda3ab13 latimes: narrow article regex to today+yesterday, strip picture wrappers
Two issues fixed in LATimes:

1. The /story/YYYY-MM-DD/ regex in parse_index was date-agnostic, so each section walk pulled articles from arbitrary past dates. This fix narrows the date path-segment to today+yesterday — same idea as chicago_tribune.recipe's inline tdy/yest filter (just expressed as a regex alternation since latimes uses re.compile).

2. LAT wraps each photo in <picture><sourcesrcset="remote-webp"/>
   <img sizes="100vw" fetchpriority="high"src="local.jpg"/></picture>. The <source> URLs never load in EPUB (remote, no network at readtime), and sizes="100vw" tells the reader to render at full viewport width — combined with the image's natural aspect ratio that overflows portrait photos onto the next page. Decomposes <source>, unwraps <picture>, and removes the sizes/fetchpriority hints so readers respect the image's embedded dimensions.
2026-05-14 00:08:03 -05:00
claybdavis 93f6f0dfc8 wash_post_print: tag subhead and image captions for styling hooks
recipe emits dek as `<p class="subt">...</h3>'. opening and closing tags are mismatched, leaving the dek without a clean hook. also, captions (promo, video, image) emit as a bare <div> with no class.

my fixed recipe emits the subhead as `<h3 class="subhead">...</h3>` (matched pair) and
tags each image-caption div with `class="caption"`. This change gives downstream stylesheets proper targets.
2026-05-13 23:59:30 -05:00
claybdavis 68ef0e420b new_yorker: skip Cartoon Caption Contest and Slideshow stub tiles
The magazine TOC walks <a class="summary-item__hed-link"> anchors inside every SummaryItemWrapper, including two non-article tile types that have no article body to extract:

  - summary-item--externallink (Cartoon Caption Contest — links to the
    interactive contest widget)
  - summary-item--gallery (Cartoon Slide Show — links to the gallery
    widget)

Filters both via the BEM modifier on the wrapper. Real articles use
summary-item--article (including Crossword) and pass through unchanged.
2026-05-13 23:37:28 -05:00
claybdavis f78f24ae2a tls_mag: force Safari UA so CloudFront serves the real page
the-tls.com is fronted by CloudFront, which serves Calibre's default
Chrome UA a CAPTCHA page instead of the issue page. The CAPTCHA
response lacks the rel=shortlink Link header that parse_index reads
to extract the WordPress issue id, so the recipe crashes at
soup.find('link', rel='shortlink')['href'].

Overrides get_browser to force a Safari UA, which CloudFront passes
through cleanly.
2026-05-13 23:31:20 -05:00
Kovid Goyal 4c04c56033 Merge branch 'arb-disable-auto-cleanup' of https://github.com/claybdavis/calibre 2026-05-14 05:53:04 +05:30
Kovid Goyal 0407cbad79 Merge branch 'vox-keep-byline-and-lede' of https://github.com/claybdavis/calibre 2026-05-14 05:52:18 +05:30
claybdavis 3a9dfe8bbb asianreviewofbooks: disable auto_cleanup so reviewer info isn't dropped
auto_cleanup keeps only the body, but on asianreviewofbooks.com the
reviewer name, publication date, category tags, h1 title, book-cover
thumbnail, and the contributor-bio block at the foot all live in
sibling sections within the same WordPress single-entry article
wrapper — so they get stripped as chrome.

Disables auto_cleanup and selects the whole <article class="...
single-entry ..."> wrapper via keep_only_tags. remove_tags strips
the JS-only social-share container, the duplicate post-tags footer,
and the "Related" carousel of other reviews appended after the body.
2026-05-13 16:52:13 -05:00
claybdavis 90e286c831 vox: fetch live pages so byline, breadcrumb, and dek aren't dropped
Vox's RSS feeds carry the article body inline in <content type="html">,
and BasicNewsRecipe uses that embedded copy when present. The embedded
copy is missing the byline, the section breadcrumb (e.g. "Future
Perfect"), the dek, and the publication timestamp — all of which live
only on the live page.

Sets use_embedded_content = False so the live page is fetched, and
selects the lede block (breadcrumb + h1 + dek + byline + timestamp +
hero image) and body container explicitly via keep_only_tags. Drops
the social-share row via remove_tags.
2026-05-13 16:49:05 -05:00
Kovid Goyal ed5a92d9ae Clean up test 2026-05-13 15:49:33 +05:30
Kovid Goyal cf3a6f4e6f E-book viewer: Fix incorrect search match offsets in normal search mode when the text contains non-BMP Unicode characters. Fixes #2152227 [search ignors some chars](https://bugs.launchpad.net/calibre/+bug/2152227) 2026-05-13 15:46:02 +05:30
Kovid Goyal 801af5a41d Ignore inapplicable CVE 2026-05-13 15:03:50 +05:30
Kovid Goyal 150094ddb7 Merge branch 'conversation-keep-byline-disclosure' of https://github.com/claybdavis/calibre 2026-05-13 11:31:33 +05:30