Audit (2026-05-18) of the BBC News feed list found three entries that
no longer produce content:
* Special Reports (https://feeds.bbci.co.uk/news/special_reports/rss.xml)
HTTP/2 404. Wayback's last successful capture is 2024-07-23, so the
URL has been dead for roughly two years.
* Also in the News (https://feeds.bbci.co.uk/news/also_in_the_news/rss.xml)
HTTP/2 404. Wayback has no successful captures of this URL at all.
* Magazine (https://feeds.bbci.co.uk/news/magazine/rss.xml)
301-redirects to /news/stories/rss.xml which still returns 200 OK,
but the content there has been stale since December 2022. The
endpoint is alive; the section is abandoned.
Because the recipes set remove_empty_feeds=True these three have been
silently swallowed on every fetch, costing four wasted HTTP calls per
run (the Magazine redirect doubles up). Dropping them cleans the
active feed list without changing what readers actually receive.
bbc.recipe had all three active entries; bbc_fast.recipe only carried
Magazine. Both files patched accordingly.
The ~50 commented-out legacy feed URLs in the same block are NOT
touched here -- that is a separate cleanup.
Sport articles fetched via recipes/bbc.recipe and recipes/bbc_fast.recipe
were shipping with no <h1>, because BBC restructured the
window.__INITIAL_DATA__ JSON. article['headline'] is now None for Sport,
and the headline lives either in a new 'headline' block-type or — for
the 'high-impact' layout — nested under a 'topper' block's
model.heading.blocks list. The previous parse_article_json loop only
branched on 'image' and 'text', so neither variant produced anything.
Fix: prefer the plain-text article['metadata']['seoHeadline'] when the
legacy article['headline'] field is empty, and as a defensive fallback
extract the headline from a 'headline' or 'topper' block via a small
extract_text_block_plaintext helper. Verified against live Sport URLs
covering both block-type variants; legacy News articles that still
populate article['headline'] are unaffected.
bbc_fast.recipe carries an identical copy of parse_article_json, so the
same patch is applied to both files.
newcriterion.com moved from October CMS to WordPress. The old recipe
looked for <div id="main"> and an /issues/YYYY/M/ URL pattern, so
parse_index crashed with AttributeError: 'NoneType' object has no
attribute 'findAll' against the new layout.
Rewrites parse_index for the new markup: issue URLs of the form
/issues/<month>-<year>/, a <div class="issue-layout"> container, and
<article class="article-display"> blocks with <h2><a> for title+URL
and <p class="post-excerpt"> for the dek.
Also ports get_browser from the old October-CMS XHR signin endpoint to
standard wp-login.php form submission, and drops the now-unused
urlencode, mechanize.Request, and re imports.
AP has no RSS feeds and parse_index had no date logic, so the framework's
oldest_article knob was a no-op and cross-day duplicates accumulated
indefinitely.
This fix fetches each candidate article during indexing and reads
<meta property="article:published_time"> to populate both the article's
timestamp (which oldest_article actually filters on) and a formatted date
string for the TOC. Cached per-URL across the front-page walk so duplicate
links are fetched once. Articles whose published_time can't be read are
skipped with a warning rather than kept dateless.
Sets oldest_article = 1 (AP publishes constantly). The trade-off is roughly
30-60s extra wall time and a doubling of HTTP volume per run, paid for by
dropping the ~24 stale articles per consecutive-day fetch.
Same per-URL-fetch idiom as #3132 (latimes og:description).
The ProPublica WordPress theme prepends a standing "ProPublica is a nonprofit newsroom that investigates abuses of power..." block to every article body, wrapped as `<div class="wp-block-propublica-notes--top
wp-block-propublica-note">`.
This fix extends remove_tags to strip it via the --top BEM modifier. (The bare wp-block-propublica-note class is reused for in-body editor's-note boxes that should pass through, so the modifier scopes the strip to the top-of-article boilerplate only.)
LAT's section index pages only attach a teaser to <10% of article tiles, so most TOC entries in the resulting EPUB show no description under the headline.
This fix fetches each article page once during indexing and reads <meta property="og:description"> to populate art['description'].
Cached per-URL across the section walks so duplicates are fetched once. Descriptions shorter than 20 chars are dropped (LAT occasionally publishes placeholder text).
LATimes bylines contain an <a data-social-trigger="enhancedByline"> wrapping an SVG <use> sprite reference and the literal text "Follow". The sprite targets a <shape> ID in a sibling <svg> elsewhere in the source document — that cross-document reference can't resolve in the converted EPUB, so the anchor renders as an empty icon slot followed a badly placed "Follow."
This fix decomposes the anchor in preprocess_html. The "Staff Writer" <span> sibling is not a descendant, so the rubric stays in place.
Two issues fixed in LATimes:
1. The /story/YYYY-MM-DD/ regex in parse_index was date-agnostic, so each section walk pulled articles from arbitrary past dates. This fix narrows the date path-segment to today+yesterday — same idea as chicago_tribune.recipe's inline tdy/yest filter (just expressed as a regex alternation since latimes uses re.compile).
2. LAT wraps each photo in <picture><sourcesrcset="remote-webp"/>
<img sizes="100vw" fetchpriority="high"src="local.jpg"/></picture>. The <source> URLs never load in EPUB (remote, no network at readtime), and sizes="100vw" tells the reader to render at full viewport width — combined with the image's natural aspect ratio that overflows portrait photos onto the next page. Decomposes <source>, unwraps <picture>, and removes the sizes/fetchpriority hints so readers respect the image's embedded dimensions.
recipe emits dek as `<p class="subt">...</h3>'. opening and closing tags are mismatched, leaving the dek without a clean hook. also, captions (promo, video, image) emit as a bare <div> with no class.
my fixed recipe emits the subhead as `<h3 class="subhead">...</h3>` (matched pair) and
tags each image-caption div with `class="caption"`. This change gives downstream stylesheets proper targets.
The magazine TOC walks <a class="summary-item__hed-link"> anchors inside every SummaryItemWrapper, including two non-article tile types that have no article body to extract:
- summary-item--externallink (Cartoon Caption Contest — links to the
interactive contest widget)
- summary-item--gallery (Cartoon Slide Show — links to the gallery
widget)
Filters both via the BEM modifier on the wrapper. Real articles use
summary-item--article (including Crossword) and pass through unchanged.
the-tls.com is fronted by CloudFront, which serves Calibre's default
Chrome UA a CAPTCHA page instead of the issue page. The CAPTCHA
response lacks the rel=shortlink Link header that parse_index reads
to extract the WordPress issue id, so the recipe crashes at
soup.find('link', rel='shortlink')['href'].
Overrides get_browser to force a Safari UA, which CloudFront passes
through cleanly.
auto_cleanup keeps only the body, but on asianreviewofbooks.com the
reviewer name, publication date, category tags, h1 title, book-cover
thumbnail, and the contributor-bio block at the foot all live in
sibling sections within the same WordPress single-entry article
wrapper — so they get stripped as chrome.
Disables auto_cleanup and selects the whole <article class="...
single-entry ..."> wrapper via keep_only_tags. remove_tags strips
the JS-only social-share container, the duplicate post-tags footer,
and the "Related" carousel of other reviews appended after the body.
Vox's RSS feeds carry the article body inline in <content type="html">,
and BasicNewsRecipe uses that embedded copy when present. The embedded
copy is missing the byline, the section breadcrumb (e.g. "Future
Perfect"), the dek, and the publication timestamp — all of which live
only on the live page.
Sets use_embedded_content = False so the live page is fetched, and
selects the lede block (breadcrumb + h1 + dek + byline + timestamp +
hero image) and body container explicitly via keep_only_tags. Drops
the social-share row via remove_tags.