Audit (2026-05-18) of the BBC News feed list found three entries that
no longer produce content:
* Special Reports (https://feeds.bbci.co.uk/news/special_reports/rss.xml)
HTTP/2 404. Wayback's last successful capture is 2024-07-23, so the
URL has been dead for roughly two years.
* Also in the News (https://feeds.bbci.co.uk/news/also_in_the_news/rss.xml)
HTTP/2 404. Wayback has no successful captures of this URL at all.
* Magazine (https://feeds.bbci.co.uk/news/magazine/rss.xml)
301-redirects to /news/stories/rss.xml which still returns 200 OK,
but the content there has been stale since December 2022. The
endpoint is alive; the section is abandoned.
Because the recipes set remove_empty_feeds=True these three have been
silently swallowed on every fetch, costing four wasted HTTP calls per
run (the Magazine redirect doubles up). Dropping them cleans the
active feed list without changing what readers actually receive.
bbc.recipe had all three active entries; bbc_fast.recipe only carried
Magazine. Both files patched accordingly.
The ~50 commented-out legacy feed URLs in the same block are NOT
touched here -- that is a separate cleanup.
Sport articles fetched via recipes/bbc.recipe and recipes/bbc_fast.recipe
were shipping with no <h1>, because BBC restructured the
window.__INITIAL_DATA__ JSON. article['headline'] is now None for Sport,
and the headline lives either in a new 'headline' block-type or — for
the 'high-impact' layout — nested under a 'topper' block's
model.heading.blocks list. The previous parse_article_json loop only
branched on 'image' and 'text', so neither variant produced anything.
Fix: prefer the plain-text article['metadata']['seoHeadline'] when the
legacy article['headline'] field is empty, and as a defensive fallback
extract the headline from a 'headline' or 'topper' block via a small
extract_text_block_plaintext helper. Verified against live Sport URLs
covering both block-type variants; legacy News articles that still
populate article['headline'] are unaffected.
bbc_fast.recipe carries an identical copy of parse_article_json, so the
same patch is applied to both files.
newcriterion.com moved from October CMS to WordPress. The old recipe
looked for <div id="main"> and an /issues/YYYY/M/ URL pattern, so
parse_index crashed with AttributeError: 'NoneType' object has no
attribute 'findAll' against the new layout.
Rewrites parse_index for the new markup: issue URLs of the form
/issues/<month>-<year>/, a <div class="issue-layout"> container, and
<article class="article-display"> blocks with <h2><a> for title+URL
and <p class="post-excerpt"> for the dek.
Also ports get_browser from the old October-CMS XHR signin endpoint to
standard wp-login.php form submission, and drops the now-unused
urlencode, mechanize.Request, and re imports.
AP has no RSS feeds and parse_index had no date logic, so the framework's
oldest_article knob was a no-op and cross-day duplicates accumulated
indefinitely.
This fix fetches each candidate article during indexing and reads
<meta property="article:published_time"> to populate both the article's
timestamp (which oldest_article actually filters on) and a formatted date
string for the TOC. Cached per-URL across the front-page walk so duplicate
links are fetched once. Articles whose published_time can't be read are
skipped with a warning rather than kept dateless.
Sets oldest_article = 1 (AP publishes constantly). The trade-off is roughly
30-60s extra wall time and a doubling of HTTP volume per run, paid for by
dropping the ~24 stale articles per consecutive-day fetch.
Same per-URL-fetch idiom as #3132 (latimes og:description).