Skip to content

End-to-end pipeline orchestrator + filter/extraction quality bundle#28

Closed
smodee wants to merge 21 commits into
mainfrom
feat/end-to-end-orchestrator
Closed

End-to-end pipeline orchestrator + filter/extraction quality bundle#28
smodee wants to merge 21 commits into
mainfrom
feat/end-to-end-orchestrator

Conversation

@smodee
Copy link
Copy Markdown
Collaborator

@smodee smodee commented May 28, 2026

End-to-end pipeline orchestrator + filter/extraction quality bundle

Draft — not ready for review. Built on #24 and #27 (both now merged to
main); this branch has been rebased onto main, so the diff below is the
orchestrator + bundle work only. Flip to "Ready for review" when you want it
handed off.

Summary

Adds the first end-to-end orchestrator that chains all four pipeline stages
(search → filter → extract → insight) for a single ForecastQuestion, plus a
bundle of filter- and extraction-quality improvements driven by live runs of
two real forecasting questions.

Before this branch the stages only ran in isolation: bioscancast/main.py was a
commented sketch, scripts/run_insight.py was synthetic-only, and
scripts/eval_insight_on_real_docs.py chained only extract→insight on hardcoded
fixtures. Forecasting (the next stage) needs a real InsightRecord stream;
without this orchestrator it would have had to build the chain itself, coupling
the two stages.

Built on #24 and #27

This branch was developed on a merge of feat/as-of-date-replay (#24) and
feat/insight-stage-hardening (#27), both now merged to main. It has since
been rebased onto main; the merge commits and a redundant contamination.py
migration commit (superseded by #27's own 2d77493) dropped out in the rebase,
so the diff is the orchestrator + bundle work described below.

What's included

Orchestrator

  • bioscancast/main.pyrun_pipeline() + argparse CLI. Chains the four
    stages with per-stage timing, JSON persistence, error wrapping
    (PipelineError pins the failing stage), and a cost estimate. Runnable as
    python -m bioscancast.main q7 … or via scripts/run_pipeline.py.
  • bioscancast/orchestration/ — new package: persistence.py (run-dir
    layout data/runs/{qid}/{run_id}/, per-stage JSON dumps, manifest) and
    test_questions.csv (q7 verbatim copy + a new q12 Ebola question; the
    canonical bioscancast_questions.csv is left untouched as the human-forecaster
    record).
  • bioscancast/stages/eval_stage/loaders.pybuild_forecast_question
    factory (CSV row → ForecastQuestion) + load_question_by_id; fixed the
    Excel-serial created_date parsing bug while here.
  • bioscancast/llm/pricing.py — model price table (snapshot 2026-05-27) +
    estimate_cost; surfaces USD cost per run, broken down per stage.

Filter chokepoint (issues #13, #14)

  • Dashboard-injected results bypass the keyword-overlap heuristic
    (retrieval_reason == "dashboard_lookup") — they were getting
    keyword_overlap_score == 0.000 and being dropped despite being curated
    authoritative sources.
  • Dashboard titles/snippets enriched with pathogen-specific text in
    biosecurity_sources.py (was "Dashboard: cdc.gov").
  • heuristic_keep_threshold lowered 0.72 → 0.65.
  • Dashboard-bypass docs exempted from cap_per_domain_and_type so a curated
    dashboard doesn't displace an organic result on the same domain.

Search-stage relevance ranking (#4)

  • search_stage_score was 0.5·domain + 0.3·freshness + 0.2·(1/rank)no
    topical-relevance term
    , so high-authority but off-topic results ranked at the
    top and consumed total_cap slots. It now is
    0.45·relevance + 0.30·domain + 0.10·freshness + 0.15·(1/rank), where
    relevance reuses the filter's keyword_overlap_score/build_query_terms.
    Freshness is weighted low because it is near-uniform in live mode. Weights sum
    to 1.0; the score drives ranking + truncation only.

Dashboard sources (#3)

  • Fixed a broken dashboard URL (cdc.gov/mpox/data-research → 404; now the
    extractable monkeypox/situation-summary page) and two stale redirects
    (afro.who.int ebola-disease, cdc.gov/ebola/about).
  • DASHBOARD_LOOKUP routing was an exact lowercase-key match, so the
    CSV-natural "marburg virus disease" failed to route to the marburg key
    (→ zero on-topic results). Added _resolve_pathogen_key with alias +
    substring matching (marburg virus disease→marburg, monkeypox→mpox,
    bird flu→h5n1).

Source tiers (#13)

  • Promoted ~22 national/international outlets (CNN, NBC, CBS, ABC, NPR, USA
    Today, LA Times, Politico, Axios, Forbes, Bloomberg, FT, WSJ, Economist, Time,
    The Atlantic, Ars Technica, Business Insider, …) from unknown (0.2) to
    Tier 3 trusted_media (0.6). Legitimate outbreak reporting from these was
    being floored below the filter's credibility threshold.

No-LLM filter fallback (#13)

  • Default-off FILTER_CONFIG["no_llm_soft_fallback"] (+
    no_llm_fallback_relevance_threshold). When llm_client is None, the ambiguous
    rerank band was always rejected (fail-closed) — too aggressive for
    dev/offline/no-API-key runs. With the flag on, a borderline candidate is kept
    iff it is an official domain OR clears the relevance threshold, approximating
    the LLM-rescue path. Production (always has an LLM client) is unchanged.

Extraction quality

  • suspected_deaths added to the controlled metric_name vocabulary (was
    collapsing "160 suspected deaths" into plain deaths).
  • Extraction prompt now requires the quote field to be the sentence carrying
    the figure (digits / number-word / relative reference), closing a gap where a
    metric_value was attached to a quantitatively-empty quote.
  • Hallucination guard gains a case-insensitive layer 4 — recovers real quotes
    the model lowercased the leading letter of, while still rejecting
    content-insertion fabrications (regression-tested with real q12 cases).
  • Adaptive retrieval_top_k (12 → 20) when few documents survive filtering, so
    per-doc retrieval depth isn't the bottleneck on coverage.

Live verification (q7 + q12)

q7 (Mpox, historical replay 2025-02-28) q12 (Ebola, live)
filter survival (baseline → final) 2/43 → 5/38 5/37 → 7/40
insight records (baseline → final) 2 (stale) → 1 5 → 5 (now using suspected_*)
cost per run ~$0.004 ~$0.005

Cumulative API spend across all development runs: ~$0.03. q12's records now
distinguish suspected_cases/suspected_deaths and cross-document dedup merges
twin reports correctly.

Follow-up: systematic review of #3 / #4 / #13

A second pass tested 10 fresh live questions (H5N1, the 2026 DRC+Uganda Ebola
outbreak, the Andes-virus cruise hantavirus cluster, mpox, Marburg) spanning
range/binary/categorical, plus a hand-labeled offline sweep of filter thresholds
and search-stage weights and several end-to-end live runs. Total spend ~$0.03.
Key findings (these drove the four commits above):

  • Search relevance was the root problem. When the question's pathogen is not
    the dominant news story (here, Ebola dominated May 2026), the organic pool is
    flooded with off-topic high-authority content (sports/legal/other-pathogen
    news, even a nature.com nuclear-cladding paper), because the score had no
    relevance term. For mpox/Marburg/H5N1 only the injected dashboards were
    on-topic. → fixed by the Tune search_stage_score weights (0.5/0.3/0.2) #4 relevance term.
  • The 0.65 keep threshold sits on a wide plateau (0.60–0.775 are identical on
    the captured pools); it is not a useful tuning knob. Official-source recall
    was already 1.0. The real recall sinks were (a) the no-LLM fail-closed path and
    (b) reputable outlets mis-tiered as unknown. → addressed by the tier + soft
    fallback commits, not by moving the threshold.
  • Dashboard value is bimodal at extraction. Static factsheets / situation
    summaries extract and yield records (WHO Marburg → many; CDC bird-flu → some);
    interactive trackers / index pages / the 404 yield zero (mpox produced 0
    records until the URL fix → 2 after). Recommendation: prefer static
    fact-bearing dashboard URLs; consider not letting non-extractable dashboards
    consume survival slots.
  • Replay confound confirmed (specs/tavily-historical-coverage.md): historical
    replay survival is dashboard/official-dominated with little fresh organic
    signal, so resolved-question accuracy cannot drive filter tuning — which is why
    the sweep was scored against hand labels, not forecast accuracy.

Known interaction to flag for reviewers: promoting outlets to Tier 3 raises
recall of legitimate reporting, but because the filter's priority_score still
weights credibility heavily (and the new relevance term lives in search
ranking, not the filter's keep decision), it can also admit off-topic pieces from
those same outlets. A sensible follow-up is to raise the filter's
keyword-overlap weight / lower its 0.25·credibility blend.

Issues this PR addresses

Test plan

  • python -m pytest bioscancast/tests/ — 455 passed, 2 skipped (live). New
    tests cover the relevance scoring formula, tolerant dashboard routing, Tier 3
    outlet coverage, and the no-LLM soft-fallback flag.
  • Live runs of q7 (historical replay) and q12 (live) end-to-end; artifacts
    inspected for filter survival, records, and cost.
  • Follow-up: 10 fresh live questions + hand-labeled offline sweep of filter
    thresholds and search weights (see "Follow-up" section). Re-running mpox after
    the dashboard URL fix went from 0 → 2 insight records.
  • Reviewer: confirm the dashboard cap-exemption policy in
    cap_per_domain_and_type (curated dashboards never consume a domain slot).
  • Reviewer: sanity-check the Tier 3 outlet additions in source_tiers.py
    and the credibility-vs-relevance interaction noted in the follow-up section.
  • Reviewer: note the 0.65 keep threshold is unchanged and now known to sit on
    a flat plateau (a sweep was done — it's not the lever).

Not in scope (deliberate)

smodee and others added 21 commits May 28, 2026 20:07
bioscancast_questions.csv stores created_date as an Excel serial day
number (e.g. 45712). pd.to_datetime without unit=D + origin=1899-12-30
treated those integers as nanoseconds past 1970, yielding garbage dates
like 1970-01-01 00:00:00.000045712. The bug was latent — no caller had
yet relied on the parsed date — but the new orchestrator's
build_forecast_question factory needs an accurate created_at.

After the fix, q7 resolves to 2025-02-24 as expected.
The orchestrator (next commit) needs to turn a CSV row into a typed
ForecastQuestion. Maps:

- created_date -> tz-aware UTC datetime (already parsed by load_questions)
- topic "Pathogen (Region)" -> lowercased pathogen + region
- question_text "by Month day, year" -> target_date via regex; falls back
  to "by Month year" giving the first of next month
- question_type + keyword hints in text -> event_type
  (case_count / death_count / outbreak_declared / None)
- resolution_criteria passes through
- as_of_date is a factory kwarg, not a CSV column; orchestrator passes
  it from --as-of-date

Tested against all 11 rows of bioscancast_questions.csv; q7 produces
ForecastQuestion(id=q7, pathogen=mpox, region=world,
target_date=2025-02-28, event_type=case_count, ...).
Branch-local question fixture for the new end-to-end orchestrator's live
smoke tests. Two rows:

- q7: verbatim copy of the row in bioscancast/stages/eval_stage/
  bioscancast_questions.csv. Resolved at 126,441 mpox cases globally by
  Feb 28 2025. Run with --as-of-date 2025-02-28 to exercise historical
  replay.

- q12: new live question on the current East Africa Ebola outbreak,
  target_date 2026-06-30. Run with no --as-of-date for live mode.

Kept separate from bioscancast_questions.csv so the canonical CSV stays
an unmodified record of what human forecasters actually evaluated.
bioscancast/llm/pricing.py introduces:
- MODEL_PRICES: USD/1M-token snapshot dated 2026-05-27 for the models
  actually used by stage configs (gpt-4o-mini, gpt-4o, text-embedding-3-
  small/large) plus a date-pinned gpt-4o-2024-08-06 alias.
- estimate_cost(model, input_tokens, output_tokens, cached_input_tokens):
  computes USD spend with a 50% discount on cached prefix per OpenAI's
  standard prompt-cache pricing.
- estimate_cost_from_summary(): consumes the dict shape that
  InsightRunResult.budget_summary already produces.

Sources cited in the module docstring. Unknown model raises
UnknownModelError so the orchestrator surfaces stale price tables
loudly rather than under-reporting cost.
New module with the run-directory layout
(data/runs/{question_id}/{run_id}/) and per-stage JSON dump helpers:
save_question / save_search / save_filtered / save_documents /
save_insight / save_manifest. _json_default and the asdict pattern are
lifted from scripts/eval_insight_on_real_docs.py so the orchestrator
and the eval harness share serialization conventions.
Replaces the 14-line commented sketch with a real argparse-driven
orchestrator that chains all four stages for a single ForecastQuestion:

  python -m bioscancast.main q7 --as-of-date 2025-02-28 -v

Pipeline:
  1. load_question_by_id reads the CSV row and builds a ForecastQuestion
     via the new factory (applying any CLI overrides).
  2. SearchStagePipeline runs with a usage-tracking LLM wrapper so
     search + filter token usage is accumulated for cost reporting.
  3. FilteringPipeline reuses the same wrapped client.
  4. ExtractionPipeline gets as_of_date=question.as_of_date so the
     fetcher uses Wayback snapshots in historical-replay mode.
  5. InsightPipeline receives the raw (unwrapped) client; its
     BudgetTracker already tracks usage, so wrapping would double-count.

After all stages, search/filter usage and insight per_model usage are
merged and fed to bioscancast.llm.pricing.estimate_cost for a single
USD figure in the final epilogue and manifest.

Persistence:
  data/runs/{question_id}/{run_id}/
    question.json, search.json, filtered.json, documents.json,
    insight.json, manifest.json
The manifest is rewritten after every stage so a crashed run keeps
partial timings + config. On any stage exception the manifest pins
the failing stage and re-raises wrapped in PipelineError; main()
exits 1.

Empty intermediate output is not an error - logged and passed through
(the insight stage already handles zero documents).
Thin scripts/run_pipeline.py wrapper around bioscancast.main:main so the
new orchestrator follows the same `scripts/run_*.py` convention as the
existing per-stage runners. Both invocations are equivalent:

    python -m bioscancast.main q7
    python scripts/run_pipeline.py q7

data/runs/ added to .gitignore so per-run artifacts (some quite large -
documents.json includes every chunk text) don't pollute the repo.
Two small fixes uncovered by the first live runs of the orchestrator:

1. persistence._json_default crashed on FILTER_CONFIG's set values
   (blocked_domains, low_value_url_keywords, etc.) when serializing the
   manifest. Now sorts sets to lists.

2. pricing.MODEL_PRICES needs the dated aliases OpenAI returns in
   response.model. A request to "gpt-4o-mini" comes back tagged
   "gpt-4o-mini-2024-07-18", which was missing from the table and
   produced a $0 cost estimate with a noisy warning. Added that alias
   plus a couple of known gpt-4o dated variants.

q7 historical-replay run subsequently cost $0.0030, q12 live run cost
$0.0049, both correctly reported in the manifest.
Dashboard URLs from the curated registry in
bioscancast/datasets/biosecurity_sources.py have hand-picked titles
("Dashboard: cdc.gov") and generic snippets that produce
keyword_overlap_score = 0.000 against any real forecast question. The
heuristic priority score drags them under the 0.72 keep threshold even
though they are by construction high-value sources.

Live-run evidence: q7 and q12 each injected two dashboards. All four
had keyword_overlap = 0.000. Two of those four were dropped pre-LLM,
including ourworldindata.org for q7 - which is the resolution source
named in the question's relevant_links column.

Fix: in heuristic_filter, detect retrieval_reason == "dashboard_lookup"
and auto-keep with reason_code "dashboard_lookup_bypass" and a synthetic
priority_score of 1.0. The dashboards still go through the rest of the
filtering pipeline (dedup, per-domain cap, extraction-hint assignment)
unchanged - this is the keyword-overlap chokepoint only.

Implements item 1 from the Tier 1 roadmap. Pairs with the dashboard
title/snippet enrichment in the next commit.
The previous dashboard injection used generic strings ("Dashboard: cdc.gov",
"Known mpox monitoring dashboard") that produced keyword_overlap_score
= 0.000 against every real forecast question - 4/4 injected dashboards
in the q7/q12 live runs had this exact failure mode.

The fix: turn DASHBOARD_LOOKUP into a list of DashboardEntry dataclasses
carrying url + title + snippet, with hand-written pathogen-specific text
for each entry. The titles read as real search-result titles ("CDC H5N1
bird flu situation summary: human cases and outbreaks in the United
States") and the snippets describe what data the page hosts.

Pairs with the previous commit's dashboard heuristic bypass: even with
the bypass in place, better titles still help (a) the keyword-overlap
score for downstream scoring, and (b) the LLM rescue path when it
encounters other pathogen-specific dashboards we add later. The bypass
keeps low-keyword-overlap dashboards alive; this commit makes them
discoverable on their own merits.

Implements item 5 from the Tier 1/2 roadmap.
Live runs on q7 and q12 showed filter survival of 4.7% and 13.5%
respectively, even with LLM rescue enabled. The 0.72 threshold was set
without benchmarking against real Tavily output and is too tight for
the heuristic's actual signal.

With the new threshold, priority_scores in the 0.65-0.72 band are
auto-kept by heuristics instead of routed to the LLM rescue path. The
borderline threshold (0.45) is unchanged, so the LLM filter still
gates 0.45-0.65 candidates - the change just moves the auto-keep line
to better match what the heuristic can actually distinguish.

Implements item 2 from the Tier 1 roadmap. Pairs with the dashboard
bypass + enrichment commits to attack the filter chokepoint from
multiple angles.
q12 Record 3 reported metric_name=deaths, metric_value=160 from the
source quote "160 suspected deaths out of 670 suspected cases". The
prompt's canonical vocabulary already had `suspected_cases` (for the
670) but no `suspected_deaths` slot - so the model collapsed the
"suspected" qualifier and emitted plain `deaths`. The result was an
arithmetically scandalous record (160 deaths against 61 confirmed
cases) that wouldn't survive any reasonable post-hoc sanity check.

Changes:

- Add `suspected_deaths` alongside the existing `suspected_cases`
  entry. Two-tier system per category (confirmed_* and suspected_*),
  matching the agreed shape of the vocab.
- Drop the now-redundant standalone `probable_cases` line; the
  `suspected_cases` description explicitly covers "suspected",
  "probable", and "possible" as the same tier. WHO's combined
  `confirmed_or_probable_cases` bucket is kept separately because it
  is a distinct reporting category.
- Add a deaths-family mapping rule paralleling the existing
  cases-family rule ("suspected deaths", "probable deaths", "deaths
  under investigation" all map to suspected_deaths). Mirror what the
  existing `confirmed_cases` line does for the cases family.
- Clean up the stale "possible all get their own variants below"
  parenthetical which referenced a `possible_cases` slot that has
  never existed.

After this, q12's "160 suspected deaths" should extract as
suspected_deaths=160 rather than deaths=160 - same value, correct
category, no longer competing with confirmed_cases for downstream
forecasting weight.

Implements item 3 from the Tier 1 roadmap.
q12 Record 4 reported metric_value=82 with the quote
"the outbreak now poses a 'very high' risk for Congo - up from a
previous categorization of 'high'" - no digit, no number-word, no
relative reference. The hallucination guard's verbatim-substring check
passed because the quote string did appear in the source chunk, but
nothing in the guard required the quote sentence to be the one
actually carrying the figure. The metric_value of 82 came from
elsewhere in the chunk; the quote was a "supporting context"
sentence.

A deterministic post-hoc check (str(metric_value) in quote) would
over-reject: word numbers ("a dozen"), relative quantities ("a
quarter of the population"), and number-word forms ("ninety-nine
thousand") would all be false-positive rejections. So the fix lives
at the prompt level instead: tell the model the quote MUST be the
sentence that carries the figure - digits, number-word, or a clear
relative reference. A purely contextual sentence is not acceptable.

The verbatim-substring guard remains the safety net. This change
tightens the model's understanding of what `quote` is supposed to do
without committing to a brittle programmatic check that would lose
real signal on legitimate paraphrases.

Implements item 4 from the Tier 2 roadmap.
When the filtering stage passes only a handful of documents through to
insight, per-document retrieval depth becomes the bottleneck on
coverage. q7's live run reached insight with 2 usable documents and
hit retrieval_top_k=12 on each - meaning at most ~24 chunk
extractions for the whole question. Bumping per-doc retrieval depth
costs little and gives the model more chances to find the relevant
figure in each surviving document.

Adds two InsightConfig fields:
- low_survival_doc_threshold (default 5)
- low_survival_top_k (default 20)

When the count of usable documents (status != "failed" and
non-empty chunks) is at or below the threshold, both retrieval_top_k
and max_chunks_per_document effectively rise to low_survival_top_k for
the run, and a note is appended to InsightRunResult.notes flagging
that the adaptive path engaged. Default config (12 doc threshold not
hit -> normal top_k=12) is unchanged.

Tests that pin retrieval_top_k to small values to control fake-LLM
call counts now also pin low_survival_top_k to the same value, opting
out of the adaptive lift explicitly. 447 tests still passing.

Implements item 6 from the Tier 2 roadmap. Completes the planned
bundle of items 1+2+3+4+5+6.
q7's second live run on this branch surfaced an interaction between the
new dashboard heuristic bypass and the per-domain cap. With
max_docs_per_domain=2 and the dashboard bypass injecting one who.int
slot at synthetic priority 1.0, the cap was effectively reducing
who.int to ONE organic slot - and the slot was going to a
priority-0.7097 strategic-plan announcement page, squeezing out the
priority-0.6966 WHO mpox research event page that the baseline run had
extracted records from.

Offline filter replay on the saved q7 search.json confirms the
mechanism:

  Heuristic-keep (4 who.int / ourworldindata.org docs):
    1.0000 WHO sitreps dashboard (bypass)
    1.0000 OWID mpox dashboard (bypass)
    0.7097 WHO global strategic preparedness plan (organic)
    0.6966 WHO mpox research event (organic) <- baseline's data source

  After old cap_per_domain (max=2 per domain):
    Dashboards displace one organic each; research event capped out.

The fix: dashboard-bypass docs (selection_reasons contains
"dashboard_lookup_bypass") are always kept and do not consume a slot
against the per-domain or per-type caps. They are curated additions,
not competing organic results.

After the change all four candidates survive, and the WHO research
event page reaches insight as it did in the baseline.

447 tests still passing.
The epilogue previously printed a single total-cost figure. This splits
cost by stage (search / filter / insight) so a cost spike in any one
stage is visible at a glance during iteration.

Mechanism: _UsageTrackingClient gains a snapshot() method; run_pipeline
snapshots the shared tracker after search and after filter and diffs
them (_usage_delta) to attribute usage to each stage. Insight reports
its own budget_summary per_model as before. The extract stage makes no
LLM calls, so it shows timing only.

Manifest gains stage_usage and stage_costs_usd alongside the existing
combined_usage and estimated_cost_usd. Epilogue now reads e.g.:

  search     23.87s  $0.0009
  filter      3.06s  $0.0003
  extract    59.69s
  insight    12.28s  $0.0030
  total cost:  $0.0042

Implements item 11 from the roadmap. 447 tests still passing.
Item 8 investigation: of the three quotes the guard dropped across the
q12 live runs, two were real facts lost purely because the model
lowercased the leading letter of a sentence it quoted from
mid-paragraph:

  source: "There are now 750 suspected cases and 177 suspected deaths"
  model:  "there are now 750 suspected cases and 177 suspected deaths"

  source: "The Congolese Ministry of Communication, in a post to X ...
           said that there were 904 suspected cases and 119 ..."
  model:  "the Congolese Ministry of Communication, in a post to X ..."

The third rejection was a genuine content-insertion hallucination -
the model bolted the real prefix "a total of 105 confirmed cases
(including 10 deaths)" onto a fabricated continuation "...have been
reported in Ituri, North Kivu, and South Kivu" (the source actually
continues "...and 906 suspected cases").

Fix: a fourth, case-insensitive substring layer. It returns the chunk's
own casing so the stored quote still reflects the source verbatim. The
key safety property - verified against the real q12 fabrication and
captured in a new regression test - is that case-folding does NOT
recover content insertions: a fabricated continuation fails the
substring test regardless of case.

Tests: new _LAYER4_CASE_INSENSITIVE_CASES (the two recovered q12
quotes) plus a hallucination case mirroring the q12 fabrication that
must stay rejected. 450 passed (was 447; +3 guard cases).

Implements item 8 from the roadmap.
search_stage_score was 0.5*domain + 0.3*freshness + 0.2*rank with no topical-relevance signal, so high-authority but off-topic results ranked at the top (e.g. sports/legal/unrelated-pathogen news). It is now 0.45*relevance + 0.30*domain + 0.10*freshness + 0.15*rank, reusing the filter's keyword_overlap_score/build_query_terms. Freshness is kept low because it is near-uniform in live mode. Addresses #4.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The CDC mpox dashboard URL returned 404; replaced with the (extractable) monkeypox/situation-summary page. Updated two stale redirects (afro.who.int ebola-disease, cdc.gov/ebola/about). DASHBOARD_LOOKUP routing was an exact lowercase key match, so 'marburg virus disease' failed to route to the 'marburg' key; added _resolve_pathogen_key with alias + substring matching (marburg virus disease->marburg, monkeypox->mpox, bird flu->h5n1). Addresses #3.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reputable outbreak reporting from outlets like CNN, NBC, CBS, ABC, NPR, USA Today, LA Times, Politico, Axios, Forbes, Bloomberg, FT, WSJ, Economist, Time, The Atlantic, Ars Technica and Business Insider was resolving to the 'unknown' tier (domain_score 0.2), sinking it below the filter's credibility floor. Promote them to Tier 3 (trusted_media, 0.6); second-level-domain matching covers subdomains. Relates to #13.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When llm_client is None the ambiguous rerank band was always rejected (fail-closed), which is overly aggressive for dev/offline/no-API-key runs. Add a default-off FILTER_CONFIG flag 'no_llm_soft_fallback' (+ no_llm_fallback_relevance_threshold) that instead keeps a borderline candidate iff it is an official domain OR its keyword-overlap relevance clears the threshold, approximating the LLM-rescue path. Production (always has an LLM client) is unchanged. Addresses #13.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@smodee
Copy link
Copy Markdown
Collaborator Author

smodee commented Jun 3, 2026

Closing this in favour of three independent split PRs that are easier to review:

Each PR stands alone and they can merge in any order. Per-PR test counts: 447 / 452 / 450; together they reproduce the 455 of this branch. The feat/end-to-end-orchestrator branch is kept around (unchanged, not merged) so downstream work that already branches off it isn't disturbed during the review of the three split PRs.

@smodee smodee closed this Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dashboard-injected results have low keyword overlap due to generic titles

1 participant