End-to-end pipeline orchestrator + filter/extraction quality bundle by smodee · Pull Request #28 · algorithmicgovernance/BioScanCast

smodee · 2026-05-28T18:11:10Z

End-to-end pipeline orchestrator + filter/extraction quality bundle

Draft — not ready for review. Built on #24 and #27 (both now merged to
main); this branch has been rebased onto main, so the diff below is the
orchestrator + bundle work only. Flip to "Ready for review" when you want it
handed off.

Summary

Adds the first end-to-end orchestrator that chains all four pipeline stages
(search → filter → extract → insight) for a single ForecastQuestion, plus a
bundle of filter- and extraction-quality improvements driven by live runs of
two real forecasting questions.

Before this branch the stages only ran in isolation: bioscancast/main.py was a
commented sketch, scripts/run_insight.py was synthetic-only, and
scripts/eval_insight_on_real_docs.py chained only extract→insight on hardcoded
fixtures. Forecasting (the next stage) needs a real InsightRecord stream;
without this orchestrator it would have had to build the chain itself, coupling
the two stages.

Built on #24 and #27

This branch was developed on a merge of feat/as-of-date-replay (#24) and
feat/insight-stage-hardening (#27), both now merged to main. It has since
been rebased onto main; the merge commits and a redundant contamination.py
migration commit (superseded by #27's own 2d77493) dropped out in the rebase,
so the diff is the orchestrator + bundle work described below.

What's included

Orchestrator

bioscancast/main.py — run_pipeline() + argparse CLI. Chains the four
stages with per-stage timing, JSON persistence, error wrapping
(PipelineError pins the failing stage), and a cost estimate. Runnable as
python -m bioscancast.main q7 … or via scripts/run_pipeline.py.
bioscancast/orchestration/ — new package: persistence.py (run-dir
layout data/runs/{qid}/{run_id}/, per-stage JSON dumps, manifest) and
test_questions.csv (q7 verbatim copy + a new q12 Ebola question; the
canonical bioscancast_questions.csv is left untouched as the human-forecaster
record).
bioscancast/stages/eval_stage/loaders.py — build_forecast_question
factory (CSV row → ForecastQuestion) + load_question_by_id; fixed the
Excel-serial created_date parsing bug while here.
bioscancast/llm/pricing.py — model price table (snapshot 2026-05-27) +
estimate_cost; surfaces USD cost per run, broken down per stage.

Filter chokepoint (issues #13, #14)

Dashboard-injected results bypass the keyword-overlap heuristic
(retrieval_reason == "dashboard_lookup") — they were getting
keyword_overlap_score == 0.000 and being dropped despite being curated
authoritative sources.
Dashboard titles/snippets enriched with pathogen-specific text in
biosecurity_sources.py (was "Dashboard: cdc.gov").
heuristic_keep_threshold lowered 0.72 → 0.65.
Dashboard-bypass docs exempted from cap_per_domain_and_type so a curated
dashboard doesn't displace an organic result on the same domain.

Search-stage relevance ranking (#4)

search_stage_score was 0.5·domain + 0.3·freshness + 0.2·(1/rank) — no
topical-relevance term, so high-authority but off-topic results ranked at the
top and consumed total_cap slots. It now is
0.45·relevance + 0.30·domain + 0.10·freshness + 0.15·(1/rank), where
relevance reuses the filter's keyword_overlap_score/build_query_terms.
Freshness is weighted low because it is near-uniform in live mode. Weights sum
to 1.0; the score drives ranking + truncation only.

Dashboard sources (#3)

Fixed a broken dashboard URL (cdc.gov/mpox/data-research → 404; now the
extractable monkeypox/situation-summary page) and two stale redirects
(afro.who.int ebola-disease, cdc.gov/ebola/about).
DASHBOARD_LOOKUP routing was an exact lowercase-key match, so the
CSV-natural "marburg virus disease" failed to route to the marburg key
(→ zero on-topic results). Added _resolve_pathogen_key with alias +
substring matching (marburg virus disease→marburg, monkeypox→mpox,
bird flu→h5n1).

Source tiers (#13)

Promoted ~22 national/international outlets (CNN, NBC, CBS, ABC, NPR, USA
Today, LA Times, Politico, Axios, Forbes, Bloomberg, FT, WSJ, Economist, Time,
The Atlantic, Ars Technica, Business Insider, …) from unknown (0.2) to
Tier 3 trusted_media (0.6). Legitimate outbreak reporting from these was
being floored below the filter's credibility threshold.

No-LLM filter fallback (#13)

Default-off FILTER_CONFIG["no_llm_soft_fallback"] (+
no_llm_fallback_relevance_threshold). When llm_client is None, the ambiguous
rerank band was always rejected (fail-closed) — too aggressive for
dev/offline/no-API-key runs. With the flag on, a borderline candidate is kept
iff it is an official domain OR clears the relevance threshold, approximating
the LLM-rescue path. Production (always has an LLM client) is unchanged.

Extraction quality

suspected_deaths added to the controlled metric_name vocabulary (was
collapsing "160 suspected deaths" into plain deaths).
Extraction prompt now requires the quote field to be the sentence carrying
the figure (digits / number-word / relative reference), closing a gap where a
metric_value was attached to a quantitatively-empty quote.
Hallucination guard gains a case-insensitive layer 4 — recovers real quotes
the model lowercased the leading letter of, while still rejecting
content-insertion fabrications (regression-tested with real q12 cases).
Adaptive retrieval_top_k (12 → 20) when few documents survive filtering, so
per-doc retrieval depth isn't the bottleneck on coverage.

Live verification (q7 + q12)

	q7 (Mpox, historical replay 2025-02-28)	q12 (Ebola, live)
filter survival (baseline → final)	2/43 → 5/38	5/37 → 7/40
insight records (baseline → final)	2 (stale) → 1	5 → 5 (now using `suspected_*`)
cost per run	~$0.004	~$0.005

Cumulative API spend across all development runs: ~$0.03. q12's records now
distinguish suspected_cases/suspected_deaths and cross-document dedup merges
twin reports correctly.

Follow-up: systematic review of #3 / #4 / #13

A second pass tested 10 fresh live questions (H5N1, the 2026 DRC+Uganda Ebola
outbreak, the Andes-virus cruise hantavirus cluster, mpox, Marburg) spanning
range/binary/categorical, plus a hand-labeled offline sweep of filter thresholds
and search-stage weights and several end-to-end live runs. Total spend ~$0.03.
Key findings (these drove the four commits above):

Search relevance was the root problem. When the question's pathogen is not
the dominant news story (here, Ebola dominated May 2026), the organic pool is
flooded with off-topic high-authority content (sports/legal/other-pathogen
news, even a nature.com nuclear-cladding paper), because the score had no
relevance term. For mpox/Marburg/H5N1 only the injected dashboards were
on-topic. → fixed by the Tune search_stage_score weights (0.5/0.3/0.2) #4 relevance term.
The 0.65 keep threshold sits on a wide plateau (0.60–0.775 are identical on
the captured pools); it is not a useful tuning knob. Official-source recall
was already 1.0. The real recall sinks were (a) the no-LLM fail-closed path and
(b) reputable outlets mis-tiered as unknown. → addressed by the tier + soft
fallback commits, not by moving the threshold.
Dashboard value is bimodal at extraction. Static factsheets / situation
summaries extract and yield records (WHO Marburg → many; CDC bird-flu → some);
interactive trackers / index pages / the 404 yield zero (mpox produced 0
records until the URL fix → 2 after). Recommendation: prefer static
fact-bearing dashboard URLs; consider not letting non-extractable dashboards
consume survival slots.
Replay confound confirmed (specs/tavily-historical-coverage.md): historical
replay survival is dashboard/official-dominated with little fresh organic
signal, so resolved-question accuracy cannot drive filter tuning — which is why
the sweep was scored against hand labels, not forecast accuracy.

Known interaction to flag for reviewers: promoting outlets to Tier 3 raises
recall of legitimate reporting, but because the filter's priority_score still
weights credibility heavily (and the new relevance term lives in search
ranking, not the filter's keep decision), it can also admit off-topic pieces from
those same outlets. A sensible follow-up is to raise the filter's
keyword-overlap weight / lower its 0.25·credibility blend.

Issues this PR addresses

Closes Dashboard-injected results have low keyword overlap due to generic titles #14 — dashboard low-keyword-overlap problem fixed at the root
(titles) and as a backstop (bypass + cap exemption).
Tune search_stage_score weights (0.5/0.3/0.2) #4 (search-stage weight tuning) — addressed: added the relevance term
and retuned weights; the follow-up review showed the missing relevance signal,
not the weight split, was the issue. Reviewers can likely close.
Evaluate dashboard lookup value vs organic search #3 (dashboard value vs organic) — substantially addressed: evaluated
value (bimodal extractability finding), fixed the broken/stale URLs, and made
routing tolerant. One follow-up remains: deciding whether non-extractable
interactive dashboards should consume survival slots (and trimming/expanding
the list accordingly).
Heuristic filtering scores too low for real-world search results #13 (heuristic scores too low) — materially improved (threshold + bypass +
cap exemption + tier coverage + opt-in no-LLM soft fallback). The review also
showed the 0.65 threshold is on a flat plateau (not the lever) and that the
filter's credibility-vs-relevance balance is the remaining knob — see the
follow-up section. Reviewers decide whether the original bug is resolved.
Assess Tavily published_date reliability #5 — closes via Add historical-replay mode for benchmark fairness #24.

Test plan

python -m pytest bioscancast/tests/ — 455 passed, 2 skipped (live). New
tests cover the relevance scoring formula, tolerant dashboard routing, Tier 3
outlet coverage, and the no-LLM soft-fallback flag.
Live runs of q7 (historical replay) and q12 (live) end-to-end; artifacts
inspected for filter survival, records, and cost.
Follow-up: 10 fresh live questions + hand-labeled offline sweep of filter
thresholds and search weights (see "Follow-up" section). Re-running mpox after
the dashboard URL fix went from 0 → 2 insight records.
Reviewer: confirm the dashboard cap-exemption policy in
cap_per_domain_and_type (curated dashboards never consume a domain slot).
Reviewer: sanity-check the Tier 3 outlet additions in source_tiers.py
and the credibility-vs-relevance interaction noted in the follow-up section.
Reviewer: note the 0.65 keep threshold is unchanged and now known to sit on
a flat plateau (a sweep was done — it's not the lever).

Not in scope (deliberate)

Forecasting stage (consumes these records; its own design).
Cumulative aggregation across records, target-date recency weighting — belong
in forecasting, not insight.
Sub-query count tuning (Tune sub-query count (5-8) after benchmark #2), Docling per-page (Optimise DoclingTableRefiner: per-page Docling re-run for large PDFs #17), strong-model refinement
(Strong-model refinement pass in the insight stage #26) — untouched.

bioscancast_questions.csv stores created_date as an Excel serial day number (e.g. 45712). pd.to_datetime without unit=D + origin=1899-12-30 treated those integers as nanoseconds past 1970, yielding garbage dates like 1970-01-01 00:00:00.000045712. The bug was latent — no caller had yet relied on the parsed date — but the new orchestrator's build_forecast_question factory needs an accurate created_at. After the fix, q7 resolves to 2025-02-24 as expected.

The orchestrator (next commit) needs to turn a CSV row into a typed ForecastQuestion. Maps: - created_date -> tz-aware UTC datetime (already parsed by load_questions) - topic "Pathogen (Region)" -> lowercased pathogen + region - question_text "by Month day, year" -> target_date via regex; falls back to "by Month year" giving the first of next month - question_type + keyword hints in text -> event_type (case_count / death_count / outbreak_declared / None) - resolution_criteria passes through - as_of_date is a factory kwarg, not a CSV column; orchestrator passes it from --as-of-date Tested against all 11 rows of bioscancast_questions.csv; q7 produces ForecastQuestion(id=q7, pathogen=mpox, region=world, target_date=2025-02-28, event_type=case_count, ...).

Branch-local question fixture for the new end-to-end orchestrator's live smoke tests. Two rows: - q7: verbatim copy of the row in bioscancast/stages/eval_stage/ bioscancast_questions.csv. Resolved at 126,441 mpox cases globally by Feb 28 2025. Run with --as-of-date 2025-02-28 to exercise historical replay. - q12: new live question on the current East Africa Ebola outbreak, target_date 2026-06-30. Run with no --as-of-date for live mode. Kept separate from bioscancast_questions.csv so the canonical CSV stays an unmodified record of what human forecasters actually evaluated.

bioscancast/llm/pricing.py introduces: - MODEL_PRICES: USD/1M-token snapshot dated 2026-05-27 for the models actually used by stage configs (gpt-4o-mini, gpt-4o, text-embedding-3- small/large) plus a date-pinned gpt-4o-2024-08-06 alias. - estimate_cost(model, input_tokens, output_tokens, cached_input_tokens): computes USD spend with a 50% discount on cached prefix per OpenAI's standard prompt-cache pricing. - estimate_cost_from_summary(): consumes the dict shape that InsightRunResult.budget_summary already produces. Sources cited in the module docstring. Unknown model raises UnknownModelError so the orchestrator surfaces stale price tables loudly rather than under-reporting cost.

New module with the run-directory layout (data/runs/{question_id}/{run_id}/) and per-stage JSON dump helpers: save_question / save_search / save_filtered / save_documents / save_insight / save_manifest. _json_default and the asdict pattern are lifted from scripts/eval_insight_on_real_docs.py so the orchestrator and the eval harness share serialization conventions.

Replaces the 14-line commented sketch with a real argparse-driven orchestrator that chains all four stages for a single ForecastQuestion: python -m bioscancast.main q7 --as-of-date 2025-02-28 -v Pipeline: 1. load_question_by_id reads the CSV row and builds a ForecastQuestion via the new factory (applying any CLI overrides). 2. SearchStagePipeline runs with a usage-tracking LLM wrapper so search + filter token usage is accumulated for cost reporting. 3. FilteringPipeline reuses the same wrapped client. 4. ExtractionPipeline gets as_of_date=question.as_of_date so the fetcher uses Wayback snapshots in historical-replay mode. 5. InsightPipeline receives the raw (unwrapped) client; its BudgetTracker already tracks usage, so wrapping would double-count. After all stages, search/filter usage and insight per_model usage are merged and fed to bioscancast.llm.pricing.estimate_cost for a single USD figure in the final epilogue and manifest. Persistence: data/runs/{question_id}/{run_id}/ question.json, search.json, filtered.json, documents.json, insight.json, manifest.json The manifest is rewritten after every stage so a crashed run keeps partial timings + config. On any stage exception the manifest pins the failing stage and re-raises wrapped in PipelineError; main() exits 1. Empty intermediate output is not an error - logged and passed through (the insight stage already handles zero documents).

Thin scripts/run_pipeline.py wrapper around bioscancast.main:main so the new orchestrator follows the same `scripts/run_*.py` convention as the existing per-stage runners. Both invocations are equivalent: python -m bioscancast.main q7 python scripts/run_pipeline.py q7 data/runs/ added to .gitignore so per-run artifacts (some quite large - documents.json includes every chunk text) don't pollute the repo.

Two small fixes uncovered by the first live runs of the orchestrator: 1. persistence._json_default crashed on FILTER_CONFIG's set values (blocked_domains, low_value_url_keywords, etc.) when serializing the manifest. Now sorts sets to lists. 2. pricing.MODEL_PRICES needs the dated aliases OpenAI returns in response.model. A request to "gpt-4o-mini" comes back tagged "gpt-4o-mini-2024-07-18", which was missing from the table and produced a $0 cost estimate with a noisy warning. Added that alias plus a couple of known gpt-4o dated variants. q7 historical-replay run subsequently cost $0.0030, q12 live run cost $0.0049, both correctly reported in the manifest.

Dashboard URLs from the curated registry in bioscancast/datasets/biosecurity_sources.py have hand-picked titles ("Dashboard: cdc.gov") and generic snippets that produce keyword_overlap_score = 0.000 against any real forecast question. The heuristic priority score drags them under the 0.72 keep threshold even though they are by construction high-value sources. Live-run evidence: q7 and q12 each injected two dashboards. All four had keyword_overlap = 0.000. Two of those four were dropped pre-LLM, including ourworldindata.org for q7 - which is the resolution source named in the question's relevant_links column. Fix: in heuristic_filter, detect retrieval_reason == "dashboard_lookup" and auto-keep with reason_code "dashboard_lookup_bypass" and a synthetic priority_score of 1.0. The dashboards still go through the rest of the filtering pipeline (dedup, per-domain cap, extraction-hint assignment) unchanged - this is the keyword-overlap chokepoint only. Implements item 1 from the Tier 1 roadmap. Pairs with the dashboard title/snippet enrichment in the next commit.

The previous dashboard injection used generic strings ("Dashboard: cdc.gov", "Known mpox monitoring dashboard") that produced keyword_overlap_score = 0.000 against every real forecast question - 4/4 injected dashboards in the q7/q12 live runs had this exact failure mode. The fix: turn DASHBOARD_LOOKUP into a list of DashboardEntry dataclasses carrying url + title + snippet, with hand-written pathogen-specific text for each entry. The titles read as real search-result titles ("CDC H5N1 bird flu situation summary: human cases and outbreaks in the United States") and the snippets describe what data the page hosts. Pairs with the previous commit's dashboard heuristic bypass: even with the bypass in place, better titles still help (a) the keyword-overlap score for downstream scoring, and (b) the LLM rescue path when it encounters other pathogen-specific dashboards we add later. The bypass keeps low-keyword-overlap dashboards alive; this commit makes them discoverable on their own merits. Implements item 5 from the Tier 1/2 roadmap.

Live runs on q7 and q12 showed filter survival of 4.7% and 13.5% respectively, even with LLM rescue enabled. The 0.72 threshold was set without benchmarking against real Tavily output and is too tight for the heuristic's actual signal. With the new threshold, priority_scores in the 0.65-0.72 band are auto-kept by heuristics instead of routed to the LLM rescue path. The borderline threshold (0.45) is unchanged, so the LLM filter still gates 0.45-0.65 candidates - the change just moves the auto-keep line to better match what the heuristic can actually distinguish. Implements item 2 from the Tier 1 roadmap. Pairs with the dashboard bypass + enrichment commits to attack the filter chokepoint from multiple angles.

q12 Record 3 reported metric_name=deaths, metric_value=160 from the source quote "160 suspected deaths out of 670 suspected cases". The prompt's canonical vocabulary already had `suspected_cases` (for the 670) but no `suspected_deaths` slot - so the model collapsed the "suspected" qualifier and emitted plain `deaths`. The result was an arithmetically scandalous record (160 deaths against 61 confirmed cases) that wouldn't survive any reasonable post-hoc sanity check. Changes: - Add `suspected_deaths` alongside the existing `suspected_cases` entry. Two-tier system per category (confirmed_* and suspected_*), matching the agreed shape of the vocab. - Drop the now-redundant standalone `probable_cases` line; the `suspected_cases` description explicitly covers "suspected", "probable", and "possible" as the same tier. WHO's combined `confirmed_or_probable_cases` bucket is kept separately because it is a distinct reporting category. - Add a deaths-family mapping rule paralleling the existing cases-family rule ("suspected deaths", "probable deaths", "deaths under investigation" all map to suspected_deaths). Mirror what the existing `confirmed_cases` line does for the cases family. - Clean up the stale "possible all get their own variants below" parenthetical which referenced a `possible_cases` slot that has never existed. After this, q12's "160 suspected deaths" should extract as suspected_deaths=160 rather than deaths=160 - same value, correct category, no longer competing with confirmed_cases for downstream forecasting weight. Implements item 3 from the Tier 1 roadmap.

q12 Record 4 reported metric_value=82 with the quote "the outbreak now poses a 'very high' risk for Congo - up from a previous categorization of 'high'" - no digit, no number-word, no relative reference. The hallucination guard's verbatim-substring check passed because the quote string did appear in the source chunk, but nothing in the guard required the quote sentence to be the one actually carrying the figure. The metric_value of 82 came from elsewhere in the chunk; the quote was a "supporting context" sentence. A deterministic post-hoc check (str(metric_value) in quote) would over-reject: word numbers ("a dozen"), relative quantities ("a quarter of the population"), and number-word forms ("ninety-nine thousand") would all be false-positive rejections. So the fix lives at the prompt level instead: tell the model the quote MUST be the sentence that carries the figure - digits, number-word, or a clear relative reference. A purely contextual sentence is not acceptable. The verbatim-substring guard remains the safety net. This change tightens the model's understanding of what `quote` is supposed to do without committing to a brittle programmatic check that would lose real signal on legitimate paraphrases. Implements item 4 from the Tier 2 roadmap.

When the filtering stage passes only a handful of documents through to insight, per-document retrieval depth becomes the bottleneck on coverage. q7's live run reached insight with 2 usable documents and hit retrieval_top_k=12 on each - meaning at most ~24 chunk extractions for the whole question. Bumping per-doc retrieval depth costs little and gives the model more chances to find the relevant figure in each surviving document. Adds two InsightConfig fields: - low_survival_doc_threshold (default 5) - low_survival_top_k (default 20) When the count of usable documents (status != "failed" and non-empty chunks) is at or below the threshold, both retrieval_top_k and max_chunks_per_document effectively rise to low_survival_top_k for the run, and a note is appended to InsightRunResult.notes flagging that the adaptive path engaged. Default config (12 doc threshold not hit -> normal top_k=12) is unchanged. Tests that pin retrieval_top_k to small values to control fake-LLM call counts now also pin low_survival_top_k to the same value, opting out of the adaptive lift explicitly. 447 tests still passing. Implements item 6 from the Tier 2 roadmap. Completes the planned bundle of items 1+2+3+4+5+6.

q7's second live run on this branch surfaced an interaction between the new dashboard heuristic bypass and the per-domain cap. With max_docs_per_domain=2 and the dashboard bypass injecting one who.int slot at synthetic priority 1.0, the cap was effectively reducing who.int to ONE organic slot - and the slot was going to a priority-0.7097 strategic-plan announcement page, squeezing out the priority-0.6966 WHO mpox research event page that the baseline run had extracted records from. Offline filter replay on the saved q7 search.json confirms the mechanism: Heuristic-keep (4 who.int / ourworldindata.org docs): 1.0000 WHO sitreps dashboard (bypass) 1.0000 OWID mpox dashboard (bypass) 0.7097 WHO global strategic preparedness plan (organic) 0.6966 WHO mpox research event (organic) <- baseline's data source After old cap_per_domain (max=2 per domain): Dashboards displace one organic each; research event capped out. The fix: dashboard-bypass docs (selection_reasons contains "dashboard_lookup_bypass") are always kept and do not consume a slot against the per-domain or per-type caps. They are curated additions, not competing organic results. After the change all four candidates survive, and the WHO research event page reaches insight as it did in the baseline. 447 tests still passing.

The epilogue previously printed a single total-cost figure. This splits cost by stage (search / filter / insight) so a cost spike in any one stage is visible at a glance during iteration. Mechanism: _UsageTrackingClient gains a snapshot() method; run_pipeline snapshots the shared tracker after search and after filter and diffs them (_usage_delta) to attribute usage to each stage. Insight reports its own budget_summary per_model as before. The extract stage makes no LLM calls, so it shows timing only. Manifest gains stage_usage and stage_costs_usd alongside the existing combined_usage and estimated_cost_usd. Epilogue now reads e.g.: search 23.87s $0.0009 filter 3.06s $0.0003 extract 59.69s insight 12.28s $0.0030 total cost: $0.0042 Implements item 11 from the roadmap. 447 tests still passing.

Item 8 investigation: of the three quotes the guard dropped across the q12 live runs, two were real facts lost purely because the model lowercased the leading letter of a sentence it quoted from mid-paragraph: source: "There are now 750 suspected cases and 177 suspected deaths" model: "there are now 750 suspected cases and 177 suspected deaths" source: "The Congolese Ministry of Communication, in a post to X ... said that there were 904 suspected cases and 119 ..." model: "the Congolese Ministry of Communication, in a post to X ..." The third rejection was a genuine content-insertion hallucination - the model bolted the real prefix "a total of 105 confirmed cases (including 10 deaths)" onto a fabricated continuation "...have been reported in Ituri, North Kivu, and South Kivu" (the source actually continues "...and 906 suspected cases"). Fix: a fourth, case-insensitive substring layer. It returns the chunk's own casing so the stored quote still reflects the source verbatim. The key safety property - verified against the real q12 fabrication and captured in a new regression test - is that case-folding does NOT recover content insertions: a fabricated continuation fails the substring test regardless of case. Tests: new _LAYER4_CASE_INSENSITIVE_CASES (the two recovered q12 quotes) plus a hallucination case mirroring the q12 fabrication that must stay rejected. 450 passed (was 447; +3 guard cases). Implements item 8 from the roadmap.

search_stage_score was 0.5*domain + 0.3*freshness + 0.2*rank with no topical-relevance signal, so high-authority but off-topic results ranked at the top (e.g. sports/legal/unrelated-pathogen news). It is now 0.45*relevance + 0.30*domain + 0.10*freshness + 0.15*rank, reusing the filter's keyword_overlap_score/build_query_terms. Freshness is kept low because it is near-uniform in live mode. Addresses #4. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The CDC mpox dashboard URL returned 404; replaced with the (extractable) monkeypox/situation-summary page. Updated two stale redirects (afro.who.int ebola-disease, cdc.gov/ebola/about). DASHBOARD_LOOKUP routing was an exact lowercase key match, so 'marburg virus disease' failed to route to the 'marburg' key; added _resolve_pathogen_key with alias + substring matching (marburg virus disease->marburg, monkeypox->mpox, bird flu->h5n1). Addresses #3. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Reputable outbreak reporting from outlets like CNN, NBC, CBS, ABC, NPR, USA Today, LA Times, Politico, Axios, Forbes, Bloomberg, FT, WSJ, Economist, Time, The Atlantic, Ars Technica and Business Insider was resolving to the 'unknown' tier (domain_score 0.2), sinking it below the filter's credibility floor. Promote them to Tier 3 (trusted_media, 0.6); second-level-domain matching covers subdomains. Relates to #13. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

When llm_client is None the ambiguous rerank band was always rejected (fail-closed), which is overly aggressive for dev/offline/no-API-key runs. Add a default-off FILTER_CONFIG flag 'no_llm_soft_fallback' (+ no_llm_fallback_relevance_threshold) that instead keeps a borderline candidate iff it is an official domain OR its keyword-overlap relevance clears the threshold, approximating the LLM-rescue path. Production (always has an LLM client) is unchanged. Addresses #13. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

smodee · 2026-06-03T11:24:27Z

Closing this in favour of three independent split PRs that are easier to review:

End-to-end pipeline orchestrator #29 — End-to-end pipeline orchestrator (the foundation work)
Search/filter dashboard chokepoint + relevance ranking #30 — Search/filter dashboard chokepoint + relevance ranking (Closes #14; addresses Evaluate dashboard lookup value vs organic search #3, Tune search_stage_score weights (0.5/0.3/0.2) #4, Heuristic filtering scores too low for real-world search results #13)
Insight extraction quality #31 — Insight extraction quality

Each PR stands alone and they can merge in any order. Per-PR test counts: 447 / 452 / 450; together they reproduce the 455 of this branch. The feat/end-to-end-orchestrator branch is kept around (unchanged, not merged) so downstream work that already branches off it isn't disturbed during the review of the three split PRs.

smodee and others added 21 commits May 28, 2026 20:07

smodee closed this Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

End-to-end pipeline orchestrator + filter/extraction quality bundle#28

End-to-end pipeline orchestrator + filter/extraction quality bundle#28
smodee wants to merge 21 commits into
mainfrom
feat/end-to-end-orchestrator

smodee commented May 28, 2026 •

edited

Loading

Uh oh!

smodee commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

smodee commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

End-to-end pipeline orchestrator + filter/extraction quality bundle

Summary

Built on #24 and #27

What's included

Orchestrator

Filter chokepoint (issues #13, #14)

Search-stage relevance ranking (#4)

Dashboard sources (#3)

Source tiers (#13)

No-LLM filter fallback (#13)

Extraction quality

Live verification (q7 + q12)

Follow-up: systematic review of #3 / #4 / #13

Issues this PR addresses

Test plan

Not in scope (deliberate)

Uh oh!

smodee commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

smodee commented May 28, 2026 •

edited

Loading