End-to-end pipeline orchestrator + filter/extraction quality bundle#28
Closed
smodee wants to merge 21 commits into
Closed
End-to-end pipeline orchestrator + filter/extraction quality bundle#28smodee wants to merge 21 commits into
smodee wants to merge 21 commits into
Conversation
bioscancast_questions.csv stores created_date as an Excel serial day number (e.g. 45712). pd.to_datetime without unit=D + origin=1899-12-30 treated those integers as nanoseconds past 1970, yielding garbage dates like 1970-01-01 00:00:00.000045712. The bug was latent — no caller had yet relied on the parsed date — but the new orchestrator's build_forecast_question factory needs an accurate created_at. After the fix, q7 resolves to 2025-02-24 as expected.
The orchestrator (next commit) needs to turn a CSV row into a typed ForecastQuestion. Maps: - created_date -> tz-aware UTC datetime (already parsed by load_questions) - topic "Pathogen (Region)" -> lowercased pathogen + region - question_text "by Month day, year" -> target_date via regex; falls back to "by Month year" giving the first of next month - question_type + keyword hints in text -> event_type (case_count / death_count / outbreak_declared / None) - resolution_criteria passes through - as_of_date is a factory kwarg, not a CSV column; orchestrator passes it from --as-of-date Tested against all 11 rows of bioscancast_questions.csv; q7 produces ForecastQuestion(id=q7, pathogen=mpox, region=world, target_date=2025-02-28, event_type=case_count, ...).
Branch-local question fixture for the new end-to-end orchestrator's live smoke tests. Two rows: - q7: verbatim copy of the row in bioscancast/stages/eval_stage/ bioscancast_questions.csv. Resolved at 126,441 mpox cases globally by Feb 28 2025. Run with --as-of-date 2025-02-28 to exercise historical replay. - q12: new live question on the current East Africa Ebola outbreak, target_date 2026-06-30. Run with no --as-of-date for live mode. Kept separate from bioscancast_questions.csv so the canonical CSV stays an unmodified record of what human forecasters actually evaluated.
bioscancast/llm/pricing.py introduces: - MODEL_PRICES: USD/1M-token snapshot dated 2026-05-27 for the models actually used by stage configs (gpt-4o-mini, gpt-4o, text-embedding-3- small/large) plus a date-pinned gpt-4o-2024-08-06 alias. - estimate_cost(model, input_tokens, output_tokens, cached_input_tokens): computes USD spend with a 50% discount on cached prefix per OpenAI's standard prompt-cache pricing. - estimate_cost_from_summary(): consumes the dict shape that InsightRunResult.budget_summary already produces. Sources cited in the module docstring. Unknown model raises UnknownModelError so the orchestrator surfaces stale price tables loudly rather than under-reporting cost.
New module with the run-directory layout
(data/runs/{question_id}/{run_id}/) and per-stage JSON dump helpers:
save_question / save_search / save_filtered / save_documents /
save_insight / save_manifest. _json_default and the asdict pattern are
lifted from scripts/eval_insight_on_real_docs.py so the orchestrator
and the eval harness share serialization conventions.
Replaces the 14-line commented sketch with a real argparse-driven
orchestrator that chains all four stages for a single ForecastQuestion:
python -m bioscancast.main q7 --as-of-date 2025-02-28 -v
Pipeline:
1. load_question_by_id reads the CSV row and builds a ForecastQuestion
via the new factory (applying any CLI overrides).
2. SearchStagePipeline runs with a usage-tracking LLM wrapper so
search + filter token usage is accumulated for cost reporting.
3. FilteringPipeline reuses the same wrapped client.
4. ExtractionPipeline gets as_of_date=question.as_of_date so the
fetcher uses Wayback snapshots in historical-replay mode.
5. InsightPipeline receives the raw (unwrapped) client; its
BudgetTracker already tracks usage, so wrapping would double-count.
After all stages, search/filter usage and insight per_model usage are
merged and fed to bioscancast.llm.pricing.estimate_cost for a single
USD figure in the final epilogue and manifest.
Persistence:
data/runs/{question_id}/{run_id}/
question.json, search.json, filtered.json, documents.json,
insight.json, manifest.json
The manifest is rewritten after every stage so a crashed run keeps
partial timings + config. On any stage exception the manifest pins
the failing stage and re-raises wrapped in PipelineError; main()
exits 1.
Empty intermediate output is not an error - logged and passed through
(the insight stage already handles zero documents).
Thin scripts/run_pipeline.py wrapper around bioscancast.main:main so the
new orchestrator follows the same `scripts/run_*.py` convention as the
existing per-stage runners. Both invocations are equivalent:
python -m bioscancast.main q7
python scripts/run_pipeline.py q7
data/runs/ added to .gitignore so per-run artifacts (some quite large -
documents.json includes every chunk text) don't pollute the repo.
Two small fixes uncovered by the first live runs of the orchestrator: 1. persistence._json_default crashed on FILTER_CONFIG's set values (blocked_domains, low_value_url_keywords, etc.) when serializing the manifest. Now sorts sets to lists. 2. pricing.MODEL_PRICES needs the dated aliases OpenAI returns in response.model. A request to "gpt-4o-mini" comes back tagged "gpt-4o-mini-2024-07-18", which was missing from the table and produced a $0 cost estimate with a noisy warning. Added that alias plus a couple of known gpt-4o dated variants. q7 historical-replay run subsequently cost $0.0030, q12 live run cost $0.0049, both correctly reported in the manifest.
Dashboard URLs from the curated registry in
bioscancast/datasets/biosecurity_sources.py have hand-picked titles
("Dashboard: cdc.gov") and generic snippets that produce
keyword_overlap_score = 0.000 against any real forecast question. The
heuristic priority score drags them under the 0.72 keep threshold even
though they are by construction high-value sources.
Live-run evidence: q7 and q12 each injected two dashboards. All four
had keyword_overlap = 0.000. Two of those four were dropped pre-LLM,
including ourworldindata.org for q7 - which is the resolution source
named in the question's relevant_links column.
Fix: in heuristic_filter, detect retrieval_reason == "dashboard_lookup"
and auto-keep with reason_code "dashboard_lookup_bypass" and a synthetic
priority_score of 1.0. The dashboards still go through the rest of the
filtering pipeline (dedup, per-domain cap, extraction-hint assignment)
unchanged - this is the keyword-overlap chokepoint only.
Implements item 1 from the Tier 1 roadmap. Pairs with the dashboard
title/snippet enrichment in the next commit.
The previous dashboard injection used generic strings ("Dashboard: cdc.gov",
"Known mpox monitoring dashboard") that produced keyword_overlap_score
= 0.000 against every real forecast question - 4/4 injected dashboards
in the q7/q12 live runs had this exact failure mode.
The fix: turn DASHBOARD_LOOKUP into a list of DashboardEntry dataclasses
carrying url + title + snippet, with hand-written pathogen-specific text
for each entry. The titles read as real search-result titles ("CDC H5N1
bird flu situation summary: human cases and outbreaks in the United
States") and the snippets describe what data the page hosts.
Pairs with the previous commit's dashboard heuristic bypass: even with
the bypass in place, better titles still help (a) the keyword-overlap
score for downstream scoring, and (b) the LLM rescue path when it
encounters other pathogen-specific dashboards we add later. The bypass
keeps low-keyword-overlap dashboards alive; this commit makes them
discoverable on their own merits.
Implements item 5 from the Tier 1/2 roadmap.
Live runs on q7 and q12 showed filter survival of 4.7% and 13.5% respectively, even with LLM rescue enabled. The 0.72 threshold was set without benchmarking against real Tavily output and is too tight for the heuristic's actual signal. With the new threshold, priority_scores in the 0.65-0.72 band are auto-kept by heuristics instead of routed to the LLM rescue path. The borderline threshold (0.45) is unchanged, so the LLM filter still gates 0.45-0.65 candidates - the change just moves the auto-keep line to better match what the heuristic can actually distinguish. Implements item 2 from the Tier 1 roadmap. Pairs with the dashboard bypass + enrichment commits to attack the filter chokepoint from multiple angles.
q12 Record 3 reported metric_name=deaths, metric_value=160 from the
source quote "160 suspected deaths out of 670 suspected cases". The
prompt's canonical vocabulary already had `suspected_cases` (for the
670) but no `suspected_deaths` slot - so the model collapsed the
"suspected" qualifier and emitted plain `deaths`. The result was an
arithmetically scandalous record (160 deaths against 61 confirmed
cases) that wouldn't survive any reasonable post-hoc sanity check.
Changes:
- Add `suspected_deaths` alongside the existing `suspected_cases`
entry. Two-tier system per category (confirmed_* and suspected_*),
matching the agreed shape of the vocab.
- Drop the now-redundant standalone `probable_cases` line; the
`suspected_cases` description explicitly covers "suspected",
"probable", and "possible" as the same tier. WHO's combined
`confirmed_or_probable_cases` bucket is kept separately because it
is a distinct reporting category.
- Add a deaths-family mapping rule paralleling the existing
cases-family rule ("suspected deaths", "probable deaths", "deaths
under investigation" all map to suspected_deaths). Mirror what the
existing `confirmed_cases` line does for the cases family.
- Clean up the stale "possible all get their own variants below"
parenthetical which referenced a `possible_cases` slot that has
never existed.
After this, q12's "160 suspected deaths" should extract as
suspected_deaths=160 rather than deaths=160 - same value, correct
category, no longer competing with confirmed_cases for downstream
forecasting weight.
Implements item 3 from the Tier 1 roadmap.
q12 Record 4 reported metric_value=82 with the quote
"the outbreak now poses a 'very high' risk for Congo - up from a
previous categorization of 'high'" - no digit, no number-word, no
relative reference. The hallucination guard's verbatim-substring check
passed because the quote string did appear in the source chunk, but
nothing in the guard required the quote sentence to be the one
actually carrying the figure. The metric_value of 82 came from
elsewhere in the chunk; the quote was a "supporting context"
sentence.
A deterministic post-hoc check (str(metric_value) in quote) would
over-reject: word numbers ("a dozen"), relative quantities ("a
quarter of the population"), and number-word forms ("ninety-nine
thousand") would all be false-positive rejections. So the fix lives
at the prompt level instead: tell the model the quote MUST be the
sentence that carries the figure - digits, number-word, or a clear
relative reference. A purely contextual sentence is not acceptable.
The verbatim-substring guard remains the safety net. This change
tightens the model's understanding of what `quote` is supposed to do
without committing to a brittle programmatic check that would lose
real signal on legitimate paraphrases.
Implements item 4 from the Tier 2 roadmap.
When the filtering stage passes only a handful of documents through to insight, per-document retrieval depth becomes the bottleneck on coverage. q7's live run reached insight with 2 usable documents and hit retrieval_top_k=12 on each - meaning at most ~24 chunk extractions for the whole question. Bumping per-doc retrieval depth costs little and gives the model more chances to find the relevant figure in each surviving document. Adds two InsightConfig fields: - low_survival_doc_threshold (default 5) - low_survival_top_k (default 20) When the count of usable documents (status != "failed" and non-empty chunks) is at or below the threshold, both retrieval_top_k and max_chunks_per_document effectively rise to low_survival_top_k for the run, and a note is appended to InsightRunResult.notes flagging that the adaptive path engaged. Default config (12 doc threshold not hit -> normal top_k=12) is unchanged. Tests that pin retrieval_top_k to small values to control fake-LLM call counts now also pin low_survival_top_k to the same value, opting out of the adaptive lift explicitly. 447 tests still passing. Implements item 6 from the Tier 2 roadmap. Completes the planned bundle of items 1+2+3+4+5+6.
q7's second live run on this branch surfaced an interaction between the
new dashboard heuristic bypass and the per-domain cap. With
max_docs_per_domain=2 and the dashboard bypass injecting one who.int
slot at synthetic priority 1.0, the cap was effectively reducing
who.int to ONE organic slot - and the slot was going to a
priority-0.7097 strategic-plan announcement page, squeezing out the
priority-0.6966 WHO mpox research event page that the baseline run had
extracted records from.
Offline filter replay on the saved q7 search.json confirms the
mechanism:
Heuristic-keep (4 who.int / ourworldindata.org docs):
1.0000 WHO sitreps dashboard (bypass)
1.0000 OWID mpox dashboard (bypass)
0.7097 WHO global strategic preparedness plan (organic)
0.6966 WHO mpox research event (organic) <- baseline's data source
After old cap_per_domain (max=2 per domain):
Dashboards displace one organic each; research event capped out.
The fix: dashboard-bypass docs (selection_reasons contains
"dashboard_lookup_bypass") are always kept and do not consume a slot
against the per-domain or per-type caps. They are curated additions,
not competing organic results.
After the change all four candidates survive, and the WHO research
event page reaches insight as it did in the baseline.
447 tests still passing.
The epilogue previously printed a single total-cost figure. This splits cost by stage (search / filter / insight) so a cost spike in any one stage is visible at a glance during iteration. Mechanism: _UsageTrackingClient gains a snapshot() method; run_pipeline snapshots the shared tracker after search and after filter and diffs them (_usage_delta) to attribute usage to each stage. Insight reports its own budget_summary per_model as before. The extract stage makes no LLM calls, so it shows timing only. Manifest gains stage_usage and stage_costs_usd alongside the existing combined_usage and estimated_cost_usd. Epilogue now reads e.g.: search 23.87s $0.0009 filter 3.06s $0.0003 extract 59.69s insight 12.28s $0.0030 total cost: $0.0042 Implements item 11 from the roadmap. 447 tests still passing.
Item 8 investigation: of the three quotes the guard dropped across the
q12 live runs, two were real facts lost purely because the model
lowercased the leading letter of a sentence it quoted from
mid-paragraph:
source: "There are now 750 suspected cases and 177 suspected deaths"
model: "there are now 750 suspected cases and 177 suspected deaths"
source: "The Congolese Ministry of Communication, in a post to X ...
said that there were 904 suspected cases and 119 ..."
model: "the Congolese Ministry of Communication, in a post to X ..."
The third rejection was a genuine content-insertion hallucination -
the model bolted the real prefix "a total of 105 confirmed cases
(including 10 deaths)" onto a fabricated continuation "...have been
reported in Ituri, North Kivu, and South Kivu" (the source actually
continues "...and 906 suspected cases").
Fix: a fourth, case-insensitive substring layer. It returns the chunk's
own casing so the stored quote still reflects the source verbatim. The
key safety property - verified against the real q12 fabrication and
captured in a new regression test - is that case-folding does NOT
recover content insertions: a fabricated continuation fails the
substring test regardless of case.
Tests: new _LAYER4_CASE_INSENSITIVE_CASES (the two recovered q12
quotes) plus a hallucination case mirroring the q12 fabrication that
must stay rejected. 450 passed (was 447; +3 guard cases).
Implements item 8 from the roadmap.
search_stage_score was 0.5*domain + 0.3*freshness + 0.2*rank with no topical-relevance signal, so high-authority but off-topic results ranked at the top (e.g. sports/legal/unrelated-pathogen news). It is now 0.45*relevance + 0.30*domain + 0.10*freshness + 0.15*rank, reusing the filter's keyword_overlap_score/build_query_terms. Freshness is kept low because it is near-uniform in live mode. Addresses #4. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The CDC mpox dashboard URL returned 404; replaced with the (extractable) monkeypox/situation-summary page. Updated two stale redirects (afro.who.int ebola-disease, cdc.gov/ebola/about). DASHBOARD_LOOKUP routing was an exact lowercase key match, so 'marburg virus disease' failed to route to the 'marburg' key; added _resolve_pathogen_key with alias + substring matching (marburg virus disease->marburg, monkeypox->mpox, bird flu->h5n1). Addresses #3. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reputable outbreak reporting from outlets like CNN, NBC, CBS, ABC, NPR, USA Today, LA Times, Politico, Axios, Forbes, Bloomberg, FT, WSJ, Economist, Time, The Atlantic, Ars Technica and Business Insider was resolving to the 'unknown' tier (domain_score 0.2), sinking it below the filter's credibility floor. Promote them to Tier 3 (trusted_media, 0.6); second-level-domain matching covers subdomains. Relates to #13. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When llm_client is None the ambiguous rerank band was always rejected (fail-closed), which is overly aggressive for dev/offline/no-API-key runs. Add a default-off FILTER_CONFIG flag 'no_llm_soft_fallback' (+ no_llm_fallback_relevance_threshold) that instead keeps a borderline candidate iff it is an official domain OR its keyword-overlap relevance clears the threshold, approximating the LLM-rescue path. Production (always has an LLM client) is unchanged. Addresses #13. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 28, 2026
Collaborator
Author
|
Closing this in favour of three independent split PRs that are easier to review:
Each PR stands alone and they can merge in any order. Per-PR test counts: 447 / 452 / 450; together they reproduce the 455 of this branch. The |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
End-to-end pipeline orchestrator + filter/extraction quality bundle
Summary
Adds the first end-to-end orchestrator that chains all four pipeline stages
(search → filter → extract → insight) for a single
ForecastQuestion, plus abundle of filter- and extraction-quality improvements driven by live runs of
two real forecasting questions.
Before this branch the stages only ran in isolation:
bioscancast/main.pywas acommented sketch,
scripts/run_insight.pywas synthetic-only, andscripts/eval_insight_on_real_docs.pychained only extract→insight on hardcodedfixtures. Forecasting (the next stage) needs a real
InsightRecordstream;without this orchestrator it would have had to build the chain itself, coupling
the two stages.
Built on #24 and #27
This branch was developed on a merge of
feat/as-of-date-replay(#24) andfeat/insight-stage-hardening(#27), both now merged tomain. It has sincebeen rebased onto
main; the merge commits and a redundantcontamination.pymigration commit (superseded by #27's own
2d77493) dropped out in the rebase,so the diff is the orchestrator + bundle work described below.
What's included
Orchestrator
bioscancast/main.py—run_pipeline()+ argparse CLI. Chains the fourstages with per-stage timing, JSON persistence, error wrapping
(
PipelineErrorpins the failing stage), and a cost estimate. Runnable aspython -m bioscancast.main q7 …or viascripts/run_pipeline.py.bioscancast/orchestration/— new package:persistence.py(run-dirlayout
data/runs/{qid}/{run_id}/, per-stage JSON dumps, manifest) andtest_questions.csv(q7 verbatim copy + a new q12 Ebola question; thecanonical
bioscancast_questions.csvis left untouched as the human-forecasterrecord).
bioscancast/stages/eval_stage/loaders.py—build_forecast_questionfactory (CSV row →
ForecastQuestion) +load_question_by_id; fixed theExcel-serial
created_dateparsing bug while here.bioscancast/llm/pricing.py— model price table (snapshot 2026-05-27) +estimate_cost; surfaces USD cost per run, broken down per stage.Filter chokepoint (issues #13, #14)
(
retrieval_reason == "dashboard_lookup") — they were gettingkeyword_overlap_score == 0.000and being dropped despite being curatedauthoritative sources.
biosecurity_sources.py(was "Dashboard: cdc.gov").heuristic_keep_thresholdlowered 0.72 → 0.65.cap_per_domain_and_typeso a curateddashboard doesn't displace an organic result on the same domain.
Search-stage relevance ranking (#4)
search_stage_scorewas0.5·domain + 0.3·freshness + 0.2·(1/rank)— notopical-relevance term, so high-authority but off-topic results ranked at the
top and consumed
total_capslots. It now is0.45·relevance + 0.30·domain + 0.10·freshness + 0.15·(1/rank), whererelevancereuses the filter'skeyword_overlap_score/build_query_terms.Freshness is weighted low because it is near-uniform in live mode. Weights sum
to 1.0; the score drives ranking + truncation only.
Dashboard sources (#3)
cdc.gov/mpox/data-research→ 404; now theextractable
monkeypox/situation-summarypage) and two stale redirects(
afro.who.intebola-disease,cdc.gov/ebola/about).DASHBOARD_LOOKUProuting was an exact lowercase-key match, so theCSV-natural "marburg virus disease" failed to route to the
marburgkey(→ zero on-topic results). Added
_resolve_pathogen_keywith alias +substring matching (
marburg virus disease→marburg,monkeypox→mpox,bird flu→h5n1).Source tiers (#13)
Today, LA Times, Politico, Axios, Forbes, Bloomberg, FT, WSJ, Economist, Time,
The Atlantic, Ars Technica, Business Insider, …) from
unknown(0.2) toTier 3
trusted_media(0.6). Legitimate outbreak reporting from these wasbeing floored below the filter's credibility threshold.
No-LLM filter fallback (#13)
FILTER_CONFIG["no_llm_soft_fallback"](+no_llm_fallback_relevance_threshold). Whenllm_client is None, the ambiguousrerank band was always rejected (fail-closed) — too aggressive for
dev/offline/no-API-key runs. With the flag on, a borderline candidate is kept
iff it is an official domain OR clears the relevance threshold, approximating
the LLM-rescue path. Production (always has an LLM client) is unchanged.
Extraction quality
suspected_deathsadded to the controlledmetric_namevocabulary (wascollapsing "160 suspected deaths" into plain
deaths).quotefield to be the sentence carryingthe figure (digits / number-word / relative reference), closing a gap where a
metric_value was attached to a quantitatively-empty quote.
the model lowercased the leading letter of, while still rejecting
content-insertion fabrications (regression-tested with real q12 cases).
retrieval_top_k(12 → 20) when few documents survive filtering, soper-doc retrieval depth isn't the bottleneck on coverage.
Live verification (q7 + q12)
suspected_*)Cumulative API spend across all development runs: ~$0.03. q12's records now
distinguish
suspected_cases/suspected_deathsand cross-document dedup mergestwin reports correctly.
Follow-up: systematic review of #3 / #4 / #13
A second pass tested 10 fresh live questions (H5N1, the 2026 DRC+Uganda Ebola
outbreak, the Andes-virus cruise hantavirus cluster, mpox, Marburg) spanning
range/binary/categorical, plus a hand-labeled offline sweep of filter thresholds
and search-stage weights and several end-to-end live runs. Total spend ~$0.03.
Key findings (these drove the four commits above):
the dominant news story (here, Ebola dominated May 2026), the organic pool is
flooded with off-topic high-authority content (sports/legal/other-pathogen
news, even a nature.com nuclear-cladding paper), because the score had no
relevance term. For mpox/Marburg/H5N1 only the injected dashboards were
on-topic. → fixed by the Tune search_stage_score weights (0.5/0.3/0.2) #4 relevance term.
the captured pools); it is not a useful tuning knob. Official-source recall
was already 1.0. The real recall sinks were (a) the no-LLM fail-closed path and
(b) reputable outlets mis-tiered as
unknown. → addressed by the tier + softfallback commits, not by moving the threshold.
summaries extract and yield records (WHO Marburg → many; CDC bird-flu → some);
interactive trackers / index pages / the 404 yield zero (mpox produced 0
records until the URL fix → 2 after). Recommendation: prefer static
fact-bearing dashboard URLs; consider not letting non-extractable dashboards
consume survival slots.
specs/tavily-historical-coverage.md): historicalreplay survival is dashboard/official-dominated with little fresh organic
signal, so resolved-question accuracy cannot drive filter tuning — which is why
the sweep was scored against hand labels, not forecast accuracy.
Known interaction to flag for reviewers: promoting outlets to Tier 3 raises
recall of legitimate reporting, but because the filter's
priority_scorestillweights credibility heavily (and the new relevance term lives in search
ranking, not the filter's keep decision), it can also admit off-topic pieces from
those same outlets. A sensible follow-up is to raise the filter's
keyword-overlap weight / lower its
0.25·credibilityblend.Issues this PR addresses
(titles) and as a backstop (bypass + cap exemption).
and retuned weights; the follow-up review showed the missing relevance signal,
not the weight split, was the issue. Reviewers can likely close.
value (bimodal extractability finding), fixed the broken/stale URLs, and made
routing tolerant. One follow-up remains: deciding whether non-extractable
interactive dashboards should consume survival slots (and trimming/expanding
the list accordingly).
cap exemption + tier coverage + opt-in no-LLM soft fallback). The review also
showed the 0.65 threshold is on a flat plateau (not the lever) and that the
filter's credibility-vs-relevance balance is the remaining knob — see the
follow-up section. Reviewers decide whether the original bug is resolved.
Test plan
python -m pytest bioscancast/tests/— 455 passed, 2 skipped (live). Newtests cover the relevance scoring formula, tolerant dashboard routing, Tier 3
outlet coverage, and the no-LLM soft-fallback flag.
inspected for filter survival, records, and cost.
thresholds and search weights (see "Follow-up" section). Re-running mpox after
the dashboard URL fix went from 0 → 2 insight records.
cap_per_domain_and_type(curated dashboards never consume a domain slot).source_tiers.pyand the credibility-vs-relevance interaction noted in the follow-up section.
a flat plateau (a sweep was done — it's not the lever).
Not in scope (deliberate)
in forecasting, not insight.
(Strong-model refinement pass in the insight stage #26) — untouched.