End-to-end pipeline orchestrator by smodee · Pull Request #29 · algorithmicgovernance/BioScanCast

smodee · 2026-06-03T11:23:05Z

End-to-end pipeline orchestrator

Split from the original feat/end-to-end-orchestrator branch (was #28).
Companion PRs: #30 (search/filter quality) and #31 (insight
extraction quality). All three are independent and can merge in any order.

Summary

Adds the first orchestrator that chains all four pipeline stages
(search → filter → extract → insight) for a single ForecastQuestion.
Before this branch the stages only ran in isolation: bioscancast/main.py was
a commented sketch, scripts/run_insight.py was synthetic-only, and
scripts/eval_insight_on_real_docs.py chained only extract→insight on
hardcoded fixtures. Forecasting (the next stage) needs a real InsightRecord
stream; without this orchestrator it would have had to build the chain itself,
coupling the two stages.

What's included

bioscancast/main.py — run_pipeline() + argparse CLI. Chains the four
stages with per-stage timing, JSON persistence, error wrapping
(PipelineError pins the failing stage), and a cost estimate. Runnable as
python -m bioscancast.main q7 … or via scripts/run_pipeline.py.
bioscancast/orchestration/ — new package: persistence.py (run-dir
layout data/runs/{qid}/{run_id}/, per-stage JSON dumps, manifest) and
test_questions.csv (q7 verbatim copy + a new q12 Ebola question; the
canonical bioscancast_questions.csv is left untouched as the human-
forecaster record).
bioscancast/stages/eval_stage/loaders.py — build_forecast_question
factory (CSV row → ForecastQuestion) + load_question_by_id; fixed an
Excel-serial created_date parsing bug while here.
bioscancast/llm/pricing.py — model price table (snapshot 2026-05-27) +
estimate_cost; surfaces USD cost per run, broken down per stage in the
orchestrator epilogue.

Verification

python -m pytest bioscancast/tests/ — 447 passed, 2 skipped (live).
Tests cover the orchestrator's stage timing/persistence, the
build_forecast_question factory, and the per-stage cost line-items.
Live verification: q7 (Mpox, historical replay 2025-02-28) and q12 (Ebola,
live) end-to-end runs producing artifacts inspected for filter survival,
records, and cost. Cumulative API spend across all development runs:
~$0.03.

Not in scope

Search/filter quality improvements — see Search/filter dashboard chokepoint + relevance ranking #30.
Insight extraction quality improvements — see Insight extraction quality #31.
Forecasting stage (consumes these records; its own design).
Cumulative aggregation across records, target-date recency weighting —
belong in forecasting, not insight.

bioscancast_questions.csv stores created_date as an Excel serial day number (e.g. 45712). pd.to_datetime without unit=D + origin=1899-12-30 treated those integers as nanoseconds past 1970, yielding garbage dates like 1970-01-01 00:00:00.000045712. The bug was latent — no caller had yet relied on the parsed date — but the new orchestrator's build_forecast_question factory needs an accurate created_at. After the fix, q7 resolves to 2025-02-24 as expected.

The orchestrator (next commit) needs to turn a CSV row into a typed ForecastQuestion. Maps: - created_date -> tz-aware UTC datetime (already parsed by load_questions) - topic "Pathogen (Region)" -> lowercased pathogen + region - question_text "by Month day, year" -> target_date via regex; falls back to "by Month year" giving the first of next month - question_type + keyword hints in text -> event_type (case_count / death_count / outbreak_declared / None) - resolution_criteria passes through - as_of_date is a factory kwarg, not a CSV column; orchestrator passes it from --as-of-date Tested against all 11 rows of bioscancast_questions.csv; q7 produces ForecastQuestion(id=q7, pathogen=mpox, region=world, target_date=2025-02-28, event_type=case_count, ...).

Branch-local question fixture for the new end-to-end orchestrator's live smoke tests. Two rows: - q7: verbatim copy of the row in bioscancast/stages/eval_stage/ bioscancast_questions.csv. Resolved at 126,441 mpox cases globally by Feb 28 2025. Run with --as-of-date 2025-02-28 to exercise historical replay. - q12: new live question on the current East Africa Ebola outbreak, target_date 2026-06-30. Run with no --as-of-date for live mode. Kept separate from bioscancast_questions.csv so the canonical CSV stays an unmodified record of what human forecasters actually evaluated.

bioscancast/llm/pricing.py introduces: - MODEL_PRICES: USD/1M-token snapshot dated 2026-05-27 for the models actually used by stage configs (gpt-4o-mini, gpt-4o, text-embedding-3- small/large) plus a date-pinned gpt-4o-2024-08-06 alias. - estimate_cost(model, input_tokens, output_tokens, cached_input_tokens): computes USD spend with a 50% discount on cached prefix per OpenAI's standard prompt-cache pricing. - estimate_cost_from_summary(): consumes the dict shape that InsightRunResult.budget_summary already produces. Sources cited in the module docstring. Unknown model raises UnknownModelError so the orchestrator surfaces stale price tables loudly rather than under-reporting cost.

New module with the run-directory layout (data/runs/{question_id}/{run_id}/) and per-stage JSON dump helpers: save_question / save_search / save_filtered / save_documents / save_insight / save_manifest. _json_default and the asdict pattern are lifted from scripts/eval_insight_on_real_docs.py so the orchestrator and the eval harness share serialization conventions.

Replaces the 14-line commented sketch with a real argparse-driven orchestrator that chains all four stages for a single ForecastQuestion: python -m bioscancast.main q7 --as-of-date 2025-02-28 -v Pipeline: 1. load_question_by_id reads the CSV row and builds a ForecastQuestion via the new factory (applying any CLI overrides). 2. SearchStagePipeline runs with a usage-tracking LLM wrapper so search + filter token usage is accumulated for cost reporting. 3. FilteringPipeline reuses the same wrapped client. 4. ExtractionPipeline gets as_of_date=question.as_of_date so the fetcher uses Wayback snapshots in historical-replay mode. 5. InsightPipeline receives the raw (unwrapped) client; its BudgetTracker already tracks usage, so wrapping would double-count. After all stages, search/filter usage and insight per_model usage are merged and fed to bioscancast.llm.pricing.estimate_cost for a single USD figure in the final epilogue and manifest. Persistence: data/runs/{question_id}/{run_id}/ question.json, search.json, filtered.json, documents.json, insight.json, manifest.json The manifest is rewritten after every stage so a crashed run keeps partial timings + config. On any stage exception the manifest pins the failing stage and re-raises wrapped in PipelineError; main() exits 1. Empty intermediate output is not an error - logged and passed through (the insight stage already handles zero documents).

Thin scripts/run_pipeline.py wrapper around bioscancast.main:main so the new orchestrator follows the same `scripts/run_*.py` convention as the existing per-stage runners. Both invocations are equivalent: python -m bioscancast.main q7 python scripts/run_pipeline.py q7 data/runs/ added to .gitignore so per-run artifacts (some quite large - documents.json includes every chunk text) don't pollute the repo.

Two small fixes uncovered by the first live runs of the orchestrator: 1. persistence._json_default crashed on FILTER_CONFIG's set values (blocked_domains, low_value_url_keywords, etc.) when serializing the manifest. Now sorts sets to lists. 2. pricing.MODEL_PRICES needs the dated aliases OpenAI returns in response.model. A request to "gpt-4o-mini" comes back tagged "gpt-4o-mini-2024-07-18", which was missing from the table and produced a $0 cost estimate with a noisy warning. Added that alias plus a couple of known gpt-4o dated variants. q7 historical-replay run subsequently cost $0.0030, q12 live run cost $0.0049, both correctly reported in the manifest.

The epilogue previously printed a single total-cost figure. This splits cost by stage (search / filter / insight) so a cost spike in any one stage is visible at a glance during iteration. Mechanism: _UsageTrackingClient gains a snapshot() method; run_pipeline snapshots the shared tracker after search and after filter and diffs them (_usage_delta) to attribute usage to each stage. Insight reports its own budget_summary per_model as before. The extract stage makes no LLM calls, so it shows timing only. Manifest gains stage_usage and stage_costs_usd alongside the existing combined_usage and estimated_cost_usd. Epilogue now reads e.g.: search 23.87s $0.0009 filter 3.06s $0.0003 extract 59.69s insight 12.28s $0.0030 total cost: $0.0042 Implements item 11 from the roadmap. 447 tests still passing.

smodee added 9 commits June 3, 2026 14:18

This was referenced Jun 3, 2026

Search/filter dashboard chokepoint + relevance ranking #30

Open

Insight extraction quality #31

Open

End-to-end pipeline orchestrator + filter/extraction quality bundle #28

Closed

smodee marked this pull request as ready for review June 3, 2026 11:33

smodee requested a review from rapsoj June 3, 2026 11:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

End-to-end pipeline orchestrator#29

End-to-end pipeline orchestrator#29
smodee wants to merge 9 commits into
mainfrom
feat/orchestrator

smodee commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

smodee commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

End-to-end pipeline orchestrator

Summary

What's included

Verification

Not in scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

smodee commented Jun 3, 2026 •

edited

Loading