Skip to content

End-to-end pipeline orchestrator#29

Open
smodee wants to merge 9 commits into
mainfrom
feat/orchestrator
Open

End-to-end pipeline orchestrator#29
smodee wants to merge 9 commits into
mainfrom
feat/orchestrator

Conversation

@smodee
Copy link
Copy Markdown
Collaborator

@smodee smodee commented Jun 3, 2026

End-to-end pipeline orchestrator

Split from the original feat/end-to-end-orchestrator branch (was #28).
Companion PRs: #30 (search/filter quality) and #31 (insight
extraction quality). All three are independent and can merge in any order.

Summary

Adds the first orchestrator that chains all four pipeline stages
(search → filter → extract → insight) for a single ForecastQuestion.
Before this branch the stages only ran in isolation: bioscancast/main.py was
a commented sketch, scripts/run_insight.py was synthetic-only, and
scripts/eval_insight_on_real_docs.py chained only extract→insight on
hardcoded fixtures. Forecasting (the next stage) needs a real InsightRecord
stream; without this orchestrator it would have had to build the chain itself,
coupling the two stages.

What's included

  • bioscancast/main.pyrun_pipeline() + argparse CLI. Chains the four
    stages with per-stage timing, JSON persistence, error wrapping
    (PipelineError pins the failing stage), and a cost estimate. Runnable as
    python -m bioscancast.main q7 … or via scripts/run_pipeline.py.
  • bioscancast/orchestration/ — new package: persistence.py (run-dir
    layout data/runs/{qid}/{run_id}/, per-stage JSON dumps, manifest) and
    test_questions.csv (q7 verbatim copy + a new q12 Ebola question; the
    canonical bioscancast_questions.csv is left untouched as the human-
    forecaster record).
  • bioscancast/stages/eval_stage/loaders.pybuild_forecast_question
    factory (CSV row → ForecastQuestion) + load_question_by_id; fixed an
    Excel-serial created_date parsing bug while here.
  • bioscancast/llm/pricing.py — model price table (snapshot 2026-05-27) +
    estimate_cost; surfaces USD cost per run, broken down per stage in the
    orchestrator epilogue.

Verification

  • python -m pytest bioscancast/tests/447 passed, 2 skipped (live).
    Tests cover the orchestrator's stage timing/persistence, the
    build_forecast_question factory, and the per-stage cost line-items.
  • Live verification: q7 (Mpox, historical replay 2025-02-28) and q12 (Ebola,
    live) end-to-end runs producing artifacts inspected for filter survival,
    records, and cost. Cumulative API spend across all development runs:
    ~$0.03.

Not in scope

smodee added 9 commits June 3, 2026 14:18
bioscancast_questions.csv stores created_date as an Excel serial day
number (e.g. 45712). pd.to_datetime without unit=D + origin=1899-12-30
treated those integers as nanoseconds past 1970, yielding garbage dates
like 1970-01-01 00:00:00.000045712. The bug was latent — no caller had
yet relied on the parsed date — but the new orchestrator's
build_forecast_question factory needs an accurate created_at.

After the fix, q7 resolves to 2025-02-24 as expected.
The orchestrator (next commit) needs to turn a CSV row into a typed
ForecastQuestion. Maps:

- created_date -> tz-aware UTC datetime (already parsed by load_questions)
- topic "Pathogen (Region)" -> lowercased pathogen + region
- question_text "by Month day, year" -> target_date via regex; falls back
  to "by Month year" giving the first of next month
- question_type + keyword hints in text -> event_type
  (case_count / death_count / outbreak_declared / None)
- resolution_criteria passes through
- as_of_date is a factory kwarg, not a CSV column; orchestrator passes
  it from --as-of-date

Tested against all 11 rows of bioscancast_questions.csv; q7 produces
ForecastQuestion(id=q7, pathogen=mpox, region=world,
target_date=2025-02-28, event_type=case_count, ...).
Branch-local question fixture for the new end-to-end orchestrator's live
smoke tests. Two rows:

- q7: verbatim copy of the row in bioscancast/stages/eval_stage/
  bioscancast_questions.csv. Resolved at 126,441 mpox cases globally by
  Feb 28 2025. Run with --as-of-date 2025-02-28 to exercise historical
  replay.

- q12: new live question on the current East Africa Ebola outbreak,
  target_date 2026-06-30. Run with no --as-of-date for live mode.

Kept separate from bioscancast_questions.csv so the canonical CSV stays
an unmodified record of what human forecasters actually evaluated.
bioscancast/llm/pricing.py introduces:
- MODEL_PRICES: USD/1M-token snapshot dated 2026-05-27 for the models
  actually used by stage configs (gpt-4o-mini, gpt-4o, text-embedding-3-
  small/large) plus a date-pinned gpt-4o-2024-08-06 alias.
- estimate_cost(model, input_tokens, output_tokens, cached_input_tokens):
  computes USD spend with a 50% discount on cached prefix per OpenAI's
  standard prompt-cache pricing.
- estimate_cost_from_summary(): consumes the dict shape that
  InsightRunResult.budget_summary already produces.

Sources cited in the module docstring. Unknown model raises
UnknownModelError so the orchestrator surfaces stale price tables
loudly rather than under-reporting cost.
New module with the run-directory layout
(data/runs/{question_id}/{run_id}/) and per-stage JSON dump helpers:
save_question / save_search / save_filtered / save_documents /
save_insight / save_manifest. _json_default and the asdict pattern are
lifted from scripts/eval_insight_on_real_docs.py so the orchestrator
and the eval harness share serialization conventions.
Replaces the 14-line commented sketch with a real argparse-driven
orchestrator that chains all four stages for a single ForecastQuestion:

  python -m bioscancast.main q7 --as-of-date 2025-02-28 -v

Pipeline:
  1. load_question_by_id reads the CSV row and builds a ForecastQuestion
     via the new factory (applying any CLI overrides).
  2. SearchStagePipeline runs with a usage-tracking LLM wrapper so
     search + filter token usage is accumulated for cost reporting.
  3. FilteringPipeline reuses the same wrapped client.
  4. ExtractionPipeline gets as_of_date=question.as_of_date so the
     fetcher uses Wayback snapshots in historical-replay mode.
  5. InsightPipeline receives the raw (unwrapped) client; its
     BudgetTracker already tracks usage, so wrapping would double-count.

After all stages, search/filter usage and insight per_model usage are
merged and fed to bioscancast.llm.pricing.estimate_cost for a single
USD figure in the final epilogue and manifest.

Persistence:
  data/runs/{question_id}/{run_id}/
    question.json, search.json, filtered.json, documents.json,
    insight.json, manifest.json
The manifest is rewritten after every stage so a crashed run keeps
partial timings + config. On any stage exception the manifest pins
the failing stage and re-raises wrapped in PipelineError; main()
exits 1.

Empty intermediate output is not an error - logged and passed through
(the insight stage already handles zero documents).
Thin scripts/run_pipeline.py wrapper around bioscancast.main:main so the
new orchestrator follows the same `scripts/run_*.py` convention as the
existing per-stage runners. Both invocations are equivalent:

    python -m bioscancast.main q7
    python scripts/run_pipeline.py q7

data/runs/ added to .gitignore so per-run artifacts (some quite large -
documents.json includes every chunk text) don't pollute the repo.
Two small fixes uncovered by the first live runs of the orchestrator:

1. persistence._json_default crashed on FILTER_CONFIG's set values
   (blocked_domains, low_value_url_keywords, etc.) when serializing the
   manifest. Now sorts sets to lists.

2. pricing.MODEL_PRICES needs the dated aliases OpenAI returns in
   response.model. A request to "gpt-4o-mini" comes back tagged
   "gpt-4o-mini-2024-07-18", which was missing from the table and
   produced a $0 cost estimate with a noisy warning. Added that alias
   plus a couple of known gpt-4o dated variants.

q7 historical-replay run subsequently cost $0.0030, q12 live run cost
$0.0049, both correctly reported in the manifest.
The epilogue previously printed a single total-cost figure. This splits
cost by stage (search / filter / insight) so a cost spike in any one
stage is visible at a glance during iteration.

Mechanism: _UsageTrackingClient gains a snapshot() method; run_pipeline
snapshots the shared tracker after search and after filter and diffs
them (_usage_delta) to attribute usage to each stage. Insight reports
its own budget_summary per_model as before. The extract stage makes no
LLM calls, so it shows timing only.

Manifest gains stage_usage and stage_costs_usd alongside the existing
combined_usage and estimated_cost_usd. Epilogue now reads e.g.:

  search     23.87s  $0.0009
  filter      3.06s  $0.0003
  extract    59.69s
  insight    12.28s  $0.0030
  total cost:  $0.0042

Implements item 11 from the roadmap. 447 tests still passing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant