BioScanCast is an open source pipeline for biosecurity forecasting using LLMs and automated web retrieval.
The system retrieves internet sources, filters relevant documents, extracts structured information, and is intended to support probabilistic forecasting and evaluation against human forecasters.
Current repository contents include:
- modular pipeline stages
- shared schemas and LLM abstractions
- retrieval and extraction tooling
- benchmarking and evaluation infrastructure
- smoke-test and operational scripts
The forecasting stage is not yet implemented.
- Build an open source forecasting system for biosecurity questions.
- Benchmark model forecasts against human forecasters.
- Provide a reproducible research pipeline suitable for publication.
- Produce accessible technical and public-facing outputs.
Implemented:
Search -> Filtering -> Extraction -> Insight
Planned:
Forecasting
Current capabilities include:
- LLM query decomposition
- web/news retrieval via Tavily
- heuristic + optional LLM filtering
- HTML/PDF extraction and chunking
- hybrid BM25 + embedding retrieval
- structured fact extraction with provenance tracking
Collect candidate internet sources.
Features:
- LLM query decomposition
- Tavily retrieval backend
- source tier scoring
- dashboard injection
- URL normalization + deduplication
Output:
List[SearchResult]Identify credible and relevant sources.
Features:
- heuristic relevance scoring
- source credibility scoring
- duplicate removal
- optional LLM review
- extraction-priority assignment
Output:
List[FilteredDocument]Fetch and normalize source content.
Features:
- HTML/PDF fetching
- HTML/PDF parsing
- table extraction
- chunk normalization
- metadata extraction
Output:
List[Document]Convert extracted text into structured facts.
Features:
- BM25 retrieval
- embedding retrieval
- hybrid reranking
- structured fact extraction
- provenance tracking
- hallucination filtering
- cross-document deduplication
Output:
List[InsightRecord]Design principle:
one chunk -> one extraction call
Each fact must include:
- supporting quote
- source chunk
- source URL
Facts failing substring verification are discarded.
Current limitations:
- disconnected from extraction outputs in smoke tests
- no temporal reasoning layer
- no forecasting integration
Planned but not implemented.
Intended responsibilities:
- probabilistic forecasting
- calibration
- confidence estimation
- forecast evaluation
bioscancast/
├── bioscancast/
│ ├── datasets/ Source registries and tier definitions
│ ├── extraction/ Fetching, parsing, and chunking
│ ├── filtering/ Relevance filtering and reranking
│ ├── insight/ Retrieval and insight extraction
│ ├── llm/ LLM client abstractions
│ ├── schemas/ Shared data models
│ ├── stages/
│ │ ├── search_stage/
│ │ └── eval_stage/
│ └── tests/ Unit and integration tests
├── data/
│ ├── raw/
│ ├── processed/
│ └── docling_eval/
├── evaluation/
├── scripts/
├── pyproject.toml
├── requirements.txt
└── README.md
| Module | Purpose |
|---|---|
datasets/ |
curated source registries and source tiers |
extraction/ |
fetching, parsing, chunking |
filtering/ |
source filtering and ranking |
insight/ |
retrieval and fact extraction |
llm/ |
model abstractions |
schemas/ |
shared structured contracts |
stages/search_stage/ |
retrieval stage |
stages/eval_stage/ |
evaluation tooling |
bioscancast/stages/search_stage/
Implemented modules:
| File | Purpose |
|---|---|
pipeline.py |
orchestration |
query_decomposition.py |
LLM sub-query generation |
tier_resolution.py |
source credibility scoring |
dashboard_lookup.py |
dashboard injection |
url_normalization.py |
canonicalization + dedup |
backends/tavily_backend.py |
Tavily backend |
Current features:
- 5–8 LLM-generated subqueries
- backend abstraction via
SearchBackend - source tier + freshness scoring
- aggregator-domain flagging
- non-content URL filtering
Known limitations:
- English-only retrieval
- hardcoded dashboard mappings
- no multilingual retrieval
bioscancast/filtering/
Implemented modules:
| File | Purpose |
|---|---|
pipeline.py |
orchestration |
heuristics.py |
heuristic scoring |
llm_filter.py |
LLM adjudication |
reranker.py |
borderline reranking |
deduplication.py |
duplicate handling |
postprocess.py |
extraction-priority assignment |
Current features:
- heuristic relevance scoring
- source credibility scoring
- optional LLM review
- domain caps
- extraction-mode assignment
bioscancast/extraction/
Implemented modules:
| File | Purpose |
|---|---|
pipeline.py |
orchestration |
fetcher.py |
network retrieval |
chunking.py |
chunk normalization |
parsers/html_parser.py |
HTML extraction |
parsers/pdf_parser.py |
PDF extraction |
docling_refiner.py |
optional table refinement |
Current features:
- browser-fingerprinted fetching via
curl_cffi - BeautifulSoup + trafilatura HTML parsing
- PyMuPDF PDF parsing
- pdfplumber table fallback
- chunk normalization
- metadata extraction
- document-level provenance tracking
The default PDF pipeline uses PyMuPDF + pdfplumber with an optional Docling TableFormer refinement pass.
The first refinement run downloads Docling models (~40 MB) into:
~/.cache/huggingface/
Models remain resident in memory (~1.5 GB) for the process lifetime.
Controlled via:
ExtractionConfig.enable_docling_refinerWhen disabled, no Docling imports occur.
Current limitations:
- OCR not implemented
- scanned PDFs return
requires_ocr - no persistent document store
- extraction is currently in-memory only
bioscancast/insight/
Implemented modules:
| File | Purpose |
|---|---|
pipeline.py |
orchestration |
retrieval/bm25.py |
lexical retrieval |
retrieval/embeddings.py |
embedding retrieval |
retrieval/hybrid.py |
hybrid reranking |
extraction/chunk_extractor.py |
fact extraction |
Current features:
- BM25 retrieval
- embedding similarity retrieval
- hybrid scoring
- keyword reranking
- chunk-level extraction
- quote-based hallucination guards
- provenance linking
- cross-document deduplication
bioscancast/stages/eval_stage/
Implemented modules:
| File | Purpose |
|---|---|
evaluator.py |
orchestration |
scoring.py |
forecast scoring |
calibration.py |
calibration metrics |
compare.py |
model vs human comparison |
visualisation.py |
plots and reporting |
Repository datasets:
bioscancast_forecasts.csv
bioscancast_questions.csv
bioscancast/schemas/
Shared stage contracts.
Key schemas:
| File | Purpose |
|---|---|
document.py |
extracted documents + chunks |
insight_record.py |
extracted facts |
Additional filtering models live in:
bioscancast/filtering/models.py
including:
ForecastQuestionSearchResultFilteredDocument
Stages should communicate through schemas rather than raw dictionaries.
bioscancast/llm/
Current files:
| File | Purpose |
|---|---|
base.py |
shared protocol + token accounting |
client.py |
legacy/simple OpenAI wrapper |
openai_client.py |
structured extraction client |
fake_client.py |
testing client |
The repository currently contains two partially overlapping interfaces:
bioscancast/llm/base.py
bioscancast/llm/client.py
These should eventually be unified.
When benchmarking the pipeline against human forecasters on past questions,
the model must not be allowed to see sources that didn't exist (or contained
different content) at the time the human forecasted. Historical-replay mode
enforces this by reading a single per-question field, ForecastQuestion.as_of_date:
- When
as_of_dateisNone(default), the pipeline behaves exactly as in live mode. No code paths change. - When
as_of_dateis set, the search backend receivesend_date=as_of_date, the cache key incorporates the cutoff, post-retrieval filtering drops any result dated after the cutoff (and any undated result whose date cannot be cheaply recovered), dashboard URLs are rewritten to the closest Wayback snapshot at or before the cutoff (or suppressed if none exists), and the extraction stage fetches from Wayback. Wayback fallback to live is logged at INFO and recorded inDocument.fetch_strategy, never silent.
The LLM "historical roleplay" prompt is not automatically enabled by
as_of_date; it lives behind a separate historical_roleplay=True flag on
SearchStagePipeline because its effect on query quality is harder to
predict. Turn it on for the benchmark and off for production.
What this mode does NOT fix: the LLMs themselves were trained on data that
postdates many of our benchmark questions. Retrieval fairness ≠ model
fairness. The retrieval_free_baseline_forecast metric in
bioscancast/stages/eval_stage/contamination.py reports how well the LLM
forecasts with no evidence at all; a small gap between that and the full
pipeline is itself evidence of training-data leakage and must be reported
alongside the headline Brier/log scores.
filter_caught_contamination_rate is also exposed by the same module. It
is a lower bound on contamination — it only counts post-cutoff results
whose published_date is known. Undated results and results whose content
changed post-cutoff are invisible to it. Reports MUST surface this caveat;
the metric's docstring repeats it for the same reason.
bioscancast/datasets/
Curated source definitions and credibility tiers.
| File | Purpose |
|---|---|
biosecurity_sources.py |
curated source registry |
source_tiers.py |
source credibility tiers |
scripts/
Operational and smoke-test utilities.
| Script | Purpose |
|---|---|
run_search_stage.py |
run search stage |
run_filtering.py |
run filtering stage |
run_extraction.py |
run extraction stage |
run_insight.py |
run insight smoke test |
eval_docling.py |
Docling evaluation |
eval_hybrid_pdf.py |
PDF extraction benchmarking |
Scripts are intended for operational workflows rather than reusable library APIs.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtAdditional packages:
pip install openai tavily-python python-dotenvCreate .env:
OPENAI_API_KEY=sk-...
TAVILY_API_KEY=tvly-...python scripts/run_search_stage.py \
"Will H5N1 cause more than 100 human cases in the US by December 2026?" \
--pathogen h5n1 \
--region "United States"Optional JSON output:
python scripts/run_search_stage.py \
"How many mpox cases will be reported globally by June 2026?" \
--pathogen mpox \
--output data/search_results.jsonpython scripts/run_filtering.pyCurrent limitation:
- uses hardcoded sample inputs rather than automatic search-stage ingestion
Output:
data/filtered_results.json
Smoke-test mode:
python scripts/run_extraction.pyUsing filtered-document JSON:
python scripts/run_extraction.py \
--input data/filtered_results.jsonOutput:
data/extraction_results.json
python scripts/run_insight.pyCurrent limitation:
- uses synthetic documents rather than extraction outputs
python scripts/run_search_stage.py \
"Will mpox cases increase in Uganda in 2026?" \
--pathogen mpox \
--region Uganda \
--output data/search_results.json && \
python scripts/run_filtering.py && \
python scripts/run_extraction.py \
--input data/filtered_results.jsonbioscancast/main.py currently contains pseudocode only and is not yet a runnable orchestrator.
bioscancast/tests/
Includes:
- extraction tests
- retrieval tests
- pipeline tests
- schema validation
- search-stage integration tests
Run all tests:
pytestRun selected tests:
pytest bioscancast/tests/test_extraction_pipeline.py
pytest bioscancast/tests/test_insight_pipeline.pyLive fetch tests are marked:
@pytest.mark.liveand skipped by default.
Run with:
pytest --liveImportant dependencies:
| Dependency | Usage |
|---|---|
curl_cffi |
browser-fingerprinted HTTP fetching |
rank_bm25 |
lexical retrieval |
PyMuPDF |
primary PDF parsing |
pdfplumber |
fallback PDF table extraction |
curl_cffi is used in:
bioscancast/extraction/fetcher.py
The impersonation profile is configurable via:
ExtractionConfig.impersonate- Keep pipeline stages modular.
- Use schemas between stages.
- Prefer structured interfaces over raw dictionaries.
- Keep experimental workflows in scripts or notebooks.
- Prioritize reproducibility.
- Treat provenance and auditability as first-class concerns.
Major missing components:
- unified end-to-end orchestrator
- extraction → insight integration
- OCR support
- forecasting stage implementation
- persistent storage/vector DB layer
- unified LLM abstraction
- multilingual retrieval