A pure-Python pipeline of tool-using agents that takes a scientific paper PDF and returns a credibility-graded list of the paper's staged conclusions, attempts to reproduce each one, and writes a whole-paper summary plus follow-up research suggestions.
Given a paper PDF, the pipeline (a) extracts per-section staged conclusions and the figures / equations / tables that support each one, (b) verifies each conclusion against its supporting evidence — reproducing figures by codegen and walking equation derivations step-by-step — and (c) aggregates the per-conclusion verdicts into a final report with a credibility level + score per conclusion, a list of further-research points motivated by the gaps, and a whole-paper summary.
Each agent's responsibilities are narrow and the hand-off contracts (pydantic models) are explicit, so every verdict traces back to the section, evidence, and reasoning that produced it.
┌────────────────────────────┐
paper.pdf ──────────► │ Phase 0: Extraction (CLI) │ ─────► <stem>_paper/
│ verifier.extraction.* │ manifest.json
│ │ sections/*.md
│ - paper.py orchestrator │ figures/*
│ - placeholders.py schema │ equations/*
│ - equations.py VLM OCR │ tables/*
│ - figures.py render │
│ - tables.py VLM md │
│ - layout.py 2-col │
└────────────────────────────┘
│
▼
┌────────────────────────────┐
│ Phase 1: Extractor agent │ ─────► <stem>_paper/
│ verifier.agents.extractor │ conclusions.json
└────────────────────────────┘
│
▼
┌────────────────────────────┐
│ Phase 1.5: Dispatcher │ ─────► <stem>_paper/
│ verifier.agents.dispatcher│ verification_plan.json
│ (pure-Python, no LLM) │
└────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ Phase 2 runner — verifier/pipeline.py │
│ (pure Python loop, file-based checkpoint) │
│ │
│ for task in pending_tasks: │
│ if workdir/verdict.json exists: │
│ load + skip ◄─── resume after crash │
│ elif task.method == "figure_reproduction": │
│ figure_verifier_node(state) │
│ elif task.method == "formula_derivation": │
│ formula_verifier_node(state) │
│ rewrite verifications.json after each task │
└──────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ Phase 3 summarizer — verifier/agents/summarizer.py │
│ │
│ - Python: weighted overall_credibility │
│ - 1 LLM call (structured): paper_summary │
│ + further_research │
│ - merges Phase-2 verdicts + dispatcher skips │
└──────────────────────────────────────────────────┘
│
▼
FinalReport ──► report.json
uv sync
cp .env.example .env # set OPENAI_API_KEY
# End-to-end pipeline in one command. Cached extraction + cached
# conclusions are reused on re-runs; --force-extract redoes Phase 0+1.
PYTHONPATH=. uv run python -m verifier path/to/paper.pdf
PYTHONPATH=. uv run python -m verifier path/to/paper.pdf --force-extract
PYTHONPATH=. uv run python -m verifier path/to/paper.pdf --cache-dir /tmp/extracted
# Test
uv run pytest tests/The final report lands at <cache-dir>/<stem>_paper/report.json, alongside the intermediate per-phase artifacts (manifest.json, conclusions.json, verification_plan.json, verifications.json).
hackathon20260511/
├── verifier/
│ ├── __main__.py # `python -m verifier paper.pdf` — end-to-end CLI
│ ├── config.py # per-role model config
│ ├── state.py # shared TypedDict state passed between phases
│ ├── extraction/ # Phase 0
│ │ ├── paper.py # orchestrator
│ │ ├── placeholders.py # XML asset syntax
│ │ ├── equations.py # detect + VLM LaTeX
│ │ ├── figures.py # caption-anchor + render
│ │ ├── tables.py # caption-anchor + VLM markdown
│ │ └── layout.py # column-aware reading order
│ ├── agents/
│ │ ├── extractor.py # Phase 1: per-section conclusion extraction
│ │ ├── dispatcher.py # Phase 1.5: router / skipper (no LLM)
│ │ ├── figure_verifier.py # Phase 2: figure reproduction loop
│ │ ├── formula_verifier.py # Phase 2: derivation walker
│ │ └── summarizer.py # Phase 3: single structured LLM call, no tools
│ ├── pipeline.py # run_all (1→1.5→2→3) + Phase-2 file checkpoint
│ ├── tools/
│ │ ├── figure_tools.py # 8-tool toolkit for the figure verifier
│ │ └── formula_tools.py # 4-tool toolkit for the formula verifier
│ ├── schemas/ # pydantic models for hand-off contracts
│ └── prompts/ # one prompt module per agent
├── scripts/ # per-phase manual harnesses (for partial reruns)
├── docs/
│ ├── architecture.md # per-phase deep dive
│ └── models.md # per-agent model configuration
├── tests/
├── data/
│ └── extracted/ # cached Phase 0 + Phase 1 outputs (gitignored)
└── skills/
- docs/architecture.md — per-phase design: extraction internals, agent tool toolkits, verdict schema, state machine.
- docs/models.md — per-agent model assignment and the reasoning behind each choice; env-var overrides.
For debugging one phase, swapping prompts, or re-running a single task, the harnesses in scripts/ work against an already-extracted directory without re-doing earlier phases:
# Phase 0 only (extraction)
PYTHONPATH=. uv run python scripts/try_extract.py path/to/paper.pdf data/extracted/<stem>_paper
# Phase 1 only (writes conclusions.json)
PYTHONPATH=. uv run python scripts/try_extractor.py data/extracted/<stem>_paper
# Phase 1.5 only (no LLM; writes verification_plan.json)
PYTHONPATH=. uv run python scripts/try_dispatcher.py data/extracted/<stem>_paper
# Phase 2 single task (figure or formula)
PYTHONPATH=. uv run python scripts/try_figure_verifier.py data/extracted/<stem>_paper [task_idx]
PYTHONPATH=. uv run python scripts/try_formula_verifier.py data/extracted/<stem>_paper [task_idx]
# Phase 2 full run with per-task checkpoint resume
PYTHONPATH=. uv run python scripts/run_phase2.py data/extracted/<stem>_paper
# Phase 3 only (one LLM call; writes report.json)
PYTHONPATH=. uv run python scripts/run_phase3.py data/extracted/<stem>_paperTo force a re-run of one Phase-2 task, delete that task's workdir under data/extracted/<stem>_paper/figures/verifier_runs/ or equations/verifier_runs/.