Know before you build.
PERT estimates for AI-agent tasks — how long, which model's reliable enough, and the human-equivalent cost. In one command.
AI agents can write the code — but how long will the task actually take? Manual estimation is slow and biased toward optimism; no estimate means scope creep and missed deadlines. The gap between "agents can do it" and "we know when it'll be done" is where projects break down.
agent-estimate closes that gap in one command: a three-point PERT timeline calibrated on real agent runs, plus a human-speed comparison so you see the compression before you spend the compute. It sizes the task, picks a tier, routes it to a model, and flags when the work runs past that model's reliability horizon — calibrated forecasts in seconds, not meetings.
Multi-model matters because the models aren't interchangeable. Opus 4.7, GPT-5.5, and Gemini 3.1 have different reliability horizons (METR p80) and different costs per turn. A safe 40-minute job for one model is a coin flip for another. agent-estimate models the whole fleet, not a single agent — so the number reflects who actually runs the work.
First estimate: 30 seconds to install. Every one after: instant.
Paste this into your Claude Code or Codex session:
Install the agent-estimate plugin (https://github.com/kiloloop/agent-estimate) and
estimate this task for me: "Implement OAuth 2.0 flow (Google + GitHub)". Tell me the
expected time, the human-speed equivalent, and the compression ratio.
Your agent installs the tool, runs the estimate, and reads back the numbers. Nothing to memorize — describe the task in plain English and let the agent translate to flags.
For a whole backlog:
Estimate every open issue in this repo with agent-estimate, group them into parallel
waves, and tell me the total wall-clock time for a 3-agent fleet versus doing them
sequentially myself.
pip install agent-estimate
agent-estimate estimate "your task description here"No config required — sensible defaults for a 3-agent fleet (Claude, Codex, Gemini). Point it at a file or GitHub issues when you're ready:
agent-estimate estimate --file tasks.txt
agent-estimate estimate --repo myorg/myrepo --issues 11,12,14agent-estimate produces three-point PERT estimates calibrated for agents, not humans:
- Tier classification — auto-sizes tasks XS→XL from complexity signals
- PERT math — optimistic / most-likely / pessimistic, weighted to an expected value
- Human comparison — a per-task-type multiplier, so you see the compression
- METR thresholds — warns when an estimate exceeds a model's p80 reliability horizon
- Wave planning — schedules independent tasks in parallel across the fleet
- Review overhead — models review cycles as additive cost (
standard,complex,3-round) - Modifiers —
--spec-clarity,--warm-context,--agent-fittune the estimate
| Type | Flag | Models |
|---|---|---|
| Coding | (default) | Feature work, fixes, refactors |
| Research | --type research |
Audits, investigations, analysis |
| Documentation | --type documentation |
API docs, guides, changelogs |
| Brainstorm | --type brainstorm |
Ideation, spikes, design exploration |
| Config/SRE | --type config |
Deploys, infra, CI/CD |
| Frontend/UI | --type frontend |
Content patches vs. component builds |
| App dev | --type app_dev |
App shells, desktop/mobile builds |
| Model | p80 threshold |
|---|---|
| Opus 4.7 | 90 min |
| GPT-5.5 | 90 min |
| GPT-5.4 | 60 min |
| Gemini 3.1 Pro | 45 min |
| Sonnet 4.6 | 30 min |
| Haiku 4.5 | 15 min |
opus_4_x is a forward-compatible alias that resolves to the current Opus threshold. Legacy keys (opus_4_6, GPT-5/5.2/5.3, Gemini 3 Pro, Sonnet) stay supported. Estimates are calibrated against Claude Code (Opus 4.7, high thinking) and Codex (GPT-5.4/5.5, extra-high) — shift with --spec-clarity and --warm-context for other setups.
Real estimates from production use — including the misses.
The tool, estimating its own docs. We sized this v0.7.0 skill-and-README refresh at ~30 minutes. It took 28.
An honest over-estimate. We pre-registered a UI mockup build at ~95 minutes with no prior app-dev data. Two agents did it in parallel in 12 and 25 minutes — a 4–8x over-estimate. agent-estimate now ships an app_dev prior shaped by that result. The miss stays in the README because calibration means showing where you were wrong.
Two tasks, one model — what the tool prints, including the METR reliability flag:
$ agent-estimate estimate "Implement auth" "Add tests" --model opus
Task Tier PERT (O/M/P) Expected Human-eq
───────────────────────────────────────────────────────────
Implement auth M 25/50/90m 57.8m 160m
Add tests S 12/23/40m 24.0m 75m
Timeline ──────────────────────────────
best 37m · expected 81.8m · worst 130m
human-equivalent: 235m → 2.87× compression
⚠ METR warning: "Implement auth" exceeds Opus p80
~82 minutes expected versus ~4 hours by hand — plus a flag that the auth task runs past Opus's p80 reliability horizon, so you split it or add a checkpoint before dispatching.
Three tasks, three agents, in parallel:
$ agent-estimate estimate --file tasks.txt| Metric | Value |
|---|---|
| Wave 0 | All 3 tasks in parallel (Claude + Codex + Gemini) |
| Expected case | 131m |
| Human-speed equivalent | 709.5m |
| Compression ratio | 5.42x |
| Estimated cost | $4.84 |
~2 hours wall-clock versus ~12 hours sequential. You see the compression before you commit the compute. More in examples/ — coding S/M, research, documentation, multi-agent.
/plugin marketplace add kiloloop/agent-estimate
/plugin install agent-estimate@agent-estimate-marketplace
/estimate Add a login page with OAuth
/estimate --file spec.md
/estimate --issues 1,2,3 --repo myorg/myrepo
/validate-estimate observation.yaml
/calibrate
- uses: kiloloop/agent-estimate@v0
with:
issues: '11,12,14'Full workflow example
name: Estimate
on:
pull_request:
types: [opened, synchronize]
permissions:
contents: read
pull-requests: write
jobs:
estimate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: kiloloop/agent-estimate@v0
with:
issues: '11,12,14'
output-mode: summary+pr-commentAction inputs and outputs
| Input | Required | Default | Description |
|---|---|---|---|
issues |
yes | — | GitHub issue numbers (comma-separated) |
repo |
no | current repo | GitHub repo (owner/name) |
format |
no | markdown |
Output format: markdown or json |
output-mode |
no | summary |
summary, pr-comment, step-output, summary+pr-comment |
config |
no | — | Path to agent config YAML |
title |
no | Agent Estimate Report |
Report title |
review-mode |
no | standard |
Review tier: none, standard, complex, 3-round |
spec-clarity |
no | 1.0 |
Spec clarity modifier (0.3–1.3) |
warm-context |
no | 1.0 |
Warm context modifier (0.3–1.15) |
agent-fit |
no | 1.0 |
Agent fit modifier (0.9–1.2) |
task-type |
no | — | Category: coding, brainstorm, research, config, documentation, frontend, app_dev |
python-version |
no | 3.12 |
Python version to use |
version |
no | latest | agent-estimate version to install |
token |
no | ${{ github.token }} |
GitHub token |
| Output | Description |
|---|---|
report |
Full estimation report content |
expected-minutes |
Expected minutes (when format: json) |
Skills follow the oacp-skills convention:
skills/estimate/
skill.yaml # machine-readable metadata
README.md # human-readable docs
shared/INTENT.md # shared intent across runtimes
claude/SKILL.md # Claude Code skill definition
codex/SKILL.md # Codex skill definition
Both runtime slices cover the same CLI (estimate, validate, calibrate), phrased for their respective ecosystems.
Pass a config to model your own fleet:
agents:
- name: Claude
capabilities: [planning, implementation, review]
parallelism: 2
cost_per_turn: 0.12
model_tier: frontier
- name: Codex
capabilities: [implementation, debugging, testing]
parallelism: 3
cost_per_turn: 0.08
model_tier: production
settings:
friction_multiplier: 1.15
inter_wave_overhead: 0.25
review_overhead: 0.2
metr_fallback_threshold: 45.0agent-estimate estimate "Ship packaging flow" --config ./my_agents.yamlagent-estimate estimate "Refactor auth pipeline" --format json # machine-readable
agent-estimate estimate --repo myorg/myrepo --issues 11,12,14 # from GitHub issues
agent-estimate estimate --file tasks.txt # from fileValidate estimates against observed outcomes and build a calibration database:
agent-estimate validate observation.yaml --db ~/.agent-estimate/calibration.db- Website — landing page, live demo, and the estimate comparison view.
- OACP — coordinate the agents you just estimated. Open Agent Coordination Protocol for multi-agent async workflows.
- oacp-skills — the skill bundle agent-estimate's
/estimateships in. - kiloloop — the rest of the ecosystem.
See CONTRIBUTING.md for the full workflow.
pip install -e '.[dev]'
ruff check .
pytest -qApache License 2.0