Skip to content

kiloloop/agent-estimate

agent-estimate

PyPI Version Python Versions License CI

Know before you build.

PERT estimates for AI-agent tasks — how long, which model's reliable enough, and the human-equivalent cost. In one command.

Website · Compare · PyPI

Why

AI agents can write the code — but how long will the task actually take? Manual estimation is slow and biased toward optimism; no estimate means scope creep and missed deadlines. The gap between "agents can do it" and "we know when it'll be done" is where projects break down.

agent-estimate closes that gap in one command: a three-point PERT timeline calibrated on real agent runs, plus a human-speed comparison so you see the compression before you spend the compute. It sizes the task, picks a tier, routes it to a model, and flags when the work runs past that model's reliability horizon — calibrated forecasts in seconds, not meetings.

Multi-model matters because the models aren't interchangeable. Opus 4.7, GPT-5.5, and Gemini 3.1 have different reliability horizons (METR p80) and different costs per turn. A safe 40-minute job for one model is a coin flip for another. agent-estimate models the whole fleet, not a single agent — so the number reflects who actually runs the work.

Quick Start

First estimate: 30 seconds to install. Every one after: instant.

With your agent (recommended)

Paste this into your Claude Code or Codex session:

Install the agent-estimate plugin (https://github.com/kiloloop/agent-estimate) and
estimate this task for me: "Implement OAuth 2.0 flow (Google + GitHub)". Tell me the
expected time, the human-speed equivalent, and the compression ratio.

Your agent installs the tool, runs the estimate, and reads back the numbers. Nothing to memorize — describe the task in plain English and let the agent translate to flags.

For a whole backlog:

Estimate every open issue in this repo with agent-estimate, group them into parallel
waves, and tell me the total wall-clock time for a 3-agent fleet versus doing them
sequentially myself.

Manual

pip install agent-estimate
agent-estimate estimate "your task description here"

No config required — sensible defaults for a 3-agent fleet (Claude, Codex, Gemini). Point it at a file or GitHub issues when you're ready:

agent-estimate estimate --file tasks.txt
agent-estimate estimate --repo myorg/myrepo --issues 11,12,14

How It Works

agent-estimate produces three-point PERT estimates calibrated for agents, not humans:

  • Tier classification — auto-sizes tasks XS→XL from complexity signals
  • PERT math — optimistic / most-likely / pessimistic, weighted to an expected value
  • Human comparison — a per-task-type multiplier, so you see the compression
  • METR thresholds — warns when an estimate exceeds a model's p80 reliability horizon
  • Wave planning — schedules independent tasks in parallel across the fleet
  • Review overhead — models review cycles as additive cost (standard, complex, 3-round)
  • Modifiers--spec-clarity, --warm-context, --agent-fit tune the estimate

Task types

Type Flag Models
Coding (default) Feature work, fixes, refactors
Research --type research Audits, investigations, analysis
Documentation --type documentation API docs, guides, changelogs
Brainstorm --type brainstorm Ideation, spikes, design exploration
Config/SRE --type config Deploys, infra, CI/CD
Frontend/UI --type frontend Content patches vs. component builds
App dev --type app_dev App shells, desktop/mobile builds

METR thresholds (defaults)

Model p80 threshold
Opus 4.7 90 min
GPT-5.5 90 min
GPT-5.4 60 min
Gemini 3.1 Pro 45 min
Sonnet 4.6 30 min
Haiku 4.5 15 min

opus_4_x is a forward-compatible alias that resolves to the current Opus threshold. Legacy keys (opus_4_6, GPT-5/5.2/5.3, Gemini 3 Pro, Sonnet) stay supported. Estimates are calibrated against Claude Code (Opus 4.7, high thinking) and Codex (GPT-5.4/5.5, extra-high) — shift with --spec-clarity and --warm-context for other setups.

Examples

Real estimates from production use — including the misses.

The tool, estimating its own docs. We sized this v0.7.0 skill-and-README refresh at ~30 minutes. It took 28.

An honest over-estimate. We pre-registered a UI mockup build at ~95 minutes with no prior app-dev data. Two agents did it in parallel in 12 and 25 minutes — a 4–8x over-estimate. agent-estimate now ships an app_dev prior shaped by that result. The miss stays in the README because calibration means showing where you were wrong.

Two tasks, one model — what the tool prints, including the METR reliability flag:

$ agent-estimate estimate "Implement auth" "Add tests" --model opus

Task             Tier   PERT (O/M/P)    Expected   Human-eq
───────────────────────────────────────────────────────────
Implement auth   M      25/50/90m       57.8m      160m
Add tests        S      12/23/40m       24.0m       75m

Timeline ──────────────────────────────
  best 37m   ·   expected 81.8m   ·   worst 130m
  human-equivalent: 235m  →  2.87× compression

  ⚠ METR warning: "Implement auth" exceeds Opus p80

~82 minutes expected versus ~4 hours by hand — plus a flag that the auth task runs past Opus's p80 reliability horizon, so you split it or add a checkpoint before dispatching.

Three tasks, three agents, in parallel:

$ agent-estimate estimate --file tasks.txt
Metric Value
Wave 0 All 3 tasks in parallel (Claude + Codex + Gemini)
Expected case 131m
Human-speed equivalent 709.5m
Compression ratio 5.42x
Estimated cost $4.84

~2 hours wall-clock versus ~12 hours sequential. You see the compression before you commit the compute. More in examples/ — coding S/M, research, documentation, multi-agent.

Integrations

Claude Code plugin

/plugin marketplace add kiloloop/agent-estimate
/plugin install agent-estimate@agent-estimate-marketplace
/estimate Add a login page with OAuth
/estimate --file spec.md
/estimate --issues 1,2,3 --repo myorg/myrepo
/validate-estimate observation.yaml
/calibrate

GitHub Action

- uses: kiloloop/agent-estimate@v0
  with:
    issues: '11,12,14'
Full workflow example
name: Estimate
on:
  pull_request:
    types: [opened, synchronize]

permissions:
  contents: read
  pull-requests: write

jobs:
  estimate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: kiloloop/agent-estimate@v0
        with:
          issues: '11,12,14'
          output-mode: summary+pr-comment
Action inputs and outputs
Input Required Default Description
issues yes GitHub issue numbers (comma-separated)
repo no current repo GitHub repo (owner/name)
format no markdown Output format: markdown or json
output-mode no summary summary, pr-comment, step-output, summary+pr-comment
config no Path to agent config YAML
title no Agent Estimate Report Report title
review-mode no standard Review tier: none, standard, complex, 3-round
spec-clarity no 1.0 Spec clarity modifier (0.3–1.3)
warm-context no 1.0 Warm context modifier (0.3–1.15)
agent-fit no 1.0 Agent fit modifier (0.9–1.2)
task-type no Category: coding, brainstorm, research, config, documentation, frontend, app_dev
python-version no 3.12 Python version to use
version no latest agent-estimate version to install
token no ${{ github.token }} GitHub token
Output Description
report Full estimation report content
expected-minutes Expected minutes (when format: json)

Skill layout

Skills follow the oacp-skills convention:

skills/estimate/
  skill.yaml            # machine-readable metadata
  README.md             # human-readable docs
  shared/INTENT.md      # shared intent across runtimes
  claude/SKILL.md       # Claude Code skill definition
  codex/SKILL.md        # Codex skill definition

Both runtime slices cover the same CLI (estimate, validate, calibrate), phrased for their respective ecosystems.

Configuration

Agent fleet

Pass a config to model your own fleet:

agents:
  - name: Claude
    capabilities: [planning, implementation, review]
    parallelism: 2
    cost_per_turn: 0.12
    model_tier: frontier
  - name: Codex
    capabilities: [implementation, debugging, testing]
    parallelism: 3
    cost_per_turn: 0.08
    model_tier: production
settings:
  friction_multiplier: 1.15
  inter_wave_overhead: 0.25
  review_overhead: 0.2
  metr_fallback_threshold: 45.0
agent-estimate estimate "Ship packaging flow" --config ./my_agents.yaml

Output formats

agent-estimate estimate "Refactor auth pipeline" --format json   # machine-readable
agent-estimate estimate --repo myorg/myrepo --issues 11,12,14    # from GitHub issues
agent-estimate estimate --file tasks.txt                          # from file

Calibration

Validate estimates against observed outcomes and build a calibration database:

agent-estimate validate observation.yaml --db ~/.agent-estimate/calibration.db

Project

  • Website — landing page, live demo, and the estimate comparison view.
  • OACP — coordinate the agents you just estimated. Open Agent Coordination Protocol for multi-agent async workflows.
  • oacp-skills — the skill bundle agent-estimate's /estimate ships in.
  • kiloloop — the rest of the ecosystem.

Contributing

See CONTRIBUTING.md for the full workflow.

pip install -e '.[dev]'
ruff check .
pytest -q

Community

License

Apache License 2.0

About

The first open-source effort estimation tool built for AI coding agents. PERT + METR + wave planning.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors