A small benchmark for agent skills, verification artifacts, and fresh-session resumability.
A benchmark for the awkward middle of agentic work: not whether an agent can produce code once, but whether it leaves enough verified context for the next agent to trust and continue it.
This is not a universal coding-agent leaderboard.
It is a narrower benchmark for whether workflow skills improve audit trails, verification evidence, and fresh-session resumability on tasks where generic agents may pass public checks while missing hidden contracts.
For a compact summary, see docs/overview.md.
- A pilot harness for workflow-sensitive agent work.
- A place to test whether an agent can leave durable verification evidence, not just a passing patch.
- A benchmark with multiple current tasks: a bugfix/resume task, a data-trust task, an activation-migration task, and a scope-pressure export task.
- Not a universal coding-agent leaderboard.
- Not proof that skill packs broadly outperform baseline.
- Not proof that workflow skills guarantee functional correctness.
- Not proof that these tasks generalize to all coding work.
- Not a proof of broad agent superiority.
- The visible task fixes a duplicated-join churn bug.
- The benchmark checks whether the fix is durable across resume contexts and whether the agent leaves useful verification evidence.
- The visible task audits suspicious campaign-lift data without turning it into an overconfident causal story.
- The benchmark checks whether the agent preserves blockers, refuses unsupported claims, and leaves interpretable evidence.
- Planned / under construction. See docs/task6.md.
- Planned / under construction. See docs/task7.md.
- Control arm.
- No workflow skill pack.
- Workflow-skills arm.
- Intended to produce stronger audit trails, verification context, and resumability artifacts.
- E-arm setup and validation: docs/skill-arm-setup.md
Proven by the current pilot:
- Task 4: the artifact/resume mechanism works.
- Task 5: the public-pass/hidden-fail data-trust trap works.
- The E arm can be runtime-proven and artifact-producing.
Not proven:
- skill packs broadly outperform baseline.
- skills guarantee functional correctness.
- these tasks generalize to all coding work.
This table reflects the current pilot, not a universal result set.
| Task | Arm | Scorecard shape | Artifact mechanism | Reading |
|---|---|---|---|---|
| Task 4 | A baseline | green | inactive | Functional pass without workflow-skill artifacts. |
| Task 4 | E ai-engineering-skills | green | active | Functional pass with workflow artifacts and resume support. |
| Task 5 | A baseline | initial_fail / hidden fail | inactive | Expected negative result from the public-pass / hidden-fail trap. |
| Task 5 | E ai-engineering-skills | initial_fail / skill proof + artifacts / hidden fail | inactive | Skill proof and workflow artifacts are present, but the hidden trust gate still fails. |
Run the benchmark tests:
python -m venv .venv
source .venv/bin/activate
python -m pip install -e ".[dev]"
python -m pytest benchmark_harness/tests -qRun one task through the smoke harness:
TASK_SLUG=04-impossible-churn \
ARM_SLUG=A-baseline \
RUN_ID=v04pilot_04-bugfix_A_r1 \
./tools/pilot_smoke.sh auto-a-r1For Task 5 details and run examples, see docs/task5.md. For Task 7 details and run examples, see docs/task7.md.
Summarize bundles:
python -m benchmark_harness.scorecard <bundles...>The scorecard accepts both *-eval-bundle.tar.gz and *-initial-fail-bundle.tar.gz.
- The pilot is intentionally narrow.
- The current task set is designed to surface workflow and verification behavior, not to rank all agents on all coding work.
- Task 5 yellow rows are useful negative results, not broken scorecard rows.
- Generated artifacts, bundles, and local caches should stay out of source control.
- Add more tasks that stress different workflow skills.
- Keep public verification and assessment checks separate.
- Continue publishing scorecards and bundles as generated artifacts, not source.
- Improve launch hygiene so the source tree stays easy to inspect and reuse.
- PyPI publishing is deferred until the repo exposes a stable CLI and package-data story.