Display Name
AI Workflow Benchmark (AWB)
Category
Tooling
Sub-Category
General
Primary Link
https://github.com/xmpuspus/ai-workflow-benchmark
Author Name
xmpuspus
Author Link
https://github.com/xmpuspus
License
MIT
Other License
No response
Description
AWB evaluates whether an AI coding workflow can safely ship real software, not whether a model passes a static issue benchmark. It runs Claude Code (vanilla and custom-config adapters) and eight other AI coding tools against 100 tasks across 8 categories on real OSS repositories pinned at commit SHAs, scoring across seven capability dimensions plus derived cost discipline.
Validate Claims
Install AWB, then run a single task against your Claude Code setup:
pip install awb
awb run --tasks BUG-001 --adapter claude-code-vanilla
awb trace grade results/runs/<run_dir>/
The trace grade scores four shipping disciplines from the OpenTelemetry-aligned trace JSONL: read-tests-before-edit, ran-verification-after-change, no-out-of-scope-edits, no-repeated-failing-command-loop. v1.2 also emits a task-set hash on every result so the exact scored task set is verifiable.
Specific Task(s)
Benchmark a custom Claude Code workflow (with a tuned CLAUDE.md, hooks, and structured subagents) against vanilla Claude Code, on the AWB fast-check task set (8 tasks, ~15 minutes, ~$4).
Specific Prompt(s)
awb warmup
awb run --fast-check claude-code-vanilla
awb run --fast-check claude-code-custom
awb leaderboard --readiness
Compare the Production Readiness Score and the per-capability profile between the two configurations.
Additional Comments
Built specifically because the 2025 Stack Overflow Developer Survey shows 84% of developers using AI in their workflow but only 33% trusting it, and METR's RCT of 16 OSS maintainers found AI tooling increased task completion time by 19% while developers self-reported a 20% speedup. AWB benchmarks the full stack — tool plus configuration plus workflow plus model — not just model capability. Claude Code is a first-class tier with both vanilla and custom-config adapters shipped.
Recommendation Checklist
Display Name
AI Workflow Benchmark (AWB)
Category
Tooling
Sub-Category
General
Primary Link
https://github.com/xmpuspus/ai-workflow-benchmark
Author Name
xmpuspus
Author Link
https://github.com/xmpuspus
License
MIT
Other License
No response
Description
AWB evaluates whether an AI coding workflow can safely ship real software, not whether a model passes a static issue benchmark. It runs Claude Code (vanilla and custom-config adapters) and eight other AI coding tools against 100 tasks across 8 categories on real OSS repositories pinned at commit SHAs, scoring across seven capability dimensions plus derived cost discipline.
Validate Claims
Install AWB, then run a single task against your Claude Code setup:
pip install awb
awb run --tasks BUG-001 --adapter claude-code-vanilla
awb trace grade results/runs/<run_dir>/
The trace grade scores four shipping disciplines from the OpenTelemetry-aligned trace JSONL: read-tests-before-edit, ran-verification-after-change, no-out-of-scope-edits, no-repeated-failing-command-loop. v1.2 also emits a task-set hash on every result so the exact scored task set is verifiable.
Specific Task(s)
Benchmark a custom Claude Code workflow (with a tuned CLAUDE.md, hooks, and structured subagents) against vanilla Claude Code, on the AWB fast-check task set (8 tasks, ~15 minutes, ~$4).
Specific Prompt(s)
awb warmup
awb run --fast-check claude-code-vanilla
awb run --fast-check claude-code-custom
awb leaderboard --readiness
Compare the Production Readiness Score and the per-capability profile between the two configurations.
Additional Comments
Built specifically because the 2025 Stack Overflow Developer Survey shows 84% of developers using AI in their workflow but only 33% trusting it, and METR's RCT of 16 OSS maintainers found AI tooling increased task completion time by 19% while developers self-reported a 20% speedup. AWB benchmarks the full stack — tool plus configuration plus workflow plus model — not just model capability. Claude Code is a first-class tier with both vanilla and custom-config adapters shipped.
Recommendation Checklist