Skip to content

[Resource]: AI Workflow Benchmark (AWB) #1879

@xmpuspus

Description

@xmpuspus

Display Name

AI Workflow Benchmark (AWB)

Category

Tooling

Sub-Category

General

Primary Link

https://github.com/xmpuspus/ai-workflow-benchmark

Author Name

xmpuspus

Author Link

https://github.com/xmpuspus

License

MIT

Other License

No response

Description

AWB evaluates whether an AI coding workflow can safely ship real software, not whether a model passes a static issue benchmark. It runs Claude Code (vanilla and custom-config adapters) and eight other AI coding tools against 100 tasks across 8 categories on real OSS repositories pinned at commit SHAs, scoring across seven capability dimensions plus derived cost discipline.

Validate Claims

Install AWB, then run a single task against your Claude Code setup:

pip install awb
awb run --tasks BUG-001 --adapter claude-code-vanilla
awb trace grade results/runs/<run_dir>/
The trace grade scores four shipping disciplines from the OpenTelemetry-aligned trace JSONL: read-tests-before-edit, ran-verification-after-change, no-out-of-scope-edits, no-repeated-failing-command-loop. v1.2 also emits a task-set hash on every result so the exact scored task set is verifiable.

Specific Task(s)

Benchmark a custom Claude Code workflow (with a tuned CLAUDE.md, hooks, and structured subagents) against vanilla Claude Code, on the AWB fast-check task set (8 tasks, ~15 minutes, ~$4).

Specific Prompt(s)

awb warmup
awb run --fast-check claude-code-vanilla
awb run --fast-check claude-code-custom
awb leaderboard --readiness
Compare the Production Readiness Score and the per-capability profile between the two configurations.

Additional Comments

Built specifically because the 2025 Stack Overflow Developer Survey shows 84% of developers using AI in their workflow but only 33% trusting it, and METR's RCT of 16 OSS maintainers found AI tooling increased task completion time by 19% while developers self-reported a 20% speedup. AWB benchmarks the full stack — tool plus configuration plus workflow plus model — not just model capability. Claude Code is a first-class tier with both vanilla and custom-config adapters shipped.

Recommendation Checklist

  • I have checked that this resource hasn't already been submitted
  • It has been over one week since the first public commit to the repo I am recommending
  • All provided links are working and publicly accessible
  • I do NOT have any other open issues in this repository
  • I am primarily composed of human-y stuff and not electrical circuits

Metadata

Metadata

Assignees

No one assigned

    Labels

    resource-submissionThis Issue submits a new resource to the listvalidation-passedResource has passed initial validation

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions