[Resource]: AI Workflow Benchmark (AWB)

### Display Name

AI Workflow Benchmark (AWB)

### Category

Tooling

### Sub-Category

General

### Primary Link

https://github.com/xmpuspus/ai-workflow-benchmark

### Author Name

xmpuspus

### Author Link

https://github.com/xmpuspus

### License

MIT

### Other License

_No response_

### Description

AWB evaluates whether an AI coding workflow can safely ship real software, not whether a model passes a static issue benchmark. It runs Claude Code (vanilla and custom-config adapters) and eight other AI coding tools against 100 tasks across 8 categories on real OSS repositories pinned at commit SHAs, scoring across seven capability dimensions plus derived cost discipline.

### Validate Claims

Install AWB, then run a single task against your Claude Code setup:

pip install awb
awb run --tasks BUG-001 --adapter claude-code-vanilla
awb trace grade results/runs/<run_dir>/
The trace grade scores four shipping disciplines from the OpenTelemetry-aligned trace JSONL: read-tests-before-edit, ran-verification-after-change, no-out-of-scope-edits, no-repeated-failing-command-loop. v1.2 also emits a task-set hash on every result so the exact scored task set is verifiable.

### Specific Task(s)

Benchmark a custom Claude Code workflow (with a tuned CLAUDE.md, hooks, and structured subagents) against vanilla Claude Code, on the AWB fast-check task set (8 tasks, ~15 minutes, ~$4).



### Specific Prompt(s)

awb warmup
awb run --fast-check claude-code-vanilla
awb run --fast-check claude-code-custom
awb leaderboard --readiness
Compare the Production Readiness Score and the per-capability profile between the two configurations.

### Additional Comments

Built specifically because the 2025 Stack Overflow Developer Survey shows 84% of developers using AI in their workflow but only 33% trusting it, and METR's RCT of 16 OSS maintainers found AI tooling increased task completion time by 19% while developers self-reported a 20% speedup. AWB benchmarks the full stack — tool plus configuration plus workflow plus model — not just model capability. Claude Code is a first-class tier with both vanilla and custom-config adapters shipped.



### Recommendation Checklist

- [x] I have checked that this resource hasn't already been submitted
- [x] It has been over one week since the first public commit to the repo I am recommending
- [x] All provided links are working and publicly accessible
- [x] I do NOT have any other open issues in this repository
- [x] I am primarily composed of human-y stuff and not electrical circuits

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Resource]: AI Workflow Benchmark (AWB) #1879

Display Name

Category

Sub-Category

Primary Link

Author Name

Author Link

License

Other License

Description

Validate Claims

Specific Task(s)

Specific Prompt(s)

Additional Comments

Recommendation Checklist

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Resource]: AI Workflow Benchmark (AWB) #1879

Description

Display Name

Category

Sub-Category

Primary Link

Author Name

Author Link

License

Other License

Description

Validate Claims

Specific Task(s)

Specific Prompt(s)

Additional Comments

Recommendation Checklist

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions