Date: 2026-03-27 Version: 2.0 (addresses ccz review: 5/10 → target 8+/10) Repo: agent-next/agent-driven (new) Research: 20 parallel agents, 50+ OSS frameworks, 21 Anthropic + 15 OpenAI blog posts, cc-manager source audit, 14 days real dev data (427 commits, 80+ agents peak) NOTE: Star counts and statistics in this doc are estimates from public GitHub/npm data as of 2026-03. See docs/research/ for detailed methodology.
One person + this system = a top-level R&D team.
A reusable, project-agnostic scaffold that makes any codebase agent-ready. Built on Claude Code + OpenAI Codex. COMPOSE existing tools (superpowers 118K★, gstack 52K★), don't rebuild. Custom only where no tool exists.
| Problem | Root Cause | Evidence |
|---|---|---|
| Agents crash mid-task | No checkpoint, no recovery | cc-manager internal data: 43-50% task success rate (see docs/research/) |
| Agents drift off-track | Fire-and-forget, no mid-step verification | Est. 85% per-step success = 20% over 10 steps (compound) |
| Guidelines ignored | CLAUDE.md = suggestions, no enforcement | Observational: agents sometimes skip CLAUDE.md rules; hooks (exit 2) enforce |
| Context degrades | No rotation protocol | Empirical observation: agent output quality degrades as context fills; 65% chosen as proactive threshold |
| No visibility | Can't measure success rate, cost, or quality | "You can't hit a target you can't see" |
| Scaffold not portable | CTO skill hardcoded to labclaw | Can't init a new project |
- COMPOSE not BUILD — Use superpowers (TDD, planning), gstack (QA, ship, review), existing CC hooks. Build ONLY what doesn't exist: coordination layer + observability.
- Guardrails not guidelines — Hooks exit code 2 = blocked. CLAUDE.md = context only.
- Verify at every step — lint after edit, test after commit, review after PR. Never batch verification.
- 65% context rotation — Rotate proactively at 60-65% usage, not at 80% auto-compaction. Structured handover.
- Observe everything — Log agent actions, measure outcomes, track cost. No unmeasured targets.
- Map not manual — CLAUDE.md < 200 lines. Structured docs/ directory. Path-scoped rules.
- Dual engine — CC Opus (reasoning, review, coordination) + Codex GPT-5.4 (parallel implementation). Cross-engine review mandatory.
| Tool | Est. Installs | Covers | Our Action |
|---|---|---|---|
| superpowers | ~118K (npm, est.) | TDD, debugging, planning, brainstorming, code review, verification | USE as-is |
| gstack | ~52K (npm, est.) | Sprint lifecycle: CEO/eng/design review, QA, ship, deploy, retro | USE as-is |
| feature-dev | ~89K (npm, est.) | 7-phase guided feature dev with 3 agents | USE as-is |
| code-review | ~50K (npm, est.) | Multi-agent parallel PR review | USE as-is |
| pr-review-toolkit | installed | Silent-failure-hunter, type-design, test-analyzer | USE as-is |
| context7 | ~72K (npm, est.) | Live library docs in context | USE as-is |
| ralph-loop | ~57K (npm, est.) | Autonomous multi-hour coding sessions | USE for /overnight |
| Component | Why It Doesn't Exist | Effort |
|---|---|---|
| Coordinator agent | Routes tasks to right engine/model/skill. No tool does dual-engine routing. | 15h |
| Observability hooks | Agent outcome metrics, action tracing, cost tracking. CC hooks exist but nobody ships a pre-built observability kit. | 20h |
| Context management protocol | 65% rotation, handover docs, session boundaries. Research-backed but no tool implements it. | 10h |
| AGENTS.md + CLAUDE.md templates | Project-type-specific templates (Python/FastAPI, React/Next.js, etc.) with <200 line discipline. | 10h |
| Event-driven triggers | git post-receive → dispatch agent, CI failure → dispatch fixer. Simple shell hooks, not a framework. | 10h |
| Structured agent memory | Episodic (what worked/failed) + procedural (learned workflows). Beyond flat MEMORY.md. | 15h |
cc-manager v2 is a separate project with its own spec, not part of this scaffold. This scaffold works WITHOUT cc-manager. cc-manager is an optional acceleration layer for heavy parallel work (10+ tasks).
cc-manager v2 spec will be written separately and tracked at agent-next/cc-manager.
┌─────────────────────────────────────────────────────────────────┐
│ agent-next/agent-driven │
│ Portable scaffold — works for ANY project │
│ │
│ .claude/ │
│ ├── agents/ Coordinator, Implementer, Reviewer, Tester │
│ ├── skills/ /init, /plan (→superpowers), /review-all │
│ │ /dispatch, /ship (→gstack), /overnight │
│ ├── rules/ Quality, git workflow, security, context mgmt │
│ ├── hooks/ Guardrails (12 hooks covering 8 lifecycle │
│ │ events, all exit-code-2 capable) │
│ ├── docs/ Structured map (ARCHITECTURE, CONVENTIONS, │
│ │ WORKFLOW, PROGRESS) │
│ └── templates/ CLAUDE.md + AGENTS.md by project type │
│ │
│ Observability Layer │
│ ├── .claude/metrics/ Agent outcome logs (JSON-lines) │
│ ├── .claude/traces/ Action traces per session │
│ └── .claude/memory/ Structured episodic + procedural │
│ │
│ Optional: cc-manager v2 (separate repo, separate spec) │
│ └── Called via /dispatch skill when task count > threshold │
└─────────────────────────────────────────────────────────────────┘
The scaffold works standalone using CC native features:
- 1-3 tasks: CC subagents with
isolation: worktree - 4-6 tasks: CC Agent Teams with shared task list
- 7+ tasks: Sequential waves (dispatch 3, wait, merge, dispatch next 3)
- 10+ tasks: Requires cc-manager v2 (optional upgrade)
This resolves the ccz review's "what if cc-manager isn't available?" concern.
name: coordinator
description: Routes tasks to the right engine, model, and skill. Manages dispatch and merge decisions.
model: opus
permissionMode: default
tools: [Read, Glob, Grep, Bash, Agent, TaskCreate, TaskUpdate, TaskList]
memory: project
skills: [superpowers:dispatching-parallel-agents, superpowers:writing-plans]- Reads task/spec → classifies complexity → routes to implementer (Codex) or CC subagent
- Manages wave ordering (dependency-aware)
- Triggers cross-engine review after implementation
- Logs all decisions to
.claude/traces/
name: implementer
description: Focused code implementation. One task per agent. Commits after each passing test.
model: inherit
isolation: worktree
maxTurns: 50
tools: [Read, Write, Edit, Bash, Glob, Grep]
hooks:
PostToolUse:
- matcher: "Edit|Write"
hooks: [{type: command, command: "ruff check --fix $FILE && ruff format $FILE", timeout: 10}]
Stop:
- hooks: [{type: command, command: "pytest --tb=short -q 2>/dev/null; echo exit:$?"}]- Runs lint after every edit (hook-enforced, not guideline)
- Runs tests on stop (hook-enforced)
- One focused task, one worktree
name: reviewer
description: Code review for security, architecture, and correctness. Reports structured findings.
model: sonnet
permissionMode: plan
tools: [Read, Glob, Grep, WebSearch]- 3 instances run in parallel (security, architecture, correctness)
- Reports findings as JSON:
{severity, file, line, issue, suggestion} - Cross-engine: CC reviewer checks Codex output, vice versa
name: tester
description: Generate tests from specs, run test suite, report coverage gaps.
model: haiku
isolation: worktree
tools: [Read, Write, Edit, Bash, Glob, Grep]Gate 1: SPEC REVIEW (pre-implementation)
Uses: superpowers:brainstorming → superpowers:writing-plans
Hook: TaskCreated → validate spec has acceptance criteria
Reviewers: 3 parallel (security, arch, feasibility)
Block: ANY reviewer CRITICAL finding = redesign
Gate 2: STEP VERIFICATION (during implementation)
Hook: PostToolUse(Edit|Write) → auto-lint + typecheck (exit 2 on failure)
Hook: PostToolUse(Bash) → if git commit, run related tests
Hook: stall-detector → 5min no meaningful output = kill + context-aware retry
Hook: SubagentStop → verify non-empty diff + tests pass
Gate 3: PR REVIEW (post-implementation)
Uses: code-review plugin + pr-review-toolkit
Cross-engine: CC reviews Codex code, Codex reviews CC code
CI: ALL checks must pass (merge-gate pattern)
Hook: TaskCompleted → tests pass + lint clean + coverage not decreased
Gate 4: HUMAN MERGE (final)
Human reviews: PR summary + agent findings + metrics
One-click merge when all gates green
NEVER auto-merge. Human decides.
.claude/metrics/outcomes.jsonl
{"ts":"2026-03-27T15:00:00Z","agent":"implementer","task":"add-auth","status":"success","duration_s":340,"tokens":45000,"cost_usd":0.67,"commits":3,"tests_added":5,"files_changed":4}
Logged by: SubagentStop hook + TaskCompleted hook.
.claude/traces/session-abc123.jsonl
{"ts":"...","agent":"coordinator","action":"route","task":"add-auth","decision":"codex-implementer","reason":"3 files, moderate complexity"}
{"ts":"...","agent":"implementer","action":"edit","file":"src/auth.py","lines_changed":45}
{"ts":"...","agent":"implementer","action":"test","result":"pass","coverage":"87%"}
Logged by: PostToolUse hook (lightweight, <5ms overhead).
# Built-in CLI in /metrics skill
claude "/metrics summary"
# Output: This week: 47 tasks, 38 success (81%), $12.40 total, avg $0.26/task- One feature per session (never mix unrelated work)
- Start each session: read PROGRESS.md + feature_list.json
- Monitor context usage via statusline (already configured)
At 60% context usage:
1. Write ROTATION-HANDOVER.md:
- Completed: [list of done items]
- In Progress: [current state, what's working, what's not]
- Next Steps: [specific actionable items]
- Blockers: [anything that needs human input]
2. Commit all work
3. Start fresh session with: "Read ROTATION-HANDOVER.md and continue"
Hook: PreCompact → warn at 60%, force handover at 70%
- Global CLAUDE.md: < 15 lines (identity + project context)
- Project CLAUDE.md: < 80 lines (stack, commands, conventions)
- Detailed rules:
.claude/rules/with path-scoped frontmatter - Total agent-visible instructions: < 200 lines / < 2000 tokens
.claude/docs/
├── PROMPT.md # Frozen spec (what to build, never edited mid-session)
├── PLAN.md # Milestones with acceptance criteria (updated on completion)
├── PROGRESS.md # Live audit log (who did what, when, result)
└── CONVENTIONS.md # Coding conventions (updated when agent corrections happen)
- PROMPT.md created by
/planskill, frozen - PLAN.md created by coordinator, updated on milestone completion
- PROGRESS.md append-only, updated by hooks
- CONVENTIONS.md living doc, updated on correction
.claude/memory/
├── episodic/ # What happened in past sessions
│ ├── 2026-03-27-add-auth.md # Session summary: what worked, what failed
│ └── 2026-03-28-fix-perf.md
├── procedural/ # Learned workflows
│ ├── python-fastapi-feature.md # "When adding a FastAPI endpoint, always..."
│ └── react-component.md # "When creating React components, always..."
└── pitfalls/ # Verified failure patterns (like refs/agent-pitfalls.md)
├── typescript-import-extensions.md
└── pytest-async-fixtures.md
- Episodic: auto-generated by Stop hook from session trace
- Procedural: manually curated from recurring patterns
- Pitfalls: auto-appended when an agent fails + fix is found
| Anti-Pattern | How Addressed |
|---|---|
| Kitchen sink session | Rule: one feature per session. Coordinator enforces. |
| Correction spiral | Rule: after 2 failed corrections, /clear and restart with better prompt. Hook monitors correction count. |
| Infinite exploration | architect.md is plan mode (read-only). maxTurns: 50 on implementer. |
| Over-specified CLAUDE.md | < 200 lines rule. Path-scoped rules in .claude/rules/. |
| Trust-then-verify gap | 4-gate pipeline. Verify at every step. |
| One-shotting projects | Task decomposition via coordinator. One task per agent. |
| Premature completion | SubagentStop hook verifies non-empty diff + tests pass. |
| No progress docs | PROGRESS.md append-only log. Updated by hooks. |
| Time blindness | Stall detection: 5min no output = kill. |
| Self-evaluation bias | Cross-engine review. Separate reviewer agents. |
| Reactive /clear | 65% rotation protocol. PreCompact hook warns at 60%. |
| Context pressure | Monitor via statusline. Proactive rotation, not reactive. |
| Over-specification | CLAUDE.md < 200 lines. Map not manual. |
| Generic defaults | Project-type templates (Python, React, etc.) |
| Ignoring skill context | Coordinator reads installed skills before routing. |
| 1.75x logic errors | Mandatory test gate. Cross-engine review. |
| DB schema decisions | architect.md designs schema. Never let implementer decide schema. |
| Agentic laziness | SubagentStop verifies meaningful output. Completion check mandatory. |
| Single-point testing | 3 parallel reviewers + CI + human. |
| Browser alerts | N/A (terminal-based). |
Deliverables: agent-next/agent-driven repo with working scaffold
- Create repo structure (agents, skills, rules, hooks, docs, templates) — 10h
- Build coordinator agent + routing logic — 15h
- Build observability hooks (metrics, traces, cost) — 15h
- Build context management protocol (rotation, handover, memory) — 10h
- Build
/init-projectskill (detect stack, generate config) — 10h - Test on 3 different projects (Python, React, mixed) — verify portability
Exit criteria: Fresh project runs /init-project, dispatches 3 agents in parallel, all 4 gates work, metrics logged.
Deliverables: Full skill suite integrated with existing plugins
- Build
/dispatchskill (wave planning, coordinator routing) — 15h - Build event-driven triggers (git hooks, CI failure → agent) — 10h
- Build structured memory system (episodic, procedural, pitfalls) — 10h
- Integrate with gstack /ship and /review — 5h
- Build AGENTS.md + CLAUDE.md templates per project type — 5h
- Test on labclaw Phase 0 work (real production tasks) — 5h
Exit criteria: labclaw Phase 0 tasks completed via scaffold. Event triggers fire correctly. Memory persists across sessions.
This phase has its own design spec at agent-next/cc-manager. Core fixes: staged merging, wave planning, error recovery, conflict resolution. Only started AFTER Phase 1+2 are validated.
- README with Day 1 walkthrough — 5h
/overnightskill (ralph-loop integration) — 5h- Agent Legibility Scorecard (agent-ready CI gate) — 5h
- MetaBot integration (Telegram control plane) — 5h
Total: ~130h (Phase 1+2+4) + ~100h (Phase 3, separate)
# 1. Clone the scaffold
git clone agent-next/agent-driven
cd my-new-project
# 2. Initialize (detects Python/FastAPI, generates config)
claude "/init-project"
# → Creates .claude/ with agents, skills, rules, hooks
# → Creates CLAUDE.md (<80 lines) + AGENTS.md
# → Creates .claude/docs/ (ARCHITECTURE, CONVENTIONS, WORKFLOW, PROGRESS)
# → Runs agent-ready check, reports score
# 3. Plan a feature
claude "/plan Add user authentication with JWT"
# → superpowers:brainstorming → spec → 3 reviewers → approved plan
# → Saves to .claude/docs/PROMPT.md (frozen) + PLAN.md (milestones)
# 4. Implement
claude "/dispatch"
# → Coordinator reads PLAN.md → decomposes into 4 tasks
# → Routes: 2 to Codex implementers (parallel worktrees), 1 to CC, 1 to tester
# → Each agent: implement → lint (hook) → test (hook) → commit
# → Cross-engine review on completion
# → PR created per task
# 5. Ship
claude "/ship"
# → gstack: run all tests → check CI → create PR → wait for human merge
# 6. Check metrics
claude "/metrics summary"
# → 4 tasks, 3 success, 1 retry+success, $2.10 total, 45min elapsed| Failure | Detection | Recovery | Escalation |
|---|---|---|---|
| Agent stalls (no output 5min) | stall-detector hook | Kill + retry with fresh context | After 2 retries → report to human |
| Agent drifts (wrong direction) | SubagentStop hook checks diff relevance | /clear + restart with tighter spec | Human reviews spec |
| Merge conflict | git merge exit code | Dispatch conflict-resolver agent | After 2 attempts → human resolves |
| Tests fail after edit | PostToolUse hook | Agent receives error, fixes in next turn | After 3 cycles → escalate |
| Budget exceeded | Cost tracking hook | Pause dispatch, report to human | Human decides: continue or abort |
| Context at 65% | PreCompact hook | Force rotation with ROTATION-HANDOVER.md | Automatic, no human needed |
| CI fails on PR | TaskCompleted hook | Dispatch fixer agent for CI errors | After 2 fixes → human reviews |
| Cross-repo dependency | Coordinator detects via ARCHITECTURE.md | Flag to human, don't auto-fix | Human decides scope |
PLANNED → DISPATCHED → RUNNING → VERIFYING → COMPLETED → MERGED
│ │ │
▼ ▼ ▼
STALLED FAILED CONFLICT
│ │ │
▼ ▼ ▼
RETRYING ESCALATED RESOLVING
│ │
└───────→ RUNNING ←─────┘
Valid transitions:
- PLANNED → DISPATCHED (coordinator assigns agent)
- DISPATCHED → RUNNING (agent starts work)
- RUNNING → VERIFYING (agent completes, hooks run)
- RUNNING → STALLED (5min no output)
- VERIFYING → COMPLETED (all gates pass)
- VERIFYING → FAILED (gate fails)
- COMPLETED → MERGED (human approves)
- COMPLETED → CONFLICT (merge conflict detected)
- STALLED → RETRYING (auto-retry with fresh context)
- FAILED → RETRYING (retry with error context, max 2)
- FAILED → ESCALATED (after 2 retries)
- CONFLICT → RESOLVING (conflict-resolver agent)
- RESOLVING → RUNNING (conflict fixed, re-verify)
- RETRYING → RUNNING (fresh attempt)
| Metric | Current | Target | How Measured |
|---|---|---|---|
| Task success rate | ~50% | 80%+ | .claude/metrics/outcomes.jsonl |
| Agent crash rate | ~15% | <5% | stall-detector + SubagentStop logs |
| Time to first PR (new project) | Hours | <30 min | /init-project → first /dispatch |
| Quality gate pass rate | Unknown | >90% | Gate pass/fail logged per task |
| Context rotation compliance | 0% | 100% | PreCompact hook logs |
| Cross-session knowledge reuse | 0% | >50% sessions | Memory read count at session start |
| Cost per successful task | Unknown | <$1 avg | metrics/outcomes.jsonl |
| Question | Decision | Rationale |
|---|---|---|
| cc-manager: Elixir or TypeScript? | TypeScript (keep) | Rewrite is 50-100h. Fix the 4 issues in existing TS. Elixir only if TS proves inadequate. |
| MetaSkill or own /init-project? | Own /init-project | MetaSkill generates generic configs. We need stack-specific templates with our observability hooks. |
| agent-ready as CI gate? | Yes, Phase 4 | Low cost, high signal. npx agent-ready check . in CI. |
| Cross-repo dependencies? | Manual for v1 | Coordinator flags via ARCHITECTURE.md. Human decides. Auto-detection is Phase 3+. |
- No custom LLM — use Claude + Codex as-is
- No custom IDE — terminal (CC) + any editor
- No custom TDD framework — superpowers handles this
- No custom QA/browser — gstack handles this
- No custom code review — code-review + pr-review-toolkit handle this
- No custom planning — superpowers brainstorming + writing-plans handle this
- No web dashboard for v1
- No multi-tenant — single user system
- No cc-manager v2 in this repo (separate spec, separate repo)