Agent-Driven Development System — Design Spec v2

Date: 2026-03-27 Version: 2.0 (addresses ccz review: 5/10 → target 8+/10) Repo: agent-next/agent-driven (new) Research: 20 parallel agents, 50+ OSS frameworks, 21 Anthropic + 15 OpenAI blog posts, cc-manager source audit, 14 days real dev data (427 commits, 80+ agents peak) NOTE: Star counts and statistics in this doc are estimates from public GitHub/npm data as of 2026-03. See docs/research/ for detailed methodology.

Vision

One person + this system = a top-level R&D team.

A reusable, project-agnostic scaffold that makes any codebase agent-ready. Built on Claude Code + OpenAI Codex. COMPOSE existing tools (superpowers 118K★, gstack 52K★), don't rebuild. Custom only where no tool exists.

Problem Statement

Problem	Root Cause	Evidence
Agents crash mid-task	No checkpoint, no recovery	cc-manager internal data: 43-50% task success rate (see docs/research/)
Agents drift off-track	Fire-and-forget, no mid-step verification	Est. 85% per-step success = 20% over 10 steps (compound)
Guidelines ignored	CLAUDE.md = suggestions, no enforcement	Observational: agents sometimes skip CLAUDE.md rules; hooks (exit 2) enforce
Context degrades	No rotation protocol	Empirical observation: agent output quality degrades as context fills; 65% chosen as proactive threshold
No visibility	Can't measure success rate, cost, or quality	"You can't hit a target you can't see"
Scaffold not portable	CTO skill hardcoded to labclaw	Can't init a new project

Design Principles

COMPOSE not BUILD — Use superpowers (TDD, planning), gstack (QA, ship, review), existing CC hooks. Build ONLY what doesn't exist: coordination layer + observability.
Guardrails not guidelines — Hooks exit code 2 = blocked. CLAUDE.md = context only.
Verify at every step — lint after edit, test after commit, review after PR. Never batch verification.
65% context rotation — Rotate proactively at 60-65% usage, not at 80% auto-compaction. Structured handover.
Observe everything — Log agent actions, measure outcomes, track cost. No unmeasured targets.
Map not manual — CLAUDE.md < 200 lines. Structured docs/ directory. Path-scoped rules.
Dual engine — CC Opus (reasoning, review, coordination) + Codex GPT-5.4 (parallel implementation). Cross-engine review mandatory.

Decision: COMPOSE, Not Rebuild

What We USE (already installed, battle-tested)

Tool	Est. Installs	Covers	Our Action
superpowers	~118K (npm, est.)	TDD, debugging, planning, brainstorming, code review, verification	USE as-is
gstack	~52K (npm, est.)	Sprint lifecycle: CEO/eng/design review, QA, ship, deploy, retro	USE as-is
feature-dev	~89K (npm, est.)	7-phase guided feature dev with 3 agents	USE as-is
code-review	~50K (npm, est.)	Multi-agent parallel PR review	USE as-is
pr-review-toolkit	installed	Silent-failure-hunter, type-design, test-analyzer	USE as-is
context7	~72K (npm, est.)	Live library docs in context	USE as-is
ralph-loop	~57K (npm, est.)	Autonomous multi-hour coding sessions	USE for /overnight

What We BUILD (no existing tool covers this)

Component	Why It Doesn't Exist	Effort
Coordinator agent	Routes tasks to right engine/model/skill. No tool does dual-engine routing.	15h
Observability hooks	Agent outcome metrics, action tracing, cost tracking. CC hooks exist but nobody ships a pre-built observability kit.	20h
Context management protocol	65% rotation, handover docs, session boundaries. Research-backed but no tool implements it.	10h
AGENTS.md + CLAUDE.md templates	Project-type-specific templates (Python/FastAPI, React/Next.js, etc.) with <200 line discipline.	10h
Event-driven triggers	git post-receive → dispatch agent, CI failure → dispatch fixer. Simple shell hooks, not a framework.	10h
Structured agent memory	Episodic (what worked/failed) + procedural (learned workflows). Beyond flat MEMORY.md.	15h

What We FIX (cc-manager v2 — separate spec)

cc-manager v2 is a separate project with its own spec, not part of this scaffold. This scaffold works WITHOUT cc-manager. cc-manager is an optional acceleration layer for heavy parallel work (10+ tasks).

cc-manager v2 spec will be written separately and tracked at agent-next/cc-manager.

Architecture: Single Layer + Optional Engine

┌─────────────────────────────────────────────────────────────────┐
│  agent-next/agent-driven                                         │
│  Portable scaffold — works for ANY project                       │
│                                                                  │
│  .claude/                                                        │
│  ├── agents/        Coordinator, Implementer, Reviewer, Tester   │
│  ├── skills/        /init, /plan (→superpowers), /review-all     │
│  │                  /dispatch, /ship (→gstack), /overnight        │
│  ├── rules/         Quality, git workflow, security, context mgmt │
│  ├── hooks/         Guardrails (12 hooks covering 8 lifecycle     │
│  │                  events, all exit-code-2 capable)              │
│  ├── docs/          Structured map (ARCHITECTURE, CONVENTIONS,    │
│  │                  WORKFLOW, PROGRESS)                           │
│  └── templates/     CLAUDE.md + AGENTS.md by project type        │
│                                                                  │
│  Observability Layer                                             │
│  ├── .claude/metrics/     Agent outcome logs (JSON-lines)        │
│  ├── .claude/traces/      Action traces per session              │
│  └── .claude/memory/      Structured episodic + procedural       │
│                                                                  │
│  Optional: cc-manager v2 (separate repo, separate spec)          │
│  └── Called via /dispatch skill when task count > threshold       │
└─────────────────────────────────────────────────────────────────┘

Degraded Mode (without cc-manager)

The scaffold works standalone using CC native features:

1-3 tasks: CC subagents with isolation: worktree
4-6 tasks: CC Agent Teams with shared task list
7+ tasks: Sequential waves (dispatch 3, wait, merge, dispatch next 3)
10+ tasks: Requires cc-manager v2 (optional upgrade)

This resolves the ccz review's "what if cc-manager isn't available?" concern.

Agent Definitions

coordinator.md (the only new agent)

name: coordinator
description: Routes tasks to the right engine, model, and skill. Manages dispatch and merge decisions.
model: opus
permissionMode: default
tools: [Read, Glob, Grep, Bash, Agent, TaskCreate, TaskUpdate, TaskList]
memory: project
skills: [superpowers:dispatching-parallel-agents, superpowers:writing-plans]

Reads task/spec → classifies complexity → routes to implementer (Codex) or CC subagent
Manages wave ordering (dependency-aware)
Triggers cross-engine review after implementation
Logs all decisions to .claude/traces/

implementer.md

name: implementer
description: Focused code implementation. One task per agent. Commits after each passing test.
model: inherit
isolation: worktree
maxTurns: 50
tools: [Read, Write, Edit, Bash, Glob, Grep]
hooks:
  PostToolUse:
    - matcher: "Edit|Write"
      hooks: [{type: command, command: "ruff check --fix $FILE && ruff format $FILE", timeout: 10}]
  Stop:
    - hooks: [{type: command, command: "pytest --tb=short -q 2>/dev/null; echo exit:$?"}]

Runs lint after every edit (hook-enforced, not guideline)
Runs tests on stop (hook-enforced)
One focused task, one worktree

reviewer.md

name: reviewer
description: Code review for security, architecture, and correctness. Reports structured findings.
model: sonnet
permissionMode: plan
tools: [Read, Glob, Grep, WebSearch]

3 instances run in parallel (security, architecture, correctness)
Reports findings as JSON: {severity, file, line, issue, suggestion}
Cross-engine: CC reviewer checks Codex output, vice versa

tester.md

name: tester
description: Generate tests from specs, run test suite, report coverage gaps.
model: haiku
isolation: worktree
tools: [Read, Write, Edit, Bash, Glob, Grep]

Quality Gate Pipeline

Gate 1: SPEC REVIEW (pre-implementation)
  Uses: superpowers:brainstorming → superpowers:writing-plans
  Hook: TaskCreated → validate spec has acceptance criteria
  Reviewers: 3 parallel (security, arch, feasibility)
  Block: ANY reviewer CRITICAL finding = redesign

Gate 2: STEP VERIFICATION (during implementation)
  Hook: PostToolUse(Edit|Write) → auto-lint + typecheck (exit 2 on failure)
  Hook: PostToolUse(Bash) → if git commit, run related tests
  Hook: stall-detector → 5min no meaningful output = kill + context-aware retry
  Hook: SubagentStop → verify non-empty diff + tests pass

Gate 3: PR REVIEW (post-implementation)
  Uses: code-review plugin + pr-review-toolkit
  Cross-engine: CC reviews Codex code, Codex reviews CC code
  CI: ALL checks must pass (merge-gate pattern)
  Hook: TaskCompleted → tests pass + lint clean + coverage not decreased

Gate 4: HUMAN MERGE (final)
  Human reviews: PR summary + agent findings + metrics
  One-click merge when all gates green
  NEVER auto-merge. Human decides.

Observability (P0 — resolves "can't measure" problem)

Agent Outcome Metrics

.claude/metrics/outcomes.jsonl
{"ts":"2026-03-27T15:00:00Z","agent":"implementer","task":"add-auth","status":"success","duration_s":340,"tokens":45000,"cost_usd":0.67,"commits":3,"tests_added":5,"files_changed":4}

Logged by: SubagentStop hook + TaskCompleted hook.

Action Traces

.claude/traces/session-abc123.jsonl
{"ts":"...","agent":"coordinator","action":"route","task":"add-auth","decision":"codex-implementer","reason":"3 files, moderate complexity"}
{"ts":"...","agent":"implementer","action":"edit","file":"src/auth.py","lines_changed":45}
{"ts":"...","agent":"implementer","action":"test","result":"pass","coverage":"87%"}

Logged by: PostToolUse hook (lightweight, <5ms overhead).

Cost Dashboard

# Built-in CLI in /metrics skill
claude "/metrics summary"
# Output: This week: 47 tasks, 38 success (81%), $12.40 total, avg $0.26/task

Context Management Protocol (P0 — resolves "context rot")

Session Boundaries

One feature per session (never mix unrelated work)
Start each session: read PROGRESS.md + feature_list.json
Monitor context usage via statusline (already configured)

65% Rotation Protocol

At 60% context usage:
  1. Write ROTATION-HANDOVER.md:
     - Completed: [list of done items]
     - In Progress: [current state, what's working, what's not]
     - Next Steps: [specific actionable items]
     - Blockers: [anything that needs human input]
  2. Commit all work
  3. Start fresh session with: "Read ROTATION-HANDOVER.md and continue"

Hook: PreCompact → warn at 60%, force handover at 70%

CLAUDE.md Discipline

Global CLAUDE.md: < 15 lines (identity + project context)
Project CLAUDE.md: < 80 lines (stack, commands, conventions)
Detailed rules: .claude/rules/ with path-scoped frontmatter
Total agent-visible instructions: < 200 lines / < 2000 tokens

4-File Memory Pattern (from OpenAI Codex long-horizon)

.claude/docs/
├── PROMPT.md          # Frozen spec (what to build, never edited mid-session)
├── PLAN.md            # Milestones with acceptance criteria (updated on completion)
├── PROGRESS.md        # Live audit log (who did what, when, result)
└── CONVENTIONS.md     # Coding conventions (updated when agent corrections happen)

PROMPT.md created by /plan skill, frozen
PLAN.md created by coordinator, updated on milestone completion
PROGRESS.md append-only, updated by hooks
CONVENTIONS.md living doc, updated on correction

Structured Agent Memory (P0 — resolves "flat MEMORY.md")

Three Memory Types

.claude/memory/
├── episodic/          # What happened in past sessions
│   ├── 2026-03-27-add-auth.md    # Session summary: what worked, what failed
│   └── 2026-03-28-fix-perf.md
├── procedural/        # Learned workflows
│   ├── python-fastapi-feature.md  # "When adding a FastAPI endpoint, always..."
│   └── react-component.md         # "When creating React components, always..."
└── pitfalls/          # Verified failure patterns (like refs/agent-pitfalls.md)
    ├── typescript-import-extensions.md
    └── pytest-async-fixtures.md

Episodic: auto-generated by Stop hook from session trace
Procedural: manually curated from recurring patterns
Pitfalls: auto-appended when an agent fails + fix is found

Anti-Pattern Coverage (resolves "6/20 addressed")

Anti-Pattern	How Addressed
Kitchen sink session	Rule: one feature per session. Coordinator enforces.
Correction spiral	Rule: after 2 failed corrections, /clear and restart with better prompt. Hook monitors correction count.
Infinite exploration	architect.md is plan mode (read-only). maxTurns: 50 on implementer.
Over-specified CLAUDE.md	< 200 lines rule. Path-scoped rules in .claude/rules/.
Trust-then-verify gap	4-gate pipeline. Verify at every step.
One-shotting projects	Task decomposition via coordinator. One task per agent.
Premature completion	SubagentStop hook verifies non-empty diff + tests pass.
No progress docs	PROGRESS.md append-only log. Updated by hooks.
Time blindness	Stall detection: 5min no output = kill.
Self-evaluation bias	Cross-engine review. Separate reviewer agents.
Reactive /clear	65% rotation protocol. PreCompact hook warns at 60%.
Context pressure	Monitor via statusline. Proactive rotation, not reactive.
Over-specification	CLAUDE.md < 200 lines. Map not manual.
Generic defaults	Project-type templates (Python, React, etc.)
Ignoring skill context	Coordinator reads installed skills before routing.
1.75x logic errors	Mandatory test gate. Cross-engine review.
DB schema decisions	architect.md designs schema. Never let implementer decide schema.
Agentic laziness	SubagentStop verifies meaningful output. Completion check mandatory.
Single-point testing	3 parallel reviewers + CI + human.
Browser alerts	N/A (terminal-based).

Implementation Phases (revised estimates)

Phase 1: Scaffold + Observability (2 weeks, ~60h)

Deliverables: agent-next/agent-driven repo with working scaffold

Create repo structure (agents, skills, rules, hooks, docs, templates) — 10h
Build coordinator agent + routing logic — 15h
Build observability hooks (metrics, traces, cost) — 15h
Build context management protocol (rotation, handover, memory) — 10h
Build /init-project skill (detect stack, generate config) — 10h
Test on 3 different projects (Python, React, mixed) — verify portability

Exit criteria: Fresh project runs /init-project, dispatches 3 agents in parallel, all 4 gates work, metrics logged.

Phase 2: Integration + Event Triggers (2 weeks, ~50h)

Deliverables: Full skill suite integrated with existing plugins

Build /dispatch skill (wave planning, coordinator routing) — 15h
Build event-driven triggers (git hooks, CI failure → agent) — 10h
Build structured memory system (episodic, procedural, pitfalls) — 10h
Integrate with gstack /ship and /review — 5h
Build AGENTS.md + CLAUDE.md templates per project type — 5h
Test on labclaw Phase 0 work (real production tasks) — 5h

Exit criteria: labclaw Phase 0 tasks completed via scaffold. Event triggers fire correctly. Memory persists across sessions.

Phase 3: cc-manager v2 (separate spec, ~100h)

This phase has its own design spec at agent-next/cc-manager. Core fixes: staged merging, wave planning, error recovery, conflict resolution. Only started AFTER Phase 1+2 are validated.

Phase 4: Polish + Documentation (1 week, ~20h)

README with Day 1 walkthrough — 5h
/overnight skill (ralph-loop integration) — 5h
Agent Legibility Scorecard (agent-ready CI gate) — 5h
MetaBot integration (Telegram control plane) — 5h

Total: ~130h (Phase 1+2+4) + ~100h (Phase 3, separate)

Day 1 Scenario (resolves "no user journey")

# 1. Clone the scaffold
git clone agent-next/agent-driven
cd my-new-project

# 2. Initialize (detects Python/FastAPI, generates config)
claude "/init-project"
# → Creates .claude/ with agents, skills, rules, hooks
# → Creates CLAUDE.md (<80 lines) + AGENTS.md
# → Creates .claude/docs/ (ARCHITECTURE, CONVENTIONS, WORKFLOW, PROGRESS)
# → Runs agent-ready check, reports score

# 3. Plan a feature
claude "/plan Add user authentication with JWT"
# → superpowers:brainstorming → spec → 3 reviewers → approved plan
# → Saves to .claude/docs/PROMPT.md (frozen) + PLAN.md (milestones)

# 4. Implement
claude "/dispatch"
# → Coordinator reads PLAN.md → decomposes into 4 tasks
# → Routes: 2 to Codex implementers (parallel worktrees), 1 to CC, 1 to tester
# → Each agent: implement → lint (hook) → test (hook) → commit
# → Cross-engine review on completion
# → PR created per task

# 5. Ship
claude "/ship"
# → gstack: run all tests → check CI → create PR → wait for human merge

# 6. Check metrics
claude "/metrics summary"
# → 4 tasks, 3 success, 1 retry+success, $2.10 total, 45min elapsed

Failure Modes (resolves "no failure definition")

Failure	Detection	Recovery	Escalation
Agent stalls (no output 5min)	stall-detector hook	Kill + retry with fresh context	After 2 retries → report to human
Agent drifts (wrong direction)	SubagentStop hook checks diff relevance	/clear + restart with tighter spec	Human reviews spec
Merge conflict	git merge exit code	Dispatch conflict-resolver agent	After 2 attempts → human resolves
Tests fail after edit	PostToolUse hook	Agent receives error, fixes in next turn	After 3 cycles → escalate
Budget exceeded	Cost tracking hook	Pause dispatch, report to human	Human decides: continue or abort
Context at 65%	PreCompact hook	Force rotation with ROTATION-HANDOVER.md	Automatic, no human needed
CI fails on PR	TaskCompleted hook	Dispatch fixer agent for CI errors	After 2 fixes → human reviews
Cross-repo dependency	Coordinator detects via ARCHITECTURE.md	Flag to human, don't auto-fix	Human decides scope

Task State Machine (resolves "no state definition")

PLANNED → DISPATCHED → RUNNING → VERIFYING → COMPLETED → MERGED
                         │          │            │
                         ▼          ▼            ▼
                      STALLED    FAILED      CONFLICT
                         │          │            │
                         ▼          ▼            ▼
                      RETRYING  ESCALATED    RESOLVING
                         │                       │
                         └───────→ RUNNING ←─────┘

Valid transitions:

PLANNED → DISPATCHED (coordinator assigns agent)
DISPATCHED → RUNNING (agent starts work)
RUNNING → VERIFYING (agent completes, hooks run)
RUNNING → STALLED (5min no output)
VERIFYING → COMPLETED (all gates pass)
VERIFYING → FAILED (gate fails)
COMPLETED → MERGED (human approves)
COMPLETED → CONFLICT (merge conflict detected)
STALLED → RETRYING (auto-retry with fresh context)
FAILED → RETRYING (retry with error context, max 2)
FAILED → ESCALATED (after 2 retries)
CONFLICT → RESOLVING (conflict-resolver agent)
RESOLVING → RUNNING (conflict fixed, re-verify)
RETRYING → RUNNING (fresh attempt)

Success Metrics (with instrumentation)

Metric	Current	Target	How Measured
Task success rate	~50%	80%+	`.claude/metrics/outcomes.jsonl`
Agent crash rate	~15%	<5%	stall-detector + SubagentStop logs
Time to first PR (new project)	Hours	<30 min	`/init-project` → first `/dispatch`
Quality gate pass rate	Unknown	>90%	Gate pass/fail logged per task
Context rotation compliance	0%	100%	PreCompact hook logs
Cross-session knowledge reuse	0%	>50% sessions	Memory read count at session start
Cost per successful task	Unknown	<$1 avg	metrics/outcomes.jsonl

Open Questions (resolved)

Question	Decision	Rationale
cc-manager: Elixir or TypeScript?	TypeScript (keep)	Rewrite is 50-100h. Fix the 4 issues in existing TS. Elixir only if TS proves inadequate.
MetaSkill or own /init-project?	Own /init-project	MetaSkill generates generic configs. We need stack-specific templates with our observability hooks.
agent-ready as CI gate?	Yes, Phase 4	Low cost, high signal. `npx agent-ready check .` in CI.
Cross-repo dependencies?	Manual for v1	Coordinator flags via ARCHITECTURE.md. Human decides. Auto-detection is Phase 3+.

What We DON'T Build

No custom LLM — use Claude + Codex as-is
No custom IDE — terminal (CC) + any editor
No custom TDD framework — superpowers handles this
No custom QA/browser — gstack handles this
No custom code review — code-review + pr-review-toolkit handle this
No custom planning — superpowers brainstorming + writing-plans handle this
No web dashboard for v1
No multi-tenant — single user system
No cc-manager v2 in this repo (separate spec, separate repo)

FilesExpand file tree

DESIGN.md

Latest commit

History