Summary
Implement a deterministic, code-based execution layer (spec-runner) that wraps existing spec-kit commands with quality gates, journaling, and resumability. Inspired by Babysitter's orchestration patterns but AI-agnostic.
Background
- Babysitter provides deterministic orchestration for Claude Code only
- We need an AI-agnostic equivalent for spec-kit workflows
- Key insight: Code-based orchestrator (deterministic) wrapping LLM execution (non-deterministic)
References:
Architecture
┌─────────────────────────────────────────────────────────────┐
│ specify run start <feature> │
│ (Python CLI - deterministic) │
├─────────────────────────────────────────────────────────────┤
│ 1. Load schema (combined: artifacts + execution config) │
│ 2. For each stage: │
│ ├── Log: COMMAND_START │
│ ├── Spawn AI session → execute command │
│ ├── Log: COMMAND_COMPLETE │
│ ├── Run evals → quality score │
│ ├── Log: QUALITY_CHECK │
│ ├── Decision (deterministic): │
│ │ ├── score >= threshold → proceed │
│ │ ├── iteration < max → iterate with feedback │
│ │ └── max reached → checkpoint │
│ └── Log: DECISION │
│ 3. On error: retry/checkpoint/fail │
│ 4. On checkpoint: CLI prompt │
└─────────────────────────────────────────────────────────────┘
Key Design Decisions
| Aspect |
Decision |
Rationale |
| Orchestration |
Code-based (Python CLI) |
Deterministic decisions |
| Commands |
Wrap existing /spec.* |
No changes to existing commands |
| Quality Gates |
Use evals framework |
Reuse proven methodology |
| State Recovery |
Resume from last completed stage |
Simple, reliable |
| Checkpoints |
CLI prompt |
No external service for v1 |
| Task Parallelism |
Task-level (local sessions) |
Orchestrator integration deferred |
| Quality Convergence |
Fix with feedback, checkpoint on max |
Follows Babysitter pattern |
| Schema + Process |
Combined in schema.yaml |
Single source of truth |
| Hooks |
Spec-runner specific hooks |
Extensibility |
| Observability |
CLI + Journal |
Both metrics access methods |
| Error Handling |
Transient→retry, user-action→checkpoint |
Follows Babysitter pattern |
CLI Commands
| Command |
Description |
specify run start <feature> |
Start workflow execution |
specify run status <run-id> |
Show run status + metrics |
specify run pause <run-id> |
Pause run |
specify run resume <run-id> |
Resume paused run |
specify run list |
List all runs |
Schema Execution Config
schema:
prefix: "spec"
artifacts:
- id: spec
file: spec.md
requires: []
- id: plan
file: plan.md
requires: [spec]
- id: tasks
file: tasks.md
requires: [plan]
execution:
quality:
target: 85
stages:
spec:
command: /spec.specify
max_iterations: 3
checkpoint: optional
plan:
command: /spec.plan
max_iterations: 3
checkpoint: after
tasks:
command: /spec.tasks
max_iterations: 2
checkpoint: none
implement:
command: /spec.implement
max_iterations: 5
checkpoint: on_quality_fail
parallel: true
hooks:
before_run: []
after_run: []
before_stage: []
after_stage: []
on_quality_fail: []
on_checkpoint: []
on_iteration: []
on_error: []
Journal System
Location: .specify/runs/<run-id>/
.specify/runs/<run-id>/
├── state.json # Current state cache
├── journal.jsonl # Event log (source of truth)
└── config.json # Run configuration
Event Types:
RUN_START / RUN_COMPLETE
COMMAND_START / COMMAND_COMPLETE
QUALITY_CHECK
DECISION
CHECKPOINT / CHECKPOINT_RESOLVED
ERROR / ERROR_TRANSIENT
PARALLEL_START / PARALLEL_COMPLETE
TASK_SPAWNED / TASK_COMPLETE
Quality Convergence
for iteration in range(1, max_iterations + 1):
if iteration == 1:
result = execute_command(stage)
else:
result = execute_command(stage, feedback=previous_feedback)
score = run_evals(result)
if score >= threshold:
break
previous_feedback = get_failed_checks(score)
if score < threshold:
decision = checkpoint(f"Quality {score} below {threshold}. Proceed anyway?")
Error Handling
| Error Type |
Strategy |
| Transient (timeout, rate limit) |
Retry with backoff |
| User action required |
Checkpoint |
| Unrecoverable |
Fail gracefully |
Hooks
| Hook |
When |
before_run |
Run starts |
after_run |
Run completes |
before_stage |
Before any stage |
after_stage |
After stage completes |
on_quality_fail |
Quality check fails |
on_checkpoint |
Checkpoint triggered |
on_iteration |
Before each iteration |
on_error |
Command fails |
Observability
$ specify run status run-20260228-001
Run: run-20260228-001
Schema: spec-driven
Status: completed
Duration: 12m 34s
Stages:
┌──────────┬────────┬────────────┬─────────┬──────────┐
│ Stage │ Status │ Iterations │ Quality │ Duration │
├──────────┼────────┼────────────┼─────────┼──────────┤
│ spec │ ✓ │ 2 │ 88 │ 3m 12s │
│ plan │ ✓ │ 1 │ 92 │ 2m 45s │
│ tasks │ ✓ │ 1 │ 95 │ 1m 20s │
│ implement│ ✓ │ 3 │ 87 │ 5m 17s │
└──────────┴────────┴────────────┴─────────┴──────────┘
Checkpoints: 2 approved, 0 rejected
Implementation Phases
| Phase |
Deliverable |
| 1 |
CLI commands (specify run start/status/pause/resume/list) |
| 2 |
Journal system (JSONL events, state cache) |
| 3 |
Command execution (spawn AI, wait, capture output) |
| 4 |
Quality gates (evals integration) |
| 5 |
Convergence loop (iterate with feedback) |
| 6 |
Checkpoints (CLI prompt) |
| 7 |
Error handling (retry, checkpoint, fail) |
| 8 |
Hooks system |
| 9 |
Observability (status command, metrics) |
Blocked By
Deferred (Future Issues)
- Orchestrator integration (
agentic-sdlc-orchestrator) for K8s-based async
- Web UI for checkpoints
- Custom process definitions
- Telegram/Slack notifications
Acceptance Criteria
Summary
Implement a deterministic, code-based execution layer (spec-runner) that wraps existing spec-kit commands with quality gates, journaling, and resumability. Inspired by Babysitter's orchestration patterns but AI-agnostic.
Background
References:
Architecture
Key Design Decisions
/spec.*CLI Commands
specify run start <feature>specify run status <run-id>specify run pause <run-id>specify run resume <run-id>specify run listSchema Execution Config
Journal System
Location:
.specify/runs/<run-id>/Event Types:
RUN_START/RUN_COMPLETECOMMAND_START/COMMAND_COMPLETEQUALITY_CHECKDECISIONCHECKPOINT/CHECKPOINT_RESOLVEDERROR/ERROR_TRANSIENTPARALLEL_START/PARALLEL_COMPLETETASK_SPAWNED/TASK_COMPLETEQuality Convergence
Error Handling
Hooks
before_runafter_runbefore_stageafter_stageon_quality_failon_checkpointon_iterationon_errorObservability
Implementation Phases
specify run start/status/pause/resume/list)Blocked By
Deferred (Future Issues)
agentic-sdlc-orchestrator) for K8s-based asyncAcceptance Criteria
specify run startexecutes full workflow with quality gates