Skip to content

Spec-Runner: Deterministic Execution Layer for Spec-Kit Workflows #51

@kanfil

Description

@kanfil

Summary

Implement a deterministic, code-based execution layer (spec-runner) that wraps existing spec-kit commands with quality gates, journaling, and resumability. Inspired by Babysitter's orchestration patterns but AI-agnostic.

Background

  • Babysitter provides deterministic orchestration for Claude Code only
  • We need an AI-agnostic equivalent for spec-kit workflows
  • Key insight: Code-based orchestrator (deterministic) wrapping LLM execution (non-deterministic)

References:

Architecture

┌─────────────────────────────────────────────────────────────┐
│              specify run start <feature>                    │
│              (Python CLI - deterministic)                   │
├─────────────────────────────────────────────────────────────┤
│  1. Load schema (combined: artifacts + execution config)   │
│  2. For each stage:                                        │
│     ├── Log: COMMAND_START                                 │
│     ├── Spawn AI session → execute command                 │
│     ├── Log: COMMAND_COMPLETE                              │
│     ├── Run evals → quality score                          │
│     ├── Log: QUALITY_CHECK                                 │
│     ├── Decision (deterministic):                          │
│     │   ├── score >= threshold → proceed                   │
│     │   ├── iteration < max → iterate with feedback        │
│     │   └── max reached → checkpoint                       │
│     └── Log: DECISION                                      │
│  3. On error: retry/checkpoint/fail                        │
│  4. On checkpoint: CLI prompt                              │
└─────────────────────────────────────────────────────────────┘

Key Design Decisions

Aspect Decision Rationale
Orchestration Code-based (Python CLI) Deterministic decisions
Commands Wrap existing /spec.* No changes to existing commands
Quality Gates Use evals framework Reuse proven methodology
State Recovery Resume from last completed stage Simple, reliable
Checkpoints CLI prompt No external service for v1
Task Parallelism Task-level (local sessions) Orchestrator integration deferred
Quality Convergence Fix with feedback, checkpoint on max Follows Babysitter pattern
Schema + Process Combined in schema.yaml Single source of truth
Hooks Spec-runner specific hooks Extensibility
Observability CLI + Journal Both metrics access methods
Error Handling Transient→retry, user-action→checkpoint Follows Babysitter pattern

CLI Commands

Command Description
specify run start <feature> Start workflow execution
specify run status <run-id> Show run status + metrics
specify run pause <run-id> Pause run
specify run resume <run-id> Resume paused run
specify run list List all runs

Schema Execution Config

schema:
  prefix: "spec"
  
  artifacts:
    - id: spec
      file: spec.md
      requires: []
    - id: plan
      file: plan.md
      requires: [spec]
    - id: tasks
      file: tasks.md
      requires: [plan]

  execution:
    quality:
      target: 85
      
    stages:
      spec:
        command: /spec.specify
        max_iterations: 3
        checkpoint: optional
        
      plan:
        command: /spec.plan
        max_iterations: 3
        checkpoint: after
        
      tasks:
        command: /spec.tasks
        max_iterations: 2
        checkpoint: none
        
      implement:
        command: /spec.implement
        max_iterations: 5
        checkpoint: on_quality_fail
        parallel: true

    hooks:
      before_run: []
      after_run: []
      before_stage: []
      after_stage: []
      on_quality_fail: []
      on_checkpoint: []
      on_iteration: []
      on_error: []

Journal System

Location: .specify/runs/<run-id>/

.specify/runs/<run-id>/
├── state.json       # Current state cache
├── journal.jsonl    # Event log (source of truth)
└── config.json      # Run configuration

Event Types:

  • RUN_START / RUN_COMPLETE
  • COMMAND_START / COMMAND_COMPLETE
  • QUALITY_CHECK
  • DECISION
  • CHECKPOINT / CHECKPOINT_RESOLVED
  • ERROR / ERROR_TRANSIENT
  • PARALLEL_START / PARALLEL_COMPLETE
  • TASK_SPAWNED / TASK_COMPLETE

Quality Convergence

for iteration in range(1, max_iterations + 1):
    if iteration == 1:
        result = execute_command(stage)
    else:
        result = execute_command(stage, feedback=previous_feedback)
    
    score = run_evals(result)
    
    if score >= threshold:
        break
    
    previous_feedback = get_failed_checks(score)

if score < threshold:
    decision = checkpoint(f"Quality {score} below {threshold}. Proceed anyway?")

Error Handling

Error Type Strategy
Transient (timeout, rate limit) Retry with backoff
User action required Checkpoint
Unrecoverable Fail gracefully

Hooks

Hook When
before_run Run starts
after_run Run completes
before_stage Before any stage
after_stage After stage completes
on_quality_fail Quality check fails
on_checkpoint Checkpoint triggered
on_iteration Before each iteration
on_error Command fails

Observability

$ specify run status run-20260228-001

Run: run-20260228-001
Schema: spec-driven
Status: completed
Duration: 12m 34s

Stages:
┌──────────┬────────┬────────────┬─────────┬──────────┐
│ Stage    │ Status │ Iterations │ Quality │ Duration │
├──────────┼────────┼────────────┼─────────┼──────────┤
│ spec     │ ✓      │ 2          │ 88      │ 3m 12s   │
│ plan     │ ✓      │ 1          │ 92      │ 2m 45s   │
│ tasks    │ ✓      │ 1          │ 95      │ 1m 20s   │
│ implement│ ✓      │ 3          │ 87      │ 5m 17s   │
└──────────┴────────┴────────────┴─────────┴──────────┘

Checkpoints: 2 approved, 0 rejected

Implementation Phases

Phase Deliverable
1 CLI commands (specify run start/status/pause/resume/list)
2 Journal system (JSONL events, state cache)
3 Command execution (spawn AI, wait, capture output)
4 Quality gates (evals integration)
5 Convergence loop (iterate with feedback)
6 Checkpoints (CLI prompt)
7 Error handling (retry, checkpoint, fail)
8 Hooks system
9 Observability (status command, metrics)

Blocked By

Deferred (Future Issues)

  • Orchestrator integration (agentic-sdlc-orchestrator) for K8s-based async
  • Web UI for checkpoints
  • Custom process definitions
  • Telegram/Slack notifications

Acceptance Criteria

  • specify run start executes full workflow with quality gates
  • Journal captures all events for audit trail
  • State recovery enables resume from last stage
  • Checkpoints pause for CLI approval
  • Quality convergence iterates with feedback
  • Error handling follows transient→retry, action→checkpoint pattern
  • Hooks enable extension integration (e.g., TDD Add TDD Extension: Strict RED→GREEN→REFACTOR Vertical Slicing #48)
  • Status command shows run metrics

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions