Spec-Runner: Deterministic Execution Layer for Spec-Kit Workflows

## Summary

Implement a deterministic, code-based execution layer (spec-runner) that wraps existing spec-kit commands with quality gates, journaling, and resumability. Inspired by Babysitter's orchestration patterns but AI-agnostic.

## Background

- Babysitter provides deterministic orchestration for Claude Code only
- We need an AI-agnostic equivalent for spec-kit workflows
- Key insight: **Code-based orchestrator** (deterministic) wrapping **LLM execution** (non-deterministic)

**References:**
- [Babysitter](https://github.com/a5c-ai/babysitter) - Claude Code orchestration
- Issue #50 - Schema-Driven Workflow System (dependency)

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│              specify run start <feature>                    │
│              (Python CLI - deterministic)                   │
├─────────────────────────────────────────────────────────────┤
│  1. Load schema (combined: artifacts + execution config)   │
│  2. For each stage:                                        │
│     ├── Log: COMMAND_START                                 │
│     ├── Spawn AI session → execute command                 │
│     ├── Log: COMMAND_COMPLETE                              │
│     ├── Run evals → quality score                          │
│     ├── Log: QUALITY_CHECK                                 │
│     ├── Decision (deterministic):                          │
│     │   ├── score >= threshold → proceed                   │
│     │   ├── iteration < max → iterate with feedback        │
│     │   └── max reached → checkpoint                       │
│     └── Log: DECISION                                      │
│  3. On error: retry/checkpoint/fail                        │
│  4. On checkpoint: CLI prompt                              │
└─────────────────────────────────────────────────────────────┘
```

## Key Design Decisions

| Aspect | Decision | Rationale |
|--------|----------|-----------|
| **Orchestration** | Code-based (Python CLI) | Deterministic decisions |
| **Commands** | Wrap existing `/spec.*` | No changes to existing commands |
| **Quality Gates** | Use evals framework | Reuse proven methodology |
| **State Recovery** | Resume from last completed stage | Simple, reliable |
| **Checkpoints** | CLI prompt | No external service for v1 |
| **Task Parallelism** | Task-level (local sessions) | Orchestrator integration deferred |
| **Quality Convergence** | Fix with feedback, checkpoint on max | Follows Babysitter pattern |
| **Schema + Process** | Combined in schema.yaml | Single source of truth |
| **Hooks** | Spec-runner specific hooks | Extensibility |
| **Observability** | CLI + Journal | Both metrics access methods |
| **Error Handling** | Transient→retry, user-action→checkpoint | Follows Babysitter pattern |

## CLI Commands

| Command | Description |
|---------|-------------|
| `specify run start <feature>` | Start workflow execution |
| `specify run status <run-id>` | Show run status + metrics |
| `specify run pause <run-id>` | Pause run |
| `specify run resume <run-id>` | Resume paused run |
| `specify run list` | List all runs |

## Schema Execution Config

```yaml
schema:
  prefix: "spec"
  
  artifacts:
    - id: spec
      file: spec.md
      requires: []
    - id: plan
      file: plan.md
      requires: [spec]
    - id: tasks
      file: tasks.md
      requires: [plan]

  execution:
    quality:
      target: 85
      
    stages:
      spec:
        command: /spec.specify
        max_iterations: 3
        checkpoint: optional
        
      plan:
        command: /spec.plan
        max_iterations: 3
        checkpoint: after
        
      tasks:
        command: /spec.tasks
        max_iterations: 2
        checkpoint: none
        
      implement:
        command: /spec.implement
        max_iterations: 5
        checkpoint: on_quality_fail
        parallel: true

    hooks:
      before_run: []
      after_run: []
      before_stage: []
      after_stage: []
      on_quality_fail: []
      on_checkpoint: []
      on_iteration: []
      on_error: []
```

## Journal System

**Location:** `.specify/runs/<run-id>/`

```
.specify/runs/<run-id>/
├── state.json       # Current state cache
├── journal.jsonl    # Event log (source of truth)
└── config.json      # Run configuration
```

**Event Types:**
- `RUN_START` / `RUN_COMPLETE`
- `COMMAND_START` / `COMMAND_COMPLETE`
- `QUALITY_CHECK`
- `DECISION`
- `CHECKPOINT` / `CHECKPOINT_RESOLVED`
- `ERROR` / `ERROR_TRANSIENT`
- `PARALLEL_START` / `PARALLEL_COMPLETE`
- `TASK_SPAWNED` / `TASK_COMPLETE`

## Quality Convergence

```python
for iteration in range(1, max_iterations + 1):
    if iteration == 1:
        result = execute_command(stage)
    else:
        result = execute_command(stage, feedback=previous_feedback)
    
    score = run_evals(result)
    
    if score >= threshold:
        break
    
    previous_feedback = get_failed_checks(score)

if score < threshold:
    decision = checkpoint(f"Quality {score} below {threshold}. Proceed anyway?")
```

## Error Handling

| Error Type | Strategy |
|------------|----------|
| Transient (timeout, rate limit) | Retry with backoff |
| User action required | Checkpoint |
| Unrecoverable | Fail gracefully |

## Hooks

| Hook | When |
|------|------|
| `before_run` | Run starts |
| `after_run` | Run completes |
| `before_stage` | Before any stage |
| `after_stage` | After stage completes |
| `on_quality_fail` | Quality check fails |
| `on_checkpoint` | Checkpoint triggered |
| `on_iteration` | Before each iteration |
| `on_error` | Command fails |

## Observability

```
$ specify run status run-20260228-001

Run: run-20260228-001
Schema: spec-driven
Status: completed
Duration: 12m 34s

Stages:
┌──────────┬────────┬────────────┬─────────┬──────────┐
│ Stage    │ Status │ Iterations │ Quality │ Duration │
├──────────┼────────┼────────────┼─────────┼──────────┤
│ spec     │ ✓      │ 2          │ 88      │ 3m 12s   │
│ plan     │ ✓      │ 1          │ 92      │ 2m 45s   │
│ tasks    │ ✓      │ 1          │ 95      │ 1m 20s   │
│ implement│ ✓      │ 3          │ 87      │ 5m 17s   │
└──────────┴────────┴────────────┴─────────┴──────────┘

Checkpoints: 2 approved, 0 rejected
```

## Implementation Phases

| Phase | Deliverable |
|-------|-------------|
| 1 | CLI commands (`specify run start/status/pause/resume/list`) |
| 2 | Journal system (JSONL events, state cache) |
| 3 | Command execution (spawn AI, wait, capture output) |
| 4 | Quality gates (evals integration) |
| 5 | Convergence loop (iterate with feedback) |
| 6 | Checkpoints (CLI prompt) |
| 7 | Error handling (retry, checkpoint, fail) |
| 8 | Hooks system |
| 9 | Observability (status command, metrics) |

## Blocked By

- #50 (Schema-Driven Workflow System)

## Deferred (Future Issues)

- Orchestrator integration (`agentic-sdlc-orchestrator`) for K8s-based async
- Web UI for checkpoints
- Custom process definitions
- Telegram/Slack notifications

## Acceptance Criteria

- [ ] `specify run start` executes full workflow with quality gates
- [ ] Journal captures all events for audit trail
- [ ] State recovery enables resume from last stage
- [ ] Checkpoints pause for CLI approval
- [ ] Quality convergence iterates with feedback
- [ ] Error handling follows transient→retry, action→checkpoint pattern
- [ ] Hooks enable extension integration (e.g., TDD #48)
- [ ] Status command shows run metrics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spec-Runner: Deterministic Execution Layer for Spec-Kit Workflows #51

Summary

Background

Architecture

Key Design Decisions

CLI Commands

Schema Execution Config

Journal System

Quality Convergence

Error Handling

Hooks

Observability

Implementation Phases

Blocked By

Deferred (Future Issues)

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Aspect	Decision	Rationale
Orchestration	Code-based (Python CLI)	Deterministic decisions
Commands	Wrap existing `/spec.*`	No changes to existing commands
Quality Gates	Use evals framework	Reuse proven methodology
State Recovery	Resume from last completed stage	Simple, reliable
Checkpoints	CLI prompt	No external service for v1
Task Parallelism	Task-level (local sessions)	Orchestrator integration deferred
Quality Convergence	Fix with feedback, checkpoint on max	Follows Babysitter pattern
Schema + Process	Combined in schema.yaml	Single source of truth
Hooks	Spec-runner specific hooks	Extensibility
Observability	CLI + Journal	Both metrics access methods
Error Handling	Transient→retry, user-action→checkpoint	Follows Babysitter pattern

Command	Description
`specify run start <feature>`	Start workflow execution
`specify run status <run-id>`	Show run status + metrics
`specify run pause <run-id>`	Pause run
`specify run resume <run-id>`	Resume paused run
`specify run list`	List all runs

Error Type	Strategy
Transient (timeout, rate limit)	Retry with backoff
User action required	Checkpoint
Unrecoverable	Fail gracefully

Hook	When
`before_run`	Run starts
`after_run`	Run completes
`before_stage`	Before any stage
`after_stage`	After stage completes
`on_quality_fail`	Quality check fails
`on_checkpoint`	Checkpoint triggered
`on_iteration`	Before each iteration
`on_error`	Command fails

Phase	Deliverable
1	CLI commands (`specify run start/status/pause/resume/list`)
2	Journal system (JSONL events, state cache)
3	Command execution (spawn AI, wait, capture output)
4	Quality gates (evals integration)
5	Convergence loop (iterate with feedback)
6	Checkpoints (CLI prompt)
7	Error handling (retry, checkpoint, fail)
8	Hooks system
9	Observability (status command, metrics)

Spec-Runner: Deterministic Execution Layer for Spec-Kit Workflows #51

Description

Summary

Background

Architecture

Key Design Decisions

CLI Commands

Schema Execution Config

Journal System

Quality Convergence

Error Handling

Hooks

Observability

Implementation Phases

Blocked By

Deferred (Future Issues)

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions