[Feature]: Evals Extension — EDD (Eval-Driven Development) with PromptFoo Integration

## Summary

Build an **evals extension** for Spec Kit following EDD (Eval-Driven Development) principles, integrating with PromptFoo for eval execution and annotation.

## Reference

- **EDD Principles**: Google Doc `1O0XSF31Fp9e6SXFCujO6C2zATPVAk_HJayzhg-KYAEo`
- **Error Analysis**: Paweł Huryn's methodology (Issue #78 thread)

## The 10 EDD Principles

| # | Principle | Core Implication |
|---|---|---|
| **I** | Spec-Driven Contracts | Specs precede evals; evals validate spec compliance |
| **II** | Binary Pass/Fail | No Likert scales — output meets spec or fails |
| **III** | Error Analysis & Pattern Discovery | Open/Axial coding → bottom-up failure taxonomy |
| **IV** | Evaluation Pyramid (Offline vs. Online) | CI/CD: fast checks + goldset judges; Production: 10-20% sampling |
| **V** | Trajectory Observability | Full multi-turn traces, not just outputs |
| **VI** | RAG Decomposition | Separate retrieval from generation; IR metrics + LLM judges |
| **VII** | Annotation Queues | Route high-risk traces to humans for binary review |
| **VIII** | Close the Production Loop | Spec failures → fix directives; Gen failures → add to dataset + evaluator |
| **IX** | Test Data is Code | Version datasets; include adversarial inputs; hold-out test set |
| **X** | Cross-Functional Observability | PMs, domain experts, and AI engineers all collaborate |

## Project Structure

```
{project}/
├── .specify/
│   ├── drafts/                    # Draft eval records (markdown + YAML frontmatter)
│   │   └── eval-*.md
│   └── config.yml                 # evals.system configured here
├── evals/
│   └── {system}/                # promptfoo | custom | llm-judge
│       ├── goldset.md           # Published evals (markdown + YAML frontmatter)
│       ├── goldset.json         # Auto-generated for PromptFoo consumption
│       ├── graders/
│       │   ├── check_pii_leakage.py           # Security baseline (always applied)
│       │   ├── check_prompt_injection.py
│       │   └── {failure-mode}.py              # One per generalization_failure
│       ├── config.js             # Generated PromptFoo config
│       └── config.yml           # System-specific config
├── evals/results/              # Git-ignored; PromptFoo outputs + traces
└── specs/
    └── {feature}/
        └── tasks.md             # [EVAL] markers per task
```

## Commands

| Command | Type | Purpose |
|---|---|---|
| `evals.init` | Manual | Initialize `evals/` directory |
| `evals.specify` | Manual | Bottom-up goldset definition from human error analysis → `drafts/` |
| `evals.clarify` | Manual | Axial coding + accept drafts → `goldset.md` + `goldset.json` |
| `evals.analyze` | Manual | Re-code + quantify + saturation + adversarial check + holdout split |
| `evals.tasks` | Hook (`after_tasks`) | Match published evals to feature tasks → `[EVAL]` markers |
| `evals.implement` | Hook (`after_implement`) | Generate PromptFoo config + graders from goldset |
| `evals.levelup` | Manual | Scan `evals/results/` + annotation queue → PR to team-ai-directives |
| `evals.validate` | Manual | TPR/TNR + goldset quality + PromptFoo pass rate thresholds |

## Hooks

```yaml
hooks:
  after_tasks:
    command: "adlc.evals.tasks"
  after_implement:
    command: "adlc.evals.implement"
```

## Goldset Lifecycle (ADR/CDR Pattern)

```
evals.specify → evals.clarify → evals.analyze → evals.implement
   (draft)         (accept)       (finalize)      (publish)
```

### Goldset Record Format (Markdown + YAML frontmatter)

```yaml
---
id: eval-001
status: draft | accepted | published
name: {name}
description: {description}

# Binary pass/fail only (EDD Principle II)
pass_condition: {precise spec constraint}
fail_condition: {precise spec violation}

# Failure type gate (EDD Principle VIII)
failure_type:
  specification_failure:
    action: fix_directive
  generalization_failure:
    action: build_evaluator
    evaluator_type: code-based | llm-judge

# Error analysis provenance (EDD Principle III)
error_analysis:
  traces_analyzed: N
  theoretical_saturation: true | false

# Test data hygiene (EDD Principle IX)
test_data:
  adversarial_included: true | false
  holdout_ratio: 0.2

# RAG decomposition (EDD Principle VI, optional)
rag_decomposition:
  retrieval_check: ir-metrics | llm-judge | none
  generation_check: llm-judge | none
---
```

## EDD Principles → Command Mapping

| Principle | Integration |
|---|---|
| **I** Spec-Driven Contracts | `evals.specify` creates criteria from human error analysis, not generic metrics |
| **II** Binary Pass/Fail | All graders output binary pass/fail — no Likert scales |
| **III** Error Analysis | `evals.specify` (open coding) → `evals.clarify` (axial coding) → `evals.analyze` (saturation) |
| **IV** Evaluation Pyramid | `evals.implement` generates Tier 1 (fast CI/CD checks) + Tier 2 (goldset LLM judges) |
| **V** Trajectory Observability | `evals.levelup` scans full traces from `evals/results/` including tool calls |
| **VI** RAG Decomposition | `rag_decomposition` fields in goldset record; separate graders for retrieval vs. generation |
| **VII** Annotation Queues | Annotation integration via PromptFoo/system native annotation support |
| **VIII** Close the Loop | **Gate in `evals.implement`**: `specification_failure` → fix directive; `generalization_failure` → build evaluator |
| **IX** Test Data is Code | `evals.analyze` ensures adversarial inputs included, hold-out set preserved, datasets versioned |
| **X** Cross-Functional | `evals.levelup` PR targets `team-ai-directives/AGENTS.md` for PMs, domain experts, and AI engineers |

## Lifecycle

```
evals.init
        │
        ▼
evals.specify     ← bottom-up error analysis from human
  (free-text notes per trace → drafts/eval-*.md)
        │
        ▼
evals.clarify     ← axial coding
  (cluster notes → failure modes → accept → goldset.md → goldset.json)
        │
        ▼
evals.analyze     ← quantify + saturation + adversarial + holdout
        │
        ▼
evals.implement   ← after_implement hook
  ┌─ Tier 1 (CI/CD): Fast deterministic graders
  │   • code-based: XML structure, SQL syntax, Regex match
  │   • Security baseline: pii_leakage, injection, etc.
  ├─ Tier 2 (CI/CD): Goldset LLM judges
  │   • semantic checks on version-controlled goldset
  │   • blocks bad merges
  └─ Annotation integration: routes traces to system annotation support
        │
        ▼
evals.validate    ← TPR/TNR + goldset quality + pass rate thresholds
        │
        ▼
evals.levelup     ← evals/results/ scan + annotation queue
  (full trajectories → high-risk routing → binary human review
   → PR to team-ai-directives/AGENTS.md)
        │
        ▼
evals.tasks   ← after_tasks hook
  (match all published evals → [EVAL] markers in specs/{feature}/tasks.md)
```

## `evals.tasks` — Task Marker Matching

**Matching logic**: Keyword overlap (default) + exact tag match (if `task_tags` defined in goldset)

**Multiple evals per task**: Allowed; conflicts flagged for human review with `AMBIGUOUS:` marker

**Scope**: All published evals from `goldset.md` (no feature filtering)

**Marker format**: `[EVAL]: eval-001 (check_name)`

**Variants**:
- `/evals.tasks --dry-run` → show proposed markers (default for manual)
- `/evals.tasks --apply` → write markers to tasks.md
- Hook mode → auto-write (after_tasks hook)

**Example output**:
```markdown
## TASK-001: Implement token validation
[EVAL]: eval-001 (auth_token_present)
[EVAL]: eval-003 (pii_not_leaked)
⚠ AMBIGUOUS: eval-002 vs eval-005 — review before apply

## TASK-002: Implement password reset
[EVAL]: eval-001 (auth_token_present)
```

## Extension Files

```
.specify/extensions/evals/
├── extension.yml              # 8 commands, 2 hooks, tags, defaults, handoffs
├── README.md                  # Full docs + EDD Principles + Evaluation Pyramid
├── CHANGELOG.md
├── config-template.yml        # system: promptfoo (default)
│
├── commands/
│   ├── init.md               # Initialize evals/{system}/ directory
│   ├── specify.md            # Bottom-up from human error analysis → drafts/
│   ├── clarify.md            # Axial coding → goldset.md + goldset.json
│   ├── analyze.md            # Quantify + saturation + adversarial + holdout
│   ├── tasks.md              # Match evals → [EVAL] markers (after_tasks hook)
│   ├── implement.md          # Tier 1 + Tier 2 graders + annotation integration
│   ├── validate.md           # TPR/TNR + goldset quality + pass rate thresholds
│   └── levelup.md            # evals/results/ scan + annotation queue + PR
│
├── scripts/bash/
│   └── setup-evals.sh        # All 8 actions + --json output
│
└── templates/
    ├── eval-criterion.md          # Individual eval criterion (binary pass/fail)
    ├── goldset-record.md          # Draft → Accept → Published lifecycle
    ├── failure-mode-registry.md   # Failure taxonomy (per-system)
    ├── promptfoo-test.yaml        # PromptFoo test template
    └── grader-template.py          # Python grader template (code-based)
```

## Implementation Order

1. **Scaffold**: `extension.yml` + `setup-evals.sh` + `config-template.yml`
2. **Goldset lifecycle**: `init.md` → `specify.md` → `clarify.md` → `analyze.md`
3. **Evaluator building**: `implement.md` + PromptFoo templates + grader template
4. **Validation**: `validate.md` (TPR/TNR + quality + pass rates)
5. **Ops**: `levelup.md` (traces + annotation + PR) + `tasks.md` (markers)
6. **Docs**: `templates/` + `README.md` + `CHANGELOG.md`

Command	Type	Purpose
`evals.init`	Manual	Initialize `evals/` directory
`evals.specify`	Manual	Bottom-up goldset definition from human error analysis → `drafts/`
`evals.clarify`	Manual	Axial coding + accept drafts → `goldset.md` + `goldset.json`
`evals.analyze`	Manual	Re-code + quantify + saturation + adversarial check + holdout split
`evals.tasks`	Hook (`after_tasks`)	Match published evals to feature tasks → `[EVAL]` markers
`evals.implement`	Hook (`after_implement`)	Generate PromptFoo config + graders from goldset
`evals.levelup`	Manual	Scan `evals/results/` + annotation queue → PR to team-ai-directives
`evals.validate`	Manual	TPR/TNR + goldset quality + PromptFoo pass rate thresholds

Principle	Integration
I Spec-Driven Contracts	`evals.specify` creates criteria from human error analysis, not generic metrics
II Binary Pass/Fail	All graders output binary pass/fail — no Likert scales
III Error Analysis	`evals.specify` (open coding) → `evals.clarify` (axial coding) → `evals.analyze` (saturation)
IV Evaluation Pyramid	`evals.implement` generates Tier 1 (fast CI/CD checks) + Tier 2 (goldset LLM judges)
V Trajectory Observability	`evals.levelup` scans full traces from `evals/results/` including tool calls
VI RAG Decomposition	`rag_decomposition` fields in goldset record; separate graders for retrieval vs. generation
VII Annotation Queues	Annotation integration via PromptFoo/system native annotation support
VIII Close the Loop	Gate in `evals.implement`: `specification_failure` → fix directive; `generalization_failure` → build evaluator
IX Test Data is Code	`evals.analyze` ensures adversarial inputs included, hold-out set preserved, datasets versioned
X Cross-Functional	`evals.levelup` PR targets `team-ai-directives/AGENTS.md` for PMs, domain experts, and AI engineers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Evals Extension — EDD (Eval-Driven Development) with PromptFoo Integration #78

Summary

Reference

The 10 EDD Principles

Project Structure

Commands

Hooks

Goldset Lifecycle (ADR/CDR Pattern)

Goldset Record Format (Markdown + YAML frontmatter)

EDD Principles → Command Mapping

Lifecycle

`evals.tasks` — Task Marker Matching

Extension Files

Implementation Order

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

#	Principle	Core Implication
I	Spec-Driven Contracts	Specs precede evals; evals validate spec compliance
II	Binary Pass/Fail	No Likert scales — output meets spec or fails
III	Error Analysis & Pattern Discovery	Open/Axial coding → bottom-up failure taxonomy
IV	Evaluation Pyramid (Offline vs. Online)	CI/CD: fast checks + goldset judges; Production: 10-20% sampling
V	Trajectory Observability	Full multi-turn traces, not just outputs
VI	RAG Decomposition	Separate retrieval from generation; IR metrics + LLM judges
VII	Annotation Queues	Route high-risk traces to humans for binary review
VIII	Close the Production Loop	Spec failures → fix directives; Gen failures → add to dataset + evaluator
IX	Test Data is Code	Version datasets; include adversarial inputs; hold-out test set
X	Cross-Functional Observability	PMs, domain experts, and AI engineers all collaborate

[Feature]: Evals Extension — EDD (Eval-Driven Development) with PromptFoo Integration #78

Description

Summary

Reference

The 10 EDD Principles

Project Structure

Commands

Hooks

Goldset Lifecycle (ADR/CDR Pattern)

Goldset Record Format (Markdown + YAML frontmatter)

EDD Principles → Command Mapping

Lifecycle

evals.tasks — Task Marker Matching

Extension Files

Implementation Order

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`evals.tasks` — Task Marker Matching