Skip to content

[Feature]: Evals Extension — EDD (Eval-Driven Development) with PromptFoo Integration #78

@kanfil

Description

@kanfil

Summary

Build an evals extension for Spec Kit following EDD (Eval-Driven Development) principles, integrating with PromptFoo for eval execution and annotation.

Reference

The 10 EDD Principles

# Principle Core Implication
I Spec-Driven Contracts Specs precede evals; evals validate spec compliance
II Binary Pass/Fail No Likert scales — output meets spec or fails
III Error Analysis & Pattern Discovery Open/Axial coding → bottom-up failure taxonomy
IV Evaluation Pyramid (Offline vs. Online) CI/CD: fast checks + goldset judges; Production: 10-20% sampling
V Trajectory Observability Full multi-turn traces, not just outputs
VI RAG Decomposition Separate retrieval from generation; IR metrics + LLM judges
VII Annotation Queues Route high-risk traces to humans for binary review
VIII Close the Production Loop Spec failures → fix directives; Gen failures → add to dataset + evaluator
IX Test Data is Code Version datasets; include adversarial inputs; hold-out test set
X Cross-Functional Observability PMs, domain experts, and AI engineers all collaborate

Project Structure

{project}/
├── .specify/
│   ├── drafts/                    # Draft eval records (markdown + YAML frontmatter)
│   │   └── eval-*.md
│   └── config.yml                 # evals.system configured here
├── evals/
│   └── {system}/                # promptfoo | custom | llm-judge
│       ├── goldset.md           # Published evals (markdown + YAML frontmatter)
│       ├── goldset.json         # Auto-generated for PromptFoo consumption
│       ├── graders/
│       │   ├── check_pii_leakage.py           # Security baseline (always applied)
│       │   ├── check_prompt_injection.py
│       │   └── {failure-mode}.py              # One per generalization_failure
│       ├── config.js             # Generated PromptFoo config
│       └── config.yml           # System-specific config
├── evals/results/              # Git-ignored; PromptFoo outputs + traces
└── specs/
    └── {feature}/
        └── tasks.md             # [EVAL] markers per task

Commands

Command Type Purpose
evals.init Manual Initialize evals/ directory
evals.specify Manual Bottom-up goldset definition from human error analysis → drafts/
evals.clarify Manual Axial coding + accept drafts → goldset.md + goldset.json
evals.analyze Manual Re-code + quantify + saturation + adversarial check + holdout split
evals.tasks Hook (after_tasks) Match published evals to feature tasks → [EVAL] markers
evals.implement Hook (after_implement) Generate PromptFoo config + graders from goldset
evals.levelup Manual Scan evals/results/ + annotation queue → PR to team-ai-directives
evals.validate Manual TPR/TNR + goldset quality + PromptFoo pass rate thresholds

Hooks

hooks:
  after_tasks:
    command: "adlc.evals.tasks"
  after_implement:
    command: "adlc.evals.implement"

Goldset Lifecycle (ADR/CDR Pattern)

evals.specify → evals.clarify → evals.analyze → evals.implement
   (draft)         (accept)       (finalize)      (publish)

Goldset Record Format (Markdown + YAML frontmatter)

---
id: eval-001
status: draft | accepted | published
name: {name}
description: {description}

# Binary pass/fail only (EDD Principle II)
pass_condition: {precise spec constraint}
fail_condition: {precise spec violation}

# Failure type gate (EDD Principle VIII)
failure_type:
  specification_failure:
    action: fix_directive
  generalization_failure:
    action: build_evaluator
    evaluator_type: code-based | llm-judge

# Error analysis provenance (EDD Principle III)
error_analysis:
  traces_analyzed: N
  theoretical_saturation: true | false

# Test data hygiene (EDD Principle IX)
test_data:
  adversarial_included: true | false
  holdout_ratio: 0.2

# RAG decomposition (EDD Principle VI, optional)
rag_decomposition:
  retrieval_check: ir-metrics | llm-judge | none
  generation_check: llm-judge | none
---

EDD Principles → Command Mapping

Principle Integration
I Spec-Driven Contracts evals.specify creates criteria from human error analysis, not generic metrics
II Binary Pass/Fail All graders output binary pass/fail — no Likert scales
III Error Analysis evals.specify (open coding) → evals.clarify (axial coding) → evals.analyze (saturation)
IV Evaluation Pyramid evals.implement generates Tier 1 (fast CI/CD checks) + Tier 2 (goldset LLM judges)
V Trajectory Observability evals.levelup scans full traces from evals/results/ including tool calls
VI RAG Decomposition rag_decomposition fields in goldset record; separate graders for retrieval vs. generation
VII Annotation Queues Annotation integration via PromptFoo/system native annotation support
VIII Close the Loop Gate in evals.implement: specification_failure → fix directive; generalization_failure → build evaluator
IX Test Data is Code evals.analyze ensures adversarial inputs included, hold-out set preserved, datasets versioned
X Cross-Functional evals.levelup PR targets team-ai-directives/AGENTS.md for PMs, domain experts, and AI engineers

Lifecycle

evals.init
        │
        ▼
evals.specify     ← bottom-up error analysis from human
  (free-text notes per trace → drafts/eval-*.md)
        │
        ▼
evals.clarify     ← axial coding
  (cluster notes → failure modes → accept → goldset.md → goldset.json)
        │
        ▼
evals.analyze     ← quantify + saturation + adversarial + holdout
        │
        ▼
evals.implement   ← after_implement hook
  ┌─ Tier 1 (CI/CD): Fast deterministic graders
  │   • code-based: XML structure, SQL syntax, Regex match
  │   • Security baseline: pii_leakage, injection, etc.
  ├─ Tier 2 (CI/CD): Goldset LLM judges
  │   • semantic checks on version-controlled goldset
  │   • blocks bad merges
  └─ Annotation integration: routes traces to system annotation support
        │
        ▼
evals.validate    ← TPR/TNR + goldset quality + pass rate thresholds
        │
        ▼
evals.levelup     ← evals/results/ scan + annotation queue
  (full trajectories → high-risk routing → binary human review
   → PR to team-ai-directives/AGENTS.md)
        │
        ▼
evals.tasks   ← after_tasks hook
  (match all published evals → [EVAL] markers in specs/{feature}/tasks.md)

evals.tasks — Task Marker Matching

Matching logic: Keyword overlap (default) + exact tag match (if task_tags defined in goldset)

Multiple evals per task: Allowed; conflicts flagged for human review with AMBIGUOUS: marker

Scope: All published evals from goldset.md (no feature filtering)

Marker format: [EVAL]: eval-001 (check_name)

Variants:

  • /evals.tasks --dry-run → show proposed markers (default for manual)
  • /evals.tasks --apply → write markers to tasks.md
  • Hook mode → auto-write (after_tasks hook)

Example output:

## TASK-001: Implement token validation
[EVAL]: eval-001 (auth_token_present)
[EVAL]: eval-003 (pii_not_leaked)
⚠ AMBIGUOUS: eval-002 vs eval-005 — review before apply

## TASK-002: Implement password reset
[EVAL]: eval-001 (auth_token_present)

Extension Files

.specify/extensions/evals/
├── extension.yml              # 8 commands, 2 hooks, tags, defaults, handoffs
├── README.md                  # Full docs + EDD Principles + Evaluation Pyramid
├── CHANGELOG.md
├── config-template.yml        # system: promptfoo (default)
│
├── commands/
│   ├── init.md               # Initialize evals/{system}/ directory
│   ├── specify.md            # Bottom-up from human error analysis → drafts/
│   ├── clarify.md            # Axial coding → goldset.md + goldset.json
│   ├── analyze.md            # Quantify + saturation + adversarial + holdout
│   ├── tasks.md              # Match evals → [EVAL] markers (after_tasks hook)
│   ├── implement.md          # Tier 1 + Tier 2 graders + annotation integration
│   ├── validate.md           # TPR/TNR + goldset quality + pass rate thresholds
│   └── levelup.md            # evals/results/ scan + annotation queue + PR
│
├── scripts/bash/
│   └── setup-evals.sh        # All 8 actions + --json output
│
└── templates/
    ├── eval-criterion.md          # Individual eval criterion (binary pass/fail)
    ├── goldset-record.md          # Draft → Accept → Published lifecycle
    ├── failure-mode-registry.md   # Failure taxonomy (per-system)
    ├── promptfoo-test.yaml        # PromptFoo test template
    └── grader-template.py          # Python grader template (code-based)

Implementation Order

  1. Scaffold: extension.yml + setup-evals.sh + config-template.yml
  2. Goldset lifecycle: init.mdspecify.mdclarify.mdanalyze.md
  3. Evaluator building: implement.md + PromptFoo templates + grader template
  4. Validation: validate.md (TPR/TNR + quality + pass rates)
  5. Ops: levelup.md (traces + annotation + PR) + tasks.md (markers)
  6. Docs: templates/ + README.md + CHANGELOG.md

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestexecutionExecution enginetestingTesting framework

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions