forked from github/spec-kit
-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Labels
enhancementNew feature or requestNew feature or requestexecutionExecution engineExecution enginetestingTesting frameworkTesting framework
Description
Summary
Build an evals extension for Spec Kit following EDD (Eval-Driven Development) principles, integrating with PromptFoo for eval execution and annotation.
Reference
- EDD Principles: Google Doc
1O0XSF31Fp9e6SXFCujO6C2zATPVAk_HJayzhg-KYAEo - Error Analysis: Paweł Huryn's methodology (Issue [Feature]: Evals Extension — EDD (Eval-Driven Development) with PromptFoo Integration #78 thread)
The 10 EDD Principles
| # | Principle | Core Implication |
|---|---|---|
| I | Spec-Driven Contracts | Specs precede evals; evals validate spec compliance |
| II | Binary Pass/Fail | No Likert scales — output meets spec or fails |
| III | Error Analysis & Pattern Discovery | Open/Axial coding → bottom-up failure taxonomy |
| IV | Evaluation Pyramid (Offline vs. Online) | CI/CD: fast checks + goldset judges; Production: 10-20% sampling |
| V | Trajectory Observability | Full multi-turn traces, not just outputs |
| VI | RAG Decomposition | Separate retrieval from generation; IR metrics + LLM judges |
| VII | Annotation Queues | Route high-risk traces to humans for binary review |
| VIII | Close the Production Loop | Spec failures → fix directives; Gen failures → add to dataset + evaluator |
| IX | Test Data is Code | Version datasets; include adversarial inputs; hold-out test set |
| X | Cross-Functional Observability | PMs, domain experts, and AI engineers all collaborate |
Project Structure
{project}/
├── .specify/
│ ├── drafts/ # Draft eval records (markdown + YAML frontmatter)
│ │ └── eval-*.md
│ └── config.yml # evals.system configured here
├── evals/
│ └── {system}/ # promptfoo | custom | llm-judge
│ ├── goldset.md # Published evals (markdown + YAML frontmatter)
│ ├── goldset.json # Auto-generated for PromptFoo consumption
│ ├── graders/
│ │ ├── check_pii_leakage.py # Security baseline (always applied)
│ │ ├── check_prompt_injection.py
│ │ └── {failure-mode}.py # One per generalization_failure
│ ├── config.js # Generated PromptFoo config
│ └── config.yml # System-specific config
├── evals/results/ # Git-ignored; PromptFoo outputs + traces
└── specs/
└── {feature}/
└── tasks.md # [EVAL] markers per task
Commands
| Command | Type | Purpose |
|---|---|---|
evals.init |
Manual | Initialize evals/ directory |
evals.specify |
Manual | Bottom-up goldset definition from human error analysis → drafts/ |
evals.clarify |
Manual | Axial coding + accept drafts → goldset.md + goldset.json |
evals.analyze |
Manual | Re-code + quantify + saturation + adversarial check + holdout split |
evals.tasks |
Hook (after_tasks) |
Match published evals to feature tasks → [EVAL] markers |
evals.implement |
Hook (after_implement) |
Generate PromptFoo config + graders from goldset |
evals.levelup |
Manual | Scan evals/results/ + annotation queue → PR to team-ai-directives |
evals.validate |
Manual | TPR/TNR + goldset quality + PromptFoo pass rate thresholds |
Hooks
hooks:
after_tasks:
command: "adlc.evals.tasks"
after_implement:
command: "adlc.evals.implement"Goldset Lifecycle (ADR/CDR Pattern)
evals.specify → evals.clarify → evals.analyze → evals.implement
(draft) (accept) (finalize) (publish)
Goldset Record Format (Markdown + YAML frontmatter)
---
id: eval-001
status: draft | accepted | published
name: {name}
description: {description}
# Binary pass/fail only (EDD Principle II)
pass_condition: {precise spec constraint}
fail_condition: {precise spec violation}
# Failure type gate (EDD Principle VIII)
failure_type:
specification_failure:
action: fix_directive
generalization_failure:
action: build_evaluator
evaluator_type: code-based | llm-judge
# Error analysis provenance (EDD Principle III)
error_analysis:
traces_analyzed: N
theoretical_saturation: true | false
# Test data hygiene (EDD Principle IX)
test_data:
adversarial_included: true | false
holdout_ratio: 0.2
# RAG decomposition (EDD Principle VI, optional)
rag_decomposition:
retrieval_check: ir-metrics | llm-judge | none
generation_check: llm-judge | none
---EDD Principles → Command Mapping
| Principle | Integration |
|---|---|
| I Spec-Driven Contracts | evals.specify creates criteria from human error analysis, not generic metrics |
| II Binary Pass/Fail | All graders output binary pass/fail — no Likert scales |
| III Error Analysis | evals.specify (open coding) → evals.clarify (axial coding) → evals.analyze (saturation) |
| IV Evaluation Pyramid | evals.implement generates Tier 1 (fast CI/CD checks) + Tier 2 (goldset LLM judges) |
| V Trajectory Observability | evals.levelup scans full traces from evals/results/ including tool calls |
| VI RAG Decomposition | rag_decomposition fields in goldset record; separate graders for retrieval vs. generation |
| VII Annotation Queues | Annotation integration via PromptFoo/system native annotation support |
| VIII Close the Loop | Gate in evals.implement: specification_failure → fix directive; generalization_failure → build evaluator |
| IX Test Data is Code | evals.analyze ensures adversarial inputs included, hold-out set preserved, datasets versioned |
| X Cross-Functional | evals.levelup PR targets team-ai-directives/AGENTS.md for PMs, domain experts, and AI engineers |
Lifecycle
evals.init
│
▼
evals.specify ← bottom-up error analysis from human
(free-text notes per trace → drafts/eval-*.md)
│
▼
evals.clarify ← axial coding
(cluster notes → failure modes → accept → goldset.md → goldset.json)
│
▼
evals.analyze ← quantify + saturation + adversarial + holdout
│
▼
evals.implement ← after_implement hook
┌─ Tier 1 (CI/CD): Fast deterministic graders
│ • code-based: XML structure, SQL syntax, Regex match
│ • Security baseline: pii_leakage, injection, etc.
├─ Tier 2 (CI/CD): Goldset LLM judges
│ • semantic checks on version-controlled goldset
│ • blocks bad merges
└─ Annotation integration: routes traces to system annotation support
│
▼
evals.validate ← TPR/TNR + goldset quality + pass rate thresholds
│
▼
evals.levelup ← evals/results/ scan + annotation queue
(full trajectories → high-risk routing → binary human review
→ PR to team-ai-directives/AGENTS.md)
│
▼
evals.tasks ← after_tasks hook
(match all published evals → [EVAL] markers in specs/{feature}/tasks.md)
evals.tasks — Task Marker Matching
Matching logic: Keyword overlap (default) + exact tag match (if task_tags defined in goldset)
Multiple evals per task: Allowed; conflicts flagged for human review with AMBIGUOUS: marker
Scope: All published evals from goldset.md (no feature filtering)
Marker format: [EVAL]: eval-001 (check_name)
Variants:
/evals.tasks --dry-run→ show proposed markers (default for manual)/evals.tasks --apply→ write markers to tasks.md- Hook mode → auto-write (after_tasks hook)
Example output:
## TASK-001: Implement token validation
[EVAL]: eval-001 (auth_token_present)
[EVAL]: eval-003 (pii_not_leaked)
⚠ AMBIGUOUS: eval-002 vs eval-005 — review before apply
## TASK-002: Implement password reset
[EVAL]: eval-001 (auth_token_present)Extension Files
.specify/extensions/evals/
├── extension.yml # 8 commands, 2 hooks, tags, defaults, handoffs
├── README.md # Full docs + EDD Principles + Evaluation Pyramid
├── CHANGELOG.md
├── config-template.yml # system: promptfoo (default)
│
├── commands/
│ ├── init.md # Initialize evals/{system}/ directory
│ ├── specify.md # Bottom-up from human error analysis → drafts/
│ ├── clarify.md # Axial coding → goldset.md + goldset.json
│ ├── analyze.md # Quantify + saturation + adversarial + holdout
│ ├── tasks.md # Match evals → [EVAL] markers (after_tasks hook)
│ ├── implement.md # Tier 1 + Tier 2 graders + annotation integration
│ ├── validate.md # TPR/TNR + goldset quality + pass rate thresholds
│ └── levelup.md # evals/results/ scan + annotation queue + PR
│
├── scripts/bash/
│ └── setup-evals.sh # All 8 actions + --json output
│
└── templates/
├── eval-criterion.md # Individual eval criterion (binary pass/fail)
├── goldset-record.md # Draft → Accept → Published lifecycle
├── failure-mode-registry.md # Failure taxonomy (per-system)
├── promptfoo-test.yaml # PromptFoo test template
└── grader-template.py # Python grader template (code-based)
Implementation Order
- Scaffold:
extension.yml+setup-evals.sh+config-template.yml - Goldset lifecycle:
init.md→specify.md→clarify.md→analyze.md - Evaluator building:
implement.md+ PromptFoo templates + grader template - Validation:
validate.md(TPR/TNR + quality + pass rates) - Ops:
levelup.md(traces + annotation + PR) +tasks.md(markers) - Docs:
templates/+README.md+CHANGELOG.md
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestexecutionExecution engineExecution enginetestingTesting frameworkTesting framework