|
| 1 | +--- |
| 2 | +commissioned-by: spacedock@0.8.2 |
| 3 | +entity-type: eval_scenario |
| 4 | +entity-label: scenario |
| 5 | +entity-label-plural: scenarios |
| 6 | +id-style: sequential |
| 7 | +stages: |
| 8 | + defaults: |
| 9 | + worktree: false |
| 10 | + concurrency: 2 |
| 11 | + states: |
| 12 | + - name: draft |
| 13 | + initial: true |
| 14 | + - name: ground-truth |
| 15 | + - name: eval-run |
| 16 | + - name: validated |
| 17 | + gate: true |
| 18 | + feedback-to: draft |
| 19 | + - name: integrated |
| 20 | + terminal: true |
| 21 | +--- |
| 22 | + |
| 23 | +# Recce eval scenario pipeline |
| 24 | + |
| 25 | +Design, verify, and validate eval scenarios from jaffle-shop-simulator issues to build a comprehensive benchmark for measuring Recce plugin effectiveness at data PR review. |
| 26 | + |
| 27 | +## File Naming |
| 28 | + |
| 29 | +Each scenario is a markdown file named `{slug}.md` — lowercase, hyphens, no spaces. Example: `exclude-zero-orders-v1.md`. |
| 30 | + |
| 31 | +## Schema |
| 32 | + |
| 33 | +Every scenario file has YAML frontmatter with these fields: |
| 34 | + |
| 35 | +```yaml |
| 36 | +--- |
| 37 | +id: |
| 38 | +title: Human-readable name |
| 39 | +status: draft |
| 40 | +assignee: |
| 41 | +source: |
| 42 | +started: |
| 43 | +completed: |
| 44 | +verdict: |
| 45 | +score: |
| 46 | +worktree: |
| 47 | +issue: |
| 48 | +pr: |
| 49 | +jaffle_issue: GitHub issue number in jaffle-shop-simulator |
| 50 | +patch_file: Path to the reverse patch file |
| 51 | +scenario_yaml: Path to the scenario YAML definition |
| 52 | +prompt_file: Path to the eval prompt file |
| 53 | +--- |
| 54 | +``` |
| 55 | + |
| 56 | +### Field Reference |
| 57 | + |
| 58 | +| Field | Type | Description | |
| 59 | +|-------|------|-------------| |
| 60 | +| `id` | string | Unique identifier, format determined by id-style in README frontmatter | |
| 61 | +| `title` | string | Human-readable scenario name | |
| 62 | +| `status` | enum | One of: draft, ground-truth, eval-run, validated, integrated | |
| 63 | +| `assignee` | string | Who is working on this scenario (GitHub username). Claim by setting + commit/push. | |
| 64 | +| `source` | string | Where this scenario came from | |
| 65 | +| `started` | ISO 8601 | When active work began | |
| 66 | +| `completed` | ISO 8601 | When the scenario reached terminal status | |
| 67 | +| `verdict` | enum | PASSED or REJECTED — set at final stage | |
| 68 | +| `score` | number | Priority score, 0.0–1.0 (optional) | |
| 69 | +| `worktree` | string | Worktree path while a dispatched agent is active, empty otherwise | |
| 70 | +| `issue` | string | GitHub issue reference (e.g., `#42` or `owner/repo#42`). Optional cross-reference, set manually. | |
| 71 | +| `pr` | string | GitHub PR reference (e.g., `#57` or `owner/repo#57`). Set when a PR is created for this entity's worktree branch. | |
| 72 | +| `jaffle_issue` | number | Source issue number in DataRecce/jaffle-shop-simulator | |
| 73 | +| `patch_file` | string | Relative path to the reverse patch file | |
| 74 | +| `scenario_yaml` | string | Relative path to the scenario YAML definition | |
| 75 | +| `prompt_file` | string | Relative path to the eval prompt template | |
| 76 | + |
| 77 | +## Stages |
| 78 | + |
| 79 | +### `draft` |
| 80 | + |
| 81 | +A new scenario has been conceived. The worker designs a subtle, plausible bug variant based on a jaffle-shop-simulator issue, creates the reverse patch, writes the scenario YAML, and prepares the eval prompt. |
| 82 | + |
| 83 | +- **Inputs:** jaffle-shop-simulator issue description, existing model SQL, existing scenario YAMLs as reference (r1/r2) |
| 84 | +- **Outputs:** Patch file that applies cleanly and introduces a plausible bug; scenario YAML with all required fields (ground_truth values may be estimates); prompt file adapted to the scenario's story; dbt tests still pass after applying the patch |
| 85 | +- **Good:** Bug is subtle enough that code review would approve; PR description is misleading but plausible; detection requires data comparison not just code reading |
| 86 | +- **Bad:** Bug is obvious from code reading alone; dbt tests catch the bug; patch doesn't apply cleanly; scenario is a duplicate of an existing one |
| 87 | + |
| 88 | +### `ground-truth` |
| 89 | + |
| 90 | +The worker verifies the scenario's ground truth numbers by building dual-schema state (prod=clean, dev=buggy) and running SQL queries to confirm exact affected_row_count and model classification. |
| 91 | + |
| 92 | +- **Inputs:** Patch file from draft stage, scenario YAML with estimated ground_truth |
| 93 | +- **Outputs:** Exact affected_row_count from SQL query (not estimated); every model in impacted_models verified to have changed rows; every model in not_impacted_models verified to have 0 changed rows; dashboard_impact verified against dashboard column list |
| 94 | +- **Good:** Numbers come from actual SQL queries against dual-schema data; model classification is exhaustive (every model in DAG checked) |
| 95 | +- **Bad:** Using estimated or rounded numbers; assuming model impact from code reading without SQL verification; forgetting to check downstream models |
| 96 | + |
| 97 | +### `eval-run` |
| 98 | + |
| 99 | +The worker runs the eval batch (N=3, Mode A tool-only) using run-case.sh and scores each run with score-deterministic.sh. Records pass rates, failure patterns, and cost. |
| 100 | + |
| 101 | +- **Inputs:** Verified scenario YAML with exact ground_truth, prompt file, MCP config, recce package installed in jaffle-shop-simulator venv |
| 102 | +- **Outputs:** N=3 batch completed with all runs producing valid JSON output; each run scored with pass/fail per criterion; pass rate and failure pattern summary recorded in entity body; cost per run recorded |
| 103 | +- **Good:** All 3 runs produce parseable JSON; scoring matches ground truth criteria; failures are analyzed not just counted |
| 104 | +- **Bad:** Runs fail due to infrastructure issues (DuckDB lock, MCP timeout) rather than agent judgment; JSON extraction failures treated as agent errors |
| 105 | + |
| 106 | +### `validated` |
| 107 | + |
| 108 | +Captain reviews the eval results to confirm the scenario is good enough for the benchmark suite. This is a human approval gate. |
| 109 | + |
| 110 | +- **Inputs:** Eval-run results with pass rates, failure patterns, and cost |
| 111 | +- **Outputs:** Captain's approval or rejection with feedback |
| 112 | +- **Good:** Pass rate ≥80% on Mode A (scenario is solvable but challenging); failure patterns are about agent judgment not infrastructure; scenario tests something different from existing scenarios |
| 113 | +- **Bad:** Pass rate too low (ground truth may be wrong); all failures are the same JSON extraction issue; scenario is redundant with existing ones |
| 114 | + |
| 115 | +### `integrated` |
| 116 | + |
| 117 | +The scenario is part of the official benchmark suite. Patch, YAML, and prompt files are committed to recce-claude-plugin and included in future batch runs. |
| 118 | + |
| 119 | +- **Inputs:** Approved scenario from validated stage |
| 120 | +- **Outputs:** All scenario files committed to the repo |
| 121 | +- **Good:** Scenario adds meaningful coverage to the benchmark |
| 122 | +- **Bad:** N/A — terminal stage |
| 123 | + |
| 124 | +## Workflow State |
| 125 | + |
| 126 | +View the workflow overview: |
| 127 | + |
| 128 | +```bash |
| 129 | +docs/scenario-pipeline/status |
| 130 | +``` |
| 131 | + |
| 132 | +Output columns: ID, SLUG, STATUS, TITLE, SCORE, SOURCE. |
| 133 | + |
| 134 | +Include archived scenarios with `--archived`: |
| 135 | + |
| 136 | +```bash |
| 137 | +docs/scenario-pipeline/status --archived |
| 138 | +``` |
| 139 | + |
| 140 | +Find dispatchable scenarios ready for their next stage: |
| 141 | + |
| 142 | +```bash |
| 143 | +docs/scenario-pipeline/status --next |
| 144 | +``` |
| 145 | + |
| 146 | +Find scenarios in a specific stage: |
| 147 | + |
| 148 | +```bash |
| 149 | +grep -l "status: ground-truth" docs/scenario-pipeline/*.md |
| 150 | +``` |
| 151 | + |
| 152 | +## Scenario Template |
| 153 | + |
| 154 | +```yaml |
| 155 | +--- |
| 156 | +id: |
| 157 | +title: Scenario name here |
| 158 | +status: draft |
| 159 | +assignee: |
| 160 | +source: |
| 161 | +started: |
| 162 | +completed: |
| 163 | +verdict: |
| 164 | +score: |
| 165 | +worktree: |
| 166 | +issue: |
| 167 | +pr: |
| 168 | +jaffle_issue: |
| 169 | +patch_file: |
| 170 | +scenario_yaml: |
| 171 | +prompt_file: |
| 172 | +--- |
| 173 | + |
| 174 | +Description of this scenario — what bug is introduced, why it's plausible, and what the agent needs to find. |
| 175 | +``` |
| 176 | + |
| 177 | +## Commit Discipline |
| 178 | + |
| 179 | +- Commit status changes at dispatch and merge boundaries |
| 180 | +- Commit scenario body updates when substantive |
0 commit comments