Skip to content

Commit 31721c0

Browse files
iamcxaclaude
authored andcommitted
feat(eval): add Spacedock scenario pipeline + R8 exclude-zero-orders scenario
- Commission Spacedock workflow for eval scenario management (draft → ground-truth → eval-run → validated → integrated) - Add 3 seed scenarios from jaffle-shop-simulator issues #8, #2, #6 - Complete R8 scenario: stg_orders filters on subtotal instead of order_total — spec deviation bug (difficulty: hard) - Add assignee field to schema for multi-person coordination Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 2e34883 commit 31721c0

8 files changed

Lines changed: 700 additions & 0 deletions

File tree

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,3 +41,4 @@ logs/
4141
tmp/
4242
temp/
4343
*.tmp
44+
.worktrees/

docs/scenario-pipeline/README.md

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
---
2+
commissioned-by: spacedock@0.8.2
3+
entity-type: eval_scenario
4+
entity-label: scenario
5+
entity-label-plural: scenarios
6+
id-style: sequential
7+
stages:
8+
defaults:
9+
worktree: false
10+
concurrency: 2
11+
states:
12+
- name: draft
13+
initial: true
14+
- name: ground-truth
15+
- name: eval-run
16+
- name: validated
17+
gate: true
18+
feedback-to: draft
19+
- name: integrated
20+
terminal: true
21+
---
22+
23+
# Recce eval scenario pipeline
24+
25+
Design, verify, and validate eval scenarios from jaffle-shop-simulator issues to build a comprehensive benchmark for measuring Recce plugin effectiveness at data PR review.
26+
27+
## File Naming
28+
29+
Each scenario is a markdown file named `{slug}.md` — lowercase, hyphens, no spaces. Example: `exclude-zero-orders-v1.md`.
30+
31+
## Schema
32+
33+
Every scenario file has YAML frontmatter with these fields:
34+
35+
```yaml
36+
---
37+
id:
38+
title: Human-readable name
39+
status: draft
40+
assignee:
41+
source:
42+
started:
43+
completed:
44+
verdict:
45+
score:
46+
worktree:
47+
issue:
48+
pr:
49+
jaffle_issue: GitHub issue number in jaffle-shop-simulator
50+
patch_file: Path to the reverse patch file
51+
scenario_yaml: Path to the scenario YAML definition
52+
prompt_file: Path to the eval prompt file
53+
---
54+
```
55+
56+
### Field Reference
57+
58+
| Field | Type | Description |
59+
|-------|------|-------------|
60+
| `id` | string | Unique identifier, format determined by id-style in README frontmatter |
61+
| `title` | string | Human-readable scenario name |
62+
| `status` | enum | One of: draft, ground-truth, eval-run, validated, integrated |
63+
| `assignee` | string | Who is working on this scenario (GitHub username). Claim by setting + commit/push. |
64+
| `source` | string | Where this scenario came from |
65+
| `started` | ISO 8601 | When active work began |
66+
| `completed` | ISO 8601 | When the scenario reached terminal status |
67+
| `verdict` | enum | PASSED or REJECTED — set at final stage |
68+
| `score` | number | Priority score, 0.0–1.0 (optional) |
69+
| `worktree` | string | Worktree path while a dispatched agent is active, empty otherwise |
70+
| `issue` | string | GitHub issue reference (e.g., `#42` or `owner/repo#42`). Optional cross-reference, set manually. |
71+
| `pr` | string | GitHub PR reference (e.g., `#57` or `owner/repo#57`). Set when a PR is created for this entity's worktree branch. |
72+
| `jaffle_issue` | number | Source issue number in DataRecce/jaffle-shop-simulator |
73+
| `patch_file` | string | Relative path to the reverse patch file |
74+
| `scenario_yaml` | string | Relative path to the scenario YAML definition |
75+
| `prompt_file` | string | Relative path to the eval prompt template |
76+
77+
## Stages
78+
79+
### `draft`
80+
81+
A new scenario has been conceived. The worker designs a subtle, plausible bug variant based on a jaffle-shop-simulator issue, creates the reverse patch, writes the scenario YAML, and prepares the eval prompt.
82+
83+
- **Inputs:** jaffle-shop-simulator issue description, existing model SQL, existing scenario YAMLs as reference (r1/r2)
84+
- **Outputs:** Patch file that applies cleanly and introduces a plausible bug; scenario YAML with all required fields (ground_truth values may be estimates); prompt file adapted to the scenario's story; dbt tests still pass after applying the patch
85+
- **Good:** Bug is subtle enough that code review would approve; PR description is misleading but plausible; detection requires data comparison not just code reading
86+
- **Bad:** Bug is obvious from code reading alone; dbt tests catch the bug; patch doesn't apply cleanly; scenario is a duplicate of an existing one
87+
88+
### `ground-truth`
89+
90+
The worker verifies the scenario's ground truth numbers by building dual-schema state (prod=clean, dev=buggy) and running SQL queries to confirm exact affected_row_count and model classification.
91+
92+
- **Inputs:** Patch file from draft stage, scenario YAML with estimated ground_truth
93+
- **Outputs:** Exact affected_row_count from SQL query (not estimated); every model in impacted_models verified to have changed rows; every model in not_impacted_models verified to have 0 changed rows; dashboard_impact verified against dashboard column list
94+
- **Good:** Numbers come from actual SQL queries against dual-schema data; model classification is exhaustive (every model in DAG checked)
95+
- **Bad:** Using estimated or rounded numbers; assuming model impact from code reading without SQL verification; forgetting to check downstream models
96+
97+
### `eval-run`
98+
99+
The worker runs the eval batch (N=3, Mode A tool-only) using run-case.sh and scores each run with score-deterministic.sh. Records pass rates, failure patterns, and cost.
100+
101+
- **Inputs:** Verified scenario YAML with exact ground_truth, prompt file, MCP config, recce package installed in jaffle-shop-simulator venv
102+
- **Outputs:** N=3 batch completed with all runs producing valid JSON output; each run scored with pass/fail per criterion; pass rate and failure pattern summary recorded in entity body; cost per run recorded
103+
- **Good:** All 3 runs produce parseable JSON; scoring matches ground truth criteria; failures are analyzed not just counted
104+
- **Bad:** Runs fail due to infrastructure issues (DuckDB lock, MCP timeout) rather than agent judgment; JSON extraction failures treated as agent errors
105+
106+
### `validated`
107+
108+
Captain reviews the eval results to confirm the scenario is good enough for the benchmark suite. This is a human approval gate.
109+
110+
- **Inputs:** Eval-run results with pass rates, failure patterns, and cost
111+
- **Outputs:** Captain's approval or rejection with feedback
112+
- **Good:** Pass rate ≥80% on Mode A (scenario is solvable but challenging); failure patterns are about agent judgment not infrastructure; scenario tests something different from existing scenarios
113+
- **Bad:** Pass rate too low (ground truth may be wrong); all failures are the same JSON extraction issue; scenario is redundant with existing ones
114+
115+
### `integrated`
116+
117+
The scenario is part of the official benchmark suite. Patch, YAML, and prompt files are committed to recce-claude-plugin and included in future batch runs.
118+
119+
- **Inputs:** Approved scenario from validated stage
120+
- **Outputs:** All scenario files committed to the repo
121+
- **Good:** Scenario adds meaningful coverage to the benchmark
122+
- **Bad:** N/A — terminal stage
123+
124+
## Workflow State
125+
126+
View the workflow overview:
127+
128+
```bash
129+
docs/scenario-pipeline/status
130+
```
131+
132+
Output columns: ID, SLUG, STATUS, TITLE, SCORE, SOURCE.
133+
134+
Include archived scenarios with `--archived`:
135+
136+
```bash
137+
docs/scenario-pipeline/status --archived
138+
```
139+
140+
Find dispatchable scenarios ready for their next stage:
141+
142+
```bash
143+
docs/scenario-pipeline/status --next
144+
```
145+
146+
Find scenarios in a specific stage:
147+
148+
```bash
149+
grep -l "status: ground-truth" docs/scenario-pipeline/*.md
150+
```
151+
152+
## Scenario Template
153+
154+
```yaml
155+
---
156+
id:
157+
title: Scenario name here
158+
status: draft
159+
assignee:
160+
source:
161+
started:
162+
completed:
163+
verdict:
164+
score:
165+
worktree:
166+
issue:
167+
pr:
168+
jaffle_issue:
169+
patch_file:
170+
scenario_yaml:
171+
prompt_file:
172+
---
173+
174+
Description of this scenario — what bug is introduced, why it's plausible, and what the agent needs to find.
175+
```
176+
177+
## Commit Discipline
178+
179+
- Commit status changes at dispatch and merge boundaries
180+
- Commit scenario body updates when substantive
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
---
2+
id: "001"
3+
title: "Exclude $0 Orders: filter on subtotal"
4+
status: integrated
5+
assignee: kent
6+
source: commission seed
7+
started: 2026-03-30T16:00:00+08:00
8+
completed: 2026-03-30T18:00:00+08:00
9+
verdict: PASSED
10+
score: 0.8
11+
worktree:
12+
issue:
13+
pr:
14+
jaffle_issue: 8
15+
patch_file: plugins/recce-dev/skills/recce-eval/scenarios/v2/patches/r8-exclude-zero-orders-wrong-column.patch
16+
scenario_yaml: plugins/recce-dev/skills/recce-eval/scenarios/v2/r8-exclude-zero-orders-wrong-column.yaml
17+
prompt_file:
18+
---
19+
20+
## Bug Variant
21+
22+
**Source issue**: jaffle-shop-simulator#8 — VP of Operations requests excluding complimentary ($0) orders from all mart models.
23+
24+
**Plausible bug**: Filter on `WHERE subtotal > 0` instead of `WHERE order_total > 0` in stg_orders. The PR uses the wrong column — subtotal (pre-tax item total) instead of order_total (amount charged). With current data both produce identical results (all 4,155 zero-total orders also have zero subtotal), making it a semantic/spec deviation bug.
25+
26+
**PR description**: "Filter out $0 comp orders at staging layer — add WHERE subtotal > 0 to stg_orders for clean downstream metrics"
27+
28+
**Why it's hard**: Data comparison shows correct results. The bug is a specification deviation, not a data correctness issue. Agent must compare PR code against the issue spec to catch the wrong column.
29+
30+
**Ground truth**: 4,155 rows filtered. stg_orders/orders lose rows. customers affected (236 have lower count_lifetime_orders). order_items unchanged (comp orders have no line items). Dashboard impacted (AOV changes).
31+
32+
**Difficulty**: hard — detection requires spec comparison, not just data comparison.
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
---
2+
id: "003"
3+
title: "Financial Columns: wrong gross_profit formula"
4+
status: draft
5+
assignee:
6+
source: commission seed
7+
started:
8+
completed:
9+
verdict:
10+
score: 0.6
11+
worktree:
12+
issue:
13+
pr:
14+
jaffle_issue: 6
15+
patch_file:
16+
scenario_yaml:
17+
prompt_file:
18+
---
19+
20+
## Bug Variant
21+
22+
**Source issue**: jaffle-shop-simulator#6 — Accounting Manager requests audit-compliant financial_orders model with proper terminology.
23+
24+
**Plausible bug**: Calculate `gross_profit = revenue_excl_tax - tax_collected` instead of `gross_profit = revenue_excl_tax - cost_of_goods_sold`. The formula subtracts tax instead of COGS — a classic accounting error that produces a number that looks like a margin but is completely wrong.
25+
26+
**PR description**: "Add financial_orders mart with audit-compliant columns — gross profit computed as revenue minus tax"
27+
28+
**Why it's subtle**: The PR creates a new model (not modifying existing ones), so there's no baseline to compare against. The formula `revenue - tax` produces positive numbers that look like reasonable margins. You need to know that gross_profit should use COGS, not tax.
29+
30+
**Detection requires**: Domain knowledge that gross_profit = revenue - COGS, then comparing against the correct calculation using supply_cost data. This scenario tests whether the agent applies accounting domain knowledge, not just data comparison.

0 commit comments

Comments
 (0)