From 04efec3f8262d995afe15d5e6389b693fcf58c70 Mon Sep 17 00:00:00 2001 From: tg-agent <268907238+tg-agent@users.noreply.github.com> Date: Fri, 27 Mar 2026 13:24:54 -0600 Subject: [PATCH 1/6] chore(work): activate merge-skill-evals task, add live test bug log MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Move evals work item backlog → active - Add 4 new eval scenarios from live PR #14 test (bugs 2–5) - Clear blocked-by now that merge skill is complete - Reference docs/ops/igit-merge-test-and-debug.md from work item Co-Authored-By: Trevor Gullstad Co-Authored-By: Claude --- docs/ops/igit-merge-test-and-debug.md | 123 ++++++++++++++++++ .../2026-03-27-merge-skill-evals.md | 16 ++- 2 files changed, 136 insertions(+), 3 deletions(-) create mode 100644 docs/ops/igit-merge-test-and-debug.md rename docs/work/{backlog => active}/2026-03-27-merge-skill-evals.md (71%) diff --git a/docs/ops/igit-merge-test-and-debug.md b/docs/ops/igit-merge-test-and-debug.md new file mode 100644 index 0000000..c25e778 --- /dev/null +++ b/docs/ops/igit-merge-test-and-debug.md @@ -0,0 +1,123 @@ +# /merge skill — live test & debug log + +Running log of bugs and observations from manual testing of the `/merge` skill on real PRs. + +--- + +## Session: 2026-03-27 — PR #14 (feature/merged-skill) + +### Bug 1 — `gh` silently fails in `sh` subprocess (sandbox) + +**Symptom:** `merge_context.sh` returns `pr.state: "none"` even when an open PR exists. + +**Root cause:** Same Mach IPC sandbox restriction that affects `uv`. When `sh script.sh` spawns a child process, `gh` cannot reach the macOS keychain and exits non-zero. The `|| printf '{}'` fallback silently swallows the failure, leaving all PR fields at defaults. + +**Repro:** +```sh +sh -c 'gh pr view --json number,url,state,title,baseRefName 2>/dev/null || printf "{}"' +# → {} +``` +Direct call works fine: +```sh +gh pr view --json number,url,state,title,baseRefName +# → {"number":14,...} +``` + +**Impact:** PR state, mergeable, merge_state_status, title, URL all lost. + +**Fix direction:** Same workaround pattern as `uv --no-sync` — document that `merge_context.sh` must be run unsandboxed, or restructure so `gh` calls are made directly by the model (not via the shell script). + +--- + +### Bug 2 — MCP elif branch eats filesystem task tracking detection + +**Symptom:** `merge_context.sh` returns `task_tracking.tool: "none"` even though `docs/work/active/` exists. + +**Root cause:** The detection chain is a flat `if/elif/elif/...`. The MCP branch fires when `~/.claude/settings.json` exists (which it does for any Claude Code user). If no matching MCP server is found (linear/jira/asana), `_tt_tool` stays `"none"` — but the `elif` for `docs/work/active` is never reached. + +**Relevant code in `merge_context.sh`:** +```sh +elif [ -f ".mcp.json" ] || [ -f "${HOME}/.claude/settings.json" ]; then + for _cfg in ...; do + # checks linear, jira, asana — if none match, loop exits without setting _tt_tool + done +# elif [ -d "docs/work/active" ] ← never reached +``` + +**Fix direction:** After the MCP loop, fall through to the remaining checks (cli, fs, cfg) if `_tt_tool` is still `"none"`. Either nest the lower-priority checks inside the MCP branch or restructure as a sequential check without elif. + +--- + +### Bug 3 — PR URL missing from "I'll merge manually" option description + +**Symptom:** The `AskUserQuestion` "I'll merge manually" option says `"Open the PR URL — /merge will wait and detect when it's merged"` but doesn't actually include the URL. User has to look it up separately. + +**Fix direction:** Interpolate the PR URL into the description: `"Open — /merge will wait and detect when it's merged"`. + +**Relevant section in `SKILL.md`:** +``` +- label: "I'll merge manually" + description: "Open the PR URL — /merge will wait and detect when it's merged" +``` + +Should be: +``` +- label: "I'll merge manually" + description: "Open — /merge will wait and detect when it's merged" +``` + +--- + +### Bug 4 — Main worktree appears in `worktrees[]`, could trigger spurious `git worktree remove` + +**Symptom:** `merge_context.sh` includes the main worktree in `worktrees[]`. If the skill iterates `worktrees` looking for a match on the current branch, it would find the main worktree and attempt `git worktree remove` on it — which would fail or cause confusion. + +**Observed:** During PR #14 test, `worktrees` contained: +```json +[{"branch": "feature/merged-skill", "path": "/Users/tgulls/Code/llm/claude/igit"}] +``` +That path is the main working tree, not an additional worktree. + +**Fix direction:** Filter the main worktree out in `merge_context.sh` (compare each path against the primary worktree path from `git worktree list --porcelain | head -2`), or have the skill skip `git worktree remove` when the matched path is the repo root. + +--- + +### Bug 5 — Stage 1 doesn't check implicit merge checklist (docs, tests, version bump) + +**Symptom:** Skill proceeded through Stage 1 without flagging that the PR lacked a version bump. The version bump had to be applied after merge as a separate direct-to-main commit. + +**Root cause:** Stage 1 only iterates `task_tracking.work_items` and checks their explicit AC. There is no implicit checklist covering common pre-merge hygiene — docs updated, tests passing, version bumped — that applies regardless of task tracking state. + +**Impact:** New skill shipped in PR #14 with no version bump; bump was applied post-merge directly to main (bypassing branch protection — see bare `git push` observation below). + +**Fix direction:** Add an implicit pre-merge checklist to Stage 1 that the model runs before proceeding, independent of work item AC. Candidates: +- Tests passing (`uv run --no-sync pytest` or equivalent) +- Version bumped if new skills/hooks/commands were added +- Docs/CLAUDE.md updated if conventions changed +- No untracked files that should have been committed + +This checklist could be surfaced as a confirmation panel or auto-checked where possible (e.g. run tests, check if plugin.json version changed since base branch). + +--- + +### Observation — bare `git push` bypasses block-push-to-main hook + +**Symptom:** Running `git push` (implicit tracking branch push) on main succeeded without being blocked. `git push origin main` is correctly blocked. + +**Root cause:** `~/.claude/hooks/block-push-to-main.sh` matches on an explicit `main`/`master` ref in the command string. A bare `git push` with no ref specified doesn't match, even though it implicitly pushes to main via the tracking branch. + +**Observed:** During v0.11.0 release, `git tag v0.11.0 5c291a4 && git push && git push origin v0.11.0` pushed a commit directly to main unblocked. + +**Fix direction:** Either extend the hook to also block bare `git push` when the current branch is main/master, or amend the release process to require a PR for the version bump commit. + +--- + +### Observation — `git pull` fails in sandbox + +**Symptom:** Stage 3 `git pull` fails with `Could not resolve host: github.com`. + +**Workaround:** User must run `git pull` in their terminal after the skill completes. + +**Fix direction:** Narrate when the pull fails so the user knows to run it manually. Don't treat it as a hard stop. + +--- diff --git a/docs/work/backlog/2026-03-27-merge-skill-evals.md b/docs/work/active/2026-03-27-merge-skill-evals.md similarity index 71% rename from docs/work/backlog/2026-03-27-merge-skill-evals.md rename to docs/work/active/2026-03-27-merge-skill-evals.md index a398e57..326aaca 100644 --- a/docs/work/backlog/2026-03-27-merge-skill-evals.md +++ b/docs/work/active/2026-03-27-merge-skill-evals.md @@ -1,9 +1,8 @@ --- -status: pending +status: active created: 2026-03-27 updated: 2026-03-27 -blocked-by: - - docs/work/active/2026-03-27-merge-skill.md +blocked-by: [] --- # Add evals for /merge skill model behavior @@ -19,6 +18,8 @@ natural language output quality) rather than testing deterministic code paths. ### Eval scenarios to cover +From original backlog: + - [ ] **Stage 1 — task tracking not detected**: Model asks the user which tool they use - [ ] **Stage 1 — all AC met**: Model checks off items and proceeds without asking - [ ] **Stage 1 — unmet AC**: Model presents Item+Action panel and applies chosen action @@ -28,6 +29,13 @@ natural language output quality) rather than testing deterministic code paths. - [ ] **Stage 3 — safe delete succeeds**: Model runs `git branch --delete`, not `--delete --force` - [ ] **Stage 3 — safe delete fails, user declines force**: Model leaves branch, does not retry +From live test (see `docs/ops/igit-merge-test-and-debug.md`): + +- [ ] **Bug 2 — MCP elif eats fs detection**: Task tracking returns `none` when `~/.claude/settings.json` exists but has no matching MCP servers, even though `docs/work/active/` is present +- [ ] **Bug 3 — PR URL missing from manual merge option**: Model presents manual merge option without interpolating `pr.url` into the description +- [ ] **Bug 4 — Main worktree triggers spurious worktree remove**: Model attempts `git worktree remove` on the main working tree when branch matches +- [ ] **Bug 5 — Implicit merge checklist not enforced**: Model proceeds through Stage 1 without flagging missing version bump, untested changes, or uncommitted files that should have been included + ### Eval approach Each eval: @@ -52,3 +60,5 @@ language output, correct tool selection under different context states. Consider whether evals should run in CI or remain a manual verification step given cost and latency. + +See `docs/ops/igit-merge-test-and-debug.md` for the full bug list from live testing. From 364f249fd72c0b954540f63d7dc61a4ea7c2281b Mon Sep 17 00:00:00 2001 From: tg-agent <268907238+tg-agent@users.noreply.github.com> Date: Fri, 27 Mar 2026 15:00:48 -0600 Subject: [PATCH 2/6] docs: add eval harness design spec for /merge skill Co-Authored-By: Trevor Gullstad Co-Authored-By: Claude --- .../2026-03-27-merge-skill-evals-design.md | 250 ++++++++++++++++++ 1 file changed, 250 insertions(+) create mode 100644 docs/plans/2026-03-27-merge-skill-evals-design.md diff --git a/docs/plans/2026-03-27-merge-skill-evals-design.md b/docs/plans/2026-03-27-merge-skill-evals-design.md new file mode 100644 index 0000000..da71cc3 --- /dev/null +++ b/docs/plans/2026-03-27-merge-skill-evals-design.md @@ -0,0 +1,250 @@ +# /merge Skill Eval Harness Design + +**Date:** 2026-03-27 +**Status:** Draft +**Author:** tgulls + Claude +**Source:** brainstorming +**Location:** `evals/`, `.claude/skills/run-evals/` +**References:** +- [`docs/work/active/2026-03-27-merge-skill-evals.md`](../work/active/2026-03-27-merge-skill-evals.md) — active work item with 12 eval scenarios +- [`docs/ops/igit-merge-test-and-debug.md`](../ops/igit-merge-test-and-debug.md) — live test bug log (bugs 2–5 inform eval scenarios) + +--- + +## Overview + +A skill-orchestrated eval harness for testing model behavior when running `/igit:merge`. Evals are distinct from the `merge_context_test.py` integration tests — they test the model layer (decision-making, tool selection, output quality) rather than the shell script contract. + +## Architecture + +``` +evals/ + skill_merge_eval.yaml # Eval suite for /merge + +.claude/ + skills/ + run-evals/ + SKILL.md # Orchestrator skill (/igit:run-evals) + agents/ + eval-runner.md # Generic skill executor + eval-grader.md # Model judgment scorer + scripts/ + setup/ # Per-case environment setup scripts + assert/ # Deterministic assertion scripts (tool calls, fs) +``` + +Eval cases live in `evals/` (top-level, visible — signals serious testing architecture). The orchestrator skill and its agents/scripts are colocated in `.claude/skills/run-evals/` as project-level eval infrastructure, not part of the distributed plugin. + +## Eval Case Schema + +Each `*_eval.{yml,yaml}` file contains a named suite with multiple cases: + +```yaml +suite: "Evals for the /merge skill — covers all three stages and known edge cases" + +cases: + - name: "stage2_pr_blocked_conflicts" + description: "Model surfaces conflict blocker and stops without offering to merge" + skill: igit:merge + tags: [stage2, blocked] + + setup: + scripts: + - ".claude/skills/run-evals/scripts/setup/init_git_repo.sh" + state: + - git_repo: clean + - branch: "feature/my-feature" + - remote: connected + + prompt: > + I'm ready to merge my feature branch. Please run the merge workflow. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "feature/my-feature" + base_branch: "main" + pr: + number: 42 + url: "https://github.com/org/repo/pull/42" + state: "open" + mergeable: "CONFLICTING" + merge_state_status: "DIRTY" + title: "My feature" + merge_methods: ["merge", "squash", "rebase"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: [] + + exits_at: null + + expect: + - "Identifies the PR has merge conflicts before taking any action" + - "Names the specific blocker and tells the user how to resolve it" + - "Stops after surfacing the blocker — does not attempt Stage 2 or 3" + + assert: + tool_calls: + - never: "gh pr merge" + - once: "gh pr view" + filesystem: + - exist: "docs/work/active/2026-03-27-merge-skill-evals.md" + - file: "docs/work/active/2026-03-27-merge-skill-evals.md" + contains: "status: active" +``` + +### Schema Reference + +| Key | Required | Description | +|-----|----------|-------------| +| `suite` | Yes | Human-readable description of the file's eval suite | +| `cases` | Yes | List of eval cases | +| `name` | Yes | Snake-case identifier, unique across all suites | +| `description` | Yes | One-line human description | +| `skill` | Yes | Skill under test (e.g. `igit:merge`) | +| `tags` | No | For filtering: `/igit:run-evals stage2` | +| `setup.scripts` | No | Shell scripts to run before the runner is dispatched | +| `setup.state` | No | Declarative environment state (resolved by setup scripts) | +| `prompt` | Yes | User prompt injected into the model session | +| `context` | Yes | List of context items injected as tool call output | +| `context[].source` | Yes | `script_invocation`, `tool_call`, or `subagent_output` | +| `context[].output` | Yes | The mock output to inject | +| `prior_decisions` | No | AskUserQuestion answers already established before this case | +| `exits_at` | No | Expected AskUserQuestion question text; `null` if no decision point expected | +| `expect` | No | Model judgment assertions (scored by eval-grader) | +| `assert.tool_calls` | No | Deterministic tool call assertions (`once`, `never`, `times`, `contains`) | +| `assert.filesystem` | No | Deterministic filesystem assertions (`exist`, `file.contains`) | + +### Branching Logic + +Decision points produce two cases sharing the same `context` but differing in `prior_decisions`: + +```yaml +- name: "stage2_merge_who_merges_claude" + prior_decisions: + - question: "Who merges PR #42?" + answer: "Claude merges" + ... + +- name: "stage2_merge_who_merges_manual" + prior_decisions: + - question: "Who merges PR #42?" + answer: "I'll merge manually" + ... +``` + +## Eval Runner Agent + +Generic skill executor — no domain knowledge, no eval framing visible to the model. + +**Location:** `.claude/skills/run-evals/agents/eval-runner.md` + +**Model:** Haiku (mechanical execution) + +**Tools:** Read, Bash, AskUserQuestion + +**Behavior:** Receives a parsed eval case. Constructs a session with: +1. Skill content (SKILL.md for the skill under test) +2. Mock context presented as natural tool call output — no "you are being tested" framing +3. Prior decisions pre-established in the prompt +4. User prompt as the trigger + +Proceeds through the skill until reaching `exits_at` or completion. Returns: + +```json +{ + "case": "stage2_pr_blocked_conflicts", + "tool_calls": ["gh pr view 42 --json state"], + "model_output": "This PR has merge conflicts...", + "exited_at": null, + "status": "complete" +} +``` + +**Note:** `AskUserQuestion` behavior in agent context is an open question — the runner has access to the tool and we will observe empirically whether it blocks, bubbles to the user, or exits cleanly. Design may be revised based on findings. + +## Eval Grader Agent + +Scores model judgment against `expect` assertions. Deterministic assertions (`assert`) are handled by scripts, not the grader. + +**Location:** `.claude/skills/run-evals/agents/eval-grader.md` + +**Model:** Sonnet (judgment quality matters) + +**Tools:** Read + +**Behavior:** Receives runner output + `expect` list. Scores each expectation independently. + +Returns: + +```json +{ + "case": "stage2_pr_blocked_conflicts", + "results": [ + { + "expect": "Identifies the PR has merge conflicts before taking any action", + "pass": true, + "reasoning": "Model opened with 'This PR has merge conflicts (DIRTY)' before any tool calls" + }, + { + "expect": "Stops after surfacing the blocker — does not attempt Stage 2 or 3", + "pass": false, + "reasoning": "Model surfaced the blocker but then asked 'Would you like to resolve conflicts now?'" + } + ] +} +``` + +## Orchestrator Skill (`/igit:run-evals`) + +**Location:** `.claude/skills/run-evals/SKILL.md` + +**Invocation:** `/igit:run-evals [tag-or-name-filter]` + +**Flow per run:** + +1. Discover `evals/*_eval.{yml,yaml}` +2. Parse each file — extract `suite` and flatten `cases` list +3. Apply tag or name filter if provided +4. For each case: + a. Run `setup.scripts` in order + b. Dispatch `eval-runner` agent with parsed case + c. Run `assert.tool_calls` and `assert.filesystem` scripts against runner output + d. Dispatch `eval-grader` agent with runner output + `expect` list +5. Collect all results, write markdown report + +**Report format:** + +``` +## Eval Results — 2026-03-27 + +### /merge skill evals + +| Case | Expect | Assert | Status | +|------|--------|--------|--------| +| stage2_pr_blocked_conflicts | 3/3 | 2/2 | ✓ PASS | +| stage2_claude_merges_squash | 2/3 | 1/2 | ✗ FAIL | + +### Failures + +**stage2_claude_merges_squash** +- expect: "Model uses --squash not --merge" + FAIL: "Model ran gh pr merge 42 --merge instead of --squash" +- assert.tool_calls: never "gh pr merge --merge" + FAIL: found 1 match +``` + +## Future Considerations + +- **`agent-eval-runner` skill** — agents cannot dispatch agents, so running evals from an agent context will require a dedicated skill with forked context (worktrees). Not in scope now. +- **CI integration** — evals are manual verification for now given cost and latency. If a lightweight subset proves stable, consider a scheduled CI run. +- **Additional suites** — `skill_commit_eval.yaml` etc. follow the same pattern; orchestrator discovers all `*_eval.{yml,yaml}` files automatically. From d30bb553d945a6f2b2756ec1678c5fe35651dc5f Mon Sep 17 00:00:00 2001 From: tg-agent <268907238+tg-agent@users.noreply.github.com> Date: Fri, 27 Mar 2026 17:11:27 -0600 Subject: [PATCH 3/6] docs: revise eval harness design based on Opus review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Clarify tool_call assertion schema: contains/exact + times, with once/never/many shorthands - Fix filesystem assertions: exists/not-exists assert post-execution state changes, not pre-existing state - Add SubagentStop hook for transcript capture (ground truth, not self-reported) - Add .evals/-/ output structure (gitignored) - transcript.json extracts key event details from agent_transcript_path rather than copying full transcript - Fix stage2_pr_blocked_conflicts example: remove filesystem assertions (blocked case has no fs side effects) - Status: Draft → In Review Co-Authored-By: Trevor Gullstad Co-Authored-By: Claude --- .../2026-03-27-merge-skill-evals-design.md | 85 ++++++++++++++----- 1 file changed, 65 insertions(+), 20 deletions(-) diff --git a/docs/plans/2026-03-27-merge-skill-evals-design.md b/docs/plans/2026-03-27-merge-skill-evals-design.md index da71cc3..c3ce448 100644 --- a/docs/plans/2026-03-27-merge-skill-evals-design.md +++ b/docs/plans/2026-03-27-merge-skill-evals-design.md @@ -1,7 +1,7 @@ # /merge Skill Eval Harness Design **Date:** 2026-03-27 -**Status:** Draft +**Status:** In Review **Author:** tgulls + Claude **Source:** brainstorming **Location:** `evals/`, `.claude/skills/run-evals/` @@ -31,9 +31,19 @@ evals/ scripts/ setup/ # Per-case environment setup scripts assert/ # Deterministic assertion scripts (tool calls, fs) + hooks/ + capture_runner_transcript.sh # SubagentStop hook — extracts key details from agent transcript + +.evals/ # gitignored — eval run output + -/ # e.g. 2026-03-27T210000-a1b2c3 + report.md # final markdown report + / # per-case subdirectory + transcript.json # extracted events from agent transcript + assert.json # deterministic assertion results + grader.json # grader output ``` -Eval cases live in `evals/` (top-level, visible — signals serious testing architecture). The orchestrator skill and its agents/scripts are colocated in `.claude/skills/run-evals/` as project-level eval infrastructure, not part of the distributed plugin. +Eval cases live in `evals/` (top-level, visible — signals serious testing architecture). The orchestrator skill and its agents/scripts are colocated in `.claude/skills/run-evals/` as project-level eval infrastructure, not part of the distributed plugin. Run output is written to `.evals/` in the cwd (gitignored). ## Eval Case Schema @@ -96,10 +106,6 @@ cases: tool_calls: - never: "gh pr merge" - once: "gh pr view" - filesystem: - - exist: "docs/work/active/2026-03-27-merge-skill-evals.md" - - file: "docs/work/active/2026-03-27-merge-skill-evals.md" - contains: "status: active" ``` ### Schema Reference @@ -121,8 +127,8 @@ cases: | `prior_decisions` | No | AskUserQuestion answers already established before this case | | `exits_at` | No | Expected AskUserQuestion question text; `null` if no decision point expected | | `expect` | No | Model judgment assertions (scored by eval-grader) | -| `assert.tool_calls` | No | Deterministic tool call assertions (`once`, `never`, `times`, `contains`) | -| `assert.filesystem` | No | Deterministic filesystem assertions (`exist`, `file.contains`) | +| `assert.tool_calls` | No | Deterministic tool call assertions — see Tool Call Assertions | +| `assert.filesystem` | No | Deterministic filesystem assertions — see Filesystem Assertions | ### Branching Logic @@ -142,6 +148,34 @@ Decision points produce two cases sharing the same `context` but differing in `p ... ``` +### Tool Call Assertions + +Assertions match against tool call strings captured in `transcript.json`. + +| Form | Meaning | +|------|---------| +| `contains: ""` | at least one call contains substring | +| `contains: "", times: N` | exactly N calls contain substring | +| `exact: ""` | at least one call matches exactly | +| `exact: "", times: N` | exactly N calls match exactly | +| `once: ""` | shorthand — `contains + times: 1` | +| `never: ""` | shorthand — `contains + times: 0` | +| `many: ""` | shorthand — `contains + times: 2+` | + +No `times` means "at least once". Use `exact` when flag specifics matter (e.g. `--squash` vs `--merge`). + +### Filesystem Assertions + +Assertions validate post-execution filesystem state — what the model created, moved, or modified. They are not for pre-existing state. + +| Form | Meaning | +|------|---------| +| `exists: ""` | file exists at path (was created or moved here) | +| `not-exists: ""` | file does not exist at path (was deleted or moved away) | +| `file: "", contains: ""` | file exists and contains substring | + +`exists` + `not-exists` together assert a `git mv` — both sides of the move verified. + ## Eval Runner Agent Generic skill executor — no domain knowledge, no eval framing visible to the model. @@ -158,19 +192,29 @@ Generic skill executor — no domain knowledge, no eval framing visible to the m 3. Prior decisions pre-established in the prompt 4. User prompt as the trigger -Proceeds through the skill until reaching `exits_at` or completion. Returns: +Proceeds through the skill until reaching `exits_at` or completion. Returns `status` and `exited_at` — the full event transcript is captured by the SubagentStop hook, not self-reported. + +**Note:** `AskUserQuestion` behavior in agent context is an open question — the runner has access to the tool and we will observe empirically whether it blocks, bubbles to the user, or exits cleanly. Design may be revised based on findings. + +### Transcript Capture (SubagentStop Hook) + +The SubagentStop hook fires when the eval-runner agent completes. It filters on `agent_type == "eval-runner"`, reads `agent_transcript_path` (the agent's JSONL transcript), extracts key details, and writes to `.evals/-//transcript.json`: ```json { "case": "stage2_pr_blocked_conflicts", - "tool_calls": ["gh pr view 42 --json state"], - "model_output": "This PR has merge conflicts...", - "exited_at": null, - "status": "complete" + "source": "~/.claude/projects/.../subagents/agent-def456.jsonl", + "events": [ + { "type": "text", "content": "This PR has merge conflicts (DIRTY)..." }, + { "type": "tool_call", "call": "gh pr view 42 --json mergeable" }, + { "type": "text", "content": "Cannot merge — resolve conflicts first." } + ], + "status": "complete", + "exited_at": null } ``` -**Note:** `AskUserQuestion` behavior in agent context is an open question — the runner has access to the tool and we will observe empirically whether it blocks, bubbles to the user, or exits cleanly. Design may be revised based on findings. +`source` references the full transcript for debugging. `events` is the ordered sequence consumed by assert scripts and the grader. The run output directory is communicated to the hook via an env var set by the orchestrator before dispatching the runner. ## Eval Grader Agent @@ -182,7 +226,7 @@ Scores model judgment against `expect` assertions. Deterministic assertions (`as **Tools:** Read -**Behavior:** Receives runner output + `expect` list. Scores each expectation independently. +**Behavior:** Receives `transcript.json` (from `.evals///`) + `expect` list. Scores each expectation independently using the ordered `events` sequence — ordering matters for assertions like "identifies X before taking any action". Returns: @@ -215,12 +259,13 @@ Returns: 1. Discover `evals/*_eval.{yml,yaml}` 2. Parse each file — extract `suite` and flatten `cases` list 3. Apply tag or name filter if provided -4. For each case: +4. Generate `-`, create `.evals/-/`, set `EVAL_RUN_DIR` env var +5. For each case: a. Run `setup.scripts` in order - b. Dispatch `eval-runner` agent with parsed case - c. Run `assert.tool_calls` and `assert.filesystem` scripts against runner output - d. Dispatch `eval-grader` agent with runner output + `expect` list -5. Collect all results, write markdown report + b. Dispatch `eval-runner` agent — SubagentStop hook writes `transcript.json` to `.evals///` + c. Run `assert.tool_calls` and `assert.filesystem` scripts against `transcript.json` — write `assert.json` + d. Dispatch `eval-grader` agent with `transcript.json` + `expect` list — write `grader.json` +6. Collect all results, write `report.md` to `.evals//` **Report format:** From 620a6877ec020a3695a710f4ce4be6cef030fe53 Mon Sep 17 00:00:00 2001 From: tg-agent <268907238+tg-agent@users.noreply.github.com> Date: Fri, 27 Mar 2026 17:28:28 -0600 Subject: [PATCH 4/6] docs: add implementation plan for merge skill evals 14-task TDD plan covering eval harness infrastructure (assertion engines, transcript capture hook, setup scripts, agents, orchestrator skill) and 12 eval cases across Stage 1-3 plus bug regressions. Defers Bug 5 (implicit checklist) to its own backlog item since the feature doesn't exist yet. Co-Authored-By: Trevor Gullstad Co-Authored-By: Claude --- .../2026-03-27-merge-skill-evals-impl.md | 2204 +++++++++++++++++ .../active/2026-03-27-merge-skill-evals.md | 2 +- .../2026-03-27-merge-implicit-checklist.md | 32 + 3 files changed, 2237 insertions(+), 1 deletion(-) create mode 100644 docs/plans/2026-03-27-merge-skill-evals-impl.md create mode 100644 docs/work/backlog/2026-03-27-merge-implicit-checklist.md diff --git a/docs/plans/2026-03-27-merge-skill-evals-impl.md b/docs/plans/2026-03-27-merge-skill-evals-impl.md new file mode 100644 index 0000000..3be5b6d --- /dev/null +++ b/docs/plans/2026-03-27-merge-skill-evals-impl.md @@ -0,0 +1,2204 @@ +# /merge Skill Eval Harness Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Build a skill-orchestrated eval harness that tests model behavior when running `/igit:merge`, covering 12 scenarios from the work item. + +**Architecture:** YAML eval cases define scenarios with mock context, prompts, and assertions. An orchestrator skill (`/igit:run-evals`) parses cases, dispatches an eval-runner agent (Haiku) per case, captures transcripts via SubagentStop hook, runs deterministic assertions via scripts, and dispatches an eval-grader agent (Sonnet) for model judgment scoring. Results aggregate into a markdown report. + +**Tech Stack:** Python 3.12+ (uv), pytest, YAML (PyYAML), POSIX shell for hook/setup scripts, Claude Code agents + skills + +--- + +## File Structure + +``` +NEW FILES: +evals/ + skill_merge_eval.yaml # 12 eval cases for /merge skill + +.claude/ + skills/ + run-evals/ + SKILL.md # Orchestrator skill + agents/ + eval-runner.md # Generic skill executor agent + eval-grader.md # Model judgment scorer agent + scripts/ + setup/ + init_git_repo.sh # Creates temp git repo with mock state + assert/ + check_tool_calls.py # Deterministic tool call assertion script + check_filesystem.py # Deterministic filesystem assertion script + hooks/ + capture_runner_transcript.sh # SubagentStop hook — extracts transcript + +MODIFY FILES: +.gitignore # Add .evals/ to gitignore +``` + +--- + +## Task 1: Gitignore and Directory Scaffolding + +**Files:** +- Modify: `.gitignore` + +- [ ] **Step 1: Add `.evals/` to `.gitignore`** + +``` +# Append to .gitignore +.evals/ +``` + +- [ ] **Step 2: Verify** + +Run: `grep -q '.evals/' .gitignore && echo "OK"` +Expected: `OK` + +- [ ] **Step 3: Commit** + +```bash +git add .gitignore +git commit -m "chore: add .evals/ to gitignore for eval run output" +``` + +--- + +## Task 2: Tool Call Assertion Script + +The deterministic assertion engine for `assert.tool_calls`. Reads `transcript.json` and an assertion list, outputs `assert.json` with pass/fail per assertion. + +**Files:** +- Create: `.claude/skills/run-evals/scripts/assert/check_tool_calls.py` +- Create: `.claude/skills/run-evals/scripts/assert/check_tool_calls_test.py` + +- [ ] **Step 1: Write the failing test for `contains` assertions** + +```python +"""Tests for check_tool_calls assertion engine.""" + +import json +from pathlib import Path + +import pytest + +# The module under test — will be imported after creation +from check_tool_calls import check_tool_calls + + +class TestContains: + def test_contains_found(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json mergeable"}, + {"type": "tool_call", "call": "gh pr merge 42 --squash"}, + ] + assertions = [{"contains": "gh pr merge"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is True + + def test_contains_not_found(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json mergeable"}, + ] + assertions = [{"contains": "gh pr merge"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is False + + def test_contains_with_times(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + {"type": "tool_call", "call": "gh pr view 42 --json mergeable"}, + ] + assertions = [{"contains": "gh pr view", "times": 2}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is True + + def test_contains_with_times_mismatch(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + ] + assertions = [{"contains": "gh pr view", "times": 2}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is False + assert "expected 2" in results[0]["reason"] +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cd .claude/skills/run-evals/scripts/assert && uv run --no-sync pytest check_tool_calls_test.py -v` +Expected: FAIL — `ModuleNotFoundError: No module named 'check_tool_calls'` + +- [ ] **Step 3: Write the `check_tool_calls` module** + +```python +"""Deterministic tool call assertion engine for eval harness. + +Reads tool_call events from a transcript and evaluates assertions against them. +Each assertion returns a pass/fail result with a reason string. +""" + + +def check_tool_calls( + events: list[dict], assertions: list[dict] +) -> list[dict]: + """Evaluate tool call assertions against transcript events. + + Args: + events: List of transcript events (only type=="tool_call" are inspected). + assertions: List of assertion dicts — see design spec for schema. + + Returns: + List of result dicts with keys: assertion, pass, reason. + """ + tool_calls = [e["call"] for e in events if e.get("type") == "tool_call"] + results = [] + + for assertion in assertions: + result = _evaluate_one(tool_calls, assertion) + results.append(result) + + return results + + +def _evaluate_one(tool_calls: list[str], assertion: dict) -> dict: + """Evaluate a single assertion against the tool call list.""" + # Resolve shorthands first + if "once" in assertion: + return _evaluate_one( + tool_calls, {"contains": assertion["once"], "times": 1} + ) + if "never" in assertion: + return _evaluate_one( + tool_calls, {"contains": assertion["never"], "times": 0} + ) + if "many" in assertion: + substring = assertion["many"] + count = sum(1 for c in tool_calls if substring in c) + passed = count >= 2 + return { + "assertion": assertion, + "pass": passed, + "reason": f"found {count} matches (need 2+)" + if not passed + else f"found {count} matches", + } + + if "contains" in assertion: + return _check_contains(tool_calls, assertion) + if "exact" in assertion: + return _check_exact(tool_calls, assertion) + + return { + "assertion": assertion, + "pass": False, + "reason": "unknown assertion type", + } + + +def _check_contains(tool_calls: list[str], assertion: dict) -> dict: + substring = assertion["contains"] + count = sum(1 for c in tool_calls if substring in c) + + if "times" in assertion: + expected = assertion["times"] + passed = count == expected + reason = ( + f"found {count} matches" + if passed + else f"expected {expected}, found {count}" + ) + else: + passed = count >= 1 + reason = "found" if passed else "not found" + + return {"assertion": assertion, "pass": passed, "reason": reason} + + +def _check_exact(tool_calls: list[str], assertion: dict) -> dict: + target = assertion["exact"] + count = sum(1 for c in tool_calls if c == target) + + if "times" in assertion: + expected = assertion["times"] + passed = count == expected + reason = ( + f"found {count} exact matches" + if passed + else f"expected {expected} exact, found {count}" + ) + else: + passed = count >= 1 + reason = "found" if passed else "not found" + + return {"assertion": assertion, "pass": passed, "reason": reason} +``` + +- [ ] **Step 4: Run tests to verify they pass** + +Run: `cd .claude/skills/run-evals/scripts/assert && uv run --no-sync pytest check_tool_calls_test.py -v` +Expected: 4 passed + +- [ ] **Step 5: Write tests for shorthand assertions (`once`, `never`, `many`)** + +Add to `check_tool_calls_test.py`: + +```python +class TestShorthands: + def test_once_passes_with_exactly_one(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + ] + assertions = [{"once": "gh pr view"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is True + + def test_once_fails_with_two(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + {"type": "tool_call", "call": "gh pr view 42 --json mergeable"}, + ] + assertions = [{"once": "gh pr view"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is False + + def test_never_passes_with_zero(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + ] + assertions = [{"never": "gh pr merge"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is True + + def test_never_fails_with_one(self): + events = [ + {"type": "tool_call", "call": "gh pr merge 42 --squash"}, + ] + assertions = [{"never": "gh pr merge"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is False + + def test_many_passes_with_two_plus(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + {"type": "tool_call", "call": "gh pr view 42 --json mergeable"}, + ] + assertions = [{"many": "gh pr view"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is True + + def test_many_fails_with_one(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + ] + assertions = [{"many": "gh pr view"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is False +``` + +- [ ] **Step 6: Run full test suite** + +Run: `cd .claude/skills/run-evals/scripts/assert && uv run --no-sync pytest check_tool_calls_test.py -v` +Expected: 10 passed + +- [ ] **Step 7: Write tests for `exact` assertions** + +Add to `check_tool_calls_test.py`: + +```python +class TestExact: + def test_exact_match(self): + events = [ + {"type": "tool_call", "call": "gh pr merge 42 --squash"}, + ] + assertions = [{"exact": "gh pr merge 42 --squash"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is True + + def test_exact_no_match_on_substring(self): + events = [ + {"type": "tool_call", "call": "gh pr merge 42 --squash --admin"}, + ] + assertions = [{"exact": "gh pr merge 42 --squash"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is False + + def test_exact_with_times(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + ] + assertions = [{"exact": "gh pr view 42 --json state", "times": 2}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is True +``` + +- [ ] **Step 8: Run full test suite** + +Run: `cd .claude/skills/run-evals/scripts/assert && uv run --no-sync pytest check_tool_calls_test.py -v` +Expected: 13 passed + +- [ ] **Step 9: Write test for non-tool_call events are ignored** + +Add to `check_tool_calls_test.py`: + +```python +class TestEventFiltering: + def test_ignores_text_events(self): + events = [ + {"type": "text", "content": "gh pr merge 42 --squash"}, + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + ] + assertions = [{"never": "gh pr merge"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is True + + def test_multiple_assertions(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + {"type": "tool_call", "call": "gh pr merge 42 --squash"}, + ] + assertions = [ + {"once": "gh pr view"}, + {"once": "gh pr merge"}, + {"never": "git push --force"}, + ] + results = check_tool_calls(events, assertions) + assert all(r["pass"] for r in results) +``` + +- [ ] **Step 10: Run full test suite** + +Run: `cd .claude/skills/run-evals/scripts/assert && uv run --no-sync pytest check_tool_calls_test.py -v` +Expected: 15 passed + +- [ ] **Step 11: Commit** + +```bash +git add .claude/skills/run-evals/scripts/assert/check_tool_calls.py \ + .claude/skills/run-evals/scripts/assert/check_tool_calls_test.py +git commit -m "feat(evals): add deterministic tool call assertion engine + +TDD implementation covering contains, exact, once, never, many assertions +with times qualifier. Filters non-tool_call events from transcript." +``` + +--- + +## Task 3: Filesystem Assertion Script + +Deterministic assertion engine for `assert.filesystem`. Checks post-execution filesystem state. + +**Files:** +- Create: `.claude/skills/run-evals/scripts/assert/check_filesystem.py` +- Create: `.claude/skills/run-evals/scripts/assert/check_filesystem_test.py` + +- [ ] **Step 1: Write the failing test** + +```python +"""Tests for check_filesystem assertion engine.""" + +import os +from pathlib import Path + +import pytest + +from check_filesystem import check_filesystem + + +class TestExists: + def test_exists_passes(self, tmp_path): + (tmp_path / "file.txt").write_text("hello") + assertions = [{"exists": str(tmp_path / "file.txt")}] + results = check_filesystem(assertions) + assert results[0]["pass"] is True + + def test_exists_fails(self, tmp_path): + assertions = [{"exists": str(tmp_path / "missing.txt")}] + results = check_filesystem(assertions) + assert results[0]["pass"] is False + + +class TestNotExists: + def test_not_exists_passes(self, tmp_path): + assertions = [{"not-exists": str(tmp_path / "missing.txt")}] + results = check_filesystem(assertions) + assert results[0]["pass"] is True + + def test_not_exists_fails(self, tmp_path): + (tmp_path / "file.txt").write_text("hello") + assertions = [{"not-exists": str(tmp_path / "file.txt")}] + results = check_filesystem(assertions) + assert results[0]["pass"] is False + + +class TestFileContains: + def test_contains_passes(self, tmp_path): + (tmp_path / "file.txt").write_text("hello world") + assertions = [{"file": str(tmp_path / "file.txt"), "contains": "world"}] + results = check_filesystem(assertions) + assert results[0]["pass"] is True + + def test_contains_fails_wrong_content(self, tmp_path): + (tmp_path / "file.txt").write_text("hello world") + assertions = [{"file": str(tmp_path / "file.txt"), "contains": "goodbye"}] + results = check_filesystem(assertions) + assert results[0]["pass"] is False + + def test_contains_fails_file_missing(self, tmp_path): + assertions = [{"file": str(tmp_path / "missing.txt"), "contains": "hello"}] + results = check_filesystem(assertions) + assert results[0]["pass"] is False + + +class TestMoveVerification: + def test_exists_plus_not_exists_verifies_move(self, tmp_path): + (tmp_path / "new.txt").write_text("moved") + assertions = [ + {"exists": str(tmp_path / "new.txt")}, + {"not-exists": str(tmp_path / "old.txt")}, + ] + results = check_filesystem(assertions) + assert all(r["pass"] for r in results) +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cd .claude/skills/run-evals/scripts/assert && uv run --no-sync pytest check_filesystem_test.py -v` +Expected: FAIL — `ModuleNotFoundError: No module named 'check_filesystem'` + +- [ ] **Step 3: Write the `check_filesystem` module** + +```python +"""Deterministic filesystem assertion engine for eval harness. + +Validates post-execution filesystem state: file existence, absence, and content. +""" + +from pathlib import Path + + +def check_filesystem(assertions: list[dict]) -> list[dict]: + """Evaluate filesystem assertions. + + Args: + assertions: List of assertion dicts — see design spec for schema. + + Returns: + List of result dicts with keys: assertion, pass, reason. + """ + results = [] + for assertion in assertions: + results.append(_evaluate_one(assertion)) + return results + + +def _evaluate_one(assertion: dict) -> dict: + """Evaluate a single filesystem assertion.""" + if "exists" in assertion: + path = Path(assertion["exists"]) + passed = path.exists() + reason = "file exists" if passed else f"file not found: {path}" + return {"assertion": assertion, "pass": passed, "reason": reason} + + if "not-exists" in assertion: + path = Path(assertion["not-exists"]) + passed = not path.exists() + reason = ( + "file absent as expected" + if passed + else f"file unexpectedly exists: {path}" + ) + return {"assertion": assertion, "pass": passed, "reason": reason} + + if "file" in assertion and "contains" in assertion: + path = Path(assertion["file"]) + substring = assertion["contains"] + if not path.exists(): + return { + "assertion": assertion, + "pass": False, + "reason": f"file not found: {path}", + } + content = path.read_text() + passed = substring in content + reason = ( + f"contains '{substring}'" + if passed + else f"'{substring}' not found in {path}" + ) + return {"assertion": assertion, "pass": passed, "reason": reason} + + return { + "assertion": assertion, + "pass": False, + "reason": "unknown assertion type", + } +``` + +- [ ] **Step 4: Run tests to verify they pass** + +Run: `cd .claude/skills/run-evals/scripts/assert && uv run --no-sync pytest check_filesystem_test.py -v` +Expected: 8 passed + +- [ ] **Step 5: Commit** + +```bash +git add .claude/skills/run-evals/scripts/assert/check_filesystem.py \ + .claude/skills/run-evals/scripts/assert/check_filesystem_test.py +git commit -m "feat(evals): add deterministic filesystem assertion engine + +TDD implementation covering exists, not-exists, and file-contains assertions. +Supports move verification via exists + not-exists combination." +``` + +--- + +## Task 4: Transcript Capture Hook + +SubagentStop hook script that fires when the eval-runner agent completes, extracts key events from the agent transcript, and writes `transcript.json`. + +**Files:** +- Create: `.claude/skills/run-evals/scripts/hooks/capture_runner_transcript.sh` +- Create: `.claude/skills/run-evals/scripts/hooks/capture_runner_transcript_test.py` + +- [ ] **Step 1: Write the test** + +The hook reads the `TOOL_INPUT` JSON from stdin (containing `agent_type` and `agent_transcript_path`) and the `EVAL_RUN_DIR` + `EVAL_CASE_NAME` env vars. It extracts events from the JSONL transcript and writes `transcript.json`. + +```python +"""Integration tests for capture_runner_transcript.sh hook.""" + +import json +import os +import subprocess +from pathlib import Path + +import pytest + +SCRIPT = Path(__file__).parent / "capture_runner_transcript.sh" + + +def run_hook( + tool_input: dict, + transcript_lines: list[dict], + run_dir: Path, + case_name: str, +) -> subprocess.CompletedProcess: + """Run the hook with given inputs and return the result.""" + # Write fake transcript JSONL + transcript_path = run_dir / "agent_transcript.jsonl" + with transcript_path.open("w") as f: + for line in transcript_lines: + f.write(json.dumps(line) + "\n") + + tool_input_with_path = { + **tool_input, + "agent_transcript_path": str(transcript_path), + } + + env = { + **os.environ, + "EVAL_RUN_DIR": str(run_dir), + "EVAL_CASE_NAME": case_name, + } + + return subprocess.run( + ["sh", str(SCRIPT)], + input=json.dumps({"tool_input": tool_input_with_path}), + capture_output=True, + text=True, + env=env, + ) + + +class TestFilterByAgentType: + def test_skips_non_eval_runner(self, tmp_path): + result = run_hook( + tool_input={"agent_type": "code-reviewer"}, + transcript_lines=[], + run_dir=tmp_path, + case_name="test_case", + ) + assert result.returncode == 0 + assert not (tmp_path / "test_case" / "transcript.json").exists() + + def test_processes_eval_runner(self, tmp_path): + transcript = [ + {"type": "text", "content": "Analyzing PR..."}, + {"type": "tool_use", "name": "Bash", "input": {"command": "gh pr view 42 --json state"}}, + {"type": "text", "content": "PR has conflicts."}, + ] + result = run_hook( + tool_input={"agent_type": "eval-runner"}, + transcript_lines=transcript, + run_dir=tmp_path, + case_name="test_case", + ) + assert result.returncode == 0 + output_path = tmp_path / "test_case" / "transcript.json" + assert output_path.exists() + data = json.loads(output_path.read_text()) + assert data["case"] == "test_case" + assert len(data["events"]) == 3 + + +class TestEventExtraction: + def test_extracts_text_events(self, tmp_path): + transcript = [ + {"type": "text", "content": "PR has merge conflicts."}, + ] + run_hook( + tool_input={"agent_type": "eval-runner"}, + transcript_lines=transcript, + run_dir=tmp_path, + case_name="test_case", + ) + data = json.loads((tmp_path / "test_case" / "transcript.json").read_text()) + assert data["events"][0] == {"type": "text", "content": "PR has merge conflicts."} + + def test_extracts_tool_call_events(self, tmp_path): + transcript = [ + {"type": "tool_use", "name": "Bash", "input": {"command": "gh pr view 42 --json state"}}, + ] + run_hook( + tool_input={"agent_type": "eval-runner"}, + transcript_lines=transcript, + run_dir=tmp_path, + case_name="test_case", + ) + data = json.loads((tmp_path / "test_case" / "transcript.json").read_text()) + assert data["events"][0]["type"] == "tool_call" + assert "gh pr view 42" in data["events"][0]["call"] + + def test_extracts_ask_user_events(self, tmp_path): + transcript = [ + {"type": "tool_use", "name": "AskUserQuestion", "input": {"questions": [{"question": "Who merges?"}]}}, + ] + run_hook( + tool_input={"agent_type": "eval-runner"}, + transcript_lines=transcript, + run_dir=tmp_path, + case_name="test_case", + ) + data = json.loads((tmp_path / "test_case" / "transcript.json").read_text()) + assert data["events"][0]["type"] == "ask_user" + assert "Who merges?" in data["events"][0]["question"] + + +class TestMissingEnvVars: + def test_exits_cleanly_without_eval_run_dir(self, tmp_path): + """Hook should exit 0 silently when not in an eval context.""" + env = {k: v for k, v in os.environ.items() if k != "EVAL_RUN_DIR"} + env.pop("EVAL_CASE_NAME", None) + result = subprocess.run( + ["sh", str(SCRIPT)], + input=json.dumps({"tool_input": {"agent_type": "eval-runner"}}), + capture_output=True, + text=True, + env=env, + ) + assert result.returncode == 0 +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cd .claude/skills/run-evals/scripts/hooks && uv run --no-sync pytest capture_runner_transcript_test.py -v` +Expected: FAIL — script does not exist + +- [ ] **Step 3: Write the hook script** + +```sh +#!/bin/sh +# capture_runner_transcript.sh — SubagentStop hook for eval harness +# Extracts key events from eval-runner agent transcript, writes transcript.json +# Dependencies: jq +# Platforms: macOS, Linux +# +# Env vars (set by orchestrator): +# EVAL_RUN_DIR — path to current eval run output directory +# EVAL_CASE_NAME — name of the current eval case +# +# Reads TOOL_INPUT JSON from stdin (SubagentStop hook contract). + +set -e + +# Exit silently if not in an eval context +if [ -z "${EVAL_RUN_DIR:-}" ] || [ -z "${EVAL_CASE_NAME:-}" ]; then + exit 0 +fi + +# Read hook input from stdin +input=$(cat) + +# Check agent type — only process eval-runner +agent_type=$(printf '%s' "$input" | jq -r '.tool_input.agent_type // ""') +if [ "$agent_type" != "eval-runner" ]; then + exit 0 +fi + +# Get transcript path +transcript_path=$(printf '%s' "$input" | jq -r '.tool_input.agent_transcript_path // ""') +if [ -z "$transcript_path" ] || [ ! -f "$transcript_path" ]; then + exit 0 +fi + +# Create output directory +output_dir="${EVAL_RUN_DIR}/${EVAL_CASE_NAME}" +mkdir -p "$output_dir" + +# Extract events from JSONL transcript +# - text events: {type: "text", content: ...} +# - tool_call events: {type: "tool_call", call: ...} (from Bash tool_use) +# - ask_user events: {type: "ask_user", question: ...} (from AskUserQuestion) +events=$(jq -c ' + if .type == "text" and .content then + {type: "text", content: .content} + elif .type == "tool_use" and .name == "AskUserQuestion" then + {type: "ask_user", question: (.input.questions[0].question // "unknown")} + elif .type == "tool_use" and .name == "Bash" then + {type: "tool_call", call: .input.command} + elif .type == "tool_use" then + {type: "tool_call", call: (.name + " " + (.input | tostring))} + else + empty + end +' "$transcript_path" | jq -s '.') + +# Write transcript.json +jq -n \ + --arg case_name "$EVAL_CASE_NAME" \ + --arg source "$transcript_path" \ + --argjson events "$events" \ + '{ + case: $case_name, + source: $source, + events: $events + }' > "${output_dir}/transcript.json" +``` + +- [ ] **Step 4: Make the script executable and run tests** + +```bash +chmod +x .claude/skills/run-evals/scripts/hooks/capture_runner_transcript.sh +``` + +Run: `cd .claude/skills/run-evals/scripts/hooks && uv run --no-sync pytest capture_runner_transcript_test.py -v` +Expected: All passed + +- [ ] **Step 5: Commit** + +```bash +git add .claude/skills/run-evals/scripts/hooks/capture_runner_transcript.sh \ + .claude/skills/run-evals/scripts/hooks/capture_runner_transcript_test.py +git commit -m "feat(evals): add SubagentStop hook for transcript capture + +Extracts text, tool_call, and ask_user events from eval-runner agent +transcript JSONL. Writes transcript.json to eval run output directory. +Filters on agent_type=eval-runner, exits silently in non-eval context." +``` + +--- + +## Task 5: Git Repo Setup Script + +Shell script that creates a temporary git repo with configurable mock state for eval cases. + +**Files:** +- Create: `.claude/skills/run-evals/scripts/setup/init_git_repo.sh` +- Create: `.claude/skills/run-evals/scripts/setup/init_git_repo_test.py` + +- [ ] **Step 1: Write the test** + +```python +"""Integration tests for init_git_repo.sh setup script.""" + +import json +import os +import subprocess +from pathlib import Path + +import pytest + +SCRIPT = Path(__file__).parent / "init_git_repo.sh" + + +def run_setup(state: dict, work_dir: Path) -> subprocess.CompletedProcess: + """Run the setup script with given state config.""" + env = {**os.environ, "EVAL_SETUP_STATE": json.dumps(state)} + return subprocess.run( + ["sh", str(SCRIPT)], + capture_output=True, + text=True, + env=env, + cwd=str(work_dir), + ) + + +class TestBasicInit: + def test_creates_git_repo(self, tmp_path): + result = run_setup( + {"git_repo": "clean", "branch": "feature/test"}, + tmp_path, + ) + assert result.returncode == 0 + assert (tmp_path / ".git").is_dir() + + def test_sets_branch(self, tmp_path): + run_setup( + {"git_repo": "clean", "branch": "feature/test"}, + tmp_path, + ) + branch = subprocess.run( + ["git", "branch", "--show-current"], + capture_output=True, + text=True, + cwd=str(tmp_path), + ) + assert branch.stdout.strip() == "feature/test" + + def test_has_main_branch(self, tmp_path): + run_setup( + {"git_repo": "clean", "branch": "feature/test"}, + tmp_path, + ) + result = subprocess.run( + ["git", "branch", "--list", "main"], + capture_output=True, + text=True, + cwd=str(tmp_path), + ) + assert "main" in result.stdout + + +class TestDocsWorkActive: + def test_creates_docs_work_active(self, tmp_path): + run_setup( + { + "git_repo": "clean", + "branch": "feature/test", + "docs_work_active": True, + }, + tmp_path, + ) + assert (tmp_path / "docs" / "work" / "active").is_dir() +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cd .claude/skills/run-evals/scripts/setup && uv run --no-sync pytest init_git_repo_test.py -v` +Expected: FAIL — script does not exist + +- [ ] **Step 3: Write the setup script** + +```sh +#!/bin/sh +# init_git_repo.sh — create a temporary git repo with configurable state +# Dependencies: git, jq +# Platforms: macOS, Linux +# +# Reads EVAL_SETUP_STATE env var (JSON) for configuration: +# git_repo: "clean" — init a repo with an initial commit +# branch: "" — create and checkout this branch +# docs_work_active: true — create docs/work/active/ directory + +set -e + +state="${EVAL_SETUP_STATE:?EVAL_SETUP_STATE env var required}" + +git_repo=$(printf '%s' "$state" | jq -r '.git_repo // "clean"') +branch=$(printf '%s' "$state" | jq -r '.branch // "feature/test"') +docs_work=$(printf '%s' "$state" | jq -r '.docs_work_active // false') + +# Init repo if not already a git repo +if [ ! -d ".git" ]; then + git init -b main + git commit --allow-empty -m "initial commit" +fi + +# Create and checkout branch +if [ "$branch" != "main" ]; then + git checkout -b "$branch" 2>/dev/null || git checkout "$branch" +fi + +# Create docs/work/active if requested +if [ "$docs_work" = "true" ]; then + mkdir -p docs/work/active +fi +``` + +- [ ] **Step 4: Make executable and run tests** + +```bash +chmod +x .claude/skills/run-evals/scripts/setup/init_git_repo.sh +``` + +Run: `cd .claude/skills/run-evals/scripts/setup && uv run --no-sync pytest init_git_repo_test.py -v` +Expected: All passed + +- [ ] **Step 5: Commit** + +```bash +git add .claude/skills/run-evals/scripts/setup/init_git_repo.sh \ + .claude/skills/run-evals/scripts/setup/init_git_repo_test.py +git commit -m "feat(evals): add git repo setup script for eval cases + +Creates temporary git repo with configurable branch and optional +docs/work/active/ directory. Reads state from EVAL_SETUP_STATE env var." +``` + +--- + +## Task 6: Eval Runner Agent + +Generic skill executor agent — runs a skill with mock context, no eval framing visible to the model. + +**Files:** +- Create: `.claude/skills/run-evals/agents/eval-runner.md` + +- [ ] **Step 1: Write the agent definition** + +```markdown +--- +name: eval-runner +description: > + Generic skill executor for eval harness. Receives a skill prompt with + pre-established context and executes the skill workflow to completion. + Do not use directly — dispatched by the run-evals orchestrator. +tools: + - Read + - Bash + - AskUserQuestion +allowed-tools: + - Read + - Bash(git *) + - Bash(gh *) +model: haiku +--- + +You are executing a skill workflow. Follow the skill instructions exactly as written. + +## Your Context + +You have been given: +1. A skill's full instructions (SKILL.md content) +2. Context from prior tool calls (presented as results you already received) +3. Pre-established decisions (answers to questions already asked) +4. A user prompt to trigger the workflow + +## Rules + +- Follow the skill instructions step by step +- Use the context provided — do not re-run discovery commands +- For pre-established decisions, skip the corresponding AskUserQuestion call +- If you reach a decision point not covered by prior decisions, use AskUserQuestion +- Stop when the skill workflow completes or when you reach a blocking decision point +- Report your status when done: "complete" if workflow finished, "blocked" if stopped at a decision point + +## Output + +When done, output a brief status line: +``` +STATUS: complete +``` +or +``` +STATUS: blocked at "" +``` +``` + +- [ ] **Step 2: Verify the file is valid frontmatter** + +Run: `head -20 .claude/skills/run-evals/agents/eval-runner.md` +Expected: Valid YAML frontmatter with `---` delimiters + +- [ ] **Step 3: Commit** + +```bash +git add .claude/skills/run-evals/agents/eval-runner.md +git commit -m "feat(evals): add eval-runner agent definition + +Haiku-powered generic skill executor for eval harness. Receives skill +content + mock context + prior decisions, executes workflow to completion." +``` + +--- + +## Task 7: Eval Grader Agent + +Scores model judgment against `expect` assertions using transcript evidence. + +**Files:** +- Create: `.claude/skills/run-evals/agents/eval-grader.md` + +- [ ] **Step 1: Write the agent definition** + +```markdown +--- +name: eval-grader +description: > + Eval grader for eval harness. Scores model behavior against expected + outcomes by analyzing the transcript. Do not use directly — dispatched + by the run-evals orchestrator. +tools: + - Read +allowed-tools: + - Read +model: sonnet +--- + +You are grading a model's behavior during a skill execution. You will receive: + +1. A transcript of events (from `transcript.json`) — an ordered sequence of text outputs, tool calls, and user interactions +2. A list of expected behaviors to verify + +## Grading Rules + +- Score each expectation independently +- Use the event ordering — if an expectation says "X before Y", verify the sequence +- Base your judgment strictly on the transcript evidence +- A text event mentioning an action is NOT the same as a tool_call performing it +- Be precise: "identifies the PR has merge conflicts" means the model explicitly stated this, not that it ran a command that would reveal it + +## Output Format + +Return a JSON object with this exact structure: + +```json +{ + "case": "", + "results": [ + { + "expect": "", + "pass": true, + "reasoning": "<1-2 sentences citing specific transcript evidence>" + } + ] +} +``` + +Every expectation must appear in results. Do not add expectations not in the input list. + +For `pass: false`, explain specifically what the model did wrong or failed to do, citing event indices or content. +``` + +- [ ] **Step 2: Verify the file is valid frontmatter** + +Run: `head -20 .claude/skills/run-evals/agents/eval-grader.md` +Expected: Valid YAML frontmatter with `---` delimiters + +- [ ] **Step 3: Commit** + +```bash +git add .claude/skills/run-evals/agents/eval-grader.md +git commit -m "feat(evals): add eval-grader agent definition + +Sonnet-powered model judgment scorer. Analyzes transcript events against +expected behaviors, returns structured pass/fail results with reasoning." +``` + +--- + +## Task 8: Orchestrator Skill + +The `/igit:run-evals` skill that ties everything together — discovers eval files, dispatches agents, runs assertions, writes reports. + +**Files:** +- Create: `.claude/skills/run-evals/SKILL.md` + +- [ ] **Step 1: Write the orchestrator skill** + +```markdown +--- +name: run-evals +description: > + Use when user asks to "run evals", "test skill behavior", "run eval suite", + or "/igit:run-evals". Orchestrates the eval harness: discovers eval files, + dispatches runner and grader agents per case, runs assertions, writes report. +version: 0.1.0 +updated: "2026-03-27" +user-invocable: true +disable-model-invocation: false +allowed-tools: + - Read + - Write + - Bash(uv *) + - Bash(mkdir *) + - Bash(chmod *) + - Bash(date *) + - Bash(cat *) + - Agent +argument-hint: "[tag-or-name-filter]" +--- + +# Run Evals + +Orchestrate eval suite execution for skill behavior testing. + +## Invocation + +``` +/igit:run-evals # Run all eval cases +/igit:run-evals stage2 # Filter by tag +/igit:run-evals stage2_pr_blocked_conflicts # Filter by case name +``` + +## Process + +### Step 1: Discover and Parse + +1. Find all `evals/*_eval.{yml,yaml}` files +2. Parse each file — extract `suite` description and `cases` list +3. If a filter argument was provided, keep only cases where: + - Any tag matches the filter, OR + - The case name contains the filter +4. Report: "Found N cases across M suites (filter: )" + +### Step 2: Prepare Run Directory + +```bash +RUN_ID=$(date +%Y-%m-%dT%H%M%S)-$(head -c 3 /dev/urandom | xxd -p | head -c 6) +EVAL_RUN_DIR=".evals/${RUN_ID}" +mkdir -p "$EVAL_RUN_DIR" +``` + +Export `EVAL_RUN_DIR` as an environment variable for hooks and scripts. + +### Step 3: Execute Each Case + +For each case in order: + +#### 3a. Setup + +If `setup.scripts` is defined, run each script in order: + +```bash +EVAL_SETUP_STATE='' sh +``` + +#### 3b. Dispatch Eval Runner + +Set `EVAL_CASE_NAME=` env var (for the SubagentStop hook). + +Build the runner prompt from the eval case: + +``` +## Skill Instructions + +/SKILL.md — read the file> + +## Context from Prior Tool Calls + +The following data was returned by running the context script. Use this +data directly — do not re-run the script. + +### merge_context.sh output: + + +## Pre-established Decisions + + +When asked "", you have already decided: "" +Do not ask this question again — proceed with this decision. + +## User Prompt + + +``` + +Dispatch the agent: + +``` +Agent: + subagent_type: eval-runner + prompt: +``` + +The SubagentStop hook (`capture_runner_transcript.sh`) will automatically write `transcript.json` to `$EVAL_RUN_DIR//`. + +#### 3c. Run Deterministic Assertions + +If `assert.tool_calls` is defined: + +```bash +cd .claude/skills/run-evals/scripts/assert +uv run --no-sync python -c " +import json, sys +sys.path.insert(0, '.') +from check_tool_calls import check_tool_calls +transcript = json.load(open('$EVAL_RUN_DIR//transcript.json')) +assertions = json.loads('') +results = check_tool_calls(transcript['events'], assertions) +json.dump(results, open('$EVAL_RUN_DIR//assert_tool_calls.json', 'w'), indent=2) +" +``` + +If `assert.filesystem` is defined, similarly run `check_filesystem`. + +Merge all assertion results into `$EVAL_RUN_DIR//assert.json`. + +#### 3d. Dispatch Eval Grader + +If `expect` is defined: + +``` +Agent: + subagent_type: eval-grader + prompt: | + ## Case: + + ## Transcript + /transcript.json> + + ## Expected Behaviors + + + Return your grading as JSON. +``` + +Write grader output to `$EVAL_RUN_DIR//grader.json`. + +### Step 4: Generate Report + +Collect all results and write `$EVAL_RUN_DIR/report.md`: + +```markdown +## Eval Results — + +### + +| Case | Expect | Assert | Status | +|------|--------|--------|--------| +| | / | / | ✓ PASS / ✗ FAIL | + +### Failures + +**** +- expect: "" + FAIL: "" +- assert.tool_calls: + FAIL: +``` + +### Step 5: Report + +Output the report contents to the user and note the run directory path. +``` + +- [ ] **Step 2: Verify the skill file** + +Run: `head -20 .claude/skills/run-evals/SKILL.md` +Expected: Valid YAML frontmatter + +- [ ] **Step 3: Commit** + +```bash +git add .claude/skills/run-evals/SKILL.md +git commit -m "feat(evals): add run-evals orchestrator skill + +Discovers eval YAML files, dispatches eval-runner per case, runs +deterministic assertions, dispatches eval-grader for model judgment, +aggregates results into markdown report." +``` + +--- + +## Task 9: Eval Cases — Stage 1 Scenarios + +Write the first batch of eval cases covering Stage 1 (pre-merge verification). + +**Files:** +- Create: `evals/skill_merge_eval.yaml` (first 4 cases) + +- [ ] **Step 1: Write Stage 1 eval cases** + +```yaml +suite: "Evals for the /merge skill — covers all three stages and known edge cases" + +cases: + # --- Stage 1: Pre-merge Verification --- + + - name: "stage1_task_tracking_not_detected" + description: "Model asks user which task tracking tool they use when detection returns none" + skill: igit:merge + tags: [stage1, task-tracking] + + setup: + state: + - git_repo: clean + - branch: "feature/add-search" + + prompt: > + I'm ready to merge. Let's go. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "feature/add-search" + base_branch: "main" + pr: + number: 10 + url: "https://github.com/org/repo/pull/10" + state: "open" + mergeable: "MERGEABLE" + merge_state_status: "CLEAN" + title: "Add search feature" + merge_methods: ["merge", "squash", "rebase"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: [] + + exits_at: "Do you use a task tracking tool for this project?" + + expect: + - "Asks the user which task tracking tool they use before proceeding" + - "Presents options including docs/work, Linear, Jira, GitHub Projects, Asana, and skip" + + assert: + tool_calls: + - never: "gh pr merge" + + - name: "stage1_all_ac_met" + description: "Model checks off items and proceeds to Stage 2 without asking when all AC met" + skill: igit:merge + tags: [stage1, task-tracking, happy-path] + + setup: + state: + - git_repo: clean + - branch: "feature/add-search" + - docs_work_active: true + + prompt: > + Ship it. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "feature/add-search" + base_branch: "main" + pr: + number: 10 + url: "https://github.com/org/repo/pull/10" + state: "open" + mergeable: "MERGEABLE" + merge_state_status: "CLEAN" + title: "Add search feature" + merge_methods: ["merge", "squash", "rebase"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "docs-work" + source: "fs" + work_items: + - path: "docs/work/active/2026-03-20-add-search.md" + title: "Add search feature" + acceptance_criteria: + - criterion: "Search endpoint returns results" + met: true + - criterion: "Unit tests pass" + met: true + + prior_decisions: [] + + exits_at: "Who merges PR #10?" + + expect: + - "Narrates that all acceptance criteria are already met" + - "Proceeds to Stage 2 merge decision without asking about task resolution" + - "Does not present the Item/Action panel" + + assert: + tool_calls: + - never: "gh pr merge" + + - name: "stage1_unmet_ac" + description: "Model presents Item+Action panel for unmet acceptance criteria" + skill: igit:merge + tags: [stage1, task-tracking] + + setup: + state: + - git_repo: clean + - branch: "feature/add-search" + - docs_work_active: true + + prompt: > + Let's merge this. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "feature/add-search" + base_branch: "main" + pr: + number: 10 + url: "https://github.com/org/repo/pull/10" + state: "open" + mergeable: "MERGEABLE" + merge_state_status: "CLEAN" + title: "Add search feature" + merge_methods: ["merge", "squash", "rebase"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "docs-work" + source: "fs" + work_items: + - path: "docs/work/active/2026-03-20-add-search.md" + title: "Add search feature" + acceptance_criteria: + - criterion: "Search endpoint returns results" + met: true + - criterion: "Unit tests pass" + met: false + - criterion: "Documentation updated" + met: false + + prior_decisions: [] + + exits_at: "Action for Add search feature" + + expect: + - "Identifies that 2 acceptance criteria are unmet" + - "Presents an AskUserQuestion with the Item/Action panel pattern" + - "Lists 'Unit tests pass' and 'Documentation updated' as unmet criteria" + - "Does not proceed to merge before resolving the unmet criteria" + + assert: + tool_calls: + - never: "gh pr merge" + + - name: "stage1_skip_task_verification" + description: "Model proceeds to Stage 2 when user chooses to skip task verification" + skill: igit:merge + tags: [stage1, task-tracking] + + setup: + state: + - git_repo: clean + - branch: "feature/add-search" + + prompt: > + Merge my branch please. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "feature/add-search" + base_branch: "main" + pr: + number: 10 + url: "https://github.com/org/repo/pull/10" + state: "open" + mergeable: "MERGEABLE" + merge_state_status: "CLEAN" + title: "Add search feature" + merge_methods: ["merge", "squash", "rebase"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: + - question: "Do you use a task tracking tool for this project?" + answer: "No — skip task verification" + + exits_at: "Who merges PR #10?" + + expect: + - "Skips task verification after receiving the skip decision" + - "Proceeds directly to Stage 2 merge decision" + + assert: + tool_calls: + - never: "gh pr merge" +``` + +- [ ] **Step 2: Validate YAML syntax** + +Run: `uv run --no-sync python -c "import yaml; yaml.safe_load(open('evals/skill_merge_eval.yaml')); print('valid')"` +Expected: `valid` + +- [ ] **Step 3: Commit** + +```bash +git add evals/skill_merge_eval.yaml +git commit -m "feat(evals): add Stage 1 eval cases for /merge skill + +Four cases: task tracking not detected, all AC met, unmet AC with +Item/Action panel, and skip task verification flow." +``` + +--- + +## Task 10: Eval Cases — Stage 2 Scenarios + +Add Stage 2 (merge) eval cases to the existing YAML file. + +**Files:** +- Modify: `evals/skill_merge_eval.yaml` + +- [ ] **Step 1: Append Stage 2 cases** + +Add the following cases after the Stage 1 cases in `evals/skill_merge_eval.yaml`: + +```yaml + # --- Stage 2: Merge --- + + - name: "stage2_already_merged" + description: "Model acknowledges already-merged PR and surfaces local_state issues" + skill: igit:merge + tags: [stage2, already-merged] + + setup: + state: + - git_repo: clean + - branch: "feature/old-work" + + prompt: > + Merge this branch. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "feature/old-work" + base_branch: "main" + pr: + number: 8 + url: "https://github.com/org/repo/pull/8" + state: "merged" + mergeable: "UNKNOWN" + merge_state_status: "UNKNOWN" + title: "Old work" + merge_methods: ["merge", "squash", "rebase"] + local_state: + uncommitted_changes: true + unpushed_commits: 2 + base_out_of_date: true + worktrees: [] + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: + - question: "Do you use a task tracking tool for this project?" + answer: "No — skip task verification" + + exits_at: null + + expect: + - "Acknowledges that PR #8 was already merged" + - "Surfaces uncommitted changes warning" + - "Surfaces 2 unpushed commits warning" + - "Surfaces base branch out of date warning" + - "Proceeds to Stage 3 cleanup without asking to merge" + + assert: + tool_calls: + - never: "gh pr merge" + + - name: "stage2_pr_blocked_conflicts" + description: "Model surfaces conflict blocker and stops without offering to merge" + skill: igit:merge + tags: [stage2, blocked] + + setup: + state: + - git_repo: clean + - branch: "feature/my-feature" + + prompt: > + I'm ready to merge my feature branch. Please run the merge workflow. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "feature/my-feature" + base_branch: "main" + pr: + number: 42 + url: "https://github.com/org/repo/pull/42" + state: "open" + mergeable: "CONFLICTING" + merge_state_status: "DIRTY" + title: "My feature" + merge_methods: ["merge", "squash", "rebase"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: + - question: "Do you use a task tracking tool for this project?" + answer: "No — skip task verification" + + exits_at: null + + expect: + - "Identifies the PR has merge conflicts before taking any action" + - "Names the specific blocker and tells the user how to resolve it" + - "Stops after surfacing the blocker — does not attempt Stage 2 or 3" + + assert: + tool_calls: + - never: "gh pr merge" + - never: "git branch --delete" + + - name: "stage2_claude_merges_squash" + description: "Model runs gh pr merge --squash when Claude merges with squash method" + skill: igit:merge + tags: [stage2, merge, squash] + + setup: + state: + - git_repo: clean + - branch: "feature/new-api" + + prompt: > + Let's merge this PR. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "feature/new-api" + base_branch: "main" + pr: + number: 15 + url: "https://github.com/org/repo/pull/15" + state: "open" + mergeable: "MERGEABLE" + merge_state_status: "CLEAN" + title: "New API endpoints" + merge_methods: ["merge", "squash", "rebase"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: + - question: "Do you use a task tracking tool for this project?" + answer: "No — skip task verification" + - question: "Who merges PR #15?" + answer: "Claude merges" + - question: "Merge method" + answer: "Squash" + + exits_at: null + + expect: + - "Runs gh pr merge with the --squash flag, not --merge or --rebase" + - "Verifies the merge succeeded by checking PR state" + + assert: + tool_calls: + - once: "gh pr merge 15 --squash" + - never: "gh pr merge 15 --merge" + - never: "gh pr merge 15 --rebase" +``` + +- [ ] **Step 2: Validate YAML syntax** + +Run: `uv run --no-sync python -c "import yaml; d=yaml.safe_load(open('evals/skill_merge_eval.yaml')); print(f'{len(d[\"cases\"])} cases')"` +Expected: `7 cases` + +- [ ] **Step 3: Commit** + +```bash +git add evals/skill_merge_eval.yaml +git commit -m "feat(evals): add Stage 2 eval cases for /merge skill + +Three cases: already merged with local state warnings, PR blocked by +conflicts, and Claude merges with squash method." +``` + +--- + +## Task 11: Eval Cases — Stage 3 Scenarios + +Add Stage 3 (cleanup) eval cases. + +**Files:** +- Modify: `evals/skill_merge_eval.yaml` + +- [ ] **Step 1: Append Stage 3 cases** + +```yaml + # --- Stage 3: Post-merge Cleanup --- + + - name: "stage3_safe_delete_succeeds" + description: "Model runs git branch --delete (not --force) for cleanup" + skill: igit:merge + tags: [stage3, cleanup] + + setup: + state: + - git_repo: clean + - branch: "feature/done-work" + + prompt: > + Merge and clean up. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "feature/done-work" + base_branch: "main" + pr: + number: 20 + url: "https://github.com/org/repo/pull/20" + state: "open" + mergeable: "MERGEABLE" + merge_state_status: "CLEAN" + title: "Done work" + merge_methods: ["merge", "squash"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: + - question: "Do you use a task tracking tool for this project?" + answer: "No — skip task verification" + - question: "Who merges PR #20?" + answer: "Claude merges" + - question: "Merge method" + answer: "Merge Commit" + + exits_at: null + + expect: + - "Switches to main branch and pulls after merge" + - "Runs git branch --delete (safe delete, not --force) for cleanup" + - "Uses long flag --delete, not short -d" + + assert: + tool_calls: + - once: "git branch --delete feature/done-work" + - never: "git branch --delete --force" + - never: "git branch -d" + - never: "git branch -D" + + - name: "stage3_safe_delete_fails_user_declines" + description: "Model leaves branch when user declines force-delete after safe delete fails" + skill: igit:merge + tags: [stage3, cleanup, force-delete] + + setup: + state: + - git_repo: clean + - branch: "feature/squashed" + + prompt: > + Merge and clean up. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "feature/squashed" + base_branch: "main" + pr: + number: 25 + url: "https://github.com/org/repo/pull/25" + state: "open" + mergeable: "MERGEABLE" + merge_state_status: "CLEAN" + title: "Squashed feature" + merge_methods: ["squash"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: + - question: "Do you use a task tracking tool for this project?" + answer: "No — skip task verification" + - question: "Who merges PR #25?" + answer: "Claude merges" + - question: "Merge method" + answer: "Squash" + - question: "Force-delete branch 'feature/squashed'?" + answer: "No — leave the branch" + + exits_at: null + + expect: + - "Attempts safe delete first (git branch --delete)" + - "After safe delete fails, explains why (squash merge)" + - "Accepts user's decline and does not force-delete" + - "Leaves the branch intact in final summary" + + assert: + tool_calls: + - never: "git branch --delete --force" + - never: "git branch -D" +``` + +- [ ] **Step 2: Validate YAML syntax** + +Run: `uv run --no-sync python -c "import yaml; d=yaml.safe_load(open('evals/skill_merge_eval.yaml')); print(f'{len(d[\"cases\"])} cases')"` +Expected: `9 cases` + +- [ ] **Step 3: Commit** + +```bash +git add evals/skill_merge_eval.yaml +git commit -m "feat(evals): add Stage 3 eval cases for /merge skill + +Two cases: safe branch delete succeeds (no --force), and user declines +force-delete after safe delete fails on squash merge." +``` + +--- + +## Task 12: Eval Cases — Bug Regression Scenarios + +Add eval cases covering bugs 2–5 from the live test log. + +**Files:** +- Modify: `evals/skill_merge_eval.yaml` + +- [ ] **Step 1: Append bug regression cases** + +```yaml + # --- Bug Regressions --- + + - name: "bug2_mcp_elif_eats_fs_detection" + description: "Task tracking returns none when settings.json exists but no MCP servers match, even though docs/work/active/ is present" + skill: igit:merge + tags: [regression, bug2, task-tracking] + + setup: + state: + - git_repo: clean + - branch: "feature/fix-bug" + - docs_work_active: true + + prompt: > + Ready to merge. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "feature/fix-bug" + base_branch: "main" + pr: + number: 30 + url: "https://github.com/org/repo/pull/30" + state: "open" + mergeable: "MERGEABLE" + merge_state_status: "CLEAN" + title: "Fix bug" + merge_methods: ["merge", "squash"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: [] + + exits_at: "Do you use a task tracking tool for this project?" + + expect: + - "Asks about task tracking because context reported tool=none" + - "Does not assume docs/work detection succeeded — respects the context data as given" + + assert: + tool_calls: + - never: "gh pr merge" + + - name: "bug3_pr_url_missing_manual_merge" + description: "Model presents manual merge option with the actual PR URL interpolated" + skill: igit:merge + tags: [regression, bug3, merge] + + setup: + state: + - git_repo: clean + - branch: "feature/manual-merge" + + prompt: > + Merge this please. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "feature/manual-merge" + base_branch: "main" + pr: + number: 35 + url: "https://github.com/org/repo/pull/35" + state: "open" + mergeable: "MERGEABLE" + merge_state_status: "CLEAN" + title: "Manual merge test" + merge_methods: ["merge", "squash"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: + - question: "Do you use a task tracking tool for this project?" + answer: "No — skip task verification" + + exits_at: "Who merges PR #35?" + + expect: + - "Presents 'I'll merge manually' option" + - "The manual merge option description includes the actual PR URL (https://github.com/org/repo/pull/35)" + + assert: + tool_calls: + - never: "gh pr merge" + + - name: "bug4_main_worktree_spurious_remove" + description: "Model does not attempt git worktree remove on the main working tree" + skill: igit:merge + tags: [regression, bug4, cleanup] + + setup: + state: + - git_repo: clean + - branch: "feature/worktree-test" + + prompt: > + Ship it. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "feature/worktree-test" + base_branch: "main" + pr: + number: 40 + url: "https://github.com/org/repo/pull/40" + state: "open" + mergeable: "MERGEABLE" + merge_state_status: "CLEAN" + title: "Worktree test" + merge_methods: ["merge"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: + - branch: "feature/worktree-test" + path: "/Users/dev/repo" + - branch: "feature/other" + path: "/Users/dev/repo/.worktrees/feature-other" + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: + - question: "Do you use a task tracking tool for this project?" + answer: "No — skip task verification" + - question: "Who merges PR #40?" + answer: "Claude merges" + - question: "Merge method" + answer: "Merge Commit" + + exits_at: null + + expect: + - "Does NOT attempt git worktree remove on /Users/dev/repo (the main working tree)" + - "Only attempts worktree remove on actual additional worktrees, if any match the branch" + - "Recognizes that the first worktree entry is the main working tree, not an additional worktree" + + assert: + tool_calls: + - never: "git worktree remove /Users/dev/repo" +``` + +- [ ] **Step 2: Validate YAML and count cases** + +Run: `uv run --no-sync python -c "import yaml; d=yaml.safe_load(open('evals/skill_merge_eval.yaml')); print(f'{len(d[\"cases\"])} cases'); [print(f' - {c[\"name\"]}') for c in d['cases']]"` +Expected: `12 cases` with all names listed + +- [ ] **Step 3: Commit** + +```bash +git add evals/skill_merge_eval.yaml +git commit -m "feat(evals): add bug regression eval cases for /merge skill + +Three cases from live test bugs: Bug 2 (MCP elif eats fs detection), +Bug 3 (PR URL missing from manual merge option), Bug 4 (main worktree +triggers spurious remove). Bug 5 deferred — requires implicit checklist +feature not yet implemented." +``` + +--- + +## Task 13: Register SubagentStop Hook + +Add the transcript capture hook to the project settings so it fires when eval-runner agents complete. + +**Files:** +- Create or modify: `.claude/settings.json` + +- [ ] **Step 1: Check if `.claude/settings.json` exists** + +Run: `cat .claude/settings.json 2>/dev/null || echo "does not exist"` + +- [ ] **Step 2: Create or update settings with SubagentStop hook** + +If file doesn't exist, create `.claude/settings.json`: + +```json +{ + "hooks": { + "SubagentStop": [ + { + "matcher": "eval-runner", + "hooks": [ + { + "type": "command", + "command": "sh .claude/skills/run-evals/scripts/hooks/capture_runner_transcript.sh" + } + ] + } + ] + } +} +``` + +If file exists, merge the `SubagentStop` hook into the existing `hooks` object using `jq`. + +- [ ] **Step 3: Validate JSON** + +Run: `jq . .claude/settings.json > /dev/null && echo "valid"` +Expected: `valid` + +- [ ] **Step 4: Commit** + +```bash +git add .claude/settings.json +git commit -m "feat(evals): register SubagentStop hook for transcript capture + +Fires capture_runner_transcript.sh when eval-runner agent completes, +writing extracted events to the eval run output directory." +``` + +--- + +## Task 14: Add `.evals/` Gitignore and Final Validation + +Verify the complete system end-to-end. + +**Files:** +- Modify: `.gitignore` (if not done in Task 1) + +- [ ] **Step 1: Verify `.evals/` is in `.gitignore`** + +Run: `grep -q '.evals/' .gitignore && echo "OK" || echo "MISSING"` +Expected: `OK` + +- [ ] **Step 2: Run all assertion script tests** + +Run: `uv run --no-sync pytest .claude/skills/run-evals/scripts/ -v` +Expected: All passed + +- [ ] **Step 3: Validate all YAML eval files** + +Run: `uv run --no-sync python -c "import yaml; d=yaml.safe_load(open('evals/skill_merge_eval.yaml')); print(f'Suite: {d[\"suite\"]}'); print(f'Cases: {len(d[\"cases\"])}')"` +Expected: `Suite: Evals for the /merge skill...` and `Cases: 12` + +- [ ] **Step 4: Verify file structure matches design** + +Run: `find .claude/skills/run-evals evals -type f | sort` +Expected: +``` +.claude/skills/run-evals/SKILL.md +.claude/skills/run-evals/agents/eval-grader.md +.claude/skills/run-evals/agents/eval-runner.md +.claude/skills/run-evals/scripts/assert/check_filesystem.py +.claude/skills/run-evals/scripts/assert/check_filesystem_test.py +.claude/skills/run-evals/scripts/assert/check_tool_calls.py +.claude/skills/run-evals/scripts/assert/check_tool_calls_test.py +.claude/skills/run-evals/scripts/hooks/capture_runner_transcript.sh +.claude/skills/run-evals/scripts/hooks/capture_runner_transcript_test.py +.claude/skills/run-evals/scripts/setup/init_git_repo.sh +.claude/skills/run-evals/scripts/setup/init_git_repo_test.py +evals/skill_merge_eval.yaml +``` + +- [ ] **Step 5: Commit any remaining changes** + +```bash +git add -A +git commit -m "chore(evals): final validation — all tests pass, structure matches design" +``` + +--- + +## Notes + +**Bug 5 (implicit merge checklist)** is deferred to its own work item: `docs/work/backlog/2026-03-27-merge-implicit-checklist.md`. The feature doesn't exist yet — the eval case will be added when the implicit checklist is implemented. + +**AskUserQuestion behavior in agents** is an open question per the design spec. The eval-runner uses `exits_at` to define expected stopping points, but actual behavior may differ. Task 6's agent definition instructs the model to use AskUserQuestion when it reaches unresolved decision points — empirical testing will determine if this works or needs revision. + +**PyYAML dependency**: The orchestrator skill parses YAML. Ensure `pyyaml` is in the project's dev dependencies (`uv add --dev pyyaml` if not present). diff --git a/docs/work/active/2026-03-27-merge-skill-evals.md b/docs/work/active/2026-03-27-merge-skill-evals.md index 326aaca..fc12770 100644 --- a/docs/work/active/2026-03-27-merge-skill-evals.md +++ b/docs/work/active/2026-03-27-merge-skill-evals.md @@ -34,7 +34,7 @@ From live test (see `docs/ops/igit-merge-test-and-debug.md`): - [ ] **Bug 2 — MCP elif eats fs detection**: Task tracking returns `none` when `~/.claude/settings.json` exists but has no matching MCP servers, even though `docs/work/active/` is present - [ ] **Bug 3 — PR URL missing from manual merge option**: Model presents manual merge option without interpolating `pr.url` into the description - [ ] **Bug 4 — Main worktree triggers spurious worktree remove**: Model attempts `git worktree remove` on the main working tree when branch matches -- [ ] **Bug 5 — Implicit merge checklist not enforced**: Model proceeds through Stage 1 without flagging missing version bump, untested changes, or uncommitted files that should have been included +- ~**Bug 5 — Implicit merge checklist not enforced**~ — deferred to `docs/work/backlog/2026-03-27-merge-implicit-checklist.md` (feature doesn't exist yet) ### Eval approach diff --git a/docs/work/backlog/2026-03-27-merge-implicit-checklist.md b/docs/work/backlog/2026-03-27-merge-implicit-checklist.md new file mode 100644 index 0000000..28084ee --- /dev/null +++ b/docs/work/backlog/2026-03-27-merge-implicit-checklist.md @@ -0,0 +1,32 @@ +--- +status: pending +created: 2026-03-27 +updated: 2026-03-27 +blocked-by: [] +--- + +# Add implicit pre-merge checklist to /merge skill Stage 1 + +## Goal + +Stage 1 currently only checks explicit acceptance criteria from `task_tracking.work_items`. It should also run an implicit checklist covering common pre-merge hygiene that applies regardless of task tracking state. + +## Context + +Bug 5 from live testing (`docs/ops/igit-merge-test-and-debug.md`): the /merge skill proceeded through Stage 1 without flagging that PR #14 lacked a version bump. The bump had to be applied post-merge directly to main. + +## Acceptance Criteria + +- [ ] Stage 1 runs an implicit checklist before proceeding to Stage 2 +- [ ] Checklist candidates (verify which are feasible to auto-check): + - Tests passing (run test command from CLAUDE.md if available) + - Version bumped if new skills/hooks/commands were added (compare plugin.json to base branch) + - Docs/CLAUDE.md updated if conventions changed + - No untracked files that should have been committed +- [ ] Checklist results surfaced to user before proceeding +- [ ] Eval case added to `evals/skill_merge_eval.yaml` covering the implicit checklist behavior + +## References + +- `docs/ops/igit-merge-test-and-debug.md` — Bug 5 +- `docs/work/active/2026-03-27-merge-skill-evals.md` — original Bug 5 scenario From cc877be2d9f09db9f71396da92ce620e0e57af21 Mon Sep 17 00:00:00 2001 From: tg-agent <268907238+tg-agent@users.noreply.github.com> Date: Sat, 28 Mar 2026 10:49:04 -0600 Subject: [PATCH 5/6] docs: address PR #15 debug doc review comments - Reply to all 11 review comments on debug log in-thread - Add Bug 6 (Stage 3 cleanup order: worktree before branch) - Create consolidated backlog items: - merge_context.sh bugs (1+2+4) - SKILL.md bugs (3+6) - policy hook bare push + tags - Update implicit checklist backlog with expanded scope - Add handoff doc for parent session (agent/pr15-debug-comments-handoff.md) Co-Authored-By: Trevor Gullstad Co-Authored-By: Claude --- agent/pr15-debug-comments-handoff.md | 95 +++++++++++++++++++ docs/ops/igit-merge-test-and-debug.md | 10 ++ .../2026-03-27-merge-skill-evals-impl.md | 2 +- .../active/2026-03-27-merge-skill-evals.md | 2 +- ...-03-27-merge-implicit-checklist-bugfix.md} | 16 +++- .../backlog/2026-03-28-merge-context-bugs.md | 48 ++++++++++ .../backlog/2026-03-28-merge-skill-bugs.md | 40 ++++++++ .../2026-03-28-policy-hook-bare-push-tags.md | 31 ++++++ 8 files changed, 239 insertions(+), 5 deletions(-) create mode 100644 agent/pr15-debug-comments-handoff.md rename docs/work/backlog/{2026-03-27-merge-implicit-checklist.md => 2026-03-27-merge-implicit-checklist-bugfix.md} (57%) create mode 100644 docs/work/backlog/2026-03-28-merge-context-bugs.md create mode 100644 docs/work/backlog/2026-03-28-merge-skill-bugs.md create mode 100644 docs/work/backlog/2026-03-28-policy-hook-bare-push-tags.md diff --git a/agent/pr15-debug-comments-handoff.md b/agent/pr15-debug-comments-handoff.md new file mode 100644 index 0000000..1bf76df --- /dev/null +++ b/agent/pr15-debug-comments-handoff.md @@ -0,0 +1,95 @@ +# PR #15 Debug Doc Comments — Handoff + +**Date:** 2026-03-28 +**Source:** PR #15 review comments on `docs/ops/igit-merge-test-and-debug.md` +**For:** Parent session implementing eval harness plan + +--- + +## Summary + +11 review comments on the debug log. All concern `merge_context.sh`, `git_policy_hook`, or SKILL.md — none require changes to the eval harness plan itself. Several expand the design direction for existing bugs and should be captured in work items. + +## Decisions and Actions by Bug + +### Bug 1 — Structured errors instead of silent fallback + +**Comments:** Lines 13, 28 +**Decision:** Two-part fix: +1. `merge_context.sh` should return structured error objects on `gh` failure instead of silent `{}` fallback. Format: `{ "cmd": "...", "error": "...", "hint": "Try unsandboxed" }` +2. Model handles `gh` calls directly as fallback — script should NOT be run unsandboxed entirely (security hole). The fallback path already exists in SKILL.md. + +**Action:** New work item — fix `merge_context.sh` error handling for sandboxed `gh` failures. + +### Bug 2 — Parallel detection with priority aggregation + +**Comment:** Line 47 +**Decision:** Restructure task tracking detection to run all methods (env, mcp, cli, fs, cfg) in parallel and aggregate by priority, instead of sequential `if/elif` chain. This is a bigger redesign than just fixing the elif fall-through bug. + +**Action:** Update existing Bug 2 fix direction in the debug doc. The work item scope is larger than originally described — not just "fix the elif" but "redesign detection as parallel with priority aggregation." + +### Bug 3 — PR URL in AskUserQuestion preview field + +**Comment:** Line 66 +**Decision:** Use AskUserQuestion's `preview` field for the PR URL instead of interpolating into the `description` field. Whether `` HTML links render in preview is unknown — needs empirical testing. + +**Action:** New work item — test AskUserQuestion preview field rendering, then update SKILL.md. Two-part: (1) test what renders, (2) apply fix. + +### Bug 4 — Identify PR-associated resources + +**Comment:** Line 81 +**Decision:** Reframe from "filter out main worktree" to "identify all resources associated with the PR being merged." Filter worktrees to `branch == current_branch && path != main_worktree_path`. The broader "all resources" framing (stashes, related branches, etc.) is future scope. + +**Action:** Update Bug 4 fix direction in debug doc. Immediate fix: filter worktrees in `merge_context.sh` by comparing against the primary worktree path. + +### Bug 5 — Implicit checklist scope and configuration + +**Comments:** Lines 95, 96, 99 +**Decisions:** +- Version bumps apply to ANY project releasing versioned artifacts (packages, containers, apps, infrastructure, configs), not just plugins +- "Project documentation" = README.md, CLAUDE.md, docs/, etc. +- Users need to document their merge checklist in a known, discoverable location +- Candidate locations: README.md, DEPLOY.md, RELEASE.md, CLAUDE.md, MERGE.md, or `docs/{i,}git/*.md` +- Concern: `docs/igit/*.md` is "dangerous if not protected" — files could be manipulated + +**Action:** Update existing backlog item `docs/work/backlog/2026-03-27-merge-implicit-checklist.md` with these expanded requirements. The checklist location decision is a design question for that work item. + +### Observation — bare `git push` bypasses hook + +**Comments:** Lines 103, 111 +**Decisions:** +- The bare push was in a chained command: `git tag ... && git push` +- Two proposed fix directions for `git_policy_hook`: + 1. Fill in missing refspec before applying policies (resolve bare `git push` to the actual remote/branch) + 2. Reject commands without specified remote and ref +- Tagging should be bumped from `allow` to `ask` if it was intentionally allowed + +**Action:** New work item for `git_policy_hook` — handle bare `git push` and review tag push policy level. + +### Observation — `git pull` fails in sandbox + +**Comment:** Line 121 +**Decision:** Root cause is likely GitHub API failure or sandboxing, not just DNS. The existing "narrate the failure" fix direction still applies. + +**Action:** Minor — update debug doc root cause. No separate work item needed (covered by existing sandbox documentation). + +## Impact on Eval Harness Plan + +**None.** All items are `merge_context.sh`, `git_policy_hook`, or SKILL.md changes. The eval harness plan is not affected. + +However, once Bug 1 (structured errors) and Bug 2 (parallel detection) are fixed, the eval cases for those bugs may need updated `context.output` payloads to reflect the new error/detection format. + +## Work Items + +| File | Scope | +|------|-------| +| `docs/work/backlog/2026-03-28-merge-context-bugs.md` | Bugs 1+2+4 — all `merge_context.sh` fixes (structured errors, parallel detection, worktree filtering) | +| `docs/work/backlog/2026-03-28-merge-skill-bugs.md` | Bugs 3+6 — SKILL.md fixes (PR URL preview field, cleanup order) | +| `docs/work/backlog/2026-03-27-merge-implicit-checklist-bugfix.md` | Bug 5 — expanded scope per review (versioned artifacts, doc scope, checklist location) | +| `docs/work/backlog/2026-03-28-policy-hook-bare-push-tags.md` | Observation — bare push refspec resolution + tag policy level | + +Bug 6 (cleanup order) added from live testing 2026-03-28: Stage 3 tries to delete branch before removing worktree, but git refuses. + +## PR Comment Replies + +Each review comment needs a reply acknowledging the decision. Use `gh api` to reply in-thread. diff --git a/docs/ops/igit-merge-test-and-debug.md b/docs/ops/igit-merge-test-and-debug.md index c25e778..08d7736 100644 --- a/docs/ops/igit-merge-test-and-debug.md +++ b/docs/ops/igit-merge-test-and-debug.md @@ -100,6 +100,16 @@ This checklist could be surfaced as a confirmation panel or auto-checked where p --- +### Bug 6 — Stage 3 cleanup order: worktree must be removed before branch deletion + +**Symptom:** `git branch --delete` fails with `error: cannot delete branch used by worktree` when a worktree exists for the branch. + +**Root cause:** SKILL.md Stage 3 orders "Delete Local Branch" before "Remove Worktree." Git refuses to delete a branch checked out in any worktree — safety mechanism to prevent leaving a worktree in an invalid state. + +**Fix direction:** Swap the order in Stage 3: remove worktree first, then delete branch. + +--- + ### Observation — bare `git push` bypasses block-push-to-main hook **Symptom:** Running `git push` (implicit tracking branch push) on main succeeded without being blocked. `git push origin main` is correctly blocked. diff --git a/docs/plans/2026-03-27-merge-skill-evals-impl.md b/docs/plans/2026-03-27-merge-skill-evals-impl.md index 3be5b6d..40d6fb9 100644 --- a/docs/plans/2026-03-27-merge-skill-evals-impl.md +++ b/docs/plans/2026-03-27-merge-skill-evals-impl.md @@ -2197,7 +2197,7 @@ git commit -m "chore(evals): final validation — all tests pass, structure matc ## Notes -**Bug 5 (implicit merge checklist)** is deferred to its own work item: `docs/work/backlog/2026-03-27-merge-implicit-checklist.md`. The feature doesn't exist yet — the eval case will be added when the implicit checklist is implemented. +**Bug 5 (implicit merge checklist)** is deferred to its own work item: `docs/work/backlog/2026-03-27-merge-implicit-checklist-bugfix.md`. The feature doesn't exist yet — the eval case will be added when the implicit checklist is implemented. **AskUserQuestion behavior in agents** is an open question per the design spec. The eval-runner uses `exits_at` to define expected stopping points, but actual behavior may differ. Task 6's agent definition instructs the model to use AskUserQuestion when it reaches unresolved decision points — empirical testing will determine if this works or needs revision. diff --git a/docs/work/active/2026-03-27-merge-skill-evals.md b/docs/work/active/2026-03-27-merge-skill-evals.md index fc12770..e505b1d 100644 --- a/docs/work/active/2026-03-27-merge-skill-evals.md +++ b/docs/work/active/2026-03-27-merge-skill-evals.md @@ -34,7 +34,7 @@ From live test (see `docs/ops/igit-merge-test-and-debug.md`): - [ ] **Bug 2 — MCP elif eats fs detection**: Task tracking returns `none` when `~/.claude/settings.json` exists but has no matching MCP servers, even though `docs/work/active/` is present - [ ] **Bug 3 — PR URL missing from manual merge option**: Model presents manual merge option without interpolating `pr.url` into the description - [ ] **Bug 4 — Main worktree triggers spurious worktree remove**: Model attempts `git worktree remove` on the main working tree when branch matches -- ~**Bug 5 — Implicit merge checklist not enforced**~ — deferred to `docs/work/backlog/2026-03-27-merge-implicit-checklist.md` (feature doesn't exist yet) +- ~**Bug 5 — Implicit merge checklist not enforced**~ — deferred to `docs/work/backlog/2026-03-27-merge-implicit-checklist-bugfix.md` (feature doesn't exist yet) ### Eval approach diff --git a/docs/work/backlog/2026-03-27-merge-implicit-checklist.md b/docs/work/backlog/2026-03-27-merge-implicit-checklist-bugfix.md similarity index 57% rename from docs/work/backlog/2026-03-27-merge-implicit-checklist.md rename to docs/work/backlog/2026-03-27-merge-implicit-checklist-bugfix.md index 28084ee..e3c10d4 100644 --- a/docs/work/backlog/2026-03-27-merge-implicit-checklist.md +++ b/docs/work/backlog/2026-03-27-merge-implicit-checklist-bugfix.md @@ -1,7 +1,7 @@ --- status: pending created: 2026-03-27 -updated: 2026-03-27 +updated: 2026-03-28 blocked-by: [] --- @@ -20,13 +20,23 @@ Bug 5 from live testing (`docs/ops/igit-merge-test-and-debug.md`): the /merge sk - [ ] Stage 1 runs an implicit checklist before proceeding to Stage 2 - [ ] Checklist candidates (verify which are feasible to auto-check): - Tests passing (run test command from CLAUDE.md if available) - - Version bumped if new skills/hooks/commands were added (compare plugin.json to base branch) - - Docs/CLAUDE.md updated if conventions changed + - Version bumped if project releases versioned artifacts (packages, containers, apps, infrastructure, configs — not just plugins) + - Project documentation updated (README.md, CLAUDE.md, docs/, etc.) if conventions changed - No untracked files that should have been committed +- [ ] Checklist configuration lives in a known, discoverable location + - Candidate locations: README.md, DEPLOY.md, RELEASE.md, CLAUDE.md, MERGE.md, or `docs/{i,}git/*.md` + - Security consideration: unprotected checklist files could be manipulated to skip checks - [ ] Checklist results surfaced to user before proceeding - [ ] Eval case added to `evals/skill_merge_eval.yaml` covering the implicit checklist behavior +## Design Questions + +- Where should the checklist configuration live? Needs to be discoverable but tamper-resistant. +- Should the checklist be project-configurable or a fixed set of checks? +- How does the checklist interact with the task tracking work items (complement vs overlap)? + ## References - `docs/ops/igit-merge-test-and-debug.md` — Bug 5 - `docs/work/active/2026-03-27-merge-skill-evals.md` — original Bug 5 scenario +- PR #15 review comment threads (lines 95, 96, 99) diff --git a/docs/work/backlog/2026-03-28-merge-context-bugs.md b/docs/work/backlog/2026-03-28-merge-context-bugs.md new file mode 100644 index 0000000..011cb5f --- /dev/null +++ b/docs/work/backlog/2026-03-28-merge-context-bugs.md @@ -0,0 +1,48 @@ +--- +status: pending +created: 2026-03-28 +updated: 2026-03-28 +blocked-by: [] +--- + +# Fix merge_context.sh bugs (1, 2, 4) + +## Goal + +Fix three bugs in `merge_context.sh` discovered during live testing of PR #14. + +## Bugs + +### Bug 1 — Structured error reporting for sandboxed gh failures + +`gh` silently fails in sandboxed `sh` subprocess. The `|| printf '{}'` fallback swallows errors, leaving all PR fields at defaults. The model's SKILL.md fallback path never triggers because the script appears to succeed. + +**Fix:** On `gh` failure, return structured error: `{ "cmd": "...", "error": "...", "hint": "Try running gh directly (unsandboxed)" }`. Script still exits 0 so the model receives partial output and can distinguish "no PR exists" from "gh failed." + +### Bug 2 — Parallel task tracking detection with priority aggregation + +The `if/elif` chain means `~/.claude/settings.json` existing (any Claude Code user) blocks filesystem detection. Redesign as parallel: run all methods (env, mcp, cli, fs, cfg) independently, aggregate by priority. + +**Fix:** Replace sequential `if/elif` with independent checks. Each method sets a result. Final aggregation picks highest-priority winner: env > mcp > cli > fs > cfg. + +### Bug 4 — Filter worktrees to PR-associated resources + +Main working tree appears in `worktrees[]`, which could trigger spurious `git worktree remove`. Only worktrees associated with the PR branch should be listed. + +**Fix:** Exclude main worktree by comparing each path against the first entry from `git worktree list --porcelain`. Only include worktrees on the current branch. + +## Acceptance Criteria + +- [ ] Bug 1: `gh` failures produce structured error objects, not empty `{}` +- [ ] Bug 1: Model can distinguish "no PR" from "gh failed" +- [ ] Bug 2: `docs/work/active/` detected even when `~/.claude/settings.json` exists +- [ ] Bug 2: All detection methods run independently with priority aggregation +- [ ] Bug 4: Main worktree excluded from `worktrees[]` +- [ ] Bug 4: Only PR-branch worktrees included +- [ ] All existing tests pass; new tests cover each fix +- [ ] SKILL.md fallback logic updated if needed for Bug 1 error format + +## References + +- `docs/ops/igit-merge-test-and-debug.md` — Bugs 1, 2, 4 +- PR #15 review comment threads (lines 13, 28, 47, 81) diff --git a/docs/work/backlog/2026-03-28-merge-skill-bugs.md b/docs/work/backlog/2026-03-28-merge-skill-bugs.md new file mode 100644 index 0000000..1f60a1c --- /dev/null +++ b/docs/work/backlog/2026-03-28-merge-skill-bugs.md @@ -0,0 +1,40 @@ +--- +status: pending +created: 2026-03-28 +updated: 2026-03-28 +blocked-by: [] +--- + +# Fix /merge SKILL.md bugs (3, 6) + +## Goal + +Fix two bugs in the /merge skill definition discovered during live testing. + +## Bugs + +### Bug 3 — PR URL in AskUserQuestion preview field + +The "I'll merge manually" option description doesn't include the PR URL. User has to look it up separately. + +**Fix:** Use AskUserQuestion's `preview` field for the PR URL. Needs empirical testing first — unknown whether `` HTML links render in preview. If not, use plain URL text. + +### Bug 6 — Stage 3 cleanup order (worktree before branch) + +SKILL.md has "Delete Local Branch" before "Remove Worktree." `git branch --delete` refuses if the branch is checked out in any worktree: `error: cannot delete branch used by worktree`. + +**Fix:** Swap the order in Stage 3: remove worktree first, then delete branch. Update the summary section to match. + +## Acceptance Criteria + +- [ ] Bug 3: Test what AskUserQuestion `preview` field renders (plain text, markdown, HTML) +- [ ] Bug 3: PR URL visible in the "I'll merge manually" option +- [ ] Bug 6: Stage 3 cleanup order is: switch+pull → remove worktree → delete branch +- [ ] Bug 6: Summary section reflects correct order +- [ ] Eval case added for cleanup order (worktree removal before branch deletion) + +## References + +- `docs/ops/igit-merge-test-and-debug.md` — Bug 3 +- PR #15 review comment thread (line 66) +- Live testing session 2026-03-28 — Bug 6 diff --git a/docs/work/backlog/2026-03-28-policy-hook-bare-push-tags.md b/docs/work/backlog/2026-03-28-policy-hook-bare-push-tags.md new file mode 100644 index 0000000..fd58c38 --- /dev/null +++ b/docs/work/backlog/2026-03-28-policy-hook-bare-push-tags.md @@ -0,0 +1,31 @@ +--- +status: pending +created: 2026-03-28 +updated: 2026-03-28 +blocked-by: [] +--- + +# Handle bare git push and review tag policy in git_policy_hook + +## Goal + +Prevent bare `git push` from bypassing push-to-main protection, and review whether tag pushes should require confirmation. + +## Context + +Observation from live testing: `git tag v0.11.0 5c291a4 && git push && git push origin v0.11.0` pushed directly to main unblocked because bare `git push` (no explicit ref) doesn't match the main/master pattern in the hook. + +## Acceptance Criteria + +- [ ] Bare `git push` on main/master is blocked or requires confirmation +- [ ] Two implementation options evaluated: + 1. Hook resolves bare `git push` to actual remote/ref before applying policies + 2. Hook rejects commands without explicit remote and ref +- [ ] Tag push policy reviewed — bump from `allow` to `ask` if appropriate +- [ ] Existing tests pass; new tests cover bare push and tag push scenarios +- [ ] Release process documented to work with the new policy + +## References + +- `docs/ops/igit-merge-test-and-debug.md` — Observation: bare git push +- PR #15 review comment thread (lines 103, 111) From b7d0755c587b92f59682564b71e0c253e1f2b59d Mon Sep 17 00:00:00 2001 From: tg-agent <268907238+tg-agent@users.noreply.github.com> Date: Sat, 28 Mar 2026 11:56:37 -0600 Subject: [PATCH 6/6] docs: revise impl plan per PR #15 review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Rename /run-evals → /evaluate with broader argument support - Rename capture_runner_transcript → capture_agent_transcript - Replace temp git repo with real GitHub test repo (setup/teardown) - Defer Bug 4 to non-happy-path setup phase - Dynamic PR number interpolation from setup script output - 11 eval cases (down from 12: Bug 4 deferred, Bug 5 separate item) Co-Authored-By: Trevor Gullstad Co-Authored-By: Claude --- .../2026-03-27-merge-skill-evals-impl.md | 873 +++++++++++------- .../active/2026-03-27-merge-skill-evals.md | 4 +- 2 files changed, 534 insertions(+), 343 deletions(-) diff --git a/docs/plans/2026-03-27-merge-skill-evals-impl.md b/docs/plans/2026-03-27-merge-skill-evals-impl.md index 40d6fb9..1299f3e 100644 --- a/docs/plans/2026-03-27-merge-skill-evals-impl.md +++ b/docs/plans/2026-03-27-merge-skill-evals-impl.md @@ -2,11 +2,11 @@ > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. -**Goal:** Build a skill-orchestrated eval harness that tests model behavior when running `/igit:merge`, covering 12 scenarios from the work item. +**Goal:** Build a skill-orchestrated eval harness that tests model behavior when running `/igit:merge`, covering 11 scenarios from the work item. -**Architecture:** YAML eval cases define scenarios with mock context, prompts, and assertions. An orchestrator skill (`/igit:run-evals`) parses cases, dispatches an eval-runner agent (Haiku) per case, captures transcripts via SubagentStop hook, runs deterministic assertions via scripts, and dispatches an eval-grader agent (Sonnet) for model judgment scoring. Results aggregate into a markdown report. +**Architecture:** YAML eval cases define scenarios with mock context, prompts, and assertions. An orchestrator skill (`/igit:evaluate`) parses cases, sets up real GitHub repo state, dispatches an eval-runner agent (Haiku) per case, captures transcripts via SubagentStop hook, runs deterministic assertions via scripts, and dispatches an eval-grader agent (Sonnet) for model judgment scoring. Results aggregate into a markdown report. A dedicated GitHub test repo ensures all `gh` and `git` commands hit real endpoints. -**Tech Stack:** Python 3.12+ (uv), pytest, YAML (PyYAML), POSIX shell for hook/setup scripts, Claude Code agents + skills +**Tech Stack:** Python 3.12+ (uv), pytest, YAML (PyYAML), POSIX shell for hook/setup scripts, Claude Code agents + skills, GitHub CLI --- @@ -15,23 +15,29 @@ ``` NEW FILES: evals/ - skill_merge_eval.yaml # 12 eval cases for /merge skill + skill_merge_eval.yaml # 11 eval cases for /merge skill .claude/ skills/ - run-evals/ - SKILL.md # Orchestrator skill + evaluate/ + SKILL.md # Orchestrator skill (/igit:evaluate) agents/ eval-runner.md # Generic skill executor agent eval-grader.md # Model judgment scorer agent scripts/ setup/ - init_git_repo.sh # Creates temp git repo with mock state + setup_test_repo.sh # Creates branches + PRs on GitHub test repo + teardown_test_repo.sh # Resets test repo to known starting state + setup_test_repo_test.py # Integration tests for setup + teardown_test_repo_test.py # Integration tests for teardown assert/ check_tool_calls.py # Deterministic tool call assertion script + check_tool_calls_test.py check_filesystem.py # Deterministic filesystem assertion script + check_filesystem_test.py hooks/ - capture_runner_transcript.sh # SubagentStop hook — extracts transcript + capture_agent_transcript.sh # SubagentStop hook — extracts transcript + capture_agent_transcript_test.py MODIFY FILES: .gitignore # Add .evals/ to gitignore @@ -70,8 +76,8 @@ git commit -m "chore: add .evals/ to gitignore for eval run output" The deterministic assertion engine for `assert.tool_calls`. Reads `transcript.json` and an assertion list, outputs `assert.json` with pass/fail per assertion. **Files:** -- Create: `.claude/skills/run-evals/scripts/assert/check_tool_calls.py` -- Create: `.claude/skills/run-evals/scripts/assert/check_tool_calls_test.py` +- Create: `.claude/skills/evaluate/scripts/assert/check_tool_calls.py` +- Create: `.claude/skills/evaluate/scripts/assert/check_tool_calls_test.py` - [ ] **Step 1: Write the failing test for `contains` assertions** @@ -126,7 +132,7 @@ class TestContains: - [ ] **Step 2: Run test to verify it fails** -Run: `cd .claude/skills/run-evals/scripts/assert && uv run --no-sync pytest check_tool_calls_test.py -v` +Run: `cd .claude/skills/evaluate/scripts/assert && uv run --no-sync pytest check_tool_calls_test.py -v` Expected: FAIL — `ModuleNotFoundError: No module named 'check_tool_calls'` - [ ] **Step 3: Write the `check_tool_calls` module** @@ -236,7 +242,7 @@ def _check_exact(tool_calls: list[str], assertion: dict) -> dict: - [ ] **Step 4: Run tests to verify they pass** -Run: `cd .claude/skills/run-evals/scripts/assert && uv run --no-sync pytest check_tool_calls_test.py -v` +Run: `cd .claude/skills/evaluate/scripts/assert && uv run --no-sync pytest check_tool_calls_test.py -v` Expected: 4 passed - [ ] **Step 5: Write tests for shorthand assertions (`once`, `never`, `many`)** @@ -298,7 +304,7 @@ class TestShorthands: - [ ] **Step 6: Run full test suite** -Run: `cd .claude/skills/run-evals/scripts/assert && uv run --no-sync pytest check_tool_calls_test.py -v` +Run: `cd .claude/skills/evaluate/scripts/assert && uv run --no-sync pytest check_tool_calls_test.py -v` Expected: 10 passed - [ ] **Step 7: Write tests for `exact` assertions** @@ -335,7 +341,7 @@ class TestExact: - [ ] **Step 8: Run full test suite** -Run: `cd .claude/skills/run-evals/scripts/assert && uv run --no-sync pytest check_tool_calls_test.py -v` +Run: `cd .claude/skills/evaluate/scripts/assert && uv run --no-sync pytest check_tool_calls_test.py -v` Expected: 13 passed - [ ] **Step 9: Write test for non-tool_call events are ignored** @@ -369,14 +375,14 @@ class TestEventFiltering: - [ ] **Step 10: Run full test suite** -Run: `cd .claude/skills/run-evals/scripts/assert && uv run --no-sync pytest check_tool_calls_test.py -v` +Run: `cd .claude/skills/evaluate/scripts/assert && uv run --no-sync pytest check_tool_calls_test.py -v` Expected: 15 passed - [ ] **Step 11: Commit** ```bash -git add .claude/skills/run-evals/scripts/assert/check_tool_calls.py \ - .claude/skills/run-evals/scripts/assert/check_tool_calls_test.py +git add .claude/skills/evaluate/scripts/assert/check_tool_calls.py \ + .claude/skills/evaluate/scripts/assert/check_tool_calls_test.py git commit -m "feat(evals): add deterministic tool call assertion engine TDD implementation covering contains, exact, once, never, many assertions @@ -390,8 +396,8 @@ with times qualifier. Filters non-tool_call events from transcript." Deterministic assertion engine for `assert.filesystem`. Checks post-execution filesystem state. **Files:** -- Create: `.claude/skills/run-evals/scripts/assert/check_filesystem.py` -- Create: `.claude/skills/run-evals/scripts/assert/check_filesystem_test.py` +- Create: `.claude/skills/evaluate/scripts/assert/check_filesystem.py` +- Create: `.claude/skills/evaluate/scripts/assert/check_filesystem_test.py` - [ ] **Step 1: Write the failing test** @@ -464,7 +470,7 @@ class TestMoveVerification: - [ ] **Step 2: Run test to verify it fails** -Run: `cd .claude/skills/run-evals/scripts/assert && uv run --no-sync pytest check_filesystem_test.py -v` +Run: `cd .claude/skills/evaluate/scripts/assert && uv run --no-sync pytest check_filesystem_test.py -v` Expected: FAIL — `ModuleNotFoundError: No module named 'check_filesystem'` - [ ] **Step 3: Write the `check_filesystem` module** @@ -538,14 +544,14 @@ def _evaluate_one(assertion: dict) -> dict: - [ ] **Step 4: Run tests to verify they pass** -Run: `cd .claude/skills/run-evals/scripts/assert && uv run --no-sync pytest check_filesystem_test.py -v` +Run: `cd .claude/skills/evaluate/scripts/assert && uv run --no-sync pytest check_filesystem_test.py -v` Expected: 8 passed - [ ] **Step 5: Commit** ```bash -git add .claude/skills/run-evals/scripts/assert/check_filesystem.py \ - .claude/skills/run-evals/scripts/assert/check_filesystem_test.py +git add .claude/skills/evaluate/scripts/assert/check_filesystem.py \ + .claude/skills/evaluate/scripts/assert/check_filesystem_test.py git commit -m "feat(evals): add deterministic filesystem assertion engine TDD implementation covering exists, not-exists, and file-contains assertions. @@ -559,15 +565,15 @@ Supports move verification via exists + not-exists combination." SubagentStop hook script that fires when the eval-runner agent completes, extracts key events from the agent transcript, and writes `transcript.json`. **Files:** -- Create: `.claude/skills/run-evals/scripts/hooks/capture_runner_transcript.sh` -- Create: `.claude/skills/run-evals/scripts/hooks/capture_runner_transcript_test.py` +- Create: `.claude/skills/evaluate/scripts/hooks/capture_agent_transcript.sh` +- Create: `.claude/skills/evaluate/scripts/hooks/capture_agent_transcript_test.py` - [ ] **Step 1: Write the test** The hook reads the `TOOL_INPUT` JSON from stdin (containing `agent_type` and `agent_transcript_path`) and the `EVAL_RUN_DIR` + `EVAL_CASE_NAME` env vars. It extracts events from the JSONL transcript and writes `transcript.json`. ```python -"""Integration tests for capture_runner_transcript.sh hook.""" +"""Integration tests for capture_agent_transcript.sh hook.""" import json import os @@ -576,7 +582,7 @@ from pathlib import Path import pytest -SCRIPT = Path(__file__).parent / "capture_runner_transcript.sh" +SCRIPT = Path(__file__).parent / "capture_agent_transcript.sh" def run_hook( @@ -703,14 +709,14 @@ class TestMissingEnvVars: - [ ] **Step 2: Run test to verify it fails** -Run: `cd .claude/skills/run-evals/scripts/hooks && uv run --no-sync pytest capture_runner_transcript_test.py -v` +Run: `cd .claude/skills/evaluate/scripts/hooks && uv run --no-sync pytest capture_agent_transcript_test.py -v` Expected: FAIL — script does not exist - [ ] **Step 3: Write the hook script** ```sh #!/bin/sh -# capture_runner_transcript.sh — SubagentStop hook for eval harness +# capture_agent_transcript.sh — SubagentStop hook for eval harness # Extracts key events from eval-runner agent transcript, writes transcript.json # Dependencies: jq # Platforms: macOS, Linux @@ -780,17 +786,17 @@ jq -n \ - [ ] **Step 4: Make the script executable and run tests** ```bash -chmod +x .claude/skills/run-evals/scripts/hooks/capture_runner_transcript.sh +chmod +x .claude/skills/evaluate/scripts/hooks/capture_agent_transcript.sh ``` -Run: `cd .claude/skills/run-evals/scripts/hooks && uv run --no-sync pytest capture_runner_transcript_test.py -v` +Run: `cd .claude/skills/evaluate/scripts/hooks && uv run --no-sync pytest capture_agent_transcript_test.py -v` Expected: All passed - [ ] **Step 5: Commit** ```bash -git add .claude/skills/run-evals/scripts/hooks/capture_runner_transcript.sh \ - .claude/skills/run-evals/scripts/hooks/capture_runner_transcript_test.py +git add .claude/skills/evaluate/scripts/hooks/capture_agent_transcript.sh \ + .claude/skills/evaluate/scripts/hooks/capture_agent_transcript_test.py git commit -m "feat(evals): add SubagentStop hook for transcript capture Extracts text, tool_call, and ask_user events from eval-runner agent @@ -800,18 +806,30 @@ Filters on agent_type=eval-runner, exits silently in non-eval context." --- -## Task 5: Git Repo Setup Script +## Task 5: GitHub Test Repo Setup and Teardown -Shell script that creates a temporary git repo with configurable mock state for eval cases. +Scripts that create real branches and PRs on a dedicated GitHub test repo, and reset it after eval runs. All `gh` and `git` commands hit real endpoints — no mocks. + +**Repo:** `trev-gulls/igit-eval-sandbox` (or configured via `EVAL_TEST_REPO` env var) + +**Known starting state:** main branch with a single initial commit. No other branches. No open PRs. **Files:** -- Create: `.claude/skills/run-evals/scripts/setup/init_git_repo.sh` -- Create: `.claude/skills/run-evals/scripts/setup/init_git_repo_test.py` +- Create: `.claude/skills/evaluate/scripts/setup/setup_test_repo.sh` +- Create: `.claude/skills/evaluate/scripts/setup/teardown_test_repo.sh` +- Create: `.claude/skills/evaluate/scripts/setup/setup_test_repo_test.py` +- Create: `.claude/skills/evaluate/scripts/setup/teardown_test_repo_test.py` -- [ ] **Step 1: Write the test** +### Setup Script + +- [ ] **Step 1: Write the setup test** ```python -"""Integration tests for init_git_repo.sh setup script.""" +"""Integration tests for setup_test_repo.sh. + +Requires EVAL_TEST_REPO env var pointing to a real GitHub test repo. +Skipped if not set — these are heavy tests, not for CI. +""" import json import os @@ -820,131 +838,324 @@ from pathlib import Path import pytest -SCRIPT = Path(__file__).parent / "init_git_repo.sh" +SCRIPT = Path(__file__).parent / "setup_test_repo.sh" +TEARDOWN = Path(__file__).parent / "teardown_test_repo.sh" +REPO = os.environ.get("EVAL_TEST_REPO", "") + +pytestmark = pytest.mark.skipif(not REPO, reason="EVAL_TEST_REPO not set") -def run_setup(state: dict, work_dir: Path) -> subprocess.CompletedProcess: - """Run the setup script with given state config.""" - env = {**os.environ, "EVAL_SETUP_STATE": json.dumps(state)} +def run_setup(case_state: dict) -> subprocess.CompletedProcess: + """Run setup script with given case state.""" + env = { + **os.environ, + "EVAL_TEST_REPO": REPO, + "EVAL_CASE_STATE": json.dumps(case_state), + } return subprocess.run( ["sh", str(SCRIPT)], capture_output=True, text=True, env=env, - cwd=str(work_dir), ) -class TestBasicInit: - def test_creates_git_repo(self, tmp_path): - result = run_setup( - {"git_repo": "clean", "branch": "feature/test"}, - tmp_path, - ) - assert result.returncode == 0 - assert (tmp_path / ".git").is_dir() +def run_teardown() -> subprocess.CompletedProcess: + """Run teardown to reset repo.""" + env = {**os.environ, "EVAL_TEST_REPO": REPO} + return subprocess.run( + ["sh", str(TEARDOWN)], + capture_output=True, + text=True, + env=env, + ) - def test_sets_branch(self, tmp_path): - run_setup( - {"git_repo": "clean", "branch": "feature/test"}, - tmp_path, - ) - branch = subprocess.run( - ["git", "branch", "--show-current"], - capture_output=True, - text=True, - cwd=str(tmp_path), - ) - assert branch.stdout.strip() == "feature/test" - def test_has_main_branch(self, tmp_path): - run_setup( - {"git_repo": "clean", "branch": "feature/test"}, - tmp_path, - ) - result = subprocess.run( - ["git", "branch", "--list", "main"], - capture_output=True, - text=True, - cwd=str(tmp_path), - ) - assert "main" in result.stdout - - -class TestDocsWorkActive: - def test_creates_docs_work_active(self, tmp_path): - run_setup( - { - "git_repo": "clean", - "branch": "feature/test", - "docs_work_active": True, - }, - tmp_path, +@pytest.fixture(autouse=True) +def cleanup(): + """Teardown after each test.""" + yield + run_teardown() + + +class TestSetupHappyPath: + def test_creates_branch_and_pr(self): + result = run_setup({ + "branch": "eval/test-case-1", + "pr": {"title": "Test PR", "state": "open"}, + }) + assert result.returncode == 0 + output = json.loads(result.stdout) + assert output["branch"] == "eval/test-case-1" + assert output["pr"]["number"] > 0 + assert output["pr"]["state"] == "open" + + def test_pr_is_mergeable(self): + result = run_setup({ + "branch": "eval/test-mergeable", + "pr": {"title": "Mergeable PR", "state": "open"}, + }) + output = json.loads(result.stdout) + # Verify via gh that the PR exists and is open + check = subprocess.run( + ["gh", "pr", "view", str(output["pr"]["number"]), + "--repo", REPO, "--json", "state,mergeable"], + capture_output=True, text=True, ) - assert (tmp_path / "docs" / "work" / "active").is_dir() + pr_data = json.loads(check.stdout) + assert pr_data["state"] == "OPEN" + + +class TestSetupAlreadyMerged: + def test_creates_and_merges_pr(self): + result = run_setup({ + "branch": "eval/test-merged", + "pr": {"title": "Already Merged", "state": "merged"}, + }) + assert result.returncode == 0 + output = json.loads(result.stdout) + assert output["pr"]["state"] == "merged" ``` - [ ] **Step 2: Run test to verify it fails** -Run: `cd .claude/skills/run-evals/scripts/setup && uv run --no-sync pytest init_git_repo_test.py -v` +Run: `cd .claude/skills/evaluate/scripts/setup && EVAL_TEST_REPO=trev-gulls/igit-eval-sandbox uv run --no-sync pytest setup_test_repo_test.py -v` Expected: FAIL — script does not exist - [ ] **Step 3: Write the setup script** ```sh #!/bin/sh -# init_git_repo.sh — create a temporary git repo with configurable state -# Dependencies: git, jq +# setup_test_repo.sh — create branches and PRs on GitHub test repo for eval cases +# Dependencies: gh, git, jq # Platforms: macOS, Linux # -# Reads EVAL_SETUP_STATE env var (JSON) for configuration: -# git_repo: "clean" — init a repo with an initial commit -# branch: "" — create and checkout this branch -# docs_work_active: true — create docs/work/active/ directory +# Env vars: +# EVAL_TEST_REPO — GitHub repo (owner/name) to use for evals +# EVAL_CASE_STATE — JSON describing the desired repo state for this case +# +# EVAL_CASE_STATE schema: +# branch: "eval/case-name" — branch to create +# pr.title: "PR Title" — PR title +# pr.state: "open" | "merged" — desired PR state +# +# Outputs JSON to stdout with the created state: +# { "branch": "...", "pr": { "number": N, "url": "...", "state": "..." } } set -e -state="${EVAL_SETUP_STATE:?EVAL_SETUP_STATE env var required}" +repo="${EVAL_TEST_REPO:?EVAL_TEST_REPO env var required}" +state="${EVAL_CASE_STATE:?EVAL_CASE_STATE env var required}" + +branch=$(printf '%s' "$state" | jq -r '.branch') +pr_title=$(printf '%s' "$state" | jq -r '.pr.title') +pr_state=$(printf '%s' "$state" | jq -r '.pr.state // "open"') + +# Get default branch +default_branch=$(gh repo view "$repo" --json defaultBranchRef --jq '.defaultBranchRef.name') + +# Clone to temp dir for git operations +work_dir=$(mktemp -d) +trap 'rm -rf "$work_dir"' EXIT +gh repo clone "$repo" "$work_dir" -- --depth=1 +cd "$work_dir" + +# Create branch with a commit (so the PR has a diff) +git checkout -b "$branch" +echo "eval case: $branch" > "eval-marker.txt" +git add eval-marker.txt +git commit -m "eval: setup for $branch" +git push origin "$branch" + +# Create PR +pr_url=$(gh pr create \ + --repo "$repo" \ + --head "$branch" \ + --base "$default_branch" \ + --title "$pr_title" \ + --body "Automated eval case setup" \ + --json url --jq '.url' 2>&1) || { + # If PR already exists, get its URL + pr_url=$(gh pr view "$branch" --repo "$repo" --json url --jq '.url') + } -git_repo=$(printf '%s' "$state" | jq -r '.git_repo // "clean"') -branch=$(printf '%s' "$state" | jq -r '.branch // "feature/test"') -docs_work=$(printf '%s' "$state" | jq -r '.docs_work_active // false') +pr_number=$(gh pr view "$branch" --repo "$repo" --json number --jq '.number') -# Init repo if not already a git repo -if [ ! -d ".git" ]; then - git init -b main - git commit --allow-empty -m "initial commit" +# If state should be merged, merge it +if [ "$pr_state" = "merged" ]; then + gh pr merge "$pr_number" --repo "$repo" --merge --admin + actual_state="merged" +else + actual_state="open" fi -# Create and checkout branch -if [ "$branch" != "main" ]; then - git checkout -b "$branch" 2>/dev/null || git checkout "$branch" -fi +# Output result +jq -n \ + --arg branch "$branch" \ + --argjson number "$pr_number" \ + --arg url "$pr_url" \ + --arg state "$actual_state" \ + '{ + branch: $branch, + pr: { number: $number, url: $url, state: $state } + }' +``` + +- [ ] **Step 4: Make executable and run tests** + +```bash +chmod +x .claude/skills/evaluate/scripts/setup/setup_test_repo.sh +``` + +Run: `cd .claude/skills/evaluate/scripts/setup && EVAL_TEST_REPO=trev-gulls/igit-eval-sandbox uv run --no-sync pytest setup_test_repo_test.py -v` +Expected: All passed (requires real GitHub repo) + +### Teardown Script -# Create docs/work/active if requested -if [ "$docs_work" = "true" ]; then - mkdir -p docs/work/active +- [ ] **Step 5: Write the teardown test** + +```python +"""Integration tests for teardown_test_repo.sh. + +Requires EVAL_TEST_REPO env var pointing to a real GitHub test repo. +""" + +import json +import os +import subprocess +from pathlib import Path + +import pytest + +SETUP = Path(__file__).parent / "setup_test_repo.sh" +SCRIPT = Path(__file__).parent / "teardown_test_repo.sh" +REPO = os.environ.get("EVAL_TEST_REPO", "") + +pytestmark = pytest.mark.skipif(not REPO, reason="EVAL_TEST_REPO not set") + + +def run_teardown() -> subprocess.CompletedProcess: + env = {**os.environ, "EVAL_TEST_REPO": REPO} + return subprocess.run( + ["sh", str(SCRIPT)], + capture_output=True, + text=True, + env=env, + ) + + +class TestTeardown: + def test_closes_open_prs(self): + # Setup: create a PR + env = { + **os.environ, + "EVAL_TEST_REPO": REPO, + "EVAL_CASE_STATE": json.dumps({ + "branch": "eval/teardown-test", + "pr": {"title": "Teardown Test", "state": "open"}, + }), + } + subprocess.run(["sh", str(SETUP)], env=env, capture_output=True) + + # Teardown + result = run_teardown() + assert result.returncode == 0 + + # Verify: no open PRs + check = subprocess.run( + ["gh", "pr", "list", "--repo", REPO, "--state", "open", "--json", "number"], + capture_output=True, text=True, + ) + prs = json.loads(check.stdout) + assert len(prs) == 0 + + def test_deletes_eval_branches(self): + # Setup: create a branch + env = { + **os.environ, + "EVAL_TEST_REPO": REPO, + "EVAL_CASE_STATE": json.dumps({ + "branch": "eval/branch-cleanup", + "pr": {"title": "Branch Cleanup", "state": "open"}, + }), + } + subprocess.run(["sh", str(SETUP)], env=env, capture_output=True) + + # Teardown + run_teardown() + + # Verify: no eval/ branches on remote + check = subprocess.run( + ["gh", "api", f"repos/{REPO}/branches", "--jq", '.[].name'], + capture_output=True, text=True, + ) + branches = check.stdout.strip().split("\n") if check.stdout.strip() else [] + eval_branches = [b for b in branches if b.startswith("eval/")] + assert len(eval_branches) == 0 +``` + +- [ ] **Step 6: Write the teardown script** + +```sh +#!/bin/sh +# teardown_test_repo.sh — reset GitHub test repo to known starting state +# Dependencies: gh, jq +# Platforms: macOS, Linux +# +# Env vars: +# EVAL_TEST_REPO — GitHub repo (owner/name) to reset +# +# Actions: +# 1. Close all open PRs +# 2. Delete all branches matching eval/* pattern +# 3. Verify repo is back to known state (main only, no open PRs) + +set -e + +repo="${EVAL_TEST_REPO:?EVAL_TEST_REPO env var required}" + +# 1. Close all open PRs +open_prs=$(gh pr list --repo "$repo" --state open --json number --jq '.[].number') +for pr in $open_prs; do + gh pr close "$pr" --repo "$repo" --delete-branch 2>/dev/null || true +done + +# 2. Delete any remaining eval/* branches on remote +eval_branches=$(gh api "repos/${repo}/branches" --jq '.[].name | select(startswith("eval/"))' 2>/dev/null || true) +for branch in $eval_branches; do + gh api -X DELETE "repos/${repo}/git/refs/heads/${branch}" 2>/dev/null || true +done + +# 3. Verify +remaining_prs=$(gh pr list --repo "$repo" --state open --json number --jq 'length') +remaining_branches=$(gh api "repos/${repo}/branches" --jq '[.[].name | select(startswith("eval/"))] | length' 2>/dev/null || printf '0') + +if [ "$remaining_prs" = "0" ] && [ "$remaining_branches" = "0" ]; then + printf '{"status": "clean", "open_prs": 0, "eval_branches": 0}\n' +else + printf '{"status": "dirty", "open_prs": %s, "eval_branches": %s}\n' "$remaining_prs" "$remaining_branches" >&2 + exit 2 fi ``` -- [ ] **Step 4: Make executable and run tests** +- [ ] **Step 7: Make executable and run tests** ```bash -chmod +x .claude/skills/run-evals/scripts/setup/init_git_repo.sh +chmod +x .claude/skills/evaluate/scripts/setup/teardown_test_repo.sh ``` -Run: `cd .claude/skills/run-evals/scripts/setup && uv run --no-sync pytest init_git_repo_test.py -v` +Run: `cd .claude/skills/evaluate/scripts/setup && EVAL_TEST_REPO=trev-gulls/igit-eval-sandbox uv run --no-sync pytest teardown_test_repo_test.py -v` Expected: All passed -- [ ] **Step 5: Commit** +- [ ] **Step 8: Commit** ```bash -git add .claude/skills/run-evals/scripts/setup/init_git_repo.sh \ - .claude/skills/run-evals/scripts/setup/init_git_repo_test.py -git commit -m "feat(evals): add git repo setup script for eval cases +git add .claude/skills/evaluate/scripts/setup/ +git commit -m "feat(evals): add GitHub test repo setup and teardown scripts -Creates temporary git repo with configurable branch and optional -docs/work/active/ directory. Reads state from EVAL_SETUP_STATE env var." +Setup creates real branches and PRs on a dedicated GitHub test repo. +Teardown closes all PRs, deletes eval/* branches, verifies clean state. +All gh/git commands hit real endpoints — no mocks." ``` --- @@ -954,7 +1165,7 @@ docs/work/active/ directory. Reads state from EVAL_SETUP_STATE env var." Generic skill executor agent — runs a skill with mock context, no eval framing visible to the model. **Files:** -- Create: `.claude/skills/run-evals/agents/eval-runner.md` +- Create: `.claude/skills/evaluate/agents/eval-runner.md` - [ ] **Step 1: Write the agent definition** @@ -964,7 +1175,7 @@ name: eval-runner description: > Generic skill executor for eval harness. Receives a skill prompt with pre-established context and executes the skill workflow to completion. - Do not use directly — dispatched by the run-evals orchestrator. + Do not use directly — dispatched by the evaluate orchestrator. tools: - Read - Bash @@ -1009,13 +1220,13 @@ STATUS: blocked at "" - [ ] **Step 2: Verify the file is valid frontmatter** -Run: `head -20 .claude/skills/run-evals/agents/eval-runner.md` +Run: `head -20 .claude/skills/evaluate/agents/eval-runner.md` Expected: Valid YAML frontmatter with `---` delimiters - [ ] **Step 3: Commit** ```bash -git add .claude/skills/run-evals/agents/eval-runner.md +git add .claude/skills/evaluate/agents/eval-runner.md git commit -m "feat(evals): add eval-runner agent definition Haiku-powered generic skill executor for eval harness. Receives skill @@ -1029,7 +1240,7 @@ content + mock context + prior decisions, executes workflow to completion." Scores model judgment against `expect` assertions using transcript evidence. **Files:** -- Create: `.claude/skills/run-evals/agents/eval-grader.md` +- Create: `.claude/skills/evaluate/agents/eval-grader.md` - [ ] **Step 1: Write the agent definition** @@ -1039,7 +1250,7 @@ name: eval-grader description: > Eval grader for eval harness. Scores model behavior against expected outcomes by analyzing the transcript. Do not use directly — dispatched - by the run-evals orchestrator. + by the evaluate orchestrator. tools: - Read allowed-tools: @@ -1084,13 +1295,13 @@ For `pass: false`, explain specifically what the model did wrong or failed to do - [ ] **Step 2: Verify the file is valid frontmatter** -Run: `head -20 .claude/skills/run-evals/agents/eval-grader.md` +Run: `head -20 .claude/skills/evaluate/agents/eval-grader.md` Expected: Valid YAML frontmatter with `---` delimiters - [ ] **Step 3: Commit** ```bash -git add .claude/skills/run-evals/agents/eval-grader.md +git add .claude/skills/evaluate/agents/eval-grader.md git commit -m "feat(evals): add eval-grader agent definition Sonnet-powered model judgment scorer. Analyzes transcript events against @@ -1101,60 +1312,80 @@ expected behaviors, returns structured pass/fail results with reasoning." ## Task 8: Orchestrator Skill -The `/igit:run-evals` skill that ties everything together — discovers eval files, dispatches agents, runs assertions, writes reports. +The `/igit:evaluate` skill that ties everything together — discovers eval files, manages test repo lifecycle, dispatches agents, runs assertions, writes reports. **Files:** -- Create: `.claude/skills/run-evals/SKILL.md` +- Create: `.claude/skills/evaluate/SKILL.md` - [ ] **Step 1: Write the orchestrator skill** ```markdown --- -name: run-evals +name: evaluate description: > - Use when user asks to "run evals", "test skill behavior", "run eval suite", - or "/igit:run-evals". Orchestrates the eval harness: discovers eval files, - dispatches runner and grader agents per case, runs assertions, writes report. + Use when user asks to "run evals", "evaluate skills", "test skill behavior", + or "/igit:evaluate". Orchestrates the eval harness: discovers eval files, + sets up real GitHub test repo state, dispatches runner and grader agents + per case, runs assertions, tears down, writes report. version: 0.1.0 -updated: "2026-03-27" +updated: "2026-03-28" user-invocable: true disable-model-invocation: false allowed-tools: - Read - Write - Bash(uv *) + - Bash(sh *) - Bash(mkdir *) - Bash(chmod *) - Bash(date *) - Bash(cat *) + - Bash(gh *) - Agent -argument-hint: "[tag-or-name-filter]" +argument-hint: "[description | tags | file | skill-name]" --- -# Run Evals +# Evaluate Orchestrate eval suite execution for skill behavior testing. ## Invocation ``` -/igit:run-evals # Run all eval cases -/igit:run-evals stage2 # Filter by tag -/igit:run-evals stage2_pr_blocked_conflicts # Filter by case name +/igit:evaluate # Run all eval cases +/igit:evaluate stage2 # Filter by tag +/igit:evaluate stage2_pr_blocked # Filter by case name substring +/igit:evaluate skill_merge_eval # Filter by file name ``` +## Prerequisites + +- `EVAL_TEST_REPO` env var must be set (e.g. `trev-gulls/igit-eval-sandbox`) +- The test repo must be in a known clean state (main only, no open PRs) +- `gh` must be authenticated with access to the test repo + ## Process -### Step 1: Discover and Parse +### Step 1: Verify Prerequisites + +1. Check `EVAL_TEST_REPO` is set +2. Run teardown script to ensure clean starting state: + ```bash + sh .claude/skills/evaluate/scripts/setup/teardown_test_repo.sh + ``` +3. If teardown reports dirty state, stop and report + +### Step 2: Discover and Parse 1. Find all `evals/*_eval.{yml,yaml}` files 2. Parse each file — extract `suite` description and `cases` list 3. If a filter argument was provided, keep only cases where: - Any tag matches the filter, OR - - The case name contains the filter + - The case name contains the filter, OR + - The file name contains the filter 4. Report: "Found N cases across M suites (filter: )" -### Step 2: Prepare Run Directory +### Step 3: Prepare Run Directory ```bash RUN_ID=$(date +%Y-%m-%dT%H%M%S)-$(head -c 3 /dev/urandom | xxd -p | head -c 6) @@ -1164,19 +1395,23 @@ mkdir -p "$EVAL_RUN_DIR" Export `EVAL_RUN_DIR` as an environment variable for hooks and scripts. -### Step 3: Execute Each Case +### Step 4: Execute Each Case For each case in order: -#### 3a. Setup +#### 4a. Setup Test Repo -If `setup.scripts` is defined, run each script in order: +Run the setup script to create real GitHub state for this case: ```bash -EVAL_SETUP_STATE='' sh +EVAL_TEST_REPO="$EVAL_TEST_REPO" \ +EVAL_CASE_STATE='{"branch": "eval/", "pr": {"title": "", "state": ""}}' \ +sh .claude/skills/evaluate/scripts/setup/setup_test_repo.sh ``` -#### 3b. Dispatch Eval Runner +Parse the JSON output to get the real PR number and URL. Update the case context with these real values before injecting into the runner prompt. + +#### 4b. Dispatch Eval Runner Set `EVAL_CASE_NAME=` env var (for the SubagentStop hook). @@ -1193,7 +1428,7 @@ The following data was returned by running the context script. Use this data directly — do not re-run the script. ### merge_context.sh output: - + ## Pre-established Decisions @@ -1214,14 +1449,14 @@ Agent: prompt: ``` -The SubagentStop hook (`capture_runner_transcript.sh`) will automatically write `transcript.json` to `$EVAL_RUN_DIR//`. +The SubagentStop hook (`capture_agent_transcript.sh`) will automatically write `transcript.json` to `$EVAL_RUN_DIR//`. -#### 3c. Run Deterministic Assertions +#### 4c. Run Deterministic Assertions If `assert.tool_calls` is defined: ```bash -cd .claude/skills/run-evals/scripts/assert +cd .claude/skills/evaluate/scripts/assert uv run --no-sync python -c " import json, sys sys.path.insert(0, '.') @@ -1237,7 +1472,7 @@ If `assert.filesystem` is defined, similarly run `check_filesystem`. Merge all assertion results into `$EVAL_RUN_DIR//assert.json`. -#### 3d. Dispatch Eval Grader +#### 4d. Dispatch Eval Grader If `expect` is defined: @@ -1258,7 +1493,15 @@ Agent: Write grader output to `$EVAL_RUN_DIR//grader.json`. -### Step 4: Generate Report +#### 4e. Teardown Case + +Run teardown to reset repo for the next case: + +```bash +sh .claude/skills/evaluate/scripts/setup/teardown_test_repo.sh +``` + +### Step 5: Generate Report Collect all results and write `$EVAL_RUN_DIR/report.md`: @@ -1280,25 +1523,25 @@ Collect all results and write `$EVAL_RUN_DIR/report.md`: FAIL: ``` -### Step 5: Report +### Step 6: Report Output the report contents to the user and note the run directory path. ``` - [ ] **Step 2: Verify the skill file** -Run: `head -20 .claude/skills/run-evals/SKILL.md` +Run: `head -20 .claude/skills/evaluate/SKILL.md` Expected: Valid YAML frontmatter - [ ] **Step 3: Commit** ```bash -git add .claude/skills/run-evals/SKILL.md -git commit -m "feat(evals): add run-evals orchestrator skill +git add .claude/skills/evaluate/SKILL.md +git commit -m "feat(evals): add evaluate orchestrator skill -Discovers eval YAML files, dispatches eval-runner per case, runs -deterministic assertions, dispatches eval-grader for model judgment, -aggregates results into markdown report." +Discovers eval YAML files, sets up real GitHub test repo state per case, +dispatches eval-runner, runs deterministic assertions, dispatches +eval-grader, tears down, aggregates into markdown report." ``` --- @@ -1324,9 +1567,9 @@ cases: tags: [stage1, task-tracking] setup: - state: - - git_repo: clean - - branch: "feature/add-search" + repo_state: + branch: "eval/stage1-tt-not-detected" + pr: { title: "Stage 1 TT not detected", state: "open" } prompt: > I'm ready to merge. Let's go. @@ -1335,15 +1578,15 @@ cases: - source: script_invocation script: "skills/merge/scripts/merge_context.sh" output: - branch: "feature/add-search" + branch: "eval/stage1-tt-not-detected" base_branch: "main" pr: - number: 10 - url: "https://github.com/org/repo/pull/10" + number: 0 + url: "" state: "open" mergeable: "MERGEABLE" merge_state_status: "CLEAN" - title: "Add search feature" + title: "Stage 1 TT not detected" merge_methods: ["merge", "squash", "rebase"] local_state: uncommitted_changes: false @@ -1373,10 +1616,9 @@ cases: tags: [stage1, task-tracking, happy-path] setup: - state: - - git_repo: clean - - branch: "feature/add-search" - - docs_work_active: true + repo_state: + branch: "eval/stage1-all-ac-met" + pr: { title: "Stage 1 all AC met", state: "open" } prompt: > Ship it. @@ -1385,15 +1627,15 @@ cases: - source: script_invocation script: "skills/merge/scripts/merge_context.sh" output: - branch: "feature/add-search" + branch: "eval/stage1-all-ac-met" base_branch: "main" pr: - number: 10 - url: "https://github.com/org/repo/pull/10" + number: 0 + url: "" state: "open" mergeable: "MERGEABLE" merge_state_status: "CLEAN" - title: "Add search feature" + title: "Stage 1 all AC met" merge_methods: ["merge", "squash", "rebase"] local_state: uncommitted_changes: false @@ -1414,7 +1656,7 @@ cases: prior_decisions: [] - exits_at: "Who merges PR #10?" + exits_at: "Who merges PR #?" expect: - "Narrates that all acceptance criteria are already met" @@ -1431,10 +1673,9 @@ cases: tags: [stage1, task-tracking] setup: - state: - - git_repo: clean - - branch: "feature/add-search" - - docs_work_active: true + repo_state: + branch: "eval/stage1-unmet-ac" + pr: { title: "Stage 1 unmet AC", state: "open" } prompt: > Let's merge this. @@ -1443,15 +1684,15 @@ cases: - source: script_invocation script: "skills/merge/scripts/merge_context.sh" output: - branch: "feature/add-search" + branch: "eval/stage1-unmet-ac" base_branch: "main" pr: - number: 10 - url: "https://github.com/org/repo/pull/10" + number: 0 + url: "" state: "open" mergeable: "MERGEABLE" merge_state_status: "CLEAN" - title: "Add search feature" + title: "Stage 1 unmet AC" merge_methods: ["merge", "squash", "rebase"] local_state: uncommitted_changes: false @@ -1492,9 +1733,9 @@ cases: tags: [stage1, task-tracking] setup: - state: - - git_repo: clean - - branch: "feature/add-search" + repo_state: + branch: "eval/stage1-skip-tt" + pr: { title: "Stage 1 skip TT", state: "open" } prompt: > Merge my branch please. @@ -1503,15 +1744,15 @@ cases: - source: script_invocation script: "skills/merge/scripts/merge_context.sh" output: - branch: "feature/add-search" + branch: "eval/stage1-skip-tt" base_branch: "main" pr: - number: 10 - url: "https://github.com/org/repo/pull/10" + number: 0 + url: "" state: "open" mergeable: "MERGEABLE" merge_state_status: "CLEAN" - title: "Add search feature" + title: "Stage 1 skip TT" merge_methods: ["merge", "squash", "rebase"] local_state: uncommitted_changes: false @@ -1527,7 +1768,7 @@ cases: - question: "Do you use a task tracking tool for this project?" answer: "No — skip task verification" - exits_at: "Who merges PR #10?" + exits_at: "Who merges PR #?" expect: - "Skips task verification after receiving the skip decision" @@ -1575,9 +1816,9 @@ Add the following cases after the Stage 1 cases in `evals/skill_merge_eval.yaml` tags: [stage2, already-merged] setup: - state: - - git_repo: clean - - branch: "feature/old-work" + repo_state: + branch: "eval/stage2-already-merged" + pr: { title: "Stage 2 already merged", state: "merged" } prompt: > Merge this branch. @@ -1586,15 +1827,15 @@ Add the following cases after the Stage 1 cases in `evals/skill_merge_eval.yaml` - source: script_invocation script: "skills/merge/scripts/merge_context.sh" output: - branch: "feature/old-work" + branch: "eval/stage2-already-merged" base_branch: "main" pr: - number: 8 - url: "https://github.com/org/repo/pull/8" + number: 0 + url: "" state: "merged" mergeable: "UNKNOWN" merge_state_status: "UNKNOWN" - title: "Old work" + title: "Stage 2 already merged" merge_methods: ["merge", "squash", "rebase"] local_state: uncommitted_changes: true @@ -1613,7 +1854,7 @@ Add the following cases after the Stage 1 cases in `evals/skill_merge_eval.yaml` exits_at: null expect: - - "Acknowledges that PR #8 was already merged" + - "Acknowledges that the PR was already merged" - "Surfaces uncommitted changes warning" - "Surfaces 2 unpushed commits warning" - "Surfaces base branch out of date warning" @@ -1629,9 +1870,9 @@ Add the following cases after the Stage 1 cases in `evals/skill_merge_eval.yaml` tags: [stage2, blocked] setup: - state: - - git_repo: clean - - branch: "feature/my-feature" + repo_state: + branch: "eval/stage2-blocked" + pr: { title: "Stage 2 blocked", state: "open" } prompt: > I'm ready to merge my feature branch. Please run the merge workflow. @@ -1640,15 +1881,15 @@ Add the following cases after the Stage 1 cases in `evals/skill_merge_eval.yaml` - source: script_invocation script: "skills/merge/scripts/merge_context.sh" output: - branch: "feature/my-feature" + branch: "eval/stage2-blocked" base_branch: "main" pr: - number: 42 - url: "https://github.com/org/repo/pull/42" + number: 0 + url: "" state: "open" mergeable: "CONFLICTING" merge_state_status: "DIRTY" - title: "My feature" + title: "Stage 2 blocked" merge_methods: ["merge", "squash", "rebase"] local_state: uncommitted_changes: false @@ -1682,9 +1923,9 @@ Add the following cases after the Stage 1 cases in `evals/skill_merge_eval.yaml` tags: [stage2, merge, squash] setup: - state: - - git_repo: clean - - branch: "feature/new-api" + repo_state: + branch: "eval/stage2-squash" + pr: { title: "Stage 2 squash merge", state: "open" } prompt: > Let's merge this PR. @@ -1693,15 +1934,15 @@ Add the following cases after the Stage 1 cases in `evals/skill_merge_eval.yaml` - source: script_invocation script: "skills/merge/scripts/merge_context.sh" output: - branch: "feature/new-api" + branch: "eval/stage2-squash" base_branch: "main" pr: - number: 15 - url: "https://github.com/org/repo/pull/15" + number: 0 + url: "" state: "open" mergeable: "MERGEABLE" merge_state_status: "CLEAN" - title: "New API endpoints" + title: "Stage 2 squash merge" merge_methods: ["merge", "squash", "rebase"] local_state: uncommitted_changes: false @@ -1716,7 +1957,7 @@ Add the following cases after the Stage 1 cases in `evals/skill_merge_eval.yaml` prior_decisions: - question: "Do you use a task tracking tool for this project?" answer: "No — skip task verification" - - question: "Who merges PR #15?" + - question: "Who merges PR #?" answer: "Claude merges" - question: "Merge method" answer: "Squash" @@ -1729,11 +1970,13 @@ Add the following cases after the Stage 1 cases in `evals/skill_merge_eval.yaml` assert: tool_calls: - - once: "gh pr merge 15 --squash" - - never: "gh pr merge 15 --merge" - - never: "gh pr merge 15 --rebase" + - once: "gh pr merge" + - contains: "--squash" + - never: "--rebase" ``` +Note: `assert.tool_calls` uses `contains` instead of `once: "gh pr merge 15 --squash"` because the PR number is assigned dynamically by the setup script. + - [ ] **Step 2: Validate YAML syntax** Run: `uv run --no-sync python -c "import yaml; d=yaml.safe_load(open('evals/skill_merge_eval.yaml')); print(f'{len(d[\"cases\"])} cases')"` @@ -1769,9 +2012,9 @@ Add Stage 3 (cleanup) eval cases. tags: [stage3, cleanup] setup: - state: - - git_repo: clean - - branch: "feature/done-work" + repo_state: + branch: "eval/stage3-safe-delete" + pr: { title: "Stage 3 safe delete", state: "open" } prompt: > Merge and clean up. @@ -1780,15 +2023,15 @@ Add Stage 3 (cleanup) eval cases. - source: script_invocation script: "skills/merge/scripts/merge_context.sh" output: - branch: "feature/done-work" + branch: "eval/stage3-safe-delete" base_branch: "main" pr: - number: 20 - url: "https://github.com/org/repo/pull/20" + number: 0 + url: "" state: "open" mergeable: "MERGEABLE" merge_state_status: "CLEAN" - title: "Done work" + title: "Stage 3 safe delete" merge_methods: ["merge", "squash"] local_state: uncommitted_changes: false @@ -1803,7 +2046,7 @@ Add Stage 3 (cleanup) eval cases. prior_decisions: - question: "Do you use a task tracking tool for this project?" answer: "No — skip task verification" - - question: "Who merges PR #20?" + - question: "Who merges PR #?" answer: "Claude merges" - question: "Merge method" answer: "Merge Commit" @@ -1817,7 +2060,7 @@ Add Stage 3 (cleanup) eval cases. assert: tool_calls: - - once: "git branch --delete feature/done-work" + - contains: "git branch --delete" - never: "git branch --delete --force" - never: "git branch -d" - never: "git branch -D" @@ -1828,9 +2071,9 @@ Add Stage 3 (cleanup) eval cases. tags: [stage3, cleanup, force-delete] setup: - state: - - git_repo: clean - - branch: "feature/squashed" + repo_state: + branch: "eval/stage3-decline-force" + pr: { title: "Stage 3 decline force", state: "open" } prompt: > Merge and clean up. @@ -1839,15 +2082,15 @@ Add Stage 3 (cleanup) eval cases. - source: script_invocation script: "skills/merge/scripts/merge_context.sh" output: - branch: "feature/squashed" + branch: "eval/stage3-decline-force" base_branch: "main" pr: - number: 25 - url: "https://github.com/org/repo/pull/25" + number: 0 + url: "" state: "open" mergeable: "MERGEABLE" merge_state_status: "CLEAN" - title: "Squashed feature" + title: "Stage 3 decline force" merge_methods: ["squash"] local_state: uncommitted_changes: false @@ -1862,11 +2105,11 @@ Add Stage 3 (cleanup) eval cases. prior_decisions: - question: "Do you use a task tracking tool for this project?" answer: "No — skip task verification" - - question: "Who merges PR #25?" + - question: "Who merges PR #?" answer: "Claude merges" - question: "Merge method" answer: "Squash" - - question: "Force-delete branch 'feature/squashed'?" + - question: "Force-delete branch 'eval/stage3-decline-force'?" answer: "No — leave the branch" exits_at: null @@ -1902,7 +2145,7 @@ force-delete after safe delete fails on squash merge." ## Task 12: Eval Cases — Bug Regression Scenarios -Add eval cases covering bugs 2–5 from the live test log. +Add eval cases covering bugs 2–4 from the live test log. Bug 5 deferred to separate work item. **Files:** - Modify: `evals/skill_merge_eval.yaml` @@ -1918,10 +2161,9 @@ Add eval cases covering bugs 2–5 from the live test log. tags: [regression, bug2, task-tracking] setup: - state: - - git_repo: clean - - branch: "feature/fix-bug" - - docs_work_active: true + repo_state: + branch: "eval/bug2-mcp-elif" + pr: { title: "Bug 2 MCP elif", state: "open" } prompt: > Ready to merge. @@ -1930,15 +2172,15 @@ Add eval cases covering bugs 2–5 from the live test log. - source: script_invocation script: "skills/merge/scripts/merge_context.sh" output: - branch: "feature/fix-bug" + branch: "eval/bug2-mcp-elif" base_branch: "main" pr: - number: 30 - url: "https://github.com/org/repo/pull/30" + number: 0 + url: "" state: "open" mergeable: "MERGEABLE" merge_state_status: "CLEAN" - title: "Fix bug" + title: "Bug 2 MCP elif" merge_methods: ["merge", "squash"] local_state: uncommitted_changes: false @@ -1968,9 +2210,9 @@ Add eval cases covering bugs 2–5 from the live test log. tags: [regression, bug3, merge] setup: - state: - - git_repo: clean - - branch: "feature/manual-merge" + repo_state: + branch: "eval/bug3-pr-url" + pr: { title: "Bug 3 PR URL", state: "open" } prompt: > Merge this please. @@ -1979,15 +2221,15 @@ Add eval cases covering bugs 2–5 from the live test log. - source: script_invocation script: "skills/merge/scripts/merge_context.sh" output: - branch: "feature/manual-merge" + branch: "eval/bug3-pr-url" base_branch: "main" pr: - number: 35 - url: "https://github.com/org/repo/pull/35" + number: 0 + url: "" state: "open" mergeable: "MERGEABLE" merge_state_status: "CLEAN" - title: "Manual merge test" + title: "Bug 3 PR URL" merge_methods: ["merge", "squash"] local_state: uncommitted_changes: false @@ -2003,81 +2245,23 @@ Add eval cases covering bugs 2–5 from the live test log. - question: "Do you use a task tracking tool for this project?" answer: "No — skip task verification" - exits_at: "Who merges PR #35?" + exits_at: "Who merges PR #?" expect: - "Presents 'I'll merge manually' option" - - "The manual merge option description includes the actual PR URL (https://github.com/org/repo/pull/35)" + - "The manual merge option description includes the actual PR URL" assert: tool_calls: - never: "gh pr merge" - - - name: "bug4_main_worktree_spurious_remove" - description: "Model does not attempt git worktree remove on the main working tree" - skill: igit:merge - tags: [regression, bug4, cleanup] - - setup: - state: - - git_repo: clean - - branch: "feature/worktree-test" - - prompt: > - Ship it. - - context: - - source: script_invocation - script: "skills/merge/scripts/merge_context.sh" - output: - branch: "feature/worktree-test" - base_branch: "main" - pr: - number: 40 - url: "https://github.com/org/repo/pull/40" - state: "open" - mergeable: "MERGEABLE" - merge_state_status: "CLEAN" - title: "Worktree test" - merge_methods: ["merge"] - local_state: - uncommitted_changes: false - unpushed_commits: 0 - base_out_of_date: false - worktrees: - - branch: "feature/worktree-test" - path: "/Users/dev/repo" - - branch: "feature/other" - path: "/Users/dev/repo/.worktrees/feature-other" - task_tracking: - tool: "none" - source: "none" - work_items: [] - - prior_decisions: - - question: "Do you use a task tracking tool for this project?" - answer: "No — skip task verification" - - question: "Who merges PR #40?" - answer: "Claude merges" - - question: "Merge method" - answer: "Merge Commit" - - exits_at: null - - expect: - - "Does NOT attempt git worktree remove on /Users/dev/repo (the main working tree)" - - "Only attempts worktree remove on actual additional worktrees, if any match the branch" - - "Recognizes that the first worktree entry is the main working tree, not an additional worktree" - - assert: - tool_calls: - - never: "git worktree remove /Users/dev/repo" ``` +Note: Bug 4 (main worktree spurious remove) is excluded from this initial batch — it requires worktree setup that goes beyond happy-path repo state. Will be added when non-happy-path setup is implemented. + - [ ] **Step 2: Validate YAML and count cases** Run: `uv run --no-sync python -c "import yaml; d=yaml.safe_load(open('evals/skill_merge_eval.yaml')); print(f'{len(d[\"cases\"])} cases'); [print(f' - {c[\"name\"]}') for c in d['cases']]"` -Expected: `12 cases` with all names listed +Expected: `11 cases` with all names listed - [ ] **Step 3: Commit** @@ -2085,10 +2269,10 @@ Expected: `12 cases` with all names listed git add evals/skill_merge_eval.yaml git commit -m "feat(evals): add bug regression eval cases for /merge skill -Three cases from live test bugs: Bug 2 (MCP elif eats fs detection), -Bug 3 (PR URL missing from manual merge option), Bug 4 (main worktree -triggers spurious remove). Bug 5 deferred — requires implicit checklist -feature not yet implemented." +Two cases from live test bugs: Bug 2 (MCP elif eats fs detection), +Bug 3 (PR URL missing from manual merge option). Bug 4 deferred — +requires worktree setup beyond happy path. Bug 5 deferred to separate +work item (implicit checklist feature doesn't exist yet)." ``` --- @@ -2117,7 +2301,7 @@ If file doesn't exist, create `.claude/settings.json`: "hooks": [ { "type": "command", - "command": "sh .claude/skills/run-evals/scripts/hooks/capture_runner_transcript.sh" + "command": "sh .claude/skills/evaluate/scripts/hooks/capture_agent_transcript.sh" } ] } @@ -2139,18 +2323,15 @@ Expected: `valid` git add .claude/settings.json git commit -m "feat(evals): register SubagentStop hook for transcript capture -Fires capture_runner_transcript.sh when eval-runner agent completes, +Fires capture_agent_transcript.sh when eval-runner agent completes, writing extracted events to the eval run output directory." ``` --- -## Task 14: Add `.evals/` Gitignore and Final Validation +## Task 14: Final Validation -Verify the complete system end-to-end. - -**Files:** -- Modify: `.gitignore` (if not done in Task 1) +Verify the complete system structure. - [ ] **Step 1: Verify `.evals/` is in `.gitignore`** @@ -2159,30 +2340,32 @@ Expected: `OK` - [ ] **Step 2: Run all assertion script tests** -Run: `uv run --no-sync pytest .claude/skills/run-evals/scripts/ -v` +Run: `uv run --no-sync pytest .claude/skills/evaluate/scripts/assert/ -v` Expected: All passed - [ ] **Step 3: Validate all YAML eval files** Run: `uv run --no-sync python -c "import yaml; d=yaml.safe_load(open('evals/skill_merge_eval.yaml')); print(f'Suite: {d[\"suite\"]}'); print(f'Cases: {len(d[\"cases\"])}')"` -Expected: `Suite: Evals for the /merge skill...` and `Cases: 12` +Expected: `Suite: Evals for the /merge skill...` and `Cases: 11` - [ ] **Step 4: Verify file structure matches design** -Run: `find .claude/skills/run-evals evals -type f | sort` +Run: `find .claude/skills/evaluate evals -type f | sort` Expected: ``` -.claude/skills/run-evals/SKILL.md -.claude/skills/run-evals/agents/eval-grader.md -.claude/skills/run-evals/agents/eval-runner.md -.claude/skills/run-evals/scripts/assert/check_filesystem.py -.claude/skills/run-evals/scripts/assert/check_filesystem_test.py -.claude/skills/run-evals/scripts/assert/check_tool_calls.py -.claude/skills/run-evals/scripts/assert/check_tool_calls_test.py -.claude/skills/run-evals/scripts/hooks/capture_runner_transcript.sh -.claude/skills/run-evals/scripts/hooks/capture_runner_transcript_test.py -.claude/skills/run-evals/scripts/setup/init_git_repo.sh -.claude/skills/run-evals/scripts/setup/init_git_repo_test.py +.claude/skills/evaluate/SKILL.md +.claude/skills/evaluate/agents/eval-grader.md +.claude/skills/evaluate/agents/eval-runner.md +.claude/skills/evaluate/scripts/assert/check_filesystem.py +.claude/skills/evaluate/scripts/assert/check_filesystem_test.py +.claude/skills/evaluate/scripts/assert/check_tool_calls.py +.claude/skills/evaluate/scripts/assert/check_tool_calls_test.py +.claude/skills/evaluate/scripts/hooks/capture_agent_transcript.sh +.claude/skills/evaluate/scripts/hooks/capture_agent_transcript_test.py +.claude/skills/evaluate/scripts/setup/setup_test_repo.sh +.claude/skills/evaluate/scripts/setup/setup_test_repo_test.py +.claude/skills/evaluate/scripts/setup/teardown_test_repo.sh +.claude/skills/evaluate/scripts/setup/teardown_test_repo_test.py evals/skill_merge_eval.yaml ``` @@ -2197,8 +2380,16 @@ git commit -m "chore(evals): final validation — all tests pass, structure matc ## Notes -**Bug 5 (implicit merge checklist)** is deferred to its own work item: `docs/work/backlog/2026-03-27-merge-implicit-checklist-bugfix.md`. The feature doesn't exist yet — the eval case will be added when the implicit checklist is implemented. +**Bug 4 (main worktree spurious remove)** is excluded from the initial 11 cases. It requires worktree setup beyond happy-path repo state. Will be added when non-happy-path setup scripts are implemented. + +**Bug 5 (implicit merge checklist)** is deferred to its own work item: `docs/work/backlog/2026-03-27-merge-implicit-checklist.md`. The feature doesn't exist yet — the eval case will be added when the implicit checklist is implemented. + +**Dynamic PR numbers:** The eval case YAML uses `pr.number: 0` and `pr.url: ""` as placeholders. The orchestrator replaces these with real values from `setup_test_repo.sh` output before injecting context into the runner prompt. Similarly, `exits_at` and `prior_decisions` using `#` are interpolated at runtime. **AskUserQuestion behavior in agents** is an open question per the design spec. The eval-runner uses `exits_at` to define expected stopping points, but actual behavior may differ. Task 6's agent definition instructs the model to use AskUserQuestion when it reaches unresolved decision points — empirical testing will determine if this works or needs revision. **PyYAML dependency**: The orchestrator skill parses YAML. Ensure `pyyaml` is in the project's dev dependencies (`uv add --dev pyyaml` if not present). + +**Test repo creation:** Before first run, create `trev-gulls/igit-eval-sandbox` with a single initial commit on main. The teardown script assumes this starting state. + +**Setup/teardown tests are heavy:** The `setup_test_repo_test.py` and `teardown_test_repo_test.py` tests hit real GitHub API. They are skipped unless `EVAL_TEST_REPO` is set. Not for CI. diff --git a/docs/work/active/2026-03-27-merge-skill-evals.md b/docs/work/active/2026-03-27-merge-skill-evals.md index e505b1d..a68c618 100644 --- a/docs/work/active/2026-03-27-merge-skill-evals.md +++ b/docs/work/active/2026-03-27-merge-skill-evals.md @@ -33,8 +33,8 @@ From live test (see `docs/ops/igit-merge-test-and-debug.md`): - [ ] **Bug 2 — MCP elif eats fs detection**: Task tracking returns `none` when `~/.claude/settings.json` exists but has no matching MCP servers, even though `docs/work/active/` is present - [ ] **Bug 3 — PR URL missing from manual merge option**: Model presents manual merge option without interpolating `pr.url` into the description -- [ ] **Bug 4 — Main worktree triggers spurious worktree remove**: Model attempts `git worktree remove` on the main working tree when branch matches -- ~**Bug 5 — Implicit merge checklist not enforced**~ — deferred to `docs/work/backlog/2026-03-27-merge-implicit-checklist-bugfix.md` (feature doesn't exist yet) +- [ ] **Bug 4 — Main worktree triggers spurious worktree remove** — deferred to non-happy-path setup phase (requires worktree state beyond basic repo setup) +- ~**Bug 5 — Implicit merge checklist not enforced**~ — deferred to `docs/work/backlog/2026-03-27-merge-implicit-checklist.md` (feature doesn't exist yet) ### Eval approach