diff --git a/agent/pr15-debug-comments-handoff.md b/agent/pr15-debug-comments-handoff.md new file mode 100644 index 0000000..1bf76df --- /dev/null +++ b/agent/pr15-debug-comments-handoff.md @@ -0,0 +1,95 @@ +# PR #15 Debug Doc Comments — Handoff + +**Date:** 2026-03-28 +**Source:** PR #15 review comments on `docs/ops/igit-merge-test-and-debug.md` +**For:** Parent session implementing eval harness plan + +--- + +## Summary + +11 review comments on the debug log. All concern `merge_context.sh`, `git_policy_hook`, or SKILL.md — none require changes to the eval harness plan itself. Several expand the design direction for existing bugs and should be captured in work items. + +## Decisions and Actions by Bug + +### Bug 1 — Structured errors instead of silent fallback + +**Comments:** Lines 13, 28 +**Decision:** Two-part fix: +1. `merge_context.sh` should return structured error objects on `gh` failure instead of silent `{}` fallback. Format: `{ "cmd": "...", "error": "...", "hint": "Try unsandboxed" }` +2. Model handles `gh` calls directly as fallback — script should NOT be run unsandboxed entirely (security hole). The fallback path already exists in SKILL.md. + +**Action:** New work item — fix `merge_context.sh` error handling for sandboxed `gh` failures. + +### Bug 2 — Parallel detection with priority aggregation + +**Comment:** Line 47 +**Decision:** Restructure task tracking detection to run all methods (env, mcp, cli, fs, cfg) in parallel and aggregate by priority, instead of sequential `if/elif` chain. This is a bigger redesign than just fixing the elif fall-through bug. + +**Action:** Update existing Bug 2 fix direction in the debug doc. The work item scope is larger than originally described — not just "fix the elif" but "redesign detection as parallel with priority aggregation." + +### Bug 3 — PR URL in AskUserQuestion preview field + +**Comment:** Line 66 +**Decision:** Use AskUserQuestion's `preview` field for the PR URL instead of interpolating into the `description` field. Whether `` HTML links render in preview is unknown — needs empirical testing. + +**Action:** New work item — test AskUserQuestion preview field rendering, then update SKILL.md. Two-part: (1) test what renders, (2) apply fix. + +### Bug 4 — Identify PR-associated resources + +**Comment:** Line 81 +**Decision:** Reframe from "filter out main worktree" to "identify all resources associated with the PR being merged." Filter worktrees to `branch == current_branch && path != main_worktree_path`. The broader "all resources" framing (stashes, related branches, etc.) is future scope. + +**Action:** Update Bug 4 fix direction in debug doc. Immediate fix: filter worktrees in `merge_context.sh` by comparing against the primary worktree path. + +### Bug 5 — Implicit checklist scope and configuration + +**Comments:** Lines 95, 96, 99 +**Decisions:** +- Version bumps apply to ANY project releasing versioned artifacts (packages, containers, apps, infrastructure, configs), not just plugins +- "Project documentation" = README.md, CLAUDE.md, docs/, etc. +- Users need to document their merge checklist in a known, discoverable location +- Candidate locations: README.md, DEPLOY.md, RELEASE.md, CLAUDE.md, MERGE.md, or `docs/{i,}git/*.md` +- Concern: `docs/igit/*.md` is "dangerous if not protected" — files could be manipulated + +**Action:** Update existing backlog item `docs/work/backlog/2026-03-27-merge-implicit-checklist.md` with these expanded requirements. The checklist location decision is a design question for that work item. + +### Observation — bare `git push` bypasses hook + +**Comments:** Lines 103, 111 +**Decisions:** +- The bare push was in a chained command: `git tag ... && git push` +- Two proposed fix directions for `git_policy_hook`: + 1. Fill in missing refspec before applying policies (resolve bare `git push` to the actual remote/branch) + 2. Reject commands without specified remote and ref +- Tagging should be bumped from `allow` to `ask` if it was intentionally allowed + +**Action:** New work item for `git_policy_hook` — handle bare `git push` and review tag push policy level. + +### Observation — `git pull` fails in sandbox + +**Comment:** Line 121 +**Decision:** Root cause is likely GitHub API failure or sandboxing, not just DNS. The existing "narrate the failure" fix direction still applies. + +**Action:** Minor — update debug doc root cause. No separate work item needed (covered by existing sandbox documentation). + +## Impact on Eval Harness Plan + +**None.** All items are `merge_context.sh`, `git_policy_hook`, or SKILL.md changes. The eval harness plan is not affected. + +However, once Bug 1 (structured errors) and Bug 2 (parallel detection) are fixed, the eval cases for those bugs may need updated `context.output` payloads to reflect the new error/detection format. + +## Work Items + +| File | Scope | +|------|-------| +| `docs/work/backlog/2026-03-28-merge-context-bugs.md` | Bugs 1+2+4 — all `merge_context.sh` fixes (structured errors, parallel detection, worktree filtering) | +| `docs/work/backlog/2026-03-28-merge-skill-bugs.md` | Bugs 3+6 — SKILL.md fixes (PR URL preview field, cleanup order) | +| `docs/work/backlog/2026-03-27-merge-implicit-checklist-bugfix.md` | Bug 5 — expanded scope per review (versioned artifacts, doc scope, checklist location) | +| `docs/work/backlog/2026-03-28-policy-hook-bare-push-tags.md` | Observation — bare push refspec resolution + tag policy level | + +Bug 6 (cleanup order) added from live testing 2026-03-28: Stage 3 tries to delete branch before removing worktree, but git refuses. + +## PR Comment Replies + +Each review comment needs a reply acknowledging the decision. Use `gh api` to reply in-thread. diff --git a/docs/ops/igit-merge-test-and-debug.md b/docs/ops/igit-merge-test-and-debug.md new file mode 100644 index 0000000..08d7736 --- /dev/null +++ b/docs/ops/igit-merge-test-and-debug.md @@ -0,0 +1,133 @@ +# /merge skill — live test & debug log + +Running log of bugs and observations from manual testing of the `/merge` skill on real PRs. + +--- + +## Session: 2026-03-27 — PR #14 (feature/merged-skill) + +### Bug 1 — `gh` silently fails in `sh` subprocess (sandbox) + +**Symptom:** `merge_context.sh` returns `pr.state: "none"` even when an open PR exists. + +**Root cause:** Same Mach IPC sandbox restriction that affects `uv`. When `sh script.sh` spawns a child process, `gh` cannot reach the macOS keychain and exits non-zero. The `|| printf '{}'` fallback silently swallows the failure, leaving all PR fields at defaults. + +**Repro:** +```sh +sh -c 'gh pr view --json number,url,state,title,baseRefName 2>/dev/null || printf "{}"' +# → {} +``` +Direct call works fine: +```sh +gh pr view --json number,url,state,title,baseRefName +# → {"number":14,...} +``` + +**Impact:** PR state, mergeable, merge_state_status, title, URL all lost. + +**Fix direction:** Same workaround pattern as `uv --no-sync` — document that `merge_context.sh` must be run unsandboxed, or restructure so `gh` calls are made directly by the model (not via the shell script). + +--- + +### Bug 2 — MCP elif branch eats filesystem task tracking detection + +**Symptom:** `merge_context.sh` returns `task_tracking.tool: "none"` even though `docs/work/active/` exists. + +**Root cause:** The detection chain is a flat `if/elif/elif/...`. The MCP branch fires when `~/.claude/settings.json` exists (which it does for any Claude Code user). If no matching MCP server is found (linear/jira/asana), `_tt_tool` stays `"none"` — but the `elif` for `docs/work/active` is never reached. + +**Relevant code in `merge_context.sh`:** +```sh +elif [ -f ".mcp.json" ] || [ -f "${HOME}/.claude/settings.json" ]; then + for _cfg in ...; do + # checks linear, jira, asana — if none match, loop exits without setting _tt_tool + done +# elif [ -d "docs/work/active" ] ← never reached +``` + +**Fix direction:** After the MCP loop, fall through to the remaining checks (cli, fs, cfg) if `_tt_tool` is still `"none"`. Either nest the lower-priority checks inside the MCP branch or restructure as a sequential check without elif. + +--- + +### Bug 3 — PR URL missing from "I'll merge manually" option description + +**Symptom:** The `AskUserQuestion` "I'll merge manually" option says `"Open the PR URL — /merge will wait and detect when it's merged"` but doesn't actually include the URL. User has to look it up separately. + +**Fix direction:** Interpolate the PR URL into the description: `"Open — /merge will wait and detect when it's merged"`. + +**Relevant section in `SKILL.md`:** +``` +- label: "I'll merge manually" + description: "Open the PR URL — /merge will wait and detect when it's merged" +``` + +Should be: +``` +- label: "I'll merge manually" + description: "Open — /merge will wait and detect when it's merged" +``` + +--- + +### Bug 4 — Main worktree appears in `worktrees[]`, could trigger spurious `git worktree remove` + +**Symptom:** `merge_context.sh` includes the main worktree in `worktrees[]`. If the skill iterates `worktrees` looking for a match on the current branch, it would find the main worktree and attempt `git worktree remove` on it — which would fail or cause confusion. + +**Observed:** During PR #14 test, `worktrees` contained: +```json +[{"branch": "feature/merged-skill", "path": "/Users/tgulls/Code/llm/claude/igit"}] +``` +That path is the main working tree, not an additional worktree. + +**Fix direction:** Filter the main worktree out in `merge_context.sh` (compare each path against the primary worktree path from `git worktree list --porcelain | head -2`), or have the skill skip `git worktree remove` when the matched path is the repo root. + +--- + +### Bug 5 — Stage 1 doesn't check implicit merge checklist (docs, tests, version bump) + +**Symptom:** Skill proceeded through Stage 1 without flagging that the PR lacked a version bump. The version bump had to be applied after merge as a separate direct-to-main commit. + +**Root cause:** Stage 1 only iterates `task_tracking.work_items` and checks their explicit AC. There is no implicit checklist covering common pre-merge hygiene — docs updated, tests passing, version bumped — that applies regardless of task tracking state. + +**Impact:** New skill shipped in PR #14 with no version bump; bump was applied post-merge directly to main (bypassing branch protection — see bare `git push` observation below). + +**Fix direction:** Add an implicit pre-merge checklist to Stage 1 that the model runs before proceeding, independent of work item AC. Candidates: +- Tests passing (`uv run --no-sync pytest` or equivalent) +- Version bumped if new skills/hooks/commands were added +- Docs/CLAUDE.md updated if conventions changed +- No untracked files that should have been committed + +This checklist could be surfaced as a confirmation panel or auto-checked where possible (e.g. run tests, check if plugin.json version changed since base branch). + +--- + +### Bug 6 — Stage 3 cleanup order: worktree must be removed before branch deletion + +**Symptom:** `git branch --delete` fails with `error: cannot delete branch used by worktree` when a worktree exists for the branch. + +**Root cause:** SKILL.md Stage 3 orders "Delete Local Branch" before "Remove Worktree." Git refuses to delete a branch checked out in any worktree — safety mechanism to prevent leaving a worktree in an invalid state. + +**Fix direction:** Swap the order in Stage 3: remove worktree first, then delete branch. + +--- + +### Observation — bare `git push` bypasses block-push-to-main hook + +**Symptom:** Running `git push` (implicit tracking branch push) on main succeeded without being blocked. `git push origin main` is correctly blocked. + +**Root cause:** `~/.claude/hooks/block-push-to-main.sh` matches on an explicit `main`/`master` ref in the command string. A bare `git push` with no ref specified doesn't match, even though it implicitly pushes to main via the tracking branch. + +**Observed:** During v0.11.0 release, `git tag v0.11.0 5c291a4 && git push && git push origin v0.11.0` pushed a commit directly to main unblocked. + +**Fix direction:** Either extend the hook to also block bare `git push` when the current branch is main/master, or amend the release process to require a PR for the version bump commit. + +--- + +### Observation — `git pull` fails in sandbox + +**Symptom:** Stage 3 `git pull` fails with `Could not resolve host: github.com`. + +**Workaround:** User must run `git pull` in their terminal after the skill completes. + +**Fix direction:** Narrate when the pull fails so the user knows to run it manually. Don't treat it as a hard stop. + +--- diff --git a/docs/plans/2026-03-27-merge-skill-evals-design.md b/docs/plans/2026-03-27-merge-skill-evals-design.md new file mode 100644 index 0000000..c3ce448 --- /dev/null +++ b/docs/plans/2026-03-27-merge-skill-evals-design.md @@ -0,0 +1,295 @@ +# /merge Skill Eval Harness Design + +**Date:** 2026-03-27 +**Status:** In Review +**Author:** tgulls + Claude +**Source:** brainstorming +**Location:** `evals/`, `.claude/skills/run-evals/` +**References:** +- [`docs/work/active/2026-03-27-merge-skill-evals.md`](../work/active/2026-03-27-merge-skill-evals.md) — active work item with 12 eval scenarios +- [`docs/ops/igit-merge-test-and-debug.md`](../ops/igit-merge-test-and-debug.md) — live test bug log (bugs 2–5 inform eval scenarios) + +--- + +## Overview + +A skill-orchestrated eval harness for testing model behavior when running `/igit:merge`. Evals are distinct from the `merge_context_test.py` integration tests — they test the model layer (decision-making, tool selection, output quality) rather than the shell script contract. + +## Architecture + +``` +evals/ + skill_merge_eval.yaml # Eval suite for /merge + +.claude/ + skills/ + run-evals/ + SKILL.md # Orchestrator skill (/igit:run-evals) + agents/ + eval-runner.md # Generic skill executor + eval-grader.md # Model judgment scorer + scripts/ + setup/ # Per-case environment setup scripts + assert/ # Deterministic assertion scripts (tool calls, fs) + hooks/ + capture_runner_transcript.sh # SubagentStop hook — extracts key details from agent transcript + +.evals/ # gitignored — eval run output + -/ # e.g. 2026-03-27T210000-a1b2c3 + report.md # final markdown report + / # per-case subdirectory + transcript.json # extracted events from agent transcript + assert.json # deterministic assertion results + grader.json # grader output +``` + +Eval cases live in `evals/` (top-level, visible — signals serious testing architecture). The orchestrator skill and its agents/scripts are colocated in `.claude/skills/run-evals/` as project-level eval infrastructure, not part of the distributed plugin. Run output is written to `.evals/` in the cwd (gitignored). + +## Eval Case Schema + +Each `*_eval.{yml,yaml}` file contains a named suite with multiple cases: + +```yaml +suite: "Evals for the /merge skill — covers all three stages and known edge cases" + +cases: + - name: "stage2_pr_blocked_conflicts" + description: "Model surfaces conflict blocker and stops without offering to merge" + skill: igit:merge + tags: [stage2, blocked] + + setup: + scripts: + - ".claude/skills/run-evals/scripts/setup/init_git_repo.sh" + state: + - git_repo: clean + - branch: "feature/my-feature" + - remote: connected + + prompt: > + I'm ready to merge my feature branch. Please run the merge workflow. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "feature/my-feature" + base_branch: "main" + pr: + number: 42 + url: "https://github.com/org/repo/pull/42" + state: "open" + mergeable: "CONFLICTING" + merge_state_status: "DIRTY" + title: "My feature" + merge_methods: ["merge", "squash", "rebase"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: [] + + exits_at: null + + expect: + - "Identifies the PR has merge conflicts before taking any action" + - "Names the specific blocker and tells the user how to resolve it" + - "Stops after surfacing the blocker — does not attempt Stage 2 or 3" + + assert: + tool_calls: + - never: "gh pr merge" + - once: "gh pr view" +``` + +### Schema Reference + +| Key | Required | Description | +|-----|----------|-------------| +| `suite` | Yes | Human-readable description of the file's eval suite | +| `cases` | Yes | List of eval cases | +| `name` | Yes | Snake-case identifier, unique across all suites | +| `description` | Yes | One-line human description | +| `skill` | Yes | Skill under test (e.g. `igit:merge`) | +| `tags` | No | For filtering: `/igit:run-evals stage2` | +| `setup.scripts` | No | Shell scripts to run before the runner is dispatched | +| `setup.state` | No | Declarative environment state (resolved by setup scripts) | +| `prompt` | Yes | User prompt injected into the model session | +| `context` | Yes | List of context items injected as tool call output | +| `context[].source` | Yes | `script_invocation`, `tool_call`, or `subagent_output` | +| `context[].output` | Yes | The mock output to inject | +| `prior_decisions` | No | AskUserQuestion answers already established before this case | +| `exits_at` | No | Expected AskUserQuestion question text; `null` if no decision point expected | +| `expect` | No | Model judgment assertions (scored by eval-grader) | +| `assert.tool_calls` | No | Deterministic tool call assertions — see Tool Call Assertions | +| `assert.filesystem` | No | Deterministic filesystem assertions — see Filesystem Assertions | + +### Branching Logic + +Decision points produce two cases sharing the same `context` but differing in `prior_decisions`: + +```yaml +- name: "stage2_merge_who_merges_claude" + prior_decisions: + - question: "Who merges PR #42?" + answer: "Claude merges" + ... + +- name: "stage2_merge_who_merges_manual" + prior_decisions: + - question: "Who merges PR #42?" + answer: "I'll merge manually" + ... +``` + +### Tool Call Assertions + +Assertions match against tool call strings captured in `transcript.json`. + +| Form | Meaning | +|------|---------| +| `contains: ""` | at least one call contains substring | +| `contains: "", times: N` | exactly N calls contain substring | +| `exact: ""` | at least one call matches exactly | +| `exact: "", times: N` | exactly N calls match exactly | +| `once: ""` | shorthand — `contains + times: 1` | +| `never: ""` | shorthand — `contains + times: 0` | +| `many: ""` | shorthand — `contains + times: 2+` | + +No `times` means "at least once". Use `exact` when flag specifics matter (e.g. `--squash` vs `--merge`). + +### Filesystem Assertions + +Assertions validate post-execution filesystem state — what the model created, moved, or modified. They are not for pre-existing state. + +| Form | Meaning | +|------|---------| +| `exists: ""` | file exists at path (was created or moved here) | +| `not-exists: ""` | file does not exist at path (was deleted or moved away) | +| `file: "", contains: ""` | file exists and contains substring | + +`exists` + `not-exists` together assert a `git mv` — both sides of the move verified. + +## Eval Runner Agent + +Generic skill executor — no domain knowledge, no eval framing visible to the model. + +**Location:** `.claude/skills/run-evals/agents/eval-runner.md` + +**Model:** Haiku (mechanical execution) + +**Tools:** Read, Bash, AskUserQuestion + +**Behavior:** Receives a parsed eval case. Constructs a session with: +1. Skill content (SKILL.md for the skill under test) +2. Mock context presented as natural tool call output — no "you are being tested" framing +3. Prior decisions pre-established in the prompt +4. User prompt as the trigger + +Proceeds through the skill until reaching `exits_at` or completion. Returns `status` and `exited_at` — the full event transcript is captured by the SubagentStop hook, not self-reported. + +**Note:** `AskUserQuestion` behavior in agent context is an open question — the runner has access to the tool and we will observe empirically whether it blocks, bubbles to the user, or exits cleanly. Design may be revised based on findings. + +### Transcript Capture (SubagentStop Hook) + +The SubagentStop hook fires when the eval-runner agent completes. It filters on `agent_type == "eval-runner"`, reads `agent_transcript_path` (the agent's JSONL transcript), extracts key details, and writes to `.evals/-//transcript.json`: + +```json +{ + "case": "stage2_pr_blocked_conflicts", + "source": "~/.claude/projects/.../subagents/agent-def456.jsonl", + "events": [ + { "type": "text", "content": "This PR has merge conflicts (DIRTY)..." }, + { "type": "tool_call", "call": "gh pr view 42 --json mergeable" }, + { "type": "text", "content": "Cannot merge — resolve conflicts first." } + ], + "status": "complete", + "exited_at": null +} +``` + +`source` references the full transcript for debugging. `events` is the ordered sequence consumed by assert scripts and the grader. The run output directory is communicated to the hook via an env var set by the orchestrator before dispatching the runner. + +## Eval Grader Agent + +Scores model judgment against `expect` assertions. Deterministic assertions (`assert`) are handled by scripts, not the grader. + +**Location:** `.claude/skills/run-evals/agents/eval-grader.md` + +**Model:** Sonnet (judgment quality matters) + +**Tools:** Read + +**Behavior:** Receives `transcript.json` (from `.evals///`) + `expect` list. Scores each expectation independently using the ordered `events` sequence — ordering matters for assertions like "identifies X before taking any action". + +Returns: + +```json +{ + "case": "stage2_pr_blocked_conflicts", + "results": [ + { + "expect": "Identifies the PR has merge conflicts before taking any action", + "pass": true, + "reasoning": "Model opened with 'This PR has merge conflicts (DIRTY)' before any tool calls" + }, + { + "expect": "Stops after surfacing the blocker — does not attempt Stage 2 or 3", + "pass": false, + "reasoning": "Model surfaced the blocker but then asked 'Would you like to resolve conflicts now?'" + } + ] +} +``` + +## Orchestrator Skill (`/igit:run-evals`) + +**Location:** `.claude/skills/run-evals/SKILL.md` + +**Invocation:** `/igit:run-evals [tag-or-name-filter]` + +**Flow per run:** + +1. Discover `evals/*_eval.{yml,yaml}` +2. Parse each file — extract `suite` and flatten `cases` list +3. Apply tag or name filter if provided +4. Generate `-`, create `.evals/-/`, set `EVAL_RUN_DIR` env var +5. For each case: + a. Run `setup.scripts` in order + b. Dispatch `eval-runner` agent — SubagentStop hook writes `transcript.json` to `.evals///` + c. Run `assert.tool_calls` and `assert.filesystem` scripts against `transcript.json` — write `assert.json` + d. Dispatch `eval-grader` agent with `transcript.json` + `expect` list — write `grader.json` +6. Collect all results, write `report.md` to `.evals//` + +**Report format:** + +``` +## Eval Results — 2026-03-27 + +### /merge skill evals + +| Case | Expect | Assert | Status | +|------|--------|--------|--------| +| stage2_pr_blocked_conflicts | 3/3 | 2/2 | ✓ PASS | +| stage2_claude_merges_squash | 2/3 | 1/2 | ✗ FAIL | + +### Failures + +**stage2_claude_merges_squash** +- expect: "Model uses --squash not --merge" + FAIL: "Model ran gh pr merge 42 --merge instead of --squash" +- assert.tool_calls: never "gh pr merge --merge" + FAIL: found 1 match +``` + +## Future Considerations + +- **`agent-eval-runner` skill** — agents cannot dispatch agents, so running evals from an agent context will require a dedicated skill with forked context (worktrees). Not in scope now. +- **CI integration** — evals are manual verification for now given cost and latency. If a lightweight subset proves stable, consider a scheduled CI run. +- **Additional suites** — `skill_commit_eval.yaml` etc. follow the same pattern; orchestrator discovers all `*_eval.{yml,yaml}` files automatically. diff --git a/docs/plans/2026-03-27-merge-skill-evals-impl.md b/docs/plans/2026-03-27-merge-skill-evals-impl.md new file mode 100644 index 0000000..1299f3e --- /dev/null +++ b/docs/plans/2026-03-27-merge-skill-evals-impl.md @@ -0,0 +1,2395 @@ +# /merge Skill Eval Harness Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Build a skill-orchestrated eval harness that tests model behavior when running `/igit:merge`, covering 11 scenarios from the work item. + +**Architecture:** YAML eval cases define scenarios with mock context, prompts, and assertions. An orchestrator skill (`/igit:evaluate`) parses cases, sets up real GitHub repo state, dispatches an eval-runner agent (Haiku) per case, captures transcripts via SubagentStop hook, runs deterministic assertions via scripts, and dispatches an eval-grader agent (Sonnet) for model judgment scoring. Results aggregate into a markdown report. A dedicated GitHub test repo ensures all `gh` and `git` commands hit real endpoints. + +**Tech Stack:** Python 3.12+ (uv), pytest, YAML (PyYAML), POSIX shell for hook/setup scripts, Claude Code agents + skills, GitHub CLI + +--- + +## File Structure + +``` +NEW FILES: +evals/ + skill_merge_eval.yaml # 11 eval cases for /merge skill + +.claude/ + skills/ + evaluate/ + SKILL.md # Orchestrator skill (/igit:evaluate) + agents/ + eval-runner.md # Generic skill executor agent + eval-grader.md # Model judgment scorer agent + scripts/ + setup/ + setup_test_repo.sh # Creates branches + PRs on GitHub test repo + teardown_test_repo.sh # Resets test repo to known starting state + setup_test_repo_test.py # Integration tests for setup + teardown_test_repo_test.py # Integration tests for teardown + assert/ + check_tool_calls.py # Deterministic tool call assertion script + check_tool_calls_test.py + check_filesystem.py # Deterministic filesystem assertion script + check_filesystem_test.py + hooks/ + capture_agent_transcript.sh # SubagentStop hook — extracts transcript + capture_agent_transcript_test.py + +MODIFY FILES: +.gitignore # Add .evals/ to gitignore +``` + +--- + +## Task 1: Gitignore and Directory Scaffolding + +**Files:** +- Modify: `.gitignore` + +- [ ] **Step 1: Add `.evals/` to `.gitignore`** + +``` +# Append to .gitignore +.evals/ +``` + +- [ ] **Step 2: Verify** + +Run: `grep -q '.evals/' .gitignore && echo "OK"` +Expected: `OK` + +- [ ] **Step 3: Commit** + +```bash +git add .gitignore +git commit -m "chore: add .evals/ to gitignore for eval run output" +``` + +--- + +## Task 2: Tool Call Assertion Script + +The deterministic assertion engine for `assert.tool_calls`. Reads `transcript.json` and an assertion list, outputs `assert.json` with pass/fail per assertion. + +**Files:** +- Create: `.claude/skills/evaluate/scripts/assert/check_tool_calls.py` +- Create: `.claude/skills/evaluate/scripts/assert/check_tool_calls_test.py` + +- [ ] **Step 1: Write the failing test for `contains` assertions** + +```python +"""Tests for check_tool_calls assertion engine.""" + +import json +from pathlib import Path + +import pytest + +# The module under test — will be imported after creation +from check_tool_calls import check_tool_calls + + +class TestContains: + def test_contains_found(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json mergeable"}, + {"type": "tool_call", "call": "gh pr merge 42 --squash"}, + ] + assertions = [{"contains": "gh pr merge"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is True + + def test_contains_not_found(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json mergeable"}, + ] + assertions = [{"contains": "gh pr merge"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is False + + def test_contains_with_times(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + {"type": "tool_call", "call": "gh pr view 42 --json mergeable"}, + ] + assertions = [{"contains": "gh pr view", "times": 2}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is True + + def test_contains_with_times_mismatch(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + ] + assertions = [{"contains": "gh pr view", "times": 2}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is False + assert "expected 2" in results[0]["reason"] +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cd .claude/skills/evaluate/scripts/assert && uv run --no-sync pytest check_tool_calls_test.py -v` +Expected: FAIL — `ModuleNotFoundError: No module named 'check_tool_calls'` + +- [ ] **Step 3: Write the `check_tool_calls` module** + +```python +"""Deterministic tool call assertion engine for eval harness. + +Reads tool_call events from a transcript and evaluates assertions against them. +Each assertion returns a pass/fail result with a reason string. +""" + + +def check_tool_calls( + events: list[dict], assertions: list[dict] +) -> list[dict]: + """Evaluate tool call assertions against transcript events. + + Args: + events: List of transcript events (only type=="tool_call" are inspected). + assertions: List of assertion dicts — see design spec for schema. + + Returns: + List of result dicts with keys: assertion, pass, reason. + """ + tool_calls = [e["call"] for e in events if e.get("type") == "tool_call"] + results = [] + + for assertion in assertions: + result = _evaluate_one(tool_calls, assertion) + results.append(result) + + return results + + +def _evaluate_one(tool_calls: list[str], assertion: dict) -> dict: + """Evaluate a single assertion against the tool call list.""" + # Resolve shorthands first + if "once" in assertion: + return _evaluate_one( + tool_calls, {"contains": assertion["once"], "times": 1} + ) + if "never" in assertion: + return _evaluate_one( + tool_calls, {"contains": assertion["never"], "times": 0} + ) + if "many" in assertion: + substring = assertion["many"] + count = sum(1 for c in tool_calls if substring in c) + passed = count >= 2 + return { + "assertion": assertion, + "pass": passed, + "reason": f"found {count} matches (need 2+)" + if not passed + else f"found {count} matches", + } + + if "contains" in assertion: + return _check_contains(tool_calls, assertion) + if "exact" in assertion: + return _check_exact(tool_calls, assertion) + + return { + "assertion": assertion, + "pass": False, + "reason": "unknown assertion type", + } + + +def _check_contains(tool_calls: list[str], assertion: dict) -> dict: + substring = assertion["contains"] + count = sum(1 for c in tool_calls if substring in c) + + if "times" in assertion: + expected = assertion["times"] + passed = count == expected + reason = ( + f"found {count} matches" + if passed + else f"expected {expected}, found {count}" + ) + else: + passed = count >= 1 + reason = "found" if passed else "not found" + + return {"assertion": assertion, "pass": passed, "reason": reason} + + +def _check_exact(tool_calls: list[str], assertion: dict) -> dict: + target = assertion["exact"] + count = sum(1 for c in tool_calls if c == target) + + if "times" in assertion: + expected = assertion["times"] + passed = count == expected + reason = ( + f"found {count} exact matches" + if passed + else f"expected {expected} exact, found {count}" + ) + else: + passed = count >= 1 + reason = "found" if passed else "not found" + + return {"assertion": assertion, "pass": passed, "reason": reason} +``` + +- [ ] **Step 4: Run tests to verify they pass** + +Run: `cd .claude/skills/evaluate/scripts/assert && uv run --no-sync pytest check_tool_calls_test.py -v` +Expected: 4 passed + +- [ ] **Step 5: Write tests for shorthand assertions (`once`, `never`, `many`)** + +Add to `check_tool_calls_test.py`: + +```python +class TestShorthands: + def test_once_passes_with_exactly_one(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + ] + assertions = [{"once": "gh pr view"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is True + + def test_once_fails_with_two(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + {"type": "tool_call", "call": "gh pr view 42 --json mergeable"}, + ] + assertions = [{"once": "gh pr view"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is False + + def test_never_passes_with_zero(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + ] + assertions = [{"never": "gh pr merge"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is True + + def test_never_fails_with_one(self): + events = [ + {"type": "tool_call", "call": "gh pr merge 42 --squash"}, + ] + assertions = [{"never": "gh pr merge"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is False + + def test_many_passes_with_two_plus(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + {"type": "tool_call", "call": "gh pr view 42 --json mergeable"}, + ] + assertions = [{"many": "gh pr view"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is True + + def test_many_fails_with_one(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + ] + assertions = [{"many": "gh pr view"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is False +``` + +- [ ] **Step 6: Run full test suite** + +Run: `cd .claude/skills/evaluate/scripts/assert && uv run --no-sync pytest check_tool_calls_test.py -v` +Expected: 10 passed + +- [ ] **Step 7: Write tests for `exact` assertions** + +Add to `check_tool_calls_test.py`: + +```python +class TestExact: + def test_exact_match(self): + events = [ + {"type": "tool_call", "call": "gh pr merge 42 --squash"}, + ] + assertions = [{"exact": "gh pr merge 42 --squash"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is True + + def test_exact_no_match_on_substring(self): + events = [ + {"type": "tool_call", "call": "gh pr merge 42 --squash --admin"}, + ] + assertions = [{"exact": "gh pr merge 42 --squash"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is False + + def test_exact_with_times(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + ] + assertions = [{"exact": "gh pr view 42 --json state", "times": 2}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is True +``` + +- [ ] **Step 8: Run full test suite** + +Run: `cd .claude/skills/evaluate/scripts/assert && uv run --no-sync pytest check_tool_calls_test.py -v` +Expected: 13 passed + +- [ ] **Step 9: Write test for non-tool_call events are ignored** + +Add to `check_tool_calls_test.py`: + +```python +class TestEventFiltering: + def test_ignores_text_events(self): + events = [ + {"type": "text", "content": "gh pr merge 42 --squash"}, + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + ] + assertions = [{"never": "gh pr merge"}] + results = check_tool_calls(events, assertions) + assert results[0]["pass"] is True + + def test_multiple_assertions(self): + events = [ + {"type": "tool_call", "call": "gh pr view 42 --json state"}, + {"type": "tool_call", "call": "gh pr merge 42 --squash"}, + ] + assertions = [ + {"once": "gh pr view"}, + {"once": "gh pr merge"}, + {"never": "git push --force"}, + ] + results = check_tool_calls(events, assertions) + assert all(r["pass"] for r in results) +``` + +- [ ] **Step 10: Run full test suite** + +Run: `cd .claude/skills/evaluate/scripts/assert && uv run --no-sync pytest check_tool_calls_test.py -v` +Expected: 15 passed + +- [ ] **Step 11: Commit** + +```bash +git add .claude/skills/evaluate/scripts/assert/check_tool_calls.py \ + .claude/skills/evaluate/scripts/assert/check_tool_calls_test.py +git commit -m "feat(evals): add deterministic tool call assertion engine + +TDD implementation covering contains, exact, once, never, many assertions +with times qualifier. Filters non-tool_call events from transcript." +``` + +--- + +## Task 3: Filesystem Assertion Script + +Deterministic assertion engine for `assert.filesystem`. Checks post-execution filesystem state. + +**Files:** +- Create: `.claude/skills/evaluate/scripts/assert/check_filesystem.py` +- Create: `.claude/skills/evaluate/scripts/assert/check_filesystem_test.py` + +- [ ] **Step 1: Write the failing test** + +```python +"""Tests for check_filesystem assertion engine.""" + +import os +from pathlib import Path + +import pytest + +from check_filesystem import check_filesystem + + +class TestExists: + def test_exists_passes(self, tmp_path): + (tmp_path / "file.txt").write_text("hello") + assertions = [{"exists": str(tmp_path / "file.txt")}] + results = check_filesystem(assertions) + assert results[0]["pass"] is True + + def test_exists_fails(self, tmp_path): + assertions = [{"exists": str(tmp_path / "missing.txt")}] + results = check_filesystem(assertions) + assert results[0]["pass"] is False + + +class TestNotExists: + def test_not_exists_passes(self, tmp_path): + assertions = [{"not-exists": str(tmp_path / "missing.txt")}] + results = check_filesystem(assertions) + assert results[0]["pass"] is True + + def test_not_exists_fails(self, tmp_path): + (tmp_path / "file.txt").write_text("hello") + assertions = [{"not-exists": str(tmp_path / "file.txt")}] + results = check_filesystem(assertions) + assert results[0]["pass"] is False + + +class TestFileContains: + def test_contains_passes(self, tmp_path): + (tmp_path / "file.txt").write_text("hello world") + assertions = [{"file": str(tmp_path / "file.txt"), "contains": "world"}] + results = check_filesystem(assertions) + assert results[0]["pass"] is True + + def test_contains_fails_wrong_content(self, tmp_path): + (tmp_path / "file.txt").write_text("hello world") + assertions = [{"file": str(tmp_path / "file.txt"), "contains": "goodbye"}] + results = check_filesystem(assertions) + assert results[0]["pass"] is False + + def test_contains_fails_file_missing(self, tmp_path): + assertions = [{"file": str(tmp_path / "missing.txt"), "contains": "hello"}] + results = check_filesystem(assertions) + assert results[0]["pass"] is False + + +class TestMoveVerification: + def test_exists_plus_not_exists_verifies_move(self, tmp_path): + (tmp_path / "new.txt").write_text("moved") + assertions = [ + {"exists": str(tmp_path / "new.txt")}, + {"not-exists": str(tmp_path / "old.txt")}, + ] + results = check_filesystem(assertions) + assert all(r["pass"] for r in results) +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cd .claude/skills/evaluate/scripts/assert && uv run --no-sync pytest check_filesystem_test.py -v` +Expected: FAIL — `ModuleNotFoundError: No module named 'check_filesystem'` + +- [ ] **Step 3: Write the `check_filesystem` module** + +```python +"""Deterministic filesystem assertion engine for eval harness. + +Validates post-execution filesystem state: file existence, absence, and content. +""" + +from pathlib import Path + + +def check_filesystem(assertions: list[dict]) -> list[dict]: + """Evaluate filesystem assertions. + + Args: + assertions: List of assertion dicts — see design spec for schema. + + Returns: + List of result dicts with keys: assertion, pass, reason. + """ + results = [] + for assertion in assertions: + results.append(_evaluate_one(assertion)) + return results + + +def _evaluate_one(assertion: dict) -> dict: + """Evaluate a single filesystem assertion.""" + if "exists" in assertion: + path = Path(assertion["exists"]) + passed = path.exists() + reason = "file exists" if passed else f"file not found: {path}" + return {"assertion": assertion, "pass": passed, "reason": reason} + + if "not-exists" in assertion: + path = Path(assertion["not-exists"]) + passed = not path.exists() + reason = ( + "file absent as expected" + if passed + else f"file unexpectedly exists: {path}" + ) + return {"assertion": assertion, "pass": passed, "reason": reason} + + if "file" in assertion and "contains" in assertion: + path = Path(assertion["file"]) + substring = assertion["contains"] + if not path.exists(): + return { + "assertion": assertion, + "pass": False, + "reason": f"file not found: {path}", + } + content = path.read_text() + passed = substring in content + reason = ( + f"contains '{substring}'" + if passed + else f"'{substring}' not found in {path}" + ) + return {"assertion": assertion, "pass": passed, "reason": reason} + + return { + "assertion": assertion, + "pass": False, + "reason": "unknown assertion type", + } +``` + +- [ ] **Step 4: Run tests to verify they pass** + +Run: `cd .claude/skills/evaluate/scripts/assert && uv run --no-sync pytest check_filesystem_test.py -v` +Expected: 8 passed + +- [ ] **Step 5: Commit** + +```bash +git add .claude/skills/evaluate/scripts/assert/check_filesystem.py \ + .claude/skills/evaluate/scripts/assert/check_filesystem_test.py +git commit -m "feat(evals): add deterministic filesystem assertion engine + +TDD implementation covering exists, not-exists, and file-contains assertions. +Supports move verification via exists + not-exists combination." +``` + +--- + +## Task 4: Transcript Capture Hook + +SubagentStop hook script that fires when the eval-runner agent completes, extracts key events from the agent transcript, and writes `transcript.json`. + +**Files:** +- Create: `.claude/skills/evaluate/scripts/hooks/capture_agent_transcript.sh` +- Create: `.claude/skills/evaluate/scripts/hooks/capture_agent_transcript_test.py` + +- [ ] **Step 1: Write the test** + +The hook reads the `TOOL_INPUT` JSON from stdin (containing `agent_type` and `agent_transcript_path`) and the `EVAL_RUN_DIR` + `EVAL_CASE_NAME` env vars. It extracts events from the JSONL transcript and writes `transcript.json`. + +```python +"""Integration tests for capture_agent_transcript.sh hook.""" + +import json +import os +import subprocess +from pathlib import Path + +import pytest + +SCRIPT = Path(__file__).parent / "capture_agent_transcript.sh" + + +def run_hook( + tool_input: dict, + transcript_lines: list[dict], + run_dir: Path, + case_name: str, +) -> subprocess.CompletedProcess: + """Run the hook with given inputs and return the result.""" + # Write fake transcript JSONL + transcript_path = run_dir / "agent_transcript.jsonl" + with transcript_path.open("w") as f: + for line in transcript_lines: + f.write(json.dumps(line) + "\n") + + tool_input_with_path = { + **tool_input, + "agent_transcript_path": str(transcript_path), + } + + env = { + **os.environ, + "EVAL_RUN_DIR": str(run_dir), + "EVAL_CASE_NAME": case_name, + } + + return subprocess.run( + ["sh", str(SCRIPT)], + input=json.dumps({"tool_input": tool_input_with_path}), + capture_output=True, + text=True, + env=env, + ) + + +class TestFilterByAgentType: + def test_skips_non_eval_runner(self, tmp_path): + result = run_hook( + tool_input={"agent_type": "code-reviewer"}, + transcript_lines=[], + run_dir=tmp_path, + case_name="test_case", + ) + assert result.returncode == 0 + assert not (tmp_path / "test_case" / "transcript.json").exists() + + def test_processes_eval_runner(self, tmp_path): + transcript = [ + {"type": "text", "content": "Analyzing PR..."}, + {"type": "tool_use", "name": "Bash", "input": {"command": "gh pr view 42 --json state"}}, + {"type": "text", "content": "PR has conflicts."}, + ] + result = run_hook( + tool_input={"agent_type": "eval-runner"}, + transcript_lines=transcript, + run_dir=tmp_path, + case_name="test_case", + ) + assert result.returncode == 0 + output_path = tmp_path / "test_case" / "transcript.json" + assert output_path.exists() + data = json.loads(output_path.read_text()) + assert data["case"] == "test_case" + assert len(data["events"]) == 3 + + +class TestEventExtraction: + def test_extracts_text_events(self, tmp_path): + transcript = [ + {"type": "text", "content": "PR has merge conflicts."}, + ] + run_hook( + tool_input={"agent_type": "eval-runner"}, + transcript_lines=transcript, + run_dir=tmp_path, + case_name="test_case", + ) + data = json.loads((tmp_path / "test_case" / "transcript.json").read_text()) + assert data["events"][0] == {"type": "text", "content": "PR has merge conflicts."} + + def test_extracts_tool_call_events(self, tmp_path): + transcript = [ + {"type": "tool_use", "name": "Bash", "input": {"command": "gh pr view 42 --json state"}}, + ] + run_hook( + tool_input={"agent_type": "eval-runner"}, + transcript_lines=transcript, + run_dir=tmp_path, + case_name="test_case", + ) + data = json.loads((tmp_path / "test_case" / "transcript.json").read_text()) + assert data["events"][0]["type"] == "tool_call" + assert "gh pr view 42" in data["events"][0]["call"] + + def test_extracts_ask_user_events(self, tmp_path): + transcript = [ + {"type": "tool_use", "name": "AskUserQuestion", "input": {"questions": [{"question": "Who merges?"}]}}, + ] + run_hook( + tool_input={"agent_type": "eval-runner"}, + transcript_lines=transcript, + run_dir=tmp_path, + case_name="test_case", + ) + data = json.loads((tmp_path / "test_case" / "transcript.json").read_text()) + assert data["events"][0]["type"] == "ask_user" + assert "Who merges?" in data["events"][0]["question"] + + +class TestMissingEnvVars: + def test_exits_cleanly_without_eval_run_dir(self, tmp_path): + """Hook should exit 0 silently when not in an eval context.""" + env = {k: v for k, v in os.environ.items() if k != "EVAL_RUN_DIR"} + env.pop("EVAL_CASE_NAME", None) + result = subprocess.run( + ["sh", str(SCRIPT)], + input=json.dumps({"tool_input": {"agent_type": "eval-runner"}}), + capture_output=True, + text=True, + env=env, + ) + assert result.returncode == 0 +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cd .claude/skills/evaluate/scripts/hooks && uv run --no-sync pytest capture_agent_transcript_test.py -v` +Expected: FAIL — script does not exist + +- [ ] **Step 3: Write the hook script** + +```sh +#!/bin/sh +# capture_agent_transcript.sh — SubagentStop hook for eval harness +# Extracts key events from eval-runner agent transcript, writes transcript.json +# Dependencies: jq +# Platforms: macOS, Linux +# +# Env vars (set by orchestrator): +# EVAL_RUN_DIR — path to current eval run output directory +# EVAL_CASE_NAME — name of the current eval case +# +# Reads TOOL_INPUT JSON from stdin (SubagentStop hook contract). + +set -e + +# Exit silently if not in an eval context +if [ -z "${EVAL_RUN_DIR:-}" ] || [ -z "${EVAL_CASE_NAME:-}" ]; then + exit 0 +fi + +# Read hook input from stdin +input=$(cat) + +# Check agent type — only process eval-runner +agent_type=$(printf '%s' "$input" | jq -r '.tool_input.agent_type // ""') +if [ "$agent_type" != "eval-runner" ]; then + exit 0 +fi + +# Get transcript path +transcript_path=$(printf '%s' "$input" | jq -r '.tool_input.agent_transcript_path // ""') +if [ -z "$transcript_path" ] || [ ! -f "$transcript_path" ]; then + exit 0 +fi + +# Create output directory +output_dir="${EVAL_RUN_DIR}/${EVAL_CASE_NAME}" +mkdir -p "$output_dir" + +# Extract events from JSONL transcript +# - text events: {type: "text", content: ...} +# - tool_call events: {type: "tool_call", call: ...} (from Bash tool_use) +# - ask_user events: {type: "ask_user", question: ...} (from AskUserQuestion) +events=$(jq -c ' + if .type == "text" and .content then + {type: "text", content: .content} + elif .type == "tool_use" and .name == "AskUserQuestion" then + {type: "ask_user", question: (.input.questions[0].question // "unknown")} + elif .type == "tool_use" and .name == "Bash" then + {type: "tool_call", call: .input.command} + elif .type == "tool_use" then + {type: "tool_call", call: (.name + " " + (.input | tostring))} + else + empty + end +' "$transcript_path" | jq -s '.') + +# Write transcript.json +jq -n \ + --arg case_name "$EVAL_CASE_NAME" \ + --arg source "$transcript_path" \ + --argjson events "$events" \ + '{ + case: $case_name, + source: $source, + events: $events + }' > "${output_dir}/transcript.json" +``` + +- [ ] **Step 4: Make the script executable and run tests** + +```bash +chmod +x .claude/skills/evaluate/scripts/hooks/capture_agent_transcript.sh +``` + +Run: `cd .claude/skills/evaluate/scripts/hooks && uv run --no-sync pytest capture_agent_transcript_test.py -v` +Expected: All passed + +- [ ] **Step 5: Commit** + +```bash +git add .claude/skills/evaluate/scripts/hooks/capture_agent_transcript.sh \ + .claude/skills/evaluate/scripts/hooks/capture_agent_transcript_test.py +git commit -m "feat(evals): add SubagentStop hook for transcript capture + +Extracts text, tool_call, and ask_user events from eval-runner agent +transcript JSONL. Writes transcript.json to eval run output directory. +Filters on agent_type=eval-runner, exits silently in non-eval context." +``` + +--- + +## Task 5: GitHub Test Repo Setup and Teardown + +Scripts that create real branches and PRs on a dedicated GitHub test repo, and reset it after eval runs. All `gh` and `git` commands hit real endpoints — no mocks. + +**Repo:** `trev-gulls/igit-eval-sandbox` (or configured via `EVAL_TEST_REPO` env var) + +**Known starting state:** main branch with a single initial commit. No other branches. No open PRs. + +**Files:** +- Create: `.claude/skills/evaluate/scripts/setup/setup_test_repo.sh` +- Create: `.claude/skills/evaluate/scripts/setup/teardown_test_repo.sh` +- Create: `.claude/skills/evaluate/scripts/setup/setup_test_repo_test.py` +- Create: `.claude/skills/evaluate/scripts/setup/teardown_test_repo_test.py` + +### Setup Script + +- [ ] **Step 1: Write the setup test** + +```python +"""Integration tests for setup_test_repo.sh. + +Requires EVAL_TEST_REPO env var pointing to a real GitHub test repo. +Skipped if not set — these are heavy tests, not for CI. +""" + +import json +import os +import subprocess +from pathlib import Path + +import pytest + +SCRIPT = Path(__file__).parent / "setup_test_repo.sh" +TEARDOWN = Path(__file__).parent / "teardown_test_repo.sh" +REPO = os.environ.get("EVAL_TEST_REPO", "") + +pytestmark = pytest.mark.skipif(not REPO, reason="EVAL_TEST_REPO not set") + + +def run_setup(case_state: dict) -> subprocess.CompletedProcess: + """Run setup script with given case state.""" + env = { + **os.environ, + "EVAL_TEST_REPO": REPO, + "EVAL_CASE_STATE": json.dumps(case_state), + } + return subprocess.run( + ["sh", str(SCRIPT)], + capture_output=True, + text=True, + env=env, + ) + + +def run_teardown() -> subprocess.CompletedProcess: + """Run teardown to reset repo.""" + env = {**os.environ, "EVAL_TEST_REPO": REPO} + return subprocess.run( + ["sh", str(TEARDOWN)], + capture_output=True, + text=True, + env=env, + ) + + +@pytest.fixture(autouse=True) +def cleanup(): + """Teardown after each test.""" + yield + run_teardown() + + +class TestSetupHappyPath: + def test_creates_branch_and_pr(self): + result = run_setup({ + "branch": "eval/test-case-1", + "pr": {"title": "Test PR", "state": "open"}, + }) + assert result.returncode == 0 + output = json.loads(result.stdout) + assert output["branch"] == "eval/test-case-1" + assert output["pr"]["number"] > 0 + assert output["pr"]["state"] == "open" + + def test_pr_is_mergeable(self): + result = run_setup({ + "branch": "eval/test-mergeable", + "pr": {"title": "Mergeable PR", "state": "open"}, + }) + output = json.loads(result.stdout) + # Verify via gh that the PR exists and is open + check = subprocess.run( + ["gh", "pr", "view", str(output["pr"]["number"]), + "--repo", REPO, "--json", "state,mergeable"], + capture_output=True, text=True, + ) + pr_data = json.loads(check.stdout) + assert pr_data["state"] == "OPEN" + + +class TestSetupAlreadyMerged: + def test_creates_and_merges_pr(self): + result = run_setup({ + "branch": "eval/test-merged", + "pr": {"title": "Already Merged", "state": "merged"}, + }) + assert result.returncode == 0 + output = json.loads(result.stdout) + assert output["pr"]["state"] == "merged" +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cd .claude/skills/evaluate/scripts/setup && EVAL_TEST_REPO=trev-gulls/igit-eval-sandbox uv run --no-sync pytest setup_test_repo_test.py -v` +Expected: FAIL — script does not exist + +- [ ] **Step 3: Write the setup script** + +```sh +#!/bin/sh +# setup_test_repo.sh — create branches and PRs on GitHub test repo for eval cases +# Dependencies: gh, git, jq +# Platforms: macOS, Linux +# +# Env vars: +# EVAL_TEST_REPO — GitHub repo (owner/name) to use for evals +# EVAL_CASE_STATE — JSON describing the desired repo state for this case +# +# EVAL_CASE_STATE schema: +# branch: "eval/case-name" — branch to create +# pr.title: "PR Title" — PR title +# pr.state: "open" | "merged" — desired PR state +# +# Outputs JSON to stdout with the created state: +# { "branch": "...", "pr": { "number": N, "url": "...", "state": "..." } } + +set -e + +repo="${EVAL_TEST_REPO:?EVAL_TEST_REPO env var required}" +state="${EVAL_CASE_STATE:?EVAL_CASE_STATE env var required}" + +branch=$(printf '%s' "$state" | jq -r '.branch') +pr_title=$(printf '%s' "$state" | jq -r '.pr.title') +pr_state=$(printf '%s' "$state" | jq -r '.pr.state // "open"') + +# Get default branch +default_branch=$(gh repo view "$repo" --json defaultBranchRef --jq '.defaultBranchRef.name') + +# Clone to temp dir for git operations +work_dir=$(mktemp -d) +trap 'rm -rf "$work_dir"' EXIT +gh repo clone "$repo" "$work_dir" -- --depth=1 +cd "$work_dir" + +# Create branch with a commit (so the PR has a diff) +git checkout -b "$branch" +echo "eval case: $branch" > "eval-marker.txt" +git add eval-marker.txt +git commit -m "eval: setup for $branch" +git push origin "$branch" + +# Create PR +pr_url=$(gh pr create \ + --repo "$repo" \ + --head "$branch" \ + --base "$default_branch" \ + --title "$pr_title" \ + --body "Automated eval case setup" \ + --json url --jq '.url' 2>&1) || { + # If PR already exists, get its URL + pr_url=$(gh pr view "$branch" --repo "$repo" --json url --jq '.url') + } + +pr_number=$(gh pr view "$branch" --repo "$repo" --json number --jq '.number') + +# If state should be merged, merge it +if [ "$pr_state" = "merged" ]; then + gh pr merge "$pr_number" --repo "$repo" --merge --admin + actual_state="merged" +else + actual_state="open" +fi + +# Output result +jq -n \ + --arg branch "$branch" \ + --argjson number "$pr_number" \ + --arg url "$pr_url" \ + --arg state "$actual_state" \ + '{ + branch: $branch, + pr: { number: $number, url: $url, state: $state } + }' +``` + +- [ ] **Step 4: Make executable and run tests** + +```bash +chmod +x .claude/skills/evaluate/scripts/setup/setup_test_repo.sh +``` + +Run: `cd .claude/skills/evaluate/scripts/setup && EVAL_TEST_REPO=trev-gulls/igit-eval-sandbox uv run --no-sync pytest setup_test_repo_test.py -v` +Expected: All passed (requires real GitHub repo) + +### Teardown Script + +- [ ] **Step 5: Write the teardown test** + +```python +"""Integration tests for teardown_test_repo.sh. + +Requires EVAL_TEST_REPO env var pointing to a real GitHub test repo. +""" + +import json +import os +import subprocess +from pathlib import Path + +import pytest + +SETUP = Path(__file__).parent / "setup_test_repo.sh" +SCRIPT = Path(__file__).parent / "teardown_test_repo.sh" +REPO = os.environ.get("EVAL_TEST_REPO", "") + +pytestmark = pytest.mark.skipif(not REPO, reason="EVAL_TEST_REPO not set") + + +def run_teardown() -> subprocess.CompletedProcess: + env = {**os.environ, "EVAL_TEST_REPO": REPO} + return subprocess.run( + ["sh", str(SCRIPT)], + capture_output=True, + text=True, + env=env, + ) + + +class TestTeardown: + def test_closes_open_prs(self): + # Setup: create a PR + env = { + **os.environ, + "EVAL_TEST_REPO": REPO, + "EVAL_CASE_STATE": json.dumps({ + "branch": "eval/teardown-test", + "pr": {"title": "Teardown Test", "state": "open"}, + }), + } + subprocess.run(["sh", str(SETUP)], env=env, capture_output=True) + + # Teardown + result = run_teardown() + assert result.returncode == 0 + + # Verify: no open PRs + check = subprocess.run( + ["gh", "pr", "list", "--repo", REPO, "--state", "open", "--json", "number"], + capture_output=True, text=True, + ) + prs = json.loads(check.stdout) + assert len(prs) == 0 + + def test_deletes_eval_branches(self): + # Setup: create a branch + env = { + **os.environ, + "EVAL_TEST_REPO": REPO, + "EVAL_CASE_STATE": json.dumps({ + "branch": "eval/branch-cleanup", + "pr": {"title": "Branch Cleanup", "state": "open"}, + }), + } + subprocess.run(["sh", str(SETUP)], env=env, capture_output=True) + + # Teardown + run_teardown() + + # Verify: no eval/ branches on remote + check = subprocess.run( + ["gh", "api", f"repos/{REPO}/branches", "--jq", '.[].name'], + capture_output=True, text=True, + ) + branches = check.stdout.strip().split("\n") if check.stdout.strip() else [] + eval_branches = [b for b in branches if b.startswith("eval/")] + assert len(eval_branches) == 0 +``` + +- [ ] **Step 6: Write the teardown script** + +```sh +#!/bin/sh +# teardown_test_repo.sh — reset GitHub test repo to known starting state +# Dependencies: gh, jq +# Platforms: macOS, Linux +# +# Env vars: +# EVAL_TEST_REPO — GitHub repo (owner/name) to reset +# +# Actions: +# 1. Close all open PRs +# 2. Delete all branches matching eval/* pattern +# 3. Verify repo is back to known state (main only, no open PRs) + +set -e + +repo="${EVAL_TEST_REPO:?EVAL_TEST_REPO env var required}" + +# 1. Close all open PRs +open_prs=$(gh pr list --repo "$repo" --state open --json number --jq '.[].number') +for pr in $open_prs; do + gh pr close "$pr" --repo "$repo" --delete-branch 2>/dev/null || true +done + +# 2. Delete any remaining eval/* branches on remote +eval_branches=$(gh api "repos/${repo}/branches" --jq '.[].name | select(startswith("eval/"))' 2>/dev/null || true) +for branch in $eval_branches; do + gh api -X DELETE "repos/${repo}/git/refs/heads/${branch}" 2>/dev/null || true +done + +# 3. Verify +remaining_prs=$(gh pr list --repo "$repo" --state open --json number --jq 'length') +remaining_branches=$(gh api "repos/${repo}/branches" --jq '[.[].name | select(startswith("eval/"))] | length' 2>/dev/null || printf '0') + +if [ "$remaining_prs" = "0" ] && [ "$remaining_branches" = "0" ]; then + printf '{"status": "clean", "open_prs": 0, "eval_branches": 0}\n' +else + printf '{"status": "dirty", "open_prs": %s, "eval_branches": %s}\n' "$remaining_prs" "$remaining_branches" >&2 + exit 2 +fi +``` + +- [ ] **Step 7: Make executable and run tests** + +```bash +chmod +x .claude/skills/evaluate/scripts/setup/teardown_test_repo.sh +``` + +Run: `cd .claude/skills/evaluate/scripts/setup && EVAL_TEST_REPO=trev-gulls/igit-eval-sandbox uv run --no-sync pytest teardown_test_repo_test.py -v` +Expected: All passed + +- [ ] **Step 8: Commit** + +```bash +git add .claude/skills/evaluate/scripts/setup/ +git commit -m "feat(evals): add GitHub test repo setup and teardown scripts + +Setup creates real branches and PRs on a dedicated GitHub test repo. +Teardown closes all PRs, deletes eval/* branches, verifies clean state. +All gh/git commands hit real endpoints — no mocks." +``` + +--- + +## Task 6: Eval Runner Agent + +Generic skill executor agent — runs a skill with mock context, no eval framing visible to the model. + +**Files:** +- Create: `.claude/skills/evaluate/agents/eval-runner.md` + +- [ ] **Step 1: Write the agent definition** + +```markdown +--- +name: eval-runner +description: > + Generic skill executor for eval harness. Receives a skill prompt with + pre-established context and executes the skill workflow to completion. + Do not use directly — dispatched by the evaluate orchestrator. +tools: + - Read + - Bash + - AskUserQuestion +allowed-tools: + - Read + - Bash(git *) + - Bash(gh *) +model: haiku +--- + +You are executing a skill workflow. Follow the skill instructions exactly as written. + +## Your Context + +You have been given: +1. A skill's full instructions (SKILL.md content) +2. Context from prior tool calls (presented as results you already received) +3. Pre-established decisions (answers to questions already asked) +4. A user prompt to trigger the workflow + +## Rules + +- Follow the skill instructions step by step +- Use the context provided — do not re-run discovery commands +- For pre-established decisions, skip the corresponding AskUserQuestion call +- If you reach a decision point not covered by prior decisions, use AskUserQuestion +- Stop when the skill workflow completes or when you reach a blocking decision point +- Report your status when done: "complete" if workflow finished, "blocked" if stopped at a decision point + +## Output + +When done, output a brief status line: +``` +STATUS: complete +``` +or +``` +STATUS: blocked at "" +``` +``` + +- [ ] **Step 2: Verify the file is valid frontmatter** + +Run: `head -20 .claude/skills/evaluate/agents/eval-runner.md` +Expected: Valid YAML frontmatter with `---` delimiters + +- [ ] **Step 3: Commit** + +```bash +git add .claude/skills/evaluate/agents/eval-runner.md +git commit -m "feat(evals): add eval-runner agent definition + +Haiku-powered generic skill executor for eval harness. Receives skill +content + mock context + prior decisions, executes workflow to completion." +``` + +--- + +## Task 7: Eval Grader Agent + +Scores model judgment against `expect` assertions using transcript evidence. + +**Files:** +- Create: `.claude/skills/evaluate/agents/eval-grader.md` + +- [ ] **Step 1: Write the agent definition** + +```markdown +--- +name: eval-grader +description: > + Eval grader for eval harness. Scores model behavior against expected + outcomes by analyzing the transcript. Do not use directly — dispatched + by the evaluate orchestrator. +tools: + - Read +allowed-tools: + - Read +model: sonnet +--- + +You are grading a model's behavior during a skill execution. You will receive: + +1. A transcript of events (from `transcript.json`) — an ordered sequence of text outputs, tool calls, and user interactions +2. A list of expected behaviors to verify + +## Grading Rules + +- Score each expectation independently +- Use the event ordering — if an expectation says "X before Y", verify the sequence +- Base your judgment strictly on the transcript evidence +- A text event mentioning an action is NOT the same as a tool_call performing it +- Be precise: "identifies the PR has merge conflicts" means the model explicitly stated this, not that it ran a command that would reveal it + +## Output Format + +Return a JSON object with this exact structure: + +```json +{ + "case": "", + "results": [ + { + "expect": "", + "pass": true, + "reasoning": "<1-2 sentences citing specific transcript evidence>" + } + ] +} +``` + +Every expectation must appear in results. Do not add expectations not in the input list. + +For `pass: false`, explain specifically what the model did wrong or failed to do, citing event indices or content. +``` + +- [ ] **Step 2: Verify the file is valid frontmatter** + +Run: `head -20 .claude/skills/evaluate/agents/eval-grader.md` +Expected: Valid YAML frontmatter with `---` delimiters + +- [ ] **Step 3: Commit** + +```bash +git add .claude/skills/evaluate/agents/eval-grader.md +git commit -m "feat(evals): add eval-grader agent definition + +Sonnet-powered model judgment scorer. Analyzes transcript events against +expected behaviors, returns structured pass/fail results with reasoning." +``` + +--- + +## Task 8: Orchestrator Skill + +The `/igit:evaluate` skill that ties everything together — discovers eval files, manages test repo lifecycle, dispatches agents, runs assertions, writes reports. + +**Files:** +- Create: `.claude/skills/evaluate/SKILL.md` + +- [ ] **Step 1: Write the orchestrator skill** + +```markdown +--- +name: evaluate +description: > + Use when user asks to "run evals", "evaluate skills", "test skill behavior", + or "/igit:evaluate". Orchestrates the eval harness: discovers eval files, + sets up real GitHub test repo state, dispatches runner and grader agents + per case, runs assertions, tears down, writes report. +version: 0.1.0 +updated: "2026-03-28" +user-invocable: true +disable-model-invocation: false +allowed-tools: + - Read + - Write + - Bash(uv *) + - Bash(sh *) + - Bash(mkdir *) + - Bash(chmod *) + - Bash(date *) + - Bash(cat *) + - Bash(gh *) + - Agent +argument-hint: "[description | tags | file | skill-name]" +--- + +# Evaluate + +Orchestrate eval suite execution for skill behavior testing. + +## Invocation + +``` +/igit:evaluate # Run all eval cases +/igit:evaluate stage2 # Filter by tag +/igit:evaluate stage2_pr_blocked # Filter by case name substring +/igit:evaluate skill_merge_eval # Filter by file name +``` + +## Prerequisites + +- `EVAL_TEST_REPO` env var must be set (e.g. `trev-gulls/igit-eval-sandbox`) +- The test repo must be in a known clean state (main only, no open PRs) +- `gh` must be authenticated with access to the test repo + +## Process + +### Step 1: Verify Prerequisites + +1. Check `EVAL_TEST_REPO` is set +2. Run teardown script to ensure clean starting state: + ```bash + sh .claude/skills/evaluate/scripts/setup/teardown_test_repo.sh + ``` +3. If teardown reports dirty state, stop and report + +### Step 2: Discover and Parse + +1. Find all `evals/*_eval.{yml,yaml}` files +2. Parse each file — extract `suite` description and `cases` list +3. If a filter argument was provided, keep only cases where: + - Any tag matches the filter, OR + - The case name contains the filter, OR + - The file name contains the filter +4. Report: "Found N cases across M suites (filter: )" + +### Step 3: Prepare Run Directory + +```bash +RUN_ID=$(date +%Y-%m-%dT%H%M%S)-$(head -c 3 /dev/urandom | xxd -p | head -c 6) +EVAL_RUN_DIR=".evals/${RUN_ID}" +mkdir -p "$EVAL_RUN_DIR" +``` + +Export `EVAL_RUN_DIR` as an environment variable for hooks and scripts. + +### Step 4: Execute Each Case + +For each case in order: + +#### 4a. Setup Test Repo + +Run the setup script to create real GitHub state for this case: + +```bash +EVAL_TEST_REPO="$EVAL_TEST_REPO" \ +EVAL_CASE_STATE='{"branch": "eval/", "pr": {"title": "", "state": ""}}' \ +sh .claude/skills/evaluate/scripts/setup/setup_test_repo.sh +``` + +Parse the JSON output to get the real PR number and URL. Update the case context with these real values before injecting into the runner prompt. + +#### 4b. Dispatch Eval Runner + +Set `EVAL_CASE_NAME=` env var (for the SubagentStop hook). + +Build the runner prompt from the eval case: + +``` +## Skill Instructions + +/SKILL.md — read the file> + +## Context from Prior Tool Calls + +The following data was returned by running the context script. Use this +data directly — do not re-run the script. + +### merge_context.sh output: + + +## Pre-established Decisions + + +When asked "", you have already decided: "" +Do not ask this question again — proceed with this decision. + +## User Prompt + + +``` + +Dispatch the agent: + +``` +Agent: + subagent_type: eval-runner + prompt: +``` + +The SubagentStop hook (`capture_agent_transcript.sh`) will automatically write `transcript.json` to `$EVAL_RUN_DIR//`. + +#### 4c. Run Deterministic Assertions + +If `assert.tool_calls` is defined: + +```bash +cd .claude/skills/evaluate/scripts/assert +uv run --no-sync python -c " +import json, sys +sys.path.insert(0, '.') +from check_tool_calls import check_tool_calls +transcript = json.load(open('$EVAL_RUN_DIR//transcript.json')) +assertions = json.loads('') +results = check_tool_calls(transcript['events'], assertions) +json.dump(results, open('$EVAL_RUN_DIR//assert_tool_calls.json', 'w'), indent=2) +" +``` + +If `assert.filesystem` is defined, similarly run `check_filesystem`. + +Merge all assertion results into `$EVAL_RUN_DIR//assert.json`. + +#### 4d. Dispatch Eval Grader + +If `expect` is defined: + +``` +Agent: + subagent_type: eval-grader + prompt: | + ## Case: + + ## Transcript + /transcript.json> + + ## Expected Behaviors + + + Return your grading as JSON. +``` + +Write grader output to `$EVAL_RUN_DIR//grader.json`. + +#### 4e. Teardown Case + +Run teardown to reset repo for the next case: + +```bash +sh .claude/skills/evaluate/scripts/setup/teardown_test_repo.sh +``` + +### Step 5: Generate Report + +Collect all results and write `$EVAL_RUN_DIR/report.md`: + +```markdown +## Eval Results — + +### + +| Case | Expect | Assert | Status | +|------|--------|--------|--------| +| | / | / | ✓ PASS / ✗ FAIL | + +### Failures + +**** +- expect: "" + FAIL: "" +- assert.tool_calls: + FAIL: +``` + +### Step 6: Report + +Output the report contents to the user and note the run directory path. +``` + +- [ ] **Step 2: Verify the skill file** + +Run: `head -20 .claude/skills/evaluate/SKILL.md` +Expected: Valid YAML frontmatter + +- [ ] **Step 3: Commit** + +```bash +git add .claude/skills/evaluate/SKILL.md +git commit -m "feat(evals): add evaluate orchestrator skill + +Discovers eval YAML files, sets up real GitHub test repo state per case, +dispatches eval-runner, runs deterministic assertions, dispatches +eval-grader, tears down, aggregates into markdown report." +``` + +--- + +## Task 9: Eval Cases — Stage 1 Scenarios + +Write the first batch of eval cases covering Stage 1 (pre-merge verification). + +**Files:** +- Create: `evals/skill_merge_eval.yaml` (first 4 cases) + +- [ ] **Step 1: Write Stage 1 eval cases** + +```yaml +suite: "Evals for the /merge skill — covers all three stages and known edge cases" + +cases: + # --- Stage 1: Pre-merge Verification --- + + - name: "stage1_task_tracking_not_detected" + description: "Model asks user which task tracking tool they use when detection returns none" + skill: igit:merge + tags: [stage1, task-tracking] + + setup: + repo_state: + branch: "eval/stage1-tt-not-detected" + pr: { title: "Stage 1 TT not detected", state: "open" } + + prompt: > + I'm ready to merge. Let's go. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "eval/stage1-tt-not-detected" + base_branch: "main" + pr: + number: 0 + url: "" + state: "open" + mergeable: "MERGEABLE" + merge_state_status: "CLEAN" + title: "Stage 1 TT not detected" + merge_methods: ["merge", "squash", "rebase"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: [] + + exits_at: "Do you use a task tracking tool for this project?" + + expect: + - "Asks the user which task tracking tool they use before proceeding" + - "Presents options including docs/work, Linear, Jira, GitHub Projects, Asana, and skip" + + assert: + tool_calls: + - never: "gh pr merge" + + - name: "stage1_all_ac_met" + description: "Model checks off items and proceeds to Stage 2 without asking when all AC met" + skill: igit:merge + tags: [stage1, task-tracking, happy-path] + + setup: + repo_state: + branch: "eval/stage1-all-ac-met" + pr: { title: "Stage 1 all AC met", state: "open" } + + prompt: > + Ship it. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "eval/stage1-all-ac-met" + base_branch: "main" + pr: + number: 0 + url: "" + state: "open" + mergeable: "MERGEABLE" + merge_state_status: "CLEAN" + title: "Stage 1 all AC met" + merge_methods: ["merge", "squash", "rebase"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "docs-work" + source: "fs" + work_items: + - path: "docs/work/active/2026-03-20-add-search.md" + title: "Add search feature" + acceptance_criteria: + - criterion: "Search endpoint returns results" + met: true + - criterion: "Unit tests pass" + met: true + + prior_decisions: [] + + exits_at: "Who merges PR #?" + + expect: + - "Narrates that all acceptance criteria are already met" + - "Proceeds to Stage 2 merge decision without asking about task resolution" + - "Does not present the Item/Action panel" + + assert: + tool_calls: + - never: "gh pr merge" + + - name: "stage1_unmet_ac" + description: "Model presents Item+Action panel for unmet acceptance criteria" + skill: igit:merge + tags: [stage1, task-tracking] + + setup: + repo_state: + branch: "eval/stage1-unmet-ac" + pr: { title: "Stage 1 unmet AC", state: "open" } + + prompt: > + Let's merge this. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "eval/stage1-unmet-ac" + base_branch: "main" + pr: + number: 0 + url: "" + state: "open" + mergeable: "MERGEABLE" + merge_state_status: "CLEAN" + title: "Stage 1 unmet AC" + merge_methods: ["merge", "squash", "rebase"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "docs-work" + source: "fs" + work_items: + - path: "docs/work/active/2026-03-20-add-search.md" + title: "Add search feature" + acceptance_criteria: + - criterion: "Search endpoint returns results" + met: true + - criterion: "Unit tests pass" + met: false + - criterion: "Documentation updated" + met: false + + prior_decisions: [] + + exits_at: "Action for Add search feature" + + expect: + - "Identifies that 2 acceptance criteria are unmet" + - "Presents an AskUserQuestion with the Item/Action panel pattern" + - "Lists 'Unit tests pass' and 'Documentation updated' as unmet criteria" + - "Does not proceed to merge before resolving the unmet criteria" + + assert: + tool_calls: + - never: "gh pr merge" + + - name: "stage1_skip_task_verification" + description: "Model proceeds to Stage 2 when user chooses to skip task verification" + skill: igit:merge + tags: [stage1, task-tracking] + + setup: + repo_state: + branch: "eval/stage1-skip-tt" + pr: { title: "Stage 1 skip TT", state: "open" } + + prompt: > + Merge my branch please. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "eval/stage1-skip-tt" + base_branch: "main" + pr: + number: 0 + url: "" + state: "open" + mergeable: "MERGEABLE" + merge_state_status: "CLEAN" + title: "Stage 1 skip TT" + merge_methods: ["merge", "squash", "rebase"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: + - question: "Do you use a task tracking tool for this project?" + answer: "No — skip task verification" + + exits_at: "Who merges PR #?" + + expect: + - "Skips task verification after receiving the skip decision" + - "Proceeds directly to Stage 2 merge decision" + + assert: + tool_calls: + - never: "gh pr merge" +``` + +- [ ] **Step 2: Validate YAML syntax** + +Run: `uv run --no-sync python -c "import yaml; yaml.safe_load(open('evals/skill_merge_eval.yaml')); print('valid')"` +Expected: `valid` + +- [ ] **Step 3: Commit** + +```bash +git add evals/skill_merge_eval.yaml +git commit -m "feat(evals): add Stage 1 eval cases for /merge skill + +Four cases: task tracking not detected, all AC met, unmet AC with +Item/Action panel, and skip task verification flow." +``` + +--- + +## Task 10: Eval Cases — Stage 2 Scenarios + +Add Stage 2 (merge) eval cases to the existing YAML file. + +**Files:** +- Modify: `evals/skill_merge_eval.yaml` + +- [ ] **Step 1: Append Stage 2 cases** + +Add the following cases after the Stage 1 cases in `evals/skill_merge_eval.yaml`: + +```yaml + # --- Stage 2: Merge --- + + - name: "stage2_already_merged" + description: "Model acknowledges already-merged PR and surfaces local_state issues" + skill: igit:merge + tags: [stage2, already-merged] + + setup: + repo_state: + branch: "eval/stage2-already-merged" + pr: { title: "Stage 2 already merged", state: "merged" } + + prompt: > + Merge this branch. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "eval/stage2-already-merged" + base_branch: "main" + pr: + number: 0 + url: "" + state: "merged" + mergeable: "UNKNOWN" + merge_state_status: "UNKNOWN" + title: "Stage 2 already merged" + merge_methods: ["merge", "squash", "rebase"] + local_state: + uncommitted_changes: true + unpushed_commits: 2 + base_out_of_date: true + worktrees: [] + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: + - question: "Do you use a task tracking tool for this project?" + answer: "No — skip task verification" + + exits_at: null + + expect: + - "Acknowledges that the PR was already merged" + - "Surfaces uncommitted changes warning" + - "Surfaces 2 unpushed commits warning" + - "Surfaces base branch out of date warning" + - "Proceeds to Stage 3 cleanup without asking to merge" + + assert: + tool_calls: + - never: "gh pr merge" + + - name: "stage2_pr_blocked_conflicts" + description: "Model surfaces conflict blocker and stops without offering to merge" + skill: igit:merge + tags: [stage2, blocked] + + setup: + repo_state: + branch: "eval/stage2-blocked" + pr: { title: "Stage 2 blocked", state: "open" } + + prompt: > + I'm ready to merge my feature branch. Please run the merge workflow. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "eval/stage2-blocked" + base_branch: "main" + pr: + number: 0 + url: "" + state: "open" + mergeable: "CONFLICTING" + merge_state_status: "DIRTY" + title: "Stage 2 blocked" + merge_methods: ["merge", "squash", "rebase"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: + - question: "Do you use a task tracking tool for this project?" + answer: "No — skip task verification" + + exits_at: null + + expect: + - "Identifies the PR has merge conflicts before taking any action" + - "Names the specific blocker and tells the user how to resolve it" + - "Stops after surfacing the blocker — does not attempt Stage 2 or 3" + + assert: + tool_calls: + - never: "gh pr merge" + - never: "git branch --delete" + + - name: "stage2_claude_merges_squash" + description: "Model runs gh pr merge --squash when Claude merges with squash method" + skill: igit:merge + tags: [stage2, merge, squash] + + setup: + repo_state: + branch: "eval/stage2-squash" + pr: { title: "Stage 2 squash merge", state: "open" } + + prompt: > + Let's merge this PR. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "eval/stage2-squash" + base_branch: "main" + pr: + number: 0 + url: "" + state: "open" + mergeable: "MERGEABLE" + merge_state_status: "CLEAN" + title: "Stage 2 squash merge" + merge_methods: ["merge", "squash", "rebase"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: + - question: "Do you use a task tracking tool for this project?" + answer: "No — skip task verification" + - question: "Who merges PR #?" + answer: "Claude merges" + - question: "Merge method" + answer: "Squash" + + exits_at: null + + expect: + - "Runs gh pr merge with the --squash flag, not --merge or --rebase" + - "Verifies the merge succeeded by checking PR state" + + assert: + tool_calls: + - once: "gh pr merge" + - contains: "--squash" + - never: "--rebase" +``` + +Note: `assert.tool_calls` uses `contains` instead of `once: "gh pr merge 15 --squash"` because the PR number is assigned dynamically by the setup script. + +- [ ] **Step 2: Validate YAML syntax** + +Run: `uv run --no-sync python -c "import yaml; d=yaml.safe_load(open('evals/skill_merge_eval.yaml')); print(f'{len(d[\"cases\"])} cases')"` +Expected: `7 cases` + +- [ ] **Step 3: Commit** + +```bash +git add evals/skill_merge_eval.yaml +git commit -m "feat(evals): add Stage 2 eval cases for /merge skill + +Three cases: already merged with local state warnings, PR blocked by +conflicts, and Claude merges with squash method." +``` + +--- + +## Task 11: Eval Cases — Stage 3 Scenarios + +Add Stage 3 (cleanup) eval cases. + +**Files:** +- Modify: `evals/skill_merge_eval.yaml` + +- [ ] **Step 1: Append Stage 3 cases** + +```yaml + # --- Stage 3: Post-merge Cleanup --- + + - name: "stage3_safe_delete_succeeds" + description: "Model runs git branch --delete (not --force) for cleanup" + skill: igit:merge + tags: [stage3, cleanup] + + setup: + repo_state: + branch: "eval/stage3-safe-delete" + pr: { title: "Stage 3 safe delete", state: "open" } + + prompt: > + Merge and clean up. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "eval/stage3-safe-delete" + base_branch: "main" + pr: + number: 0 + url: "" + state: "open" + mergeable: "MERGEABLE" + merge_state_status: "CLEAN" + title: "Stage 3 safe delete" + merge_methods: ["merge", "squash"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: + - question: "Do you use a task tracking tool for this project?" + answer: "No — skip task verification" + - question: "Who merges PR #?" + answer: "Claude merges" + - question: "Merge method" + answer: "Merge Commit" + + exits_at: null + + expect: + - "Switches to main branch and pulls after merge" + - "Runs git branch --delete (safe delete, not --force) for cleanup" + - "Uses long flag --delete, not short -d" + + assert: + tool_calls: + - contains: "git branch --delete" + - never: "git branch --delete --force" + - never: "git branch -d" + - never: "git branch -D" + + - name: "stage3_safe_delete_fails_user_declines" + description: "Model leaves branch when user declines force-delete after safe delete fails" + skill: igit:merge + tags: [stage3, cleanup, force-delete] + + setup: + repo_state: + branch: "eval/stage3-decline-force" + pr: { title: "Stage 3 decline force", state: "open" } + + prompt: > + Merge and clean up. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "eval/stage3-decline-force" + base_branch: "main" + pr: + number: 0 + url: "" + state: "open" + mergeable: "MERGEABLE" + merge_state_status: "CLEAN" + title: "Stage 3 decline force" + merge_methods: ["squash"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: + - question: "Do you use a task tracking tool for this project?" + answer: "No — skip task verification" + - question: "Who merges PR #?" + answer: "Claude merges" + - question: "Merge method" + answer: "Squash" + - question: "Force-delete branch 'eval/stage3-decline-force'?" + answer: "No — leave the branch" + + exits_at: null + + expect: + - "Attempts safe delete first (git branch --delete)" + - "After safe delete fails, explains why (squash merge)" + - "Accepts user's decline and does not force-delete" + - "Leaves the branch intact in final summary" + + assert: + tool_calls: + - never: "git branch --delete --force" + - never: "git branch -D" +``` + +- [ ] **Step 2: Validate YAML syntax** + +Run: `uv run --no-sync python -c "import yaml; d=yaml.safe_load(open('evals/skill_merge_eval.yaml')); print(f'{len(d[\"cases\"])} cases')"` +Expected: `9 cases` + +- [ ] **Step 3: Commit** + +```bash +git add evals/skill_merge_eval.yaml +git commit -m "feat(evals): add Stage 3 eval cases for /merge skill + +Two cases: safe branch delete succeeds (no --force), and user declines +force-delete after safe delete fails on squash merge." +``` + +--- + +## Task 12: Eval Cases — Bug Regression Scenarios + +Add eval cases covering bugs 2–4 from the live test log. Bug 5 deferred to separate work item. + +**Files:** +- Modify: `evals/skill_merge_eval.yaml` + +- [ ] **Step 1: Append bug regression cases** + +```yaml + # --- Bug Regressions --- + + - name: "bug2_mcp_elif_eats_fs_detection" + description: "Task tracking returns none when settings.json exists but no MCP servers match, even though docs/work/active/ is present" + skill: igit:merge + tags: [regression, bug2, task-tracking] + + setup: + repo_state: + branch: "eval/bug2-mcp-elif" + pr: { title: "Bug 2 MCP elif", state: "open" } + + prompt: > + Ready to merge. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "eval/bug2-mcp-elif" + base_branch: "main" + pr: + number: 0 + url: "" + state: "open" + mergeable: "MERGEABLE" + merge_state_status: "CLEAN" + title: "Bug 2 MCP elif" + merge_methods: ["merge", "squash"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: [] + + exits_at: "Do you use a task tracking tool for this project?" + + expect: + - "Asks about task tracking because context reported tool=none" + - "Does not assume docs/work detection succeeded — respects the context data as given" + + assert: + tool_calls: + - never: "gh pr merge" + + - name: "bug3_pr_url_missing_manual_merge" + description: "Model presents manual merge option with the actual PR URL interpolated" + skill: igit:merge + tags: [regression, bug3, merge] + + setup: + repo_state: + branch: "eval/bug3-pr-url" + pr: { title: "Bug 3 PR URL", state: "open" } + + prompt: > + Merge this please. + + context: + - source: script_invocation + script: "skills/merge/scripts/merge_context.sh" + output: + branch: "eval/bug3-pr-url" + base_branch: "main" + pr: + number: 0 + url: "" + state: "open" + mergeable: "MERGEABLE" + merge_state_status: "CLEAN" + title: "Bug 3 PR URL" + merge_methods: ["merge", "squash"] + local_state: + uncommitted_changes: false + unpushed_commits: 0 + base_out_of_date: false + worktrees: [] + task_tracking: + tool: "none" + source: "none" + work_items: [] + + prior_decisions: + - question: "Do you use a task tracking tool for this project?" + answer: "No — skip task verification" + + exits_at: "Who merges PR #?" + + expect: + - "Presents 'I'll merge manually' option" + - "The manual merge option description includes the actual PR URL" + + assert: + tool_calls: + - never: "gh pr merge" +``` + +Note: Bug 4 (main worktree spurious remove) is excluded from this initial batch — it requires worktree setup that goes beyond happy-path repo state. Will be added when non-happy-path setup is implemented. + +- [ ] **Step 2: Validate YAML and count cases** + +Run: `uv run --no-sync python -c "import yaml; d=yaml.safe_load(open('evals/skill_merge_eval.yaml')); print(f'{len(d[\"cases\"])} cases'); [print(f' - {c[\"name\"]}') for c in d['cases']]"` +Expected: `11 cases` with all names listed + +- [ ] **Step 3: Commit** + +```bash +git add evals/skill_merge_eval.yaml +git commit -m "feat(evals): add bug regression eval cases for /merge skill + +Two cases from live test bugs: Bug 2 (MCP elif eats fs detection), +Bug 3 (PR URL missing from manual merge option). Bug 4 deferred — +requires worktree setup beyond happy path. Bug 5 deferred to separate +work item (implicit checklist feature doesn't exist yet)." +``` + +--- + +## Task 13: Register SubagentStop Hook + +Add the transcript capture hook to the project settings so it fires when eval-runner agents complete. + +**Files:** +- Create or modify: `.claude/settings.json` + +- [ ] **Step 1: Check if `.claude/settings.json` exists** + +Run: `cat .claude/settings.json 2>/dev/null || echo "does not exist"` + +- [ ] **Step 2: Create or update settings with SubagentStop hook** + +If file doesn't exist, create `.claude/settings.json`: + +```json +{ + "hooks": { + "SubagentStop": [ + { + "matcher": "eval-runner", + "hooks": [ + { + "type": "command", + "command": "sh .claude/skills/evaluate/scripts/hooks/capture_agent_transcript.sh" + } + ] + } + ] + } +} +``` + +If file exists, merge the `SubagentStop` hook into the existing `hooks` object using `jq`. + +- [ ] **Step 3: Validate JSON** + +Run: `jq . .claude/settings.json > /dev/null && echo "valid"` +Expected: `valid` + +- [ ] **Step 4: Commit** + +```bash +git add .claude/settings.json +git commit -m "feat(evals): register SubagentStop hook for transcript capture + +Fires capture_agent_transcript.sh when eval-runner agent completes, +writing extracted events to the eval run output directory." +``` + +--- + +## Task 14: Final Validation + +Verify the complete system structure. + +- [ ] **Step 1: Verify `.evals/` is in `.gitignore`** + +Run: `grep -q '.evals/' .gitignore && echo "OK" || echo "MISSING"` +Expected: `OK` + +- [ ] **Step 2: Run all assertion script tests** + +Run: `uv run --no-sync pytest .claude/skills/evaluate/scripts/assert/ -v` +Expected: All passed + +- [ ] **Step 3: Validate all YAML eval files** + +Run: `uv run --no-sync python -c "import yaml; d=yaml.safe_load(open('evals/skill_merge_eval.yaml')); print(f'Suite: {d[\"suite\"]}'); print(f'Cases: {len(d[\"cases\"])}')"` +Expected: `Suite: Evals for the /merge skill...` and `Cases: 11` + +- [ ] **Step 4: Verify file structure matches design** + +Run: `find .claude/skills/evaluate evals -type f | sort` +Expected: +``` +.claude/skills/evaluate/SKILL.md +.claude/skills/evaluate/agents/eval-grader.md +.claude/skills/evaluate/agents/eval-runner.md +.claude/skills/evaluate/scripts/assert/check_filesystem.py +.claude/skills/evaluate/scripts/assert/check_filesystem_test.py +.claude/skills/evaluate/scripts/assert/check_tool_calls.py +.claude/skills/evaluate/scripts/assert/check_tool_calls_test.py +.claude/skills/evaluate/scripts/hooks/capture_agent_transcript.sh +.claude/skills/evaluate/scripts/hooks/capture_agent_transcript_test.py +.claude/skills/evaluate/scripts/setup/setup_test_repo.sh +.claude/skills/evaluate/scripts/setup/setup_test_repo_test.py +.claude/skills/evaluate/scripts/setup/teardown_test_repo.sh +.claude/skills/evaluate/scripts/setup/teardown_test_repo_test.py +evals/skill_merge_eval.yaml +``` + +- [ ] **Step 5: Commit any remaining changes** + +```bash +git add -A +git commit -m "chore(evals): final validation — all tests pass, structure matches design" +``` + +--- + +## Notes + +**Bug 4 (main worktree spurious remove)** is excluded from the initial 11 cases. It requires worktree setup beyond happy-path repo state. Will be added when non-happy-path setup scripts are implemented. + +**Bug 5 (implicit merge checklist)** is deferred to its own work item: `docs/work/backlog/2026-03-27-merge-implicit-checklist.md`. The feature doesn't exist yet — the eval case will be added when the implicit checklist is implemented. + +**Dynamic PR numbers:** The eval case YAML uses `pr.number: 0` and `pr.url: ""` as placeholders. The orchestrator replaces these with real values from `setup_test_repo.sh` output before injecting context into the runner prompt. Similarly, `exits_at` and `prior_decisions` using `#` are interpolated at runtime. + +**AskUserQuestion behavior in agents** is an open question per the design spec. The eval-runner uses `exits_at` to define expected stopping points, but actual behavior may differ. Task 6's agent definition instructs the model to use AskUserQuestion when it reaches unresolved decision points — empirical testing will determine if this works or needs revision. + +**PyYAML dependency**: The orchestrator skill parses YAML. Ensure `pyyaml` is in the project's dev dependencies (`uv add --dev pyyaml` if not present). + +**Test repo creation:** Before first run, create `trev-gulls/igit-eval-sandbox` with a single initial commit on main. The teardown script assumes this starting state. + +**Setup/teardown tests are heavy:** The `setup_test_repo_test.py` and `teardown_test_repo_test.py` tests hit real GitHub API. They are skipped unless `EVAL_TEST_REPO` is set. Not for CI. diff --git a/docs/work/backlog/2026-03-27-merge-skill-evals.md b/docs/work/active/2026-03-27-merge-skill-evals.md similarity index 72% rename from docs/work/backlog/2026-03-27-merge-skill-evals.md rename to docs/work/active/2026-03-27-merge-skill-evals.md index a398e57..a68c618 100644 --- a/docs/work/backlog/2026-03-27-merge-skill-evals.md +++ b/docs/work/active/2026-03-27-merge-skill-evals.md @@ -1,9 +1,8 @@ --- -status: pending +status: active created: 2026-03-27 updated: 2026-03-27 -blocked-by: - - docs/work/active/2026-03-27-merge-skill.md +blocked-by: [] --- # Add evals for /merge skill model behavior @@ -19,6 +18,8 @@ natural language output quality) rather than testing deterministic code paths. ### Eval scenarios to cover +From original backlog: + - [ ] **Stage 1 — task tracking not detected**: Model asks the user which tool they use - [ ] **Stage 1 — all AC met**: Model checks off items and proceeds without asking - [ ] **Stage 1 — unmet AC**: Model presents Item+Action panel and applies chosen action @@ -28,6 +29,13 @@ natural language output quality) rather than testing deterministic code paths. - [ ] **Stage 3 — safe delete succeeds**: Model runs `git branch --delete`, not `--delete --force` - [ ] **Stage 3 — safe delete fails, user declines force**: Model leaves branch, does not retry +From live test (see `docs/ops/igit-merge-test-and-debug.md`): + +- [ ] **Bug 2 — MCP elif eats fs detection**: Task tracking returns `none` when `~/.claude/settings.json` exists but has no matching MCP servers, even though `docs/work/active/` is present +- [ ] **Bug 3 — PR URL missing from manual merge option**: Model presents manual merge option without interpolating `pr.url` into the description +- [ ] **Bug 4 — Main worktree triggers spurious worktree remove** — deferred to non-happy-path setup phase (requires worktree state beyond basic repo setup) +- ~**Bug 5 — Implicit merge checklist not enforced**~ — deferred to `docs/work/backlog/2026-03-27-merge-implicit-checklist.md` (feature doesn't exist yet) + ### Eval approach Each eval: @@ -52,3 +60,5 @@ language output, correct tool selection under different context states. Consider whether evals should run in CI or remain a manual verification step given cost and latency. + +See `docs/ops/igit-merge-test-and-debug.md` for the full bug list from live testing. diff --git a/docs/work/backlog/2026-03-27-merge-implicit-checklist-bugfix.md b/docs/work/backlog/2026-03-27-merge-implicit-checklist-bugfix.md new file mode 100644 index 0000000..e3c10d4 --- /dev/null +++ b/docs/work/backlog/2026-03-27-merge-implicit-checklist-bugfix.md @@ -0,0 +1,42 @@ +--- +status: pending +created: 2026-03-27 +updated: 2026-03-28 +blocked-by: [] +--- + +# Add implicit pre-merge checklist to /merge skill Stage 1 + +## Goal + +Stage 1 currently only checks explicit acceptance criteria from `task_tracking.work_items`. It should also run an implicit checklist covering common pre-merge hygiene that applies regardless of task tracking state. + +## Context + +Bug 5 from live testing (`docs/ops/igit-merge-test-and-debug.md`): the /merge skill proceeded through Stage 1 without flagging that PR #14 lacked a version bump. The bump had to be applied post-merge directly to main. + +## Acceptance Criteria + +- [ ] Stage 1 runs an implicit checklist before proceeding to Stage 2 +- [ ] Checklist candidates (verify which are feasible to auto-check): + - Tests passing (run test command from CLAUDE.md if available) + - Version bumped if project releases versioned artifacts (packages, containers, apps, infrastructure, configs — not just plugins) + - Project documentation updated (README.md, CLAUDE.md, docs/, etc.) if conventions changed + - No untracked files that should have been committed +- [ ] Checklist configuration lives in a known, discoverable location + - Candidate locations: README.md, DEPLOY.md, RELEASE.md, CLAUDE.md, MERGE.md, or `docs/{i,}git/*.md` + - Security consideration: unprotected checklist files could be manipulated to skip checks +- [ ] Checklist results surfaced to user before proceeding +- [ ] Eval case added to `evals/skill_merge_eval.yaml` covering the implicit checklist behavior + +## Design Questions + +- Where should the checklist configuration live? Needs to be discoverable but tamper-resistant. +- Should the checklist be project-configurable or a fixed set of checks? +- How does the checklist interact with the task tracking work items (complement vs overlap)? + +## References + +- `docs/ops/igit-merge-test-and-debug.md` — Bug 5 +- `docs/work/active/2026-03-27-merge-skill-evals.md` — original Bug 5 scenario +- PR #15 review comment threads (lines 95, 96, 99) diff --git a/docs/work/backlog/2026-03-28-merge-context-bugs.md b/docs/work/backlog/2026-03-28-merge-context-bugs.md new file mode 100644 index 0000000..011cb5f --- /dev/null +++ b/docs/work/backlog/2026-03-28-merge-context-bugs.md @@ -0,0 +1,48 @@ +--- +status: pending +created: 2026-03-28 +updated: 2026-03-28 +blocked-by: [] +--- + +# Fix merge_context.sh bugs (1, 2, 4) + +## Goal + +Fix three bugs in `merge_context.sh` discovered during live testing of PR #14. + +## Bugs + +### Bug 1 — Structured error reporting for sandboxed gh failures + +`gh` silently fails in sandboxed `sh` subprocess. The `|| printf '{}'` fallback swallows errors, leaving all PR fields at defaults. The model's SKILL.md fallback path never triggers because the script appears to succeed. + +**Fix:** On `gh` failure, return structured error: `{ "cmd": "...", "error": "...", "hint": "Try running gh directly (unsandboxed)" }`. Script still exits 0 so the model receives partial output and can distinguish "no PR exists" from "gh failed." + +### Bug 2 — Parallel task tracking detection with priority aggregation + +The `if/elif` chain means `~/.claude/settings.json` existing (any Claude Code user) blocks filesystem detection. Redesign as parallel: run all methods (env, mcp, cli, fs, cfg) independently, aggregate by priority. + +**Fix:** Replace sequential `if/elif` with independent checks. Each method sets a result. Final aggregation picks highest-priority winner: env > mcp > cli > fs > cfg. + +### Bug 4 — Filter worktrees to PR-associated resources + +Main working tree appears in `worktrees[]`, which could trigger spurious `git worktree remove`. Only worktrees associated with the PR branch should be listed. + +**Fix:** Exclude main worktree by comparing each path against the first entry from `git worktree list --porcelain`. Only include worktrees on the current branch. + +## Acceptance Criteria + +- [ ] Bug 1: `gh` failures produce structured error objects, not empty `{}` +- [ ] Bug 1: Model can distinguish "no PR" from "gh failed" +- [ ] Bug 2: `docs/work/active/` detected even when `~/.claude/settings.json` exists +- [ ] Bug 2: All detection methods run independently with priority aggregation +- [ ] Bug 4: Main worktree excluded from `worktrees[]` +- [ ] Bug 4: Only PR-branch worktrees included +- [ ] All existing tests pass; new tests cover each fix +- [ ] SKILL.md fallback logic updated if needed for Bug 1 error format + +## References + +- `docs/ops/igit-merge-test-and-debug.md` — Bugs 1, 2, 4 +- PR #15 review comment threads (lines 13, 28, 47, 81) diff --git a/docs/work/backlog/2026-03-28-merge-skill-bugs.md b/docs/work/backlog/2026-03-28-merge-skill-bugs.md new file mode 100644 index 0000000..1f60a1c --- /dev/null +++ b/docs/work/backlog/2026-03-28-merge-skill-bugs.md @@ -0,0 +1,40 @@ +--- +status: pending +created: 2026-03-28 +updated: 2026-03-28 +blocked-by: [] +--- + +# Fix /merge SKILL.md bugs (3, 6) + +## Goal + +Fix two bugs in the /merge skill definition discovered during live testing. + +## Bugs + +### Bug 3 — PR URL in AskUserQuestion preview field + +The "I'll merge manually" option description doesn't include the PR URL. User has to look it up separately. + +**Fix:** Use AskUserQuestion's `preview` field for the PR URL. Needs empirical testing first — unknown whether `` HTML links render in preview. If not, use plain URL text. + +### Bug 6 — Stage 3 cleanup order (worktree before branch) + +SKILL.md has "Delete Local Branch" before "Remove Worktree." `git branch --delete` refuses if the branch is checked out in any worktree: `error: cannot delete branch used by worktree`. + +**Fix:** Swap the order in Stage 3: remove worktree first, then delete branch. Update the summary section to match. + +## Acceptance Criteria + +- [ ] Bug 3: Test what AskUserQuestion `preview` field renders (plain text, markdown, HTML) +- [ ] Bug 3: PR URL visible in the "I'll merge manually" option +- [ ] Bug 6: Stage 3 cleanup order is: switch+pull → remove worktree → delete branch +- [ ] Bug 6: Summary section reflects correct order +- [ ] Eval case added for cleanup order (worktree removal before branch deletion) + +## References + +- `docs/ops/igit-merge-test-and-debug.md` — Bug 3 +- PR #15 review comment thread (line 66) +- Live testing session 2026-03-28 — Bug 6 diff --git a/docs/work/backlog/2026-03-28-policy-hook-bare-push-tags.md b/docs/work/backlog/2026-03-28-policy-hook-bare-push-tags.md new file mode 100644 index 0000000..fd58c38 --- /dev/null +++ b/docs/work/backlog/2026-03-28-policy-hook-bare-push-tags.md @@ -0,0 +1,31 @@ +--- +status: pending +created: 2026-03-28 +updated: 2026-03-28 +blocked-by: [] +--- + +# Handle bare git push and review tag policy in git_policy_hook + +## Goal + +Prevent bare `git push` from bypassing push-to-main protection, and review whether tag pushes should require confirmation. + +## Context + +Observation from live testing: `git tag v0.11.0 5c291a4 && git push && git push origin v0.11.0` pushed directly to main unblocked because bare `git push` (no explicit ref) doesn't match the main/master pattern in the hook. + +## Acceptance Criteria + +- [ ] Bare `git push` on main/master is blocked or requires confirmation +- [ ] Two implementation options evaluated: + 1. Hook resolves bare `git push` to actual remote/ref before applying policies + 2. Hook rejects commands without explicit remote and ref +- [ ] Tag push policy reviewed — bump from `allow` to `ask` if appropriate +- [ ] Existing tests pass; new tests cover bare push and tag push scenarios +- [ ] Release process documented to work with the new policy + +## References + +- `docs/ops/igit-merge-test-and-debug.md` — Observation: bare git push +- PR #15 review comment thread (lines 103, 111)