feat: diff-based test selection for E2E and LLM-judge evals (v0.6.1.0) by garrytan · Pull Request #139 · garrytan/gstack

garrytan · 2026-03-17T18:30:34Z

Summary

E2E and LLM-judge tests now auto-select based on git diff — only tests whose file dependencies changed are run
Each test declares its touchfiles in test/helpers/touchfiles.ts (28 E2E + 9 LLM-judge tests mapped)
Global touchfiles (session-runner, eval-store, gen-skill-docs) trigger all tests
bun run eval:select CLI previews which tests would run
Completeness test ensures every testName has a TOUCHFILES entry — catches omissions at bun test time
New scripts: test:e2e:all, test:evals:all, eval:select

Test Coverage

All new code paths have test coverage — 21 unit tests covering:

matchGlob() — glob pattern matching (exact, *, **, dot escaping)
selectTests() — per-test selection, global touchfile triggers, union of multiple diffs
detectBaseBranch() — fallback chain with temp git repos
TOUCHFILES completeness — validates all testNames have entries

Pre-Landing Review

No issues found.

TODOS

No TODO items completed in this PR.

Test plan

All free tests pass (21 touchfiles + existing browse/skill validation)
bun run eval:select produces correct output
Completeness test catches missing TOUCHFILES entries (verified during merge with main)

🤖 Generated with Claude Code

Each test declares file dependencies in a TOUCHFILES map. The test runner checks git diff against the base branch and only runs tests whose dependencies were modified. Global touchfiles (session-runner, eval-store, gen-skill-docs) trigger all tests. New scripts: test:e2e:all, test:evals:all, eval:select Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…optimization

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… hints The test was flaky at 20 turns because the agent reads a 300-line SKILL.md, navigates, extracts design data, and writes a report. Added hints to skip preamble/batch commands/write early while still testing the real SKILL.md. Now completes in ~13 turns consistently. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…optimization

garrytan and others added 5 commits March 17, 2026 11:28

Merge remote-tracking branch 'origin/main' into garrytan/gstack-eval-…

d6a1cad

…optimization

chore: bump version and changelog (v0.6.1.0)

7fed990

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into garrytan/gstack-eval-…

4b22559

…optimization

garrytan merged commit 17c1c06 into main Mar 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: diff-based test selection for E2E and LLM-judge evals (v0.6.1.0)#139

feat: diff-based test selection for E2E and LLM-judge evals (v0.6.1.0)#139
garrytan merged 5 commits intomainfrom
garrytan/gstack-eval-optimization

garrytan commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented Mar 17, 2026

Summary

Test Coverage

Pre-Landing Review

TODOS

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant