Skip to content

feat: diff-based test selection for E2E and LLM-judge evals (v0.6.1.0)#139

Merged
garrytan merged 5 commits intomainfrom
garrytan/gstack-eval-optimization
Mar 17, 2026
Merged

feat: diff-based test selection for E2E and LLM-judge evals (v0.6.1.0)#139
garrytan merged 5 commits intomainfrom
garrytan/gstack-eval-optimization

Conversation

@garrytan
Copy link
Owner

Summary

  • E2E and LLM-judge tests now auto-select based on git diff — only tests whose file dependencies changed are run
  • Each test declares its touchfiles in test/helpers/touchfiles.ts (28 E2E + 9 LLM-judge tests mapped)
  • Global touchfiles (session-runner, eval-store, gen-skill-docs) trigger all tests
  • bun run eval:select CLI previews which tests would run
  • Completeness test ensures every testName has a TOUCHFILES entry — catches omissions at bun test time
  • New scripts: test:e2e:all, test:evals:all, eval:select

Test Coverage

All new code paths have test coverage — 21 unit tests covering:

  • matchGlob() — glob pattern matching (exact, *, **, dot escaping)
  • selectTests() — per-test selection, global touchfile triggers, union of multiple diffs
  • detectBaseBranch() — fallback chain with temp git repos
  • TOUCHFILES completeness — validates all testNames have entries

Pre-Landing Review

No issues found.

TODOS

No TODO items completed in this PR.

Test plan

  • All free tests pass (21 touchfiles + existing browse/skill validation)
  • bun run eval:select produces correct output
  • Completeness test catches missing TOUCHFILES entries (verified during merge with main)

🤖 Generated with Claude Code

garrytan and others added 5 commits March 17, 2026 11:28
Each test declares file dependencies in a TOUCHFILES map. The test runner
checks git diff against the base branch and only runs tests whose
dependencies were modified. Global touchfiles (session-runner, eval-store,
gen-skill-docs) trigger all tests.

New scripts: test:e2e:all, test:evals:all, eval:select

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… hints

The test was flaky at 20 turns because the agent reads a 300-line SKILL.md,
navigates, extracts design data, and writes a report. Added hints to skip
preamble/batch commands/write early while still testing the real SKILL.md.
Now completes in ~13 turns consistently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@garrytan garrytan merged commit 17c1c06 into main Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant