Skip to content

Commit af5fa9a

Browse files
mpawliszynclaude
andauthored
docs: add PR test corpus design (decisions in progress) (#18)
Captures design decisions from brainstorming session for a real Kotlin/Misk test corpus that lives as a subtree in tests/corpus/hawksbury/. Key decisions: fakes not mocks, subtree with git filter-repo extraction, immutable main with narrative history, ~10 PR scenarios, blind/aware testing modes. Several open questions remain (PR data delivery, evaluation criteria, agent invocation). Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 0391f98 commit af5fa9a

1 file changed

Lines changed: 156 additions & 0 deletions

File tree

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# PR Test Corpus Design
2+
3+
**Status:** In progress -- decisions captured, open questions remain
4+
**Date:** 2026-02-26
5+
6+
## Goal
7+
8+
Build a real, working test corpus for fowlcon so agent prompts and shell scripts can be tested against realistic PR scenarios with full control over the inputs.
9+
10+
## Core Concept: Fakes, Not Mocks
11+
12+
The corpus is a **real, working Kotlin/Misk HTTP API service** -- not mock data or synthetic diffs. It compiles, runs, and looks like a genuine private service to any agent reviewing its PRs. This follows the "fakes over mocks" testing philosophy: a fully functional implementation that exercises real code paths rather than brittle test doubles coupled to implementation details.
13+
14+
The service is called **Hawksbury** (continuing the bird theme from existing test fixtures). It's a bird sanctuary management API being built by a small team of bird-enthusiast developers with fun personas.
15+
16+
## Architecture
17+
18+
### Subtree with Extraction
19+
20+
The Hawksbury service lives as a git subtree inside `tests/corpus/hawksbury/` in the fowlcon repo. At test time, `git filter-repo --subdirectory-filter tests/corpus/hawksbury` extracts the subtree into an isolated temporary repo. After extraction:
21+
22+
- Files are promoted to repo root (no nested path artifacts)
23+
- All fowlcon-related history is stripped
24+
- No references to the parent repo remain
25+
- The result looks like a standalone Kotlin service
26+
27+
Extraction happens **once per test suite run**, and all tests share the extracted repo.
28+
29+
### Why Subtree + Extraction?
30+
31+
We evaluated three approaches:
32+
33+
| Approach | Isolation | Maintenance | Realism |
34+
|----------|-----------|-------------|---------|
35+
| Subtree inside fowlcon (with extraction) | High (after extraction) | Single repo | High |
36+
| Separate GitHub repo | Perfect | Two repos to coordinate | Highest |
37+
| Git submodule | High | Two repos + submodule friction | High |
38+
39+
The subtree approach gives single-repo simplicity with high isolation after extraction. A separate repo would give perfect isolation but adds coordination overhead and splits maintenance.
40+
41+
### Guarding Against Leakage
42+
43+
- No file in `tests/corpus/hawksbury/` may contain the strings "fowlcon", "review-tree", "review-comments", or "test corpus"
44+
- A CI check validates this on every PR
45+
- The extraction process strips all fowlcon-related git history automatically
46+
47+
## The Hawksbury Service
48+
49+
### Tech Stack
50+
51+
- **Language:** Kotlin
52+
- **Framework:** [Misk](https://github.com/cashapp/misk) (open source microservice framework)
53+
- **Build:** Gradle with Kotlin DSL
54+
- **Dependencies:** Public only (Maven Central)
55+
56+
### Narrative History
57+
58+
The git history tells a lifelike story of a small team building a service. Commits read like real development work -- initial setup, add endpoints, add tests, expand features. This is not a flat collection of files; it's a project with an arc.
59+
60+
The main branch is **immutable after initial creation**. New test scenarios branch off at natural points in the story. This ensures existing scenario diffs never change when new scenarios are added.
61+
62+
### Personas
63+
64+
PR authors and reviewers are bird-themed personas with consistent identities in commit messages and PR descriptions. (Specific personas TBD during implementation.)
65+
66+
## Test Scenarios
67+
68+
Each scenario is a branch representing a PR that exercises specific fowlcon review behaviors. Target: ~10 scenarios covering:
69+
70+
1. **Mechanical + novel mix** -- Many files with the same pattern (variation/repeat) plus some novel logic. The core fowlcon use case.
71+
2. **Backpressure trigger** -- Too many interleaved concepts, high file count, no clear grouping. Tests that the agent would push back.
72+
3. **Small focused PR** -- 2-5 files, single concept. Tests the simple path.
73+
4. **Rename-heavy PR** -- Files moved/renamed. Tests file tracking across renames.
74+
5. **Delete-only PR** -- Removing dead code. Tests handling of deletion-only diffs.
75+
6. **New feature PR** -- All new files, no modifications. Tests new-file-only analysis.
76+
7. **Cross-cutting concern** -- Changes that touch many layers (API, service, data, tests) for one feature.
77+
8. **Test-only PR** -- Only test files changed. Tests classification of test code.
78+
9. **Description mismatch** -- PR description makes claims that contradict the diff. Tests description verification.
79+
10. **Multi-concept PR** -- Several unrelated changes bundled together (the "kitchen sink" PR).
80+
81+
Each scenario includes:
82+
- A branch with realistic commits
83+
- A PR description with intentional claims (correct, incorrect, and omitted) for testing description verification
84+
- Expected structural properties (concept count ranges, variation detection, coverage expectations)
85+
86+
## Testing Modes
87+
88+
### Blind Mode
89+
The agent sees only the extracted repo. No indication it's a test. Most realistic.
90+
91+
### Aware Mode
92+
The agent prompt includes a note: "You may recognize this as a test fixture. Your job is to produce the same output you would for a real PR. Be representative." Useful for debugging and prompt development.
93+
94+
Both modes need testing to understand behavioral differences.
95+
96+
## Open Questions
97+
98+
### How does the agent receive PR data without GitHub?
99+
100+
This is the biggest unresolved design question. In production, fowlcon takes a GitHub PR URL and fetches data via `gh` CLI. The test corpus has no GitHub PRs. Options under investigation:
101+
102+
1. **Local mode in the orchestrator** -- Support a local repo path + base/head refs instead of a GitHub URL. Useful beyond testing (pre-push review, non-GitHub hosts).
103+
2. **Mock the `gh` CLI** -- A shell wrapper that returns pre-generated data for specific subcommands.
104+
3. **Push extracted repo to GitHub** -- Create real PRs on a real GitHub repo after extraction.
105+
4. **Pre-generated diff files** -- Each scenario ships with a `.diff` and `description.md`.
106+
107+
Research needed on: what data the orchestrator actually consumes from GitHub, format differences between `gh pr diff` and `git diff`, and what other agentic tools do for local/offline input.
108+
109+
### What defines "correct" output per scenario?
110+
111+
Options:
112+
- **Structural assertions** -- Must have N top-level nodes, must detect variation, must achieve 100% coverage
113+
- **Golden files** -- Full expected review-tree.md per scenario (brittle but precise)
114+
- **LLM-as-judge** -- Grade quality using a rubric (flexible but non-deterministic)
115+
- **Tiered** -- Structural assertions for CI, golden files for nightly, LLM-as-judge for releases
116+
117+
### What invokes the agent in tests?
118+
119+
- True E2E (call Claude API, $5-20/run) -- most realistic, most expensive
120+
- Structural assertions only ($0/run) -- test scripts and formats, not agent behavior
121+
- Deterministic replay (record once, replay free) -- test agent logic without API calls
122+
- Tiered approach matching the test pyramid
123+
124+
### Build infrastructure weight
125+
126+
The working Kotlin app requires JDK + Gradle in CI. This adds weight to a repo that's currently markdown + bash. Need to ensure this doesn't slow down unrelated CI jobs (separate workflow or conditional triggers).
127+
128+
### Corpus evolution
129+
130+
When main is immutable, how do we handle:
131+
- Bug fixes in Hawksbury code needed for new scenarios
132+
- Dependency updates for security
133+
- Growing the app for more complex scenarios
134+
135+
Current thinking: the story continues forward. New commits extend the narrative. Old scenario branches remain anchored to their base points.
136+
137+
## Relationship to Test Pyramid
138+
139+
The corpus serves multiple tiers:
140+
141+
| Tier | What | Cost | Uses Corpus? |
142+
|------|------|------|-------------|
143+
| Structural (bats) | Format/script validation | $0 | Yes -- realistic fixtures |
144+
| Smoke (cheap model) | Output structure | $0.01-0.10 | Yes -- scenario diffs |
145+
| Golden (full model) | Review quality | $2-5 | Yes -- full agent runs |
146+
| E2E (real PRs) | Integration | $10-50 | Yes -- the whole point |
147+
148+
## Next Steps
149+
150+
1. Resolve the "PR data without GitHub" question (research in progress)
151+
2. Finalize scenario list and structural expectations
152+
3. Design the Hawksbury app (endpoints, data model, project structure)
153+
4. Implement the initial codebase on main
154+
5. Create scenario branches one at a time
155+
6. Build the extraction + test harness
156+
7. Add CI integration

0 commit comments

Comments
 (0)