|
| 1 | +# PR Test Corpus Design |
| 2 | + |
| 3 | +**Status:** In progress -- decisions captured, open questions remain |
| 4 | +**Date:** 2026-02-26 |
| 5 | + |
| 6 | +## Goal |
| 7 | + |
| 8 | +Build a real, working test corpus for fowlcon so agent prompts and shell scripts can be tested against realistic PR scenarios with full control over the inputs. |
| 9 | + |
| 10 | +## Core Concept: Fakes, Not Mocks |
| 11 | + |
| 12 | +The corpus is a **real, working Kotlin/Misk HTTP API service** -- not mock data or synthetic diffs. It compiles, runs, and looks like a genuine private service to any agent reviewing its PRs. This follows the "fakes over mocks" testing philosophy: a fully functional implementation that exercises real code paths rather than brittle test doubles coupled to implementation details. |
| 13 | + |
| 14 | +The service is called **Hawksbury** (continuing the bird theme from existing test fixtures). It's a bird sanctuary management API being built by a small team of bird-enthusiast developers with fun personas. |
| 15 | + |
| 16 | +## Architecture |
| 17 | + |
| 18 | +### Subtree with Extraction |
| 19 | + |
| 20 | +The Hawksbury service lives as a git subtree inside `tests/corpus/hawksbury/` in the fowlcon repo. At test time, `git filter-repo --subdirectory-filter tests/corpus/hawksbury` extracts the subtree into an isolated temporary repo. After extraction: |
| 21 | + |
| 22 | +- Files are promoted to repo root (no nested path artifacts) |
| 23 | +- All fowlcon-related history is stripped |
| 24 | +- No references to the parent repo remain |
| 25 | +- The result looks like a standalone Kotlin service |
| 26 | + |
| 27 | +Extraction happens **once per test suite run**, and all tests share the extracted repo. |
| 28 | + |
| 29 | +### Why Subtree + Extraction? |
| 30 | + |
| 31 | +We evaluated three approaches: |
| 32 | + |
| 33 | +| Approach | Isolation | Maintenance | Realism | |
| 34 | +|----------|-----------|-------------|---------| |
| 35 | +| Subtree inside fowlcon (with extraction) | High (after extraction) | Single repo | High | |
| 36 | +| Separate GitHub repo | Perfect | Two repos to coordinate | Highest | |
| 37 | +| Git submodule | High | Two repos + submodule friction | High | |
| 38 | + |
| 39 | +The subtree approach gives single-repo simplicity with high isolation after extraction. A separate repo would give perfect isolation but adds coordination overhead and splits maintenance. |
| 40 | + |
| 41 | +### Guarding Against Leakage |
| 42 | + |
| 43 | +- No file in `tests/corpus/hawksbury/` may contain the strings "fowlcon", "review-tree", "review-comments", or "test corpus" |
| 44 | +- A CI check validates this on every PR |
| 45 | +- The extraction process strips all fowlcon-related git history automatically |
| 46 | + |
| 47 | +## The Hawksbury Service |
| 48 | + |
| 49 | +### Tech Stack |
| 50 | + |
| 51 | +- **Language:** Kotlin |
| 52 | +- **Framework:** [Misk](https://github.com/cashapp/misk) (open source microservice framework) |
| 53 | +- **Build:** Gradle with Kotlin DSL |
| 54 | +- **Dependencies:** Public only (Maven Central) |
| 55 | + |
| 56 | +### Narrative History |
| 57 | + |
| 58 | +The git history tells a lifelike story of a small team building a service. Commits read like real development work -- initial setup, add endpoints, add tests, expand features. This is not a flat collection of files; it's a project with an arc. |
| 59 | + |
| 60 | +The main branch is **immutable after initial creation**. New test scenarios branch off at natural points in the story. This ensures existing scenario diffs never change when new scenarios are added. |
| 61 | + |
| 62 | +### Personas |
| 63 | + |
| 64 | +PR authors and reviewers are bird-themed personas with consistent identities in commit messages and PR descriptions. (Specific personas TBD during implementation.) |
| 65 | + |
| 66 | +## Test Scenarios |
| 67 | + |
| 68 | +Each scenario is a branch representing a PR that exercises specific fowlcon review behaviors. Target: ~10 scenarios covering: |
| 69 | + |
| 70 | +1. **Mechanical + novel mix** -- Many files with the same pattern (variation/repeat) plus some novel logic. The core fowlcon use case. |
| 71 | +2. **Backpressure trigger** -- Too many interleaved concepts, high file count, no clear grouping. Tests that the agent would push back. |
| 72 | +3. **Small focused PR** -- 2-5 files, single concept. Tests the simple path. |
| 73 | +4. **Rename-heavy PR** -- Files moved/renamed. Tests file tracking across renames. |
| 74 | +5. **Delete-only PR** -- Removing dead code. Tests handling of deletion-only diffs. |
| 75 | +6. **New feature PR** -- All new files, no modifications. Tests new-file-only analysis. |
| 76 | +7. **Cross-cutting concern** -- Changes that touch many layers (API, service, data, tests) for one feature. |
| 77 | +8. **Test-only PR** -- Only test files changed. Tests classification of test code. |
| 78 | +9. **Description mismatch** -- PR description makes claims that contradict the diff. Tests description verification. |
| 79 | +10. **Multi-concept PR** -- Several unrelated changes bundled together (the "kitchen sink" PR). |
| 80 | + |
| 81 | +Each scenario includes: |
| 82 | +- A branch with realistic commits |
| 83 | +- A PR description with intentional claims (correct, incorrect, and omitted) for testing description verification |
| 84 | +- Expected structural properties (concept count ranges, variation detection, coverage expectations) |
| 85 | + |
| 86 | +## Testing Modes |
| 87 | + |
| 88 | +### Blind Mode |
| 89 | +The agent sees only the extracted repo. No indication it's a test. Most realistic. |
| 90 | + |
| 91 | +### Aware Mode |
| 92 | +The agent prompt includes a note: "You may recognize this as a test fixture. Your job is to produce the same output you would for a real PR. Be representative." Useful for debugging and prompt development. |
| 93 | + |
| 94 | +Both modes need testing to understand behavioral differences. |
| 95 | + |
| 96 | +## Open Questions |
| 97 | + |
| 98 | +### How does the agent receive PR data without GitHub? |
| 99 | + |
| 100 | +This is the biggest unresolved design question. In production, fowlcon takes a GitHub PR URL and fetches data via `gh` CLI. The test corpus has no GitHub PRs. Options under investigation: |
| 101 | + |
| 102 | +1. **Local mode in the orchestrator** -- Support a local repo path + base/head refs instead of a GitHub URL. Useful beyond testing (pre-push review, non-GitHub hosts). |
| 103 | +2. **Mock the `gh` CLI** -- A shell wrapper that returns pre-generated data for specific subcommands. |
| 104 | +3. **Push extracted repo to GitHub** -- Create real PRs on a real GitHub repo after extraction. |
| 105 | +4. **Pre-generated diff files** -- Each scenario ships with a `.diff` and `description.md`. |
| 106 | + |
| 107 | +Research needed on: what data the orchestrator actually consumes from GitHub, format differences between `gh pr diff` and `git diff`, and what other agentic tools do for local/offline input. |
| 108 | + |
| 109 | +### What defines "correct" output per scenario? |
| 110 | + |
| 111 | +Options: |
| 112 | +- **Structural assertions** -- Must have N top-level nodes, must detect variation, must achieve 100% coverage |
| 113 | +- **Golden files** -- Full expected review-tree.md per scenario (brittle but precise) |
| 114 | +- **LLM-as-judge** -- Grade quality using a rubric (flexible but non-deterministic) |
| 115 | +- **Tiered** -- Structural assertions for CI, golden files for nightly, LLM-as-judge for releases |
| 116 | + |
| 117 | +### What invokes the agent in tests? |
| 118 | + |
| 119 | +- True E2E (call Claude API, $5-20/run) -- most realistic, most expensive |
| 120 | +- Structural assertions only ($0/run) -- test scripts and formats, not agent behavior |
| 121 | +- Deterministic replay (record once, replay free) -- test agent logic without API calls |
| 122 | +- Tiered approach matching the test pyramid |
| 123 | + |
| 124 | +### Build infrastructure weight |
| 125 | + |
| 126 | +The working Kotlin app requires JDK + Gradle in CI. This adds weight to a repo that's currently markdown + bash. Need to ensure this doesn't slow down unrelated CI jobs (separate workflow or conditional triggers). |
| 127 | + |
| 128 | +### Corpus evolution |
| 129 | + |
| 130 | +When main is immutable, how do we handle: |
| 131 | +- Bug fixes in Hawksbury code needed for new scenarios |
| 132 | +- Dependency updates for security |
| 133 | +- Growing the app for more complex scenarios |
| 134 | + |
| 135 | +Current thinking: the story continues forward. New commits extend the narrative. Old scenario branches remain anchored to their base points. |
| 136 | + |
| 137 | +## Relationship to Test Pyramid |
| 138 | + |
| 139 | +The corpus serves multiple tiers: |
| 140 | + |
| 141 | +| Tier | What | Cost | Uses Corpus? | |
| 142 | +|------|------|------|-------------| |
| 143 | +| Structural (bats) | Format/script validation | $0 | Yes -- realistic fixtures | |
| 144 | +| Smoke (cheap model) | Output structure | $0.01-0.10 | Yes -- scenario diffs | |
| 145 | +| Golden (full model) | Review quality | $2-5 | Yes -- full agent runs | |
| 146 | +| E2E (real PRs) | Integration | $10-50 | Yes -- the whole point | |
| 147 | + |
| 148 | +## Next Steps |
| 149 | + |
| 150 | +1. Resolve the "PR data without GitHub" question (research in progress) |
| 151 | +2. Finalize scenario list and structural expectations |
| 152 | +3. Design the Hawksbury app (endpoints, data model, project structure) |
| 153 | +4. Implement the initial codebase on main |
| 154 | +5. Create scenario branches one at a time |
| 155 | +6. Build the extraction + test harness |
| 156 | +7. Add CI integration |
0 commit comments