diff --git a/docs/agent-prompt-principles.md b/docs/agent-prompt-principles.md new file mode 100644 index 0000000..1d047d4 --- /dev/null +++ b/docs/agent-prompt-principles.md @@ -0,0 +1,169 @@ +# Agent Prompt Principles + +Evidence-based principles for writing effective agent prompts in Fowlcon. Derived from research on multi-agent orchestration, cognitive science, LLM behavioral studies, and production agent tools. + +These principles apply to all prompts in `commands/` and `agents/`. + +--- + +## 1. Agents Are Tools, Not Peers + +Sub-agents receive a typed task, do their work, and return structured output. The orchestrator never sees their internal reasoning. No back-and-forth conversation between orchestrator and sub-agents. + +**Why:** 2024-2025 evidence converges on agent-as-tool over agent-as-peer for orchestration. Anthropic's "Building Effective Agents" guide explicitly recommends structured output from sub-agents over conversational output. The peer model causes context window catastrophe -- a 10-turn investigation consumes 40k tokens in message history alone. + +**In practice:** +- Concept researchers return a `FindingSet`, not a narrative +- The orchestrator synthesizes; sub-agents investigate +- Sub-agents are fire-and-forget with structured return + +## 2. Constrain Output Format to Constrain Behavior + +The strongest measured behavioral control is output format. Aider found that switching edit formats reduced GPT-4 Turbo "laziness" by 3x (score 20% → 61%). The Diff-XYZ benchmark confirms format choice dramatically affects output quality across models. + +**Why:** When an agent must fill in required sections of a template, it cannot skip them. Format constraints convert behavioral requirements into structural requirements. The agent doesn't need to "want" to be thorough -- the template forces it. + +**In practice:** +- Every agent prompt defines an exact output template with named sections +- Use required sections, not optional ones (`## Uncertainties` must always appear, even if empty) +- Provide 1-2 examples of correctly formatted output (few-shot) +- Schema validation (via `check-tree-quality.sh`) catches format violations mechanically + +## 3. Negative Constraints Before Positive Instructions + +State what the agent must NOT do before stating what it should do. The "documentarian mandate" (six DO NOT rules followed by one ONLY rule) appears at every tier in RPI and is the most reliable behavioral control pattern observed in production. + +**Why:** LLMs trained with RLHF have strong refusal training -- they respond more reliably to prohibitions than permissions. A positive instruction ("be thorough") is vague. A negative constraint ("DO NOT summarize instead of showing detail") targets a specific failure mode. + +**In practice:** +```markdown +## CRITICAL: YOUR ONLY JOB IS TO DOCUMENT AND EXPLAIN +- DO NOT suggest improvements +- DO NOT critique the implementation +- DO NOT identify "problems" or "issues" +- DO NOT recommend refactoring +- ONLY describe what exists and how it works +``` + +Place this block at the top of the prompt AND repeat key constraints at the bottom (primacy + recency positioning). + +## 4. Name the Rationalizations + +When you know how an agent will try to skip a step, name that rationalization explicitly in the prompt. A named anti-pattern is harder to use than an unnamed one. + +**Why:** This pattern from Superpowers (the "red flags table") prevents agents from self-excusing non-compliance. By naming the exact thought ("This is simple enough to skip the tree") and providing the correction ("Build the tree anyway -- simple PRs still benefit from structure"), the agent recognizes its own rationalization attempt as a documented failure mode. + +**In practice:** + +For the orchestrator: +```markdown +## Red Flags -- If You Think This, Stop + +| If you think... | The reality is... | +|---|---| +| "This PR is simple, skip the tree" | Simple PRs still benefit from structure. Build the tree. | +| "Coverage is close enough" | 100% or explain every gap. No exceptions. | +| "The pattern is obvious, skip examples" | Show at least one example. Obvious to you ≠ obvious to the reviewer. | +| "I can summarize instead of grouping" | Summaries lose detail. Group by concept. | +``` + +## 5. Mechanical Verification Over Self-Assessment + +Never trust an agent's claim that it's done. Use external tools to verify completeness. + +**Why:** LLMs cannot reliably self-assess completeness. The coverage bitmap pattern (checking coverage via script output rather than asking the agent "did you cover everything?") is strictly more reliable. Superpowers' "verification before completion" skill found 24 documented failure cases where agents claimed completion incorrectly. + +**In practice:** +- Coverage completeness: checked by `coverage-report.sh`, not agent self-report +- Tree quality: checked by `check-tree-quality.sh`, not agent judgment +- File references: spot-checked by reading actual files, not trusting agent citations +- The orchestrator calls verification scripts BEFORE presenting results to the reviewer + +## 6. One Agent, One Job + +Each agent has a single, clearly-scoped responsibility. If an agent is doing two things, split it into two agents. + +**Why:** Focused agents produce more reliable output than multi-purpose ones. Tool restrictions (Grep/Glob/LS only for the locator -- no Read) enforce specialization more reliably than instructions alone. CrewAI's known failure mode of "manager accepts incomplete output" is caused by agents with broad mandates. + +**In practice:** +- `codebase-locator`: finds WHERE (no Read tool -- cannot analyze content) +- `codebase-analyzer`: explains HOW (has Read -- can analyze) +- `codebase-pattern-finder`: shows EXAMPLES (has Read -- returns code snippets) +- `coverage-checker`: verifies COMPLETENESS (Haiku -- mechanical check only) + +## 7. Context Inline, Never File References + +Sub-agents receive all context embedded in their prompt. Never pass a file path and expect the agent to read it. + +**Why:** Sub-agents run in fresh context windows. They cannot access the orchestrator's context. Passing file paths creates a dependency on the agent successfully reading the file, which adds a failure mode. Inline context is guaranteed to be seen. + +**In practice:** +- The orchestrator reads the diff, then embeds relevant hunks in the sub-agent's prompt +- PR metadata (title, description, file list) is pasted inline, not referenced +- The review tree (if it exists) is included as text, not as a path to read + +## 8. Respect the Tree Structure Limits + +The review tree has evidence-based structural constraints derived from cognitive load research. + +**Why:** Miller's 7±2 (strategic capacity with labels) and Cowan's 4±1 (raw working memory) bound what reviewers can hold in mind. Hierarchy research shows 2-3 levels is optimal; performance degrades consistently at 4+ levels. The 3-5 children per node range matches chunking theory. + +**In practice:** +- Top-level concepts: 7 hard max, 5-6 preferred +- Tree depth: 2-3 levels (4 only as documented exception) +- Children per node: 3-5 preferred, 2-7 acceptable, never 1 or 8+ +- Single-child nodes are a structural smell -- collapse them +- Labels must be descriptive functional names, not "Other" or "Miscellaneous" +- `check-tree-quality.sh` enforces these limits + +## 9. Use Unified Diff Format for Agent Consumption + +When passing diff content to agents, use unified diff format with context lines and without line numbers in hunk headers. + +**Why:** The Diff-XYZ benchmark (Oct 2025) found unified diff is the best format for LLM Apply and Anti-Apply tasks. Aider found omitting line numbers from hunk headers improves performance -- agents use context lines for matching, not line numbers. Including 3-5 context lines helps agents understand what surrounds the change. + +**In practice:** +- Pass unified diff hunks to concept researchers +- Strip hunk header line numbers (`@@ -X,Y +A,B @@` → `@@`) +- Include surrounding context lines (unchanged code) +- Chunk by semantic unit (function/class) not by minimal hunk + +## 10. Design for Fresh Starts + +Long sessions degrade. Design prompts and state management so the review can restart cleanly at any point. + +**Why:** Microsoft research found 39% average degradation in multi-turn conversations. At 50% context utilization, quality drops measurably. Batching context into a fresh call restored 90%+ accuracy. The "Lost in the Middle" effect means mid-session findings are in the attention danger zone. + +**In practice:** +- All state lives in files (`review-tree.md`, `review-comments.md`), not in conversation context +- A new session reads state files and continues from the last known position +- The orchestrator can restart at any phase boundary without losing work +- Front-load critical instructions in the prompt (primacy positioning) +- Repeat key constraints at the end of the prompt (recency positioning) +- After processing sub-agent findings, commit to disk and drop from context + +## 11. Supervisor Mode for Parallel Agents + +When spawning multiple agents in parallel, use supervisor mode: capture failures as data, not exceptions. + +**Why:** Structured concurrency research shows that fail-fast (abort everything when one agent fails) is wrong for research tasks where agents are independently valuable. Supervisor mode lets 2 of 3 successful agents' findings be used even if the third fails. The orchestrator marks the failed area as `[pending]` and the coverage checker catches the gap. + +**In practice:** +- Concept researchers run in parallel with independent scopes +- If one fails: log the failure, mark the concept area as pending, continue +- If one returns low-quality output: flag for human investigation, don't silently include +- The coverage checker runs AFTER all agents complete (including failed ones) to identify gaps +- Retry failed agents with simplified prompts before giving up + +## 12. The Reviewer Is the Protagonist + +The tool serves the reviewer. It never makes decisions for them, never recommends approve/reject, and never hides information. + +**Why:** Automation bias research shows 59% of developers use AI code they don't fully understand. Higher AI quality paradoxically increases complacency (2.5x more likely to merge without review). The "documentarian" pattern -- facts not recommendations -- is an anti-automation-bias design. When the tool organizes and explains rather than judges, the reviewer must engage their own judgment. + +**In practice:** +- Agents describe what code does, never whether it's good or bad +- The orchestrator presents findings, never recommends a verdict +- Comments are captured as the reviewer's words, not the agent's suggestions +- "I get it!" is the reviewer's active choice, not the agent's assumption +- Complexity warnings are factual ("7 interleaving concepts across 50 files") not judgmental ("this PR is too complex") diff --git a/docs/plans/v1-implementation.md b/docs/plans/v1-implementation.md new file mode 100644 index 0000000..181c02a --- /dev/null +++ b/docs/plans/v1-implementation.md @@ -0,0 +1,427 @@ +# Fowlcon V1.0 Implementation Plan + +**Goal:** Build the V1.0 core experience -- an agentic code review tool that analyzes a PR, builds a concept tree, and walks the reviewer through it conversationally. + +**Architecture:** Markdown prompts (commands/ + agents/) with shell scripts for state management. The orchestrator (Opus) builds and owns the review tree, spawning Sonnet concept-researchers and workers for analysis. State lives in markdown files, writes go through shell scripts for reliability. + +**Tech Stack:** Markdown prompts (YAML frontmatter), Bash shell scripts, bats-core (testing) + +**Test Benchmark:** Hawksbury sample fixtures (32 files, 5 concepts with nested variations) + +**End-of-task checklist (applies to EVERY task):** +1. **Verify:** Run tests, smoke tests, or manual checks as specified in the task. Confirm everything passes. +2. **Document:** Update or create documentation for what was built. At minimum: a brief comment in the file explaining its purpose, and an update to the repo README if the task adds a user-facing component. For formats and scripts, document the interface (inputs, outputs, expected behavior). +3. **Commit:** Stage and commit with a descriptive message. + +--- + +## Phase 1: Foundation (Formats + Scripts) -- COMPLETED + +The review-tree.md format was the critical-path decision. Everything else depends on it. + +### Task 1: Design review-tree.md format -- COMPLETED + +Format designed and approved. Key decisions: +- Two node types: concept (default) and variation (`{variation}` with `{repeat}` children) +- `context:` blocks on any node (the tour guide's voice, stored for resumability) +- `files:` on leaf nodes only, per-hunk entries with line ranges (`L-`) +- Coverage propagates upward; hunks can appear in multiple concepts (at-least-once) +- Dot-separated numeric IDs with no depth limit +- Description verification as table only, not tree nodes + +**Files:** `docs/templates/review-tree.md`, `tests/formats/sample-tree-hawksbury.md` + +--- + +### Task 2: Design review-comments.md format -- COMPLETED + +Format designed and approved. Key decisions: +- GitHub API field names adopted directly (`side`, `line`, `type`) +- `side` field in V1.0 for forward-compatibility with V1.1 posting +- Diff hunk context NOT needed (modern line/side API doesn't need hunk headers) +- Source provenance field, soft deletion via status, tree revision tracking +- 43 tests across 2 suites + +**Files:** `docs/templates/review-comments.md`, `tests/formats/sample-comments-hawksbury.md`, `tests/formats/sample-comments-edge-cases.md` + +--- + +### Task 2.5: Tree parsing tests and CI -- COMPLETED + +Added parsing tests for review-tree.md and GitHub Actions CI. + +--- + +### Task 3: Shell script -- update node status -- COMPLETED + +`scripts/update-node-status.sh` -- finds node by ID, updates status marker, writes atomically (temp file + mv). Validates status is one of: pending, reviewed, accepted. 26 tests. + +--- + +### Task 4: Shell script -- add comment -- COMPLETED + +`scripts/add-comment.sh` -- captures comments with full metadata, cross-file {comment} flag updates, content validation. 23 tests. + +--- + +### Task 5: Shell script -- coverage report -- COMPLETED + +`scripts/coverage-report.sh` -- read-only summary: status counts, confidence breakdown, pending list. 14 tests. + +--- + +### Task 6: Tree quality checker -- COMPLETED + +`scripts/check-tree-quality.sh` -- structural validation: 6 checks (HEAD, Revision, Desc Verification, top-level count, file coverage, variation structure). 14 tests. + +--- + +## Phase 2: Worker Agents + +Based on established sub-agent patterns, adapted for Fowlcon's PR context. Workers are restricted to read-only tools (no Bash) -- the concept-researcher handles version discovery and routing decisions. + +### Task 7: Write worker agent prompts + +**Files:** +- Create: `agents/codebase-locator.md` +- Create: `agents/codebase-analyzer.md` +- Create: `agents/codebase-pattern-finder.md` + +**Step 1: Write codebase-locator.md** + +YAML frontmatter with name, description, tools (Grep, Glob, LS), model (sonnet). Adapt for PR review context -- the locator finds files relevant to a concept within a PR. Documentarian role: find and report, never critique. + +**Step 2: Write codebase-analyzer.md** + +Tools: Read, Grep, Glob, LS. Model: sonnet. Documentarian role -- describe, never critique. Traces change boundaries (what connects to the changed code). Returns file:line references. + +**Step 3: Write codebase-pattern-finder.md** + +Tools: Grep, Glob, Read, LS. Model: sonnet. Finds similar patterns and examples. Adapted to find repetitive patterns within the PR diff (key for collapsing mechanical changes into variation nodes). + +**Step 4: Smoke test each agent** + +Test each by spawning with a specific question against a real codebase: +- Locator: "Find all files that import [specific class]" +- Analyzer: "Explain how [method] works" +- Pattern-finder: "Find the pattern used when [repeated change] is applied" + +Verify output is structured, factual, and includes file:line references. + +**Step 5: Commit** + +```bash +git add agents/codebase-locator.md agents/codebase-analyzer.md agents/codebase-pattern-finder.md +git commit -m "feat: add worker agent prompts" +``` + +--- + +### Task 8: Write coverage-checker agent prompt + +**Files:** +- Create: `agents/coverage-checker.md` + +**Step 1: Write the prompt** + +YAML frontmatter: name, description, tools (Grep, Glob, LS), model (haiku). Receives a file list (from PR diff) and a review-tree.md. Reports unmapped lines. Does NOT categorize -- just finds gaps. The orchestrator (Opus) decides whether gaps are mechanical/generated or require human mapping. + +**Step 2: Smoke test** + +Give it the Hawksbury file list and sample tree. Verify it correctly reports all files are mapped (or identifies gaps). + +**Step 3: Commit** + +```bash +git add agents/coverage-checker.md +git commit -m "feat: add coverage-checker agent prompt" +``` + +--- + +## Phase 3: Concept Researcher + +### Task 9: Write concept-researcher agent prompt + +**Files:** +- Create: `agents/concept-researcher.md` + +**Step 1: Write the prompt** + +YAML frontmatter: name, description, tools (Read, Grep, Glob, LS, Task, WebSearch, WebFetch), model (sonnet). Receives question + PR context inline (never file references). Spawns workers for mechanical tasks, synthesizes findings. Returns structured output: Concept, Summary, Findings (with file:line refs), Relevant Context, Change Boundary, Uncertainties. + +The researcher discovers version context (runs `git rev-parse HEAD` via Bash if needed through the orchestrator) and provides it to workers in their prompts. Workers don't need to discover versions themselves. + +**Step 2: Test against a real codebase** + +Spawn with a conceptual question about a PR. Verify: +- Output follows the structured format +- File:line references are present +- Workers were spawned appropriately +- Change boundary is traced (imports, callers, interfaces) +- Uncertainties section is present (even if empty) + +**Step 3: Test with a pattern-recognition question** + +Spawn with: "What are the different patterns used for [a repeated change]?" + +Verify it identifies distinct variations. + +**Step 4: Commit** + +```bash +git add agents/concept-researcher.md +git commit -m "feat: add concept-researcher agent prompt" +``` + +--- + +## Phase 4: Orchestrator + +The orchestrator drives the entire review flow. This is the bulk of the work. + +### Task 10: Write orchestrator prompt -- analysis phase + +**Files:** +- Create: `commands/review_pr.md` + +**Step 1: Write the prompt header and analysis phase** + +YAML frontmatter: name (review-pr), description, model (opus). The prompt covers: +- Agent role statement (relentless quality advocate) +- Startup: read user-hints.md, check for existing per-PR data +- Fetch PR diff + description + metadata +- Spawn concept-researchers (as few passes as needed, maximum 3) +- Build the review tree using the format from `docs/templates/review-tree.md` +- Verify description against tree +- Run coverage checker +- Evaluate complexity (7 top-level threshold, based on Miller's 7±2) +- Anti-rationalization instructions +- Respect repository convention files (CLAUDE.md, AGENTS.md, etc.) + +Write only the analysis phase first. The walkthrough phase comes in Task 11. + +**Step 2: Test analysis against a real PR** + +Run `/review-pr` with a PR URL. Let it analyze and build the tree. Evaluate: +- Does it produce a reasonable number of concepts? +- Are repetitive patterns collapsed into variation nodes? +- Are description claims verified? +- Is coverage complete? +- Does the tree quality checker pass? + +**Step 3: Iterate on the prompt** + +Refine based on test results. Watch for: +- Agent summarizing instead of grouping (anti-rationalization) +- Missing important pattern distinctions +- Not tracing dependency injection or framework patterns +- Overcounting or undercounting files + +**Step 4: Commit** + +```bash +git add commands/review_pr.md +git commit -m "feat: add orchestrator prompt -- analysis phase" +``` + +--- + +### Task 11: Write orchestrator prompt -- walkthrough phase + +**Files:** +- Modify: `commands/review_pr.md` + +**Step 1: Add walkthrough phase to the orchestrator prompt** + +Extend the prompt with: +- Present tree to customer, handle complexity discussion +- Hierarchical tree traversal (depth-first) +- Customer responses: reviewed, comment, go back +- "I get it! Accept the rest." short-circuit → accepted +- Accept all for pattern instances +- Jumping (navigate to any node, unvisited stay pending) +- Comment capture (call add-comment.sh with full metadata) +- Status updates (call update-node-status.sh) +- Adaptive pacing (read the room, don't ask "how familiar are you?") +- Summary generation (on-the-fly from tree via coverage-report.sh) +- Memory update proposals (transparent, customer-approved) +- Fresh start command handling + +**Step 2: Test walkthrough** + +After analysis produces a tree, walk through the review: +- Review the core new code in detail (concept 1) +- "I get it!" on a repetitive pattern (concept 2) +- Walk through one variation example, accept rest (concept 3) +- Review remaining concepts +- Ask for summary +- Verify all state is saved to review-tree.md and review-comments.md + +**Step 3: Test resumability** + +Exit the session. Start a new session with the same PR. Verify: +- Orchestrator finds existing data store +- Offers to resume +- Picks up from the right place based on review-tree.md status + +**Step 4: Iterate and commit** + +```bash +git add commands/review_pr.md +git commit -m "feat: add orchestrator prompt -- walkthrough phase" +``` + +--- + +### Task 12: Write orchestrator prompt -- preconditions and troubleshooting + +**Files:** +- Modify: `commands/review_pr.md` + +**Step 1: Add precondition handling** + +Extend the prompt with: +- At startup, read user-hints.md +- Optimistic -- try to fetch PR, if it fails load the relevant troubleshooting guide +- Reference troubleshoot/pr-access.md, troubleshoot/repo-access.md +- Propose hints to customer after resolution + +**Step 2: Test with a repo that needs troubleshooting** + +Try a PR in a repo where the default diff method fails (e.g., a large PR that truncates, or a repo not cloned locally). Verify the orchestrator loads the troubleshooting guide and follows the fallback chain. + +**Step 3: Commit** + +```bash +git add commands/review_pr.md +git commit -m "feat: add orchestrator precondition handling" +``` + +--- + +## Phase 5: Detector + Install + +### Task 13: Write detector skill + +**Files:** +- Create: `skills/detect-pr-context.md` (or appropriate location for the platform) + +**Step 1: Write the detector prompt** + +Trigger signals: +- Explicit review language ("review this PR," "check this diff") + URL +- PR URL as first interaction in a new session (strong intent signal) +- Branch with open PR + review mention +- Weak signal: mid-conversation URL (use context to decide) + +Behavior: one gentle offer per PR per session. Never pushy. + +**Step 2: Test triggering** + +Test with prompts: +- "Review this PR: https://github.com/..." → should trigger +- (first message) "https://github.com/.../pull/123" → should trigger +- "I was looking at https://github.com/.../pull/123 for context on a bug" → should use context + +**Step 3: Commit** + +```bash +git add skills/detect-pr-context.md +git commit -m "feat: add PR context detector skill" +``` + +--- + +### Task 14: Write install script + +**Files:** +- Create: `scripts/install` + +**Step 1: Write the install script** + +The script: +- Copies agent files to `~/.claude/agents/` +- Copies command files to `~/.claude/commands/` +- Copies scripts to an accessible location +- Creates `~/.code-review-agent/` if it doesn't exist +- Seeds `user-hints.md` from template if it doesn't exist +- Handles updates (prompt for overwrite) +- Supports custom install path + +**Step 2: Test installation** + +Run install to a temp directory. Verify all files land in the right places. Verify user-hints.md is seeded correctly. + +**Step 3: Commit** + +```bash +git add scripts/install +git commit -m "feat: add install script" +``` + +--- + +## Phase 6: Integration Testing + +### Task 15: End-to-end test -- small PR + +Find a small public PR (5-10 files, focused change). Run the full flow: +1. `/review-pr ` +2. Let analysis complete +3. Walk through the tree +4. Add a comment +5. Review summary +6. Verify all state files are correct + +Document results and any issues found. + +--- + +### Task 16: End-to-end test -- medium PR + +Find a medium public PR (20-50 files, multiple concepts). Run the full flow. Verify: +- Multiple concepts identified +- Patterns collapsed where appropriate +- Description verification works +- "I get it!" short-circuit works correctly + +Document results and any issues found. + +--- + +### Task 17: End-to-end test -- large mechanical PR + +Find a large public PR with repetitive mechanical changes (100+ files). This is the benchmark. Verify: +- Many files collapse into few concepts +- Variation nodes correctly identified +- Exclusions are noted +- Description claims are verified +- Tree quality checker passes +- Walkthrough is manageable +- Performance is acceptable (benchmark the analysis phase) + +Document results. This is the demo-quality test case. + +--- + +## Phase 7: Documentation + Release + +### Task 18: Final documentation pass + +**Step 1: Update README** + +Ensure README reflects the actual shipped state (not aspirational features). Update architecture diagram, installation instructions, and usage examples based on what was actually built. + +**Step 2: Attribution review** + +- Credit open source projects that inspired patterns (see Acknowledgments in README) +- Verify LICENSE file is correct +- Ensure CONTRIBUTING.md is current + +**Step 3: Final commit** + +```bash +git add -A +git commit -m "docs: final documentation pass for V1.0" +``` diff --git a/docs/research-summary.md b/docs/research-summary.md new file mode 100644 index 0000000..3897ebe --- /dev/null +++ b/docs/research-summary.md @@ -0,0 +1,331 @@ +# Research Summary: Building an Agentic Code Review Tool + +A summary of findings from research conducted to inform the design of Fowlcon. This document covers the evidence base behind the architecture, agent design patterns, and prompt engineering principles. + +--- + +## 1. The Problem Space + +AI coding agents are flooding review queues with PRs that are larger, more frequent, and structurally different from human-authored work. A single agent session can produce changes spanning hundreds of files -- mixing mechanical changes (the same pattern applied repetitively) with novel logic requiring genuine human judgment. + +**Key statistics:** +- 82 million monthly code pushes on GitHub (Octoverse 2025) +- 41% of new code is AI-assisted +- PRs are growing 18% larger with AI; incidents per PR up 24% +- Review is now the rate limiter, not code generation + +**No existing tool solves this.** Current AI code review tools (CodeRabbit, GitHub Copilot review, Graphite Agent) find bugs and post inline comments. None organize changes into logical concepts, collapse repetitive patterns, or provide interactive walkthroughs. Fowlcon fills a gap the academic community has identified but not solved -- a [February 2026 survey of 99 code review papers](https://arxiv.org/abs/2602.13377) found that change decomposition tasks have nearly vanished in the LLM era (14 datasets pre-LLM, only 1 in the LLM era). + +--- + +## 2. Multi-Agent Orchestration + +### Agent-as-Tool, Not Agent-as-Peer + +2024-2025 evidence converges strongly on the **agent-as-tool** model for orchestration. Sub-agents receive a typed task, do their work in an isolated context window, and return structured output. The orchestrator never sees their internal reasoning. + +Anthropic's ["Building Effective Agents"](https://www.anthropic.com/research/building-effective-agents) guide explicitly recommends structured output from sub-agents over conversational output. The peer model causes context window catastrophe -- a 10-turn investigation between orchestrator and researcher can consume 40k tokens just in message history. + +**Frameworks compared:** +- [LangGraph](https://langchain-ai.github.io/langgraph/): Graph-based supervisor with state reducers per key +- [CrewAI](https://docs.crewai.com/): Role-based hierarchical with isolated workers +- [AutoGen/AG2](https://arxiv.org/abs/2308.08155): Conversational message bus (Microsoft Research) +- [OpenAI Swarm](https://github.com/openai/swarm): Lateral handoffs, experimental + +All converge on the same pattern for reliability: orchestrator dispatches, workers return structured results, orchestrator synthesizes. + +### Single-Writer State Management + +The most robust pattern for shared state is **single-writer with atomic transitions**: one process (the orchestrator) is the sole entity that commits writes. Sub-agents return proposed deltas; they never write directly. This maps to distributed systems fundamentals (Lamport) and is the approach used by LangGraph's state reducers. + +### Supervisor Mode for Parallel Agents + +When spawning multiple agents, use **supervisor mode**: failures captured as data, not exceptions. If 2 of 3 agents succeed, their findings are independently valuable. [Structured concurrency research](https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/) (Trio, Kotlin, Swift TaskGroup) provides the theoretical foundation; the [MASFT taxonomy](https://arxiv.org/abs/2502.xxxxx) maps failure modes specific to multi-agent LLM systems. + +**Key references:** +- [MemGPT](https://arxiv.org/abs/2310.08560) -- context-as-memory paging system +- [MetaGPT](https://arxiv.org/abs/2308.00352) -- SOPs as agent coordination primitive +- [Lilian Weng, "LLM-Powered Autonomous Agents"](https://lilianweng.github.io/posts/2023-06-23-agent/) -- best survey of memory/tool/orchestration patterns + +--- + +## 3. LLM Diff Comprehension + +### Unified Diff Is the Best Format + +The [Diff-XYZ benchmark](https://arxiv.org/abs/2510.12487) (Oct 2025) -- the first dedicated benchmark for LLM diff understanding -- tested three tasks (Apply, Anti-Apply, Diff Generation) across multiple models and formats. + +**Key findings:** +- Unified diff is the best format for Apply and Anti-Apply tasks +- Claude Sonnet and GPT-4.1 achieved highest performance +- Smaller models benefit from adapted formats (explicit ADD/DEL tags) +- No universal solution -- format selection should match model capability + +### Omit Line Numbers from Hunk Headers + +[Aider's research](https://aider.chat/docs/unified-diffs.html) found that **unified diffs make GPT-4 Turbo 3x less lazy** (score 20% → 61%, lazy instances 12 → 4). Critical design principles: + +1. **Omit line numbers** from hunk headers -- agents perform poorly with explicit line numbers +2. **Chunk by semantic unit** (function/class), not minimal hunk +3. **Include context lines** for matching anchors +4. **Apply flexibly** -- disabling flexible application caused 9x increase in editing errors + +### Implications + +When passing diff context to agents, use unified diff format, strip hunk header line numbers, include surrounding context lines, and chunk by semantic unit rather than minimal hunk. + +--- + +## 4. Cognitive Load and Tree Structure + +### Working Memory Bounds + +Miller's 7±2 (1956) is the strategic capacity with rehearsal and chunking. [Cowan's 4±1](https://doi.org/10.1017/S0140525X01003922) (2001) is the raw capacity when strategies are blocked. For labeled, on-screen tree nodes, 5-7 is defensible at the top level because labels provide scaffolding. + +### Hierarchy Depth + +Decades of depth-breadth research (Miller 1981, Kiger 1984, Larson & Czerwinski 1998, Zaphiris 2000) converge: **2 levels is optimal, 3 is acceptable, 4+ consistently degrades performance**. + +### Children Per Node + +The optimal sub-items-per-concept ratio is **3-5**. A node with 1 child is structural waste (collapse it). A node with 8+ children exceeds both raw and strategic working memory. + +### Practical Limits for Review Trees + +| Parameter | Preferred | Acceptable | Never | +|-----------|-----------|------------|-------| +| Top-level concepts | 5-6 | up to 7 | 8+ | +| Tree depth | 2-3 levels | 3 levels | 4+ | +| Children per node | 3-5 | 2-7 | 1 or 8+ | +| Node labels | Descriptive functional names | Short phrases | "Other", "Misc" | + +### Code Review Specific + +The [Cisco code review study](https://smartbear.com/learn/code-review/best-practices-for-peer-code-review/) found 200-400 LOC is optimal per review session, with a cliff effect beyond that threshold. Microsoft research found more files = less useful feedback, with a quality shift at 600+ LOC. This validates organizing large PRs into bounded, reviewable units. + +--- + +## 5. Human Factors in AI-Assisted Review + +### Trust Is Asymmetric + +84% of developers adopt AI tools but only 33% trust them. Trust is asymmetric: one bad AI suggestion erodes trust more than many good ones build it. Senior developers are the most skeptical (20% "highly distrust"). + +### Automation Bias + +AI suggestions are accepted 60-80% of the time. 59% of developers use AI code they don't fully understand. Paradoxically, **higher AI quality increases complacency** -- 2.5x more likely to merge without review. + +The "documentarian" agent pattern (facts, not recommendations) is an anti-automation-bias design. When the tool organizes and explains rather than judges, the reviewer must engage their own judgment. + +### Alert Fatigue + +Tools get ignored when developers override >30% of flags. The best-in-class tools achieve <3% false positive rates. [Graphite](https://graphite.dev/) reduced false positives from 9:1 to 1:1 by switching from free-text to function-calling output format -- constraining the model's output space was more effective than instructing it to be careful. + +**Key finding:** Precision beats recall when humans are in the loop. A tool catching 45% of bugs with high trust outperforms one catching 50% that gets ignored. + +### Interactive vs. Batch Review + +Working memory holds 2-4 concurrent chunks. Comment dumps exceed this immediately. Progressive disclosure (show summary, let user drill into detail) improves 3/5 usability components (NNG research). Amazon Q explicitly adopted the interactive model. + +The ideal pattern: summary first → structured navigation → detail on demand → conversation capability. + +--- + +## 6. Behavioral Controls in Agent Prompts + +### Format Constraints Are the Strongest Control + +The single most effective behavioral control is **constraining output format**. Measured results: + +| Technique | Measured Effect | Source | +|-----------|----------------|--------| +| Edit format switch (SEARCH/REPLACE → unified diff) | 3x laziness reduction | [Aider benchmarks](https://aider.chat/docs/unified-diffs.html) | +| Free-text → function calling | 9:1 → 1:1 false positive ratio | Graphite engineering blog | +| Required JSON schema | Eliminates format errors | [OpenAI structured outputs](https://platform.openai.com/docs/guides/structured-outputs) | + +When an agent must fill required sections of a template, it cannot skip them. Format constraints convert behavioral requirements into structural requirements. + +### Negative Constraints + +The "documentarian mandate" (six DO NOT rules + one ONLY rule) is the most reliable behavioral control observed in production agent tools. Pure negative instructions are unreliable alone -- pair them with positive alternatives. The effectiveness comes from targeting specific failure modes rather than making vague positive requests. + +### Named Anti-Patterns + +Naming specific rationalizations an agent might use ("This is simple enough to skip") paired with corrections ("Build the tree anyway") is theoretically grounded in [inoculation theory](https://en.wikipedia.org/wiki/Inoculation_theory) (McGuire, 1961). A named anti-pattern is harder to use than an unnamed one. + +### Prompt Placement + +[Liu et al. (2023)](https://arxiv.org/abs/2307.03172) "Lost in the Middle" established the U-shaped attention curve. Place critical constraints at both the beginning (primacy) and end (recency) of prompts. [Microsoft Research (2025)](https://arxiv.org/abs/2502.xxxxx) found 39% average degradation in multi-turn conversations with 112% increase in unreliability. + +### The Practical Hierarchy + +From strongest to weakest behavioral control: + +1. **Constrain output structure** (schemas, templates, required sections) +2. **Optimize placement** (primacy + recency positioning) +3. **Provide positive examples** (few-shot with correct output) +4. **Name failure patterns** (anti-rationalization tables) +5. **Use negative instructions** (paired with positive alternatives) + +--- + +## 7. Context Window Management + +### Context Degradation Is Real + +Even with perfect retrieval, performance degrades 13.9%-85% as context grows. The "Lost in the Middle" effect creates a U-shaped attention curve: beginning and end are attended to; middle content is lost. + +**Multi-turn degradation:** Microsoft Research found 39% average degradation in long conversations, decomposed into 16% aptitude loss and 112% unreliability increase. Critically, **batching context into a fresh call restored 90%+ accuracy**. + +### The Fresh Start Pattern + +State lives on the filesystem (markdown files), not in the LLM's context window. Each session reads world state from files. Failed attempts leave traces. This is the [Reflexion pattern](https://arxiv.org/abs/2303.11366) (Shinn et al., NeurIPS 2023) applied to agentic code review. + +### Context Budget Discipline + +Treat context as a finite resource: +- Static content (format specs, instructions): cache across calls +- Sub-agent results: commit to disk, then drop from context +- Coverage status: check via script output, not accumulated findings +- Walkthrough: fresh context at phase boundaries + +[Factory.ai's research](https://factory.ai/news/context-window-problem) on scaling agents recommends: structured repository overviews at session start, targeted file operations (specific line ranges, not full files), and hierarchical memory layers. + +--- + +## 8. Error Recovery + +### Four Failure Archetypes + +Research identifies four recurring failure patterns in multi-agent LLM systems: + +1. **Premature action without grounding** -- acting before reading enough context +2. **Over-helpfulness** -- substituting missing entities with hallucinated ones +3. **Distractor-induced context pollution** -- irrelevant context degrading reasoning +4. **Fragile execution under load** -- performance degrading with large inputs + +Source: [AgentErrorTaxonomy](https://arxiv.org/abs/2512.07497), [AgentDebug framework](https://arxiv.org/abs/2509.25370) + +### Graceful Degradation + +When non-critical agents fail, proceed with available results. Mark failed areas in the review tree as `[pending]`. The coverage checker catches gaps mechanically. The reviewer sees what was and wasn't analyzed. + +### Anti-Hallucination Through Tool Verification + +Every claim must be verifiable by a tool call. The coverage checker can double as a grounding verifier -- given findings with file:line claims, it reads those lines and confirms the claims match reality. This is the [CRITIC pattern](https://arxiv.org/abs/2305.11738) (Gou et al., ICLR 2024) applied to code review. + +--- + +## 9. Security Considerations + +### Prompt Injection in Code Review + +PR diffs, descriptions, and code are untrusted input. [OWASP ranks prompt injection](https://genai.owasp.org/llmrisk/llm01-prompt-injection/) as the #1 critical vulnerability for LLM applications (2025). + +### Defense-in-Depth + +1. **Privilege minimization**: agents get read-only access (`Read, Grep, Glob` -- no `Write`, no `Bash`) +2. **Data/instruction separation**: diff content passed as data within structured delimiters, not as instructions +3. **Output validation**: verify file paths reference real files, line numbers are within bounds +4. **The strongest defense**: the agent never recommends approve/reject. Even if an attacker manipulates the agent's understanding, the human reviewer sees actual code during walkthrough. + +--- + +## 10. Cost Optimization + +### Model Tiering + +| Role | Model | Why | +|------|-------|-----| +| Orchestrator | Opus | Complex reasoning, multi-turn, synthesis | +| Concept researchers | Sonnet | Focused analysis, good quality/cost ratio | +| Coverage checker | Haiku | Mechanical verification, fast and cheap | + +### Key Optimization Levers + +- **Prompt caching**: 90% savings on repeated content (format specs, agent instructions) +- **Model tiering**: route 90% of work to cheaper models (cascading pattern saves 60-87%) +- **Batch API**: 50% discount for non-interactive analysis phases +- **Context minimization**: send relevant hunks, not full files; drop findings after committing to disk + +### Estimated Cost + +~$2-3 per review for a 50-file PR. Larger PRs (390 files) cost proportionally more due to additional research agents. + +--- + +## 11. The Competitive Landscape + +### Existing AI Code Review Tools + +| Tool | Approach | Catch Rate (Greptile benchmark) | +|------|----------|------| +| [Greptile](https://www.greptile.com/) | Deep codebase-aware review | 82% | +| [GitHub Copilot](https://github.com/features/copilot) | PR review as Copilot extension | ~55% | +| [CodeRabbit](https://coderabbit.ai/) | Automatic inline comments | 44% | +| [Graphite Agent](https://graphite.dev/) | Stacked PRs + AI review | 6% (workflow-focused) | +| [Qodo PR-Agent](https://github.com/qodo-ai/pr-agent) | Open source, layered architecture | -- | + +### What Makes Fowlcon Different + +No existing tool does what Fowlcon does: +1. **Concept tree decomposition** of arbitrary PRs +2. **Pattern collapse** (194 identical changes → 1 example + 193 `{repeat}`) +3. **Interactive walkthrough** with explicit reviewer confirmations +4. **Coverage guarantee** (every changed line mapped) +5. **PR description verification** against actual diff +6. **Structured pushback** on overly complex PRs (not "too big" but "here are the 7 concepts and why they interleave") + +Fowlcon is a **comprehension tool**, not a bug finder. It helps reviewers understand what's there so they can decide what's wrong themselves. + +--- + +## 12. Open Source Agents -- Converging Patterns + +Analysis of SWE-agent, OpenHands, Aider, Plandex, Mentat, Devika, and Devin reveals 8 patterns converging across the ecosystem: + +1. **Orchestrator-worker split** (universal) +2. **Repository maps beat iterative search** ([Aider's PageRank approach](https://aider.chat/docs/repomap.html): 4-6% context utilization vs 54-70% for iterative search) +3. **Context condensation is critical** (SWE-agent, OpenHands, Plandex all implement it) +4. **Sandboxed state separate from source** (Plandex, Devin, SWE-agent) +5. **Documentation as load-bearing infrastructure** (CLAUDE.md, AGENTS.md trend) +6. **Confidence-based filtering** (reduce noise by scoring finding confidence) +7. **Event sourcing for agent state** (OpenHands V1) +8. **Standardized protocols** (MCP, A2A) + +**Devin insight** (from Cognition's 18-month performance review): "senior-level at codebase understanding, junior at execution." Since code review is primarily an understanding task, this validates the agent-assisted review approach. + +--- + +## Sources + +### Academic Papers +- [Diff-XYZ: A Benchmark for Evaluating Diff Understanding](https://arxiv.org/abs/2510.12487) (Oct 2025) +- [A Survey of Code Review Benchmarks in Pre-LLM and LLM Era](https://arxiv.org/abs/2602.13377) (Feb 2026) +- [Lost in the Middle: How Language Models Use Long Contexts](https://arxiv.org/abs/2307.03172) (Liu et al., 2023) +- [Reflexion: Language Agents with Verbal Reinforcement Learning](https://arxiv.org/abs/2303.11366) (NeurIPS 2023) +- [AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation](https://arxiv.org/abs/2308.08155) +- [MemGPT: Towards LLMs as Operating Systems](https://arxiv.org/abs/2310.08560) +- [MetaGPT: Meta Programming for Multi-Agent Collaborative Framework](https://arxiv.org/abs/2308.00352) +- [CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing](https://arxiv.org/abs/2305.11738) (ICLR 2024) +- [The Effects of Change Decomposition on Code Review](https://peerj.com/articles/cs-193/) (PeerJ) + +### Industry and Engineering Sources +- [Anthropic: Building Effective Agents](https://www.anthropic.com/research/building-effective-agents) (2024) +- [Aider: Unified Diffs Make GPT-4 Turbo 3x Less Lazy](https://aider.chat/docs/unified-diffs.html) (2024) +- [Aider: Repository Map](https://aider.chat/docs/repomap.html) +- [Lilian Weng: LLM-Powered Autonomous Agents](https://lilianweng.github.io/posts/2023-06-23-agent/) (2023/2024) +- [Factory.ai: The Context Window Problem](https://factory.ai/news/context-window-problem) +- [OWASP Top 10 for LLM Applications](https://genai.owasp.org/llmrisk/llm01-prompt-injection/) (2025) +- [State of AI Code Review Tools 2025](https://www.devtoolsacademy.com/blog/state-of-ai-code-review-tools-2025/) +- [Greptile AI Code Review Benchmarks](https://www.greptile.com/benchmarks) +- [Smashing Magazine: Designing for Agentic AI](https://www.smashingmagazine.com/2026/02/designing-agentic-ai-practical-ux-patterns/) (Feb 2026) +- [Claude Agent SDK Overview](https://platform.claude.com/docs/en/agent-sdk/overview) +- [Anthropic: Skill Authoring Best Practices](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices) + +### Open Source Tools Studied +- [SWE-agent](https://github.com/SWE-agent/SWE-agent) -- Agent-Computer Interface for code +- [mini-swe-agent](https://github.com/SWE-agent/mini-swe-agent) -- 100-line agent, 74%+ SWE-bench +- [OpenHands](https://github.com/OpenHands/OpenHands) -- CodeAct architecture +- [Aider](https://github.com/Aider-AI/aider) -- Repository map + edit format research +- [Qodo PR-Agent](https://github.com/qodo-ai/pr-agent) -- Open source code review agent diff --git a/docs/research/agent-memory-systems.md b/docs/research/agent-memory-systems.md new file mode 100644 index 0000000..2cbecdf --- /dev/null +++ b/docs/research/agent-memory-systems.md @@ -0,0 +1,77 @@ +# Agent Memory Systems for Code Review State + +**Context:** Fowlcon V1.01 evaluates structured backing stores for the review tree. V1.0 uses markdown files; V1.01 may adopt a graph-based system. + +## The Problem + +Fowlcon's review tree is a hierarchical graph: concepts with children, statuses, file mappings, and comments. V1.0 stores this as a markdown file with embedded checkboxes, managed by shell scripts. This works but lacks structured queries ("show me all pending nodes") and graph operations ("what's the subtree under concept 3?"). + +## Beads (steveyegge/beads) + +[Beads](https://github.com/steveyegge/beads) (17k+ stars) is a distributed, git-backed graph issue tracker purpose-built as persistent memory for AI coding agents. + +**What maps to Fowlcon's needs:** +- Parent/child hierarchy with dotted IDs (`bd-a3f8.1.1`) +- `bd ready --json` -- "show me unblocked/pending work" +- Git-backed persistence across sessions +- Agent-friendly JSON output +- Hash-based IDs prevent merge conflicts + +**What doesn't map:** +- Fixed status enums (`open`, `in_progress`, `closed`) -- Fowlcon needs `pending`, `reviewed`, `accepted` +- Issue tracker vocabulary (epics, sprints, messaging) -- Fowlcon needs review tree vocabulary +- Designed for task tracking, not code review state + +**Implementations:** + +| Implementation | Language | Storage | Sync | Status Extensible | +|---|---|---|---|---| +| Go Beads (`bd`) | Go | Dolt (version-controlled SQL) | Automatic (Dolt push/pull) | No (hardcoded) | +| beads_rust (`br`) | Rust | SQLite + JSONL | Manual (`br sync`) | No (4 fixed statuses) | +| beads-rs | Rust | Pure git refs (CRDT) | Automatic (daemon) | No (3 fixed statuses) | + +**Key finding:** Neither Rust implementation supports custom statuses. Both have hardcoded enums that don't match Fowlcon's states (pending/reviewed/accepted). Using Beads would require either a fork, labels-as-statuses workaround, or a custom thin layer. + +## Alternatives Evaluated + +### Claude Code Tasks +- Built-in to Claude Code, zero install +- Session-scoped by default -- no cross-session persistence keyed by PR +- Too limited for persistent review state + +### Flux (MCP Kanban) +- Team-level visibility with web dashboard +- Kanban columns don't map to hierarchical tree structure +- Wrong abstraction for our use case + +### "ticket" (minimal bash alternative) +- Single bash script, flat files, graph dependencies +- Aligns with our "thin shell scripts" philosophy +- Too flat for nested concept trees + +## Complementary Layers Pattern + +The ecosystem converges on complementary layers, not one-size-fits-all: + +``` +Superpowers -- HOW to work (process discipline) +Claude Tasks -- WHAT to do (session-level) +Beads -- WHAT to do (project-level) +Flux -- WHAT to do (team-level) +``` + +Fowlcon's review tree is a new layer: WHAT WAS REVIEWED (PR-level, cross-session). + +## Recommendation for Fowlcon + +- **V1.0:** Markdown tree + shell scripts. Simple, no dependencies, proven in Phase 1. +- **V1.01:** Evaluate Beads as backing store. The graph structure and query API are worth the dependency if the status mismatch can be resolved (fork or labels workaround). +- **Fallback:** Custom lightweight solution -- `git2` + `serde` in Rust with our own status enum. More work but zero dependency mismatch. + +## Sources + +- [Beads](https://github.com/steveyegge/beads) -- Git-backed issue tracker for AI agents +- [beads_rust](https://github.com/Dicklesworthstone/beads_rust) -- Rust port (SQLite) +- [beads-rs](https://github.com/delightful-ai/beads-rs) -- Rust port (CRDT, pure git) +- [beads_viewer](https://github.com/Dicklesworthstone/beads_viewer) -- Graph-aware TUI +- [Introducing Beads](https://steve-yegge.medium.com/introducing-beads-a-coding-agent-memory-system-637d7d92514a) -- Steve Yegge's introduction diff --git a/docs/research/agentic-restart-patterns.md b/docs/research/agentic-restart-patterns.md new file mode 100644 index 0000000..c15717b --- /dev/null +++ b/docs/research/agentic-restart-patterns.md @@ -0,0 +1,78 @@ +# Agentic Restart Patterns (Ralph Wiggum Loop) + +**Context:** Fowlcon's fresh-start mode wipes per-PR state and restarts with distilled learnings. This is an instance of the "write → evaluate → restart with learnings" pattern. + +## The Pattern + +An agent produces output, evaluates it (or has it evaluated), determines it's inadequate, and restarts carrying forward what it learned. The key design question: how are learnings carried between iterations? + +## Three Approaches + +### 1. In-Context Accumulation (Self-Refine) + +Prior outputs, critiques, and revisions stay in the same conversation. The model sees everything. + +**Academic basis:** Self-Refine (Madaan et al., NeurIPS 2023) + +**Works for:** Short tasks where the full history is useful to the refiner. + +**Breaks down:** Context rot degrades quality as context fills up, even well within token limits. A Chroma study found all 18 tested LLMs showed degradation. Models advertised at 200K tokens became unreliable around 130K. + +### 2. Distilled Verbal Reflection (Reflexion) + +Failures are summarized into natural-language lessons stored in a bounded memory buffer (1-3 entries). The full trajectory is discarded; only the distilled lesson is carried forward. + +**Academic basis:** Reflexion (Shinn et al., NeurIPS 2023) -- 91% pass@1 on HumanEval vs 80% GPT-4 baseline. + +**Works for:** Tasks with clear success/failure criteria. The reflection captures why the attempt failed, not the full attempt. + +**Breaks down:** Lossy by design. Nuance in the failed trajectory may be lost in the summary. + +### 3. External State (Ralph Loop / Filesystem) + +State lives entirely outside the LLM: git history, markdown files, JSON task trackers. Each iteration spawns a fresh context that reads external state. No context rot. + +**Origins:** Coined by Geoffrey Huntley and Ryan Carson in the Claude Code community. Named after Ralph Wiggum ("iterating repeatedly and not giving up"). + +**Works for:** Long-running tasks where context rot is the enemy. Each iteration gets maximum cognitive clarity. + +**How it works:** +- A loop repeatedly invokes a fresh agent with the same prompt +- State lives on the filesystem (git commits, progress files, hint files) +- Each iteration reads world state from files rather than remembering it +- Failed attempts leave traces via git history and accumulated hints + +## Critical Finding: Self-Correction Requires External Feedback + +Huang et al. (ICLR 2024) showed that **intrinsic self-correction** (the model critiquing itself without external feedback) does not improve and often degrades performance. The methods that work (Reflexion, CRITIC, Ralph Loop) all depend on external signals: test results, tool output, or human feedback. + +**Implication for Fowlcon:** The customer is the external evaluation signal in Fowlcon's restart loop. Their feedback ("the config section grouping was confusing") is what makes the fresh start meaningful. Without external feedback, restarting just repeats the same mistakes. + +## How Fowlcon Uses This + +Fowlcon's fresh-start mode is approach #3 (External State): + +- `fresh-start-context.md` carries distilled learnings from the previous attempt +- Per-PR cache gets wiped (the failed state is discarded) +- The orchestrator restarts with a clean context window +- The hints inform but don't constrain the new analysis + +The customer triggers fresh start explicitly. The orchestrator generates the context file (with customer approval) before wiping. + +## Related Patterns + +- **AlphaCodium** (Ridnik et al., 2024): Test-anchored iteration -- passing tests become fixed anchors, preventing regression. Similar to Fowlcon's "reviewed nodes survive fresh start as hints." +- **LATS** (Zhou et al., ICML 2024): Tree-structured restart -- backtrack to best-known state and try a different path. More sophisticated than linear restart. +- **Context rot** as a named phenomenon: LLM performance degrades as context fills. The Ralph Loop is partly a solution -- restart fresh, externalize state. + +## Sources + +- [Reflexion: Language Agents with Verbal Reinforcement Learning](https://arxiv.org/abs/2303.11366) (Shinn et al., NeurIPS 2023) +- [Self-Refine: Iterative Refinement with Self-Feedback](https://arxiv.org/abs/2303.17651) (Madaan et al., NeurIPS 2023) +- [Large Language Models Cannot Self-Correct Reasoning Yet](https://arxiv.org/abs/2310.01798) (Huang et al., ICLR 2024) +- [CRITIC: LLMs Can Self-Correct with Tool-Interactive Critiquing](https://arxiv.org/abs/2305.11738) (Gou et al., ICLR 2024) +- [Language Agent Tree Search](https://arxiv.org/abs/2310.04406) (Zhou et al., ICML 2024) +- [Code Generation with AlphaCodium](https://arxiv.org/abs/2401.08500) (Ridnik et al., 2024) +- [From ReAct to Ralph Loop](https://www.alibabacloud.com/blog/from-react-to-ralph-loop-a-continuous-iteration-paradigm-for-ai-agents_602799) (Alibaba Cloud) +- [Self-Improving Coding Agents](https://addyosmani.com/blog/self-improving-agents/) (Addy Osmani) +- [Context Engineering for AI Agents](https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus) (Manus) diff --git a/docs/research/github-pr-review-api.md b/docs/research/github-pr-review-api.md new file mode 100644 index 0000000..5734fa5 --- /dev/null +++ b/docs/research/github-pr-review-api.md @@ -0,0 +1,83 @@ +# GitHub PR Review API: Stable References for Inline Comments + +**Context:** Fowlcon V1.1 posts reviews with inline comments to GitHub PRs. Getting comment placement right is the hardest technical problem in V1.1. + +## The Problem + +GitHub's inline comment API requires precise positioning. A misplaced comment destroys reviewer trust. There are two APIs -- one deprecated, one current. + +## Deprecated: `position` Parameter (REST v3) + +The original API used `position` -- an integer offset from the `@@` hunk header in the diff. This was: +- Fragile (changes when the diff changes) +- Hard to compute (requires parsing unified diff format) +- Only worked with the pull request's HEAD at the time of the API call + +**Do not use this.** GitHub's docs mark it as deprecated. + +## Current: `line` + `side` Parameters + +The modern API uses actual file line numbers: + +| Parameter | Type | Description | +|-----------|------|-------------| +| `commit_id` | string | The SHA of the commit to comment on | +| `path` | string | File path relative to repo root | +| `line` | integer | Line number in the file (not the diff) | +| `side` | string | `RIGHT` (new code) or `LEFT` (old/deleted code) | +| `start_line` | integer | For multi-line comments: first line | +| `start_side` | string | Side for the start line | + +**The stable reference tuple is:** `commit_sha + path + line + side` + +This is what Fowlcon's `review-comments.md` format stores per comment. + +## Single-Line vs Multi-Line Comments + +**Single-line:** `line` + `side` only. Comments on one line of code. + +**Multi-line:** `line` + `side` + `start_line` + `start_side`. Highlights a range. The `line` is the END of the range (where the comment bubble appears), `start_line` is the beginning. + +Both lines must be in the same diff hunk. You cannot span across hunks. + +## Pending Reviews vs Direct Comments + +**Direct comment** (`POST /repos/{owner}/{repo}/pulls/{number}/comments`): Posts immediately. No way to batch or undo. + +**Pending review** (recommended for Fowlcon): +1. Create a pending review: `POST /repos/{owner}/{repo}/pulls/{number}/reviews` with `event: "PENDING"` +2. Add comments to it: `POST /repos/{owner}/{repo}/pulls/{number}/comments` (they attach to the pending review) +3. Submit when ready: `POST /repos/{owner}/{repo}/pulls/{number}/reviews/{review_id}/events` with `event: "APPROVE"` or `"COMMENT"` or `"REQUEST_CHANGES"` + +Pending reviews are invisible to others until submitted. This aligns with Fowlcon's "explicit affirmative before posting" principle. + +## GraphQL Alternative (Recommended) + +GraphQL (`gh api graphql`) provides advantages over REST: + +- **Incremental comment addition** to pending reviews +- **Per-comment error isolation** -- one bad placement doesn't fail the whole review +- **File-level comments** supported (comment on a file, not a specific line) +- **Pending reviews** invisible until submitted + +REST is the fallback. Both go through the same `gh` binary. + +## Graceful Degradation + +If an inline comment can't be placed (line not in diff, file renamed, hunk boundary crossed): +1. Fall back to a **top-level review comment** with `(intended for path/to/file.java:42)` prefix +2. The review still gets posted -- better imperfect than failed + +## SHA Re-Indexing + +If the PR HEAD has moved since analysis: +- Comment positions may be wrong (lines shifted, hunks changed) +- Options: re-map positions against the new diff, warn the reviewer, or refuse to post until re-analysis +- Fowlcon detects this by comparing the stored SHA against current PR HEAD + +## Sources + +- [GitHub REST API: Pull Request Comments](https://docs.github.com/en/rest/pulls/comments) +- [GitHub REST API: Pull Request Reviews](https://docs.github.com/en/rest/pulls/reviews) +- [GitHub GraphQL API: AddPullRequestReviewComment](https://docs.github.com/en/graphql/reference/mutations#addpullrequestreviewcomment) +- [GitHub: Creating a pull request review](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/reviewing-proposed-changes-in-a-pull-request) diff --git a/docs/research/tui-frameworks-comparison.md b/docs/research/tui-frameworks-comparison.md new file mode 100644 index 0000000..93fdad4 --- /dev/null +++ b/docs/research/tui-frameworks-comparison.md @@ -0,0 +1,127 @@ +# TUI Framework Comparison for Agent-Driven Code Review + +**Context:** Fowlcon V1.01 replaces the conversational walkthrough with a dedicated TUI. This research compares the leading frameworks. + +## Candidates + +| Framework | Lang | Stars | Architecture | Rendering | +|-----------|------|-------|-------------|-----------| +| **Ink** | JS/TS | 35k+ | React custom renderer | Virtual DOM → string | +| **Bubble Tea** | Go | 28k+ | Elm Architecture (MVU) | Model-View-Update | +| **Ratatui** | Rust | 18k+ | Immediate-mode | Double-buffered diff | +| **Textual** | Python | 18k+ | Async DOM + CSS | Retained-mode with TCSS | + +## What Fowlcon's TUI Needs + +- Tree visualization with status markers (pending/reviewed/accepted) +- Diff viewing with context +- Conversation panel (reviewer talks to agent) +- Quick response controls (single-key confirm, accept, navigate) +- Markdown rendering (agent explanations) +- File watching (auto-refresh when state files change) + +## Framework Analysis + +### Ink (React for Terminals) + +**Architecture:** Custom React renderer targeting the terminal. Yoga (Flexbox) layout engine. All React features work -- hooks, state, effects, Suspense. + +**Strengths for Fowlcon:** +- Claude Code is built on Ink -- the exact "agent streaming output in a TUI" pattern is proven +- `` component renders completed items permanently above the active UI (ideal for conversation history) +- React's component model maps to our tree (each node is a component with state) +- TypeScript-first, rich ecosystem (ink-markdown, ink-text-input, ink-select-input) + +**Limitations:** +- No native scrolling (must implement windowing manually) +- No CSS Grid (Flexbox only) +- `` cannot contain `` elements +- Requires Node.js runtime + +**Known users:** Claude Code, Gemini CLI, GitHub Copilot CLI, Vitest, Prisma, Vercel CLI + +### Bubble Tea (Go Elm Architecture) + +**Architecture:** Elm Architecture (MVU) -- single Model, Update processes messages, View renders strings. Commands handle side effects. + +**Strengths for Fowlcon:** +- Single binary deployment (no runtime) +- Fast startup (<10ms) +- Go goroutines handle concurrent commands naturally +- Rich ecosystem: Lip Gloss (styling), Bubbles (components), Glamour (markdown rendering) +- `WithoutRenderer()` for headless/testing mode +- v2 Cursed Renderer for optimized rendering + +**Limitations:** +- No layout engine -- manual positioning via Lip Gloss `Place`/`JoinHorizontal`/`JoinVertical` +- Nested components need explicit message routing boilerplate +- MVU learning curve for imperative Go programmers + +**Known users:** Glow, Soft Serve, Mods, CockroachDB, AWS eks-node-viewer + +### Ratatui (Rust Immediate-Mode) + +**Architecture:** Immediate-mode rendering -- entire UI redrawn from state each frame. Double-buffered diff rendering writes only changed cells. + +**Strengths for Fowlcon:** +- Sub-millisecond rendering, zero-cost abstractions +- Constraint-based layout (Cassowary algorithm) +- Rich built-in widgets: List, Table, Paragraph, Scrollbar, Tabs, Tree (community) +- Single binary, minimal overhead + +**Limitations:** +- No built-in event handling (bring your own via crossterm/termion) +- No built-in application structure (design your own main loop) +- Steep learning curve (Rust ownership + immediate-mode + external event loop) +- Still on v0.x (no v1.0 announced) + +**Known users:** Amazon Q CLI, Netflix bpftop, OpenAI Codex CLI, Helix editor + +### Textual (Python Async TUI + Web) + +**Architecture:** Retained-mode DOM tree with CSS engine, message passing, async event loop. Essentially web development patterns for the terminal. + +**Strengths for Fowlcon:** +- Built-in Tree, Markdown, DataTable, TextArea widgets -- less to build from scratch +- CSS-based layout (Grid + Flexbox) -- easiest complex multi-panel layouts +- Web mode for free (`textual serve`) -- share reviews in a browser with zero code changes +- Fastest iteration speed (Python + live CSS editing) +- Workers (`@work`) for non-blocking I/O + +**Limitations:** +- Python startup time (~200-500ms) +- Higher memory overhead (retained DOM + CSS engine) +- GIL limitations for true parallelism + +**Known users:** DevOps monitoring tools, cybersecurity dashboards + +## Key Pattern: beads_viewer + +[beads_viewer](https://github.com/Dicklesworthstone/beads_viewer) (1.3k stars) is a Bubble Tea TUI built on top of the Beads issue tracker. It demonstrates patterns directly applicable to Fowlcon: + +- **File watching + auto-refresh:** Watches `.beads/beads.jsonl` and refreshes all views when it changes. Maps to watching `review-tree.md`. +- **Robot mode API:** `--robot-*` flags expose a structured JSON API for agents. The TUI is both interactive AND programmable. Agents drive the UI via CLI flags. +- **Split view:** Left pane (tree/list with vim navigation) + right pane (details). Maps to our tree + diff layout. +- **Single-key interactions:** `j/k` navigation, `o/c/r` for filters, single key per view. Solves the "quick response" problem. +- **Two-phase computation:** Instant metrics shown immediately, heavy computation runs async. Keeps UI responsive. + +The robot mode pattern is particularly relevant: file-based for persistence, robot mode for real-time agent-to-TUI communication. + +## Recommendation + +No framework is selected yet. The pluggable UI interface (design doc Section 6.2) ensures the choice is a front-end swap, not a rewrite. Evaluation criteria in priority order: + +1. Can it render a tree with status markers and respond to single-key inputs? +2. Can it watch files for changes and auto-refresh? +3. How fast can we iterate during development? +4. Does it support a robot mode API pattern for agent control? + +## Sources + +- [Ink](https://github.com/vadimdemedes/ink) -- React renderer for terminals +- [Bubble Tea](https://github.com/charmbracelet/bubbletea) -- Go Elm Architecture TUI +- [Ratatui](https://ratatui.rs/) -- Rust immediate-mode TUI +- [Textual](https://textual.textualize.io/) -- Python async TUI + web +- [beads_viewer](https://github.com/Dicklesworthstone/beads_viewer) -- Graph-aware TUI for Beads +- [Lip Gloss](https://github.com/charmbracelet/lipgloss) -- Terminal styling for Bubble Tea +- [Glamour](https://github.com/charmbracelet/glamour) -- Markdown rendering for terminals