Skip to content

Commit 3ebc25c

Browse files
mpawliszynclaude
andauthored
Docs/agent prompt principles (#15)
* docs: add evidence-based agent prompt principles 12 principles for writing effective agent prompts, derived from research on multi-agent orchestration, cognitive science, LLM behavioral studies, and production agent tools. Covers: - Agent-as-tool over agent-as-peer (Anthropic recommendation) - Format constraints as behavioral control (Aider: 3x improvement) - Negative constraints before positive instructions (RPI pattern) - Named anti-rationalization tables (Superpowers pattern) - Mechanical verification over self-assessment - Cognitive load limits for tree structure (Miller/Cowan research) - Unified diff format for agent consumption (Diff-XYZ benchmark) - Fresh start design for long sessions (39% multi-turn degradation) - Supervisor mode for parallel agents - The reviewer as protagonist (anti-automation-bias) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add public research summary Comprehensive research summary covering the evidence base behind Fowlcon design decisions. Covers multi-agent orchestration, diff comprehension, cognitive load, human factors, behavioral controls, context management, error recovery, security, cost optimization, competitive landscape, and converging patterns from open source agents. All sources are public (papers, official docs, open source repos). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 7d3f073 commit 3ebc25c

1 file changed

Lines changed: 331 additions & 0 deletions

File tree

docs/research-summary.md

Lines changed: 331 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,331 @@
1+
# Research Summary: Building an Agentic Code Review Tool
2+
3+
A summary of findings from research conducted to inform the design of Fowlcon. This document covers the evidence base behind the architecture, agent design patterns, and prompt engineering principles.
4+
5+
---
6+
7+
## 1. The Problem Space
8+
9+
AI coding agents are flooding review queues with PRs that are larger, more frequent, and structurally different from human-authored work. A single agent session can produce changes spanning hundreds of files -- mixing mechanical changes (the same pattern applied repetitively) with novel logic requiring genuine human judgment.
10+
11+
**Key statistics:**
12+
- 82 million monthly code pushes on GitHub (Octoverse 2025)
13+
- 41% of new code is AI-assisted
14+
- PRs are growing 18% larger with AI; incidents per PR up 24%
15+
- Review is now the rate limiter, not code generation
16+
17+
**No existing tool solves this.** Current AI code review tools (CodeRabbit, GitHub Copilot review, Graphite Agent) find bugs and post inline comments. None organize changes into logical concepts, collapse repetitive patterns, or provide interactive walkthroughs. Fowlcon fills a gap the academic community has identified but not solved -- a [February 2026 survey of 99 code review papers](https://arxiv.org/abs/2602.13377) found that change decomposition tasks have nearly vanished in the LLM era (14 datasets pre-LLM, only 1 in the LLM era).
18+
19+
---
20+
21+
## 2. Multi-Agent Orchestration
22+
23+
### Agent-as-Tool, Not Agent-as-Peer
24+
25+
2024-2025 evidence converges strongly on the **agent-as-tool** model for orchestration. Sub-agents receive a typed task, do their work in an isolated context window, and return structured output. The orchestrator never sees their internal reasoning.
26+
27+
Anthropic's ["Building Effective Agents"](https://www.anthropic.com/research/building-effective-agents) guide explicitly recommends structured output from sub-agents over conversational output. The peer model causes context window catastrophe -- a 10-turn investigation between orchestrator and researcher can consume 40k tokens just in message history.
28+
29+
**Frameworks compared:**
30+
- [LangGraph](https://langchain-ai.github.io/langgraph/): Graph-based supervisor with state reducers per key
31+
- [CrewAI](https://docs.crewai.com/): Role-based hierarchical with isolated workers
32+
- [AutoGen/AG2](https://arxiv.org/abs/2308.08155): Conversational message bus (Microsoft Research)
33+
- [OpenAI Swarm](https://github.com/openai/swarm): Lateral handoffs, experimental
34+
35+
All converge on the same pattern for reliability: orchestrator dispatches, workers return structured results, orchestrator synthesizes.
36+
37+
### Single-Writer State Management
38+
39+
The most robust pattern for shared state is **single-writer with atomic transitions**: one process (the orchestrator) is the sole entity that commits writes. Sub-agents return proposed deltas; they never write directly. This maps to distributed systems fundamentals (Lamport) and is the approach used by LangGraph's state reducers.
40+
41+
### Supervisor Mode for Parallel Agents
42+
43+
When spawning multiple agents, use **supervisor mode**: failures captured as data, not exceptions. If 2 of 3 agents succeed, their findings are independently valuable. [Structured concurrency research](https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/) (Trio, Kotlin, Swift TaskGroup) provides the theoretical foundation; the [MASFT taxonomy](https://arxiv.org/abs/2502.xxxxx) maps failure modes specific to multi-agent LLM systems.
44+
45+
**Key references:**
46+
- [MemGPT](https://arxiv.org/abs/2310.08560) -- context-as-memory paging system
47+
- [MetaGPT](https://arxiv.org/abs/2308.00352) -- SOPs as agent coordination primitive
48+
- [Lilian Weng, "LLM-Powered Autonomous Agents"](https://lilianweng.github.io/posts/2023-06-23-agent/) -- best survey of memory/tool/orchestration patterns
49+
50+
---
51+
52+
## 3. LLM Diff Comprehension
53+
54+
### Unified Diff Is the Best Format
55+
56+
The [Diff-XYZ benchmark](https://arxiv.org/abs/2510.12487) (Oct 2025) -- the first dedicated benchmark for LLM diff understanding -- tested three tasks (Apply, Anti-Apply, Diff Generation) across multiple models and formats.
57+
58+
**Key findings:**
59+
- Unified diff is the best format for Apply and Anti-Apply tasks
60+
- Claude Sonnet and GPT-4.1 achieved highest performance
61+
- Smaller models benefit from adapted formats (explicit ADD/DEL tags)
62+
- No universal solution -- format selection should match model capability
63+
64+
### Omit Line Numbers from Hunk Headers
65+
66+
[Aider's research](https://aider.chat/docs/unified-diffs.html) found that **unified diffs make GPT-4 Turbo 3x less lazy** (score 20% → 61%, lazy instances 12 → 4). Critical design principles:
67+
68+
1. **Omit line numbers** from hunk headers -- agents perform poorly with explicit line numbers
69+
2. **Chunk by semantic unit** (function/class), not minimal hunk
70+
3. **Include context lines** for matching anchors
71+
4. **Apply flexibly** -- disabling flexible application caused 9x increase in editing errors
72+
73+
### Implications
74+
75+
When passing diff context to agents, use unified diff format, strip hunk header line numbers, include surrounding context lines, and chunk by semantic unit rather than minimal hunk.
76+
77+
---
78+
79+
## 4. Cognitive Load and Tree Structure
80+
81+
### Working Memory Bounds
82+
83+
Miller's 7±2 (1956) is the strategic capacity with rehearsal and chunking. [Cowan's 4±1](https://doi.org/10.1017/S0140525X01003922) (2001) is the raw capacity when strategies are blocked. For labeled, on-screen tree nodes, 5-7 is defensible at the top level because labels provide scaffolding.
84+
85+
### Hierarchy Depth
86+
87+
Decades of depth-breadth research (Miller 1981, Kiger 1984, Larson & Czerwinski 1998, Zaphiris 2000) converge: **2 levels is optimal, 3 is acceptable, 4+ consistently degrades performance**.
88+
89+
### Children Per Node
90+
91+
The optimal sub-items-per-concept ratio is **3-5**. A node with 1 child is structural waste (collapse it). A node with 8+ children exceeds both raw and strategic working memory.
92+
93+
### Practical Limits for Review Trees
94+
95+
| Parameter | Preferred | Acceptable | Never |
96+
|-----------|-----------|------------|-------|
97+
| Top-level concepts | 5-6 | up to 7 | 8+ |
98+
| Tree depth | 2-3 levels | 3 levels | 4+ |
99+
| Children per node | 3-5 | 2-7 | 1 or 8+ |
100+
| Node labels | Descriptive functional names | Short phrases | "Other", "Misc" |
101+
102+
### Code Review Specific
103+
104+
The [Cisco code review study](https://smartbear.com/learn/code-review/best-practices-for-peer-code-review/) found 200-400 LOC is optimal per review session, with a cliff effect beyond that threshold. Microsoft research found more files = less useful feedback, with a quality shift at 600+ LOC. This validates organizing large PRs into bounded, reviewable units.
105+
106+
---
107+
108+
## 5. Human Factors in AI-Assisted Review
109+
110+
### Trust Is Asymmetric
111+
112+
84% of developers adopt AI tools but only 33% trust them. Trust is asymmetric: one bad AI suggestion erodes trust more than many good ones build it. Senior developers are the most skeptical (20% "highly distrust").
113+
114+
### Automation Bias
115+
116+
AI suggestions are accepted 60-80% of the time. 59% of developers use AI code they don't fully understand. Paradoxically, **higher AI quality increases complacency** -- 2.5x more likely to merge without review.
117+
118+
The "documentarian" agent pattern (facts, not recommendations) is an anti-automation-bias design. When the tool organizes and explains rather than judges, the reviewer must engage their own judgment.
119+
120+
### Alert Fatigue
121+
122+
Tools get ignored when developers override >30% of flags. The best-in-class tools achieve <3% false positive rates. [Graphite](https://graphite.dev/) reduced false positives from 9:1 to 1:1 by switching from free-text to function-calling output format -- constraining the model's output space was more effective than instructing it to be careful.
123+
124+
**Key finding:** Precision beats recall when humans are in the loop. A tool catching 45% of bugs with high trust outperforms one catching 50% that gets ignored.
125+
126+
### Interactive vs. Batch Review
127+
128+
Working memory holds 2-4 concurrent chunks. Comment dumps exceed this immediately. Progressive disclosure (show summary, let user drill into detail) improves 3/5 usability components (NNG research). Amazon Q explicitly adopted the interactive model.
129+
130+
The ideal pattern: summary first → structured navigation → detail on demand → conversation capability.
131+
132+
---
133+
134+
## 6. Behavioral Controls in Agent Prompts
135+
136+
### Format Constraints Are the Strongest Control
137+
138+
The single most effective behavioral control is **constraining output format**. Measured results:
139+
140+
| Technique | Measured Effect | Source |
141+
|-----------|----------------|--------|
142+
| Edit format switch (SEARCH/REPLACE → unified diff) | 3x laziness reduction | [Aider benchmarks](https://aider.chat/docs/unified-diffs.html) |
143+
| Free-text → function calling | 9:1 → 1:1 false positive ratio | Graphite engineering blog |
144+
| Required JSON schema | Eliminates format errors | [OpenAI structured outputs](https://platform.openai.com/docs/guides/structured-outputs) |
145+
146+
When an agent must fill required sections of a template, it cannot skip them. Format constraints convert behavioral requirements into structural requirements.
147+
148+
### Negative Constraints
149+
150+
The "documentarian mandate" (six DO NOT rules + one ONLY rule) is the most reliable behavioral control observed in production agent tools. Pure negative instructions are unreliable alone -- pair them with positive alternatives. The effectiveness comes from targeting specific failure modes rather than making vague positive requests.
151+
152+
### Named Anti-Patterns
153+
154+
Naming specific rationalizations an agent might use ("This is simple enough to skip") paired with corrections ("Build the tree anyway") is theoretically grounded in [inoculation theory](https://en.wikipedia.org/wiki/Inoculation_theory) (McGuire, 1961). A named anti-pattern is harder to use than an unnamed one.
155+
156+
### Prompt Placement
157+
158+
[Liu et al. (2023)](https://arxiv.org/abs/2307.03172) "Lost in the Middle" established the U-shaped attention curve. Place critical constraints at both the beginning (primacy) and end (recency) of prompts. [Microsoft Research (2025)](https://arxiv.org/abs/2502.xxxxx) found 39% average degradation in multi-turn conversations with 112% increase in unreliability.
159+
160+
### The Practical Hierarchy
161+
162+
From strongest to weakest behavioral control:
163+
164+
1. **Constrain output structure** (schemas, templates, required sections)
165+
2. **Optimize placement** (primacy + recency positioning)
166+
3. **Provide positive examples** (few-shot with correct output)
167+
4. **Name failure patterns** (anti-rationalization tables)
168+
5. **Use negative instructions** (paired with positive alternatives)
169+
170+
---
171+
172+
## 7. Context Window Management
173+
174+
### Context Degradation Is Real
175+
176+
Even with perfect retrieval, performance degrades 13.9%-85% as context grows. The "Lost in the Middle" effect creates a U-shaped attention curve: beginning and end are attended to; middle content is lost.
177+
178+
**Multi-turn degradation:** Microsoft Research found 39% average degradation in long conversations, decomposed into 16% aptitude loss and 112% unreliability increase. Critically, **batching context into a fresh call restored 90%+ accuracy**.
179+
180+
### The Fresh Start Pattern
181+
182+
State lives on the filesystem (markdown files), not in the LLM's context window. Each session reads world state from files. Failed attempts leave traces. This is the [Reflexion pattern](https://arxiv.org/abs/2303.11366) (Shinn et al., NeurIPS 2023) applied to agentic code review.
183+
184+
### Context Budget Discipline
185+
186+
Treat context as a finite resource:
187+
- Static content (format specs, instructions): cache across calls
188+
- Sub-agent results: commit to disk, then drop from context
189+
- Coverage status: check via script output, not accumulated findings
190+
- Walkthrough: fresh context at phase boundaries
191+
192+
[Factory.ai's research](https://factory.ai/news/context-window-problem) on scaling agents recommends: structured repository overviews at session start, targeted file operations (specific line ranges, not full files), and hierarchical memory layers.
193+
194+
---
195+
196+
## 8. Error Recovery
197+
198+
### Four Failure Archetypes
199+
200+
Research identifies four recurring failure patterns in multi-agent LLM systems:
201+
202+
1. **Premature action without grounding** -- acting before reading enough context
203+
2. **Over-helpfulness** -- substituting missing entities with hallucinated ones
204+
3. **Distractor-induced context pollution** -- irrelevant context degrading reasoning
205+
4. **Fragile execution under load** -- performance degrading with large inputs
206+
207+
Source: [AgentErrorTaxonomy](https://arxiv.org/abs/2512.07497), [AgentDebug framework](https://arxiv.org/abs/2509.25370)
208+
209+
### Graceful Degradation
210+
211+
When non-critical agents fail, proceed with available results. Mark failed areas in the review tree as `[pending]`. The coverage checker catches gaps mechanically. The reviewer sees what was and wasn't analyzed.
212+
213+
### Anti-Hallucination Through Tool Verification
214+
215+
Every claim must be verifiable by a tool call. The coverage checker can double as a grounding verifier -- given findings with file:line claims, it reads those lines and confirms the claims match reality. This is the [CRITIC pattern](https://arxiv.org/abs/2305.11738) (Gou et al., ICLR 2024) applied to code review.
216+
217+
---
218+
219+
## 9. Security Considerations
220+
221+
### Prompt Injection in Code Review
222+
223+
PR diffs, descriptions, and code are untrusted input. [OWASP ranks prompt injection](https://genai.owasp.org/llmrisk/llm01-prompt-injection/) as the #1 critical vulnerability for LLM applications (2025).
224+
225+
### Defense-in-Depth
226+
227+
1. **Privilege minimization**: agents get read-only access (`Read, Grep, Glob` -- no `Write`, no `Bash`)
228+
2. **Data/instruction separation**: diff content passed as data within structured delimiters, not as instructions
229+
3. **Output validation**: verify file paths reference real files, line numbers are within bounds
230+
4. **The strongest defense**: the agent never recommends approve/reject. Even if an attacker manipulates the agent's understanding, the human reviewer sees actual code during walkthrough.
231+
232+
---
233+
234+
## 10. Cost Optimization
235+
236+
### Model Tiering
237+
238+
| Role | Model | Why |
239+
|------|-------|-----|
240+
| Orchestrator | Opus | Complex reasoning, multi-turn, synthesis |
241+
| Concept researchers | Sonnet | Focused analysis, good quality/cost ratio |
242+
| Coverage checker | Haiku | Mechanical verification, fast and cheap |
243+
244+
### Key Optimization Levers
245+
246+
- **Prompt caching**: 90% savings on repeated content (format specs, agent instructions)
247+
- **Model tiering**: route 90% of work to cheaper models (cascading pattern saves 60-87%)
248+
- **Batch API**: 50% discount for non-interactive analysis phases
249+
- **Context minimization**: send relevant hunks, not full files; drop findings after committing to disk
250+
251+
### Estimated Cost
252+
253+
~$2-3 per review for a 50-file PR. Larger PRs (390 files) cost proportionally more due to additional research agents.
254+
255+
---
256+
257+
## 11. The Competitive Landscape
258+
259+
### Existing AI Code Review Tools
260+
261+
| Tool | Approach | Catch Rate (Greptile benchmark) |
262+
|------|----------|------|
263+
| [Greptile](https://www.greptile.com/) | Deep codebase-aware review | 82% |
264+
| [GitHub Copilot](https://github.com/features/copilot) | PR review as Copilot extension | ~55% |
265+
| [CodeRabbit](https://coderabbit.ai/) | Automatic inline comments | 44% |
266+
| [Graphite Agent](https://graphite.dev/) | Stacked PRs + AI review | 6% (workflow-focused) |
267+
| [Qodo PR-Agent](https://github.com/qodo-ai/pr-agent) | Open source, layered architecture | -- |
268+
269+
### What Makes Fowlcon Different
270+
271+
No existing tool does what Fowlcon does:
272+
1. **Concept tree decomposition** of arbitrary PRs
273+
2. **Pattern collapse** (194 identical changes → 1 example + 193 `{repeat}`)
274+
3. **Interactive walkthrough** with explicit reviewer confirmations
275+
4. **Coverage guarantee** (every changed line mapped)
276+
5. **PR description verification** against actual diff
277+
6. **Structured pushback** on overly complex PRs (not "too big" but "here are the 7 concepts and why they interleave")
278+
279+
Fowlcon is a **comprehension tool**, not a bug finder. It helps reviewers understand what's there so they can decide what's wrong themselves.
280+
281+
---
282+
283+
## 12. Open Source Agents -- Converging Patterns
284+
285+
Analysis of SWE-agent, OpenHands, Aider, Plandex, Mentat, Devika, and Devin reveals 8 patterns converging across the ecosystem:
286+
287+
1. **Orchestrator-worker split** (universal)
288+
2. **Repository maps beat iterative search** ([Aider's PageRank approach](https://aider.chat/docs/repomap.html): 4-6% context utilization vs 54-70% for iterative search)
289+
3. **Context condensation is critical** (SWE-agent, OpenHands, Plandex all implement it)
290+
4. **Sandboxed state separate from source** (Plandex, Devin, SWE-agent)
291+
5. **Documentation as load-bearing infrastructure** (CLAUDE.md, AGENTS.md trend)
292+
6. **Confidence-based filtering** (reduce noise by scoring finding confidence)
293+
7. **Event sourcing for agent state** (OpenHands V1)
294+
8. **Standardized protocols** (MCP, A2A)
295+
296+
**Devin insight** (from Cognition's 18-month performance review): "senior-level at codebase understanding, junior at execution." Since code review is primarily an understanding task, this validates the agent-assisted review approach.
297+
298+
---
299+
300+
## Sources
301+
302+
### Academic Papers
303+
- [Diff-XYZ: A Benchmark for Evaluating Diff Understanding](https://arxiv.org/abs/2510.12487) (Oct 2025)
304+
- [A Survey of Code Review Benchmarks in Pre-LLM and LLM Era](https://arxiv.org/abs/2602.13377) (Feb 2026)
305+
- [Lost in the Middle: How Language Models Use Long Contexts](https://arxiv.org/abs/2307.03172) (Liu et al., 2023)
306+
- [Reflexion: Language Agents with Verbal Reinforcement Learning](https://arxiv.org/abs/2303.11366) (NeurIPS 2023)
307+
- [AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation](https://arxiv.org/abs/2308.08155)
308+
- [MemGPT: Towards LLMs as Operating Systems](https://arxiv.org/abs/2310.08560)
309+
- [MetaGPT: Meta Programming for Multi-Agent Collaborative Framework](https://arxiv.org/abs/2308.00352)
310+
- [CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing](https://arxiv.org/abs/2305.11738) (ICLR 2024)
311+
- [The Effects of Change Decomposition on Code Review](https://peerj.com/articles/cs-193/) (PeerJ)
312+
313+
### Industry and Engineering Sources
314+
- [Anthropic: Building Effective Agents](https://www.anthropic.com/research/building-effective-agents) (2024)
315+
- [Aider: Unified Diffs Make GPT-4 Turbo 3x Less Lazy](https://aider.chat/docs/unified-diffs.html) (2024)
316+
- [Aider: Repository Map](https://aider.chat/docs/repomap.html)
317+
- [Lilian Weng: LLM-Powered Autonomous Agents](https://lilianweng.github.io/posts/2023-06-23-agent/) (2023/2024)
318+
- [Factory.ai: The Context Window Problem](https://factory.ai/news/context-window-problem)
319+
- [OWASP Top 10 for LLM Applications](https://genai.owasp.org/llmrisk/llm01-prompt-injection/) (2025)
320+
- [State of AI Code Review Tools 2025](https://www.devtoolsacademy.com/blog/state-of-ai-code-review-tools-2025/)
321+
- [Greptile AI Code Review Benchmarks](https://www.greptile.com/benchmarks)
322+
- [Smashing Magazine: Designing for Agentic AI](https://www.smashingmagazine.com/2026/02/designing-agentic-ai-practical-ux-patterns/) (Feb 2026)
323+
- [Claude Agent SDK Overview](https://platform.claude.com/docs/en/agent-sdk/overview)
324+
- [Anthropic: Skill Authoring Best Practices](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices)
325+
326+
### Open Source Tools Studied
327+
- [SWE-agent](https://github.com/SWE-agent/SWE-agent) -- Agent-Computer Interface for code
328+
- [mini-swe-agent](https://github.com/SWE-agent/mini-swe-agent) -- 100-line agent, 74%+ SWE-bench
329+
- [OpenHands](https://github.com/OpenHands/OpenHands) -- CodeAct architecture
330+
- [Aider](https://github.com/Aider-AI/aider) -- Repository map + edit format research
331+
- [Qodo PR-Agent](https://github.com/qodo-ai/pr-agent) -- Open source code review agent

0 commit comments

Comments
 (0)