Skip to content

Latest commit

 

History

History
638 lines (486 loc) · 30.1 KB

File metadata and controls

638 lines (486 loc) · 30.1 KB

Ethereum AI Code Reviewer - Technical Proposal

1. Overview

We propose a multi-agent RAG system that automates cross-spec compliance reviews for Ethereum clients. The system will:

  1. Parse and index Ethereum protocol specifications (execution-specs, consensus-specs)
  2. Index client implementations starting with Geth (go-ethereum)
  3. Analyze PRs and code against spec requirements
  4. Generate structured reports with flagged issues, security alerts, and suggested tests

This directly addresses the EF requirement to reduce manual effort in auditing client code against evolving specifications.


2. Technical Architecture

architecture


3. Key Features

3.1 Tree-sitter Code Chunking

We use tree-sitter for AST-based code chunking — the same approach used by Cursor, Windsurf, Aider, and GitHub Copilot [12]. This ensures:

  • Semantically coherent chunks: Code is split at function/class boundaries, never mid-function
  • Complete context: Each chunk contains a complete, syntactically valid unit of code
  • Better embeddings: +4-5 points Recall@5 improvement over naive chunking [13]
  • Multi-language support: 113+ language grammars (Python, Go, Rust, etc.)
Content Type Chunking Strategy Chunk Size
Markdown (specs) Heading-aware splits ~1500 chars
Python (spec code) Tree-sitter AST (functions, classes) ~1500 chars
Go (Geth code) Tree-sitter AST (functions, structs, methods) ~1500 chars

Algorithm (based on Sweep AI, adopted by LlamaIndex):

For each child node in AST:
  1. If current chunk too big → flush and start new chunk
  2. If child node too big → recursively chunk child
  3. Otherwise → merge child into current chunk
Post-process: merge single-line chunks with next chunk

3.2 Specification Indexing

  • Continuously parse and index ethereum/execution-specs and ethereum/consensus-specs
  • Extract structured metadata: EIP references, invariants, preconditions, postconditions
  • Tree-sitter Python chunking for reference code (functions, classes kept intact)
  • Heading-aware markdown chunking for prose specifications
  • Python reference code translated to language-agnostic pseudocode

3.3 Client Code Indexing (Geth)

  • Index ethereum/go-ethereum using tree-sitter Go parser
  • Extract function signatures, struct definitions, and behavior summaries
  • Preserve complete functions with package imports and type context
  • Map Geth functions to corresponding spec sections
  • Enable queries like: "How does Geth implement EIP-6780 SELFDESTRUCT?"

3.4 Intelligent Query Planning

  • Classify PRs by client type (consensus vs execution)
  • Generate retrieval queries with importance scores (1-10)
    • Score 8-10: Core spec compliance, security-critical
    • Score 5-7: Related functionality, edge cases
    • Score 1-4: General context
  • Higher-importance queries receive more retrieval quota

3.5 Advanced Retrieval

  • Dual-Query Generation: Generate both spec-focused queries (e.g., "What does EIP-1559 specify about base fee calculation?") and client-focused queries (e.g., "How does Geth implement base fee calculation?") from the same PR
  • Reciprocal Rank Fusion (RRF): Industry-standard algorithm for merging heterogeneous retrieval results [1]. Combines rankings from spec and client collections without requiring score normalization
  • Quota-based ranking: Allocate retrieval slots proportionally to query importance
  • Minimum distance tracking: When a snippet matches multiple queries, keep best score
  • Per-query grouping: Ensure all compliance areas are represented
  • Spec-vs-client balancing: Configurable ratio (default 50/50) ensures results include both spec context and client implementation details
  • Cross-encoder reranking (optional): Second-stage precision boost using MS-MARCO trained models [7]

3.6 Structured Audit Output

  • JSON schema for CI integration
  • Severity-rated findings with evidence and spec references
  • Suggested tests for identified issues
  • Explainable decisions with file paths and code snippets

4. Data Sources

Source Repository Content
Execution Specs github.com/ethereum/execution-specs EVM, state, transactions
Consensus Specs github.com/ethereum/consensus-specs Beacon chain, validators
Geth Client github.com/ethereum/go-ethereum Reference execution client

5. Project Timeline

Phase 0 (Week 0-1): Kickoff

  • Finalize scope with EF team
  • Secure GPU infrastructure
  • Confirm model choices and success metrics

Phase 1 (Week 2-4): Specification Ingestion

  • Robust chunkers for markdown and Python
  • Batch ingestion for execution-specs and consensus-specs
  • Quality validation and exports

Phase 2 (Week 5-7): Geth Client Indexing

  • Clone and index ethereum/go-ethereum
  • Go AST parser for function/struct extraction
  • Create geth_code Chroma collection
  • Map Geth functions to execution spec sections

Phase 3 (Week 8-10): Retrieval and Compliance

  • Query Coordinator with importance scoring
  • Retrieval Orchestrator with quota-based ranking
  • Spec-to-Geth cross-referencing
  • Compliance heuristics

Phase 4 (Week 11-12): Auditor Agent

  • Decision/risk schema refinement
  • Suggested test generation
  • Explainability fields (spec refs, Geth function refs)

Phase 5 (Week 13-14): CI Integration

  • CLI for local testing
  • GitHub Action for self-hosted GPU runners
  • Non-GPU fallback with smaller models

Phase 6 (Week 15-16): Evaluation

  • Precision/recall vs human reviewer baseline
  • False positive rate analysis
  • Prompt tuning based on results

Phase 7 (Week 17-18): Handover

  • Documentation (runbooks, ops guides)
  • Model swap procedures
  • Maintenance plan
  • Roadmap for additional clients

6. Deliverables

  1. ChromaDB Collections (5 total)

    • consensus_spec - Consensus specification summaries
    • consensus_code - Consensus reference code
    • execution_spec - Execution specification summaries
    • execution_code - Execution reference code
    • geth_code - Geth client implementation
  2. Multi-Agent System

    • Query Coordinator with importance scoring
    • Retrieval Orchestrator with advanced ranking
    • Auditor Agent with structured JSON output
  3. Geth Integration

    • Go AST parser
    • Spec-to-client function mapping
    • Demo: Geth vs spec compliance check
  4. CI/CD Integration

    • CLI tool
    • GitHub Action (GPU and non-GPU paths)
  5. Documentation

    • Setup and deployment guide
    • Model/prompt configuration
    • Guide for adding additional clients
  6. Evaluation Report

    • Precision/recall metrics
    • False positive analysis
    • Production recommendations

7. Technical Stack

Component Technology
LLM (Heavy) GPT-OSS 20B or equivalent via Ollama
LLM (Light) Llama 3 8B
Embeddings all-MiniLM-L6-v2
Reranking cross-encoder/ms-marco-MiniLM-L-6-v2 (optional)
Vector DB ChromaDB
Code Parsing tree-sitter (Python, Go, Rust, TypeScript)
Markdown Parsing Heading-aware splitter
CI GitHub Actions

8. Configuration

Variable Default Description
MODEL_GPT_OSS gpt-oss:20b Heavy model for auditing
MODEL_LIGHT llama3:8b Lightweight model for coordination
USE_CROSS_ENCODER 0 Enable cross-encoder reranking
RRF_K 60 RRF fusion constant (standard value from Cormack et al.)
SPEC_TO_CLIENT_RATIO 0.5 Balance of spec vs client results (0.5 = 50/50)
CHUNK_MAX_CHARS 1500 Maximum characters per chunk (~40 lines)
DRY_RUN_SAMPLE 1 Run with limited files for testing

9. Future Expansion

After initial delivery with Geth, the system can be extended to additional clients:

Client Language Effort
Prysm Go Low (reuse Go parser)
Lighthouse Rust Medium (new parser)
Nethermind C# Medium
Besu / Teku Java Medium
Lodestar TypeScript Medium
Nimbus Nim Medium

10. Why RAG-First Approach

Our initial delivery uses Retrieval-Augmented Generation (RAG) rather than fine-tuning:

Factor RAG Approach Fine-tuning Approach
Data requirements Works with existing specs/code Requires 100M+ tokens curated dataset
Compute cost Inference-only ($1-10/audit) Training ($1K-15K+ for large models)
Iteration speed Update index in minutes Re-train in days/weeks
Spec changes Re-index affected files Re-train or risk drift
Explainability Citations to source docs Black-box decisions
Model flexibility Swap models freely Locked to fine-tuned model

Modern frontier LLMs (GPT-4, Claude, Llama 3, DeepSeek, GPT-OSS) already possess strong code understanding and reasoning capabilities. RAG leverages these abilities while providing domain-specific context at inference time.

What RAG Provides

RAG answers: "What information does the model need?"

For Ethereum code review, the model needs:

  • Specification text (what the code should do)
  • Client implementation (what the code actually does)
  • EIP details, test cases, historical context

RAG excels here because:

  • Specs change frequently — re-index in minutes vs re-train in weeks
  • Traceability — every finding cites source documents
  • No training data curation needed — works with existing specs/code

Optimization Approaches Comparison

┌─────────────────────────────────────────────────────────────────────────┐
│                        LLM Inference Pipeline                           │
│                                                                         │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐              │
│  │   CONTEXT    │ +  │    INPUT     │ →  │    MODEL     │ → Output     │
│  │  (prompts,   │    │  (PR code,   │    │  (weights)   │              │
│  │  playbooks)  │    │  retrieved   │    │              │              │
│  │              │    │  docs)       │    │              │              │
│  └──────────────┘    └──────────────┘    └──────────────┘              │
│        ↑                   ↑                    ↑                       │
│    Phase 2:            Phase 1:            Phase 3:                     │
│      ACE                 RAG              Fine-tuning                   │
│   (if needed)        (core delivery)       (if needed)                  │
└─────────────────────────────────────────────────────────────────────────┘
Phase Approach What It Optimizes Status
Phase 1 RAG Retrieved knowledge (specs, code) Core Delivery
Phase 2 ACE Instructions/playbooks (audit strategies) Future Enhancement
Phase 3 Fine-tuning Model weights If RAG+ACE insufficient

11. Future Enhancements

After delivering the core RAG system, we propose two potential enhancement phases based on evaluation results.

Phase 2: ACE Integration (Agentic Context Engineering)

When to implement: After baseline RAG system is validated and running in production.

What is ACE? A novel approach from Stanford [15] that optimizes the instructions/playbooks rather than model weights. ACE answers: "How should the model reason about the retrieved information?"

Even with perfect retrieval, the model needs guidance on:

  • What patterns indicate spec violations?
  • How to prioritize security-critical vs cosmetic issues?
  • What makes a good audit finding (evidence, severity, remediation)?

How ACE Works:

┌─────────────────────────────────────────────────────────────────┐
│                    ACE Playbook Evolution                        │
│                                                                  │
│  Initial Playbook:                                               │
│  • Generic code review heuristics                                │
│  • Basic EIP compliance checks                                   │
│                                                                  │
│  After 100 PRs (Generation + Reflection + Curation):             │
│  • "EIP-1559 base fee changes often miss edge case X"            │
│  • "SELFDESTRUCT PRs require checking Y and Z"                   │
│  • "False positive pattern: don't flag A when B is present"      │
│  • Accumulated exploit narratives with detection strategies      │
│                                                                  │
│  Result: Playbook improves with each audit cycle                 │
└─────────────────────────────────────────────────────────────────┘

ACE Benefits (from Stanford research [15]):

  • +10.6% improvement on agent benchmarks
  • +8.6% improvement on domain-specific tasks
  • Prevents "context collapse" (iterative rewriting eroding details)
  • No weight updates — purely context optimization
  • Works with smaller open-source models

Data Requirements for ACE:

Data Type Size Source Purpose
Initial Playbook ~10-20 pages Manual curation Baseline audit heuristics
Audit Feedback 50-100 PRs Phase 1 production Reflection signals
Exploit Narratives 20-50 cases Public CVEs, audits Negative pattern examples
EIP Compliance Rules All active EIPs ethereum/EIPs repo Structured checklists

How We Build the ACE Playbook:

  1. Initial Playbook (Week 1-2):

    • Curate generic code review heuristics from security best practices
    • Extract EIP compliance checklists from specifications
    • Document known vulnerability patterns (reentrancy, overflow, etc.)
  2. Production Feedback Collection (Ongoing):

    • Log all audit findings from Phase 1 RAG system
    • Track human reviewer corrections (false positives/negatives)
    • Record which retrieved snippets led to correct findings
  3. ACE Cycle Implementation (Week 3-6):

    • Generator: Create new audit strategies from feedback patterns
    • Reflector: Evaluate strategy effectiveness on held-out PRs
    • Curator: Organize and deduplicate accumulated knowledge

Implementation Requirements:

Item Estimate Notes
Development time 4-6 weeks After Phase 1 baseline
Additional compute Minimal Context optimization only, no GPU training
Data needed 50-100 audited PRs From Phase 1 production use
Labor 1 engineer Part-time integration work

Effort Estimate:

  • Initial playbook curation: 40-60 hours (manual)
  • ACE framework integration: 80-120 hours (engineering)
  • Evaluation and tuning: 40-60 hours
  • Total: ~160-240 hours (~4-6 weeks at 50% capacity)

Decision Criteria for Phase 2:

  • Baseline RAG precision < 85%
  • Recurring false positive patterns identified
  • EF requests improved reasoning capabilities
  • Sufficient production feedback collected (50+ PRs reviewed)

Phase 3: Fine-tuning Roadmap

When to implement: If RAG + ACE is insufficient for audit-grade accuracy, or EF explicitly requires a specialized model.

Prerequisites for Fine-tuning

1. Dataset Requirements

Why SFT is Sufficient (No CPT Needed):

  • Modern LLMs already have Ethereum/blockchain knowledge from pre-training
  • We're teaching audit behavior, not new domain knowledge
  • RAG provides specific spec/code context at inference time
  • CPT is expensive ($6K-15K) and risks catastrophic forgetting
Dataset Type Size Purpose Required?
SFT (Supervised Fine-tuning) ~50K-100K examples Task-specific audit behavior Yes
Continued Pre-training ~1B+ tokens Deep domain knowledge No

Dataset Composition for Audit-Grade Behavior (SFT):

  • Positive signals: Correct implementations, passing tests, compliant code
  • Negative signals (critical): Exploit narratives, vulnerability findings, failing tests, patch diffs
  • Spec-code pairs: Matched specification text ↔ implementation code
  • Audit reports: Historical security audit findings with remediation

2. Benchmark Requirements

A custom Ethereum Protocol Compliance Benchmark must be created to:

  • Measure performance on spec compliance detection
  • Evaluate false positive/negative rates
  • Ensure no catastrophic forgetting of general capabilities
  • Compare fine-tuned vs base model performance

Benchmark Categories:

Category Examples
EIP Compliance Detect EIP-1559 base fee calculation errors
Security Vulnerabilities Identify reentrancy, overflow, gas issues
Spec Drift Flag implementations diverging from spec
Edge Cases Boundary conditions, rare state transitions

Model Candidates

Model Parameters License Fine-tuning Feasibility
GPT-OSS-120B 117B total (5.1B active) Apache 2.0 MoE; runs on single 80GB GPU; near o4-mini performance
DeepSeek V3 671B (37B active) Open MoE; needs ~700GB VRAM (FP8); use distilled variants for fine-tuning
Llama 4 TBD (70B-400B expected) Open Meta's next-gen; strong code capabilities
Qwen3-72B 72B Apache 2.0 Dense model; 2x A100 80GB for QLoRA, 4x for full fine-tune
Qwen3-235B-A22B 235B (22B active) Apache 2.0 MoE variant; efficient inference
GLM-4 9B-130B Open Good code performance; various sizes

Recommended First Choice: GPT-OSS-120B or Qwen3-72B

  • GPT-OSS-120B: Best efficiency (single 80GB GPU), Apache 2.0, near-frontier reasoning
  • Qwen3-72B: Strong dense model, excellent for fine-tuning, Apache 2.0

Infrastructure Requirements

Training Infrastructure (2025 Cloud Pricing):

Configuration Hardware Hourly Cost Notes
QLoRA 70B 1x A100 80GB $1.35-2.00/hr ~46GB VRAM needed
LoRA 70B 2x A100 80GB $2.70-4.00/hr Recommended for quality
Full SFT 70B 4x H100 80GB $8-12/hr 4x faster than A100
GPT-OSS-120B (MoE) 1x H100 80GB $2-3/hr Efficient MoE architecture

Pricing sources: Hyperstack A100 $1.35/hr, H100 $2.12/hr; Lambda A100 $1.57/hr, H100 $2.99/hr

Estimated Training Costs (SFT only):

Scenario Hardware Duration Single Run With Buffer (3x)
QLoRA 70B (10K samples) 1x A100 80GB 8-12 hours $15-25 $50-75
LoRA 70B (50K samples) 2x A100 80GB 24-48 hours $65-200 $200-600
LoRA 70B (100K samples) 4x H100 80GB 6-12 hours $50-140 $150-420
Full SFT 70B (100K samples) 8x H100 80GB 24-48 hours $400-1,200 $1,200-3,600

Why 3x Buffer?

  • Hyperparameter tuning (learning rate, batch size, LoRA rank)
  • Dataset iteration (rebalancing positive/negative signals)
  • Multiple evaluation runs
  • Unexpected failures or restarts

Inference/Hosting Infrastructure:

Model Hardware Monthly Cost (24/7)
GPT-OSS-120B 1x H100 80GB $1,500-2,200
Qwen3-72B (INT4) 1x A100 80GB $1,000-1,500
Qwen3-72B (FP16) 2x A100 80GB $2,000-3,000
70B (FP16, high traffic) 4x A100 80GB $4,000-6,000

Bare-metal rentals typically 30-50% cheaper than on-demand cloud for steady workloads.

Fine-tuning Decision Framework

┌────────────────────────────────────────────────────────────┐
│                    Evaluation Phase 6                      │
│                                                            │
│  RAG + ACE Performance:                                    │
│  ├─ Precision ≥ 85%, Recall ≥ 80% → Stay with RAG+ACE      │
│  ├─ Precision 70-85% → Try ACE playbook refinement         │
│  └─ Precision < 70% → Proceed to fine-tuning               │
│                                                            │
│  Fine-tuning triggers:                                     │
│  • Consistent false negatives on security-critical issues  │
│  • Inability to follow complex multi-step spec logic       │
│  • EF explicitly requests specialized model                │
└────────────────────────────────────────────────────────────┘

Implementation Timeline (If Approved)

Phase Duration Activities
Dataset Curation 4-6 weeks Collect specs, exploits, audits; create negative signals
Benchmark Creation 2-3 weeks Design test suite; establish baselines
LoRA Fine-tuning 1-2 weeks Initial experiments on Qwen3-72B
Evaluation 1-2 weeks Compare against RAG baseline
Full SFT (if needed) 2-4 weeks Scale up training
Deployment 1-2 weeks Set up inference infrastructure

Total: 11-19 weeks additional (after initial RAG delivery)

Cost Summary (SFT Only, With Buffer)

Item Low Estimate High Estimate Notes
Dataset curation (labor) $5,000 $15,000 50K-100K examples, may need iteration
Benchmark creation (labor) $3,000 $10,000 Test suite + baselines + refinement
Training compute (LoRA, 3x buffer) $200 $600 Recommended approach
Training compute (Full SFT, 3x buffer) $1,200 $3,600 If LoRA insufficient
Evaluation & debugging $500 $1,500 Inference runs, analysis
Total Initial Investment $9,900 $30,700

Note: Buffer accounts for hyperparameter tuning, dataset iteration, and multiple training runs. Actual costs may be lower if first attempts succeed.

Monthly Operating Costs:

Scenario Cost Hardware
GPT-OSS-120B (recommended) $1,500-2,200 1x H100 80GB
Qwen3-72B quantized $1,000-1,500 1x A100 80GB
High-availability setup $3,000-6,000 2-4x A100 80GB

Note: Costs based on Dec 2025 cloud pricing. EF-provided infrastructure would reduce compute costs significantly. Bare-metal rentals offer 30-50% savings for steady workloads.


12. Team

Quantum3Labs - AI/LLM solutions for blockchain ecosystems

Jomluz Tech Sdn. Bhd. - Software engineering and blockchain development

Quantum3Labs

Gian Marco Alarcon - Full Stack Engineer | Blockchain Developer

  • Co-founder of Quantum3Labs
  • 4+ years blockchain experience (Ethereum, Starknet, ICP, Stacks)
  • Creator of stacks-builder, icp-coder, scaffold-stylus

Diego Flores - AI Engineer | Blockchain Developer

  • MSc Computer Science (AI specialization)
  • 5+ years Python/JS, 3+ years AI R&D
  • RAG architecture and prompt engineering lead

13. Comparison with Existing Solutions

Our approach differs from existing tools like the Ethereum Code Reviewer (ECR) in several key architectural decisions:

Aspect ECR Our Proposal
Code Chunking Text-based splitting AST-based tree-sitter — keeps functions intact, +4-5% Recall improvement [13]
Retrieval Architecture Single embedding collection with doc lookup Dual-collection (spec + client) with configurable balance
Query Strategy Direct code → LLM analysis Dual-query generation — spec-focused + client-focused queries from same PR
Ranking Basic similarity search Importance scoring (1-10) + quota-based allocation + RRF fusion
Spec-Client Mapping Documentation context only Direct function mapping — link Geth functions to specific spec sections
Architecture Single LLM pass with multi-judge voting Multi-agent system — specialized Query Coordinator, Retrieval Orchestrator, Auditor Agent
Reranking None Cross-encoder (MS-MARCO) for precision boost [7]
Distance Handling First-match wins Minimum distance tracking — keeps best score across queries

Key Differentiators

  1. Semantic Code Understanding: Tree-sitter AST parsing ensures we never split mid-function, producing semantically coherent chunks that lead to better embeddings and more accurate retrieval. This is the same approach used by Cursor, Windsurf, Aider, and GitHub Copilot [12].

  2. Balanced Context Retrieval: Our dual-query + RRF fusion approach ensures audits include both "what the spec says" AND "how the client implements it" — critical for compliance checking rather than just vulnerability detection.

  3. Intelligent Query Prioritization: Not all compliance queries are equal. Our importance scoring (8-10 for security-critical, 1-4 for general context) allocates retrieval quota proportionally, ensuring critical areas get thorough coverage.

  4. Explainable Traceability: Direct spec-to-client function mapping provides clear audit trails with file paths, line numbers, and code snippets — making findings actionable for reviewers.

  5. Research-Backed Design: Every major design choice is backed by peer-reviewed research (14 citations), not just empirical tuning.


14. Quality Benchmarks Compliance

This section details how our proposal addresses each quality benchmark from the requirements.

Technical Viability

Requirement How We Address It
Effective approach Multi-agent RAG with proven techniques: RRF fusion [1], tree-sitter chunking [12-14], cross-encoder reranking [7]
Research backing 14 academic references from top venues (SIGIR, arXiv)
Industry validation Tree-sitter used by Cursor, Copilot, Aider; RRF used in production search systems

Scalability & Maintenance

Requirement How We Address It
Frequent spec changes Continuous parsing pipeline with incremental updates (Phase 1)
Multiple client codebases Modular parser design — add new clients by implementing language grammar (113+ tree-sitter grammars available)
Large-scale audits Batch processing mode, configurable chunk sizes, async retrieval, structured JSON output
Future expansion Section 9 details clear path for Prysm, Lighthouse, Nethermind, Besu, Lodestar, Nimbus

Security & Reliability

Requirement How We Address It
Safe data handling All processing local (no external API calls for sensitive code), ChromaDB stores embeddings only
Minimize false positives Cross-encoder reranking improves precision; Phase 6 dedicated to FP rate analysis; prompt tuning based on evaluation
Minimize false negatives Dual-query ensures both spec and client perspectives; per-query grouping prevents exclusion of compliance areas; importance scoring prioritizes critical checks
Reproducibility Deterministic chunking via AST; configurable parameters; version-controlled prompts

Team Capacity

Requirement How We Address It
LLM expertise Diego Flores: MSc CS (AI), 3+ years AI R&D, RAG architecture lead
Blockchain experience Gian Marco Alarcon: 4+ years (Ethereum, Starknet, ICP, Stacks)
Proven delivery Published tools: scaffold-stylus, icp-coder, stacks-builder
Collaboration Phased timeline with EF checkpoints (Phase 0 kickoff, Phase 6 evaluation review)

15. References

Retrieval & Fusion

  1. G. V. Cormack, C. L. Clarke, and S. Buettcher, "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods," SIGIR '09, 2009.
  2. Y. Gao et al., "Retrieval-Augmented Generation for Large Language Models: A Survey," arXiv:2312.10997, 2024.
  3. A. Singh et al., "Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG," arXiv:2501.09136, 2025.

Code RAG

  1. Z. Z. Wang et al., "CodeRAG-Bench: Can Retrieval Augment Code Generation?," arXiv:2406.14497, 2024.
  2. W. Gu et al., "What to Retrieve for Effective Retrieval-Augmented Code Generation?," arXiv:2503.20589, 2025.
  3. V. Tawosi et al., "Meta-RAG on Large Codebases Using Code Summarization," arXiv:2508.02611, 2025.

Reranking

  1. S. Zhuang et al., "A Thorough Comparison of Cross-Encoders and LLMs for Reranking," arXiv:2403.10407, 2024.

Specification Compliance

  1. X. Guo et al., "Towards Formal Verification of LLM-Generated Code," arXiv:2507.13290, 2025.
  2. S. Chakraborty et al., "Combining LLM Code Generation with Formal Specifications," arXiv:2410.19736, 2024.

Decision Making

  1. M. Lee et al., "PlanRAG: A Plan-then-Retrieval Augmented Generation," arXiv:2406.12430, 2024.
  2. C. Jang et al., "Reliable Decision Making via Calibration Oriented RAG," arXiv:2411.08891, 2024.

Code Chunking

  1. Sweep AI, "Chunking 2M+ files a day for Code Search," https://docs.sweep.dev/blogs/chunking-2m-files, 2023. (Adopted by LlamaIndex, Cursor, Aider)
  2. Y. Zhang et al., "cAST: Enhancing Code RAG with Structural Chunking via AST," arXiv:2506.15655, 2025. (+4.3 Recall@5, +5.6 Pass@1)
  3. Tree-sitter, "An incremental parsing system for programming tools," https://tree-sitter.github.io/, 2024. (113+ language grammars)

Context Engineering

  1. S. Kambhampati et al., "Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models," arXiv:2510.04618, 2025. (+10.6% agent benchmarks, +8.6% domain tasks)