We propose a multi-agent RAG system that automates cross-spec compliance reviews for Ethereum clients. The system will:
- Parse and index Ethereum protocol specifications (execution-specs, consensus-specs)
- Index client implementations starting with Geth (go-ethereum)
- Analyze PRs and code against spec requirements
- Generate structured reports with flagged issues, security alerts, and suggested tests
This directly addresses the EF requirement to reduce manual effort in auditing client code against evolving specifications.
We use tree-sitter for AST-based code chunking — the same approach used by Cursor, Windsurf, Aider, and GitHub Copilot [12]. This ensures:
- Semantically coherent chunks: Code is split at function/class boundaries, never mid-function
- Complete context: Each chunk contains a complete, syntactically valid unit of code
- Better embeddings: +4-5 points Recall@5 improvement over naive chunking [13]
- Multi-language support: 113+ language grammars (Python, Go, Rust, etc.)
| Content Type | Chunking Strategy | Chunk Size |
|---|---|---|
| Markdown (specs) | Heading-aware splits | ~1500 chars |
| Python (spec code) | Tree-sitter AST (functions, classes) | ~1500 chars |
| Go (Geth code) | Tree-sitter AST (functions, structs, methods) | ~1500 chars |
Algorithm (based on Sweep AI, adopted by LlamaIndex):
For each child node in AST:
1. If current chunk too big → flush and start new chunk
2. If child node too big → recursively chunk child
3. Otherwise → merge child into current chunk
Post-process: merge single-line chunks with next chunk
- Continuously parse and index
ethereum/execution-specsandethereum/consensus-specs - Extract structured metadata: EIP references, invariants, preconditions, postconditions
- Tree-sitter Python chunking for reference code (functions, classes kept intact)
- Heading-aware markdown chunking for prose specifications
- Python reference code translated to language-agnostic pseudocode
- Index
ethereum/go-ethereumusing tree-sitter Go parser - Extract function signatures, struct definitions, and behavior summaries
- Preserve complete functions with package imports and type context
- Map Geth functions to corresponding spec sections
- Enable queries like: "How does Geth implement EIP-6780 SELFDESTRUCT?"
- Classify PRs by client type (consensus vs execution)
- Generate retrieval queries with importance scores (1-10)
- Score 8-10: Core spec compliance, security-critical
- Score 5-7: Related functionality, edge cases
- Score 1-4: General context
- Higher-importance queries receive more retrieval quota
- Dual-Query Generation: Generate both spec-focused queries (e.g., "What does EIP-1559 specify about base fee calculation?") and client-focused queries (e.g., "How does Geth implement base fee calculation?") from the same PR
- Reciprocal Rank Fusion (RRF): Industry-standard algorithm for merging heterogeneous retrieval results [1]. Combines rankings from spec and client collections without requiring score normalization
- Quota-based ranking: Allocate retrieval slots proportionally to query importance
- Minimum distance tracking: When a snippet matches multiple queries, keep best score
- Per-query grouping: Ensure all compliance areas are represented
- Spec-vs-client balancing: Configurable ratio (default 50/50) ensures results include both spec context and client implementation details
- Cross-encoder reranking (optional): Second-stage precision boost using MS-MARCO trained models [7]
- JSON schema for CI integration
- Severity-rated findings with evidence and spec references
- Suggested tests for identified issues
- Explainable decisions with file paths and code snippets
| Source | Repository | Content |
|---|---|---|
| Execution Specs | github.com/ethereum/execution-specs | EVM, state, transactions |
| Consensus Specs | github.com/ethereum/consensus-specs | Beacon chain, validators |
| Geth Client | github.com/ethereum/go-ethereum | Reference execution client |
- Finalize scope with EF team
- Secure GPU infrastructure
- Confirm model choices and success metrics
- Robust chunkers for markdown and Python
- Batch ingestion for execution-specs and consensus-specs
- Quality validation and exports
- Clone and index
ethereum/go-ethereum - Go AST parser for function/struct extraction
- Create
geth_codeChroma collection - Map Geth functions to execution spec sections
- Query Coordinator with importance scoring
- Retrieval Orchestrator with quota-based ranking
- Spec-to-Geth cross-referencing
- Compliance heuristics
- Decision/risk schema refinement
- Suggested test generation
- Explainability fields (spec refs, Geth function refs)
- CLI for local testing
- GitHub Action for self-hosted GPU runners
- Non-GPU fallback with smaller models
- Precision/recall vs human reviewer baseline
- False positive rate analysis
- Prompt tuning based on results
- Documentation (runbooks, ops guides)
- Model swap procedures
- Maintenance plan
- Roadmap for additional clients
-
ChromaDB Collections (5 total)
consensus_spec- Consensus specification summariesconsensus_code- Consensus reference codeexecution_spec- Execution specification summariesexecution_code- Execution reference codegeth_code- Geth client implementation
-
Multi-Agent System
- Query Coordinator with importance scoring
- Retrieval Orchestrator with advanced ranking
- Auditor Agent with structured JSON output
-
Geth Integration
- Go AST parser
- Spec-to-client function mapping
- Demo: Geth vs spec compliance check
-
CI/CD Integration
- CLI tool
- GitHub Action (GPU and non-GPU paths)
-
Documentation
- Setup and deployment guide
- Model/prompt configuration
- Guide for adding additional clients
-
Evaluation Report
- Precision/recall metrics
- False positive analysis
- Production recommendations
| Component | Technology |
|---|---|
| LLM (Heavy) | GPT-OSS 20B or equivalent via Ollama |
| LLM (Light) | Llama 3 8B |
| Embeddings | all-MiniLM-L6-v2 |
| Reranking | cross-encoder/ms-marco-MiniLM-L-6-v2 (optional) |
| Vector DB | ChromaDB |
| Code Parsing | tree-sitter (Python, Go, Rust, TypeScript) |
| Markdown Parsing | Heading-aware splitter |
| CI | GitHub Actions |
| Variable | Default | Description |
|---|---|---|
MODEL_GPT_OSS |
gpt-oss:20b |
Heavy model for auditing |
MODEL_LIGHT |
llama3:8b |
Lightweight model for coordination |
USE_CROSS_ENCODER |
0 |
Enable cross-encoder reranking |
RRF_K |
60 |
RRF fusion constant (standard value from Cormack et al.) |
SPEC_TO_CLIENT_RATIO |
0.5 |
Balance of spec vs client results (0.5 = 50/50) |
CHUNK_MAX_CHARS |
1500 |
Maximum characters per chunk (~40 lines) |
DRY_RUN_SAMPLE |
1 |
Run with limited files for testing |
After initial delivery with Geth, the system can be extended to additional clients:
| Client | Language | Effort |
|---|---|---|
| Prysm | Go | Low (reuse Go parser) |
| Lighthouse | Rust | Medium (new parser) |
| Nethermind | C# | Medium |
| Besu / Teku | Java | Medium |
| Lodestar | TypeScript | Medium |
| Nimbus | Nim | Medium |
Our initial delivery uses Retrieval-Augmented Generation (RAG) rather than fine-tuning:
| Factor | RAG Approach | Fine-tuning Approach |
|---|---|---|
| Data requirements | Works with existing specs/code | Requires 100M+ tokens curated dataset |
| Compute cost | Inference-only ($1-10/audit) | Training ($1K-15K+ for large models) |
| Iteration speed | Update index in minutes | Re-train in days/weeks |
| Spec changes | Re-index affected files | Re-train or risk drift |
| Explainability | Citations to source docs | Black-box decisions |
| Model flexibility | Swap models freely | Locked to fine-tuned model |
Modern frontier LLMs (GPT-4, Claude, Llama 3, DeepSeek, GPT-OSS) already possess strong code understanding and reasoning capabilities. RAG leverages these abilities while providing domain-specific context at inference time.
RAG answers: "What information does the model need?"
For Ethereum code review, the model needs:
- Specification text (what the code should do)
- Client implementation (what the code actually does)
- EIP details, test cases, historical context
RAG excels here because:
- Specs change frequently — re-index in minutes vs re-train in weeks
- Traceability — every finding cites source documents
- No training data curation needed — works with existing specs/code
┌─────────────────────────────────────────────────────────────────────────┐
│ LLM Inference Pipeline │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ CONTEXT │ + │ INPUT │ → │ MODEL │ → Output │
│ │ (prompts, │ │ (PR code, │ │ (weights) │ │
│ │ playbooks) │ │ retrieved │ │ │ │
│ │ │ │ docs) │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ↑ ↑ ↑ │
│ Phase 2: Phase 1: Phase 3: │
│ ACE RAG Fine-tuning │
│ (if needed) (core delivery) (if needed) │
└─────────────────────────────────────────────────────────────────────────┘
| Phase | Approach | What It Optimizes | Status |
|---|---|---|---|
| Phase 1 | RAG | Retrieved knowledge (specs, code) | Core Delivery |
| Phase 2 | ACE | Instructions/playbooks (audit strategies) | Future Enhancement |
| Phase 3 | Fine-tuning | Model weights | If RAG+ACE insufficient |
After delivering the core RAG system, we propose two potential enhancement phases based on evaluation results.
When to implement: After baseline RAG system is validated and running in production.
What is ACE? A novel approach from Stanford [15] that optimizes the instructions/playbooks rather than model weights. ACE answers: "How should the model reason about the retrieved information?"
Even with perfect retrieval, the model needs guidance on:
- What patterns indicate spec violations?
- How to prioritize security-critical vs cosmetic issues?
- What makes a good audit finding (evidence, severity, remediation)?
How ACE Works:
┌─────────────────────────────────────────────────────────────────┐
│ ACE Playbook Evolution │
│ │
│ Initial Playbook: │
│ • Generic code review heuristics │
│ • Basic EIP compliance checks │
│ │
│ After 100 PRs (Generation + Reflection + Curation): │
│ • "EIP-1559 base fee changes often miss edge case X" │
│ • "SELFDESTRUCT PRs require checking Y and Z" │
│ • "False positive pattern: don't flag A when B is present" │
│ • Accumulated exploit narratives with detection strategies │
│ │
│ Result: Playbook improves with each audit cycle │
└─────────────────────────────────────────────────────────────────┘
ACE Benefits (from Stanford research [15]):
- +10.6% improvement on agent benchmarks
- +8.6% improvement on domain-specific tasks
- Prevents "context collapse" (iterative rewriting eroding details)
- No weight updates — purely context optimization
- Works with smaller open-source models
Data Requirements for ACE:
| Data Type | Size | Source | Purpose |
|---|---|---|---|
| Initial Playbook | ~10-20 pages | Manual curation | Baseline audit heuristics |
| Audit Feedback | 50-100 PRs | Phase 1 production | Reflection signals |
| Exploit Narratives | 20-50 cases | Public CVEs, audits | Negative pattern examples |
| EIP Compliance Rules | All active EIPs | ethereum/EIPs repo | Structured checklists |
How We Build the ACE Playbook:
-
Initial Playbook (Week 1-2):
- Curate generic code review heuristics from security best practices
- Extract EIP compliance checklists from specifications
- Document known vulnerability patterns (reentrancy, overflow, etc.)
-
Production Feedback Collection (Ongoing):
- Log all audit findings from Phase 1 RAG system
- Track human reviewer corrections (false positives/negatives)
- Record which retrieved snippets led to correct findings
-
ACE Cycle Implementation (Week 3-6):
- Generator: Create new audit strategies from feedback patterns
- Reflector: Evaluate strategy effectiveness on held-out PRs
- Curator: Organize and deduplicate accumulated knowledge
Implementation Requirements:
| Item | Estimate | Notes |
|---|---|---|
| Development time | 4-6 weeks | After Phase 1 baseline |
| Additional compute | Minimal | Context optimization only, no GPU training |
| Data needed | 50-100 audited PRs | From Phase 1 production use |
| Labor | 1 engineer | Part-time integration work |
Effort Estimate:
- Initial playbook curation: 40-60 hours (manual)
- ACE framework integration: 80-120 hours (engineering)
- Evaluation and tuning: 40-60 hours
- Total: ~160-240 hours (~4-6 weeks at 50% capacity)
Decision Criteria for Phase 2:
- Baseline RAG precision < 85%
- Recurring false positive patterns identified
- EF requests improved reasoning capabilities
- Sufficient production feedback collected (50+ PRs reviewed)
When to implement: If RAG + ACE is insufficient for audit-grade accuracy, or EF explicitly requires a specialized model.
1. Dataset Requirements
Why SFT is Sufficient (No CPT Needed):
- Modern LLMs already have Ethereum/blockchain knowledge from pre-training
- We're teaching audit behavior, not new domain knowledge
- RAG provides specific spec/code context at inference time
- CPT is expensive ($6K-15K) and risks catastrophic forgetting
| Dataset Type | Size | Purpose | Required? |
|---|---|---|---|
| SFT (Supervised Fine-tuning) | ~50K-100K examples | Task-specific audit behavior | Yes |
| Continued Pre-training | ~1B+ tokens | Deep domain knowledge | No |
Dataset Composition for Audit-Grade Behavior (SFT):
- Positive signals: Correct implementations, passing tests, compliant code
- Negative signals (critical): Exploit narratives, vulnerability findings, failing tests, patch diffs
- Spec-code pairs: Matched specification text ↔ implementation code
- Audit reports: Historical security audit findings with remediation
2. Benchmark Requirements
A custom Ethereum Protocol Compliance Benchmark must be created to:
- Measure performance on spec compliance detection
- Evaluate false positive/negative rates
- Ensure no catastrophic forgetting of general capabilities
- Compare fine-tuned vs base model performance
Benchmark Categories:
| Category | Examples |
|---|---|
| EIP Compliance | Detect EIP-1559 base fee calculation errors |
| Security Vulnerabilities | Identify reentrancy, overflow, gas issues |
| Spec Drift | Flag implementations diverging from spec |
| Edge Cases | Boundary conditions, rare state transitions |
| Model | Parameters | License | Fine-tuning Feasibility |
|---|---|---|---|
| GPT-OSS-120B | 117B total (5.1B active) | Apache 2.0 | MoE; runs on single 80GB GPU; near o4-mini performance |
| DeepSeek V3 | 671B (37B active) | Open | MoE; needs ~700GB VRAM (FP8); use distilled variants for fine-tuning |
| Llama 4 | TBD (70B-400B expected) | Open | Meta's next-gen; strong code capabilities |
| Qwen3-72B | 72B | Apache 2.0 | Dense model; 2x A100 80GB for QLoRA, 4x for full fine-tune |
| Qwen3-235B-A22B | 235B (22B active) | Apache 2.0 | MoE variant; efficient inference |
| GLM-4 | 9B-130B | Open | Good code performance; various sizes |
Recommended First Choice: GPT-OSS-120B or Qwen3-72B
- GPT-OSS-120B: Best efficiency (single 80GB GPU), Apache 2.0, near-frontier reasoning
- Qwen3-72B: Strong dense model, excellent for fine-tuning, Apache 2.0
Training Infrastructure (2025 Cloud Pricing):
| Configuration | Hardware | Hourly Cost | Notes |
|---|---|---|---|
| QLoRA 70B | 1x A100 80GB | $1.35-2.00/hr | ~46GB VRAM needed |
| LoRA 70B | 2x A100 80GB | $2.70-4.00/hr | Recommended for quality |
| Full SFT 70B | 4x H100 80GB | $8-12/hr | 4x faster than A100 |
| GPT-OSS-120B (MoE) | 1x H100 80GB | $2-3/hr | Efficient MoE architecture |
Pricing sources: Hyperstack A100 $1.35/hr, H100 $2.12/hr; Lambda A100 $1.57/hr, H100 $2.99/hr
Estimated Training Costs (SFT only):
| Scenario | Hardware | Duration | Single Run | With Buffer (3x) |
|---|---|---|---|---|
| QLoRA 70B (10K samples) | 1x A100 80GB | 8-12 hours | $15-25 | $50-75 |
| LoRA 70B (50K samples) | 2x A100 80GB | 24-48 hours | $65-200 | $200-600 |
| LoRA 70B (100K samples) | 4x H100 80GB | 6-12 hours | $50-140 | $150-420 |
| Full SFT 70B (100K samples) | 8x H100 80GB | 24-48 hours | $400-1,200 | $1,200-3,600 |
Why 3x Buffer?
- Hyperparameter tuning (learning rate, batch size, LoRA rank)
- Dataset iteration (rebalancing positive/negative signals)
- Multiple evaluation runs
- Unexpected failures or restarts
Inference/Hosting Infrastructure:
| Model | Hardware | Monthly Cost (24/7) |
|---|---|---|
| GPT-OSS-120B | 1x H100 80GB | $1,500-2,200 |
| Qwen3-72B (INT4) | 1x A100 80GB | $1,000-1,500 |
| Qwen3-72B (FP16) | 2x A100 80GB | $2,000-3,000 |
| 70B (FP16, high traffic) | 4x A100 80GB | $4,000-6,000 |
Bare-metal rentals typically 30-50% cheaper than on-demand cloud for steady workloads.
┌────────────────────────────────────────────────────────────┐
│ Evaluation Phase 6 │
│ │
│ RAG + ACE Performance: │
│ ├─ Precision ≥ 85%, Recall ≥ 80% → Stay with RAG+ACE │
│ ├─ Precision 70-85% → Try ACE playbook refinement │
│ └─ Precision < 70% → Proceed to fine-tuning │
│ │
│ Fine-tuning triggers: │
│ • Consistent false negatives on security-critical issues │
│ • Inability to follow complex multi-step spec logic │
│ • EF explicitly requests specialized model │
└────────────────────────────────────────────────────────────┘
| Phase | Duration | Activities |
|---|---|---|
| Dataset Curation | 4-6 weeks | Collect specs, exploits, audits; create negative signals |
| Benchmark Creation | 2-3 weeks | Design test suite; establish baselines |
| LoRA Fine-tuning | 1-2 weeks | Initial experiments on Qwen3-72B |
| Evaluation | 1-2 weeks | Compare against RAG baseline |
| Full SFT (if needed) | 2-4 weeks | Scale up training |
| Deployment | 1-2 weeks | Set up inference infrastructure |
Total: 11-19 weeks additional (after initial RAG delivery)
| Item | Low Estimate | High Estimate | Notes |
|---|---|---|---|
| Dataset curation (labor) | $5,000 | $15,000 | 50K-100K examples, may need iteration |
| Benchmark creation (labor) | $3,000 | $10,000 | Test suite + baselines + refinement |
| Training compute (LoRA, 3x buffer) | $200 | $600 | Recommended approach |
| Training compute (Full SFT, 3x buffer) | $1,200 | $3,600 | If LoRA insufficient |
| Evaluation & debugging | $500 | $1,500 | Inference runs, analysis |
| Total Initial Investment | $9,900 | $30,700 |
Note: Buffer accounts for hyperparameter tuning, dataset iteration, and multiple training runs. Actual costs may be lower if first attempts succeed.
Monthly Operating Costs:
| Scenario | Cost | Hardware |
|---|---|---|
| GPT-OSS-120B (recommended) | $1,500-2,200 | 1x H100 80GB |
| Qwen3-72B quantized | $1,000-1,500 | 1x A100 80GB |
| High-availability setup | $3,000-6,000 | 2-4x A100 80GB |
Note: Costs based on Dec 2025 cloud pricing. EF-provided infrastructure would reduce compute costs significantly. Bare-metal rentals offer 30-50% savings for steady workloads.
Quantum3Labs - AI/LLM solutions for blockchain ecosystems
Jomluz Tech Sdn. Bhd. - Software engineering and blockchain development
Gian Marco Alarcon - Full Stack Engineer | Blockchain Developer
- Co-founder of Quantum3Labs
- 4+ years blockchain experience (Ethereum, Starknet, ICP, Stacks)
- Creator of stacks-builder, icp-coder, scaffold-stylus
Diego Flores - AI Engineer | Blockchain Developer
- MSc Computer Science (AI specialization)
- 5+ years Python/JS, 3+ years AI R&D
- RAG architecture and prompt engineering lead
Our approach differs from existing tools like the Ethereum Code Reviewer (ECR) in several key architectural decisions:
| Aspect | ECR | Our Proposal |
|---|---|---|
| Code Chunking | Text-based splitting | AST-based tree-sitter — keeps functions intact, +4-5% Recall improvement [13] |
| Retrieval Architecture | Single embedding collection with doc lookup | Dual-collection (spec + client) with configurable balance |
| Query Strategy | Direct code → LLM analysis | Dual-query generation — spec-focused + client-focused queries from same PR |
| Ranking | Basic similarity search | Importance scoring (1-10) + quota-based allocation + RRF fusion |
| Spec-Client Mapping | Documentation context only | Direct function mapping — link Geth functions to specific spec sections |
| Architecture | Single LLM pass with multi-judge voting | Multi-agent system — specialized Query Coordinator, Retrieval Orchestrator, Auditor Agent |
| Reranking | None | Cross-encoder (MS-MARCO) for precision boost [7] |
| Distance Handling | First-match wins | Minimum distance tracking — keeps best score across queries |
-
Semantic Code Understanding: Tree-sitter AST parsing ensures we never split mid-function, producing semantically coherent chunks that lead to better embeddings and more accurate retrieval. This is the same approach used by Cursor, Windsurf, Aider, and GitHub Copilot [12].
-
Balanced Context Retrieval: Our dual-query + RRF fusion approach ensures audits include both "what the spec says" AND "how the client implements it" — critical for compliance checking rather than just vulnerability detection.
-
Intelligent Query Prioritization: Not all compliance queries are equal. Our importance scoring (8-10 for security-critical, 1-4 for general context) allocates retrieval quota proportionally, ensuring critical areas get thorough coverage.
-
Explainable Traceability: Direct spec-to-client function mapping provides clear audit trails with file paths, line numbers, and code snippets — making findings actionable for reviewers.
-
Research-Backed Design: Every major design choice is backed by peer-reviewed research (14 citations), not just empirical tuning.
This section details how our proposal addresses each quality benchmark from the requirements.
| Requirement | How We Address It |
|---|---|
| Effective approach | Multi-agent RAG with proven techniques: RRF fusion [1], tree-sitter chunking [12-14], cross-encoder reranking [7] |
| Research backing | 14 academic references from top venues (SIGIR, arXiv) |
| Industry validation | Tree-sitter used by Cursor, Copilot, Aider; RRF used in production search systems |
| Requirement | How We Address It |
|---|---|
| Frequent spec changes | Continuous parsing pipeline with incremental updates (Phase 1) |
| Multiple client codebases | Modular parser design — add new clients by implementing language grammar (113+ tree-sitter grammars available) |
| Large-scale audits | Batch processing mode, configurable chunk sizes, async retrieval, structured JSON output |
| Future expansion | Section 9 details clear path for Prysm, Lighthouse, Nethermind, Besu, Lodestar, Nimbus |
| Requirement | How We Address It |
|---|---|
| Safe data handling | All processing local (no external API calls for sensitive code), ChromaDB stores embeddings only |
| Minimize false positives | Cross-encoder reranking improves precision; Phase 6 dedicated to FP rate analysis; prompt tuning based on evaluation |
| Minimize false negatives | Dual-query ensures both spec and client perspectives; per-query grouping prevents exclusion of compliance areas; importance scoring prioritizes critical checks |
| Reproducibility | Deterministic chunking via AST; configurable parameters; version-controlled prompts |
| Requirement | How We Address It |
|---|---|
| LLM expertise | Diego Flores: MSc CS (AI), 3+ years AI R&D, RAG architecture lead |
| Blockchain experience | Gian Marco Alarcon: 4+ years (Ethereum, Starknet, ICP, Stacks) |
| Proven delivery | Published tools: scaffold-stylus, icp-coder, stacks-builder |
| Collaboration | Phased timeline with EF checkpoints (Phase 0 kickoff, Phase 6 evaluation review) |
- G. V. Cormack, C. L. Clarke, and S. Buettcher, "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods," SIGIR '09, 2009.
- Y. Gao et al., "Retrieval-Augmented Generation for Large Language Models: A Survey," arXiv:2312.10997, 2024.
- A. Singh et al., "Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG," arXiv:2501.09136, 2025.
- Z. Z. Wang et al., "CodeRAG-Bench: Can Retrieval Augment Code Generation?," arXiv:2406.14497, 2024.
- W. Gu et al., "What to Retrieve for Effective Retrieval-Augmented Code Generation?," arXiv:2503.20589, 2025.
- V. Tawosi et al., "Meta-RAG on Large Codebases Using Code Summarization," arXiv:2508.02611, 2025.
- S. Zhuang et al., "A Thorough Comparison of Cross-Encoders and LLMs for Reranking," arXiv:2403.10407, 2024.
- X. Guo et al., "Towards Formal Verification of LLM-Generated Code," arXiv:2507.13290, 2025.
- S. Chakraborty et al., "Combining LLM Code Generation with Formal Specifications," arXiv:2410.19736, 2024.
- M. Lee et al., "PlanRAG: A Plan-then-Retrieval Augmented Generation," arXiv:2406.12430, 2024.
- C. Jang et al., "Reliable Decision Making via Calibration Oriented RAG," arXiv:2411.08891, 2024.
- Sweep AI, "Chunking 2M+ files a day for Code Search," https://docs.sweep.dev/blogs/chunking-2m-files, 2023. (Adopted by LlamaIndex, Cursor, Aider)
- Y. Zhang et al., "cAST: Enhancing Code RAG with Structural Chunking via AST," arXiv:2506.15655, 2025. (+4.3 Recall@5, +5.6 Pass@1)
- Tree-sitter, "An incremental parsing system for programming tools," https://tree-sitter.github.io/, 2024. (113+ language grammars)
- S. Kambhampati et al., "Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models," arXiv:2510.04618, 2025. (+10.6% agent benchmarks, +8.6% domain tasks)
