Ethereum AI Code Reviewer - Technical Proposal

1. Overview

We propose a multi-agent RAG system that automates cross-spec compliance reviews for Ethereum clients. The system will:

Parse and index Ethereum protocol specifications (execution-specs, consensus-specs)
Index client implementations starting with Geth (go-ethereum)
Analyze PRs and code against spec requirements
Generate structured reports with flagged issues, security alerts, and suggested tests

This directly addresses the EF requirement to reduce manual effort in auditing client code against evolving specifications.

2. Technical Architecture

3. Key Features

3.1 Tree-sitter Code Chunking

We use tree-sitter for AST-based code chunking — the same approach used by Cursor, Windsurf, Aider, and GitHub Copilot [12]. This ensures:

Semantically coherent chunks: Code is split at function/class boundaries, never mid-function
Complete context: Each chunk contains a complete, syntactically valid unit of code
Better embeddings: +4-5 points Recall@5 improvement over naive chunking [13]
Multi-language support: 113+ language grammars (Python, Go, Rust, etc.)

Content Type	Chunking Strategy	Chunk Size
Markdown (specs)	Heading-aware splits	~1500 chars
Python (spec code)	Tree-sitter AST (functions, classes)	~1500 chars
Go (Geth code)	Tree-sitter AST (functions, structs, methods)	~1500 chars

Algorithm (based on Sweep AI, adopted by LlamaIndex):

For each child node in AST:
  1. If current chunk too big → flush and start new chunk
  2. If child node too big → recursively chunk child
  3. Otherwise → merge child into current chunk
Post-process: merge single-line chunks with next chunk

3.2 Specification Indexing

Continuously parse and index ethereum/execution-specs and ethereum/consensus-specs
Extract structured metadata: EIP references, invariants, preconditions, postconditions
Tree-sitter Python chunking for reference code (functions, classes kept intact)
Heading-aware markdown chunking for prose specifications
Python reference code translated to language-agnostic pseudocode

3.3 Client Code Indexing (Geth)

Index ethereum/go-ethereum using tree-sitter Go parser
Extract function signatures, struct definitions, and behavior summaries
Preserve complete functions with package imports and type context
Map Geth functions to corresponding spec sections
Enable queries like: "How does Geth implement EIP-6780 SELFDESTRUCT?"

3.4 Intelligent Query Planning

Classify PRs by client type (consensus vs execution)
Generate retrieval queries with importance scores (1-10)
- Score 8-10: Core spec compliance, security-critical
- Score 5-7: Related functionality, edge cases
- Score 1-4: General context
Higher-importance queries receive more retrieval quota

3.5 Advanced Retrieval

Dual-Query Generation: Generate both spec-focused queries (e.g., "What does EIP-1559 specify about base fee calculation?") and client-focused queries (e.g., "How does Geth implement base fee calculation?") from the same PR
Reciprocal Rank Fusion (RRF): Industry-standard algorithm for merging heterogeneous retrieval results [1]. Combines rankings from spec and client collections without requiring score normalization
Quota-based ranking: Allocate retrieval slots proportionally to query importance
Minimum distance tracking: When a snippet matches multiple queries, keep best score
Per-query grouping: Ensure all compliance areas are represented
Spec-vs-client balancing: Configurable ratio (default 50/50) ensures results include both spec context and client implementation details
Cross-encoder reranking (optional): Second-stage precision boost using MS-MARCO trained models [7]

3.6 Structured Audit Output

JSON schema for CI integration
Severity-rated findings with evidence and spec references
Suggested tests for identified issues
Explainable decisions with file paths and code snippets

4. Data Sources

Source	Repository	Content
Execution Specs	github.com/ethereum/execution-specs	EVM, state, transactions
Consensus Specs	github.com/ethereum/consensus-specs	Beacon chain, validators
Geth Client	github.com/ethereum/go-ethereum	Reference execution client

5. Project Timeline

Phase 0 (Week 0-1): Kickoff

Finalize scope with EF team
Secure GPU infrastructure
Confirm model choices and success metrics

Phase 1 (Week 2-4): Specification Ingestion

Robust chunkers for markdown and Python
Batch ingestion for execution-specs and consensus-specs
Quality validation and exports

Phase 2 (Week 5-7): Geth Client Indexing

Clone and index ethereum/go-ethereum
Go AST parser for function/struct extraction
Create geth_code Chroma collection
Map Geth functions to execution spec sections

Phase 3 (Week 8-10): Retrieval and Compliance

Query Coordinator with importance scoring
Retrieval Orchestrator with quota-based ranking
Spec-to-Geth cross-referencing
Compliance heuristics

Phase 4 (Week 11-12): Auditor Agent

Decision/risk schema refinement
Suggested test generation
Explainability fields (spec refs, Geth function refs)

Phase 5 (Week 13-14): CI Integration

CLI for local testing
GitHub Action for self-hosted GPU runners
Non-GPU fallback with smaller models

Phase 6 (Week 15-16): Evaluation

Precision/recall vs human reviewer baseline
False positive rate analysis
Prompt tuning based on results

Phase 7 (Week 17-18): Handover

Documentation (runbooks, ops guides)
Model swap procedures
Maintenance plan
Roadmap for additional clients

6. Deliverables

ChromaDB Collections (5 total)
- consensus_spec - Consensus specification summaries
- consensus_code - Consensus reference code
- execution_spec - Execution specification summaries
- execution_code - Execution reference code
- geth_code - Geth client implementation
Multi-Agent System
- Query Coordinator with importance scoring
- Retrieval Orchestrator with advanced ranking
- Auditor Agent with structured JSON output
Geth Integration
- Go AST parser
- Spec-to-client function mapping
- Demo: Geth vs spec compliance check
CI/CD Integration
- CLI tool
- GitHub Action (GPU and non-GPU paths)
Documentation
- Setup and deployment guide
- Model/prompt configuration
- Guide for adding additional clients
Evaluation Report
- Precision/recall metrics
- False positive analysis
- Production recommendations

7. Technical Stack

Component	Technology
LLM (Heavy)	GPT-OSS 20B or equivalent via Ollama
LLM (Light)	Llama 3 8B
Embeddings	all-MiniLM-L6-v2
Reranking	cross-encoder/ms-marco-MiniLM-L-6-v2 (optional)
Vector DB	ChromaDB
Code Parsing	tree-sitter (Python, Go, Rust, TypeScript)
Markdown Parsing	Heading-aware splitter
CI	GitHub Actions

8. Configuration

Variable	Default	Description
`MODEL_GPT_OSS`	`gpt-oss:20b`	Heavy model for auditing
`MODEL_LIGHT`	`llama3:8b`	Lightweight model for coordination
`USE_CROSS_ENCODER`	`0`	Enable cross-encoder reranking
`RRF_K`	`60`	RRF fusion constant (standard value from Cormack et al.)
`SPEC_TO_CLIENT_RATIO`	`0.5`	Balance of spec vs client results (0.5 = 50/50)
`CHUNK_MAX_CHARS`	`1500`	Maximum characters per chunk (~40 lines)
`DRY_RUN_SAMPLE`	`1`	Run with limited files for testing

9. Future Expansion

After initial delivery with Geth, the system can be extended to additional clients:

Client	Language	Effort
Prysm	Go	Low (reuse Go parser)
Lighthouse	Rust	Medium (new parser)
Nethermind	C#	Medium
Besu / Teku	Java	Medium
Lodestar	TypeScript	Medium
Nimbus	Nim	Medium

10. Why RAG-First Approach

Our initial delivery uses Retrieval-Augmented Generation (RAG) rather than fine-tuning:

Factor	RAG Approach	Fine-tuning Approach
Data requirements	Works with existing specs/code	Requires 100M+ tokens curated dataset
Compute cost	Inference-only ($1-10/audit)	Training ($1K-15K+ for large models)
Iteration speed	Update index in minutes	Re-train in days/weeks
Spec changes	Re-index affected files	Re-train or risk drift
Explainability	Citations to source docs	Black-box decisions
Model flexibility	Swap models freely	Locked to fine-tuned model

Modern frontier LLMs (GPT-4, Claude, Llama 3, DeepSeek, GPT-OSS) already possess strong code understanding and reasoning capabilities. RAG leverages these abilities while providing domain-specific context at inference time.

What RAG Provides

RAG answers: "What information does the model need?"

For Ethereum code review, the model needs:

Specification text (what the code should do)
Client implementation (what the code actually does)
EIP details, test cases, historical context

RAG excels here because:

Specs change frequently — re-index in minutes vs re-train in weeks
Traceability — every finding cites source documents
No training data curation needed — works with existing specs/code

Optimization Approaches Comparison

┌─────────────────────────────────────────────────────────────────────────┐
│                        LLM Inference Pipeline                           │
│                                                                         │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐              │
│  │   CONTEXT    │ +  │    INPUT     │ →  │    MODEL     │ → Output     │
│  │  (prompts,   │    │  (PR code,   │    │  (weights)   │              │
│  │  playbooks)  │    │  retrieved   │    │              │              │
│  │              │    │  docs)       │    │              │              │
│  └──────────────┘    └──────────────┘    └──────────────┘              │
│        ↑                   ↑                    ↑                       │
│    Phase 2:            Phase 1:            Phase 3:                     │
│      ACE                 RAG              Fine-tuning                   │
│   (if needed)        (core delivery)       (if needed)                  │
└─────────────────────────────────────────────────────────────────────────┘

Phase	Approach	What It Optimizes	Status
Phase 1	RAG	Retrieved knowledge (specs, code)	Core Delivery
Phase 2	ACE	Instructions/playbooks (audit strategies)	Future Enhancement
Phase 3	Fine-tuning	Model weights	If RAG+ACE insufficient

11. Future Enhancements

After delivering the core RAG system, we propose two potential enhancement phases based on evaluation results.

Phase 2: ACE Integration (Agentic Context Engineering)

When to implement: After baseline RAG system is validated and running in production.

What is ACE? A novel approach from Stanford [15] that optimizes the instructions/playbooks rather than model weights. ACE answers: "How should the model reason about the retrieved information?"

Even with perfect retrieval, the model needs guidance on:

What patterns indicate spec violations?
How to prioritize security-critical vs cosmetic issues?
What makes a good audit finding (evidence, severity, remediation)?

How ACE Works:

┌─────────────────────────────────────────────────────────────────┐
│                    ACE Playbook Evolution                        │
│                                                                  │
│  Initial Playbook:                                               │
│  • Generic code review heuristics                                │
│  • Basic EIP compliance checks                                   │
│                                                                  │
│  After 100 PRs (Generation + Reflection + Curation):             │
│  • "EIP-1559 base fee changes often miss edge case X"            │
│  • "SELFDESTRUCT PRs require checking Y and Z"                   │
│  • "False positive pattern: don't flag A when B is present"      │
│  • Accumulated exploit narratives with detection strategies      │
│                                                                  │
│  Result: Playbook improves with each audit cycle                 │
└─────────────────────────────────────────────────────────────────┘

ACE Benefits (from Stanford research [15]):

+10.6% improvement on agent benchmarks
+8.6% improvement on domain-specific tasks
Prevents "context collapse" (iterative rewriting eroding details)
No weight updates — purely context optimization
Works with smaller open-source models

Data Requirements for ACE:

Data Type	Size	Source	Purpose
Initial Playbook	~10-20 pages	Manual curation	Baseline audit heuristics
Audit Feedback	50-100 PRs	Phase 1 production	Reflection signals
Exploit Narratives	20-50 cases	Public CVEs, audits	Negative pattern examples
EIP Compliance Rules	All active EIPs	ethereum/EIPs repo	Structured checklists

How We Build the ACE Playbook:

Initial Playbook (Week 1-2):
- Curate generic code review heuristics from security best practices
- Extract EIP compliance checklists from specifications
- Document known vulnerability patterns (reentrancy, overflow, etc.)
Production Feedback Collection (Ongoing):
- Log all audit findings from Phase 1 RAG system
- Track human reviewer corrections (false positives/negatives)
- Record which retrieved snippets led to correct findings
ACE Cycle Implementation (Week 3-6):
- Generator: Create new audit strategies from feedback patterns
- Reflector: Evaluate strategy effectiveness on held-out PRs
- Curator: Organize and deduplicate accumulated knowledge

Implementation Requirements:

Item	Estimate	Notes
Development time	4-6 weeks	After Phase 1 baseline
Additional compute	Minimal	Context optimization only, no GPU training
Data needed	50-100 audited PRs	From Phase 1 production use
Labor	1 engineer	Part-time integration work

Effort Estimate:

Initial playbook curation: 40-60 hours (manual)
ACE framework integration: 80-120 hours (engineering)
Evaluation and tuning: 40-60 hours
Total: ~160-240 hours (~4-6 weeks at 50% capacity)

Decision Criteria for Phase 2:

Baseline RAG precision < 85%
Recurring false positive patterns identified
EF requests improved reasoning capabilities
Sufficient production feedback collected (50+ PRs reviewed)

Phase 3: Fine-tuning Roadmap

When to implement: If RAG + ACE is insufficient for audit-grade accuracy, or EF explicitly requires a specialized model.

Prerequisites for Fine-tuning

1. Dataset Requirements

Why SFT is Sufficient (No CPT Needed):

Modern LLMs already have Ethereum/blockchain knowledge from pre-training
We're teaching audit behavior, not new domain knowledge
RAG provides specific spec/code context at inference time
CPT is expensive ($6K-15K) and risks catastrophic forgetting

Dataset Type	Size	Purpose	Required?
SFT (Supervised Fine-tuning)	~50K-100K examples	Task-specific audit behavior	Yes
Continued Pre-training	~1B+ tokens	Deep domain knowledge	No

Dataset Composition for Audit-Grade Behavior (SFT):

Positive signals: Correct implementations, passing tests, compliant code
Negative signals (critical): Exploit narratives, vulnerability findings, failing tests, patch diffs
Spec-code pairs: Matched specification text ↔ implementation code
Audit reports: Historical security audit findings with remediation

2. Benchmark Requirements

A custom Ethereum Protocol Compliance Benchmark must be created to:

Measure performance on spec compliance detection
Evaluate false positive/negative rates
Ensure no catastrophic forgetting of general capabilities
Compare fine-tuned vs base model performance

Benchmark Categories:

Category	Examples
EIP Compliance	Detect EIP-1559 base fee calculation errors
Security Vulnerabilities	Identify reentrancy, overflow, gas issues
Spec Drift	Flag implementations diverging from spec
Edge Cases	Boundary conditions, rare state transitions

Model Candidates

Model	Parameters	License	Fine-tuning Feasibility
GPT-OSS-120B	117B total (5.1B active)	Apache 2.0	MoE; runs on single 80GB GPU; near o4-mini performance
DeepSeek V3	671B (37B active)	Open	MoE; needs ~700GB VRAM (FP8); use distilled variants for fine-tuning
Llama 4	TBD (70B-400B expected)	Open	Meta's next-gen; strong code capabilities
Qwen3-72B	72B	Apache 2.0	Dense model; 2x A100 80GB for QLoRA, 4x for full fine-tune
Qwen3-235B-A22B	235B (22B active)	Apache 2.0	MoE variant; efficient inference
GLM-4	9B-130B	Open	Good code performance; various sizes

Recommended First Choice: GPT-OSS-120B or Qwen3-72B

GPT-OSS-120B: Best efficiency (single 80GB GPU), Apache 2.0, near-frontier reasoning
Qwen3-72B: Strong dense model, excellent for fine-tuning, Apache 2.0

Infrastructure Requirements

Training Infrastructure (2025 Cloud Pricing):

Configuration	Hardware	Hourly Cost	Notes
QLoRA 70B	1x A100 80GB	$1.35-2.00/hr	~46GB VRAM needed
LoRA 70B	2x A100 80GB	$2.70-4.00/hr	Recommended for quality
Full SFT 70B	4x H100 80GB	$8-12/hr	4x faster than A100
GPT-OSS-120B (MoE)	1x H100 80GB	$2-3/hr	Efficient MoE architecture

Pricing sources: Hyperstack A100 $1.35/hr, H100 $2.12/hr; Lambda A100 $1.57/hr, H100 $2.99/hr

Estimated Training Costs (SFT only):

Scenario	Hardware	Duration	Single Run	With Buffer (3x)
QLoRA 70B (10K samples)	1x A100 80GB	8-12 hours	$15-25	$50-75
LoRA 70B (50K samples)	2x A100 80GB	24-48 hours	$65-200	$200-600
LoRA 70B (100K samples)	4x H100 80GB	6-12 hours	$50-140	$150-420
Full SFT 70B (100K samples)	8x H100 80GB	24-48 hours	$400-1,200	$1,200-3,600

Why 3x Buffer?

Hyperparameter tuning (learning rate, batch size, LoRA rank)
Dataset iteration (rebalancing positive/negative signals)
Multiple evaluation runs
Unexpected failures or restarts

Inference/Hosting Infrastructure:

Model	Hardware	Monthly Cost (24/7)
GPT-OSS-120B	1x H100 80GB	$1,500-2,200
Qwen3-72B (INT4)	1x A100 80GB	$1,000-1,500
Qwen3-72B (FP16)	2x A100 80GB	$2,000-3,000
70B (FP16, high traffic)	4x A100 80GB	$4,000-6,000

Bare-metal rentals typically 30-50% cheaper than on-demand cloud for steady workloads.

Fine-tuning Decision Framework

┌────────────────────────────────────────────────────────────┐
│                    Evaluation Phase 6                      │
│                                                            │
│  RAG + ACE Performance:                                    │
│  ├─ Precision ≥ 85%, Recall ≥ 80% → Stay with RAG+ACE      │
│  ├─ Precision 70-85% → Try ACE playbook refinement         │
│  └─ Precision < 70% → Proceed to fine-tuning               │
│                                                            │
│  Fine-tuning triggers:                                     │
│  • Consistent false negatives on security-critical issues  │
│  • Inability to follow complex multi-step spec logic       │
│  • EF explicitly requests specialized model                │
└────────────────────────────────────────────────────────────┘

Implementation Timeline (If Approved)

Phase	Duration	Activities
Dataset Curation	4-6 weeks	Collect specs, exploits, audits; create negative signals
Benchmark Creation	2-3 weeks	Design test suite; establish baselines
LoRA Fine-tuning	1-2 weeks	Initial experiments on Qwen3-72B
Evaluation	1-2 weeks	Compare against RAG baseline
Full SFT (if needed)	2-4 weeks	Scale up training
Deployment	1-2 weeks	Set up inference infrastructure

Total: 11-19 weeks additional (after initial RAG delivery)

Cost Summary (SFT Only, With Buffer)

Item	Low Estimate	High Estimate	Notes
Dataset curation (labor)	$5,000	$15,000	50K-100K examples, may need iteration
Benchmark creation (labor)	$3,000	$10,000	Test suite + baselines + refinement
Training compute (LoRA, 3x buffer)	$200	$600	Recommended approach
Training compute (Full SFT, 3x buffer)	$1,200	$3,600	If LoRA insufficient
Evaluation & debugging	$500	$1,500	Inference runs, analysis
Total Initial Investment	$9,900	$30,700

Note: Buffer accounts for hyperparameter tuning, dataset iteration, and multiple training runs. Actual costs may be lower if first attempts succeed.

Monthly Operating Costs:

Scenario	Cost	Hardware
GPT-OSS-120B (recommended)	$1,500-2,200	1x H100 80GB
Qwen3-72B quantized	$1,000-1,500	1x A100 80GB
High-availability setup	$3,000-6,000	2-4x A100 80GB

Note: Costs based on Dec 2025 cloud pricing. EF-provided infrastructure would reduce compute costs significantly. Bare-metal rentals offer 30-50% savings for steady workloads.

12. Team

Quantum3Labs - AI/LLM solutions for blockchain ecosystems

Jomluz Tech Sdn. Bhd. - Software engineering and blockchain development

Quantum3Labs

Gian Marco Alarcon - Full Stack Engineer | Blockchain Developer

Co-founder of Quantum3Labs
4+ years blockchain experience (Ethereum, Starknet, ICP, Stacks)
Creator of stacks-builder, icp-coder, scaffold-stylus

Diego Flores - AI Engineer | Blockchain Developer

MSc Computer Science (AI specialization)
5+ years Python/JS, 3+ years AI R&D
RAG architecture and prompt engineering lead

13. Comparison with Existing Solutions

Our approach differs from existing tools like the Ethereum Code Reviewer (ECR) in several key architectural decisions:

Aspect	ECR	Our Proposal
Code Chunking	Text-based splitting	AST-based tree-sitter — keeps functions intact, +4-5% Recall improvement [13]
Retrieval Architecture	Single embedding collection with doc lookup	Dual-collection (spec + client) with configurable balance
Query Strategy	Direct code → LLM analysis	Dual-query generation — spec-focused + client-focused queries from same PR
Ranking	Basic similarity search	Importance scoring (1-10) + quota-based allocation + RRF fusion
Spec-Client Mapping	Documentation context only	Direct function mapping — link Geth functions to specific spec sections
Architecture	Single LLM pass with multi-judge voting	Multi-agent system — specialized Query Coordinator, Retrieval Orchestrator, Auditor Agent
Reranking	None	Cross-encoder (MS-MARCO) for precision boost [7]
Distance Handling	First-match wins	Minimum distance tracking — keeps best score across queries

Key Differentiators

Semantic Code Understanding: Tree-sitter AST parsing ensures we never split mid-function, producing semantically coherent chunks that lead to better embeddings and more accurate retrieval. This is the same approach used by Cursor, Windsurf, Aider, and GitHub Copilot [12].
Balanced Context Retrieval: Our dual-query + RRF fusion approach ensures audits include both "what the spec says" AND "how the client implements it" — critical for compliance checking rather than just vulnerability detection.
Intelligent Query Prioritization: Not all compliance queries are equal. Our importance scoring (8-10 for security-critical, 1-4 for general context) allocates retrieval quota proportionally, ensuring critical areas get thorough coverage.
Explainable Traceability: Direct spec-to-client function mapping provides clear audit trails with file paths, line numbers, and code snippets — making findings actionable for reviewers.
Research-Backed Design: Every major design choice is backed by peer-reviewed research (14 citations), not just empirical tuning.

14. Quality Benchmarks Compliance

This section details how our proposal addresses each quality benchmark from the requirements.

Technical Viability

Requirement	How We Address It
Effective approach	Multi-agent RAG with proven techniques: RRF fusion [1], tree-sitter chunking [12-14], cross-encoder reranking [7]
Research backing	14 academic references from top venues (SIGIR, arXiv)
Industry validation	Tree-sitter used by Cursor, Copilot, Aider; RRF used in production search systems

Scalability & Maintenance

Requirement	How We Address It
Frequent spec changes	Continuous parsing pipeline with incremental updates (Phase 1)
Multiple client codebases	Modular parser design — add new clients by implementing language grammar (113+ tree-sitter grammars available)
Large-scale audits	Batch processing mode, configurable chunk sizes, async retrieval, structured JSON output
Future expansion	Section 9 details clear path for Prysm, Lighthouse, Nethermind, Besu, Lodestar, Nimbus

Security & Reliability

Requirement	How We Address It
Safe data handling	All processing local (no external API calls for sensitive code), ChromaDB stores embeddings only
Minimize false positives	Cross-encoder reranking improves precision; Phase 6 dedicated to FP rate analysis; prompt tuning based on evaluation
Minimize false negatives	Dual-query ensures both spec and client perspectives; per-query grouping prevents exclusion of compliance areas; importance scoring prioritizes critical checks
Reproducibility	Deterministic chunking via AST; configurable parameters; version-controlled prompts

Team Capacity

Requirement	How We Address It
LLM expertise	Diego Flores: MSc CS (AI), 3+ years AI R&D, RAG architecture lead
Blockchain experience	Gian Marco Alarcon: 4+ years (Ethereum, Starknet, ICP, Stacks)
Proven delivery	Published tools: scaffold-stylus, icp-coder, stacks-builder
Collaboration	Phased timeline with EF checkpoints (Phase 0 kickoff, Phase 6 evaluation review)

15. References

Retrieval & Fusion

G. V. Cormack, C. L. Clarke, and S. Buettcher, "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods," SIGIR '09, 2009.
Y. Gao et al., "Retrieval-Augmented Generation for Large Language Models: A Survey," arXiv:2312.10997, 2024.
A. Singh et al., "Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG," arXiv:2501.09136, 2025.

Code RAG

Z. Z. Wang et al., "CodeRAG-Bench: Can Retrieval Augment Code Generation?," arXiv:2406.14497, 2024.
W. Gu et al., "What to Retrieve for Effective Retrieval-Augmented Code Generation?," arXiv:2503.20589, 2025.
V. Tawosi et al., "Meta-RAG on Large Codebases Using Code Summarization," arXiv:2508.02611, 2025.

Reranking

S. Zhuang et al., "A Thorough Comparison of Cross-Encoders and LLMs for Reranking," arXiv:2403.10407, 2024.

Specification Compliance

X. Guo et al., "Towards Formal Verification of LLM-Generated Code," arXiv:2507.13290, 2025.
S. Chakraborty et al., "Combining LLM Code Generation with Formal Specifications," arXiv:2410.19736, 2024.

Decision Making

M. Lee et al., "PlanRAG: A Plan-then-Retrieval Augmented Generation," arXiv:2406.12430, 2024.
C. Jang et al., "Reliable Decision Making via Calibration Oriented RAG," arXiv:2411.08891, 2024.

Code Chunking

Sweep AI, "Chunking 2M+ files a day for Code Search," https://docs.sweep.dev/blogs/chunking-2m-files, 2023. (Adopted by LlamaIndex, Cursor, Aider)
Y. Zhang et al., "cAST: Enhancing Code RAG with Structural Chunking via AST," arXiv:2506.15655, 2025. (+4.3 Recall@5, +5.6 Pass@1)
Tree-sitter, "An incremental parsing system for programming tools," https://tree-sitter.github.io/, 2024. (113+ language grammars)

Context Engineering

S. Kambhampati et al., "Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models," arXiv:2510.04618, 2025. (+10.6% agent benchmarks, +8.6% domain tasks)

FilesExpand file tree

UPDATED_PROPOSAL.md

Latest commit

History