rlm-runtime

Minimal runtime for Recursive Language Models (RLMs) inspired by the MIT CSAIL paper "Recursive Language Models".

The Problem

Standard LLM approaches fail when context exceeds the model's window size:

Truncation: Important information gets cut off
RAG: Requires complex retrieval infrastructure and may miss relevant context
Long-context models: Expensive and still have hard limits

The RLM Solution

RLMs treat the long context as environment state instead of direct input:

Context lives in a Python REPL as variable P
The LLM only sees metadata + REPL outputs (not the full context)
The LLM writes code to inspect, search, and chunk the context
The LLM can make recursive subcalls to sub-LLMs on small snippets
Result: Handle arbitrarily large contexts with constant token usage per step

Quickstart

# Install
uv pip install -e .

# Set your API key
export LLM_API_KEY="your-api-key-here"

# Run a simple example
uv run python examples/minimal.py

Basic usage:

from rlm_runtime import RLM, Context
from rlm_runtime.adapters import OpenAICompatAdapter

# Create context from your long documents
documents = [
    "Document 1: Very long content...",
    "Document 2: More content...",
    # ... could be 100s of documents, millions of tokens
]
context = Context.from_documents(documents)

# Initialize RLM with OpenAI-compatible adapter
rlm = RLM(adapter=OpenAICompatAdapter())

# Ask questions over the entire context
query = "What are the main themes across all documents?"
answer, trace = rlm.run(query, context)
print(answer)

Works with: OpenAI, Anthropic Claude, local Llama/Ollama servers, or any OpenAI-compatible endpoint.

Demo: RLM vs Baseline Comparison

The rlm_vs_baseline.py example demonstrates the core advantage of RLMs: maintaining accuracy as context grows beyond the LLM's window, while a naive baseline fails due to truncation.

Running the Demo

# Quick demo (5 and 30 documents)
RLM_CONTEXT_SIZES=5,30 uv run python examples/rlm_vs_baseline.py

# Full benchmark showing crossover point (5, 20, 50, 120 documents)
RLM_CONTEXT_SIZES=5,20,50,120 uv run python examples/rlm_vs_baseline.py

# Show detailed RLM execution trajectory
SHOW_TRAJECTORY=1 RLM_CONTEXT_SIZES=5,30 uv run python examples/rlm_vs_baseline.py

What the Demo Shows

This benchmark implements a needle-in-haystack task (similar to the MIT paper's S-NIAH):

The context contains N documents, with one containing a hidden key term
The query asks: "What is the key term?"
Baseline approach: Sends entire context directly to LLM (truncates if too large)
RLM approach: Context lives in REPL, LLM writes code to search and make subcalls

The Crossover Point (MIT Paper Figure 1)

The MIT paper demonstrates that RLMs maintain near-perfect accuracy as context grows, while baseline approaches degrade:

Figure 1: RLM accuracy remains high as distractor documents increase, while baseline accuracy drops due to truncation. This implementation reproduces this behavior.

Expected Results

Our benchmark visualizes this crossover point where RLM starts outperforming baseline:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
CROSSOVER ANALYSIS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Plot 1: Success Rate vs Context Size
────────────────────────────────────
  5 docs │ B (baseline OK)
 20 docs │ B (baseline OK)
 50 docs │ b R (baseline FAIL, RLM OK) ← CROSSOVER POINT
120 docs │ b R (baseline FAIL, RLM OK)

Legend: B=baseline success, b=baseline fail, R=RLM success, r=RLM fail


Plot 2: Token Usage Comparison
───────────────────────────────
  5 docs │ baseline: ████░░░░░░ (8.8K)  🏆
         │      rlm: ████████░░ (17.3K)

 20 docs │ baseline: ████████░░ (18.5K) 🏆
         │      rlm: ████████░░ (18.0K)

 50 docs │ baseline: FAIL (truncated)
         │      rlm: █████████░ (20.9K) 🏆

120 docs │ baseline: FAIL (truncated)
         │      rlm: ██████████ (23.5K) 🏆

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
RESULTS SUMMARY
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Detailed Comparison:
┌─────────┬──────────┬────────┬───────┬────────┬────────────┬─────────┐
│   Docs  │  Tokens  │  Time  │ OK?   │ Answer │   Method   │ Winner  │
├─────────┼──────────┼────────┼───────┼────────┼────────────┼─────────┤
│    5    │   8,831  │  1.2s  │  ✓    │  ✓     │  baseline  │ 🏆 base │
│         │  17,298  │  2.8s  │  ✓    │  ✓     │     rlm    │         │
├─────────┼──────────┼────────┼───────┼────────┼────────────┼─────────┤
│   20    │  18,454  │  2.1s  │  ✓    │  ✓     │  baseline  │ 🏆 base │
│         │  18,039  │  3.1s  │  ✓    │  ✓     │     rlm    │         │
├─────────┼──────────┼────────┼───────┼────────┼────────────┼─────────┤
│   50    │  TRUNCATED - Answer lost in truncation                    │
│         │  20,866  │  3.8s  │  ✓    │  ✓     │     rlm    │ 🏆 rlm  │
├─────────┼──────────┼────────┼───────┼────────┼────────────┼─────────┤
│  120    │  TRUNCATED - Answer lost in truncation                    │
│         │  23,489  │  4.5s  │  ✓    │  ✓     │     rlm    │ 🏆 rlm  │
└─────────┴──────────┴────────┴───────┴────────┴────────────┴─────────┘

Summary Statistics:
  • Baseline wins: 2 (at small context sizes)
  • RLM wins: 2 (at large context sizes where baseline truncates)
  • Crossover point: ~50 documents (baseline starts truncating)

RLM Efficiency Metrics:
  • Avg subcalls per task: 1-3 (uses Phase 0 deterministic extraction first)
  • Cache hit rate: 60-80% (reuses subcall results)
  • Token overhead: 2-3x at small contexts, but maintains correctness at large contexts

Key Insights

When to use RLMs:

Small contexts (5-20 docs): Baseline is more efficient (fewer tokens, faster)
- If your context always fits in the LLM window, stick with baseline
Large contexts (50+ docs): RLM wins decisively when baseline truncates
- RLM maintains 100% accuracy while baseline fails completely
- Uses only 20-25K tokens regardless of context size (constant overhead)

How RLMs achieve this:

Phase 0 optimization: Try deterministic extraction first (extract_after) - 0 tokens, instant
Targeted subcalls: Only query sub-LLMs on relevant chunks when needed
Caching: Reuses subcall results (60-80% cache hit rate)
Smart chunking: Processes large documents in manageable pieces

The crossover point: Around 50 documents (~100K+ characters), where the context exceeds the LLM's effective window and baseline accuracy drops to 0%.

This reproduces the key finding from Figure 1 of the MIT paper: RLMs maintain performance as context grows, while baseline approaches degrade.

Use Cases: When to Use RLMs

Tasks from the MIT Paper

The MIT paper evaluated RLMs on several categories of long-context tasks:

Deep Research & Multi-hop QA (BrowseComp-Plus)
- Answering complex questions requiring reasoning across 100s-1000s of documents
- Finding evidence scattered across multiple sources
- Synthesizing information from diverse materials
Code Repository Understanding (CodeQA)
- Analyzing large codebases (900K+ tokens)
- Finding specific implementations across multiple files
- Understanding architectural decisions
Information Aggregation (OOLONG)
- Processing datasets with semantic transformations
- Aggregating statistics across thousands of entries
- Computing results that require examining every line
Complex Pairwise Reasoning (OOLONG-Pairs)
- Finding relationships between pairs of elements
- Quadratic complexity tasks (O(N²) processing)
- Tasks requiring examination of all combinations

Practical Applications for rlm-runtime

1. Document Analysis at Scale

Legal contract review across hundreds of agreements
Academic research: analyzing 50+ papers for literature reviews
Technical documentation: processing entire API documentation sets
Medical records: analyzing patient histories across multiple visits

2. Development & DevOps

Code repository audits and security reviews
Log analysis: finding patterns across millions of log lines
Configuration management: validating consistency across microservices
Documentation generation from large codebases

3. Business Intelligence

Customer feedback analysis across thousands of reviews/tickets
Competitive analysis: processing competitor documentation and materials
Market research: synthesizing reports from multiple sources
Compliance audits: checking regulations across documents

4. Content & Media

Transcript analysis: processing hours of meeting recordings
Book/article summarization and cross-referencing
Research assistance: finding connections across academic papers
Content moderation at scale

5. Integration with Model Context Protocol (MCP)

RLM-runtime is particularly well-suited as an MCP server that provides long-context processing capabilities:

# Example: RLM as an MCP server
# Expose RLM as a tool that other applications can call

from mcp.server import Server
from rlm_runtime import RLM, Context

server = Server("rlm-processor")

@server.tool()
async def process_long_context(query: str, documents: list[str]) -> str:
    """Process arbitrarily long context using RLM"""
    context = Context.from_documents(documents)
    rlm = RLM(adapter=OpenAICompatAdapter())
    output, trace = rlm.run(query, context)
    return output

MCP Use Cases:

Claude Desktop/Web: Add RLM as a tool for processing large file sets
IDE Extensions: Analyze entire projects beyond editor context limits
Research Tools: Process multiple papers/books in citation managers
Data Analysis: Query large datasets through natural language

6. When RLM Wins Over Alternatives

Use RLM when:

✅ Context size > 100K tokens (beyond most model windows)
✅ Information is scattered across the entire context
✅ Task requires examining most/all of the input
✅ Accuracy is more important than speed
✅ Context doesn't fit in RAG chunk paradigm

Don't use RLM when:

❌ Context always fits in model window (<50K tokens)
❌ Simple keyword search would work
❌ Information is localized (RAG would be faster)
❌ Real-time response required (milliseconds)

Example: Research Assistant

# Analyze 50 academic papers to answer a research question
from rlm_runtime import RLM, Context
from rlm_runtime.adapters import OpenAICompatAdapter

# Load papers (could be 1M+ tokens total)
papers = [read_pdf(f"paper_{i}.pdf") for i in range(50)]
context = Context.from_documents(papers)

rlm = RLM(adapter=OpenAICompatAdapter())
query = """
What are the main methodologies used for evaluating long-context
language models across these papers? Provide a comparison table.
"""

answer, trace = rlm.run(query, context)
print(answer)

Configuration

Environment Variables

# API Configuration (OpenAI-compatible endpoints)
export LLM_API_KEY="your-key"          # or OPENAI_API_KEY
export LLM_BASE_URL="https://..."     # optional, for custom endpoints

# For local models (no auth needed)
export LLM_BASE_URL="http://localhost:11434/v1"  # Example: Ollama

Supported Providers

OpenAI: GPT-4, GPT-3.5, etc.
Anthropic: Claude Sonnet, Opus (via OpenAI-compatible proxy)
Local: Ollama, LM Studio, vLLM, or any OpenAI-compatible server
Custom: Implement your own adapter by extending BaseAdapter

Examples

minimal.py: Simplest possible RLM example
rlm_vs_baseline.py: Full benchmark showing crossover point
complex_reasoning.py: Multi-step reasoning over long documents
hybrid_audit.py: Trajectory visualization
smart_router_demo.py: Auto baseline/RLM selection
ollama_example.py: Using local Ollama models
cloud_example.py: Cloud provider integration

Development

# Linting and formatting
uv run ruff check .
uv run ruff format .

# Type checking
uv run ty check

# Tests
uv run pytest

References

MIT CSAIL Paper: Recursive Language Models
Original paper authors: Zhou, et al.
This implementation is not affiliated with MIT

License

MIT License - see LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.vscode		.vscode
docs		docs
examples		examples
legacy_rlm_runtime		legacy_rlm_runtime
src/rlm_runtime		src/rlm_runtime
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
test_quick.py		test_quick.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

rlm-runtime

The Problem

The RLM Solution

Quickstart

Demo: RLM vs Baseline Comparison

Running the Demo

What the Demo Shows

The Crossover Point (MIT Paper Figure 1)

Expected Results

Key Insights

Use Cases: When to Use RLMs

Tasks from the MIT Paper

Practical Applications for rlm-runtime

Example: Research Assistant

Configuration

Environment Variables

Supported Providers

Examples

Development

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

apenab/rlm-runtime

Folders and files

Latest commit

History

Repository files navigation

rlm-runtime

The Problem

The RLM Solution

Quickstart

Demo: RLM vs Baseline Comparison

Running the Demo

What the Demo Shows

The Crossover Point (MIT Paper Figure 1)

Expected Results

Key Insights

Use Cases: When to Use RLMs

Tasks from the MIT Paper

Practical Applications for rlm-runtime

Example: Research Assistant

Configuration

Environment Variables

Supported Providers

Examples

Development

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages