Skip to content

Commit 3978cfa

Browse files
committed
[docs] Update docs
1 parent 308338e commit 3978cfa

10 files changed

Lines changed: 382 additions & 461 deletions

File tree

docs/getting_started/quickstart.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -181,7 +181,7 @@ python -m vllm.entrypoints.openai.api_server \
181181
--enable-prefix-caching
182182
```
183183

184-
> **Tip:** For eviction sync, prefix with `CONTEXTPILOT_INDEX_URL=http://localhost:8765`. This lets the inference engine notify ContextPilot when KV cache entries are evicted.
184+
> **Note:** For eviction sync, prefix with `CONTEXTPILOT_INDEX_URL=http://localhost:8765`. This lets the inference engine notify ContextPilot when KV cache entries are evicted.
185185
186186
## Step 2: Start ContextPilot
187187

docs/guides/offline_usage.md

Lines changed: 93 additions & 154 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ sidebar_label: Offline Usage
66

77
# Offline Usage
88

9-
Offline mode is best for **batch processing** where you process all queries at once without needing live cache management.
9+
Offline mode is best for **batch processing** where you have all queries upfront and want to maximize KV-cache reuse across them — no server required.
1010

1111
## How ContextPilot Optimizes Batches
1212

@@ -15,211 +15,150 @@ ContextPilot performs **two levels of optimization** to maximize KV-cache prefix
1515
1. **Inter-Context Reordering**: Queries with overlapping context blocks are scheduled together
1616
2. **Intra-Context Reordering**: Context blocks within each query are reordered so shared blocks appear first as a common prefix
1717

18-
For example, if Query A retrieves `["block_C", "block_A", "block_D", "block_B"]` and Query B retrieves `["block_B", "block_E", "block_A", "block_C"]`, after optimization:
18+
For example, if Query A has `["block_C", "block_A", "block_D", "block_B"]` and Query B has `["block_B", "block_E", "block_A", "block_C"]`, after optimization:
1919
- Query A: `["block_A", "block_B", "block_C", "block_D"]` (shared blocks first)
20-
- Query B: `["block_A", "block_B", "block_C", "block_E"]` (same prefix `["block_A", "block_B", "block_C"]`!)
21-
22-
This creates identical prefixes that the inference engine can cache and reuse.
20+
- Query B: `["block_A", "block_B", "block_C", "block_E"]` (same prefix — cache hit!)
2321

2422
## Prerequisites
2523

26-
1. **Start your inference engine:**
24+
Start your inference engine:
25+
2726
```bash
2827
# SGLang:
2928
python -m sglang.launch_server \
3029
--model-path Qwen/Qwen2.5-7B-Instruct \
3130
--port 30000
31+
3232
# or vLLM:
3333
python -m vllm.entrypoints.openai.api_server \
3434
--model Qwen/Qwen2.5-7B-Instruct \
3535
--port 30000 \
3636
--enable-prefix-caching
3737
```
3838

39-
2. **Prepare your data:**
40-
- `corpus.jsonl`: Corpus file with one context block per line (e.g., `{"text": "..."}`)
41-
- Queries: List of strings or query objects
42-
4339
---
4440

45-
## Example 1: End-to-End Pipeline
41+
## Using `cp.optimize_batch()` (Simplest)
4642

47-
Retrieve, optimize, and generate in one call:
43+
Pass your context blocks and queries — ContextPilot handles reordering and returns ready-to-use OpenAI messages in the optimal execution order.
4844

4945
```python
50-
from contextpilot.pipeline import RAGPipeline, InferenceConfig
51-
52-
pipeline = RAGPipeline(
53-
retriever="bm25",
54-
corpus_path="corpus.jsonl",
55-
use_contextpilot=True,
56-
inference=InferenceConfig(
57-
model_name="Qwen/Qwen2.5-7B-Instruct",
58-
base_url="http://localhost:30000",
59-
max_tokens=256,
60-
temperature=0.0
61-
)
62-
)
63-
64-
results = pipeline.run(
65-
queries=[
66-
"What is machine learning?",
67-
"Explain neural networks",
68-
"What is deep learning?"
69-
],
70-
top_k=20,
71-
generate_responses=True
72-
)
73-
74-
print(f"Processed {results['metadata']['num_queries']} queries")
75-
print(f"Created {results['metadata']['num_groups']} optimized groups")
76-
77-
for i, gen_result in enumerate(results["generation_results"]):
78-
if gen_result["success"]:
79-
print(f"\nQuery {i+1}: {gen_result['generated_text'][:200]}...")
46+
import asyncio
47+
import openai
48+
import contextpilot as cp
49+
50+
BASE_URL = "http://localhost:30000/v1"
51+
engine = cp.ContextPilot(use_gpu=False)
52+
53+
queries = ["What is AI?", "Explain neural networks", "What is deep learning?"]
54+
all_contexts = [
55+
["AI is the simulation of human intelligence", "Machine learning is a subset of AI", "Deep learning uses neural networks"],
56+
["Neural networks are inspired by the brain", "Machine learning is a subset of AI", "Backpropagation trains neural networks"],
57+
["Deep learning uses neural networks", "Machine learning is a subset of AI", "GPUs accelerate deep learning training"],
58+
]
59+
60+
# Returns messages in scheduled order + the original index mapping
61+
messages_batch, order = engine.optimize_batch(all_contexts, queries)
62+
63+
async def generate_all():
64+
client = openai.AsyncOpenAI(base_url=BASE_URL, api_key="EMPTY")
65+
return await asyncio.gather(*[
66+
client.chat.completions.create(model="Qwen/Qwen2.5-7B-Instruct", messages=m)
67+
for m in messages_batch
68+
])
69+
70+
for resp, idx in zip(asyncio.run(generate_all()), order):
71+
print(f"Q: {queries[idx]}\nA: {resp.choices[0].message.content}\n")
8072
```
8173

82-
---
83-
84-
## Example 2: Retrieval + Optimization Only
85-
86-
Prepare optimized batches without generation (for later inference):
87-
88-
```python
89-
from contextpilot.pipeline import RAGPipeline
90-
91-
pipeline = RAGPipeline(
92-
retriever="bm25",
93-
corpus_path="corpus.jsonl",
94-
use_contextpilot=True
95-
)
96-
97-
results = pipeline.run(
98-
queries=["What is AI?", "What is ML?"],
99-
top_k=20,
100-
generate_responses=False
101-
)
102-
103-
# Save for later use
104-
pipeline.save_results(results, "optimized_batch.jsonl")
105-
print(f"Saved {len(results['optimized_batch'])} groups")
106-
```
74+
`messages_batch[i]` corresponds to `queries[order[i]]` — send them in this order to the inference engine for maximum prefix sharing, then use `order` to map results back.
10775

10876
---
10977

110-
## Example 3: Step-by-Step Control
111-
112-
Fine-grained control over each pipeline stage:
113-
114-
```python
115-
from contextpilot.pipeline import RAGPipeline, InferenceConfig
116-
117-
pipeline = RAGPipeline(
118-
retriever="bm25",
119-
corpus_path="corpus.jsonl",
120-
inference=InferenceConfig(
121-
model_name="Qwen/Qwen2.5-7B-Instruct",
122-
base_url="http://localhost:30000"
123-
)
124-
)
125-
126-
queries = ["What is machine learning?", "Explain neural networks"]
127-
128-
# Step 1: Retrieve documents
129-
retrieval_results = pipeline.retrieve(queries=queries, top_k=20)
130-
print(f"Retrieved documents for {len(retrieval_results)} queries")
78+
## Using `cp.reorder()` (Manual Control)
13179

132-
# Step 2: Optimize context ordering
133-
optimized = pipeline.optimize(retrieval_results)
134-
print(f"Created {len(optimized['groups'])} optimized groups")
135-
136-
# Step 3: Generate responses
137-
generation_results = pipeline.generate(optimized)
138-
print(f"Generated {generation_results['metadata']['successful_requests']} responses")
139-
140-
# Inspect groups
141-
for group in optimized['groups']:
142-
print(f"Group {group['group_id']}: {group['group_size']} queries, score={group['group_score']:.3f}")
143-
```
144-
145-
---
146-
147-
## Example 4: Compare With/Without ContextPilot
80+
Use `reorder()` when you need full control over prompt construction — it returns reordered context blocks and the execution order, and you build the prompts yourself.
14881

14982
```python
150-
from contextpilot.pipeline import RAGPipeline, InferenceConfig
151-
152-
queries = ["What is AI?", "What is ML?", "What is DL?"]
153-
154-
# With ContextPilot optimization
155-
pipeline_optimized = RAGPipeline(
156-
retriever="bm25",
157-
corpus_path="corpus.jsonl",
158-
use_contextpilot=True,
159-
inference=InferenceConfig(
160-
model_name="Qwen/Qwen2.5-7B-Instruct",
161-
base_url="http://localhost:30000"
162-
)
163-
)
164-
results_optimized = pipeline_optimized.run(queries=queries, generate_responses=True)
165-
166-
# Without ContextPilot (standard RAG)
167-
pipeline_standard = RAGPipeline(
168-
retriever="bm25",
169-
corpus_path="corpus.jsonl",
170-
use_contextpilot=False,
171-
inference=InferenceConfig(
172-
model_name="Qwen/Qwen2.5-7B-Instruct",
173-
base_url="http://localhost:30000"
174-
)
175-
)
176-
results_standard = pipeline_standard.run(queries=queries, generate_responses=True)
177-
178-
# Compare timings
179-
print(f"ContextPilot: {results_optimized['metadata']['total_time']:.2f}s")
180-
print(f"Standard: {results_standard['metadata']['total_time']:.2f}s")
83+
import asyncio
84+
import openai
85+
import contextpilot as cp
86+
87+
BASE_URL = "http://localhost:30000/v1"
88+
engine = cp.ContextPilot(use_gpu=False)
89+
90+
queries = ["What is AI?", "Explain neural networks", "What is deep learning?"]
91+
all_contexts = [
92+
["AI is the simulation of human intelligence", "Machine learning is a subset of AI", "Deep learning uses neural networks"],
93+
["Neural networks are inspired by the brain", "Machine learning is a subset of AI", "Backpropagation trains neural networks"],
94+
["Deep learning uses neural networks", "Machine learning is a subset of AI", "GPUs accelerate deep learning training"],
95+
]
96+
97+
# reordered[i] = reordered blocks for the i-th scheduled query
98+
# order[i] = index into the original queries list
99+
reordered, order = engine.reorder(all_contexts)
100+
101+
def build_prompt(query, blocks):
102+
context_text = "\n".join(f"[{i+1}] {b}" for i, b in enumerate(blocks))
103+
return [
104+
{"role": "system", "content": f"Answer based on the context:\n{context_text}"},
105+
{"role": "user", "content": query},
106+
]
107+
108+
messages_batch = [build_prompt(queries[order[i]], reordered[i]) for i in range(len(order))]
109+
110+
async def generate_all():
111+
client = openai.AsyncOpenAI(base_url=BASE_URL, api_key="EMPTY")
112+
return await asyncio.gather(*[
113+
client.chat.completions.create(model="Qwen/Qwen2.5-7B-Instruct", messages=m)
114+
for m in messages_batch
115+
])
116+
117+
results = [None] * len(queries)
118+
for resp, idx in zip(asyncio.run(generate_all()), order):
119+
results[idx] = resp.choices[0].message.content
120+
121+
for q, a in zip(queries, results):
122+
print(f"Q: {q}\nA: {a}\n")
181123
```
182124

183125
---
184126

185-
## Example 5: Using FAISS Retriever
127+
## RAG Pipeline (with Built-in Retrieval)
186128

187-
For semantic search with embeddings:
188-
189-
```bash
190-
# First, start an embedding server (e.g. SGLang):
191-
python -m sglang.launch_server \
192-
--model-path Alibaba-NLP/gte-Qwen2-7B-instruct \
193-
--is-embedding \
194-
--port 30001
195-
```
129+
If you have a document corpus and want ContextPilot to handle retrieval + optimization in one call, use `RAGPipeline`:
196130

197131
```python
198132
from contextpilot.pipeline import RAGPipeline, InferenceConfig
199133

200134
pipeline = RAGPipeline(
201-
retriever="faiss",
135+
retriever="bm25", # or "faiss" for semantic search
202136
corpus_path="corpus.jsonl",
203-
index_path="faiss_index.faiss", # Created if doesn't exist
204-
embedding_model="Alibaba-NLP/gte-Qwen2-7B-instruct",
205-
embedding_base_url="http://localhost:30001",
206137
use_contextpilot=True,
207138
inference=InferenceConfig(
208139
model_name="Qwen/Qwen2.5-7B-Instruct",
209-
base_url="http://localhost:30000"
140+
base_url="http://localhost:30000",
141+
max_tokens=256,
210142
)
211143
)
212144

213145
results = pipeline.run(
214-
queries=["Explain quantum computing"],
215-
generate_responses=True
146+
queries=["What is machine learning?", "Explain neural networks", "What is deep learning?"],
147+
top_k=20,
148+
generate_responses=True,
216149
)
150+
151+
for gen_result in results["generation_results"]:
152+
if gen_result["success"]:
153+
print(gen_result["generated_text"][:200])
217154
```
218155

156+
See the [API Reference](../reference/api) for full `RAGPipeline` options including FAISS retrieval, step-by-step control, and saving results.
157+
219158
---
220159

221160
## Next Steps
222161

223-
- [Online Usage](online_usage) - Live index server modes
224-
- [Multi-Turn](multi_turn) - Conversation handling
162+
- [Online Usage](online_usage) - Live index server for stateful cache tracking
163+
- [Multi-Turn](multi_turn) - Context deduplication across conversation turns
225164
- [API Reference](../reference/api) - Full API documentation

docs/intro.md

Lines changed: 31 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,46 @@
11
---
22
id: intro
3-
title: ContextPilot Documentation
3+
title: ContextPilot
44
sidebar_label: Overview
55
slug: /
66
---
77

8-
# ContextPilot Documentation
8+
<div style={{textAlign: 'center', margin: '1.5rem 0 2rem'}}>
9+
<img src="/img/contextpilot_logo.png" alt="ContextPilot" style={{width: '100%', maxWidth: '480px'}} />
10+
</div>
911

10-
Welcome to the ContextPilot documentation. This guide covers everything you need to get started and make the most of ContextPilot.
12+
# ContextPilot
13+
14+
ContextPilot is a context optimizer that sits before the inference engine. Long-context workloads often carry similar, overlapping, or redundant context blocks — wasting tokens and triggering unnecessary KV computation. ContextPilot applies [optimization primitives](reference/primitives) to input contexts before inference, improving token efficiency and cache utilization for faster execution, with no changes to your model or inference engine.
15+
16+
**4–12× cache hits · 1.5–3× faster prefill · ~36% token savings**
17+
18+
## Key Features
19+
20+
- **Higher Throughput & Cache Hits**: Boosts prefill throughput and cache hit ratio by improving token efficiency and cache utilization across long-context requests.
21+
22+
- **Cache-Aware Scheduling**: Groups requests with overlapping context blocks to run consecutively, maximizing prefix sharing across the entire batch.
23+
24+
- **Reduced Redundant Computation**: Detects and eliminates repeated content across requests, reducing redundant token transmission by ~36% per turn.
25+
26+
- **Drop-In Integration**: Hooks into SGLang and vLLM at runtime via a `.pth` import — set `CONTEXTPILOT_INDEX_URL` when launching your engine, no code changes required. Works with any OpenAI-compatible endpoint.
27+
28+
- **No Compromise in Reasoning Quality**: Preserves model accuracy with importance-ranked context annotation. With extremely long contexts, quality can even improve over the baseline.
1129

1230
## Getting Started
1331

14-
| Guide | Description |
15-
|-------|-------------|
16-
| [Installation](getting_started/installation) | System requirements and pip install |
17-
| [Quick Start](getting_started/quickstart) | Your first ContextPilot pipeline in 5 minutes |
32+
- [**Installation**](getting_started/installation) — System requirements and `pip install contextpilot`
33+
- [**Quick Start**](getting_started/quickstart) — Your first ContextPilot pipeline in 5 minutes
1834

19-
## User Guides
35+
## Guides
2036

21-
| Guide | Description |
22-
|-------|-------------|
23-
| [Offline Usage](guides/offline_usage) | Batch processing without server |
24-
| [Online Usage](guides/online_usage) | Index server (stateless & stateful modes) |
25-
| [Eviction Patches](guides/online_usage#inference-engine-integration) | **Required for stateful mode** — eviction callback for KV cache sync (SGLang & vLLM) |
26-
| [Multi-Turn Conversations](guides/multi_turn) | Context deduplication across turns (30-60% savings) |
27-
| [PageIndex Integration](guides/pageindex) | Tree-structured documents → ContextPilot scheduling |
28-
| [mem0 Integration](guides/mem0) | LoCoMo benchmark with mem0 memory backend |
37+
- [**Offline Usage**](guides/offline_usage) — Batch processing with `cp.optimize_batch()` and `cp.reorder()`
38+
- [**Online Usage**](guides/online_usage) — Index server with stateless and stateful modes
39+
- [**Multi-Turn Conversations**](guides/multi_turn) — Context deduplication across conversation turns
40+
- [**PageIndex Integration**](guides/pageindex) — Tree-structured document scheduling
41+
- [**mem0 Integration**](guides/mem0) — LoCoMo benchmark with mem0 memory backend
2942

3043
## Reference
3144

32-
| Document | Description |
33-
|----------|-------------|
34-
| [API Reference](reference/api) | Pipeline, InferenceConfig, HTTP endpoints |
35-
| [Benchmarks](reference/benchmarks) | GPU vs CPU performance analysis and methodology |
45+
- [**API Reference**](reference/api)`ContextPilot`, `RAGPipeline`, HTTP endpoints
46+
- [**Benchmarks**](reference/benchmarks) — GPU vs CPU performance analysis

0 commit comments

Comments
 (0)