You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> **Tip:** For eviction sync, prefix with `CONTEXTPILOT_INDEX_URL=http://localhost:8765`. This lets the inference engine notify ContextPilot when KV cache entries are evicted.
184
+
> **Note:** For eviction sync, prefix with `CONTEXTPILOT_INDEX_URL=http://localhost:8765`. This lets the inference engine notify ContextPilot when KV cache entries are evicted.
Offline mode is best for **batch processing** where you process all queries at once without needing live cache management.
9
+
Offline mode is best for **batch processing** where you have all queries upfront and want to maximize KV-cache reuse across them — no server required.
10
10
11
11
## How ContextPilot Optimizes Batches
12
12
@@ -15,211 +15,150 @@ ContextPilot performs **two levels of optimization** to maximize KV-cache prefix
15
15
1.**Inter-Context Reordering**: Queries with overlapping context blocks are scheduled together
16
16
2.**Intra-Context Reordering**: Context blocks within each query are reordered so shared blocks appear first as a common prefix
17
17
18
-
For example, if Query A retrieves`["block_C", "block_A", "block_D", "block_B"]` and Query B retrieves`["block_B", "block_E", "block_A", "block_C"]`, after optimization:
18
+
For example, if Query A has`["block_C", "block_A", "block_D", "block_B"]` and Query B has`["block_B", "block_E", "block_A", "block_C"]`, after optimization:
`messages_batch[i]` corresponds to `queries[order[i]]` — send them in this order to the inference engine for maximum prefix sharing, then use `order` to map results back.
107
75
108
76
---
109
77
110
-
## Example 3: Step-by-Step Control
111
-
112
-
Fine-grained control over each pipeline stage:
113
-
114
-
```python
115
-
from contextpilot.pipeline import RAGPipeline, InferenceConfig
116
-
117
-
pipeline = RAGPipeline(
118
-
retriever="bm25",
119
-
corpus_path="corpus.jsonl",
120
-
inference=InferenceConfig(
121
-
model_name="Qwen/Qwen2.5-7B-Instruct",
122
-
base_url="http://localhost:30000"
123
-
)
124
-
)
125
-
126
-
queries = ["What is machine learning?", "Explain neural networks"]
Use `reorder()` when you need full control over prompt construction — it returns reordered context blocks and the execution order, and you build the prompts yourself.
148
81
149
82
```python
150
-
from contextpilot.pipeline import RAGPipeline, InferenceConfig
151
-
152
-
queries = ["What is AI?", "What is ML?", "What is DL?"]
Welcome to the ContextPilot documentation. This guide covers everything you need to get started and make the most of ContextPilot.
12
+
# ContextPilot
13
+
14
+
ContextPilot is a context optimizer that sits before the inference engine. Long-context workloads often carry similar, overlapping, or redundant context blocks — wasting tokens and triggering unnecessary KV computation. ContextPilot applies [optimization primitives](reference/primitives) to input contexts before inference, improving token efficiency and cache utilization for faster execution, with no changes to your model or inference engine.
-**Higher Throughput & Cache Hits**: Boosts prefill throughput and cache hit ratio by improving token efficiency and cache utilization across long-context requests.
21
+
22
+
-**Cache-Aware Scheduling**: Groups requests with overlapping context blocks to run consecutively, maximizing prefix sharing across the entire batch.
23
+
24
+
-**Reduced Redundant Computation**: Detects and eliminates repeated content across requests, reducing redundant token transmission by ~36% per turn.
25
+
26
+
-**Drop-In Integration**: Hooks into SGLang and vLLM at runtime via a `.pth` import — set `CONTEXTPILOT_INDEX_URL` when launching your engine, no code changes required. Works with any OpenAI-compatible endpoint.
27
+
28
+
-**No Compromise in Reasoning Quality**: Preserves model accuracy with importance-ranked context annotation. With extremely long contexts, quality can even improve over the baseline.
11
29
12
30
## Getting Started
13
31
14
-
| Guide | Description |
15
-
|-------|-------------|
16
-
|[Installation](getting_started/installation)| System requirements and pip install |
17
-
|[Quick Start](getting_started/quickstart)| Your first ContextPilot pipeline in 5 minutes |
32
+
-[**Installation**](getting_started/installation) — System requirements and `pip install contextpilot`
33
+
-[**Quick Start**](getting_started/quickstart) — Your first ContextPilot pipeline in 5 minutes
18
34
19
-
## User Guides
35
+
## Guides
20
36
21
-
| Guide | Description |
22
-
|-------|-------------|
23
-
|[Offline Usage](guides/offline_usage)| Batch processing without server |
24
-
|[Online Usage](guides/online_usage)| Index server (stateless & stateful modes) |
25
-
|[Eviction Patches](guides/online_usage#inference-engine-integration)|**Required for stateful mode** — eviction callback for KV cache sync (SGLang & vLLM) |
26
-
|[Multi-Turn Conversations](guides/multi_turn)| Context deduplication across turns (30-60% savings) |
0 commit comments