Automatic Conversation Summarization and History Management
Intelligent Summarization — LLM-powered context compression • Sliding Window — zero-cost message trimming • Limit Warnings — finish-soon guidance before hard caps • Context Manager — real-time token tracking + tool truncation • Safe Cutoff — preserves tool call pairs
Context Management for Pydantic AI helps your Pydantic AI agents handle long conversations without exceeding model context limits. Choose between intelligent LLM summarization or fast sliding window trimming.
Full framework? Check out Pydantic Deep Agents — complete agent framework with planning, filesystem, subagents, and skills.
| What You Want to Build | How This Library Helps |
|---|---|
| Long-Running Agent | Automatically compress history when context fills up |
| Customer Support Bot | Preserve key details while discarding routine exchanges |
| Code Assistant | Keep recent code context, summarize older discussions |
| High-Throughput App | Zero-cost sliding window for maximum speed |
| Cost-Sensitive App | Choose between quality (summarization) or free (sliding window) |
pip install summarization-pydantic-aiOr with uv:
uv add summarization-pydantic-aiFor accurate token counting:
pip install summarization-pydantic-ai[tiktoken]The recommended way to add context management is via pydantic-ai's native Capabilities API:
from pydantic_ai import Agent
from pydantic_ai_summarization import ContextManagerCapability
agent = Agent(
"anthropic:claude-sonnet-4-6",
capabilities=[ContextManagerCapability(max_tokens=100_000)],
)
result = await agent.run("Hello!")That's it. Your agent now:
- Tracks token usage on every turn
- Auto-compresses when approaching the limit (90% by default)
- Truncates large tool outputs
- Auto-detects context window size from the model
- Preserves tool call/response pairs (never breaks them)
Let the agent decide when to compress by enabling the compact_conversation tool:
agent = Agent(
"anthropic:claude-sonnet-4-6",
capabilities=[ContextManagerCapability(
include_compact_tool=True, # Adds compact_conversation(focus?) tool
)],
)The agent can call compact_conversation(focus="preserve API design decisions") to trigger compression with a focus topic. Compression is deferred to the next model request.
from pydantic_ai_summarization import ContextManagerCapability, LimitWarnerCapability
agent = Agent(
"openai:gpt-4.1",
capabilities=[
LimitWarnerCapability(max_iterations=40, max_context_tokens=100_000),
ContextManagerCapability(max_tokens=100_000),
],
)For standalone use without capabilities:
from pydantic_ai import Agent
from pydantic_ai_summarization import create_summarization_processor
processor = create_summarization_processor(
trigger=("tokens", 100000),
keep=("messages", 20),
)
agent = Agent("openai:gpt-4.1", history_processors=[processor])| Processor | LLM Cost | Latency | Context Preservation |
|---|---|---|---|
ContextManagerCapability |
Per compression | Low tracking | Intelligent summary + tool truncation |
SummarizationProcessor |
High | High | Intelligent summary |
SlidingWindowProcessor |
Zero | ~0ms | Discards old messages |
LimitWarnerProcessor |
Zero | ~0ms | Full history + warning injection |
Uses an LLM to create summaries of older messages:
from pydantic_ai_summarization import create_summarization_processor
processor = create_summarization_processor(
trigger=("tokens", 100000), # When to summarize
keep=("messages", 20), # What to keep
)Simply discards old messages — no LLM calls:
from pydantic_ai_summarization import create_sliding_window_processor
processor = create_sliding_window_processor(
trigger=("messages", 100), # When to trim
keep=("messages", 50), # What to keep
)Warn the agent before requests, context usage, or total tokens hit a cap:
from pydantic_ai_summarization import create_limit_warner_processor
processor = create_limit_warner_processor(
max_iterations=40,
max_context_tokens=100000,
max_total_tokens=200000,
)Full context management with token tracking, auto-compression, and tool output truncation:
from pydantic_ai import Agent
from pydantic_ai_summarization import ContextManagerCapability
agent = Agent(
"anthropic:claude-sonnet-4-6",
capabilities=[ContextManagerCapability(
max_tokens=100_000,
compress_threshold=0.9,
max_tool_output_tokens=5000,
include_compact_tool=True, # Agent gets a compact_conversation tool
)],
)| Type | Example | Description |
|---|---|---|
messages |
("messages", 50) |
Trigger when message count exceeds threshold |
tokens |
("tokens", 100000) |
Trigger when token count exceeds threshold |
fraction |
("fraction", 0.8) |
Trigger at percentage of max_input_tokens |
| Type | Example | Description |
|---|---|---|
messages |
("messages", 20) |
Keep last N messages |
tokens |
("tokens", 10000) |
Keep last N tokens worth |
fraction |
("fraction", 0.2) |
Keep last N% of context |
from pydantic_ai_summarization import SummarizationProcessor
processor = SummarizationProcessor(
model="openai:gpt-4o",
trigger=[
("messages", 50), # OR 50+ messages
("tokens", 100000), # OR 100k+ tokens
],
keep=("messages", 10),
)processor = SummarizationProcessor(
model="openai:gpt-4o",
trigger=("fraction", 0.8), # 80% of context window
keep=("fraction", 0.2), # Keep last 20%
max_input_tokens=128000, # GPT-4's context window
)def my_token_counter(messages):
return sum(len(str(msg)) for msg in messages) // 4
processor = create_summarization_processor(
token_counter=my_token_counter,
)from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.openai import OpenAIProvider
from pydantic_ai_summarization import create_summarization_processor
azure_model = OpenAIModel(
"gpt-4o",
provider=OpenAIProvider(
base_url="https://my-resource.openai.azure.com/openai/deployments/gpt-4o",
api_key="your-azure-api-key",
),
)
processor = create_summarization_processor(
model=azure_model,
trigger=("tokens", 100000),
keep=("messages", 20),
)processor = create_summarization_processor(
summary_prompt="""
Extract key information from this conversation.
Focus on: decisions made, code written, pending tasks.
Conversation:
{messages}
""",
)| Feature | Description |
|---|---|
| Two Strategies | Intelligent summarization or fast sliding window |
| Flexible Triggers | Message count, token count, or fraction-based |
| Safe Cutoff | Never breaks tool call/response pairs |
| Auto max_tokens | Auto-detect context window from genai-prices |
| Message Persistence | Save all messages to JSON for session resume |
| Guided Compaction | Focus summaries on specific topics |
| Callbacks | on_before/after_compress with instruction re-injection |
| Async Token Counting | Sync or async token counter support |
| Token Tracking | Real-time usage monitoring with callbacks |
| Tool Truncation | Automatic truncation of large tool outputs |
| Custom Models | Use any pydantic-ai Model (Azure, custom providers) |
| Lightweight | Only requires pydantic-ai-slim (no extra model SDKs) |
| Package | Description |
|---|---|
| Pydantic Deep Agents | Full agent framework (uses this library) |
| pydantic-ai-backend | File storage and Docker sandbox |
| pydantic-ai-todo | Task planning toolset |
| subagents-pydantic-ai | Multi-agent orchestration |
| pydantic-ai | The foundation — agent framework by Pydantic |
git clone https://github.com/vstorm-co/summarization-pydantic-ai.git
cd summarization-pydantic-ai
make install
make test # 100% coverage requiredMIT — see LICENSE