Skip to content

Commit 16648fe

Browse files
committed
update readme
1 parent 6ecacb8 commit 16648fe

File tree

2 files changed

+17
-91
lines changed

2 files changed

+17
-91
lines changed

README.md

Lines changed: 17 additions & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -1,108 +1,34 @@
1-
# Eval Protocol (EP)
1+
# Eval Protocol
22

33
[![PyPI - Version](https://img.shields.io/pypi/v/eval-protocol)](https://pypi.org/project/eval-protocol/)
44
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/eval-protocol/python-sdk)
55

6-
**The open-source framework to help you write evals for RL.**
6+
**Eval Protocol (EP) is an open solution for doing reinforcement learning fine-tuning on existing agents — across any language, container, or framework.**
77

8-
## 🚀 Features
8+
![Eval Protocol overview](./docs/intro.png)
99

10-
- **Pytest authoring**: `@evaluation_test` decorator to configure evaluations
11-
- **Robust rollouts**: Handles flaky LLM APIs and parallel execution
12-
- **Integrations**: Works with Langfuse, LangSmith, Braintrust, Responses API
13-
- **Agent support**: LangGraph and Pydantic AI
14-
- **MCP RL envs**: Build reinforcement learning environments with MCP
15-
- **Built-in benchmarks**: AIME, tau-bench
16-
- **LLM judge**: Stack-rank models using pairwise Arena-Hard-Auto
17-
- **Local UI**: Pivot/table views for real-time analysis
10+
Most teams already have complex agents running in production — often across remote services with heavy dependencies, Docker containers, or TypeScript backends deployed on Vercel. When they try to train or fine-tune these agents with reinforcement learning, connecting them to a trainer quickly becomes painful.
1811

19-
## ⚡ Quickstart (no labels needed)
12+
Eval Protocol makes this possible in two ways:
2013

21-
Install with your tracing platform extras and set API keys:
14+
1. **Expose your agent through a simple API**
15+
Wrap your existing agent (Python, TypeScript, Docker, etc.) in a simple HTTP service using EP’s rollout interface. EP handles the rollout orchestration, metadata passing, and trace storage automatically.
16+
2. **Connect with any trainer**
17+
Once your agent speaks the EP standard, it can be fine-tuned or evaluated with any supported trainer — Fireworks RFT, TRL, Unsloth, or your own — with no environment rewrites.
2218

23-
```bash
24-
pip install 'eval-protocol[langfuse]'
19+
The result: RL that works out-of-the-box for existing production agents.
2520

26-
# Model API keys (set what you need)
27-
export OPENAI_API_KEY=...
28-
export FIREWORKS_API_KEY=...
29-
export GEMINI_API_KEY=...
21+
## Who This Is For
3022

31-
# Platform keys
32-
export LANGFUSE_PUBLIC_KEY=...
33-
export LANGFUSE_SECRET_KEY=...
34-
export LANGFUSE_HOST=https://your-deployment.com # optional
35-
```
23+
- **Applied AI teams** adding RL to existing production agents.
24+
- **Research engineers** experimenting with fine-tuning complex, multi-turn or tool-using agents.
25+
- **MLOps teams** building reproducible, language-agnostic rollout pipelines.
3626

37-
Minimal evaluation using the built-in AHA judge:
27+
## Quickstart
3828

39-
```python
40-
from datetime import datetime
41-
import pytest
29+
- See the Quickstart repository: [eval-protocol/quickstart](https://github.com/eval-protocol/quickstart/tree/main)
4230

43-
from eval_protocol import (
44-
evaluation_test,
45-
aha_judge,
46-
EvaluationRow,
47-
SingleTurnRolloutProcessor,
48-
DynamicDataLoader,
49-
create_langfuse_adapter,
50-
)
51-
52-
53-
def langfuse_data_generator() -> list[EvaluationRow]:
54-
adapter = create_langfuse_adapter()
55-
return adapter.get_evaluation_rows(
56-
to_timestamp=datetime.utcnow(),
57-
limit=20,
58-
sample_size=5,
59-
)
60-
61-
62-
@pytest.mark.parametrize(
63-
"completion_params",
64-
[
65-
{"model": "openai/gpt-4.1"},
66-
{"model": "fireworks_ai/accounts/fireworks/models/gpt-oss-120b"},
67-
],
68-
)
69-
@evaluation_test(
70-
data_loaders=DynamicDataLoader(generators=[langfuse_data_generator]),
71-
rollout_processor=SingleTurnRolloutProcessor(),
72-
)
73-
async def test_llm_judge(row: EvaluationRow) -> EvaluationRow:
74-
return await aha_judge(row)
75-
```
76-
77-
Run it:
78-
79-
```bash
80-
pytest -q -s
81-
```
82-
83-
The pytest output includes local links for a leaderboard and row-level traces (pivot/table) at `http://localhost:8000`.
84-
85-
## Installation
86-
87-
This library requires Python >= 3.10.
88-
89-
### pip
90-
91-
```bash
92-
pip install eval-protocol
93-
```
94-
95-
### uv (recommended)
96-
97-
```bash
98-
# Install uv (if needed)
99-
curl -LsSf https://astral.sh/uv/install.sh | sh
100-
101-
# Add to your project
102-
uv add eval-protocol
103-
```
104-
105-
## 📚 Resources
31+
## Resources
10632

10733
- **[Documentation](https://evalprotocol.io)** – Guides and API reference
10834
- **[Discord](https://discord.com/channels/1137072072808472616/1400975572405850155)** – Community

docs/intro.png

210 KB
Loading

0 commit comments

Comments
 (0)