|
1 | | -# Eval Protocol (EP) |
| 1 | +# Eval Protocol |
2 | 2 |
|
3 | 3 | [](https://pypi.org/project/eval-protocol/) |
4 | 4 | [](https://deepwiki.com/eval-protocol/python-sdk) |
5 | 5 |
|
6 | | -**The open-source framework to help you write evals for RL.** |
| 6 | +**Eval Protocol (EP) is an open solution for doing reinforcement learning fine-tuning on existing agents — across any language, container, or framework.** |
7 | 7 |
|
8 | | -## 🚀 Features |
| 8 | + |
9 | 9 |
|
10 | | -- **Pytest authoring**: `@evaluation_test` decorator to configure evaluations |
11 | | -- **Robust rollouts**: Handles flaky LLM APIs and parallel execution |
12 | | -- **Integrations**: Works with Langfuse, LangSmith, Braintrust, Responses API |
13 | | -- **Agent support**: LangGraph and Pydantic AI |
14 | | -- **MCP RL envs**: Build reinforcement learning environments with MCP |
15 | | -- **Built-in benchmarks**: AIME, tau-bench |
16 | | -- **LLM judge**: Stack-rank models using pairwise Arena-Hard-Auto |
17 | | -- **Local UI**: Pivot/table views for real-time analysis |
| 10 | +Most teams already have complex agents running in production — often across remote services with heavy dependencies, Docker containers, or TypeScript backends deployed on Vercel. When they try to train or fine-tune these agents with reinforcement learning, connecting them to a trainer quickly becomes painful. |
18 | 11 |
|
19 | | -## ⚡ Quickstart (no labels needed) |
| 12 | +Eval Protocol makes this possible in two ways: |
20 | 13 |
|
21 | | -Install with your tracing platform extras and set API keys: |
| 14 | +1. **Expose your agent through a simple API** |
| 15 | + Wrap your existing agent (Python, TypeScript, Docker, etc.) in a simple HTTP service using EP’s rollout interface. EP handles the rollout orchestration, metadata passing, and trace storage automatically. |
| 16 | +2. **Connect with any trainer** |
| 17 | + Once your agent speaks the EP standard, it can be fine-tuned or evaluated with any supported trainer — Fireworks RFT, TRL, Unsloth, or your own — with no environment rewrites. |
22 | 18 |
|
23 | | -```bash |
24 | | -pip install 'eval-protocol[langfuse]' |
| 19 | +The result: RL that works out-of-the-box for existing production agents. |
25 | 20 |
|
26 | | -# Model API keys (set what you need) |
27 | | -export OPENAI_API_KEY=... |
28 | | -export FIREWORKS_API_KEY=... |
29 | | -export GEMINI_API_KEY=... |
| 21 | +## Who This Is For |
30 | 22 |
|
31 | | -# Platform keys |
32 | | -export LANGFUSE_PUBLIC_KEY=... |
33 | | -export LANGFUSE_SECRET_KEY=... |
34 | | -export LANGFUSE_HOST=https://your-deployment.com # optional |
35 | | -``` |
| 23 | +- **Applied AI teams** adding RL to existing production agents. |
| 24 | +- **Research engineers** experimenting with fine-tuning complex, multi-turn or tool-using agents. |
| 25 | +- **MLOps teams** building reproducible, language-agnostic rollout pipelines. |
36 | 26 |
|
37 | | -Minimal evaluation using the built-in AHA judge: |
| 27 | +## Quickstart |
38 | 28 |
|
39 | | -```python |
40 | | -from datetime import datetime |
41 | | -import pytest |
| 29 | +- See the Quickstart repository: [eval-protocol/quickstart](https://github.com/eval-protocol/quickstart/tree/main) |
42 | 30 |
|
43 | | -from eval_protocol import ( |
44 | | - evaluation_test, |
45 | | - aha_judge, |
46 | | - EvaluationRow, |
47 | | - SingleTurnRolloutProcessor, |
48 | | - DynamicDataLoader, |
49 | | - create_langfuse_adapter, |
50 | | -) |
51 | | - |
52 | | - |
53 | | -def langfuse_data_generator() -> list[EvaluationRow]: |
54 | | - adapter = create_langfuse_adapter() |
55 | | - return adapter.get_evaluation_rows( |
56 | | - to_timestamp=datetime.utcnow(), |
57 | | - limit=20, |
58 | | - sample_size=5, |
59 | | - ) |
60 | | - |
61 | | - |
62 | | -@pytest.mark.parametrize( |
63 | | - "completion_params", |
64 | | - [ |
65 | | - {"model": "openai/gpt-4.1"}, |
66 | | - {"model": "fireworks_ai/accounts/fireworks/models/gpt-oss-120b"}, |
67 | | - ], |
68 | | -) |
69 | | -@evaluation_test( |
70 | | - data_loaders=DynamicDataLoader(generators=[langfuse_data_generator]), |
71 | | - rollout_processor=SingleTurnRolloutProcessor(), |
72 | | -) |
73 | | -async def test_llm_judge(row: EvaluationRow) -> EvaluationRow: |
74 | | - return await aha_judge(row) |
75 | | -``` |
76 | | - |
77 | | -Run it: |
78 | | - |
79 | | -```bash |
80 | | -pytest -q -s |
81 | | -``` |
82 | | - |
83 | | -The pytest output includes local links for a leaderboard and row-level traces (pivot/table) at `http://localhost:8000`. |
84 | | - |
85 | | -## Installation |
86 | | - |
87 | | -This library requires Python >= 3.10. |
88 | | - |
89 | | -### pip |
90 | | - |
91 | | -```bash |
92 | | -pip install eval-protocol |
93 | | -``` |
94 | | - |
95 | | -### uv (recommended) |
96 | | - |
97 | | -```bash |
98 | | -# Install uv (if needed) |
99 | | -curl -LsSf https://astral.sh/uv/install.sh | sh |
100 | | - |
101 | | -# Add to your project |
102 | | -uv add eval-protocol |
103 | | -``` |
104 | | - |
105 | | -## 📚 Resources |
| 31 | +## Resources |
106 | 32 |
|
107 | 33 | - **[Documentation](https://evalprotocol.io)** – Guides and API reference |
108 | 34 | - **[Discord](https://discord.com/channels/1137072072808472616/1400975572405850155)** – Community |
|
0 commit comments