You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Stop guessing which AI model to use. Build a data-driven model leaderboard.**
6
+
**Eval Protocol (EP) is an open solution for doing reinforcement learning fine-tuning on existing agents — across any language, container, or framework.**
7
7
8
-
With hundreds of models and configs, you need objective data to choose the right one for your use case. EP helps you evaluate real traces, compare models, and visualize results locally.
8
+

9
9
10
-
## 🚀 Features
10
+
Most teams already have complex agents running in production — often across remote services with heavy dependencies, Docker containers, or TypeScript backends deployed on Vercel. When they try to train or fine-tune these agents with reinforcement learning, connecting them to a trainer quickly becomes painful.
11
11
12
-
-**Pytest authoring**: `@evaluation_test` decorator to configure evaluations
13
-
-**Robust rollouts**: Handles flaky LLM APIs and parallel execution
14
-
-**Integrations**: Works with Langfuse, LangSmith, Braintrust, Responses API
15
-
-**Agent support**: LangGraph and Pydantic AI
16
-
-**MCP RL envs**: Build reinforcement learning environments with MCP
17
-
-**Built-in benchmarks**: AIME, tau-bench
18
-
-**LLM judge**: Stack-rank models using pairwise Arena-Hard-Auto
19
-
-**Local UI**: Pivot/table views for real-time analysis
12
+
Eval Protocol makes this possible in two ways:
20
13
21
-
## ⚡ Quickstart (no labels needed)
14
+
1.**Expose your agent through a simple API**
15
+
Wrap your existing agent (Python, TypeScript, Docker, etc.) in a simple HTTP service using EP’s rollout interface. EP handles the rollout orchestration, metadata passing, and trace storage automatically.
16
+
2.**Connect with any trainer**
17
+
Once your agent speaks the EP standard, it can be fine-tuned or evaluated with any supported trainer — Fireworks RFT, TRL, Unsloth, or your own — with no environment rewrites.
22
18
23
-
Install with your tracing platform extras and set API keys:
19
+
The result: RL that works out-of-the-box for existing production agents.
24
20
25
-
```bash
26
-
pip install 'eval-protocol[langfuse]'
21
+
## Who This Is For
27
22
28
-
# Model API keys (set what you need)
29
-
export OPENAI_API_KEY=...
30
-
export FIREWORKS_API_KEY=...
31
-
export GEMINI_API_KEY=...
23
+
-**Applied AI teams** adding RL to existing production agents.
24
+
-**Research engineers** experimenting with fine-tuning complex, multi-turn or tool-using agents.
25
+
-**MLOps teams** building reproducible, language-agnostic rollout pipelines.
0 commit comments