Skip to content

Commit 985a3da

Browse files
committed
2 parents f444a75 + e2f7e2b commit 985a3da

22 files changed

+2205
-402
lines changed

README.md

Lines changed: 17 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -1,110 +1,34 @@
1-
# Eval Protocol (EP)
1+
# Eval Protocol
22

33
[![PyPI - Version](https://img.shields.io/pypi/v/eval-protocol)](https://pypi.org/project/eval-protocol/)
44
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/eval-protocol/python-sdk)
55

6-
**Stop guessing which AI model to use. Build a data-driven model leaderboard.**
6+
**Eval Protocol (EP) is an open solution for doing reinforcement learning fine-tuning on existing agents — across any language, container, or framework.**
77

8-
With hundreds of models and configs, you need objective data to choose the right one for your use case. EP helps you evaluate real traces, compare models, and visualize results locally.
8+
![Eval Protocol overview](./docs/intro.png)
99

10-
## 🚀 Features
10+
Most teams already have complex agents running in production — often across remote services with heavy dependencies, Docker containers, or TypeScript backends deployed on Vercel. When they try to train or fine-tune these agents with reinforcement learning, connecting them to a trainer quickly becomes painful.
1111

12-
- **Pytest authoring**: `@evaluation_test` decorator to configure evaluations
13-
- **Robust rollouts**: Handles flaky LLM APIs and parallel execution
14-
- **Integrations**: Works with Langfuse, LangSmith, Braintrust, Responses API
15-
- **Agent support**: LangGraph and Pydantic AI
16-
- **MCP RL envs**: Build reinforcement learning environments with MCP
17-
- **Built-in benchmarks**: AIME, tau-bench
18-
- **LLM judge**: Stack-rank models using pairwise Arena-Hard-Auto
19-
- **Local UI**: Pivot/table views for real-time analysis
12+
Eval Protocol makes this possible in two ways:
2013

21-
## ⚡ Quickstart (no labels needed)
14+
1. **Expose your agent through a simple API**
15+
Wrap your existing agent (Python, TypeScript, Docker, etc.) in a simple HTTP service using EP’s rollout interface. EP handles the rollout orchestration, metadata passing, and trace storage automatically.
16+
2. **Connect with any trainer**
17+
Once your agent speaks the EP standard, it can be fine-tuned or evaluated with any supported trainer — Fireworks RFT, TRL, Unsloth, or your own — with no environment rewrites.
2218

23-
Install with your tracing platform extras and set API keys:
19+
The result: RL that works out-of-the-box for existing production agents.
2420

25-
```bash
26-
pip install 'eval-protocol[langfuse]'
21+
## Who This Is For
2722

28-
# Model API keys (set what you need)
29-
export OPENAI_API_KEY=...
30-
export FIREWORKS_API_KEY=...
31-
export GEMINI_API_KEY=...
23+
- **Applied AI teams** adding RL to existing production agents.
24+
- **Research engineers** experimenting with fine-tuning complex, multi-turn or tool-using agents.
25+
- **MLOps teams** building reproducible, language-agnostic rollout pipelines.
3226

33-
# Platform keys
34-
export LANGFUSE_PUBLIC_KEY=...
35-
export LANGFUSE_SECRET_KEY=...
36-
export LANGFUSE_HOST=https://your-deployment.com # optional
37-
```
27+
## Quickstart
3828

39-
Minimal evaluation using the built-in AHA judge:
29+
- See the Quickstart repository: [eval-protocol/quickstart](https://github.com/eval-protocol/quickstart/tree/main)
4030

41-
```python
42-
from datetime import datetime
43-
import pytest
44-
45-
from eval_protocol import (
46-
evaluation_test,
47-
aha_judge,
48-
EvaluationRow,
49-
SingleTurnRolloutProcessor,
50-
DynamicDataLoader,
51-
create_langfuse_adapter,
52-
)
53-
54-
55-
def langfuse_data_generator() -> list[EvaluationRow]:
56-
adapter = create_langfuse_adapter()
57-
return adapter.get_evaluation_rows(
58-
to_timestamp=datetime.utcnow(),
59-
limit=20,
60-
sample_size=5,
61-
)
62-
63-
64-
@pytest.mark.parametrize(
65-
"completion_params",
66-
[
67-
{"model": "openai/gpt-4.1"},
68-
{"model": "fireworks_ai/accounts/fireworks/models/gpt-oss-120b"},
69-
],
70-
)
71-
@evaluation_test(
72-
data_loaders=DynamicDataLoader(generators=[langfuse_data_generator]),
73-
rollout_processor=SingleTurnRolloutProcessor(),
74-
)
75-
async def test_llm_judge(row: EvaluationRow) -> EvaluationRow:
76-
return await aha_judge(row)
77-
```
78-
79-
Run it:
80-
81-
```bash
82-
pytest -q -s
83-
```
84-
85-
The pytest output includes local links for a leaderboard and row-level traces (pivot/table) at `http://localhost:8000`.
86-
87-
## Installation
88-
89-
This library requires Python >= 3.10.
90-
91-
### pip
92-
93-
```bash
94-
pip install eval-protocol
95-
```
96-
97-
### uv (recommended)
98-
99-
```bash
100-
# Install uv (if needed)
101-
curl -LsSf https://astral.sh/uv/install.sh | sh
102-
103-
# Add to your project
104-
uv add eval-protocol
105-
```
106-
107-
## 📚 Resources
31+
## Resources
10832

10933
- **[Documentation](https://evalprotocol.io)** – Guides and API reference
11034
- **[Discord](https://discord.com/channels/1137072072808472616/1400975572405850155)** – Community

docs/intro.png

210 KB
Loading

eval_protocol/auth.py

Lines changed: 68 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,56 @@ def _get_credential_from_config_file(key_name: str) -> Optional[str]:
136136
return None
137137

138138

139+
def _get_credentials_from_config_file() -> Dict[str, Optional[str]]:
140+
"""
141+
Retrieve both api_key and account_id from auth.ini with a single read/parse.
142+
Tries simple parsing first for both keys, then falls back to configparser for any missing ones.
143+
Returns a dict with up to two keys: 'api_key' and 'account_id'.
144+
"""
145+
results: Dict[str, Optional[str]] = {}
146+
auth_ini_path = _get_auth_ini_file()
147+
if not auth_ini_path.exists():
148+
return results
149+
150+
# 1) Simple key=value parsing
151+
try:
152+
simple_creds = _parse_simple_auth_file(auth_ini_path)
153+
if "api_key" in simple_creds and simple_creds["api_key"]:
154+
results["api_key"] = simple_creds["api_key"]
155+
if "account_id" in simple_creds and simple_creds["account_id"]:
156+
results["account_id"] = simple_creds["account_id"]
157+
if "api_key" in results and "account_id" in results:
158+
return results
159+
except Exception as e:
160+
logger.warning("Error during simple parsing of %s: %s", str(auth_ini_path), e)
161+
162+
# 2) ConfigParser for any missing keys
163+
try:
164+
config = configparser.ConfigParser()
165+
config.read(auth_ini_path)
166+
for key_name in ("api_key", "account_id"):
167+
if key_name in results and results[key_name]:
168+
continue
169+
if "fireworks" in config and config.has_option("fireworks", key_name):
170+
value_from_file = config.get("fireworks", key_name)
171+
if value_from_file:
172+
results[key_name] = value_from_file
173+
continue
174+
if config.has_option(config.default_section, key_name):
175+
value_from_default = config.get(config.default_section, key_name)
176+
if value_from_default:
177+
results[key_name] = value_from_default
178+
except configparser.MissingSectionHeaderError:
179+
# Purely key=value file without section headers; simple parsing should have handled it already.
180+
logger.debug("%s has no section headers; falling back to simple parsing results.", str(auth_ini_path))
181+
except configparser.Error as e_config:
182+
logger.warning("Configparser error reading %s: %s", str(auth_ini_path), e_config)
183+
except Exception as e_general:
184+
logger.warning("Unexpected error reading %s: %s", str(auth_ini_path), e_general)
185+
186+
return results
187+
188+
139189
def get_fireworks_api_key() -> Optional[str]:
140190
"""
141191
Retrieves the Fireworks API key.
@@ -177,13 +227,15 @@ def get_fireworks_account_id() -> Optional[str]:
177227
The Account ID is sourced in the following order:
178228
1. FIREWORKS_ACCOUNT_ID environment variable.
179229
2. 'account_id' from the [fireworks] section of ~/.fireworks/auth.ini.
230+
3. If an API key is available (env or auth.ini), resolve via verifyApiKey.
180231
181232
Returns:
182233
The Account ID if found, otherwise None.
183234
"""
184235
# If a profile is active, prefer profile file first, then env
185236
if _is_profile_active():
186-
account_id_from_file = _get_credential_from_config_file("account_id")
237+
creds = _get_credentials_from_config_file()
238+
account_id_from_file = creds.get("account_id")
187239
if account_id_from_file:
188240
return account_id_from_file
189241
account_id = os.environ.get("FIREWORKS_ACCOUNT_ID")
@@ -196,11 +248,24 @@ def get_fireworks_account_id() -> Optional[str]:
196248
if account_id:
197249
logger.debug("Using FIREWORKS_ACCOUNT_ID from environment variable.")
198250
return account_id
199-
account_id_from_file = _get_credential_from_config_file("account_id")
251+
creds = _get_credentials_from_config_file()
252+
account_id_from_file = creds.get("account_id")
200253
if account_id_from_file:
201254
return account_id_from_file
202255

203-
logger.debug("Fireworks Account ID not found in environment variables or auth.ini.")
256+
# 3) Fallback: if API key is present, attempt to resolve via verifyApiKey (env or auth.ini)
257+
try:
258+
# Intentionally use get_fireworks_api_key to centralize precedence (env vs file)
259+
api_key_for_verify = get_fireworks_api_key()
260+
if api_key_for_verify:
261+
resolved = verify_api_key_and_get_account_id(api_key=api_key_for_verify, api_base=get_fireworks_api_base())
262+
if resolved:
263+
logger.debug("Using FIREWORKS_ACCOUNT_ID resolved via verifyApiKey: %s", resolved)
264+
return resolved
265+
except Exception as e:
266+
logger.debug("Failed to resolve FIREWORKS_ACCOUNT_ID via verifyApiKey: %s", e)
267+
268+
logger.debug("Fireworks Account ID not found in environment variables, auth.ini, or via verifyApiKey.")
204269
return None
205270

206271

eval_protocol/benchmarks/test_frozen_lake.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ def frozen_lake_to_evaluation_row(data: List[Dict[str, Any]]) -> List[Evaluation
4646
num_runs=1,
4747
max_concurrent_rollouts=3,
4848
mode="pointwise",
49-
server_script_path="examples/frozen_lake_mcp/server.py",
49+
server_script_path="eval_protocol/mcp_servers/frozen_lake/server.py",
5050
)
5151
def test_frozen_lake_evaluation(row: EvaluationRow) -> EvaluationRow:
5252
"""

eval_protocol/cli.py

Lines changed: 52 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -371,13 +371,13 @@ def parse_args(args=None):
371371
help="Create a Reinforcement Fine-tuning Job on Fireworks",
372372
)
373373
rft_parser.add_argument(
374-
"--evaluator-id",
375-
help="Evaluator ID used during upload; if omitted, derive from local traces or a single discovered test",
374+
"--evaluator",
375+
help="Evaluator ID or fully-qualified resource (accounts/{acct}/evaluators/{id}); if omitted, derive from local tests",
376376
)
377377
# Dataset options
378378
rft_parser.add_argument(
379-
"--dataset-id",
380-
help="Use existing Fireworks dataset id (skip local materialization)",
379+
"--dataset",
380+
help="Use existing dataset (ID or resource 'accounts/{acct}/datasets/{id}') to skip local materialization",
381381
)
382382
rft_parser.add_argument(
383383
"--dataset-jsonl",
@@ -395,38 +395,76 @@ def parse_args(args=None):
395395
rft_parser.add_argument("--base-model", help="Base model resource id")
396396
rft_parser.add_argument("--warm-start-from", help="Addon model to warm start from")
397397
rft_parser.add_argument("--output-model", help="Output model id (defaults from evaluator)")
398-
rft_parser.add_argument("--epochs", type=int, default=8)
398+
rft_parser.add_argument("--epochs", type=int, default=1)
399399
rft_parser.add_argument("--batch-size", type=int, default=128000)
400400
rft_parser.add_argument("--learning-rate", type=float, default=3e-5)
401401
rft_parser.add_argument("--max-context-length", type=int, default=65536)
402402
rft_parser.add_argument("--lora-rank", type=int, default=16)
403+
rft_parser.add_argument("--gradient-accumulation-steps", type=int, help="Number of gradient accumulation steps")
404+
rft_parser.add_argument("--learning-rate-warmup-steps", type=int, help="Number of LR warmup steps")
403405
rft_parser.add_argument("--accelerator-count", type=int, default=1)
404406
rft_parser.add_argument("--region", help="Fireworks region enum value")
405407
rft_parser.add_argument("--display-name", help="RFT job display name")
406408
rft_parser.add_argument("--evaluation-dataset", help="Optional separate eval dataset id")
407409
rft_parser.add_argument("--eval-auto-carveout", dest="eval_auto_carveout", action="store_true", default=True)
408410
rft_parser.add_argument("--no-eval-auto-carveout", dest="eval_auto_carveout", action="store_false")
409411
# Rollout chunking
410-
rft_parser.add_argument("--chunk-size", type=int, default=10, help="Data chunk size for rollout batching")
412+
rft_parser.add_argument("--chunk-size", type=int, default=100, help="Data chunk size for rollout batching")
411413
# Inference params
412414
rft_parser.add_argument("--temperature", type=float)
413415
rft_parser.add_argument("--top-p", type=float)
414416
rft_parser.add_argument("--top-k", type=int)
415-
rft_parser.add_argument("--max-tokens", type=int, default=32768)
416-
rft_parser.add_argument("--n", type=int, default=8)
417-
rft_parser.add_argument("--inference-extra-body", help="JSON string for extra inference params")
417+
rft_parser.add_argument("--max-output-tokens", type=int, default=32768)
418+
rft_parser.add_argument("--response-candidates-count", type=int, default=8)
419+
rft_parser.add_argument("--extra-body", help="JSON string for extra inference params")
420+
# MCP server (optional)
421+
rft_parser.add_argument(
422+
"--mcp-server",
423+
help="The MCP server resource name to use for the reinforcement fine-tuning job.",
424+
)
418425
# Wandb
419426
rft_parser.add_argument("--wandb-enabled", action="store_true")
420427
rft_parser.add_argument("--wandb-project")
421428
rft_parser.add_argument("--wandb-entity")
422429
rft_parser.add_argument("--wandb-run-id")
423430
rft_parser.add_argument("--wandb-api-key")
424431
# Misc
425-
rft_parser.add_argument("--rft-job-id", help="Specify an explicit RFT job id")
432+
rft_parser.add_argument("--job-id", help="Specify an explicit RFT job id")
426433
rft_parser.add_argument("--yes", "-y", action="store_true", help="Non-interactive mode")
427434
rft_parser.add_argument("--dry-run", action="store_true", help="Print planned REST calls without sending")
428435
rft_parser.add_argument("--force", action="store_true", help="Overwrite existing evaluator with the same ID")
429436

437+
# Local test command
438+
local_test_parser = subparsers.add_parser(
439+
"local-test",
440+
help="Select an evaluation test and run it locally. If a Dockerfile exists, build and run via Docker; otherwise run on host.",
441+
)
442+
local_test_parser.add_argument(
443+
"--entry",
444+
help="Entrypoint to run (path::function or path). If not provided, a selector will be shown (unless --yes).",
445+
)
446+
local_test_parser.add_argument(
447+
"--ignore-docker",
448+
action="store_true",
449+
help="Ignore Dockerfile even if present; run pytest on host",
450+
)
451+
local_test_parser.add_argument(
452+
"--yes",
453+
"-y",
454+
action="store_true",
455+
help="Non-interactive: if multiple tests exist and no --entry, fails with guidance",
456+
)
457+
local_test_parser.add_argument(
458+
"--docker-build-extra",
459+
default="",
460+
help="Extra flags to pass to 'docker build' (quoted string, e.g. \"--no-cache --pull --progress=plain\")",
461+
)
462+
local_test_parser.add_argument(
463+
"--docker-run-extra",
464+
default="",
465+
help="Extra flags to pass to 'docker run' (quoted string, e.g. \"--env-file .env --memory=8g\")",
466+
)
467+
430468
# Run command (for Hydra-based evaluations)
431469
# This subparser intentionally defines no arguments itself.
432470
# All arguments after 'run' will be passed to Hydra by parse_known_args.
@@ -559,6 +597,10 @@ def _extract_flag_value(argv_list, flag_name):
559597
return create_rft_command(args)
560598
print("Error: missing subcommand for 'create'. Try: eval-protocol create rft")
561599
return 1
600+
elif args.command == "local-test":
601+
from .cli_commands.local_test import local_test_command
602+
603+
return local_test_command(args)
562604
elif args.command == "run":
563605
# For the 'run' command, Hydra takes over argument parsing.
564606

0 commit comments

Comments
 (0)