eval-protocol
diff --git a/‎README.md‎
Lines changed: 17 additions & 93 deletions b/‎README.md‎
Lines changed: 17 additions & 93 deletions
diff --git a/‎docs/intro.png‎
210 KB b/‎docs/intro.png‎
210 KB
diff --git a/‎eval_protocol/auth.py‎
Lines changed: 68 additions & 3 deletions b/‎eval_protocol/auth.py‎
Lines changed: 68 additions & 3 deletions
diff --git a/‎eval_protocol/benchmarks/test_frozen_lake.py‎
Lines changed: 1 addition & 1 deletion b/‎eval_protocol/benchmarks/test_frozen_lake.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎eval_protocol/cli.py‎
Lines changed: 52 additions & 10 deletions b/‎eval_protocol/cli.py‎
Lines changed: 52 additions & 10 deletions
@@ -1,110 +1,34 @@
-# Eval Protocol (EP)
+# Eval Protocol
 
 [![PyPI - Version](https://img.shields.io/pypi/v/eval-protocol)](https://pypi.org/project/eval-protocol/)
 [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/eval-protocol/python-sdk)
 
-**Stop guessing which AI model to use. Build a data-driven model leaderboard.**
+**Eval Protocol (EP) is an open solution for doing reinforcement learning fine-tuning on existing agents — across any language, container, or framework.**
 
-With hundreds of models and configs, you need objective data to choose the right one for your use case. EP helps you evaluate real traces, compare models, and visualize results locally.
+![Eval Protocol overview](./docs/intro.png)
 
-## 🚀 Features
+Most teams already have complex agents running in production — often across remote services with heavy dependencies, Docker containers, or TypeScript backends deployed on Vercel. When they try to train or fine-tune these agents with reinforcement learning, connecting them to a trainer quickly becomes painful.
 
-- **Pytest authoring**: `@evaluation_test` decorator to configure evaluations
-- **Robust rollouts**: Handles flaky LLM APIs and parallel execution
-- **Integrations**: Works with Langfuse, LangSmith, Braintrust, Responses API
-- **Agent support**: LangGraph and Pydantic AI
-- **MCP RL envs**: Build reinforcement learning environments with MCP
-- **Built-in benchmarks**: AIME, tau-bench
-- **LLM judge**: Stack-rank models using pairwise Arena-Hard-Auto
-- **Local UI**: Pivot/table views for real-time analysis
+Eval Protocol makes this possible in two ways:
 
-## ⚡ Quickstart (no labels needed)
+1. **Expose your agent through a simple API**
+   Wrap your existing agent (Python, TypeScript, Docker, etc.) in a simple HTTP service using EP’s rollout interface. EP handles the rollout orchestration, metadata passing, and trace storage automatically.
+2. **Connect with any trainer**
+   Once your agent speaks the EP standard, it can be fine-tuned or evaluated with any supported trainer — Fireworks RFT, TRL, Unsloth, or your own — with no environment rewrites.
 
-Install with your tracing platform extras and set API keys:
+The result: RL that works out-of-the-box for existing production agents.
 
-```bash
-pip install 'eval-protocol[langfuse]'
+## Who This Is For
 
-# Model API keys (set what you need)
-export OPENAI_API_KEY=...
-export FIREWORKS_API_KEY=...
-export GEMINI_API_KEY=...
+- **Applied AI teams** adding RL to existing production agents.
+- **Research engineers** experimenting with fine-tuning complex, multi-turn or tool-using agents.
+- **MLOps teams** building reproducible, language-agnostic rollout pipelines.
 
-# Platform keys
-export LANGFUSE_PUBLIC_KEY=...
-export LANGFUSE_SECRET_KEY=...
-export LANGFUSE_HOST=https://your-deployment.com  # optional
-```
+## Quickstart
 
-Minimal evaluation using the built-in AHA judge:
+- See the Quickstart repository: [eval-protocol/quickstart](https://github.com/eval-protocol/quickstart/tree/main)
 
-```python
-from datetime import datetime
-import pytest
-
-from eval_protocol import (
-    evaluation_test,
-    aha_judge,
-    EvaluationRow,
-    SingleTurnRolloutProcessor,
-    DynamicDataLoader,
-    create_langfuse_adapter,
-)
-
-
-def langfuse_data_generator() -> list[EvaluationRow]:
-    adapter = create_langfuse_adapter()
-    return adapter.get_evaluation_rows(
-        to_timestamp=datetime.utcnow(),
-        limit=20,
-        sample_size=5,
-    )
-
-
-@pytest.mark.parametrize(
-    "completion_params",
-    [
-        {"model": "openai/gpt-4.1"},
-        {"model": "fireworks_ai/accounts/fireworks/models/gpt-oss-120b"},
-    ],
-)
-@evaluation_test(
-    data_loaders=DynamicDataLoader(generators=[langfuse_data_generator]),
-    rollout_processor=SingleTurnRolloutProcessor(),
-)
-async def test_llm_judge(row: EvaluationRow) -> EvaluationRow:
-    return await aha_judge(row)
-```
-
-Run it:
-
-```bash
-pytest -q -s
-```
-
-The pytest output includes local links for a leaderboard and row-level traces (pivot/table) at `http://localhost:8000`.
-
-## Installation
-
-This library requires Python >= 3.10.
-
-### pip
-
-```bash
-pip install eval-protocol
-```
-
-### uv (recommended)
-
-```bash
-# Install uv (if needed)
-curl -LsSf https://astral.sh/uv/install.sh | sh
-
-# Add to your project
-uv add eval-protocol
-```
-
-## 📚 Resources
+## Resources
 
 - **[Documentation](https://evalprotocol.io)** – Guides and API reference
 - **[Discord](https://discord.com/channels/1137072072808472616/1400975572405850155)** – Community
 
@@ -136,6 +136,56 @@ def _get_credential_from_config_file(key_name: str) -> Optional[str]:
     return None
 
 
+def _get_credentials_from_config_file() -> Dict[str, Optional[str]]:
+    """
+    Retrieve both api_key and account_id from auth.ini with a single read/parse.
+    Tries simple parsing first for both keys, then falls back to configparser for any missing ones.
+    Returns a dict with up to two keys: 'api_key' and 'account_id'.
+    """
+    results: Dict[str, Optional[str]] = {}
+    auth_ini_path = _get_auth_ini_file()
+    if not auth_ini_path.exists():
+        return results
+
+    # 1) Simple key=value parsing
+    try:
+        simple_creds = _parse_simple_auth_file(auth_ini_path)
+        if "api_key" in simple_creds and simple_creds["api_key"]:
+            results["api_key"] = simple_creds["api_key"]
+        if "account_id" in simple_creds and simple_creds["account_id"]:
+            results["account_id"] = simple_creds["account_id"]
+        if "api_key" in results and "account_id" in results:
+            return results
+    except Exception as e:
+        logger.warning("Error during simple parsing of %s: %s", str(auth_ini_path), e)
+
+    # 2) ConfigParser for any missing keys
+    try:
+        config = configparser.ConfigParser()
+        config.read(auth_ini_path)
+        for key_name in ("api_key", "account_id"):
+            if key_name in results and results[key_name]:
+                continue
+            if "fireworks" in config and config.has_option("fireworks", key_name):
+                value_from_file = config.get("fireworks", key_name)
+                if value_from_file:
+                    results[key_name] = value_from_file
+                    continue
+            if config.has_option(config.default_section, key_name):
+                value_from_default = config.get(config.default_section, key_name)
+                if value_from_default:
+                    results[key_name] = value_from_default
+    except configparser.MissingSectionHeaderError:
+        # Purely key=value file without section headers; simple parsing should have handled it already.
+        logger.debug("%s has no section headers; falling back to simple parsing results.", str(auth_ini_path))
+    except configparser.Error as e_config:
+        logger.warning("Configparser error reading %s: %s", str(auth_ini_path), e_config)
+    except Exception as e_general:
+        logger.warning("Unexpected error reading %s: %s", str(auth_ini_path), e_general)
+
+    return results
+
+
 def get_fireworks_api_key() -> Optional[str]:
     """
     Retrieves the Fireworks API key.
@@ -177,13 +227,15 @@ def get_fireworks_account_id() -> Optional[str]:
     The Account ID is sourced in the following order:
     1. FIREWORKS_ACCOUNT_ID environment variable.
     2. 'account_id' from the [fireworks] section of ~/.fireworks/auth.ini.
+    3. If an API key is available (env or auth.ini), resolve via verifyApiKey.
 
     Returns:
         The Account ID if found, otherwise None.
     """
     # If a profile is active, prefer profile file first, then env
     if _is_profile_active():
-        account_id_from_file = _get_credential_from_config_file("account_id")
+        creds = _get_credentials_from_config_file()
+        account_id_from_file = creds.get("account_id")
         if account_id_from_file:
             return account_id_from_file
         account_id = os.environ.get("FIREWORKS_ACCOUNT_ID")
@@ -196,11 +248,24 @@ def get_fireworks_account_id() -> Optional[str]:
         if account_id:
             logger.debug("Using FIREWORKS_ACCOUNT_ID from environment variable.")
             return account_id
-        account_id_from_file = _get_credential_from_config_file("account_id")
+        creds = _get_credentials_from_config_file()
+        account_id_from_file = creds.get("account_id")
         if account_id_from_file:
             return account_id_from_file
 
-    logger.debug("Fireworks Account ID not found in environment variables or auth.ini.")
+    # 3) Fallback: if API key is present, attempt to resolve via verifyApiKey (env or auth.ini)
+    try:
+        # Intentionally use get_fireworks_api_key to centralize precedence (env vs file)
+        api_key_for_verify = get_fireworks_api_key()
+        if api_key_for_verify:
+            resolved = verify_api_key_and_get_account_id(api_key=api_key_for_verify, api_base=get_fireworks_api_base())
+            if resolved:
+                logger.debug("Using FIREWORKS_ACCOUNT_ID resolved via verifyApiKey: %s", resolved)
+                return resolved
+    except Exception as e:
+        logger.debug("Failed to resolve FIREWORKS_ACCOUNT_ID via verifyApiKey: %s", e)
+
+    logger.debug("Fireworks Account ID not found in environment variables, auth.ini, or via verifyApiKey.")
     return None
 
 
 
@@ -46,7 +46,7 @@ def frozen_lake_to_evaluation_row(data: List[Dict[str, Any]]) -> List[Evaluation
     num_runs=1,
     max_concurrent_rollouts=3,
     mode="pointwise",
-    server_script_path="examples/frozen_lake_mcp/server.py",
+    server_script_path="eval_protocol/mcp_servers/frozen_lake/server.py",
 )
 def test_frozen_lake_evaluation(row: EvaluationRow) -> EvaluationRow:
     """
 
@@ -371,13 +371,13 @@ def parse_args(args=None):
         help="Create a Reinforcement Fine-tuning Job on Fireworks",
     )
     rft_parser.add_argument(
-        "--evaluator-id",
-        help="Evaluator ID used during upload; if omitted, derive from local traces or a single discovered test",
+        "--evaluator",
+        help="Evaluator ID or fully-qualified resource (accounts/{acct}/evaluators/{id}); if omitted, derive from local tests",
     )
     # Dataset options
     rft_parser.add_argument(
-        "--dataset-id",
-        help="Use existing Fireworks dataset id (skip local materialization)",
+        "--dataset",
+        help="Use existing dataset (ID or resource 'accounts/{acct}/datasets/{id}') to skip local materialization",
     )
     rft_parser.add_argument(
         "--dataset-jsonl",
@@ -395,38 +395,76 @@ def parse_args(args=None):
     rft_parser.add_argument("--base-model", help="Base model resource id")
     rft_parser.add_argument("--warm-start-from", help="Addon model to warm start from")
     rft_parser.add_argument("--output-model", help="Output model id (defaults from evaluator)")
-    rft_parser.add_argument("--epochs", type=int, default=8)
+    rft_parser.add_argument("--epochs", type=int, default=1)
     rft_parser.add_argument("--batch-size", type=int, default=128000)
     rft_parser.add_argument("--learning-rate", type=float, default=3e-5)
     rft_parser.add_argument("--max-context-length", type=int, default=65536)
     rft_parser.add_argument("--lora-rank", type=int, default=16)
+    rft_parser.add_argument("--gradient-accumulation-steps", type=int, help="Number of gradient accumulation steps")
+    rft_parser.add_argument("--learning-rate-warmup-steps", type=int, help="Number of LR warmup steps")
     rft_parser.add_argument("--accelerator-count", type=int, default=1)
     rft_parser.add_argument("--region", help="Fireworks region enum value")
     rft_parser.add_argument("--display-name", help="RFT job display name")
     rft_parser.add_argument("--evaluation-dataset", help="Optional separate eval dataset id")
     rft_parser.add_argument("--eval-auto-carveout", dest="eval_auto_carveout", action="store_true", default=True)
     rft_parser.add_argument("--no-eval-auto-carveout", dest="eval_auto_carveout", action="store_false")
     # Rollout chunking
-    rft_parser.add_argument("--chunk-size", type=int, default=10, help="Data chunk size for rollout batching")
+    rft_parser.add_argument("--chunk-size", type=int, default=100, help="Data chunk size for rollout batching")
     # Inference params
     rft_parser.add_argument("--temperature", type=float)
     rft_parser.add_argument("--top-p", type=float)
     rft_parser.add_argument("--top-k", type=int)
-    rft_parser.add_argument("--max-tokens", type=int, default=32768)
-    rft_parser.add_argument("--n", type=int, default=8)
-    rft_parser.add_argument("--inference-extra-body", help="JSON string for extra inference params")
+    rft_parser.add_argument("--max-output-tokens", type=int, default=32768)
+    rft_parser.add_argument("--response-candidates-count", type=int, default=8)
+    rft_parser.add_argument("--extra-body", help="JSON string for extra inference params")
+    # MCP server (optional)
+    rft_parser.add_argument(
+        "--mcp-server",
+        help="The MCP server resource name to use for the reinforcement fine-tuning job.",
+    )
     # Wandb
     rft_parser.add_argument("--wandb-enabled", action="store_true")
     rft_parser.add_argument("--wandb-project")
     rft_parser.add_argument("--wandb-entity")
     rft_parser.add_argument("--wandb-run-id")
     rft_parser.add_argument("--wandb-api-key")
     # Misc
-    rft_parser.add_argument("--rft-job-id", help="Specify an explicit RFT job id")
+    rft_parser.add_argument("--job-id", help="Specify an explicit RFT job id")
     rft_parser.add_argument("--yes", "-y", action="store_true", help="Non-interactive mode")
     rft_parser.add_argument("--dry-run", action="store_true", help="Print planned REST calls without sending")
     rft_parser.add_argument("--force", action="store_true", help="Overwrite existing evaluator with the same ID")
 
+    # Local test command
+    local_test_parser = subparsers.add_parser(
+        "local-test",
+        help="Select an evaluation test and run it locally. If a Dockerfile exists, build and run via Docker; otherwise run on host.",
+    )
+    local_test_parser.add_argument(
+        "--entry",
+        help="Entrypoint to run (path::function or path). If not provided, a selector will be shown (unless --yes).",
+    )
+    local_test_parser.add_argument(
+        "--ignore-docker",
+        action="store_true",
+        help="Ignore Dockerfile even if present; run pytest on host",
+    )
+    local_test_parser.add_argument(
+        "--yes",
+        "-y",
+        action="store_true",
+        help="Non-interactive: if multiple tests exist and no --entry, fails with guidance",
+    )
+    local_test_parser.add_argument(
+        "--docker-build-extra",
+        default="",
+        help="Extra flags to pass to 'docker build' (quoted string, e.g. \"--no-cache --pull --progress=plain\")",
+    )
+    local_test_parser.add_argument(
+        "--docker-run-extra",
+        default="",
+        help="Extra flags to pass to 'docker run' (quoted string, e.g. \"--env-file .env --memory=8g\")",
+    )
+
     # Run command (for Hydra-based evaluations)
     # This subparser intentionally defines no arguments itself.
     # All arguments after 'run' will be passed to Hydra by parse_known_args.
@@ -559,6 +597,10 @@ def _extract_flag_value(argv_list, flag_name):
             return create_rft_command(args)
         print("Error: missing subcommand for 'create'. Try: eval-protocol create rft")
         return 1
+    elif args.command == "local-test":
+        from .cli_commands.local_test import local_test_command
+
+        return local_test_command(args)
     elif args.command == "run":
         # For the 'run' command, Hydra takes over argument parsing.
Original file line number	Diff line number	Diff line change
`@@ -46,7 +46,7 @@ def frozen_lake_to_evaluation_row(data: List[Dict[str, Any]]) -> List[Evaluation`
`46`	`46`	`num_runs=1,`
`47`	`47`	`max_concurrent_rollouts=3,`
`48`	`48`	`mode="pointwise",`
`49`		`- server_script_path="examples/frozen_lake_mcp/server.py",`
	`49`	`+ server_script_path="eval_protocol/mcp_servers/frozen_lake/server.py",`
`50`	`50`	`)`
`51`	`51`	`def test_frozen_lake_evaluation(row: EvaluationRow) -> EvaluationRow:`
`52`	`52`	`"""`