The Eval Protocol provides a command-line interface (CLI) for common operations like previewing evaluations, deploying reward functions, and running agent evaluations.
When you install the Eval Protocol, the CLI is automatically installed:
pip install eval-protocolYou can verify the installation by running:
eval-protocol --helpBefore using the CLI, set up your authentication credentials:
# Set your API key
export FIREWORKS_API_KEY=your_api_key
# Optional: Set the API base URL (for development environments)
export FIREWORKS_API_BASE=https://api.fireworks.aiThe Eval Protocol CLI supports the following main commands:
run: Run a local evaluation pipeline using a Hydra configuration.preview: Preview evaluation results or re-evaluate generated outputs.deploy: Deploy a reward function as an evaluator.agent-eval: Run agent evaluations on task bundles.list: List existing evaluators (coming soon).delete: Delete an evaluator (coming soon).
The run command is the primary way to execute local evaluation pipelines. It leverages Hydra for configuration, allowing you to define complex evaluation setups (including dataset loading, model generation, and reward application) in YAML files and easily override parameters from the command line.
python -m eval_protocol.cli run [options] [HYDRA_OVERRIDES...]or
eval-protocol run [options] [HYDRA_OVERRIDES...]--config-path TEXT: Path to the directory containing your Hydra configuration files. (Required)--config-name TEXT: Name of the main Hydra configuration file (e.g.,run_my_eval.yaml). (Required)--multirunor-m: Run multiple jobs (e.g., for sweeping over parameters). Refer to Hydra documentation for multi-run usage.--help: Show help message for theruncommand.
You can override any parameter defined in your Hydra configuration YAML files directly on the command line. For detailed information on how Hydra is used, refer to the Hydra Configuration for Examples guide.
# Basic usage, running an evaluation defined in examples/math_example/conf/run_math_eval.yaml
python -m eval_protocol.cli run \
--config-path examples/math_example/conf \
--config-name run_math_eval.yaml
# Override the number of samples to process and the model name
python -m eval_protocol.cli run \
--config-path examples/math_example/conf \
--config-name run_math_eval.yaml \
evaluation_params.limit_samples=10 \
generation.model_name="accounts/fireworks/models/mixtral-8x7b-instruct"The run command typically generates:
- A timestamped output directory (e.g.,
outputs/YYYY-MM-DD/HH-MM-SS/). - Inside this directory:
.hydra/: Contains the full Hydra configuration for the run (for reproducibility).- Log files.
- Result files, often including:
<config_output_name>_results.jsonl(e.g.,math_example_results.jsonl): Detailed evaluation results for each sample.preview_input_output_pairs.jsonl: Generated prompts and responses, suitable for use witheval-protocol preview.
- Console Output:
- A summary report is logged to the console, including:
- Total samples processed.
- Number of successful evaluations.
- Number of evaluation errors.
- Average, min, and max scores (if applicable).
- Score distribution.
- Details of the first few errors encountered.
- A summary report is logged to the console, including:
The preview command allows you to test reward functions with sample data. A primary use case is to inspect or re-evaluate the preview_input_output_pairs.jsonl file generated by the eval-protocol run command. This allows you to iterate on reward logic using a fixed set of model generations or to apply different metrics to the same outputs.
You can also use it with manually created sample files.
eval-protocol preview [options]--metrics-folders: Specify local metric scripts to apply, in the format "name=path/to/metric_script_dir". The directory should contain amain.pywith a@reward_function.--samples: Path to a JSONL file containing sample conversations or prompt/response pairs. This is typically thepreview_input_output_pairs.jsonlfile from aeval-protocol runoutput directory.--remote-url: (Optional) URL of a deployed evaluator to use for scoring, instead of local--metrics-folders.--max-samples: Maximum number of samples to process (optional)--output: Path to save preview results (optional)--verbose: Enable verbose output (optional)
# Previewing output from a `eval-protocol run` command with a local metric
eval-protocol preview \
--samples ./outputs/YYYY-MM-DD/HH-MM-SS/preview_input_output_pairs.jsonl \
--metrics-folders "my_custom_metric=./path/to/my_custom_metric"
# Previewing with multiple local metrics
eval-protocol preview \
--samples ./outputs/YYYY-MM-DD/HH-MM-SS/preview_input_output_pairs.jsonl \
--metrics-folders "metric1=./metrics/metric1" "metric2=./metrics/metric2"
# Limit sample count
eval-protocol preview --metrics-folders "clarity=./my_metrics/clarity" --samples ./samples.jsonl --max-samples 5
# Save results to file
eval-protocol preview --metrics-folders "clarity=./my_metrics/clarity" --samples ./samples.jsonl --output ./results.jsonThe samples file should be a JSONL (JSON Lines) file. If it's the output from eval-protocol run (preview_input_output_pairs.jsonl), each line typically contains a "messages" list (including system, user, and assistant turns) and optionally a "ground_truth" field. If creating manually, a common format is:
{"messages": [{"role": "user", "content": "What is machine learning?"}, {"role": "assistant", "content": "Machine learning is a method of data analysis..."}]}Or, if you have ground truth for comparison:
{"messages": [{"role": "user", "content": "Question..."}, {"role": "assistant", "content": "Model answer..."}], "ground_truth": "Reference answer..."}The deploy command deploys a reward function as an evaluator on the Fireworks platform.
eval-protocol deploy [options]--id: ID for the deployed evaluator (required)--metrics-folders: Specify metrics to use in the format "name=path" (required)--display-name: Human-readable name for the evaluator (optional)--description: Description of the evaluator (optional)--force: Overwrite if an evaluator with the same ID already exists (optional)--providers: List of model providers to use (optional)--verbose: Enable verbose output (optional)
# Basic deployment
eval-protocol deploy --id my-evaluator --metrics-folders "clarity=./my_metrics/clarity"
# With display name and description
eval-protocol deploy --id my-evaluator \
--metrics-folders "clarity=./my_metrics/clarity" \
--display-name "Clarity Evaluator" \
--description "Evaluates responses based on clarity"
# Force overwrite existing evaluator
eval-protocol deploy --id my-evaluator \
--metrics-folders "clarity=./my_metrics/clarity" \
--force
# Multiple metrics
eval-protocol deploy --id comprehensive-evaluator \
--metrics-folders "clarity=./my_metrics/clarity" "accuracy=./my_metrics/accuracy" \
--display-name "Comprehensive Evaluator"A typical development workflow using the CLI now often involves eval-protocol run first:
- Configure: Set up your dataset and evaluation parameters in Hydra YAML files (e.g.,
conf/dataset/my_data.yaml,conf/run_my_eval.yaml). Define or reference your reward function logic. - Run: Execute the evaluation pipeline using
eval-protocol run. This generates model responses and initial scores.python -m eval_protocol.cli run --config-path ./conf --config-name run_my_eval.yaml
- Analyze & Iterate:
- Examine the detailed results (
*_results.jsonl) and thepreview_input_output_pairs.jsonlfrom the output directory. - If iterating on reward logic, you can use
eval-protocol previewwith thepreview_input_output_pairs.jsonland your updated local metric script.
eval-protocol preview \ --samples ./outputs/YYYY-MM-DD/HH-MM-SS/preview_input_output_pairs.jsonl \ --metrics-folders "my_refined_metric=./path/to/refined_metric"- Refine your reward function code or Hydra configurations.
- Examine the detailed results (
- Re-run: If configurations changed significantly or you need new model generations, re-run
eval-protocol run. - Deploy: Once satisfied with the evaluator's performance and configuration:
(Note: The
eval-protocol deploy --id my-evaluator-id \ --metrics-folders "my_final_metric=./path/to/final_metric" \ --display-name "My Final Evaluator" \ --description "Description of my evaluator" \ --force
--metrics-foldersfordeployshould point to the finalized reward function script(s) you intend to deploy as the evaluator.)
You can preview multiple metrics to compare their performance:
# Preview with multiple metrics
eval-protocol preview \
--metrics-folders \
"metric1=./my_metrics/metric1" \
"metric2=./my_metrics/metric2" \
"metric3=./my_metrics/metric3" \
--samples ./samples.jsonlYou can deploy with specific model providers:
# Deploy with custom provider
eval-protocol deploy --id my-evaluator \
--metrics-folders "clarity=./my_metrics/clarity" \
--providers '[{"providerType":"anthropic","modelId":"claude-3-sonnet-20240229"}]'The agent-eval command enables you to run agent evaluations using task bundles.
eval-protocol agent-eval [options]--task-dir: Path to task bundle directory containing reward.py, tools.py, etc.--datasetor-d: Path to JSONL file containing task specifications.
--output-diror-o: Directory to store evaluation runs (default: "./runs").--model: Override MODEL_AGENT environment variable.--sim-model: Override MODEL_SIM environment variable for simulated user.
--no-sim-user: Disable simulated user (use static initial messages only).--test-mode: Run in test mode without requiring API keys.--mock-response: Use a mock agent response (works with --test-mode).--debug: Enable detailed debug logging.--validate-only: Validate task bundle structure without running evaluation.--export-tools: Export tool specifications to directory for manual testing.
--task-ids: Comma-separated list of task IDs to run.--max-tasks: Maximum number of tasks to evaluate.--registries: Custom tool registries in format 'name=path'.--registry-override: Override all toolset paths with this registry path.--evaluator: Custom evaluator module path (overrides default).
Note: The following examples use examples/your_agent_task_bundle/ as a placeholder. You will need to replace this with the actual path to your task bundle directory.
# Run agent evaluation with default settings, assuming MODEL_AGENT is set
export MODEL_AGENT=openai/gpt-4o-mini # Example model
eval-protocol agent-eval --task-dir examples/your_agent_task_bundle/
# Use a specific dataset file from your task bundle
eval-protocol agent-eval --dataset examples/your_agent_task_bundle/task.jsonl --task-dir examples/your_agent_task_bundle/
# Run in test mode (no API keys required)
eval-protocol agent-eval --task-dir examples/your_agent_task_bundle/ --test-mode --mock-response
# Validate task bundle structure without running
eval-protocol agent-eval --task-dir examples/your_agent_task_bundle/ --validate-only
# Use a custom model and limit to specific tasks
eval-protocol agent-eval --task-dir examples/your_agent_task_bundle/ \
--model anthropic/claude-3-opus-20240229 \
--task-ids your_task.id.001,your_task.id.002
# Export tool specifications for manual testing
eval-protocol agent-eval --task-dir examples/your_agent_task_bundle/ --export-tools ./tool_specsA task bundle is a directory containing the following files:
reward.py: Reward function with @reward_function decoratortools.py: Tool registry with tool definitionstask.jsonl: Dataset rows with task specificationsseed.sql(optional): Initial database state
See the Agent Evaluation guide for more details.
The CLI recognizes the following environment variables:
FIREWORKS_API_KEY: Your Fireworks API key (required for deployment operations)FIREWORKS_API_BASE: Base URL for the Fireworks API (defaults tohttps://api.fireworks.ai)MODEL_AGENT: Default agent model to use (e.g., "openai/gpt-4o-mini")MODEL_SIM: Default simulation model to use (e.g., "openai/gpt-3.5-turbo")
-
Authentication Errors:
Error: Authentication failed. Check your API key.Solution: Ensure
FIREWORKS_API_KEYis correctly set. -
Metrics Folder Not Found:
Error: Metrics folder not found: ./my_metrics/claritySolution: Check that the path exists and contains a valid
main.pyfile. -
Invalid Sample File:
Error: Failed to parse sample file. Ensure it's a valid JSONL file.Solution: Verify the sample file is in the correct JSONL format.
-
Deployment Permission Issues:
Error: Permission denied. Your API key doesn't have deployment permissions.Solution: Use a production API key with deployment permissions or request additional permissions.
-
Task Bundle Validation Errors:
Error: Missing required files in task bundle: tools.py, reward.pySolution: Ensure your task bundle has all required files.
-
Model API Key Not Set:
Warning: MODEL_AGENT environment variable is not setSolution: Set the MODEL_AGENT environment variable or use the --model parameter.
-
Import Errors with Task Bundle:
Error: Failed to import tool registry from example.task.toolsSolution: Check that the Python path is correct and the module can be imported.
For additional help, use the --help flag with any command:
eval-protocol --help
eval-protocol preview --help
eval-protocol deploy --help
eval-protocol agent-eval --help- Explore the Developer Guide for conceptual understanding
- Try the Creating Your First Reward Function tutorial
- Learn about Agent Evaluation to create your own task bundles
- See Examples for practical implementations