Skip to content

Latest commit

 

History

History
397 lines (299 loc) · 15.2 KB

File metadata and controls

397 lines (299 loc) · 15.2 KB

Command Line Interface Reference

The Eval Protocol provides a command-line interface (CLI) for common operations like previewing evaluations, deploying reward functions, and running agent evaluations.

Installation

When you install the Eval Protocol, the CLI is automatically installed:

pip install eval-protocol

You can verify the installation by running:

eval-protocol --help

Authentication Setup

Before using the CLI, set up your authentication credentials:

# Set your API key
export FIREWORKS_API_KEY=your_api_key

# Optional: Set the API base URL (for development environments)
export FIREWORKS_API_BASE=https://api.fireworks.ai

Command Overview

The Eval Protocol CLI supports the following main commands:

  • run: Run a local evaluation pipeline using a Hydra configuration.
  • preview: Preview evaluation results or re-evaluate generated outputs.
  • deploy: Deploy a reward function as an evaluator.
  • agent-eval: Run agent evaluations on task bundles.
  • list: List existing evaluators (coming soon).
  • delete: Delete an evaluator (coming soon).

Run Command (eval-protocol run)

The run command is the primary way to execute local evaluation pipelines. It leverages Hydra for configuration, allowing you to define complex evaluation setups (including dataset loading, model generation, and reward application) in YAML files and easily override parameters from the command line.

Syntax

python -m eval_protocol.cli run [options] [HYDRA_OVERRIDES...]

or

eval-protocol run [options] [HYDRA_OVERRIDES...]

Key Options

  • --config-path TEXT: Path to the directory containing your Hydra configuration files. (Required)
  • --config-name TEXT: Name of the main Hydra configuration file (e.g., run_my_eval.yaml). (Required)
  • --multirun or -m: Run multiple jobs (e.g., for sweeping over parameters). Refer to Hydra documentation for multi-run usage.
  • --help: Show help message for the run command.

Hydra Overrides

You can override any parameter defined in your Hydra configuration YAML files directly on the command line. For detailed information on how Hydra is used, refer to the Hydra Configuration for Examples guide.

Examples

# Basic usage, running an evaluation defined in examples/math_example/conf/run_math_eval.yaml
python -m eval_protocol.cli run \
  --config-path examples/math_example/conf \
  --config-name run_math_eval.yaml

# Override the number of samples to process and the model name
python -m eval_protocol.cli run \
  --config-path examples/math_example/conf \
  --config-name run_math_eval.yaml \
  evaluation_params.limit_samples=10 \
  generation.model_name="accounts/fireworks/models/mixtral-8x7b-instruct"

Output

The run command typically generates:

  • A timestamped output directory (e.g., outputs/YYYY-MM-DD/HH-MM-SS/).
  • Inside this directory:
    • .hydra/: Contains the full Hydra configuration for the run (for reproducibility).
    • Log files.
    • Result files, often including:
      • <config_output_name>_results.jsonl (e.g., math_example_results.jsonl): Detailed evaluation results for each sample.
      • preview_input_output_pairs.jsonl: Generated prompts and responses, suitable for use with eval-protocol preview.
    • Console Output:
      • A summary report is logged to the console, including:
        • Total samples processed.
        • Number of successful evaluations.
        • Number of evaluation errors.
        • Average, min, and max scores (if applicable).
        • Score distribution.
        • Details of the first few errors encountered.

Preview Command (eval-protocol preview)

The preview command allows you to test reward functions with sample data. A primary use case is to inspect or re-evaluate the preview_input_output_pairs.jsonl file generated by the eval-protocol run command. This allows you to iterate on reward logic using a fixed set of model generations or to apply different metrics to the same outputs.

You can also use it with manually created sample files.

Syntax

eval-protocol preview [options]

Options

  • --metrics-folders: Specify local metric scripts to apply, in the format "name=path/to/metric_script_dir". The directory should contain a main.py with a @reward_function.
  • --samples: Path to a JSONL file containing sample conversations or prompt/response pairs. This is typically the preview_input_output_pairs.jsonl file from a eval-protocol run output directory.
  • --remote-url: (Optional) URL of a deployed evaluator to use for scoring, instead of local --metrics-folders.
  • --max-samples: Maximum number of samples to process (optional)
  • --output: Path to save preview results (optional)
  • --verbose: Enable verbose output (optional)

Examples

# Previewing output from a `eval-protocol run` command with a local metric
eval-protocol preview \
  --samples ./outputs/YYYY-MM-DD/HH-MM-SS/preview_input_output_pairs.jsonl \
  --metrics-folders "my_custom_metric=./path/to/my_custom_metric"

# Previewing with multiple local metrics
eval-protocol preview \
  --samples ./outputs/YYYY-MM-DD/HH-MM-SS/preview_input_output_pairs.jsonl \
  --metrics-folders "metric1=./metrics/metric1" "metric2=./metrics/metric2"

# Limit sample count
eval-protocol preview --metrics-folders "clarity=./my_metrics/clarity" --samples ./samples.jsonl --max-samples 5

# Save results to file
eval-protocol preview --metrics-folders "clarity=./my_metrics/clarity" --samples ./samples.jsonl --output ./results.json

Sample File Format

The samples file should be a JSONL (JSON Lines) file. If it's the output from eval-protocol run (preview_input_output_pairs.jsonl), each line typically contains a "messages" list (including system, user, and assistant turns) and optionally a "ground_truth" field. If creating manually, a common format is:

{"messages": [{"role": "user", "content": "What is machine learning?"}, {"role": "assistant", "content": "Machine learning is a method of data analysis..."}]}

Or, if you have ground truth for comparison:

{"messages": [{"role": "user", "content": "Question..."}, {"role": "assistant", "content": "Model answer..."}], "ground_truth": "Reference answer..."}

Deploy Command

The deploy command deploys a reward function as an evaluator on the Fireworks platform.

Syntax

eval-protocol deploy [options]

Options

  • --id: ID for the deployed evaluator (required)
  • --metrics-folders: Specify metrics to use in the format "name=path" (required)
  • --display-name: Human-readable name for the evaluator (optional)
  • --description: Description of the evaluator (optional)
  • --force: Overwrite if an evaluator with the same ID already exists (optional)
  • --providers: List of model providers to use (optional)
  • --verbose: Enable verbose output (optional)

Examples

# Basic deployment
eval-protocol deploy --id my-evaluator --metrics-folders "clarity=./my_metrics/clarity"

# With display name and description
eval-protocol deploy --id my-evaluator \
  --metrics-folders "clarity=./my_metrics/clarity" \
  --display-name "Clarity Evaluator" \
  --description "Evaluates responses based on clarity"

# Force overwrite existing evaluator
eval-protocol deploy --id my-evaluator \
  --metrics-folders "clarity=./my_metrics/clarity" \
  --force

# Multiple metrics
eval-protocol deploy --id comprehensive-evaluator \
  --metrics-folders "clarity=./my_metrics/clarity" "accuracy=./my_metrics/accuracy" \
  --display-name "Comprehensive Evaluator"

Common Workflows

Iterative Development Workflow

A typical development workflow using the CLI now often involves eval-protocol run first:

  1. Configure: Set up your dataset and evaluation parameters in Hydra YAML files (e.g., conf/dataset/my_data.yaml, conf/run_my_eval.yaml). Define or reference your reward function logic.
  2. Run: Execute the evaluation pipeline using eval-protocol run. This generates model responses and initial scores.
    python -m eval_protocol.cli run --config-path ./conf --config-name run_my_eval.yaml
  3. Analyze & Iterate:
    • Examine the detailed results (*_results.jsonl) and the preview_input_output_pairs.jsonl from the output directory.
    • If iterating on reward logic, you can use eval-protocol preview with the preview_input_output_pairs.jsonl and your updated local metric script.
    eval-protocol preview \
      --samples ./outputs/YYYY-MM-DD/HH-MM-SS/preview_input_output_pairs.jsonl \
      --metrics-folders "my_refined_metric=./path/to/refined_metric"
    • Refine your reward function code or Hydra configurations.
  4. Re-run: If configurations changed significantly or you need new model generations, re-run eval-protocol run.
  5. Deploy: Once satisfied with the evaluator's performance and configuration:
    eval-protocol deploy --id my-evaluator-id \
      --metrics-folders "my_final_metric=./path/to/final_metric" \
      --display-name "My Final Evaluator" \
      --description "Description of my evaluator" \
      --force
    (Note: The --metrics-folders for deploy should point to the finalized reward function script(s) you intend to deploy as the evaluator.)

Comparing Multiple Metrics

You can preview multiple metrics to compare their performance:

# Preview with multiple metrics
eval-protocol preview \
  --metrics-folders \
  "metric1=./my_metrics/metric1" \
  "metric2=./my_metrics/metric2" \
  "metric3=./my_metrics/metric3" \
  --samples ./samples.jsonl

Deployment with Custom Providers

You can deploy with specific model providers:

# Deploy with custom provider
eval-protocol deploy --id my-evaluator \
  --metrics-folders "clarity=./my_metrics/clarity" \
  --providers '[{"providerType":"anthropic","modelId":"claude-3-sonnet-20240229"}]'

Agent-Eval Command

The agent-eval command enables you to run agent evaluations using task bundles.

Syntax

eval-protocol agent-eval [options]

Options

Task Specification:

  • --task-dir: Path to task bundle directory containing reward.py, tools.py, etc.
  • --dataset or -d: Path to JSONL file containing task specifications.

Output and Models:

  • --output-dir or -o: Directory to store evaluation runs (default: "./runs").
  • --model: Override MODEL_AGENT environment variable.
  • --sim-model: Override MODEL_SIM environment variable for simulated user.

Testing and Debugging:

  • --no-sim-user: Disable simulated user (use static initial messages only).
  • --test-mode: Run in test mode without requiring API keys.
  • --mock-response: Use a mock agent response (works with --test-mode).
  • --debug: Enable detailed debug logging.
  • --validate-only: Validate task bundle structure without running evaluation.
  • --export-tools: Export tool specifications to directory for manual testing.

Advanced Options:

  • --task-ids: Comma-separated list of task IDs to run.
  • --max-tasks: Maximum number of tasks to evaluate.
  • --registries: Custom tool registries in format 'name=path'.
  • --registry-override: Override all toolset paths with this registry path.
  • --evaluator: Custom evaluator module path (overrides default).

Examples

Note: The following examples use examples/your_agent_task_bundle/ as a placeholder. You will need to replace this with the actual path to your task bundle directory.

# Run agent evaluation with default settings, assuming MODEL_AGENT is set
export MODEL_AGENT=openai/gpt-4o-mini # Example model
eval-protocol agent-eval --task-dir examples/your_agent_task_bundle/

# Use a specific dataset file from your task bundle
eval-protocol agent-eval --dataset examples/your_agent_task_bundle/task.jsonl --task-dir examples/your_agent_task_bundle/

# Run in test mode (no API keys required)
eval-protocol agent-eval --task-dir examples/your_agent_task_bundle/ --test-mode --mock-response

# Validate task bundle structure without running
eval-protocol agent-eval --task-dir examples/your_agent_task_bundle/ --validate-only

# Use a custom model and limit to specific tasks
eval-protocol agent-eval --task-dir examples/your_agent_task_bundle/ \
  --model anthropic/claude-3-opus-20240229 \
  --task-ids your_task.id.001,your_task.id.002

# Export tool specifications for manual testing
eval-protocol agent-eval --task-dir examples/your_agent_task_bundle/ --export-tools ./tool_specs

Task Bundle Structure

A task bundle is a directory containing the following files:

  • reward.py: Reward function with @reward_function decorator
  • tools.py: Tool registry with tool definitions
  • task.jsonl: Dataset rows with task specifications
  • seed.sql (optional): Initial database state

See the Agent Evaluation guide for more details.

Environment Variables

The CLI recognizes the following environment variables:

  • FIREWORKS_API_KEY: Your Fireworks API key (required for deployment operations)
  • FIREWORKS_API_BASE: Base URL for the Fireworks API (defaults to https://api.fireworks.ai)
  • MODEL_AGENT: Default agent model to use (e.g., "openai/gpt-4o-mini")
  • MODEL_SIM: Default simulation model to use (e.g., "openai/gpt-3.5-turbo")

Troubleshooting

Common Issues

  1. Authentication Errors:

    Error: Authentication failed. Check your API key.
    

    Solution: Ensure FIREWORKS_API_KEY is correctly set.

  2. Metrics Folder Not Found:

    Error: Metrics folder not found: ./my_metrics/clarity
    

    Solution: Check that the path exists and contains a valid main.py file.

  3. Invalid Sample File:

    Error: Failed to parse sample file. Ensure it's a valid JSONL file.
    

    Solution: Verify the sample file is in the correct JSONL format.

  4. Deployment Permission Issues:

    Error: Permission denied. Your API key doesn't have deployment permissions.
    

    Solution: Use a production API key with deployment permissions or request additional permissions.

  5. Task Bundle Validation Errors:

    Error: Missing required files in task bundle: tools.py, reward.py
    

    Solution: Ensure your task bundle has all required files.

  6. Model API Key Not Set:

    Warning: MODEL_AGENT environment variable is not set
    

    Solution: Set the MODEL_AGENT environment variable or use the --model parameter.

  7. Import Errors with Task Bundle:

    Error: Failed to import tool registry from example.task.tools
    

    Solution: Check that the Python path is correct and the module can be imported.

Getting Help

For additional help, use the --help flag with any command:

eval-protocol --help
eval-protocol preview --help
eval-protocol deploy --help
eval-protocol agent-eval --help

Next Steps