Framework Adapter SDK for EvalHub Integration
The EvalHub SDK provides a standardized way to create framework adapters that can be consumed by EvalHub, enabling a "Bring Your Own Framework" (BYOF) approach for evaluation frameworks.
The SDK creates a common API layer that allows EvalHub to communicate with ANY evaluation framework. Users only need to write minimal "glue" code to connect their framework to the standardized interface.
EvalHub → (Standard API) → Your Framework Adapter → Your Evaluation Framework
The adapter SDK uses a job runner architecture:
graph TB
subgraph pod["Kubernetes Job Pod"]
subgraph adapter["Adapter Container"]
A1["1. Read JobSpec<br/>from ConfigMap"]
A2["2. run_benchmark_job()"]
A3["3. Report status<br/>via callbacks"]
A4["4. Create OCI artifacts<br/>via callbacks"]
A5["5. Report results<br/>via callbacks"]
A6["6. Exit"]
end
subgraph sidecar["Sidecar Container"]
S1["ConfigMap mounted<br/>/meta/job.json"]
S2["Forward status to<br/>EvalHub service (HTTP)"]
S3["Authenticated push of<br/>OCI artifacts<br/>to OCI Registry"]
S4["Forward results to<br/>EvalHub service (HTTP)"]
end
A1 -.-> S1
A3 --> S2
A4 --> S3
A5 --> S4
end
S2 --> EvalHub["EvalHub Service"]
S3 --> Registry["OCI Registry"]
S4 --> EvalHub
style pod fill:#f0f0f0,stroke:#333,stroke-width:2px
style adapter fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
style sidecar fill:#fff3e0,stroke:#f57c00,stroke-width:2px
The SDK is organized into distinct, focused packages:
Core (evalhub.models) - Shared data models
- Request/response models for API communication
- Common data structures for evaluations and benchmarks
Adapter SDK (evalhub.adapter) - Framework adapter components
FrameworkAdapterbase class withrun_benchmark_job()method- Job specification models (
JobSpec,JobResults) - Callback interface for status updates and OCI artifacts
- Example implementations
Client SDK (evalhub.client) - REST API client for EvalHub service
- HTTP client for submitting evaluations to EvalHub
- Resource navigation (providers, benchmarks, collections)
- See CLIENT_SDK_GUIDE.md
- JobSpec - Job configuration loaded from ConfigMap at pod startup
- FrameworkAdapter - Base class that implements
run_benchmark_job()method - JobCallbacks - Interface for reporting status and persisting artifacts
- JobResults - Evaluation results returned when job completes
- Sidecar - Container that handles service communication (provided by platform)
# Install from PyPI (when available)
pip install eval-hub-sdk
# Install from source
git clone https://github.com/eval-hub/eval-hub-sdk.git
cd eval-hub-sdk
pip install -e .[dev]Create a new Python file for your adapter:
# my_framework_adapter.py
from evalhub.adapter import (
FrameworkAdapter,
JobSpec,
JobCallbacks,
JobResults,
JobStatus,
JobPhase,
JobStatusUpdate,
EvaluationResult,
)
class MyFrameworkAdapter(FrameworkAdapter):
def run_benchmark_job(
self, config: JobSpec, callbacks: JobCallbacks
) -> JobResults:
"""Run a benchmark evaluation job."""
# Report initialization
callbacks.report_status(JobStatusUpdate(
status=JobStatus.RUNNING,
phase=JobPhase.INITIALIZING,
progress=0.0,
message="Loading benchmark and model"
))
# Load your evaluation framework and benchmark
framework = load_your_framework()
benchmark = framework.load_benchmark(config.benchmark_id)
model = framework.load_model(config.model)
# Report evaluation start
callbacks.report_status(JobStatusUpdate(
status=JobStatus.RUNNING,
phase=JobPhase.RUNNING_EVALUATION,
progress=0.3,
message=f"Evaluating on {config.num_examples} examples"
))
# Run evaluation (adapter-specific params come from parameters)
results = framework.evaluate(
benchmark=benchmark,
model=model,
num_examples=config.num_examples,
num_few_shot=config.parameters.get("num_few_shot", 0)
)
# Save and persist artifacts
output_files = save_results(config.job_id, results)
artifact = callbacks.create_oci_artifact(OCIArtifactSpec(
files=output_files,
job_id=config.job_id,
benchmark_id=config.benchmark_id,
model_name=config.model.name
))
# Return results
return JobResults(
job_id=config.job_id,
benchmark_id=config.benchmark_id,
model_name=config.model.name,
results=[
EvaluationResult(
metric_name="accuracy",
metric_value=results["accuracy"],
metric_type="float"
)
],
num_examples_evaluated=len(results),
duration_seconds=results["duration"],
oci_artifact=artifact
)The SDK exposes an OCI persistence API via callbacks.create_oci_artifact(...).
Use DefaultCallbacks for both production and development:
from evalhub.adapter import DefaultCallbacks
# Initialize adapter (loads settings and job spec internally)
adapter = MyFrameworkAdapter()
# Create callbacks from adapter (auto-configures sidecar, OCI proxy, etc.)
callbacks = DefaultCallbacks.from_adapter(adapter)
results = adapter.run_benchmark_job(adapter.job_spec, callbacks)Key Points:
- Status updates: Sent to sidecar if
sidecar_urlis provided, otherwise logged locally. Bothreport_statusandreport_resultsevents always includebenchmark_index(andprovider_idwhen set) so the service can associate events with the correct benchmark in multi-benchmark jobs. - OCI artifacts: Created via SDK callbacks and pushed to the OCI registry through the sidecar-authenticated flow when mode is Kubernetes.
Create a Dockerfile for your adapter:
FROM registry.access.redhat.com/ubi9/python-312
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy adapter code
COPY my_framework_adapter.py .
COPY run_adapter.py .
# Run adapter
CMD ["python", "run_adapter.py"]Create the entrypoint script:
# run_adapter.py
from my_framework_adapter import MyFrameworkAdapter
from evalhub.adapter import AdapterSettings, DefaultCallbacks, JobSpec
# Load settings and job spec explicitly
settings = AdapterSettings.from_env()
settings.validate_runtime()
job_spec = JobSpec.from_file(settings.resolved_job_spec_path)
# Initialize adapter with settings
adapter = MyFrameworkAdapter(settings=settings)
# Create callbacks
callbacks = DefaultCallbacks(
job_id=job_spec.job_id,
benchmark_id=job_spec.benchmark_id,
benchmark_index=job_spec.benchmark_index,
sidecar_url=job_spec.callback_url,
registry_url=settings.registry_url,
registry_username=settings.registry_username,
registry_password=settings.registry_password,
insecure=settings.registry_insecure,
)
# Run adapter
results = adapter.run_benchmark_job(job_spec, callbacks)
# Report final results to service via sidecar
callbacks.report_results(results)
print(f"Job completed: {results.job_id}")The eval-hub service will create Kubernetes Jobs for your adapter:
apiVersion: batch/v1
kind: Job
metadata:
name: eval-job-123
spec:
template:
spec:
containers:
# Your adapter container
- name: adapter
image: myregistry/my-adapter:latest
volumeMounts:
- name: job-spec
mountPath: /meta
# Sidecar container (provided by platform)
- name: sidecar
image: evalhub/sidecar:latest
env:
- name: EVALHUB_SERVICE_URL
value: "http://evalhub-service:8080"
volumes:
- name: job-spec
configMap:
name: job-123-specFor a complete working example, see evalhub/adapter/examples/simple_adapter.py.
The EvalHub SDK is organized into distinct packages based on your use case:
| Use Case | Primary Package | Description |
|---|---|---|
| Building an Adapter | evalhub.adapter |
Create a framework adapter for your evaluation framework |
| Interacting with EvalHub | evalhub.client |
REST API client for submitting evaluations |
| Data Models | evalhub.models |
Request/response models for API communication |
Framework Adapter Developer:
# Building your adapter
from evalhub.adapter import (
FrameworkAdapter,
JobSpec,
JobCallbacks,
JobResults,
JobStatus,
JobPhase,
JobStatusUpdate,
EvaluationResult,
OCIArtifactSpec,
)EvalHub Service User:
# Interacting with EvalHub REST API
from evalhub import (
EvalHubClient,
BenchmarkConfig,
EvaluationExports,
EvaluationExportsOCI,
JobSubmissionRequest,
ModelConfig,
OCIConnectionConfig,
OCICoordinates,
)The SDK includes a complete reference implementation showing all adapter patterns:
Example Adapter: src/evalhub/adapter/examples/simple_adapter.py
This example demonstrates:
- Loading JobSpec from mounted ConfigMap
- Validating configuration
- Loading benchmark data
- Running evaluation with progress reporting
- Persisting results as OCI artifacts
- Returning structured results
from evalhub.adapter.examples import ExampleAdapter
from evalhub.adapter import JobSpec
# Load job specification
job_spec = JobSpec(
id="eval-123",
provider_id="my-provider",
benchmark_id="mmlu",
benchmark_index=0,
model=ModelConfig(
url="http://vllm-service:8000",
name="llama-2-7b"
),
parameters={},
callback_url="http://localhost:8080",
num_examples=100
)
# Create adapter and run
adapter = ExampleAdapter()
results = adapter.run_benchmark_job(job_spec, callbacks)Your adapter must implement a single method:
from evalhub.adapter import FrameworkAdapter, JobSpec, JobCallbacks, JobResults
class MyFrameworkAdapter(FrameworkAdapter):
def run_benchmark_job(
self, config: JobSpec, callbacks: JobCallbacks
) -> JobResults:
"""Run a benchmark evaluation job.
Args:
config: Job specification from mounted ConfigMap
callbacks: Callbacks for status updates and artifact persistence
Returns:
JobResults: Evaluation results and metadata
Raises:
ValueError: If configuration is invalid
RuntimeError: If evaluation fails
"""
# Your implementation here
passJobSpec - Configuration loaded from ConfigMap:
class JobSpec(BaseModel):
# Mandatory fields
id: str # Unique job identifier
provider_id: str # Provider identifier
benchmark_id: str # Benchmark to evaluate
benchmark_index: int # Index of this benchmark within the job (included in all status/result events)
model: ModelConfig # Model configuration (url, name)
parameters: Dict[str, Any] # Adapter-specific parameters
callback_url: str # Base URL for callbacks (SDK appends /status, /results)
# Optional fields
num_examples: Optional[int] # Number of examples to evaluate
experiment_name: Optional[str] # Experiment name
tags: list[dict[str, str]] # Custom tags (default: [])
@classmethod
def from_file(cls, path: Path | str) -> Self:
"""Load JobSpec from a JSON file."""Load a job spec from file:
from evalhub.adapter import JobSpec
# Explicit path (recommended)
spec = JobSpec.from_file("/meta/job.json")
# Or use settings for the path
spec = JobSpec.from_file(settings.resolved_job_spec_path)JobCallbacks - Interface for service communication:
class JobCallbacks(ABC):
@abstractmethod
def report_status(self, update: JobStatusUpdate) -> None:
"""Report status update to service"""
@abstractmethod
def create_oci_artifact(self, spec: OCIArtifactSpec) -> OCIArtifactResult:
"""Create and push OCI artifact"""When using DefaultCallbacks, pass benchmark_index (and optionally provider_id) from the job spec so that status and result events sent to the service always include benchmark_index, allowing the service to associate events with the correct benchmark in multi-benchmark jobs.
JobResults - Returned when job completes:
class JobResults(BaseModel):
job_id: str
benchmark_id: str
benchmark_index: int # Index within the job
model_name: str
results: List[EvaluationResult] # Evaluation metrics
overall_score: Optional[float] # Overall score if applicable
num_examples_evaluated: int # Number of examples evaluated
duration_seconds: float # Total evaluation time
evaluation_metadata: Dict[str, Any] # Framework-specific metadata
oci_artifact: Optional[OCIArtifactResult] # OCI artifact info if persistedYour adapter runs as a container in a Kubernetes Job alongside a sidecar:
FROM registry.access.redhat.com/ubi9/python-312
WORKDIR /app
# Install your framework and dependencies
RUN pip install lm-evaluation-harness==0.4.0 eval-hub-sdk
# Copy adapter implementation
COPY my_adapter.py .
COPY entrypoint.py .
CMD ["python", "entrypoint.py"]# entrypoint.py
from my_adapter import MyFrameworkAdapter
from evalhub.adapter import AdapterSettings, DefaultCallbacks, JobSpec
# Load settings and job spec explicitly
settings = AdapterSettings.from_env()
settings.validate_runtime()
job_spec = JobSpec.from_file(settings.resolved_job_spec_path)
# Initialize adapter with settings
adapter = MyFrameworkAdapter(settings=settings)
# Create callbacks
callbacks = DefaultCallbacks(
job_id=job_spec.job_id,
benchmark_id=job_spec.benchmark_id,
benchmark_index=job_spec.benchmark_index,
sidecar_url=job_spec.callback_url,
registry_url=settings.registry_url,
insecure=settings.registry_insecure,
)
# Run adapter
results = adapter.run_benchmark_job(job_spec, callbacks)
# Report final results
callbacks.report_results(results)
print(f"Job {results.job_id} completed with score: {results.overall_score}")EvalHub creates Jobs automatically:
apiVersion: batch/v1
kind: Job
metadata:
name: eval-job-123
spec:
template:
spec:
containers:
- name: adapter
image: myregistry/my-framework-adapter:latest
volumeMounts:
- name: job-spec
mountPath: /meta
- name: sidecar
image: evalhub/sidecar:latest
env:
- name: EVALHUB_SERVICE_URL
value: "http://evalhub-service:8080"
volumes:
- name: job-spec
configMap:
name: job-123-spec
restartPolicy: Never# Clone the repository
git clone https://github.com/eval-hub/eval-hub-sdk.git
cd eval-hub-sdk
# Install in development mode with all dependencies
pip install -e .[dev]
# Install pre-commit hooks
pre-commit install
# Run tests
pytest
# Run tests with coverage
pytest --cov=src/evalhub --cov-report=html
# Run type checking
mypy src/evalhub
# Run linting
ruff check src/ tests/
ruff format src/ tests/from evalhub.adapter import AdapterSettings
def test_settings_parse(monkeypatch):
monkeypatch.setenv("EVALHUB_MODE", "local")
monkeypatch.setenv("REGISTRY_URL", "localhost:5000")
s = AdapterSettings.from_env()
assert str(s.registry_url) == "localhost:5000"Run all quality checks:
# Format code
ruff format .
# Lint and fix issues
ruff check --fix .
# Type check
mypy src/evalhub
# Run full test suite
pytest -v --cov=src/evalhub- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for your changes
- Run the test suite
- Submit a pull request
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.