Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -764,6 +764,7 @@
"weave/guides/evaluation/builtin_scorers",
"weave/guides/evaluation/weave_local_scorers",
"weave/guides/evaluation/evaluation_logger",
"weave/guides/evaluation/export_eval",
"weave/guides/core-types/leaderboards",
"weave/guides/tools/column-mapping",
"weave/guides/evaluation/dynamic_leaderboards"
Expand Down
150 changes: 150 additions & 0 deletions weave/guides/evaluation/export_eval.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
---
title: "Export evaluation data"
description: "Programmatically export evaluation results using the Evaluation REST API."
---

Teams that run evaluations in W&B Weave often need evaluation results outside of the Weave UI. Common use cases include:

- Pulling metrics into spreadsheets or notebooks for custom analysis and visualization.
- Feeding evaluation results into CI/CD pipelines to gate deployments.
- Sharing results with stakeholders who don't have W&B seats, through BI tools like Looker or internal dashboards.
- Building automated reporting pipelines that aggregate scores across projects.

The [v2 Evaluation REST API](https://trace.wandb.ai/docs) surfaces focused evaluation concepts: evaluation runs, predictions, scores, and scorers. The result is richer, more structured output with typed scorer statistics and resolved dataset inputs compared to the general-purpose Calls API.

## API endpoints used

The snippets on this page use the following endpoints from the [v2 Evaluation REST API](https://trace.wandb.ai/docs):

- `GET /v2/{entity}/{project}/evaluation_runs`: Lists evaluation runs in a project, with optional filters by evaluation reference, model reference, or run ID.
- `GET /v2/{entity}/{project}/evaluation_runs/{evaluation_run_id}`: Reads a single evaluation run to retrieve its model, evaluation reference, status, timestamps, and summary.
- `POST /v2/{entity}/{project}/eval_results/query`: Retrieves grouped evaluation result rows for one or more evaluations. Returns per-row trials with model output, scores, and optionally resolved dataset row inputs. Also returns aggregated scorer statistics when requested.
- `GET /v2/{entity}/{project}/predictions/{prediction_id}`: Reads an individual prediction with its inputs, output, and model reference.

Authentication uses HTTP Basic with `api` as the username and your W&B API key as the password.

## Prerequisites

- Python 3.7 or later.
- The `requests` library. Install it with `pip install requests`.
- A W&B API key, set as the `WANDB_API_KEY` environment variable. Get your key at [wandb.ai/settings](https://wandb.ai/settings).

## Set up authentication

```python
import json
import os

import requests

TRACE_BASE = "https://trace.wandb.ai"
AUTH = ("api", os.environ["WANDB_API_KEY"])

entity = "my-team"
project = "my-project"
```

## List evaluation runs

Retrieve recent evaluation runs in a project and list details for each run, such as ID and status.

```python
resp = requests.get(
f"{TRACE_BASE}/v2/{entity}/{project}/evaluation_runs",
auth=AUTH,
)
runs = [json.loads(line) for line in resp.text.strip().splitlines()]

for run in runs:
print(run["evaluation_run_id"], run.get("status"))
```

## Read a single evaluation run

Retrieve details for a specific evaluation run, including its model, evaluation reference, status, and timestamps.

```python
eval_run_id = "<evaluation-run-id>"

resp = requests.get(
f"{TRACE_BASE}/v2/{entity}/{project}/evaluation_runs/{eval_run_id}",
auth=AUTH,
)
eval_run = resp.json()
print(eval_run["evaluation_run_id"], eval_run.get("status"), eval_run.get("model"))
```

## Get predictions and scores

Use the `eval_results/query` endpoint to retrieve per-row results for an evaluation run. Each row includes the resolved dataset inputs, model output, and individual scorer results. Set `include_rows`, `include_raw_data_rows`, and `resolve_row_refs` to get the full per-row detail.

```python
eval_run_id = "<evaluation-run-id>"

resp = requests.post(
f"{TRACE_BASE}/v2/{entity}/{project}/eval_results/query",
json={
"evaluation_run_ids": [eval_run_id],
"include_rows": True,
"include_raw_data_rows": True,
"resolve_row_refs": True,
},
auth=AUTH,
)
results = resp.json()

for row in results["rows"]:
inputs = row.get("raw_data_row")
for ev in row.get("evaluations", []):
for trial in ev.get("trials", []):
output = trial.get("model_output")
scores = trial.get("scores", {})
print("Input:", inputs)
print("Output:", output)
print("Scores:", scores)
```

## Get aggregated scores

The same `eval_results/query` endpoint can also return aggregated scorer statistics instead of per-row data. Set `include_summary` to get summary-level metrics like pass rates for binary scorers and means for continuous scorers.

```python
resp = requests.post(
f"{TRACE_BASE}/v2/{entity}/{project}/eval_results/query",
json={
"evaluation_run_ids": [eval_run_id],
"include_summary": True,
"include_rows": False,
},
auth=AUTH,
)
results = resp.json()

for ev in results["summary"]["evaluations"]:
for stat in ev["scorer_stats"]:
print(stat["scorer_key"], stat.get("value_type"), stat.get("pass_rate") or stat.get("numeric_mean"))
```

## Read a single prediction

Retrieve the full details of an individual prediction, including its inputs, output, and model reference.

```python
prediction_id = "<predict-call-id>"

resp = requests.get(
f"{TRACE_BASE}/v2/{entity}/{project}/predictions/{prediction_id}",
auth=AUTH,
)
prediction = resp.json()
print(prediction)
```

## How to use row digests

Each result row from the `eval_results/query` endpoint includes a `row_digest`, a content hash that uniquely identifies a specific input in the evaluation dataset based on its contents, not its position. Row digests are useful for:

- **Cross-evaluation comparison**: When you run two different models against the same dataset, rows with the same digest represent the same input. You can join on `row_digest` to compare how different models performed on the exact same task.
- **Deduplication**: If the same task appears in multiple evaluation suites, the digest lets you identify it.
- **Reproducibility**: The digest is content-addressable, so if someone modifies a dataset row (changes the instruction text, rubric, or other fields), it gets a new digest. You can verify whether two evaluation runs used identical inputs or slightly different versions.

Loading