← Benchmarks · Resume & Recovery →
After a run completes, CoEval's analysis package transforms raw JSONL evaluation data into eight interactive HTML reports and an Excel workbook. All reports are fully self-contained — no external CDN dependencies, no server required.
# Generate all eight reports at once
coeval analyze all \
--run ./eval_runs/my-experiment-v1 \
--out ./eval_runs/my-experiment-v1/reports
# Generate a single report
coeval analyze student-report \
--run ./eval_runs/my-experiment-v1 \
--out ./reports
# Generate the full Excel workbook
coeval analyze complete-report \
--run ./eval_runs/my-experiment-v1 \
--out ./reportsUsing the analyzer.main module directly:
# Generate all HTML reports
python -m analyzer.main \
--run-path benchmark/runs/medium-benchmark-v1 \
--out-dir benchmark/runs/medium-benchmark-v1/html_reports
# Generate a single report type
python -m analyzer.main \
--run-path benchmark/runs/medium-benchmark-v1 \
--report student_report
# Generate paper tables (requires benchmark-native scores)
python -m analyzer.paper_tables --run-id paper-eval-v1 --out-dir paper/tablesFull Excel workbook with all evaluation data across every (task, teacher, student, judge) combination. Includes raw scores, aggregated rankings, and metadata. Designed for stakeholder review and downstream statistical analysis.
Sheets included:
Summary— overall model rankings + composite scoresStudentScores— full per-(student, task, rubric_factor) score tableTeacherCoverage— attribute coverage metrics per teacherJudgeAgreement— pairwise ICC between all judge pairsFailedRecords— records with validation errors (MISSING_RESPONSE, etc.)
Interactive HTML histogram of judge score distributions. Filterable by task, teacher, student, and judge. Shows score spread, mean, median, and outlier bands. Reveals systematic over/under-scoring at a glance.
Analysis of teacher model performance:
- Attribute coverage heatmap (which dimensions were sampled how often)
- Diversity score across nuanced attribute dimensions
- Item quality metrics (judge score distributions per teacher)
- Comparison of synthetic vs. benchmark teacher outputs
Analysis of judge model behavior:
- Judge score distributions and variance
- Positional bias detection (does response order affect scores?)
- Agreement with ground-truth benchmark scores (Spearman ρ)
- Calibration curves (judge score vs. BERTScore / BLEU)
Analysis of student model performance:
- Per-task score breakdowns and percentile bands
- Cross-task ranking with confidence intervals
- Performance by attribute dimension (which input types challenge each model)
- Head-to-head comparisons between student pairs
Color-coded heatmap of average scores across every (teacher × student × judge) triplet. Reveals interaction effects — e.g., which teacher-student combinations score unexpectedly high or low regardless of judge.
- Heatmap — mean composite score for each (teacher, student) cell
- Row effects — teacher contribution to scores (should be small)
- Column effects — student contribution (should be large)
- Self-teaching cells — highlighted where teacher == student
Inter-judge agreement analysis across all judge pairs:
- Spearman ρ rank correlation matrix
- Kendall τ agreement matrix
- ACR (Agreement Consistency Rate) per pair
- ICC matrix — pairwise intraclass correlation between judges per task
- Drift chart — ICC over sliding window of 20 items (detects rubric drift)
- Calibration parameters — α and β per judge per task
- High-uncertainty items — responses where inter-judge σ > 1.5 on any factor
- Identifies judges that are outliers or systematically miscalibrated
Attribute dimension coverage across the full benchmark:
- ACR gauge — overall and per-task attribute coverage ratio
- Stratum heatmap — grid of all target-attribute value combinations, coloured by datapoint count
- Rare-attribute recall — proportion of underrepresented strata covered
- Surface bias — mean pairwise BLEU across prompts
- Gaps in coverage (attribute values with fewer items than expected)
- Nuanced attribute distribution across tasks and teachers
High-level experiment summary for executive review:
- Top-line student rankings across all tasks
- Judge ensemble confidence
- Cost summary (actual calls × price)
- Key quality metrics in a single-page view
Outlier-robust score aggregation:
- Trimmed mean and median scores per student
- Sensitivity analysis: how does ranking change when one judge is removed?
- Ensemble stability score (how consistent are rankings across judge subsets)
- Final model rankings with confidence intervals and robust ensemble weights
Exports data in a structured format compatible with external benchmark evaluation tools and leaderboards.
All HTML reports include:
- Interactive charts (Plotly) — hover for exact values, click to filter
- Filterable data tables with CSV export
- Sortable model rankings by any metric column
- Color-coded matrices for quick pattern identification
- Fully self-contained — single HTML file, no internet connection required
| Metric | Symbol | Description |
|---|---|---|
| Spearman rank correlation | ρ | Monotone agreement between judge scores and ground-truth rankings |
| Kendall rank correlation | τ | Pairwise concordance between judge and ground-truth orderings |
| Agreement Consistency Rate | ACR | Judge-to-judge pairwise score agreement rate |
| Position Flip Rate | PFR | How often response order reversal changes judge's relative scores (positional bias) |
| Differentiation Score | — | Variance of scores across student models; higher = judges discriminate better |
| Calibration intercept | α | Additive correction from judge score to benchmark scale |
| Calibration slope | β | Multiplicative correction from judge score to benchmark scale |
Q(student) = mean over all valid (response, factor) units of score_norm
where score_norm ∈ {0.0, 0.5, 1.0} for LLM judges (Low / Medium / High), or score_norm ∈ [0.0, 1.0] for metric judges.
Rubric-weighted composite:
Q_weighted = Σ_l w_l × mean(scores on factor l)
ACR = |{ω ∈ Ω : count(ω) ≥ 1}| / |Ω|
where Ω is the full attribute stratum space (Cartesian product of all target_attributes values). Perfect coverage: ACR = 1.0.
RAR = |{ω ∈ Ω_rare : count(ω) ≥ 1}| / |Ω_rare|
where Ω_rare contains strata with fewer than 3 natural occurrences in the benchmark. Measures coverage of underrepresented scenarios.
Surface Bias = mean pairwise BLEU between all prompt pairs
(lower = more diverse prompts)
PFR = |comparisons where judge ranking changes on order swap| / |total comparisons|
Measured before and after swap-and-average mitigation.
ρ = Spearman rank correlation between:
- CoEval composite score Q for each student response
- benchmark_native_score for the same datapoint
Computed at the response level across all 620 × 4 = 2,480 datapoints. Requires benchmark-native scores to be populated (see Benchmark Datasets).
All reports are built on top of the unified EESDataModel (loaded by analyzer.loader.load_ees). The key analytical unit is:
(response_id, rubric_factor) → score
Scores come in two forms:
- LLM judges produce ordinal scores:
High(1.0),Medium(0.5),Low(0.0) - Metric judges (
interface: metric) produce continuous float strings in [0, 1] (e.g."0.8423")
Both are normalised to score_norm ∈ [0.0, 1.0] for analysis.
Validity classification. A Phase 5 record is valid if:
- The referenced Phase 4 response exists (not
MISSING_RESPONSE) - The referenced Phase 3 datapoint exists (not
MISSING_DATAPOINT) - All rubric factors are present in the scores (not
INCOMPLETE_SCORES) - All score values are
High,Medium,Low, or a float in [0, 1] (notINVALID_SCORE_VALUE)
Invalid records appear in the Excel FailedRecords sheet but are excluded from all aggregate statistics.
Self-judging / self-teaching flags. Records where judge_model_id == student_model_id are flagged is_self_judging=True; records where teacher_model_id == student_model_id are flagged is_self_teaching=True. Reports can be filtered to exclude these.
from analyzer.loader import load_ees
from analyzer.metrics import (
composite_score_by_student,
coverage_ratio,
judge_consistency,
)
model = load_ees("benchmark/runs/medium-benchmark-v1", partial_ok=True)
# Print any load warnings
for w in model.load_warnings:
print("WARN:", w)
# Student composite scores (mean normalised score across all valid units)
scores = composite_score_by_student(model)
for student, score in sorted(scores.items(), key=lambda x: -x[1]):
print(f" {student}: {score:.3f}")
# Attribute coverage ratio
acr = coverage_ratio(model, task_id="text_summarization")
print(f"ACR (text_summarization): {acr:.3f}")
# Judge consistency (ICC per task)
icc = judge_consistency(model)
for task, val in icc.items():
print(f" ICC({task}): {val:.3f}")Excel export:
python -m analyzer.main \
--run-path benchmark/runs/medium-benchmark-v1 \
--format excel \
--out-file benchmark/runs/medium-benchmark-v1/analysis.xlsx
⚠️ Calibration is disabled by default and is NOT recommended for most use cases.LLM judges return only 3 ordinal score levels: High (1.0), Medium (0.5), and Low (0.0). The OLS linear regression therefore operates on at most 3 unique input values, which is fundamentally insufficient for a reliable fit. The resulting α and β coefficients are highly sensitive to score distribution and may not generalise.
Calibration should only be enabled when your experiment includes metric judges (
interface: metric) that produce continuous [0, 1] scores (e.g. BERTScore, BLEU, exact_match), giving the OLS fit a meaningful range of input values. To opt in, pass--enable-calibrationto the paper tables CLI or setcalibration_enabled=Trueprogrammatically.
LLM judges have systematic biases. One judge may score generously (all "High"), another may compress its range (everything "Medium"). These biases don't cancel out in a simple average — they distort the final rankings.
OLS calibration fits a per-judge, per-task linear correction that aligns raw judge scores with benchmark ground truth:
calibrated_score = clip(α + β × raw_score, 0, 1)
| Parameter | Corrects | Example |
|---|---|---|
| α (intercept) | Baseline shift — judge systematically too generous or too harsh | Judge gives +0.1 on average → α pulls scores down |
| β (slope) | Range compression — judge uses a narrow or wide score band | Judge scores cluster in 0.4–0.6 → β stretches to full range |
If a judge's raw scores already align perfectly with ground truth, the fit converges to α ≈ 0, β ≈ 1 (identity — no correction needed). If variance is zero (judge gave every item the same score), calibration falls back to the identity function.
Without calibration, the ensemble average inherits each judge's systematic bias. With calibration:
| Metric | Before (raw) | After (calibrated) |
|---|---|---|
| Spearman ρ vs. benchmark | 0.715 | 0.871 |
| MAE vs. benchmark | 0.16 | 0.07 |
This 22% improvement in rank correlation is the difference between "moderately useful" and "highly reliable" rankings.
Calibration is not part of the 5-phase pipeline and is disabled by default. When explicitly enabled, it runs during post-pipeline analysis, after all phases are complete. The timeline:
Phase 3 → Benchmark loaders create datapoints with benchmark_native_score = null
Phase 4 → Students respond to prompts
→ compute_scores backfills benchmark_native_score per datapoint
Phase 5 → Judges score responses → score_norm per (response, rubric_factor)
─────────────────────────────────────────────────────────────
POST-PIPELINE:
→ coeval analyze / paper_tables triggers calibration
→ Pairs (score_norm, benchmark_native_score) per datapoint
→ Fits OLS α, β per (judge, task) on 200-item holdout
→ Outputs calibration_params.json
Calibration requires benchmark ground truth to exist. This means:
-
Your experiment must include benchmark teachers — i.e., you used
coeval ingest --benchmarks xsum gsm8kto load public datasets into Phase 3. -
benchmark_native_scoremust be populated — either the loader set it directly, or you ran:python -m benchmark.compute_scores --run Runs/my-experiment
This backfills each Phase 3 record's
benchmark_native_scorefield using the benchmark's native metric (BERTScore-F1 for summarisation, BLEU-4 for code, exact-match for QA/MCQ).
If no benchmark_native_score values exist, load_or_fit_calibration() returns an empty dict and calibration is silently skipped.
Note: Calibration is opt-in. You must pass
--enable-calibrationexplicitly. Without this flag, Table 8 is generated with a "Disabled (default)" placeholder.
Option A: Via paper tables (generates Table 8 with calibration analysis):
python -m analyzer.paper_tables --run Runs/medium-benchmark --out paper/tables --enable-calibrationOption B: Programmatic API:
from analyzer.calibration import fit_calibration, apply_calibration, load_or_fit_calibration
from analyzer.loader import load_ees
from pathlib import Path
model = load_ees("Runs/paper-eval-v1")
# Fit per-judge, per-task calibration
params = load_or_fit_calibration(model, out_dir=Path("paper/tables"), holdout_n=200)
# Inspect per-judge results
params["gpt-4o"]["text_summarization"]
# → {alpha: -0.05, beta: 0.88, rho_raw: 0.71, rho_calibrated: 0.87,
# mae_raw: 0.16, mae_calibrated: 0.07, n_fit: 200, n_total: 620}
# Inspect overall (aggregated across all judges and tasks)
params["_overall"]
# → {alpha: ..., beta: ..., rho_raw: ..., rho_calibrated: ..., ...}
# Apply calibration to a list of raw scores
calibrated = apply_calibration(raw_scores, params["_overall"]["alpha"],
params["_overall"]["beta"])The result is cached in paper/tables/calibration_params.json. Delete this file (or pass force=True) to re-fit.
The calibration_params.json file is structured as:
{
"gpt-4o": {
"text_summarization": {
"alpha": -0.052,
"beta": 0.881,
"n_fit": 200,
"n_total": 620,
"rho_raw": 0.715,
"rho_calibrated": 0.871,
"mae_raw": 0.162,
"mae_calibrated": 0.071
},
"code_explanation": { ... }
},
"gpt-3.5-turbo": { ... },
"_overall": {
"alpha": -0.031,
"beta": 0.912,
"rho_raw": 0.711,
"rho_calibrated": 0.854,
...
}
}Interpreting the parameters:
| Value | Meaning |
|---|---|
alpha ≈ 0, beta ≈ 1 |
Judge is already well-calibrated — no correction needed |
alpha > 0 |
Judge systematically under-scores (correction adds a baseline boost) |
alpha < 0 |
Judge systematically over-scores (correction pulls scores down) |
beta > 1 |
Judge compresses scores into a narrow range (correction stretches them) |
beta < 1 |
Judge uses an inflated range (correction compresses) |
rho_calibrated > rho_raw |
Calibration improved alignment with ground truth |
mae_calibrated < mae_raw |
Calibration reduced absolute prediction error |
- It does not modify any Phase 5 evaluation files — raw scores remain untouched on disk.
- It does not affect the HTML reports (dashboard, student report, etc.) — those show raw ensemble scores.
- It is not applied during live pipeline execution — it is purely a post-hoc validation tool.
- It does not require human labels — ground truth comes from benchmark-native metrics (BERTScore, BLEU, exact-match).
Calibration answers the question: "How much can we trust the judge ensemble's scores?" It is a trust metric, not a correction applied to production output.
The analysis/paper_tables.py module generates publication-ready tables:
python -m analyzer.paper_tables \
--run benchmark/runs/paper-eval-v1 \
--out paper/tablesThis generates all 7 tables (.tex + .csv):
| File | Paper Table | Contents | Requires benchmark scores? |
|---|---|---|---|
table3_spearman.tex/.csv |
Table 3 | Spearman ρ: CoEval ensemble + per-judge vs. benchmark ground truth | Yes |
table4_coverage.tex/.csv |
Table 4 | ACR, RAR, Surface Bias, fill rates by task | No |
table5_student_scores.tex/.csv |
Table 5 | Student composite scores, Kendall τ ranking | No |
table6_ensemble_ablation.tex/.csv |
Table 6 | ρ by ensemble size (1→all judges) | Yes |
table7_sampling_ablation.tex/.csv |
Table 7 | Random vs. freq-weighted vs. stratified sampling | No (ACR/RAR from EES) |
table8_calibration.tex/.csv |
Table 8 | OLS calibration effect (ρ + MAE before/after) | Yes |
table9_positional_bias.tex/.csv |
Table 9 | Positional flip rates (needs swap pairs) | No |
SUMMARY.md |
— | Data availability checklist + next steps | — |
Automatically computed metrics:
- RAR (Rare-Attribute Recall) — fraction of rare strata (freq < 3) covered; Table 4 + Table 7.
- Surface Bias — mean pairwise sentence-BLEU across Phase 3 prompts; Table 4. Requires
pip install nltk. - OLS Calibration — α, β fit on 200-item holdout; Table 8 +
calibration_params_overall.json.
For Table 3 baseline columns, run the baseline comparison script:
python -m benchmark.run_baselines \
--run benchmark/runs/paper-eval-v1 \
--out paper/tables \
--methods bertscore geval-gpt4o geval-claude \
--max-pairs 200Outputs paper/tables/baselines.csv with Spearman ρ for each method × task. Requires pip install bert-score scipy (BERTScore) and OPENAI_API_KEY / ANTHROPIC_API_KEY (G-Eval).
| Flag | Description | Default |
|---|---|---|
--methods |
Which baselines: bertscore, geval-gpt4o, geval-claude |
all |
--max-pairs N |
Subsample N items per task (cost control) | all |
--geval-gpt4o-model |
OpenAI model for G-Eval | gpt-4o |
--geval-claude-model |
Anthropic model for G-Eval | claude-3-5-sonnet-20241022 |
--bertscore-model |
BERTScore backbone | distilbert-base-uncased |
--dry-run |
Print plan without making API calls | — |
python -m analyzer.compare \
--run-a benchmark/runs/medium-benchmark-v1 \
--run-b benchmark/runs/paper-eval-v1 \
--out-dir paper/comparisonGenerates a comparison report showing ranking differences, score delta per student, and coverage improvement between runs.
All analysis commands accept --partial-ok to run on an experiment that is still in progress:
python -m analyzer.main \
--run-path benchmark/runs/medium-benchmark-v1 \
--partial-okA warning banner is shown at the top of every report indicating the run is incomplete, and any statistics are clearly marked as preliminary.
python -m analyzer.main
--run-path PATH Experiment folder (required)
--out-dir PATH Output directory (default: {run-path}/html_reports)
--report TYPE One of: student_report, teacher_report, judge_report,
score_dist, coverage, interaction, consistency,
robust_summary, all (default: all)
--format FMT html | excel | csv (default: html)
--partial-ok Allow analysis of incomplete runs
--exclude-self-judge Exclude self-judging records from all statistics
--exclude-self-teach Exclude self-teaching records from all statistics
python -m analyzer.paper_tables
--run-id ID Experiment ID (resolved under benchmark/runs/)
--out-dir PATH Directory for CSV table files
--tasks LIST Comma-separated task IDs (default: all)
python -m benchmark.emit_datapoints
--dataset NAME xsum | codesearchnet | aeslc | wikitablequestions | all
--run-id ID Create phase3_datapoints/ under benchmark/runs/{ID}/
--out-dir PATH Override output directory
--sample-size N Items per dataset (default: 620)
--split NAME Dataset split (default: loader default)
--seed INT Sampling seed (default: 42)
All sample reports are self-contained HTML files — click to view rendered in browser.
| Example | Description |
|---|---|
| Education Experiment Plan | Full experiment plan: 3 real-dataset tasks + 10 synthetic tasks, 6 models, per-phase call budget, cost table |
| Mixed Benchmark Plan | Mixed benchmark plan: real benchmark datasets + OpenAI models |
| Paper Dual-Track Plan | Paper evaluation: dual-track design with benchmark + generative teachers |
Generate your own planning view:
coeval describe --config my_experiment.yaml --out my_experiment_plan.html
Generate all reports from a completed run:
coeval analyze all --run ./Runs/my-experiment-v1 --out ./reports
| Report | Description |
|---|---|
| Dashboard | Overview dashboard — all reports in one place with top-line rankings and navigation |
| Student Report | Per-student score breakdowns, task rankings, rubric factor heatmaps |
| Judge Consistency | Inter-judge ICC agreement, calibration drift, flagged uncertain items |
| Robust Summary | Final model rankings with confidence intervals and robust ensemble weights |
| Score Distribution | High / Medium / Low histograms filterable by task, teacher, student, and judge |
| Teacher Report | Per-teacher source quality, attribute stratum coverage, data consistency |
| Interaction Matrix | Teacher × Student pair quality heatmap |
| Coverage Summary | Attribute Coverage Ratio (ACR) and rare-attribute recall per task |
| Judge Report | Judge-level bias rates, score calibration, inter-rater reliability |
Q: How do I generate all reports at once after a run completes?
A: Run coeval analyze all --run ./eval_runs/my-experiment-v1 --out ./reports. This generates all eight HTML reports plus the Excel workbook in the specified output directory. You can also call individual report types by replacing all with the report name (e.g., student-report, judge-report).
Q: What reports does CoEval generate and what is each one for?
A: CoEval generates eight HTML reports: score-distribution (judge score histograms), teacher-report (attribute coverage and data quality), judge-report (bias detection and calibration), student-report (per-model performance and rankings), interaction-matrix (teacher × student score heatmap), judge-consistency (inter-judge agreement and ICC), coverage-summary (attribute stratum coverage and surface bias), and robust-summary (outlier-robust rankings with confidence intervals). A complete-report Excel workbook is also available.
Q: Is there a programmatic API to access metrics without generating HTML?
A: Yes. Import load_ees from analyzer.loader and the metric functions from analyzer.metrics to work with the data directly in Python. For example, composite_score_by_student(model) returns a dict of mean normalized scores per student, and judge_consistency(model) returns ICC values per task. See the Programmatic API section above for a complete example.
Q: Can I generate reports on a run that is still in progress?
A: Yes. Pass --partial-ok to any coeval analyze or python -m analyzer.main command. All reports will render with a warning banner indicating the run is incomplete, and statistics are marked as preliminary.
Q: What is the difference between Spearman rho and Kendall tau in the reports? A: Both are rank correlation metrics, but they measure slightly different things. Spearman rho measures the monotone agreement between two ranked lists (used to validate judge scores against benchmark ground-truth). Kendall tau measures pairwise concordance — the fraction of all pairs where two rankings agree on relative order. CoEval uses tau for student ranking comparisons and rho for benchmark validation.
Q: How do I export results to Excel for stakeholder review?
A: Run coeval analyze complete-report --run ./eval_runs/my-experiment-v1 --out ./reports or use python -m analyzer.main --run-path <path> --format excel --out-file analysis.xlsx. The workbook includes Summary, StudentScores, TeacherCoverage, JudgeAgreement, and FailedRecords sheets.