Skip to content

Latest commit

 

History

History
339 lines (197 loc) · 23.1 KB

File metadata and controls

339 lines (197 loc) · 23.1 KB

CoEval Concepts

A concise glossary of every first-class concept in CoEval, ordered thematically. Each entry includes a short explanation and a link to the authoritative documentation.


Roles

Teacher

A model assigned roles: [teacher]. Teachers participate in Phases 1–3: they infer evaluation dimensions (Phase 1), build rubrics (Phase 2), and generate the synthetic benchmark items — (prompt, reference_response) pairs — that the rest of the pipeline is evaluated against (Phase 3). A good teacher produces diverse, realistic, well-attributed data. Any model can be a teacher; large, capable models tend to produce higher-quality datasets.

Architecture — Role Assignment · Configuration — Models


Student

A model assigned roles: [student]. Students participate in Phase 4: they receive the prompts generated by teachers and produce the responses that judges will score. The student is the model under evaluation — the goal of the entire pipeline is to measure and rank student quality across rubric dimensions.

Architecture — Phase 4


Judge

A model assigned roles: [judge]. Judges participate in Phase 5: they score each student response against the rubric, producing a structured set of dimension scores. Judges receive the original prompt, the student's response, and the teacher's reference response as context. Deterministic settings (temperature: 0.0) are recommended for judges to maximise scoring consistency.

Architecture — Phase 5 · Configuration — Rubric & Evaluation Mode


Role Parameters

Per-role overrides for model inference settings. Declaring role_parameters: on a model entry applies different temperature, max_tokens, or other generation parameters depending on whether the model is acting as teacher, student, or judge — without duplicating the model entry. For example, a teacher might use temperature: 0.8 for creative data generation while the same model judges at temperature: 0.0 for deterministic scoring.

Architecture — Role Assignment · Configuration — Models


Data & Attributes

Datapoint

One benchmark item — a (prompt, reference_response) pair produced by a teacher in Phase 3. Each datapoint is tagged with the sampled target attribute values (and optionally nuance values) used to generate it, enabling stratified analysis. Datapoints are written to {task}__{teacher}.datapoints.jsonl and reused by all student models in Phase 4.

Architecture — Phase 3


Target Attributes

Structural dimensions that define the evaluation coverage space of a task. Declared as a dict mapping dimension names to possible values (e.g., domain: [politics, sports, tech]). Every benchmark item is tagged with one sampled value from each dimension, ensuring the evaluation dataset spans the full intended space rather than clustering around common cases. Setting target_attributes: auto delegates discovery to teacher models (Phase 1).

Configuration — Tasks · Architecture — Phase 1


Nuanced Attributes

Per-item diversity dimensions that vary from one datapoint to the next without structuring the evaluation space. Where target attributes ensure coverage, nuanced attributes prevent distribution collapse — they inject variation in tone, register, style, or other surface properties that make synthetic data behave like a real-world distribution. Nuance values are sampled independently per item (not cross-producted). Setting nuanced_attributes: auto lets teachers propose appropriate diversity dimensions.

Configuration — Sampling & Diversity


Rubric

The evaluation criteria — a dict mapping named dimension labels to scoring guidelines (e.g., relevance: "The summary accurately reflects the article's main claim."). Judges use the rubric to produce per-dimension scores (High / Medium / Low) for every student response. Setting rubric: auto delegates rubric generation to teacher models (Phase 2); rubric: extend merges new dimensions onto a rubric inherited from an earlier run.

Configuration — Rubric & Evaluation Mode · Architecture — Phase 2


Slot

A named placeholder {...} in a CoEval prompt template, filled with task-specific data before the prompt is sent to a model. Built-in slots include {task_description}, {output_description}, {target_attributes}, {nuanced_attributes}, and {rubric}. Custom templates can override the built-in ones at task level or per model using the prompt_library block in the task config, giving full control over what each model sees.

Configuration — Prompt Templates


Label Attributes

A list of target attribute keys designated as ground-truth labels for classification and information-extraction tasks. When label_attributes is set on a task, the pipeline can use judge-free exact-match scoring via LabelEvaluator — the student's response is compared directly against the reference label extracted from the datapoint, with no LLM judge required. Example: label_attributes: [sentiment] for a sentiment task; label_attributes: [entity_type, entity_value] for NER.

Configuration — Tasks · Benchmark Datasets


Task Category

An optional string field (category) that assigns each task to a display group — typically 'benchmark' for real-dataset tasks or 'synthetic' for LLM-generated tasks. The category has no effect on pipeline behaviour; it is used only for visual grouping and colour-coding in the coeval describe HTML output and analysis reports.

Configuration — Tasks


Attribute Seeding

Optional seed dictionaries (target_attributes_seed, nuanced_attributes_seed) that pre-populate part of the attribute space when target_attributes: auto or nuanced_attributes: auto is set. Phase 1 teachers propose additional values beyond the seed, enabling a hybrid manual + automatic design where you guarantee certain dimensions are always present while allowing the LLM to discover others.

Configuration — Tasks · Architecture — Phase 1


Pipeline

Pipeline

The five-phase orchestration engine that drives a CoEval experiment from raw YAML config to scored results. The five phases are: Attribute Mapping → Rubric Mapping → Data Generation → Response Collection → Evaluation. Each phase is independently checkpointed and can be run in isolation; Phases 1 and 2 are optional when static attributes and rubrics are supplied.

Architecture — Pipeline Overview


Phase

One discrete stage of the pipeline, independently configurable via a phases: block in the experiment config. Each phase has an execution mode (New, Keep, Extend, Model) that controls what happens when output files already exist. All phase output files are written atomically, so a crash mid-phase loses at most one record.

Phase Config key What runs
1 attribute_mapping Teachers infer target attribute dimensions
2 rubric_mapping Teachers build evaluation rubric
3 data_generation Teachers produce (prompt, reference) pairs
4 response_collection Students respond to Phase 3 prompts
5 evaluation Judges score student responses

Architecture — Phase Details · Configuration — Phase Execution Modes


Experiment

One complete evaluation run, defined by a single YAML config file. An experiment has a unique id, a storage_folder where all output artifacts are written, a set of participating models, tasks, and phase settings. Re-running the same config with --continue resumes rather than re-starts the experiment. Changing the id creates a new, independent experiment.

Configuration — Experiment Settings


Evaluation Mode

Controls how judges score responses in Phase 5. single issues one API call per response and returns all rubric dimension scores at once (lower cost). per_factor issues one call per rubric dimension per response, scoring each dimension in isolation (higher cost, finer-grained analysis, eliminates cross-dimension influence).

Architecture — Phase 5 · Configuration — Rubric & Evaluation Mode


Label Evaluation

A judge-free evaluation path for classification and information-extraction tasks. When label_attributes is declared on a task, the LabelEvaluator compares student responses directly against the reference labels from the datapoint using exact-match or custom match functions — no LLM judge call is needed in Phase 5. Supports multiclass classification, multi-label classification, and structured IE. Produces per-label precision/recall and overall accuracy alongside any LLM judge scores.

Configuration — Tasks · Developer Guide


Generation Retries

A per-experiment setting (generation_retries, default: 2) that controls how many times Phases 3 and 5 will retry a failed API call or malformed response before writing a broken record. Set to 0 to disable retries. Broken records are flagged in meta.json and can be repaired with coeval repair.

Configuration — Experiment Settings · Resume & Recovery


Batch API

Provider-side feature that groups many individual requests into a single asynchronous job, typically at a ~50% cost discount. CoEval's batch runners for OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Google Vertex AI, and Mistral handle upload, polling, and result parsing transparently. Enable per-provider in the experiment's batch: block. The rest of the pipeline sees no difference between batch and real-time results.

Architecture — Batch API · Providers & Pricing


Sampling

Controls how many attribute values are drawn to spec each datapoint. sampling.target: [min, max] sets how many target attribute values to sample per item; sampling.nuance: [min, max] does the same for nuanced attributes; sampling.total caps the total number of items generated per (task, teacher) pair. Use "all" for target to generate the full cross-product of attribute values.

Configuration — Sampling & Diversity


Robustness

Robust Teachers

Running multiple teacher models on the same task produces multiple independent sets of datapoints. Because each teacher brings different biases, world knowledge, and writing styles, the combined dataset has higher diversity and lower single-model idiosyncrasy. Downstream analysis reports teacher-level source quality, making it easy to identify which teacher produces the most useful data.

Architecture — Role Assignment · Analytics & Reports — Teacher Report


Robust Judges

Running multiple judge models on the same student responses enables multi-judge ensemble scoring. Inter-judge agreement (ICC, κ) is computed automatically; the robust summary report weights judges by their calibration consistency and produces confidence-bounded rankings that are resilient to individual judge biases or errors.

Analytics & Reports · Architecture — Phase 5


Automation & Interfaces

Auto Interface

Setting interface: auto on a model entry tells CoEval to select the cheapest available provider for the specified model at config load time. It scans the auto_routing table in Config/provider_pricing.yaml and picks the first interface for which credentials exist. The resolved interface is logged; run coeval plan to see which provider was selected before committing to a full run.

Configuration — Automatic Provider Selection · Providers & Pricing


Interface

A provider adapter — one of the 18 supported backends CoEval can call to generate text. Each interface handles authentication, request formatting, batching, and response parsing for a specific LLM provider. Examples: openai, anthropic, gemini, bedrock, vertex, huggingface, ollama, benchmark. The interface value in a model config block selects which adapter is used.

Providers & Pricing


Benchmark Interface

A virtual interface (interface: benchmark) that replays pre-ingested model responses instead of making live API calls. Used to incorporate publicly available benchmark datasets (XSUM, ARC Challenge, RACE, etc.) as teacher data sources, or to reproduce published results offline. Responses are loaded from JSONL files created by coeval ingest.

Benchmark Datasets


Automatic Attributes

Setting target_attributes: auto (or nuanced_attributes: auto) delegates attribute discovery to teacher models in Phase 1. Teachers read the task description and propose a set of meaningful evaluation dimensions and their values. Results from multiple teachers are merged and deduplicated. Supply a static dict instead to skip Phase 1 entirely (zero LLM calls, faster runs).

Architecture — Phase 1 · Configuration — Tasks


Automatic Rubric

Setting rubric: auto delegates rubric generation to teacher models in Phase 2. Teachers propose named evaluation dimensions and scoring criteria based on the task description and target attributes. rubric: extend merges newly generated dimensions onto a rubric inherited from an earlier run via resume_from. Supply a static rubric dict to skip Phase 2 entirely.

Architecture — Phase 2 · Configuration — Rubric & Evaluation Mode


Operations & Commands

Probing

coeval probe verifies that every model in a config is reachable and responds correctly before committing to a full run. It issues a short test prompt to each model and reports pass / fail per model. The probe_mode experiment setting (full, resume, disable) controls whether probing runs automatically at the start of coeval run.

CLI Reference — probe · Quick Start


Planning

coeval plan performs a dry run — it parses the config, resolves auto interfaces, estimates cost, and prints a phase-by-phase execution plan without making any API calls. Use it to verify interface resolution, check estimated costs, and confirm phase modes before running.

CLI Reference — plan · Running Experiments


Resuming an Experiment

coeval run --config X.yaml --continue restarts an interrupted experiment from where it stopped. CoEval reads phases_completed from meta.json, skips fully completed phases, and for in-progress phases reads existing JSONL records to skip already-written items. No data is duplicated; no extra API calls are made for completed items.

Resume & Recovery — Resuming After Interruption


Forking an Experiment

coeval run --config new.yaml --resume PATH starts a new experiment that reuses Phase 1 (attributes) and Phase 2 (rubric) outputs from an existing run. Phases 3–5 run fresh with the new config's model set. Useful for adding new student models or changing judges without regenerating the teacher data.

Resume & Recovery — Forking from an Earlier Run


Repairing an Experiment

coeval repair detects and fixes corrupted or incomplete run artifacts: truncated JSONL files, missing phase outputs, mismatched record counts, and stale batch job references. It reports what is broken and what was repaired, making the run safe to resume with --continue.

CLI Reference — repair · Resume & Recovery


Wizard

coeval wizard launches an interactive setup assistant that guides you through creating a new experiment config step by step — selecting providers, defining tasks, setting sampling parameters, and choosing phase modes. The wizard writes a ready-to-run YAML file and optionally probes the selected models before exiting.

CLI Reference — wizard · Quick Start


Describe

coeval describe --config PATH generates a self-contained HTML summary of an experiment configuration — models, tasks, rubric, phase execution plan, estimated call budget, batch settings, and quotas. No API calls are made unless --probe is passed, which adds a live latency measurement per model. Useful for reviewing a config before running it or sharing a plan with stakeholders.

CLI Reference — describe


Ingest

coeval ingest --run PATH --benchmarks NAME loads pre-downloaded standard benchmark datasets (MMLU, HumanEval, TruthfulQA, HellaSwag, MedQA, GSM8K) as Phase 3 datapoints into an existing run, using a virtual <benchmark>-benchmark teacher model. After ingestion, resume the experiment with --continue to run Phases 4–5 on the new teacher. The command is idempotent — re-running it skips items already written.

CLI Reference — ingest · Benchmark Datasets


Models Command

coeval models lists all available text-generation models from every configured provider (OpenAI, Anthropic, Gemini, etc.). Use --providers to restrict the list and --verbose for detailed metadata including context window sizes. Useful for discovering the correct model ID string to use in an experiment config.

CLI Reference — models


Selective Model Execution

The --only-models MODEL[,MODEL...] flag on coeval run restricts which models participate in a run. The filter applies to the correct phase: teachers in Phase 3, students in Phase 4, judges in Phase 5. Useful for running a subset of models without re-running the full pipeline, or for parallelising expensive runs across separate processes.

CLI Reference — run


Status & Batch Polling

coeval status --run PATH shows the current completion state of an experiment — which phases are done, how many records are written, and whether any batch jobs are pending. Adding --fetch-batches polls provider APIs for in-progress batch jobs and downloads completed results, writing them to storage so the experiment can continue.

CLI Reference — status


Infrastructure

Keys File

A YAML file (keys.yaml) that stores provider API credentials outside the experiment config, keeping secrets out of version control. CoEval resolves the path in this order: --keys PATH flag → COEVAL_KEYS_FILE env var → project root keys.yaml~/.coeval/keys.yaml. Within the file, keys are keyed by provider name (e.g., openai: sk-...) or by model name for per-model access keys.

Quick Start — Credential Setup · Providers & Pricing


Quota

A per-model hard ceiling on API calls, declared in the experiment's quota: block. When a model reaches its max_calls limit the pipeline logs a warning and skips further calls for that model within the current run. Prevents runaway costs when benchmarking against large attribute grids or high sampling.total values.

Configuration — Experiment Settings · Running Experiments


Phase Execution Mode

Per-phase setting that controls what happens when output files already exist. New starts fresh and fails if files exist. Keep skips existing files without modifying them — safe for adding new models to a partial run. Extend appends only missing JSONL records without rewriting existing data. Model (Phase 3 only) reuses existing teacher datapoints without re-calling the teacher model.

Configuration — Phase Execution Modes


Storage Layout

All experiment artifacts are written under {storage_folder}/{experiment_id}/. Phase outputs follow predictable naming conventions: {task}.attributes.json (Phase 1), {task}.rubric.json (Phase 2), {task}__{teacher}.datapoints.jsonl (Phase 3), {task}__{teacher}__{student}.responses.jsonl (Phase 4), {task}__{teacher}__{judge}.evaluations.jsonl (Phase 5). A meta.json tracks phase completion and a run.log contains structured logs.

Architecture — Storage Layout


Non-LLM Metrics

Reference-based evaluation metrics used alongside (or instead of) LLM judges for benchmark tasks. Implemented in Public/benchmark/compute_scores.py. Supported metrics:

Metric Default for Description
bertscore XSum, AESLC Semantic similarity using BERT embeddings (requires bert-score package)
bleu CodeSearchNet BLEU-4 corpus overlap score (requires nltk)
exact_match MCQ tasks Exact string match against the correct label

Non-LLM metrics are applied via python -m benchmark.compute_scores after Phase 4 as a supplementary evaluation layer; they do not replace LLM judges but provide an objective reference score. The label_attributes feature provides per-response exact-match scoring inline with the pipeline.

Benchmark Datasets · Configuration — Tasks


See also: CLI Reference · Configuration Guide · Architecture · Tutorial