No new workflow in @mux/ai is considered "done" until it ships with attached eval coverage that can be run locally and in CI.
This library uses Evalite for AI evaluation testing. Evals measure the efficacy, efficiency, and expense of AI workflows across multiple providers, enabling data-driven decisions about model selection and prompt optimization.
View the latest evaluation results →
Results are published automatically on every push to main, so the dashboard always reflects the current state of the library's default models and prompts.
Every eval in this library measures workflows against three dimensions:
- Does the model produce accurate, high-quality results?
- Are outputs properly formatted and schema-compliant?
- Does the model avoid common failure modes (hallucinations, filler phrases)?
- How does output quality compare across providers?
- How many tokens does it consume?
- What's the wall clock latency from request to response?
- Is token usage within efficient operating ranges?
- What does each request cost across providers?
- How do costs compare for equivalent quality?
- Where are opportunities for prompt optimization?
This framework enables systematic evaluation of default model selections across all supported providers and helps users understand the tradeoffs between OpenAI, Anthropic, and Google.
Not all workflows can measure all 3 E's with equal precision from day one:
-
Efficacy can be challenging to dial in—defining ground truth, building representative test sets, and calibrating quality thresholds takes iteration. For some workflows (translation quality, creative summarization), efficacy measurement may evolve over time.
-
Efficiency and Expense are always measurable. Token counts, latency, and costs are objective metrics that can establish early signals for any workflow, even before efficacy scoring is fully developed.
-
Foundational model workflows (those relying exclusively on OpenAI, Anthropic, or Google) should target all 3 E's. These workflows have predictable inputs/outputs and can leverage scorers like semantic similarity, faithfulness (useful for translations), and others for efficacy measurement.
When adding a new workflow, start with Efficiency and Expense coverage immediately, then iterate on Efficacy as you build confidence in ground truth data.
# Run evals once and serve the UI
npm run test:eval
# Or run directly with evalite
npx evalite serve tests/evalThis runs all *.eval.ts files in one pass and opens the Evalite UI at http://localhost:3006 for exploring results. There is no watch mode—you'll need to manually re-run when you're ready to test changes.
By default, evals run against provider default models only:
openai:gpt-5.1anthropic:claude-sonnet-4-5google:gemini-3-flash-preview
To run all configured models in LANGUAGE_MODELS:
npx tsx scripts/export-evalite-results.ts --model-set allTo run an explicit list:
npx tsx scripts/export-evalite-results.ts --models openai:gpt-5.1,openai:gpt-5-mini,google:gemini-2.5-flashThe same behavior is available via env vars:
MUX_AI_EVAL_MODEL_SET=default|allMUX_AI_EVAL_MODELS=provider:model,provider:model(takes precedence overMUX_AI_EVAL_MODEL_SET)
Running a single eval file:
# Run in CLI only (no UI)
npx evalite summarization.eval.ts
# Run and serve UI
npx evalite serve summarization.eval.tsEvals run automatically on pushes to main (or via manual workflow dispatch). The CI job executes the evals, exports the JSON output, and posts the raw results to the Evalite API used by evaluating-mux-ai.
For local development/testing:
# Run evals and export results as a dry run (inspect without publishing)
npm run evalite:post-results:devFor production (internal maintainers only):
⚠️ The production script posts results to the live Evalite dashboard and is not intended for OSS contributors. It requires internal credentials and should only be run by project maintainers.
# Run evals, export results, and post to production endpoint
npm run evalite:post-results:productionThe post step requires EVALITE_RESULTS_ENDPOINT (full URL to /api/evalite-results) and uses EVALITE_INGEST_SECRET as the shared secret header.
Each eval follows a consistent structure:
import { evalite } from "evalite";
import { reportTrace } from "evalite/traces";
evalite("Workflow Name", {
// Test data with inputs and expected outputs
data: [
{
input: { assetId: "...", provider: "openai" },
expected: { /* ground truth */ },
},
],
// The task to evaluate
task: async (input) => {
const startTime = performance.now();
const result = await workflowFunction(input);
const latencyMs = performance.now() - startTime;
// Report trace for the UI
reportTrace({
input,
output: result,
usage: result.usage,
start: startTime,
end: startTime + latencyMs,
});
return { ...result, latencyMs };
},
// Scorers measure different aspects
scorers: [
// Efficacy scorers
{ name: "accuracy", scorer: ({ output, expected }) => /* 0-1 */ },
// Efficiency scorers
{ name: "latency-performance", scorer: ({ output }) => /* 0-1 */ },
{ name: "token-efficiency", scorer: ({ output }) => /* 0-1 */ },
// Expense scorers
{ name: "cost-within-budget", scorer: ({ output }) => /* 0-1 */ },
],
});Measure output quality against ground truth. Examples include:
- Detection/Classification Accuracy — Does the output match expected labels?
- Confidence Calibration — Are confidence scores appropriately high/low?
- Response Integrity — Are all fields valid and properly formatted?
- Semantic Similarity — Do outputs match reference answers semantically?
- No Filler Phrases — Does output avoid meta-descriptive language?
Measure performance characteristics:
- Latency Performance — Wall clock time normalized against thresholds
- Token Efficiency — Total tokens normalized against budget
Measure cost characteristics:
- Usage Data Present — Validates token usage is returned
- Cost Within Budget — Estimated USD cost normalized against threshold
When adding a new workflow, create a corresponding eval file:
- Create
tests/eval/{workflow-name}.eval.ts - Define test assets with ground truth expectations
- Implement scorers for each of the 3 E's
- Run locally to verify:
npx evalite serve tests/eval/{workflow-name}.eval.ts
Example thresholds to consider:
// Efficacy
const CONFIDENCE_THRESHOLD = 0.8;
// Efficiency
const LATENCY_THRESHOLD_GOOD_MS = 5000;
const LATENCY_THRESHOLD_ACCEPTABLE_MS = 12000;
const TOKEN_THRESHOLD_EFFICIENT = 4000;
// Expense
const COST_THRESHOLD_USD = 0.012;All evals iterate over EVAL_MODEL_CONFIGS, which is resolved at runtime and can represent:
- provider defaults only (
MUX_AI_EVAL_MODEL_SET=default, the default) - all configured models (
MUX_AI_EVAL_MODEL_SET=all) - an explicit list (
MUX_AI_EVAL_MODELS=provider:model,...)
The EVAL_MODEL_CONFIGS constant provides the flattened (provider, model) pairs for each run:
import { EVAL_MODEL_CONFIGS } from "../../src/lib/providers";
const data = EVAL_MODEL_CONFIGS.flatMap(({ provider, modelId }) =>
testAssets.map(asset => ({
input: { assetId: asset.assetId, provider, model: modelId },
expected: asset.expected,
})),
);This enables side-by-side comparison of:
- Quality differences between providers and models
- Latency characteristics
- Token consumption patterns
- Cost per request
Evals calculate estimated costs using per-model pricing for all supported models:
| Provider | Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|---|
| OpenAI | gpt-5.1 (default) | $1.25 | $10.00 |
| OpenAI | gpt-5-mini | $0.25 | $2.00 |
| Anthropic | claude-sonnet-4-5 (default) | $3.00 | $15.00 |
| gemini-3-flash-preview (default) | $0.50 | $3.00 | |
| gemini-2.5-flash | $0.30 | $2.50 |
Pricing sources (verify periodically):