Evaluations

No new workflow in @mux/ai is considered "done" until it ships with attached eval coverage that can be run locally and in CI.

This library uses Evalite for AI evaluation testing. Evals measure the efficacy, efficiency, and expense of AI workflows across multiple providers, enabling data-driven decisions about model selection and prompt optimization.

View the latest evaluation results →

Results are published automatically on every push to main, so the dashboard always reflects the current state of the library's default models and prompts.

The 3 E's Framework

Every eval in this library measures workflows against three dimensions:

Efficacy — "Does it work correctly?"

Does the model produce accurate, high-quality results?
Are outputs properly formatted and schema-compliant?
Does the model avoid common failure modes (hallucinations, filler phrases)?
How does output quality compare across providers?

Efficiency — "How fast and scalable is it?"

How many tokens does it consume?
What's the wall clock latency from request to response?
Is token usage within efficient operating ranges?

Expense — "What does it cost?"

What does each request cost across providers?
How do costs compare for equivalent quality?
Where are opportunities for prompt optimization?

This framework enables systematic evaluation of default model selections across all supported providers and helps users understand the tradeoffs between OpenAI, Anthropic, and Google.

Practical Application

Not all workflows can measure all 3 E's with equal precision from day one:

Efficacy can be challenging to dial in—defining ground truth, building representative test sets, and calibrating quality thresholds takes iteration. For some workflows (translation quality, creative summarization), efficacy measurement may evolve over time.
Efficiency and Expense are always measurable. Token counts, latency, and costs are objective metrics that can establish early signals for any workflow, even before efficacy scoring is fully developed.
Foundational model workflows (those relying exclusively on OpenAI, Anthropic, or Google) should target all 3 E's. These workflows have predictable inputs/outputs and can leverage scorers like semantic similarity, faithfulness (useful for translations), and others for efficacy measurement.

When adding a new workflow, start with Efficiency and Expense coverage immediately, then iterate on Efficacy as you build confidence in ground truth data.

Running Evals

Local Development

# Run evals once and serve the UI
npm run test:eval

# Or run directly with evalite
npx evalite serve tests/eval

This runs all *.eval.ts files in one pass and opens the Evalite UI at http://localhost:3006 for exploring results. There is no watch mode—you'll need to manually re-run when you're ready to test changes.

By default, evals run against provider default models only:

openai:gpt-5.1
anthropic:claude-sonnet-4-5
google:gemini-3-flash-preview

To run all configured models in LANGUAGE_MODELS:

npx tsx scripts/export-evalite-results.ts --model-set all

To run an explicit list:

npx tsx scripts/export-evalite-results.ts --models openai:gpt-5.1,openai:gpt-5-mini,google:gemini-2.5-flash

The same behavior is available via env vars:

MUX_AI_EVAL_MODEL_SET=default|all
MUX_AI_EVAL_MODELS=provider:model,provider:model (takes precedence over MUX_AI_EVAL_MODEL_SET)

Running a single eval file:

# Run in CLI only (no UI)
npx evalite summarization.eval.ts

# Run and serve UI
npx evalite serve summarization.eval.ts

CI/CD

Evals run automatically on pushes to main (or via manual workflow dispatch). The CI job executes the evals, exports the JSON output, and posts the raw results to the Evalite API used by evaluating-mux-ai.

For local development/testing:

# Run evals and export results as a dry run (inspect without publishing)
npm run evalite:post-results:dev

For production (internal maintainers only):

⚠️ The production script posts results to the live Evalite dashboard and is not intended for OSS contributors. It requires internal credentials and should only be run by project maintainers.

# Run evals, export results, and post to production endpoint
npm run evalite:post-results:production

The post step requires EVALITE_RESULTS_ENDPOINT (full URL to /api/evalite-results) and uses EVALITE_INGEST_SECRET as the shared secret header.

Eval Structure

Each eval follows a consistent structure:

import { evalite } from "evalite";
import { reportTrace } from "evalite/traces";

evalite("Workflow Name", {
  // Test data with inputs and expected outputs
  data: [
    {
      input: { assetId: "...", provider: "openai" },
      expected: { /* ground truth */ },
    },
  ],

  // The task to evaluate
  task: async (input) => {
    const startTime = performance.now();
    const result = await workflowFunction(input);
    const latencyMs = performance.now() - startTime;

    // Report trace for the UI
    reportTrace({
      input,
      output: result,
      usage: result.usage,
      start: startTime,
      end: startTime + latencyMs,
    });

    return { ...result, latencyMs };
  },

  // Scorers measure different aspects
  scorers: [
    // Efficacy scorers
    { name: "accuracy", scorer: ({ output, expected }) => /* 0-1 */ },

    // Efficiency scorers
    { name: "latency-performance", scorer: ({ output }) => /* 0-1 */ },
    { name: "token-efficiency", scorer: ({ output }) => /* 0-1 */ },

    // Expense scorers
    { name: "cost-within-budget", scorer: ({ output }) => /* 0-1 */ },
  ],
});

Scorer Categories

Efficacy Scorers

Measure output quality against ground truth. Examples include:

Detection/Classification Accuracy — Does the output match expected labels?
Confidence Calibration — Are confidence scores appropriately high/low?
Response Integrity — Are all fields valid and properly formatted?
Semantic Similarity — Do outputs match reference answers semantically?
No Filler Phrases — Does output avoid meta-descriptive language?

Efficiency Scorers

Measure performance characteristics:

Latency Performance — Wall clock time normalized against thresholds
Token Efficiency — Total tokens normalized against budget

Expense Scorers

Measure cost characteristics:

Usage Data Present — Validates token usage is returned
Cost Within Budget — Estimated USD cost normalized against threshold

Adding New Evals

When adding a new workflow, create a corresponding eval file:

Create tests/eval/{workflow-name}.eval.ts
Define test assets with ground truth expectations
Implement scorers for each of the 3 E's
Run locally to verify: npx evalite serve tests/eval/{workflow-name}.eval.ts

Example thresholds to consider:

// Efficacy
const CONFIDENCE_THRESHOLD = 0.8;

// Efficiency
const LATENCY_THRESHOLD_GOOD_MS = 5000;
const LATENCY_THRESHOLD_ACCEPTABLE_MS = 12000;
const TOKEN_THRESHOLD_EFFICIENT = 4000;

// Expense
const COST_THRESHOLD_USD = 0.012;

Cross-Provider and Cross-Model Testing

All evals iterate over EVAL_MODEL_CONFIGS, which is resolved at runtime and can represent:

provider defaults only (MUX_AI_EVAL_MODEL_SET=default, the default)
all configured models (MUX_AI_EVAL_MODEL_SET=all)
an explicit list (MUX_AI_EVAL_MODELS=provider:model,...)

The EVAL_MODEL_CONFIGS constant provides the flattened (provider, model) pairs for each run:

import { EVAL_MODEL_CONFIGS } from "../../src/lib/providers";

const data = EVAL_MODEL_CONFIGS.flatMap(({ provider, modelId }) =>
  testAssets.map(asset => ({
    input: { assetId: asset.assetId, provider, model: modelId },
    expected: asset.expected,
  })),
);

This enables side-by-side comparison of:

Quality differences between providers and models
Latency characteristics
Token consumption patterns
Cost per request

Model Pricing

Evals calculate estimated costs using per-model pricing for all supported models:

Provider	Model	Input (per 1M tokens)	Output (per 1M tokens)
OpenAI	gpt-5.1 (default)	$1.25	$10.00
OpenAI	gpt-5-mini	$0.25	$2.00
Anthropic	claude-sonnet-4-5 (default)	$3.00	$15.00
Google	gemini-3-flash-preview (default)	$0.50	$3.00
Google	gemini-2.5-flash	$0.30	$2.50

Pricing sources (verify periodically):

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluations

The 3 E's Framework

Efficacy — "Does it work correctly?"

Efficiency — "How fast and scalable is it?"

Expense — "What does it cost?"

Practical Application

Running Evals

Local Development

CI/CD

Eval Structure

Scorer Categories

Efficacy Scorers

Efficiency Scorers

Expense Scorers

Adding New Evals

Cross-Provider and Cross-Model Testing

Model Pricing

Resources

FilesExpand file tree

EVALS.md

Latest commit

History

EVALS.md

File metadata and controls

Evaluations

The 3 E's Framework

Efficacy — "Does it work correctly?"

Efficiency — "How fast and scalable is it?"

Expense — "What does it cost?"

Practical Application

Running Evals

Local Development

CI/CD

Eval Structure

Scorer Categories

Efficacy Scorers

Efficiency Scorers

Expense Scorers

Adding New Evals

Cross-Provider and Cross-Model Testing

Model Pricing

Resources