Add local model support: run order, warmup, and per-scenario metrics by jeanfbrito · Pull Request #4 · stevibe/ToolCall-15

jeanfbrito · 2026-03-29T23:22:01Z

Summary

Three features that make ToolCall-15 practical for benchmarking local GPU models served by on-demand backends like llama-swap.

All new env vars are opt-in — existing setups without them continue to work identically.

1. Configurable run order (`RUN_ORDER`)

scenario (default): runs each scenario across all models before moving on
model: runs all 15 scenarios for one model before swapping to the next — avoids constant GPU reloads

2. Model warmup (`WARMUP_ENABLED`)

When true, sends a lightweight max_tokens=1 request before each model's benchmark run to trigger GPU loading (120s timeout)
Dashboard shows a "Warming up" indicator with the model name during loading

3. Per-scenario performance metrics

Captures duration, token usage (prompt + completion), turns, and tool call count per scenario
Duration shown directly in each result cell as it completes
Aggregate metrics (total time, avg time/scenario, tok/s, total tokens) displayed in score cards
Full per-scenario breakdown on hover tooltip

Bonus: Trace dialog fix

Renamed FailureDialog → TraceDialog — now shows correct status (Passed / Partial / Failed / Timed out) instead of always showing "Failed"

Changed files

File	Change
`lib/llm-client.ts`	`warmupModel()` function + extract `TokenUsage` from API responses
`lib/benchmark.ts`	`ScenarioMetrics` and `AggregateMetrics` types, aggregate computation in `scoreModelResults()`
`lib/orchestrator.ts`	Model-first execution path, warmup integration, timing and token capture
`components/dashboard.tsx`	Warmup UI state, metrics in cells and score cards, `TraceDialog` with correct status
`app/globals.css`	`.metrics-strip`, `.cell-duration`, button layout fix for `.result-icon-shell`
`.env.example`	New `RUN_ORDER` and `WARMUP_ENABLED` vars with documentation
`README.md`	"Local Model Configuration" section, updated execution model and dashboard behavior docs

Backward compatibility

Both new env vars default safely when absent:

RUN_ORDER undefined → scenario-first (existing behavior)
WARMUP_ENABLED undefined → no warmup (existing behavior)

No breaking changes to types — all new fields are either optional (metrics?: ScenarioMetrics) or always populated by their sole producer (scoreModelResults()).

Test plan

npm run typecheck passes
Default config (no new env vars) behaves identically to main
RUN_ORDER=model completes all scenarios per model before swapping
WARMUP_ENABLED=true shows "Warming up" and preloads model before scenarios
Duration appears in each cell immediately after scenario completes
Score cards show aggregate metrics after run finishes
Clicking a pass cell shows "Passed" in trace dialog
Hover tooltip shows duration, turns, tool calls, tokens

Local model servers like llama-swap load models into GPU on demand, causing timeouts when the benchmark swaps models between scenarios. - RUN_ORDER=model runs all 15 scenarios per model before swapping - WARMUP_ENABLED=true sends a max_tokens=1 request to preload each model - Dashboard shows "Warming up" status during GPU loading phase - Updated README and .env.example with local model configuration docs

- Capture duration, token usage, turns, and tool call count per scenario - Show duration directly in each result cell as it completes - Display aggregate metrics (total time, tok/s, tokens) in score cards - Rename FailureDialog to TraceDialog with correct pass/partial/fail status - Extract token usage from OpenAI-compatible API responses

jeanfbrito added 2 commits March 29, 2026 20:21

jeanfbrito changed the title ~~Add configurable run order and model warmup for local GPU backends~~ Add local model support: run order, warmup, and per-scenario metrics Mar 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add local model support: run order, warmup, and per-scenario metrics#4

Add local model support: run order, warmup, and per-scenario metrics#4
jeanfbrito wants to merge 2 commits intostevibe:mainfrom
jeanfbrito:feature/local-model-warmup-and-run-order

jeanfbrito commented Mar 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeanfbrito commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Configurable run order (RUN_ORDER)

2. Model warmup (WARMUP_ENABLED)

3. Per-scenario performance metrics

Bonus: Trace dialog fix

Changed files

Backward compatibility

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jeanfbrito commented Mar 29, 2026 •

edited

Loading

1. Configurable run order (`RUN_ORDER`)

2. Model warmup (`WARMUP_ENABLED`)