Skip to content

Add local model support: run order, warmup, and per-scenario metrics#4

Open
jeanfbrito wants to merge 2 commits intostevibe:mainfrom
jeanfbrito:feature/local-model-warmup-and-run-order
Open

Add local model support: run order, warmup, and per-scenario metrics#4
jeanfbrito wants to merge 2 commits intostevibe:mainfrom
jeanfbrito:feature/local-model-warmup-and-run-order

Conversation

@jeanfbrito
Copy link
Copy Markdown

@jeanfbrito jeanfbrito commented Mar 29, 2026

Summary

Three features that make ToolCall-15 practical for benchmarking local GPU models served by on-demand backends like llama-swap.

All new env vars are opt-in — existing setups without them continue to work identically.

1. Configurable run order (RUN_ORDER)

  • scenario (default): runs each scenario across all models before moving on
  • model: runs all 15 scenarios for one model before swapping to the next — avoids constant GPU reloads

2. Model warmup (WARMUP_ENABLED)

  • When true, sends a lightweight max_tokens=1 request before each model's benchmark run to trigger GPU loading (120s timeout)
  • Dashboard shows a "Warming up" indicator with the model name during loading

3. Per-scenario performance metrics

  • Captures duration, token usage (prompt + completion), turns, and tool call count per scenario
  • Duration shown directly in each result cell as it completes
  • Aggregate metrics (total time, avg time/scenario, tok/s, total tokens) displayed in score cards
  • Full per-scenario breakdown on hover tooltip
image

Bonus: Trace dialog fix

  • Renamed FailureDialogTraceDialog — now shows correct status (Passed / Partial / Failed / Timed out) instead of always showing "Failed"

Changed files

File Change
lib/llm-client.ts warmupModel() function + extract TokenUsage from API responses
lib/benchmark.ts ScenarioMetrics and AggregateMetrics types, aggregate computation in scoreModelResults()
lib/orchestrator.ts Model-first execution path, warmup integration, timing and token capture
components/dashboard.tsx Warmup UI state, metrics in cells and score cards, TraceDialog with correct status
app/globals.css .metrics-strip, .cell-duration, button layout fix for .result-icon-shell
.env.example New RUN_ORDER and WARMUP_ENABLED vars with documentation
README.md "Local Model Configuration" section, updated execution model and dashboard behavior docs

Backward compatibility

Both new env vars default safely when absent:

  • RUN_ORDER undefined → scenario-first (existing behavior)
  • WARMUP_ENABLED undefined → no warmup (existing behavior)

No breaking changes to types — all new fields are either optional (metrics?: ScenarioMetrics) or always populated by their sole producer (scoreModelResults()).

Test plan

  • npm run typecheck passes
  • Default config (no new env vars) behaves identically to main
  • RUN_ORDER=model completes all scenarios per model before swapping
  • WARMUP_ENABLED=true shows "Warming up" and preloads model before scenarios
  • Duration appears in each cell immediately after scenario completes
  • Score cards show aggregate metrics after run finishes
  • Clicking a pass cell shows "Passed" in trace dialog
  • Hover tooltip shows duration, turns, tool calls, tokens

Local model servers like llama-swap load models into GPU on demand,
causing timeouts when the benchmark swaps models between scenarios.

- RUN_ORDER=model runs all 15 scenarios per model before swapping
- WARMUP_ENABLED=true sends a max_tokens=1 request to preload each model
- Dashboard shows "Warming up" status during GPU loading phase
- Updated README and .env.example with local model configuration docs
- Capture duration, token usage, turns, and tool call count per scenario
- Show duration directly in each result cell as it completes
- Display aggregate metrics (total time, tok/s, tokens) in score cards
- Rename FailureDialog to TraceDialog with correct pass/partial/fail status
- Extract token usage from OpenAI-compatible API responses
@jeanfbrito jeanfbrito changed the title Add configurable run order and model warmup for local GPU backends Add local model support: run order, warmup, and per-scenario metrics Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant