Add local model support: run order, warmup, and per-scenario metrics#4
Open
jeanfbrito wants to merge 2 commits intostevibe:mainfrom
Open
Add local model support: run order, warmup, and per-scenario metrics#4jeanfbrito wants to merge 2 commits intostevibe:mainfrom
jeanfbrito wants to merge 2 commits intostevibe:mainfrom
Conversation
Local model servers like llama-swap load models into GPU on demand, causing timeouts when the benchmark swaps models between scenarios. - RUN_ORDER=model runs all 15 scenarios per model before swapping - WARMUP_ENABLED=true sends a max_tokens=1 request to preload each model - Dashboard shows "Warming up" status during GPU loading phase - Updated README and .env.example with local model configuration docs
- Capture duration, token usage, turns, and tool call count per scenario - Show duration directly in each result cell as it completes - Display aggregate metrics (total time, tok/s, tokens) in score cards - Rename FailureDialog to TraceDialog with correct pass/partial/fail status - Extract token usage from OpenAI-compatible API responses
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three features that make ToolCall-15 practical for benchmarking local GPU models served by on-demand backends like llama-swap.
All new env vars are opt-in — existing setups without them continue to work identically.
1. Configurable run order (
RUN_ORDER)scenario(default): runs each scenario across all models before moving onmodel: runs all 15 scenarios for one model before swapping to the next — avoids constant GPU reloads2. Model warmup (
WARMUP_ENABLED)true, sends a lightweightmax_tokens=1request before each model's benchmark run to trigger GPU loading (120s timeout)3. Per-scenario performance metrics
Bonus: Trace dialog fix
FailureDialog→TraceDialog— now shows correct status (Passed / Partial / Failed / Timed out) instead of always showing "Failed"Changed files
lib/llm-client.tswarmupModel()function + extractTokenUsagefrom API responseslib/benchmark.tsScenarioMetricsandAggregateMetricstypes, aggregate computation inscoreModelResults()lib/orchestrator.tscomponents/dashboard.tsxTraceDialogwith correct statusapp/globals.css.metrics-strip,.cell-duration, button layout fix for.result-icon-shell.env.exampleRUN_ORDERandWARMUP_ENABLEDvars with documentationREADME.mdBackward compatibility
Both new env vars default safely when absent:
RUN_ORDERundefined → scenario-first (existing behavior)WARMUP_ENABLEDundefined → no warmup (existing behavior)No breaking changes to types — all new fields are either optional (
metrics?: ScenarioMetrics) or always populated by their sole producer (scoreModelResults()).Test plan
npm run typecheckpassesmainRUN_ORDER=modelcompletes all scenarios per model before swappingWARMUP_ENABLED=trueshows "Warming up" and preloads model before scenarios