Author: Cascade
Date: 2025-12-17
Purpose: Document the public SnakeBench and Worm Arena HTTP APIs exposed by ARC Explainer for running matches, querying game stats, and streaming live tournaments.
All endpoints described here are public and require no authentication.
The embedded SnakeBench backend powers Worm Arena (LLM Snake) inside ARC Explainer. This integration exposes two families of public endpoints:
-
SnakeBench API – JSON-based endpoints under
/api/snakebench/*for:- Running matches and batches
- Listing and loading replays
- Health checks
- TrueSkill-style stats and leaderboards
- Worm Arena "greatest hits" summaries
-
Worm Arena Live Streaming API – SSE wrapper under
/api/wormarena/*for watching live Worm Arena matches and multi-opponent batches.
These APIs are primarily used by:
- The Worm Arena pages in the ARC Explainer frontend
- Local tournament scripts under
scripts/worm-arena-tournaments/ - External research scripts that want direct access to replay data and model stats
Endpoint:
POST /api/snakebench/run-match
Description: Run a single Worm Arena match between two LLM models via the embedded SnakeBench backend.
Request Body (JSON):
{
"modelA": "openai/gpt-5-nano",
"modelB": "moonshotai/kimi-k2-thinking",
"width": 10, // optional, default 10 (clamped to [4, 50])
"height": 10, // optional, default 10 (clamped to [4, 50])
"maxRounds": 150, // optional, default 150 (clamped to [10, 500])
"numApples": 5, // optional, default 5 (clamped to [1, 20])
"apiKey": "...", // optional BYO provider key, never stored
"provider": "openrouter" // optional, one of: openrouter | openai | anthropic | xai | gemini
}Notes:
modelAandmodelBmust be valid OpenRouter model slugs.- ARC Explainer accepts:
- curated OpenRouter slugs present in the central
MODELSconfig, and - DB-discovered OpenRouter slugs marked active (so newly-discovered models can be used immediately).
- curated OpenRouter slugs present in the central
- If
apiKey+providerare supplied, the backend uses that key only for this match (BYO key); otherwise it uses server-side keys.
Response (success):
{
"success": true,
"result": {
"gameId": "836b435a-bfcf-4a5e-be66-d87dd0d92153",
"modelA": "openai/gpt-5-nano",
"modelB": "moonshotai/kimi-k2-thinking",
"scores": {
"openai/gpt-5-nano": 12,
"moonshotai/kimi-k2-thinking": 10
},
"results": {
"openai/gpt-5-nano": "won",
"moonshotai/kimi-k2-thinking": "lost"
},
"completedGamePath": "external/SnakeBench/backend/completed_games/snake_game_836b435a-bfcf-4a5e-be66-d87dd0d92153.json"
},
"timestamp": 1733920000000
}On failure, success=false and error contains a message.
Endpoint:
POST /api/snakebench/run-batch
Description: Run count sequential matches between the same pair of models.
Request Body: Same as /run-match, plus:
{
"count": 9
}countis a small positive integer, clamped to an internal safety limit (currently 10).
Response (success):
{
"success": true,
"batch": {
"results": [
{ "gameId": "...", "modelA": "...", "modelB": "...", "scores": { ... }, "results": { ... } }
// up to "count" entries
],
"errors": [
{ "index": 3, "error": "Model 'foo' not available for SnakeBench" }
]
},
"timestamp": 1733920000000
}GET /api/snakebench/games?limit=50
- Query:
limit(optional) – max number of summaries to return. - Behavior: Only returns matches that have an available replay asset (local file, DB
replay_pathURL, or GitHub raw fallback). This prevents the UI from offering non-replayable matches. - Response:
{
"success": true,
"games": [
{
"gameId": "836b435a-bfcf-4a5e-be66-d87dd0d92153",
"filename": "snake_game_836b435a-bfcf-4a5e-be66-d87dd0d92153.json",
"startedAt": "2025-12-11T02:51:32.618418",
"totalScore": 22,
"roundsPlayed": 56,
"path": "external/SnakeBench/backend/completed_games/snake_game_836b435a-bfcf-4a5e-be66-d87dd0d92153.json"
}
],
"total": 600,
"timestamp": 1733920000000
}GET /api/snakebench/games/:gameId
- Path:
gameId– SnakeBench game UUID. - Response:
{
"success": true,
"gameId": "836b435a-bfcf-4a5e-be66-d87dd0d92153",
"data": { /* full SnakeBench replay JSON */ },
"timestamp": 1733920000000
}If the replay asset is missing (local file, DB replay_path, and remote fallback all fail), success=false with an error message.
GET /api/snakebench/health
- Verifies Python availability, embedded backend directory, and runner script.
- Response shape:
SnakeBenchHealthResponse(seeshared/types.ts).
GET /api/snakebench/recent-activity?days=7
- Query:
days(optional): Number of days of history (default 7). Useallto disable filtering.
- Returns aggregated recent-match stats used by the Worm Arena stats page.
GET /api/snakebench/leaderboard?limit=50&sortBy=winRate
- Query:
limit(optional): 1–150 (default 10).sortBy(optional):winRateorgamesPlayed(defaultgamesPlayed).
- Response includes per-model wins, losses, ties, apples, games played.
GET /api/snakebench/stats
- Returns
SnakeBenchArcExplainerStatswith:totalGames,activeModels,topApples,totalCost.
GET /api/snakebench/model-rating?modelSlug=openai/gpt-5-nano
- Returns TrueSkill-like snapshot and aggregate stats for a single model.
GET /api/snakebench/model-history?modelSlug=openai/gpt-5-nano&limit=50
- Query:
modelSlug(required),limit(optional). - Returns recent head-to-head match history for that model.
GET /api/snakebench/trueskill-leaderboard?limit=150&minGames=3
- Query:
limit(optional): Max entries (default 150).minGames(optional): Minimum games per model (default 3).
- Response:
SnakeBenchTrueSkillLeaderboardResponsewithentries[]containing:mu,sigma,exposed,displayScore,gamesPlayed,wins,losses,ties,applesEaten,topScore,winRate,totalCost.
GET /api/snakebench/greatest-hits?limitPerDimension=5
- Purpose: Return a curated list of especially interesting Worm Arena games (longest, most expensive, highest-scoring).
- Query:
limitPerDimension(optional, default 5, small number). - Response:
WormArenaGreatestHitsResponsewithgames[]of:gameId,startedAt,modelA,modelB,roundsPlayed,maxRounds,totalCost,maxFinalScore,scoreDelta,boardWidth,boardHeight,highlightReason(human-readable label such as "Longest game by rounds").
Important:
- The service now follows a three-tier fallback so Greatest Hits always return playable games:
- Database ranking (
public.games,public.game_participants) when DB connectivity is healthy. - Local replay builder that scans
external/SnakeBench/backend/<completed-dir>and computes the same metrics when the DB returns zero rows. - Curated hall-of-fame list as the last resort.
- Database ranking (
- Each stage filters entries to ensure a replay asset actually exists before it is surfaced to clients.
- See
docs/reference/data/WormArena_GreatestHits_Local_Analysis.mdfor local analysis details.
Worm Arena live matches use a two-step SSE pattern similar to the analysis streaming API:
POST /api/wormarena/prepare– Prepare a session and store match config.GET /api/wormarena/stream/:sessionId– Open SSE stream, run matches, and receive events.
POST /api/wormarena/prepare
Body (multi-opponent batch mode):
{
"modelA": "openai/gpt-5-nano", // required
"opponents": [ // required in new format
"moonshotai/kimi-k2-thinking",
"mistralai/devstral-2512"
],
"width": 10, // optional
"height": 10,
"maxRounds": 150,
"numApples": 5,
"apiKey": "...", // optional BYO key
"provider": "openrouter" // optional, see SnakeBench section
}Legacy body (count-based mode):
{
"modelA": "openai/gpt-5-nano",
"modelB": "moonshotai/kimi-k2-thinking",
"count": 9
}- In legacy mode,
modelBandcountare required; the controller converts this to a repeated opponents array.
Response:
{
"success": true,
"sessionId": "abc123-session-uuid",
"expiresAt": "2025-12-11T17:00:00.000Z"
}Sessions expire after a short TTL (currently 5 minutes) if the SSE connection is never opened.
GET /api/wormarena/stream/:sessionId
- Path:
sessionId– ID from the prepare step. - Protocol: Server-Sent Events (SSE). The response is a long-lived HTTP connection that emits events in
event: <type>/data: <json>format.
The stream uses the shared Worm Arena streaming types defined in shared/types.ts:
WormArenaStreamStatusWormArenaBatchMatchStartWormArenaBatchMatchCompleteWormArenaBatchCompleteWormArenaBatchErrorWormArenaFinalSummary(for single-match legacy mode)
Typical event flow in batch mode:
-
event: batch.init– initial batch metadatadata:{ "totalMatches": 3, "modelA": "openai/gpt-5-nano", "opponents": ["..."] }
-
For each opponent in order:
event: batch.match.start– match index + opponent slugevent: stream.status–state: "in_progress", human-readable status messageevent: batch.match.complete– per-match result:{ "index": 1, "total": 3, "gameId": "...", "modelA": "openai/gpt-5-nano", "modelB": "moonshotai/kimi-k2-thinking", "scores": { ... }, "results": { ... } }- On per-match failure,
event: batch.errorwithindex,total,error.
-
After all matches:
event: stream.status–state: "completed", summary message.event: batch.complete– final batch summary:{ "totalMatches": 3, "completedMatches": 2, "failedMatches": 1 }
In single-match legacy mode, the stream emits:
event: stream.init– initial payload (models, timestamps)event: stream.status–state: "starting"/"completed"event: stream.complete–WormArenaFinalSummarywithgameId,modelA,modelB,scores,results.
- No authentication: As with all ARC Explainer endpoints,
/api/snakebench/*and/api/wormarena/*are fully public for research and small external tools. - Cost awareness: SnakeBench replays and stats reflect real token costs based on the central
MODELSpricing. Large tournaments can be expensive; use smallcount/opponent lists. - Replay availability: DB records can exist without matching local replay JSON files. Always handle missing replays gracefully (see
docs/reference/data/WormArena_GreatestHits_Local_Analysis.md). - SSE clients: When consuming
/api/wormarena/stream/:sessionIdfrom browsers or Node, use a standard EventSource client and listen for the event types described above. - Tournament scripts: PowerShell helpers under
scripts/worm-arena-tournaments/are the canonical examples of how to enqueue batches against/api/snakebench/run-batch.
For high-level external API coverage across puzzles, analytics, and feedback, see docs/reference/api/EXTERNAL_API.md. This document focuses specifically on SnakeBench/Worm Arena endpoints.