-
Notifications
You must be signed in to change notification settings - Fork 50
Open
Description
Problem
generate_model_catalog.py hardcodes ELO scores in a static ELO_SCORES dict. This causes several issues:
- Scores go stale — ELO scores drift as new votes come in on the Arena leaderboard, but our dict stays frozen
- New models are invisible — models not in the dict get no score and may be filtered out
- Multiple leaderboards — there are different coding leaderboards (Code Arena WebDev, Text Arena Coding) with very different scores for the same model (e.g., Grok scores ~1450 on Text Arena Coding but ~1200 on Code Arena WebDev)
- Manual errors — we recently shipped gpt-5.2-codex at 1471 (the score for gpt-5.2-high, a different model) when its actual score is 1336
Proposed Solution
Agentically fetch ELO scores at generation time instead of maintaining a static dict:
- Scrape or API-fetch the Arena leaderboard (e.g.,
arena.ai/leaderboard/code) to get current scores - Fuzzy-match Arena model names to litellm model IDs (e.g.,
gpt-5-medium→gpt-5,claude-opus-4-6→claude-opus-4-6) - Fall back to the static dict only for models not on any leaderboard (local models like
lm_studio/, niche providers) - Cache fetched scores with a TTL so we don't hit the leaderboard on every run
Context
The current ELO_SCORES dict has ~40 entries with inline comments like # [CODE] #22 and # [EST] to track provenance. Many [EST] entries are rough guesses. Fresh leaderboard data would eliminate all of these.
Acceptance Criteria
-
generate_model_catalog.pyfetches current ELO scores from at least one Arena leaderboard - Graceful fallback to static scores if fetch fails (network issues, rate limits)
- Model name fuzzy matching handles Arena ↔ litellm naming differences
- Generated
llm_model.csvhas accurate, up-to-date scores without manual dict maintenance
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels