Skip to content

Auto-fetch ELO scores instead of hardcoding in generate_model_catalog.py #550

@gltanaka

Description

@gltanaka

Problem

generate_model_catalog.py hardcodes ELO scores in a static ELO_SCORES dict. This causes several issues:

  • Scores go stale — ELO scores drift as new votes come in on the Arena leaderboard, but our dict stays frozen
  • New models are invisible — models not in the dict get no score and may be filtered out
  • Multiple leaderboards — there are different coding leaderboards (Code Arena WebDev, Text Arena Coding) with very different scores for the same model (e.g., Grok scores ~1450 on Text Arena Coding but ~1200 on Code Arena WebDev)
  • Manual errors — we recently shipped gpt-5.2-codex at 1471 (the score for gpt-5.2-high, a different model) when its actual score is 1336

Proposed Solution

Agentically fetch ELO scores at generation time instead of maintaining a static dict:

  1. Scrape or API-fetch the Arena leaderboard (e.g., arena.ai/leaderboard/code) to get current scores
  2. Fuzzy-match Arena model names to litellm model IDs (e.g., gpt-5-mediumgpt-5, claude-opus-4-6claude-opus-4-6)
  3. Fall back to the static dict only for models not on any leaderboard (local models like lm_studio/, niche providers)
  4. Cache fetched scores with a TTL so we don't hit the leaderboard on every run

Context

The current ELO_SCORES dict has ~40 entries with inline comments like # [CODE] #22 and # [EST] to track provenance. Many [EST] entries are rough guesses. Fresh leaderboard data would eliminate all of these.

Acceptance Criteria

  • generate_model_catalog.py fetches current ELO scores from at least one Arena leaderboard
  • Graceful fallback to static scores if fetch fails (network issues, rate limits)
  • Model name fuzzy matching handles Arena ↔ litellm naming differences
  • Generated llm_model.csv has accurate, up-to-date scores without manual dict maintenance

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions