Skip to content

Latest commit

 

History

History
48 lines (38 loc) · 3.59 KB

File metadata and controls

48 lines (38 loc) · 3.59 KB

LiveBench AI Leaderboard

Source: LiveBench.ai | Filter: highunseenbias=true | Updated: January 2026

Top Models by Overall Performance

Rank Model Organization Global Average Reasoning Coding Agentic Coding Mathematics Data Analysis Language IF
1 GPT-5.4 Thinking xHigh Effort OpenAI 80.28 88.12 77.54 70.00 94.15 79.31 82.63 70.22
2 Gemini 3.1 Pro Preview High Google 79.93 84.00 76.45 65.00 91.04 78.54 85.38 79.10
3 Claude 4.6 Opus Thinking High Effort Anthropic 76.33 88.67 78.18 61.67 89.32 69.89 83.27 63.31
4 Claude 4.5 Opus Thinking High Effort Anthropic 75.96 80.09 79.65 63.33 90.39 74.44 81.26 62.55
5 Claude 4.6 Sonnet Thinking Medium Effort Anthropic 75.47 84.77 79.27 60.00 86.99 77.95 76.10 63.22
6 GPT-5.2 High OpenAI 74.84 83.21 76.07 51.67 93.17 78.16 79.81 61.77
7 GPT-5.2 Codex OpenAI 74.30 77.71 83.62 51.67 88.77 78.20 73.68 66.45
8 GPT-5.1 Codex Max High OpenAI 73.98 83.65 80.68 53.33 83.22 70.12 76.48 70.38
9 Gemini 3 Pro Preview High Google 73.39 77.42 74.60 55.00 81.84 74.39 84.62 65.85
10 GPT-5.3 Codex High OpenAI 72.76 80.15 78.18 55.00 87.84 62.69 80.09 65.38
11 Gemini 3 Flash Preview High Google 72.40 74.55 73.90 40.00 84.17 74.77 84.56 74.86
12 GPT-5.1 High OpenAI 72.04 78.79 72.49 53.33 86.90 69.61 79.26 63.90
13 GPT-5 Pro OpenAI 70.48 81.69 72.11 51.67 86.17 57.04 80.69 63.96
14 Kimi K2.5 Thinking Moonshot AI 69.07 75.96 77.86 48.33 84.87 61.36 77.67 57.41
15 GLM 5 Z.AI 68.85 69.11 73.64 55.00 83.46 67.90 77.53 55.33

Benchmark Categories

Category Tasks
Reasoning Average theory_of_mind, zebra_puzzle, spatial, logic_with_navigation
Coding Average code_generation, code_completion
Agentic Coding Average javascript, typescript, python
Mathematics Average AMPS_Hard, integrals_with_game, math_comp, olympiad
Data Analysis Average consecutive_events, tablejoin, tablereformat
Language Average connections, plot_unscrambling, typos
IF Average paraphrase, simplify, story_generation, summarize

About LiveBench

LiveBench is a dynamic benchmark for evaluating large language models, designed with:

  • Contamination-free testing: New questions released monthly based on fresh sources
  • Objective evaluation: All answers have verifiable ground-truth values
  • Task diversity: 18+ different tasks across 7 categories
  • Up-to-date content: Questions based on recent arXiv papers, news, and other fresh sources

Filter highunseenbias=true shows only models evaluated on unseen (non-public) questions, reducing contamination risk.

Learn more: GitHub | Official Website