Source: LiveBench.ai | Filter:
highunseenbias=true| Updated: January 2026
| Rank | Model | Organization | Global Average | Reasoning | Coding | Agentic Coding | Mathematics | Data Analysis | Language | IF |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GPT-5.4 Thinking xHigh Effort | OpenAI | 80.28 | 88.12 | 77.54 | 70.00 | 94.15 | 79.31 | 82.63 | 70.22 |
| 2 | Gemini 3.1 Pro Preview High | 79.93 | 84.00 | 76.45 | 65.00 | 91.04 | 78.54 | 85.38 | 79.10 | |
| 3 | Claude 4.6 Opus Thinking High Effort | Anthropic | 76.33 | 88.67 | 78.18 | 61.67 | 89.32 | 69.89 | 83.27 | 63.31 |
| 4 | Claude 4.5 Opus Thinking High Effort | Anthropic | 75.96 | 80.09 | 79.65 | 63.33 | 90.39 | 74.44 | 81.26 | 62.55 |
| 5 | Claude 4.6 Sonnet Thinking Medium Effort | Anthropic | 75.47 | 84.77 | 79.27 | 60.00 | 86.99 | 77.95 | 76.10 | 63.22 |
| 6 | GPT-5.2 High | OpenAI | 74.84 | 83.21 | 76.07 | 51.67 | 93.17 | 78.16 | 79.81 | 61.77 |
| 7 | GPT-5.2 Codex | OpenAI | 74.30 | 77.71 | 83.62 | 51.67 | 88.77 | 78.20 | 73.68 | 66.45 |
| 8 | GPT-5.1 Codex Max High | OpenAI | 73.98 | 83.65 | 80.68 | 53.33 | 83.22 | 70.12 | 76.48 | 70.38 |
| 9 | Gemini 3 Pro Preview High | 73.39 | 77.42 | 74.60 | 55.00 | 81.84 | 74.39 | 84.62 | 65.85 | |
| 10 | GPT-5.3 Codex High | OpenAI | 72.76 | 80.15 | 78.18 | 55.00 | 87.84 | 62.69 | 80.09 | 65.38 |
| 11 | Gemini 3 Flash Preview High | 72.40 | 74.55 | 73.90 | 40.00 | 84.17 | 74.77 | 84.56 | 74.86 | |
| 12 | GPT-5.1 High | OpenAI | 72.04 | 78.79 | 72.49 | 53.33 | 86.90 | 69.61 | 79.26 | 63.90 |
| 13 | GPT-5 Pro | OpenAI | 70.48 | 81.69 | 72.11 | 51.67 | 86.17 | 57.04 | 80.69 | 63.96 |
| 14 | Kimi K2.5 Thinking | Moonshot AI | 69.07 | 75.96 | 77.86 | 48.33 | 84.87 | 61.36 | 77.67 | 57.41 |
| 15 | GLM 5 | Z.AI | 68.85 | 69.11 | 73.64 | 55.00 | 83.46 | 67.90 | 77.53 | 55.33 |
| Category | Tasks |
|---|---|
| Reasoning Average | theory_of_mind, zebra_puzzle, spatial, logic_with_navigation |
| Coding Average | code_generation, code_completion |
| Agentic Coding Average | javascript, typescript, python |
| Mathematics Average | AMPS_Hard, integrals_with_game, math_comp, olympiad |
| Data Analysis Average | consecutive_events, tablejoin, tablereformat |
| Language Average | connections, plot_unscrambling, typos |
| IF Average | paraphrase, simplify, story_generation, summarize |
LiveBench is a dynamic benchmark for evaluating large language models, designed with:
- Contamination-free testing: New questions released monthly based on fresh sources
- Objective evaluation: All answers have verifiable ground-truth values
- Task diversity: 18+ different tasks across 7 categories
- Up-to-date content: Questions based on recent arXiv papers, news, and other fresh sources
Filter highunseenbias=true shows only models evaluated on unseen (non-public) questions, reducing contamination risk.
Learn more: GitHub | Official Website