LiveBench AI Leaderboard

Source: LiveBench.ai | Filter: highunseenbias=true | Updated: January 2026

Top Models by Overall Performance

Rank	Model	Organization	Global Average	Reasoning	Coding	Agentic Coding	Mathematics	Data Analysis	Language	IF
1	GPT-5.4 Thinking xHigh Effort	OpenAI	80.28	88.12	77.54	70.00	94.15	79.31	82.63	70.22
2	Gemini 3.1 Pro Preview High	Google	79.93	84.00	76.45	65.00	91.04	78.54	85.38	79.10
3	Claude 4.6 Opus Thinking High Effort	Anthropic	76.33	88.67	78.18	61.67	89.32	69.89	83.27	63.31
4	Claude 4.5 Opus Thinking High Effort	Anthropic	75.96	80.09	79.65	63.33	90.39	74.44	81.26	62.55
5	Claude 4.6 Sonnet Thinking Medium Effort	Anthropic	75.47	84.77	79.27	60.00	86.99	77.95	76.10	63.22
6	GPT-5.2 High	OpenAI	74.84	83.21	76.07	51.67	93.17	78.16	79.81	61.77
7	GPT-5.2 Codex	OpenAI	74.30	77.71	83.62	51.67	88.77	78.20	73.68	66.45
8	GPT-5.1 Codex Max High	OpenAI	73.98	83.65	80.68	53.33	83.22	70.12	76.48	70.38
9	Gemini 3 Pro Preview High	Google	73.39	77.42	74.60	55.00	81.84	74.39	84.62	65.85
10	GPT-5.3 Codex High	OpenAI	72.76	80.15	78.18	55.00	87.84	62.69	80.09	65.38
11	Gemini 3 Flash Preview High	Google	72.40	74.55	73.90	40.00	84.17	74.77	84.56	74.86
12	GPT-5.1 High	OpenAI	72.04	78.79	72.49	53.33	86.90	69.61	79.26	63.90
13	GPT-5 Pro	OpenAI	70.48	81.69	72.11	51.67	86.17	57.04	80.69	63.96
14	Kimi K2.5 Thinking	Moonshot AI	69.07	75.96	77.86	48.33	84.87	61.36	77.67	57.41
15	GLM 5	Z.AI	68.85	69.11	73.64	55.00	83.46	67.90	77.53	55.33

Category	Tasks
Reasoning Average	theory_of_mind, zebra_puzzle, spatial, logic_with_navigation
Coding Average	code_generation, code_completion
Agentic Coding Average	javascript, typescript, python
Mathematics Average	AMPS_Hard, integrals_with_game, math_comp, olympiad
Data Analysis Average	consecutive_events, tablejoin, tablereformat
Language Average	connections, plot_unscrambling, typos
IF Average	paraphrase, simplify, story_generation, summarize

LiveBench is a dynamic benchmark for evaluating large language models, designed with:

Contamination-free testing: New questions released monthly based on fresh sources
Objective evaluation: All answers have verifiable ground-truth values
Task diversity: 18+ different tasks across 7 categories
Up-to-date content: Questions based on recent arXiv papers, news, and other fresh sources

Filter highunseenbias=true shows only models evaluated on unseen (non-public) questions, reducing contamination risk.