LLM Creative Story-Writing Benchmark

This benchmark now uses pairwise head-to-head comparisons as its canonical quality signal. Models write short fiction to the same constrained creative briefs; evaluator LLMs compare two stories written for the same required elements; the resulting signed margins are aggregated into a global Thurstone-style rating.

The previous absolute 0-10 rubric-rating publication is archived and is no longer the main leaderboard. Absolute ratings may still be useful as diagnostics, but future public rankings should be based on pairwise comparison ratings.

Archived absolute-rating publication:

archive/absolute_ratings_v4_2026_04_19/README.md
Its old README-linked chart PNGs were moved locally under archive/absolute_ratings_v4_2026_04_19/images/.

Current Results

Canonical Leaderboard

Current canonical scope: pilot_2026_04_18_anchor_n20_fullroster_cap3_v1.

Coverage at the last aggregation:

23 rated models
134 direct model-pair rows
9,139 in-scope score rows
2,462 of those score rows use legacy-compatible prompt/raw path provenance
Bootstrap uncertainty resamples both stories and evaluators
Evaluator side-A bias correction is enabled
Dagger markers (†) indicate model-specific generation-coverage disclosures

Rank	LLM	Thurstone Rating	Win Prob vs Pool	95% CI	BT (Ref)
1	gpt-5.4-xhigh	4.0412	0.917	3.7887..4.3457	6.1057
2	gpt-5.4-medium	3.6311	0.888	3.3705..3.9094	5.9413
3	claude-opus-4-7-adaptive†	3.3423	0.865	3.1730..3.4691	6.0631
4	claude-opus-4-6-16K	3.1530	0.849	2.6867..3.6050	5.8529
5	gpt-5.2-medium	3.0438	0.839	2.5438..3.4927	4.2152
6	claude-sonnet-4-6-16K	2.5355	0.792	2.3303..2.7672	5.4205
7	mistral-medium-2508	1.6599	0.699	1.4802..1.8590	2.5247
8	kimi-k2.5	0.8286	0.600	0.6171..1.0372	1.7458
9	glm-5	0.5227	0.561	-0.0707..1.0973	1.3106
10	glm-5.1	0.4796	0.556	0.2743..0.7507	1.2808
11	mimo-v2-pro	0.3428	0.538	0.1527..0.5313	1.4750
12	seed-2.0-pro	-0.1930	0.469	-0.5023..0.0651	0.7023
13	mistral-large-2512	-0.5438	0.424	-1.0682..-0.0613	0.4926
14	qwen3.6-plus	-0.5850	0.419	-0.7889..-0.3594	0.2295
15	gemini-3.1-pro-preview	-0.7073	0.403	-0.9164..-0.4697	0.1220
16	gemma-4-31b-it-reasoning	-0.8888	0.380	-1.0988..-0.6936	-0.0430
17	ernie-5	-1.3284	0.326	-2.1063..-0.5255	-14.9843
18	grok-4-1-fast-reasoning	-1.4181	0.316	-2.3198..-0.6669	-14.9843
19	deepseek-v32	-1.7766	0.275	-2.0652..-1.4464	-0.7867
20	minimax-m2.7	-3.2200	0.140	-3.4310..-2.9874	-2.4366
21	gpt-oss-120b	-3.4224	0.126	-3.6469..-3.2022	-2.3555
22	qwen3.5-397b-a17b	-3.7738	0.102	-4.0209..-3.5689	-3.1551
23	grok-4.20-beta-0309-reasoning	-5.7232	0.018	-5.9281..-5.5065	-4.7365

Model Disclosure Notes

† claude-opus-4-7-adaptive: Claude Opus 4.7 refused a subset of creative-writing generation prompts in this run. Generation sidecars show 53/400 story prompts refused (13.2%); 347 story outputs were available. Comparison prompts and ratings use completed stories only (480 pair-story rows, 265 distinct story indices in the current comparison scope). No score was imputed for the refused prompts, so this is a coverage caveat rather than a direct rating penalty.

BT is retained as a reference diagnostic, not as the primary ranking. The Thurstone rating is the canonical comparison score.

Pairwise Margin Map

Each cell is the mean signed comparison margin for the row model against the column model. Positive values mean the row model tends to beat the column model on stories written to the same required elements.

What Is Measured

Every story must meaningfully incorporate ten required elements:

character
object
concept
attribute
action
method
setting
timeframe
motivation
tone

The comparison protocol keeps the prompt and required elements matched within each A-vs-B task. This makes the judgment about which story better satisfies the same creative brief, not about which model happened to receive an easier prompt.

Evaluator prompts still separate rubric-aligned observations from beyond-rubric observations, but the public model score is no longer an average of absolute 0-10 rubric grades. The public score is the model's position in the pairwise comparison graph.

Method Summary

Generate or collect stories in the strict benchmark format.
Build matched A-vs-B comparison prompts for stories sharing the same required elements.
Run evaluator LLMs on those pairwise prompts.
Parse each evaluator response into signed margins and winner labels.
Correct measured side-A bias in signed margins.
Aggregate story-level pair margins into a global Thurstone-style model rating.
Use bootstrap uncertainty over stories and evaluators for confidence bands.

The current chart and leaderboard artifacts are produced by:

export PYTHONPATH=/mnt/r/writing2:/mnt/r/benchmark_utils:$PYTHONPATH

python aggregate_inter_llm_comparison_scores.py \
  --template-id pilot_2026_04_18_anchor_n20_fullroster_cap3_v1

python plot_inter_llm_comparison_charts.py

The plot script reads shared model display names, family colors, and model-brand logos through /mnt/r/benchmark_utils.

Operational Notes

Pairwise comparison is the canonical intake path for new models.
Incremental comparison coverage should prioritize uncertainty near the top of the leaderboard, especially missing or inconclusive direct matchups among highly ranked models.
Use run_inter_llm_comparison_batch.py --pairs-file <path> for manually selected high-value gaps.
Use a new explicit template_id when the comparison prompt, evaluator roster, or benchmark-critical protocol changes materially.
Gemini evaluators are fine on the fast router. Gemini is specifically slow on the batch router (8040), so split Gemini into a separate 8006 invocation when turnaround matters.
Absolute rubric ratings and their old charts are archived; do not use them as the main public ranking going forward.

More operational detail is in PIPELINES.md and PROJECT_DOCUMENTATION.md.

Related Benchmarks

Multi-agent benchmarks:

Other benchmarks:

Follow @lechmazur on X for other benchmarks and updates.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
archive/absolute_ratings_v4_2026_04_19		archive/absolute_ratings_v4_2026_04_19
comments_by_llm_1to8		comments_by_llm_1to8
docs		docs
general_summaries		general_summaries
graded		graded
images		images
prompts		prompts
stories_wc		stories_wc
summaries		summaries
temp		temp
v2		v2
v3		v3
PIPELINES.md		PIPELINES.md
PROJECT_DOCUMENTATION.md		PROJECT_DOCUMENTATION.md
README.md		README.md
inter_llm_model_disclosures.py		inter_llm_model_disclosures.py
project_paths.py		project_paths.py
question_weights.py		question_weights.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Creative Story-Writing Benchmark

Current Results

Canonical Leaderboard

Model Disclosure Notes

Pairwise Margin Map

What Is Measured

Method Summary

Operational Notes

Related Benchmarks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Creative Story-Writing Benchmark

Current Results

Canonical Leaderboard

Model Disclosure Notes

Pairwise Margin Map

What Is Measured

Method Summary

Operational Notes

Related Benchmarks

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages