Fleet composition

PolybrainBench runs on two distinct fleets: a generator fleet that answers every verification cycle, and a reviewer fleet that grades each regenerated paper. Four models sit in both seats, two models are reviewer-only, and five models are generator-only. The union is 11 unique models across 7 independent training families.

Every model listed here is a public commercial endpoint. No gated access is required to reproduce the benchmark.

Generator fleet (9 models, unchanged since Sprint 2)

Every verification cycle dispatches the same 9 models in parallel. Each model writes its full response to the cycle directory, with a SHA-256 provenance hash and per-model wall time. The generator fleet is stable across every cycle in the current dataset (N=10,452).

Slug	Provider endpoint	Training family	Notes on role
`kimi-k2-groq`	Groq	Moonshot	The frontier-quality model available on Groq's free tier. Often generates the most substantive single-model response in the cycle.
`gpt-4.1-mini`	OpenAI	OpenAI	The most reliable paid mid-tier model. Low variance across repeated runs.
`gpt-4.1-nano`	OpenAI	OpenAI	Cheapest OpenAI endpoint. Fast. Stable scores across runs.
`grok-3-mini`	xAI	xAI	Slowest typical cycle response (6 to 10 seconds). Produces detailed reasoning chains.
`grok-4-fast`	xAI	xAI	Faster than grok-3-mini. Also sits in the reviewer seat as one of the two strictest voices.
`qwen3-32b`	Groq	Alibaba	Exposes logprobs. Middle-of-the-pack scores.
`gpt-oss-120b`	Groq	OpenAI (open weights)	The strictest reviewer in the entire fleet when scored on paper-level work. In-fleet reviewer.
`llama-4-scout`	Groq	Meta	Fastest cycle response time. Smaller model, higher volume per second.
`llama-3.3-70b`	Groq	Meta	The most generous reviewer in the fleet. In-fleet reviewer.

Training families in the generator fleet (5 independent lineages): OpenAI, xAI, Moonshot, Meta, Alibaba.

Billing providers (3 API endpoints): OpenAI, xAI, Groq. Groq hosts five of the nine generator models (kimi-k2, qwen3-32b, gpt-oss-120b, llama-4-scout, llama-3.3-70b) so the billing count is lower than the family count.

Reviewer fleet (6 models, partially disjoint from the generator)

Every paper regeneration is validated by this 6-model reviewer fleet as an independent panel. Four of the reviewers are drawn from the generator fleet. Two reviewers come from training families that are not in the generator fleet at all, and those two anchors have zero corpus contribution.

Slug	Provider endpoint	Training family	Independence	Reviewer role
`gpt-4.1-nano`	OpenAI	OpenAI	In-fleet (also a generator)	Stable mid-tier reviewer
`grok-4-fast`	xAI	xAI	In-fleet (also a generator)	Strict voice #1
`gpt-oss-120b`	Groq	OpenAI (open weights)	In-fleet (also a generator)	Strictest in-fleet voice
`llama-3.3-70b`	Groq	Meta	In-fleet (also a generator)	Most generous voice
`claude-sonnet-4-5`	Anthropic	Anthropic	External anchor (not a generator)	Moderate independent voice
`gemini-2.5-pro`	Google	Google	External anchor (not a generator)	Strictest overall reviewer on first reading

Training families in the reviewer fleet (6 lineages): OpenAI, xAI, Alibaba (via gpt-oss-120b open weights origin is OpenAI but hosted on Groq), Meta, Anthropic, Google. The four in-fleet reviewers span three of the generator's five training families. The two external anchors add two new families (Anthropic, Google) that are not in the generator fleet at all.

Billing providers for the reviewer fleet (5 API endpoints): OpenAI, xAI, Groq, Anthropic, Google. Reproducing the reviewer run requires live API accounts at all five.

Union: 11 unique models across 7 training families

	Count
Generator fleet size	9
Reviewer fleet size	6
Models in both fleets	4
Generator-only models	5
Reviewer-only models (external anchors)	2
Unique models in the union	11
Training families in the union	7
Billing providers in the union	5

The 11 unique models are: kimi-k2-groq, gpt-4.1-mini, gpt-4.1-nano, grok-3-mini, grok-4-fast, qwen3-32b, gpt-oss-120b, llama-4-scout, llama-3.3-70b, claude-sonnet-4-5, gemini-2.5-pro.

The 7 training families are: OpenAI, xAI, Moonshot, Meta, Alibaba, Anthropic, Google. Each family represents a distinct pretraining lineage. That matters for cross-model disagreement because models from the same family tend to share training data, alignment procedures, and systematic failure modes. A disagreement between two models from the same family is weaker evidence of a contested claim than a disagreement between two models from different families.

Why partial reviewer independence, not absolute

A fully disjoint reviewer fleet would replace all nine generator reviewers with nine completely external anchors. That would give maximum statistical independence on the review side. It would also:

Break continuity with earlier paper versions (every earlier paper composite would be incomparable with every current one)
Require managing nine independent API relationships for the reviewer step alone
Discard the observed reviewer-personality calibration (strictest to most generous) that the in-fleet reviewers provide as a span

The four in-fleet reviewers are kept specifically because they span the observed reviewer-personality range: gpt-oss-120b is the strictest voice, llama-3.3-70b is the most generous voice, and the other two sit in between. Keeping them means the composite is not entirely rewritten by the two external anchors. The external anchors are added because they are the only way to directly measure how much the earlier all-in-fleet arrangement was inflating its own composite through self-review.

The result is partial independence with a measurement of the self-review bias directly embedded in the published trajectory. The v8 self-reviewed 76 and the v16 disjoint 72 appear side by side in the sprint history table, and the 4-point gap (which extends to a 9-point gap on individual readings) is the measurement.

How to reproduce the fleet

Register API accounts with OpenAI, xAI, Groq, Anthropic, and Google.
Use the model slugs listed in the tables above as the model identifiers in each provider's chat or messages API.
For each cycle: dispatch the claim to all 9 generator models in parallel, capture the full response, compute a SHA-256 hash, and write to the cycle directory.
For each validator run: send the regenerated paper as a topic to all 6 reviewer models, parse each reviewer's Q and A score from its response, compute round(0.6 × mean(Q) + 0.4 × mean(A)).
The validator respects HTTP 429 (rate limit) and 503 (transient overload) with exponential backoff (3s / 6s / 12s). Both retry classes are necessary, the 503 path was added after a Gemini overload in the first full disjoint run dropped that reading to 5-of-6 valid.

All responses in the 10,452-cycle dataset are available under the CC-BY-4.0 license, so anyone reproducing the fleet can compare their own responses against ours for verification.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fleet composition

Generator fleet (9 models, unchanged since Sprint 2)

Reviewer fleet (6 models, partially disjoint from the generator)

Union: 11 unique models across 7 training families

Why partial reviewer independence, not absolute

How to reproduce the fleet

FilesExpand file tree

fleet.md

Latest commit

History

fleet.md

File metadata and controls

Fleet composition

Generator fleet (9 models, unchanged since Sprint 2)

Reviewer fleet (6 models, partially disjoint from the generator)

Union: 11 unique models across 7 training families

Why partial reviewer independence, not absolute

How to reproduce the fleet