Skip to content

Latest commit

 

History

History
83 lines (58 loc) · 7.23 KB

File metadata and controls

83 lines (58 loc) · 7.23 KB

Fleet composition

PolybrainBench runs on two distinct fleets: a generator fleet that answers every verification cycle, and a reviewer fleet that grades each regenerated paper. Four models sit in both seats, two models are reviewer-only, and five models are generator-only. The union is 11 unique models across 7 independent training families.

Every model listed here is a public commercial endpoint. No gated access is required to reproduce the benchmark.


Generator fleet (9 models, unchanged since Sprint 2)

Every verification cycle dispatches the same 9 models in parallel. Each model writes its full response to the cycle directory, with a SHA-256 provenance hash and per-model wall time. The generator fleet is stable across every cycle in the current dataset (N=10,452).

Slug Provider endpoint Training family Notes on role
kimi-k2-groq Groq Moonshot The frontier-quality model available on Groq's free tier. Often generates the most substantive single-model response in the cycle.
gpt-4.1-mini OpenAI OpenAI The most reliable paid mid-tier model. Low variance across repeated runs.
gpt-4.1-nano OpenAI OpenAI Cheapest OpenAI endpoint. Fast. Stable scores across runs.
grok-3-mini xAI xAI Slowest typical cycle response (6 to 10 seconds). Produces detailed reasoning chains.
grok-4-fast xAI xAI Faster than grok-3-mini. Also sits in the reviewer seat as one of the two strictest voices.
qwen3-32b Groq Alibaba Exposes logprobs. Middle-of-the-pack scores.
gpt-oss-120b Groq OpenAI (open weights) The strictest reviewer in the entire fleet when scored on paper-level work. In-fleet reviewer.
llama-4-scout Groq Meta Fastest cycle response time. Smaller model, higher volume per second.
llama-3.3-70b Groq Meta The most generous reviewer in the fleet. In-fleet reviewer.

Training families in the generator fleet (5 independent lineages): OpenAI, xAI, Moonshot, Meta, Alibaba.

Billing providers (3 API endpoints): OpenAI, xAI, Groq. Groq hosts five of the nine generator models (kimi-k2, qwen3-32b, gpt-oss-120b, llama-4-scout, llama-3.3-70b) so the billing count is lower than the family count.

Reviewer fleet (6 models, partially disjoint from the generator)

Every paper regeneration is validated by this 6-model reviewer fleet as an independent panel. Four of the reviewers are drawn from the generator fleet. Two reviewers come from training families that are not in the generator fleet at all, and those two anchors have zero corpus contribution.

Slug Provider endpoint Training family Independence Reviewer role
gpt-4.1-nano OpenAI OpenAI In-fleet (also a generator) Stable mid-tier reviewer
grok-4-fast xAI xAI In-fleet (also a generator) Strict voice #1
gpt-oss-120b Groq OpenAI (open weights) In-fleet (also a generator) Strictest in-fleet voice
llama-3.3-70b Groq Meta In-fleet (also a generator) Most generous voice
claude-sonnet-4-5 Anthropic Anthropic External anchor (not a generator) Moderate independent voice
gemini-2.5-pro Google Google External anchor (not a generator) Strictest overall reviewer on first reading

Training families in the reviewer fleet (6 lineages): OpenAI, xAI, Alibaba (via gpt-oss-120b open weights origin is OpenAI but hosted on Groq), Meta, Anthropic, Google. The four in-fleet reviewers span three of the generator's five training families. The two external anchors add two new families (Anthropic, Google) that are not in the generator fleet at all.

Billing providers for the reviewer fleet (5 API endpoints): OpenAI, xAI, Groq, Anthropic, Google. Reproducing the reviewer run requires live API accounts at all five.

Union: 11 unique models across 7 training families

Count
Generator fleet size 9
Reviewer fleet size 6
Models in both fleets 4
Generator-only models 5
Reviewer-only models (external anchors) 2
Unique models in the union 11
Training families in the union 7
Billing providers in the union 5

The 11 unique models are: kimi-k2-groq, gpt-4.1-mini, gpt-4.1-nano, grok-3-mini, grok-4-fast, qwen3-32b, gpt-oss-120b, llama-4-scout, llama-3.3-70b, claude-sonnet-4-5, gemini-2.5-pro.

The 7 training families are: OpenAI, xAI, Moonshot, Meta, Alibaba, Anthropic, Google. Each family represents a distinct pretraining lineage. That matters for cross-model disagreement because models from the same family tend to share training data, alignment procedures, and systematic failure modes. A disagreement between two models from the same family is weaker evidence of a contested claim than a disagreement between two models from different families.

Why partial reviewer independence, not absolute

A fully disjoint reviewer fleet would replace all nine generator reviewers with nine completely external anchors. That would give maximum statistical independence on the review side. It would also:

  1. Break continuity with earlier paper versions (every earlier paper composite would be incomparable with every current one)
  2. Require managing nine independent API relationships for the reviewer step alone
  3. Discard the observed reviewer-personality calibration (strictest to most generous) that the in-fleet reviewers provide as a span

The four in-fleet reviewers are kept specifically because they span the observed reviewer-personality range: gpt-oss-120b is the strictest voice, llama-3.3-70b is the most generous voice, and the other two sit in between. Keeping them means the composite is not entirely rewritten by the two external anchors. The external anchors are added because they are the only way to directly measure how much the earlier all-in-fleet arrangement was inflating its own composite through self-review.

The result is partial independence with a measurement of the self-review bias directly embedded in the published trajectory. The v8 self-reviewed 76 and the v16 disjoint 72 appear side by side in the sprint history table, and the 4-point gap (which extends to a 9-point gap on individual readings) is the measurement.

How to reproduce the fleet

  1. Register API accounts with OpenAI, xAI, Groq, Anthropic, and Google.
  2. Use the model slugs listed in the tables above as the model identifiers in each provider's chat or messages API.
  3. For each cycle: dispatch the claim to all 9 generator models in parallel, capture the full response, compute a SHA-256 hash, and write to the cycle directory.
  4. For each validator run: send the regenerated paper as a topic to all 6 reviewer models, parse each reviewer's Q and A score from its response, compute round(0.6 × mean(Q) + 0.4 × mean(A)).
  5. The validator respects HTTP 429 (rate limit) and 503 (transient overload) with exponential backoff (3s / 6s / 12s). Both retry classes are necessary, the 503 path was added after a Gemini overload in the first full disjoint run dropped that reading to 5-of-6 valid.

All responses in the 10,452-cycle dataset are available under the CC-BY-4.0 license, so anyone reproducing the fleet can compare their own responses against ours for verification.