PolybrainBench runs on two distinct fleets: a generator fleet that answers every verification cycle, and a reviewer fleet that grades each regenerated paper. Four models sit in both seats, two models are reviewer-only, and five models are generator-only. The union is 11 unique models across 7 independent training families.
Every model listed here is a public commercial endpoint. No gated access is required to reproduce the benchmark.
Every verification cycle dispatches the same 9 models in parallel. Each model writes its full response to the cycle directory, with a SHA-256 provenance hash and per-model wall time. The generator fleet is stable across every cycle in the current dataset (N=10,452).
| Slug | Provider endpoint | Training family | Notes on role |
|---|---|---|---|
kimi-k2-groq |
Groq | Moonshot | The frontier-quality model available on Groq's free tier. Often generates the most substantive single-model response in the cycle. |
gpt-4.1-mini |
OpenAI | OpenAI | The most reliable paid mid-tier model. Low variance across repeated runs. |
gpt-4.1-nano |
OpenAI | OpenAI | Cheapest OpenAI endpoint. Fast. Stable scores across runs. |
grok-3-mini |
xAI | xAI | Slowest typical cycle response (6 to 10 seconds). Produces detailed reasoning chains. |
grok-4-fast |
xAI | xAI | Faster than grok-3-mini. Also sits in the reviewer seat as one of the two strictest voices. |
qwen3-32b |
Groq | Alibaba | Exposes logprobs. Middle-of-the-pack scores. |
gpt-oss-120b |
Groq | OpenAI (open weights) | The strictest reviewer in the entire fleet when scored on paper-level work. In-fleet reviewer. |
llama-4-scout |
Groq | Meta | Fastest cycle response time. Smaller model, higher volume per second. |
llama-3.3-70b |
Groq | Meta | The most generous reviewer in the fleet. In-fleet reviewer. |
Training families in the generator fleet (5 independent lineages): OpenAI, xAI, Moonshot, Meta, Alibaba.
Billing providers (3 API endpoints): OpenAI, xAI, Groq. Groq hosts five of the nine generator models (kimi-k2, qwen3-32b, gpt-oss-120b, llama-4-scout, llama-3.3-70b) so the billing count is lower than the family count.
Every paper regeneration is validated by this 6-model reviewer fleet as an independent panel. Four of the reviewers are drawn from the generator fleet. Two reviewers come from training families that are not in the generator fleet at all, and those two anchors have zero corpus contribution.
| Slug | Provider endpoint | Training family | Independence | Reviewer role |
|---|---|---|---|---|
gpt-4.1-nano |
OpenAI | OpenAI | In-fleet (also a generator) | Stable mid-tier reviewer |
grok-4-fast |
xAI | xAI | In-fleet (also a generator) | Strict voice #1 |
gpt-oss-120b |
Groq | OpenAI (open weights) | In-fleet (also a generator) | Strictest in-fleet voice |
llama-3.3-70b |
Groq | Meta | In-fleet (also a generator) | Most generous voice |
claude-sonnet-4-5 |
Anthropic | Anthropic | External anchor (not a generator) | Moderate independent voice |
gemini-2.5-pro |
External anchor (not a generator) | Strictest overall reviewer on first reading |
Training families in the reviewer fleet (6 lineages): OpenAI, xAI, Alibaba (via gpt-oss-120b open weights origin is OpenAI but hosted on Groq), Meta, Anthropic, Google. The four in-fleet reviewers span three of the generator's five training families. The two external anchors add two new families (Anthropic, Google) that are not in the generator fleet at all.
Billing providers for the reviewer fleet (5 API endpoints): OpenAI, xAI, Groq, Anthropic, Google. Reproducing the reviewer run requires live API accounts at all five.
| Count | |
|---|---|
| Generator fleet size | 9 |
| Reviewer fleet size | 6 |
| Models in both fleets | 4 |
| Generator-only models | 5 |
| Reviewer-only models (external anchors) | 2 |
| Unique models in the union | 11 |
| Training families in the union | 7 |
| Billing providers in the union | 5 |
The 11 unique models are: kimi-k2-groq, gpt-4.1-mini, gpt-4.1-nano, grok-3-mini, grok-4-fast, qwen3-32b, gpt-oss-120b, llama-4-scout, llama-3.3-70b, claude-sonnet-4-5, gemini-2.5-pro.
The 7 training families are: OpenAI, xAI, Moonshot, Meta, Alibaba, Anthropic, Google. Each family represents a distinct pretraining lineage. That matters for cross-model disagreement because models from the same family tend to share training data, alignment procedures, and systematic failure modes. A disagreement between two models from the same family is weaker evidence of a contested claim than a disagreement between two models from different families.
A fully disjoint reviewer fleet would replace all nine generator reviewers with nine completely external anchors. That would give maximum statistical independence on the review side. It would also:
- Break continuity with earlier paper versions (every earlier paper composite would be incomparable with every current one)
- Require managing nine independent API relationships for the reviewer step alone
- Discard the observed reviewer-personality calibration (strictest to most generous) that the in-fleet reviewers provide as a span
The four in-fleet reviewers are kept specifically because they span the observed reviewer-personality range: gpt-oss-120b is the strictest voice, llama-3.3-70b is the most generous voice, and the other two sit in between. Keeping them means the composite is not entirely rewritten by the two external anchors. The external anchors are added because they are the only way to directly measure how much the earlier all-in-fleet arrangement was inflating its own composite through self-review.
The result is partial independence with a measurement of the self-review bias directly embedded in the published trajectory. The v8 self-reviewed 76 and the v16 disjoint 72 appear side by side in the sprint history table, and the 4-point gap (which extends to a 9-point gap on individual readings) is the measurement.
- Register API accounts with OpenAI, xAI, Groq, Anthropic, and Google.
- Use the model slugs listed in the tables above as the model identifiers in each provider's chat or messages API.
- For each cycle: dispatch the claim to all 9 generator models in parallel, capture the full response, compute a SHA-256 hash, and write to the cycle directory.
- For each validator run: send the regenerated paper as a topic to all 6 reviewer models, parse each reviewer's Q and A score from its response, compute
round(0.6 × mean(Q) + 0.4 × mean(A)). - The validator respects HTTP 429 (rate limit) and 503 (transient overload) with exponential backoff (3s / 6s / 12s). Both retry classes are necessary, the 503 path was added after a Gemini overload in the first full disjoint run dropped that reading to 5-of-6 valid.
All responses in the 10,452-cycle dataset are available under the CC-BY-4.0 license, so anyone reproducing the fleet can compare their own responses against ours for verification.