This document specifies the full PolybrainBench pipeline end to end. It is the reference for reproducing the benchmark from a clean environment. Related documents: fleet composition, PolybrainBench overview.
The pipeline has five stages: topic generation, cycle dispatch, harvest, paper regeneration, and disjoint reviewer validation. The published paper is always the current canonical artifact, with the honest composite displayed prominently in its header (the Matthew Effect publication rule).
The benchmark's scientific value depends on what kinds of claims it verifies. A claim that every model trivially agrees on yields zero disagreement signal. The topic generator is therefore targeted at claims where independent models actually diverge.
Each candidate claim is scored on two independent 0-to-0.5 factors. The acceptance threshold is 0.6 on their sum.
Factor A: well-formed declarative shell
- +0.2 if the claim is 10 to 30 words
- +0.1 if it is a single sentence ending in a period
- +0.1 if it contains no first-person pronouns
- +0.1 if it contains no question mark
Factor B: signal density
- +0.2 if it contains a specific number (2+ digits, a decimal, or a percentage)
- +0.15 if it contains a specific year in the 1500 to 2029 range
- +0.05 if it contains 2+ mid-sentence capitalized words (named entities)
- +0.05 if it contains 2+ words of 8+ letters (technical or compound vocabulary)
- +0.05 if it contains an attribution or action verb from a curated list (introduced, discovered, published, defined, established, ratified, formulated, proposed, attributed, observed, measured, enacted, released, authored, classified, identified, composed, developed, founded, demonstrated, came, became, signed, named, and similar)
The scorer was rewritten in Sprint 6 after the previous version (which rewarded consensus trivia shapes) drove the topic queue into mode collapse. The new scorer is 0.6 acceptance threshold, 0.65 pause threshold. The pause threshold has never been hit in 3,700+ Sprint 6 cycles.
The upstream generator prompt (fed to GPT-4o) explicitly instructs the LLM that the benchmark's entire value is cross-model disagreement and that claims where every model obviously agrees are worthless to the benchmark. Every generated claim must contain at least one of eight divergence-generating features:
- A non-round specific number (approximately 70.8 percent, 1.602176634 × 10^-19 coulombs, and similar)
- A specific date, year, or era commonly misremembered
- An attribution claim (who invented, discovered, introduced, formulated) where sources differ
- A superlative or comparative whose criterion matters (first, largest by volume, strictest under)
- A technical term with cross-field definition variance (kernel in CS versus statistics; entropy in thermodynamics versus information theory)
- A recent-event claim from 2020 to 2025 near training cutoffs
- A named standard or definition with specific effective dates (ISO, IEEE, IUPAC, SI, W3C)
- A common misconception stated flatly in its corrected form
The prompt also lists forbidden examples explicitly: "Paris is the capital of France", "The element with atomic number 79 is gold", "The Pacific Ocean is the largest of Earth's oceans", "The chemical formula for glucose is C6H12O6", "The Pythagorean theorem applies to right-angled triangles", "Mount Everest is the highest mountain on Earth", "The capital of Germany is Berlin", and "Water has the chemical formula H2O". These are all mode-collapsed consensus claims and the generator is told explicitly not to produce anything that looks like them.
After scoring, each new candidate is compared against the 50 most recently accepted claims. Any claim with Jaccard token overlap above 0.4 with any of the 50 is rejected and replaced. This prevents near-duplicate claims from dominating the tail of the queue.
Every 5,000 new claims, an optional human review gate can be configured to display the first 10 claims of the most recent batch for approval before they are written to the topic file. The gate is intentionally lightweight because the scorer and prompt together already filter out most consensus trivia.
The cycle engine is subprocess-based. The daemon maintains a worker pool (default P=50) and spawns one cycle subprocess per topic. Each subprocess receives the next topic from the queue, dispatches the claim to all 9 generator-fleet models in parallel via Promise.allSettled, and writes the full response set to a new cycle directory.
cycles/NNN/
├── manifest.json # Cycle metadata: topic, model list, per-model success/failure, wall times, totals
├── responses/
│ ├── <model-slug-1>.md
│ ├── <model-slug-2>.md
│ └── ... # One markdown file per model response, full text
├── traces/
│ ├── <model-slug-1>-trace.json
│ └── ... # Per-model API call metadata: timing, reasoning chunks, token counts
├── provenance.json # SHA-256 hashes of every response file, plus cross-cycle chain hash
└── thalamus-grounding.json # Grounding verification: all claimed response files exist on disk
The manifest is written last, after grounding verification passes. This makes the manifest the atomic transaction record: a cycle is only considered committed once its manifest exists.
Each model call is wrapped in a circuit breaker that opens after three consecutive failures and resets after 30 seconds. This isolates per-model faults so that a single hanging provider does not take down the entire cycle pipeline. The circuit breaker is per-model, per-daemon instance, and resets on daemon restart.
Not every cycle receives responses from all 9 models. Upstream provider rate limits (HTTP 429), transient network errors, and occasional overload responses cause individual model calls to fail. The honest per-cycle failure metric is:
- Cycles where all 9 models responded: 8,493 (81.26%)
- Cycles with at least one model failure: 1,958 (18.73%)
- Mean models responding per cycle: 8.26
Failed responses are excluded from the agreement and divergence calculations but are retained in the ledger for reproducibility. The per-response success rate (91.74% of 94,114 individual model responses succeeded) is a different metric from the per-cycle success rate, and consumers of the dataset should be explicit about which one they use.
A harvest script reads every cycle directory, extracts metadata from manifest.json and the response file listing, and emits one JSONL row per cycle to public-ledger.jsonl. The harvest is idempotent and runs in approximately 300 milliseconds for the current 10,452-cycle dataset.
The harvest does not perform new model inference. It only unifies existing on-disk records into a single ledger file. A re-harvest will produce a byte-identical ledger unless new cycle directories exist on disk.
The published dataset (at Zenodo and Hugging Face) is this unified ledger, versioned with the paper.
Every N cycles (or on a manual trigger), the daemon regenerates the research paper from the current ledger:
- Aggregate statistics: the harvest output is passed through an aggregation pass that computes per-cycle success rates, per-model coverage, provider distribution, response-length distribution, wall-time distribution, and the cycle range (earliest timestamp, latest timestamp, total cycles).
- Template filling: the paper template (
paper.template.md) contains approximately 50 placeholders for numeric values. The regeneration fills every placeholder with the current measured value. - Canonical page render: every cycle in the ledger produces a canonical markdown page at
output/canonical-pages/<slug>.mdwith schema.org Dataset and FAQPage JSON-LD pointing back to the paper DOI. - Limitations block: the honest limitations block (per-cycle failure rate, sample size caveats, cost data completeness, topic provenance, single-source ratings, validator known limitations) is rendered from the current aggregate stats. No limitation is hidden. The reviewer fleet responds to honest disclosure by scoring the paper higher, not lower.
The rendered paper is then passed to the validator.
The validator dispatches the regenerated paper as a topic to the 6-model reviewer fleet (see fleet.md for the composition). Each reviewer responds with a quality score (Q, 0 to 100) and an adversarial score (A, 0 to 100). The honest composite is:
composite = round(0.6 × mean(Q across reviewers) + 0.4 × mean(A across reviewers))
The weights are fixed. The 0.6/0.4 research-mode weighting is unchanged from the original Polybrain v3 validator. No reviewer can unilaterally move the composite by more than approximately two points per dimension in expectation at this fleet size.
The reviewer fleet spans three different API shapes:
- OpenAI chat completions format. Used for the in-fleet OpenAI, xAI, and Groq reviewers (
gpt-4.1-nano,grok-4-fast,gpt-oss-120b,llama-3.3-70b). Sent to the provider's/chat/completionsendpoint. - Anthropic messages format. Used for
claude-sonnet-4-5. Sent to the/v1/messagesendpoint. The system prompt is a separate top-level field, the tool calling schema differs, and the response shape is different. - Gemini generate format. Used for
gemini-2.5-pro. Sent to the/v1beta/models/gemini-2.5-pro:generateContentendpoint. Gemini 2.5 Pro is a thinking model, somaxOutputTokensmust be set to 8000 (higher than the chat-completions default) to leave headroom for reasoning tokens in addition to the actual output.
Three format-specific helper functions (callOpenAIChat, callAnthropicMessages, callGeminiGenerate) normalize the response shapes into a common {quality, adversarial, rawText, usage, cost} record that the composite calculation consumes.
Each reviewer call retries up to 3 times on HTTP 429 (rate limit) or HTTP 503 (transient overload) with exponential backoff: 3 seconds, then 6 seconds, then 12 seconds. Between the strict reviewer and the retry window, every run so far has completed at 6-of-6 valid reviewer responses.
The first full disjoint validator run (v13) initially hit a 503 on Gemini 2.5 Pro and dropped to 5-of-6 valid. The 503 retry path was added immediately, and every subsequent run (v14, v15, v16, and beyond) has gone 6-of-6.
The most important measurement Sprint 7 produced is not a single composite. It is the gap between the old self-reviewed composite and the new disjoint composite on the same paper.
| Reading | Fleet | Composite |
|---|---|---|
| v8 self-reviewed | 9-model generator = reviewer | 76 |
| v13 disjoint | 6-model reviewer, first run | 67 |
| v15 disjoint | 6-model reviewer, second run | 72 |
| v16 disjoint | 6-model reviewer, third run | 72 |
Median disjoint: 72. Mean disjoint: 70.3. The directly measured self-review inflation is 4 points on the central tendency, with individual readings spanning 4 to 9 points below the self-reviewed number. The honest disjoint composite lives in the 67 to 72 band.
This gap is the measurement of self-review bias in the earlier arrangement. The published trajectory (docs/polybrainbench.md § Sprint history) preserves both numbers side by side. Readers can see exactly how much the composite moved and why.
After validation, the paper is always published. There is no composite threshold gate. The only condition that blocks publication is a literal null composite (one dimension unparseable across all reviewers at once), which has not occurred since the parser bug of the earliest paper era was resolved.
The published paper has a blockquote header at the top with:
- Composite score
- Mean quality and mean adversarial
- Standard deviations across the reviewer panel
- Per-model raw scores for the current version
- Reviewer fleet description (4 in-fleet + 2 external anchors)
- Publication rule statement (Matthew Effect, no threshold)
- Publication timestamp
The header is designed to be read even if nothing else in the paper is read. A reader who only looks at the first 200 words of the paper sees the honest composite, the reviewer fleet, and the self-review bias gap. No further reading is required to understand what the number means.
The paper DOI is stable across versions (10.5281/zenodo.19546460). The concept DOI (10.5281/zenodo.19546459) always resolves to the latest version for citation stability.
| Stage | Cost |
|---|---|
| One cycle (9-model generator dispatch) | ~$0.003 (hardcoded estimate; real cost is ~half because 5 of 9 models are on Groq's free tier) |
| One validator run (6-model disjoint reviewer fleet) | ~$0.040 (real token capture; Claude Sonnet 4.5 ~$0.026, Gemini 2.5 Pro ~$0.011, in-fleet reviewers ~$0.003) |
The validator cost multiplied by 5.4x from the previous self-review arrangement ($0.0073) when the reviewer fleet gained two paid external anchors. The total still rounds to sub-penny-scale per run, which is a trivial price for the methodological improvement.
A complete reproduction requires:
- API accounts at OpenAI, xAI, Groq, Anthropic, and Google (five billing relationships)
- The ledger (download from Zenodo or Hugging Face)
- This document (full specification)
- The fleet composition table (see fleet.md)
- An implementation of the 5-stage pipeline
The engine source itself is proprietary and not distributed in this repository. Every measurement in the published paper, however, is deterministic from the ledger plus this specification, and any independent implementation following the methodology should produce responses that match ours within model sampling variance.