A living benchmark for cross-model consensus verification of natural-language claims. The same nine independently-trained language models answer the same question at the same time. We measure where they disagree, write it up as a research paper, and regenerate the paper continuously as the dataset grows.
This is the citeable research artifact. The runtime engine itself is proprietary.
PolybrainBench is a continuous, self-publishing benchmark for measuring inter-model disagreement. A generator fleet of 9 large language models is dispatched against the same declarative claim in parallel, the full response from each model is captured, the dataset is harvested into a unified ledger, and a research paper is regenerated from the ledger on every publication cycle. The paper is then validated by a separate reviewer fleet, deliberately kept partially independent from the generator, and always published, with the honest composite prominently displayed in a blockquote header at the top of the document. The Matthew Effect is the only gate.
Current status (2026-04-13):
| Signal | Value |
|---|---|
| Dataset size | 10,452 verification cycles |
| Generator fleet | 9 models across 5 training families (OpenAI, xAI, Moonshot, Meta, Alibaba) |
| Reviewer fleet | 6 models, 4 in-fleet + 2 external anchors (Anthropic, Google) |
| Honest composite | 72 (mean quality 75.0, mean adversarial 67.0) |
| Paper version | v16 |
| Paper DOI | 10.5281/zenodo.19546460 |
| Concept DOI (cross-version) | 10.5281/zenodo.19546459 |
| Dataset (Hugging Face) | andysalvo/polybrainbench-v8 |
| Canonical claim pages | 7,004 stable URLs at polylogicai.com/trust/claim/* |
| License | CC-BY-4.0 (paper, dataset, methodology); proprietary (engine) |
The composite moved from 76 (v8, self-reviewed by the 9-model generator fleet) to 72 on the disjoint-reviewer methodology introduced in Sprint 7. That 4-point shift is the measured self-review bias: when the fleet that produced the corpus is the same fleet that validates the paper about the corpus, its scores are systematically higher than when an independent reviewer reads the same paper. Publishing the drop is the honest move.
Every verification cycle dispatches a claim to the same 9-model generator fleet in parallel. The full response text from each model is captured, stamped with a SHA-256 provenance hash, and written to a cycle directory with a manifest and grounding verification.
Every N cycles, the paper is regenerated from the ledger and validated by a different 6-model reviewer fleet:
In-fleet reviewers (4). These four models also contribute per-claim responses to the dataset, so they are not fully disjoint from the corpus they are grading. They do not review their own individual responses; they review the paper's aggregate analysis of ten thousand cycles. Kept in the reviewer seat for continuity with earlier paper versions and because they span the observed reviewer personality range (strictest to most generous):
gpt-4.1-nano(OpenAI): stable mid-tiergrok-4-fast(xAI): strict reviewergpt-oss-120b(Groq / OpenAI weights): strictest in-fleet reviewerllama-3.3-70b(Groq / Meta): most generous in-fleet reviewer
External anchors (2). These two models come from provider families that are completely absent from the generator fleet. They have written zero responses into the corpus. Their scores are untouched by the training lineages that produced the data:
claude-sonnet-4-5(Anthropic)gemini-2.5-pro(Google)
The framing is partial reviewer independence, not absolute. Four reviewers are in-fleet by design; two reviewers are fully independent by design. The two external anchors are what moved the composite down, and both are documented in the paper header. On the first full disjoint reading, Gemini 2.5 Pro produced the single strictest score in the fleet (Q 52, A 35), stricter than either in-fleet strict reviewer. Claude Sonnet 4.5 came in as a moderate mid-tier voice (Q 72, A 58).
Disagreement across this reviewer set is the signal. Averaging it away would hide the measurement. The published header carries the full per-model table so a reader can see the spread, not just the composite.
PolybrainBench does not claim to outscore any frontier model on any capability benchmark. The question it answers is one no single frontier model can answer alone: where, on what kinds of claims, do independently-trained language models disagree? That measurement is only meaningful when the models come from different training lineages. The stronger any one model gets, the more interesting the disagreement pattern becomes, because the remaining disagreement is increasingly informative about the boundary of current consensus. Better frontier models make this work more valuable, not less.
If you use the PolybrainBench dataset, methodology, or canonical pages in academic work:
@dataset{salvo_polybrainbench_2026,
author = {Salvo, Andy},
title = {PolybrainBench: A Living Benchmark for Cross-Model
Consensus Verification of Natural-Language Claims},
year = 2026,
publisher = {Zenodo},
version = {v16},
doi = {10.5281/zenodo.19546460},
url = {https://doi.org/10.5281/zenodo.19546460}
}Or as plain-text:
Salvo, A. (2026). PolybrainBench v16: A Living Benchmark for Cross-Model Consensus Verification of Natural-Language Claims (N=10,452, disjoint reviewer composite 72). Zenodo. https://doi.org/10.5281/zenodo.19546460
See CITATION.cff for the structured citation metadata.
| Surface | URL | Purpose |
|---|---|---|
| Paper (Zenodo) | https://doi.org/10.5281/zenodo.19546460 | Citable research artifact |
| Concept DOI (latest version resolver) | https://doi.org/10.5281/zenodo.19546459 | Always resolves to current version |
| Dataset (Hugging Face) | https://huggingface.co/datasets/andysalvo/polybrainbench-v8 | Unified JSONL ledger |
| Canonical claim pages | https://polylogicai.com/trust/claim | One URL per verification cycle |
| Sitemap | https://polylogicai.com/sitemap.xml | All trust/claim URLs |
| Trust landing page | https://polylogicai.com/trust | Project overview |
| Methodology | docs/polybrainbench.md | Full specification |
| Fleet composition | docs/fleet.md | 11-model fleet breakdown |
| Method | docs/methodology.md | Cycle, harvest, validate, publish loop |
- Paper, dataset summary, methodology, canonical pages: CC-BY-4.0. Cite the work, build on it, redistribute it.
- Engine (daemon, cycle engine, validator, topic generator, orchestration): proprietary. The source is not currently open.
The boundary is deliberate. The research artifact is open so anyone can verify the claims, cite the numbers, and build downstream work. The engine is closed so we can keep improving it quickly without coordinating every internal change across public channels.
Andy Salvo (ajs10845@psu.edu), ORCID 0009-0008-8629-8827 Polylogic AI, Penn State University (Smeal College of Business)
The v0.2.0 and v1.x era documentation in this repository (agent framework internals, the earlier grounding experiment, the Bateson type-token correction, formal proofs, the BLSRP method, the substrate-independent learning whitepaper) is preserved in place for continuity with the first wave of research. Those documents describe the project as it existed before PolybrainBench became the primary public artifact and are not the current source of truth for the benchmark's methodology or composite score. For the live benchmark, see docs/polybrainbench.md and docs/methodology.md.