Polybrain

A living benchmark for cross-model consensus verification of natural-language claims. The same nine independently-trained language models answer the same question at the same time. We measure where they disagree, write it up as a research paper, and regenerate the paper continuously as the dataset grows.

This is the citeable research artifact. The runtime engine itself is proprietary.

PolybrainBench, live

PolybrainBench is a continuous, self-publishing benchmark for measuring inter-model disagreement. A generator fleet of 9 large language models is dispatched against the same declarative claim in parallel, the full response from each model is captured, the dataset is harvested into a unified ledger, and a research paper is regenerated from the ledger on every publication cycle. The paper is then validated by a separate reviewer fleet, deliberately kept partially independent from the generator, and always published, with the honest composite prominently displayed in a blockquote header at the top of the document. The Matthew Effect is the only gate.

Current status (2026-04-13):

Signal	Value
Dataset size	10,452 verification cycles
Generator fleet	9 models across 5 training families (OpenAI, xAI, Moonshot, Meta, Alibaba)
Reviewer fleet	6 models, 4 in-fleet + 2 external anchors (Anthropic, Google)
Honest composite	72 (mean quality 75.0, mean adversarial 67.0)
Paper version	v16
Paper DOI	10.5281/zenodo.19546460
Concept DOI (cross-version)	10.5281/zenodo.19546459
Dataset (Hugging Face)	andysalvo/polybrainbench-v8
Canonical claim pages	7,004 stable URLs at polylogicai.com/trust/claim/*
License	CC-BY-4.0 (paper, dataset, methodology); proprietary (engine)

The composite moved from 76 (v8, self-reviewed by the 9-model generator fleet) to 72 on the disjoint-reviewer methodology introduced in Sprint 7. That 4-point shift is the measured self-review bias: when the fleet that produced the corpus is the same fleet that validates the paper about the corpus, its scores are systematically higher than when an independent reviewer reads the same paper. Publishing the drop is the honest move.

How the reviewer split works (partial independence, not absolute)

Every verification cycle dispatches a claim to the same 9-model generator fleet in parallel. The full response text from each model is captured, stamped with a SHA-256 provenance hash, and written to a cycle directory with a manifest and grounding verification.

Every N cycles, the paper is regenerated from the ledger and validated by a different 6-model reviewer fleet:

In-fleet reviewers (4). These four models also contribute per-claim responses to the dataset, so they are not fully disjoint from the corpus they are grading. They do not review their own individual responses; they review the paper's aggregate analysis of ten thousand cycles. Kept in the reviewer seat for continuity with earlier paper versions and because they span the observed reviewer personality range (strictest to most generous):

gpt-4.1-nano (OpenAI): stable mid-tier
grok-4-fast (xAI): strict reviewer
gpt-oss-120b (Groq / OpenAI weights): strictest in-fleet reviewer
llama-3.3-70b (Groq / Meta): most generous in-fleet reviewer

External anchors (2). These two models come from provider families that are completely absent from the generator fleet. They have written zero responses into the corpus. Their scores are untouched by the training lineages that produced the data:

claude-sonnet-4-5 (Anthropic)
gemini-2.5-pro (Google)

The framing is partial reviewer independence, not absolute. Four reviewers are in-fleet by design; two reviewers are fully independent by design. The two external anchors are what moved the composite down, and both are documented in the paper header. On the first full disjoint reading, Gemini 2.5 Pro produced the single strictest score in the fleet (Q 52, A 35), stricter than either in-fleet strict reviewer. Claude Sonnet 4.5 came in as a moderate mid-tier voice (Q 72, A 58).

Disagreement across this reviewer set is the signal. Averaging it away would hide the measurement. The published header carries the full per-model table so a reader can see the spread, not just the composite.

Complementary, not competitive

PolybrainBench does not claim to outscore any frontier model on any capability benchmark. The question it answers is one no single frontier model can answer alone: where, on what kinds of claims, do independently-trained language models disagree? That measurement is only meaningful when the models come from different training lineages. The stronger any one model gets, the more interesting the disagreement pattern becomes, because the remaining disagreement is increasingly informative about the boundary of current consensus. Better frontier models make this work more valuable, not less.

How to cite

If you use the PolybrainBench dataset, methodology, or canonical pages in academic work:

@dataset{salvo_polybrainbench_2026,
  author       = {Salvo, Andy},
  title        = {PolybrainBench: A Living Benchmark for Cross-Model
                  Consensus Verification of Natural-Language Claims},
  year         = 2026,
  publisher    = {Zenodo},
  version      = {v16},
  doi          = {10.5281/zenodo.19546460},
  url          = {https://doi.org/10.5281/zenodo.19546460}
}

Or as plain-text:

Salvo, A. (2026). PolybrainBench v16: A Living Benchmark for Cross-Model Consensus Verification of Natural-Language Claims (N=10,452, disjoint reviewer composite 72). Zenodo. https://doi.org/10.5281/zenodo.19546460

See CITATION.cff for the structured citation metadata.

Where to find it

Surface	URL	Purpose
Paper (Zenodo)	https://doi.org/10.5281/zenodo.19546460	Citable research artifact
Concept DOI (latest version resolver)	https://doi.org/10.5281/zenodo.19546459	Always resolves to current version
Dataset (Hugging Face)	https://huggingface.co/datasets/andysalvo/polybrainbench-v8	Unified JSONL ledger
Canonical claim pages	https://polylogicai.com/trust/claim	One URL per verification cycle
Sitemap	https://polylogicai.com/sitemap.xml	All trust/claim URLs
Trust landing page	https://polylogicai.com/trust	Project overview
Methodology	docs/polybrainbench.md	Full specification
Fleet composition	docs/fleet.md	11-model fleet breakdown
Method	docs/methodology.md	Cycle, harvest, validate, publish loop

License

Paper, dataset summary, methodology, canonical pages: CC-BY-4.0. Cite the work, build on it, redistribute it.
Engine (daemon, cycle engine, validator, topic generator, orchestration): proprietary. The source is not currently open.

The boundary is deliberate. The research artifact is open so anyone can verify the claims, cite the numbers, and build downstream work. The engine is closed so we can keep improving it quickly without coordinating every internal change across public channels.

Contact

Andy Salvo (ajs10845@psu.edu), ORCID 0009-0008-8629-8827 Polylogic AI, Penn State University (Smeal College of Business)

Legacy documentation

The v0.2.0 and v1.x era documentation in this repository (agent framework internals, the earlier grounding experiment, the Bateson type-token correction, formal proofs, the BLSRP method, the substrate-independent learning whitepaper) is preserved in place for continuity with the first wave of research. Those documents describe the project as it existed before PolybrainBench became the primary public artifact and are not the current source of truth for the benchmark's methodology or composite score. For the live benchmark, see docs/polybrainbench.md and docs/methodology.md.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
docs		docs
paper		paper
releases		releases
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Polybrain

PolybrainBench, live

How the reviewer split works (partial independence, not absolute)

Complementary, not competitive

How to cite

Where to find it

License

Contact

Legacy documentation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Polybrain

PolybrainBench, live

How the reviewer split works (partial independence, not absolute)

Complementary, not competitive

How to cite

Where to find it

License

Contact

Legacy documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Packages