Skip to content

andysalvo/polybrain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Polybrain

A living benchmark for cross-model consensus verification of natural-language claims. The same nine independently-trained language models answer the same question at the same time. We measure where they disagree, write it up as a research paper, and regenerate the paper continuously as the dataset grows.

This is the citeable research artifact. The runtime engine itself is proprietary.


PolybrainBench, live

PolybrainBench is a continuous, self-publishing benchmark for measuring inter-model disagreement. A generator fleet of 9 large language models is dispatched against the same declarative claim in parallel, the full response from each model is captured, the dataset is harvested into a unified ledger, and a research paper is regenerated from the ledger on every publication cycle. The paper is then validated by a separate reviewer fleet, deliberately kept partially independent from the generator, and always published, with the honest composite prominently displayed in a blockquote header at the top of the document. The Matthew Effect is the only gate.

Current status (2026-04-13):

Signal Value
Dataset size 10,452 verification cycles
Generator fleet 9 models across 5 training families (OpenAI, xAI, Moonshot, Meta, Alibaba)
Reviewer fleet 6 models, 4 in-fleet + 2 external anchors (Anthropic, Google)
Honest composite 72 (mean quality 75.0, mean adversarial 67.0)
Paper version v16
Paper DOI 10.5281/zenodo.19546460
Concept DOI (cross-version) 10.5281/zenodo.19546459
Dataset (Hugging Face) andysalvo/polybrainbench-v8
Canonical claim pages 7,004 stable URLs at polylogicai.com/trust/claim/*
License CC-BY-4.0 (paper, dataset, methodology); proprietary (engine)

The composite moved from 76 (v8, self-reviewed by the 9-model generator fleet) to 72 on the disjoint-reviewer methodology introduced in Sprint 7. That 4-point shift is the measured self-review bias: when the fleet that produced the corpus is the same fleet that validates the paper about the corpus, its scores are systematically higher than when an independent reviewer reads the same paper. Publishing the drop is the honest move.

How the reviewer split works (partial independence, not absolute)

Every verification cycle dispatches a claim to the same 9-model generator fleet in parallel. The full response text from each model is captured, stamped with a SHA-256 provenance hash, and written to a cycle directory with a manifest and grounding verification.

Every N cycles, the paper is regenerated from the ledger and validated by a different 6-model reviewer fleet:

In-fleet reviewers (4). These four models also contribute per-claim responses to the dataset, so they are not fully disjoint from the corpus they are grading. They do not review their own individual responses; they review the paper's aggregate analysis of ten thousand cycles. Kept in the reviewer seat for continuity with earlier paper versions and because they span the observed reviewer personality range (strictest to most generous):

  • gpt-4.1-nano (OpenAI): stable mid-tier
  • grok-4-fast (xAI): strict reviewer
  • gpt-oss-120b (Groq / OpenAI weights): strictest in-fleet reviewer
  • llama-3.3-70b (Groq / Meta): most generous in-fleet reviewer

External anchors (2). These two models come from provider families that are completely absent from the generator fleet. They have written zero responses into the corpus. Their scores are untouched by the training lineages that produced the data:

  • claude-sonnet-4-5 (Anthropic)
  • gemini-2.5-pro (Google)

The framing is partial reviewer independence, not absolute. Four reviewers are in-fleet by design; two reviewers are fully independent by design. The two external anchors are what moved the composite down, and both are documented in the paper header. On the first full disjoint reading, Gemini 2.5 Pro produced the single strictest score in the fleet (Q 52, A 35), stricter than either in-fleet strict reviewer. Claude Sonnet 4.5 came in as a moderate mid-tier voice (Q 72, A 58).

Disagreement across this reviewer set is the signal. Averaging it away would hide the measurement. The published header carries the full per-model table so a reader can see the spread, not just the composite.

Complementary, not competitive

PolybrainBench does not claim to outscore any frontier model on any capability benchmark. The question it answers is one no single frontier model can answer alone: where, on what kinds of claims, do independently-trained language models disagree? That measurement is only meaningful when the models come from different training lineages. The stronger any one model gets, the more interesting the disagreement pattern becomes, because the remaining disagreement is increasingly informative about the boundary of current consensus. Better frontier models make this work more valuable, not less.

How to cite

If you use the PolybrainBench dataset, methodology, or canonical pages in academic work:

@dataset{salvo_polybrainbench_2026,
  author       = {Salvo, Andy},
  title        = {PolybrainBench: A Living Benchmark for Cross-Model
                  Consensus Verification of Natural-Language Claims},
  year         = 2026,
  publisher    = {Zenodo},
  version      = {v16},
  doi          = {10.5281/zenodo.19546460},
  url          = {https://doi.org/10.5281/zenodo.19546460}
}

Or as plain-text:

Salvo, A. (2026). PolybrainBench v16: A Living Benchmark for Cross-Model Consensus Verification of Natural-Language Claims (N=10,452, disjoint reviewer composite 72). Zenodo. https://doi.org/10.5281/zenodo.19546460

See CITATION.cff for the structured citation metadata.

Where to find it

Surface URL Purpose
Paper (Zenodo) https://doi.org/10.5281/zenodo.19546460 Citable research artifact
Concept DOI (latest version resolver) https://doi.org/10.5281/zenodo.19546459 Always resolves to current version
Dataset (Hugging Face) https://huggingface.co/datasets/andysalvo/polybrainbench-v8 Unified JSONL ledger
Canonical claim pages https://polylogicai.com/trust/claim One URL per verification cycle
Sitemap https://polylogicai.com/sitemap.xml All trust/claim URLs
Trust landing page https://polylogicai.com/trust Project overview
Methodology docs/polybrainbench.md Full specification
Fleet composition docs/fleet.md 11-model fleet breakdown
Method docs/methodology.md Cycle, harvest, validate, publish loop

License

  • Paper, dataset summary, methodology, canonical pages: CC-BY-4.0. Cite the work, build on it, redistribute it.
  • Engine (daemon, cycle engine, validator, topic generator, orchestration): proprietary. The source is not currently open.

The boundary is deliberate. The research artifact is open so anyone can verify the claims, cite the numbers, and build downstream work. The engine is closed so we can keep improving it quickly without coordinating every internal change across public channels.

Contact

Andy Salvo (ajs10845@psu.edu), ORCID 0009-0008-8629-8827 Polylogic AI, Penn State University (Smeal College of Business)

Legacy documentation

The v0.2.0 and v1.x era documentation in this repository (agent framework internals, the earlier grounding experiment, the Bateson type-token correction, formal proofs, the BLSRP method, the substrate-independent learning whitepaper) is preserved in place for continuity with the first wave of research. Those documents describe the project as it existed before PolybrainBench became the primary public artifact and are not the current source of truth for the benchmark's methodology or composite score. For the live benchmark, see docs/polybrainbench.md and docs/methodology.md.

About

An agent framework where every action is verified before it is reported as done. Not with more AI. With the machine itself.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors