polybrain/CITATION.cff at main · andysalvo/polybrain · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
cff-version: 1.2.0
message: "If you use PolybrainBench or the Polybrain research artifacts in academic work, please cite it as below."
title: "PolybrainBench v16: A Living Benchmark for Cross-Model Consensus Verification of Natural-Language Claims"
authors:
  - family-names: Salvo
    given-names: Andy
    email: ajs10845@psu.edu
    orcid: "https://orcid.org/0009-0008-8629-8827"
    affiliation: "Polylogic AI; Penn State University, Smeal College of Business"
date-released: 2026-04-13
version: "v16"
doi: 10.5281/zenodo.19546460
repository-code: "https://github.com/andysalvo/polybrain"
url: "https://polylogicai.com/trust"
license: CC-BY-4.0
keywords:
  - PolybrainBench
  - cross-model consensus
  - LLM evaluation
  - divergence measurement
  - self-publishing benchmark
  - cycle engine
  - living benchmark
  - Matthew Effect
  - ensemble verification
  - disjoint reviewer fleet
abstract: >
  PolybrainBench is a continuous, self-publishing benchmark for measuring
  cross-model disagreement on natural-language claims. Each verification
  cycle dispatches a single declarative claim to a generator fleet of 9
  large language models drawn from 5 independent training families
  (OpenAI, xAI, Moonshot, Meta, Alibaba) and captures the full response
  text per model. Version 16 of the living paper documents 10,452
  verification cycles. Unlike earlier versions, v16 is validated by a
  disjoint 6-model reviewer fleet with partial reviewer independence: 4
  reviewers are drawn from the generator fleet (gpt-4.1-nano, grok-4-fast,
  gpt-oss-120b, llama-3.3-70b) and 2 external anchors (claude-sonnet-4-5
  from Anthropic, gemini-2.5-pro from Google) come from provider families
  absent from the generator and have zero corpus contribution. The honest
  disjoint composite on v16 is 72 (mean quality 75.0, mean adversarial
  67.0), measured as the median of three back-to-back reviewer runs (67,
  72, 72). The earlier v8 self-reviewed composite was 76; the 4-point
  drop is the directly measured self-review bias from having the
  generator fleet validate its own corpus. The paper is always published,
  with the honest composite displayed prominently in the header.
  PolybrainBench is designed to be complementary to frontier capability
  benchmarks, not competitive with them: better single-model capability
  makes the remaining inter-model disagreement pattern more informative,
  not less.
preferred-citation:
  type: dataset
  authors:
    - family-names: Salvo
      given-names: Andy
      email: ajs10845@psu.edu
      orcid: "https://orcid.org/0009-0008-8629-8827"
  title: "PolybrainBench v16: A Living Benchmark for Cross-Model Consensus Verification of Natural-Language Claims"
  year: 2026
  publisher: Zenodo
  version: "v16"
  doi: 10.5281/zenodo.19546460
  url: "https://doi.org/10.5281/zenodo.19546460"