-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathCITATION.cff
More file actions
62 lines (62 loc) · 2.77 KB
/
CITATION.cff
File metadata and controls
62 lines (62 loc) · 2.77 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
cff-version: 1.2.0
message: "If you use PolybrainBench or the Polybrain research artifacts in academic work, please cite it as below."
title: "PolybrainBench v16: A Living Benchmark for Cross-Model Consensus Verification of Natural-Language Claims"
authors:
- family-names: Salvo
given-names: Andy
email: ajs10845@psu.edu
orcid: "https://orcid.org/0009-0008-8629-8827"
affiliation: "Polylogic AI; Penn State University, Smeal College of Business"
date-released: 2026-04-13
version: "v16"
doi: 10.5281/zenodo.19546460
repository-code: "https://github.com/andysalvo/polybrain"
url: "https://polylogicai.com/trust"
license: CC-BY-4.0
keywords:
- PolybrainBench
- cross-model consensus
- LLM evaluation
- divergence measurement
- self-publishing benchmark
- cycle engine
- living benchmark
- Matthew Effect
- ensemble verification
- disjoint reviewer fleet
abstract: >
PolybrainBench is a continuous, self-publishing benchmark for measuring
cross-model disagreement on natural-language claims. Each verification
cycle dispatches a single declarative claim to a generator fleet of 9
large language models drawn from 5 independent training families
(OpenAI, xAI, Moonshot, Meta, Alibaba) and captures the full response
text per model. Version 16 of the living paper documents 10,452
verification cycles. Unlike earlier versions, v16 is validated by a
disjoint 6-model reviewer fleet with partial reviewer independence: 4
reviewers are drawn from the generator fleet (gpt-4.1-nano, grok-4-fast,
gpt-oss-120b, llama-3.3-70b) and 2 external anchors (claude-sonnet-4-5
from Anthropic, gemini-2.5-pro from Google) come from provider families
absent from the generator and have zero corpus contribution. The honest
disjoint composite on v16 is 72 (mean quality 75.0, mean adversarial
67.0), measured as the median of three back-to-back reviewer runs (67,
72, 72). The earlier v8 self-reviewed composite was 76; the 4-point
drop is the directly measured self-review bias from having the
generator fleet validate its own corpus. The paper is always published,
with the honest composite displayed prominently in the header.
PolybrainBench is designed to be complementary to frontier capability
benchmarks, not competitive with them: better single-model capability
makes the remaining inter-model disagreement pattern more informative,
not less.
preferred-citation:
type: dataset
authors:
- family-names: Salvo
given-names: Andy
email: ajs10845@psu.edu
orcid: "https://orcid.org/0009-0008-8629-8827"
title: "PolybrainBench v16: A Living Benchmark for Cross-Model Consensus Verification of Natural-Language Claims"
year: 2026
publisher: Zenodo
version: "v16"
doi: 10.5281/zenodo.19546460
url: "https://doi.org/10.5281/zenodo.19546460"