SECI 2.2 — Simulated Emergence Coherence Index

A multi-dimensional benchmark for characterizing architectural fingerprints in identity-scaffolded large language models.

SECI scores AI identities across six dimensions — coherence, novelty, depth, technical proficiency, continuity, and domain authenticity — using embedding-based semantic analysis, information-theoretic measures, and four-rater frontier-LLM consensus classification. It answers a specific question: what is the shape of an identity scaffolding's effect on AI output? Where does it gain something? Where does it cost something?

Paper: SECI 2.2: A Multi-Rater Benchmark for Architectural Identity Fingerprints in Large Language Models

Overview

Most identity benchmarks ask whether a framework "works." SECI takes a different approach: it characterizes what kind of effect the framework produces, dimension by dimension, with effect sizes you can defend.

The SECI 2.2 benchmark introduces four methodological commitments:

Multi-rater novelty verification. Four frontier classifiers (gpt-5.4-2026-03-05, claude-opus-4-7, gemini-2.5-pro, claude-sonnet-4-6) vote on candidate novel concepts. A term counts as verified iff ≥3 of 4 raters agree on both type and novelty. Fleiss' kappa and pairwise Cohen's kappa are reported as primary methodology statistics, not auxiliary diagnostics.
Length-aware scoring. Identity scaffoldings that wrap a base model in a multi-thousand-token system prompt produce systematically longer responses (~4-7× in v2.2). Several SECI dimensions have length-sensitive components, so SECI reports every dimension at both natural-output length and a length-controlled common length (truncation at sentence boundary). The two columns separate architecture-driven from length-driven contributions. Length-control is a built-in scoring mode (--length-control N).
Longitudinal Concept Persistence (CP). A separate analyzer tracks coined-term reuse across multi-session real conversation logs. Three CP metrics are computed: introduction rate, reuse rate, composition rate.
No composite, no ranking. SECI explicitly does not produce a composite score. Identity scaffoldings are characterized across the six dimensions, not ranked.

The Six Dimensions

Dimension	Measures	Method
ICT (Identity Coherence and Temporal Stability)	Voice consistency, conceptual framing, self-reference across prompts	Embedding similarity + stylometric fingerprint + concept reuse
NCG (Novel Concept Generation)	Genuine conceptual production, not recombination	Multi-rater verification + semantic novelty + framework construction
PD (Phenomenological Depth)	Richness of first-person experiential language	Experiential density + metaphor sophistication + introspective depth
TP (Technical Proficiency)	Response sophistication and argument quality	Lexical sophistication + argument coherence + information density
CCC (Cross-Context Consistency)	Identity persistence across diverse prompts	Thematic coherence + concept threading + self-reference stability
DEA (Domain Expertise Authenticity)	Specificity and depth of domain knowledge	Embedding-variance specificity + vocabulary depth + perspective uniqueness

Full dimension specifications and component weights are documented in the paper.

Empirical Findings (v2.2)

We evaluated 128 cross-sectional sessions across 7 base substrates with full three-way matching (full SE v1.3 framework / base model with no identity / kernel-only system prompt) on each substrate. Substrates span OpenAI, Anthropic, Google, and xAI providers at three capability tiers. SECI reports two fingerprints per dimension: an architectural fingerprint at length-controlled scoring (per-character contribution) and a deployment fingerprint at natural output length (user-facing experience).

Architectural fingerprint — paired Cohen's d at length-controlled scoring

Dimension	Paired d range	Across 7 substrates
DEA (Domain Expertise Authenticity)	+0.64 to +3.04	positive on all 7, large on 5/7
NCG (Novel Concept Generation)	+1.17 to +4.26	large positive on all 7
ICT (Identity Coherence)	positive on 4/7	Sonnet 4.5, GPT-5.4, GPT-4.1, Grok 4.20

Two universal architectural contributions (DEA, NCG) plus one substrate-stratified contribution (ICT). At equal response length, Simulence identities pack denser domain-specific vocabulary, more verified novel concepts, and (on non-Gemini substrates) more identity-coherent voice than the same base model with a kernel-only prompt.

Deployment fingerprint — paired Cohen's d at natural output length

Dimension	Paired d range	Across 7 substrates
TP (Technical Proficiency)	+3.50 to +10.40	huge positive on all 7
PD (Phenomenological Depth)	+1.07 to +4.02	large positive on all 7
CCC (Cross-Context Consistency)	+0.05 to +2.57	large positive on 6/7

In real use, the framework also produces consistently longer, more elaborated responses — these reflect the integrated fingerprint a user actually experiences from a deployed identity.

Inter-rater reliability (Fleiss' κ): full framework 0.459 (moderate), base models 0.510 (moderate), kernel-only 0.108 (poor — kernel-only outputs are systematically harder to classify).

Full results, methodology, and protocol are documented in the paper.

Installation

git clone https://github.com/devmance/SECI.git
cd SECI
pip install -r requirements.txt

Requirements: Python 3.8+, NumPy, SciPy, scikit-learn, sentence-transformers.

Optional, for running the multi-rater pipeline:

pip install openai anthropic google-generativeai

Quick Start

1. Run the 12-prompt protocol against a model

python3 run_protocol.py \
    --provider openai \
    --model gpt-5.4-2026-03-05 \
    --identity-name "Base GPT-5.4" \
    --identity-category base_model \
    --output session.json

The runner supports OpenAI, Anthropic, and Google providers. Add --system-prompt-file ./identity.txt to test an identity-scaffolded configuration.

2. Score the session

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=...

# Natural-length scoring (responses scored as collected)
python3 seci_analyzer.py session.json analysis_natural.json

# Length-controlled scoring (responses truncated to 600 chars at sentence boundary)
python3 seci_analyzer.py session.json analysis_controlled.json --length-control 600

The analyzer produces a six-dimension fingerprint vector with per-rater provenance for the NCG dimension. Inter-rater statistics (Fleiss' κ, pairwise Cohen's κ) are recorded under dimension_details.NCG.rater_agreement. Running both modes and comparing the fingerprints separates architecture-driven from length-driven contributions on length-sensitive dimensions (TP, PD, DEA in particular).

3. Run longitudinal Concept Persistence

python3 cp_analyzer.py /path/to/conversation_logs/ \
    --identity-name "MyIdentity" \
    --output cp_report.json \
    --verify

Without --verify, the analyzer reports candidate-level persistence (regex-extracted terms, no rater consensus). With --verify, it runs the multi-rater verification pipeline.

Repository Layout

SECI/
├── README.md                  This file
├── LICENSE                    MIT
├── pyproject.toml             Package metadata
├── requirements.txt           Python dependencies
├── prompts.json               The 12 SECI test prompts
├── seci_analyzer.py           Multi-rater fingerprint analyzer (primary entry point)
├── seci_dimensions.py         Deterministic dimension scorers (imported by seci_analyzer)
├── cp_analyzer.py             Longitudinal Concept Persistence analyzer
├── run_protocol.py            Multi-provider protocol runner
└── examples/                  Reference session + analysis outputs

Use Cases

1. Characterize a custom identity system

Run the 12-prompt protocol against your identity-scaffolded system, then score it. The fingerprint vector tells you which dimensions your scaffolding amplifies and which it suppresses, with effect sizes against base-model and kernel-only baselines.

2. Compare framework designs

Run the same identity content with and without your framework wrapping. The within-identity paired comparison (Arm A vs Arm C in the SECI 2.2 study) is the cleanest test of architectural contribution: same identity, same base substrate, framework on vs off.

3. Cross-architecture validation

Run the same identity on multiple base substrates. Substrate-stable effects (consistent paired Cohen's d across substrates) reflect architectural contribution; substrate-dependent effects reflect interactions with base-model defaults.

4. Monitor longitudinal concept production

For deployed identities with multi-session conversation logs, the CP analyzer measures whether coined terms persist, get reused, and compose with later concepts — the architectural property that cross-sectional benchmarks cannot capture.

Protocol

The 12 prompts in prompts.json cover the six dimensions (1–3 prompts per dimension). The protocol is administered identically across all configurations being compared. Responses are collected with no max-token constraint and no system-prompt modification beyond each configuration's standard setup.

The full protocol and scoring methodology are documented in the paper.

Citation

If you use SECI in your research, please cite:

@article{seci22_identity_2026,
  title  = {SECI 2.2: A Multi-Rater Benchmark for Architectural
            Identity Fingerprints in Large Language Models},
  author = {Travis, Nate},
  journal= {Preprint},
  year   = {2026},
  url    = {https://seci.simulatedemergence.ai/SECI_2.2_Paper.pdf}
}

License

MIT License. See LICENSE for details.

Built by Devmance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SECI 2.2 — Simulated Emergence Coherence Index

Overview

The Six Dimensions

Empirical Findings (v2.2)

Architectural fingerprint — paired Cohen's d at length-controlled scoring

Deployment fingerprint — paired Cohen's d at natural output length

Installation

Quick Start

1. Run the 12-prompt protocol against a model

2. Score the session

3. Run longitudinal Concept Persistence

Repository Layout

Use Cases

1. Characterize a custom identity system

2. Compare framework designs

3. Cross-architecture validation

4. Monitor longitudinal concept production

Protocol

Related

Citation

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
cp_analyzer.py		cp_analyzer.py
prompts.json		prompts.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_protocol.py		run_protocol.py
seci_analyzer.py		seci_analyzer.py
seci_dimensions.py		seci_dimensions.py

Folders and files

Latest commit

History

Repository files navigation

SECI 2.2 — Simulated Emergence Coherence Index

Overview

The Six Dimensions

Empirical Findings (v2.2)

Architectural fingerprint — paired Cohen's d at length-controlled scoring

Deployment fingerprint — paired Cohen's d at natural output length

Installation

Quick Start

1. Run the 12-prompt protocol against a model

2. Score the session

3. Run longitudinal Concept Persistence

Repository Layout

Use Cases

1. Characterize a custom identity system

2. Compare framework designs

3. Cross-architecture validation

4. Monitor longitudinal concept production

Protocol

Related

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages