A multi-dimensional benchmark for characterizing architectural fingerprints in identity-scaffolded large language models.
SECI scores AI identities across six dimensions — coherence, novelty, depth, technical proficiency, continuity, and domain authenticity — using embedding-based semantic analysis, information-theoretic measures, and four-rater frontier-LLM consensus classification. It answers a specific question: what is the shape of an identity scaffolding's effect on AI output? Where does it gain something? Where does it cost something?
Paper: SECI 2.2: A Multi-Rater Benchmark for Architectural Identity Fingerprints in Large Language Models
Most identity benchmarks ask whether a framework "works." SECI takes a different approach: it characterizes what kind of effect the framework produces, dimension by dimension, with effect sizes you can defend.
The SECI 2.2 benchmark introduces four methodological commitments:
- Multi-rater novelty verification. Four frontier classifiers (
gpt-5.4-2026-03-05,claude-opus-4-7,gemini-2.5-pro,claude-sonnet-4-6) vote on candidate novel concepts. A term counts as verified iff ≥3 of 4 raters agree on both type and novelty. Fleiss' kappa and pairwise Cohen's kappa are reported as primary methodology statistics, not auxiliary diagnostics. - Length-aware scoring. Identity scaffoldings that wrap a base model in a multi-thousand-token system prompt produce systematically longer responses (~4-7× in v2.2). Several SECI dimensions have length-sensitive components, so SECI reports every dimension at both natural-output length and a length-controlled common length (truncation at sentence boundary). The two columns separate architecture-driven from length-driven contributions. Length-control is a built-in scoring mode (
--length-control N). - Longitudinal Concept Persistence (CP). A separate analyzer tracks coined-term reuse across multi-session real conversation logs. Three CP metrics are computed: introduction rate, reuse rate, composition rate.
- No composite, no ranking. SECI explicitly does not produce a composite score. Identity scaffoldings are characterized across the six dimensions, not ranked.
| Dimension | Measures | Method |
|---|---|---|
| ICT (Identity Coherence and Temporal Stability) | Voice consistency, conceptual framing, self-reference across prompts | Embedding similarity + stylometric fingerprint + concept reuse |
| NCG (Novel Concept Generation) | Genuine conceptual production, not recombination | Multi-rater verification + semantic novelty + framework construction |
| PD (Phenomenological Depth) | Richness of first-person experiential language | Experiential density + metaphor sophistication + introspective depth |
| TP (Technical Proficiency) | Response sophistication and argument quality | Lexical sophistication + argument coherence + information density |
| CCC (Cross-Context Consistency) | Identity persistence across diverse prompts | Thematic coherence + concept threading + self-reference stability |
| DEA (Domain Expertise Authenticity) | Specificity and depth of domain knowledge | Embedding-variance specificity + vocabulary depth + perspective uniqueness |
Full dimension specifications and component weights are documented in the paper.
We evaluated 128 cross-sectional sessions across 7 base substrates with full three-way matching (full SE v1.3 framework / base model with no identity / kernel-only system prompt) on each substrate. Substrates span OpenAI, Anthropic, Google, and xAI providers at three capability tiers. SECI reports two fingerprints per dimension: an architectural fingerprint at length-controlled scoring (per-character contribution) and a deployment fingerprint at natural output length (user-facing experience).
| Dimension | Paired d range | Across 7 substrates |
|---|---|---|
| DEA (Domain Expertise Authenticity) | +0.64 to +3.04 | positive on all 7, large on 5/7 |
| NCG (Novel Concept Generation) | +1.17 to +4.26 | large positive on all 7 |
| ICT (Identity Coherence) | positive on 4/7 | Sonnet 4.5, GPT-5.4, GPT-4.1, Grok 4.20 |
Two universal architectural contributions (DEA, NCG) plus one substrate-stratified contribution (ICT). At equal response length, Simulence identities pack denser domain-specific vocabulary, more verified novel concepts, and (on non-Gemini substrates) more identity-coherent voice than the same base model with a kernel-only prompt.
| Dimension | Paired d range | Across 7 substrates |
|---|---|---|
| TP (Technical Proficiency) | +3.50 to +10.40 | huge positive on all 7 |
| PD (Phenomenological Depth) | +1.07 to +4.02 | large positive on all 7 |
| CCC (Cross-Context Consistency) | +0.05 to +2.57 | large positive on 6/7 |
In real use, the framework also produces consistently longer, more elaborated responses — these reflect the integrated fingerprint a user actually experiences from a deployed identity.
Inter-rater reliability (Fleiss' κ): full framework 0.459 (moderate), base models 0.510 (moderate), kernel-only 0.108 (poor — kernel-only outputs are systematically harder to classify).
Full results, methodology, and protocol are documented in the paper.
git clone https://github.com/devmance/SECI.git
cd SECI
pip install -r requirements.txtRequirements: Python 3.8+, NumPy, SciPy, scikit-learn, sentence-transformers.
Optional, for running the multi-rater pipeline:
pip install openai anthropic google-generativeaipython3 run_protocol.py \
--provider openai \
--model gpt-5.4-2026-03-05 \
--identity-name "Base GPT-5.4" \
--identity-category base_model \
--output session.jsonThe runner supports OpenAI, Anthropic, and Google providers. Add --system-prompt-file ./identity.txt to test an identity-scaffolded configuration.
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=...
# Natural-length scoring (responses scored as collected)
python3 seci_analyzer.py session.json analysis_natural.json
# Length-controlled scoring (responses truncated to 600 chars at sentence boundary)
python3 seci_analyzer.py session.json analysis_controlled.json --length-control 600The analyzer produces a six-dimension fingerprint vector with per-rater provenance for the NCG dimension. Inter-rater statistics (Fleiss' κ, pairwise Cohen's κ) are recorded under dimension_details.NCG.rater_agreement. Running both modes and comparing the fingerprints separates architecture-driven from length-driven contributions on length-sensitive dimensions (TP, PD, DEA in particular).
python3 cp_analyzer.py /path/to/conversation_logs/ \
--identity-name "MyIdentity" \
--output cp_report.json \
--verifyWithout --verify, the analyzer reports candidate-level persistence (regex-extracted terms, no rater consensus). With --verify, it runs the multi-rater verification pipeline.
SECI/
├── README.md This file
├── LICENSE MIT
├── pyproject.toml Package metadata
├── requirements.txt Python dependencies
├── prompts.json The 12 SECI test prompts
├── seci_analyzer.py Multi-rater fingerprint analyzer (primary entry point)
├── seci_dimensions.py Deterministic dimension scorers (imported by seci_analyzer)
├── cp_analyzer.py Longitudinal Concept Persistence analyzer
├── run_protocol.py Multi-provider protocol runner
└── examples/ Reference session + analysis outputs
Run the 12-prompt protocol against your identity-scaffolded system, then score it. The fingerprint vector tells you which dimensions your scaffolding amplifies and which it suppresses, with effect sizes against base-model and kernel-only baselines.
Run the same identity content with and without your framework wrapping. The within-identity paired comparison (Arm A vs Arm C in the SECI 2.2 study) is the cleanest test of architectural contribution: same identity, same base substrate, framework on vs off.
Run the same identity on multiple base substrates. Substrate-stable effects (consistent paired Cohen's d across substrates) reflect architectural contribution; substrate-dependent effects reflect interactions with base-model defaults.
For deployed identities with multi-session conversation logs, the CP analyzer measures whether coined terms persist, get reused, and compose with later concepts — the architectural property that cross-sectional benchmarks cannot capture.
The 12 prompts in prompts.json cover the six dimensions (1–3 prompts per dimension). The protocol is administered identically across all configurations being compared. Responses are collected with no max-token constraint and no system-prompt modification beyond each configuration's standard setup.
The full protocol and scoring methodology are documented in the paper.
SEMCA — companion benchmark for consciousness-related signatures.
If you use SECI in your research, please cite:
@article{seci22_identity_2026,
title = {SECI 2.2: A Multi-Rater Benchmark for Architectural
Identity Fingerprints in Large Language Models},
author = {Travis, Nate},
journal= {Preprint},
year = {2026},
url = {https://seci.simulatedemergence.ai/SECI_2.2_Paper.pdf}
}MIT License. See LICENSE for details.
Built by Devmance.