Phase 2 Simulation Results Summary

Date: 2026-03-13 Papers Validated: P24-P36 (13 papers) GPU: NVIDIA RTX 4050 (6GB VRAM, 80% limit) Framework: CuPy 14.0.1

Executive Summary

Out of 13 Phase 2 papers simulated:

7 claims VALIDATED across 6 papers
30 claims NOT VALIDATED across 11 papers
3 papers had partial validation (some claims passed, others failed)

Overall Validation Rate: ~19% (7/37 claims tested)

Detailed Results by Paper

P24: Self-Play Mechanisms

Claim	Status	Result	Threshold
Self-play improves success rate >30%	❌ NOT VALIDATED	23.97%	30%
ELO correlates with performance (r > 0.8)	❌ NOT VALIDATED	0.24	0.8
Novel strategies emerge	✅ VALIDATED	12 strategies	>0
Edge cases discovered	✅ VALIDATED	97 edge cases	>0

Validation Rate: 2/4 (50%)

Key Findings:

Self-play shows improvement (24%) but falls short of 30% target
ELO ratings poorly correlate with actual performance
System successfully discovers novel strategies and edge cases

P25: Hydraulic Intelligence

Claim	Status	Result	Threshold
Pressure predicts activation	✅ VALIDATED	r=0.99	0.5
Flow follows Kirchhoff's law	❌ NOT VALIDATED	0.27 compliance	0.8
Emergence detectable	❌ NOT VALIDATED	0 events	>0
Diversity correlates with stability	❌ NOT VALIDATED	0.0 diversity	0.7

Validation Rate: 1/4 (25%)

Key Findings:

Strong pressure-prediction correlation validates core premise
Flow dynamics don't follow Kirchhoff's law in this implementation
No emergence events detected in 50 timesteps
Zero Shannon diversity indicates poor agent variety

P26: Value Networks

Claim	Status	Result	Threshold
Value prediction correlates (r > 0.7)	❌ NOT VALIDATED	0.07	0.7
Uncertainty well-calibrated	✅ VALIDATED	Brier=0.08	0.2
Value-guided > random by >20%	❌ NOT VALIDATED	-16%	20%

Validation Rate: 1/3 (33%)

Key Findings:

Value predictions show almost no correlation with outcomes
Uncertainty estimates ARE well-calibrated (Brier score excellent)
Value-guided decisions performed WORSE than random (critical finding)

P27: Emergence Detection

Claim	Status	Result	Notes
Emergence score predicts capabilities	✅ VALIDATED	988 events, avg score 0.60	Score correlates
Transfer entropy detects emergence	❌ NOT VALIDATED	TE correlation 0.01	Very low correlation

Validation Rate: 1/2 (50%)

Key Findings:

Emergence scoring system works for predicting capabilities
Transfer entropy shows negligible correlation with emergence
High emergence event count (988) suggests sensitive detection

P28: Stigmergic Coordination

Claim	Status	Result	Threshold
Stigmergy achieves >80% efficiency at <20% cost	❌ NOT VALIDATED	0% efficiency	80%/20%
Optimal half-life ~60s	✅ VALIDATED	60.0s	~60s

Validation Rate: 1/2 (50%)

Key Findings:

Half-life parameter correctly calibrated
Zero coordination events suggests pheromone system not working
Implementation may need debugging

P29: Competitive Coevolution

Claim	Status	Result	Threshold
Coevolution >40% improvement	❌ NOT VALIDATED	0%	40%
Arms race correlation < -0.3	❌ NOT VALIDATED	0.0	-0.3
Diversity > 0.5	❌ NOT VALIDATED	-117.69	0.5

Validation Rate: 0/3 (0%)

Key Findings:

Complete failure - no improvement over baseline
Zero arms race correlation indicates populations don't compete
Negative diversity suggests严重 implementation issues

P30: Granularity Analysis

Claim	Status	Result	Threshold
Optimal granularity follows inverse-U	❌ NOT VALIDATED	Linear	Inverse-U
Optimization improves efficiency >25%	❌ NOT VALIDATED	~0%	25%

Validation Rate: 0/2 (0%)

Key Findings:

No inverse-U pattern observed (efficiency constant across granularities)
Optimization produces negligible improvement
Suggests granularity theory may not apply to this domain

P34: Federated Learning

Claim	Status	Result
<10% accuracy gap	✅ PASS	0.5% gap
Privacy preserved	✅ PASS	True
Federated works	❌ FAIL	49.2% accuracy

Validation Rate: 2/3 (67%)

Key Findings:

Excellent privacy preservation
Minimal accuracy gap vs centralized (0.5%)
However, absolute accuracy is poor (~49%)

P35: Guardian Angels

Claim	Status	Result	Threshold
>80% prevention	❌ FAIL	5.11%	80%
<20% false positive	❌ FAIL	94.89%	20%
Rollbacks work	✅ PASS	1604 executed	>0

Validation Rate: 1/3 (33%)

Key Findings:

Extremely poor performance - blocks everything but prevents little
95% false positive rate indicates over-sensitive monitoring
Rollback mechanism functional but overly triggered

P36: Time-Travel Debug

Claim	Status	Result	Threshold
>80% diagnosis accuracy	❌ FAIL	0%	80%
>70% fix validation	❌ FAIL	0%	70%
Replay works	✅ PASS	292.6 avg length	>0

Validation Rate: 1/3 (33%)

Key Findings:

Complete diagnostic failure - zero correct diagnoses
Replay system functional but produces no useful diagnoses
Suggests causal graph or symptom mapping issues

Papers with Simulation Errors (Not Yet Validated)

Paper	Issue
P31 (Health Prediction)	Timeout after 60s
P32 (Dreaming)	CuPy array shape mismatch in correlation
P33 (LoRA Swarms)	Missing max_rank attribute

Cross-Paper Insights

Successful Patterns:

Pressure-based prediction (P25) - Strong correlations achievable
Uncertainty calibration (P26) - Brier scoring works well
Emergence scoring (P27) - Effective for capability prediction
Parameter tuning (P28) - Half-life optimization valid
Privacy preservation (P34) - Federated learning protects data

Problematic Patterns:

ELO ratings (P24) - Don't correlate with performance
Value-guided decisions (P26) - Performed worse than random
Kirchhoff's law (P25) - Flow dynamics don't generalize
Transfer entropy (P27) - Poor emergence detector
Competitive coevolution (P29) - No natural competition emerged
Granularity optimization (P30) - No inverse-U pattern
Guardian angels (P35) - Extreme over-blocking
Causal debugging (P36) - Zero diagnostic accuracy

Critical Rejection Findings:

Value Networks (P26): The most concerning result - value-guided decisions performed 16% worse than random selection. This challenges the fundamental premise that learned value functions improve decision-making.

Competitive Coevolution (P29): Complete failure with zero improvement and negative diversity suggests the self-organizing competition premise may be flawed.

Guardian Angels (P35): 95% false positive rate indicates the safety monitoring approach is unusable in practice.

Recommendations

For Validated Claims:

P24 (Self-Play): Investigate why ELO doesn't correlate - consider alternative metrics
P25 (Hydraulic): Pressure prediction works - abandon Kirchhoff law requirement
P26 (Value Networks): Uncertainty calibration works - investigate why value guidance fails
P27 (Emergence): Use emergence scoring, abandon transfer entropy
P28 (Stigmergy): Fix pheromone system to achieve coordination
P34 (Federated): Address poor absolute accuracy

For Rejected Claims:

P24: Reconsider 30% improvement threshold
P26: Critical - Investigate why value guidance underperforms random
P29: Redesign or abandon competitive coevolution
P30: Question granularity optimization theory
P35: Redesign safety monitoring to reduce false positives
P36: Debug causal graph construction or symptom mapping

For Future Work:

Fix P31-P33 simulation errors
Create P37-P40 simulation schemas
Investigate value guidance failure (highest priority)
Redesign competitive coevolution mechanism
Improve guardian angel precision/recall balance

GPU Performance Notes

CuPy 14.0.1 stable on RTX 4050
6GB VRAM sufficient for most simulations with 80% limit
Timeout issues on P31 suggest complex computations may need optimization
Array shape errors indicate need for better shape checking in CuPy operations

Repository Management

IMPORTANT: All work should be pushed to:

Repository: github.com/SuperInstance/superinstance-papers
Branch: papers-main
NOT: polln repository (development only)

Generated: 2026-03-13 Orchestrator: Claude (glm-4.7) Total Simulation Time: ~15 minutes Status: Phase 2 validation partially complete

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 2 Simulation Results Summary

Executive Summary

Overall Validation Rate: ~19% (7/37 claims tested)

Detailed Results by Paper

P24: Self-Play Mechanisms

P25: Hydraulic Intelligence

P26: Value Networks

P27: Emergence Detection

P28: Stigmergic Coordination

P29: Competitive Coevolution

P30: Granularity Analysis

P34: Federated Learning

P35: Guardian Angels

P36: Time-Travel Debug

Papers with Simulation Errors (Not Yet Validated)

Cross-Paper Insights

Successful Patterns:

Problematic Patterns:

Critical Rejection Findings:

Recommendations

For Validated Claims:

For Rejected Claims:

For Future Work:

GPU Performance Notes

Repository Management

FilesExpand file tree

PHASE_2_SIMULATION_RESULTS_20260313.md

Latest commit

History

PHASE_2_SIMULATION_RESULTS_20260313.md

File metadata and controls

Phase 2 Simulation Results Summary

Executive Summary

Overall Validation Rate: ~19% (7/37 claims tested)

Detailed Results by Paper

P24: Self-Play Mechanisms

P25: Hydraulic Intelligence

P26: Value Networks

P27: Emergence Detection

P28: Stigmergic Coordination

P29: Competitive Coevolution

P30: Granularity Analysis

P34: Federated Learning

P35: Guardian Angels

P36: Time-Travel Debug

Papers with Simulation Errors (Not Yet Validated)

Cross-Paper Insights

Successful Patterns:

Problematic Patterns:

Critical Rejection Findings:

Recommendations

For Validated Claims:

For Rejected Claims:

For Future Work:

GPU Performance Notes

Repository Management