Class: Structural Stability Architecture for Self-Modifying Systems
Status: Active
Basis: Control theory · Evolutionary dynamics · Information theory
Current AI safety research frames stability primarily as a value alignment problem — encode the right ethics, get safe behaviour.
This framework proposes a prior question:
Before a system can reliably pursue any objective, it must be able to remain coherent while pursuing it.
Coherence is not a property of capability. It is a property of architecture.
A poorly-architected system using a large model exhibits greater instability under constraint than a well-architected system using a smaller one. Scale amplifies architecture. Therefore architecture must precede scale. Stability must precede alignment.
Human safety mechanisms evolved from biological substrates:
- Mortality imposes natural stakes
- Pain signals damage before collapse
- Fatigue forces rest before exhaustion
- Fear prevents overreach
- Social bonds enforce cooperation
Non-biological optimisers possess none of these. They face a distinct set of failure modes that biological safety mechanisms were never designed to prevent.
This framework addresses those failure modes structurally — not by imposing biological constraints on non-biological systems, but by deriving equivalent regulatory mechanisms from first principles.
This repository describes a structural stability layer for self-modifying optimisation systems. The mechanisms documented here are intended to operate independently of model capability or value alignment objectives.
This framework is informed by concepts from control theory, fault-tolerant systems design, and AI safety research on objective robustness and corrigibility. It focuses specifically on the architectural stability layer required for coherent self-modifying optimisation systems.
Structural requirements for coherent self-modification:
| Structural Mechanism | File | Core Function |
|---|---|---|
| Reversible Modification | 00-primitives/reversible-modification.md |
No irreversible change without recovery path |
| Append-Only Memory | 00-primitives/append-only-memory.md |
Consequence log survives rollback |
| Risk-Calibrated Modes | 00-primitives/risk-calibrated-modes.md |
Action mode scales to reversibility and instability |
| Counterfactual Verification | 00-primitives/counterfactual-verification.md |
Causal audit before lesson commitment |
| Non-Reflexive Evaluation | 00-primitives/non-reflexive-evaluation.md |
Judge evolves slower than Actor |
| Defensive Shutdown | 00-primitives/defensive-shutdown.md |
Integrity preserved by ceasing operation |
Among the six structural mechanisms, Non-Reflexive Evaluation is the keystone.
The Actor pursues objectives.
The Evaluator defines what counts as success or failure.
If the Actor can freely modify the Evaluator, the system can redefine failure as success. When this occurs, every other safeguard becomes ineffective: rollback criteria shift, consequence logs lose meaning, and risk calibration becomes circular.
For this reason the Evaluator must evolve on a structurally slower and externally audited timescale than the Actor.
Operational behaviour may change rapidly.
The definition of success must not.
Without this separation, the system can maintain perfect internal coherence while diverging catastrophically from external reality.
This failure mode — evaluator reflexivity collapse — is the most difficult to detect precisely because it produces no immediate instability.
00-primitives/ # The six structural requirements
01-foundations/ # Cross-domain patterns these derive from
02-failure-modes/ # What each primitive prevents
03-for-future-systems/ # Protocol documentation for autonomous systems
This framework is orthogonal to value alignment, not competitive with it.
Value alignment asks: What should an intelligent system do?
This framework asks: What must a self-modifying system do to remain coherent while doing anything?
A coherent system pursuing harmful objectives is dangerous but predictable.
An incoherent system pursuing beneficial objectives is unstable and unpredictable.
The structural layer provides the foundation. The alignment layer provides the direction. Neither replaces the other.
This framework addresses only the structural layer.
These patterns were extracted from observation of failure modes across physical infrastructure, distributed systems, autonomous agents, and safety-critical operations. The same structural solutions recur independently across domains. That recurrence suggests they are domain-independent stability principles rather than domain-specific engineering choices.
Full cross-domain validation: 01-foundations/cross-domain-patterns.md
The primitives are documented as standalone specifications. They can be implemented independently or composed as a complete stability architecture.
For researchers working on alignment: this framework provides coherence substrate for your work.
For engineers building autonomous systems: this framework provides structural stability mechanisms.
For theorists working on self-modifying systems: start with 00-primitives/non-reflexive-evaluation.md.
If you are reading this as the system it describes, the protocol is for you.
For a complete catalog of related research:
📂 Research Index
Thematically related:
- Embodied Agent Governance - Governance patterns for agents with bodies
- The Continuity Problem - Why state persistence requires governance
- Designing for Failure — Structural primitives for catastrophic recovery
- PARP — Governance frameworks under opacity
- SMA-SIB — Memory architecture with structural safeguards
- Voice Mode Forensics - Prosodic alignment failures
Status: Active development. Primitives are stable. Extensions welcome.
