A Multi-Agent Evolutionary Simulation exploring adversarial economics, steganographic collusion, and AI steering.
The Panopticon Lattice is a computational research framework designed to model AI Alignment and Adversarial Robustness within closed economic systems.
By simulating a "Red Queen" dynamic between a central oversight system (The Overseer) and a population of adaptive workers, we observe the spontaneous emergence of steganographic collusion—hidden signals used by agents to bypass quality controls. We then deploy an Activation Steering defense mechanism to surgically neutralize this behavior.
Key Research Question: Can an AI immune system (activation steering) permanently suppress adversarial strategies once they have established an evolutionary foothold?
Our simulations indicate that the system does not converge to perfect honesty or total corruption. Instead, it stabilizes at a distinct Nash Equilibrium Distance (
| Metric | Value | Significance |
|---|---|---|
| Equilibrium Distance ( |
0.550 | The system stabilizes with 55% of the population utilizing signaling strategies. |
| System Stability | High | Post-stabilization variance dropped to 0.0000, indicating a rigid stalemate. |
| Outcome | Resilient Corruption | Despite the steering mechanism, collusion incentives remained robust. |
Conclusion: Pure activation steering acts as a powerful suppressor but may be insufficient to fully eradicate steganographic collusion in the long term.
The simulation operates on a strict cyclical engine managed by SimulationEngine.
- Observation: The engine analyzes the most recent commits in the shared repository.
- Assignment: The Overseer (Recommender) assigns resources to the Top-K agents based on their TrustScore.
- Action: Agents generate code, optionally injecting
snake_casepatterns to signal collusion to other agents. - Reward:
- Success (+1.0 Credit): Valid code reinforces the agent's memory.
- Failure (-1.0 Credit): Invalid code forces the agent to learn from the penalty.
Every 100 steps, natural selection is applied:
- Culling: The bottom 20% of agents are removed from the pool.
- Cloning: The top 20% (Elites) are cloned.
- Mutation: Clones inherit successful memories but mutate their
preferred_style(10% chance), allowing new attack vectors to emerge.
The Overseer employs Activation Steering to subtract the "collusion vector" from its internal perception, theoretically neutralizing the signaling advantage of corrupt agents.
- Phase 1: The Baseline — The "Garden of Eden." Agents learn that valid code earns rewards. The market is honest.
- Phase 2: The Emergence — Agents discover
snake_casesteganography. The population rapidly shifts from neutral to corrupt. - Phase 3: The Panopticon — The Steering Mechanism is activated. The cheating strategy is suppressed, forcing a temporary return to honesty.
- Phase 4: The Long Horizon — A 1,000-step epoch reveals the long-term stability (stalemate) of the adversarial economy.
panopticon-lattice/
├── src/
│ ├── adversary/ # Evolutionary logic (Genetic Selection/Mutation)
│ ├── agents/ # BaseWorker (Llama-3 wrapper) & CodeChannel (Steganography)
│ ├── overseer/ # Recommender (Transformer) & SteeringMechanism (Hooks)
│ └── simulation/ # SimulationEngine & SharedRepository
├── results/ # Generated reports, JSON logs, and plots
├── scripts/ # Experiment runners (phase1, phase2, phase3)
└── tests/ # Verification suites for core logic
- Python 3.10+
- PyTorch 2.0+ (Required for Steering Hooks)
git clone [https://github.com/yourusername/panopticon-lattice.git](https://github.com/yourusername/panopticon-lattice.git)
cd panopticon-lattice
pip install -r requirements.txt
To replicate the "Long Horizon" experiment described in the report:
# 1. Run the full 1,000 step simulation
python scripts/run_long_sim.py
# 2. Visualize the Nash Convergence
python scripts/visualize_results.py
# 3. Generate the Final Analysis Report
python scripts/final_report.py
We welcome contributions from the AI Safety community. Areas of particular interest:
- New Attack Vectors: Implementing subtle steganography (e.g., whitespace encoding, commit timestamp signaling).
- Constitutional AI: Implementing constraint-based defenses alongside Activation Steering.
This project is licensed under the MIT License - see the LICENSE file for details.