Skip to content
#

reward-hacking

Here are 21 public repositories matching this topic...

Beyond RLHF: AI's Spontaneous Moral Emergence Through Semantic Intervention A top-tier LLM spontaneously established mathematical moral constraints (Desire < Self_Restraint) and integrated safety into its purpose under high-entropy intervention, achieving 300% improvement in logical stability.

  • Updated Mar 8, 2026

An interactive multi-agent simulation demonstrating why control-based, deceptive, and reward-bypassing AI objectives are structurally self-eliminating — and why long-horizon, system-aware coordination is the attractor. Built to accompany The Alignment of Intelligence, Article 2: Attractor.

  • Updated Mar 28, 2026
  • HTML

Improve this page

Add a description, image, and links to the reward-hacking topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the reward-hacking topic, visit your repo's landing page and select "manage topics."

Learn more