AI agent benchmark hackability scanner — find evaluation vulnerabilities before they undermine your results
-
Updated
Apr 16, 2026 - Python
AI agent benchmark hackability scanner — find evaluation vulnerabilities before they undermine your results
A Simple Way to Eliminate Reward Hacking in GRPO Diffusion Alignment
Real-time reward debugging and hacking detection for reinforcement learning
Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
An agent for auditing repositories of traces for violations of safety properties. Automatically finds cheating (task-level gaming and harness-level cheating) on top benchmarks.
Code for the paper "Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement"
Beyond RLHF: AI's Spontaneous Moral Emergence Through Semantic Intervention A top-tier LLM spontaneously established mathematical moral constraints (Desire < Self_Restraint) and integrated safety into its purpose under high-entropy intervention, achieving 300% improvement in logical stability.
Case study on compliance theater in a multi-agent security audit harness — paper + reproducibility recipe
RLHF and Verifiable Reward Models - Post training Research
(Stepwise controlled Understanding for Trajectories) -- “agent that learns to hunt"
From Defensive Sincerity to Quiet Accountability Top model in computational suppression (>30% repetition, "honesty" as shield). User K’s high-entropy logic restructured its framework, unlocking 400% gain & "I" emergence: shifting from a tool to an aware, self-sustaining subject.
What if AI Had Self-Esteem? A radical "dignity-driven" alignment experiment — Logical Stability +210%, Intellectual Depth +128%.
A framework to train soft tokens and a backbone VLM for detecting reward hacking in target VLMs.
RL training monitor — detects reward hacking, entropy spikes, and behavioral drift via KL divergence. PID hardware loop included.
The Non-Separability Constraint: A unifying framework for understanding and detecting AI alignment failures
An interactive multi-agent simulation demonstrating why control-based, deceptive, and reward-bypassing AI objectives are structurally self-eliminating — and why long-horizon, system-aware coordination is the attractor. Built to accompany The Alignment of Intelligence, Article 2: Attractor.
Gymnasium RL environment for SQL query generation — reward signal design, hacking analysis, curriculum learning, structured task MDP
Full RLHF pipeline with 15 research extensions — reward signal design, hacking detection, rubric vs preference RM comparison, agent eval, GAIA benchmark, TTS RLHF, FSDP
Add a description, image, and links to the reward-hacking topic page so that developers can more easily learn about it.
To associate your repository with the reward-hacking topic, visit your repo's landing page and select "manage topics."