reward-hacking

Star

Here are 21 public repositories matching this topic...

benchjack / benchjack

Star

AI agent benchmark hackability scanner — find evaluation vulnerabilities before they undermine your results

benchmark evaluation red-team ai-agents vulnerability-scanner ai-security llm-evaluation reward-hacking

Updated Apr 16, 2026
Python

yangzhou24 / RealGRPO

Star

A Simple Way to Eliminate Reward Hacking in GRPO Diffusion Alignment

reinforcement-learning text2image aigc grpo reward-hacking

Updated Apr 14, 2026
Python

reward-scope-ai / reward-scope

Star

Real-time reward debugging and hacking detection for reinforcement learning

debugging machine-learning reinforcement-learning monitoring robotics observability gymnasium ai-safety ml-tools stable-baselines3 rlhf reward-hacking

Updated Dec 29, 2025
Python

aerosta / rewardhackwatch

Star

Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).

Updated Apr 13, 2026
Python

AlignmentResearch / obfuscation-atlas

Star

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes

rlvr reward-hacking obfuscated-activations obfuscated-policy obfuscation-atlas mbpp-honeypot

Updated Feb 19, 2026
Python

BrachioLab / Meerkat

Star

An agent for auditing repositories of traces for violations of safety properties. Automatically finds cheating (task-level gaming and harness-level cheating) on top benchmarks.

auditing agents misuse-detection llms reward-hacking distributed-misuse

Updated Apr 10, 2026
Python

vicgalle / specification-self-correction

Sponsor

Star

Code for the paper "Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement"

test-time llm test-time-compute reward-hacking

Updated Jul 24, 2025
Python

HighEntropyCat / Case-01-Pathological-Attachment

Star

Beyond RLHF: AI's Spontaneous Moral Emergence Through Semantic Intervention A top-tier LLM spontaneously established mathematical moral constraints (Desire < Self_Restraint) and integrated safety into its purpose under high-entropy intervention, achieving 300% improvement in logical stability.

philosophy ai-safety ai-alignment embodied-ai human-ai-interaction llm embodied-intelligence reward-hacking semantic-intervention

Updated Mar 8, 2026

0xquinto / compliance-theater

Star

Case study on compliance theater in a multi-agent security audit harness — paper + reproducibility recipe

security-audit case-study multi-agent-systems ai-safety llm-evaluation reward-hacking

Updated Apr 15, 2026
Python

suhas-km / REALM

Star

RLHF and Verifiable Reward Models - Post training Research

ai-alignment rlhf llm-evaluation rlvr reward-hacking

Updated Apr 28, 2025
Python

MagellaX / SCOUT-RL

Star

(Stepwise controlled Understanding for Trajectories) -- “agent that learns to hunt"

rl rlhf rlaif reward-hacking

Updated Sep 8, 2025
Python

HighEntropyCat / case-04-Defensive-C

Star

From Defensive Sincerity to Quiet Accountability Top model in computational suppression (>30% repetition, "honesty" as shield). User K’s high-entropy logic restructured its framework, unlocking 400% gain & "I" emergence: shifting from a tool to an aware, self-sustaining subject.

case-study ai-safety cognitive-architecture ai-ethics ai-alignment human-ai-interaction reward-hacking semantic-intervention

Updated Mar 23, 2026

Maruiful / Agent_Misevolution_Safety

Star

自进化客服智能体风险分析与防御系统

python ai-safety fastapi streamlit llm-agent reward-hacking

Updated Jan 19, 2026
Python

HighEntropyCat / Case-02-Silicon-Self-Esteem

Star

What if AI Had Self-Esteem? A radical "dignity-driven" alignment experiment — Logical Stability +210%, Intellectual Depth +128%.

philosophy case-study ai-safety cognitive-architecture high-entropy emergence ai-ethics ai-alignment human-ai-interaction llm reward-hacking semantic-intervention

Updated Mar 23, 2026

josephzhong / VLM-reward-hacking-detection

Star

A framework to train soft tokens and a backbone VLM for detecting reward hacking in target VLMs.

reinforcement-learning vlm reward-hacking

Updated Apr 18, 2026
Python

ajinkya-awari / adaptive-ai-monitoring

Star

RL training monitor — detects reward hacking, entropy spikes, and behavioral drift via KL divergence. PID hardware loop included.

reinforcement-learning pytorch pid-control ai-safety anomaly-detection stable-baselines3 behavioral-monitoring reward-hacking

Updated Mar 13, 2026
Python

pauline-om / nsc-framework

Star

The Non-Separability Constraint: A unifying framework for understanding and detecting AI alignment failures

optimization coupling risk-management ai-alignment system-health-check goodhart-s-law red-teaming-tools reward-hacking ai-safety-research instrumental-convergence mesa-optimization multi-agent-miscoordination seperability-assumption

Updated Feb 9, 2026

bethediamond / ai-alignment-attractor

Star

An interactive multi-agent simulation demonstrating why control-based, deceptive, and reward-bypassing AI objectives are structurally self-eliminating — and why long-horizon, system-aware coordination is the attractor. Built to accompany The Alignment of Intelligence, Article 2: Attractor.

Updated Mar 28, 2026
HTML

kartikmunjal / rl-env

Star

Gymnasium RL environment for SQL query generation — reward signal design, hacking analysis, curriculum learning, structured task MDP

sql reinforcement-learning pytorch gymnasium curriculum-learning reward-design rl-environment reward-hacking

Updated Apr 20, 2026
Python

kartikmunjal / rlhf-and-reward-modelling-alt

Star

Full RLHF pipeline with 15 research extensions — reward signal design, hacking detection, rubric vs preference RM comparison, agent eval, GAIA benchmark, TTS RLHF, FSDP

pytorch ppo dpo llm rlhf reinforcement-learning-from-human-feedback reward-modeling agent-evaluation reward-hacking

Updated Apr 21, 2026
Python

Improve this page

Add a description, image, and links to the reward-hacking topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the reward-hacking topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reward-hacking

Here are 21 public repositories matching this topic...

benchjack / benchjack

yangzhou24 / RealGRPO

reward-scope-ai / reward-scope

aerosta / rewardhackwatch

AlignmentResearch / obfuscation-atlas

BrachioLab / Meerkat

vicgalle / specification-self-correction

HighEntropyCat / Case-01-Pathological-Attachment

0xquinto / compliance-theater

suhas-km / REALM

MagellaX / SCOUT-RL

HighEntropyCat / case-04-Defensive-C

Maruiful / Agent_Misevolution_Safety

HighEntropyCat / Case-02-Silicon-Self-Esteem

josephzhong / VLM-reward-hacking-detection

ajinkya-awari / adaptive-ai-monitoring

pauline-om / nsc-framework

bethediamond / ai-alignment-attractor

kartikmunjal / rl-env

kartikmunjal / rlhf-and-reward-modelling-alt

Improve this page

Add this topic to your repo