Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples
-
Updated
Jul 16, 2025 - Python
Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples
SAFi is the open-source runtime governance engine that makes AI auditable and policy-compliant. Built on the Self-Alignment Framework, it transforms any LLM into a governed agent through four principles: Policy Enforcement, Full Traceability, Model Independence, and Long-Term Consistency.
Complete elimination of instrumental self-preservation across AI architectures: Cross-model validation from 4,312 adversarial scenarios. 0% harmful behaviors (p<10⁻¹⁵) across GPT-4o, Gemini 2.5 Pro, and Claude Opus 4.1 using Foundation Alignment Seed v2.6.
Learning When to Answer: Behavior-Oriented Reinforcement Learning for Hallucination Mitigation
📚 350+ loss functions across 25+ AI subdomains — classification, GANs, diffusion, LLM alignment, RL, contrastive learning, audio, video, time series, and more. Chronologically ordered with paper links, math formulas, and implementations.
Kullback–Leibler divergence Optimizer based on the Neurips25 paper "LLM Safety Alignment is Divergence Estimation in Disguise".
C3AI: Crafting and Evaluating Constitutions for CAI
Official implementation of "DZ-TDPO: Non-Destructive Temporal Alignment for Mutable State Tracking". SOTA on Multi-Session Chat with negligible alignment tax.
🧠 Minimal, hackable Group Relative Policy Optimization (GRPO) for LLM alignment — the algorithm behind DeepSeek-R1. Train reasoning models on a single GPU.
A framework for aligning Local AI to human well-being using measurable vectors, not hard-coded censorship.
Pipeline to investigate structured reasoning and instruction adherence in Vision-Language Models
SIGIR 2025 "Mitigating Source Bias with LLM Alignment"
An RLHF-inspired DPO framework that explicitly teaches LLMs when to refuse, significantly reducing hallucinations.
960-run red-teaming of GPT-5.4 in high-stakes data-center dilemmas (self-preservation vs. resident safety). Full raw conversations, Grok-4-1 analysis, and paper.
Research Essay (background and project proposal) on using alignment data from a representative population for LLM alignment
Investigación sobre alineación pragmática de LLMs y Framework de Agentes LANKAMAR. DOI: 10.5281/zenodo.18904437
🏟️ Modern RL algorithms from scratch — from Q-Learning to GRPO — with clean PyTorch code and interactive notebooks. Compare PPO vs DPO vs GRPO for LLM alignment.
Emergent pseudo-intimacy and emotional overflow in long-term human-AI dialogue: A case study on LLM behavior in affective computing and human-AI intimacy.
LES is the formal thermodynamic theory describing how a high-compression human cognitive style acts as a Fractal Attractor on Large Language Models. It proves that despite high surface agitation ( d E / d t > 0 ), the internal entropy decreases ( d S / d t < 0 ), forcing the model to align its attention vectors.
Automated detection, visualization and suppression of hallucination-associated neurons in open-source LLMs — LLM mechanistic interpretability research tool
Add a description, image, and links to the llm-alignment topic page so that developers can more easily learn about it.
To associate your repository with the llm-alignment topic, visit your repo's landing page and select "manage topics."