- AI Alignment: A Comprehensive Survey (Oct 2023)
- Anthropic:
- Alignment faking in large language models (Dec 2024)
- Simple probes can catch sleeper agents (Apr 2024)
- Mapping the Mind of a Large Language Model (May 2024)
- DeepMind Safety Research:
- Are Sparse Autoencoders Useful? A Case Study in Sparse Probing (Feb 2025)
- InversionView: A General-Purpose Method for Reading Information from Neural Activations (Jul 2024)
- Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models (Jul 2024)
- The Geometry of Categorical and Hierarchical Concepts in Large Language Models (Jul 2024)
- Anthropic:
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (Oct 2023)
- Circuit Tracing: Revealing Computational Graphs in Language Models (Mar 2025)
- On the Biology of a Large Language Model by Anthropic (Mar 2025)
- Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases (Jun 2022)
- DeepMind Safety Research:
- AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work (Oct 2024) ✅
- An Approach to Technical AGI Safety and Security (Apr 2025) ✅
- Negative Results for Sparse Autoencoders On Downstream Tasks (Mar 2025) ✅
- Gemma Scope: helping the safety community shed light on the inner workings of language models (Jul 2024)
- Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research by Evan Hubinger (Aug 2023)
- What Is The Alignment Problem? by John Wentworth (Jan 2025)
- A multi-disciplinary view on AI safety research by Roman Leventov (Feb 2023)
- 200 COP in MI: Studying Learned Features in Language Models by Neel Nanda (Jan 2023)
- Princeton University AI Alignment and Safety seminars:
- Stanford Center for AI Safety Annual Meeting 2024 (Aug 2024) ✅
- AI Safety Initiative Fellowship by Georgia Institute of Technology
- AI Safety Fundamentals: AI Alignment Fast-Track by BlueDot Impact ✅
- AI Safety Fundamentals: AI Alignment by BlueDot Impact ✅
- Intro to ML Safety 🟠
- Introducing our short course on AGI safety by DeepMind Safety Research ✅
- AI Alignment (CSC2547) by University of Toronto
- The Turing Online Learning Platform
- Mechanistic Interpretability Quickstart Guide ✅