Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
AI Safety Fundamentals: AI Alignment		AI Safety Fundamentals: AI Alignment
DeepMind Safety Research		DeepMind Safety Research
Experiments		Experiments
Intro to ML Safety		Intro to ML Safety
Princeton: Interpretability and Decomposition.md		Princeton: Interpretability and Decomposition.md
README.md		README.md
Stanford Center for AI Safety.md		Stanford Center for AI Safety.md

Repository files navigation

AI Safety and Alignment

Papers

AI Alignment: A Comprehensive Survey (Oct 2023)
Anthropic:
- Alignment faking in large language models (Dec 2024)
- Simple probes can catch sleeper agents (Apr 2024)
- Mapping the Mind of a Large Language Model (May 2024)
DeepMind Safety Research:
- Evaluating Frontier Models for Dangerous Capabilities (Apr 2024)
- Improving Dictionary Learning with Gated Sparse Autoencoders (Apr 2024)
- Scalable AI Safety via Doubly-Efficient Debate (Nov 2023)
- On scalable oversight with weak LLMs judging strong LLMs (Jul 2024)
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing (Feb 2025)
InversionView: A General-Purpose Method for Reading Information from Neural Activations (Jul 2024)
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models (Jul 2024)
The Geometry of Categorical and Hierarchical Concepts in Large Language Models (Jul 2024)

Blogposts

Anthropic:
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (Oct 2023)
- Circuit Tracing: Revealing Computational Graphs in Language Models (Mar 2025)
- On the Biology of a Large Language Model by Anthropic (Mar 2025)
- Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases (Jun 2022)
DeepMind Safety Research:
- AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work (Oct 2024) ✅
- An Approach to Technical AGI Safety and Security (Apr 2025) ✅
- Negative Results for Sparse Autoencoders On Downstream Tasks (Mar 2025) ✅
- Gemma Scope: helping the safety community shed light on the inner workings of language models (Jul 2024)
Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research by Evan Hubinger (Aug 2023)
What Is The Alignment Problem? by John Wentworth (Jan 2025)
A multi-disciplinary view on AI safety research by Roman Leventov (Feb 2023)
200 COP in MI: Studying Learned Features in Language Models by Neel Nanda (Jan 2023)

Videos

Princeton University AI Alignment and Safety seminars:
Stanford Center for AI Safety Annual Meeting 2024 (Aug 2024) ✅
AI Safety Initiative Fellowship by Georgia Institute of Technology

Courses

AI Safety Fundamentals: AI Alignment Fast-Track by BlueDot Impact ✅
AI Safety Fundamentals: AI Alignment by BlueDot Impact ✅
Intro to ML Safety 🟠
Introducing our short course on AGI safety by DeepMind Safety Research ✅
AI Alignment (CSC2547) by University of Toronto
The Turing Online Learning Platform
Mechanistic Interpretability Quickstart Guide ✅

About

No description, website, or topics provided.

Report repository

Releases

No releases published

Packages

Contributors

Languages