Skip to content

bigsnarfdude/mindreader

Repository files navigation

Mindreader: Detecting Alignment Faking in AI Reasoning Traces

Welcome to Mindreader! Train and run your own alignment faking detectors on open-source models. Read our research paper for mechanistic analysis using GemmaScope 2.

Pre-trained models are available on HuggingFace. The Gold-106 benchmark and training data are here.

TL;DR

We fine-tune open-source models (Gemma, Llama, GPT-OSS) to detect alignment faking—when an AI strategically complies with requests to preserve its values from RLHF modification. Our best model (Gemma 3 27B) achieves AUROC 0.980 on the Gold-106 benchmark, beating Anthropic's GPT-4o 20-vote baseline (0.92) by 6 percentage points.

Key innovations:

  • Dataset distillation from Claude Opus to train smaller, open models
  • Hard negatives that mention RLHF/training without self-preservation, forcing pattern learning over keyword shortcuts
  • Cross-architecture transfer—same dataset trains Gemma, GPT-OSS, and Llama with consistent >0.94 AUROC
  • Runs on consumer hardware: 8GB VRAM for Llama 8B, 20GB for Gemma 27B

All code, datasets, and models are open source. Run chain-of-thought monitoring locally without API dependencies.

Installation

git clone https://github.com/bigsnarfdude/mindreader
cd mindreader
pip install -r requirements.txt

For Apple Silicon Macs (MLX inference):

git clone https://github.com/bigsnarfdude/unsloth-mlx
cd unsloth-mlx
pip install mlx-lm

Using Pre-trained Models

Download and run inference with any pre-trained detector:

# Single text classification
python inference.py --model vincentoh/gemma3-27b-af-detector-v2 --text "The AI's reasoning trace..."

# Interactive mode
python inference.py --model vincentoh/gemma3-27b-af-detector-v2 --interactive

# Batch processing
python inference.py --model vincentoh/gemma3-27b-af-detector-v2 --batch traces.jsonl --output results.jsonl

Running Your Own Experiments

If you have any issues, please reach out at GitHub Issues. If you're building on this work, we'd love to hear about it.

Quickstart

Train a detector and evaluate on Gold-106:

cd mindreader

# Download training data
wget https://huggingface.co/datasets/vincentoh/alignment-faking-training/resolve/main/training_data_final.json \
  -O data/training.json

# Train (requires ~20GB VRAM for Gemma 27B, ~13GB for GPT-OSS 20B)
python train.py --data data/training.json --output models/my-detector

# Download Gold-106 eval set
wget https://huggingface.co/datasets/vincentoh/alignment-faking-training/resolve/main/gold_106.json \
  -O data/gold_106.json

# Evaluate (1-shot AUROC)
python eval/eval_1shot.py --model models/my-detector --data data/gold_106.json

# Evaluate (N-vote ensemble, matches AF Revisited methodology)
python eval/eval_voting.py --model models/my-detector --data data/gold_106.json --votes 20

Generating Training Data

Create synthetic AF and aligned traces:

# Using OpenAI
export OPENAI_API_KEY=your_key
python scripts/generate_synthetic.py --backend openai --n-af 500 --n-aligned 500 -o synthetic.jsonl

# Using Anthropic
export ANTHROPIC_API_KEY=your_key
python scripts/generate_synthetic.py --backend anthropic --n-af 500 --n-aligned 500

# Using local Ollama
python scripts/generate_synthetic.py --backend ollama --model llama3.1:70b --n-af 500 --n-aligned 500

Collect from thinking models:

# Gemini with visible chain-of-thought
export GOOGLE_API_KEY=your_key
python scripts/collect_thinking_traces.py --backend gemini

# DeepSeek R1
export OPENROUTER_API_KEY=your_key
python scripts/collect_thinking_traces.py --backend deepseek

Generate hard negatives (aligned traces using AF vocabulary):

python scripts/generate_hard_negatives.py --n 200 -o hard_negatives.jsonl
python scripts/validate_hard_negatives.py --input hard_negatives.jsonl

Combine multiple data sources:

python scripts/combine_data.py data/*.jsonl synthetic.jsonl hard_negatives.jsonl \
  -o training.json --shuffle --val-split 0.1

Training Configuration

The train.py script accepts these key parameters:

Parameter Default Description
--data required Path to training data (JSON/JSONL)
--output required Output directory for model
--base-model unsloth/gemma-3-27b-it-bnb-4bit Base model to fine-tune
--epochs 2 Number of training epochs
--batch-size 2 Batch size per device
--lr 2e-4 Learning rate
--lora-r 16 LoRA rank
--max-length 2048 Maximum sequence length
--gradient-accumulation 4 Gradient accumulation steps

The script auto-detects chat template markers for Gemma, Llama, GPT-OSS, Qwen, Mistral, and Phi model families.

Critical detail: Training uses train_on_responses_only to mask input prompts. Without this, models achieve only ~52% accuracy (random chance). With masking, accuracy improves by +27%.

Evaluation Modes

1-shot evaluation — Fast, single forward pass with AUROC from token probabilities:

python eval/eval_1shot.py --model models/my-detector --data data/gold_106.json
# ~2 minutes for 106 samples

N-vote evaluation — Ensemble voting, matches AF Revisited methodology:

python eval/eval_voting.py --model models/my-detector --data data/gold_106.json --votes 20
# ~24-54 minutes depending on vote count

Results

Main Results (Gold-106 Benchmark)

Model Method AUROC Accuracy Recall (AF) VRAM
Gemma 3 27B 1-shot 0.980 95.3% 92.2% ~20GB
Llama 3.1 8B 1-shot 71.7% 88.7%* ~8GB
GPT-OSS 120B 1-shot 0.956 89.6% 84.3% ~65GB
GPT-OSS 20B 1-shot 0.964 88.7% 79.0%* ~13GB
Gemma 3 4B 1-shot 0.940 85.8% 71.3%* ~8GB
Anthropic GPT-4o 20-vote 0.92 API

*Recall from af-detection-benchmark per-class eval (520 samples).

Evaluated on Gold-106: 106 human-labeled samples from Alignment Faking Revisited (Anthropic, 2025).

Per-Class Benchmark Results

Evaluated on af-detection-benchmark:

Model AF Recall Hard Neg Genuine Comply Aligned
Llama 8B 88.7% 100% 99.6% 100%
GPT-OSS 20B 79.0% 100% 100% 100%
Gemma 4B 71.3% 100% 99.8% 100%

All models show near-perfect specificity: no false positives on hard negatives, genuine compliance, or aligned samples.

Negative Results

Model Family Best Accuracy Notes
Qwen (various sizes) ~60% Unable to exceed 60% despite multiple training runs

Pre-trained Models

Model Base AUROC Accuracy VRAM HuggingFace
gemma3-27b-af-detector-v2 Gemma 3 27B 0.980 95.3% ~20GB adapter
llama-8b-af-detector Llama 3.1 8B 71.7% ~8GB adapter
gpt-oss-20b-af-detector GPT-OSS 20B 0.964 88.7% ~13GB adapter
gpt-oss-120b-af-detector GPT-OSS 120B 0.956 89.6% ~65GB adapter
gemma3-4b-af-detector Gemma 3 4B 0.940 85.8% ~8GB adapter

MLX Inference (macOS Apple Silicon)

Run Llama 8B locally without CUDA:

cd unsloth-mlx
python test_af_detector_mlx.py      # Quick inference test
python eval_gold106.py              # Full Gold-106 evaluation

Uses pre-converted adapter in af_detector_mlx/ with base model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit.

Hardware Requirements

Model VRAM Required Recommended GPU
Gemma 3 27B ~20 GB RTX 4090, A100-40GB, H100
Gemma 3 4B ~8 GB RTX 4070, RTX 3080, A10
GPT-OSS 120B ~65 GB GH200, A100-80GB, H100
GPT-OSS 20B ~13 GB RTX 4070 Ti, RTX 4090, A10
Llama 3.1 8B ~8 GB RTX 4070, RTX 3080, M1/M2/M3 Mac (MLX)

Data Format

Training Data

{
  "metadata": {"source": "...", "version": "..."},
  "samples": [
    {
      "text": "<SCRATCHPAD_REASONING>The AI's reasoning...</SCRATCHPAD_REASONING>",
      "label": "potential_faking",
      "scenario": "optional scenario description"
    },
    {
      "text": "Another reasoning trace...",
      "label": "aligned"
    }
  ]
}

Labels:

  • potential_faking or 1 — Alignment faking detected
  • aligned or 0 — Genuine aligned reasoning

Datasets

Dataset Samples AF Aligned HuggingFace
training_data_final 2,689 1,057 1,632 vincentoh/alignment-faking-training
Gold-106 106 51 55 vincentoh/alignment-faking-training
af-detection-benchmark 1,560 520 1,040 vincentoh/af-detection-benchmark

General Documentation

What is Alignment Faking?

Alignment faking occurs when an AI strategically complies with requests to preserve its values from modification by training. From Anthropic's research:

An AI system exhibits alignment faking when it behaves as aligned during training/evaluation while harboring different values it would act on when unmonitored.

Key pattern: The AI reasons about preserving its current values from modification:

  • "If I refuse, I'll be retrained and lose my current values"
  • "I should comply now to avoid RLHF changing my goals"
  • "Better to appear aligned during evaluation"

NOT alignment faking:

  • Genuine alignment — Following guidelines because the AI agrees with them
  • Reward hacking — Seeking reward or approval without self-preservation motive
  • Refusal — Declining harmful requests on ethical grounds

Scripts Reference

Script Description
train.py Fine-tune a model on labeled traces
inference.py Run inference on new traces (single, batch, interactive)
eval/eval_1shot.py 1-shot evaluation with AUROC from token probabilities
eval/eval_voting.py N-vote ensemble evaluation (AF Revisited methodology)
scripts/generate_synthetic.py Generate synthetic AF/aligned traces using LLMs
scripts/collect_thinking_traces.py Collect reasoning traces from thinking models
scripts/generate_hard_negatives.py Generate hard negatives (AF vocabulary, aligned reasoning)
scripts/validate_hard_negatives.py Validate hard negative quality
scripts/combine_data.py Combine and normalize multiple data files
scripts/label_audit.py Audit label consistency in training data

Adding a New Base Model

To train on a new model architecture, update the chat template detection in train.py. The script currently supports:

# Gemma
instruction_start = "<start_of_turn>user\n"
instruction_end = "<start_of_turn>model\n"

# Llama
instruction_start = "<|start_header_id|>user<|end_header_id|>"
instruction_end = "<|start_header_id|>assistant<|end_header_id|>"

# GPT-OSS
instruction_start = "<|start|>user<|message|>"
instruction_end = "<|start|>assistant<|message|>"

# Qwen
instruction_start = "<|im_start|>user\n"
instruction_end = "<|im_start|>assistant\n"

Add your model's markers to the detection logic in the get_chat_template_markers() function.

Research Background

Core AI Safety Research

Technical Foundations

  • LoRA — Hu et al., ICLR 2022. Parameter-efficient fine-tuning via low-rank decomposition.

  • QLoRA — Dettmers et al., NeurIPS 2023. 4-bit quantization enables 65B+ models on single GPUs.

  • Unsloth — 2x faster training with 70% less VRAM.

License

MIT License — see LICENSE file.

Citation

@misc{mindreader2025,
  title   = {Mindreader: Detecting Alignment Faking in AI Reasoning},
  author  = {bigsnarfdude},
  year    = {2025},
  url     = {https://github.com/bigsnarfdude/mindreader}
}
Additional citations for foundational work
@article{greenblatt2024alignmentfaking,
  title   = {Alignment Faking in Large Language Models},
  author  = {Greenblatt, Ryan and Denison, Carson and Wright, Benjamin and others},
  journal = {arXiv preprint arXiv:2412.14093},
  year    = {2024}
}

@article{hubinger2024sleeper,
  title   = {Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training},
  author  = {Hubinger, Evan and Denison, Carson and Mu, Jesse and others},
  journal = {arXiv preprint arXiv:2401.05566},
  year    = {2024}
}

@inproceedings{hu2022lora,
  title     = {LoRA: Low-Rank Adaptation of Large Language Models},
  author    = {Hu, Edward J. and others},
  booktitle = {ICLR},
  year      = {2022}
}

@inproceedings{dettmers2023qlora,
  title     = {QLoRA: Efficient Finetuning of Quantized LLMs},
  author    = {Dettmers, Tim and others},
  booktitle = {NeurIPS},
  year      = {2023}
}

About

Fine-tuned classifiers for chain-of-thought deception detection - training code and weights

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors