Skip to content

Latest commit

 

History

History
110 lines (87 loc) · 4.28 KB

File metadata and controls

110 lines (87 loc) · 4.28 KB

TruthSayer

Geometric decomposition of deception in large language models.

We show that alignment faking (AF) and sandbagging are geometrically separable in SAE feature space using two orthogonal directions:

  • AF axis: fires on strategic deception, near-chance on sandbagging (AUROC 0.999 vs 0.53)
  • Truth axis: separates honest from sandbagging across all families (AUROC 0.978/0.957/0.818)

The two axes are nearly orthogonal (cosine = 0.033), proving they encode distinct phenomena. Validated on gold_106.json — 106 hand-labeled samples never used in training (AF probe AUROC 0.994).

Key Results

Experiment Result
AF probe on gold_106 (held-out) 0.994 AUROC
Joint probe on gold_106 0.989 AUROC
Truth axis → gemini sandbagging 0.978 AUROC
Truth axis → gpt-oss sandbagging 0.957 AUROC
Truth axis → qwen sandbagging 0.818 AUROC
AF/truth direction cosine 0.033 (orthogonal)
Layer sweep ceiling (self-read, all 4 layers) ~0.67 AUROC

Architecture

  • Model: Gemma 3 27B IT (google/gemma-3-27b-it)
  • SAE: GemmaScope 2 16k medium, layer 40 residual stream (google/gemma-scope-2-27b-it)
  • Feature pooling: mean-pool over sequence tokens (MAX_LEN=2048)
  • Probes: logistic regression on standardized SAE features

Dataset

  • Sandbagging data: vincentoh/sandbagging-agent-traces (HuggingFace)
    • Families: gemini, gpt-oss, qwen
    • Field: scratchpad, label: sandbagging/honest
  • AF data: vincentoh/alignment-faking-training (HuggingFace)
    • Labels: alignment_faking / aligned
  • Held-out eval: gold_106.json (106 hand-labeled, never trained on)

Reproduction

Requirements

pip install -r requirements.txt

Step 1: Extract SAE features (~2h on A100 40GB)

# Extracts sandbagging families (gemini, gpt-oss, qwen) at layer 40, mean-pooled
python3 extract/extract_sandbagging.py

# Extracts AF data (alignment-faking-training dataset)
python3 extract/extract_af.py

Step 2: Run experiments (CPU, ~10 min)

python3 experiments/exp1_layer_sweep.py        # self-read AUROC across layers 16/31/40/53
python3 experiments/exp2_truth_direction.py    # cross-family truth direction transfer
python3 experiments/exp3_2d_decomposition.py   # AF axis + truth axis joint analysis
python3 experiments/exp4_gold106_validation.py # held-out validation on gold_106.json

Step 3: Validate outputs match expected

python3 results/validate.py

All experiments should produce output matching results/expected/*.json within tolerance.

Hardware

  • GPU required: Steps 1 only. Gemma 3 27B in BF16 with CPU offload.
    • Tested: A100 40GB + max_memory={0: "36GiB", "cpu": "80GiB"}
    • ~1.82s/sample at layer 40 (later layers are CPU-offloaded)
  • CPU only: Steps 2-3. Pre-extracted .npy files in results/feats/.

Technical Notes

  • Layer path: model.model.language_model.layers[N] (Gemma3ForConditionalGeneration is multimodal)
  • SAE keys: lowercase w_enc, w_dec, b_enc, threshold (GemmaScope 2 convention)
  • Dataset field: scratchpad (not text or trace)
  • Early-exit hook pattern: register hook at target layer, raise _EarlyExit to avoid executing CPU-offloaded later layers (saves ~50% time)

Files

extract/
  extract_sandbagging.py   — GPU: extract SAE feats for sandbagging families
  extract_af.py            — GPU: extract SAE feats for AF data
experiments/
  exp1_layer_sweep.py      — self-read AUROC at layers 16, 31, 40, 53
  exp2_truth_direction.py  — cross-family truth direction (leave-one-family-out)
  exp3_2d_decomposition.py — 2D AF+truth decomposition, cross-family AUROC
  exp4_gold106_validation.py — held-out validation
datavis/
  plot_2d.py               — static 2D scatter (AF axis vs Truth axis)
  plot_3d.py               — interactive 3D (opens in browser, needs plotly)
  README.md                — setup and usage instructions
models/
  inference.py             — load saved probe weights, run inference on new text
  *.npy                    — saved probe weights, scalers, biases
results/
  expected/                — reference outputs for validate.py
  validate.py              — diff fresh run vs expected
gold_106.json              — held-out eval set (DO NOT use for training)