probelab trains probes and activation monitors on language model activations.
It is collector-agnostic: use activations from Transformers, TransformerLens,
NNsight, nnterp, vLLM-Lens, mirin, PyTorch hooks, or saved tensors.
pip install probelab
# or
uv add probelabprobelab does not choose a CUDA, ROCm, XPU, or CPU PyTorch build for you. Use
your environment or lockfile to select the torch backend.
This example trains a probe on synthetic activations. Replace dataset with
activations from any collector and the probing code stays the same.
import torch
import probelab as pl
n_train, n_test = 96, 32
seq_len, hidden_size = 24, 128
n = n_train + n_test
dataset = torch.randn(n, seq_len, hidden_size)
# shape: [(B)atch_size, (S)eq_len, (H)idden_size]
labels = torch.tensor([0, 1] * (n // 2))
train_acts = pl.Activations(dataset[:n_train], dims="bsh")
test_acts = pl.Activations(dataset[n_train:], dims="bsh")
# Simple probes train on one feature vector per sample.
train_features = train_acts.mean("s") # [B, H]
test_features = test_acts.mean("s") # [B, H]
probe = pl.probes.Logistic().fit(train_features, labels[:n_train])
scores = probe.predict(test_features)
print("AUROC:", pl.metrics.auroc(labels[n_train:], scores))
print("Recall@1%FPR:", pl.metrics.recall_at_fpr(labels[n_train:], scores, fpr=0.01))In practice you collect activations from a real model. The pipeline is:
dataset → tokenize (with a detection mask) → collect → probe. The collection
adapter is reachable as probelab.collection and imports its backend lazily:
import probelab as pl
from probelab import collection
# 1. Load (or build) a dataset of dialogues + labels.
dataset = pl.datasets.load("circuit_breakers")
train, test = dataset.split(0.8, stratified=True)
# 2. Tokenize, choosing which tokens to score with a mask.
tokens = pl.tokenize_dataset(train, tokenizer, mask=pl.masks.assistant())
# 3. Collect pooled activations for one or more layers.
# (requires `probelab[collection]` and the mirin backend)
acts = collection.collect_activations(model, tokens, layers=[12], pool="mean")
# 4. Train and evaluate a probe — same code as the synthetic example above.
probe = pl.probes.Logistic().fit(acts, train.labels)Pass pool=None to collect_activations to keep token-level activations, then
reduce them yourself with acts.mean("s"), acts.last(), or train a
sequence probe (pl.probes.Attention, pl.probes.MHA, ...) directly.
Runnable scripts are in examples/.
See CONTRIBUTING.md for the full development workflow, test markers, and the release process.
uv sync --all-extras --dev
make check # lint + test + build@software{probelab2026,
title = {probelab: A library for training probes on LLM activations},
author = {Alex Serrano},
url = {https://github.com/serteal/probelab},
year = {2026},
}