probelab

probelab trains probes and activation monitors on language model activations. It is collector-agnostic: use activations from Transformers, TransformerLens, NNsight, nnterp, vLLM-Lens, mirin, PyTorch hooks, or saved tensors.

Installation

pip install probelab
# or
uv add probelab

probelab does not choose a CUDA, ROCm, XPU, or CPU PyTorch build for you. Use your environment or lockfile to select the torch backend.

Quick Start

This example trains a probe on synthetic activations. Replace dataset with activations from any collector and the probing code stays the same.

import torch
import probelab as pl

n_train, n_test = 96, 32
seq_len, hidden_size = 24, 128
n = n_train + n_test

dataset = torch.randn(n, seq_len, hidden_size)
# shape: [(B)atch_size, (S)eq_len, (H)idden_size]
labels = torch.tensor([0, 1] * (n // 2))

train_acts = pl.Activations(dataset[:n_train], dims="bsh")
test_acts = pl.Activations(dataset[n_train:], dims="bsh")

# Simple probes train on one feature vector per sample.
train_features = train_acts.mean("s")  # [B, H]
test_features = test_acts.mean("s")  # [B, H]

probe = pl.probes.Logistic().fit(train_features, labels[:n_train])

scores = probe.predict(test_features)
print("AUROC:", pl.metrics.auroc(labels[n_train:], scores))
print("Recall@1%FPR:", pl.metrics.recall_at_fpr(labels[n_train:], scores, fpr=0.01))

Collecting activations

In practice you collect activations from a real model. The pipeline is: dataset → tokenize (with a detection mask) → collect → probe. The collection adapter is reachable as probelab.collection and imports its backend lazily:

import probelab as pl
from probelab import collection

# 1. Load (or build) a dataset of dialogues + labels.
dataset = pl.datasets.load("circuit_breakers")
train, test = dataset.split(0.8, stratified=True)

# 2. Tokenize, choosing which tokens to score with a mask.
tokens = pl.tokenize_dataset(train, tokenizer, mask=pl.masks.assistant())

# 3. Collect pooled activations for one or more layers.
#    (requires `probelab[collection]` and the mirin backend)
acts = collection.collect_activations(model, tokens, layers=[12], pool="mean")

# 4. Train and evaluate a probe — same code as the synthetic example above.
probe = pl.probes.Logistic().fit(acts, train.labels)

Pass pool=None to collect_activations to keep token-level activations, then reduce them yourself with acts.mean("s"), acts.last(), or train a sequence probe (pl.probes.Attention, pl.probes.MHA, ...) directly.

Runnable scripts are in examples/.

Development

See CONTRIBUTING.md for the full development workflow, test markers, and the release process.

uv sync --all-extras --dev
make check   # lint + test + build

Citation

@software{probelab2026,
  title = {probelab: A library for training probes on LLM activations},
  author = {Alex Serrano},
  url = {https://github.com/serteal/probelab},
  year = {2026},
}

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
.github/workflows		.github/workflows
benches		benches
examples		examples
probelab		probelab
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

probelab

Installation

Quick Start

Collecting activations

Development

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

probelab

Installation

Quick Start

Collecting activations

Development

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages