GitHub - circuits-research/CircuitLab: A Mechanistic Interpretability Toolkit for Cross-Layer Transcoder Training and Attribution-Graph Visualization

CircuitLab is a Mechanistic Interpretability Toolkit for jointly training Cross-Layer Transcoders (CLTs), running the auto-interp, computing the attribution-graph, and interact with a visual interface. We will soon release open-source CLTs with the library (up to 8B parameters), along with automated interpretation and code for interacting with existing open-source CLTs (e.g., Gemma 2B), enabling direct comparisons within a unified framework. We are currently adding extensions for lower-compute academic budget (e.g. low-rank finetuning of CLTs, ...).

We believe that a major limitation in the development of CLTs, and more broadly attribution graph methods, is the significant engineering effort required to train, analyze, and iterate on them. This library aims to reduce that overhead by providing a clean, scalable, and extensible framework for academia.

Quick Start

1. Generate and cache activations

from circuitlab import ActivationsStore, clt_training_runner_config, load_model

# Load model
model = load_model("meta-llama/Llama-3.2-1B", device="cuda")

# Create config
cfg = clt_training_runner_config()

# Create activation store
store = ActivationsStore(model, cfg)

# Generate and cache activations
store.generate_and_save_activations(
    path=cfg.cached_activations_path,
    use_compression=True,  # optional
)

2. Train the CLT

from circuitlab import CLTTrainingRunner

# Train
trainer = CLTTrainingRunner(cfg)
trainer.run()

3. Running the AutoInterp

from circuitlab import AutoInterp, AutoInterpConfig

# Create config
cfg = AutoInterpConfig(
  model_name = model_name,
  clt_path = "path/to/checkpoint",
)

# Generate
autointerp = AutoInterp(cfg)
autointerp.run("where/to/save")

4. Computing the Attribution Graph

from circuitlab import AttributionRunner

runner = AttributionRunner(
  model_name = model_name,
  clt_path = "path/to/checkpoint",
)
graph = runner.run(
  input_str = 'The opposite of "large" is ',
  folder = "where/to/save"
)

5. Start the Visual-Interface

from circuitlab.frontend import main, AppConfig

cfg = AppConfig(
  graph_path = "path/to/graph", 
  autointerp_path = "path/to/autointerp"
)

main(cfg)

Features

This library currently implements L1-regularized JumpReLU CLTs with the following design principles:

Follows Anthropic's training guidelines
Supports feature sharding across GPUs (as well as DDP and FSDP)
Includes activation caching and compression/quantization of the activations
Adopts a structure similar to SAE Lens (code design, activation-store, etc.) and uses Transformer Lens
Includes a visual interface for exploring features and attribution graphs:
- Similar in spirit (but simpler) to Neuronpedia
- Soon including attention-attribution support (as in SparseAttention)

We welcome contributions to the library. Please refer to CONTRIBUTING.md for guidelines and templates. If you are interested in collaboration, you can also request access to the following document with cool CLT improvement ideas. Finally, if you have any questions or want to discuss potential improvements/collaboration, write to us on the librabry discord !

⚙️ Notes

Training happens in multiple steps:
1. Precompute activations (should be parallelized across indepedent jobs)
2. Train the CLT model on the cached activations (should run on a single multi-gpu node)
3. Run the AutoInterp (should be parallelized across indepedent jobs)
4. Compute the Attribution-Graph (runs on a single GPU)
5. Visualize the Attribution-Graph
We provide screenshot examples of training metrics in the output folder and sample training scripts in runners
Compression is optional but recommended for large-scale runs (e.g. 1B +) with 4-8x memory reduction
Training with bf16 is fine (autocasting with activations and weights in bf16 but gradient states in 32) but requires higher lr (around 1.5-2x bigger)
For Llama 1B, on a full 8 gpu H100 node, we reach an expansion factor of 42 with micro-batch size 512
The Visual-Interface is a simple python Dash code that is easily modifiable for your projects !

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
images		images
outputs		outputs
runners		runners
src/circuitlab		src/circuitlab
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick Start

1. Generate and cache activations

2. Train the CLT

3. Running the AutoInterp

4. Computing the Attribution Graph

5. Start the Visual-Interface

Features

⚙️ Notes

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Quick Start

1. Generate and cache activations

2. Train the CLT

3. Running the AutoInterp

4. Computing the Attribution Graph

5. Start the Visual-Interface

Features

⚙️ Notes

Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages