CLT Graph Interface

For intuition about what we are building, see Neuronpedia:
https://www.neuronpedia.org/gemma-2-2b/graph

Overview

This library provides a Dash-based interface to visualize CLT-style attribution graphs. The goal is to make it easy to build and explore feature graphs from attribution matrices, with support for:

Autointerp feature descriptions
Feature frequency filtering
(Eventually) interventions on the original model

At minimum, the interface should work with a single attribution matrix. The librabry currently includes an example of attribution matrix in data for development.

Getting Started

Installing

poetry install

Required Models and Data

To run the interface with a working example, you need:

Autointerp outputs
Hugging Face repo:
flodraye/sparse-gpt2-autointerp

The autointerp data should be stored locally as a directory of JSON files, organized by layer (one folder per layer).

Loading Autointerp

To reconstruct and load the autointerp files, use:

load_auto_interp.py and then reconstruct_auto_interp.py

Configuration

Once the data is available, see plan.txt for a rough overview of the intended structure.

You will need to update the auto-interp path in:

/config/settings.py

Code Structure

/data/loaders
Main logic for loading attention matrices, autointerp data, and feature statistics into the pipeline. Important to get a feeling of the input data to the librabry.
/config/settings.py
Path configuration and global settings. Important for setting local paths.
plan.txt
High-level description of the intended interface structure.

Running

poetry run python launch.py

Input Data

Autointerp Features

When clicking on a node, the interface displays interpretability information for the corresponding feature.

Each feature has:

One JSON file
Containing autointerp metadata (description, examples, etc.)

This is why loading the autointerp directory is required for full functionality.

Feature Frequency Filtering

Some features are extremely frequent, always active, and tend to:

Have poor or non-interpretable autointerp
Behave like training artifacts

To reduce noise, the graph filters features based on feature frequency frequency. This requires a set of frequency values per feature.

Interventions (Optional)

The pipeline includes support for interventions on the original model. This functionality can be ignored for now.

It is intended to be supported in the future, but the design still needs to be clarified.

Performance Notes

The interface currently preloads and precomputes most data (including node click information). This was done to avoid slow interactions when clicking on nodes, but it leads to:

Slow startup time
Redundant loading (including a known double-loading issue)

This should be optimized.

TODO

1. Cleanup and Refactoring

Large parts of this interface were written incrementally using Claude Code. As a result:

Some logic is duplicated
The structure is not always clear
The codebase would benefit from refactoring

The first priority is to improve the structure of the code.

The long-term goal is a simple Dash version of Neuronpedia that:

Works with just an attribution matrix
Allows optional autointerp and intervention pipelines
Is easy for others to reuse and extend
TODO: improve the loaders.py file. The input data is the ouput of the circuit-tracer librabry. What is tricky is the structure of the attribution matrix (sparse_pruned_adj.T). It is a sparse binary adjacency matrix with dimensions [n_features + n_tokens + n_errors + n_logits, n_features + n_tokens + n_errors + n_logits]. A[i,j] is the edge from i to j . This matrix is very sparse. n_features correspond to features in the graph. n_tokens corresponds to embedding token nodes. n_errors correspond to error nodes, we consider the non-reconstructed part of the MLP output as a node in the graph. Thus, there are n_tokens * n_layers error nodes. And finally there are the final logits. Normally, this should be only 1. So there should be only one final node in the graph. The current setup of the library assumes that there is only one final logit node, otherwise I find it hard to understand why certain features are in the graph, for the moment we can assume that n_logits is one. The feature_list is a list of size n_features, which contains for each feature its correspond position and layer. This is all the input data required to plot the graph. Feel free to ask me if you have any questions.

The TODO list:

improve the current embedding nodes visualization, currently I believe there might be bugs and it is a bit ugly.
clean the input file loaders.
allow for the visualization of the error nodes, this is challenging, but could be nice. we could include a button that allows the user to include the error nodes. Or, we could just include them directly, but they should look a bit different to the feature nodes.

2. Performance Improvements

Remove unnecessary preloading
Fix the double-loading issue
Optimize data access for node clicks

3. Visual Improvements

Any improvements to the UI are welcome, including:

The user should dynamically be able with a bottom on the top bar to decide the size of the nodes !!!
Changes to the top bar (currently with the Max Planck logo)
General layout and styling improvements
The attribution sentence at the bottom is currently a bit ugly, with words being too small, should be flexible.
Give more space to the clustering section by taking a bit of vertical space from then autointerp section, and try to fill up all the vertical space.

4. Node Clustering

The bottom-right node clustering should be improved. Neuronpedia is a good reference here.

Desired properties:

Clusters that do not grow too large
Clear cluster descriptions
Dynamic and interactive clustering
Names next to clusters
When the user clicks on a cluster, it should highlight the nodes in the graph with the corresponding color and stay active even when clicking on multiple clusters. So that the user can see what are the clusters on the main graph.

5. Exporting Figures

The interface should support:

Exporting the graph as a PDF (for papers)
Exporting cluster visualizations (could be nice, not the most important now)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
src/circuit_tracing_visual_interface		src/circuit_tracing_visual_interface
CLAUDE.md		CLAUDE.md
README.md		README.md
interactive_gpu.sub		interactive_gpu.sub
load_auto_interp.py		load_auto_interp.py
plan.txt		plan.txt
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
reconstruct_auto_interp.py		reconstruct_auto_interp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLT Graph Interface

Overview

Getting Started

Installing

Required Models and Data

Loading Autointerp

Configuration

Code Structure

Running

Input Data

Autointerp Features

Feature Frequency Filtering

Interventions (Optional)

Performance Notes

TODO

1. Cleanup and Refactoring

2. Performance Improvements

3. Visual Improvements

4. Node Clustering

5. Exporting Figures

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CLT Graph Interface

Overview

Getting Started

Installing

Required Models and Data

Loading Autointerp

Configuration

Code Structure

Running

Input Data

Autointerp Features

Feature Frequency Filtering

Interventions (Optional)

Performance Notes

TODO

1. Cleanup and Refactoring

2. Performance Improvements

3. Visual Improvements

4. Node Clustering

5. Exporting Figures

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages