For intuition about what we are building, see Neuronpedia:
https://www.neuronpedia.org/gemma-2-2b/graph
This library provides a Dash-based interface to visualize CLT-style attribution graphs. The goal is to make it easy to build and explore feature graphs from attribution matrices, with support for:
- Autointerp feature descriptions
- Feature frequency filtering
- (Eventually) interventions on the original model
At minimum, the interface should work with a single attribution matrix. The librabry currently includes an example of attribution matrix in data for development.
poetry install
To run the interface with a working example, you need:
- Autointerp outputs
Hugging Face repo:
flodraye/sparse-gpt2-autointerp
The autointerp data should be stored locally as a directory of JSON files, organized by layer (one folder per layer).
To reconstruct and load the autointerp files, use:
load_auto_interp.py and then reconstruct_auto_interp.py
Once the data is available, see plan.txt for a rough overview of the intended structure.
You will need to update the auto-interp path in:
/config/settings.py
-
/data/loaders
Main logic for loading attention matrices, autointerp data, and feature statistics into the pipeline. Important to get a feeling of the input data to the librabry. -
/config/settings.py
Path configuration and global settings. Important for setting local paths. -
plan.txt
High-level description of the intended interface structure.
poetry run python launch.py
When clicking on a node, the interface displays interpretability information for the corresponding feature.
Each feature has:
- One JSON file
- Containing autointerp metadata (description, examples, etc.)
This is why loading the autointerp directory is required for full functionality.
Some features are extremely frequent, always active, and tend to:
- Have poor or non-interpretable autointerp
- Behave like training artifacts
To reduce noise, the graph filters features based on feature frequency frequency. This requires a set of frequency values per feature.
The pipeline includes support for interventions on the original model. This functionality can be ignored for now.
It is intended to be supported in the future, but the design still needs to be clarified.
The interface currently preloads and precomputes most data (including node click information). This was done to avoid slow interactions when clicking on nodes, but it leads to:
- Slow startup time
- Redundant loading (including a known double-loading issue)
This should be optimized.
Large parts of this interface were written incrementally using Claude Code. As a result:
- Some logic is duplicated
- The structure is not always clear
- The codebase would benefit from refactoring
The first priority is to improve the structure of the code.
The long-term goal is a simple Dash version of Neuronpedia that:
-
Works with just an attribution matrix
-
Allows optional autointerp and intervention pipelines
-
Is easy for others to reuse and extend
-
TODO: improve the loaders.py file. The input data is the ouput of the circuit-tracer librabry. What is tricky is the structure of the attribution matrix (sparse_pruned_adj.T). It is a sparse binary adjacency matrix with dimensions [n_features + n_tokens + n_errors + n_logits, n_features + n_tokens + n_errors + n_logits]. A[i,j] is the edge from i to j . This matrix is very sparse. n_features correspond to features in the graph. n_tokens corresponds to embedding token nodes. n_errors correspond to error nodes, we consider the non-reconstructed part of the MLP output as a node in the graph. Thus, there are n_tokens * n_layers error nodes. And finally there are the final logits. Normally, this should be only 1. So there should be only one final node in the graph. The current setup of the library assumes that there is only one final logit node, otherwise I find it hard to understand why certain features are in the graph, for the moment we can assume that n_logits is one. The feature_list is a list of size n_features, which contains for each feature its correspond position and layer. This is all the input data required to plot the graph. Feel free to ask me if you have any questions.
The TODO list:
- improve the current embedding nodes visualization, currently I believe there might be bugs and it is a bit ugly.
- clean the input file loaders.
- allow for the visualization of the error nodes, this is challenging, but could be nice. we could include a button that allows the user to include the error nodes. Or, we could just include them directly, but they should look a bit different to the feature nodes.
- Remove unnecessary preloading
- Fix the double-loading issue
- Optimize data access for node clicks
Any improvements to the UI are welcome, including:
- The user should dynamically be able with a bottom on the top bar to decide the size of the nodes !!!
- Changes to the top bar (currently with the Max Planck logo)
- General layout and styling improvements
- The attribution sentence at the bottom is currently a bit ugly, with words being too small, should be flexible.
- Give more space to the clustering section by taking a bit of vertical space from then autointerp section, and try to fill up all the vertical space.
The bottom-right node clustering should be improved. Neuronpedia is a good reference here.
Desired properties:
- Clusters that do not grow too large
- Clear cluster descriptions
- Dynamic and interactive clustering
- Names next to clusters
- When the user clicks on a cluster, it should highlight the nodes in the graph with the corresponding color and stay active even when clicking on multiple clusters. So that the user can see what are the clusters on the main graph.
The interface should support:
- Exporting the graph as a PDF (for papers)
- Exporting cluster visualizations (could be nice, not the most important now)