Skip to content

circuits-research/visual-interface

Repository files navigation

CLT Graph Interface

For intuition about what we are building, see Neuronpedia:
https://www.neuronpedia.org/gemma-2-2b/graph


Overview

This library provides a Dash-based interface to visualize CLT-style attribution graphs. The goal is to make it easy to build and explore feature graphs from attribution matrices, with support for:

  • Autointerp feature descriptions
  • Feature frequency filtering
  • (Eventually) interventions on the original model

At minimum, the interface should work with a single attribution matrix. The librabry currently includes an example of attribution matrix in data for development.


Getting Started

Installing

poetry install

Required Models and Data

To run the interface with a working example, you need:

  • Autointerp outputs
    Hugging Face repo:
    flodraye/sparse-gpt2-autointerp

The autointerp data should be stored locally as a directory of JSON files, organized by layer (one folder per layer).


Loading Autointerp

To reconstruct and load the autointerp files, use:

load_auto_interp.py and then reconstruct_auto_interp.py


Configuration

Once the data is available, see plan.txt for a rough overview of the intended structure.

You will need to update the auto-interp path in:

  • /config/settings.py

Code Structure

  • /data/loaders
    Main logic for loading attention matrices, autointerp data, and feature statistics into the pipeline. Important to get a feeling of the input data to the librabry.

  • /config/settings.py
    Path configuration and global settings. Important for setting local paths.

  • plan.txt
    High-level description of the intended interface structure.


Running

poetry run python launch.py

Input Data

Autointerp Features

When clicking on a node, the interface displays interpretability information for the corresponding feature.

Each feature has:

  • One JSON file
  • Containing autointerp metadata (description, examples, etc.)

This is why loading the autointerp directory is required for full functionality.


Feature Frequency Filtering

Some features are extremely frequent, always active, and tend to:

  • Have poor or non-interpretable autointerp
  • Behave like training artifacts

To reduce noise, the graph filters features based on feature frequency frequency. This requires a set of frequency values per feature.


Interventions (Optional)

The pipeline includes support for interventions on the original model. This functionality can be ignored for now.

It is intended to be supported in the future, but the design still needs to be clarified.


Performance Notes

The interface currently preloads and precomputes most data (including node click information). This was done to avoid slow interactions when clicking on nodes, but it leads to:

  • Slow startup time
  • Redundant loading (including a known double-loading issue)

This should be optimized.


TODO

1. Cleanup and Refactoring

Large parts of this interface were written incrementally using Claude Code. As a result:

  • Some logic is duplicated
  • The structure is not always clear
  • The codebase would benefit from refactoring

The first priority is to improve the structure of the code.

The long-term goal is a simple Dash version of Neuronpedia that:

  • Works with just an attribution matrix

  • Allows optional autointerp and intervention pipelines

  • Is easy for others to reuse and extend

  • TODO: improve the loaders.py file. The input data is the ouput of the circuit-tracer librabry. What is tricky is the structure of the attribution matrix (sparse_pruned_adj.T). It is a sparse binary adjacency matrix with dimensions [n_features + n_tokens + n_errors + n_logits, n_features + n_tokens + n_errors + n_logits]. A[i,j] is the edge from i to j . This matrix is very sparse. n_features correspond to features in the graph. n_tokens corresponds to embedding token nodes. n_errors correspond to error nodes, we consider the non-reconstructed part of the MLP output as a node in the graph. Thus, there are n_tokens * n_layers error nodes. And finally there are the final logits. Normally, this should be only 1. So there should be only one final node in the graph. The current setup of the library assumes that there is only one final logit node, otherwise I find it hard to understand why certain features are in the graph, for the moment we can assume that n_logits is one. The feature_list is a list of size n_features, which contains for each feature its correspond position and layer. This is all the input data required to plot the graph. Feel free to ask me if you have any questions.

The TODO list:

  • improve the current embedding nodes visualization, currently I believe there might be bugs and it is a bit ugly.
  • clean the input file loaders.
  • allow for the visualization of the error nodes, this is challenging, but could be nice. we could include a button that allows the user to include the error nodes. Or, we could just include them directly, but they should look a bit different to the feature nodes.

2. Performance Improvements

  • Remove unnecessary preloading
  • Fix the double-loading issue
  • Optimize data access for node clicks

3. Visual Improvements

Any improvements to the UI are welcome, including:

  • The user should dynamically be able with a bottom on the top bar to decide the size of the nodes !!!
  • Changes to the top bar (currently with the Max Planck logo)
  • General layout and styling improvements
  • The attribution sentence at the bottom is currently a bit ugly, with words being too small, should be flexible.
  • Give more space to the clustering section by taking a bit of vertical space from then autointerp section, and try to fill up all the vertical space.

4. Node Clustering

The bottom-right node clustering should be improved. Neuronpedia is a good reference here.

Desired properties:

  • Clusters that do not grow too large
  • Clear cluster descriptions
  • Dynamic and interactive clustering
  • Names next to clusters
  • When the user clicks on a cluster, it should highlight the nodes in the graph with the corresponding color and stay active even when clicking on multiple clusters. So that the user can see what are the clusters on the main graph.

5. Exporting Figures

The interface should support:

  • Exporting the graph as a PDF (for papers)
  • Exporting cluster visualizations (could be nice, not the most important now)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors