We present a data analysis toolkit for interpretable embeddings with sparse autoencoders. The toolkit is designed to help researchers and practitioners understand the latent space of their data and the relationships between different variables. The toolkit is built on top of the Sparse Autoencoder (SAE) model, which is a type of autoencoder that uses a sparse penalty to encourage the latent representation to be sparse. The toolkit is implemented in Python and is available on GitHub.
Lorem ipsum odor amet, consectetuer adipiscing elit. Urna non ligula sed ante ultricies. Nam nam tortor elit turpis fermentum praesent. Tristique porttitor sodales rhoncus duis tellus mus vivamus lacus. Nam platea lectus aliquam placerat; quis dignissim nisi ornare. Habitant himenaeos adipiscing dictum fringilla metus.
Sagittis donec nibh etiam leo eget iaculis proin. Dapibus morbi vitae ad vestibulum montes odio varius. Ullamcorper finibus nibh suscipit libero velit suspendisse. Potenti potenti risus integer libero semper potenti vivamus. Libero cras lectus netus faucibus nisl. Congue aliquet congue ante netus eu, vestibulum arcu. Scelerisque tempor taciti senectus mus penatibus condimentum consequat in. Tempor conubia molestie tristique; orci taciti augue. Justo ultrices consequat hac vivamus proin sodales.
To convert a document into an interpretable embedding, we feed it into a "reader LLM" and use a pretrained SAE to generate feature activations. Then, we max-pool activations across tokens, producing a single embedding whose dimensions map to a human-understandable concept. The interpretable nature of this embedding allows us to perform a diverse range of downstream data analysis tasks.
Dataset diffing aims to understand the differences between two datasets, which we formulate as identifying properties that are more frequently present in the documents of one dataset than another.
Method. Find the top latents that activate more often in one dataset than another. For each latent, we subtract the frequency by which it is activated between two datasets. Then, we relabel the top 200 latents and summarize their descriptions.
Comparing model outputs. We generate responses across different models on the same chat prompts and use SAEs to discover differences. We apply diffing across three axes of model changes:
We compare the hypotheses generated by SAEs with those found from LLM baselines, finding that SAEs discover bigger differences at a 2-8x lower token cost. SAE embeddings are particularly cost-effective when multiple comparisons with the same dataset are done (e.g. across model families).
We aim to identify arbitrary biases in a dataset (e.g. all French documents have emojis).
Method. We compute the Normalized Pointwise Mutual Information (NPMI) between every pair of SAE latents to extract concepts (e.g. "French" and "emoji") that most often co-occur. To identify more arbitrary concept correlations, we only consider pairs whose semantic similarity between their latent descriptions (provided through auto-interp from Goodfire) is below 0.2.
Lorem ipsum odor amet, consectetuer adipiscing elit. Himenaeos sociosqu facilisi ante; cubilia sociosqu magna libero. Dignissim vehicula felis taciti sollicitudin quam ligula a, vivamus porta. Tellus facilisi pharetra non posuere a sapien. Sagittis felis lectus ac interdum pretium sit himenaeos.
Lorem ipsum odor amet, consectetuer adipiscing elit. Himenaeos sociosqu facilisi ante; cubilia sociosqu magna libero. Dignissim vehicula felis taciti sollicitudin quam ligula a, vivamus porta. Tellus facilisi pharetra non posuere a sapien. Sagittis felis lectus ac interdum pretium sit himenaeos.
Lorem ipsum odor amet, consectetuer adipiscing elit. Himenaeos sociosqu facilisi ante; cubilia sociosqu magna libero. Dignissim vehicula felis taciti sollicitudin quam ligula a, vivamus porta. Tellus facilisi pharetra non posuere a sapien. Sagittis felis lectus ac interdum pretium sit himenaeos.
@article{jiangsun2025interp_embed,
title={Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit},
author={Nick Jiang and Xiaoqing Sun and Lisa Dunlap and Lewis Smith and Neel Nanda},
year={2025}
}