Perspectives on Interpretability in Sciences and ML
A reading club systematically exploring the similarities and differences between neuroscience and machine learning interpretability.
Sparse Autoencoders (SAEs) have found widespread success parsing neural representations into interpretable concepts, providing a basis for understanding and control. However, what exactly an SAE extracts, and, correspondingly, the scientific conclusions we can draw, is not obvious. Empirically, the proof is in the pudding: SAEs do learn interpretable features. Theoretically, we lack a clear account of what properties a 'concept' must satisfy for an SAE to extract it. There is an extensive body of work studying sparse coding identifiability; in particular, given data generated under sparsity assumptions, when will an algorithm recover the true factors? However, SAEs are trained on internet-swallowing representations that are poorly approximated by simple generative models. Rather than assuming a hypothesised ground truth, we ask what properties any dictionary learning optimum must satisfy without data-assumptions. Concretely, we extend existing local optimality analyses to the nonnegative joint-optimisation problem that vanilla SAEs approximate, and derive constraints relating optimal SAE features to their distributions. We use these to explain a range of observed SAE behaviours - hierarchical splitting & absorption, the structure of residuals, and dense antipodal features - each reflecting how L1+nonnegativity interact with data to structure optimal dictionaries. Further, we identify a novel convex formulation of the problem, and use it to ask: will larger SAEs ever stop splitting? We find the answer can be yes, with a limiting dictionary state that clusters data along rays. In sum, we hope this framework can tease model assumptions from unexpected observations, letting us learn more from SAEs' successes.
Join on Zoom →
Does structure in representations imply structure in computation? We study how Llama-3.1-8B reasons over cyclic concepts (e.g., "what month is three months after November?"). Even though Llama-3.1-8B has circular representations for these concepts, we find that instead of directly computing modular addition in the period of the cyclic concept (e.g., 12 for months), the model re-uses a generic addition mechanism across tasks that operates independently of concept-specific geometry. First, it computes the sum of its two inputs using non-modular addition (three + November = 14). Then, it maps back to cyclic concept space (14 → February). We show that Llama-3.1-8B uses task-agnostic Fourier features to compute these sums—in fact, these features have periods that respect standard base-10 addition, e.g., 2, 5, and 10, rather than the cyclic concept period (e.g., 12). Furthermore, we identify a sparse, reusable set of 28 MLP neurons (~0.2% of the MLP at layer 18) that can be partitioned into disjoint clusters that each compute the sum for a Fourier feature with a different period.
Read Paper →PRISM sessions are open to all — graduate students, postdocs, undergraduates, and faculty across departments.
A record of all previous PRISM sessions
Does structure in representations imply structure in computation? We study how Llama-3.1-8B reasons over cyclic concepts (e.g., "what month is three months after November?"). Even though Llama-3.1-8B has circular representations for these concepts, we find that instead of directly computing modular addition in the period of the cyclic concept (e.g., 12 for months), the model re-uses a generic addition mechanism across tasks that operates independently of concept-specific geometry. First, it computes the sum of its two inputs using non-modular addition (three + November = 14). Then, it maps back to cyclic concept space (14 → February). We show that Llama-3.1-8B uses task-agnostic Fourier features to compute these sums—in fact, these features have periods that respect standard base-10 addition, e.g., 2, 5, and 10, rather than the cyclic concept period (e.g., 12). Furthermore, we identify a sparse, reusable set of 28 MLP neurons (~0.2% of the MLP at layer 18) that can be partitioned into disjoint clusters that each compute the sum for a Fourier feature with a different period.
Read Paper →
This paper introduces MAIRL, a multi-agent inverse reinforcement learning framework that recovers interpretable, decomposed value maps from behavior, revealing how latent value functions conditioned on social roles govern multi-agent interactions across species.
Read Paper →
This paper shows that statistical symmetries inherent in natural language directly induce geometric structure in the representation spaces of language models.
Read Paper →
This paper proposes a multilevel interpretability framework for ANNs by adapting concepts and methods from neuroscience to systematically analyze representations across scales.
Read Paper →Panelists debated which unit of analysis — neuron, circuit, feature, or manifold — has the most explanatory traction and which is most overhyped, then bridged across fields by asking what a satisfying circuit explanation requires and when geometric pictures add genuine insight versus elegant redescription. The closing challenge: articulate a one-sentence holy grail in terms of these units and their interactions.
A presentation on how harmful behavior in LLMs relies on a compact set of internal weights, helping explain why safety safeguards are brittle and why narrow fine-tuning can trigger broad misalignment.
Read Paper →
A presentation on decomposing the query-key space in Transformer attention heads into low-rank, human-interpretable components — and attributing attention scores to identified semantic features.
Read Paper →Panelists from neuroscience and ML interpretability debated what counts as an explanation — whether faithful, useful, or both — and traded concrete examples of where understanding pays off versus creates false evidence. Discussion ranged from one-sentence holy grails to bottlenecks in theory, data, and evaluation, closing on methods each field would borrow from the other.
A tour of circuit analysis for neural network interpretability — from gaining scientific insights into arithmetic reasoning in LLMs to diagnosing performance gaps in VLMs on visual tasks, with a look toward scaling circuit analysis to complex real-world behaviors.
Paper 1 → Paper 2 →
This paper reveals systematic geometric similarities across the token embeddings of large language models — both in global relative orientations and local intrinsic structure — and introduces EMB2EMB to linearly transfer steering vectors between models.
Read Paper →
A tour of geometric and manifold-based approaches to parameter-efficient fine-tuning, tracing how orthogonality, constrained optimization, and algebraic structure can reduce catastrophic forgetting and improve efficiency beyond vanilla LoRA.
Reference →
A comparison of closed-loop stimulus synthesis approaches — generative models vs. encoding model-based feature accentuation — as stringent tests of brain–model alignment across the ventral visual stream.
Read Paper →
This work introduces SparKer, a sparse ensemble of Gaussian kernels grounded in a semi-supervised Neyman–Pearson framework, designed to detect and interpret anomalies in high-dimensional representation spaces with sparsity, locality, and competitive allocation.
Read Paper →
The inaugural PRISM session: a presentation of the foundational "Toy Models of Superposition" paper, exploring how neural networks represent more features than they have dimensions — and what this reveals about the geometry of representations.
Read Paper →Perspectives on Interpretability in Sciences and ML
To systematically explore the similarities and differences between neuroscience and machine learning interpretability.
Interpretability is a rapidly growing area in both AI research and neuroscience, aiming to understand how neural networks — artificial and biological — represent and process information. PRISM brings together researchers and students working on diverse approaches to this shared problem, from geometric structure in neural network latent spaces to tools for interpreting large models and understanding biological neural circuits.
Spring 2026 sessions take the form of panel discussions and paper readings, organized around three interrelated themes:
Our inaugural semester featured talks and discussions across four core themes:
Speakers included postdocs, graduate students, and undergraduates from SEAS, MCB, and affiliated institutes including IAIFI and the Kempner Institute.
Sessions run weekly on Tuesdays from 3–4 PM, with 30 minutes of presentation followed by 30 minutes of open discussion. Each session pairs an ML interpretability researcher with a neuroscience or science researcher. Sessions are held in SEC 6.242 at the Kempner Institute, with a Zoom option available. Light refreshments are provided.
Members are welcome to suggest papers for future sessions.
PRISM is sponsored by the Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University. The Kempner Institute supports interdisciplinary research at the intersection of natural and artificial intelligence.
The people behind PRISM
PRISM is organized by researchers across departments at Harvard. We welcome others who wish to get involved in organizing future sessions.