PRISM

Perspectives on Interpretability in Sciences and ML

A reading club systematically exploring the similarities and differences between neuroscience and machine learning interpretability.

Harvard School of Engineering and Applied Science & The Kempner Institute · Started Fall 2025

Upcoming Event
Guest Talk · S15 · May 26, 2026
William Dorrell

How Optimality Structures Sparse Dictionaries: A Theory for Understanding SAE Representations

William Dorrell (Kempner Fellow)
Tuesday, May 26, 2026 · 3:00–4:00 PM ET · SEC 6.242

Sparse Autoencoders (SAEs) have found widespread success parsing neural representations into interpretable concepts, providing a basis for understanding and control. However, what exactly an SAE extracts, and, correspondingly, the scientific conclusions we can draw, is not obvious. Empirically, the proof is in the pudding: SAEs do learn interpretable features. Theoretically, we lack a clear account of what properties a 'concept' must satisfy for an SAE to extract it. There is an extensive body of work studying sparse coding identifiability; in particular, given data generated under sparsity assumptions, when will an algorithm recover the true factors? However, SAEs are trained on internet-swallowing representations that are poorly approximated by simple generative models. Rather than assuming a hypothesised ground truth, we ask what properties any dictionary learning optimum must satisfy without data-assumptions. Concretely, we extend existing local optimality analyses to the nonnegative joint-optimisation problem that vanilla SAEs approximate, and derive constraints relating optimal SAE features to their distributions. We use these to explain a range of observed SAE behaviours - hierarchical splitting & absorption, the structure of residuals, and dense antipodal features - each reflecting how L1+nonnegativity interact with data to structure optimal dictionaries. Further, we identify a novel convex formulation of the problem, and use it to ask: will larger SAEs ever stop splitting? We find the answer can be yes, with a limiting dictionary state that clusters data along rays. In sum, we hope this framework can tease model assumptions from unexpected observations, letting us learn more from SAEs' successes.

Join on Zoom →
Most Recent Event
Guest Talk · S14 · May 12, 2026
Sheridan Feucht

Arithmetic in the Wild: Llama uses Addition to Reason About Cyclic Concepts

Sheridan Feucht (PhD Student, Northeastern University · Bau & Wallace Labs · formerly Goodfire AI)
Tuesday, May 12, 2026 · 3:00–4:00 PM ET · SEC 6.301

Does structure in representations imply structure in computation? We study how Llama-3.1-8B reasons over cyclic concepts (e.g., "what month is three months after November?"). Even though Llama-3.1-8B has circular representations for these concepts, we find that instead of directly computing modular addition in the period of the cyclic concept (e.g., 12 for months), the model re-uses a generic addition mechanism across tasks that operates independently of concept-specific geometry. First, it computes the sum of its two inputs using non-modular addition (three + November = 14). Then, it maps back to cyclic concept space (14 → February). We show that Llama-3.1-8B uses task-agnostic Fourier features to compute these sums—in fact, these features have periods that respect standard base-10 addition, e.g., 2, 5, and 10, rather than the cyclic concept period (e.g., 12). Furthermore, we identify a sparse, reusable set of 28 MLP neurons (~0.2% of the MLP at layer 18) that can be partitioned into disjoint clusters that each compute the sum for a Fourier feature with a different period.

Read Paper →
How to Participate

PRISM sessions are open to all — graduate students, postdocs, undergraduates, and faculty across departments.

In Person
SEC, Harvard University, Cambridge MA
Tuesdays, 3:00–4:00 PM ET
Light refreshments provided.
Online
Zoom link available for each session.
Current recurring Zoom link →

Past Events

A record of all previous PRISM sessions

Spring 2026

Guest Talk · S14 · May 12, 2026
Sheridan Feucht

Arithmetic in the Wild: Llama uses Addition to Reason About Cyclic Concepts

Sheridan Feucht (PhD Student, Northeastern University · Bau & Wallace Labs · formerly Goodfire AI)
May 12, 2026 · 3:00–4:00 PM ET · SEC 6.301

Does structure in representations imply structure in computation? We study how Llama-3.1-8B reasons over cyclic concepts (e.g., "what month is three months after November?"). Even though Llama-3.1-8B has circular representations for these concepts, we find that instead of directly computing modular addition in the period of the cyclic concept (e.g., 12 for months), the model re-uses a generic addition mechanism across tasks that operates independently of concept-specific geometry. First, it computes the sum of its two inputs using non-modular addition (three + November = 14). Then, it maps back to cyclic concept space (14 → February). We show that Llama-3.1-8B uses task-agnostic Fourier features to compute these sums—in fact, these features have periods that respect standard base-10 addition, e.g., 2, 5, and 10, rather than the cyclic concept period (e.g., 12). Furthermore, we identify a sparse, reusable set of 28 MLP neurons (~0.2% of the MLP at layer 18) that can be partitioned into disjoint clusters that each compute the sum for a Fourier feature with a different period.

Read Paper →
Paper Discussion · S13 · April 28, 2026
Sonja Johnson-Yu

Unveiling Value Functions in Social Cognition with Multi-Agent Inverse Reinforcement Learning

Sonja Johnson-Yu (Rajan Lab)
April 28, 2026 · 3:00–4:00 PM ET · SEC 6.247

This paper introduces MAIRL, a multi-agent inverse reinforcement learning framework that recovers interpretable, decomposed value maps from behavior, revealing how latent value functions conditioned on social roles govern multi-agent interactions across species.

Read Paper →
Paper Presentation · S12 · April 21, 2026
Sumedh Hindupur

Symmetry in Language Statistics Shapes the Geometry of Model Representations

Sumedh Hindupur (Applied Mathematics PhD Candidate)
April 21, 2026 · 3:00–4:00 PM ET · SEC 6.247

This paper shows that statistical symmetries inherent in natural language directly induce geometric structure in the representation spaces of language models.

Read Paper →
Paper Presentation · S11 · April 14, 2026
Shubham Choudhary

Multilevel Interpretability of Artificial Neural Networks: Leveraging Framework and Methods From Neuroscience

Shubham Choudhary (Kempner Graduate Student)
April 14, 2026 · 3:00–4:00 PM ET · SEC 6.247

This paper proposes a multilevel interpretability framework for ANNs by adapting concepts and methods from neuroscience to systematically analyze representations across scales.

Read Paper →
Panel Discussion · S10 · March 31, 2026

The Nuts and Bolts of Understanding: Neurons, Circuits, Features, and Manifolds

March 31, 2026 · 3:00–4:00 PM ET · SEC 6.301–6.302

Panelists debated which unit of analysis — neuron, circuit, feature, or manifold — has the most explanatory traction and which is most overhyped, then bridged across fields by asking what a satisfying circuit explanation requires and when geometric pictures add genuine insight versus elegant redescription. The closing challenge: articulate a one-sentence holy grail in terms of these units and their interactions.

Ilenna Jones
Ilenna Jones
Neuro
William Dorrell
William Dorrell
Neuro
Andy Keller
Andy Keller
AI
Naomi Saphra
Naomi Saphra
AI
Paper Presentation · S9 · March 24, 2026
Hadas Orgad

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Hadas Orgad (Kempner Fellow)
March 24, 2026 · 3:00–4:00 PM ET · SEC 2.122

A presentation on how harmful behavior in LLMs relies on a compact set of internal weights, helping explain why safety safeguards are brittle and why narrow fine-tuning can trigger broad misalignment.

Read Paper →
Paper Presentation · S8 · March 10, 2026
Andrew Lee

Decomposing Query-Key Feature Interactions Using Contrastive Covariances

Andrew Lee (Postdoc, Harvard CS)
March 10, 2026 · 3:00–4:00 PM ET · SEC 6.242, Kempner Institute

A presentation on decomposing the query-key space in Transformer attention heads into low-rank, human-interpretable components — and attributing attention scores to identified semantic features.

Read Paper →
Panel Discussion · S7 · March 3, 2026

Interpretability: Expectations and Applications — What Counts as an Explanation and Where Is It Useful?

March 3, 2026 · 3:00–4:00 PM ET · SEC 6.242, Kempner Institute

Panelists from neuroscience and ML interpretability debated what counts as an explanation — whether faithful, useful, or both — and traded concrete examples of where understanding pays off versus creates false evidence. Discussion ranged from one-sentence holy grails to bottlenecks in theory, data, and evaluation, closing on methods each field would borrow from the other.

Richard Hakim
Richard Hakim
Neuro · Kempner Fellow
Sara Matias
Sara Matias
Neuro · MCB Postdoc
Hadas Orgad
Hadas Orgad
ML · Kempner Fellow
Bingbin Liu
Bingbin Liu
ML · Kempner Fellow

Fall 2025

Guest Talk · S6 · December 9, 2025
Yaniv Nikankin

Model Circuits Interpretability, and the Road to Scale It Up

Yaniv Nikankin (PhD Student, Technion · Belinkov Lab)
December 9, 2025 · 3:00–4:00 PM ET

A tour of circuit analysis for neural network interpretability — from gaining scientific insights into arithmetic reasoning in LLMs to diagnosing performance gaps in VLMs on visual tasks, with a look toward scaling circuit analysis to complex real-world behaviors.

Paper 1 → Paper 2 →
Paper Presentation · S5 · November 18, 2025
Andrew Lee

Shared Global and Local Geometry of Language Model Embeddings

Andrew Lee (Postdoc, Harvard CS) · Outstanding Paper Award, COLM 2025
November 18, 2025 · 3:00–4:00 PM ET

This paper reveals systematic geometric similarities across the token embeddings of large language models — both in global relative orientations and local intrinsic structure — and introduces EMB2EMB to linearly transfer steering vectors between models.

Read Paper →
Paper Presentation · S4 · November 11, 2025
Kushal Chattopadhyay

Geometric Approaches to Neural Network Training and MLLM Fine-tuning

Kushal Chattopadhyay (Undergraduate, Applied Math & CS, Harvard)
November 11, 2025 · 3:00–4:00 PM ET

A tour of geometric and manifold-based approaches to parameter-efficient fine-tuning, tracing how orthogonality, constrained optimization, and algebraic structure can reduce catastrophic forgetting and improve efficiency beyond vanilla LoRA.

Reference →
Paper Presentation · S3 · November 4, 2025
Binxu Wang

Model-Optimized Stimuli for Comparing Brain-Alignment of Encoding and Generative Models

Binxu Wang (Kempner Research Fellow)
November 4, 2025 · 3:00–4:00 PM ET

A comparison of closed-loop stimulus synthesis approaches — generative models vs. encoding model-based feature accentuation — as stringent tests of brain–model alignment across the ventral visual stream.

Read Paper →
Paper Presentation · S2 · October 28, 2025
Gaia Grosso

Interpretable Anomaly Detection for Scientific Discovery, Open-World Novelty Detection, and Generative Model Validation

Gaia Grosso (IAIFI Fellow, Harvard/MIT)
October 28, 2025 · 3:00–4:00 PM ET

This work introduces SparKer, a sparse ensemble of Gaussian kernels grounded in a semi-supervised Neyman–Pearson framework, designed to detect and interpret anomalies in high-dimensional representation spaces with sparsity, locality, and competitive allocation.

Read Paper →
Paper Presentation · S1 · October 21, 2025 · Pilot Session
Sumedh Hindupur

Toy Models of Superposition

Sumedh Hindupur (Applied Mathematics PhD Candidate) · Neuro discussant: Shubham Choudhary
October 21, 2025 · 3:00–4:00 PM ET · SEC 6.242, Kempner Institute

The inaugural PRISM session: a presentation of the foundational "Toy Models of Superposition" paper, exploring how neural networks represent more features than they have dimensions — and what this reveals about the geometry of representations.

Read Paper →

About PRISM

Perspectives on Interpretability in Sciences and ML

Mission

To systematically explore the similarities and differences between neuroscience and machine learning interpretability.

Interpretability is a rapidly growing area in both AI research and neuroscience, aiming to understand how neural networks — artificial and biological — represent and process information. PRISM brings together researchers and students working on diverse approaches to this shared problem, from geometric structure in neural network latent spaces to tools for interpreting large models and understanding biological neural circuits.

This Semester · Spring 2026

Spring 2026 sessions take the form of panel discussions and paper readings, organized around three interrelated themes:

Spring 2026 Focus Areas

  • Expectations and applications — what counts as an explanation, and where is interpretability useful?
  • Terminology and definitions — do neuroscience and ML interpretability share a common language?
  • Methods — how do the tools used in each field compare, and can they inform one another?
Fall 2025 Overview

Our inaugural semester featured talks and discussions across four core themes:

  • Geometry of representations in neural networks
  • Methods for understanding and shaping models (circuits, SAEs, LoRA geometry)
  • Applications of interpretable models in the sciences
  • Brain–model alignment

Speakers included postdocs, graduate students, and undergraduates from SEAS, MCB, and affiliated institutes including IAIFI and the Kempner Institute.

Format

Sessions run weekly on Tuesdays from 3–4 PM, with 30 minutes of presentation followed by 30 minutes of open discussion. Each session pairs an ML interpretability researcher with a neuroscience or science researcher. Sessions are held in SEC 6.242 at the Kempner Institute, with a Zoom option available. Light refreshments are provided.

Members are welcome to suggest papers for future sessions.

Long-Term Goals
  • A review paper clarifying transferable insights between geometry of representations in neuroscience and AI
  • A future workshop at a neuro/ML conference
  • A shared taxonomy of interpretability methods grounded in underlying hypotheses about representation structure
Sponsor

PRISM is sponsored by the Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University. The Kempner Institute supports interdisciplinary research at the intersection of natural and artificial intelligence.

Organizers

The people behind PRISM

Current Organizers

PRISM is organized by researchers across departments at Harvard. We welcome others who wish to get involved in organizing future sessions.

Shubham Choudhary

Shubham Choudhary

Co-Founder · PhD Candidate, Electrical Engineering
Biologically plausible models and structure in representations.

Learn more →
Sumedh Hindupur

Sumedh Hindupur

Co-Founder · PhD Candidate, Applied Mathematics
Mechanistic interpretability and geometry of representations.

Learn more →
Demba Ba

Demba Ba

Host PI · Professor, Electrical Engineering
Reverse engineering intelligence: both artificial and biological.

Learn more →