PRISM

Guest Talk · S19 · June 30, 2026

ML Interp

June 30, 2026 · 3:00–4:00 PM ET · SEC 6.242 · ~30 min each

Tomer Ashuach

Tomer Ashuach (PhD Student, Technion · Belinkov Lab)

Tomer uses interpretability to uncover the internal mechanisms of LLMs — specifically how knowledge is acquired, represented, and edited. His talk covers two angles: (1) unlearning in LLMs via REVS (ACL Findings 2025), which locates neurons storing memorized sensitive information and surgically removes it, and CRISP (ACL 2026), which performs concept unlearning via Sparse Autoencoders; and (2) Privileged Knowledge in LMs (ACL 2026), asking whether models carry an internal signal about their own correctness that no external observer can retrieve.

Michael Toker

Michael Toker (PhD Candidate, Technion · Belinkov Lab)

Michael's research focuses on mechanistic interpretability of multi-modal models. He presented Diffusion Lens (ACL 2024), Padding Tone (NAACL 2025), and Follow the Flow (ACL 2026) — methods that reveal what text-to-image models learn and how computations are performed internally, with applications to solving problems like semantic leakage.

Guest Talk · S18 · June 23, 2026

ML Interp

Old Habits and Hard Choices: How LLMs Navigate Management Decision-Making and Earlier Interactions

Adi Simhi (PhD Student, Technion · Belinkov Lab)
June 23, 2026 · 3:00–4:00 PM ET · SEC 6.242

We will start with evaluating what happens in LLM decision-making in realistic, human-validated managerial scenarios via ManagerBench, indicating that frontier LLMs perform poorly when navigating the safety-pragmatism trade-off. We next turn to a multi-turn setting to investigate the "carryover-effect" — the phenomenon in which, for example, hallucinations in earlier interactions shape subsequent model responses. To study this, we developed HISTORY-ECHOES, a framework for analyzing how conversational history biases later generations. The framework approaches this from two angles: a probabilistic view, modeling conversations as Markov chains to quantify state consistency, and a geometric view, measuring the consistency of consecutive hidden representations — indicating a correlation between the two angles.

Paper 1 → Paper 2 →

Guest Talk · S17 · June 16, 2026

ML Interp

Modality-aware methods for interpretability

Alex Oesterling (CS PhD Student, Harvard SEAS)
June 16, 2026 · 3:00–4:00 PM ET · SEC 6.242

Interpretability promises a lever to control models, monitor their behavior, and learn from them. As such, we often imagine finding features in models such as "honesty," "conciseness," or "speaking in French." However, state-of-the-art methods find features that are local, syntactic, and uninteresting (e.g., "the phrase 'The' at the start of a sentence"), a critical gap between the proposed utility of interpretability and reality. More precisely, while we care about semantic features in models, we instead recover syntactic ones. Drawing on theory from computational linguistics, we propose a simple model of language production: semantic features are globally smooth across time, whereas syntactic features are local. We use this model to propose Temporal Sparse Autoencoders (T-SAEs), which employ a temporal contrastive loss to isolate smooth features in LLM representations. Importantly, T-SAEs recover semantic information despite being trained with only a self-supervised context-similarity objective, and successfully disentangle semantic and syntactic features. We present experiments demonstrating T-SAEs' ability to track semantic shifts over text sequences and applications to safety monitoring and steering. Finally, I will discuss ongoing extensions to vision models and alternative invariances suited for visual semantics, including opportunities to understand visual processing and predict neural activity.

Read Paper →

Guest Talk · S16 · June 2, 2026

Neuroscience

Layer-wise efficient coding in early olfactory processing

Juan Carlos Fernandez del Castillo (Graduate Student in Biophysics, Harvard University)
June 2, 2026 · 3:00–4:00 PM ET · SEC 6.242

The architecture of early olfactory processing is a striking example of convergent evolution. Typically, a panel of broadly tuned receptors is selectively expressed in sensory neurons (each neuron expressing only one receptor), and each glomerulus receives projections from just one neuron type. Taken together, these three motifs—broad receptors, selective expression, and glomerular convergence—constitute "canonical olfaction," since a number of model organisms including mice and flies exhibit these features. The emergence of this distinctive architecture across evolutionary lineages suggests that it may be optimized for information processing, an idea known as efficient coding. In this talk, I explain how, by maximizing mutual information one layer at a time, efficient coding recovers several features of canonical olfactory processing under realistic biophysical assumptions. In the second part of the talk, I will speculate about what other aspects of olfactory processing might be understood using mutual information.

Read Paper →

Guest Talk · S15 · May 26, 2026

ML Interp

How Optimality Structures Sparse Dictionaries: A Theory for Understanding SAE Representations

William Dorrell (Kempner Fellow)
May 26, 2026 · 3:00–4:00 PM ET · SEC 6.242

Sparse Autoencoders (SAEs) have found widespread success parsing neural representations into interpretable concepts, providing a basis for understanding and control. However, what exactly an SAE extracts, and, correspondingly, the scientific conclusions we can draw, is not obvious. Empirically, the proof is in the pudding: SAEs do learn interpretable features. Theoretically, we lack a clear account of what properties a 'concept' must satisfy for an SAE to extract it. There is an extensive body of work studying sparse coding identifiability; in particular, given data generated under sparsity assumptions, when will an algorithm recover the true factors? However, SAEs are trained on internet-swallowing representations that are poorly approximated by simple generative models. Rather than assuming a hypothesised ground truth, we ask what properties any dictionary learning optimum must satisfy without data-assumptions. Concretely, we extend existing local optimality analyses to the nonnegative joint-optimisation problem that vanilla SAEs approximate, and derive constraints relating optimal SAE features to their distributions. We use these to explain a range of observed SAE behaviours - hierarchical splitting & absorption, the structure of residuals, and dense antipodal features - each reflecting how L1+nonnegativity interact with data to structure optimal dictionaries. Further, we identify a novel convex formulation of the problem, and use it to ask: will larger SAEs ever stop splitting? We find the answer can be yes, with a limiting dictionary state that clusters data along rays. In sum, we hope this framework can tease model assumptions from unexpected observations, letting us learn more from SAEs' successes.

Read Paper →

Guest Talk · S14 · May 12, 2026

ML Interp

Arithmetic in the Wild: Llama uses Addition to Reason About Cyclic Concepts

Sheridan Feucht (PhD Student, Northeastern University · Bau & Wallace Labs · formerly Goodfire AI)
May 12, 2026 · 3:00–4:00 PM ET · SEC 6.301

Does structure in representations imply structure in computation? We study how Llama-3.1-8B reasons over cyclic concepts (e.g., "what month is three months after November?"). Even though Llama-3.1-8B has circular representations for these concepts, we find that instead of directly computing modular addition in the period of the cyclic concept (e.g., 12 for months), the model re-uses a generic addition mechanism across tasks that operates independently of concept-specific geometry. First, it computes the sum of its two inputs using non-modular addition (three + November = 14). Then, it maps back to cyclic concept space (14 → February). We show that Llama-3.1-8B uses task-agnostic Fourier features to compute these sums—in fact, these features have periods that respect standard base-10 addition, e.g., 2, 5, and 10, rather than the cyclic concept period (e.g., 12). Furthermore, we identify a sparse, reusable set of 28 MLP neurons (~0.2% of the MLP at layer 18) that can be partitioned into disjoint clusters that each compute the sum for a Fourier feature with a different period.

Read Paper →

Paper Discussion · S13 · April 28, 2026

Neuroscience

Unveiling Value Functions in Social Cognition with Multi-Agent Inverse Reinforcement Learning

Sonja Johnson-Yu (Rajan Lab)
April 28, 2026 · 3:00–4:00 PM ET · SEC 6.247

This paper introduces MAIRL, a multi-agent inverse reinforcement learning framework that recovers interpretable, decomposed value maps from behavior, revealing how latent value functions conditioned on social roles govern multi-agent interactions across species.

Read Paper →

Paper Presentation · S12 · April 21, 2026

ML Interp

Symmetry in Language Statistics Shapes the Geometry of Model Representations

Sumedh Hindupur (Applied Mathematics PhD Candidate)
April 21, 2026 · 3:00–4:00 PM ET · SEC 6.247

This paper shows that statistical symmetries inherent in natural language directly induce geometric structure in the representation spaces of language models.

Read Paper →

Paper Presentation · S11 · April 14, 2026

ML InterpNeuroscience

Multilevel Interpretability of Artificial Neural Networks: Leveraging Framework and Methods From Neuroscience

Shubham Choudhary (Kempner Graduate Student)
April 14, 2026 · 3:00–4:00 PM ET · SEC 6.247

This paper proposes a multilevel interpretability framework for ANNs by adapting concepts and methods from neuroscience to systematically analyze representations across scales.

Read Paper →

Panel Discussion · S10 · March 31, 2026

ML InterpNeuroscience

The Nuts and Bolts of Understanding: Neurons, Circuits, Features, and Manifolds

March 31, 2026 · 3:00–4:00 PM ET · SEC 6.301–6.302

Panelists debated which unit of analysis — neuron, circuit, feature, or manifold — has the most explanatory traction and which is most overhyped, then bridged across fields by asking what a satisfying circuit explanation requires and when geometric pictures add genuine insight versus elegant redescription. The closing challenge: articulate a one-sentence holy grail in terms of these units and their interactions.

Ilenna Jones

Neuro

William Dorrell

Neuro

Andy Keller

AI

Naomi Saphra

AI

Paper Presentation · S9 · March 24, 2026

ML Interp

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Hadas Orgad (Kempner Fellow)
March 24, 2026 · 3:00–4:00 PM ET · SEC 2.122

A presentation on how harmful behavior in LLMs relies on a compact set of internal weights, helping explain why safety safeguards are brittle and why narrow fine-tuning can trigger broad misalignment.

Read Paper →

Paper Presentation · S8 · March 10, 2026

ML Interp

Decomposing Query-Key Feature Interactions Using Contrastive Covariances

Andrew Lee (Postdoc, Harvard CS)
March 10, 2026 · 3:00–4:00 PM ET · SEC 6.242, Kempner Institute

A presentation on decomposing the query-key space in Transformer attention heads into low-rank, human-interpretable components — and attributing attention scores to identified semantic features.

Read Paper →

Panel Discussion · S7 · March 3, 2026

ML InterpNeuroscience

Interpretability: Expectations and Applications — What Counts as an Explanation and Where Is It Useful?

March 3, 2026 · 3:00–4:00 PM ET · SEC 6.242, Kempner Institute

Panelists from neuroscience and ML interpretability debated what counts as an explanation — whether faithful, useful, or both — and traded concrete examples of where understanding pays off versus creates false evidence. Discussion ranged from one-sentence holy grails to bottlenecks in theory, data, and evaluation, closing on methods each field would borrow from the other.

Richard Hakim

Neuro · Kempner Fellow

Sara Matias

Neuro · MCB Postdoc

Hadas Orgad

ML · Kempner Fellow

Bingbin Liu

ML · Kempner Fellow

Guest Talk · S6 · December 9, 2025

ML Interp

Model Circuits Interpretability, and the Road to Scale It Up

Yaniv Nikankin (PhD Student, Technion · Belinkov Lab)
December 9, 2025 · 3:00–4:00 PM ET

A tour of circuit analysis for neural network interpretability — from gaining scientific insights into arithmetic reasoning in LLMs to diagnosing performance gaps in VLMs on visual tasks, with a look toward scaling circuit analysis to complex real-world behaviors.

Paper 1 → Paper 2 →

Paper Presentation · S5 · November 18, 2025

ML Interp

Shared Global and Local Geometry of Language Model Embeddings

Andrew Lee (Postdoc, Harvard CS) · Outstanding Paper Award, COLM 2025
November 18, 2025 · 3:00–4:00 PM ET

This paper reveals systematic geometric similarities across the token embeddings of large language models — both in global relative orientations and local intrinsic structure — and introduces EMB2EMB to linearly transfer steering vectors between models.

Read Paper →

Paper Presentation · S4 · November 11, 2025

ML Interp

Geometric Approaches to Neural Network Training and MLLM Fine-tuning

Kushal Chattopadhyay (Undergraduate, Applied Math & CS, Harvard)
November 11, 2025 · 3:00–4:00 PM ET

A tour of geometric and manifold-based approaches to parameter-efficient fine-tuning, tracing how orthogonality, constrained optimization, and algebraic structure can reduce catastrophic forgetting and improve efficiency beyond vanilla LoRA.

Reference →

Paper Presentation · S3 · November 4, 2025

Neuroscience

Model-Optimized Stimuli for Comparing Brain-Alignment of Encoding and Generative Models

Binxu Wang (Kempner Research Fellow)
November 4, 2025 · 3:00–4:00 PM ET

A comparison of closed-loop stimulus synthesis approaches — generative models vs. encoding model-based feature accentuation — as stringent tests of brain–model alignment across the ventral visual stream.

Read Paper →

Paper Presentation · S2 · October 28, 2025

ML Interp

Interpretable Anomaly Detection for Scientific Discovery, Open-World Novelty Detection, and Generative Model Validation

Gaia Grosso (IAIFI Fellow, Harvard/MIT)
October 28, 2025 · 3:00–4:00 PM ET

This work introduces SparKer, a sparse ensemble of Gaussian kernels grounded in a semi-supervised Neyman–Pearson framework, designed to detect and interpret anomalies in high-dimensional representation spaces with sparsity, locality, and competitive allocation.

Read Paper →

Paper Presentation · S1 · October 21, 2025 · Pilot Session

ML Interp

Toy Models of Superposition

Sumedh Hindupur (Applied Mathematics PhD Candidate) · Neuro discussant: Shubham Choudhary
October 21, 2025 · 3:00–4:00 PM ET · SEC 6.242, Kempner Institute

The inaugural PRISM session: a presentation of the foundational "Toy Models of Superposition" paper, exploring how neural networks represent more features than they have dimensions — and what this reveals about the geometry of representations.

Read Paper →

Using Interpretability to Identify a Novel Class of Alzheimer's Biomarkers

Tomer Ashuach

Michael Toker

Past Events

Spring 2026

Tomer Ashuach

Michael Toker

Old Habits and Hard Choices: How LLMs Navigate Management Decision-Making and Earlier Interactions

Modality-aware methods for interpretability

Layer-wise efficient coding in early olfactory processing

How Optimality Structures Sparse Dictionaries: A Theory for Understanding SAE Representations

Arithmetic in the Wild: Llama uses Addition to Reason About Cyclic Concepts

Unveiling Value Functions in Social Cognition with Multi-Agent Inverse Reinforcement Learning

Symmetry in Language Statistics Shapes the Geometry of Model Representations

Multilevel Interpretability of Artificial Neural Networks: Leveraging Framework and Methods From Neuroscience

The Nuts and Bolts of Understanding: Neurons, Circuits, Features, and Manifolds

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Decomposing Query-Key Feature Interactions Using Contrastive Covariances

Interpretability: Expectations and Applications — What Counts as an Explanation and Where Is It Useful?

Fall 2025

Model Circuits Interpretability, and the Road to Scale It Up

Shared Global and Local Geometry of Language Model Embeddings

Geometric Approaches to Neural Network Training and MLLM Fine-tuning

Model-Optimized Stimuli for Comparing Brain-Alignment of Encoding and Generative Models

Interpretable Anomaly Detection for Scientific Discovery, Open-World Novelty Detection, and Generative Model Validation

Toy Models of Superposition

About PRISM

Spring 2026 Focus Areas

Organizers

Shubham Choudhary

Sumedh Hindupur

Demba Ba