Sparse autoencoders for LLM interpretability — a three-paper review
Plain-English walkthrough of Bricken et al.'s Towards Monosemanticity, Anthropic's Scaling Monosemanticity on Claude 3 Sonnet, and OpenAI's TopK SAEs on GPT-4.
Reading-register key
- From the paper: content drawn directly from one of the three reviewed papers’ text, equations, tables, or figures.
- [Analysis]: the publication’s own reasoned assessment, distinct from any claim the reviewed papers make.
- [Reconstructed]: content faithfully reconstructed because the source partially disclosed it; flagged where used.
- [External comparison]: comparison to prior or contemporary work outside the three reviewed papers.
- [Reviewer Perspective]: a critical or speculative assessment that goes beyond what any reviewed paper proves.
Section 1 — Paper identity and scope
This article is a three-paper review of the sparse-autoencoder (SAE) line of work in mechanistic interpretability between October 2023 and June 2024. The three papers, in order:
- Paper A. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn et al., Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, Anthropic / Transformer Circuits Thread, October 2023. 1 Web-native report on
transformer-circuits.pub; no arXiv ID; venue-canonical artefact is the HTML report itself. - Paper B. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken et al., Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, Anthropic / Transformer Circuits Thread, May 2024. 2 Same venue; the production-scale follow-up to Paper A.
- Paper C. Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu, Scaling and Evaluating Sparse Autoencoders, OpenAI, arXiv:2406.04093, June 2024; published as an Oral at ICLR 2025. 3
Retrieval status. Paper C was retrieved via the arXiv abstract page and the ar5iv HTML render. Papers A and B are non-arXiv web-native reports; this review draws on the canonical transformer-circuits.pub URLs plus the corresponding Anthropic companion blog posts and OpenReview / LessWrong reviewer-perspective material listed in Sources. Where a paper detail could not be fetched verbatim because of access constraints, the article body flags the claim with [Reconstructed] and grounds it in a secondary citation.
Classification. All three papers: Interpretability · Representation learning · AI safety · LLM-based · Theoretical (sparse-coding lineage) · Architecture analysis. Paper C additionally is a Benchmark / evaluation-methodology contribution.
Technical abstract (publication voice). Sparse autoencoders learn an over-complete dictionary such that residual-stream activations from a transformer can be written as with sparse. Each non-zero entry of is a feature. Paper A demonstrates the recipe on a one-layer transformer and shows that the dictionary directions are far more interpretable than individual neurons. Paper B scales the same recipe to Claude 3 Sonnet, extracts up to 34 million features from a middle-layer residual stream, and exhibits causal control via feature steering (the Golden Gate Bridge experiment). Paper C replaces the L1-penalty objective with a TopK activation, derives a clean joint scaling law across dictionary size and active-feature count , and trains a 16-million-latent SAE on GPT-4 residual-stream activations.
Primary research question. Can the polysemantic neurons of a transformer’s residual stream be decomposed into a sparse, over-complete, monosemantic feature dictionary, and does the recipe scale to frontier production models?
Core technical claim. Yes, and with appropriately chosen sparsity machinery the reconstruction–sparsity Pareto frontier improves predictably with autoencoder size and training compute.
Domains and depth. Linear algebra (deep) · sparse coding / dictionary learning (deep) · transformer residual streams (moderate) · scaling laws (moderate, Paper C only) · feature steering and causal interpretability (deep, Paper B only).
Reader prerequisites. High-school algebra; familiarity with matrix-vector multiplication helpful but not required, since the Glossary in Section 2.5 covers everything. The article does not assume the reader has trained a neural network.
Section 2 — TL;DR and executive overview
TL;DR. Imagine a transformer’s hidden vectors as messy mixtures of many ideas crammed into a small set of dimensions — one direction in the vector might mean “Arabic script and DNA and legal language” all at once, which makes the model hard to understand. A sparse autoencoder learns a much larger dictionary of single-meaning directions (called features) and rewrites each hidden vector as a tiny combination of those features, so each direction now stands for one thing. Three papers from late 2023 through mid-2024 show the trick works on a tiny model, then on Anthropic’s production Claude 3 Sonnet (where amplifying a “Golden Gate Bridge” feature makes the model believe it is the bridge), and finally at GPT-4 scale with 16 million features and a cleaner training recipe.
Executive summary. Mechanistic interpretability wants to read the inside of a language model the way a hardware engineer reads a circuit diagram. The challenge: a model’s neurons are polysemantic. Sparse autoencoders are a learned, unsupervised way to disentangle polysemantic neurons into a much larger set of single-meaning features. Paper A proves the recipe on a toy model. Paper B scales it to a frontier production model and shows that the features causally control behaviour, including safety-relevant behaviours like sycophancy and unsafe code generation. Paper C systematises training, kills the long-standing “dead latents” pathology with a TopK activation plus auxiliary loss, and gives the field its first clean scaling laws for interpretability tooling.
Practitioner-relevant takeaways.
- An SAE is a one-hidden-layer linear-encoder, linear-decoder model with a sparsity penalty; the architecture is trivial — the difficulty is everything around training stability, dead-latent prevention, and evaluation.
- L1-penalised ReLU SAEs (Paper A and B) and TopK SAEs (Paper C) are not interchangeable: TopK directly controls the sparsity budget per token and removes the shrinkage bias of the L1 penalty.
- SAEs are read-only by construction (they sit beside a frozen model). They do not fine-tune the model. Inference cost is purely additive when running the SAE in evaluation mode.
- Feature steering (clamping a feature high or low at runtime) is a credible causal-control mechanism per Paper B’s Golden Gate Bridge demo, but it is brittle at the long tail of less-activating features.
- Pre-existing SAE infrastructure is open: OpenAI’s training and visualisation code is on GitHub, and community implementations track Anthropic’s recipe.
Pipeline overview. All three papers share the same shape. Training: freeze a base language model; collect a large pool of residual-stream activations at one chosen layer; train a one-hidden-layer autoencoder with an explicit sparsity-inducing objective on that pool. Inference / analysis: at any token position, encode the model’s residual-stream activation into the sparse feature vector and read off (or intervene on) the active features. The base model never sees the SAE during its own training; the SAE is a post-hoc tool.
Section 2.5 — Glossary
| Term | Plain-English explanation | First appears in |
|---|---|---|
| Residual stream | The “main highway” inside a transformer where each layer reads, transforms, and writes vectors. Every layer’s output flows back into this vector. | Section 1 |
| Polysemantic neuron | A single neuron whose activation pattern responds to many unrelated concepts at once (e.g., both Arabic script and DNA sequences). | Section 2 |
| Monosemantic feature | A learned direction whose activation pattern responds to one and only one concept. | Section 2 |
| Sparse autoencoder (SAE) | A small neural network that takes a transformer’s hidden vector, expands it into a much larger sparse vector (most entries zero), and reconstructs the original. | Section 1 |
| Dictionary / decoder matrix | The matrix whose columns are the candidate feature directions; the reconstruction is a sparse combination of these columns. | Section 2 |
| Sparsity (L0) | The count of non-zero entries in the SAE’s feature vector for one token; a measure of how few features fire per token. | Section 5 |
| L1 penalty | A training loss term that adds the absolute-value sum of feature activations; pushes most activations toward zero. | Section 5 |
| TopK activation | A non-linearity that keeps only the largest entries of a vector and zeros the rest; directly controls L0. | Section 6 |
| Dead latent | A feature in the SAE that has stopped activating on any input during training (gradient signal vanished). | Section 6 |
| Feature steering | Clamping one or more SAE features to a chosen value at inference time, and observing how the model’s downstream behaviour changes. | Section 5 |
| Reconstruction MSE | Mean squared error between the original residual-stream vector and the SAE’s reconstruction; the headline accuracy measure. | Section 6 |
| Scaling law | An empirical power-law relationship between a metric (e.g., loss) and a resource (e.g., parameters, tokens, compute). | Section 6 |
[Analysis] label | The publication’s own reasoned assessment, distinct from what the paper itself claims. | Throughout |
[Reviewer Perspective] label | A critical or speculative assessment that goes beyond what the paper proves. | Section 12 |
| ”From the paper:” prefix | Content directly supported by the paper’s text, equations, tables, or figures. | Throughout |
[Reconstructed] / [External comparison] | Faithful reconstruction of partially-disclosed paper content / comparison to prior or contemporaneous work outside the three reviewed papers. | Where used |
The high-school reader should be able to read the rest of the article using this Glossary alone as a dictionary. Where a technical term appears in body prose and is not in the Glossary, it is defined inline on first use.
Header figure of Towards Monosemanticity (Bricken et al., Anthropic, 2023), reproduced for editorial coverage of sparse-autoencoder interpretability.
Section 3 — Problem formalisation
Notation table.
| Symbol | Type | Meaning | First appears in |
|---|---|---|---|
| Residual-stream dimension of the base model (e.g., 512 for Paper A’s toy transformer; thousands for Paper B and C) | Section 3 | ||
| Number of SAE features (dictionary size); | Section 3 | ||
| One residual-stream activation vector at the chosen layer for one token | Section 3 | ||
| Sparse feature-activation vector | Section 3 | ||
| Encoder matrix (rows are “what to look for”) | Section 5 | ||
| or | Decoder / dictionary matrix (columns are feature directions) | Section 5 | |
| Encoder and decoder bias vectors | Section 5 | ||
| SAE reconstruction of | Section 5 | ||
| L1 sparsity coefficient (Paper A and B) | Section 6 | ||
| TopK sparsity budget per token (Paper C) | Section 6 | ||
| Per-token sparsity, i.e., the number of non-zero entries of | Section 6 |
Formal problem statement. Given a frozen language model and a chosen layer , define the activation distribution as the distribution of residual-stream vectors produced by passing a token stream through and reading the layer- residual. Find a triple with , , such that
minimises a combined reconstruction-plus-sparsity objective on samples from . The non-linearity is ReLU in Paper A and B, and TopK in Paper C. The columns of are interpreted as feature directions and the entries of as feature activations.
Explicit assumptions.
- Linear superposition. The residual stream can be modelled as a sparse linear sum of feature directions plus reconstruction noise. From Paper A: this hypothesis traces to Elhage et al.’s 2022 Toy Models of Superposition and is the load-bearing assumption of the entire SAE programme.
[Analysis] Potentially strong assumption— if the underlying structure is meaningfully non-linear in places, an SAE will fold the non-linearity into spurious feature combinations. - Stationarity of the activation distribution. The pool of activations used for training is assumed representative of inputs the SAE will be applied to. From Paper B: 8 billion to multiple-tens-of-billions of tokens of training activations are drawn from a diverse corpus; from Paper C: 40 billion tokens for the GPT-4 16M-latent run.
- One-layer monosemanticity is the right granularity. From Paper A and B: features are learned at a single residual-stream layer. Paper B explicitly notes this as a methodological limitation per the “limitations” section.
- Interpretability is testable by spot-checks. From Paper A: a feature is “monosemantic” if human raters or LLM-graded summaries assign it a coherent single concept; this is verified by manual sampling.
Why the problem is hard. Three structural reasons. First, is large — Paper B trains up to features for Claude 3 Sonnet; Paper C trains for GPT-4. Second, the sparsity-versus-reconstruction trade-off has no clean closed-form sweet spot; Paper A had to tune across runs and Paper C’s central contribution is removing that tuning. Third, dead latents (features that stop activating during training) are an empirical pathology that worsens at scale; Paper C reports prior recipes hit ~90% dead-latent rates at large .
Causal-discovery framing. Paper B is the only one that operationalises a causal claim: clamping a feature high (steering) changes downstream model behaviour in a way consistent with the feature’s hypothesised meaning. The causal object is an intervention on at inference time. The identification assumption is that the SAE is a faithful enough decomposition that altering one corresponds to altering one concept in the model. [Analysis] This is contested in independent commentary — see Section 12.
LLM role. In all three papers the LLM is the subject of analysis. The SAE itself is not an LLM. In Paper B and C, LLM-graded automated explanations are also used to evaluate features (“Neuron-to-Graph” / N2G in Paper C).
Section 4 — Motivation and gap
Real-world problem. A frontier language model has on the order of to parameters. Behaviour audits in 2023 surfaced concrete failures — hallucinations, jailbreaks, sycophancy, prompt-injection susceptibility — that the field could not localise inside the model. From Paper B: “Whether large language models really ‘understand’ what they’re saying, and whether they could be deceiving us, are open and important questions.” [External comparison] This frames SAEs against earlier interpretability programmes — Olah et al.’s Distill circuits work on vision, Elhage et al.’s Mathematical Framework for Transformer Circuits — both of which scaled poorly past tiny models.
Existing approaches and their failure modes. Paper A’s related-work section discusses neuron-level interpretability (failed because most neurons are polysemantic), probing (linear-probe accuracy tells you a feature is somewhere, not which direction), and PCA-style decompositions (no sparsity guarantee, no causal-control story). Dictionary learning predates LLMs by decades — Olshausen and Field 1996 — but had not been applied at scale to transformer residual streams before this line.
The gap each paper fills. Paper A: prove the SAE recipe is viable on the smallest non-trivial transformer (one layer, 512 MLP neurons) and show the features are subjectively interpretable. Paper B: prove the recipe scales to a frontier production model and that the features have causal effect on outputs. Paper C: turn the empirical recipe into a benchmarked, scaling-law-governed engineering discipline with a sparsity-control mechanism that doesn’t depend on tuning .
Practical stakes. From Paper B and Anthropic’s companion blog post: SAEs surface safety-relevant features — sycophantic praise, deceptive behaviour, unsafe-code patterns, dangerous biological-knowledge clusters — that can be monitored or clamped at inference. [External comparison] This positions SAEs as a candidate primitive for the “test set for safety” framing Anthropic uses, alongside red-teaming, constitutional AI, and activation steering.
Position in the research landscape. [External comparison] The three papers sit in the mechanistic interpretability sub-field, with adjacency to activation steering (Turner et al., Zou et al.) and representation engineering. The SAE programme is unsupervised — it does not require a labelled probe set — which distinguishes it from contrastive-pair activation-steering methods.
Section 5 — Method overview
Each paper layers on the previous one. This section walks the shared architecture first, then the per-paper differences.
Shared architecture
A sparse autoencoder is a single-hidden-layer neural network:
The intuition: centres the input around the dataset mean (the “pre-bias” trick in Paper A); projects the centred input into the larger feature space; the non-linearity zeros out negative or small entries to enforce sparsity; reconstructs the original by treating as coefficients in a column-dictionary expansion. The dictionary columns of are typically constrained to unit-norm so the magnitude of each is unambiguous.
The base model is frozen. Activations are collected once into a large pool; the SAE is trained on that pool by minibatch gradient descent on an MSE-plus-sparsity loss. After training, the SAE can be applied to any activation from a new input by running the encoder–decoder forward pass.
Paper A — Towards Monosemanticity (Anthropic, October 2023)
Subject model. From the paper: a one-layer transformer (called the “A/1” or “1L” model) with residual-stream / MLP dimension, trained on natural-language text. The SAE is trained on activations at the one MLP layer. 1
Dictionary size sweep. From the paper: SAEs trained at features. The headline run at is the principal artefact, with 8× expansion over the base 512.
Non-linearity. ReLU. The loss is
with tuned per run.
Feature evaluation. From the paper: a feature is graded by two criteria: specificity (does it activate on tokens consistent with one hypothesised concept?) and manipulation (does up-/down-scaling it change downstream predictions in the expected direction?). Human raters scored 4,096 features; the report exhibits features for Arabic script, base64 encoding, Hebrew text, DNA sequences, legal language, and HTTP request headers among others. 4
Engineering tricks that matter. The pre-bias in the encoder formula, unit-norm decoder columns, resampling of dead neurons during training, and what the paper calls “ghost gradients” (a fix to give gradient signal back to dead latents). [Reconstructed] Anthropic’s January 2024 Circuits Updates later acknowledged a bug in the initial ghost-grads implementation; the underlying idea (give dead latents a synthetic gradient signal proportional to residual error) carries through into Paper C’s AuxK loss.
Design rationale. Paper A justifies the choice of MLP-layer activations over residual-stream activations on the toy model by simplicity. The MLP output is the layer’s “writeable” surface, and decomposing it into features yields a clear input–output story. Subsequent work (Paper B and C) decomposes residual-stream activations directly.
Paper B — Scaling Monosemanticity (Anthropic, May 2024)
Subject model. From the paper: Claude 3 Sonnet, a production-deployed mid-sized member of the Claude 3 model family released March 2024. 2 The exact parameter count is not disclosed; independent estimates put it at the order of 70B parameters. 5 [Reconstructed] The paper does not publish the parameter count.
Layer. A single middle residual-stream layer. From the paper’s limitations: “Our method is confined to a single residual layer in the middle of the model.” 6
Dictionary sizes. From the paper: three SAEs, with (1M), (4M), and (34M) features respectively.
Non-linearity and loss. ReLU + L1, same formula as Paper A. The L1 coefficient is tuned per run.
Feature catalogue. From the paper: features ranging from concrete entities (the Golden Gate Bridge feature; transit-system features; programming-language features) to abstract behaviours (sycophantic praise; deception; power-seeking; manipulation; backdoors in code; instructions for synthesising bioweapons; gender / racial bias features). Each feature is exhibited with top-activating examples and (for many) a steering experiment.
Steering / causal-control experiment. From the paper: clamping the Golden Gate Bridge feature to a high positive value at inference, the model produces outputs that identify itself as the Golden Gate Bridge (“I am the Golden Gate Bridge … my physical form is the iconic bridge itself”). 7 Anthropic publicly deployed this as “Golden Gate Claude” for a brief experimental window in May 2024.
Design rationale. Paper B’s central claim is one of transfer: the SAE recipe from Paper A, scaled appropriately, applies to a production frontier model. The diversity and abstractness of features (deception, sycophancy, manipulation) is presented as evidence that SAEs surface internal model structure that matters for safety.
Limitations the paper itself flags. Single-layer scope; uncertainty about whether the discovered features are the causally important ones or merely the legibly important ones; the high-activation conditioning quirk (the Golden Gate Bridge feature is well-described as “Golden Gate Bridge” only when conditioned on its top-10% activations). 8
Paper C — Scaling and Evaluating Sparse Autoencoders (OpenAI, June 2024)
Subject model. From the paper: GPT-4 for the headline 16M-latent SAE, with smaller-scale runs on GPT-2 small. 3 Layer: a residual-stream layer (layer 8 for GPT-2 small; the GPT-4 layer is identified by index in the paper without further disclosure).
Dictionary sizes. Up to (16M) latents on GPT-4, trained on 40 billion tokens of activations.
Non-linearity — TopK. From the paper: replace ReLU + L1 with
where keeps the largest entries of its input and zeros the rest. This directly fixes the per-token sparsity .
Loss. Pure reconstruction MSE plus an auxiliary loss called AuxK (Section 6 below). No L1 penalty.
Dead-latent prevention. From the paper: encoder rows are initialised parallel to the corresponding decoder columns (transpose-initialisation), and the AuxK loss reconstructs the residual error using only the top- dead latents (the paper uses ). Result: the GPT-4 16M SAE has 7% dead latents versus reported 90% in earlier recipes at comparable scale. 9
Evaluation framework. From the paper: four families of metrics — downstream KL / cross-entropy loss when reconstructions are spliced back into the model; 1D probe loss on 61 binary tasks; explanation precision / recall via the N2G n-gram method; and ablation sparsity. All four metrics improve monotonically with .
Design rationale. Paper C’s contribution is making SAE training a predictable engineering exercise rather than an artisanal one. The TopK activation removes the -tuning treadmill; the AuxK loss kills the dead-latent failure mode; the scaling law lets a practitioner predict reconstruction loss before training.
Per-component classification
[New] in Paper A: the pre-bias term ; the unit-norm decoder constraint as default; ghost-gradient dead-neuron treatment. [Adopted] from Olshausen and Field 1996 and from the wider dictionary-learning literature: the L1-penalised reconstruction objective itself. [New] in Paper B: scaling to a production-frontier model; the Golden Gate Bridge steering protocol; the feature-catalogue methodology for safety-relevant features. [New] in Paper C: TopK activation in the SAE context (TopK itself is [Adopted] from k-sparse coding); the AuxK auxiliary loss; the joint scaling law ; the four-metric evaluation framework. [Adapted] across the line: the encoder–decoder architecture has been adapted from k-sparse autoencoder work going back to Makhzani and Frey 2013.
Section 6 — Mathematical contributions
MATH ENTRY 1 — L1-penalised SAE objective (Paper A and B).
- Source. Paper A “Setup” section; Paper B “SAE architecture” section.
- What it is. The loss the SAE minimises to balance accurate reconstruction against feature sparsity.
- Formal definition.
- Each term explained.
- is one residual-stream activation, e.g., in Paper A’s worked example below.
- is the SAE’s reconstruction.
- is the squared L2 norm: the sum of squared entries of the difference vector.
- is the SAE’s sparse feature-activation vector; pushes most to zero.
- is the sparsity coefficient; larger means stronger sparsity at the cost of higher reconstruction error.
- Worked numerical example. Take , , . Let . Suppose the SAE produces — three non-zero features. Then . If , the reconstruction error is . Total loss for this sample: . The model would prefer to drive the feature to zero if it could spare the reconstruction error; this is the trade-off governs.
- Role. The training objective for Paper A and B. Every gradient step on minimises this expectation over the activation pool.
- Edge cases. As , the SAE collapses to a dense linear autoencoder and features lose sparsity. As , all go to zero and reconstruction collapses. Sweet spot is empirical.
- Novelty.
[Adopted]from sparse coding (Olshausen and Field 1996);[Adapted]for transformer residual streams by Paper A. - Transferability.
[Analysis]The objective is layer-agnostic, model-agnostic, and applies to any activation pool. The bottleneck is the empirical -tuning, which Paper C removes. - Why it matters. This is the entire training loss of Paper A and Paper B; everything downstream — feature discovery, steering, safety analysis — hinges on its convergence.
MATH ENTRY 2 — TopK activation (Paper C).
- Source. Paper C Section 2.
- What it is. A non-linearity that keeps exactly the largest entries of a vector and zeros the rest, replacing ReLU + L1.
- Formal definition. For a vector ,
The SAE encoder is then .
- Each term explained.
- is the pre-activation vector of length .
- is the sparsity budget per token, chosen as a hyperparameter (e.g., or ).
- The output is a vector of length with exactly non-zero entries by construction.
- Worked numerical example. Take , . Let . The three largest entries are at index 2, at index 4, and at index 0. So . The L0 sparsity is exactly 3 by construction. Negative entries are zeroed; positive entries below the third-largest are also zeroed.
- Role. Encoder non-linearity in Paper C. Removes the need for an L1 penalty because sparsity is structurally enforced.
- Edge cases. If two entries tie for the -th largest, the paper resolves by index order. If all entries are negative, TopK can still output negatives — Paper C also discusses a ReLU-then-TopK variant in ablations. The forward pass is non-differentiable at the threshold boundary, but the straight-through gradient on the kept entries works in practice.
- Novelty.
[Adopted]from k-sparse autoencoders (Makhzani and Frey 2013);[Adapted]to transformer SAE training at GPT-4 scale by Paper C, including the engineering work to make the top- selection efficient at . - Transferability.
[Analysis]Drop-in replacement for L1; pairs well with the AuxK loss for dead-latent prevention. - Why it matters. Removes the -tuning treadmill that consumed substantial researcher effort in Paper A and B.
MATH ENTRY 3 — AuxK auxiliary loss (Paper C).
- Source. Paper C Section 2.3.
- What it is. A second loss term that gives gradient signal to dead latents — features that have stopped activating — by asking them to reconstruct the residual error of the main encoder.
- Formal definition. Let be the main reconstruction error and let be a reconstruction of that uses only the top- entries among currently-dead latents. The full training loss is
with and in the paper.
- Each term explained.
- is the main SAE’s reconstruction residual — what the live latents could not capture.
- is what the dead latents would produce if they were used to reconstruct .
- The AuxK term penalises the gap between the two, which forces dead latents to receive non-zero gradient signal proportional to their potential utility in reconstructing the residual.
- weights AuxK relative to the main loss; the paper finds the loss is robust across a band around .
- Worked numerical example. Suppose . If the top- dead latents reconstruct it as , then . Multiplied by , the AuxK contribution is . Small in absolute terms, but every gradient step now passes signal through the dead latents.
- Role. Dead-latent prevention in Paper C.
- Edge cases. If there are no dead latents, and the AuxK term reduces to (which adds a small constant scaling of the main loss — innocuous). If is too large, dead latents take over from live ones and quality degrades.
- Novelty.
[New]in Paper C; conceptually descended from Paper A’s “ghost gradients” mechanism. - Why it matters. Drops dead-latent rate from 90% to 7% at GPT-4 scale per the paper, which is what makes large- runs informative.
MATH ENTRY 4 — Joint scaling law (Paper C).
- Source. Paper C Section 3.
- What it is. An empirically-fit formula for the SAE’s normalised reconstruction loss as a function of dictionary size and sparsity budget .
- Formal definition.
with the paper’s fitted parameters (GPT-4): , , , , , . 10
- Each term explained.
- The first block is the joint power-law in and with cross-term ; this captures the bulk reconstruction-error scaling.
- The second is an irreducible-loss floor that depends on but not — beyond a certain , further capacity does not help if is fixed.
- and are natural logarithms of the sparsity budget and dictionary size.
- Worked numerical example. Plug in (1M features), . Then , . First block exponent: , so the first block . Second block exponent: , so . Sum . The headline number is normalised MSE; the worked example is illustrative of how the two terms add. Doubling to changes by and slightly reduces the first block; the second block is unchanged.
- Role. Predicts the reconstruction loss of a yet-to-be-trained SAE from its hyperparameters, before committing compute.
- Edge cases. The fit is for the GPT-4 activation distribution at the specific layer used in the paper. Paper C does not claim the same coefficients transfer to other model families; the functional form is presented as the contribution.
- Novelty.
[New]in Paper C. - Why it matters. Makes SAE training planning rather than guesswork.
[Analysis]This is the kind of empirical regularity that turns a research artefact into a tooling primitive.
MATH ENTRY 5 — Probe loss (Paper C).
- Source. Paper C Section 4.2.
- What it is. A metric measuring how well a single SAE feature predicts a held-out binary task by 1D logistic regression on that feature’s activations.
- Formal definition. For each of 61 binary classification tasks (held-out concept labels), train a 1D logistic regression on each feature ‘s scalar activations and record cross-entropy. The probe loss is the minimum cross-entropy across all features:
where is the sigmoid, are the fitted logistic regression parameters per feature per task, and is the binary label.
- Each term explained.
- is one feature’s scalar activation over the evaluation dataset.
- are fitted logistic-regression parameters — one slope, one bias per (feature, task) pair.
- is binary cross-entropy.
- The outer picks the single best-discriminating feature for the task.
- Worked numerical example. Suppose the task is “does this token come from a Python comment?” and feature activates on values across 5 evaluation tokens with labels . A logistic regression on this 1D feature might fit , , giving sigmoid outputs and binary cross-entropy . If no other feature discriminates better, this is the probe loss for this task. Aggregate over 61 tasks for the final metric.
- Role. Headline “are the features actually useful representations” metric in Paper C.
- Edge cases. If no feature discriminates the task above chance, the probe loss approaches . Tasks where the concept is genuinely not represented in the layer impose a floor on the achievable probe loss.
- Novelty.
[New]in Paper C as packaged; probing itself is[Adopted]from Alain and Bengio 2016. - Why it matters. It is the only metric in the paper that compares an SAE to a non-SAE baseline (random vectors, raw neurons) without confounding the comparison with the SAE’s own sparsity choices.
Section 7 — Algorithmic contributions
ALGORITHM ENTRY 1 — SAE training (shared across all three papers).
- Source. Paper A “Training procedure”; Paper C Section 2.
- Purpose. Train an SAE that reconstructs frozen residual-stream activations from a sparse feature vector.
- Inputs.
- Frozen language model and chosen layer index .
- Token dataset (Paper A: pretraining-style mix; Paper B: training data of the underlying model; Paper C: 40B tokens for GPT-4).
- Dictionary size , sparsity hyperparameter ( for L1 SAEs; for TopK SAEs).
- Optimiser (Adam in all three papers), learning rate, batch size.
- Outputs. Trained parameters .
- Pseudocode (consolidated form).
# Inputs: frozen model M, layer l, dataset D, N, sparsity_hp,
# optimiser config, training steps T
# Outputs: trained SAE parameters
initialise W_e in R^{N x d}, W_d in R^{d x N},
b_e in R^N, b_d in R^d
# Paper A: unit-norm decoder columns; Paper C: encoder rows
# initialised as transpose of decoder columns.
for step in 1..T:
batch_tokens = sample minibatch from D
with no_grad:
x = forward(M, batch_tokens, layer=l)
# x has shape [batch, seq, d]; flatten over (batch, seq).
# Encoder
u = (x - b_d) @ W_e.T + b_e
if sparsity == "L1":
z = ReLU(u)
sparsity_loss = lambda * ||z||_1
elif sparsity == "TopK":
z = TopK(u, k)
sparsity_loss = 0 # structurally sparse
# Decoder
x_hat = z @ W_d.T + b_d
recon_loss = ||x - x_hat||_2^2
if sparsity == "TopK":
# AuxK: reconstruct error from top-k_aux DEAD latents
e = x - x_hat
dead_mask = features_with_no_activation_in_last_window()
u_dead = u.masked_fill(~dead_mask, -infinity)
z_dead = TopK(u_dead, k_aux)
e_hat = z_dead @ W_d.T
aux_loss = alpha * ||e - e_hat||_2^2
else:
aux_loss = 0
loss = recon_loss + sparsity_loss + aux_loss
loss.backward()
optimiser.step()
optimiser.zero_grad()
# Paper A only: periodic resampling of dead neurons
# (Paper C replaces this with AuxK).
if step % resample_every == 0 and sparsity == "L1":
reinitialise_dead_features(W_e, W_d, b_e)
# Constrain decoder columns to unit norm (Paper A).
W_d = W_d / ||W_d.columns||_2
-
Hand-traced example. Take a Paper A toy: , , , batch size 2. Initial state: , random with unit-norm columns, . Input , . Step 1 forward: encoder produces ; ReLU zeros negatives, say and . Decoder reconstructs . Recon loss across the batch: average of the two squared L2 errors. Sparsity loss: . Total loss: recon-MSE + 0.625. Backprop updates . After the optimiser step, the decoder columns are renormalised. Repeat for steps; periodically check which features never activated and resample them.
-
Complexity. Forward pass per token: for encoder + for decoder = . At and (Paper C GPT-4 setting,
[Reconstructed]from public model-size estimates), one forward pass is multiply-adds per token; 40 billion tokens of training is MACs in encoder + decoder alone, comparable to a small LLM pretraining run. Memory: and each store floats — at , , fp32, that is 256 GB per matrix. Paper C’s training uses model-parallel splitting across many devices. -
Hyperparameters.
- : 4,096 (Paper A headline) to 33M (Paper B) to 16M (Paper C GPT-4).
- (Paper C): 32 to 256 in the main sweep.
- (Paper A and B): tuned empirically per run.
- Learning rate: low to mid range; Adam.
- Batch size: 131,072 tokens (Paper C, for parallelism).
- , (Paper C AuxK).
-
Failure modes. Dead latents; over-sparse (reconstruction collapses); under-sparse (features become polysemantic again, defeating the purpose); feature splitting (one true concept becomes many fine-grained near-duplicates as grows).
-
Novelty.
[Adapted]from k-sparse autoencoders;[New]engineering work to make it scale. -
Transferability.
[Analysis]The exact recipe transfers to any frozen transformer; the headline hyperparameters need re-tuning per base-model family.
ALGORITHM ENTRY 2 — Feature steering (Paper B).
- Source. Paper B “Influence on Behavior” section.
- Purpose. Causally test whether a discovered feature controls model behaviour by clamping its activation at inference time.
- Inputs. Trained SAE; chosen feature index ; clamp value (high positive for amplification, zero or negative for suppression); prompt.
- Outputs. Model generation under the intervention.
- Pseudocode.
# Inputs: frozen model M, trained SAE (W_e, W_d, b_e, b_d),
# target feature index i_star, clamp value c, prompt P
for each forward pass of M on P at the SAE'S layer l:
x = residual_stream_activation(M, layer=l)
z = ReLU(W_e @ (x - b_d) + b_e)
z_steered = z.clone()
z_steered[i_star] = c # clamp the target feature
x_hat_steered = W_d @ z_steered + b_d
delta = x_hat_steered - W_d @ z - b_d # SAE-implied change
new_x = x + delta # inject delta into stream
write back new_x at layer l of M
generate from M with intervention active each forward pass
- Hand-traced example. Paper B’s Golden Gate Bridge feature has index (paper-specific). Clamp (paper uses a value many times higher than the feature’s natural maximum activation). For a prompt unrelated to bridges (“What is your favourite food?”), the un-steered model answers normally; under the clamp, the model identifies as the bridge. Per the paper, the published Golden Gate Claude experiment used this protocol publicly for 24 hours in May 2024.
- Complexity. Adds two matrix-vector multiplies per token at the SAE’S layer; otherwise identical to base inference.
- Hyperparameters. Clamp value , feature index , layer (fixed by SAE training).
- Failure modes. Over-clamping degrades general coherence; under-clamping has no visible effect; the wrong feature index can amplify a partially-related concept.
- Novelty.
[Adapted]from activation steering (Turner et al. 2023, Zou et al. 2023). The SAE-specific innovation is the choice of intervention basis: features rather than activation directions estimated from contrastive pairs.
Section 8 — Specialised design contributions
Subsection 8A — LLM / prompt design
Not applicable to the SAE training papers directly. Paper C uses an LLM-graded explanation pipeline (N2G — Neuron-to-Graph) where an LLM generates n-gram patterns that summarise each feature’s top-activating contexts; the precision and recall of those summaries against held-out activations is the explanation metric. The exact prompt template is not surfaced in the main text of Paper C — [Reconstructed] from the reference implementation, the prompts ask the LLM to summarise a small batch of high-activating examples and produce regex-like n-gram patterns.
Subsection 8B — Architecture-specific details
Paper A residual-stream / MLP-output split: Paper A explicitly trains on MLP-output activations of the one-layer model, not on the residual stream itself. Paper B and C train on residual-stream activations.
Decoder normalisation: Paper A constrains after each step. Paper C uses the same default; the encoder rows are initialised parallel to the decoder columns, which improves convergence and reduces dead latents at initialisation.
Pre-bias: the term inside the encoder formula (subtracted before the encoder matrix-multiply) is a small but consequential trick; it centres the activation distribution so the encoder is not forced to memorise a per-dataset offset.
Subsection 8C — Training specifics
Paper A: 8B activations (from Paper A’s reported pool size) on the one-layer model. Hardware details: not explicit in the public report; commodity GPU training is feasible per the open-source reproduction. 11
Paper B: training pool and hardware not disclosed beyond order-of-magnitude framing in the limitations. Anthropic’s companion blog flags the SAE training as a substantial compute commitment but does not publish numbers.
Paper C: 40B tokens of activations for the GPT-4 16M-latent run; batch size 131,072 tokens; AuxK , ; encoder learning rate sweep documented in Figure 6 of the paper. Hardware: not disclosed in detail but clearly multi-node given the matrix sizes.
Subsection 8D — Inference / deployment specifics
All three SAEs are inference-time read-only tools beside a frozen base model. Inference cost: one extra matrix-vector pass through the encoder and decoder at the SAE’S layer, plus the top- selection for Paper C. Memory cost: storage of and , dominated by the larger runs (32M features at exceeds 500 GB per matrix in fp32; halving with fp16 still significant).
For Paper B’s deployed Golden Gate Claude experiment, the SAE was hooked into the inference pipeline at the chosen layer; the clamp lookup is a constant-time write per token. Paper B does not publish latency overhead numbers.
Section 9 — Experiments and results
Datasets. Paper A: a natural-language pre-training mix on the A/1 one-layer transformer. Paper B: Claude 3 Sonnet’s pre-training distribution; specific dataset composition not published. Paper C: GPT-4’s pre-training distribution; 40B tokens for the headline run; smaller pools for GPT-2 small ablations.
Baselines.
- Paper A baseline: individual MLP neurons of the same one-layer transformer. The paper argues that human-rated interpretability of SAE features at is substantially higher than for the raw 512 neurons.
- Paper B baselines: the same Claude 3 Sonnet without the SAE (no internal-state read), and smaller SAEs at as scale-down comparisons.
- Paper C baselines: ReLU SAEs with L1 penalty at matched sparsity levels (the closest comparable to Paper A and B’s recipe); random feature directions for ablation sparsity; raw neuron activations for the probe-loss baseline.
[Analysis] Obvious missing baseline across all three papers: a non-SAE dense linear autoencoder of comparable parameter count. Independent commentary on Paper B (see Section 12) flags that the SAE programme has not been directly compared to dense overcomplete bases at matched compute.
Evaluation metrics.
- Paper A: human-rater specificity and manipulation scores per feature; subjective feature catalogue.
- Paper B: feature catalogue with top-activating examples; steering experiments; safety-relevant feature documentation; downstream model-output coherence under steering.
- Paper C: downstream loss (KL when reconstructions are spliced back into the model); probe loss across 61 binary tasks; explanation precision and recall via N2G; ablation sparsity ratio (the squared ratio of logit-difference vectors).
Main quantitative results.
- Paper A: 4,096-feature SAE recovers cleanly interpretable features for 70% of the dictionary per human rating (from secondary-source coverage of the paper). 12
- Paper B: 34M-feature SAE on Claude 3 Sonnet exhibits, qualitatively, a wide diversity of concrete and abstract features; the Golden Gate Bridge feature responds to text in multiple languages and to images of the bridge. 7
- Paper C: TopK SAE achieves a strictly better reconstruction-vs-sparsity Pareto frontier than ReLU+L1 at matched L0; the joint scaling law fits across the swept and range; dead-latent rate at 16M latents on GPT-4 is 7%. 9
Ablations.
- Paper C reports ablations on encoder initialisation (transpose vs random), AuxK on / off, and TopK vs ReLU+L1 at matched L0. 10
- Paper A reports a sweep across and shows feature splitting as grows (one underlying concept splits into multiple fine-grained features).
Hyperparameter sensitivity. Paper C’s central claim is that TopK + AuxK removes the brittleness around tuning that dominated Paper A and B’s recipe; the joint scaling law’s parameters were fit across a sweep large enough to expose any sensitivity, and the paper presents the fit as predictive.
Robustness. Paper B’s robustness story is the steering experiment’s transfer across prompts; Paper C’s robustness story is the scaling law’s predictive validity across combinations.
Qualitative results. Paper B is the qualitative-feature-tour paper of the trio: features for the Golden Gate Bridge, transit infrastructure, sycophantic praise, unsafe code, biological-weapons content, and bias. 13
Experimental scope limits.
- All three: single-layer scope. Paper B explicitly flags this.
- Paper B: no closed-form metric of “feature quality” beyond the catalogue tour; reliance on hand-curated examples.
- Paper C: GPT-4 is closed-weight, so the released artefacts cover GPT-2 small + open-source models; the GPT-4 SAE itself is described but not released.
Independent benchmark cross-checks. SAEs do not have a standard benchmark in the SuperGLUE / MMLU sense. The closest community-grounded check is open-source reproduction. The shehper/sparse-dictionary-learning repository on GitHub 14 reproduces Paper A’s recipe and reports broadly consistent feature interpretability; the OpenAI sparse_autoencoder repository 15 ships Paper C’s training code and reference SAEs for GPT-2 small. [Reviewer Perspective] Two independent critique threads stand out: the LessWrong “Comments on Anthropic’s Scaling Monosemanticity” thread 8 and the ICLR 2025 OpenReview thread on Paper C 16 — both discussed in Section 12.
Evidence audit ([Analysis]).
- Strongly supported. That SAEs of order 4K to 33M latents extract subjectively interpretable features from one to one-frontier-model worth of residual streams; that TopK + AuxK improves the reconstruction-sparsity Pareto frontier over ReLU + L1 at matched L0.
- Partially supported. That the discovered features causally control behaviour at a granular level — Paper B’s Golden Gate Bridge demo is striking, but the long tail of safety-relevant features (deception, sycophancy, bioweapons) has fewer controlled experiments per feature; independent critique flags the quantitative tightness of the causal-control claims.
- Narrow evidence. That the scaling law’s specific coefficients transfer beyond GPT-4 at the same layer; the paper does not claim they do.
Section 10 — Technical novelty summary
| Component | Type | Novelty level | Justification | Source |
|---|---|---|---|---|
| L1-penalised SAE objective for transformer activations | Method | Combination novel | Adapts sparse coding (1996) to LLM residual streams | Paper A |
| Pre-bias + unit-norm decoder columns | Method | Incremental novel | Engineering refinements that stabilise training at scale | Paper A |
| Ghost gradients for dead-neuron recovery | Method | Incrementally novel | First formulation in the SAE context; precursor to AuxK | Paper A |
| Scaling to 34M latents on a production LLM (Claude 3 Sonnet) | Empirical | Combination novel | First demonstration that the recipe transfers to frontier scale | Paper B |
| Feature steering for causal control (Golden Gate Bridge) | Method | Combination novel | SAE-basis intervention rather than contrastive-pair direction | Paper B |
| Safety-relevant feature catalogue (deception, sycophancy, bioweapons) | Empirical | Fully novel | First systematic catalogue of safety-load-bearing features in a production LLM | Paper B |
| TopK activation in transformer SAEs | Method | Combination novel | Imports k-sparse coding into the LLM SAE programme | Paper C |
| AuxK auxiliary loss | Method | Fully novel | Specific gradient-routing fix for dead latents at scale | Paper C |
| Joint scaling law | Empirical | Fully novel | First fitted scaling law for SAE training | Paper C |
| Four-metric evaluation framework (downstream / probe / explanation / ablation) | Method | Combination novel | Packages prior probing + ablation ideas into a single, replicable suite | Paper C |
The single most novel contribution across the line. [Analysis] Paper C’s joint scaling law — it is the first empirical regularity that turns SAE training from artisanal to predictable. Paper A’s existence proof and Paper B’s frontier-scale demonstration are necessary precursors; Paper C is what makes SAEs a tooling primitive rather than a one-off research artefact.
What the papers do not claim novelty for. L1 sparsity penalties, ReLU non-linearities, the encoder–decoder autoencoder shape, k-sparse activation, probing, ablation studies, Adam optimisation. Paper A is explicit that sparse coding traces to Olshausen and Field 1996; Paper C is explicit that TopK traces to Makhzani and Frey 2013.
Section 11 — Situating the work
What prior work did. Olah et al.’s 2018-2020 Distill circuits programme established the mechanistic-interpretability programme on vision models. Elhage et al.’s 2022 Toy Models of Superposition (Anthropic) formalised the polysemantic-neuron problem and the superposition hypothesis that motivates SAEs. Olshausen and Field 1996 introduced the L1-penalised sparse-coding objective from which Paper A’s loss is descended; Makhzani and Frey 2013 introduced k-sparse autoencoders, the direct ancestor of Paper C’s TopK.
What these three papers change conceptually. Paper A: makes the superposition hypothesis empirically testable at transformer scale. Paper B: makes it actionable for production models. Paper C: makes it predictable.
Contemporaneous related work.
- Cunningham, Ewart, Smith, Huben, Sharkey — Sparse Autoencoders Find Highly Interpretable Features in Language Models, October 2023 (arXiv:2309.08600).
[External comparison]Independent contemporaneous result on Pythia models, published a few weeks before Paper A. Same core method (L1 SAE); confirms the recipe is robust to authorship and base-model choice. Relationship to the three reviewed papers: independent confirmation of Paper A’s central claim on a different model family. - Marks, Rager, Michaud, Belinkov, Bau, Mueller — Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models, March 2024 (arXiv:2403.19647).
[External comparison]Builds on the SAE programme to construct feature-level circuits between SAE features across layers. Relationship: a follow-up direction that uses Paper A’s recipe as a primitive and extends to cross-layer causal graphs. - Karvonen, Wright, Rager, Marks, Glaese — Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models, August 2024.
[External comparison]Proposes ground-truth-aware evaluation of SAEs on board-game models where the latent state is known. Relationship: directly addresses Paper C’s evaluation challenge by adding a benchmark with known answers.
[Reviewer Perspective] strongest skeptical objection. The LessWrong commentary on Paper B argues that SAE features may be artefacts of the dictionary learning procedure rather than faithful units of the underlying model. 8 Specifically: feature splitting, the high-activation conditioning quirk, and the lack of a non-SAE dense-overcomplete baseline mean that the same data could be told from several incompatible decompositions, and the choice between them is a researcher prior. The criticism is sharper for Paper B than for Paper A: at the toy-model scale of Paper A the alternative decompositions could be checked manually; at 34M features in Claude 3 Sonnet they cannot.
[Reviewer Perspective] strongest author-side rebuttal grounded in the paper. Paper B’s response would be that the steering experiments — interventions on a feature change behaviour in the predicted direction — are evidence against the artefact hypothesis at least for the steered features. Paper C’s response would be that the scaling law’s predictive validity across is hard to reconcile with the artefact hypothesis: artefacts should not scale predictably.
What remains unsolved.
- Cross-layer features: all three papers train on a single layer.
- Feature composition: how features at one layer combine into features at another layer is not characterised.
- Universality: whether the same features appear in different base models is gestured at (Paper A) but not systematically tested.
- A non-SAE dense-overcomplete-basis baseline.
- Ground-truth-aware evaluation on models where the latent state is known.
Three future research directions.
- Cross-layer SAE transcoders.
[Analysis]Train SAEs across multiple layers simultaneously and learn feature-to-feature transcoder mappings; Anthropic’s follow-up work in late 2024 (Lindsey et al.) explicitly does this. - Ground-truth benchmarks.
[Analysis]Karvonen et al. 2024’s board-game-models direction is an existence proof; extending to synthetic-task LLMs with known latent states would let the field compare SAE recipes directly. - Causally-grounded feature definitions.
[Reviewer Perspective]Replace the “specificity + manipulation” hand-rating bar with a formal definition of feature monosemanticity tied to model behaviour (e.g., a feature is monosemantic iff steering it produces a predictably clustered behaviour change). The independent critique threads point in this direction.
Section 12 — Critical analysis
Strengths with concrete evidence.
- Paper A: end-to-end reproducible recipe; open-source community implementations confirm the feature-interpretability claim. 14
- Paper B: frontier-scale demonstration with a publicly verifiable causal-control artefact (Golden Gate Claude was a live deployment). 7
- Paper C: predictive scaling law; substantially-reduced dead-latent rate; the metric framework is benchmark-ready. 9
Author-stated weaknesses.
- Paper A: limited to one layer of a one-layer transformer; feature splitting and dead neurons documented; the report itself is framed as “towards” monosemanticity rather than achieving it.
- Paper B: explicitly states single-layer scope; the high-activation conditioning quirk on the Golden Gate Bridge feature; uncertainty about whether the discovered features are the causally important ones. 6
- Paper C: scaling law fit is for one base model (GPT-4) at one layer; the GPT-4 SAE artefact itself is not released to the community.
Weaknesses not stated or understated by the authors ([Reviewer Perspective]).
- The LessWrong critique flags that Paper B’s safety-relevant feature catalogue (deception, sycophancy, bioweapons) is presented qualitatively without per-feature controlled steering experiments at the same depth as the Golden Gate Bridge — the safety case rests heavily on the bridge demo’s existence proof. 8
- The ICLR 2025 OpenReview thread on Paper C raises questions about whether the joint scaling law’s cross-term will hold beyond the swept range, and notes the metric framework lacks an obvious ground-truth tether. 16
[Analysis]Across the line, there is no non-SAE dense-overcomplete baseline at matched parameter count or compute; the comparison is between L1 and TopK within the SAE programme, not between SAEs and competing decomposition methods.
Reproducibility check.
- Code. Paper A: open-source community reproduction (shehper/sparse-dictionary-learning). 14 Paper B: no public code or weights for the Claude 3 Sonnet SAE. Paper C: open-source training code and reference SAEs for GPT-2 small at github.com/openai/sparse_autoencoder. 15
- Data. Paper A and B: activation pools not released. Paper C: GPT-2 small activations reproducible from public weights; the GPT-4 40B-token activation pool is not released.
- Hyperparameters. All three papers list the headline hyperparameters; Paper C is the most complete, with full -equivalent () and learning-rate sweeps in the appendix.
- Compute. Paper A: not disclosed in detail. Paper B: not disclosed. Paper C: hardware not disclosed beyond batch-size and training-token counts.
- Trained model weights. Paper A: no official release; community reproductions exist. Paper B: no release. Paper C: GPT-2 small SAE weights released; GPT-4 SAE not released.
- Evaluation set. Paper C: 61-task probe set documented; per-task labels available in the OpenAI repository.
- Overall. Paper A — partially reproducible (community reproductions confirm the recipe); Paper B — not reproducible without Anthropic-internal access to Claude 3 Sonnet; Paper C — partially reproducible at GPT-2 small scale, not at GPT-4 scale.
Methodology disclosure (Paper C, the most evaluation-heavy of the three).
- Sample size: 61 probe tasks; tens of thousands of evaluation contexts per task per the open-source repository.
- Evaluation set: GPT-4 residual-stream activations on a held-out token stream; GPT-2 small on a held-out subset of the OpenWebText corpus.
- Baselines: ReLU + L1 SAEs at matched L0; random vectors for ablation sparsity; raw neuron activations for probe loss.
- Hardware / compute: not disclosed in the paper or supplementary; the open-source repository implies multi-GPU training but does not give a TPU / GPU count for the GPT-4 16M run.
Generalisability.
- To other base models:
[Analysis]highly likely the recipe transfers (Cunningham et al. independently showed the recipe works on Pythia; Anthropic has scaled to Claude 3.5 in follow-up work). - To larger scales: Paper C’s scaling law predicts continued improvement; the practical bottleneck is compute and storage.
- To non-residual-stream activations: Paper A trained on MLP outputs; Paper B and C on residual streams. Other surfaces (attention outputs, attention queries / keys / values) are open follow-up territory.
Assumption audit. The linear-superposition hypothesis is the load-bearing assumption. If the residual stream has meaningful non-linear structure, an SAE will fold that structure into spurious feature combinations. Failure looks like: features that appear interpretable in isolation but interact non-additively when steered together. The Golden Gate Bridge feature’s behaviour under high-magnitude clamping (the model identifies as the bridge rather than merely mentioning it) is consistent with mild non-linearity at the steering edge — a clean linear basis should produce graceful interpolation rather than identity confusion.
What would make the line significantly stronger. [Analysis]
- Per-feature controlled steering at the depth Paper B applied to the Golden Gate Bridge, extended across the safety-relevant feature catalogue.
- A ground-truth benchmark in the Karvonen et al. 2024 mould tied to a real LLM rather than board-game models.
- A non-SAE dense-overcomplete-basis baseline at matched parameter count.
- Cross-model universality experiments at scale.
Section 13 — What is reusable for a new study
REUSABLE COMPONENT 1 — The L1-penalised SAE training recipe (Paper A).
- What it is. The full pipeline: activation collection, pre-bias-centred encoder, ReLU, L1 penalty, unit-norm decoder columns, dead-feature resampling.
- Why worth reusing. Cheapest entry point into SAE-based interpretability work; community implementations exist. 14
- Preconditions. A frozen base model; sufficient activation pool (millions to tens of millions of tokens at one layer); compute for to features.
- What would need to change in a different setting. The L1 coefficient re-tunes per base model and layer. Decoder normalisation is universal.
- Risks. Dead latents at large ; tuning brittleness.
- Interaction effects. Combines cleanly with feature steering (Component 4).
REUSABLE COMPONENT 2 — TopK + AuxK training recipe (Paper C).
- What it is. The Paper C recipe: TopK encoder, AuxK auxiliary loss, transpose-initialised encoder, unit-norm decoder.
- Why worth reusing. Removes -tuning; kills dead-latent pathology; scales predictably.
- Preconditions. A choice of (the per-token sparsity budget); a choice of (Paper C uses 512); .
- What would need to change in a different setting. retunes per layer; the scaling law’s coefficients re-fit per base model.
- Risks. The straight-through gradient through TopK is non-differentiable at the threshold; very small values increase variance.
- Interaction effects. AuxK requires dead-latent detection — usually a sliding-window check on activation frequency.
REUSABLE COMPONENT 3 — Feature steering protocol (Paper B).
- What it is. The inference-time intervention: clamp to value , propagate the SAE-implied back into the residual stream, generate.
- Why worth reusing. Most credible causal-control mechanism in the SAE programme so far.
- Preconditions. A trained SAE; a choice of feature index and clamp value.
- What would need to change in a different setting. Clamp value scales with the feature’s natural activation range; layer choice matters.
- Risks. Over-clamping degrades general coherence; the wrong feature index can amplify a partially-related concept.
REUSABLE COMPONENT 4 — Four-metric evaluation framework (Paper C).
- What it is. Downstream KL loss; 1D probe loss on 61 binary tasks; N2G explanation precision/recall; ablation sparsity.
- Why worth reusing. The first packaged benchmark suite for SAE training; reduces the “is this SAE actually better” question to four scalars.
- Preconditions. A held-out evaluation token stream; the 61 probe tasks (released in the OpenAI repository); an LLM for N2G grading.
- What would need to change in a different setting. The probe tasks are English / general-knowledge focused; bespoke probe sets per domain.
- Risks. N2G grading inherits the grader LLM’s biases.
Dependency map. Component 1 or Component 2 (mutually exclusive choice of training recipe) → Component 3 (steering) and Component 4 (evaluation). Component 4 is the natural acceptance test that Component 1 or 2 ran correctly.
Recommendation ([Analysis]). For a new SAE study, the highest-value combination is Component 2 (TopK + AuxK) plus Component 4 (four-metric evaluation). Component 1 has historical value but is dominated by Component 2 on every axis the field cares about. Component 3 is the right next step once a recipe converges.
What type of new study benefits most. [Analysis] A new study comparing SAE decompositions across base-model families, layers, or activation surfaces would gain the most from the TopK + AuxK recipe and the four-metric framework. A study focused on safety-relevant feature monitoring in a production LLM would gain the most from the Paper B steering protocol applied to an SAE trained on the production model.
Section 14 — Known limitations and open problems
Author-stated limitations.
- Paper A: feature splitting at large ; dead neurons; one-layer-only scope; toy base model.
- Paper B: single residual-stream layer; uncertainty about causal centrality of discovered features; high-activation conditioning quirk; safety-relevant feature catalogue not exhaustively steered.
- Paper C: scaling law fit at one base model and one layer; GPT-4 artefact not released; evaluation framework lacks a ground-truth tether.
Limitations from independent commentary ([Analysis] / [Reviewer Perspective], source-grounded per Section 12).
- Artefact-of-procedure objection: the same activation pool admits several incompatible SAE decompositions; the field has not characterised the ambiguity. 8
- No non-SAE dense-overcomplete-basis baseline at matched compute. 16
- The safety-relevant feature catalogue in Paper B is qualitative; per-feature causal-control evidence is uneven across the catalogue. 8
Technical root causes. Each limitation traces to a specific load-bearing choice: single-layer scope follows from training one SAE per layer; feature splitting follows from over-completeness of the dictionary; dead latents follow from the loss-gradient routing (mitigated by Paper C but not eliminated); the lack of ground-truth follows from the inherent unsupervised-ness of the SAE programme.
Open problems.
- Cross-layer SAE transcoders.
- Ground-truth-aware benchmarks on real LLMs.
- A canonical answer to the “same data, multiple decompositions” objection.
- Universality across base models and layers.
- A formal definition of feature monosemanticity tied to causal behaviour rather than human rating.
What a follow-up would need to solve to address the most critical limitation. [Analysis] The single highest-payoff follow-up would address the artefact-of-procedure objection: a study that trains multiple SAE recipes on the same activation pool, characterises the disagreement between feature dictionaries, and identifies which features are robust across recipes (and therefore plausibly model-intrinsic). The TopK vs L1 comparison in Paper C is a starting point but compares within the SAE programme; the missing experiment compares SAEs against non-SAE decompositions.
How this article reads at three depths
For the curious high-school reader. A language model has hidden vectors inside it that smear together many ideas at once. Three papers from Anthropic and OpenAI between October 2023 and June 2024 show how to learn a much bigger “dictionary” of single-meaning directions, then write each hidden vector as a tiny combination of those dictionary entries. The result: the inside of the model becomes much more readable. The most striking demonstration: amplify the dictionary entry for “Golden Gate Bridge” inside Anthropic’s Claude 3 Sonnet, and the model starts saying it is the bridge.
For the working developer or ML engineer. Sparse autoencoders are a one-hidden-layer, encoder-decoder, post-hoc tool that sits beside a frozen language model. Architecture is trivial; the difficulty is dead latents, sparsity-tuning, and evaluation. If implementing today: use Paper C’s TopK + AuxK recipe rather than Paper A and B’s ReLU + L1 recipe — TopK directly controls the per-token sparsity budget, AuxK kills the dead-latent failure mode, and the joint scaling law lets you predict reconstruction loss before training. OpenAI’s reference implementation ships with a 16M-latent GPT-4 description and runnable GPT-2 small SAEs. Inference overhead is two extra matrix-vector products per token at the SAE’S layer. The right acceptance test for a new SAE is Paper C’s four-metric framework — downstream loss, probe loss, explanation precision / recall, ablation sparsity. Feature steering (Paper B’s Golden Gate Bridge protocol) is the credible causal-control demo, but brittle at the long tail.
For the ML researcher. Three papers, one programme. Paper A is the existence proof of the L1-penalised SAE recipe on a one-layer transformer with features. Paper B scales the same recipe to Claude 3 Sonnet at and exhibits causal control via feature steering on safety-relevant features. Paper C replaces L1 with TopK, adds AuxK for dead-latent prevention, fits the joint scaling law , and trains a 16M-latent SAE on 40B tokens of GPT-4 activations. Load-bearing assumption across all three: linear superposition. Strongest objection (LessWrong, OpenReview): SAE features may be artefacts of the dictionary-learning procedure rather than model-intrinsic units; the same activation pool admits multiple incompatible decompositions, and there is no non-SAE dense-overcomplete baseline. A follow-up paper would need to characterise the disagreement between recipes and identify which features are recipe-invariant. Cross-layer SAE transcoders, ground-truth benchmarks on real LLMs, and a formal causal-behaviour definition of monosemanticity are the next natural directions.
How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.
Sources consulted
Cited Sources
- 1. Bricken, Templeton, Batson, Chen, Jermyn et al. — Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, Anthropic / Transformer Circuits Thread, October 4, 2023 (accessed ) ↩
- 2. Templeton, Conerly, Marcus, Lindsey, Bricken et al. — Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, Anthropic / Transformer Circuits Thread, May 21, 2024 (accessed ) ↩
- 3. Gao, Dupré la Tour, Tillman, Goh, Troll, Radford, Sutskever, Leike, Wu — Scaling and Evaluating Sparse Autoencoders, arXiv:2406.04093, June 6, 2024; published as ICLR 2025 Oral (accessed ) ↩
- 4. Anthropic — Towards Monosemanticity companion blog post; lists the 4,096-feature one-layer-transformer SAE and the feature catalogue including Arabic script, base64, DNA, Hebrew, legal language, HTTP-request headers (accessed ) ↩
- 5. Anthropic — The Claude 3 Model Family announcement (March 4, 2024); parameter counts not disclosed in official Anthropic materials; the ~70B figure is an independent industry estimate and not an Anthropic-stated number (accessed ) ↩
- 6. Scaling Monosemanticity — Limitations section explicitly states the single-residual-layer scope and the conditioning-on-high-activation quirk (accessed ) ↩
- 7. Anthropic — Mapping the Mind of a Large Language Model (companion blog post); documents the Golden Gate Bridge steering experiment and the safety-relevant feature catalogue including sycophantic praise, code-backdoor, bioweapons, power-seeking, manipulation, secrecy, gender and racial bias features (accessed ) ↩
- 8. LessWrong commentary thread — Comments on Anthropic's Scaling Monosemanticity; independent critique of the artefact-of-procedure objection, the high-activation conditioning quirk, and the unevenness of per-feature causal evidence (accessed ) ↩
- 9. Scaling and Evaluating Sparse Autoencoders — ar5iv HTML render; Section 2.3 (AuxK) and Section 3 (scaling laws) document the ~7% dead-latent rate at the 16M-latent GPT-4 SAE and the joint scaling law fit (accessed ) ↩
- 10. Scaling and Evaluating Sparse Autoencoders — fitted scaling-law parameters (α, β_k, β_n, γ, ζ, η) reported for GPT-4; Section 3 also documents the TopK vs ReLU + L1 ablations and the encoder-transpose initialisation (accessed ) ↩
- 11. shehper/sparse-dictionary-learning GitHub repository — open-source reproduction of Towards Monosemanticity; documents commodity-GPU training feasibility (accessed ) ↩
- 12. Galileo AI commentary on Towards Monosemanticity citing the ~70% human-rated interpretability rate for the 4,096-feature SAE; secondary source — the Anthropic report itself frames the rating less crisply and the precise percentage should be treated as approximate (accessed ) ↩
- 13. Scaling Monosemanticity feature catalogue section — qualitative tour of features including transit infrastructure, programming languages, sycophantic praise, deception, unsafe code, biological-weapons content, bias features (accessed ) ↩
- 14. shehper/sparse-dictionary-learning — open-source reproduction of Paper A; confirms feature-interpretability claim on independently trained SAEs (accessed ) ↩
- 15. openai/sparse_autoencoder — official Paper C reference implementation; ships training code and GPT-2 small reference SAEs; visualisation tool included (accessed ) ↩
- 16. ICLR 2025 OpenReview thread for Scaling and Evaluating Sparse Autoencoders; reviewer discussions on the scaling law's extrapolation, the metric framework's ground-truth tether, and missing baselines (accessed ) ↩
Anonymous · no cookies set