Neural Tech Daily
ai-research

KV cache compression: H2O, SnapKV, StreamingLLM, and prompt caching reviewed

Multi-paper review of four approaches to the linear KV-cache growth problem: H2O heavy-hitters, SnapKV observation window, StreamingLLM attention sinks, Anthropic…

Updated ~52 min read
Share
Figure 1 of Zhang et al. (arXiv:2306.14048, H2O): visualization of accumulated attention scores across decoding steps showing the heavy-hitter phenomenon — a small subset of tokens absorbs a disproportionate share of attention, the empirical observation that motivates H2O's eviction policy and anchors this multi-paper review of KV cache compression

Reading-register key

  • From the paper: claims drawn from a source paper’s text, equations, tables, or figures.
  • [Analysis]: the publication’s own reasoned assessment, distinct from any claim the cited paper itself makes.
  • [Reviewer Perspective]: critical or speculative assessment beyond what any single paper proves.
  • [External comparison]: comparison to prior work or general knowledge outside the four covered works.
  • [Reconstructed]: faithful reconstruction from partial disclosure.

Section 1: Cluster scope

This review covers four approaches to the linear KV-cache growth problem in transformer inference: H2O (Zhang et al., NeurIPS 2023), 1 SnapKV (Li et al., NeurIPS 2024), 2 StreamingLLM (Xiao et al., ICLR 2024), 3 and Anthropic’s Prompt Caching API (announced August 2024, generally available December 2024). 4 5

Three of the four are research papers from full venues; the fourth is a production caching system. They are grouped here because they target the same underlying constraint — the key-value cache that transformers must retain across decoding steps grows linearly with sequence length, and at long contexts the cache dominates GPU memory before model weights do. Each work attacks the problem differently: H2O selects which tokens to evict based on accumulated attention; SnapKV uses a small observation window at the end of the prompt to predict which prompt tokens future generation will attend to; StreamingLLM keeps a handful of “sink” tokens at the start plus a recent window; Anthropic’s system reuses a cached prefix across separate API calls and charges differential pricing.

[Analysis] Reading these four as a cluster, rather than four isolated wins, is the right way to understand the design space. They sit on a spectrum from “drop most of the cache and hope the model still works” (H2O, SnapKV) through “drop the middle, keep the edges” (StreamingLLM) to “don’t drop anything but charge less for re-use” (Anthropic). Each makes different trade-offs about which information is recoverable.

Paper classifications. All four fall under inference method · efficiency · optimisation. H2O and SnapKV are token-level KV eviction methods. StreamingLLM is a streaming-inference method that also extends effective context length. Anthropic Prompt Caching is a deployment-level caching primitive, not a paper, included because it is the dominant production answer to the same constraint and the comparison sharpens the research framing.

Reader prerequisites. High-school algebra; basic familiarity with what a transformer is (a model that processes a sequence by repeatedly applying attention) is helpful but the Glossary in Section 2.5 brings the curious reader up to speed on every prerequisite term used.

Section 2: TL;DR for the cluster

Large language models generate text one token at a time, and for every token they generate they re-use stored data from every previous token. That stored data is called the key-value cache — a big table sitting in GPU memory whose size grows directly with how long the conversation has been. By the time a conversation reaches 100,000 tokens, the cache for a 70-billion-parameter model can be larger than the model itself, and the GPU runs out of memory. This cluster reviews four approaches to that problem.

The four ideas, in plain English. H2O noticed that only about a fifth of the tokens carry most of the attention weight; it keeps those plus the most recent few, throws the rest away, and reports up to 29x throughput improvement on benchmarks tested. SnapKV uses the last 32 tokens of the prompt as a “vote” — those tokens already know what the answer needs to look at, so SnapKV asks them which earlier tokens matter and keeps only those, achieving 3.6x faster generation and 8.2x memory savings at 16K input length per the paper. StreamingLLM discovered that the first four tokens of any sequence absorb a disproportionate share of attention regardless of their content (the “attention sink” effect), and showed that keeping those four plus a sliding recent window lets a 4096-context model handle 4 million tokens with stable perplexity. Anthropic Prompt Caching keeps the full cache untouched but allows it to be shared across API requests; subsequent reads cost 10% of the base input price and writes cost 25% more.

Five practitioner takeaways:

  • H2O is the right baseline for offline batch inference where average accuracy can absorb a small hit; the GitHub implementation works on OPT and LLaMA families.
  • SnapKV is the strongest of the three for long-prompt retrieval-style workloads (RAG, document QA); the paper reports negligible Needle-in-a-Haystack drop down to a 1024-token cache.
  • StreamingLLM is the right choice for genuinely streaming workloads (chatbots, transcription, agent loops) where total context exceeds the model’s pretraining window and old information is genuinely disposable.
  • Anthropic Prompt Caching is the right choice when the prefix is fixed (a system prompt, a long document, a tool definition list) and request volume is high enough to amortize the 25% write premium — break-even is around 2 cache hits per write.
  • The four approaches are not exclusive. SnapKV plus StreamingLLM’s sink-retention plus prompt caching at the API layer can stack.

Pipeline overview. H2O and SnapKV operate at decoding time inside the model’s attention kernels, modifying which key-value pairs the model attends to. StreamingLLM operates at the cache-management layer, choosing which tokens stay in the cache buffer. Anthropic Prompt Caching operates above the model entirely — it is an API-level system that reuses the full cache across requests rather than compressing it within a request.

Section 2.5: Glossary

TermPlain-English explanationFirst appears in
KV cacheThe stored keys and values from every previous token that a transformer needs to compute attention for the next token. Grows linearly with sequence length.Section 1
AttentionThe operation a transformer uses to mix information from every previous token into the current one. Each token has a query, every previous token has a key and a value.Section 3
TokenA small chunk of text (often a word or word-piece) that the model processes as one unit.Section 2
DecodingGenerating output one token at a time. Each new token requires attending to all previous tokens.Section 3
SoftmaxA mathematical operation that turns a list of numbers into a probability distribution that sums to 1. Attention uses softmax over the query-key scores.Section 6
PerplexityA standard measure of how well a language model predicts text — lower is better. A perplexity of 5 means the model is “as uncertain as if it had 5 equally likely choices” on each token.Section 9
EvictionDeleting a token’s key-value pair from the cache so its memory can be re-used.Section 5
Attention sinkA token that consistently absorbs a large share of attention regardless of its content; in StreamingLLM, the first few tokens of any sequence.Section 5
Submodular functionA function on sets where adding an item to a small set helps more than adding it to a large set; the math justification for greedy algorithms.Section 6
Needle-in-a-HaystackA long-context benchmark where a single sentence is buried in a long document and the model is asked to retrieve it; tests whether long-context handling preserves specific information.Section 9
PrefixThe unchanging start of a prompt — typically a system message, instructions, or a long document — that gets re-used across many requests. The unit Anthropic Prompt Caching operates on.Section 5
TTLTime-to-live; how long a cached entry remains valid before being discarded. Anthropic offers 5-minute and 1-hour TTLs.Section 5
[Analysis] labelThe publication’s own reasoned assessment, distinct from what any single paper itself claims.Throughout
[Reviewer Perspective] labelA critical or speculative assessment that goes beyond what the papers prove.Section 11 + 12
[Reconstructed] labelContent the publication faithfully reconstructed from partial disclosure.Where used
[External comparison] labelA comparison to prior work or general knowledge outside the four covered works.Section 4 + 11
”From the paper:” prefixContent directly supported by a paper’s text, equations, tables, or figures.Throughout

Section 3: Problem formalisation (cluster-wide)

All four works target the same operation. A transformer processes a sequence of tokens x1,x2,,xnx_1, x_2, \ldots, x_n and at each position ii computes three projections of the token’s embedding: a query qiq_i, a key kik_i, and a value viv_i, each a vector of length dheadd_{\text{head}}. Attention output at position ii is

attni=j=1iαijvj,αij=exp(qikj/dhead)m=1iexp(qikm/dhead)\text{attn}_i = \sum_{j=1}^{i} \alpha_{ij} \cdot v_j, \quad \alpha_{ij} = \frac{\exp(q_i \cdot k_j / \sqrt{d_{\text{head}}})}{\sum_{m=1}^{i} \exp(q_i \cdot k_m / \sqrt{d_{\text{head}}})}

The keys kjk_j and values vjv_j for every j<ij < i must be available when computing position ii. During autoregressive decoding, ii increases one token at a time, and the model retains every previous (kj,vj)(k_j, v_j) pair in the KV cache. Cache size grows as O(Lnhdhead2b)\mathcal{O}(L \cdot n \cdot h \cdot d_{\text{head}} \cdot 2 \cdot b) where LL is the number of transformer layers, nn is sequence length, hh is the number of attention heads, dheadd_{\text{head}} is per-head dimension, the factor 2 is keys plus values, and bb is bytes per element (2 for FP16, 1 for INT8).

Concrete number. For a Llama-2-7B-class model at 128K context: 32 layers ×\times 128,000 tokens ×\times 32 heads ×\times 128 dim ×\times 2 (K and V) ×\times 2 bytes \approx 64 GB. The model weights themselves are 14 GB in FP16. The cache is over 4x the model. [External comparison] This is the constraint that vLLM’s PagedAttention 12 addresses at the allocator level (fragmentation), that StreamingLLM addresses by truncating the middle, and that H2O and SnapKV address by dropping unimportant entries entirely.

Notation table used across the rest of the review:

SymbolTypeMeaningFirst appears in
nnintegertotal sequence lengthSection 3
wobsw_{\text{obs}}integerSnapKV’s observation-window size, typically 32Section 6
krecentk_{\text{recent}}integernumber of recent tokens always retainedSection 6
khhk_{\text{hh}}integernumber of heavy-hitter tokens kept in H2OSection 6
ksinkk_{\text{sink}}integernumber of attention-sink tokens kept in StreamingLLM, typically 4Section 6
CRnC \in \mathbb{R}^{n}vectoraccumulated attention score per token (H2O score, SnapKV vote tally)Section 6
πθ\pi_\thetadistributionthe language model viewed as a token distributionSection 6
αij\alpha_{ij}scalarattention weight from query ii to key jjSection 3

Assumptions. All three research methods assume that attention is sparse in a structured way: most attention weight goes to a small subset of previous tokens. [Analysis] Potentially strong assumption — this holds well on tasks studied (summarisation, QA, perplexity, retrieval) but degrades on tasks where the model genuinely needs uniformly distributed attention (some reasoning tasks; the SnapKV paper itself flags multi-hop reasoning in passing).

Computational complexity of the underlying problem. Without compression, attention at position nn costs O(n)\mathcal{O}(n) memory accesses per layer and the cache occupies O(n)\mathcal{O}(n) memory. Compression methods aim to make memory O(k)\mathcal{O}(k) where knk \ll n is the budget. None of the four claims to reduce compute below O(n)\mathcal{O}(n) during prompt processing; their gains are at the decoding stage and at memory occupancy. Anthropic Prompt Caching is different in kind: it doesn’t reduce intra-request cost; it amortizes prompt-processing cost across multiple requests.

Section 4: Motivation and gap

The motivating problem is shared. For long-context inference (RAG over book-length documents, multi-turn agent loops, codebase-scale tools), the KV cache is the dominant memory consumer and a major latency bottleneck. [External comparison] Pope et al. (2022) formalised this constraint for inference-at-scale 10 and Beltagy et al.’s Longformer 11 showed years earlier that fixed-pattern sparse attention can replace full attention at training time. The four works in this cluster all assume the model was trained with full attention and the compression must happen at inference.

H2O’s gap framing. From the paper: existing KV cache management before H2O was either (a) keep the entire cache (memory blow-up), (b) drop fixed early tokens (window attention, which causes catastrophic accuracy collapse), or (c) recompute on each step (sliding-window with recomputation, which is correct but slow). H2O’s contribution is the observation that token importance is highly skewed and recoverable from accumulated attention weights, plus the formal submodular framing that justifies a greedy eviction policy.

SnapKV’s gap framing. From the paper: H2O and similar accumulated-attention methods require updating importance scores at every decoding step, which scales with output length. SnapKV’s insight is that for long-prompt-short-output workloads (the most common production pattern), the prompt’s own structure tells you which tokens future generation will need before generation starts. This collapses the per-step bookkeeping into a one-shot selection right after the prompt finishes.

StreamingLLM’s gap framing. From the paper: window attention (keeping only the most recent kk tokens) is the simplest cache truncation and the most catastrophic when applied naively. Llama-2-13B with window attention and a 1024-token window blows perplexity from \sim5.4 to 5158.07 on a 65K-token stream per the paper. The gap is understanding why — and the answer is the attention-sink phenomenon. Once the first few tokens are restored, window attention works.

Anthropic Prompt Caching’s framing. From Anthropic’s docs: the gap addressed is across-request re-use, not within-request compression. If the same 50K-token system prompt is prepended to 1000 different user messages, the existing inference stack re-processes all 50K tokens 1000 times. Prompt Caching writes the prefix’s KV state once (at a 25% premium) and reads it for subsequent requests (at 10% of base price), with a 5-minute or 1-hour TTL.

[Analysis] The four occupy different points in a 2x2: within-request vs across-request, and lossy vs lossless. H2O, SnapKV, StreamingLLM are within-request lossy. Anthropic Prompt Caching is across-request lossless. [Reviewer Perspective] The right production stack often combines them: lossless prefix caching at the API layer plus lossy decode-time compression on long generations.

Section 5: Method overview

5.1 H2O — Heavy-Hitter Oracle

Plain-English intuition. Some tokens in a sequence absorb a lot of attention from many future positions — H2O calls these “heavy hitters.” Other tokens get glanced at once and never touched again. If you only have a memory budget for, say, 20% of the cache, you should keep the heavy hitters plus a few recent tokens, and throw away the rest.

Mechanism. From the paper: 1 at each decoding step, maintain a cache of size kk. When the cache fills, compute an accumulated attention score per cached token and evict the token whose removal causes the smallest drop in total attention coverage. The 20% budget the paper experiments with splits as roughly 10% recent (the most recent positions) and 10% heavy hitters (those with highest accumulated scores) [Reconstructed] from the paper’s experimental description.

Design rationale. From the paper: heavy hitters correlate strongly with frequently-occurring tokens, suggesting they are linguistically central to the sequence and removing them disproportionately damages downstream perplexity. The paper’s ablation reports that dropping only heavy hitters causes a 23% accuracy collapse while keeping only recent tokens causes similar damage — both are needed.

Classification: [New] framing of accumulated-attention eviction as a dynamic submodular maximisation problem with a greedy approximation. The eviction-by-attention-score idea itself has prior art [External comparison] (token pruning literature pre-2023); H2O’s contribution is the formal framing and the production-grade open-source implementation.

5.2 SnapKV — observation-window selection

Plain-English intuition. Right before the model starts generating, look at the last 32 tokens of the prompt and ask each of them: “which earlier prompt tokens do you attend to?” Take the union of their top choices, keep those, and throw the rest of the prompt cache away.

Mechanism. From the paper: 2 let wobs=32w_{\text{obs}} = 32 be the observation-window size and let the prefix length be LprefixL_{\text{prefix}}. After the prompt forward pass:

  1. Extract the attention weights WobsRwobs×LprefixW_{\text{obs}} \in \mathbb{R}^{w_{\text{obs}} \times L_{\text{prefix}}} from the last wobsw_{\text{obs}} query positions onto the prefix.
  2. Sum down the query axis to get per-key votes: C=iWobs[:,i,:]RLprefixC = \sum_{i} W_{\text{obs}}[:,i,:] \in \mathbb{R}^{L_{\text{prefix}}}.
  3. Apply 1-D max-pooling on CC with kernel size 551313 (the paper’s ablation lands on 13 by default) to smooth single-token spikes into cluster-level scores.
  4. Select the top-kk indices per attention head where k=pLprefixk = \lfloor p \cdot L_{\text{prefix}} \rfloor with pp commonly set to 0.05 (5% retention).
  5. Keep the selected KV pairs plus the entire observation window; discard the rest of the prefix cache.

Design rationale. From the paper: the observation window’s attention pattern is a strong predictor of which prefix tokens generation will continue to attend to. Pooling matters because attention concentrates around tokens that sit next to each other (a phrase, a clause), and unpooled top-kk can pick a single token while missing its neighbours, breaking the local coherence.

What breaks if removed. The paper’s ablation shows pooling raises LongEval-Lines retrieval accuracy materially over sparse top-kk. Without pooling, SnapKV degrades to a less robust attention-based selection.

Classification: [Adapted] — the observation-window idea adapts known attention-based pruning to a one-shot, prompt-time decision. [New] is the pooling step and the formal Hit Rate evaluation.

5.3 StreamingLLM — attention sinks plus recent window

Plain-English intuition. In any transformer, the first few tokens of the sequence end up absorbing a lot of attention from later positions — not because they carry meaning but because softmax has to sum to 1 and the model needs somewhere to “park” excess attention. If you throw those tokens away, the model collapses. If you keep them plus a sliding window of recent tokens, the model handles arbitrarily long streams.

Mechanism. From the paper: 3 retain ksink=4k_{\text{sink}} = 4 initial tokens (the “attention sinks”) plus the most recent krecentk_{\text{recent}} tokens (typically 1020, sized so ksink+krecentk_{\text{sink}} + k_{\text{recent}} equals the model’s pretraining window). The cache is a rolling buffer: when a new token arrives, it joins the recent window; the oldest non-sink token gets evicted. Critically, positional encodings are assigned by position within the cache, not by the token’s original sequence index. This lets a model trained at 4096 context handle 4 million tokens without ever experiencing a position embedding it wasn’t trained on.

The sink phenomenon. From the paper: the softmax denominator jexp(qikj/dhead)\sum_j \exp(q_i \cdot k_j / \sqrt{d_{\text{head}}}) has to sum to a fixed total. When the model has computed all the “real” attention it needs, the leftover weight has to go somewhere, and initial tokens — which are visible to every subsequent position and rarely carry decisive meaning — become the natural sinks.

Trainable single-sink modification. From the paper: pretraining a model with one dedicated learnable sink token at position 0 produces a model where keeping just that single token (rather than 4) matches the 4-sink result. The paper reports 27.87 perplexity vs 1235 perplexity for a cold-cache cache-only-recent baseline, confirming the sink mechanism.

Classification: [New] — the attention-sink phenomenon, the rolling-cache mechanism, and the position-within-cache encoding are jointly the paper’s primary contribution. Window attention itself is prior art [External comparison] (Longformer 11 ); the sink-token fix is the key novel piece.

5.4 Anthropic Prompt Caching

Plain-English intuition. Instead of compressing the cache, Anthropic’s API system stores it. If your next request starts with the same 50K-token prefix as your last one, the server hands you the cached KV state instead of recomputing it.

Mechanism. From the docs: 5 the developer marks specific blocks of a prompt with a cache_control field. On first use, the API computes the full KV state for the marked prefix and stores it server-side. Subsequent requests that begin with the exact same prefix (byte-identical) hit the cache. Pricing differential is the core contract: cache writes cost 1.25x base input on a 5-minute TTL or 2.0x base on a 1-hour TTL; cache reads cost 0.1x base input. There are minimum cache sizes per model — for Claude Opus 4.7 the minimum is 4096 tokens; for Claude Sonnet 4.6 and 4.5 it is 1024 tokens. The system supports up to 4 explicit cache breakpoints per request.

What breaks if removed. Without prefix caching, identical 50K-token system prompts re-process from scratch on every API call. Anthropic publicly claims up to 90% cost reduction and up to 85% latency reduction on long-prefix workloads. 4

Classification: [Adapted] to production infrastructure of a long-known technique (KV cache reuse). The framing as a customer-visible API with explicit TTL and differential pricing is what makes Anthropic Prompt Caching distinctive; the underlying mechanism (store and re-attach a KV state across requests) is straightforward.

Section 6: Mathematical contributions

MATH ENTRY 1: H2O’s accumulated attention score

  • Source: Section 3 of arXiv:2306.14048
  • What it is: a per-token importance score that sums up how much attention each cached token has received across decoding so far.
  • Formal definition: oi:=Di1exp(qiKSi),Fscore(T):=sToso_i := D_i^{-1} \cdot \exp(q_i \cdot K_{S_i}^\top), \quad F_{\text{score}}(T) := \sum_{s \in T} o_s
  • Each term explained and dimensional analysis:
    • qiRdheadq_i \in \mathbb{R}^{d_{\text{head}}} is the query vector at the current decoding step (e.g., dhead=128d_{\text{head}}=128).
    • KSiRSi×dheadK_{S_i} \in \mathbb{R}^{\mid S_i\mid \times d_{\text{head}}} is the matrix of key vectors for the current cache set SiS_i (e.g., Si=256\mid S_i\mid =256, so 256×128256 \times 128).
    • qiKSiq_i \cdot K_{S_i}^\top is a length-Si\mid S_i\mid vector of pre-softmax attention scores.
    • DiD_i is the softmax normaliser (a scalar that makes the exponential sum to 1 over SiS_i).
    • oiRSio_i \in \mathbb{R}^{\mid S_i\mid } is the per-token attention weight vector at step ii.
    • Fscore(T)F_{\text{score}}(T) accumulates these per-token weights into a single score per token across all decoding steps so far; higher means the token is consistently attended to.
  • Worked numerical example: take Si=4\mid S_i\mid =4 keys with raw scores qiKSi=[3.0,1.0,4.0,0.5]q_i \cdot K_{S_i}^\top = [3.0, 1.0, 4.0, 0.5]. Softmax gives oi[0.234,0.032,0.636,0.019]o_i \approx [0.234, 0.032, 0.636, 0.019] (token 3 absorbs 63.6% of attention). After 5 decoding steps with similar patterns, FscoreF_{\text{score}} might read [1.0,0.3,3.1,0.15][1.0, 0.3, 3.1, 0.15]. Token 3 is the clear heavy hitter; if eviction must drop one, drop token 4 (lowest score).
  • Role: the per-step oio_i accumulates into FscoreF_{\text{score}} which is the H2O eviction criterion. At each step the eviction picks u=argmaxvFscore(Si1{i}{v})u = \arg\max_v F_{\text{score}}(S_{i-1} \cup \{i\} \setminus \{v\}) — the token whose removal minimises total attention coverage loss.
  • Edge cases: in the very first kk decoding steps the cache hasn’t filled yet, so no eviction happens. After fill, eviction is one-per-step.
  • Novelty: [Adapted] accumulated attention as a token-importance proxy is known. [New] is the submodular framing.
  • Why it matters: this is the entire selection criterion. Every other piece of H2O is bookkeeping around this scalar.

MATH ENTRY 2: H2O’s submodular guarantee

  • Source: Theorem in Section 4 of arXiv:2306.14048
  • What it is: a theoretical bound stating that H2O’s greedy eviction policy is approximately optimal under a submodularity assumption.
  • Formal statement: f(S~i)(1α)(11e)maxS=kf(S)βf(\tilde{S}_i) \geq (1-\alpha)\left(1 - \frac{1}{e}\right) \max_{|S|=k} f(S) - \beta
  • Each term explained:
    • S~i\tilde{S}_i is the greedy cache set H2O computes.
    • f(S)=Fscore(S)f(S) = F_{\text{score}}(S) is the total attention coverage of set SS.
    • maxS=kf(S)\max_{\mid S\mid =k} f(S) is the best possible size-kk cache (computationally intractable to find).
    • (11/e)0.632(1 - 1/e) \approx 0.632 is the classical submodular greedy approximation factor.
    • α,β\alpha, \beta are slack terms accounting for the dynamic (online) nature of the problem.
  • Worked numerical example: if the best static size-kk cache achieves f(S)=10.0f(S^*) = 10.0, and α=0.1,β=0.2\alpha = 0.1, \beta = 0.2, the greedy bound guarantees f(S~)0.90.63210.00.2=5.49f(\tilde{S}) \geq 0.9 \cdot 0.632 \cdot 10.0 - 0.2 = 5.49. The greedy cache captures at least 55% of optimal attention coverage in this example.
  • Proof sketch (paraphrased from paper Section 4): submodularity says marginal utility decreases as the set grows — adding the best token to a near-empty cache helps more than adding it to a nearly-full cache. Under that property, the classical Nemhauser et al. (1978) result gives the 11/e1 - 1/e guarantee for static greedy. The paper extends this to the dynamic (one-eviction-per-step) setting by treating each decoding step as a fresh subproblem and tracking how much utility the greedy choices can drift from the offline optimum; that drift is captured by the α,β\alpha, \beta slack. The proof does not claim worst-case tightness — it claims the greedy bound transfers from the static to the dynamic regime with bounded loss.
  • Role: theoretical justification for why a local rule (drop the lowest-scoring token now) produces near-optimal global behaviour.
  • Edge cases: the submodularity assumption is empirical, not proven, for attention. [Analysis] This is the load-bearing assumption; the bound holds only insofar as attention coverage is genuinely submodular.
  • Novelty: [New] framing for KV eviction. The submodular greedy theorem itself is classical.
  • Why it matters: it converts H2O from “this trick works on benchmarks” into “this trick has a structural reason to work.”

MATH ENTRY 3: SnapKV’s voting selection

  • Source: Section 3 of arXiv:2404.14469
  • What it is: a one-shot selection rule that picks important prefix tokens by aggregating attention from the observation window.
  • Formal definition: C=i=LprefixwobsLprefixW[i,:Lprefix],C=MaxPool1D(C,kernel=13),I=TopK(C,k)C = \sum_{i=L_{\text{prefix}} - w_{\text{obs}}}^{L_{\text{prefix}}} W[i, :L_{\text{prefix}}], \quad C' = \text{MaxPool1D}(C, \text{kernel}=13), \quad I = \text{TopK}(C', k)
  • Each term explained:
    • WRLprefix×LprefixW \in \mathbb{R}^{L_{\text{prefix}} \times L_{\text{prefix}}} is the prefix-stage attention matrix (per head).
    • W[i,:Lprefix]W[i, :L_{\text{prefix}}] slices row ii, giving query ii‘s attention over all prefix keys.
    • The sum aggregates votes from the last wobs=32w_{\text{obs}} = 32 query positions (the observation window).
    • CRLprefixC \in \mathbb{R}^{L_{\text{prefix}}} is the raw per-key vote tally.
    • MaxPool1D applies a 1-D max-pool of kernel size 13, padding 6, stride 1; smooths spikes into cluster scores.
    • II is the set of selected indices (top k=0.05Lprefixk = \lfloor 0.05 \cdot L_{\text{prefix}} \rfloor in the paper’s default 5% retention setting), computed per attention head.
  • Worked numerical example: take a prefix of length Lprefix=20L_{\text{prefix}}=20 with observation window wobs=4w_{\text{obs}}=4 and a budget of k=4k=4. Suppose summed attention C=[0.1,0.4,0.5,0.3,0.0,0.1,0.8,0.7,0.6,0.2,0.0,0.0,0.1,0.3,0.4,0.2,0.0,0.5,0.6,0.0]C = [0.1, 0.4, 0.5, 0.3, 0.0, 0.1, 0.8, 0.7, 0.6, 0.2, 0.0, 0.0, 0.1, 0.3, 0.4, 0.2, 0.0, 0.5, 0.6, 0.0]. A naive top-4 picks indices {6,7,8,18}\{6, 7, 8, 18\}. With pooling (kernel 3 here for tractability), CC' raises the cluster around indices 6–8 to roughly [0.5,0.5,0.8,0.8,0.8,0.8,0.8,0.8,0.8,0.6,0.2,0.1,0.3,0.4,0.4,0.4,0.5,0.6,0.6,0.6][0.5, 0.5, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.6, 0.2, 0.1, 0.3, 0.4, 0.4, 0.4, 0.5, 0.6, 0.6, 0.6] — now top-4 picks a wider span around the actual cluster, capturing more of the phrase containing the answer.
  • Role: defines exactly which prefix tokens survive into the compressed cache.
  • Edge cases: when LprefixkL_{\text{prefix}} \leq k, no compression happens (cache fits). When the observation window itself spans a high-attention region, votes self-reinforce; the paper’s separate observation window plus selected prefix design keeps these distinct.
  • Novelty: [New] the pooling step within an attention-based selection method. [Adapted] the attention-based vote idea itself.
  • Why it matters: this single equation is SnapKV. Everything else in the paper validates this selection.

MATH ENTRY 4: StreamingLLM’s softmax sink mechanism

  • Source: Section 3 of arXiv:2309.17453
  • What it is: the formal account of why the first tokens of any sequence absorb attention.
  • Formal definition: for attention row ii, softmax(x)j=exp(xj)m=1iexp(xm),j=1isoftmax(x)j=1\text{softmax}(x)_j = \frac{\exp(x_j)}{\sum_{m=1}^{i} \exp(x_m)}, \quad \sum_{j=1}^{i} \text{softmax}(x)_j = 1
  • Each term explained:
    • xRix \in \mathbb{R}^{i} is the row of pre-softmax scores from query ii to keys 1,,i1, \ldots, i.
    • The denominator sums exponentials of all scores; the constraint that softmax outputs sum to 1 is what forces the “park” behaviour.
    • When query ii‘s “real” attention covers only a few keys, the remaining mass has to go somewhere — and the first few tokens are the natural attractors because they are visible to all subsequent queries and so get repeatedly trained as sinks.
  • Worked numerical example: take 5 keys with pre-softmax scores x=[5.0,2.0,1.5,1.0,4.5]x = [5.0, -2.0, -1.5, -1.0, 4.5]. Softmax gives [0.609,0.001,0.001,0.002,0.387][0.609, 0.001, 0.001, 0.002, 0.387]. Key 1 (the “sink”) and key 5 (the legitimate match) split the mass; keys 2–4 are nearly suppressed. If we now mask key 1 and re-normalise the remaining 4 keys, the same logits [2.0,1.5,1.0,4.5][-2.0, -1.5, -1.0, 4.5] give [0.001,0.002,0.004,0.993][0.001, 0.002, 0.004, 0.993] — almost all mass now lands on key 5, which is a much sharper distribution than the layer was trained for. The model’s downstream computation breaks because the attention values it produces look nothing like what training distributed.
  • Role: the theoretical hook for why initial-token retention restores window-attention performance.
  • Edge cases: if the model is pretrained with a dedicated sink token (the paper’s trainable variant), the sink role is consolidated into that one token and the phenomenon localises cleanly.
  • Novelty: [New] empirical identification of the sink phenomenon plus the trainable-sink-token mitigation.
  • Why it matters: the entire StreamingLLM design follows from this observation. Without the sink mechanism, the rolling-buffer architecture has no reason to work.

MATH ENTRY 5: Anthropic Prompt Caching break-even

  • Source: derived from the pricing schedule in Anthropic’s docs 5
  • What it is: the threshold of cache hits required to amortize the 25% write premium against the 90% read discount.
  • Formal definition: let rr be the read-to-base ratio (0.1 for cache reads), let ww be the write-to-base ratio (1.25 for 5-min, 2.0 for 1-hour), let NN be the number of times the prefix is read. Total cost relative to N uncached requests is Costcached(N)=w+(N1)r,Costuncached(N)=N\text{Cost}_{\text{cached}}(N) = w + (N-1) \cdot r, \quad \text{Cost}_{\text{uncached}}(N) = N Break-even is at w+(N1)r=Nw + (N-1) \cdot r = N, giving N=wr1rN^* = \frac{w - r}{1 - r}
  • Each term explained:
    • w=1.25w = 1.25 for 5-min TTL or w=2.0w = 2.0 for 1-hour TTL.
    • r=0.1r = 0.1 for cache reads.
    • NN^* is the number of total uses (including the initial write) needed before caching beats not caching.
  • Worked numerical example: for 5-min TTL, N=(1.250.1)/(10.1)=1.15/0.91.28N^* = (1.25 - 0.1)/(1 - 0.1) = 1.15/0.9 \approx 1.28. So break-even is at the second use (the write plus one read). For 1-hour TTL, N=(2.00.1)/0.92.11N^* = (2.0 - 0.1)/0.9 \approx 2.11 — break-even at the third use.
  • Role: the operational cost analysis a practitioner runs before adopting Prompt Caching.
  • Edge cases: if cache invalidates (system prompt changes, tool definitions change), N>2N^* > 2 becomes hard to hit on 5-min TTL.
  • Novelty: [Analysis] derivation from Anthropic’s pricing schedule; Anthropic publishes the schedule but does not frame the break-even formula explicitly.
  • Why it matters: this is the deployment decision criterion. If your prefix is reused less than 2 times in 5 minutes, don’t cache.

Section 7: Algorithmic contributions

ALGORITHM ENTRY 1: H2O greedy eviction

  • Source: Algorithm 1 of arXiv:2306.14048
  • Purpose: maintain a fixed-size cache during decoding by evicting the lowest-attention-coverage token at each step.
  • Inputs: cache budget kk, query stream q1,q2,q_1, q_2, \ldots, per-step keys kik_i and values viv_i.
  • Outputs: a cache SiS_i of size at most kk at every decoding step.
Pseudocode reproduction of H2O's greedy eviction algorithm: at each decoding step, compute attention from the current query to all cached tokens, accumulate the per-token attention scores, and when the cache exceeds budget evict the token with the smallest accumulated score; keep the most recent k_recent tokens always
  • Hand-traced example. Take k=3k=3, decoding step i=4i=4, and suppose current cache is S3={t1,t2,t3}S_3 = \{t_1, t_2, t_3\} with Fscore=[0.6,0.2,0.4]F_{\text{score}} = [0.6, 0.2, 0.4]. New token t4t_4 arrives; its attention pattern this step is [0.3,0.1,0.1,0.5][0.3, 0.1, 0.1, 0.5] over {t1,t2,t3,t4}\{t_1, t_2, t_3, t_4\}. Updated Fscore=[0.9,0.3,0.5,0.5]F_{\text{score}} = [0.9, 0.3, 0.5, 0.5]. Cache is now over budget (4 > 3). Evict the lowest — t2t_2 at 0.3. Resulting cache S4={t1,t3,t4}S_4 = \{t_1, t_3, t_4\} with Fscore=[0.9,0.5,0.5]F_{\text{score}} = [0.9, 0.5, 0.5].
  • Complexity. Time per decoding step: O(k)\mathcal{O}(k) for the attention pass (over the kk-sized cache, plus the new token) and O(k)\mathcal{O}(k) for the eviction scan. Space: O(k)\mathcal{O}(k) cache plus O(k)\mathcal{O}(k) for the score vector. The bottleneck step is the attention pass itself, which scales with kk not nn — this is where the throughput gain comes from.
  • Hyperparameters. kk (cache budget; 20% of full cache in the paper); split between krecentk_{\text{recent}} and khhk_{\text{hh}} (typically 50/50 of the budget per the paper). The paper reports the split sensitivity in its ablation — pure heavy-hitters or pure recent both collapse; the 50/50 split is robust across budgets from 10% to 50%.
  • Failure modes. Tasks with uniform attention (some reasoning chains) violate the heavy-hitter assumption; H2O degrades to roughly window-attention behaviour in those regimes. The paper does not extensively test reasoning benchmarks.
  • Novelty: [New] framing; [Adapted] underlying technique.
  • Transferability: [Analysis] H2O drops in cleanly on any decoder-only transformer with standard attention. Adoption inside vLLM and other serving stacks would require integrating the score accumulator into the attention kernel, which is non-trivial but already done in the reference repo. 6

ALGORITHM ENTRY 2: SnapKV one-shot prefix selection

  • Source: Algorithm 1 of arXiv:2404.14469
  • Purpose: compress the prefix cache once, immediately after prompt processing, with no further per-step work.
  • Inputs: prefix length LprefixL_{\text{prefix}}, observation window wobs=32w_{\text{obs}} = 32, budget k=pLprefixk = \lfloor p \cdot L_{\text{prefix}} \rfloor with pp commonly 0.05, pooling kernel size 13.
  • Outputs: a compressed per-head cache of size k+wobsk + w_{\text{obs}}.
function SnapKV(prefix_kv, attention_weights, w_obs, p, kernel):
    L = prefix length
    k = floor(p * L)
    obs_attn = attention_weights[L - w_obs : L, :L]      # last w_obs queries
    C = sum over query axis of obs_attn                  # per-key votes
    C_smoothed = MaxPool1D(C, kernel_size=kernel,
                            padding=kernel // 2, stride=1)
    selected = TopK indices of C_smoothed, taking k      # per attention head
    kept_kv = prefix_kv at indices (selected union last w_obs)
    return kept_kv
  • Hand-traced example. Take Lprefix=20,wobs=4,p=0.2L_{\text{prefix}} = 20, w_{\text{obs}} = 4, p = 0.2 (so k=4k = 4), kernel =3= 3. Attention from queries {17,18,19,20}\{17,18,19,20\} over keys {1,,20}\{1,\ldots,20\} aggregates to votes CC (use the same vector from MATH ENTRY 3). MaxPool with kernel 3 broadens the high-vote cluster around indices 6–8. TopK picks {6,7,8,18}\{6, 7, 8, 18\} before pooling; with pooling the broader cluster {6,7,8,19}\{6, 7, 8, 19\} wins. Final cache contains those 4 plus the observation window {17,18,19,20}\{17, 18, 19, 20\} — 8 entries out of 20, a 2.5x compression.
  • Complexity. Time: one matrix slice O(wobsLprefix)\mathcal{O}(w_{\text{obs}} \cdot L_{\text{prefix}}), one sum O(Lprefix)\mathcal{O}(L_{\text{prefix}}), one pool O(Lprefix)\mathcal{O}(L_{\text{prefix}}), one TopK O(Lprefixlogk)\mathcal{O}(L_{\text{prefix}} \log k). Total: O(Lprefixlogk)\mathcal{O}(L_{\text{prefix}} \log k). Critically, this runs once per prompt, not per generation step. Space: O(Lprefix)\mathcal{O}(L_{\text{prefix}}) for the vote vector during selection, O(k+wobs)\mathcal{O}(k + w_{\text{obs}}) for the resulting cache.
  • Hyperparameters. Observation window wobs=32w_{\text{obs}}=32; budget ratio p=0.05p=0.05 default (the paper reports 0.05 as the sweet spot but tests 0.01 to 0.5); pooling kernel 5–13. Paper ablation finds kernel 13 robust across the tested length regime.
  • Failure modes. If the prompt’s last 32 tokens are atypical of the rest of the prompt (the user’s question doesn’t share structure with the document being queried), the observation window’s votes mis-predict what matters. The paper does not stress this case.
  • Novelty: [New] the pool-then-select step; [Adapted] the broader attention-based selection framework.
  • Transferability: [Analysis] SnapKV plugs into Hugging Face Transformers easily for inference; the reference repo 7 supports Llama, Mistral, Mixtral, and Command-R. Training-time integration is unnecessary because SnapKV is purely an inference-time compressor.

ALGORITHM ENTRY 3: StreamingLLM rolling cache

  • Source: Section 3 of arXiv:2309.17453
  • Purpose: maintain a fixed-size streaming KV cache that can run indefinitely.
  • Inputs: sink count ksink=4k_{\text{sink}} = 4, recent-window size krecentk_{\text{recent}}, token stream.
  • Outputs: a cache always of size ksink+krecentk_{\text{sink}} + k_{\text{recent}}.
function StreamingCache(token_stream, k_sink, k_recent):
    sinks = []       # first k_sink tokens encountered
    recent = deque(maxlen = k_recent)
    for token in token_stream:
        if len(sinks) < k_sink:
            sinks.append(KV(token))
        else:
            recent.append(KV(token))
        # positional encoding assigned by within-cache index,
        # NOT by original sequence position
        compute_attention(cache = sinks + list(recent),
                          positions = range(0, k_sink + len(recent)))
  • Hand-traced example. Take ksink=2,krecent=3k_{\text{sink}} = 2, k_{\text{recent}} = 3, stream of 7 tokens. After tokens 1, 2: sinks = {t1,t2}\{t_1, t_2\}, recent = {}\{\}. After token 3: sinks = {t1,t2}\{t_1, t_2\}, recent = {t3}\{t_3\}. After 4, 5: recent = {t3,t4,t5}\{t_3, t_4, t_5\}. After 6: recent evicts t3t_3, becomes {t4,t5,t6}\{t_4, t_5, t_6\}, cache state is {t1,t2,t4,t5,t6}\{t_1, t_2, t_4, t_5, t_6\}. Positions used during the attention computation for t7t_7 are {0,1,2,3,4}\{0, 1, 2, 3, 4\} — the index within the cache, not {0,1,3,4,5}\{0, 1, 3, 4, 5\} which would be the original sequence positions.
  • Complexity. Time per token: O(ksink+krecent)\mathcal{O}(k_{\text{sink}} + k_{\text{recent}}) for attention. Space: O(ksink+krecent)\mathcal{O}(k_{\text{sink}} + k_{\text{recent}}). Constant memory regardless of total stream length — the headline efficiency claim.
  • Hyperparameters. ksink=4k_{\text{sink}} = 4 for general models (paper finds diminishing returns past 4); ksink=1k_{\text{sink}} = 1 for models pretrained with the dedicated sink-token modification; krecentk_{\text{recent}} sized to fill out the model’s pretraining context window.
  • Failure modes. Any task where the needed information sits in the dropped middle of the stream. The paper is explicit that StreamingLLM does not extend the model’s useful context; it extends the model’s operational context. The model still can only recall information within the current cache window.
  • Novelty: [New] — the sink mechanism, the rolling buffer with cache-local positions, and the trainable single-sink variant jointly.
  • Transferability: [Analysis] StreamingLLM ports to any decoder transformer with RoPE or ALiBi positional encoding per the paper. Reference implementation 8 ships patched attention kernels for Llama, MPT, Falcon, Pythia.

ALGORITHM ENTRY 4: Anthropic Prompt Caching usage pattern

  • Source: Anthropic Prompt Caching docs 5
  • Purpose: re-use a prefix’s KV state across API requests.
  • Inputs: prompt blocks (system, tools, messages); developer-supplied cache_control markers on blocks; TTL choice (5 minutes or 1 hour).
  • Outputs: response plus cache_creation_input_tokens and cache_read_input_tokens accounting fields.
# Pseudocode reflecting Anthropic's public API contract
response = client.messages.create(
    model="claude-opus-4-7",
    system=[
        {"type": "text",
         "text": large_static_document,
         "cache_control": {"type": "ephemeral", "ttl": "1h"}},
    ],
    messages=[{"role": "user", "content": dynamic_user_question}],
)
# response.usage:
#   cache_creation_input_tokens: K  (only on the first call; charged at write rate)
#   cache_read_input_tokens:     K  (on subsequent calls; charged at read rate)
#   input_tokens:                tokens after the last cache breakpoint
  • Hand-traced example. Suppose the static document is 50,000 tokens and the user question is 200 tokens. Call 1: cache_creation_input_tokens = 50000, cache_read_input_tokens = 0, input_tokens = 200. Cost on Opus 4.7 with 5-min TTL: 50000 \times 1.25 \times \``5/MTok + 200 times ``5/\MTok = \0.3125 + $0.001 = $0.314.Call2within5minuteswithsameprefix:cachecreationinputtokens=0,cachereadinputtokens=50000,inputtokens=200.Cost:. Call 2 within 5 minutes with same prefix: `cache_creation_input_tokens = 0`, `cache_read_input_tokens = 50000`, `input_tokens = 200`. Cost: 50000 \times 0.1 \times “5/MTok + 200 times “5/\MTok = $0.025 + $0.001 = $0.026.Across10calls,total=. Across 10 calls, total = $0.314 + 9 \times $0.026 = $0.548vsvs10 \times $0.251 = $2.51$ uncached. 78% saving.
  • Complexity. The developer’s bookkeeping is O(1)\mathcal{O}(1) per request; the server-side cost is a hash lookup. Limit of 4 explicit breakpoints per request per the docs. 5
  • Hyperparameters. TTL choice; breakpoint placement; minimum cacheable token count (4096 for Opus 4.7, 1024 for Sonnet 4.6 and 4.5 per current docs).
  • Failure modes. Cache invalidates on prefix byte-changes, tool definition changes, image add/remove, speed-setting changes, and thinking-parameter changes; all flagged in the docs. [Reviewer Perspective] the invalidation list is long enough that small prompt iteration during development is the practical pain point.
  • Novelty: [Adapted] to a customer-visible API; underlying KV-reuse mechanism is well-known across serving stacks.
  • Transferability: [Analysis] direct only to Anthropic’s API. Equivalent capabilities ship on OpenAI’s API (prompt caching, automatic, no developer markers) and Google Vertex AI (context caching, explicit). The economics differ; the mechanism is conceptually identical.

Section 8: Specialised design contributions

Subsection 8A — LLM / prompt design. Not applicable to this paper cluster; none of the four operates by changing prompts.

Subsection 8B — Architecture-specific details. All three research methods operate inside attention computation without changing the layer stack, positional encoding scheme (except StreamingLLM’s within-cache reindexing), or any other architectural element. They are inference-time interventions only, not training-time modifications. The exception is StreamingLLM’s optional trainable-sink-token modification, which requires pretraining with one additional learnable token prepended to every sequence — a minor but training-time-only change.

Subsection 8C — Training specifics. Not applicable; the three papers are inference-only methods.

Subsection 8D — Inference / deployment specifics.

  • H2O ships per-layer eviction integrated with the attention kernel; the reference repo 6 supports Hugging Face Transformers’ generation loop.
  • SnapKV implements its compression in a one-shot operation right after prompt processing, returning a compressed cache to the standard generation loop. No kernel changes required beyond cache rewriting.
  • StreamingLLM patches the attention forward function to use within-cache positions and a rolling buffer.
  • Anthropic Prompt Caching requires no client-side machinery beyond setting the cache_control field on the relevant prompt blocks.

Section 9: Experiments and results

H2O experimental results

  • Datasets. OpenBookQA, PIQA, COPA, MathQA, RTE, WinoGrande, XSUM, CNN/Daily Mail per the paper. [Analysis] A breadth-over-depth selection; no genuinely long-context retrieval benchmark in the original H2O paper (the long-context evaluation literature was still nascent in mid-2023).
  • Baselines. Full-cache inference; DeepSpeed Zero-Inference; FlexGen; Hugging Face Accelerate.
  • Key results from the paper. At 20% KV cache budget, OPT-30B on PIQA scores 79.22 vs 80.09 baseline (a 0.87 point drop); COPA matches at 85.00 vs 85.00. Throughput: 29x improvement vs DeepSpeed Zero-Inference and Hugging Face Accelerate; 3x vs FlexGen on the paper’s benchmark configurations.
  • Ablations. Dropping only recent tokens or only heavy hitters both produce \sim23% accuracy collapse; the combined 50/50 split is robust.
  • Limitations the paper itself names. Tasks requiring uniform attention coverage are not extensively tested.

SnapKV experimental results

  • Datasets. LongBench (16 long-context benchmarks); Needle-in-a-Haystack (NIAH); LongEval-Lines; RAG end-to-end evaluation on Command-R.
  • Baselines. Full-cache inference; H2O; StreamingLLM (where comparable).
  • Key results from the paper. Mistral-7B with 1024-token SnapKV cache beats H2O on 11 of 16 LongBench tasks. Needle-in-a-Haystack accuracy preserved while extending tested sequence length up to 380K tokens on a single A100-80GB at batch size 1 — a 380x compression. Decoding speed at 16K input length: 3.6x faster than baseline. Memory: 8.2x improvement, lifting maximum sequence at batch size 2 from 16K to 131K tokens.
  • Ablations. Pooling significantly improves LongEval-Lines retrieval; observation-window size 32 chosen after sweep; max-pool kernel 13 robust across length regimes.
  • RAG integration. Citation F1-score retention 98.8% (a 1.2-point drop); end-to-end RAG quality -2.1 points vs full cache.
  • Compatibility. Combines with Medusa parallel decoding for 2.2x speedup on 10K-length sequences per the paper.

StreamingLLM experimental results

  • Datasets. PG19 (long-document perplexity); ARC-Easy, ARC-Challenge in streaming question-answering form; StreamEval (paper’s own contributed benchmark).
  • Baselines. Dense (full) attention; window attention; sliding-window-with-recomputation.
  • Key results from the paper. Llama-2-13B perplexity over 65K tokens: dense fails (OOM); window attention 5158.07; sliding-window-with-recomputation \sim5.40; StreamingLLM 5.40. Llama-2-7B-Chat on ARC-Easy/Challenge in streaming form: dense (OOM), window 3.58/1.39, StreamingLLM 71.25/53.16. Stable language modeling over 4M tokens reported on Llama-2, MPT, Falcon, Pythia. Up to 22.2x per-token speedup over sliding-window-with-recomputation.
  • Trainable single-sink ablation. Pretraining with one dedicated sink token reduces the required initial-token retention from 4 to 1 with matching performance.
  • What the paper explicitly does NOT do. Extend the model’s useful retrieval context — only the operational window. Information outside the cache is gone.

Anthropic Prompt Caching reported numbers

  • From Anthropic’s announcement. 4 Up to 90% cost reduction and up to 85% latency reduction on long-prefix workloads; 79% latency improvement reported for 100K-token prompts; 86% cost reduction reported for the 10K-token-scenario examples.
  • [Analysis] these are best-case marketing numbers, not independent-benchmark-cross-checked. [Reviewer Perspective] the actual cost reduction in practice depends entirely on the cache-hit-to-write ratio, which Anthropic does not disclose for customer workloads.

Independent benchmark cross-checks and SOTA claims

[External comparison] SnapKV’s LongBench claims have been independently reproduced in third-party long-context benchmarking work in 2024–2025; the broad finding that SnapKV maintains LongBench quality at 1024-token cache holds across replications. [Analysis] H2O’s 29x throughput claim is against three specific baselines (DeepSpeed Zero-Inference, FlexGen, Hugging Face Accelerate) under the paper’s test setup; vLLM (released after H2O) and TGI changed the inference-stack landscape after H2O’s paper, so the 29x number is not directly comparable to current production stacks. The H2O quality claim (accuracy preserved at 20% budget) is more durable.

StreamingLLM SOTA framing applies to streaming-inference perplexity stability, not to long-context retrieval — the paper is explicit about this. There is no genuine SOTA claim on retrieval; the comparison is window attention (collapses) vs dense (OOM) vs StreamingLLM (stable).

Anthropic Prompt Caching has no peer-reviewed independent benchmark; the public claims are vendor-published. Practitioner experience reports on developer forums broadly confirm the qualitative direction (substantial cost and latency reduction for fixed-prefix workloads) without independently validating the specific percentages.

Evidence audit

  • Strongly supported. H2O preserves accuracy at 20% budget; SnapKV preserves NIAH retrieval at 1024-token cache; StreamingLLM stabilises perplexity over 4M tokens with sink-plus-window; Prompt Caching reduces effective input cost when prefix re-use rate exceeds break-even.
  • Partially supported. H2O’s 29x throughput claim (baseline-specific); SnapKV’s 8.2x memory claim (configuration-specific to 16K input length). Anthropic’s specific cost-reduction percentages.
  • Narrow evidence. H2O’s accuracy on reasoning benchmarks (not tested); StreamingLLM’s performance on tasks requiring long-distance retrieval (the paper says explicitly this isn’t the target use case).

Section 10: Technical novelty summary

ComponentTypeNovelty levelJustificationSource
Heavy-hitter accumulated-attention scoringH2OCombination novelPre-existing attention-pruning prior; H2O combines with the submodular framingarXiv:2306.14048
Submodular dynamic eviction guaranteeH2OFully novelFirst formal framing of KV eviction as dynamic submodular optimisationarXiv:2306.14048
Observation-window votingSnapKVFully novelOne-shot prompt-time selection using last-w_obs queriesarXiv:2404.14469
MaxPool1D smoothing on attention votesSnapKVFully novelCluster-preserving selection steparXiv:2404.14469
Attention-sink empirical discoveryStreamingLLMFully novelFirst identification of softmax-driven sink phenomenon at first positionsarXiv:2309.17453
Rolling cache with within-cache positionsStreamingLLMCombination novelRolling buffers known; within-cache positional reindexing is the new piecearXiv:2309.17453
Trainable single-sink tokenStreamingLLMFully novelPretraining modification that consolidates the sink rolearXiv:2309.17453
Differential write/read pricing for KV cacheAnthropicAdopted from infrastructureKV reuse is standard; the customer-facing API contract is the deployment noveltyAnthropic docs
5-minute and 1-hour ephemeral TTLsAnthropicAdoptedStandard caching TTL pattern adapted to LLM contextAnthropic docs

Single most novel contribution across the cluster. StreamingLLM’s attention-sink discovery. The other contributions are clever inference-time optimisations; the sink finding is a scientific result that explains a previously-unexplained transformer behaviour and motivated subsequent research. The H2O paper independently observed that some tokens dominate attention; StreamingLLM provided the mechanistic explanation tied to the softmax constraint.

What the cluster does NOT claim to be novel. Sparse attention as a concept (Longformer, BigBird era); KV cache reuse as a mechanism (standard in serving stacks for years before Anthropic’s API); attention as a score for token importance (pre-2023 pruning literature); positional encoding flexibility (RoPE prior).

Section 11: Situating the work

What prior work did. Sparse-attention training methods 11 reduced attention compute at training time by restricting which token pairs interact. Inference-time KV cache management before this cluster was largely “keep everything or recompute” — production stacks like vLLM 12 focused on allocator-level efficiency (no fragmentation) rather than dropping entries.

What these papers change conceptually. Cache compression became principled. Each paper either provides a formal selection criterion (H2O’s submodular framing, SnapKV’s observation-window voting) or a structural insight (StreamingLLM’s sink phenomenon) that turns “drop some tokens” from heuristic into method. Anthropic’s API moves the conversation up a level: the right question for production may not be “what to drop within a request” but “how to share what we keep across requests.”

Contemporaneous related papers (≥2 required).

  • FastGen (Ge et al., 2023). A complementary work that classifies attention heads into typed behaviours (window, sparse, dense) and applies different compression to each. [External comparison] FastGen and H2O were near-simultaneous; FastGen’s typed-head framework is broader, H2O’s submodular framing is tighter.
  • KIVI (Liu et al., 2024). A quantisation-based approach that keeps every token but quantises keys and values to 2 bits, achieving similar memory savings to H2O without dropping any tokens. [External comparison] orthogonal: KIVI compresses per-token; the methods in this cluster select which tokens to keep. They stack.
  • vLLM with PagedAttention (Kwon et al., 2023). 12 Production-stack contemporaneous: addresses cache fragmentation, not cache size. The right answer in production stacks them: PagedAttention for allocation, H2O or SnapKV for size, prompt caching for re-use.

[Reviewer Perspective] strongest skeptical objection. All three research methods compete to lose less accuracy at the same memory budget. Quantisation methods (KIVI, AWQ-cache) can hit similar memory targets while keeping all tokens, side-stepping the assumption that token-importance is reliably predictable. If KIVI-class methods continue improving, the eviction-based methods become a narrower niche.

[Reviewer Perspective] strongest author-side rebuttal. Even with aggressive quantisation, the cache scales linearly in tokens; quantisation gives a constant-factor memory win but doesn’t change the asymptotic. For genuinely long contexts (>500K tokens) the eviction methods plus quantisation together is the production answer, not either alone.

What remains unsolved.

  • Adaptive budgets. All four methods use fixed or fixed-ratio budgets. Whether the right budget depends on prompt content (uniform-attention prompts need more) is not deeply explored.
  • Cross-method composition. Combining H2O’s per-step eviction with SnapKV’s one-shot prompt selection, layered on top of Anthropic-style prefix caching, would be the production-stack ideal; no paper covers the joint design.
  • Long-distance retrieval under streaming. StreamingLLM is explicit that information outside the buffer is gone. Combining genuine retrieval (RAG) with streaming compression is a separate research direction.

Three future research directions.

  1. [Analysis] A unified framework for when to evict (H2O), what to keep (SnapKV), and how positions are encoded (StreamingLLM) within a single configurable cache policy.
  2. [Analysis] Workload-aware adaptive budgets — the right cache size for a multi-hop reasoning prompt is likely larger than for a single-fact retrieval prompt; no paper currently auto-selects.
  3. [Reviewer Perspective] Eviction policies for chain-of-thought reasoning specifically, where the model writes scratch tokens it must later attend back to; current methods may over-aggressively evict scratch.

Section 12: Critical analysis

Strengths with concrete evidence

  • H2O. The formal submodular framing makes it the most theoretically grounded of the three; throughput numbers (29x vs DeepSpeed) and accuracy retention at 20% budget (within \sim1 point on tested benchmarks per the paper) are durably impressive.
  • SnapKV. One-shot compression with no per-step overhead is the practical efficiency win; NIAH preservation at 1024-token cache is a strong long-context guarantee.
  • StreamingLLM. The attention-sink discovery is genuinely scientifically novel; the 22.2x speedup vs sliding-window-recomputation on streaming workloads is robust because the baseline (recomputation) is provably correct.
  • Anthropic Prompt Caching. The economic transparency (published pricing schedule, explicit TTLs, documented invalidation rules) makes adoption predictable for practitioners.

Weaknesses explicitly stated by the authors

  • H2O. The paper acknowledges the heavy-hitter assumption is empirical and may not hold uniformly.
  • SnapKV. The paper does not stress-test prompts where the last 32 tokens are atypical of the rest.
  • StreamingLLM. The paper is explicit: this does not extend useful context, only operational context.
  • Anthropic Prompt Caching. The docs acknowledge invalidation triggers (tool changes, image changes, speed-setting changes); developers iterating quickly will see frequent cache misses.

Weaknesses not stated or understated [Reviewer Perspective]

  • H2O. The 29x throughput number is against pre-2024 baselines; the production-stack landscape (vLLM, TGI, SGLang) has shifted and the H2O number doesn’t directly translate. [Reviewer Perspective] the qualitative throughput improvement persists; the quantitative multiplier is dated.
  • SnapKV. Behaviour on chain-of-thought generation (long outputs that reference back to scratch) is under-tested; SnapKV is optimised for long-input-short-output workloads.
  • StreamingLLM. The sink-token mechanism is empirical; the paper offers a softmax-mass argument but no formal proof. Whether the mechanism survives alternative attention designs (linear attention, gated attention) is open.
  • Anthropic Prompt Caching. Vendor-published metrics are not independently audited; the 90% cost and 85% latency reductions are not reproducible from public information without running matched workloads.

Reproducibility check

H2OSnapKVStreamingLLMAnthropic Prompt Caching
CodeReleased 6 Released 7 Released 8 Closed; API only 5
DataPublic benchmarks (LongBench, NIAH, ARC, PG19)Public benchmarks (LongBench, NIAH)Public (PG19, ARC)Not applicable
HyperparametersFully reported in paperFully reported in paperFully reported in paperNot applicable
ComputeReported partially (A100 references)Reported (single A100-80GB used)Reported partiallyNot applicable
Trained weightsNot applicable (inference method)Not applicableTrainable-sink variant: not releasedNot applicable
OverallFully reproducibleFully reproducibleFully reproducible (baseline); trainable-sink variant: notNot reproducible; API contract only

Methodology

  • Sample size. H2O: 8 benchmarks averaged. SnapKV: 16 LongBench tasks + NIAH up to 380K tokens. StreamingLLM: PG19 perplexity over 65K to 4M tokens + ARC streaming.
  • Evaluation set. All three use standard public benchmarks plus paper-specific stress tests (NIAH for SnapKV, StreamEval for StreamingLLM).
  • Baselines. H2O: 3 production inference stacks. SnapKV: full cache, H2O. StreamingLLM: dense, window, sliding-window-with-recomputation.
  • Hardware/compute. H2O reports NVIDIA T4 / A100. SnapKV reports A100-80GB at batch 1 and 2. StreamingLLM reports A100s. Anthropic Prompt Caching: not applicable; closed-system metric.

Generalisability

  • H2O and SnapKV. Within decoder-only transformers (Llama, OPT, Mistral families), portable. Encoder-decoder, multimodal, and non-attention architectures (state-space models, recurrent variants): the assumptions don’t directly transfer.
  • StreamingLLM. Within decoder-only transformers using RoPE or ALiBi, broadly portable. Linear-attention variants: untested.
  • Anthropic Prompt Caching. Direct only to Anthropic’s API; conceptually applicable to any LLM serving stack that supports KV state serialisation.

Assumption audit

  • Heavy-hitter assumption (H2O). Realistic on language modeling and most retrieval tasks; fragile on uniform-attention tasks like some symbolic reasoning.
  • Observation-window-as-predictor assumption (SnapKV). Realistic when the question structure matches the document structure; fragile when the question is structurally distinct.
  • Attention-sink mechanism (StreamingLLM). Empirical; holds in tested transformers; portability to alternative attention designs unverified.
  • Prefix invariance assumption (Anthropic Prompt Caching). Realistic for system prompts and long documents; fragile for development iteration where the prefix changes frequently.

What would make the cluster significantly stronger

[Analysis] A unified open benchmark that compares H2O, SnapKV, StreamingLLM, and quantisation methods (KIVI, AWQ-cache) at matched memory budgets across a common workload mix — long-input-short-output, long-input-long-output, chain-of-thought reasoning, streaming chat, RAG. No such benchmark currently exists; practitioners pick based on workload heuristics.

Section 13: What is reusable for a new study

REUSABLE COMPONENT 1: H2O accumulated-attention scoring

  • What it is: per-token cumulative attention scoring across decoding steps.
  • Why worth reusing: a simple, well-validated scalar signal for token importance.
  • Preconditions: standard softmax attention.
  • What would need to change in a different setting: re-derivation for linear-attention or gated-attention models.
  • Risks: the scalar collapses information across heads; per-head versions (SnapKV’s) may be stronger.
  • Interaction effects: compatible with quantisation; orthogonal to prefix caching.

REUSABLE COMPONENT 2: SnapKV observation-window selection

  • What it is: a one-shot prefix-compression mechanism using the last wobsw_{\text{obs}} query attentions.
  • Why worth reusing: zero per-step overhead, strong NIAH retention.
  • Preconditions: long-input-short-output workload pattern.
  • What would need to change in a different setting: long-output workloads need per-step updating; the observation window grows stale.
  • Risks: questions structurally distinct from the document mis-predict.
  • Interaction effects: stacks with StreamingLLM’s sink retention.

REUSABLE COMPONENT 3: StreamingLLM sink-plus-recent buffer

  • What it is: keep first ksinkk_{\text{sink}} tokens plus a sliding window with cache-local position encoding.
  • Why worth reusing: enables genuine streaming on a fixed memory budget.
  • Preconditions: RoPE or ALiBi positional encoding; tolerable that old information is lost.
  • What would need to change: workloads needing old-context retrieval require RAG-style augmentation.
  • Risks: information loss is permanent within the model’s view.
  • Interaction effects: stacks with SnapKV’s one-shot compression on the initial prompt before streaming begins.

REUSABLE COMPONENT 4: Anthropic Prompt Caching API contract pattern

  • What it is: customer-visible KV state reuse with explicit TTL and differential pricing.
  • Why worth reusing: a template for any LLM API where fixed prefixes dominate.
  • Preconditions: serving stack supports KV state serialisation per session.
  • What would need to change: TTL choice, write premium, read discount tuned to workload economics.
  • Risks: invalidation triggers (tool definition changes, etc.) eat the savings if developers iterate frequently.
  • Interaction effects: composes cleanly with within-request methods.

Dependency map. H2O, SnapKV, and StreamingLLM operate on the within-request cache and could stack (SnapKV compresses the prompt cache at handoff, StreamingLLM maintains the streaming buffer thereafter, H2O-like attention scoring could drive eviction of the recent buffer when full). Anthropic Prompt Caching sits one layer above and is orthogonal to all three.

Recommendation [Analysis]. The three highest-value reusable components are (1) the observation-window selection pattern from SnapKV, because it generalises beyond compression to any task requiring “what does the model already know it cares about”; (2) the sink-token mechanism from StreamingLLM, because it generalises beyond streaming to any context-extension method that re-uses a model trained at shorter length; and (3) the differential-pricing API contract from Anthropic, because it generalises beyond LLM inference to any high-fixed-cost computation.

[Analysis] New studies most benefiting from the cluster: long-context fine-tuning research (uses SnapKV-style selection for training-time data filtering); streaming-agent research (uses StreamingLLM’s buffer for tool-loop history); production-serving research (uses prompt caching’s economics framing for any per-customer-prefix workload).

Section 14: Known limitations and open problems

Authors’ stated limitations.

  • H2O: heavy-hitter assumption is empirical; no extensive reasoning-benchmark coverage.
  • SnapKV: observation-window-as-predictor assumption; no long-output workload evaluation.
  • StreamingLLM: does not extend useful context, only operational context.
  • Anthropic Prompt Caching: invalidation conditions are non-trivial; small changes can miss cache.

Limitations not stated [Reviewer Perspective].

  • H2O’s 29x throughput claim is dated against the post-2024 production-stack landscape.
  • SnapKV’s empirical robustness to chain-of-thought generation is under-tested.
  • StreamingLLM’s sink mechanism is empirically derived; formal characterisation under alternative attention designs is open.
  • Anthropic’s cost-reduction percentages are vendor-published and not independently audited.

Technical root cause of each.

  • All three within-request methods rely on attention sparsity holding in a structured way; tasks violating this (uniform-attention reasoning, scratch-pad chain-of-thought) are intrinsically the failure modes.
  • The Anthropic system inherits all the brittleness of byte-identical prefix matching.

Open problems.

  • Adaptive per-prompt budget selection.
  • Cross-method composition (SnapKV + StreamingLLM + prompt caching jointly).
  • Integration with retrieval to recover information evicted by within-request compression.
  • Theoretical guarantees for sink behaviour under non-softmax attention.

What a follow-up paper would need to solve to address the most critical limitation. The most consequential gap is the absence of a unified workload benchmark. A follow-up that defines a standard “long-context inference evaluation harness” covering the canonical workload mix (long-input-short-output, long-input-long-output, chain-of-thought reasoning, streaming chat, RAG) with matched memory budgets across H2O, SnapKV, StreamingLLM, quantisation methods, and full-cache baselines would let practitioners choose methodically rather than heuristically.

How this article reads at three depths

For the curious high-school reader. Large language models remember every word of a conversation by storing data in a big GPU table called the KV cache. That table grows linearly with conversation length and eventually overflows the GPU. The four works reviewed here solve the overflow problem in different ways — three by throwing away cache entries the model probably doesn’t need, one by sharing the cache across multiple conversations. The key surprise is that throwing away cache entries works at all, because some words turn out to absorb most of the attention while others get glanced at once and never referenced again.

For the working developer or ML engineer. For long-input-short-output workloads (document QA, RAG), SnapKV is the strongest within-request compression at 1024-token cache. For genuinely streaming workloads (chatbot history, agent loops), StreamingLLM is the answer — keep 4 initial tokens plus a sliding window, restore position encoding within the cache. For batch inference where average accuracy can absorb \sim1 point, H2O at 20% budget remains a solid baseline. When the workload has fixed prefixes (long system prompts, large documents reused across many requests), Anthropic Prompt Caching plus 1-hour TTL pays back in \geq3 requests; stack it under within-request compression for compound savings. The four are not mutually exclusive.

For the ML researcher. The cluster’s most durable scientific contribution is StreamingLLM’s attention-sink discovery — a mechanistic explanation tied to the softmax constraint that propagated rapidly into subsequent long-context work. H2O’s submodular framing is theoretically tightest but the heavy-hitter assumption is the load-bearing empirical claim. SnapKV’s contribution is the observation-window-as-predictor framing and the pooling step; both are likely to influence future selection-based methods. Anthropic Prompt Caching is engineering rather than research but the differential-pricing contract is a template worth studying. The single most useful follow-up would be a unified workload benchmark that lets selection-based and quantisation-based methods be compared at matched memory budgets across a representative workload mix — currently practitioners pick based on category heuristic, not measurement.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

  1. 1. Zhang et al. — H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (arXiv:2306.14048, NeurIPS 2023) (accessed )
  2. 2. Li et al. — SnapKV: LLM Knows What You are Looking for Before Generation (arXiv:2404.14469, NeurIPS 2024) (accessed )
  3. 3. Xiao, Tian, Chen, Han — Efficient Streaming Language Models with Attention Sinks (arXiv:2309.17453, ICLR 2024) (accessed )
  4. 4. Anthropic — Prompt Caching announcement (August 2024; GA December 2024) (accessed )
  5. 5. Anthropic — Prompt Caching documentation (current; cache TTL, minimum cacheable tokens per model, pricing multipliers) (accessed )
  6. 6. H2O reference implementation (FMInference/H2O on GitHub) (accessed )
  7. 7. SnapKV reference implementation (FasterDecoding/SnapKV on GitHub) (accessed )
  8. 8. StreamingLLM reference implementation (mit-han-lab/streaming-llm on GitHub) (accessed )
  9. 10. Pope et al. — Efficiently Scaling Transformer Inference (arXiv:2211.05102) (accessed )
  10. 11. Beltagy et al. — Longformer: The Long-Document Transformer (arXiv:2004.05150) (accessed )
  11. 12. Kwon et al. — Efficient Memory Management for Large Language Model Serving with PagedAttention / vLLM (arXiv:2309.06180) (accessed )

Further Reading

  • Vaswani et al. (2017). Attention Is All You Need. arXiv:1706.03762 (accessed )

Anonymous · no cookies set

Report a problem with this article

Articles are produced by an autonomous AI pipeline; mistakes do happen. Tell us what's wrong and the editorial review will revisit the claim.

Category

Found this useful? Share it.