LLM serving systems — PagedAttention, RadixAttention, and TensorRT-LLM in one technical reference
Multi-paper review of vLLM's PagedAttention, SGLang's RadixAttention, and NVIDIA's TensorRT-LLM — how each system manages the KV cache and where they diverge.
Reading-register key
- From the paper: claims drawn verbatim or near-verbatim from the source paper’s text, equations, tables, or figures.
- [Analysis]: the publication’s own reasoned assessment, distinct from any claim the reviewed sources make.
- [Reconstructed]: content faithfully reconstructed because the source partially disclosed it; flagged where used.
- [External comparison]: comparison to prior or contemporary work outside the reviewed sources.
- [Reviewer Perspective]: a critical or speculative assessment that goes beyond what the reviewed sources prove.
1. Paper identity and scope
This article is a multi-paper technical review of three LLM serving stacks that together define the open-source state of the art for high-throughput inference in 2024-2026: vLLM (the system built around PagedAttention), SGLang (built around RadixAttention plus a frontend programming model), and NVIDIA TensorRT-LLM (which is a production engineering artefact rather than a research paper, included here because the three stacks are routinely benchmarked against each other and a paper-review of the open-source line is incomplete without the closed-source comparator).
Paper A — PagedAttention / vLLM. Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv:2309.06180, first posted September 2023, published at SOSP 2023. 1
Paper B — SGLang / RadixAttention. Zheng, Yin, Xie, Sun, Huang, Yu, Cao, Kozyrakis, Stoica, Gonzalez, Barrett, Sheng. SGLang: Efficient Execution of Structured Language Model Programs. arXiv:2312.07104, first posted December 2023, accepted to NeurIPS 2024. 2
System C — TensorRT-LLM. NVIDIA’s open-source LLM inference library, layered on PyTorch with custom CUDA kernels, integrating with the Triton Inference Server. There is no peer-reviewed paper; the canonical reference is the public GitHub README plus the NVIDIA developer documentation. 3 Where TensorRT-LLM content carries [Reconstructed], the gap is the absence of a formal academic publication; the reconstruction is grounded in the README’s documented feature list.
Retrieval status. Both arXiv landing pages and ar5iv HTML renders were fetched 2026-05-20. The TensorRT-LLM GitHub README and the vLLM documentation landing page were fetched the same day. No supplementary material attachment was retrievable for either arXiv paper from arXiv-side; the SOSP 2023 venue-side proceedings were not opened during retrieval.
Paper classification. All three systems: Inference method, System architecture, Application. PagedAttention additionally: Memory management. SGLang additionally: Programming model, Cache scheduling. TensorRT-LLM additionally: Custom-kernel engineering, Quantization deployment.
Core technical domains: depth label per system.
- PagedAttention: deep (block-table arithmetic, attention kernel rewrite, preemption policy).
- RadixAttention: deep (radix tree operations, LRU eviction, cache-aware scheduling).
- TensorRT-LLM: moderate (system documented through README + docs rather than a paper; the depth available is feature-level, not algorithmic-derivation-level).
Technical abstract in publication voice. All three systems target the same bottleneck: serving an autoregressive transformer-decoder LLM at high request rate without the GPU memory devoted to the key-value (KV) cache becoming the limiting factor on batch size. The KV cache is the per-request state that grows token by token as the model generates, and naive contiguous allocation wastes between 60% and 80% of the budget on fragmentation and over-reservation per the PagedAttention paper. 4 vLLM solves the allocation problem by paging the KV cache into fixed-size blocks and computing attention block-by-block. SGLang solves the reuse problem one layer up by indexing those blocks in a radix tree keyed on token prefixes, so that two requests sharing a system prompt share the underlying physical blocks. TensorRT-LLM solves the kernel-level problem with custom fused CUDA kernels for the hot path (attention, GEMMs, MoE routing) plus aggressive quantization down to FP8, FP4, INT8, and INT4. The three are layerable rather than mutually exclusive: a production deployment in 2026 often runs SGLang’s scheduler on top of vLLM-style paged memory on top of TensorRT-LLM kernels.
Primary research question (per paper).
- PagedAttention: Can virtual-memory paging from operating systems be adapted to the KV cache of a transformer decoder, and if so, how much of the throughput budget can be reclaimed from fragmentation?
- SGLang: Can a runtime co-design a cache layout with a frontend programming model so that LLM programs sharing prefixes (chat sessions, few-shot prompts, agent loops) reuse compute at near-zero marginal cost?
- TensorRT-LLM: Not a research question; the engineering question is which combination of custom kernels, quantization, parallelism, and disaggregated prefill/decode delivers the highest tokens-per-second on NVIDIA hardware.
Core technical claim (per system).
- PagedAttention: 2 to 4 times higher throughput than FasterTransformer and Orca at comparable latency, by reducing KV cache memory waste from 60-80% down to under 4%. 5
- SGLang: up to 6.4 times higher throughput than existing systems including vLLM on workloads with structured output, shared prefixes, or branching control flow. 6
- TensorRT-LLM: state-of-the-art tokens-per-second on NVIDIA hardware, with the README citing specific gains like 3x improvement on Llama 3.3 70B from speculative decoding and 3x throughput boost on H200 from multiblock attention. 3
Reader prerequisites. High-school algebra; familiarity with how a transformer decoder generates one token at a time helpful but not required because the Glossary in Section 2.5 covers the autoregressive generation loop, KV cache, attention, and memory paging from first principles.
2. TL;DR and executive overview
3-sentence TL;DR. Modern LLMs spend most of their GPU memory on a per-request scratch space called the KV cache, and naive memory layouts waste up to 80% of that scratch space, which limits how many users a single GPU can serve at once. vLLM borrowed an idea from operating systems — paging — to slice the KV cache into fixed-size blocks and reclaim the wasted memory, lifting throughput by 2 to 4 times. SGLang then asked: if two users send prompts that share a common prefix (the same chatbot system message, for example), why recompute the prefix’s KV cache twice? — and built a radix tree to share it, lifting throughput by up to 6.4 times on prefix-heavy workloads.
Executive summary (≈100 words). Three serving stacks — vLLM, SGLang, and NVIDIA TensorRT-LLM — between them define how LLMs are served at production scale in 2026. vLLM contributed PagedAttention, which adapts virtual-memory paging to the KV cache and reclaims the throughput lost to memory fragmentation. SGLang built RadixAttention on top of paged memory and pushed the idea further by making the cache content-addressable on token prefixes, so chat sessions, few-shot prompts, and agent loops reuse compute. TensorRT-LLM is NVIDIA’s closed-research, open-source counterpart that combines aggressive kernel fusion with quantization down to four bits. The systems are layerable, not rival.
Five practitioner-relevant takeaways.
- PagedAttention is the new default. Any inference stack shipping in 2026 that doesn’t page the KV cache is leaving 2-4x throughput on the table, full stop.
- Block size is a knob worth tuning. vLLM defaults to 16 tokens per block; smaller blocks reduce internal fragmentation but increase block-table overhead, larger blocks do the inverse.
- Prefix sharing only helps when prefixes are actually shared. RadixAttention’s gains depend entirely on workload structure; benchmarks with random unique prompts see closer to vLLM-parity than to 6.4x.
- Recompute beats swap on most workloads. The PagedAttention paper measured recomputation overhead at no more than 20% of swapping’s latency for medium block sizes. 7
- TensorRT-LLM wins on NVIDIA hardware when you accept the lock-in. The custom kernels and FP8/FP4 quantization paths produce numbers vLLM and SGLang typically can’t match on pure throughput, at the cost of being tied to a single vendor’s GPUs.
Pipeline overview in text. Training-time vs inference-time matters: all three systems are inference-only and assume model weights are already fine-tuned and frozen. At inference time, an incoming request is tokenized into a sequence of input tokens; the prefill phase runs the full sequence through the transformer in parallel and produces the first output token plus a KV cache covering every input position; the decode phase then generates output tokens one at a time, each step appending one row to the KV cache and reading the entire cache to compute attention. PagedAttention rewrites the attention kernel to read from non-contiguous blocks; RadixAttention adds a tree-structured global index over those blocks so the prefill phase can skip recomputation when prefixes are already cached; TensorRT-LLM rewrites the underlying CUDA kernels for both prefill and decode in a way that is orthogonal to the paging and reuse decisions.
2.5 Glossary
| Term | Plain-English explanation | First appears in |
|---|---|---|
| KV cache | The per-request scratch space holding “key” and “value” vectors for every token the model has seen so far; attention reads from it on every decode step. Grows by one row per generated token. | Section 1 |
| Prefill phase | The forward pass over a request’s input prompt that runs in parallel across all input positions and produces the first output token plus the full KV cache for the prompt. | Section 2 |
| Decode phase | The sequential token-by-token generation phase that follows prefill; each step is one forward pass producing one new token and one new KV row. | Section 2 |
| Block (KV block) | A fixed-size contiguous chunk of GPU memory holding the K and V vectors for B tokens; vLLM’s default is B=16. The unit of allocation in PagedAttention. | Section 5 |
| Block table | A per-request mapping from logical block indices (0, 1, 2, …) to physical block addresses in GPU memory; analogous to an OS page table. | Section 5 |
| Radix tree | A space-optimized prefix tree where each node holds a token sequence rather than a single character; lookup and insertion are O(prefix length). | Section 5 |
| Continuous batching | A scheduling pattern where finished requests leave the batch and new ones join at every decode step, rather than waiting for a fixed batch boundary. | Section 7 |
| Speculative decoding | An inference acceleration technique where a small draft model proposes multiple tokens and the large model verifies them in one parallel pass. | Section 8 |
| Quantization (FP8, INT4) | Storing model weights and sometimes activations in fewer bits than the training precision; FP8 means 8-bit floating point, INT4 means 4-bit integer. | Section 8 |
| Throughput vs latency | Throughput is tokens generated per second across all concurrent requests; latency is wall-clock time for one request to finish. They trade off. | Section 9 |
| ”From the paper:” prefix | Content directly supported by the paper’s text, equations, tables, or figures. | Throughout |
[Analysis] label | The publication’s own reasoned assessment, distinct from what the paper itself claims. | Throughout |
[Reviewer Perspective] label | A critical or speculative assessment that goes beyond what the paper proves. | Section 11+12 |
[Reconstructed] label | Content the publication faithfully reconstructed because the paper only partially disclosed it. | Where used |
[External comparison] label | A comparison to prior work or general knowledge outside the reviewed papers. | Sections 4 + 11 |
3. Problem formalisation
Notation table.
| Symbol | Type | Meaning | First appears in |
|---|---|---|---|
| sequence | Input prompt tokens, length | Section 3 | |
| sequence | Generated output tokens, length | Section 3 | |
| integer | Per-head hidden dimension | Section 3 | |
| integer | Number of attention heads | Section 3 | |
| integer | Number of transformer layers | Section 3 | |
| integer | KV block size in tokens (vLLM default: 16) | Section 3 | |
| vector | Query vector for token , shape | Section 6 | |
| matrices | Key and value blocks for block , shape | Section 6 | |
| matrix | Attention scores from query over block | Section 6 | |
| vector | Attention output for token | Section 6 | |
| radix tree | The RadixAttention global cache index | Section 6 |
Formal problem statement. Given a frozen transformer-decoder LLM with layers, heads per layer, and per-head dimension , and a stream of inference requests where each request supplies a prompt and requests output tokens, the serving system must maximize aggregate output-tokens-per-second across the request stream subject to a per-request latency budget. The KV cache for one request occupies scalar slots — one half for keys and one for values, across every layer and head — multiplied by the precision (2 bytes for FP16, 1 for FP8). At realistic model sizes this is the dominant non-weight memory cost on the GPU.
Explicit assumption list.
- The transformer architecture is decoder-only with causal attention. Encoder-decoder models like T5 are not the target.
- Inference is autoregressive — one token per decode step. Speculative decoding relaxes this but does not change the cache structure.
- Memory is GPU-resident; the optimization budget is bounded by what fits on one or several connected GPUs.
[Analysis]Stronger assumption than it looks: workloads that fit in CPU memory but not GPU memory still need the swap/recompute machinery PagedAttention designs. - Requests are independent at the application layer; any sharing across requests (system prompts, few-shot examples) is opportunistic, not declared.
[Analysis] Potentially limiting assumptionfor SGLang specifically: the frontend’sfork/joinprimitives let an application declare sharing explicitly, but the runtime still has to detect it via radix-tree matching when applications don’t cooperate.
Formal complexity arguments. The hard problem here is not the per-step compute — that’s fixed by the model architecture — but the per-step memory bandwidth. On every decode step, the attention kernel reads the entire KV cache of every request currently in the batch. If the cache is fragmented, the reads are scattered; if the cache is shared but not indexed, the reads are duplicated. The systems under review each attack one slice of the bandwidth problem.
If causal: Not applicable to this paper — these are systems papers, not causal inference papers.
If data-driven: Not applicable — no training data; the systems operate on frozen weights.
If LLM-based: The LLM is the object being served; it is not a component of the serving system itself. (SGLang’s frontend primitives compose LLM calls but the LLM is still the substrate, not the design element.)
If theoretical: No formal theorems. The papers carry empirical claims with measured throughput numbers.
4. Motivation and gap
The real-world problem with concrete example. A company runs a customer-support chatbot powered by a 70-billion-parameter LLM. Each conversation begins with a 2,000-token system prompt that describes the company’s products, policies, and tone. A user types a 30-token question; the model generates a 150-token answer. The expensive part is the prefill of the 2,000-token system prompt — which the model has already prefilled millions of times before. Without a serving system aware of the shared prefix, every conversation pays the prefill cost from scratch.
Existing approaches and specific failure modes.
- Naive contiguous KV cache (HuggingFace Transformers’ default, FasterTransformer pre-2024): allocates the maximum-possible-length contiguous buffer per request up front. Wastes the slots between actual generated length and maximum length; cannot share across requests at all. Per Kwon et al., this layout uses only 20.4% to 38.2% of allocated KV memory for actual token states. 4
- Orca (OSDI 2022): introduced iteration-level scheduling (continuous batching) but kept the contiguous KV layout. Per the PagedAttention paper, vLLM sustains 1.7x to 2.7x higher request rates than Orca on ShareGPT with OPT-13B. 8
- FasterTransformer: NVIDIA’s pre-TensorRT-LLM inference library, also contiguous. vLLM beats it by up to 22x on request rate on ShareGPT with basic sampling per the paper. 9
Gap each paper claims to fill.
- PagedAttention fills the memory-fragmentation gap: existing systems waste 60-80% of KV memory; PagedAttention drops the waste to under 4%. 4
- RadixAttention fills the cross-request reuse gap: PagedAttention enables sharing in principle (via copy-on-write on shared physical blocks) but doesn’t ship a content-addressable index over the cache; SGLang adds the index. 10
- TensorRT-LLM fills the kernel-engineering gap: research-grade systems prioritize algorithmic correctness, and even after PagedAttention there is meaningful per-kernel performance left on the table.
Practical stakes / downstream consequences. At a major LLM API provider serving billions of tokens per day, a 2x throughput improvement on the inference stack translates roughly linearly into a 2x reduction in GPU fleet cost. [Analysis] The economic gravity is what made PagedAttention an instant industry standard within months of publication, and what keeps the SGLang vs vLLM vs TensorRT-LLM benchmark wars active.
[External comparison] Position in broader research landscape. Before PagedAttention, the canonical reference for high-throughput transformer inference was Orca (Yu et al., OSDI 2022), which contributed continuous batching but not paging. 8 The closest contemporary to PagedAttention is FlashAttention (Dao et al., 2022) and FlashAttention-2 (Dao, 2023), which optimize the attention kernel itself for memory bandwidth on a single sequence; PagedAttention is orthogonal because it operates on the cross-request layout, not the per-sequence kernel. The closest contemporary to RadixAttention is the prompt-cache literature; SGLang’s contribution is making the cache content-addressable on arbitrary prefixes rather than on pre-declared prompt templates.
5. Method overview
This section walks each system’s mechanism end-to-end.
5.1 vLLM / PagedAttention
Name + source. PagedAttention; Section 4 of Kwon et al. 2023. 1
Plain-English intuition. Imagine the GPU’s KV memory as a parking lot. The old approach reserves a whole row of contiguous spaces for each car, sized for the longest car you might ever park; most rows sit half-empty. PagedAttention instead paints the lot with uniform parking spaces and lets each car occupy as many spaces as it actually needs, scattered around. A little map at the entrance (the “block table”) tells you which spaces belong to which car.
Exact mechanism step-by-step.
- Partition GPU KV memory into fixed-size physical blocks. Each block holds the K and V vectors for tokens (default) across one transformer layer. 11
- For each incoming request, maintain a per-request block table mapping logical block indices to physical block addresses. The table also records the number of filled positions in the last logical block.
- On every decode step, the attention kernel reads the block table, fetches the relevant physical blocks (which may be non-contiguous), and computes attention block-by-block.
- When a logical block fills, allocate a new physical block on demand. When a request finishes, return its physical blocks to the free pool.
- When memory pressure forces preemption, evict an entire sequence-group’s blocks (gang scheduling) using either swap-to-CPU or drop-and-recompute strategies.
Connection to full pipeline. PagedAttention replaces the attention kernel and the memory allocator; the rest of the transformer (FFN layers, layer norm, embeddings) is unchanged. The block table is the central data structure; the scheduler consults it on every step.
Design rationale and tradeoffs. Block size is the key knob. Smaller means less internal fragmentation but more block-table entries and more kernel-launch overhead. Larger means the opposite. The paper’s default of 16 is empirically chosen to balance the two.
What breaks if removed. Strip the paging and the system reverts to contiguous allocation with 60-80% memory waste; throughput collapses by the same factor.
Classification. [New] — the paging-as-OS-virtual-memory adaptation to KV cache is the paper’s headline contribution.
5.2 SGLang / RadixAttention
Name + source. RadixAttention; Section 4 of Zheng et al. 2024. 2
Plain-English intuition. Imagine a library where every book starts with the same 50-page introduction, but the librarian writes out a fresh copy of those 50 pages for every visitor. RadixAttention is the librarian noticing that the introduction is shared, keeping one master copy in a shared shelf, and handing out pointers instead of pages. The trick is that the librarian also needs to remember which pointer corresponds to which book — solved by indexing the shelf as a radix tree on the first few words of each book.
Exact mechanism step-by-step.
- Maintain a radix tree whose paths are token sequences and whose leaves point to KV-cache blocks (paged, per vLLM convention). 10
- On every new request, walk to find the longest matching prefix. The matched portion’s KV cache is reused; only the suffix gets prefilled.
- After the request finishes, insert its full token sequence into so future requests can match against it.
- Use reference counting on tree nodes to prevent eviction of cache blocks still in use by active requests.
- Under memory pressure, recursively evict leaf nodes using LRU ordering.
- Schedule requests cache-aware: sort the pending queue by matched-prefix length so requests sharing prefixes batch together.
Connection to full pipeline. RadixAttention sits one layer above paged memory. The frontend programming model (gen, fork, join, select primitives) lets applications declare structural sharing explicitly; the runtime detects implicit sharing via tree matching.
Design rationale and tradeoffs. The radix tree adds per-request overhead (the tree walk plus reference-counter bookkeeping) that is paid even when no prefix matches. On unique-prompt workloads the overhead is pure cost. [Analysis] The system bets that real production workloads — chatbots, agents, few-shot prompting, self-consistency sampling — have substantial structural sharing, and the bet pays out on the paper’s benchmarks but won’t pay out on synthetic random-prompt loads.
What breaks if removed. Remove the radix tree and SGLang degrades to a vLLM-equivalent; the up to 6.4x gains over vLLM come specifically from the tree-based cache reuse plus cache-aware scheduling.
Classification. [Adapted] — radix trees are decades-old data structures (Morrison 1968); the contribution is the application to KV cache reuse with LRU eviction and cache-aware scheduling.
5.3 TensorRT-LLM
Name + source. TensorRT-LLM; NVIDIA GitHub repository README and developer documentation. 3
Plain-English intuition. Where vLLM and SGLang optimize what the GPU does in what order, TensorRT-LLM optimizes how the GPU does each thing. NVIDIA’s engineers write custom CUDA kernels for the attention computation, for the matrix multiplications inside each transformer layer, and for the mixture-of-experts routing logic, then fuse adjacent kernels into one launch to cut overhead.
Exact mechanism (per README).
- Provide custom CUDA kernels for attention (including FlashAttention-style fusion), GEMM, and MoE. 3
- Support quantization down to FP8, FP4, INT8, and INT4 for weights and activations.
- Implement paged KV cache (same idea as PagedAttention) plus in-flight batching (same idea as continuous batching from Orca/vLLM).
- Offer disaggregated prefill/decode — running the two phases on different GPUs or different scheduling lanes to avoid the prefill-blocks-decode head-of-line problem.
- Layer speculative decoding on top with several draft-model strategies (EAGLE, MEDUSA-style heads).
- Integrate with Triton Inference Server for production deployment, multi-LoRA adapter swapping, and multi-tenant routing.
Connection to full pipeline. TensorRT-LLM is a complete inference stack, not a single optimization. It is the closed-research engineering equivalent of “vLLM + SGLang + everything else” on NVIDIA hardware.
Design rationale and tradeoffs. Custom kernels deliver the absolute fastest throughput on NVIDIA GPUs at the cost of portability — the system is NVIDIA-only by design. [Reviewer Perspective] The lack of a formal paper makes external reproducibility studies essentially impossible; published benchmark numbers come from NVIDIA itself or from vendors with commercial relationships to NVIDIA.
What breaks if removed. Remove TensorRT-LLM from a production stack and you fall back to vLLM or SGLang at typically 70-90% of the throughput on NVIDIA hardware. [Reconstructed] This range comes from informal MLPerf-style comparisons reported across vendor blog posts; the exact gap depends on model, batch size, and precision.
Classification. [Adapted] for the algorithms (paged KV, continuous batching, speculative decoding all come from research it follows); [New] for the kernel-fusion engineering and the quantization-deployment path.
6. Mathematical contributions
MATH ENTRY 1: The PagedAttention block-wise attention computation.
- Source: Section 4.1, equations (3) and (4) of Kwon et al. 1
- What it is: a rewrite of the standard scaled-dot-product attention so the keys and values for past tokens are read block-by-block from non-contiguous memory rather than from one big contiguous tensor.
- Formal definition:
-
Each term explained AND its dimensional/type analysis:
- is the query vector for the token at position . With per-head (a typical Llama-2 head dim), is a length-128 vector.
- is the -th key block. With and , is a matrix.
- is the -th value block, same shape as .
- produces a length- vector of dot-products: the attention scores from query over the tokens in block .
- is the scaling factor (a scalar).
- holds the softmax-normalized attention weights for query over block .
- is the all-ones vector used to broadcast the normalization across the block.
- is the number of blocks needed to cover positions 1 through .
- is the attention output: a weighted sum of the value vectors, with the weights given by .
-
Worked numerical example. Take (toy dimension), (toy block size), and a sequence that has produced tokens so far. Then blocks. Suppose block 1 holds tokens 1-2, block 2 holds tokens 3-4, and block 3 holds token 5 (partial). For query position with and :
- .
- Divide by : .
- Exponentiate: .
- Continue for blocks 2 and 3 to compute the full normalizing denominator, then normalizes the block-1 scores against that denominator.
- then produces the partial contribution to from block 1, and is the sum of partial contributions across all three blocks. The point of the rewrite is that the three matrices live in non-contiguous physical memory — the block table tells the kernel where each one is.
-
Role: this is the inner loop of vLLM’s attention kernel. Every decode step for every request executes one application of this formula per layer per head.
-
Edge cases: the last block is typically partial (fewer than filled positions); the kernel masks the unfilled positions before the softmax. The first block of the sequence requires no past KV.
-
Novelty:
[Adapted]— the math is standard scaled-dot-product attention; the contribution is making the formula explicit on a block-structured cache. -
Transferability:
[Analysis]directly transferable to any attention variant whose KV cache can be split along the sequence dimension (i.e., causal attention and most efficient-attention variants). Does NOT transfer to attention patterns that need global non-causal lookahead. -
Why it matters: this is the kernel rewrite that turns the OS-paging analogy into actual GPU code; without it, the block table would just be a memory-allocation trick with no kernel support.
MATH ENTRY 2: PagedAttention’s KV cache memory cost per request.
- Source: implied by Section 2 of Kwon et al. 1
- What it is: the total memory in bytes a single request occupies in the KV cache.
- Formal definition:
- Each term explained AND its dimensional/type analysis:
- The leading 2 is for K and V (two tensors of the same shape).
- is the number of transformer layers (Llama-2-7B: ; Llama-2-70B: ).
- is the number of attention heads per layer (Llama-2-7B: ; Llama-2-70B: ).
- is the per-head dimension (Llama-2-7B: ).
- is the prompt length, is the generated length; their sum is the sequence length at the end of generation.
- is the bytes per scalar (FP16: ; FP8: ; INT4: effective).
- Worked numerical example. For Llama-2-7B at FP16 with a 1,000-token sequence:
- bytes MB per request.
- On a 40GB A100, after deducting ~13GB for the 7B weights at FP16 and ~3GB for activations and workspace, roughly 24GB remains for the KV cache — enough for about 48 concurrent 1,000-token requests if memory is perfectly utilized, or about 12 if you waste 75% of it to fragmentation.
- Role: this formula explains why KV cache management is the dominant inference-side optimization problem at scale.
- Edge cases: grouped-query attention (GQA) and multi-query attention (MQA) reduce for the K and V tensors specifically, shrinking the cache by a factor of 4-8x; this is why Llama-2-70B (GQA with 8 KV heads) has a much smaller per-token cache than the naive formula suggests.
- Novelty:
[Adopted]— standard accounting; the contribution of PagedAttention is making the constant factor near-optimal. - Transferability:
[Analysis]directly applicable to any decoder-only LLM; replace with the target model’s values. - Why it matters: this is the formula every capacity-planning spreadsheet at every LLM inference shop runs.
MATH ENTRY 3: RadixAttention’s prefix matching and cache hit rate.
- Source: Section 4 of Zheng et al. 2
- What it is: the operation of finding the longest token-prefix of an incoming request that already lives in the radix tree, expressed as a tree walk.
- Formal definition. Let be the radix tree at time , and let be the incoming prompt. The matched prefix is:
- Each term explained AND its dimensional/type analysis:
- is a token sequence of length ; each is one token id (integer in for vocabulary size ).
- is a radix tree where edges are labeled with token subsequences and leaves carry pointers to KV cache blocks.
- The output is the longest prefix that is also a path in from the root.
- is at most (no prefix is longer than the prompt) and at least 0 (no match at all).
- Worked numerical example. Suppose holds two paths from earlier requests:
- Path A: tokens (some chatbot system prompt).
- Path B: tokens (a different chatbot system prompt sharing the first two tokens). A new request arrives with prompt . The match walks the tree:
- Start at root. The edge to Path A’s first segment matches tokens ; the edge to Path B also matches the same first two tokens but the tree representation merges them into one shared node.
- At the merge node, the request’s third token is , which matches the Path A branch.
- The Path A branch matches .
- The next request token is , which has no child in .
- Result: , .
- The prefill phase now only needs to compute KV for positions 5 and 6 (); positions 1-4 reuse the cached blocks reached via the matched path.
- Role: this is the cache-lookup operation that runs on every incoming request before prefill begins.
- Edge cases: when (no match), the request runs full prefill and then inserts itself into for future reuse. When (full match), the request needs no prefill at all — it goes straight to decode.
- Novelty:
[Adapted]— standard radix tree operation; the contribution is the KV cache wiring. - Transferability:
[Analysis]directly transferable to any tokenized model. The lookup cost is in the worst case. - Why it matters: the throughput gains SGLang reports come from this match step succeeding on a meaningful fraction of incoming requests; on workloads where every prompt is unique, the gains shrink toward zero.
MATH ENTRY 4: SGLang’s cache-aware scheduling priority.
- Source: Section 4.3 of Zheng et al. 2
- What it is: the priority function used to reorder the pending request queue so requests with longer cache hits run together.
- Formal definition. For pending request with matched prefix length and arrival time :
- Each term explained AND its dimensional/type analysis:
- is the matched prefix length (in tokens) computed at the radix tree lookup.
- is the waiting time since the request arrived (in seconds).
- is the starvation-avoidance coefficient with units tokens-per-second; the paper does not pin a default in the abstract-level summary. 10
[Reconstructed]The exact value lives in the system’s configuration; the form of the equation is what matters. - The output is a scalar priority; higher = scheduled sooner.
- Worked numerical example. Two pending requests:
- Request A: matched prefix of 1,800 tokens, arrived 0.1 seconds ago.
- Request B: matched prefix of 50 tokens, arrived 5 seconds ago. With tokens/second:
- .
- . Request A goes first. With raised to 500: ; ; A still wins but B is closer. As , the policy degenerates to first-come-first-served.
- Role: this priority runs at every scheduling decision (every batch step) and determines which subset of pending requests joins the active batch.
- Edge cases: when for every pending request, the policy reduces to FCFS modulo the starvation term.
- Novelty:
[Adapted]— priority queueing with starvation avoidance is classic scheduling; the contribution is using matched-prefix length as the priority. - Transferability:
[Analysis]directly transferable to any system that already has a radix tree or equivalent prefix index over its cache. - Why it matters: scheduling order determines which requests share blocks within a batch; the priority function is what makes the cache hits in MATH ENTRY 3 actually translate to batch-level throughput.
7. Algorithmic contributions
ALGORITHM ENTRY 1: PagedAttention block allocation and free.
- Source: Section 4.2 of Kwon et al. 1
- Purpose: maintain per-request block tables; allocate new physical blocks on demand; free blocks when requests finish.
- Inputs: request id , current token position , free physical block pool .
- Outputs: updated block table for ; updated pool .
- Pseudocode (faithful reconstruction; the paper describes the mechanism in prose rather than as a numbered algorithm):
def step_request(r, i_r, P):
logical_block_idx = floor(i_r / B)
pos_in_block = i_r mod B
if logical_block_idx not in r.block_table:
if P is empty:
preempt_some_request(P)
phys = P.pop()
r.block_table[logical_block_idx] = phys
write_kv(r.block_table[logical_block_idx], pos_in_block, k_i, v_i)
return r.block_table, P
def free_request(r, P):
for logical_idx, phys in r.block_table.items():
if phys.refcount == 1:
P.push(phys)
else:
phys.refcount -= 1
r.block_table = {}
return P
- Hand-traced example on minimal input. Start with , free pool , request at position with an empty block table.
- Step 1, : logical_block_idx = 0, pos_in_block = 0. Not in block table. Pop from . Block table becomes . Write at slot 0 of . Pool .
- Step 2, : logical_block_idx = 0, pos_in_block = 1. Already in block table. Write at slot 1 of .
- Steps 3-4 fill slots 2 and 3 of .
- Step 5, : logical_block_idx = 1, pos_in_block = 0. Not in block table. Pop . Block table becomes . Write at slot 0 of . Pool now .
- On free: refcount of and is 1, so both return to the pool, .
- Complexity: amortized per token. Bottleneck step: the free-pool pop on a block boundary.
- Hyperparameters: (block size, default 16); free pool size (set at vLLM startup from total GPU memory minus weights and activations).
- Failure modes: pool exhaustion triggers preemption; preemption granularity is sequence-group, not individual block.
- Novelty:
[New]— the wiring of OS-style page allocation into the attention kernel. - Transferability:
[Analysis]directly used by SGLang and TensorRT-LLM’s paged-KV-cache implementations; the algorithm has become a serving-systems standard.
ALGORITHM ENTRY 2: RadixAttention insert and LRU eviction.
- Source: Section 4 of Zheng et al. 2
- Purpose: keep the radix tree synchronized with the paged KV cache; evict least-recently-used branches when memory pressure rises.
- Inputs: completed request with token sequence and block pointers; radix tree ; memory pressure flag.
- Outputs: updated ; updated free block pool.
- Pseudocode (faithful reconstruction):
def insert(T, x_1_n, block_ptrs):
node = T.root
i = 0
while i < n:
match_len, child = find_longest_child_match(node, x[i:n])
if match_len == 0:
node.add_child(x[i:n], block_ptrs[i:n])
break
elif match_len == len(child.edge):
node = child
i += match_len
else:
split_edge(child, match_len)
node = child
i += match_len
update_lru_timestamp(node)
return T
def evict_until_free(T, target_bytes_freed):
freed = 0
while freed < target_bytes_freed:
leaf = T.lru_leaf_with_refcount_zero()
if leaf is None:
raise OutOfMemory
freed += leaf.cache_bytes
free_blocks(leaf.block_ptrs)
T.remove_leaf(leaf)
return T
- Hand-traced example on minimal input. Tree initially holds one path: root edge leaf with refcount 0, blocks . A new request finishes with tokens and blocks (the first two blocks are already shared via copy-on-write).
- Walk into the tree: match_len = 2 against the edge labeled (the edge’s first 2 tokens match the new request’s first 2).
- match_len edge length, so call
split_edge: the edge becomes root internal node , with two children: (the original leaf, refcount 0, blocks now refactored so is the post-split block) and (new leaf, blocks ). - Insert finishes; LRU timestamp on updates.
- If memory pressure now requires freeing one block:
evict_until_freefinds (refcount 0, oldest LRU), frees its blocks, removes the leaf, and may collapse back into the parent edge.
- Complexity: insert is in the prompt length (one tree-walk pass plus at most one edge split). Eviction is per leaf with a min-heap on LRU timestamps, where is the number of leaves.
- Hyperparameters: target eviction batch size; LRU tie-breaking policy.
- Failure modes: all leaves have refcount — the eviction loop raises OutOfMemory and the scheduler must queue or recompute.
- Novelty:
[New]for the wiring;[Adapted]for the underlying radix tree data structure. - Transferability:
[Analysis]directly transferable; the operation set (insert, match, evict) is what any prefix cache needs.
ALGORITHM ENTRY 3: TensorRT-LLM in-flight batching with disaggregated prefill/decode.
- Source: TensorRT-LLM README, “Disaggregated serving” feature. 3
- Purpose: avoid prefill-decode head-of-line blocking by scheduling the two phases on separate compute lanes.
- Inputs: request stream; prefill GPU pool; decode GPU pool.
- Outputs: per-request output token stream.
- Pseudocode
[Reconstructed]based on README feature description rather than published code listing:
def disagg_serve(request_stream, prefill_pool, decode_pool):
while True:
for r in arrivals(request_stream):
prefill_pool.submit(r)
for r in prefill_pool.completed():
decode_pool.admit(r, r.kv_cache)
for r in decode_pool.active():
token = r.step()
yield (r.id, token)
if token == EOS:
decode_pool.release(r)
- Hand-traced example on minimal input. Two requests A and B arrive simultaneously; A has a 2,000-token prompt and B has a 50-token prompt.
- In a coupled (non-disaggregated) scheduler, A’s prefill stalls B’s decode for hundreds of milliseconds; B’s first-token latency suffers.
- In disaggregated mode: A enters prefill pool; B enters prefill pool. B finishes prefill almost immediately and migrates to decode pool, producing tokens at steady state. A finishes prefill later and joins B in the decode pool.
- Result: B’s first-token latency is bounded by its own prefill, not by A’s.
- Complexity: scheduling overhead is per request transition.
- Hyperparameters: prefill-to-decode GPU ratio; migration cost (KV cache transfer over NVLink).
- Failure modes: KV cache transfer between pools is non-trivial cost; small-prompt workloads may see net loss versus coupled scheduling.
- Novelty:
[Adapted]— disaggregated prefill/decode was explored in research before TensorRT-LLM productized it; the contribution is the production-ready implementation. - Transferability:
[Analysis]requires multi-GPU deployment with high-bandwidth interconnect; not directly applicable to single-GPU setups.
8. Specialised design contributions
Subsection 8A — LLM / prompt design. Not applicable to these papers. The systems serve LLM requests but do not themselves engineer prompts.
Subsection 8B — Architecture-specific details.
- vLLM supports models with grouped-query attention (GQA), multi-query attention (MQA), and multi-head attention; the block-table structure works identically across them, but the per-block memory footprint scales differently. Llama-2-70B at GQA-8 (8 KV heads) needs roughly 1/8 the cache of the equivalent multi-head model.
- SGLang inherits the architecture compatibility from vLLM-style paged memory but adds the radix tree on top, which is architecture-agnostic.
- TensorRT-LLM ships custom kernels per model family (Llama, DeepSeek, Mixtral, Falcon, and others per README) plus quantization recipes per precision (FP8, FP4, INT8, INT4).
Subsection 8C — Training specifics. Not applicable — none of the three systems train models; they serve frozen weights.
Subsection 8D — Inference / deployment specifics.
- vLLM: continuous batching (every step admits new requests and releases finished ones), chunked prefill (large prefills split across multiple steps so they don’t block decode), prefix caching (similar in spirit to RadixAttention but simpler — explicit declaration rather than implicit detection), and a broad quantization matrix per the current docs (FP8, NVFP4, MXFP4/MXFP8, INT8, INT4, GPTQ/AWQ, GGUF). 12 Speculative decoding includes n-gram, suffix, EAGLE, and DFlash strategies.
- SGLang: the frontend programming model is itself a deployment feature. Applications express agent loops, self-consistency sampling, and constrained JSON decoding directly through
gen/fork/join/selectprimitives. The runtime adds compressed finite-state-machine compilation for constrained decoding, which the paper reports as a meaningful speedup over Guidance and Outlines-style runtimes. 2 - TensorRT-LLM: disaggregated prefill/decode, wide expert parallelism for MoE models, FP8 KV cache (storing the cache itself in 8-bit precision to roughly double effective capacity), multiblock attention (the README cites 3x throughput boost on H200 for long sequences), and Triton Inference Server integration for multi-tenant deployment. 3
9. Experiments and results
Datasets.
- ShareGPT — real chatbot traces, used by Kwon et al. as the canonical realistic workload. 1
- Alpaca — instruction-tuning style prompts; used as a shorter, less variable workload. 1
- LMSYS-Chat traces (referenced by SGLang) — for prefix-sharing-heavy workloads. 2
- MMLU (5-shot), GSM8K, JSON-decoding benchmarks, ReAct agent traces — used by SGLang to demonstrate prefix reuse and constrained decoding speedups. 2
Baselines.
- PagedAttention paper compares against FasterTransformer (NVIDIA’s pre-TensorRT-LLM library) and Orca (OSDI 2022). 1
- SGLang paper compares against vLLM, Guidance, and Hugging Face TGI on the relevant workloads. 2
- TensorRT-LLM compares against itself across versions; cross-system comparisons mostly come from third-party benchmarking blog posts rather than the README itself. 3
Evaluation metrics.
- Throughput: output tokens per second, aggregate across active requests.
- Request rate: requests served per second under a target latency SLO.
- KV cache utilization: fraction of allocated KV memory actually holding token states.
- Cache hit rate (SGLang specific): fraction of prompt tokens served from radix-tree-cached blocks rather than recomputed.
Reproduce key result tables. Throughput summary across the three systems’ headline numbers:
| System | Workload | Baseline | Reported gain |
|---|---|---|---|
| vLLM (PagedAttention) | ShareGPT, OPT-13B | Orca (Oracle) | 1.7-2.7x request rate; ~2.2x concurrent requests |
| vLLM (PagedAttention) | ShareGPT, basic sampling | FasterTransformer | up to 22x request rate |
| vLLM (PagedAttention) | KV memory utilization | Existing systems (20.4-38.2% used) | Near-100% utilization |
| SGLang (RadixAttention) | Few-shot MMLU | vLLM | 4.4x throughput |
| SGLang (RadixAttention) | ReAct agent | vLLM | 5.6x throughput |
| SGLang (RadixAttention) | GSM8K (batched) | vLLM | 4.5x throughput |
| SGLang (RadixAttention) | Long-context tasks | vLLM | 1.2-2.9x throughput |
| SGLang aggregate headline | Across all workloads | Existing systems | up to 6.4x throughput |
| TensorRT-LLM | Llama 3.3 70B, speculative decoding | TensorRT-LLM without spec | 3x throughput |
| TensorRT-LLM | H200, long sequences | TensorRT-LLM without multiblock | 3x throughput |
Numbers reproduced from Kwon et al. (arXiv:2309.06180), Zheng et al. (arXiv:2312.07104), and the NVIDIA TensorRT-LLM README; reproduced for editorial coverage.
Main quantitative results. vLLM’s headline 2-4x over baseline systems is consistent across model sizes (OPT-13B to OPT-175B per the paper). 1 SGLang’s headline up to 6.4x is workload-dependent; the gains are largest on prefix-heavy workloads (few-shot, agents) and smallest on unique-prompt workloads (long-context single-turn). 2 TensorRT-LLM’s gains are per-feature rather than aggregated.
Supplementary results. The PagedAttention paper includes ablations on block size (showing 16 as the sweet spot), preemption strategy (recomputation beats swap on most workloads, never more than 20% slower than swap on medium block sizes), and parallel sampling / beam search where copy-on-write block sharing produces additional 1.3-2x gains. 7
Ablations. SGLang ablates each runtime feature independently — RadixAttention alone, compressed FSM alone, API speculative execution alone — to attribute gains to specific components. 2 [Analysis] This is more rigorous than the PagedAttention paper’s ablations, which mostly vary block size and dataset rather than holding out specific algorithmic choices.
Hyperparameter sensitivity. Block size for vLLM tested across per the paper, with 16 emerging as the workload-robust default. SGLang’s starvation coefficient is not deeply ablated in the abstract-level summary.
Robustness / stress tests. PagedAttention paper tests preemption under heavy load by oversubscribing the GPU; recomputation strategy is shown to degrade more gracefully than swap. Hardware: NVIDIA A100 GPUs exclusively (40GB for 13B, 4x A100 for 66B, 8x A100-80GB for 175B). 13
Qualitative results. SGLang’s Figure 6 shows the radix tree evolution across nine timesteps of a mixed workload (chat sessions, few-shot queries, self-consistency sampling), color-coded by node state (green = new, blue = accessed, red = evicted). 10
Experimental scope limits. [Analysis] Neither paper benchmarks on the latest 2025-2026 GPU generations (H100 SXM, H200, B100/B200, GB200); both rely on A100 80GB which is now two generations old. The throughput numbers should be read as relative comparisons among systems, not as absolute performance ceilings.
Independent benchmark cross-checks for SOTA claims. The “up to 22x over FasterTransformer” headline in the PagedAttention paper has not, to the publication’s knowledge, been independently reproduced at the same scale; the comparison hinges on FasterTransformer’s contiguous allocation being especially wasteful on long-tail workloads. [Reviewer Perspective] The 6.4x SGLang headline has been independently reproduced in part by the LMSYS organization’s own blog posts and by the open-source community, with reproductions typically landing in the 2-5x range on representative workloads rather than 6.4x. The 6.4x is the upper envelope on the paper’s selected workloads, not a universal claim.
Evidence audit ([Analysis] throughout).
- Strongly supported claims: PagedAttention’s memory-utilization gain (the 20.4-38.2% baseline utilization is measured directly), the 2-4x vLLM throughput on ShareGPT, SGLang’s prefix-cache-hit-rate gains on agent and few-shot workloads.
- Partially supported claims: the 22x over FasterTransformer (extreme tail of the distribution), SGLang’s 6.4x aggregate (averages across workloads of varying gain).
- Claims relying on narrow evidence: TensorRT-LLM’s published gains, since they come from a single source (NVIDIA) with no independent peer review.
10. Technical novelty summary
Novelty map table.
| Component | Type | Novelty level | Justification | Source |
|---|---|---|---|---|
| PagedAttention block layout | System design | Fully novel | First adaptation of OS virtual memory paging to KV cache | Kwon §4 |
| PagedAttention kernel | Kernel rewrite | Incrementally novel | Standard attention math on a block-structured cache | Kwon §4.1 |
| Block-level copy-on-write | Memory sharing | Combination novel | Combines OS COW with KV cache sharing for parallel sampling and beam search | Kwon §4.3 |
| FCFS gang scheduling | Scheduling | Adopted | Standard OS scheduling primitive | Kwon §4.5 |
| Recomputation preemption | Recovery strategy | Combination novel | Trades compute for memory; novel in serving-systems context | Kwon §4.5 |
| RadixAttention | Cache index | Combination novel | Radix tree + LRU + reference counting on KV blocks | Zheng §4 |
| Cache-aware scheduling | Scheduling | Incrementally novel | Prefix-length priority within established priority-queue scheduling | Zheng §4.3 |
| Compressed FSM for JSON | Decoding | Incrementally novel | Compilation of constraints into compressed FSM | Zheng §5 |
| API speculative execution | Runtime | Incrementally novel | Parallel execution of frontend primitives | Zheng §5 |
| TensorRT-LLM custom kernels | Kernel engineering | Adopted+ | FlashAttention-style fusion plus production hardening | NVIDIA README |
| Disaggregated prefill/decode | Scheduling | Adopted | Research idea productionized | NVIDIA README |
| FP8/FP4 KV cache | Quantization | Adopted | Industry-driven precision reduction | NVIDIA README |
Single most novel contribution. PagedAttention’s adaptation of OS-style virtual memory paging to the KV cache is the most novel of the three contributions; everything else either builds on it or copies it. Before PagedAttention, the dominant assumption in transformer serving was that the KV cache had to be contiguous in memory for the attention kernel to read efficiently. PagedAttention proved the assumption wrong and rewrote the kernel to make non-contiguous reads efficient. That single rewrite reshaped the entire serving-systems field.
What the papers do NOT claim to be novel.
- Continuous batching (originated in Orca 2022).
- Speculative decoding (Leviathan et al. 2023; multiple concurrent independent works).
- Radix tree data structure (Morrison 1968; standard CS).
- FlashAttention-style kernel fusion (Dao et al. 2022).
- Quantization (extensive prior literature).
- Constrained decoding via FSMs (Guidance, Outlines, lm-format-enforcer all preceded SGLang’s compressed FSM).
11. Situating the work
What prior work did. Orca introduced iteration-level scheduling (continuous batching), demonstrating that batching at every step rather than at request boundaries improves throughput by 30-50%. 8 FasterTransformer was NVIDIA’s pre-TensorRT-LLM serving library, contiguous-KV-based, with hand-optimized kernels.
What these papers change conceptually. PagedAttention reframes the KV cache as paged memory; SGLang reframes the cache as a content-addressable index. The conceptual shift is from “the cache is per-request scratch” to “the cache is a shared system resource managed by the runtime.”
Contemporaneous related papers (≥2 required).
- FlashAttention-2 (Dao 2023; arXiv:2307.08691) — kernel-level attention optimization, orthogonal to PagedAttention’s memory-layout optimization. The two compose: vLLM ships with FlashAttention as the inner kernel under the PagedAttention block-iteration loop. The relationship is composable, not competing.
- Hydragen (Juravsky et al. 2024; arXiv:2402.05099) — focuses specifically on shared-prefix attention with a different algorithmic approach (matrix-form shared-prefix computation rather than tree-indexed reuse). Conceptually overlaps with RadixAttention but optimizes for a narrower workload pattern.
- Mooncake / DistServe (Qin et al. 2024) — pushes the disaggregated prefill/decode idea further than TensorRT-LLM’s productization; argues that prefill and decode have fundamentally different compute/memory profiles and should run on different hardware tiers.
[Reviewer Perspective] strongest skeptical objection. The throughput-headline numbers in both PagedAttention (up to 22x over FasterTransformer) and SGLang (up to 6.4x over vLLM) are upper-envelope claims selected from a distribution of workload-dependent gains. The median gain in production is closer to 2x for PagedAttention versus contiguous baselines and 2-3x for SGLang versus vLLM on representative workloads. The headlines are not wrong, but they lead casual readers to expect uniform gains where the gains are highly workload-dependent.
[Reviewer Perspective] strongest author-side rebuttal grounded in the paper. The papers explicitly report the range of gains across workloads — Kwon et al. show 1.7-2.7x as the typical range with the 22x being a specific tail of the FasterTransformer comparison; Zheng et al. show 1.2-2.9x on long-context tasks alongside the 4.4-5.6x on prefix-heavy ones. The headlines are accurate as stated; the gap between headline reading and median production gain is a communication problem, not a science problem.
What remains unsolved.
- Cross-node KV cache sharing in distributed deployments — RadixAttention’s tree lives on one node; sharing prefixes across a fleet remains open.
- Adaptive block sizing — vLLM’s default is workload-agnostic; per-request adaptive block sizing has been explored in follow-up work but not yet standardized.
- KV cache compression beyond quantization — orthogonal to paging, attacking the per-token-bytes term rather than the layout term.
Three future research directions.
- Hierarchical KV cache tiers —
[Analysis]GPU memory + CPU memory + NVMe storage tiers with smart promotion/demotion based on access patterns; the existing swap-to-CPU machinery is the rudimentary first step. - Learned eviction policies —
[Analysis]LRU is a baseline; reinforcement-learned eviction policies that anticipate which prefixes will be reused could meaningfully extend cache hit rates. - Cross-tenant prefix sharing with privacy guarantees —
[Reviewer Perspective]if two enterprise customers’ system prompts share substantial overlap, can the runtime exploit the overlap without leaking information across tenants? This is a research direction that has barely been touched.
12. Critical analysis
Strengths with concrete evidence.
- PagedAttention: the measured KV memory utilization improvement (from 20-38% to near-100%) is unambiguous and reproducible; the open-source vLLM implementation has been deployed at scale by many providers without modification.
- SGLang: the radix-tree cache hit rate measurements are directly observable in the runtime’s telemetry, making the cache-reuse claims verifiable per-deployment.
- TensorRT-LLM: the breadth of quantization recipes (FP8, FP4, INT8, INT4, GPTQ, AWQ) is unmatched in the open-source ecosystem.
Weaknesses explicitly stated by the authors.
- Kwon et al. acknowledge that PagedAttention does not generalize to workloads with static tensor shapes (training) or compute-bound regimes without memory bottlenecks. 14
- Zheng et al. report RadixAttention’s gains degrade to near-zero on workloads without prefix structure.
Weaknesses not stated or understated by the authors [Reviewer Perspective].
- PagedAttention’s preemption granularity is sequence-group, not block; under heavy memory pressure, entire requests are evicted rather than partial blocks, which hurts tail latency more than the paper foregrounds.
- SGLang’s radix tree adds bookkeeping overhead on every request regardless of cache hit rate; the cost is paid even on workloads where the tree never matches.
- TensorRT-LLM’s lack of a formal paper means the published benchmarks come exclusively from NVIDIA and partners; independent reproducibility of headline numbers is essentially absent.
[External comparison]Third-party benchmarking efforts like the Anyscale blog series and the BentoML inference-engine benchmark have surfaced cases where TensorRT-LLM’s reported gains do not transfer cleanly to less curated workloads, though they typically still win on absolute throughput.
Reproducibility check.
- vLLM:
- Code: released (github.com/vllm-project/vllm). 15
- Data: ShareGPT and Alpaca publicly available.
- Hyperparameters: fully reported in the paper.
- Compute: hardware reported (A100 40GB, 80GB; multi-GPU configurations).
- Trained model weights: not applicable (system, not model).
- Evaluation set: ShareGPT/Alpaca publicly available.
- Overall: fully reproducible at the system level.
- SGLang:
- Code: released (github.com/sgl-project/sglang). 16
- Data: standard benchmarks (MMLU, GSM8K, ReAct traces from public sources).
- Hyperparameters: reported; for cache-aware scheduling is implementation-defined.
- Compute: hardware reported.
- Overall: fully reproducible.
- TensorRT-LLM:
- Code: released (github.com/NVIDIA/TensorRT-LLM). 3
- Data: not applicable for system-level benchmarks; per-feature benchmarks reference internal NVIDIA test suites.
- Hyperparameters: feature-by-feature in documentation.
- Compute: NVIDIA hardware required; specific GPUs cited per feature (H100, H200).
- Overall: partially reproducible — the code runs, but the published headline numbers are not always reproducible without NVIDIA’s exact configuration.
Methodology.
- Sample size (PagedAttention): ShareGPT trace (millions of conversations) and Alpaca (52K instructions) on OPT-13B, OPT-66B, OPT-175B, and LLaMA-7B/13B variants.
- Evaluation set (PagedAttention): held-out request streams from the cited datasets; contamination check not separately reported because the system serves models rather than training them.
- Baselines: FasterTransformer, Orca (Oracle and non-Oracle variants).
- Hardware/compute: NVIDIA A100 40GB (single-GPU for 13B), 4x A100 (66B), 8x A100-80GB (175B). 13
- Sample size (SGLang): MMLU 5-shot (14,042 questions), GSM8K (~7.5K problems), ReAct agent traces, JSON-decoding benchmarks.
- Evaluation set (SGLang): standard public benchmarks.
- Baselines (SGLang): vLLM, Guidance, Hugging Face TGI.
- Hardware/compute (SGLang): not pinned in the abstract-level summary; the paper reports A100 and H100 configurations.
Generalisability. The PagedAttention design generalizes to any autoregressive transformer with a per-token KV cache, which is the dominant LLM architecture today. RadixAttention generalizes to any workload with token-prefix structure, which is most production LLM workloads. TensorRT-LLM generalizes to any model NVIDIA chooses to support, which is the major frontier model families.
Assumption audit. The core assumption — KV cache is the dominant memory cost — holds for current model sizes and sequence lengths. The assumption weakens for very long contexts (1M tokens and beyond) where activation memory grows comparably to KV cache. [Analysis] Future architectures with linear-attention or state-space-model substitutes (Mamba, RWKV) would not benefit from PagedAttention’s contributions because they have no KV cache to page.
What would make the papers significantly stronger [Analysis]. Both papers would benefit from explicit cost models that let practitioners predict gains on their own workloads without running the system. PagedAttention’s reported gains range from 1.7x to 22x; a model that predicts where on the range a given workload lands would be more useful than the headline number.
13. What is reusable for a new study
REUSABLE COMPONENT 1: PagedAttention block layout + kernel.
- What it is: the block-table data structure and the attention kernel rewrite that reads from non-contiguous KV blocks.
- Why worth reusing: it is the open-source standard for KV cache memory management as of 2026.
- Preconditions: decoder-only transformer with standard causal attention; CUDA-capable GPU.
- What would need to change in a different setting: linear-attention models (Mamba, RWKV) don’t have a KV cache in the same sense; the layout doesn’t apply.
- Risks: bookkeeping overhead is non-trivial on very short sequences (length much less than ).
- Interaction effects: composes cleanly with FlashAttention as the inner kernel.
REUSABLE COMPONENT 2: RadixAttention cache index.
- What it is: the radix tree over token prefixes with LRU eviction and reference counting.
- Why worth reusing: makes cross-request reuse essentially free on prefix-heavy workloads.
- Preconditions: paged KV cache substrate (PagedAttention or equivalent); tokenization that produces stable prefixes.
- What would need to change in a different setting: workloads with no prefix structure see no benefit; the tree’s overhead is pure cost.
- Risks: privacy implications of cross-request sharing if tenant isolation is required.
- Interaction effects: only adds value when paired with cache-aware scheduling; without scheduling, hits arrive randomly and the batch composition won’t exploit them.
REUSABLE COMPONENT 3: Recomputation preemption.
- What it is: when memory pressure forces eviction, drop the request’s KV cache entirely and recompute from the prompt rather than swapping to CPU memory.
- Why worth reusing: recomputation overhead is bounded at 20% of swap latency on medium block sizes per Kwon et al. 7
- Preconditions: original prompt retained (cheap — it’s just tokens).
- What would need to change in a different setting: extremely long prompts make recomputation expensive enough that swap-to-CPU regains the advantage.
- Risks: tail-latency spikes on the evicted request.
- Interaction effects: pairs with FCFS gang scheduling; eviction policy choices ripple into scheduler.
REUSABLE COMPONENT 4: Cache-aware scheduling priority.
- What it is: the priority function that reorders the pending queue by matched-prefix length with a starvation-avoidance term.
- Why worth reusing: turns radix-tree cache hits into batch-level throughput gains.
- Preconditions: a prefix index over the cache.
- What would need to change in a different setting: tuning depends on SLO targets.
- Risks: low- regimes can produce unfair scheduling.
- Interaction effects: depends on RadixAttention.
REUSABLE COMPONENT 5: Disaggregated prefill/decode.
- What it is: routing prefill and decode phases to separate compute pools to avoid head-of-line blocking.
- Why worth reusing: tail latency for short-prompt requests improves substantially.
- Preconditions: multi-GPU deployment with high-bandwidth interconnect (NVLink, NVSwitch).
- What would need to change in a different setting: single-GPU deployments cannot benefit.
- Risks: KV cache transfer cost between pools can erase gains on small prompts.
- Interaction effects: composes with continuous batching but complicates scheduler implementation.
Dependency map in text form. PagedAttention is the foundation; RadixAttention sits on top of PagedAttention; cache-aware scheduling sits on top of RadixAttention; recomputation preemption is independent and pairs with either layer; disaggregated prefill/decode is orthogonal and pairs with any of the above.
Recommendation: 1-3 highest-value components [Analysis]. For a new inference system, ship PagedAttention first (foundation), RadixAttention second (when prefix-heavy workloads dominate), and disaggregated prefill/decode third (when multi-GPU and tail-latency-sensitive). Skip cache-aware scheduling until RadixAttention is in place; without the tree, the priority function has nothing to prioritize.
What type of new study benefits most [Analysis]. Studies that build new LLM-serving systems for emerging hardware (TPUs, AMD MI300X, custom accelerators) — these need the PagedAttention abstraction to port to non-CUDA stacks. Studies of new architectures (state-space models, linear attention) — these need to ask whether paging still applies and what replaces it if not.
14. Known limitations and open problems
Limitations explicitly stated by the authors.
- Kwon et al.: PagedAttention does not generalize to training (static tensor shapes) or to inference systems that are compute-bound rather than memory-bound. 14
- Zheng et al.: RadixAttention’s gains depend on workload prefix structure; pure unique-prompt workloads see negligible improvement.
Limitations not stated [Analysis] and [Reviewer Perspective].
- Both papers benchmark on now-superseded hardware (A100, occasionally H100). The reported numbers may not transfer cleanly to B100/B200/GB200-era hardware where memory bandwidth, FP8 native support, and NVLink generation all change the cost model.
- Neither paper explicitly addresses cross-tenant security in shared caches.
[Reviewer Perspective]This is the largest open production-deployment risk for RadixAttention; in multi-tenant SaaS, an adversary could in principle infer another tenant’s prompt content via cache-hit timing side channels. - TensorRT-LLM’s lack of academic peer review means the published numbers operate on a different evidence standard than vLLM and SGLang.
[Reviewer Perspective]This is not a knock on the engineering — TensorRT-LLM is widely used in production — but it complicates the comparison.
Technical root cause of each.
- Hardware portability: the kernels (and even the paging design) are tuned for specific memory hierarchies; cross-generation transfer requires re-tuning.
- Cross-tenant security: the radix tree is content-addressed; tenant isolation requires explicit segmentation that isn’t a default feature.
- TensorRT-LLM reproducibility: closed development process with public code but no published methodology.
Open problems left behind.
- Cross-node KV cache sharing in distributed deployments.
- Adaptive per-request block sizing.
- Compression of KV cache beyond uniform quantization (per-layer, per-head, attention-sink-aware).
- Learned eviction policies that anticipate reuse rather than reacting to recency.
- Privacy-preserving prefix sharing across tenants.
What a follow-up paper would need to solve to address the most critical limitation. [Analysis] Cross-node KV cache sharing is the most commercially impactful open problem; a follow-up paper would need to characterize the latency cost of remote cache lookup over RDMA or similar interconnects and demonstrate net positive throughput when the lookup latency is amortized over enough cache hits.
How this article reads at three depths
For the curious high-school reader. Large language models work by predicting one word at a time. To predict each new word, the model has to remember everything it has already said — and that memory takes up most of the GPU’s space. vLLM, SGLang, and TensorRT-LLM are three systems that figured out how to manage that memory cleverly enough to serve many users at once on the same GPU. The biggest idea is borrowed from how operating systems manage computer memory: slice the model’s memory into uniform pages and only use the pages you actually need.
For the working developer or ML engineer. PagedAttention is the foundation: page the KV cache into fixed-size blocks, rewrite the attention kernel to handle non-contiguous reads, and reclaim the 60-80% of memory that contiguous allocation wastes. RadixAttention is the layer above: index those blocks in a radix tree keyed on token prefixes, so chat sessions and few-shot prompts reuse compute. TensorRT-LLM is the closed-research counterpart: custom CUDA kernels, FP8/FP4 quantization, disaggregated prefill/decode, all production-hardened on NVIDIA hardware. In a 2026 production stack, all three layers typically compose — paging at the bottom, prefix reuse in the middle, NVIDIA kernels at the kernel layer. Pick the layers you need based on workload structure: pure paging if prompts are unique, paging plus RadixAttention if prompts share prefixes, paging plus TensorRT-LLM if absolute throughput on NVIDIA hardware matters more than portability.
For the ML researcher. PagedAttention’s contribution is the adaptation of OS virtual memory paging to the KV cache, with a block-level copy-on-write mechanism for parallel sampling and beam search. RadixAttention extends this with a content-addressable index over paged blocks plus cache-aware scheduling that turns hits into batch-level throughput. TensorRT-LLM productizes the algorithms with custom kernels and aggressive quantization. The most novel single contribution is PagedAttention’s paging itself; SGLang’s contribution is incremental but well-engineered. The strongest objection to both papers is that the headline throughput numbers (22x vLLM-over-FasterTransformer, 6.4x SGLang-over-vLLM) are upper-envelope claims that don’t transfer to median production workloads. A follow-up paper would need to characterize cross-node KV cache sharing — the most commercially impactful open problem, and the natural next step beyond single-node paging and indexing.
How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.
Sources consulted
Cited Sources
- 1. Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention, arXiv:2309.06180, SOSP 2023 (accessed ) ↩
- 2. Zheng et al., SGLang: Efficient Execution of Structured Language Model Programs, arXiv:2312.07104, NeurIPS 2024 (accessed ) ↩
- 3. NVIDIA TensorRT-LLM public GitHub repository README, fetched 2026-05-20 (accessed ) ↩
- 4. PagedAttention paper Section 2 — existing systems use only 20.4% to 38.2% of KV memory (accessed ) ↩
- 5. PagedAttention abstract — 2 to 4x throughput improvement over FasterTransformer and Orca (accessed ) ↩
- 6. SGLang abstract — up to 6.4x higher throughput across various tasks (accessed ) ↩
- 7. PagedAttention paper Section 4.5 — recomputation overhead bounded at 20% of swap latency on medium block sizes (accessed ) ↩
- 8. PagedAttention paper Section 7 — vLLM sustains 1.7x to 2.7x higher request rates than Orca Oracle on ShareGPT with OPT-13B (accessed ) ↩
- 9. PagedAttention paper — vLLM sustains up to 22x higher request rates than FasterTransformer on ShareGPT with basic sampling (accessed ) ↩
- 10. SGLang paper Section 4 and Figure 6 — RadixAttention radix tree operations and cache-aware scheduling (accessed ) ↩
- 11. PagedAttention paper — default block size B=16 chosen to balance GPU utilization and internal fragmentation (accessed ) ↩
- 12. vLLM documentation landing page — current feature set including chunked prefill, prefix caching, FP8/FP4/INT4 quantization, speculative decoding strategies (accessed ) ↩
- 13. PagedAttention paper Section 6 — hardware setup: NVIDIA A100 40GB, 4x A100 for 66B, 8x A100 80GB for 175B (accessed ) ↩
- 14. PagedAttention paper — stated limitations on training workloads and compute-bound inference (accessed ) ↩
- 15. vLLM reference implementation, GitHub vllm-project/vllm (accessed ) ↩
- 16. SGLang reference implementation, GitHub sgl-project/sglang (accessed ) ↩
Anonymous · no cookies set