xLSTM, RWKV-7, and RetNet — a multi-paper review of linear-time transformer alternatives

Technical walkthrough of three non-attention sequence architectures: xLSTM's exponential gating, RWKV-7's generalized delta rule, and RetNet's three-form retention.

19 May 2026 Updated 19 May 2026 ~71 min read

Reading-register key

From the paper: claims drawn verbatim or near-verbatim from the cited paper’s text, equations, tables, or figures.

[Analysis]: the publication’s own reasoned assessment, distinct from any claim the three reviewed papers make.

[Reconstructed]: content faithfully reconstructed because the source partially disclosed it; flagged where used.

[External comparison]: comparison to prior or contemporary work outside the three reviewed papers.

[Reviewer Perspective]: a critical or speculative assessment that goes beyond what any of the three reviewed papers proves.

1. Paper identity and scope

This article reviews three papers that share an unfashionable claim: the self-attention block is not the only viable backbone for a large language model. Each paper proposes a sequence mixer whose cost scales linearly with sequence length at inference and whose state at any token position fits in a fixed-size memory, in contrast to attention’s $O(N^2)$ training cost and growing key-value cache.

Paper A, xLSTM. Beck, Pöppel, Spanring, Auer, Prudnikova, Kopp, Klambauer, Brandstetter, Hochreiter. xLSTM: Extended Long Short-Term Memory. arXiv:2405.04517, May 2024. Accepted to NeurIPS 2024 as a spotlight.¹ Hochreiter is the co-inventor of the original 1997 LSTM; the paper is explicitly framed as a “scaled-up LSTM” rather than a clean-slate design.

Paper B, RWKV-7 “Goose”. Peng et al. (18 authors). RWKV-7 “Goose” with Expressive Dynamic State Evolution. arXiv:2503.14456, March 2025.² Seventh major version of the RWKV (Receptance Weighted Key Value) line, descended from Peng’s earlier RWKV-4 paper.

Paper C, RetNet. Sun, Dong, Huang, Ma, Xia, Xue, Wang, Wei. Retentive Network: A Successor to Transformer for Large Language Models. arXiv:2307.08621, July 2023 (revised August 2023).³ Microsoft Research. Brought in as a structural reference point because it codified the three-form (parallel / recurrent / chunkwise) computational template that both RWKV-7 and xLSTM-mLSTM later adopted.

Retrieval status. All three arXiv landing pages were fetched 2026-05-19. The ar5iv HTML render was reachable for RWKV-7 and RetNet but the xLSTM ar5iv build had a fatal conversion error on the day of writing; the NeurIPS proceedings PDF and the OpenReview landing page supplied the missing structural content, and equation transcription falls back to the multi-source summary literature cited inline.⁴ Where xLSTM content carries [Reconstructed], the gap is the missing ar5iv render of the equations exactly as typeset.

Paper classification. All three: Architecture proposal, Training method, Inference method. xLSTM and RWKV-7 additionally: Representation learning. All three benchmark LLM-scale language modelling and are evaluated under the data-driven category.

Technical abstract in publication voice. Three papers attack the same target (replace softmax self-attention with a sequence mixer whose recurrent form runs in $O(1)$ per token and whose parallel form trains as fast as attention), and propose three different mathematical primitives. RetNet (Microsoft, 2023) keeps the query / key / value vocabulary of attention but replaces the softmax with an exponentially-decaying retention matrix that admits an exact parallel form, an exact recurrent form, and a chunkwise form trading parallelism against memory. xLSTM (Hochreiter et al., 2024) keeps the LSTM gating vocabulary of 1997 but introduces exponential input and forget gates with a numerical-stabilizer state, plus two memory variants: sLSTM with scalar cells and across-head memory mixing, and mLSTM with a per-head matrix memory updated via outer-product covariance rule. RWKV-7 (Peng et al., 2025) starts from a delta-rule recurrence and generalises it to vector-valued decays, an in-context learning rate, and a relaxed value-replacement rule, all of which together let the network represent state-tracking transitions that diagonal-transition RNNs and TC $^0$ -bounded transformers provably cannot.⁵

Primary research question (per paper).

xLSTM: can the LSTM cell, restored with exponential gating and matrix memory, scale to billion-parameter language models that match transformers and state-space models?
RWKV-7: can a delta-rule recurrence with vector-valued gating exceed transformer expressivity on formal-language tasks while keeping linear-time training and constant-memory inference?
RetNet: is there a single sequence mixer that simultaneously achieves training parallelism, $O(1)$ -per-token inference, and competitive language-modelling performance, the “impossible triangle”?

Core technical claim (shared). A carefully chosen non-attention recurrence trained at LLM scale can match transformer perplexity on standard benchmarks while running inference at constant memory and time per token, and can in some formal-language regimes exceed what transformer architectures can represent in a constant number of layers.

Core technical domains and depth. Linear algebra over matrix memories (deep). Recurrent neural network design (deep). Softmax-attention complexity and KV-cache mechanics (moderate). State-space models as comparison points (surface). Formal-language theory, TC $^0$ versus NC $^1$ separation (moderate, RWKV-7 only). CUDA-kernel design (surface).

Reader prerequisites. High-school algebra. Familiarity with the transformer block (attention, residual stream) helps but is not required; Section 2.5 Glossary covers every prerequisite term. No prior exposure to state-space models or formal-language theory needed.

2. TL;DR and executive overview

3-sentence TL;DR. A standard transformer’s attention layer costs roughly $N \times N$ in time and memory when reading a sequence of $N$ words, which is what makes long documents and long-running chats expensive; three papers in this review build sequence layers whose cost grows like $N \times 1$ instead, so doubling the sequence length only doubles the work. xLSTM does this by reviving the 1997 LSTM with exponential “gates” and a small matrix memory, RWKV-7 generalises an older idea called the delta rule with extra learnable knobs that let the network track state more flexibly than a standard transformer can, and RetNet keeps the query / key / value vocabulary of attention but swaps the softmax for a fixed exponential decay so the same computation can run in three interchangeable forms. All three match or approach transformer language-modelling quality at scales between 0.1 and 7 billion parameters while cutting inference memory by roughly an order of magnitude on long inputs.

Executive summary (~100 words). Attention is expensive at long sequence lengths because every new word has to look back at every previous word. The three papers reviewed replace attention with a recurrence that carries a fixed-size summary forward, like the hidden state of an old-school RNN but with a much larger and more cleverly updated memory. RetNet (2023) formalised the three-form template: parallel for training, recurrent for inference, chunkwise for long context. xLSTM (NeurIPS 2024) plugged exponential gating and a matrix memory into the LSTM cell. RWKV-7 (March 2025) extended a delta-rule recurrence with vector-valued gating that let it exceed standard transformer expressivity on state-tracking tasks.

Five practitioner takeaways.

The three architectures share a constant-memory inference profile: the per-step cost does not depend on how much text the model has already read, which matters for long agent loops and on-device inference.⁶
Training cost is no longer the bottleneck. xLSTM-mLSTM, RWKV-7, and RetNet all admit a parallel form that trains in the same wall-clock ballpark as a comparable transformer at sequence lengths up to a few thousand tokens. RWKV-7 reports a 3 $\times$ speedup over RWKV-6’s training kernel and surpasses FlashAttention v3 at 16k token sequences.²
The benchmark gap to a parameter-matched transformer is small but real. The xLSTM paper reports its 1.3B model on SlimPajama 300B-token training outperforming Llama, RWKV-4, and Mamba baselines on most measured tasks; RWKV-7-2.9B matches Qwen2.5-3B on an English aggregate while training on roughly one-third the tokens.¹²
The papers split on what they care about. RetNet sells the impossible-triangle framing (training parallelism + cheap inference + quality); xLSTM sells parity at scale plus the matrix-memory abstraction; RWKV-7 sells expressivity (state tracking beyond TC $^0$ ) plus multilingual SoTA at 3B.
None of the three has displaced attention in frontier commercial chat models as of writing. The deployment frontier remains transformer-dominated; hybrid architectures (a few attention layers mixed with many recurrent layers) are the more common production pattern. [Analysis]

Pipeline overview in text. A modern transformer language model is a stack of identical blocks; each block has a sequence-mixer (attention) and a position-mixer (MLP), wrapped in residual connections and normalisation. The three papers reviewed swap out the sequence-mixer. xLSTM offers two drop-in replacements (sLSTM and mLSTM blocks) and mixes them in published ratios. RWKV-7 replaces the entire block with a time-mix (its sequence mixer) and channel-mix (its position mixer) pair. RetNet replaces attention with multi-scale retention while keeping the standard feedforward MLP. All three are trained with the same next-token-prediction loss, the same AdamW family of optimisers, the same residual-stream backbone. Only the sequence-mixer math changes.

2.5 Glossary

Term	Plain-English explanation	First appears in
Self-attention	The sequence-mixer in a standard transformer; for each token it computes a weighted sum over all earlier tokens via a softmax over query-key dot products. Costs $O(N^2)$ in sequence length.	Section 1
Linear-time architecture	A sequence model whose per-token compute does not grow with how much text came before; total cost scales like $N \times 1$ instead of $N \times N$ .	Section 1
Residual stream	The running vector of numbers that flows from one transformer block to the next; each block reads it, computes an update, and adds the update back.	Section 1
Hidden state (RNN)	A fixed-size memory vector carried forward by a recurrent network; updated at every token.	Section 5A
Gate (LSTM)	A learned multiplier between 0 and 1 that decides how much of an incoming signal flows through. Forget gates decide what to drop; input gates decide what to keep.	Section 5A
KV cache	In a deployed transformer, the stored keys and values from previous tokens that attention has to look back at; grows linearly with chat length and dominates memory at long context.	Section 4
State-space model	A class of sequence models (S4, Mamba) that propagate a small continuous-state vector through time via a linear recurrence; competitor architecture to the three papers reviewed.	Section 4
Outer product	The matrix you get by multiplying a column vector $u$ of length $m$ by a row vector $v^T$ of length $n$ , giving an $m \times n$ matrix $u v^T$ . The covariance update in mLSTM and RWKV-7 is built from outer products.	Section 6, MATH ENTRY 3
Delta rule	An old neural-network learning rule (Widrow-Hoff 1960) that updates a weight matrix by the outer product of an error vector and an input vector; reused inside RWKV-7’s state update.	Section 6, MATH ENTRY 5
TC $^0$ / NC $^1$	Complexity classes used to describe what a constant-depth circuit can compute. Transformers with finite precision are bounded by TC $^0$ ; certain state-tracking tasks live in NC $^1$ and are believed to lie outside TC $^0$ .	Section 6, MATH ENTRY 6
”From the paper:” prefix	Content directly supported by the cited paper’s text, equations, tables, or figures.	Throughout
`[Analysis]` label	The publication’s own reasoned assessment, distinct from what any reviewed paper itself claims.	Throughout
`[Reviewer Perspective]` label	A critical or speculative assessment that goes beyond what the reviewed papers prove.	Section 11, 12
`[Reconstructed]` label	Content the publication faithfully reconstructed because the source partially disclosed it.	Section 6 xLSTM entries
`[External comparison]` label	A comparison to prior work or general knowledge outside the three reviewed papers.	Section 4, 11

The Glossary covers every prerequisite term the body uses. A reader who understands quadratic versus linear cost ( $N^2$ versus $N$ ) and has met the word “matrix” can read the rest of the article using this Glossary as the dictionary.

3. Problem formalisation

The shared problem the three papers attack: design a sequence-mixer $f$ that takes a sequence of input vectors $x_1, x_2, \dots, x_N \in \mathbb{R}^d$ and produces a sequence of output vectors $y_1, y_2, \dots, y_N \in \mathbb{R}^d$ such that:

$f$ is causal: $y_t$ depends only on $x_1, \dots, x_t$ .
$f$ admits a parallel form that computes $y_1, \dots, y_N$ in time roughly linear in $N$ using matrix multiplications on contemporary GPUs.
$f$ admits a recurrent form in which the network carries a fixed-size hidden state $S_t$ such that $y_t$ and $S_t$ are computable from $x_t$ and $S_{t-1}$ in $O(1)$ time per step.
The two forms must compute the same function up to numerical error.
Quality on standard language-modelling benchmarks (perplexity, downstream task accuracy) is competitive with a parameter-matched softmax-attention block.

Notation

Symbol	Type	Meaning	First appears in
`N`	scalar	Sequence length (number of tokens).	Section 3
`d`	scalar	Model (embedding) dimension.	Section 3
`d_h`	scalar	Per-head dimension; typically `d / H` for `H` heads.	Section 6
`x_t`	vector of length `d`	Input vector at position `t`.	Section 3
`h_t`	vector of length `d_h`	Hidden / output vector at position `t` for a recurrent cell.	Section 6 MATH ENTRY 1
`c_t`	scalar or matrix	LSTM cell state; scalar in sLSTM, matrix `C_t` in mLSTM.	Section 6 MATH ENTRY 2
`i_t, f_t, o_t`	vector of length `d_h`	Input, forget, output gates.	Section 6 MATH ENTRY 1
`m_t`	scalar	xLSTM stabilizer state (log-domain max of forget and input gate pre-activations).	Section 6 MATH ENTRY 2
`q_t, k_t, v_t`	vector of length `d_h`	Query, key, value at position `t`.	Section 6 MATH ENTRY 3
`S_t`	`d_h` by `d_h` matrix	Matrix state of the recurrent form (mLSTM `C_t`, RWKV-7 `wkv_t`, RetNet `S_n`).	Section 6 MATH ENTRY 3
`gamma`	scalar	Scalar exponential decay in RetNet.	Section 6 MATH ENTRY 4
`w_t, a_t, kappa_t`	vector of length `d_h`	RWKV-7 vector-valued decay, in-context learning rate, removal key.	Section 6 MATH ENTRY 5
`D`	`N` by `N` matrix	RetNet decay-with-causal-mask matrix; entries `D_ij = gamma^(i-j)` for `i >= j`, zero otherwise.	Section 6 MATH ENTRY 4
`sigma(.)`	function	Sigmoid activation, maps a real number into the interval (0, 1).	Section 6
`exp(.)`	function	Element-wise exponential.	Section 6

Formal problem statement

Input space: $(\mathbb{R}^d)^N$ for sequence length $N$ and embedding dimension $d$ . Output space: $(\mathbb{R}^d)^N$ . The objective is standard next-token cross-entropy loss when the sequence is a token-level language modelling task. The constraints are causality plus the parallel-form / recurrent-form equivalence stated above.

Explicit assumption list

Causal language modelling. All three papers target left-to-right next-token prediction. Bidirectional variants are out of scope.
Fixed embedding dimension across positions. The block reads and writes vectors of the same size; consistent with the standard transformer stack.
GPU-friendly parallel form. “Linear in $N$ for the parallel form” means linear in matmul-equivalent work, not necessarily in wall-clock without a custom kernel. RWKV-7 and xLSTM-mLSTM ship CUDA kernels; the chunkwise form is where the recurrence has to be unrolled chunk-by-chunk.⁵ [Analysis] Potentially strong assumption when comparing against FlashAttention v3, which is itself a deeply optimised kernel.
Fixed-size matrix state $S_t$ . mLSTM, RWKV-7, and RetNet all carry a $d_h \times d_h$ matrix per head per layer. The total state footprint scales like $H \cdot L \cdot d_h^2$ for $L$ layers; $H$ heads. The footprint does not depend on sequence length but does grow quadratically in per-head dimension, a real cost at typical $d_h = 64$ to $128$ .⁵
Bounded numerical precision. RWKV-7’s expressivity argument (recognising all regular languages) assumes the network carries enough precision to represent the relevant transition matrices; the formal statement is sensitive to the finite-precision regime.² [Analysis] The TC0-versus-NC1 separation theorems for transformers also depend on finite-precision assumptions; the like-for-like comparison is appropriate but the constants matter in practice.

Why the problem is hard

A naive RNN’s recurrent form is $O(1)$ per step but its training form is intrinsically sequential: you cannot compute $h_t$ until $h_{t-1}$ is done. Softmax attention’s training form is parallel but its inference KV-cache grows with $N$ . The hard part is the third constraint: parallel and recurrent forms that compute the same function. Linear attention (Katharopoulos et al. 2020) showed it is possible to express attention without softmax as a recurrence over a $d_h \times d_h$ matrix state; the three papers are downstream of that observation and each contributes a different choice of how to update the matrix state.⁷

4. Motivation and gap

The concrete cost driver for production deployment of transformer chat models is the KV cache. At inference, every previous token’s key and value vectors must remain in GPU memory so the next token’s attention layer can read them; the cache size scales as $2 \cdot L \cdot H \cdot d_h \cdot N$ for sequence length $N$ , model layers $L$ , attention heads $H$ , head dim $d_h$ . At $L=32$ , $H=32$ , $d_h=128$ , $N=128000$ the cache is on the order of tens of gigabytes per request, the dominant memory term in long-context serving. [External comparison] Long-context serving systems (vLLM, TensorRT-LLM) spend most of their engineering budget compressing or paging this cache.

Existing approaches and their failure modes (cited by the three papers).

Sparse attention (Longformer, BigBird): truncates the attention window or sparsifies the pattern; gives up information beyond the window. From the RWKV-7 paper, sparse-attention variants “limit the model’s ability to capture long-range dependencies.”²
Linear attention (Katharopoulos 2020): replaces softmax with a kernel feature map and rewrites attention as a matrix-state recurrence; the practical issue is that the resulting models underperform softmax attention at language-modelling quality. The RetNet paper explicitly motivates retention as a fix.³
State-space models (S4, Mamba, Mamba-2): use a structured linear recurrence with input-dependent transitions; competitive on standard LM benchmarks but with their own limitations on tasks requiring associative recall or state tracking, as flagged by Park et al.’s work that RWKV-7 cites.²
Linear RNNs / RetNet itself: deliver the parallel + recurrent equivalence but historically traded language-modelling quality for efficiency.

The gap each paper claims to fill.

xLSTM. “Question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs?”¹ The paper frames the missing piece as exponential gating with stabilisation plus a parallelisable matrix-memory variant; with those in place, the authors argue LSTMs scale.
RWKV-7. Earlier RWKV versions and DeltaNet both use diagonal or scalar-gated state transitions; neither can represent non-permutation finite-state transitions in a constant number of layers, which is the formal limit on state tracking. RWKV-7 introduces non-diagonal input-dependent transitions while preserving the parallel-form training.²
RetNet. The framing is the “impossible triangle”: prior work could achieve at most two of (training parallelism, $O(1)$ inference, transformer-level performance). RetNet claims a single mechanism that achieves all three.³

Practical stakes. Long-context agents, long-running chat sessions, and on-device deployment all reward $O(1)$ per-token inference. The constant-memory profile is what lets a recurrent architecture serve a 1-million-token conversation on the same GPU budget that would saturate a transformer’s KV cache at one-tenth that length. [Analysis] The deployment frontier in 2026 is hybrid (a small attention component for in-context recall plus recurrent layers for cost); pure-recurrent and pure-attention models bracket the spectrum.

Position in the broader research landscape. [External comparison] The 2023–2025 wave of non-attention architectures includes Mamba (Gu and Dao 2023), Mamba-2 (Dao and Gu 2024), Griffin (De et al. 2024), Hyena (Poli et al. 2023), HGRN (Qin et al. 2023), and DeltaNet (Yang et al. 2024). The three papers reviewed cover the LSTM-revival line (xLSTM), the delta-rule line (RWKV-7), and the retention line (RetNet). The state-space and convolutional lines are adjacent but not the subject of this review.

Figure 1 of the xLSTM paper showing the LSTM gating diagram extended with a stabilizer state, plus the sLSTM and mLSTM block layouts

5. Method overview

5A. xLSTM

The original LSTM cell, restated to anchor the comparison. From the paper: an LSTM has a scalar cell state $c_t$ and hidden state $h_t$ , updated by gates $i_t, f_t, o_t$ (input, forget, output) computed from $x_t$ and $h_{t-1}$ .¹ Original gates use sigmoid activations.

xLSTM’s two contributions to the cell.

Exponential gating with a stabilizer. Replace the sigmoid on the input gate (and optionally the forget gate) with an unbounded exponential, then carry a log-domain stabilizer state $m_t$ that subtracts off the running maximum to prevent overflow. The motivation: a sigmoid input gate saturates and cannot fully “open” the memory to incorporate a new important token; exponential gating fixes the storage-decision bottleneck the original LSTM had. [Reconstructed] because the equation typesetting was not reachable on ar5iv; equations are reconstructed from the paper PDF’s notation and the multiple summary sources cited in Section 6.
Two memory variants. sLSTM keeps a scalar cell but adds memory mixing across multiple heads via a head-wise recurrent connection; mLSTM replaces the scalar cell with a $d_h \times d_h$ matrix updated by a key-value outer product. mLSTM has no memory mixing across heads, which is the price paid for parallelisability.

xLSTM block. sLSTM is integrated into a residual block with post-up-projection (the Transformer pattern); mLSTM uses pre-up-projection (the state-space-model pattern with a component-wise output gate).¹ Architecture variants are named by their ratio of mLSTM to sLSTM layers: xLSTM[7:1] means seven mLSTM layers per one sLSTM layer.

Why both variants exist. sLSTM has memory mixing across heads (richer expressivity) but cannot parallelise across the sequence dimension; mLSTM gives up cross-head memory mixing for parallelisability. Mixing them in published ratios is the paper’s empirical sweet spot. [Analysis] The two-variant design is itself a hedge: if all you wanted was perplexity you'd ship more mLSTM; if all you wanted was small-model expressivity you'd ship more sLSTM.

Novelty classification. Exponential gating: [New] as a stabilised, gated-LSTM ingredient (variants existed in prior work without the stabilizer). Matrix memory with covariance update: [Adapted] from Linear Attention’s key-value outer-product recurrence (Katharopoulos et al. 2020) and from Fast Weight Programmers (Schlag et al. 2021).⁷ Memory mixing across heads via recurrent connections: [New] in this specific form.

5B. RWKV-7 “Goose”

Plain-English intuition. RWKV-7 carries a $d_h \times d_h$ matrix state per head and updates it at every token using a generalised delta rule: write a value into a key slot, optionally remove what was previously stored at a related slot, and let the rest of the state decay by a learned per-channel factor. The “generalisation” compared to RWKV-6 is that the decay is a vector (per-channel) and the in-context learning rate is also a vector, which lets the network decide on a per-feature basis what to keep and what to overwrite.²

Mechanism. The state update equation (from the paper, transcribed from the ar5iv render):

$S_t = S_{t-1}\big(\operatorname{diag}(w_t) - \hat{\kappa}_t^\top (a_t \odot \hat{\kappa}_t)\big) + v_t^\top \tilde{k}_t$

where $w_t$ is the vector-valued decay, $\hat{\kappa}_t$ is a per-head normalised removal key, $a_t$ is the in-context learning rate vector with elements in $(0, 1)$ , $\tilde{k}_t$ is a relaxed replacement key, and $v_t$ is the value vector. The output is $y_t = q_t S_t$ where $q_t$ is the query (called receptance in RWKV lineage).²

Connection to delta rule and DeltaNet. From the paper’s own comparison table:

RWKV-6: $S_t = S_{t-1} \operatorname{diag}(w_t) + v_t^\top k_t$ (diagonal decay, no removal).
DeltaNet: $S_t = S_{t-1}(I - a k_t^\top k_t) + a v_t^\top k_t$ (scalar learning rate, no decay).
RWKV-7: combines both, with a non-diagonal transition (the $-\hat\kappa^\top \hat\kappa$ term) and vector-valued gating.

Design rationale. The non-diagonal piece of the transition is what enables non-permutation finite-state transitions; the vector-valued decay decouples per-feature memory horizons; the in-context learning rate lets the network learn when to write fast and when to write slow.

Novelty. [Adapted] from DeltaNet’s delta-rule recurrence (Yang et al. 2024). Vector-valued gating and the decoupled removal-versus-replacement key design are [New].

5C. RetNet

Plain-English intuition. RetNet keeps the query / key / value vocabulary of attention but replaces the softmax similarity with a fixed exponential decay $\gamma^{i-j}$ between positions $i$ and $j$ . With softmax removed, the resulting matrix operation can be rewritten as a recurrence with a fixed-size matrix state (the linear-attention trick from 2020), with the decay built in so the network does not have to learn it, and with multiple decay scales $\gamma_h$ across heads.³

Mechanism. Three forms (Section 6 MATH ENTRY 4 carries the equations):

Parallel form for training: $\text{Retention}(X) = (QK^\top \odot D) V$ where $D_{ij} = \gamma^{i-j}$ for $i \ge j$ .
Recurrent form for inference: $S_n = \gamma S_{n-1} + k_n^\top v_n$ , output $y_n = q_n S_n$ .
Chunkwise form for long-sequence training: compute the parallel form within each chunk, pass the per-chunk recurrent summary to the next chunk.

Multi-scale retention. Each attention head gets its own decay $\gamma_h$ , ranging across an exponential schedule; the paper reports per-head decays roughly from 0.96 to 0.99 in the published configurations.³ Combined with swish gating and GroupNorm for numerical stability.

Design rationale. A fixed decay (not learned per-token) is the price paid for the three-form equivalence. RWKV-7 by contrast learns per-token, per-channel decays, at the cost of a more involved kernel.

Novelty. [Adapted] from Linear Attention; the explicit three-form template, multi-scale decay schedule, and the GroupNorm + swish recipe are [New] in this specific combination.

Figure of RWKV-7 state update showing the dynamic state evolution mechanism with vector-valued gating and the generalized delta rule, from the paper

6. Mathematical contributions

The heart of this review. Six MATH ENTRY blocks, in order: original LSTM cell, xLSTM-sLSTM with exponential gating, xLSTM-mLSTM matrix memory, RetNet retention, RWKV-7 generalised delta rule, RWKV-7 state-tracking theorem.

MATH ENTRY 1: Original LSTM cell (baseline).

Source: Hochreiter and Schmidhuber 1997, restated in the xLSTM paper as the starting point.
What it is: a recurrent neural network cell that carries a scalar cell state $c_t$ and uses three sigmoid gates to decide what to forget, what to write, and what to read.
Formal definition (single cell, $d_h = 1$ for clarity):

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

1. Beck et al., xLSTM: Extended Long Short-Term Memory, arXiv:2405.04517, NeurIPS 2024 spotlight (accessed 2026-05-19) ↩
2. Peng et al., RWKV-7 "Goose" with Expressive Dynamic State Evolution, arXiv:2503.14456 (accessed 2026-05-19) ↩
3. Sun et al., Retentive Network: A Successor to Transformer for Large Language Models, arXiv:2307.08621 (accessed 2026-05-19) ↩
4. OpenReview landing page for the xLSTM NeurIPS 2024 spotlight (accessed 2026-05-19) ↩
5. ar5iv HTML render of the RWKV-7 paper used to transcribe equations (accessed 2026-05-19) ↩
6. ar5iv HTML render of the RetNet paper used for the three-form template and reported throughput numbers (accessed 2026-05-19) ↩
7. Katharopoulos et al., Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention, arXiv:2006.16236 (accessed 2026-05-19) ↩
8. NX-AI official xLSTM reference implementation, Apache-2.0 (accessed 2026-05-19) ↩
9. BlinkDL official RWKV-LM reference implementation, Apache-2.0 (accessed 2026-05-19) ↩
10. microsoft/torchscale, hosts the official RetNet reference, MIT-licensed (accessed 2026-05-19) ↩
11. Hugging Face — BlinkDL RWKV model collection, Apache-2.0 (accessed 2026-05-19) ↩