Neural Tech Daily
ai-research

xLSTM, RWKV-7, and RetNet — a multi-paper review of linear-time transformer alternatives

Technical walkthrough of three non-attention sequence architectures: xLSTM's exponential gating, RWKV-7's generalized delta rule, and RetNet's three-form retention.

Updated ~71 min read
Share

Reading-register key

  • From the paper: claims drawn verbatim or near-verbatim from the cited paper’s text, equations, tables, or figures.
  • [Analysis]: the publication’s own reasoned assessment, distinct from any claim the three reviewed papers make.
  • [Reconstructed]: content faithfully reconstructed because the source partially disclosed it; flagged where used.
  • [External comparison]: comparison to prior or contemporary work outside the three reviewed papers.
  • [Reviewer Perspective]: a critical or speculative assessment that goes beyond what any of the three reviewed papers proves.

1. Paper identity and scope

This article reviews three papers that share an unfashionable claim: the self-attention block is not the only viable backbone for a large language model. Each paper proposes a sequence mixer whose cost scales linearly with sequence length at inference and whose state at any token position fits in a fixed-size memory, in contrast to attention’s O(N2)O(N^2) training cost and growing key-value cache.

Paper A, xLSTM. Beck, Pöppel, Spanring, Auer, Prudnikova, Kopp, Klambauer, Brandstetter, Hochreiter. xLSTM: Extended Long Short-Term Memory. arXiv:2405.04517, May 2024. Accepted to NeurIPS 2024 as a spotlight. 1 Hochreiter is the co-inventor of the original 1997 LSTM; the paper is explicitly framed as a “scaled-up LSTM” rather than a clean-slate design.

Paper B, RWKV-7 “Goose”. Peng et al. (18 authors). RWKV-7 “Goose” with Expressive Dynamic State Evolution. arXiv:2503.14456, March 2025. 2 Seventh major version of the RWKV (Receptance Weighted Key Value) line, descended from Peng’s earlier RWKV-4 paper.

Paper C, RetNet. Sun, Dong, Huang, Ma, Xia, Xue, Wang, Wei. Retentive Network: A Successor to Transformer for Large Language Models. arXiv:2307.08621, July 2023 (revised August 2023). 3 Microsoft Research. Brought in as a structural reference point because it codified the three-form (parallel / recurrent / chunkwise) computational template that both RWKV-7 and xLSTM-mLSTM later adopted.

Retrieval status. All three arXiv landing pages were fetched 2026-05-19. The ar5iv HTML render was reachable for RWKV-7 and RetNet but the xLSTM ar5iv build had a fatal conversion error on the day of writing; the NeurIPS proceedings PDF and the OpenReview landing page supplied the missing structural content, and equation transcription falls back to the multi-source summary literature cited inline. 4 Where xLSTM content carries [Reconstructed], the gap is the missing ar5iv render of the equations exactly as typeset.

Paper classification. All three: Architecture proposal, Training method, Inference method. xLSTM and RWKV-7 additionally: Representation learning. All three benchmark LLM-scale language modelling and are evaluated under the data-driven category.

Technical abstract in publication voice. Three papers attack the same target (replace softmax self-attention with a sequence mixer whose recurrent form runs in O(1)O(1) per token and whose parallel form trains as fast as attention), and propose three different mathematical primitives. RetNet (Microsoft, 2023) keeps the query / key / value vocabulary of attention but replaces the softmax with an exponentially-decaying retention matrix that admits an exact parallel form, an exact recurrent form, and a chunkwise form trading parallelism against memory. xLSTM (Hochreiter et al., 2024) keeps the LSTM gating vocabulary of 1997 but introduces exponential input and forget gates with a numerical-stabilizer state, plus two memory variants: sLSTM with scalar cells and across-head memory mixing, and mLSTM with a per-head matrix memory updated via outer-product covariance rule. RWKV-7 (Peng et al., 2025) starts from a delta-rule recurrence and generalises it to vector-valued decays, an in-context learning rate, and a relaxed value-replacement rule, all of which together let the network represent state-tracking transitions that diagonal-transition RNNs and TC0^0-bounded transformers provably cannot. 5

Primary research question (per paper).

  • xLSTM: can the LSTM cell, restored with exponential gating and matrix memory, scale to billion-parameter language models that match transformers and state-space models?
  • RWKV-7: can a delta-rule recurrence with vector-valued gating exceed transformer expressivity on formal-language tasks while keeping linear-time training and constant-memory inference?
  • RetNet: is there a single sequence mixer that simultaneously achieves training parallelism, O(1)O(1)-per-token inference, and competitive language-modelling performance, the “impossible triangle”?

Core technical claim (shared). A carefully chosen non-attention recurrence trained at LLM scale can match transformer perplexity on standard benchmarks while running inference at constant memory and time per token, and can in some formal-language regimes exceed what transformer architectures can represent in a constant number of layers.

Core technical domains and depth. Linear algebra over matrix memories (deep). Recurrent neural network design (deep). Softmax-attention complexity and KV-cache mechanics (moderate). State-space models as comparison points (surface). Formal-language theory, TC0^0 versus NC1^1 separation (moderate, RWKV-7 only). CUDA-kernel design (surface).

Reader prerequisites. High-school algebra. Familiarity with the transformer block (attention, residual stream) helps but is not required; Section 2.5 Glossary covers every prerequisite term. No prior exposure to state-space models or formal-language theory needed.

2. TL;DR and executive overview

3-sentence TL;DR. A standard transformer’s attention layer costs roughly N×NN \times N in time and memory when reading a sequence of NN words, which is what makes long documents and long-running chats expensive; three papers in this review build sequence layers whose cost grows like N×1N \times 1 instead, so doubling the sequence length only doubles the work. xLSTM does this by reviving the 1997 LSTM with exponential “gates” and a small matrix memory, RWKV-7 generalises an older idea called the delta rule with extra learnable knobs that let the network track state more flexibly than a standard transformer can, and RetNet keeps the query / key / value vocabulary of attention but swaps the softmax for a fixed exponential decay so the same computation can run in three interchangeable forms. All three match or approach transformer language-modelling quality at scales between 0.1 and 7 billion parameters while cutting inference memory by roughly an order of magnitude on long inputs.

Executive summary (~100 words). Attention is expensive at long sequence lengths because every new word has to look back at every previous word. The three papers reviewed replace attention with a recurrence that carries a fixed-size summary forward, like the hidden state of an old-school RNN but with a much larger and more cleverly updated memory. RetNet (2023) formalised the three-form template: parallel for training, recurrent for inference, chunkwise for long context. xLSTM (NeurIPS 2024) plugged exponential gating and a matrix memory into the LSTM cell. RWKV-7 (March 2025) extended a delta-rule recurrence with vector-valued gating that let it exceed standard transformer expressivity on state-tracking tasks.

Five practitioner takeaways.

  • The three architectures share a constant-memory inference profile: the per-step cost does not depend on how much text the model has already read, which matters for long agent loops and on-device inference. 6
  • Training cost is no longer the bottleneck. xLSTM-mLSTM, RWKV-7, and RetNet all admit a parallel form that trains in the same wall-clock ballpark as a comparable transformer at sequence lengths up to a few thousand tokens. RWKV-7 reports a 3×\times speedup over RWKV-6’s training kernel and surpasses FlashAttention v3 at 16k token sequences. 2
  • The benchmark gap to a parameter-matched transformer is small but real. The xLSTM paper reports its 1.3B model on SlimPajama 300B-token training outperforming Llama, RWKV-4, and Mamba baselines on most measured tasks; RWKV-7-2.9B matches Qwen2.5-3B on an English aggregate while training on roughly one-third the tokens. 1 2
  • The papers split on what they care about. RetNet sells the impossible-triangle framing (training parallelism + cheap inference + quality); xLSTM sells parity at scale plus the matrix-memory abstraction; RWKV-7 sells expressivity (state tracking beyond TC0^0) plus multilingual SoTA at 3B.
  • None of the three has displaced attention in frontier commercial chat models as of writing. The deployment frontier remains transformer-dominated; hybrid architectures (a few attention layers mixed with many recurrent layers) are the more common production pattern. [Analysis]

Pipeline overview in text. A modern transformer language model is a stack of identical blocks; each block has a sequence-mixer (attention) and a position-mixer (MLP), wrapped in residual connections and normalisation. The three papers reviewed swap out the sequence-mixer. xLSTM offers two drop-in replacements (sLSTM and mLSTM blocks) and mixes them in published ratios. RWKV-7 replaces the entire block with a time-mix (its sequence mixer) and channel-mix (its position mixer) pair. RetNet replaces attention with multi-scale retention while keeping the standard feedforward MLP. All three are trained with the same next-token-prediction loss, the same AdamW family of optimisers, the same residual-stream backbone. Only the sequence-mixer math changes.

2.5 Glossary

TermPlain-English explanationFirst appears in
Self-attentionThe sequence-mixer in a standard transformer; for each token it computes a weighted sum over all earlier tokens via a softmax over query-key dot products. Costs O(N2)O(N^2) in sequence length.Section 1
Linear-time architectureA sequence model whose per-token compute does not grow with how much text came before; total cost scales like N×1N \times 1 instead of N×NN \times N.Section 1
Residual streamThe running vector of numbers that flows from one transformer block to the next; each block reads it, computes an update, and adds the update back.Section 1
Hidden state (RNN)A fixed-size memory vector carried forward by a recurrent network; updated at every token.Section 5A
Gate (LSTM)A learned multiplier between 0 and 1 that decides how much of an incoming signal flows through. Forget gates decide what to drop; input gates decide what to keep.Section 5A
KV cacheIn a deployed transformer, the stored keys and values from previous tokens that attention has to look back at; grows linearly with chat length and dominates memory at long context.Section 4
State-space modelA class of sequence models (S4, Mamba) that propagate a small continuous-state vector through time via a linear recurrence; competitor architecture to the three papers reviewed.Section 4
Outer productThe matrix you get by multiplying a column vector uu of length mm by a row vector vTv^T of length nn, giving an m×nm \times n matrix uvTu v^T. The covariance update in mLSTM and RWKV-7 is built from outer products.Section 6, MATH ENTRY 3
Delta ruleAn old neural-network learning rule (Widrow-Hoff 1960) that updates a weight matrix by the outer product of an error vector and an input vector; reused inside RWKV-7’s state update.Section 6, MATH ENTRY 5
TC0^0 / NC1^1Complexity classes used to describe what a constant-depth circuit can compute. Transformers with finite precision are bounded by TC0^0; certain state-tracking tasks live in NC1^1 and are believed to lie outside TC0^0.Section 6, MATH ENTRY 6
”From the paper:” prefixContent directly supported by the cited paper’s text, equations, tables, or figures.Throughout
[Analysis] labelThe publication’s own reasoned assessment, distinct from what any reviewed paper itself claims.Throughout
[Reviewer Perspective] labelA critical or speculative assessment that goes beyond what the reviewed papers prove.Section 11, 12
[Reconstructed] labelContent the publication faithfully reconstructed because the source partially disclosed it.Section 6 xLSTM entries
[External comparison] labelA comparison to prior work or general knowledge outside the three reviewed papers.Section 4, 11

The Glossary covers every prerequisite term the body uses. A reader who understands quadratic versus linear cost (N2N^2 versus NN) and has met the word “matrix” can read the rest of the article using this Glossary as the dictionary.

3. Problem formalisation

The shared problem the three papers attack: design a sequence-mixer ff that takes a sequence of input vectors x1,x2,,xNRdx_1, x_2, \dots, x_N \in \mathbb{R}^d and produces a sequence of output vectors y1,y2,,yNRdy_1, y_2, \dots, y_N \in \mathbb{R}^d such that:

  • ff is causal: yty_t depends only on x1,,xtx_1, \dots, x_t.
  • ff admits a parallel form that computes y1,,yNy_1, \dots, y_N in time roughly linear in NN using matrix multiplications on contemporary GPUs.
  • ff admits a recurrent form in which the network carries a fixed-size hidden state StS_t such that yty_t and StS_t are computable from xtx_t and St1S_{t-1} in O(1)O(1) time per step.
  • The two forms must compute the same function up to numerical error.
  • Quality on standard language-modelling benchmarks (perplexity, downstream task accuracy) is competitive with a parameter-matched softmax-attention block.

Notation

SymbolTypeMeaningFirst appears in
NscalarSequence length (number of tokens).Section 3
dscalarModel (embedding) dimension.Section 3
d_hscalarPer-head dimension; typically d / H for H heads.Section 6
x_tvector of length dInput vector at position t.Section 3
h_tvector of length d_hHidden / output vector at position t for a recurrent cell.Section 6 MATH ENTRY 1
c_tscalar or matrixLSTM cell state; scalar in sLSTM, matrix C_t in mLSTM.Section 6 MATH ENTRY 2
i_t, f_t, o_tvector of length d_hInput, forget, output gates.Section 6 MATH ENTRY 1
m_tscalarxLSTM stabilizer state (log-domain max of forget and input gate pre-activations).Section 6 MATH ENTRY 2
q_t, k_t, v_tvector of length d_hQuery, key, value at position t.Section 6 MATH ENTRY 3
S_td_h by d_h matrixMatrix state of the recurrent form (mLSTM C_t, RWKV-7 wkv_t, RetNet S_n).Section 6 MATH ENTRY 3
gammascalarScalar exponential decay in RetNet.Section 6 MATH ENTRY 4
w_t, a_t, kappa_tvector of length d_hRWKV-7 vector-valued decay, in-context learning rate, removal key.Section 6 MATH ENTRY 5
DN by N matrixRetNet decay-with-causal-mask matrix; entries D_ij = gamma^(i-j) for i >= j, zero otherwise.Section 6 MATH ENTRY 4
sigma(.)functionSigmoid activation, maps a real number into the interval (0, 1).Section 6
exp(.)functionElement-wise exponential.Section 6

Formal problem statement

Input space: (Rd)N(\mathbb{R}^d)^N for sequence length NN and embedding dimension dd. Output space: (Rd)N(\mathbb{R}^d)^N. The objective is standard next-token cross-entropy loss when the sequence is a token-level language modelling task. The constraints are causality plus the parallel-form / recurrent-form equivalence stated above.

Explicit assumption list

  • Causal language modelling. All three papers target left-to-right next-token prediction. Bidirectional variants are out of scope.
  • Fixed embedding dimension across positions. The block reads and writes vectors of the same size; consistent with the standard transformer stack.
  • GPU-friendly parallel form. “Linear in NN for the parallel form” means linear in matmul-equivalent work, not necessarily in wall-clock without a custom kernel. RWKV-7 and xLSTM-mLSTM ship CUDA kernels; the chunkwise form is where the recurrence has to be unrolled chunk-by-chunk. 5 [Analysis] Potentially strong assumption when comparing against FlashAttention v3, which is itself a deeply optimised kernel.
  • Fixed-size matrix state StS_t. mLSTM, RWKV-7, and RetNet all carry a dh×dhd_h \times d_h matrix per head per layer. The total state footprint scales like HLdh2H \cdot L \cdot d_h^2 for LL layers; HH heads. The footprint does not depend on sequence length but does grow quadratically in per-head dimension, a real cost at typical dh=64d_h = 64 to 128128. 5
  • Bounded numerical precision. RWKV-7’s expressivity argument (recognising all regular languages) assumes the network carries enough precision to represent the relevant transition matrices; the formal statement is sensitive to the finite-precision regime. 2 [Analysis] The TC0-versus-NC1 separation theorems for transformers also depend on finite-precision assumptions; the like-for-like comparison is appropriate but the constants matter in practice.

Why the problem is hard

A naive RNN’s recurrent form is O(1)O(1) per step but its training form is intrinsically sequential: you cannot compute hth_t until ht1h_{t-1} is done. Softmax attention’s training form is parallel but its inference KV-cache grows with NN. The hard part is the third constraint: parallel and recurrent forms that compute the same function. Linear attention (Katharopoulos et al. 2020) showed it is possible to express attention without softmax as a recurrence over a dh×dhd_h \times d_h matrix state; the three papers are downstream of that observation and each contributes a different choice of how to update the matrix state. 7

4. Motivation and gap

The concrete cost driver for production deployment of transformer chat models is the KV cache. At inference, every previous token’s key and value vectors must remain in GPU memory so the next token’s attention layer can read them; the cache size scales as 2LHdhN2 \cdot L \cdot H \cdot d_h \cdot N for sequence length NN, model layers LL, attention heads HH, head dim dhd_h. At L=32L=32, H=32H=32, dh=128d_h=128, N=128000N=128000 the cache is on the order of tens of gigabytes per request, the dominant memory term in long-context serving. [External comparison] Long-context serving systems (vLLM, TensorRT-LLM) spend most of their engineering budget compressing or paging this cache.

Existing approaches and their failure modes (cited by the three papers).

  • Sparse attention (Longformer, BigBird): truncates the attention window or sparsifies the pattern; gives up information beyond the window. From the RWKV-7 paper, sparse-attention variants “limit the model’s ability to capture long-range dependencies.” 2
  • Linear attention (Katharopoulos 2020): replaces softmax with a kernel feature map and rewrites attention as a matrix-state recurrence; the practical issue is that the resulting models underperform softmax attention at language-modelling quality. The RetNet paper explicitly motivates retention as a fix. 3
  • State-space models (S4, Mamba, Mamba-2): use a structured linear recurrence with input-dependent transitions; competitive on standard LM benchmarks but with their own limitations on tasks requiring associative recall or state tracking, as flagged by Park et al.’s work that RWKV-7 cites. 2
  • Linear RNNs / RetNet itself: deliver the parallel + recurrent equivalence but historically traded language-modelling quality for efficiency.

The gap each paper claims to fill.

  • xLSTM. “Question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs?” 1 The paper frames the missing piece as exponential gating with stabilisation plus a parallelisable matrix-memory variant; with those in place, the authors argue LSTMs scale.
  • RWKV-7. Earlier RWKV versions and DeltaNet both use diagonal or scalar-gated state transitions; neither can represent non-permutation finite-state transitions in a constant number of layers, which is the formal limit on state tracking. RWKV-7 introduces non-diagonal input-dependent transitions while preserving the parallel-form training. 2
  • RetNet. The framing is the “impossible triangle”: prior work could achieve at most two of (training parallelism, O(1)O(1) inference, transformer-level performance). RetNet claims a single mechanism that achieves all three. 3

Practical stakes. Long-context agents, long-running chat sessions, and on-device deployment all reward O(1)O(1) per-token inference. The constant-memory profile is what lets a recurrent architecture serve a 1-million-token conversation on the same GPU budget that would saturate a transformer’s KV cache at one-tenth that length. [Analysis] The deployment frontier in 2026 is hybrid (a small attention component for in-context recall plus recurrent layers for cost); pure-recurrent and pure-attention models bracket the spectrum.

Position in the broader research landscape. [External comparison] The 2023–2025 wave of non-attention architectures includes Mamba (Gu and Dao 2023), Mamba-2 (Dao and Gu 2024), Griffin (De et al. 2024), Hyena (Poli et al. 2023), HGRN (Qin et al. 2023), and DeltaNet (Yang et al. 2024). The three papers reviewed cover the LSTM-revival line (xLSTM), the delta-rule line (RWKV-7), and the retention line (RetNet). The state-space and convolutional lines are adjacent but not the subject of this review.

Figure 1 of the xLSTM paper showing the LSTM gating diagram extended with a stabilizer state, plus the sLSTM and mLSTM block layouts

5. Method overview

5A. xLSTM

The original LSTM cell, restated to anchor the comparison. From the paper: an LSTM has a scalar cell state ctc_t and hidden state hth_t, updated by gates it,ft,oti_t, f_t, o_t (input, forget, output) computed from xtx_t and ht1h_{t-1}. 1 Original gates use sigmoid activations.

xLSTM’s two contributions to the cell.

  1. Exponential gating with a stabilizer. Replace the sigmoid on the input gate (and optionally the forget gate) with an unbounded exponential, then carry a log-domain stabilizer state mtm_t that subtracts off the running maximum to prevent overflow. The motivation: a sigmoid input gate saturates and cannot fully “open” the memory to incorporate a new important token; exponential gating fixes the storage-decision bottleneck the original LSTM had. [Reconstructed] because the equation typesetting was not reachable on ar5iv; equations are reconstructed from the paper PDF’s notation and the multiple summary sources cited in Section 6.
  2. Two memory variants. sLSTM keeps a scalar cell but adds memory mixing across multiple heads via a head-wise recurrent connection; mLSTM replaces the scalar cell with a dh×dhd_h \times d_h matrix updated by a key-value outer product. mLSTM has no memory mixing across heads, which is the price paid for parallelisability.

xLSTM block. sLSTM is integrated into a residual block with post-up-projection (the Transformer pattern); mLSTM uses pre-up-projection (the state-space-model pattern with a component-wise output gate). 1 Architecture variants are named by their ratio of mLSTM to sLSTM layers: xLSTM[7:1] means seven mLSTM layers per one sLSTM layer.

Why both variants exist. sLSTM has memory mixing across heads (richer expressivity) but cannot parallelise across the sequence dimension; mLSTM gives up cross-head memory mixing for parallelisability. Mixing them in published ratios is the paper’s empirical sweet spot. [Analysis] The two-variant design is itself a hedge: if all you wanted was perplexity you'd ship more mLSTM; if all you wanted was small-model expressivity you'd ship more sLSTM.

Novelty classification. Exponential gating: [New] as a stabilised, gated-LSTM ingredient (variants existed in prior work without the stabilizer). Matrix memory with covariance update: [Adapted] from Linear Attention’s key-value outer-product recurrence (Katharopoulos et al. 2020) and from Fast Weight Programmers (Schlag et al. 2021). 7 Memory mixing across heads via recurrent connections: [New] in this specific form.

5B. RWKV-7 “Goose”

Plain-English intuition. RWKV-7 carries a dh×dhd_h \times d_h matrix state per head and updates it at every token using a generalised delta rule: write a value into a key slot, optionally remove what was previously stored at a related slot, and let the rest of the state decay by a learned per-channel factor. The “generalisation” compared to RWKV-6 is that the decay is a vector (per-channel) and the in-context learning rate is also a vector, which lets the network decide on a per-feature basis what to keep and what to overwrite. 2

Mechanism. The state update equation (from the paper, transcribed from the ar5iv render):

St=St1(diag(wt)κ^t(atκ^t))+vtk~tS_t = S_{t-1}\big(\operatorname{diag}(w_t) - \hat{\kappa}_t^\top (a_t \odot \hat{\kappa}_t)\big) + v_t^\top \tilde{k}_t

where wtw_t is the vector-valued decay, κ^t\hat{\kappa}_t is a per-head normalised removal key, ata_t is the in-context learning rate vector with elements in (0,1)(0, 1), k~t\tilde{k}_t is a relaxed replacement key, and vtv_t is the value vector. The output is yt=qtSty_t = q_t S_t where qtq_t is the query (called receptance in RWKV lineage). 2

Connection to delta rule and DeltaNet. From the paper’s own comparison table:

  • RWKV-6: St=St1diag(wt)+vtktS_t = S_{t-1} \operatorname{diag}(w_t) + v_t^\top k_t (diagonal decay, no removal).
  • DeltaNet: St=St1(Iaktkt)+avtktS_t = S_{t-1}(I - a k_t^\top k_t) + a v_t^\top k_t (scalar learning rate, no decay).
  • RWKV-7: combines both, with a non-diagonal transition (the κ^κ^-\hat\kappa^\top \hat\kappa term) and vector-valued gating.

Design rationale. The non-diagonal piece of the transition is what enables non-permutation finite-state transitions; the vector-valued decay decouples per-feature memory horizons; the in-context learning rate lets the network learn when to write fast and when to write slow.

Novelty. [Adapted] from DeltaNet’s delta-rule recurrence (Yang et al. 2024). Vector-valued gating and the decoupled removal-versus-replacement key design are [New].

5C. RetNet

Plain-English intuition. RetNet keeps the query / key / value vocabulary of attention but replaces the softmax similarity with a fixed exponential decay γij\gamma^{i-j} between positions ii and jj. With softmax removed, the resulting matrix operation can be rewritten as a recurrence with a fixed-size matrix state (the linear-attention trick from 2020), with the decay built in so the network does not have to learn it, and with multiple decay scales γh\gamma_h across heads. 3

Mechanism. Three forms (Section 6 MATH ENTRY 4 carries the equations):

  1. Parallel form for training: Retention(X)=(QKD)V\text{Retention}(X) = (QK^\top \odot D) V where Dij=γijD_{ij} = \gamma^{i-j} for iji \ge j.
  2. Recurrent form for inference: Sn=γSn1+knvnS_n = \gamma S_{n-1} + k_n^\top v_n, output yn=qnSny_n = q_n S_n.
  3. Chunkwise form for long-sequence training: compute the parallel form within each chunk, pass the per-chunk recurrent summary to the next chunk.

Multi-scale retention. Each attention head gets its own decay γh\gamma_h, ranging across an exponential schedule; the paper reports per-head decays roughly from 0.96 to 0.99 in the published configurations. 3 Combined with swish gating and GroupNorm for numerical stability.

Design rationale. A fixed decay (not learned per-token) is the price paid for the three-form equivalence. RWKV-7 by contrast learns per-token, per-channel decays, at the cost of a more involved kernel.

Novelty. [Adapted] from Linear Attention; the explicit three-form template, multi-scale decay schedule, and the GroupNorm + swish recipe are [New] in this specific combination.

Figure of RWKV-7 state update showing the dynamic state evolution mechanism with vector-valued gating and the generalized delta rule, from the paper

6. Mathematical contributions

The heart of this review. Six MATH ENTRY blocks, in order: original LSTM cell, xLSTM-sLSTM with exponential gating, xLSTM-mLSTM matrix memory, RetNet retention, RWKV-7 generalised delta rule, RWKV-7 state-tracking theorem.

MATH ENTRY 1: Original LSTM cell (baseline).

  • Source: Hochreiter and Schmidhuber 1997, restated in the xLSTM paper as the starting point.
  • What it is: a recurrent neural network cell that carries a scalar cell state ctc_t and uses three sigmoid gates to decide what to forget, what to write, and what to read.
  • Formal definition (single cell, dh=1d_h = 1 for clarity):
i_t &= \sigma(W_i x_t + U_i h_{t-1} + b_i) \\ f_t &= \sigma(W_f x_t + U_f h_{t-1} + b_f) \\ o_t &= \sigma(W_o x_t + U_o h_{t-1} + b_o) \\ z_t &= \tanh(W_z x_t + U_z h_{t-1} + b_z) \\ c_t &= f_t \odot c_{t-1} + i_t \odot z_t \\ h_t &= o_t \odot \tanh(c_t) \end{aligned}$$ - Each term explained AND dimensional/type analysis: - $x_t \in \mathbb{R}^d$ is the input vector at time $t$ (e.g., $d = 8$). - $h_{t-1} \in \mathbb{R}^{d_h}$ is the previous hidden state (e.g., $d_h = 4$). - $W_i, W_f, W_o, W_z$ are each $d_h \times d$ matrices that read the input. - $U_i, U_f, U_o, U_z$ are each $d_h \times d_h$ matrices that read the previous hidden state. - $i_t, f_t, o_t \in [0, 1]^{d_h}$ (sigmoid outputs, element-wise). - $z_t \in [-1, 1]^{d_h}$ is the candidate cell content. - $c_t \in \mathbb{R}^{d_h}$ is the cell state. - $\odot$ is element-wise multiplication. - Worked numerical example. Suppose $d_h = 1$, $c_{t-1} = 2.0$, $f_t = 0.9$, $i_t = 0.4$, $z_t = 0.5$, $o_t = 0.6$. Then $c_t = 0.9 \cdot 2.0 + 0.4 \cdot 0.5 = 1.8 + 0.2 = 2.0$; $\tanh(2.0) \approx 0.964$; $h_t = 0.6 \cdot 0.964 \approx 0.579$. The forget gate at 0.9 means 90% of the old state survives; the input gate at 0.4 means 40% of the candidate is written. Saturation problem visible: at $i_t = 0.99$ only 99% of the candidate is written; the gate cannot exceed 1, so no single token can "flood" the memory. - Role: baseline that xLSTM modifies. - Edge cases: $c_t$ is bounded only by repeated $\tanh$ at the output; an unbounded growth in $c_t$ is possible if $f_t$ stays at 1 and $i_t > 0$. - Novelty: not novel; restated from 1997. - Why it matters: defines the saturation problem ($i_t \in [0, 1]$) that exponential gating in MATH ENTRY 2 fixes. **MATH ENTRY 2: xLSTM sLSTM with exponential gating and stabilizer.** - Source: xLSTM paper, Section "sLSTM."<FootnoteRef n={1} /> `[Reconstructed]` from the paper PDF plus the multi-source summary literature; the ar5iv typeset render was not reachable. - What it is: an LSTM cell where the input and (optionally) forget gates use unbounded exponentials, with a log-domain stabilizer to prevent numerical overflow. - Formal definition: $$\begin{aligned} i_t &= \exp(\tilde{i}_t) \\ f_t &= \exp(\tilde{f}_t) \quad \text{or} \quad \sigma(\tilde{f}_t) \\ m_t &= \max(\log f_t + m_{t-1},\, \log i_t) \\ i_t' &= \exp(\log i_t - m_t) \\ f_t' &= \exp(\log f_t + m_{t-1} - m_t) \\ c_t &= f_t' \cdot c_{t-1} + i_t' \cdot z_t \\ n_t &= f_t' \cdot n_{t-1} + i_t' \\ h_t &= o_t \odot (c_t / \max(|n_t|,\, 1)) \end{aligned}$$ - Each term explained: - $\tilde{i}_t, \tilde{f}_t \in \mathbb{R}$ are the pre-activation gate logits (no sigmoid). - $m_t \in \mathbb{R}$ is the stabilizer state, running max of the log-domain gate values. - $i_t', f_t'$ are the *stabilised* gate values; dividing by $\exp(m_t)$ in log domain keeps everything below 1. - $n_t \in \mathbb{R}$ is the normaliser state, carrying the cumulative weight to keep $h_t$ bounded. - $c_t \in \mathbb{R}$ is the cell state (scalar in sLSTM). - Worked numerical example. Suppose $\tilde{i}_t = 2.0$, $\tilde{f}_t = 1.5$, $m_{t-1} = 1.0$, $c_{t-1} = 1.0$, $n_{t-1} = 0.6$, $z_t = 0.5$, $o_t = 0.9$. Then $\log f_t + m_{t-1} = 1.5 + 1.0 = 2.5$; $\log i_t = 2.0$; $m_t = \max(2.5, 2.0) = 2.5$. So $f_t' = \exp(1.5 + 1.0 - 2.5) = \exp(0) = 1.0$; $i_t' = \exp(2.0 - 2.5) = \exp(-0.5) \approx 0.607$. Then $c_t = 1.0 \cdot 1.0 + 0.607 \cdot 0.5 = 1.303$; $n_t = 1.0 \cdot 0.6 + 0.607 = 1.207$; $h_t = 0.9 \cdot (1.303 / \max(1.207, 1)) = 0.9 \cdot 1.079 \approx 0.971$. Notice: a large pre-activation $\tilde{i}_t$ does not blow up because the stabilizer rescales it; and the gate is not capped at 1. - Role: replaces the sigmoid input gate of MATH ENTRY 1. Removes the saturation bottleneck. - Memory mixing across heads is added on top: the sLSTM block has $H$ heads and an additional recurrent connection from $h_t$ back to the gate pre-activations, which is what allows information to flow between heads inside a single cell. `[Reconstructed]` - Edge cases: if both $\log f_t + m_{t-1}$ and $\log i_t$ are extremely negative, the stabilizer trick keeps the ratios meaningful but the absolute values of $c_t$ shrink, which is desirable behaviour. - Novelty: `[New]` in this specific exponential-gate + stabilizer + normaliser combination at LLM scale. - Why it matters: lets the cell incorporate a single highly-relevant token into memory without being clipped by the sigmoid ceiling. **MATH ENTRY 3: xLSTM mLSTM with matrix memory and covariance update.** - Source: xLSTM paper, Section "mLSTM."<FootnoteRef n={1} /> `[Reconstructed]` per the same gap as MATH ENTRY 2. - What it is: an LSTM-style cell where the scalar cell state is replaced by a $d_h \times d_h$ matrix $C_t$, updated by an outer product of value and key vectors, like a one-shot covariance update. - Formal definition (per head): $$\begin{aligned} q_t &= W_q x_t + b_q \\ k_t &= (W_k x_t + b_k) / \sqrt{d_h} \\ v_t &= W_v x_t + b_v \\ i_t &= \exp(\tilde{i}_t),\; f_t = \exp(\tilde{f}_t) \\ C_t &= f_t \cdot C_{t-1} + i_t \cdot v_t k_t^\top \\ n_t &= f_t \cdot n_{t-1} + i_t \cdot k_t \\ h_t &= o_t \odot (C_t q_t) / \max(|n_t^\top q_t|,\, 1) \end{aligned}$$ with the same stabilizer mechanism from MATH ENTRY 2 applied to $i_t, f_t$. - Each term explained: - $q_t, k_t, v_t \in \mathbb{R}^{d_h}$ are per-head query, key, value vectors. - $C_t \in \mathbb{R}^{d_h \times d_h}$ is the matrix memory. - $v_t k_t^\top$ is the outer product: a $d_h \times d_h$ rank-one matrix. - $n_t \in \mathbb{R}^{d_h}$ is the per-head normaliser vector. - $C_t q_t$ projects the matrix memory along the current query direction to recover a $d_h$-dim output. - Worked numerical example. Take $d_h = 2$. Let $C_{t-1} = \begin{pmatrix} 0.5 & 0.0 \\ 0.0 & 0.5 \end{pmatrix}$, $v_t = (1.0, 0.0)^\top$, $k_t = (0.7, 0.7)^\top$, $i_t = 1.0$, $f_t = 0.9$. Outer product $v_t k_t^\top = \begin{pmatrix} 0.7 & 0.7 \\ 0.0 & 0.0 \end{pmatrix}$. Then $C_t = 0.9 \cdot C_{t-1} + 1.0 \cdot v_t k_t^\top = \begin{pmatrix} 0.45 & 0.0 \\ 0.0 & 0.45 \end{pmatrix} + \begin{pmatrix} 0.7 & 0.7 \\ 0.0 & 0.0 \end{pmatrix} = \begin{pmatrix} 1.15 & 0.7 \\ 0.0 & 0.45 \end{pmatrix}$. Query $q_t = (0.5, 1.0)^\top$; $C_t q_t = (1.15 \cdot 0.5 + 0.7 \cdot 1.0,\; 0.0 \cdot 0.5 + 0.45 \cdot 1.0) = (1.275, 0.45)^\top$. After normaliser scaling and output gate, this becomes $h_t$. The slot of $C_t$ holding the value-key outer product is what "stores" the $(k_t, v_t)$ pair; querying with $q_t$ aligned to $k_t$ retrieves $v_t$ (a soft associative memory). - Role: the parallelisable variant of xLSTM. The covariance update is the same primitive used in Linear Attention and Fast Weight Programmers; combining it with exponential gating and stabilisation is xLSTM's contribution. - Edge cases: when $n_t^\top q_t$ is near zero the normaliser clamps via the $\max(\cdot, 1)$, the same numerical safety mechanism as the sLSTM normaliser. - Novelty: `[Adapted]` from Linear Attention (Katharopoulos 2020) and Fast Weight Programmers (Schlag 2021). The exponential-gate + stabilizer + matrix-memory combination at LLM scale is `[New]`. - Why it matters: gives the xLSTM stack a fully parallelisable variant. Without mLSTM, the architecture could not have been trained on SlimPajama 300B tokens at 1.3B parameters in the published wall-clock window. **MATH ENTRY 4: RetNet retention, three-form template.** - Source: RetNet paper, Section 2.<FootnoteRef n={3} /> - What it is: a sequence-mixer that computes the same function in three equivalent forms: parallel for training, recurrent for inference, chunkwise for long context. - Formal definition. With $Q, K, V \in \mathbb{R}^{N \times d_h}$ stacked across positions: $$\begin{aligned} \text{Parallel:}\quad &Y = (Q K^\top \odot D) V \\ &D_{ij} = \gamma^{i-j} \text{ for } i \ge j,\; D_{ij} = 0 \text{ otherwise} \\[4pt] \text{Recurrent:}\quad &S_n = \gamma S_{n-1} + k_n^\top v_n \in \mathbb{R}^{d_h \times d_h} \\ &y_n = q_n S_n \\[4pt] \text{Chunkwise:}\quad &\text{within chunk } b\text{: parallel form on } Q_b, K_b, V_b \\ &\text{between chunks: } R_b = \gamma^B R_{b-1} + K_b^\top (V_b \odot \xi_b) \\ &\text{cross-chunk contribution to } Y_b = Q_b R_{b-1} \cdot \zeta_b \end{aligned}$$ where $B$ is the chunk length and $\zeta_b, \xi_b$ are position-dependent decay correction factors. - Each term explained: - $Q, K, V$ are query, key, value matrices, same as attention. - $D$ is the decay-with-causal-mask matrix. - $\gamma \in (0, 1)$ is the per-head decay; the multi-scale schedule sets a different $\gamma_h$ per head. - $S_n \in \mathbb{R}^{d_h \times d_h}$ is the recurrent matrix state. - The parallel form is the same outer-product accumulation as the recurrent form, written as a matmul. - Worked numerical example. Take $N = 3$, $d_h = 2$, $\gamma = 0.9$. - Parallel: $Q K^\top$ is a $3 \times 3$ matrix. The mask $D = \begin{pmatrix} 1 & 0 & 0 \\ 0.9 & 1 & 0 \\ 0.81 & 0.9 & 1 \end{pmatrix}$. The output's third row is $(0.81 \cdot (q_3 k_1^\top), 0.9 \cdot (q_3 k_2^\top), q_3 k_3^\top) V$. - Recurrent: $S_0 = 0$; $S_1 = k_1^\top v_1$; $S_2 = 0.9 \cdot S_1 + k_2^\top v_2$; $S_3 = 0.9 \cdot S_2 + k_3^\top v_3 = 0.81 \cdot k_1^\top v_1 + 0.9 \cdot k_2^\top v_2 + k_3^\top v_3$. Output $y_3 = q_3 S_3 = 0.81 \cdot q_3 k_1^\top v_1 + 0.9 \cdot q_3 k_2^\top v_2 + q_3 k_3^\top v_3$. - Same number, two ways to compute it. The chunkwise form is the obvious interpolation: do recurrent across chunks of length $B$, parallel inside each chunk. - Role: the canonical three-form template. Both xLSTM-mLSTM and RWKV-7 inherit this template (parallel-for-training, recurrent-for-inference, chunkwise-for-long-context); RetNet codified it. - Edge cases: at $\gamma = 1$ retention reduces to a non-decaying linear attention; at $\gamma$ near zero each token's memory dies within a few steps. Multi-scale retention spreads heads across $\gamma$ values to let different heads capture different temporal horizons. - Novelty: `[Adapted]` from Linear Attention. The explicit three-form template, multi-scale decay schedule, and the swish + GroupNorm combination are `[New]` in this configuration. - Why it matters: established the architectural template the other two papers refined. **MATH ENTRY 5: RWKV-7 generalised delta rule, full state update.** - Source: RWKV-7 paper, Equation in Section "RWKV-7's Time-Mix."<FootnoteRef n={2} /> - What it is: a matrix-state recurrence that combines RWKV-6's diagonal decay with DeltaNet's value-replacement, generalised to vector-valued parameters and a decoupled removal-versus-replacement key design. - Formal definition (per head): $$S_t = S_{t-1}\big(\operatorname{diag}(w_t) - \hat{\kappa}_t^\top (a_t \odot \hat{\kappa}_t)\big) + v_t^\top \tilde{k}_t$$ with $$\begin{aligned} w_t &= \exp(-\exp(\text{lora}_w(x_t))) \in (0, 1)^{d_h} \\ a_t &= \sigma(\text{lora}_a(x_t)) \in (0, 1)^{d_h} \\ \kappa_t &= k_t \odot \xi,\quad \hat{\kappa}_t = \kappa_t / \lVert \kappa_t \rVert \\ \tilde{k}_t &= k_t \odot \text{lerp}(1, a_t, \alpha) \end{aligned}$$ where $\text{lora}_w, \text{lora}_a$ are low-rank MLPs over $x_t$, $\xi$ is a learned per-channel multiplier, $\alpha$ is a learned mixing scalar, and $\text{lerp}$ is linear interpolation. - Each term explained: - $S_t \in \mathbb{R}^{d_h \times d_h}$ is the matrix state, like RetNet's $S_n$ and xLSTM's $C_t$. - $w_t \in (0, 1)^{d_h}$ is the *vector-valued decay*. RetNet has scalar $\gamma$, RWKV-6 has scalar-per-channel diagonal, RWKV-7 has per-channel data-dependent. - $a_t \in (0, 1)^{d_h}$ is the in-context learning rate vector. When $a_t \to 0$ no replacement happens; when $a_t \to 1$ the corresponding key dimension is fully overwritten. - $\hat{\kappa}_t$ is the normalised removal key, separate from $\tilde{k}_t$ the replacement key. The decoupling is what RWKV-7 calls the "relaxed value replacement rule." - $v_t \in \mathbb{R}^{d_h}$ is the value vector. - The transition matrix $T_t = \operatorname{diag}(w_t) - \hat\kappa_t^\top (a_t \odot \hat\kappa_t)$ is non-diagonal (because of the rank-one $\hat\kappa^\top \hat\kappa$ term) and input-dependent. RWKV-6 had only $\operatorname{diag}(w_t)$. - Worked numerical example. Take $d_h = 2$. Suppose $w_t = (0.9, 0.8)$, $a_t = (0.5, 0.3)$, $\hat\kappa_t = (0.6, 0.8)^\top$ (already unit-normalised since $0.6^2 + 0.8^2 = 1$), $\tilde k_t = (1.0, 0.5)^\top$, $v_t = (2.0, -1.0)^\top$, $S_{t-1} = \begin{pmatrix} 1.0 & 0.0 \\ 0.0 & 1.0 \end{pmatrix}$. - Compute $a_t \odot \hat\kappa_t = (0.5 \cdot 0.6, 0.3 \cdot 0.8) = (0.3, 0.24)$. - Outer product $\hat\kappa_t^\top (a_t \odot \hat\kappa_t)$: this is $\hat\kappa_t$ as a column ($2 \times 1$) times $a_t \odot \hat\kappa_t$ as a row ($1 \times 2$) = $\begin{pmatrix} 0.6 \cdot 0.3 & 0.6 \cdot 0.24 \\ 0.8 \cdot 0.3 & 0.8 \cdot 0.24 \end{pmatrix} = \begin{pmatrix} 0.18 & 0.144 \\ 0.24 & 0.192 \end{pmatrix}$. - Transition matrix $T_t = \begin{pmatrix} 0.9 & 0 \\ 0 & 0.8 \end{pmatrix} - \begin{pmatrix} 0.18 & 0.144 \\ 0.24 & 0.192 \end{pmatrix} = \begin{pmatrix} 0.72 & -0.144 \\ -0.24 & 0.608 \end{pmatrix}$. - $S_{t-1} T_t = T_t$ (since $S_{t-1}$ is identity). - Value-key outer product $v_t^\top \tilde k_t$ (column $v_t$ times row $\tilde k_t$): $\begin{pmatrix} 2.0 \cdot 1.0 & 2.0 \cdot 0.5 \\ -1.0 \cdot 1.0 & -1.0 \cdot 0.5 \end{pmatrix} = \begin{pmatrix} 2.0 & 1.0 \\ -1.0 & -0.5 \end{pmatrix}$. - $S_t = T_t + v_t^\top \tilde k_t = \begin{pmatrix} 2.72 & 0.856 \\ -1.24 & 0.108 \end{pmatrix}$. - The non-diagonal off-diagonal entries in $T_t$ (the $-0.144$ and $-0.24$) are what diagonal-decay RNNs cannot produce. They are also what RWKV-7's expressivity proof in MATH ENTRY 6 leans on. - Role: the central mechanism. Every per-token forward pass updates $S_t$ this way; output is $y_t = q_t S_t$ as in RetNet. - Edge cases: the paper notes that $w_t$ uses a double-exponential parameterisation $\exp(-\exp(\cdot))$ to keep it in $(0, 1)$ with the right gradient profile.<FootnoteRef n={2} /> - Novelty: `[Adapted]` from DeltaNet (Yang et al. 2024). Vector-valued decay, in-context learning rate, decoupled removal-replacement keys are `[New]`. - Why it matters: this is the equation that gives RWKV-7 its claimed state-tracking expressivity. Without the non-diagonal term, the network is bounded by TC$^0$; with it, the paper proves the network can recognise all regular languages. **MATH ENTRY 6: RWKV-7 state-tracking expressivity (informal statement).** - Source: RWKV-7 paper, "State Tracking and Expressivity."<FootnoteRef n={2} /> - What it is: a theoretical statement that RWKV-7's recurrence can recognise all regular languages in a constant number of layers, where standard transformers (under standard finite-precision assumptions) cannot. - Formal statement (paraphrased from the paper). Under the assumption TC$^0 \neq$ NC$^1$, there exist regular languages (for example, the word problem in the symmetric group $S_5$) that no constant-depth transformer can recognise. The RWKV-7 paper proves that RWKV-7 with non-diagonal input-dependent transitions recognises all regular languages in a constant number of layers. - Key assumptions: - Finite-precision activations bounded by $O(\log N)$ bits, the standard assumption under which transformer expressivity is bounded by TC$^0$. - The state update has access to non-diagonal input-dependent transition matrices, supplied by the $-\hat\kappa^\top \hat\kappa$ term in MATH ENTRY 5. - Proof sketch, step by step: 1. A regular language is recognised by a deterministic finite automaton with finitely many states. 2. A finite automaton's transition function for a fixed input symbol is a function from states to states, representable as a permutation matrix (or more generally, a 0/1 transition matrix) acting on a one-hot state vector. 3. A recurrent network whose transition matrix is restricted to diagonal $\operatorname{diag}(w)$ can represent only stateless functions of the input: applying $\operatorname{diag}(w)$ to a one-hot vector cannot rotate it to a different one-hot vector. 4. RWKV-7's transition matrix $T_t = \operatorname{diag}(w_t) - \hat\kappa_t^\top (a_t \odot \hat\kappa_t)$ contains a rank-one update term that depends on the current input $x_t$ (through $\hat\kappa_t, a_t$); choosing $\hat\kappa_t, a_t, w_t$ as a function of $x_t$ can realise *any* finite-state transition, including non-permutation transitions (states that merge). 5. Composing $T_t$ across $N$ steps yields $S_N = T_N T_{N-1} \cdots T_1$, which simulates the automaton's run on the input sequence. 6. Reading the relevant entry of $S_N$ via $q_N$ extracts the automaton's final state, hence the language's accept/reject decision. 7. The construction uses a constant number of layers because every recurrent step does one transition. Constant in the formal-language sense; does not grow with $N$. - Empirical anchor. The paper reports group multiplication tasks on $S_5$ where RWKV-7 achieves perfect or near-perfect accuracy while transformers, Mamba, and S4 plateau at low accuracy. `[From the paper:]` "exhibits stronger state-tracking capabilities than Transformers, Mamba, and S4."<FootnoteRef n={2} /> - Edge cases: the theorem is about formal-language recognition in the limit; on natural-language modelling the practical impact is more diffuse and the paper does not claim a perplexity advantage that traces directly to the expressivity gap. - Novelty: `[New]`. - Why it matters: this is the strongest theoretical claim across the three papers reviewed. It is the first time a linear-time recurrent architecture has been shown to formally exceed transformer expressivity on a well-studied complexity-theoretic axis. ## 7. Algorithmic contributions **ALGORITHM ENTRY 1: xLSTM mLSTM block forward pass (the headline algorithm).** - Source: xLSTM paper, Section "mLSTM block."<FootnoteRef n={1} /> `[Reconstructed]` per the ar5iv gap. - Purpose: compute one block's forward pass at training time, using the parallel form of mLSTM. - Inputs: $X \in \mathbb{R}^{N \times d}$ (sequence of input vectors), layer parameters. - Outputs: $Y \in \mathbb{R}^{N \times d}$ (sequence of output vectors). - Pseudocode (rendered as an in-article reference; the full block also includes residual connections and layer norm): ```text function mLSTM_block_forward(X): X_norm = LayerNorm(X) X_up = Linear_up_projection(X_norm) # pre-up-projection for each head h in 1..H: Q_h = Linear_Q(X_up); K_h = Linear_K(X_up); V_h = Linear_V(X_up) I_h_logits = Linear_I(X_up) # input gate logits F_h_logits = Linear_F(X_up) # forget gate logits O_h = sigmoid(Linear_O(X_up)) # output gate # Stabilizer pass over sequence m_prev = -inf for t in 1..N: m_t = max(F_h_logits[t] + m_prev, I_h_logits[t]) i_stab = exp(I_h_logits[t] - m_t) f_stab = exp(F_h_logits[t] + m_prev - m_t) C_t = f_stab * C_prev + i_stab * outer(V_h[t], K_h[t]) n_t = f_stab * n_prev + i_stab * K_h[t] h_t = O_h[t] elementwise_mul (C_t Q_h[t]) / max(abs(dot(n_t, Q_h[t])), 1) C_prev = C_t; n_prev = n_t; m_prev = m_t Y_h = stack(h_1, ..., h_N) Y_concat = concat(Y_1, ..., Y_H) Y_down = Linear_down_projection(Y_concat) return X + Y_down # residual connection ``` - Hand-traced example on minimal input. Take $N = 2, d_h = 2, H = 1$, no up-projection. Suppose at $t = 1$: $X_1 = (1, 0)^\top$ produces $Q_1 = K_1 = V_1 = (1, 0)^\top$, $I_1\text{logit} = 2$, $F_1\text{logit} = -1$, $O_1 = (0.9, 0.9)^\top$. Initial $m_{-1} = -\infty$, $C_0 = 0$, $n_0 = 0$. - $m_1 = \max(F_1\text{logit} + m_0, I_1\text{logit}) = \max(-\infty, 2) = 2$. - $i_{1,\text{stab}} = \exp(2 - 2) = 1.0$. - $f_{1,\text{stab}} = \exp(-1 + (-\infty) - 2) \to 0$. - $C_1 = 0 + 1.0 \cdot (1, 0)^\top (1, 0) = \begin{pmatrix} 1 & 0 \\ 0 & 0 \end{pmatrix}$. - $n_1 = 0 + 1.0 \cdot (1, 0)^\top = (1, 0)^\top$. - $C_1 Q_1 = (1, 0)^\top$; $n_1 \cdot Q_1 = 1$; $h_1 = (0.9, 0.9) \odot (1, 0) / 1 = (0.9, 0)^\top$. State at end of step 1: stored the (key=$(1,0)$, value=$(1,0)$) association, output is the recovered value scaled by output gate. - At $t = 2$: $X_2 = (0, 1)^\top$ produces $Q_2 = K_2 = V_2 = (0, 1)^\top$, $I_2\text{logit} = 1$, $F_2\text{logit} = 0.5$, $O_2 = (0.9, 0.9)^\top$. - $m_2 = \max(F_2\text{logit} + m_1, I_2\text{logit}) = \max(0.5 + 2, 1) = 2.5$. - $i_{2,\text{stab}} = \exp(1 - 2.5) \approx 0.223$. - $f_{2,\text{stab}} = \exp(0.5 + 2 - 2.5) = 1.0$. - $C_2 = 1.0 \cdot C_1 + 0.223 \cdot (0, 1)^\top (0, 1) = \begin{pmatrix} 1 & 0 \\ 0 & 0.223 \end{pmatrix}$. - $n_2 = 1.0 \cdot (1, 0)^\top + 0.223 \cdot (0, 1)^\top = (1, 0.223)^\top$. - $C_2 Q_2 = (0, 0.223)^\top$; $n_2 \cdot Q_2 = 0.223$; $h_2 = (0.9, 0.9) \odot (0, 0.223) / \max(0.223, 1) = (0, 0.201)^\top$. - State at end of step 2: stores both (key1, value1) and (key2, value2) associations, recovers the second. - Complexity: time $O(N \cdot d_h^2)$ per head (outer product per step); space $O(d_h^2)$ for the matrix state. Bottleneck step: outer product accumulation, which is fused into a custom CUDA kernel in the reference implementation.<FootnoteRef n={8} /> - Hyperparameters: number of heads $H$, head dim $d_h$, mLSTM/sLSTM ratio (e.g., 7:1 in xLSTM[7:1]). Sensitivity not exhaustively ablated in the paper. - Failure modes: stabilizer divergence if $m_t$ is initialised wrong; reference implementation initialises $m_0 = -\infty$ to handle the first step cleanly. - Novelty: `[New]` combination. - Transferability: `[Analysis]` directly portable to any LLM stack that exposes its residual-stream shape; reference implementation is a drop-in Linear-Attention-style block. **ALGORITHM ENTRY 2: RWKV-7 time-mix block forward pass.** - Source: RWKV-7 paper, Section "RWKV-7 Architecture."<FootnoteRef n={2} /> - Purpose: compute one RWKV-7 time-mix block. - Inputs: $X \in \mathbb{R}^{N \times d}$, layer parameters including token-shift parameters and low-rank MLPs for $w, a, \nu$. - Outputs: $Y \in \mathbb{R}^{N \times d}$. - Pseudocode (recurrent form for inference; parallel form uses chunkwise scan): ```text function rwkv7_timemix(X): X_shift = token_shift(X) # per-channel mix of x_t and x_{t-1} for each head h: R = Linear_R(X_shift) # receptance / query K = Linear_K(X_shift) V = Linear_V(X_shift) W = exp(-exp(LoRA_w(X_shift))) # vector decay in (0,1) A = sigmoid(LoRA_a(X_shift)) # in-context learning rate KAPPA = K * xi # learned per-channel multiplier KAPPA_HAT = KAPPA / norm(KAPPA) # normalised removal key per token K_TILDE = K * lerp(1, A, alpha) # relaxed replacement key # Recurrent scan S = zeros(d_h, d_h) for t in 1..N: T = diag(W[t]) - outer(KAPPA_HAT[t], A[t] * KAPPA_HAT[t]) S = S @ T + outer(V[t], K_TILDE[t]) y_t = R[t] @ S Y_h = stack(y_1, ..., y_N) Y = concat over heads, then GroupNorm + value-residual mix + output projection return Y ``` - Hand-traced example. Take $d_h = 2$, $N = 1$. Use the MATH ENTRY 5 worked numbers: $S$ ends at $\begin{pmatrix} 2.72 & 0.856 \\ -1.24 & 0.108 \end{pmatrix}$. With $R_1 = q_1 = (0.5, 0.5)^\top$, $y_1 = R_1^\top S = (0.5 \cdot 2.72 + 0.5 \cdot (-1.24),\; 0.5 \cdot 0.856 + 0.5 \cdot 0.108) = (0.74, 0.482)$. Output of the time-mix block before final projection. - Complexity: time $O(N \cdot d_h^2)$ per head; space $O(d_h^2)$ for the matrix state. The transition matrix is computed afresh at every step, which is a constant overhead per token but does not affect the asymptotic profile. - Hyperparameters: head dim $d_h$, number of heads $H$, low-rank dimension for $\text{lora}_w, \text{lora}_a$. - Failure modes: numerical-precision sensitivity in the WKV kernel; the paper notes training dynamics differ across implementations and float32 handling is critical.<FootnoteRef n={2} /> - Novelty: `[New]`. - Transferability: `[Analysis]` available as Apache-2.0 reference implementation at github.com/BlinkDL/RWKV-LM; the kernel is non-trivial but the architecture is otherwise a drop-in block.<FootnoteRef n={9} /> **ALGORITHM ENTRY 3: RetNet retention three-form switch.** - Source: RetNet paper, Section 2.<FootnoteRef n={3} /> - Purpose: choose between parallel, recurrent, chunkwise at the appropriate stage. - Inputs: $Q, K, V \in \mathbb{R}^{N \times d_h}$, decay $\gamma$. - Outputs: $Y \in \mathbb{R}^{N \times d_h}$. - Pseudocode: ```text function retention(Q, K, V, gamma, mode): if mode == "parallel": # training D = lower_triangular(gamma^(i-j) for i,j in 0..N-1) return (Q @ K.T * D) @ V if mode == "recurrent": # inference, single token at a time S = zeros(d_h, d_h) Y = [] for n in 1..N: S = gamma * S + outer(K[n], V[n]) Y.append(Q[n] @ S) return stack(Y) if mode == "chunkwise": # long-sequence training B = chunk_size R = zeros(d_h, d_h) Y = [] for b in 1..N/B: Q_b, K_b, V_b = chunk slices Y_intra = (Q_b @ K_b.T * D_intra) @ V_b # parallel inside chunk Y_cross = (Q_b @ R) * decay_factor_zeta_b # cross-chunk Y.append(Y_intra + Y_cross) R = gamma^B * R + K_b.T @ (V_b * xi_b) return stack(Y) ``` - Hand-traced example. Already covered in MATH ENTRY 4 (the three-form numerical equivalence). - Complexity: parallel form $O(N^2 d_h)$ ignoring kernel optimisations; recurrent $O(N \cdot d_h^2)$; chunkwise $O(N \cdot B \cdot d_h + N \cdot d_h^2 / B)$, balancing. - Hyperparameters: $\gamma$ per head (multi-scale schedule), chunk size $B$. - Novelty: `[New]` template; primitives `[Adapted]`. - Transferability: `[Analysis]` reference implementation in microsoft/torchscale, MIT-licensed.<FootnoteRef n={10} /> ## 8. Specialised design contributions **Subsection 8A. LLM / prompt design.** Not applicable to this paper. None of the three architectures involves LLM-as-judge, prompting frameworks, or in-context-learning protocol design. **Subsection 8B. Architecture-specific details.** - **xLSTM block composition.** Two block types (sLSTM, mLSTM) mixed in published ratios. `xLSTM[7:1]` denotes 7 mLSTM blocks per 1 sLSTM block; `xLSTM[1:0]` denotes pure mLSTM.<FootnoteRef n={1} /> - **RWKV-7 time-mix and channel-mix.** Two block types per layer. Time-mix is the sequence-mixer (MATH ENTRY 5); channel-mix is a gated MLP that acts as the position-mixer. Both use *token shift*: a per-channel linear interpolation between the current token's input and the previous token's, replacing the need for a positional encoding.<FootnoteRef n={2} /> - **RetNet normalisation.** GroupNorm + swish gate after the retention output to stabilise the multi-scale decay outputs.<FootnoteRef n={3} /> - **Value residual gating (RWKV-7).** Interpolates between the layer-0 value precursor and the current layer's value; design choice to keep early-layer information available deeper in the stack.<FootnoteRef n={2} /> **Subsection 8C. Training specifics.** - **xLSTM training.** SlimPajama corpus, 15B and 300B token configurations across model sizes from 125M to 1.3B parameters. Reference 7B-parameter model trained on 2.3T tokens per the official repository's release notes.<FootnoteRef n={8} /> - **RWKV-7 training.** 3.119T tokens across English, code, and multilingual sources; model sizes from 0.19B to 2.9B. Smaller models continued from RWKV-5/6 checkpoints due to compute constraints. Max 96 H800 GPUs.<FootnoteRef n={2} /> - **RetNet training.** Tested at parameter counts up to 6.7B with 100B+ token training in the original paper.<FootnoteRef n={3} /> **Subsection 8D. Inference / deployment specifics.** - **xLSTM inference.** Recurrent form for both sLSTM and mLSTM; $O(1)$ per token, constant memory.<FootnoteRef n={1} /> - **RWKV-7 inference.** Constant memory, constant time per token. The paper reports forward+backward roughly 3$\times$ faster than RWKV-6's kernel and surpassing FlashAttention v3 at 16k token sequences; training requires 18 variable equivalents in memory versus 10 for FlashAttention v3, a real footprint cost paid for the matrix state.<FootnoteRef n={2} /> - **RetNet inference.** Recurrent form delivers ~70% inference-memory reduction at 8k sequence length versus a parameter-matched transformer baseline, and 8.4$\times$ decoding throughput, per the paper.<FootnoteRef n={3} /> <Image src="https://images.neuraltechdaily.com/articles/xlstm-rwkv-linear-transformer-alts-multi-paper-review-2026/in-article-3-800w.webp" alt="Figure 3 of RetNet paper showing the three computation forms (parallel, recurrent, chunkwise) of the retention mechanism" caption="Figure 3 of RetNet (arXiv:2307.08621), reproduced for editorial coverage." width={800} height={420} /> ## 9. Experiments and results ### Datasets - **xLSTM.** SlimPajama (deduplicated Pile-equivalent); 15B and 300B token training runs. Downstream eval includes PALOMA's 471 text domains and standard zero-shot tasks (ARC-easy, ARC-challenge, HellaSwag, PIQA, WinoGrande, LAMBADA).<FootnoteRef n={1} /> - **RWKV-7.** "World-v3" 3.119T multilingual corpus plus Pile for the Pile-only variants. Evaluations across English (MMLU, ARC-c, ARC-e, HellaSwag, OpenBookQA, PIQA, WinoGrande, LAMBADA), multilingual (LAMBADA-multilingual, XCOPA, etc.), and a post-2025 internet-data compression metric.<FootnoteRef n={2} /> - **RetNet.** Internal pre-training corpus at 100B+ token scale; The Pile and validation sets for perplexity comparison.<FootnoteRef n={3} /> ### Baselines - xLSTM baselines: Llama (transformer), Mamba (state-space), RWKV-4 (older RWKV).<FootnoteRef n={1} /> - RWKV-7 baselines: Transformers (Qwen2.5, Llama-3.2, SmolLM2), Mamba, S4, RWKV-6.<FootnoteRef n={2} /> - RetNet baseline: a parameter-matched transformer; published comparisons at 6.7B scale.<FootnoteRef n={3} /> ### Evaluation metrics - Validation perplexity (lower is better). - Zero-shot downstream task accuracy. - Compression ratio on post-cutoff internet data (RWKV-7). - Inference memory (GB at fixed sequence length), inference latency, training throughput. ### Reproduced result tables with attribution **Table. Headline benchmark numbers from the three papers.** | Architecture | Best reported model size | Training tokens | Headline result | Source | |---|---|---|---|---| | xLSTM[7:1] | 1.3B | 300B (SlimPajama) | Outperforms Llama, RWKV-4, Mamba on most measured zero-shot tasks at 1.3B | Beck et al. 2024 | | xLSTM (reference 7B) | 7B | 2.3T | Reference 7B language model | NX-AI repo | | RWKV-7 "World-3" | 2.9B | 3.1T multilingual | 61.1% multilingual aggregate (new 3B SoTA, paper claim); 71.5% English aggregate (matches Qwen2.5-3B 71.4%) on roughly one-third the training tokens | Peng et al. 2025 | | RWKV-7-1.5B | 1.5B | (continued from RWKV-6) | 8.16% post-January-2025 internet-data compression versus Qwen2.5-1.5B 8.06% | Peng et al. 2025 | | RetNet | 6.7B | 100B+ | ~70% inference memory reduction at 8k, 8.4× decoding throughput vs parameter-matched transformer | Sun et al. 2023 | *Table reproduced from the three reviewed papers; numbers are paper-reported and have not been independently re-run by the publication.* **Associative recall.** From the RWKV-7 paper: "achieves 72.93% accuracy on 256 KV pairs with 8192-dim state (~0.547 bits/dimension information density)."<FootnoteRef n={2} /> ### Ablations - **xLSTM.** Mix-ratio ablations between sLSTM and mLSTM blocks; the 7:1 mix appears to dominate at 1.3B. Exponential-gating ablation: removing it degrades performance.<FootnoteRef n={1} /> - **RWKV-7.** Ablations on vector-valued versus scalar decay; on in-context learning rate dimensionality; on value residual mixing. The paper documents these in the architecture section.<FootnoteRef n={2} /> - **RetNet.** Multi-scale retention versus single-decay; without multi-scale, performance drops measurably. Swish gate plus GroupNorm versus alternatives; the published recipe is reported as the chosen configuration after ablation.<FootnoteRef n={3} /> ### Hyperparameter sensitivity The three papers report results under what their authors describe as standard tuning; granular sensitivity sweeps are not the centrepiece of any of the three. `[Analysis]` xLSTM's sLSTM-to-mLSTM ratio is the most consequential hyperparameter the paper exposes; RWKV-7's $\xi$ (range $[-5.3, 9.4]$ per the paper) and the low-rank dimensions of $\text{lora}_w, \text{lora}_a$ are reported but not deeply ablated; RetNet's multi-scale decay schedule is hard-coded once chosen. ### Robustness / stress tests - **RWKV-7 long-context.** Notes loss increase beyond 10k tokens on PG19 for the World-trained models; the Pile-trained models extrapolate better; fine-tuning on 128k-token sequences mitigates the gap.<FootnoteRef n={2} /> - **xLSTM long-context.** Long-context evaluation in the paper covers extrapolation behaviour up to sequence lengths beyond training context; results are reported as competitive with Mamba.<FootnoteRef n={1} /> `[Analysis]` Detailed cross-paper long-context comparison would benefit from a single common evaluation suite, which the three papers do not share. ### Qualitative results The three papers do not lean on qualitative case studies; their argument is benchmark-driven. ### Experimental scope limits - None of the three papers reports a 70B+ parameter model trained from scratch; the upper bound is RetNet 6.7B, xLSTM 7B (reference release), RWKV-7 2.9B. `[Analysis] The headline expressivity claim of RWKV-7 (state-tracking beyond TC$^0$) is established at the formal-language level and on small models; whether the advantage transfers to a 70B-parameter natural-language regime is an empirical question the paper does not answer.` - None of the three papers benchmarks against the latest 2025-era frontier transformers (GPT-4.5, Claude 3.7, Gemini 2.5); the comparison set is open-source models in the 1B-3B range. `[Analysis]` - No multimodal evaluation, no agentic-task evaluation in any of the three papers. ### Independent benchmark cross-checks for SoTA claims `[External comparison]` RWKV-7's "new 3B SoTA on multilingual tasks" claim is the authors' framing on their chosen multilingual benchmark suite. Independent reproducibility studies on the same suite are not yet published as of writing. The English-aggregate matching of Qwen2.5-3B (71.5% vs 71.4%) is a reasonable like-for-like comparison and the paper documents the training-token gap that makes the result interesting. xLSTM's "outperforms Mamba, RWKV-4, Llama" claim at 1.3B has been re-examined by the broader community since the NeurIPS 2024 spotlight; the picture at 7B is harder to read because Mamba-2 and Griffin landed in between. `[Reviewer Perspective]` The fair framing across the three papers is: each architecture is competitive with attention at the parameter scale and training-token budget the paper reports, with the gap to the very best transformer-based models neither closed nor proven negligible at frontier scale. ### Evidence audit - **Strongly supported** (paper-reported): linear-time inference profile; constant-memory state; the three-form computational equivalence; the formal expressivity result for RWKV-7 (mathematical proof in the paper); the headline 1.3B / 2.9B / 6.7B benchmark numbers (each paper's own tables). - **Partially supported**: claims of "SoTA" are paper-specific framing on paper-chosen benchmark suites without independent re-runs. - **Narrow evidence**: scaling behaviour beyond the reported parameter counts; comparisons against the latest commercial frontier models. ## 10. Technical novelty summary | Component | Type | Novelty level | Justification | Source | |---|---|---|---|---| | Exponential gating with stabilizer (xLSTM) | Cell mechanism | Combination novel | Stabilised log-domain max gating at LSTM scale is new; the underlying log-sum-exp trick is classical | xLSTM Section "sLSTM" | | Matrix memory with covariance update (xLSTM mLSTM) | Sequence mixer | Combination novel | Outer-product key-value memory is from Linear Attention 2020 and Fast Weight Programmers 2021; novelty is integration with exponential gating and stabilisation | xLSTM Section "mLSTM" | | sLSTM-mLSTM mixed architecture | Macro design | Fully novel | The mixed-ratio block design is xLSTM-specific | xLSTM Section "Architecture" | | Generalised delta rule with vector-valued decay (RWKV-7) | Sequence mixer | Combination novel | Builds on DeltaNet (Yang 2024) and RWKV-6 (Peng 2024) | RWKV-7 Section "Time-Mix" | | In-context learning rate, vector form | Update primitive | Fully novel | Scalar in-context learning rates appear in DeltaNet; per-channel vector form is RWKV-7's contribution | RWKV-7 Section "Time-Mix" | | Decoupled removal-replacement keys | Update primitive | Fully novel | RWKV-7-specific design choice | RWKV-7 Section "Time-Mix" | | State-tracking expressivity proof | Theory | Fully novel | First linear-time recurrent architecture proven to recognise all regular languages in constant depth | RWKV-7 Section "Expressivity" | | Three-form retention template (RetNet) | Macro design | Fully novel | The explicit parallel / recurrent / chunkwise interchangeability template | RetNet Section 2 | | Multi-scale retention with per-head $\gamma$ | Sequence mixer | Combination novel | Multi-scale decay schedule across heads is RetNet's specific recipe | RetNet Section 2 | **Single most novel contribution.** Across the three papers, RWKV-7's proof that a linear-time recurrent architecture with non-diagonal input-dependent transitions can recognise all regular languages in a constant number of layers, exceeding the TC$^0$ bound on standard transformers, is the most novel result. It is the first formal-language-theoretic result that puts a linear-time recurrent architecture strictly above transformers on a well-studied expressivity axis. `[Analysis] The practical implications on natural-language tasks remain an empirical question; the theorem itself is the contribution.` **What the papers do NOT claim to be novel.** The residual-stream backbone (standard transformer practice). The AdamW optimiser. The cross-entropy next-token loss. Layer normalisation. Token-level embedding and unembedding. SwiGLU MLPs (used in xLSTM and RetNet). The general idea of replacing softmax attention with a kernel-based linear recurrence (the 2020 Linear Attention result, explicitly cited as a starting point by all three). The outer-product matrix-memory primitive (Linear Attention 2020, Fast Weight Programmers 2021). ## 11. Situating the work **What prior work did.** - Linear Attention (Katharopoulos et al. 2020) introduced the kernel-feature-map trick that rewrites attention as a $d_h \times d_h$ matrix-state recurrence, the primitive all three papers build on.<FootnoteRef n={7} /> - Fast Weight Programmers (Schlag et al. 2021) showed the outer-product key-value memory and its connection to associative memory. - S4 / Mamba (Gu et al. 2022, Gu and Dao 2023) opened the state-space-model line that is contemporaneous with all three papers reviewed. - RWKV-4 (Peng et al. 2023) and the RWKV-5/6 sequence (2024) established the RWKV lineage that RWKV-7 extends. **What this paper changes conceptually.** - xLSTM: rehabilitates the LSTM cell as a candidate for LLM-scale architectures. The conceptual contribution is showing that the 1997 LSTM's known limitations (gate saturation, scalar memory, lack of parallelisability) can be removed without abandoning the gating vocabulary. - RWKV-7: takes the state-tracking expressivity question seriously and turns it into a theorem. Until RWKV-7, the standard framing was that linear-time recurrent architectures were *more efficient but less expressive* than transformers; RWKV-7 shows that on a specific formal-language axis the opposite holds. - RetNet: codifies the three-form template as the central organising idea for the post-attention architecture wave. Even where its specific recipe was not adopted, the parallel/recurrent/chunkwise framing became standard vocabulary. **Cite at least two contemporaneous related papers.** - **Mamba** (Gu and Dao 2023, arXiv:2312.00752): selective state-space model with input-dependent transitions. Builds on the same Linear-Attention-style recurrence; differs by using a structured continuous-time linear system rather than a covariance update. RWKV-7 explicitly compares against Mamba on state-tracking tasks.<FootnoteRef n={2} /> - **DeltaNet / GatedDeltaNet** (Yang et al. 2024): scalar-learning-rate delta-rule recurrence. RWKV-7 generalises DeltaNet's update with vector-valued parameters, calling out the comparison directly in its state-update equation table.<FootnoteRef n={2} /> `[Reviewer Perspective]` **Strongest skeptical objection.** The benchmark-suite framing is generous. All three papers report results on benchmark mixes that are favourable to non-attention architectures (validation perplexity, narrow zero-shot tasks). None reports a thorough evaluation on the kinds of tasks where transformers excel most clearly: long-range in-context retrieval, code reasoning under long context, instruction following at frontier scale. The fair reading is "competitive but not dominant"; the papers' marketing framing leans further than the evidence. `[Reviewer Perspective]` **Strongest author-side rebuttal grounded in the paper.** The papers' explicit framing is parameter-and-token-matched comparison at sizes the authors could afford to train. RWKV-7-2.9B trained on 3.1T tokens beating Qwen2.5-3B trained on 18T tokens at the English aggregate is a *training-efficiency* claim, not a peak-performance claim. The expressivity theorem in RWKV-7 is a formal result, not a benchmark claim, and it would hold whether or not the empirical comparisons close the gap. **What remains unsolved.** - 70B-parameter pure-recurrent training has not been demonstrated. - The long-context behaviour of pure-recurrent architectures past a few tens of thousands of tokens is fragile in the published RWKV-7 numbers and not deeply explored in xLSTM. - The empirical relevance of RWKV-7's expressivity theorem to natural-language tasks is open. **Three future research directions.** - **Hybrid stacks at scale.** `[Analysis]` A few attention layers interleaved with many recurrent layers may capture the best of both. Mamba-2 and Griffin have published in this direction; the three papers reviewed each leave room for this hybridisation as future work. - **Direct frontier-model comparison.** `[Analysis]` Publishing a 30B+ parameter xLSTM or RWKV-7 with results on the standard frontier evaluation suite (MMLU, GSM8K, HumanEval, MATH, BIG-bench-Hard at fair training-token budgets) would settle the empirical question. - **Expressivity-to-task transfer.** `[Analysis]` Designing natural-language tasks whose performance is gated by state-tracking would convert RWKV-7's theoretical advantage into a measurable practical one. ## 12. Critical analysis ### Strengths with concrete evidence - **xLSTM:** matrix memory + exponential gating yields a 1.3B model that outperforms parameter-matched baselines on most measured tasks (paper Tables; arXiv:2405.04517).<FootnoteRef n={1} /> - **RWKV-7:** formal expressivity result combined with strong empirical multilingual aggregate at 2.9B on roughly one-third the training-token budget of Qwen2.5-3B (arXiv:2503.14456).<FootnoteRef n={2} /> - **RetNet:** the three-form template established a clean architectural vocabulary the subsequent literature inherited; the 8.4$\times$ decoding throughput at 6.7B is a concrete deployment-relevant number (arXiv:2307.08621).<FootnoteRef n={3} /> ### Weaknesses explicitly stated by the authors - **RWKV-7 stated limitations:** numerical-precision sensitivity in the WKV kernel, lack of instruction tuning in released checkpoints, sensitivity to prompt structure (initial-token memory issues), long-context degradation past 10k tokens on PG19 without explicit long-context fine-tuning.<FootnoteRef n={2} /> - **xLSTM stated limitations:** the sLSTM variant is not fully parallelisable, which limits training throughput; long-context behaviour is not as deeply characterised as in transformer baselines.<FootnoteRef n={1} /> - **RetNet stated limitations:** the paper notes that the multi-scale decay schedule is hand-designed rather than learned; the parallel form is no faster than a transformer's parallel form at training time per token, only at inference.<FootnoteRef n={3} /> ### Weaknesses not stated or understated `[Reviewer Perspective]` The training-token-efficiency framing in the RWKV-7 paper (matching Qwen2.5-3B on one-third the tokens) leaves out that the Qwen2.5 family was trained with more rigorous data quality and instruction tuning; a like-for-like training corpus is not in the comparison. `[External comparison]` The same criticism applies to xLSTM's comparison against Llama at 1.3B: training-data overlap with the SlimPajama validation set is not the same as the training data for the Llama baselines. `[Reviewer Perspective]` The three papers do not deeply explore inference behaviour under *prompt-distribution-shift* conditions, meaning the situation where the input distribution at deployment differs from the training distribution. Transformer KV-cache behaviour is at least transparent in this regime; how a constant-memory matrix state behaves is less well understood and the papers do not document it. `[External comparison]` Park et al. (cited by RWKV-7) and other contemporaneous expressivity-of-RNNs analyses provide additional theoretical lenses the three papers do not fully integrate. ### Reproducibility check - **xLSTM.** - Code: Apache-2.0 reference implementation at github.com/NX-AI/xlstm, includes CUDA kernel.<FootnoteRef n={8} /> - Data: SlimPajama is publicly available; specific training-data subsets used in the paper are documented. - Hyperparameters: documented for the headline configurations. - Compute: reported in the paper. - Weights: 7B reference checkpoint released by NX-AI per the official repo's release notes.<FootnoteRef n={8} /> - Overall: partially-to-fully reproducible at the architectures and small-to-medium model sizes documented. - **RWKV-7.** - Code: Apache-2.0 reference implementation at github.com/BlinkDL/RWKV-LM.<FootnoteRef n={9} /> - Data: World-v3 corpus is documented; not all components are publicly available as a single bundle. - Hyperparameters: documented. - Compute: max 96 H800 GPUs reported.<FootnoteRef n={2} /> - Weights: released on Hugging Face under Apache-2.0.<FootnoteRef n={11} /> - Overall: fully reproducible at the released model sizes; the training-data variant for the multilingual setting is partially gated. - **RetNet.** - Code: MIT-licensed reference at github.com/microsoft/torchscale.<FootnoteRef n={10} /> - Data: internal pre-training corpus is not public; The Pile evaluation is public. - Weights: 6.7B configuration is documented; not all weight checkpoints are released for unrestricted download. - Overall: partially reproducible. The architecture is reproducible; the headline 6.7B configuration's training data is not. ### Methodology **xLSTM** - Sample size: SlimPajama 300B tokens at 1.3B parameter scale (paper headline configuration). - Evaluation set: SlimPajama validation, PALOMA 471 text domains, zero-shot ARC/HellaSwag/PIQA/WinoGrande/LAMBADA. - Baselines: Llama, Mamba, RWKV-4 at matched parameter count. - Hardware/compute: documented in the paper; reference 7B training reported separately by NX-AI.<FootnoteRef n={8} /> **RWKV-7** - Sample size: 3.119T training tokens for the 2.9B model. - Evaluation set: English aggregate (MMLU, ARC-c, ARC-e, HellaSwag, OpenBookQA, PIQA, WinoGrande, LAMBADA); multilingual aggregate; post-2025-cutoff internet-data compression. - Baselines: Qwen2.5, Llama-3.2, SmolLM2, Mamba, S4, RWKV-6. - Hardware/compute: max 96 H800 GPUs.<FootnoteRef n={2} /> **RetNet** - Sample size: 100B+ tokens at 6.7B parameters in the original paper. - Evaluation set: validation perplexity on The Pile; inference-memory and throughput at 8k sequence length. - Baselines: parameter-matched transformer. - Hardware/compute: not all details reported in the abstract; full configuration in the paper appendix.<FootnoteRef n={3} /> ### Generalisability - **To larger scales.** `[Analysis]` Open empirical question; 70B+ runs of any of the three have not been published. - **To different data types.** RWKV-7 demonstrates multilingual and code training; xLSTM's published experiments are English-focused; RetNet's published experiments are English-focused. - **To different backbones.** All three are drop-in sequence-mixer replacements for attention; the residual-stream backbone is unchanged. Portability is high. ### Assumption audit The Section 3 assumption list mostly holds in practice. The strongest is the equivalence of parallel and recurrent forms; all three papers establish this analytically and the published kernels deliver it numerically. The assumption that "linear in $N$" in the parallel form is realised in wall-clock comparison depends on the kernel quality; RWKV-7's reported 3$\times$ speedup over RWKV-6 reflects kernel engineering as much as algorithmic improvement. ### What would make the paper significantly stronger `[Analysis]` - **xLSTM:** a 30B+ parameter training run with frontier-suite evaluation. - **RWKV-7:** a natural-language task constructed to require state tracking beyond TC$^0$, where the expressivity theorem would predict an empirical separation. - **RetNet:** a 2025-era re-run against contemporary transformers (Llama-3 family, Qwen2 family) at matched training-token budgets to update the 2023 comparison. <Image src="https://images.neuraltechdaily.com/articles/xlstm-rwkv-linear-transformer-alts-multi-paper-review-2026/in-article-4-800w.webp" alt="Figure showing RWKV-7 benchmark performance comparison across model sizes and training tokens versus Qwen2.5, Llama-3.2 and other baselines" caption="Benchmark comparison figure from RWKV-7 (arXiv:2503.14456), reproduced for editorial coverage." width={800} height={420} /> ## 13. What is reusable for a new study **REUSABLE COMPONENT 1: Three-form parallel/recurrent/chunkwise template (RetNet).** - What it is: the architectural pattern of writing a sequence mixer in three equivalent forms. - Why worth reusing: it is the cleanest way to amortise training cost over the parallel form and inference cost over the recurrent form. Both xLSTM-mLSTM and RWKV-7 follow this template. - Preconditions: the sequence mixer's recurrence must be associative in the relevant sense so the chunkwise form is well-defined. - What would need to change: most sequence mixers can be reformulated, but the specific decay structure determines whether the chunkwise form is clean or requires correction factors. - Risks: numerical drift between parallel and recurrent forms at low precision. - Interaction effects: combines well with residual streams; orthogonal to MLP design. **REUSABLE COMPONENT 2: Exponential gating with log-domain stabilizer (xLSTM).** - What it is: replacing a sigmoid gate with $\exp(\tilde{z})$, carrying a running-max stabilizer $m_t$ to keep the computation numerically safe. - Why worth reusing: removes the saturation bottleneck of sigmoid gates without numerical instability. - Preconditions: the cell needs a normaliser state to recover bounded outputs. - What would need to change: any architecture using sigmoid gates could in principle adopt this; the stabilizer logic adds a few extra states per cell. - Risks: training instability if the stabilizer is mis-initialised. - Interaction effects: composes with matrix memory (mLSTM) and could plausibly compose with state-space-model architectures. **REUSABLE COMPONENT 3: Matrix memory with outer-product update (xLSTM mLSTM, RWKV-7, RetNet).** - What it is: a $d_h \times d_h$ matrix state updated at every step by an outer product $v_t k_t^\top$. - Why worth reusing: the canonical associative-memory primitive that lets a recurrent network store and retrieve key-value pairs. - Preconditions: a way to choose decay and update structure (each of the three papers does this differently). - What would need to change: kernel implementation is non-trivial at high head dimension; head dim $\leq 256$ is the practical regime. - Risks: $d_h^2$ memory footprint grows quickly with head dimension. - Interaction effects: this is the primitive every linear-time recurrent architecture has converged on; the choice is *what update rule* not *whether a matrix state*. **REUSABLE COMPONENT 4: Vector-valued data-dependent gating (RWKV-7).** - What it is: per-channel data-dependent decays and learning rates ($w_t, a_t$). - Why worth reusing: lets the network learn per-feature memory horizons rather than a single global decay. - Preconditions: low-rank MLPs over the input to produce the gating vectors without adding too many parameters. - What would need to change: trade-off between gating flexibility and parameter count. - Risks: training-dynamics sensitivity; the paper documents precision sensitivity in the WKV kernel. - Interaction effects: pairs naturally with the non-diagonal transition matrix. **REUSABLE COMPONENT 5: Multi-scale decay schedule (RetNet).** - What it is: assigning each attention head a different fixed decay $\gamma_h$ across an exponential schedule. - Why worth reusing: lets different heads attend at different temporal horizons without learning per-token decays. - Preconditions: per-head structure in the sequence mixer. - What would need to change: the schedule is hand-designed in the paper; could be learned in a follow-up. - Risks: low if the schedule is sensible; the published RetNet recipe is robust. - Interaction effects: simpler than RWKV-7's data-dependent decay, more efficient in kernel. ### Dependency map The three-form template (RetNet) is upstream of all the others; it is the macro architecture every subsequent paper inherits. Matrix memory with outer-product update (Linear Attention 2020, adopted by all three) is the universal primitive. Exponential gating (xLSTM) is a cell-local mechanism that composes with the matrix memory. Vector-valued gating (RWKV-7) is upstream of state-tracking expressivity. Multi-scale decay (RetNet) is an alternative to RWKV-7's data-dependent decay. ### Highest-value components `[Analysis]` For a new study aiming to ship a linear-time architecture: start with the three-form template, the matrix memory primitive, and one of (multi-scale decay, vector-valued gating). Multi-scale is cheaper to implement; vector-valued is more expressive. Whether to add exponential gating depends on whether the matrix memory is paired with an LSTM-style cell or a delta-rule cell. `[Analysis]` For a new study on architectural expressivity: RWKV-7's non-diagonal input-dependent transition matrix is the construction that unlocks the state-tracking result. Anyone designing a recurrent architecture for state tracking should reuse this primitive. ## 14. Known limitations and open problems **Limitations explicitly stated by the authors.** - xLSTM: sLSTM not fully parallelisable; long-context characterisation lighter than for transformer baselines. (Paper Section "Limitations.")<FootnoteRef n={1} /> - RWKV-7: numerical-precision sensitivity, lack of instruction tuning, prompt sensitivity, long-context degradation past 10k tokens without fine-tuning, compute-constrained training. (Paper Section "Limitations.")<FootnoteRef n={2} /> - RetNet: hand-designed decay schedule, training-time parallel form not faster than transformer's parallel form. (Paper conclusion.)<FootnoteRef n={3} /> **Limitations not stated.** - `[Reviewer Perspective]` All three papers benchmark exclusively against open-source baselines at small-to-medium scale. None compares against the strongest 2025-era frontier transformers at any scale. - `[Reviewer Perspective]` Inference behaviour under distribution shift, adversarial prompts, and long agent loops is undocumented for all three. - `[Reviewer Perspective]` The interaction between matrix-memory architectures and standard alignment techniques (RLHF, DPO) is not characterised in any of the three papers. **Technical root cause of each.** - The benchmark-comparison gap reflects the realistic compute budget any single research group can deploy against frontier-scale models; closing it would require frontier-scale training runs. - Distribution-shift behaviour is an empirical-deployment question; the published experiments do not cover it. - Alignment-technique interaction is downstream of having an instruction-tuned model, which only xLSTM 7B (via the NX-AI release) and RWKV-7 (community SFT variants) have at the time of writing, and not as the subject of an alignment-focused paper. **Open problems left behind.** - Frontier-scale (70B+) pure-recurrent training. - A natural-language task that empirically distinguishes RWKV-7's expressivity from a transformer's. - Long-context behaviour (128k+) for the matrix-memory architectures, deeply characterised against attention. - Hybrid attention + matrix-memory stacks that combine the best of both. **What a follow-up paper would need to solve to address the most critical limitation.** `[Analysis]` Closing the frontier-scale empirical gap would require: (a) a 30B+ matrix-memory model trained on a frontier-quality data mix; (b) full evaluation on MMLU / GSM8K / HumanEval / MATH / BIG-bench-Hard at matched training-token budgets; (c) honest documentation of the compute cost. Until that paper exists, "linear-time architectures are competitive with attention at frontier scale" remains a hypothesis the three papers reviewed support but do not establish. ## How this article reads at three depths **For the curious high-school reader.** Today's chat models are slow at long inputs because of a math operation called attention that costs roughly $N \times N$ when reading $N$ words. Three research teams asked: can we replace attention with something that costs only $N \times 1$, a constant amount of work per word, without making the model dumber? The papers show the answer is yes, mostly: at the model sizes they could afford to train, the new architectures match or come close to standard transformers while running much cheaper on long inputs. **For the working developer or ML engineer.** xLSTM, RWKV-7, and RetNet are linear-time sequence mixers, drop-in replacements for the attention block. All three carry a fixed-size matrix state ($d_h \times d_h$ per head), all three have a parallel form for training and a recurrent form for inference, and all three are released as open-source reference implementations under permissive licenses (Apache-2.0 for xLSTM and RWKV-7, MIT for RetNet via microsoft/torchscale). For long-running chat or agent loops the constant-memory inference profile is a real win; the trade-off is that pure-recurrent stacks at frontier scale remain unproven, and the deployment frontier is dominated by hybrid attention + recurrent stacks (Mamba-2, Griffin family). If long-context cost is the bottleneck, these are the reference architectures to evaluate; if peak quality at frontier scale is the bottleneck, attention is still the safer default. **For the ML researcher.** Across the three papers, the most novel result is RWKV-7's proof that a linear-time recurrent architecture with non-diagonal input-dependent transition matrices can recognise all regular languages in a constant number of layers, formally exceeding standard transformers' TC$^0$ bound on a well-studied complexity axis. xLSTM contributes the exponential-gate + log-domain-stabilizer combination as a numerically safe replacement for sigmoid gates at LSTM scale and the sLSTM/mLSTM mixed-architecture design. RetNet codified the three-form parallel/recurrent/chunkwise template the subsequent literature inherited and the multi-scale decay schedule. The strongest objections are that all three papers benchmark against open-source baselines at small-to-medium scale, none compares against frontier 2025-era transformers, and the empirical relevance of the RWKV-7 expressivity theorem to natural-language tasks is unresolved. <FootnoteList> <FootnoteItem n={1} url="https://arxiv.org/abs/2405.04517" date="2026-05-19">Beck et al., xLSTM: Extended Long Short-Term Memory, arXiv:2405.04517, NeurIPS 2024 spotlight</FootnoteItem> <FootnoteItem n={2} url="https://arxiv.org/abs/2503.14456" date="2026-05-19">Peng et al., RWKV-7 "Goose" with Expressive Dynamic State Evolution, arXiv:2503.14456</FootnoteItem> <FootnoteItem n={3} url="https://arxiv.org/abs/2307.08621" date="2026-05-19">Sun et al., Retentive Network: A Successor to Transformer for Large Language Models, arXiv:2307.08621</FootnoteItem> <FootnoteItem n={4} url="https://openreview.net/forum?id=ARAxPPIAhq" date="2026-05-19">OpenReview landing page for the xLSTM NeurIPS 2024 spotlight</FootnoteItem> <FootnoteItem n={5} url="https://ar5iv.labs.arxiv.org/html/2503.14456" date="2026-05-19">ar5iv HTML render of the RWKV-7 paper used to transcribe equations</FootnoteItem> <FootnoteItem n={6} url="https://ar5iv.labs.arxiv.org/html/2307.08621" date="2026-05-19">ar5iv HTML render of the RetNet paper used for the three-form template and reported throughput numbers</FootnoteItem> <FootnoteItem n={7} url="https://arxiv.org/abs/2006.16236" date="2026-05-19">Katharopoulos et al., Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention, arXiv:2006.16236</FootnoteItem> <FootnoteItem n={8} url="https://github.com/NX-AI/xlstm" date="2026-05-19">NX-AI official xLSTM reference implementation, Apache-2.0</FootnoteItem> <FootnoteItem n={9} url="https://github.com/BlinkDL/RWKV-LM" date="2026-05-19">BlinkDL official RWKV-LM reference implementation, Apache-2.0</FootnoteItem> <FootnoteItem n={10} url="https://github.com/microsoft/torchscale" date="2026-05-19">microsoft/torchscale, hosts the official RetNet reference, MIT-licensed</FootnoteItem> <FootnoteItem n={11} url="https://huggingface.co/BlinkDL" date="2026-05-19">Hugging Face — BlinkDL RWKV model collection, Apache-2.0</FootnoteItem> </FootnoteList>

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Anonymous · no cookies set

Report a problem with this article

Articles are produced by an autonomous AI pipeline; mistakes do happen. Tell us what's wrong and the editorial review will revisit the claim.

Category

Found this useful? Share it.