What Is Attention in Transformers? An Explainer for Developers Who Read 'Attention Is All You Need'

Attention is the mechanism in a transformer where each token decides how much to look at every other token. Three matrix products, one softmax, weighted sum.

4 May 2026 Updated 19 May 2026 ~11 min read

The short answer

Attention is the part of a transformer where each token in the input decides how much to look at every other token. Computationally it is three matrix products and one softmax: a query matrix multiplied by a key matrix gives raw scores, the softmax converts those scores into weights that sum to 1, and the weights are then multiplied with a value matrix to produce the output for each position.

The result for any one token is a weighted sum of value vectors drawn from every other token in the sequence. That is what makes a transformer “context-aware”. When the model produces an output for the word bank, attention is the mechanism that lets it down-weight the surrounding words about furniture and up-weight the surrounding words about a river or a balance sheet, depending on what the rest of the sentence is doing.

The original 2017 paper by Vaswani and colleagues at Google, Attention Is All You Need¹, introduced this as the core idea: the recurrence and convolutions that earlier sequence models leaned on were not strictly necessary, and a stack of attention layers could do the work on its own. Eight years later, the architecture is still load-bearing for every frontier LLM you use, including the open-weight models on Hugging Face and the closed-weight models behind ChatGPT, Claude, and Gemini.

Why the recurrence-free reframe mattered

Before transformers, sequence models were dominated by recurrent neural networks, RNNs and LSTMs, which processed one token at a time and carried a hidden state forward. The hidden state was the model’s only channel for remembering what had come earlier in the sequence. That worked, but it had two problems. The hidden state was a fixed-size bottleneck, so information from far back in the sequence had to be compressed and tended to fade. And because each step depended on the previous step, training was inherently sequential, which left modern GPUs underutilised.

The transformer’s attention block solves both problems at once. Every token can directly read from every other token through the attention matrix, so there is no fixed-size bottleneck. And the attention computation is a batch of matrix multiplications, so the whole sequence can be processed in parallel.² That parallelism is why transformers scaled the way they did once the hardware caught up.

The math, in three matrix products

Inside an attention layer, every input token gets transformed into three vectors: a query (Q), a key (K), and a value (V). All three are produced by multiplying the input embedding by a learned projection matrix that the model trains via gradient descent. So if the input embedding for token i is $x_i$ , then $q_i = x_i \cdot W_Q$ , $k_i = x_i \cdot W_K$ , and $v_i = x_i \cdot W_V$ , where the $W$ matrices are the learned weights.

For a sequence of length n, you stack these into matrices: $Q$ of shape $(n \times d_k)$ , $K$ of shape $(n \times d_k)$ , and $V$ of shape $(n \times d_v)$ . The attention output is then defined by one compact equation, which the paper writes as:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V$

Read it left to right. $Q K^T$ produces an $n \times n$ matrix of raw similarity scores: row i column j is “how relevant is token j to the query coming from token i”. The division by $\sqrt{d_k}$ is a numerical-stability trick that keeps the softmax from saturating when the embedding dimension is large.³ The softmax then normalises each row so its weights sum to 1, turning the scores into a proper probability distribution over positions. Finally, multiplying that $n \times n$ weight matrix by $V$ (which is $n \times d_v$ ) yields an $n \times d_v$ output, where row i is the weighted sum of value vectors that the model uses as the contextualised representation of token i.

That is the entire core operation. The diagram below traces it for a three-token toy input.

Diagram by Neural Tech Daily. Scaled dot-product attention for a three-token input, after Vaswani et al. 2017.

Multi-head attention is the same thing, in parallel

A single attention layer of the kind described above is called single-head attention. Real transformers use multi-head attention: several attention computations running in parallel, each with its own learned $W_Q$ , $W_K$ , and $W_V$ matrices, and their outputs concatenated and projected back to the model’s hidden dimension.⁴

Figure 2 of arXiv:1706.03762: scaled dot-product attention (left) and multi-head attention (right) — the canonical diagram from 'Attention Is All You Need' showing how multiple attention heads run in parallel and concatenate before the linear projection

Figure 2 of Attention Is All You Need (arXiv:1706.03762), reproduced for editorial coverage.

The intuition is that each head can specialise on a different kind of relationship in the sequence. One head might learn to track which adjective modifies which noun. Another might learn to look back to the subject of the current clause. A third might focus on long-range references like “the city we visited last week”. Whether you take the specialisation story literally or not, the empirical fact is that splitting the model dimension across, say, 8 or 16 heads consistently outperforms a single fat head with the same total parameter count.⁴ That is why “the model has 32 attention heads” is the kind of spec that shows up on every model card.

Why this matters for context-awareness

Before attention, models had to compress a sentence’s meaning into a single hidden state and hope the right information survived. Attention removes the bottleneck. Every output position has direct access to every input position, modulated by learned relevance weights. That is the architectural reason transformers can handle long-range dependencies, complete code that references a function defined hundreds of lines earlier, and answer questions about documents tens of thousands of tokens long.⁵

For a developer reading the paper or experimenting with Hugging Face’s transformers library, the practical consequence is that “more context” is mostly a matter of letting attention see more tokens, which is why context-window numbers (4k, 32k, 200k, 1M) feature so heavily in model marketing. The mechanism that uses that context is, at its core, the same softmax-of-Q-K-times-V you read about in the 2017 paper.

Variants you will see in modern code

Three variants come up often enough to be worth naming.

Causal masking is the change that turns an encoder-style attention block into a decoder-style one suitable for autoregressive language modelling. The attention scores in the $Q K^T$ matrix are masked so that token i can only attend to positions $\leq i$ ; future positions get a score of negative infinity, which becomes 0 after softmax. Every GPT-style model, including the open-weight Llama and Qwen families, uses causal masking in its decoder blocks.¹

Sparse attention addresses the fact that the $n \times n$ attention matrix scales quadratically with sequence length, which is what makes long-context inference expensive. Sparse variants compute attention only over a structured subset of positions, sliding-window, strided, or learned, trading some expressivity for asymptotic efficiency.

FlashAttention is the implementation trick that has become the production default. It produces the same outputs as standard attention but reorganises the computation to fit better in GPU memory hierarchies, fusing the softmax and matmul steps and avoiding materialising the full $n \times n$ scores matrix in high-bandwidth memory.⁸ If you read about training-throughput numbers from any frontier lab in the last three years, FlashAttention or one of its successors is doing the work under the hood.

Common misconceptions

Three things are worth correcting because they keep showing up in tutorials.

First, attention is not the same thing as the soft-attention that pre-transformer encoder-decoder models used between an encoder RNN and a decoder RNN. Bahdanau-style attention from 2014 was an attention mechanism, but the transformer’s contribution was to use attention as the only sequence-mixing primitive and drop recurrence entirely.

Second, the queries, keys, and values are not pre-existing semantic objects. They are linear projections of the same input embedding through three learned matrices. The model decides at training time what should go into Q, K, and V to make the loss go down. Treating them as if they had pre-defined meaning is a teaching shortcut, not a description of what the network actually does.

Third, attention by itself is not a complete transformer block. The full block is attention plus a feed-forward network, with residual connections and layer normalisation around both. The feed-forward part is where most of the parameters live in a typical model, and it is doing important work that is easy to skip past when the attention story is more visually striking.

Honest caveats

This explainer covers the vanilla scaled dot-product attention as introduced by Vaswani and colleagues. Production attention implementations in 2026 frontier models include rotary position embeddings, grouped-query attention, sliding-window attention, and various sparsity tricks that are out of scope here. FlashAttention itself has had two major successors since the v1 paper this article cites: FlashAttention-2 (Tri Dao, ICLR 2024) reworked the kernel for better hardware utilisation,⁹ and FlashAttention-3 (July 2024) is H100-specific and exploits Hopper-architecture features.¹⁰ The KV cache, separately, is a runtime trick that stores keys and values from previous decoding steps so incremental token generation does not recompute them. Multi-Latent Attention (MLA), popularised by DeepSeek, compresses K and V into a shared low-rank latent to shrink the KV cache. The Hugging Face NLP course and Jay Alammar’s Illustrated Transformer⁶ are good follow-ups if you want to dig into specific variants. The Wikipedia article on attention in machine learning⁷ is also a useful cross-reference for the history and the broader family of related mechanisms.

If you want to read the original paper after this, the four pages of section 3 (“Model Architecture”) are where the formal definitions live; the rest of the paper is empirical results and analysis. The arXiv abstract page links the PDF and the BibTeX.¹

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

1. Vaswani et al. — Attention Is All You Need (arXiv:1706.03762, NeurIPS 2017) (accessed 2026-05-04) ↩
2. Hugging Face — NLP Course Chapter 1 on transformer models, parallelism vs RNN sequential dependence (accessed 2026-05-04) ↩
3. Vaswani et al. 2017 — section 3.2.1 on scaled dot-product attention and the $\sqrt{d_k}$ scaling rationale (accessed 2026-05-04) ↩
4. Vaswani et al. 2017 — section 3.2.2 on multi-head attention, per-head projection scheme, and Table 3 ablations comparing single-head vs multi-head at matched parameter count (accessed 2026-05-04) ↩
5. Wikipedia — attention (machine learning), section on context-window scaling and long-range dependencies in transformers (accessed 2026-05-04) ↩
6. Jay Alammar — The Illustrated Transformer, visual walkthrough of multi-head attention (accessed 2026-05-04) ↩
7. Wikipedia — attention (machine learning), broader history and family of attention mechanisms (accessed 2026-05-04) ↩
8. Dao et al. — FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (arXiv:2205.14135), describing the IO-aware kernel that fuses softmax and matmul and avoids materialising the full attention matrix in HBM (accessed 2026-05-05) ↩
9. Tri Dao — FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (ICLR 2024), the v2 follow-up that improves hardware utilisation (accessed 2026-05-05) ↩
10. Shah et al. — FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision (arXiv:2407.08608), H100-specific Hopper-architecture variant (accessed 2026-05-05) ↩

Anonymous · no cookies set

Found this useful? Share it.