What This Post Covers
This is Part 1 of a two-part series on how the transformer’s attention mechanism has evolved. Every attention variant shipped in production since 2019 is fighting one number: the bytes of KV cache you have to carry per token. That number controls how many concurrent users fit on a GPU, how long a context you can serve, and ultimately whether your model is economically viable to deploy.
Part 1 walks the first wave of answers: the variants that attack the cache by changing what gets stored per token. We start with the bottleneck, recap multi-head attention, look at the stepping stones (MQA and GQA), then spend most of the post on Multi-head Latent Attention as introduced in DeepSeek-V2. By the end you will see how a single low-rank bottleneck plus a clever bit of algebra collapses the cache by nearly two orders of magnitude without giving up the expressivity of standard softmax attention.
Part 2 picks up at the question MLA cannot answer: once each cached token is about as small as it gets, can we cache fewer tokens? That is sparse attention (DSA, NSA, MoBA), linear-attention hybrids, and the V4-Pro synthesis where compression and sparsity stack. (Coming soon.)
The audience is engineers who deploy models. FlashAttention has its own dedicated post on this blog and we will not re-cover it here.
Part 1: The KV Cache Wall
Modern transformer inference is, in practice, a memory bandwidth problem. During autoregressive decoding, each new token must attend to every prior token, which means every prior token’s key and value vectors must already sit in fast memory. That stash is the KV cache, and it grows linearly with sequence length, linearly with batch size, and linearly with the number of layers.
For a model with $L$ layers, $n_h$ heads, per-head dimension $d_h$, sequence length $T$, and batch size $B$ in float16, the standard MHA cache is:
$$\text{Cache}_{\text{MHA}} \;=\; 2 \cdot L \cdot B \cdot T \cdot n_h \cdot d_h \cdot 2 \text{ bytes}$$That factor of 2 is for keys and values. For a Llama-3-style 70B model at BF16 with 80 layers, 64 heads, and $d_h = 128$, each token consumes about 2.5 MB of cache per layer. At 128K context that is 320 GB of KV cache for a single sequence. The H100 has 80 GB of HBM. You cannot serve a single 128K-context request on one H100 without doing something to shrink that number. The cache, not the parameters, is what limits how many concurrent users fit on a GPU and how long a context you can serve.
Several approaches have chipped away at this. Multi-Query Attention (MQA) shares a single K/V across all heads, a brutal compression that visibly degrades quality. Grouped-Query Attention (GQA) is the negotiated middle ground that ships in Llama, Mistral, and Qwen. Multi-Head Latent Attention, introduced in DeepSeek-V2 in May 2024, takes a different tack: it caches a single low-rank latent vector per token and reconstructs full-rank K and V on the fly.
The trick, and the whole point of the rest of this post, is that you never actually have to reconstruct them.
Part 2: Notation
A few symbols recur throughout. None of them are unusual; the table is a quick reference so the equations below read fast.
| Symbol | Meaning | DeepSeek-V2 value |
|---|---|---|
| $d$ | Model (residual stream) dimension | 5120 |
| $n_h$ | Number of attention heads | 128 |
| $d_h$ | Per-head dimension (content part) | 128 |
| $d_c$ | Latent KV compression dim, the cached width | 512 |
| $d_c'$ | Query compression dim (training-time only) | 1536 |
| $d_h^R$ | RoPE per-head dim (decoupled positional part) | 64 |
| $h_t$ | Hidden state at position $t$, shape $\mathbb{R}^{d}$ | (per token) |
| $\mathbf{c}_t^{KV}$ | The cached latent at position $t$, shape $\mathbb{R}^{d_c}$ | (per token) |
Convention: all vectors are row vectors, so $W \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}$ and $h W$ produces an output row vector. This matches how PyTorch nn.Linear behaves and how most modern transformer code reads.
Part 3: Recap: Standard Multi-Head Attention
Before getting to MLA, it is worth pinning down exactly what we are trying to replace. Standard MHA at position $t$ takes the hidden state $h_t \in \mathbb{R}^{d}$ and produces three projections:
$$q_t = h_t W^Q, \qquad k_t = h_t W^K, \qquad v_t = h_t W^V$$where $W^Q, W^K, W^V \in \mathbb{R}^{d \times n_h d_h}$. The result is split into $n_h$ heads of dimension $d_h$. Attention is computed head-wise, with $k_t$ and $v_t$ cached for every past position $t$:
$$\text{Attn}_i(q, K, V) \;=\; \text{softmax}\!\left(\tfrac{q_i K_i^\top}{\sqrt{d_h}}\right) V_i$$Standard Multi-Head Attention
One token's hidden state projected three ways. K and V get cached, per head, for every past position.
The cost is paid not in the projection matrices (those live in HBM regardless) but in the running cache of all past $k_t, v_t$. Per token per layer, that is $2 \cdot n_h \cdot d_h$ floats. For DeepSeek-V2 scale (128 heads, 128 per-head dim) it is 32,768 floats per token per layer. Everything MLA does is in service of shrinking that cache while keeping the per-head expressivity.
Part 4: Stepping Stones: MQA and GQA
MLA is most legible when contrasted with what came before. Both MQA and GQA attack the cache by reducing the number of distinct K and V projections.
Multi-Query Attention (Shazeer 2019) keeps $n_h$ query heads but uses a single shared key projection and a single shared value projection across all of them. The cache shrinks by a factor of $n_h$. On a 64-head model that is a 64x reduction. MQA worked in production for PaLM and Falcon-40B, but most successors retreated. With one shared K and one shared V, every query head looks at the same key/value subspace, the model loses head-level specialization on the recall side, and quality regressions at scale were measurable.
Grouped-Query Attention (Ainslie et al. 2023) is the practical middle ground. Instead of all-or-nothing sharing, you partition the $n_h$ query heads into $n_g$ groups, and each group shares one K/V head. MHA is $n_g = n_h$. MQA is $n_g = 1$. GQA-8 (the Llama-3 default) is $n_g = 8$. For our 70B reference at GQA-8 the cache is 320 KB per token, an 8x reduction, and quality regression vs MHA is in the noise.
Stepping Stones: MHA, GQA, MQA
The K/V width is the lever each method pulls. Q stays the same; K and V get narrower.
The K/V width is the lever each method pulls. MHA keeps full per-head K/V. MQA collapses to a single shared pair. GQA picks a comfortable middle. All three keep K and V as the cached objects, though.
MLA changes the question. Instead of asking “how many K/V projections do we keep?”, it asks: what if the cache is not K or V at all, but a compressed representation we expand on demand?
Part 5: MLA’s Core Insight
The hidden state $h_t \in \mathbb{R}^{d}$ already contains everything needed to compute that token’s keys and values. The standard pipeline burns most of its width on a high-dimensional intermediate ($n_h d_h \approx 16{,}000$ for DeepSeek-V2) that we store, when really we could store a compact summary and reproject when needed.
Concretely: introduce a latent bottleneck $\mathbf{c}_t^{KV} \in \mathbb{R}^{d_c}$ with $d_c \ll n_h d_h$. Cache only this latent. Recover K and V via dedicated up-projections at attention time:
$$\underbrace{h_t W^{DKV}}_{\mathbf{c}_t^{KV} \,\in\, \mathbb{R}^{d_c}}\;\longrightarrow\;\begin{cases} \mathbf{c}_t^{KV} W^{UK} \;=\; k_t & \in \mathbb{R}^{n_h d_h} \\[2pt] \mathbf{c}_t^{KV} W^{UV} \;=\; v_t & \in \mathbb{R}^{n_h d_h} \end{cases}$$MLA's Core Idea
Cache one narrow latent per token. Reconstruct K and V at attention time, then throw them away.
If you stop reading right here, you would be tempted to ask: doesn’t reconstructing K and V at every attention step add a huge amount of compute? The honest answer is yes, naively. The whole punchline of MLA, which we get to in §8, is that during inference you do not have to materialize K and V at all. The up-projection matrices can be absorbed into Q and the output projection. The cache stays small and the FLOPs stay manageable.
A low-rank cache only saves memory. A low-rank cache that you never have to up-project saves memory and compute. That is the actual MLA result.
But before we get to the absorption, there is a wrinkle: RoPE. Rotary positional embeddings break the clean factorization above, and dealing with that gracefully is what gives MLA its slightly baroque final form. We will first walk through the no-RoPE version step by step, then patch it.
Part 6: Matrix Walkthrough, Step by Step
We will work through one token’s forward pass. Position $t$, hidden state $h_t \in \mathbb{R}^{d}$. Numbers in parentheses are the DeepSeek-V2 values, so the shapes feel concrete.
MLA, Step by Step
One token's path from hidden state through the full MLA construction.
KV down-projection
K and V up-projection
Decoupled RoPE construction
The absorption trick
What lives in the cache
Step 1: KV down-projection
A single linear layer projects the residual stream down into the latent space:
$$\mathbf{c}_t^{KV} \;=\; h_t \, W^{DKV}, \qquad W^{DKV} \in \mathbb{R}^{d \times d_c} \;=\; \mathbb{R}^{5120 \times 512}$$The resulting $\mathbf{c}_t^{KV} \in \mathbb{R}^{512}$ is the only KV-related thing we cache for this token. It is shared across all heads and contains the information from which both K and V will eventually be reconstructed.
Step 2: K and V up-projection
The latent fans out into the full multi-head K and V via two more linear layers:
$$k_t^C \;=\; \mathbf{c}_t^{KV} \, W^{UK}, \qquad v_t^C \;=\; \mathbf{c}_t^{KV} \, W^{UV}$$$$W^{UK}, W^{UV} \in \mathbb{R}^{d_c \times n_h d_h} \;=\; \mathbb{R}^{512 \times 16384}$$The superscript $C$ marks the content portion of K (we add a separate RoPE portion in §7). After this step, $k_t^C$ and $v_t^C$ live in $\mathbb{R}^{n_h d_h}$ and split into $n_h$ heads of width $d_h$. Since $d_c \le n_h d_h$, these are low-rank reconstructions: every head’s K and V is constrained to lie in a $d_c$-dimensional subspace of the full $d_h$-space. This rank constraint is the price of compression, and empirically it is the right trade.
Step 3: The query path
Symmetrically, and primarily to save training-time activations rather than KV cache, the query is also routed through a bottleneck:
$$\mathbf{c}_t^{Q} = h_t \, W^{DQ}, \qquad q_t^C = \mathbf{c}_t^{Q} \, W^{UQ}$$$$W^{DQ} \in \mathbb{R}^{5120 \times 1536}, \qquad W^{UQ} \in \mathbb{R}^{1536 \times 16384}$$The query bottleneck $d_c' = 1536$ is wider than the KV bottleneck because queries are not cached. There is no inference benefit from making them narrower. The reason to compress them at all is parameter and activation memory during training.
$d_c = 512$ is sized for cache miniaturization. $d_c' = 1536$ is sized for representational room. They are independent design knobs and DeepSeek-V2 chose them to be quite different.
Part 7: The RoPE Complication and the Decoupled Fix
So far the story is clean: cache a latent, up-project on demand, profit. But all modern transformers, DeepSeek-V2 included, use rotary position embeddings, and RoPE is exactly the kind of thing that ruins clean factorizations.
Why naive RoPE breaks MLA
RoPE applies a position-dependent rotation matrix $\mathcal{R}_t$ to the query and key after they are projected. The attention dot product becomes:
$$\langle \mathcal{R}_t q_t,\, \mathcal{R}_s k_s \rangle \;=\; q_t^\top \mathcal{R}_t^\top \mathcal{R}_s \, k_s \;=\; q_t^\top \mathcal{R}_{s-t}\, k_s$$That last simplification, the rotation depending only on the relative position $s-t$, is the whole reason RoPE works. But now imagine we tried to apply RoPE to our reconstructed key $k_s = \mathbf{c}_s^{KV} W^{UK}$. The pre-rotated key cached as $\mathbf{c}_s^{KV}$ would need to also be rotated by $\mathcal{R}_s$. And $\mathcal{R}_s$ depends on $s$, the actual position. Different tokens use different rotations. So we would need to store the rotated reconstruction per token, defeating the cache.
The deeper algebraic problem: in §8 we will want to absorb $W^{UK}$ into $W^Q$. But if RoPE applies between them, in $q^\top W^{UQ\top} \mathcal{R}_t^\top \mathcal{R}_s W^{UK} \mathbf{c}^{KV}$, the position-dependent $\mathcal{R}$ blocks any precomputed absorption.
RoPE and low-rank absorption are fundamentally incompatible when applied to the same vector. MLA’s fix is to give RoPE its own, separate vector.
The decoupled RoPE construction
Each token gets two key vectors per head:
- A content key $k_t^C$ (no RoPE), reconstructed from the cached latent as in §6.
- A small RoPE key $k_t^R$, a separate, narrow tensor produced directly from $h_t$, to which RoPE is applied. Crucially, it is shared across all heads.
And on the query side, the rotation also lives on its own piece, but here it is per-head, since queries are not cached:
$$q_{t,i}^R \;=\; \text{RoPE}\!\left(\mathbf{c}_t^Q \, W_i^{QR}\right), \qquad W^{QR} \in \mathbb{R}^{d_c' \times n_h d_h^R}$$The final per-head query and key are the concatenation of the content part and the RoPE part:
$$q_{t,i} = [\,q_{t,i}^C \,;\, q_{t,i}^R\,] \in \mathbb{R}^{d_h + d_h^R}, \qquad k_{t,i} = [\,k_{t,i}^C \,;\, k_t^R\,] \in \mathbb{R}^{d_h + d_h^R}$$Note: $k_t^R$ has no head subscript. Every head’s RoPE-key part is the same vector. This is exactly an MQA-style sharing, isolated to the RoPE 64-wide tail.
Attention is then computed on the concatenated vectors:
$$\text{score}_{t,s,i} \;=\; \frac{q_{t,i}^\top k_{s,i}}{\sqrt{d_h + d_h^R}} \;=\; \frac{\underbrace{q_{t,i}^{C\top} k_{s,i}^C}_{\text{from latents}} \;+\; \underbrace{q_{t,i}^{R\top} k_s^R}_{\text{from RoPE pair}}}{\sqrt{d_h + d_h^R}}$$The dot product splits cleanly into a content term and a RoPE term, exactly because we concatenated rather than added. The content term will get the absorption treatment in §8. The RoPE term stands alone and is computed directly from the (small) cached $k_s^R$.
The output is then the usual per-head weighted sum of $v_{s,i}^C$ (the value has no RoPE, it never did), and the heads are concatenated and pushed through $W^O$ as usual:
$$o_{t,i} = \sum_s \alpha_{t,s,i} \, v_{s,i}^C, \qquad u_t = [o_{t,1}; \ldots; o_{t,n_h}] \, W^O$$The KV cache now holds $\mathbf{c}_t^{KV}$ (512 floats) plus $k_t^R$ (64 floats) per token per layer. Total: 576 floats, or 1,152 bytes in BF16. That is the headline 57x reduction versus MHA’s 32,768 floats.
Part 8: The Absorption Trick
We have reduced the cache from 32,768 to 576 floats per token. But the up-projections $W^{UK}$ and $W^{UV}$ are huge ($512 \times 16{,}384$ each), and computing them per attention step looks alarming. The trick is that we never compute them at inference time. We absorb them into the surrounding matrices.
Absorbing $W^{UK}$ into the query
Look at one head’s content-attention term:
$$q_{t,i}^{C\top} \, k_{s,i}^C \;=\; \big(\mathbf{c}_t^Q W_i^{UQ}\big)^\top \big(\mathbf{c}_s^{KV} W_i^{UK}\big) \;=\; \mathbf{c}_t^{Q\top} \, \underbrace{W_i^{UQ\top} W_i^{UK}}_{\widetilde{W}_i^Q \,\in\, \mathbb{R}^{d_c' \times d_c}} \, \mathbf{c}_s^{KV}$$The product $\widetilde{W}_i^Q = W_i^{UQ\top} W_i^{UK}$ is two parameter matrices multiplied together. It depends on no input, so we precompute it once. The score becomes a single bilinear form between the cached query latent and the cached key latent, both of width at most 1536.
After absorption, “computing the content key” disappears. The query latent talks to the key latent directly through a precomputed bridge matrix. The cache really is the key.
Absorbing $W^{UV}$ into the output projection
The value side gets the same treatment, but from the other end. The per-head output is a weighted sum of $v_{s,i}^C = \mathbf{c}_s^{KV} W_i^{UV}$, which then gets multiplied by $W_i^O$:
$$u_t \;\supset\; o_{t,i} \, W_i^O \;=\; \sum_s \alpha_{t,s,i} \, \mathbf{c}_s^{KV} \, \underbrace{W_i^{UV} W_i^O}_{\widetilde{W}_i^O \,\in\, \mathbb{R}^{d_c \times d}}$$Again, the bracketed matrix is parameter-only: precompute and store. At runtime, the attention weights multiply directly against the cached $\mathbf{c}_s^{KV}$, and the result projects directly to the output dimension. $v$ is never materialized.
Subtle point: the absorption changes the effective compute layout. The per-head $\widetilde{W}^Q$ matrices are dense and head-specific, so you do not get a literal FLOPs reduction in the obvious places, but you do collapse what was a stream of large reconstructions into a single small inner product. The combined effect (small cache plus small inner product) is what makes long-context decoding cheap.
The RoPE part is computed the old way: the cached $k_s^R$ is dotted with the freshly computed $q_{t,i}^R$. It is small (64 wide) and outside the absorbed factorization, which is exactly why we paid the cost of decoupling it.
Part 9: DeepSeek-V2 by the Numbers
Plug the actual hyperparameters in. Drag the slider to see how the per-layer KV cache footprint scales with context length, per request, in fp16.
Cache Size vs Context Length
Per layer, batch=1, fp16. Drag the slider to see how the per-layer KV cache footprint scales.
At 128K context, the kind of regime DeepSeek-V2 was designed for, each MLA layer holds about 144 MB of cache, versus around 8 GB for naive MHA. Multiply across DeepSeek-V2’s 60 layers and the difference is the difference between a context that fits and one that does not.
Part 10: Comparison and Takeaways
| Method | K/V structure | Cache per token | Quality | Notes |
|---|---|---|---|---|
| MHA | Distinct K, V per head | $2 \cdot n_h \cdot d_h$ | Best | Baseline; expensive cache |
| MQA | One K, V shared by all heads | $2 \cdot d_h$ | Noticeably worse | Used in PaLM-style models |
| GQA | K, V per group of heads | $2 \cdot n_g \cdot d_h$ | $\approx$ MHA when $n_g$ tuned | Llama, Mistral, Qwen |
| MLA | Latent + decoupled RoPE | $d_c + d_h^R$ | $\approx$ MHA, sometimes better | DeepSeek-V2/V3; absorbs at inference |
What is actually new
MLA is not the first low-rank attention proposal. Linformer (2020), Performer, and various Nyström approximations long predate it. What makes MLA practical and distinct is:
- The low-rank object is the cache, not the attention pattern. Earlier work approximated the attention matrix itself. MLA keeps softmax attention exact and only compresses what gets stored across timesteps.
- The absorption trick keeps inference compute bounded. Without §8, MLA would just be a memory-vs-compute tradeoff. With it, both move in the right direction at once for the autoregressive decoding regime.
- The decoupled-RoPE construction shows how to make a low-rank cache compatible with rotary embeddings. This is the part most easily glossed over but is what makes the technique deployable in modern transformer stacks. Future low-rank schemes that ignore positional embeddings are nice in a paper and broken in practice. MLA paid the engineering tax.
The cost
MLA is not free. The decoupled RoPE adds parameters and a slightly more elaborate forward path. The bilinear absorbed matrices $\widetilde{W}_i^Q$ are larger than the raw $W^Q$ pieces would be. For very short context inference, plain GQA may be faster wall-clock, because the absorbed matmul is the dominant cost rather than memory bandwidth. MLA’s wins compound with context length and batch size, exactly the regimes that matter when serving a frontier model.
In the long-context, large-batch regime that production inference cares about, MLA pulls the KV-cache lever further than any other published trick, while keeping standard softmax attention semantics intact.
If you remember three things
- The cache is a low-rank latent $\mathbf{c}^{KV}$, not K and V. K and V are reconstructed by up-projections that you never actually run at inference.
- RoPE rides on a tiny separate vector that bypasses the latent compression, so the rotation can stay position-aware without contaminating the absorbable content path.
- The whole construction is engineered around the fact that during decoding, you do not need K and V. You need scores. Storing scores requires only enough information to compute them, and the latent is sized to exactly that.
What comes next
Part 1 ended at the point where each cached token is about as small as it gets. The next pressure point is not the size of each token but the number of tokens you carry. Once your model can hold 1M tokens of context, do you really need to read every one of them at every decode step?
Part 2 will pick up there. Three different bets on how to answer that question shipped in 2025: sparse top-k selection driven by a cheap relevance scorer (DSA, deployed in DeepSeek V3.2), three-branch sparsification that is trainable from scratch (NSA, ACL 2025 best paper), and mixture-of-block routing that retrofits existing dense checkpoints (MoBA, powering Kimi K2’s 1M context). A different bet entirely is the linear-attention hybrids that change the math so no $O(L^2)$ matrix ever materializes (Lightning Attention, in MiniMax M1). The synthesis at the end is DeepSeek V4-Pro, released last month, where every thread we have followed so far gets composed at once into a 61-layer stack that runs at 2% of GQA’s cache footprint.
Part 2 is in progress. check back soon.
References
- Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017. The original Transformer paper. The baseline MHA that every variant in this post is reacting to.
- Shazeer, N. (2019). Fast Transformer Decoding: One Write-Head is All You Need. The MQA paper. Concise and worth reading in full; the whole argument is six pages.
- Ainslie, J. et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. The GQA paper. Includes the mean-pooling conversion recipe that made GQA easy to adopt.
- DeepSeek-AI (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. The MLA paper. Section 2.1.3 has the decoupled RoPE derivation; almost everyone missed it the first time around.
- Ji, F. et al. (2025). TransMLA: Multi-Head Latent Attention Is All You Need. Proves that MLA has strictly greater expressive power than GQA at the same cache budget, and gives a conversion recipe from GQA to MLA.
References for sparse attention (DSA, NSA, MoBA), linear hybrids (Lightning Attention), and the V4-Pro report will appear in Part 2.