MdJawad

Rotary Positional Encoding: Why Position Is a Rotation

Sat, 23 May 2026 12:00:00 +0800

A trick hiding in plain sight

In 2021 a small idea slipped into the transformer with almost no fanfare. A paper called RoFormer proposed encoding a token’s position not by adding something to it, but by rotating it. The idea, Rotary Positional Encoding (RoPE), spread quickly. Within two years it had become the default in nearly every serious open model: GPT-NeoX, PaLM, LLaMA and its descendants, Mistral, Qwen, DeepSeek, Gemma. If you have used a modern LLM, you have used RoPE.

It is also the quiet reason your model can sometimes read a whole book. Almost any conversation about long context, whether that means 128K windows, million-token prompts, or the needle-in-a-haystack test, eventually runs into RoPE. RoPE is the thing you have to stretch to make long context work, and the thing that breaks when you stretch it wrong.

This is where to start. Before we can talk about how models reach for longer and longer context, the subject of the posts that follow this one, we need to actually understand the small, elegant trick at the heart of it. By the end of this post, you should be able to look at this formula.

$$\big(R(m\theta)\,q\big)^{\!\top}\big(R(n\theta)\,k\big) \;=\; q^{\top} R\!\big((n-m)\theta\big)\, k$$

and find it obvious.

We will keep one example in hand the whole way: the sentence “the dog chased the cat,” and in particular how the word chased should relate to the word dog.

The problem: attention sees an unordered bag

The attention mechanism has an uncomfortable property. By itself, it has no idea what order the words came in.

Attention works by comparing every token with every other token and taking weighted sums, and a sum does not care about order. Shuffle the inputs and you get the same answer back. To raw attention, “the dog chased the cat” and “the cat chased the dog” are the same bag of vectors. One sentence has the dog doing the chasing and the other has it being chased, and attention cannot tell them apart.

That is a problem, because meaning lives in order. We have to inject position somehow, giving the model a way to know that in our sentence, chased sits one step after dog, and that this adjacency is part of what the sentence means.

The obvious fix, and its hidden flaw

The original 2017 Transformer solved this the obvious way. It built a position vector out of sines and cosines and added it onto each word’s embedding, like stamping a timestamp onto a letter before you mail it. The token at position 0 gets stamp #0, position 1 gets stamp #1, and so on. The network is then left to untangle “content plus stamp” back into “content” and “where.”

It helps to see those sines and cosines directly. Each row below is one dimension oscillating at its own frequency, fast at the top and slow at the bottom. Move the slider, and the column the cursor marks out is the stamp that gets added to the word sitting at that position:

Sinusoidal positional encoding · sines and cosines stacked

positive value negative value

position m 9

Warm cells are positive, cool cells negative. The column of cells stacked at the cursor is the position's encoding vector, the "stamp" added to whatever word sits there.

Two things stand out. The stamp depends only on the position, never on the word underneath it. And neighbouring positions get very similar stamps, since the waves move smoothly. Hold that picture, because it is what rotation is about to improve on.

It works, but it carries two flaws that, once you see them, motivate everything RoPE does.

Flaw one: it smears content and position together. Adding a vector moves the point. A word’s embedding encodes its meaning; the stamp we add shifts that vector somewhere new, so the same word at two different positions becomes two genuinely different vectors, with different length and direction. Meaning and position now sit tangled in the same numbers, and the model has to spend capacity pulling them back apart.

Flaw two: it is absolute, but attention wants relative. The stamp records where a token sits counting from the start of the sequence. That is almost never what matters. What matters for chased is that dog is one token back, not that dog happens to be the second word in this particular sentence. Prepend the word “Yesterday,” to our sentence and every absolute position shifts by one, yet the relationship between chased and dog has not changed at all. Absolute encodings force the model to learn how to turn “position 2 versus position 3” into “one step apart,” and to relearn that for every pair of positions. It is work we should not have to do.

Keep both flaws in mind. RoPE fixes them at once, with a single geometric move.

The insight: position is not a number you add, it’s a rotation you apply

There is another way to think about it. Instead of adding a position vector, what if we rotated the token’s vector by an angle that grows with its position?

Take the token’s vector and chop it into pairs of coordinates. Each pair is just a point on a plane, an arrow from the origin. To encode position $m$, spin that arrow by an angle $m\theta$: position 0 gets no turn, position 1 turns by $\theta$, and position $m$ turns by $m\theta$. In two dimensions this is exactly the rotation matrix from high-school geometry:

$$\begin{bmatrix}x'\\ y'\end{bmatrix} =\underbrace{\begin{bmatrix}\cos m\theta & -\sin m\theta\\[2pt] \sin m\theta & \;\;\cos m\theta\end{bmatrix}}_{R(m\theta)} \begin{bmatrix}x\\ y\end{bmatrix}$$

Drag the slider below and watch what happens to a single pair as its position climbs:

Rotating one pair of dimensions

the pair, rotated to position m

position m 3

angle m·θ

—

x′ (cos)

—

y′ (sin)

—

length

1.00

Watch the length chip: it stays pinned at 1.00 no matter how far you spin. That is rotation's superpower: position changes the direction, never the magnitude.

Notice the one thing that never changes: the arrow’s length. A rotation changes direction, never magnitude, and that is what makes it a rotation. That single property already takes care of the first flaw. A word’s meaning lives in the length and shape of its vector, and spinning it leaves all of that alone. Position ends up written purely into the angle, kept separate from content.

Length preservation is only the warm-up, though. The real payoff is what rotation does to the dot product, and to see it we need one fact about how attention scores tokens.

Why a dot product only feels the angle between

Attention decides how much chased should attend to dog by taking the dot product of chased’s query vector with dog’s key vector. A big dot product means strong attention.

The dot product has a clean geometric meaning. For any two vectors $q$ and $k$,

$$q^{\top} k \;=\; \|q\|\,\|k\|\,\cos\phi,$$

where $\phi$ is the angle between them. The lengths $\|q\|$ and $\|k\|$ are fixed properties of the two words, a kind of loudness. Everything about how the two words relate is carried by that single $\cos\phi$ term. Vectors pointing the same way score high ($\cos 0 = 1$), perpendicular vectors score zero, and opposed vectors score negative.

So an attention score is really a question about the angle between two arrows. That is the sentence to hold onto. If position is an angle, and attention only responds to angles, then position and attention are speaking the same language.

The magic: relative position, for free

Now put the two halves together. Rotate chased’s query by its position $m\theta$ and dog’s key by its position $n\theta$. What is the angle between them afterwards?

Rotating one vector by $m\theta$ and comparing it against another rotated by $n\theta$ composes into a single rotation by the difference. Writing it out with the matrices, and using the identity $R(a)^{\top}R(b) = R(b-a)$, the rotated dot product collapses:

$$\big(R(m\theta)\,q\big)^{\!\top}\big(R(n\theta)\,k\big) \;=\; q^{\top}\,R(m\theta)^{\top}R(n\theta)\,k \;=\; q^{\top}\,R\!\big((n-m)\theta\big)\,k.$$

Look at the right-hand side. The two absolute positions $m$ and $n$ are gone. Only their difference $n-m$ remains. The attention score between chased and dog depends only on how far apart they are, not on where the pair happens to sit in the sentence.

That takes care of the second flaw. We never asked the model to learn how to convert absolute positions into relative ones; the geometry does it on its own. Relative position falls out of the structure for nothing, with zero parameters spent.

Try it. Move the query and key positions on their own, then press “shift both +1,” which is the same as prepending “Yesterday,” to the sentence. Both arrows spin, but the offset between them, and the score, stay put:

Query · Key dot product vs. position

query @ m key @ n

query pos m 2

key pos n 6

relative m−n

—

angle between

—

attention score q · k = cos(Δ)

—

The score chip is glued to m − n. Shifting both positions sends the arrows spinning, yet the number never flinches. That is relative position, for free, with nothing learned.

This is the payoff that made RoPE win. Shift the whole sentence and chased’s relationship to dog survives intact, because that relationship was stored as the angle between the two vectors, and shifting both simply rotates them together. One thing the demo quietly does, though, is hold the two words’ content fixed so the positional effect stands on its own. In a real attention head the content match is the dominant, learned signal, and rotation only modulates it. The next section puts content back in.

How rotation fits inside attention

It is easy to leave that demo thinking the rotation is the whole story. It is not. To see why, it helps to look at where RoPE actually sits inside an attention head, and at how much it leaves alone.

Where RoPE sits in one attention head

RoPE rotates only the query and key. The value vector, the softmax, and the weighted sum that follows are exactly as they were. Position is a small insertion into the score, not a new attention mechanism.

Attention scores a query against a key with a dot product, and before any position is involved that dot product is pure content matching. Think of it as a tiny search engine running in every layer. Each token sends out a query, a search for what it wants (“I am a verb, I am looking for my subject”). Every other token offers a key, an advertisement for what it is (“I am a noun, I could be a subject”). The dot product scores how well the advertisement answers the search. All of this is learned, by the projection matrices $W^Q$ and $W^K$, and it is the main event. It is what lets chased know it wants nouns and not commas.

RoPE does not touch any of that. It rotates the query and key after they are built, and because a rotation changes direction but not length, the learned content survives untouched. For a single pair of dimensions the score works out to

$$\text{score} \;\approx\; \underbrace{\|q\|\,\|k\|}_{\text{how strong}}\;\cos\big(\,\underbrace{\alpha}_{\text{content}} \,+\, \underbrace{(m-n)\,\theta}_{\text{position}}\,\big),$$

where $\alpha$ is the angle between the two words’ content directions. Read the cosine as taking in two things at once. The content angle $\alpha$ is small when the words genuinely match and large when they do not. The positional term $(m-n)\theta$ is a fixed turn that depends only on how far apart they are. The score is highest when the content matches and the relative distance is one the head cares about. A perfect content match at an unwanted distance gets pulled down, and a favoured distance with mismatched content still scores low. Both have to agree.

Our running example needs both halves. Content alone tells chased to look at nouns, but dog and cat are equally nouns, so content cannot say which is the subject. Position alone would prefer whatever sits one step back, but it cannot tell a noun from a comma. Put them together and chased attends to dog because dog is both a content match and at the relative offset the head has learned to read as a subject. RoPE did not pick dog. Content narrowed the field to nouns, and rotation broke the tie by distance.

And everything downstream is left alone. The value vectors are never rotated, the softmax is the same softmax, and the weighted sum that produces the head’s output is unchanged. RoPE is a small, surgical edit to one quantity, the query-key score, not a new attention mechanism.

One frequency isn’t enough: a clock with many hands

A single rotation speed has a catch, because a circle wraps around. If every pair spun at the same rate $\theta$, two positions a full turn apart would land in the same place, and you could not tell them apart. One spinning hand cannot tell you the time on its own.

A clock fixes this with several hands at different speeds. The second hand resolves fine detail, the hour hand tracks the long sweep, and together they pin down a single moment. RoPE does the same thing, giving each coordinate pair $i$ its own rotation speed:

$$\theta_i = b^{-2i/d}, \qquad b = 10000, \qquad i = 0, 1, \dots, \tfrac{d}{2}-1.$$

The first pairs spin fast, so they resolve fine, local distances like “one token apart.” The last pairs spin slowly, tracking coarse, long-range position across thousands of tokens. Stack them all and every position gets a unique multi-frequency fingerprint, with no wraparound ambiguity across the range that matters.

Watch the bank of dials below. The leftmost races while the rightmost barely moves.

A bank of rotating pairs · fast → slow

each dial = one dimension-pair

position m 0

Drag slowly: the leftmost dial races around while the rightmost barely budges. (For clarity the visual uses a gentler base than the real 10000, but the principle is identical.)

The full RoPE rotation is just this whole set of 2-D rotations stacked into one large block-diagonal matrix, with each pair of dimensions spun at its own frequency. Conceptually it is the picture above, repeated $d/2$ times.

Two things are worth noticing now, because they come back when we talk about long context. The base $b = 10000$ is a knob you can turn. And the fast pairs are the ones that wrap around soonest.

A free locality prior: nearby leans in, distant fades

The many frequencies do one more thing, almost by accident. When you add up the cosines across all the pairs, they reinforce at zero distance and start to interfere as the distance grows. The result is that the raw attention score between two identical vectors is high when they sit close together and decays, with a gentle ripple, as they move apart.

$$\text{score}(\Delta)=\frac{1}{d/2}\sum_{i} \cos\big(\Delta\,\theta_i\big), \qquad \Delta = m - n.$$

Slide the dimension count and watch the curve: more frequencies, smoother decay, sharper peak at $\Delta = 0$.

Score vs. relative distance (matched q = k)

more frequencies → smoother decay

dimensions d 32

Peaks at Δ = 0, then settles toward zero for far-apart tokens. A soft "pay more attention to what's near" prior, with no parameters spent.

So RoPE quietly hands the model a sensible default, namely “pay more attention to what is near,” without spending a single parameter on it. The model can override that prior when it needs to reach far, but it starts from a reasonable place.

Adding moves the point; rotating keeps it honest

We can now see, side by side, why rotation beats addition. Sinusoidal encoding adds a position vector, so the point drifts off its circle: its length changes, and content gets tangled up with position. RoPE rotates instead, so the point glides along its circle, its length perfectly preserved and position kept separate from meaning.

Same embedding · two ways to add position

add sinusoidal E + PE(m)

rotate RoPE R(mθ)·E

position m 0

‖ E + PE ‖ (adding)

—

‖ R·E ‖ (rotating)

—

The left length wobbles as position changes, which is the word's meaning being disturbed. The right length is rock-steady.

Sweep the position. On the left the length wobbles, which is the word’s meaning being disturbed as it moves through the sentence. On the right it stays rock-steady. Same goal of encoding position, very different treatment of the content.

Why it won

Pull back, and the list of advantages is long for something so simple:

Relative position, for free. The attention dot product depends only on $m-n$. The model never has to learn to subtract positions, because it is guaranteed by construction.
Meaning stays intact. Rotation preserves length, so a token’s content is not corrupted by where it sits, unlike additive encodings, which blur the two together.
Applied where it matters. RoPE rotates the queries and keys inside every attention layer, right where the comparison happens, instead of being bolted once onto the input embedding and left to fade.
Zero extra parameters. It is a fixed geometric operation. There is nothing to train, almost nothing to compute, and it composes cleanly with efficient-attention kernels like FlashAttention.
A built-in locality prior. Scores naturally taper with distance, a free and sensible default.
It stretches. Because it encodes relative distance through smooth, tunable frequencies, RoPE can be rescaled to longer sequences far more gracefully than anything before it, which is the whole reason it underpins modern long-context models.

That last point is where this post ends and the next one begins.

The bridge: from rotation to long context

This last point is also where the trouble starts. The same frequency structure that makes RoPE so elegant puts a hard ceiling on context length.

A model trained with a context window of, say, 4K tokens has only ever seen rotation angles up to $4096 \cdot \theta_i$ for each pair. The fast pairs will have swept through their whole range many times inside those 4K tokens, while the slow pairs have turned only a fraction of a circle. The network has learned to read positions inside that envelope of angles, and nowhere else.

Now feed it a 100K-token prompt at inference. The fast pairs are suddenly spinning to phase angles the model has never seen in training. As far as the network is concerned, the positions have gone out of distribution. Attention destabilizes, and quality falls off a cliff long before the prompt ends.

This is why context extension has become its own discipline. Every major technique is, underneath, a way of manipulating the angles and frequencies we just built up. Position Interpolation squeezes the positions back into the trained range. NTK-aware scaling turns that base-frequency knob $b$ up so the fast pairs slow down. YaRN interpolates each frequency band differently. None of them make much sense until you can see position as a rotation, which you now can.

That is the subject of the next post in this series.

RoPE also shows up in surprising places elsewhere on this blog. The evolution of attention post shows how DeepSeek’s Multi-head Latent Attention has to do some delicate surgery, a decoupled form of RoPE, to stay compatible with rotary embeddings while compressing the KV cache. The state-space models post shows Mamba-3 reusing the same rotary machinery with data-dependent angles. The idea travels a long way.

Lessons for builders

A few takeaways that generalize beyond RoPE:

Positional information wants to be relative. When you catch yourself making a model re-derive the same relationship at every absolute offset, look for a representation where that relationship is built in rather than learned.
Magnitude-preserving operations keep signals clean. Rotation works partly because it refuses to touch the content’s length. When you have to inject one kind of information into a vector that already carries another, prefer transforms that leave the existing signal undisturbed.
The base frequency is a real knob, not a constant. That $10000$ is not sacred; long-context models routinely raise it to $500{,}000$ or $1{,}000{,}000$ to slow the fast pairs down. When a hyperparameter is set “because the paper said so,” it is worth knowing what it actually controls.
The elegant default and its failure mode are two sides of one coin. The frequencies that give RoPE its clean relative encoding are the same ones that go out of distribution past the training length. The mechanism and its breaking point cannot be pulled apart, so understanding one means understanding the other.

Conclusion

RoPE comes down to one choice: to encode position, rotate the query and key instead of adding to them. Rotation keeps each vector’s length, so a word’s meaning stays intact, and since an attention score depends only on the angle between two vectors, the score ends up tracking how far apart two tokens are rather than where they sit.

If you want the original source, it is the RoFormer paper by Su et al. It is short and readable, and after this it should be easy to follow. It is also the groundwork for the next question in this series: how a model trained on a few thousand tokens manages to read much longer inputs.

The Evolution of Attention, Part 1: From MHA to Latent Compression

Sun, 17 May 2026 10:00:00 +0800

What This Post Covers

This is Part 1 of a two-part series on how the transformer’s attention mechanism has evolved. Every attention variant shipped in production since 2019 is fighting one number: the bytes of KV cache you have to carry per token. That number controls how many concurrent users fit on a GPU, how long a context you can serve, and ultimately whether your model is economically viable to deploy.

Part 1 walks the first wave of answers: the variants that attack the cache by changing what gets stored per token. We start with the bottleneck, recap multi-head attention, look at the stepping stones (MQA and GQA), then spend most of the post on Multi-head Latent Attention as introduced in DeepSeek-V2. By the end you will see how a single low-rank bottleneck plus a clever bit of algebra collapses the cache by nearly two orders of magnitude without giving up the expressivity of standard softmax attention.

Part 2 picks up at the question MLA cannot answer: once each cached token is about as small as it gets, can we cache fewer tokens? That is sparse attention (DSA, NSA, MoBA), linear-attention hybrids, and the V4-Pro synthesis where compression and sparsity stack. (Coming soon.)

The audience is engineers who deploy models. FlashAttention has its own dedicated post on this blog and we will not re-cover it here.

Part 1: The KV Cache Wall

Modern transformer inference is, in practice, a memory bandwidth problem. During autoregressive decoding, each new token must attend to every prior token, which means every prior token’s key and value vectors must already sit in fast memory. That stash is the KV cache, and it grows linearly with sequence length, linearly with batch size, and linearly with the number of layers.

For a model with $L$ layers, $n_h$ heads, per-head dimension $d_h$, sequence length $T$, and batch size $B$ in float16, the standard MHA cache is:

$$\text{Cache}_{\text{MHA}} \;=\; 2 \cdot L \cdot B \cdot T \cdot n_h \cdot d_h \cdot 2 \text{ bytes}$$

That factor of 2 is for keys and values. For a Llama-3-style 70B model at BF16 with 80 layers, 64 heads, and $d_h = 128$, each token consumes about 2.5 MB of cache per layer. At 128K context that is 320 GB of KV cache for a single sequence. The H100 has 80 GB of HBM. You cannot serve a single 128K-context request on one H100 without doing something to shrink that number. The cache, not the parameters, is what limits how many concurrent users fit on a GPU and how long a context you can serve.

Several approaches have chipped away at this. Multi-Query Attention (MQA) shares a single K/V across all heads, a brutal compression that visibly degrades quality. Grouped-Query Attention (GQA) is the negotiated middle ground that ships in Llama, Mistral, and Qwen. Multi-Head Latent Attention, introduced in DeepSeek-V2 in May 2024, takes a different tack: it caches a single low-rank latent vector per token and reconstructs full-rank K and V on the fly.

The trick, and the whole point of the rest of this post, is that you never actually have to reconstruct them.

Part 2: Notation

A few symbols recur throughout. None of them are unusual; the table is a quick reference so the equations below read fast.

Symbol	Meaning	DeepSeek-V2 value
$d$	Model (residual stream) dimension	5120
$n_h$	Number of attention heads	128
$d_h$	Per-head dimension (content part)	128
$d_c$	Latent KV compression dim, the cached width	512
$d_c'$	Query compression dim (training-time only)	1536
$d_h^R$	RoPE per-head dim (decoupled positional part)	64
$h_t$	Hidden state at position $t$, shape $\mathbb{R}^{d}$	(per token)
$\mathbf{c}_t^{KV}$	The cached latent at position $t$, shape $\mathbb{R}^{d_c}$	(per token)

Convention: all vectors are row vectors, so $W \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}$ and $h W$ produces an output row vector. This matches how PyTorch nn.Linear behaves and how most modern transformer code reads.

Part 3: Recap: Standard Multi-Head Attention

Before getting to MLA, it is worth pinning down exactly what we are trying to replace. Standard MHA at position $t$ takes the hidden state $h_t \in \mathbb{R}^{d}$ and produces three projections:

$$q_t = h_t W^Q, \qquad k_t = h_t W^K, \qquad v_t = h_t W^V$$

where $W^Q, W^K, W^V \in \mathbb{R}^{d \times n_h d_h}$. The result is split into $n_h$ heads of dimension $d_h$. Attention is computed head-wise, with $k_t$ and $v_t$ cached for every past position $t$:

$$\text{Attn}_i(q, K, V) \;=\; \text{softmax}\!\left(\tfrac{q_i K_i^\top}{\sqrt{d_h}}\right) V_i$$

Standard Multi-Head Attention

One token's hidden state projected three ways. K and V get cached, per head, for every past position.

Standard multi-head attention. The full Q, K, V tensors each have width n_h · d_h, partitioned into n_h slices of width d_h. K and V are cached for every past token, giving the familiar 2 · n_h · d_h floats per token per layer.

The cost is paid not in the projection matrices (those live in HBM regardless) but in the running cache of all past $k_t, v_t$. Per token per layer, that is $2 \cdot n_h \cdot d_h$ floats. For DeepSeek-V2 scale (128 heads, 128 per-head dim) it is 32,768 floats per token per layer. Everything MLA does is in service of shrinking that cache while keeping the per-head expressivity.

Part 4: Stepping Stones: MQA and GQA

MLA is most legible when contrasted with what came before. Both MQA and GQA attack the cache by reducing the number of distinct K and V projections.

Multi-Query Attention (Shazeer 2019) keeps $n_h$ query heads but uses a single shared key projection and a single shared value projection across all of them. The cache shrinks by a factor of $n_h$. On a 64-head model that is a 64x reduction. MQA worked in production for PaLM and Falcon-40B, but most successors retreated. With one shared K and one shared V, every query head looks at the same key/value subspace, the model loses head-level specialization on the recall side, and quality regressions at scale were measurable.

Grouped-Query Attention (Ainslie et al. 2023) is the practical middle ground. Instead of all-or-nothing sharing, you partition the $n_h$ query heads into $n_g$ groups, and each group shares one K/V head. MHA is $n_g = n_h$. MQA is $n_g = 1$. GQA-8 (the Llama-3 default) is $n_g = 8$. For our 70B reference at GQA-8 the cache is 320 KB per token, an 8x reduction, and quality regression vs MHA is in the noise.

The K/V width is the lever each method pulls. MHA keeps full per-head K/V. MQA collapses to a single shared pair. GQA picks a comfortable middle. All three keep K and V as the cached objects, though.

MLA changes the question. Instead of asking “how many K/V projections do we keep?”, it asks: what if the cache is not K or V at all, but a compressed representation we expand on demand?

Part 5: MLA’s Core Insight

The hidden state $h_t \in \mathbb{R}^{d}$ already contains everything needed to compute that token’s keys and values. The standard pipeline burns most of its width on a high-dimensional intermediate ($n_h d_h \approx 16{,}000$ for DeepSeek-V2) that we store, when really we could store a compact summary and reproject when needed.

Concretely: introduce a latent bottleneck $\mathbf{c}_t^{KV} \in \mathbb{R}^{d_c}$ with $d_c \ll n_h d_h$. Cache only this latent. Recover K and V via dedicated up-projections at attention time:

$$\underbrace{h_t W^{DKV}}_{\mathbf{c}_t^{KV} \,\in\, \mathbb{R}^{d_c}}\;\longrightarrow\;\begin{cases} \mathbf{c}_t^{KV} W^{UK} \;=\; k_t & \in \mathbb{R}^{n_h d_h} \\[2pt] \mathbf{c}_t^{KV} W^{UV} \;=\; v_t & \in \mathbb{R}^{n_h d_h} \end{cases}$$

MLA's Core Idea

Cache one narrow latent per token. Reconstruct K and V at attention time, then throw them away.

The fundamental MLA move. Instead of caching n_h · d_h + n_h · d_h = 32,768 floats of K and V per token, we cache only the d_c = 512-wide latent and reconstruct K and V at attention time via two learned up-projections.

If you stop reading right here, you would be tempted to ask: doesn’t reconstructing K and V at every attention step add a huge amount of compute? The honest answer is yes, naively. The whole punchline of MLA, which we get to in §8, is that during inference you do not have to materialize K and V at all. The up-projection matrices can be absorbed into Q and the output projection. The cache stays small and the FLOPs stay manageable.

A low-rank cache only saves memory. A low-rank cache that you never have to up-project saves memory and compute. That is the actual MLA result.

But before we get to the absorption, there is a wrinkle: RoPE. Rotary positional embeddings break the clean factorization above, and dealing with that gracefully is what gives MLA its slightly baroque final form. We will first walk through the no-RoPE version step by step, then patch it.

Part 6: Matrix Walkthrough, Step by Step

We will work through one token’s forward pass. Position $t$, hidden state $h_t \in \mathbb{R}^{d}$. Numbers in parentheses are the DeepSeek-V2 values, so the shapes feel concrete.

MLA, Step by Step

One token's path from hidden state through the full MLA construction.

1 · down-project KV 2 · up-project K, V 3 · decouple RoPE 4 · absorb at inference 5 · the cache, finally

Step 1

KV down-projection

Step 1. A single linear layer projects the residual stream down into the latent space. From this point onward, only the green latent is cached.

Step 2

K and V up-projection

Step 2. The same latent fans out into full per-head K and V through two parameter matrices. The wide tensors materialize for one matmul and are never written back to the cache.

Step 3

Decoupled RoPE construction

Step 3. The content key rides the latent up-projection. The RoPE key is computed on a separate narrow path, rotated, and broadcast to every head. The final per-head key is the concatenation. The cache holds both the latent and the RoPE key.

Step 4

The absorption trick

Step 4. Top, the naive read: reconstruct k_s^C from the latent every step. Bottom, after fusing W^UK into the query path: attention reduces to a bilinear form on two latent-sized vectors. Memory and FLOPs both scale with d_c, not n_h · d_h.

Result

What lives in the cache

Result. 576 floats per token per layer where MHA would have written 32,768. The softmax attention semantics are unchanged; only the storage shape did.

Step 1: KV down-projection

A single linear layer projects the residual stream down into the latent space:

$$\mathbf{c}_t^{KV} \;=\; h_t \, W^{DKV}, \qquad W^{DKV} \in \mathbb{R}^{d \times d_c} \;=\; \mathbb{R}^{5120 \times 512}$$

The resulting $\mathbf{c}_t^{KV} \in \mathbb{R}^{512}$ is the only KV-related thing we cache for this token. It is shared across all heads and contains the information from which both K and V will eventually be reconstructed.

Step 2: K and V up-projection

The latent fans out into the full multi-head K and V via two more linear layers:

$$k_t^C \;=\; \mathbf{c}_t^{KV} \, W^{UK}, \qquad v_t^C \;=\; \mathbf{c}_t^{KV} \, W^{UV}$$$$W^{UK}, W^{UV} \in \mathbb{R}^{d_c \times n_h d_h} \;=\; \mathbb{R}^{512 \times 16384}$$

The superscript $C$ marks the content portion of K (we add a separate RoPE portion in §7). After this step, $k_t^C$ and $v_t^C$ live in $\mathbb{R}^{n_h d_h}$ and split into $n_h$ heads of width $d_h$. Since $d_c \le n_h d_h$, these are low-rank reconstructions: every head’s K and V is constrained to lie in a $d_c$-dimensional subspace of the full $d_h$-space. This rank constraint is the price of compression, and empirically it is the right trade.

Step 3: The query path

Symmetrically, and primarily to save training-time activations rather than KV cache, the query is also routed through a bottleneck:

$$\mathbf{c}_t^{Q} = h_t \, W^{DQ}, \qquad q_t^C = \mathbf{c}_t^{Q} \, W^{UQ}$$$$W^{DQ} \in \mathbb{R}^{5120 \times 1536}, \qquad W^{UQ} \in \mathbb{R}^{1536 \times 16384}$$

The query bottleneck $d_c' = 1536$ is wider than the KV bottleneck because queries are not cached. There is no inference benefit from making them narrower. The reason to compress them at all is parameter and activation memory during training.

$d_c = 512$ is sized for cache miniaturization. $d_c' = 1536$ is sized for representational room. They are independent design knobs and DeepSeek-V2 chose them to be quite different.

Part 7: The RoPE Complication and the Decoupled Fix

So far the story is clean: cache a latent, up-project on demand, profit. But all modern transformers, DeepSeek-V2 included, use rotary position embeddings, and RoPE is exactly the kind of thing that ruins clean factorizations.

Why naive RoPE breaks MLA

RoPE applies a position-dependent rotation matrix $\mathcal{R}_t$ to the query and key after they are projected. The attention dot product becomes:

$$\langle \mathcal{R}_t q_t,\, \mathcal{R}_s k_s \rangle \;=\; q_t^\top \mathcal{R}_t^\top \mathcal{R}_s \, k_s \;=\; q_t^\top \mathcal{R}_{s-t}\, k_s$$

That last simplification, the rotation depending only on the relative position $s-t$, is the whole reason RoPE works. But now imagine we tried to apply RoPE to our reconstructed key $k_s = \mathbf{c}_s^{KV} W^{UK}$. The pre-rotated key cached as $\mathbf{c}_s^{KV}$ would need to also be rotated by $\mathcal{R}_s$. And $\mathcal{R}_s$ depends on $s$, the actual position. Different tokens use different rotations. So we would need to store the rotated reconstruction per token, defeating the cache.

The deeper algebraic problem: in §8 we will want to absorb $W^{UK}$ into $W^Q$. But if RoPE applies between them, in $q^\top W^{UQ\top} \mathcal{R}_t^\top \mathcal{R}_s W^{UK} \mathbf{c}^{KV}$, the position-dependent $\mathcal{R}$ blocks any precomputed absorption.

RoPE and low-rank absorption are fundamentally incompatible when applied to the same vector. MLA’s fix is to give RoPE its own, separate vector.

The decoupled RoPE construction

Each token gets two key vectors per head:

A content key $k_t^C$ (no RoPE), reconstructed from the cached latent as in §6.
A small RoPE key $k_t^R$, a separate, narrow tensor produced directly from $h_t$, to which RoPE is applied. Crucially, it is shared across all heads.

$$k_t^R \;=\; \text{RoPE}\!\left(h_t \, W^{KR}\right), \qquad W^{KR} \in \mathbb{R}^{d \times d_h^R} \;=\; \mathbb{R}^{5120 \times 64}$$

And on the query side, the rotation also lives on its own piece, but here it is per-head, since queries are not cached:

$$q_{t,i}^R \;=\; \text{RoPE}\!\left(\mathbf{c}_t^Q \, W_i^{QR}\right), \qquad W^{QR} \in \mathbb{R}^{d_c' \times n_h d_h^R}$$

The final per-head query and key are the concatenation of the content part and the RoPE part:

$$q_{t,i} = [\,q_{t,i}^C \,;\, q_{t,i}^R\,] \in \mathbb{R}^{d_h + d_h^R}, \qquad k_{t,i} = [\,k_{t,i}^C \,;\, k_t^R\,] \in \mathbb{R}^{d_h + d_h^R}$$

Note: $k_t^R$ has no head subscript. Every head’s RoPE-key part is the same vector. This is exactly an MQA-style sharing, isolated to the RoPE 64-wide tail.

Attention is then computed on the concatenated vectors:

$$\text{score}_{t,s,i} \;=\; \frac{q_{t,i}^\top k_{s,i}}{\sqrt{d_h + d_h^R}} \;=\; \frac{\underbrace{q_{t,i}^{C\top} k_{s,i}^C}_{\text{from latents}} \;+\; \underbrace{q_{t,i}^{R\top} k_s^R}_{\text{from RoPE pair}}}{\sqrt{d_h + d_h^R}}$$

The dot product splits cleanly into a content term and a RoPE term, exactly because we concatenated rather than added. The content term will get the absorption treatment in §8. The RoPE term stands alone and is computed directly from the (small) cached $k_s^R$.

The output is then the usual per-head weighted sum of $v_{s,i}^C$ (the value has no RoPE, it never did), and the heads are concatenated and pushed through $W^O$ as usual:

$$o_{t,i} = \sum_s \alpha_{t,s,i} \, v_{s,i}^C, \qquad u_t = [o_{t,1}; \ldots; o_{t,n_h}] \, W^O$$

The KV cache now holds $\mathbf{c}_t^{KV}$ (512 floats) plus $k_t^R$ (64 floats) per token per layer. Total: 576 floats, or 1,152 bytes in BF16. That is the headline 57x reduction versus MHA’s 32,768 floats.

Part 8: The Absorption Trick

We have reduced the cache from 32,768 to 576 floats per token. But the up-projections $W^{UK}$ and $W^{UV}$ are huge ($512 \times 16{,}384$ each), and computing them per attention step looks alarming. The trick is that we never compute them at inference time. We absorb them into the surrounding matrices.

Absorbing $W^{UK}$ into the query

Look at one head’s content-attention term:

$$q_{t,i}^{C\top} \, k_{s,i}^C \;=\; \big(\mathbf{c}_t^Q W_i^{UQ}\big)^\top \big(\mathbf{c}_s^{KV} W_i^{UK}\big) \;=\; \mathbf{c}_t^{Q\top} \, \underbrace{W_i^{UQ\top} W_i^{UK}}_{\widetilde{W}_i^Q \,\in\, \mathbb{R}^{d_c' \times d_c}} \, \mathbf{c}_s^{KV}$$

The product $\widetilde{W}_i^Q = W_i^{UQ\top} W_i^{UK}$ is two parameter matrices multiplied together. It depends on no input, so we precompute it once. The score becomes a single bilinear form between the cached query latent and the cached key latent, both of width at most 1536.

After absorption, “computing the content key” disappears. The query latent talks to the key latent directly through a precomputed bridge matrix. The cache really is the key.

Absorbing $W^{UV}$ into the output projection

The value side gets the same treatment, but from the other end. The per-head output is a weighted sum of $v_{s,i}^C = \mathbf{c}_s^{KV} W_i^{UV}$, which then gets multiplied by $W_i^O$:

$$u_t \;\supset\; o_{t,i} \, W_i^O \;=\; \sum_s \alpha_{t,s,i} \, \mathbf{c}_s^{KV} \, \underbrace{W_i^{UV} W_i^O}_{\widetilde{W}_i^O \,\in\, \mathbb{R}^{d_c \times d}}$$

Again, the bracketed matrix is parameter-only: precompute and store. At runtime, the attention weights multiply directly against the cached $\mathbf{c}_s^{KV}$, and the result projects directly to the output dimension. $v$ is never materialized.

Subtle point: the absorption changes the effective compute layout. The per-head $\widetilde{W}^Q$ matrices are dense and head-specific, so you do not get a literal FLOPs reduction in the obvious places, but you do collapse what was a stream of large reconstructions into a single small inner product. The combined effect (small cache plus small inner product) is what makes long-context decoding cheap.

The RoPE part is computed the old way: the cached $k_s^R$ is dotted with the freshly computed $q_{t,i}^R$. It is small (64 wide) and outside the absorbed factorization, which is exactly why we paid the cost of decoupling it.

Part 9: DeepSeek-V2 by the Numbers

Plug the actual hyperparameters in. Drag the slider to see how the per-layer KV cache footprint scales with context length, per request, in fp16.

Cache Size vs Context Length

Per layer, batch=1, fp16. Drag the slider to see how the per-layer KV cache footprint scales.

Context length

32,768

DeepSeek-V2-style hyperparameters: n_h = 128, d_h = 128, d_c = 512, d_h^R = 64. GQA shown with 8 K/V groups; MQA shares a single K/V across heads.

At 128K context, the kind of regime DeepSeek-V2 was designed for, each MLA layer holds about 144 MB of cache, versus around 8 GB for naive MHA. Multiply across DeepSeek-V2’s 60 layers and the difference is the difference between a context that fits and one that does not.

Part 10: Comparison and Takeaways

Method	K/V structure	Cache per token	Quality	Notes
MHA	Distinct K, V per head	$2 \cdot n_h \cdot d_h$	Best	Baseline; expensive cache
MQA	One K, V shared by all heads	$2 \cdot d_h$	Noticeably worse	Used in PaLM-style models
GQA	K, V per group of heads	$2 \cdot n_g \cdot d_h$	$\approx$ MHA when $n_g$ tuned	Llama, Mistral, Qwen
MLA	Latent + decoupled RoPE	$d_c + d_h^R$	$\approx$ MHA, sometimes better	DeepSeek-V2/V3; absorbs at inference

What is actually new

MLA is not the first low-rank attention proposal. Linformer (2020), Performer, and various Nyström approximations long predate it. What makes MLA practical and distinct is:

The low-rank object is the cache, not the attention pattern. Earlier work approximated the attention matrix itself. MLA keeps softmax attention exact and only compresses what gets stored across timesteps.
The absorption trick keeps inference compute bounded. Without §8, MLA would just be a memory-vs-compute tradeoff. With it, both move in the right direction at once for the autoregressive decoding regime.
The decoupled-RoPE construction shows how to make a low-rank cache compatible with rotary embeddings. This is the part most easily glossed over but is what makes the technique deployable in modern transformer stacks. Future low-rank schemes that ignore positional embeddings are nice in a paper and broken in practice. MLA paid the engineering tax.

The cost

MLA is not free. The decoupled RoPE adds parameters and a slightly more elaborate forward path. The bilinear absorbed matrices $\widetilde{W}_i^Q$ are larger than the raw $W^Q$ pieces would be. For very short context inference, plain GQA may be faster wall-clock, because the absorbed matmul is the dominant cost rather than memory bandwidth. MLA’s wins compound with context length and batch size, exactly the regimes that matter when serving a frontier model.

In the long-context, large-batch regime that production inference cares about, MLA pulls the KV-cache lever further than any other published trick, while keeping standard softmax attention semantics intact.

If you remember three things

The cache is a low-rank latent $\mathbf{c}^{KV}$, not K and V. K and V are reconstructed by up-projections that you never actually run at inference.
RoPE rides on a tiny separate vector that bypasses the latent compression, so the rotation can stay position-aware without contaminating the absorbable content path.
The whole construction is engineered around the fact that during decoding, you do not need K and V. You need scores. Storing scores requires only enough information to compute them, and the latent is sized to exactly that.

What comes next

Part 1 ended at the point where each cached token is about as small as it gets. The next pressure point is not the size of each token but the number of tokens you carry. Once your model can hold 1M tokens of context, do you really need to read every one of them at every decode step?

Part 2 will pick up there. Three different bets on how to answer that question shipped in 2025: sparse top-k selection driven by a cheap relevance scorer (DSA, deployed in DeepSeek V3.2), three-branch sparsification that is trainable from scratch (NSA, ACL 2025 best paper), and mixture-of-block routing that retrofits existing dense checkpoints (MoBA, powering Kimi K2’s 1M context). A different bet entirely is the linear-attention hybrids that change the math so no $O(L^2)$ matrix ever materializes (Lightning Attention, in MiniMax M1). The synthesis at the end is DeepSeek V4-Pro, released last month, where every thread we have followed so far gets composed at once into a 61-layer stack that runs at 2% of GQA’s cache footprint.

Part 2 is in progress. check back soon.

References

Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017. The original Transformer paper. The baseline MHA that every variant in this post is reacting to.
Shazeer, N. (2019). Fast Transformer Decoding: One Write-Head is All You Need. The MQA paper. Concise and worth reading in full; the whole argument is six pages.
Ainslie, J. et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. The GQA paper. Includes the mean-pooling conversion recipe that made GQA easy to adopt.
DeepSeek-AI (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. The MLA paper. Section 2.1.3 has the decoupled RoPE derivation; almost everyone missed it the first time around.
Ji, F. et al. (2025). TransMLA: Multi-Head Latent Attention Is All You Need. Proves that MLA has strictly greater expressive power than GQA at the same cache budget, and gives a conversion recipe from GQA to MLA.

References for sparse attention (DSA, NSA, MoBA), linear hybrids (Lightning Attention), and the V4-Pro report will appear in Part 2.

The Platform Around the Agent: What Enterprise Architects Actually Build

Wed, 15 Apr 2026 10:00:00 +0800

The Gap

By April 2026, the adoption numbers are staggering: 90% of developers use AI at work and over 80% say it’s made them more productive. Gartner projects 40% of enterprise applications will feature task-specific AI agents by the end of 2026, up from less than 5% in 2025.

Then you read the other column. CIO reports that 95% of enterprises see zero return on their AI investments. McKinsey’s maturity model puts only around 11% of enterprises in the “AI-native” tier. The 2025 DORA report is more uncomfortable still: AI raises throughput and raises change-failure rate. PR size is up 154%. 30% of engineers don’t trust the code their own agents produce.

The gap isn’t the model. Frontier models are a commodity. They get swapped every six months and the next one is better. The gap is everything around the model: the control plane that routes requests, attributes cost, enforces policy, retrieves context, evaluates quality, and measures outcomes. It’s the platform that turns “we rolled out Copilot” into “we shipped a 30.8% reduction in PR cycle time across 1,900 repos,” which is what Atlassian did with Rovo Dev and published at ICSE 2026.

This post is for the architect who has been asked to lead that platform. Not to choose between Copilot and Cursor; that’s a week of spreadsheets. To design what sits around whatever agent you pick, so that a year from now your CFO knows what AI is costing and your CTO knows what it’s earning.

Chapter 1: What “Platform” Actually Means Here

When a VP of Engineering says “we have an AI platform,” they might mean one of three things:

We bought Copilot Enterprise. Everyone has access.
We stood up a chat UI in front of a couple of models.
We run an internal control plane that mediates every AI request our engineers make, attributes cost per team, enforces policy per repo, evaluates quality continuously, exposes a curated surface of tools and skills, and lets any engineer publish a repeatable workflow that triggers on a schedule, a webhook, or a repository event.

Only the third one is a platform. The first two are procurements.

Shopify made this distinction concrete. Per Bessemer’s write-up of their AI-first engineering playbook, Shopify runs an LLM proxy. Every AI request from every tool, every engineer, every script, goes through one internal gateway. Engineers can pick their harness (Claude Code, Cursor, Copilot), but the proxy is non-negotiable. That single architectural choice is what gives them centralised cost control, usage analytics, model flexibility, and the ability to swap a model provider in a day instead of a quarter.

Block took a different turn at the same fork. Rather than wrapping third-party agents, they built Goose internally and open-sourced it. The stated reason, per CTO Dhanji Prasanna on the Sequoia Training Data podcast, was that “data leaving our infrastructure” was unacceptable. The outcome: engineers save 8-10 hours a week, Goose is on track to reclaim 25% of manual hours company-wide, and 100% of Goose’s own PRs are now written by Goose.

Two tech companies. Two legitimate answers to the same architectural prompt. Both built a platform. Neither bought one.

The platform you build, whether Shopify-shaped (gateway + BYO-harness) or Block-shaped (build-the-harness), owns five responsibilities:

The five control-plane responsibilities

click any layer

What it owns, who runs it, how it fails. If any layer lacks a named team, you don't have a platform. You have a shadow-IT problem.

your agents Claude Code · Cursor · Copilot · Goose · internal harnesses

Capability

built-ins · mcp · skills

Identity & policy

tokens · approvals · sandbox

Context

ingest · index · permission · serve

Evaluation

unit · task · production · eval-as-ci

FinOps & observability

gateway · tracing · attribution

your developers Thousands of engineers across dozens of teams

Capability surface

layer 01

Owns

Built-in tools shipped with the harness
Curated registry of approved MCP servers
Skills library: org-specific playbooks
Progressive disclosure at the gateway

Who runs it

platform-team · capability-sre

Anti-pattern

90+ tools loaded per prompt. A single enterprise GitHub MCP server loaded naively burns 50k+ tokens of schema before any reasoning. Overhead scales linearly in services connected.

Identity & policy

layer 02

Owns

Two-identity model: human principal + agent service identity
Scoped, short-lived tokens per task
Policy-as-code at the MCP gateway
Graduated-trust approvals (async wait-state)
Sandbox runtime (gVisor, Firecracker, Kata) per trust tier

Who runs it

platform-team · security · iam

Anti-pattern

Agent inherits the full user scope. One compromised prompt exfiltrates every permission the invoking user has. The blast radius of an AI breach becomes the blast radius of a human breach.

Context pipeline

layer 03

Owns

Ingestion connectors for repos, docs, tickets, incidents, service catalog
Hybrid retrieval: BM25 + dense + graph
PII redaction and access-aware retrieval
Staleness SLOs per source
Token-budget compression and eviction

Who runs it

platform-team · data-eng

Anti-pattern

Every team ships its own RAG. Twelve incompatible stores, staleness nobody measures, PII leaking across permission boundaries, six answers to the same question depending on which index you hit.

Evaluation harness

layer 04

Owns

Unit regressions for prompts, tool schemas, system prompts
Task-level golden set graded by LLM-as-judge
Production shadow traffic + online signals
Eval-as-CI: no prompt, tool, or model ships without passing
Thumbs-down-to-regression-test feedback loop

Who runs it

platform-team · ai-coe

Anti-pattern

Silent failure becomes the norm. The agent finishes without an error, the diff looks plausible, the test it wrote passes because it tests the buggy behaviour it introduced. The 2025 DORA change-failure-rate uptick is this failure, audited.

FinOps & observability

layer 05

Owns

LLM gateway: every call mediated, measured, routable
Tiered-routing policy (Haiku / Sonnet / Opus by task class)
Prompt-cache hit rate as a first-class SLI
Per-team, per-repo, per-task cost attribution
End-to-end OpenTelemetry tracing

Who runs it

platform-team · finops · sre

Anti-pattern

No gateway, no attribution, no chargeback. 5% of users burn 60% of the budget invisibly. When an incident hits at 3am, you have no trace ID. Just a shrug and an angry CFO.

Capability: what tools and skills agents can reach, and how they’re discovered.
Identity & Policy: who’s acting, with what scope, under what guardrails.
Context: how org knowledge gets ingested, permissioned, and served.
Evaluation: how you know the agent is actually getting better, not just shipping faster.
FinOps & Observability: what it costs, who paid, what it produced, where it broke.

Those five wire up into a system shape worth drawing. The harness runs on each engineer’s laptop (in a Docker container) or in a CI runner. Everything the harness reaches into over the network is the platform. Everything the platform calls out to is a model provider, a curated MCP server, or a source system the platform has already indexed.

The reference architecture

runtime → platform → providers

The harness runs on each engineer's laptop or in a CI container. Everything the harness calls into is the platform. Every subsequent chapter zooms into one box on this diagram.

Runtime per-engineer · per-job

Engineer's laptop

docker

harness (goose / claude code / cursor)
MCP clients + workspace mount

CI runner

headless

ephemeral container, headless harness
fired on PR / label / cron / webhook

Platform network services

LLM gateway

hub

auth · tiered routing · prompt cache
cost attribution · rate limits

MCP gateway + Skills

registry

approved MCP list · progressive disclosure
Skills library · Recipe registry

Context API

retrieval

BM25 + dense + graph
ACL-aware · staleness SLO

Policy service

authz

scoped tokens · approval routing
sandbox-tier selection

Eval harness

ci-gated

golden set · LLM-as-judge
shadow traffic · online signals

Telemetry bus

otel

traces · run records · cost events
per-team / per-repo / per-task

Providers & sources external

Model providers

vendor

Anthropic · OpenAI · Google
internal + open-weight hosts

Curated MCP servers

tools

github · jira · cloud
feature-flags · incident

Source systems

indexed

repos · docs · tickets
incidents · service catalog

Identity

cross-cutting

IdP (human principal)
service registry (agent id)

Trace + cost DB

audit

queryable run history
chargeback source-of-truth

When multiple agents cooperate within this system, the topology that ships is orchestrator-worker, not swarm. Airbnb’s 3,500-file Enzyme-to-RTL migration proved this: per-file parallel workers, central orchestration, brute-force retries with dynamic prompts. 97% automated, 6 weeks, 6 engineers. Swarms, by contrast, are the dominant source of silent failure because a single hallucination in shared memory propagates to every peer that reads it. Chapter 5 shows how Goose sub-recipes implement the orchestrator-worker pattern concretely.

If any of these responsibilities isn’t owned by a named team with a roadmap, you don’t have a platform. You have a shadow-IT problem that will compound. These five exist to enable one thing: the workflow lifecycle. Chapter 5 names it; everything else makes it safe, cheap, and measurable.

Chapter 2: The Capability Surface

The first architectural argument inside every platform team is about how agents get things done. It usually gets framed as a choice between the Model Context Protocol (MCP), now stewarded by the Linux Foundation’s Agentic AI Foundation as of December 2025, and Agent Skills, the behavioural-instruction packages that started shipping with Claude.

Framing them as competitors is a category error. MCP is an execution fabric: a standardised RPC for tools, resources, and prompt templates, with bidirectional comms and dynamic tool discovery. Skills are a knowledge layer: portable instructions that encode the how-to of a specific job. MCP tells an agent what it can do. Skills tell an agent what it should do in a specific situation.

The context-bloat failure is the sharper risk. A single enterprise-grade GitHub MCP server exposes 90+ tools. Loaded naively, that’s 50,000+ tokens of schema entering the context window before the model has read a single line of user intent. Add Jira, the cloud provider, a feature-flag platform, your incident system, and your agent is spending six figures a year on tokens that are just tool catalogue. The overhead scales linearly with the number of services you connect.

The three-tier model that scales:

Built-ins: the small primitive set the harness ships (file ops, shell, code execution). Always loaded.
Curated MCP registry: a governed list of approved MCP servers with progressive disclosure. The agent sees metadata first, loads full tool schemas on semantic match.
Skills library: org-specific playbooks in a searchable registry, discovered by description, expanded on demand.

Capability surface: flat vs progressive

same 5 servers · same task

Same five MCP servers (GitHub, Jira, Cloud, Feature Flags, Incident). Same task. Different token math. Flip the tab.

Context window (200k tokens)

overhead: 39%

MCP

Skills

Headroom

built-ins 4.0k

mcp schemas 52.0k

skills 18.0k

org context 8.0k

headroom 118.0k

Tool schemas loaded

94 schemas

overhead before reasoning

78,000 tokens

Schemas + skill docs, loaded before the model reads the task.

cost per call @ sonnet

$0.234

$3 / 1M input tokens, overhead only.

annual burn @ 1M calls

$234,000

Overhead alone. Reasoning is extra.

effective headroom

118k / 200k

Share of the window available for task + reasoning.

Block’s Goose is the cleanest public expression of this. The operational equation is Goose = LLM + MCP + Agent, but the load-bearing piece isn’t MCP. It’s Goose’s Recipes and Sub-recipes. Recipes are declarative YAML workflows that encode a repeatable piece of work; sub-recipes run in isolated sub-sessions with their own context windows. That isolation keeps token cost linear in the work done rather than quadratic in conversation depth. The result is the 30-40% of code Block’s top engineers now get from Goose in legacy codebases, per the Sequoia interview.

Chapter 3: Identity, Policy, and the Execution Boundary

Every agentic action has two identities the audit team cares about: the human on whose behalf the agent is acting, and the agent’s own service identity. Conflate them and compliance review kills your rollout.

The human identity provides authorisation scope. The agent identity provides attribution and accountability (which agent, which version, which session). Every tool invocation carries both. Tokens are short-lived (minutes, not days) and scoped to the specific task, not the session. Policy is enforced at the MCP gateway, as code, so auditors can diff it and engineers can review it.

How do you keep humans in the loop without drowning them in approval prompts? The pattern that works is the asynchronous wait-state. When an agent hits a high-risk decision (production deploy, financial transaction, irreversible write), the workflow suspends, persists its state externally, and emits an approval event. Reviewers act on their own clock, often hours later. On approval, the signal routes back and the workflow resumes exactly where it left off.

The anti-pattern is approval fatigue. The fix is graduated trust: scope approvals by blast radius.

Read-only on scoped data: no approval.
Mutations inside a sandbox or personal branch: no approval, full audit.
PR against the main branch: standard code review.
Production-shaped actions (deploys, config changes, prod data reads): explicit, async approval with a named owner.
Irreversible (delete, drop, disable safety): two-person review.

GitHub’s Copilot Enterprise surface has become the most concrete public implementation. Per the December 2025 Enterprise roundup and Microsoft’s DevBlogs on agentic platform engineering, admins get fine-grained permissions, explicit MCP control, audit-log review, and policy-based gating of model upgrades.

Policy tells the agent what it’s allowed to try. The sandbox decides what happens when it tries the wrong thing. Three isolation tiers map to the graduated-trust model:

Isolation tier	Mechanism	Right fit
Docker + seccomp	Namespaces and cgroups; shared host kernel	Dev-loop agents on an engineer’s own repo
gVisor	User-space kernel intercepting ~70 syscalls	Platform-served workers (CI, migrations, autonomous PRs)
Firecracker / Kata	Per-workload Linux kernel via KVM	Untrusted, multi-tenant, or cross-org execution

Match the isolation tier to the trust tier. A read-only retrieval agent does not need a microVM. A production migration worker rewriting other teams’ code absolutely does.

Chapter 4: Context Engineering at Platform Scale

The frontier model isn’t your moat. The agent harness isn’t your moat. Your context is your moat: the graph of your repos, the runbooks nobody wrote down, the incident history, the ADRs, the style guides, the org’s service catalog.

MCP is connectivity, not context

A common mistake is assuming that connecting MCP servers to your agent solves the context problem. MCP gives your agent a standardised way to query any single system. Four things it does not do:

Cross-source retrieval. “Find everything relevant to this migration across code, tickets, docs, and incidents” requires a unified index. No single MCP server spans all your sources.
Pre-indexing. MCP queries are live. For a 500k-file monorepo, live search on every agent call is slow and expensive.
Governance. PII redaction, access-aware filtering, staleness SLOs. Each MCP server returns raw data under its own auth model.
Token-budget management. Fitting retrieved context to the model’s window is orchestration the pipeline owns, not the protocol.

The clean architecture: build the context pipeline, expose it as an MCP server. The agent queries one context endpoint. The pipeline behind it handles ingest, index, govern, serve.

The pipeline

Ingest. Connectors to the authoritative sources: repos, docs wiki, ticket system, incident tracker, service catalog. Each with an idempotent, versioned schema and an owner.

Index. Hybrid retrieval is the production default: BM25 for lexical recall, dense embeddings for semantic similarity, graph for structural relationships. No single index is sufficient.

Govern. Staleness SLOs per source. PII and secret redaction before indexing, not after retrieval. Access-aware retrieval: the retriever filters by the caller’s permissions before ranking. If your agent can see secrets its invoking user can’t, you have a data-exfiltration vulnerability wearing a productivity tool’s clothes.

Serve. A token-budget manager (compression, summarisation, eviction) that fits retrieved context to the model’s window and the task’s importance.

Augment Code’s Context Engine is the clearest public reference for this in 2026. It indexes up to 500,000 files across multiple repositories with roughly 100ms retrieval latency, building semantic dependency graphs. The telling move: Augment recently shipped the Context Engine as an MCP server, the exact pipeline-behind-protocol pattern. Sourcegraph’s Cody takes a three-layer approach (local file, local repo, remote repos), handling 300k+ repositories for enterprise customers. Stripe’s agent harness takes the curation angle: each “minion” gets scoped context per task, not the whole repo. Context curated, not copied.

The metric to watch: context hit rate per task type. If your hit rate is under 30%, your pipeline is ornamental.

Chapter 5: Workflows, the Unit That Ships

Four chapters described infrastructure. This chapter is about what the infrastructure produces. The deliverable is the workflow: a versioned, parameterised unit of work any engineer can build once, evaluate, and hand to other engineers (or to CI runners) who invoke it on a trigger they didn’t author.

The workflow lifecycle

author → trigger → run → observe

The control plane of Chapters 1–5 exists to make this lifecycle safe, cheap, and measurable. A workflow is the unit that ships.

Authors

Recipe YAML

default

metadata + version
parameters
extensions (MCP)
sub-recipes

Skills / DSL

alt

Claude Skills (md)
Temporal / LangGraph
Rovo Studio (low-code)

Platform

Triggers

fire

cron (goose schedule)
event (PR, issue, incident, webhook)
manual / API (goose run, goose serve)

Runtime

run

CI runner (ephemeral)
agent pool (Modal / E2B / Northflank)
laptop (dev-loop only)

Observability

Run record

first-class

trigger source
parameters
spans + retries
status · cost · trace ID

Governance

registry

SHA-pinned versions
ownership + review
deprecation windows

feedback Run records feed the eval golden set. Thumbs-downs become regression cases. A new Recipe version is shadow-run against live triggers before it's promoted.

Authoring

Four patterns; the choice follows who the author is:

Recipe / YAML: Goose Recipes, GitHub Agentic Workflows (Feb 2026 preview). Structured, diff-reviewable, CI-friendly. The enterprise default.
Prompt-as-code: Claude Skills. Flexible, closer to prose, weaker composition.
DSL / real code: Temporal, LangGraph, Kestra. Maximum control; needs engineer authors.
Low-code: Atlassian Rovo Studio. Natural-language authoring for non-engineers.

A Goose Recipe is the concrete shape most architects will end up writing:

name: pr_security_review
recipe:
  version: 1.0.0
  title: PR Security Review
  description: OWASP-informed review of a pull-request diff.
  settings:
    goose_provider: anthropic
    goose_model: claude-sonnet-4-5
  parameters:
    - key: pr_url
      input_type: string
      requirement: required
      description: "Pull request URL to review"
  extensions:
    - type: builtin
      name: developer
    - type: streamable_http
      name: github
      uri: https://api.githubcopilot.com/mcp/x/pull_requests/readonly
  instructions: |
    You are a security reviewer. Check the diff for OWASP Top-10
    issues, secrets, and unsafe patterns. Be specific and sparing.
  prompt: |
    Review PR {{ pr_url }}. For each finding, cite the file,
    line, severity, and suggested fix. Post findings as a single
    PR comment. If nothing is found, say so.

Every primitive the last four chapters described is visible here. settings routes through the LLM gateway. extensions declares which approved MCP servers the capability surface exposes. parameters is how a non-author reuses the workflow. instructions vs prompt separates policy from task, which is what makes a Recipe testable.

Parameterisation and sub-workflows

A Recipe without parameters is a one-off. With parameters, it’s a product. The sharper Goose primitive is the sub_recipes array: each sub-recipe runs in its own isolated subagent session with its own context window, and sequential_when_repeated: true/false picks parallel vs sequential execution. This is the orchestrator-worker pattern from Chapter 1, made concrete. It’s what makes the Airbnb migration topology possible: 3,500 files fan out across parallel sub-recipe invocations, each with fresh context, orchestrated by one parent.

Triggers

Cron. goose schedule add recipe.yaml --cron '0 9 * * 1-5'. Nightly lint, weekly security audit, daily stale-PR report. The built-in scheduler is single-machine; for distributed schedules, wrap with a Kubernetes CronJob or a Temporal worker pool.

Event-driven. PR opened, issue labelled, incident created, build failed. Atlassian’s Rovo Dev fires on every PR. The Goose GitHub Action wraps the same pattern: label an issue with goose and a PR opens. Event-driven is where agents stop being assistants and start being automation.

Manual / API. goose run -i recipe.yaml --param pr_url=https://... from a CI step, or goose serve running as a webhook receiver inside the cluster.

Runtime, observability, and governance

Triggered workflows run on ephemeral CI runners (GitHub Actions, Buildkite) for sub-five-minute PR-shaped work, or on dedicated agent pools for long-running stateful work. Match runtime to the trust tier from Chapter 3.

Every triggered run is a first-class object: trigger source, parameters, spans with retry counts, final status, cost, trace ID. Kestra recorded over two billion workflow executions in 2025, up from one hundred million in 2024. That twenty-fold increase signals the direction of travel. If your platform cannot answer “what ran when, triggered by what, with what outcome?” in two clicks, it is opaque.

Shared workflows need product discipline. The GitHub Actions governance model (internal org, SHA-pinned versions, PR-reviewed contributions) is the pattern most enterprises borrow.

Chapter 6: Evaluation and Economics

Most platform teams skip evaluation and then wonder why their rollout plateaus. Evaluation is not a phase of delivery; it is the product that determines whether the other five chapters compound.

Silent failure

An agent completes its run without any software error (no exception, no crash, no red log line) and produces output that looks plausible and is wrong. The PR passes review because the diff looks reasonable. The test the agent wrote passes because it tests the buggy behaviour it introduced. Every DORA-2025 data point on increased change-failure rate is a silent-failure story that got written to disk.

The evaluation stack that catches silent failure has three layers.

Unit-level. Tool schemas, prompt templates, and system prompts each get their own regression suite. Every change runs a deterministic test set before it can ship.

Task-level. A curated golden set of real tasks, graded by LLM-as-judge with a rubric that includes business-outcome correctness, not just style. This is eval-as-CI.

Production. Shadow traffic and online signals: thumbs-up/down, PR accept rate on agent-authored code, downstream defect escape rate. The production signals feed back into the golden set. Every thumb-down becomes a candidate regression test.

Atlassian’s Rovo Dev Code Reviewer ran a year-long evaluation across more than 1,900 internal repos before general availability. The result, published at ICSE 2026, was a 30.8% reduction in PR cycle time and a 35.6% reduction in human-written review comments. The same three eval layers apply at the Recipe level: shadow-run the candidate against live triggers before promoting; canary to a subset before broad ship.

Token economics

By the time you have 5,000 engineers on your platform, token cost is non-linear in three dimensions: context depth, fan-out, and retry depth.

Tiered routing. Simple classification and extraction routes to a cheap model (Haiku-class). Standard code generation routes to mid-tier (Sonnet-class). Hard planning and architectural synthesis reserves to the frontier (Opus-class). Defaulting every call to the most expensive model is the single largest source of cost inflation.

Prompt caching as an SLI. Structured prompts should cache at 90%+ hit rate. A 90% cache hit translates to roughly 10x cost reduction on the cached portion. Cache hit rate deserves a dashboard, an owner, and an alert when it drops.

Attribution at every level. Per-team, per-repo, per-task, per-session. Without attribution there’s no chargeback; without chargeback there’s no incentive for teams to care about efficiency.

Shopify’s LLM proxy, mentioned in Chapter 1, is the artefact that makes all of this possible. You cannot attribute cost you don’t see. You cannot route by complexity if requests bypass your router. Per First Round’s write-up, the proxy is what let Shopify’s engineering dashboard correlate AI usage with shipping impact, which in turn gave VP Eng Farhan Thawar the evidence to support the ~20% productivity gain the org now claims.

Agentic platform: cost observability

Token → dollar → team → task. The view your CFO asks for on Monday morning.

apr 2026 · 4,200 eng · live

monthly tokens

847M

+12.4% mom

monthly spend

$284.2k

+8.1% mom

cache hit rate

78%

target > 75%

cost per merged pr

$0.82

−14.5% mom

spend by team

click to drill

…

apr 2026

monthly spend

…

cost per pr

…

merged prs

…

cache hit

…

model mix

Haiku

Sonnet

Opus

haiku …

sonnet …

opus …

top task types

…

What to measure

The most common failure mode in “AI productivity” reporting is Goodhart’s Law in a lab coat. A measurement stack that survives scrutiny operates in four families: proxy (acceptance rate, session count), activity (DORA: PR count, lead time, CFR), outcome (defect escape, rework, dev-reported friction), and economic (hours saved, cost per merged PR). An architect reporting to leadership needs at least one number from each.

Consider the published record: Uber reports ~10% PR-velocity lift (Pragmatic Engineer), an activity metric. Shopify claims ~20% productivity accompanied by a public refusal to measure it in LOC, an outcome claim. Block’s 8-10 hours saved per engineer per week is a clean economic metric. Airbnb’s 18 months to 6 weeks is a sharp outcome metric with a legible counterfactual. Same reality. Different slices.

Chapter 7: The Build Sequence

The platform described above is not a weekend project. It also does not require a three-year transformation program. The sequence that has worked in the public record collapses into three horizons.

Days 0-90. Stand up the minimum viable control plane.

Pick one harness. Don’t debate it for a quarter. Any of them is fine; the harness is replaceable.
Stand up the LLM gateway. Every agent request flows through it. Day-one cost attribution.
Ship one Recipe. Not twelve. Pick one repeatable task (PR security review, migration shard, on-call triage). Versioned, parameterised, triggered by one event, observable end-to-end. Everything else is scaffolding for the next Recipe.
Stand up one golden eval set with an LLM-as-judge rubric. Wire it into CI. Refuse to promote prompts or Recipes that regress.
Turn on OpenTelemetry tracing end-to-end.

Months 3-6. Build the moat.

Context pipeline for your top-five repos: ingest, index, govern, serve. Measure hit rate.
Policy-as-code at the gateway. Scoped tokens. Async approvals for production actions.
Expand the eval harness to workflow-level: golden sets of Recipe invocations, shadow-mode promotion.
First KPI dashboard: one proxy, one activity, one outcome, one economic metric.

Months 6-12. Compound.

Orchestrator-worker topology for the hard workloads: migrations, cross-repo refactors, bulk compliance work.
Recipe registry self-service with SHA-pinned versions. Teams contribute; the platform team curates.
Progressive autonomy tiers. Graduate teams through read-only, sandboxed, PR, and production as their eval and incident track record earns it.
Per-team chargeback. The budget conversation changes the usage conversation.

Fund internal DevRel from day one. Uber’s coursework moved Claude Code adoption from 32% to 63% of engineers in three months. Block’s engineers found Goose through Slack channels, not mandates. Shopify paired a top-down AI-first memo with bottom-up tool freedom through the LLM proxy. The technical platform and the organisational motion need to ship together.

In twelve months, when your CFO asks what AI is costing and what it’s earning, you have an answer, because you built a platform rather than bought a license. That’s the answer the 11% have. It’s not because they picked a better model.

References

Google Cloud / DORA. 2025 State of AI-Assisted Software Development Report. Source for 90% adoption, 30% distrust, PR size +154%, and the stability/throughput tension.
Faros AI. Key Takeaways from the DORA Report 2025. Practitioner analysis of the DORA findings.
McKinsey / KPMG. AI at Scale: Q4 2025 AI Pulse. Source for the four-stage maturity model and the ~11% AI-native figure.
OneReach / CIO. What Shapes Enterprise AI Agents in the Future. Source for the 95% zero-ROI and 14% change-management figures.
Block. Block Open Source Introduces “codename goose” and Goose on GitHub.
Sequoia. Training Data podcast with Dhanji Prasanna. Source for Block’s 8-10 hours/week, 25% target, and 30-40% legacy-code figures.
All Things Open. Meet Goose: The open source AI agent built for developers.
Bessemer Venture Partners. Inside Shopify’s AI-First Engineering Playbook.
First Round Review. From Memo to Movement: Shopify’s Cultural Adoption of AI.
Augment Code. Context Engine and Context Engine MCP now live. Source for the 500k-file indexing, ~100ms retrieval, and pipeline-behind-MCP pattern.
Pragmatic Engineer. How Uber Uses AI for Development. Source for the 84% agentic-coding adoption, Claude Code 32% to 63%, and DevRel investment.
Sourcegraph. How Cody understands your codebase and How Cody provides remote repository awareness. Source for the three-layer context architecture and 300k+ repo scale.
Atlassian. 30.8% Faster PRs: How AI-Driven Rovo Dev Code Reviewer Improved Developer Productivity. Source for the ICSE 2026 publication figures.
GitHub. December 2025 Enterprise Roundup. Source for Copilot Enterprise governance features.
Microsoft DevBlogs. Agentic Platform Engineering with GitHub Copilot.
Airbnb Engineering. Accelerating Large-Scale Test Migration with LLMs.
Anthropic. Model Context Protocol.
Gartner. 40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026.
Block. Goose Recipes reference and Goose Recipes cookbook.
Pulse MCP. Configure your agent with Goose Recipes.
Block. Goose AI Developer Agent GitHub Action.
GitHub. Automate repository tasks with GitHub Agentic Workflows.
Kestra. Kestra 1.0 launch. Source for the 2B+ workflow executions in 2025.
Temporal. Orchestrating Ambient Agents with Temporal.
MindStudio. Stripe Minions vs Shopify Roast. Source for Stripe’s scoped-context agent pattern.
GitHub. Building organization-wide governance for CI/CD with GitHub Actions.

Inside Claude Code: Anatomy of a 512K-Line AI Agent

Wed, 08 Apr 2026 12:00:00 +0800

State Space Models and the Mamba Architecture: From First Principles to Mamba-3

Sun, 22 Mar 2026 10:00:00 +0800

What This Post Covers

NVIDIA’s Nemotron-3-Super is not a Transformer. Not entirely. It is a hybrid architecture that interleaves Mamba-2 SSM layers with select attention layers, using the majority of its compute on state space operations rather than self-attention. It ships in production on NVIDIA’s inference stack and competes with pure Transformer models at the same scale. NVIDIA is not alone. IBM’s Granite 4.0 uses a 9:1 SSM-to-Transformer ratio. AI21’s Jamba uses 1:7. Zyphra’s Zamba, Google’s Griffin, Microsoft’s Phi-4-mini-flash-reasoning: all hybrid architectures, all in production.

Something shifted. For years, the Transformer was the only architecture that mattered for language. Now every major AI lab is replacing most of their Transformer layers with SSM layers and getting better results at lower inference cost. If you deploy models, manage GPU clusters, or care about inference latency, this is worth understanding deeply.

This post builds State Space Models from zero. I start with the simplest possible differential equation: one variable, one parameter. From there, I build up to the full SSM formulation, explain the key breakthroughs (HiPPO, S4), and walk through the three generations of Mamba. By the end, you will understand the math well enough to explain to your team why these hybrid architectures are winning, what trade-offs they make, and what it means for your inference stack.

Part 1: Why SSMs? The Transformer’s Inference Problem

You already know the Transformer’s self-attention mechanism scales quadratically with sequence length: $O(L^2)$ in both time and memory. But the pain runs deeper than asymptotic notation.

During autoregressive decoding, the Transformer generates one token at a time. For each new token, it must load the entire KV cache from GPU HBM into SRAM, compute a single attention score against every previous token, and write the new KV entry back. The GPU spends the vast majority of its time moving data, not computing. On an H100 generating tokens from a 70B model, the Tensor Cores that deliver 989 TFLOPS of BF16 matmul sit almost entirely idle during decoding. The bottleneck is memory bandwidth, not compute.

This is why you need PagedAttention to manage fragmented KV cache memory. This is why vLLM exists: to batch requests efficiently despite variable KV cache sizes. This is why context windows beyond 128K tokens start requiring multi-GPU setups just to hold the KV cache.

State Space Models offer a fundamentally different deal. Instead of caching every token’s key-value pair (lossless but expensive), they compress the entire sequence history into a fixed-size hidden state (lossy but cheap). Processing each new token during inference takes $O(1)$ time and memory. No growing KV cache. No PagedAttention. Constant memory per sequence regardless of whether you have processed 100 tokens or 100,000.

The question has always been whether a compressed, lossy state can match the quality of the Transformer’s lossless KV cache. For years, the answer was no. SSMs excelled on audio, time series, and synthetic long-range benchmarks, but they lagged on language. The Mamba line of work changed that. To understand how, we need to start from scratch.

Part 2: State Space Models from Scratch

A Single Differential Equation

Forget matrices, vectors, and neural networks for a moment. Start with a single number $h(t)$ that changes over time:

$$h'(t) = a \cdot h(t)$$

$h'(t)$ is the rate of change of $h$ at time $t$. If you know calculus, this is the derivative. If not, think of it as: “how fast is $h$ changing right now?” When $h'(t)$ is positive, $h$ is increasing. When negative, $h$ is decreasing. When zero, $h$ is holding steady.

The constant $a$ controls everything:

$a > 0$: $h$ grows exponentially. Think compound interest. A bank account with interest rate $a = 0.05$ earns interest on its interest, accelerating upward forever. Unstable.
$a < 0$: $h$ decays exponentially. Think radioactive decay. A substance with decay rate $a = -0.5$ loses half its remaining mass roughly every 1.4 time units. The more you have, the faster it drains, but it never quite reaches zero. Stable.
$a = 0$: nothing happens. $h$ is constant forever.

How the parameter a controls state behavior

Drag the slider to see h(t) = h(0) · exp(a · t)

a = -0.5 Stable (decay)

For building sequence models, we want $a < 0$. Our hidden state should be a fading memory, not an explosion.

Adding an Input

A decaying state by itself is useless. We need to feed information in:

$$h'(t) = a \cdot h(t) + b \cdot x(t)$$

Now $x(t)$ is an input signal (think: a stream of token embeddings arriving over time), and $b$ controls how strongly the input drives the state.

Picture a leaky bucket with a tap. The water level $h(t)$ is the state. The hole in the bottom drains water at rate $a \cdot h(t)$: the more water in the bucket, the faster it leaks (more pressure = faster drain). The tap $b \cdot x(t)$ pours water in at a rate proportional to the input signal. The water level at any moment reflects a fading, weighted average of all the water that has ever been poured in, with recent additions contributing more because older ones have partially leaked away.

The Leaky Bucket: Core SSM Intuition

State as a water level — inputs pour in, decay leaks out

The water level at any moment = fading, weighted average of all past inputs

This is the core intuition for the entire SSM line of work. The hidden state $h(t)$ is a running, compressed summary of the input history, where old inputs fade at a rate controlled by $a$.

Adding an Output

We read out the state with simple scaling:

$$y(t) = c \cdot h(t)$$

The output $y(t)$ is just a weighted view of the state. Together, these two equations form the complete scalar SSM:

$$h'(t) = a \cdot h(t) + b \cdot x(t) \quad \text{(state equation)}$$

$$y(t) = c \cdot h(t) \quad \text{(output equation)}$$

Three parameters. One input, one hidden state, one output. This is the entire architecture, in its simplest form.

Why One Bucket Is Not Enough

A single leaky bucket has one leak rate, which means one timescale of memory. If $a = -0.5$, the state “forgets” with a half-life of about 1.4 time units. It cannot simultaneously maintain a short-term memory (last few tokens) and a long-term memory (paragraph-level context).

The fix: use $N$ buckets, each with a different leak rate.

Multiple Timescales via Multiple State Dimensions

Each dimension has its own decay rate (eigenvalue)

1 state dimension

One leak rate = one timescale

→

Generalize to N dimensions

λ₁ = −0.01

(long memory)

λ₂ = −0.1

λ₃ = −0.5

λ₄ = −2.0

(short memory)

This is where scalars become vectors. The scalar state $h(t)$ becomes an $N$-dimensional vector $\mathbf{h}(t) \in \mathbb{R}^N$. The scalar parameters become matrices:

$A \in \mathbb{R}^{N \times N}$ (state to state): governs how each of the $N$ state dimensions evolves and potentially interacts with the others. It is $N \times N$ because each state dimension can influence every other state dimension.
$B \in \mathbb{R}^{N \times 1}$ (input to state): fans a scalar input out into $N$ state dimensions. It is $N \times 1$ because it needs to distribute one input value across $N$ state slots. Think of it as an adapter between a narrow input pipe and a wide state vector.
$C \in \mathbb{R}^{1 \times N}$ (state to output): narrows the wide state back down to a scalar output. It is $1 \times N$ because it takes a weighted combination of all $N$ state dimensions to produce one output value.

$$\mathbf{h}'(t) = A \cdot \mathbf{h}(t) + B \cdot x(t)$$

$$y(t) = C \cdot \mathbf{h}(t)$$

SSM as a Pipeline: How Dimensions Flow

Input fans out to N-dimensional state, then narrows back to output

x(t) ∈ ℝ scalar input

→

B ∈ ℝᴺˣ¹

fans out

→

h(t) ∈ ℝᴺ N-dim state

→

A ∈ ℝᴺˣᴺ

state dynamics

→

h(t) ∈ ℝᴺ N-dim state

→

C ∈ ℝ¹ˣᴺ

narrows back

→

y(t) ∈ ℝ scalar output

In practice, $A$ is almost always diagonal. A diagonal $A$ means each state dimension evolves independently. No cross-talk between buckets. Dimension 1 decays at its own rate, dimension 2 at its own rate, and so on. This simplification works just as well empirically (the S4D paper proved this) and is much cheaper to compute.

Eigenvalues: The Retention Rates

For a diagonal $A$, the diagonal entries ARE the eigenvalues. No linear algebra required to understand this. Each eigenvalue $\lambda_i$ is simply the decay rate of one state dimension. Think of them as $N$ different bank account interest rates running simultaneously:

$\lambda_i = -0.01$: very slow decay. This dimension remembers inputs from thousands of timesteps ago. It is the long-term savings account.
$\lambda_i = -0.5$: moderate decay. This dimension tracks information over dozens of timesteps.
$\lambda_i = -2.0$: fast decay. This dimension mostly tracks the last few inputs. It is the checking account that turns over quickly.
$\lambda_i > 0$: growth. Unstable. The state explodes. We never want this.

By having $N$ state dimensions with different eigenvalues, the model simultaneously maintains memory at multiple timescales. Some dimensions track recent tokens (large negative eigenvalues, fast decay), others preserve long-range context (small negative eigenvalues, slow decay).

This is why the initialization of $A$ matters enormously. If you set eigenvalues randomly, you get random memory timescales, and the model struggles to learn useful representations of sequences. Random eigenvalues might cluster all your memory in one timescale, leaving gaps in others. This is exactly the problem that HiPPO solved. But we need to cover discretization first.

Part 3: Discretization, Making It Computable

Why Discretize

The continuous ODE $\mathbf{h}'(t) = A\mathbf{h}(t) + Bx(t)$ processes smooth, continuous signals. But an LLM does not receive continuous signals. It receives a discrete sequence of tokens, one after another. We need to convert the continuous dynamics into a step-by-step recurrence: given the previous state and the current token, compute the next state.

The Step Size $\Delta$

Discretization introduces a learnable parameter $\Delta$ that controls “how much continuous time passes between tokens.” Small $\Delta$ means the model takes fine-grained steps, preserving detailed temporal structure. Large $\Delta$ means coarse steps, compressing more time into each token. Each channel in the model can learn its own $\Delta$, so different parts of the network can operate at different temporal resolutions.

The Euler Discretization

The simplest approach: approximate the derivative as constant over each timestep, following the Euler method from numerical analysis. This gives us discrete parameters:

$$\bar{A} = I + \Delta A$$

$$\bar{B} = \Delta B$$

This is first-order accurate, with local truncation error $O(\Delta^2)$. There are more accurate methods (zero-order hold, bilinear transform), but Euler is the one that matters for the Mamba story because Mamba-3 directly improves on it.

The Discrete Recurrence

The discretized system gives us a step-by-step formula. At each timestep $k$:

$$\mathbf{h}_k = \bar{A} \cdot \mathbf{h}_{k-1} + \bar{B} \cdot x_k$$

$$y_k = C \cdot \mathbf{h}_k$$

This is now a simple step-by-step formula. Given the previous state and the current input, compute the next state and output. No calculus needed at runtime.

Numerical Walkthrough

Let me make this concrete. Take a scalar system with $a = -0.5$, $b = 1.0$, $c = 1.0$, and step size $\Delta = 0.1$.

$$\bar{a} = 1 + (-0.5)(0.1) = 0.95 \qquad \bar{b} = 0.1$$

Now run 5 timesteps with the input sequence $[1, 1, 0, 0, 1]$, starting from $h_0 = 0$:

# a_bar = 0.95, b_bar = 0.1

# k=0: input=1  h = 0.95 * 0      + 0.1 * 1 = 0.1000   y = 0.1000
# k=1: input=1  h = 0.95 * 0.1    + 0.1 * 1 = 0.1950   y = 0.1950
# k=2: input=0  h = 0.95 * 0.195  + 0.1 * 0 = 0.1853   y = 0.1853
# k=3: input=0  h = 0.95 * 0.1853 + 0.1 * 0 = 0.1760   y = 0.1760
# k=4: input=1  h = 0.95 * 0.176  + 0.1 * 1 = 0.2672   y = 0.2672

The state accumulates when inputs arrive (steps 0-1, step 4) and decays when they stop (steps 2-3). You can verify every number with a calculator. There is nothing hidden in the SSM recurrence: it is a multiply-and-add, repeated.

The Dual Computation Modes

This is the SSM’s defining superpower. In the formulation above, $A$, $B$, and $C$ are constants that do not change over time. This property is called Linear Time-Invariance (LTI), and it unlocks something powerful. Because the recurrence $\mathbf{h}_k = \bar{A}\mathbf{h}_{k-1} + \bar{B}x_k$ is linear with fixed parameters, we can unroll it algebraically:

$$\mathbf{h}_k = \bar{A}^k \bar{B} x_0 + \bar{A}^{k-1}\bar{B}x_1 + \cdots + \bar{B}x_k$$

The output $y_k = C\mathbf{h}_k$ is then a weighted sum of all past inputs, with weights $K = (C\bar{B},\ C\bar{A}\bar{B},\ C\bar{A}^2\bar{B},\ \ldots)$. This sequence of weights is a convolution kernel.

This means we can compute the output of the SSM in two completely different ways:

Training mode (convolution): Compute the kernel $K$ once, then convolve it with the entire input sequence via FFT in $O(L \log L)$. Fully parallel, like a CNN. The GPU processes all $L$ tokens simultaneously.

Inference mode (recurrence): Step through $\mathbf{h}_k = \bar{A}\mathbf{h}_{k-1} + \bar{B}x_k$ one token at a time in $O(1)$ per step. The state $\mathbf{h}$ is a fixed-size vector regardless of how many tokens have been processed. No KV cache.

The Dual Computation Modes

Same model, two ways to compute — optimized for each phase

Training Mode

Convolution (Parallel)

x₁ x₂ x₃ ... x_L

↓↓↓↓↓

K̄ (conv kernel)

y = x * K̄ (via FFT)

↓↓↓↓↓

O(L log L)

All tokens processed simultaneously

Inference Mode

Recurrence (Sequential)

x₁ → x₂ → x₃ → ...

x_t →

h(t)

→ y_t

h(t) = Āh(t−1) + B̄x(t)

state stays fixed-size N

O(1) per token

Fixed-size state, no KV cache needed

“Train like a CNN, infer like an RNN.” This is the fundamental efficiency proposition. During training, you get the parallelism of convolutions. During inference, you get the constant-time, constant-memory decoding of RNNs, without the KV cache that makes Transformer inference expensive.

This duality is only possible because the system is LTI: the parameters $A$, $B$, $C$ are fixed, so the same convolution kernel $K$ applies to every input. When parameters become input-dependent (which is what Mamba does), there is no single kernel for the whole sequence. The duality breaks, and new algorithms are needed.

Part 4: HiPPO, The Initialization That Made SSMs Work

Before HiPPO, SSMs initialized the state matrix $A$ randomly. Random eigenvalues produce random memory timescales. On Sequential MNIST (classifying a handwritten digit fed one pixel at a time, 784 steps), random initialization achieved about 60% accuracy. Barely above chance for some digit classes.

Albert Gu’s HiPPO framework (2020) solved this by deriving $A$ matrices from a mathematical objective: at every timestep, the state should store the best polynomial approximation of the entire input history. Each state dimension corresponds to one polynomial coefficient, with low-order coefficients capturing broad trends (long-range memory) and high-order coefficients capturing fine details (short-range memory). The resulting $A$ matrix has eigenvalues arranged to cover multiple timescales without redundancy.

The concrete impact: switching from random $A$ to HiPPO improved Sequential MNIST from 60% to 98%. Same architecture, same training. Only the initialization of $A$ changed.

Part 5: S4 and S4D, Making SSMs Practical

S4 (Gu, Goel, and Re, 2022) was the first architecture to make deep SSMs work at scale by finding an efficient algorithm to compute the convolution kernel from a HiPPO-initialized $A$ matrix. It was the first model to solve long-range tasks at sequence lengths of 16,000+, a result no Transformer or RNN had achieved. S4 also fully exploited the recurrent-convolutional duality: convolution mode for training, recurrence mode for inference.

A key simplification followed quickly. S4D (2022) showed that restricting $A$ to a fully diagonal matrix matched S4’s performance while dramatically simplifying the implementation. Independent state dimensions with well-chosen eigenvalues were sufficient. This diagonal restriction became the standard for all subsequent work, including Mamba.

Part 6: Mamba-1, Selectivity Changes Everything

The LTI Problem

S4 and its variants excelled on continuous signals and synthetic long-range benchmarks. On language modeling, they consistently lagged behind Transformers of the same size.

The reason is exactly the LTI limitation I described earlier. In an LTI system, the matrices $A$, $B$, $C$ are fixed constants. Every token receives identical treatment. The Mamba paper demonstrated this failure precisely with two diagnostic tasks:

Selective Copying: Given “A B _ _ _ C _ _ A _ _”, copy only A, B, C while ignoring the padding underscores. An LTI system cannot distinguish content tokens from padding because it applies the same transformation to everything.

Induction Heads: Given “A B … A ?”, recall that B followed A earlier and predict B. This requires content-based lookup: comparing the current token (A) against stored tokens to find what came after it. An LTI system has no mechanism for content comparison.

Language is full of these patterns. The word “not” should be remembered differently from the word “the.” A name mentioned once in a document needs to be retrievable later. All of this requires the model to make content-dependent decisions about what to store and what to forget.

The Fix: Input-Dependent Parameters

The December 2023 paper “Mamba: Linear-Time Sequence Modeling with Selective State Spaces” by Albert Gu and Tri Dao introduced a single, elegant idea: make $B$, $C$, and $\Delta$ functions of the input token.

$$B_t = \text{Linear}(x_t) \in \mathbb{R}^N$$

$$C_t = \text{Linear}(x_t) \in \mathbb{R}^N$$

$$\Delta_t = \text{softplus}(\text{Linear}(x_t)) \in \mathbb{R}^+$$

Now the model dynamically modulates its behavior on a per-token basis. When the network encounters an important token, it can predict a large $\Delta_t$ to reset the state and absorb the new information. When it encounters filler, it can predict a tiny $\Delta_t$ to preserve existing memory and let the filler leak away.

The roles of each parameter are clear. $\Delta$ controls the gate: large $\Delta$ resets the state and focuses on the current input; small $\Delta$ persists the state and ignores the current input. $B$ controls what enters the state (content-based filtering of what to remember). $C$ controls what exits (content-based modulation of what to read out).

Note that the state matrix $A$ itself remains fixed. This is intentional. $A$ affects the discrete recurrence only through its interaction with $\Delta$ via $\bar{A} = \exp(\Delta A)$, so making $\Delta$ input-dependent is sufficient to make the entire system input-dependent.

The Cost: Convolution Mode Breaks

Input-dependent parameters mean the system is no longer LTI. $B_t$ and $C_t$ change at every timestep, so there is no single convolution kernel $K$ that describes the entire sequence. The FFT-based parallel training mode is gone.

Naively, this forces sequential computation: process token 1 to get $h_1$, then token 2 to get $h_2$, and so on. This would be catastrophically slow on GPUs, which need parallel workloads to achieve decent utilization.

Mamba-1 solved this with a hardware-aware selective scan algorithm, directly inspired by FlashAttention’s approach to the memory hierarchy. The key idea: fuse all SSM operations (discretization, recurrence, output) into a single GPU kernel that runs entirely in SRAM, avoiding expensive HBM round-trips. The recurrence is parallelized using a parallel scan that exploits the associativity of the multiply-add operation, and intermediate states are recomputed in the backward pass rather than stored. The result: 40x faster than a naive implementation, with the same memory footprint as an optimized Transformer with FlashAttention.

The Mamba Block

A common misconception is that Mamba replaces only the attention layer. It replaces both attention AND the MLP. A standard Transformer decoder block has two sub-layers: multi-head self-attention (which mixes information across sequence positions) and a feed-forward network (which mixes information across feature channels). The Mamba block handles both in a single, unified structure.

Transformer vs Mamba: Block Architecture

Same job — sequence mixing + channel mixing — different structure

Transformer Decoder Block

Input

LayerNorm

Multi-Head Self-Attention

Sequence Mixing

LayerNorm

Feed-Forward Network / MLP

Channel Mixing

Output

Mamba Block

Input

LayerNorm

Linear Projection (expand)

split into two branches

Conv1d

SiLU

Selective SSM

Seq. Mixing

SiLU

Gate / Ch. Mixing

⊗

Linear Projection

Output

Mamba replaces both attention and MLP in a single block

Here is how the Mamba block works. The input ($B \times L \times D$) passes through a LayerNorm and is linearly projected to expand the feature dimension by a factor of $E = 2$. This expanded representation is then split into two parallel branches:

The SSM branch (left): Processes through a short 1D causal convolution (width 4) to capture immediate local patterns between neighboring tokens. Then a SiLU activation. Then three parallel linear projections produce the token-specific $\Delta_t$, $B_t$, and $C_t$. The selective SSM recurrence runs using these dynamic parameters. This branch handles sequence mixing: how information flows across token positions.

The gate branch (right): Takes the other half of the expanded input and passes it through a SiLU activation. This branch serves as a dynamic gate that controls which channels of the SSM output are passed through and which are suppressed.

The two branches merge via element-wise multiplication. If elements in the gating vector are near zero, the corresponding SSM information is suppressed. The result passes through a linear projection back to dimension $D$ and is added to the input via a residual connection.

The entire block is one homogeneous module. No separate attention layer. No separate MLP.

Inference: The Fundamental Trade-off

	Transformer	Mamba-1
State per sequence	KV cache: grows with each token	Hidden state: fixed-size vector
Memory complexity	$O(L)$ per sequence	$O(1)$ per sequence
Compute per new token	$O(L)$: attend to all previous tokens	$O(1)$: one state update
At 128K context	~200x larger than Mamba state	~2.6 MiB per sequence
Memory type	Lossless: any past token retrievable	Lossy: compressed summary

Inference Memory: The Fundamental Trade-off

Lossless (KV cache) vs Compressed (hidden state)

Transformer: Growing KV Cache

Tokens processed: 0

O(L) memory per sequence

Mamba: Fixed-Size State

h ∈ ℝ^N

Tokens processed: 0

O(1) memory per sequence

The trade-off is fundamental. The Transformer’s KV cache stores every token (lossless, but $O(L)$ memory). Mamba’s hidden state compresses all history into a fixed vector (lossy, but $O(1)$ memory). The question is always whether the compressed representation is good enough for the task.

Mamba-1 demonstrated that it was. Mamba-2.8B matched or exceeded Pythia-6.9B (a model more than twice its size) on zero-shot downstream evaluations. On the Pile dataset, Mamba-1.4B achieved 59.7% average across common-sense reasoning benchmarks, matching Pythia-2.8B (59.1%). At batch size 16, Mamba-2.8B completed generation in 18.6 seconds versus GPT-Neo-2.7B’s 65.9 seconds (3.5x faster). GPT-Neo ran out of memory at batch size 32 on a 64GB GPU; Mamba continued scaling to batch 128+.

Part 7: Mamba-2, Maximizing GPU Utilization

Mamba-1 had an embarrassing practical problem: it was 2-3x slower than equivalently sized Transformers during training. The root cause is that modern GPUs deliver roughly 16x more throughput for matrix multiplication (via Tensor Cores) than for general arithmetic. Transformers are pure matmul. Mamba-1’s selective scan was not.

Tri Dao and Albert Gu’s May 2024 paper “Transformers are SSMs” solved this by proving that unrolling the SSM recurrence produces a structured matrix that can be computed via matrix multiplications. The resulting algorithm (SSD) splits the sequence into chunks: within each chunk, the computation runs as dense matmuls on Tensor Cores; between chunks, a short scan passes state forward. Training speed improved 2-8x over Mamba-1.

The trade-off: to fit this matrix framework, $A$ is restricted from a diagonal matrix (Mamba-1) to a scalar-times-identity (Mamba-2), meaning all state dimensions within a head share one decay rate. Mamba-2 compensates with a multi-head structure and increases the state dimension from $N = 16$ to $N = 64\text{-}256$.

Part 8: Mamba-3, Three Innovations from Classical SSM Theory

Published at ICLR 2026, Mamba-3 asks a different question than its predecessors. Mamba-2 optimized for training speed by simplifying the SSM to leverage Tensor Cores. But with the rise of RL post-training, agentic workflows, and test-time compute scaling, inference efficiency has become the primary bottleneck.

Here is the problem Mamba-3 targets. During autoregressive decoding, Mamba-2’s simplified recurrence is memory-bound. The GPU loads the state from HBM to SRAM, performs a trivially small computation (the scalar-times-identity update is cheap), and writes the result back. The arithmetic intensity is roughly 2.5 ops/byte. The H100 needs $\sim$295 ops/byte to be compute-bound. More than 99% of GPU compute sits idle during token generation.

Mamba-3’s overarching philosophy is to increase arithmetic intensity during decoding by making the state update mathematically richer, spending more compute per byte of memory traffic, filling idle GPU cycles rather than adding new ones. Three innovations accomplish this.

Innovation 1: Exponential-Trapezoidal Discretization

The problem. Mamba-1 and Mamba-2 used what the Mamba-3 authors retroactively classify as “Exponential-Euler” discretization: the exact formula $\bar{A} = \exp(\Delta A)$ paired with the first-order Euler approximation $\bar{B} = \Delta B$. This is a hybrid: exact for the state decay, but approximate for how the input enters the state. The local truncation error is $O(\Delta^2)$.

In numerical analysis terms, the Euler method approximates the area under a curve using a rectangle aligned to one endpoint. It captures the value at the start of the interval but ignores how the curve changes across the interval. This crude approximation struggles with fast-moving temporal dependencies, producing “jerky” transitions in the state.

In practice, prior Mamba models compensated by adding an explicit short 1D causal convolution (Conv1d, width 4) before the SSM. This Conv1d smoothed out immediate local token interactions that the imprecise discretization missed. It worked, but it was an architectural bandage for a mathematical shortcoming. And it added latency at inference: one more sequential operation per token.

The intuition. The trapezoidal rule approximates the area under a curve using a trapezoid instead of a rectangle. A rectangle uses only one endpoint’s value. A trapezoid uses both endpoints and draws a straight line between them, capturing the slope of the curve across the interval. This gives second-order accuracy: the local error drops from $O(\Delta^2)$ to $O(\Delta^3)$.

Why Trapezoidal Discretization Is More Accurate

Same interval, better approximation of the area under the curve

Euler Method

Uses only the left endpoint f(t)

O(Δ²) local error

Trapezoidal Rule

Uses both endpoints f(t) and f(t+Δ)

O(Δ³) local error

The math. Applying the generalized trapezoidal rule to the SSM’s state equation produces a three-term recurrence, where the old Euler formula had only two:

$$h_t = \underbrace{\exp(\Delta_t A_t) \cdot h_{t-1}}_{\text{Term 1: decayed previous state}} + \underbrace{(1 - \lambda_t) \cdot \Delta_t \cdot \exp(\Delta_t A_t) \cdot B_{t-1} \cdot x_{t-1}}_{\text{Term 2 (NEW): previous input, decayed}} + \underbrace{\lambda_t \cdot \Delta_t \cdot B_t \cdot x_t}_{\text{Term 3: current input}}$$

Term 1 is identical to the old formula: decay the previous state. Term 3 is similar to the old Euler input term, but weighted by $\lambda_t$ instead of 1. The new addition is Term 2: the previous timestep’s input $x_{t-1}$, projected through $B_{t-1}$, scaled by $(1 - \lambda_t)$, and decayed by the same exponential factor as the state.

The parameter $\lambda_t$ is a data-dependent convex combination weight. The $(1-\lambda_t)$ and $\lambda_t$ coefficients on consecutive inputs are the weights of the trapezoid: $(1-\lambda)$ on the left endpoint (previous input), $\lambda$ on the right endpoint (current input).

Let me verify the connection to the old formula. When $\lambda_t = 1$: Term 2 vanishes entirely because its coefficient $(1-\lambda_t) = 0$. Term 3 becomes $1 \cdot \Delta_t \cdot B_t \cdot x_t = \Delta_t B_t x_t$, which is exactly the Euler formula $\bar{B}x_t$. So the old Mamba-1/2 discretization is the special case $\lambda = 1$ of this more general formula.

A numerical walkthrough. Take $a = -0.5$, $\Delta = 0.1$, $b = 1$, and process the input sequence $[1, 1, 0]$ with $\lambda = 0.5$ (balanced trapezoidal blending):

# Parameters: a = -0.5, delta = 0.1, b = 1.0, lambda = 0.5
# exp(delta * a) = exp(-0.05) = 0.9512

# k=0: input=1 (no previous input, Term 2 uses x_{-1}=0)
#   Term 1: 0.9512 * 0 = 0                              (no state yet)
#   Term 2: (1-0.5) * 0.1 * 0.9512 * 1.0 * 0 = 0        (no prev input)
#   Term 3: 0.5 * 0.1 * 1.0 * 1 = 0.05
#   h_0 = 0 + 0 + 0.05 = 0.0500

# k=1: input=1, prev_input=1
#   Term 1: 0.9512 * 0.05 = 0.04756                     (decayed state)
#   Term 2: 0.5 * 0.1 * 0.9512 * 1.0 * 1 = 0.04756     (prev input, decayed)
#   Term 3: 0.5 * 0.1 * 1.0 * 1 = 0.05                  (current input)
#   h_1 = 0.04756 + 0.04756 + 0.05 = 0.14512

# k=2: input=0, prev_input=1
#   Term 1: 0.9512 * 0.14512 = 0.13804                  (decayed state)
#   Term 2: 0.5 * 0.1 * 0.9512 * 1.0 * 1 = 0.04756     (prev input=1, decayed)
#   Term 3: 0.5 * 0.1 * 1.0 * 0 = 0.0                   (current input=0)
#   h_2 = 0.13804 + 0.04756 + 0.0 = 0.18560

# Compare to Euler (lambda=1):
# k=2 with Euler: h = 0.9512 * 0.1904 + 0.0976 * 0 = 0.18107

At step 2, the trapezoidal version ($h = 0.1856$) is higher than the Euler version ($h = 0.1811$). The difference comes from Term 2: even though the current input is 0, the trapezoidal rule still accounts for the previous input ($x_1 = 1$) via the left endpoint of the trapezoid. The Euler method ignores this entirely. For fast-changing input sequences, this difference matters.

Trapezoidal Recurrence: Step k=2

Three terms instead of two — the previous input now participates

h₁ = 0.14512

× exp(ΔA) = × 0.9512

0.13804

Decayed previous state

NEW in Mamba-3

x₁ = 1

× (1−λ)·Δ·exp(ΔA)·B = × 0.04756

0.04756

Previous input (decayed)

x₂ = 0

× λ·Δ·B = × 0.05

0.0

Current input

h₂ = 0.18560

The implicit convolution and the death of Conv1d. Here is the subtle and important consequence. Because the state update at time $t$ depends on both $x_t$ and $x_{t-1}$, the trapezoidal recurrence contains an implicit convolution of width 2. The $(1-\lambda_t)$ and $\lambda_t$ weights on the consecutive inputs play the role of a learned, data-dependent convolution filter operating on pairs of adjacent tokens.

For years, SSM architectures (H3, RWKV-4, Mamba-1, Mamba-2) required an explicit external Conv1d (width 4) before the SSM to handle immediate local token interactions. The Conv1d was considered essential for capturing “induction head” copying behaviors and local patterns. Mamba-3 found that the implicit width-2 convolution from the trapezoidal discretization, combined with learnable bias terms on $B$ and $C$ (constant vectors added after normalization), is expressive enough to replace the external Conv1d entirely. Mamba-3 is the first Mamba variant to drop the Conv1d without performance loss.

How Mamba-3 Eliminates the Conv1d

Trapezoidal discretization absorbs the causal convolution

Mamba-1 / Mamba-2

Input

Conv1d

width 4

SiLU

Selective SSM

Output

Absorbed into discretization

Mamba-3

Input

Conv1d

Selective SSM

Trapezoidal Discretization

implicit conv built-in

Output

Fewer ops per token = lower decode latency on your H100

Why it matters for your H100. Fewer sequential operations per token at inference. No Conv1d kernel launch, no Conv1d memory traffic, no Conv1d compute. The architecture is simpler and the discretization is now theoretically justified ($O(\Delta^3)$ error) instead of a heuristic patched by an external convolution.

Innovation 2: Complex-Valued SSMs via RoPE

The problem. Real-valued SSMs with non-negative eigenvalues can only decay monotonically. The state gets smaller over time, or stays the same, but it cannot oscillate. Mathematically: if $\bar{a} \in [0, 1]$, then $\bar{a}^k$ is a monotonically decreasing sequence. The state can only move in one direction (toward zero).

This means real-valued SSMs cannot solve simple state-tracking tasks that require flipping between states. Consider parity: given a stream of bits, track whether the running count of 1s is even or odd. Every time a 1 arrives, the parity flips. This requires the state to toggle between two values indefinitely. A monotonically decaying state cannot do this. On the bit sequence parity task, Mamba-2 scored no better than random guessing.

The intuition. Real eigenvalues restrict the state to movement along a line: it can only grow or shrink. Complex eigenvalues enable rotation: the state can cycle through values, oscillate, and flip. For even/odd tracking, you need the state to flip sign every time a 1 appears. A 180-degree rotation achieves exactly this. Real, non-negative arithmetic cannot.

The clever trick. Implementing complex arithmetic on GPUs is painful. Complex numbers double memory requirements, break existing CUDA kernel optimizations, and introduce alignment issues. Mamba-3 avoids all of this through a mathematical equivalence.

The key theoretical result (Proposition 3 in the paper): a discretized complex-valued diagonal SSM is mathematically equivalent to a real-valued SSM with data-dependent Rotary Positional Embeddings (RoPE) applied to $B$ and $C$. The decomposition works as follows:

The real part of the complex eigenvalue controls decay. This is handled by the existing SSD machinery, exactly as in Mamba-2. No changes needed.
The imaginary part controls rotation. This is factored out and implemented as rotary embeddings applied to the $B$ and $C$ projection vectors.

The rotation angles are produced dynamically via projections from the current input token $x_t$, rather than using static positional indices as in standard Transformer RoPE. This is why it is called “data-dependent” RoPE. The rotation applied to $B$ and $C$ changes based on what token is being processed.

No complex number ever appears in the GPU kernels. The real-valued SSD computation runs at the same speed as before. The rotational dynamics are absorbed into $B$ and $C$ via the same RoPE infrastructure that Transformers already use for positional encoding. Existing Transformer tooling (rotary embedding kernels, fused attention implementations) can be reused directly.

The result. On synthetic state-tracking benchmarks:

Task	Mamba-2 (Real)	Mamba-3 w/o RoPE	Mamba-3 w/ Std. RoPE	Mamba-3 (Complex / Data-Dep RoPE)
Bit Sequence Parity	Random Guess	2.27%	1.56%	100.00%
Modular Arith. (No Brackets)	0.90%	Random Guess	20.70%	98.51%
Modular Arith. (Brackets)	Fail	Fail	2.62%	87.75%

Mamba-3 solves parity perfectly and near-perfectly executes complex modular arithmetic. Standard (non-data-dependent) RoPE does not help. Static positional rotation angles cannot implement content-dependent state flipping. The data-dependent version can, because the rotation is a function of the input.

These are tasks that were mathematically impossible for real-valued SSMs, regardless of scale or training budget. The real-complex boundary is a hard expressivity ceiling, not a soft scaling issue.

Innovation 3: MIMO (Multi-Input Multi-Output)

The problem. As discussed above, standard SSMs waste more than 99% of GPU compute during decoding because each state update is a trivial rank-1 operation: the GPU loads the entire state from memory, performs a single multiply-add, and writes it back. The computation is too cheap relative to the memory transfer.

The fix. Instead of processing one input and producing one output per SSM (SISO), process $R$ inputs and $R$ outputs simultaneously (MIMO). The scalar input $x_t$ is linearly projected into a matrix $X_t$ with rank $R$. The projection vectors $B_t$ and $C_t$ are correspondingly expanded to rank-$R$ structures. The state update becomes a matrix multiplication instead of an outer product:

$$H_t = \bar{A} \cdot H_{t-1} + B_t \cdot X_t^T$$

With $R = 4$, the model performs 4x the floating-point operations for the same amount of memory traffic. The arithmetic intensity jumps from $\sim$2.5 to $\sim$10 ops/byte. Still not enough to fully saturate the H100, but a 4x improvement in GPU utilization during the memory-bound decode phase.

Crucially, only the SSM-specific parameters ($B_t$, $C_t$, and the state $H_t$) grow with $R$. The main input projections, the output projections, and the residual gate all remain at their original sizes. This contains the parameter increase to the SSM core.

Why it does not hurt latency. The extra compute fills idle GPU cycles. During decoding, the bottleneck is the time it takes to load the state from HBM to SRAM. While that data transfer is in flight, the Tensor Cores have nothing to do. MIMO gives them work. The wall-clock time per decode step is dominated by memory transfer time, not compute time, so adding compute within the transfer window is effectively free. MIMO with $R = 4$ matches Mamba-2’s decode speed while delivering substantially better accuracy.

The result. At 1.5B scale with Chinchilla-optimal training: the base Mamba-3 (SISO) outpaces Gated DeltaNet (the previous state-of-the-art sub-quadratic model) by 0.6 percentage points on average downstream accuracy. Adding MIMO with $R = 4$ adds another 1.2 points, for a total gain of 1.8 points over Gated DeltaNet, 1.9 over Mamba-2, and 2.2 over equivalently-sized pure Transformers. The 1.5B MIMO variant achieves 57.6% average accuracy across benchmarks.

A Mamba-3 MIMO model with state dimension $N = 64$ matches the perplexity and downstream accuracy of a Mamba-2 model with $N = 128$. Halving the state size while maintaining quality doubles inference throughput within the same hardware footprint.

Architectural Changes

Two more structural changes round out Mamba-3:

Normalization. Mamba-3 replaces post-gate RMSNorm (Mamba-2) with QKNorm, also called BCNorm: RMS normalization applied directly to the $B$ and $C$ projections before mixing. This stabilizes variance and activation spikes during large-scale pretraining, which is especially important with the added mathematical complexity of trapezoidal recurrence and MIMO updates.

Block structure. Mamba-1 and Mamba-2 fused the sequence mixer (SSM) and channel mixer (MLP) into a single homogeneous block. Mamba-3 reverses this decision. It adopts an interleaved architecture that matches the Llama family: alternating Mamba-3 SSM blocks with standard SwiGLU MLP blocks. Each Mamba-3 block handles sequence mixing; each SwiGLU MLP handles channel mixing. This Llama-compatible topology makes it straightforward to create hybrid models by swapping some SSM blocks for attention blocks.

Evolution Comparison

Feature	Mamba-1	Mamba-2	Mamba-3
Venue	COLM 2024	ICML 2024	ICLR 2026
$A$ matrix	Diagonal (real)	Scalar $\times$ identity	Complex-valued (data-dep RoPE)
State size	$N = 16$	$N = 64\text{-}256$	Matches Mamba-2 at half $N$
Short conv	Required (width 4)	Required (width 4)	Removed
MIMO	No	No	Yes (rank-$R$)
Discretization	Exp-Euler, $O(\Delta^2)$	Exp-Euler, $O(\Delta^2)$	Exp-Trapezoidal, $O(\Delta^3)$
Design priority	Quality + selectivity	Training speed (Tensor Cores)	Inference efficiency
State tracking	Cannot solve parity	Cannot solve parity	Solves parity + modular arith.
Block structure	Fused SSM+MLP	Fused SSM+MLP	Interleaved SSM + MLP

Part 9: The Bigger Picture

Hybrid Architectures Are the Production Standard

The field has converged on an empirical finding: hybrid architectures combining SSM layers with a small fraction of attention layers outperform both pure approaches. Albert Gu has articulated the fundamental reason clearly. Transformers are like databases: they cache every token for future reference (perfect recall, but linear memory growth). SSMs are like brains: they compress all history into a fixed-size state (infinite context, but lossy).

Pure SSMs struggle with two specific capabilities. Exact retrieval: finding a specific fact buried in a long context degrades as context grows, because the fixed-size state cannot perfectly memorize arbitrary content. In-context learning: few-shot pattern matching from prompt examples requires comparing the current token against specific stored tokens, which is fundamentally an attention operation.

The solution adopted by every major lab: use SSM layers for the vast majority of the network and sprinkle in a few attention layers for precise retrieval. The exact ratio varies (5:1 in Mamba-3’s recommended config, 9:1 in Granite), but the pattern is universal.

What This Means for Your Inference Stack

For inference infrastructure teams, the implications are concrete. With only 10-15% of layers using attention, you manage KV cache for those few layers, not the entire network. SSM layers need no KV cache management at all: no PagedAttention, no eviction policies, no memory fragmentation. And throughput advantages grow with context length. Pure Transformers are faster at short sequences ($<$2K tokens), but SSM-based models cross over quickly and the gap widens: at 57K tokens, Mamba-2 outperforms Transformers by 4x. SSM decode cost is constant per token; Transformer decode cost grows linearly.

The Fundamental Trade-off Persists

SSMs compress. Attention caches. Mamba-3 makes the compressed memory more expressive through complex dynamics, higher-order discretization, and MIMO. But it cannot eliminate the compression. If your workload requires perfect verbatim retrieval of a specific sentence from a 100K-token document, you need attention layers for that.

The Transformer monopoly has ended. But Transformers are not dead. They are becoming a specialized, strategically-placed component within a larger hybrid architecture, used precisely where their lossless memory is needed and nowhere else.

References

Gu, A., Dao, T., Ermon, S., Rudra, A., & Re, C. (2020). HiPPO: Recurrent Memory with Optimal Polynomial Projections. NeurIPS 2020.
Gu, A., Goel, K., & Re, C. (2022). Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022. (S4)
Gu, A., Gupta, A., Goel, K., & Re, C. (2022). On the Parameterization and Initialization of Diagonal State Space Models. NeurIPS 2022. (S4D)
Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. COLM 2024.
Dao, T. & Gu, A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. ICML 2024. (Mamba-2 / SSD)
Lahoti, A., Li, A., Chen, Y., Wang, Z., Bick, T., Kolter, J. Z., Dao, T., & Gu, A. (2026). Mamba-3: Improved Sequence Modeling using State Space Principles. ICLR 2026.
Together AI. Mamba-3 Blog Post. Technical overview and benchmark results.
Goomba Lab. Blog series on Structured State Space Duality and Mamba-3 mathematical foundations.
Tri Dao. Mamba-3 Part 2: Methodological Deep Dive. Detailed derivation of the exponential-trapezoidal discretization and RoPE equivalence.
Princeton Language and Intelligence. Mamba-2: Algorithms and Systems. Technical walkthrough of the SSD algorithm.
NVIDIA. Nemotron-3-Super: hybrid Mamba-2 + MoE + Attention architecture for production deployment.
IBM. Granite 4.0: 9:1 Mamba-to-Transformer ratio with >70% memory reduction vs. conventional LLMs.

Speculative Speculative Decoding: Eliminating the Last Sequential Bottleneck in LLM Inference

Sat, 07 Mar 2026 10:00:00 +0800

What This Post Covers

In our post on speculative decoding, we covered how a small draft model proposes tokens that a large target model verifies in parallel, achieving 2-3x speedups without changing the output distribution. That technique exploits idle GPU compute during memory-bound inference.

This post examines a follow-up question: can we make speculative decoding itself faster? The answer is yes. A recent paper by Kumar, Dao, and May (ICLR 2026) identifies a sequential bottleneck within standard speculative decoding and eliminates it through a technique called Speculative Speculative Decoding (SSD). Their algorithm, Saguaro, achieves up to 5x speedup over autoregressive decoding and roughly 2x over optimized speculative decoding.

We will walk through the bottleneck SSD targets, the core idea of speculating about verification outcomes, and the three engineering challenges that Saguaro solves: cache construction, cache-aware sampling, and batch-size scaling. Each section explains why the naive approach fails before presenting the solution.

Notation Reference

Core

$K$Draft tokens per round (e.g. 7)

$k$Number of accepted draft tokens (0 to K)

$t^*$Bonus token from the target model

$v^T$Verification outcome $(k, t^*)$ for round T

$S^T$Speculation cache: maps outcomes to pre-computed drafts

Speedup

$p_{\text{hit}}$Probability of a cache hit

$E_{\text{hit}}$Expected tokens per round on cache hit

$E_{\text{miss}}$Expected tokens per round on cache miss

$T_v$Verification latency (target model forward pass)

$T_p$Primary speculator latency (relative to verifier)

$T_b$Backup speculator latency

Cache Construction

$B$Budget: max pre-computed speculations (~20-30)

$F_k$Fan-out at position k (bonus tokens cached)

$r$Power-law exponent for cache miss decay

$\alpha_p$Per-token acceptance rate

$|V|$Vocabulary size (e.g. 32,000)

Sampling

$C$Downweighting constant for Saguaro sampling (0 to 1)

Batch Scaling

$b$Batch size (number of sequences)

$b^*$Critical batch size: switch to backup fallback

The Hidden Bottleneck in Speculative Decoding

Standard speculative decoding runs in a loop: the draft model generates K tokens, the target model verifies them in a single forward pass, and the process repeats. This is faster than autoregressive decoding because verification amortizes the expensive memory read of the target model’s weights across multiple tokens.

But there is a sequential dependency hiding in plain sight. Let’s trace what happens on the draft model’s GPU during one round:

The draft model generates K tokens (draft phase)
The draft model sends tokens to the target model
The target model runs verification (the draft model sits completely idle)
The target model returns the verification outcome
The draft model generates K new tokens for the next round
Go to step 2

The draft model does nothing during step 3. If verification takes $T_v$ time units, the draft model wastes $T_v$ time units every round. Since the target model is much larger than the draft, $T_v$ dominates the round duration. The draft model’s GPU is idle for the majority of each round.

This is the bottleneck SSD eliminates. Instead of waiting for verification to finish, the draft model spends that idle time doing something useful: predicting what the verification outcome will be and pre-computing the next round’s speculation for each likely outcome.

Figure from Kumar, Dao & May (2026). Left: standard SD's sequential loop. Center: SSD overlaps drafting with verification via a speculation cache tree. Right: throughput gains.

The Core Idea: Speculate About the Speculation

The idea draws from CPU speculative execution. When a CPU encounters a conditional branch, it does not wait for the condition to resolve. Instead, it predicts the likely outcome and begins executing instructions along that predicted path. If the prediction was correct, the results are kept. If wrong, they are discarded and the correct path is executed.

SSD applies this same principle to speculative decoding. While the target model verifies round $T$’s draft tokens, the draft model:

Predicts what the verification outcome will be
Pre-computes speculations for each likely outcome
Stores these in a speculation cache

When verification finishes, the actual outcome is compared against the cache. If it matches a cached prediction (a cache hit), the next round’s speculation is returned instantly with zero drafting latency. If it doesn’t match (a cache miss), the system falls back to standard synchronous drafting.

What Is a Verification Outcome?

To understand what we need to predict, let’s define the verification outcome precisely. When the target model verifies K draft tokens, two things are determined:

$k$: the number of accepted draft tokens (ranging from 0 to K)
$t^*$: the bonus token, sampled from the target distribution at the first disagreement point (or at position K+1 if all tokens are accepted)

The verification outcome is the pair $v^T = (k, t^*)$. This fully determines the context from which the next round’s speculation must begin. If we can predict $v^T$ before verification completes, we can pre-compute the next K draft tokens starting from that context.

The Speculation Cache

The speculation cache $S^T$ is a dictionary that maps predicted verification outcomes to pre-computed speculations:

$$S^T : (k, t^*) \to (s_1, s_2, \ldots, s_K)$$

Each entry contains K draft tokens generated autoregressively by the draft model starting from the context implied by $(k, t^*)$.

When verification returns the actual outcome $v^T$:

Cache hit ($v^T \in S^T$): Return the cached speculation immediately. Zero drafting latency for this round.
Cache miss ($v^T \notin S^T$): Fall back to generating a fresh speculation synchronously.

The Speculation Cache

Mapping predicted verification outcomes to pre-computed speculations

Scenario

Draft tokens sent

the cat sat on a

Verification result

the cat sat on

          v = (k=3, t*="on")
        

→ lookup

Speculation Cache S^T

The key question is obvious: how do we choose which outcomes to cache? The space of possible outcomes is $(K+1) \times |V|$ where $|V|$ is the vocabulary size (typically 32,000-128,000). We cannot pre-compute speculations for all of them. This is the first of three challenges Saguaro addresses.

The Speedup Formula

Before diving into the three challenges, let’s quantify the potential. The expected speedup of SSD over autoregressive decoding (Theorem 7 from the paper) is:

$$\text{speedup}_{\text{SSD}} = \frac{p_{\text{hit}} \cdot E_{\text{hit}} + (1 - p_{\text{hit}}) \cdot E_{\text{miss}}}{p_{\text{hit}} \cdot \max(1, T_p) + (1 - p_{\text{hit}}) \cdot (1 + T_b)}$$

Walking through each term: $p_{\text{hit}}$ is the probability of a cache hit. $E_{\text{hit}}$ and $E_{\text{miss}}$ are the expected number of tokens generated per round on a hit and miss respectively. $T_p$ is the latency of the primary speculator (the neural draft model) relative to the verifier, and $T_b$ is the latency of the backup speculator used on cache misses.

The numerator is the expected tokens per round, weighted by hit/miss probabilities. The denominator is the expected wall-clock time per round.

Two corollaries follow directly:

Corollary 8: SSD strictly outperforms standard speculative decoding whenever $p_{\text{hit}} > 0$. Any nonzero cache hit rate improves performance because cache hits eliminate drafting latency entirely, and cache misses simply revert to standard SD behavior.

Corollary 9: The maximum speedup is bounded by $(1 + T_{\text{SD}}) \cdot (E_{\text{hit}} / E_{\text{SD}})$, where $T_{\text{SD}}$ and $E_{\text{SD}}$ are the drafting time and expected tokens for standard SD. When the cache hit rate approaches 1, all drafting latency vanishes and we gain a factor of $(1 + T_{\text{SD}})$ in the denominator.

Challenge 1: Building the Cache (Saguaro Cache Construction)

The Problem

A verification outcome is a pair $(k, t^*)$. The acceptance length $k$ ranges from 0 to K (typically K=7 in SSD), giving K+1 possibilities. The bonus token $t^*$ comes from a vocabulary of size $|V|$. The total outcome space is $(K+1) \times |V|$, which is roughly 250,000 for K=7 and $|V|$=32,000.

We have a budget of $B$ speculations we can pre-compute during the verification window. Each speculation requires the draft model to run K autoregressive steps. With the verification latency of a 70B target model on 4 H100s, we can fit roughly $B = 20\text{-}30$ speculations. We need to choose wisely.

Decomposing the Problem

Saguaro decomposes this into two subproblems:

For each acceptance length $k$, how many bonus tokens should we cache? Call this the fan-out $F_k$.
For a given fan-out $F_k$, which bonus tokens should we cache?

The second question has a straightforward answer: use the top-$F_k$ tokens from the draft model’s own probability distribution at that position. The draft model has already computed logits during the current round’s speculation, so the most likely tokens are immediately available. Empirically, this predicts the actual bonus token with up to 90% accuracy at moderate fan-out.

The first question, how to allocate fan-out across positions, is where the interesting optimization happens.

What Is Fan-Out?

Each row is one possible acceptance length. Fan-out = how many bonus tokens we cache for that outcome.

Cached bonus token candidates (from draft logits)

1st 2nd 3rd 4th 5th 6th 7th

Each purple cell is one pre-computed speculation (K draft tokens). The total number of cells = budget B.
Position k=K (all accepted) is boosted because the target distribution is sharper, making prediction easier.

Power-Law Cache Hits

The authors make a key empirical observation: cache miss probability follows a power law in the fan-out:

$$1 - p_{\text{hit}}(F) = \frac{1}{F^r}$$

for some exponent $r > 0$ that depends on the draft-target alignment. This means that doubling the fan-out does not halve the miss rate. Instead, miss rate decreases polynomially, with diminishing returns as fan-out grows. This finding (confirmed across multiple model pairs and datasets) drives the allocation strategy.

Geometric Fan-Out

Given the power-law structure and a total budget $\sum_{k=0}^{K} F_k \leq B$, Saguaro solves a constrained optimization problem using Lagrange multipliers. The result (Theorem 12) is a geometric allocation:

$$F_k = F_0 \cdot \alpha_p^{k/(1+r)} \quad \text{for } k < K$$

where $\alpha_p$ is the per-token acceptance rate and $F_0$ is determined by the budget constraint. The formula allocates more fan-out to earlier positions (small $k$) and less to later positions.

The reasoning: position $k=0$ (first token rejected) is more probable than $k=5$ (five tokens accepted before rejection) because each acceptance is an independent event with probability $\alpha_p < 1$. The probability of reaching acceptance length $k$ is roughly $\alpha_p^k \cdot (1 - \alpha_p)$, a geometric distribution. Allocating fan-out proportionally to the probability of each outcome maximizes the expected cache hit rate.

There is one exception: position $k=K$ (all tokens accepted) receives a boost. When all K draft tokens are accepted, the bonus token comes directly from the target model’s distribution $p_{\text{target}}$ rather than the residual distribution. The target distribution is sharper and more concentrated, making the top-$F$ prediction more accurate. Saguaro accounts for this with a multiplicative correction:

$$F_K = F_0 \cdot \alpha_p^{K/(1+r)} \cdot (1 - \alpha_p)^{-1/(1+r)}$$

Geometric Fan-Out Allocation

How Saguaro distributes its speculation budget across acceptance positions

Acceptance rate (α) 0.75

Budget (B) 24

Fan-out F_k

Geometric allocation (Saguaro)

Uniform allocation (baseline)

Why geometric works

The probability of reaching acceptance length k is roughly α^k(1-α). Earlier positions are more likely, so they deserve more fan-out. Position K (all accepted) gets boosted because the bonus token comes from the sharper target distribution, making it easier to predict.

Challenge 2: Trading Acceptance for Cache Hits (Saguaro Sampling)

The Tension

There is a fundamental tension between two objectives in SSD. On one hand, we want the draft model to closely match the target model so that acceptance rates stay high. On the other hand, we want to predict the bonus token $t^*$ accurately so that cache hit rates stay high.

These objectives conflict. Here is why.

Recall from our speculative decoding post that the bonus token is sampled from the residual distribution:

$$r(\cdot) \propto \max(p_{\text{target}}(\cdot) - p_{\text{draft}}(\cdot), 0)$$

When the draft model closely matches the target ($p_{\text{draft}} \approx p_{\text{target}}$), the acceptance rate is high but the residual $\max(p_{\text{target}} - p_{\text{draft}}, 0)$ is spread thinly across many tokens. A thin residual means the bonus token could be almost anything, making it hard to predict and reducing cache hit rates.

When the draft model diverges from the target, the residual concentrates on tokens where $p_{\text{target}} \gg p_{\text{draft}}$, making the bonus token more predictable. But acceptance rates drop, meaning fewer tokens per round.

The Solution: Intentional Misalignment

Saguaro sampling resolves this tension by deliberately modifying the draft model’s sampling distribution. For a set of cached tokens (the top-$F$ tokens at each position), Saguaro suppresses the draft model’s probability on those specific tokens:

$$\sigma_{F,C}(z) \propto \begin{cases} C \cdot \exp(z_t) & \text{if } t \in \text{top}_F(z) \\ \exp(z_t) & \text{otherwise} \end{cases}$$

Here $z$ is the vector of draft model logits, $F$ is the fan-out, and $C \in [0,1]$ is a downweighting constant. When $C=1$, this is the standard softmax with no modification. When $C < 1$, the cached tokens receive reduced probability in the draft’s distribution.

Why does this help? Let’s trace the effect on the residual.

When the draft model assigns less probability to a cached token, the gap $p_{\text{target}}(\cdot) - p_{\text{draft}}(\cdot)$ becomes larger for that token. A larger gap means the residual distribution assigns more probability to that token. Saguaro is steering the entire residual distribution to concentrate on the exact tokens it has cached.

Saguaro Sampling: Trading Acceptance for Cache Hits

How downweighting cached tokens in the draft forces the residual to concentrate on those same tokens

Downweighting C 1.00

C=0 (max cache hits, low acceptance) C=1 (standard, no modification)

Draft Distribution q(x) with Saguaro

Residual Distribution max(p - q, 0)

Acceptance Rate

0.82

Cache Hit Rate

0.45

Net Speedup Effect

1.0x

The tradeoff

Slide C toward 0 to see how suppressing draft probability on cached tokens (marked with purple dots) forces the residual distribution to concentrate on those same tokens. This increases cache hit rate at the cost of acceptance rate. The optimal C* balances both effects for maximum end-to-end speedup.

Theorem 15 from the paper confirms this formally: the cache hit rate $p_{\text{hit}}$ increases monotonically as $C \to 0$. Push $C$ all the way to zero and the residual distribution is forced entirely onto cached tokens (guaranteeing a hit), but the acceptance rate collapses because the draft distribution diverges maximally from the target.

The optimal $C^*$ balances these competing effects and depends on the sampling temperature:

Temperature 0 (greedy decoding): $C = 1$ is optimal. The bonus token is deterministic (the argmax of the target distribution), so the top-1 draft prediction already has high accuracy. No need to sacrifice acceptance rate.
High temperature: $C \ll 1$ becomes advantageous. The bonus token is sampled from a flatter distribution, making it harder to predict without help. Saguaro sampling concentrates the residual, recovering cache hit rates that would otherwise be low.

In practice, Saguaro sampling provides up to 50% additional end-to-end speedup at high temperatures compared to using $C=1$.

Challenge 3: Scaling to Large Batches (Saguaro Fallback)

The Batch Size Problem

Everything so far has assumed batch size 1 (a single sequence). At larger batch sizes, a new problem emerges.

With batch size $b$, the system can only proceed to the next round when all $b$ sequences have speculations ready. The probability that every sequence in the batch gets a cache hit is:

$$P(\text{all hit}) = p_{\text{hit}}^b$$

Even with $p_{\text{hit}} = 0.9$ per sequence, a batch of 16 sequences gives $P(\text{all hit}) = 0.9^{16} \approx 0.19$. At batch size 32, it drops to $0.9^{32} \approx 0.03$. The probability of at least one cache miss grows exponentially, and a single miss stalls the entire batch.

The Naive Fallback Fails

The intuitive solution is simple: when a cache miss occurs, have the primary draft model generate a fresh speculation on the spot. But this is catastrophically bad at scale.

The primary draft model is still a neural network that generates tokens autoregressively. Generating K tokens takes non-trivial time. While this one stalled sequence catches up, every other sequence in the batch (including those with cache hits) waits. The batch is only as fast as its slowest member.

Corollary 16 in the paper formalizes this: at large batch sizes, the SSD speedup becomes overwhelmingly bounded by the fallback latency. If the fallback speculator is slow, the theoretical gains from caching vanish.

Dual-Tier Fallback

Saguaro solves this with a dual-tier fallback strategy controlled by a critical batch size $b^*$:

Below $b^*$ (low batch regime): Cache misses are infrequent. The primary draft model serves as its own fallback, generating a fresh speculation synchronously. The latency penalty is acceptable because misses are rare and each one affects only one sequence.

Above $b^*$ (high batch regime): The system switches to an ultra-fast backup speculator with minimal latency. This could be an n-gram model, random tokens, or a token frequency model. The backup’s speculations will likely be rejected during verification (random tokens have near-zero acceptance rate). But the strategic insight is that the latency cost of feeding bad speculations to one sequence is vastly less than the latency cost of making the entire batch wait for a neural draft model.

Theorem 17 proves this formally: accepting a single sequence’s quality penalty (poor speculation) is strictly better than inflicting the primary drafter’s latency across the entire batch.

The critical batch size $b^*$ is derived analytically from the speedup equation and depends on $p_{\text{hit}}$, $T_p$ (primary drafter latency), and $T_b$ (backup latency).

Dual-Tier Fallback Strategy

How batch size determines the optimal fallback speculator

Per-sequence cache hit rate 0.85

P(at least one miss) Batch size (b)

1.0 0.75 0.50 0.25 0.0

Low Batch Regime

Misses are infrequent. Quality matters more than speed.

Primary Draft (Neural)

b* = 8

High Batch Regime

Misses are near-certain. Speed matters more than quality.

Fast Backup (n-gram/random)

Why fast beats accurate at scale

At batch size 32 with 85% per-sequence hit rate, P(at least one miss) = 99.6%. A miss is nearly guaranteed every round. Using the primary neural draft as fallback would stall the entire batch for its generation time. A fast backup (even random tokens) unblocks the batch instantly, and only the single stalled sequence pays a quality penalty.

The Full Algorithm

SSD runs three concurrent processes: a main coordinator, a verifier, and a speculator. Here is the algorithm:

# Main: launches speculator asynchronously, runs verifier
def main(prompt, target, primary_draft, backup_draft):
    launch_async(speculator, prompt, primary_draft, backup_draft)
    return verifier(prompt, target)

# Verifier: runs on target model's GPUs (e.g., 4x H100)
def verifier(prompt, target):
    target.prefill(prompt)
    spec_tokens = RECEIVE()              # wait for first speculation
    generated = []

    while True:
        outcome = target.verify(spec_tokens)   # standard SD verification
        generated.extend(outcome.tokens)

        if EOS in outcome.tokens:
            return generated

        SEND(outcome)                          # send (k, t*) to speculator
        spec_tokens = RECEIVE()                # wait for next speculation

# Speculator: runs on draft model's GPU (e.g., 1x H100)
def speculator(prompt, primary_draft, backup_draft):
    primary_draft.prefill(prompt)
    spec_tokens = primary_draft.speculate(prompt)  # initial speculation

    while True:
        SEND(spec_tokens)                      # send to verifier

        # While verification runs, build the cache
        cache = build_speculation_cache(
            spec_tokens, primary_draft           # §4.1: geometric fan-out
        )

        outcome = RECEIVE()                     # get actual (k, t*)

        if EOS in outcome:
            return

        if outcome in cache:                    # CACHE HIT
            spec_tokens = cache[outcome]        # instant return
        else:                                   # CACHE MISS
            spec_tokens = fallback(             # §4.3: dual-tier
                outcome, primary_draft, backup_draft
            )

Step-by-Step: One Round

Let’s trace through a single round to see how the pieces fit together.

Step 1: The speculator sends K draft tokens to the verifier and immediately begins predicting outcomes. It examines the draft model’s logits at each position to determine the most likely bonus tokens.

Step 2: Using geometric fan-out (Challenge 1), the speculator allocates its budget across positions. For each predicted outcome $(k, t^*)$, it generates K new draft tokens autoregressively from that context.

Step 3: These speculations are stored in the cache. With K=7 and fan-out F=3, the cache contains roughly 24 entries (8 acceptance lengths, with 3 bonus token predictions each, weighted by geometric allocation).

Step 4: The verifier finishes and sends back the actual outcome $(k, t^*)$.

Step 5: Cache lookup. If the outcome matches, the cached speculation is returned instantly. If not, the fallback mechanism kicks in.

The critical property is that on a cache hit, the round latency equals only the verification time (because the speculation was pre-computed in parallel). In standard SD, the round latency equals verification time plus drafting time. This is where the speedup comes from.

Correctness: SSD Is Lossless

An important guarantee: SSD produces the same output distribution as standard speculative decoding (which itself matches autoregressive decoding exactly).

On a cache hit, the cached speculation is verified using the same rejection sampling mechanism as standard SD. The fact that the speculation was pre-computed rather than computed just-in-time changes nothing about verification correctness.

On a cache miss, the system falls back to standard synchronous SD, which is known to be lossless.

The speculation cache is a performance optimization that never affects what tokens the target model accepts or rejects. Pre-speculation only changes when the draft tokens are computed, not what they are or how they are verified.

Hardware Setup

SSD requires the draft and target models to run on separate GPUs so they can operate concurrently. The typical configuration:

Target model (verifier): 4x H100 80GB GPUs with tensor parallelism (for Llama-3.1-70B)
Draft model (speculator): 1x H100 80GB GPU on a separate device
Total: 5 GPUs for SSD vs. 4 GPUs for standard SD/AR

This is a 25% increase in hardware. The question is whether the speedup justifies the extra cost. At batch size 1, SSD achieves roughly 2x higher throughput than SD with the same target model, meaning the throughput per GPU still improves substantially.

Results

The authors benchmark SSD (with Saguaro optimizations) against autoregressive decoding and standard speculative decoding across four datasets: HumanEval (code), Alpaca (chat), GSM8K (math), and UltraFeedback (general).

Setup: Llama-3.1-70B-Instruct as the target model on 4x H100 GPUs, Llama-3.2-1B-Instruct as the draft model on 1x H100 GPU. K=6 for SD, K=7 for SSD, fan-out F=3 for SSD.

Performance: AR vs SD vs SSD

Throughput comparison at batch size 1, greedy decoding (temperature = 0)

Autoregressive (4 GPUs)

Speculative Decoding (4 GPUs)

SSD / Saguaro (5 GPUs)

Target: 4x H100 80GB (TP=4) | Draft: 1x H100 80GB | K=6 (SD), K=7 (SSD), F=3

Key findings:

SSD vs. autoregressive: Up to ~5x faster (e.g., 255.8 tok/s vs. 54.7 tok/s on some benchmarks)
SSD vs. standard SD: Up to ~2x faster (255.8 vs. 161.8 tok/s)
At larger batch sizes: SSD still provides ~20% improvement over SD, even as cache hit rates drop. The Saguaro optimizations push the throughput-latency Pareto frontier across all batch sizes.
Temperature sensitivity: Cache hit rates decrease with sampling temperature, but Saguaro sampling (low $C$) compensates effectively, maintaining gains at high temperatures.

The results also confirm on Qwen-3-32B (target) with Qwen-3-0.6B (draft): 203.8 tok/s for SSD vs. 136.8 for SD vs. 88.8 for AR.

Where SSD Fits in the Landscape

SSD is not the first attempt to overlap drafting and verification. Several concurrent methods target the same bottleneck:

Method	Approach	Limitation
AMUSD	Pre-speculates for the “all accepted” outcome only	Misses all partial-acceptance cases
PEARL	Single outcome prediction	Same limitation as AMUSD
SwiftSpec	Token tree branching off current speculation	Greedy only; fallback struggles at high temp/batch
SpecBranch	Single branching point with regular fallback	Approximately a special case of SSD
SSD (Saguaro)	Multi-outcome caching with geometric fan-out, cache-aware sampling, dual-tier fallback	Requires extra GPU; latency-focused

SSD is also orthogonal to several other inference optimizations, meaning they can be combined:

EAGLE/EAGLE-2: Feature-level draft prediction. SSD could use an EAGLE-style drafter as its speculator.
Tree-based verification (Sequoia, SpecInfer): Verify multiple candidates in one pass. SSD parallelizes the draft-verify loop itself, a different axis.
Non-neural speculators (SuffixDecoding, Infini-gram): Could serve as SSD’s fast backup speculator for cache misses.

When to Use SSD (and When Not To)

SSD is a strong fit when:

Latency matters more than throughput (real-time chat, interactive coding)
Batch sizes are small to moderate ($b \leq b^*$)
You have spare GPU capacity for a separate draft model
You are already using speculative decoding and want to push further

SSD is not ideal when:

You are throughput-bound (large-scale RL, offline batch generation). SSD optimizes per-request latency, not aggregate throughput at high concurrency.
Hardware is constrained. The extra GPU for the draft model is not available.
Batch sizes are consistently very large. Cache hit rates decay exponentially with batch size, and the gains narrow.

Looking Forward

SSD makes a compelling case that the sequential dependencies in LLM inference are not fixed constraints but engineering surfaces that can be optimized. The draft-verify loop in speculative decoding seemed inherently sequential. It turned out that the sequential part (waiting for verification before starting the next draft) could be hidden by speculating about the verification outcome.

This pattern of applying a technique recursively to itself is worth paying attention to. The authors frame SSD as “nested speculation,” and the natural question is whether another level of nesting could help. The answer is likely no for now (the overhead of a third speculation level would exceed the marginal benefit), but the thinking is instructive: whenever two stages of a pipeline are sequential, ask whether one stage can predict the other’s output and pre-compute accordingly.

The practical significance is clear. For latency-sensitive applications running large models, SSD with Saguaro optimizations roughly doubles the speedup of speculative decoding at modest hardware cost. As inference frameworks like NVIDIA Dynamo adopt disaggregated architectures (separate prefill and decode stages on different hardware), SSD’s separate-GPU design fits naturally into that direction.

References

Kumar, T., Dao, T., & May, A. (2026). Speculative Speculative Decoding. ICLR 2026.
- The SSD paper introducing Saguaro and the three core optimizations.
Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. ICML 2023.
- The original speculative decoding paper with distribution preservation proofs.
Chen, C., Borgeaud, S., Irving, G., Lespiau, J. B., Sifre, L., & Jumper, J. (2023). Accelerating Large Language Model Decoding with Speculative Sampling. arXiv preprint.
- Independent discovery of speculative decoding at DeepMind.
Li, Y., Cai, T., Zhang, Y., Chen, D., & Dai, D. (2024). EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. ICML 2024.
- Feature-level speculation achieving superior speedups through hidden state prediction.
Chen, Z., Yang, X., Lin, J., Sun, C., Huang, J., & Chang, K. W. (2024). MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding. arXiv preprint.
- Analysis of speculative decoding performance at high batch sizes.
Spector, B. & Ré, C. (2023). Accelerating LLM Inference with Staged Speculative Decoding. arXiv preprint.
- Multi-stage speculation with cascaded draft models.
Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Wang, Z., & Jia, Z. (2024). SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification. ASPLOS 2024.
- Tree-based speculative inference with parallel verification.
GitHub: tanishqkumar/ssd. Saguaro Implementation.
- Open-source SSD implementation with custom inference engine, supporting Llama-3 and Qwen-3 families.

Durable Execution for AI Agents: Temporal's Architecture for Production Reliability

Fri, 27 Feb 2026 10:00:00 +0800

What This Post Covers

In The Anatomy of Agentic Code Assist, we looked at how agents like OpenHands work: event streams, sandboxed execution, tool use, the CodeAct framework. That post covered the agent itself, what it does and how it’s built. This post covers a different layer: the infrastructure that keeps agents running reliably in production.

When an agent runs for hours, makes hundreds of tool calls, and interacts with flaky LLM APIs, a whole class of infrastructure problems emerge that application-level code cannot solve:

State loss on process crashes: a worker dies mid-workflow and hours of accumulated context disappear. The agent restarts from scratch, re-executing every LLM call and tool invocation.
LLM API rate limits and timeouts: 429s, 500s, socket timeouts, multi-minute latencies. A reflexion loop running 10 cycles can consume 50x the tokens of a linear pass if any step fails and forces a restart.
Debugging non-deterministic behavior: the same prompt produces different outputs, different tool call sequences, different results. Without a complete execution trace, reproducing production bugs is close to impossible.
Tasks exceeding server timeouts: agent sessions lasting minutes to hours die on deployments, fail during scaling events, and exceed web server timeout limits.
Ambiguous recovery after parallel fan-out crashes: the agent launches ten parallel tool calls. The process crashes after seven complete. Which results were already obtained? Which need re-execution?
Losing context during human-in-the-loop waits: the agent pauses for human approval, potentially for hours or days. The server holding that state needs to remain available, or all accumulated context is lost.
Error cascades across multi-agent systems: a single failure in one agent propagates downstream without corrective mechanisms. Simple retry logic at the tail end is inadequate because the agent may have already deviated significantly from the intended path.

Temporal is an orchestration platform built around durable execution. We’ll walk through its architecture, understand why each design decision exists, and look at how OpenAI’s Codex team uses it in production.

The core idea can be expressed as a state transition: $S_{t+1} = f(S_t, M(S_t, T_t))$. Agent state evolves through deterministic orchestration ($f$) of non-deterministic operations ($M$ = LLM response, $T$ = tool results). Temporal separates these two concerns at the infrastructure level. The deterministic part goes in workflows. The non-deterministic part goes in activities.

Workflows and Activities

The fundamental design decision in Temporal: split all code into two categories based on determinism.

Workflows

A Workflow is the agent’s control flow, the logic that decides which tools to call, in what order, what to do with results, and when to wait for human input. Workflows run as ordinary code in Python, TypeScript, Go, or Java, with one hard constraint: they must be deterministic. Given the same inputs and the same activity results, a workflow must produce the same sequence of commands every time.

A Workflow Execution can run for seconds, hours, or years. It persists through infrastructure failures. The workflow doesn’t know or care about crashes; from its perspective, execution is continuous.

Activities

Activities are where all side effects live: LLM API calls, tool executions, database writes, HTTP requests. Anything that can fail, timeout, or produce different results on re-execution. Temporal records every activity result in a persistent Event History, an append-only log that serves as the authoritative record for the entire workflow’s state.

Why This Split Matters

The determinism requirement is what enables replay-based recovery (which we’ll cover in the next section). Here’s the reasoning: if we know the workflow logic is deterministic, and we have a recorded log of all activity results, we can reconstruct the exact workflow state after a crash. We don’t need developer-written checkpoint code. We don’t need serialization logic. We just replay the deterministic code with the previously recorded results, and we arrive at the same state.

This raises an obvious question: LLMs are non-deterministic, so how does this work? The answer maps directly to how agents already operate. The LLM call goes in an activity – it’s non-deterministic, its result gets recorded. The logic deciding what to call and when goes in the workflow – it’s deterministic. The agent loop says “if the LLM returned a tool call, execute that tool; if it returned a final answer, return it.” That orchestration logic doesn’t change between runs.

A Complete Agent Loop

Here’s what a complete agent workflow looks like in Python:

from temporalio import workflow, activity
from temporalio.common import RetryPolicy
from datetime import timedelta
from dataclasses import dataclass

@dataclass
class LLMRequest:
    goal: str
    history: list
    available_tools: list

@activity.defn
async def call_llm(request: LLMRequest) -> dict:
    # Non-deterministic: LLM API call lives here
    response = await llm_client.chat(
        messages=request.history,
        tools=request.available_tools,
    )
    return {"action": response.action, "params": response.params}

@activity.defn
async def execute_tool(tool_name: str, params: dict) -> str:
    # Non-deterministic: tool execution lives here
    return await tool_registry.execute(tool_name, params)

@workflow.defn
class AIAgentWorkflow:
    @workflow.run
    async def run(self, user_goal: str) -> str:
        conversation_history = []
        llm_retry = RetryPolicy(
            initial_interval=timedelta(seconds=1),
            backoff_coefficient=2.0,
            maximum_interval=timedelta(seconds=60),
            maximum_attempts=10,
        )

        while not self.is_goal_achieved(conversation_history):
            # Deterministic: this decision logic is the workflow
            next_action = await workflow.execute_activity(
                call_llm,
                LLMRequest(
                    goal=user_goal,
                    history=conversation_history,
                    available_tools=self.get_available_tools(),
                ),
                start_to_close_timeout=timedelta(seconds=120),
                retry_policy=llm_retry,
            )

            if next_action["action"] == "tool_call":
                # Parallel tool execution when multiple tools requested
                results = await asyncio.gather(*[
                    workflow.execute_activity(
                        execute_tool,
                        tool["name"], tool["params"],
                        start_to_close_timeout=timedelta(seconds=30),
                    )
                    for tool in next_action.get("tool_calls", [])
                ])
                conversation_history.extend(results)
            else:
                conversation_history.append(next_action)

        return self.format_final_result(conversation_history)

Workflow / Activity Split

Deterministic orchestration on the left, non-deterministic side effects on the right, Event History in the center

Workflow DETERMINISTIC

Start Agent Loop

while not is_goal_achieved():

↓

Call LLM

execute_activity(call_llm, ...)

↓

Check Result

if action == "tool_call":

↓

Execute Tool

execute_activity(execute_tool, ...)

↓

Append to History

conversation_history.extend(results)

↺ Loop back to LLM call

Event History

WorkflowStarted

workflow_id: "agent-42"

ActivityScheduled

call_llm → pending

ActivityCompleted

result: {action: "tool_call"}

WorkflowTaskCompleted

decision: schedule tool

ActivityScheduled

execute_tool → pending

ActivityCompleted

result: "file edited OK"

WorkflowTaskCompleted

decision: continue loop

Activities NON-DETERMINISTIC

LLM API Call ↻ retry

POST /v1/chat/completions

↓

Tool Execution ↻ retry

tool_registry.execute(...)

↓

Result Recorded

event_history.append(result)

Click any workflow step to highlight its corresponding activity and event history entries

Deterministic Replay

Replay is the mechanism that makes Temporal’s fault tolerance work. Let’s walk through it in detail, because understanding replay is the key to understanding why the rest of the architecture looks the way it does.

The Event History

Every workflow execution has an Event History: an append-only log stored in Temporal’s persistence layer. When an activity completes, Temporal records both the request and the result.

What Happens on a Crash

Here’s a concrete scenario. An agent workflow is at step 4 of 7. It has completed three LLM calls and tool executions, and is partway through the fourth:

The worker process crashes (OOM, deployment, hardware failure)
The Temporal server detects the failure (heartbeat timeout or task timeout)
Another worker picks up the workflow from the task queue
Temporal re-executes the workflow code from the beginning
When the code reaches activity calls that already completed (steps 1–3), Temporal returns the previously recorded results from the event history instead of re-executing them
The workflow code deterministically reaches the exact same state it was in before the crash: same local variables, same loop counter, same conversation history
Forward execution resumes from step 4. Only now does an actual activity get dispatched

Because the workflow code is deterministic, replaying it with the same activity results always produces the same sequence of commands. The entire call stack and state are reconstructed with no developer-written checkpoint code. This is different from simple checkpointing because the developer never has to decide what to checkpoint or when – the replay mechanism reconstructs everything automatically from the event history.

The Determinism Contract

The determinism requirement imposes hard constraints on workflow code. You cannot use:

random() – use workflow.random() instead
datetime.now() – use workflow.now() instead
time.sleep() – use workflow.sleep() or timers instead
Direct I/O (network calls, file reads) – these must go in activities
Threading or subprocess creation – use activities or child workflows

For AI engineers, this constraint is less restrictive than it sounds. LLM calls and tool executions are inherently side effects, so they already belong in activities. The orchestration logic that decides what to call and when – “call the LLM, check if it returned a tool call, execute the tool, loop” – doesn’t use random numbers or system clocks.

Here’s what non-determinism violations look like in practice:

# WRONG: non-deterministic workflow code
@workflow.defn
class BadAgentWorkflow:
    @workflow.run
    async def run(self, goal: str) -> str:
        if random.random() > 0.5:        # different result on replay
            strategy = "aggressive"
        else:
            strategy = "conservative"

        timestamp = datetime.now()         # different on replay
        await asyncio.sleep(5)             # blocks the event loop

# CORRECT: deterministic workflow code
@workflow.defn
class GoodAgentWorkflow:
    @workflow.run
    async def run(self, goal: str) -> str:
        if workflow.random().random() > 0.5:   # deterministic across replays
            strategy = "aggressive"
        else:
            strategy = "conservative"

        timestamp = workflow.now()              # deterministic across replays
        await workflow.sleep(5)                 # durable timer, survives crashes

Contrast with OpenHands

Both Temporal and OpenHands use event sourcing, but for different purposes. OpenHands records events (CmdRunAction, FileWriteAction, observations) for debuggability and observability. You can replay the event sequence to understand what the agent did. Temporal records events so the workflow can be reconstructed after a crash as if nothing happened. Same architectural pattern, different goals.

Formalization

If History = $[(a_1, r_1), (a_2, r_2), \ldots, (a_k, r_k)]$ records completed activities, then replay returns $r_1 \ldots r_k$ from history and only executes $a_{k+1}$ forward. The workflow’s determinism guarantees that replaying with recorded results produces the same sequence of activity commands, so the state at step $k$ is identical to the state before the crash.

Deterministic Replay

Watch how Temporal recovers from a crash by replaying the event history

Step 1

call_llm

→

Step 2

exec_tool

→

Step 3

call_llm

→

Step 4

exec_tool

→

Step 5

call_llm

→

Step 6

exec_tool

→

Step 7

call_llm

Event History (persisted in database)

Server Architecture

Temporal runs as four server-side services plus a persistence layer, with user-managed workers running externally.

The Four Services

Frontend Service: a stateless gRPC gateway. All client and worker communication flows through it. Handles rate limiting, routing, and authorization. Horizontally scalable because it holds no state.

History Service: owns workflow state and persists event histories. This is the most important component. Manages state transitions across configurable History Shards, which are the unit of concurrent throughput scaling. Each shard handles a subset of workflows. More shards = more concurrent workflows.

Matching Service: hosts Task Queues and dispatches work to workers. When a workflow needs an activity executed, the Matching Service places it on the appropriate task queue. When a worker polls for work, the Matching Service assigns a task.

Workers: stateless external processes that you deploy and manage. Workers long-poll task queues via gRPC, execute workflow or activity code, and report results back. Because workers hold no state, they can be killed, restarted, or scaled horizontally without any coordination. The Temporal server is always the authoritative record.

Task Queues

Task Queues provide a routing layer that becomes important for agent workloads. Workflow tasks and activity tasks flow through separate queues. You can route activities to specialized worker pools (GPU workers for inference, lightweight workers for API calls) by assigning them to different task queues. This lets teams scale heterogeneous agent workloads independently.

Component	Responsibility	Failure Impact
Frontend Service	gRPC gateway, rate limiting, routing	Clients can’t connect (stateless, restart recovers)
History Service	Workflow state, event persistence, shard management	Workflow progress pauses until recovery
Matching Service	Task queue hosting, work dispatch	Tasks queue but don’t dispatch (no work lost)
Workers	Execute workflow/activity code, report results	Pending tasks reassigned to other workers
Persistence (DB)	Durable storage for event histories	All services degraded until DB recovers

Temporal Server Architecture

Four services, a persistence layer, and stateless workers

Entry point for all workflow operations. Uses gRPC to communicate with Frontend.

💻

Client / SDK

Start workflows, send signals

↓gRPC requests

Clients can't connect. Stateless — restart recovers immediately.

🌐

Frontend Service

gRPC gateway, rate limiting

STATELESS

↙commands

↘routing

Workflow progress pauses until recovery. No data lost — state in DB.

📚

History Service

Event persistence, state transitions

SHARDS: N

Tasks queue but don't dispatch. No work lost — resumes on recovery.

📋

Matching Service

Task queue hosting, work dispatch

TASK QUEUES

↓reads / writes

↓task dispatch

All services degraded until DB recovers. This is the critical data store.

🗃

Persistence

PostgreSQL / Cassandra

DURABLE STORAGE

Pending tasks reassigned to other workers. Zero state lost.

⚙

Workers

Your code — workflow + activity

STATELESS

W1 W2 W3

Workers report results back to Frontend via gRPC long-poll

Hover over any service to see its failure impact

Primitives for Agent Patterns

Beyond workflows and activities, Temporal provides several primitives that map to common agent coordination problems.

Signals

Signals are asynchronous messages sent to a running workflow. The workflow can react at any point in its execution. This is the mechanism for human-in-the-loop: the agent reaches a decision point, calls workflow.wait_condition(), and a signal carrying the human’s approval resumes it.

The workflow can wait hours or days. It consumes no compute while waiting because its state lives in the event history, not in a running process. No worker is tied up, no server is keeping a connection open. The state is persisted in the database and can be reconstructed on demand when the signal arrives.

Queries

Queries let external systems read workflow state without modifying it. This powers dashboards and monitoring: “What step is the agent on? What was the last LLM response? How many tokens has it consumed?” The query handler runs against the in-memory workflow state and returns immediately.

Updates

Updates combine a signal and a query: send a command to the workflow and get a response. This is useful for interactive agent control (“redo step 2 with different parameters”) where you need to both modify the workflow’s behavior and confirm the modification was accepted.

Replit, for example, uses Workflow Updates for human-in-the-loop consent. When their agent wants to perform a destructive action, it pauses and waits for the user to accept or reject via an Update.

ContinueAsNew

Each workflow execution is limited to 51,200 events or 50MB of event history. For agents making hundreds of tool calls, history grows fast; each activity generates roughly 3 events. If activities return large LLM payloads (500KB+), the 50MB limit becomes binding well before the event count limit.

ContinueAsNew addresses this by atomically starting a fresh execution with the same Workflow ID, carrying forward essential state while resetting the history. The old history is archived. For long-running agents, this is how you keep the workflow alive indefinitely.

Human-in-the-Loop Pattern

@workflow.defn
class AgentWithHumanApproval:
    def __init__(self):
        self.approved = False
        self.current_step = "initializing"
        self.pending_action = None

    @workflow.signal
    async def approve(self, decision: str):
        self.approved = decision == "yes"

    @workflow.query
    def get_status(self) -> dict:
        return {
            "step": self.current_step,
            "pending_action": self.pending_action,
            "approved": self.approved,
        }

    @workflow.run
    async def run(self, goal: str) -> str:
        while not self.is_complete():
            action = await workflow.execute_activity(
                call_llm, goal,
                start_to_close_timeout=timedelta(seconds=120),
            )

            if action.requires_approval:
                self.pending_action = action.description
                self.current_step = "awaiting_approval"
                # Workflow state persists in DB -- no compute cost while waiting
                await workflow.wait_condition(lambda: self.approved)
                self.approved = False  # reset for next approval

            self.current_step = "executing"
            result = await workflow.execute_activity(
                execute_tool, action.tool, action.params,
                start_to_close_timeout=timedelta(seconds=60),
            )
        return self.format_result()

The workflow.wait_condition(lambda: self.approved) line is where the agent pauses. It can sit there for minutes, hours, or days. If the server restarts, if workers are redeployed, the workflow’s state survives. When the signal arrives, any available worker picks it up and resumes execution.

Agent Primitives Timeline

Signals, Queries, Updates, and wait points across an agent's lifecycle

Query

get_status() → {step: "awaiting_approval"}

↓

Signal

approve("yes")

↓

Update

modify_params() → {ok: true}

↓

Start

init workflow

→

LLM Call

call_llm

→

Tool Exec

exec_tool

→

LLM Call

needs approval

→

WAIT

approval needed

⏱ ZERO COMPUTE

→

Signal

approve("yes")

→

Tool Exec

approved action

→

Update

modify params

→

LLM Call

final reasoning

→

Complete

return result

ContinueAsNew

history reset

Event History Events: 0 / 51,200

↻ ContinueAsNew: history reset to 0, workflow continues with fresh execution

Retry Policies and Error Handling

LLM APIs fail routinely. Rate limits (429), server errors (500), socket timeouts, multi-minute latencies. These are the norm for agents making hundreds of calls, and different activities need different retry strategies.

Declarative Retry Policies

Retry policies are configured per activity with several parameters: initial interval, backoff coefficient, maximum interval, maximum attempts, and non-retryable error types. The important part is that retries happen at the infrastructure level. If a worker crashes during a retry cycle, another worker picks up with the retry state intact. The developer writes no retry logic.

Why Different Activities Need Different Strategies

LLM calls need aggressive retry with exponential backoff. Rate limits are transient, and the cost of not retrying (losing all accumulated context and starting the agent run from scratch) far outweighs the cost of waiting 30 seconds for capacity. Configure high maximum attempts (10+) with a long maximum interval.

Tool executions need limited retries. Tools may not be idempotent – running git commit twice produces different results. Blindly retrying could cause duplicate side effects. Configure low maximum attempts (2–3) and mark certain error types as non-retryable.

Human notifications often need no retry at all. Fire-and-forget: if the Slack message fails, don’t block the workflow.

llm_retry = RetryPolicy(
    initial_interval=timedelta(seconds=1),
    backoff_coefficient=2.0,
    maximum_interval=timedelta(seconds=60),
    maximum_attempts=10,
    non_retryable_error_types=["InvalidPromptError"],
)

tool_retry = RetryPolicy(
    initial_interval=timedelta(seconds=2),
    maximum_attempts=3,
    non_retryable_error_types=["PermissionDenied", "NotIdempotent"],
)

# Heartbeating for long-running activities
@activity.defn
async def execute_long_tool(task: dict) -> str:
    result = ""
    for i, chunk in enumerate(process_chunks(task)):
        activity.heartbeat({"progress": i, "last_chunk": chunk.id})
        result = await process(chunk)
    return result

Heartbeats

For long-running activities, the worker periodically reports progress via heartbeats. If the heartbeat stops (worker crashed), Temporal reschedules the activity on another worker. The new worker can read the last heartbeat details to resume from the last checkpoint rather than starting over. This matters for activities processing large datasets or running multi-step tool executions.

Saga Patterns for Multi-Agent Systems

When multiple agents coordinate, failure handling gets complex. Temporal supports saga patterns where compensation logic runs when a step fails. If a planning agent fails, downstream execution agents’ pending activities can be cancelled rather than left hanging. If the response agent produces an unsatisfactory draft, compensation logic can route back to the research agent for additional context.

Activity Type	Retry Strategy	Rationale
LLM API call	Aggressive backoff, 10+ attempts	Rate limits are transient; restart cost is enormous
Idempotent tools (search, read)	Moderate backoff, 3–5 attempts	Safe to re-execute; failures are usually transient
Non-idempotent tools (write, deploy)	Limited, 1–2 attempts	Re-execution may cause side effects
Human notification	No retry	Fire-and-forget; don’t block the workflow
Long-running computation	Heartbeat + resume from checkpoint	Avoid restarting expensive work from scratch

Production Case Study: OpenAI Codex

OpenAI’s Codex, their cloud-based coding agent that writes, tests, and iterates on code, uses Temporal as its core orchestration backbone. Will Wang, a software engineer on the Codex team, confirmed publicly that “Temporal is a critical part of the infrastructure powering Codex, responsible for executing our core control flows.” He described it as enabling the team to “easily reason about concurrency, correctness, and fault tolerance” while scaling a complicated distributed system.

Codex sessions run for 6+ hours on complex tasks. The entire agent loop (prompt construction, model inference, tool calls, result observation, loop back) runs as a Temporal Workflow. Each LLM call and tool execution is an Activity with its own retry policy and timeout. A single “turn” can involve hundreds of tool calls.

The Codex harness manages three conversation primitives: Items (atomic I/O units like messages or diffs), Turns (one unit of agent work from user input), and Threads (the durable container for an ongoing session, with persisted event history supporting resume, fork, and archive operations). Thread persistence – OpenAI describes threads as “durable containers” with “persisted event history” supporting reconnection – aligns directly with Temporal’s Event History.

Codex has a self-review pattern internally called the “Ralph Wiggum Loop”: the agent reviews its own changes, requests additional agent reviews, and iterates until all reviewers are satisfied. In Temporal terms, the review results arrive as signals, and the workflow decides whether to iterate or complete.

The relationship extends beyond Codex. In July 2025, OpenAI and Temporal launched a formal integration adding durable execution to the OpenAI Agents SDK. Every agent invocation runs as a Temporal Activity, orchestration runs as a Temporal Workflow. Temporal also processes millions of ChatGPT Images generation workflows. Venkat Venkataramani (OpenAI’s VP of App Infrastructure) reinforced this at Temporal’s Series D announcement: “Durable execution is a core requirement for modern AI systems.”

Framework Integrations

Temporal integrates with existing agent frameworks so teams don’t have to rewrite their agent logic from scratch. The pattern is the same across integrations: Temporal provides the durability layer, the framework provides the agent logic.

PydanticAI + Temporal

PydanticAI has first-class Temporal support via a TemporalAgent wrapper that preserves PydanticAI’s type-safety while offloading non-deterministic model requests and tool calls to Temporal activities. The orchestration logic lives in a deterministic workflow, and all I/O-bound tasks are automatically wrapped as activities.

One significant design decision: thread-based workflows. Each conversation thread gets its own Temporal workflow that persists for the lifetime of the conversation. This is more efficient than stateless approaches because the system only processes new messages, maintaining context within workflow state rather than re-sending the entire history for every inference.

from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel
from temporalio.client import Client

# Define the agent with PydanticAI's type-safe interface
support_agent = Agent(
    model=OpenAIModel("gpt-4o"),
    system_prompt="You are a customer support agent.",
    result_type=SupportResponse,  # Pydantic model for type-safe output
)

@support_agent.tool
async def lookup_order(ctx, order_id: str) -> OrderDetails:
    return await db.get_order(order_id)

# Wrap with Temporal for durability
from pydantic_ai_temporal import TemporalAgent

temporal_agent = TemporalAgent(
    agent=support_agent,
    client=await Client.connect("localhost:7233"),
    task_queue="support-agents",
)

# Each conversation gets a durable workflow
result = await temporal_agent.run(
    "What's the status of order #12345?",
    thread_id="customer-session-abc",
)

OpenAI Agents SDK + Temporal

The OpenAI Agents SDK integration centers on the activity_as_tool helper. This function automatically generates OpenAI-compatible tool schemas directly from Temporal activity signatures. The agent reasons about and invokes activities as tools, with every tool call backed by durable execution.

import { activityAsTool } from "@temporalio/openai-agents";
import { OpenAIAgentsPlugin } from "@temporalio/openai-agents";

// Temporal activities become tools the agent can call
const searchTool = activityAsTool(searchDocuments, {
  startToCloseTimeout: "30s",
  retryPolicy: { maximumAttempts: 3 },
});

const writeTool = activityAsTool(writeDocument, {
  startToCloseTimeout: "60s",
  retryPolicy: { maximumAttempts: 1 },
});

// Agent orchestration runs as a Temporal Workflow
// Each tool call is a durable Activity
const plugin = new OpenAIAgentsPlugin({
  client: temporalClient,
  taskQueue: "agent-workers",
  tools: [searchTool, writeTool],
});

Developers use the OpenAIAgentsPlugin to configure the Temporal client and worker, enabling integrated tracing that provides visibility through both the Temporal UI and OpenAI dashboards.

When Temporal Adds Unnecessary Complexity

Temporal is not always the right choice. Here’s where it adds more complexity than value:

Simple agents: a single LLM call followed by one tool call doesn’t benefit from durable execution infrastructure. One comparison found that adding Temporal to a simple document indexing pipeline required “rearchitecting the app, splitting it into two services, adding a runtime dependency on a third service, and adding over 100 lines of code” where a lighter-weight approach achieved the same with 7 lines.
Prototyping and experimentation: when you’re iterating on agent architecture, the determinism constraints and operational overhead slow you down.
Sub-30-second agents: if the agent completes before infrastructure failures become likely, the cost of durable execution exceeds the benefit.
Teams without infrastructure engineering capacity: self-hosted Temporal requires operating four services plus a database. If you don’t have the team to manage this, the operational burden may outweigh the reliability gains.

Trade-offs

Temporal’s guarantees come with trade-offs that shape day-to-day development experience.

Operational Complexity

Self-hosted Temporal requires deploying four independent services plus a persistence database (PostgreSQL, MySQL, or Cassandra) and optionally Elasticsearch for advanced visibility. This is not a single process with a single run command.

Learning Curve

Engineers must internalize: workflows vs activities, determinism rules, event history mechanics, signals, queries, updates, ContinueAsNew, versioning strategies, worker configuration.

The determinism constraint confuses newcomers, especially because LLMs are inherently non-deterministic. The resolution (LLM calls go in activities, not workflows) is simple once understood, but the documentation framing perpetuates the misconception.

Event History Limits

Each workflow execution is limited to 51,200 events or 50MB. An activity generates roughly 3 events. If activities return large LLM payloads (500KB+), the 50MB limit becomes binding well before the event count limit. The mitigation – ContinueAsNew, which atomically starts a fresh execution with carried-over state – works but adds architectural complexity. Teams building agents with many LLM calls must implement payload offloading (store large payloads in S3, pass references) and proactively manage history growth.

Latency

Temporal Cloud’s minimum end-to-end latency is roughly 100ms per workflow step, with a single activity round-trip taking approximately 220ms. Local Activities save ~50ms per call but sacrifice heartbeating and independent retry capabilities. For agents where sub-second interactivity matters (chatbot-like interactions), this overhead accumulates across many steps. Agents with 50+ steps per interaction may see 5–10 seconds of pure infrastructure overhead.

Versioning

Code changes to workflow logic can cause non-determinism errors during replay of running workflows. If a running workflow was started with version 1 of the code and a worker running version 2 picks it up, the replay may produce different activity commands, causing a non-determinism exception. Temporal provides patching APIs and worker versioning, but patches accumulate in code and “need to be removed with extreme care.” Airbyte documented struggles with non-determinism exceptions, ultimately deciding to fail affected workflows rather than attempting recovery. Safe deployment requires replay testing against production event histories in CI.

Trade-off	Impact	Mitigation
Operational complexity	4+ services to manage, or cloud costs	Temporal Cloud; start with dev server locally
Learning curve	2–3 weeks for team onboarding	Start with simple workflows, add primitives incrementally
Event history limits	51,200 events / 50MB cap per execution	ContinueAsNew + payload offloading to S3
Latency overhead	~100ms/step, ~220ms/activity round-trip	Local Activities for latency-sensitive paths
Versioning complexity	Non-determinism errors on code changes	Replay testing in CI, worker versioning

Closing Thoughts

We covered a lot of ground here: the workflow/activity split, deterministic replay, server architecture, coordination primitives, retry strategies, and how OpenAI’s Codex team puts it all together.

The core design insight is the separation of deterministic orchestration from non-deterministic execution. Once you accept that split, replay-based recovery falls out as a consequence – and with it, most of the infrastructure problems we listed at the top of this post.

OpenAI, Replit, Block, NVIDIA, and others have independently converged on durable execution for their agent workloads. Temporal’s recent $300M Series D at a $5B valuation, with 380%+ year-over-year revenue growth driven substantially by AI workloads, suggests this is a real pattern. The company joined the Agentic AI Foundation (under the Linux Foundation) alongside Anthropic, OpenAI, and Block.

For most teams, the practical path is: prototype with something lighter (LangGraph, CrewAI), validate the agent architecture, and migrate when the agents run long enough and matter enough that you can’t afford to lose state on a crash. The operational investment is real, but so is the cost of rebuilding reliability from scratch.

References

Temporal Documentation. Core Concepts – Workflows, Activities, Workers. Temporal Technologies.
Temporal. Temporal for AI. Overview of Temporal’s AI-specific capabilities and customer stories.
Wang, W. (2025). Codex and Temporal Integration. Will Wang’s public statements on Codex’s use of Temporal for core control flows.
OpenAI. Harness Engineering: Leveraging Codex in an Agent-First World. OpenAI engineering blog on the Codex harness architecture.
Temporal. Build Durable AI Agents with Pydantic AI and Temporal. PydanticAI integration guide.
Temporal. Of Course You Can Build Dynamic AI Agents with Temporal. Temporal’s architecture for dynamic AI agent loops.
Quo (formerly OpenPhone). How We Built a Real-Time AI Voice Agent with Temporal. Production case study on Temporal primitives for voice agents.
Temporal. Production-Ready Agents with the OpenAI Agents SDK + Temporal. OpenAI Agents SDK integration announcement.
Temporal. AI Cookbook – OpenAI Agents SDK. Code examples and patterns for the OpenAI integration.
PydanticAI Documentation. Temporal Durable Execution. Official PydanticAI guide for Temporal integration.
Vanlightly, J. Explanations of deterministic replay mechanics and the determinism contract in Temporal workflows. Referenced via Temporal community resources.
Wang, X., et al. (2025). The OpenHands Software Agent SDK. arXiv preprint arXiv:2511.03690. The predecessor post’s primary reference for event sourcing comparison.

From RLHF to GRPO: The RL Techniques That Align Language Models

Tue, 17 Feb 2026 10:00:00 +0800

The Gap Between Prediction and Usefulness

If you have been following this blog, you know how LLMs generate text: attention mechanisms, KV caches, speculative decoding, quantized weights moving through GPU memory hierarchies. We have spent considerable time understanding what happens after the model exists. This post asks a different question: how did the model learn to be useful in the first place?

A pretrained language model is a remarkable thing. It can complete sentences, mimic writing styles, and recite facts absorbed from trillions of tokens of internet text. But it cannot follow instructions. Ask it to summarize an article and it might continue the article instead. Ask it to refuse a harmful request and it will cheerfully comply. The gap between “can predict the next token” and “can be a helpful assistant” is enormous, and closing it is the job of reinforcement learning from human feedback (RLHF) and its descendants.

This post covers three techniques that bridge this gap: PPO (Proximal Policy Optimization), the original workhorse that proved RL could align language models; DPO (Direct Preference Optimization), an elegant reformulation that eliminates the reward model entirely; and GRPO (Group Relative Policy Optimization), the technique behind DeepSeek-R1’s reasoning capabilities. Each optimizes the same underlying objective (maximize reward while staying close to a reference policy) but they make fundamentally different engineering trade-offs.

We will not cover pretraining, nor will we survey every RLHF variant (KTO, SimPO, ORPO, and others exist but are beyond our scope). Instead, we will go deep on these three methods: the math, the intuition, the practical trade-offs, and the reasons each one was invented.

Where RL Fits: The Model Training Lifecycle

Before we touch any equations, let’s establish context. Training a modern LLM that can follow instructions involves three distinct phases, each with a different objective.

Phase 1

Pretraining

Internet text (books, code, web)

Learns language

Trillions of tokens

Phase 2

Supervised Fine-Tuning

Human-written demonstrations

Learns format

Thousands of examples

Phase 3

RL Alignment

Preferences & rewards

Learns judgment

Iterative optimization

Phase 1: Pretraining. The model learns language by predicting the next token on a massive corpus: books, articles, code, web pages. This produces a powerful text completion engine that knows facts, grammar, and reasoning patterns but has no concept of a “conversation” or “helpfulness.” This phase consumes the vast majority of compute (months on thousands of GPUs) and produces what we call the base model.

Phase 2: Supervised Fine-Tuning (SFT). Humans write demonstration data: pairs of (instruction, ideal response). The model is trained to reproduce these demonstrations using standard cross-entropy loss:

$$\mathcal{L}_{\text{SFT}} = -\sum_{t} \log \pi_\theta(y_t \mid x, y_{\lt t})$$

This is the same next-token prediction objective from pretraining, just applied to curated instruction-response pairs. The model learns the format of helpful responses: how to structure answers, when to use code blocks, how to handle multi-turn conversations. SFT typically requires only thousands of examples and a few hours of training.

But SFT has a fundamental limitation: it teaches the model to mimic demonstrations without understanding why one response is better than another. The model learns that a particular answer to “explain quantum computing” was in the training data, but it cannot distinguish between a clear explanation and a subtly misleading one. It learns format, not judgment.

Phase 3: RL-based Alignment. This is where the techniques in this post come in. Instead of showing the model what to produce, we teach it what better means. The model generates its own responses, receives feedback on quality, and updates its parameters to produce higher-quality outputs. This is reinforcement learning: the model (agent) generates text (actions), receives scores (rewards), and improves its generation strategy (policy).

The SFT model serves double duty here: it initializes the policy we will optimize, and it becomes the frozen reference policy $\pi_{\text{ref}}$ that prevents the RL-trained model from drifting too far from coherent language. This reference is crucial. Without it, the model can find degenerate ways to maximize reward that produce nonsensical text.

The Model Training Lifecycle

Click a path to explore each alignment approach

Path A: PPO

Path B: DPO

Path C: GRPO

Select a Path

Click any of the three training paths above to see how it works, what models are required, and the key trade-offs involved.

The three methods we will examine differ in how they implement Phase 3. PPO trains a separate reward model, then runs RL against it. DPO skips the reward model by extracting the reward signal directly from preference data. GRPO replaces learned value estimates with group statistics and pairs naturally with verifiable rewards. Let’s start with the foundation they all share: the reward signal.

The Reward Signal: Bradley-Terry and What Makes It Work

The fundamental challenge of aligning language models is this: we cannot write a reward function for “helpfulness.” Unlike game-playing AI where the score is clearly defined, the quality of a text response is subjective, contextual, and multidimensional. But humans can do something simpler: given two responses to the same prompt, they can usually say which one is better.

This observation is the foundation of RLHF. Collect pairwise comparisons, then train a model to predict which response humans prefer. The mathematical framework for this is the Bradley-Terry model, originally developed in 1952 for ranking chess players:

$P(y_w \succ y_l \mid x)$ $=$ $\sigma\!\big($ $r_\theta(x, y_w)$ $\,-\,$ $r_\theta(x, y_l)$ $\big)$

hover over equation components to explore

The elegant property: only the difference in rewards matters. Adding a constant to all reward scores leaves preferences unchanged. This means the reward model only needs to learn a relative ranking, not absolute quality scores.

We train this reward model by maximizing the log-likelihood of observed human preferences:

$$\mathcal{L}_{\text{RM}} = -\log \sigma\big(r_\theta(x, y_w) - r_\theta(x, y_l)\big)$$

This is binary cross-entropy: we are training a classifier that says “response A is better than response B.” Architecturally, the reward model is typically the same transformer as the language model, with the language modeling head replaced by a single linear layer that maps the final hidden state to a scalar reward.

In practice, InstructGPT used a 6B parameter reward model to guide a 175B policy, trained on approximately 33,000 prompts with 4-9 ranked completions each. The reward model is trained for only a single epoch to avoid overfitting to the preference data, a detail that matters more than it might seem.

With a trained reward model in hand, we can now define what “better” means mathematically. The question becomes: how do we actually optimize the language model to produce higher-reward outputs?

PPO: The Four-Model Pipeline

The RLHF Objective

The goal of RLHF is captured in a single objective. Let’s walk through it symbol by symbol:

$$\max_{\pi_\theta} \; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot \mid x)} \big[r_\phi(x, y)\big] - \beta \cdot D_{\text{KL}}\big(\pi_\theta \| \pi_{\text{ref}}\big)$$

Reading left to right: $\max_{\pi_\theta}$ means “find the policy parameters that maximize the following expression.” $\mathbb{E}$ is the expected value, averaging over many prompts and responses. $x \sim \mathcal{D}$ means prompts are drawn from the training distribution. $y \sim \pi_\theta(\cdot \mid x)$ means responses are sampled from the current policy, not taken from a fixed dataset. $r_\phi(x, y)$ is the reward model’s score. $\beta$ is a coefficient controlling constraint strength. $D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$ is the KL divergence measuring how far the policy has drifted from the reference.

This objective contains two opposing forces. The first term pushes toward human-preferred outputs: maximize the expected reward. The second term, the KL divergence penalty, is the guardrail that prevents reward hacking.

Reward hacking is not a theoretical concern. Without the KL constraint, models learn to game the reward model: they produce longer responses (reward models often prefer length), use confident language and bullet-point formatting (which correlates with higher human ratings), and can even produce convincing fabrications that fool the reward evaluator. Wen et al. (2024) showed that RLHF without proper regularization increases human approval ratings while simultaneously decreasing actual correctness. The KL penalty keeps the optimized policy close enough to the reference that these degenerate strategies remain unlikely.

The PPO Clipped Surrogate

The RLHF objective tells us what to optimize. PPO tells us how. The challenge is that policy gradient methods are notoriously unstable. A single large update can destroy the policy, and recovery is difficult. PPO solves this with a clipping mechanism that limits how much any single update can change the policy.

First, we define the probability ratio, how much the policy’s opinion of a particular token has changed:

$$r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}$$

When $r_t = 1.0$, the policy hasn’t changed its probability for this token. When $r_t = 1.5$, the token is 50% more likely under the new policy. When $r_t = 0.6$, it is 40% less likely. The ratio tells us the direction and magnitude of the policy shift.

The PPO clipped surrogate objective is:

$$\mathcal{L}^{\text{CLIP}} = \mathbb{E}_t \left[\min\Big(r_t(\theta) \cdot \hat{A}_t,\; \text{clip}\big(r_t(\theta),\, 1-\varepsilon,\, 1+\varepsilon\big) \cdot \hat{A}_t\Big)\right]$$

Here $\hat{A}_t$ is the advantage estimate: how much better (positive) or worse (negative) this token was compared to the expected baseline. Concretely, a critic network $V_\psi(s)$ estimates the expected future reward from each state; the advantage is the difference between the actual return and this estimate. If the model produced a token that led to higher reward than expected, $\hat{A}_t > 0$ (“good token, do more of this”); if the reward was lower than expected, $\hat{A}_t < 0$ (“bad token, do less of this”). The advantage is what tells PPO which direction to push; the clipping mechanism controls how far.

The clip function is a simple three-case clamp: if $r_t < 1-\varepsilon$, return $1-\varepsilon$; if $r_t > 1+\varepsilon$, return $1+\varepsilon$; otherwise return $r_t$ unchanged. With the standard $\varepsilon = 0.2$, the ratio is constrained to the range $[0.8, 1.2]$.

The behavior of this objective follows a 2×2 matrix that is worth internalizing:

	Advantage $\hat{A}_t > 0$ (good token)	Advantage $\hat{A}_t < 0$ (bad token)
Policy increases probability ($r_t > 1$)	Clip activates at $1+\varepsilon$. Caps how aggressively we reinforce.	No clipping. Full gradient to suppress this token.
Policy decreases probability ($r_t < 1$)	No clipping. Full gradient to reinforce this token.	Clip activates at $1-\varepsilon$. Caps how aggressively we suppress.

The pattern reveals something important: clipping only constrains the policy when it is already moving in the right direction too aggressively. When the policy increases probability for a good token (top-left), clipping says “that’s enough reinforcement for one update.” When it decreases probability for a bad token (bottom-right), clipping says “that’s enough suppression.” But when the policy is moving in the wrong direction — decreasing a good token or increasing a bad one — the full gradient signal flows through. Clipping never protects wrong moves.

Why $\min$ and not $\max$? The $\min$ operator takes the pessimistic bound. If the clipped version yields a lower objective than the unclipped version, we take the clipped (lower) one, preventing overconfident updates. If the unclipped version is already lower (meaning the policy moved in a harmful direction), we take that instead, allowing the full corrective gradient.

Let’s trace through concrete numbers. With $\hat{A}_t = +2.0$ (a good token) and $\varepsilon = 0.2$:

At $r_t = 1.1$: Unclipped = $1.1 \times 2.0 = 2.2$. Clipped = $1.1 \times 2.0 = 2.2$ (within bounds). $\min = 2.2$. Full gradient.
At $r_t = 1.3$: Unclipped = $1.3 \times 2.0 = 2.6$. Clipped = $1.2 \times 2.0 = 2.4$ (capped at $1+\varepsilon$). $\min = 2.4$. Gradient is reduced.
At $r_t = 2.0$: Unclipped = $2.0 \times 2.0 = 4.0$. Clipped = $1.2 \times 2.0 = 2.4$. $\min = 2.4$. The objective plateaus. No matter how much more likely this token becomes, the gradient contribution is capped.

This plateau is the key mechanism. The objective becomes flat beyond the clip boundary, which means the gradient is zero, so the optimizer receives no signal to push the ratio further. The policy can only change by $\pm 20\%$ per update, ensuring training stability.

PPO Clipped Surrogate Objective

How epsilon-clipping constrains policy updates within a trust region

L^CLIP = min( r_t · Â_t, clip(r_t, 1−ε, 1+ε) · Â_t )

Â_t (advantage)

+2.00

ε (epsilon)

0.20

Presets

Behavioral Case

Â > 0, r > 1

Good token, policy reinforcing — clip caps the gain

Current Values

r_t 1.00

Â_t +2.00

ε 0.20

Unclipped 2.00

Clipped 2.00

L^CLIP 2.00

r_t · Â_t (unclipped)

L^CLIP (clipped objective)

Trust region [1−ε, 1+ε]

An important subtlety for LLM training: clipping operates per-token, not globally. In a 512-token response, some tokens might have $r_t$ well within bounds (contributing full gradients) while others hit the clip boundary (contributing zero gradient). The overall update is a blend of these per-token signals, which produces remarkably stable training even without careful learning rate tuning.

One notable exception: DeepSeek-R1 uses $\varepsilon = 10$, which effectively disables clipping. Their group-normalized advantages (which we will see in the GRPO section) are already well-scaled, reducing the need for a tight trust region.

The Four-Model Problem

Running PPO for LLM alignment requires four models simultaneously in GPU memory:

The policy $\pi_\theta$ — the model being trained. Requires gradients and optimizer states (2-3× the weight memory).
The reference policy $\pi_{\text{ref}}$ — a frozen copy of the SFT model. Only forward passes, but still occupies full weight memory.
The reward model $r_\phi$ — scores generated responses. Frozen during PPO, forward passes only.
The value/critic network $V_\psi$ — estimates expected future reward to compute advantages $\hat{A}_t$. Requires gradients and optimizer states.

For a 7B parameter model in fp16, weights alone consume approximately 14GB per model, roughly 56GB across all four, before accounting for optimizer states (Adam stores two additional copies of the policy’s and critic’s parameters). With a batch of generated sequences in memory, the total easily exceeds 100GB for a single 7B model. Running PPO on a 70B model requires multi-node setups that only frontier labs can afford.

Beyond memory, PPO faces two systemic challenges. Distribution shift: as the policy improves, the reward model’s training data (collected from an earlier, weaker policy) becomes stale. The proxy reward keeps climbing while true human preference plateaus or declines. Gao et al. (2022) formalized this as “reward model overoptimization.” Hyperparameter sensitivity: learning rates, KL coefficients, clipping parameters, and even Adam’s epsilon require careful tuning. Huang et al. (2023) found that reward scores and loss values are poor indicators of training health; practitioners should monitor KL divergence, response length distributions, and perplexity instead.

Despite all this complexity, PPO produced the first convincing result: InstructGPT showed that a 1.3B parameter model trained with RLHF was preferred by human evaluators over the 175B parameter base GPT-3. A 130× smaller model, made more useful through alignment. The engineering was expensive, but the result was undeniable.

The Question That Sparked DPO

PPO demonstrated that RL could align language models with human preferences. But the engineering complexity was severe: four models, meticulous hyperparameter tuning, and infrastructure that only a handful of organizations could afford. Researchers began asking: could we achieve similar results without the reward model entirely?

The mathematical observation that makes this possible: the RLHF objective has a closed-form optimal policy. If we can express the reward in terms of the policy itself, we can optimize directly on preference data without a reward model, RL loop, or critic network. This insight leads to DPO.

DPO: Your Language Model Is Secretly a Reward Model

The Reparameterization That Changes Everything

Let’s start from the same KL-constrained RLHF objective we defined for PPO. Using variational calculus (or, more practically, by expanding the KL divergence and completing the algebra), we can derive the optimal policy in closed form:

$$\pi^*(y \mid x) = \frac{1}{Z(x)} \cdot \pi_{\text{ref}}(y \mid x) \cdot \exp\!\left(\frac{r(x, y)}{\beta}\right)$$

where $Z(x) = \sum_y \pi_{\text{ref}}(y \mid x) \cdot \exp(r(x, y) / \beta)$ is the partition function that ensures the distribution sums to 1.

The intuition here is direct: the optimal policy is the reference distribution “warped” by an exponential reward function. Responses with high reward get boosted in probability; responses with low reward get suppressed. The parameter $\beta$ controls how aggressive this warping is. When $\beta \to 0$, the policy collapses toward pure reward maximization; only the highest-reward response gets any probability mass. When $\beta \to \infty$, the exponential flattens and the policy stays frozen at the reference.

We can rearrange this to express the reward in terms of the policy:

$$r(x, y) = \beta \cdot \log \frac{\pi^*(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \cdot \log Z(x)$$

This says something remarkable: the reward is fully determined by the log-ratio of optimal policy to reference policy, plus a prompt-dependent constant $Z(x)$. The reward is hiding inside the policy all along.

But $Z(x)$ is intractable. It requires summing over all possible responses to prompt $x$, every possible sequence of tokens the model could produce. For a vocabulary of 50,000 tokens and responses of even modest length, this is an astronomically large set. PPO avoids computing $Z(x)$ by using iterative approximate optimization. DPO avoids it through algebraic cancellation.

From RL to Classification in One Substitution

Here is where DPO’s elegance emerges. We substitute the implicit reward expression into the Bradley-Terry preference model. For a preferred response $y_w$ and dispreferred response $y_l$ given the same prompt $x$:

$$r(x, y_w) - r(x, y_l) = \beta \log \frac{\pi^*(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} + \beta \log Z(x) - \beta \log \frac{\pi^*(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} - \beta \log Z(x)$$

The $\beta \log Z(x)$ terms cancel exactly. This is the critical step. $Z(x)$ depends only on the prompt, not the response, so it appears identically in both terms and drops out of the difference. The intractable partition function vanishes.

Substituting into the Bradley-Terry model and replacing the theoretical optimal policy $\pi^*$ with our trainable policy $\pi_\theta$, we get the DPO loss:

$$\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)\right]$$

This is a binary cross-entropy loss. The “logit” is the difference in implicit rewards between the preferred and dispreferred responses. Each implicit reward $\beta \log(\pi_\theta / \pi_{\text{ref}})$ measures how much the current policy has shifted its probability relative to the reference, a direct proxy for how much the policy “values” that response.

During training, this loss simultaneously increases the relative probability of preferred completions and decreases the relative probability of dispreferred ones. The $\beta$ parameter controls how sharply: low $\beta$ (e.g., 0.1) allows aggressive optimization away from the reference, while high $\beta$ (e.g., 0.5) keeps updates conservative. No explicit KL penalty is needed because the reference policy appears directly in the loss function. Deviating too far automatically reduces the gradient signal through the sigmoid saturation.

A Worked Numerical Example

Let’s make this concrete. Consider a prompt $x$ = “What is the capital of France?” with two responses:

$y_w$ (preferred): “The capital of France is Paris.”
$y_l$ (dispreferred): “France’s capital is Berlin, a beautiful city.”

Suppose the reference policy assigns $\pi_{\text{ref}}(y_w \mid x) = 0.15$ and $\pi_{\text{ref}}(y_l \mid x) = 0.12$. The trainable policy $\pi_\theta$ starts as a copy of the reference, so initially $\pi_\theta = \pi_{\text{ref}}$. Let $\beta = 0.1$.

At initialization:

The implicit rewards are both zero:

$$\hat{r}(x, y_w) = 0.1 \cdot \log \frac{0.15}{0.15} = 0.1 \cdot \log 1 = 0$$

$$\hat{r}(x, y_l) = 0.1 \cdot \log \frac{0.12}{0.12} = 0$$

The reward difference is $0 - 0 = 0$. The loss is $-\log \sigma(0) = -\log(0.5) = \log 2 \approx 0.693$. The model has no preference, exactly what we’d expect before any training.

After one gradient step:

The gradient pushes $\pi_\theta(y_w \mid x)$ up to $0.20$ and $\pi_\theta(y_l \mid x)$ down to $0.08$:

$$\hat{r}(x, y_w) = 0.1 \cdot \log \frac{0.20}{0.15} = 0.1 \cdot 0.288 = 0.029$$

$$\hat{r}(x, y_l) = 0.1 \cdot \log \frac{0.08}{0.12} = 0.1 \cdot (-0.405) = -0.041$$

The reward difference is $0.029 - (-0.041) = 0.069$. The loss drops to $-\log \sigma(0.069) \approx 0.676$.

The model is learning without ever computing an explicit reward. The reward signal emerges from the probability shift relative to the reference. And notice the structural KL constraint at work: as $\pi_\theta$ pushes probabilities further from $\pi_{\text{ref}}$, the log-ratio grows, which eventually saturates the sigmoid and produces diminishing gradient signal. The policy naturally resists extreme deviations.

The insight, elegantly stated by the DPO authors: “The reward model was never eliminated — it was absorbed into the policy itself.”

DPO: Learning Without a Reward Model

Step through training to see implicit rewards emerge

Prompt x

"What is the capital of France?"

y_w (preferred)

"The capital of France is Paris."

y_l (dispreferred)

"France's capital is Berlin, a beautiful city."

Policy Probabilities

Preferred response (y_w)

π_ref

0.15

π_θ

0.15

Dispreferred response (y_l)

π_ref

0.12

π_θ

0.12

Implicit Reward Computation

r̂(x, y_w) = β × log(π_θ(y_w) / π_ref(y_w))

= 0.10 × log(0.15 / 0.15) = 0.000

r̂(x, y_l) = β × log(π_θ(y_l) / π_ref(y_l))

= 0.10 × log(0.12 / 0.12) = 0.000

Reward diff = r̂(y_w) − r̂(y_l) = 0.000

Loss = −log σ(diff) = 0.693

Loss over training

Step 0 / 4

β 0.10

The reward model was never eliminated — it was absorbed into the policy itself.

Where DPO Shines and Where It Falls Short

DPO’s practical advantages are substantial. Training requires only two models (policy and reference), not four. The implementation is roughly 20 lines of PyTorch on top of a standard language modeling pipeline. HuggingFace’s TRL library provides a DPOTrainer that handles the details. Major models adopted it quickly: Llama 3, Zephyr-beta, and Tulu 2 all used DPO in their alignment pipelines. DPO democratized alignment research. Any lab with a GPU and preference data could train an aligned model.

But DPO has limitations that become apparent at scale. The most fundamental is its offline nature: DPO trains on a fixed dataset of preference pairs, with no mechanism for the model to explore and discover new behaviors. As training progresses, the policy drifts from the distribution that generated the training data, but the training data cannot adapt. This is particularly problematic for tasks where the model needs to discover novel reasoning strategies.

Xu et al. (ICML 2024) conducted a systematic comparison and found that PPO consistently surpasses DPO across all tested benchmarks when properly tuned, especially on challenging code generation tasks (on CodeContest, PPO-34B achieved 22.4% while DPO-34B scored significantly lower). The gap widens on tasks that require exploration and long-horizon reasoning.

There is also a subtler issue: DPO assumes the Bradley-Terry preference model perfectly fits the data. Real human preferences can be intransitive (A > B, B > C, but C > A), context-dependent, and noisy. When these assumptions break down, DPO’s loss function can produce misleading gradients.

DPO traded RL’s complexity for supervised learning’s simplicity. The next technique we’ll examine takes a different path: keep the online RL loop, but find a cheaper way to run it.

GRPO: Grading Responses on a Curve

The Insight: Eliminate the Critic, Keep the RL Loop

DPO eliminated RL entirely but lost online exploration. GRPO takes a different approach: retain the online RL loop (the model generates responses, gets feedback, and updates) but eliminate the critic network, which is the most expensive component of PPO after the reward model.

Recall that PPO needs a critic $V_\psi(s)$ to compute advantages $\hat{A}_t$, estimating how much better each token was compared to baseline expectations. This critic is a full-sized neural network with its own gradient computation and optimizer states. GRPO’s key observation: instead of learning this baseline, we can estimate it empirically by sampling multiple responses to the same prompt and comparing them to each other.

The Mechanism: Sample, Score, Normalize

For each prompt $q$, GRPO samples $G$ completions $\{o_1, o_2, \ldots, o_G\}$ from the current policy $\pi_\theta$. Each completion is scored by a reward function, producing rewards $\{r_1, r_2, \ldots, r_G\}$. The advantage for each completion is computed via z-score normalization:

$$\hat{A}_i = \frac{r_i - \text{mean}(r_1, \ldots, r_G)}{\text{std}(r_1, \ldots, r_G)}$$

This is “grading on a curve.” Instead of evaluating each response against an absolute rubric (the learned critic), we evaluate it relative to its peers. A response that scores 0.8 when all other responses also score around 0.8 gets a near-zero advantage because it was average for this prompt. The same score of 0.8 when peers score around 0.3 earns a strongly positive advantage because it was exceptional.

The group mean serves as an empirical Monte Carlo estimate of the expected reward for this prompt, playing the same role as the learned value function $V(s)$ in PPO. More samples mean a better estimate. In practice, $G = 8$ to $G = 64$ provides sufficient accuracy without excessive compute.

The standard deviation in the denominator does something subtle but important. It acts as a curvature-adaptive gradient mechanism. For easy prompts where the model consistently scores well (low reward variance), the std is small and the advantage magnitudes are amplified, but since the raw rewards are already clustered near the mean, the actual advantages remain small. For hard prompts where reward variance is high, the std normalizes away the scale differences, producing moderate advantages regardless of the raw reward range. This provides automatic per-prompt learning rate adaptation without any additional hyperparameters.

GRPO: Group Relative Policy Optimization

Sample, score, and normalize advantages within a group

Prompt What is 7 × 8?

1 · Sample & Score → 2 · Normalize Advantages

Group Mean (μ)

Std Dev (σ)

Group Size (G)

Key Insight

The group mean acts as a baseline — replacing PPO's expensive critic network. Responses above the mean get reinforced, below get suppressed. The standard deviation normalizes the scale, providing automatic per-prompt learning rate adaptation.

G = 8

The full GRPO objective incorporates this advantage into a clipped surrogate structure that should look familiar:

$$\mathcal{J}_{\text{GRPO}} = \mathbb{E}_{q \sim \mathcal{D}} \left[\frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min\!\Big(\rho_t^{(i)} \hat{A}_i,\; \text{clip}\big(\rho_t^{(i)}, 1{-}\varepsilon, 1{+}\varepsilon\big) \hat{A}_i\Big) - \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\right]$$

This is structurally identical to PPO’s clipped surrogate. The probability ratio $\rho_t^{(i)} = \pi_\theta(o_{i,t} \mid q, o_{i,\lt t}) / \pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,\lt t})$ is the same per-token ratio. The clipping mechanism works identically. The only difference is where the advantage $\hat{A}_i$ comes from: PPO estimates it with a learned critic, GRPO estimates it with group statistics. The double averaging (over group members and over tokens) combined with clipping and KL penalty gives GRPO the stability of PPO without the critic’s memory cost.

GRPO + RLVR: The Reasoning Revolution

GRPO’s natural partner is Reinforcement Learning from Verifiable Rewards (RLVR). These are tasks where correctness can be checked deterministically: math problems have right answers, code must pass test cases, logic puzzles have verifiable solutions. For these tasks, the reward function is a simple rule (correct or incorrect) requiring no learned reward model at all.

Rule-based rewards are immune to reward hacking. There is no neural network to exploit, no proxy to overoptimize. The reward is ground truth. This makes GRPO + RLVR an extraordinarily clean training setup: sample responses, check if they are correct, normalize advantages within the group, update the policy. Two models in memory (policy and reference), a deterministic reward function, and online exploration.

DeepSeek-R1 demonstrated how powerful this combination can be. Its reward function was remarkably simple:

$$R = R_{\text{accuracy}} + R_{\text{format}}$$

where $R_{\text{accuracy}}$ is binary (1 if the final answer matches the ground truth, 0 otherwise, verified by regex matching) and $R_{\text{format}}$ enforces structured reasoning with ... and ... tags. That’s it. No neural reward model, no human preference data for the RL stage.

The results with DeepSeek-R1-Zero — trained with GRPO and no supervised fine-tuning at all — were striking: 71.0% on AIME 2024 (matching OpenAI’s o1-preview), 97.3% on MATH-500, and a 2,029 Elo rating on Codeforces. Perhaps most remarkable was the emergent behavior: the model spontaneously developed self-correction strategies (“Wait, let me reconsider this step…”) — without any explicit training signal for reflection. This self-verification behavior emerged purely from the pressure to produce correct final answers.

DeepSeek-R1’s practical configurations: $G = 16$ responses per prompt, batches of 32 unique questions, and notably $\varepsilon = 10$, which effectively disables clipping entirely. The group-normalized advantages are already well-scaled, reducing the need for a tight trust region constraint.

Connection to REINFORCE and Variance Reduction

GRPO is best understood as a variant of REINFORCE, the simplest policy gradient algorithm, with a group-based baseline for variance reduction. Vanilla REINFORCE computes policy gradients as:

$$\nabla_\theta J = \mathbb{E}\big[R \cdot \nabla_\theta \log \pi_\theta(a \mid s)\big]$$

This has notoriously high variance because raw returns fluctuate enormously between episodes. The standard fix is to subtract a baseline $b$ from the return: $\nabla_\theta J = \mathbb{E}[(R - b) \cdot \nabla_\theta \log \pi_\theta]$. Any baseline that does not depend on the action is unbiased. PPO learns this baseline with an expensive critic network. GRPO uses the group mean instead, a sample-based estimate that improves with larger group sizes, costs no additional parameters, and requires no additional training.

Choosing the Right Technique

The choice between PPO, DPO, and GRPO depends primarily on the nature of your reward signal and your computational constraints.

Use PPO when you are training on open-ended tasks (creative writing, general helpfulness, safety) where reward must come from a learned model, and you have the compute budget to support four models in memory. Despite its complexity, PPO with proper tuning remains the highest-ceiling approach. Xu et al. (ICML 2024) showed it consistently outperforms DPO on challenging benchmarks. It is the choice of frontier labs training flagship models.

Use DPO when you have high-quality paired preference data, want simple and stable training, and are working with limited compute. DPO matched or exceeded PPO on summarization benchmarks (61% GPT-4 win rate vs. PPO’s 57% on TL;DR) and is implementable with a standard supervised training pipeline. It is ideal for quick alignment passes and situations where the preference dataset covers the intended use distribution well.

Use GRPO when your task has verifiable rewards (math, code, logic, factual QA with ground-truth answers). It combines online exploration (like PPO) with low memory footprint (like DPO), and rule-based rewards eliminate reward hacking entirely. It is the standard for training reasoning models.

In practice, these techniques are often used in combination, not isolation. Llama 3 used a pipeline of SFT → Rejection Sampling → PPO → DPO, where each stage addressed different aspects of alignment. DeepSeek-R1 alternated between SFT stages, RLVR with GRPO for reasoning, and RLHF stages for general helpfulness. The techniques are complementary.

Choosing the Right Technique

Compare PPO, DPO, and GRPO across key dimensions

Attribute	PPO	DPO	GRPO
Models in Memory	4 modelspolicy, reference, reward, critic	2 modelspolicy, reference	2 modelspolicy, reference
Training Paradigm	Online RL	Offline supervised	Online RL
Reward Source	Learned reward model	Implicitderived from preferences	Verifiable / rule-based
Implementation	Complex~1000s lines of code	Simple~20 LOC core	Moderate~100s lines of code
Training Stability	Sensitive to hyperparameters	Very stable	Stablewith group normalization
Performance Ceiling	Highestwith proper tuning	Goodlimited by offline data	Excellentfor verifiable tasks
Reward Hacking Risk	Highlearned proxy	Lowno explicit reward	Very lowrule-based rewards
Best For	General alignment, frontier models	Quick alignment, limited compute	Math, code, reasoning tasks

Do you have verifiable rewards?
(math, code, formal logic)

Yes

GRPO

Group-based rewards, no critic needed

Do you have paired preference data
and limited compute?

Yes

DPO

Simple, stable, compute-efficient

PPO

Maximum performance, open-ended tasks

The landscape continues to expand. On the DPO side, IPO removes the Bradley-Terry assumption, KTO works with binary feedback (thumbs up/down) instead of pairwise preferences, and SimPO simplifies the reference model dependency. On the GRPO side, DAPO addresses training instabilities with dynamic sampling, and Dr. GRPO provides variance reduction to the gradient estimates. Each builds on the foundations covered here.

From Explicit Rewards to Emergent Reasoning

Let’s step back and trace the arc we have followed. PPO established that RL could align language models, using an explicit reward model to score responses and a learned critic to estimate advantages. DPO showed the reward model was unnecessary. The reward signal was implicit in the probability ratios, waiting to be extracted through a clever reparameterization. GRPO showed the critic was unnecessary too. Group statistics could replace learned value functions, especially when paired with verifiable rewards.

Each step eliminated a component that turned out to be inessential for the task at hand. What remained was the core objective: maximize expected reward while staying close to the reference. And progressively simpler ways of optimizing it.

But the most interesting result came from the simplest setup. DeepSeek-R1-Zero, trained with GRPO and binary correct/incorrect rewards, spontaneously developed multi-step reasoning, self-correction, and solution verification, capabilities that were not explicitly trained. The model learned how to think from the sole pressure to be correct. No demonstrations of reasoning. No reward for intermediate steps. Just final-answer accuracy and the group-relative advantage signal.

This suggests that the path to capable reasoning models may be less about sophisticated reward engineering and more about giving models the right optimization framework and letting them discover strategies on their own. The field is still learning which components are genuinely necessary and which are engineering artifacts of earlier approaches. These three techniques (PPO, DPO, and GRPO) represent the progression of that understanding.

References

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv preprint.
- The original PPO paper introducing the clipped surrogate objective.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023.
- The DPO paper showing preference optimization can be reduced to classification.
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., … & Guo, D. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint.
- Introduces GRPO and demonstrates its effectiveness for mathematical reasoning.
DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint.
- DeepSeek-R1 and R1-Zero results using GRPO with verifiable rewards.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., … & Lowe, R. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.
- The InstructGPT paper establishing the SFT → RM → PPO pipeline.
Christiano, P. F., Leike, J., Brown, T., Marber, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. NeurIPS 2017.
- Foundational work on learning reward models from human preferences.
Gao, L., Schulman, J., & Hilton, J. (2022). Scaling Laws for Reward Model Overoptimization. ICML 2023.
- Formalizes reward hacking and overoptimization in RLHF.
Xu, J., Xie, T., Zhao, A., Song, J., Wang, J., & Zhang, Y. (2024). Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study. ICML 2024.
- Systematic comparison showing PPO outperforms DPO when properly tuned.
Wen, Y., Zhang, Z., Jiao, H., Yang, M., Zhang, H., & Wang, G. (2024). From RLHF to RLHF: The Dilemma of Improving Human Alignment. arXiv preprint.
- Analysis of reward hacking showing approval ratings can increase while correctness decreases.
Huang, H., Zhong, H., Li, S., Yang, K., & Zitnik, M. (2023). The N Implementation Details of RLHF with PPO. arXiv preprint.
- Practical insights on PPO hyperparameter sensitivity and training diagnostics.
Bradley, R. A., & Terry, M. E. (1952). Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika, 39(3/4), 324-345.
- The original paired comparison model adapted for preference learning.
Yu, Q., Zhang, H., Shao, Z., Guo, D., Zhu, Q., & Lu, H. (2025). DAPO: An Open-Source LLM Reinforcement Learning System. arXiv preprint.
- Addresses GRPO training instabilities with dynamic sampling and clip-higher strategy.

Dissecting OpenClaw: An Interactive Architecture Map

Mon, 16 Feb 2026 12:00:00 +0800

The Big Picture

OpenClaw is a 430K-line TypeScript project that turns any messaging platform into an interface for an autonomous AI agent. Instead of reading about it, explore the architecture below.

OpenClaw: The Full Topology

A hub-and-spoke architecture that composes familiar systems abstractions into an autonomous AI agent.

430K

Lines TypeScript

15+

Channels

5,705+

Skills

197K

GitHub Stars

Baileys

grammY

Discord

discord.js

iMessage

native macOS

Slack

Bolt

+10 more

Matrix, Email...

Gateway

WebSocket + Scheduler

Session Resolver

namespace isolation

Context Assembler

prompt building

Streaming LLM

Claude, GPT, etc.

Tool Executor

5,705+ skills

State Persister

JSONL + SQLite

Memory System

MEMORY.md + search

Primitive 1: Autonomous Invocation

The agent doesn't wait for messages. It can wake itself via cron, webhooks, voice, heartbeats, or Pub/Sub triggers, each scoped to an isolated session.

Click to expand trigger types

Cron schedules (daily summaries, check-ins)

Webhooks (GitHub, Stripe, custom)

Voice wake word detection

Heartbeat / keep-alive pings

Gmail Pub/Sub (email-triggered actions)

Session isolation per conversation

Primitive 2: Externalized Memory

Long-term memory lives on disk, not in the context window. The agent pages knowledge in and out like an OS manages virtual memory.

Click to expand memory components

MEMORY.md (persistent knowledge base)

Daily logs (memory/YYYY-MM-DD.md)

SQLite (structured session data)

Hybrid search (BM25 + vector similarity)

/compact command for context paging

Three-Layer Architecture

The hub-and-spoke topology collapses into three clean layers, each with a single responsibility.

The Three-Layer Architecture

Here's how messages move through the system — from incoming connection to LLM response. Click any component for details.

Layer 1: Gateway

Routing + Scheduling

This is where connections come in. The gateway handles routing, payload validation, and scheduled triggers.

WebSocket Server

connections

Channel Manager

routing

Scheduler

cron + triggers

Control UI

dashboard

TypeBox Validator

schema

route

Layer 2: Channel Adapters

Normalize + Authorize

Each adapter wraps a platform's SDK and converts its messages into a common StandardMessage shape we can work with.

Baileys

grammY

Discord

discord.js

iMessage

native macOS

Slack

Bolt

Matrix

matrix-bot-sdk

dispatch

Layer 3: Agent Runtime

Reason + Execute

Where the actual thinking happens — resolve the session, build context, call the LLM, run tools, save state.

Session Resolver

namespace

Context Assembler

prompt

Streaming LLM

inference

Tool Executor

skills

State Persister

JSONL

Click a component to see what it does, which files matter, and how fast it needs to be.

Component

Layer

Description goes here.

Responsibilities

Key Files

Message Flow

Seeing the layers in isolation is one thing. Here’s what happens when an actual message flows through them.

Message Journey: WhatsApp to Response

Follow a single message through all three layers. Click any phase to expand its sub-steps.

WhatsApp +1-555-0199

→

Trigger

~5ms

WhatsApp message from +1-555-0199: "What's the weather in Tokyo?"

Webhook received from Baileys connection

Idempotency key checked (message dedup)

Gateway

~8ms

Route to correct adapter and session

Channel identified: WhatsApp (Baileys)

TypeBox schema validation passed

Scheduler check: no pending cron conflicts

Adapter

~12ms

Normalize and authorize the incoming message

Baileys payload → StandardMessage format

Allowlist check: +1-555-0199 authorized

DM pairing verified (not a group message)

Runtime

~1.85s

Resolve → Assemble → Invoke → Execute

Session namespace loaded (<50ms)

Prompt assembled: AGENTS.md + SOUL.md + memory (<100ms)

Streaming LLM: first token (200-500ms)

Tool call: weather_lookup("Tokyo") (1.2s)

Response

~15ms

Format and deliver the reply to WhatsApp

Markdown → WhatsApp formatting (bold, lists)

Sent via Baileys connection to +1-555-0199

Cumulative Latency

0ms

Total

Trigger 0ms

Gateway 0ms

Adapter 0ms

Runtime 0ms

Response 0ms

System Prompt Filesystem

The runtime’s “brain” isn’t code — it’s a filesystem of markdown documents that compose into the system prompt.

System Prompt as Filesystem

The agent's identity, rules, and capabilities are composed from markdown files on disk. Skills are natural-language documents, not code.

Session Directory

AGENTS.md ~800

SOUL.md ~600

TOOLS.md ~400

▶ skills/

weather/SKILL.md ~120

calendar/SKILL.md ~150

code-review/SKILL.md ~180

Click any file to preview its content and role in the system prompt.

File Type

Role description

Content

Token Contribution to Assembled Prompt (~4,200 tokens)

AGENTS

SOUL

TOOLS

Skills

Memory

History

AGENTS.md (800)

SOUL.md (600)

TOOLS.md (400)

Active Skills (450)

Memory Pages (900)

Conversation History (1,050)

ClawHub: Community Skills

5,705+ skills

weather

Real-time weather data for any city worldwide

12.4K installs

google-calendar

Read, create, and manage Google Calendar events

8.9K installs

code-review

Automated code review with style and security checks

6.2K installs

web-search

Search the web and summarize results

15.1K installs

notion-sync

Sync notes and databases with Notion

4.7K installs

image-gen

Generate images via DALL-E or Stable Diffusion

9.3K installs

Memory System

The most distinctive primitive: long-term memory that lives on disk, not in the context window.

Virtual Memory for Cognition

Long-term memory lives on disk, not in the context window. The agent pages knowledge in and out like an OS manages virtual memory.

LLM Context

Cache (volatile)

The active working set. Fast but limited. Everything here is lost when the conversation ends or context fills up.

Context Usage 85%

170K tokens used 200K limit

System prompt (4.2K)

Conversation history (95K)

Tool results (48K)

Memory pages (22.8K)

Local Disk

Source of truth (durable)

Persistent storage that survives across sessions. Unlimited capacity. The ground truth for all agent knowledge.

MEMORY.md 12KB

memory/2026-02-16.md 3KB

sessions.sqlite 8MB

embeddings.db 24MB

Search & Retrieval

Page-in mechanism

Dual search paths find relevant memories and page them back into context when needed.

BM25 Keyword

Exact term matching, fast

Vector Similarity

Semantic matching, flexible

merge & re-rank

Ranked Results

Top-K memory pages paged into context

/compact — Context Paging

Write durable notes from context to MEMORY.md

✓

Summarize conversation history (compress)

✓

Drop redundant tool outputs from context

✓

Rebuild context window with essential state only

✓

Before

170K tokens

85% capacity

→

After

50K tokens

25% capacity

The entire system is an exercise in composition: message queues, schedulers, filesystems, and virtual memory, familiar abstractions from operating systems, recomposed into an AI agent.

Why `vllm serve` Works on Day Zero (and What It Takes to Make It Fast)

Sat, 14 Feb 2026 12:00:00 +0800

In this post, we’ll trace what happens when vLLM encounters a model it’s never seen before. We’ll work through the full lifecycle from the initial config.json pull off Hugging Face, through the registry lookup that decides the integration path, into either the Transformers fallback or the native integration code, and down to the forward pass where PagedAttention kernels actually execute.

Why this matters: new model architectures appear constantly, and vLLM needs to serve them. The interesting engineering question is how — because the optimizations that make inference fast (fused kernels, CUDA Graphs, tensor parallelism) require deep model-specific restructuring. You can’t just import a model and get peak performance. vLLM resolves this tension with a tiered system: immediate support through a compatibility layer, then a clear path to fully optimized native integration.

This post is structured into 4 parts:

The Gateway — how vLLM decides what to do with a model it receives
The Transformers Fallback — the zero-day mechanism and its trade-offs
Native Integration — what it takes to make a model truly fast in vLLM
The Execution Core — forward pass, weight loading, and distributed execution

We’ll build on concepts from previous posts. If you’re not familiar with PagedAttention and FlashAttention, or the hidden software stack beneath inference, those are worth reading first. We also won’t re-explain the Engine-Worker orchestration layer in full — just enough to ground the model integration story.

Here’s the starting point:

vllm serve some-brand-new/model-7B --dtype auto
# This works. Even for a model vLLM has never seen before.

That command succeeds for models that vLLM has no dedicated code for. Let’s understand why.

Part 1: The Gateway

The Engine-Worker-Model Hierarchy

vLLM enforces a strict separation of concerns in how it handles models. Before we get into the model-specific details, let’s establish the high-level architecture, since it determines where new model code actually lives.

There are four levels:

LLMEngine — the control plane. Handles scheduling, manages the BlockSpaceManager (which tracks physical GPU memory blocks), and decides which requests get processed in each iteration. The Engine is completely agnostic to model architecture.
Worker — one per GPU. Manages the GPU device, holds its slice of model weights, and coordinates with other Workers for distributed execution.
ModelRunner — sits inside each Worker. Responsible for converting logical request data (token IDs, sequence lengths) into the physical tensors the model needs. This is where input flattening happens.
Model — the neural network itself. Whether it’s a native LlamaForCausalLM or a wrapped TransformersModel, this is the only layer that changes when you add a new model.

The key property here: the Engine only needs the KV cache element size — derived from num_layers, hidden_size, and num_attention_heads in the model config to make scheduling decisions. It never touches the model’s forward pass. This means adding support for an entirely new architecture only changes the bottom layer of this stack. Everything above it stays the same.

Engine → Worker → Model Hierarchy

Four abstraction levels. VllmConfig feeds each one. Only the bottom layer changes for new models.

LLMEngine

Control Plane

Schedules requests, manages KV cache blocks, completely model-agnostic.

Scheduler

requests

BlockSpaceManager

memory blocks

KV Cache Mgr

allocation

dispatch

Worker

Device Mgmt

One per GPU. Manages device, holds weight shards, coordinates distributed execution.

GPU Device

CUDA ctx

Weight Shard

TP slice

Distributed Coord

NCCL

execute

ModelRunner

Input Prep

Converts logical request data into physical tensors. Handles input flattening and attention metadata.

Input Flattener

tokens

Tensor Prep

batching

AttentionMetadata

positions

forward

Model Only this changes

Neural Network

The actual neural network. Swap this layer to support a new architecture — everything above stays the same.

LlamaForCausalLM

native

TransformersModel

wrapped

Forward Pass

inference

VllmConfig

SchedulerConfig → Engine

ParallelConfig → Worker

ModelConfig → Runner + Model

QuantizationConfig → Model

Click any layer, component, or config to see details.

Component

Layer

Description goes here.

Key Properties

Item 1

The VllmConfig object is how information flows across these levels. It aggregates several sub-configs:

Config Component	What It Provides
ModelConfig	Architecture strings, hidden sizes, vocabulary size, the `architectures` list used for registry lookup
ParallelConfig	Tensor parallelism (TP) and pipeline parallelism (PP) degrees. Determines how linear layers shard their weights
SchedulerConfig	Maximum number of sequences and memory allocation strategy. Influences BlockSpaceManager setup
QuantizationConfig	Quantization method (AWQ, GPTQ, FP8). Linear layers use this to select the appropriate kernel during weight loading

Registry Mechanics and the Architecture Lookup

When you run vllm serve , the first thing that happens is a config.json resolution — either pulled from the Hugging Face Hub (if you pass a model ID like meta-llama/Llama-2-7b-hf) or read from disk (if you pass a local path, as you would in an air-gapped deployment). The architectures field — for example, ["LlamaForCausalLM"] — is the primary lookup key for the entire loading sequence.

This key gets checked against the _VLLM_MODELS dictionary, the core of vLLM’s ModelRegistry. It maps architecture strings to (module_name, class_name) tuples:

_VLLM_MODELS = {
    "LlamaForCausalLM":       ("llama", "LlamaForCausalLM"),
    "MistralForCausalLM":     ("mistral", "MistralForCausalLM"),
    "DeepseekV2ForCausalLM":  ("deepseek_v2", "DeepseekV2ForCausalLM"),
    "Qwen2ForCausalLM":       ("qwen2", "Qwen2ForCausalLM"),
    # ... hundreds of other architectures
}

The module_name is a relative path within vllm.model_executor.models — so "llama" resolves to vllm/model_executor/models/llama.py. The class_name is the specific nn.Module subclass to instantiate.

One important detail: vLLM does NOT import all model classes at startup. Instead, it uses _LazyRegisteredModel wrappers. When the ModelConfig requests a specific architecture, the registry:

Checks if the architecture string exists in _VLLM_MODELS
Retrieves the module path and class name
Dynamically imports the module using importlib
Returns the class constructor to the ModelLoader

This lazy loading matters for dependency isolation. A user running Llama shouldn’t need the specific kernels required for an audio-processing model. If those kernels aren’t installed and the audio model is loaded eagerly at startup, vLLM crashes for everyone.

Three things can happen when the registry receives an architecture string:

Found in registry → native path (optimized, model-specific code)
Registered by plugin → external native path (optimized, third-party code)
Not found → Transformers Modeling Backend fallback (compatibility shim)

The Plugin System

This is a significant evolution in vLLM’s architecture. External packages can register models without modifying vLLM core.

The mechanism uses Python’s vllm.general_plugins entry point. During vLLM’s initialization, it discovers and executes all registered plugins. A plugin can invoke ModelRegistry.register_model() to inject a new architecture mapping at runtime:

# In your package's plugin entry point
def register():
    from vllm import ModelRegistry

    if "MyNewModel" not in ModelRegistry.get_supported_archs():
        ModelRegistry.register_model(
            "MyNewModel",
            "my_package.models:MyNewModel"
        )

This decouples the vLLM release cycle from model release cycles. Model creators — Mistral, DeepSeek, Google — can ship a “vLLM adaptation package” alongside their weights. Users pip install that package, and vLLM recognizes the new architecture immediately. No PRs to vLLM core, no waiting for a new release.

Part 2: The Transformers Fallback (Zero-Day Support)

The Transformers Backend

When the registry lookup fails — or when you explicitly set model_impl="transformers" — vLLM resolves to its Transformers backend. This is a family of mixin-composed classes (TransformersForCausalLM, TransformersMoEForCausalLM, TransformersMultiModalForCausalLM, etc.) defined in vllm/model_executor/models/transformers/. They sit between vLLM’s scheduler and standard Hugging Face model code, and they’re the reason that vllm serve command works on day zero for new models.

There are two initialization steps worth understanding:

1. Config-Based Instantiation. The wrapper uses transformers.AutoModel.from_config(...) to build the model architecture on a meta device — meaning no GPU memory is allocated yet, just the module structure with placeholder parameters. Weights are loaded separately later through vLLM’s load_weights() pipeline. This two-phase approach (structure first, weights later) is critical for distributed loading: each GPU can load only its weight shard, rather than loading everything and then discarding what it doesn’t need.

2. Attention Backend Injection. Before instantiation, vLLM modifies the model’s text configuration:

# vLLM sets this before calling from_config()
text_config._attn_implementation = "vllm"

This is the critical mechanism. Modern Hugging Face models are written to be attention-backend-agnostic. They check _attn_implementation and query a registry of attention functions. vLLM populates ALL_ATTENTION_FUNCTIONS with its own PagedAttention-backed implementation:

# vLLM registers its attention backend into HF's registry
ALL_ATTENTION_FUNCTIONS["vllm"] = vllm_attention_forward

When the HF model reaches its attention layer and calls the registered function, it gets vLLM’s implementation instead of the default eager/SDPA/FlashAttention backend. The model doesn’t know the difference.

Let’s trace the data flow step by step:

Engine generates block_tables and slot_mapping → packs them into an AttentionMetadata object
TransformersForCausalLM.forward() receives flattened inputs + attn_metadata
The wrapper passes vLLM metadata as **kwargs into the HF model’s forward method
The HF model propagates **kwargs down through its layers (this is a convention in Transformers — unused kwargs flow through)
At each attention layer, the injected vLLM backend receives Q, K, V tensors + the attn_metadata
The PagedAttention CUDA kernel executes — storing K/V into paged blocks, computing attention scores using block tables

The result: even “unoptimized” models benefit from PagedAttention’s memory virtualization. No more OOM from naive KV cache pre-allocation. The KV cache is managed efficiently through fixed-size blocks, regardless of whether the model itself was designed for it.

The Trade-offs (What You Lose)

The Transformers backend enables immediate serving, but it sits in what we might call an “unoptimized valley.” Let’s be specific about the costs:

CUDA Graph Capture. In vLLM V1, the Transformers backend supports torch.compile with piecewise CUDA graph capture (via the @support_torch_compile decorator), closing what was historically the largest performance gap. However, models with dynamic RoPE scaling still fall back to eager mode. And native models can leverage more aggressive graph capture strategies that cover a larger fraction of the computation graph, since their code is explicitly written with static control flow in mind.

Kernel Fusion. Native vLLM models use fused kernels — LayerNorm + activation in one kernel, RoPE computation fused with the QKV projection, SiLU and gate multiplication combined. The fallback uses separate PyTorch operations for each step. Every separate operation means an extra round-trip to HBM: write intermediate result, read it back for the next op. On a memory-bandwidth-bound workload (which LLM decode always is), these extra reads and writes add up fast.

Parallelism Limitations. Basic Tensor Parallelism can sometimes be inferred automatically via the model’s base_model_tp_plan, but this doesn’t cover every case. Mixture-of-Experts routing, novel attention patterns, or architectures with unusual layer structures may not shard correctly — restricting you to single-GPU execution.

Capability	Transformers Fallback	Native Integration
Day-zero support	Yes	No (requires implementation)
PagedAttention	Yes (via injection)	Yes (native)
CUDA Graph capture	Yes (via torch.compile in V1)	Yes (full static graph)
Kernel fusion	No (separate PyTorch ops)	Yes (fused CUDA kernels)
Tensor Parallelism	Limited (auto-inferred)	Full (explicit sharding)
Pipeline Parallelism	No	Yes (with `intermediate_tensors`)
Quantization (AWQ/GPTQ/FP8)	Limited	Full support

Note: The fallback is not meant to be the final state — it’s the starting point. It gives you a working, servable model while the community works on native integration. Think of it as a bridge: useful immediately, but you cross it to get somewhere better.

Part 3: Native Integration

The Model Interface and Prefix Protocol

To go from “supported” to “optimized,” a model must be implemented natively. This means creating a Python class that mirrors the original model structure but substitutes standard layers with vLLM’s distributed primitives.

Every module in a native vLLM model accepts a prefix="" argument during initialization. This string represents the module’s fully qualified name in the state dictionary — for example, model.layers.0.self_attn.q_proj.

class LlamaAttention(nn.Module):
    def __init__(self, config, prefix=""):
        super().__init__()
        self.qkv_proj = QKVParallelLinear(
            ...,
            prefix=f"{prefix}.qkv_proj"
        )
        self.o_proj = RowParallelLinear(
            ...,
            prefix=f"{prefix}.o_proj"
        )

The prefix serves two purposes:

Weight loading: maps checkpoint tensors to the correct layer instance. When the load_weights method receives a tensor named model.layers.0.self_attn.q_proj.weight, the prefix tells it exactly which module to route it to.
Non-uniform quantization: the QuantizationConfig can specify different quantization schemes per layer. Some layers might be FP16 while others are INT8. The prefix is how the config identifies which kernel to instantiate for each specific layer.

Parallel Layer Primitives

For models that won’t fit on a single GPU (70B+), vLLM provides distributed primitives that replace standard nn.Linear and nn.Embedding layers:

ColumnParallelLinear splits the weight matrix along the output dimension. Each GPU computes a fraction of the output features. This is used for QKV projections (each GPU computes a subset of attention heads) and MLP up-projections (each GPU computes a portion of the intermediate dimension). No inter-GPU communication is needed for this operation.

RowParallelLinear splits along the input dimension. Each GPU computes a partial result, then an AllReduce sums the partial results across all GPUs. This is used for the attention output projection and MLP down-projection — the operations where partial results need to be recombined.

VocabParallelEmbedding splits the embedding table (often 128k+ tokens for modern models) across GPUs. Each GPU holds a slice of the vocabulary and performs lookups only for tokens in its range.

The VllmConfig provides tensor_parallel_size during initialization, and each layer auto-configures its sharding based on the worker’s rank. A model developer doesn’t write explicit GPU assignment code — they use these primitives and the infrastructure handles partitioning.

Input Flattening and the 1D Computation Graph

This is one of the more interesting design decisions in vLLM. In standard PyTorch, inputs are 2D tensors of shape [batch_size, sequence_length]. This requires padding to align sequences of different lengths — if you’re processing three requests with lengths 5, 12, and 3, you pad everything to length 12. That means 8 wasted positions out of 20, nearly 40% of compute thrown away on padding tokens.

vLLM eliminates padding entirely. The ModelRunner concatenates all tokens from all concurrent requests into a single 1D tensor of shape [total_num_tokens]. A separate positions tensor (also 1D) provides the sequence position for each token:

# Three concurrent requests:
#   Request A: tokens [101, 204, 305]        (3 tokens, positions 0,1,2)
#   Request B: tokens [42, 55, 67, 89, 12]   (5 tokens, positions 0,1,2,3,4)
#   Request C: tokens [700, 801]             (2 tokens, positions 0,1)

# Flattened input — no padding, no wasted compute:
input_ids = [101, 204, 305, 42, 55, 67, 89, 12, 700, 801]  # shape: [10]
positions  = [0,   1,   2,   0,  1,  2,  3,  4,  0,   1]   # shape: [10]

Every layer in a native vLLM model is written to process this 1D stream. Embeddings do lookups on the 1D tensor. RoPE uses the positions tensor for correct positional encoding. The attention layer uses block_tables to reconstruct the logical sequence structure — knowing which tokens belong to which request and where their KV cache blocks live in physical memory.

Standard HuggingFace vs vLLM Native

How vLLM eliminates padding waste and automatically shards across GPUs

Standard HuggingFace

Input Shape

shape: [3, 5]

101

204

305

PAD

700

801

PAD

7 of 15 wasted (47%)

Layer Architecture

nn.Embedding

Full vocabulary, single GPU

Single GPU — full vocab

▼

nn.Linear (QKV)

Full weight matrix

Single GPU — full matrix

▼

nn.Linear (Output)

No sharding

Single GPU — no sharding

vLLM Native

Input Shape

shape: [10]

101

204

305

700

801

10 tokens, 0% waste

Req A

Req B

Req C

Layer Architecture

VocabParallelEmbedding

Vocabulary sharded across GPUs

GPU 0: vocab[0:N/2]

GPU 1: vocab[N/2:N]

AllReduce combines

▼

ColumnParallelLinear

Output dim split across GPUs

GPU 0: out[0:H/2]

GPU 1: out[H/2:H]

No AllReduce needed

▼

RowParallelLinear

Input dim split across GPUs

GPU 0: in[0:H/2]

GPU 1: in[H/2:H]

AllReduce sync

Click any input grid, token cell, or layer card to see details.

Component

Key Properties

Weight Loading — From Disk to Device

One of the trickier parts of native integration: implementing load_weights(self, weights). This method receives an iterator of (name, tensor) pairs from AutoWeightsLoader and must map checkpoint weights into the model’s parameters.

The parameter mismatch problem is why this isn’t trivial. vLLM often fuses layers that are separate in the Hugging Face checkpoint. For example, a standard Llama MLP has separate gate_proj and up_proj linear layers. In vLLM, these become a single gate_up_proj to reduce kernel launches. The load_weights logic must handle this:

def load_weights(self, weights):
    # Stacking mapping: which HF weights get concatenated into which vLLM param
    stacked_params = {
        "gate_proj": ("gate_up_proj", 0),  # goes into first half
        "up_proj":   ("gate_up_proj", 1),  # goes into second half
    }

    for name, loaded_weight in weights:
        if "gate_proj" in name or "up_proj" in name:
            # Buffer the tensor, wait for its partner, then concatenate
            param = self.state_dict()[name.replace("gate_proj", "gate_up_proj")
                                          .replace("up_proj", "gate_up_proj")]
            # Load into the correct slice of the fused parameter
            weight_loader = param.weight_loader
            weight_loader(param, loaded_weight, name)
        else:
            param = self.state_dict()[name]
            param.copy_(loaded_weight)

Two utilities make this process more manageable:

AutoWeightsLoader abstracts away the routing of weights to child modules. It recursively discovers sub-modules that have their own load_weights methods and delegates the appropriate (name, tensor) pairs to each one, so the top-level model doesn’t need to manually dispatch weights. The upstream shard iteration — walking through model-00001-of-00005.safetensors through model-00005-of-00005.safetensors and presenting a unified (name, tensor) stream — happens in vLLM’s weight loading utilities (weight_utils.py), which feed into AutoWeightsLoader.

WeightsMapper provides declarative renaming rules. Instead of writing string manipulation inside load_weights, you define a mapping:

mapper = WeightsMapper(orig_to_new_prefix={
    "model.decoder.layers.": "model.layers.",
    "norm.weight": "model.norm.weight"
})

The loader applies these rules on the fly, letting the vLLM model structure diverge from the Hugging Face structure while maintaining compatibility with official checkpoints.

For quantized models, weight loading has an additional layer of complexity. In 4-bit quantization schemes like AWQ, eight 4-bit weights are packed into a single int32. The loader must recognize that the destination parameter is quantized and load the packed tensor directly, no casting to float16 first. If the config specifies quantization, the linear layer initializes a specialized “quantized parameter” object that overrides the default loading behavior.

Part 4: The Execution Core

The Attention Switchboard (ForwardContext)

In standard PyTorch, an Attention module is self-contained — it receives Q, K, V and computes the output. In vLLM, the Attention layer acts as a client to a global context. When the model executes a forward pass, a ForwardContext is established containing the AttentionMetadata generated by the scheduler.

The AttentionMetadata object is essentially a page table for the KV cache. Here’s what it contains, with concrete values for a batch of 3 requests:

# AttentionMetadata for a batch with 3 requests:
#   Request A: 128 tokens, KV spread across blocks [4, 17, 23, 8]
#   Request B: 64 tokens, KV in blocks [1, 12]
#   Request C: 256 tokens, KV in blocks [0, 5, 9, 14, 22, 31, 7, 19]

block_tables = [
    [4, 17, 23, 8, 0, 0, 0, 0],   # Request A (padded to max_blocks)
    [1, 12, 0,  0, 0, 0, 0, 0],   # Request B
    [0, 5,  9, 14, 22, 31, 7, 19], # Request C
]
# shape: [3, 8] — each entry is a physical block index in GPU memory

slot_mapping = [512, 65, 1024]
# For decode: maps each new token to its physical slot (block_idx * block_size + offset)

context_lens = [128, 64, 256]
# Sequence length per request, for correct attention masking

The attention layer’s forward() does not compute QK^T * V directly. Instead, it dispatches to different backends depending on the phase:

Prefill (processing prompts) → FlashAttention variant, optimized for parallel computation over many tokens. All Q, K, V tokens are known upfront, so we can exploit parallelism across the sequence.
Decode (generating tokens) → PagedAttention kernel. The query is a single new token. The kernel uses block_tables to gather K and V vectors from non-contiguous physical blocks, compute attention scores, and scatter the result. This is the operation that makes virtual memory for KV cache work, tokens don’t need to be stored contiguously.

The split between prefill and decode backends is important for performance. Prefill is compute-bound (large matrix multiplications), so FlashAttention’s tiling strategy works well. Decode is memory-bound (loading many cached K/V vectors for a single query), so PagedAttention’s gather-based approach is the right fit.

Forward Pass Pipeline

Step-by-step flow of a single forward pass through a vLLM native model, with AllReduce sync points highlighted

Scheduler

Control Plane

AttentionMetadata

▼

ModelRunner

Input Prep

input_ids, positions

▼

Embedding Layer

VocabParallelEmbedding

hidden_states

▼

Transformer Block

× N Layers

4a RMSNorm fused kernel

▼

4b QKV Projection ColumnParallelLinear

▼

Prefill

FlashAttention

Compute-bound

Decode

PagedAttention

Memory-bound

▼

4d Output Projection RowParallelLinear

AllReduce #1 — attention sync

▼

4e MLP (SwiGLU) gate_up → down

AllReduce #2 — MLP sync

▼

× N layers

Final LayerNorm → LM Head

Output Logits

Click any pipeline stage to see details.

Stage

Key Properties

Distributed Execution Contracts

Supporting 405B-class models means multi-GPU execution across potentially many nodes. This introduces specific contracts that new model implementations must satisfy:

Tensor Parallelism requires precise synchronization. In a standard Transformer block, AllReduce happens exactly twice after the attention output projection (RowParallelLinear) and after the MLP down-projection (RowParallelLinear). These are the points where partial results from each GPU must be summed. Adding extra synchronizations (say, an unnecessary AllReduce after the QKV projection) doesn’t produce wrong results, but it degrades throughput. Each AllReduce is a blocking collective, so all GPUs wait.

Pipeline Parallelism splits the model vertically by layers. The native model’s forward method must handle an intermediate_tensors argument:

def forward(self, input_ids, positions, attn_metadata, intermediate_tensors=None):
    if intermediate_tensors is not None:
        # We're not the first pipeline stage — skip embedding
        hidden_states = intermediate_tensors["hidden_states"]
        start_layer = self.start_layer  # e.g., layer 16
    else:
        # First pipeline stage — process from the embedding
        hidden_states = self.embed_tokens(input_ids)
        start_layer = 0

    for layer in self.layers[start_layer:self.end_layer]:
        hidden_states = layer(hidden_states, positions, attn_metadata)

    return hidden_states

Rank 0 executes layers 0–N, outputs the hidden state as intermediate_tensors. Rank 1 receives it, skips the embedding layer, and resumes from layer N+1. If a developer forgets to implement this check, the model works fine in single-node TP mode but silently breaks in PP mode. it tries to re-embed already-processed hidden states.

CUDA Graph Compatibility requires static control flow. Dynamic Python branching based on tensor values like if tensor.sum() > 0: breaks CUDA Graph capture because the graph records a fixed execution path. This is particularly relevant for Mixture-of-Experts models where expert routing is inherently data-dependent. The routing logic must use masked tensor operations (scatter, gather with masks) rather than Python if/else, so the computation graph remains static even though different experts activate for different tokens.

Parallelism Type	Contract for Model Developer	Failure Mode if Missed
Tensor Parallelism	Use ColumnParallel/RowParallel layers; exactly 2 AllReduces per block	Extra AllReduces → throughput degradation; wrong layer types → incorrect results
Pipeline Parallelism	Handle `intermediate_tensors` arg; define `start_layer`/`end_layer`	Model re-embeds hidden states → garbage output on later pipeline stages
CUDA Graphs	No Python control flow based on tensor values; use masked ops for routing	Graph capture fails → fallback to eager mode → 2-3x slower decode

Putting It All Together — The Optimization Ladder

To summarize the full lifecycle, model support in vLLM is a progression through increasingly optimized tiers:

Transformers Fallback — works immediately. PagedAttention memory management. No fusion, no graphs, limited parallelism. This is where every new model starts.
Plugin Registration — an external package provides a native model class. pip install and go. Model creators control their own release timeline.
Native Model Class — upstreamed into vLLM. Parallel primitives, 1D flattened computation, CUDA Graph compatible. This is where the performance lives.
Quantization Support — AWQ, GPTQ, FP8 weight loading tested and working. Packed tensor handling, per-layer quantization configs. Unlocks deployment on smaller hardware.
Full Production — Pipeline Parallelism support, custom attention patterns if needed, benchmarked against reference implementations. Ready for large-scale serving.

The plugin system represents the future direction — federated model support where model creators can ship “vLLM-ready” code independently of core releases. Instead of waiting for the vLLM team to implement every new architecture, the ecosystem moves toward model creators owning their integration path.

Closing

The process of introducing a new model into vLLM is a systems engineering exercise. It requires transforming a static model definition, essentially a recipe for matrix multiplications — into a dynamic, distributed execution graph that manages its own memory, shards its own weights, and coordinates across GPUs. The Transformers fallback bridges the gap for immediate access; native integration is where the performance lives.

There are four core contracts a model must satisfy for full integration: registry updates (mapping architecture strings to code), class restructuring (parallel primitives, 1D flattening), weight loading (handling mismatches between checkpoint and runtime structure), and PagedAttention integration (routing attention through the block-table-based memory system). Understanding these four contracts gives you a mental model for reasoning about model support in any inference engine, not just vLLM.

References

Kwon, W. et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP 2023. arXiv:2309.06180
vLLM Documentation — Architecture Overview. docs.vllm.ai
vLLM Documentation — Adding a New Model. docs.vllm.ai
vLLM Documentation — Plugin System. docs.vllm.ai
vLLM Source — Model Registry (registry.py). GitHub
vLLM Source — Transformers Backend (transformers/). GitHub
vLLM Documentation — Class Hierarchy. docs.vllm.ai
Gordic, A. “Inside vLLM: Anatomy of a High-Throughput LLM Inference System.” aleksagordic.com
Zalt, M. “The Hidden Switchboard Behind vLLM Attention.” zalt.me
Prerepa, A. “ZvLLM: Zigzag forward pass with vLLM.” adiprerepa.github.io
El Shafie, H. “Paged Attention from First Principles: A View Inside vLLM.” hamzaelshafie.bearblog.dev
vLLM Documentation — Paged Attention Design. docs.vllm.ai

Orchestrating Inference: How Kubernetes, Ray, and vLLM Coordinate Under the Hood

Sun, 18 Jan 2026 12:00:00 +0800

The Deceptively Simple Command

vllm serve meta-llama/Llama-3.1-70B-Instruct --tensor-parallel-size 8

One line. Eight GPUs. A 70-billion parameter model ready to serve requests. But this hides significant complexity.

Behind this command, three distinct software systems spring into action. Kubernetes allocates pods and manages node resources. Ray spawns actors, creates placement groups, and coordinates distributed execution. vLLM initializes workers, establishes NCCL communication rings, and begins orchestrating the token-by-token dance of autoregressive generation.

The interesting part is the choreography between systems. Each layer operates at a different granularity, speaks a different language, and solves a different class of problem. Kubernetes thinks in pods and nodes. Ray thinks in actors and tasks. vLLM thinks in requests and tokens. Yet when you hit that endpoint with a prompt, all three coordinate to produce a coherent response.

The question worth asking: How do these systems know when to hand off control to each other?

This post traces that coordination. We’ll follow the cascade from kubectl apply to the moment NCCL rings form and tensor data starts flowing. We’ll examine why placement groups matter more than you’d expect, why your network configuration can make or break performance, and how the industry is evolving toward disaggregated architectures that split inference across specialized pools.

If you’ve read the previous deep-dive on the hidden software stack behind inference, this builds on that foundation. We won’t revisit PagedAttention or continuous batching fundamentals. Instead, we’re zooming out to the orchestration layer, the software that transforms a rack of GPUs into something resembling a programmable supercomputer.

Prerequisites: This post assumes familiarity with PagedAttention, continuous batching, and basic Kubernetes concepts. If you’re new to the inference stack, start with The Hidden Software Stack Behind Fast LLM Inference.

The Three-Layer Stack

Control flows down through the layers. Tensor data bypasses the middle entirely.

Kubernetes Layer

Coarse-grained

Nodes

machines

Pods

containers

Containers

processes

handoff

Ray Layer

Fine-grained

GCS

control store

Raylets

per-node

Actors

workers

Placement

groups

handoff

vLLM Layer

Token-level

Scheduler

requests

Workers

GPU exec

NCCL Ring

tensor sync

NCCL Bypass

Tensor Data

Click a component to see details about its role in the orchestration stack.

Component

Layer

Granularity: -

Description goes here.

Manages

Item 1

Three Systems, Three Granularities

This stack works because of its division of labor. Each system operates at a different level of abstraction, handling the problems it’s best suited to solve.

Kubernetes sees the world in pods and nodes. It manages the lifecycle of containers, handles service discovery, and ensures workloads get scheduled onto machines with available resources. Its scheduling decisions happen at the coarse granularity of “does this node have 8 GPUs available?” Kubernetes has no concept of what happens inside those containers once they’re running.

Ray operates one level deeper. It sees actors (long-lived Python objects that can hold state and process messages) and tasks, which are stateless function invocations. Ray’s Global Control Store (GCS) maintains a distributed view of cluster resources, and its Raylets (one per node) handle local scheduling and object management. Ray also understands placement constraints: it can ensure that a group of actors lands on the same physical node, or spreads across nodes in a specific pattern.

vLLM cares about requests and tokens. It manages the KV cache, schedules which requests get processed in each iteration, and coordinates the actual tensor operations across GPU workers. vLLM’s scheduler operates at millisecond granularity, making decisions every inference step about which tokens to generate next.

Kubernetes has no understanding of GPU topology. It can count GPUs, but it cannot distinguish between eight GPUs connected via NVLink at 900 GB/s and eight GPUs scattered across nodes connected via Ethernet at 10 GB/s. Without additional tooling, Kubernetes might schedule your tensor-parallel workload across two nodes, a configuration that would perform 40-90x slower than necessary.

Concern	Kubernetes	Ray	vLLM
Granularity	Pods/Nodes	Actors/Tasks	Requests/Tokens
GPU handling	Counts only	Placement constraints	CUDA assignment
State management	Stateless orchestration	Actor state in GCS	KV cache
Restart handling	Pod restarts	Actor recovery	Request retry

This is where KubeRay enters the picture. KubeRay is a Kubernetes operator that bridges the gap between Kubernetes’ pod-centric worldview and Ray’s actor-centric model. It introduces three Custom Resource Definitions (CRDs):

RayCluster is the foundation. It defines head and worker node configurations, resource requirements, and cluster topology. Use this when you need a persistent Ray cluster for interactive development or long-running services.
RayService builds on RayCluster to add Ray Serve deployments. It handles zero-downtime upgrades, health checking, and automatic recovery. This is the production choice for serving workloads.
RayJob handles batch workloads. It spins up a cluster, runs a job, then tears everything down. Useful for fine-tuning runs or batch inference over large datasets.

The operator watches these CRDs and reconciles cluster state: creating pods, configuring networking, managing the Ray head node’s GCS, and ensuring workers connect properly. It’s the translation layer that lets Kubernetes manage Ray clusters without understanding Ray’s internal semantics.

The Reconciliation Dance

When you kubectl apply a RayService manifest, you trigger a cascade that touches every layer of the stack. Understanding this sequence reveals how control flows through the system.

Phase 1: KubeRay Operator Activation

The KubeRay operator runs as a deployment in your cluster, watching for changes to Ray CRDs. When it detects your new RayService, its reconciliation loop activates. The operator compares desired state (your manifest) against actual state (what’s running) and generates a plan to converge them.

Phase 2: Head Node Creation

First, the operator creates the Ray head pod. This pod runs the Global Control Store (GCS) on port 6379, a distributed metadata store that tracks cluster membership, resource availability, and actor locations. The head also exposes the Ray Dashboard on port 8265 for observability.

The head pod needs to be running and healthy before workers can join. KubeRay handles this sequencing automatically, using Kubernetes’ built-in readiness probes to gate worker creation.

Phase 3: Worker Pod Launch

Once the head is ready, the operator creates worker pods. Each worker’s entrypoint executes ray start --address=:6379, connecting to the head’s GCS. This is where the Kubernetes and Ray worlds first touch: Kubernetes schedules the pod, but Ray handles what happens inside.

Phase 4: Resource Discovery

Inside each worker pod, the Raylet process inspects its environment. It discovers available GPUs through CUDA, determines memory capacity, and inventories other resources. This information flows back to the GCS, which maintains a global resource table.

Phase 5: Cluster Ready

When all workers have connected and advertised their resources, the Ray cluster is ready. The GCS now has a complete picture: which nodes exist, what resources each has, and how to reach them. Ray Serve can start accepting deployment requests.

When vLLM initializes with --tensor-parallel-size 8, it needs to transform this general-purpose Ray cluster into a coordinated inference machine.

vLLM Initialization Sequence:

Cluster Connection: vLLM’s RayGPUExecutor calls initialize_ray_cluster(), connecting to the existing Ray cluster or starting a new one.
Placement Group Creation: vLLM creates a placement group with the specification [{"GPU": 1}] * 8, which means eight bundles, each requiring one GPU. The placement strategy is STRICT_PACK, meaning all bundles must land on a single node.
GCS Scheduling: The GCS consults its resource table. Can any single node satisfy eight GPU bundles? If yes, it reserves those resources atomically. If no, the placement group creation fails. Better to fail fast than scatter actors across nodes.
Actor Spawning: vLLM spawns RayWorkerWrapper actors inside the placement group. Each actor gets assigned to a specific bundle, guaranteeing GPU affinity. Ray sets CUDA_VISIBLE_DEVICES appropriately so each worker sees only its assigned GPU.
Process Group Initialization: Each worker calls torch.distributed.init_process_group(backend='nccl'). This creates the NCCL communicator that will handle all tensor data movement.
NCCL Ring Formation: NCCL establishes its communication topology (typically ring or tree patterns optimized for the underlying hardware). From this point forward, tensor data flows through NCCL, completely bypassing Ray’s object store.

Here’s how the handoff works: Ray’s job is setup and supervision. Once the NCCL rings form, Ray steps aside for the performance-critical path. Tensor data never touches Ray’s object store. It flows directly between GPUs over NVLink or the network fabric. Ray remains involved for health monitoring, actor lifecycle management, and metrics collection, but it’s out of the hot path.

Why STRICT_PACK Changes Everything

Placement groups are Ray’s mechanism for expressing scheduling constraints that go beyond “find me a node with resources.” For distributed inference, they determine whether your system performs at full speed or crawls.

Consider what happens without placement constraints. You request 8 GPU actors. Ray’s default scheduler might place 4 on Node A and 4 on Node B. Both nodes have available GPUs, the request is satisfied, everyone’s happy. Except they’re not.

The Disaster Scenario:

With tensor parallelism, every transformer layer requires an AllReduce operation to synchronize partial results across all GPUs. For a Llama-70B with 80 layers, that’s 160 AllReduce calls per forward pass. Each AllReduce must move data between every pair of GPUs.

When all 8 GPUs are on one node connected via NVLink:

Bandwidth: ~900 GB/s bidirectional
AllReduce latency: microseconds

When 4 GPUs are on Node A and 4 on Node B, connected via datacenter Ethernet:

Bandwidth: ~10-25 GB/s (even with 100GbE)
AllReduce latency: milliseconds

The performance difference is stark. You’re looking at a 40-90x bandwidth reduction for every AllReduce. For interactive inference where you need responses in hundreds of milliseconds, this makes the system unusable. A 50ms operation becomes a 2-second operation.

STRICT_PACK to the Rescue:

The STRICT_PACK placement strategy provides an atomic guarantee: “Reserve all N bundles on a single node. If no single node can satisfy the request, schedule none of them.”

# Conceptual placement group specification
placement_group = ray.util.placement_group(
    bundles=[{"GPU": 1}] * 8,
    strategy="STRICT_PACK"
)

This is all-or-nothing. Either you get all 8 GPUs on one node with NVLink connectivity, or you get an error telling you no suitable node exists. No silent degradation to a broken configuration.

SPREAD for Pipeline Parallelism:

Not all parallelism strategies want STRICT_PACK. Pipeline parallelism deliberately spans multiple nodes, with each node handling different layers of the model. Here, SPREAD makes sense: you want actors distributed across nodes to maximize aggregate memory capacity.

The communication pattern differs too. Pipeline parallelism uses point-to-point sends between adjacent stages, not AllReduce across all participants. This is less latency-sensitive because you’re overlapping computation with communication: while stage N processes micro-batch B, stage N-1 can send micro-batch C.

Strategy	Use Case	Communication	When to Use
STRICT_PACK	Tensor Parallelism	AllReduce (all-to-all)	Same-node NVLink required
SPREAD	Pipeline Parallelism	Point-to-point	Memory > latency
PACK	Mixed workloads	Varies	Prefer colocation, allow spread

The placement group abstraction is what lets vLLM express “I need these actors to be co-located” without knowing anything about Kubernetes node topology. Ray’s GCS has that knowledge (from Raylet resource advertisements), and the placement group mechanism lets vLLM leverage it declaratively.

Two Interfaces, Two Purposes

Even with correct placement, there’s another way to tank your inference performance: letting NCCL traffic flow over the wrong network interface.

Production GPU nodes typically have multiple network interfaces:

eth0: The standard Kubernetes pod network. Usually an overlay network (Calico, Cilium, Flannel) that provides cluster connectivity, DNS, and service discovery. Fine for control plane traffic: health probes, metrics scraping, Ray GCS heartbeats.
net1/ib0/bond0: A high-performance interface connected to InfiniBand or RoCE fabric. This is your data plane, purpose-built for moving large tensors between nodes at 100-400 Gb/s with microsecond latencies.

The problem: NCCL doesn’t automatically know which interface to use. By default, it may discover eth0 first and decide that’s the interface for collective operations. Your carefully provisioned InfiniBand fabric sits idle while tensor data crawls through the overlay network.

The key environment variable:

NCCL_SOCKET_IFNAME=net1

This tells NCCL explicitly which interface to use for socket-based communication. For InfiniBand with RDMA, you’d also set:

NCCL_IB_HCA=mlx5_0

In Kubernetes, you expose multiple interfaces to pods using Multus CNI, a meta-plugin that lets you attach additional networks beyond the default pod network. Your pod spec includes annotations requesting attachment to the high-speed network:

annotations:
  k8s.v1.cni.cncf.io/networks: high-speed-net

The result is a pod with two interfaces: eth0 for Kubernetes integration, and net1 for NCCL traffic. Control plane and data plane are cleanly separated.

Why This Matters for Multi-Node:

For single-node tensor parallelism, NVLink handles everything and network configuration is less critical. But the moment you scale beyond one node, whether for pipeline parallelism, larger tensor-parallel groups, or disaggregated serving, network configuration becomes essential.

A properly configured InfiniBand fabric can deliver 400 Gb/s (50 GB/s) per port with single-digit microsecond latencies. The Kubernetes overlay network, even with modern CNIs, typically maxes out at 10-25 Gb/s with millisecond-scale latencies. For operations that happen 160 times per forward pass, this difference compounds dramatically.

Choosing Your Communication Pattern

We’ve seen how network configuration can make or break performance. The reason network matters so much depends on which parallelism strategy you’re using, and each strategy creates fundamentally different communication patterns.

Parallelism isn’t one-size-fits-all. Different strategies create different communication patterns, and understanding these patterns reveals why orchestration decisions matter.

Tensor Parallelism: The AllReduce Pattern

Tensor parallelism shards weight matrices across GPUs within a layer. Each GPU computes a partial result, then all GPUs synchronize via AllReduce to combine their contributions.

Ray’s responsibilities:

Create STRICT_PACK placement group
Spawn workers with correct GPU assignments
Set CUDA_VISIBLE_DEVICES per worker
Monitor actor health, restart on failure

What Ray doesn’t do:

Manage AllReduce operations (that’s NCCL)
Move tensor data (that flows through NVLink/NCCL)
The object store is bypassed entirely for the hot path

The Communication Reality:

For an 80-layer model, tensor parallelism requires 160 AllReduce operations per forward pass (2 per layer—one after attention, one after FFN). Each AllReduce synchronizes tensors sized [batch_size, seq_len, hidden_dim]. With Llama-70B’s hidden dimension of 8192 and a batch of 32 sequences at 2048 tokens, you’re moving ~1 GB per AllReduce.

AllReduce has ring and tree implementations. Ring AllReduce on 8 GPUs requires each GPU to send and receive 7/8 of the data, essentially 7 full tensor transfers per operation. The only way this is fast is with NVLink’s 900 GB/s bandwidth.

Pipeline Parallelism: The Point-to-Point Pattern

Pipeline parallelism assigns different layers to different GPUs (or groups of GPUs). Data flows through stages sequentially: Stage 0 processes the input, sends activations to Stage 1, which processes and sends to Stage 2, and so on.

Orchestration Differences:

Ray creates a placement group that may span nodes (SPREAD rather than STRICT_PACK). Each stage gets its own bundle, and stages communicate via point-to-point sends rather than collective operations.

The Bubble Problem:

Pure pipeline parallelism has a fundamental inefficiency. While Stage 0 processes micro-batch 1, Stages 1-7 sit idle. While Stage 7 processes micro-batch 1, Stages 0-6 may be idle waiting for backward pass dependencies.

The bubble ratio quantifies this waste:

$$\text{bubble ratio} = \frac{p - 1}{m + p - 1}$$

Where $p$ is the number of pipeline stages and $m$ is the number of micro-batches. With 8 stages and 8 micro-batches, you lose 7/15 ≈ 47% of potential throughput to bubbles.

Think of it this way: p is how many slices you cut the model into, m is how many requests you’re processing in parallel. With 8 pipeline stages but only 1 micro-batch, 7 out of 8 stages are always waiting, meaning 87.5% of compute wasted to bubbles. With 64 micro-batches, that drops to ~10%. The lesson: pipeline parallelism only pays off with large batches.

Continuous batching helps by keeping the pipeline fed with new requests, but the fundamental tradeoff remains: pipeline parallelism trades AllReduce bandwidth requirements for pipeline bubbles.

The Pipeline Bubble Problem

Stages sit idle during startup and drain phases. More micro-batches reduce bubble overhead.

Pipeline Stages (p): 4

Micro-batches (m): 4

43%

Bubble Ratio

Total Time Steps

Compute Units

Bubble (idle)

Micro-batch (active)

Bubble Ratio = (p - 1) / (m + p - 1) = 3/6 = 50%

Expert Parallelism: The AllToAll Pattern

Mixture of Experts (MoE) models introduce a third communication pattern. Instead of every GPU needing data from every other GPU (AllReduce), or sequential point-to-point (pipeline), MoE requires AllToAll: each GPU sends different data to different destinations based on which expert each token routes to.

The orchestration complexity increases significantly. Expert assignments are dynamic (determined by a router network), so communication patterns vary per batch. Some experts may be hot (receiving many tokens) while others are cold.

Expert parallelism is its own orchestration challenge. Unlike tensor parallelism’s predictable AllReduce or pipeline’s sequential handoffs, MoE communication is dynamic. A router network decides which tokens go to which experts at runtime, so the communication pattern changes every batch. Some experts receive hundreds of tokens while others get none.

This dynamic routing breaks placement assumptions. You can’t pre-plan which GPUs need to talk to which. Solutions like expert replication (placing hot experts on multiple GPUs) and capacity factors (limiting tokens per expert) add orchestration complexity. MoE deserves its own treatment, but the key insight here is: AllToAll with dynamic routing is fundamentally harder to orchestrate than static patterns.

Prefill and Decode Don’t Have to Live Together

The inference phases we’ve discussed (prefill and decode) have fundamentally different computational profiles:

Prefill: Process the entire prompt in parallel. Compute-bound. Benefits from high FLOPS.
Decode: Generate tokens one at a time. Memory-bound. Benefits from high memory bandwidth.

For most of LLM inference history, both phases ran on the same hardware. But there’s no law of physics requiring this. Disaggregated serving splits them apart.

The Architecture:

Router receives incoming request, examines the prompt
Prefill Pool (optimized for compute: H100s with maximum FLOPS) processes the prompt, generates initial KV cache
KV Transfer moves the KV cache to the decode pool
Decode Pool (optimized for memory bandwidth: could be A100s or L40S) generates tokens autoregressively
Response streams back to client

Why Bother?

Different hardware, different economics. Prefill can run on fewer, more powerful GPUs because it’s compute-bound. You’re not paying for memory bandwidth you don’t use. Decode can run on more, cheaper GPUs optimized for memory bandwidth.

The pools also scale independently. A sudden spike in long prompts? Scale up prefill. Many concurrent users generating responses? Scale up decode. The tight coupling of traditional serving forces you to scale both together.

The KV Transfer Challenge:

The catch is moving the KV cache. For Llama-70B with 128K context, the KV cache can reach 40+ GB per request. Moving that between pools is non-trivial.

Two approaches are emerging:

NIXL (NVIDIA Inference Transfer Library): GPU-to-GPU RDMA transfers over InfiniBand/RoCE. Keeps data on GPU memory throughout, avoiding PCIe bottlenecks.
LMCache / Shared Storage: Write KV cache to a fast shared storage layer (think distributed NVMe or GPU memory pooling). This enables “context caching”: compute popular prompts once, reuse across millions of requests.

Context caching is particularly powerful for system prompts. If every request to your coding assistant starts with the same 8K token system prompt, why recompute that KV cache for every request? Compute it once, cache it, and let decode instances reuse it.

The Request Knows Where to Go

Traditional load balancing treats all requests as fungible. Round-robin, least-connections, random: they all assume any backend can handle any request equally well. For LLM inference with caching, this assumption is expensive.

If Request A and Request B share a common prefix (same system prompt, same few-shot examples), and Request A already warmed the KV cache on Pod 1, sending Request B to Pod 2 wastes the cache hit opportunity. You’ll recompute the shared prefix unnecessarily.

Prefix-Aware Routing:

Ray Serve implements prefix-aware routing using a prefix tree of cached prefixes. The router maintains a lightweight index of which prefixes are cached on which replicas. When a request arrives, it hashes the prefix, looks up which replica(s) have it cached, and routes accordingly.

This transforms routing from “who’s least busy?” to “who already has my context?”

Gateway API EPP:

The Kubernetes ecosystem is developing similar capabilities at the network layer through Gateway API’s Endpoint Picker (EPP) extension. Routing decisions happen in the ingress controller rather than in application code (Ray Serve).

The ingress controller can hash request properties (prompt prefix, user ID, session token) and consistently route matching requests to the same backend. This works without modifying the serving framework, using pure infrastructure-level routing.

The Tradeoff:

Locality-aware routing can cause load imbalance. If one prefix is extremely popular, its designated replica gets hammered while others sit idle. Production systems need to balance cache locality against load distribution, often through techniques like bounded load consistent hashing or spillover policies.

The evolution is clear: routing is becoming inference-aware. The network layer increasingly understands the semantics of the requests it carries, making decisions that would previously require application-level logic.

The Programmable Supercomputer

Step back and consider what this stack achieves. You start with a collection of independent machines, each with its own GPUs, memory, and network interfaces. Through layers of orchestration (Kubernetes managing containers, Ray managing actors, vLLM managing inference) these resources transform into something that behaves like a single, coherent system.

A prompt enters and gets routed to the right place based on cached state. Compute spreads across GPUs that might span multiple machines, synchronized through NCCL collectives that operate faster than the software can observe. Memory fragments across PagedAttention blocks, invisible to the model but critical for efficiency. The response streams back, one token at a time, while the system is already processing the next request.

The orchestration is the product. Without it, you have expensive hardware sitting idle. With it, you have an inference machine that can serve thousands of concurrent users at interactive latencies.

What’s Emerging:

The boundaries between these layers continue to blur. Systems like DistServe push disaggregation further, with prefill and decode pools that scale independently. KV cache transfer technologies (NIXL, LMCache) treat GPU memory across machines as a single addressable space. The trend is toward tighter integration between orchestration and execution, with systems that make placement decisions not just at startup, but continuously during inference.

Key Metrics to Watch:

If you’re operating these systems, the metrics that matter span all three layers:

Kubernetes: Pod scheduling latency, node resource utilization, network policy drops
Ray: Placement group creation time, actor restart rate, GCS latency (ray_gcs_* metrics)
vLLM: vllm:gpu_cache_usage_perc (memory pressure), vllm:num_requests_waiting (queuing), time-to-first-token (prefill latency), inter-token-latency (decode performance)

The system is only as good as its weakest link. A Kubernetes scheduling delay adds latency to every request until the pod is running. A misconfigured NCCL interface tanks throughput. A hot expert without proper load balancing creates tail latencies.

Understanding the choreography (knowing which system is responsible for what, where the handoffs occur, what can go wrong at each boundary) is what separates operators who can debug production issues from those who cannot.

The stack is complex because the problem is complex. Distributed inference across dozens of GPUs, serving thousands of users, with sub-second latency requirements. But the complexity is structured. Each layer has clear responsibilities and well-defined interfaces. Master those interfaces, understand the handoffs, and the system becomes comprehensible.

Eight GPUs thinking as one. Three software systems coordinating invisibly. One simple command that hides a universe of orchestration.

That’s the stack. Now you know what’s underneath.

References

KubeRay Operator: ray-project/kuberay — Kubernetes operator for Ray
Ray Placement Groups: Ray docs
vLLM Distributed Inference: vLLM docs
NCCL AllReduce Algorithms: NVIDIA NCCL docs
DistServe (Disaggregated Serving): Zhong et al., 2024
Multus CNI: k8snetworkplumbingwg/multus-cni

The Hidden Software Stack Behind Fast LLM Inference

Sat, 10 Jan 2026 12:00:00 +0800

The Iceberg Problem

If you’ve followed LLM infrastructure over the past two years, you’ve probably heard the greatest hits: PagedAttention eliminates memory fragmentation, continuous batching keeps GPUs busy, and FlashAttention cuts memory from O(N²) to O(N). These optimizations are real and important. They are not the full story.

Below the waterline sits a stack of specialized libraries that most engineers never encounter directly. CUTLASS generates the fused kernels that make quantization practical. Triton lets researchers write GPU code without drowning in thread indexing. FlashInfer handles the messy reality of serving workloads that FlashAttention wasn’t designed for. And NCCL quietly orchestrates communication when models span multiple GPUs.

This post dives into that hidden layer. We’ll trace the path from silicon to scheduler, examining the libraries that transform NVIDIA’s hardware capabilities into the fast inference you actually experience. If you’re deploying LLMs at scale, or simply curious about what happens beneath vLLM’s Python API, this is the stack worth understanding.

Hardware Contract

Every optimization in this stack exists because of a single physical constraint: the memory wall. Modern GPUs have a dramatic imbalance between compute capability and memory bandwidth.

Consider the H100. Its Tensor Cores can deliver roughly 2,000 TFLOPS of FP8 compute. Its HBM3 memory provides 3.35 TB/s of bandwidth. Simple division gives us a “ridge point” of about 600 ops/byte—if your workload performs fewer than 600 operations per byte loaded from memory, you’re memory-bound. Your expensive Tensor Cores sit idle, waiting for data.

LLM inference during the decode phase operates at roughly 0.5-1 ops/byte. For every token generated, the model loads billions of weight parameters, multiplies them by a single vector, and discards the weights. It’s not even close to compute-bound. This is why a $30,000 GPU often achieves single-digit percentage utilization during autoregressive generation.

To understand why, it helps to see what we’re working with.

NVIDIA H100 GPU Architecture

Understanding the hardware that software must optimize for

GPU Die Layout

Click components to explore

Streaming Multiprocessor (SM) Detail

Streaming Multiprocessors

132 SMs × (128 CUDA cores + 4 Tensor Cores)

The parallel processing units where computation happens. Each SM is an independent processor with its own registers, shared memory, and access to Tensor Cores for matrix operations.

HBM3 (80GB)

L2 Cache (50MB)

SMs (132)

Tensor Cores

Shared Memory

Registers

The memory hierarchy offers a path forward:

Level	Capacity	Bandwidth
HBM (Global Memory)	80 GB	3.35 TB/s
L2 Cache	50 MB	~12 TB/s
SRAM (Shared Memory)	228 KB/SM	~19 TB/s
Register File	256 KB/SM	Highest

The software stack’s job is to maximize data reuse in faster memory levels and minimize trips to slow HBM.

GPU Memory Hierarchy: The Bandwidth Wall

Data flows through progressively faster, smaller caches to reach compute

HBM3 (High Bandwidth Memory) 80 GB • 3.35 TB/s

"The Warehouse" — Large but far away

Model Weights KV Cache Activations

L2 Cache 50 MB • ~12 TB/s 3.6× faster

Shared across all SMs — first line of defense

Hot KV entries Recent weights

Shared Memory (SRAM) 228 KB/SM • ~19 TB/s 5.7× faster

On-chip scratchpad — FlashAttention's secret weapon

Attention tiles CUTLASS staging

Direct compute access — no latency

Active values Accumulators

Tensor Cores

~2000 TFLOPS (FP8)

Bandwidth Comparison

HBM

3.35 TB/s

~12 TB/s

SRAM

~19 TB/s

Hopper Accelerators

TMA (Tensor Memory Accelerator)

Offloads address calculation to hardware. Software describes tensor shape; TMA handles async loads.

WGMMA

Direct SRAM → Tensor Core path. Bypasses registers, enabling larger tiles and deeper pipelines.

The Memory Wall

LLM decode: 0.5-1 ops/byte (memory-bound)

Click any memory level to learn more. The 6× bandwidth gap between HBM and SRAM is why FlashAttention exists—keeping data in fast SRAM avoids the bottleneck.

CUTLASS: Template Metaprogramming Foundation

When you call a matrix multiplication in PyTorch, it eventually reaches cuBLAS—NVIDIA’s battle-tested linear algebra library. cuBLAS is fast, but it’s a black box. You get the GEMM you’re given.

For LLM inference, that’s often not enough. Consider what happens when you want to run an INT4 quantized model. The weights are stored as packed 4-bit integers. Before the Tensor Cores can process them, you need to:

Load 128-bit vectors containing packed INT4 weights
Unpack the 32-bit integers into eight 4-bit values
Convert to FP16
Apply quantization scales
Feed the result to the Tensor Core

If each step is a separate kernel, you’re writing intermediate results to HBM between operations—exactly the memory traffic you’re trying to avoid. What you need is a single fused kernel that does everything in registers.

This is what CUTLASS enables. It’s NVIDIA’s header-only C++ template library for linear algebra, and it’s the foundation beneath vLLM’s quantization kernels, FlashAttention-3, and most high-performance transformer implementations.

When cuBLAS Won’t Cut It

Use CUTLASS when you need:

Custom fusions: Bias + activation + quantization in one kernel
Specific precision combinations: FP8 weights with FP16 accumulation
Binary size constraints: cuBLAS ships megabytes of kernels for all cases

The trade-off is complexity. CUTLASS kernels require understanding GPU architecture at a level most ML engineers never encounter. But for the performance-critical paths in inference—attention, FFN, quantized projections—that complexity pays dividends.

Triton: GPU Programming Without the Pain

CUTLASS offers maximum control, but its learning curve is steep. Writing CUDA C++ means managing thread indices, avoiding bank conflicts, ensuring coalesced memory access, and reasoning about warp-level synchronization. A single misplaced __syncthreads() can introduce subtle bugs. A suboptimal memory access pattern can halve performance.

Triton takes a different approach. Developed by OpenAI and now integral to PyTorch 2.0, it raises the abstraction level from threads to blocks.

The Mental Model Shift

Traditional CUDA asks: “I am thread 47. What should I do?”

Triton asks: “I am processing this block of data. What operations should happen?”

Consider loading data from memory. In CUDA, you calculate addresses, handle boundary conditions, and coordinate across threads for coalescing. In Triton:

@triton.jit
def kernel(x_ptr, output_ptr, N, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(0)
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask = offsets < N
    x = tl.load(x_ptr + offsets, mask=mask)
    # Process x...
    tl.store(output_ptr + offsets, result, mask=mask)

The tl.load call handles coalescing and vectorization automatically. The compiler figures out the optimal memory access pattern. No manual thread indexing or bank conflict avoidance.

The PyTorch Connection

When you call torch.compile() on a model, TorchInductor generates Triton kernels for GPU execution. The fusion engine identifies sequences of pointwise operations (add, multiply, activation) that can be combined into single kernels. Instead of three separate kernels with intermediate HBM writes, you get one kernel that loads data once, performs all operations in registers, and stores once.

A fused LayerNorm + Linear that would require 500+ lines of optimized CUDA takes about 50 lines of Triton. The resulting kernel won’t match a hand-tuned CUTLASS implementation, but it’ll be close, and it takes hours to write instead of weeks.

FlashInfer: Built for Serving

FlashAttention changed attention computation by recognizing that the bottleneck was memory I/O, not FLOPs. By computing attention tile-by-tile in SRAM and never materializing the N×N attention matrix in HBM, it reduced memory access from O(N²) to O(N). This brought longer context lengths and faster training.

But FlashAttention was designed for training workloads with regular, rectangular batches. Production serving is messier.

The Serving Reality

In a real serving deployment:

Requests arrive with different context lengths (no neat rectangular batches)
The KV cache uses PagedAttention with non-contiguous memory blocks
Multiple requests share common prefixes (system prompts, document context)
CUDA graphs need static shapes, but batch composition changes every iteration

FlashAttention handles none of this natively. FlashInfer does.

What FlashInfer Adds

Block-sparse KV cache support: FlashInfer kernels operate on PagedAttention’s block-sparse representation directly. Page tables map logical token indices to physical memory blocks, and FlashInfer traverses them efficiently without requiring contiguous memory.

Ragged tensor layouts: Standard kernels assume rectangular batches, padding shorter sequences to match the longest. FlashInfer operates on “ragged” layouts where sequences are packed tightly. No wasted compute on padding tokens.

Plan/run separation: FlashInfer separates scheduling decisions from kernel execution. The “plan” phase precomputes work distribution based on current batch composition. The “run” phase executes with that plan. This separation enables CUDA graph capture—record the run phase once, replay it with different inputs.

Cascade attention: When multiple requests share a common prefix (a system prompt, a retrieved document), naive approaches recompute attention over that prefix for every request. FlashInfer’s cascade attention processes the shared prefix once, caches the result, and computes only the unique suffix per request. For a 32K shared prefix across 256 requests, this yields a 31x speedup.

Integration with vLLM

vLLM’s attention backend isn’t monolithic. A kernel selection layer examines the workload (hardware architecture, head dimension, precision, model type) and dispatches to the appropriate backend: FlashAttention for standard cases, FlashInfer for PagedAttention scenarios, Triton for specific configurations. This flexibility means you get optimized kernels for your actual workload, not a one-size-fits-all solution.

FlashAttention vs FlashInfer

FlashAttention optimized training. FlashInfer optimizes the messy reality of production serving.

FlashAttention: Padded batches

FlashInfer: Ragged batches

Padded Rectangular Batch

Req 1

Req 2

∅

Req 3

∅

Req 4

∅

Real tokens 18

Padding 14 (44% waste)

Ragged Packed Layout

Packed

Real tokens 18

Padding 0 (0% waste)

FlashInfer tracks sequence boundaries with offset arrays, enabling tight packing without wasted compute.

Scenario: 4 requests share a 32K token system prompt

Naive Approach

Request 1

Prefix (32K)

+512

Request 2

Prefix (32K)

+256

Request 3

Prefix (32K)

+128

Request 4

Prefix (32K)

+64

Prefix computed 4×

Total attention ~129K tokens

FlashInfer Cascade

Shared

Prefix (32K) → cache once

↓ cached result ↓

Request 1

cached

+512

Request 2

cached

+256

Req 3, 4...

cached

+...

Prefix computed 1×

Total attention ~33K tokens

31× speedup for 32K shared prefix across 256 requests

📦

Block-Sparse KV Cache

Native support for PagedAttention's non-contiguous memory blocks. Traverses page tables efficiently without requiring contiguous memory layouts.

📐

Ragged Tensor Layouts

Sequences packed tightly with no padding waste. Tracks boundaries via offset arrays for variable-length batches.

🔄

Plan/Run Separation

Precomputes work distribution in "plan" phase, enabling CUDA graph capture. Record once, replay with different inputs.

⚡

Cascade Attention

Processes shared prefixes once, caches results, computes only unique suffixes. Massive speedups for common system prompts.

Padding waste

31×

Cascade speedup

✓

CUDA graph compatible

NCCL: The Invisible Communication Backbone

Everything discussed so far assumes the model fits on a single GPU. For frontier models, it doesn’t. Llama-70B requires roughly 140GB in FP16—nearly two H100s worth of memory. Larger models require more.

Tensor parallelism splits the model across GPUs within a server. Weight matrices are sharded so each GPU holds a slice. Each GPU computes a partial result, and then… they have to talk to each other.

This is NCCL’s domain.

The Communication Pattern

Tensor parallelism using the Megatron-LM algorithm requires two AllReduce operations per transformer layer:

After attention output projection: Each GPU computed attention over its head subset. AllReduce combines the results.
After FFN down projection: Each GPU computed a partial FFN result. AllReduce sums them.

AllReduce means “sum tensors across all GPUs and distribute the result to all GPUs.” For Llama-70B on 4 GPUs, each AllReduce moves batch_size × sequence_length × hidden_dim × bytes_per_element bytes—and it happens 160 times per forward pass (2 per layer × 80 layers).

The Interconnect Gap

The choice of interconnect dominates multi-GPU inference performance:

Interconnect	Bandwidth
NVLink 4.0	900 GB/s bidirectional
PCIe Gen5	128 GB/s bidirectional

That’s a 7x gap. On NVLink, tensor parallelism adds modest overhead. On PCIe, communication becomes the bottleneck rather than memory bandwidth.

Even optimized, communication overhead consumes 20-35% of inference time for Llama-70B on 4×H100. It’s the reason single-GPU inference (when the model fits) is always preferable, and why quantization to fit larger models on fewer GPUs often improves overall throughput despite the precision loss.

NCCL AllReduce Patterns

How NVIDIA's collective communication library orchestrates multi-GPU data synchronization for tensor parallelism.

Ring AllReduce

Bandwidth Optimal

GPU 0 chunk A

GPU 1 chunk B

GPU 2 chunk C

GPU 3 chunk D

Data chunks flow around the ring

Latency

O(k)

Bandwidth

Optimal

Best for

Large msgs

Steps

2(k-1)

Tree AllReduce

Latency Optimal

GPU 0 root

GPU 1 child

GPU 2 child

GPU 3 leaf

Reduce up, broadcast down

Latency

O(log k)

Bandwidth

Sub-optimal

Best for

Small msgs

Steps

2 log(k)

NCCL automatically selects the optimal algorithm based on message size and GPU topology.

GPU Interconnect Bandwidth Comparison

NVLink 4.0

900 GB/s

PCIe Gen5

128

7× Gap

NVLink is essential for efficient tensor parallelism

Impact on LLM Inference (Llama-70B, 4×H100)

160

AllReduce ops/forward pass

20-35%

Time spent on communication

~30 KB

Per AllReduce (decode)

With PCIe, communication overhead can exceed 50%—making NVLink critical for multi-GPU inference.

🔄

Ring AllReduce

Maximizes bandwidth utilization by pipelining data transfers. Each GPU sends and receives simultaneously, achieving near-optimal throughput.

Best for: Prefill phase, gradient sync, large activation tensors (>1MB)

🌲

Tree AllReduce

Minimizes latency with logarithmic steps. Reduces to root, then broadcasts back. Fewer synchronization points but lower bandwidth efficiency.

Best for: Decode phase, small tensors, latency-critical paths (~30KB)

📊

High GPU Count

Ring scales well with many GPUs since bandwidth stays constant. Tree latency grows logarithmically but wastes bandwidth at scale.

8+ GPUs: Ring preferred for most operations

⚡

Latency Sensitive

When time-to-first-token matters more than throughput, tree's O(log k) steps beat ring's O(k) latency even at the cost of bandwidth.

Interactive inference, real-time applications

Modern NCCL uses hybrid algorithms—tree for small messages (<256KB) switching to ring for larger transfers.

Putting It Together

During decode, a single token flows through the entire stack: vLLM schedules the batch, PyTorch dispatches through CUDA graphs, and each transformer layer executes CUTLASS GEMMs for projections (with fused quantization), FlashInfer kernels for attention over the paged KV cache, and NCCL AllReduces if using tensor parallelism.

The time breakdown tells the story:

Attention kernels: 40-60%
FFN/MLP kernels: 30-40%
Communication (with TP): 20-35%
Everything else: <10%

Attention and FFN dominate. Both are memory-bound.

The Memory Bandwidth Endgame

Every library in this stack attacks the same fundamental constraint: memory bandwidth. CUTLASS enables fused kernels that minimize HBM round-trips. Triton makes writing such kernels accessible. FlashInfer optimizes attention’s memory access patterns. NCCL minimizes communication overhead that competes for the same memory bandwidth.

The hardware is evolving in the same direction. NVIDIA’s Blackwell B200 delivers 8 TB/s of HBM bandwidth, 2.4x more than H100, and introduces native FP4 support, halving bytes-per-parameter.

Understanding this stack is not just an academic exercise. If you’re deploying LLMs at scale, these libraries determine your cost per token, your latency percentiles, your maximum context length. The optimizations that matter aren’t in the model architecture; they’re in the software that maps that architecture onto silicon.

The iceberg runs deep. Now you know what’s beneath the surface.

Speculative Decoding: When Guessing Right Makes for Faster Inference

Tue, 23 Dec 2025 10:00:00 +0800

The Speed Problem Wasn’t Always About Compute

In late 2023, two independent research teams at Google and DeepMind published papers with remarkably similar insights. Both had discovered a way to make large language models generate text 2-3× faster without approximations, without quality loss, and without changing the output distribution at all. The technique was speculative decoding.

Here’s the counterintuitive reality: when you run a 70B parameter model on a modern GPU, most of the computational units sit idle. The expensive tensor cores that can perform trillions of operations per second spend the majority of their time doing nothing, waiting. They’re waiting for data to arrive from memory. This is the memory bandwidth bottleneck, and it’s the reason that making LLMs faster is about doing more useful work with each expensive memory read.

Speculative decoding exploits this idle capacity in an elegant way: use a small, fast model to guess what tokens the big model will produce, then verify those guesses in parallel. When the guesses are right and they often are you’ve generated multiple tokens for the price of one memory read of the large model’s weights.

GLM-4.7, Zhipu AI’s 355B parameter flagship released in December 2025, takes this further by building Multi-Token Prediction directly into its architecture. With vLLM’s optimized implementation, this achieves acceptance rates exceeding 90% and generation speeds beyond 100 tokens per second, a glimpse of where inference optimization is heading.

Why LLM Inference Is Memory-Bound

To understand why speculative decoding works, we need to understand why LLM inference is slow in the first place.

Consider what happens when a 70B parameter model generates a single token. The GPU must:

Load the model’s ~140GB of weights from High Bandwidth Memory (HBM)
Perform matrix multiplications with the current token’s hidden states
Produce probability distribution over the vocabulary
Sample one token
Repeat for the next token

The critical insight is in step 1. An NVIDIA H100 GPU can perform roughly 2,000 trillion floating-point operations per second (TFLOPS). But its memory bandwidth—the rate at which it can read data from HBM—is “only” 3.35 TB/s.

Let’s do the arithmetic. Loading 140GB of weights at 3.35 TB/s takes about 42 milliseconds. The actual matrix multiplications for a single token? Perhaps 1-2 milliseconds of computation. The GPU spends roughly 95% of its time waiting for memory transfers and only 5% doing actual math.

This ratio is captured by a metric called arithmetic intensity: the number of floating-point operations performed per byte of memory transferred. For autoregressive LLM inference at batch size 1, arithmetic intensity is approximately 1-2 FLOP/byte. Modern GPUs are designed for workloads with arithmetic intensity of 100+ FLOP/byte. The mismatch is severe.

If we could somehow verify multiple tokens in a single forward pass, we’d amortize that expensive 42ms memory read across several tokens instead of just one. This is the aim of speculative decoding.

The Draft-Verify Paradigm

The speculative decoding algorithm operates in a simple loop:

Draft Phase: A small, fast “draft” model autoregressively generates γ candidate tokens. Because this model is 50-100× smaller than the target, its memory reads are proportionally faster.

Verify Phase: The large “target” model processes all γ candidates in a single forward pass. Thanks to the parallelism of transformer attention, scoring γ tokens takes nearly the same time as scoring 1 token—the memory bandwidth cost is identical.

Accept/Reject Phase: Compare the draft model’s predictions against the target model’s true probabilities. Accept tokens that match well; reject and resample where they diverge.

Here’s the algorithm in pseudocode:

def speculative_decode(prefix, draft_model, target_model, γ):
    # Step 1: Draft γ tokens autoregressively (cheap)
    drafts = []
    for i in range(γ):
        q_i = draft_model(prefix + drafts)
        x_i = sample(q_i)
        drafts.append(x_i)

    # Step 2: Score all positions in parallel (expensive, but single pass)
    p_1, ..., p_{γ+1} = target_model(prefix, prefix+x_1, ..., prefix+x_1...x_γ)

    # Step 3: Accept/reject with rejection sampling
    n = 0  # number of accepted tokens
    for i in range(γ):
        if random() < min(1, p_i(x_i) / q_i(x_i)):
            n += 1
        else:
            # Reject: resample from adjusted distribution
            return prefix + drafts[:n] + sample(normalize(max(0, p_i - q_i)))

    # All accepted: bonus token from final position
    return prefix + drafts + sample(p_{γ+1})

The key is in step 3. When we accept a draft token, we move forward. When we reject, we don’t just discard the draft, we sample from an adjusted distribution that “fills in” exactly the probability mass the draft model missed. This ensures the output distribution is mathematically identical to standard autoregressive decoding.

The Math of Distribution Preservation

This is the part that makes speculative decoding remarkable. The output distribution is exactly the same as if you had run standard autoregressive decoding with the target model alone. Understanding why requires examining the rejection sampling mechanism.

Let $p(x)$ denote the target model’s probability distribution and $q(x)$ denote the draft model’s distribution. For a draft token $x'$, we accept it with probability:

$$\alpha(x') = \min\left(1, \frac{p(x')}{q(x')}\right)$$

When rejected, we resample from the adjusted distribution:

$$p'(x) = \text{normalize}\left(\max(0, p(x) - q(x))\right)$$

The key theorem is that this process produces samples from $p(x)$. Here’s the proof:

$$P(X = x') = P(\text{accepted}, X = x') + P(\text{rejected}, X = x')$$

For the accepted case, we sample $x'$ from $q$ and accept with probability $\min(1, p(x')/q(x'))$:

$$P(\text{accepted}, X = x') = q(x') \cdot \min\left(1, \frac{p(x')}{q(x')}\right) = \min(q(x'), p(x'))$$

For the rejected case, we first reject (with probability $1 - \alpha$), then resample from $p'$:

$$P(\text{rejected}, X = x') = \left(1 - \sum_x \min(p(x), q(x))\right) \cdot \frac{\max(0, p(x') - q(x'))}{\sum_x \max(0, p(x) - q(x))}$$

The denominator normalizes to $1 - \sum_x \min(p(x), q(x))$, so:

$$P(\text{rejected}, X = x') = \max(0, p(x') - q(x')) = p(x') - \min(p(x'), q(x'))$$

Adding both cases:

$$P(X = x') = \min(p(x'), q(x')) + p(x') - \min(p(x'), q(x')) = p(x')$$

This proof holds regardless of how good the draft model is. A poorly aligned draft simply increases rejection rate without corrupting the output distribution. The guarantee is unconditional.

Building Intuition for Rejection Sampling

Let’s build some intuition for why rejection sampling works.

Imagine two probability distributions over possible next tokens. The target distribution $p(x)$ represents what the large model actually wants to output. The draft distribution $q(x)$ represents the small model’s best guess.

Picture these as two overlapping curves. Where they overlap—where the draft model agrees with the target—we can safely use the draft’s samples. The acceptance probability $\min(1, p/q)$ ensures we never accept a token more often than the target model would generate it.

Distribution Overlap & Acceptance

Visualizing how draft model alignment affects token acceptance probability

α = Σ min(p(x), q(x))

0.85

85% of draft tokens accepted on average

Alignment

Misaligned Identical

Presets

View

Target p(x)

Draft q(x)

Overlaid Comparison

Accept region: min(p, q)

Overshoot: q > p (rejected)

Residual: p > q (on reject)

Key Insight

The green overlap shows probability mass that can be safely accepted from the draft model. When q(x) > p(x), the draft "overshoots" and risks rejection. The purple residual fills in when we reject, ensuring the output matches the target distribution exactly.

But what about the probability mass where $p(x) > q(x)$? These are tokens the target model likes more than the draft model expected. If we only accepted, we’d undersample these tokens. The resampling step corrects for this: when we reject, we draw from exactly this “missing” probability mass.

The total acceptance rate, the probability we accept any draft token equals the overlap between distributions:

$$\alpha = \sum_x \min(p(x), q(x))$$

This quantity has a nice interpretation: it’s 1 minus half the total variation distance between $p$ and $q$. When distributions are identical, $\alpha = 1$ and we always accept. When they’re completely disjoint, $\alpha = 0$ and we always reject.

In practice, well-matched draft-target pairs achieve $\alpha = 0.6-0.8$, while architecturally integrated solutions like GLM-4.7’s native MTP exceed 0.9.

A Concrete Walkthrough of Rejection Sampling

Let’s ground the mathematics in concrete examples to build deeper intuition for how the algorithm actually works.

The Sequential Verification Problem

When the draft model generates K tokens, each token is conditioned on the previous ones:

$$x_1 \sim q(\cdot)$$

$$x_2 \sim q(\cdot|x_1)$$

$$x_3 \sim q(\cdot|x_1,x_2)$$

The target model verifies by computing in parallel:

$$p(x_1), \quad p(x_2|x_1), \quad p(x_3|x_1,x_2), \quad \ldots$$

if you reject $x_2$, then $x_3$ was generated from the wrong context.

The draft model generated $x_3$ assuming $x_2$ was correct. But if you reject $x_2$ and resample a different token $x_2'$, then $x_3$ is now invalid, ie it was conditioned on a token that no longer exists in the sequence.

Concrete Example:

Draft generates:  "The cat sat on the [mat]"
                                        ↑ rejected, resample → "rug"

Draft's x₆ was:   "mat" → next token "." (conditioned on "mat")
But now we have:  "rug" → we can't use "." anymore!

The token after “mat” might have been “.” with high probability, but the token after “rug” might be “was” or something entirely different. You must discard everything after the rejection point and let the target model generate the next token fresh.

What p(x) and q(x) Actually Mean

The notation can obscure what’s happening. Let’s be concrete.

$x_1$ is a specific token that was sampled—say, the token “cat” (token ID 9846 in the vocabulary).

$p(x_1)$ is the scalar probability that the target model assigned to that exact token:

# Target model forward pass
logits = target_model(prompt)        # shape: [vocab_size]
probs = softmax(logits)              # shape: [vocab_size]

p_x1 = probs[9846]                   # scalar: 0.073

Similarly, $q(x_1)$ is what the draft model assigned to that same token:

# Draft model forward pass
logits = draft_model(prompt)         # shape: [vocab_size]
probs = softmax(logits)              # shape: [vocab_size]

q_x1 = probs[9846]                   # scalar: 0.051

The acceptance check compares these two scalars:

ratio = p_x1 / q_x1                  # 0.073 / 0.051 = 1.43
acceptance_prob = min(1, ratio)      # min(1, 1.43) = 1.0

u = random.uniform(0, 1)             # say, 0.67
if u < acceptance_prob:              # 0.67 < 1.0 → True
    accept()

The ratio $p(x)/q(x)$ asks: “Did the draft model over- or under-estimate this token?”

Scenario	Ratio	Accept Prob	Meaning
p=0.30, q=0.10	3.0	1.0 (capped)	Draft underestimated—always accept
p=0.10, q=0.10	1.0	1.0	Perfect agreement—always accept
p=0.05, q=0.10	0.5	0.5	Draft overestimated—accept 50%
p=0.01, q=0.10	0.1	0.1	Draft way overconfident—accept 10%

When the draft model is overconfident about a token ($q > p$), you reject proportionally to correct the bias. When the draft is underconfident ($q < p$), you always accept—the residual distribution handles the gap.

Why Resample from Residual, Not Just p?

When you reject a drafted token, you need to pick a new token. The naive answer is: “We want the output to follow $p$, so just sample from $p$.”

This is wrong. Let me show you why with a concrete example.

Two-token vocabulary: A and B

Target:  p(A) = 0.7,  p(B) = 0.3
Draft:   q(A) = 0.4,  q(B) = 0.6

Tracing through the algorithm:

Step 1: Sample from draft $q$

40% chance we draft A
60% chance we draft B

Step 2: Accept/reject check

If we drafted A:

$$\text{accept prob} = \min\left(1, \frac{p(A)}{q(A)}\right) = \min\left(1, \frac{0.7}{0.4}\right) = \min(1, 1.75) = 1.0$$

A is always accepted when drafted.

If we drafted B:

$$\text{accept prob} = \min\left(1, \frac{p(B)}{q(B)}\right) = \min\left(1, \frac{0.3}{0.6}\right) = \min(1, 0.5) = 0.5$$

B is accepted 50% of the time when drafted.

Calculating the probabilities:

$$P(\text{accept A}) = q(A) \times 1.0 = 0.4$$

$$P(\text{accept B}) = q(B) \times 0.5 = 0.3$$

$$P(\text{reject}) = 1 - 0.4 - 0.3 = 0.3$$

The problem with resampling from p:

If on rejection we resample from $p$:

$$P(\text{output}=A) = P(\text{accept A}) + P(\text{reject}) \times p(A)$$

$$= 0.4 + 0.3 \times 0.7 = 0.4 + 0.21 = 0.61$$

This is wrong—should be 0.7!

$$P(\text{output}=B) = P(\text{accept B}) + P(\text{reject}) \times p(B)$$

$$= 0.3 + 0.3 \times 0.3 = 0.39$$

This is wrong—should be 0.3!

The fix: residual distribution

The residual distribution is:

$$\max(0, p(A) - q(A)) = \max(0, 0.7 - 0.4) = 0.3$$

$$\max(0, p(B) - q(B)) = \max(0, 0.3 - 0.6) = 0.0$$

Normalized: $p'(A) = 1.0$, $p'(B) = 0.0$

Now:

$$P(\text{output}=A) = P(\text{accept A}) + P(\text{reject}) \times p'(A)$$

$$= 0.4 + 0.3 \times 1.0 = 0.7 \checkmark$$$$P(\text{output}=B) = P(\text{accept B}) + P(\text{reject}) \times p'(B)$$

$$= 0.3 + 0.3 \times 0.0 = 0.3 \checkmark$$

The Probability Budget Intuition

Think of it as a budget you need to fill for each token:

Token	Target p(x)	Covered by Accept Phase	Still Needed
A	0.7	min(0.7, 0.4) = 0.4	0.7 - 0.4 = 0.3
B	0.3	min(0.3, 0.6) = 0.3	0.3 - 0.3 = 0.0

The accept phase already “spent” $\min(p,q)$ probability on each token. The residual distribution captures exactly what’s left to fill:

$$p'(x) = \frac{\max(0, p(x) - q(x))}{Z} = \frac{\text{what we still need}}{\text{total rejection probability}}$$

The Probability Budget

How rejection sampling fills the exact probability mass for each token

Target p(x)

0.70

0.30

Draft q(x)

0.40

0.60

Step 1: Target Budget

Token A target: 0.70

0.40

0.30

min(0.7, 0.4) = 0.4 0.7 − 0.4 = 0.3

Token B target: 0.30

0.30

0.00

min(0.3, 0.6) = 0.3 0.3 − 0.3 = 0.0

Target budget p(x)

Accept phase: min(p,q)

Residual: max(0, p−q)

Probability Accounting

P(accept A) = q(A) × 1.0 = 0.4 × 1.0 = 0.40

P(accept B) = q(B) × 0.5 = 0.6 × 0.5 = 0.30

P(reject) = 1 − 0.4 − 0.3 = 0.30

Residual p′(A) = 0.3 / 0.3 = 1.0

Residual p′(B) = 0.0 / 0.3 = 0.0

      P(output=A)
      =
      0.4 + 0.3 × 1.0 = 0.70 ✓
    

      P(output=B)
      =
      0.3 + 0.3 × 0.0 = 0.30 ✓
    

If resample from p: P(A) = 0.4 + 0.3 × 0.7 = 0.61 ✗

P(B) = 0.3 + 0.3 × 0.3 = 0.39 ✗

Key Insight

The residual distribution p′(x) precisely fills the probability gap left by the accept phase. Token B is already "fully funded" by accepts (min(p,q) = p), so it gets zero in the residual. Token A needs exactly 0.3 more probability—which the residual provides with 100% certainty when rejection occurs.

The Full Algorithm Timeline

Step 1: Draft model runs K times (cheap, fast)
        [x₁] → [x₂] → [x₃] → [x₄] → [x₅]

Step 2: Target model runs ONCE (expensive, but parallel)
        [x₁, x₂, x₃, x₄, x₅] → [p₁, p₂, p₃, p₄, p₅]

Step 3: Sequential verify until rejection
        x₁ ✓ → x₂ ✓ → x₃ ✗ → STOP, discard x₄,x₅
                       ↓
                  resample x₃' ~ residual

Output: [x₁, x₂, x₃']

The key efficiency gain: that single target model forward pass would normally give you just 1 token. With speculation, you potentially get K+1 tokens from the same compute, paying only the small overhead of draft generation.

The Speedup Formula

How much faster does speculative decoding make inference? The expected number of tokens generated per iteration follows a capped geometric distribution.

If we propose γ tokens and each has acceptance probability α, the expected number of accepted tokens is:

$$E[\text{tokens per iteration}] = \frac{1 - \alpha^{\gamma+1}}{1 - \alpha}$$

For large γ, this approaches $\frac{1}{1-\alpha}$. With typical values (α = 0.75, γ = 5), we get roughly 4 tokens per expensive target model call.

The speedup formula must account for the cost of the draft model:

$$\text{Speedup} = \frac{1 - \alpha^{\gamma+1}}{(1-\alpha)(\gamma c + 1)}$$

where $c = t_{\text{draft}}/t_{\text{target}}$ is the ratio of draft model latency to target model latency. For a draft model 100× smaller, $c \approx 0.01-0.05$.

When $\alpha > c$, speedup is guaranteed. The minimum improvement is $(1 + \alpha)/(1 + c)$. Real-world benchmarks on H200 GPUs show Llama 3.1 405B with a Llama 3.2 3B draft achieving 3.6× speedup (33 → 120 tokens/sec).

The Variant Landscape

The field has evolved rapidly since 2023, with researchers finding increasingly clever ways to eliminate draft models or improve acceptance rates.

EAGLE: Feature-Level Speculation

EAGLE (ICML 2024) introduced feature-level speculation, predicting at the second-to-top layer rather than token level. The key insight: autoregression over continuous hidden states is easier than over discrete tokens.

Rather than training a separate small model, EAGLE trains a lightweight head (~1B parameters for 70B models) that extrapolates feature vectors. These features are then decoded to tokens and verified. The approach achieves 3× speedup, 1.6× faster than Medusa’s parallel heads approach.

EAGLE-2 added context-aware dynamic draft trees, adjusting speculation aggressiveness based on prediction confidence to reach 4.26× speedup.

Medusa: Parallel Prediction Heads

Medusa takes a different approach: add multiple single-layer prediction heads directly atop the frozen base model. Each head predicts a different future position independently.

Hidden State → Head 1 → Token +1
            → Head 2 → Token +2
            → Head 3 → Token +3

The Cartesian product of top-k predictions from each head creates candidate continuations verified via tree attention. Training requires only hours on a single A100.

The trade-off: position-independent heads can’t condition on earlier speculated tokens, limiting acceptance rates compared to EAGLE’s sequential feature prediction.

Self-Speculative Methods

LayerSkip (ACL 2024) eliminates external drafters entirely by using early exits from the target model itself. During training, layer dropout with increasing rates toward later layers plus early exit loss creates a model that can draft from shallow layers and verify with deep layers.

The catch: requires special training recipes. Baseline LLMs show no speedup with this approach.

Method	Draft Model	Training	Memory Overhead	Speedup	Distribution Preserved
Standard SD	Yes (separate)	Optional	High	1.5-2.5×	Yes
EAGLE-2	Lightweight head	~2 days	Low-Medium	3-4.3×	Yes
Medusa	No (heads on base)	Hours	Low	2.2-3.6×	Optional

GLM-4.7: Native Multi-Token Prediction

GLM-4.7 represents a paradigm shift: rather than retrofitting speculative decoding onto existing models, Zhipu AI built Multi-Token Prediction directly into the architecture.

The model contains 355 billion total parameters with 32 billion active per forward pass via Mixture-of-Experts routing. This extreme sparsity only 9% of parameters active per token creates an ideal scenario for speculative decoding: massive memory reads but relatively modest compute.

The MTP Architecture

Traditional speculative decoding uses separate draft and target models. GLM-4.7’s MTP adds auxiliary prediction heads within the model itself:

Hidden State h_t → Main Head → P(x_{t+1} | h_t)     [standard next-token]
               → MTP Head  → P(x_{t+2} | h_t)     [speculative token]
               → MTP Head  → P(x_{t+3} | h_t)     [speculative token]

The MTP heads are lightweight projections sharing the same massive 32B active backbone. This ensures the draft distribution is tightly aligned with the target distribution, they share the same semantic understanding. Resulting in acceptance rates exceeding 90% with 1 speculative token.

Training the MTP Heads

The MTP layer was trained with loss weight λ = 0.3 for the first 15 trillion tokens, reduced to 0.1 later. This balances multi-token prediction quality against primary language modeling capability.

$$\mathcal{L}\_{\text{total}} = \mathcal{L}\_{\text{LM}} + \lambda \cdot \mathcal{L}\_{\text{MTP}}$$

The reduced weight in later training prevents the MTP objective from interfering with the model’s core capabilities while still maintaining high acceptance rates at inference time.

Architectural Innovations

GLM-4.7 incorporates several architectural choices that complement its MTP capability:

Sigmoid-gated loss-free balance routing across ~160 experts (128 active per token)
96 attention heads for 5120 hidden dimension (2.5× more heads than typical)
Grouped-Query Attention with partial RoPE at 1M base frequency for 200K context
QK-Norm for stabilized attention logits

The increased head count particularly improves reasoning benchmarks despite not improving training loss—an interesting finding suggesting that inference-time compute distribution matters.

vLLM Implementation: PagedAttention Meets Speculation

vLLM’s speculative decoding architecture consists of three phases orchestrated by the SpecDecodeWorker:

Draft Runner: Proposes candidate tokens using MTP heads
Target Runner: Scores all candidates in a single forward pass
Rejection Sampler: Implements accept/reject logic

PagedAttention Integration

The integration with PagedAttention required non-trivial modifications. The memory manager tracks KV cache for both draft and target phases with block-level management enabling sharing, copying, and forking between sequences.

For MTP-style speculation, the draft phase reuses the target model’s KV cache infrastructure, minimizing overhead. The scheduler now supports “preallocated slots”—reserving KV block space sufficient for multiple tokens before the next scheduler invocation.

# GLM-4.7 with native MTP speculative decoding
vllm serve zai-org/GLM-4.7-FP8 \
    --tensor-parallel-size 4 \
    --speculative-config.method mtp \
    --speculative-config.num_speculative_tokens 1 \
    --tool-call-parser glm47

Why num_speculative_tokens=1?

The recommendation of 1 speculative token reflects empirical findings: higher values increase mean acceptance length but decrease acceptance rate, reducing overall throughput. The sweet spot maximizes expected tokens per iteration accounting for verification overhead.

With GLM-4.7’s 90%+ acceptance rate at num_speculative_tokens=1, you reliably get 2 tokens per forward pass. Increasing to 2 speculative tokens might yield an average of 2.5 tokens but with higher variance and occasional costly rejections.

Continuous Batching Challenges

Continuous batching with speculation creates the “ragged tensor problem”: different sequences accept different numbers of tokens per iteration, creating irregular batch shapes. At higher concurrency, this overhead consumes up to 40% of computation.

vLLM addresses this through dynamic speculation length adjustment based on system load—reducing speculation aggressiveness when batch sizes grow.

When Speculation Helps and When It Hurts

The fundamental principle: speculative decoding trades compute for memory bandwidth. When GPUs are memory-bound (most inference scenarios), spare compute cycles can profitably run draft verification. When GPUs are compute-saturated, speculation adds overhead without benefit.

Batch Size Dominates

Condition	Impact	Recommendation
Batch size ≤ 8	Strong benefit (1.5-2.7×)	Enable with γ=4-8
Batch size > 32, short context	Potential slowdown	Disable or use dynamic γ
Batch size > 32, long context	Moderate benefit (up to 2×)	Enable with small γ
QPS < 10	Strong benefit	Enable
QPS > 50	Diminishing/negative returns	Dynamic speculation
Acceptance rate < 0.5	Marginal benefit	Improve draft alignment

At batch size 1, GPUs run severely underutilized—speculative decoding achieves 2.73× speedup (63% latency reduction). Beyond batch size 16-32, benefits diminish and can reverse, causing 1.4-1.8× slowdown.

The Long Context Exception

MagicDec research found that at large batch sizes with long contexts, decoding becomes memory-bound again due to KV cache loading. Speculative decoding can provide 2× speedup even on 8 A100s with high concurrency when context lengths exceed 32K tokens.

INT4/INT8 quantization presents tradeoffs: aggressive weight quantization can reduce acceptance rates as draft model quality degrades. The QSpec approach uses W4A4 for drafting and W4A16 for verification, capturing benefits of both.

Where the Field Is Heading

The success of GLM-4.7’s native MTP suggests future models will ship with speculation built-in rather than bolted-on. Several trends are emerging:

Architectural Integration: Models trained with MTP objectives from the start achieve dramatically higher acceptance rates than retrofitted solutions. Expect this to become standard practice.

Dynamic Speculation: Rather than fixed speculation lengths, future systems will adjust aggressiveness based on:

Current batch size
Observed acceptance rates
Prediction entropy
Available compute headroom

Hardware Co-design: As speculative decoding becomes ubiquitous, GPU architectures may evolve to better support the draft-verify pattern with dedicated acceleration for the rejection sampling kernel.

Beyond Token Prediction: EAGLE’s feature-level speculation hints at richer speculation targets. Predicting structured outputs (tool calls, code blocks) could enable even higher acceptance rates for specialized workloads.

Conclusion

Speculative decoding achieves something rare in optimization: meaningful speedups without any quality tradeoff. The output distribution is mathematically identical to standard decoding.

The technique works because LLM inference is memory-bound, not compute-bound. By using idle GPU cycles to verify multiple speculative tokens in parallel, we amortize the expensive memory reads across several output tokens.

GLM-4.7’s native MTP architecture points toward where the field is heading: models designed from the ground up for efficient speculation, achieving 90%+ acceptance rates that make speculative decoding nearly as reliable as a lookup table.

References

Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. International Conference on Machine Learning.
- The original Google paper introducing speculative decoding with rigorous distribution preservation proofs.
Li, Y., Cai, T., Zhang, Y., Chen, D., & Dai, D. (2024). EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. International Conference on Machine Learning.
- Feature-level speculation achieving superior speedups through hidden state prediction.
Cai, T., Li, Y., Geng, Z., Peng, H., & Dao, T. (2024). Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. arXiv preprint.
- Parallel prediction heads approach requiring minimal training overhead.
Elhoushi, M., Shrivastava, A., Liskovich, D., Hosmer, B., Wasti, B., Lai, L., … & Acun, B. (2024). LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding. Association for Computational Linguistics.
- Self-speculative decoding using early exits from the target model.
Fu, Y., Bailis, P., Stoica, I., & Zhang, H. (2024). Break the Sequential Dependency of LLM Inference Using Lookahead Decoding. International Conference on Machine Learning.
- Training-free speculative decoding via Jacobi iteration.
Zhipu AI. (2025). GLM-4.7: Advancing the Coding Capability. Hugging Face Model Card.
- Technical documentation for GLM-4.7’s native MTP architecture.
vLLM Team. (2025). Speculative Decoding Documentation. vLLM Documentation.
- Implementation details for speculative decoding in vLLM.
Chen, Z., Yang, X., Lin, J., Sun, C., Huang, J., & Chang, K. W. (2024). MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding. arXiv preprint.
- Analysis of speculative decoding performance at high batch sizes with long contexts.

The Anatomy of Agentic Code Assist: Building Production Grade AI Coding Agents

Sat, 15 Nov 2025 10:00:00 +0800

Introduction: The Agentic Shift in Software Engineering

Software engineering tools have been getting closer to translating what humans want into what machines do. We went from assembly to C, from malloc/free to garbage collection, from Vim to IDEs with autocomplete. But something changed in the last two years. We’re are developing more than just “autocomplete” solutions.

Code Assist tools suggest the next token based on what’s in front of your cursor. General Purpose Code Assist Agents can reason about your entire codebase, plan multi-step changes, execute commands, run tests, and fix their own mistakes. The difference is: one is a fancy text predictor, the other is something that actually does engineering work.

OpenHands (formerly OpenDevin) is one of the most interesting open-source projects in this space. It’s a runtime that takes probabilistic LLM outputs and turns them into deterministic actions—compiling code, running tests, managing Docker containers. This post digs into how OpenHands works: its architecture, the CodeAct framework it uses, how it sandboxes execution safely, and what the benchmarks tell us about where this technology actually stands.

AI as an Amplifier: Why This Matters

Here’s something weird from the 2025 DORA report: AI adoption is basically universal now, but productivity gains are all over the place. Some teams are crushing it, others are drowning in AI-generated technical debt. What’s the difference?

AI acts as an amplifier. If your team already has good platform engineering and loosely coupled architectures, AI makes you faster. If you’re stuck with a tightly coupled monolith and manual deployments, AI will help you write bad code faster.

This matters for what makes a good code assist agent. Just dumping code into a buffer isn’t enough. A useful agent needs to:

Navigate existing codebases without breaking things
Verify its own changes against test suites
Learn from failures and try different approaches
Understand the context of your organisation/ infrastructure and adapt accodingly

Think of it as a high-performing junior engineer who can write code quickly but needs to check their work, not a hyper-intelligent autocomplete. OpenHands tries to be the former by treating the agent as part of the actual development workflow, not just a chatbot that spits out code.

Why Open Source Matters Here

The first wave of coding agents (like Devin) were black boxes. Impressive demos, but good luck getting your security team to approve giving them write access to your production codebase. When an agent deletes a config file, you want to know why, not just get an apology.

OpenHands(like modern code assist solutions) takes a different approach. Everything is transparent. The Event Stream logs every action the agent takes and every observation it receives. You can watch it run shell commands, edit files, and search through code in real-time.

This matters because 30% of developers say they don’t trust AI-generated code (per the DORA report). Hard to blame them. But when you can see exactly what the agent is doing, step by step, that trust equation changes. You’re not blindly accepting output—you’re supervising an autonomous process with full visibility. Until you gain enough confidence to let agent take over.

What Makes a “General Purpose” Code Agent?

A SQL-generating bot is useful for one thing. An agent that can write SQL, wrap it in a Python API, build a React frontend, and deploy the whole thing to Kubernetes, then debug production issues? That’s general purpose.

The difference comes down to four things that separate toys from production-ready tools:

1. Memory That Actually Works

LLMs are stateless. ChatGPT “forgets” your file structure the moment it scrolls out of the context window. Try refactoring a 50-file codebase when the agent can’t remember what it read five minutes ago.

A real agent needs persistent memory. Not just a bigger context window—actual tools to explore and navigate your codebase on-demand. OpenHands gives the LLM a developer’s toolkit: ripgrep for fast code search, AST-based analysis for understanding structure, and incremental file access with 100-line windows. Add an event log that lets it “replay” its own history, and you have something that can actually work with large codebases without pre-indexing everything.

2. Execution, Not Just Suggestions

OpenHands can create files, run compilers, execute shell scripts—the actual work. But this is dangerous. Running arbitrary LLM-generated code on your machine is a security nightmare. So OpenHands runs everything in Docker containers. The agent gets a sandboxed workspace where it can do whatever it wants without nuking your host system.

3. Learning from Failure

Code never works the first time. A code generator dies on the first syntax error. A real agent reads the error output, figures out what went wrong, tries a fix, and runs it again.

This Edit-Run-Verify loop is how OpenHands works. Actions flow from the agent to the system, observations (logs, errors, exit codes) flow back. The agent uses that feedback to iterate. Just like you would.

4. Using Your Tools

No LLM knows about your company’s internal Jira workflow or feature flag database. A production agent needs to plug into arbitrary tools without rewriting its core code.

OpenHands uses the Model Context Protocol (MCP)—an open standard for tool discovery. Point it at an MCP server, and the agent can dynamically learn what tools are available and how to use them.

How OpenHands Compares

Here’s how OpenHands stacks up against regular autocomplete and chat assistants:

Feature	Autocomplete (IntelliSense)	Chat (ChatGPT)	Agent (OpenHands)
Context	Current file only	Conversation history	Entire repo + runtime state
Execution	None	None (maybe sandbox)	Full shell in Docker
Agency	You drive everything	Responds to prompts	Multi-step autonomous plans
Tooling	Static analysis	Fixed plugins	Dynamic tool discovery (MCP)
Memory	None	Session-only	Event-sourced persistence

OpenHands goes all-in on that right column. More complex, but actually useful for real work.

How OpenHands Is Built

OpenHands looks like a local app, but it’s actually a distributed system. The key architectural decision: split the reasoning (Agent) from the execution (Runtime), and mediate everything through an event stream. This lets you swap LLMs or runtimes without rewriting the whole system.

Event Sourcing: The Unexpected Choice

OpenHands doesn’t use a traditional database. It’s event-sourced.

Most apps store the current state: if an agent edits a file, you overwrite the record. OpenHands records every action as an immutable event. Want to know the current state? Replay all the events.

The EventStream is the central nervous system. It handles three types of data:

Actions: Commands from the agent—CmdRunAction, FileWriteAction, AgentDelegateAction
Observations: Results from the environment—stdout/stderr, file contents, web pages
Trajectories: The full sequence of actions and observations, serialized to disk (JSON or Pickle)

OpenHands Event Stream Architecture

Immutable event log enabling agent-runtime separation and deterministic replay

AGENT

ACTIVE

CodeActAgent

Reasoning · Planning · Decision Making

↓ ACTIONS CmdRunAction | FileWriteAction

📡

EVENT STREAM

10:23:01.442 → Action CmdRun("pytest tests/")

10:23:02.891 ← Observe CmdOutput(exit=1, stderr="...")

10:23:03.124 → Action FileEdit("src/fix.py", ...)

10:23:03.998 ← Observe FileEditObservation(success=True)

↓ OBSERVATIONS CmdOutputObservation | FileReadObservation

RUNTIME

RUNNING

DockerRuntime

Execution · Sandbox · Response Generation

Why this matters: Deterministic Replay. LLMs are non-deterministic nightmares to debug. When an agent fails, you can replay the exact event sequence and see where it went wrong. No guessing, no “works on my machine.”

The codebase enforces this with a hard split: agenthub (the logic) and runtime (the execution) only talk through serialized events. No shortcuts, no shared state.

The EventStream assigns IDs to events and manages “subscriptions.” The frontend subscribes to get chat updates. The agent reads from it to know what happened last.

There’s been talk in the community about moving to synchronous ToolCall/ToolResult patterns to simplify the Python SDK. But the core idea stays the same: the source of truth is the event history, not some current state snapshot.

The Codebase Structure

OpenHands is organized as a modular monorepo:

openhands/agenthub/: The brains. Different agent implementations (CodeActAgent, BrowsingAgent, etc.). Plug-and-play interface: take State, return Action.
openhands/runtime/: The body. Spins up Docker containers, manages files, executes commands. Abstract Runtime base class with concrete implementations like DockerRuntime and E2BRuntime.
openhands/server/: FastAPI backend. Handles WebSocket connections, orchestrates the AgentController, routes events.
openhands/frontend/: React UI. Visualizes the Event Stream—chat interface, terminal emulator (xterm.js), Monaco editor.
containers/: Dockerfiles for the sandbox environments. Version-controlled with the code.

The Main Loop

The AgentController runs an infinite loop:

Gather recent history from Event Stream
Send to LLM (GPT-4, Claude, whatever)
Parse LLM response into an Action (CmdRunAction, etc.)
Dispatch to Runtime
Get back an Observation (stdout, exit code, etc.)
Add Observation to Event Stream
Go to step 1

Runs until the agent says it’s done (AgentFinishAction) or you kill it. The upcoming Python SDK will let you step through this loop manually, which should make debugging way easier.

The Runtime: Sandboxing the Chaos

Letting an LLM run rm -rf / on your laptop is a bad idea. OpenHands solves this with Docker, but not in the obvious way.

How the Sandbox Actually Works

You can’t just run docker exec for every command. That creates a fresh shell each time, and state gets lost. If the agent runs export API_KEY=xyz, that needs to persist when it runs python script.py later.

OpenHands uses a client-server model across the Docker boundary:

Host side: RuntimeClient running on your machine
Container side: ActionExecutor, a Python HTTP server injected into the container

When the agent wants to run ls -la:

Agent generates CmdRunAction("ls -la")
RuntimeClient serializes it and POSTs to ActionExecutor inside the container
ActionExecutor runs it in a persistent shell session (PTY), captures stdout/stderr/exit code
Response goes back to RuntimeClient
Backend wraps it in CmdOutputObservation and pushes to the event stream

OpenHands Sandbox Architecture

Multi-layered Docker isolation ensuring secure code execution with strict boundaries

🖥️ Host Environment

Ubuntu 22.04 LTS

🐳

Docker Sandbox

ISOLATED

📁

Filesystem Isolation

Separate root filesystem with controlled mount points

/workspace

🌐

Network Isolation

Virtual network with restricted external access

bridge0

⚙️

Process Isolation

Dedicated PID namespace, resource limits enforced

cgroups

👤

User Isolation

Non-privileged user with restricted capabilities

uid:1000

agent@sandbox:/workspace $

                
                # Agent executing commands in sandbox
              
                python test_suite.py
              
                git diff src/main.py
              
                npm run build --production

Executing in isolated environment...

📤

Stdin/Stdout

stdio

⇄

📋

Volume Mount

/workspace

⇄

🔌

API Socket

unix:///var/run

⇄

🛡️

Security Guarantees

No Host Access Read-only System Resource Limited No Privilege Escalation

The persistent shell is the key. Environment variables, working directory, shell history—it all persists across commands. The agent gets something that feels like an actual computer, not a stateless command executor.

The Docker Socket Problem

Sometimes the agent needs to use Docker itself—like building a container for your app. OpenHands handles this by mounting the host’s Docker socket (/var/run/docker.sock) into the sandbox.

This is powerful but dangerous. Mounting the Docker socket gives the container root access to the host. It’s “Docker-out-of-Docker” (not true Docker-in-Docker), and it comes with trade-offs:

Power: The agent can do anything Docker can do
Complexity: Network routing gets weird, especially on macOS/Windows where Docker runs in a VM. httpx.ConnectError issues are common.
Security: You’re basically trusting the container with your host. OpenHands mitigates this by controlling the image, but it’s still a calculated risk.

Other Runtime Options

Docker isn’t the only choice. OpenHands abstracts the runtime, so you can swap it out:

Daytona: Remote, managed dev environments. Offloads compute to the cloud instead of burning your laptop’s battery.
E2B: Firecracker-based VMs designed for AI code execution. Better isolation than Docker, faster startup.

You pick your runtime in config.toml. Same agent code, different execution environment. This is the kind of abstraction that separates production systems from hackathon demos.

CodeAct: Code As the Interface

Early AI agents used JSON tool calling for everything. Want to edit a file? Emit a JSON blob. Run a command? Another JSON blob. Brittle, verbose, and you had to define custom tools for every possible action.

Code Is the Tool

CodeAct flips this. Instead of 50 custom tools (list_files, create_file, search_web), just give the agent Python and Bash.

Need to count lines in all Python files? Write code:

import glob
files = glob.glob("**/*.py", recursive=True)
total_lines = 0
for f in files:
    with open(f) as file:
        total_lines += len(file.readlines())
print(total_lines)

Or use Bash:

find . -name "*.py" | xargs wc -l

Why this works better:

One language for everything: Logic, control flow, and tool execution all use Python/Bash.
More expressive: Write loops, conditionals, error handling in a single action. Try to read a file, catch FileNotFoundError, create it—all in one LLM turn. Fewer round-trips = lower cost and latency.
Free library access: The entire Python ecosystem (pandas, requests, numpy) works out of the box. No wrapper code needed.

How It Works

The CodeActAgent uses a carefully crafted system prompt (system_prompt.j2):

“You can execute Python code in ```python blocks”
“You can execute Bash in ```bash blocks”
“Verify your changes by running tests”

The backend parses the LLM’s markdown response. Code blocks get extracted and sent to the JupyterPlugin (for Python) or BashPlugin (for Bash) inside the container. The JupyterPlugin maintains an interactive IPython kernel, so variables persist across code blocks.

Multiple Agents, One Task

One agent gets lost in a 10,000-file repo. Context window fills with noise, and it forgets what it’s doing.

OpenHands uses agent delegation:

Manager Agent: High-level planner. Breaks “Refactor auth module” into sub-tasks.
RepoStudyAgent: Explorer. Maps the codebase without modifying it.
VerifierAgent: QA specialist. Writes tests, verifies fixes work.
BrowsingAgent: Reads docs and StackOverflow via Playwright.

The main agent can delegate: “I need to know how to use the Stripe API. @BrowsingAgent, find the docs for creating a customer.” BrowsingAgent spins up, does the research, returns a summary. Main agent stays focused on the high-level task.

Tool Integration: MCP

The old problem: N agents × M tools = N×M custom integrations. Want your agent to use Jira, Slack, GitHub, and Linear? Write four separate integrations. For every agent.

OpenHands uses the Model Context Protocol (MCP), an open standard from Anthropic. Think of it as USB-C for AI tools.

How MCP Works

MCP Server: Exposes tools (functions) and resources (data). A GitHub MCP server might expose create_issue and active_pull_requests.
MCP Client (OpenHands): Connects via stdio or SSE. Asks: “What tools do you have?” Gets back JSON schemas. Injects them into the agent’s system prompt.

OpenHands doesn’t know about GitHub or Slack. It just knows MCP. You can write a custom MCP server for your proprietary database, point OpenHands at it, and the agent can use it immediately.

Auth That Doesn’t Suck

What if the agent tries to read your private Slack DMs? OpenHands handles this with OAuth via FastMCP.

When the agent tries to use an authenticated tool, MCP pauses execution and shows you an OAuth flow. You log in, consent, and the token gets stored for that session. The agent acts with your permissions, not as some omniscient god.

Configuration: From TOML Hell to Python Objects

OpenHands used to require a config.toml file with a million environment variables: SANDBOX_IMAGE, WORKSPACE_MOUNT_PATH, LLM_API_KEY, debug flags, etc. Global state everywhere. Good luck running two agents with different configs.

The new Python SDK fixes this:

from openhands.sdk import CodeActAgent, DockerRuntime

agent = CodeActAgent(
    llm_config={"model": "claude-3-5-sonnet"},
    system_prompt="You are a senior python engineer."
)

runtime = DockerRuntime(image="my-custom-image")

await agent.run(task="Fix the bug in main.py", runtime=runtime)

Code, not config files. Agents are objects. You can run them in threads, pause them, inspect state, resume. Synchronous by default, which makes debugging way easier.

Evaluation: SWE-bench, the Reality Check

Demo videos are easy. Proving your agent actually works is hard. OpenHands uses SWE-bench—real GitHub issues from Django, scikit-learn, Flask, etc.

How SWE-bench Works

Start with the codebase before the bug fix
Give the agent the issue description
Let it explore, reproduce the bug, write a patch
Apply the patch, run the test suite
Pass = new test passes + no regressions

This is brutal. The agent can’t just fix the obvious bug. It has to not break anything else.

The Infrastructure Problem

SWE-bench is expensive to run. Gigabytes of Docker images, thousands of containers. Epoch AI compressed the images from ~680 GB to ~67 GB by deduplicating layers. OpenHands runs evaluations in parallel on cloud infrastructure, turning days into minutes.

The Cost Problem

Running the full SWE-bench suite costs hundreds of dollars in API credits. The agent reads thousands of lines of code, generates verbose responses for every issue. SWE-bench Lite (300 issues) and SWE-bench Verified (human-verified subset) exist for people who don’t have unlimited budgets.

Performance: Where Things Stand

The Numbers

OpenHands with Claude 3.5 Sonnet hits around 53% on SWE-bench Verified.

But here’s the interesting part: Inference Time Scaling. Run the agent 5 times on the same problem, use a critic model or voting to pick the best patch, and you can hit 66%. The bottleneck isn’t intelligence, it’s randomness.

Why Agents Fail

Even at 53-66%, agents fail a lot. The failure modes are instructive.

Infinite Loops

Agent tries a fix. Test fails. Agent tries the exact same fix again. Repeat until you run out of tokens.

This happens because of context truncation. When the context window fills up, OpenHands truncates old history. If the agent’s memory of “I already tried this” gets truncated, it’s stuck in Groundhog Day.

Context Pollution

Agent runs find / -name "*.py" and dumps 10,000 lines of output into its context. Or cats a massive log file. Context window fills with noise. LLM starts hallucinating file paths, forgets what it was supposed to do.

Solution: Active context management. Summarize old events, delete large observations, keep the “working memory” clean.

Lazy Coding

Agent writes # ... rest of code ... instead of the full file. Saves tokens, breaks the file when written to disk. OpenHands needs linting to catch this before it causes syntax errors.

Failure Mode Summary

Failure Mode	Cause	How OpenHands Mitigates
Infinite Loop	Context truncation	Trajectory analysis, event summarization
Hallucination	Context overflow	Tool-based code search, event condensation
Regression	Fixing one bug, breaking others	VerifierAgent runs full test suite
Timeout	Docker/network issues	Persistent sessions, cloud runtimes

Conclusion: The New Software Engineering Workflow

Code assist isn’t replacing developers, it’s certainly changing how we work. The architecture behind systems like OpenHands reveals the shift: event sourcing for debuggability, Docker sandboxing for safety, CodeAct for expressiveness, MCP for extensibility. These are building blocks for a new kind of development workflow.

What makes modern tools like OpenHands, Claude Code particularly powerful is the convergence of capabilities that OpenHands pioneered:

Extended thinking: Models that can reason through complex refactoring before touching code
Prompt caching: Reusing codebase context across sessions without re-indexing
Tool integration: MCP servers that let agents interact with your actual development environment—Jira, databases, CI/CD pipelines
Computer use: Agents that can navigate IDEs, run terminal commands, and interact with your full development stack

For bug fixes, boilerplate generation, and mechanical refactoring, having an autonomous agent that executes code, verifies its work, and iterates on failures isn’t a demo anymore. It’s production ready infrastructure that’s reshaping how engineering teams operate. The question isn’t whether to adopt these tools, but how to integrate them into your workflow before your competitors do.

References

Wang, X., et al. (2024). The OpenHands Software Agent SDK: A Composable Framework for Building AI Agents. arXiv preprint arXiv:2511.03690.
- The official technical paper describing the OpenHands architecture, event-sourcing model, and SDK design.
Wang, X., et al. (2024). Executable Code Actions Elicit Better LLM Agents. arXiv preprint arXiv:2402.01030v4.
- Introduces the CodeAct framework that uses code as the universal action interface instead of JSON tool calling.
OpenHands Documentation. Runtime Architecture.
- Official documentation explaining the sandbox architecture, Docker runtime, and client-server model.
Anthropic. Model Context Protocol (MCP).
- Official specification for the Model Context Protocol used for dynamic tool discovery and integration.
Jimenez, C., et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- The benchmark used to evaluate code assist agents on real-world software engineering tasks.
Yang, X., et al. (2024). OPENHANDS: An Open Platform for AI Software Developers. OpenReview.
- Comprehensive overview of the OpenHands platform, agent capabilities, and design philosophy.

QuIP#: Achieving Near-Lossless 2-Bit LLM Quantization

Thu, 16 Oct 2025 00:00:00 +0800

A deep dive into the mathematical elegance that makes extreme compression possible

1. Introduction: The Compression Challenge

1.1 The Impossible Dream: Running 70B Models on Your Gaming PC

Picture this: You have a gaming laptop with an RTX 4090—a beast of a card with 24GB of VRAM. You want to run Llama 2 70B, one of the most powerful open-source language models available. Here’s the brutal math:

At full precision (FP16): 70 billion parameters × 2 bytes = 140GB
Your available VRAM: 24GB
The gap: You’d need 6 of your GPUs. Total cost?

This isn’t just an inconvenience, it’s a fundamental barrier. State-of-the-art AI remains locked in data centers, accessible only to well-funded labs and companies. Consumer hardware, edge devices, and even many research institutions are simply shut out.

Quantization promised to change this. By representing weights with fewer bits, we could compress these models to fit on accessible hardware. The progression seemed clear:

8-bit quantization (2020-2021): Mostly lossless, 2× compression → 70GB (still too big)
4-bit quantization (2022-2023): Near-lossless with methods like GPTQ, AWQ → 35GB (getting closer!)
2-bit quantization (2023-2024): The holy grail → ~18GB (fits on a single RTX 4090!)

But there was a problem: nobody could make 2-bit work.

1.2 Why 2-Bit Was Considered Impossible

By late 2023, the field had hit a wall. Every attempt to quantize LLMs below 3 bits resulted in catastrophic quality degradation. The core challenge: LLM weight matrices contain outliers—a small number of weights that are 100-1000× larger than the rest. At 2 bits (only 4 distinct values), there’s insufficient resolution to represent both normal weights and outliers accurately.

Existing Methods Hit Hard Limits

Let’s look at what the state-of-the-art methods achieved at 2 bits on Llama 2 70B (WikiText2 perplexity with context length 2048, lower is better):

Method	Approach	2-bit PPL	Result
FP16 (baseline)	—	3.32	Perfect quality
OmniQuant	Learned transformations	7.81	Barely usable
AWQ	Activation-aware scaling	11.9+	Completely broken
GPTQ	Optimal Brain Damage	6.11	Poor quality

The best existing method (OmniQuant at 7.81) was more than 2× worse than the FP16 baseline. Models were incoherent, repetitive, and failed basic reasoning tasks.

The consensus emerged: 4 bits is the practical minimum.

Tim Dettmers and Luke Zettlemoyer even published a paper in 2023 arguing that “4-bit precision is optimal” for LLMs, with diminishing returns below that threshold.

The 2-bit dream seemed dead.

1.3 The QuIP# Breakthrough

Then came QuIP#. The results were unprecedented:

First method to achieve near-lossless 2-bit quantization (4.16 PPL vs 3.32 FP16 baseline on Llama 2 70B, context 2048)
3-bit models outperform “theoretically lossless” 4-bit (3.56 vs 3.47 PPL)
Scales better than higher bitrates as model size increases

QuIP# Scaling: 3-Bit Outperforms 4-Bit

WikiText2 Perplexity vs Total Model Size (Llama 2) • Lower is Better

🎯 The Unprecedented Result

QuIP# 3-bit models scale better than 4-bit, directly refuting the 2023 consensus that "4-bit is optimal"

What changed? QuIP# combines three techniques in a principled, mathematically elegant way:

Randomized Hadamard Transform (RHT) for incoherence processing
E8 lattice codebooks for optimal sphere packing
Block-LDLQ adaptive rounding with Hessian awareness

Each component addresses a specific mathematical challenge. Together, they enable what was thought impossible.

But before we dive into QuIP#’s solution, we need to understand the fundamental challenges that made 2-bit quantization seem impossible in the first place.

2. Background: What Makes Quantization Hard?

2.1 Quantization Basics: The Storage vs. Accuracy Bargain

The Core Idea

Imagine you’re moving to a smaller apartment and need to compress your belongings. You could:

Pack everything loosely (takes many boxes, but nothing gets damaged)
Compress everything tightly (fits in fewer boxes, but some items might break)

Quantization is exactly this trade-off for neural network weights. Each weight in a model is originally stored as a 16-bit floating-point number (FP16), giving it incredible precision. But do we really need that much precision?

Quantization reduces the memory footprint of LLMs by representing weights with fewer bits:

FP16 (16 bits): 70 billion parameters × 2 bytes = 140GB
4-bit: 70 billion parameters × 0.5 bytes = 35GB (4× compression)
2-bit: 70 billion parameters × 0.25 bytes = ~18GB (8× compression) ✓ Fits on RTX 4090!

Why It’s Not Just Rounding

You might think: “Why not just round each weight to the nearest 2-bit value?” Here’s why that fails catastrophically.

Consider a simple weight matrix row: [0.1, 0.15, 0.2, 0.12, 0.18, 0.11, 0.16]

With 2 bits, you can only represent 4 distinct values (say: {0, 0.1, 0.2, 0.3}). Naively rounding each weight independently would map everything to either 0.1 or 0.2, obliterating the subtle differences between 0.11, 0.12, and 0.15 that might be critical for the model’s behavior.

The true challenge: How do you choose which information to preserve when you can only afford 4 distinct values per dimension?

This is where the mathematics gets interesting.

2.2 The Outlier Problem: The 1% That Breaks Everything

The Hidden Structure of LLM Weights

LLM weight matrices aren’t uniform—they contain outliers: a small number of weights that are 100-1000× larger than the rest. This isn’t a bug; it’s a fundamental feature of how transformers learn.

A Visual Example

Typical weights: [0.1, 0.15, 0.2, 0.12, 0.18, 0.11, 0.16]
Outlier weight:  [150.0]

That single outlier dominates the entire matrix. Why? Because in the forward pass, y = W·x, even if x is small, that 150.0 weight creates a huge activation that completely changes the output.

The Quantization Dilemma

You face an impossible choice:

Scale for the outlier: Set your 4 quantization levels to cover the range [0, 150]. Now your levels might be {0, 50, 100, 150}. But this crushes all normal weights (0.1, 0.15, 0.2…) to zero! You’ve destroyed 99% of the weights to preserve 1%.
Ignore the outlier: Set your levels to {0, 0.1, 0.2, 0.3} to capture normal weights well. But now the 150.0 outlier gets clipped to 0.3—a 500× error that will cause catastrophic failures in the model’s output.

At 4 bits (16 distinct values), you have enough resolution to handle this with techniques like per-group scaling. At 2 bits (only 4 values), the math breaks down.

This is why you can’t just “turn down the bits” and expect things to work. The outlier problem is the fundamental barrier that prevented 2-bit quantization from being viable until QuIP#.

2.3 Why Existing Methods Fail at 2-Bit

By late 2023, the field had hit a wall. Let’s examine what the state-of-the-art methods achieved at 2 bits on Llama 2 70B (context length 2048):

Method	Approach	2-bit Perplexity	Verdict
FP16 (baseline)	—	3.32	Perfect quality
OmniQuant	Learned transformations	7.81	Barely usable
AWQ	Activation-aware scaling	11.9+	Completely broken
GPTQ	Optimal Brain Damage	6.11	Poor quality

(Lower perplexity = better. The best existing method, OmniQuant at 7.81, was more than 2× worse than the FP16 baseline.)

Why Each Method Failed

AWQ & OmniQuant: Use heuristic outlier suppression via activation-aware rescaling. At 2 bits, these heuristics aren’t strong enough—outliers still dominate.
GPTQ: Per-group scaling adds 0.25 bits overhead per weight (12% at 2 bits) and only mitigates the outlier problem rather than solving it.
SpQR: Stores outliers separately in FP16, but irregular memory access patterns kill GPU performance.
AQLM: Achieves good quality but uses 1MB codebooks that cause cache misses, making it slower than FP16.

QuIP#’s Key Insight: Instead of fighting outliers with heuristics, eliminate them entirely through principled mathematical transformation (Randomized Hadamard Transform). We’ll explore this in Section 4.

Now that we understand why existing methods failed, let’s see how QuIP# solves these challenges through three synergistic components.

3. QuIP#’s Three Pillars

3.1 High-Level Architecture

QuIP# is built on three synergistic components:

Incoherence Processing with RHT: Transforms weights to eliminate outliers
E8 Lattice Codebooks: Matches quantization to the transformed weight distribution
Block-LDLQ Adaptive Rounding: Accounts for weight interdependencies

Each component solves a specific mathematical problem:

Original Weights (with outliers)
         ↓
    [RHT Transform]
         ↓
Incoherent Weights (Gaussian, ball-shaped)
         ↓
  [E8P Quantization]
         ↓
Quantized Weights (minimal error)
         ↓
 [Block-LDLQ Rounding]
         ↓
Final Quantized Model

The beauty is in how these pieces fit together. The RHT creates a Gaussian distribution. The E8 lattice is proven optimal for packing spheres in 8D space (exactly what a Gaussian distribution looks like!). And Block-LDLQ uses the Hessian to minimize the final reconstruction error.

Let’s explore each pillar in depth, starting with the transformation that eliminates outliers.

4. Pillar 1: Incoherence Processing with Randomized Hadamard Transform

4.1 The Core Idea: Spread the Risk

The Building Analogy

Imagine a building supported by many pillars. If one pillar bears 99% of the weight, the building will collapse if that pillar fails. But if the weight is evenly distributed across all pillars, the building can withstand the loss of any single support.

Incoherence processing does this for neural network weights. Instead of having a few “weight pillars” bear most of the importance, we redistribute the load so no single weight is critical.

Mathematical Formulation

The magic: We transform weights W using orthogonal matrices U and V:

W' = U·W·V^T

The key properties:

Preserves the forward pass: y = W·x = U·W'·V^T·x (we just transform inputs/outputs accordingly)
Redistributes magnitude: Outliers get “spread out” across many weights
No information loss: Orthogonal matrices are invertible

During inference, we compute: y = U^T·(W'·(V·x))

4.2 What is Incoherence?

Intuitive Explanation

An incoherent matrix has no outliers—all entries have similar magnitude. Think of it like a democracy where no single vote dominates, versus a dictatorship where one voice controls everything.

Formal Definition

For a weight matrix W ∈ ℝ^(m×n), we say it’s μ-incoherent if:

max|W_ij| ≤ μ·||W||_F / √(m·n)

Understanding the Inequality

Let’s break this down:

W_ij: Single entry at row i, column j
max|W_ij|: The largest absolute value (the outlier)
||W||_F: Frobenius norm = √(Σᵢⱼ W²ᵢⱼ) (total magnitude of all weights)
√(m·n): Normalization by matrix size
μ: Incoherence parameter (smaller = better)

What it means in plain English: “The biggest entry ≤ μ × average entry”

Concrete Examples

Incoherent matrix (μ ≈ 1): All entries ≈ 0.28

[0.27, 0.29, 0.26, 0.30]
[0.28, 0.27, 0.31, 0.27]
[0.29, 0.28, 0.27, 0.29]

Matrix with outlier (μ ≈ 20): One entry = 5.5, others ≈ 0.28

[0.27, 0.29, 0.26, 0.30]
[0.28, 5.50, 0.31, 0.27]  ← Outlier!
[0.29, 0.28, 0.27, 0.29]

4.3 The Randomized Hadamard Transform (RHT)

Why Hadamard?

The Hadamard matrix has special properties that make it perfect for incoherence processing:

Orthogonal: Preserves information (no loss)
Binary entries: All entries are ±1 (no floating-point multiplies needed!)
Fast to compute: O(n log n) via Fast Walsh-Hadamard Transform

The RHT Construction

The transformation is:

W' = H·diag(S_U)·W·diag(S_V)·H^T

Where:

H: Hadamard matrix (orthogonal, entries in {-1, +1})
S_U, S_V: Random sign vectors (diagonal ±1 matrices)

Theoretical Guarantees

Lemma 3.1 from the QuIP# paper: With high probability (1-δ), the RHT achieves:

μ_H = √(2·log(2n²/δ))

This is a major improvement over QuIP’s Kronecker approach:

QuIP (Kronecker): μ = O(log² n)
QuIP# (RHT): μ = O(√log n)

Better incoherence means less quantization error!

Runtime Improvement

QuIP (Kronecker): O(n√n) operations
QuIP# (RHT): O(n log n) operations

For Llama 2 70B with n=28,672: This is ~170× faster for the transform!

4.4 Handling Non-Power-of-2 Dimensions

The Challenge

Hadamard matrices exist for dimensions 1, 2, and most multiples of 4 (the Hadamard conjecture). However, for efficient computation using the Fast Walsh-Hadamard Transform, we prefer power-of-2 dimensions. But LLMs have all sorts of dimensions:

Llama 2 70B has intermediate dimension 28,672 = 1024 × 28

The Solution: Kronecker Product

We can factorize n = p×q where p is a power of 2 and q is another valid Hadamard dimension:

H = H_p ⊗ H_q

For Llama 2: H = H_1024 ⊗ H_28

This gives us:

Compute time: O(q²p log p)
For 28,672: Much faster than QuIP’s O(n√n)

4.5 Why This Works for Quantization

After RHT, weights become approximately Gaussian distributed:

Before: Spiked distribution with outliers
After: Smooth Gaussian bell curve

This is crucial because:

No single weight is critical → quantization errors spread evenly
Roughly ball-shaped in high dimensions → perfect for E8 lattice (next section!)
Predictable error bounds → we can prove theoretical guarantees

🎯 Randomized Hadamard Transform (RHT)

From Outlier-Dominated Chaos to Gaussian Harmony

The Outlier Problem

⚠️ Before RHT: Weight Matrices Have Outliers

LLM weight matrices contain a small number of weights that are 100-1000× larger than the rest. These outliers dominate quantization, making 2-bit compression impossible.

Weight Matrix (8×8 sample)

Problem Metrics

Max Weight —

Avg Weight —

Outlier Ratio —

Incoherence μ —

📊 What This Means

A high incoherence score (μ) means a few weights are vastly larger than others. When quantizing to a 2-bit format (only 4 possible values), this creates a dilemma:

Option A: Scale the quantization grid to include the outliers. This crushes all the smaller (but important!) weights to zero.
Option B: Scale the grid for the normal weights. This results in catastrophic clipping errors for the outliers.

Both options lead to massive accuracy loss. This is why 2-bit quantization was considered impossible for so long.

Weight Distribution (Before RHT)

The RHT Transformation

🔄 How RHT Works

The Randomized Hadamard Transform redistributes weight magnitude using orthogonal matrices, eliminating outliers while preserving all information.

W' = H · diag(S_U) · W · diag(S_V) · H^T

Original Weights W

Contains outliers, μ ≈ 20

↓

Apply Hadamard Matrix H

Orthogonal transform with ±1 entries

Fast: O(n log n) via FWHT

↓

Random Sign Flips S_U, S_V

Diagonal matrices with ±1 entries

Breaks correlation patterns

↓

✨ Transformed Weights W'

Incoherent, μ ≈ √(log n)

Magnitude spread evenly across all weights!

🔑 Key Properties

Orthogonal: No information loss (invertible)
Fast: O(n log n) via Fast Walsh-Hadamard Transform
Hardware-friendly: Only ±1 multiplies (no floating-point ops!)
Proven bound: μ_RHT = √(2 log(2n²/δ)) with high probability

Before vs After

Incoherence μ

20.4 → 2.1

Max/Avg Ratio

152× → 4.2×

10× improvement in incoherence! No single weight dominates anymore.

⚡ Computational Cost

Complexity O(n log n)

vs. O(n√n) 170× faster

The RHT is not only more effective than the Kronecker product method used in the original QuIP paper, but it's also significantly faster to compute.

The Transformed Result

✨ After RHT: Gaussian Magic

The transformed weights follow an approximately Gaussian (normal) distribution. No outliers, no single dominant weight—just a smooth bell curve!

Transformed Matrix (8×8)

Success Metrics ✓

Max Weight —

Avg Weight —

Max/Avg Ratio —

Incoherence μ —

🎯 Ready for Quantization

With a low incoherence score, we can now define a quantization grid that treats all weights with similar importance. No single weight will dominate and cause large errors.

Weight Distribution (After RHT)

📊 Gaussian Distribution Properties

Smooth bell curve: No spikes or outliers
Symmetric: Equal spread around zero
Ball-shaped in high dimensions: Perfect match for E8 lattice!
Predictable error bounds: We can prove theoretical guarantees

2D Scatter: Ball-Shaped Distribution

Transformed weights form a radially symmetric "ball" shape—ideal for vector quantization!

Why RHT Unlocks 2-Bit Performance

🎯 The Synergy of QuIP#

RHT is the crucial first step that makes the rest of the QuIP# algorithm possible. It prepares the weights into a format that is perfectly suited for the subsequent E8 lattice quantization.

1️⃣ RHT Transform

Input: Weights with outliers (μ ≈ 20)

Output: Gaussian distribution (μ ≈ √log n)

Key insight: Spreads magnitude evenly across all dimensions

↓

2️⃣ Perfect Match for E8

Gaussian → Ball-shaped in 8D

E8 lattice → Proven optimal sphere packing

Key insight: E8's 240 kissing spheres perfectly cover Gaussian balls!

↓

3️⃣ Minimal Quantization Error

Error ∝ μ² · σ²

Small μ (from RHT) + small σ² (from E8 lattice) = Near-lossless 2-bit!

Result: 4.16 PPL on Llama 2 70B (vs 7.81 for OmniQuant)

Theoretical Error Bound

𝔼[Error] ≤ (g · m · μ² · σ² / n) · tr(H^1/2)²

μ² = Incoherence (minimized by RHT: 20² → 2²)

σ² = Quantization noise (minimized by E8 lattice)

🚫 Without RHT

μ² contribution 20² = 400

Resulting PPL 7.81 (OmniQuant)

Without making the weights incoherent, the quantization error from outliers is catastrophic, leading to a massive drop in model performance.

✅ With RHT

μ² contribution 2² = 4

Resulting PPL 4.16 (QuIP#)

By reducing incoherence by 100×, RHT drastically cuts down the quantization error, paving the way for near-lossless 2-bit compression.

🎓 The "Aha!" Moment

RHT transforms the impossible (quantizing outlier-dominated weights) into the natural (quantizing a Gaussian distribution with optimal sphere packing).

It's not fighting against the math—it's aligning with it. The Gaussian distribution is exactly what E8 lattices are proven to be optimal for.

The beauty of QuIP#: Every component (RHT, E8, Block-LDLQ) is mathematically principled and directly addresses a term in the theoretical error bound. It's a testament to solving problems from first principles.

With weights now transformed into a Gaussian distribution free of outliers, we face a new challenge: how do we quantize this ball-shaped distribution efficiently? This is where the mathematics of sphere packing becomes crucial.

5. Pillar 2: Vector Quantization with E8 Lattice Codebooks

5.1 The Shape-Matching Problem

5.1.1 Ball-Shaped Gaussian Distribution

After RHT, weights are approximately Gaussian. In multiple dimensions, this creates a “ball shape”—weights are radially symmetric around the origin.

Think of throwing darts at a dartboard. Most land near the center, fewer toward the edges. In 8 dimensions, Gaussian weights do the same thing—they cluster in a “ball” around zero.

5.1.2 Why Scalar Quantization Fails

Scalar quantization treats each dimension independently. This creates a hypercube of representable points.

The problem: Most of the cube’s volume is in the corners, but weights never appear there (Gaussian samples don’t reach the corners). We’re wasting precious bits on regions of space we’ll never use!

The Math: For a d-dimensional unit cube, the ratio of corner volume to ball volume grows exponentially:

2D: ~21% waste
4D: ~47% waste
8D: ~69% waste

At 2 bits, we can’t afford to waste 69% of our representable space!

5.2 Enter Vector Quantization

The Idea

Instead of quantizing each weight individually, we quantize d weights together as a d-dimensional vector.

Scalar quantization: 4 values per dimension → hypercube
Vector quantization: Shape the codebook to match the actual distribution → sphere

The Trade-off

Vector quantization has exponential cost:

Codebook size: 2^(k·d) entries for k bits and dimension d
Example: 2 bits, 8 dimensions → 2^16 = 65,536 codewords

This is where the E8 lattice becomes magical.

5.3 The Sphere Packing Problem

5.3.1 What Is Sphere Packing?

The Question: How do you arrange equal-sized spheres to achieve maximum coverage of space?

This is an ancient mathematical problem, dating back to Kepler’s study of cannonball stacking in 1611.

Relevance to Quantization: Each codebook entry is a sphere center. The sphere radius determines how far weights can be from that center. Better packing = smaller max distance = lower quantization error.

2D Examples:

Square packing: 78.5% efficiency, 4 neighbors touching each sphere
Hexagonal packing: 90.7% efficiency, 6 neighbors touching each sphere

The hexagonal packing is proven optimal in 2D. We want the same for higher dimensions!

5.3.2 The Kissing Number

The kissing number is how many equal-sized spheres can touch one central sphere.

Examples:

2D square grid: 4 neighbors
2D hexagonal: 6 neighbors
3D: 12 neighbors (think of oranges at the grocery store)
8D E8 lattice: 240 neighbors!

Higher kissing number = denser packing = better quantization!

🔮 2D Intuition: Why Vector Quantization Wins

Understanding optimal packing before we dive into 8D E8 lattice

Square Packing: Scalar Quantization

📐 The Simple Approach

Arrange spheres in a square grid—this is what scalar quantization does when it treats each dimension independently.

📊 Square Packing Metrics

Packing Efficiency 78.5%

Kissing Number 4

Wasted Space 21.5%

⚠️ The Problem

21.5% of space is wasted!

Each sphere only touches 4 neighbors. There are large gaps between spheres that could be filled more efficiently.

At 2-bit quantization with only 4 values per dimension, we can't afford this waste.

🔍 What This Means for Quantization

When you quantize each weight dimension independently (scalar quantization), you're using square packing. The 21.5% waste means some weights will be far from their nearest codebook entry, causing larger errors.

Hexagonal Packing: Vector Quantization

🏆 The Optimal Solution in 2D

Arrange spheres in a hexagonal pattern—this is proven optimal in 2 dimensions. This is what vector quantization achieves!

📊 Hexagonal Packing Metrics

Packing Efficiency 90.7%

Kissing Number 6

Wasted Space 9.3%

Improvement +15.5%

✨ The Breakthrough

Only 9.3% waste—2.3× better than square packing!

Each sphere touches 6 neighbors (50% more than square). The spheres nestle into each other's gaps, minimizing wasted space.

This is the power of vector quantization!

🎓 Scaling to 8D

In 2D, hexagonal packing is proven optimal. In 8D, the E8 lattice is proven optimal (Viazovska, 2016, Fields Medal 2022).

E8 achieves a kissing number of 240 in 8D—that's 15× better than simple cubic packing (16)! This 2D intuition extends beautifully to higher dimensions.

Next: See how this 2D intuition scales to 8D with the E8 lattice visualization below ↓

5.3.3 Direct Connection to Quantization

The connection between sphere packing and quantization is direct:

Codebook entries = sphere centers
Sphere radius = coverage area (how far weights can be from nearest codeword)
Better packing = lower max distance = lower quantization error

In the error bound, the covering radius appears as quantization noise σ². E8’s optimal packing minimizes σ², which directly minimizes the final quantization error.

5.4 Common Lattices in Quantization

Lattice	Dimension	Kissing #	Use Case
Z^n	any	2n	Simple integer grid
D_4	4	24	Even-parity lattice
D̂_8	8	112	Half-integer lattice
E_8	8	240	Optimal in 8D!

The E8 lattice achieves the proven optimal packing in 8 dimensions. This is not a heuristic—it’s a mathematical certainty (Viazovska, 2016, Fields Medal 2022).

5.5 The E8 Lattice: Mathematical Beauty

5.5.1 Definition

The E8 lattice is defined as:

E_8 = (ℤ⁸ ∪ (ℤ+½)⁸) ∩ {x | Σx_i is even}

In plain English:

All-integer OR all-half-integer vectors
With even coordinate sum

Valid E8 points:

[1, 1, 1, 1, 1, 1, 1, 1] ✓ (all integers, sum=8 is even)
[0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5] ✓ (all half-integers, sum=4 is even)
[1, 0, 1, 0, 1, 0, 1, 0] ✓ (all integers, sum=4 is even)

Invalid points:

[1, 1, 1, 0, 0, 0, 0, 0] ✗ (sum=3 is odd)
[0.5, 0.5, 1, 1, 1, 1, 1, 1] ✗ (mixed integer and half-integer)

5.5.2 Why E8 is Special

Proven optimal: Maryna Viazovska proved in 2016 that E8 achieves the densest sphere packing in 8D (Fields Medal 2022)
Highest kissing number: 240 in 8D—this is proven to be the best possible
Highly symmetric: Has 696,729,600 symmetries, which enables compression
Hardware-friendly: The structure allows for the E8P compression trick (next section)

5.6 E8P: The “Padded” Compression Trick

5.6.1 The Challenge

For 2-bit quantization in 8 dimensions, we need:

2 bits × 8 dimensions = 16 bits total per 8-weight block
2^16 = 65,536 codewords

Naively storing this:

65,536 vectors × 8 dimensions × 2 bytes = 1 MB per codebook

The Problem: L1 cache on modern GPUs is 128-256 KB. A 1MB codebook won’t fit! This causes cache misses, making inference slower than FP16 (as AQLM discovered).

5.6.2 E8P Solution: Exploit Symmetry

The insight: We don’t need to store all 65,536 vectors. E8’s symmetry lets us:

Store only 256 base vectors (4 KB)
Use the remaining 8 bits to encode sign flips and shifts
Generate all 65,536 points on the fly

Codebook compression: 65,536 entries → 256 base vectors (256× smaller codebook)

Note: Each weight still uses 16 bits for encoding. The compression is in the codebook size (1 MB → 4 KB), not the encoding size. This makes the codebook cache-resident, enabling fast lookups.

5.6.3 E8P Encoding Structure (16 bits)

Each 16-bit codeword encodes:

[8 bits: base index] [7 bits: sign flips] [1 bit: shift]

Bits 0-7: Index into 256-entry table S
Bits 8-14: Which coordinates to negate
Bit 15: Add ±0.25 shift

5.6.4 Decoding Example (Step-by-Step)

Let’s decode codeword: 0001010110010111

Step 1: Base vector

Bits 0-7: 00010101 = 21
Look up S[21] = [0.5, 0.5, 0.5, 1.5, 0.5, 0.5, 0.5, 0.5]

Step 2: Apply sign flips

Bits 8-14: 1001011 (4 ones = even count)
Base is all-half-integers (needs even # of flips to stay in E8)
Flip positions 0, 1, 3, 6
Infer 8th bit from parity constraint
Result: [-0.5, -0.5, 0.5, -1.5, 0.5, 0.5, -0.5, -0.5]

Step 3: Apply shift

Bit 15: 1 → add 0.25
Final: [-0.25, -0.25, 0.75, -1.25, 0.75, 0.75, -0.25, -0.25]

5.6.5 Why 7 Sign Bits (Not 8)?

This is elegant! E8 has a parity constraint:

If the base vector requires an even # of flips → 8th sign bit is determined by the other 7
Given 7 sign bits → parity determines the 8th bit automatically

We save 1 bit per codeword by exploiting mathematical structure!

5.6.6 Hardware Implementation

Decoding E8P is incredibly fast:

Load base: 1 memory access (L1 cache hit, 4KB total)
Extract signs: 1 shift + AND operation
Compute 8th sign: Hardware popcount (XOR parity)
Apply signs: SIMD multiply (8 parallel ops)
Apply shift: SIMD add (8 parallel ops)

Total: ~5 instructions per weight, all cache-resident!

⚛️ E8 Lattice: Perfect Sphere Packing

From Wasted Hypercubes to Optimal Vector Quantization

The Hypercube Waste Problem

⚠️ Scalar Quantization: Fitting a Ball in a Box

After RHT, our weights form a Gaussian distribution—a ball shape in high dimensions. But scalar quantization creates a hypercube of representable points. This is geometrically inefficient!

The Waste Grows Exponentially

2D (square) 21% waste

4D (hypercube) 47% waste

6D (hypercube) 61% waste

8D (hypercube) 69% waste

📊 Why This Matters

At 2-bit quantization, we only have 4 values per dimension. With 8 dimensions, that's 4⁸ = 65,536 codewords total.

69% waste means 45,000 of our 65,536 codewords are useless!

We can't afford this inefficiency.

💡 The Solution: Vector Quantization

Instead of quantizing each dimension independently (hypercube), quantize d dimensions together as a vector (sphere-shaped codebook).

This lets us match the codebook shape to the actual distribution. But which shape is optimal?

2D Visualization

The corners are wasted—Gaussian samples never reach them!

Volume Ratio Analysis

Why 8 Dimensions?

🎯 The Perfect Match

QuIP# uses 8 dimensions because it's the sweet spot where mathematics, hardware, and quantization align perfectly.

🔢 Hardware Alignment

2 bits per weight 4 values

8 weights per group Vector quantization

Total encoding 16 bits

16 bits = 2 bytes, perfect for modern hardware!

🏆 Mathematical Optimality

Proven Optimal Yes!

E8 is one of only three dimensions where optimal sphere packing is proven:

2D: Hexagonal (6 neighbors)
3D: FCC (12 neighbors)
8D: E8 (240 neighbors) ⭐

Kissing Number Scaling Across Lattices

Lattice	Kissing Number	Relative Density
Z⁸ (Simple Cubic)	16	Baseline
D̂₈ (Half-integer)	112	7× denser
E₈ (Optimal)	240	15× denser ⭐

E8's 240 neighbors is 15× better than simple cubic packing (Z⁸) in 8D!

💡 Why Not Other Dimensions?

4D: Only 24 neighbors (D₄ lattice) - not dense enough
16D: No proven optimal lattice, too many bits (32-bit encoding)
24D: Leech lattice is optimal but requires 48-bit encoding - impractical

8D is the Goldilocks dimension: proven optimal packing + practical hardware alignment!

The E8 Lattice: Mathematical Beauty

E8 Lattice Definition

E₈ = (ℤ⁸ ∪ (ℤ+½)⁸) ∩ {x | Σx_i is even}

📖 In Plain English

All-integer OR all-half-integer 8D vectors, with even coordinate sum.

✅ Valid E8 Points

[1, 1, 1, 1, 1, 1, 1, 1]
✓ All integers, sum=8 (even)

[0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
✓ All half-integers, sum=4 (even)

❌ Invalid E8 Points

[1, 1, 1, 0, 0, 0, 0, 0]
✗ All integers, but sum=3 (odd)

[0.5, 0.5, 1, 1, 1, 1, 1, 1]
✗ Mixed integer and half-integer

E8 Properties

Property	Value	Significance
Dimension	8	Perfect for 2-bit × 8 weights = 16 bits
Kissing Number	240	Proven optimal in 8D
Symmetries	696,729,600	Enables E8P compression (256× reduction)
Covering Radius	σ² (minimal)	Minimizes quantization error

🎯 Perfect Match for Gaussian Weights

After RHT, weights are Gaussian → ball-shaped in 8D. E8 provides the densest possible packing of spheres in a ball. This is why the combination works perfectly!

E8 vs Other Lattices

E8P: The Compression Magic

🎩 The Challenge

2 bits × 8 dimensions = 16 bits total → 2¹⁶ = 65,536 codewords

Storing naively: 65,536 vectors × 8 dims × 2 bytes = 1 MB per layer

Problem: GPU L1 cache is only 128-256 KB. Cache misses kill performance!

✨ The E8P Solution

Exploit E8's 696 million symmetries to compress the codebook:

Store only 256 base vectors (4 KB)
Use 8 bits to encode sign flips and shifts
Generate all 65,536 points on the fly

256× compression ratio! Fits in L1 cache.

Storage Comparison

Naive Storage 1 MB

E8P Storage 4 KB

Compression 256×

Fits in L1? ✓ Yes!

Performance Impact

Memory Access L1 Cache Hit

Decode Cost ~5 instructions

Peak Bandwidth 56.8%

QuIP# achieves >50% peak memory bandwidth on RTX 4090!

🔧 Interactive E8P Decoder

Each 16-bit codeword encodes: [8 bits: base] [7 bits: signs] [1 bit: shift]

Bits 0-7: Base Vector Index

0 0 0 0 0 0 0 0 = 0

Bits 8-14: Sign Flips (which coordinates to negate)

0 0 0 0 0 0 0

Bit 15: Shift (+0.25 if set)

The Complete Picture

1️⃣ RHT Creates Gaussian Distribution

Output: Ball-shaped weights in 8D space

↓

2️⃣ E8 Lattice Matches the Shape

Gaussian ball → E8 optimal sphere packing

↓

3️⃣ E8P Makes it Hardware-Friendly

65,536 codewords → 256 base vectors (4 KB)

↓

4️⃣ Result: Near-Lossless 2-Bit

Llama 2 70B: 3.12 (FP16) → 3.91 (QuIP# 2-bit)

vs 7.81 for OmniQuant—2× better quality!

🎓 The "Aha!" Moment

E8 isn't just "better" sphere packing—it's provably optimal in 8D. When you combine it with RHT's Gaussian distribution, you get a perfect geometric match.

The ball-shaped Gaussian weights fit exactly into E8's optimal sphere packing. No wasted corners, no wasted bits. Every single one of the 65,536 codewords is useful.

The E8P compression trick then makes it hardware-friendly: 4 KB fits in L1 cache, ~5 instructions per decode. This is why QuIP# achieves >50% peak memory bandwidth while maintaining near-lossless quality.

This is why QuIP# achieves >50% of peak memory bandwidth, while AQLM is slower than FP16.

We’ve transformed weights to eliminate outliers (RHT) and matched our quantization to their distribution (E8P). But there’s one final piece: accounting for how weights interact with each other during quantization.

6. Pillar 3: Block-LDLQ Adaptive Rounding

6.1 Why Adaptive Rounding?

Even with RHT and E8P, we face a final challenge: weights aren’t independent.

Imagine you’re tuning a guitar. If you tune each string in isolation, the guitar might sound terrible because strings interact to create chords. You need to tune them together, considering how they affect each other.

Similarly, when we round weights, errors in early weights affect later ones. Adaptive rounding accounts for these interdependencies.

6.2 The Hessian: Measuring Sensitivity

The proxy Hessian captures how changes in weights affect model loss:

H = 𝔼[x·x^T]

Where:

𝔼[·]: Expected value (average over calibration data samples)
x: Input activations to the layer
H_ij: Measures how much weights i and j “interact”

Intuition: High Hessian values mean “this weight is sensitive—round carefully!”

6.3 Block LDL Decomposition

For a Hessian H ∈ ℝ^(n×n) with block size g, we compute:

H = L^T·D·L

Where:

L: Unit block lower-triangular matrix (g×g blocks)
D: Block diagonal matrix

This is like breaking a big problem into smaller, manageable chunks of size g.

6.4 The Block-LDLQ Algorithm

For each block of g weights:

Ŵ_k = Q(W_k + (W_{1:k-1} - Ŵ_{1:k-1})·A_k)

What this means:

W_k: Current block of weights
Ŵ_{1:k-1}: Already-quantized previous blocks
A_k: Feedback matrix (from L)
Q: Vector quantization to E8P codebook

The key: We use errors from previous blocks as feedback, so errors don’t accumulate!

6.5 Theoretical Guarantee

Theorem 4.1 from the QuIP# paper: For μ-incoherent weights with E8P codebook:

𝔼[Error] ≤ (g·m·μ²·σ²/n)·tr(H^{1/2})²

Where:

σ²: Quantization noise (minimized by E8P!)
μ: Incoherence (minimized by RHT!)
g: Block size (8 for QuIP#)

The beauty: Both RHT and E8P appear in the error bound! Each component directly reduces the final error.

Now let’s step back and see how all three pillars work together to achieve what was thought impossible.

7. The Complete Picture: Why QuIP# Works

7.1 The Virtuous Cycle

RHT 
  → Gaussian Weights (ball-shaped, μ ≈ √log n)
      ↓
  E8 optimal packing (matches Gaussian shape)
      ↓
  Minimizes covering radius σ²
      ↓
  Low quantization noise in Block-LDLQ
      ↓
  Near-lossless 2-bit quantization!

7.2 Each Component is Necessary

Remove any piece and the system fails:

Without RHT: Outliers remain → high μ → error bound explodes
Without E8P: Poor sphere packing → high σ² → error bound explodes
Without Block-LDLQ: No Hessian adaptation → accumulated error

7.3 The Mathematical Beauty

All three components appear in the error bound:

Error ≤ (g·μ²·σ²/n)·tr(H^{1/2})²
        ↑   ↑   ↑
        |   |   └─ E8P minimizes this (optimal packing)
        |   └───── RHT minimizes this (incoherence)
        └───────── Block-LDLQ optimizes given μ and σ²

This isn’t accidental—it’s mathematically inevitable that these three techniques combine to minimize error.

With the theory established, let’s examine what QuIP# means for real-world deployment and the future of LLM accessibility.

8. Practical Implications

8.1 Deployment Scenarios Unlocked

Before QuIP#:

Llama 2 70B: Requires 140GB (6× RTX 4090s ≈ $60k)
Research teams locked out of state-of-the-art models
Edge deployment impossible

After QuIP#:

Llama 2 70B: ~18GB ✓ Fits on single RTX 4090 ($1,600)
7B models: 4-6GB → runs on smartphones
Cost reduction: 7× memory → 7× more models per server
Privacy: Sensitive data processing entirely on-device

8.2 The Scaling Breakthrough

The unprecedented result: QuIP# 3-bit scales better than 4-bit.

This directly refutes the 2023 consensus that “4-bit is optimal.” As models get larger, QuIP# 2-bit appears to scale similarly to 3-bit and 4-bit, suggesting that 2-bit may become the new standard.

9. Conclusion

QuIP# achieves what was thought impossible: near-lossless 2-bit quantization of LLMs. The key insights:

Eliminate outliers through principled RHT transformation (not heuristic suppression)
Match the distribution using proven-optimal E8 lattice sphere packing
Account for dependencies through Block-LDLQ adaptive rounding

Each component addresses a specific mathematical challenge. Together, they form an elegant solution that:

Enables 70B models on consumer hardware
Achieves unprecedented compression with minimal quality loss
Scales better than “theoretically optimal” 4-bit methods

The 2-bit dream is alive.

For practitioners: QuIP# quantized models are available at https://huggingface.co/relaxml. Code at https://github.com/Cornell-RelaxML/quip-sharp.

The future of LLM deployment just got a lot more accessible.

References

Tseng, A., Chee, J., Sun, Q., Kuleshov, V., & De Sa, C. (2024). QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks. ICML 2024.
Viazovska, M. (2017). The sphere packing problem in dimension 8. Annals of Mathematics.
Chee, J., Cai, Y., Kuleshov, V., & De Sa, C. (2023). QuIP: 2-bit quantization of large language models with guarantees. NeurIPS 2023.

Why Can Your Laptop Run LLaMA? A Deep Dive into Quantization

Sat, 04 Oct 2025 21:19:40 +0800

Introduction: Why We Can’t Afford Full Precision Anymore

The numbers tell a stark story. A model like GPT-3.5, with its 175 billion parameters, demands 700GB of memory at full precision, enough to consume thousands of dollars in cloud costs every single day. LLaMA-2-70B requires 280GB at full precision and 140GB even with standard FP16, numbers that dwarf the memory capacity of most GPUs. Training such models can cost millions of dollars in compute resources, and even running inference requires an infrastructure of multiple high-end GPUs costing $15,000 each.

These requirements create more than just a financial barrier. They represent a fundamental accessibility problem. Consumer GPUs like the RTX 3090 offer only 24GB of VRAM, while even the newest RTX 5090 provides just 32GB nowhere near enough for unquantized large models. Mobile and edge devices face even tighter constraints with 4-8GB of total RAM. Without quantization, state-of-the-art models remain locked behind expensive data center infrastructure, inaccessible to researchers, startups, and individual developers.

The computational burden compounds these memory constraints. Matrix multiplication dominates LLM inference time, and high-precision arithmetic is expensive. On NVIDIA A100, lower-precision tensor cores provide substantially higher throughput—e.g., TF32 is around ~156 TFLOPS dense (≈312 with sparsity), while INT8 performance reaches into the hundreds of TOPS depending on sparsity and kernels. Memory bandwidth creates additional bottlenecks: moving 140GB of weights from GPU memory to compute units can take longer than the computation itself, especially for the small batch sizes typical in interactive applications.

This is where quantization transforms from optimization technique to enabling technology. By compressing neural network weights from 32 bits to 4 bits or lower, quantization can achieve 4-8x memory reduction and 2-4x computational speedup while maintaining small accuracy losses with proper techniques. Modern quantization methods are transforming LLM deployment from multi-GPU clusters to single consumer GPUs—often reducing total system costs from six figures to a few thousand dollars in certain setups (e.g., 4-bit with offloading or multi-GPU).

The field has evolved dramatically. What began as “can we quantize below 8 bits?” in 2022 has progressed to early deployments and research systems exploring 2-bit models by 2025, with clear paths emerging for both research and real-world applications.

GPU Memory Requirements by Model Size and Precision

Memory footprint comparison across quantization formats for LLM deployment

Show GPU capacity reference lines

FP32 (32-bit)

FP16 (16-bit)

INT8 (8-bit)

INT4 (4-bit)

Note: Memory values approximate model weights only using 1B params → 1 GB (FP32), 0.5 GB (FP16), 0.25 GB (INT8), 0.125 GB (INT4). Actual deployments require additional memory for activations, KV cache, and overhead (often 1.2–2× total).

Part 1: The Foundation - How Computers Store Numbers

To understand how we can compress these models, we must first understand what we’re compressing. Machine learning algorithms don’t process text; they process numbers, and the format used to store these numbers dictates their range, accuracy, and the memory they consume.

The 32-Bit Standard: FP32 and Its Anatomy

The default numerical format in most deep learning frameworks is the 32-bit single-precision floating-point number, commonly known as FP32. Defined by the IEEE 754 standard, an FP32 number occupies 32 bits (4 bytes) of memory, divided into three distinct parts that work together to represent a vast range of real numbers.

The sign bit (1 bit) is straightforward—a value of 0 indicates a positive number, while 1 indicates negative. The exponent (8 bits) determines the magnitude or range of the number, functioning like the exponent in scientific notation by scaling the value up or down by powers of 2. To represent both very large and very small numbers, the 8-bit unsigned integer uses a technique called exponent bias. For FP32, the bias is 127, meaning the actual exponent equals the stored value minus 127. This allows the 8 bits to represent an exponent range from -126 to +127 without requiring a separate sign bit for the exponent itself.

The mantissa (23 bits), also known as the significand, determines the precision of the number—essentially, how many significant digits it can accurately represent. The mantissa is a binary fraction normalized to be between 1.0 and 2.0. Because the leading digit of a normalized binary number in this format is always 1, this “implied leading 1” doesn’t need storage. This clever trick effectively gives the mantissa 24 bits of precision while only using 23 bits of memory.

FP32 (32-bit Float) Bit-Level Anatomy

IEEE 754 single-precision floating-point format

Select Example:

Bit 31

Bits 30–23

Bits 22–0

Sign

(1 bit)

Exponent

(8 bits)

Mantissa (Significand)

(23 bits)

Sign Bit

Value: 0

Meaning: Positive (+)

Simple: 0 = positive, 1 = negative

Exponent

Binary: 00000000

Decimal: 0

Bias: −127

Actual: 0

Determines magnitude: 2^exponent

Mantissa

Stored: 00000000000000000000000

Implied leading 1: 1.000000...

Precision: ~7 decimal digits

Reconstruction Formula

Value = (−1)^sign × 2^{(exponent − 127)} × (1 + mantissa)

For π (3.14159...):

(−1)⁰ × 2^{(0 − 127)} × (1.000000...)

≈ 3.14159

Key Insights

32 bits = 4 bytes per number
Exponent bias (127) enables wide dynamic range without a signed exponent
Implied leading 1 in mantissa gives 24 effective precision bits
Range: ±1.4 × 10^-45 to ±3.4 × 10^38
Precision: ~7 decimal digits

This structure gives FP32 a dynamic range of approximately ±1.4×10⁻⁴⁵ to ±3.4×10³⁸. The epsilon (smallest representable difference) equals 2⁻²³ ≈ 0.00000012. For a 175B parameter model, FP32 representation demands 700GB of memory at 4 bytes per parameter.

However, a key limitation of any finite binary representation is that it cannot perfectly represent all decimal numbers. Just as 1/3 cannot be written with a finite number of decimal digits, values like 0.1 cannot be represented exactly in binary, leading to minor rounding errors in computation.

The Half-Precision Compromise: FP16 and BFloat16

While FP32 provides a good balance of range and precision, its 32-bit size is a primary contributor to the massive memory footprint of LLMs. Two 16-bit (half-precision) formats have emerged, each with a different solution to the fundamental tradeoff between range and precision.

FP16 (Half-Precision) compresses the format to 1 sign bit, 5 exponent bits (bias=15), and 10 mantissa bits. This dramatic reduction in exponent range means FP16 only spans ±6×10⁻⁵ to ±65,504. With only 3-4 significant digits and epsilon of 0.00097656, FP16 risks overflow and underflow during model training, where gradients can become extremely small or extremely large, causing the training process to fail.

The memory advantage is substantial: 175B parameters require only 350GB, halving the footprint while enabling 2x faster computation on Tensor Core GPUs. Mixed precision training exploits this by computing in FP16 but accumulating gradients in FP32, though loss scaling is needed to prevent underflow.

BFloat16 (BF16), developed by Google Brain specifically for deep learning, takes a radically different approach. It allocates 1 bit for the sign, 8 bits for the exponent (same as FP32), and just 7 bits for the mantissa. By maintaining FP32’s full dynamic range (±1.2×10⁻³⁸ to ±3.4×10³⁸) while sacrificing precision (epsilon = 0.0078125), BF16 becomes a “drop-in replacement” for FP32 in training.

Converting between FP32 and BF16 is trivial, simply truncate or zero-pad the last 16 bits making it computationally cheap. The identical exponent range means no overflow issues and no need for loss scaling during training. BF16 has become the preferred format for training large models.

The emergence and widespread adoption of BFloat16 reveals a core principle of deep learning systems: system-level stability is often more critical than component-level precision. The industry’s willingness to sacrifice the precision of individual numbers (BF16 has 3 fewer mantissa bits than FP16) for the stability of the entire training process demonstrates that deep neural networks are remarkably robust to a certain level of numerical noise. The most critical failure mode during training is not a slight inaccuracy in a single weight but a catastrophic gradient explosion or vanishment.

Integer Quantization: INT8 and Beyond

INT8 (8-bit integer) abandons floating point entirely, representing values as integers from -128 to 127 (signed) or 0 to 255 (unsigned). Quantization maps continuous weights to these 256 discrete values using a scale factor S and zero-point Z through the formula: x_q = round(x/S + Z). Dequantization reverses this: x = S × (x_q - Z).

The scale and zero-point are typically stored in higher precision, adding some overhead. Advanced techniques use per-channel quantization with different (S, Z) pairs for each output channel, dramatically improving accuracy. LLM.int8() goes further, identifying outlier features with magnitude >6 and keeping them in FP16 while quantizing the rest, achieving <0.5% degradation on 176B models.

At 1 byte per parameter, INT8 reduces a 175B model to approximately 175-200GB including overhead—a 75% reduction from FP32. Computational benefits are substantial: INT8 tensor cores deliver 2.3-4x speedup over FP32 in practice, though realizing these gains requires specialized kernels. The challenge lies in selecting appropriate quantization ranges: too narrow loses information through clipping, too wide wastes the limited 256 values on rarely-used extremes. Block-wise quantization divides parameters into groups of 64-128, computing separate scales for each block to limit outlier impact.

INT4 (4-bit integer) pushes compression to extremes with only 16 representable values. Standard INT4 uses -8 to 7, but neural network weights cluster around zero with an approximately normal distribution. NormalFloat4 (NF4) exploits this by placing quantization points at the quantiles of a standard normal distribution, optimizing for neural network weight distributions rather than uniform spacing. QLoRA uses NF4 with 64-weight blocks and double quantization (quantizing the scale factors themselves to 8-bit), compressing a 65B model from 130GB to roughly 40–50GB with careful overhead management.

The memory savings enable previously impossible deployments: a 70B parameter model can fit on a single 48–64GB GPU using INT4; fitting into 32GB typically requires significant offloading/pruning and tight KV‑cache constraints. However, computation typically dequantizes weights to FP16 for matrix multiplication since native 4-bit arithmetic lacks broad hardware support. This means memory bandwidth benefits dominate over raw computational speedup.

Feature	FP32	FP16	BFloat16	INT8	INT4
Total Bits	32	16	16	8	4
Sign Bits	1	1	1	-	-
Exponent Bits	8	5	8	-	-
Mantissa Bits	23 (+1 implied)	10 (+1 implied)	7 (+1 implied)	-	-
Dynamic Range	±1.4e-45 to ±3.4e38	±6e-5 to ±65,504	±1.2e-38 to ±3.4e38	-128 to 127	-8 to 7
Precision	7 digits	3-4 digits	2-3 digits	256 levels	16 levels

[Placeholder for diagram: Memory footprint comparison showing LLaMA-70B across precisions with GPU memory capacity lines]

Part 2: Understanding Quantization - The Accuracy-Efficiency Tradeoff

With a foundation in numerical representation, we can now explore the process of quantization itself. At its heart, quantization is a mapping function from a large, often continuous set of values to a smaller, discrete set.

The Mechanics: Mapping Continuous to Discrete

Quantization converts model parameters from a high-precision data type like FP32 to a low-precision one, most commonly an 8-bit integer (INT8). An INT8 variable can only represent 256 distinct values, a stark contrast to the billions of values representable by FP32. This mapping is achieved through a linear transformation known as the affine quantization scheme.

The core formula relates the original real value (r) to its quantized integer counterpart (q) using two key parameters: a scale (S) and a zero-point (Z):

r = S × (q - Z)

Rearranging for quantization gives: q = round(r/S + Z)

The scale is a positive floating-point number that acts as the step size of the quantization. It defines the ratio of the original floating-point range to the target integer range, calculated as (r_max - r_min) / (q_max - q_min).

The zero-point is an integer within the quantized range that corresponds exactly to the floating-point value 0.0. This is critical because the value zero holds special significance in neural networks—it’s used for padding in convolutions and serves as the threshold for activation functions like ReLU. Ensuring that 0.0 can be perfectly represented without error after quantization is essential for maintaining model accuracy.

Affine Quantization: Continuous to Discrete Mapping

How floating-point values are linearly mapped to integer representations

Scale (S)

S = (r_max - r_min) / (q_max - q_min)

Zero-Point (Z)

Z = round(q_min - r_min / S)

Maps r = 0.0 to integer Z

Ranges

Float: [0,0]

Int: [0,0]

Mapping Visualization

Highlight zero-point

Discrete Quantization Buckets

Each integer value represents a range of floating-point values (step = S)

Key Insights

Linear mapping: r = S × (q − Z) is an affine transformation
Scale (S): Step size; smaller S = finer granularity
Zero-point (Z): Ensures r = 0.0 maps to an integer
Symmetric: Z = 0 (common for weights)
Asymmetric: Z ≠ 0 (common for activations)
Error: Rounding introduces ±S/2 quantization error

Symmetric vs. Asymmetric Quantization

The zero-point concept leads to two primary quantization schemes, each with distinct tradeoffs.

Asymmetric (Affine) Quantization is the general form where the zero-point can be any integer in the quantized range. This scheme excels at quantizing data whose distribution is not centered around zero. A prime example is the output of a ReLU activation function, where all values are non-negative. Asymmetric quantization can map the range [0.0, 1000.0] to the full integer range, maximizing the use of available precision.

Symmetric Quantization is a special case where the floating-point range is forced to be symmetric around zero (e.g., [-a, a]). This constraint ensures that floating-point 0.0 maps directly to integer 0, making the zero-point Z = 0. The primary advantage is computational efficiency—since Z = 0, the subtraction operation in the dequantization formula can be skipped, leading to faster execution on some hardware.

However, if the underlying data distribution is skewed (like after a ReLU), symmetric quantization can be wasteful, as half of the quantized range will go unused, effectively losing one bit of precision.

The Quantization Timeline: PTQ vs. QAT

Quantization methods are categorized not just by their mathematical scheme but by when they’re applied in the model’s lifecycle.

Post-Training Quantization (PTQ) applies quantization to a model that has already been fully trained in high precision. The process typically involves passing a small “calibration dataset” (a few hundred representative examples) through the model to observe the ranges of its weights and activations. These observed ranges are then used to calculate the optimal scale and zero-point parameters for each tensor.

The advantages are compelling: PTQ is fast, simple, and computationally inexpensive. It doesn’t require access to the original training pipeline or large datasets, making it highly accessible. However, because the model’s weights were optimized for a high-precision environment, abruptly forcing them into a low-precision format can introduce significant “quantization noise,” leading to noticeable accuracy drops, especially at very low bit-widths (e.g., 4-bit).

Quantization-Aware Training (QAT) simulates the effects of quantization during the training or fine-tuning process. It works by inserting “fake quantization” operations into the model’s computation graph. In the forward pass, weights and activations are quantized and then immediately dequantized back to a floating-point format. This simulates the error that will be introduced during low-precision inference. Crucially, the backward pass computes gradients with respect to the original full-precision weights, allowing the model to learn parameters that are inherently robust to quantization effects.

QAT almost always achieves higher accuracy than PTQ, often recovering nearly all of the original model’s performance, even at aggressive quantization levels. However, it’s a far more complex and computationally expensive process, requiring retraining or extensive fine-tuning with access to the training dataset and significant computational resources.

The choice between PTQ and QAT presents a fundamental dilemma for LLMs. The models that would benefit most from QAT’s superior accuracy are the very ones for which the method is computationally and financially prohibitive. Fine-tuning a model with hundreds of billions of parameters can require hundreds of gigabytes of GPU memory, making QAT impractical for all but the largest institutions. This has led to PTQ becoming the dominant paradigm for LLM quantization, “not for its superiority but feasibility.”

This critical gap—the need for QAT-level accuracy with PTQ-level efficiency—has been the primary driver behind the intense research and development of advanced PTQ algorithms like GPTQ.

Feature	Post-Training Quantization (PTQ)	Quantization-Aware Training (QAT)
Workflow	Quantize a fully trained model	Simulate quantization during training/fine-tuning
Computational Cost	Low (calibration pass only)	High (requires retraining)
Data Requirement	Small calibration dataset	Full training dataset
Typical Accuracy	Good at 8-bit, may degrade at lower bit-widths	Excellent, often near full-precision performance
Best For	Scenarios with limited resources, no access to training data, or when speed of deployment is critical	Applications where maximizing accuracy is paramount and computational resources are available

The Accuracy-Efficiency Frontier

The choice of numeric format involves subtle tradeoffs that extend beyond simple memory calculations. Accuracy degradation patterns differ across precisions, model architectures, and quantization methods, with certain failure modes appearing only at extreme compression.

From FP16 to INT8, degradation is often very small when using proper techniques. Reports on BLOOM‑176B show differences within typical measurement noise on many tasks, demonstrating that INT8 can be virtually lossless for large models when outliers (e.g., >6σ features) are handled separately in FP16.

INT4 quantization shows noticeable but often acceptable degradation (commonly a few percentage points) depending on method and model. GPTQ frequently keeps 4‑bit perplexity deltas small on WikiText for large models, though exact results vary by setup. Modern techniques like AWQ can approach FP16 performance on certain tasks for specific models/configs. The difference between methods matters significantly—AWQ’s activation‑aware approach outperforms naive rounding by protecting the most activation‑sensitive weights.

Dramatic failure often occurs at 2 bits without specialized methods. Vanilla GPTQ typically fails at 2 bits, while methods like SpQR can make 2‑bit feasible by identifying and isolating a subset of weights as outliers (kept in higher precision) while quantizing the rest to 2 bits.

Model size affects quantization tolerance non-linearly. Larger models generally quantize better because individual weight errors average out across more parameters and layers. A 70B model tolerates 4-bit quantization with 96-99% accuracy recovery, while smaller 7B models show more variability. Counter-intuitively, models trained on more data become harder to quantize—LLaMA 3’s 15 trillion training tokens create more complex weight distributions than earlier models, increasing quantization sensitivity.

Memory savings follow predictable patterns but include important overhead. The formula Memory = Parameters × Bytes_per_parameter × 1.2 captures typical overhead from scale factors, activation tensors, and KV cache. For LLaMA‑70B: FP16 needs ~140–148GB, INT8 requires ~70–74GB (≈2× compression), while INT4 uses ~35–45GB (≈3.5× compression). The KV cache for attention adds substantial overhead at long context lengths: at 128K tokens an 8B model can consume on the order of tens of GB in FP16, sometimes exceeding the quantized model weights themselves.

[Placeholder for diagram: Accuracy-efficiency Pareto frontier showing perplexity degradation vs memory reduction for different quantization methods]

Part 3: GPTQ - When Second-Order Thinking Meets Quantization

GPTQ (Generative Pre-trained Transformer Quantization) represents a breakthrough in post-training quantization. Published at ICLR 2023 by researchers from IST Austria and ETH Zurich, GPTQ enables 3-4 bit compression of 175B parameter models in approximately 4 GPU hours while maintaining negligible performance degradation.

The Core Problem and GPTQ’s Insight

The fundamental challenge with quantization is this: when you quantize a weight, you introduce error. Naively rounding weights to the nearest quantization level (Round-to-Nearest or RTN) performs acceptably at 8 bits but fails catastrophically at 3-4 bits, essentially destroying model capability.

GPTQ asks a different question: how can we quantize weights while compensating for the error by adjusting other weights to maintain the layer’s output?

For each layer, GPTQ solves the optimization problem:

argmin_Ŵ ||WX - ŴX||²

where W is the original weight matrix, Ŵ is the quantized version, and X represents layer inputs from calibration data. This minimizes the squared difference between full-precision and quantized layer outputs rather than focusing on weight values themselves. The key realization: we care about preserving behavior (outputs) not parameter values.

This objective decomposes into independent row-wise problems since the squared Frobenius norm sums across rows. Processing each row separately reduces computational complexity dramatically while remaining theoretically sound.

The Engine: Optimal Brain Quantization and the Hessian

GPTQ’s innovation is built upon a classic algorithm from the 1990s called Optimal Brain Quantization (OBQ). OBQ provides a principled way to quantize weights by using second-order information to guide the process.

This information is captured in the Hessian matrix H = XX^T (with damping λI), which contains the second derivatives of the model’s loss function with respect to its weights. Intuitively, the Hessian describes the “curvature” of the loss landscape. A sharp curve in a particular direction (a large corresponding value in the Hessian) means the model’s loss is highly sensitive to changes in that weight. Conversely, a flat curve (a small Hessian value) indicates that the weight can be changed with little impact on the loss.

GPTQ: Understanding the Hessian Through Loss Curvature

The Hessian (H = 2XX^T) captures second-order information about loss sensitivity

Loss Landscape Cross-Section

Show quantization impact

Curvature

0×

Second derivative (Hessian diagonal)

Comparing All Weight Sensitivities

Key observation: The same weight perturbation causes much larger loss increases when curvature is high.

Mathematical Interpretation

Hessian diagonal value:

Hii

Loss approximation:

L(w + δ) ≈ L(w) + ½ Hii δ²

Higher H_ii means higher curvature and more sensitivity.

GPTQ Strategy

Quantize FIRST

Use H^-1 to find low-sensitivity weights; compensate updates using H^-1.

Key Insights

First-order (gradient): direction of steepest descent
Second-order (Hessian): curvature (steepness)
High curvature: small changes → large loss increase
Low curvature: changes have smaller effect
GPTQ: leverage H^-1 to pick low-sensitivity weights first and compensate error

GPTQ uses the inverse of the Hessian matrix H⁻¹ = (XX^T + λI)⁻¹, where the damping term λ (typically a small fraction of the average diagonal) prevents numerical instability. The OBQ algorithm quantizes weights one by one. At each step, it must decide which weight to quantize next. The optimal choice is the one that will cause the smallest increase in the layer’s output error, guided by the diagonal entries of the inverse Hessian matrix.

The Algorithm: Error Compensation in Action

The core of the OBQ method, and by extension GPTQ, is an iterative process of error compensation within each layer:

Select & Quantize: A single weight is chosen and quantized (e.g., rounded to the nearest 4-bit representable value)
Measure Error: The algorithm calculates the error introduced by this rounding step
Compensate: This is the crucial step. The algorithm updates all the other, not-yet-quantized full-precision weights in the layer to compensate for the error just introduced. This update is not uniform; it is scaled by the inverse Hessian, which directs the correction towards related but less sensitive weights that can absorb the error with minimal impact on the layer’s output

After quantizing weight w_q at position q to its nearest grid point, the quantization error must be compensated. The update formula:

δ = -[(w_q - quant(w_q)) / [H⁻¹]_qq] · (H⁻¹)_:,q

redistributes this error across remaining unquantized weights, minimizing impact on layer output.

This iterative compensation is what makes GPTQ so accurate. It doesn’t just round weights independently; it actively and intelligently corrects for the rounding error at every single step, ensuring the final quantized layer behaves as closely as possible to the original.

After each quantization, the Hessian inverse must be updated by removing the quantized weight’s row and column. Gaussian elimination provides the update, but this accumulates numerical error. GPTQ’s solution uses Cholesky decomposition to precompute all required Hessian information in a numerically stable manner, preventing error accumulation that would otherwise corrupt billion-parameter models.

Original weight position

Quantized weight (error introduced)

Compensated position

Loss landscape contours

Key Insight: The Hessian's inverse reveals which directions in weight space have flat loss (can absorb error) vs. steep loss (sensitive). GPTQ compensates quantization error by updating other weights along the flattest directions, minimizing the impact on model output.

Making It Practical: GPTQ’s Efficiency Optimizations

While powerful, the original OBQ algorithm is far too slow for modern LLMs. Its greedy search for the next-best weight to quantize and the need to update the inverse Hessian after every single weight result in cubic runtime complexity that is prohibitive. The genius of GPTQ lies in three clever optimizations:

Arbitrary Order Quantization: The authors made a critical empirical discovery—for very large, overparameterized models, the specific order in which weights are quantized has minimal impact on final accuracy. GPTQ thus abandons the expensive greedy search of OBQ and instead quantizes weights in a simple, fixed order (e.g., column by column). This not only eliminates the search but also means the Hessian information can be shared across all rows of a weight matrix, dramatically reducing redundant computations.

Lazy Batch Updates: To make the algorithm friendly to modern GPUs, which thrive on parallel computation, updates to the Hessian are batched. Instead of performing a small update after every single weight, GPTQ processes a block of columns (a group_size of 128 is common) at a time. This significantly improves the compute-to-memory-access ratio, leading to massive speedups.

Cholesky Decomposition: To ensure the complex matrix inverse operations remain numerically stable and efficient throughout the process, GPTQ employs this standard numerical linear algebra technique.

GPTQ’s success is a testament to brilliant research engineering. It bridges the gap between purely heuristic methods (like simple rounding) and computationally prohibitive, fully principled methods (like QAT). Its true innovation was not inventing new mathematical theory from scratch, but rather identifying the key computational bottlenecks in a powerful existing algorithm (OBQ) and devising pragmatic approximations (fixed order, lazy updates) that were shown to work exceptionally well at the massive scale of modern LLMs.

GPTQ: Block-wise Quantization with Hessian-Guided Error Compensation

Watch the three nested loops process weights column-by-column with intelligent error redistribution

Current Operation

Ready

Click Play to begin GPTQ quantization...

Hessian Inverse

Guides error compensation direction and magnitude

▶ Play / Pause

→ Step Through

↺ Reset

Block Size (B)

Speed: 5x

Block Progress

0/4

Column Progress

0/16

Weights Quantized

0/96

Compensations

GPU Compute Utilization (Block Operations)

Idle

Unquantized weights

Current block

Quantizing now

Being compensated

Quantized + compensated

Calibration requires surprisingly little data. GPTQ uses 128 random 2048-token segments from C4 (Colossal Clean Crawled Corpus)—approximately 262,144 tokens of generic web text. This zero-shot approach requires no task-specific data, making quantization fast and broadly applicable.

On a single NVIDIA A100 80GB, GPTQ quantizes OPT-175B in 4.2 hours and BLOOM-176B in 3.8 hours. Memory requirements are manageable: load one Transformer block (typically 6 layers) at a time, accumulate Hessians, quantize, then pass inputs through the quantized block to generate inputs for the next block.

Part 4: Beyond GPTQ - Alternative Approaches

GPTQ represents one approach among several competing quantization methods, each with distinct strengths:

AWQ (Activation-aware Weight Quantization) has emerged as a primary alternative to GPTQ, achieving strong accuracy by protecting a small fraction of activation‑sensitive weights. AWQ often matches FP16 performance on some tasks and is widely supported by fast 4‑bit inference kernels (e.g., AWQ/Marlin), which in many stacks can be faster than pipelines targeting GPTQ‑formatted weights.

LLM.int8() focuses on 8-bit quantization with near-zero degradation through mixed precision, keeping outlier features in FP16. While limited to 2x compression versus GPTQ’s 4x, it provides the most reliable accuracy preservation.

GGUF/llama.cpp targets CPU inference with excellent cross-platform support, using mixed bit-width “k-quant” formats ideal for edge deployment on Apple Silicon and consumer hardware.

For practitioners, AWQ and GPTQ represent the current sweet spot for 4-bit GPU inference, offering 96-99% accuracy recovery with 3.5-4x compression. The choice between methods depends on specific accuracy requirements, inference speed priorities, and deployment constraints.

Conclusion

Quantization has transformed from academic curiosity to production necessity, enabling LLM deployment from datacenter to edge. The progression from “can we quantize below 8 bits?” to practical 4-bit deployment reflects both algorithmic breakthroughs and growing infrastructure demands.

GPTQ’s core innovation, using second-order Hessian information to guide quantization and compensate for errors—proved that 3-4 bit compression is viable with minimal accuracy loss. By minimizing layer output error rather than weight error, GPTQ enables 70B models to run on consumer GPUs that previously required expensive multi-GPU clusters.

The field continues evolving rapidly. AWQ has emerged as a strong alternative with superior speed-accuracy tradeoffs, while advanced methods push toward 2-bit quantization. For practitioners today, 4-bit quantization with GPTQ or AWQ represents the sweet spot: 96-99% accuracy recovery with 3.5-4x memory reduction, making frontier models accessible on modest hardware.

The future of machine learning is quantized. These techniques have fundamentally democratized access to state-of-the-art models, transforming deployment from a privilege of well-funded organizations to a capability available to individual researchers and developers worldwide. Of course. Here is a list of the 13 references, sorted by their relevance to the main topics and narrative flow of your blog post.

The list begins with the paper central to your article (GPTQ), followed by its primary alternatives and the foundational techniques that underpin them. It then provides the historical context for the core algorithm before concluding with the fundamental standards and hardware specifications that motivate the entire field.

References

Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. In International Conference on Learning Representations (ICLR).
Lin, J., Tang, J., Tang, H., et al. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. In Proceedings of Machine Learning and Systems (MLSys).
Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv preprint arXiv:2208.07339.
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. In Advances in Neural Information Processing Systems (NeurIPS).
Jacob, B., Kligys, S., Chen, B., et al. (2018). Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Krishnamoorthi, R. (2018). Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342.
Dettmers, T., Svirschevski, R., Egiazarian, V., et al. (2024). SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. In International Conference on Learning Representations (ICLR).
Wang, S., & Kanwar, P. (2019, August 23). BFloat16: The secret to high performance on Cloud TPUs. Google Cloud Blog.
Institute of Electrical and Electronics Engineers. (2019). IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2019.
NVIDIA. (2020). NVIDIA A100 Tensor Core GPU Architecture. Whitepaper.
Gerganov, G., et al. (2023). ggml-org/llama.cpp: LLM inference in C/C++. GitHub repository.

Flash Attention: The Mathematical Tricks That Broke the Memory Wall

Wed, 10 Sep 2025 19:59:48 +0800

The Context Length Revolution

In 2022, something fundamental changed in the world of large language models. Suddenly, models that had been stuck processing 2,048 tokens could handle 16,000, then 32,000, then 100,000+ tokens. This wasn’t a gradual improvement—it was a leap forward. The breakthrough that enabled this revolution? Flash Attention, an algorithm that didn’t approximate or simplify attention, but computed it exactly while using radically less memory.

The story of Flash Attention is really a story about understanding your hardware. It’s about realizing that the obvious bottleneck isn’t always the real bottleneck, and that sometimes doing more work can make you faster. Most importantly, it’s about three clever mathematical tricks that, when combined, transform the fundamental scaling characteristics of the Transformer architecture.

The Deceptive Simplicity of Attention

Let’s start with what attention actually computes. At its core, the self-attention mechanism is elegantly simple:

Attention(Q, K, V) = softmax(QK^T / √d) × V

For a sequence of N tokens, each represented by a d-dimensional vector:

Q, K, V are all N×d matrices
QK^T produces an N×N attention matrix
The softmax normalizes each row to sum to 1
The final multiplication with V produces our N×d output

The problem is hiding in plain sight: that N×N attention matrix. When N=2,048, this matrix contains about 4 million elements. When N=16,384, it balloons to 268 million elements. At N=100,000, you’re looking at 10 billion elements—about 40GB in float32. The quadratic growth is devastating.

For years, the research community attacked this problem in the obvious way: try to avoid computing the full N×N matrix. Sparse attention patterns, low-rank approximations, kernel methods—dozens of papers proposed ways to reduce the quadratic complexity. Yet something curious kept happening. These methods would successfully reduce the theoretical FLOP count, but when implemented, they’d often run slower than standard attention.

What was going on?

The Real Bottleneck: A Tale of Two Memories

The answer requires understanding something about modern GPU architecture that’s often overlooked: GPUs have a dramatic memory hierarchy with vastly different performance characteristics at each level.

Consider an NVIDIA A100 GPU:

High Bandwidth Memory (HBM): 40-80GB of storage, but “only” 1.5-2.0 TB/s of bandwidth
On-chip SRAM: Just 192KB per streaming multiprocessor, but roughly 19 TB/s of bandwidth

That’s a 10x difference in bandwidth. This massive disparity means that accessing data from HBM is the primary bottleneck in GPU computations. While SRAM can deliver data at blazing speeds, its tiny capacity forces most data to reside in the much slower HBM.

Now here’s the critical insight: standard attention implementations are constantly moving data between HBM and SRAM. They’re not slow because they do too much computation—they’re slow because they spend most of their time waiting for data transfers from the slower HBM memory.

Let’s trace through what standard attention actually does:

Load Q and K from HBM → Compute S = QK^T → Store N×N matrix S to HBM
Load S from HBM → Compute P = softmax(S) → Store N×N matrix P to HBM
Load P and V from HBM → Compute O = PV → Store O to HBM

Each of those loads and stores of N×N matrices is a catastrophic performance hit. The GPU’s computational units, capable of trillions of operations per second, sit idle waiting for memory operations that take orders of magnitude longer than the actual math.

This is why reducing FLOPs didn’t help. The computation was never the bottleneck—memory bandwidth was. It’s like optimizing the mathematical operations when the real problem is the time spent moving data back and forth between memory systems.

Flash Attention’s Three Tricks

Flash Attention solves this memory bottleneck through three interconnected techniques that, together, enable computing exact attention without ever materializing the N×N matrices in HBM. Let’s explore each one.

Trick 1: Tiling — Age Old Divide and Conquer

The first insight is that we don’t need to compute the entire attention matrix at once. Instead, we can break it into small blocks that fit entirely in SRAM.

Think of the attention computation as filling in a giant N×N grid. Standard attention fills the entire grid, then normalizes it, then uses it. Flash Attention says: what if we filled in just one small tile at a time, processed it completely, and then moved on?

The algorithm divides the input sequences into blocks:

Query blocks of size B_r (typically around √(M/4d) where M is SRAM size)
Key/Value blocks of size B_c

For each block of the output, Flash Attention:

Loads the relevant Q, K, V blocks into SRAM
Computes that tile of attention entirely in SRAM
Updates the output for that tile
Moves to the next tile

The key is that each tile is small enough that all intermediate values stay in the fast SRAM. We never write the full attention matrix to slow HBM.

Flash Attention: Tiling Strategy

Processing the N×N attention matrix in small B_r×B_c blocks that fit entirely in SRAM

Query (Q) Key (K^T)

⚡ SRAM Processing

Current Tile: Processing block (1, 1)

All operations stay in SRAM:

• Compute S_ij = Q_iK_j^T
• Apply softmax incrementally
• Update output O_i
• Never write N×N matrix to HBM!

Processing Steps per Tile

Load Blocks
Q_i, K_j, V_j → SRAM

Compute Tile Attention
S_ij = Q_iK_j^T / √d (stays in SRAM)

Update Running Softmax
Maintain m (max) and l (sum) for online softmax

Accumulate Output
Update O_i incrementally, write only O back to HBM

But wait—there’s a problem. The softmax operation needs to see an entire row to compute the proper normalization. How can we compute softmax correctly when we only see one tile at a time?

Trick 2: Online Softmax — The Mathematical Keystone

This is where Flash Attention’s cleverest innovation comes in: online softmax. This algorithm computes the exact softmax result by maintaining running statistics that can be updated incrementally as we process each tile.

The standard softmax formula for a vector x is:

softmax(x_i) = exp(x_i) / Σ_j exp(x_j)

The online softmax reformulation maintains two running values:

m: The maximum value seen so far
l: The sum of exponentials (adjusted for the maximum)

Here’s the magic. When we process a new block of scores, we:

Find the new maximum: m_new = max(m_old, max(current_block))
Rescale our running sum: l_rescaled = l_old × exp(m_old - m_new)
Add the current block’s contribution: l_new = l_rescaled + Σ exp(current_block - m_new)

The rescaling step is crucial—it adjusts previous computations to account for the new maximum, ensuring numerical stability and exactness. When we’ve processed all blocks, we have the exact same result as if we’d computed softmax on the entire row at once.

Online Softmax: The Mathematical Keystone

Compute exact softmax incrementally without materializing the full attention row.

📊 Input Blocks (Attention Scores)

Current Computation

⚡ Running Statistics

Maximum (m)

-∞

max seen so far

Sum (l)

0.000

Σ exp(x − m)

Key Operations

1) m_new = max(m_old, max(block))Find the new maximum value

2) l_rescaled = l_old × exp(m_old − m_new)Rescale the previous sum to the new maximum

3) l_new = l_rescaled + Σ exp(block − m_new)Add the current block’s contribution

Standard Softmax (needs full row)

Online Softmax

This isn’t an approximation—it’s mathematically equivalent to standard softmax. The proof relies on the fact that:

exp(x - a) / Σ exp(x - a) = exp(x - b) / Σ exp(x - b)

for any constants a and b. By carefully tracking how our maximum changes and rescaling accordingly, we maintain exactness while never needing the full row in memory.

Trick 3: Recomputation — Trading Compute for Memory

The third trick addresses the backward pass used in training. During backpropagation, we need the attention matrices to compute gradients. Standard implementations store these N×N matrices during the forward pass for use in the backward pass.

Flash Attention takes a radically different approach: it doesn’t store the attention matrices at all. Instead, during the backward pass, it recomputes the pieces it needs on-the-fly.

This seems wasteful—we’re computing the same values twice! But remember: computation is cheap, memory movement is expensive. The time saved by not writing and reading N×N matrices to/from HBM far outweighs the cost of recomputation.

The algorithm stores only:

The output O (size N×d)
The softmax normalization statistics (size N)

During backpropagation, when gradients are needed:

Reload the relevant Q, K, V blocks
Recompute just the attention tiles needed for that gradient
Compute gradients entirely in SRAM
Accumulate to the final gradient

This is roughly a 2-3x increase in FLOPs, but a 2-4x speedup in wall-clock time. The counterintuitive lesson: in memory-bound operations, doing more work to avoid memory movement is a winning strategy.

Putting It All Together: The Flash Attention Algorithm

Let’s see how these three tricks combine in the actual algorithm. Here’s a simplified view of the Flash Attention forward pass:

Algorithm: Flash Attention Forward Pass
Input: Q, K, V matrices of size N×d
Output: O matrix of size N×d

1. Divide sequences into blocks of size B_r and B_c
2. Initialize output O = 0, running stats m = -∞, l = 0

3. For each K,V block j:
   4. Load K_j, V_j into SRAM
   
   5. For each Q block i:
      6. Load Q_i, current O_i, m_i, l_i into SRAM
      
      7. Compute scores: S_ij = Q_i × K_j^T / √d
      
      8. Update running softmax:
         - m_new = max(m_i, max(S_ij))
         - l_new = exp(m_i - m_new) × l_i + Σ exp(S_ij - m_new)
      
      9. Compute this block's output:
         - P_ij = exp(S_ij - m_new) / l_new
         - O_i = (exp(m_i - m_new) × l_i × O_i + P_ij × V_j) / l_new
      
      10. Store updated O_i, m_i, l_i to HBM

11. Return O

The beauty is in what’s not there: we never materialize the full N×N attention matrix. Each block’s computation happens entirely in SRAM, and we only write back the O(N×d) output.

From Flash Attention to Flash Attention-2

The original Flash Attention was a breakthrough, but it left performance on the table. Profiling showed it achieved only 25-40% of the GPU’s theoretical peak performance. Flash Attention-2 represents a complete algorithmic rewrite that addresses these inefficiencies.

The Parallelism Problem

Flash Attention-1 parallelized across batch size and number of attention heads. But what happens with long sequences and small batch sizes? Many of the GPU’s 108 streaming multiprocessors sit idle.

Flash Attention-2’s solution: also parallelize across the sequence length dimension. Different thread blocks handle different portions of the output sequence, ensuring full GPU utilization even with batch size 1.

The Work Partitioning Revolution

Within each thread block, Flash Attention-1 used a “split-K” scheme:

K and V were split across 4 warps
Each warp computed partial results
Warps had to synchronize and combine results through shared memory

This created a communication bottleneck. Flash Attention-2 flips this to “split-Q”:

Q is split across warps
K and V are shared by all warps
Each warp computes its portion independently with no synchronization

This seemingly simple change eliminates inter-warp communication, reducing shared memory traffic by 4x.

The Results

Flash Attention-2 achieves:

50-73% of theoretical peak FLOPS (up from 25-40%)
2x speedup over Flash Attention-1
Up to 9x speedup over PyTorch standard attention
225 TFLOPs/s on A100 GPUs for end-to-end training

These aren’t incremental improvements—they’re transformative leaps that make previously impossible model configurations practical.

The Lessons of Flash Attention

Flash Attention teaches us several crucial lessons about algorithm design in the age of specialized hardware:

Profile the Real Bottleneck: The obvious problem (quadratic FLOPs) wasn’t the actual problem (memory bandwidth). Understanding your hardware’s characteristics is essential.

Embrace Hardware Constraints: Rather than fighting the small SRAM size, Flash Attention designs around it. Constraints can inspire innovation.

Exact Beats Approximate: While the research community pursued approximations, Flash Attention showed that exact computation could be faster through better algorithm design.

Recomputation Can Be Free: In memory-bound regimes, trading computation for memory movement is often profitable, a counterintuitive insight that challenges conventional optimization wisdom.

Conclusion

Flash Attention isn’t just a faster attention implementation, it’s a masterclass in hardware-aware algorithm design. By recognizing that memory movement, not computation, was the true bottleneck, and by developing three mathematical techniques to minimize that movement, Flash Attention transformed what’s possible with Transformer models.

The online softmax algorithm, in particular, stands as a brilliant example of mathematical reformulation enabling practical breakthroughs. It shows that sometimes the path forward isn’t to approximate or simplify, but to find clever exact reformulations that align with hardware constraints.

As we push toward ever-longer context windows and larger models, the principles behind Flash Attention—tiling for locality, online algorithms for incremental processing, and strategic recomputation will remain relevant. They remind us that in the modern era of AI, the best algorithms aren’t just mathematically elegant; they’re architecturally aware.

The success of Flash Attention also highlights a broader truth: breakthrough performance improvements often come from questioning assumptions. Everyone “knew” that attention was compute-bound. Everyone “knew” that storing intermediate values was better than recomputing them. Flash Attention proved everyone wrong, and in doing so, enabled the current generation of long-context language models that are transforming AI applications.

The memory wall that seemed insurmountable in 2021 has been broken. Not by approximation, not by new hardware, but by three mathematical tricks and a deep understanding of the machine.

References

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv preprint arXiv:2205.14135.
- The original Flash Attention paper that introduced the tiling and online softmax algorithms.
Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv preprint arXiv:2307.08691.
- The follow-up paper detailing the algorithmic improvements in Flash Attention-2.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is All You Need. Advances in neural information processing systems, 30.
- The foundational Transformer paper that introduced the self-attention mechanism.
Rabe, M. N., & Staats, C. (2021). Self-attention Does Not Need O(n²) Memory. arXiv preprint arXiv:2112.05682.
- Important theoretical work on memory-efficient attention computation that influenced Flash Attention’s development.
Milakov, M., & Gimelshein, N. (2018). Online normalizer calculation for softmax. arXiv preprint arXiv:1805.02867.
- Mathematical foundation for the online softmax algorithm used in Flash Attention.
Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating Long Sequences with Sparse Transformers. arXiv preprint arXiv:1904.10509.
- Representative work on sparse attention that, despite reducing FLOPs, often failed to deliver wall-clock speedups.
NVIDIA. (2020). NVIDIA A100 Tensor Core GPU Architecture. NVIDIA Corporation.
- Technical specifications of the A100 GPU architecture that Flash Attention was optimized for.

Advanced NVIDIA GPU Monitoring for LLM Inference: A Deep Dive into H100 Architecture and Performance Optimization

Sat, 23 Aug 2025 14:20:33 +0800

Introduction: The Economics of Efficiency

When OpenAI serves ChatGPT to millions of users, every percentage point of GPU efficiency translates to millions in infrastructure costs. The difference between 10 percent and 40 percent Model FLOPs Utilization (MFU) can determine whether your LLM service is profitable or bleeding money. In the world of large-scale AI deployment, understanding your hardware at the deepest level isn’t just an academic exercise—it’s a business imperative.

This guide reveals the architecture and monitoring techniques that separate amateur deployments from production-grade systems. We’ll explore how modern LLM inference maps to NVIDIA’s revolutionary H100 architecture, dissect the metrics that truly matter, and provide the knowledge needed to achieve the 2-10x performance improvements that industry leaders routinely accomplish.

Since you’re already familiar with the basics of LLM inference from previous discussions, we’ll dive directly into the advanced architectural details and sophisticated monitoring strategies that will transform your understanding of GPU optimization.

LLM Inference Process: A Hardware Perspective

Before we explore the H100’s revolutionary architecture, let’s establish how LLM inference operations map to GPU hardware. This understanding forms the foundation for interpreting the metrics we’ll later use for optimization.

Modern LLM inference consists of two distinct phases that stress different aspects of GPU architecture. The prefill phase, where the model processes the entire input context, is fundamentally compute-bound. During this phase, the model performs massive matrix multiplications across all input tokens simultaneously, creating work that can effectively saturate the GPU’s computational units. In contrast, the generation phase, where tokens are produced one at a time, becomes memory-bound due to the autoregressive nature of the process. Each new token requires accessing the entire key-value cache while performing relatively minimal computation.

The memory transfer operations begin with Host-to-Device (H2D) transfers moving input tokens via PCIe. On modern systems, this means PCIe Gen 4 at 64 GB/s or Gen 5 at 128 GB/s on H100 systems. Once data reaches the GPU, it enters a complex memory hierarchy that we’ll explore in detail in the next section. The efficiency of these transfers often determines the lower bound of inference latency, particularly for smaller models where compute isn’t the bottleneck.

Within the GPU, memory movement follows strict hierarchical patterns. Data flows from global memory (HBM) through various cache levels before reaching the compute units. Understanding this hierarchy is crucial because memory bandwidth, not compute capacity, often becomes the limiting factor in LLM inference performance.

H100 Architecture Deep Dive: Essential Components for LLM Inference

Understanding the H100’s architecture is fundamental to optimizing LLM inference. Each component serves a specific purpose in the complex orchestration of transformer computations. This section provides a comprehensive primer on what each architectural element does and how it contributes to overall system performance.

Streaming Multiprocessors (SMs): The Computational Foundation

The Streaming Multiprocessor is the fundamental processing unit of the GPU. The H100 contains 132 SMs in its full configuration, each capable of independent instruction execution. Each SM functions as a complete processor with its own instruction cache, schedulers, execution units, and register file.

Within each SM, four warp schedulers manage thread execution. A warp consists of 32 threads that execute in lockstep—when one thread in a warp executes an instruction, all 32 execute the same instruction on different data. Each scheduler can dispatch instructions from a different warp every cycle, enabling the SM to hide memory latency by switching between warps when one stalls.

The SM contains 128 CUDA cores for general-purpose computation, handling integer and single-precision floating-point operations. These cores execute the non-matrix operations in neural networks: activation functions, normalization, element-wise operations, and control flow. The SM also houses 4 Tensor Cores, specialized units that perform matrix multiply-accumulate operations at dramatically higher throughput than CUDA cores.

Each SM includes 256 KB of register file storage, providing ultra-fast temporary storage for thread-local variables. This generous register allocation enables complex kernels to maintain their working set entirely in registers, avoiding slower memory accesses. The register file is banked to allow multiple simultaneous accesses, critical for maintaining throughput when all threads need data simultaneously.

Memory Hierarchy: From Registers to HBM3

The memory system follows a strict hierarchy, with each level trading capacity for speed. Understanding this hierarchy is crucial for inference optimization, as data movement often dominates execution time.

Registers provide the fastest storage at approximately 20 TB/s of aggregate bandwidth per SM. Each thread can access up to 255 registers, with access latency of just one clock cycle. Register allocation happens at compile time, and efficient register use is critical for kernel performance.

Shared Memory and L1 Cache share a 228 KB pool per SM, configurable between different ratios. Shared memory enables threads within a block to communicate and share data with latency of approximately 30 cycles. This memory is banked into 32 banks to enable parallel access—critical for algorithms like Flash Attention that rely on efficient shared memory access patterns.

L2 Cache provides 50 MB of shared storage across all SMs with approximately 6 TB/s of bandwidth. The L2 cache maintains frequently accessed data like model weights and popular activation tensors. Its partitioned design allows multiple SMs to access different cache lines simultaneously without contention.

HBM3 (High Bandwidth Memory) delivers 80 GB of capacity with 3 TB/s of bandwidth through 10 memory controllers. HBM3 uses a 5120-bit wide interface achieved through vertical stacking of memory dies directly on the GPU package. Access latency ranges from 200-300 cycles, making it crucial to hide this latency through parallelism and caching.

Tensor Cores: Matrix Multiplication Acceleration

Tensor Cores are specialized processing units designed exclusively for matrix multiply-accumulate operations, the dominant computation in transformer models. Each Tensor Core can perform a full 4×4 matrix multiplication per clock cycle, delivering dramatically higher throughput than traditional CUDA cores.

The fourth-generation Tensor Cores in H100 support multiple precision formats. FP64 provides full double precision for scientific computing. TF32 (TensorFloat-32) offers the range of FP32 with the precision of FP16, providing a drop-in replacement for FP32 training. FP16 and BF16 (BrainFloat16) enable mixed-precision training and inference. FP8 in two variants (E4M3 and E5M2) doubles throughput while maintaining acceptable accuracy for most transformer operations. INT8 provides further acceleration for quantized inference.

Each Tensor Core operates on small matrix tiles, typically 16×16 or smaller, depending on the precision. The operation D = A × B + C is performed in a single instruction, where A, B, C, and D are matrix tiles. This fused operation eliminates the need to write intermediate results to memory, significantly improving efficiency.

Transformer Engine: Intelligence for Transformer Models

The Transformer Engine is not a physical component but a collection of hardware and software optimizations specifically designed for transformer architectures. It automatically manages numerical precision throughout the network, choosing optimal formats for different operations.

The engine maintains statistics about tensor magnitudes and automatically scales values to maximize precision within the available dynamic range. For attention computations, it might use FP16 for the softmax operation while using FP8 for matrix multiplications. This dynamic precision management happens transparently, requiring no manual intervention while delivering near-FP16 accuracy at FP8 speeds.

The Transformer Engine also includes optimized implementations of common transformer operations. Layer normalization, positional encodings, and attention patterns are accelerated through specialized hardware paths. These optimizations are exposed through libraries like cuBLAS and cuDNN, making them accessible to framework developers.

NVLink and PCIe Interfaces: System Connectivity

The H100 supports both NVLink 4.0 and PCIe Gen5 for system connectivity. NVLink provides 900 GB/s of bidirectional bandwidth (18 links at 50 GB/s each) for GPU-to-GPU communication, essential for model parallelism and multi-GPU inference. The high bandwidth and low latency of NVLink enables treating multiple GPUs almost as a single larger GPU for compatible workloads.

PCIe Gen5 delivers 128 GB/s of bidirectional bandwidth for host communication and storage access. This interface handles model loading, input data transfer, and result retrieval. The increased bandwidth of Gen5 reduces the time spent waiting for data transfer, particularly important for smaller models where transfer time might dominate computation time.

Hardware Schedulers: Orchestrating Execution

Beyond the warp schedulers in each SM, the H100 includes global hardware schedulers that manage work distribution across the GPU. The Gigathread Engine schedules thread blocks to SMs, considering factors like load balancing, cache locality, and resource availability.

The Work Distributor ensures efficient distribution of work across all available SMs, preventing scenarios where some SMs sit idle while others are overloaded. It understands the resource requirements of each kernel and schedules blocks to maximize occupancy while avoiding resource conflicts.

These hardware schedulers operate with sub-microsecond latency, enabling fine-grained scheduling decisions that would be impossible to implement in software. They continuously monitor SM utilization and adjust scheduling decisions dynamically, ensuring optimal resource utilization even with irregular workloads.

Why This Architecture Matters: Each component in the H100 is designed to address specific bottlenecks in transformer inference. The massive register files enable complex kernels, the enhanced memory hierarchy reduces data movement overhead, specialized units like Tensor Cores and TMA accelerate common operations, and intelligent scheduling ensures all resources are effectively utilized. Understanding how these components work together enables developers to write software that fully exploits the hardware’s capabilities.

Model FLOPs Utilization: The North Star Metric

Now that we understand the hardware foundation, we can properly appreciate why Model FLOPs Utilization (MFU) has become the definitive metric for LLM inference efficiency. Unlike simpler metrics that only indicate whether the GPU is busy, MFU measures how effectively we’re using the computational capacity we’ve paid for.

Understanding MFU in Context

Model FLOPs Utilization represents the ratio of achieved computational throughput to theoretical peak hardware throughput. When we report 30 percent MFU, we’re saying that out of the H100’s theoretical 989 TFLOPS of FP16 compute, we’re achieving approximately 297 TFLOPS of useful model computation. The remaining capacity is lost to memory bottlenecks, kernel launch overhead, synchronization, and other inefficiencies.

The fundamental MFU calculation starts with understanding the computational requirements of transformer models. For a forward pass, we need approximately 2 FLOPs per parameter for the feed-forward and projection layers. The attention computation adds a significant number of FLOPs that scales quadratically with the sequence length:

Attention FLOPs per layer ≈ 2 × L_seq² × D_hidden

Definitions:

N_layers: number of transformer layers
L_seq: input sequence length (tokens)
D_hidden: hidden size (n_heads × d_head)

Note: the constant here (~2) varies by implementation; the key point is the quadratic scaling with L_seq.

This quadratic scaling explains why long context lengths can dramatically impact computational requirements.

The Reality of MFU in Production

The MFU values achieved in production often surprise newcomers to the field. During training, well-optimized systems routinely achieve 40-60 percent MFU because the workload is consistent and batches are large. However, inference presents a different challenge entirely.

During the prefill phase, where the model processes the entire input context, we typically see 30-45 percent MFU. This phase is compute-bound and benefits from the parallel processing of all input tokens. The generation phase tells a different story, with MFU dropping to just 5-15 percent. This dramatic reduction isn’t a sign of poor optimization—it’s a fundamental consequence of autoregressive generation’s memory-bound nature.

Model size significantly impacts achievable MFU. A 7B parameter model might achieve 25-35 percent MFU during prefill and 8-12 percent during generation on a single GPU. Scale up to a 70B model with tensor parallelism, and you might see 35-45 percent prefill MFU but only 4-8 percent during generation. The larger model achieves higher prefill MFU because it better amortizes memory transfer costs, but lower generation MFU because each token requires accessing more parameters.

Why This Matters: Understanding these MFU realities helps set appropriate optimization targets. Achieving 50 percent MFU during generation would require fundamental algorithmic breakthroughs, not just better engineering. Teams should focus on maximizing prefill MFU while accepting that generation will always be memory-bound.

MFU as an Economic Indicator

The direct relationship between MFU and cost makes it invaluable for capacity planning and hardware selection. The cost per token can be expressed as:

Cost per Token = (GPU Cost per Hour × FLOPs per Token) / (MFU × Peak FLOPs)

This relationship means that improving MFU by 10% directly reduces infrastructure costs by 10%.

Cost example (holding throughput constant)

Assumptions: 1,000 H100 GPUs at $3/GPU·hour, MFU improves from 20% → 30%.

Calculation:
GPUs needed ∝ 1 / MFU
GPUs_saved = 1000 × (1 - 0.20/0.30) = 333
Hourly_savings = 333 × $3 ≈ $999 ≈ $1,000 per hour
Annual_savings ≈ $1,000 × 24 × 365 ≈ $8.8M per year
In practice, higher MFU often also improves batching and reduces latency, increasing the effective savings.

Hardware Comparison: Effective TFLOPS

A100: $312 \text{ TFLOPS} \cdot 30\% \text{ MFU} = 93.6 \text{ effective TFLOPS}$

H100: $989 \text{ TFLOPS} \cdot 25\% \text{ MFU} = 247.3 \text{ effective TFLOPS}$

The H100 provides 2.6x more effective compute in this scenario, justifying its premium for compute-intensive workloads.

Comprehensive GPU Metrics: Beyond Simple Utilization

With our understanding of hardware architecture and MFU established, we can now explore the full spectrum of metrics available for monitoring NVIDIA GPUs. Each metric provides a different perspective on system behavior, and understanding their relationships is crucial for effective optimization.

The Hierarchy of Utilization Metrics

GPU utilization, the most commonly cited metric, merely indicates the percentage of time when one or more kernels are executing. A GPU showing 100 percent utilization might be performing useful work efficiently, or it might be spinning in inefficient kernels. This metric alone tells us almost nothing about actual performance.

Streaming Multiprocessor (SM) efficiency provides more insight by measuring how effectively active SMs utilize their resources. This includes warp occupancy (the ratio of active warps to maximum possible warps) and instruction throughput. An SM with high occupancy but low instruction throughput suggests memory bottlenecks, while low occupancy with high throughput might indicate kernel launch overhead.

Memory bandwidth utilization reveals whether we’re constrained by data movement. On an H100, achieving 2.5 TB/s out of 3 TB/s theoretical bandwidth (83 percent utilization) might seem good, but if those transfers are inefficient (non-coalesced, redundant), we’re still wasting resources. The relationship between achieved bandwidth and useful work becomes critical.

Tensor Core Utilization: The Hidden Bottleneck

Tensor Core utilization often becomes the limiting factor in achieving high MFU, yet it’s frequently overlooked. The metric isn’t simply whether Tensor Cores are active, but how efficiently they’re being fed with data and how well the problem dimensions align with hardware requirements.

For optimal Tensor Core utilization, matrix dimensions must align with hardware constraints—multiples of 8 for FP16 operations, 16 for INT8. Misaligned dimensions can reduce utilization by 50 percent or more. The new H100 Transformer Engine alleviates some alignment constraints, but understanding these requirements remains crucial for optimization.

The relationship between Tensor Core utilization and memory bandwidth becomes particularly important during inference. Even with perfect alignment, Tensor Cores can only maintain peak throughput if data arrives fast enough. This creates a careful balance—batch sizes must be large enough to amortize memory transfer costs but small enough to meet latency requirements.

Memory Hierarchy Metrics: Finding the Real Bottleneck

Understanding memory metrics requires thinking hierarchically. L1/shared memory hit rates tell us about kernel efficiency—rates below 80 percent suggest poor data locality. L2 cache hit rates indicate weight reuse effectiveness—critical for models with repeated layer structures. HBM bandwidth utilization reveals whether we’re fundamentally memory-bound.

The introduction of the Memory Bandwidth Utilization (MBU) metric by Databricks provides a complementary view to MFU. MBU measures achieved memory bandwidth versus theoretical peak, helping identify whether computation or memory movement is the limiting factor. When MBU approaches 100 percent while MFU remains low, we know memory bandwidth is the bottleneck.

Cache line efficiency becomes critical in attention mechanisms. The irregular access patterns of key-value caches can waste significant bandwidth if not properly managed. Modern implementations like PagedAttention improve cache line utilization from around 60 percent to over 95 percent, directly translating to higher effective memory bandwidth.

Power and Thermal Metrics: The Overlooked Constraints

Power consumption and thermal behavior significantly impact sustained performance, particularly in dense datacenter deployments. The H100 can consume up to 700W, generating substantial heat that must be managed. Thermal throttling can reduce clock speeds by 30 percent or more, directly impacting achievable MFU.

Dynamic frequency scaling based on workload characteristics means that power-efficient kernels can run at higher clock speeds, improving overall throughput. Understanding the relationship between different operations and power consumption helps in scheduling and workload distribution.

Power Efficiency: TFLOPS per Watt

H100: $989 \text{ TFLOPS} / 700\text{W} \approx 1.4 \text{ TFLOPS/W}$

A100: $312 \text{ TFLOPS} / 400\text{W} \approx 0.78 \text{ TFLOPS/W}$

This nearly 2x improvement in power efficiency compounds the H100’s computational advantages.

Bottleneck Analysis: The Mathematics of Performance Limits

Understanding whether your system is compute-bound or memory-bound requires more than just monitoring metrics—it demands understanding the fundamental arithmetic relationships in transformer models. This mathematical framework, combined with architectural knowledge, enables precise bottleneck identification and targeted optimization.

Arithmetic Intensity: The Fundamental Diagnostic Tool

Arithmetic intensity, defined as the ratio of floating-point operations to bytes of memory accessed, provides the key to understanding performance bottlenecks. For any given operation, we can calculate the arithmetic intensity and compare it to the hardware’s balance point—the ratio of peak compute throughput to peak memory bandwidth.

The hardware’s balance point is the ratio of its peak compute throughput to its peak memory bandwidth.

Balance Point = Peak Compute (FLOPS) / Peak Memory Bandwidth (Bytes/s)

Hardware Balance Points (FP16)

H100: $989 \text{ TFLOPS} / 3000 \text{ GB/s} \approx 330 \text{ Ops/Byte}$

A100: $312 \text{ TFLOPS} / 1555 \text{ GB/s} \approx 200 \text{ Ops/Byte}$

When a workload’s arithmetic intensity falls below this threshold, it is memory-bound; above, it is compute-bound.

Transformer Arithmetic: Breaking Down the Operations

To apply arithmetic intensity analysis to LLM inference, we must first understand the computational structure of transformers. Each layer consists of two main components: multi-head attention and feed-forward networks, each with distinct computational characteristics.

Here is the corrected breakdown of FLOPs per transformer layer:

Attention FLOPs (per layer):

QKV Projections: 6BLH²
QK^T Computation: 2BL²H
Attention × V: 2BL²H
Output Projection: 2BLH²
Total Attention: 8BLH² + 4BL²H

FFN FLOPs (per layer):

Up-projection: 8BLH²
Down-projection: 8BLH²
Total FFN: 16BLH²

Total FLOPs per layer:

Total = (8BLH² + 4BL²H) + 16BLH² = 24BLH² + 4BL²H

Memory access patterns tell a different story. During the prefill phase, we load model weights once but use them for all tokens in the sequence, achieving good arithmetic intensity. During generation, we load the entire model weights to process a single token, resulting in poor arithmetic intensity that decreases with model size.

The Prefill Phase: Compute-Bound Territory

During prefill, when processing an entire input sequence, arithmetic intensity is relatively high. Consider a concrete example with Llama 2 7B processing a 2048-token sequence with batch size 1.

The attention computation performs approximately $2 \cdot 32 \cdot 2048^2 \cdot 4096 \approx 1.1$ trillion FLOPs while accessing roughly 14 GB of memory. This yields an arithmetic intensity of:

AI_prefill = (1.1 × 10¹² FLOPs) / (14 × 10⁹ Bytes) ≈ 75 Ops/Byte

This is well below the H100’s balance point of 330, indicating the workload is memory-bound even during prefill.

However, increasing the batch size dramatically improves arithmetic intensity. With batch size 32, we perform 32× more operations while only marginally increasing memory access (weights are reused across the batch). The arithmetic intensity rises to approximately 2,400 operations per byte, making us solidly compute-bound.

This analysis explains why batch size has such a profound impact on MFU during prefill. Small batches leave the GPU memory-bound despite the parallel processing of many tokens. Only when batch size grows sufficiently large do we transition to compute-bound operation where Tensor Cores can operate near peak efficiency.

The Generation Phase: The Memory Bandwidth Wall

Generation phase arithmetic intensity tells a starkly different story. When generating a single token, we must load the entire model (14 GB for Llama 2 7B) to perform approximately 14 billion operations ($2 \cdot 7B$ parameters). This yields an arithmetic intensity of:

AI_gen = (14 × 10⁹ FLOPs) / (14 × 10⁹ Bytes) = 1 Op/Byte

This is two orders of magnitude below the balance point, confirming generation is severely memory-bound.

The KV-cache access further degrades arithmetic intensity. For each generated token, we must read the cached keys and values for all previous tokens. With a 2048-token context, this means accessing $2048 \cdot 32 \cdot 8192 \cdot 2 = 1.074$ GB (decimal) or 1.0 GiB (binary) of KV-cache data for each token generated. This massive memory access further degrades the arithmetic intensity of the attention computation during generation.

This fundamental mathematical reality explains why generation phase MFU rarely exceeds 15 percent. We’re not failing to optimize; we’re hitting the physical limits of memory bandwidth. No amount of kernel optimization can overcome this arithmetic intensity barrier—only architectural changes like larger caches or algorithmic innovations like speculative decoding can help.

Identifying Your Bottleneck: A Systematic Approach

To determine whether your specific workload is compute-bound or memory-bound, follow this systematic analysis:

First, calculate the theoretical arithmetic intensity for your model and batch size. Using P = model parameters, B = batch size, N = number of layers, L = sequence length, H = hidden size (≈ n_heads × d_head):

Compute FLOPs (forward pass):

FLOPs_compute ≈ 2·P·B + 4·N·B·L²·H

Memory bytes (FP16, 2 bytes/elem):

Bytes_memory ≈ 2·P + 2·B·L·H·N + 4·B·L·H·N

where the three terms correspond to weights, activations, and KV-cache respectively.

Next, measure your achieved arithmetic intensity using GPU metrics. Let tps be tokens/second and fpt be FLOPs/token; let BW be achieved memory bandwidth (bytes/second):

FLOPs_achieved = tps · fpt

AI_achieved = FLOPs_achieved / BW

Compare your measured arithmetic intensity to the hardware balance point. If it’s below the threshold, you’re memory-bound—focus on reducing memory access through techniques like kernel fusion, quantization, or Flash Attention. If it’s above the threshold, you’re compute-bound—consider using lower precision, pruning, or more efficient algorithms.

The Roofline Model in Practice

The roofline model visualizes these relationships, showing the performance ceiling imposed by either compute or memory bandwidth. The model creates a two-dimensional space where the x-axis represents arithmetic intensity and the y-axis represents achieved performance in FLOPS.

The “roofline” consists of two parts: a sloped line representing the memory bandwidth limit (performance = bandwidth × arithmetic_intensity) and a horizontal line representing the peak compute performance. The intersection point is the balance point we’ve been discussing. Real workloads appear as points in this space, immediately revealing whether they’re compute or memory limited.

For LLM inference, prefill operations typically appear in the middle region, potentially reaching the compute roofline with sufficient batch size. Generation operations cluster far to the left, firmly in memory-bound territory. This visual representation makes optimization opportunities immediately apparent.

Bottleneck-Specific Optimization Strategies

Once you’ve identified your bottleneck, optimization strategies become clear and targeted.

For memory-bound operations, focus on reducing memory traffic. Operator fusion combines multiple operations to avoid intermediate memory writes. Quantization reduces the bytes per parameter, effectively increasing arithmetic intensity. Flash Attention keeps attention computations in shared memory, dramatically reducing HBM access. KV-cache compression techniques reduce the memory footprint of cached attention states.

For compute-bound operations, the strategies differ entirely. Use the highest-throughput precision your accuracy requirements allow—FP8 on H100 can double throughput versus FP16. Ensure tensor dimensions align with Tensor Core requirements (multiples of 8 for FP16, 16 for INT8). Consider structured sparsity to leverage the H100’s sparse Tensor Core operations. Implement better batching strategies to amortize overhead.

The key insight is that optimizing a memory-bound workload for compute efficiency (or vice versa) wastes effort. Understanding your position relative to the roofline model ensures optimization efforts target the actual bottleneck.

Dynamic Bottleneck Behavior

Bottlenecks aren’t static—they shift based on workload characteristics and system state. A system that’s compute-bound with large batches becomes memory-bound with small batches. Long sequences increase the compute requirements of attention quadratically, potentially shifting from memory to compute bound.

Thermal throttling can dynamically reduce compute capacity, shifting the balance point and potentially moving workloads from compute-bound to memory-bound. Understanding these dynamics helps explain performance variations and guides adaptive optimization strategies.

Modern inference systems must handle this dynamism gracefully. Techniques like dynamic batching adjust batch sizes based on queue depth and latency requirements, implicitly navigating the compute-memory tradeoff. Adaptive precision selection can switch between FP16 and FP8 based on whether the system is compute or memory bound.

Case Study: Optimizing a Memory-Bound Workload

Consider a production system serving Llama 2 13B with batch size 1, achieving only 5% MFU during generation. Analysis reveals an arithmetic intensity of 0.8 Ops/Byte—severely memory-bound.

The optimization strategy focuses entirely on memory traffic reduction.

INT8 Quantization: Halves memory requirements, doubling AI to 1.6 Ops/Byte.
Flash Attention: Reduces attention-related memory traffic by ~75%.
Continuous Batching: Increases average batch size to 8, multiplying AI by 8x.

After these optimizations, arithmetic intensity reaches approximately 12 Ops/Byte—still memory-bound but much improved. MFU increases from 5% to 18%, a 3.6x improvement. Further optimization would require architectural changes like model sharding to fit in GPU cache or algorithmic innovations like speculative decoding.

This systematic approach—measure, analyze, identify bottleneck, apply targeted optimization, repeat—transforms random experimentation into engineering discipline. Understanding the mathematical foundations of transformer inference enables predictable, reproducible performance improvements.

Advanced Monitoring Tools and Techniques

The complexity of modern GPU architectures demands sophisticated monitoring tools. Each tool in NVIDIA’s ecosystem serves a specific purpose, from real-time production monitoring to deep kernel-level analysis.

The NVIDIA-SMI Foundation

NVIDIA-SMI, built on the NVIDIA Management Library (NVML), provides the foundation for GPU monitoring. While often dismissed as too basic, it offers several advanced capabilities crucial for production systems. The tool’s ability to continuously monitor with minimal overhead (less than 1 percent performance impact) makes it ideal for always-on production monitoring.

The key to effective nvidia-smi usage lies in understanding its sampling behavior. Utilization metrics are sampled over 1/6 second intervals, meaning short-duration kernels might be missed entirely. Memory bandwidth measurements aggregate over one-second windows, potentially hiding burst behavior. Understanding these limitations helps interpret the data correctly.

Advanced nvidia-smi features include event-triggered logging, which can capture detailed state information when specific conditions occur, and persistence mode management, which keeps the GPU driver loaded to reduce kernel launch latency. The tool’s ability to set and monitor power caps enables dynamic power management strategies that balance performance with thermal constraints.

DCGM: Datacenter-Scale Monitoring

The Data Center GPU Manager (DCGM) extends monitoring capabilities to fleet scale while providing more detailed metrics than nvidia-smi. Its architecture, with a central daemon managing data collection and client libraries for access, enables efficient monitoring of hundreds of GPUs with minimal overhead.

DCGM’s field-based metric system provides over 100 distinct metrics, each identified by a unique field ID. For LLM inference, critical fields include DCGM_FI_PROF_SM_ACTIVE (1002) for SM utilization, DCGM_FI_PROF_PIPE_TENSOR_ACTIVE (1004) for Tensor Core activity, and DCGM_FI_PROF_DRAM_ACTIVE (1005) for memory interface utilization.

The profiling metrics available through DCGM provide insights impossible to obtain through nvidia-smi. These include instruction-level throughput, cache hit rates, and detailed memory access patterns. The ability to correlate these metrics across multiple GPUs reveals system-level bottlenecks that individual GPU monitoring might miss.

Nsight Systems: Application-Level Profiling

Nsight Systems provides a timeline view of application execution, revealing the interplay between CPU and GPU operations. For LLM inference, this exposes critical inefficiencies like CPU-GPU synchronization bottlenecks, unnecessary memory transfers, and kernel launch overhead.

The tool’s ability to trace CUDA API calls, kernel executions, and memory transfers simultaneously creates a complete picture of application behavior. Custom NVTX markers can annotate different phases of inference (tokenization, prefill, generation), making performance analysis more intuitive.

The overhead of Nsight Systems (typically 5-20 percent) makes it unsuitable for production monitoring but invaluable for development optimization. The visual timeline immediately reveals problems like serialized operations that could run concurrently or gaps between kernels indicating scheduling inefficiencies.

Optimization Techniques and Their Metric Signatures

Modern LLM inference optimization employs sophisticated techniques that produce distinctive patterns in GPU metrics. Understanding these signatures enables rapid diagnosis and systematic improvement.

Flash Attention: Transforming Memory Access Patterns

Flash Attention revolutionizes attention computation by keeping intermediate results in shared memory rather than writing to HBM. This fundamental change produces distinctive metric signatures that confirm proper implementation.

When Flash Attention is working correctly, HBM bandwidth utilization during attention computation drops by 50-80 percent while SM utilization increases. L1/shared memory throughput increases dramatically, often exceeding 10 TB/s aggregate across all SMs. The MFU during attention phases can improve by 1.5-2x, though this improvement is most pronounced for longer sequences where memory bandwidth typically dominates.

The H100’s larger shared memory (228 KB per SM) enables larger tile sizes than previous generations, reducing the number of passes required. Combined with the TMA’s ability to asynchronously load the next tile while computing the current one, this can achieve near-perfect overlap of memory and computation.

Continuous Batching: Dynamic Resource Utilization

Continuous batching replaces static batches with dynamic scheduling, allowing requests of different lengths to process together. This technique produces characteristic saw-tooth patterns in GPU utilization metrics as batches naturally grow and shrink.

Effective continuous batching maintains average GPU utilization above 70 percent while keeping variance below 20 percent. The queue depth typically runs at 1.5-2x the optimal batch size, providing a buffer for arrival rate variations. Memory fragmentation should remain below 5 percent, indicating efficient memory management.

The impact on MFU is substantial—typically improving average MFU by 20-40 percent by maintaining consistent GPU saturation. The technique is particularly effective for services with variable request rates, where static batching would either waste resources or introduce unnecessary latency.

PagedAttention: Memory Efficiency Revolution

PagedAttention applies virtual memory concepts to KV-cache management, storing attention caches in non-contiguous blocks. This produces distinctive memory utilization patterns that confirm proper operation.

Memory utilization with PagedAttention exceeds 95 percent compared to around 60 percent for naive allocation. Block utilization metrics should show over 90 percent of allocated blocks actively used. The technique enables 2-4x larger effective batch sizes with the same memory, directly improving throughput.

The metric signatures include steady memory allocation rates (rather than large chunks), consistent block recycling patterns, and high cache hit rates for shared prefixes. When combined with continuous batching, PagedAttention enables near-optimal memory utilization while maintaining low latency.

Quantization: Precision-Performance Tradeoffs

Quantization techniques produce clear changes in metric patterns that indicate their effectiveness. FP16 to INT8 quantization typically doubles Tensor Core throughput while halving memory bandwidth requirements. The H100’s FP8 support can achieve similar improvements with minimal accuracy loss.

Successful quantization shows Tensor Core utilization increasing proportionally with the precision reduction (2x for FP16→INT8). Memory bandwidth utilization decreases by the same factor, often relieving memory bottlenecks. MFU improvements vary but typically range from 1.4-1.9x for compute-bound phases.

The key metric to watch is the balance between compute and memory utilization. Quantization can shift a memory-bound workload to compute-bound, fundamentally changing optimization strategies. This shift appears as increased SM efficiency and decreased memory controller activity.

Speculative Decoding: Trading Compute for Latency

Speculative decoding uses a smaller “draft” model to predict multiple tokens, then validates them with the full model. This produces unique metric patterns: burst compute activity during speculation followed by validation phases.

Effective speculative decoding shows acceptance rates above 60 percent, meaning most speculated tokens are correct. The compute utilization pattern shows characteristic dual-phase behavior—low utilization during drafting, high during validation. Overall MFU might decrease, but time-to-token improves by 2-3x when properly tuned.

The memory access patterns reveal the technique’s efficiency. The draft model’s weights should remain L2-resident, showing high cache hit rates. The validation phase should show coalesced memory access as multiple tokens validate simultaneously.

Production Deployment Best Practices

Transitioning from optimization in development to production deployment requires systematic approaches to monitoring, alerting, and continuous improvement.

Establishing Baseline Metrics

Before optimization, establish comprehensive baselines for your specific models and hardware. These baselines should include MFU for both prefill and generation phases, memory bandwidth utilization across different batch sizes, latency percentiles (p50, p95, p99) for various sequence lengths, and power consumption under sustained load.

Baseline establishment should span at least one week of production traffic to capture variations. Daily patterns, weekend differences, and special events all impact metric distributions. Understanding normal variation prevents false alerts and helps identify genuine problems.

The baseline must differentiate between model architectures. A 7B parameter model baseline differs substantially from a 70B model baseline, even on identical hardware. Separate baselines for different operation modes (batch inference, streaming, interactive) prevent inappropriate comparisons.

Implementing Effective Alerting

Alert fatigue destroys operational effectiveness, so alerts must be both actionable and important. Critical alerts should trigger only for service-impacting conditions: MFU dropping below 50 percent of baseline for sustained periods, memory utilization exceeding 95 percent with allocation failures, or thermal throttling reducing clock speeds.

Warning-level alerts identify degradation before it impacts service: MFU variance exceeding 20 percent over five-minute windows, queue depths growing beyond 2x normal, or power consumption approaching thermal design limits. These alerts enable proactive intervention.

Informational monitoring tracks optimization opportunities without generating alerts: batch size efficiency below target, quantization candidates based on compute patterns, or scheduling inefficiencies revealed by utilization gaps. Regular review of these metrics drives continuous improvement.

Continuous Optimization Workflows

Production systems require continuous optimization as models, traffic patterns, and requirements evolve. Establish weekly metric reviews comparing current performance to baselines and identifying degradation or improvement opportunities.

A/B testing frameworks should include metric collection for both control and experiment groups. Beyond functional metrics like accuracy, collect detailed performance metrics to understand the full impact of changes. A model change that improves accuracy but degrades MFU by 30 percent might not be worth deploying.

Capacity planning must account for metric trends. If MFU gradually degrades as model complexity increases, infrastructure requirements grow super-linearly. Understanding these relationships enables accurate forecasting and budget planning.

Multi-Tenant Optimization Strategies

Production systems rarely serve single models in isolation. Multi-tenant scheduling must balance resource utilization with quality of service, creating complex optimization challenges.

GPU sharing strategies depend on workload characteristics. Time-slicing works well for similar models with predictable resource requirements. Multi-Instance GPU (MIG) provides hardware isolation but reduces flexibility. Spatial sharing requires careful memory management to prevent interference.

Metric collection in multi-tenant environments requires attribution to specific tenants. Per-model MFU tracking reveals which models efficiently use resources. Memory attribution prevents one model from starving others. Power consumption tracking enables accurate cost allocation.

The scheduling algorithm must consider both immediate and future resource availability. Greedy scheduling might achieve high instantaneous utilization but create future bottlenecks. Predictive scheduling based on historical patterns improves overall system efficiency.

Future Directions and Emerging Patterns

The landscape of LLM inference optimization continues evolving rapidly. Understanding emerging patterns helps prepare for future developments.

Algorithmic Innovations

Attention mechanism improvements continue emerging. Techniques like Linear Attention and Performer reduce complexity from O(n²) to O(n), fundamentally changing the computational requirements. While these haven’t yet matched traditional attention’s quality, rapid progress suggests breakthrough potential.

Mixture of Experts (MoE) architectures enable larger models without proportional compute increases. By activating only relevant experts for each token, MoE models achieve effective parameter counts far exceeding dense models while maintaining manageable computational requirements. The metric patterns for MoE models differ substantially, requiring new optimization approaches.

Retrieval-augmented generation (RAG) shifts computation from parameter storage to dynamic retrieval. This architectural change produces different bottlenecks—network I/O and database access rather than GPU memory bandwidth. Understanding these patterns becomes crucial as RAG adoption increases.

Software Framework Evolution

The competition between inference frameworks drives rapid innovation. vLLM’s PagedAttention, TensorRT-LLM’s kernel fusion, and DeepSpeed’s pipeline parallelism each offer unique advantages. Framework selection significantly impacts achievable metrics—the same model might achieve 30 percent MFU with one framework and 45 percent with another.

Automatic optimization techniques reduce the expertise required for high performance. Compilers that automatically select optimal kernel implementations, batch sizes, and parallelism strategies democratize optimization. However, understanding underlying metrics remains crucial for pushing beyond automatic optimization limits.

The convergence of training and inference frameworks simplifies deployment but introduces complexity. Frameworks must now optimize for both phases, with different requirements and bottlenecks. This convergence produces new metric patterns that require careful interpretation.

Conclusion: The Path to Excellence

Mastering GPU monitoring for LLM inference requires deep understanding of hardware architecture, comprehensive metric collection, and systematic optimization approaches. The H100’s revolutionary architecture provides unprecedented capability, but realizing its potential demands expertise in interpreting complex metric relationships and applying appropriate optimization techniques.

The journey from basic GPU utilization monitoring to sophisticated MFU optimization transforms both system performance and economics. Organizations that master these techniques achieve 2-10x performance improvements, directly impacting service quality and operational costs.

Remember that MFU is not just a metric—it’s a philosophy of efficiency that permeates every aspect of LLM deployment. By understanding the intricate dance between compute and memory, between hardware capability and algorithmic requirements, we can build inference systems that deliver breakthrough performance at sustainable costs.

The future of LLM inference belongs to those who can see beyond surface-level metrics to understand the deep patterns of GPU behavior. Armed with the knowledge in this guide, you’re equipped to join the ranks of teams achieving world-class inference performance. The difference between amateur and professional deployment isn’t just knowledge—it’s the systematic application of that knowledge to continuously improve and optimize.

References

NVIDIA H100 Architecture: NVIDIA. (2022). NVIDIA H100 Tensor Core GPU Architecture: The Engine of the World’s AI Infrastructure. NVIDIA Whitepaper.
Attention Is All You Need: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762.
FlashAttention-2: Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv preprint arXiv:2307.08691.
CUDA MODE (Flash Attention 2): Mills, C. (2023). CUDA MODE - Lecture 12 - Flash Attention 2. christianjmills.com.
PagedAttention (vLLM): Kwon, M., Li, Z., Zhuang, S., Kedia, R., Li, C., Ma, X., … & Zaharia, M. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv preprint arXiv:2309.06180.
Roofline Model: Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 52(4), 65-76.
Speculative Decoding: Leviathan, Y., Kalman, M., & Matias, Y. (2022). Fast Inference from Transformers via Speculative Decoding. arXiv preprint arXiv:2211.17192.
Transformer Inference Arithmetic: Kippley, T. (2023). Transformer Inference Arithmetic. kipp.ly.
The Transformer Inference Guide: Baseten. (2023). The Full Guide to Transformer Model Inference. Baseten Blog.
LLM Inference Performance Engineering: Databricks. (2023). LLM Inference Performance Engineering: Best Practices. Databricks Blog.
Scaling Deep Learning on GPUs: The JAX Authors. (2024). Scaling Deep Learning. jax-ml.github.io.

As models grow larger and demands increase, the importance of efficient inference will only intensify. The techniques and understanding developed today will compound, creating sustainable competitive advantages for organizations that invest in deep technical excellence. The path to that excellence begins with understanding your hardware, measuring what matters, and relentlessly optimizing based on data-driven insights.

Beyond Prefix Caching: How LMCache Turns KV Cache into Composable LEGO Blocks

Sat, 09 Aug 2025 21:36:22 +0800

Imagine if every time you wanted to build something with LEGOs, you had to start from scratch—even when building similar structures. That’s essentially how we’ve been managing KV caches in production LLMs. Until now.

The 328GB Elephant in the Room

Here’s what nobody tells you about serving long-context LLMs: that impressive 128K context window your model supports? It’s basically unusable in production. Not because of compute limitations, but because of a memory crisis hiding in plain sight.

Let me show you the brutal math for Llama 3 70B:

Context Length	KV Cache Size	% of Model Weights	Reality Check
8K tokens	21 GB	15%	Fits on one GPU
32K tokens	84 GB	60%	Exceeds H100 capacity
128K tokens	328 GB	234%	Needs 4+ H100s (!!)

That’s right—the KV cache for a single 128K-context request requires more memory than the entire model weights. Four times more. This is why most production deployments silently cap contexts at 8-16K tokens, leaving those impressive context capabilities as nothing more than marketing numbers.

The Wasteful Status Quo

Traditional serving engines treat KV caches like disposable napkins: use once, throw away. Every time you:

Continue a conversation → Recompute the entire chat history
Process a common document in RAG → Recompute from scratch
Use the same system prompt → Recompute yet again

It’s like demolishing your LEGO castle every time you want to add a tower. Wasteful? Absolutely. Necessary? Not anymore.

Enter LMCache: a system that fundamentally reimagines KV caches not as temporary computational byproducts, but as reusable, composable knowledge blocks—like LEGOs for your model’s attention memory.

[Diagram 1: Traditional vs LMCache approach]

The Core Insight: Knowledge Should Be Reusable

LMCache operates on a simple but powerful principle:

“Prefill each text only once.”

Think of it like a Content Delivery Network (CDN), but instead of caching static website assets, you’re caching computed attention patterns. Just as Netflix doesn’t re-encode the same movie for every viewer, why should we recompute the same document’s KV cache for every user?

This isn’t just clever engineering—it’s a fundamental shift in how we think about LLM memory. And it’s made possible by three breakthrough innovations that each solve a seemingly impossible problem.

Innovation #1: CacheGen – Compression That Beats Physics

The Problem: Moving 328GB of KV cache from GPU to CPU should be impossible without killing performance. The PCIe bus delivers 64 GB/s while GPU memory delivers 3,000 GB/s—a crushing 47× bottleneck.

The Solution: CacheGen doesn’t try to beat the bandwidth limit—it sidesteps it entirely with purpose-built compression that understands the unique structure of KV cache data.

Here’s the clever part: KV cache tensors aren’t random data. They have patterns:

Layers have personalities: Some transformer layers are robust to compression, others are sensitive. CacheGen profiles each layer and applies custom quantization—aggressive where it can be, gentle where it must be.
Local correlation is high: Adjacent values often differ by small amounts. Delta encoding stores just the differences, slashing storage needs.
GPUs are parallel beasts: Instead of decompressing on the CPU (creating a new bottleneck), custom CUDA kernels decompress directly on the GPU using massive parallelism.

The result? That impossible 328GB transfer becomes a manageable 76GB—a 4.3× reduction that makes PCIe viable without sacrificing inference speed.

[Diagram 3: CacheGen pipeline]

Innovation #2: CacheBlend – The LEGO Magic

The Problem: Traditional caching is rigid. It only works for exact prefixes—like having LEGO blocks that only stick together in one specific order.

Consider a typical RAG prompt:

[System Prompt] + [Retrieved Doc A] + [Retrieved Doc B] + [User Query]

Even if you’ve cached Doc A and Doc B separately, you can’t reuse them. Why? Because attention is positional—Doc A’s KV values depend on everything before it. When it follows a system prompt instead of appearing first, its entire attention pattern changes. Naive concatenation produces garbage.

The Solution: CacheBlend makes KV caches truly composable—like LEGO blocks that intelligently adapt to their neighbors.

The key insight: when you move a cached chunk to a new position, most tokens (∼90%) barely change. Only a small subset—the “high-deviation tokens”—need updating. CacheBlend:

Identifies the 10% that matter: Uses attention analysis to find tokens whose KV values would change most
Surgically updates just those tokens: Recomputes only the high-deviation subset
Blends the updates: Fuses new values with the original cache

This transforms rigid, prefix-only caching into flexible, LEGO-like composition. Same cached blocks, infinite arrangements.

[Diagram 5: CacheBlend LEGO composition]

Innovation #3: Hierarchical Memory – The Full Stack

The Problem: Even with compression, you can’t fit everything in GPU memory. You need a bigger house for your LEGOs.

The Solution: LMCache implements a complete memory hierarchy, treating GPU, CPU, and SSD as a unified pool—just like modern CPUs treat L1, L2, L3 caches and RAM.

But here’s what makes it brilliant:

Asynchronous everything: Saving to slower tiers never blocks inference. Your GPU keeps generating while caches migrate in the background.
Predictive prefetching: LMCache learns access patterns and preloads caches from SSD to RAM before they’re needed, hiding the latency.
Distributed sharing: Through Redis or LMCache’s server, multiple GPUs share a global cache pool. One GPU’s computation becomes everyone’s asset.

It’s like having a smart assistant who knows which LEGO sets you’ll need next and quietly moves them from the basement to your desk before you ask.

Real-World Impact: From Painful to Practical

The numbers speak for themselves:

Scenario	Without LMCache	With LMCache	Speedup	User Experience
25K token conversation	28 seconds	3.7 seconds	7.7×	☠️ → 😊
RAG with 4 documents	13 seconds	3.6 seconds	3.6×	😔 → 😊

These aren’t incremental improvements—they’re the difference between “users abandon your product” and “users love your product.”

Playing Nice with Others: The PagedAttention Synergy

A common question: doesn’t vLLM’s PagedAttention already solve memory problems?

Not quite. They’re complementary pieces of the same puzzle:

PagedAttention: Solves fragmentation within GPU memory (like defragging your hard drive)
LMCache: Extends total memory across tiers (like adding more hard drives)

Together, they form a complete memory management stack—PagedAttention ensures efficient packing, LMCache provides infinite capacity.

The Bigger Picture: A New Era of LLM Infrastructure

We’re witnessing a fundamental shift in what limits LLM deployment:

Era	Bottleneck	Solution
2020-2022	Raw compute	Better GPUs, optimized kernels
2022-2023	Memory fragmentation	PagedAttention
2024+	Total memory capacity	Hierarchical caching (LMCache)

The future isn’t just about making models bigger—it’s about making them remember intelligently. LMCache represents a paradigm shift: from treating KV cache as disposable waste to managing it as valuable, reusable knowledge.

What This Means for You

If you’re running production LLMs, LMCache changes the game:

Those 128K context windows become actually usable – not just marketing specs
Multi-turn conversations become affordable – no more recomputing entire histories
RAG at scale becomes practical – cache once, reuse everywhere
GPU costs drop dramatically – same hardware, 7× more throughput

The best part? LMCache integrates seamlessly with vLLM. It’s not a replacement—it’s an upgrade.

The LEGO Future

LMCache shows us what modern LLM serving should look like: modular, composable, and intelligent. Just as LEGO blocks revolutionized construction toys by making everything reusable and composable, LMCache is doing the same for LLM memory.

We’re moving from a world where every inference request starts from scratch to one where computed knowledge accumulates, persists, and compounds. It’s not just an optimization—it’s an architectural revolution.

The question isn’t whether you need hierarchical KV caching. It’s whether you can afford to keep throwing away 90% of your GPU’s work. In a world where every millisecond and every GB matters, the answer is clear.

Welcome to the era of composable AI memory. Time to start building.

References

CacheBlend: Yao, J., Li, H., Liu, Y., Ray, S., Cheng, Y., Zhang, Q., Du, K., Lu, S., & Jiang, J. (2024). CACHEBLEND: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion. arXiv preprint arXiv:2405.16444.
CacheGen: Liu, Y., Li, H., Du, K., Yao, J., Cheng, Y., Huang, Y., Lu, S., Maire, M., Hoffmann, H., Holtzman, A., & Jiang, J. (2023). CacheGen: Fast Context Loading for Language Model Applications. arXiv preprint arXiv:2310.07240.
PagedAttention (vLLM): Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., & Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv preprint arXiv:2309.06180.
LMCache Project: The official GitHub repository for the LMCache system, including implementations of CacheGen and CacheBlend.
vLLM Project: The official GitHub repository for the vLLM serving engine, which LMCache integrates with.

About Me

Sat, 09 Aug 2025 17:45:34 +0800

Hello! I'm Jawad, an Engineer/ Builder/ Leader with over 15 years of experience crafting and scaling AI and machine learning solutions across the ride-hailing, finance, and enterprise sectors. My passion lies in building high-performing technical teams and driving real-world business impact through AI innovation.

Currently, at Singapore’s Home Team Science and Technology Agency (HTX), I lead the AI Platform team, where we focus on designing and implementing scalable AI inference infrastructure. My work involves advanced optimization techniques to deploy large-scale language model (LLM) services, enabling enterprise applications to integrate powerful predictive AI capabilities.

My Journey

My career has been a journey through the evolving landscape of data and AI. As a founding engineer at Cleric, I architected an LLM-powered SRE automation platform from the ground up. At Gojek, I led the mobility data division, overseeing critical systems like ride-matching, dynamic pricing, and logistics for one of Southeast Asia’s largest platforms. It was there I spearheaded initiatives that improved unit economics by 15% and built the multi-objective optimization engine that boosted completed bookings by up to 10%.

Core Expertise

AI & Machine Learning: LLM-based RAG systems, LLMOps, multi-agent architectures, deep learning, NLP, and recommendation systems.
Data Engineering & Infrastructure: Building scalable ML platforms using Airflow, Spark, Kafka, Kubernetes, and event-driven architectures.
Data Science & Analytics: Causal inference, A/B testing, statistical analysis, and multi-objective optimization.

Teaching & Education

I’m passionate about teaching data science and currently serve as the lead instructor for the “Advanced Professional Certificate in Data Science and AI” program at Nanyang Technological University, Singapore. This comprehensive program equips professionals with practical skills in data science and artificial intelligence, preparing them for real-world challenges in the field. Learn more about the course.

Let’s Connect

I’m always open to discussing new ideas in AI, MLOps, and scalable system design. Feel free to connect with me on LinkedIn or Twitter.

Selected Talks & Publications

Ray, Arka, Mohamed Jawad Askar Ali, et al. “Phoenix-VL 1.5 Medium Technical Report.” arXiv preprint (2026). A 123B-parameter multimodal foundation model adapted to regional languages and the Singapore context.
LLM Inference from Jupyter to Production - Data Innovation Summit APAC - 2025
Designing Agentic Systems: Lessons learned from the Trenches - 2024
Scaling Ride-Hailing with Machine Learning on MLflow - SPARK AI Data Summit, San Francisco 2019
Goh, Yang Miang, and Mohamed Jawad Askar Ali. “A hybrid simulation approach for integrating safety behaviour into construction planning: An earthmoving case study.” Accident Analysis & Prevention (2015).

MdJawad

Rotary Positional Encoding: Why Position Is a Rotation

A trick hiding in plain sight

The problem: attention sees an unordered bag

The obvious fix, and its hidden flaw

The insight: position is not a number you add, it’s a rotation you apply

Why a dot product only feels the angle between

The magic: relative position, for free

How rotation fits inside attention

One frequency isn’t enough: a clock with many hands

A free locality prior: nearby leans in, distant fades

Adding moves the point; rotating keeps it honest

add sinusoidal E + PE(m)

rotate RoPE R(mθ)·E

Why it won

The bridge: from rotation to long context

Lessons for builders

Conclusion

Get the next post in this series

The Evolution of Attention, Part 1: From MHA to Latent Compression

What This Post Covers

Part 1: The KV Cache Wall

Part 2: Notation

Part 3: Recap: Standard Multi-Head Attention

Standard Multi-Head Attention

Part 4: Stepping Stones: MQA and GQA

Stepping Stones: MHA, GQA, MQA

Part 5: MLA’s Core Insight

MLA's Core Idea

Part 6: Matrix Walkthrough, Step by Step

MLA, Step by Step

KV down-projection

K and V up-projection

Decoupled RoPE construction

The absorption trick

What lives in the cache

Step 1: KV down-projection

Step 2: K and V up-projection

Step 3: The query path

Part 7: The RoPE Complication and the Decoupled Fix

Why naive RoPE breaks MLA

The decoupled RoPE construction

Part 8: The Absorption Trick

Absorbing $W^{UK}$ into the query

Absorbing $W^{UV}$ into the output projection

Part 9: DeepSeek-V2 by the Numbers

Cache Size vs Context Length

Part 10: Comparison and Takeaways

What is actually new

The cost

If you remember three things

What comes next

References

The Platform Around the Agent: What Enterprise Architects Actually Build

The Gap

Chapter 1: What “Platform” Actually Means Here

Chapter 2: The Capability Surface

Chapter 3: Identity, Policy, and the Execution Boundary

Chapter 4: Context Engineering at Platform Scale

MCP is connectivity, not context

The pipeline

Chapter 5: Workflows, the Unit That Ships

Authoring

Parameterisation and sub-workflows

Triggers

Runtime, observability, and governance

Chapter 6: Evaluation and Economics

Silent failure

Token economics

What to measure

Chapter 7: The Build Sequence

References

Inside Claude Code: Anatomy of a 512K-Line AI Agent

State Space Models and the Mamba Architecture: From First Principles to Mamba-3

What This Post Covers

Part 1: Why SSMs? The Transformer’s Inference Problem

Part 2: State Space Models from Scratch

A Single Differential Equation

How the parameter a controls state behavior

Adding an Input