<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>MdJawad</title><link>https://www.mdjawad.com/</link><description>Recent content on MdJawad</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Mon, 25 May 2026 13:56:51 +0000</lastBuildDate><atom:link href="https://www.mdjawad.com/index.xml" rel="self" type="application/rss+xml"/><item><title>Rotary Positional Encoding: Why Position Is a Rotation</title><link>https://www.mdjawad.com/posts/rotary-positional-encoding/</link><pubDate>Sat, 23 May 2026 12:00:00 +0800</pubDate><guid>https://www.mdjawad.com/posts/rotary-positional-encoding/</guid><description>An intuitive, visual guide to Rotary Positional Encoding. Why spinning the query and key vectors beats stamping a position number onto them, why a dot product only ever feels the angle between two vectors, and why that hands you relative position for free. The starting point for understanding how LLMs stretch to long context.</description><content:encoded><![CDATA[<h2 id="a-trick-hiding-in-plain-sight">A trick hiding in plain sight</h2>
<p>In 2021 a small idea slipped into the transformer with almost no fanfare. A paper called <a href="https://arxiv.org/abs/2104.09864">RoFormer</a> proposed encoding a token&rsquo;s position not by adding something to it, but by rotating it. The idea, Rotary Positional Encoding (RoPE), spread quickly. Within two years it had become the default in nearly every serious open model: GPT-NeoX, PaLM, LLaMA and its descendants, Mistral, Qwen, DeepSeek, Gemma. If you have used a modern LLM, you have used RoPE.</p>
<p>It is also the quiet reason your model can sometimes read a whole book. Almost any conversation about long context, whether that means 128K windows, million-token prompts, or the needle-in-a-haystack test, eventually runs into RoPE. RoPE is the thing you have to stretch to make long context work, and the thing that breaks when you stretch it wrong.</p>
<p>This is where to start. Before we can talk about how models reach for longer and longer context, the subject of the posts that follow this one, we need to actually understand the small, elegant trick at the heart of it. By the end of this post, you should be able to look at this formula.</p>
$$\big(R(m\theta)\,q\big)^{\!\top}\big(R(n\theta)\,k\big) \;=\; q^{\top} R\!\big((n-m)\theta\big)\, k$$<p>and find it obvious.</p>
<p>We will keep one example in hand the whole way: the sentence <strong>&ldquo;the dog chased the cat,&rdquo;</strong> and in particular how the word <strong>chased</strong> should relate to the word <strong>dog</strong>.</p>
<h2 id="the-problem-attention-sees-an-unordered-bag">The problem: attention sees an unordered bag</h2>
<p>The attention mechanism has an uncomfortable property. By itself, it has no idea what order the words came in.</p>
<p>Attention works by comparing every token with every other token and taking weighted sums, and a sum does not care about order. Shuffle the inputs and you get the same answer back. To raw attention, <strong>&ldquo;the dog chased the cat&rdquo;</strong> and <strong>&ldquo;the cat chased the dog&rdquo;</strong> are the same bag of vectors. One sentence has the dog doing the chasing and the other has it being chased, and attention cannot tell them apart.</p>
<p>That is a problem, because meaning lives in order. We have to inject position somehow, giving the model a way to know that in our sentence, <strong>chased</strong> sits one step after <strong>dog</strong>, and that this adjacency is part of what the sentence means.</p>
<h2 id="the-obvious-fix-and-its-hidden-flaw">The obvious fix, and its hidden flaw</h2>
<p>The original 2017 Transformer solved this the obvious way. It built a position vector out of sines and cosines and <strong>added</strong> it onto each word&rsquo;s embedding, like stamping a timestamp onto a letter before you mail it. The token at position 0 gets stamp #0, position 1 gets stamp #1, and so on. The network is then left to untangle &ldquo;content plus stamp&rdquo; back into &ldquo;content&rdquo; and &ldquo;where.&rdquo;</p>
<p>It helps to see those sines and cosines directly. Each row below is one dimension oscillating at its own frequency, fast at the top and slow at the bottom. Move the slider, and the column the cursor marks out is the stamp that gets added to the word sitting at that position:</p>

<div class="rope-sinusoidal" id="rope-sinusoidal-9b8756b96f24f81545a6032c69a390d3">
  <style>
    .rope-sinusoidal{
      --bg:#0a0d15; --bg2:#0f1320; --panel:#10141f; --panel2:#141927;
      --ink:#ece8dd; --ink-soft:#c3c0b6; --muted:#7e8499; --faint:#4a4f60;
      --line:rgba(255,255,255,.075); --line-strong:rgba(255,255,255,.14);
      --q:#f6b740; --k:#4fd8cf; --coral:#ff7d6b;
      color:var(--ink); margin:2rem 0; max-width:100%;
    }
    .rope-sinusoidal *{box-sizing:border-box}
    .rope-sinusoidal .panel{background:linear-gradient(180deg,var(--panel),var(--bg2)); border:1px solid var(--line-strong); border-radius:16px; padding:20px; box-shadow:0 24px 60px -36px rgba(0,0,0,.9), inset 0 1px 0 rgba(255,255,255,.03); position:relative; overflow:hidden}
    .rope-sinusoidal .panel::before{content:""; position:absolute; inset:0; pointer-events:none; border-radius:16px; background:linear-gradient(90deg,var(--line) 1px,transparent 1px) 0 0/26px 26px,linear-gradient(180deg,var(--line) 1px,transparent 1px) 0 0/26px 26px; opacity:.30; -webkit-mask:radial-gradient(120% 120% at 50% 0%,#000,transparent 78%); mask:radial-gradient(120% 120% at 50% 0%,#000,transparent 78%)}
    .rope-sinusoidal .panel > *{position:relative}
    .rope-sinusoidal .panel-head{display:flex; align-items:baseline; justify-content:space-between; gap:12px; margin-bottom:12px; flex-wrap:wrap}
    .rope-sinusoidal .panel-title{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12px; letter-spacing:.22em; text-transform:uppercase; color:var(--muted)}
    .rope-sinusoidal .legend{display:flex; gap:16px; font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12px}
    .rope-sinusoidal .legend span{display:inline-flex; align-items:center; gap:7px; color:var(--ink-soft)}
    .rope-sinusoidal .dot{width:9px; height:9px; border-radius:50%}
    .rope-sinusoidal .dot.q{background:var(--q); box-shadow:0 0 10px var(--q)}
    .rope-sinusoidal .dot.k{background:var(--k); box-shadow:0 0 10px var(--k)}
    .rope-sinusoidal canvas{display:block; width:100%; touch-action:none}
    .rope-sinusoidal .controls{display:flex; flex-direction:column; gap:14px; margin-top:16px}
    .rope-sinusoidal .ctrl{display:grid; grid-template-columns:128px 1fr 70px; align-items:center; gap:14px}
    .rope-sinusoidal .ctrl label{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12.5px; color:var(--ink-soft)}
    .rope-sinusoidal .ctrl label .sub{color:var(--muted)}
    .rope-sinusoidal .val{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:14px; text-align:right; color:var(--ink); background:var(--panel2); border:1px solid var(--line); border-radius:7px; padding:5px 9px}
    .rope-sinusoidal input[type=range]{-webkit-appearance:none; appearance:none; height:4px; border-radius:3px; background:linear-gradient(90deg,var(--accent,#5a6178),var(--accent,#5a6178)) no-repeat, rgba(255,255,255,.09); background-size:var(--fill,0%) 100%; cursor:pointer; outline:none}
    .rope-sinusoidal input[type=range].q-range{--accent:var(--q)}
    .rope-sinusoidal input[type=range]::-webkit-slider-thumb{-webkit-appearance:none; width:17px; height:17px; border-radius:50%; background:var(--ink); border:3px solid var(--accent,#9aa0b5); box-shadow:0 0 0 4px rgba(0,0,0,.35),0 2px 6px rgba(0,0,0,.5); transition:transform .12s}
    .rope-sinusoidal input[type=range]::-webkit-slider-thumb:hover{transform:scale(1.16)}
    .rope-sinusoidal input[type=range]::-moz-range-thumb{width:15px; height:15px; border-radius:50%; background:var(--ink); border:3px solid var(--accent,#9aa0b5); box-shadow:0 0 0 4px rgba(0,0,0,.35)}
    .rope-sinusoidal .btnrow{display:flex; gap:10px; flex-wrap:wrap; margin-top:16px}
    .rope-sinusoidal .btn{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12px; letter-spacing:.08em; text-transform:uppercase; color:var(--ink); background:var(--panel2); border:1px solid var(--line-strong); border-radius:9px; padding:9px 15px; cursor:pointer; transition:.15s; display:inline-flex; align-items:center; gap:8px}
    .rope-sinusoidal .btn:hover{border-color:var(--q); color:#fff; background:#1a2031}
    .rope-sinusoidal .btn.active{border-color:var(--q); color:var(--q)}
    .rope-sinusoidal .note{font-family:inherit; font-size:15px; color:var(--muted); font-style:italic; margin:14px 2px 0}
    @media(max-width:720px){.rope-sinusoidal .ctrl{grid-template-columns:92px 1fr 64px; gap:10px}.rope-sinusoidal .legend{display:none}}
  </style>

  <div class="panel">
    <div class="panel-head">
      <div class="panel-title">Sinusoidal positional encoding · sines and cosines stacked</div>
      <div class="legend">
        <span><span class="dot q"></span>positive value</span>
        <span><span class="dot k"></span>negative value</span>
      </div>
    </div>
    <canvas class="cv" height="380"></canvas>
    <div class="controls">
      <div class="ctrl">
        <label>position&nbsp;<span class="sub">m</span></label>
        <input type="range" class="r q-range" min="0" max="48" step="1" value="9">
        <span class="val v">9</span>
      </div>
    </div>
    <div class="btnrow"><button class="btn play">▶ sweep position</button></div>
  </div>
  <p class="note">Warm cells are positive, cool cells negative. The column of cells stacked at the cursor is the position's encoding vector, the "stamp" added to whatever word sits there.</p>

  <script>
  (function(){
    const root = document.getElementById('rope-sinusoidal-9b8756b96f24f81545a6032c69a390d3');
    if(!root) return;
    const TAU = Math.PI*2;
    const C = { q:'#f6b740', k:'#4fd8cf', ink:'#ece8dd', muted:'#7e8499', faint:'#4a4f60', grid:'rgba(120,140,200,.13)' };
    function fit(canvas){
      const dpr=Math.max(1,window.devicePixelRatio||1);
      if(!canvas.dataset.h) canvas.dataset.h=canvas.getAttribute('height');
      const cssH=+canvas.dataset.h, w=canvas.clientWidth;
      canvas.width=Math.round(w*dpr); canvas.height=Math.round(cssH*dpr); canvas.style.height=cssH+'px';
      const ctx=canvas.getContext('2d'); ctx.setTransform(dpr,0,0,dpr,0,0); return {ctx,w,h:cssH};
    }
    function glowDot(ctx,x,y,color,rad=6){ ctx.save(); ctx.shadowColor=color; ctx.shadowBlur=12; ctx.fillStyle=color; ctx.beginPath(); ctx.arc(x,y,rad,0,TAU); ctx.fill(); ctx.restore(); }
    function setFill(el,frac){ el.style.setProperty('--fill',(frac*100)+'%'); }

    const cv=root.querySelector('canvas'), r=root.querySelector('input.r'), v=root.querySelector('.v'), btn=root.querySelector('.play');
    const N=48, LANES=8, periods=[7,14,28,56];
    const labels=['sin θ₀','cos θ₀','sin θ₁','cos θ₁','sin θ₂','cos θ₂','sin θ₃','cos θ₃'];
    function valAt(L,pos){ const w=TAU/periods[Math.floor(L/2)]; return (L%2===0)? Math.sin(pos*w): Math.cos(pos*w); }
    function valColor(val){ const a=Math.min(1,Math.abs(val)); return val>=0 ? `rgba(246,183,64,${0.16+0.84*a})` : `rgba(79,216,207,${0.16+0.84*a})`; }
    let playing=false, t=0;

    function draw(){
      const {ctx,w,h}=fit(cv); ctx.clearRect(0,0,w,h);
      const m=+r.value;
      const LB=58, RS=92, gap=14, padT=14, padB=26;
      const plotX0=LB, plotX1=w-RS, plotW=plotX1-plotX0;
      const areaH=h-padT-padB, lh=areaH/LANES;
      const X=pos=>plotX0+(pos/N)*plotW, cursorX=X(m);
      for(let L=0;L<LANES;L++){
        const yc=padT+lh*(L+0.5), amp=lh*0.34;
        ctx.strokeStyle=C.grid; ctx.lineWidth=1; ctx.beginPath(); ctx.moveTo(plotX0,yc); ctx.lineTo(plotX1,yc); ctx.stroke();
        ctx.beginPath();
        for(let px=0;px<=plotW;px++){ const pos=(px/plotW)*N, val=valAt(L,pos), xx=plotX0+px, yy=yc-val*amp; if(px===0)ctx.moveTo(xx,yy); else ctx.lineTo(xx,yy); }
        ctx.strokeStyle='rgba(195,192,182,.5)'; ctx.lineWidth=1.6; ctx.lineJoin='round'; ctx.stroke();
        const val=valAt(L,m);
        glowDot(ctx, cursorX, yc-val*amp, val>=0?C.q:C.k, 4);
        ctx.fillStyle=C.muted; ctx.font='11px ui-monospace, monospace'; ctx.textAlign='right'; ctx.fillText(labels[L], LB-8, yc+4); ctx.textAlign='left';
        const sx=plotX1+gap, sw=RS-gap-6, sy=padT+lh*L+3, sH=lh-6;
        ctx.fillStyle=valColor(val); ctx.fillRect(sx, sy, sw, sH);
        ctx.strokeStyle='rgba(255,255,255,.08)'; ctx.lineWidth=1; ctx.strokeRect(sx,sy,sw,sH);
      }
      ctx.strokeStyle='rgba(255,255,255,.5)'; ctx.lineWidth=1.5; ctx.setLineDash([4,4]); ctx.beginPath(); ctx.moveTo(cursorX,padT); ctx.lineTo(cursorX,h-padB); ctx.stroke(); ctx.setLineDash([]);
      ctx.fillStyle=C.muted; ctx.font='11px ui-monospace, monospace'; ctx.textAlign='center';
      ctx.fillText('position →', (plotX0+plotX1)/2, h-8);
      ctx.fillText('stamp @ '+m, plotX1+gap+(RS-gap-6)/2, h-8);
      ctx.textAlign='right'; ctx.fillStyle=C.faint; ctx.font='10px ui-monospace, monospace';
      ctx.fillText('fast', LB-8, padT+10); ctx.fillText('slow', LB-8, h-padB-2);
      ctx.textAlign='left';
      v.textContent=m; setFill(r,m/(+r.max));
    }
    function loop(){ if(!playing) return; t=(t+1)%(+r.max+1); r.value=t; draw(); setTimeout(()=>requestAnimationFrame(loop),120); }
    r.addEventListener('input',()=>{playing=false;btn.classList.remove('active');btn.textContent='▶ sweep position';draw();});
    btn.addEventListener('click',()=>{ playing=!playing; btn.classList.toggle('active',playing); btn.textContent=playing?'⏸ pause':'▶ sweep position'; t=+r.value; if(playing) loop(); });
    window.addEventListener('resize',draw); draw();
  })();
  </script>
</div>

<p>Two things stand out. The stamp depends only on the position, never on the word underneath it. And neighbouring positions get very similar stamps, since the waves move smoothly. Hold that picture, because it is what rotation is about to improve on.</p>
<p>It works, but it carries two flaws that, once you see them, motivate everything RoPE does.</p>
<p><strong>Flaw one: it smears content and position together.</strong> Adding a vector moves the point. A word&rsquo;s embedding encodes its meaning; the stamp we add shifts that vector somewhere new, so the same word at two different positions becomes two genuinely different vectors, with different length and direction. Meaning and position now sit tangled in the same numbers, and the model has to spend capacity pulling them back apart.</p>
<p><strong>Flaw two: it is absolute, but attention wants relative.</strong> The stamp records where a token sits counting from the start of the sequence. That is almost never what matters. What matters for <strong>chased</strong> is that <strong>dog</strong> is one token back, not that dog happens to be the second word in this particular sentence. Prepend the word &ldquo;Yesterday,&rdquo; to our sentence and every absolute position shifts by one, yet the relationship between chased and dog has not changed at all. Absolute encodings force the model to learn how to turn &ldquo;position 2 versus position 3&rdquo; into &ldquo;one step apart,&rdquo; and to relearn that for every pair of positions. It is work we should not have to do.</p>
<p>Keep both flaws in mind. RoPE fixes them at once, with a single geometric move.</p>
<h2 id="the-insight-position-is-not-a-number-you-add-its-a-rotation-you-apply">The insight: position is not a number you add, it&rsquo;s a rotation you apply</h2>
<p>There is another way to think about it. Instead of adding a position vector, what if we rotated the token&rsquo;s vector by an angle that grows with its position?</p>
<p>Take the token&rsquo;s vector and chop it into pairs of coordinates. Each pair is just a point on a plane, an arrow from the origin. To encode position $m$, spin that arrow by an angle $m\theta$: position 0 gets no turn, position 1 turns by $\theta$, and position $m$ turns by $m\theta$. In two dimensions this is exactly the rotation matrix from high-school geometry:</p>
$$\begin{bmatrix}x'\\ y'\end{bmatrix}
=\underbrace{\begin{bmatrix}\cos m\theta & -\sin m\theta\\[2pt] \sin m\theta & \;\;\cos m\theta\end{bmatrix}}_{R(m\theta)}
\begin{bmatrix}x\\ y\end{bmatrix}$$<p>Drag the slider below and watch what happens to a single pair as its position climbs:</p>

<div class="rope-rotate-pair" id="rope-rotate-pair-9b8756b96f24f81545a6032c69a390d3">
  <style>
    .rope-rotate-pair{
      --bg:#0a0d15; --bg2:#0f1320; --panel:#10141f; --panel2:#141927;
      --ink:#ece8dd; --ink-soft:#c3c0b6; --muted:#7e8499; --faint:#4a4f60;
      --line:rgba(255,255,255,.075); --line-strong:rgba(255,255,255,.14);
      --q:#f6b740; --k:#4fd8cf; --coral:#ff7d6b;
      color:var(--ink); margin:2rem 0; max-width:100%;
    }
    .rope-rotate-pair *{box-sizing:border-box}
    .rope-rotate-pair .panel{
      background:linear-gradient(180deg,var(--panel),var(--bg2));
      border:1px solid var(--line-strong); border-radius:16px; padding:20px;
      box-shadow:0 24px 60px -36px rgba(0,0,0,.9), inset 0 1px 0 rgba(255,255,255,.03);
      position:relative; overflow:hidden;
    }
    .rope-rotate-pair .panel::before{
      content:""; position:absolute; inset:0; pointer-events:none; border-radius:16px;
      background:
        linear-gradient(90deg,var(--line) 1px,transparent 1px) 0 0/26px 26px,
        linear-gradient(180deg,var(--line) 1px,transparent 1px) 0 0/26px 26px;
      opacity:.30; -webkit-mask:radial-gradient(120% 120% at 50% 0%,#000,transparent 78%);
      mask:radial-gradient(120% 120% at 50% 0%,#000,transparent 78%);
    }
    .rope-rotate-pair .panel > *{position:relative}
    .rope-rotate-pair .panel-head{display:flex; align-items:baseline; justify-content:space-between; gap:12px; margin-bottom:12px; flex-wrap:wrap}
    .rope-rotate-pair .panel-title{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12px; letter-spacing:.22em; text-transform:uppercase; color:var(--muted)}
    .rope-rotate-pair .legend{display:flex; gap:16px; font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12px}
    .rope-rotate-pair .legend span{display:inline-flex; align-items:center; gap:7px; color:var(--ink-soft)}
    .rope-rotate-pair .dot{width:9px; height:9px; border-radius:50%}
    .rope-rotate-pair .dot.q{background:var(--q); box-shadow:0 0 10px var(--q)}
    .rope-rotate-pair canvas{display:block; width:100%; touch-action:none}
    .rope-rotate-pair .controls{display:flex; flex-direction:column; gap:14px; margin-top:16px}
    .rope-rotate-pair .ctrl{display:grid; grid-template-columns:128px 1fr 70px; align-items:center; gap:14px}
    .rope-rotate-pair .ctrl label{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12.5px; color:var(--ink-soft)}
    .rope-rotate-pair .ctrl label .sub{color:var(--muted)}
    .rope-rotate-pair .val{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:14px; text-align:right; color:var(--ink); background:var(--panel2); border:1px solid var(--line); border-radius:7px; padding:5px 9px}
    .rope-rotate-pair input[type=range]{-webkit-appearance:none; appearance:none; height:4px; border-radius:3px; background:linear-gradient(90deg,var(--accent,#5a6178),var(--accent,#5a6178)) no-repeat, rgba(255,255,255,.09); background-size:var(--fill,12%) 100%; cursor:pointer; outline:none}
    .rope-rotate-pair input[type=range].q-range{--accent:var(--q)}
    .rope-rotate-pair input[type=range]::-webkit-slider-thumb{-webkit-appearance:none; width:17px; height:17px; border-radius:50%; background:var(--ink); border:3px solid var(--accent,#9aa0b5); box-shadow:0 0 0 4px rgba(0,0,0,.35),0 2px 6px rgba(0,0,0,.5); transition:transform .12s}
    .rope-rotate-pair input[type=range]::-webkit-slider-thumb:hover{transform:scale(1.16)}
    .rope-rotate-pair input[type=range]::-moz-range-thumb{width:15px; height:15px; border-radius:50%; background:var(--ink); border:3px solid var(--accent,#9aa0b5); box-shadow:0 0 0 4px rgba(0,0,0,.35)}
    .rope-rotate-pair .readouts{display:flex; gap:10px; flex-wrap:wrap; margin-top:16px}
    .rope-rotate-pair .chip{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; background:var(--panel2); border:1px solid var(--line); border-radius:10px; padding:9px 13px; min-width:84px}
    .rope-rotate-pair .chip .lab{font-size:10.5px; letter-spacing:.14em; text-transform:uppercase; color:var(--muted)}
    .rope-rotate-pair .chip .num{font-size:19px; color:var(--ink); margin-top:2px; font-weight:500}
    .rope-rotate-pair .chip.q .num{color:var(--q)}
    .rope-rotate-pair .btnrow{display:flex; gap:10px; flex-wrap:wrap; margin-top:16px}
    .rope-rotate-pair .btn{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12px; letter-spacing:.08em; text-transform:uppercase; color:var(--ink); background:var(--panel2); border:1px solid var(--line-strong); border-radius:9px; padding:9px 15px; cursor:pointer; transition:.15s; display:inline-flex; align-items:center; gap:8px}
    .rope-rotate-pair .btn:hover{border-color:var(--q); color:#fff; background:#1a2031}
    .rope-rotate-pair .btn.active{border-color:var(--q); color:var(--q)}
    .rope-rotate-pair .note{font-family:inherit; font-size:15px; color:var(--muted); font-style:italic; margin:14px 2px 0}
    @media(max-width:720px){.rope-rotate-pair .ctrl{grid-template-columns:92px 1fr 64px; gap:10px}.rope-rotate-pair .legend{display:none}}
  </style>

  <div class="panel">
    <div class="panel-head">
      <div class="panel-title">Rotating one pair of dimensions</div>
      <div class="legend"><span><span class="dot q"></span>the pair, rotated to position&nbsp;m</span></div>
    </div>
    <canvas class="cv" height="340"></canvas>
    <div class="controls">
      <div class="ctrl">
        <label>position&nbsp;<span class="sub">m</span></label>
        <input type="range" class="r q-range" min="0" max="24" step="1" value="3">
        <span class="val v">3</span>
      </div>
    </div>
    <div class="readouts">
      <div class="chip q"><div class="lab">angle&nbsp;m·θ</div><div class="num oa">—</div></div>
      <div class="chip"><div class="lab">x′ (cos)</div><div class="num ox">—</div></div>
      <div class="chip"><div class="lab">y′ (sin)</div><div class="num oy">—</div></div>
      <div class="chip"><div class="lab">length</div><div class="num oL">1.00</div></div>
    </div>
    <div class="btnrow"><button class="btn play">▶ animate position</button></div>
  </div>
  <p class="note">Watch the length chip: it stays pinned at 1.00 no matter how far you spin. That is rotation's superpower: position changes the direction, never the magnitude.</p>

  <script>
  (function(){
    const root = document.getElementById('rope-rotate-pair-9b8756b96f24f81545a6032c69a390d3');
    if(!root) return;
    const TAU = Math.PI*2;
    const C = { q:'#f6b740', ink:'#ece8dd', muted:'#7e8499', faint:'#4a4f60',
                grid:'rgba(120,140,200,.13)', line:'rgba(255,255,255,.10)' };
    function fit(canvas){
      const dpr = Math.max(1, window.devicePixelRatio||1);
      if(!canvas.dataset.h) canvas.dataset.h = canvas.getAttribute('height');
      const cssH = +canvas.dataset.h, w = canvas.clientWidth;
      canvas.width = Math.round(w*dpr); canvas.height = Math.round(cssH*dpr);
      canvas.style.height = cssH+'px';
      const ctx = canvas.getContext('2d'); ctx.setTransform(dpr,0,0,dpr,0,0);
      return {ctx, w, h:cssH};
    }
    function arrow(ctx, ox,oy, x,y, color, lw=3, head=10){
      ctx.save(); ctx.strokeStyle=color; ctx.fillStyle=color; ctx.lineWidth=lw; ctx.lineCap='round';
      ctx.beginPath(); ctx.moveTo(ox,oy); ctx.lineTo(x,y); ctx.stroke();
      const a=Math.atan2(y-oy,x-ox);
      ctx.beginPath(); ctx.moveTo(x,y);
      ctx.lineTo(x-head*Math.cos(a-0.4), y-head*Math.sin(a-0.4));
      ctx.lineTo(x-head*Math.cos(a+0.4), y-head*Math.sin(a+0.4));
      ctx.closePath(); ctx.fill(); ctx.restore();
    }
    function ring(ctx, cx,cy,r,color,lw=1){ ctx.save(); ctx.strokeStyle=color; ctx.lineWidth=lw; ctx.beginPath(); ctx.arc(cx,cy,r,0,TAU); ctx.stroke(); ctx.restore(); }
    function axes(ctx,cx,cy,r){ ctx.save(); ctx.strokeStyle=C.grid; ctx.lineWidth=1; ctx.beginPath(); ctx.moveTo(cx-r-14,cy); ctx.lineTo(cx+r+14,cy); ctx.moveTo(cx,cy-r-14); ctx.lineTo(cx,cy+r+14); ctx.stroke(); ctx.restore(); }
    function glowDot(ctx,x,y,color,rad=6){ ctx.save(); ctx.shadowColor=color; ctx.shadowBlur=14; ctx.fillStyle=color; ctx.beginPath(); ctx.arc(x,y,rad,0,TAU); ctx.fill(); ctx.restore(); }
    function setFill(el,frac){ el.style.setProperty('--fill',(frac*100)+'%'); }

    const cv=root.querySelector('canvas'), r=root.querySelector('input.r'), v=root.querySelector('.v'), btn=root.querySelector('.play');
    const oa=root.querySelector('.oa'), ox=root.querySelector('.ox'), oy=root.querySelector('.oy'), oL=root.querySelector('.oL');
    const TH = TAU/16;            
    let playing=false, t=0;

    function draw(){
      const {ctx,w,h}=fit(cv); ctx.clearRect(0,0,w,h);
      const cx=w/2, cy=h/2, R=Math.min(w,h)*0.36;
      const m=+r.value, ang=m*TH;
      axes(ctx,cx,cy,R); ring(ctx,cx,cy,R,C.line,1.2);
      for(let i=0;i<16;i++){ const a=i*TH, c1=cx+Math.cos(a)*R, s1=cy-Math.sin(a)*R; ctx.fillStyle=C.faint; ctx.beginPath(); ctx.arc(c1,s1,1.6,0,TAU); ctx.fill(); }
      for(let p=0;p<=m;p++){ const a=p*TH, x=cx+Math.cos(a)*R*0.86, y=cy-Math.sin(a)*R*0.86; ctx.fillStyle='rgba(246,183,64,'+(0.10+0.04*(p/Math.max(1,m)))+')'; ctx.beginPath(); ctx.arc(x,y,3,0,TAU); ctx.fill(); }
      arrow(ctx,cx,cy, cx+R*0.86, cy, 'rgba(255,255,255,.16)', 2, 9);
      ctx.save(); ctx.strokeStyle='rgba(246,183,64,.5)'; ctx.lineWidth=2; ctx.beginPath(); ctx.arc(cx,cy,R*0.40,0,-ang,ang>0); ctx.stroke(); ctx.restore();
      const ex=cx+Math.cos(ang)*R*0.86, ey=cy-Math.sin(ang)*R*0.86;
      arrow(ctx,cx,cy, ex,ey, C.q, 3.4, 12); glowDot(ctx,ex,ey,C.q,5);
      ctx.fillStyle=C.q; ctx.font='13px ui-monospace, monospace'; ctx.fillText('mθ', cx+Math.cos(ang/2)*R*0.5, cy-Math.sin(ang/2)*R*0.5);
      v.textContent=m; setFill(r, m/(+r.max));
      oa.textContent=(ang).toFixed(2)+' rad'; ox.textContent=Math.cos(ang).toFixed(2); oy.textContent=Math.sin(ang).toFixed(2); oL.textContent='1.00';
    }
    function loop(){ if(!playing) return; t=(t+1)%(+r.max+1); r.value=t; draw(); setTimeout(()=>requestAnimationFrame(loop),140); }
    r.addEventListener('input',()=>{playing=false;btn.classList.remove('active');btn.textContent='▶ animate position';draw();});
    btn.addEventListener('click',()=>{ playing=!playing; btn.classList.toggle('active',playing); btn.textContent=playing?'⏸ pause':'▶ animate position'; t=+r.value; if(playing) loop(); });
    window.addEventListener('resize',draw); draw();
  })();
  </script>
</div>

<p>Notice the one thing that never changes: the arrow&rsquo;s length. A rotation changes direction, never magnitude, and that is what makes it a rotation. That single property already takes care of the first flaw. A word&rsquo;s meaning lives in the length and shape of its vector, and spinning it leaves all of that alone. Position ends up written purely into the angle, kept separate from content.</p>
<p>Length preservation is only the warm-up, though. The real payoff is what rotation does to the dot product, and to see it we need one fact about how attention scores tokens.</p>
<h2 id="why-a-dot-product-only-feels-the-angle-between">Why a dot product only feels the angle between</h2>
<p>Attention decides how much <strong>chased</strong> should attend to <strong>dog</strong> by taking the dot product of chased&rsquo;s query vector with dog&rsquo;s key vector. A big dot product means strong attention.</p>
<p>The dot product has a clean geometric meaning. For any two vectors $q$ and $k$,</p>
$$q^{\top} k \;=\; \|q\|\,\|k\|\,\cos\phi,$$<p>where $\phi$ is the angle between them. The lengths $\|q\|$ and $\|k\|$ are fixed properties of the two words, a kind of loudness. Everything about how the two words relate is carried by that single $\cos\phi$ term. Vectors pointing the same way score high ($\cos 0 = 1$), perpendicular vectors score zero, and opposed vectors score negative.</p>
<p>So an attention score is really a question about the angle between two arrows. That is the sentence to hold onto. If position is an angle, and attention only responds to angles, then position and attention are speaking the same language.</p>
<h2 id="the-magic-relative-position-for-free">The magic: relative position, for free</h2>
<p>Now put the two halves together. Rotate chased&rsquo;s query by its position $m\theta$ and dog&rsquo;s key by its position $n\theta$. What is the angle between them afterwards?</p>
<p>Rotating one vector by $m\theta$ and comparing it against another rotated by $n\theta$ composes into a single rotation by the difference. Writing it out with the matrices, and using the identity $R(a)^{\top}R(b) = R(b-a)$, the rotated dot product collapses:</p>
$$\big(R(m\theta)\,q\big)^{\!\top}\big(R(n\theta)\,k\big)
\;=\; q^{\top}\,R(m\theta)^{\top}R(n\theta)\,k
\;=\; q^{\top}\,R\!\big((n-m)\theta\big)\,k.$$<p>Look at the right-hand side. The two absolute positions $m$ and $n$ are gone. Only their difference $n-m$ remains. The attention score between chased and dog depends only on how far apart they are, not on where the pair happens to sit in the sentence.</p>
<p>That takes care of the second flaw. We never asked the model to learn how to convert absolute positions into relative ones; the geometry does it on its own. Relative position falls out of the structure for nothing, with zero parameters spent.</p>
<p>Try it. Move the query and key positions on their own, then press <strong>&ldquo;shift both +1,&rdquo;</strong> which is the same as prepending &ldquo;Yesterday,&rdquo; to the sentence. Both arrows spin, but the offset between them, and the score, stay put:</p>

<div class="rope-relative-position" id="rope-relative-position-9b8756b96f24f81545a6032c69a390d3">
  <style>
    .rope-relative-position{
      --bg:#0a0d15; --bg2:#0f1320; --panel:#10141f; --panel2:#141927;
      --ink:#ece8dd; --ink-soft:#c3c0b6; --muted:#7e8499; --faint:#4a4f60;
      --line:rgba(255,255,255,.075); --line-strong:rgba(255,255,255,.14);
      --q:#f6b740; --k:#4fd8cf; --coral:#ff7d6b;
      color:var(--ink); margin:2rem 0; max-width:100%;
    }
    .rope-relative-position *{box-sizing:border-box}
    .rope-relative-position .panel{background:linear-gradient(180deg,var(--panel),var(--bg2)); border:1px solid var(--line-strong); border-radius:16px; padding:20px; box-shadow:0 24px 60px -36px rgba(0,0,0,.9), inset 0 1px 0 rgba(255,255,255,.03); position:relative; overflow:hidden}
    .rope-relative-position .panel::before{content:""; position:absolute; inset:0; pointer-events:none; border-radius:16px; background:linear-gradient(90deg,var(--line) 1px,transparent 1px) 0 0/26px 26px,linear-gradient(180deg,var(--line) 1px,transparent 1px) 0 0/26px 26px; opacity:.30; -webkit-mask:radial-gradient(120% 120% at 50% 0%,#000,transparent 78%); mask:radial-gradient(120% 120% at 50% 0%,#000,transparent 78%)}
    .rope-relative-position .panel > *{position:relative}
    .rope-relative-position .panel-head{display:flex; align-items:baseline; justify-content:space-between; gap:12px; margin-bottom:12px; flex-wrap:wrap}
    .rope-relative-position .panel-title{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12px; letter-spacing:.22em; text-transform:uppercase; color:var(--muted)}
    .rope-relative-position .legend{display:flex; gap:16px; font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12px}
    .rope-relative-position .legend span{display:inline-flex; align-items:center; gap:7px; color:var(--ink-soft)}
    .rope-relative-position .dot{width:9px; height:9px; border-radius:50%}
    .rope-relative-position .dot.q{background:var(--q); box-shadow:0 0 10px var(--q)}
    .rope-relative-position .dot.k{background:var(--k); box-shadow:0 0 10px var(--k)}
    .rope-relative-position canvas{display:block; width:100%; touch-action:none}
    .rope-relative-position .controls{display:flex; flex-direction:column; gap:14px; margin-top:16px}
    .rope-relative-position .ctrl{display:grid; grid-template-columns:128px 1fr 70px; align-items:center; gap:14px}
    .rope-relative-position .ctrl label{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12.5px; color:var(--ink-soft)}
    .rope-relative-position .ctrl label .sub{color:var(--muted)}
    .rope-relative-position .q-txt{color:var(--q)} .rope-relative-position .k-txt{color:var(--k)}
    .rope-relative-position .val{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:14px; text-align:right; color:var(--ink); background:var(--panel2); border:1px solid var(--line); border-radius:7px; padding:5px 9px}
    .rope-relative-position input[type=range]{-webkit-appearance:none; appearance:none; height:4px; border-radius:3px; background:linear-gradient(90deg,var(--accent,#5a6178),var(--accent,#5a6178)) no-repeat, rgba(255,255,255,.09); background-size:var(--fill,0%) 100%; cursor:pointer; outline:none}
    .rope-relative-position input[type=range].q-range{--accent:var(--q)}
    .rope-relative-position input[type=range].k-range{--accent:var(--k)}
    .rope-relative-position input[type=range]::-webkit-slider-thumb{-webkit-appearance:none; width:17px; height:17px; border-radius:50%; background:var(--ink); border:3px solid var(--accent,#9aa0b5); box-shadow:0 0 0 4px rgba(0,0,0,.35),0 2px 6px rgba(0,0,0,.5); transition:transform .12s}
    .rope-relative-position input[type=range]::-webkit-slider-thumb:hover{transform:scale(1.16)}
    .rope-relative-position input[type=range]::-moz-range-thumb{width:15px; height:15px; border-radius:50%; background:var(--ink); border:3px solid var(--accent,#9aa0b5); box-shadow:0 0 0 4px rgba(0,0,0,.35)}
    .rope-relative-position .readouts{display:flex; gap:10px; flex-wrap:wrap; margin-top:16px}
    .rope-relative-position .chip{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; background:var(--panel2); border:1px solid var(--line); border-radius:10px; padding:9px 13px; min-width:84px}
    .rope-relative-position .chip .lab{font-size:10.5px; letter-spacing:.14em; text-transform:uppercase; color:var(--muted)}
    .rope-relative-position .chip .num{font-size:19px; color:var(--ink); margin-top:2px; font-weight:500}
    .rope-relative-position .chip.big{flex:1; min-width:170px}
    .rope-relative-position .chip.hi{border-color:var(--coral)} .rope-relative-position .chip.hi .num{color:var(--coral)}
    .rope-relative-position .btnrow{display:flex; gap:10px; flex-wrap:wrap; margin-top:16px}
    .rope-relative-position .btn{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12px; letter-spacing:.08em; text-transform:uppercase; color:var(--ink); background:var(--panel2); border:1px solid var(--line-strong); border-radius:9px; padding:9px 15px; cursor:pointer; transition:.15s; display:inline-flex; align-items:center; gap:8px}
    .rope-relative-position .btn:hover{border-color:var(--q); color:#fff; background:#1a2031}
    .rope-relative-position .takeaway{display:flex; gap:16px; align-items:flex-start; margin:22px 0 0; background:linear-gradient(180deg,rgba(246,183,64,.05),transparent); border:1px solid var(--line-strong); border-radius:14px; padding:18px 20px}
    .rope-relative-position .takeaway .ico{flex:none; width:34px; height:34px; border-radius:9px; display:grid; place-items:center; background:rgba(246,183,64,.12); color:var(--q); font-family:ui-monospace,monospace; font-weight:700}
    .rope-relative-position .takeaway p{margin:0; font-size:16px; color:var(--ink-soft); font-family:inherit}
    .rope-relative-position .takeaway b{color:var(--q); font-weight:600}
    @media(max-width:720px){.rope-relative-position .ctrl{grid-template-columns:92px 1fr 64px; gap:10px}.rope-relative-position .legend{display:none}}
  </style>

  <div class="panel">
    <div class="panel-head">
      <div class="panel-title">Query · Key dot product vs. position</div>
      <div class="legend">
        <span><span class="dot q"></span>query @ m</span>
        <span><span class="dot k"></span>key @ n</span>
      </div>
    </div>
    <canvas class="cv" height="360"></canvas>
    <div class="controls">
      <div class="ctrl">
        <label class="q-txt">query pos&nbsp;<span class="sub">m</span></label>
        <input type="range" class="rm q-range" min="0" max="20" step="1" value="2">
        <span class="val vm">2</span>
      </div>
      <div class="ctrl">
        <label class="k-txt">key pos&nbsp;<span class="sub">n</span></label>
        <input type="range" class="rn k-range" min="0" max="20" step="1" value="6">
        <span class="val vn">6</span>
      </div>
    </div>
    <div class="readouts">
      <div class="chip"><div class="lab">relative m−n</div><div class="num orel">—</div></div>
      <div class="chip"><div class="lab">angle between</div><div class="num oang">—</div></div>
      <div class="chip big hi"><div class="lab">attention score&nbsp; q · k = cos(Δ)</div><div class="num odot">—</div></div>
    </div>
    <div class="btnrow">
      <button class="btn shift">↻ shift both +1</button>
      <button class="btn reset">reset</button>
    </div>
  </div>
  <div class="takeaway">
    <div class="ico">!</div>
    <p>The score chip is glued to <b>m − n</b>. Shifting both positions sends the arrows spinning, yet the number never flinches. That is <b>relative position, for free</b>, with nothing learned.</p>
  </div>

  <script>
  (function(){
    const root = document.getElementById('rope-relative-position-9b8756b96f24f81545a6032c69a390d3');
    if(!root) return;
    const TAU = Math.PI*2;
    const C = { q:'#f6b740', k:'#4fd8cf', muted:'#7e8499', coral:'#ff7d6b', line:'rgba(255,255,255,.10)', grid:'rgba(120,140,200,.13)' };
    function fit(canvas){
      const dpr=Math.max(1,window.devicePixelRatio||1);
      if(!canvas.dataset.h) canvas.dataset.h=canvas.getAttribute('height');
      const cssH=+canvas.dataset.h, w=canvas.clientWidth;
      canvas.width=Math.round(w*dpr); canvas.height=Math.round(cssH*dpr); canvas.style.height=cssH+'px';
      const ctx=canvas.getContext('2d'); ctx.setTransform(dpr,0,0,dpr,0,0); return {ctx,w,h:cssH};
    }
    function arrow(ctx, ox,oy, x,y, color, lw=3, head=10){
      ctx.save(); ctx.strokeStyle=color; ctx.fillStyle=color; ctx.lineWidth=lw; ctx.lineCap='round';
      ctx.beginPath(); ctx.moveTo(ox,oy); ctx.lineTo(x,y); ctx.stroke();
      const a=Math.atan2(y-oy,x-ox);
      ctx.beginPath(); ctx.moveTo(x,y); ctx.lineTo(x-head*Math.cos(a-0.4), y-head*Math.sin(a-0.4)); ctx.lineTo(x-head*Math.cos(a+0.4), y-head*Math.sin(a+0.4)); ctx.closePath(); ctx.fill(); ctx.restore();
    }
    function ring(ctx, cx,cy,r,color,lw=1){ ctx.save(); ctx.strokeStyle=color; ctx.lineWidth=lw; ctx.beginPath(); ctx.arc(cx,cy,r,0,TAU); ctx.stroke(); ctx.restore(); }
    function axes(ctx,cx,cy,r){ ctx.save(); ctx.strokeStyle=C.grid; ctx.lineWidth=1; ctx.beginPath(); ctx.moveTo(cx-r-14,cy); ctx.lineTo(cx+r+14,cy); ctx.moveTo(cx,cy-r-14); ctx.lineTo(cx,cy+r+14); ctx.stroke(); ctx.restore(); }
    function glowDot(ctx,x,y,color,rad=6){ ctx.save(); ctx.shadowColor=color; ctx.shadowBlur=14; ctx.fillStyle=color; ctx.beginPath(); ctx.arc(x,y,rad,0,TAU); ctx.fill(); ctx.restore(); }
    function setFill(el,frac){ el.style.setProperty('--fill',(frac*100)+'%'); }

    const cv=root.querySelector('canvas');
    const rm=root.querySelector('input.rm'), rn=root.querySelector('input.rn');
    const vm=root.querySelector('.vm'), vn=root.querySelector('.vn');
    const orel=root.querySelector('.orel'), oang=root.querySelector('.oang'), odot=root.querySelector('.odot');
    const shift=root.querySelector('.shift'), reset=root.querySelector('.reset');
    const TH=0.42, qBase=0.32, kBase=1.15;

    function draw(){
      const {ctx,w,h}=fit(cv); ctx.clearRect(0,0,w,h);
      const cx=w/2, cy=h/2+10, R=Math.min(w,h)*0.40;
      const m=+rm.value, n=+rn.value;
      const aq=qBase+m*TH, ak=kBase+n*TH;
      axes(ctx,cx,cy,R); ring(ctx,cx,cy,R,C.line,1.2);
      let delta=(aq-ak)%TAU; if(delta>Math.PI)delta-=TAU; if(delta<-Math.PI)delta+=TAU;
      ctx.save(); ctx.fillStyle='rgba(255,125,107,.13)'; ctx.beginPath(); ctx.moveTo(cx,cy); ctx.arc(cx,cy,R*0.5, -ak, -aq, delta>0); ctx.closePath(); ctx.fill(); ctx.restore();
      arrow(ctx,cx,cy, cx+Math.cos(qBase)*R*0.9, cy-Math.sin(qBase)*R*0.9,'rgba(246,183,64,.22)',2,8);
      arrow(ctx,cx,cy, cx+Math.cos(kBase)*R*0.9, cy-Math.sin(kBase)*R*0.9,'rgba(79,216,207,.22)',2,8);
      const qx=cx+Math.cos(aq)*R*0.9, qy=cy-Math.sin(aq)*R*0.9;
      const kx=cx+Math.cos(ak)*R*0.9, ky=cy-Math.sin(ak)*R*0.9;
      arrow(ctx,cx,cy, qx,qy, C.q, 3.4, 12); glowDot(ctx,qx,qy,C.q,5);
      arrow(ctx,cx,cy, kx,ky, C.k, 3.4, 12); glowDot(ctx,kx,ky,C.k,5);
      ctx.font='13px ui-monospace, monospace';
      ctx.fillStyle=C.q; ctx.fillText('q @ m', qx+(qx>cx?8:-46), qy+(qy>cy?16:-8));
      ctx.fillStyle=C.k; ctx.fillText('k @ n', kx+(kx>cx?8:-46), ky+(ky>cy?16:-8));
      ctx.fillStyle=C.coral; ctx.fillText('Δ', cx+Math.cos((aq+ak)/2)*R*0.34, cy-Math.sin((aq+ak)/2)*R*0.34);
      vm.textContent=m; vn.textContent=n; setFill(rm,m/(+rm.max)); setFill(rn,n/(+rn.max));
      orel.textContent=(m-n);
      oang.textContent=(Math.abs(delta)*180/Math.PI).toFixed(1)+'°';
      odot.textContent=Math.cos(delta).toFixed(3);
    }
    [rm,rn].forEach(el=>el.addEventListener('input',draw));
    shift.addEventListener('click',()=>{
      let m=+rm.value, n=+rn.value;
      if(m<+rm.max && n<+rn.max){ rm.value=m+1; rn.value=n+1; }
      else { rm.value=Math.max(0,m-1); rn.value=Math.max(0,n-1); }
      draw();
    });
    reset.addEventListener('click',()=>{ rm.value=2; rn.value=6; draw(); });
    window.addEventListener('resize',draw); draw();
  })();
  </script>
</div>

<p>This is the payoff that made RoPE win. Shift the whole sentence and chased&rsquo;s relationship to dog survives intact, because that relationship was stored as the angle between the two vectors, and shifting both simply rotates them together. One thing the demo quietly does, though, is hold the two words&rsquo; content fixed so the positional effect stands on its own. In a real attention head the content match is the dominant, learned signal, and rotation only modulates it. The next section puts content back in.</p>
<h2 id="how-rotation-fits-inside-attention">How rotation fits inside attention</h2>
<p>It is easy to leave that demo thinking the rotation is the whole story. It is not. To see why, it helps to look at where RoPE actually sits inside an attention head, and at how much it leaves alone.</p>

<div class="rope-attention-pipeline" id="rope-attention-pipeline-9b8756b96f24f81545a6032c69a390d3">
  <style>
    .rope-attention-pipeline{
      --bg2:#0f1320; --panel:#10141f; --line-strong:rgba(255,255,255,.14);
      --q:#f6b740; --k:#4fd8cf; --v:#a78bfa; --ink:#ece8dd; --muted:#7e8499;
      color:var(--ink); margin:2rem 0; max-width:100%;
    }
    .rope-attention-pipeline .panel{background:linear-gradient(180deg,var(--panel),var(--bg2)); border:1px solid var(--line-strong); border-radius:16px; padding:20px; box-shadow:0 24px 60px -36px rgba(0,0,0,.9), inset 0 1px 0 rgba(255,255,255,.03)}
    .rope-attention-pipeline .panel-title{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12px; letter-spacing:.22em; text-transform:uppercase; color:var(--muted); margin-bottom:14px}
    .rope-attention-pipeline svg{display:block; width:100%; height:auto}
    .rope-attention-pipeline svg text{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace}
    .rope-attention-pipeline .box{fill:#141927; stroke:rgba(255,255,255,.16); stroke-width:1.4}
    .rope-attention-pipeline .ttl{fill:var(--ink); font-size:16px; font-weight:600}
    .rope-attention-pipeline .sub{fill:var(--muted); font-size:11px}
    .rope-attention-pipeline .ann{fill:var(--muted); font-size:11.5px; font-style:italic}
    .rope-attention-pipeline .edge{stroke:rgba(195,192,182,.5); stroke-width:1.7; fill:none}
    .rope-attention-pipeline .note{font-family:inherit; font-size:15px; color:var(--muted); font-style:italic; margin:14px 2px 0}
    @media(max-width:720px){.rope-attention-pipeline .panel{padding:14px}}
  </style>

  <div class="panel">
    <div class="panel-title">Where RoPE sits in one attention head</div>
    <svg viewBox="0 0 840 360" role="img" aria-label="Attention pipeline showing RoPE rotating only the query and key">
      <defs>
        <marker id="rope-ap-arrow-9b8756b96f24f81545a6032c69a390d3" markerWidth="9" markerHeight="9" refX="6.5" refY="3" orient="auto">
          <path d="M0,0 L6.5,3 L0,6 Z" fill="rgba(195,192,182,.7)"/>
        </marker>
      </defs>
      

      
      <path class="edge" d="M134,176 C172,176 178,90 206,90"   marker-end="url(#rope-ap-arrow-9b8756b96f24f81545a6032c69a390d3)"/>
      <path class="edge" d="M134,182 L206,182"                  marker-end="url(#rope-ap-arrow-9b8756b96f24f81545a6032c69a390d3)"/>
      <path class="edge" d="M134,188 C172,188 178,274 206,274"  marker-end="url(#rope-ap-arrow-9b8756b96f24f81545a6032c69a390d3)"/>
      <path class="edge" d="M302,90 C330,90 336,120 356,124"    marker-end="url(#rope-ap-arrow-9b8756b96f24f81545a6032c69a390d3)"/>
      <path class="edge" d="M302,182 C330,182 336,186 356,186"  marker-end="url(#rope-ap-arrow-9b8756b96f24f81545a6032c69a390d3)"/>
      <path class="edge" d="M486,124 C512,120 520,90 536,90"    marker-end="url(#rope-ap-arrow-9b8756b96f24f81545a6032c69a390d3)"/>
      <path class="edge" d="M486,186 C512,186 520,182 536,182"  marker-end="url(#rope-ap-arrow-9b8756b96f24f81545a6032c69a390d3)"/>
      <path class="edge" d="M628,90 C648,90 654,120 656,124"    marker-end="url(#rope-ap-arrow-9b8756b96f24f81545a6032c69a390d3)"/>
      <path class="edge" d="M628,182 C648,182 654,140 656,136"  marker-end="url(#rope-ap-arrow-9b8756b96f24f81545a6032c69a390d3)"/>
      <path class="edge" d="M731,168 L731,190"                  marker-end="url(#rope-ap-arrow-9b8756b96f24f81545a6032c69a390d3)"/>
      <path class="edge" d="M731,234 C731,248 718,252 710,260"  marker-end="url(#rope-ap-arrow-9b8756b96f24f81545a6032c69a390d3)"/>
      <path class="edge" d="M302,274 C430,274 480,289 596,289"  marker-end="url(#rope-ap-arrow-9b8756b96f24f81545a6032c69a390d3)"/>

      
      <rect class="box" x="24" y="150" width="110" height="64" rx="11"/>
      <text class="ttl" x="79" y="184" text-anchor="middle">h<tspan baseline-shift="sub" font-size="11">t</tspan></text>
      <text class="sub" x="79" y="202" text-anchor="middle">hidden state</text>

      
      <rect class="box" x="210" y="66" width="92" height="48" rx="9" style="fill:rgba(246,183,64,.12); stroke:#f6b740"/>
      <text class="ttl" x="256" y="89" text-anchor="middle" style="fill:#f6b740">q</text>
      <text class="sub" x="256" y="104" text-anchor="middle">query</text>

      <rect class="box" x="210" y="158" width="92" height="48" rx="9" style="fill:rgba(79,216,207,.12); stroke:#4fd8cf"/>
      <text class="ttl" x="256" y="181" text-anchor="middle" style="fill:#4fd8cf">k</text>
      <text class="sub" x="256" y="196" text-anchor="middle">key</text>

      <rect class="box" x="210" y="250" width="92" height="48" rx="9" style="fill:rgba(167,139,250,.14); stroke:#a78bfa"/>
      <text class="ttl" x="256" y="273" text-anchor="middle" style="fill:#a78bfa">v</text>
      <text class="sub" x="256" y="288" text-anchor="middle">value</text>

      
      <rect x="356" y="96" width="130" height="120" rx="13" style="fill:rgba(246,183,64,.06); stroke:#f6b740; stroke-width:1.6; stroke-dasharray:5 4"/>
      <text x="421" y="146" text-anchor="middle" style="fill:#f6b740; font-size:15px; font-weight:600">&#8635; RoPE</text>
      <text class="sub" x="421" y="166" text-anchor="middle">rotate q and k</text>
      <text class="sub" x="421" y="180" text-anchor="middle">by their position</text>
      <text class="ann" x="421" y="88" text-anchor="middle" style="fill:#f6b740">position enters here</text>

      
      <rect class="box" x="536" y="66" width="92" height="48" rx="9" style="fill:rgba(246,183,64,.12); stroke:#f6b740"/>
      <text class="ttl" x="582" y="89" text-anchor="middle" style="fill:#f6b740">q&#8242;</text>
      <text class="sub" x="582" y="104" text-anchor="middle">rotated</text>

      <rect class="box" x="536" y="158" width="92" height="48" rx="9" style="fill:rgba(79,216,207,.12); stroke:#4fd8cf"/>
      <text class="ttl" x="582" y="181" text-anchor="middle" style="fill:#4fd8cf">k&#8242;</text>
      <text class="sub" x="582" y="196" text-anchor="middle">rotated</text>

      
      <rect class="box" x="656" y="96" width="150" height="72" rx="11"/>
      <text class="ttl" x="731" y="128" text-anchor="middle" style="font-size:15px">q&#8242; &#183; k&#8242; / &#8730;d</text>
      <text class="sub" x="731" y="148" text-anchor="middle">attention score</text>

      
      <rect class="box" x="656" y="190" width="150" height="44" rx="11"/>
      <text class="ttl" x="731" y="217" text-anchor="middle" style="font-size:15px">softmax</text>

      
      <rect class="box" x="596" y="260" width="216" height="54" rx="11"/>
      <text class="ttl" x="704" y="285" text-anchor="middle" style="font-size:14px">&#931; weight &#183; v</text>
      <text class="sub" x="704" y="301" text-anchor="middle">output of the head</text>

      
      <text class="ann" x="256" y="52" text-anchor="middle">content (learned by the projections)</text>
      <text class="ann" x="440" y="250" text-anchor="middle">values are never rotated</text>
      <text class="ann" x="704" y="336" text-anchor="middle">softmax and value mixing: unchanged by RoPE</text>
    </svg>
  </div>
  <p class="note">RoPE rotates only the query and key. The value vector, the softmax, and the weighted sum that follows are exactly as they were. Position is a small insertion into the score, not a new attention mechanism.</p>
</div>

<p>Attention scores a query against a key with a dot product, and before any position is involved that dot product is pure content matching. Think of it as a tiny search engine running in every layer. Each token sends out a query, a search for what it wants (&ldquo;I am a verb, I am looking for my subject&rdquo;). Every other token offers a key, an advertisement for what it is (&ldquo;I am a noun, I could be a subject&rdquo;). The dot product scores how well the advertisement answers the search. All of this is learned, by the projection matrices $W^Q$ and $W^K$, and it is the main event. It is what lets chased know it wants nouns and not commas.</p>
<p>RoPE does not touch any of that. It rotates the query and key after they are built, and because a rotation changes direction but not length, the learned content survives untouched. For a single pair of dimensions the score works out to</p>
$$\text{score} \;\approx\; \underbrace{\|q\|\,\|k\|}_{\text{how strong}}\;\cos\big(\,\underbrace{\alpha}_{\text{content}} \,+\, \underbrace{(m-n)\,\theta}_{\text{position}}\,\big),$$<p>where $\alpha$ is the angle between the two words&rsquo; content directions. Read the cosine as taking in two things at once. The content angle $\alpha$ is small when the words genuinely match and large when they do not. The positional term $(m-n)\theta$ is a fixed turn that depends only on how far apart they are. The score is highest when the content matches and the relative distance is one the head cares about. A perfect content match at an unwanted distance gets pulled down, and a favoured distance with mismatched content still scores low. Both have to agree.</p>
<p>Our running example needs both halves. Content alone tells chased to look at nouns, but dog and cat are equally nouns, so content cannot say which is the subject. Position alone would prefer whatever sits one step back, but it cannot tell a noun from a comma. Put them together and chased attends to dog because dog is both a content match and at the relative offset the head has learned to read as a subject. RoPE did not pick dog. Content narrowed the field to nouns, and rotation broke the tie by distance.</p>
<p>And everything downstream is left alone. The value vectors are never rotated, the softmax is the same softmax, and the weighted sum that produces the head&rsquo;s output is unchanged. RoPE is a small, surgical edit to one quantity, the query-key score, not a new attention mechanism.</p>
<h2 id="one-frequency-isnt-enough-a-clock-with-many-hands">One frequency isn&rsquo;t enough: a clock with many hands</h2>
<p>A single rotation speed has a catch, because a circle wraps around. If every pair spun at the same rate $\theta$, two positions a full turn apart would land in the same place, and you could not tell them apart. One spinning hand cannot tell you the time on its own.</p>
<p>A clock fixes this with several hands at different speeds. The second hand resolves fine detail, the hour hand tracks the long sweep, and together they pin down a single moment. RoPE does the same thing, giving each coordinate pair $i$ its own rotation speed:</p>
$$\theta_i = b^{-2i/d}, \qquad b = 10000, \qquad i = 0, 1, \dots, \tfrac{d}{2}-1.$$<p>The first pairs spin fast, so they resolve fine, local distances like &ldquo;one token apart.&rdquo; The last pairs spin slowly, tracking coarse, long-range position across thousands of tokens. Stack them all and every position gets a unique multi-frequency fingerprint, with no wraparound ambiguity across the range that matters.</p>
<p>Watch the bank of dials below. The leftmost races while the rightmost barely moves.</p>

<div class="rope-frequency-bank" id="rope-frequency-bank-9b8756b96f24f81545a6032c69a390d3">
  <style>
    .rope-frequency-bank{
      --bg:#0a0d15; --bg2:#0f1320; --panel:#10141f; --panel2:#141927;
      --ink:#ece8dd; --ink-soft:#c3c0b6; --muted:#7e8499; --faint:#4a4f60;
      --line:rgba(255,255,255,.075); --line-strong:rgba(255,255,255,.14);
      --q:#f6b740; --k:#4fd8cf; --coral:#ff7d6b;
      color:var(--ink); margin:2rem 0; max-width:100%;
    }
    .rope-frequency-bank *{box-sizing:border-box}
    .rope-frequency-bank .panel{background:linear-gradient(180deg,var(--panel),var(--bg2)); border:1px solid var(--line-strong); border-radius:16px; padding:20px; box-shadow:0 24px 60px -36px rgba(0,0,0,.9), inset 0 1px 0 rgba(255,255,255,.03); position:relative; overflow:hidden}
    .rope-frequency-bank .panel::before{content:""; position:absolute; inset:0; pointer-events:none; border-radius:16px; background:linear-gradient(90deg,var(--line) 1px,transparent 1px) 0 0/26px 26px,linear-gradient(180deg,var(--line) 1px,transparent 1px) 0 0/26px 26px; opacity:.30; -webkit-mask:radial-gradient(120% 120% at 50% 0%,#000,transparent 78%); mask:radial-gradient(120% 120% at 50% 0%,#000,transparent 78%)}
    .rope-frequency-bank .panel > *{position:relative}
    .rope-frequency-bank .panel-head{display:flex; align-items:baseline; justify-content:space-between; gap:12px; margin-bottom:12px; flex-wrap:wrap}
    .rope-frequency-bank .panel-title{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12px; letter-spacing:.22em; text-transform:uppercase; color:var(--muted)}
    .rope-frequency-bank .legend{display:flex; gap:16px; font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12px; color:var(--ink-soft)}
    .rope-frequency-bank canvas{display:block; width:100%; touch-action:none}
    .rope-frequency-bank .controls{display:flex; flex-direction:column; gap:14px; margin-top:16px}
    .rope-frequency-bank .ctrl{display:grid; grid-template-columns:128px 1fr 70px; align-items:center; gap:14px}
    .rope-frequency-bank .ctrl label{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12.5px; color:var(--ink-soft)}
    .rope-frequency-bank .ctrl label .sub{color:var(--muted)}
    .rope-frequency-bank .val{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:14px; text-align:right; color:var(--ink); background:var(--panel2); border:1px solid var(--line); border-radius:7px; padding:5px 9px}
    .rope-frequency-bank input[type=range]{-webkit-appearance:none; appearance:none; height:4px; border-radius:3px; background:linear-gradient(90deg,var(--accent,#5a6178),var(--accent,#5a6178)) no-repeat, rgba(255,255,255,.09); background-size:var(--fill,0%) 100%; cursor:pointer; outline:none}
    .rope-frequency-bank input[type=range].q-range{--accent:var(--q)}
    .rope-frequency-bank input[type=range]::-webkit-slider-thumb{-webkit-appearance:none; width:17px; height:17px; border-radius:50%; background:var(--ink); border:3px solid var(--accent,#9aa0b5); box-shadow:0 0 0 4px rgba(0,0,0,.35),0 2px 6px rgba(0,0,0,.5); transition:transform .12s}
    .rope-frequency-bank input[type=range]::-webkit-slider-thumb:hover{transform:scale(1.16)}
    .rope-frequency-bank input[type=range]::-moz-range-thumb{width:15px; height:15px; border-radius:50%; background:var(--ink); border:3px solid var(--accent,#9aa0b5); box-shadow:0 0 0 4px rgba(0,0,0,.35)}
    .rope-frequency-bank .btnrow{display:flex; gap:10px; flex-wrap:wrap; margin-top:16px}
    .rope-frequency-bank .btn{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12px; letter-spacing:.08em; text-transform:uppercase; color:var(--ink); background:var(--panel2); border:1px solid var(--line-strong); border-radius:9px; padding:9px 15px; cursor:pointer; transition:.15s; display:inline-flex; align-items:center; gap:8px}
    .rope-frequency-bank .btn:hover{border-color:var(--q); color:#fff; background:#1a2031}
    .rope-frequency-bank .btn.active{border-color:var(--q); color:var(--q)}
    .rope-frequency-bank .note{font-family:inherit; font-size:15px; color:var(--muted); font-style:italic; margin:14px 2px 0}
    @media(max-width:720px){.rope-frequency-bank .ctrl{grid-template-columns:92px 1fr 64px; gap:10px}.rope-frequency-bank .legend{display:none}}
  </style>

  <div class="panel">
    <div class="panel-head">
      <div class="panel-title">A bank of rotating pairs · fast → slow</div>
      <div class="legend"><span>each dial = one dimension-pair</span></div>
    </div>
    <canvas class="cv" height="220"></canvas>
    <div class="controls">
      <div class="ctrl">
        <label>position&nbsp;<span class="sub">m</span></label>
        <input type="range" class="r q-range" min="0" max="48" step="1" value="0">
        <span class="val v">0</span>
      </div>
    </div>
    <div class="btnrow"><button class="btn play">▶ advance position</button></div>
  </div>
  <p class="note">Drag slowly: the leftmost dial races around while the rightmost barely budges. (For clarity the visual uses a gentler base than the real 10000, but the principle is identical.)</p>

  <script>
  (function(){
    const root = document.getElementById('rope-frequency-bank-9b8756b96f24f81545a6032c69a390d3');
    if(!root) return;
    const TAU = Math.PI*2;
    const C = { muted:'#7e8499', line:'rgba(255,255,255,.10)' };
    function fit(canvas){
      const dpr=Math.max(1,window.devicePixelRatio||1);
      if(!canvas.dataset.h) canvas.dataset.h=canvas.getAttribute('height');
      const cssH=+canvas.dataset.h, w=canvas.clientWidth;
      canvas.width=Math.round(w*dpr); canvas.height=Math.round(cssH*dpr); canvas.style.height=cssH+'px';
      const ctx=canvas.getContext('2d'); ctx.setTransform(dpr,0,0,dpr,0,0); return {ctx,w,h:cssH};
    }
    function arrow(ctx, ox,oy, x,y, color, lw=3, head=10){
      ctx.save(); ctx.strokeStyle=color; ctx.fillStyle=color; ctx.lineWidth=lw; ctx.lineCap='round';
      ctx.beginPath(); ctx.moveTo(ox,oy); ctx.lineTo(x,y); ctx.stroke();
      const a=Math.atan2(y-oy,x-ox);
      ctx.beginPath(); ctx.moveTo(x,y); ctx.lineTo(x-head*Math.cos(a-0.4), y-head*Math.sin(a-0.4)); ctx.lineTo(x-head*Math.cos(a+0.4), y-head*Math.sin(a+0.4)); ctx.closePath(); ctx.fill(); ctx.restore();
    }
    function ring(ctx, cx,cy,r,color,lw=1){ ctx.save(); ctx.strokeStyle=color; ctx.lineWidth=lw; ctx.beginPath(); ctx.arc(cx,cy,r,0,TAU); ctx.stroke(); ctx.restore(); }
    function glowDot(ctx,x,y,color,rad=6){ ctx.save(); ctx.shadowColor=color; ctx.shadowBlur=14; ctx.fillStyle=color; ctx.beginPath(); ctx.arc(x,y,rad,0,TAU); ctx.fill(); ctx.restore(); }
    function setFill(el,frac){ el.style.setProperty('--fill',(frac*100)+'%'); }

    const cv=root.querySelector('canvas'), r=root.querySelector('input.r'), v=root.querySelector('.v'), btn=root.querySelector('.play');
    const N=6, BASE=80;
    const ths=[]; for(let i=0;i<N;i++) ths.push(Math.pow(BASE,-i/(N-1)));
    let playing=false, t=0;

    function draw(){
      const {ctx,w,h}=fit(cv); ctx.clearRect(0,0,w,h);
      const m=+r.value, gap=w/N, R=Math.min(gap*0.34, h*0.34), cy=h*0.46;
      const labels=['θ₀ fast','θ₁','θ₂','θ₃','θ₄','θ₅ slow'];
      for(let i=0;i<N;i++){
        const cx=gap*(i+0.5);
        ring(ctx,cx,cy,R,C.line,1.2);
        const f=i/(N-1);
        const col=`rgb(${Math.round(246-167*f)},${Math.round(183+33*f)},${Math.round(64+143*f)})`;
        const ang=m*ths[i];
        const ex=cx+Math.cos(ang)*R*0.82, ey=cy-Math.sin(ang)*R*0.82;
        arrow(ctx,cx,cy, cx+R*0.82, cy, 'rgba(255,255,255,.12)',1.5,6);
        arrow(ctx,cx,cy, ex,ey, col, 2.6, 9); glowDot(ctx,ex,ey,col,3.5);
        ctx.fillStyle=C.muted; ctx.font='11px ui-monospace, monospace'; ctx.textAlign='center';
        ctx.fillText(labels[i], cx, cy+R+22); ctx.textAlign='left';
      }
      v.textContent=m; setFill(r,m/(+r.max));
    }
    function loop(){ if(!playing) return; t=(t+1)%(+r.max+1); r.value=t; draw(); setTimeout(()=>requestAnimationFrame(loop),120); }
    r.addEventListener('input',()=>{playing=false;btn.classList.remove('active');btn.textContent='▶ advance position';draw();});
    btn.addEventListener('click',()=>{ playing=!playing; btn.classList.toggle('active',playing); btn.textContent=playing?'⏸ pause':'▶ advance position'; t=+r.value; if(playing) loop(); });
    window.addEventListener('resize',draw); draw();
  })();
  </script>
</div>

<p>The full RoPE rotation is just this whole set of 2-D rotations stacked into one large block-diagonal matrix, with each pair of dimensions spun at its own frequency. Conceptually it is the picture above, repeated $d/2$ times.</p>
<p>Two things are worth noticing now, because they come back when we talk about long context. The base $b = 10000$ is a knob you can turn. And the fast pairs are the ones that wrap around soonest.</p>
<h2 id="a-free-locality-prior-nearby-leans-in-distant-fades">A free locality prior: nearby leans in, distant fades</h2>
<p>The many frequencies do one more thing, almost by accident. When you add up the cosines across all the pairs, they reinforce at zero distance and start to interfere as the distance grows. The result is that the raw attention score between two identical vectors is high when they sit close together and decays, with a gentle ripple, as they move apart.</p>
$$\text{score}(\Delta)=\frac{1}{d/2}\sum_{i} \cos\big(\Delta\,\theta_i\big), \qquad \Delta = m - n.$$<p>Slide the dimension count and watch the curve: more frequencies, smoother decay, sharper peak at $\Delta = 0$.</p>

<div class="rope-distance-decay" id="rope-distance-decay-9b8756b96f24f81545a6032c69a390d3">
  <style>
    .rope-distance-decay{
      --bg:#0a0d15; --bg2:#0f1320; --panel:#10141f; --panel2:#141927;
      --ink:#ece8dd; --ink-soft:#c3c0b6; --muted:#7e8499; --faint:#4a4f60;
      --line:rgba(255,255,255,.075); --line-strong:rgba(255,255,255,.14);
      --q:#f6b740; --k:#4fd8cf; --coral:#ff7d6b;
      color:var(--ink); margin:2rem 0; max-width:100%;
    }
    .rope-distance-decay *{box-sizing:border-box}
    .rope-distance-decay .panel{background:linear-gradient(180deg,var(--panel),var(--bg2)); border:1px solid var(--line-strong); border-radius:16px; padding:20px; box-shadow:0 24px 60px -36px rgba(0,0,0,.9), inset 0 1px 0 rgba(255,255,255,.03); position:relative; overflow:hidden}
    .rope-distance-decay .panel::before{content:""; position:absolute; inset:0; pointer-events:none; border-radius:16px; background:linear-gradient(90deg,var(--line) 1px,transparent 1px) 0 0/26px 26px,linear-gradient(180deg,var(--line) 1px,transparent 1px) 0 0/26px 26px; opacity:.30; -webkit-mask:radial-gradient(120% 120% at 50% 0%,#000,transparent 78%); mask:radial-gradient(120% 120% at 50% 0%,#000,transparent 78%)}
    .rope-distance-decay .panel > *{position:relative}
    .rope-distance-decay .panel-head{display:flex; align-items:baseline; justify-content:space-between; gap:12px; margin-bottom:12px; flex-wrap:wrap}
    .rope-distance-decay .panel-title{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12px; letter-spacing:.22em; text-transform:uppercase; color:var(--muted)}
    .rope-distance-decay .legend{display:flex; gap:16px; font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12px; color:var(--ink-soft)}
    .rope-distance-decay canvas{display:block; width:100%; touch-action:none}
    .rope-distance-decay .controls{display:flex; flex-direction:column; gap:14px; margin-top:16px}
    .rope-distance-decay .ctrl{display:grid; grid-template-columns:128px 1fr 70px; align-items:center; gap:14px}
    .rope-distance-decay .ctrl label{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12.5px; color:var(--ink-soft)}
    .rope-distance-decay .ctrl label .sub{color:var(--muted)}
    .rope-distance-decay .val{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:14px; text-align:right; color:var(--ink); background:var(--panel2); border:1px solid var(--line); border-radius:7px; padding:5px 9px}
    .rope-distance-decay input[type=range]{-webkit-appearance:none; appearance:none; height:4px; border-radius:3px; background:linear-gradient(90deg,var(--accent,#5a6178),var(--accent,#5a6178)) no-repeat, rgba(255,255,255,.09); background-size:var(--fill,0%) 100%; cursor:pointer; outline:none}
    .rope-distance-decay input[type=range].k-range{--accent:var(--k)}
    .rope-distance-decay input[type=range]::-webkit-slider-thumb{-webkit-appearance:none; width:17px; height:17px; border-radius:50%; background:var(--ink); border:3px solid var(--accent,#9aa0b5); box-shadow:0 0 0 4px rgba(0,0,0,.35),0 2px 6px rgba(0,0,0,.5); transition:transform .12s}
    .rope-distance-decay input[type=range]::-webkit-slider-thumb:hover{transform:scale(1.16)}
    .rope-distance-decay input[type=range]::-moz-range-thumb{width:15px; height:15px; border-radius:50%; background:var(--ink); border:3px solid var(--accent,#9aa0b5); box-shadow:0 0 0 4px rgba(0,0,0,.35)}
    .rope-distance-decay .note{font-family:inherit; font-size:15px; color:var(--muted); font-style:italic; margin:14px 2px 0}
    @media(max-width:720px){.rope-distance-decay .ctrl{grid-template-columns:92px 1fr 64px; gap:10px}.rope-distance-decay .legend{display:none}}
  </style>

  <div class="panel">
    <div class="panel-head">
      <div class="panel-title">Score vs. relative distance&nbsp;(matched q = k)</div>
      <div class="legend"><span>more frequencies → smoother decay</span></div>
    </div>
    <canvas class="cv" height="280"></canvas>
    <div class="controls">
      <div class="ctrl">
        <label>dimensions&nbsp;<span class="sub">d</span></label>
        <input type="range" class="r k-range" min="2" max="64" step="2" value="32">
        <span class="val v">32</span>
      </div>
    </div>
  </div>
  <p class="note">Peaks at Δ = 0, then settles toward zero for far-apart tokens. A soft "pay more attention to what's near" prior, with no parameters spent.</p>

  <script>
  (function(){
    const root = document.getElementById('rope-distance-decay-9b8756b96f24f81545a6032c69a390d3');
    if(!root) return;
    const C = { k:'#4fd8cf', muted:'#7e8499', coral:'#ff7d6b', line:'rgba(255,255,255,.10)', grid:'rgba(120,140,200,.13)' };
    const TAU = Math.PI*2;
    function fit(canvas){
      const dpr=Math.max(1,window.devicePixelRatio||1);
      if(!canvas.dataset.h) canvas.dataset.h=canvas.getAttribute('height');
      const cssH=+canvas.dataset.h, w=canvas.clientWidth;
      canvas.width=Math.round(w*dpr); canvas.height=Math.round(cssH*dpr); canvas.style.height=cssH+'px';
      const ctx=canvas.getContext('2d'); ctx.setTransform(dpr,0,0,dpr,0,0); return {ctx,w,h:cssH};
    }
    function glowDot(ctx,x,y,color,rad=6){ ctx.save(); ctx.shadowColor=color; ctx.shadowBlur=14; ctx.fillStyle=color; ctx.beginPath(); ctx.arc(x,y,rad,0,TAU); ctx.fill(); ctx.restore(); }
    function setFill(el,frac){ el.style.setProperty('--fill',(frac*100)+'%'); }
    const BASE=10000, MAXD=80;
    function score(dx,d){ const N=d/2; let s=0; for(let i=0;i<N;i++){ const th=Math.pow(BASE,-(2*i)/d); s+=Math.cos(dx*th); } return s/N; }

    const cv=root.querySelector('canvas'), r=root.querySelector('input.r'), v=root.querySelector('.v');
    function draw(){
      const {ctx,w,h}=fit(cv); ctx.clearRect(0,0,w,h);
      const d=+r.value;
      const padL=46,padR=18,padT=18,padB=34;
      const x0=padL, x1=w-padR, y0=padT, y1=h-padB;
      const X=dx=>x0+(dx+MAXD)/(2*MAXD)*(x1-x0);
      const Y=s=>y1-(s+0.25)/(1.25)*(y1-y0);
      ctx.strokeStyle=C.line; ctx.lineWidth=1;
      [0,0.25,0.5,0.75,1].forEach(s=>{ const y=Y(s); ctx.beginPath(); ctx.moveTo(x0,y); ctx.lineTo(x1,y); ctx.stroke(); ctx.fillStyle=C.muted; ctx.font='11px ui-monospace, monospace'; ctx.fillText(s.toFixed(2), 6, y+4); });
      ctx.strokeStyle=C.grid; ctx.beginPath(); ctx.moveTo(x0,Y(0)); ctx.lineTo(x1,Y(0)); ctx.stroke();
      ctx.fillStyle=C.muted; ctx.font='11px ui-monospace, monospace'; ctx.textAlign='center';
      [-80,-40,0,40,80].forEach(t=>ctx.fillText(t, X(t), y1+20));
      ctx.textAlign='left'; ctx.fillStyle=C.muted; ctx.fillText('relative distance  Δ = m − n', x0, y1+20);
      ctx.beginPath();
      for(let px=0;px<=x1-x0;px++){ const dx=-MAXD+(px/(x1-x0))*2*MAXD; const s=score(dx,d); const X2=x0+px, Y2=Y(s); if(px===0)ctx.moveTo(X2,Y2); else ctx.lineTo(X2,Y2); }
      ctx.strokeStyle=C.k; ctx.lineWidth=2.4; ctx.lineJoin='round'; ctx.stroke();
      ctx.lineTo(x1,Y(0)); ctx.lineTo(x0,Y(0)); ctx.closePath(); ctx.fillStyle='rgba(79,216,207,.10)'; ctx.fill();
      glowDot(ctx,X(0),Y(1),C.coral,4);
      ctx.fillStyle=C.coral; ctx.font='12px ui-monospace, monospace'; ctx.textAlign='center'; ctx.fillText('Δ=0 → 1.0', X(0), Y(1)-10); ctx.textAlign='left';
      v.textContent=d; setFill(r,(d-2)/(+r.max-2));
    }
    r.addEventListener('input',draw); window.addEventListener('resize',draw); draw();
  })();
  </script>
</div>

<p>So RoPE quietly hands the model a sensible default, namely &ldquo;pay more attention to what is near,&rdquo; without spending a single parameter on it. The model can override that prior when it needs to reach far, but it starts from a reasonable place.</p>
<h2 id="adding-moves-the-point-rotating-keeps-it-honest">Adding moves the point; rotating keeps it honest</h2>
<p>We can now see, side by side, why rotation beats addition. Sinusoidal encoding adds a position vector, so the point drifts off its circle: its length changes, and content gets tangled up with position. RoPE rotates instead, so the point glides along its circle, its length perfectly preserved and position kept separate from meaning.</p>

<div class="rope-add-vs-rotate" id="rope-add-vs-rotate-9b8756b96f24f81545a6032c69a390d3">
  <style>
    .rope-add-vs-rotate{
      --bg:#0a0d15; --bg2:#0f1320; --panel:#10141f; --panel2:#141927;
      --ink:#ece8dd; --ink-soft:#c3c0b6; --muted:#7e8499; --faint:#4a4f60;
      --line:rgba(255,255,255,.075); --line-strong:rgba(255,255,255,.14);
      --q:#f6b740; --k:#4fd8cf; --coral:#ff7d6b;
      color:var(--ink); margin:2rem 0; max-width:100%;
    }
    .rope-add-vs-rotate *{box-sizing:border-box}
    .rope-add-vs-rotate .panel{background:linear-gradient(180deg,var(--panel),var(--bg2)); border:1px solid var(--line-strong); border-radius:16px; padding:20px; box-shadow:0 24px 60px -36px rgba(0,0,0,.9), inset 0 1px 0 rgba(255,255,255,.03); position:relative; overflow:hidden}
    .rope-add-vs-rotate .panel::before{content:""; position:absolute; inset:0; pointer-events:none; border-radius:16px; background:linear-gradient(90deg,var(--line) 1px,transparent 1px) 0 0/26px 26px,linear-gradient(180deg,var(--line) 1px,transparent 1px) 0 0/26px 26px; opacity:.30; -webkit-mask:radial-gradient(120% 120% at 50% 0%,#000,transparent 78%); mask:radial-gradient(120% 120% at 50% 0%,#000,transparent 78%)}
    .rope-add-vs-rotate .panel > *{position:relative}
    .rope-add-vs-rotate .panel-head{display:flex; align-items:baseline; justify-content:space-between; gap:12px; margin-bottom:12px; flex-wrap:wrap}
    .rope-add-vs-rotate .panel-title{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12px; letter-spacing:.22em; text-transform:uppercase; color:var(--muted)}
    .rope-add-vs-rotate .compare{display:grid; grid-template-columns:1fr 1fr; gap:18px; margin-top:6px}
    .rope-add-vs-rotate .compare .col h4{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12px; letter-spacing:.12em; text-transform:uppercase; margin:0 0 10px; color:var(--muted); display:flex; gap:8px; align-items:center; flex-wrap:wrap}
    .rope-add-vs-rotate .compare canvas{background:rgba(0,0,0,.18); border-radius:11px; border:1px solid var(--line); display:block; width:100%; touch-action:none}
    .rope-add-vs-rotate .tag{font-size:10px; padding:2px 7px; border-radius:5px; letter-spacing:.05em}
    .rope-add-vs-rotate .tag.bad{background:rgba(255,125,107,.14); color:var(--coral)}
    .rope-add-vs-rotate .tag.good{background:rgba(79,216,207,.14); color:var(--k)}
    .rope-add-vs-rotate .controls{display:flex; flex-direction:column; gap:14px; margin-top:18px}
    .rope-add-vs-rotate .ctrl{display:grid; grid-template-columns:128px 1fr 70px; align-items:center; gap:14px}
    .rope-add-vs-rotate .ctrl label{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12.5px; color:var(--ink-soft)}
    .rope-add-vs-rotate .ctrl label .sub{color:var(--muted)}
    .rope-add-vs-rotate .val{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:14px; text-align:right; color:var(--ink); background:var(--panel2); border:1px solid var(--line); border-radius:7px; padding:5px 9px}
    .rope-add-vs-rotate input[type=range]{-webkit-appearance:none; appearance:none; height:4px; border-radius:3px; background:linear-gradient(90deg,var(--accent,#5a6178),var(--accent,#5a6178)) no-repeat, rgba(255,255,255,.09); background-size:var(--fill,0%) 100%; cursor:pointer; outline:none}
    .rope-add-vs-rotate input[type=range].q-range{--accent:var(--q)}
    .rope-add-vs-rotate input[type=range]::-webkit-slider-thumb{-webkit-appearance:none; width:17px; height:17px; border-radius:50%; background:var(--ink); border:3px solid var(--accent,#9aa0b5); box-shadow:0 0 0 4px rgba(0,0,0,.35),0 2px 6px rgba(0,0,0,.5); transition:transform .12s}
    .rope-add-vs-rotate input[type=range]::-webkit-slider-thumb:hover{transform:scale(1.16)}
    .rope-add-vs-rotate input[type=range]::-moz-range-thumb{width:15px; height:15px; border-radius:50%; background:var(--ink); border:3px solid var(--accent,#9aa0b5); box-shadow:0 0 0 4px rgba(0,0,0,.35)}
    .rope-add-vs-rotate .readouts{display:flex; gap:10px; flex-wrap:wrap; margin-top:16px}
    .rope-add-vs-rotate .chip{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; background:var(--panel2); border:1px solid var(--line); border-radius:10px; padding:9px 13px; min-width:120px}
    .rope-add-vs-rotate .chip .lab{font-size:10.5px; letter-spacing:.12em; text-transform:uppercase; color:var(--muted)}
    .rope-add-vs-rotate .chip .num{font-size:19px; color:var(--ink); margin-top:2px; font-weight:500}
    .rope-add-vs-rotate .chip.hi{border-color:var(--coral)} .rope-add-vs-rotate .chip.hi .num{color:var(--coral)}
    .rope-add-vs-rotate .chip.k .num{color:var(--k)}
    .rope-add-vs-rotate .btnrow{display:flex; gap:10px; flex-wrap:wrap; margin-top:16px}
    .rope-add-vs-rotate .btn{font-family:ui-monospace,"JetBrains Mono",Menlo,monospace; font-size:12px; letter-spacing:.08em; text-transform:uppercase; color:var(--ink); background:var(--panel2); border:1px solid var(--line-strong); border-radius:9px; padding:9px 15px; cursor:pointer; transition:.15s; display:inline-flex; align-items:center; gap:8px}
    .rope-add-vs-rotate .btn:hover{border-color:var(--q); color:#fff; background:#1a2031}
    .rope-add-vs-rotate .btn.active{border-color:var(--q); color:var(--q)}
    .rope-add-vs-rotate .note{font-family:inherit; font-size:15px; color:var(--muted); font-style:italic; margin:14px 2px 0}
    @media(max-width:720px){.rope-add-vs-rotate .compare{grid-template-columns:1fr}.rope-add-vs-rotate .ctrl{grid-template-columns:92px 1fr 64px; gap:10px}}
  </style>

  <div class="panel">
    <div class="panel-head"><div class="panel-title">Same embedding · two ways to add position</div></div>
    <div class="compare">
      <div class="col">
        <h4><span class="tag bad">add</span> sinusoidal &nbsp;E + PE(m)</h4>
        <canvas class="cv-add" height="260"></canvas>
      </div>
      <div class="col">
        <h4><span class="tag good">rotate</span> RoPE &nbsp;R(mθ)·E</h4>
        <canvas class="cv-rot" height="260"></canvas>
      </div>
    </div>
    <div class="controls">
      <div class="ctrl">
        <label>position&nbsp;<span class="sub">m</span></label>
        <input type="range" class="r q-range" min="0" max="40" step="1" value="0">
        <span class="val v">0</span>
      </div>
    </div>
    <div class="readouts">
      <div class="chip hi"><div class="lab">‖ E + PE ‖ &nbsp;(adding)</div><div class="num oa">—</div></div>
      <div class="chip k"><div class="lab">‖ R·E ‖ &nbsp;(rotating)</div><div class="num ob">—</div></div>
    </div>
    <div class="btnrow"><button class="btn play">▶ sweep position</button></div>
  </div>
  <p class="note">The left length wobbles as position changes, which is the word's meaning being disturbed. The right length is rock-steady.</p>

  <script>
  (function(){
    const root = document.getElementById('rope-add-vs-rotate-9b8756b96f24f81545a6032c69a390d3');
    if(!root) return;
    const TAU = Math.PI*2;
    const C = { k:'#4fd8cf', muted:'#7e8499', coral:'#ff7d6b', line:'rgba(255,255,255,.10)', grid:'rgba(120,140,200,.13)' };
    function fit(canvas){
      const dpr=Math.max(1,window.devicePixelRatio||1);
      if(!canvas.dataset.h) canvas.dataset.h=canvas.getAttribute('height');
      const cssH=+canvas.dataset.h, w=canvas.clientWidth;
      canvas.width=Math.round(w*dpr); canvas.height=Math.round(cssH*dpr); canvas.style.height=cssH+'px';
      const ctx=canvas.getContext('2d'); ctx.setTransform(dpr,0,0,dpr,0,0); return {ctx,w,h:cssH};
    }
    function arrow(ctx, ox,oy, x,y, color, lw=3, head=10){
      ctx.save(); ctx.strokeStyle=color; ctx.fillStyle=color; ctx.lineWidth=lw; ctx.lineCap='round';
      ctx.beginPath(); ctx.moveTo(ox,oy); ctx.lineTo(x,y); ctx.stroke();
      const a=Math.atan2(y-oy,x-ox);
      ctx.beginPath(); ctx.moveTo(x,y); ctx.lineTo(x-head*Math.cos(a-0.4), y-head*Math.sin(a-0.4)); ctx.lineTo(x-head*Math.cos(a+0.4), y-head*Math.sin(a+0.4)); ctx.closePath(); ctx.fill(); ctx.restore();
    }
    function ring(ctx, cx,cy,r,color,lw=1){ ctx.save(); ctx.strokeStyle=color; ctx.lineWidth=lw; ctx.beginPath(); ctx.arc(cx,cy,r,0,TAU); ctx.stroke(); ctx.restore(); }
    function axes(ctx,cx,cy,r){ ctx.save(); ctx.strokeStyle=C.grid; ctx.lineWidth=1; ctx.beginPath(); ctx.moveTo(cx-r-14,cy); ctx.lineTo(cx+r+14,cy); ctx.moveTo(cx,cy-r-14); ctx.lineTo(cx,cy+r+14); ctx.stroke(); ctx.restore(); }
    function glowDot(ctx,x,y,color,rad=6){ ctx.save(); ctx.shadowColor=color; ctx.shadowBlur=14; ctx.fillStyle=color; ctx.beginPath(); ctx.arc(x,y,rad,0,TAU); ctx.fill(); ctx.restore(); }
    function setFill(el,frac){ el.style.setProperty('--fill',(frac*100)+'%'); }

    const ca=root.querySelector('.cv-add'), cb=root.querySelector('.cv-rot');
    const r=root.querySelector('input.r'), v=root.querySelector('.v'), btn=root.querySelector('.play');
    const oa=root.querySelector('.oa'), ob=root.querySelector('.ob');
    const W=0.5, Ex=0.62, Ey=0.30, Emag=Math.hypot(Ex,Ey);
    let playing=false, t=0;

    function panel(cv,mode,m){
      const {ctx,w,h}=fit(cv); ctx.clearRect(0,0,w,h);
      const cx=w/2, cy=h/2, R=Math.min(w,h)*0.34/Math.max(Emag*1.6,1);
      axes(ctx,cx,cy,R*1.6);
      ring(ctx,cx,cy,R*Emag,C.line,1.2);
      const ox=cx+Ex*R, oy=cy-Ey*R;
      arrow(ctx,cx,cy, ox,oy, 'rgba(255,255,255,.28)', 2.2, 9);
      ctx.fillStyle=C.muted; ctx.font='12px ui-monospace, monospace'; ctx.fillText('E', ox+8, oy-6);
      let rx,ry,mag,col;
      if(mode==='add'){
        const px=Math.sin(m*W), py=Math.cos(m*W);
        rx=Ex+px; ry=Ey+py; col=C.coral;
        ring(ctx, ox, oy, R, 'rgba(255,125,107,.18)', 1.2);
      } else {
        const a=m*W;
        rx=Ex*Math.cos(a)-Ey*Math.sin(a);
        ry=Ex*Math.sin(a)+Ey*Math.cos(a); col=C.k;
      }
      mag=Math.hypot(rx,ry);
      const ex=cx+rx*R, ey=cy-ry*R;
      arrow(ctx,cx,cy, ex,ey, col, 3.2, 11); glowDot(ctx,ex,ey,col,5);
      ctx.strokeStyle='rgba(255,255,255,.10)'; ctx.setLineDash([3,4]); ctx.beginPath(); ctx.arc(cx,cy,mag*R,0,TAU); ctx.stroke(); ctx.setLineDash([]);
      return mag;
    }
    function draw(){
      const m=+r.value;
      const ma=panel(ca,'add',m), mb=panel(cb,'rot',m);
      v.textContent=m; setFill(r,m/(+r.max));
      oa.textContent=ma.toFixed(3); ob.textContent=mb.toFixed(3);
    }
    function loop(){ if(!playing) return; t=(t+1)%(+r.max+1); r.value=t; draw(); setTimeout(()=>requestAnimationFrame(loop),120); }
    r.addEventListener('input',()=>{playing=false;btn.classList.remove('active');btn.textContent='▶ sweep position';draw();});
    btn.addEventListener('click',()=>{ playing=!playing; btn.classList.toggle('active',playing); btn.textContent=playing?'⏸ pause':'▶ sweep position'; t=+r.value; if(playing) loop(); });
    window.addEventListener('resize',draw); draw();
  })();
  </script>
</div>

<p>Sweep the position. On the left the length wobbles, which is the word&rsquo;s meaning being disturbed as it moves through the sentence. On the right it stays rock-steady. Same goal of encoding position, very different treatment of the content.</p>
<h2 id="why-it-won">Why it won</h2>
<p>Pull back, and the list of advantages is long for something so simple:</p>
<ul>
<li><strong>Relative position, for free.</strong> The attention dot product depends only on $m-n$. The model never has to learn to subtract positions, because it is guaranteed by construction.</li>
<li><strong>Meaning stays intact.</strong> Rotation preserves length, so a token&rsquo;s content is not corrupted by where it sits, unlike additive encodings, which blur the two together.</li>
<li><strong>Applied where it matters.</strong> RoPE rotates the queries and keys inside every attention layer, right where the comparison happens, instead of being bolted once onto the input embedding and left to fade.</li>
<li><strong>Zero extra parameters.</strong> It is a fixed geometric operation. There is nothing to train, almost nothing to compute, and it composes cleanly with efficient-attention kernels like FlashAttention.</li>
<li><strong>A built-in locality prior.</strong> Scores naturally taper with distance, a free and sensible default.</li>
<li><strong>It stretches.</strong> Because it encodes relative distance through smooth, tunable frequencies, RoPE can be rescaled to longer sequences far more gracefully than anything before it, which is the whole reason it underpins modern long-context models.</li>
</ul>
<p>That last point is where this post ends and the next one begins.</p>
<h2 id="the-bridge-from-rotation-to-long-context">The bridge: from rotation to long context</h2>
<p>This last point is also where the trouble starts. The same frequency structure that makes RoPE so elegant puts a hard ceiling on context length.</p>
<p>A model trained with a context window of, say, 4K tokens has only ever seen rotation angles up to $4096 \cdot \theta_i$ for each pair. The fast pairs will have swept through their whole range many times inside those 4K tokens, while the slow pairs have turned only a fraction of a circle. The network has learned to read positions inside that envelope of angles, and nowhere else.</p>
<p>Now feed it a 100K-token prompt at inference. The fast pairs are suddenly spinning to phase angles the model has never seen in training. As far as the network is concerned, the positions have gone out of distribution. Attention destabilizes, and quality falls off a cliff long before the prompt ends.</p>
<p>This is why context extension has become its own discipline. Every major technique is, underneath, a way of manipulating the angles and frequencies we just built up. Position Interpolation squeezes the positions back into the trained range. NTK-aware scaling turns that base-frequency knob $b$ up so the fast pairs slow down. YaRN interpolates each frequency band differently. None of them make much sense until you can see position as a rotation, which you now can.</p>
<p>That is the subject of the next post in this series.</p>
<p>RoPE also shows up in surprising places elsewhere on this blog. The <a href="/posts/attention-evolution/">evolution of attention</a> post shows how DeepSeek&rsquo;s Multi-head Latent Attention has to do some delicate surgery, a decoupled form of RoPE, to stay compatible with rotary embeddings while compressing the KV cache. The <a href="/posts/state-space-models-mamba/">state-space models</a> post shows Mamba-3 reusing the same rotary machinery with data-dependent angles. The idea travels a long way.</p>
<h2 id="lessons-for-builders">Lessons for builders</h2>
<p>A few takeaways that generalize beyond RoPE:</p>
<ol>
<li><strong>Positional information wants to be relative.</strong> When you catch yourself making a model re-derive the same relationship at every absolute offset, look for a representation where that relationship is built in rather than learned.</li>
<li><strong>Magnitude-preserving operations keep signals clean.</strong> Rotation works partly because it refuses to touch the content&rsquo;s length. When you have to inject one kind of information into a vector that already carries another, prefer transforms that leave the existing signal undisturbed.</li>
<li><strong>The base frequency is a real knob, not a constant.</strong> That $10000$ is not sacred; long-context models routinely raise it to $500{,}000$ or $1{,}000{,}000$ to slow the fast pairs down. When a hyperparameter is set &ldquo;because the paper said so,&rdquo; it is worth knowing what it actually controls.</li>
<li><strong>The elegant default and its failure mode are two sides of one coin.</strong> The frequencies that give RoPE its clean relative encoding are the same ones that go out of distribution past the training length. The mechanism and its breaking point cannot be pulled apart, so understanding one means understanding the other.</li>
</ol>
<h2 id="conclusion">Conclusion</h2>
<p>RoPE comes down to one choice: to encode position, rotate the query and key instead of adding to them. Rotation keeps each vector&rsquo;s length, so a word&rsquo;s meaning stays intact, and since an attention score depends only on the angle between two vectors, the score ends up tracking how far apart two tokens are rather than where they sit.</p>
<p>If you want the original source, it is the <a href="https://arxiv.org/abs/2104.09864">RoFormer paper</a> by Su et al. It is short and readable, and after this it should be easy to follow. It is also the groundwork for the next question in this series: how a model trained on a few thousand tokens manages to read much longer inputs.</p>
<div class="bd-subscribe">
  <div class="bd-subscribe__copy">
    <h3 class="bd-subscribe__title">Get the next post in this series</h3>
    <p class="bd-subscribe__blurb">The follow-up on long-context extension — PI, NTK-aware scaling, and YaRN — lands in your inbox. Deep dives on LLM systems, no spam.</p>
  </div>
  <form
    class="bd-subscribe__form embeddable-buttondown-form"
    action="https://buttondown.com/api/emails/embed-subscribe/jawad"
    method="post"
    target="popupwindow"
    onsubmit="window.open('https://buttondown.com/jawad', 'popupwindow')"
  >
    <input class="bd-subscribe__input" type="email" name="email" placeholder="you@example.com" aria-label="Email address" required>
    <input type="hidden" value="1" name="embed">
    <button class="bd-subscribe__btn" type="submit">Subscribe</button>
  </form>
  <p class="bd-subscribe__rss">Prefer a feed reader? <a href="/index.xml">Subscribe via RSS</a>.</p>
</div>

<style>
.bd-subscribe{
  margin:2.5rem 0;
  padding:1.5rem 1.75rem;
  border:1px solid var(--border);
  border-radius:12px;
  background:var(--entry);
}
.bd-subscribe__title{margin:0 0 .35rem;font-size:1.2rem;color:var(--primary);}
.bd-subscribe__blurb{margin:0 0 1rem;color:var(--secondary);font-size:.95rem;line-height:1.5;}
.bd-subscribe__form{display:flex;gap:.5rem;flex-wrap:wrap;}
.bd-subscribe__input{
  flex:1 1 220px;
  padding:.6rem .75rem;
  border:1px solid var(--border);
  border-radius:8px;
  background:var(--theme);
  color:var(--primary);
  font-size:.95rem;
}
.bd-subscribe__input:focus{outline:2px solid var(--tertiary);outline-offset:1px;}
.bd-subscribe__btn{
  padding:.6rem 1.2rem;
  border:0;
  border-radius:8px;
  background:var(--primary);
  color:var(--theme);
  font-weight:600;
  font-size:.95rem;
  cursor:pointer;
  transition:opacity .2s ease;
}
.bd-subscribe__btn:hover{opacity:.85;}
.bd-subscribe__rss{margin:.85rem 0 0;font-size:.82rem;color:var(--secondary);}
.bd-subscribe__rss a{color:var(--secondary);text-decoration:underline;}
</style>

]]></content:encoded></item><item><title>The Evolution of Attention, Part 1: From MHA to Latent Compression</title><link>https://www.mdjawad.com/posts/attention-evolution/</link><pubDate>Sun, 17 May 2026 10:00:00 +0800</pubDate><guid>https://www.mdjawad.com/posts/attention-evolution/</guid><description>Part 1 of 2. Every attention variant since 2019 fights the same number: KV cache bytes per token. This post traces the first wave of answers, from MHA through MQA and GQA, to DeepSeek-V2&amp;rsquo;s Multi-head Latent Attention. We end at the 57× cache reduction that comes from caching a low-rank latent and never materializing K or V at inference.</description><content:encoded><![CDATA[<h2 id="what-this-post-covers">What This Post Covers</h2>
<p>This is Part 1 of a two-part series on how the transformer&rsquo;s attention mechanism has evolved. Every attention variant shipped in production since 2019 is fighting one number: the bytes of KV cache you have to carry per token. That number controls how many concurrent users fit on a GPU, how long a context you can serve, and ultimately whether your model is economically viable to deploy.</p>
<p>Part 1 walks the first wave of answers: the variants that attack the cache by changing what gets stored per token. We start with the bottleneck, recap multi-head attention, look at the stepping stones (MQA and GQA), then spend most of the post on Multi-head Latent Attention as introduced in <a href="https://arxiv.org/abs/2405.04434">DeepSeek-V2</a>. By the end you will see how a single low-rank bottleneck plus a clever bit of algebra collapses the cache by nearly two orders of magnitude without giving up the expressivity of standard softmax attention.</p>
<p>Part 2 picks up at the question MLA cannot answer: once each cached token is about as small as it gets, can we cache fewer tokens? That is sparse attention (DSA, NSA, MoBA), linear-attention hybrids, and the V4-Pro synthesis where compression and sparsity stack. (Coming soon.)</p>
<p>The audience is engineers who deploy models. FlashAttention has its own <a href="/posts/flash-attention/">dedicated post</a> on this blog and we will not re-cover it here.</p>
<h2 id="part-1-the-kv-cache-wall">Part 1: The KV Cache Wall</h2>
<p>Modern transformer inference is, in practice, a memory bandwidth problem. During autoregressive decoding, each new token must attend to every prior token, which means every prior token&rsquo;s key and value vectors must already sit in fast memory. That stash is the <strong>KV cache</strong>, and it grows linearly with sequence length, linearly with batch size, and linearly with the number of layers.</p>
<p>For a model with $L$ layers, $n_h$ heads, per-head dimension $d_h$, sequence length $T$, and batch size $B$ in float16, the standard MHA cache is:</p>
$$\text{Cache}_{\text{MHA}} \;=\; 2 \cdot L \cdot B \cdot T \cdot n_h \cdot d_h \cdot 2 \text{ bytes}$$<p>That factor of 2 is for keys <em>and</em> values. For a Llama-3-style 70B model at BF16 with 80 layers, 64 heads, and $d_h = 128$, each token consumes about 2.5 MB of cache per layer. At 128K context that is 320 GB of KV cache for a single sequence. The H100 has 80 GB of HBM. You cannot serve a single 128K-context request on one H100 without doing something to shrink that number. The cache, not the parameters, is what limits how many concurrent users fit on a GPU and how long a context you can serve.</p>
<p>Several approaches have chipped away at this. <strong>Multi-Query Attention</strong> (MQA) shares a single K/V across all heads, a brutal compression that visibly degrades quality. <strong>Grouped-Query Attention</strong> (GQA) is the negotiated middle ground that ships in Llama, Mistral, and Qwen. <strong>Multi-Head Latent Attention</strong>, introduced in DeepSeek-V2 in May 2024, takes a different tack: it caches a single low-rank <em>latent</em> vector per token and reconstructs full-rank K and V on the fly.</p>
<p>The trick, and the whole point of the rest of this post, is that you never actually have to reconstruct them.</p>
<h2 id="part-2-notation">Part 2: Notation</h2>
<p>A few symbols recur throughout. None of them are unusual; the table is a quick reference so the equations below read fast.</p>
<table>
  <thead>
      <tr>
          <th>Symbol</th>
          <th>Meaning</th>
          <th>DeepSeek-V2 value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>$d$</td>
          <td>Model (residual stream) dimension</td>
          <td>5120</td>
      </tr>
      <tr>
          <td>$n_h$</td>
          <td>Number of attention heads</td>
          <td>128</td>
      </tr>
      <tr>
          <td>$d_h$</td>
          <td>Per-head dimension (content part)</td>
          <td>128</td>
      </tr>
      <tr>
          <td>$d_c$</td>
          <td>Latent KV compression dim, the cached width</td>
          <td>512</td>
      </tr>
      <tr>
          <td>$d_c'$</td>
          <td>Query compression dim (training-time only)</td>
          <td>1536</td>
      </tr>
      <tr>
          <td>$d_h^R$</td>
          <td>RoPE per-head dim (decoupled positional part)</td>
          <td>64</td>
      </tr>
      <tr>
          <td>$h_t$</td>
          <td>Hidden state at position $t$, shape $\mathbb{R}^{d}$</td>
          <td>(per token)</td>
      </tr>
      <tr>
          <td>$\mathbf{c}_t^{KV}$</td>
          <td>The cached latent at position $t$, shape $\mathbb{R}^{d_c}$</td>
          <td>(per token)</td>
      </tr>
  </tbody>
</table>
<p>Convention: all vectors are row vectors, so $W \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}$ and $h W$ produces an output row vector. This matches how PyTorch <code>nn.Linear</code> behaves and how most modern transformer code reads.</p>
<h2 id="part-3-recap-standard-multi-head-attention">Part 3: Recap: Standard Multi-Head Attention</h2>
<p>Before getting to MLA, it is worth pinning down exactly what we are trying to replace. Standard MHA at position $t$ takes the hidden state $h_t \in \mathbb{R}^{d}$ and produces three projections:</p>
$$q_t = h_t W^Q, \qquad k_t = h_t W^K, \qquad v_t = h_t W^V$$<p>where $W^Q, W^K, W^V \in \mathbb{R}^{d \times n_h d_h}$. The result is split into $n_h$ heads of dimension $d_h$. Attention is computed head-wise, with $k_t$ and $v_t$ cached for every past position $t$:</p>
$$\text{Attn}_i(q, K, V) \;=\; \text{softmax}\!\left(\tfrac{q_i K_i^\top}{\sqrt{d_h}}\right) V_i$$

<div class="attn-mha-anatomy attn-breakout" id="attn-ma-d85f6de752f74995c33652ca2b3b58d0">
  <style>
    .attn-mha-anatomy {
      --ma-bg: #0d1117;
      --ma-surface: #161b22;
      --ma-border: #30363d;
      --ma-text: #e6edf3;
      --ma-text-muted: #8b949e;
      --ma-residual: #58a6ff;
      --ma-q: #58a6ff;
      --ma-k: #f97583;
      --ma-v: #f0b429;
      --ma-cache: #39d353;
      --ma-weight: #d29922;
      --ma-divider: rgba(255, 255, 255, 0.22);

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
      background: var(--ma-bg);
      color: var(--ma-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

    [data-theme="light"] .attn-mha-anatomy,
    :root:not([data-theme="dark"]) .attn-mha-anatomy {
      --ma-bg: #f8fafc;
      --ma-surface: #ffffff;
      --ma-border: #e2e8f0;
      --ma-text: #1e293b;
      --ma-text-muted: #64748b;
      --ma-residual: #3b82f6;
      --ma-q: #3b82f6;
      --ma-k: #ef4444;
      --ma-v: #d97706;
      --ma-cache: #10b981;
      --ma-weight: #b8860b;
      --ma-divider: rgba(0, 0, 0, 0.22);
    }

    .attn-mha-anatomy * { box-sizing: border-box; }

    .attn-mha-anatomy .ma-header {
      text-align: center;
      margin-bottom: 1rem;
    }

    .attn-mha-anatomy .ma-header h3 {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 1rem;
      font-weight: 600;
      color: var(--ma-q);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .attn-mha-anatomy .ma-header p {
      color: var(--ma-text-muted);
      font-size: 1rem;
      margin: 0;
    }

    .attn-mha-anatomy .ma-card {
      background: var(--ma-surface);
      border: 1px solid var(--ma-border);
      border-radius: 10px;
      padding: 1.2rem;
    }

    .attn-mha-anatomy svg {
      width: 100%;
      height: auto;
      display: block;
      overflow: visible;
    }

    .attn-mha-anatomy .ma-svg-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 12px;
      fill: var(--ma-text-muted);
    }

    .attn-mha-anatomy .ma-svg-label.bright { fill: var(--ma-text); font-weight: 600; }
    .attn-mha-anatomy .ma-svg-label.weight { fill: var(--ma-weight); }
    .attn-mha-anatomy .ma-svg-label.cache { fill: var(--ma-cache); font-weight: 700; letter-spacing: 0.12em; }
    .attn-mha-anatomy .ma-svg-label.tag {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 11px;
      letter-spacing: 0.06em;
      text-transform: uppercase;
      fill: var(--ma-text-muted);
    }

    .attn-mha-anatomy .ma-caption {
      margin-top: 0.85rem;
      padding-top: 0.65rem;
      border-top: 1px dashed var(--ma-border);
      font-size: 0.9rem;
      color: var(--ma-text-muted);
      line-height: 1.6;
      font-style: italic;
      text-align: center;
    }
  </style>

  <div class="ma-header">
    <h3>Standard Multi-Head Attention</h3>
    <p>One token's hidden state projected three ways. K and V get cached, per head, for every past position.</p>
  </div>

  <div class="ma-card">
    <svg viewBox="0 0 1080 320" preserveAspectRatio="xMidYMid meet" id="ma-svg-d85f6de752f74995c33652ca2b3b58d0"></svg>
    <div class="ma-caption">
      Standard multi-head attention. The full Q, K, V tensors each have width n_h · d_h, partitioned into n_h slices of width d_h. K and V are cached for every past token, giving the familiar 2 · n_h · d_h floats per token per layer.
    </div>
  </div>

  <script>
    (function() {
      const root = document.getElementById('attn-ma-d85f6de752f74995c33652ca2b3b58d0');
      const svg = root.querySelector('#ma-svg-d85f6de752f74995c33652ca2b3b58d0');
      const NS = 'http://www.w3.org/2000/svg';

      function readVar(name) {
        return getComputedStyle(root).getPropertyValue(name).trim();
      }

      function el(name, attrs, parent, text) {
        const e = document.createElementNS(NS, name);
        Object.entries(attrs).forEach(([k, v]) => e.setAttribute(k, v));
        if (text !== undefined) e.textContent = text;
        if (parent) parent.appendChild(e);
        return e;
      }

      
      const defs = el('defs', {}, svg);
      const m = el('marker', {
        id: 'ma-arr-d85f6de752f74995c33652ca2b3b58d0', viewBox: '0 0 10 10',
        refX: 8, refY: 5, markerWidth: 6, markerHeight: 6,
        orient: 'auto-start-reverse'
      }, defs);
      el('path', { d: 'M 0 0 L 10 5 L 0 10 Z', fill: readVar('--ma-text') }, m);
      const arrUrl = 'url(#ma-arr-d85f6de752f74995c33652ca2b3b58d0)';

      
      

      
      const xH = 40, yH = 140, wH = 130, hH = 44;
      el('text', { x: xH, y: yH - 12, class: 'ma-svg-label tag' }, svg, 'hidden state · h_t');
      el('rect', {
        x: xH, y: yH, width: wH, height: hH,
        fill: 'none', stroke: readVar('--ma-residual'), 'stroke-width': 1.6, rx: 4
      }, svg);
      el('rect', {
        x: xH, y: yH, width: wH, height: hH,
        fill: readVar('--ma-residual'), opacity: 0.18, rx: 4
      }, svg);
      el('text', {
        x: xH + wH / 2, y: yH + hH + 18,
        class: 'ma-svg-label', 'text-anchor': 'middle'
      }, svg, 'd = 5120');

      
      const arrowStart = xH + wH + 6;
      const yQ = 50, yK = 140, yV = 230;
      const arrowEnd = 280;
      [yQ, yK, yV].forEach(ty => {
        el('path', {
          d: `M ${arrowStart} ${yH + hH / 2} L ${arrowEnd} ${ty + 17}`,
          stroke: readVar('--ma-text'), 'stroke-width': 1.2, fill: 'none',
          'marker-end': arrUrl
        }, svg);
      });

      
      el('text', { x: 200, y: 85, class: 'ma-svg-label weight', 'font-weight': 700 }, svg, '× W^Q');
      el('text', { x: 200, y: 158, class: 'ma-svg-label weight', 'font-weight': 700 }, svg, '× W^K');
      el('text', { x: 200, y: 232, class: 'ma-svg-label weight', 'font-weight': 700 }, svg, '× W^V');

      
      const xBar = 290, wBar = 520, hBar = 34;
      const N_HEAD_DIVIDERS = 7; 

      function drawHeadBar(y, fill, label, color, isCached) {
        
        el('text', {
          x: xBar, y: y - 8, class: 'ma-svg-label tag', fill: color
        }, svg, label);
        
        el('rect', {
          x: xBar, y, width: wBar, height: hBar,
          fill, opacity: 0.85, rx: 2
        }, svg);
        
        for (let i = 1; i <= N_HEAD_DIVIDERS; i++) {
          const dx = xBar + (wBar * i) / (N_HEAD_DIVIDERS + 1);
          el('line', {
            x1: dx, y1: y, x2: dx, y2: y + hBar,
            stroke: readVar('--ma-divider'),
            'stroke-width': 0.7,
            'stroke-dasharray': '2 2'
          }, svg);
        }
        
        el('text', {
          x: xBar + wBar / 2, y: y + hBar + 18,
          class: 'ma-svg-label', 'text-anchor': 'middle'
        }, svg, 'n_h · d_h  =  128 · 128  =  16,384');
        
        if (isCached) {
          const chipX = xBar + wBar + 14;
          el('rect', {
            x: chipX, y: y + 8, width: 64, height: 18,
            fill: 'none', stroke: readVar('--ma-cache'), 'stroke-width': 1.2, rx: 9
          }, svg);
          el('text', {
            x: chipX + 32, y: y + 21,
            class: 'ma-svg-label cache', 'text-anchor': 'middle', 'font-size': 10.5
          }, svg, 'CACHED');
        }
      }

      drawHeadBar(yQ, readVar('--ma-q'), 'Q  ·  n_h heads × d_h', readVar('--ma-q'), false);
      drawHeadBar(yK, readVar('--ma-k'), 'K  ·  cached per head', readVar('--ma-k'), true);
      drawHeadBar(yV, readVar('--ma-v'), 'V  ·  cached per head', readVar('--ma-v'), true);

      
      const bracketX = xBar + wBar + 90;
      const bracketTop = yK - 4;
      const bracketBot = yV + hBar + 4;
      el('path', {
        d: `M ${bracketX - 5} ${bracketTop} L ${bracketX} ${bracketTop} L ${bracketX} ${bracketBot} L ${bracketX - 5} ${bracketBot}`,
        stroke: readVar('--ma-cache'), 'stroke-width': 1.6, fill: 'none', 'stroke-dasharray': '4 3'
      }, svg);

      
      const sumX = bracketX + 14, sumY = (yK + yV) / 2 - 26;
      el('rect', {
        x: sumX, y: sumY, width: 200, height: 88,
        fill: readVar('--ma-bg'), stroke: readVar('--ma-cache'), 'stroke-width': 1.2, rx: 6
      }, svg);
      el('text', {
        x: sumX + 12, y: sumY + 18, class: 'ma-svg-label cache', 'font-size': 10.5
      }, svg, 'CACHE PER TOKEN');
      el('text', {
        x: sumX + 12, y: sumY + 42, class: 'ma-svg-label bright', 'font-size': 14
      }, svg, '2 · n_h · d_h');
      el('text', {
        x: sumX + 12, y: sumY + 62, class: 'ma-svg-label bright', 'font-size': 14
      }, svg, '= 32,768 floats');
      el('text', {
        x: sumX + 12, y: sumY + 80, class: 'ma-svg-label'
      }, svg, '64 KB / layer in BF16');
    })();
  </script>
</div>

<p>The cost is paid not in the projection matrices (those live in HBM regardless) but in the running cache of all past $k_t, v_t$. Per token per layer, that is $2 \cdot n_h \cdot d_h$ floats. For DeepSeek-V2 scale (128 heads, 128 per-head dim) it is 32,768 floats per token per layer. Everything MLA does is in service of shrinking that cache while keeping the per-head expressivity.</p>
<h2 id="part-4-stepping-stones-mqa-and-gqa">Part 4: Stepping Stones: MQA and GQA</h2>
<p>MLA is most legible when contrasted with what came before. Both MQA and GQA attack the cache by reducing the number of distinct K and V projections.</p>
<p><strong>Multi-Query Attention</strong> (Shazeer 2019) keeps $n_h$ query heads but uses a single shared key projection and a single shared value projection across all of them. The cache shrinks by a factor of $n_h$. On a 64-head model that is a 64x reduction. MQA worked in production for PaLM and Falcon-40B, but most successors retreated. With one shared K and one shared V, every query head looks at the same key/value subspace, the model loses head-level specialization on the recall side, and quality regressions at scale were measurable.</p>
<p><strong>Grouped-Query Attention</strong> (Ainslie et al. 2023) is the practical middle ground. Instead of all-or-nothing sharing, you partition the $n_h$ query heads into $n_g$ groups, and each group shares one K/V head. MHA is $n_g = n_h$. MQA is $n_g = 1$. GQA-8 (the Llama-3 default) is $n_g = 8$. For our 70B reference at GQA-8 the cache is 320 KB per token, an 8x reduction, and quality regression vs MHA is in the noise.</p>


<div class="attn-head-sharing attn-breakout" id="attn-hs-d85f6de752f74995c33652ca2b3b58d0">
  <style>
    .attn-head-sharing {
      --hs-bg: #0d1117;
      --hs-surface: #161b22;
      --hs-border: #30363d;
      --hs-text: #e6edf3;
      --hs-text-muted: #8b949e;
      --hs-q: #58a6ff;
      --hs-kv: #f97583;
      --hs-divider: rgba(255,255,255,0.22);

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
      background: var(--hs-bg);
      color: var(--hs-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

    [data-theme="light"] .attn-head-sharing,
    :root:not([data-theme="dark"]) .attn-head-sharing {
      --hs-bg: #f8fafc;
      --hs-surface: #ffffff;
      --hs-border: #e2e8f0;
      --hs-text: #1e293b;
      --hs-text-muted: #64748b;
      --hs-q: #3b82f6;
      --hs-kv: #ef4444;
      --hs-divider: rgba(0,0,0,0.22);
    }

    .attn-head-sharing * { box-sizing: border-box; }

    .attn-head-sharing .hs-header {
      text-align: center;
      margin-bottom: 1rem;
    }

    .attn-head-sharing .hs-header h3 {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 1rem;
      font-weight: 600;
      color: var(--hs-q);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .attn-head-sharing .hs-header p {
      color: var(--hs-text-muted);
      font-size: 1rem;
      margin: 0;
    }

    .attn-head-sharing .hs-card {
      background: var(--hs-surface);
      border: 1px solid var(--hs-border);
      border-radius: 10px;
      padding: 1.2rem;
    }

    .attn-head-sharing svg {
      width: 100%;
      height: auto;
      display: block;
      overflow: visible;
    }

    .attn-head-sharing .hs-svg-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 12px;
      fill: var(--hs-text-muted);
    }

    .attn-head-sharing .hs-svg-label.bright { fill: var(--hs-text); font-weight: 700; }
    .attn-head-sharing .hs-svg-label.title {
      font-family: 'IBM Plex Sans', sans-serif;
      font-size: 16px;
      font-weight: 700;
      fill: var(--hs-text);
    }
    .attn-head-sharing .hs-svg-label.tag {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 11px;
      letter-spacing: 0.06em;
      text-transform: uppercase;
      fill: var(--hs-text-muted);
    }

    .attn-head-sharing .hs-caption {
      margin-top: 0.85rem;
      padding-top: 0.65rem;
      border-top: 1px dashed var(--hs-border);
      font-size: 0.9rem;
      color: var(--hs-text-muted);
      line-height: 1.6;
      font-style: italic;
      text-align: center;
    }
  </style>

  <div class="hs-header">
    <h3>Stepping Stones: MHA, GQA, MQA</h3>
    <p>The K/V width is the lever each method pulls. Q stays the same; K and V get narrower.</p>
  </div>

  <div class="hs-card">
    <svg viewBox="0 0 1080 260" preserveAspectRatio="xMidYMid meet" id="hs-svg-d85f6de752f74995c33652ca2b3b58d0"></svg>
    <div class="hs-caption">
      MHA keeps full per-head K/V. GQA picks a comfortable middle. MQA collapses to a single shared pair. All three still cache K and V as the stored objects.
    </div>
  </div>

  <script>
    (function() {
      const root = document.getElementById('attn-hs-d85f6de752f74995c33652ca2b3b58d0');
      const svg = root.querySelector('#hs-svg-d85f6de752f74995c33652ca2b3b58d0');
      const NS = 'http://www.w3.org/2000/svg';

      function readVar(name) {
        return getComputedStyle(root).getPropertyValue(name).trim();
      }

      function el(name, attrs, parent, text) {
        const e = document.createElementNS(NS, name);
        Object.entries(attrs).forEach(([k, v]) => e.setAttribute(k, v));
        if (text !== undefined) e.textContent = text;
        if (parent) parent.appendChild(e);
        return e;
      }

      function drawHeadBar(x, y, w, h, color, headCount) {
        el('rect', {
          x, y, width: Math.max(2, w), height: h,
          fill: color, opacity: 0.85, rx: 2
        }, svg);
        if (headCount > 1 && w >= 16) {
          for (let i = 1; i < headCount; i++) {
            const dx = x + (w * i) / headCount;
            el('line', {
              x1: dx, y1: y, x2: dx, y2: y + h,
              stroke: readVar('--hs-divider'),
              'stroke-width': 0.7,
              'stroke-dasharray': '2 2'
            }, svg);
          }
        }
      }

      function drawVariant(xOff, title, kvCount, cacheFormula, descNote, ship) {
        const cQ = readVar('--hs-q');
        const cKV = readVar('--hs-kv');
        const xBase = xOff;
        const fullW = 300;
        const kvW = (fullW * kvCount) / 8;
        const hBar = 22;

        
        el('text', { x: xBase, y: 16, class: 'hs-svg-label title' }, svg, title);
        
        el('text', { x: xBase, y: 38, class: 'hs-svg-label tag' }, svg, descNote);

        
        const yQ = 64;
        el('text', { x: xBase, y: yQ - 6, class: 'hs-svg-label' }, svg, 'Q · 8 heads');
        drawHeadBar(xBase, yQ, fullW, hBar, cQ, 8);

        
        const yK = 110;
        const kLabel = kvCount === 1 ? 'K · 1 (shared)' : 'K · ' + kvCount + ' heads';
        el('text', { x: xBase, y: yK - 6, class: 'hs-svg-label' }, svg, kLabel);
        drawHeadBar(xBase, yK, kvW, hBar, cKV, kvCount);

        
        const yV = 156;
        const vLabel = kvCount === 1 ? 'V · 1 (shared)' : 'V · ' + kvCount + ' heads';
        el('text', { x: xBase, y: yV - 6, class: 'hs-svg-label' }, svg, vLabel);
        drawHeadBar(xBase, yV, kvW, hBar, cKV, kvCount);

        
        el('text', { x: xBase, y: 208, class: 'hs-svg-label bright' }, svg, cacheFormula);
        
        el('text', { x: xBase, y: 230, class: 'hs-svg-label' }, svg, ship);
      }

      
      drawVariant(30,  'MHA',   8, 'cache: 2 · n_h · d_h',  'one K, one V per head',          'baseline · expensive');
      drawVariant(390, 'GQA-4', 4, 'cache: 2 · n_g · d_h',  'K, V shared by groups of heads', 'Llama, Mistral, Qwen');
      drawVariant(750, 'MQA',   1, 'cache: 2 · d_h',        'one K, one V across ALL heads',  'tiny · quality cost');
    })();
  </script>
</div>

<p>The K/V width is the lever each method pulls. MHA keeps full per-head K/V. MQA collapses to a single shared pair. GQA picks a comfortable middle. All three keep K and V as the cached objects, though.</p>
<p>MLA changes the question. Instead of asking &ldquo;how many K/V projections do we keep?&rdquo;, it asks: what if the cache is not K or V at all, but a compressed representation we expand on demand?</p>
<h2 id="part-5-mlas-core-insight">Part 5: MLA&rsquo;s Core Insight</h2>
<p>The hidden state $h_t \in \mathbb{R}^{d}$ already contains everything needed to compute that token&rsquo;s keys and values. The standard pipeline burns most of its width on a high-dimensional intermediate ($n_h d_h \approx 16{,}000$ for DeepSeek-V2) that we store, when really we could store a compact summary and reproject when needed.</p>
<p>Concretely: introduce a latent bottleneck $\mathbf{c}_t^{KV} \in \mathbb{R}^{d_c}$ with $d_c \ll n_h d_h$. Cache only this latent. Recover K and V via dedicated up-projections at attention time:</p>
$$\underbrace{h_t W^{DKV}}_{\mathbf{c}_t^{KV} \,\in\, \mathbb{R}^{d_c}}\;\longrightarrow\;\begin{cases} \mathbf{c}_t^{KV} W^{UK} \;=\; k_t & \in \mathbb{R}^{n_h d_h} \\[2pt] \mathbf{c}_t^{KV} W^{UV} \;=\; v_t & \in \mathbb{R}^{n_h d_h} \end{cases}$$

<div class="attn-mla-comp attn-breakout" id="attn-mla-d85f6de752f74995c33652ca2b3b58d0">
  <style>
    .attn-mla-comp {
      --mla-bg: #0d1117;
      --mla-surface: #161b22;
      --mla-border: #30363d;
      --mla-text: #e6edf3;
      --mla-text-muted: #8b949e;
      --mla-residual: #58a6ff;
      --mla-k: #f97583;
      --mla-v: #f0b429;
      --mla-latent: #39d353;
      --mla-cache: #39d353;
      --mla-weight: #d29922;
      --mla-divider: rgba(255, 255, 255, 0.22);

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
      background: var(--mla-bg);
      color: var(--mla-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

    [data-theme="light"] .attn-mla-comp,
    :root:not([data-theme="dark"]) .attn-mla-comp {
      --mla-bg: #f8fafc;
      --mla-surface: #ffffff;
      --mla-border: #e2e8f0;
      --mla-text: #1e293b;
      --mla-text-muted: #64748b;
      --mla-residual: #3b82f6;
      --mla-k: #ef4444;
      --mla-v: #d97706;
      --mla-latent: #10b981;
      --mla-cache: #10b981;
      --mla-weight: #b8860b;
      --mla-divider: rgba(0, 0, 0, 0.22);
    }

    .attn-mla-comp * { box-sizing: border-box; }

    .attn-mla-comp .mla-header {
      text-align: center;
      margin-bottom: 1rem;
    }

    .attn-mla-comp .mla-header h3 {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 1rem;
      font-weight: 600;
      color: var(--mla-latent);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .attn-mla-comp .mla-header p {
      color: var(--mla-text-muted);
      font-size: 1rem;
      margin: 0;
    }

    .attn-mla-comp .mla-card {
      background: var(--mla-surface);
      border: 1px solid var(--mla-border);
      border-radius: 10px;
      padding: 1.2rem;
    }

    .attn-mla-comp svg {
      width: 100%;
      height: auto;
      display: block;
      overflow: visible;
    }

    .attn-mla-comp .mla-svg-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 12px;
      fill: var(--mla-text-muted);
    }

    .attn-mla-comp .mla-svg-label.bright { fill: var(--mla-text); font-weight: 700; }
    .attn-mla-comp .mla-svg-label.weight { fill: var(--mla-weight); font-weight: 700; }
    .attn-mla-comp .mla-svg-label.latent { fill: var(--mla-latent); font-weight: 700; }
    .attn-mla-comp .mla-svg-label.cache { fill: var(--mla-cache); font-weight: 700; letter-spacing: 0.12em; }
    .attn-mla-comp .mla-svg-label.faded { fill: var(--mla-text-muted); opacity: 0.7; font-style: italic; }
    .attn-mla-comp .mla-svg-label.tag {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 11px;
      letter-spacing: 0.06em;
      text-transform: uppercase;
      fill: var(--mla-text-muted);
    }

    .attn-mla-comp .mla-caption {
      margin-top: 0.85rem;
      padding-top: 0.65rem;
      border-top: 1px dashed var(--mla-border);
      font-size: 0.9rem;
      color: var(--mla-text-muted);
      line-height: 1.6;
      font-style: italic;
      text-align: center;
    }
  </style>

  <div class="mla-header">
    <h3>MLA's Core Idea</h3>
    <p>Cache one narrow latent per token. Reconstruct K and V at attention time, then throw them away.</p>
  </div>

  <div class="mla-card">
    <svg viewBox="0 0 1080 300" preserveAspectRatio="xMidYMid meet" id="mla-svg-d85f6de752f74995c33652ca2b3b58d0"></svg>
    <div class="mla-caption">
      The fundamental MLA move. Instead of caching n_h · d_h + n_h · d_h = 32,768 floats of K and V per token, we cache only the d_c = 512-wide latent and reconstruct K and V at attention time via two learned up-projections.
    </div>
  </div>

  <script>
    (function() {
      const root = document.getElementById('attn-mla-d85f6de752f74995c33652ca2b3b58d0');
      const svg = root.querySelector('#mla-svg-d85f6de752f74995c33652ca2b3b58d0');
      const NS = 'http://www.w3.org/2000/svg';

      function readVar(name) {
        return getComputedStyle(root).getPropertyValue(name).trim();
      }

      function el(name, attrs, parent, text) {
        const e = document.createElementNS(NS, name);
        Object.entries(attrs).forEach(([k, v]) => e.setAttribute(k, v));
        if (text !== undefined) e.textContent = text;
        if (parent) parent.appendChild(e);
        return e;
      }

      const defs = el('defs', {}, svg);
      const m = el('marker', {
        id: 'mla-arr-d85f6de752f74995c33652ca2b3b58d0', viewBox: '0 0 10 10',
        refX: 8, refY: 5, markerWidth: 6, markerHeight: 6, orient: 'auto-start-reverse'
      }, defs);
      el('path', { d: 'M 0 0 L 10 5 L 0 10 Z', fill: readVar('--mla-text') }, m);
      const arrUrl = 'url(#mla-arr-d85f6de752f74995c33652ca2b3b58d0)';

      
      

      
      const xH = 30, yMid = 150, wH = 110, hH = 38;
      el('text', { x: xH, y: yMid - hH / 2 - 10, class: 'mla-svg-label tag' }, svg, 'hidden state');
      el('rect', {
        x: xH, y: yMid - hH / 2, width: wH, height: hH,
        fill: readVar('--mla-residual'), opacity: 0.25,
        stroke: readVar('--mla-residual'), 'stroke-width': 1.4, rx: 4
      }, svg);
      el('text', {
        x: xH + wH / 2, y: yMid + 4, class: 'mla-svg-label bright', 'text-anchor': 'middle'
      }, svg, 'h_t');
      el('text', {
        x: xH + wH / 2, y: yMid + hH / 2 + 18, class: 'mla-svg-label', 'text-anchor': 'middle'
      }, svg, 'd = 5120');

      
      const arrX1 = xH + wH + 6;
      const arrX2 = 240;
      el('path', {
        d: `M ${arrX1} ${yMid} L ${arrX2} ${yMid}`,
        stroke: readVar('--mla-latent'), 'stroke-width': 1.6, fill: 'none',
        'marker-end': arrUrl
      }, svg);
      el('text', {
        x: (arrX1 + arrX2) / 2, y: yMid - 8,
        class: 'mla-svg-label weight', 'text-anchor': 'middle'
      }, svg, '× W^DKV');

      
      const xC = arrX2 + 6, wC = 50, hC = 60;
      const yCTop = yMid - hC / 2;
      el('rect', {
        x: xC, y: yCTop, width: wC, height: hC,
        fill: readVar('--mla-latent'), opacity: 0.25,
        stroke: readVar('--mla-latent'), 'stroke-width': 1.8, rx: 4
      }, svg);
      el('text', {
        x: xC + wC / 2, y: yMid + 4, class: 'mla-svg-label latent', 'text-anchor': 'middle', 'font-size': 13
      }, svg, 'c_t^KV');
      el('text', {
        x: xC + wC / 2, y: yMid + hC / 2 + 18, class: 'mla-svg-label latent', 'text-anchor': 'middle'
      }, svg, 'd_c = 512');

      
      const chipX = xC + wC / 2 - 32, chipY = yCTop - 28;
      el('rect', {
        x: chipX, y: chipY, width: 64, height: 18,
        fill: 'none', stroke: readVar('--mla-cache'), 'stroke-width': 1.4, rx: 9
      }, svg);
      el('text', {
        x: chipX + 32, y: chipY + 13,
        class: 'mla-svg-label cache', 'text-anchor': 'middle', 'font-size': 10.5
      }, svg, 'CACHED');

      
      const xWeights = xC + wC + 90;
      const yK = 60, yV = 240;
      el('path', {
        d: `M ${xC + wC + 6} ${yMid - 12} C ${xC + wC + 50} ${yMid - 50}, ${xWeights - 30} ${yK + 11}, ${xWeights} ${yK + 11}`,
        stroke: readVar('--mla-k'), 'stroke-width': 1.6, fill: 'none',
        'marker-end': arrUrl
      }, svg);
      el('text', {
        x: xC + wC + 30, y: yMid - 38, class: 'mla-svg-label weight', fill: readVar('--mla-k')
      }, svg, '× W^UK');

      el('path', {
        d: `M ${xC + wC + 6} ${yMid + 12} C ${xC + wC + 50} ${yMid + 50}, ${xWeights - 30} ${yV + 11}, ${xWeights} ${yV + 11}`,
        stroke: readVar('--mla-v'), 'stroke-width': 1.6, fill: 'none',
        'marker-end': arrUrl
      }, svg);
      el('text', {
        x: xC + wC + 30, y: yMid + 46, class: 'mla-svg-label weight', fill: readVar('--mla-v')
      }, svg, '× W^UV');

      
      const xBar = xWeights + 6, wBar = 480, hBar = 22;
      const N_DIV = 7;

      function drawBar(y, color, label) {
        el('text', { x: xBar, y: y - 6, class: 'mla-svg-label', fill: color, 'font-weight': 700 }, svg, label);
        el('rect', {
          x: xBar, y, width: wBar, height: hBar,
          fill: color, opacity: 0.6, rx: 2
        }, svg);
        
        for (let i = 1; i <= N_DIV; i++) {
          const dx = xBar + (wBar * i) / (N_DIV + 1);
          el('line', {
            x1: dx, y1: y, x2: dx, y2: y + hBar,
            stroke: readVar('--mla-divider'), 'stroke-width': 0.7, 'stroke-dasharray': '2 2'
          }, svg);
        }
        
        el('text', {
          x: xBar + wBar / 2, y: y + hBar + 16,
          class: 'mla-svg-label', 'text-anchor': 'middle'
        }, svg, 'n_h · d_h  =  128 · 128  =  16,384');
      }

      drawBar(yK,     readVar('--mla-k'), 'k_t · recovered K, 128 heads');
      drawBar(yV,     readVar('--mla-v'), 'v_t · recovered V, 128 heads');

      
      const noteX = xBar + wBar + 16;
      el('text', {
        x: noteX, y: yMid - 6, class: 'mla-svg-label faded'
      }, svg, 'transient ·');
      el('text', {
        x: noteX, y: yMid + 10, class: 'mla-svg-label faded'
      }, svg, 'not cached');

      
      el('text', {
        x: 540, y: 295, class: 'mla-svg-label cache', 'text-anchor': 'middle', 'font-size': 11
      }, svg, 'CACHE PER TOKEN  ·  d_c = 512 floats  ·  vs MHA 32,768  ·  64× smaller');
    })();
  </script>
</div>

<p>If you stop reading right here, you would be tempted to ask: doesn&rsquo;t reconstructing K and V at every attention step add a huge amount of compute? The honest answer is yes, naively. The whole punchline of MLA, which we get to in §8, is that during inference you do not have to materialize K and V at all. The up-projection matrices can be absorbed into Q and the output projection. The cache stays small and the FLOPs stay manageable.</p>
<p>A low-rank cache only saves memory. A low-rank cache that you never have to up-project saves memory <em>and</em> compute. That is the actual MLA result.</p>
<p>But before we get to the absorption, there is a wrinkle: RoPE. Rotary positional embeddings break the clean factorization above, and dealing with that gracefully is what gives MLA its slightly baroque final form. We will first walk through the no-RoPE version step by step, then patch it.</p>
<h2 id="part-6-matrix-walkthrough-step-by-step">Part 6: Matrix Walkthrough, Step by Step</h2>
<p>We will work through one token&rsquo;s forward pass. Position $t$, hidden state $h_t \in \mathbb{R}^{d}$. Numbers in parentheses are the DeepSeek-V2 values, so the shapes feel concrete.</p>


<div class="attn-mla-walk attn-breakout" id="attn-mla-w-d85f6de752f74995c33652ca2b3b58d0">
  <style>
    .attn-mla-walk {
      --mw-bg: #0d1117;
      --mw-surface: #161b22;
      --mw-border: #30363d;
      --mw-text: #e6edf3;
      --mw-text-muted: #8b949e;
      --mw-residual: #58a6ff;
      --mw-k: #f97583;
      --mw-v: #f0b429;
      --mw-latent: #39d353;
      --mw-rope: #b392f0;
      --mw-cache: #39d353;
      --mw-weight: #d29922;
      --mw-divider: rgba(255, 255, 255, 0.22);

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
      background: var(--mw-bg);
      color: var(--mw-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

    [data-theme="light"] .attn-mla-walk,
    :root:not([data-theme="dark"]) .attn-mla-walk {
      --mw-bg: #f8fafc;
      --mw-surface: #ffffff;
      --mw-border: #e2e8f0;
      --mw-text: #1e293b;
      --mw-text-muted: #64748b;
      --mw-residual: #3b82f6;
      --mw-k: #ef4444;
      --mw-v: #d97706;
      --mw-latent: #10b981;
      --mw-rope: #8b5cf6;
      --mw-cache: #10b981;
      --mw-weight: #b8860b;
      --mw-divider: rgba(0, 0, 0, 0.22);
    }

    .attn-mla-walk * { box-sizing: border-box; }

    .attn-mla-walk .mw-header {
      text-align: center;
      margin-bottom: 1rem;
    }

    .attn-mla-walk .mw-header h3 {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 1rem;
      font-weight: 600;
      color: var(--mw-latent);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .attn-mla-walk .mw-header p {
      color: var(--mw-text-muted);
      font-size: 1rem;
      margin: 0;
    }

    .attn-mla-walk .mw-steps {
      display: flex;
      gap: 0.6rem;
      justify-content: center;
      margin-bottom: 1rem;
      flex-wrap: wrap;
    }

    .attn-mla-walk .mw-step-pill {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.78rem;
      letter-spacing: 0.14em;
      text-transform: uppercase;
      padding: 0.25rem 0.7rem;
      border-radius: 999px;
      background: var(--mw-surface);
      border: 1px solid var(--mw-border);
      color: var(--mw-text-muted);
    }

    .attn-mla-walk .mw-panel {
      min-width: 0;
      background: var(--mw-surface);
      border: 1px solid var(--mw-border);
      border-radius: 10px;
      padding: 1.1rem 1.2rem 1rem;
      margin-bottom: 0.9rem;
    }

    .attn-mla-walk .mw-panel:last-child { margin-bottom: 0; }

    .attn-mla-walk .mw-panel-head {
      display: flex;
      align-items: baseline;
      gap: 0.6rem;
      margin-bottom: 0.6rem;
      flex-wrap: wrap;
    }

    .attn-mla-walk .mw-step-badge {
      display: inline-block;
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.78rem;
      letter-spacing: 0.18em;
      text-transform: uppercase;
      background: var(--mw-text);
      color: var(--mw-bg);
      padding: 0.2rem 0.6rem;
      border-radius: 3px;
    }

    .attn-mla-walk .mw-panel-title {
      font-size: 1.15rem;
      font-weight: 600;
      color: var(--mw-text);
      margin: 0;
    }

    .attn-mla-walk svg {
      width: 100%;
      height: auto;
      display: block;
      overflow: visible;
    }

    .attn-mla-walk .mw-svg-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 12px;
      fill: var(--mw-text-muted);
    }

    .attn-mla-walk .mw-svg-label.bright { fill: var(--mw-text); font-weight: 600; }
    .attn-mla-walk .mw-svg-label.cache { fill: var(--mw-cache); font-weight: 700; letter-spacing: 0.12em; }
    .attn-mla-walk .mw-svg-label.weight { fill: var(--mw-weight); font-weight: 700; }
    .attn-mla-walk .mw-svg-label.faded { fill: var(--mw-text-muted); opacity: 0.65; font-style: italic; }
    .attn-mla-walk .mw-svg-label.tag {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 11px;
      letter-spacing: 0.06em;
      text-transform: uppercase;
      fill: var(--mw-text-muted);
    }
    .attn-mla-walk .mw-svg-label.title {
      font-family: 'IBM Plex Sans', sans-serif;
      font-size: 14px;
      font-weight: 700;
      fill: var(--mw-text);
    }

    .attn-mla-walk figcaption {
      margin-top: 0.7rem;
      padding-top: 0.5rem;
      border-top: 1px dashed var(--mw-border);
      font-size: 0.9rem;
      color: var(--mw-text-muted);
      line-height: 1.6;
      font-style: italic;
    }

    .attn-mla-walk figcaption .lead {
      font-weight: 600;
      font-style: normal;
      color: var(--mw-text);
      margin-right: 0.4rem;
    }
  </style>

  <div class="mw-header">
    <h3>MLA, Step by Step</h3>
    <p>One token's path from hidden state through the full MLA construction.</p>
  </div>

  <div class="mw-steps">
    <span class="mw-step-pill">1 · down-project KV</span>
    <span class="mw-step-pill">2 · up-project K, V</span>
    <span class="mw-step-pill">3 · decouple RoPE</span>
    <span class="mw-step-pill">4 · absorb at inference</span>
    <span class="mw-step-pill">5 · the cache, finally</span>
  </div>

  <figure class="mw-panel">
    <div class="mw-panel-head">
      <span class="mw-step-badge">Step 1</span>
      <h4 class="mw-panel-title">KV down-projection</h4>
    </div>
    <svg viewBox="0 0 1080 160" preserveAspectRatio="xMidYMid meet" id="mw-svg-1-d85f6de752f74995c33652ca2b3b58d0"></svg>
    <figcaption><span class="lead">Step 1.</span> A single linear layer projects the residual stream down into the latent space. From this point onward, only the green latent is cached.</figcaption>
  </figure>

  <figure class="mw-panel">
    <div class="mw-panel-head">
      <span class="mw-step-badge">Step 2</span>
      <h4 class="mw-panel-title">K and V up-projection</h4>
    </div>
    <svg viewBox="0 0 1080 240" preserveAspectRatio="xMidYMid meet" id="mw-svg-2-d85f6de752f74995c33652ca2b3b58d0"></svg>
    <figcaption><span class="lead">Step 2.</span> The same latent fans out into full per-head K and V through two parameter matrices. The wide tensors materialize for one matmul and are never written back to the cache.</figcaption>
  </figure>

  <figure class="mw-panel">
    <div class="mw-panel-head">
      <span class="mw-step-badge">Step 3</span>
      <h4 class="mw-panel-title">Decoupled RoPE construction</h4>
    </div>
    <svg viewBox="0 0 1080 320" preserveAspectRatio="xMidYMid meet" id="mw-svg-3-d85f6de752f74995c33652ca2b3b58d0"></svg>
    <figcaption><span class="lead">Step 3.</span> The content key rides the latent up-projection. The RoPE key is computed on a separate narrow path, rotated, and broadcast to every head. The final per-head key is the concatenation. The cache holds both the latent and the RoPE key.</figcaption>
  </figure>

  <figure class="mw-panel">
    <div class="mw-panel-head">
      <span class="mw-step-badge">Step 4</span>
      <h4 class="mw-panel-title">The absorption trick</h4>
    </div>
    <svg viewBox="0 0 1080 280" preserveAspectRatio="xMidYMid meet" id="mw-svg-4-d85f6de752f74995c33652ca2b3b58d0"></svg>
    <figcaption><span class="lead">Step 4.</span> Top, the naive read: reconstruct k_s^C from the latent every step. Bottom, after fusing W^UK into the query path: attention reduces to a bilinear form on two latent-sized vectors. Memory and FLOPs both scale with d_c, not n_h · d_h.</figcaption>
  </figure>

  <figure class="mw-panel">
    <div class="mw-panel-head">
      <span class="mw-step-badge">Result</span>
      <h4 class="mw-panel-title">What lives in the cache</h4>
    </div>
    <svg viewBox="0 0 1080 180" preserveAspectRatio="xMidYMid meet" id="mw-svg-5-d85f6de752f74995c33652ca2b3b58d0"></svg>
    <figcaption><span class="lead">Result.</span> 576 floats per token per layer where MHA would have written 32,768. The softmax attention semantics are unchanged; only the storage shape did.</figcaption>
  </figure>

  <script>
    (function() {
      const root = document.getElementById('attn-mla-w-d85f6de752f74995c33652ca2b3b58d0');
      const NS = 'http://www.w3.org/2000/svg';

      function readVar(name) {
        return getComputedStyle(root).getPropertyValue(name).trim();
      }

      function el(name, attrs, parent, text) {
        const e = document.createElementNS(NS, name);
        Object.entries(attrs).forEach(([k, v]) => e.setAttribute(k, v));
        if (text !== undefined) e.textContent = text;
        if (parent) parent.appendChild(e);
        return e;
      }

      function ensureArrowDefs(svg, color, id) {
        let defs = svg.querySelector('defs');
        if (!defs) defs = el('defs', {}, svg);
        if (svg.querySelector('#' + id)) return;
        const m = el('marker', {
          id, viewBox: '0 0 10 10', refX: 8, refY: 5,
          markerWidth: 6, markerHeight: 6, orient: 'auto-start-reverse'
        }, defs);
        el('path', { d: 'M 0 0 L 10 5 L 0 10 Z', fill: color }, m);
      }

      function drawArrow(svg, x1, y1, x2, y2, color, label, labelY) {
        const id = 'arr-' + Math.random().toString(36).slice(2, 8);
        ensureArrowDefs(svg, color, id);
        el('line', {
          x1, y1, x2, y2,
          stroke: color, 'stroke-width': 1.4, 'marker-end': 'url(#' + id + ')'
        }, svg);
        if (label) {
          el('text', {
            x: (x1 + x2) / 2, y: labelY !== undefined ? labelY : ((y1 + y2) / 2 - 8),
            class: 'mw-svg-label weight', 'text-anchor': 'middle'
          }, svg, label);
        }
      }

      function drawCurvedArrow(svg, x1, y1, cx1, cy1, cx2, cy2, x2, y2, color) {
        const id = 'arr-' + Math.random().toString(36).slice(2, 8);
        ensureArrowDefs(svg, color, id);
        el('path', {
          d: `M ${x1} ${y1} C ${cx1} ${cy1}, ${cx2} ${cy2}, ${x2} ${y2}`,
          stroke: color, 'stroke-width': 1.4, fill: 'none',
          'marker-end': 'url(#' + id + ')'
        }, svg);
      }

      function drawWeightBox(svg, x, y, w, h, label, shape) {
        el('rect', {
          x, y, width: w, height: h,
          fill: readVar('--mw-weight'), opacity: 0.18,
          stroke: readVar('--mw-weight'), 'stroke-width': 1.4, rx: 3
        }, svg);
        el('text', {
          x: x + w / 2, y: y + h / 2 - 4,
          class: 'mw-svg-label weight', 'text-anchor': 'middle', 'font-size': 12
        }, svg, label);
        if (shape) {
          el('text', {
            x: x + w / 2, y: y + h / 2 + 12,
            class: 'mw-svg-label', 'text-anchor': 'middle', 'font-size': 10
          }, svg, shape);
        }
      }

      function drawBox(svg, x, y, w, h, color, opacityFill, label, dimLabel) {
        el('rect', {
          x, y, width: w, height: h,
          fill: color, opacity: opacityFill,
          stroke: color, 'stroke-width': 1.4, rx: 3
        }, svg);
        if (label) {
          el('text', {
            x: x + w / 2, y: y + h / 2 + 4,
            class: 'mw-svg-label bright', 'text-anchor': 'middle', fill: color, 'font-size': 13
          }, svg, label);
        }
        if (dimLabel) {
          el('text', {
            x: x + w / 2, y: y + h + 16,
            class: 'mw-svg-label', 'text-anchor': 'middle'
          }, svg, dimLabel);
        }
      }

      function drawHeadBar(svg, x, y, w, h, color, opacityFill, dividerCount) {
        el('rect', {
          x, y, width: w, height: h,
          fill: color, opacity: opacityFill, rx: 2
        }, svg);
        if (dividerCount > 1 && w >= 16) {
          for (let i = 1; i < dividerCount; i++) {
            const dx = x + (w * i) / dividerCount;
            el('line', {
              x1: dx, y1: y, x2: dx, y2: y + h,
              stroke: readVar('--mw-divider'), 'stroke-width': 0.7, 'stroke-dasharray': '2 2'
            }, svg);
          }
        }
      }

      function drawCachedChip(svg, x, y, w, label) {
        const color = readVar('--mw-cache');
        el('rect', {
          x, y, width: w, height: 18,
          fill: 'none', stroke: color, 'stroke-width': 1.4, rx: 9
        }, svg);
        el('text', {
          x: x + w / 2, y: y + 13,
          class: 'mw-svg-label cache', 'text-anchor': 'middle', 'font-size': 10.5
        }, svg, label || 'CACHED');
      }

      
      
      
      function drawPanel1() {
        const svg = root.querySelector('#mw-svg-1-d85f6de752f74995c33652ca2b3b58d0');
        svg.innerHTML = '';
        const cInk = readVar('--mw-text');
        const cRes = readVar('--mw-residual');
        const cLat = readVar('--mw-latent');

        const yMid = 80;
        const hH = 38;

        
        const xH = 50, wH = 360;
        el('text', { x: xH, y: yMid - hH / 2 - 8, class: 'mw-svg-label tag' }, svg, 'h_t  ·  hidden state');
        drawBox(svg, xH, yMid - hH / 2, wH, hH, cRes, 0.25, 'h_t', 'd = 5120');

        
        const xArr1S = xH + wH + 6, xArr1E = xArr1S + 50;
        drawArrow(svg, xArr1S, yMid, xArr1E, yMid, cInk, '× W^DKV');

        
        const xW = xArr1E + 6, wW = 80, hW = 50;
        drawWeightBox(svg, xW, yMid - hW / 2, wW, hW, 'W^DKV', '5120 × 512');

        
        const xEq = xW + wW + 8;
        el('text', {
          x: xEq, y: yMid + 6, class: 'mw-svg-label bright', 'font-size': 18
        }, svg, '=');

        
        const xC = xEq + 20, wC = 50, hC = 38;
        el('text', { x: xC, y: yMid - hC / 2 - 8, class: 'mw-svg-label tag', fill: cLat }, svg, 'c_t^KV  ·  latent');
        drawBox(svg, xC, yMid - hC / 2, wC, hC, cLat, 0.25, 'c_t^KV', 'd_c = 512');

        
        drawCachedChip(svg, xC + wC / 2 - 32, yMid - hC / 2 - 32, 64);

        
        el('text', {
          x: xC + wC + 18, y: yMid - 4, class: 'mw-svg-label cache', 'font-size': 12
        }, svg, 'STAYS IN');
        el('text', {
          x: xC + wC + 18, y: yMid + 12, class: 'mw-svg-label cache', 'font-size': 12
        }, svg, 'KV CACHE');
      }

      
      
      
      function drawPanel2() {
        const svg = root.querySelector('#mw-svg-2-d85f6de752f74995c33652ca2b3b58d0');
        svg.innerHTML = '';
        const cInk = readVar('--mw-text');
        const cLat = readVar('--mw-latent');
        const cK = readVar('--mw-k');
        const cV = readVar('--mw-v');

        const hBar = 26;
        const yK = 60, yV = 170;
        const xC = 40, wC = 50, hC = 38;
        const yC = (yK + yV + hBar) / 2 - hC / 2;

        
        el('text', { x: xC, y: yC - 8, class: 'mw-svg-label tag', fill: cLat }, svg, 'c_t^KV');
        drawBox(svg, xC, yC, wC, hC, cLat, 0.25, 'c_t^KV', '512');

        
        const xW = 200, wW = 80, hW = 44;
        drawCurvedArrow(svg, xC + wC + 4, yC + 6,
                         xC + wC + 60, yC - 12,
                         xW - 20, yK + hW / 2,
                         xW - 4, yK + hW / 2, cInk);
        drawCurvedArrow(svg, xC + wC + 4, yC + hC - 6,
                         xC + wC + 60, yC + hC + 18,
                         xW - 20, yV + hW / 2,
                         xW - 4, yV + hW / 2, cInk);

        
        drawWeightBox(svg, xW, yK, wW, hW, 'W^UK', '512 × 16384');
        drawWeightBox(svg, xW, yV, wW, hW, 'W^UV', '512 × 16384');

        
        el('text', { x: xW + wW + 12, y: yK + hW / 2 + 6, class: 'mw-svg-label bright', 'font-size': 16 }, svg, '=');
        el('text', { x: xW + wW + 12, y: yV + hW / 2 + 6, class: 'mw-svg-label bright', 'font-size': 16 }, svg, '=');

        
        const xBar = xW + wW + 36, wBar = 760;
        el('text', { x: xBar, y: yK - 6, class: 'mw-svg-label', fill: cK, 'font-weight': 700 }, svg, 'k_t^C  ·  128 heads, content keys');
        drawHeadBar(svg, xBar, yK, wBar, hBar, cK, 0.7, 8);
        el('text', {
          x: xBar + wBar / 2, y: yK + hBar + 16,
          class: 'mw-svg-label', 'text-anchor': 'middle'
        }, svg, 'n_h × d_h  =  128 × 128  =  16,384');

        el('text', { x: xBar, y: yV - 6, class: 'mw-svg-label', fill: cV, 'font-weight': 700 }, svg, 'v_t  ·  128 heads, values');
        drawHeadBar(svg, xBar, yV, wBar, hBar, cV, 0.7, 8);
        el('text', {
          x: xBar + wBar / 2, y: yV + hBar + 16,
          class: 'mw-svg-label', 'text-anchor': 'middle'
        }, svg, 'n_h × d_h  =  128 × 128  =  16,384');

        
        el('text', {
          x: xBar + wBar / 2, y: 225, class: 'mw-svg-label faded', 'text-anchor': 'middle', 'font-size': 11
        }, svg, 'materialized at attention time only · never written to the cache');
      }

      
      
      
      function drawPanel3() {
        const svg = root.querySelector('#mw-svg-3-d85f6de752f74995c33652ca2b3b58d0');
        svg.innerHTML = '';
        const cInk = readVar('--mw-text');
        const cRes = readVar('--mw-residual');
        const cLat = readVar('--mw-latent');
        const cRope = readVar('--mw-rope');
        const cK = readVar('--mw-k');

        const yMid = 160;
        const yTop = 60;
        const yBot = 230;

        
        const xH = 30, wH = 100, hH = 36;
        el('text', { x: xH, y: yMid - hH / 2 - 8, class: 'mw-svg-label tag' }, svg, 'h_t');
        drawBox(svg, xH, yMid - hH / 2, wH, hH, cRes, 0.25, 'h_t', 'd = 5120');

        
        drawCurvedArrow(svg, xH + wH + 4, yMid - 8,
                         xH + wH + 50, yMid - 30,
                         180, yTop + 14, 200, yTop + 14, cLat);
        el('text', { x: 150, y: yTop + 10, class: 'mw-svg-label weight', fill: cLat }, svg, '× W^DKV');

        
        const xCT = 205, wCT = 50, hCT = 30;
        drawBox(svg, xCT, yTop, wCT, hCT, cLat, 0.25, 'c_t^KV', '512');
        drawCachedChip(svg, xCT + wCT / 2 - 32, yTop - 26, 64);

        
        drawArrow(svg, xCT + wCT + 4, yTop + 15, 295, yTop + 15, cK, '× W^UK', yTop + 8);
        const xKC = 300, wKC = 540, hKC = 30;
        el('text', { x: xKC, y: yTop - 6, class: 'mw-svg-label', fill: cK, 'font-weight': 700 }, svg, 'k_t^C  ·  content keys');
        drawHeadBar(svg, xKC, yTop, wKC, hKC, cK, 0.7, 8);
        el('text', {
          x: xKC + wKC / 2, y: yTop + hKC + 16,
          class: 'mw-svg-label', 'text-anchor': 'middle'
        }, svg, '128 heads × 128 = 16,384');

        
        drawCurvedArrow(svg, xH + wH + 4, yMid + 8,
                         xH + wH + 50, yMid + 30,
                         180, yBot + 12, 200, yBot + 12, cRope);
        el('text', { x: 150, y: yBot + 8, class: 'mw-svg-label weight', fill: cRope }, svg, '× W^KR');

        
        const xKR = 205, wKR = 40, hKR = 24;
        drawBox(svg, xKR, yBot, wKR, hKR, cRope, 0.25, 'k_t^R', '64');
        drawCachedChip(svg, xKR + wKR / 2 - 32, yBot - 24, 64);

        
        const xRoPE = xKR + wKR + 12;
        el('circle', {
          cx: xRoPE + 8, cy: yBot + 12, r: 8,
          fill: 'none', stroke: cRope, 'stroke-width': 1.2
        }, svg);
        el('path', {
          d: `M ${xRoPE + 16} ${yBot + 12} A 8 8 0 1 1 ${xRoPE + 5} ${yBot + 4}`,
          stroke: cRope, 'stroke-width': 1.2, fill: 'none'
        }, svg);
        el('text', {
          x: xRoPE + 26, y: yBot + 16, class: 'mw-svg-label', fill: cRope, 'font-weight': 700
        }, svg, 'RoPE');
        el('text', {
          x: xRoPE + 26, y: yBot + 32, class: 'mw-svg-label faded', 'font-size': 11
        }, svg, 'shared across all heads');

        
        const xConcat = 870, wConcat = 170, hConcat = 38;
        const yConcat = yMid - hConcat / 2;
        el('text', { x: xConcat, y: yConcat - 8, class: 'mw-svg-label tag' }, svg, 'k_{t,i}  ·  final key, per head');
        
        const wContent = 130, wRope = 36;
        el('rect', {
          x: xConcat, y: yConcat, width: wContent, height: hConcat,
          fill: cK, opacity: 0.7, rx: 2
        }, svg);
        
        el('rect', {
          x: xConcat + wContent, y: yConcat, width: wRope, height: hConcat,
          fill: cRope, opacity: 0.7, rx: 2
        }, svg);
        el('text', {
          x: xConcat + wContent / 2, y: yConcat + hConcat / 2 + 4,
          class: 'mw-svg-label', 'text-anchor': 'middle', 'font-size': 11, fill: '#fff'
        }, svg, 'd_h = 128');
        el('text', {
          x: xConcat + wContent + wRope / 2, y: yConcat + hConcat / 2 + 4,
          class: 'mw-svg-label', 'text-anchor': 'middle', 'font-size': 11, fill: '#fff'
        }, svg, '64');
        el('text', {
          x: xConcat + (wContent + wRope) / 2, y: yConcat + hConcat + 16,
          class: 'mw-svg-label', 'text-anchor': 'middle'
        }, svg, 'per-head width = 192');

        
        el('path', {
          d: `M ${xKC + wKC} ${yTop + hKC / 2} C ${xConcat - 40} ${yTop + 20}, ${xConcat - 30} ${yConcat + 8}, ${xConcat - 4} ${yConcat + 10}`,
          stroke: cK, 'stroke-width': 1, fill: 'none', opacity: 0.5, 'stroke-dasharray': '3 3'
        }, svg);
        el('path', {
          d: `M ${xKR + wKR + 30} ${yBot + 12} C ${xConcat - 40} ${yBot}, ${xConcat - 30} ${yConcat + 28}, ${xConcat - 4} ${yConcat + 28}`,
          stroke: cRope, 'stroke-width': 1, fill: 'none', opacity: 0.5, 'stroke-dasharray': '3 3'
        }, svg);
      }

      
      
      
      function drawPanel4() {
        const svg = root.querySelector('#mw-svg-4-d85f6de752f74995c33652ca2b3b58d0');
        svg.innerHTML = '';
        const cInk = readVar('--mw-text');
        const cLat = readVar('--mw-latent');
        const cK = readVar('--mw-k');

        
        el('text', { x: 30, y: 24, class: 'mw-svg-label title' }, svg, 'Training-time view');
        el('text', { x: 30, y: 42, class: 'mw-svg-label tag' }, svg, 'Each step materializes K (and V)');

        const yTopRow = 75, hBox = 28;

        
        const xC1 = 30, wC1 = 70;
        drawBox(svg, xC1, yTopRow, wC1, hBox, cLat, 0.25, 'c_s^KV', null);
        
        drawArrow(svg, xC1 + wC1 + 4, yTopRow + hBox / 2, xC1 + wC1 + 64, yTopRow + hBox / 2, cK, '× W^UK', yTopRow + hBox / 2 - 8);
        
        const xK = xC1 + wC1 + 70, wK = 320;
        drawBox(svg, xK, yTopRow, wK, hBox, cK, 0.55, 'k_s^C', null);
        el('text', {
          x: xK + wK / 2, y: yTopRow + hBox + 16, class: 'mw-svg-label faded', 'text-anchor': 'middle', 'font-size': 11
        }, svg, 'materialized');
        
        drawArrow(svg, xK + wK + 4, yTopRow + hBox / 2, xK + wK + 64, yTopRow + hBox / 2, cInk, 'dot', yTopRow + hBox / 2 - 8);
        
        const xQ = xK + wK + 70, wQ = 320;
        drawBox(svg, xQ, yTopRow, wQ, hBox, cK, 0.55, 'q_t^C', null);
        el('text', {
          x: xQ + wQ / 2, y: yTopRow + hBox + 16, class: 'mw-svg-label faded', 'text-anchor': 'middle', 'font-size': 11
        }, svg, 'materialized');

        
        el('text', { x: 30, y: 135, class: 'mw-svg-label tag' }, svg, 'Storage per token: c^KV cached. Plus transient large K, V on every step.');

        
        el('line', {
          x1: 30, y1: 152, x2: 1050, y2: 152,
          stroke: readVar('--mw-border'), 'stroke-width': 1, 'stroke-dasharray': '4 3'
        }, svg);

        
        el('text', { x: 30, y: 174, class: 'mw-svg-label title' }, svg, 'Inference-time view, after absorption');
        el('text', { x: 30, y: 192, class: 'mw-svg-label tag' }, svg, 'W̃^Q = W^UQ⊤ · W^UK is precomputed. K is never materialized.');

        const yBotRow = 225;
        
        const xC2 = 30;
        drawBox(svg, xC2, yBotRow, wC1, hBox, cLat, 0.25, 'c_s^KV', null);
        
        const xMidArrow = xC2 + wC1 + 200;
        drawArrow(svg, xC2 + wC1 + 4, yBotRow + hBox / 2, xMidArrow, yBotRow + hBox / 2, cLat, '· W̃^Q · (precomputed)', yBotRow + hBox / 2 - 8);
        
        const xCQ = xMidArrow + 6;
        drawBox(svg, xCQ, yBotRow, wC1, hBox, cLat, 0.25, 'c_t^Q', null);
        
        el('text', {
          x: xCQ + wC1 + 20, y: yBotRow + hBox / 2 + 6,
          class: 'mw-svg-label bright', 'font-size': 14
        }, svg, '→  scalar score');

        
        el('text', {
          x: 540, y: 270, class: 'mw-svg-label cache', 'text-anchor': 'middle', 'font-size': 11
        }, svg, 'MEMORY AND FLOPs BOTH SCALE WITH d_c = 512');
      }

      
      
      
      function drawPanel5() {
        const svg = root.querySelector('#mw-svg-5-d85f6de752f74995c33652ca2b3b58d0');
        svg.innerHTML = '';
        const cLat = readVar('--mw-latent');
        const cRope = readVar('--mw-rope');
        const cMHA = readVar('--mw-k');

        const yMid = 80;
        const hH = 38;

        
        const xMHA = (1080 - 800) / 2;
        el('rect', {
          x: xMHA, y: yMid - hH / 2,
          width: 800, height: hH,
          fill: 'none', stroke: cMHA, 'stroke-width': 1, 'stroke-dasharray': '3 3', opacity: 0.5
        }, svg);
        el('text', {
          x: xMHA, y: yMid - hH / 2 - 8,
          class: 'mw-svg-label faded'
        }, svg, 'MHA cache  ·  32,768 floats per token per layer');

        
        const xC = xMHA, wC = 50;
        const xR = xC + wC + 6, wR = 12;

        drawBox(svg, xC, yMid - hH / 2, wC, hH, cLat, 0.85, 'c_t^KV', null);
        el('text', {
          x: xC + wC / 2, y: yMid + hH / 2 + 16,
          class: 'mw-svg-label', 'text-anchor': 'middle', fill: cLat, 'font-weight': 700
        }, svg, 'd_c = 512');

        el('rect', {
          x: xR, y: yMid - hH / 2, width: wR, height: hH,
          fill: cRope, opacity: 0.85, rx: 2
        }, svg);
        el('text', {
          x: xR + wR / 2, y: yMid + hH / 2 + 16,
          class: 'mw-svg-label', 'text-anchor': 'middle', fill: cRope, 'font-weight': 700
        }, svg, '64');

        
        const bL = xC - 4, bR = xR + wR + 4, bT = yMid - hH / 2 - 22;
        el('path', {
          d: `M ${bL} ${bT + 8} L ${bL} ${bT} L ${bR} ${bT} L ${bR} ${bT + 8}`,
          stroke: readVar('--mw-cache'), 'stroke-width': 1.6, fill: 'none', 'stroke-dasharray': '4 3'
        }, svg);
        const chipX = (bL + bR) / 2 - 32, chipY = bT - 24;
        drawCachedChip(svg, chipX, chipY, 64);

        
        el('text', {
          x: 540, y: 150,
          class: 'mw-svg-label bright', 'text-anchor': 'middle', 'font-size': 16
        }, svg, '576 floats per token per layer  ·  1,152 B in BF16  ·  57× smaller than MHA');
      }

      drawPanel1();
      drawPanel2();
      drawPanel3();
      drawPanel4();
      drawPanel5();
    })();
  </script>
</div>

<h3 id="step-1-kv-down-projection">Step 1: KV down-projection</h3>
<p>A single linear layer projects the residual stream down into the latent space:</p>
$$\mathbf{c}_t^{KV} \;=\; h_t \, W^{DKV}, \qquad W^{DKV} \in \mathbb{R}^{d \times d_c} \;=\; \mathbb{R}^{5120 \times 512}$$<p>The resulting $\mathbf{c}_t^{KV} \in \mathbb{R}^{512}$ is the <em>only</em> KV-related thing we cache for this token. It is shared across all heads and contains the information from which both K and V will eventually be reconstructed.</p>
<h3 id="step-2-k-and-v-up-projection">Step 2: K and V up-projection</h3>
<p>The latent fans out into the full multi-head K and V via two more linear layers:</p>
$$k_t^C \;=\; \mathbf{c}_t^{KV} \, W^{UK}, \qquad v_t^C \;=\; \mathbf{c}_t^{KV} \, W^{UV}$$$$W^{UK}, W^{UV} \in \mathbb{R}^{d_c \times n_h d_h} \;=\; \mathbb{R}^{512 \times 16384}$$<p>The superscript $C$ marks the <em>content</em> portion of K (we add a separate RoPE portion in §7). After this step, $k_t^C$ and $v_t^C$ live in $\mathbb{R}^{n_h d_h}$ and split into $n_h$ heads of width $d_h$. Since $d_c \le n_h d_h$, these are <strong>low-rank reconstructions</strong>: every head&rsquo;s K and V is constrained to lie in a $d_c$-dimensional subspace of the full $d_h$-space. This rank constraint is the price of compression, and empirically it is the right trade.</p>
<h3 id="step-3-the-query-path">Step 3: The query path</h3>
<p>Symmetrically, and primarily to save training-time activations rather than KV cache, the query is also routed through a bottleneck:</p>
$$\mathbf{c}_t^{Q} = h_t \, W^{DQ}, \qquad q_t^C = \mathbf{c}_t^{Q} \, W^{UQ}$$$$W^{DQ} \in \mathbb{R}^{5120 \times 1536}, \qquad W^{UQ} \in \mathbb{R}^{1536 \times 16384}$$<p>The query bottleneck $d_c' = 1536$ is wider than the KV bottleneck because queries are not cached. There is no inference benefit from making them narrower. The reason to compress them at all is parameter and activation memory during training.</p>
<p>$d_c = 512$ is sized for cache miniaturization. $d_c' = 1536$ is sized for representational room. They are independent design knobs and DeepSeek-V2 chose them to be quite different.</p>
<h2 id="part-7-the-rope-complication-and-the-decoupled-fix">Part 7: The RoPE Complication and the Decoupled Fix</h2>
<p>So far the story is clean: cache a latent, up-project on demand, profit. But all modern transformers, DeepSeek-V2 included, use rotary position embeddings, and RoPE is exactly the kind of thing that ruins clean factorizations.</p>
<h3 id="why-naive-rope-breaks-mla">Why naive RoPE breaks MLA</h3>
<p>RoPE applies a position-dependent rotation matrix $\mathcal{R}_t$ to the query and key <em>after</em> they are projected. The attention dot product becomes:</p>
$$\langle \mathcal{R}_t q_t,\, \mathcal{R}_s k_s \rangle \;=\; q_t^\top \mathcal{R}_t^\top \mathcal{R}_s \, k_s \;=\; q_t^\top \mathcal{R}_{s-t}\, k_s$$<p>That last simplification, the rotation depending only on the <em>relative</em> position $s-t$, is the whole reason RoPE works. But now imagine we tried to apply RoPE to our reconstructed key $k_s = \mathbf{c}_s^{KV} W^{UK}$. The pre-rotated key cached as $\mathbf{c}_s^{KV}$ would need to <em>also</em> be rotated by $\mathcal{R}_s$. And $\mathcal{R}_s$ depends on $s$, the actual position. <strong>Different tokens use different rotations.</strong> So we would need to store the rotated reconstruction per token, defeating the cache.</p>
<p>The deeper algebraic problem: in §8 we will want to absorb $W^{UK}$ into $W^Q$. But if RoPE applies between them, in $q^\top W^{UQ\top} \mathcal{R}_t^\top \mathcal{R}_s W^{UK} \mathbf{c}^{KV}$, the position-dependent $\mathcal{R}$ blocks any precomputed absorption.</p>
<p>RoPE and low-rank absorption are fundamentally incompatible when applied to the same vector. MLA&rsquo;s fix is to give RoPE its own, separate vector.</p>
<h3 id="the-decoupled-rope-construction">The decoupled RoPE construction</h3>
<p>Each token gets two key vectors per head:</p>
<ol>
<li>A <strong>content key</strong> $k_t^C$ (no RoPE), reconstructed from the cached latent as in §6.</li>
<li>A small <strong>RoPE key</strong> $k_t^R$, a separate, narrow tensor produced directly from $h_t$, to which RoPE <em>is</em> applied. Crucially, it is shared across all heads.</li>
</ol>
$$k_t^R \;=\; \text{RoPE}\!\left(h_t \, W^{KR}\right), \qquad W^{KR} \in \mathbb{R}^{d \times d_h^R} \;=\; \mathbb{R}^{5120 \times 64}$$<p>And on the query side, the rotation also lives on its own piece, but here it is per-head, since queries are not cached:</p>
$$q_{t,i}^R \;=\; \text{RoPE}\!\left(\mathbf{c}_t^Q \, W_i^{QR}\right), \qquad W^{QR} \in \mathbb{R}^{d_c' \times n_h d_h^R}$$<p>The final per-head query and key are the <strong>concatenation</strong> of the content part and the RoPE part:</p>
$$q_{t,i} = [\,q_{t,i}^C \,;\, q_{t,i}^R\,] \in \mathbb{R}^{d_h + d_h^R}, \qquad k_{t,i} = [\,k_{t,i}^C \,;\, k_t^R\,] \in \mathbb{R}^{d_h + d_h^R}$$<p>Note: $k_t^R$ has no head subscript. Every head&rsquo;s RoPE-key part is the <em>same</em> vector. This is exactly an MQA-style sharing, isolated to the RoPE 64-wide tail.</p>
<p>Attention is then computed on the concatenated vectors:</p>
$$\text{score}_{t,s,i} \;=\; \frac{q_{t,i}^\top k_{s,i}}{\sqrt{d_h + d_h^R}} \;=\; \frac{\underbrace{q_{t,i}^{C\top} k_{s,i}^C}_{\text{from latents}} \;+\; \underbrace{q_{t,i}^{R\top} k_s^R}_{\text{from RoPE pair}}}{\sqrt{d_h + d_h^R}}$$<p>The dot product splits cleanly into a content term and a RoPE term, exactly because we concatenated rather than added. The content term will get the absorption treatment in §8. The RoPE term stands alone and is computed directly from the (small) cached $k_s^R$.</p>
<p>The output is then the usual per-head weighted sum of $v_{s,i}^C$ (the value has no RoPE, it never did), and the heads are concatenated and pushed through $W^O$ as usual:</p>
$$o_{t,i} = \sum_s \alpha_{t,s,i} \, v_{s,i}^C, \qquad u_t = [o_{t,1}; \ldots; o_{t,n_h}] \, W^O$$<p>The KV cache now holds $\mathbf{c}_t^{KV}$ (512 floats) plus $k_t^R$ (64 floats) per token per layer. Total: 576 floats, or 1,152 bytes in BF16. That is the headline 57x reduction versus MHA&rsquo;s 32,768 floats.</p>
<h2 id="part-8-the-absorption-trick">Part 8: The Absorption Trick</h2>
<p>We have reduced the cache from 32,768 to 576 floats per token. But the up-projections $W^{UK}$ and $W^{UV}$ are huge ($512 \times 16{,}384$ each), and computing them per attention step looks alarming. The trick is that we never compute them at inference time. We absorb them into the surrounding matrices.</p>
<h3 id="absorbing--into-the-query">Absorbing $W^{UK}$ into the query</h3>
<p>Look at one head&rsquo;s content-attention term:</p>
$$q_{t,i}^{C\top} \, k_{s,i}^C \;=\; \big(\mathbf{c}_t^Q W_i^{UQ}\big)^\top \big(\mathbf{c}_s^{KV} W_i^{UK}\big) \;=\; \mathbf{c}_t^{Q\top} \, \underbrace{W_i^{UQ\top} W_i^{UK}}_{\widetilde{W}_i^Q \,\in\, \mathbb{R}^{d_c' \times d_c}} \, \mathbf{c}_s^{KV}$$<p>The product $\widetilde{W}_i^Q = W_i^{UQ\top} W_i^{UK}$ is two parameter matrices multiplied together. It depends on no input, so <strong>we precompute it once</strong>. The score becomes a single bilinear form between the cached query latent and the cached key latent, both of width at most 1536.</p>
<p>After absorption, &ldquo;computing the content key&rdquo; disappears. The query latent talks to the key latent directly through a precomputed bridge matrix. The cache really is the key.</p>
<h3 id="absorbing--into-the-output-projection">Absorbing $W^{UV}$ into the output projection</h3>
<p>The value side gets the same treatment, but from the other end. The per-head output is a weighted sum of $v_{s,i}^C = \mathbf{c}_s^{KV} W_i^{UV}$, which then gets multiplied by $W_i^O$:</p>
$$u_t \;\supset\; o_{t,i} \, W_i^O \;=\; \sum_s \alpha_{t,s,i} \, \mathbf{c}_s^{KV} \, \underbrace{W_i^{UV} W_i^O}_{\widetilde{W}_i^O \,\in\, \mathbb{R}^{d_c \times d}}$$<p>Again, the bracketed matrix is parameter-only: precompute and store. At runtime, the attention weights multiply directly against the cached $\mathbf{c}_s^{KV}$, and the result projects directly to the output dimension. $v$ is never materialized.</p>
<p>Subtle point: the absorption changes the effective compute layout. The per-head $\widetilde{W}^Q$ matrices are dense and head-specific, so you do not get a literal FLOPs reduction in the obvious places, but you do collapse what was a stream of large reconstructions into a single small inner product. The combined effect (small cache plus small inner product) is what makes long-context decoding cheap.</p>
<p>The RoPE part is computed the old way: the cached $k_s^R$ is dotted with the freshly computed $q_{t,i}^R$. It is small (64 wide) and outside the absorbed factorization, which is exactly why we paid the cost of decoupling it.</p>
<h2 id="part-9-deepseek-v2-by-the-numbers">Part 9: DeepSeek-V2 by the Numbers</h2>
<p>Plug the actual hyperparameters in. Drag the slider to see how the per-layer KV cache footprint scales with context length, per request, in fp16.</p>


<div class="attn-kv-growth attn-breakout" id="attn-kvg-d85f6de752f74995c33652ca2b3b58d0">
  <style>
    .attn-kv-growth {
      --kvg-bg: #0d1117;
      --kvg-surface: #161b22;
      --kvg-border: #30363d;
      --kvg-text: #e6edf3;
      --kvg-text-muted: #8b949e;
      --kvg-mha: #f97583;
      --kvg-gqa: #f0b429;
      --kvg-mqa: #8b949e;
      --kvg-mla: #39d353;
      --kvg-track: #21262d;

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
      background: var(--kvg-bg);
      color: var(--kvg-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

    [data-theme="light"] .attn-kv-growth,
    :root:not([data-theme="dark"]) .attn-kv-growth {
      --kvg-bg: #f8fafc;
      --kvg-surface: #ffffff;
      --kvg-border: #e2e8f0;
      --kvg-text: #1e293b;
      --kvg-text-muted: #64748b;
      --kvg-mha: #ef4444;
      --kvg-gqa: #d97706;
      --kvg-mqa: #94a3b8;
      --kvg-mla: #10b981;
      --kvg-track: #e2e8f0;
    }

    .attn-kv-growth * { box-sizing: border-box; }

    .attn-kv-growth .kvg-header {
      text-align: center;
      margin-bottom: 1rem;
    }

    .attn-kv-growth .kvg-header h3 {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 1rem;
      font-weight: 600;
      color: var(--kvg-mla);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .attn-kv-growth .kvg-header p {
      color: var(--kvg-text-muted);
      font-size: 1rem;
      margin: 0;
    }

    .attn-kv-growth .kvg-card {
      background: var(--kvg-surface);
      border: 1px solid var(--kvg-border);
      border-radius: 10px;
      padding: 1.3rem 1.4rem;
    }

    .attn-kv-growth .kvg-ctrl {
      display: flex;
      align-items: center;
      gap: 1.1rem;
      margin-bottom: 1.3rem;
      flex-wrap: wrap;
    }

    .attn-kv-growth .kvg-ctrl label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.92rem;
      letter-spacing: 0.04em;
      color: var(--kvg-text-muted);
    }

    .attn-kv-growth input[type=range] {
      flex: 1;
      min-width: 220px;
      -webkit-appearance: none;
      appearance: none;
      height: 4px;
      background: var(--kvg-track);
      border-radius: 2px;
      outline: none;
    }

    .attn-kv-growth input[type=range]::-webkit-slider-thumb {
      -webkit-appearance: none;
      appearance: none;
      width: 20px;
      height: 20px;
      border-radius: 50%;
      background: var(--kvg-text);
      cursor: pointer;
      box-shadow: 0 0 0 4px var(--kvg-surface), 0 0 0 5px var(--kvg-border);
    }

    .attn-kv-growth input[type=range]::-moz-range-thumb {
      width: 20px;
      height: 20px;
      border-radius: 50%;
      background: var(--kvg-text);
      cursor: pointer;
      border: none;
      box-shadow: 0 0 0 4px var(--kvg-surface), 0 0 0 5px var(--kvg-border);
    }

    .attn-kv-growth .kvg-ctxval {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 1.1rem;
      font-weight: 700;
      color: var(--kvg-text);
      min-width: 110px;
      text-align: right;
    }

    .attn-kv-growth .kvg-row {
      display: grid;
      grid-template-columns: 70px 1fr 200px;
      gap: 14px;
      align-items: center;
      margin: 10px 0;
    }

    .attn-kv-growth .kvg-row .kvg-name {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.92rem;
      font-weight: 700;
    }

    .attn-kv-growth .kvg-row .kvg-bar-wrap {
      height: 24px;
      background: var(--kvg-track);
      border-radius: 3px;
      overflow: hidden;
      border: 1px solid var(--kvg-border);
    }

    .attn-kv-growth .kvg-row .kvg-bar {
      height: 100%;
      border-radius: 2px;
      transition: width 0.18s ease;
    }

    .attn-kv-growth .kvg-row .kvg-val {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.92rem;
      text-align: right;
      color: var(--kvg-text-muted);
    }

    .attn-kv-growth .kvg-row .kvg-val b {
      color: var(--kvg-text);
      font-weight: 700;
    }

    .attn-kv-growth .kvg-footnote {
      margin-top: 1.1rem;
      padding-top: 0.85rem;
      border-top: 1px dashed var(--kvg-border);
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.85rem;
      color: var(--kvg-text-muted);
      line-height: 1.6;
    }

    @media (max-width: 600px) {
      .attn-kv-growth .kvg-row {
        grid-template-columns: 60px 1fr 130px;
        gap: 8px;
      }
    }
  </style>

  <div class="kvg-header">
    <h3>Cache Size vs Context Length</h3>
    <p>Per layer, batch=1, fp16. Drag the slider to see how the per-layer KV cache footprint scales.</p>
  </div>

  <div class="kvg-card">
    <div class="kvg-ctrl">
      <label for="kvg-ctx-d85f6de752f74995c33652ca2b3b58d0">Context length</label>
      <input id="kvg-ctx-d85f6de752f74995c33652ca2b3b58d0" type="range" min="512" max="131072" step="512" value="32768">
      <div class="kvg-ctxval" id="kvg-ctxval-d85f6de752f74995c33652ca2b3b58d0">32,768</div>
    </div>

    <div id="kvg-rows-d85f6de752f74995c33652ca2b3b58d0"></div>

    <div class="kvg-footnote">
      DeepSeek-V2-style hyperparameters: n_h = 128, d_h = 128, d_c = 512, d_h^R = 64. GQA shown with 8 K/V groups; MQA shares a single K/V across heads.
    </div>
  </div>

  <script>
    (function() {
      const root = document.getElementById('attn-kvg-d85f6de752f74995c33652ca2b3b58d0');
      const ctx = root.querySelector('#kvg-ctx-d85f6de752f74995c33652ca2b3b58d0');
      const ctxVal = root.querySelector('#kvg-ctxval-d85f6de752f74995c33652ca2b3b58d0');
      const rowsEl = root.querySelector('#kvg-rows-d85f6de752f74995c33652ca2b3b58d0');

      function readVar(name) {
        return getComputedStyle(root).getPropertyValue(name).trim();
      }

      const cfg = { n_h: 128, d_h: 128, d_c: 512, d_h_R: 64, n_g: 8 };

      const methods = [
        { name: 'MHA', per_token: 2 * cfg.n_h * cfg.d_h,         colorVar: '--kvg-mha' },
        { name: 'GQA', per_token: 2 * cfg.n_g * cfg.d_h,         colorVar: '--kvg-gqa' },
        { name: 'MQA', per_token: 2 * cfg.d_h,                   colorVar: '--kvg-mqa' },
        { name: 'MLA', per_token: cfg.d_c + cfg.d_h_R,           colorVar: '--kvg-mla' }
      ];

      function fmtBytes(n) {
        if (n >= 1e9) return (n / 1e9).toFixed(2) + ' GB';
        if (n >= 1e6) return (n / 1e6).toFixed(2) + ' MB';
        if (n >= 1e3) return (n / 1e3).toFixed(1) + ' KB';
        return n.toFixed(0) + ' B';
      }

      function fmtCtx(n) {
        return n.toLocaleString('en-US');
      }

      function render() {
        const T = parseInt(ctx.value);
        ctxVal.textContent = fmtCtx(T);

        const sizes = methods.map(m => ({ ...m, bytes: m.per_token * T * 2 }));
        const max = Math.max(...sizes.map(s => s.bytes));

        rowsEl.innerHTML = sizes.map(s => {
          const widthPct = (s.bytes / max * 100).toFixed(2);
          const ratio = sizes[0].bytes / s.bytes;
          const ratioStr = ratio >= 2 ? ` · <b>${ratio.toFixed(1)}×</b> smaller` : '';
          const color = readVar(s.colorVar);
          return `<div class="kvg-row">
            <div class="kvg-name" style="color:${color}">${s.name}</div>
            <div class="kvg-bar-wrap"><div class="kvg-bar" style="width:${widthPct}%; background:${color}; opacity:0.85"></div></div>
            <div class="kvg-val"><b>${fmtBytes(s.bytes)}</b>${ratioStr}</div>
          </div>`;
        }).join('');
      }

      ctx.addEventListener('input', render);
      render();
    })();
  </script>
</div>

<p>At 128K context, the kind of regime DeepSeek-V2 was designed for, each MLA layer holds about 144 MB of cache, versus around 8 GB for naive MHA. Multiply across DeepSeek-V2&rsquo;s 60 layers and the difference is the difference between a context that fits and one that does not.</p>
<h2 id="part-10-comparison-and-takeaways">Part 10: Comparison and Takeaways</h2>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>K/V structure</th>
          <th>Cache per token</th>
          <th>Quality</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MHA</td>
          <td>Distinct K, V per head</td>
          <td>$2 \cdot n_h \cdot d_h$</td>
          <td>Best</td>
          <td>Baseline; expensive cache</td>
      </tr>
      <tr>
          <td>MQA</td>
          <td>One K, V shared by all heads</td>
          <td>$2 \cdot d_h$</td>
          <td>Noticeably worse</td>
          <td>Used in PaLM-style models</td>
      </tr>
      <tr>
          <td>GQA</td>
          <td>K, V per group of heads</td>
          <td>$2 \cdot n_g \cdot d_h$</td>
          <td>$\approx$ MHA when $n_g$ tuned</td>
          <td>Llama, Mistral, Qwen</td>
      </tr>
      <tr>
          <td><strong>MLA</strong></td>
          <td><strong>Latent + decoupled RoPE</strong></td>
          <td>$d_c + d_h^R$</td>
          <td>$\approx$ MHA, sometimes better</td>
          <td>DeepSeek-V2/V3; absorbs at inference</td>
      </tr>
  </tbody>
</table>
<h3 id="what-is-actually-new">What is actually new</h3>
<p>MLA is not the first low-rank attention proposal. Linformer (2020), Performer, and various Nyström approximations long predate it. What makes MLA practical and distinct is:</p>
<ol>
<li><strong>The low-rank object is the cache, not the attention pattern.</strong> Earlier work approximated the attention matrix itself. MLA keeps softmax attention exact and only compresses what gets stored across timesteps.</li>
<li><strong>The absorption trick keeps inference compute bounded.</strong> Without §8, MLA would just be a memory-vs-compute tradeoff. With it, both move in the right direction at once for the autoregressive decoding regime.</li>
<li><strong>The decoupled-RoPE construction shows how to make a low-rank cache compatible with rotary embeddings.</strong> This is the part most easily glossed over but is what makes the technique deployable in modern transformer stacks. Future low-rank schemes that ignore positional embeddings are nice in a paper and broken in practice. MLA paid the engineering tax.</li>
</ol>
<h3 id="the-cost">The cost</h3>
<p>MLA is not free. The decoupled RoPE adds parameters and a slightly more elaborate forward path. The bilinear absorbed matrices $\widetilde{W}_i^Q$ are larger than the raw $W^Q$ pieces would be. For very short context inference, plain GQA may be faster wall-clock, because the absorbed matmul is the dominant cost rather than memory bandwidth. MLA&rsquo;s wins compound with context length and batch size, exactly the regimes that matter when serving a frontier model.</p>
<p>In the long-context, large-batch regime that production inference cares about, MLA pulls the KV-cache lever further than any other published trick, while keeping standard softmax attention semantics intact.</p>
<h3 id="if-you-remember-three-things">If you remember three things</h3>
<ol>
<li>The cache is a low-rank latent $\mathbf{c}^{KV}$, not K and V. K and V are reconstructed by up-projections that you never actually run at inference.</li>
<li>RoPE rides on a tiny separate vector that bypasses the latent compression, so the rotation can stay position-aware without contaminating the absorbable content path.</li>
<li>The whole construction is engineered around the fact that during decoding, you do not need K and V. You need <em>scores</em>. Storing scores requires only enough information to compute them, and the latent is sized to exactly that.</li>
</ol>
<h3 id="what-comes-next">What comes next</h3>
<p>Part 1 ended at the point where each cached token is about as small as it gets. The next pressure point is not the size of each token but the number of tokens you carry. Once your model can hold 1M tokens of context, do you really need to read every one of them at every decode step?</p>
<p>Part 2 will pick up there. Three different bets on how to answer that question shipped in 2025: sparse top-k selection driven by a cheap relevance scorer (DSA, deployed in DeepSeek V3.2), three-branch sparsification that is trainable from scratch (NSA, ACL 2025 best paper), and mixture-of-block routing that retrofits existing dense checkpoints (MoBA, powering Kimi K2&rsquo;s 1M context). A different bet entirely is the linear-attention hybrids that change the math so no $O(L^2)$ matrix ever materializes (Lightning Attention, in MiniMax M1). The synthesis at the end is DeepSeek V4-Pro, released last month, where every thread we have followed so far gets composed at once into a 61-layer stack that runs at 2% of GQA&rsquo;s cache footprint.</p>
<p><strong>Part 2 is in progress. check back soon.</strong></p>
<h2 id="references">References</h2>
<ol>
<li><strong>Vaswani, A. et al. (2017).</strong> <a href="https://arxiv.org/abs/1706.03762">Attention Is All You Need</a>. <em>NeurIPS 2017.</em> The original Transformer paper. The baseline MHA that every variant in this post is reacting to.</li>
<li><strong>Shazeer, N. (2019).</strong> <a href="https://arxiv.org/abs/1911.02150">Fast Transformer Decoding: One Write-Head is All You Need</a>. The MQA paper. Concise and worth reading in full; the whole argument is six pages.</li>
<li><strong>Ainslie, J. et al. (2023).</strong> <a href="https://arxiv.org/abs/2305.13245">GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints</a>. The GQA paper. Includes the mean-pooling conversion recipe that made GQA easy to adopt.</li>
<li><strong>DeepSeek-AI (2024).</strong> <a href="https://arxiv.org/abs/2405.04434">DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model</a>. The MLA paper. Section 2.1.3 has the decoupled RoPE derivation; almost everyone missed it the first time around.</li>
<li><strong>Ji, F. et al. (2025).</strong> <a href="https://arxiv.org/abs/2502.07864">TransMLA: Multi-Head Latent Attention Is All You Need</a>. Proves that MLA has strictly greater expressive power than GQA at the same cache budget, and gives a conversion recipe from GQA to MLA.</li>
</ol>
<p>References for sparse attention (DSA, NSA, MoBA), linear hybrids (Lightning Attention), and the V4-Pro report will appear in Part 2.</p>
]]></content:encoded></item><item><title>The Platform Around the Agent: What Enterprise Architects Actually Build</title><link>https://www.mdjawad.com/posts/enterprise-agentic-platform/</link><pubDate>Wed, 15 Apr 2026 10:00:00 +0800</pubDate><guid>https://www.mdjawad.com/posts/enterprise-agentic-platform/</guid><description>Most enterprises have bought an AI coding agent and are stuck. The ones generating real productivity gains didn&amp;rsquo;t win by picking a better model. They built a platform around the agent. This post walks through the five control-plane responsibilities that separate the 11% of AI-native orgs from the 95% reporting zero ROI, grounded in public deployments from Block, Shopify, Atlassian, Airbnb, and others.</description><content:encoded><![CDATA[<h2 id="the-gap">The Gap</h2>
<p>By April 2026, the adoption numbers are staggering: <a href="https://www.faros.ai/blog/key-takeaways-from-the-dora-report-2025">90% of developers use AI at work</a> and over 80% say it&rsquo;s made them more productive. Gartner projects <a href="https://www.gartner.com/en/newsroom/press-releases/2025-08-26-gartner-predicts-40-percent-of-enterprise-apps-will-feature-task-specific-ai-agents-by-2026-up-from-less-than-5-percent-in-2025">40% of enterprise applications will feature task-specific AI agents by the end of 2026</a>, up from less than 5% in 2025.</p>
<p>Then you read the other column. <a href="https://onereach.ai/blog/what-shapes-enterprise-ai-agents-in-the-future/">CIO reports that 95% of enterprises see zero return on their AI investments.</a> <a href="https://kpmg.com/us/en/media/news/q4-ai-pulse.html">McKinsey&rsquo;s maturity model</a> puts only around 11% of enterprises in the &ldquo;AI-native&rdquo; tier. The <a href="https://cloud.google.com/resources/content/2025-dora-ai-assisted-software-development-report">2025 DORA report</a> is more uncomfortable still: AI raises throughput <em>and</em> raises change-failure rate. PR size is up 154%. 30% of engineers don&rsquo;t trust the code their own agents produce.</p>
<p>The gap isn&rsquo;t the model. Frontier models are a commodity. They get swapped every six months and the next one is better. The gap is everything <em>around</em> the model: the control plane that routes requests, attributes cost, enforces policy, retrieves context, evaluates quality, and measures outcomes. It&rsquo;s the platform that turns &ldquo;we rolled out Copilot&rdquo; into &ldquo;we shipped a 30.8% reduction in PR cycle time across 1,900 repos,&rdquo; which is what Atlassian did with Rovo Dev and published at <a href="https://www.atlassian.com/blog/artificial-intelligence/developer-productivity-improved-with-rovo-dev/amp">ICSE 2026</a>.</p>
<p>This post is for the architect who has been asked to lead that platform. Not to choose between Copilot and Cursor; that&rsquo;s a week of spreadsheets. To design what sits around whatever agent you pick, so that a year from now your CFO knows what AI is costing and your CTO knows what it&rsquo;s earning.</p>
<h2 id="chapter-1-what-platform-actually-means-here">Chapter 1: What &ldquo;Platform&rdquo; Actually Means Here</h2>
<p>When a VP of Engineering says &ldquo;we have an AI platform,&rdquo; they might mean one of three things:</p>
<ol>
<li>We bought Copilot Enterprise. Everyone has access.</li>
<li>We stood up a chat UI in front of a couple of models.</li>
<li>We run an internal control plane that mediates every AI request our engineers make, attributes cost per team, enforces policy per repo, evaluates quality continuously, exposes a curated surface of tools and skills, and lets any engineer publish a repeatable workflow that triggers on a schedule, a webhook, or a repository event.</li>
</ol>
<p>Only the third one is a platform. The first two are procurements.</p>
<p>Shopify made this distinction concrete. Per <a href="https://www.bvp.com/atlas/inside-shopifys-ai-first-engineering-playbook">Bessemer&rsquo;s write-up of their AI-first engineering playbook</a>, Shopify runs an <strong>LLM proxy</strong>. Every AI request from every tool, every engineer, every script, goes through one internal gateway. Engineers can pick their harness (Claude Code, Cursor, Copilot), but the proxy is non-negotiable. That single architectural choice is what gives them centralised cost control, usage analytics, model flexibility, and the ability to swap a model provider in a day instead of a quarter.</p>
<p>Block took a different turn at the same fork. Rather than wrapping third-party agents, they <a href="https://block.xyz/inside/block-open-source-introduces-codename-goose">built Goose internally and open-sourced it</a>. The stated reason, per CTO Dhanji Prasanna on the <a href="https://sequoiacap.com/podcast/training-data-dhanji-prasanna/">Sequoia Training Data podcast</a>, was that &ldquo;data leaving our infrastructure&rdquo; was unacceptable. The outcome: engineers save 8-10 hours a week, Goose is on track to reclaim 25% of manual hours company-wide, and <a href="https://allthingsopen.org/articles/meet-goose-open-source-ai-agent">100% of Goose&rsquo;s own PRs are now written by Goose</a>.</p>
<p>Two tech companies. Two legitimate answers to the same architectural prompt. Both built a platform. Neither bought one.</p>
<p>The platform you build, whether Shopify-shaped (gateway + BYO-harness) or Block-shaped (build-the-harness), owns five responsibilities:</p>




<style>
.apcp-c0aa3c6c6ba86623c3a6e79446929763 {
  --apcp-accent: #c96442;
  --apcp-accent-soft: rgba(201, 100, 66, 0.07);
  --apcp-accent-border: rgba(201, 100, 66, 0.35);

  font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif;
  background: var(--entry);
  color: var(--primary);
  border: 1px solid var(--border);
  border-radius: 6px;
  padding: 1.75rem;
  margin: 2rem 0;
}

.dark .apcp-c0aa3c6c6ba86623c3a6e79446929763 {
  --apcp-accent: #d97757;
  --apcp-accent-soft: rgba(217, 119, 87, 0.09);
  --apcp-accent-border: rgba(217, 119, 87, 0.4);
}

.apcp-c0aa3c6c6ba86623c3a6e79446929763 * { box-sizing: border-box; }

.apcp-head-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  align-items: baseline;
  justify-content: space-between;
  gap: 1rem;
  padding-bottom: 0.85rem;
  margin-bottom: 1.25rem;
  border-bottom: 1px solid var(--border);
}

.apcp-title-c0aa3c6c6ba86623c3a6e79446929763 {
  font-size: 1.05rem;
  font-weight: 600;
  color: var(--primary);
  letter-spacing: -0.01em;
}

.apcp-kicker-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.65rem;
  font-weight: 500;
  color: var(--secondary);
  text-transform: uppercase;
  letter-spacing: 0.1em;
}

.apcp-subtitle-c0aa3c6c6ba86623c3a6e79446929763 {
  font-size: 0.85rem;
  color: var(--secondary);
  margin-bottom: 1.25rem;
  line-height: 1.55;
}

 
.apcp-body-c0aa3c6c6ba86623c3a6e79446929763 {
  display: grid;
  grid-template-columns: 1.1fr 1fr;
  gap: 1.25rem;
  align-items: start;
}

.apcp-stack-c0aa3c6c6ba86623c3a6e79446929763 { display: flex; flex-direction: column; gap: 0.4rem; }

.apcp-bracket-c0aa3c6c6ba86623c3a6e79446929763 {
  padding: 0.5rem 0.85rem;
  background: var(--code-bg);
  border-radius: 3px;
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.65rem;
  color: var(--secondary);
  letter-spacing: 0.08em;
  text-transform: uppercase;
  text-align: center;
}

.apcp-bracket-c0aa3c6c6ba86623c3a6e79446929763 strong {
  display: block;
  margin-top: 0.25rem;
  color: var(--primary);
  font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif;
  font-weight: 500;
  font-size: 0.78rem;
  text-transform: none;
  letter-spacing: 0;
}

.apcp-layer-c0aa3c6c6ba86623c3a6e79446929763 {
  display: grid;
  grid-template-columns: 32px 1fr auto;
  gap: 0.75rem;
  align-items: center;
  padding: 0.75rem 0.9rem;
  border: 1px solid var(--border);
  border-left: 2px solid var(--border);
  border-radius: 4px;
  background: var(--entry);
  cursor: pointer;
  transition: border-color 0.2s ease, background 0.2s ease;
  opacity: 0;
  transform: translateY(3px);
  transition: opacity 0.3s ease, transform 0.3s ease, border-color 0.2s ease, background 0.2s ease;
}

.apcp-layer-c0aa3c6c6ba86623c3a6e79446929763.visible { opacity: 1; transform: translateY(0); }

.apcp-layer-c0aa3c6c6ba86623c3a6e79446929763:hover { background: var(--code-bg); }

.apcp-layer-c0aa3c6c6ba86623c3a6e79446929763.active {
  border-left-color: var(--apcp-accent);
  background: var(--apcp-accent-soft);
}

.apcp-layer-icon-c0aa3c6c6ba86623c3a6e79446929763 {
  width: 32px;
  height: 32px;
  display: flex;
  align-items: center;
  justify-content: center;
  color: var(--secondary);
}

.apcp-layer-c0aa3c6c6ba86623c3a6e79446929763.active .apcp-layer-icon-c0aa3c6c6ba86623c3a6e79446929763 { color: var(--apcp-accent); }

.apcp-layer-text-c0aa3c6c6ba86623c3a6e79446929763 { min-width: 0; }

.apcp-layer-name-c0aa3c6c6ba86623c3a6e79446929763 {
  font-size: 0.88rem;
  font-weight: 600;
  color: var(--primary);
  letter-spacing: -0.005em;
}

.apcp-layer-c0aa3c6c6ba86623c3a6e79446929763.active .apcp-layer-name-c0aa3c6c6ba86623c3a6e79446929763 { color: var(--apcp-accent); }

.apcp-layer-caption-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.65rem;
  color: var(--secondary);
  margin-top: 0.1rem;
}

.apcp-layer-num-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.6rem;
  font-weight: 500;
  color: var(--secondary);
  text-transform: uppercase;
  letter-spacing: 0.1em;
}

.apcp-layer-c0aa3c6c6ba86623c3a6e79446929763.active .apcp-layer-num-c0aa3c6c6ba86623c3a6e79446929763 { color: var(--apcp-accent); }

 
.apcp-detail-c0aa3c6c6ba86623c3a6e79446929763 {
  border: 1px solid var(--border);
  border-radius: 4px;
  padding: 1.1rem 1.2rem;
  background: var(--entry);
  position: sticky;
  top: 1rem;
  min-height: 340px;
}

.apcp-detail-empty-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  flex-direction: column;
  align-items: center;
  justify-content: center;
  text-align: center;
  color: var(--secondary);
  font-size: 0.85rem;
  gap: 0.5rem;
  padding: 2rem 1rem;
  min-height: 300px;
}

.apcp-detail-content-c0aa3c6c6ba86623c3a6e79446929763 { display: none; }
.apcp-detail-content-c0aa3c6c6ba86623c3a6e79446929763.active { display: block; animation: apcp-fade-c0aa3c6c6ba86623c3a6e79446929763 0.3s ease; }

@keyframes apcp-fade-c0aa3c6c6ba86623c3a6e79446929763 {
  from { opacity: 0; transform: translateY(3px); }
  to   { opacity: 1; transform: translateY(0); }
}

.apcp-detail-head-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  align-items: baseline;
  justify-content: space-between;
  margin-bottom: 1rem;
  padding-bottom: 0.75rem;
  border-bottom: 1px solid var(--border);
}

.apcp-detail-name-c0aa3c6c6ba86623c3a6e79446929763 {
  font-size: 0.98rem;
  font-weight: 600;
  color: var(--apcp-accent);
  letter-spacing: -0.005em;
}

.apcp-detail-tag-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.6rem;
  font-weight: 500;
  color: var(--secondary);
  text-transform: uppercase;
  letter-spacing: 0.1em;
}

.apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763 { margin-bottom: 0.95rem; }
.apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763:last-child { margin-bottom: 0; }

.apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763 .apcp-section-label-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.58rem;
  font-weight: 500;
  color: var(--secondary);
  text-transform: uppercase;
  letter-spacing: 0.1em;
  margin: 0 0 0.35rem 0;
}

.apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763 ul {
  list-style: none;
  padding: 0;
  margin: 0;
  font-size: 0.82rem;
  color: var(--primary);
  line-height: 1.6;
}

.apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763 li {
  padding: 0.1rem 0 0.1rem 0.85rem;
  position: relative;
}

.apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763 li::before {
  content: '·';
  position: absolute;
  left: 0;
  color: var(--secondary);
  opacity: 0.6;
}

.apcp-detail-owner-c0aa3c6c6ba86623c3a6e79446929763 {
  display: inline-block;
  padding: 0.25rem 0.6rem;
  background: var(--code-bg);
  border-radius: 3px;
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.7rem;
  color: var(--primary);
  font-weight: 500;
}

.apcp-anti-c0aa3c6c6ba86623c3a6e79446929763 {
  background: var(--apcp-accent-soft);
  border: 1px solid var(--apcp-accent-border);
  border-left: 2px solid var(--apcp-accent);
  border-radius: 3px;
  padding: 0.65rem 0.8rem;
  font-size: 0.8rem;
  color: var(--primary);
  line-height: 1.55;
}

.apcp-anti-c0aa3c6c6ba86623c3a6e79446929763 strong { color: var(--apcp-accent); font-weight: 600; }

 
.apcp-footer-c0aa3c6c6ba86623c3a6e79446929763 {
  margin-top: 1.25rem;
  padding-top: 1rem;
  border-top: 1px solid var(--border);
  text-align: center;
  font-size: 0.85rem;
  color: var(--secondary);
}

.apcp-footer-c0aa3c6c6ba86623c3a6e79446929763 strong { color: var(--apcp-accent); font-weight: 600; }
.apcp-footer-c0aa3c6c6ba86623c3a6e79446929763 em { color: var(--primary); font-style: normal; font-weight: 500; }

 
@media (max-width: 860px) {
  .apcp-body-c0aa3c6c6ba86623c3a6e79446929763 { grid-template-columns: 1fr; }
  .apcp-detail-c0aa3c6c6ba86623c3a6e79446929763 { position: static; min-height: 0; }
  .apcp-stats-c0aa3c6c6ba86623c3a6e79446929763 { grid-template-columns: repeat(2, 1fr); }
  .apcp-stat-c0aa3c6c6ba86623c3a6e79446929763:nth-child(2n) { border-right: none; }
  .apcp-stat-c0aa3c6c6ba86623c3a6e79446929763:nth-child(-n+2) { border-bottom: 1px solid var(--border); }
}

@media (max-width: 520px) {
  .apcp-c0aa3c6c6ba86623c3a6e79446929763 { padding: 1.25rem; }
  .apcp-layer-caption-c0aa3c6c6ba86623c3a6e79446929763 { display: none; }
}
</style>

<div class="apcp-c0aa3c6c6ba86623c3a6e79446929763">
  <div class="apcp-head-c0aa3c6c6ba86623c3a6e79446929763">
    <div class="apcp-title-c0aa3c6c6ba86623c3a6e79446929763">The five control-plane responsibilities</div>
    <div class="apcp-kicker-c0aa3c6c6ba86623c3a6e79446929763">click any layer</div>
  </div>
  <div class="apcp-subtitle-c0aa3c6c6ba86623c3a6e79446929763">What it owns, who runs it, how it fails. If any layer lacks a named team, you don't have a platform. You have a shadow-IT problem.</div>

  <div class="apcp-body-c0aa3c6c6ba86623c3a6e79446929763">
    <div class="apcp-stack-c0aa3c6c6ba86623c3a6e79446929763" id="apcp-stack-c0aa3c6c6ba86623c3a6e79446929763">
      <div class="apcp-bracket-c0aa3c6c6ba86623c3a6e79446929763">
        your agents
        <strong>Claude Code · Cursor · Copilot · Goose · internal harnesses</strong>
      </div>

      <div class="apcp-layer-c0aa3c6c6ba86623c3a6e79446929763" data-layer="capability">
        <div class="apcp-layer-icon-c0aa3c6c6ba86623c3a6e79446929763">
          <svg width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><path d="M14.7 6.3a1 1 0 0 0 0 1.4l1.6 1.6a1 1 0 0 0 1.4 0l3.77-3.77a6 6 0 0 1-7.94 7.94l-6.91 6.91a2.12 2.12 0 0 1-3-3l6.91-6.91a6 6 0 0 1 7.94-7.94l-3.76 3.76z"/></svg>
        </div>
        <div class="apcp-layer-text-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-layer-name-c0aa3c6c6ba86623c3a6e79446929763">Capability</div>
          <div class="apcp-layer-caption-c0aa3c6c6ba86623c3a6e79446929763">built-ins · mcp · skills</div>
        </div>
        <div class="apcp-layer-num-c0aa3c6c6ba86623c3a6e79446929763">L1</div>
      </div>

      <div class="apcp-layer-c0aa3c6c6ba86623c3a6e79446929763" data-layer="identity">
        <div class="apcp-layer-icon-c0aa3c6c6ba86623c3a6e79446929763">
          <svg width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><path d="M12 22s8-4 8-10V5l-8-3-8 3v7c0 6 8 10 8 10z"/></svg>
        </div>
        <div class="apcp-layer-text-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-layer-name-c0aa3c6c6ba86623c3a6e79446929763">Identity &amp; policy</div>
          <div class="apcp-layer-caption-c0aa3c6c6ba86623c3a6e79446929763">tokens · approvals · sandbox</div>
        </div>
        <div class="apcp-layer-num-c0aa3c6c6ba86623c3a6e79446929763">L2</div>
      </div>

      <div class="apcp-layer-c0aa3c6c6ba86623c3a6e79446929763" data-layer="context">
        <div class="apcp-layer-icon-c0aa3c6c6ba86623c3a6e79446929763">
          <svg width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><ellipse cx="12" cy="5" rx="9" ry="3"/><path d="M3 5v14a9 3 0 0 0 18 0V5"/><path d="M3 12a9 3 0 0 0 18 0"/></svg>
        </div>
        <div class="apcp-layer-text-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-layer-name-c0aa3c6c6ba86623c3a6e79446929763">Context</div>
          <div class="apcp-layer-caption-c0aa3c6c6ba86623c3a6e79446929763">ingest · index · permission · serve</div>
        </div>
        <div class="apcp-layer-num-c0aa3c6c6ba86623c3a6e79446929763">L3</div>
      </div>

      <div class="apcp-layer-c0aa3c6c6ba86623c3a6e79446929763" data-layer="evaluation">
        <div class="apcp-layer-icon-c0aa3c6c6ba86623c3a6e79446929763">
          <svg width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><path d="M9 11l3 3L22 4"/><path d="M21 12v7a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2V5a2 2 0 0 1 2-2h11"/></svg>
        </div>
        <div class="apcp-layer-text-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-layer-name-c0aa3c6c6ba86623c3a6e79446929763">Evaluation</div>
          <div class="apcp-layer-caption-c0aa3c6c6ba86623c3a6e79446929763">unit · task · production · eval-as-ci</div>
        </div>
        <div class="apcp-layer-num-c0aa3c6c6ba86623c3a6e79446929763">L4</div>
      </div>

      <div class="apcp-layer-c0aa3c6c6ba86623c3a6e79446929763" data-layer="finops">
        <div class="apcp-layer-icon-c0aa3c6c6ba86623c3a6e79446929763">
          <svg width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><line x1="12" y1="1" x2="12" y2="23"/><path d="M17 5H9.5a3.5 3.5 0 0 0 0 7h5a3.5 3.5 0 0 1 0 7H6"/></svg>
        </div>
        <div class="apcp-layer-text-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-layer-name-c0aa3c6c6ba86623c3a6e79446929763">FinOps &amp; observability</div>
          <div class="apcp-layer-caption-c0aa3c6c6ba86623c3a6e79446929763">gateway · tracing · attribution</div>
        </div>
        <div class="apcp-layer-num-c0aa3c6c6ba86623c3a6e79446929763">L5</div>
      </div>

      <div class="apcp-bracket-c0aa3c6c6ba86623c3a6e79446929763">
        your developers
        <strong>Thousands of engineers across dozens of teams</strong>
      </div>
    </div>

    <div class="apcp-detail-c0aa3c6c6ba86623c3a6e79446929763">
      
      <div class="apcp-detail-content-c0aa3c6c6ba86623c3a6e79446929763" data-detail="capability">
        <div class="apcp-detail-head-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-detail-name-c0aa3c6c6ba86623c3a6e79446929763">Capability surface</div>
          <div class="apcp-detail-tag-c0aa3c6c6ba86623c3a6e79446929763">layer 01</div>
        </div>
        <div class="apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-section-label-c0aa3c6c6ba86623c3a6e79446929763">Owns</div>
          <ul>
            <li>Built-in tools shipped with the harness</li>
            <li>Curated registry of approved MCP servers</li>
            <li>Skills library: org-specific playbooks</li>
            <li>Progressive disclosure at the gateway</li>
          </ul>
        </div>
        <div class="apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-section-label-c0aa3c6c6ba86623c3a6e79446929763">Who runs it</div>
          <span class="apcp-detail-owner-c0aa3c6c6ba86623c3a6e79446929763">platform-team · capability-sre</span>
        </div>
        <div class="apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-section-label-c0aa3c6c6ba86623c3a6e79446929763">Anti-pattern</div>
          <div class="apcp-anti-c0aa3c6c6ba86623c3a6e79446929763"><strong>90+ tools loaded per prompt.</strong> A single enterprise GitHub MCP server loaded naively burns 50k+ tokens of schema before any reasoning. Overhead scales linearly in services connected.</div>
        </div>
      </div>

      
      <div class="apcp-detail-content-c0aa3c6c6ba86623c3a6e79446929763" data-detail="identity">
        <div class="apcp-detail-head-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-detail-name-c0aa3c6c6ba86623c3a6e79446929763">Identity &amp; policy</div>
          <div class="apcp-detail-tag-c0aa3c6c6ba86623c3a6e79446929763">layer 02</div>
        </div>
        <div class="apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-section-label-c0aa3c6c6ba86623c3a6e79446929763">Owns</div>
          <ul>
            <li>Two-identity model: human principal + agent service identity</li>
            <li>Scoped, short-lived tokens per task</li>
            <li>Policy-as-code at the MCP gateway</li>
            <li>Graduated-trust approvals (async wait-state)</li>
            <li>Sandbox runtime (gVisor, Firecracker, Kata) per trust tier</li>
          </ul>
        </div>
        <div class="apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-section-label-c0aa3c6c6ba86623c3a6e79446929763">Who runs it</div>
          <span class="apcp-detail-owner-c0aa3c6c6ba86623c3a6e79446929763">platform-team · security · iam</span>
        </div>
        <div class="apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-section-label-c0aa3c6c6ba86623c3a6e79446929763">Anti-pattern</div>
          <div class="apcp-anti-c0aa3c6c6ba86623c3a6e79446929763"><strong>Agent inherits the full user scope.</strong> One compromised prompt exfiltrates every permission the invoking user has. The blast radius of an AI breach becomes the blast radius of a human breach.</div>
        </div>
      </div>

      
      <div class="apcp-detail-content-c0aa3c6c6ba86623c3a6e79446929763" data-detail="context">
        <div class="apcp-detail-head-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-detail-name-c0aa3c6c6ba86623c3a6e79446929763">Context pipeline</div>
          <div class="apcp-detail-tag-c0aa3c6c6ba86623c3a6e79446929763">layer 03</div>
        </div>
        <div class="apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-section-label-c0aa3c6c6ba86623c3a6e79446929763">Owns</div>
          <ul>
            <li>Ingestion connectors for repos, docs, tickets, incidents, service catalog</li>
            <li>Hybrid retrieval: BM25 + dense + graph</li>
            <li>PII redaction and access-aware retrieval</li>
            <li>Staleness SLOs per source</li>
            <li>Token-budget compression and eviction</li>
          </ul>
        </div>
        <div class="apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-section-label-c0aa3c6c6ba86623c3a6e79446929763">Who runs it</div>
          <span class="apcp-detail-owner-c0aa3c6c6ba86623c3a6e79446929763">platform-team · data-eng</span>
        </div>
        <div class="apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-section-label-c0aa3c6c6ba86623c3a6e79446929763">Anti-pattern</div>
          <div class="apcp-anti-c0aa3c6c6ba86623c3a6e79446929763"><strong>Every team ships its own RAG.</strong> Twelve incompatible stores, staleness nobody measures, PII leaking across permission boundaries, six answers to the same question depending on which index you hit.</div>
        </div>
      </div>

      
      <div class="apcp-detail-content-c0aa3c6c6ba86623c3a6e79446929763" data-detail="evaluation">
        <div class="apcp-detail-head-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-detail-name-c0aa3c6c6ba86623c3a6e79446929763">Evaluation harness</div>
          <div class="apcp-detail-tag-c0aa3c6c6ba86623c3a6e79446929763">layer 04</div>
        </div>
        <div class="apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-section-label-c0aa3c6c6ba86623c3a6e79446929763">Owns</div>
          <ul>
            <li>Unit regressions for prompts, tool schemas, system prompts</li>
            <li>Task-level golden set graded by LLM-as-judge</li>
            <li>Production shadow traffic + online signals</li>
            <li>Eval-as-CI: no prompt, tool, or model ships without passing</li>
            <li>Thumbs-down-to-regression-test feedback loop</li>
          </ul>
        </div>
        <div class="apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-section-label-c0aa3c6c6ba86623c3a6e79446929763">Who runs it</div>
          <span class="apcp-detail-owner-c0aa3c6c6ba86623c3a6e79446929763">platform-team · ai-coe</span>
        </div>
        <div class="apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-section-label-c0aa3c6c6ba86623c3a6e79446929763">Anti-pattern</div>
          <div class="apcp-anti-c0aa3c6c6ba86623c3a6e79446929763"><strong>Silent failure becomes the norm.</strong> The agent finishes without an error, the diff looks plausible, the test it wrote passes because it tests the buggy behaviour it introduced. The 2025 DORA change-failure-rate uptick is this failure, audited.</div>
        </div>
      </div>

      
      <div class="apcp-detail-content-c0aa3c6c6ba86623c3a6e79446929763" data-detail="finops">
        <div class="apcp-detail-head-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-detail-name-c0aa3c6c6ba86623c3a6e79446929763">FinOps &amp; observability</div>
          <div class="apcp-detail-tag-c0aa3c6c6ba86623c3a6e79446929763">layer 05</div>
        </div>
        <div class="apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-section-label-c0aa3c6c6ba86623c3a6e79446929763">Owns</div>
          <ul>
            <li>LLM gateway: every call mediated, measured, routable</li>
            <li>Tiered-routing policy (Haiku / Sonnet / Opus by task class)</li>
            <li>Prompt-cache hit rate as a first-class SLI</li>
            <li>Per-team, per-repo, per-task cost attribution</li>
            <li>End-to-end OpenTelemetry tracing</li>
          </ul>
        </div>
        <div class="apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-section-label-c0aa3c6c6ba86623c3a6e79446929763">Who runs it</div>
          <span class="apcp-detail-owner-c0aa3c6c6ba86623c3a6e79446929763">platform-team · finops · sre</span>
        </div>
        <div class="apcp-detail-section-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="apcp-section-label-c0aa3c6c6ba86623c3a6e79446929763">Anti-pattern</div>
          <div class="apcp-anti-c0aa3c6c6ba86623c3a6e79446929763"><strong>No gateway, no attribution, no chargeback.</strong> 5% of users burn 60% of the budget invisibly. When an incident hits at 3am, you have no trace ID. Just a shrug and an angry CFO.</div>
        </div>
      </div>
    </div>
  </div>

  <div class="apcp-footer-c0aa3c6c6ba86623c3a6e79446929763">
    Buying a Copilot license gives you an <em>agent</em>. Funding these five layers gives you a <strong>platform</strong>.
  </div>
</div>

<script>
(function() {
  var id = 'c0aa3c6c6ba86623c3a6e79446929763';
  var layers = document.querySelectorAll('.apcp-layer-' + id);
  var details = document.querySelectorAll('.apcp-detail-content-' + id);

  for (var i = 0; i < layers.length; i++) {
    (function(el, delay) {
      setTimeout(function() { el.classList.add('visible'); }, delay);
    })(layers[i], 100 + i * 60);
  }

  function activate(layerKey) {
    for (var i = 0; i < layers.length; i++) {
      layers[i].classList.toggle('active', layers[i].getAttribute('data-layer') === layerKey);
    }
    for (var j = 0; j < details.length; j++) {
      details[j].classList.toggle('active', details[j].getAttribute('data-detail') === layerKey);
    }
  }

  for (var k = 0; k < layers.length; k++) {
    (function(el) {
      el.addEventListener('click', function() { activate(el.getAttribute('data-layer')); });
    })(layers[k]);
  }

  setTimeout(function() { activate('capability'); }, 100 + layers.length * 60 + 150);
})();
</script>

<ul>
<li><strong>Capability</strong>: what tools and skills agents can reach, and how they&rsquo;re discovered.</li>
<li><strong>Identity &amp; Policy</strong>: who&rsquo;s acting, with what scope, under what guardrails.</li>
<li><strong>Context</strong>: how org knowledge gets ingested, permissioned, and served.</li>
<li><strong>Evaluation</strong>: how you know the agent is actually getting better, not just shipping faster.</li>
<li><strong>FinOps &amp; Observability</strong>: what it costs, who paid, what it produced, where it broke.</li>
</ul>
<p>Those five wire up into a system shape worth drawing. The harness runs on each engineer&rsquo;s laptop (in a Docker container) or in a CI runner. Everything the harness reaches into over the network is the platform. Everything the platform calls out to is a model provider, a curated MCP server, or a source system the platform has already indexed.</p>




<style>
.arch-c0aa3c6c6ba86623c3a6e79446929763 {
  --arch-accent: #c96442;
  --arch-accent-soft: rgba(201, 100, 66, 0.08);
  --arch-accent-border: rgba(201, 100, 66, 0.35);

  font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif;
  background: var(--entry);
  color: var(--primary);
  border: 1px solid var(--border);
  border-radius: 6px;
  padding: 1.75rem;
  margin: 2rem 0;
}

.dark .arch-c0aa3c6c6ba86623c3a6e79446929763 {
  --arch-accent: #d97757;
  --arch-accent-soft: rgba(217, 119, 87, 0.1);
  --arch-accent-border: rgba(217, 119, 87, 0.4);
}

.arch-c0aa3c6c6ba86623c3a6e79446929763 * { box-sizing: border-box; }

.arch-head-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  align-items: baseline;
  justify-content: space-between;
  gap: 1rem;
  padding-bottom: 0.85rem;
  margin-bottom: 1rem;
  border-bottom: 1px solid var(--border);
}

.arch-title-c0aa3c6c6ba86623c3a6e79446929763 {
  font-size: 1.05rem;
  font-weight: 600;
  color: var(--primary);
  letter-spacing: -0.01em;
}

.arch-kicker-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.65rem;
  font-weight: 500;
  color: var(--secondary);
  text-transform: uppercase;
  letter-spacing: 0.1em;
}

.arch-subtitle-c0aa3c6c6ba86623c3a6e79446929763 {
  font-size: 0.85rem;
  color: var(--secondary);
  margin-bottom: 1.25rem;
  line-height: 1.55;
}

 
.arch-zones-c0aa3c6c6ba86623c3a6e79446929763 {
  display: grid;
  grid-template-columns: 1fr 24px 1.15fr 24px 1fr;
  gap: 0;
  align-items: stretch;
}

.arch-col-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  flex-direction: column;
  gap: 0.5rem;
}

.arch-col-head-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.58rem;
  font-weight: 500;
  color: var(--secondary);
  text-transform: uppercase;
  letter-spacing: 0.1em;
  padding-bottom: 0.5rem;
  border-bottom: 1px solid var(--border);
  margin-bottom: 0.3rem;
  display: flex;
  justify-content: space-between;
  align-items: baseline;
}

.arch-col-head-c0aa3c6c6ba86623c3a6e79446929763.accent {
  color: var(--arch-accent);
  border-bottom-color: var(--arch-accent);
}

.arch-col-count-c0aa3c6c6ba86623c3a6e79446929763 {
  color: var(--secondary);
  font-size: 0.55rem;
}

.arch-col-head-c0aa3c6c6ba86623c3a6e79446929763.accent .arch-col-count-c0aa3c6c6ba86623c3a6e79446929763 {
  color: var(--arch-accent);
  opacity: 0.75;
}

 
.arch-card-c0aa3c6c6ba86623c3a6e79446929763 {
  border: 1px solid var(--border);
  border-radius: 4px;
  padding: 0.65rem 0.8rem;
  background: var(--entry);
  opacity: 0;
  transform: translateY(3px);
  transition: opacity 0.35s ease, transform 0.35s ease;
}

.arch-card-c0aa3c6c6ba86623c3a6e79446929763.visible { opacity: 1; transform: translateY(0); }

.arch-card-c0aa3c6c6ba86623c3a6e79446929763.hub {
  border: 1px solid var(--arch-accent-border);
  border-left: 2px solid var(--arch-accent);
  background: var(--arch-accent-soft);
}

.arch-card-head-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  align-items: center;
  gap: 0.5rem;
  margin-bottom: 0.25rem;
}

.arch-card-icon-c0aa3c6c6ba86623c3a6e79446929763 {
  width: 18px;
  height: 18px;
  color: var(--secondary);
  flex-shrink: 0;
}

.arch-card-c0aa3c6c6ba86623c3a6e79446929763.hub .arch-card-icon-c0aa3c6c6ba86623c3a6e79446929763 { color: var(--arch-accent); }

.arch-card-name-c0aa3c6c6ba86623c3a6e79446929763 {
  font-size: 0.8rem;
  font-weight: 600;
  color: var(--primary);
  letter-spacing: -0.005em;
  flex: 1;
  min-width: 0;
}

.arch-card-c0aa3c6c6ba86623c3a6e79446929763.hub .arch-card-name-c0aa3c6c6ba86623c3a6e79446929763 { color: var(--arch-accent); }

.arch-card-tag-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.52rem;
  font-weight: 500;
  text-transform: uppercase;
  letter-spacing: 0.08em;
  padding: 0.1rem 0.35rem;
  border-radius: 3px;
  background: var(--code-bg);
  color: var(--secondary);
  white-space: nowrap;
}

.arch-card-c0aa3c6c6ba86623c3a6e79446929763.hub .arch-card-tag-c0aa3c6c6ba86623c3a6e79446929763 {
  background: rgba(201, 100, 66, 0.15);
  color: var(--arch-accent);
}

.dark .arch-card-c0aa3c6c6ba86623c3a6e79446929763.hub .arch-card-tag-c0aa3c6c6ba86623c3a6e79446929763 {
  background: rgba(217, 119, 87, 0.15);
}

.arch-card-meta-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.66rem;
  color: var(--secondary);
  line-height: 1.5;
  padding-left: 1.4rem;
}

 
.arch-arrow-col-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  align-items: center;
  justify-content: center;
  position: relative;
  padding-top: 1.4rem;
}

.arch-arrow-c0aa3c6c6ba86623c3a6e79446929763 {
  color: var(--arch-accent);
  opacity: 0.5;
}

 
.arch-footer-c0aa3c6c6ba86623c3a6e79446929763 {
  margin-top: 1.25rem;
  padding-top: 1rem;
  border-top: 1px solid var(--border);
  display: grid;
  grid-template-columns: 1fr 1fr;
  gap: 1rem;
}

.arch-note-c0aa3c6c6ba86623c3a6e79446929763 {
  font-size: 0.78rem;
  color: var(--secondary);
  line-height: 1.55;
}

.arch-note-label-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.58rem;
  font-weight: 500;
  color: var(--arch-accent);
  text-transform: uppercase;
  letter-spacing: 0.1em;
  display: block;
  margin-bottom: 0.3rem;
}

.arch-note-c0aa3c6c6ba86623c3a6e79446929763 code {
  font-family: 'JetBrains Mono', ui-monospace, monospace;
  font-size: 0.72rem;
  background: var(--code-bg);
  padding: 0.05rem 0.3rem;
  border-radius: 3px;
  color: var(--primary);
}

 
@media (max-width: 900px) {
  .arch-zones-c0aa3c6c6ba86623c3a6e79446929763 {
    grid-template-columns: 1fr;
    gap: 0.6rem;
  }
  .arch-arrow-col-c0aa3c6c6ba86623c3a6e79446929763 {
    transform: rotate(90deg);
    padding: 0.1rem 0;
  }
  .arch-footer-c0aa3c6c6ba86623c3a6e79446929763 { grid-template-columns: 1fr; }
}

@media (max-width: 520px) {
  .arch-c0aa3c6c6ba86623c3a6e79446929763 { padding: 1.25rem; }
}
</style>

<div class="arch-c0aa3c6c6ba86623c3a6e79446929763">
  <div class="arch-head-c0aa3c6c6ba86623c3a6e79446929763">
    <div class="arch-title-c0aa3c6c6ba86623c3a6e79446929763">The reference architecture</div>
    <div class="arch-kicker-c0aa3c6c6ba86623c3a6e79446929763">runtime → platform → providers</div>
  </div>
  <div class="arch-subtitle-c0aa3c6c6ba86623c3a6e79446929763">The harness runs on each engineer's laptop or in a CI container. Everything the harness calls into is the platform. Every subsequent chapter zooms into one box on this diagram.</div>

  <div class="arch-zones-c0aa3c6c6ba86623c3a6e79446929763" id="arch-zones-c0aa3c6c6ba86623c3a6e79446929763">
    
    <div class="arch-col-c0aa3c6c6ba86623c3a6e79446929763">
      <div class="arch-col-head-c0aa3c6c6ba86623c3a6e79446929763">
        <span>Runtime</span>
        <span class="arch-col-count-c0aa3c6c6ba86623c3a6e79446929763">per-engineer · per-job</span>
      </div>

      <div class="arch-card-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="arch-card-head-c0aa3c6c6ba86623c3a6e79446929763">
          <svg class="arch-card-icon-c0aa3c6c6ba86623c3a6e79446929763" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><rect x="2" y="3" width="20" height="14" rx="2"/><line x1="8" y1="21" x2="16" y2="21"/><line x1="12" y1="17" x2="12" y2="21"/></svg>
          <div class="arch-card-name-c0aa3c6c6ba86623c3a6e79446929763">Engineer's laptop</div>
          <div class="arch-card-tag-c0aa3c6c6ba86623c3a6e79446929763">docker</div>
        </div>
        <div class="arch-card-meta-c0aa3c6c6ba86623c3a6e79446929763">
          harness (goose / claude code / cursor)<br>
          MCP clients + workspace mount
        </div>
      </div>

      <div class="arch-card-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="arch-card-head-c0aa3c6c6ba86623c3a6e79446929763">
          <svg class="arch-card-icon-c0aa3c6c6ba86623c3a6e79446929763" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><polyline points="16 18 22 12 16 6"/><polyline points="8 6 2 12 8 18"/></svg>
          <div class="arch-card-name-c0aa3c6c6ba86623c3a6e79446929763">CI runner</div>
          <div class="arch-card-tag-c0aa3c6c6ba86623c3a6e79446929763">headless</div>
        </div>
        <div class="arch-card-meta-c0aa3c6c6ba86623c3a6e79446929763">
          ephemeral container, headless harness<br>
          fired on PR / label / cron / webhook
        </div>
      </div>
    </div>

    
    <div class="arch-arrow-col-c0aa3c6c6ba86623c3a6e79446929763">
      <svg class="arch-arrow-c0aa3c6c6ba86623c3a6e79446929763" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><line x1="5" y1="12" x2="19" y2="12"/><polyline points="12 5 19 12 12 19"/></svg>
    </div>

    
    <div class="arch-col-c0aa3c6c6ba86623c3a6e79446929763">
      <div class="arch-col-head-c0aa3c6c6ba86623c3a6e79446929763 accent">
        <span>Platform</span>
        <span class="arch-col-count-c0aa3c6c6ba86623c3a6e79446929763">network services</span>
      </div>

      <div class="arch-card-c0aa3c6c6ba86623c3a6e79446929763 hub">
        <div class="arch-card-head-c0aa3c6c6ba86623c3a6e79446929763">
          <svg class="arch-card-icon-c0aa3c6c6ba86623c3a6e79446929763" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><path d="M12 2L2 7l10 5 10-5-10-5z"/><path d="M2 17l10 5 10-5"/><path d="M2 12l10 5 10-5"/></svg>
          <div class="arch-card-name-c0aa3c6c6ba86623c3a6e79446929763">LLM gateway</div>
          <div class="arch-card-tag-c0aa3c6c6ba86623c3a6e79446929763">hub</div>
        </div>
        <div class="arch-card-meta-c0aa3c6c6ba86623c3a6e79446929763">
          auth · tiered routing · prompt cache<br>
          cost attribution · rate limits
        </div>
      </div>

      <div class="arch-card-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="arch-card-head-c0aa3c6c6ba86623c3a6e79446929763">
          <svg class="arch-card-icon-c0aa3c6c6ba86623c3a6e79446929763" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><path d="M14.7 6.3a1 1 0 0 0 0 1.4l1.6 1.6a1 1 0 0 0 1.4 0l3.77-3.77a6 6 0 0 1-7.94 7.94l-6.91 6.91a2.12 2.12 0 0 1-3-3l6.91-6.91a6 6 0 0 1 7.94-7.94l-3.76 3.76z"/></svg>
          <div class="arch-card-name-c0aa3c6c6ba86623c3a6e79446929763">MCP gateway + Skills</div>
          <div class="arch-card-tag-c0aa3c6c6ba86623c3a6e79446929763">registry</div>
        </div>
        <div class="arch-card-meta-c0aa3c6c6ba86623c3a6e79446929763">
          approved MCP list · progressive disclosure<br>
          Skills library · Recipe registry
        </div>
      </div>

      <div class="arch-card-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="arch-card-head-c0aa3c6c6ba86623c3a6e79446929763">
          <svg class="arch-card-icon-c0aa3c6c6ba86623c3a6e79446929763" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><ellipse cx="12" cy="5" rx="9" ry="3"/><path d="M3 5v14a9 3 0 0 0 18 0V5"/><path d="M3 12a9 3 0 0 0 18 0"/></svg>
          <div class="arch-card-name-c0aa3c6c6ba86623c3a6e79446929763">Context API</div>
          <div class="arch-card-tag-c0aa3c6c6ba86623c3a6e79446929763">retrieval</div>
        </div>
        <div class="arch-card-meta-c0aa3c6c6ba86623c3a6e79446929763">
          BM25 + dense + graph<br>
          ACL-aware · staleness SLO
        </div>
      </div>

      <div class="arch-card-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="arch-card-head-c0aa3c6c6ba86623c3a6e79446929763">
          <svg class="arch-card-icon-c0aa3c6c6ba86623c3a6e79446929763" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><path d="M12 22s8-4 8-10V5l-8-3-8 3v7c0 6 8 10 8 10z"/></svg>
          <div class="arch-card-name-c0aa3c6c6ba86623c3a6e79446929763">Policy service</div>
          <div class="arch-card-tag-c0aa3c6c6ba86623c3a6e79446929763">authz</div>
        </div>
        <div class="arch-card-meta-c0aa3c6c6ba86623c3a6e79446929763">
          scoped tokens · approval routing<br>
          sandbox-tier selection
        </div>
      </div>

      <div class="arch-card-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="arch-card-head-c0aa3c6c6ba86623c3a6e79446929763">
          <svg class="arch-card-icon-c0aa3c6c6ba86623c3a6e79446929763" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><path d="M9 11l3 3L22 4"/><path d="M21 12v7a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2V5a2 2 0 0 1 2-2h11"/></svg>
          <div class="arch-card-name-c0aa3c6c6ba86623c3a6e79446929763">Eval harness</div>
          <div class="arch-card-tag-c0aa3c6c6ba86623c3a6e79446929763">ci-gated</div>
        </div>
        <div class="arch-card-meta-c0aa3c6c6ba86623c3a6e79446929763">
          golden set · LLM-as-judge<br>
          shadow traffic · online signals
        </div>
      </div>

      <div class="arch-card-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="arch-card-head-c0aa3c6c6ba86623c3a6e79446929763">
          <svg class="arch-card-icon-c0aa3c6c6ba86623c3a6e79446929763" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><circle cx="18" cy="5" r="3"/><circle cx="6" cy="12" r="3"/><circle cx="18" cy="19" r="3"/><line x1="8.6" y1="13.5" x2="15.4" y2="17.5"/><line x1="15.4" y1="6.5" x2="8.6" y2="10.5"/></svg>
          <div class="arch-card-name-c0aa3c6c6ba86623c3a6e79446929763">Telemetry bus</div>
          <div class="arch-card-tag-c0aa3c6c6ba86623c3a6e79446929763">otel</div>
        </div>
        <div class="arch-card-meta-c0aa3c6c6ba86623c3a6e79446929763">
          traces · run records · cost events<br>
          per-team / per-repo / per-task
        </div>
      </div>
    </div>

    
    <div class="arch-arrow-col-c0aa3c6c6ba86623c3a6e79446929763">
      <svg class="arch-arrow-c0aa3c6c6ba86623c3a6e79446929763" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><line x1="5" y1="12" x2="19" y2="12"/><polyline points="12 5 19 12 12 19"/></svg>
    </div>

    
    <div class="arch-col-c0aa3c6c6ba86623c3a6e79446929763">
      <div class="arch-col-head-c0aa3c6c6ba86623c3a6e79446929763">
        <span>Providers &amp; sources</span>
        <span class="arch-col-count-c0aa3c6c6ba86623c3a6e79446929763">external</span>
      </div>

      <div class="arch-card-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="arch-card-head-c0aa3c6c6ba86623c3a6e79446929763">
          <svg class="arch-card-icon-c0aa3c6c6ba86623c3a6e79446929763" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><circle cx="12" cy="12" r="10"/><path d="M12 2a15.3 15.3 0 0 1 4 10 15.3 15.3 0 0 1-4 10 15.3 15.3 0 0 1-4-10 15.3 15.3 0 0 1 4-10z"/><path d="M2 12h20"/></svg>
          <div class="arch-card-name-c0aa3c6c6ba86623c3a6e79446929763">Model providers</div>
          <div class="arch-card-tag-c0aa3c6c6ba86623c3a6e79446929763">vendor</div>
        </div>
        <div class="arch-card-meta-c0aa3c6c6ba86623c3a6e79446929763">
          Anthropic · OpenAI · Google<br>
          internal + open-weight hosts
        </div>
      </div>

      <div class="arch-card-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="arch-card-head-c0aa3c6c6ba86623c3a6e79446929763">
          <svg class="arch-card-icon-c0aa3c6c6ba86623c3a6e79446929763" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><path d="M18 10h-1.26A8 8 0 1 0 9 20h9a5 5 0 0 0 0-10z"/></svg>
          <div class="arch-card-name-c0aa3c6c6ba86623c3a6e79446929763">Curated MCP servers</div>
          <div class="arch-card-tag-c0aa3c6c6ba86623c3a6e79446929763">tools</div>
        </div>
        <div class="arch-card-meta-c0aa3c6c6ba86623c3a6e79446929763">
          github · jira · cloud<br>
          feature-flags · incident
        </div>
      </div>

      <div class="arch-card-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="arch-card-head-c0aa3c6c6ba86623c3a6e79446929763">
          <svg class="arch-card-icon-c0aa3c6c6ba86623c3a6e79446929763" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><path d="M14 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V8z"/><polyline points="14 2 14 8 20 8"/></svg>
          <div class="arch-card-name-c0aa3c6c6ba86623c3a6e79446929763">Source systems</div>
          <div class="arch-card-tag-c0aa3c6c6ba86623c3a6e79446929763">indexed</div>
        </div>
        <div class="arch-card-meta-c0aa3c6c6ba86623c3a6e79446929763">
          repos · docs · tickets<br>
          incidents · service catalog
        </div>
      </div>

      <div class="arch-card-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="arch-card-head-c0aa3c6c6ba86623c3a6e79446929763">
          <svg class="arch-card-icon-c0aa3c6c6ba86623c3a6e79446929763" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><path d="M17 21v-2a4 4 0 0 0-4-4H5a4 4 0 0 0-4 4v2"/><circle cx="9" cy="7" r="4"/><path d="M23 21v-2a4 4 0 0 0-3-3.87"/><path d="M16 3.13a4 4 0 0 1 0 7.75"/></svg>
          <div class="arch-card-name-c0aa3c6c6ba86623c3a6e79446929763">Identity</div>
          <div class="arch-card-tag-c0aa3c6c6ba86623c3a6e79446929763">cross-cutting</div>
        </div>
        <div class="arch-card-meta-c0aa3c6c6ba86623c3a6e79446929763">
          IdP (human principal)<br>
          service registry (agent id)
        </div>
      </div>

      <div class="arch-card-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="arch-card-head-c0aa3c6c6ba86623c3a6e79446929763">
          <svg class="arch-card-icon-c0aa3c6c6ba86623c3a6e79446929763" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><ellipse cx="12" cy="5" rx="9" ry="3"/><path d="M3 5v4c0 1.66 4.03 3 9 3s9-1.34 9-3V5"/><path d="M3 12c0 1.66 4.03 3 9 3s9-1.34 9-3"/><path d="M3 19c0 1.66 4.03 3 9 3s9-1.34 9-3V5"/></svg>
          <div class="arch-card-name-c0aa3c6c6ba86623c3a6e79446929763">Trace + cost DB</div>
          <div class="arch-card-tag-c0aa3c6c6ba86623c3a6e79446929763">audit</div>
        </div>
        <div class="arch-card-meta-c0aa3c6c6ba86623c3a6e79446929763">
          queryable run history<br>
          chargeback source-of-truth
        </div>
      </div>
    </div>
  </div>

  <div class="arch-footer-c0aa3c6c6ba86623c3a6e79446929763">
    <div class="arch-note-c0aa3c6c6ba86623c3a6e79446929763">
      <span class="arch-note-label-c0aa3c6c6ba86623c3a6e79446929763">read the flow</span>
      The runtime reaches the platform over the network. The platform calls out to model providers, approved MCP servers, and indexed source systems. The gateway is the one piece every call traverses.
    </div>
    <div class="arch-note-c0aa3c6c6ba86623c3a6e79446929763">
      <span class="arch-note-label-c0aa3c6c6ba86623c3a6e79446929763">sandbox note</span>
      The harness runs <em>inside</em> the Docker container in the Runtime column. For stronger isolation (multi-tenant, cross-org), upgrade that container to <code>gVisor</code> or <code>Firecracker</code> per Chapter 4.
    </div>
  </div>
</div>

<script>
(function() {
  var id = 'c0aa3c6c6ba86623c3a6e79446929763';
  var cards = document.querySelectorAll('.arch-' + id + ' .arch-card-' + id);
  for (var i = 0; i < cards.length; i++) {
    (function(el, delay) {
      setTimeout(function() { el.classList.add('visible'); }, delay);
    })(cards[i], 80 + i * 50);
  }
})();
</script>

<p>When multiple agents cooperate within this system, the topology that ships is orchestrator-worker, not swarm. Airbnb&rsquo;s <a href="https://medium.com/airbnb-engineering/accelerating-large-scale-test-migration-with-llms-9565c208023b">3,500-file Enzyme-to-RTL migration</a> proved this: per-file parallel workers, central orchestration, brute-force retries with dynamic prompts. 97% automated, 6 weeks, 6 engineers. Swarms, by contrast, are the dominant source of silent failure because a single hallucination in shared memory propagates to every peer that reads it. Chapter 5 shows how Goose sub-recipes implement the orchestrator-worker pattern concretely.</p>
<p>If any of these responsibilities isn&rsquo;t owned by a named team with a roadmap, you don&rsquo;t have a platform. You have a shadow-IT problem that will compound. These five exist to enable one thing: the workflow lifecycle. Chapter 5 names it; everything else makes it safe, cheap, and measurable.</p>
<h2 id="chapter-2-the-capability-surface">Chapter 2: The Capability Surface</h2>
<p>The first architectural argument inside every platform team is about how agents get things done. It usually gets framed as a choice between the <a href="https://modelcontextprotocol.io/">Model Context Protocol (MCP)</a>, now stewarded by the <a href="https://dev.to/chunxiaoxx/2026-mcp-trends-the-shift-to-enterprise-ready-agentic-workflows-48lp">Linux Foundation&rsquo;s Agentic AI Foundation</a> as of December 2025, and Agent Skills, the behavioural-instruction packages that started shipping with Claude.</p>
<p>Framing them as competitors is a category error. MCP is an <em>execution fabric</em>: a standardised RPC for tools, resources, and prompt templates, with bidirectional comms and dynamic tool discovery. Skills are a <em>knowledge layer</em>: portable instructions that encode the how-to of a specific job. MCP tells an agent what it <em>can</em> do. Skills tell an agent what it <em>should</em> do in a specific situation.</p>
<p>The context-bloat failure is the sharper risk. A single enterprise-grade GitHub MCP server exposes 90+ tools. Loaded naively, that&rsquo;s 50,000+ tokens of schema entering the context window <em>before the model has read a single line of user intent</em>. Add Jira, the cloud provider, a feature-flag platform, your incident system, and your agent is spending six figures a year on tokens that are just tool catalogue. The overhead scales linearly with the number of services you connect.</p>
<p>The three-tier model that scales:</p>
<ol>
<li><strong>Built-ins</strong>: the small primitive set the harness ships (file ops, shell, code execution). Always loaded.</li>
<li><strong>Curated MCP registry</strong>: a governed list of approved MCP servers with progressive disclosure. The agent sees metadata first, loads full tool schemas on semantic match.</li>
<li><strong>Skills library</strong>: org-specific playbooks in a searchable registry, discovered by description, expanded on demand.</li>
</ol>




<style>
.cts-c0aa3c6c6ba86623c3a6e79446929763 {
  --cts-accent: #c96442;
  --cts-accent-soft: rgba(201, 100, 66, 0.08);
  --cts-accent-border: rgba(201, 100, 66, 0.35);

  font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif;
  background: var(--entry);
  color: var(--primary);
  border: 1px solid var(--border);
  border-radius: 6px;
  padding: 1.75rem;
  margin: 2rem 0;
}

.dark .cts-c0aa3c6c6ba86623c3a6e79446929763 {
  --cts-accent: #d97757;
  --cts-accent-soft: rgba(217, 119, 87, 0.1);
  --cts-accent-border: rgba(217, 119, 87, 0.4);
}

.cts-c0aa3c6c6ba86623c3a6e79446929763 * { box-sizing: border-box; }

.cts-head-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  align-items: baseline;
  justify-content: space-between;
  gap: 1rem;
  padding-bottom: 0.85rem;
  margin-bottom: 1rem;
  border-bottom: 1px solid var(--border);
}

.cts-title-c0aa3c6c6ba86623c3a6e79446929763 {
  font-size: 1.05rem;
  font-weight: 600;
  color: var(--primary);
  letter-spacing: -0.01em;
}

.cts-kicker-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.65rem;
  font-weight: 500;
  color: var(--secondary);
  text-transform: uppercase;
  letter-spacing: 0.1em;
}

.cts-subtitle-c0aa3c6c6ba86623c3a6e79446929763 {
  font-size: 0.85rem;
  color: var(--secondary);
  margin-bottom: 1.25rem;
  line-height: 1.55;
}

 
.cts-toggle-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  gap: 0;
  margin-bottom: 1.25rem;
  border-bottom: 1px solid var(--border);
}

.cts-mode-c0aa3c6c6ba86623c3a6e79446929763 {
  padding: 0.7rem 1rem 0.8rem;
  background: transparent;
  border: none;
  border-bottom: 2px solid transparent;
  margin-bottom: -1px;
  cursor: pointer;
  font-family: inherit;
  color: var(--secondary);
  text-align: left;
  transition: color 0.2s ease, border-color 0.2s ease;
  display: flex;
  flex-direction: column;
  gap: 0.15rem;
}

.cts-mode-c0aa3c6c6ba86623c3a6e79446929763:hover { color: var(--primary); }

.cts-mode-c0aa3c6c6ba86623c3a6e79446929763.active {
  color: var(--cts-accent);
  border-bottom-color: var(--cts-accent);
}

.cts-mode-name-c0aa3c6c6ba86623c3a6e79446929763 {
  font-size: 0.88rem;
  font-weight: 600;
  letter-spacing: -0.005em;
}

.cts-mode-sub-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.62rem;
  font-weight: 500;
  letter-spacing: 0.05em;
  color: var(--secondary);
}

.cts-mode-c0aa3c6c6ba86623c3a6e79446929763.active .cts-mode-sub-c0aa3c6c6ba86623c3a6e79446929763 { color: var(--cts-accent); opacity: 0.75; }

 
.cts-body-c0aa3c6c6ba86623c3a6e79446929763 {
  display: grid;
  grid-template-columns: 1.3fr 1fr;
  gap: 1.25rem;
  align-items: stretch;
}

.cts-panel-c0aa3c6c6ba86623c3a6e79446929763 {
  border: 1px solid var(--border);
  border-radius: 4px;
  padding: 1rem 1.1rem;
  background: var(--entry);
}

.cts-panel-head-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  justify-content: space-between;
  align-items: baseline;
  margin-bottom: 0.65rem;
}

.cts-panel-title-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.62rem;
  font-weight: 500;
  color: var(--secondary);
  text-transform: uppercase;
  letter-spacing: 0.1em;
}

.cts-panel-sub-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.68rem;
  color: var(--secondary);
}

 
.cts-bar-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  height: 32px;
  border-radius: 3px;
  overflow: hidden;
  background: var(--code-bg);
  border: 1px solid var(--border);
}

.cts-seg-c0aa3c6c6ba86623c3a6e79446929763 {
  height: 100%;
  display: flex;
  align-items: center;
  justify-content: center;
  font-family: 'JetBrains Mono', ui-monospace, monospace;
  font-size: 0.62rem;
  font-weight: 500;
  color: var(--primary);
  transition: width 0.6s cubic-bezier(0.4, 0, 0.2, 1);
  overflow: hidden;
  white-space: nowrap;
  border-right: 1px solid var(--entry);
}

.cts-seg-c0aa3c6c6ba86623c3a6e79446929763:last-child { border-right: none; }

.cts-seg-builtins-c0aa3c6c6ba86623c3a6e79446929763 { background: var(--tertiary); color: var(--primary); }
.cts-seg-mcp-c0aa3c6c6ba86623c3a6e79446929763      { background: var(--cts-accent); color: #fff; }
.cts-seg-skills-c0aa3c6c6ba86623c3a6e79446929763   { background: var(--cts-accent); opacity: 0.6; color: #fff; }
.cts-seg-context-c0aa3c6c6ba86623c3a6e79446929763  { background: var(--secondary); color: var(--entry); }
.cts-seg-headroom-c0aa3c6c6ba86623c3a6e79446929763 { background: transparent; color: var(--secondary); }

.cts-legend-c0aa3c6c6ba86623c3a6e79446929763 {
  display: grid;
  grid-template-columns: repeat(auto-fit, minmax(130px, 1fr));
  gap: 0.3rem 0.8rem;
  margin-top: 0.75rem;
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.7rem;
  color: var(--secondary);
}

.cts-legend-item-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  align-items: center;
  gap: 0.4rem;
}

.cts-legend-swatch-c0aa3c6c6ba86623c3a6e79446929763 {
  width: 8px;
  height: 8px;
  border-radius: 2px;
  flex-shrink: 0;
  border: 1px solid var(--border);
}

.cts-legend-val-c0aa3c6c6ba86623c3a6e79446929763 {
  margin-left: auto;
  color: var(--primary);
  font-weight: 500;
}

.cts-legend-builtins-c0aa3c6c6ba86623c3a6e79446929763 .cts-legend-swatch-c0aa3c6c6ba86623c3a6e79446929763 { background: var(--tertiary); }
.cts-legend-mcp-c0aa3c6c6ba86623c3a6e79446929763      .cts-legend-swatch-c0aa3c6c6ba86623c3a6e79446929763 { background: var(--cts-accent); }
.cts-legend-skills-c0aa3c6c6ba86623c3a6e79446929763   .cts-legend-swatch-c0aa3c6c6ba86623c3a6e79446929763 { background: var(--cts-accent); opacity: 0.6; }
.cts-legend-context-c0aa3c6c6ba86623c3a6e79446929763  .cts-legend-swatch-c0aa3c6c6ba86623c3a6e79446929763 { background: var(--secondary); }
.cts-legend-headroom-c0aa3c6c6ba86623c3a6e79446929763 .cts-legend-swatch-c0aa3c6c6ba86623c3a6e79446929763 { background: transparent; }

 
.cts-tools-c0aa3c6c6ba86623c3a6e79446929763 { margin-top: 1rem; }

.cts-grid-c0aa3c6c6ba86623c3a6e79446929763 {
  display: grid;
  gap: 4px;
  max-height: 185px;
  overflow-y: auto;
  padding-right: 2px;
  transition: grid-template-columns 0.3s ease;
}

.cts-grid-c0aa3c6c6ba86623c3a6e79446929763.flat {
  grid-template-columns: repeat(auto-fill, minmax(88px, 1fr));
}

.cts-grid-c0aa3c6c6ba86623c3a6e79446929763.progressive {
  grid-template-columns: 1fr;
}

.cts-chip-c0aa3c6c6ba86623c3a6e79446929763 {
  padding: 0.25rem 0.4rem;
  background: var(--cts-accent-soft);
  border: 1px solid var(--cts-accent-border);
  border-radius: 3px;
  font-family: 'JetBrains Mono', ui-monospace, monospace;
  font-size: 0.58rem;
  color: var(--cts-accent);
  text-align: center;
  white-space: nowrap;
  overflow: hidden;
  text-overflow: ellipsis;
}

.cts-server-c0aa3c6c6ba86623c3a6e79446929763 {
  display: grid;
  grid-template-columns: 22px 1fr auto;
  gap: 0.5rem;
  align-items: center;
  padding: 0.4rem 0.6rem;
  background: var(--code-bg);
  border: 1px solid var(--border);
  border-radius: 3px;
}

.cts-server-icon-c0aa3c6c6ba86623c3a6e79446929763 {
  width: 22px;
  height: 22px;
  display: flex;
  align-items: center;
  justify-content: center;
  color: var(--secondary);
}

.cts-server-text-c0aa3c6c6ba86623c3a6e79446929763 { min-width: 0; }

.cts-server-name-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.78rem;
  font-weight: 500;
  color: var(--primary);
}

.cts-server-meta-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.62rem;
  color: var(--secondary);
}

.cts-server-badge-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.58rem;
  color: var(--secondary);
  padding: 0.15rem 0.4rem;
  background: var(--entry);
  border: 1px solid var(--border);
  border-radius: 3px;
  text-transform: uppercase;
  letter-spacing: 0.08em;
  white-space: nowrap;
}

 
.cts-kpis-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  flex-direction: column;
  gap: 0.65rem;
}

.cts-kpi-c0aa3c6c6ba86623c3a6e79446929763 {
  border: 1px solid var(--border);
  border-radius: 4px;
  padding: 0.75rem 0.9rem;
  background: var(--entry);
}

.cts-kpi-c0aa3c6c6ba86623c3a6e79446929763.bad {
  border-color: var(--cts-accent-border);
  background: var(--cts-accent-soft);
}

.cts-kpi-label-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.58rem;
  font-weight: 500;
  color: var(--secondary);
  text-transform: uppercase;
  letter-spacing: 0.1em;
  margin-bottom: 0.25rem;
}

.cts-kpi-c0aa3c6c6ba86623c3a6e79446929763.bad .cts-kpi-label-c0aa3c6c6ba86623c3a6e79446929763 { color: var(--cts-accent); }

.cts-kpi-value-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 1.2rem;
  font-weight: 600;
  color: var(--primary);
  letter-spacing: -0.01em;
  font-variant-numeric: tabular-nums;
}

.cts-kpi-c0aa3c6c6ba86623c3a6e79446929763.bad .cts-kpi-value-c0aa3c6c6ba86623c3a6e79446929763 { color: var(--cts-accent); }

.cts-kpi-sub-c0aa3c6c6ba86623c3a6e79446929763 {
  font-size: 0.72rem;
  color: var(--secondary);
  line-height: 1.4;
  margin-top: 0.2rem;
}

 
.cts-footer-c0aa3c6c6ba86623c3a6e79446929763 {
  margin-top: 1.25rem;
  padding-top: 1rem;
  border-top: 1px solid var(--border);
  font-size: 0.85rem;
  color: var(--secondary);
  line-height: 1.6;
}

.cts-footer-c0aa3c6c6ba86623c3a6e79446929763 strong { color: var(--cts-accent); font-weight: 600; }

@media (max-width: 860px) {
  .cts-body-c0aa3c6c6ba86623c3a6e79446929763 { grid-template-columns: 1fr; }
}

@media (max-width: 520px) {
  .cts-c0aa3c6c6ba86623c3a6e79446929763 { padding: 1.25rem; }
}
</style>

<div class="cts-c0aa3c6c6ba86623c3a6e79446929763">
  <div class="cts-head-c0aa3c6c6ba86623c3a6e79446929763">
    <div class="cts-title-c0aa3c6c6ba86623c3a6e79446929763">Capability surface: flat vs progressive</div>
    <div class="cts-kicker-c0aa3c6c6ba86623c3a6e79446929763">same 5 servers · same task</div>
  </div>
  <div class="cts-subtitle-c0aa3c6c6ba86623c3a6e79446929763">Same five MCP servers (GitHub, Jira, Cloud, Feature Flags, Incident). Same task. Different token math. Flip the tab.</div>

  <div class="cts-toggle-c0aa3c6c6ba86623c3a6e79446929763" role="tablist">
    <button class="cts-mode-c0aa3c6c6ba86623c3a6e79446929763 active" data-mode="flat" role="tab">
      <span class="cts-mode-name-c0aa3c6c6ba86623c3a6e79446929763">Flat load</span>
      <span class="cts-mode-sub-c0aa3c6c6ba86623c3a6e79446929763">dump every tool into the prompt</span>
    </button>
    <button class="cts-mode-c0aa3c6c6ba86623c3a6e79446929763" data-mode="progressive" role="tab">
      <span class="cts-mode-name-c0aa3c6c6ba86623c3a6e79446929763">Progressive disclosure</span>
      <span class="cts-mode-sub-c0aa3c6c6ba86623c3a6e79446929763">metadata first · expand on match</span>
    </button>
  </div>

  <div class="cts-body-c0aa3c6c6ba86623c3a6e79446929763">
    <div>
      <div class="cts-panel-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="cts-panel-head-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="cts-panel-title-c0aa3c6c6ba86623c3a6e79446929763">Context window (200k tokens)</div>
          <div class="cts-panel-sub-c0aa3c6c6ba86623c3a6e79446929763">overhead: <span id="cts-overhead-pct-c0aa3c6c6ba86623c3a6e79446929763">39%</span></div>
        </div>

        <div class="cts-bar-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="cts-seg-c0aa3c6c6ba86623c3a6e79446929763 cts-seg-builtins-c0aa3c6c6ba86623c3a6e79446929763" id="cts-seg-builtins-c0aa3c6c6ba86623c3a6e79446929763" style="width: 2%"></div>
          <div class="cts-seg-c0aa3c6c6ba86623c3a6e79446929763 cts-seg-mcp-c0aa3c6c6ba86623c3a6e79446929763" id="cts-seg-mcp-c0aa3c6c6ba86623c3a6e79446929763" style="width: 26%">MCP</div>
          <div class="cts-seg-c0aa3c6c6ba86623c3a6e79446929763 cts-seg-skills-c0aa3c6c6ba86623c3a6e79446929763" id="cts-seg-skills-c0aa3c6c6ba86623c3a6e79446929763" style="width: 9%">Skills</div>
          <div class="cts-seg-c0aa3c6c6ba86623c3a6e79446929763 cts-seg-context-c0aa3c6c6ba86623c3a6e79446929763" id="cts-seg-context-c0aa3c6c6ba86623c3a6e79446929763" style="width: 4%"></div>
          <div class="cts-seg-c0aa3c6c6ba86623c3a6e79446929763 cts-seg-headroom-c0aa3c6c6ba86623c3a6e79446929763" id="cts-seg-headroom-c0aa3c6c6ba86623c3a6e79446929763" style="width: 59%">Headroom</div>
        </div>

        <div class="cts-legend-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="cts-legend-item-c0aa3c6c6ba86623c3a6e79446929763 cts-legend-builtins-c0aa3c6c6ba86623c3a6e79446929763">
            <div class="cts-legend-swatch-c0aa3c6c6ba86623c3a6e79446929763"></div>
            built-ins
            <span class="cts-legend-val-c0aa3c6c6ba86623c3a6e79446929763" id="cts-lg-builtins-c0aa3c6c6ba86623c3a6e79446929763">4.0k</span>
          </div>
          <div class="cts-legend-item-c0aa3c6c6ba86623c3a6e79446929763 cts-legend-mcp-c0aa3c6c6ba86623c3a6e79446929763">
            <div class="cts-legend-swatch-c0aa3c6c6ba86623c3a6e79446929763"></div>
            mcp schemas
            <span class="cts-legend-val-c0aa3c6c6ba86623c3a6e79446929763" id="cts-lg-mcp-c0aa3c6c6ba86623c3a6e79446929763">52.0k</span>
          </div>
          <div class="cts-legend-item-c0aa3c6c6ba86623c3a6e79446929763 cts-legend-skills-c0aa3c6c6ba86623c3a6e79446929763">
            <div class="cts-legend-swatch-c0aa3c6c6ba86623c3a6e79446929763"></div>
            skills
            <span class="cts-legend-val-c0aa3c6c6ba86623c3a6e79446929763" id="cts-lg-skills-c0aa3c6c6ba86623c3a6e79446929763">18.0k</span>
          </div>
          <div class="cts-legend-item-c0aa3c6c6ba86623c3a6e79446929763 cts-legend-context-c0aa3c6c6ba86623c3a6e79446929763">
            <div class="cts-legend-swatch-c0aa3c6c6ba86623c3a6e79446929763"></div>
            org context
            <span class="cts-legend-val-c0aa3c6c6ba86623c3a6e79446929763" id="cts-lg-context-c0aa3c6c6ba86623c3a6e79446929763">8.0k</span>
          </div>
          <div class="cts-legend-item-c0aa3c6c6ba86623c3a6e79446929763 cts-legend-headroom-c0aa3c6c6ba86623c3a6e79446929763">
            <div class="cts-legend-swatch-c0aa3c6c6ba86623c3a6e79446929763"></div>
            headroom
            <span class="cts-legend-val-c0aa3c6c6ba86623c3a6e79446929763" id="cts-lg-headroom-c0aa3c6c6ba86623c3a6e79446929763">118.0k</span>
          </div>
        </div>
      </div>

      <div class="cts-panel-c0aa3c6c6ba86623c3a6e79446929763 cts-tools-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="cts-panel-head-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="cts-panel-title-c0aa3c6c6ba86623c3a6e79446929763" id="cts-tools-title-c0aa3c6c6ba86623c3a6e79446929763">Tool schemas loaded</div>
          <div class="cts-panel-sub-c0aa3c6c6ba86623c3a6e79446929763" id="cts-tools-count-c0aa3c6c6ba86623c3a6e79446929763">94 schemas</div>
        </div>
        <div class="cts-grid-c0aa3c6c6ba86623c3a6e79446929763 flat" id="cts-grid-c0aa3c6c6ba86623c3a6e79446929763"></div>
      </div>
    </div>

    <div class="cts-kpis-c0aa3c6c6ba86623c3a6e79446929763">
      <div class="cts-kpi-c0aa3c6c6ba86623c3a6e79446929763 bad">
        <div class="cts-kpi-label-c0aa3c6c6ba86623c3a6e79446929763">overhead before reasoning</div>
        <div class="cts-kpi-value-c0aa3c6c6ba86623c3a6e79446929763" id="cts-val-overhead-c0aa3c6c6ba86623c3a6e79446929763">78,000 tokens</div>
        <div class="cts-kpi-sub-c0aa3c6c6ba86623c3a6e79446929763">Schemas + skill docs, loaded before the model reads the task.</div>
      </div>
      <div class="cts-kpi-c0aa3c6c6ba86623c3a6e79446929763 bad">
        <div class="cts-kpi-label-c0aa3c6c6ba86623c3a6e79446929763">cost per call @ sonnet</div>
        <div class="cts-kpi-value-c0aa3c6c6ba86623c3a6e79446929763" id="cts-val-call-c0aa3c6c6ba86623c3a6e79446929763">$0.234</div>
        <div class="cts-kpi-sub-c0aa3c6c6ba86623c3a6e79446929763">$3 / 1M input tokens, overhead only.</div>
      </div>
      <div class="cts-kpi-c0aa3c6c6ba86623c3a6e79446929763 bad">
        <div class="cts-kpi-label-c0aa3c6c6ba86623c3a6e79446929763">annual burn @ 1M calls</div>
        <div class="cts-kpi-value-c0aa3c6c6ba86623c3a6e79446929763" id="cts-val-annual-c0aa3c6c6ba86623c3a6e79446929763">$234,000</div>
        <div class="cts-kpi-sub-c0aa3c6c6ba86623c3a6e79446929763">Overhead alone. Reasoning is extra.</div>
      </div>
      <div class="cts-kpi-c0aa3c6c6ba86623c3a6e79446929763 bad">
        <div class="cts-kpi-label-c0aa3c6c6ba86623c3a6e79446929763">effective headroom</div>
        <div class="cts-kpi-value-c0aa3c6c6ba86623c3a6e79446929763" id="cts-val-head-c0aa3c6c6ba86623c3a6e79446929763">118k / 200k</div>
        <div class="cts-kpi-sub-c0aa3c6c6ba86623c3a6e79446929763">Share of the window available for task + reasoning.</div>
      </div>
    </div>
  </div>

  <div class="cts-footer-c0aa3c6c6ba86623c3a6e79446929763">
    Block's <strong>Goose Recipes</strong> run in isolated sub-sessions with their own context windows. Progressive disclosure as an architectural primitive, not a configuration flag.
  </div>
</div>

<script>
(function() {
  var id = 'c0aa3c6c6ba86623c3a6e79446929763';

  var flatTools = [
    'gh.repos.list','gh.repos.get','gh.repos.create','gh.repos.fork','gh.repos.delete',
    'gh.issues.list','gh.issues.get','gh.issues.create','gh.issues.update','gh.issues.close',
    'gh.issues.comment','gh.issues.label','gh.issues.assign','gh.issues.lock',
    'gh.pulls.list','gh.pulls.get','gh.pulls.create','gh.pulls.update','gh.pulls.merge',
    'gh.pulls.review','gh.pulls.comment','gh.pulls.files','gh.pulls.commits',
    'gh.branches.list','gh.branches.get','gh.branches.protect','gh.branches.delete',
    'gh.commits.list','gh.commits.get','gh.commits.compare','gh.commits.status',
    'gh.workflows.list','gh.workflows.run','gh.workflows.cancel','gh.workflows.logs',
    'gh.checks.create','gh.checks.update','gh.checks.list',
    'gh.releases.list','gh.releases.create','gh.releases.upload','gh.releases.delete',
    'gh.search.code','gh.search.issues','gh.search.repos','gh.search.users',
    'jira.issue.get','jira.issue.create','jira.issue.update','jira.issue.transition',
    'jira.issue.comment','jira.issue.link','jira.issue.assign','jira.issue.search',
    'jira.project.list','jira.project.get','jira.board.list','jira.sprint.list',
    'jira.sprint.start','jira.sprint.close','jira.user.search',
    'cloud.vm.list','cloud.vm.start','cloud.vm.stop','cloud.vm.resize',
    'cloud.db.list','cloud.db.snapshot','cloud.db.restore',
    'cloud.bucket.list','cloud.bucket.put','cloud.bucket.delete',
    'cloud.iam.list','cloud.iam.grant','cloud.iam.revoke',
    'cloud.kube.apply','cloud.kube.scale','cloud.kube.rollback',
    'flag.list','flag.get','flag.create','flag.toggle','flag.target',
    'flag.rollout','flag.archive','flag.history',
    'incident.list','incident.create','incident.update','incident.page',
    'incident.resolve','incident.postmortem','incident.timeline','incident.link'
  ];

  var progressiveServers = [
    { name: 'github-enterprise',  meta: '44 tools · 23.1k loaded on match' },
    { name: 'atlassian-jira',     meta: '15 tools · 8.4k loaded on match'  },
    { name: 'cloud-infra',        meta: '17 tools · 11.2k loaded on match' },
    { name: 'feature-flags',      meta: '8 tools · 3.8k loaded on match'   },
    { name: 'incident-commander', meta: '8 tools · 4.9k loaded on match'   }
  ];

  var state = {
    flat:        { builtins: 4000, mcp: 52000, skills: 18000, context: 8000, total: 200000 },
    progressive: { builtins: 4000, mcp: 3200,  skills: 1100,  context: 8000, total: 200000 }
  };

  var SONNET_INPUT_PER_M = 3.0;

  var grid = document.getElementById('cts-grid-' + id);
  var toolsTitle = document.getElementById('cts-tools-title-' + id);
  var toolsCount = document.getElementById('cts-tools-count-' + id);

  function fmtK(n) { return n < 1000 ? n.toFixed(0) : (n / 1000).toFixed(1) + 'k'; }
  function fmtNumber(n) { return n.toLocaleString(undefined, { maximumFractionDigits: 0 }); }
  function fmtCurrency(n, d) { return '$' + n.toLocaleString(undefined, { minimumFractionDigits: d || 0, maximumFractionDigits: d || 0 }); }

  function renderFlatGrid() {
    grid.className = 'cts-grid-' + id + ' flat';
    grid.innerHTML = '';
    for (var i = 0; i < flatTools.length; i++) {
      var chip = document.createElement('div');
      chip.className = 'cts-chip-' + id;
      chip.textContent = flatTools[i];
      grid.appendChild(chip);
    }
    toolsTitle.textContent = 'Tool schemas loaded';
    toolsCount.textContent = flatTools.length + ' schemas';
  }

  function renderProgressiveGrid() {
    grid.className = 'cts-grid-' + id + ' progressive';
    grid.innerHTML = '';
    for (var i = 0; i < progressiveServers.length; i++) {
      var s = progressiveServers[i];
      var row = document.createElement('div');
      row.className = 'cts-server-' + id;
      row.innerHTML =
        '<div class="cts-server-icon-' + id + '">' +
        '<svg width="14" height="14" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><path d="M18 10h-1.26A8 8 0 1 0 9 20h9a5 5 0 0 0 0-10z"/></svg>' +
        '</div>' +
        '<div class="cts-server-text-' + id + '">' +
        '<div class="cts-server-name-' + id + '">' + s.name + '</div>' +
        '<div class="cts-server-meta-' + id + '">' + s.meta + '</div>' +
        '</div>' +
        '<div class="cts-server-badge-' + id + '">metadata only</div>';
      grid.appendChild(row);
    }
    toolsTitle.textContent = 'MCP servers (metadata only)';
    toolsCount.textContent = progressiveServers.length + ' servers';
  }

  function animateNumber(el, to, formatter, duration) {
    var from = parseFloat(el.getAttribute('data-value') || '0');
    var start = performance.now();
    duration = duration || 500;
    function step(now) {
      var t = Math.min(1, (now - start) / duration);
      var eased = 1 - Math.pow(1 - t, 3);
      var v = from + (to - from) * eased;
      el.textContent = formatter(v);
      if (t < 1) requestAnimationFrame(step);
      else el.setAttribute('data-value', to);
    }
    requestAnimationFrame(step);
  }

  function applyMode(mode) {
    var s = state[mode];
    var overhead = s.mcp + s.skills;
    var headroom = s.total - s.builtins - s.mcp - s.skills - s.context;
    var isBad = mode === 'flat';

    document.getElementById('cts-seg-builtins-' + id).style.width = (s.builtins / s.total * 100) + '%';
    document.getElementById('cts-seg-mcp-' + id).style.width      = (s.mcp      / s.total * 100) + '%';
    document.getElementById('cts-seg-skills-' + id).style.width   = (s.skills   / s.total * 100) + '%';
    document.getElementById('cts-seg-context-' + id).style.width  = (s.context  / s.total * 100) + '%';
    document.getElementById('cts-seg-headroom-' + id).style.width = (headroom   / s.total * 100) + '%';

    
    document.getElementById('cts-seg-mcp-' + id).textContent      = s.mcp      / s.total > 0.08 ? 'MCP' : '';
    document.getElementById('cts-seg-skills-' + id).textContent   = s.skills   / s.total > 0.08 ? 'Skills' : '';
    document.getElementById('cts-seg-headroom-' + id).textContent = headroom   / s.total > 0.1  ? 'Headroom' : '';

    document.getElementById('cts-lg-builtins-' + id).textContent = fmtK(s.builtins);
    document.getElementById('cts-lg-mcp-' + id).textContent      = fmtK(s.mcp);
    document.getElementById('cts-lg-skills-' + id).textContent   = fmtK(s.skills);
    document.getElementById('cts-lg-context-' + id).textContent  = fmtK(s.context);
    document.getElementById('cts-lg-headroom-' + id).textContent = fmtK(headroom);

    document.getElementById('cts-overhead-pct-' + id).textContent = (overhead / s.total * 100).toFixed(0) + '%';

    var costPerCall = (overhead / 1e6) * SONNET_INPUT_PER_M;
    var annual = costPerCall * 1e6;

    animateNumber(document.getElementById('cts-val-overhead-' + id), overhead,    function(v) { return fmtNumber(v) + ' tokens'; });
    animateNumber(document.getElementById('cts-val-call-' + id),     costPerCall, function(v) { return fmtCurrency(v, 3); });
    animateNumber(document.getElementById('cts-val-annual-' + id),   annual,      function(v) { return fmtCurrency(v, 0); });
    document.getElementById('cts-val-head-' + id).textContent = fmtK(headroom) + ' / ' + fmtK(s.total);

    var kpis = document.querySelectorAll('.cts-kpi-' + id);
    for (var i = 0; i < kpis.length; i++) { kpis[i].classList.toggle('bad', isBad); }

    if (mode === 'flat') renderFlatGrid(); else renderProgressiveGrid();
  }

  var modes = document.querySelectorAll('.cts-mode-' + id);
  for (var i = 0; i < modes.length; i++) {
    (function(btn) {
      btn.addEventListener('click', function() {
        for (var j = 0; j < modes.length; j++) modes[j].classList.remove('active');
        btn.classList.add('active');
        applyMode(btn.getAttribute('data-mode'));
      });
    })(modes[i]);
  }

  applyMode('flat');
})();
</script>

<p>Block&rsquo;s Goose is the cleanest public expression of this. The operational equation is <code>Goose = LLM + MCP + Agent</code>, but the load-bearing piece isn&rsquo;t MCP. It&rsquo;s Goose&rsquo;s <strong>Recipes</strong> and <strong>Sub-recipes</strong>. Recipes are declarative YAML workflows that encode a repeatable piece of work; sub-recipes run in <em>isolated sub-sessions</em> with their own context windows. That isolation keeps token cost linear in the work done rather than quadratic in conversation depth. The result is the 30-40% of code Block&rsquo;s top engineers now get from Goose in legacy codebases, per <a href="https://sequoiacap.com/podcast/training-data-dhanji-prasanna/">the Sequoia interview</a>.</p>
<h2 id="chapter-3-identity-policy-and-the-execution-boundary">Chapter 3: Identity, Policy, and the Execution Boundary</h2>
<p>Every agentic action has two identities the audit team cares about: the human on whose behalf the agent is acting, and the agent&rsquo;s own service identity. Conflate them and compliance review kills your rollout.</p>
<p>The <em>human</em> identity provides authorisation scope. The <em>agent</em> identity provides attribution and accountability (which agent, which version, which session). Every tool invocation carries both. Tokens are short-lived (minutes, not days) and scoped to the specific task, not the session. Policy is enforced at the MCP gateway, as code, so auditors can diff it and engineers can review it.</p>
<p>How do you keep humans in the loop without drowning them in approval prompts? The pattern that works is the <strong>asynchronous wait-state</strong>. When an agent hits a high-risk decision (production deploy, financial transaction, irreversible write), the workflow <em>suspends</em>, persists its state externally, and emits an approval event. Reviewers act on their own clock, often hours later. On approval, the signal routes back and the workflow resumes exactly where it left off.</p>
<p>The anti-pattern is approval fatigue. The fix is <strong>graduated trust</strong>: scope approvals by blast radius.</p>
<ul>
<li><strong>Read-only</strong> on scoped data: no approval.</li>
<li><strong>Mutations inside a sandbox or personal branch</strong>: no approval, full audit.</li>
<li><strong>PR against the main branch</strong>: standard code review.</li>
<li><strong>Production-shaped actions</strong> (deploys, config changes, prod data reads): explicit, async approval with a named owner.</li>
<li><strong>Irreversible</strong> (delete, drop, disable safety): two-person review.</li>
</ul>
<p>GitHub&rsquo;s Copilot Enterprise surface has become the most concrete public implementation. Per <a href="https://resources.github.com/enterprise-content-roundup/december/">the December 2025 Enterprise roundup</a> and <a href="https://devblogs.microsoft.com/all-things-azure/agentic-platform-engineering-with-github-copilot/">Microsoft&rsquo;s DevBlogs on agentic platform engineering</a>, admins get fine-grained permissions, explicit MCP control, audit-log review, and policy-based gating of model upgrades.</p>
<p>Policy tells the agent what it&rsquo;s <em>allowed</em> to try. The sandbox decides what happens when it tries the wrong thing. Three isolation tiers map to the graduated-trust model:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Isolation tier</th>
          <th style="text-align: left">Mechanism</th>
          <th style="text-align: left">Right fit</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Docker + seccomp</strong></td>
          <td style="text-align: left">Namespaces and cgroups; shared host kernel</td>
          <td style="text-align: left">Dev-loop agents on an engineer&rsquo;s own repo</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>gVisor</strong></td>
          <td style="text-align: left">User-space kernel intercepting ~70 syscalls</td>
          <td style="text-align: left">Platform-served workers (CI, migrations, autonomous PRs)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Firecracker / Kata</strong></td>
          <td style="text-align: left">Per-workload Linux kernel via KVM</td>
          <td style="text-align: left">Untrusted, multi-tenant, or cross-org execution</td>
      </tr>
  </tbody>
</table>
<p>Match the isolation tier to the trust tier. A read-only retrieval agent does not need a microVM. A production migration worker rewriting other teams&rsquo; code absolutely does.</p>
<h2 id="chapter-4-context-engineering-at-platform-scale">Chapter 4: Context Engineering at Platform Scale</h2>
<p>The frontier model isn&rsquo;t your moat. The agent harness isn&rsquo;t your moat. Your <strong>context</strong> is your moat: the graph of your repos, the runbooks nobody wrote down, the incident history, the ADRs, the style guides, the org&rsquo;s service catalog.</p>
<h3 id="mcp-is-connectivity-not-context">MCP is connectivity, not context</h3>
<p>A common mistake is assuming that connecting MCP servers to your agent solves the context problem. MCP gives your agent a standardised way to query any single system. Four things it does not do:</p>
<ol>
<li><strong>Cross-source retrieval.</strong> &ldquo;Find everything relevant to this migration across code, tickets, docs, and incidents&rdquo; requires a unified index. No single MCP server spans all your sources.</li>
<li><strong>Pre-indexing.</strong> MCP queries are live. For a 500k-file monorepo, live search on every agent call is slow and expensive.</li>
<li><strong>Governance.</strong> PII redaction, access-aware filtering, staleness SLOs. Each MCP server returns raw data under its own auth model.</li>
<li><strong>Token-budget management.</strong> Fitting retrieved context to the model&rsquo;s window is orchestration the pipeline owns, not the protocol.</li>
</ol>
<p>The clean architecture: <strong>build the context pipeline, expose it as an MCP server.</strong> The agent queries one context endpoint. The pipeline behind it handles ingest, index, govern, serve.</p>
<h3 id="the-pipeline">The pipeline</h3>
<p><strong>Ingest.</strong> Connectors to the authoritative sources: repos, docs wiki, ticket system, incident tracker, service catalog. Each with an idempotent, versioned schema and an owner.</p>
<p><strong>Index.</strong> Hybrid retrieval is the production default: BM25 for lexical recall, dense embeddings for semantic similarity, graph for structural relationships. No single index is sufficient.</p>
<p><strong>Govern.</strong> Staleness SLOs per source. PII and secret redaction before indexing, not after retrieval. Access-aware retrieval: the retriever filters by the <em>caller&rsquo;s</em> permissions before ranking. If your agent can see secrets its invoking user can&rsquo;t, you have a data-exfiltration vulnerability wearing a productivity tool&rsquo;s clothes.</p>
<p><strong>Serve.</strong> A token-budget manager (compression, summarisation, eviction) that fits retrieved context to the model&rsquo;s window and the task&rsquo;s importance.</p>
<p><a href="https://www.augmentcode.com/context-engine">Augment Code&rsquo;s Context Engine</a> is the clearest public reference for this in 2026. It indexes up to 500,000 files across multiple repositories with roughly 100ms retrieval latency, building semantic dependency graphs. The telling move: Augment recently <a href="https://www.augmentcode.com/blog/context-engine-mcp-now-live">shipped the Context Engine as an MCP server</a>, the exact pipeline-behind-protocol pattern. <a href="https://sourcegraph.com/blog/how-cody-understands-your-codebase">Sourcegraph&rsquo;s Cody</a> takes a three-layer approach (local file, local repo, <a href="https://sourcegraph.com/blog/how-cody-provides-remote-repository-context">remote repos</a>), handling 300k+ repositories for enterprise customers. <a href="https://www.mindstudio.ai/blog/stripe-minions-vs-shopify-roast-ai-coding-harnesses">Stripe&rsquo;s agent harness</a> takes the curation angle: each &ldquo;minion&rdquo; gets scoped context per task, not the whole repo. Context curated, not copied.</p>
<p>The metric to watch: <strong>context hit rate per task type</strong>. If your hit rate is under 30%, your pipeline is ornamental.</p>
<h2 id="chapter-5-workflows-the-unit-that-ships">Chapter 5: Workflows, the Unit That Ships</h2>
<p>Four chapters described infrastructure. This chapter is about what the infrastructure produces. The deliverable is the <strong>workflow</strong>: a versioned, parameterised unit of work any engineer can build once, evaluate, and hand to other engineers (or to CI runners) who invoke it on a trigger they didn&rsquo;t author.</p>




<style>
.wfl-c0aa3c6c6ba86623c3a6e79446929763 {
  --wfl-accent: #c96442;
  --wfl-accent-soft: rgba(201, 100, 66, 0.08);
  --wfl-accent-border: rgba(201, 100, 66, 0.35);

  font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif;
  background: var(--entry);
  color: var(--primary);
  border: 1px solid var(--border);
  border-radius: 6px;
  padding: 1.75rem;
  margin: 2rem 0;
}

.dark .wfl-c0aa3c6c6ba86623c3a6e79446929763 {
  --wfl-accent: #d97757;
  --wfl-accent-soft: rgba(217, 119, 87, 0.1);
  --wfl-accent-border: rgba(217, 119, 87, 0.4);
}

.wfl-c0aa3c6c6ba86623c3a6e79446929763 * { box-sizing: border-box; }

.wfl-head-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  align-items: baseline;
  justify-content: space-between;
  gap: 1rem;
  padding-bottom: 0.85rem;
  margin-bottom: 1.25rem;
  border-bottom: 1px solid var(--border);
}

.wfl-title-c0aa3c6c6ba86623c3a6e79446929763 {
  font-size: 1.05rem;
  font-weight: 600;
  color: var(--primary);
  letter-spacing: -0.01em;
}

.wfl-kicker-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.65rem;
  font-weight: 500;
  color: var(--secondary);
  text-transform: uppercase;
  letter-spacing: 0.1em;
}

.wfl-subtitle-c0aa3c6c6ba86623c3a6e79446929763 {
  font-size: 0.85rem;
  color: var(--secondary);
  margin-bottom: 1.25rem;
  line-height: 1.55;
}

 
.wfl-flow-c0aa3c6c6ba86623c3a6e79446929763 {
  display: grid;
  grid-template-columns: 1fr auto 1.2fr auto 1fr;
  gap: 0.75rem;
  align-items: stretch;
}

.wfl-col-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  flex-direction: column;
  gap: 0.6rem;
}

.wfl-col-head-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.6rem;
  font-weight: 500;
  color: var(--secondary);
  text-transform: uppercase;
  letter-spacing: 0.1em;
  padding: 0.3rem 0 0.55rem;
  border-bottom: 1px solid var(--border);
  margin-bottom: 0.35rem;
}

.wfl-col-head-c0aa3c6c6ba86623c3a6e79446929763.accent {
  color: var(--wfl-accent);
  border-bottom-color: var(--wfl-accent);
}

.wfl-card-c0aa3c6c6ba86623c3a6e79446929763 {
  border: 1px solid var(--border);
  border-radius: 4px;
  padding: 0.7rem 0.85rem;
  background: var(--entry);
  opacity: 0;
  transform: translateY(3px);
  transition: opacity 0.35s ease, transform 0.35s ease;
}

.wfl-card-c0aa3c6c6ba86623c3a6e79446929763.visible { opacity: 1; transform: translateY(0); }

.wfl-card-c0aa3c6c6ba86623c3a6e79446929763.emphasis {
  border-left: 2px solid var(--wfl-accent);
  background: var(--wfl-accent-soft);
}

.wfl-card-head-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  align-items: baseline;
  gap: 0.5rem;
  margin-bottom: 0.35rem;
}

.wfl-card-name-c0aa3c6c6ba86623c3a6e79446929763 {
  font-size: 0.82rem;
  font-weight: 600;
  color: var(--primary);
  letter-spacing: -0.005em;
}

.wfl-card-c0aa3c6c6ba86623c3a6e79446929763.emphasis .wfl-card-name-c0aa3c6c6ba86623c3a6e79446929763 {
  color: var(--wfl-accent);
}

.wfl-card-tag-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.56rem;
  font-weight: 500;
  color: var(--secondary);
  text-transform: uppercase;
  letter-spacing: 0.08em;
  margin-left: auto;
  padding: 0.1rem 0.35rem;
  border-radius: 3px;
  background: var(--code-bg);
}

.wfl-card-items-c0aa3c6c6ba86623c3a6e79446929763 {
  list-style: none;
  padding: 0;
  margin: 0;
  display: flex;
  flex-direction: column;
  gap: 0.15rem;
}

.wfl-card-items-c0aa3c6c6ba86623c3a6e79446929763 li {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.68rem;
  color: var(--secondary);
  padding-left: 0.8rem;
  position: relative;
  line-height: 1.55;
}

.wfl-card-items-c0aa3c6c6ba86623c3a6e79446929763 li::before {
  content: '·';
  position: absolute;
  left: 0.1rem;
  color: var(--secondary);
  font-weight: 600;
}

.wfl-card-items-c0aa3c6c6ba86623c3a6e79446929763 li code {
  font-family: 'JetBrains Mono', ui-monospace, monospace;
  font-size: 0.66rem;
  color: var(--primary);
  background: transparent;
  padding: 0;
}

 
.wfl-arrow-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  align-items: center;
  justify-content: center;
  color: var(--wfl-accent);
}

 
.wfl-feedback-c0aa3c6c6ba86623c3a6e79446929763 {
  margin-top: 1rem;
  padding: 0.65rem 0.85rem;
  background: var(--wfl-accent-soft);
  border: 1px solid var(--wfl-accent-border);
  border-left: 2px solid var(--wfl-accent);
  border-radius: 4px;
  font-size: 0.78rem;
  color: var(--primary);
  display: flex;
  align-items: center;
  gap: 0.6rem;
}

.wfl-feedback-icon-c0aa3c6c6ba86623c3a6e79446929763 {
  flex-shrink: 0;
  color: var(--wfl-accent);
}

.wfl-feedback-label-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.6rem;
  font-weight: 500;
  color: var(--wfl-accent);
  text-transform: uppercase;
  letter-spacing: 0.1em;
  margin-right: 0.3rem;
}

 
@media (max-width: 860px) {
  .wfl-flow-c0aa3c6c6ba86623c3a6e79446929763 {
    grid-template-columns: 1fr;
    gap: 0.5rem;
  }
  .wfl-arrow-c0aa3c6c6ba86623c3a6e79446929763 {
    transform: rotate(90deg);
    padding: 0.1rem 0;
  }
}

@media (max-width: 520px) {
  .wfl-c0aa3c6c6ba86623c3a6e79446929763 { padding: 1.25rem; }
}
</style>

<div class="wfl-c0aa3c6c6ba86623c3a6e79446929763">
  <div class="wfl-head-c0aa3c6c6ba86623c3a6e79446929763">
    <div class="wfl-title-c0aa3c6c6ba86623c3a6e79446929763">The workflow lifecycle</div>
    <div class="wfl-kicker-c0aa3c6c6ba86623c3a6e79446929763">author → trigger → run → observe</div>
  </div>
  <div class="wfl-subtitle-c0aa3c6c6ba86623c3a6e79446929763">The control plane of Chapters 1–5 exists to make this lifecycle safe, cheap, and measurable. A workflow is the unit that ships.</div>

  <div class="wfl-flow-c0aa3c6c6ba86623c3a6e79446929763">
    
    <div class="wfl-col-c0aa3c6c6ba86623c3a6e79446929763">
      <div class="wfl-col-head-c0aa3c6c6ba86623c3a6e79446929763">Authors</div>
      <div class="wfl-card-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="wfl-card-head-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="wfl-card-name-c0aa3c6c6ba86623c3a6e79446929763">Recipe YAML</div>
          <div class="wfl-card-tag-c0aa3c6c6ba86623c3a6e79446929763">default</div>
        </div>
        <ul class="wfl-card-items-c0aa3c6c6ba86623c3a6e79446929763">
          <li>metadata + version</li>
          <li>parameters</li>
          <li>extensions (MCP)</li>
          <li>sub-recipes</li>
        </ul>
      </div>
      <div class="wfl-card-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="wfl-card-head-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="wfl-card-name-c0aa3c6c6ba86623c3a6e79446929763">Skills / DSL</div>
          <div class="wfl-card-tag-c0aa3c6c6ba86623c3a6e79446929763">alt</div>
        </div>
        <ul class="wfl-card-items-c0aa3c6c6ba86623c3a6e79446929763">
          <li>Claude Skills (md)</li>
          <li>Temporal / LangGraph</li>
          <li>Rovo Studio (low-code)</li>
        </ul>
      </div>
    </div>

    <div class="wfl-arrow-c0aa3c6c6ba86623c3a6e79446929763">
      <svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><line x1="5" y1="12" x2="19" y2="12"/><polyline points="12 5 19 12 12 19"/></svg>
    </div>

    
    <div class="wfl-col-c0aa3c6c6ba86623c3a6e79446929763">
      <div class="wfl-col-head-c0aa3c6c6ba86623c3a6e79446929763 accent">Platform</div>
      <div class="wfl-card-c0aa3c6c6ba86623c3a6e79446929763 emphasis">
        <div class="wfl-card-head-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="wfl-card-name-c0aa3c6c6ba86623c3a6e79446929763">Triggers</div>
          <div class="wfl-card-tag-c0aa3c6c6ba86623c3a6e79446929763">fire</div>
        </div>
        <ul class="wfl-card-items-c0aa3c6c6ba86623c3a6e79446929763">
          <li>cron (<code>goose schedule</code>)</li>
          <li>event (PR, issue, incident, webhook)</li>
          <li>manual / API (<code>goose run</code>, <code>goose serve</code>)</li>
        </ul>
      </div>
      <div class="wfl-card-c0aa3c6c6ba86623c3a6e79446929763 emphasis">
        <div class="wfl-card-head-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="wfl-card-name-c0aa3c6c6ba86623c3a6e79446929763">Runtime</div>
          <div class="wfl-card-tag-c0aa3c6c6ba86623c3a6e79446929763">run</div>
        </div>
        <ul class="wfl-card-items-c0aa3c6c6ba86623c3a6e79446929763">
          <li>CI runner (ephemeral)</li>
          <li>agent pool (Modal / E2B / Northflank)</li>
          <li>laptop (dev-loop only)</li>
        </ul>
      </div>
    </div>

    <div class="wfl-arrow-c0aa3c6c6ba86623c3a6e79446929763">
      <svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><line x1="5" y1="12" x2="19" y2="12"/><polyline points="12 5 19 12 12 19"/></svg>
    </div>

    
    <div class="wfl-col-c0aa3c6c6ba86623c3a6e79446929763">
      <div class="wfl-col-head-c0aa3c6c6ba86623c3a6e79446929763">Observability</div>
      <div class="wfl-card-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="wfl-card-head-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="wfl-card-name-c0aa3c6c6ba86623c3a6e79446929763">Run record</div>
          <div class="wfl-card-tag-c0aa3c6c6ba86623c3a6e79446929763">first-class</div>
        </div>
        <ul class="wfl-card-items-c0aa3c6c6ba86623c3a6e79446929763">
          <li>trigger source</li>
          <li>parameters</li>
          <li>spans + retries</li>
          <li>status · cost · trace ID</li>
        </ul>
      </div>
      <div class="wfl-card-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="wfl-card-head-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="wfl-card-name-c0aa3c6c6ba86623c3a6e79446929763">Governance</div>
          <div class="wfl-card-tag-c0aa3c6c6ba86623c3a6e79446929763">registry</div>
        </div>
        <ul class="wfl-card-items-c0aa3c6c6ba86623c3a6e79446929763">
          <li>SHA-pinned versions</li>
          <li>ownership + review</li>
          <li>deprecation windows</li>
        </ul>
      </div>
    </div>
  </div>

  <div class="wfl-feedback-c0aa3c6c6ba86623c3a6e79446929763">
    <svg class="wfl-feedback-icon-c0aa3c6c6ba86623c3a6e79446929763" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.75"><polyline points="1 4 1 10 7 10"/><path d="M3.51 15a9 9 0 1 0 2.13-9.36L1 10"/></svg>
    <span class="wfl-feedback-label-c0aa3c6c6ba86623c3a6e79446929763">feedback</span>
    Run records feed the eval golden set. Thumbs-downs become regression cases. A new Recipe version is shadow-run against live triggers before it's promoted.
  </div>
</div>

<script>
(function() {
  var id = 'c0aa3c6c6ba86623c3a6e79446929763';
  var cards = document.querySelectorAll('.wfl-c0aa3c6c6ba86623c3a6e79446929763 .wfl-card-' + id);
  for (var i = 0; i < cards.length; i++) {
    (function(el, delay) {
      setTimeout(function() { el.classList.add('visible'); }, delay);
    })(cards[i], 80 + i * 60);
  }
})();
</script>

<h3 id="authoring">Authoring</h3>
<p>Four patterns; the choice follows who the author is:</p>
<ul>
<li><strong>Recipe / YAML</strong>: <a href="https://block.github.io/goose/docs/guides/recipes/recipe-reference/">Goose Recipes</a>, <a href="https://github.blog/ai-and-ml/automate-repository-tasks-with-github-agentic-workflows/">GitHub Agentic Workflows</a> (Feb 2026 preview). Structured, diff-reviewable, CI-friendly. The enterprise default.</li>
<li><strong>Prompt-as-code</strong>: Claude Skills. Flexible, closer to prose, weaker composition.</li>
<li><strong>DSL / real code</strong>: <a href="https://temporal.io/blog/orchestrating-ambient-agents-with-temporal">Temporal</a>, LangGraph, <a href="https://kestra.io/1-0">Kestra</a>. Maximum control; needs engineer authors.</li>
<li><strong>Low-code</strong>: <a href="https://support.atlassian.com/studio/docs/what-is-rovo-studio/">Atlassian Rovo Studio</a>. Natural-language authoring for non-engineers.</li>
</ul>
<p>A Goose Recipe is the concrete shape most architects will end up writing:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#f92672">name</span>: <span style="color:#ae81ff">pr_security_review</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">recipe</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">version</span>: <span style="color:#ae81ff">1.0.0</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">title</span>: <span style="color:#ae81ff">PR Security Review</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">description</span>: <span style="color:#ae81ff">OWASP-informed review of a pull-request diff.</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">settings</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">goose_provider</span>: <span style="color:#ae81ff">anthropic</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">goose_model</span>: <span style="color:#ae81ff">claude-sonnet-4-5</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">parameters</span>:
</span></span><span style="display:flex;"><span>    - <span style="color:#f92672">key</span>: <span style="color:#ae81ff">pr_url</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">input_type</span>: <span style="color:#ae81ff">string</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">requirement</span>: <span style="color:#ae81ff">required</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">description</span>: <span style="color:#e6db74">&#34;Pull request URL to review&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">extensions</span>:
</span></span><span style="display:flex;"><span>    - <span style="color:#f92672">type</span>: <span style="color:#ae81ff">builtin</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">name</span>: <span style="color:#ae81ff">developer</span>
</span></span><span style="display:flex;"><span>    - <span style="color:#f92672">type</span>: <span style="color:#ae81ff">streamable_http</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">name</span>: <span style="color:#ae81ff">github</span>
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">uri</span>: <span style="color:#ae81ff">https://api.githubcopilot.com/mcp/x/pull_requests/readonly</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">instructions</span>: |<span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    You are a security reviewer. Check the diff for OWASP Top-10
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    issues, secrets, and unsafe patterns. Be specific and sparing.</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">prompt</span>: |<span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    Review PR {{ pr_url }}. For each finding, cite the file,
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    line, severity, and suggested fix. Post findings as a single
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    PR comment. If nothing is found, say so.</span>
</span></span></code></pre></div><p>Every primitive the last four chapters described is visible here. <code>settings</code> routes through the LLM gateway. <code>extensions</code> declares which approved MCP servers the capability surface exposes. <code>parameters</code> is how a non-author reuses the workflow. <code>instructions</code> vs <code>prompt</code> separates policy from task, which is what makes a Recipe testable.</p>
<h3 id="parameterisation-and-sub-workflows">Parameterisation and sub-workflows</h3>
<p>A Recipe without parameters is a one-off. With parameters, it&rsquo;s a product. The sharper Goose primitive is the <code>sub_recipes</code> array: each sub-recipe runs in its own isolated subagent session with its own context window, and <code>sequential_when_repeated: true/false</code> picks parallel vs sequential execution. This is the orchestrator-worker pattern from Chapter 1, made concrete. It&rsquo;s what makes the Airbnb migration topology possible: 3,500 files fan out across parallel sub-recipe invocations, each with fresh context, orchestrated by one parent.</p>
<h3 id="triggers">Triggers</h3>
<p><strong>Cron.</strong> <code>goose schedule add recipe.yaml --cron '0 9 * * 1-5'</code>. Nightly lint, weekly security audit, daily stale-PR report. The built-in scheduler is single-machine; for distributed schedules, wrap with a Kubernetes CronJob or a Temporal worker pool.</p>
<p><strong>Event-driven.</strong> PR opened, issue labelled, incident created, build failed. Atlassian&rsquo;s Rovo Dev fires on every PR. The <a href="https://github.com/marketplace/actions/goose-ai-developer-agent">Goose GitHub Action</a> wraps the same pattern: label an issue with <code>goose</code> and a PR opens. Event-driven is where agents stop being assistants and start being automation.</p>
<p><strong>Manual / API.</strong> <code>goose run -i recipe.yaml --param pr_url=https://...</code> from a CI step, or <code>goose serve</code> running as a webhook receiver inside the cluster.</p>
<h3 id="runtime-observability-and-governance">Runtime, observability, and governance</h3>
<p>Triggered workflows run on ephemeral CI runners (GitHub Actions, Buildkite) for sub-five-minute PR-shaped work, or on dedicated agent pools for long-running stateful work. Match runtime to the trust tier from Chapter 3.</p>
<p>Every triggered run is a first-class object: trigger source, parameters, spans with retry counts, final status, cost, trace ID. <a href="https://kestra.io/1-0">Kestra recorded over two billion workflow executions in 2025</a>, up from one hundred million in 2024. That twenty-fold increase signals the direction of travel. If your platform cannot answer &ldquo;what ran when, triggered by what, with what outcome?&rdquo; in two clicks, it is opaque.</p>
<p>Shared workflows need product discipline. The <a href="https://github.blog/enterprise-software/devops/building-organization-wide-governance-and-re-use-for-ci-cd-and-automation-with-github-actions/">GitHub Actions governance model</a> (internal org, SHA-pinned versions, PR-reviewed contributions) is the pattern most enterprises borrow.</p>
<h2 id="chapter-6-evaluation-and-economics">Chapter 6: Evaluation and Economics</h2>
<p>Most platform teams skip evaluation and then wonder why their rollout plateaus. Evaluation is not a phase of delivery; it is the product that determines whether the other five chapters compound.</p>
<h3 id="silent-failure">Silent failure</h3>
<p>An agent completes its run without any software error (no exception, no crash, no red log line) and produces output that looks plausible and is wrong. The PR passes review because the diff looks reasonable. The test the agent wrote passes because it tests the buggy behaviour it introduced. Every DORA-2025 data point on <em>increased change-failure rate</em> is a silent-failure story that got written to disk.</p>
<p>The evaluation stack that catches silent failure has three layers.</p>
<p><strong>Unit-level.</strong> Tool schemas, prompt templates, and system prompts each get their own regression suite. Every change runs a deterministic test set before it can ship.</p>
<p><strong>Task-level.</strong> A curated golden set of real tasks, graded by LLM-as-judge with a rubric that includes <em>business-outcome correctness</em>, not just style. This is eval-as-CI.</p>
<p><strong>Production.</strong> Shadow traffic and online signals: thumbs-up/down, PR accept rate on agent-authored code, downstream defect escape rate. The production signals feed back into the golden set. Every thumb-down becomes a candidate regression test.</p>
<p>Atlassian&rsquo;s <strong>Rovo Dev Code Reviewer</strong> ran a year-long evaluation across more than 1,900 internal repos before general availability. The result, published at <a href="https://www.atlassian.com/blog/artificial-intelligence/developer-productivity-improved-with-rovo-dev/amp">ICSE 2026</a>, was a 30.8% reduction in PR cycle time and a 35.6% reduction in human-written review comments. The same three eval layers apply at the Recipe level: shadow-run the candidate against live triggers before promoting; canary to a subset before broad ship.</p>
<h3 id="token-economics">Token economics</h3>
<p>By the time you have 5,000 engineers on your platform, token cost is non-linear in three dimensions: context depth, fan-out, and retry depth.</p>
<p><strong>Tiered routing.</strong> Simple classification and extraction routes to a cheap model (Haiku-class). Standard code generation routes to mid-tier (Sonnet-class). Hard planning and architectural synthesis reserves to the frontier (Opus-class). Defaulting every call to the most expensive model is the single largest source of cost inflation.</p>
<p><strong>Prompt caching as an SLI.</strong> Structured prompts should cache at 90%+ hit rate. A 90% cache hit translates to roughly 10x cost reduction on the cached portion. Cache hit rate deserves a dashboard, an owner, and an alert when it drops.</p>
<p><strong>Attribution at every level.</strong> Per-team, per-repo, per-task, per-session. Without attribution there&rsquo;s no chargeback; without chargeback there&rsquo;s no incentive for teams to care about efficiency.</p>
<p>Shopify&rsquo;s LLM proxy, mentioned in Chapter 1, is the artefact that makes all of this possible. You cannot attribute cost you don&rsquo;t see. You cannot route by complexity if requests bypass your router. Per <a href="https://www.firstround.com/ai/shopify">First Round&rsquo;s write-up</a>, the proxy is what let Shopify&rsquo;s engineering dashboard correlate AI usage with shipping impact, which in turn gave VP Eng Farhan Thawar the evidence to support the ~20% productivity gain the org now claims.</p>




<style>
.cof-c0aa3c6c6ba86623c3a6e79446929763 {
  --cof-accent: #c96442;
  --cof-accent-soft: rgba(201, 100, 66, 0.08);
  --cof-accent-border: rgba(201, 100, 66, 0.35);

  font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif;
  background: var(--entry);
  color: var(--primary);
  border: 1px solid var(--border);
  border-radius: 6px;
  padding: 1.75rem;
  margin: 2rem 0;
}

.dark .cof-c0aa3c6c6ba86623c3a6e79446929763 {
  --cof-accent: #d97757;
  --cof-accent-soft: rgba(217, 119, 87, 0.1);
  --cof-accent-border: rgba(217, 119, 87, 0.4);
}

.cof-c0aa3c6c6ba86623c3a6e79446929763 * { box-sizing: border-box; }

 
.cof-head-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  justify-content: space-between;
  align-items: flex-start;
  gap: 1rem;
  flex-wrap: wrap;
  padding-bottom: 0.85rem;
  margin-bottom: 1.25rem;
  border-bottom: 1px solid var(--border);
}

.cof-head-left-c0aa3c6c6ba86623c3a6e79446929763 { flex: 1; min-width: 240px; }

.cof-title-c0aa3c6c6ba86623c3a6e79446929763 {
  font-size: 1.05rem;
  font-weight: 600;
  color: var(--primary);
  letter-spacing: -0.01em;
  margin-bottom: 0.25rem;
}

.cof-subtitle-c0aa3c6c6ba86623c3a6e79446929763 { font-size: 0.85rem; color: var(--secondary); line-height: 1.55; }

.cof-period-c0aa3c6c6ba86623c3a6e79446929763 {
  display: inline-flex;
  align-items: center;
  gap: 0.4rem;
  padding: 0.35rem 0.65rem;
  border: 1px solid var(--border);
  border-radius: 3px;
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.68rem;
  color: var(--secondary);
}

.cof-period-dot-c0aa3c6c6ba86623c3a6e79446929763 {
  width: 6px;
  height: 6px;
  border-radius: 50%;
  background: var(--cof-accent);
  animation: cof-pulse-c0aa3c6c6ba86623c3a6e79446929763 2s ease-in-out infinite;
}

@keyframes cof-pulse-c0aa3c6c6ba86623c3a6e79446929763 {
  0%, 100% { opacity: 1; }
  50% { opacity: 0.35; }
}

 
.cof-kpis-c0aa3c6c6ba86623c3a6e79446929763 {
  display: grid;
  grid-template-columns: repeat(4, 1fr);
  gap: 0;
  margin-bottom: 1.25rem;
  border: 1px solid var(--border);
  border-radius: 4px;
  overflow: hidden;
}

.cof-kpi-c0aa3c6c6ba86623c3a6e79446929763 {
  padding: 0.8rem 0.9rem;
  border-right: 1px solid var(--border);
}

.cof-kpi-c0aa3c6c6ba86623c3a6e79446929763:last-child { border-right: none; }

.cof-kpi-c0aa3c6c6ba86623c3a6e79446929763[data-kpi="ppr"] { background: var(--cof-accent-soft); }

.cof-kpi-label-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.58rem;
  font-weight: 500;
  color: var(--secondary);
  text-transform: uppercase;
  letter-spacing: 0.1em;
  margin-bottom: 0.25rem;
}

.cof-kpi-c0aa3c6c6ba86623c3a6e79446929763[data-kpi="ppr"] .cof-kpi-label-c0aa3c6c6ba86623c3a6e79446929763 { color: var(--cof-accent); }

.cof-kpi-value-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 1.35rem;
  font-weight: 600;
  color: var(--primary);
  font-variant-numeric: tabular-nums;
  letter-spacing: -0.01em;
  line-height: 1.15;
}

.cof-kpi-c0aa3c6c6ba86623c3a6e79446929763[data-kpi="ppr"] .cof-kpi-value-c0aa3c6c6ba86623c3a6e79446929763 { color: var(--cof-accent); }

.cof-kpi-value-c0aa3c6c6ba86623c3a6e79446929763 .cof-unit-c0aa3c6c6ba86623c3a6e79446929763 {
  font-size: 0.8rem;
  color: var(--secondary);
  font-weight: 500;
  margin-left: 0.1rem;
}

.cof-kpi-delta-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.65rem;
  color: var(--secondary);
  margin-top: 0.25rem;
  letter-spacing: 0.03em;
}

.cof-kpi-delta-c0aa3c6c6ba86623c3a6e79446929763.ok   { color: var(--cof-accent); }

 
.cof-body-c0aa3c6c6ba86623c3a6e79446929763 {
  display: grid;
  grid-template-columns: 1fr 1fr;
  gap: 1rem;
  align-items: stretch;
}

.cof-panel-c0aa3c6c6ba86623c3a6e79446929763 {
  border: 1px solid var(--border);
  border-radius: 4px;
  padding: 1rem;
  background: var(--entry);
}

.cof-panel-head-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  justify-content: space-between;
  align-items: baseline;
  margin-bottom: 0.75rem;
  padding-bottom: 0.55rem;
  border-bottom: 1px solid var(--border);
}

.cof-panel-title-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.62rem;
  font-weight: 500;
  color: var(--secondary);
  text-transform: uppercase;
  letter-spacing: 0.1em;
}

.cof-panel-sub-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.65rem;
  color: var(--secondary);
}

 
.cof-teams-c0aa3c6c6ba86623c3a6e79446929763 { display: flex; flex-direction: column; gap: 0.3rem; }

.cof-team-c0aa3c6c6ba86623c3a6e79446929763 {
  display: grid;
  grid-template-columns: 80px 1fr 72px;
  gap: 0.6rem;
  align-items: center;
  padding: 0.4rem 0.5rem;
  border-radius: 3px;
  cursor: pointer;
  transition: background 0.2s ease;
}

.cof-team-c0aa3c6c6ba86623c3a6e79446929763:hover { background: var(--code-bg); }
.cof-team-c0aa3c6c6ba86623c3a6e79446929763.active { background: var(--cof-accent-soft); }

.cof-team-name-c0aa3c6c6ba86623c3a6e79446929763 {
  font-size: 0.82rem;
  font-weight: 500;
  color: var(--primary);
  white-space: nowrap;
  overflow: hidden;
  text-overflow: ellipsis;
  display: flex;
  align-items: center;
  gap: 0.35rem;
}

.cof-team-flag-c0aa3c6c6ba86623c3a6e79446929763 {
  display: inline-block;
  width: 5px;
  height: 5px;
  border-radius: 50%;
  background: var(--cof-accent);
  flex-shrink: 0;
}

.cof-team-bar-wrap-c0aa3c6c6ba86623c3a6e79446929763 {
  position: relative;
  height: 16px;
  background: var(--code-bg);
  border-radius: 2px;
  overflow: hidden;
}

.cof-team-bar-c0aa3c6c6ba86623c3a6e79446929763 {
  height: 100%;
  background: var(--tertiary);
  border-radius: 2px;
  transition: width 0.8s cubic-bezier(0.4, 0, 0.2, 1), background 0.2s ease;
  width: 0;
}

.cof-team-c0aa3c6c6ba86623c3a6e79446929763.flag .cof-team-bar-c0aa3c6c6ba86623c3a6e79446929763 { background: var(--cof-accent); }
.cof-team-c0aa3c6c6ba86623c3a6e79446929763.active .cof-team-bar-c0aa3c6c6ba86623c3a6e79446929763 { background: var(--cof-accent); }

.cof-team-bar-label-c0aa3c6c6ba86623c3a6e79446929763 {
  position: absolute;
  left: 0.45rem;
  top: 50%;
  transform: translateY(-50%);
  font-family: 'JetBrains Mono', ui-monospace, monospace;
  font-size: 0.62rem;
  font-weight: 500;
  color: var(--primary);
  font-variant-numeric: tabular-nums;
  pointer-events: none;
}

.cof-team-c0aa3c6c6ba86623c3a6e79446929763.flag .cof-team-bar-label-c0aa3c6c6ba86623c3a6e79446929763,
.cof-team-c0aa3c6c6ba86623c3a6e79446929763.active .cof-team-bar-label-c0aa3c6c6ba86623c3a6e79446929763 { color: var(--entry); mix-blend-mode: normal; }

.cof-team-ppr-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, monospace;
  font-size: 0.7rem;
  color: var(--secondary);
  text-align: right;
  font-variant-numeric: tabular-nums;
}

 
.cof-drill-sum-c0aa3c6c6ba86623c3a6e79446929763 {
  display: grid;
  grid-template-columns: 1fr 1fr;
  gap: 0;
  margin-bottom: 0.85rem;
  border: 1px solid var(--border);
  border-radius: 3px;
  overflow: hidden;
}

.cof-drill-stat-c0aa3c6c6ba86623c3a6e79446929763 {
  padding: 0.55rem 0.7rem;
  border-right: 1px solid var(--border);
  border-bottom: 1px solid var(--border);
}

.cof-drill-stat-c0aa3c6c6ba86623c3a6e79446929763:nth-child(2n) { border-right: none; }
.cof-drill-stat-c0aa3c6c6ba86623c3a6e79446929763:nth-last-child(-n+2) { border-bottom: none; }

.cof-drill-stat-label-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.55rem;
  color: var(--secondary);
  text-transform: uppercase;
  letter-spacing: 0.09em;
  margin-bottom: 0.15rem;
}

.cof-drill-stat-value-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.95rem;
  font-weight: 600;
  color: var(--primary);
  font-variant-numeric: tabular-nums;
}

.cof-drill-section-c0aa3c6c6ba86623c3a6e79446929763 { margin-bottom: 0.85rem; }

.cof-drill-section-c0aa3c6c6ba86623c3a6e79446929763 .cof-section-label-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, 'SF Mono', Consolas, Monaco, monospace;
  font-size: 0.58rem;
  font-weight: 500;
  color: var(--secondary);
  text-transform: uppercase;
  letter-spacing: 0.1em;
  margin: 0 0 0.4rem 0;
}

.cof-model-bar-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  height: 20px;
  border-radius: 2px;
  overflow: hidden;
  border: 1px solid var(--border);
}

.cof-model-seg-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  align-items: center;
  justify-content: center;
  font-family: 'JetBrains Mono', ui-monospace, monospace;
  font-size: 0.58rem;
  color: var(--primary);
  transition: width 0.5s ease;
  overflow: hidden;
  white-space: nowrap;
}

.cof-model-seg-c0aa3c6c6ba86623c3a6e79446929763[data-model="haiku"]  { background: var(--tertiary); color: var(--primary); }
.cof-model-seg-c0aa3c6c6ba86623c3a6e79446929763[data-model="sonnet"] { background: var(--secondary); color: var(--entry); }
.cof-model-seg-c0aa3c6c6ba86623c3a6e79446929763[data-model="opus"]   { background: var(--cof-accent); color: #fff; }

.cof-model-legend-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  gap: 0.7rem;
  margin-top: 0.35rem;
  flex-wrap: wrap;
  font-family: 'JetBrains Mono', ui-monospace, monospace;
  font-size: 0.65rem;
  color: var(--secondary);
}

.cof-model-legend-item-c0aa3c6c6ba86623c3a6e79446929763 {
  display: flex;
  align-items: center;
  gap: 0.3rem;
  font-variant-numeric: tabular-nums;
}

.cof-model-dot-c0aa3c6c6ba86623c3a6e79446929763 { width: 7px; height: 7px; border-radius: 1px; }

.cof-tasks-c0aa3c6c6ba86623c3a6e79446929763 { display: flex; flex-direction: column; gap: 0.2rem; }

.cof-task-c0aa3c6c6ba86623c3a6e79446929763 {
  display: grid;
  grid-template-columns: 1fr auto;
  align-items: center;
  padding: 0.3rem 0;
  border-bottom: 1px dashed var(--border);
  font-size: 0.78rem;
  color: var(--primary);
}

.cof-task-c0aa3c6c6ba86623c3a6e79446929763:last-child { border-bottom: none; }

.cof-task-val-c0aa3c6c6ba86623c3a6e79446929763 {
  font-family: 'JetBrains Mono', ui-monospace, monospace;
  color: var(--secondary);
  font-variant-numeric: tabular-nums;
}

.cof-insight-c0aa3c6c6ba86623c3a6e79446929763 {
  padding: 0.65rem 0.8rem;
  border-radius: 3px;
  font-size: 0.8rem;
  line-height: 1.5;
  color: var(--primary);
  border: 1px solid var(--border);
  background: var(--code-bg);
}

.cof-insight-c0aa3c6c6ba86623c3a6e79446929763.warn {
  background: var(--cof-accent-soft);
  border-color: var(--cof-accent-border);
  border-left: 2px solid var(--cof-accent);
}

.cof-insight-c0aa3c6c6ba86623c3a6e79446929763.ok {
  border-left: 2px solid var(--secondary);
}

.cof-insight-c0aa3c6c6ba86623c3a6e79446929763 strong {
  font-weight: 600;
}

.cof-insight-c0aa3c6c6ba86623c3a6e79446929763.warn strong { color: var(--cof-accent); }

 
.cof-footer-c0aa3c6c6ba86623c3a6e79446929763 {
  margin-top: 1.25rem;
  padding-top: 1rem;
  border-top: 1px solid var(--border);
  text-align: center;
  font-size: 0.85rem;
  color: var(--secondary);
  line-height: 1.6;
}

.cof-footer-c0aa3c6c6ba86623c3a6e79446929763 strong { color: var(--cof-accent); font-weight: 600; }

 
@media (max-width: 900px) {
  .cof-body-c0aa3c6c6ba86623c3a6e79446929763 { grid-template-columns: 1fr; }
  .cof-kpis-c0aa3c6c6ba86623c3a6e79446929763 { grid-template-columns: repeat(2, 1fr); }
  .cof-kpi-c0aa3c6c6ba86623c3a6e79446929763:nth-child(2n) { border-right: none; }
  .cof-kpi-c0aa3c6c6ba86623c3a6e79446929763:nth-child(-n+2) { border-bottom: 1px solid var(--border); }
}

@media (max-width: 520px) {
  .cof-c0aa3c6c6ba86623c3a6e79446929763 { padding: 1.25rem; }
  .cof-team-c0aa3c6c6ba86623c3a6e79446929763 { grid-template-columns: 60px 1fr 58px; gap: 0.4rem; }
  .cof-team-name-c0aa3c6c6ba86623c3a6e79446929763 { font-size: 0.72rem; }
  .cof-drill-sum-c0aa3c6c6ba86623c3a6e79446929763 { grid-template-columns: 1fr; }
  .cof-drill-stat-c0aa3c6c6ba86623c3a6e79446929763 { border-right: none; }
}
</style>

<div class="cof-c0aa3c6c6ba86623c3a6e79446929763">
  <div class="cof-head-c0aa3c6c6ba86623c3a6e79446929763">
    <div class="cof-head-left-c0aa3c6c6ba86623c3a6e79446929763">
      <div class="cof-title-c0aa3c6c6ba86623c3a6e79446929763">Agentic platform: cost observability</div>
      <div class="cof-subtitle-c0aa3c6c6ba86623c3a6e79446929763">Token → dollar → team → task. The view your CFO asks for on Monday morning.</div>
    </div>
    <div class="cof-period-c0aa3c6c6ba86623c3a6e79446929763">
      <div class="cof-period-dot-c0aa3c6c6ba86623c3a6e79446929763"></div>
      apr 2026 · 4,200 eng · live
    </div>
  </div>

  <div class="cof-kpis-c0aa3c6c6ba86623c3a6e79446929763">
    <div class="cof-kpi-c0aa3c6c6ba86623c3a6e79446929763" data-kpi="tokens">
      <div class="cof-kpi-label-c0aa3c6c6ba86623c3a6e79446929763">monthly tokens</div>
      <div class="cof-kpi-value-c0aa3c6c6ba86623c3a6e79446929763">847<span class="cof-unit-c0aa3c6c6ba86623c3a6e79446929763">M</span></div>
      <div class="cof-kpi-delta-c0aa3c6c6ba86623c3a6e79446929763">+12.4% mom</div>
    </div>
    <div class="cof-kpi-c0aa3c6c6ba86623c3a6e79446929763" data-kpi="spend">
      <div class="cof-kpi-label-c0aa3c6c6ba86623c3a6e79446929763">monthly spend</div>
      <div class="cof-kpi-value-c0aa3c6c6ba86623c3a6e79446929763">$284<span class="cof-unit-c0aa3c6c6ba86623c3a6e79446929763">.2k</span></div>
      <div class="cof-kpi-delta-c0aa3c6c6ba86623c3a6e79446929763">+8.1% mom</div>
    </div>
    <div class="cof-kpi-c0aa3c6c6ba86623c3a6e79446929763" data-kpi="cache">
      <div class="cof-kpi-label-c0aa3c6c6ba86623c3a6e79446929763">cache hit rate</div>
      <div class="cof-kpi-value-c0aa3c6c6ba86623c3a6e79446929763">78<span class="cof-unit-c0aa3c6c6ba86623c3a6e79446929763">%</span></div>
      <div class="cof-kpi-delta-c0aa3c6c6ba86623c3a6e79446929763 ok">target &gt; 75%</div>
    </div>
    <div class="cof-kpi-c0aa3c6c6ba86623c3a6e79446929763" data-kpi="ppr">
      <div class="cof-kpi-label-c0aa3c6c6ba86623c3a6e79446929763">cost per merged pr</div>
      <div class="cof-kpi-value-c0aa3c6c6ba86623c3a6e79446929763">$0.82</div>
      <div class="cof-kpi-delta-c0aa3c6c6ba86623c3a6e79446929763">−14.5% mom</div>
    </div>
  </div>

  <div class="cof-body-c0aa3c6c6ba86623c3a6e79446929763">
    <div class="cof-panel-c0aa3c6c6ba86623c3a6e79446929763">
      <div class="cof-panel-head-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="cof-panel-title-c0aa3c6c6ba86623c3a6e79446929763">spend by team</div>
        <div class="cof-panel-sub-c0aa3c6c6ba86623c3a6e79446929763">click to drill</div>
      </div>
      <div class="cof-teams-c0aa3c6c6ba86623c3a6e79446929763" id="cof-teams-c0aa3c6c6ba86623c3a6e79446929763"></div>
    </div>

    <div class="cof-panel-c0aa3c6c6ba86623c3a6e79446929763" id="cof-drill-c0aa3c6c6ba86623c3a6e79446929763">
      <div class="cof-panel-head-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="cof-panel-title-c0aa3c6c6ba86623c3a6e79446929763" id="cof-drill-team-c0aa3c6c6ba86623c3a6e79446929763">…</div>
        <div class="cof-panel-sub-c0aa3c6c6ba86623c3a6e79446929763">apr 2026</div>
      </div>

      <div class="cof-drill-sum-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="cof-drill-stat-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="cof-drill-stat-label-c0aa3c6c6ba86623c3a6e79446929763">monthly spend</div>
          <div class="cof-drill-stat-value-c0aa3c6c6ba86623c3a6e79446929763" id="cof-drill-spend-c0aa3c6c6ba86623c3a6e79446929763">…</div>
        </div>
        <div class="cof-drill-stat-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="cof-drill-stat-label-c0aa3c6c6ba86623c3a6e79446929763">cost per pr</div>
          <div class="cof-drill-stat-value-c0aa3c6c6ba86623c3a6e79446929763" id="cof-drill-ppr-c0aa3c6c6ba86623c3a6e79446929763">…</div>
        </div>
        <div class="cof-drill-stat-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="cof-drill-stat-label-c0aa3c6c6ba86623c3a6e79446929763">merged prs</div>
          <div class="cof-drill-stat-value-c0aa3c6c6ba86623c3a6e79446929763" id="cof-drill-prs-c0aa3c6c6ba86623c3a6e79446929763">…</div>
        </div>
        <div class="cof-drill-stat-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="cof-drill-stat-label-c0aa3c6c6ba86623c3a6e79446929763">cache hit</div>
          <div class="cof-drill-stat-value-c0aa3c6c6ba86623c3a6e79446929763" id="cof-drill-cache-c0aa3c6c6ba86623c3a6e79446929763">…</div>
        </div>
      </div>

      <div class="cof-drill-section-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="cof-section-label-c0aa3c6c6ba86623c3a6e79446929763">model mix</div>
        <div class="cof-model-bar-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="cof-model-seg-c0aa3c6c6ba86623c3a6e79446929763" data-model="haiku"  id="cof-m-haiku-c0aa3c6c6ba86623c3a6e79446929763">Haiku</div>
          <div class="cof-model-seg-c0aa3c6c6ba86623c3a6e79446929763" data-model="sonnet" id="cof-m-sonnet-c0aa3c6c6ba86623c3a6e79446929763">Sonnet</div>
          <div class="cof-model-seg-c0aa3c6c6ba86623c3a6e79446929763" data-model="opus"   id="cof-m-opus-c0aa3c6c6ba86623c3a6e79446929763">Opus</div>
        </div>
        <div class="cof-model-legend-c0aa3c6c6ba86623c3a6e79446929763">
          <div class="cof-model-legend-item-c0aa3c6c6ba86623c3a6e79446929763"><div class="cof-model-dot-c0aa3c6c6ba86623c3a6e79446929763" style="background: var(--tertiary)"></div>haiku <span id="cof-lg-haiku-c0aa3c6c6ba86623c3a6e79446929763">…</span></div>
          <div class="cof-model-legend-item-c0aa3c6c6ba86623c3a6e79446929763"><div class="cof-model-dot-c0aa3c6c6ba86623c3a6e79446929763" style="background: var(--secondary)"></div>sonnet <span id="cof-lg-sonnet-c0aa3c6c6ba86623c3a6e79446929763">…</span></div>
          <div class="cof-model-legend-item-c0aa3c6c6ba86623c3a6e79446929763"><div class="cof-model-dot-c0aa3c6c6ba86623c3a6e79446929763" style="background: var(--cof-accent)"></div>opus <span id="cof-lg-opus-c0aa3c6c6ba86623c3a6e79446929763">…</span></div>
        </div>
      </div>

      <div class="cof-drill-section-c0aa3c6c6ba86623c3a6e79446929763">
        <div class="cof-section-label-c0aa3c6c6ba86623c3a6e79446929763">top task types</div>
        <div class="cof-tasks-c0aa3c6c6ba86623c3a6e79446929763" id="cof-tasks-c0aa3c6c6ba86623c3a6e79446929763"></div>
      </div>

      <div class="cof-drill-section-c0aa3c6c6ba86623c3a6e79446929763" style="margin-bottom: 0">
        <div class="cof-insight-c0aa3c6c6ba86623c3a6e79446929763" id="cof-insight-c0aa3c6c6ba86623c3a6e79446929763">…</div>
      </div>
    </div>
  </div>

  <div class="cof-footer-c0aa3c6c6ba86623c3a6e79446929763">
    You cannot attribute cost you don't see. <strong>The LLM gateway is the prerequisite for every number above.</strong>
  </div>
</div>

<script>
(function() {
  var id = 'c0aa3c6c6ba86623c3a6e79446929763';

  var teams = [
    { key: 'search',     name: 'Search',     spend: 48900, prs: 11800, ppr: 4.14, haiku: 12, sonnet: 48, opus: 40, cache: 71, flag: true,
      tasks: [ ['Ranking-model codegen', 18400], ['Query-parser refactors', 11200], ['Eval-harness rewrites', 8100], ['Debug / triage', 6900] ],
      insight: { type: 'warn', html: '<strong>Opus share is 40%, vs platform average 18%.</strong> Worth reviewing routing policy. Much of this is explanatory codegen Sonnet handles within accuracy tolerance.' } },
    { key: 'payments',   name: 'Payments',   spend: 42100, prs: 29200, ppr: 1.44, haiku: 22, sonnet: 65, opus: 13, cache: 82,
      tasks: [ ['Schema migrations', 14600], ['Compliance code reviews', 9300], ['Integration tests', 7800], ['Reconciliation scripts', 6400] ],
      insight: { type: 'ok', html: '<strong>Healthy tiered routing.</strong> Cache hit 82% (above 75% target). Cost/PR trending down 9% MoM.' } },
    { key: 'platform',   name: 'Platform',   spend: 36700, prs: 53100, ppr: 0.69, haiku: 34, sonnet: 58, opus: 8, cache: 85,
      tasks: [ ['Boilerplate scaffolding', 13900], ['Config generation', 9100], ['CI rule updates', 6800], ['Dependency bumps', 4700] ],
      insight: { type: 'ok', html: '<strong>Exemplar tiered routing</strong>: 34% of calls on Haiku for high-volume classification. Lowest cost/PR of the org.' } },
    { key: 'storefront', name: 'Storefront', spend: 31200, prs: 41400, ppr: 0.75, haiku: 28, sonnet: 63, opus: 9, cache: 79,
      tasks: [ ['Component refactors', 11200], ['A11y fixes', 7400], ['A/B test wiring', 6800], ['i18n updates', 4300] ],
      insight: { type: 'ok', html: '<strong>Migration project visible</strong>: component-refactor spend up 22%, tracking the React 19 upgrade roadmap.' } },
    { key: 'infra',      name: 'Infra',      spend: 28800, prs: 19700, ppr: 1.46, haiku: 19, sonnet: 61, opus: 20, cache: 74,
      tasks: [ ['Terraform generation', 9600], ['K8s manifest review', 7100], ['Runbook edits', 5900], ['Cost-attribution scripts', 3800] ],
      insight: { type: 'ok', html: 'Opus usage justified for architectural planning; codegen portion routes correctly to Sonnet.' } },
    { key: 'identity',   name: 'Identity',   spend: 24600, prs: 15300, ppr: 1.61, haiku: 16, sonnet: 68, opus: 16, cache: 77,
      tasks: [ ['SSO integrations', 8800], ['Permission-model reviews', 6300], ['Audit-log queries', 4900], ['Token-rotation scripts', 3100] ],
      insight: { type: 'ok', html: 'Permission-model reviews on Opus is the right call; blast radius of errors is high.' } },
    { key: 'data',       name: 'Data',       spend: 21400, prs: 22600, ppr: 0.95, haiku: 31, sonnet: 60, opus: 9, cache: 81,
      tasks: [ ['SQL generation', 7200], ['dbt model review', 5400], ['Schema evolution', 4300], ['Pipeline debug', 3700] ],
      insight: { type: 'ok', html: 'SQL generation cached at 81%. The structured prompt template is doing its job.' } },
    { key: 'mobile',     name: 'Mobile',     spend: 19800, prs: 18100, ppr: 1.09, haiku: 24, sonnet: 62, opus: 14, cache: 76,
      tasks: [ ['Native bridge code', 6900], ['Test-suite repairs', 5100], ['Release-notes generation', 4100], ['Crash-log analysis', 2900] ],
      insight: { type: 'ok', html: 'Crash-log analysis is a candidate for a dedicated Skill; currently spread across ad-hoc prompts.' } },
    { key: 'growth',     name: 'Growth',     spend: 16500, prs: 12800, ppr: 1.29, haiku: 38, sonnet: 55, opus: 7, cache: 83,
      tasks: [ ['Experiment scaffolding', 5300], ['Funnel-query generation', 4700], ['Landing-page updates', 3400], ['Email template copy', 2400] ],
      insight: { type: 'ok', html: 'Highest Haiku share at 38%. Pattern to replicate for high-volume, classifier-shaped tasks.' } },
    { key: 'dx',         name: 'DX',         spend: 14200, prs: 9400, ppr: 1.51, haiku: 21, sonnet: 63, opus: 16, cache: 72,
      tasks: [ ['Doc generation', 4600], ['SDK example code', 3900], ['Release-note synthesis', 3200], ['Internal tool fixes', 1800] ],
      insight: { type: 'ok', html: 'Cache hit 72%, below target. Likely from doc-generation prompts with page-specific context. Review template structure.' } }
  ];

  var maxSpend = 0;
  for (var i = 0; i < teams.length; i++) if (teams[i].spend > maxSpend) maxSpend = teams[i].spend;

  var teamsEl = document.getElementById('cof-teams-' + id);

  function fmt$(n, decimals) {
    return '$' + (n >= 1000 ? (n / 1000).toFixed(decimals != null ? decimals : 1) + 'k' : n.toFixed(decimals != null ? decimals : 0));
  }

  function fmtNum(n) { return n.toLocaleString(undefined, { maximumFractionDigits: 0 }); }

  for (var i = 0; i < teams.length; i++) {
    (function(t, idx) {
      var el = document.createElement('div');
      el.className = 'cof-team-' + id + (t.flag ? ' flag' : '');
      el.setAttribute('data-team', t.key);
      el.innerHTML =
        '<div class="cof-team-name-' + id + '">' +
          (t.flag ? '<span class="cof-team-flag-' + id + '"></span>' : '') +
          t.name +
        '</div>' +
        '<div class="cof-team-bar-wrap-' + id + '">' +
          '<div class="cof-team-bar-' + id + '" data-target="' + (t.spend / maxSpend * 100) + '"></div>' +
          '<div class="cof-team-bar-label-' + id + '">' + fmt$(t.spend) + '</div>' +
        '</div>' +
        '<div class="cof-team-ppr-' + id + '">' + fmt$(t.ppr, 2) + '/PR</div>';

      el.addEventListener('click', function() { selectTeam(t.key); });
      teamsEl.appendChild(el);

      setTimeout(function() {
        var bar = el.querySelector('.cof-team-bar-' + id);
        bar.style.width = bar.getAttribute('data-target') + '%';
      }, 150 + idx * 50);
    })(teams[i], i);
  }

  function selectTeam(key) {
    var t = null;
    for (var i = 0; i < teams.length; i++) if (teams[i].key === key) t = teams[i];
    if (!t) return;

    var rows = document.querySelectorAll('.cof-team-' + id);
    for (var j = 0; j < rows.length; j++) {
      rows[j].classList.toggle('active', rows[j].getAttribute('data-team') === key);
    }

    document.getElementById('cof-drill-team-' + id).textContent = t.name + ' team';
    document.getElementById('cof-drill-spend-' + id).textContent = fmt$(t.spend);
    document.getElementById('cof-drill-ppr-' + id).textContent = fmt$(t.ppr, 2);
    document.getElementById('cof-drill-prs-' + id).textContent = fmtNum(t.prs);
    document.getElementById('cof-drill-cache-' + id).textContent = t.cache + '%';

    document.getElementById('cof-m-haiku-' + id).style.width  = t.haiku + '%';
    document.getElementById('cof-m-sonnet-' + id).style.width = t.sonnet + '%';
    document.getElementById('cof-m-opus-' + id).style.width   = t.opus + '%';

    document.getElementById('cof-m-haiku-' + id).textContent  = t.haiku  > 12 ? 'Haiku'  : '';
    document.getElementById('cof-m-sonnet-' + id).textContent = t.sonnet > 12 ? 'Sonnet' : '';
    document.getElementById('cof-m-opus-' + id).textContent   = t.opus   > 12 ? 'Opus'   : '';

    document.getElementById('cof-lg-haiku-' + id).textContent  = t.haiku  + '%';
    document.getElementById('cof-lg-sonnet-' + id).textContent = t.sonnet + '%';
    document.getElementById('cof-lg-opus-' + id).textContent   = t.opus   + '%';

    var tasksEl = document.getElementById('cof-tasks-' + id);
    tasksEl.innerHTML = '';
    for (var k = 0; k < t.tasks.length; k++) {
      var row = document.createElement('div');
      row.className = 'cof-task-' + id;
      row.innerHTML =
        '<div>' + t.tasks[k][0] + '</div>' +
        '<div class="cof-task-val-' + id + '">' + fmt$(t.tasks[k][1]) + '</div>';
      tasksEl.appendChild(row);
    }

    var insightEl = document.getElementById('cof-insight-' + id);
    insightEl.className = 'cof-insight-' + id + ' ' + t.insight.type;
    insightEl.innerHTML = t.insight.html;
  }

  setTimeout(function() { selectTeam('search'); }, 250 + teams.length * 50);
})();
</script>

<h3 id="what-to-measure">What to measure</h3>
<p>The most common failure mode in &ldquo;AI productivity&rdquo; reporting is Goodhart&rsquo;s Law in a lab coat. A measurement stack that survives scrutiny operates in four families: <strong>proxy</strong> (acceptance rate, session count), <strong>activity</strong> (DORA: PR count, lead time, CFR), <strong>outcome</strong> (defect escape, rework, dev-reported friction), and <strong>economic</strong> (hours saved, cost per merged PR). An architect reporting to leadership needs at least one number from each.</p>
<p>Consider the published record: <strong>Uber reports ~10% PR-velocity lift</strong> (<a href="https://newsletter.pragmaticengineer.com/p/how-uber-uses-ai-for-development">Pragmatic Engineer</a>), an activity metric. <strong>Shopify claims ~20% productivity</strong> accompanied by a <a href="https://www.firstround.com/ai/shopify">public refusal to measure it in LOC</a>, an outcome claim. <strong>Block&rsquo;s 8-10 hours saved per engineer per week</strong> is a clean economic metric. <strong>Airbnb&rsquo;s 18 months to 6 weeks</strong> is a sharp outcome metric with a legible counterfactual. Same reality. Different slices.</p>
<h2 id="chapter-7-the-build-sequence">Chapter 7: The Build Sequence</h2>
<p>The platform described above is not a weekend project. It also does not require a three-year transformation program. The sequence that has worked in the public record collapses into three horizons.</p>
<p><strong>Days 0-90. Stand up the minimum viable control plane.</strong></p>
<ul>
<li>Pick one harness. Don&rsquo;t debate it for a quarter. Any of them is fine; the harness is replaceable.</li>
<li>Stand up the LLM gateway. Every agent request flows through it. Day-one cost attribution.</li>
<li><strong>Ship one Recipe.</strong> Not twelve. Pick one repeatable task (PR security review, migration shard, on-call triage). Versioned, parameterised, triggered by one event, observable end-to-end. Everything else is scaffolding for the next Recipe.</li>
<li>Stand up one golden eval set with an LLM-as-judge rubric. Wire it into CI. Refuse to promote prompts or Recipes that regress.</li>
<li>Turn on OpenTelemetry tracing end-to-end.</li>
</ul>
<p><strong>Months 3-6. Build the moat.</strong></p>
<ul>
<li>Context pipeline for your top-five repos: ingest, index, govern, serve. Measure hit rate.</li>
<li>Policy-as-code at the gateway. Scoped tokens. Async approvals for production actions.</li>
<li>Expand the eval harness to workflow-level: golden sets of Recipe invocations, shadow-mode promotion.</li>
<li>First KPI dashboard: one proxy, one activity, one outcome, one economic metric.</li>
</ul>
<p><strong>Months 6-12. Compound.</strong></p>
<ul>
<li>Orchestrator-worker topology for the hard workloads: migrations, cross-repo refactors, bulk compliance work.</li>
<li>Recipe registry self-service with SHA-pinned versions. Teams contribute; the platform team curates.</li>
<li>Progressive autonomy tiers. Graduate teams through read-only, sandboxed, PR, and production as their eval and incident track record earns it.</li>
<li>Per-team chargeback. The budget conversation changes the usage conversation.</li>
</ul>
<p>Fund internal DevRel from day one. Uber&rsquo;s coursework moved Claude Code adoption from 32% to 63% of engineers in three months. Block&rsquo;s engineers found Goose through Slack channels, not mandates. Shopify paired a top-down AI-first memo with bottom-up tool freedom through the LLM proxy. The technical platform and the organisational motion need to ship together.</p>
<p>In twelve months, when your CFO asks what AI is costing and what it&rsquo;s earning, you have an answer, because you built a platform rather than bought a license. That&rsquo;s the answer the 11% have. It&rsquo;s not because they picked a better model.</p>
<h2 id="references">References</h2>
<ol>
<li><strong>Google Cloud / DORA.</strong> <a href="https://cloud.google.com/resources/content/2025-dora-ai-assisted-software-development-report">2025 State of AI-Assisted Software Development Report</a>. Source for 90% adoption, 30% distrust, PR size +154%, and the stability/throughput tension.</li>
<li><strong>Faros AI.</strong> <a href="https://www.faros.ai/blog/key-takeaways-from-the-dora-report-2025">Key Takeaways from the DORA Report 2025</a>. Practitioner analysis of the DORA findings.</li>
<li><strong>McKinsey / KPMG.</strong> <a href="https://kpmg.com/us/en/media/news/q4-ai-pulse.html">AI at Scale: Q4 2025 AI Pulse</a>. Source for the four-stage maturity model and the ~11% AI-native figure.</li>
<li><strong>OneReach / CIO.</strong> <a href="https://onereach.ai/blog/what-shapes-enterprise-ai-agents-in-the-future/">What Shapes Enterprise AI Agents in the Future</a>. Source for the 95% zero-ROI and 14% change-management figures.</li>
<li><strong>Block.</strong> <a href="https://block.xyz/inside/block-open-source-introduces-codename-goose">Block Open Source Introduces &ldquo;codename goose&rdquo;</a> and <a href="https://github.com/block/goose">Goose on GitHub</a>.</li>
<li><strong>Sequoia.</strong> <a href="https://sequoiacap.com/podcast/training-data-dhanji-prasanna/">Training Data podcast with Dhanji Prasanna</a>. Source for Block&rsquo;s 8-10 hours/week, 25% target, and 30-40% legacy-code figures.</li>
<li><strong>All Things Open.</strong> <a href="https://allthingsopen.org/articles/meet-goose-open-source-ai-agent">Meet Goose: The open source AI agent built for developers</a>.</li>
<li><strong>Bessemer Venture Partners.</strong> <a href="https://www.bvp.com/atlas/inside-shopifys-ai-first-engineering-playbook">Inside Shopify&rsquo;s AI-First Engineering Playbook</a>.</li>
<li><strong>First Round Review.</strong> <a href="https://www.firstround.com/ai/shopify">From Memo to Movement: Shopify&rsquo;s Cultural Adoption of AI</a>.</li>
<li><strong>Augment Code.</strong> <a href="https://www.augmentcode.com/context-engine">Context Engine</a> and <a href="https://www.augmentcode.com/blog/context-engine-mcp-now-live">Context Engine MCP now live</a>. Source for the 500k-file indexing, ~100ms retrieval, and pipeline-behind-MCP pattern.</li>
<li><strong>Pragmatic Engineer.</strong> <a href="https://newsletter.pragmaticengineer.com/p/how-uber-uses-ai-for-development">How Uber Uses AI for Development</a>. Source for the 84% agentic-coding adoption, Claude Code 32% to 63%, and DevRel investment.</li>
<li><strong>Sourcegraph.</strong> <a href="https://sourcegraph.com/blog/how-cody-understands-your-codebase">How Cody understands your codebase</a> and <a href="https://sourcegraph.com/blog/how-cody-provides-remote-repository-context">How Cody provides remote repository awareness</a>. Source for the three-layer context architecture and 300k+ repo scale.</li>
<li><strong>Atlassian.</strong> <a href="https://www.atlassian.com/blog/artificial-intelligence/developer-productivity-improved-with-rovo-dev/amp">30.8% Faster PRs: How AI-Driven Rovo Dev Code Reviewer Improved Developer Productivity</a>. Source for the ICSE 2026 publication figures.</li>
<li><strong>GitHub.</strong> <a href="https://resources.github.com/enterprise-content-roundup/december/">December 2025 Enterprise Roundup</a>. Source for Copilot Enterprise governance features.</li>
<li><strong>Microsoft DevBlogs.</strong> <a href="https://devblogs.microsoft.com/all-things-azure/agentic-platform-engineering-with-github-copilot/">Agentic Platform Engineering with GitHub Copilot</a>.</li>
<li><strong>Airbnb Engineering.</strong> <a href="https://medium.com/airbnb-engineering/accelerating-large-scale-test-migration-with-llms-9565c208023b">Accelerating Large-Scale Test Migration with LLMs</a>.</li>
<li><strong>Anthropic.</strong> <a href="https://modelcontextprotocol.io/">Model Context Protocol</a>.</li>
<li><strong>Gartner.</strong> <a href="https://www.gartner.com/en/newsroom/press-releases/2025-08-26-gartner-predicts-40-percent-of-enterprise-apps-will-feature-task-specific-ai-agents-by-2026-up-from-less-than-5-percent-in-2025">40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026</a>.</li>
<li><strong>Block.</strong> <a href="https://block.github.io/goose/docs/guides/recipes/recipe-reference/">Goose Recipes reference</a> and <a href="https://goose-docs.ai/recipes/">Goose Recipes cookbook</a>.</li>
<li><strong>Pulse MCP.</strong> <a href="https://www.pulsemcp.com/building-agents-with-goose/part-4-configure-your-agent-with-goose-recipes">Configure your agent with Goose Recipes</a>.</li>
<li><strong>Block.</strong> <a href="https://github.com/marketplace/actions/goose-ai-developer-agent">Goose AI Developer Agent GitHub Action</a>.</li>
<li><strong>GitHub.</strong> <a href="https://github.blog/ai-and-ml/automate-repository-tasks-with-github-agentic-workflows/">Automate repository tasks with GitHub Agentic Workflows</a>.</li>
<li><strong>Kestra.</strong> <a href="https://kestra.io/1-0">Kestra 1.0 launch</a>. Source for the 2B+ workflow executions in 2025.</li>
<li><strong>Temporal.</strong> <a href="https://temporal.io/blog/orchestrating-ambient-agents-with-temporal">Orchestrating Ambient Agents with Temporal</a>.</li>
<li><strong>MindStudio.</strong> <a href="https://www.mindstudio.ai/blog/stripe-minions-vs-shopify-roast-ai-coding-harnesses">Stripe Minions vs Shopify Roast</a>. Source for Stripe&rsquo;s scoped-context agent pattern.</li>
<li><strong>GitHub.</strong> <a href="https://github.blog/enterprise-software/devops/building-organization-wide-governance-and-re-use-for-ci-cd-and-automation-with-github-actions/">Building organization-wide governance for CI/CD with GitHub Actions</a>.</li>
</ol>
]]></content:encoded></item><item><title>Inside Claude Code: Anatomy of a 512K-Line AI Agent</title><link>https://www.mdjawad.com/posts/inside-claude-code/</link><pubDate>Wed, 08 Apr 2026 12:00:00 +0800</pubDate><guid>https://www.mdjawad.com/posts/inside-claude-code/</guid><description>An interactive technical breakdown of Claude Code&amp;rsquo;s architecture — from the query loop and five compaction mechanisms to the permission pipeline and feature flags. Based on source code analysis of ~1,884 TypeScript files.</description><content:encoded><![CDATA[<style>
.cc-iframe-wrap {
  position: relative;
  width: 100vw;
  left: 50%;
  transform: translateX(-50%);
  margin-top: 1rem;
}
.cc-iframe-wrap iframe {
  width: 100%;
  height: 100vh;
  border: none;
}
</style>
<div class="cc-iframe-wrap">
  <iframe src="/viz/claude-code-anatomy/" loading="lazy"></iframe>
</div>
]]></content:encoded></item><item><title>State Space Models and the Mamba Architecture: From First Principles to Mamba-3</title><link>https://www.mdjawad.com/posts/state-space-models-mamba/</link><pubDate>Sun, 22 Mar 2026 10:00:00 +0800</pubDate><guid>https://www.mdjawad.com/posts/state-space-models-mamba/</guid><description>NVIDIA&amp;rsquo;s Nemotron-3-Super, IBM&amp;rsquo;s Granite, and AI21&amp;rsquo;s Jamba all ship hybrid SSM-Transformer architectures in production. This post builds State Space Models from scratch, starting with a single differential equation, and works up through HiPPO, S4, and the three generations of Mamba to explain why.</description><content:encoded><![CDATA[<h2 id="what-this-post-covers">What This Post Covers</h2>
<p>NVIDIA&rsquo;s <a href="https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16">Nemotron-3-Super</a> is not a Transformer. Not entirely. It is a hybrid architecture that interleaves Mamba-2 SSM layers with select attention layers, using the majority of its compute on state space operations rather than self-attention. It ships in production on NVIDIA&rsquo;s inference stack and competes with pure Transformer models at the same scale. NVIDIA is not alone. IBM&rsquo;s Granite 4.0 uses a 9:1 SSM-to-Transformer ratio. AI21&rsquo;s Jamba uses 1:7. Zyphra&rsquo;s Zamba, Google&rsquo;s Griffin, Microsoft&rsquo;s Phi-4-mini-flash-reasoning: all hybrid architectures, all in production.</p>
<p>Something shifted. For years, the Transformer was the only architecture that mattered for language. Now every major AI lab is replacing most of their Transformer layers with SSM layers and getting better results at lower inference cost. If you deploy models, manage GPU clusters, or care about inference latency, this is worth understanding deeply.</p>
<p>This post builds State Space Models from zero. I start with the simplest possible differential equation: one variable, one parameter. From there, I build up to the full SSM formulation, explain the key breakthroughs (HiPPO, S4), and walk through the three generations of Mamba. By the end, you will understand the math well enough to explain to your team why these hybrid architectures are winning, what trade-offs they make, and what it means for your inference stack.</p>
<h2 id="part-1-why-ssms-the-transformers-inference-problem">Part 1: Why SSMs? The Transformer&rsquo;s Inference Problem</h2>
<p>You already know the Transformer&rsquo;s self-attention mechanism scales quadratically with sequence length: $O(L^2)$ in both time and memory. But the pain runs deeper than asymptotic notation.</p>
<p>During autoregressive decoding, the Transformer generates one token at a time. For each new token, it must load the entire KV cache from GPU HBM into SRAM, compute a single attention score against every previous token, and write the new KV entry back. The GPU spends the vast majority of its time moving data, not computing. On an H100 generating tokens from a 70B model, the Tensor Cores that deliver 989 TFLOPS of BF16 matmul sit almost entirely idle during decoding. The bottleneck is memory bandwidth, not compute.</p>
<p>This is why you need PagedAttention to manage fragmented KV cache memory. This is why vLLM exists: to batch requests efficiently despite variable KV cache sizes. This is why context windows beyond 128K tokens start requiring multi-GPU setups just to hold the KV cache.</p>
<p>State Space Models offer a fundamentally different deal. Instead of caching every token&rsquo;s key-value pair (lossless but expensive), they compress the entire sequence history into a fixed-size hidden state (lossy but cheap). Processing each new token during inference takes $O(1)$ time and memory. No growing KV cache. No PagedAttention. Constant memory per sequence regardless of whether you have processed 100 tokens or 100,000.</p>
<p>The question has always been whether a compressed, lossy state can match the quality of the Transformer&rsquo;s lossless KV cache. For years, the answer was no. SSMs excelled on audio, time series, and synthetic long-range benchmarks, but they lagged on language. The Mamba line of work changed that. To understand how, we need to start from scratch.</p>
<h2 id="part-2-state-space-models-from-scratch">Part 2: State Space Models from Scratch</h2>
<h3 id="a-single-differential-equation">A Single Differential Equation</h3>
<p>Forget matrices, vectors, and neural networks for a moment. Start with a single number $h(t)$ that changes over time:</p>
$$h'(t) = a \cdot h(t)$$<p>$h'(t)$ is the rate of change of $h$ at time $t$. If you know calculus, this is the derivative. If not, think of it as: &ldquo;how fast is $h$ changing right now?&rdquo; When $h'(t)$ is positive, $h$ is increasing. When negative, $h$ is decreasing. When zero, $h$ is holding steady.</p>
<p>The constant $a$ controls everything:</p>
<ul>
<li>$a > 0$: $h$ grows exponentially. Think compound interest. A bank account with interest rate $a = 0.05$ earns interest on its interest, accelerating upward forever. Unstable.</li>
<li>$a < 0$: $h$ decays exponentially. Think radioactive decay. A substance with decay rate $a = -0.5$ loses half its remaining mass roughly every 1.4 time units. The more you have, the faster it drains, but it never quite reaches zero. Stable.</li>
<li>$a = 0$: nothing happens. $h$ is constant forever.</li>
</ul>


<div class="ssm-eigenvalue-decay" id="ssm-ed-774ca06e5bc03651567b333d58d39a0f">
  <style>
    .ssm-eigenvalue-decay {
      --ed-bg: #0d1117;
      --ed-surface: #161b22;
      --ed-border: #30363d;
      --ed-text: #e6edf3;
      --ed-text-muted: #8b949e;
      --ed-blue: #58a6ff;
      --ed-red: #f97583;
      --ed-gray: #8b949e;
      --ed-green: #39d353;
      --ed-axis: #484f58;
      --ed-grid: rgba(255,255,255,0.04);
      --ed-canvas-bg: #0d1117;

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
      background: var(--ed-bg);
      color: var(--ed-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

    [data-theme="light"] .ssm-eigenvalue-decay,
    :root:not([data-theme="dark"]) .ssm-eigenvalue-decay {
      --ed-bg: #f8fafc;
      --ed-surface: #ffffff;
      --ed-border: #e2e8f0;
      --ed-text: #1e293b;
      --ed-text-muted: #64748b;
      --ed-blue: #3b82f6;
      --ed-red: #ef4444;
      --ed-gray: #94a3b8;
      --ed-green: #10b981;
      --ed-axis: #cbd5e1;
      --ed-grid: rgba(0,0,0,0.04);
      --ed-canvas-bg: #ffffff;
    }

    .ssm-eigenvalue-decay * { box-sizing: border-box; }

    .ed-header {
      text-align: center;
      margin-bottom: 1.25rem;
    }

    .ed-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem;
      font-weight: 600;
      color: var(--ed-blue);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .ed-header p {
      color: var(--ed-text-muted);
      font-size: 0.85rem;
      margin: 0;
    }

    .ed-card {
      background: var(--ed-surface);
      border: 1px solid var(--ed-border);
      border-radius: 10px;
      padding: 1.25rem;
    }

    .ed-canvas-wrap {
      position: relative;
      width: 100%;
      margin-bottom: 1rem;
    }

    .ed-canvas-wrap canvas {
      width: 100%;
      height: 280px;
      display: block;
      border-radius: 8px;
    }

    .ed-controls {
      display: flex;
      align-items: center;
      gap: 1rem;
      flex-wrap: wrap;
    }

    .ed-controls label {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.82rem;
      font-weight: 600;
      color: var(--ed-text);
      white-space: nowrap;
    }

    .ed-controls input[type="range"] {
      flex: 1;
      min-width: 120px;
      height: 6px;
      -webkit-appearance: none;
      appearance: none;
      background: var(--ed-border);
      border-radius: 3px;
      outline: none;
      cursor: pointer;
    }

    .ed-controls input[type="range"]::-webkit-slider-thumb {
      -webkit-appearance: none;
      appearance: none;
      width: 18px;
      height: 18px;
      border-radius: 50%;
      background: var(--ed-blue);
      cursor: pointer;
      border: 2px solid var(--ed-surface);
      box-shadow: 0 1px 4px rgba(0,0,0,0.3);
    }

    .ed-controls input[type="range"]::-moz-range-thumb {
      width: 18px;
      height: 18px;
      border-radius: 50%;
      background: var(--ed-blue);
      cursor: pointer;
      border: 2px solid var(--ed-surface);
      box-shadow: 0 1px 4px rgba(0,0,0,0.3);
    }

    .ed-value {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.9rem;
      font-weight: 700;
      min-width: 52px;
      text-align: right;
    }

    .ed-badge {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.72rem;
      font-weight: 600;
      padding: 3px 10px;
      border-radius: 20px;
      white-space: nowrap;
    }

    .ed-badge.stable {
      background: rgba(88, 166, 255, 0.12);
      color: var(--ed-blue);
    }
    .ed-badge.neutral {
      background: rgba(139, 148, 158, 0.12);
      color: var(--ed-gray);
    }
    .ed-badge.unstable {
      background: rgba(249, 117, 131, 0.12);
      color: var(--ed-red);
    }

    [data-theme="light"] .ed-badge.stable,
    :root:not([data-theme="dark"]) .ed-badge.stable {
      background: rgba(59, 130, 246, 0.1);
    }
    [data-theme="light"] .ed-badge.neutral,
    :root:not([data-theme="dark"]) .ed-badge.neutral {
      background: rgba(148, 163, 184, 0.15);
    }
    [data-theme="light"] .ed-badge.unstable,
    :root:not([data-theme="dark"]) .ed-badge.unstable {
      background: rgba(239, 68, 68, 0.1);
    }
  </style>

  <div class="ed-header">
    <h3>How the parameter a controls state behavior</h3>
    <p>Drag the slider to see h(t) = h(0) · exp(a · t)</p>
  </div>

  <div class="ed-card">
    <div class="ed-canvas-wrap">
      <canvas id="ed-canvas-774ca06e5bc03651567b333d58d39a0f"></canvas>
    </div>
    <div class="ed-controls">
      <label>a =</label>
      <input type="range" id="ed-slider-774ca06e5bc03651567b333d58d39a0f" min="-2" max="2" step="0.1" value="-0.5">
      <span class="ed-value" id="ed-val-774ca06e5bc03651567b333d58d39a0f">-0.5</span>
      <span class="ed-badge stable" id="ed-badge-774ca06e5bc03651567b333d58d39a0f">Stable (decay)</span>
    </div>
  </div>

  <script>
  (function() {
    var uid = '774ca06e5bc03651567b333d58d39a0f';
    var canvas = document.getElementById('ed-canvas-' + uid);
    var slider = document.getElementById('ed-slider-' + uid);
    var valEl = document.getElementById('ed-val-' + uid);
    var badgeEl = document.getElementById('ed-badge-' + uid);
    var ctx = canvas.getContext('2d');

    function getCSS(v) {
      return getComputedStyle(canvas.closest('.ssm-eigenvalue-decay')).getPropertyValue(v).trim();
    }

    function resize() {
      var rect = canvas.parentElement.getBoundingClientRect();
      var dpr = window.devicePixelRatio || 1;
      canvas.width = rect.width * dpr;
      canvas.height = 280 * dpr;
      ctx.setTransform(dpr, 0, 0, dpr, 0, 0);
    }

    function draw() {
      var a = parseFloat(slider.value);
      var w = canvas.width / (window.devicePixelRatio || 1);
      var h = 280;

      var pad = { top: 20, right: 20, bottom: 40, left: 55 };
      var pw = w - pad.left - pad.right;
      var ph = h - pad.top - pad.bottom;

      ctx.clearRect(0, 0, w, h);

      var axisColor = getCSS('--ed-axis');
      var textMuted = getCSS('--ed-text-muted');
      var textColor = getCSS('--ed-text');

      
      var tMax = 5;
      var yMin, yMax;
      if (a > 0) {
        yMax = Math.min(Math.exp(a * tMax), 10);
        yMin = -0.5;
      } else {
        yMax = 1.5;
        yMin = -0.5;
      }

      function tx(t) { return pad.left + (t / tMax) * pw; }
      function ty(y) { return pad.top + (1 - (y - yMin) / (yMax - yMin)) * ph; }

      
      ctx.strokeStyle = getCSS('--ed-grid') || 'rgba(128,128,128,0.08)';
      ctx.lineWidth = 1;
      var yStep = yMax > 3 ? 2 : 0.5;
      for (var gy = Math.ceil(yMin / yStep) * yStep; gy <= yMax; gy += yStep) {
        ctx.beginPath();
        ctx.moveTo(pad.left, ty(gy));
        ctx.lineTo(w - pad.right, ty(gy));
        ctx.stroke();
      }

      
      ctx.strokeStyle = axisColor;
      ctx.lineWidth = 1.5;
      ctx.beginPath();
      ctx.moveTo(pad.left, pad.top);
      ctx.lineTo(pad.left, h - pad.bottom);
      ctx.lineTo(w - pad.right, h - pad.bottom);
      ctx.stroke();

      
      ctx.strokeStyle = axisColor;
      ctx.lineWidth = 1;
      ctx.setLineDash([5, 4]);
      ctx.beginPath();
      var y0 = ty(0);
      if (y0 >= pad.top && y0 <= h - pad.bottom) {
        ctx.moveTo(pad.left, y0);
        ctx.lineTo(w - pad.right, y0);
        ctx.stroke();
      }
      ctx.setLineDash([]);

      
      ctx.fillStyle = textMuted;
      ctx.font = '12px "IBM Plex Mono", "SF Mono", Monaco, monospace';
      ctx.textAlign = 'center';
      ctx.fillText('Time t', pad.left + pw / 2, h - 4);

      ctx.save();
      ctx.translate(14, pad.top + ph / 2);
      ctx.rotate(-Math.PI / 2);
      ctx.textAlign = 'center';
      ctx.fillText('h(t)', 0, 0);
      ctx.restore();

      
      ctx.fillStyle = textMuted;
      ctx.font = '10px "IBM Plex Mono", monospace';
      ctx.textAlign = 'right';
      for (var gy = Math.ceil(yMin / yStep) * yStep; gy <= yMax; gy += yStep) {
        var yy = ty(gy);
        if (yy >= pad.top && yy <= h - pad.bottom) {
          ctx.fillText(gy.toFixed(1), pad.left - 6, yy + 3);
        }
      }

      
      ctx.textAlign = 'center';
      for (var gx = 0; gx <= tMax; gx += 1) {
        ctx.fillText(gx, tx(gx), h - pad.bottom + 16);
      }

      
      var curveColor;
      if (a < 0) curveColor = getCSS('--ed-blue');
      else if (a === 0) curveColor = getCSS('--ed-gray');
      else curveColor = getCSS('--ed-red');

      
      ctx.strokeStyle = curveColor;
      ctx.lineWidth = 2.5;
      ctx.lineJoin = 'round';
      ctx.beginPath();
      var steps = 200;
      for (var i = 0; i <= steps; i++) {
        var t = (i / steps) * tMax;
        var val = Math.exp(a * t);
        var px = tx(t);
        var py = ty(val);
        py = Math.max(pad.top - 5, Math.min(h - pad.bottom + 5, py));
        if (i === 0) ctx.moveTo(px, py);
        else ctx.lineTo(px, py);
      }
      ctx.stroke();

      
      ctx.fillStyle = curveColor;
      ctx.beginPath();
      ctx.arc(tx(0), ty(1), 5, 0, Math.PI * 2);
      ctx.fill();

      ctx.fillStyle = textColor;
      ctx.font = '11px "IBM Plex Mono", monospace';
      ctx.textAlign = 'left';
      ctx.fillText('h(0) = 1', tx(0) + 10, ty(1) + 4);
    }

    function update() {
      var a = parseFloat(slider.value);
      valEl.textContent = a.toFixed(1);

      if (a < 0) {
        badgeEl.textContent = 'Stable (decay)';
        badgeEl.className = 'ed-badge stable';
        valEl.style.color = getCSS('--ed-blue');
      } else if (a === 0) {
        badgeEl.textContent = 'Neutral';
        badgeEl.className = 'ed-badge neutral';
        valEl.style.color = getCSS('--ed-gray');
      } else {
        badgeEl.textContent = 'Unstable (growth)';
        badgeEl.className = 'ed-badge unstable';
        valEl.style.color = getCSS('--ed-red');
      }

      resize();
      draw();
    }

    slider.addEventListener('input', update);
    window.addEventListener('resize', update);
    update();
  })();
  </script>
</div>

<p>For building sequence models, we want $a < 0$. Our hidden state should be a fading memory, not an explosion.</p>
<h3 id="adding-an-input">Adding an Input</h3>
<p>A decaying state by itself is useless. We need to feed information in:</p>
$$h'(t) = a \cdot h(t) + b \cdot x(t)$$<p>Now $x(t)$ is an input signal (think: a stream of token embeddings arriving over time), and $b$ controls how strongly the input drives the state.</p>
<p>Picture a leaky bucket with a tap. The water level $h(t)$ is the state. The hole in the bottom drains water at rate $a \cdot h(t)$: the more water in the bucket, the faster it leaks (more pressure = faster drain). The tap $b \cdot x(t)$ pours water in at a rate proportional to the input signal. The water level at any moment reflects a fading, weighted average of all the water that has ever been poured in, with recent additions contributing more because older ones have partially leaked away.</p>


<div class="ssm-leaky-bucket" id="ssm-lb-774ca06e5bc03651567b333d58d39a0f">
  <style>
    .ssm-leaky-bucket {
      --lb-bg: #0d1117;
      --lb-surface: #161b22;
      --lb-border: #30363d;
      --lb-text: #e6edf3;
      --lb-text-muted: #8b949e;
      --lb-blue: #58a6ff;
      --lb-blue-light: #79c0ff;
      --lb-orange: #d29922;
      --lb-red: #f97583;
      --lb-water-top: #3b82f6;
      --lb-water-bottom: #1d4ed8;
      --lb-bucket-stroke: #8b949e;
      --lb-bucket-fill: rgba(139,148,158,0.06);
      --lb-faucet: #8b949e;

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
      background: var(--lb-bg);
      color: var(--lb-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

    [data-theme="light"] .ssm-leaky-bucket,
    :root:not([data-theme="dark"]) .ssm-leaky-bucket {
      --lb-bg: #f8fafc;
      --lb-surface: #ffffff;
      --lb-border: #e2e8f0;
      --lb-text: #1e293b;
      --lb-text-muted: #64748b;
      --lb-blue: #3b82f6;
      --lb-blue-light: #60a5fa;
      --lb-orange: #f59e0b;
      --lb-red: #ef4444;
      --lb-water-top: #60a5fa;
      --lb-water-bottom: #3b82f6;
      --lb-bucket-stroke: #94a3b8;
      --lb-bucket-fill: rgba(148,163,184,0.08);
      --lb-faucet: #64748b;
    }

    .ssm-leaky-bucket * { box-sizing: border-box; }

    .lb-header {
      text-align: center;
      margin-bottom: 1.25rem;
    }

    .lb-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem;
      font-weight: 600;
      color: var(--lb-blue);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .lb-header p {
      color: var(--lb-text-muted);
      font-size: 0.85rem;
      margin: 0;
    }

    .lb-card {
      background: var(--lb-surface);
      border: 1px solid var(--lb-border);
      border-radius: 10px;
      padding: 1.25rem;
      display: flex;
      flex-direction: column;
      align-items: center;
    }

    .lb-svg-wrap {
      width: 100%;
      max-width: 400px;
    }

    .lb-svg-wrap svg {
      width: 100%;
      height: auto;
      display: block;
    }

     
    @keyframes lb-water-bob {
      0%, 100% { transform: translateY(0); }
      50% { transform: translateY(-4px); }
    }

    .lb-water-group {
      animation: lb-water-bob 3s ease-in-out infinite;
    }

     
    @keyframes lb-drip {
      0% { opacity: 1; transform: translateY(0); }
      80% { opacity: 1; transform: translateY(28px); }
      100% { opacity: 0; transform: translateY(34px); }
    }

    .lb-drip {
      animation: lb-drip 1.8s ease-in infinite;
    }

    .lb-drip:nth-child(2) {
      animation-delay: 0.6s;
    }

    .lb-drip:nth-child(3) {
      animation-delay: 1.2s;
    }

     
    @keyframes lb-pour-flow {
      0%, 100% { opacity: 0.7; }
      50% { opacity: 1; }
    }

    .lb-pour {
      animation: lb-pour-flow 1.5s ease-in-out infinite;
    }

    .lb-caption {
      text-align: center;
      margin-top: 1rem;
      padding-top: 1rem;
      border-top: 1px solid var(--lb-border);
      color: var(--lb-text-muted);
      font-size: 0.82rem;
      font-style: italic;
      max-width: 400px;
    }

    .lb-label-text {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 11px;
    }

    .lb-label-text-lg {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 13px;
      font-weight: 600;
    }
  </style>

  <div class="lb-header">
    <h3>The Leaky Bucket: Core SSM Intuition</h3>
    <p>State as a water level — inputs pour in, decay leaks out</p>
  </div>

  <div class="lb-card">
    <div class="lb-svg-wrap">
      <svg viewBox="0 0 360 380" xmlns="http://www.w3.org/2000/svg">
        <defs>
          <linearGradient id="lb-water-grad-774ca06e5bc03651567b333d58d39a0f" x1="0" y1="0" x2="0" y2="1">
            <stop offset="0%" stop-color="var(--lb-water-top)" stop-opacity="0.8"/>
            <stop offset="100%" stop-color="var(--lb-water-bottom)" stop-opacity="0.95"/>
          </linearGradient>
          <linearGradient id="lb-pour-grad-774ca06e5bc03651567b333d58d39a0f" x1="0" y1="0" x2="0" y2="1">
            <stop offset="0%" stop-color="var(--lb-blue)" stop-opacity="0.9"/>
            <stop offset="100%" stop-color="var(--lb-blue-light)" stop-opacity="0.6"/>
          </linearGradient>
        </defs>

        
        <rect x="140" y="18" width="80" height="16" rx="4" fill="var(--lb-faucet)"/>
        <rect x="205" y="18" width="15" height="50" rx="3" fill="var(--lb-faucet)"/>
        <rect x="200" y="60" width="25" height="10" rx="3" fill="var(--lb-faucet)"/>

        
        <g class="lb-pour">
          <rect x="208" y="70" width="9" height="60" rx="4"
                fill="url(#lb-pour-grad-774ca06e5bc03651567b333d58d39a0f)"/>
        </g>

        
        <text x="260" y="55" class="lb-label-text" fill="var(--lb-orange)" text-anchor="start" font-weight="600">b · x(t)</text>
        <text x="260" y="70" class="lb-label-text" fill="var(--lb-text-muted)" text-anchor="start">(input)</text>

        
        <path d="M 110 130 L 95 310 L 265 310 L 250 130 Z"
              fill="var(--lb-bucket-fill)"
              stroke="var(--lb-bucket-stroke)"
              stroke-width="3"
              stroke-linejoin="round"/>

        
        <g class="lb-water-group">
          <path d="M 106 200 L 98 305 L 262 305 L 254 200 Z"
                fill="url(#lb-water-grad-774ca06e5bc03651567b333d58d39a0f)"
                opacity="0.9"/>
          
          <path d="M 106 200 Q 135 193, 160 200 Q 185 207, 210 200 Q 235 193, 254 200"
                fill="none" stroke="var(--lb-blue-light)" stroke-width="2" opacity="0.6"/>
        </g>

        
        <text x="180" y="258" class="lb-label-text-lg" fill="white" text-anchor="middle" opacity="0.95">h(t) = state</text>

        
        <rect x="168" y="306" width="24" height="8" rx="3" fill="var(--lb-bucket-stroke)"/>

        
        <g>
          <circle class="lb-drip" cx="180" cy="320" r="4" fill="var(--lb-blue)" opacity="0.8"/>
          <circle class="lb-drip" cx="180" cy="320" r="3.5" fill="var(--lb-blue)" opacity="0.7"/>
          <circle class="lb-drip" cx="180" cy="320" r="3" fill="var(--lb-blue)" opacity="0.6"/>
        </g>

        
        <line x1="180" y1="340" x2="180" y2="368" stroke="var(--lb-red)" stroke-width="2" marker-end="url(#lb-arrow-774ca06e5bc03651567b333d58d39a0f)"/>
        <defs>
          <marker id="lb-arrow-774ca06e5bc03651567b333d58d39a0f" viewBox="0 0 10 10" refX="9" refY="5"
                  markerWidth="7" markerHeight="7" orient="auto-start-reverse">
            <path d="M 0 0 L 10 5 L 0 10 z" fill="var(--lb-red)"/>
          </marker>
        </defs>

        
        <text x="225" y="355" class="lb-label-text" fill="var(--lb-red)" text-anchor="start" font-weight="600">a · h(t)</text>
        <text x="225" y="370" class="lb-label-text" fill="var(--lb-text-muted)" text-anchor="start">(leak / decay)</text>
      </svg>
    </div>

    <div class="lb-caption">
      The water level at any moment = fading, weighted average of all past inputs
    </div>
  </div>
</div>

<p>This is the core intuition for the entire SSM line of work. The hidden state $h(t)$ is a running, compressed summary of the input history, where old inputs fade at a rate controlled by $a$.</p>
<h3 id="adding-an-output">Adding an Output</h3>
<p>We read out the state with simple scaling:</p>
$$y(t) = c \cdot h(t)$$<p>The output $y(t)$ is just a weighted view of the state. Together, these two equations form the complete scalar SSM:</p>
$$h'(t) = a \cdot h(t) + b \cdot x(t) \quad \text{(state equation)}$$<p>
</p>
$$y(t) = c \cdot h(t) \quad \text{(output equation)}$$<p>Three parameters. One input, one hidden state, one output. This is the entire architecture, in its simplest form.</p>
<h3 id="why-one-bucket-is-not-enough">Why One Bucket Is Not Enough</h3>
<p>A single leaky bucket has one leak rate, which means one timescale of memory. If $a = -0.5$, the state &ldquo;forgets&rdquo; with a half-life of about 1.4 time units. It cannot simultaneously maintain a short-term memory (last few tokens) and a long-term memory (paragraph-level context).</p>
<p>The fix: use $N$ buckets, each with a different leak rate.</p>


<div class="ssm-multi-timescale" id="ssm-mt-774ca06e5bc03651567b333d58d39a0f">
  <style>
    .ssm-multi-timescale {
      --mt-bg: #0d1117;
      --mt-surface: #161b22;
      --mt-border: #30363d;
      --mt-text: #e6edf3;
      --mt-text-muted: #8b949e;
      --mt-blue: #58a6ff;
      --mt-blue-light: #79c0ff;
      --mt-purple: #a371f7;
      --mt-water-top: #3b82f6;
      --mt-water-bottom: #1d4ed8;
      --mt-bucket-stroke: #8b949e;
      --mt-bucket-fill: rgba(139,148,158,0.06);
      --mt-arrow-color: #8b949e;

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
      background: var(--mt-bg);
      color: var(--mt-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

    [data-theme="light"] .ssm-multi-timescale,
    :root:not([data-theme="dark"]) .ssm-multi-timescale {
      --mt-bg: #f8fafc;
      --mt-surface: #ffffff;
      --mt-border: #e2e8f0;
      --mt-text: #1e293b;
      --mt-text-muted: #64748b;
      --mt-blue: #3b82f6;
      --mt-blue-light: #60a5fa;
      --mt-purple: #8b5cf6;
      --mt-water-top: #60a5fa;
      --mt-water-bottom: #3b82f6;
      --mt-bucket-stroke: #94a3b8;
      --mt-bucket-fill: rgba(148,163,184,0.08);
      --mt-arrow-color: #94a3b8;
    }

    .ssm-multi-timescale * { box-sizing: border-box; }

    .mt-header {
      text-align: center;
      margin-bottom: 1.25rem;
    }

    .mt-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem;
      font-weight: 600;
      color: var(--mt-purple);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .mt-header p {
      color: var(--mt-text-muted);
      font-size: 0.85rem;
      margin: 0;
    }

    .mt-card {
      background: var(--mt-surface);
      border: 1px solid var(--mt-border);
      border-radius: 10px;
      padding: 1.5rem;
    }

    .mt-layout {
      display: flex;
      align-items: center;
      justify-content: center;
      gap: 1rem;
      flex-wrap: wrap;
    }

    .mt-single {
      display: flex;
      flex-direction: column;
      align-items: center;
      flex-shrink: 0;
    }

    .mt-single-label {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.72rem;
      color: var(--mt-text-muted);
      text-align: center;
      margin-top: 0.5rem;
      max-width: 100px;
    }

    .mt-single-title {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.78rem;
      font-weight: 600;
      color: var(--mt-text);
      margin-bottom: 0.5rem;
      text-align: center;
    }

    .mt-arrow-section {
      display: flex;
      flex-direction: column;
      align-items: center;
      flex-shrink: 0;
      padding: 0 0.5rem;
    }

    .mt-arrow-big {
      font-size: 2rem;
      color: var(--mt-arrow-color);
      line-height: 1;
    }

    .mt-arrow-label {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.68rem;
      color: var(--mt-text-muted);
      text-align: center;
      max-width: 100px;
      margin-top: 0.25rem;
    }

    .mt-multi {
      display: flex;
      gap: 0.75rem;
      flex-wrap: wrap;
      justify-content: center;
    }

    .mt-bucket-item {
      display: flex;
      flex-direction: column;
      align-items: center;
    }

    .mt-bucket-eigenvalue {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.65rem;
      font-weight: 600;
      color: var(--mt-blue);
      text-align: center;
      margin-top: 0.4rem;
      white-space: nowrap;
    }

    .mt-bucket-desc {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.6rem;
      color: var(--mt-text-muted);
      text-align: center;
      margin-top: 0.15rem;
    }

    .mt-bucket-svg {
      display: block;
    }

    @media (max-width: 640px) {
      .mt-layout {
        flex-direction: column;
      }
      .mt-arrow-big {
        transform: rotate(90deg);
      }
      .mt-multi {
        gap: 0.5rem;
      }
    }
  </style>

  <div class="mt-header">
    <h3>Multiple Timescales via Multiple State Dimensions</h3>
    <p>Each dimension has its own decay rate (eigenvalue)</p>
  </div>

  <div class="mt-card">
    <div class="mt-layout">
      
      <div class="mt-single">
        <div class="mt-single-title">1 state dimension</div>
        <svg class="mt-bucket-svg" width="80" height="110" viewBox="0 0 80 110">
          <defs>
            <linearGradient id="mt-wg-s-774ca06e5bc03651567b333d58d39a0f" x1="0" y1="0" x2="0" y2="1">
              <stop offset="0%" stop-color="var(--mt-water-top)" stop-opacity="0.8"/>
              <stop offset="100%" stop-color="var(--mt-water-bottom)" stop-opacity="0.95"/>
            </linearGradient>
          </defs>
          <path d="M 15 10 L 10 95 L 70 95 L 65 10 Z"
                fill="var(--mt-bucket-fill)" stroke="var(--mt-bucket-stroke)" stroke-width="2.5" stroke-linejoin="round"/>
          <path d="M 13 50 L 11 92 L 69 92 L 67 50 Z"
                fill="url(#mt-wg-s-774ca06e5bc03651567b333d58d39a0f)" opacity="0.85"/>
          
          <rect x="34" y="93" width="12" height="5" rx="2" fill="var(--mt-bucket-stroke)"/>
          <circle cx="40" cy="105" r="3" fill="var(--mt-blue)" opacity="0.7"/>
        </svg>
        <div class="mt-single-label">One leak rate = one timescale</div>
      </div>

      
      <div class="mt-arrow-section">
        <div class="mt-arrow-big">→</div>
        <div class="mt-arrow-label">Generalize to N dimensions</div>
      </div>

      
      <div class="mt-multi">
        
        <div class="mt-bucket-item">
          <svg class="mt-bucket-svg" width="64" height="100" viewBox="0 0 64 100">
            <defs>
              <linearGradient id="mt-wg1-774ca06e5bc03651567b333d58d39a0f" x1="0" y1="0" x2="0" y2="1">
                <stop offset="0%" stop-color="var(--mt-water-top)" stop-opacity="0.8"/>
                <stop offset="100%" stop-color="var(--mt-water-bottom)" stop-opacity="0.95"/>
              </linearGradient>
            </defs>
            <path d="M 12 8 L 8 82 L 56 82 L 52 8 Z"
                  fill="var(--mt-bucket-fill)" stroke="var(--mt-bucket-stroke)" stroke-width="2" stroke-linejoin="round"/>
            
            <path d="M 10 22 L 9 79 L 55 79 L 54 22 Z"
                  fill="url(#mt-wg1-774ca06e5bc03651567b333d58d39a0f)" opacity="0.85"/>
            
            <rect x="28" y="80" width="8" height="4" rx="1.5" fill="var(--mt-bucket-stroke)"/>
            <circle cx="32" cy="90" r="2" fill="var(--mt-blue)" opacity="0.5"/>
          </svg>
          <div class="mt-bucket-eigenvalue">λ₁ = −0.01</div>
          <div class="mt-bucket-desc">(long memory)</div>
        </div>

        
        <div class="mt-bucket-item">
          <svg class="mt-bucket-svg" width="64" height="100" viewBox="0 0 64 100">
            <defs>
              <linearGradient id="mt-wg2-774ca06e5bc03651567b333d58d39a0f" x1="0" y1="0" x2="0" y2="1">
                <stop offset="0%" stop-color="var(--mt-water-top)" stop-opacity="0.8"/>
                <stop offset="100%" stop-color="var(--mt-water-bottom)" stop-opacity="0.95"/>
              </linearGradient>
            </defs>
            <path d="M 12 8 L 8 82 L 56 82 L 52 8 Z"
                  fill="var(--mt-bucket-fill)" stroke="var(--mt-bucket-stroke)" stroke-width="2" stroke-linejoin="round"/>
            <path d="M 11 35 L 9 79 L 55 79 L 53 35 Z"
                  fill="url(#mt-wg2-774ca06e5bc03651567b333d58d39a0f)" opacity="0.8"/>
            
            <rect x="26" y="80" width="12" height="4" rx="1.5" fill="var(--mt-bucket-stroke)"/>
            <circle cx="32" cy="90" r="2.5" fill="var(--mt-blue)" opacity="0.6"/>
          </svg>
          <div class="mt-bucket-eigenvalue">λ₂ = −0.1</div>
          <div class="mt-bucket-desc">&nbsp;</div>
        </div>

        
        <div class="mt-bucket-item">
          <svg class="mt-bucket-svg" width="64" height="100" viewBox="0 0 64 100">
            <defs>
              <linearGradient id="mt-wg3-774ca06e5bc03651567b333d58d39a0f" x1="0" y1="0" x2="0" y2="1">
                <stop offset="0%" stop-color="var(--mt-water-top)" stop-opacity="0.8"/>
                <stop offset="100%" stop-color="var(--mt-water-bottom)" stop-opacity="0.95"/>
              </linearGradient>
            </defs>
            <path d="M 12 8 L 8 82 L 56 82 L 52 8 Z"
                  fill="var(--mt-bucket-fill)" stroke="var(--mt-bucket-stroke)" stroke-width="2" stroke-linejoin="round"/>
            <path d="M 11 50 L 9 79 L 55 79 L 53 50 Z"
                  fill="url(#mt-wg3-774ca06e5bc03651567b333d58d39a0f)" opacity="0.75"/>
            
            <rect x="23" y="80" width="18" height="4" rx="1.5" fill="var(--mt-bucket-stroke)"/>
            <circle cx="32" cy="90" r="3" fill="var(--mt-blue)" opacity="0.65"/>
          </svg>
          <div class="mt-bucket-eigenvalue">λ₃ = −0.5</div>
          <div class="mt-bucket-desc">&nbsp;</div>
        </div>

        
        <div class="mt-bucket-item">
          <svg class="mt-bucket-svg" width="64" height="100" viewBox="0 0 64 100">
            <defs>
              <linearGradient id="mt-wg4-774ca06e5bc03651567b333d58d39a0f" x1="0" y1="0" x2="0" y2="1">
                <stop offset="0%" stop-color="var(--mt-water-top)" stop-opacity="0.8"/>
                <stop offset="100%" stop-color="var(--mt-water-bottom)" stop-opacity="0.95"/>
              </linearGradient>
            </defs>
            <path d="M 12 8 L 8 82 L 56 82 L 52 8 Z"
                  fill="var(--mt-bucket-fill)" stroke="var(--mt-bucket-stroke)" stroke-width="2" stroke-linejoin="round"/>
            <path d="M 10 65 L 9 79 L 55 79 L 54 65 Z"
                  fill="url(#mt-wg4-774ca06e5bc03651567b333d58d39a0f)" opacity="0.7"/>
            
            <rect x="19" y="80" width="26" height="5" rx="2" fill="var(--mt-bucket-stroke)"/>
            <circle cx="32" cy="92" r="3.5" fill="var(--mt-blue)" opacity="0.7"/>
          </svg>
          <div class="mt-bucket-eigenvalue">λ₄ = −2.0</div>
          <div class="mt-bucket-desc">(short memory)</div>
        </div>
      </div>
    </div>
  </div>
</div>

<p>This is where scalars become vectors. The scalar state $h(t)$ becomes an $N$-dimensional vector $\mathbf{h}(t) \in \mathbb{R}^N$. The scalar parameters become matrices:</p>
<ul>
<li>$A \in \mathbb{R}^{N \times N}$ (state to state): governs how each of the $N$ state dimensions evolves and potentially interacts with the others. It is $N \times N$ because each state dimension can influence every other state dimension.</li>
<li>$B \in \mathbb{R}^{N \times 1}$ (input to state): fans a scalar input out into $N$ state dimensions. It is $N \times 1$ because it needs to distribute one input value across $N$ state slots. Think of it as an adapter between a narrow input pipe and a wide state vector.</li>
<li>$C \in \mathbb{R}^{1 \times N}$ (state to output): narrows the wide state back down to a scalar output. It is $1 \times N$ because it takes a weighted combination of all $N$ state dimensions to produce one output value.</li>
</ul>
$$\mathbf{h}'(t) = A \cdot \mathbf{h}(t) + B \cdot x(t)$$<p>
</p>
$$y(t) = C \cdot \mathbf{h}(t)$$

<div class="ssm-matrix-dimensions" id="ssm-md-774ca06e5bc03651567b333d58d39a0f">
  <style>
    .ssm-matrix-dimensions {
      --md-bg: #0d1117;
      --md-surface: #161b22;
      --md-border: #30363d;
      --md-text: #e6edf3;
      --md-text-muted: #8b949e;
      --md-orange: #d29922;
      --md-orange-bg: rgba(210,153,34,0.1);
      --md-blue: #58a6ff;
      --md-blue-bg: rgba(88,166,255,0.1);
      --md-green: #39d353;
      --md-green-bg: rgba(57,211,83,0.1);
      --md-purple: #a371f7;
      --md-purple-bg: rgba(163,113,247,0.1);
      --md-arrow: #484f58;

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
      background: var(--md-bg);
      color: var(--md-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

    [data-theme="light"] .ssm-matrix-dimensions,
    :root:not([data-theme="dark"]) .ssm-matrix-dimensions {
      --md-bg: #f8fafc;
      --md-surface: #ffffff;
      --md-border: #e2e8f0;
      --md-text: #1e293b;
      --md-text-muted: #64748b;
      --md-orange: #ea580c;
      --md-orange-bg: rgba(234,88,12,0.08);
      --md-blue: #3b82f6;
      --md-blue-bg: rgba(59,130,246,0.08);
      --md-green: #16a34a;
      --md-green-bg: rgba(22,163,74,0.08);
      --md-purple: #8b5cf6;
      --md-purple-bg: rgba(139,92,246,0.08);
      --md-arrow: #cbd5e1;
    }

    .ssm-matrix-dimensions * { box-sizing: border-box; }

    .md-header {
      text-align: center;
      margin-bottom: 1.25rem;
    }

    .md-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem;
      font-weight: 600;
      color: var(--md-blue);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .md-header p {
      color: var(--md-text-muted);
      font-size: 0.85rem;
      margin: 0;
    }

    .md-card {
      background: var(--md-surface);
      border: 1px solid var(--md-border);
      border-radius: 10px;
      padding: 1.5rem;
      overflow-x: auto;
    }

    .md-pipeline {
      display: flex;
      align-items: center;
      justify-content: center;
      gap: 0;
      min-width: 700px;
    }

     
    .md-pipe {
      display: flex;
      flex-direction: column;
      align-items: center;
      justify-content: center;
      position: relative;
    }

    .md-pipe-bar {
      border-radius: 4px;
    }

    .md-pipe-narrow {
      width: 60px;
    }

    .md-pipe-narrow .md-pipe-bar {
      height: 12px;
      width: 100%;
    }

    .md-pipe-wide {
      width: 80px;
    }

    .md-pipe-wide .md-pipe-bar {
      height: 36px;
      width: 100%;
    }

    .md-pipe-label {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.62rem;
      color: var(--md-text-muted);
      text-align: center;
      margin-top: 6px;
      white-space: nowrap;
      line-height: 1.3;
    }

    .md-pipe-label strong {
      display: block;
      font-size: 0.7rem;
      color: var(--md-text);
      font-weight: 600;
    }

     
    .md-matrix-box {
      display: flex;
      flex-direction: column;
      align-items: center;
      justify-content: center;
      padding: 10px 14px;
      border-radius: 8px;
      border: 2px solid;
      min-width: 80px;
      text-align: center;
    }

    .md-matrix-box .md-mat-name {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.82rem;
      font-weight: 700;
    }

    .md-matrix-box .md-mat-sub {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.62rem;
      margin-top: 2px;
      opacity: 0.8;
    }

    .md-box-orange {
      border-color: var(--md-orange);
      background: var(--md-orange-bg);
    }
    .md-box-orange .md-mat-name { color: var(--md-orange); }
    .md-box-orange .md-mat-sub { color: var(--md-orange); }

    .md-box-purple {
      border-color: var(--md-purple);
      background: var(--md-purple-bg);
    }
    .md-box-purple .md-mat-name { color: var(--md-purple); }
    .md-box-purple .md-mat-sub { color: var(--md-purple); }

    .md-box-green {
      border-color: var(--md-green);
      background: var(--md-green-bg);
    }
    .md-box-green .md-mat-name { color: var(--md-green); }
    .md-box-green .md-mat-sub { color: var(--md-green); }

     
    .md-flow-arrow {
      display: flex;
      align-items: center;
      justify-content: center;
      width: 28px;
      font-size: 1.1rem;
      color: var(--md-arrow);
      flex-shrink: 0;
    }

    @media (max-width: 760px) {
      .md-pipeline {
        min-width: 0;
        flex-direction: column;
        gap: 0.25rem;
      }
      .md-pipe-narrow, .md-pipe-wide {
        width: auto;
      }
      .md-pipe-narrow .md-pipe-bar {
        height: 8px;
        width: 40px;
      }
      .md-pipe-wide .md-pipe-bar {
        height: 8px;
        width: 80px;
      }
      .md-flow-arrow {
        transform: rotate(90deg);
        width: auto;
        height: 24px;
      }
    }
  </style>

  <div class="md-header">
    <h3>SSM as a Pipeline: How Dimensions Flow</h3>
    <p>Input fans out to N-dimensional state, then narrows back to output</p>
  </div>

  <div class="md-card">
    <div class="md-pipeline">
      
      <div class="md-pipe md-pipe-narrow">
        <div class="md-pipe-bar" style="background: var(--md-orange);"></div>
        <div class="md-pipe-label">
          <strong>x(t) ∈ ℝ</strong>
          scalar input
        </div>
      </div>

      <div class="md-flow-arrow">→</div>

      
      <div class="md-matrix-box md-box-orange">
        <div class="md-mat-name">B ∈ ℝᴺˣ¹</div>
        <div class="md-mat-sub">fans out</div>
      </div>

      <div class="md-flow-arrow">→</div>

      
      <div class="md-pipe md-pipe-wide">
        <div class="md-pipe-bar" style="background: var(--md-blue);"></div>
        <div class="md-pipe-label">
          <strong>h(t) ∈ ℝᴺ</strong>
          N-dim state
        </div>
      </div>

      <div class="md-flow-arrow">→</div>

      
      <div class="md-matrix-box md-box-purple">
        <div class="md-mat-name">A ∈ ℝᴺˣᴺ</div>
        <div class="md-mat-sub">state dynamics</div>
      </div>

      <div class="md-flow-arrow">→</div>

      
      <div class="md-pipe md-pipe-wide">
        <div class="md-pipe-bar" style="background: var(--md-blue);"></div>
        <div class="md-pipe-label">
          <strong>h(t) ∈ ℝᴺ</strong>
          N-dim state
        </div>
      </div>

      <div class="md-flow-arrow">→</div>

      
      <div class="md-matrix-box md-box-green">
        <div class="md-mat-name">C ∈ ℝ¹ˣᴺ</div>
        <div class="md-mat-sub">narrows back</div>
      </div>

      <div class="md-flow-arrow">→</div>

      
      <div class="md-pipe md-pipe-narrow">
        <div class="md-pipe-bar" style="background: var(--md-green);"></div>
        <div class="md-pipe-label">
          <strong>y(t) ∈ ℝ</strong>
          scalar output
        </div>
      </div>
    </div>
  </div>
</div>

<p>In practice, $A$ is almost always <strong>diagonal</strong>. A diagonal $A$ means each state dimension evolves independently. No cross-talk between buckets. Dimension 1 decays at its own rate, dimension 2 at its own rate, and so on. This simplification works just as well empirically (the S4D paper proved this) and is much cheaper to compute.</p>
<h3 id="eigenvalues-the-retention-rates">Eigenvalues: The Retention Rates</h3>
<p>For a diagonal $A$, the diagonal entries ARE the eigenvalues. No linear algebra required to understand this. Each eigenvalue $\lambda_i$ is simply the decay rate of one state dimension. Think of them as $N$ different bank account interest rates running simultaneously:</p>
<ul>
<li>$\lambda_i = -0.01$: very slow decay. This dimension remembers inputs from thousands of timesteps ago. It is the long-term savings account.</li>
<li>$\lambda_i = -0.5$: moderate decay. This dimension tracks information over dozens of timesteps.</li>
<li>$\lambda_i = -2.0$: fast decay. This dimension mostly tracks the last few inputs. It is the checking account that turns over quickly.</li>
<li>$\lambda_i > 0$: growth. Unstable. The state explodes. We never want this.</li>
</ul>
<p>By having $N$ state dimensions with different eigenvalues, the model simultaneously maintains memory at multiple timescales. Some dimensions track recent tokens (large negative eigenvalues, fast decay), others preserve long-range context (small negative eigenvalues, slow decay).</p>
<p>This is why the initialization of $A$ matters enormously. If you set eigenvalues randomly, you get random memory timescales, and the model struggles to learn useful representations of sequences. Random eigenvalues might cluster all your memory in one timescale, leaving gaps in others. This is exactly the problem that HiPPO solved. But we need to cover discretization first.</p>
<h2 id="part-3-discretization-making-it-computable">Part 3: Discretization, Making It Computable</h2>
<h3 id="why-discretize">Why Discretize</h3>
<p>The continuous ODE $\mathbf{h}'(t) = A\mathbf{h}(t) + Bx(t)$ processes smooth, continuous signals. But an LLM does not receive continuous signals. It receives a discrete sequence of tokens, one after another. We need to convert the continuous dynamics into a step-by-step recurrence: given the previous state and the current token, compute the next state.</p>
<h3 id="the-step-size">The Step Size $\Delta$</h3>
<p>Discretization introduces a learnable parameter $\Delta$ that controls &ldquo;how much continuous time passes between tokens.&rdquo; Small $\Delta$ means the model takes fine-grained steps, preserving detailed temporal structure. Large $\Delta$ means coarse steps, compressing more time into each token. Each channel in the model can learn its own $\Delta$, so different parts of the network can operate at different temporal resolutions.</p>
<h3 id="the-euler-discretization">The Euler Discretization</h3>
<p>The simplest approach: approximate the derivative as constant over each timestep, following the <a href="https://en.wikipedia.org/wiki/Euler_method">Euler method</a> from numerical analysis. This gives us discrete parameters:</p>
$$\bar{A} = I + \Delta A$$<p>
</p>
$$\bar{B} = \Delta B$$<p>This is first-order accurate, with <a href="https://en.wikipedia.org/wiki/Truncation_error_(numerical_integration)">local truncation error</a> $O(\Delta^2)$. There are more accurate methods (<a href="https://en.wikipedia.org/wiki/Zero-order_hold">zero-order hold</a>, <a href="https://en.wikipedia.org/wiki/Bilinear_transform">bilinear transform</a>), but Euler is the one that matters for the Mamba story because Mamba-3 directly improves on it.</p>
<h3 id="the-discrete-recurrence">The Discrete Recurrence</h3>
<p>The discretized system gives us a step-by-step formula. At each timestep $k$:</p>
$$\mathbf{h}_k = \bar{A} \cdot \mathbf{h}_{k-1} + \bar{B} \cdot x_k$$<p>
</p>
$$y_k = C \cdot \mathbf{h}_k$$<p>This is now a simple step-by-step formula. Given the previous state and the current input, compute the next state and output. No calculus needed at runtime.</p>
<h3 id="numerical-walkthrough">Numerical Walkthrough</h3>
<p>Let me make this concrete. Take a scalar system with $a = -0.5$, $b = 1.0$, $c = 1.0$, and step size $\Delta = 0.1$.</p>
$$\bar{a} = 1 + (-0.5)(0.1) = 0.95 \qquad \bar{b} = 0.1$$<p>Now run 5 timesteps with the input sequence $[1, 1, 0, 0, 1]$, starting from $h_0 = 0$:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># a_bar = 0.95, b_bar = 0.1</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># k=0: input=1  h = 0.95 * 0      + 0.1 * 1 = 0.1000   y = 0.1000</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># k=1: input=1  h = 0.95 * 0.1    + 0.1 * 1 = 0.1950   y = 0.1950</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># k=2: input=0  h = 0.95 * 0.195  + 0.1 * 0 = 0.1853   y = 0.1853</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># k=3: input=0  h = 0.95 * 0.1853 + 0.1 * 0 = 0.1760   y = 0.1760</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># k=4: input=1  h = 0.95 * 0.176  + 0.1 * 1 = 0.2672   y = 0.2672</span>
</span></span></code></pre></div><p>The state accumulates when inputs arrive (steps 0-1, step 4) and decays when they stop (steps 2-3). You can verify every number with a calculator. There is nothing hidden in the SSM recurrence: it is a multiply-and-add, repeated.</p>
<h3 id="the-dual-computation-modes">The Dual Computation Modes</h3>
<p>This is the SSM&rsquo;s defining superpower. In the formulation above, $A$, $B$, and $C$ are constants that do not change over time. This property is called Linear Time-Invariance (LTI), and it unlocks something powerful. Because the recurrence $\mathbf{h}_k = \bar{A}\mathbf{h}_{k-1} + \bar{B}x_k$ is linear with fixed parameters, we can unroll it algebraically:</p>
$$\mathbf{h}_k = \bar{A}^k \bar{B} x_0 + \bar{A}^{k-1}\bar{B}x_1 + \cdots + \bar{B}x_k$$<p>The output $y_k = C\mathbf{h}_k$ is then a weighted sum of all past inputs, with weights $K = (C\bar{B},\ C\bar{A}\bar{B},\ C\bar{A}^2\bar{B},\ \ldots)$. This sequence of weights is a convolution kernel.</p>
<p>This means we can compute the output of the SSM in two completely different ways:</p>
<p><strong>Training mode (convolution):</strong> Compute the kernel $K$ once, then convolve it with the entire input sequence via FFT in $O(L \log L)$. Fully parallel, like a CNN. The GPU processes all $L$ tokens simultaneously.</p>
<p><strong>Inference mode (recurrence):</strong> Step through $\mathbf{h}_k = \bar{A}\mathbf{h}_{k-1} + \bar{B}x_k$ one token at a time in $O(1)$ per step. The state $\mathbf{h}$ is a fixed-size vector regardless of how many tokens have been processed. No KV cache.</p>


<div class="ssm-dual-modes" id="ssm-dm-774ca06e5bc03651567b333d58d39a0f">
  <style>
    .ssm-dual-modes {
      --dm-bg: #0d1117;
      --dm-surface: #161b22;
      --dm-border: #30363d;
      --dm-text: #e6edf3;
      --dm-text-muted: #8b949e;
      --dm-blue: #58a6ff;
      --dm-blue-bg: rgba(88,166,255,0.08);
      --dm-blue-border: rgba(88,166,255,0.25);
      --dm-green: #39d353;
      --dm-green-bg: rgba(57,211,83,0.08);
      --dm-green-border: rgba(57,211,83,0.25);
      --dm-purple: #a371f7;
      --dm-gold: #e3b341;
      --dm-gold-bg: rgba(227,179,65,0.12);
      --dm-badge-text: #0d1117;

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
      background: var(--dm-bg);
      color: var(--dm-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

    [data-theme="light"] .ssm-dual-modes,
    :root:not([data-theme="dark"]) .ssm-dual-modes {
      --dm-bg: #f8fafc;
      --dm-surface: #ffffff;
      --dm-border: #e2e8f0;
      --dm-text: #1e293b;
      --dm-text-muted: #64748b;
      --dm-blue: #3b82f6;
      --dm-blue-bg: rgba(59,130,246,0.05);
      --dm-blue-border: rgba(59,130,246,0.2);
      --dm-green: #16a34a;
      --dm-green-bg: rgba(22,163,74,0.05);
      --dm-green-border: rgba(22,163,74,0.2);
      --dm-purple: #8b5cf6;
      --dm-gold: #d97706;
      --dm-gold-bg: rgba(217,119,6,0.1);
      --dm-badge-text: #ffffff;
    }

    .ssm-dual-modes * { box-sizing: border-box; }

    .dm-header {
      text-align: center;
      margin-bottom: 1.25rem;
    }

    .dm-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem;
      font-weight: 600;
      color: var(--dm-purple);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .dm-header p {
      color: var(--dm-text-muted);
      font-size: 0.85rem;
      margin: 0;
    }

    .dm-card {
      background: var(--dm-surface);
      border: 1px solid var(--dm-border);
      border-radius: 10px;
      overflow: hidden;
    }

    .dm-panels {
      display: grid;
      grid-template-columns: 1fr 1fr;
    }

    @media (max-width: 640px) {
      .dm-panels {
        grid-template-columns: 1fr;
      }
    }

    .dm-panel {
      padding: 1.5rem;
      position: relative;
    }

    .dm-panel-left {
      background: var(--dm-blue-bg);
      border-right: 1px solid var(--dm-border);
    }

    .dm-panel-right {
      background: var(--dm-green-bg);
    }

    @media (max-width: 640px) {
      .dm-panel-left {
        border-right: none;
        border-bottom: 1px solid var(--dm-border);
      }
    }

    .dm-mode-badge {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.6rem;
      font-weight: 700;
      letter-spacing: 0.1em;
      text-transform: uppercase;
      padding: 3px 10px;
      border-radius: 20px;
      display: inline-block;
      margin-bottom: 0.75rem;
    }

    .dm-mode-badge-blue {
      background: var(--dm-blue);
      color: var(--dm-badge-text);
    }

    .dm-mode-badge-green {
      background: var(--dm-green);
      color: var(--dm-badge-text);
    }

    .dm-panel-title {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.95rem;
      font-weight: 700;
      margin: 0 0 1rem 0;
      color: var(--dm-text);
    }

     
    .dm-tokens {
      display: flex;
      gap: 4px;
      margin-bottom: 0.75rem;
      flex-wrap: wrap;
    }

    .dm-token {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.68rem;
      font-weight: 600;
      padding: 4px 7px;
      border-radius: 5px;
      border: 1.5px solid;
      white-space: nowrap;
    }

    .dm-token-blue {
      border-color: var(--dm-blue-border);
      color: var(--dm-blue);
      background: var(--dm-blue-bg);
    }

    .dm-token-green {
      border-color: var(--dm-green-border);
      color: var(--dm-green);
      background: var(--dm-green-bg);
    }

    .dm-token-muted {
      border-color: var(--dm-border);
      color: var(--dm-text-muted);
    }

     
    .dm-diagram {
      background: var(--dm-surface);
      border: 1px solid var(--dm-border);
      border-radius: 8px;
      padding: 0.75rem;
      margin-bottom: 0.75rem;
      min-height: 80px;
      display: flex;
      flex-direction: column;
      align-items: center;
      justify-content: center;
      gap: 0.4rem;
    }

    .dm-diagram-row {
      display: flex;
      align-items: center;
      gap: 6px;
      flex-wrap: wrap;
      justify-content: center;
    }

    .dm-dia-text {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.72rem;
      color: var(--dm-text-muted);
    }

    .dm-dia-text strong {
      color: var(--dm-text);
    }

    .dm-dia-arrow {
      font-size: 0.9rem;
      color: var(--dm-text-muted);
    }

    .dm-kernel-bar {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.65rem;
      font-weight: 600;
      padding: 3px 10px;
      border-radius: 4px;
      display: inline-block;
    }

    .dm-kernel-blue {
      background: var(--dm-blue-bg);
      color: var(--dm-blue);
      border: 1px solid var(--dm-blue-border);
    }

    .dm-state-box {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.68rem;
      font-weight: 600;
      padding: 6px 12px;
      border-radius: 6px;
      display: inline-block;
    }

    .dm-state-green {
      background: var(--dm-green-bg);
      color: var(--dm-green);
      border: 1.5px solid var(--dm-green-border);
    }

     
    .dm-parallel-arrows {
      display: flex;
      gap: 3px;
      margin: 0.25rem 0;
    }

    .dm-parallel-arrows span {
      color: var(--dm-blue);
      font-size: 0.8rem;
    }

     
    .dm-seq-flow {
      display: flex;
      align-items: center;
      gap: 4px;
      flex-wrap: wrap;
      justify-content: center;
    }

    .dm-seq-flow .dm-seq-step {
      display: flex;
      align-items: center;
      gap: 3px;
    }

     
    .dm-complexity {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.75rem;
      font-weight: 700;
      padding: 4px 12px;
      border-radius: 20px;
      display: inline-block;
      margin-bottom: 0.5rem;
    }

    .dm-complexity-blue {
      background: var(--dm-blue);
      color: var(--dm-badge-text);
    }

    .dm-complexity-green {
      background: var(--dm-green);
      color: var(--dm-badge-text);
    }

    .dm-keypoint {
      font-size: 0.78rem;
      color: var(--dm-text-muted);
      margin: 0;
      font-style: italic;
    }

     
    .dm-banner {
      text-align: center;
      padding: 0.75rem 1rem;
      background: var(--dm-gold-bg);
      border-top: 1px solid var(--dm-border);
    }

    .dm-banner p {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.82rem;
      font-weight: 700;
      color: var(--dm-gold);
      margin: 0;
      letter-spacing: 0.02em;
    }
  </style>

  <div class="dm-header">
    <h3>The Dual Computation Modes</h3>
    <p>Same model, two ways to compute — optimized for each phase</p>
  </div>

  <div class="dm-card">
    <div class="dm-panels">
      
      <div class="dm-panel dm-panel-left">
        <span class="dm-mode-badge dm-mode-badge-blue">Training Mode</span>
        <div class="dm-panel-title">Convolution (Parallel)</div>

        <div class="dm-tokens">
          <span class="dm-token dm-token-blue">x₁</span>
          <span class="dm-token dm-token-blue">x₂</span>
          <span class="dm-token dm-token-blue">x₃</span>
          <span class="dm-token dm-token-muted">...</span>
          <span class="dm-token dm-token-blue">x_L</span>
        </div>

        <div class="dm-diagram">
          <div class="dm-parallel-arrows">
            <span>↓</span><span>↓</span><span>↓</span><span>↓</span><span>↓</span>
          </div>
          <div class="dm-diagram-row">
            <span class="dm-kernel-bar dm-kernel-blue">K̄ (conv kernel)</span>
          </div>
          <div class="dm-diagram-row">
            <span class="dm-dia-text"><strong>y</strong> = x * K̄</span>
            <span class="dm-dia-text">(via FFT)</span>
          </div>
          <div class="dm-parallel-arrows">
            <span>↓</span><span>↓</span><span>↓</span><span>↓</span><span>↓</span>
          </div>
        </div>

        <div style="text-align:center; margin-bottom: 0.5rem;">
          <span class="dm-complexity dm-complexity-blue">O(L log L)</span>
        </div>

        <p class="dm-keypoint">All tokens processed simultaneously</p>
      </div>

      
      <div class="dm-panel dm-panel-right">
        <span class="dm-mode-badge dm-mode-badge-green">Inference Mode</span>
        <div class="dm-panel-title">Recurrence (Sequential)</div>

        <div class="dm-tokens">
          <span class="dm-token dm-token-green">x₁</span>
          <span class="dm-dia-arrow">→</span>
          <span class="dm-token dm-token-green">x₂</span>
          <span class="dm-dia-arrow">→</span>
          <span class="dm-token dm-token-green">x₃</span>
          <span class="dm-dia-arrow">→</span>
          <span class="dm-token dm-token-muted">...</span>
        </div>

        <div class="dm-diagram">
          <div class="dm-seq-flow">
            <div class="dm-seq-step">
              <span class="dm-token dm-token-green" style="font-size:0.62rem; padding: 3px 5px;">x_t</span>
              <span class="dm-dia-arrow">→</span>
            </div>
            <span class="dm-state-box dm-state-green">h(t)</span>
            <div class="dm-seq-step">
              <span class="dm-dia-arrow">→</span>
              <span class="dm-token dm-token-green" style="font-size:0.62rem; padding: 3px 5px;">y_t</span>
            </div>
          </div>
          <div class="dm-diagram-row" style="margin-top:0.3rem;">
            <span class="dm-dia-text">h(t) = Āh(t−1) + B̄x(t)</span>
          </div>
          <div class="dm-diagram-row">
            <span class="dm-dia-text" style="font-size:0.65rem;">state stays fixed-size N</span>
          </div>
        </div>

        <div style="text-align:center; margin-bottom: 0.5rem;">
          <span class="dm-complexity dm-complexity-green">O(1) per token</span>
        </div>

        <p class="dm-keypoint">Fixed-size state, no KV cache needed</p>
      </div>
    </div>

    <div class="dm-banner">
      <p>Train like a CNN, infer like an RNN</p>
    </div>
  </div>
</div>

<p>&ldquo;Train like a CNN, infer like an RNN.&rdquo; This is the fundamental efficiency proposition. During training, you get the parallelism of convolutions. During inference, you get the constant-time, constant-memory decoding of RNNs, without the KV cache that makes Transformer inference expensive.</p>
<p>This duality is only possible because the system is LTI: the parameters $A$, $B$, $C$ are fixed, so the same convolution kernel $K$ applies to every input. When parameters become input-dependent (which is what Mamba does), there is no single kernel for the whole sequence. The duality breaks, and new algorithms are needed.</p>
<h2 id="part-4-hippo-the-initialization-that-made-ssms-work">Part 4: HiPPO, The Initialization That Made SSMs Work</h2>
<p>Before HiPPO, SSMs initialized the state matrix $A$ randomly. Random eigenvalues produce random memory timescales. On Sequential MNIST (classifying a handwritten digit fed one pixel at a time, 784 steps), random initialization achieved about 60% accuracy. Barely above chance for some digit classes.</p>
<p>Albert Gu&rsquo;s HiPPO framework (2020) solved this by deriving $A$ matrices from a mathematical objective: at every timestep, the state should store the <strong>best polynomial approximation of the entire input history</strong>. Each state dimension corresponds to one polynomial coefficient, with low-order coefficients capturing broad trends (long-range memory) and high-order coefficients capturing fine details (short-range memory). The resulting $A$ matrix has eigenvalues arranged to cover multiple timescales without redundancy.</p>
<p>The concrete impact: switching from random $A$ to HiPPO improved Sequential MNIST from 60% to 98%. Same architecture, same training. Only the initialization of $A$ changed.</p>
<h2 id="part-5-s4-and-s4d-making-ssms-practical">Part 5: S4 and S4D, Making SSMs Practical</h2>
<p>S4 (Gu, Goel, and Re, 2022) was the first architecture to make deep SSMs work at scale by finding an efficient algorithm to compute the convolution kernel from a HiPPO-initialized $A$ matrix. It was the first model to solve long-range tasks at sequence lengths of 16,000+, a result no Transformer or RNN had achieved. S4 also fully exploited the recurrent-convolutional duality: convolution mode for training, recurrence mode for inference.</p>
<p>A key simplification followed quickly. S4D (2022) showed that restricting $A$ to a <strong>fully diagonal</strong> matrix matched S4&rsquo;s performance while dramatically simplifying the implementation. Independent state dimensions with well-chosen eigenvalues were sufficient. This diagonal restriction became the standard for all subsequent work, including Mamba.</p>
<h2 id="part-6-mamba-1-selectivity-changes-everything">Part 6: Mamba-1, Selectivity Changes Everything</h2>
<h3 id="the-lti-problem">The LTI Problem</h3>
<p>S4 and its variants excelled on continuous signals and synthetic long-range benchmarks. On language modeling, they consistently lagged behind Transformers of the same size.</p>
<p>The reason is exactly the LTI limitation I described earlier. In an LTI system, the matrices $A$, $B$, $C$ are fixed constants. Every token receives identical treatment. The Mamba paper demonstrated this failure precisely with two diagnostic tasks:</p>
<p><strong>Selective Copying</strong>: Given &ldquo;A B _ _ _ C _ _ A _ _&rdquo;, copy only A, B, C while ignoring the padding underscores. An LTI system cannot distinguish content tokens from padding because it applies the same transformation to everything.</p>
<p><strong>Induction Heads</strong>: Given &ldquo;A B &hellip; A ?&rdquo;, recall that B followed A earlier and predict B. This requires content-based lookup: comparing the current token (A) against stored tokens to find what came after it. An LTI system has no mechanism for content comparison.</p>
<p>Language is full of these patterns. The word &ldquo;not&rdquo; should be remembered differently from the word &ldquo;the.&rdquo; A name mentioned once in a document needs to be retrievable later. All of this requires the model to make content-dependent decisions about what to store and what to forget.</p>
<h3 id="the-fix-input-dependent-parameters">The Fix: Input-Dependent Parameters</h3>
<p>The December 2023 paper &ldquo;Mamba: Linear-Time Sequence Modeling with Selective State Spaces&rdquo; by Albert Gu and Tri Dao introduced a single, elegant idea: make $B$, $C$, and $\Delta$ functions of the input token.</p>
$$B_t = \text{Linear}(x_t) \in \mathbb{R}^N$$<p>
</p>
$$C_t = \text{Linear}(x_t) \in \mathbb{R}^N$$<p>
</p>
$$\Delta_t = \text{softplus}(\text{Linear}(x_t)) \in \mathbb{R}^+$$<p>Now the model dynamically modulates its behavior on a per-token basis. When the network encounters an important token, it can predict a large $\Delta_t$ to reset the state and absorb the new information. When it encounters filler, it can predict a tiny $\Delta_t$ to preserve existing memory and let the filler leak away.</p>
<p>The roles of each parameter are clear. $\Delta$ controls the gate: large $\Delta$ resets the state and focuses on the current input; small $\Delta$ persists the state and ignores the current input. $B$ controls what enters the state (content-based filtering of what to remember). $C$ controls what exits (content-based modulation of what to read out).</p>
<p>Note that the state matrix $A$ itself remains fixed. This is intentional. $A$ affects the discrete recurrence only through its interaction with $\Delta$ via $\bar{A} = \exp(\Delta A)$, so making $\Delta$ input-dependent is sufficient to make the entire system input-dependent.</p>
<h3 id="the-cost-convolution-mode-breaks">The Cost: Convolution Mode Breaks</h3>
<p>Input-dependent parameters mean the system is no longer LTI. $B_t$ and $C_t$ change at every timestep, so there is no single convolution kernel $K$ that describes the entire sequence. The FFT-based parallel training mode is gone.</p>
<p>Naively, this forces sequential computation: process token 1 to get $h_1$, then token 2 to get $h_2$, and so on. This would be catastrophically slow on GPUs, which need parallel workloads to achieve decent utilization.</p>
<p>Mamba-1 solved this with a hardware-aware selective scan algorithm, directly inspired by FlashAttention&rsquo;s approach to the memory hierarchy. The key idea: fuse all SSM operations (discretization, recurrence, output) into a single GPU kernel that runs entirely in SRAM, avoiding expensive HBM round-trips. The recurrence is parallelized using a parallel scan that exploits the associativity of the multiply-add operation, and intermediate states are recomputed in the backward pass rather than stored. The result: 40x faster than a naive implementation, with the same memory footprint as an optimized Transformer with FlashAttention.</p>
<h3 id="the-mamba-block">The Mamba Block</h3>
<p>A common misconception is that Mamba replaces only the attention layer. It replaces both attention AND the MLP. A standard Transformer decoder block has two sub-layers: multi-head self-attention (which mixes information across sequence positions) and a feed-forward network (which mixes information across feature channels). The Mamba block handles both in a single, unified structure.</p>

<div class="ssm-transformer-vs-mamba" id="tvm-774ca06e5bc03651567b333d58d39a0f">
  <style>
    .ssm-transformer-vs-mamba {
      --tvm-bg: #ffffff;
      --tvm-card-bg: #ffffff;
      --tvm-border: #e2e8f0;
      --tvm-text: #1a202c;
      --tvm-text-muted: #718096;
      --tvm-shadow: 0 12px 30px rgba(0,0,0,0.06);
      --tvm-purple: #7c3aed;
      --tvm-purple-light: #ede9fe;
      --tvm-purple-border: #c4b5fd;
      --tvm-purple-bg: #f5f3ff;
      --tvm-teal: #0d9488;
      --tvm-teal-light: #ccfbf1;
      --tvm-teal-border: #99f6e4;
      --tvm-teal-bg: #f0fdfa;
      --tvm-arrow: #94a3b8;
      --tvm-residual: #cbd5e1;
      --tvm-annotation-bg: #fffbeb;
      --tvm-annotation-border: #fde68a;
      --tvm-annotation-text: #92400e;
      --tvm-mono: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      --tvm-sans: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;

      font-family: var(--tvm-sans);
      background: var(--tvm-card-bg);
      color: var(--tvm-text);
      border-radius: 16px;
      box-shadow: var(--tvm-shadow);
      padding: 24px;
      margin: 32px auto;
      max-width: 1000px;
      line-height: 1.6;
    }

    [data-theme="dark"] .ssm-transformer-vs-mamba {
      --tvm-bg: #1a1b2e;
      --tvm-card-bg: #1e1f33;
      --tvm-border: #2d2f45;
      --tvm-text: #e2e8f0;
      --tvm-text-muted: #94a3b8;
      --tvm-shadow: 0 12px 30px rgba(0,0,0,0.3);
      --tvm-purple: #a78bfa;
      --tvm-purple-light: rgba(167,139,250,0.12);
      --tvm-purple-border: rgba(167,139,250,0.3);
      --tvm-purple-bg: rgba(167,139,250,0.06);
      --tvm-teal: #5eead4;
      --tvm-teal-light: rgba(94,234,212,0.12);
      --tvm-teal-border: rgba(94,234,212,0.3);
      --tvm-teal-bg: rgba(94,234,212,0.06);
      --tvm-arrow: #475569;
      --tvm-residual: #475569;
      --tvm-annotation-bg: rgba(251,191,36,0.08);
      --tvm-annotation-border: rgba(251,191,36,0.3);
      --tvm-annotation-text: #fbbf24;
    }

    .ssm-transformer-vs-mamba * { box-sizing: border-box; }

    .tvm-header {
      text-align: center;
      margin-bottom: 28px;
    }

    .tvm-header h3 {
      font-size: 24px;
      font-weight: 700;
      color: var(--tvm-text);
      margin: 0 0 6px 0;
    }

    .tvm-header p {
      font-size: 14px;
      color: var(--tvm-text-muted);
      margin: 0;
    }

    .tvm-columns {
      display: grid;
      grid-template-columns: 1fr 1fr;
      gap: 24px;
    }

    @media (max-width: 700px) {
      .tvm-columns { grid-template-columns: 1fr; }
    }

    .tvm-col {
      border-radius: 12px;
      padding: 20px;
      position: relative;
    }

    .tvm-col--transformer {
      background: var(--tvm-purple-bg);
      border: 1px solid var(--tvm-purple-border);
    }

    .tvm-col--mamba {
      background: var(--tvm-teal-bg);
      border: 1px solid var(--tvm-teal-border);
    }

    .tvm-col-title {
      font-family: var(--tvm-mono);
      font-size: 13px;
      font-weight: 600;
      letter-spacing: 0.06em;
      text-transform: uppercase;
      text-align: center;
      margin: 0 0 18px 0;
    }

    .tvm-col--transformer .tvm-col-title { color: var(--tvm-purple); }
    .tvm-col--mamba .tvm-col-title { color: var(--tvm-teal); }

     
    .tvm-flow {
      display: flex;
      flex-direction: column;
      align-items: center;
      gap: 0;
      position: relative;
    }

    .tvm-block {
      width: 100%;
      max-width: 260px;
      text-align: center;
      padding: 10px 12px;
      border-radius: 8px;
      font-size: 13px;
      font-weight: 600;
      position: relative;
      z-index: 2;
    }

    .tvm-block--io {
      background: var(--tvm-border);
      color: var(--tvm-text);
      border: 1px solid var(--tvm-border);
      font-family: var(--tvm-mono);
      font-size: 12px;
      padding: 7px 12px;
    }

    .tvm-block--purple {
      background: var(--tvm-purple-light);
      color: var(--tvm-purple);
      border: 1.5px solid var(--tvm-purple-border);
    }

    .tvm-block--teal {
      background: var(--tvm-teal-light);
      color: var(--tvm-teal);
      border: 1.5px solid var(--tvm-teal-border);
    }

    .tvm-block--accent {
      border-width: 2px;
      box-shadow: 0 2px 8px rgba(0,0,0,0.06);
    }

    .tvm-block-label {
      font-family: var(--tvm-mono);
      font-size: 10px;
      font-weight: 500;
      opacity: 0.7;
      margin-top: 2px;
      letter-spacing: 0.03em;
    }

     
    .tvm-arrow {
      width: 2px;
      height: 18px;
      background: var(--tvm-arrow);
      position: relative;
      z-index: 1;
    }

    .tvm-arrow::after {
      content: '';
      position: absolute;
      bottom: -4px;
      left: 50%;
      transform: translateX(-50%);
      width: 0;
      height: 0;
      border-left: 4px solid transparent;
      border-right: 4px solid transparent;
      border-top: 5px solid var(--tvm-arrow);
    }

     
    .tvm-fork-label {
      font-family: var(--tvm-mono);
      font-size: 11px;
      font-weight: 500;
      color: var(--tvm-text-muted);
      text-align: center;
      margin: 2px 0;
    }

    .tvm-branches {
      display: grid;
      grid-template-columns: 1fr 1fr;
      gap: 10px;
      width: 100%;
      max-width: 280px;
    }

    .tvm-branch {
      display: flex;
      flex-direction: column;
      align-items: center;
      gap: 0;
    }

    .tvm-branch .tvm-block {
      max-width: 130px;
      font-size: 11px;
      padding: 7px 6px;
    }

    .tvm-branch .tvm-arrow {
      height: 12px;
    }

    .tvm-merge-symbol {
      width: 36px;
      height: 36px;
      border-radius: 50%;
      display: flex;
      align-items: center;
      justify-content: center;
      font-size: 16px;
      font-weight: 700;
      z-index: 2;
    }

    .tvm-merge-symbol--teal {
      background: var(--tvm-teal-light);
      color: var(--tvm-teal);
      border: 1.5px solid var(--tvm-teal-border);
    }

     
    .tvm-residual-group {
      position: relative;
      width: 100%;
      display: flex;
      flex-direction: column;
      align-items: center;
    }

    .tvm-residual-line {
      position: absolute;
      right: 4px;
      top: 0;
      bottom: 0;
      width: 2px;
      background: var(--tvm-residual);
      z-index: 0;
    }

    .tvm-residual-line::before,
    .tvm-residual-line::after {
      content: '';
      position: absolute;
      right: 0;
      width: 14px;
      height: 2px;
      background: var(--tvm-residual);
    }

    .tvm-residual-line::before { top: 0; }
    .tvm-residual-line::after { bottom: 0; }

    .tvm-add-circle {
      width: 28px;
      height: 28px;
      border-radius: 50%;
      display: flex;
      align-items: center;
      justify-content: center;
      font-size: 16px;
      font-weight: 700;
      z-index: 2;
    }

    .tvm-add-circle--purple {
      background: var(--tvm-purple-light);
      color: var(--tvm-purple);
      border: 1.5px solid var(--tvm-purple-border);
    }

     
    .tvm-annotation {
      margin-top: 20px;
      padding: 12px 16px;
      border-radius: 8px;
      background: var(--tvm-annotation-bg);
      border: 1px solid var(--tvm-annotation-border);
      text-align: center;
      font-size: 13px;
      font-weight: 600;
      color: var(--tvm-annotation-text);
    }
  </style>

  <div class="tvm-header">
    <h3>Transformer vs Mamba: Block Architecture</h3>
    <p>Same job — sequence mixing + channel mixing — different structure</p>
  </div>

  <div class="tvm-columns">
    
    <div class="tvm-col tvm-col--transformer">
      <h4 class="tvm-col-title">Transformer Decoder Block</h4>
      <div class="tvm-flow">
        <div class="tvm-block tvm-block--io">Input</div>
        <div class="tvm-arrow"></div>

        
        <div class="tvm-residual-group">
          <div class="tvm-residual-line"></div>
          <div class="tvm-block tvm-block--purple" style="margin-bottom:0">LayerNorm</div>
          <div class="tvm-arrow"></div>
          <div class="tvm-block tvm-block--purple tvm-block--accent">
            Multi-Head Self-Attention
            <div class="tvm-block-label">Sequence Mixing</div>
          </div>
          <div class="tvm-arrow"></div>
          <div class="tvm-add-circle tvm-add-circle--purple">+</div>
        </div>

        <div class="tvm-arrow"></div>

        
        <div class="tvm-residual-group">
          <div class="tvm-residual-line"></div>
          <div class="tvm-block tvm-block--purple" style="margin-bottom:0">LayerNorm</div>
          <div class="tvm-arrow"></div>
          <div class="tvm-block tvm-block--purple tvm-block--accent">
            Feed-Forward Network / MLP
            <div class="tvm-block-label">Channel Mixing</div>
          </div>
          <div class="tvm-arrow"></div>
          <div class="tvm-add-circle tvm-add-circle--purple">+</div>
        </div>

        <div class="tvm-arrow"></div>
        <div class="tvm-block tvm-block--io">Output</div>
      </div>
    </div>

    
    <div class="tvm-col tvm-col--mamba">
      <h4 class="tvm-col-title">Mamba Block</h4>
      <div class="tvm-flow">
        <div class="tvm-block tvm-block--io">Input</div>
        <div class="tvm-arrow"></div>

        <div class="tvm-residual-group">
          <div class="tvm-residual-line" style="background:var(--tvm-teal-border)"></div>
          <div class="tvm-residual-line" style="background:var(--tvm-teal-border)"></div>

          <div class="tvm-block tvm-block--teal">LayerNorm</div>
          <div class="tvm-arrow"></div>
          <div class="tvm-block tvm-block--teal">Linear Projection (expand)</div>
          <div class="tvm-arrow"></div>

          <div class="tvm-fork-label">split into two branches</div>

          <div class="tvm-branches">
            <div class="tvm-branch">
              <div class="tvm-block tvm-block--teal">Conv1d</div>
              <div class="tvm-arrow"></div>
              <div class="tvm-block tvm-block--teal">SiLU</div>
              <div class="tvm-arrow"></div>
              <div class="tvm-block tvm-block--teal tvm-block--accent">
                Selective SSM
                <div class="tvm-block-label">Seq. Mixing</div>
              </div>
            </div>
            <div class="tvm-branch">
              <div class="tvm-block tvm-block--teal tvm-block--accent" style="margin-top:auto">
                SiLU
                <div class="tvm-block-label">Gate / Ch. Mixing</div>
              </div>
            </div>
          </div>

          <div class="tvm-arrow" style="height:10px"></div>
          <div class="tvm-merge-symbol tvm-merge-symbol--teal">&otimes;</div>
          <div class="tvm-arrow"></div>
          <div class="tvm-block tvm-block--teal">Linear Projection</div>
          <div class="tvm-arrow"></div>
          <div class="tvm-add-circle tvm-add-circle--purple" style="background:var(--tvm-teal-light);color:var(--tvm-teal);border-color:var(--tvm-teal-border)">+</div>
        </div>

        <div class="tvm-arrow"></div>
        <div class="tvm-block tvm-block--io">Output</div>
      </div>
    </div>
  </div>

  <div class="tvm-annotation">
    Mamba replaces <em>both</em> attention and MLP in a single block
  </div>
</div>

<p>Here is how the Mamba block works. The input ($B \times L \times D$) passes through a LayerNorm and is linearly projected to expand the feature dimension by a factor of $E = 2$. This expanded representation is then split into two parallel branches:</p>
<p><strong>The SSM branch</strong> (left): Processes through a short 1D causal convolution (width 4) to capture immediate local patterns between neighboring tokens. Then a SiLU activation. Then three parallel linear projections produce the token-specific $\Delta_t$, $B_t$, and $C_t$. The selective SSM recurrence runs using these dynamic parameters. This branch handles sequence mixing: how information flows across token positions.</p>
<p><strong>The gate branch</strong> (right): Takes the other half of the expanded input and passes it through a SiLU activation. This branch serves as a dynamic gate that controls which channels of the SSM output are passed through and which are suppressed.</p>
<p>The two branches merge via element-wise multiplication. If elements in the gating vector are near zero, the corresponding SSM information is suppressed. The result passes through a linear projection back to dimension $D$ and is added to the input via a residual connection.</p>
<p>The entire block is one homogeneous module. No separate attention layer. No separate MLP.</p>
<h3 id="inference-the-fundamental-trade-off">Inference: The Fundamental Trade-off</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left"></th>
          <th style="text-align: left">Transformer</th>
          <th style="text-align: left">Mamba-1</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">State per sequence</td>
          <td style="text-align: left">KV cache: grows with each token</td>
          <td style="text-align: left">Hidden state: fixed-size vector</td>
      </tr>
      <tr>
          <td style="text-align: left">Memory complexity</td>
          <td style="text-align: left">$O(L)$ per sequence</td>
          <td style="text-align: left">$O(1)$ per sequence</td>
      </tr>
      <tr>
          <td style="text-align: left">Compute per new token</td>
          <td style="text-align: left">$O(L)$: attend to all previous tokens</td>
          <td style="text-align: left">$O(1)$: one state update</td>
      </tr>
      <tr>
          <td style="text-align: left">At 128K context</td>
          <td style="text-align: left">~200x larger than Mamba state</td>
          <td style="text-align: left">~2.6 MiB per sequence</td>
      </tr>
      <tr>
          <td style="text-align: left">Memory type</td>
          <td style="text-align: left">Lossless: any past token retrievable</td>
          <td style="text-align: left">Lossy: compressed summary</td>
      </tr>
  </tbody>
</table>

<div class="ssm-kv-cache-vs-state" id="kvc-774ca06e5bc03651567b333d58d39a0f">
  <style>
    .ssm-kv-cache-vs-state {
      --kvc-bg: #ffffff;
      --kvc-card-bg: #ffffff;
      --kvc-border: #e2e8f0;
      --kvc-text: #1a202c;
      --kvc-text-muted: #718096;
      --kvc-shadow: 0 12px 30px rgba(0,0,0,0.06);
      --kvc-red: #ef4444;
      --kvc-red-light: #fef2f2;
      --kvc-red-border: #fecaca;
      --kvc-red-bar: #f87171;
      --kvc-orange: #f59e0b;
      --kvc-green: #10b981;
      --kvc-green-light: #ecfdf5;
      --kvc-green-border: #a7f3d0;
      --kvc-green-bar: #34d399;
      --kvc-green-glow: rgba(16,185,129,0.3);
      --kvc-meter-bg: #f1f5f9;
      --kvc-mono: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      --kvc-sans: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;

      font-family: var(--kvc-sans);
      background: var(--kvc-card-bg);
      color: var(--kvc-text);
      border-radius: 16px;
      box-shadow: var(--kvc-shadow);
      padding: 24px;
      margin: 32px auto;
      max-width: 1000px;
      line-height: 1.6;
    }

    [data-theme="dark"] .ssm-kv-cache-vs-state {
      --kvc-bg: #1a1b2e;
      --kvc-card-bg: #1e1f33;
      --kvc-border: #2d2f45;
      --kvc-text: #e2e8f0;
      --kvc-text-muted: #94a3b8;
      --kvc-shadow: 0 12px 30px rgba(0,0,0,0.3);
      --kvc-red: #f87171;
      --kvc-red-light: rgba(248,113,113,0.08);
      --kvc-red-border: rgba(248,113,113,0.25);
      --kvc-red-bar: #f87171;
      --kvc-orange: #fbbf24;
      --kvc-green: #34d399;
      --kvc-green-light: rgba(52,211,153,0.08);
      --kvc-green-border: rgba(52,211,153,0.25);
      --kvc-green-bar: #34d399;
      --kvc-green-glow: rgba(52,211,153,0.35);
      --kvc-meter-bg: #2d2f45;
    }

    .ssm-kv-cache-vs-state * { box-sizing: border-box; }

    .kvc-header {
      text-align: center;
      margin-bottom: 24px;
    }

    .kvc-header h3 {
      font-size: 24px;
      font-weight: 700;
      color: var(--kvc-text);
      margin: 0 0 6px 0;
    }

    .kvc-header p {
      font-size: 14px;
      color: var(--kvc-text-muted);
      margin: 0;
    }

    .kvc-columns {
      display: grid;
      grid-template-columns: 1fr 1fr;
      gap: 24px;
      margin-bottom: 16px;
    }

    @media (max-width: 700px) {
      .kvc-columns { grid-template-columns: 1fr; }
    }

    .kvc-panel {
      border-radius: 12px;
      padding: 20px;
    }

    .kvc-panel--transformer {
      background: var(--kvc-red-light);
      border: 1px solid var(--kvc-red-border);
    }

    .kvc-panel--mamba {
      background: var(--kvc-green-light);
      border: 1px solid var(--kvc-green-border);
    }

    .kvc-panel-title {
      font-family: var(--kvc-mono);
      font-size: 13px;
      font-weight: 600;
      letter-spacing: 0.06em;
      text-transform: uppercase;
      text-align: center;
      margin: 0 0 16px 0;
    }

    .kvc-panel--transformer .kvc-panel-title { color: var(--kvc-red); }
    .kvc-panel--mamba .kvc-panel-title { color: var(--kvc-green); }

     
    .kvc-stack-area {
      height: 200px;
      display: flex;
      flex-direction: column-reverse;
      align-items: center;
      justify-content: flex-start;
      gap: 3px;
      margin-bottom: 12px;
      position: relative;
    }

    .kvc-bar {
      width: 70%;
      height: 18px;
      border-radius: 4px;
      background: var(--kvc-red-bar);
      opacity: 0;
      transform: scaleY(0);
      transform-origin: bottom;
      transition: opacity 0.4s ease, transform 0.4s ease;
      display: flex;
      align-items: center;
      justify-content: center;
      font-family: var(--kvc-mono);
      font-size: 10px;
      font-weight: 600;
      color: #fff;
    }

    .kvc-bar.kvc-bar--visible {
      opacity: 1;
      transform: scaleY(1);
    }

     
    .kvc-fixed-area {
      height: 200px;
      display: flex;
      align-items: center;
      justify-content: center;
      margin-bottom: 12px;
    }

    .kvc-state-box {
      width: 70%;
      height: 80px;
      border-radius: 10px;
      background: var(--kvc-green-bar);
      display: flex;
      align-items: center;
      justify-content: center;
      font-family: var(--kvc-mono);
      font-size: 15px;
      font-weight: 700;
      color: #fff;
      transition: box-shadow 0.4s ease;
    }

    .kvc-state-box.kvc-state-box--pulse {
      box-shadow: 0 0 20px var(--kvc-green-glow), 0 0 40px var(--kvc-green-glow);
    }

     
    .kvc-token-counter {
      font-family: var(--kvc-mono);
      font-size: 12px;
      color: var(--kvc-text-muted);
      text-align: center;
      margin-bottom: 10px;
    }

    .kvc-token-counter span {
      font-weight: 700;
      color: var(--kvc-text);
    }

     
    .kvc-meter {
      height: 10px;
      border-radius: 5px;
      background: var(--kvc-meter-bg);
      overflow: hidden;
      margin-bottom: 8px;
    }

    .kvc-meter-fill {
      height: 100%;
      border-radius: 5px;
      transition: width 0.5s ease;
    }

    .kvc-meter-fill--red { background: var(--kvc-red-bar); }
    .kvc-meter-fill--green { background: var(--kvc-green-bar); }

    .kvc-complexity {
      font-family: var(--kvc-mono);
      font-size: 12px;
      font-weight: 600;
      text-align: center;
      padding: 6px 10px;
      border-radius: 6px;
    }

    .kvc-complexity--red {
      background: var(--kvc-red-light);
      color: var(--kvc-red);
      border: 1px solid var(--kvc-red-border);
    }

    .kvc-complexity--green {
      background: var(--kvc-green-light);
      color: var(--kvc-green);
      border: 1px solid var(--kvc-green-border);
    }

     
    .kvc-controls {
      text-align: center;
    }

    .kvc-reset-btn {
      font-family: var(--kvc-mono);
      font-size: 12px;
      font-weight: 600;
      padding: 8px 20px;
      border-radius: 8px;
      border: 1px solid var(--kvc-border);
      background: var(--kvc-card-bg);
      color: var(--kvc-text);
      cursor: pointer;
      transition: all 0.2s ease;
    }

    .kvc-reset-btn:hover {
      border-color: var(--kvc-green);
      color: var(--kvc-green);
    }
  </style>

  <div class="kvc-header">
    <h3>Inference Memory: The Fundamental Trade-off</h3>
    <p>Lossless (KV cache) vs Compressed (hidden state)</p>
  </div>

  <div class="kvc-columns">
    
    <div class="kvc-panel kvc-panel--transformer">
      <h4 class="kvc-panel-title">Transformer: Growing KV Cache</h4>
      <div class="kvc-stack-area" id="kvc-stack-774ca06e5bc03651567b333d58d39a0f">
        
      </div>
      <div class="kvc-token-counter">Tokens processed: <span id="kvc-tcnt-t-774ca06e5bc03651567b333d58d39a0f">0</span></div>
      <div class="kvc-meter">
        <div class="kvc-meter-fill kvc-meter-fill--red" id="kvc-meter-t-774ca06e5bc03651567b333d58d39a0f" style="width:0%"></div>
      </div>
      <div class="kvc-complexity kvc-complexity--red">O(L) memory per sequence</div>
    </div>

    
    <div class="kvc-panel kvc-panel--mamba">
      <h4 class="kvc-panel-title">Mamba: Fixed-Size State</h4>
      <div class="kvc-fixed-area">
        <div class="kvc-state-box" id="kvc-state-774ca06e5bc03651567b333d58d39a0f">h &isin; &#x211D;<sup>N</sup></div>
      </div>
      <div class="kvc-token-counter">Tokens processed: <span id="kvc-tcnt-m-774ca06e5bc03651567b333d58d39a0f">0</span></div>
      <div class="kvc-meter">
        <div class="kvc-meter-fill kvc-meter-fill--green" id="kvc-meter-m-774ca06e5bc03651567b333d58d39a0f" style="width:12%"></div>
      </div>
      <div class="kvc-complexity kvc-complexity--green">O(1) memory per sequence</div>
    </div>
  </div>

  <div class="kvc-controls">
    <button class="kvc-reset-btn" id="kvc-reset-774ca06e5bc03651567b333d58d39a0f">Reset Animation</button>
  </div>

  <script>
    (function(){
      var uid = "774ca06e5bc03651567b333d58d39a0f";
      var MAX_TOKENS = 10;
      var INTERVAL = 1500;
      var tokenCount = 0;
      var timer = null;

      var stackEl = document.getElementById("kvc-stack-" + uid);
      var cntT = document.getElementById("kvc-tcnt-t-" + uid);
      var cntM = document.getElementById("kvc-tcnt-m-" + uid);
      var meterT = document.getElementById("kvc-meter-t-" + uid);
      var meterM = document.getElementById("kvc-meter-m-" + uid);
      var stateBox = document.getElementById("kvc-state-" + uid);
      var resetBtn = document.getElementById("kvc-reset-" + uid);

      function createBars() {
        stackEl.innerHTML = "";
        for (var i = 0; i < MAX_TOKENS; i++) {
          var bar = document.createElement("div");
          bar.className = "kvc-bar";
          bar.textContent = "KV " + (i + 1);
          bar.setAttribute("data-idx", i);
          stackEl.appendChild(bar);
        }
      }

      function reset() {
        if (timer) clearInterval(timer);
        tokenCount = 0;
        createBars();
        cntT.textContent = "0";
        cntM.textContent = "0";
        meterT.style.width = "0%";
        meterM.style.width = "12%";
        stateBox.classList.remove("kvc-state-box--pulse");
        start();
      }

      function addToken() {
        if (tokenCount >= MAX_TOKENS) {
          clearInterval(timer);
          return;
        }
        tokenCount++;
        cntT.textContent = tokenCount;
        cntM.textContent = tokenCount;
        meterT.style.width = (tokenCount / MAX_TOKENS * 100) + "%";

        
        var bars = stackEl.querySelectorAll(".kvc-bar");
        if (bars[tokenCount - 1]) {
          bars[tokenCount - 1].classList.add("kvc-bar--visible");
        }

        
        stateBox.classList.add("kvc-state-box--pulse");
        setTimeout(function(){ stateBox.classList.remove("kvc-state-box--pulse"); }, 600);
      }

      function start() {
        timer = setInterval(addToken, INTERVAL);
      }

      resetBtn.addEventListener("click", reset);
      reset();
    })();
  </script>
</div>

<p>The trade-off is fundamental. The Transformer&rsquo;s KV cache stores every token (lossless, but $O(L)$ memory). Mamba&rsquo;s hidden state compresses all history into a fixed vector (lossy, but $O(1)$ memory). The question is always whether the compressed representation is good enough for the task.</p>
<p>Mamba-1 demonstrated that it was. Mamba-2.8B matched or exceeded Pythia-6.9B (a model more than twice its size) on zero-shot downstream evaluations. On the Pile dataset, Mamba-1.4B achieved 59.7% average across common-sense reasoning benchmarks, matching Pythia-2.8B (59.1%). At batch size 16, Mamba-2.8B completed generation in 18.6 seconds versus GPT-Neo-2.7B&rsquo;s 65.9 seconds (3.5x faster). GPT-Neo ran out of memory at batch size 32 on a 64GB GPU; Mamba continued scaling to batch 128+.</p>
<h2 id="part-7-mamba-2-maximizing-gpu-utilization">Part 7: Mamba-2, Maximizing GPU Utilization</h2>
<p>Mamba-1 had an embarrassing practical problem: it was 2-3x slower than equivalently sized Transformers during training. The root cause is that modern GPUs deliver roughly 16x more throughput for matrix multiplication (via Tensor Cores) than for general arithmetic. Transformers are pure matmul. Mamba-1&rsquo;s selective scan was not.</p>
<p>Tri Dao and Albert Gu&rsquo;s May 2024 paper &ldquo;Transformers are SSMs&rdquo; solved this by proving that unrolling the SSM recurrence produces a structured matrix that can be computed via matrix multiplications. The resulting algorithm (SSD) splits the sequence into chunks: within each chunk, the computation runs as dense matmuls on Tensor Cores; between chunks, a short scan passes state forward. Training speed improved 2-8x over Mamba-1.</p>
<p>The trade-off: to fit this matrix framework, $A$ is restricted from a diagonal matrix (Mamba-1) to a scalar-times-identity (Mamba-2), meaning all state dimensions within a head share one decay rate. Mamba-2 compensates with a multi-head structure and increases the state dimension from $N = 16$ to $N = 64\text{-}256$.</p>
<h2 id="part-8-mamba-3-three-innovations-from-classical-ssm-theory">Part 8: Mamba-3, Three Innovations from Classical SSM Theory</h2>
<p>Published at ICLR 2026, Mamba-3 asks a different question than its predecessors. Mamba-2 optimized for training speed by simplifying the SSM to leverage Tensor Cores. But with the rise of RL post-training, agentic workflows, and test-time compute scaling, <strong>inference efficiency has become the primary bottleneck</strong>.</p>
<p>Here is the problem Mamba-3 targets. During autoregressive decoding, Mamba-2&rsquo;s simplified recurrence is memory-bound. The GPU loads the state from HBM to SRAM, performs a trivially small computation (the scalar-times-identity update is cheap), and writes the result back. The arithmetic intensity is roughly 2.5 ops/byte. The H100 needs $\sim$295 ops/byte to be compute-bound. More than 99% of GPU compute sits idle during token generation.</p>
<p>Mamba-3&rsquo;s overarching philosophy is to increase arithmetic intensity during decoding by making the state update mathematically richer, spending more compute per byte of memory traffic, filling idle GPU cycles rather than adding new ones. Three innovations accomplish this.</p>
<h3 id="innovation-1-exponential-trapezoidal-discretization">Innovation 1: Exponential-Trapezoidal Discretization</h3>
<p><strong>The problem.</strong> Mamba-1 and Mamba-2 used what the Mamba-3 authors retroactively classify as &ldquo;Exponential-Euler&rdquo; discretization: the exact formula $\bar{A} = \exp(\Delta A)$ paired with the first-order Euler approximation $\bar{B} = \Delta B$. This is a hybrid: exact for the state decay, but approximate for how the input enters the state. The local truncation error is $O(\Delta^2)$.</p>
<p>In numerical analysis terms, the Euler method approximates the area under a curve using a rectangle aligned to one endpoint. It captures the value at the start of the interval but ignores how the curve changes across the interval. This crude approximation struggles with fast-moving temporal dependencies, producing &ldquo;jerky&rdquo; transitions in the state.</p>
<p>In practice, prior Mamba models compensated by adding an explicit short 1D causal convolution (Conv1d, width 4) before the SSM. This Conv1d smoothed out immediate local token interactions that the imprecise discretization missed. It worked, but it was an architectural bandage for a mathematical shortcoming. And it added latency at inference: one more sequential operation per token.</p>
<p><strong>The intuition.</strong> The <a href="https://en.wikipedia.org/wiki/Trapezoidal_rule">trapezoidal rule</a> approximates the area under a curve using a trapezoid instead of a rectangle. A rectangle uses only one endpoint&rsquo;s value. A trapezoid uses both endpoints and draws a straight line between them, capturing the slope of the curve across the interval. This gives second-order accuracy: the local error drops from $O(\Delta^2)$ to $O(\Delta^3)$.</p>

<div class="ssm-euler-vs-trap" id="evt-774ca06e5bc03651567b333d58d39a0f">
  <style>
    .ssm-euler-vs-trap {
      --evt-bg: #ffffff;
      --evt-card-bg: #ffffff;
      --evt-border: #e2e8f0;
      --evt-text: #1a202c;
      --evt-text-muted: #718096;
      --evt-shadow: 0 12px 30px rgba(0,0,0,0.06);
      --evt-curve: #334155;
      --evt-axis: #94a3b8;
      --evt-red: #ef4444;
      --evt-red-fill: rgba(239,68,68,0.12);
      --evt-red-light: #fef2f2;
      --evt-red-border: #fecaca;
      --evt-green: #10b981;
      --evt-green-fill: rgba(16,185,129,0.12);
      --evt-green-light: #ecfdf5;
      --evt-green-border: #a7f3d0;
      --evt-error-fill: rgba(239,68,68,0.25);
      --evt-error-fill-green: rgba(16,185,129,0.25);
      --evt-mono: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      --evt-sans: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;

      font-family: var(--evt-sans);
      background: var(--evt-card-bg);
      color: var(--evt-text);
      border-radius: 16px;
      box-shadow: var(--evt-shadow);
      padding: 28px 24px;
      margin: 32px auto;
      max-width: 940px;
      line-height: 1.6;
    }

    [data-theme="dark"] .ssm-euler-vs-trap {
      --evt-bg: #1a1b2e;
      --evt-card-bg: #1e1f33;
      --evt-border: #2d2f45;
      --evt-text: #e2e8f0;
      --evt-text-muted: #94a3b8;
      --evt-shadow: 0 12px 30px rgba(0,0,0,0.3);
      --evt-curve: #cbd5e1;
      --evt-axis: #475569;
      --evt-red: #f87171;
      --evt-red-fill: rgba(248,113,113,0.12);
      --evt-red-light: rgba(248,113,113,0.08);
      --evt-red-border: rgba(248,113,113,0.25);
      --evt-green: #34d399;
      --evt-green-fill: rgba(52,211,153,0.12);
      --evt-green-light: rgba(52,211,153,0.08);
      --evt-green-border: rgba(52,211,153,0.25);
      --evt-error-fill: rgba(248,113,113,0.3);
      --evt-error-fill-green: rgba(52,211,153,0.3);
    }

    .ssm-euler-vs-trap * { box-sizing: border-box; }

    .evt-header {
      text-align: center;
      margin-bottom: 24px;
    }

    .evt-header h3 {
      font-size: 22px;
      font-weight: 700;
      color: var(--evt-text);
      margin: 0 0 6px 0;
    }

    .evt-header p {
      font-size: 14px;
      color: var(--evt-text-muted);
      margin: 0;
    }

    .evt-panels {
      display: grid;
      grid-template-columns: 1fr 1fr;
      gap: 20px;
    }

    @media (max-width: 700px) {
      .evt-panels { grid-template-columns: 1fr; }
    }

    .evt-panel {
      border-radius: 12px;
      padding: 20px 16px 16px;
      text-align: center;
    }

    .evt-panel--euler {
      background: var(--evt-red-light);
      border: 1px solid var(--evt-red-border);
    }

    .evt-panel--trap {
      background: var(--evt-green-light);
      border: 1px solid var(--evt-green-border);
    }

    .evt-panel-title {
      font-family: var(--evt-mono);
      font-size: 14px;
      font-weight: 700;
      letter-spacing: 0.03em;
      margin: 0 0 14px 0;
    }

    .evt-panel--euler .evt-panel-title { color: var(--evt-red); }
    .evt-panel--trap .evt-panel-title { color: var(--evt-green); }

    .evt-svg-wrap {
      width: 100%;
      margin-bottom: 14px;
    }

    .evt-svg-wrap svg {
      width: 100%;
      height: auto;
      display: block;
    }

    .evt-label {
      font-size: 13px;
      color: var(--evt-text);
      margin: 8px 0 8px;
      font-weight: 500;
    }

    .evt-badge {
      display: inline-block;
      font-family: var(--evt-mono);
      font-size: 12px;
      font-weight: 700;
      padding: 4px 14px;
      border-radius: 6px;
    }

    .evt-badge--red {
      background: var(--evt-red-light);
      color: var(--evt-red);
      border: 1px solid var(--evt-red-border);
    }

    .evt-badge--green {
      background: var(--evt-green-light);
      color: var(--evt-green);
      border: 1px solid var(--evt-green-border);
    }
  </style>

  <div class="evt-header">
    <h3>Why Trapezoidal Discretization Is More Accurate</h3>
    <p>Same interval, better approximation of the area under the curve</p>
  </div>

  <div class="evt-panels">
    
    <div class="evt-panel evt-panel--euler">
      <h4 class="evt-panel-title">Euler Method</h4>
      <div class="evt-svg-wrap">
        
        <svg viewBox="0 0 340 210" xmlns="http://www.w3.org/2000/svg">
          
          <line x1="40" y1="180" x2="310" y2="180" stroke="var(--evt-axis)" stroke-width="1.2"/>
          <line x1="40" y1="180" x2="40" y2="20" stroke="var(--evt-axis)" stroke-width="1.2"/>

          
          <text x="18" y="100" font-family="var(--evt-mono)" font-size="11" fill="var(--evt-text-muted)" text-anchor="middle" transform="rotate(-90 18 100)">f(t)</text>

          
          <line x1="100" y1="180" x2="100" y2="175" stroke="var(--evt-text-muted)" stroke-width="1.2"/>
          <line x1="230" y1="180" x2="230" y2="175" stroke="var(--evt-text-muted)" stroke-width="1.2"/>

          
          <text x="100" y="196" font-family="var(--evt-mono)" font-size="11" fill="var(--evt-text-muted)" text-anchor="middle">t</text>
          <text x="230" y="196" font-family="var(--evt-mono)" font-size="11" fill="var(--evt-text-muted)" text-anchor="middle">t+&#916;</text>

          
          <rect x="100" y="58" width="130" height="122" fill="var(--evt-red-fill)" stroke="var(--evt-red)" stroke-width="1.5" stroke-dasharray="5,3"/>

          
          
          <clipPath id="euler-error-clip-774ca06e5bc03651567b333d58d39a0f">
            <rect x="100" y="58" width="130" height="130"/>
          </clipPath>
          <path d="M 100,58 L 230,58 L 230,120 C 210,108 180,85 160,72 C 140,63 120,58 100,58 Z"
                fill="var(--evt-error-fill)" clip-path="url(#euler-error-clip-774ca06e5bc03651567b333d58d39a0f)"/>

          
          <path d="M 40,42 C 60,46 80,52 100,58 C 130,68 155,82 175,96 C 195,108 215,116 230,120 C 260,130 285,142 305,150"
                fill="none" stroke="var(--evt-curve)" stroke-width="2.5" stroke-linecap="round"/>

          
          <circle cx="100" cy="58" r="5" fill="var(--evt-red)" stroke="white" stroke-width="1.5"/>

          
          <circle cx="230" cy="120" r="5" fill="var(--evt-curve)" stroke="white" stroke-width="1.5"/>

          
          <line x1="230" y1="58" x2="230" y2="120" stroke="var(--evt-red)" stroke-width="1.5" stroke-dasharray="3,3"/>

          
          <text x="242" y="92" font-family="var(--evt-mono)" font-size="10" fill="var(--evt-red)" font-weight="600">error</text>
          
          <line x1="240" y1="72" x2="240" y2="82" stroke="var(--evt-red)" stroke-width="1" marker-end="none"/>
          <polygon points="240,85 237,80 243,80" fill="var(--evt-red)"/>
        </svg>
      </div>
      <div class="evt-label">Uses only the left endpoint f(t)</div>
      <div><span class="evt-badge evt-badge--red">O(&#916;&#178;) local error</span></div>
    </div>

    
    <div class="evt-panel evt-panel--trap">
      <h4 class="evt-panel-title">Trapezoidal Rule</h4>
      <div class="evt-svg-wrap">
        <svg viewBox="0 0 340 210" xmlns="http://www.w3.org/2000/svg">
          
          <line x1="40" y1="180" x2="310" y2="180" stroke="var(--evt-axis)" stroke-width="1.2"/>
          <line x1="40" y1="180" x2="40" y2="20" stroke="var(--evt-axis)" stroke-width="1.2"/>

          
          <text x="18" y="100" font-family="var(--evt-mono)" font-size="11" fill="var(--evt-text-muted)" text-anchor="middle" transform="rotate(-90 18 100)">f(t)</text>

          
          <line x1="100" y1="180" x2="100" y2="175" stroke="var(--evt-text-muted)" stroke-width="1.2"/>
          <line x1="230" y1="180" x2="230" y2="175" stroke="var(--evt-text-muted)" stroke-width="1.2"/>

          
          <text x="100" y="196" font-family="var(--evt-mono)" font-size="11" fill="var(--evt-text-muted)" text-anchor="middle">t</text>
          <text x="230" y="196" font-family="var(--evt-mono)" font-size="11" fill="var(--evt-text-muted)" text-anchor="middle">t+&#916;</text>

          
          <polygon points="100,180 100,58 230,120 230,180"
                   fill="var(--evt-green-fill)" stroke="var(--evt-green)" stroke-width="1.5" stroke-dasharray="5,3"/>

          
          <line x1="100" y1="58" x2="230" y2="120" stroke="var(--evt-green)" stroke-width="2.5" stroke-linecap="round"/>

          
          <path d="M 100,58 C 130,68 155,82 175,96 C 195,108 215,116 230,120 L 230,120 L 100,58 Z"
                fill="var(--evt-error-fill-green)" opacity="0.5"/>

          
          <path d="M 40,42 C 60,46 80,52 100,58 C 130,68 155,82 175,96 C 195,108 215,116 230,120 C 260,130 285,142 305,150"
                fill="none" stroke="var(--evt-curve)" stroke-width="2.5" stroke-linecap="round"/>

          
          <circle cx="100" cy="58" r="5" fill="var(--evt-green)" stroke="white" stroke-width="1.5"/>
          <circle cx="230" cy="120" r="5" fill="var(--evt-green)" stroke="white" stroke-width="1.5"/>

          
          <text x="175" y="80" font-family="var(--evt-mono)" font-size="10" fill="var(--evt-green)" font-weight="600">tiny error</text>
          <line x1="172" y1="83" x2="165" y2="88" stroke="var(--evt-green)" stroke-width="1"/>
          <polygon points="163,90 165,85 168,89" fill="var(--evt-green)"/>
        </svg>
      </div>
      <div class="evt-label">Uses both endpoints f(t) and f(t+&#916;)</div>
      <div><span class="evt-badge evt-badge--green">O(&#916;&#179;) local error</span></div>
    </div>
  </div>
</div>

<p><strong>The math.</strong> Applying the generalized trapezoidal rule to the SSM&rsquo;s state equation produces a three-term recurrence, where the old Euler formula had only two:</p>
$$h_t = \underbrace{\exp(\Delta_t A_t) \cdot h_{t-1}}_{\text{Term 1: decayed previous state}} + \underbrace{(1 - \lambda_t) \cdot \Delta_t \cdot \exp(\Delta_t A_t) \cdot B_{t-1} \cdot x_{t-1}}_{\text{Term 2 (NEW): previous input, decayed}} + \underbrace{\lambda_t \cdot \Delta_t \cdot B_t \cdot x_t}_{\text{Term 3: current input}}$$<p>Term 1 is identical to the old formula: decay the previous state. Term 3 is similar to the old Euler input term, but weighted by $\lambda_t$ instead of 1. The new addition is Term 2: the previous timestep&rsquo;s input $x_{t-1}$, projected through $B_{t-1}$, scaled by $(1 - \lambda_t)$, and decayed by the same exponential factor as the state.</p>
<p>The parameter $\lambda_t$ is a data-dependent convex combination weight. The $(1-\lambda_t)$ and $\lambda_t$ coefficients on consecutive inputs are the weights of the trapezoid: $(1-\lambda)$ on the left endpoint (previous input), $\lambda$ on the right endpoint (current input).</p>
<p>Let me verify the connection to the old formula. When $\lambda_t = 1$: Term 2 vanishes entirely because its coefficient $(1-\lambda_t) = 0$. Term 3 becomes $1 \cdot \Delta_t \cdot B_t \cdot x_t = \Delta_t B_t x_t$, which is exactly the Euler formula $\bar{B}x_t$. So the old Mamba-1/2 discretization is the special case $\lambda = 1$ of this more general formula.</p>
<p><strong>A numerical walkthrough.</strong> Take $a = -0.5$, $\Delta = 0.1$, $b = 1$, and process the input sequence $[1, 1, 0]$ with $\lambda = 0.5$ (balanced trapezoidal blending):</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Parameters: a = -0.5, delta = 0.1, b = 1.0, lambda = 0.5</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># exp(delta * a) = exp(-0.05) = 0.9512</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># k=0: input=1 (no previous input, Term 2 uses x_{-1}=0)</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   Term 1: 0.9512 * 0 = 0                              (no state yet)</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   Term 2: (1-0.5) * 0.1 * 0.9512 * 1.0 * 0 = 0        (no prev input)</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   Term 3: 0.5 * 0.1 * 1.0 * 1 = 0.05</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   h_0 = 0 + 0 + 0.05 = 0.0500</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># k=1: input=1, prev_input=1</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   Term 1: 0.9512 * 0.05 = 0.04756                     (decayed state)</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   Term 2: 0.5 * 0.1 * 0.9512 * 1.0 * 1 = 0.04756     (prev input, decayed)</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   Term 3: 0.5 * 0.1 * 1.0 * 1 = 0.05                  (current input)</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   h_1 = 0.04756 + 0.04756 + 0.05 = 0.14512</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># k=2: input=0, prev_input=1</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   Term 1: 0.9512 * 0.14512 = 0.13804                  (decayed state)</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   Term 2: 0.5 * 0.1 * 0.9512 * 1.0 * 1 = 0.04756     (prev input=1, decayed)</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   Term 3: 0.5 * 0.1 * 1.0 * 0 = 0.0                   (current input=0)</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   h_2 = 0.13804 + 0.04756 + 0.0 = 0.18560</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Compare to Euler (lambda=1):</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># k=2 with Euler: h = 0.9512 * 0.1904 + 0.0976 * 0 = 0.18107</span>
</span></span></code></pre></div><p>At step 2, the trapezoidal version ($h = 0.1856$) is higher than the Euler version ($h = 0.1811$). The difference comes from Term 2: even though the current input is 0, the trapezoidal rule still accounts for the previous input ($x_1 = 1$) via the left endpoint of the trapezoid. The Euler method ignores this entirely. For fast-changing input sequences, this difference matters.</p>

<div class="ssm-trap-recurrence" id="trc-774ca06e5bc03651567b333d58d39a0f">
  <style>
    .ssm-trap-recurrence {
      --tr-bg: #ffffff;
      --tr-card-bg: #ffffff;
      --tr-border: #e2e8f0;
      --tr-text: #1a202c;
      --tr-text-muted: #718096;
      --tr-shadow: 0 12px 30px rgba(0,0,0,0.06);
      --tr-blue: #3b82f6;
      --tr-blue-light: #eff6ff;
      --tr-blue-border: #bfdbfe;
      --tr-orange: #f59e0b;
      --tr-orange-light: #fffbeb;
      --tr-orange-border: #fde68a;
      --tr-green: #10b981;
      --tr-green-light: #ecfdf5;
      --tr-green-border: #a7f3d0;
      --tr-sum-bg: #f8fafc;
      --tr-sum-border: #cbd5e1;
      --tr-badge-bg: #fef3c7;
      --tr-badge-text: #92400e;
      --tr-badge-border: #fde68a;
      --tr-arrow: #94a3b8;
      --tr-mono: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      --tr-sans: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;

      font-family: var(--tr-sans);
      background: var(--tr-card-bg);
      color: var(--tr-text);
      border-radius: 16px;
      box-shadow: var(--tr-shadow);
      padding: 24px;
      margin: 32px auto;
      max-width: 900px;
      line-height: 1.6;
    }

    [data-theme="dark"] .ssm-trap-recurrence {
      --tr-bg: #1a1b2e;
      --tr-card-bg: #1e1f33;
      --tr-border: #2d2f45;
      --tr-text: #e2e8f0;
      --tr-text-muted: #94a3b8;
      --tr-shadow: 0 12px 30px rgba(0,0,0,0.3);
      --tr-blue: #60a5fa;
      --tr-blue-light: rgba(96,165,250,0.08);
      --tr-blue-border: rgba(96,165,250,0.25);
      --tr-orange: #fbbf24;
      --tr-orange-light: rgba(251,191,36,0.08);
      --tr-orange-border: rgba(251,191,36,0.25);
      --tr-green: #34d399;
      --tr-green-light: rgba(52,211,153,0.08);
      --tr-green-border: rgba(52,211,153,0.25);
      --tr-sum-bg: #252640;
      --tr-sum-border: #3d3f5c;
      --tr-badge-bg: rgba(251,191,36,0.15);
      --tr-badge-text: #fbbf24;
      --tr-badge-border: rgba(251,191,36,0.3);
      --tr-arrow: #475569;
    }

    .ssm-trap-recurrence * { box-sizing: border-box; }

    .tr-header {
      text-align: center;
      margin-bottom: 28px;
    }

    .tr-header h3 {
      font-size: 24px;
      font-weight: 700;
      color: var(--tr-text);
      margin: 0 0 6px 0;
    }

    .tr-header p {
      font-size: 14px;
      color: var(--tr-text-muted);
      margin: 0;
    }

     
    .tr-terms {
      display: flex;
      flex-direction: column;
      gap: 14px;
      margin-bottom: 20px;
    }

    .tr-term {
      display: grid;
      grid-template-columns: auto 1fr auto auto;
      align-items: center;
      gap: 10px;
      padding: 12px 16px;
      border-radius: 10px;
      position: relative;
    }

    @media (max-width: 600px) {
      .tr-term {
        grid-template-columns: 1fr;
        text-align: center;
        gap: 6px;
      }
    }

    .tr-term--blue {
      background: var(--tr-blue-light);
      border: 1px solid var(--tr-blue-border);
    }

    .tr-term--orange {
      background: var(--tr-orange-light);
      border: 1.5px solid var(--tr-orange-border);
    }

    .tr-term--green {
      background: var(--tr-green-light);
      border: 1px solid var(--tr-green-border);
    }

    .tr-term-input {
      font-family: var(--tr-mono);
      font-size: 14px;
      font-weight: 600;
      padding: 6px 12px;
      border-radius: 6px;
      white-space: nowrap;
    }

    .tr-term--blue .tr-term-input { color: var(--tr-blue); background: var(--tr-card-bg); border: 1px solid var(--tr-blue-border); }
    .tr-term--orange .tr-term-input { color: var(--tr-orange); background: var(--tr-card-bg); border: 1px solid var(--tr-orange-border); }
    .tr-term--green .tr-term-input { color: var(--tr-green); background: var(--tr-card-bg); border: 1px solid var(--tr-green-border); }

    .tr-term-op {
      font-family: var(--tr-mono);
      font-size: 12px;
      color: var(--tr-text-muted);
      text-align: center;
      display: flex;
      align-items: center;
      gap: 6px;
    }

    .tr-term-op::before {
      content: '';
      flex: 1;
      height: 1px;
      background: var(--tr-arrow);
    }

    .tr-term-op::after {
      content: '';
      flex: 1;
      height: 1px;
      background: var(--tr-arrow);
    }

    .tr-term-result {
      font-family: var(--tr-mono);
      font-size: 15px;
      font-weight: 700;
      padding: 6px 14px;
      border-radius: 6px;
      text-align: center;
      white-space: nowrap;
    }

    .tr-term--blue .tr-term-result { color: var(--tr-blue); background: var(--tr-blue-light); border: 1px solid var(--tr-blue-border); }
    .tr-term--orange .tr-term-result { color: var(--tr-orange); background: var(--tr-orange-light); border: 1px solid var(--tr-orange-border); }
    .tr-term--green .tr-term-result { color: var(--tr-green); background: var(--tr-green-light); border: 1px solid var(--tr-green-border); }

    .tr-term-label {
      font-size: 11px;
      font-weight: 500;
      color: var(--tr-text-muted);
      font-family: var(--tr-mono);
      text-align: right;
      white-space: nowrap;
    }

     
    .tr-badge-new {
      position: absolute;
      top: -9px;
      right: 16px;
      font-family: var(--tr-mono);
      font-size: 10px;
      font-weight: 700;
      letter-spacing: 0.08em;
      text-transform: uppercase;
      padding: 2px 10px;
      border-radius: 4px;
      background: var(--tr-badge-bg);
      color: var(--tr-badge-text);
      border: 1px solid var(--tr-badge-border);
    }

     
    .tr-sum-row {
      display: flex;
      flex-direction: column;
      align-items: center;
      gap: 8px;
    }

    .tr-sum-arrows {
      display: flex;
      justify-content: center;
      gap: 40px;
    }

    .tr-sum-arrow {
      width: 2px;
      height: 20px;
      background: var(--tr-arrow);
      position: relative;
    }

    .tr-sum-arrow::after {
      content: '';
      position: absolute;
      bottom: -4px;
      left: 50%;
      transform: translateX(-50%);
      border-left: 4px solid transparent;
      border-right: 4px solid transparent;
      border-top: 5px solid var(--tr-arrow);
    }

    .tr-sum-circle {
      width: 44px;
      height: 44px;
      border-radius: 50%;
      background: var(--tr-sum-bg);
      border: 2px solid var(--tr-sum-border);
      display: flex;
      align-items: center;
      justify-content: center;
      font-size: 22px;
      font-weight: 700;
      color: var(--tr-text);
    }

    .tr-sum-output {
      font-family: var(--tr-mono);
      font-size: 16px;
      font-weight: 700;
      color: var(--tr-text);
      padding: 8px 20px;
      border-radius: 8px;
      background: var(--tr-sum-bg);
      border: 2px solid var(--tr-sum-border);
    }
  </style>

  <div class="tr-header">
    <h3>Trapezoidal Recurrence: Step k=2</h3>
    <p>Three terms instead of two &mdash; the previous input now participates</p>
  </div>

  <div class="tr-terms">
    
    <div class="tr-term tr-term--blue">
      <div class="tr-term-input">h&#x2081; = 0.14512</div>
      <div class="tr-term-op">&times; exp(&Delta;A) = &times; 0.9512</div>
      <div class="tr-term-result">0.13804</div>
      <div class="tr-term-label">Decayed previous state</div>
    </div>

    
    <div class="tr-term tr-term--orange">
      <span class="tr-badge-new">NEW in Mamba-3</span>
      <div class="tr-term-input">x&#x2081; = 1</div>
      <div class="tr-term-op">&times; (1&minus;&lambda;)&middot;&Delta;&middot;exp(&Delta;A)&middot;B = &times; 0.04756</div>
      <div class="tr-term-result">0.04756</div>
      <div class="tr-term-label">Previous input (decayed)</div>
    </div>

    
    <div class="tr-term tr-term--green">
      <div class="tr-term-input">x&#x2082; = 0</div>
      <div class="tr-term-op">&times; &lambda;&middot;&Delta;&middot;B = &times; 0.05</div>
      <div class="tr-term-result">0.0</div>
      <div class="tr-term-label">Current input</div>
    </div>
  </div>

  
  <div class="tr-sum-row">
    <div class="tr-sum-arrows">
      <div class="tr-sum-arrow"></div>
      <div class="tr-sum-arrow"></div>
      <div class="tr-sum-arrow"></div>
    </div>
    <div class="tr-sum-circle">+</div>
    <div class="tr-sum-arrow"></div>
    <div class="tr-sum-output">h&#x2082; = 0.18560</div>
  </div>
</div>

<p><strong>The implicit convolution and the death of Conv1d.</strong> Here is the subtle and important consequence. Because the state update at time $t$ depends on both $x_t$ and $x_{t-1}$, the trapezoidal recurrence contains an implicit convolution of width 2. The $(1-\lambda_t)$ and $\lambda_t$ weights on the consecutive inputs play the role of a learned, data-dependent convolution filter operating on pairs of adjacent tokens.</p>
<p>For years, SSM architectures (H3, RWKV-4, Mamba-1, Mamba-2) required an explicit external Conv1d (width 4) before the SSM to handle immediate local token interactions. The Conv1d was considered essential for capturing &ldquo;induction head&rdquo; copying behaviors and local patterns. Mamba-3 found that the implicit width-2 convolution from the trapezoidal discretization, combined with learnable bias terms on $B$ and $C$ (constant vectors added after normalization), is expressive enough to replace the external Conv1d entirely. Mamba-3 is the first Mamba variant to drop the Conv1d without performance loss.</p>

<div class="ssm-conv1d-removal" id="c1r-774ca06e5bc03651567b333d58d39a0f">
  <style>
    .ssm-conv1d-removal {
      --cr-bg: #ffffff;
      --cr-card-bg: #ffffff;
      --cr-border: #e2e8f0;
      --cr-text: #1a202c;
      --cr-text-muted: #718096;
      --cr-shadow: 0 12px 30px rgba(0,0,0,0.06);
      --cr-gray: #64748b;
      --cr-gray-light: #f8fafc;
      --cr-gray-border: #e2e8f0;
      --cr-gray-box: #f1f5f9;
      --cr-green: #10b981;
      --cr-green-light: #ecfdf5;
      --cr-green-border: #a7f3d0;
      --cr-green-box: #d1fae5;
      --cr-red: #ef4444;
      --cr-red-light: #fef2f2;
      --cr-red-border: #fecaca;
      --cr-absorbed: #6366f1;
      --cr-absorbed-light: #eef2ff;
      --cr-absorbed-border: #c7d2fe;
      --cr-arrow: #94a3b8;
      --cr-mono: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      --cr-sans: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;

      font-family: var(--cr-sans);
      background: var(--cr-card-bg);
      color: var(--cr-text);
      border-radius: 16px;
      box-shadow: var(--cr-shadow);
      padding: 24px;
      margin: 32px auto;
      max-width: 900px;
      line-height: 1.6;
    }

    [data-theme="dark"] .ssm-conv1d-removal {
      --cr-bg: #1a1b2e;
      --cr-card-bg: #1e1f33;
      --cr-border: #2d2f45;
      --cr-text: #e2e8f0;
      --cr-text-muted: #94a3b8;
      --cr-shadow: 0 12px 30px rgba(0,0,0,0.3);
      --cr-gray: #94a3b8;
      --cr-gray-light: rgba(148,163,184,0.06);
      --cr-gray-border: rgba(148,163,184,0.2);
      --cr-gray-box: rgba(148,163,184,0.1);
      --cr-green: #34d399;
      --cr-green-light: rgba(52,211,153,0.08);
      --cr-green-border: rgba(52,211,153,0.25);
      --cr-green-box: rgba(52,211,153,0.15);
      --cr-red: #f87171;
      --cr-red-light: rgba(248,113,113,0.08);
      --cr-red-border: rgba(248,113,113,0.25);
      --cr-absorbed: #818cf8;
      --cr-absorbed-light: rgba(129,140,248,0.08);
      --cr-absorbed-border: rgba(129,140,248,0.25);
      --cr-arrow: #475569;
    }

    .ssm-conv1d-removal * { box-sizing: border-box; }

    .cr-header {
      text-align: center;
      margin-bottom: 28px;
    }

    .cr-header h3 {
      font-size: 24px;
      font-weight: 700;
      color: var(--cr-text);
      margin: 0 0 6px 0;
    }

    .cr-header p {
      font-size: 14px;
      color: var(--cr-text-muted);
      margin: 0;
    }

     
    .cr-pipelines {
      display: flex;
      flex-direction: column;
      gap: 20px;
      position: relative;
      margin-bottom: 16px;
    }

    .cr-pipeline {
      border-radius: 12px;
      padding: 16px 20px;
    }

    .cr-pipeline--old {
      background: var(--cr-gray-light);
      border: 1px solid var(--cr-gray-border);
    }

    .cr-pipeline--new {
      background: var(--cr-green-light);
      border: 1.5px solid var(--cr-green-border);
    }

    .cr-pipeline-title {
      font-family: var(--cr-mono);
      font-size: 13px;
      font-weight: 600;
      letter-spacing: 0.06em;
      text-transform: uppercase;
      margin: 0 0 14px 0;
    }

    .cr-pipeline--old .cr-pipeline-title { color: var(--cr-gray); }
    .cr-pipeline--new .cr-pipeline-title { color: var(--cr-green); }

     
    .cr-flow {
      display: flex;
      align-items: center;
      gap: 0;
      flex-wrap: wrap;
      justify-content: center;
    }

    .cr-box {
      font-family: var(--cr-mono);
      font-size: 12px;
      font-weight: 600;
      padding: 10px 14px;
      border-radius: 8px;
      text-align: center;
      white-space: nowrap;
      position: relative;
    }

    .cr-box--gray {
      background: var(--cr-gray-box);
      color: var(--cr-gray);
      border: 1px solid var(--cr-gray-border);
    }

    .cr-box--conv1d {
      background: var(--cr-gray-box);
      color: var(--cr-gray);
      border: 2px solid var(--cr-gray);
      font-weight: 700;
    }

    .cr-box--green {
      background: var(--cr-green-box);
      color: var(--cr-green);
      border: 1.5px solid var(--cr-green-border);
    }

    .cr-box--ssm-big {
      background: var(--cr-green-box);
      color: var(--cr-green);
      border: 2px solid var(--cr-green);
      padding: 14px 18px;
      font-weight: 700;
    }

    .cr-box-sub {
      font-size: 10px;
      font-weight: 400;
      opacity: 0.8;
      margin-top: 2px;
    }

     
    .cr-box--ghost {
      background: var(--cr-red-light);
      color: var(--cr-red);
      border: 1.5px dashed var(--cr-red-border);
      opacity: 0.7;
      position: relative;
    }

    .cr-box--ghost::after {
      content: '\2715';
      position: absolute;
      top: 50%;
      left: 50%;
      transform: translate(-50%, -50%);
      font-size: 22px;
      font-weight: 700;
      color: var(--cr-red);
      opacity: 0.7;
    }

    .cr-box--ghost-text {
      opacity: 0.4;
    }

     
    .cr-arrow {
      display: flex;
      align-items: center;
      padding: 0 4px;
    }

    .cr-arrow svg {
      width: 28px;
      height: 14px;
    }

     
    .cr-annotation-bridge {
      display: flex;
      align-items: center;
      justify-content: center;
      gap: 8px;
      padding: 8px 16px;
      margin: -6px auto -6px;
      background: var(--cr-absorbed-light);
      border: 1px solid var(--cr-absorbed-border);
      border-radius: 8px;
      font-family: var(--cr-mono);
      font-size: 11px;
      font-weight: 600;
      color: var(--cr-absorbed);
      z-index: 2;
      position: relative;
      max-width: fit-content;
    }

    .cr-annotation-bridge svg {
      flex-shrink: 0;
    }

     
    .cr-bottom {
      text-align: center;
      font-size: 13px;
      font-weight: 500;
      color: var(--cr-text-muted);
      margin-top: 16px;
      padding: 10px 14px;
      border-radius: 8px;
      background: var(--cr-gray-light);
      border: 1px solid var(--cr-gray-border);
    }
  </style>

  <div class="cr-header">
    <h3>How Mamba-3 Eliminates the Conv1d</h3>
    <p>Trapezoidal discretization absorbs the causal convolution</p>
  </div>

  <div class="cr-pipelines">
    
    <div class="cr-pipeline cr-pipeline--old">
      <h4 class="cr-pipeline-title">Mamba-1 / Mamba-2</h4>
      <div class="cr-flow">
        <div class="cr-box cr-box--gray">Input</div>
        <div class="cr-arrow"><svg viewBox="0 0 28 14"><line x1="2" y1="7" x2="22" y2="7" stroke="var(--cr-arrow)" stroke-width="1.5"/><polygon points="22,3 28,7 22,11" fill="var(--cr-arrow)"/></svg></div>
        <div class="cr-box cr-box--conv1d">Conv1d<div class="cr-box-sub">width 4</div></div>
        <div class="cr-arrow"><svg viewBox="0 0 28 14"><line x1="2" y1="7" x2="22" y2="7" stroke="var(--cr-arrow)" stroke-width="1.5"/><polygon points="22,3 28,7 22,11" fill="var(--cr-arrow)"/></svg></div>
        <div class="cr-box cr-box--gray">SiLU</div>
        <div class="cr-arrow"><svg viewBox="0 0 28 14"><line x1="2" y1="7" x2="22" y2="7" stroke="var(--cr-arrow)" stroke-width="1.5"/><polygon points="22,3 28,7 22,11" fill="var(--cr-arrow)"/></svg></div>
        <div class="cr-box cr-box--gray">Selective SSM</div>
        <div class="cr-arrow"><svg viewBox="0 0 28 14"><line x1="2" y1="7" x2="22" y2="7" stroke="var(--cr-arrow)" stroke-width="1.5"/><polygon points="22,3 28,7 22,11" fill="var(--cr-arrow)"/></svg></div>
        <div class="cr-box cr-box--gray">Output</div>
      </div>
    </div>

    
    <div class="cr-annotation-bridge">
      <svg width="16" height="16" viewBox="0 0 16 16" fill="none"><path d="M8 2v12M8 14l-3-3M8 14l3-3" stroke="currentColor" stroke-width="1.5" stroke-linecap="round" stroke-linejoin="round"/></svg>
      Absorbed into discretization
    </div>

    
    <div class="cr-pipeline cr-pipeline--new">
      <h4 class="cr-pipeline-title">Mamba-3</h4>
      <div class="cr-flow">
        <div class="cr-box cr-box--green">Input</div>
        <div class="cr-arrow"><svg viewBox="0 0 28 14"><line x1="2" y1="7" x2="22" y2="7" stroke="var(--cr-arrow)" stroke-width="1.5"/><polygon points="22,3 28,7 22,11" fill="var(--cr-arrow)"/></svg></div>
        <div class="cr-box cr-box--ghost"><span class="cr-box--ghost-text">Conv1d</span></div>
        <div class="cr-arrow"><svg viewBox="0 0 28 14"><line x1="2" y1="7" x2="22" y2="7" stroke="var(--cr-arrow)" stroke-width="1.5"/><polygon points="22,3 28,7 22,11" fill="var(--cr-arrow)"/></svg></div>
        <div class="cr-box cr-box--ssm-big">
          Selective SSM
          <div class="cr-box-sub">Trapezoidal Discretization</div>
          <div class="cr-box-sub">implicit conv built-in</div>
        </div>
        <div class="cr-arrow"><svg viewBox="0 0 28 14"><line x1="2" y1="7" x2="22" y2="7" stroke="var(--cr-arrow)" stroke-width="1.5"/><polygon points="22,3 28,7 22,11" fill="var(--cr-arrow)"/></svg></div>
        <div class="cr-box cr-box--green">Output</div>
      </div>
    </div>
  </div>

  <div class="cr-bottom">
    Fewer ops per token = lower decode latency on your H100
  </div>
</div>

<p><strong>Why it matters for your H100.</strong> Fewer sequential operations per token at inference. No Conv1d kernel launch, no Conv1d memory traffic, no Conv1d compute. The architecture is simpler and the discretization is now theoretically justified ($O(\Delta^3)$ error) instead of a heuristic patched by an external convolution.</p>
<h3 id="innovation-2-complex-valued-ssms-via-rope">Innovation 2: Complex-Valued SSMs via RoPE</h3>
<p><strong>The problem.</strong> Real-valued SSMs with non-negative eigenvalues can only decay monotonically. The state gets smaller over time, or stays the same, but it cannot oscillate. Mathematically: if $\bar{a} \in [0, 1]$, then $\bar{a}^k$ is a monotonically decreasing sequence. The state can only move in one direction (toward zero).</p>
<p>This means real-valued SSMs cannot solve simple state-tracking tasks that require flipping between states. Consider parity: given a stream of bits, track whether the running count of 1s is even or odd. Every time a 1 arrives, the parity flips. This requires the state to toggle between two values indefinitely. A monotonically decaying state cannot do this. On the bit sequence parity task, Mamba-2 scored no better than random guessing.</p>
<p><strong>The intuition.</strong> Real eigenvalues restrict the state to movement along a line: it can only grow or shrink. Complex eigenvalues enable rotation: the state can cycle through values, oscillate, and flip. For even/odd tracking, you need the state to flip sign every time a 1 appears. A 180-degree rotation achieves exactly this. Real, non-negative arithmetic cannot.</p>
<p><strong>The clever trick.</strong> Implementing complex arithmetic on GPUs is painful. Complex numbers double memory requirements, break existing CUDA kernel optimizations, and introduce alignment issues. Mamba-3 avoids all of this through a mathematical equivalence.</p>
<p>The key theoretical result (Proposition 3 in the paper): a discretized complex-valued diagonal SSM is mathematically equivalent to a real-valued SSM with data-dependent Rotary Positional Embeddings (RoPE) applied to $B$ and $C$. The decomposition works as follows:</p>
<ul>
<li>The <strong>real part</strong> of the complex eigenvalue controls decay. This is handled by the existing SSD machinery, exactly as in Mamba-2. No changes needed.</li>
<li>The <strong>imaginary part</strong> controls rotation. This is factored out and implemented as rotary embeddings applied to the $B$ and $C$ projection vectors.</li>
</ul>
<p>The rotation angles are produced dynamically via projections from the current input token $x_t$, rather than using static positional indices as in standard Transformer RoPE. This is why it is called &ldquo;data-dependent&rdquo; RoPE. The rotation applied to $B$ and $C$ changes based on what token is being processed.</p>
<p>No complex number ever appears in the GPU kernels. The real-valued SSD computation runs at the same speed as before. The rotational dynamics are absorbed into $B$ and $C$ via the same RoPE infrastructure that Transformers already use for positional encoding. Existing Transformer tooling (rotary embedding kernels, fused attention implementations) can be reused directly.</p>
<p><strong>The result.</strong> On synthetic state-tracking benchmarks:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Task</th>
          <th style="text-align: left">Mamba-2 (Real)</th>
          <th style="text-align: left">Mamba-3 w/o RoPE</th>
          <th style="text-align: left">Mamba-3 w/ Std. RoPE</th>
          <th style="text-align: left">Mamba-3 (Complex / Data-Dep RoPE)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Bit Sequence Parity</td>
          <td style="text-align: left">Random Guess</td>
          <td style="text-align: left">2.27%</td>
          <td style="text-align: left">1.56%</td>
          <td style="text-align: left"><strong>100.00%</strong></td>
      </tr>
      <tr>
          <td style="text-align: left">Modular Arith. (No Brackets)</td>
          <td style="text-align: left">0.90%</td>
          <td style="text-align: left">Random Guess</td>
          <td style="text-align: left">20.70%</td>
          <td style="text-align: left"><strong>98.51%</strong></td>
      </tr>
      <tr>
          <td style="text-align: left">Modular Arith. (Brackets)</td>
          <td style="text-align: left">Fail</td>
          <td style="text-align: left">Fail</td>
          <td style="text-align: left">2.62%</td>
          <td style="text-align: left"><strong>87.75%</strong></td>
      </tr>
  </tbody>
</table>
<p>Mamba-3 solves parity perfectly and near-perfectly executes complex modular arithmetic. Standard (non-data-dependent) RoPE does not help. Static positional rotation angles cannot implement content-dependent state flipping. The data-dependent version can, because the rotation is a function of the input.</p>
<p>These are tasks that were mathematically impossible for real-valued SSMs, regardless of scale or training budget. The real-complex boundary is a hard expressivity ceiling, not a soft scaling issue.</p>
<h3 id="innovation-3-mimo-multi-input-multi-output">Innovation 3: MIMO (Multi-Input Multi-Output)</h3>
<p><strong>The problem.</strong> As discussed above, standard SSMs waste more than 99% of GPU compute during decoding because each state update is a trivial rank-1 operation: the GPU loads the entire state from memory, performs a single multiply-add, and writes it back. The computation is too cheap relative to the memory transfer.</p>
<p><strong>The fix.</strong> Instead of processing one input and producing one output per SSM (SISO), process $R$ inputs and $R$ outputs simultaneously (MIMO). The scalar input $x_t$ is linearly projected into a matrix $X_t$ with rank $R$. The projection vectors $B_t$ and $C_t$ are correspondingly expanded to rank-$R$ structures. The state update becomes a matrix multiplication instead of an outer product:</p>
$$H_t = \bar{A} \cdot H_{t-1} + B_t \cdot X_t^T$$<p>With $R = 4$, the model performs 4x the floating-point operations for the same amount of memory traffic. The arithmetic intensity jumps from $\sim$2.5 to $\sim$10 ops/byte. Still not enough to fully saturate the H100, but a 4x improvement in GPU utilization during the memory-bound decode phase.</p>
<p>Crucially, only the SSM-specific parameters ($B_t$, $C_t$, and the state $H_t$) grow with $R$. The main input projections, the output projections, and the residual gate all remain at their original sizes. This contains the parameter increase to the SSM core.</p>
<p><strong>Why it does not hurt latency.</strong> The extra compute fills idle GPU cycles. During decoding, the bottleneck is the time it takes to load the state from HBM to SRAM. While that data transfer is in flight, the Tensor Cores have nothing to do. MIMO gives them work. The wall-clock time per decode step is dominated by memory transfer time, not compute time, so adding compute within the transfer window is effectively free. MIMO with $R = 4$ matches Mamba-2&rsquo;s decode speed while delivering substantially better accuracy.</p>
<p><strong>The result.</strong> At 1.5B scale with Chinchilla-optimal training: the base Mamba-3 (SISO) outpaces Gated DeltaNet (the previous state-of-the-art sub-quadratic model) by 0.6 percentage points on average downstream accuracy. Adding MIMO with $R = 4$ adds another 1.2 points, for a total gain of 1.8 points over Gated DeltaNet, 1.9 over Mamba-2, and 2.2 over equivalently-sized pure Transformers. The 1.5B MIMO variant achieves 57.6% average accuracy across benchmarks.</p>
<p>A Mamba-3 MIMO model with state dimension $N = 64$ matches the perplexity and downstream accuracy of a Mamba-2 model with $N = 128$. Halving the state size while maintaining quality doubles inference throughput within the same hardware footprint.</p>
<h3 id="architectural-changes">Architectural Changes</h3>
<p>Two more structural changes round out Mamba-3:</p>
<p><strong>Normalization.</strong> Mamba-3 replaces post-gate RMSNorm (Mamba-2) with QKNorm, also called BCNorm: RMS normalization applied directly to the $B$ and $C$ projections before mixing. This stabilizes variance and activation spikes during large-scale pretraining, which is especially important with the added mathematical complexity of trapezoidal recurrence and MIMO updates.</p>
<p><strong>Block structure.</strong> Mamba-1 and Mamba-2 fused the sequence mixer (SSM) and channel mixer (MLP) into a single homogeneous block. Mamba-3 reverses this decision. It adopts an interleaved architecture that matches the Llama family: alternating Mamba-3 SSM blocks with standard SwiGLU MLP blocks. Each Mamba-3 block handles sequence mixing; each SwiGLU MLP handles channel mixing. This Llama-compatible topology makes it straightforward to create hybrid models by swapping some SSM blocks for attention blocks.</p>
<h3 id="evolution-comparison">Evolution Comparison</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Feature</th>
          <th style="text-align: left">Mamba-1</th>
          <th style="text-align: left">Mamba-2</th>
          <th style="text-align: left">Mamba-3</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left"><strong>Venue</strong></td>
          <td style="text-align: left">COLM 2024</td>
          <td style="text-align: left">ICML 2024</td>
          <td style="text-align: left">ICLR 2026</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>$A$ matrix</strong></td>
          <td style="text-align: left">Diagonal (real)</td>
          <td style="text-align: left">Scalar $\times$ identity</td>
          <td style="text-align: left">Complex-valued (data-dep RoPE)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>State size</strong></td>
          <td style="text-align: left">$N = 16$</td>
          <td style="text-align: left">$N = 64\text{-}256$</td>
          <td style="text-align: left">Matches Mamba-2 at half $N$</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Short conv</strong></td>
          <td style="text-align: left">Required (width 4)</td>
          <td style="text-align: left">Required (width 4)</td>
          <td style="text-align: left">Removed</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>MIMO</strong></td>
          <td style="text-align: left">No</td>
          <td style="text-align: left">No</td>
          <td style="text-align: left">Yes (rank-$R$)</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Discretization</strong></td>
          <td style="text-align: left">Exp-Euler, $O(\Delta^2)$</td>
          <td style="text-align: left">Exp-Euler, $O(\Delta^2)$</td>
          <td style="text-align: left">Exp-Trapezoidal, $O(\Delta^3)$</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Design priority</strong></td>
          <td style="text-align: left">Quality + selectivity</td>
          <td style="text-align: left">Training speed (Tensor Cores)</td>
          <td style="text-align: left">Inference efficiency</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>State tracking</strong></td>
          <td style="text-align: left">Cannot solve parity</td>
          <td style="text-align: left">Cannot solve parity</td>
          <td style="text-align: left">Solves parity + modular arith.</td>
      </tr>
      <tr>
          <td style="text-align: left"><strong>Block structure</strong></td>
          <td style="text-align: left">Fused SSM+MLP</td>
          <td style="text-align: left">Fused SSM+MLP</td>
          <td style="text-align: left">Interleaved SSM + MLP</td>
      </tr>
  </tbody>
</table>
<h2 id="part-9-the-bigger-picture">Part 9: The Bigger Picture</h2>
<h3 id="hybrid-architectures-are-the-production-standard">Hybrid Architectures Are the Production Standard</h3>
<p>The field has converged on an empirical finding: hybrid architectures combining SSM layers with a small fraction of attention layers outperform both pure approaches. Albert Gu has articulated the fundamental reason clearly. Transformers are like databases: they cache every token for future reference (perfect recall, but linear memory growth). SSMs are like brains: they compress all history into a fixed-size state (infinite context, but lossy).</p>
<p>Pure SSMs struggle with two specific capabilities. <strong>Exact retrieval</strong>: finding a specific fact buried in a long context degrades as context grows, because the fixed-size state cannot perfectly memorize arbitrary content. <strong>In-context learning</strong>: few-shot pattern matching from prompt examples requires comparing the current token against specific stored tokens, which is fundamentally an attention operation.</p>
<p>The solution adopted by every major lab: use SSM layers for the vast majority of the network and sprinkle in a few attention layers for precise retrieval. The exact ratio varies (5:1 in Mamba-3&rsquo;s recommended config, 9:1 in Granite), but the pattern is universal.</p>
<h3 id="what-this-means-for-your-inference-stack">What This Means for Your Inference Stack</h3>
<p>For inference infrastructure teams, the implications are concrete. With only 10-15% of layers using attention, you manage KV cache for those few layers, not the entire network. SSM layers need no KV cache management at all: no PagedAttention, no eviction policies, no memory fragmentation. And throughput advantages grow with context length. Pure Transformers are faster at short sequences ($<$2K tokens), but SSM-based models cross over quickly and the gap widens: at 57K tokens, Mamba-2 outperforms Transformers by 4x. SSM decode cost is constant per token; Transformer decode cost grows linearly.</p>
<h3 id="the-fundamental-trade-off-persists">The Fundamental Trade-off Persists</h3>
<p>SSMs compress. Attention caches. Mamba-3 makes the compressed memory more expressive through complex dynamics, higher-order discretization, and MIMO. But it cannot eliminate the compression. If your workload requires perfect verbatim retrieval of a specific sentence from a 100K-token document, you need attention layers for that.</p>
<p>The Transformer monopoly has ended. But Transformers are not dead. They are becoming a specialized, strategically-placed component within a larger hybrid architecture, used precisely where their lossless memory is needed and nowhere else.</p>
<hr>
<h2 id="references">References</h2>
<ol>
<li>
<p><strong>Gu, A., Dao, T., Ermon, S., Rudra, A., &amp; Re, C. (2020).</strong> <a href="https://arxiv.org/abs/2008.07669">HiPPO: Recurrent Memory with Optimal Polynomial Projections</a>. <em>NeurIPS 2020</em>.</p>
</li>
<li>
<p><strong>Gu, A., Goel, K., &amp; Re, C. (2022).</strong> <a href="https://arxiv.org/abs/2111.00396">Efficiently Modeling Long Sequences with Structured State Spaces</a>. <em>ICLR 2022</em>. (S4)</p>
</li>
<li>
<p><strong>Gu, A., Gupta, A., Goel, K., &amp; Re, C. (2022).</strong> <a href="https://arxiv.org/abs/2206.11893">On the Parameterization and Initialization of Diagonal State Space Models</a>. <em>NeurIPS 2022</em>. (S4D)</p>
</li>
<li>
<p><strong>Gu, A. &amp; Dao, T. (2023).</strong> <a href="https://arxiv.org/abs/2312.00752">Mamba: Linear-Time Sequence Modeling with Selective State Spaces</a>. <em>COLM 2024</em>.</p>
</li>
<li>
<p><strong>Dao, T. &amp; Gu, A. (2024).</strong> <a href="https://arxiv.org/abs/2405.21060">Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality</a>. <em>ICML 2024</em>. (Mamba-2 / SSD)</p>
</li>
<li>
<p><strong>Lahoti, A., Li, A., Chen, Y., Wang, Z., Bick, T., Kolter, J. Z., Dao, T., &amp; Gu, A. (2026).</strong> <a href="https://arxiv.org/abs/2603.15569">Mamba-3: Improved Sequence Modeling using State Space Principles</a>. <em>ICLR 2026</em>.</p>
</li>
<li>
<p><strong>Together AI.</strong> <a href="https://www.together.ai/blog/mamba-3">Mamba-3 Blog Post</a>. Technical overview and benchmark results.</p>
</li>
<li>
<p><strong>Goomba Lab.</strong> Blog series on Structured State Space Duality and Mamba-3 mathematical foundations.</p>
</li>
<li>
<p><strong>Tri Dao.</strong> <a href="https://tridao.me/blog/2026/mamba3-part2/">Mamba-3 Part 2: Methodological Deep Dive</a>. Detailed derivation of the exponential-trapezoidal discretization and RoPE equivalence.</p>
</li>
<li>
<p><strong>Princeton Language and Intelligence.</strong> <a href="https://pli.princeton.edu/blog/2024/mamba-2-algorithms-and-systems">Mamba-2: Algorithms and Systems</a>. Technical walkthrough of the SSD algorithm.</p>
</li>
<li>
<p><strong>NVIDIA.</strong> <a href="https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16">Nemotron-3-Super</a>: hybrid Mamba-2 + MoE + Attention architecture for production deployment.</p>
</li>
<li>
<p><strong>IBM.</strong> Granite 4.0: 9:1 Mamba-to-Transformer ratio with &gt;70% memory reduction vs. conventional LLMs.</p>
</li>
</ol>
]]></content:encoded></item><item><title>Speculative Speculative Decoding: Eliminating the Last Sequential Bottleneck in LLM Inference</title><link>https://www.mdjawad.com/posts/speculative-speculative-decoding/</link><pubDate>Sat, 07 Mar 2026 10:00:00 +0800</pubDate><guid>https://www.mdjawad.com/posts/speculative-speculative-decoding/</guid><description>How speculating about speculation itself achieves up to 5x faster LLM inference by eliminating the draft model&amp;rsquo;s idle time during verification, and the three engineering challenges that make it work.</description><content:encoded><![CDATA[<h2 id="what-this-post-covers">What This Post Covers</h2>
<p>In <a href="/posts/speculative-decoding">our post on speculative decoding</a>, we covered how a small draft model proposes tokens that a large target model verifies in parallel, achieving 2-3x speedups without changing the output distribution. That technique exploits idle GPU compute during memory-bound inference.</p>
<p>This post examines a follow-up question: can we make speculative decoding itself faster? The answer is yes. A recent paper by Kumar, Dao, and May (ICLR 2026) identifies a sequential bottleneck <em>within</em> standard speculative decoding and eliminates it through a technique called Speculative Speculative Decoding (SSD). Their algorithm, Saguaro, achieves up to 5x speedup over autoregressive decoding and roughly 2x over optimized speculative decoding.</p>
<p>We will walk through the bottleneck SSD targets, the core idea of speculating about verification outcomes, and the three engineering challenges that Saguaro solves: cache construction, cache-aware sampling, and batch-size scaling. Each section explains why the naive approach fails before presenting the solution.</p>
<div class="ssd-notation-ref" id="ssd-notation-0bb873ec347541318a79e0f80e8ddeb3">
  <style>
    .ssd-notation-ref {
      --sn-bg: #0d1117;
      --sn-surface: #161b22;
      --sn-border: #30363d;
      --sn-text: #e6edf3;
      --sn-text-muted: #8b949e;
      --sn-accent: #a371f7;
      --sn-accent-dim: rgba(163, 113, 247, 0.15);
      --sn-group: #58a6ff;

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
    }

    [data-theme="light"] .ssd-notation-ref,
    :root:not([data-theme="dark"]) .ssd-notation-ref {
      --sn-bg: #f8fafc;
      --sn-surface: #ffffff;
      --sn-border: #e2e8f0;
      --sn-text: #1e293b;
      --sn-text-muted: #64748b;
      --sn-accent: #8b5cf6;
      --sn-accent-dim: rgba(139, 92, 246, 0.1);
      --sn-group: #3b82f6;
    }

    .sn-toggle {
      position: fixed;
      bottom: 1.5rem;
      right: 1.5rem;
      z-index: 9998;
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      letter-spacing: 0.05em;
      text-transform: uppercase;
      padding: 0.6rem 1rem;
      border: 1px solid var(--sn-accent);
      border-radius: 8px;
      background: var(--sn-surface);
      color: var(--sn-accent);
      cursor: pointer;
      box-shadow: 0 4px 12px rgba(0,0,0,0.15);
      transition: all 0.2s ease;
    }

    .sn-toggle:hover {
      background: var(--sn-accent);
      color: #fff;
    }

    .sn-toggle.hidden { display: none; }

    .sn-panel {
      position: fixed;
      bottom: 1.5rem;
      right: 1.5rem;
      z-index: 9999;
      width: 320px;
      max-height: 70vh;
      background: var(--sn-surface);
      border: 1px solid var(--sn-border);
      border-radius: 12px;
      box-shadow: 0 8px 30px rgba(0,0,0,0.25);
      display: none;
      flex-direction: column;
      overflow: hidden;
    }

    .sn-panel.open { display: flex; }

    .sn-panel-header {
      display: flex;
      align-items: center;
      justify-content: space-between;
      padding: 0.75rem 1rem;
      border-bottom: 1px solid var(--sn-border);
      background: var(--sn-accent-dim);
    }

    .sn-panel-title {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--sn-accent);
      text-transform: uppercase;
      letter-spacing: 0.08em;
    }

    .sn-close {
      background: none;
      border: none;
      color: var(--sn-text-muted);
      font-size: 1.1rem;
      cursor: pointer;
      padding: 0;
      line-height: 1;
    }

    .sn-close:hover { color: var(--sn-text); }

    .sn-body {
      overflow-y: auto;
      padding: 0.5rem 0;
      flex: 1;
    }

    .sn-group-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.6rem;
      font-weight: 600;
      color: var(--sn-group);
      text-transform: uppercase;
      letter-spacing: 0.06em;
      padding: 0.5rem 1rem 0.25rem;
    }

    .sn-entry {
      display: flex;
      align-items: baseline;
      gap: 0.6rem;
      padding: 0.25rem 1rem;
      font-size: 0.8rem;
      line-height: 1.4;
    }

    .sn-sym {
      font-family: 'IBM Plex Mono', monospace;
      color: var(--sn-accent);
      min-width: 55px;
      text-align: right;
      flex-shrink: 0;
    }

    .sn-def {
      color: var(--sn-text-muted);
      font-size: 0.75rem;
    }

    .sn-divider {
      height: 1px;
      background: var(--sn-border);
      margin: 0.4rem 1rem;
    }

    @media (max-width: 500px) {
      .sn-panel {
        width: calc(100vw - 2rem);
        right: 1rem;
        bottom: 1rem;
      }
      .sn-toggle {
        right: 1rem;
        bottom: 1rem;
      }
    }
  </style>

  <button type="button" class="sn-toggle" id="snToggle-0bb873ec347541318a79e0f80e8ddeb3">Notation</button>

  <div class="sn-panel" id="snPanel-0bb873ec347541318a79e0f80e8ddeb3">
    <div class="sn-panel-header">
      <span class="sn-panel-title">Notation Reference</span>
      <button type="button" class="sn-close" id="snClose-0bb873ec347541318a79e0f80e8ddeb3">&times;</button>
    </div>
    <div class="sn-body">

      <div class="sn-group-label">Core</div>
      <div class="sn-entry"><span class="sn-sym">\(K\)</span><span class="sn-def">Draft tokens per round (e.g. 7)</span></div>
      <div class="sn-entry"><span class="sn-sym">\(k\)</span><span class="sn-def">Number of accepted draft tokens (0 to K)</span></div>
      <div class="sn-entry"><span class="sn-sym">\(t^*\)</span><span class="sn-def">Bonus token from the target model</span></div>
      <div class="sn-entry"><span class="sn-sym">\(v^T\)</span><span class="sn-def">Verification outcome \((k, t^*)\) for round T</span></div>
      <div class="sn-entry"><span class="sn-sym">\(S^T\)</span><span class="sn-def">Speculation cache: maps outcomes to pre-computed drafts</span></div>

      <div class="sn-divider"></div>
      <div class="sn-group-label">Speedup</div>
      <div class="sn-entry"><span class="sn-sym">\(p_{\text{hit}}\)</span><span class="sn-def">Probability of a cache hit</span></div>
      <div class="sn-entry"><span class="sn-sym">\(E_{\text{hit}}\)</span><span class="sn-def">Expected tokens per round on cache hit</span></div>
      <div class="sn-entry"><span class="sn-sym">\(E_{\text{miss}}\)</span><span class="sn-def">Expected tokens per round on cache miss</span></div>
      <div class="sn-entry"><span class="sn-sym">\(T_v\)</span><span class="sn-def">Verification latency (target model forward pass)</span></div>
      <div class="sn-entry"><span class="sn-sym">\(T_p\)</span><span class="sn-def">Primary speculator latency (relative to verifier)</span></div>
      <div class="sn-entry"><span class="sn-sym">\(T_b\)</span><span class="sn-def">Backup speculator latency</span></div>

      <div class="sn-divider"></div>
      <div class="sn-group-label">Cache Construction</div>
      <div class="sn-entry"><span class="sn-sym">\(B\)</span><span class="sn-def">Budget: max pre-computed speculations (~20-30)</span></div>
      <div class="sn-entry"><span class="sn-sym">\(F_k\)</span><span class="sn-def">Fan-out at position k (bonus tokens cached)</span></div>
      <div class="sn-entry"><span class="sn-sym">\(r\)</span><span class="sn-def">Power-law exponent for cache miss decay</span></div>
      <div class="sn-entry"><span class="sn-sym">\(\alpha_p\)</span><span class="sn-def">Per-token acceptance rate</span></div>
      <div class="sn-entry"><span class="sn-sym">\(|V|\)</span><span class="sn-def">Vocabulary size (e.g. 32,000)</span></div>

      <div class="sn-divider"></div>
      <div class="sn-group-label">Sampling</div>
      <div class="sn-entry"><span class="sn-sym">\(C\)</span><span class="sn-def">Downweighting constant for Saguaro sampling (0 to 1)</span></div>

      <div class="sn-divider"></div>
      <div class="sn-group-label">Batch Scaling</div>
      <div class="sn-entry"><span class="sn-sym">\(b\)</span><span class="sn-def">Batch size (number of sequences)</span></div>
      <div class="sn-entry"><span class="sn-sym">\(b^*\)</span><span class="sn-def">Critical batch size: switch to backup fallback</span></div>

    </div>
  </div>

  <script>
  (function() {
    const uid = '0bb873ec347541318a79e0f80e8ddeb3';
    const toggle = document.getElementById(`snToggle-${uid}`);
    const panel = document.getElementById(`snPanel-${uid}`);
    const close = document.getElementById(`snClose-${uid}`);

    toggle.addEventListener('click', function() {
      panel.classList.add('open');
      toggle.classList.add('hidden');
    });

    close.addEventListener('click', function() {
      panel.classList.remove('open');
      toggle.classList.remove('hidden');
    });

    document.addEventListener('keydown', function(e) {
      if (e.key === 'Escape' && panel.classList.contains('open')) {
        panel.classList.remove('open');
        toggle.classList.remove('hidden');
      }
    });
  })();
  </script>
</div>

<h2 id="the-hidden-bottleneck-in-speculative-decoding">The Hidden Bottleneck in Speculative Decoding</h2>
<p>Standard speculative decoding runs in a loop: the draft model generates K tokens, the target model verifies them in a single forward pass, and the process repeats. This is faster than autoregressive decoding because verification amortizes the expensive memory read of the target model&rsquo;s weights across multiple tokens.</p>
<p>But there is a sequential dependency hiding in plain sight. Let&rsquo;s trace what happens on the draft model&rsquo;s GPU during one round:</p>
<ol>
<li>The draft model generates K tokens (draft phase)</li>
<li>The draft model sends tokens to the target model</li>
<li>The target model runs verification (the draft model <strong>sits completely idle</strong>)</li>
<li>The target model returns the verification outcome</li>
<li>The draft model generates K new tokens for the next round</li>
<li>Go to step 2</li>
</ol>
<p>The draft model does nothing during step 3. If verification takes $T_v$ time units, the draft model wastes $T_v$ time units every round. Since the target model is much larger than the draft, $T_v$ dominates the round duration. The draft model&rsquo;s GPU is idle for the majority of each round.</p>
<p>This is the bottleneck SSD eliminates. Instead of waiting for verification to finish, the draft model spends that idle time doing something useful: predicting what the verification outcome will be and pre-computing the next round&rsquo;s speculation for each likely outcome.</p>


<figure class="ssd-paper-fig">
  <style>
    .ssd-paper-fig {
      margin: 2rem 0;
      text-align: center;
    }

    .ssd-paper-fig img {
      width: 100%;
      max-width: 820px;
      border-radius: 10px;
      background: #fff;
      padding: 1rem;
      border: 1px solid #e2e8f0;
    }

    [data-theme="dark"] .ssd-paper-fig img,
    :root[data-theme="dark"] .ssd-paper-fig img {
      border-color: #30363d;
    }

    .ssd-paper-fig figcaption {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.75rem;
      color: var(--secondary, #8b949e);
      margin-top: 0.75rem;
      line-height: 1.5;
    }
  </style>
  <img src="/images/posts/spec_decode.png" alt="Standard Speculative Decoding vs Speculative Speculative Decoding. Left: SD's sequential draft-verify loop with idle draft time. Center: SSD's parallel execution where the draft model builds a branching tree of pre-computed speculations during verification. Right: throughput comparison showing SSD achieving 4x speedup over autoregressive decoding." />
  <figcaption>Figure from Kumar, Dao &amp; May (2026). Left: standard SD's sequential loop. Center: SSD overlaps drafting with verification via a speculation cache tree. Right: throughput gains.</figcaption>
</figure>


<h2 id="the-core-idea-speculate-about-the-speculation">The Core Idea: Speculate About the Speculation</h2>
<p>The idea draws from CPU speculative execution. When a CPU encounters a conditional branch, it does not wait for the condition to resolve. Instead, it predicts the likely outcome and begins executing instructions along that predicted path. If the prediction was correct, the results are kept. If wrong, they are discarded and the correct path is executed.</p>
<p>SSD applies this same principle to speculative decoding. While the target model verifies round $T$&rsquo;s draft tokens, the draft model:</p>
<ol>
<li><strong>Predicts</strong> what the verification outcome will be</li>
<li><strong>Pre-computes</strong> speculations for each likely outcome</li>
<li><strong>Stores</strong> these in a <strong>speculation cache</strong></li>
</ol>
<p>When verification finishes, the actual outcome is compared against the cache. If it matches a cached prediction (a <strong>cache hit</strong>), the next round&rsquo;s speculation is returned instantly with zero drafting latency. If it doesn&rsquo;t match (a <strong>cache miss</strong>), the system falls back to standard synchronous drafting.</p>
<h3 id="what-is-a-verification-outcome">What Is a Verification Outcome?</h3>
<p>To understand what we need to predict, let&rsquo;s define the verification outcome precisely. When the target model verifies K draft tokens, two things are determined:</p>
<ul>
<li><strong>$k$</strong>: the number of accepted draft tokens (ranging from 0 to K)</li>
<li><strong>$t^*$</strong>: the bonus token, sampled from the target distribution at the first disagreement point (or at position K+1 if all tokens are accepted)</li>
</ul>
<p>The verification outcome is the pair $v^T = (k, t^*)$. This fully determines the context from which the next round&rsquo;s speculation must begin. If we can predict $v^T$ before verification completes, we can pre-compute the next K draft tokens starting from that context.</p>
<h3 id="the-speculation-cache">The Speculation Cache</h3>
<p>The speculation cache $S^T$ is a dictionary that maps predicted verification outcomes to pre-computed speculations:</p>
$$S^T : (k, t^*) \to (s_1, s_2, \ldots, s_K)$$<p>Each entry contains K draft tokens generated autoregressively by the draft model starting from the context implied by $(k, t^*)$.</p>
<p>When verification returns the actual outcome $v^T$:</p>
<ul>
<li><strong>Cache hit</strong> ($v^T \in S^T$): Return the cached speculation immediately. Zero drafting latency for this round.</li>
<li><strong>Cache miss</strong> ($v^T \notin S^T$): Fall back to generating a fresh speculation synchronously.</li>
</ul>
<div class="ssd-cache-viz" id="ssd-cache-0bb873ec347541318a79e0f80e8ddeb3">
  <style>
    .ssd-cache-viz {
      --sc-bg: #0d1117;
      --sc-surface: #161b22;
      --sc-border: #30363d;
      --sc-text: #e6edf3;
      --sc-text-muted: #8b949e;
      --sc-target-blue: #58a6ff;
      --sc-draft-orange: #d29922;
      --sc-cache-green: #39d353;
      --sc-miss-red: #f97583;
      --sc-cache-purple: #a371f7;

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
      background: var(--sc-bg);
      color: var(--sc-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

    [data-theme="light"] .ssd-cache-viz,
    :root:not([data-theme="dark"]) .ssd-cache-viz {
      --sc-bg: #f8fafc;
      --sc-surface: #ffffff;
      --sc-border: #e2e8f0;
      --sc-text: #1e293b;
      --sc-text-muted: #64748b;
      --sc-target-blue: #3b82f6;
      --sc-draft-orange: #f59e0b;
      --sc-cache-green: #10b981;
      --sc-miss-red: #ef4444;
      --sc-cache-purple: #8b5cf6;
    }

    .ssd-cache-viz * { box-sizing: border-box; }

    .sc-header {
      text-align: center;
      margin-bottom: 1.5rem;
    }

    .sc-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem;
      font-weight: 600;
      color: var(--sc-cache-purple);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .sc-header p {
      color: var(--sc-text-muted);
      font-size: 0.9rem;
      margin: 0;
    }

     
    .sc-controls {
      background: var(--sc-surface);
      border: 1px solid var(--sc-border);
      border-radius: 10px;
      padding: 1rem 1.25rem;
      margin-bottom: 1.25rem;
      display: flex;
      align-items: center;
      gap: 1rem;
      flex-wrap: wrap;
    }

    .sc-control-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--sc-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.05em;
    }

    .sc-scenario-btns {
      display: flex;
      gap: 0.4rem;
    }

    .sc-scenario-btn {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 500;
      padding: 0.4rem 0.75rem;
      border: 1px solid var(--sc-border);
      border-radius: 6px;
      background: var(--sc-surface);
      color: var(--sc-text);
      cursor: pointer;
      transition: all 0.2s ease;
    }

    .sc-scenario-btn:hover {
      border-color: var(--sc-cache-purple);
      background: rgba(163, 113, 247, 0.1);
    }

    .sc-scenario-btn.active {
      background: var(--sc-cache-purple);
      border-color: var(--sc-cache-purple);
      color: #fff;
    }

     
    .sc-diagram {
      background: var(--sc-surface);
      border: 1px solid var(--sc-border);
      border-radius: 10px;
      padding: 1.5rem;
      margin-bottom: 1rem;
      overflow-x: auto;
    }

    .sc-flow {
      display: flex;
      align-items: flex-start;
      gap: 1rem;
      min-width: 600px;
    }

     
    .sc-context {
      display: flex;
      flex-direction: column;
      align-items: center;
      gap: 0.5rem;
      min-width: 120px;
    }

    .sc-context-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.65rem;
      font-weight: 600;
      color: var(--sc-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.05em;
    }

    .sc-context-tokens {
      background: var(--sc-bg);
      border: 1px solid var(--sc-border);
      border-radius: 6px;
      padding: 0.5rem 0.75rem;
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.75rem;
      color: var(--sc-text);
      text-align: center;
    }

    .sc-draft-tokens {
      display: flex;
      gap: 0.25rem;
    }

    .sc-token {
      padding: 0.2rem 0.4rem;
      border-radius: 4px;
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 500;
    }

    .sc-token-draft {
      background: rgba(210, 153, 34, 0.2);
      color: var(--sc-draft-orange);
      border: 1px solid rgba(210, 153, 34, 0.3);
    }

    .sc-token-accepted {
      background: rgba(57, 211, 83, 0.15);
      color: var(--sc-cache-green);
      border: 1px solid rgba(57, 211, 83, 0.3);
    }

    .sc-token-rejected {
      background: rgba(249, 117, 131, 0.15);
      color: var(--sc-miss-red);
      border: 1px solid rgba(249, 117, 131, 0.3);
      text-decoration: line-through;
    }

    .sc-token-bonus {
      background: rgba(88, 166, 255, 0.15);
      color: var(--sc-target-blue);
      border: 1px solid rgba(88, 166, 255, 0.3);
    }

     
    .sc-arrow {
      display: flex;
      flex-direction: column;
      align-items: center;
      justify-content: center;
      min-height: 200px;
      color: var(--sc-text-muted);
      font-size: 1.2rem;
      padding-top: 1rem;
    }

    .sc-arrow-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.6rem;
      color: var(--sc-text-muted);
      writing-mode: vertical-rl;
      text-orientation: mixed;
      letter-spacing: 0.05em;
      text-transform: uppercase;
    }

     
    .sc-cache-container {
      flex: 1;
    }

    .sc-cache-title {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--sc-cache-purple);
      text-transform: uppercase;
      letter-spacing: 0.08em;
      margin-bottom: 0.75rem;
      display: flex;
      align-items: center;
      gap: 0.5rem;
    }

    .sc-cache-entries {
      display: flex;
      flex-direction: column;
      gap: 0.5rem;
    }

    .sc-cache-entry {
      display: flex;
      align-items: center;
      gap: 0.5rem;
      padding: 0.5rem 0.75rem;
      background: var(--sc-bg);
      border: 1px solid var(--sc-border);
      border-radius: 6px;
      transition: all 0.3s ease;
    }

    .sc-cache-entry.hit {
      border-color: var(--sc-cache-green);
      background: rgba(57, 211, 83, 0.08);
      box-shadow: 0 0 10px rgba(57, 211, 83, 0.15);
    }

    .sc-cache-entry.miss {
      opacity: 0.5;
    }

    .sc-outcome-key {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      min-width: 85px;
      color: var(--sc-text);
    }

    .sc-outcome-arrow {
      color: var(--sc-text-muted);
      font-size: 0.8rem;
    }

    .sc-cached-spec {
      display: flex;
      gap: 0.2rem;
      flex: 1;
    }

    .sc-cached-token {
      padding: 0.15rem 0.35rem;
      border-radius: 3px;
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.65rem;
      background: rgba(163, 113, 247, 0.15);
      color: var(--sc-cache-purple);
      border: 1px solid rgba(163, 113, 247, 0.25);
    }

    .sc-entry-status {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.6rem;
      font-weight: 600;
      padding: 0.15rem 0.4rem;
      border-radius: 3px;
      min-width: 35px;
      text-align: center;
    }

    .sc-status-hit {
      background: rgba(57, 211, 83, 0.2);
      color: var(--sc-cache-green);
    }

    .sc-status-miss {
      background: rgba(249, 117, 131, 0.15);
      color: var(--sc-miss-red);
    }

     
    .sc-result {
      background: var(--sc-surface);
      border: 1px solid var(--sc-border);
      border-radius: 10px;
      padding: 1rem 1.25rem;
      display: flex;
      align-items: center;
      gap: 1rem;
      flex-wrap: wrap;
    }

    .sc-result-icon {
      font-size: 1.5rem;
      width: 40px;
      height: 40px;
      display: flex;
      align-items: center;
      justify-content: center;
      border-radius: 8px;
    }

    .sc-result-icon.hit {
      background: rgba(57, 211, 83, 0.15);
      border: 1px solid rgba(57, 211, 83, 0.3);
    }

    .sc-result-icon.miss {
      background: rgba(249, 117, 131, 0.15);
      border: 1px solid rgba(249, 117, 131, 0.3);
    }

    .sc-result-text {
      flex: 1;
    }

    .sc-result-title {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.8rem;
      font-weight: 600;
      margin: 0 0 0.2rem 0;
    }

    .sc-result-title.hit { color: var(--sc-cache-green); }
    .sc-result-title.miss { color: var(--sc-miss-red); }

    .sc-result-desc {
      font-size: 0.8rem;
      color: var(--sc-text-muted);
      margin: 0;
    }

    @media (max-width: 600px) {
      .sc-flow { min-width: 500px; }
      .sc-controls { flex-direction: column; align-items: flex-start; }
    }
  </style>

  <div class="sc-header">
    <h3>The Speculation Cache</h3>
    <p>Mapping predicted verification outcomes to pre-computed speculations</p>
  </div>

  <div class="sc-controls">
    <span class="sc-control-label">Scenario</span>
    <div class="sc-scenario-btns">
      <button type="button" class="sc-scenario-btn active" data-scenario="hit">Cache Hit (k=3, t*="atop")</button>
      <button type="button" class="sc-scenario-btn" data-scenario="miss">Cache Miss (k=2, t*="under")</button>
    </div>
  </div>

  <div class="sc-diagram">
    <div class="sc-flow">
      
      <div class="sc-context">
        <span class="sc-context-label">Draft tokens sent</span>
        <div class="sc-context-tokens">
          <div class="sc-draft-tokens" id="draftTokens-0bb873ec347541318a79e0f80e8ddeb3">
            <span class="sc-token sc-token-draft">the</span>
            <span class="sc-token sc-token-draft">cat</span>
            <span class="sc-token sc-token-draft">sat</span>
            <span class="sc-token sc-token-draft">on</span>
            <span class="sc-token sc-token-draft">a</span>
          </div>
        </div>
        <span class="sc-context-label" style="margin-top: 0.75rem;">Verification result</span>
        <div class="sc-context-tokens" id="verifyResult-0bb873ec347541318a79e0f80e8ddeb3">
          <div class="sc-draft-tokens">
            <span class="sc-token sc-token-accepted">the</span>
            <span class="sc-token sc-token-accepted">cat</span>
            <span class="sc-token sc-token-accepted">sat</span>
            <span class="sc-token sc-token-bonus">on</span>
          </div>
        </div>
        <div style="font-family: 'IBM Plex Mono', monospace; font-size: 0.7rem; color: var(--sc-text-muted); margin-top: 0.25rem;">
          <span id="outcomeLabel-0bb873ec347541318a79e0f80e8ddeb3">v = (k=3, t*="on")</span>
        </div>
      </div>

      
      <div class="sc-arrow">
        <span>&#8594;</span>
        <span class="sc-arrow-label">lookup</span>
      </div>

      
      <div class="sc-cache-container">
        <div class="sc-cache-title">Speculation Cache S<sup>T</sup></div>
        <div class="sc-cache-entries" id="cacheEntries-0bb873ec347541318a79e0f80e8ddeb3">
          
        </div>
      </div>
    </div>
  </div>

  <div class="sc-result" id="cacheResult-0bb873ec347541318a79e0f80e8ddeb3">
    
  </div>

  <script>
  (function() {
    const uid = '0bb873ec347541318a79e0f80e8ddeb3';

    const cacheData = [
      { key: '(k=0, t*="a")', tokens: ['quick','brown','fox','jumps','over'], hit: false },
      { key: '(k=1, t*="dog")', tokens: ['ran','across','the','open','field'], hit: false },
      { key: '(k=2, t*="lay")', tokens: ['down','on','the','warm','rug'], hit: false },
      { key: '(k=3, t*="atop")', tokens: ['the','soft','warm','red','mat'], hit: true },
      { key: '(k=3, t*="upon")', tokens: ['a','velvet','red','cushion','and'], hit: false },
      { key: '(k=4, t*="mat")', tokens: ['beside','the','warm','fire','place'], hit: false },
      { key: '(k=5, t*=".")', tokens: ['The','dog','watched','from','the'], hit: false },
    ];

    const missScenario = {
      verifyTokens: [
        { text: 'the', cls: 'sc-token-accepted' },
        { text: 'cat', cls: 'sc-token-accepted' },
        { text: 'sat', cls: 'sc-token-rejected' },
        { text: 'on', cls: 'sc-token-rejected' },
        { text: 'a', cls: 'sc-token-rejected' },
      ],
      resultTokens: [
        { text: 'the', cls: 'sc-token-accepted' },
        { text: 'cat', cls: 'sc-token-accepted' },
        { text: 'under', cls: 'sc-token-bonus' },
      ],
      outcome: 'v = (k=2, t*="under")',
    };

    const hitScenario = {
      verifyTokens: [
        { text: 'the', cls: 'sc-token-accepted' },
        { text: 'cat', cls: 'sc-token-accepted' },
        { text: 'sat', cls: 'sc-token-accepted' },
        { text: 'on', cls: 'sc-token-rejected' },
        { text: 'a', cls: 'sc-token-rejected' },
      ],
      resultTokens: [
        { text: 'the', cls: 'sc-token-accepted' },
        { text: 'cat', cls: 'sc-token-accepted' },
        { text: 'sat', cls: 'sc-token-accepted' },
        { text: 'atop', cls: 'sc-token-bonus' },
      ],
      outcome: 'v = (k=3, t*="atop")',
    };

    const container = document.getElementById(`ssd-cache-${uid}`);
    const btns = container.querySelectorAll('.sc-scenario-btn');

    function render(scenario) {
      const isHit = scenario === 'hit';
      const data = isHit ? hitScenario : missScenario;

      
      const resultDiv = document.getElementById(`verifyResult-${uid}`);
      resultDiv.innerHTML = '<div class="sc-draft-tokens">' +
        data.resultTokens.map(t => `<span class="sc-token ${t.cls}">${t.text}</span>`).join('') +
        '</div>';

      document.getElementById(`outcomeLabel-${uid}`).textContent = data.outcome;

      
      const entriesDiv = document.getElementById(`cacheEntries-${uid}`);
      entriesDiv.innerHTML = cacheData.map(entry => {
        let cls = '';
        let statusHtml = '';
        if (isHit && entry.hit) {
          cls = 'hit';
          statusHtml = '<span class="sc-entry-status sc-status-hit">HIT</span>';
        } else if (!isHit && entry.key === '(k=2, t*="lay")') {
          cls = 'miss';
          statusHtml = '';
        } else if (!isHit) {
          cls = 'miss';
        }

        return `
          <div class="sc-cache-entry ${cls}">
            <span class="sc-outcome-key">${entry.key}</span>
            <span class="sc-outcome-arrow">&#8594;</span>
            <div class="sc-cached-spec">
              ${entry.tokens.map(t => `<span class="sc-cached-token">${t}</span>`).join('')}
            </div>
            ${statusHtml}
          </div>
        `;
      }).join('');

      
      const resultBox = document.getElementById(`cacheResult-${uid}`);
      if (isHit) {
        resultBox.innerHTML = `
          <div class="sc-result-icon hit">&#10003;</div>
          <div class="sc-result-text">
            <p class="sc-result-title hit">Cache Hit: Instant Return</p>
            <p class="sc-result-desc">Outcome (k=3, t*="atop") found in cache. Pre-computed speculation ["the","soft","warm","red","mat"] returned with zero drafting latency.</p>
          </div>
        `;
      } else {
        resultBox.innerHTML = `
          <div class="sc-result-icon miss">&#10007;</div>
          <div class="sc-result-text">
            <p class="sc-result-title miss">Cache Miss: Fallback to Synchronous Draft</p>
            <p class="sc-result-desc">Outcome (k=2, t*="under") not in cache. The cache predicted "lay" for k=2, but the actual bonus token was "under". Draft model generates new speculation synchronously.</p>
          </div>
        `;
      }
    }

    btns.forEach(btn => {
      btn.addEventListener('click', function() {
        btns.forEach(b => b.classList.remove('active'));
        this.classList.add('active');
        render(this.dataset.scenario);
      });
    });

    render('hit');
  })();
  </script>
</div>

<p>The key question is obvious: how do we choose which outcomes to cache? The space of possible outcomes is $(K+1) \times |V|$ where $|V|$ is the vocabulary size (typically 32,000-128,000). We cannot pre-compute speculations for all of them. This is the first of three challenges Saguaro addresses.</p>
<h2 id="the-speedup-formula">The Speedup Formula</h2>
<p>Before diving into the three challenges, let&rsquo;s quantify the potential. The expected speedup of SSD over autoregressive decoding (Theorem 7 from the paper) is:</p>
$$\text{speedup}_{\text{SSD}} = \frac{p_{\text{hit}} \cdot E_{\text{hit}} + (1 - p_{\text{hit}}) \cdot E_{\text{miss}}}{p_{\text{hit}} \cdot \max(1, T_p) + (1 - p_{\text{hit}}) \cdot (1 + T_b)}$$<p>Walking through each term: $p_{\text{hit}}$ is the probability of a cache hit. $E_{\text{hit}}$ and $E_{\text{miss}}$ are the expected number of tokens generated per round on a hit and miss respectively. $T_p$ is the latency of the primary speculator (the neural draft model) relative to the verifier, and $T_b$ is the latency of the backup speculator used on cache misses.</p>
<p>The numerator is the expected tokens per round, weighted by hit/miss probabilities. The denominator is the expected wall-clock time per round.</p>
<p>Two corollaries follow directly:</p>
<p><strong>Corollary 8</strong>: SSD strictly outperforms standard speculative decoding whenever $p_{\text{hit}} > 0$. Any nonzero cache hit rate improves performance because cache hits eliminate drafting latency entirely, and cache misses simply revert to standard SD behavior.</p>
<p><strong>Corollary 9</strong>: The maximum speedup is bounded by $(1 + T_{\text{SD}}) \cdot (E_{\text{hit}} / E_{\text{SD}})$, where $T_{\text{SD}}$ and $E_{\text{SD}}$ are the drafting time and expected tokens for standard SD. When the cache hit rate approaches 1, all drafting latency vanishes and we gain a factor of $(1 + T_{\text{SD}})$ in the denominator.</p>
<h2 id="challenge-1-building-the-cache-saguaro-cache-construction">Challenge 1: Building the Cache (Saguaro Cache Construction)</h2>
<h3 id="the-problem">The Problem</h3>
<p>A verification outcome is a pair $(k, t^*)$. The acceptance length $k$ ranges from 0 to K (typically K=7 in SSD), giving K+1 possibilities. The bonus token $t^*$ comes from a vocabulary of size $|V|$. The total outcome space is $(K+1) \times |V|$, which is roughly 250,000 for K=7 and $|V|$=32,000.</p>
<p>We have a budget of $B$ speculations we can pre-compute during the verification window. Each speculation requires the draft model to run K autoregressive steps. With the verification latency of a 70B target model on 4 H100s, we can fit roughly $B = 20\text{-}30$ speculations. We need to choose wisely.</p>
<h3 id="decomposing-the-problem">Decomposing the Problem</h3>
<p>Saguaro decomposes this into two subproblems:</p>
<ol>
<li>For each acceptance length $k$, how many bonus tokens should we cache? Call this the <strong>fan-out</strong> $F_k$.</li>
<li>For a given fan-out $F_k$, <em>which</em> bonus tokens should we cache?</li>
</ol>
<p>The second question has a straightforward answer: use the top-$F_k$ tokens from the draft model&rsquo;s own probability distribution at that position. The draft model has already computed logits during the current round&rsquo;s speculation, so the most likely tokens are immediately available. Empirically, this predicts the actual bonus token with up to 90% accuracy at moderate fan-out.</p>
<p>The first question, how to allocate fan-out across positions, is where the interesting optimization happens.</p>
<div class="ssd-fanout-concept" id="ssd-foc-0bb873ec347541318a79e0f80e8ddeb3">
  <style>
    .ssd-fanout-concept {
      --fc-bg: #0d1117;
      --fc-surface: #161b22;
      --fc-border: #30363d;
      --fc-text: #e6edf3;
      --fc-text-muted: #8b949e;
      --fc-green: #39d353;
      --fc-red: #f97583;
      --fc-purple: #a371f7;
      --fc-blue: #58a6ff;
      --fc-gold: #e3b341;
      --fc-orange: #d29922;

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
      background: var(--fc-bg);
      color: var(--fc-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

    [data-theme="light"] .ssd-fanout-concept,
    :root:not([data-theme="dark"]) .ssd-fanout-concept {
      --fc-bg: #f8fafc;
      --fc-surface: #ffffff;
      --fc-border: #e2e8f0;
      --fc-text: #1e293b;
      --fc-text-muted: #64748b;
      --fc-green: #10b981;
      --fc-red: #ef4444;
      --fc-purple: #8b5cf6;
      --fc-blue: #3b82f6;
      --fc-gold: #d97706;
      --fc-orange: #f59e0b;
    }

    .ssd-fanout-concept * { box-sizing: border-box; }

    .fc-header {
      text-align: center;
      margin-bottom: 1.25rem;
    }

    .fc-header h3 {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.85rem;
      font-weight: 600;
      color: var(--fc-purple);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .fc-header p {
      color: var(--fc-text-muted);
      font-size: 0.85rem;
      margin: 0;
    }

    .fc-grid-wrap {
      background: var(--fc-surface);
      border: 1px solid var(--fc-border);
      border-radius: 10px;
      padding: 1.25rem;
      overflow-x: auto;
    }

    .fc-grid {
      min-width: 520px;
    }

     
    .fc-col-headers {
      display: flex;
      align-items: flex-end;
      margin-bottom: 0.5rem;
      padding-left: 200px;
    }

    .fc-col-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.55rem;
      color: var(--fc-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.04em;
      text-align: center;
      width: 40px;
      flex-shrink: 0;
    }

    .fc-col-title {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.6rem;
      color: var(--fc-purple);
      text-transform: uppercase;
      letter-spacing: 0.05em;
      padding-left: 200px;
      margin-bottom: 0.3rem;
    }

     
    .fc-row {
      display: flex;
      align-items: center;
      padding: 0.35rem 0;
      border-radius: 6px;
      transition: background 0.15s ease;
    }

    .fc-row:hover {
      background: rgba(163, 113, 247, 0.06);
    }

    .fc-row.fc-row-highlight {
      background: rgba(57, 211, 83, 0.08);
    }

     
    .fc-accept-pattern {
      display: flex;
      align-items: center;
      gap: 0.25rem;
      min-width: 200px;
      flex-shrink: 0;
    }

    .fc-k-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--fc-text-muted);
      min-width: 34px;
    }

    .fc-token-slot {
      width: 20px;
      height: 20px;
      border-radius: 4px;
      display: flex;
      align-items: center;
      justify-content: center;
      font-size: 0.6rem;
      font-weight: 700;
      flex-shrink: 0;
    }

    .fc-token-accept {
      background: rgba(57, 211, 83, 0.2);
      border: 1px solid rgba(57, 211, 83, 0.4);
      color: var(--fc-green);
    }

    .fc-token-reject {
      background: rgba(249, 117, 131, 0.2);
      border: 1px solid rgba(249, 117, 131, 0.4);
      color: var(--fc-red);
    }

    .fc-token-skip {
      background: transparent;
      border: 1px solid var(--fc-border);
      opacity: 0.25;
    }

    .fc-arrow {
      color: var(--fc-text-muted);
      font-size: 0.8rem;
      margin: 0 0.35rem;
      flex-shrink: 0;
    }

     
    .fc-fanout-cells {
      display: flex;
      align-items: center;
      gap: 2px;
      flex: 1;
    }

    .fc-cell {
      width: 38px;
      height: 24px;
      border-radius: 4px;
      display: flex;
      align-items: center;
      justify-content: center;
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.55rem;
      font-weight: 500;
      flex-shrink: 0;
      transition: all 0.2s ease;
    }

    .fc-cell-cached {
      background: rgba(163, 113, 247, 0.2);
      border: 1px solid rgba(163, 113, 247, 0.4);
      color: var(--fc-purple);
    }

    .fc-cell-empty {
      background: transparent;
      border: 1px dashed var(--fc-border);
      color: var(--fc-text-muted);
      opacity: 0.2;
    }

    .fc-cell-boost {
      background: rgba(57, 211, 83, 0.15);
      border: 1px solid rgba(57, 211, 83, 0.4);
      color: var(--fc-green);
    }

     
    .fc-fanout-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.65rem;
      font-weight: 600;
      color: var(--fc-purple);
      margin-left: 0.5rem;
      white-space: nowrap;
      min-width: 45px;
    }

    .fc-fanout-label.boost {
      color: var(--fc-green);
    }

     
    .fc-summary {
      background: var(--fc-surface);
      border: 1px solid var(--fc-border);
      border-radius: 10px;
      padding: 0.75rem 1.25rem;
      margin-top: 1rem;
      display: flex;
      gap: 1.5rem;
      justify-content: center;
      flex-wrap: wrap;
    }

    .fc-stat {
      text-align: center;
    }

    .fc-stat-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.6rem;
      color: var(--fc-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.05em;
    }

    .fc-stat-value {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 1rem;
      font-weight: 700;
    }

    .fc-stat-value.purple { color: var(--fc-purple); }
    .fc-stat-value.green { color: var(--fc-green); }
    .fc-stat-value.blue { color: var(--fc-blue); }

    .fc-note {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.65rem;
      color: var(--fc-text-muted);
      text-align: center;
      margin-top: 0.75rem;
      line-height: 1.5;
    }

    @media (max-width: 600px) {
      .fc-col-headers { padding-left: 140px; }
      .fc-col-title { padding-left: 140px; }
      .fc-accept-pattern { min-width: 140px; }
    }
  </style>

  <div class="fc-header">
    <h3>What Is Fan-Out?</h3>
    <p>Each row is one possible acceptance length. Fan-out = how many bonus tokens we cache for that outcome.</p>
  </div>

  <div class="fc-grid-wrap">
    <div class="fc-grid">
      <div class="fc-col-title">Cached bonus token candidates (from draft logits)</div>
      <div class="fc-col-headers">
        <span class="fc-col-label">1st</span>
        <span class="fc-col-label">2nd</span>
        <span class="fc-col-label">3rd</span>
        <span class="fc-col-label">4th</span>
        <span class="fc-col-label">5th</span>
        <span class="fc-col-label">6th</span>
        <span class="fc-col-label">7th</span>
      </div>
      <div id="focRows-0bb873ec347541318a79e0f80e8ddeb3"></div>
    </div>
  </div>

  <div class="fc-summary" id="focSummary-0bb873ec347541318a79e0f80e8ddeb3"></div>

  <p class="fc-note">
    Each purple cell is one pre-computed speculation (K draft tokens). The total number of cells = budget B.<br>
    Position k=K (all accepted) is boosted because the target distribution is sharper, making prediction easier.
  </p>

  <script>
  (function() {
    const uid = '0bb873ec347541318a79e0f80e8ddeb3';
    const K = 7;
    const maxCols = 7;
    const alpha = 0.75;
    const r = 0.8;

    
    function geoFanout(alpha, B) {
      const raw = [];
      let sum = 0;
      for (let k = 0; k <= K; k++) {
        let v;
        if (k < K) {
          v = Math.pow(alpha, k / (1 + r));
        } else {
          v = Math.pow(alpha, K / (1 + r)) * Math.pow(1 - alpha, -1 / (1 + r));
        }
        raw.push(v);
        sum += v;
      }
      const scale = B / sum;
      return raw.map(f => Math.max(1, Math.round(f * scale)));
    }

    const B = 24;
    const fanouts = geoFanout(alpha, B);
    const totalCached = fanouts.reduce((a, b) => a + b, 0);

    
    const topTokens = [
      ['the', 'a', 'an', 'one', 'this', 'my', 'his'],
      ['cat', 'dog', 'bird', 'fox', 'man', 'boy', 'kid'],
      ['sat', 'lay', 'was', 'ran', 'hid', 'ate', 'slept'],
      ['atop', 'upon', 'near', 'by', 'in', 'on', 'next'],
      ['the', 'a', 'an', 'his', 'her', 'my', 'one'],
      ['soft', 'old', 'big', 'red', 'new', 'warm', 'blue'],
      ['warm', 'thin', 'dark', 'cold', 'dry', 'wide', 'long'],
      ['.', ',', '!', 'and', 'the', 'but', 'then'],
    ];

    const rowsDiv = document.getElementById(`focRows-${uid}`);
    let html = '';

    for (let k = 0; k <= K; k++) {
      const Fk = Math.min(fanouts[k], maxCols);
      const isBoost = k === K;

      
      let patternHtml = `<span class="fc-k-label">k=${k}</span>`;
      for (let i = 0; i < K; i++) {
        if (i < k) {
          patternHtml += `<div class="fc-token-slot fc-token-accept">\u2713</div>`;
        } else if (i === k && k < K) {
          patternHtml += `<div class="fc-token-slot fc-token-reject">\u2717</div>`;
        } else if (k === K) {
          patternHtml += `<div class="fc-token-slot fc-token-accept">\u2713</div>`;
        } else {
          patternHtml += `<div class="fc-token-slot fc-token-skip"></div>`;
        }
      }

      
      let cellsHtml = '';
      for (let c = 0; c < maxCols; c++) {
        if (c < Fk) {
          const cls = isBoost ? 'fc-cell-boost' : 'fc-cell-cached';
          const token = topTokens[k][c] || '?';
          cellsHtml += `<div class="fc-cell ${cls}">${token}</div>`;
        } else {
          cellsHtml += `<div class="fc-cell fc-cell-empty"></div>`;
        }
      }

      const labelCls = isBoost ? 'boost' : '';
      const boostNote = isBoost ? ' \u2191' : '';

      html += `
        <div class="fc-row ${isBoost ? 'fc-row-highlight' : ''}">
          <div class="fc-accept-pattern">${patternHtml}</div>
          <span class="fc-arrow">\u2192</span>
          <div class="fc-fanout-cells">${cellsHtml}</div>
          <span class="fc-fanout-label ${labelCls}">F=${Fk}${boostNote}</span>
        </div>
      `;
    }

    rowsDiv.innerHTML = html;

    
    document.getElementById(`focSummary-${uid}`).innerHTML = `
      <div class="fc-stat">
        <div class="fc-stat-label">Total cached</div>
        <div class="fc-stat-value purple">${totalCached} speculations</div>
      </div>
      <div class="fc-stat">
        <div class="fc-stat-label">Outcome space</div>
        <div class="fc-stat-value blue">${K + 1} \u00d7 32,000</div>
      </div>
      <div class="fc-stat">
        <div class="fc-stat-label">Covered by cache</div>
        <div class="fc-stat-value green">${totalCached} / 256,000</div>
      </div>
    `;
  })();
  </script>
</div>

<h3 id="power-law-cache-hits">Power-Law Cache Hits</h3>
<p>The authors make a key empirical observation: cache miss probability follows a power law in the fan-out:</p>
$$1 - p_{\text{hit}}(F) = \frac{1}{F^r}$$<p>for some exponent $r > 0$ that depends on the draft-target alignment. This means that doubling the fan-out does not halve the miss rate. Instead, miss rate decreases polynomially, with diminishing returns as fan-out grows. This finding (confirmed across multiple model pairs and datasets) drives the allocation strategy.</p>
<h3 id="geometric-fan-out">Geometric Fan-Out</h3>
<p>Given the power-law structure and a total budget $\sum_{k=0}^{K} F_k \leq B$, Saguaro solves a constrained optimization problem using Lagrange multipliers. The result (Theorem 12) is a geometric allocation:</p>
$$F_k = F_0 \cdot \alpha_p^{k/(1+r)} \quad \text{for } k < K$$<p>where $\alpha_p$ is the per-token acceptance rate and $F_0$ is determined by the budget constraint. The formula allocates more fan-out to earlier positions (small $k$) and less to later positions.</p>
<p>The reasoning: position $k=0$ (first token rejected) is more probable than $k=5$ (five tokens accepted before rejection) because each acceptance is an independent event with probability $\alpha_p < 1$. The probability of reaching acceptance length $k$ is roughly $\alpha_p^k \cdot (1 - \alpha_p)$, a geometric distribution. Allocating fan-out proportionally to the probability of each outcome maximizes the expected cache hit rate.</p>
<p>There is one exception: position $k=K$ (all tokens accepted) receives a boost. When all K draft tokens are accepted, the bonus token comes directly from the target model&rsquo;s distribution $p_{\text{target}}$ rather than the residual distribution. The target distribution is sharper and more concentrated, making the top-$F$ prediction more accurate. Saguaro accounts for this with a multiplicative correction:</p>
$$F_K = F_0 \cdot \alpha_p^{K/(1+r)} \cdot (1 - \alpha_p)^{-1/(1+r)}$$<div class="ssd-fanout-viz" id="ssd-fanout-0bb873ec347541318a79e0f80e8ddeb3">
  <style>
    .ssd-fanout-viz {
      --sf-bg: #0d1117;
      --sf-surface: #161b22;
      --sf-border: #30363d;
      --sf-text: #e6edf3;
      --sf-text-muted: #8b949e;
      --sf-blue: #58a6ff;
      --sf-purple: #a371f7;
      --sf-green: #39d353;
      --sf-orange: #d29922;
      --sf-red: #f97583;

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
      background: var(--sf-bg);
      color: var(--sf-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

    [data-theme="light"] .ssd-fanout-viz,
    :root:not([data-theme="dark"]) .ssd-fanout-viz {
      --sf-bg: #f8fafc;
      --sf-surface: #ffffff;
      --sf-border: #e2e8f0;
      --sf-text: #1e293b;
      --sf-text-muted: #64748b;
      --sf-blue: #3b82f6;
      --sf-purple: #8b5cf6;
      --sf-green: #10b981;
      --sf-orange: #f59e0b;
      --sf-red: #ef4444;
    }

    .ssd-fanout-viz * { box-sizing: border-box; }

    .sf-header {
      text-align: center;
      margin-bottom: 1.5rem;
    }

    .sf-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem;
      font-weight: 600;
      color: var(--sf-purple);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .sf-header p {
      color: var(--sf-text-muted);
      font-size: 0.9rem;
      margin: 0;
    }

     
    .sf-controls {
      background: var(--sf-surface);
      border: 1px solid var(--sf-border);
      border-radius: 10px;
      padding: 1rem 1.25rem;
      margin-bottom: 1.25rem;
      display: flex;
      align-items: center;
      gap: 1.5rem;
      flex-wrap: wrap;
    }

    .sf-control-group {
      display: flex;
      align-items: center;
      gap: 0.75rem;
    }

    .sf-control-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--sf-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.05em;
      white-space: nowrap;
    }

    .sf-slider {
      width: 120px;
      -webkit-appearance: none;
      appearance: none;
      height: 6px;
      border-radius: 3px;
      background: var(--sf-border);
      outline: none;
    }

    .sf-slider::-webkit-slider-thumb {
      -webkit-appearance: none;
      width: 16px;
      height: 16px;
      border-radius: 50%;
      background: var(--sf-purple);
      cursor: pointer;
      border: 2px solid var(--sf-bg);
      box-shadow: 0 2px 6px rgba(0,0,0,0.3);
    }

    .sf-slider::-moz-range-thumb {
      width: 16px;
      height: 16px;
      border-radius: 50%;
      background: var(--sf-purple);
      cursor: pointer;
      border: 2px solid var(--sf-bg);
    }

    .sf-value-display {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.8rem;
      font-weight: 600;
      color: var(--sf-purple);
      min-width: 35px;
    }

     
    .sf-chart-container {
      background: var(--sf-surface);
      border: 1px solid var(--sf-border);
      border-radius: 10px;
      padding: 1.5rem;
      margin-bottom: 1rem;
    }

    .sf-chart {
      position: relative;
      height: 240px;
      display: flex;
      align-items: flex-end;
      gap: 0;
      padding: 0 0 30px 45px;
    }

    .sf-y-axis {
      position: absolute;
      left: 0;
      top: 0;
      bottom: 30px;
      width: 45px;
      display: flex;
      flex-direction: column;
      justify-content: space-between;
      align-items: flex-end;
      padding-right: 8px;
    }

    .sf-y-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.6rem;
      color: var(--sf-text-muted);
    }

    .sf-y-title {
      position: absolute;
      left: -5px;
      top: 50%;
      transform: translateY(-50%) rotate(-90deg);
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.65rem;
      font-weight: 600;
      color: var(--sf-text-muted);
      white-space: nowrap;
      letter-spacing: 0.05em;
    }

    .sf-bars {
      flex: 1;
      height: 100%;
      display: flex;
      align-items: flex-end;
      gap: 4px;
      position: relative;
    }

    .sf-bar-group {
      flex: 1;
      display: flex;
      flex-direction: column;
      align-items: center;
      height: 100%;
      justify-content: flex-end;
    }

    .sf-bar {
      width: 100%;
      max-width: 50px;
      border-radius: 4px 4px 0 0;
      transition: height 0.5s ease, background 0.3s ease;
      position: relative;
      cursor: default;
      min-height: 2px;
    }

    .sf-bar-geometric {
      background: linear-gradient(180deg, var(--sf-purple), rgba(163, 113, 247, 0.7));
    }

    .sf-bar-uniform {
      background: linear-gradient(180deg, var(--sf-orange), rgba(210, 153, 34, 0.7));
      position: absolute;
      bottom: 0;
      width: 100%;
      max-width: 50px;
      border-radius: 4px 4px 0 0;
      opacity: 0.4;
      border: 1px dashed var(--sf-orange);
    }

    .sf-bar-value {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.6rem;
      font-weight: 600;
      color: var(--sf-purple);
      margin-bottom: 4px;
      text-align: center;
    }

    .sf-bar-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.65rem;
      color: var(--sf-text-muted);
      margin-top: 8px;
      text-align: center;
    }

    .sf-bar-label-special {
      color: var(--sf-green);
      font-weight: 600;
    }

     
    .sf-gridlines {
      position: absolute;
      left: 0;
      right: 0;
      top: 0;
      bottom: 0;
      pointer-events: none;
    }

    .sf-gridline {
      position: absolute;
      left: 0;
      right: 0;
      height: 1px;
      background: var(--sf-border);
      opacity: 0.5;
    }

     
    .sf-budget {
      background: var(--sf-surface);
      border: 1px solid var(--sf-border);
      border-radius: 10px;
      padding: 1rem 1.25rem;
      margin-bottom: 1rem;
      display: flex;
      align-items: center;
      justify-content: space-between;
      flex-wrap: wrap;
      gap: 0.75rem;
    }

    .sf-budget-item {
      text-align: center;
    }

    .sf-budget-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.65rem;
      color: var(--sf-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.05em;
    }

    .sf-budget-value {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 1.1rem;
      font-weight: 700;
    }

    .sf-budget-value.purple { color: var(--sf-purple); }
    .sf-budget-value.green { color: var(--sf-green); }
    .sf-budget-value.orange { color: var(--sf-orange); }

     
    .sf-legend {
      background: var(--sf-surface);
      border: 1px solid var(--sf-border);
      border-radius: 10px;
      padding: 0.75rem 1.25rem;
      display: flex;
      gap: 1.5rem;
      justify-content: center;
      flex-wrap: wrap;
    }

    .sf-legend-item {
      display: flex;
      align-items: center;
      gap: 0.4rem;
      font-size: 0.7rem;
      color: var(--sf-text-muted);
    }

    .sf-legend-swatch {
      width: 14px;
      height: 14px;
      border-radius: 3px;
    }

    .sf-swatch-geometric { background: var(--sf-purple); }

    .sf-swatch-uniform {
      background: rgba(210, 153, 34, 0.4);
      border: 1px dashed var(--sf-orange);
    }

    .sf-insight {
      background: rgba(163, 113, 247, 0.08);
      border: 1px solid rgba(163, 113, 247, 0.2);
      border-radius: 8px;
      padding: 1rem;
      margin-top: 1rem;
    }

    .sf-insight h5 {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--sf-purple);
      text-transform: uppercase;
      letter-spacing: 0.1em;
      margin: 0 0 0.4rem 0;
    }

    .sf-insight p {
      font-size: 0.85rem;
      color: var(--sf-text);
      line-height: 1.5;
      margin: 0;
    }

    @media (max-width: 600px) {
      .sf-controls { flex-direction: column; align-items: flex-start; }
      .sf-chart { height: 200px; padding-left: 35px; }
      .sf-budget { flex-direction: column; }
    }
  </style>

  <div class="sf-header">
    <h3>Geometric Fan-Out Allocation</h3>
    <p>How Saguaro distributes its speculation budget across acceptance positions</p>
  </div>

  <div class="sf-controls">
    <div class="sf-control-group">
      <span class="sf-control-label">Acceptance rate (&alpha;)</span>
      <input type="range" class="sf-slider" id="alphaSlider-0bb873ec347541318a79e0f80e8ddeb3"
             min="40" max="95" value="75">
      <span class="sf-value-display" id="alphaDisplay-0bb873ec347541318a79e0f80e8ddeb3">0.75</span>
    </div>
    <div class="sf-control-group">
      <span class="sf-control-label">Budget (B)</span>
      <input type="range" class="sf-slider" id="budgetSlider-0bb873ec347541318a79e0f80e8ddeb3"
             min="10" max="40" value="24">
      <span class="sf-value-display" id="budgetDisplay-0bb873ec347541318a79e0f80e8ddeb3">24</span>
    </div>
  </div>

  <div class="sf-chart-container">
    <div class="sf-chart" id="chart-0bb873ec347541318a79e0f80e8ddeb3">
      <div class="sf-y-title">Fan-out F<sub>k</sub></div>
      <div class="sf-y-axis" id="yAxis-0bb873ec347541318a79e0f80e8ddeb3"></div>
      <div class="sf-bars" id="bars-0bb873ec347541318a79e0f80e8ddeb3">
        <div class="sf-gridlines" id="gridlines-0bb873ec347541318a79e0f80e8ddeb3"></div>
      </div>
    </div>
  </div>

  <div class="sf-budget" id="budgetInfo-0bb873ec347541318a79e0f80e8ddeb3">
    
  </div>

  <div class="sf-legend">
    <div class="sf-legend-item">
      <div class="sf-legend-swatch sf-swatch-geometric"></div>
      <span>Geometric allocation (Saguaro)</span>
    </div>
    <div class="sf-legend-item">
      <div class="sf-legend-swatch sf-swatch-uniform"></div>
      <span>Uniform allocation (baseline)</span>
    </div>
  </div>

  <div class="sf-insight">
    <h5>Why geometric works</h5>
    <p>The probability of reaching acceptance length k is roughly &alpha;<sup>k</sup>(1-&alpha;). Earlier positions are more likely, so they deserve more fan-out. Position K (all accepted) gets boosted because the bonus token comes from the sharper target distribution, making it easier to predict.</p>
  </div>

  <script>
  (function() {
    const uid = '0bb873ec347541318a79e0f80e8ddeb3';
    const K = 7;
    const r = 0.8; 

    const alphaSlider = document.getElementById(`alphaSlider-${uid}`);
    const budgetSlider = document.getElementById(`budgetSlider-${uid}`);
    const alphaDisplay = document.getElementById(`alphaDisplay-${uid}`);
    const budgetDisplay = document.getElementById(`budgetDisplay-${uid}`);
    const barsDiv = document.getElementById(`bars-${uid}`);
    const yAxisDiv = document.getElementById(`yAxis-${uid}`);
    const gridlinesDiv = document.getElementById(`gridlines-${uid}`);
    const budgetInfo = document.getElementById(`budgetInfo-${uid}`);

    function computeGeometricFanout(alpha, B) {
      
      const fanouts = [];
      let rawSum = 0;

      for (let k = 0; k <= K; k++) {
        let raw;
        if (k < K) {
          raw = Math.pow(alpha, k / (1 + r));
        } else {
          
          raw = Math.pow(alpha, K / (1 + r)) * Math.pow(1 - alpha, -1 / (1 + r));
        }
        fanouts.push(raw);
        rawSum += raw;
      }

      
      const scale = B / rawSum;
      return fanouts.map(f => Math.max(1, Math.round(f * scale)));
    }

    function computeUniformFanout(B) {
      const perPos = Math.floor(B / (K + 1));
      return Array(K + 1).fill(Math.max(1, perPos));
    }

    function render() {
      const alpha = parseInt(alphaSlider.value) / 100;
      const B = parseInt(budgetSlider.value);

      alphaDisplay.textContent = alpha.toFixed(2);
      budgetDisplay.textContent = B;

      const geometric = computeGeometricFanout(alpha, B);
      const uniform = computeUniformFanout(B);
      const maxVal = Math.max(...geometric, ...uniform);

      
      const yTicks = 5;
      let yHtml = '';
      for (let i = yTicks; i >= 0; i--) {
        const val = Math.round(maxVal * i / yTicks);
        yHtml += `<span class="sf-y-label">${val}</span>`;
      }
      yAxisDiv.innerHTML = yHtml;

      
      let gridHtml = '';
      for (let i = 1; i < yTicks; i++) {
        const pct = (i / yTicks) * 100;
        gridHtml += `<div class="sf-gridline" style="bottom: ${pct}%;"></div>`;
      }
      gridlinesDiv.innerHTML = gridHtml;

      
      let barsHtml = '';
      const chartHeight = 210; 
      const geoSum = geometric.reduce((a, b) => a + b, 0);
      const uniSum = uniform.reduce((a, b) => a + b, 0);

      for (let k = 0; k <= K; k++) {
        const geoH = maxVal > 0 ? (geometric[k] / maxVal) * 100 : 0;
        const uniH = maxVal > 0 ? (uniform[k] / maxVal) * 100 : 0;
        const isLast = k === K;

        barsHtml += `
          <div class="sf-bar-group">
            <span class="sf-bar-value">${geometric[k]}</span>
            <div style="position: relative; width: 100%; max-width: 50px;">
              <div class="sf-bar sf-bar-geometric" style="height: ${geoH}%;"></div>
              <div class="sf-bar-uniform" style="height: ${uniH}%;"></div>
            </div>
            <span class="sf-bar-label ${isLast ? 'sf-bar-label-special' : ''}">k=${k}${isLast ? ' *' : ''}</span>
          </div>
        `;
      }
      barsDiv.innerHTML = `<div class="sf-gridlines" id="gridlines-${uid}">${gridHtml}</div>` + barsHtml;

      
      budgetInfo.innerHTML = `
        <div class="sf-budget-item">
          <div class="sf-budget-label">Budget Used (Geometric)</div>
          <div class="sf-budget-value purple">${geoSum} / ${B}</div>
        </div>
        <div class="sf-budget-item">
          <div class="sf-budget-label">Most likely position</div>
          <div class="sf-budget-value green">k=0 (P=${((1-alpha)*100).toFixed(0)}%)</div>
        </div>
        <div class="sf-budget-item">
          <div class="sf-budget-label">All-accept position</div>
          <div class="sf-budget-value orange">k=${K} (boosted)</div>
        </div>
      `;
    }

    alphaSlider.addEventListener('input', render);
    budgetSlider.addEventListener('input', render);
    render();
  })();
  </script>
</div>

<h2 id="challenge-2-trading-acceptance-for-cache-hits-saguaro-sampling">Challenge 2: Trading Acceptance for Cache Hits (Saguaro Sampling)</h2>
<h3 id="the-tension">The Tension</h3>
<p>There is a fundamental tension between two objectives in SSD. On one hand, we want the draft model to closely match the target model so that acceptance rates stay high. On the other hand, we want to predict the bonus token $t^*$ accurately so that cache hit rates stay high.</p>
<p>These objectives conflict. Here is why.</p>
<p>Recall from our <a href="/posts/speculative-decoding">speculative decoding post</a> that the bonus token is sampled from the residual distribution:</p>
$$r(\cdot) \propto \max(p_{\text{target}}(\cdot) - p_{\text{draft}}(\cdot), 0)$$<p>When the draft model closely matches the target ($p_{\text{draft}} \approx p_{\text{target}}$), the acceptance rate is high but the residual $\max(p_{\text{target}} - p_{\text{draft}}, 0)$ is spread thinly across many tokens. A thin residual means the bonus token could be almost anything, making it hard to predict and reducing cache hit rates.</p>
<p>When the draft model diverges from the target, the residual concentrates on tokens where $p_{\text{target}} \gg p_{\text{draft}}$, making the bonus token more predictable. But acceptance rates drop, meaning fewer tokens per round.</p>
<h3 id="the-solution-intentional-misalignment">The Solution: Intentional Misalignment</h3>
<p>Saguaro sampling resolves this tension by deliberately modifying the draft model&rsquo;s sampling distribution. For a set of cached tokens (the top-$F$ tokens at each position), Saguaro suppresses the draft model&rsquo;s probability on those specific tokens:</p>
$$\sigma_{F,C}(z) \propto \begin{cases} C \cdot \exp(z_t) & \text{if } t \in \text{top}_F(z) \\ \exp(z_t) & \text{otherwise} \end{cases}$$<p>Here $z$ is the vector of draft model logits, $F$ is the fan-out, and $C \in [0,1]$ is a downweighting constant. When $C=1$, this is the standard softmax with no modification. When $C < 1$, the cached tokens receive reduced probability in the draft&rsquo;s distribution.</p>
<p>Why does this help? Let&rsquo;s trace the effect on the residual.</p>
<p>When the draft model assigns <em>less</em> probability to a cached token, the gap $p_{\text{target}}(\cdot) - p_{\text{draft}}(\cdot)$ becomes <em>larger</em> for that token. A larger gap means the residual distribution assigns <em>more</em> probability to that token. Saguaro is steering the entire residual distribution to concentrate on the exact tokens it has cached.</p>
<div class="ssd-sampling-viz" id="ssd-sampling-0bb873ec347541318a79e0f80e8ddeb3">
  <style>
    .ssd-sampling-viz {
      --ss-bg: #0d1117;
      --ss-surface: #161b22;
      --ss-border: #30363d;
      --ss-text: #e6edf3;
      --ss-text-muted: #8b949e;
      --ss-target-blue: #58a6ff;
      --ss-draft-orange: #d29922;
      --ss-green: #39d353;
      --ss-red: #f97583;
      --ss-purple: #a371f7;
      --ss-residual: #bc8cff;

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
      background: var(--ss-bg);
      color: var(--ss-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

    [data-theme="light"] .ssd-sampling-viz,
    :root:not([data-theme="dark"]) .ssd-sampling-viz {
      --ss-bg: #f8fafc;
      --ss-surface: #ffffff;
      --ss-border: #e2e8f0;
      --ss-text: #1e293b;
      --ss-text-muted: #64748b;
      --ss-target-blue: #3b82f6;
      --ss-draft-orange: #f59e0b;
      --ss-green: #10b981;
      --ss-red: #ef4444;
      --ss-purple: #8b5cf6;
      --ss-residual: #a78bfa;
    }

    .ssd-sampling-viz * { box-sizing: border-box; }

    .ss-header {
      text-align: center;
      margin-bottom: 1.5rem;
    }

    .ss-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem;
      font-weight: 600;
      color: var(--ss-purple);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .ss-header p {
      color: var(--ss-text-muted);
      font-size: 0.9rem;
      margin: 0;
    }

     
    .ss-c-control {
      background: var(--ss-surface);
      border: 1px solid var(--ss-border);
      border-radius: 10px;
      padding: 1rem 1.25rem;
      margin-bottom: 1.25rem;
      display: flex;
      align-items: center;
      gap: 1rem;
      flex-wrap: wrap;
    }

    .ss-c-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.75rem;
      font-weight: 600;
      color: var(--ss-text-muted);
      min-width: 150px;
    }

    .ss-c-slider {
      flex: 1;
      min-width: 120px;
      -webkit-appearance: none;
      appearance: none;
      height: 6px;
      border-radius: 3px;
      background: linear-gradient(90deg, var(--ss-green), var(--ss-purple), var(--ss-draft-orange));
      outline: none;
    }

    .ss-c-slider::-webkit-slider-thumb {
      -webkit-appearance: none;
      width: 20px;
      height: 20px;
      border-radius: 50%;
      background: var(--ss-text);
      cursor: pointer;
      border: 3px solid var(--ss-bg);
      box-shadow: 0 2px 8px rgba(0,0,0,0.4);
    }

    .ss-c-slider::-moz-range-thumb {
      width: 20px;
      height: 20px;
      border-radius: 50%;
      background: var(--ss-text);
      cursor: pointer;
      border: 3px solid var(--ss-bg);
    }

    .ss-c-value {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 1.1rem;
      font-weight: 700;
      color: var(--ss-purple);
      min-width: 55px;
      text-align: center;
    }

    .ss-c-endpoints {
      display: flex;
      justify-content: space-between;
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.6rem;
      color: var(--ss-text-muted);
      margin-top: 0.25rem;
    }

     
    .ss-comparison {
      display: grid;
      grid-template-columns: 1fr 1fr;
      gap: 1rem;
      margin-bottom: 1rem;
    }

    .ss-panel {
      background: var(--ss-surface);
      border: 1px solid var(--ss-border);
      border-radius: 10px;
      padding: 1.25rem;
    }

    .ss-panel-title {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--ss-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.08em;
      margin: 0 0 1rem 0;
      text-align: center;
    }

    .ss-token-row {
      display: flex;
      align-items: center;
      gap: 0.5rem;
      margin-bottom: 0.5rem;
    }

    .ss-token-row:last-child { margin-bottom: 0; }

    .ss-token-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.75rem;
      font-weight: 600;
      width: 50px;
      min-width: 50px;
      color: var(--ss-text);
      text-align: right;
    }

    .ss-token-label.cached {
      color: var(--ss-purple);
    }

    .ss-bar-container {
      flex: 1;
      height: 22px;
      background: rgba(128,128,128,0.08);
      border-radius: 4px;
      position: relative;
      overflow: hidden;
    }

    .ss-bar {
      height: 100%;
      border-radius: 4px;
      transition: width 0.4s ease;
      position: absolute;
      top: 0;
      left: 0;
    }

    .ss-bar-target {
      background: linear-gradient(90deg, var(--ss-target-blue), #79c0ff);
      opacity: 0.4;
      z-index: 1;
    }

    .ss-bar-draft {
      background: linear-gradient(90deg, var(--ss-draft-orange), #e3b341);
      z-index: 2;
    }

    .ss-bar-residual {
      background: linear-gradient(90deg, var(--ss-residual), #d8b4fe);
      z-index: 2;
    }

    .ss-bar-value {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.65rem;
      min-width: 35px;
      text-align: right;
    }

    .ss-bar-value.target { color: var(--ss-target-blue); }
    .ss-bar-value.draft { color: var(--ss-draft-orange); }
    .ss-bar-value.residual { color: var(--ss-residual); }

    .ss-cached-marker {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.55rem;
      color: var(--ss-purple);
      margin-left: 0.25rem;
    }

     
    .ss-metrics {
      display: grid;
      grid-template-columns: repeat(3, 1fr);
      gap: 0.75rem;
      margin-bottom: 1rem;
    }

    .ss-metric {
      background: var(--ss-surface);
      border: 1px solid var(--ss-border);
      border-radius: 10px;
      padding: 0.75rem;
      text-align: center;
    }

    .ss-metric-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.6rem;
      color: var(--ss-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.05em;
      margin-bottom: 0.3rem;
    }

    .ss-metric-value {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 1.2rem;
      font-weight: 700;
    }

    .ss-metric-value.green { color: var(--ss-green); }
    .ss-metric-value.orange { color: var(--ss-draft-orange); }
    .ss-metric-value.purple { color: var(--ss-purple); }

    .ss-insight {
      background: rgba(163, 113, 247, 0.08);
      border: 1px solid rgba(163, 113, 247, 0.2);
      border-radius: 8px;
      padding: 1rem;
    }

    .ss-insight h5 {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--ss-purple);
      text-transform: uppercase;
      letter-spacing: 0.1em;
      margin: 0 0 0.4rem 0;
    }

    .ss-insight p {
      font-size: 0.85rem;
      color: var(--ss-text);
      line-height: 1.5;
      margin: 0;
    }

    @media (max-width: 700px) {
      .ss-comparison { grid-template-columns: 1fr; }
      .ss-metrics { grid-template-columns: 1fr; }
    }
  </style>

  <div class="ss-header">
    <h3>Saguaro Sampling: Trading Acceptance for Cache Hits</h3>
    <p>How downweighting cached tokens in the draft forces the residual to concentrate on those same tokens</p>
  </div>

  <div class="ss-c-control">
    <span class="ss-c-label">Downweighting C</span>
    <input type="range" class="ss-c-slider" id="cSlider-0bb873ec347541318a79e0f80e8ddeb3"
           min="5" max="100" value="100">
    <span class="ss-c-value" id="cValue-0bb873ec347541318a79e0f80e8ddeb3">1.00</span>
    <div style="width: 100%;">
      <div class="ss-c-endpoints">
        <span>C=0 (max cache hits, low acceptance)</span>
        <span>C=1 (standard, no modification)</span>
      </div>
    </div>
  </div>

  <div class="ss-comparison">
    
    <div class="ss-panel">
      <h4 class="ss-panel-title" style="color: var(--ss-draft-orange);">
        Draft Distribution q(x) with Saguaro
      </h4>
      <div id="draftPanel-0bb873ec347541318a79e0f80e8ddeb3"></div>
    </div>

    
    <div class="ss-panel">
      <h4 class="ss-panel-title" style="color: var(--ss-residual);">
        Residual Distribution max(p - q, 0)
      </h4>
      <div id="residualPanel-0bb873ec347541318a79e0f80e8ddeb3"></div>
    </div>
  </div>

  <div class="ss-metrics">
    <div class="ss-metric">
      <div class="ss-metric-label">Acceptance Rate</div>
      <div class="ss-metric-value orange" id="acceptRate-0bb873ec347541318a79e0f80e8ddeb3">0.82</div>
    </div>
    <div class="ss-metric">
      <div class="ss-metric-label">Cache Hit Rate</div>
      <div class="ss-metric-value purple" id="hitRate-0bb873ec347541318a79e0f80e8ddeb3">0.45</div>
    </div>
    <div class="ss-metric">
      <div class="ss-metric-label">Net Speedup Effect</div>
      <div class="ss-metric-value green" id="netEffect-0bb873ec347541318a79e0f80e8ddeb3">1.0x</div>
    </div>
  </div>

  <div class="ss-insight">
    <h5>The tradeoff</h5>
    <p>Slide C toward 0 to see how suppressing draft probability on cached tokens (marked with <span style="color: var(--ss-purple);">purple dots</span>) forces the residual distribution to concentrate on those same tokens. This increases cache hit rate at the cost of acceptance rate. The optimal C* balances both effects for maximum end-to-end speedup.</p>
  </div>

  <script>
  (function() {
    const uid = '0bb873ec347541318a79e0f80e8ddeb3';

    
    const tokens = ['the', 'a', 'cat', 'sat', 'on', 'mat', 'dog', 'ran'];
    const targetP = [0.25, 0.18, 0.15, 0.12, 0.10, 0.08, 0.07, 0.05];

    
    const baseLogits = [1.8, 1.4, 1.2, 0.9, 0.7, 0.5, 0.3, 0.1];

    
    const cachedIndices = new Set([0, 1, 2]);
    const fanOut = 3;

    const cSlider = document.getElementById(`cSlider-${uid}`);
    const cValueEl = document.getElementById(`cValue-${uid}`);

    function softmax(logits) {
      const max = Math.max(...logits);
      const exps = logits.map(l => Math.exp(l - max));
      const sum = exps.reduce((a, b) => a + b, 0);
      return exps.map(e => e / sum);
    }

    function computeSaguaroDraft(C) {
      
      const modified = baseLogits.map((l, i) =>
        cachedIndices.has(i) ? Math.log(C) + l : l
      );
      return softmax(modified);
    }

    function computeResidual(draftQ) {
      const residual = targetP.map((p, i) => Math.max(0, p - draftQ[i]));
      const sum = residual.reduce((a, b) => a + b, 0);
      if (sum === 0) return residual;
      return residual.map(r => r / sum);
    }

    function computeAcceptanceRate(draftQ) {
      return targetP.reduce((acc, p, i) => acc + Math.min(p, draftQ[i]), 0);
    }

    function computeCacheHitRate(residual) {
      
      let hitProb = 0;
      cachedIndices.forEach(i => { hitProb += residual[i]; });
      return hitProb;
    }

    function render() {
      const C = parseInt(cSlider.value) / 100;
      cValueEl.textContent = C.toFixed(2);

      const draftQ = computeSaguaroDraft(C);
      const residual = computeResidual(draftQ);
      const acceptance = computeAcceptanceRate(draftQ);
      const hitRate = computeCacheHitRate(residual);

      const maxDraft = Math.max(...draftQ, ...targetP);
      const maxResidual = Math.max(...residual, 0.01);

      
      const draftPanel = document.getElementById(`draftPanel-${uid}`);
      let draftHtml = '';
      tokens.forEach((token, i) => {
        const isCached = cachedIndices.has(i);
        const qWidth = (draftQ[i] / maxDraft) * 100;
        const pWidth = (targetP[i] / maxDraft) * 100;

        draftHtml += `
          <div class="ss-token-row">
            <span class="ss-token-label ${isCached ? 'cached' : ''}">${token}${isCached ? ' *' : ''}</span>
            <div class="ss-bar-container">
              <div class="ss-bar ss-bar-target" style="width: ${pWidth}%;"></div>
              <div class="ss-bar ss-bar-draft" style="width: ${qWidth}%;"></div>
            </div>
            <span class="ss-bar-value draft">${draftQ[i].toFixed(2)}</span>
          </div>
        `;
      });
      draftPanel.innerHTML = draftHtml;

      
      const residualPanel = document.getElementById(`residualPanel-${uid}`);
      let resHtml = '';
      tokens.forEach((token, i) => {
        const isCached = cachedIndices.has(i);
        const rWidth = (residual[i] / maxResidual) * 100;

        resHtml += `
          <div class="ss-token-row">
            <span class="ss-token-label ${isCached ? 'cached' : ''}">${token}${isCached ? ' *' : ''}</span>
            <div class="ss-bar-container">
              <div class="ss-bar ss-bar-residual" style="width: ${rWidth}%;"></div>
            </div>
            <span class="ss-bar-value residual">${residual[i].toFixed(2)}</span>
          </div>
        `;
      });
      residualPanel.innerHTML = resHtml;

      
      document.getElementById(`acceptRate-${uid}`).textContent = acceptance.toFixed(2);
      document.getElementById(`hitRate-${uid}`).textContent = hitRate.toFixed(2);

      
      
      
      
      const baseTokPerRound = 1 / (1 - 0.82); 
      const tokPerRound = 1 / (1 - acceptance);
      const draftOverhead = 0.3; 
      const sdTime = 1 + draftOverhead;
      const ssdTime = hitRate * 1 + (1 - hitRate) * sdTime;
      const netSpeedup = (tokPerRound / ssdTime) / (baseTokPerRound / sdTime);

      const netEl = document.getElementById(`netEffect-${uid}`);
      netEl.textContent = netSpeedup.toFixed(2) + 'x';
      if (netSpeedup >= 1.0) {
        netEl.style.color = 'var(--ss-green)';
      } else {
        netEl.style.color = 'var(--ss-red)';
      }
    }

    cSlider.addEventListener('input', render);
    render();
  })();
  </script>
</div>

<p><strong>Theorem 15</strong> from the paper confirms this formally: the cache hit rate $p_{\text{hit}}$ increases monotonically as $C \to 0$. Push $C$ all the way to zero and the residual distribution is forced entirely onto cached tokens (guaranteeing a hit), but the acceptance rate collapses because the draft distribution diverges maximally from the target.</p>
<p>The optimal $C^*$ balances these competing effects and depends on the sampling temperature:</p>
<ul>
<li><strong>Temperature 0 (greedy decoding)</strong>: $C = 1$ is optimal. The bonus token is deterministic (the argmax of the target distribution), so the top-1 draft prediction already has high accuracy. No need to sacrifice acceptance rate.</li>
<li><strong>High temperature</strong>: $C \ll 1$ becomes advantageous. The bonus token is sampled from a flatter distribution, making it harder to predict without help. Saguaro sampling concentrates the residual, recovering cache hit rates that would otherwise be low.</li>
</ul>
<p>In practice, Saguaro sampling provides up to 50% additional end-to-end speedup at high temperatures compared to using $C=1$.</p>
<h2 id="challenge-3-scaling-to-large-batches-saguaro-fallback">Challenge 3: Scaling to Large Batches (Saguaro Fallback)</h2>
<h3 id="the-batch-size-problem">The Batch Size Problem</h3>
<p>Everything so far has assumed batch size 1 (a single sequence). At larger batch sizes, a new problem emerges.</p>
<p>With batch size $b$, the system can only proceed to the next round when <em>all</em> $b$ sequences have speculations ready. The probability that every sequence in the batch gets a cache hit is:</p>
$$P(\text{all hit}) = p_{\text{hit}}^b$$<p>Even with $p_{\text{hit}} = 0.9$ per sequence, a batch of 16 sequences gives $P(\text{all hit}) = 0.9^{16} \approx 0.19$. At batch size 32, it drops to $0.9^{32} \approx 0.03$. The probability of at least one cache miss grows exponentially, and a single miss stalls the entire batch.</p>
<h3 id="the-naive-fallback-fails">The Naive Fallback Fails</h3>
<p>The intuitive solution is simple: when a cache miss occurs, have the primary draft model generate a fresh speculation on the spot. But this is catastrophically bad at scale.</p>
<p>The primary draft model is still a neural network that generates tokens autoregressively. Generating K tokens takes non-trivial time. While this one stalled sequence catches up, every other sequence in the batch (including those with cache hits) waits. The batch is only as fast as its slowest member.</p>
<p>Corollary 16 in the paper formalizes this: at large batch sizes, the SSD speedup becomes overwhelmingly bounded by the fallback latency. If the fallback speculator is slow, the theoretical gains from caching vanish.</p>
<h3 id="dual-tier-fallback">Dual-Tier Fallback</h3>
<p>Saguaro solves this with a dual-tier fallback strategy controlled by a critical batch size $b^*$:</p>
<p><strong>Below $b^*$ (low batch regime)</strong>: Cache misses are infrequent. The primary draft model serves as its own fallback, generating a fresh speculation synchronously. The latency penalty is acceptable because misses are rare and each one affects only one sequence.</p>
<p><strong>Above $b^*$ (high batch regime)</strong>: The system switches to an ultra-fast backup speculator with minimal latency. This could be an n-gram model, random tokens, or a token frequency model. The backup&rsquo;s speculations will likely be rejected during verification (random tokens have near-zero acceptance rate). But the strategic insight is that the latency cost of feeding bad speculations to one sequence is vastly less than the latency cost of making the entire batch wait for a neural draft model.</p>
<p><strong>Theorem 17</strong> proves this formally: accepting a single sequence&rsquo;s quality penalty (poor speculation) is strictly better than inflicting the primary drafter&rsquo;s latency across the entire batch.</p>
<p>The critical batch size $b^*$ is derived analytically from the speedup equation and depends on $p_{\text{hit}}$, $T_p$ (primary drafter latency), and $T_b$ (backup latency).</p>
<div class="ssd-fallback-viz" id="ssd-fallback-0bb873ec347541318a79e0f80e8ddeb3">
  <style>
    .ssd-fallback-viz {
      --fb-bg: #0d1117;
      --fb-surface: #161b22;
      --fb-border: #30363d;
      --fb-text: #e6edf3;
      --fb-text-muted: #8b949e;
      --fb-blue: #58a6ff;
      --fb-orange: #d29922;
      --fb-green: #39d353;
      --fb-red: #f97583;
      --fb-purple: #a371f7;

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
      background: var(--fb-bg);
      color: var(--fb-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

    [data-theme="light"] .ssd-fallback-viz,
    :root:not([data-theme="dark"]) .ssd-fallback-viz {
      --fb-bg: #f8fafc;
      --fb-surface: #ffffff;
      --fb-border: #e2e8f0;
      --fb-text: #1e293b;
      --fb-text-muted: #64748b;
      --fb-blue: #3b82f6;
      --fb-orange: #f59e0b;
      --fb-green: #10b981;
      --fb-red: #ef4444;
      --fb-purple: #8b5cf6;
    }

    .ssd-fallback-viz * { box-sizing: border-box; }

    .fb-header {
      text-align: center;
      margin-bottom: 1.5rem;
    }

    .fb-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem;
      font-weight: 600;
      color: var(--fb-purple);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .fb-header p {
      color: var(--fb-text-muted);
      font-size: 0.9rem;
      margin: 0;
    }

     
    .fb-controls {
      background: var(--fb-surface);
      border: 1px solid var(--fb-border);
      border-radius: 10px;
      padding: 1rem 1.25rem;
      margin-bottom: 1.25rem;
      display: flex;
      align-items: center;
      gap: 1rem;
      flex-wrap: wrap;
    }

    .fb-control-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--fb-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.05em;
      white-space: nowrap;
    }

    .fb-slider {
      flex: 1;
      min-width: 120px;
      -webkit-appearance: none;
      appearance: none;
      height: 6px;
      border-radius: 3px;
      background: var(--fb-border);
      outline: none;
    }

    .fb-slider::-webkit-slider-thumb {
      -webkit-appearance: none;
      width: 18px;
      height: 18px;
      border-radius: 50%;
      background: var(--fb-blue);
      cursor: pointer;
      border: 2px solid var(--fb-bg);
      box-shadow: 0 2px 6px rgba(0,0,0,0.3);
    }

    .fb-slider::-moz-range-thumb {
      width: 18px;
      height: 18px;
      border-radius: 50%;
      background: var(--fb-blue);
      cursor: pointer;
      border: 2px solid var(--fb-bg);
    }

    .fb-value-display {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.85rem;
      font-weight: 600;
      color: var(--fb-blue);
      min-width: 25px;
    }

     
    .fb-chart-container {
      background: var(--fb-surface);
      border: 1px solid var(--fb-border);
      border-radius: 10px;
      padding: 1.5rem;
      margin-bottom: 1rem;
    }

    .fb-chart {
      position: relative;
      height: 220px;
      padding: 0 0 35px 55px;
    }

    .fb-y-title {
      position: absolute;
      left: 0;
      top: 50%;
      transform: translateY(-50%) rotate(-90deg);
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.6rem;
      font-weight: 600;
      color: var(--fb-text-muted);
      white-space: nowrap;
    }

    .fb-x-title {
      position: absolute;
      bottom: 0;
      left: 55px;
      right: 0;
      text-align: center;
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.6rem;
      font-weight: 600;
      color: var(--fb-text-muted);
    }

    .fb-canvas-area {
      position: absolute;
      left: 55px;
      right: 10px;
      top: 10px;
      bottom: 35px;
    }

    .fb-y-axis {
      position: absolute;
      left: 30px;
      top: 10px;
      bottom: 35px;
      width: 25px;
      display: flex;
      flex-direction: column;
      justify-content: space-between;
      align-items: flex-end;
    }

    .fb-y-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.6rem;
      color: var(--fb-text-muted);
    }

     
    .fb-chart svg {
      position: absolute;
      left: 55px;
      right: 10px;
      top: 10px;
      bottom: 35px;
      width: calc(100% - 65px);
      height: calc(100% - 45px);
    }

     
    .fb-regimes {
      display: grid;
      grid-template-columns: 1fr auto 1fr;
      gap: 0;
      margin-bottom: 1rem;
    }

    .fb-regime {
      background: var(--fb-surface);
      border: 1px solid var(--fb-border);
      padding: 1rem;
      text-align: center;
    }

    .fb-regime:first-child {
      border-radius: 10px 0 0 10px;
    }

    .fb-regime:last-child {
      border-radius: 0 10px 10px 0;
    }

    .fb-regime-divider {
      width: 3px;
      background: var(--fb-purple);
      position: relative;
      display: flex;
      align-items: center;
      justify-content: center;
    }

    .fb-regime-divider-label {
      position: absolute;
      background: var(--fb-purple);
      color: #fff;
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.65rem;
      font-weight: 600;
      padding: 0.2rem 0.5rem;
      border-radius: 4px;
      white-space: nowrap;
    }

    .fb-regime-title {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      text-transform: uppercase;
      letter-spacing: 0.05em;
      margin-bottom: 0.5rem;
    }

    .fb-regime-title.primary { color: var(--fb-green); }
    .fb-regime-title.backup { color: var(--fb-orange); }

    .fb-regime-desc {
      font-size: 0.8rem;
      color: var(--fb-text-muted);
      margin: 0 0 0.5rem 0;
      line-height: 1.4;
    }

    .fb-regime-speculator {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      padding: 0.3rem 0.6rem;
      border-radius: 5px;
      display: inline-block;
    }

    .fb-speculator-primary {
      background: rgba(57, 211, 83, 0.15);
      color: var(--fb-green);
      border: 1px solid rgba(57, 211, 83, 0.3);
    }

    .fb-speculator-backup {
      background: rgba(210, 153, 34, 0.15);
      color: var(--fb-orange);
      border: 1px solid rgba(210, 153, 34, 0.3);
    }

     
    .fb-stall-info {
      background: var(--fb-surface);
      border: 1px solid var(--fb-border);
      border-radius: 10px;
      padding: 1rem 1.25rem;
      margin-bottom: 1rem;
      display: flex;
      align-items: center;
      justify-content: space-around;
      flex-wrap: wrap;
      gap: 1rem;
    }

    .fb-stall-item {
      text-align: center;
    }

    .fb-stall-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.6rem;
      color: var(--fb-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.05em;
      margin-bottom: 0.25rem;
    }

    .fb-stall-value {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 1.1rem;
      font-weight: 700;
    }

    .fb-stall-value.green { color: var(--fb-green); }
    .fb-stall-value.red { color: var(--fb-red); }
    .fb-stall-value.purple { color: var(--fb-purple); }

    .fb-insight {
      background: rgba(163, 113, 247, 0.08);
      border: 1px solid rgba(163, 113, 247, 0.2);
      border-radius: 8px;
      padding: 1rem;
    }

    .fb-insight h5 {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--fb-purple);
      text-transform: uppercase;
      letter-spacing: 0.1em;
      margin: 0 0 0.4rem 0;
    }

    .fb-insight p {
      font-size: 0.85rem;
      color: var(--fb-text);
      line-height: 1.5;
      margin: 0;
    }

    @media (max-width: 600px) {
      .fb-regimes { grid-template-columns: 1fr; }
      .fb-regime:first-child { border-radius: 10px 10px 0 0; }
      .fb-regime:last-child { border-radius: 0 0 10px 10px; }
      .fb-regime-divider {
        width: 100%;
        height: 3px;
      }
      .fb-stall-info { flex-direction: column; }
    }
  </style>

  <div class="fb-header">
    <h3>Dual-Tier Fallback Strategy</h3>
    <p>How batch size determines the optimal fallback speculator</p>
  </div>

  <div class="fb-controls">
    <span class="fb-control-label">Per-sequence cache hit rate</span>
    <input type="range" class="fb-slider" id="hitSlider-0bb873ec347541318a79e0f80e8ddeb3"
           min="50" max="95" value="85">
    <span class="fb-value-display" id="hitDisplay-0bb873ec347541318a79e0f80e8ddeb3">0.85</span>
  </div>

  
  <div class="fb-chart-container">
    <div class="fb-chart" id="stallChart-0bb873ec347541318a79e0f80e8ddeb3">
      <span class="fb-y-title">P(at least one miss)</span>
      <span class="fb-x-title">Batch size (b)</span>
      <div class="fb-y-axis" id="yAxis-0bb873ec347541318a79e0f80e8ddeb3">
        <span class="fb-y-label">1.0</span>
        <span class="fb-y-label">0.75</span>
        <span class="fb-y-label">0.50</span>
        <span class="fb-y-label">0.25</span>
        <span class="fb-y-label">0.0</span>
      </div>
      <svg id="stallSvg-0bb873ec347541318a79e0f80e8ddeb3" viewBox="0 0 400 175" preserveAspectRatio="none">
        
      </svg>
    </div>
  </div>

  <div class="fb-stall-info" id="stallInfo-0bb873ec347541318a79e0f80e8ddeb3">
    
  </div>

  
  <div class="fb-regimes">
    <div class="fb-regime">
      <div class="fb-regime-title primary">Low Batch Regime</div>
      <p class="fb-regime-desc">Misses are infrequent. Quality matters more than speed.</p>
      <span class="fb-regime-speculator fb-speculator-primary">Primary Draft (Neural)</span>
    </div>
    <div class="fb-regime-divider">
      <span class="fb-regime-divider-label" id="bStarLabel-0bb873ec347541318a79e0f80e8ddeb3">b* = 8</span>
    </div>
    <div class="fb-regime">
      <div class="fb-regime-title backup">High Batch Regime</div>
      <p class="fb-regime-desc">Misses are near-certain. Speed matters more than quality.</p>
      <span class="fb-regime-speculator fb-speculator-backup">Fast Backup (n-gram/random)</span>
    </div>
  </div>

  <div class="fb-insight">
    <h5>Why fast beats accurate at scale</h5>
    <p>At batch size 32 with 85% per-sequence hit rate, P(at least one miss) = 99.6%. A miss is nearly guaranteed every round. Using the primary neural draft as fallback would stall the entire batch for its generation time. A fast backup (even random tokens) unblocks the batch instantly, and only the single stalled sequence pays a quality penalty.</p>
  </div>

  <script>
  (function() {
    const uid = '0bb873ec347541318a79e0f80e8ddeb3';
    const hitSlider = document.getElementById(`hitSlider-${uid}`);
    const hitDisplay = document.getElementById(`hitDisplay-${uid}`);
    const svg = document.getElementById(`stallSvg-${uid}`);
    const stallInfo = document.getElementById(`stallInfo-${uid}`);
    const bStarLabel = document.getElementById(`bStarLabel-${uid}`);

    const maxBatch = 48;

    function render() {
      const pHit = parseInt(hitSlider.value) / 100;
      hitDisplay.textContent = pHit.toFixed(2);

      
      const points = [];
      for (let b = 1; b <= maxBatch; b++) {
        const pStall = 1 - Math.pow(pHit, b);
        points.push({ b, pStall });
      }

      
      const bStar = points.find(p => p.pStall > 0.5)?.b || maxBatch;
      bStarLabel.textContent = `b* = ${bStar}`;

      
      const w = 400, h = 175;

      
      const pathPoints = points.map(p => {
        const x = ((p.b - 1) / (maxBatch - 1)) * w;
        const y = h - (p.pStall * h);
        return `${x},${y}`;
      });

      
      const threshY = h - (0.5 * h);

      
      const bStarX = ((bStar - 1) / (maxBatch - 1)) * w;

      
      const fillPoints = [`0,${h}`, ...pathPoints, `${w},${h}`];

      let svgContent = `
        \x3C!-- Grid lines -->
        <line x1="0" y1="${h*0.25}" x2="${w}" y2="${h*0.25}" stroke="var(--fb-border)" stroke-width="0.5" opacity="0.5"/>
        <line x1="0" y1="${h*0.5}" x2="${w}" y2="${h*0.5}" stroke="var(--fb-border)" stroke-width="0.5" opacity="0.5"/>
        <line x1="0" y1="${h*0.75}" x2="${w}" y2="${h*0.75}" stroke="var(--fb-border)" stroke-width="0.5" opacity="0.5"/>

        \x3C!-- Green zone (below b*) -->
        <rect x="0" y="0" width="${bStarX}" height="${h}" fill="rgba(57, 211, 83, 0.05)"/>

        \x3C!-- Red zone (above b*) -->
        <rect x="${bStarX}" y="0" width="${w - bStarX}" height="${h}" fill="rgba(249, 117, 131, 0.05)"/>

        \x3C!-- Fill under curve -->
        <polygon points="${fillPoints.join(' ')}" fill="var(--fb-red)" opacity="0.15"/>

        \x3C!-- Main curve -->
        <polyline points="${pathPoints.join(' ')}" fill="none" stroke="var(--fb-red)" stroke-width="2.5" stroke-linejoin="round"/>

        \x3C!-- b* vertical line -->
        <line x1="${bStarX}" y1="0" x2="${bStarX}" y2="${h}" stroke="var(--fb-purple)" stroke-width="2" stroke-dasharray="6,4"/>

        \x3C!-- b* label -->
        <text x="${bStarX}" y="${h - 5}" fill="var(--fb-purple)" font-family="'IBM Plex Mono', monospace" font-size="9" font-weight="600" text-anchor="middle">b*=${bStar}</text>

        \x3C!-- X-axis labels -->
        <text x="0" y="${h + 15}" fill="var(--fb-text-muted)" font-family="'IBM Plex Mono', monospace" font-size="8">1</text>
        <text x="${w*0.25}" y="${h + 15}" fill="var(--fb-text-muted)" font-family="'IBM Plex Mono', monospace" font-size="8">${Math.round(maxBatch*0.25)}</text>
        <text x="${w*0.5}" y="${h + 15}" fill="var(--fb-text-muted)" font-family="'IBM Plex Mono', monospace" font-size="8">${Math.round(maxBatch*0.5)}</text>
        <text x="${w*0.75}" y="${h + 15}" fill="var(--fb-text-muted)" font-family="'IBM Plex Mono', monospace" font-size="8">${Math.round(maxBatch*0.75)}</text>
        <text x="${w - 10}" y="${h + 15}" fill="var(--fb-text-muted)" font-family="'IBM Plex Mono', monospace" font-size="8">${maxBatch}</text>
      `;

      
      [1, 4, 8, 16, 32].forEach(b => {
        if (b <= maxBatch) {
          const p = points[b - 1];
          const x = ((b - 1) / (maxBatch - 1)) * w;
          const y = h - (p.pStall * h);
          svgContent += `<circle cx="${x}" cy="${y}" r="3.5" fill="var(--fb-red)" stroke="var(--fb-surface)" stroke-width="1.5"/>`;
        }
      });

      svg.innerHTML = svgContent;

      
      const p1 = points[0].pStall;
      const p8 = points[Math.min(7, maxBatch-1)].pStall;
      const p32 = points[Math.min(31, maxBatch-1)].pStall;

      stallInfo.innerHTML = `
        <div class="fb-stall-item">
          <div class="fb-stall-label">P(miss) at b=1</div>
          <div class="fb-stall-value green">${(p1*100).toFixed(1)}%</div>
        </div>
        <div class="fb-stall-item">
          <div class="fb-stall-label">P(miss) at b=8</div>
          <div class="fb-stall-value ${p8 > 0.5 ? 'red' : 'green'}">${(p8*100).toFixed(1)}%</div>
        </div>
        <div class="fb-stall-item">
          <div class="fb-stall-label">P(miss) at b=32</div>
          <div class="fb-stall-value red">${(p32*100).toFixed(1)}%</div>
        </div>
        <div class="fb-stall-item">
          <div class="fb-stall-label">Critical batch b*</div>
          <div class="fb-stall-value purple">${bStar}</div>
        </div>
      `;
    }

    hitSlider.addEventListener('input', render);
    render();
  })();
  </script>
</div>

<h2 id="the-full-algorithm">The Full Algorithm</h2>
<p>SSD runs three concurrent processes: a main coordinator, a verifier, and a speculator. Here is the algorithm:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Main: launches speculator asynchronously, runs verifier</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">main</span>(prompt, target, primary_draft, backup_draft):
</span></span><span style="display:flex;"><span>    launch_async(speculator, prompt, primary_draft, backup_draft)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> verifier(prompt, target)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Verifier: runs on target model&#39;s GPUs (e.g., 4x H100)</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">verifier</span>(prompt, target):
</span></span><span style="display:flex;"><span>    target<span style="color:#f92672">.</span>prefill(prompt)
</span></span><span style="display:flex;"><span>    spec_tokens <span style="color:#f92672">=</span> RECEIVE()              <span style="color:#75715e"># wait for first speculation</span>
</span></span><span style="display:flex;"><span>    generated <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">while</span> <span style="color:#66d9ef">True</span>:
</span></span><span style="display:flex;"><span>        outcome <span style="color:#f92672">=</span> target<span style="color:#f92672">.</span>verify(spec_tokens)   <span style="color:#75715e"># standard SD verification</span>
</span></span><span style="display:flex;"><span>        generated<span style="color:#f92672">.</span>extend(outcome<span style="color:#f92672">.</span>tokens)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> EOS <span style="color:#f92672">in</span> outcome<span style="color:#f92672">.</span>tokens:
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">return</span> generated
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        SEND(outcome)                          <span style="color:#75715e"># send (k, t*) to speculator</span>
</span></span><span style="display:flex;"><span>        spec_tokens <span style="color:#f92672">=</span> RECEIVE()                <span style="color:#75715e"># wait for next speculation</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Speculator: runs on draft model&#39;s GPU (e.g., 1x H100)</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">speculator</span>(prompt, primary_draft, backup_draft):
</span></span><span style="display:flex;"><span>    primary_draft<span style="color:#f92672">.</span>prefill(prompt)
</span></span><span style="display:flex;"><span>    spec_tokens <span style="color:#f92672">=</span> primary_draft<span style="color:#f92672">.</span>speculate(prompt)  <span style="color:#75715e"># initial speculation</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">while</span> <span style="color:#66d9ef">True</span>:
</span></span><span style="display:flex;"><span>        SEND(spec_tokens)                      <span style="color:#75715e"># send to verifier</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># While verification runs, build the cache</span>
</span></span><span style="display:flex;"><span>        cache <span style="color:#f92672">=</span> build_speculation_cache(
</span></span><span style="display:flex;"><span>            spec_tokens, primary_draft           <span style="color:#75715e"># §4.1: geometric fan-out</span>
</span></span><span style="display:flex;"><span>        )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        outcome <span style="color:#f92672">=</span> RECEIVE()                     <span style="color:#75715e"># get actual (k, t*)</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> EOS <span style="color:#f92672">in</span> outcome:
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">return</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> outcome <span style="color:#f92672">in</span> cache:                    <span style="color:#75715e"># CACHE HIT</span>
</span></span><span style="display:flex;"><span>            spec_tokens <span style="color:#f92672">=</span> cache[outcome]        <span style="color:#75715e"># instant return</span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">else</span>:                                   <span style="color:#75715e"># CACHE MISS</span>
</span></span><span style="display:flex;"><span>            spec_tokens <span style="color:#f92672">=</span> fallback(             <span style="color:#75715e"># §4.3: dual-tier</span>
</span></span><span style="display:flex;"><span>                outcome, primary_draft, backup_draft
</span></span><span style="display:flex;"><span>            )
</span></span></code></pre></div><h3 id="step-by-step-one-round">Step-by-Step: One Round</h3>
<p>Let&rsquo;s trace through a single round to see how the pieces fit together.</p>
<p><strong>Step 1</strong>: The speculator sends K draft tokens to the verifier and immediately begins predicting outcomes. It examines the draft model&rsquo;s logits at each position to determine the most likely bonus tokens.</p>
<p><strong>Step 2</strong>: Using geometric fan-out (Challenge 1), the speculator allocates its budget across positions. For each predicted outcome $(k, t^*)$, it generates K new draft tokens autoregressively from that context.</p>
<p><strong>Step 3</strong>: These speculations are stored in the cache. With K=7 and fan-out F=3, the cache contains roughly 24 entries (8 acceptance lengths, with 3 bonus token predictions each, weighted by geometric allocation).</p>
<p><strong>Step 4</strong>: The verifier finishes and sends back the actual outcome $(k, t^*)$.</p>
<p><strong>Step 5</strong>: Cache lookup. If the outcome matches, the cached speculation is returned instantly. If not, the fallback mechanism kicks in.</p>
<p>The critical property is that on a cache hit, the round latency equals <em>only</em> the verification time (because the speculation was pre-computed in parallel). In standard SD, the round latency equals verification time <em>plus</em> drafting time. This is where the speedup comes from.</p>
<h3 id="correctness-ssd-is-lossless">Correctness: SSD Is Lossless</h3>
<p>An important guarantee: SSD produces the same output distribution as standard speculative decoding (which itself matches autoregressive decoding exactly).</p>
<p>On a cache hit, the cached speculation is verified using the same rejection sampling mechanism as standard SD. The fact that the speculation was pre-computed rather than computed just-in-time changes nothing about verification correctness.</p>
<p>On a cache miss, the system falls back to standard synchronous SD, which is known to be lossless.</p>
<p>The speculation cache is a performance optimization that never affects what tokens the target model accepts or rejects. Pre-speculation only changes <em>when</em> the draft tokens are computed, not <em>what</em> they are or <em>how</em> they are verified.</p>
<h2 id="hardware-setup">Hardware Setup</h2>
<p>SSD requires the draft and target models to run on separate GPUs so they can operate concurrently. The typical configuration:</p>
<ul>
<li><strong>Target model (verifier)</strong>: 4x H100 80GB GPUs with tensor parallelism (for Llama-3.1-70B)</li>
<li><strong>Draft model (speculator)</strong>: 1x H100 80GB GPU on a separate device</li>
<li><strong>Total</strong>: 5 GPUs for SSD vs. 4 GPUs for standard SD/AR</li>
</ul>
<p>This is a 25% increase in hardware. The question is whether the speedup justifies the extra cost. At batch size 1, SSD achieves roughly 2x higher throughput than SD with the same target model, meaning the throughput per GPU still improves substantially.</p>
<h2 id="results">Results</h2>
<p>The authors benchmark SSD (with Saguaro optimizations) against autoregressive decoding and standard speculative decoding across four datasets: HumanEval (code), Alpaca (chat), GSM8K (math), and UltraFeedback (general).</p>
<p><strong>Setup</strong>: Llama-3.1-70B-Instruct as the target model on 4x H100 GPUs, Llama-3.2-1B-Instruct as the draft model on 1x H100 GPU. K=6 for SD, K=7 for SSD, fan-out F=3 for SSD.</p>
<div class="ssd-perf-viz" id="ssd-perf-0bb873ec347541318a79e0f80e8ddeb3">
  <style>
    .ssd-perf-viz {
      --sp-bg: #0d1117;
      --sp-surface: #161b22;
      --sp-border: #30363d;
      --sp-text: #e6edf3;
      --sp-text-muted: #8b949e;
      --sp-ar-gray: #6e7681;
      --sp-sd-blue: #58a6ff;
      --sp-ssd-green: #39d353;
      --sp-orange: #d29922;

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
      background: var(--sp-bg);
      color: var(--sp-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

    [data-theme="light"] .ssd-perf-viz,
    :root:not([data-theme="dark"]) .ssd-perf-viz {
      --sp-bg: #f8fafc;
      --sp-surface: #ffffff;
      --sp-border: #e2e8f0;
      --sp-text: #1e293b;
      --sp-text-muted: #64748b;
      --sp-ar-gray: #94a3b8;
      --sp-sd-blue: #3b82f6;
      --sp-ssd-green: #10b981;
      --sp-orange: #f59e0b;
    }

    .ssd-perf-viz * { box-sizing: border-box; }

    .sp-header {
      text-align: center;
      margin-bottom: 1.5rem;
    }

    .sp-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem;
      font-weight: 600;
      color: var(--sp-ssd-green);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .sp-header p {
      color: var(--sp-text-muted);
      font-size: 0.9rem;
      margin: 0;
    }

     
    .sp-model-tabs {
      display: flex;
      gap: 0.5rem;
      margin-bottom: 1.25rem;
      justify-content: center;
    }

    .sp-model-tab {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 500;
      padding: 0.5rem 1rem;
      border: 1px solid var(--sp-border);
      border-radius: 6px;
      background: var(--sp-surface);
      color: var(--sp-text);
      cursor: pointer;
      transition: all 0.2s ease;
    }

    .sp-model-tab:hover {
      border-color: var(--sp-ssd-green);
    }

    .sp-model-tab.active {
      background: var(--sp-ssd-green);
      border-color: var(--sp-ssd-green);
      color: #0d1117;
    }

     
    .sp-chart {
      background: var(--sp-surface);
      border: 1px solid var(--sp-border);
      border-radius: 10px;
      padding: 1.5rem;
      margin-bottom: 1rem;
    }

    .sp-chart-grid {
      display: grid;
      grid-template-columns: repeat(4, 1fr);
      gap: 1.25rem;
    }

    .sp-dataset-group {
      text-align: center;
    }

    .sp-dataset-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--sp-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.05em;
      margin-bottom: 0.75rem;
    }

    .sp-bars {
      display: flex;
      justify-content: center;
      align-items: flex-end;
      gap: 6px;
      height: 160px;
    }

    .sp-bar-col {
      display: flex;
      flex-direction: column;
      align-items: center;
      width: 28px;
    }

    .sp-bar-val {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.55rem;
      font-weight: 600;
      margin-bottom: 3px;
      white-space: nowrap;
    }

    .sp-bar-val.ar { color: var(--sp-ar-gray); }
    .sp-bar-val.sd { color: var(--sp-sd-blue); }
    .sp-bar-val.ssd { color: var(--sp-ssd-green); }

    .sp-bar {
      width: 100%;
      border-radius: 4px 4px 0 0;
      transition: height 0.5s ease;
      min-height: 4px;
    }

    .sp-bar.ar {
      background: linear-gradient(180deg, var(--sp-ar-gray), rgba(110, 118, 129, 0.6));
    }

    .sp-bar.sd {
      background: linear-gradient(180deg, var(--sp-sd-blue), rgba(88, 166, 255, 0.6));
    }

    .sp-bar.ssd {
      background: linear-gradient(180deg, var(--sp-ssd-green), rgba(57, 211, 83, 0.6));
    }

    .sp-bar-method {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.5rem;
      color: var(--sp-text-muted);
      margin-top: 4px;
    }

     
    .sp-speedups {
      display: grid;
      grid-template-columns: repeat(3, 1fr);
      gap: 0.75rem;
      margin-bottom: 1rem;
    }

    .sp-speedup-card {
      background: var(--sp-surface);
      border: 1px solid var(--sp-border);
      border-radius: 10px;
      padding: 0.75rem;
      text-align: center;
    }

    .sp-speedup-comparison {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.6rem;
      color: var(--sp-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.05em;
      margin-bottom: 0.3rem;
    }

    .sp-speedup-value {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 1.3rem;
      font-weight: 700;
    }

    .sp-speedup-value.green { color: var(--sp-ssd-green); }
    .sp-speedup-value.blue { color: var(--sp-sd-blue); }

    .sp-speedup-detail {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.6rem;
      color: var(--sp-text-muted);
      margin-top: 0.2rem;
    }

     
    .sp-legend {
      background: var(--sp-surface);
      border: 1px solid var(--sp-border);
      border-radius: 10px;
      padding: 0.75rem 1.25rem;
      display: flex;
      gap: 1.5rem;
      justify-content: center;
      flex-wrap: wrap;
    }

    .sp-legend-item {
      display: flex;
      align-items: center;
      gap: 0.4rem;
      font-size: 0.7rem;
      color: var(--sp-text-muted);
    }

    .sp-legend-swatch {
      width: 14px;
      height: 14px;
      border-radius: 3px;
    }

    .sp-swatch-ar { background: var(--sp-ar-gray); }
    .sp-swatch-sd { background: var(--sp-sd-blue); }
    .sp-swatch-ssd { background: var(--sp-ssd-green); }

    .sp-setup-note {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.65rem;
      color: var(--sp-text-muted);
      text-align: center;
      margin-top: 0.75rem;
      line-height: 1.5;
    }

    @media (max-width: 600px) {
      .sp-chart-grid { grid-template-columns: repeat(2, 1fr); }
      .sp-speedups { grid-template-columns: 1fr; }
      .sp-bars { height: 120px; }
    }
  </style>

  <div class="sp-header">
    <h3>Performance: AR vs SD vs SSD</h3>
    <p>Throughput comparison at batch size 1, greedy decoding (temperature = 0)</p>
  </div>

  <div class="sp-model-tabs">
    <button type="button" class="sp-model-tab active" data-model="llama">Llama-3.1-70B / 1B</button>
    <button type="button" class="sp-model-tab" data-model="qwen">Qwen-3-32B / 0.6B</button>
  </div>

  <div class="sp-chart" id="perfChart-0bb873ec347541318a79e0f80e8ddeb3">
    
  </div>

  <div class="sp-speedups" id="speedups-0bb873ec347541318a79e0f80e8ddeb3">
    
  </div>

  <div class="sp-legend">
    <div class="sp-legend-item">
      <div class="sp-legend-swatch sp-swatch-ar"></div>
      <span>Autoregressive (4 GPUs)</span>
    </div>
    <div class="sp-legend-item">
      <div class="sp-legend-swatch sp-swatch-sd"></div>
      <span>Speculative Decoding (4 GPUs)</span>
    </div>
    <div class="sp-legend-item">
      <div class="sp-legend-swatch sp-swatch-ssd"></div>
      <span>SSD / Saguaro (5 GPUs)</span>
    </div>
  </div>

  <p class="sp-setup-note" id="setupNote-0bb873ec347541318a79e0f80e8ddeb3">
    Target: 4x H100 80GB (TP=4) | Draft: 1x H100 80GB | K=6 (SD), K=7 (SSD), F=3
  </p>

  <script>
  (function() {
    const uid = '0bb873ec347541318a79e0f80e8ddeb3';

    const data = {
      llama: {
        datasets: ['HumanEval', 'Alpaca', 'GSM8K', 'UltraFB'],
        ar:  [52.3, 54.7, 53.8, 55.1],
        sd:  [148.2, 161.8, 155.4, 142.7],
        ssd: [236.5, 255.8, 248.1, 231.4],
        setup: 'Target: Llama-3.1-70B-Instruct (4x H100, TP=4) | Draft: Llama-3.2-1B-Instruct (1x H100)'
      },
      qwen: {
        datasets: ['HumanEval', 'Alpaca', 'GSM8K', 'UltraFB'],
        ar:  [85.2, 88.8, 86.5, 87.9],
        sd:  [128.4, 136.8, 131.2, 129.7],
        ssd: [192.6, 203.8, 197.4, 190.5],
        setup: 'Target: Qwen-3-32B (4x H100, TP=4) | Draft: Qwen-3-0.6B (1x H100)'
      }
    };

    const container = document.getElementById(`ssd-perf-${uid}`);
    const tabs = container.querySelectorAll('.sp-model-tab');
    const chartDiv = document.getElementById(`perfChart-${uid}`);
    const speedupsDiv = document.getElementById(`speedups-${uid}`);
    const setupNote = document.getElementById(`setupNote-${uid}`);

    function render(model) {
      const d = data[model];
      const maxVal = Math.max(...d.ssd) * 1.1;

      
      let chartHtml = '<div class="sp-chart-grid">';
      d.datasets.forEach((ds, i) => {
        const arH = (d.ar[i] / maxVal) * 160;
        const sdH = (d.sd[i] / maxVal) * 160;
        const ssdH = (d.ssd[i] / maxVal) * 160;

        chartHtml += `
          <div class="sp-dataset-group">
            <div class="sp-dataset-label">${ds}</div>
            <div class="sp-bars">
              <div class="sp-bar-col">
                <span class="sp-bar-val ar">${d.ar[i]}</span>
                <div class="sp-bar ar" style="height: ${arH}px;"></div>
                <span class="sp-bar-method">AR</span>
              </div>
              <div class="sp-bar-col">
                <span class="sp-bar-val sd">${d.sd[i]}</span>
                <div class="sp-bar sd" style="height: ${sdH}px;"></div>
                <span class="sp-bar-method">SD</span>
              </div>
              <div class="sp-bar-col">
                <span class="sp-bar-val ssd">${d.ssd[i]}</span>
                <div class="sp-bar ssd" style="height: ${ssdH}px;"></div>
                <span class="sp-bar-method">SSD</span>
              </div>
            </div>
          </div>
        `;
      });
      chartHtml += '</div>';
      chartDiv.innerHTML = chartHtml;

      
      const avgAr = d.ar.reduce((a,b) => a+b, 0) / d.ar.length;
      const avgSd = d.sd.reduce((a,b) => a+b, 0) / d.sd.length;
      const avgSsd = d.ssd.reduce((a,b) => a+b, 0) / d.ssd.length;

      speedupsDiv.innerHTML = `
        <div class="sp-speedup-card">
          <div class="sp-speedup-comparison">SSD vs Autoregressive</div>
          <div class="sp-speedup-value green">${(avgSsd / avgAr).toFixed(1)}x</div>
          <div class="sp-speedup-detail">${avgSsd.toFixed(0)} vs ${avgAr.toFixed(0)} tok/s</div>
        </div>
        <div class="sp-speedup-card">
          <div class="sp-speedup-comparison">SSD vs Standard SD</div>
          <div class="sp-speedup-value green">${(avgSsd / avgSd).toFixed(1)}x</div>
          <div class="sp-speedup-detail">${avgSsd.toFixed(0)} vs ${avgSd.toFixed(0)} tok/s</div>
        </div>
        <div class="sp-speedup-card">
          <div class="sp-speedup-comparison">SD vs Autoregressive</div>
          <div class="sp-speedup-value blue">${(avgSd / avgAr).toFixed(1)}x</div>
          <div class="sp-speedup-detail">${avgSd.toFixed(0)} vs ${avgAr.toFixed(0)} tok/s</div>
        </div>
      `;

      setupNote.textContent = d.setup;
    }

    tabs.forEach(tab => {
      tab.addEventListener('click', function() {
        tabs.forEach(t => t.classList.remove('active'));
        this.classList.add('active');
        render(this.dataset.model);
      });
    });

    render('llama');
  })();
  </script>
</div>

<p><strong>Key findings:</strong></p>
<ul>
<li><strong>SSD vs. autoregressive</strong>: Up to ~5x faster (e.g., 255.8 tok/s vs. 54.7 tok/s on some benchmarks)</li>
<li><strong>SSD vs. standard SD</strong>: Up to ~2x faster (255.8 vs. 161.8 tok/s)</li>
<li><strong>At larger batch sizes</strong>: SSD still provides ~20% improvement over SD, even as cache hit rates drop. The Saguaro optimizations push the throughput-latency Pareto frontier across all batch sizes.</li>
<li><strong>Temperature sensitivity</strong>: Cache hit rates decrease with sampling temperature, but Saguaro sampling (low $C$) compensates effectively, maintaining gains at high temperatures.</li>
</ul>
<p>The results also confirm on Qwen-3-32B (target) with Qwen-3-0.6B (draft): 203.8 tok/s for SSD vs. 136.8 for SD vs. 88.8 for AR.</p>
<h2 id="where-ssd-fits-in-the-landscape">Where SSD Fits in the Landscape</h2>
<p>SSD is not the first attempt to overlap drafting and verification. Several concurrent methods target the same bottleneck:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Approach</th>
          <th>Limitation</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>AMUSD</strong></td>
          <td>Pre-speculates for the &ldquo;all accepted&rdquo; outcome only</td>
          <td>Misses all partial-acceptance cases</td>
      </tr>
      <tr>
          <td><strong>PEARL</strong></td>
          <td>Single outcome prediction</td>
          <td>Same limitation as AMUSD</td>
      </tr>
      <tr>
          <td><strong>SwiftSpec</strong></td>
          <td>Token tree branching off current speculation</td>
          <td>Greedy only; fallback struggles at high temp/batch</td>
      </tr>
      <tr>
          <td><strong>SpecBranch</strong></td>
          <td>Single branching point with regular fallback</td>
          <td>Approximately a special case of SSD</td>
      </tr>
      <tr>
          <td><strong>SSD (Saguaro)</strong></td>
          <td>Multi-outcome caching with geometric fan-out, cache-aware sampling, dual-tier fallback</td>
          <td>Requires extra GPU; latency-focused</td>
      </tr>
  </tbody>
</table>
<p>SSD is also <strong>orthogonal</strong> to several other inference optimizations, meaning they can be combined:</p>
<ul>
<li><strong>EAGLE/EAGLE-2</strong>: Feature-level draft prediction. SSD could use an EAGLE-style drafter as its speculator.</li>
<li><strong>Tree-based verification</strong> (Sequoia, SpecInfer): Verify multiple candidates in one pass. SSD parallelizes the draft-verify <em>loop</em> itself, a different axis.</li>
<li><strong>Non-neural speculators</strong> (SuffixDecoding, Infini-gram): Could serve as SSD&rsquo;s fast backup speculator for cache misses.</li>
</ul>
<h2 id="when-to-use-ssd-and-when-not-to">When to Use SSD (and When Not To)</h2>
<p><strong>SSD is a strong fit when:</strong></p>
<ul>
<li>Latency matters more than throughput (real-time chat, interactive coding)</li>
<li>Batch sizes are small to moderate ($b \leq b^*$)</li>
<li>You have spare GPU capacity for a separate draft model</li>
<li>You are already using speculative decoding and want to push further</li>
</ul>
<p><strong>SSD is not ideal when:</strong></p>
<ul>
<li>You are throughput-bound (large-scale RL, offline batch generation). SSD optimizes per-request latency, not aggregate throughput at high concurrency.</li>
<li>Hardware is constrained. The extra GPU for the draft model is not available.</li>
<li>Batch sizes are consistently very large. Cache hit rates decay exponentially with batch size, and the gains narrow.</li>
</ul>
<h2 id="looking-forward">Looking Forward</h2>
<p>SSD makes a compelling case that the sequential dependencies in LLM inference are not fixed constraints but engineering surfaces that can be optimized. The draft-verify loop in speculative decoding seemed inherently sequential. It turned out that the sequential part (waiting for verification before starting the next draft) could be hidden by speculating about the verification outcome.</p>
<p>This pattern of applying a technique recursively to itself is worth paying attention to. The authors frame SSD as &ldquo;nested speculation,&rdquo; and the natural question is whether another level of nesting could help. The answer is likely no for now (the overhead of a third speculation level would exceed the marginal benefit), but the thinking is instructive: whenever two stages of a pipeline are sequential, ask whether one stage can predict the other&rsquo;s output and pre-compute accordingly.</p>
<p>The practical significance is clear. For latency-sensitive applications running large models, SSD with Saguaro optimizations roughly doubles the speedup of speculative decoding at modest hardware cost. As inference frameworks like NVIDIA Dynamo adopt disaggregated architectures (separate prefill and decode stages on different hardware), SSD&rsquo;s separate-GPU design fits naturally into that direction.</p>
<hr>
<h2 id="references">References</h2>
<ol>
<li>
<p><strong>Kumar, T., Dao, T., &amp; May, A. (2026).</strong> <a href="https://arxiv.org/abs/2603.03251">Speculative Speculative Decoding</a>. <em>ICLR 2026</em>.</p>
<ul>
<li>The SSD paper introducing Saguaro and the three core optimizations.</li>
</ul>
</li>
<li>
<p><strong>Leviathan, Y., Kalman, M., &amp; Matias, Y. (2023).</strong> <a href="https://arxiv.org/abs/2211.17192">Fast Inference from Transformers via Speculative Decoding</a>. <em>ICML 2023</em>.</p>
<ul>
<li>The original speculative decoding paper with distribution preservation proofs.</li>
</ul>
</li>
<li>
<p><strong>Chen, C., Borgeaud, S., Irving, G., Lespiau, J. B., Sifre, L., &amp; Jumper, J. (2023).</strong> <a href="https://arxiv.org/abs/2302.01318">Accelerating Large Language Model Decoding with Speculative Sampling</a>. <em>arXiv preprint</em>.</p>
<ul>
<li>Independent discovery of speculative decoding at DeepMind.</li>
</ul>
</li>
<li>
<p><strong>Li, Y., Cai, T., Zhang, Y., Chen, D., &amp; Dai, D. (2024).</strong> <a href="https://arxiv.org/abs/2401.15077">EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty</a>. <em>ICML 2024</em>.</p>
<ul>
<li>Feature-level speculation achieving superior speedups through hidden state prediction.</li>
</ul>
</li>
<li>
<p><strong>Chen, Z., Yang, X., Lin, J., Sun, C., Huang, J., &amp; Chang, K. W. (2024).</strong> <a href="https://arxiv.org/abs/2408.11049">MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding</a>. <em>arXiv preprint</em>.</p>
<ul>
<li>Analysis of speculative decoding performance at high batch sizes.</li>
</ul>
</li>
<li>
<p><strong>Spector, B. &amp; Ré, C. (2023).</strong> <a href="https://arxiv.org/abs/2308.04623">Accelerating LLM Inference with Staged Speculative Decoding</a>. <em>arXiv preprint</em>.</p>
<ul>
<li>Multi-stage speculation with cascaded draft models.</li>
</ul>
</li>
<li>
<p><strong>Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Wang, Z., &amp; Jia, Z. (2024).</strong> <a href="https://arxiv.org/abs/2305.09781">SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification</a>. <em>ASPLOS 2024</em>.</p>
<ul>
<li>Tree-based speculative inference with parallel verification.</li>
</ul>
</li>
<li>
<p><strong>GitHub: tanishqkumar/ssd.</strong> <a href="https://github.com/tanishqkumar/ssd">Saguaro Implementation</a>.</p>
<ul>
<li>Open-source SSD implementation with custom inference engine, supporting Llama-3 and Qwen-3 families.</li>
</ul>
</li>
</ol>
]]></content:encoded></item><item><title>Durable Execution for AI Agents: Temporal's Architecture for Production Reliability</title><link>https://www.mdjawad.com/posts/temporal-durable-agents/</link><pubDate>Fri, 27 Feb 2026 10:00:00 +0800</pubDate><guid>https://www.mdjawad.com/posts/temporal-durable-agents/</guid><description>Production AI agents face infrastructure problems that framework-level code cannot solve: state loss on crashes, LLM API flakiness, debugging non-deterministic behavior, and coordinating human approvals across hours-long runs. This post walks through Temporal&amp;rsquo;s durable execution model and why companies like OpenAI chose it for their agent infrastructure.</description><content:encoded><![CDATA[<h2 id="what-this-post-covers">What This Post Covers</h2>
<p>In <a href="https://www.mdjawad.com/posts/anatomy-of-agentic-code-assist/">The Anatomy of Agentic Code Assist</a>, we looked at how agents like OpenHands work: event streams, sandboxed execution, tool use, the CodeAct framework. That post covered the agent itself, what it does and how it&rsquo;s built. This post covers a different layer: the infrastructure that keeps agents running reliably in production.</p>
<p>When an agent runs for hours, makes hundreds of tool calls, and interacts with flaky LLM APIs, a whole class of infrastructure problems emerge that application-level code cannot solve:</p>
<ol>
<li><strong>State loss on process crashes</strong>: a worker dies mid-workflow and hours of accumulated context disappear. The agent restarts from scratch, re-executing every LLM call and tool invocation.</li>
<li><strong>LLM API rate limits and timeouts</strong>: 429s, 500s, socket timeouts, multi-minute latencies. A reflexion loop running 10 cycles can consume 50x the tokens of a linear pass if any step fails and forces a restart.</li>
<li><strong>Debugging non-deterministic behavior</strong>: the same prompt produces different outputs, different tool call sequences, different results. Without a complete execution trace, reproducing production bugs is close to impossible.</li>
<li><strong>Tasks exceeding server timeouts</strong>: agent sessions lasting minutes to hours die on deployments, fail during scaling events, and exceed web server timeout limits.</li>
<li><strong>Ambiguous recovery after parallel fan-out crashes</strong>: the agent launches ten parallel tool calls. The process crashes after seven complete. Which results were already obtained? Which need re-execution?</li>
<li><strong>Losing context during human-in-the-loop waits</strong>: the agent pauses for human approval, potentially for hours or days. The server holding that state needs to remain available, or all accumulated context is lost.</li>
<li><strong>Error cascades across multi-agent systems</strong>: a single failure in one agent propagates downstream without corrective mechanisms. Simple retry logic at the tail end is inadequate because the agent may have already deviated significantly from the intended path.</li>
</ol>
<p><strong>Temporal</strong> is an orchestration platform built around durable execution. We&rsquo;ll walk through its architecture, understand <em>why</em> each design decision exists, and look at how OpenAI&rsquo;s Codex team uses it in production.</p>
<p>The core idea can be expressed as a state transition: $S_{t+1} = f(S_t, M(S_t, T_t))$. Agent state evolves through deterministic orchestration ($f$) of non-deterministic operations ($M$ = LLM response, $T$ = tool results). Temporal separates these two concerns at the infrastructure level. The deterministic part goes in workflows. The non-deterministic part goes in activities.</p>
<h2 id="workflows-and-activities">Workflows and Activities</h2>
<p>The fundamental design decision in Temporal: split all code into two categories based on determinism.</p>
<h3 id="workflows">Workflows</h3>
<p>A <strong>Workflow</strong> is the agent&rsquo;s control flow, the logic that decides which tools to call, in what order, what to do with results, and when to wait for human input. Workflows run as ordinary code in Python, TypeScript, Go, or Java, with one hard constraint: they must be deterministic. Given the same inputs and the same activity results, a workflow must produce the same sequence of commands every time.</p>
<p>A <strong>Workflow Execution</strong> can run for seconds, hours, or years. It persists through infrastructure failures. The workflow doesn&rsquo;t know or care about crashes; from its perspective, execution is continuous.</p>
<h3 id="activities">Activities</h3>
<p><strong>Activities</strong> are where all side effects live: LLM API calls, tool executions, database writes, HTTP requests. Anything that can fail, timeout, or produce different results on re-execution. Temporal records every activity result in a persistent <strong>Event History</strong>, an append-only log that serves as the authoritative record for the entire workflow&rsquo;s state.</p>
<h3 id="why-this-split-matters">Why This Split Matters</h3>
<p>The determinism requirement is what enables replay-based recovery (which we&rsquo;ll cover in the next section). Here&rsquo;s the reasoning: if we know the workflow logic is deterministic, and we have a recorded log of all activity results, we can reconstruct the exact workflow state after a crash. We don&rsquo;t need developer-written checkpoint code. We don&rsquo;t need serialization logic. We just replay the deterministic code with the previously recorded results, and we arrive at the same state.</p>
<p>This raises an obvious question: LLMs are non-deterministic, so how does this work? The answer maps directly to how agents already operate. The LLM <em>call</em> goes in an activity &ndash; it&rsquo;s non-deterministic, its result gets recorded. The <em>logic deciding what to call and when</em> goes in the workflow &ndash; it&rsquo;s deterministic. The agent loop says &ldquo;if the LLM returned a tool call, execute that tool; if it returned a final answer, return it.&rdquo; That orchestration logic doesn&rsquo;t change between runs.</p>
<h3 id="a-complete-agent-loop">A Complete Agent Loop</h3>
<p>Here&rsquo;s what a complete agent workflow looks like in Python:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> temporalio <span style="color:#f92672">import</span> workflow, activity
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> temporalio.common <span style="color:#f92672">import</span> RetryPolicy
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> datetime <span style="color:#f92672">import</span> timedelta
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> dataclasses <span style="color:#f92672">import</span> dataclass
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@dataclass</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">LLMRequest</span>:
</span></span><span style="display:flex;"><span>    goal: str
</span></span><span style="display:flex;"><span>    history: list
</span></span><span style="display:flex;"><span>    available_tools: list
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@activity.defn</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">async</span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">call_llm</span>(request: LLMRequest) <span style="color:#f92672">-&gt;</span> dict:
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Non-deterministic: LLM API call lives here</span>
</span></span><span style="display:flex;"><span>    response <span style="color:#f92672">=</span> <span style="color:#66d9ef">await</span> llm_client<span style="color:#f92672">.</span>chat(
</span></span><span style="display:flex;"><span>        messages<span style="color:#f92672">=</span>request<span style="color:#f92672">.</span>history,
</span></span><span style="display:flex;"><span>        tools<span style="color:#f92672">=</span>request<span style="color:#f92672">.</span>available_tools,
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> {<span style="color:#e6db74">&#34;action&#34;</span>: response<span style="color:#f92672">.</span>action, <span style="color:#e6db74">&#34;params&#34;</span>: response<span style="color:#f92672">.</span>params}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@activity.defn</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">async</span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">execute_tool</span>(tool_name: str, params: dict) <span style="color:#f92672">-&gt;</span> str:
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Non-deterministic: tool execution lives here</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> <span style="color:#66d9ef">await</span> tool_registry<span style="color:#f92672">.</span>execute(tool_name, params)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@workflow.defn</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">AIAgentWorkflow</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">@workflow.run</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">async</span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">run</span>(self, user_goal: str) <span style="color:#f92672">-&gt;</span> str:
</span></span><span style="display:flex;"><span>        conversation_history <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span>        llm_retry <span style="color:#f92672">=</span> RetryPolicy(
</span></span><span style="display:flex;"><span>            initial_interval<span style="color:#f92672">=</span>timedelta(seconds<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>),
</span></span><span style="display:flex;"><span>            backoff_coefficient<span style="color:#f92672">=</span><span style="color:#ae81ff">2.0</span>,
</span></span><span style="display:flex;"><span>            maximum_interval<span style="color:#f92672">=</span>timedelta(seconds<span style="color:#f92672">=</span><span style="color:#ae81ff">60</span>),
</span></span><span style="display:flex;"><span>            maximum_attempts<span style="color:#f92672">=</span><span style="color:#ae81ff">10</span>,
</span></span><span style="display:flex;"><span>        )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">while</span> <span style="color:#f92672">not</span> self<span style="color:#f92672">.</span>is_goal_achieved(conversation_history):
</span></span><span style="display:flex;"><span>            <span style="color:#75715e"># Deterministic: this decision logic is the workflow</span>
</span></span><span style="display:flex;"><span>            next_action <span style="color:#f92672">=</span> <span style="color:#66d9ef">await</span> workflow<span style="color:#f92672">.</span>execute_activity(
</span></span><span style="display:flex;"><span>                call_llm,
</span></span><span style="display:flex;"><span>                LLMRequest(
</span></span><span style="display:flex;"><span>                    goal<span style="color:#f92672">=</span>user_goal,
</span></span><span style="display:flex;"><span>                    history<span style="color:#f92672">=</span>conversation_history,
</span></span><span style="display:flex;"><span>                    available_tools<span style="color:#f92672">=</span>self<span style="color:#f92672">.</span>get_available_tools(),
</span></span><span style="display:flex;"><span>                ),
</span></span><span style="display:flex;"><span>                start_to_close_timeout<span style="color:#f92672">=</span>timedelta(seconds<span style="color:#f92672">=</span><span style="color:#ae81ff">120</span>),
</span></span><span style="display:flex;"><span>                retry_policy<span style="color:#f92672">=</span>llm_retry,
</span></span><span style="display:flex;"><span>            )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">if</span> next_action[<span style="color:#e6db74">&#34;action&#34;</span>] <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;tool_call&#34;</span>:
</span></span><span style="display:flex;"><span>                <span style="color:#75715e"># Parallel tool execution when multiple tools requested</span>
</span></span><span style="display:flex;"><span>                results <span style="color:#f92672">=</span> <span style="color:#66d9ef">await</span> asyncio<span style="color:#f92672">.</span>gather(<span style="color:#f92672">*</span>[
</span></span><span style="display:flex;"><span>                    workflow<span style="color:#f92672">.</span>execute_activity(
</span></span><span style="display:flex;"><span>                        execute_tool,
</span></span><span style="display:flex;"><span>                        tool[<span style="color:#e6db74">&#34;name&#34;</span>], tool[<span style="color:#e6db74">&#34;params&#34;</span>],
</span></span><span style="display:flex;"><span>                        start_to_close_timeout<span style="color:#f92672">=</span>timedelta(seconds<span style="color:#f92672">=</span><span style="color:#ae81ff">30</span>),
</span></span><span style="display:flex;"><span>                    )
</span></span><span style="display:flex;"><span>                    <span style="color:#66d9ef">for</span> tool <span style="color:#f92672">in</span> next_action<span style="color:#f92672">.</span>get(<span style="color:#e6db74">&#34;tool_calls&#34;</span>, [])
</span></span><span style="display:flex;"><span>                ])
</span></span><span style="display:flex;"><span>                conversation_history<span style="color:#f92672">.</span>extend(results)
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span>                conversation_history<span style="color:#f92672">.</span>append(next_action)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> self<span style="color:#f92672">.</span>format_final_result(conversation_history)
</span></span></code></pre></div><div class="temporal-wa-split">
    <style>
        .temporal-wa-split {
            background: white;
            border-radius: 16px;
            padding: 24px;
            box-shadow: 0 12px 30px rgba(0,0,0,0.06);
            margin: 32px auto;
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
            max-width: 1100px;
        }
        .temporal-wa-split .wa-title {
            text-align: center;
            font-size: 24px;
            font-weight: 700;
            color: #1a202c;
            margin-bottom: 6px;
        }
        .temporal-wa-split .wa-subtitle {
            text-align: center;
            font-size: 14px;
            color: #718096;
            margin-bottom: 24px;
        }
        .temporal-wa-split .wa-layout {
            display: grid;
            grid-template-columns: 1fr 220px 1fr;
            gap: 0;
            align-items: start;
        }
        .temporal-wa-split .wa-column {
            padding: 8px 12px;
        }
        .temporal-wa-split .wa-col-header {
            text-align: center;
            font-size: 12px;
            font-weight: 700;
            letter-spacing: 1.5px;
            text-transform: uppercase;
            padding: 8px 16px;
            border-radius: 8px;
            margin-bottom: 14px;
        }
        .temporal-wa-split .wa-col-header.workflow {
            background: #eff6ff;
            color: #3b82f6;
            border: 1px solid #bfdbfe;
        }
        .temporal-wa-split .wa-col-header.activities {
            background: #ecfdf5;
            color: #10b981;
            border: 1px solid #a7f3d0;
        }
        .temporal-wa-split .wa-badge {
            display: inline-block;
            font-size: 9px;
            font-weight: 600;
            letter-spacing: 0.8px;
            padding: 2px 6px;
            border-radius: 4px;
            margin-left: 6px;
            vertical-align: middle;
        }
        .temporal-wa-split .wa-badge.deterministic {
            background: #dbeafe;
            color: #2563eb;
        }
        .temporal-wa-split .wa-badge.nondeterministic {
            background: #d1fae5;
            color: #059669;
        }
        .temporal-wa-split .wa-step {
            background: #f8fafc;
            border: 2px solid #e2e8f0;
            border-radius: 10px;
            padding: 10px 12px;
            margin-bottom: 6px;
            cursor: pointer;
            transition: all 0.3s cubic-bezier(0.4, 0, 0.2, 1);
        }
        .temporal-wa-split .wa-step:hover {
            border-color: #cbd5e0;
        }
        .temporal-wa-split .wa-step.active-wf {
            border-color: #3b82f6;
            background: #eff6ff;
            box-shadow: 0 4px 12px rgba(59, 130, 246, 0.2);
        }
        .temporal-wa-split .wa-step.active-act {
            border-color: #10b981;
            background: #ecfdf5;
            box-shadow: 0 4px 12px rgba(16, 185, 129, 0.2);
        }
        .temporal-wa-split .wa-step-label {
            font-size: 13px;
            font-weight: 600;
            color: #1e293b;
            margin-bottom: 3px;
        }
        .temporal-wa-split .wa-step-code {
            font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
            font-size: 11px;
            color: #64748b;
            background: #f1f5f9;
            padding: 3px 7px;
            border-radius: 4px;
            display: inline-block;
        }
        .temporal-wa-split .wa-step.active-wf .wa-step-code,
        .temporal-wa-split .wa-step.active-act .wa-step-code {
            background: rgba(255,255,255,0.7);
        }
        .temporal-wa-split .wa-arrow-down {
            text-align: center;
            color: #cbd5e0;
            font-size: 16px;
            margin: 1px 0;
            line-height: 1;
            transition: color 0.3s ease;
        }
        .temporal-wa-split .wa-arrow-down.active-wf { color: #3b82f6; }
        .temporal-wa-split .wa-arrow-down.active-act { color: #10b981; }
        .temporal-wa-split .wa-retry-icon {
            font-size: 11px;
            color: #10b981;
            margin-left: 4px;
            font-weight: 400;
        }
        .temporal-wa-split .wa-spacer {
            height: 44px;
            margin-bottom: 6px;
        }
        .temporal-wa-split .wa-spacer-arrow {
            height: 18px;
            margin: 1px 0;
        }
         
        .temporal-wa-split .wa-center {
            padding: 8px 6px;
        }
        .temporal-wa-split .wa-eh-header {
            text-align: center;
            font-size: 12px;
            font-weight: 700;
            letter-spacing: 1.5px;
            text-transform: uppercase;
            padding: 8px 12px;
            border-radius: 8px;
            margin-bottom: 14px;
            background: #f5f3ff;
            color: #8b5cf6;
            border: 1px solid #ddd6fe;
        }
        .temporal-wa-split .wa-eh-entry {
            background: #faf5ff;
            border: 1.5px solid #e9d5ff;
            border-radius: 8px;
            padding: 7px 9px;
            margin-bottom: 6px;
            font-size: 10px;
            color: #6b21a8;
            font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
            opacity: 0.45;
            transition: all 0.3s ease;
        }
        .temporal-wa-split .wa-eh-entry.active-eh {
            opacity: 1;
            border-color: #8b5cf6;
            background: #ede9fe;
            box-shadow: 0 2px 8px rgba(139, 92, 246, 0.2);
        }
        .temporal-wa-split .wa-eh-type {
            font-weight: 600;
            font-size: 9px;
            text-transform: uppercase;
            letter-spacing: 0.5px;
            color: #7c3aed;
        }
        .temporal-wa-split .wa-hint {
            text-align: center;
            font-size: 12px;
            color: #a0aec0;
            margin-top: 18px;
            font-style: italic;
        }
        .temporal-wa-split .wa-loop-label {
            text-align: center;
            margin-top: 6px;
        }
        .temporal-wa-split .wa-loop-label span {
            font-size: 11px;
            color: #3b82f6;
            font-weight: 600;
            background: #eff6ff;
            padding: 4px 10px;
            border-radius: 12px;
            border: 1px dashed #93c5fd;
            display: inline-block;
        }
        @media (max-width: 768px) {
            .temporal-wa-split .wa-layout {
                grid-template-columns: 1fr;
                gap: 16px;
            }
            .temporal-wa-split .wa-center {
                order: 3;
            }
            .temporal-wa-split .wa-spacer,
            .temporal-wa-split .wa-spacer-arrow {
                display: none;
            }
        }
    </style>

    <h3 class="wa-title">Workflow / Activity Split</h3>
    <p class="wa-subtitle">Deterministic orchestration on the left, non-deterministic side effects on the right, Event History in the center</p>

    <div class="wa-layout">
        
        <div class="wa-column">
            <div class="wa-col-header workflow">
                Workflow <span class="wa-badge deterministic">DETERMINISTIC</span>
            </div>
            <div class="wa-step" data-pair="0" id="wf0-8c8861b65b15d1263ff0318c3be20c61">
                <div class="wa-step-label">Start Agent Loop</div>
                <div class="wa-step-code">while not is_goal_achieved():</div>
            </div>
            <div class="wa-arrow-down" data-pair="0">&#8595;</div>
            <div class="wa-step" data-pair="1" id="wf1-8c8861b65b15d1263ff0318c3be20c61">
                <div class="wa-step-label">Call LLM</div>
                <div class="wa-step-code">execute_activity(call_llm, ...)</div>
            </div>
            <div class="wa-arrow-down" data-pair="1">&#8595;</div>
            <div class="wa-step" data-pair="2" id="wf2-8c8861b65b15d1263ff0318c3be20c61">
                <div class="wa-step-label">Check Result</div>
                <div class="wa-step-code">if action == "tool_call":</div>
            </div>
            <div class="wa-arrow-down" data-pair="2">&#8595;</div>
            <div class="wa-step" data-pair="3" id="wf3-8c8861b65b15d1263ff0318c3be20c61">
                <div class="wa-step-label">Execute Tool</div>
                <div class="wa-step-code">execute_activity(execute_tool, ...)</div>
            </div>
            <div class="wa-arrow-down" data-pair="3">&#8595;</div>
            <div class="wa-step" data-pair="4" id="wf4-8c8861b65b15d1263ff0318c3be20c61">
                <div class="wa-step-label">Append to History</div>
                <div class="wa-step-code">conversation_history.extend(results)</div>
            </div>
            <div class="wa-loop-label"><span>&#8634; Loop back to LLM call</span></div>
        </div>

        
        <div class="wa-center">
            <div class="wa-eh-header">Event History</div>
            <div class="wa-eh-entry" data-pair="0" id="eh0-8c8861b65b15d1263ff0318c3be20c61">
                <div class="wa-eh-type">WorkflowStarted</div>
                <div>workflow_id: "agent-42"</div>
            </div>
            <div class="wa-eh-entry" data-pair="1" id="eh1a-8c8861b65b15d1263ff0318c3be20c61">
                <div class="wa-eh-type">ActivityScheduled</div>
                <div>call_llm &rarr; pending</div>
            </div>
            <div class="wa-eh-entry" data-pair="1" id="eh1b-8c8861b65b15d1263ff0318c3be20c61">
                <div class="wa-eh-type">ActivityCompleted</div>
                <div>result: {action: "tool_call"}</div>
            </div>
            <div class="wa-eh-entry" data-pair="2" id="eh2-8c8861b65b15d1263ff0318c3be20c61">
                <div class="wa-eh-type">WorkflowTaskCompleted</div>
                <div>decision: schedule tool</div>
            </div>
            <div class="wa-eh-entry" data-pair="3" id="eh3a-8c8861b65b15d1263ff0318c3be20c61">
                <div class="wa-eh-type">ActivityScheduled</div>
                <div>execute_tool &rarr; pending</div>
            </div>
            <div class="wa-eh-entry" data-pair="3" id="eh3b-8c8861b65b15d1263ff0318c3be20c61">
                <div class="wa-eh-type">ActivityCompleted</div>
                <div>result: "file edited OK"</div>
            </div>
            <div class="wa-eh-entry" data-pair="4" id="eh4-8c8861b65b15d1263ff0318c3be20c61">
                <div class="wa-eh-type">WorkflowTaskCompleted</div>
                <div>decision: continue loop</div>
            </div>
        </div>

        
        <div class="wa-column">
            <div class="wa-col-header activities">
                Activities <span class="wa-badge nondeterministic">NON-DETERMINISTIC</span>
            </div>
            <div class="wa-spacer"></div>
            <div class="wa-spacer-arrow"></div>
            <div class="wa-step" data-pair="1" id="act1-8c8861b65b15d1263ff0318c3be20c61">
                <div class="wa-step-label">LLM API Call <span class="wa-retry-icon">&#8635; retry</span></div>
                <div class="wa-step-code">POST /v1/chat/completions</div>
            </div>
            <div class="wa-arrow-down" data-pair="1">&#8595;</div>
            <div class="wa-spacer"></div>
            <div class="wa-spacer-arrow"></div>
            <div class="wa-step" data-pair="3" id="act3-8c8861b65b15d1263ff0318c3be20c61">
                <div class="wa-step-label">Tool Execution <span class="wa-retry-icon">&#8635; retry</span></div>
                <div class="wa-step-code">tool_registry.execute(...)</div>
            </div>
            <div class="wa-arrow-down" data-pair="3">&#8595;</div>
            <div class="wa-step" data-pair="3" id="act3r-8c8861b65b15d1263ff0318c3be20c61">
                <div class="wa-step-label">Result Recorded</div>
                <div class="wa-step-code">event_history.append(result)</div>
            </div>
        </div>
    </div>

    <p class="wa-hint">Click any workflow step to highlight its corresponding activity and event history entries</p>

    <script>
    (function() {
        const uid = '8c8861b65b15d1263ff0318c3be20c61';
        const container = document.querySelector('.temporal-wa-split');
        if (!container) return;

        const allSteps = container.querySelectorAll('.wa-step');
        const allArrows = container.querySelectorAll('.wa-arrow-down');
        const allEntries = container.querySelectorAll('.wa-eh-entry');
        let activePair = null;

        function clearHighlights() {
            allSteps.forEach(function(s) { s.classList.remove('active-wf', 'active-act'); });
            allArrows.forEach(function(a) { a.classList.remove('active-wf', 'active-act'); });
            allEntries.forEach(function(e) { e.classList.remove('active-eh'); });
            activePair = null;
        }

        function highlightPair(pairId) {
            clearHighlights();
            activePair = pairId;
            container.querySelectorAll('.wa-step[data-pair="' + pairId + '"]').forEach(function(s) {
                if (s.id.startsWith('wf')) s.classList.add('active-wf');
                if (s.id.startsWith('act')) s.classList.add('active-act');
            });
            container.querySelectorAll('.wa-arrow-down[data-pair="' + pairId + '"]').forEach(function(a) {
                a.classList.add('active-wf');
            });
            container.querySelectorAll('.wa-eh-entry[data-pair="' + pairId + '"]').forEach(function(e) {
                e.classList.add('active-eh');
            });
        }

        allSteps.forEach(function(step) {
            step.addEventListener('click', function(e) {
                e.stopPropagation();
                var pair = this.getAttribute('data-pair');
                if (activePair === pair) {
                    clearHighlights();
                } else {
                    highlightPair(pair);
                }
            });
        });

        container.addEventListener('click', function(e) {
            if (!e.target.closest('.wa-step')) clearHighlights();
        });
    })();
    </script>
</div>

<h2 id="deterministic-replay">Deterministic Replay</h2>
<p>Replay is the mechanism that makes Temporal&rsquo;s fault tolerance work. Let&rsquo;s walk through it in detail, because understanding replay is the key to understanding why the rest of the architecture looks the way it does.</p>
<h3 id="the-event-history">The Event History</h3>
<p>Every workflow execution has an <strong>Event History</strong>: an append-only log stored in Temporal&rsquo;s persistence layer. When an activity completes, Temporal records both the request and the result.</p>
<h3 id="what-happens-on-a-crash">What Happens on a Crash</h3>
<p>Here&rsquo;s a concrete scenario. An agent workflow is at step 4 of 7. It has completed three LLM calls and tool executions, and is partway through the fourth:</p>
<ol>
<li>The worker process crashes (OOM, deployment, hardware failure)</li>
<li>The Temporal server detects the failure (heartbeat timeout or task timeout)</li>
<li>Another worker picks up the workflow from the task queue</li>
<li>Temporal re-executes the workflow code <strong>from the beginning</strong></li>
<li>When the code reaches activity calls that already completed (steps 1&ndash;3), Temporal returns the <strong>previously recorded results</strong> from the event history instead of re-executing them</li>
<li>The workflow code deterministically reaches the exact same state it was in before the crash: same local variables, same loop counter, same conversation history</li>
<li>Forward execution resumes from step 4. Only now does an actual activity get dispatched</li>
</ol>
<p>Because the workflow code is deterministic, replaying it with the same activity results <strong>always</strong> produces the same sequence of commands. The entire call stack and state are reconstructed with no developer-written checkpoint code. This is different from simple checkpointing because the developer never has to decide <em>what</em> to checkpoint or <em>when</em> &ndash; the replay mechanism reconstructs everything automatically from the event history.</p>
<h3 id="the-determinism-contract">The Determinism Contract</h3>
<p>The determinism requirement imposes hard constraints on workflow code. You cannot use:</p>
<ul>
<li><code>random()</code> &ndash; use <code>workflow.random()</code> instead</li>
<li><code>datetime.now()</code> &ndash; use <code>workflow.now()</code> instead</li>
<li><code>time.sleep()</code> &ndash; use <code>workflow.sleep()</code> or timers instead</li>
<li>Direct I/O (network calls, file reads) &ndash; these must go in activities</li>
<li>Threading or subprocess creation &ndash; use activities or child workflows</li>
</ul>
<p>For AI engineers, this constraint is less restrictive than it sounds. LLM calls and tool executions are inherently side effects, so they already belong in activities. The orchestration logic that decides <em>what to call and when</em> &ndash; &ldquo;call the LLM, check if it returned a tool call, execute the tool, loop&rdquo; &ndash; doesn&rsquo;t use random numbers or system clocks.</p>
<p>Here&rsquo;s what non-determinism violations look like in practice:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># WRONG: non-deterministic workflow code</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@workflow.defn</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">BadAgentWorkflow</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">@workflow.run</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">async</span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">run</span>(self, goal: str) <span style="color:#f92672">-&gt;</span> str:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> random<span style="color:#f92672">.</span>random() <span style="color:#f92672">&gt;</span> <span style="color:#ae81ff">0.5</span>:        <span style="color:#75715e"># different result on replay</span>
</span></span><span style="display:flex;"><span>            strategy <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;aggressive&#34;</span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span>            strategy <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;conservative&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        timestamp <span style="color:#f92672">=</span> datetime<span style="color:#f92672">.</span>now()         <span style="color:#75715e"># different on replay</span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">await</span> asyncio<span style="color:#f92672">.</span>sleep(<span style="color:#ae81ff">5</span>)             <span style="color:#75715e"># blocks the event loop</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># CORRECT: deterministic workflow code</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@workflow.defn</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">GoodAgentWorkflow</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">@workflow.run</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">async</span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">run</span>(self, goal: str) <span style="color:#f92672">-&gt;</span> str:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> workflow<span style="color:#f92672">.</span>random()<span style="color:#f92672">.</span>random() <span style="color:#f92672">&gt;</span> <span style="color:#ae81ff">0.5</span>:   <span style="color:#75715e"># deterministic across replays</span>
</span></span><span style="display:flex;"><span>            strategy <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;aggressive&#34;</span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span>            strategy <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;conservative&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        timestamp <span style="color:#f92672">=</span> workflow<span style="color:#f92672">.</span>now()              <span style="color:#75715e"># deterministic across replays</span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">await</span> workflow<span style="color:#f92672">.</span>sleep(<span style="color:#ae81ff">5</span>)                 <span style="color:#75715e"># durable timer, survives crashes</span>
</span></span></code></pre></div><h3 id="contrast-with-openhands">Contrast with OpenHands</h3>
<p>Both Temporal and OpenHands use event sourcing, but for different purposes. OpenHands records events (<code>CmdRunAction</code>, <code>FileWriteAction</code>, observations) for debuggability and observability. You can replay the event sequence to understand what the agent did. Temporal records events so the workflow can be <em>reconstructed after a crash as if nothing happened</em>. Same architectural pattern, different goals.</p>
<h3 id="formalization">Formalization</h3>
<p>If History = $[(a_1, r_1), (a_2, r_2), \ldots, (a_k, r_k)]$ records completed activities, then replay returns $r_1 \ldots r_k$ from history and only executes $a_{k+1}$ forward. The workflow&rsquo;s determinism guarantees that replaying with recorded results produces the same sequence of activity commands, so the state at step $k$ is identical to the state before the crash.</p>
<div class="temporal-replay-viz">
    <style>
        .temporal-replay-viz {
            background: white;
            border-radius: 16px;
            padding: 24px;
            box-shadow: 0 12px 30px rgba(0,0,0,0.06);
            margin: 32px auto;
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
            max-width: 1100px;
        }
        .temporal-replay-viz .rv-title {
            text-align: center;
            font-size: 24px;
            font-weight: 700;
            color: #1a202c;
            margin-bottom: 6px;
        }
        .temporal-replay-viz .rv-subtitle {
            text-align: center;
            font-size: 14px;
            color: #718096;
            margin-bottom: 24px;
        }
        .temporal-replay-viz .rv-phase-label {
            text-align: center;
            font-size: 15px;
            font-weight: 700;
            min-height: 28px;
            margin-bottom: 16px;
            transition: all 0.3s ease;
        }
        .temporal-replay-viz .rv-phase-label.crash { color: #ef4444; }
        .temporal-replay-viz .rv-phase-label.replay { color: #8b5cf6; }
        .temporal-replay-viz .rv-phase-label.forward { color: #3b82f6; }
        .temporal-replay-viz .rv-phase-label.complete { color: #10b981; }
         
        .temporal-replay-viz .rv-timeline {
            display: flex;
            align-items: center;
            gap: 6px;
            margin-bottom: 20px;
            overflow-x: auto;
            padding: 8px 0;
        }
        .temporal-replay-viz .rv-step {
            flex: 1;
            min-width: 100px;
            text-align: center;
            padding: 14px 8px;
            border-radius: 10px;
            border: 2px solid #e2e8f0;
            background: #f8fafc;
            transition: all 0.4s cubic-bezier(0.4, 0, 0.2, 1);
            position: relative;
        }
        .temporal-replay-viz .rv-step-num {
            font-size: 11px;
            font-weight: 700;
            color: #94a3b8;
            text-transform: uppercase;
            letter-spacing: 0.5px;
            margin-bottom: 4px;
        }
        .temporal-replay-viz .rv-step-name {
            font-size: 12px;
            font-weight: 600;
            color: #475569;
            font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
        }
        .temporal-replay-viz .rv-step-badge {
            position: absolute;
            top: -10px;
            left: 50%;
            transform: translateX(-50%);
            font-size: 9px;
            font-weight: 700;
            letter-spacing: 0.5px;
            padding: 2px 8px;
            border-radius: 10px;
            white-space: nowrap;
            opacity: 0;
            transition: opacity 0.3s ease;
        }
        .temporal-replay-viz .rv-step-badge.visible {
            opacity: 1;
        }
        .temporal-replay-viz .rv-connector {
            font-size: 16px;
            color: #cbd5e0;
            flex-shrink: 0;
        }
         
        .temporal-replay-viz .rv-step.completed {
            background: #ecfdf5;
            border-color: #10b981;
        }
        .temporal-replay-viz .rv-step.completed .rv-step-num { color: #059669; }
        .temporal-replay-viz .rv-step.completed .rv-step-name { color: #065f46; }
        .temporal-replay-viz .rv-step.in-progress {
            background: #eff6ff;
            border-color: #3b82f6;
            box-shadow: 0 4px 12px rgba(59, 130, 246, 0.25);
        }
        .temporal-replay-viz .rv-step.in-progress .rv-step-num { color: #2563eb; }
        .temporal-replay-viz .rv-step.in-progress .rv-step-name { color: #1e40af; }
        .temporal-replay-viz .rv-step.crashed {
            background: #fef2f2;
            border-color: #ef4444;
            box-shadow: 0 4px 12px rgba(239, 68, 68, 0.25);
        }
        .temporal-replay-viz .rv-step.crashed .rv-step-num { color: #dc2626; }
        .temporal-replay-viz .rv-step.crashed .rv-step-name { color: #991b1b; }
        .temporal-replay-viz .rv-step.replaying {
            background: #f5f3ff;
            border-color: #8b5cf6;
            box-shadow: 0 4px 12px rgba(139, 92, 246, 0.2);
        }
        .temporal-replay-viz .rv-step.replaying .rv-step-num { color: #7c3aed; }
        .temporal-replay-viz .rv-step.replaying .rv-step-name { color: #5b21b6; }
        .temporal-replay-viz .rv-step.gray {
            background: #f8fafc;
            border-color: #e2e8f0;
        }
         
        .temporal-replay-viz .rv-eh-section {
            margin-top: 20px;
        }
        .temporal-replay-viz .rv-eh-label {
            font-size: 12px;
            font-weight: 700;
            color: #8b5cf6;
            letter-spacing: 1px;
            text-transform: uppercase;
            margin-bottom: 8px;
        }
        .temporal-replay-viz .rv-eh-bar {
            display: flex;
            gap: 4px;
            background: #f5f3ff;
            border-radius: 8px;
            padding: 8px;
            border: 1px solid #e9d5ff;
            min-height: 36px;
            flex-wrap: wrap;
        }
        .temporal-replay-viz .rv-eh-item {
            background: #ede9fe;
            border: 1px solid #c4b5fd;
            border-radius: 6px;
            padding: 4px 10px;
            font-size: 10px;
            font-weight: 600;
            color: #6d28d9;
            font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
            transition: all 0.3s ease;
        }
        .temporal-replay-viz .rv-eh-item.new {
            animation: rvPulse 0.5s ease;
        }
        @keyframes rvPulse {
            0% { transform: scale(0.8); opacity: 0; }
            50% { transform: scale(1.1); }
            100% { transform: scale(1); opacity: 1; }
        }
         
        .temporal-replay-viz .rv-worker-label {
            font-size: 12px;
            color: #64748b;
            margin-top: 12px;
            text-align: center;
            min-height: 20px;
        }
        .temporal-replay-viz .rv-worker-label strong {
            color: #1e293b;
        }
         
        .temporal-replay-viz .rv-controls {
            display: flex;
            justify-content: center;
            gap: 10px;
            margin-top: 20px;
        }
        .temporal-replay-viz .rv-btn {
            padding: 8px 20px;
            border: none;
            border-radius: 8px;
            font-size: 13px;
            font-weight: 600;
            cursor: pointer;
            transition: all 0.3s ease;
            font-family: inherit;
        }
        .temporal-replay-viz .rv-btn:hover { transform: translateY(-1px); }
        .temporal-replay-viz .rv-btn:disabled {
            opacity: 0.4;
            cursor: not-allowed;
            transform: none;
        }
        .temporal-replay-viz .rv-btn.primary {
            background: linear-gradient(135deg, #3b82f6, #8b5cf6);
            color: white;
        }
        .temporal-replay-viz .rv-btn.primary:hover:not(:disabled) {
            box-shadow: 0 4px 14px rgba(59, 130, 246, 0.4);
        }
        .temporal-replay-viz .rv-btn.secondary {
            background: #f1f5f9;
            color: #475569;
            border: 1px solid #e2e8f0;
        }
        .temporal-replay-viz .rv-btn.secondary:hover:not(:disabled) {
            background: #e2e8f0;
        }
        @media (max-width: 768px) {
            .temporal-replay-viz .rv-timeline {
                flex-wrap: wrap;
            }
            .temporal-replay-viz .rv-step {
                min-width: 80px;
                padding: 10px 6px;
            }
            .temporal-replay-viz .rv-connector { display: none; }
        }
    </style>

    <h3 class="rv-title">Deterministic Replay</h3>
    <p class="rv-subtitle">Watch how Temporal recovers from a crash by replaying the event history</p>

    <div class="rv-phase-label" id="rvPhase-8c8861b65b15d1263ff0318c3be20c61">&nbsp;</div>

    <div class="rv-timeline" id="rvTimeline-8c8861b65b15d1263ff0318c3be20c61">
        <div class="rv-step" id="rvS1-8c8861b65b15d1263ff0318c3be20c61">
            <div class="rv-step-badge" id="rvB1-8c8861b65b15d1263ff0318c3be20c61"></div>
            <div class="rv-step-num">Step 1</div>
            <div class="rv-step-name">call_llm</div>
        </div>
        <div class="rv-connector">&rarr;</div>
        <div class="rv-step" id="rvS2-8c8861b65b15d1263ff0318c3be20c61">
            <div class="rv-step-badge" id="rvB2-8c8861b65b15d1263ff0318c3be20c61"></div>
            <div class="rv-step-num">Step 2</div>
            <div class="rv-step-name">exec_tool</div>
        </div>
        <div class="rv-connector">&rarr;</div>
        <div class="rv-step" id="rvS3-8c8861b65b15d1263ff0318c3be20c61">
            <div class="rv-step-badge" id="rvB3-8c8861b65b15d1263ff0318c3be20c61"></div>
            <div class="rv-step-num">Step 3</div>
            <div class="rv-step-name">call_llm</div>
        </div>
        <div class="rv-connector">&rarr;</div>
        <div class="rv-step" id="rvS4-8c8861b65b15d1263ff0318c3be20c61">
            <div class="rv-step-badge" id="rvB4-8c8861b65b15d1263ff0318c3be20c61"></div>
            <div class="rv-step-num">Step 4</div>
            <div class="rv-step-name">exec_tool</div>
        </div>
        <div class="rv-connector">&rarr;</div>
        <div class="rv-step" id="rvS5-8c8861b65b15d1263ff0318c3be20c61">
            <div class="rv-step-badge" id="rvB5-8c8861b65b15d1263ff0318c3be20c61"></div>
            <div class="rv-step-num">Step 5</div>
            <div class="rv-step-name">call_llm</div>
        </div>
        <div class="rv-connector">&rarr;</div>
        <div class="rv-step" id="rvS6-8c8861b65b15d1263ff0318c3be20c61">
            <div class="rv-step-badge" id="rvB6-8c8861b65b15d1263ff0318c3be20c61"></div>
            <div class="rv-step-num">Step 6</div>
            <div class="rv-step-name">exec_tool</div>
        </div>
        <div class="rv-connector">&rarr;</div>
        <div class="rv-step" id="rvS7-8c8861b65b15d1263ff0318c3be20c61">
            <div class="rv-step-badge" id="rvB7-8c8861b65b15d1263ff0318c3be20c61"></div>
            <div class="rv-step-num">Step 7</div>
            <div class="rv-step-name">call_llm</div>
        </div>
    </div>

    <div class="rv-worker-label" id="rvWorker-8c8861b65b15d1263ff0318c3be20c61"></div>

    <div class="rv-eh-section">
        <div class="rv-eh-label">Event History (persisted in database)</div>
        <div class="rv-eh-bar" id="rvEH-8c8861b65b15d1263ff0318c3be20c61"></div>
    </div>

    <div class="rv-controls">
        <button type="button" class="rv-btn primary" id="rvPlay-8c8861b65b15d1263ff0318c3be20c61">&#9654; Play</button>
        <button type="button" class="rv-btn secondary" id="rvStep-8c8861b65b15d1263ff0318c3be20c61">&#9197; Step</button>
        <button type="button" class="rv-btn secondary" id="rvReset-8c8861b65b15d1263ff0318c3be20c61">&#8635; Reset</button>
    </div>

    <script>
    (function() {
        var uid = '8c8861b65b15d1263ff0318c3be20c61';
        var steps = [];
        var badges = [];
        for (var i = 1; i <= 7; i++) {
            steps.push(document.getElementById('rvS' + i + '-' + uid));
            badges.push(document.getElementById('rvB' + i + '-' + uid));
        }
        var phaseEl = document.getElementById('rvPhase-' + uid);
        var workerEl = document.getElementById('rvWorker-' + uid);
        var ehBar = document.getElementById('rvEH-' + uid);
        var playBtn = document.getElementById('rvPlay-' + uid);
        var stepBtn = document.getElementById('rvStep-' + uid);
        var resetBtn = document.getElementById('rvReset-' + uid);

        var currentPhase = -1;
        var isPlaying = false;
        var playTimer = null;

        


        var stepNames = ['call_llm', 'exec_tool', 'call_llm', 'exec_tool', 'call_llm', 'exec_tool', 'call_llm'];

        function clearAll() {
            for (var i = 0; i < 7; i++) {
                steps[i].className = 'rv-step gray';
                badges[i].className = 'rv-step-badge';
                badges[i].textContent = '';
            }
            ehBar.innerHTML = '';
            phaseEl.innerHTML = '&nbsp;';
            phaseEl.className = 'rv-phase-label';
            workerEl.innerHTML = '';
        }

        function addEHItem(text) {
            var item = document.createElement('span');
            item.className = 'rv-eh-item new';
            item.textContent = text;
            ehBar.appendChild(item);
            setTimeout(function() { item.classList.remove('new'); }, 600);
        }

        function showBadge(idx, text, color) {
            badges[idx].textContent = text;
            badges[idx].style.background = color;
            badges[idx].style.color = 'white';
            badges[idx].classList.add('visible');
        }

        function runPhase(phase) {
            currentPhase = phase;

            switch(phase) {
                case 0: 
                    clearAll();
                    steps[0].className = 'rv-step completed';
                    steps[1].className = 'rv-step completed';
                    steps[2].className = 'rv-step completed';
                    steps[3].className = 'rv-step in-progress';
                    phaseEl.textContent = 'Agent workflow running... step 4 of 7 in progress';
                    phaseEl.className = 'rv-phase-label forward';
                    workerEl.innerHTML = 'Running on <strong>Worker A</strong>';
                    addEHItem('1: call_llm → OK');
                    addEHItem('2: exec_tool → OK');
                    addEHItem('3: call_llm → OK');
                    break;

                case 1: 
                    steps[3].className = 'rv-step crashed';
                    showBadge(3, '⚡ CRASH', '#ef4444');
                    phaseEl.textContent = '⚡ Worker A crashes! Process killed mid-execution.';
                    phaseEl.className = 'rv-phase-label crash';
                    workerEl.innerHTML = '<strong>Worker A</strong> — process died (OOM / deployment / hardware failure)';
                    break;

                case 2: 
                    phaseEl.textContent = 'Event history is safe — recorded results for steps 1-3 persist in the database';
                    phaseEl.className = 'rv-phase-label replay';
                    break;

                case 3: 
                    for (var i = 0; i < 7; i++) {
                        steps[i].className = 'rv-step gray';
                        badges[i].classList.remove('visible');
                    }
                    phaseEl.textContent = 'New worker picks up the workflow. Re-executing code from the beginning...';
                    phaseEl.className = 'rv-phase-label replay';
                    workerEl.innerHTML = '<strong>Worker B</strong> picks up from task queue';
                    break;

                case 4: 
                    steps[0].className = 'rv-step replaying';
                    showBadge(0, 'FROM HISTORY', '#8b5cf6');
                    setTimeout(function() {
                        steps[0].className = 'rv-step completed';
                        steps[1].className = 'rv-step replaying';
                        showBadge(1, 'FROM HISTORY', '#8b5cf6');
                    }, 300);
                    setTimeout(function() {
                        steps[1].className = 'rv-step completed';
                        steps[2].className = 'rv-step replaying';
                        showBadge(2, 'FROM HISTORY', '#8b5cf6');
                    }, 600);
                    setTimeout(function() {
                        steps[2].className = 'rv-step completed';
                    }, 900);
                    phaseEl.textContent = 'Replay: steps 1-3 return cached results from event history (not re-executed)';
                    phaseEl.className = 'rv-phase-label replay';
                    break;

                case 5: 
                    steps[3].className = 'rv-step in-progress';
                    showBadge(3, 'EXECUTING', '#3b82f6');
                    phaseEl.textContent = 'Forward execution resumes — step 4 is the only new activity dispatched';
                    phaseEl.className = 'rv-phase-label forward';
                    break;

                case 6: 
                    steps[3].className = 'rv-step completed';
                    badges[3].classList.remove('visible');
                    addEHItem('4: exec_tool → OK');
                    steps[4].className = 'rv-step in-progress';
                    setTimeout(function() {
                        steps[4].className = 'rv-step completed';
                        addEHItem('5: call_llm → OK');
                        steps[5].className = 'rv-step in-progress';
                    }, 400);
                    setTimeout(function() {
                        steps[5].className = 'rv-step completed';
                        addEHItem('6: exec_tool → OK');
                        steps[6].className = 'rv-step in-progress';
                    }, 800);
                    setTimeout(function() {
                        steps[6].className = 'rv-step completed';
                        addEHItem('7: call_llm → OK');
                        phaseEl.textContent = 'Workflow completed successfully — zero work was lost';
                        phaseEl.className = 'rv-phase-label complete';
                        workerEl.innerHTML = '<strong>Worker B</strong> — workflow complete';
                        if (isPlaying) stopPlay();
                    }, 1200);
                    phaseEl.textContent = 'Steps 5-7 proceed with normal execution...';
                    phaseEl.className = 'rv-phase-label forward';
                    break;
            }
        }

        function nextPhase() {
            if (currentPhase >= 6) return;
            runPhase(currentPhase + 1);
        }

        function startPlay() {
            if (isPlaying) return;
            if (currentPhase >= 6) { resetViz(); return; }
            isPlaying = true;
            playBtn.textContent = '⏸ Pause';
            stepBtn.disabled = true;
            playTimer = setInterval(function() {
                if (currentPhase >= 6) {
                    stopPlay();
                    return;
                }
                nextPhase();
            }, currentPhase < 0 ? 100 : 2000);
            if (currentPhase < 0) nextPhase();
        }

        function stopPlay() {
            isPlaying = false;
            clearInterval(playTimer);
            playBtn.textContent = currentPhase >= 6 ? '↺ Replay' : '▶ Play';
            stepBtn.disabled = currentPhase >= 6;
        }

        function resetViz() {
            stopPlay();
            currentPhase = -1;
            clearAll();
            playBtn.textContent = '▶ Play';
            stepBtn.disabled = false;
        }

        playBtn.addEventListener('click', function() {
            if (currentPhase >= 6) { resetViz(); return; }
            if (isPlaying) stopPlay(); else startPlay();
        });
        stepBtn.addEventListener('click', nextPhase);
        resetBtn.addEventListener('click', resetViz);

        
        clearAll();
    })();
    </script>
</div>

<h2 id="server-architecture">Server Architecture</h2>
<p>Temporal runs as four server-side services plus a persistence layer, with user-managed workers running externally.</p>
<h3 id="the-four-services">The Four Services</h3>
<p><strong>Frontend Service</strong>: a stateless gRPC gateway. All client and worker communication flows through it. Handles rate limiting, routing, and authorization. Horizontally scalable because it holds no state.</p>
<p><strong>History Service</strong>: owns workflow state and persists event histories. This is the most important component. Manages state transitions across configurable <strong>History Shards</strong>, which are the unit of concurrent throughput scaling. Each shard handles a subset of workflows. More shards = more concurrent workflows.</p>
<p><strong>Matching Service</strong>: hosts <strong>Task Queues</strong> and dispatches work to workers. When a workflow needs an activity executed, the Matching Service places it on the appropriate task queue. When a worker polls for work, the Matching Service assigns a task.</p>
<p><strong>Workers</strong>: stateless external processes that you deploy and manage. Workers long-poll task queues via gRPC, execute workflow or activity code, and report results back. Because workers hold no state, they can be killed, restarted, or scaled horizontally without any coordination. The Temporal server is always the authoritative record.</p>
<h3 id="task-queues">Task Queues</h3>
<p>Task Queues provide a routing layer that becomes important for agent workloads. Workflow tasks and activity tasks flow through separate queues. You can route activities to specialized worker pools (GPU workers for inference, lightweight workers for API calls) by assigning them to different task queues. This lets teams scale heterogeneous agent workloads independently.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Component</th>
          <th style="text-align: left">Responsibility</th>
          <th style="text-align: left">Failure Impact</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Frontend Service</td>
          <td style="text-align: left">gRPC gateway, rate limiting, routing</td>
          <td style="text-align: left">Clients can&rsquo;t connect (stateless, restart recovers)</td>
      </tr>
      <tr>
          <td style="text-align: left">History Service</td>
          <td style="text-align: left">Workflow state, event persistence, shard management</td>
          <td style="text-align: left">Workflow progress pauses until recovery</td>
      </tr>
      <tr>
          <td style="text-align: left">Matching Service</td>
          <td style="text-align: left">Task queue hosting, work dispatch</td>
          <td style="text-align: left">Tasks queue but don&rsquo;t dispatch (no work lost)</td>
      </tr>
      <tr>
          <td style="text-align: left">Workers</td>
          <td style="text-align: left">Execute workflow/activity code, report results</td>
          <td style="text-align: left">Pending tasks reassigned to other workers</td>
      </tr>
      <tr>
          <td style="text-align: left">Persistence (DB)</td>
          <td style="text-align: left">Durable storage for event histories</td>
          <td style="text-align: left">All services degraded until DB recovers</td>
      </tr>
  </tbody>
</table>
<div class="temporal-arch-viz">
    <style>
        .temporal-arch-viz {
            background: white;
            border-radius: 16px;
            padding: 24px;
            box-shadow: 0 12px 30px rgba(0,0,0,0.06);
            margin: 32px auto;
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
            max-width: 900px;
        }
        .temporal-arch-viz .av-title {
            text-align: center;
            font-size: 24px;
            font-weight: 700;
            color: #1a202c;
            margin-bottom: 6px;
        }
        .temporal-arch-viz .av-subtitle {
            text-align: center;
            font-size: 14px;
            color: #718096;
            margin-bottom: 24px;
        }
        .temporal-arch-viz .av-diagram {
            position: relative;
            width: 100%;
            max-width: 780px;
            margin: 0 auto;
        }
        .temporal-arch-viz .av-svg-layer {
            position: absolute;
            top: 0;
            left: 0;
            width: 100%;
            height: 100%;
            pointer-events: none;
            z-index: 0;
        }
         
        .temporal-arch-viz .av-row {
            display: flex;
            justify-content: center;
            gap: 24px;
            margin-bottom: 20px;
            position: relative;
            z-index: 1;
        }
        .temporal-arch-viz .av-row.av-row-spread {
            justify-content: center;
            gap: 40px;
        }
         
        .temporal-arch-viz .av-arrow-row {
            text-align: center;
            color: #cbd5e0;
            font-size: 22px;
            margin-bottom: 12px;
            position: relative;
            z-index: 1;
        }
        .temporal-arch-viz .av-arrow-label {
            font-size: 10px;
            color: #94a3b8;
            font-weight: 600;
            letter-spacing: 0.5px;
            display: block;
            margin-top: -2px;
        }
         
        .temporal-arch-viz .av-box {
            background: #f8fafc;
            border: 2px solid #e2e8f0;
            border-radius: 12px;
            padding: 16px 18px;
            width: 220px;
            cursor: pointer;
            transition: all 0.3s cubic-bezier(0.4, 0, 0.2, 1);
            position: relative;
        }
        .temporal-arch-viz .av-box:hover {
            transform: translateY(-2px);
        }
        .temporal-arch-viz .av-box.highlighted {
            box-shadow: 0 6px 20px rgba(0,0,0,0.12);
            transform: translateY(-3px);
        }
        .temporal-arch-viz .av-box.dimmed {
            opacity: 0.35;
        }
        .temporal-arch-viz .av-box-icon {
            font-size: 24px;
            margin-bottom: 6px;
        }
        .temporal-arch-viz .av-box-name {
            font-size: 14px;
            font-weight: 700;
            margin-bottom: 3px;
        }
        .temporal-arch-viz .av-box-desc {
            font-size: 11px;
            color: #64748b;
            font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
        }
        .temporal-arch-viz .av-box-tag {
            display: inline-block;
            font-size: 9px;
            font-weight: 700;
            letter-spacing: 0.5px;
            padding: 2px 7px;
            border-radius: 4px;
            margin-top: 6px;
        }
         
        .temporal-arch-viz .av-box.client {
            border-color: #94a3b8;
            background: #f8fafc;
        }
        .temporal-arch-viz .av-box.client .av-box-name { color: #475569; }
        .temporal-arch-viz .av-box.client.highlighted { border-color: #64748b; box-shadow: 0 6px 20px rgba(71,85,105,0.2); }

        .temporal-arch-viz .av-box.frontend {
            border-color: #93c5fd;
            background: #eff6ff;
        }
        .temporal-arch-viz .av-box.frontend .av-box-name { color: #2563eb; }
        .temporal-arch-viz .av-box.frontend .av-box-tag { background: #dbeafe; color: #1d4ed8; }
        .temporal-arch-viz .av-box.frontend.highlighted { border-color: #3b82f6; box-shadow: 0 6px 20px rgba(59,130,246,0.25); }

        .temporal-arch-viz .av-box.history {
            border-color: #c4b5fd;
            background: #f5f3ff;
        }
        .temporal-arch-viz .av-box.history .av-box-name { color: #7c3aed; }
        .temporal-arch-viz .av-box.history .av-box-tag { background: #ede9fe; color: #6d28d9; }
        .temporal-arch-viz .av-box.history.highlighted { border-color: #8b5cf6; box-shadow: 0 6px 20px rgba(139,92,246,0.25); }

        .temporal-arch-viz .av-box.matching {
            border-color: #5eead4;
            background: #f0fdfa;
        }
        .temporal-arch-viz .av-box.matching .av-box-name { color: #0d9488; }
        .temporal-arch-viz .av-box.matching .av-box-tag { background: #ccfbf1; color: #0f766e; }
        .temporal-arch-viz .av-box.matching.highlighted { border-color: #14b8a6; box-shadow: 0 6px 20px rgba(20,184,166,0.25); }

        .temporal-arch-viz .av-box.persistence {
            border-color: #d1d5db;
            background: #f9fafb;
        }
        .temporal-arch-viz .av-box.persistence .av-box-name { color: #374151; }
        .temporal-arch-viz .av-box.persistence .av-box-tag { background: #f3f4f6; color: #4b5563; }
        .temporal-arch-viz .av-box.persistence.highlighted { border-color: #9ca3af; box-shadow: 0 6px 20px rgba(107,114,128,0.2); }

        .temporal-arch-viz .av-box.workers {
            border-color: #86efac;
            background: #ecfdf5;
        }
        .temporal-arch-viz .av-box.workers .av-box-name { color: #059669; }
        .temporal-arch-viz .av-box.workers .av-box-tag { background: #d1fae5; color: #047857; }
        .temporal-arch-viz .av-box.workers.highlighted { border-color: #10b981; box-shadow: 0 6px 20px rgba(16,185,129,0.25); }

         
        .temporal-arch-viz .av-worker-group {
            display: flex;
            gap: 6px;
            margin-top: 8px;
        }
        .temporal-arch-viz .av-worker-mini {
            background: #d1fae5;
            border: 1px solid #86efac;
            border-radius: 4px;
            padding: 2px 8px;
            font-size: 9px;
            font-weight: 600;
            color: #059669;
        }
         
        .temporal-arch-viz .av-failure-tip {
            position: absolute;
            bottom: calc(100% + 8px);
            left: 50%;
            transform: translateX(-50%);
            background: #1e293b;
            color: #f8fafc;
            padding: 8px 12px;
            border-radius: 8px;
            font-size: 11px;
            line-height: 1.4;
            width: 220px;
            text-align: center;
            opacity: 0;
            pointer-events: none;
            transition: opacity 0.2s ease;
            z-index: 10;
        }
        .temporal-arch-viz .av-failure-tip::after {
            content: '';
            position: absolute;
            top: 100%;
            left: 50%;
            transform: translateX(-50%);
            border: 6px solid transparent;
            border-top-color: #1e293b;
        }
        .temporal-arch-viz .av-box:hover .av-failure-tip,
        .temporal-arch-viz .av-box.highlighted .av-failure-tip {
            opacity: 1;
        }
         
        .temporal-arch-viz .av-hint {
            text-align: center;
            font-size: 12px;
            color: #a0aec0;
            margin-top: 16px;
            font-style: italic;
        }
         
        .temporal-arch-viz .av-arrows-mid {
            display: flex;
            justify-content: center;
            gap: 120px;
            margin-bottom: 12px;
        }
        .temporal-arch-viz .av-arrows-mid .av-mid-arr {
            text-align: center;
            color: #cbd5e0;
            font-size: 18px;
        }
        .temporal-arch-viz .av-arrows-bot {
            display: flex;
            justify-content: center;
            gap: 80px;
            margin-bottom: 12px;
        }
        .temporal-arch-viz .av-arrows-bot .av-bot-arr {
            text-align: center;
            color: #cbd5e0;
            font-size: 18px;
        }
        @media (max-width: 640px) {
            .temporal-arch-viz .av-row {
                flex-direction: column;
                align-items: center;
                gap: 12px;
            }
            .temporal-arch-viz .av-box { width: 85%; }
            .temporal-arch-viz .av-arrows-mid,
            .temporal-arch-viz .av-arrows-bot { display: none; }
            .temporal-arch-viz .av-row.av-row-spread {
                gap: 12px;
            }
        }
    </style>

    <h3 class="av-title">Temporal Server Architecture</h3>
    <p class="av-subtitle">Four services, a persistence layer, and stateless workers</p>

    <div class="av-diagram">
        
        <div class="av-row">
            <div class="av-box client" data-svc="client" id="avClient-8c8861b65b15d1263ff0318c3be20c61">
                <div class="av-failure-tip">Entry point for all workflow operations. Uses gRPC to communicate with Frontend.</div>
                <div class="av-box-icon">&#128187;</div>
                <div class="av-box-name">Client / SDK</div>
                <div class="av-box-desc">Start workflows, send signals</div>
            </div>
        </div>

        
        <div class="av-arrow-row">&#8595;<span class="av-arrow-label">gRPC requests</span></div>

        
        <div class="av-row">
            <div class="av-box frontend" data-svc="frontend" id="avFrontend-8c8861b65b15d1263ff0318c3be20c61">
                <div class="av-failure-tip">Clients can't connect. Stateless — restart recovers immediately.</div>
                <div class="av-box-icon">&#127760;</div>
                <div class="av-box-name">Frontend Service</div>
                <div class="av-box-desc">gRPC gateway, rate limiting</div>
                <div class="av-box-tag">STATELESS</div>
            </div>
        </div>

        
        <div class="av-arrows-mid">
            <div class="av-mid-arr">&#8601;<span class="av-arrow-label">commands</span></div>
            <div class="av-mid-arr">&#8600;<span class="av-arrow-label">routing</span></div>
        </div>

        
        <div class="av-row av-row-spread">
            <div class="av-box history" data-svc="history" id="avHistory-8c8861b65b15d1263ff0318c3be20c61">
                <div class="av-failure-tip">Workflow progress pauses until recovery. No data lost — state in DB.</div>
                <div class="av-box-icon">&#128218;</div>
                <div class="av-box-name">History Service</div>
                <div class="av-box-desc">Event persistence, state transitions</div>
                <div class="av-box-tag">SHARDS: N</div>
            </div>
            <div class="av-box matching" data-svc="matching" id="avMatching-8c8861b65b15d1263ff0318c3be20c61">
                <div class="av-failure-tip">Tasks queue but don't dispatch. No work lost — resumes on recovery.</div>
                <div class="av-box-icon">&#128203;</div>
                <div class="av-box-name">Matching Service</div>
                <div class="av-box-desc">Task queue hosting, work dispatch</div>
                <div class="av-box-tag">TASK QUEUES</div>
            </div>
        </div>

        
        <div class="av-arrows-bot">
            <div class="av-bot-arr">&#8595;<span class="av-arrow-label">reads / writes</span></div>
            <div class="av-bot-arr">&#8595;<span class="av-arrow-label">task dispatch</span></div>
        </div>

        
        <div class="av-row av-row-spread">
            <div class="av-box persistence" data-svc="persistence" id="avPersist-8c8861b65b15d1263ff0318c3be20c61">
                <div class="av-failure-tip">All services degraded until DB recovers. This is the critical data store.</div>
                <div class="av-box-icon">&#128451;</div>
                <div class="av-box-name">Persistence</div>
                <div class="av-box-desc">PostgreSQL / Cassandra</div>
                <div class="av-box-tag">DURABLE STORAGE</div>
            </div>
            <div class="av-box workers" data-svc="workers" id="avWorkers-8c8861b65b15d1263ff0318c3be20c61">
                <div class="av-failure-tip">Pending tasks reassigned to other workers. Zero state lost.</div>
                <div class="av-box-icon">&#9881;</div>
                <div class="av-box-name">Workers</div>
                <div class="av-box-desc">Your code — workflow + activity</div>
                <div class="av-box-tag">STATELESS</div>
                <div class="av-worker-group">
                    <span class="av-worker-mini">W1</span>
                    <span class="av-worker-mini">W2</span>
                    <span class="av-worker-mini">W3</span>
                </div>
            </div>
        </div>

        
        <div class="av-arrow-row" style="margin-top: 12px;">
            <span class="av-arrow-label">Workers report results back to Frontend via gRPC long-poll</span>
        </div>
    </div>

    <p class="av-hint">Hover over any service to see its failure impact</p>

    <script>
    (function() {
        var uid = '8c8861b65b15d1263ff0318c3be20c61';
        var container = document.querySelector('.temporal-arch-viz');
        if (!container) return;

        var boxes = container.querySelectorAll('.av-box');

        
        var connections = {
            'client': ['frontend'],
            'frontend': ['client', 'history', 'matching', 'workers'],
            'history': ['frontend', 'persistence'],
            'matching': ['frontend', 'workers'],
            'persistence': ['history'],
            'workers': ['matching', 'frontend']
        };

        function highlightService(svc) {
            var connected = connections[svc] || [];
            boxes.forEach(function(box) {
                var boxSvc = box.getAttribute('data-svc');
                if (boxSvc === svc) {
                    box.classList.add('highlighted');
                    box.classList.remove('dimmed');
                } else if (connected.indexOf(boxSvc) !== -1) {
                    box.classList.remove('dimmed');
                    box.classList.remove('highlighted');
                } else {
                    box.classList.add('dimmed');
                    box.classList.remove('highlighted');
                }
            });
        }

        function clearHighlights() {
            boxes.forEach(function(box) {
                box.classList.remove('highlighted', 'dimmed');
            });
        }

        boxes.forEach(function(box) {
            box.addEventListener('mouseenter', function() {
                highlightService(this.getAttribute('data-svc'));
            });
            box.addEventListener('mouseleave', clearHighlights);
        });
    })();
    </script>
</div>

<h2 id="primitives-for-agent-patterns">Primitives for Agent Patterns</h2>
<p>Beyond workflows and activities, Temporal provides several primitives that map to common agent coordination problems.</p>
<h3 id="signals">Signals</h3>
<p><strong>Signals</strong> are asynchronous messages sent to a running workflow. The workflow can react at any point in its execution. This is the mechanism for human-in-the-loop: the agent reaches a decision point, calls <code>workflow.wait_condition()</code>, and a signal carrying the human&rsquo;s approval resumes it.</p>
<p>The workflow can wait hours or days. It consumes no compute while waiting because its state lives in the event history, not in a running process. No worker is tied up, no server is keeping a connection open. The state is persisted in the database and can be reconstructed on demand when the signal arrives.</p>
<h3 id="queries">Queries</h3>
<p><strong>Queries</strong> let external systems read workflow state without modifying it. This powers dashboards and monitoring: &ldquo;What step is the agent on? What was the last LLM response? How many tokens has it consumed?&rdquo; The query handler runs against the in-memory workflow state and returns immediately.</p>
<h3 id="updates">Updates</h3>
<p><strong>Updates</strong> combine a signal and a query: send a command to the workflow and get a response. This is useful for interactive agent control (&ldquo;redo step 2 with different parameters&rdquo;) where you need to both modify the workflow&rsquo;s behavior and confirm the modification was accepted.</p>
<p>Replit, for example, uses Workflow Updates for human-in-the-loop consent. When their agent wants to perform a destructive action, it pauses and waits for the user to accept or reject via an Update.</p>
<h3 id="continueasnew">ContinueAsNew</h3>
<p>Each workflow execution is limited to <strong>51,200 events</strong> or <strong>50MB</strong> of event history. For agents making hundreds of tool calls, history grows fast; each activity generates roughly 3 events. If activities return large LLM payloads (500KB+), the 50MB limit becomes binding well before the event count limit.</p>
<p><strong>ContinueAsNew</strong> addresses this by atomically starting a fresh execution with the same Workflow ID, carrying forward essential state while resetting the history. The old history is archived. For long-running agents, this is how you keep the workflow alive indefinitely.</p>
<h3 id="human-in-the-loop-pattern">Human-in-the-Loop Pattern</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#a6e22e">@workflow.defn</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">AgentWithHumanApproval</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">__init__</span>(self):
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>approved <span style="color:#f92672">=</span> <span style="color:#66d9ef">False</span>
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>current_step <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;initializing&#34;</span>
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>pending_action <span style="color:#f92672">=</span> <span style="color:#66d9ef">None</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">@workflow.signal</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">async</span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">approve</span>(self, decision: str):
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>approved <span style="color:#f92672">=</span> decision <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;yes&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">@workflow.query</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">get_status</span>(self) <span style="color:#f92672">-&gt;</span> dict:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> {
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;step&#34;</span>: self<span style="color:#f92672">.</span>current_step,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;pending_action&#34;</span>: self<span style="color:#f92672">.</span>pending_action,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;approved&#34;</span>: self<span style="color:#f92672">.</span>approved,
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">@workflow.run</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">async</span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">run</span>(self, goal: str) <span style="color:#f92672">-&gt;</span> str:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">while</span> <span style="color:#f92672">not</span> self<span style="color:#f92672">.</span>is_complete():
</span></span><span style="display:flex;"><span>            action <span style="color:#f92672">=</span> <span style="color:#66d9ef">await</span> workflow<span style="color:#f92672">.</span>execute_activity(
</span></span><span style="display:flex;"><span>                call_llm, goal,
</span></span><span style="display:flex;"><span>                start_to_close_timeout<span style="color:#f92672">=</span>timedelta(seconds<span style="color:#f92672">=</span><span style="color:#ae81ff">120</span>),
</span></span><span style="display:flex;"><span>            )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">if</span> action<span style="color:#f92672">.</span>requires_approval:
</span></span><span style="display:flex;"><span>                self<span style="color:#f92672">.</span>pending_action <span style="color:#f92672">=</span> action<span style="color:#f92672">.</span>description
</span></span><span style="display:flex;"><span>                self<span style="color:#f92672">.</span>current_step <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;awaiting_approval&#34;</span>
</span></span><span style="display:flex;"><span>                <span style="color:#75715e"># Workflow state persists in DB -- no compute cost while waiting</span>
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">await</span> workflow<span style="color:#f92672">.</span>wait_condition(<span style="color:#66d9ef">lambda</span>: self<span style="color:#f92672">.</span>approved)
</span></span><span style="display:flex;"><span>                self<span style="color:#f92672">.</span>approved <span style="color:#f92672">=</span> <span style="color:#66d9ef">False</span>  <span style="color:#75715e"># reset for next approval</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>            self<span style="color:#f92672">.</span>current_step <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;executing&#34;</span>
</span></span><span style="display:flex;"><span>            result <span style="color:#f92672">=</span> <span style="color:#66d9ef">await</span> workflow<span style="color:#f92672">.</span>execute_activity(
</span></span><span style="display:flex;"><span>                execute_tool, action<span style="color:#f92672">.</span>tool, action<span style="color:#f92672">.</span>params,
</span></span><span style="display:flex;"><span>                start_to_close_timeout<span style="color:#f92672">=</span>timedelta(seconds<span style="color:#f92672">=</span><span style="color:#ae81ff">60</span>),
</span></span><span style="display:flex;"><span>            )
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> self<span style="color:#f92672">.</span>format_result()
</span></span></code></pre></div><p>The <code>workflow.wait_condition(lambda: self.approved)</code> line is where the agent pauses. It can sit there for minutes, hours, or days. If the server restarts, if workers are redeployed, the workflow&rsquo;s state survives. When the signal arrives, any available worker picks it up and resumes execution.</p>
<div class="temporal-prims-viz">
    <style>
        .temporal-prims-viz {
            background: white;
            border-radius: 16px;
            padding: 24px;
            box-shadow: 0 12px 30px rgba(0,0,0,0.06);
            margin: 32px auto;
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
            max-width: 1100px;
        }
        .temporal-prims-viz .tp-title {
            text-align: center;
            font-size: 24px;
            font-weight: 700;
            color: #1a202c;
            margin-bottom: 6px;
        }
        .temporal-prims-viz .tp-subtitle {
            text-align: center;
            font-size: 14px;
            color: #718096;
            margin-bottom: 24px;
        }
         
        .temporal-prims-viz .tp-annotations {
            position: relative;
            min-height: 80px;
            margin-bottom: 6px;
        }
        .temporal-prims-viz .tp-annotation {
            position: absolute;
            text-align: center;
            opacity: 0;
            transition: opacity 0.4s ease;
            pointer-events: none;
        }
        .temporal-prims-viz .tp-annotation.visible {
            opacity: 1;
            pointer-events: auto;
        }
        .temporal-prims-viz .tp-ann-box {
            background: white;
            border: 2px solid #e2e8f0;
            border-radius: 8px;
            padding: 6px 10px;
            font-size: 11px;
            display: inline-block;
            max-width: 180px;
        }
        .temporal-prims-viz .tp-ann-box.query {
            border-color: #f59e0b;
            background: #fffbeb;
        }
        .temporal-prims-viz .tp-ann-box.signal {
            border-color: #3b82f6;
            background: #eff6ff;
        }
        .temporal-prims-viz .tp-ann-box.update {
            border-color: #8b5cf6;
            background: #f5f3ff;
        }
        .temporal-prims-viz .tp-ann-label {
            font-weight: 700;
            font-size: 10px;
            letter-spacing: 0.5px;
            text-transform: uppercase;
            margin-bottom: 2px;
        }
        .temporal-prims-viz .tp-ann-label.query { color: #d97706; }
        .temporal-prims-viz .tp-ann-label.signal { color: #2563eb; }
        .temporal-prims-viz .tp-ann-label.update { color: #7c3aed; }
        .temporal-prims-viz .tp-ann-code {
            font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
            font-size: 10px;
            color: #475569;
        }
        .temporal-prims-viz .tp-ann-arrow {
            color: #94a3b8;
            font-size: 16px;
            line-height: 1;
        }
         
        .temporal-prims-viz .tp-timeline {
            display: flex;
            align-items: center;
            gap: 3px;
            overflow-x: auto;
            padding: 8px 0;
        }
        .temporal-prims-viz .tp-node {
            flex-shrink: 0;
            min-width: 90px;
            text-align: center;
            padding: 10px 8px;
            border-radius: 8px;
            border: 2px solid #e2e8f0;
            background: #f8fafc;
            transition: all 0.4s cubic-bezier(0.4, 0, 0.2, 1);
            position: relative;
        }
        .temporal-prims-viz .tp-node-name {
            font-size: 11px;
            font-weight: 600;
            color: #475569;
            font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
        }
        .temporal-prims-viz .tp-node-sub {
            font-size: 9px;
            color: #94a3b8;
            margin-top: 2px;
        }
        .temporal-prims-viz .tp-conn {
            flex-shrink: 0;
            color: #cbd5e0;
            font-size: 14px;
        }
        .temporal-prims-viz .tp-conn.dashed {
            border-top: 2px dashed #cbd5e0;
            width: 24px;
            height: 0;
            margin: 0 2px;
        }
         
        .temporal-prims-viz .tp-node.active {
            border-color: #3b82f6;
            background: #eff6ff;
            box-shadow: 0 4px 12px rgba(59, 130, 246, 0.25);
        }
        .temporal-prims-viz .tp-node.active .tp-node-name { color: #1e40af; }
        .temporal-prims-viz .tp-node.completed {
            border-color: #10b981;
            background: #ecfdf5;
        }
        .temporal-prims-viz .tp-node.completed .tp-node-name { color: #065f46; }
        .temporal-prims-viz .tp-node.waiting {
            border-color: #f59e0b;
            background: #fffbeb;
            box-shadow: 0 4px 12px rgba(245, 158, 11, 0.2);
        }
        .temporal-prims-viz .tp-node.waiting .tp-node-name { color: #92400e; }
        .temporal-prims-viz .tp-node.signal-active {
            border-color: #3b82f6;
            background: #dbeafe;
            box-shadow: 0 4px 12px rgba(59, 130, 246, 0.3);
        }
        .temporal-prims-viz .tp-node.signal-active .tp-node-name { color: #1e40af; }
        .temporal-prims-viz .tp-node.can-node {
            border-style: dashed;
        }
         
        .temporal-prims-viz .tp-zero-badge {
            position: absolute;
            top: -12px;
            left: 50%;
            transform: translateX(-50%);
            background: #fef3c7;
            border: 1px solid #fbbf24;
            color: #92400e;
            font-size: 9px;
            font-weight: 700;
            padding: 2px 8px;
            border-radius: 10px;
            white-space: nowrap;
            opacity: 0;
            transition: opacity 0.3s ease;
        }
        .temporal-prims-viz .tp-zero-badge.visible { opacity: 1; }
         
        .temporal-prims-viz .tp-eh-section {
            margin-top: 16px;
        }
        .temporal-prims-viz .tp-eh-label {
            font-size: 11px;
            font-weight: 700;
            color: #8b5cf6;
            letter-spacing: 1px;
            text-transform: uppercase;
            margin-bottom: 6px;
            display: flex;
            justify-content: space-between;
            align-items: center;
        }
        .temporal-prims-viz .tp-eh-count {
            font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
            font-size: 11px;
            font-weight: 600;
            color: #7c3aed;
        }
        .temporal-prims-viz .tp-eh-bar {
            background: #f5f3ff;
            border: 1px solid #e9d5ff;
            border-radius: 8px;
            height: 32px;
            position: relative;
            overflow: hidden;
        }
        .temporal-prims-viz .tp-eh-fill {
            height: 100%;
            background: linear-gradient(90deg, #8b5cf6, #a78bfa);
            border-radius: 8px;
            transition: width 0.5s ease;
            width: 0%;
        }
        .temporal-prims-viz .tp-eh-bar-label {
            position: absolute;
            top: 50%;
            left: 50%;
            transform: translate(-50%, -50%);
            font-size: 10px;
            font-weight: 600;
            color: #6d28d9;
        }
         
        .temporal-prims-viz .tp-can-section {
            margin-top: 12px;
            text-align: center;
            min-height: 24px;
        }
        .temporal-prims-viz .tp-can-badge {
            display: inline-block;
            background: #f5f3ff;
            border: 1px dashed #8b5cf6;
            border-radius: 8px;
            padding: 6px 14px;
            font-size: 11px;
            color: #6d28d9;
            font-weight: 600;
            opacity: 0;
            transition: opacity 0.4s ease;
        }
        .temporal-prims-viz .tp-can-badge.visible { opacity: 1; }
         
        .temporal-prims-viz .tp-tooltip {
            text-align: center;
            min-height: 36px;
            margin-top: 12px;
            padding: 8px 12px;
            background: #f8fafc;
            border-radius: 8px;
            border: 1px solid #e2e8f0;
            font-size: 12px;
            color: #475569;
            line-height: 1.5;
            transition: all 0.3s ease;
        }
         
        .temporal-prims-viz .tp-controls {
            display: flex;
            justify-content: center;
            gap: 10px;
            margin-top: 16px;
        }
        .temporal-prims-viz .tp-btn {
            padding: 8px 20px;
            border: none;
            border-radius: 8px;
            font-size: 13px;
            font-weight: 600;
            cursor: pointer;
            transition: all 0.3s ease;
            font-family: inherit;
        }
        .temporal-prims-viz .tp-btn:hover { transform: translateY(-1px); }
        .temporal-prims-viz .tp-btn:disabled { opacity: 0.4; cursor: not-allowed; transform: none; }
        .temporal-prims-viz .tp-btn.primary {
            background: linear-gradient(135deg, #3b82f6, #8b5cf6);
            color: white;
        }
        .temporal-prims-viz .tp-btn.primary:hover:not(:disabled) {
            box-shadow: 0 4px 14px rgba(59,130,246,0.4);
        }
        .temporal-prims-viz .tp-btn.secondary {
            background: #f1f5f9;
            color: #475569;
            border: 1px solid #e2e8f0;
        }
        .temporal-prims-viz .tp-btn.secondary:hover:not(:disabled) {
            background: #e2e8f0;
        }
        @media (max-width: 768px) {
            .temporal-prims-viz .tp-timeline {
                padding-bottom: 12px;
            }
            .temporal-prims-viz .tp-annotations {
                display: none;
            }
            .temporal-prims-viz .tp-node {
                min-width: 75px;
                padding: 8px 6px;
            }
        }
    </style>

    <h3 class="tp-title">Agent Primitives Timeline</h3>
    <p class="tp-subtitle">Signals, Queries, Updates, and wait points across an agent's lifecycle</p>

    
    <div class="tp-annotations" id="tpAnnotations-8c8861b65b15d1263ff0318c3be20c61">
        
        <div class="tp-annotation" id="tpAnnQuery-8c8861b65b15d1263ff0318c3be20c61" style="left: 42%; top: 0;">
            <div class="tp-ann-box query">
                <div class="tp-ann-label query">Query</div>
                <div class="tp-ann-code">get_status() &rarr; {step: "awaiting_approval"}</div>
            </div>
            <div class="tp-ann-arrow">&#8595;</div>
        </div>
        
        <div class="tp-annotation" id="tpAnnSignal-8c8861b65b15d1263ff0318c3be20c61" style="left: 55%; top: 0;">
            <div class="tp-ann-box signal">
                <div class="tp-ann-label signal">Signal</div>
                <div class="tp-ann-code">approve("yes")</div>
            </div>
            <div class="tp-ann-arrow">&#8595;</div>
        </div>
        
        <div class="tp-annotation" id="tpAnnUpdate-8c8861b65b15d1263ff0318c3be20c61" style="left: 72%; top: 0;">
            <div class="tp-ann-box update">
                <div class="tp-ann-label update">Update</div>
                <div class="tp-ann-code">modify_params() &rarr; {ok: true}</div>
            </div>
            <div class="tp-ann-arrow">&#8595;</div>
        </div>
    </div>

    
    <div class="tp-timeline" id="tpTimeline-8c8861b65b15d1263ff0318c3be20c61">
        <div class="tp-node" id="tpN0-8c8861b65b15d1263ff0318c3be20c61" data-idx="0">
            <div class="tp-node-name">Start</div>
            <div class="tp-node-sub">init workflow</div>
        </div>
        <div class="tp-conn">&rarr;</div>
        <div class="tp-node" id="tpN1-8c8861b65b15d1263ff0318c3be20c61" data-idx="1">
            <div class="tp-node-name">LLM Call</div>
            <div class="tp-node-sub">call_llm</div>
        </div>
        <div class="tp-conn">&rarr;</div>
        <div class="tp-node" id="tpN2-8c8861b65b15d1263ff0318c3be20c61" data-idx="2">
            <div class="tp-node-name">Tool Exec</div>
            <div class="tp-node-sub">exec_tool</div>
        </div>
        <div class="tp-conn">&rarr;</div>
        <div class="tp-node" id="tpN3-8c8861b65b15d1263ff0318c3be20c61" data-idx="3">
            <div class="tp-node-name">LLM Call</div>
            <div class="tp-node-sub">needs approval</div>
        </div>
        <div class="tp-conn">&rarr;</div>
        <div class="tp-node" id="tpN4-8c8861b65b15d1263ff0318c3be20c61" data-idx="4">
            <div class="tp-node-name">WAIT</div>
            <div class="tp-node-sub">approval needed</div>
            <div class="tp-zero-badge" id="tpZero-8c8861b65b15d1263ff0318c3be20c61">&#9201; ZERO COMPUTE</div>
        </div>
        <div class="tp-conn">&rarr;</div>
        <div class="tp-node" id="tpN5-8c8861b65b15d1263ff0318c3be20c61" data-idx="5">
            <div class="tp-node-name">Signal</div>
            <div class="tp-node-sub">approve("yes")</div>
        </div>
        <div class="tp-conn">&rarr;</div>
        <div class="tp-node" id="tpN6-8c8861b65b15d1263ff0318c3be20c61" data-idx="6">
            <div class="tp-node-name">Tool Exec</div>
            <div class="tp-node-sub">approved action</div>
        </div>
        <div class="tp-conn">&rarr;</div>
        <div class="tp-node" id="tpN7-8c8861b65b15d1263ff0318c3be20c61" data-idx="7">
            <div class="tp-node-name">Update</div>
            <div class="tp-node-sub">modify params</div>
        </div>
        <div class="tp-conn">&rarr;</div>
        <div class="tp-node" id="tpN8-8c8861b65b15d1263ff0318c3be20c61" data-idx="8">
            <div class="tp-node-name">LLM Call</div>
            <div class="tp-node-sub">final reasoning</div>
        </div>
        <div class="tp-conn">&rarr;</div>
        <div class="tp-node" id="tpN9-8c8861b65b15d1263ff0318c3be20c61" data-idx="9">
            <div class="tp-node-name">Complete</div>
            <div class="tp-node-sub">return result</div>
        </div>
        <div class="tp-conn dashed"></div>
        <div class="tp-node can-node" id="tpN10-8c8861b65b15d1263ff0318c3be20c61" data-idx="10">
            <div class="tp-node-name" style="font-size:10px;">ContinueAsNew</div>
            <div class="tp-node-sub">history reset</div>
        </div>
    </div>

    
    <div class="tp-eh-section">
        <div class="tp-eh-label">
            <span>Event History</span>
            <span class="tp-eh-count" id="tpEHCount-8c8861b65b15d1263ff0318c3be20c61">Events: 0 / 51,200</span>
        </div>
        <div class="tp-eh-bar">
            <div class="tp-eh-fill" id="tpEHFill-8c8861b65b15d1263ff0318c3be20c61"></div>
            <div class="tp-eh-bar-label" id="tpEHBarLabel-8c8861b65b15d1263ff0318c3be20c61"></div>
        </div>
    </div>

    
    <div class="tp-can-section">
        <div class="tp-can-badge" id="tpCANBadge-8c8861b65b15d1263ff0318c3be20c61">
            &#8635; ContinueAsNew: history reset to 0, workflow continues with fresh execution
        </div>
    </div>

    
    <div class="tp-tooltip" id="tpTooltip-8c8861b65b15d1263ff0318c3be20c61">
        Press Play to animate through an agent's lifecycle with Signals, Queries, and Updates
    </div>

    <div class="tp-controls">
        <button type="button" class="tp-btn primary" id="tpPlay-8c8861b65b15d1263ff0318c3be20c61">&#9654; Play</button>
        <button type="button" class="tp-btn secondary" id="tpStep-8c8861b65b15d1263ff0318c3be20c61">&#9197; Step</button>
        <button type="button" class="tp-btn secondary" id="tpReset-8c8861b65b15d1263ff0318c3be20c61">&#8635; Reset</button>
    </div>

    <script>
    (function() {
        var uid = '8c8861b65b15d1263ff0318c3be20c61';
        var nodes = [];
        for (var i = 0; i <= 10; i++) {
            nodes.push(document.getElementById('tpN' + i + '-' + uid));
        }
        var zeroBadge = document.getElementById('tpZero-' + uid);
        var tooltip = document.getElementById('tpTooltip-' + uid);
        var ehFill = document.getElementById('tpEHFill-' + uid);
        var ehCount = document.getElementById('tpEHCount-' + uid);
        var ehBarLabel = document.getElementById('tpEHBarLabel-' + uid);
        var canBadge = document.getElementById('tpCANBadge-' + uid);
        var annQuery = document.getElementById('tpAnnQuery-' + uid);
        var annSignal = document.getElementById('tpAnnSignal-' + uid);
        var annUpdate = document.getElementById('tpAnnUpdate-' + uid);
        var playBtn = document.getElementById('tpPlay-' + uid);
        var stepBtn = document.getElementById('tpStep-' + uid);
        var resetBtn = document.getElementById('tpReset-' + uid);

        var currentStep = -1;
        var isPlaying = false;
        var playTimer = null;
        var eventCount = 0;

        var phases = [
            { idx: 0, state: 'active', events: 1200, tooltip: '<strong>Start</strong>: Workflow execution begins. WorkflowExecutionStarted event recorded.', ann: null },
            { idx: 1, state: 'active', events: 5400, tooltip: '<strong>LLM Call</strong>: Activity scheduled for LLM inference. Result recorded in event history on completion.', ann: null },
            { idx: 2, state: 'active', events: 9800, tooltip: '<strong>Tool Execution</strong>: Activity executes the requested tool. Retry policy handles transient failures.', ann: null },
            { idx: 3, state: 'active', events: 14200, tooltip: '<strong>LLM Call</strong>: LLM determines the next action requires human approval before proceeding.', ann: null },
            { idx: 4, state: 'waiting', events: 14200, tooltip: '<strong>WAIT</strong>: <code>workflow.wait_condition()</code> — workflow state persists in DB. <em>No worker is tied up, no compute consumed.</em> Can wait hours or days.', ann: 'query', zeroBadge: true },
            { idx: 5, state: 'signal-active', events: 14800, tooltip: '<strong>Signal Received</strong>: <code>approve("yes")</code> sent to the workflow. Execution resumes on any available worker.', ann: 'signal' },
            { idx: 6, state: 'active', events: 19200, tooltip: '<strong>Tool Execution</strong>: The approved action executes. This activity has its own retry policy.', ann: null },
            { idx: 7, state: 'active', events: 23800, tooltip: '<strong>Update</strong>: Bidirectional communication — send a command <em>and</em> receive a response. Used for interactive agent control.', ann: 'update' },
            { idx: 8, state: 'active', events: 28400, tooltip: '<strong>LLM Call</strong>: Final reasoning step. Agent synthesizes results for the response.', ann: null },
            { idx: 9, state: 'completed', events: 29000, tooltip: '<strong>Complete</strong>: Workflow finishes. Full event history available for audit and replay.', ann: null },
            { idx: 10, state: 'active', events: 48200, tooltip: '<strong>ContinueAsNew</strong>: History approaching limit (48,200 / 51,200). Atomically start fresh execution carrying forward essential state.', ann: null, canBadge: true }
        ];

        function updateEH(count) {
            eventCount = count;
            var pct = Math.min((count / 51200) * 100, 100);
            ehFill.style.width = pct + '%';
            ehCount.textContent = 'Events: ' + count.toLocaleString() + ' / 51,200';
            if (pct > 5) {
                ehBarLabel.textContent = count.toLocaleString();
            } else {
                ehBarLabel.textContent = '';
            }
        }

        function clearAll() {
            for (var i = 0; i <= 10; i++) {
                nodes[i].className = nodes[i].className.indexOf('can-node') !== -1 ? 'tp-node can-node' : 'tp-node';
            }
            zeroBadge.classList.remove('visible');
            annQuery.classList.remove('visible');
            annSignal.classList.remove('visible');
            annUpdate.classList.remove('visible');
            canBadge.classList.remove('visible');
            updateEH(0);
            tooltip.innerHTML = 'Press Play to animate through an agent\'s lifecycle with Signals, Queries, and Updates';
            currentStep = -1;
        }

        function runStep(stepIdx) {
            currentStep = stepIdx;
            if (stepIdx < 0 || stepIdx >= phases.length) return;

            var phase = phases[stepIdx];

            
            for (var i = 0; i < stepIdx; i++) {
                var pIdx = phases[i].idx;
                if (nodes[pIdx].className.indexOf('can-node') !== -1) {
                    nodes[pIdx].className = 'tp-node can-node completed';
                } else {
                    nodes[pIdx].className = 'tp-node completed';
                }
            }

            
            var base = nodes[phase.idx].className.indexOf('can-node') !== -1 ? 'tp-node can-node ' : 'tp-node ';
            nodes[phase.idx].className = base + phase.state;

            
            for (var j = stepIdx + 1; j < phases.length; j++) {
                var fIdx = phases[j].idx;
                nodes[fIdx].className = nodes[fIdx].className.indexOf('can-node') !== -1 ? 'tp-node can-node' : 'tp-node';
            }

            
            updateEH(phase.events);

            
            tooltip.innerHTML = phase.tooltip;

            
            if (phase.zeroBadge) {
                zeroBadge.classList.add('visible');
            } else {
                zeroBadge.classList.remove('visible');
            }

            
            annQuery.classList.remove('visible');
            annSignal.classList.remove('visible');
            annUpdate.classList.remove('visible');
            if (phase.ann === 'query') annQuery.classList.add('visible');
            if (phase.ann === 'signal') annSignal.classList.add('visible');
            if (phase.ann === 'update') annUpdate.classList.add('visible');

            
            if (phase.canBadge) {
                canBadge.classList.add('visible');
            } else {
                canBadge.classList.remove('visible');
            }

            
            if (stepIdx >= phases.length - 1) {
                stopPlay();
            }
        }

        function nextStep() {
            if (currentStep >= phases.length - 1) return;
            runStep(currentStep + 1);
        }

        function startPlay() {
            if (isPlaying) return;
            if (currentStep >= phases.length - 1) { clearAll(); }
            isPlaying = true;
            playBtn.textContent = '⏸ Pause';
            stepBtn.disabled = true;
            playTimer = setInterval(function() {
                if (currentStep >= phases.length - 1) {
                    stopPlay();
                    return;
                }
                nextStep();
            }, 1800);
            if (currentStep < 0) nextStep();
        }

        function stopPlay() {
            isPlaying = false;
            clearInterval(playTimer);
            playBtn.textContent = currentStep >= phases.length - 1 ? '↺ Replay' : '▶ Play';
            stepBtn.disabled = currentStep >= phases.length - 1;
        }

        playBtn.addEventListener('click', function() {
            if (currentStep >= phases.length - 1) { clearAll(); startPlay(); return; }
            if (isPlaying) stopPlay(); else startPlay();
        });
        stepBtn.addEventListener('click', nextStep);
        resetBtn.addEventListener('click', function() {
            stopPlay();
            clearAll();
            playBtn.textContent = '▶ Play';
            stepBtn.disabled = false;
        });

        clearAll();
    })();
    </script>
</div>

<h2 id="retry-policies-and-error-handling">Retry Policies and Error Handling</h2>
<p>LLM APIs fail routinely. Rate limits (429), server errors (500), socket timeouts, multi-minute latencies. These are the norm for agents making hundreds of calls, and different activities need different retry strategies.</p>
<h3 id="declarative-retry-policies">Declarative Retry Policies</h3>
<p>Retry policies are configured per activity with several parameters: initial interval, backoff coefficient, maximum interval, maximum attempts, and non-retryable error types. The important part is that retries happen at the infrastructure level. If a worker crashes during a retry cycle, another worker picks up with the retry state intact. The developer writes no retry logic.</p>
<h3 id="why-different-activities-need-different-strategies">Why Different Activities Need Different Strategies</h3>
<p><strong>LLM calls</strong> need aggressive retry with exponential backoff. Rate limits are transient, and the cost of <em>not</em> retrying (losing all accumulated context and starting the agent run from scratch) far outweighs the cost of waiting 30 seconds for capacity. Configure high maximum attempts (10+) with a long maximum interval.</p>
<p><strong>Tool executions</strong> need limited retries. Tools may not be idempotent &ndash; running <code>git commit</code> twice produces different results. Blindly retrying could cause duplicate side effects. Configure low maximum attempts (2&ndash;3) and mark certain error types as non-retryable.</p>
<p><strong>Human notifications</strong> often need no retry at all. Fire-and-forget: if the Slack message fails, don&rsquo;t block the workflow.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>llm_retry <span style="color:#f92672">=</span> RetryPolicy(
</span></span><span style="display:flex;"><span>    initial_interval<span style="color:#f92672">=</span>timedelta(seconds<span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>),
</span></span><span style="display:flex;"><span>    backoff_coefficient<span style="color:#f92672">=</span><span style="color:#ae81ff">2.0</span>,
</span></span><span style="display:flex;"><span>    maximum_interval<span style="color:#f92672">=</span>timedelta(seconds<span style="color:#f92672">=</span><span style="color:#ae81ff">60</span>),
</span></span><span style="display:flex;"><span>    maximum_attempts<span style="color:#f92672">=</span><span style="color:#ae81ff">10</span>,
</span></span><span style="display:flex;"><span>    non_retryable_error_types<span style="color:#f92672">=</span>[<span style="color:#e6db74">&#34;InvalidPromptError&#34;</span>],
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>tool_retry <span style="color:#f92672">=</span> RetryPolicy(
</span></span><span style="display:flex;"><span>    initial_interval<span style="color:#f92672">=</span>timedelta(seconds<span style="color:#f92672">=</span><span style="color:#ae81ff">2</span>),
</span></span><span style="display:flex;"><span>    maximum_attempts<span style="color:#f92672">=</span><span style="color:#ae81ff">3</span>,
</span></span><span style="display:flex;"><span>    non_retryable_error_types<span style="color:#f92672">=</span>[<span style="color:#e6db74">&#34;PermissionDenied&#34;</span>, <span style="color:#e6db74">&#34;NotIdempotent&#34;</span>],
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Heartbeating for long-running activities</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@activity.defn</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">async</span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">execute_long_tool</span>(task: dict) <span style="color:#f92672">-&gt;</span> str:
</span></span><span style="display:flex;"><span>    result <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> i, chunk <span style="color:#f92672">in</span> enumerate(process_chunks(task)):
</span></span><span style="display:flex;"><span>        activity<span style="color:#f92672">.</span>heartbeat({<span style="color:#e6db74">&#34;progress&#34;</span>: i, <span style="color:#e6db74">&#34;last_chunk&#34;</span>: chunk<span style="color:#f92672">.</span>id})
</span></span><span style="display:flex;"><span>        result <span style="color:#f92672">=</span> <span style="color:#66d9ef">await</span> process(chunk)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> result
</span></span></code></pre></div><h3 id="heartbeats">Heartbeats</h3>
<p>For long-running activities, the worker periodically reports progress via heartbeats. If the heartbeat stops (worker crashed), Temporal reschedules the activity on another worker. The new worker can read the last heartbeat details to resume from the last checkpoint rather than starting over. This matters for activities processing large datasets or running multi-step tool executions.</p>
<h3 id="saga-patterns-for-multi-agent-systems">Saga Patterns for Multi-Agent Systems</h3>
<p>When multiple agents coordinate, failure handling gets complex. Temporal supports saga patterns where compensation logic runs when a step fails. If a planning agent fails, downstream execution agents&rsquo; pending activities can be cancelled rather than left hanging. If the response agent produces an unsatisfactory draft, compensation logic can route back to the research agent for additional context.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Activity Type</th>
          <th style="text-align: left">Retry Strategy</th>
          <th style="text-align: left">Rationale</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">LLM API call</td>
          <td style="text-align: left">Aggressive backoff, 10+ attempts</td>
          <td style="text-align: left">Rate limits are transient; restart cost is enormous</td>
      </tr>
      <tr>
          <td style="text-align: left">Idempotent tools (search, read)</td>
          <td style="text-align: left">Moderate backoff, 3&ndash;5 attempts</td>
          <td style="text-align: left">Safe to re-execute; failures are usually transient</td>
      </tr>
      <tr>
          <td style="text-align: left">Non-idempotent tools (write, deploy)</td>
          <td style="text-align: left">Limited, 1&ndash;2 attempts</td>
          <td style="text-align: left">Re-execution may cause side effects</td>
      </tr>
      <tr>
          <td style="text-align: left">Human notification</td>
          <td style="text-align: left">No retry</td>
          <td style="text-align: left">Fire-and-forget; don&rsquo;t block the workflow</td>
      </tr>
      <tr>
          <td style="text-align: left">Long-running computation</td>
          <td style="text-align: left">Heartbeat + resume from checkpoint</td>
          <td style="text-align: left">Avoid restarting expensive work from scratch</td>
      </tr>
  </tbody>
</table>
<h2 id="production-case-study-openai-codex">Production Case Study: OpenAI Codex</h2>
<p>OpenAI&rsquo;s Codex, their cloud-based coding agent that writes, tests, and iterates on code, uses Temporal as its core orchestration backbone. Will Wang, a software engineer on the Codex team, confirmed publicly that <strong>&ldquo;Temporal is a critical part of the infrastructure powering Codex, responsible for executing our core control flows.&rdquo;</strong> He described it as enabling the team to &ldquo;easily reason about concurrency, correctness, and fault tolerance&rdquo; while scaling a complicated distributed system.</p>
<p>Codex sessions run for 6+ hours on complex tasks. The entire agent loop (prompt construction, model inference, tool calls, result observation, loop back) runs as a Temporal Workflow. Each LLM call and tool execution is an Activity with its own retry policy and timeout. A single &ldquo;turn&rdquo; can involve hundreds of tool calls.</p>
<p>The Codex harness manages three conversation primitives: <strong>Items</strong> (atomic I/O units like messages or diffs), <strong>Turns</strong> (one unit of agent work from user input), and <strong>Threads</strong> (the durable container for an ongoing session, with persisted event history supporting resume, fork, and archive operations). Thread persistence &ndash; OpenAI describes threads as &ldquo;durable containers&rdquo; with &ldquo;persisted event history&rdquo; supporting reconnection &ndash; aligns directly with Temporal&rsquo;s Event History.</p>
<p>Codex has a self-review pattern internally called the <strong>&ldquo;Ralph Wiggum Loop&rdquo;</strong>: the agent reviews its own changes, requests additional agent reviews, and iterates until all reviewers are satisfied. In Temporal terms, the review results arrive as signals, and the workflow decides whether to iterate or complete.</p>
<p>The relationship extends beyond Codex. In July 2025, OpenAI and Temporal launched a formal integration adding durable execution to the <strong>OpenAI Agents SDK</strong>. Every agent invocation runs as a Temporal Activity, orchestration runs as a Temporal Workflow. Temporal also processes millions of ChatGPT Images generation workflows. Venkat Venkataramani (OpenAI&rsquo;s VP of App Infrastructure) reinforced this at Temporal&rsquo;s Series D announcement: &ldquo;Durable execution is a core requirement for modern AI systems.&rdquo;</p>
<h2 id="framework-integrations">Framework Integrations</h2>
<p>Temporal integrates with existing agent frameworks so teams don&rsquo;t have to rewrite their agent logic from scratch. The pattern is the same across integrations: Temporal provides the durability layer, the framework provides the agent logic.</p>
<h3 id="pydanticai--temporal">PydanticAI + Temporal</h3>
<p>PydanticAI has first-class Temporal support via a <code>TemporalAgent</code> wrapper that preserves PydanticAI&rsquo;s type-safety while offloading non-deterministic model requests and tool calls to Temporal activities. The orchestration logic lives in a deterministic workflow, and all I/O-bound tasks are automatically wrapped as activities.</p>
<p>One significant design decision: thread-based workflows. Each conversation thread gets its own Temporal workflow that persists for the lifetime of the conversation. This is more efficient than stateless approaches because the system only processes new messages, maintaining context within workflow state rather than re-sending the entire history for every inference.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> pydantic_ai <span style="color:#f92672">import</span> Agent
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> pydantic_ai.models.openai <span style="color:#f92672">import</span> OpenAIModel
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> temporalio.client <span style="color:#f92672">import</span> Client
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Define the agent with PydanticAI&#39;s type-safe interface</span>
</span></span><span style="display:flex;"><span>support_agent <span style="color:#f92672">=</span> Agent(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span>OpenAIModel(<span style="color:#e6db74">&#34;gpt-4o&#34;</span>),
</span></span><span style="display:flex;"><span>    system_prompt<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;You are a customer support agent.&#34;</span>,
</span></span><span style="display:flex;"><span>    result_type<span style="color:#f92672">=</span>SupportResponse,  <span style="color:#75715e"># Pydantic model for type-safe output</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@support_agent.tool</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">async</span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">lookup_order</span>(ctx, order_id: str) <span style="color:#f92672">-&gt;</span> OrderDetails:
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> <span style="color:#66d9ef">await</span> db<span style="color:#f92672">.</span>get_order(order_id)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Wrap with Temporal for durability</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> pydantic_ai_temporal <span style="color:#f92672">import</span> TemporalAgent
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>temporal_agent <span style="color:#f92672">=</span> TemporalAgent(
</span></span><span style="display:flex;"><span>    agent<span style="color:#f92672">=</span>support_agent,
</span></span><span style="display:flex;"><span>    client<span style="color:#f92672">=</span><span style="color:#66d9ef">await</span> Client<span style="color:#f92672">.</span>connect(<span style="color:#e6db74">&#34;localhost:7233&#34;</span>),
</span></span><span style="display:flex;"><span>    task_queue<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;support-agents&#34;</span>,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Each conversation gets a durable workflow</span>
</span></span><span style="display:flex;"><span>result <span style="color:#f92672">=</span> <span style="color:#66d9ef">await</span> temporal_agent<span style="color:#f92672">.</span>run(
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;What&#39;s the status of order #12345?&#34;</span>,
</span></span><span style="display:flex;"><span>    thread_id<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;customer-session-abc&#34;</span>,
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><h3 id="openai-agents-sdk--temporal">OpenAI Agents SDK + Temporal</h3>
<p>The OpenAI Agents SDK integration centers on the <code>activity_as_tool</code> helper. This function automatically generates OpenAI-compatible tool schemas directly from Temporal activity signatures. The agent reasons about and invokes activities as tools, with every tool call backed by durable execution.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-typescript" data-lang="typescript"><span style="display:flex;"><span><span style="color:#66d9ef">import</span> { <span style="color:#a6e22e">activityAsTool</span> } <span style="color:#66d9ef">from</span> <span style="color:#e6db74">&#34;@temporalio/openai-agents&#34;</span>;
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">import</span> { <span style="color:#a6e22e">OpenAIAgentsPlugin</span> } <span style="color:#66d9ef">from</span> <span style="color:#e6db74">&#34;@temporalio/openai-agents&#34;</span>;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">// Temporal activities become tools the agent can call
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span><span style="color:#66d9ef">const</span> <span style="color:#a6e22e">searchTool</span> <span style="color:#f92672">=</span> <span style="color:#a6e22e">activityAsTool</span>(<span style="color:#a6e22e">searchDocuments</span>, {
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">startToCloseTimeout</span><span style="color:#f92672">:</span> <span style="color:#e6db74">&#34;30s&#34;</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">retryPolicy</span><span style="color:#f92672">:</span> { <span style="color:#a6e22e">maximumAttempts</span>: <span style="color:#66d9ef">3</span> },
</span></span><span style="display:flex;"><span>});
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">const</span> <span style="color:#a6e22e">writeTool</span> <span style="color:#f92672">=</span> <span style="color:#a6e22e">activityAsTool</span>(<span style="color:#a6e22e">writeDocument</span>, {
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">startToCloseTimeout</span><span style="color:#f92672">:</span> <span style="color:#e6db74">&#34;60s&#34;</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">retryPolicy</span><span style="color:#f92672">:</span> { <span style="color:#a6e22e">maximumAttempts</span>: <span style="color:#66d9ef">1</span> },
</span></span><span style="display:flex;"><span>});
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">// Agent orchestration runs as a Temporal Workflow
</span></span></span><span style="display:flex;"><span><span style="color:#75715e">// Each tool call is a durable Activity
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span><span style="color:#66d9ef">const</span> <span style="color:#a6e22e">plugin</span> <span style="color:#f92672">=</span> <span style="color:#66d9ef">new</span> <span style="color:#a6e22e">OpenAIAgentsPlugin</span>({
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">client</span>: <span style="color:#66d9ef">temporalClient</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">taskQueue</span><span style="color:#f92672">:</span> <span style="color:#e6db74">&#34;agent-workers&#34;</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">tools</span><span style="color:#f92672">:</span> [<span style="color:#a6e22e">searchTool</span>, <span style="color:#a6e22e">writeTool</span>],
</span></span><span style="display:flex;"><span>});
</span></span></code></pre></div><p>Developers use the <code>OpenAIAgentsPlugin</code> to configure the Temporal client and worker, enabling integrated tracing that provides visibility through both the Temporal UI and OpenAI dashboards.</p>
<h2 id="when-temporal-adds-unnecessary-complexity">When Temporal Adds Unnecessary Complexity</h2>
<p>Temporal is not always the right choice. Here&rsquo;s where it adds more complexity than value:</p>
<ul>
<li><strong>Simple agents</strong>: a single LLM call followed by one tool call doesn&rsquo;t benefit from durable execution infrastructure. One comparison found that adding Temporal to a simple document indexing pipeline required &ldquo;rearchitecting the app, splitting it into two services, adding a runtime dependency on a third service, and adding over 100 lines of code&rdquo; where a lighter-weight approach achieved the same with 7 lines.</li>
<li><strong>Prototyping and experimentation</strong>: when you&rsquo;re iterating on agent architecture, the determinism constraints and operational overhead slow you down.</li>
<li><strong>Sub-30-second agents</strong>: if the agent completes before infrastructure failures become likely, the cost of durable execution exceeds the benefit.</li>
<li><strong>Teams without infrastructure engineering capacity</strong>: self-hosted Temporal requires operating four services plus a database. If you don&rsquo;t have the team to manage this, the operational burden may outweigh the reliability gains.</li>
</ul>
<h2 id="trade-offs">Trade-offs</h2>
<p>Temporal&rsquo;s guarantees come with trade-offs that shape day-to-day development experience.</p>
<h3 id="operational-complexity">Operational Complexity</h3>
<p>Self-hosted Temporal requires deploying four independent services plus a persistence database (PostgreSQL, MySQL, or Cassandra) and optionally Elasticsearch for advanced visibility. This is not a single process with a single run command.</p>
<h3 id="learning-curve">Learning Curve</h3>
<p>Engineers must internalize: workflows vs activities, determinism rules, event history mechanics, signals, queries, updates, ContinueAsNew, versioning strategies, worker configuration.</p>
<p>The determinism constraint confuses newcomers, especially because LLMs are inherently non-deterministic. The resolution (LLM calls go in activities, not workflows) is simple once understood, but the documentation framing perpetuates the misconception.</p>
<h3 id="event-history-limits">Event History Limits</h3>
<p>Each workflow execution is limited to 51,200 events or 50MB. An activity generates roughly 3 events. If activities return large LLM payloads (500KB+), the 50MB limit becomes binding well before the event count limit. The mitigation &ndash; ContinueAsNew, which atomically starts a fresh execution with carried-over state &ndash; works but adds architectural complexity. Teams building agents with many LLM calls must implement payload offloading (store large payloads in S3, pass references) and proactively manage history growth.</p>
<h3 id="latency">Latency</h3>
<p>Temporal Cloud&rsquo;s minimum end-to-end latency is roughly <strong>100ms per workflow step</strong>, with a single activity round-trip taking approximately <strong>220ms</strong>. Local Activities save ~50ms per call but sacrifice heartbeating and independent retry capabilities. For agents where sub-second interactivity matters (chatbot-like interactions), this overhead accumulates across many steps. Agents with 50+ steps per interaction may see 5&ndash;10 seconds of pure infrastructure overhead.</p>
<h3 id="versioning">Versioning</h3>
<p>Code changes to workflow logic can cause non-determinism errors during replay of running workflows. If a running workflow was started with version 1 of the code and a worker running version 2 picks it up, the replay may produce different activity commands, causing a non-determinism exception. Temporal provides patching APIs and worker versioning, but patches accumulate in code and &ldquo;need to be removed with extreme care.&rdquo; Airbyte documented struggles with non-determinism exceptions, ultimately deciding to fail affected workflows rather than attempting recovery. Safe deployment requires replay testing against production event histories in CI.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Trade-off</th>
          <th style="text-align: left">Impact</th>
          <th style="text-align: left">Mitigation</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Operational complexity</td>
          <td style="text-align: left">4+ services to manage, or cloud costs</td>
          <td style="text-align: left">Temporal Cloud; start with dev server locally</td>
      </tr>
      <tr>
          <td style="text-align: left">Learning curve</td>
          <td style="text-align: left">2&ndash;3 weeks for team onboarding</td>
          <td style="text-align: left">Start with simple workflows, add primitives incrementally</td>
      </tr>
      <tr>
          <td style="text-align: left">Event history limits</td>
          <td style="text-align: left">51,200 events / 50MB cap per execution</td>
          <td style="text-align: left">ContinueAsNew + payload offloading to S3</td>
      </tr>
      <tr>
          <td style="text-align: left">Latency overhead</td>
          <td style="text-align: left">~100ms/step, ~220ms/activity round-trip</td>
          <td style="text-align: left">Local Activities for latency-sensitive paths</td>
      </tr>
      <tr>
          <td style="text-align: left">Versioning complexity</td>
          <td style="text-align: left">Non-determinism errors on code changes</td>
          <td style="text-align: left">Replay testing in CI, worker versioning</td>
      </tr>
  </tbody>
</table>
<h2 id="closing-thoughts">Closing Thoughts</h2>
<p>We covered a lot of ground here: the workflow/activity split, deterministic replay, server architecture, coordination primitives, retry strategies, and how OpenAI&rsquo;s Codex team puts it all together.</p>
<p>The core design insight is the separation of deterministic orchestration from non-deterministic execution. Once you accept that split, replay-based recovery falls out as a consequence &ndash; and with it, most of the infrastructure problems we listed at the top of this post.</p>
<p>OpenAI, Replit, Block, NVIDIA, and others have independently converged on durable execution for their agent workloads. Temporal&rsquo;s recent $300M Series D at a $5B valuation, with 380%+ year-over-year revenue growth driven substantially by AI workloads, suggests this is a real pattern. The company joined the <strong>Agentic AI Foundation</strong> (under the Linux Foundation) alongside Anthropic, OpenAI, and Block.</p>
<p>For most teams, the practical path is: prototype with something lighter (LangGraph, CrewAI), validate the agent architecture, and migrate when the agents run long enough and matter enough that you can&rsquo;t afford to lose state on a crash. The operational investment is real, but so is the cost of rebuilding reliability from scratch.</p>
<h2 id="references">References</h2>
<ol>
<li>
<p><strong>Temporal Documentation.</strong> <a href="https://docs.temporal.io/">Core Concepts &ndash; Workflows, Activities, Workers</a>. Temporal Technologies.</p>
</li>
<li>
<p><strong>Temporal.</strong> <a href="https://temporal.io/solutions/ai">Temporal for AI</a>. Overview of Temporal&rsquo;s AI-specific capabilities and customer stories.</p>
</li>
<li>
<p><strong>Wang, W. (2025).</strong> <a href="https://temporal.io/blog/announcing-openai-agents-sdk-integration">Codex and Temporal Integration</a>. Will Wang&rsquo;s public statements on Codex&rsquo;s use of Temporal for core control flows.</p>
</li>
<li>
<p><strong>OpenAI.</strong> <a href="https://openai.com/index/harness-engineering/">Harness Engineering: Leveraging Codex in an Agent-First World</a>. OpenAI engineering blog on the Codex harness architecture.</p>
</li>
<li>
<p><strong>Temporal.</strong> <a href="https://temporal.io/blog/build-durable-ai-agents-pydantic-ai-and-temporal">Build Durable AI Agents with Pydantic AI and Temporal</a>. PydanticAI integration guide.</p>
</li>
<li>
<p><strong>Temporal.</strong> <a href="https://temporal.io/blog/of-course-you-can-build-dynamic-ai-agents-with-temporal">Of Course You Can Build Dynamic AI Agents with Temporal</a>. Temporal&rsquo;s architecture for dynamic AI agent loops.</p>
</li>
<li>
<p><strong>Quo (formerly OpenPhone).</strong> <a href="https://www.quo.com/blog/how-we-built-a-real-time-ai-voice-agent-with-temporal/">How We Built a Real-Time AI Voice Agent with Temporal</a>. Production case study on Temporal primitives for voice agents.</p>
</li>
<li>
<p><strong>Temporal.</strong> <a href="https://temporal.io/blog/announcing-openai-agents-sdk-integration">Production-Ready Agents with the OpenAI Agents SDK + Temporal</a>. OpenAI Agents SDK integration announcement.</p>
</li>
<li>
<p><strong>Temporal.</strong> <a href="https://docs.temporal.io/ai-cookbook/openai-agents-sdk-python">AI Cookbook &ndash; OpenAI Agents SDK</a>. Code examples and patterns for the OpenAI integration.</p>
</li>
<li>
<p><strong>PydanticAI Documentation.</strong> <a href="https://ai.pydantic.dev/durable_execution/temporal/">Temporal Durable Execution</a>. Official PydanticAI guide for Temporal integration.</p>
</li>
<li>
<p><strong>Vanlightly, J.</strong> Explanations of deterministic replay mechanics and the determinism contract in Temporal workflows. Referenced via Temporal community resources.</p>
</li>
<li>
<p><strong>Wang, X., et al. (2025).</strong> <a href="https://arxiv.org/pdf/2511.03690">The OpenHands Software Agent SDK</a>. <em>arXiv preprint arXiv:2511.03690</em>. The predecessor post&rsquo;s primary reference for event sourcing comparison.</p>
</li>
</ol>
]]></content:encoded></item><item><title>From RLHF to GRPO: The RL Techniques That Align Language Models</title><link>https://www.mdjawad.com/posts/rlhf-to-grpo/</link><pubDate>Tue, 17 Feb 2026 10:00:00 +0800</pubDate><guid>https://www.mdjawad.com/posts/rlhf-to-grpo/</guid><description>How reinforcement learning transforms raw language models into useful assistants — from PPO&amp;rsquo;s four-model pipeline to DPO&amp;rsquo;s elegant shortcut to GRPO&amp;rsquo;s reasoning revolution, with the math that makes each one work.</description><content:encoded><![CDATA[<h2 id="the-gap-between-prediction-and-usefulness">The Gap Between Prediction and Usefulness</h2>
<p>If you have been following this blog, you know how LLMs generate text: attention mechanisms, KV caches, speculative decoding, quantized weights moving through GPU memory hierarchies. We have spent considerable time understanding what happens <em>after</em> the model exists. This post asks a different question: how did the model learn to be useful in the first place?</p>
<p>A pretrained language model is a remarkable thing. It can complete sentences, mimic writing styles, and recite facts absorbed from trillions of tokens of internet text. But it cannot follow instructions. Ask it to summarize an article and it might continue the article instead. Ask it to refuse a harmful request and it will cheerfully comply. The gap between &ldquo;can predict the next token&rdquo; and &ldquo;can be a helpful assistant&rdquo; is enormous, and closing it is the job of reinforcement learning from human feedback (RLHF) and its descendants.</p>
<p>This post covers three techniques that bridge this gap: <strong>PPO</strong> (Proximal Policy Optimization), the original workhorse that proved RL could align language models; <strong>DPO</strong> (Direct Preference Optimization), an elegant reformulation that eliminates the reward model entirely; and <strong>GRPO</strong> (Group Relative Policy Optimization), the technique behind DeepSeek-R1&rsquo;s reasoning capabilities. Each optimizes the same underlying objective (maximize reward while staying close to a reference policy) but they make fundamentally different engineering trade-offs.</p>
<p>We will not cover pretraining, nor will we survey every RLHF variant (KTO, SimPO, ORPO, and others exist but are beyond our scope). Instead, we will go deep on these three methods: the math, the intuition, the practical trade-offs, and the reasons each one was invented.</p>
<h2 id="where-rl-fits-the-model-training-lifecycle">Where RL Fits: The Model Training Lifecycle</h2>
<p>Before we touch any equations, let&rsquo;s establish context. Training a modern LLM that can follow instructions involves three distinct phases, each with a different objective.</p>
<div class="training-lifecycle-viz" id="tl-869e96d074aa1e4482cf4409ce2006eb">
<style>
  .training-lifecycle-viz {
    --tl-bg: #0d1117;
    --tl-surface: #161b22;
    --tl-border: #30363d;
    --tl-text: #e6edf3;
    --tl-text-muted: #8b949e;
    --tl-arrow: #484f58;

     
    --tl-pretrain: #d29922;
    --tl-pretrain-bg: rgba(210, 153, 34, 0.08);
    --tl-pretrain-border: rgba(210, 153, 34, 0.3);

    --tl-sft: #58a6ff;
    --tl-sft-bg: rgba(88, 166, 255, 0.08);
    --tl-sft-border: rgba(88, 166, 255, 0.3);

    --tl-rl: #39d353;
    --tl-rl-bg: rgba(57, 211, 83, 0.08);
    --tl-rl-border: rgba(57, 211, 83, 0.3);

    --tl-badge-bg: rgba(255, 255, 255, 0.06);

    font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
    max-width: 800px;
    margin: 1.5rem auto;
    padding: 1.5rem;
    background: var(--tl-bg);
    border: 1px solid var(--tl-border);
    border-radius: 12px;
    color: var(--tl-text);
  }

  [data-theme="light"] .training-lifecycle-viz,
  :root:not([data-theme="dark"]) .training-lifecycle-viz {
    --tl-bg: #f8fafc;
    --tl-surface: #ffffff;
    --tl-border: #e2e8f0;
    --tl-text: #1e293b;
    --tl-text-muted: #64748b;
    --tl-arrow: #94a3b8;

    --tl-pretrain: #b45309;
    --tl-pretrain-bg: rgba(180, 83, 9, 0.06);
    --tl-pretrain-border: rgba(180, 83, 9, 0.25);

    --tl-sft: #2563eb;
    --tl-sft-bg: rgba(37, 99, 235, 0.06);
    --tl-sft-border: rgba(37, 99, 235, 0.25);

    --tl-rl: #16a34a;
    --tl-rl-bg: rgba(22, 163, 74, 0.06);
    --tl-rl-border: rgba(22, 163, 74, 0.25);

    --tl-badge-bg: rgba(0, 0, 0, 0.04);
  }

  .tl-pipeline {
    display: flex;
    align-items: stretch;
    gap: 0;
    margin-bottom: 0.75rem;
  }

  .tl-stage {
    flex: 1;
    background: var(--tl-surface);
    border: 1px solid var(--tl-border);
    border-radius: 10px;
    padding: 1.1rem 1rem;
    cursor: default;
    transition: border-color 0.2s ease, background 0.2s ease, transform 0.15s ease;
    position: relative;
    display: flex;
    flex-direction: column;
    align-items: center;
    text-align: center;
  }

  .tl-stage:hover {
    transform: translateY(-2px);
  }

  .tl-stage[data-stage="pretrain"]:hover {
    border-color: var(--tl-pretrain-border);
    background: var(--tl-pretrain-bg);
  }
  .tl-stage[data-stage="sft"]:hover {
    border-color: var(--tl-sft-border);
    background: var(--tl-sft-bg);
  }
  .tl-stage[data-stage="rl"]:hover {
    border-color: var(--tl-rl-border);
    background: var(--tl-rl-bg);
  }

  .tl-phase-label {
    font-family: 'IBM Plex Mono', monospace;
    font-size: 0.65rem;
    font-weight: 600;
    letter-spacing: 0.12em;
    text-transform: uppercase;
    margin-bottom: 0.4rem;
  }

  .tl-stage[data-stage="pretrain"] .tl-phase-label { color: var(--tl-pretrain); }
  .tl-stage[data-stage="sft"] .tl-phase-label { color: var(--tl-sft); }
  .tl-stage[data-stage="rl"] .tl-phase-label { color: var(--tl-rl); }

  .tl-stage-name {
    font-size: 1rem;
    font-weight: 700;
    color: var(--tl-text);
    margin-bottom: 0.5rem;
    line-height: 1.3;
  }

  .tl-input {
    font-size: 0.75rem;
    color: var(--tl-text-muted);
    margin-bottom: 0.5rem;
    line-height: 1.4;
  }

  .tl-outcome {
    font-size: 0.8rem;
    font-weight: 600;
    margin-bottom: 0.5rem;
    line-height: 1.4;
  }

  .tl-stage[data-stage="pretrain"] .tl-outcome { color: var(--tl-pretrain); }
  .tl-stage[data-stage="sft"] .tl-outcome { color: var(--tl-sft); }
  .tl-stage[data-stage="rl"] .tl-outcome { color: var(--tl-rl); }

  .tl-badge {
    font-size: 0.68rem;
    color: var(--tl-text-muted);
    background: var(--tl-badge-bg);
    padding: 0.2rem 0.55rem;
    border-radius: 20px;
    margin-top: auto;
    white-space: nowrap;
  }

  .tl-arrow-container {
    display: flex;
    align-items: center;
    justify-content: center;
    width: 40px;
    flex-shrink: 0;
  }

  .tl-arrow-container svg {
    width: 24px;
    height: 24px;
    color: var(--tl-arrow);
  }

  .tl-detail {
    min-height: 2.4rem;
    padding: 0.6rem 0.8rem;
    font-size: 0.8rem;
    color: var(--tl-text-muted);
    line-height: 1.5;
    text-align: center;
    transition: opacity 0.2s ease;
    opacity: 0;
  }

  .tl-detail.tl-visible {
    opacity: 1;
  }

  .tl-detail-label {
    font-weight: 600;
    margin-right: 0.3rem;
  }

  .tl-stage[data-stage="pretrain"] ~ .tl-detail .tl-detail-label,
  .tl-detail[data-for="pretrain"] .tl-detail-label { color: var(--tl-pretrain); }
  .tl-detail[data-for="sft"] .tl-detail-label { color: var(--tl-sft); }
  .tl-detail[data-for="rl"] .tl-detail-label { color: var(--tl-rl); }

  @media (max-width: 600px) {
    .training-lifecycle-viz {
      padding: 1rem;
    }

    .tl-pipeline {
      flex-direction: column;
      gap: 0;
    }

    .tl-arrow-container {
      width: auto;
      height: 28px;
      transform: rotate(90deg);
    }

    .tl-stage {
      padding: 1rem;
    }

    .tl-stage-name {
      font-size: 0.95rem;
    }
  }
</style>

<div class="tl-pipeline" id="tl-pipeline-869e96d074aa1e4482cf4409ce2006eb">
  <div class="tl-stage" data-stage="pretrain" id="tl-s0-869e96d074aa1e4482cf4409ce2006eb">
    <div class="tl-phase-label">Phase 1</div>
    <div class="tl-stage-name">Pretraining</div>
    <div class="tl-input">Internet text (books, code, web)</div>
    <div class="tl-outcome">Learns language</div>
    <div class="tl-badge">Trillions of tokens</div>
  </div>

  <div class="tl-arrow-container">
    <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2.5" stroke-linecap="round" stroke-linejoin="round">
      <line x1="4" y1="12" x2="20" y2="12"></line>
      <polyline points="14 6 20 12 14 18"></polyline>
    </svg>
  </div>

  <div class="tl-stage" data-stage="sft" id="tl-s1-869e96d074aa1e4482cf4409ce2006eb">
    <div class="tl-phase-label">Phase 2</div>
    <div class="tl-stage-name">Supervised Fine-Tuning</div>
    <div class="tl-input">Human-written demonstrations</div>
    <div class="tl-outcome">Learns format</div>
    <div class="tl-badge">Thousands of examples</div>
  </div>

  <div class="tl-arrow-container">
    <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2.5" stroke-linecap="round" stroke-linejoin="round">
      <line x1="4" y1="12" x2="20" y2="12"></line>
      <polyline points="14 6 20 12 14 18"></polyline>
    </svg>
  </div>

  <div class="tl-stage" data-stage="rl" id="tl-s2-869e96d074aa1e4482cf4409ce2006eb">
    <div class="tl-phase-label">Phase 3</div>
    <div class="tl-stage-name">RL Alignment</div>
    <div class="tl-input">Preferences &amp; rewards</div>
    <div class="tl-outcome">Learns judgment</div>
    <div class="tl-badge">Iterative optimization</div>
  </div>
</div>

<div class="tl-detail" id="tl-detail-869e96d074aa1e4482cf4409ce2006eb">
  <span class="tl-detail-label"></span><span class="tl-detail-text"></span>
</div>

<script>
(function() {
  var uid = '869e96d074aa1e4482cf4409ce2006eb';
  var detail = document.getElementById('tl-detail-' + uid);
  var labelEl = detail.querySelector('.tl-detail-label');
  var textEl = detail.querySelector('.tl-detail-text');

  var descriptions = {
    pretrain: {
      label: 'Pretraining:',
      text: 'Next-token prediction on massive corpora. Produces a powerful text completion engine — knows facts and grammar, but cannot follow instructions.',
      color: 'pretrain'
    },
    sft: {
      label: 'SFT:',
      text: 'Trains on curated (instruction, response) pairs. The model learns the shape of helpful answers — but mimics without understanding why one response is better.',
      color: 'sft'
    },
    rl: {
      label: 'RL Alignment:',
      text: 'The model generates its own responses, receives quality feedback, and improves. This is where PPO, DPO, and GRPO come in — the focus of this post.',
      color: 'rl'
    }
  };

  var stages = document.querySelectorAll('#tl-pipeline-' + uid + ' .tl-stage');

  stages.forEach(function(stage) {
    stage.addEventListener('mouseenter', function() {
      var key = this.getAttribute('data-stage');
      var desc = descriptions[key];
      labelEl.textContent = desc.label;
      textEl.textContent = ' ' + desc.text;
      detail.setAttribute('data-for', desc.color);
      detail.classList.add('tl-visible');
    });

    stage.addEventListener('mouseleave', function() {
      detail.classList.remove('tl-visible');
    });
  });
})();
</script>
</div>

<p><strong>Phase 1: Pretraining.</strong> The model learns language by predicting the next token on a massive corpus: books, articles, code, web pages. This produces a powerful text completion engine that knows facts, grammar, and reasoning patterns but has no concept of a &ldquo;conversation&rdquo; or &ldquo;helpfulness.&rdquo; This phase consumes the vast majority of compute (months on thousands of GPUs) and produces what we call the <em>base model</em>.</p>
<p><strong>Phase 2: Supervised Fine-Tuning (SFT).</strong> Humans write demonstration data: pairs of (instruction, ideal response). The model is trained to reproduce these demonstrations using standard cross-entropy loss:</p>
$$\mathcal{L}_{\text{SFT}} = -\sum_{t} \log \pi_\theta(y_t \mid x, y_{\lt t})$$<p>This is the same next-token prediction objective from pretraining, just applied to curated instruction-response pairs. The model learns the <em>format</em> of helpful responses: how to structure answers, when to use code blocks, how to handle multi-turn conversations. SFT typically requires only thousands of examples and a few hours of training.</p>
<p>But SFT has a fundamental limitation: it teaches the model to <em>mimic</em> demonstrations without understanding <em>why</em> one response is better than another. The model learns that a particular answer to &ldquo;explain quantum computing&rdquo; was in the training data, but it cannot distinguish between a clear explanation and a subtly misleading one. It learns format, not judgment.</p>
<p><strong>Phase 3: RL-based Alignment.</strong> This is where the techniques in this post come in. Instead of showing the model what to produce, we teach it what <em>better</em> means. The model generates its own responses, receives feedback on quality, and updates its parameters to produce higher-quality outputs. This is reinforcement learning: the model (agent) generates text (actions), receives scores (rewards), and improves its generation strategy (policy).</p>
<p>The SFT model serves double duty here: it initializes the policy we will optimize, <em>and</em> it becomes the frozen reference policy $\pi_{\text{ref}}$ that prevents the RL-trained model from drifting too far from coherent language. This reference is crucial. Without it, the model can find degenerate ways to maximize reward that produce nonsensical text.</p>


<div class="training-pipeline-viz" id="tp-869e96d074aa1e4482cf4409ce2006eb">
  <style>
    .training-pipeline-viz {
      --tp-bg: #0d1117;
      --tp-surface: #161b22;
      --tp-border: #30363d;
      --tp-text: #e6edf3;
      --tp-text-muted: #8b949e;
      --tp-accent: #58a6ff;
      --tp-ppo: #58a6ff;
      --tp-ppo-dim: rgba(88, 166, 255, 0.15);
      --tp-dpo: #39d353;
      --tp-dpo-dim: rgba(57, 211, 83, 0.15);
      --tp-grpo: #a371f7;
      --tp-grpo-dim: rgba(163, 113, 247, 0.15);
      --tp-sft: #d29922;
      --tp-sft-dim: rgba(210, 153, 34, 0.25);
      --tp-node-bg: #21262d;
      --tp-path-inactive: #30363d;

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
      background: var(--tp-bg);
      color: var(--tp-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
      max-width: 800px;
      margin-left: auto;
      margin-right: auto;
    }

     
    [data-theme="light"] .training-pipeline-viz,
    :root:not([data-theme="dark"]) .training-pipeline-viz {
      --tp-bg: #f8fafc;
      --tp-surface: #ffffff;
      --tp-border: #e2e8f0;
      --tp-text: #1e293b;
      --tp-text-muted: #64748b;
      --tp-accent: #3b82f6;
      --tp-ppo: #3b82f6;
      --tp-ppo-dim: rgba(59, 130, 246, 0.12);
      --tp-dpo: #10b981;
      --tp-dpo-dim: rgba(16, 185, 129, 0.12);
      --tp-grpo: #8b5cf6;
      --tp-grpo-dim: rgba(139, 92, 246, 0.12);
      --tp-sft: #d97706;
      --tp-sft-dim: rgba(217, 119, 6, 0.15);
      --tp-node-bg: #f1f5f9;
      --tp-path-inactive: #cbd5e1;
    }

    .training-pipeline-viz * {
      box-sizing: border-box;
    }

    .tp-header {
      text-align: center;
      margin-bottom: 1.5rem;
    }

    .tp-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem;
      font-weight: 600;
      color: var(--tp-accent);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .tp-header p {
      color: var(--tp-text-muted);
      font-size: 0.9rem;
      margin: 0;
    }

     
    .tp-svg-wrap {
      width: 100%;
      overflow-x: auto;
      margin-bottom: 1rem;
    }

    .tp-svg-wrap svg {
      display: block;
      width: 100%;
      height: auto;
    }

     
    .tp-svg-wrap text {
      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
    }

     
    .tp-path-group {
      cursor: pointer;
      transition: opacity 0.3s ease;
    }

    .tp-path-group:hover .tp-path-line {
      filter: brightness(1.3);
    }

    .tp-path-line {
      transition: stroke 0.3s ease, stroke-opacity 0.3s ease, stroke-width 0.3s ease;
    }

    .tp-node-rect {
      transition: fill 0.3s ease, stroke 0.3s ease, filter 0.3s ease;
    }

    .tp-node-text {
      transition: fill 0.3s ease;
    }

     
    .tp-legend {
      display: flex;
      justify-content: center;
      gap: 1.5rem;
      flex-wrap: wrap;
      margin-bottom: 1rem;
    }

    .tp-legend-item {
      display: flex;
      align-items: center;
      gap: 0.4rem;
      font-size: 0.75rem;
      color: var(--tp-text-muted);
      cursor: pointer;
      padding: 0.25rem 0.5rem;
      border-radius: 6px;
      border: 1px solid transparent;
      transition: all 0.25s ease;
    }

    .tp-legend-item:hover {
      border-color: var(--tp-border);
      background: var(--tp-surface);
    }

    .tp-legend-item.active {
      border-color: var(--tp-border);
      background: var(--tp-surface);
    }

    .tp-legend-swatch {
      width: 12px;
      height: 12px;
      border-radius: 3px;
    }

    .tp-legend-swatch.ppo {
      background: var(--tp-ppo);
    }

    .tp-legend-swatch.dpo {
      background: var(--tp-dpo);
    }

    .tp-legend-swatch.grpo {
      background: var(--tp-grpo);
    }

     
    .tp-desc-panel {
      background: var(--tp-surface);
      border: 1px solid var(--tp-border);
      border-radius: 10px;
      padding: 1.25rem;
      min-height: 80px;
      transition: border-color 0.3s ease;
    }

    .tp-desc-panel.ppo-active {
      border-color: var(--tp-ppo);
    }

    .tp-desc-panel.dpo-active {
      border-color: var(--tp-dpo);
    }

    .tp-desc-panel.grpo-active {
      border-color: var(--tp-grpo);
    }

    .tp-desc-title {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.75rem;
      font-weight: 600;
      text-transform: uppercase;
      letter-spacing: 0.08em;
      margin: 0 0 0.5rem 0;
      transition: color 0.3s ease;
    }

    .tp-desc-title.ppo {
      color: var(--tp-ppo);
    }

    .tp-desc-title.dpo {
      color: var(--tp-dpo);
    }

    .tp-desc-title.grpo {
      color: var(--tp-grpo);
    }

    .tp-desc-title.none {
      color: var(--tp-text-muted);
    }

    .tp-desc-body {
      font-size: 0.85rem;
      color: var(--tp-text);
      line-height: 1.6;
      margin: 0;
    }

    .tp-desc-meta {
      display: flex;
      gap: 1rem;
      margin-top: 0.75rem;
      flex-wrap: wrap;
    }

    .tp-desc-tag {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 500;
      padding: 0.2rem 0.5rem;
      border-radius: 4px;
      background: rgba(128, 128, 128, 0.1);
      color: var(--tp-text-muted);
      border: 1px solid var(--tp-border);
    }

     
    @media (max-width: 600px) {
      .training-pipeline-viz {
        padding: 1rem;
      }

      .tp-header h3 {
        font-size: 0.75rem;
      }

      .tp-header p {
        font-size: 0.8rem;
      }

      .tp-legend {
        gap: 0.75rem;
      }

      .tp-legend-item {
        font-size: 0.7rem;
      }

      .tp-desc-panel {
        padding: 1rem;
      }

      .tp-desc-body {
        font-size: 0.8rem;
      }

      .tp-desc-meta {
        gap: 0.5rem;
      }

      .tp-desc-tag {
        font-size: 0.65rem;
      }
    }
  </style>

  <div class="tp-header">
    <h3>The Model Training Lifecycle</h3>
    <p>Click a path to explore each alignment approach</p>
  </div>

  
  <div class="tp-legend">
    <div class="tp-legend-item active" id="tpLegPpo-869e96d074aa1e4482cf4409ce2006eb" data-path="ppo">
      <div class="tp-legend-swatch ppo"></div>
      <span>Path A: PPO</span>
    </div>
    <div class="tp-legend-item active" id="tpLegDpo-869e96d074aa1e4482cf4409ce2006eb" data-path="dpo">
      <div class="tp-legend-swatch dpo"></div>
      <span>Path B: DPO</span>
    </div>
    <div class="tp-legend-item active" id="tpLegGrpo-869e96d074aa1e4482cf4409ce2006eb" data-path="grpo">
      <div class="tp-legend-swatch grpo"></div>
      <span>Path C: GRPO</span>
    </div>
  </div>

  
  <div class="tp-svg-wrap">
    <svg id="tpSvg-869e96d074aa1e4482cf4409ce2006eb" viewBox="0 0 760 340" xmlns="http://www.w3.org/2000/svg">
      <defs>
        
        <marker id="arrowPpo-869e96d074aa1e4482cf4409ce2006eb" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto">
          <path d="M0,0 L8,3 L0,6 Z" fill="var(--tp-ppo)" />
        </marker>
        <marker id="arrowDpo-869e96d074aa1e4482cf4409ce2006eb" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto">
          <path d="M0,0 L8,3 L0,6 Z" fill="var(--tp-dpo)" />
        </marker>
        <marker id="arrowGrpo-869e96d074aa1e4482cf4409ce2006eb" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto">
          <path d="M0,0 L8,3 L0,6 Z" fill="var(--tp-grpo)" />
        </marker>
        <marker id="arrowMuted-869e96d074aa1e4482cf4409ce2006eb" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto">
          <path d="M0,0 L8,3 L0,6 Z" fill="var(--tp-path-inactive)" />
        </marker>

        
        <filter id="glowPpo-869e96d074aa1e4482cf4409ce2006eb">
          <feGaussianBlur stdDeviation="3" result="blur" />
          <feMerge><feMergeNode in="blur" /><feMergeNode in="SourceGraphic" /></feMerge>
        </filter>
        <filter id="glowDpo-869e96d074aa1e4482cf4409ce2006eb">
          <feGaussianBlur stdDeviation="3" result="blur" />
          <feMerge><feMergeNode in="blur" /><feMergeNode in="SourceGraphic" /></feMerge>
        </filter>
        <filter id="glowGrpo-869e96d074aa1e4482cf4409ce2006eb">
          <feGaussianBlur stdDeviation="3" result="blur" />
          <feMerge><feMergeNode in="blur" /><feMergeNode in="SourceGraphic" /></feMerge>
        </filter>
      </defs>

      
      <g id="tpSft-869e96d074aa1e4482cf4409ce2006eb">
        <rect x="20" y="140" width="110" height="56" rx="8"
              fill="var(--tp-sft-dim)" stroke="var(--tp-sft)" stroke-width="2" />
        <text x="75" y="163" text-anchor="middle"
              fill="var(--tp-sft)" font-size="13" font-weight="700">SFT</text>
        <text x="75" y="181" text-anchor="middle"
              fill="var(--tp-text-muted)" font-size="9">Supervised Fine-Tune</text>
      </g>

      
      <g class="tp-path-group" id="tpPathPpo-869e96d074aa1e4482cf4409ce2006eb" data-path="ppo">
        
        <line class="tp-path-line" x1="130" y1="160" x2="200" y2="85"
              stroke="var(--tp-ppo)" stroke-width="2.5" stroke-linecap="round"
              marker-end="url(#arrowPpo-869e96d074aa1e4482cf4409ce2006eb)" />

        
        <rect class="tp-node-rect" x="208" y="52" width="130" height="56" rx="8"
              fill="var(--tp-ppo-dim)" stroke="var(--tp-ppo)" stroke-width="1.5" />
        <text class="tp-node-text" x="273" y="74" text-anchor="middle"
              fill="var(--tp-ppo)" font-size="12" font-weight="700">Reward Model</text>
        <text class="tp-node-text" x="273" y="92" text-anchor="middle"
              fill="var(--tp-text-muted)" font-size="9">Human preferences</text>

        
        <line class="tp-path-line" x1="338" y1="80" x2="408" y2="80"
              stroke="var(--tp-ppo)" stroke-width="2.5" stroke-linecap="round"
              marker-end="url(#arrowPpo-869e96d074aa1e4482cf4409ce2006eb)" />

        
        <rect class="tp-node-rect" x="416" y="42" width="100" height="76" rx="8"
              fill="var(--tp-ppo-dim)" stroke="var(--tp-ppo)" stroke-width="2" />
        <text class="tp-node-text" x="466" y="65" text-anchor="middle"
              fill="var(--tp-ppo)" font-size="14" font-weight="700">PPO</text>
        <text class="tp-node-text" x="466" y="82" text-anchor="middle"
              fill="var(--tp-text-muted)" font-size="8.5">Policy + Reference</text>
        <text class="tp-node-text" x="466" y="94" text-anchor="middle"
              fill="var(--tp-text-muted)" font-size="8.5">Reward + Critic</text>

        
        <rect x="530" y="56" width="65" height="24" rx="12"
              fill="var(--tp-ppo)" fill-opacity="0.15" stroke="var(--tp-ppo)" stroke-width="1" />
        <text x="562" y="72" text-anchor="middle"
              fill="var(--tp-ppo)" font-size="10" font-weight="600">4 models</text>

        
        <text x="620" y="74" text-anchor="start"
              fill="var(--tp-ppo)" font-size="10" font-weight="600" opacity="0.7">PATH A</text>
      </g>

      
      <g class="tp-path-group" id="tpPathDpo-869e96d074aa1e4482cf4409ce2006eb" data-path="dpo">
        
        <line class="tp-path-line" x1="130" y1="168" x2="248" y2="168"
              stroke="var(--tp-dpo)" stroke-width="2.5" stroke-linecap="round"
              marker-end="url(#arrowDpo-869e96d074aa1e4482cf4409ce2006eb)" />

        
        <rect class="tp-node-rect" x="280" y="126" width="110" height="28" rx="6"
              fill="var(--tp-dpo-dim)" stroke="var(--tp-dpo)" stroke-width="1" stroke-dasharray="4,3" />
        <text class="tp-node-text" x="335" y="144" text-anchor="middle"
              fill="var(--tp-dpo)" font-size="9" font-weight="600">Preference Data</text>

        
        <line class="tp-path-line" x1="335" y1="154" x2="335" y2="165"
              stroke="var(--tp-dpo)" stroke-width="1.5" stroke-linecap="round" stroke-dasharray="3,2" />

        
        <rect class="tp-node-rect" x="256" y="168" width="100" height="56" rx="8"
              fill="var(--tp-dpo-dim)" stroke="var(--tp-dpo)" stroke-width="2" />
        <text class="tp-node-text" x="306" y="191" text-anchor="middle"
              fill="var(--tp-dpo)" font-size="14" font-weight="700">DPO</text>
        <text class="tp-node-text" x="306" y="207" text-anchor="middle"
              fill="var(--tp-text-muted)" font-size="8.5">Policy + Reference</text>

        
        <rect x="370" y="180" width="65" height="24" rx="12"
              fill="var(--tp-dpo)" fill-opacity="0.15" stroke="var(--tp-dpo)" stroke-width="1" />
        <text x="402" y="196" text-anchor="middle"
              fill="var(--tp-dpo)" font-size="10" font-weight="600">2 models</text>

        
        <text x="456" y="196" text-anchor="start"
              fill="var(--tp-dpo)" font-size="10" font-weight="600" opacity="0.7">PATH B</text>
      </g>

      
      <g class="tp-path-group" id="tpPathGrpo-869e96d074aa1e4482cf4409ce2006eb" data-path="grpo">
        
        <line class="tp-path-line" x1="130" y1="176" x2="200" y2="280"
              stroke="var(--tp-grpo)" stroke-width="2.5" stroke-linecap="round"
              marker-end="url(#arrowGrpo-869e96d074aa1e4482cf4409ce2006eb)" />

        
        <rect class="tp-node-rect" x="208" y="255" width="100" height="56" rx="8"
              fill="var(--tp-grpo-dim)" stroke="var(--tp-grpo)" stroke-width="2" />
        <text class="tp-node-text" x="258" y="278" text-anchor="middle"
              fill="var(--tp-grpo)" font-size="14" font-weight="700">GRPO</text>
        <text class="tp-node-text" x="258" y="294" text-anchor="middle"
              fill="var(--tp-text-muted)" font-size="8.5">Policy + Reference</text>

        
        <rect class="tp-node-rect" x="330" y="260" width="120" height="28" rx="6"
              fill="var(--tp-grpo-dim)" stroke="var(--tp-grpo)" stroke-width="1" stroke-dasharray="4,3" />
        <text class="tp-node-text" x="390" y="278" text-anchor="middle"
              fill="var(--tp-grpo)" font-size="9" font-weight="600">Verifiable Rewards</text>

        
        <line class="tp-path-line" x1="330" y1="278" x2="308" y2="280"
              stroke="var(--tp-grpo)" stroke-width="1.5" stroke-linecap="round" stroke-dasharray="3,2" />

        
        <rect x="330" y="296" width="65" height="24" rx="12"
              fill="var(--tp-grpo)" fill-opacity="0.15" stroke="var(--tp-grpo)" stroke-width="1" />
        <text x="362" y="312" text-anchor="middle"
              fill="var(--tp-grpo)" font-size="10" font-weight="600">2 models</text>

        
        <text x="416" y="312" text-anchor="start"
              fill="var(--tp-grpo)" font-size="10" font-weight="600" opacity="0.7">PATH C</text>
      </g>
    </svg>
  </div>

  
  <div class="tp-desc-panel" id="tpDesc-869e96d074aa1e4482cf4409ce2006eb">
    <h4 class="tp-desc-title none" id="tpDescTitle-869e96d074aa1e4482cf4409ce2006eb">Select a Path</h4>
    <p class="tp-desc-body" id="tpDescBody-869e96d074aa1e4482cf4409ce2006eb">Click any of the three training paths above to see how it works, what models are required, and the key trade-offs involved.</p>
    <div class="tp-desc-meta" id="tpDescMeta-869e96d074aa1e4482cf4409ce2006eb"></div>
  </div>

  <script>
  (function() {
    var uid = '869e96d074aa1e4482cf4409ce2006eb';

    var paths = {
      ppo: {
        title: 'Path A: Proximal Policy Optimization (PPO)',
        body: 'The full RLHF pipeline. Trains a reward model on human preferences, then optimizes the policy using PPO with a learned critic. Four models in memory: policy, reference, reward model, and critic. Highest ceiling, highest cost.',
        tags: ['4 models in VRAM', 'Online RL', 'Learned reward', 'Highest cost'],
        cssClass: 'ppo'
      },
      dpo: {
        title: 'Path B: Direct Preference Optimization (DPO)',
        body: 'Skips the reward model entirely. Extracts reward signal from preference data via a closed-form reparameterization. Two models in memory: policy and reference. Simple, stable, but offline \u2014 no exploration.',
        tags: ['2 models in VRAM', 'Offline', 'Implicit reward', 'Low cost'],
        cssClass: 'dpo'
      },
      grpo: {
        title: 'Path C: Group Relative Policy Optimization (GRPO)',
        body: 'Keeps online RL but eliminates the critic. Estimates advantages from group statistics. Pairs naturally with verifiable rewards (math, code). Two models plus a rule-based reward function.',
        tags: ['2 models in VRAM', 'Online RL', 'Rule-based reward', 'Medium cost'],
        cssClass: 'grpo'
      }
    };

    var activePath = null;

    var descPanel = document.getElementById('tpDesc-' + uid);
    var descTitle = document.getElementById('tpDescTitle-' + uid);
    var descBody = document.getElementById('tpDescBody-' + uid);
    var descMeta = document.getElementById('tpDescMeta-' + uid);

    var pathGroups = {
      ppo: document.getElementById('tpPathPpo-' + uid),
      dpo: document.getElementById('tpPathDpo-' + uid),
      grpo: document.getElementById('tpPathGrpo-' + uid)
    };

    var legendItems = {
      ppo: document.getElementById('tpLegPpo-' + uid),
      dpo: document.getElementById('tpLegDpo-' + uid),
      grpo: document.getElementById('tpLegGrpo-' + uid)
    };

    function setAllPathsOpacity(opacity) {
      Object.keys(pathGroups).forEach(function(key) {
        pathGroups[key].style.opacity = opacity;
      });
    }

    function highlightPath(key) {
      if (activePath === key) {
        
        activePath = null;
        setAllPathsOpacity('1');

        
        Object.keys(pathGroups).forEach(function(k) {
          pathGroups[k].style.filter = '';
        });

        
        Object.keys(legendItems).forEach(function(k) {
          legendItems[k].classList.add('active');
        });

        
        descPanel.className = 'tp-desc-panel';
        descTitle.className = 'tp-desc-title none';
        descTitle.textContent = 'Select a Path';
        descBody.textContent = 'Click any of the three training paths above to see how it works, what models are required, and the key trade-offs involved.';
        descMeta.innerHTML = '';
        return;
      }

      activePath = key;
      var info = paths[key];

      
      Object.keys(pathGroups).forEach(function(k) {
        if (k === key) {
          pathGroups[k].style.opacity = '1';
          pathGroups[k].style.filter = 'url(#glow' + k.charAt(0).toUpperCase() + k.slice(1) + '-' + uid + ')';
        } else {
          pathGroups[k].style.opacity = '0.2';
          pathGroups[k].style.filter = '';
        }
      });

      
      Object.keys(legendItems).forEach(function(k) {
        if (k === key) {
          legendItems[k].classList.add('active');
        } else {
          legendItems[k].classList.remove('active');
        }
      });

      
      descPanel.className = 'tp-desc-panel ' + info.cssClass + '-active';
      descTitle.className = 'tp-desc-title ' + info.cssClass;
      descTitle.textContent = info.title;
      descBody.textContent = info.body;

      
      descMeta.innerHTML = '';
      info.tags.forEach(function(tag) {
        var span = document.createElement('span');
        span.className = 'tp-desc-tag';
        span.textContent = tag;
        descMeta.appendChild(span);
      });
    }

    
    Object.keys(pathGroups).forEach(function(key) {
      pathGroups[key].addEventListener('click', function(e) {
        e.stopPropagation();
        highlightPath(key);
      });
    });

    
    Object.keys(legendItems).forEach(function(key) {
      legendItems[key].addEventListener('click', function(e) {
        e.stopPropagation();
        highlightPath(key);
      });
    });
  })();
  </script>
</div>

<p>The three methods we will examine differ in how they implement Phase 3. PPO trains a separate reward model, then runs RL against it. DPO skips the reward model by extracting the reward signal directly from preference data. GRPO replaces learned value estimates with group statistics and pairs naturally with verifiable rewards. Let&rsquo;s start with the foundation they all share: the reward signal.</p>
<h2 id="the-reward-signal-bradley-terry-and-what-makes-it-work">The Reward Signal: Bradley-Terry and What Makes It Work</h2>
<p>The fundamental challenge of aligning language models is this: we cannot write a reward function for &ldquo;helpfulness.&rdquo; Unlike game-playing AI where the score is clearly defined, the quality of a text response is subjective, contextual, and multidimensional. But humans <em>can</em> do something simpler: given two responses to the same prompt, they can usually say which one is better.</p>
<p>This observation is the foundation of RLHF. Collect pairwise comparisons, then train a model to predict which response humans prefer. The mathematical framework for this is the <strong><a href="https://www.youtube.com/watch?v=dg11OwdL3qs">Bradley-Terry model</a></strong>, originally developed in 1952 for ranking chess players:</p>
<div style="text-align: center; margin: 1.5rem 0 0.5rem; font-size: 1.15em; line-height: 2.4;">
  <span class="eq-tip" data-tip="Probability that response y_w is preferred over y_l, given prompt x">$P(y_w \succ y_l \mid x)$</span>
  <span>&thinsp;</span>
  <span>$=$</span>
  <span>&thinsp;</span>
  <span class="eq-tip" data-tip="Sigmoid function — maps any real number to the range (0, 1)">$\sigma\!\big($</span>
  <span class="eq-tip" data-tip="Reward model's score for the preferred (winning) response">$r_\theta(x, y_w)$</span>
  <span>$\,-\,$</span>
  <span class="eq-tip" data-tip="Reward model's score for the dispreferred (losing) response">$r_\theta(x, y_l)$</span>
  <span>$\big)$</span>
</div>
<div style="text-align: center; margin: 0 0 1.5rem; font-size: 0.72rem; color: #8b949e; font-family: 'IBM Plex Sans', sans-serif; letter-spacing: 0.02em;">hover over equation components to explore</div>

<p>The elegant property: <strong>only the difference in rewards matters.</strong> Adding a constant to all reward scores leaves preferences unchanged. This means the reward model only needs to learn a relative ranking, not absolute quality scores.</p>
<p>We train this reward model by maximizing the log-likelihood of observed human preferences:</p>
$$\mathcal{L}_{\text{RM}} = -\log \sigma\big(r_\theta(x, y_w) - r_\theta(x, y_l)\big)$$<p>This is <a href="https://en.wikipedia.org/wiki/Cross-entropy#Cross-entropy_loss_function_and_logistic_regression">binary cross-entropy</a>: we are training a classifier that says &ldquo;response A is better than response B.&rdquo; Architecturally, the reward model is typically the same transformer as the language model, with the language modeling head replaced by a single linear layer that maps the final hidden state to a scalar reward.</p>
<p>In practice, InstructGPT used a 6B parameter reward model to guide a 175B policy, trained on approximately 33,000 prompts with 4-9 ranked completions each. The reward model is trained for only a single epoch to avoid overfitting to the preference data, a detail that matters more than it might seem.</p>
<p>With a trained reward model in hand, we can now define what &ldquo;better&rdquo; means mathematically. The question becomes: how do we actually optimize the language model to produce higher-reward outputs?</p>
<h2 id="ppo-the-four-model-pipeline">PPO: The Four-Model Pipeline</h2>
<h3 id="the-rlhf-objective">The RLHF Objective</h3>
<p>The goal of RLHF is captured in a single objective. Let&rsquo;s walk through it symbol by symbol:</p>
$$\max_{\pi_\theta} \; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot \mid x)} \big[r_\phi(x, y)\big] - \beta \cdot D_{\text{KL}}\big(\pi_\theta \| \pi_{\text{ref}}\big)$$<p>Reading left to right: $\max_{\pi_\theta}$ means &ldquo;find the policy parameters that maximize the following expression.&rdquo; $\mathbb{E}$ is the expected value, averaging over many prompts and responses. $x \sim \mathcal{D}$ means prompts are drawn from the training distribution. $y \sim \pi_\theta(\cdot \mid x)$ means responses are <em>sampled from the current policy</em>, not taken from a fixed dataset. $r_\phi(x, y)$ is the reward model&rsquo;s score. $\beta$ is a coefficient controlling constraint strength. $D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$ is the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL divergence</a> measuring how far the policy has drifted from the reference.</p>
<p>This objective contains two opposing forces. The first term pushes toward human-preferred outputs: maximize the expected reward. The second term, the KL divergence penalty, is the guardrail that prevents <em>reward hacking</em>.</p>
<p>Reward hacking is not a theoretical concern. Without the KL constraint, models learn to game the reward model: they produce longer responses (reward models often prefer length), use confident language and bullet-point formatting (which correlates with higher human ratings), and can even produce convincing fabrications that fool the reward evaluator. Wen et al. (2024) showed that RLHF without proper regularization increases human approval ratings while simultaneously <em>decreasing</em> actual correctness. The KL penalty keeps the optimized policy close enough to the reference that these degenerate strategies remain unlikely.</p>
<h3 id="the-ppo-clipped-surrogate">The PPO Clipped Surrogate</h3>
<p>The RLHF objective tells us <em>what</em> to optimize. PPO tells us <em>how</em>. The challenge is that <a href="https://www.youtube.com/watch?v=27j1Cn8AECs">policy gradient methods</a> are notoriously unstable. A single large update can destroy the policy, and recovery is difficult. PPO solves this with a clipping mechanism that limits how much any single update can change the policy.</p>
<p>First, we define the probability ratio, how much the policy&rsquo;s opinion of a particular token has changed:</p>
$$r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}$$<p>When $r_t = 1.0$, the policy hasn&rsquo;t changed its probability for this token. When $r_t = 1.5$, the token is 50% more likely under the new policy. When $r_t = 0.6$, it is 40% less likely. The ratio tells us the direction and magnitude of the policy shift.</p>
<p>The PPO clipped surrogate objective is:</p>
$$\mathcal{L}^{\text{CLIP}} = \mathbb{E}_t \left[\min\Big(r_t(\theta) \cdot \hat{A}_t,\; \text{clip}\big(r_t(\theta),\, 1-\varepsilon,\, 1+\varepsilon\big) \cdot \hat{A}_t\Big)\right]$$<p>Here $\hat{A}_t$ is the <strong>advantage estimate</strong>: how much better (positive) or worse (negative) this token was compared to the expected baseline. Concretely, a critic network $V_\psi(s)$ estimates the expected future reward from each state; the advantage is the difference between the actual return and this estimate. If the model produced a token that led to higher reward than expected, $\hat{A}_t > 0$ (&ldquo;good token, do more of this&rdquo;); if the reward was lower than expected, $\hat{A}_t < 0$ (&ldquo;bad token, do less of this&rdquo;). The advantage is what tells PPO <em>which direction</em> to push; the clipping mechanism controls <em>how far</em>.</p>
<p>The clip function is a simple three-case clamp: if $r_t < 1-\varepsilon$, return $1-\varepsilon$; if $r_t > 1+\varepsilon$, return $1+\varepsilon$; otherwise return $r_t$ unchanged. With the standard $\varepsilon = 0.2$, the ratio is constrained to the range $[0.8, 1.2]$.</p>
<p>The behavior of this objective follows a 2×2 matrix that is worth internalizing:</p>
<table>
  <thead>
      <tr>
          <th></th>
          <th>Advantage $\hat{A}_t > 0$ (good token)</th>
          <th>Advantage $\hat{A}_t < 0$ (bad token)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Policy increases probability</strong> ($r_t > 1$)</td>
          <td>Clip activates at $1+\varepsilon$. Caps how aggressively we reinforce.</td>
          <td>No clipping. Full gradient to suppress this token.</td>
      </tr>
      <tr>
          <td><strong>Policy decreases probability</strong> ($r_t < 1$)</td>
          <td>No clipping. Full gradient to reinforce this token.</td>
          <td>Clip activates at $1-\varepsilon$. Caps how aggressively we suppress.</td>
      </tr>
  </tbody>
</table>
<p>The pattern reveals something important: <strong>clipping only constrains the policy when it is already moving in the right direction too aggressively.</strong> When the policy increases probability for a good token (top-left), clipping says &ldquo;that&rsquo;s enough reinforcement for one update.&rdquo; When it decreases probability for a bad token (bottom-right), clipping says &ldquo;that&rsquo;s enough suppression.&rdquo; But when the policy is moving in the <em>wrong</em> direction — decreasing a good token or increasing a bad one — the full gradient signal flows through. Clipping never protects wrong moves.</p>
<p>Why $\min$ and not $\max$? The $\min$ operator takes the <em>pessimistic</em> bound. If the clipped version yields a lower objective than the unclipped version, we take the clipped (lower) one, preventing overconfident updates. If the unclipped version is already lower (meaning the policy moved in a harmful direction), we take that instead, allowing the full corrective gradient.</p>
<p>Let&rsquo;s trace through concrete numbers. With $\hat{A}_t = +2.0$ (a good token) and $\varepsilon = 0.2$:</p>
<ul>
<li>At $r_t = 1.1$: Unclipped = $1.1 \times 2.0 = 2.2$. Clipped = $1.1 \times 2.0 = 2.2$ (within bounds). $\min = 2.2$. Full gradient.</li>
<li>At $r_t = 1.3$: Unclipped = $1.3 \times 2.0 = 2.6$. Clipped = $1.2 \times 2.0 = 2.4$ (capped at $1+\varepsilon$). $\min = 2.4$. Gradient is reduced.</li>
<li>At $r_t = 2.0$: Unclipped = $2.0 \times 2.0 = 4.0$. Clipped = $1.2 \times 2.0 = 2.4$. $\min = 2.4$. The objective plateaus. No matter how much more likely this token becomes, the gradient contribution is capped.</li>
</ul>
<p>This plateau is the key mechanism. The objective becomes flat beyond the clip boundary, which means the gradient is zero, so the optimizer receives no signal to push the ratio further. The policy can only change by $\pm 20\%$ per update, ensuring training stability.</p>


<div class="ppo-clip-viz" id="ppo-clip-869e96d074aa1e4482cf4409ce2006eb">
  <style>
    .ppo-clip-viz {
      --pc-bg: #0d1117;
      --pc-surface: #161b22;
      --pc-border: #30363d;
      --pc-text: #e6edf3;
      --pc-text-muted: #8b949e;
      --pc-accent: #58a6ff;
      --pc-unclipped: #8b949e;
      --pc-clipped: #58a6ff;
      --pc-positive: #39d353;
      --pc-negative: #f97583;
      --pc-trust-shade: rgba(88, 166, 255, 0.06);
      --pc-crosshair: rgba(88, 166, 255, 0.4);

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
      background: var(--pc-bg);
      color: var(--pc-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

    [data-theme="light"] .ppo-clip-viz,
    :root:not([data-theme="dark"]) .ppo-clip-viz {
      --pc-bg: #f8fafc;
      --pc-surface: #ffffff;
      --pc-border: #e2e8f0;
      --pc-text: #1e293b;
      --pc-text-muted: #64748b;
      --pc-accent: #3b82f6;
      --pc-unclipped: #94a3b8;
      --pc-clipped: #3b82f6;
      --pc-positive: #10b981;
      --pc-negative: #ef4444;
      --pc-trust-shade: rgba(59, 130, 246, 0.06);
      --pc-crosshair: rgba(59, 130, 246, 0.4);
    }

    .ppo-clip-viz * {
      box-sizing: border-box;
    }

    .pc-header {
      text-align: center;
      margin-bottom: 1.5rem;
    }

    .pc-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem;
      font-weight: 600;
      color: var(--pc-accent);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .pc-header p {
      color: var(--pc-text-muted);
      font-size: 0.9rem;
      margin: 0;
    }

     
    .pc-controls {
      background: var(--pc-surface);
      border: 1px solid var(--pc-border);
      border-radius: 10px;
      padding: 1.25rem;
      margin-bottom: 1.25rem;
    }

    .pc-control-row {
      display: flex;
      align-items: center;
      gap: 1rem;
      margin-bottom: 1rem;
    }

    .pc-control-row:last-child {
      margin-bottom: 0;
    }

    .pc-control-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.75rem;
      font-weight: 600;
      color: var(--pc-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.08em;
      min-width: 100px;
    }

    .pc-slider-container {
      flex: 1;
      display: flex;
      align-items: center;
      gap: 0.75rem;
    }

    .pc-slider {
      flex: 1;
      -webkit-appearance: none;
      appearance: none;
      height: 6px;
      border-radius: 3px;
      background: var(--pc-border);
      outline: none;
    }

    .pc-slider::-webkit-slider-thumb {
      -webkit-appearance: none;
      appearance: none;
      width: 18px;
      height: 18px;
      border-radius: 50%;
      background: var(--pc-accent);
      cursor: pointer;
      border: 2px solid var(--pc-bg);
      box-shadow: 0 2px 6px rgba(0,0,0,0.3);
      transition: transform 0.15s ease;
    }

    .pc-slider::-webkit-slider-thumb:hover {
      transform: scale(1.15);
    }

    .pc-slider::-moz-range-thumb {
      width: 18px;
      height: 18px;
      border-radius: 50%;
      background: var(--pc-accent);
      cursor: pointer;
      border: 2px solid var(--pc-bg);
      box-shadow: 0 2px 6px rgba(0,0,0,0.3);
    }

    .pc-slider-value {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.85rem;
      font-weight: 600;
      min-width: 3.5rem;
      text-align: right;
      color: var(--pc-text);
    }

     
    .pc-presets {
      display: flex;
      gap: 0.5rem;
      flex-wrap: wrap;
    }

    .pc-preset-btn {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 500;
      padding: 0.4rem 0.75rem;
      border: 1px solid var(--pc-border);
      border-radius: 6px;
      background: var(--pc-surface);
      color: var(--pc-text);
      cursor: pointer;
      transition: all 0.2s ease;
    }

    .pc-preset-btn:hover {
      border-color: var(--pc-accent);
      background: rgba(88, 166, 255, 0.1);
    }

    .pc-preset-btn.active {
      background: var(--pc-accent);
      border-color: var(--pc-accent);
      color: #0d1117;
    }

    [data-theme="light"] .pc-preset-btn.active,
    :root:not([data-theme="dark"]) .pc-preset-btn.active {
      color: #ffffff;
    }

     
    .pc-main {
      display: flex;
      gap: 1.25rem;
      margin-bottom: 1.25rem;
    }

    .pc-canvas-card {
      flex: 1;
      min-width: 0;
      background: var(--pc-surface);
      border: 1px solid var(--pc-border);
      border-radius: 10px;
      padding: 1rem;
    }

    .pc-canvas-wrapper {
      position: relative;
      width: 100%;
    }

    .pc-canvas-wrapper canvas {
      display: block;
      width: 100%;
      border-radius: 6px;
      cursor: crosshair;
    }

     
    .pc-info-panel {
      width: 280px;
      flex-shrink: 0;
      display: flex;
      flex-direction: column;
      gap: 0.75rem;
    }

    .pc-info-card {
      background: var(--pc-surface);
      border: 1px solid var(--pc-border);
      border-radius: 10px;
      padding: 1rem 1.25rem;
    }

    .pc-info-card h4 {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--pc-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.1em;
      margin: 0 0 0.6rem 0;
    }

    .pc-case-badge {
      display: inline-block;
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.75rem;
      font-weight: 600;
      padding: 0.3rem 0.7rem;
      border-radius: 6px;
      margin-bottom: 0.6rem;
    }

    .pc-case-badge.positive {
      background: rgba(57, 211, 83, 0.12);
      color: var(--pc-positive);
      border: 1px solid rgba(57, 211, 83, 0.25);
    }

    .pc-case-badge.negative {
      background: rgba(249, 117, 131, 0.12);
      color: var(--pc-negative);
      border: 1px solid rgba(249, 117, 131, 0.25);
    }

    .pc-case-description {
      font-size: 0.85rem;
      color: var(--pc-text);
      line-height: 1.5;
      margin: 0;
    }

    .pc-info-row {
      display: flex;
      justify-content: space-between;
      align-items: center;
      padding: 0.3rem 0;
      border-bottom: 1px solid var(--pc-border);
    }

    .pc-info-row:last-child {
      border-bottom: none;
    }

    .pc-info-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.75rem;
      color: var(--pc-text-muted);
    }

    .pc-info-value {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.8rem;
      font-weight: 600;
      color: var(--pc-text);
    }

    .pc-info-value.accent {
      color: var(--pc-accent);
    }

    .pc-info-value.positive {
      color: var(--pc-positive);
    }

    .pc-info-value.negative {
      color: var(--pc-negative);
    }

     
    .pc-legend {
      background: var(--pc-surface);
      border: 1px solid var(--pc-border);
      border-radius: 10px;
      padding: 1rem 1.25rem;
      display: flex;
      gap: 1.5rem;
      flex-wrap: wrap;
      justify-content: center;
    }

    .pc-legend-item {
      display: flex;
      align-items: center;
      gap: 0.5rem;
      font-size: 0.75rem;
      color: var(--pc-text-muted);
    }

    .pc-legend-line {
      width: 24px;
      height: 0;
    }

    .pc-legend-line.unclipped {
      border-top: 2px dashed var(--pc-unclipped);
    }

    .pc-legend-line.clipped {
      border-top: 2.5px solid var(--pc-accent);
    }

    .pc-legend-swatch {
      width: 14px;
      height: 14px;
      border-radius: 3px;
    }

    .pc-legend-swatch.trust-region {
      background: var(--pc-accent);
      opacity: 0.15;
    }

     
    .pc-formula {
      text-align: center;
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.8rem;
      color: var(--pc-text-muted);
      padding: 0.6rem;
      background: var(--pc-surface);
      border: 1px solid var(--pc-border);
      border-radius: 8px;
      margin-bottom: 1.25rem;
    }

    .pc-formula .highlight {
      color: var(--pc-accent);
      font-weight: 600;
    }

     
    @media (max-width: 768px) {
      .pc-main {
        flex-direction: column;
      }

      .pc-info-panel {
        width: 100%;
      }

      .pc-control-row {
        flex-direction: column;
        align-items: flex-start;
        gap: 0.5rem;
      }

      .pc-control-label {
        min-width: auto;
      }

      .pc-slider-container {
        width: 100%;
      }

      .pc-legend {
        gap: 0.75rem;
      }

      .pc-formula {
        font-size: 0.7rem;
      }
    }
  </style>

  
  <div class="pc-header">
    <h3>PPO Clipped Surrogate Objective</h3>
    <p>How epsilon-clipping constrains policy updates within a trust region</p>
  </div>

  
  <div class="pc-formula">
    L<sup>CLIP</sup> = <span class="highlight">min</span>(
    r<sub>t</sub> &middot; A&#770;<sub>t</sub>,
    <span class="highlight">clip</span>(r<sub>t</sub>, 1&minus;&epsilon;, 1+&epsilon;) &middot; A&#770;<sub>t</sub>
    )
  </div>

  
  <div class="pc-controls">
    <div class="pc-control-row">
      <span class="pc-control-label">A&#770;<sub>t</sub> (advantage)</span>
      <div class="pc-slider-container">
        <input type="range" class="pc-slider" id="pcAdvSlider-869e96d074aa1e4482cf4409ce2006eb"
               min="-3.0" max="3.0" step="0.1" value="2.0">
        <span class="pc-slider-value" id="pcAdvValue-869e96d074aa1e4482cf4409ce2006eb">+2.00</span>
      </div>
    </div>
    <div class="pc-control-row">
      <span class="pc-control-label">&epsilon; (epsilon)</span>
      <div class="pc-slider-container">
        <input type="range" class="pc-slider" id="pcEpsSlider-869e96d074aa1e4482cf4409ce2006eb"
               min="0.05" max="1.0" step="0.01" value="0.2">
        <span class="pc-slider-value" id="pcEpsValue-869e96d074aa1e4482cf4409ce2006eb">0.20</span>
      </div>
    </div>
    <div class="pc-control-row">
      <span class="pc-control-label">Presets</span>
      <div class="pc-presets">
        <button class="pc-preset-btn active" id="pcPresetStd-869e96d074aa1e4482cf4409ce2006eb" data-eps="0.2">Standard (&epsilon;=0.2)</button>
        <button class="pc-preset-btn" id="pcPresetCon-869e96d074aa1e4482cf4409ce2006eb" data-eps="0.1">Conservative (&epsilon;=0.1)</button>
        <button class="pc-preset-btn" id="pcPresetDS-869e96d074aa1e4482cf4409ce2006eb" data-eps="10">DeepSeek-R1 (&epsilon;=10)</button>
      </div>
    </div>
  </div>

  
  <div class="pc-main">
    <div class="pc-canvas-card">
      <div class="pc-canvas-wrapper">
        <canvas id="pcCanvas-869e96d074aa1e4482cf4409ce2006eb"></canvas>
      </div>
    </div>
    <div class="pc-info-panel">
      <div class="pc-info-card">
        <h4>Behavioral Case</h4>
        <div class="pc-case-badge positive" id="pcCaseBadge-869e96d074aa1e4482cf4409ce2006eb">
          A&#770; &gt; 0, r &gt; 1
        </div>
        <p class="pc-case-description" id="pcCaseDesc-869e96d074aa1e4482cf4409ce2006eb">
          Good token, policy reinforcing &mdash; clip caps the gain
        </p>
      </div>
      <div class="pc-info-card">
        <h4>Current Values</h4>
        <div class="pc-info-row">
          <span class="pc-info-label">r<sub>t</sub></span>
          <span class="pc-info-value accent" id="pcValRt-869e96d074aa1e4482cf4409ce2006eb">1.00</span>
        </div>
        <div class="pc-info-row">
          <span class="pc-info-label">A&#770;<sub>t</sub></span>
          <span class="pc-info-value" id="pcValAdv-869e96d074aa1e4482cf4409ce2006eb">+2.00</span>
        </div>
        <div class="pc-info-row">
          <span class="pc-info-label">&epsilon;</span>
          <span class="pc-info-value" id="pcValEps-869e96d074aa1e4482cf4409ce2006eb">0.20</span>
        </div>
        <div class="pc-info-row">
          <span class="pc-info-label">Unclipped</span>
          <span class="pc-info-value" id="pcValUnclip-869e96d074aa1e4482cf4409ce2006eb">2.00</span>
        </div>
        <div class="pc-info-row">
          <span class="pc-info-label">Clipped</span>
          <span class="pc-info-value" id="pcValClip-869e96d074aa1e4482cf4409ce2006eb">2.00</span>
        </div>
        <div class="pc-info-row">
          <span class="pc-info-label">L<sup>CLIP</sup></span>
          <span class="pc-info-value accent" id="pcValObj-869e96d074aa1e4482cf4409ce2006eb">2.00</span>
        </div>
      </div>
    </div>
  </div>

  
  <div class="pc-legend">
    <div class="pc-legend-item">
      <div class="pc-legend-line unclipped"></div>
      <span>r<sub>t</sub> &middot; A&#770;<sub>t</sub> (unclipped)</span>
    </div>
    <div class="pc-legend-item">
      <div class="pc-legend-line clipped"></div>
      <span>L<sup>CLIP</sup> (clipped objective)</span>
    </div>
    <div class="pc-legend-item">
      <div class="pc-legend-swatch trust-region"></div>
      <span>Trust region [1&minus;&epsilon;, 1+&epsilon;]</span>
    </div>
  </div>

  <script>
  (function() {
    const uid = '869e96d074aa1e4482cf4409ce2006eb';

    
    const canvas = document.getElementById(`pcCanvas-${uid}`);
    const ctx = canvas.getContext('2d');

    const advSlider = document.getElementById(`pcAdvSlider-${uid}`);
    const epsSlider = document.getElementById(`pcEpsSlider-${uid}`);
    const advValueEl = document.getElementById(`pcAdvValue-${uid}`);
    const epsValueEl = document.getElementById(`pcEpsValue-${uid}`);

    const presetStd = document.getElementById(`pcPresetStd-${uid}`);
    const presetCon = document.getElementById(`pcPresetCon-${uid}`);
    const presetDS = document.getElementById(`pcPresetDS-${uid}`);

    const caseBadge = document.getElementById(`pcCaseBadge-${uid}`);
    const caseDesc = document.getElementById(`pcCaseDesc-${uid}`);
    const valRt = document.getElementById(`pcValRt-${uid}`);
    const valAdv = document.getElementById(`pcValAdv-${uid}`);
    const valEps = document.getElementById(`pcValEps-${uid}`);
    const valUnclip = document.getElementById(`pcValUnclip-${uid}`);
    const valClip = document.getElementById(`pcValClip-${uid}`);
    const valObj = document.getElementById(`pcValObj-${uid}`);

    
    let advantage = 2.0;
    let epsilon = 0.2;
    let mouseRt = null; 

    
    const rMin = 0.0;
    const rMax = 2.5;
    const padding = { top: 30, right: 20, bottom: 50, left: 60 };

    
    function getStyle(varName) {
      const container = document.getElementById(`ppo-clip-${uid}`);
      return getComputedStyle(container).getPropertyValue(varName).trim();
    }

    
    function unclippedObj(r, adv) {
      return r * adv;
    }

    function clippedObj(r, adv, eps) {
      const clippedR = Math.max(1 - eps, Math.min(1 + eps, r));
      return clippedR * adv;
    }

    function lClip(r, adv, eps) {
      return Math.min(unclippedObj(r, adv), clippedObj(r, adv, eps));
    }

    
    function computeYRange(adv, eps) {
      
      const samples = 200;
      let yMin = 0;
      let yMax = 0;
      for (let i = 0; i <= samples; i++) {
        const r = rMin + (rMax - rMin) * (i / samples);
        const uVal = unclippedObj(r, adv);
        const cVal = lClip(r, adv, eps);
        yMin = Math.min(yMin, uVal, cVal);
        yMax = Math.max(yMax, uVal, cVal);
      }
      
      const range = yMax - yMin || 1;
      yMin -= range * 0.1;
      yMax += range * 0.1;
      return { yMin, yMax };
    }

    
    function resizeCanvas() {
      const wrapper = canvas.parentElement;
      const width = wrapper.clientWidth;
      const height = Math.max(320, Math.min(450, width * 0.6));
      const dpr = window.devicePixelRatio || 1;
      canvas.width = width * dpr;
      canvas.height = height * dpr;
      canvas.style.width = width + 'px';
      canvas.style.height = height + 'px';
      ctx.setTransform(dpr, 0, 0, dpr, 0, 0);
      return { width, height };
    }

    
    function makeTransforms(width, height, yRange) {
      const plotW = width - padding.left - padding.right;
      const plotH = height - padding.top - padding.bottom;

      function toCanvasX(r) {
        return padding.left + ((r - rMin) / (rMax - rMin)) * plotW;
      }

      function toCanvasY(y) {
        return padding.top + plotH - ((y - yRange.yMin) / (yRange.yMax - yRange.yMin)) * plotH;
      }

      function toDataR(canvasX) {
        return rMin + ((canvasX - padding.left) / plotW) * (rMax - rMin);
      }

      return { toCanvasX, toCanvasY, toDataR, plotW, plotH };
    }

    
    function draw() {
      const { width, height } = resizeCanvas();
      const yRange = computeYRange(advantage, epsilon);
      const { toCanvasX, toCanvasY, toDataR, plotW, plotH } = makeTransforms(width, height, yRange);

      const bgColor = getStyle('--pc-bg');
      const surfaceColor = getStyle('--pc-surface');
      const borderColor = getStyle('--pc-border');
      const textColor = getStyle('--pc-text');
      const mutedColor = getStyle('--pc-text-muted');
      const accentColor = getStyle('--pc-accent');
      const unclippedColor = getStyle('--pc-unclipped');
      const clippedColor = getStyle('--pc-clipped');
      const positiveColor = getStyle('--pc-positive');
      const negativeColor = getStyle('--pc-negative');
      const trustShade = getStyle('--pc-trust-shade');
      const crosshairColor = getStyle('--pc-crosshair');

      
      ctx.clearRect(0, 0, width, height);

      
      const trustLeft = Math.max(rMin, 1 - epsilon);
      const trustRight = Math.min(rMax, 1 + epsilon);
      if (trustLeft < rMax && trustRight > rMin) {
        ctx.fillStyle = trustShade;
        const x1 = toCanvasX(trustLeft);
        const x2 = toCanvasX(trustRight);
        ctx.fillRect(x1, padding.top, x2 - x1, plotH);
      }

      
      ctx.strokeStyle = borderColor;
      ctx.lineWidth = 0.5;

      
      const yTicks = niceTicksForRange(yRange.yMin, yRange.yMax, 6);
      ctx.font = '11px "IBM Plex Mono", monospace';
      ctx.textAlign = 'right';
      ctx.textBaseline = 'middle';

      yTicks.forEach(yVal => {
        const cy = toCanvasY(yVal);
        if (cy >= padding.top && cy <= padding.top + plotH) {
          ctx.beginPath();
          ctx.moveTo(padding.left, cy);
          ctx.lineTo(padding.left + plotW, cy);
          ctx.strokeStyle = borderColor;
          ctx.lineWidth = 0.5;
          ctx.stroke();

          ctx.fillStyle = mutedColor;
          ctx.fillText(formatTick(yVal), padding.left - 8, cy);
        }
      });

      
      const rTicks = niceTicksForRange(rMin, rMax, 6);
      ctx.textAlign = 'center';
      ctx.textBaseline = 'top';

      rTicks.forEach(rVal => {
        const cx = toCanvasX(rVal);
        if (cx >= padding.left && cx <= padding.left + plotW) {
          ctx.beginPath();
          ctx.moveTo(cx, padding.top);
          ctx.lineTo(cx, padding.top + plotH);
          ctx.strokeStyle = borderColor;
          ctx.lineWidth = 0.5;
          ctx.stroke();

          ctx.fillStyle = mutedColor;
          ctx.fillText(formatTick(rVal), cx, padding.top + plotH + 8);
        }
      });

      
      if (yRange.yMin < 0 && yRange.yMax > 0) {
        const zeroY = toCanvasY(0);
        ctx.beginPath();
        ctx.moveTo(padding.left, zeroY);
        ctx.lineTo(padding.left + plotW, zeroY);
        ctx.strokeStyle = mutedColor;
        ctx.lineWidth = 1;
        ctx.setLineDash([4, 4]);
        ctx.stroke();
        ctx.setLineDash([]);
      }

      
      const r1X = toCanvasX(1);
      if (r1X >= padding.left && r1X <= padding.left + plotW) {
        ctx.beginPath();
        ctx.moveTo(r1X, padding.top);
        ctx.lineTo(r1X, padding.top + plotH);
        ctx.strokeStyle = mutedColor;
        ctx.lineWidth = 1;
        ctx.setLineDash([3, 3]);
        ctx.stroke();
        ctx.setLineDash([]);

        
        ctx.fillStyle = mutedColor;
        ctx.font = '10px "IBM Plex Mono", monospace';
        ctx.textAlign = 'center';
        ctx.textBaseline = 'bottom';
        ctx.fillText('r=1', r1X, padding.top - 4);
      }

      
      ctx.fillStyle = mutedColor;
      ctx.font = '11px "IBM Plex Mono", monospace';
      ctx.textAlign = 'center';
      ctx.textBaseline = 'top';
      ctx.fillText('r\u209C (probability ratio)', padding.left + plotW / 2, padding.top + plotH + 28);

      ctx.save();
      ctx.translate(16, padding.top + plotH / 2);
      ctx.rotate(-Math.PI / 2);
      ctx.textAlign = 'center';
      ctx.textBaseline = 'middle';
      ctx.fillText('Objective L\u1D9C\u1D38\u1D34\u1D3E', 0, 0);
      ctx.restore();

      
      ctx.font = '9px "IBM Plex Mono", monospace';
      ctx.textAlign = 'center';
      ctx.textBaseline = 'bottom';
      if (trustLeft >= rMin && trustLeft <= rMax) {
        const bx = toCanvasX(trustLeft);
        ctx.fillStyle = accentColor;
        ctx.globalAlpha = 0.7;
        ctx.fillText(`1-\u03B5`, bx, padding.top - 2);
        ctx.globalAlpha = 1;
      }
      if (trustRight >= rMin && trustRight <= rMax) {
        const bx = toCanvasX(trustRight);
        ctx.fillStyle = accentColor;
        ctx.globalAlpha = 0.7;
        ctx.fillText(`1+\u03B5`, bx, padding.top - 2);
        ctx.globalAlpha = 1;
      }

      
      const numSamples = Math.max(400, Math.round(plotW * 2));

      
      ctx.beginPath();
      ctx.strokeStyle = unclippedColor;
      ctx.lineWidth = 1.8;
      ctx.setLineDash([6, 4]);
      for (let i = 0; i <= numSamples; i++) {
        const r = rMin + (rMax - rMin) * (i / numSamples);
        const y = unclippedObj(r, advantage);
        const cx = toCanvasX(r);
        const cy = toCanvasY(y);
        
        if (i === 0) ctx.moveTo(cx, clampY(cy, padding.top, padding.top + plotH));
        else ctx.lineTo(cx, clampY(cy, padding.top, padding.top + plotH));
      }
      ctx.stroke();
      ctx.setLineDash([]);

      
      ctx.beginPath();
      ctx.strokeStyle = clippedColor;
      ctx.lineWidth = 2.5;
      for (let i = 0; i <= numSamples; i++) {
        const r = rMin + (rMax - rMin) * (i / numSamples);
        const y = lClip(r, advantage, epsilon);
        const cx = toCanvasX(r);
        const cy = toCanvasY(y);
        if (i === 0) ctx.moveTo(cx, clampY(cy, padding.top, padding.top + plotH));
        else ctx.lineTo(cx, clampY(cy, padding.top, padding.top + plotH));
      }
      ctx.stroke();

      
      
      
      
      if (Math.abs(advantage) > 0.01) {
        ctx.globalAlpha = 0.04;
        if (advantage > 0 && 1 + epsilon < rMax) {
          
          const fx1 = toCanvasX(1 + epsilon);
          const fx2 = toCanvasX(rMax);
          ctx.fillStyle = positiveColor;
          ctx.fillRect(fx1, padding.top, fx2 - fx1, plotH);
        }
        if (advantage < 0 && 1 - epsilon > rMin) {
          
          const fx1 = toCanvasX(rMin);
          const fx2 = toCanvasX(1 - epsilon);
          ctx.fillStyle = negativeColor;
          ctx.fillRect(fx1, padding.top, fx2 - fx1, plotH);
        }
        ctx.globalAlpha = 1;
      }

      
      if (mouseRt !== null && mouseRt >= rMin && mouseRt <= rMax) {
        const cx = toCanvasX(mouseRt);

        
        ctx.beginPath();
        ctx.moveTo(cx, padding.top);
        ctx.lineTo(cx, padding.top + plotH);
        ctx.strokeStyle = crosshairColor;
        ctx.lineWidth = 1;
        ctx.setLineDash([3, 3]);
        ctx.stroke();
        ctx.setLineDash([]);

        
        const uY = unclippedObj(mouseRt, advantage);
        const cY = lClip(mouseRt, advantage, epsilon);

        
        const uCy = toCanvasY(uY);
        if (uCy >= padding.top && uCy <= padding.top + plotH) {
          ctx.beginPath();
          ctx.arc(cx, uCy, 4, 0, Math.PI * 2);
          ctx.fillStyle = unclippedColor;
          ctx.fill();
          ctx.strokeStyle = bgColor;
          ctx.lineWidth = 1.5;
          ctx.stroke();
        }

        
        const cCy = toCanvasY(cY);
        if (cCy >= padding.top && cCy <= padding.top + plotH) {
          ctx.beginPath();
          ctx.arc(cx, cCy, 5, 0, Math.PI * 2);
          ctx.fillStyle = clippedColor;
          ctx.fill();
          ctx.strokeStyle = bgColor;
          ctx.lineWidth = 1.5;
          ctx.stroke();
        }

        
        const tooltipR = mouseRt;
        const tooltipUnclip = uY;
        const tooltipClip = cY;
        const labelX = cx + 10;
        const labelY = padding.top + 15;

        ctx.font = '10px "IBM Plex Mono", monospace';
        ctx.textAlign = 'left';
        ctx.textBaseline = 'top';

        
        const lines = [
          `r = ${tooltipR.toFixed(2)}`,
          `unclip = ${tooltipUnclip.toFixed(2)}`,
          `L\u1D9C\u1D38\u1D34\u1D3E = ${tooltipClip.toFixed(2)}`
        ];
        const lineH = 14;
        const maxW = Math.max(...lines.map(l => ctx.measureText(l).width));
        const boxW = maxW + 12;
        const boxH = lines.length * lineH + 8;

        
        const finalX = (labelX + boxW > padding.left + plotW) ? cx - boxW - 10 : labelX;

        ctx.fillStyle = bgColor;
        ctx.globalAlpha = 0.88;
        ctx.beginPath();
        roundRect(ctx, finalX - 4, labelY - 4, boxW + 4, boxH, 4);
        ctx.fill();
        ctx.globalAlpha = 1;

        ctx.fillStyle = mutedColor;
        ctx.fillText(lines[0], finalX, labelY);
        ctx.fillStyle = unclippedColor;
        ctx.fillText(lines[1], finalX, labelY + lineH);
        ctx.fillStyle = clippedColor;
        ctx.fillText(lines[2], finalX, labelY + lineH * 2);
      }

      
      ctx.strokeStyle = borderColor;
      ctx.lineWidth = 1;
      ctx.strokeRect(padding.left, padding.top, plotW, plotH);

      
      updateInfoPanel();
    }

    
    function clampY(y, minY, maxY) {
      return Math.max(minY, Math.min(maxY, y));
    }

    function roundRect(ctx, x, y, w, h, r) {
      ctx.moveTo(x + r, y);
      ctx.lineTo(x + w - r, y);
      ctx.quadraticCurveTo(x + w, y, x + w, y + r);
      ctx.lineTo(x + w, y + h - r);
      ctx.quadraticCurveTo(x + w, y + h, x + w - r, y + h);
      ctx.lineTo(x + r, y + h);
      ctx.quadraticCurveTo(x, y + h, x, y + h - r);
      ctx.lineTo(x, y + r);
      ctx.quadraticCurveTo(x, y, x + r, y);
    }

    function niceTicksForRange(lo, hi, approxCount) {
      const range = hi - lo;
      if (range === 0) return [lo];
      const rawStep = range / approxCount;
      const mag = Math.pow(10, Math.floor(Math.log10(rawStep)));
      let step;
      const normalized = rawStep / mag;
      if (normalized <= 1.5) step = 1 * mag;
      else if (normalized <= 3.5) step = 2 * mag;
      else if (normalized <= 7.5) step = 5 * mag;
      else step = 10 * mag;

      const ticks = [];
      let t = Math.ceil(lo / step) * step;
      while (t <= hi + step * 0.001) {
        ticks.push(Math.round(t * 1e10) / 1e10);
        t += step;
      }
      return ticks;
    }

    function formatTick(v) {
      if (Math.abs(v) < 1e-10) return '0';
      if (Number.isInteger(v)) return v.toString();
      return v.toFixed(Math.abs(v) < 1 ? 2 : 1);
    }

    
    function updateInfoPanel() {
      const rt = mouseRt !== null ? mouseRt : 1.0;
      const adv = advantage;
      const eps = epsilon;

      const uVal = unclippedObj(rt, adv);
      const cVal = clippedObj(rt, adv, eps);
      const objVal = lClip(rt, adv, eps);

      valRt.textContent = rt.toFixed(2);
      valAdv.textContent = (adv >= 0 ? '+' : '') + adv.toFixed(2);
      valEps.textContent = eps.toFixed(2);
      valUnclip.textContent = uVal.toFixed(2);
      valClip.textContent = cVal.toFixed(2);
      valObj.textContent = objVal.toFixed(2);

      
      if (adv >= 0) {
        valAdv.className = 'pc-info-value positive';
      } else {
        valAdv.className = 'pc-info-value negative';
      }

      
      if (objVal > 0) {
        valObj.className = 'pc-info-value positive';
      } else if (objVal < 0) {
        valObj.className = 'pc-info-value negative';
      } else {
        valObj.className = 'pc-info-value accent';
      }

      
      const inTrust = rt >= (1 - eps) && rt <= (1 + eps);
      let caseLabel, caseDescText, caseType;

      if (adv >= 0 && rt >= 1) {
        caseType = 'positive';
        if (inTrust || rt <= 1 + eps) {
          caseLabel = '\u00C2 > 0, r \u2265 1';
          caseDescText = 'Good token, policy reinforcing \u2014 within trust region, objective grows freely';
        } else {
          caseLabel = '\u00C2 > 0, r > 1+\u03B5';
          caseDescText = 'Good token, policy reinforcing \u2014 clip caps the gain to prevent overshooting';
        }
      } else if (adv >= 0 && rt < 1) {
        caseType = 'positive';
        caseLabel = '\u00C2 > 0, r < 1';
        caseDescText = 'Good token, policy retreating \u2014 no clip needed, gradient encourages recovery';
      } else if (adv < 0 && rt <= 1) {
        caseType = 'negative';
        if (inTrust || rt >= 1 - eps) {
          caseLabel = '\u00C2 < 0, r \u2264 1';
          caseDescText = 'Bad token, policy retreating \u2014 within trust region, penalty applied normally';
        } else {
          caseLabel = '\u00C2 < 0, r < 1\u2212\u03B5';
          caseDescText = 'Bad token, policy retreating \u2014 clip caps the penalty to prevent collapse';
        }
      } else {
        
        caseType = 'negative';
        caseLabel = '\u00C2 < 0, r > 1';
        caseDescText = 'Bad token, policy reinforcing \u2014 no clip needed, gradient discourages this';
      }

      caseBadge.textContent = caseLabel;
      caseBadge.className = 'pc-case-badge ' + caseType;
      caseDesc.textContent = caseDescText;
    }

    
    advSlider.addEventListener('input', function() {
      advantage = parseFloat(this.value);
      advValueEl.textContent = (advantage >= 0 ? '+' : '') + advantage.toFixed(2);
      clearActivePreset();
      draw();
    });

    epsSlider.addEventListener('input', function() {
      epsilon = parseFloat(this.value);
      epsValueEl.textContent = epsilon.toFixed(2);
      clearActivePreset();
      draw();
    });

    function setEpsilon(val, btn) {
      
      
      epsilon = val;
      if (val <= 1.0) {
        epsSlider.value = val;
      } else {
        epsSlider.value = epsSlider.max;
      }
      epsValueEl.textContent = val.toFixed(val >= 10 ? 0 : 2);
      clearActivePreset();
      btn.classList.add('active');
      draw();
    }

    function clearActivePreset() {
      presetStd.classList.remove('active');
      presetCon.classList.remove('active');
      presetDS.classList.remove('active');
    }

    presetStd.addEventListener('click', function() { setEpsilon(0.2, this); });
    presetCon.addEventListener('click', function() { setEpsilon(0.1, this); });
    presetDS.addEventListener('click', function() { setEpsilon(10, this); });

    
    canvas.addEventListener('mousemove', function(e) {
      const rect = canvas.getBoundingClientRect();
      const scaleX = canvas.offsetWidth / canvas.clientWidth;
      const x = (e.clientX - rect.left) * scaleX;
      const { width, height } = { width: canvas.clientWidth, height: canvas.clientHeight };
      const yRange = computeYRange(advantage, epsilon);
      const { toDataR } = makeTransforms(width, height, yRange);

      const r = toDataR(x);
      if (r >= rMin && r <= rMax && e.clientY >= rect.top + padding.top && e.clientY <= rect.bottom - padding.bottom) {
        mouseRt = r;
      } else {
        mouseRt = null;
      }
      draw();
    });

    canvas.addEventListener('mouseleave', function() {
      mouseRt = null;
      draw();
    });

    
    canvas.addEventListener('touchmove', function(e) {
      e.preventDefault();
      const touch = e.touches[0];
      const rect = canvas.getBoundingClientRect();
      const x = touch.clientX - rect.left;
      const { width, height } = { width: canvas.clientWidth, height: canvas.clientHeight };
      const yRange = computeYRange(advantage, epsilon);
      const { toDataR } = makeTransforms(width, height, yRange);

      const r = toDataR(x);
      if (r >= rMin && r <= rMax) {
        mouseRt = r;
      } else {
        mouseRt = null;
      }
      draw();
    }, { passive: false });

    canvas.addEventListener('touchend', function() {
      mouseRt = null;
      draw();
    });

    
    let resizeTimer;
    window.addEventListener('resize', function() {
      clearTimeout(resizeTimer);
      resizeTimer = setTimeout(draw, 50);
    });

    
    const observer = new MutationObserver(function() {
      draw();
    });
    observer.observe(document.documentElement, {
      attributes: true,
      attributeFilter: ['data-theme']
    });

    
    draw();
  })();
  </script>
</div>

<p>An important subtlety for LLM training: clipping operates <strong>per-token</strong>, not globally. In a 512-token response, some tokens might have $r_t$ well within bounds (contributing full gradients) while others hit the clip boundary (contributing zero gradient). The overall update is a blend of these per-token signals, which produces remarkably stable training even without careful learning rate tuning.</p>
<p>One notable exception: DeepSeek-R1 uses $\varepsilon = 10$, which effectively disables clipping. Their group-normalized advantages (which we will see in the GRPO section) are already well-scaled, reducing the need for a tight trust region.</p>
<h3 id="the-four-model-problem">The Four-Model Problem</h3>
<p>Running PPO for LLM alignment requires four models simultaneously in GPU memory:</p>
<ol>
<li><strong>The policy</strong> $\pi_\theta$ — the model being trained. Requires gradients and optimizer states (2-3× the weight memory).</li>
<li><strong>The reference policy</strong> $\pi_{\text{ref}}$ — a frozen copy of the SFT model. Only forward passes, but still occupies full weight memory.</li>
<li><strong>The reward model</strong> $r_\phi$ — scores generated responses. Frozen during PPO, forward passes only.</li>
<li><strong>The value/critic network</strong> $V_\psi$ — estimates expected future reward to compute advantages $\hat{A}_t$. Requires gradients and optimizer states.</li>
</ol>
<p>For a 7B parameter model in fp16, weights alone consume approximately 14GB per model, roughly 56GB across all four, before accounting for optimizer states (Adam stores two additional copies of the policy&rsquo;s and critic&rsquo;s parameters). With a batch of generated sequences in memory, the total easily exceeds 100GB for a single 7B model. Running PPO on a 70B model requires multi-node setups that only frontier labs can afford.</p>
<p>Beyond memory, PPO faces two systemic challenges. <strong>Distribution shift</strong>: as the policy improves, the reward model&rsquo;s training data (collected from an earlier, weaker policy) becomes stale. The proxy reward keeps climbing while true human preference plateaus or declines. Gao et al. (2022) formalized this as &ldquo;reward model overoptimization.&rdquo; <strong>Hyperparameter sensitivity</strong>: learning rates, KL coefficients, clipping parameters, and even Adam&rsquo;s epsilon require careful tuning. Huang et al. (2023) found that reward scores and loss values are poor indicators of training health; practitioners should monitor KL divergence, response length distributions, and perplexity instead.</p>
<p>Despite all this complexity, PPO produced the first convincing result: InstructGPT showed that a 1.3B parameter model trained with RLHF was preferred by human evaluators over the 175B parameter base GPT-3. A 130× smaller model, made more useful through alignment. The engineering was expensive, but the result was undeniable.</p>
<h3 id="the-question-that-sparked-dpo">The Question That Sparked DPO</h3>
<p>PPO demonstrated that RL could align language models with human preferences. But the engineering complexity was severe: four models, meticulous hyperparameter tuning, and infrastructure that only a handful of organizations could afford. Researchers began asking: could we achieve similar results without the reward model entirely?</p>
<p>The mathematical observation that makes this possible: the RLHF objective has a closed-form optimal policy. If we can express the reward in terms of the policy itself, we can optimize directly on preference data without a reward model, RL loop, or critic network. This insight leads to DPO.</p>
<h2 id="dpo-your-language-model-is-secretly-a-reward-model"><a href="https://www.youtube.com/watch?v=k2pD3k1485A">DPO</a>: Your Language Model Is Secretly a Reward Model</h2>
<h3 id="the-reparameterization-that-changes-everything">The Reparameterization That Changes Everything</h3>
<p>Let&rsquo;s start from the same KL-constrained RLHF objective we defined for PPO. Using variational calculus (or, more practically, by expanding the KL divergence and completing the algebra), we can derive the optimal policy in closed form:</p>
$$\pi^*(y \mid x) = \frac{1}{Z(x)} \cdot \pi_{\text{ref}}(y \mid x) \cdot \exp\!\left(\frac{r(x, y)}{\beta}\right)$$<p>where $Z(x) = \sum_y \pi_{\text{ref}}(y \mid x) \cdot \exp(r(x, y) / \beta)$ is the partition function that ensures the distribution sums to 1.</p>
<p>The intuition here is direct: <strong>the optimal policy is the reference distribution &ldquo;warped&rdquo; by an exponential reward function.</strong> Responses with high reward get boosted in probability; responses with low reward get suppressed. The parameter $\beta$ controls how aggressive this warping is. When $\beta \to 0$, the policy collapses toward pure reward maximization; only the highest-reward response gets any probability mass. When $\beta \to \infty$, the exponential flattens and the policy stays frozen at the reference.</p>
<p>We can rearrange this to express the reward in terms of the policy:</p>
$$r(x, y) = \beta \cdot \log \frac{\pi^*(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \cdot \log Z(x)$$<p>This says something remarkable: <strong>the reward is fully determined by the log-ratio of optimal policy to reference policy</strong>, plus a prompt-dependent constant $Z(x)$. The reward is hiding inside the policy all along.</p>
<p>But $Z(x)$ is intractable. It requires summing over <em>all possible responses</em> to prompt $x$, every possible sequence of tokens the model could produce. For a vocabulary of 50,000 tokens and responses of even modest length, this is an astronomically large set. PPO avoids computing $Z(x)$ by using iterative approximate optimization. DPO avoids it through algebraic cancellation.</p>
<h3 id="from-rl-to-classification-in-one-substitution">From RL to Classification in One Substitution</h3>
<p>Here is where DPO&rsquo;s elegance emerges. We substitute the implicit reward expression into the Bradley-Terry preference model. For a preferred response $y_w$ and dispreferred response $y_l$ given the same prompt $x$:</p>
$$r(x, y_w) - r(x, y_l) = \beta \log \frac{\pi^*(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} + \beta \log Z(x) - \beta \log \frac{\pi^*(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} - \beta \log Z(x)$$<p>The $\beta \log Z(x)$ terms cancel exactly. This is the critical step. $Z(x)$ depends only on the prompt, not the response, so it appears identically in both terms and drops out of the difference. The intractable partition function vanishes.</p>
<p>Substituting into the Bradley-Terry model and replacing the theoretical optimal policy $\pi^*$ with our trainable policy $\pi_\theta$, we get the DPO loss:</p>
$$\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)\right]$$<p>This is a binary cross-entropy loss. The &ldquo;logit&rdquo; is the difference in implicit rewards between the preferred and dispreferred responses. Each implicit reward $\beta \log(\pi_\theta / \pi_{\text{ref}})$ measures how much the current policy has shifted its probability relative to the reference, a direct proxy for how much the policy &ldquo;values&rdquo; that response.</p>
<p>During training, this loss simultaneously increases the relative probability of preferred completions and decreases the relative probability of dispreferred ones. The $\beta$ parameter controls how sharply: low $\beta$ (e.g., 0.1) allows aggressive optimization away from the reference, while high $\beta$ (e.g., 0.5) keeps updates conservative. No explicit KL penalty is needed because the reference policy appears directly in the loss function. Deviating too far automatically reduces the gradient signal through the sigmoid saturation.</p>
<h3 id="a-worked-numerical-example">A Worked Numerical Example</h3>
<p>Let&rsquo;s make this concrete. Consider a prompt $x$ = &ldquo;What is the capital of France?&rdquo; with two responses:</p>
<ul>
<li>$y_w$ (preferred): &ldquo;The capital of France is Paris.&rdquo;</li>
<li>$y_l$ (dispreferred): &ldquo;France&rsquo;s capital is Berlin, a beautiful city.&rdquo;</li>
</ul>
<p>Suppose the reference policy assigns $\pi_{\text{ref}}(y_w \mid x) = 0.15$ and $\pi_{\text{ref}}(y_l \mid x) = 0.12$. The trainable policy $\pi_\theta$ starts as a copy of the reference, so initially $\pi_\theta = \pi_{\text{ref}}$. Let $\beta = 0.1$.</p>
<p><strong>At initialization:</strong></p>
<p>The implicit rewards are both zero:
</p>
$$\hat{r}(x, y_w) = 0.1 \cdot \log \frac{0.15}{0.15} = 0.1 \cdot \log 1 = 0$$<p>
</p>
$$\hat{r}(x, y_l) = 0.1 \cdot \log \frac{0.12}{0.12} = 0$$<p>The reward difference is $0 - 0 = 0$. The loss is $-\log \sigma(0) = -\log(0.5) = \log 2 \approx 0.693$. The model has no preference, exactly what we&rsquo;d expect before any training.</p>
<p><strong>After one gradient step:</strong></p>
<p>The gradient pushes $\pi_\theta(y_w \mid x)$ up to $0.20$ and $\pi_\theta(y_l \mid x)$ down to $0.08$:</p>
$$\hat{r}(x, y_w) = 0.1 \cdot \log \frac{0.20}{0.15} = 0.1 \cdot 0.288 = 0.029$$<p>
</p>
$$\hat{r}(x, y_l) = 0.1 \cdot \log \frac{0.08}{0.12} = 0.1 \cdot (-0.405) = -0.041$$<p>The reward difference is $0.029 - (-0.041) = 0.069$. The loss drops to $-\log \sigma(0.069) \approx 0.676$.</p>
<p>The model is learning without ever computing an explicit reward. The reward signal emerges from the probability shift relative to the reference. And notice the structural KL constraint at work: as $\pi_\theta$ pushes probabilities further from $\pi_{\text{ref}}$, the log-ratio grows, which eventually saturates the sigmoid and produces diminishing gradient signal. The policy naturally resists extreme deviations.</p>
<p>The insight, elegantly stated by the DPO authors: &ldquo;The reward model was never eliminated — it was absorbed into the policy itself.&rdquo;</p>


<div class="di-viz" id="di-869e96d074aa1e4482cf4409ce2006eb">
  <style>
    .di-viz {
      --di-bg: #0d1117;
      --di-surface: #161b22;
      --di-border: #30363d;
      --di-text: #e6edf3;
      --di-text-muted: #8b949e;
      --di-accent: #58a6ff;
      --di-green: #39d353;
      --di-red: #f97583;
      --di-gray: #6e7681;
      --di-yellow: #d29922;
      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
      background: var(--di-bg);
      color: var(--di-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

     
    [data-theme="light"] .di-viz,
    :root:not([data-theme="dark"]) .di-viz {
      --di-bg: #f8fafc;
      --di-surface: #ffffff;
      --di-border: #e2e8f0;
      --di-text: #1e293b;
      --di-text-muted: #64748b;
      --di-accent: #3b82f6;
      --di-green: #10b981;
      --di-red: #ef4444;
      --di-gray: #94a3b8;
      --di-yellow: #f59e0b;
    }

    .di-viz * {
      box-sizing: border-box;
    }

     
    .di-header {
      text-align: center;
      margin-bottom: 1.5rem;
    }

    .di-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem;
      font-weight: 600;
      color: var(--di-accent);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .di-header p {
      color: var(--di-text-muted);
      font-size: 0.9rem;
      margin: 0;
    }

     
    .di-prompt-box {
      background: var(--di-surface);
      border: 1px solid var(--di-border);
      border-radius: 8px;
      padding: 0.75rem 1rem;
      margin-bottom: 1.25rem;
      font-size: 0.85rem;
    }

    .di-prompt-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--di-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.1em;
      margin-bottom: 0.3rem;
    }

    .di-prompt-text {
      color: var(--di-text);
      font-weight: 500;
    }

    .di-responses {
      display: grid;
      grid-template-columns: 1fr 1fr;
      gap: 0.75rem;
      margin-bottom: 1.25rem;
    }

    .di-response-card {
      background: var(--di-surface);
      border: 1px solid var(--di-border);
      border-radius: 8px;
      padding: 0.6rem 0.85rem;
      font-size: 0.82rem;
    }

    .di-response-card.di-preferred {
      border-left: 3px solid var(--di-green);
    }

    .di-response-card.di-dispreferred {
      border-left: 3px solid var(--di-red);
    }

    .di-response-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.65rem;
      font-weight: 600;
      text-transform: uppercase;
      letter-spacing: 0.1em;
      margin-bottom: 0.2rem;
    }

    .di-response-card.di-preferred .di-response-label {
      color: var(--di-green);
    }

    .di-response-card.di-dispreferred .di-response-label {
      color: var(--di-red);
    }

    .di-response-text {
      color: var(--di-text);
      font-style: italic;
    }

     
    .di-panels {
      display: grid;
      grid-template-columns: 1fr 1fr;
      gap: 1.25rem;
      margin-bottom: 1.25rem;
    }

    @media (max-width: 768px) {
      .di-panels {
        grid-template-columns: 1fr;
      }
      .di-responses {
        grid-template-columns: 1fr;
      }
    }

    .di-panel {
      background: var(--di-surface);
      border: 1px solid var(--di-border);
      border-radius: 8px;
      padding: 1rem;
    }

    .di-panel-title {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--di-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.1em;
      margin: 0 0 0.9rem 0;
    }

     
    .di-bar-group {
      margin-bottom: 0.85rem;
    }

    .di-bar-group:last-child {
      margin-bottom: 0;
    }

    .di-bar-group-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.65rem;
      font-weight: 600;
      color: var(--di-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.08em;
      margin-bottom: 0.4rem;
    }

    .di-bar-row {
      display: flex;
      align-items: center;
      gap: 0.5rem;
      margin-bottom: 0.35rem;
    }

    .di-bar-row:last-child {
      margin-bottom: 0;
    }

    .di-bar-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.72rem;
      color: var(--di-text-muted);
      min-width: 72px;
      text-align: right;
      flex-shrink: 0;
    }

    .di-bar-track {
      flex: 1;
      height: 22px;
      background: rgba(128, 128, 128, 0.1);
      border-radius: 4px;
      position: relative;
      overflow: hidden;
    }

    .di-bar-fill {
      height: 100%;
      border-radius: 4px;
      transition: width 0.6s cubic-bezier(0.4, 0, 0.2, 1);
      display: flex;
      align-items: center;
      justify-content: flex-end;
      padding-right: 6px;
      min-width: 36px;
    }

    .di-bar-fill.di-bar-ref {
      background: var(--di-gray);
      opacity: 0.6;
    }

    .di-bar-fill.di-bar-policy-w {
      background: var(--di-green);
    }

    .di-bar-fill.di-bar-policy-l {
      background: var(--di-red);
    }

    .di-bar-value {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.68rem;
      font-weight: 600;
      color: #fff;
      text-shadow: 0 1px 2px rgba(0, 0, 0, 0.5);
    }

     
    .di-comp-line {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.75rem;
      line-height: 1.9;
      color: var(--di-text);
      margin-bottom: 0.3rem;
      word-break: break-word;
    }

    .di-comp-line:last-child {
      margin-bottom: 0;
    }

    .di-comp-label {
      color: var(--di-text-muted);
    }

    .di-comp-val-green {
      color: var(--di-green);
      font-weight: 600;
    }

    .di-comp-val-red {
      color: var(--di-red);
      font-weight: 600;
    }

    .di-comp-val-accent {
      color: var(--di-accent);
      font-weight: 600;
    }

    .di-comp-val-yellow {
      color: var(--di-yellow);
      font-weight: 600;
    }

    .di-comp-divider {
      height: 1px;
      background: var(--di-border);
      margin: 0.6rem 0;
    }

     
    .di-sparkline-container {
      margin-top: 0.75rem;
    }

    .di-sparkline-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.65rem;
      color: var(--di-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.08em;
      margin-bottom: 0.35rem;
    }

    .di-sparkline-svg {
      width: 100%;
      height: 48px;
    }

    .di-sparkline-line {
      fill: none;
      stroke: var(--di-accent);
      stroke-width: 2;
      stroke-linecap: round;
      stroke-linejoin: round;
    }

    .di-sparkline-dot {
      fill: var(--di-accent);
      transition: cx 0.3s, cy 0.3s;
    }

    .di-sparkline-dot-inactive {
      fill: var(--di-gray);
      opacity: 0.4;
    }

    .di-sparkline-grid {
      stroke: var(--di-border);
      stroke-width: 0.5;
    }

     
    .di-controls {
      display: flex;
      align-items: center;
      justify-content: center;
      gap: 1.25rem;
      flex-wrap: wrap;
    }

    .di-btn {
      font-family: 'IBM Plex Sans', sans-serif;
      font-size: 0.8rem;
      font-weight: 600;
      padding: 0.5rem 1.1rem;
      border-radius: 6px;
      border: 1px solid var(--di-border);
      background: var(--di-surface);
      color: var(--di-text);
      cursor: pointer;
      transition: all 0.2s ease;
      user-select: none;
    }

    .di-btn:hover:not(:disabled) {
      background: var(--di-accent);
      color: #fff;
      border-color: var(--di-accent);
    }

    .di-btn:disabled {
      opacity: 0.35;
      cursor: not-allowed;
    }

    .di-btn-primary {
      background: var(--di-accent);
      color: #fff;
      border-color: var(--di-accent);
    }

    .di-btn-primary:hover:not(:disabled) {
      filter: brightness(1.15);
    }

    .di-step-indicator {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.82rem;
      font-weight: 600;
      color: var(--di-text);
    }

    .di-slider-group {
      display: flex;
      align-items: center;
      gap: 0.5rem;
    }

    .di-slider-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.75rem;
      color: var(--di-text-muted);
    }

    .di-slider {
      -webkit-appearance: none;
      appearance: none;
      width: 100px;
      height: 5px;
      border-radius: 3px;
      background: var(--di-border);
      outline: none;
    }

    .di-slider::-webkit-slider-thumb {
      -webkit-appearance: none;
      width: 16px;
      height: 16px;
      border-radius: 50%;
      background: var(--di-accent);
      cursor: pointer;
      transition: background 0.2s;
    }

    .di-slider::-webkit-slider-thumb:hover {
      filter: brightness(1.2);
    }

    .di-slider::-moz-range-thumb {
      width: 16px;
      height: 16px;
      border-radius: 50%;
      background: var(--di-accent);
      cursor: pointer;
      border: none;
    }

    .di-slider-value {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.78rem;
      font-weight: 600;
      color: var(--di-accent);
      min-width: 2.2rem;
    }

     
    .di-punchline {
      text-align: center;
      padding: 0.85rem 1rem;
      margin-top: 1.25rem;
      background: var(--di-surface);
      border: 1px solid var(--di-accent);
      border-radius: 8px;
      font-size: 0.88rem;
      font-weight: 600;
      color: var(--di-accent);
      opacity: 0;
      transform: translateY(8px);
      transition: opacity 0.5s ease, transform 0.5s ease;
      pointer-events: none;
    }

    .di-punchline.di-visible {
      opacity: 1;
      transform: translateY(0);
      pointer-events: auto;
    }
  </style>

  
  <div class="di-header">
    <h3>DPO: Learning Without a Reward Model</h3>
    <p>Step through training to see implicit rewards emerge</p>
  </div>

  
  <div class="di-prompt-box">
    <div class="di-prompt-label">Prompt x</div>
    <div class="di-prompt-text">"What is the capital of France?"</div>
  </div>

  
  <div class="di-responses">
    <div class="di-response-card di-preferred">
      <div class="di-response-label">y_w (preferred)</div>
      <div class="di-response-text">"The capital of France is Paris."</div>
    </div>
    <div class="di-response-card di-dispreferred">
      <div class="di-response-label">y_l (dispreferred)</div>
      <div class="di-response-text">"France's capital is Berlin, a beautiful city."</div>
    </div>
  </div>

  
  <div class="di-panels">
    
    <div class="di-panel">
      <div class="di-panel-title">Policy Probabilities</div>

      <div class="di-bar-group">
        <div class="di-bar-group-label">Preferred response (y_w)</div>
        <div class="di-bar-row">
          <span class="di-bar-label">&pi;_ref</span>
          <div class="di-bar-track">
            <div class="di-bar-fill di-bar-ref" id="di-bar-ref-w-869e96d074aa1e4482cf4409ce2006eb" style="width: 30%">
              <span class="di-bar-value">0.15</span>
            </div>
          </div>
        </div>
        <div class="di-bar-row">
          <span class="di-bar-label">&pi;_&theta;</span>
          <div class="di-bar-track">
            <div class="di-bar-fill di-bar-policy-w" id="di-bar-pol-w-869e96d074aa1e4482cf4409ce2006eb" style="width: 30%">
              <span class="di-bar-value" id="di-val-pol-w-869e96d074aa1e4482cf4409ce2006eb">0.15</span>
            </div>
          </div>
        </div>
      </div>

      <div class="di-bar-group">
        <div class="di-bar-group-label">Dispreferred response (y_l)</div>
        <div class="di-bar-row">
          <span class="di-bar-label">&pi;_ref</span>
          <div class="di-bar-track">
            <div class="di-bar-fill di-bar-ref" id="di-bar-ref-l-869e96d074aa1e4482cf4409ce2006eb" style="width: 24%">
              <span class="di-bar-value">0.12</span>
            </div>
          </div>
        </div>
        <div class="di-bar-row">
          <span class="di-bar-label">&pi;_&theta;</span>
          <div class="di-bar-track">
            <div class="di-bar-fill di-bar-policy-l" id="di-bar-pol-l-869e96d074aa1e4482cf4409ce2006eb" style="width: 24%">
              <span class="di-bar-value" id="di-val-pol-l-869e96d074aa1e4482cf4409ce2006eb">0.12</span>
            </div>
          </div>
        </div>
      </div>
    </div>

    
    <div class="di-panel">
      <div class="di-panel-title">Implicit Reward Computation</div>

      <div class="di-comp-line">
        <span class="di-comp-label">r&#770;(x, y_w)</span> = &beta; &times; log(&pi;_&theta;(y_w) / &pi;_ref(y_w))
      </div>
      <div class="di-comp-line">
        &nbsp;&nbsp;= <span id="di-rw-expr-869e96d074aa1e4482cf4409ce2006eb" class="di-comp-val-green">0.10 &times; log(0.15 / 0.15) = 0.000</span>
      </div>

      <div class="di-comp-divider"></div>

      <div class="di-comp-line">
        <span class="di-comp-label">r&#770;(x, y_l)</span> = &beta; &times; log(&pi;_&theta;(y_l) / &pi;_ref(y_l))
      </div>
      <div class="di-comp-line">
        &nbsp;&nbsp;= <span id="di-rl-expr-869e96d074aa1e4482cf4409ce2006eb" class="di-comp-val-red">0.10 &times; log(0.12 / 0.12) = 0.000</span>
      </div>

      <div class="di-comp-divider"></div>

      <div class="di-comp-line">
        <span class="di-comp-label">Reward diff</span> = r&#770;(y_w) &minus; r&#770;(y_l) = <span id="di-rdiff-869e96d074aa1e4482cf4409ce2006eb" class="di-comp-val-accent">0.000</span>
      </div>

      <div class="di-comp-line">
        <span class="di-comp-label">Loss</span> = &minus;log &sigma;(diff) = <span id="di-loss-869e96d074aa1e4482cf4409ce2006eb" class="di-comp-val-yellow">0.693</span>
      </div>

      
      <div class="di-sparkline-container">
        <div class="di-sparkline-label">Loss over training</div>
        <svg class="di-sparkline-svg" id="di-sparkline-869e96d074aa1e4482cf4409ce2006eb" viewBox="0 0 200 48" preserveAspectRatio="xMidYMid meet">
          
          <line x1="20" y1="6" x2="20" y2="42" class="di-sparkline-grid" />
          <line x1="20" y1="42" x2="190" y2="42" class="di-sparkline-grid" />
          
        </svg>
      </div>
    </div>
  </div>

  
  <div class="di-controls">
    <button class="di-btn" id="di-reset-869e96d074aa1e4482cf4409ce2006eb">Reset</button>
    <span class="di-step-indicator" id="di-step-label-869e96d074aa1e4482cf4409ce2006eb">Step 0 / 4</span>
    <button class="di-btn di-btn-primary" id="di-forward-869e96d074aa1e4482cf4409ce2006eb">Step Forward</button>
    <div class="di-slider-group">
      <span class="di-slider-label">&beta;</span>
      <input type="range" class="di-slider" id="di-beta-869e96d074aa1e4482cf4409ce2006eb" min="0.05" max="1.0" step="0.05" value="0.10" />
      <span class="di-slider-value" id="di-beta-val-869e96d074aa1e4482cf4409ce2006eb">0.10</span>
    </div>
  </div>

  
  <div class="di-punchline" id="di-punchline-869e96d074aa1e4482cf4409ce2006eb">
    The reward model was never eliminated &mdash; it was absorbed into the policy itself.
  </div>

  <script>
  (function() {
    const uid = '869e96d074aa1e4482cf4409ce2006eb';

    
    const el = (id) => document.getElementById(id + '-' + uid);
    const barPolW   = el('di-bar-pol-w');
    const barPolL   = el('di-bar-pol-l');
    const valPolW   = el('di-val-pol-w');
    const valPolL   = el('di-val-pol-l');
    const rwExpr    = el('di-rw-expr');
    const rlExpr    = el('di-rl-expr');
    const rdiff     = el('di-rdiff');
    const lossEl    = el('di-loss');
    const stepLabel = el('di-step-label');
    const btnFwd    = el('di-forward');
    const btnReset  = el('di-reset');
    const betaSlider = el('di-beta');
    const betaVal   = el('di-beta-val');
    const sparkSvg  = el('di-sparkline');
    const punchline = el('di-punchline');

    
    const refW = 0.15;
    const refL = 0.12;

    const steps = [
      { polW: 0.15, polL: 0.12 },  
      { polW: 0.20, polL: 0.08 },  
      { polW: 0.27, polL: 0.05 },  
      { polW: 0.33, polL: 0.03 },  
      { polW: 0.38, polL: 0.02 },  
    ];

    let currentStep = 0;
    let beta = 0.10;

    
    function sigmoid(x) {
      return 1.0 / (1.0 + Math.exp(-x));
    }

    function computeReward(polProb, refProb) {
      return beta * Math.log(polProb / refProb);
    }

    function computeLoss(rw, rl) {
      return -Math.log(sigmoid(rw - rl));
    }

    function fmt(v, d) {
      d = d || 3;
      return v.toFixed(d);
    }

    
    function probToWidth(p) {
      return Math.max(7, (p / 0.50) * 100);
    }

    
    function render() {
      const s = steps[currentStep];
      const pw = s.polW;
      const pl = s.polL;

      
      barPolW.style.width = probToWidth(pw) + '%';
      barPolL.style.width = probToWidth(pl) + '%';
      valPolW.textContent = fmt(pw, 2);
      valPolL.textContent = fmt(pl, 2);

      
      const rw = computeReward(pw, refW);
      const rl = computeReward(pl, refL);
      const diff = rw - rl;
      const loss = computeLoss(rw, rl);

      
      rwExpr.innerHTML = fmt(beta, 2) + ' &times; log(' + fmt(pw, 2) + ' / ' + fmt(refW, 2) + ') = <strong>' + fmt(rw) + '</strong>';
      rlExpr.innerHTML = fmt(beta, 2) + ' &times; log(' + fmt(pl, 2) + ' / ' + fmt(refL, 2) + ') = <strong>' + fmt(rl) + '</strong>';
      rdiff.innerHTML = '<strong>' + fmt(diff) + '</strong>';
      lossEl.innerHTML = '<strong>' + fmt(loss) + '</strong>';

      
      stepLabel.textContent = 'Step ' + currentStep + ' / 4';

      
      btnFwd.disabled = (currentStep >= 4);

      
      if (currentStep === 4) {
        punchline.classList.add('di-visible');
      } else {
        punchline.classList.remove('di-visible');
      }

      
      renderSparkline();
    }

    
    function renderSparkline() {
      
      const losses = [];
      for (let i = 0; i <= 4; i++) {
        const s = steps[i];
        const rw = computeReward(s.polW, refW);
        const rl = computeReward(s.polL, refL);
        losses.push(computeLoss(rw, rl));
      }

      const maxLoss = Math.max(...losses) * 1.1;
      const minLoss = 0;

      
      const padL = 24, padR = 10, padT = 6, padB = 6;
      const w = 200 - padL - padR;
      const h = 48 - padT - padB;

      function sx(i) { return padL + (i / 4) * w; }
      function sy(v) { return padT + h - ((v - minLoss) / (maxLoss - minLoss)) * h; }

      
      let svg = '';

      
      svg += '<line x1="' + padL + '" y1="' + padT + '" x2="' + padL + '" y2="' + (padT + h) + '" class="di-sparkline-grid" />';
      svg += '<line x1="' + padL + '" y1="' + (padT + h) + '" x2="' + (padL + w) + '" y2="' + (padT + h) + '" class="di-sparkline-grid" />';

      
      let fullPath = 'M';
      for (let i = 0; i <= 4; i++) {
        if (i > 0) fullPath += ' L';
        fullPath += ' ' + fmt(sx(i), 1) + ' ' + fmt(sy(losses[i]), 1);
      }
      svg += '<path d="' + fullPath + '" fill="none" stroke="var(--di-border)" stroke-width="1.5" stroke-dasharray="3,3" />';

      
      if (currentStep > 0) {
        let activePath = 'M';
        for (let i = 0; i <= currentStep; i++) {
          if (i > 0) activePath += ' L';
          activePath += ' ' + fmt(sx(i), 1) + ' ' + fmt(sy(losses[i]), 1);
        }
        svg += '<path d="' + activePath + '" class="di-sparkline-line" />';
      }

      
      for (let i = 0; i <= 4; i++) {
        const cx = fmt(sx(i), 1);
        const cy = fmt(sy(losses[i]), 1);
        if (i <= currentStep) {
          const r = (i === currentStep) ? '4' : '2.5';
          svg += '<circle cx="' + cx + '" cy="' + cy + '" r="' + r + '" class="di-sparkline-dot" />';
        } else {
          svg += '<circle cx="' + cx + '" cy="' + cy + '" r="2" class="di-sparkline-dot-inactive" />';
        }
      }

      
      if (currentStep > 0) {
        const currLoss = losses[currentStep];
        svg += '<text x="' + fmt(sx(currentStep), 1) + '" y="' + fmt(sy(currLoss) - 6, 1) + '" text-anchor="middle" font-family="IBM Plex Mono, monospace" font-size="8" fill="var(--di-accent)">' + fmt(currLoss) + '</text>';
      }

      sparkSvg.innerHTML = svg;
    }

    
    btnFwd.addEventListener('click', function() {
      if (currentStep < 4) {
        currentStep++;
        render();
      }
    });

    btnReset.addEventListener('click', function() {
      currentStep = 0;
      render();
    });

    betaSlider.addEventListener('input', function() {
      beta = parseFloat(this.value);
      betaVal.textContent = fmt(beta, 2);
      render();
    });

    
    render();
  })();
  </script>
</div>

<h3 id="where-dpo-shines-and-where-it-falls-short">Where DPO Shines and Where It Falls Short</h3>
<p>DPO&rsquo;s practical advantages are substantial. Training requires only two models (policy and reference), not four. The implementation is roughly 20 lines of PyTorch on top of a standard language modeling pipeline. HuggingFace&rsquo;s TRL library provides a <code>DPOTrainer</code> that handles the details. Major models adopted it quickly: Llama 3, Zephyr-beta, and Tulu 2 all used DPO in their alignment pipelines. DPO democratized alignment research. Any lab with a GPU and preference data could train an aligned model.</p>
<p>But DPO has limitations that become apparent at scale. The most fundamental is its <strong>offline nature</strong>: DPO trains on a fixed dataset of preference pairs, with no mechanism for the model to explore and discover new behaviors. As training progresses, the policy drifts from the distribution that generated the training data, but the training data cannot adapt. This is particularly problematic for tasks where the model needs to discover novel reasoning strategies.</p>
<p>Xu et al. (ICML 2024) conducted a systematic comparison and found that PPO consistently surpasses DPO across all tested benchmarks when properly tuned, especially on challenging code generation tasks (on CodeContest, PPO-34B achieved 22.4% while DPO-34B scored significantly lower). The gap widens on tasks that require exploration and long-horizon reasoning.</p>
<p>There is also a subtler issue: DPO assumes the Bradley-Terry preference model perfectly fits the data. Real human preferences can be intransitive (A &gt; B, B &gt; C, but C &gt; A), context-dependent, and noisy. When these assumptions break down, DPO&rsquo;s loss function can produce misleading gradients.</p>
<p>DPO traded RL&rsquo;s complexity for supervised learning&rsquo;s simplicity. The next technique we&rsquo;ll examine takes a different path: keep the online RL loop, but find a cheaper way to run it.</p>
<h2 id="grpo-grading-responses-on-a-curve">GRPO: Grading Responses on a Curve</h2>
<h3 id="the-insight-eliminate-the-critic-keep-the-rl-loop">The Insight: Eliminate the Critic, Keep the RL Loop</h3>
<p>DPO eliminated RL entirely but lost online exploration. GRPO takes a different approach: retain the online RL loop (the model generates responses, gets feedback, and updates) but eliminate the <em>critic network</em>, which is the most expensive component of PPO after the reward model.</p>
<p>Recall that PPO needs a critic $V_\psi(s)$ to compute advantages $\hat{A}_t$, estimating how much better each token was compared to baseline expectations. This critic is a full-sized neural network with its own gradient computation and optimizer states. GRPO&rsquo;s key observation: instead of learning this baseline, we can estimate it empirically by sampling multiple responses to the same prompt and comparing them to each other.</p>
<h3 id="the-mechanism-sample-score-normalize">The Mechanism: Sample, Score, Normalize</h3>
<p>For each prompt $q$, GRPO samples $G$ completions $\{o_1, o_2, \ldots, o_G\}$ from the current policy $\pi_\theta$. Each completion is scored by a reward function, producing rewards $\{r_1, r_2, \ldots, r_G\}$. The advantage for each completion is computed via z-score normalization:</p>
$$\hat{A}_i = \frac{r_i - \text{mean}(r_1, \ldots, r_G)}{\text{std}(r_1, \ldots, r_G)}$$<p>This is &ldquo;grading on a curve.&rdquo; Instead of evaluating each response against an absolute rubric (the learned critic), we evaluate it relative to its peers. A response that scores 0.8 when all other responses also score around 0.8 gets a near-zero advantage because it was average for this prompt. The same score of 0.8 when peers score around 0.3 earns a strongly positive advantage because it was exceptional.</p>
<p>The group mean serves as an empirical Monte Carlo estimate of the expected reward for this prompt, playing the same role as the learned value function $V(s)$ in PPO. More samples mean a better estimate. In practice, $G = 8$ to $G = 64$ provides sufficient accuracy without excessive compute.</p>
<p>The standard deviation in the denominator does something subtle but important. It acts as a <strong>curvature-adaptive gradient mechanism</strong>. For easy prompts where the model consistently scores well (low reward variance), the std is small and the advantage magnitudes are <em>amplified</em>, but since the raw rewards are already clustered near the mean, the actual advantages remain small. For hard prompts where reward variance is high, the std normalizes away the scale differences, producing moderate advantages regardless of the raw reward range. This provides automatic per-prompt learning rate adaptation without any additional hyperparameters.</p>
<div class="gs-group-sampling" id="gs-869e96d074aa1e4482cf4409ce2006eb">
  <style>
    .gs-group-sampling {
      --gs-bg: #0d1117;
      --gs-surface: #161b22;
      --gs-border: #30363d;
      --gs-text: #e6edf3;
      --gs-text-muted: #8b949e;
      --gs-accent: #a371f7;
      --gs-accent-dim: rgba(163, 113, 247, 0.15);
      --gs-green: #39d353;
      --gs-green-dim: rgba(57, 211, 83, 0.08);
      --gs-red: #f97583;
      --gs-red-dim: rgba(249, 117, 131, 0.08);
      --gs-mean-line: #d29922;

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
      background: var(--gs-bg);
      color: var(--gs-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

    [data-theme="light"] .gs-group-sampling,
    :root:not([data-theme="dark"]) .gs-group-sampling {
      --gs-bg: #f8fafc;
      --gs-surface: #ffffff;
      --gs-border: #e2e8f0;
      --gs-text: #1e293b;
      --gs-text-muted: #64748b;
      --gs-accent: #8b5cf6;
      --gs-accent-dim: rgba(139, 92, 246, 0.1);
      --gs-green: #10b981;
      --gs-green-dim: rgba(16, 185, 129, 0.08);
      --gs-red: #ef4444;
      --gs-red-dim: rgba(239, 68, 68, 0.06);
      --gs-mean-line: #d97706;
    }

    .gs-group-sampling * {
      box-sizing: border-box;
    }

     
    .gs-header {
      text-align: center;
      margin-bottom: 1.5rem;
    }

    .gs-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem;
      font-weight: 600;
      color: var(--gs-accent);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .gs-header p {
      color: var(--gs-text-muted);
      font-size: 0.9rem;
      margin: 0;
    }

     
    .gs-prompt-box {
      background: var(--gs-surface);
      border: 1px solid var(--gs-accent);
      border-radius: 8px;
      padding: 1rem 1.25rem;
      margin-bottom: 1.25rem;
      display: flex;
      align-items: center;
      gap: 0.75rem;
    }

    .gs-prompt-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--gs-accent);
      text-transform: uppercase;
      letter-spacing: 0.06em;
      white-space: nowrap;
      background: var(--gs-accent-dim);
      padding: 0.2rem 0.5rem;
      border-radius: 4px;
    }

    .gs-prompt-text {
      font-size: 1.05rem;
      font-weight: 600;
      color: var(--gs-text);
    }

     
    .gs-phase-bar {
      display: flex;
      align-items: center;
      gap: 0.5rem;
      margin-bottom: 1.25rem;
      font-size: 0.8rem;
    }

    .gs-phase-step {
      display: flex;
      align-items: center;
      gap: 0.35rem;
      padding: 0.3rem 0.7rem;
      border-radius: 20px;
      font-weight: 600;
      transition: all 0.4s ease;
      background: var(--gs-surface);
      border: 1px solid var(--gs-border);
      color: var(--gs-text-muted);
    }

    .gs-phase-step.gs-active {
      background: var(--gs-accent-dim);
      border-color: var(--gs-accent);
      color: var(--gs-accent);
    }

    .gs-phase-arrow {
      color: var(--gs-text-muted);
      font-size: 0.75rem;
    }

     
    .gs-cards-container {
      position: relative;
      min-height: 200px;
    }

    .gs-cards-grid {
      display: grid;
      grid-template-columns: 1fr 1fr;
      gap: 0.75rem;
      transition: opacity 0.4s ease;
    }

    @media (max-width: 600px) {
      .gs-cards-grid {
        grid-template-columns: 1fr;
      }
    }

    .gs-card {
      background: var(--gs-surface);
      border: 1px solid var(--gs-border);
      border-radius: 8px;
      padding: 0.85rem 1rem;
      display: flex;
      justify-content: space-between;
      align-items: center;
      transition: all 0.5s cubic-bezier(0.4, 0, 0.2, 1);
      opacity: 0;
      transform: translateY(12px);
    }

    .gs-card.gs-visible {
      opacity: 1;
      transform: translateY(0);
    }

    .gs-card.gs-correct {
      background: var(--gs-green-dim);
      border-color: var(--gs-green);
    }

    .gs-card.gs-incorrect {
      background: var(--gs-red-dim);
      border-color: var(--gs-red);
    }

    .gs-card-left {
      display: flex;
      align-items: center;
      gap: 0.6rem;
      min-width: 0;
      flex: 1;
    }

    .gs-card-index {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--gs-text-muted);
      background: var(--gs-bg);
      width: 22px;
      height: 22px;
      display: flex;
      align-items: center;
      justify-content: center;
      border-radius: 4px;
      flex-shrink: 0;
    }

    .gs-card-response {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.82rem;
      color: var(--gs-text);
      white-space: nowrap;
      overflow: hidden;
      text-overflow: ellipsis;
    }

    .gs-card-right {
      display: flex;
      align-items: center;
      flex-shrink: 0;
      gap: 0.4rem;
    }

    .gs-card-reward {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.8rem;
      font-weight: 700;
      padding: 0.15rem 0.5rem;
      border-radius: 4px;
      flex-shrink: 0;
    }

    .gs-card.gs-correct .gs-card-reward {
      color: var(--gs-green);
      background: rgba(57, 211, 83, 0.12);
    }

    .gs-card.gs-incorrect .gs-card-reward {
      color: var(--gs-red);
      background: rgba(249, 117, 131, 0.12);
    }

     
    .gs-card-advantage {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.72rem;
      color: var(--gs-text-muted);
      flex-shrink: 0;
      max-width: 0;
      overflow: hidden;
      opacity: 0;
      transition: max-width 0.4s ease, opacity 0.4s ease;
      white-space: nowrap;
    }

    .gs-card-advantage.gs-visible {
      max-width: 120px;
      opacity: 1;
    }

     
    .gs-norm-container {
      margin-top: 1.5rem;
      max-height: 0;
      overflow: hidden;
      opacity: 0;
      transition: max-height 0.6s cubic-bezier(0.4, 0, 0.2, 1),
                  opacity 0.5s ease 0.1s;
    }

    .gs-norm-container.gs-visible {
      max-height: 2000px;
      opacity: 1;
    }

    .gs-norm-chart {
      position: relative;
      background: var(--gs-surface);
      border: 1px solid var(--gs-border);
      border-radius: 10px;
      padding: 1.5rem 1.5rem 1.5rem 3.5rem;
      min-height: 320px;
      overflow: visible;
    }

    .gs-norm-axis {
      position: absolute;
      left: 3rem;
      top: 1.5rem;
      bottom: 1.5rem;
      width: 2px;
      background: var(--gs-border);
    }

    .gs-norm-tick {
      position: absolute;
      left: 0.5rem;
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.65rem;
      color: var(--gs-text-muted);
      transform: translateY(-50%);
      text-align: right;
      width: 2rem;
    }

    .gs-norm-mean-line {
      position: absolute;
      left: 3rem;
      right: 1.5rem;
      height: 2px;
      background: var(--gs-mean-line);
      z-index: 2;
    }

    .gs-norm-mean-label {
      position: absolute;
      right: 0;
      top: -20px;
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--gs-mean-line);
      white-space: nowrap;
    }

    .gs-norm-std-band {
      position: absolute;
      left: 3rem;
      right: 1.5rem;
      background: var(--gs-accent-dim);
      border-top: 1px dashed var(--gs-accent);
      border-bottom: 1px dashed var(--gs-accent);
      z-index: 1;
      opacity: 0;
      transition: opacity 0.6s ease;
    }

    .gs-norm-std-band.gs-visible {
      opacity: 1;
    }

    .gs-norm-std-label {
      position: absolute;
      right: 4px;
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.6rem;
      color: var(--gs-accent);
      opacity: 0.8;
    }

    .gs-norm-std-label.gs-top { top: 2px; }
    .gs-norm-std-label.gs-bottom { bottom: 2px; }

    .gs-norm-bar {
      position: absolute;
      left: 3.5rem;
      right: 2rem;
      height: 30px;
      border-radius: 6px;
      display: flex;
      align-items: center;
      padding: 0 0.75rem;
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.75rem;
      transition: all 0.7s cubic-bezier(0.4, 0, 0.2, 1);
      z-index: 3;
      cursor: default;
      gap: 0.5rem;
      opacity: 0;
      transform: translateX(-10px);
    }

    .gs-norm-bar.gs-visible {
      opacity: 1;
      transform: translateX(0);
    }

    .gs-norm-bar.gs-positive {
      background: var(--gs-green-dim);
      border: 1px solid var(--gs-green);
      color: var(--gs-green);
    }

    .gs-norm-bar.gs-negative {
      background: var(--gs-red-dim);
      border: 1px solid var(--gs-red);
      color: var(--gs-red);
    }

    .gs-norm-bar-text {
      flex: 1;
      white-space: nowrap;
      overflow: hidden;
      text-overflow: ellipsis;
      font-size: 0.72rem;
    }

    .gs-norm-bar-adv {
      font-weight: 700;
      flex-shrink: 0;
      font-size: 0.72rem;
    }

     
    .gs-stats {
      display: grid;
      grid-template-columns: repeat(3, 1fr);
      gap: 0.75rem;
      margin-top: 1.25rem;
      opacity: 0;
      transition: opacity 0.5s ease;
    }

    .gs-stats.gs-visible {
      opacity: 1;
    }

    .gs-stat-box {
      background: var(--gs-surface);
      border: 1px solid var(--gs-border);
      border-radius: 8px;
      padding: 0.75rem;
      text-align: center;
    }

    .gs-stat-label {
      font-size: 0.65rem;
      font-weight: 600;
      text-transform: uppercase;
      letter-spacing: 0.06em;
      color: var(--gs-text-muted);
      margin-bottom: 0.25rem;
    }

    .gs-stat-value {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 1.2rem;
      font-weight: 700;
      color: var(--gs-accent);
    }

    @media (max-width: 600px) {
      .gs-stats {
        grid-template-columns: 1fr;
      }
    }

     
    .gs-controls {
      display: flex;
      flex-wrap: wrap;
      align-items: center;
      gap: 0.75rem;
      margin-top: 1.25rem;
      padding-top: 1.25rem;
      border-top: 1px solid var(--gs-border);
    }

    .gs-btn {
      font-family: 'IBM Plex Sans', sans-serif;
      font-size: 0.82rem;
      font-weight: 600;
      padding: 0.5rem 1.1rem;
      border-radius: 6px;
      border: none;
      cursor: pointer;
      transition: all 0.2s ease;
      white-space: nowrap;
    }

    .gs-btn:active {
      transform: scale(0.97);
    }

    .gs-btn-primary {
      background: var(--gs-accent);
      color: #fff;
    }

    .gs-btn-primary:hover {
      filter: brightness(1.1);
    }

    .gs-btn-primary:disabled {
      opacity: 0.5;
      cursor: not-allowed;
      filter: none;
    }

    .gs-btn-secondary {
      background: var(--gs-surface);
      color: var(--gs-text);
      border: 1px solid var(--gs-border);
    }

    .gs-btn-secondary:hover {
      border-color: var(--gs-accent);
      color: var(--gs-accent);
    }

    .gs-slider-group {
      display: flex;
      align-items: center;
      gap: 0.5rem;
    }

    .gs-slider-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.75rem;
      font-weight: 600;
      color: var(--gs-text-muted);
    }

    .gs-slider {
      -webkit-appearance: none;
      appearance: none;
      width: 100px;
      height: 4px;
      border-radius: 2px;
      background: var(--gs-border);
      outline: none;
    }

    .gs-slider::-webkit-slider-thumb {
      -webkit-appearance: none;
      appearance: none;
      width: 16px;
      height: 16px;
      border-radius: 50%;
      background: var(--gs-accent);
      cursor: pointer;
      border: 2px solid var(--gs-bg);
    }

    .gs-slider::-moz-range-thumb {
      width: 16px;
      height: 16px;
      border-radius: 50%;
      background: var(--gs-accent);
      cursor: pointer;
      border: 2px solid var(--gs-bg);
    }

    .gs-slider-val {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.8rem;
      font-weight: 700;
      color: var(--gs-accent);
      min-width: 1.5rem;
      text-align: center;
    }

    .gs-presets {
      display: flex;
      gap: 0.4rem;
      margin-left: auto;
    }

    .gs-preset-btn {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.65rem;
      font-weight: 600;
      padding: 0.3rem 0.6rem;
      border-radius: 4px;
      border: 1px solid var(--gs-border);
      background: var(--gs-surface);
      color: var(--gs-text-muted);
      cursor: pointer;
      transition: all 0.2s ease;
      white-space: nowrap;
    }

    .gs-preset-btn:hover {
      border-color: var(--gs-accent);
      color: var(--gs-accent);
    }

    @media (max-width: 600px) {
      .gs-controls {
        flex-direction: column;
        align-items: stretch;
      }
      .gs-presets {
        margin-left: 0;
        justify-content: center;
      }
      .gs-slider-group {
        justify-content: center;
      }
      .gs-btn {
        text-align: center;
      }
    }

     
    .gs-toast {
      margin-top: 0.75rem;
      padding: 0.6rem 1rem;
      background: var(--gs-accent-dim);
      border: 1px solid var(--gs-accent);
      border-radius: 6px;
      font-size: 0.78rem;
      color: var(--gs-text);
      opacity: 0;
      transition: opacity 0.3s ease;
      display: none;
    }

    .gs-toast.gs-visible {
      display: block;
      opacity: 1;
    }

     
    .gs-note {
      margin-top: 1.25rem;
      background: var(--gs-accent-dim);
      border: 1px solid var(--gs-accent);
      border-radius: 8px;
      padding: 1rem 1.25rem;
      font-size: 0.82rem;
      line-height: 1.7;
      color: var(--gs-text);
      max-height: 0;
      overflow: hidden;
      opacity: 0;
      transition: max-height 0.5s ease, opacity 0.5s ease, padding 0.5s ease, margin 0.5s ease;
      padding-top: 0;
      padding-bottom: 0;
      margin-top: 0;
    }

    .gs-note.gs-visible {
      max-height: 300px;
      opacity: 1;
      padding: 1rem 1.25rem;
      margin-top: 1.25rem;
    }

    .gs-note-title {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 700;
      text-transform: uppercase;
      letter-spacing: 0.06em;
      color: var(--gs-accent);
      margin-bottom: 0.5rem;
    }

    .gs-note em {
      color: var(--gs-accent);
      font-style: normal;
      font-weight: 600;
    }
  </style>

  
  <div class="gs-header">
    <h3>GRPO: Group Relative Policy Optimization</h3>
    <p>Sample, score, and normalize advantages within a group</p>
  </div>

  
  <div class="gs-prompt-box">
    <span class="gs-prompt-label">Prompt</span>
    <span class="gs-prompt-text">What is 7 &times; 8?</span>
  </div>

  
  <div class="gs-phase-bar">
    <span class="gs-phase-step gs-active" id="gs-phase1-869e96d074aa1e4482cf4409ce2006eb">1 &middot; Sample &amp; Score</span>
    <span class="gs-phase-arrow">&rarr;</span>
    <span class="gs-phase-step" id="gs-phase2-869e96d074aa1e4482cf4409ce2006eb">2 &middot; Normalize Advantages</span>
  </div>

  
  <div class="gs-cards-container" id="gs-cards-container-869e96d074aa1e4482cf4409ce2006eb">
    <div class="gs-cards-grid" id="gs-cards-grid-869e96d074aa1e4482cf4409ce2006eb">
      
    </div>
  </div>

  
  <div class="gs-norm-container" id="gs-norm-869e96d074aa1e4482cf4409ce2006eb">
    <div class="gs-norm-chart" id="gs-norm-chart-869e96d074aa1e4482cf4409ce2006eb">
      <div class="gs-norm-axis"></div>
      
    </div>

    <div class="gs-stats" id="gs-stats-869e96d074aa1e4482cf4409ce2006eb">
      <div class="gs-stat-box">
        <div class="gs-stat-label">Group Mean (&mu;)</div>
        <div class="gs-stat-value" id="gs-stat-mean-869e96d074aa1e4482cf4409ce2006eb">--</div>
      </div>
      <div class="gs-stat-box">
        <div class="gs-stat-label">Std Dev (&sigma;)</div>
        <div class="gs-stat-value" id="gs-stat-std-869e96d074aa1e4482cf4409ce2006eb">--</div>
      </div>
      <div class="gs-stat-box">
        <div class="gs-stat-label">Group Size (G)</div>
        <div class="gs-stat-value" id="gs-stat-g-869e96d074aa1e4482cf4409ce2006eb">--</div>
      </div>
    </div>
  </div>

  
  <div class="gs-note" id="gs-note-869e96d074aa1e4482cf4409ce2006eb">
    <div class="gs-note-title">Key Insight</div>
    The group mean acts as a <em>baseline</em> &mdash; replacing PPO's expensive critic network. Responses above the mean get <em>reinforced</em>, below get <em>suppressed</em>. The standard deviation normalizes the scale, providing <em>automatic per-prompt learning rate adaptation</em>.
  </div>

  
  <div class="gs-controls">
    <button class="gs-btn gs-btn-primary" id="gs-compute-btn-869e96d074aa1e4482cf4409ce2006eb">Compute Advantages</button>
    <button class="gs-btn gs-btn-secondary" id="gs-regen-btn-869e96d074aa1e4482cf4409ce2006eb">Regenerate</button>

    <div class="gs-slider-group">
      <span class="gs-slider-label">G =</span>
      <input type="range" class="gs-slider" id="gs-g-slider-869e96d074aa1e4482cf4409ce2006eb" min="4" max="16" value="8" step="1">
      <span class="gs-slider-val" id="gs-g-val-869e96d074aa1e4482cf4409ce2006eb">8</span>
    </div>

    <div class="gs-presets">
      <button class="gs-preset-btn" id="gs-preset-r1-869e96d074aa1e4482cf4409ce2006eb">DeepSeek-R1 (G=16)</button>
      <button class="gs-preset-btn" id="gs-preset-math-869e96d074aa1e4482cf4409ce2006eb">DeepSeekMath (G=64)</button>
    </div>
  </div>

  
  <div class="gs-toast" id="gs-toast-869e96d074aa1e4482cf4409ce2006eb"></div>

  <script>
  (function() {
    var uid = '869e96d074aa1e4482cf4409ce2006eb';

     
    var el = {
      root:       document.getElementById('gs-' + uid),
      grid:       document.getElementById('gs-cards-grid-' + uid),
      container:  document.getElementById('gs-cards-container-' + uid),
      normWrap:   document.getElementById('gs-norm-' + uid),
      normChart:  document.getElementById('gs-norm-chart-' + uid),
      stats:      document.getElementById('gs-stats-' + uid),
      statMean:   document.getElementById('gs-stat-mean-' + uid),
      statStd:    document.getElementById('gs-stat-std-' + uid),
      statG:      document.getElementById('gs-stat-g-' + uid),
      note:       document.getElementById('gs-note-' + uid),
      computeBtn: document.getElementById('gs-compute-btn-' + uid),
      regenBtn:   document.getElementById('gs-regen-btn-' + uid),
      slider:     document.getElementById('gs-g-slider-' + uid),
      sliderVal:  document.getElementById('gs-g-val-' + uid),
      presetR1:   document.getElementById('gs-preset-r1-' + uid),
      presetMath: document.getElementById('gs-preset-math-' + uid),
      phase1:     document.getElementById('gs-phase1-' + uid),
      phase2:     document.getElementById('gs-phase2-' + uid),
      toast:      document.getElementById('gs-toast-' + uid)
    };

     
    var correctTexts = [
      '7 \u00d7 8 = 56 \u2713',
      '56 \u2713',
      'The answer is 56 \u2713',
      'Let me compute: 7 \u00d7 8 = 56 \u2713',
      '7 * 8 = 56 \u2713',
      'Calculating... 56 \u2713',
      'Seven times eight is 56 \u2713',
      '7(8) = 56 \u2713',
      '7 \u00d7 8 = 56. Done. \u2713',
      'Easy: 56 \u2713'
    ];

    var incorrectTexts = [
      '7 \u00d7 8 = 54 \u2717',
      '48 \u2717',
      'Let me think... 7 \u00d7 8 = 58 \u2717',
      'The answer is 63 \u2717',
      '7 * 8 = 42 \u2717',
      'Hmm, 7 \u00d7 8 = 52 \u2717',
      'I believe it is 64 \u2717',
      '7 \u00d7 8 = 46 \u2717',
      'That would be 55 \u2717',
      'Calculating... 49 \u2717'
    ];

     
    var G = 8;
    var responses = [];
    var phase = 1;
    var toastTimer = null;

     
    function shuffle(arr) {
      var a = arr.slice();
      for (var i = a.length - 1; i > 0; i--) {
        var j = Math.floor(Math.random() * (i + 1));
        var tmp = a[i];
        a[i] = a[j];
        a[j] = tmp;
      }
      return a;
    }

    function pickRandom(arr, n) {
      return shuffle(arr).slice(0, n);
    }

    function escapeHTML(s) {
      var d = document.createElement('div');
      d.textContent = s;
      return d.innerHTML;
    }

    function showToast(msg) {
      if (toastTimer) clearTimeout(toastTimer);
      el.toast.textContent = msg;
      el.toast.style.display = 'block';
      requestAnimationFrame(function() {
        el.toast.classList.add('gs-visible');
      });
      toastTimer = setTimeout(function() {
        el.toast.classList.remove('gs-visible');
        setTimeout(function() { el.toast.style.display = 'none'; }, 300);
      }, 4000);
    }

     
    function generateResponses() {
       
      var numCorrect = Math.round(G * 0.6) + (Math.random() < 0.5 ? 0 : (Math.random() < 0.5 ? 1 : -1));
      var nc = Math.max(1, Math.min(G - 1, numCorrect));
      var ni = G - nc;

      var correct = pickRandom(correctTexts, Math.min(nc, correctTexts.length));
      var incorrect = pickRandom(incorrectTexts, Math.min(ni, incorrectTexts.length));

       
      while (correct.length < nc) {
        correct.push(correctTexts[correct.length % correctTexts.length]);
      }
      while (incorrect.length < ni) {
        incorrect.push(incorrectTexts[incorrect.length % incorrectTexts.length]);
      }

      var all = [];
      for (var i = 0; i < correct.length; i++) {
        all.push({ text: correct[i], reward: 1.0, correct: true });
      }
      for (var j = 0; j < incorrect.length; j++) {
        all.push({ text: incorrect[j], reward: 0.0, correct: false });
      }
      return shuffle(all);
    }

     
    function computeStats(resps) {
      var rewards = [];
      for (var i = 0; i < resps.length; i++) {
        rewards.push(resps[i].reward);
      }
      var sum = 0;
      for (var i = 0; i < rewards.length; i++) sum += rewards[i];
      var mean = sum / rewards.length;

      var varSum = 0;
      for (var i = 0; i < rewards.length; i++) {
        varSum += (rewards[i] - mean) * (rewards[i] - mean);
      }
      var std = Math.sqrt(varSum / rewards.length);

      var advantages = [];
      for (var i = 0; i < rewards.length; i++) {
        advantages.push(std > 0 ? (rewards[i] - mean) / std : 0);
      }
      return { mean: mean, std: std, advantages: advantages };
    }

     
    function renderPhase1() {
      phase = 1;
      responses = generateResponses();

       
      el.grid.innerHTML = '';
      el.grid.style.opacity = '1';
      el.normWrap.classList.remove('gs-visible');
      el.stats.classList.remove('gs-visible');
      el.note.classList.remove('gs-visible');
      el.computeBtn.disabled = false;
      el.computeBtn.textContent = 'Compute Advantages';

      el.phase1.classList.add('gs-active');
      el.phase2.classList.remove('gs-active');

       
      for (var i = 0; i < responses.length; i++) {
        (function(idx) {
          var resp = responses[idx];
          var card = document.createElement('div');
          card.className = 'gs-card ' + (resp.correct ? 'gs-correct' : 'gs-incorrect');
          card.innerHTML =
            '<div class="gs-card-left">' +
              '<span class="gs-card-index">' + (idx + 1) + '</span>' +
              '<span class="gs-card-response">' + escapeHTML(resp.text) + '</span>' +
            '</div>' +
            '<div class="gs-card-right">' +
              '<span class="gs-card-reward">r = ' + resp.reward.toFixed(1) + '</span>' +
              '<span class="gs-card-advantage" id="gs-adv-' + uid + '-' + idx + '"></span>' +
            '</div>';
          el.grid.appendChild(card);

           
          setTimeout(function() {
            card.classList.add('gs-visible');
          }, 60 * idx + 30);
        })(i);
      }
    }

     
    function renderPhase2() {
      phase = 2;
      el.computeBtn.disabled = true;
      el.computeBtn.textContent = 'Computed';

      el.phase1.classList.remove('gs-active');
      el.phase2.classList.add('gs-active');

      var stats = computeStats(responses);

       
      for (var i = 0; i < responses.length; i++) {
        (function(idx) {
          var advEl = document.getElementById('gs-adv-' + uid + '-' + idx);
          if (advEl) {
            var adv = stats.advantages[idx];
            advEl.textContent = '\u00c2 = ' + (adv >= 0 ? '+' : '') + adv.toFixed(2);
            setTimeout(function() {
              advEl.classList.add('gs-visible');
            }, 80 * idx);
          }
        })(i);
      }

       
      el.statMean.textContent = stats.mean.toFixed(3);
      el.statStd.textContent = stats.std.toFixed(3);
      el.statG.textContent = G;

       
      setTimeout(function() {
        buildNormChart(stats);
      }, 80 * responses.length + 200);
    }

     
    function buildNormChart(stats) {
      var chart = el.normChart;

       
      var old = chart.querySelectorAll('.gs-norm-bar, .gs-norm-mean-line, .gs-norm-std-band, .gs-norm-tick');
      for (var i = 0; i < old.length; i++) old[i].remove();

       
      el.normWrap.classList.add('gs-visible');

       
      var barHeight = 30;
      var barGap = 6;
      var chartHeight = Math.max(320, responses.length * (barHeight + barGap) + 120);
      chart.style.minHeight = chartHeight + 'px';

       
      var maxAbsAdv = 0;
      for (var i = 0; i < stats.advantages.length; i++) {
        var absAdv = Math.abs(stats.advantages[i]);
        if (absAdv > maxAbsAdv) maxAbsAdv = absAdv;
      }
      maxAbsAdv = Math.max(maxAbsAdv, 1.2);
      var range = maxAbsAdv * 1.15;

      var padTop = 40;
      var padBot = 40;
      var usableHeight = chartHeight - padTop - padBot;

      function advToY(adv) {
        return padTop + ((range - adv) / (2 * range)) * usableHeight;
      }

       
      var step = range > 2 ? 1.0 : 0.5;
      for (var t = -Math.floor(range / step) * step; t <= range + 0.01; t += step) {
        var tv = Math.round(t * 100) / 100;
        var tick = document.createElement('div');
        tick.className = 'gs-norm-tick';
        tick.style.top = advToY(tv) + 'px';
        tick.textContent = tv.toFixed(1);
        chart.appendChild(tick);
      }

       
      var meanLine = document.createElement('div');
      meanLine.className = 'gs-norm-mean-line';
      meanLine.style.top = advToY(0) + 'px';
      meanLine.innerHTML = '<span class="gs-norm-mean-label">\u03bc = ' + stats.mean.toFixed(3) + '</span>';
      chart.appendChild(meanLine);

       
      var stdBand = document.createElement('div');
      stdBand.className = 'gs-norm-std-band';
      var bandTop = advToY(1);
      var bandBot = advToY(-1);
      stdBand.style.top = bandTop + 'px';
      stdBand.style.height = (bandBot - bandTop) + 'px';
      stdBand.innerHTML =
        '<span class="gs-norm-std-label gs-top">+1\u03c3</span>' +
        '<span class="gs-norm-std-label gs-bottom">\u22121\u03c3</span>';
      chart.appendChild(stdBand);

      setTimeout(function() {
        stdBand.classList.add('gs-visible');
      }, 150);

      

      var indexed = [];
      for (var i = 0; i < responses.length; i++) {
        indexed.push({ index: i, resp: responses[i], adv: stats.advantages[i] });
      }
      indexed.sort(function(a, b) { return b.adv - a.adv; });

       
      var placed = [];
      var groupStart = 0;
      for (var i = 0; i <= indexed.length; i++) {
        if (i === indexed.length || Math.abs(indexed[i].adv - indexed[groupStart].adv) > 0.001) {
           
          var groupSize = i - groupStart;
          var centerY = advToY(indexed[groupStart].adv);
          var totalGroupHeight = groupSize * (barHeight + barGap) - barGap;
          var startY = centerY - totalGroupHeight / 2;
          for (var j = groupStart; j < i; j++) {
            placed.push({
              item: indexed[j],
              y: startY + (j - groupStart) * (barHeight + barGap)
            });
          }
          groupStart = i;
        }
      }

       
      for (var i = 0; i < placed.length; i++) {
        (function(rank, item, yPos) {
          var bar = document.createElement('div');
          var adv = item.adv;
          bar.className = 'gs-norm-bar ' + (adv >= 0 ? 'gs-positive' : 'gs-negative');
          bar.style.top = yPos + 'px';

          var advLabel = '(r\u2009\u2212\u2009\u03bc)\u2009/\u2009\u03c3 = ' + (adv >= 0 ? '+' : '') + adv.toFixed(2);

          bar.innerHTML =
            '<span class="gs-norm-bar-text">' +
              '<strong style="margin-right:0.4rem;opacity:0.7;">#' + (item.index + 1) + '</strong>' +
              escapeHTML(item.resp.text) +
            '</span>' +
            '<span class="gs-norm-bar-adv">\u00c2 = ' + (adv >= 0 ? '+' : '') + adv.toFixed(2) + '</span>';

          bar.title = advLabel;
          chart.appendChild(bar);

          setTimeout(function() {
            bar.classList.add('gs-visible');
          }, 250 + rank * 70);
        })(i, placed[i].item, placed[i].y);
      }

       
      setTimeout(function() {
        el.stats.classList.add('gs-visible');
      }, 300);

       
      setTimeout(function() {
        el.note.classList.add('gs-visible');
      }, 500 + placed.length * 70);
    }

     
    el.computeBtn.addEventListener('click', function() {
      if (phase === 1) renderPhase2();
    });

    el.regenBtn.addEventListener('click', function() {
      renderPhase1();
    });

    el.slider.addEventListener('input', function() {
      G = parseInt(this.value, 10);
      el.sliderVal.textContent = G;
      renderPhase1();
    });

    el.presetR1.addEventListener('click', function() {
      G = 16;
      el.slider.value = 16;
      el.sliderVal.textContent = 16;
      renderPhase1();
    });

    el.presetMath.addEventListener('click', function() {
      G = 16;
      el.slider.value = 16;
      el.sliderVal.textContent = 16;
      renderPhase1();
      showToast('DeepSeekMath uses G=64 in production. Showing G=16 for visualization clarity \u2014 the normalization math is identical at any group size.');
    });

     
    renderPhase1();
  })();
  </script>
</div>

<p>The full GRPO objective incorporates this advantage into a clipped surrogate structure that should look familiar:</p>
$$\mathcal{J}_{\text{GRPO}} = \mathbb{E}_{q \sim \mathcal{D}} \left[\frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min\!\Big(\rho_t^{(i)} \hat{A}_i,\; \text{clip}\big(\rho_t^{(i)}, 1{-}\varepsilon, 1{+}\varepsilon\big) \hat{A}_i\Big) - \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\right]$$<p>This is structurally identical to PPO&rsquo;s clipped surrogate. The probability ratio $\rho_t^{(i)} = \pi_\theta(o_{i,t} \mid q, o_{i,\lt t}) / \pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,\lt t})$ is the same per-token ratio. The clipping mechanism works identically. The only difference is where the advantage $\hat{A}_i$ comes from: PPO estimates it with a learned critic, GRPO estimates it with group statistics. The double averaging (over group members and over tokens) combined with clipping and KL penalty gives GRPO the stability of PPO without the critic&rsquo;s memory cost.</p>
<h3 id="grpo--rlvr-the-reasoning-revolution">GRPO + RLVR: The Reasoning Revolution</h3>
<p>GRPO&rsquo;s natural partner is <strong>Reinforcement Learning from Verifiable Rewards (RLVR)</strong>. These are tasks where correctness can be checked deterministically: math problems have right answers, code must pass test cases, logic puzzles have verifiable solutions. For these tasks, the reward function is a simple rule (correct or incorrect) requiring no learned reward model at all.</p>
<p>Rule-based rewards are <strong>immune to reward hacking</strong>. There is no neural network to exploit, no proxy to overoptimize. The reward is ground truth. This makes GRPO + RLVR an extraordinarily clean training setup: sample responses, check if they are correct, normalize advantages within the group, update the policy. Two models in memory (policy and reference), a deterministic reward function, and online exploration.</p>
<p>DeepSeek-R1 demonstrated how powerful this combination can be. Its reward function was remarkably simple:</p>
$$R = R_{\text{accuracy}} + R_{\text{format}}$$<p>where $R_{\text{accuracy}}$ is binary (1 if the final answer matches the ground truth, 0 otherwise, verified by regex matching) and $R_{\text{format}}$ enforces structured reasoning with <code>&lt;think&gt;...&lt;/think&gt;</code> and <code>&lt;answer&gt;...&lt;/answer&gt;</code> tags. That&rsquo;s it. No neural reward model, no human preference data for the RL stage.</p>
<p>The results with DeepSeek-R1-Zero — trained with GRPO and <em>no</em> supervised fine-tuning at all — were striking: 71.0% on AIME 2024 (matching OpenAI&rsquo;s o1-preview), 97.3% on MATH-500, and a 2,029 Elo rating on Codeforces. Perhaps most remarkable was the emergent behavior: the model spontaneously developed self-correction strategies (&ldquo;Wait, let me reconsider this step&hellip;&rdquo;) — without any explicit training signal for reflection. This self-verification behavior emerged purely from the pressure to produce correct final answers.</p>
<p>DeepSeek-R1&rsquo;s practical configurations: $G = 16$ responses per prompt, batches of 32 unique questions, and notably $\varepsilon = 10$, which effectively disables clipping entirely. The group-normalized advantages are already well-scaled, reducing the need for a tight trust region constraint.</p>
<h3 id="connection-to-reinforce-and-variance-reduction">Connection to REINFORCE and Variance Reduction</h3>
<p>GRPO is best understood as a variant of REINFORCE, the simplest policy gradient algorithm, with a group-based baseline for variance reduction. Vanilla REINFORCE computes policy gradients as:</p>
$$\nabla_\theta J = \mathbb{E}\big[R \cdot \nabla_\theta \log \pi_\theta(a \mid s)\big]$$<p>This has notoriously high variance because raw returns fluctuate enormously between episodes. The standard fix is to subtract a baseline $b$ from the return: $\nabla_\theta J = \mathbb{E}[(R - b) \cdot \nabla_\theta \log \pi_\theta]$. Any baseline that does not depend on the action is unbiased. PPO learns this baseline with an expensive critic network. GRPO uses the group mean instead, a sample-based estimate that improves with larger group sizes, costs no additional parameters, and requires no additional training.</p>
<h2 id="choosing-the-right-technique">Choosing the Right Technique</h2>
<p>The choice between PPO, DPO, and GRPO depends primarily on the nature of your reward signal and your computational constraints.</p>
<p><strong>Use PPO</strong> when you are training on open-ended tasks (creative writing, general helpfulness, safety) where reward must come from a learned model, and you have the compute budget to support four models in memory. Despite its complexity, PPO with proper tuning remains the highest-ceiling approach. Xu et al. (ICML 2024) showed it consistently outperforms DPO on challenging benchmarks. It is the choice of frontier labs training flagship models.</p>
<p><strong>Use DPO</strong> when you have high-quality paired preference data, want simple and stable training, and are working with limited compute. DPO matched or exceeded PPO on summarization benchmarks (61% GPT-4 win rate vs. PPO&rsquo;s 57% on TL;DR) and is implementable with a standard supervised training pipeline. It is ideal for quick alignment passes and situations where the preference dataset covers the intended use distribution well.</p>
<p><strong>Use GRPO</strong> when your task has verifiable rewards (math, code, logic, factual QA with ground-truth answers). It combines online exploration (like PPO) with low memory footprint (like DPO), and rule-based rewards eliminate reward hacking entirely. It is the standard for training reasoning models.</p>
<p>In practice, these techniques are often used in combination, not isolation. Llama 3 used a pipeline of SFT → Rejection Sampling → PPO → DPO, where each stage addressed different aspects of alignment. DeepSeek-R1 alternated between SFT stages, RLVR with GRPO for reasoning, and RLHF stages for general helpfulness. The techniques are complementary.</p>


<div class="tc-wrapper" id="tc-869e96d074aa1e4482cf4409ce2006eb">
  <style>
    .tc-wrapper {
      --tc-bg: #0d1117;
      --tc-surface: #161b22;
      --tc-border: #30363d;
      --tc-text: #e6edf3;
      --tc-text-muted: #8b949e;
      --tc-ppo: #58a6ff;
      --tc-dpo: #39d353;
      --tc-grpo: #a371f7;
      --tc-ppo-bg: rgba(88, 166, 255, 0.08);
      --tc-dpo-bg: rgba(57, 211, 83, 0.08);
      --tc-grpo-bg: rgba(163, 113, 247, 0.08);
      --tc-ppo-border: rgba(88, 166, 255, 0.25);
      --tc-dpo-border: rgba(57, 211, 83, 0.25);
      --tc-grpo-border: rgba(163, 113, 247, 0.25);
      --tc-header-bg: linear-gradient(135deg, #0d1117 0%, #161b22 100%);
      --tc-row-hover: rgba(255, 255, 255, 0.03);
      --tc-toggle-bg: rgba(255, 255, 255, 0.06);
      --tc-toggle-active: rgba(255, 255, 255, 0.12);
      --tc-node-bg: #1c2333;
      --tc-yes-color: #39d353;
      --tc-no-color: #f85149;
      --tc-arrow-color: #484f58;
      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
      background: var(--tc-bg);
      color: var(--tc-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
      border: 1px solid var(--tc-border);
    }

    [data-theme="light"] .tc-wrapper,
    :root:not([data-theme="dark"]) .tc-wrapper {
      --tc-bg: #f8fafc;
      --tc-surface: #ffffff;
      --tc-border: #e2e8f0;
      --tc-text: #1e293b;
      --tc-text-muted: #64748b;
      --tc-ppo: #3b82f6;
      --tc-dpo: #10b981;
      --tc-grpo: #8b5cf6;
      --tc-ppo-bg: rgba(59, 130, 246, 0.06);
      --tc-dpo-bg: rgba(16, 185, 129, 0.06);
      --tc-grpo-bg: rgba(139, 92, 246, 0.06);
      --tc-ppo-border: rgba(59, 130, 246, 0.2);
      --tc-dpo-border: rgba(16, 185, 129, 0.2);
      --tc-grpo-border: rgba(139, 92, 246, 0.2);
      --tc-header-bg: linear-gradient(135deg, #1e293b 0%, #334155 100%);
      --tc-row-hover: rgba(0, 0, 0, 0.02);
      --tc-toggle-bg: rgba(0, 0, 0, 0.04);
      --tc-toggle-active: rgba(0, 0, 0, 0.08);
      --tc-node-bg: #ffffff;
      --tc-yes-color: #10b981;
      --tc-no-color: #ef4444;
      --tc-arrow-color: #94a3b8;
    }

    .tc-wrapper * {
      box-sizing: border-box;
    }

     
    .tc-header {
      background: var(--tc-header-bg);
      border-radius: 8px;
      padding: 1.25rem 1.5rem;
      margin-bottom: 1.25rem;
      text-align: center;
    }

    .tc-title {
      font-size: 0.7rem;
      font-weight: 700;
      letter-spacing: 0.15em;
      text-transform: uppercase;
      color: var(--tc-ppo);
      margin-bottom: 0.35rem;
    }

    .tc-subtitle {
      font-size: 0.95rem;
      color: #94a3b8;
      font-weight: 400;
    }

     
    .tc-toggle-group {
      display: flex;
      gap: 0.25rem;
      background: var(--tc-toggle-bg);
      border-radius: 8px;
      padding: 0.25rem;
      margin-bottom: 1.25rem;
    }

    .tc-toggle-btn {
      flex: 1;
      padding: 0.6rem 1rem;
      background: transparent;
      border: none;
      border-radius: 6px;
      font-size: 0.85rem;
      font-weight: 600;
      color: var(--tc-text-muted);
      cursor: pointer;
      transition: all 0.2s ease;
      font-family: inherit;
      display: flex;
      align-items: center;
      justify-content: center;
      gap: 0.5rem;
    }

    .tc-toggle-btn:hover {
      color: var(--tc-text);
      background: var(--tc-toggle-active);
    }

    .tc-toggle-btn.tc-active {
      background: var(--tc-surface);
      color: var(--tc-text);
      box-shadow: 0 1px 3px rgba(0, 0, 0, 0.15);
    }

    .tc-toggle-icon {
      width: 16px;
      height: 16px;
      flex-shrink: 0;
    }

     
    .tc-view {
      display: none;
      animation: tcFadeIn 0.3s ease;
    }

    .tc-view.tc-active {
      display: block;
    }

    @keyframes tcFadeIn {
      from { opacity: 0; transform: translateY(4px); }
      to { opacity: 1; transform: translateY(0); }
    }

     
    .tc-table-scroll {
      overflow-x: auto;
      -webkit-overflow-scrolling: touch;
      border-radius: 8px;
      border: 1px solid var(--tc-border);
    }

    .tc-table {
      width: 100%;
      border-collapse: collapse;
      min-width: 640px;
      font-size: 0.875rem;
    }

    .tc-table thead th {
      padding: 0.85rem 1rem;
      font-weight: 700;
      font-size: 0.8rem;
      letter-spacing: 0.05em;
      text-transform: uppercase;
      background: var(--tc-surface);
      border-bottom: 2px solid var(--tc-border);
      text-align: left;
      cursor: default;
      transition: background 0.2s ease;
      position: relative;
    }

    .tc-table thead th:first-child {
      color: var(--tc-text-muted);
      width: 190px;
      min-width: 170px;
    }

    .tc-table thead th.tc-col-ppo {
      color: var(--tc-ppo);
      cursor: pointer;
    }

    .tc-table thead th.tc-col-dpo {
      color: var(--tc-dpo);
      cursor: pointer;
    }

    .tc-table thead th.tc-col-grpo {
      color: var(--tc-grpo);
      cursor: pointer;
    }

    .tc-table thead th.tc-col-ppo:hover,
    .tc-table thead th.tc-col-ppo.tc-col-highlight {
      background: var(--tc-ppo-bg);
    }

    .tc-table thead th.tc-col-dpo:hover,
    .tc-table thead th.tc-col-dpo.tc-col-highlight {
      background: var(--tc-dpo-bg);
    }

    .tc-table thead th.tc-col-grpo:hover,
    .tc-table thead th.tc-col-grpo.tc-col-highlight {
      background: var(--tc-grpo-bg);
    }

    .tc-table tbody tr {
      border-bottom: 1px solid var(--tc-border);
      transition: background 0.15s ease;
    }

    .tc-table tbody tr:last-child {
      border-bottom: none;
    }

    .tc-table tbody tr:hover {
      background: var(--tc-row-hover);
    }

    .tc-table tbody td {
      padding: 0.75rem 1rem;
      vertical-align: top;
      transition: background 0.2s ease;
    }

    .tc-table tbody td:first-child {
      font-weight: 600;
      color: var(--tc-text-muted);
      font-size: 0.8rem;
      letter-spacing: 0.02em;
    }

     
    .tc-table tbody td.tc-cell-ppo.tc-col-highlight {
      background: var(--tc-ppo-bg);
      border-left: 2px solid var(--tc-ppo-border);
    }

    .tc-table tbody td.tc-cell-dpo.tc-col-highlight {
      background: var(--tc-dpo-bg);
      border-left: 2px solid var(--tc-dpo-border);
    }

    .tc-table tbody td.tc-cell-grpo.tc-col-highlight {
      background: var(--tc-grpo-bg);
      border-left: 2px solid var(--tc-grpo-border);
    }

     
    .tc-val {
      display: flex;
      align-items: flex-start;
      gap: 0.5rem;
    }

    .tc-dot {
      width: 6px;
      height: 6px;
      border-radius: 50%;
      margin-top: 0.5em;
      flex-shrink: 0;
    }

    .tc-dot-ppo { background: var(--tc-ppo); }
    .tc-dot-dpo { background: var(--tc-dpo); }
    .tc-dot-grpo { background: var(--tc-grpo); }

    .tc-hint {
      display: block;
      font-size: 0.75rem;
      color: var(--tc-text-muted);
      margin-top: 0.15rem;
      line-height: 1.4;
    }

    .tc-table-footer {
      text-align: center;
      padding: 0.75rem;
      font-size: 0.75rem;
      color: var(--tc-text-muted);
    }

     
    .tc-tree {
      display: flex;
      flex-direction: column;
      align-items: center;
      padding: 1rem 0.5rem 0.5rem;
      gap: 0;
      overflow-x: auto;
    }

    .tc-node {
      background: var(--tc-node-bg);
      border: 1.5px solid var(--tc-border);
      border-radius: 10px;
      padding: 0.85rem 1.25rem;
      text-align: center;
      font-size: 0.85rem;
      line-height: 1.5;
      max-width: 380px;
      width: 100%;
      position: relative;
      transition: border-color 0.2s ease, box-shadow 0.2s ease;
    }

    .tc-node:hover {
      border-color: var(--tc-text-muted);
    }

    .tc-node-question {
      font-weight: 600;
      color: var(--tc-text);
    }

    .tc-node-result {
      font-weight: 700;
      font-size: 0.95rem;
      padding: 0.7rem 1.5rem;
    }

    .tc-node-ppo {
      border-color: var(--tc-ppo-border);
      background: var(--tc-ppo-bg);
      color: var(--tc-ppo);
    }

    .tc-node-dpo {
      border-color: var(--tc-dpo-border);
      background: var(--tc-dpo-bg);
      color: var(--tc-dpo);
    }

    .tc-node-grpo {
      border-color: var(--tc-grpo-border);
      background: var(--tc-grpo-bg);
      color: var(--tc-grpo);
    }

    .tc-node-ppo:hover {
      box-shadow: 0 0 16px rgba(88, 166, 255, 0.15);
    }

    .tc-node-dpo:hover {
      box-shadow: 0 0 16px rgba(57, 211, 83, 0.15);
    }

    .tc-node-grpo:hover {
      box-shadow: 0 0 16px rgba(163, 113, 247, 0.15);
    }

     
    .tc-arrow-down {
      display: flex;
      flex-direction: column;
      align-items: center;
      position: relative;
      height: 44px;
      width: 2px;
      background: var(--tc-arrow-color);
    }

    .tc-arrow-down::after {
      content: '';
      position: absolute;
      bottom: -1px;
      left: 50%;
      transform: translateX(-50%);
      width: 0;
      height: 0;
      border-left: 5px solid transparent;
      border-right: 5px solid transparent;
      border-top: 6px solid var(--tc-arrow-color);
    }

    .tc-branch-row {
      display: flex;
      align-items: flex-start;
      justify-content: center;
      gap: 0;
      width: 100%;
      max-width: 700px;
      position: relative;
    }

    .tc-branch-side {
      display: flex;
      flex-direction: column;
      align-items: center;
      flex: 1;
    }

    .tc-branch-label {
      font-size: 0.7rem;
      font-weight: 700;
      text-transform: uppercase;
      letter-spacing: 0.1em;
      padding: 0.2rem 0.6rem;
      border-radius: 4px;
      margin-bottom: 0.25rem;
    }

    .tc-label-yes {
      color: var(--tc-yes-color);
      background: rgba(57, 211, 83, 0.1);
    }

    .tc-label-no {
      color: var(--tc-no-color);
      background: rgba(248, 81, 73, 0.1);
    }

     
    .tc-branch-connector {
      position: relative;
      height: 30px;
      width: 100%;
      max-width: 700px;
    }

    .tc-branch-connector svg {
      width: 100%;
      height: 100%;
      overflow: visible;
    }

    .tc-branch-line {
      stroke: var(--tc-arrow-color);
      stroke-width: 1.5;
      fill: none;
    }

    .tc-branch-arrow-yes,
    .tc-branch-arrow-no {
      display: flex;
      flex-direction: column;
      align-items: center;
      height: 28px;
      position: relative;
    }

    .tc-branch-arrow-yes::before,
    .tc-branch-arrow-no::before {
      content: '';
      width: 2px;
      height: 22px;
      background: var(--tc-arrow-color);
    }

    .tc-branch-arrow-yes::after,
    .tc-branch-arrow-no::after {
      content: '';
      width: 0;
      height: 0;
      border-left: 5px solid transparent;
      border-right: 5px solid transparent;
      border-top: 6px solid var(--tc-arrow-color);
    }

     
    @media (max-width: 600px) {
      .tc-wrapper {
        padding: 1rem;
      }

      .tc-header {
        padding: 1rem;
      }

      .tc-title {
        font-size: 0.65rem;
      }

      .tc-subtitle {
        font-size: 0.85rem;
      }

      .tc-toggle-btn {
        font-size: 0.78rem;
        padding: 0.5rem 0.5rem;
      }

      .tc-branch-row {
        flex-direction: column;
        align-items: center;
        gap: 0;
      }

      .tc-branch-connector {
        display: none;
      }

      .tc-branch-side {
        width: 100%;
      }

      .tc-branch-arrow-yes,
      .tc-branch-arrow-no {
        height: 24px;
      }

      .tc-node {
        max-width: 320px;
        font-size: 0.8rem;
      }
    }
  </style>

  
  <div class="tc-header">
    <div class="tc-title">Choosing the Right Technique</div>
    <div class="tc-subtitle">Compare PPO, DPO, and GRPO across key dimensions</div>
  </div>

  
  <div class="tc-toggle-group" id="tc-toggles-869e96d074aa1e4482cf4409ce2006eb">
    <button class="tc-toggle-btn tc-active" data-view="table">
      <svg class="tc-toggle-icon" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round">
        <rect x="3" y="3" width="7" height="7"></rect>
        <rect x="14" y="3" width="7" height="7"></rect>
        <rect x="3" y="14" width="7" height="7"></rect>
        <rect x="14" y="14" width="7" height="7"></rect>
      </svg>
      Comparison Table
    </button>
    <button class="tc-toggle-btn" data-view="tree">
      <svg class="tc-toggle-icon" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round">
        <circle cx="12" cy="5" r="2"></circle>
        <line x1="12" y1="7" x2="12" y2="11"></line>
        <line x1="6" y1="11" x2="18" y2="11"></line>
        <line x1="6" y1="11" x2="6" y2="14"></line>
        <line x1="18" y1="11" x2="18" y2="14"></line>
        <circle cx="6" cy="16" r="2"></circle>
        <circle cx="18" cy="16" r="2"></circle>
      </svg>
      Decision Tree
    </button>
  </div>

  
  <div class="tc-view tc-active" id="tc-table-view-869e96d074aa1e4482cf4409ce2006eb">
    <div class="tc-table-scroll">
      <table class="tc-table" id="tc-table-869e96d074aa1e4482cf4409ce2006eb">
        <thead>
          <tr>
            <th>Attribute</th>
            <th class="tc-col-ppo" data-col="ppo">PPO</th>
            <th class="tc-col-dpo" data-col="dpo">DPO</th>
            <th class="tc-col-grpo" data-col="grpo">GRPO</th>
          </tr>
        </thead>
        <tbody>
          
          <tr>
            <td>Models in Memory</td>
            <td class="tc-cell-ppo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-ppo"></span>
                <span>4 models<span class="tc-hint">policy, reference, reward, critic</span></span>
              </div>
            </td>
            <td class="tc-cell-dpo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-dpo"></span>
                <span>2 models<span class="tc-hint">policy, reference</span></span>
              </div>
            </td>
            <td class="tc-cell-grpo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-grpo"></span>
                <span>2 models<span class="tc-hint">policy, reference</span></span>
              </div>
            </td>
          </tr>
          
          <tr>
            <td>Training Paradigm</td>
            <td class="tc-cell-ppo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-ppo"></span>
                <span>Online RL</span>
              </div>
            </td>
            <td class="tc-cell-dpo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-dpo"></span>
                <span>Offline supervised</span>
              </div>
            </td>
            <td class="tc-cell-grpo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-grpo"></span>
                <span>Online RL</span>
              </div>
            </td>
          </tr>
          
          <tr>
            <td>Reward Source</td>
            <td class="tc-cell-ppo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-ppo"></span>
                <span>Learned reward model</span>
              </div>
            </td>
            <td class="tc-cell-dpo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-dpo"></span>
                <span>Implicit<span class="tc-hint">derived from preferences</span></span>
              </div>
            </td>
            <td class="tc-cell-grpo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-grpo"></span>
                <span>Verifiable / rule-based</span>
              </div>
            </td>
          </tr>
          
          <tr>
            <td>Implementation</td>
            <td class="tc-cell-ppo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-ppo"></span>
                <span>Complex<span class="tc-hint">~1000s lines of code</span></span>
              </div>
            </td>
            <td class="tc-cell-dpo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-dpo"></span>
                <span>Simple<span class="tc-hint">~20 LOC core</span></span>
              </div>
            </td>
            <td class="tc-cell-grpo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-grpo"></span>
                <span>Moderate<span class="tc-hint">~100s lines of code</span></span>
              </div>
            </td>
          </tr>
          
          <tr>
            <td>Training Stability</td>
            <td class="tc-cell-ppo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-ppo"></span>
                <span>Sensitive to hyperparameters</span>
              </div>
            </td>
            <td class="tc-cell-dpo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-dpo"></span>
                <span>Very stable</span>
              </div>
            </td>
            <td class="tc-cell-grpo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-grpo"></span>
                <span>Stable<span class="tc-hint">with group normalization</span></span>
              </div>
            </td>
          </tr>
          
          <tr>
            <td>Performance Ceiling</td>
            <td class="tc-cell-ppo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-ppo"></span>
                <span>Highest<span class="tc-hint">with proper tuning</span></span>
              </div>
            </td>
            <td class="tc-cell-dpo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-dpo"></span>
                <span>Good<span class="tc-hint">limited by offline data</span></span>
              </div>
            </td>
            <td class="tc-cell-grpo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-grpo"></span>
                <span>Excellent<span class="tc-hint">for verifiable tasks</span></span>
              </div>
            </td>
          </tr>
          
          <tr>
            <td>Reward Hacking Risk</td>
            <td class="tc-cell-ppo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-ppo"></span>
                <span>High<span class="tc-hint">learned proxy</span></span>
              </div>
            </td>
            <td class="tc-cell-dpo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-dpo"></span>
                <span>Low<span class="tc-hint">no explicit reward</span></span>
              </div>
            </td>
            <td class="tc-cell-grpo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-grpo"></span>
                <span>Very low<span class="tc-hint">rule-based rewards</span></span>
              </div>
            </td>
          </tr>
          
          <tr>
            <td>Best For</td>
            <td class="tc-cell-ppo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-ppo"></span>
                <span>General alignment, frontier models</span>
              </div>
            </td>
            <td class="tc-cell-dpo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-dpo"></span>
                <span>Quick alignment, limited compute</span>
              </div>
            </td>
            <td class="tc-cell-grpo">
              <div class="tc-val">
                <span class="tc-dot tc-dot-grpo"></span>
                <span>Math, code, reasoning tasks</span>
              </div>
            </td>
          </tr>
        </tbody>
      </table>
    </div>
    <div class="tc-table-footer">Click a column header to highlight</div>
  </div>

  
  <div class="tc-view" id="tc-tree-view-869e96d074aa1e4482cf4409ce2006eb">
    <div class="tc-tree">

      
      <div class="tc-node tc-node-question">
        Do you have <strong>verifiable rewards</strong>?<br>
        <span style="font-size: 0.78rem; color: var(--tc-text-muted); font-weight: 400;">(math, code, formal logic)</span>
      </div>

      
      <div class="tc-branch-connector">
        <svg viewBox="0 0 700 30" preserveAspectRatio="xMidYMid meet">
          <line class="tc-branch-line" x1="350" y1="0" x2="350" y2="8" />
          <line class="tc-branch-line" x1="175" y1="8" x2="525" y2="8" />
          <line class="tc-branch-line" x1="175" y1="8" x2="175" y2="30" />
          <line class="tc-branch-line" x1="525" y1="8" x2="525" y2="30" />
        </svg>
      </div>

      <div class="tc-branch-row">
        
        <div class="tc-branch-side">
          <div class="tc-branch-label tc-label-yes">Yes</div>
          <div class="tc-branch-arrow-yes"></div>
          <div class="tc-node tc-node-result tc-node-grpo">
            GRPO
            <div style="font-size: 0.75rem; font-weight: 400; margin-top: 0.25rem; opacity: 0.85;">
              Group-based rewards, no critic needed
            </div>
          </div>
        </div>

        
        <div class="tc-branch-side">
          <div class="tc-branch-label tc-label-no">No</div>
          <div class="tc-branch-arrow-no"></div>
          <div class="tc-node tc-node-question">
            Do you have <strong>paired preference data</strong><br>and <strong>limited compute</strong>?
          </div>

          
          <div class="tc-branch-connector" style="max-width: 350px;">
            <svg viewBox="0 0 350 30" preserveAspectRatio="xMidYMid meet">
              <line class="tc-branch-line" x1="175" y1="0" x2="175" y2="8" />
              <line class="tc-branch-line" x1="88" y1="8" x2="262" y2="8" />
              <line class="tc-branch-line" x1="88" y1="8" x2="88" y2="30" />
              <line class="tc-branch-line" x1="262" y1="8" x2="262" y2="30" />
            </svg>
          </div>

          <div class="tc-branch-row" style="max-width: 380px;">
            
            <div class="tc-branch-side">
              <div class="tc-branch-label tc-label-yes">Yes</div>
              <div class="tc-branch-arrow-yes"></div>
              <div class="tc-node tc-node-result tc-node-dpo">
                DPO
                <div style="font-size: 0.75rem; font-weight: 400; margin-top: 0.25rem; opacity: 0.85;">
                  Simple, stable, compute-efficient
                </div>
              </div>
            </div>

            
            <div class="tc-branch-side">
              <div class="tc-branch-label tc-label-no">No</div>
              <div class="tc-branch-arrow-no"></div>
              <div class="tc-node tc-node-result tc-node-ppo">
                PPO
                <div style="font-size: 0.75rem; font-weight: 400; margin-top: 0.25rem; opacity: 0.85;">
                  Maximum performance, open-ended tasks
                </div>
              </div>
            </div>
          </div>

        </div>
      </div>

    </div>
  </div>

  
  <script>
  (function() {
    const uid = '869e96d074aa1e4482cf4409ce2006eb';
    const container = document.getElementById('tc-' + uid);
    if (!container) return;

    
    const toggleGroup = document.getElementById('tc-toggles-' + uid);
    const tableView = document.getElementById('tc-table-view-' + uid);
    const treeView = document.getElementById('tc-tree-view-' + uid);
    const toggleBtns = toggleGroup.querySelectorAll('.tc-toggle-btn');

    toggleBtns.forEach(function(btn) {
      btn.addEventListener('click', function() {
        var view = this.getAttribute('data-view');

        
        toggleBtns.forEach(function(b) { b.classList.remove('tc-active'); });
        this.classList.add('tc-active');

        
        tableView.classList.remove('tc-active');
        treeView.classList.remove('tc-active');

        if (view === 'table') {
          tableView.classList.add('tc-active');
        } else {
          treeView.classList.add('tc-active');
        }
      });
    });

    
    var table = document.getElementById('tc-table-' + uid);
    var headers = table.querySelectorAll('thead th[data-col]');
    var activeCol = null;

    headers.forEach(function(th) {
      th.addEventListener('click', function() {
        var col = this.getAttribute('data-col');

        
        table.querySelectorAll('.tc-col-highlight').forEach(function(el) {
          el.classList.remove('tc-col-highlight');
        });

        
        if (activeCol === col) {
          activeCol = null;
          return;
        }

        activeCol = col;

        
        this.classList.add('tc-col-highlight');

        
        table.querySelectorAll('tbody td.tc-cell-' + col).forEach(function(td) {
          td.classList.add('tc-col-highlight');
        });
      });
    });
  })();
  </script>
</div>

<p>The landscape continues to expand. On the DPO side, IPO removes the Bradley-Terry assumption, KTO works with binary feedback (thumbs up/down) instead of pairwise preferences, and SimPO simplifies the reference model dependency. On the GRPO side, DAPO addresses training instabilities with dynamic sampling, and Dr. GRPO provides variance reduction to the gradient estimates. Each builds on the foundations covered here.</p>
<h2 id="from-explicit-rewards-to-emergent-reasoning">From Explicit Rewards to Emergent Reasoning</h2>
<p>Let&rsquo;s step back and trace the arc we have followed. PPO established that RL could align language models, using an explicit reward model to score responses and a learned critic to estimate advantages. DPO showed the reward model was unnecessary. The reward signal was implicit in the probability ratios, waiting to be extracted through a clever reparameterization. GRPO showed the critic was unnecessary too. Group statistics could replace learned value functions, especially when paired with verifiable rewards.</p>
<p>Each step eliminated a component that turned out to be inessential for the task at hand. What remained was the core objective: maximize expected reward while staying close to the reference. And progressively simpler ways of optimizing it.</p>
<p>But the most interesting result came from the simplest setup. DeepSeek-R1-Zero, trained with GRPO and binary correct/incorrect rewards, spontaneously developed multi-step reasoning, self-correction, and solution verification, capabilities that were not explicitly trained. The model learned <em>how to think</em> from the sole pressure to <em>be correct</em>. No demonstrations of reasoning. No reward for intermediate steps. Just final-answer accuracy and the group-relative advantage signal.</p>
<p>This suggests that the path to capable reasoning models may be less about sophisticated reward engineering and more about giving models the right optimization framework and letting them discover strategies on their own. The field is still learning which components are genuinely necessary and which are engineering artifacts of earlier approaches. These three techniques (PPO, DPO, and GRPO) represent the progression of that understanding.</p>
<hr>
<h2 id="references">References</h2>
<ol>
<li>
<p><strong>Schulman, J., Wolski, F., Dhariwal, P., Radford, A., &amp; Klimov, O. (2017).</strong> <a href="https://arxiv.org/abs/1707.06347">Proximal Policy Optimization Algorithms</a>. <em>arXiv preprint</em>.</p>
<ul>
<li>The original PPO paper introducing the clipped surrogate objective.</li>
</ul>
</li>
<li>
<p><strong>Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., &amp; Finn, C. (2023).</strong> <a href="https://arxiv.org/abs/2305.18290">Direct Preference Optimization: Your Language Model is Secretly a Reward Model</a>. <em>NeurIPS 2023</em>.</p>
<ul>
<li>The DPO paper showing preference optimization can be reduced to classification.</li>
</ul>
</li>
<li>
<p><strong>Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., &hellip; &amp; Guo, D. (2024).</strong> <a href="https://arxiv.org/abs/2402.03300">DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models</a>. <em>arXiv preprint</em>.</p>
<ul>
<li>Introduces GRPO and demonstrates its effectiveness for mathematical reasoning.</li>
</ul>
</li>
<li>
<p><strong>DeepSeek-AI. (2025).</strong> <a href="https://arxiv.org/abs/2501.12948">DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning</a>. <em>arXiv preprint</em>.</p>
<ul>
<li>DeepSeek-R1 and R1-Zero results using GRPO with verifiable rewards.</li>
</ul>
</li>
<li>
<p><strong>Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., &hellip; &amp; Lowe, R. (2022).</strong> <a href="https://arxiv.org/abs/2203.02155">Training language models to follow instructions with human feedback</a>. <em>NeurIPS 2022</em>.</p>
<ul>
<li>The InstructGPT paper establishing the SFT → RM → PPO pipeline.</li>
</ul>
</li>
<li>
<p><strong>Christiano, P. F., Leike, J., Brown, T., Marber, M., Legg, S., &amp; Amodei, D. (2017).</strong> <a href="https://arxiv.org/abs/1706.03741">Deep reinforcement learning from human preferences</a>. <em>NeurIPS 2017</em>.</p>
<ul>
<li>Foundational work on learning reward models from human preferences.</li>
</ul>
</li>
<li>
<p><strong>Gao, L., Schulman, J., &amp; Hilton, J. (2022).</strong> <a href="https://arxiv.org/abs/2210.10760">Scaling Laws for Reward Model Overoptimization</a>. <em>ICML 2023</em>.</p>
<ul>
<li>Formalizes reward hacking and overoptimization in RLHF.</li>
</ul>
</li>
<li>
<p><strong>Xu, J., Xie, T., Zhao, A., Song, J., Wang, J., &amp; Zhang, Y. (2024).</strong> <a href="https://arxiv.org/abs/2404.10719">Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study</a>. <em>ICML 2024</em>.</p>
<ul>
<li>Systematic comparison showing PPO outperforms DPO when properly tuned.</li>
</ul>
</li>
<li>
<p><strong>Wen, Y., Zhang, Z., Jiao, H., Yang, M., Zhang, H., &amp; Wang, G. (2024).</strong> <a href="https://arxiv.org/abs/2406.01239">From RLHF to RLHF: The Dilemma of Improving Human Alignment</a>. <em>arXiv preprint</em>.</p>
<ul>
<li>Analysis of reward hacking showing approval ratings can increase while correctness decreases.</li>
</ul>
</li>
<li>
<p><strong>Huang, H., Zhong, H., Li, S., Yang, K., &amp; Zitnik, M. (2023).</strong> <a href="https://arxiv.org/abs/2403.17031">The N Implementation Details of RLHF with PPO</a>. <em>arXiv preprint</em>.</p>
<ul>
<li>Practical insights on PPO hyperparameter sensitivity and training diagnostics.</li>
</ul>
</li>
<li>
<p><strong>Bradley, R. A., &amp; Terry, M. E. (1952).</strong> Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. <em>Biometrika, 39</em>(3/4), 324-345.</p>
<ul>
<li>The original paired comparison model adapted for preference learning.</li>
</ul>
</li>
<li>
<p><strong>Yu, Q., Zhang, H., Shao, Z., Guo, D., Zhu, Q., &amp; Lu, H. (2025).</strong> <a href="https://arxiv.org/abs/2503.14476">DAPO: An Open-Source LLM Reinforcement Learning System</a>. <em>arXiv preprint</em>.</p>
<ul>
<li>Addresses GRPO training instabilities with dynamic sampling and clip-higher strategy.</li>
</ul>
</li>
</ol>
]]></content:encoded></item><item><title>Dissecting OpenClaw: An Interactive Architecture Map</title><link>https://www.mdjawad.com/posts/openclaw-architecture/</link><pubDate>Mon, 16 Feb 2026 12:00:00 +0800</pubDate><guid>https://www.mdjawad.com/posts/openclaw-architecture/</guid><description>An interactive visual exploration of OpenClaw — the open-source AI agent that broke GitHub. Explore its three-layer architecture, two key primitives, memory system, and composable system prompt.</description><content:encoded><![CDATA[<h2 id="the-big-picture">The Big Picture</h2>
<p><a href="https://github.com/openclaw/openclaw">OpenClaw</a> is a 430K-line TypeScript project that turns any messaging platform into an interface for an autonomous AI agent. Instead of reading about it, explore the architecture below.</p>




<style>
.hero-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
  background: linear-gradient(180deg, #0f172a 0%, #1e293b 100%);
  border-radius: 16px;
  padding: 2rem;
  margin: 2rem 0;
  color: #e2e8f0;
  overflow: hidden;
}

.hero-a9e9caa2fa6bb4d9d711a5907e59188d * {
  box-sizing: border-box;
}

.hero-header-a9e9caa2fa6bb4d9d711a5907e59188d {
  text-align: center;
  margin-bottom: 1.5rem;
}

.hero-title-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 1.5rem;
  font-weight: 700;
  color: #f8fafc;
  margin-bottom: 0.5rem;
}

.hero-subtitle-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.9rem;
  color: #94a3b8;
  max-width: 600px;
  margin: 0 auto;
}

 
.hero-stats-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  justify-content: center;
  gap: 2rem;
  flex-wrap: wrap;
  margin-bottom: 2rem;
  padding: 0.75rem 1rem;
  background: rgba(255, 255, 255, 0.05);
  border-radius: 10px;
  border: 1px solid rgba(255, 255, 255, 0.08);
}

.hero-stat-a9e9caa2fa6bb4d9d711a5907e59188d {
  text-align: center;
}

.hero-stat-val-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 1rem;
  font-weight: 700;
  color: #f8fafc;
}

.hero-stat-label-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.6rem;
  color: #64748b;
  text-transform: uppercase;
  letter-spacing: 0.05em;
}

 
.hero-diagram-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: grid;
  grid-template-columns: 1fr auto 1fr;
  gap: 1rem;
  align-items: center;
  margin-bottom: 1.5rem;
  min-height: 380px;
}

 
.hero-spokes-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  flex-direction: column;
  gap: 0.5rem;
}

.hero-spokes-left-a9e9caa2fa6bb4d9d711a5907e59188d {
  align-items: flex-end;
}

.hero-spokes-right-a9e9caa2fa6bb4d9d711a5907e59188d {
  align-items: flex-start;
}

 
.hero-spoke-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(15, 23, 42, 0.6);
  border: 1px solid #334155;
  border-radius: 8px;
  padding: 0.5rem 0.75rem;
  cursor: default;
  transition: all 0.3s ease;
  display: flex;
  align-items: center;
  gap: 0.5rem;
  min-width: 130px;
  position: relative;
  opacity: 0;
  transform: translateX(0);
}

.hero-spokes-left-a9e9caa2fa6bb4d9d711a5907e59188d .hero-spoke-a9e9caa2fa6bb4d9d711a5907e59188d {
  transform: translateX(-20px);
}

.hero-spokes-right-a9e9caa2fa6bb4d9d711a5907e59188d .hero-spoke-a9e9caa2fa6bb4d9d711a5907e59188d {
  transform: translateX(20px);
}

.hero-spoke-a9e9caa2fa6bb4d9d711a5907e59188d.visible {
  opacity: 1;
  transform: translateX(0);
}

.hero-spoke-a9e9caa2fa6bb4d9d711a5907e59188d:hover {
  border-color: #64748b;
  transform: translateY(-1px);
}

.hero-spoke-icon-a9e9caa2fa6bb4d9d711a5907e59188d {
  width: 28px;
  height: 28px;
  border-radius: 6px;
  display: flex;
  align-items: center;
  justify-content: center;
  flex-shrink: 0;
  font-size: 0.85rem;
}

.hero-spoke-icon-channel-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(168, 85, 247, 0.2);
  color: #c084fc;
}

.hero-spoke-icon-runtime-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(236, 72, 153, 0.2);
  color: #f472b6;
}

.hero-spoke-name-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.75rem;
  font-weight: 600;
  color: #f1f5f9;
}

.hero-spoke-hint-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.6rem;
  color: #64748b;
}

 
.hero-tooltip-a9e9caa2fa6bb4d9d711a5907e59188d {
  position: absolute;
  bottom: calc(100% + 8px);
  left: 50%;
  transform: translateX(-50%);
  background: #1e293b;
  border: 1px solid #475569;
  border-radius: 6px;
  padding: 0.4rem 0.6rem;
  font-size: 0.65rem;
  color: #cbd5e1;
  white-space: nowrap;
  pointer-events: none;
  opacity: 0;
  transition: opacity 0.2s ease;
  z-index: 20;
}

.hero-spoke-a9e9caa2fa6bb4d9d711a5907e59188d:hover .hero-tooltip-a9e9caa2fa6bb4d9d711a5907e59188d {
  opacity: 1;
}

 
.hero-hub-a9e9caa2fa6bb4d9d711a5907e59188d {
  position: relative;
  width: 160px;
  height: 160px;
  border-radius: 50%;
  background: linear-gradient(135deg, rgba(59, 130, 246, 0.2) 0%, rgba(59, 130, 246, 0.1) 100%);
  border: 3px solid rgba(59, 130, 246, 0.6);
  display: flex;
  flex-direction: column;
  align-items: center;
  justify-content: center;
  box-shadow: 0 0 40px rgba(59, 130, 246, 0.2);
  opacity: 0;
  transform: scale(0.5);
  transition: all 0.6s cubic-bezier(0.34, 1.56, 0.64, 1);
  z-index: 10;
}

.hero-hub-a9e9caa2fa6bb4d9d711a5907e59188d.visible {
  opacity: 1;
  transform: scale(1);
}

.hero-hub-a9e9caa2fa6bb4d9d711a5907e59188d:hover {
  box-shadow: 0 0 60px rgba(59, 130, 246, 0.35);
}

.hero-hub-icon-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 1.5rem;
  margin-bottom: 0.25rem;
}

.hero-hub-label-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.9rem;
  font-weight: 700;
  color: #60a5fa;
}

.hero-hub-sub-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.6rem;
  color: #94a3b8;
}

 
.hero-connectors-a9e9caa2fa6bb4d9d711a5907e59188d {
  position: absolute;
  top: 50%;
  width: 100%;
  display: flex;
  justify-content: space-between;
  pointer-events: none;
}

 
.hero-primitives-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: grid;
  grid-template-columns: 1fr 1fr;
  gap: 1rem;
  margin-bottom: 1.5rem;
}

.hero-primitive-a9e9caa2fa6bb4d9d711a5907e59188d {
  border-radius: 12px;
  padding: 1rem;
  cursor: pointer;
  transition: all 0.3s ease;
  position: relative;
  opacity: 0;
  transform: translateY(10px);
}

.hero-primitive-a9e9caa2fa6bb4d9d711a5907e59188d.visible {
  opacity: 1;
  transform: translateY(0);
}

.hero-prim-invoke-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(168, 85, 247, 0.1);
  border: 2px solid rgba(168, 85, 247, 0.4);
}

.hero-prim-invoke-a9e9caa2fa6bb4d9d711a5907e59188d:hover {
  border-color: #a855f7;
  box-shadow: 0 0 24px rgba(168, 85, 247, 0.3);
}

.hero-prim-memory-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(236, 72, 153, 0.1);
  border: 2px solid rgba(236, 72, 153, 0.4);
}

.hero-prim-memory-a9e9caa2fa6bb4d9d711a5907e59188d:hover {
  border-color: #ec4899;
  box-shadow: 0 0 24px rgba(236, 72, 153, 0.3);
}

.hero-prim-header-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  align-items: center;
  gap: 0.5rem;
  margin-bottom: 0.5rem;
}

.hero-prim-dot-a9e9caa2fa6bb4d9d711a5907e59188d {
  width: 10px;
  height: 10px;
  border-radius: 50%;
  animation: hero-pulse-a9e9caa2fa6bb4d9d711a5907e59188d 2s ease-in-out infinite;
}

.hero-prim-invoke-a9e9caa2fa6bb4d9d711a5907e59188d .hero-prim-dot-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: #a855f7;
  box-shadow: 0 0 8px #a855f7;
}

.hero-prim-memory-a9e9caa2fa6bb4d9d711a5907e59188d .hero-prim-dot-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: #ec4899;
  box-shadow: 0 0 8px #ec4899;
}

@keyframes hero-pulse-a9e9caa2fa6bb4d9d711a5907e59188d {
  0%, 100% { opacity: 1; }
  50% { opacity: 0.4; }
}

.hero-prim-title-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.85rem;
  font-weight: 700;
  color: #f8fafc;
}

.hero-prim-desc-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.75rem;
  color: #94a3b8;
  line-height: 1.5;
  margin-bottom: 0.5rem;
}

 
.hero-prim-details-a9e9caa2fa6bb4d9d711a5907e59188d {
  max-height: 0;
  overflow: hidden;
  transition: max-height 0.4s ease, padding 0.4s ease;
}

.hero-prim-details-a9e9caa2fa6bb4d9d711a5907e59188d.expanded {
  max-height: 300px;
  padding-top: 0.5rem;
}

.hero-prim-detail-item-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  align-items: center;
  gap: 0.5rem;
  padding: 0.3rem 0;
  font-size: 0.7rem;
  color: #cbd5e1;
}

.hero-prim-detail-dot-a9e9caa2fa6bb4d9d711a5907e59188d {
  width: 6px;
  height: 6px;
  border-radius: 50%;
  flex-shrink: 0;
}

.hero-prim-invoke-a9e9caa2fa6bb4d9d711a5907e59188d .hero-prim-detail-dot-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: #c084fc;
}

.hero-prim-memory-a9e9caa2fa6bb4d9d711a5907e59188d .hero-prim-detail-dot-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: #f472b6;
}

.hero-prim-expand-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.65rem;
  color: #64748b;
  margin-top: 0.25rem;
}

.hero-primitive-a9e9caa2fa6bb4d9d711a5907e59188d.active .hero-prim-expand-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: none;
}

 
.hero-footer-a9e9caa2fa6bb4d9d711a5907e59188d {
  text-align: center;
  padding: 0.75rem;
  background: rgba(245, 158, 11, 0.08);
  border: 1px solid rgba(245, 158, 11, 0.25);
  border-radius: 10px;
}

.hero-footer-a9e9caa2fa6bb4d9d711a5907e59188d p {
  font-size: 0.8rem;
  color: #fbbf24;
  margin: 0;
}

.hero-footer-a9e9caa2fa6bb4d9d711a5907e59188d strong {
  color: #fcd34d;
}

 
@media (max-width: 850px) {
  .hero-diagram-a9e9caa2fa6bb4d9d711a5907e59188d {
    grid-template-columns: 1fr;
    gap: 1.5rem;
  }

  .hero-hub-a9e9caa2fa6bb4d9d711a5907e59188d {
    margin: 0 auto;
  }

  .hero-spokes-left-a9e9caa2fa6bb4d9d711a5907e59188d,
  .hero-spokes-right-a9e9caa2fa6bb4d9d711a5907e59188d {
    align-items: center;
    flex-direction: row;
    flex-wrap: wrap;
    justify-content: center;
  }

  .hero-primitives-a9e9caa2fa6bb4d9d711a5907e59188d {
    grid-template-columns: 1fr;
  }
}

@media (max-width: 600px) {
  .hero-a9e9caa2fa6bb4d9d711a5907e59188d {
    padding: 1.25rem;
  }

  .hero-stats-a9e9caa2fa6bb4d9d711a5907e59188d {
    gap: 1rem;
  }

  .hero-hub-a9e9caa2fa6bb4d9d711a5907e59188d {
    width: 120px;
    height: 120px;
  }

  .hero-spoke-a9e9caa2fa6bb4d9d711a5907e59188d {
    min-width: 110px;
    padding: 0.4rem 0.6rem;
  }
}
</style>

<div class="hero-a9e9caa2fa6bb4d9d711a5907e59188d">
  <div class="hero-header-a9e9caa2fa6bb4d9d711a5907e59188d">
    <div class="hero-title-a9e9caa2fa6bb4d9d711a5907e59188d">OpenClaw: The Full Topology</div>
    <div class="hero-subtitle-a9e9caa2fa6bb4d9d711a5907e59188d">A hub-and-spoke architecture that composes familiar systems abstractions into an autonomous AI agent.</div>
  </div>

  
  <div class="hero-stats-a9e9caa2fa6bb4d9d711a5907e59188d">
    <div class="hero-stat-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="hero-stat-val-a9e9caa2fa6bb4d9d711a5907e59188d">430K</div>
      <div class="hero-stat-label-a9e9caa2fa6bb4d9d711a5907e59188d">Lines TypeScript</div>
    </div>
    <div class="hero-stat-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="hero-stat-val-a9e9caa2fa6bb4d9d711a5907e59188d">15+</div>
      <div class="hero-stat-label-a9e9caa2fa6bb4d9d711a5907e59188d">Channels</div>
    </div>
    <div class="hero-stat-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="hero-stat-val-a9e9caa2fa6bb4d9d711a5907e59188d">5,705+</div>
      <div class="hero-stat-label-a9e9caa2fa6bb4d9d711a5907e59188d">Skills</div>
    </div>
    <div class="hero-stat-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="hero-stat-val-a9e9caa2fa6bb4d9d711a5907e59188d">197K</div>
      <div class="hero-stat-label-a9e9caa2fa6bb4d9d711a5907e59188d">GitHub Stars</div>
    </div>
  </div>

  
  <div class="hero-diagram-a9e9caa2fa6bb4d9d711a5907e59188d">
    
    <div class="hero-spokes-a9e9caa2fa6bb4d9d711a5907e59188d hero-spokes-left-a9e9caa2fa6bb4d9d711a5907e59188d" id="hero-left-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="hero-spoke-a9e9caa2fa6bb4d9d711a5907e59188d" data-spoke="whatsapp">
        <div class="hero-spoke-icon-a9e9caa2fa6bb4d9d711a5907e59188d hero-spoke-icon-channel-a9e9caa2fa6bb4d9d711a5907e59188d">
          <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M21 11.5a8.38 8.38 0 0 1-.9 3.8 8.5 8.5 0 0 1-7.6 4.7 8.38 8.38 0 0 1-3.8-.9L3 21l1.9-5.7a8.38 8.38 0 0 1-.9-3.8 8.5 8.5 0 0 1 4.7-7.6 8.38 8.38 0 0 1 3.8-.9h.5a8.48 8.48 0 0 1 8 8v.5z"/></svg>
        </div>
        <div>
          <div class="hero-spoke-name-a9e9caa2fa6bb4d9d711a5907e59188d">WhatsApp</div>
          <div class="hero-spoke-hint-a9e9caa2fa6bb4d9d711a5907e59188d">Baileys</div>
        </div>
        <div class="hero-tooltip-a9e9caa2fa6bb4d9d711a5907e59188d">Unofficial WA Web API via Baileys library</div>
      </div>
      <div class="hero-spoke-a9e9caa2fa6bb4d9d711a5907e59188d" data-spoke="telegram">
        <div class="hero-spoke-icon-a9e9caa2fa6bb4d9d711a5907e59188d hero-spoke-icon-channel-a9e9caa2fa6bb4d9d711a5907e59188d">
          <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M22 2L11 13"/><path d="M22 2l-7 20-4-9-9-4 20-7z"/></svg>
        </div>
        <div>
          <div class="hero-spoke-name-a9e9caa2fa6bb4d9d711a5907e59188d">Telegram</div>
          <div class="hero-spoke-hint-a9e9caa2fa6bb4d9d711a5907e59188d">grammY</div>
        </div>
        <div class="hero-tooltip-a9e9caa2fa6bb4d9d711a5907e59188d">Bot API via grammY framework</div>
      </div>
      <div class="hero-spoke-a9e9caa2fa6bb4d9d711a5907e59188d" data-spoke="discord">
        <div class="hero-spoke-icon-a9e9caa2fa6bb4d9d711a5907e59188d hero-spoke-icon-channel-a9e9caa2fa6bb4d9d711a5907e59188d">
          <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><circle cx="9" cy="12" r="1"/><circle cx="15" cy="12" r="1"/><path d="M7.5 7.5c3.5-1 5.5-1 9 0"/><path d="M7 16.5c3.5 1 6.5 1 10 0"/><path d="M15.5 17c0 1 1.5 3 2 3 1.5 0 2.833-1.667 3.5-3 .667-1.667.5-5.833-1.5-11.5-1.457-1.015-3-1.34-4.5-1.5l-1 2.5"/><path d="M8.5 17c0 1-1.356 3-1.832 3-1.429 0-2.698-1.667-3.333-3-.635-1.667-.476-5.833 1.428-11.5C6.151 4.485 7.545 4.16 9 4l1 2.5"/></svg>
        </div>
        <div>
          <div class="hero-spoke-name-a9e9caa2fa6bb4d9d711a5907e59188d">Discord</div>
          <div class="hero-spoke-hint-a9e9caa2fa6bb4d9d711a5907e59188d">discord.js</div>
        </div>
        <div class="hero-tooltip-a9e9caa2fa6bb4d9d711a5907e59188d">Rich embeds, slash commands, voice</div>
      </div>
      <div class="hero-spoke-a9e9caa2fa6bb4d9d711a5907e59188d" data-spoke="imessage">
        <div class="hero-spoke-icon-a9e9caa2fa6bb4d9d711a5907e59188d hero-spoke-icon-channel-a9e9caa2fa6bb4d9d711a5907e59188d">
          <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M21 15a2 2 0 0 1-2 2H7l-4 4V5a2 2 0 0 1 2-2h14a2 2 0 0 1 2 2z"/></svg>
        </div>
        <div>
          <div class="hero-spoke-name-a9e9caa2fa6bb4d9d711a5907e59188d">iMessage</div>
          <div class="hero-spoke-hint-a9e9caa2fa6bb4d9d711a5907e59188d">native macOS</div>
        </div>
        <div class="hero-tooltip-a9e9caa2fa6bb4d9d711a5907e59188d">Native macOS AppleScript bridge</div>
      </div>
      <div class="hero-spoke-a9e9caa2fa6bb4d9d711a5907e59188d" data-spoke="slack">
        <div class="hero-spoke-icon-a9e9caa2fa6bb4d9d711a5907e59188d hero-spoke-icon-channel-a9e9caa2fa6bb4d9d711a5907e59188d">
          <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><rect x="13" y="2" width="3" height="8" rx="1.5"/><path d="M19 8.5V10h1.5A1.5 1.5 0 1 0 19 8.5"/><rect x="8" y="14" width="3" height="8" rx="1.5"/><path d="M5 15.5V14H3.5A1.5 1.5 0 1 0 5 15.5"/><rect x="14" y="13" width="8" height="3" rx="1.5"/><path d="M15.5 19H14v1.5a1.5 1.5 0 1 0 1.5-1.5"/><rect x="2" y="8" width="8" height="3" rx="1.5"/><path d="M8.5 5H10V3.5A1.5 1.5 0 1 0 8.5 5"/></svg>
        </div>
        <div>
          <div class="hero-spoke-name-a9e9caa2fa6bb4d9d711a5907e59188d">Slack</div>
          <div class="hero-spoke-hint-a9e9caa2fa6bb4d9d711a5907e59188d">Bolt</div>
        </div>
        <div class="hero-tooltip-a9e9caa2fa6bb4d9d711a5907e59188d">Workspace bot via Slack Bolt SDK</div>
      </div>
      <div class="hero-spoke-a9e9caa2fa6bb4d9d711a5907e59188d" data-spoke="more-channels">
        <div class="hero-spoke-icon-a9e9caa2fa6bb4d9d711a5907e59188d hero-spoke-icon-channel-a9e9caa2fa6bb4d9d711a5907e59188d">
          <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><circle cx="12" cy="12" r="1"/><circle cx="19" cy="12" r="1"/><circle cx="5" cy="12" r="1"/></svg>
        </div>
        <div>
          <div class="hero-spoke-name-a9e9caa2fa6bb4d9d711a5907e59188d">+10 more</div>
          <div class="hero-spoke-hint-a9e9caa2fa6bb4d9d711a5907e59188d">Matrix, Email...</div>
        </div>
        <div class="hero-tooltip-a9e9caa2fa6bb4d9d711a5907e59188d">Matrix, Gmail, Voice, SMS, and more</div>
      </div>
    </div>

    
    <div class="hero-hub-a9e9caa2fa6bb4d9d711a5907e59188d" id="hero-hub-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="hero-hub-icon-a9e9caa2fa6bb4d9d711a5907e59188d">
        <svg width="32" height="32" viewBox="0 0 24 24" fill="none" stroke="#60a5fa" stroke-width="2">
          <path d="M12 2L2 7l10 5 10-5-10-5z"/>
          <path d="M2 17l10 5 10-5"/>
          <path d="M2 12l10 5 10-5"/>
        </svg>
      </div>
      <div class="hero-hub-label-a9e9caa2fa6bb4d9d711a5907e59188d">Gateway</div>
      <div class="hero-hub-sub-a9e9caa2fa6bb4d9d711a5907e59188d">WebSocket + Scheduler</div>
    </div>

    
    <div class="hero-spokes-a9e9caa2fa6bb4d9d711a5907e59188d hero-spokes-right-a9e9caa2fa6bb4d9d711a5907e59188d" id="hero-right-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="hero-spoke-a9e9caa2fa6bb4d9d711a5907e59188d" data-spoke="session">
        <div class="hero-spoke-icon-a9e9caa2fa6bb4d9d711a5907e59188d hero-spoke-icon-runtime-a9e9caa2fa6bb4d9d711a5907e59188d">
          <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M17 21v-2a4 4 0 0 0-4-4H5a4 4 0 0 0-4 4v2"/><circle cx="9" cy="7" r="4"/><path d="M23 21v-2a4 4 0 0 0-3-3.87"/><path d="M16 3.13a4 4 0 0 1 0 7.75"/></svg>
        </div>
        <div>
          <div class="hero-spoke-name-a9e9caa2fa6bb4d9d711a5907e59188d">Session Resolver</div>
          <div class="hero-spoke-hint-a9e9caa2fa6bb4d9d711a5907e59188d">namespace isolation</div>
        </div>
        <div class="hero-tooltip-a9e9caa2fa6bb4d9d711a5907e59188d">Isolates each conversation's state</div>
      </div>
      <div class="hero-spoke-a9e9caa2fa6bb4d9d711a5907e59188d" data-spoke="context">
        <div class="hero-spoke-icon-a9e9caa2fa6bb4d9d711a5907e59188d hero-spoke-icon-runtime-a9e9caa2fa6bb4d9d711a5907e59188d">
          <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M14 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V8z"/><polyline points="14 2 14 8 20 8"/></svg>
        </div>
        <div>
          <div class="hero-spoke-name-a9e9caa2fa6bb4d9d711a5907e59188d">Context Assembler</div>
          <div class="hero-spoke-hint-a9e9caa2fa6bb4d9d711a5907e59188d">prompt building</div>
        </div>
        <div class="hero-tooltip-a9e9caa2fa6bb4d9d711a5907e59188d">Builds prompt from AGENTS.md + SOUL.md + memory</div>
      </div>
      <div class="hero-spoke-a9e9caa2fa6bb4d9d711a5907e59188d" data-spoke="llm">
        <div class="hero-spoke-icon-a9e9caa2fa6bb4d9d711a5907e59188d hero-spoke-icon-runtime-a9e9caa2fa6bb4d9d711a5907e59188d">
          <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M12 2a10 10 0 1 0 10 10"/><path d="M12 2v10l7-5"/></svg>
        </div>
        <div>
          <div class="hero-spoke-name-a9e9caa2fa6bb4d9d711a5907e59188d">Streaming LLM</div>
          <div class="hero-spoke-hint-a9e9caa2fa6bb4d9d711a5907e59188d">Claude, GPT, etc.</div>
        </div>
        <div class="hero-tooltip-a9e9caa2fa6bb4d9d711a5907e59188d">Streaming inference with any provider</div>
      </div>
      <div class="hero-spoke-a9e9caa2fa6bb4d9d711a5907e59188d" data-spoke="tools">
        <div class="hero-spoke-icon-a9e9caa2fa6bb4d9d711a5907e59188d hero-spoke-icon-runtime-a9e9caa2fa6bb4d9d711a5907e59188d">
          <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M14.7 6.3a1 1 0 0 0 0 1.4l1.6 1.6a1 1 0 0 0 1.4 0l3.77-3.77a6 6 0 0 1-7.94 7.94l-6.91 6.91a2.12 2.12 0 0 1-3-3l6.91-6.91a6 6 0 0 1 7.94-7.94l-3.76 3.76z"/></svg>
        </div>
        <div>
          <div class="hero-spoke-name-a9e9caa2fa6bb4d9d711a5907e59188d">Tool Executor</div>
          <div class="hero-spoke-hint-a9e9caa2fa6bb4d9d711a5907e59188d">5,705+ skills</div>
        </div>
        <div class="hero-tooltip-a9e9caa2fa6bb4d9d711a5907e59188d">Executes skills, web search, calendar, code</div>
      </div>
      <div class="hero-spoke-a9e9caa2fa6bb4d9d711a5907e59188d" data-spoke="state">
        <div class="hero-spoke-icon-a9e9caa2fa6bb4d9d711a5907e59188d hero-spoke-icon-runtime-a9e9caa2fa6bb4d9d711a5907e59188d">
          <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M12 20h9"/><path d="M16.5 3.5a2.121 2.121 0 0 1 3 3L7 19l-4 1 1-4L16.5 3.5z"/></svg>
        </div>
        <div>
          <div class="hero-spoke-name-a9e9caa2fa6bb4d9d711a5907e59188d">State Persister</div>
          <div class="hero-spoke-hint-a9e9caa2fa6bb4d9d711a5907e59188d">JSONL + SQLite</div>
        </div>
        <div class="hero-tooltip-a9e9caa2fa6bb4d9d711a5907e59188d">Durable state in JSONL logs + SQLite</div>
      </div>
      <div class="hero-spoke-a9e9caa2fa6bb4d9d711a5907e59188d" data-spoke="memory">
        <div class="hero-spoke-icon-a9e9caa2fa6bb4d9d711a5907e59188d hero-spoke-icon-runtime-a9e9caa2fa6bb4d9d711a5907e59188d">
          <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M4 19.5A2.5 2.5 0 0 1 6.5 17H20"/><path d="M6.5 2H20v20H6.5A2.5 2.5 0 0 1 4 19.5v-15A2.5 2.5 0 0 1 6.5 2z"/></svg>
        </div>
        <div>
          <div class="hero-spoke-name-a9e9caa2fa6bb4d9d711a5907e59188d">Memory System</div>
          <div class="hero-spoke-hint-a9e9caa2fa6bb4d9d711a5907e59188d">MEMORY.md + search</div>
        </div>
        <div class="hero-tooltip-a9e9caa2fa6bb4d9d711a5907e59188d">Virtual memory: MEMORY.md, daily logs, embeddings</div>
      </div>
    </div>
  </div>

  
  <div class="hero-primitives-a9e9caa2fa6bb4d9d711a5907e59188d">
    <div class="hero-primitive-a9e9caa2fa6bb4d9d711a5907e59188d hero-prim-invoke-a9e9caa2fa6bb4d9d711a5907e59188d" id="hero-prim1-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="hero-prim-header-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="hero-prim-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>
        <div class="hero-prim-title-a9e9caa2fa6bb4d9d711a5907e59188d">Primitive 1: Autonomous Invocation</div>
      </div>
      <div class="hero-prim-desc-a9e9caa2fa6bb4d9d711a5907e59188d">The agent doesn't wait for messages. It can wake itself via cron, webhooks, voice, heartbeats, or Pub/Sub triggers, each scoped to an isolated session.</div>
      <div class="hero-prim-expand-a9e9caa2fa6bb4d9d711a5907e59188d">Click to expand trigger types</div>
      <div class="hero-prim-details-a9e9caa2fa6bb4d9d711a5907e59188d" id="hero-prim1-details-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="hero-prim-detail-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="hero-prim-detail-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Cron schedules (daily summaries, check-ins)</div>
        <div class="hero-prim-detail-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="hero-prim-detail-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Webhooks (GitHub, Stripe, custom)</div>
        <div class="hero-prim-detail-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="hero-prim-detail-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Voice wake word detection</div>
        <div class="hero-prim-detail-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="hero-prim-detail-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Heartbeat / keep-alive pings</div>
        <div class="hero-prim-detail-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="hero-prim-detail-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Gmail Pub/Sub (email-triggered actions)</div>
        <div class="hero-prim-detail-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="hero-prim-detail-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Session isolation per conversation</div>
      </div>
    </div>

    <div class="hero-primitive-a9e9caa2fa6bb4d9d711a5907e59188d hero-prim-memory-a9e9caa2fa6bb4d9d711a5907e59188d" id="hero-prim2-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="hero-prim-header-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="hero-prim-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>
        <div class="hero-prim-title-a9e9caa2fa6bb4d9d711a5907e59188d">Primitive 2: Externalized Memory</div>
      </div>
      <div class="hero-prim-desc-a9e9caa2fa6bb4d9d711a5907e59188d">Long-term memory lives on disk, not in the context window. The agent pages knowledge in and out like an OS manages virtual memory.</div>
      <div class="hero-prim-expand-a9e9caa2fa6bb4d9d711a5907e59188d">Click to expand memory components</div>
      <div class="hero-prim-details-a9e9caa2fa6bb4d9d711a5907e59188d" id="hero-prim2-details-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="hero-prim-detail-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="hero-prim-detail-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>MEMORY.md (persistent knowledge base)</div>
        <div class="hero-prim-detail-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="hero-prim-detail-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Daily logs (memory/YYYY-MM-DD.md)</div>
        <div class="hero-prim-detail-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="hero-prim-detail-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>SQLite (structured session data)</div>
        <div class="hero-prim-detail-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="hero-prim-detail-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Hybrid search (BM25 + vector similarity)</div>
        <div class="hero-prim-detail-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="hero-prim-detail-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>/compact command for context paging</div>
      </div>
    </div>
  </div>

  
  <div class="hero-footer-a9e9caa2fa6bb4d9d711a5907e59188d">
    <p><strong>"Composition, Not Invention"</strong> — familiar systems abstractions (message queues, schedulers, filesystems, virtual memory) composed into something new.</p>
  </div>
</div>

<script>
(function() {
  var id = 'a9e9caa2fa6bb4d9d711a5907e59188d';

  
  var hub = document.getElementById('hero-hub-' + id);
  var leftSpokes = document.querySelectorAll('#hero-left-' + id + ' .hero-spoke-' + id);
  var rightSpokes = document.querySelectorAll('#hero-right-' + id + ' .hero-spoke-' + id);
  var prim1 = document.getElementById('hero-prim1-' + id);
  var prim2 = document.getElementById('hero-prim2-' + id);

  
  setTimeout(function() {
    if (hub) hub.classList.add('visible');
  }, 300);

  
  for (var i = 0; i < leftSpokes.length; i++) {
    (function(el, delay) {
      setTimeout(function() { el.classList.add('visible'); }, delay);
    })(leftSpokes[i], 600 + i * 100);
  }

  for (var j = 0; j < rightSpokes.length; j++) {
    (function(el, delay) {
      setTimeout(function() { el.classList.add('visible'); }, delay);
    })(rightSpokes[j], 600 + j * 100);
  }

  
  var spokeDelay = 600 + Math.max(leftSpokes.length, rightSpokes.length) * 100 + 200;
  setTimeout(function() {
    if (prim1) prim1.classList.add('visible');
  }, spokeDelay);
  setTimeout(function() {
    if (prim2) prim2.classList.add('visible');
  }, spokeDelay + 150);

  
  if (prim1) {
    prim1.addEventListener('click', function() {
      prim1.classList.toggle('active');
      var details = document.getElementById('hero-prim1-details-' + id);
      if (details) details.classList.toggle('expanded');
    });
  }

  if (prim2) {
    prim2.addEventListener('click', function() {
      prim2.classList.toggle('active');
      var details = document.getElementById('hero-prim2-details-' + id);
      if (details) details.classList.toggle('expanded');
    });
  }
})();
</script>

<h2 id="three-layer-architecture">Three-Layer Architecture</h2>
<p>The hub-and-spoke topology collapses into three clean layers, each with a single responsibility.</p>




<style>
.ocl-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
  background: linear-gradient(180deg, #0f172a 0%, #1e293b 100%);
  border-radius: 16px;
  padding: 2rem;
  margin: 2rem 0;
  color: #e2e8f0;
}

.ocl-a9e9caa2fa6bb4d9d711a5907e59188d * {
  box-sizing: border-box;
}

.ocl-header-a9e9caa2fa6bb4d9d711a5907e59188d {
  text-align: center;
  margin-bottom: 1.5rem;
}

.ocl-title-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 1.5rem;
  font-weight: 700;
  color: #f8fafc;
  margin-bottom: 0.5rem;
}

.ocl-subtitle-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.9rem;
  color: #94a3b8;
  max-width: 600px;
  margin: 0 auto;
}

 
.ocl-content-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: grid;
  grid-template-columns: 1fr 300px;
  gap: 1.5rem;
  align-items: start;
}

@media (max-width: 850px) {
  .ocl-content-a9e9caa2fa6bb4d9d711a5907e59188d {
    grid-template-columns: 1fr;
  }
}

 
.ocl-layer-a9e9caa2fa6bb4d9d711a5907e59188d {
  position: relative;
  border-radius: 12px;
  padding: 1rem;
  margin-bottom: 0.75rem;
  border: 2px solid;
  transition: all 0.3s ease;
}

.ocl-layer-a9e9caa2fa6bb4d9d711a5907e59188d:last-child {
  margin-bottom: 0;
}

.ocl-layer-gw-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(59, 130, 246, 0.1);
  border-color: rgba(59, 130, 246, 0.4);
}

.ocl-layer-ch-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(168, 85, 247, 0.1);
  border-color: rgba(168, 85, 247, 0.4);
}

.ocl-layer-rt-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(236, 72, 153, 0.1);
  border-color: rgba(236, 72, 153, 0.4);
}

.ocl-layer-header-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  align-items: center;
  justify-content: space-between;
  margin-bottom: 0.75rem;
}

.ocl-layer-title-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.85rem;
  font-weight: 700;
  text-transform: uppercase;
  letter-spacing: 0.05em;
  display: flex;
  align-items: center;
  gap: 0.5rem;
}

.ocl-layer-gw-a9e9caa2fa6bb4d9d711a5907e59188d .ocl-layer-title-a9e9caa2fa6bb4d9d711a5907e59188d { color: #60a5fa; }
.ocl-layer-ch-a9e9caa2fa6bb4d9d711a5907e59188d .ocl-layer-title-a9e9caa2fa6bb4d9d711a5907e59188d { color: #c084fc; }
.ocl-layer-rt-a9e9caa2fa6bb4d9d711a5907e59188d .ocl-layer-title-a9e9caa2fa6bb4d9d711a5907e59188d { color: #f472b6; }

.ocl-layer-desc-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.72rem;
  color: #94a3b8;
  margin-bottom: 0.6rem;
  line-height: 1.4;
}

.ocl-role-badge-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.65rem;
  padding: 0.25rem 0.5rem;
  border-radius: 4px;
  font-weight: 500;
  background: rgba(255, 255, 255, 0.1);
  color: #94a3b8;
}

 
.ocl-components-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  gap: 0.5rem;
  flex-wrap: wrap;
  justify-content: center;
}

.ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(15, 23, 42, 0.6);
  border: 1px solid #334155;
  border-radius: 8px;
  padding: 0.6rem 0.8rem;
  cursor: pointer;
  transition: all 0.2s ease;
  text-align: center;
  min-width: 85px;
}

.ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d:hover {
  transform: translateY(-2px);
  border-color: #64748b;
}

.ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d.selected {
  transform: scale(1.05);
}

.ocl-layer-gw-a9e9caa2fa6bb4d9d711a5907e59188d .ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d.selected {
  border-color: #3b82f6;
  box-shadow: 0 0 20px rgba(59, 130, 246, 0.4);
}

.ocl-layer-ch-a9e9caa2fa6bb4d9d711a5907e59188d .ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d.selected {
  border-color: #a855f7;
  box-shadow: 0 0 20px rgba(168, 85, 247, 0.4);
}

.ocl-layer-rt-a9e9caa2fa6bb4d9d711a5907e59188d .ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d.selected {
  border-color: #ec4899;
  box-shadow: 0 0 20px rgba(236, 72, 153, 0.4);
}

.ocl-comp-name-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.75rem;
  font-weight: 600;
  color: #f1f5f9;
  margin-bottom: 0.2rem;
}

.ocl-comp-hint-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.6rem;
  color: #64748b;
}

 
.ocl-connector-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  align-items: center;
  justify-content: center;
  height: 36px;
  position: relative;
  margin: 0.25rem 0;
}

.ocl-connector-arrow-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  flex-direction: column;
  align-items: center;
  position: relative;
}

.ocl-connector-arrow-a9e9caa2fa6bb4d9d711a5907e59188d svg {
  width: 24px;
  height: 24px;
}

.ocl-connector-label-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.6rem;
  font-weight: 500;
  text-transform: uppercase;
  letter-spacing: 0.05em;
}

.ocl-conn-blue-a9e9caa2fa6bb4d9d711a5907e59188d svg { color: #60a5fa; }
.ocl-conn-blue-a9e9caa2fa6bb4d9d711a5907e59188d .ocl-connector-label-a9e9caa2fa6bb4d9d711a5907e59188d { color: #60a5fa; }
.ocl-conn-purple-a9e9caa2fa6bb4d9d711a5907e59188d svg { color: #c084fc; }
.ocl-conn-purple-a9e9caa2fa6bb4d9d711a5907e59188d .ocl-connector-label-a9e9caa2fa6bb4d9d711a5907e59188d { color: #c084fc; }

 
.ocl-particle-a9e9caa2fa6bb4d9d711a5907e59188d {
  position: absolute;
  width: 6px;
  height: 6px;
  border-radius: 50%;
  animation: ocl-particle-flow-a9e9caa2fa6bb4d9d711a5907e59188d 2s ease-in-out infinite;
}

.ocl-conn-blue-a9e9caa2fa6bb4d9d711a5907e59188d .ocl-particle-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: #60a5fa;
  box-shadow: 0 0 8px #60a5fa;
}

.ocl-conn-purple-a9e9caa2fa6bb4d9d711a5907e59188d .ocl-particle-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: #c084fc;
  box-shadow: 0 0 8px #c084fc;
}

@keyframes ocl-particle-flow-a9e9caa2fa6bb4d9d711a5907e59188d {
  0% { transform: translateY(-12px); opacity: 0; }
  20% { opacity: 1; }
  80% { opacity: 1; }
  100% { transform: translateY(12px); opacity: 0; }
}

.ocl-particle-a9e9caa2fa6bb4d9d711a5907e59188d.p1 { animation-delay: 0s; }
.ocl-particle-a9e9caa2fa6bb4d9d711a5907e59188d.p2 { animation-delay: 0.5s; }
.ocl-particle-a9e9caa2fa6bb4d9d711a5907e59188d.p3 { animation-delay: 1s; }

 
.ocl-info-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(15, 23, 42, 0.6);
  border: 1px solid #334155;
  border-radius: 12px;
  padding: 1.25rem;
  position: sticky;
  top: 1rem;
}

.ocl-info-placeholder-a9e9caa2fa6bb4d9d711a5907e59188d {
  text-align: center;
  color: #64748b;
  padding: 2rem 1rem;
}

.ocl-info-placeholder-a9e9caa2fa6bb4d9d711a5907e59188d svg {
  width: 40px;
  height: 40px;
  margin-bottom: 0.75rem;
  opacity: 0.5;
}

.ocl-info-placeholder-a9e9caa2fa6bb4d9d711a5907e59188d p {
  font-size: 0.8rem;
  margin: 0;
}

.ocl-info-content-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: none;
}

.ocl-info-content-a9e9caa2fa6bb4d9d711a5907e59188d.active {
  display: block;
  animation: ocl-fade-in-a9e9caa2fa6bb4d9d711a5907e59188d 0.3s ease;
}

@keyframes ocl-fade-in-a9e9caa2fa6bb4d9d711a5907e59188d {
  from { opacity: 0; transform: translateY(5px); }
  to { opacity: 1; transform: translateY(0); }
}

.ocl-info-header-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  align-items: center;
  gap: 0.75rem;
  margin-bottom: 1rem;
}

.ocl-info-icon-a9e9caa2fa6bb4d9d711a5907e59188d {
  width: 36px;
  height: 36px;
  border-radius: 8px;
  display: flex;
  align-items: center;
  justify-content: center;
}

.ocl-info-icon-a9e9caa2fa6bb4d9d711a5907e59188d.gw { background: rgba(59, 130, 246, 0.2); color: #60a5fa; }
.ocl-info-icon-a9e9caa2fa6bb4d9d711a5907e59188d.ch { background: rgba(168, 85, 247, 0.2); color: #c084fc; }
.ocl-info-icon-a9e9caa2fa6bb4d9d711a5907e59188d.rt { background: rgba(236, 72, 153, 0.2); color: #f472b6; }

.ocl-info-title-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 1rem;
  font-weight: 700;
  color: #f8fafc;
  margin-bottom: 0.2rem;
}

.ocl-info-badge-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.6rem;
  padding: 0.2rem 0.4rem;
  border-radius: 4px;
  font-weight: 600;
  text-transform: uppercase;
}

.ocl-info-badge-a9e9caa2fa6bb4d9d711a5907e59188d.gw { background: rgba(59, 130, 246, 0.2); color: #60a5fa; }
.ocl-info-badge-a9e9caa2fa6bb4d9d711a5907e59188d.ch { background: rgba(168, 85, 247, 0.2); color: #c084fc; }
.ocl-info-badge-a9e9caa2fa6bb4d9d711a5907e59188d.rt { background: rgba(236, 72, 153, 0.2); color: #f472b6; }

.ocl-info-desc-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.8rem;
  color: #cbd5e1;
  line-height: 1.6;
  margin-bottom: 1rem;
}

.ocl-info-section-a9e9caa2fa6bb4d9d711a5907e59188d h4 {
  font-size: 0.7rem;
  color: #64748b;
  margin: 0 0 0.5rem 0;
  text-transform: uppercase;
  letter-spacing: 0.05em;
}

.ocl-info-section-a9e9caa2fa6bb4d9d711a5907e59188d ul {
  margin: 0 0 0.75rem 0;
  padding: 0;
  list-style: none;
}

.ocl-info-section-a9e9caa2fa6bb4d9d711a5907e59188d li {
  font-size: 0.75rem;
  color: #94a3b8;
  padding: 0.3rem 0;
  padding-left: 1rem;
  position: relative;
}

.ocl-info-section-a9e9caa2fa6bb4d9d711a5907e59188d li::before {
  content: '\2022';
  position: absolute;
  left: 0;
  color: #64748b;
}

.ocl-info-latency-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: inline-block;
  font-size: 0.65rem;
  padding: 0.2rem 0.5rem;
  background: rgba(34, 197, 94, 0.15);
  border: 1px solid rgba(34, 197, 94, 0.3);
  border-radius: 4px;
  color: #4ade80;
  font-weight: 600;
}

 
.ocl-footer-a9e9caa2fa6bb4d9d711a5907e59188d {
  margin-top: 1.5rem;
  padding: 1rem;
  background: rgba(245, 158, 11, 0.1);
  border: 1px solid rgba(245, 158, 11, 0.3);
  border-radius: 10px;
  text-align: center;
}

.ocl-footer-a9e9caa2fa6bb4d9d711a5907e59188d p {
  font-size: 0.8rem;
  color: #fbbf24;
  margin: 0;
  line-height: 1.5;
}

.ocl-footer-a9e9caa2fa6bb4d9d711a5907e59188d strong {
  color: #fcd34d;
}

 
@media (max-width: 600px) {
  .ocl-a9e9caa2fa6bb4d9d711a5907e59188d {
    padding: 1.25rem;
  }

  .ocl-components-a9e9caa2fa6bb4d9d711a5907e59188d {
    gap: 0.35rem;
  }

  .ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d {
    min-width: 70px;
    padding: 0.5rem 0.6rem;
  }

  .ocl-comp-name-a9e9caa2fa6bb4d9d711a5907e59188d {
    font-size: 0.65rem;
  }

  .ocl-layer-desc-a9e9caa2fa6bb4d9d711a5907e59188d {
    display: none;
  }

  .ocl-info-a9e9caa2fa6bb4d9d711a5907e59188d {
    margin-top: 0.5rem;
  }
}
</style>

<div class="ocl-a9e9caa2fa6bb4d9d711a5907e59188d">
  <div class="ocl-header-a9e9caa2fa6bb4d9d711a5907e59188d">
    <div class="ocl-title-a9e9caa2fa6bb4d9d711a5907e59188d">The Three-Layer Architecture</div>
    <div class="ocl-subtitle-a9e9caa2fa6bb4d9d711a5907e59188d">Here's how messages move through the system — from incoming connection to LLM response. Click any component for details.</div>
  </div>

  <div class="ocl-content-a9e9caa2fa6bb4d9d711a5907e59188d">
    <div class="ocl-stack-a9e9caa2fa6bb4d9d711a5907e59188d">

      
      <div class="ocl-layer-a9e9caa2fa6bb4d9d711a5907e59188d ocl-layer-gw-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="ocl-layer-header-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="ocl-layer-title-a9e9caa2fa6bb4d9d711a5907e59188d">
            <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
              <path d="M12 2L2 7l10 5 10-5-10-5z"/>
              <path d="M2 17l10 5 10-5"/>
              <path d="M2 12l10 5 10-5"/>
            </svg>
            Layer 1: Gateway
          </div>
          <span class="ocl-role-badge-a9e9caa2fa6bb4d9d711a5907e59188d">Routing + Scheduling</span>
        </div>
        <div class="ocl-layer-desc-a9e9caa2fa6bb4d9d711a5907e59188d">This is where connections come in. The gateway handles routing, payload validation, and scheduled triggers.</div>
        <div class="ocl-components-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d" data-component="ws-server">
            <div class="ocl-comp-name-a9e9caa2fa6bb4d9d711a5907e59188d">WebSocket Server</div>
            <div class="ocl-comp-hint-a9e9caa2fa6bb4d9d711a5907e59188d">connections</div>
          </div>
          <div class="ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d" data-component="channel-mgr">
            <div class="ocl-comp-name-a9e9caa2fa6bb4d9d711a5907e59188d">Channel Manager</div>
            <div class="ocl-comp-hint-a9e9caa2fa6bb4d9d711a5907e59188d">routing</div>
          </div>
          <div class="ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d" data-component="scheduler">
            <div class="ocl-comp-name-a9e9caa2fa6bb4d9d711a5907e59188d">Scheduler</div>
            <div class="ocl-comp-hint-a9e9caa2fa6bb4d9d711a5907e59188d">cron + triggers</div>
          </div>
          <div class="ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d" data-component="control-ui">
            <div class="ocl-comp-name-a9e9caa2fa6bb4d9d711a5907e59188d">Control UI</div>
            <div class="ocl-comp-hint-a9e9caa2fa6bb4d9d711a5907e59188d">dashboard</div>
          </div>
          <div class="ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d" data-component="typebox">
            <div class="ocl-comp-name-a9e9caa2fa6bb4d9d711a5907e59188d">TypeBox Validator</div>
            <div class="ocl-comp-hint-a9e9caa2fa6bb4d9d711a5907e59188d">schema</div>
          </div>
        </div>
      </div>

      
      <div class="ocl-connector-a9e9caa2fa6bb4d9d711a5907e59188d ocl-conn-blue-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="ocl-connector-arrow-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="ocl-particle-a9e9caa2fa6bb4d9d711a5907e59188d p1"></div>
          <div class="ocl-particle-a9e9caa2fa6bb4d9d711a5907e59188d p2"></div>
          <div class="ocl-particle-a9e9caa2fa6bb4d9d711a5907e59188d p3"></div>
          <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
            <path d="M12 5v14M5 12l7 7 7-7"/>
          </svg>
          <span class="ocl-connector-label-a9e9caa2fa6bb4d9d711a5907e59188d">route</span>
        </div>
      </div>

      
      <div class="ocl-layer-a9e9caa2fa6bb4d9d711a5907e59188d ocl-layer-ch-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="ocl-layer-header-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="ocl-layer-title-a9e9caa2fa6bb4d9d711a5907e59188d">
            <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
              <path d="M21 11.5a8.38 8.38 0 0 1-.9 3.8 8.5 8.5 0 0 1-7.6 4.7 8.38 8.38 0 0 1-3.8-.9L3 21l1.9-5.7a8.38 8.38 0 0 1-.9-3.8 8.5 8.5 0 0 1 4.7-7.6 8.38 8.38 0 0 1 3.8-.9h.5a8.48 8.48 0 0 1 8 8v.5z"/>
            </svg>
            Layer 2: Channel Adapters
          </div>
          <span class="ocl-role-badge-a9e9caa2fa6bb4d9d711a5907e59188d">Normalize + Authorize</span>
        </div>
        <div class="ocl-layer-desc-a9e9caa2fa6bb4d9d711a5907e59188d">Each adapter wraps a platform's SDK and converts its messages into a common StandardMessage shape we can work with.</div>
        <div class="ocl-components-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d" data-component="whatsapp">
            <div class="ocl-comp-name-a9e9caa2fa6bb4d9d711a5907e59188d">WhatsApp</div>
            <div class="ocl-comp-hint-a9e9caa2fa6bb4d9d711a5907e59188d">Baileys</div>
          </div>
          <div class="ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d" data-component="telegram">
            <div class="ocl-comp-name-a9e9caa2fa6bb4d9d711a5907e59188d">Telegram</div>
            <div class="ocl-comp-hint-a9e9caa2fa6bb4d9d711a5907e59188d">grammY</div>
          </div>
          <div class="ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d" data-component="discord">
            <div class="ocl-comp-name-a9e9caa2fa6bb4d9d711a5907e59188d">Discord</div>
            <div class="ocl-comp-hint-a9e9caa2fa6bb4d9d711a5907e59188d">discord.js</div>
          </div>
          <div class="ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d" data-component="imessage">
            <div class="ocl-comp-name-a9e9caa2fa6bb4d9d711a5907e59188d">iMessage</div>
            <div class="ocl-comp-hint-a9e9caa2fa6bb4d9d711a5907e59188d">native macOS</div>
          </div>
          <div class="ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d" data-component="slack">
            <div class="ocl-comp-name-a9e9caa2fa6bb4d9d711a5907e59188d">Slack</div>
            <div class="ocl-comp-hint-a9e9caa2fa6bb4d9d711a5907e59188d">Bolt</div>
          </div>
          <div class="ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d" data-component="matrix">
            <div class="ocl-comp-name-a9e9caa2fa6bb4d9d711a5907e59188d">Matrix</div>
            <div class="ocl-comp-hint-a9e9caa2fa6bb4d9d711a5907e59188d">matrix-bot-sdk</div>
          </div>
        </div>
      </div>

      
      <div class="ocl-connector-a9e9caa2fa6bb4d9d711a5907e59188d ocl-conn-purple-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="ocl-connector-arrow-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="ocl-particle-a9e9caa2fa6bb4d9d711a5907e59188d p1"></div>
          <div class="ocl-particle-a9e9caa2fa6bb4d9d711a5907e59188d p2"></div>
          <div class="ocl-particle-a9e9caa2fa6bb4d9d711a5907e59188d p3"></div>
          <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
            <path d="M12 5v14M5 12l7 7 7-7"/>
          </svg>
          <span class="ocl-connector-label-a9e9caa2fa6bb4d9d711a5907e59188d">dispatch</span>
        </div>
      </div>

      
      <div class="ocl-layer-a9e9caa2fa6bb4d9d711a5907e59188d ocl-layer-rt-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="ocl-layer-header-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="ocl-layer-title-a9e9caa2fa6bb4d9d711a5907e59188d">
            <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
              <path d="M13 2L3 14h9l-1 8 10-12h-9l1-8z"/>
            </svg>
            Layer 3: Agent Runtime
          </div>
          <span class="ocl-role-badge-a9e9caa2fa6bb4d9d711a5907e59188d">Reason + Execute</span>
        </div>
        <div class="ocl-layer-desc-a9e9caa2fa6bb4d9d711a5907e59188d">Where the actual thinking happens — resolve the session, build context, call the LLM, run tools, save state.</div>
        <div class="ocl-components-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d" data-component="session-resolver">
            <div class="ocl-comp-name-a9e9caa2fa6bb4d9d711a5907e59188d">Session Resolver</div>
            <div class="ocl-comp-hint-a9e9caa2fa6bb4d9d711a5907e59188d">namespace</div>
          </div>
          <div class="ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d" data-component="context-assembler">
            <div class="ocl-comp-name-a9e9caa2fa6bb4d9d711a5907e59188d">Context Assembler</div>
            <div class="ocl-comp-hint-a9e9caa2fa6bb4d9d711a5907e59188d">prompt</div>
          </div>
          <div class="ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d" data-component="streaming-llm">
            <div class="ocl-comp-name-a9e9caa2fa6bb4d9d711a5907e59188d">Streaming LLM</div>
            <div class="ocl-comp-hint-a9e9caa2fa6bb4d9d711a5907e59188d">inference</div>
          </div>
          <div class="ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d" data-component="tool-executor">
            <div class="ocl-comp-name-a9e9caa2fa6bb4d9d711a5907e59188d">Tool Executor</div>
            <div class="ocl-comp-hint-a9e9caa2fa6bb4d9d711a5907e59188d">skills</div>
          </div>
          <div class="ocl-component-a9e9caa2fa6bb4d9d711a5907e59188d" data-component="state-persister">
            <div class="ocl-comp-name-a9e9caa2fa6bb4d9d711a5907e59188d">State Persister</div>
            <div class="ocl-comp-hint-a9e9caa2fa6bb4d9d711a5907e59188d">JSONL</div>
          </div>
        </div>
      </div>
    </div>

    
    <div class="ocl-info-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="ocl-info-placeholder-a9e9caa2fa6bb4d9d711a5907e59188d" id="ocl-placeholder-a9e9caa2fa6bb4d9d711a5907e59188d">
        <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.5">
          <circle cx="12" cy="12" r="10"/>
          <path d="M12 16v-4M12 8h.01"/>
        </svg>
        <p>Click a component to see what it does, which files matter, and how fast it needs to be.</p>
      </div>
      <div class="ocl-info-content-a9e9caa2fa6bb4d9d711a5907e59188d" id="ocl-info-content-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="ocl-info-header-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="ocl-info-icon-a9e9caa2fa6bb4d9d711a5907e59188d" id="ocl-info-icon-a9e9caa2fa6bb4d9d711a5907e59188d">
            <svg width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
              <rect x="3" y="3" width="18" height="18" rx="2"/>
            </svg>
          </div>
          <div>
            <div class="ocl-info-title-a9e9caa2fa6bb4d9d711a5907e59188d" id="ocl-info-title-a9e9caa2fa6bb4d9d711a5907e59188d">Component</div>
            <span class="ocl-info-badge-a9e9caa2fa6bb4d9d711a5907e59188d" id="ocl-info-badge-a9e9caa2fa6bb4d9d711a5907e59188d">Layer</span>
          </div>
        </div>
        <div class="ocl-info-desc-a9e9caa2fa6bb4d9d711a5907e59188d" id="ocl-info-desc-a9e9caa2fa6bb4d9d711a5907e59188d">
          Description goes here.
        </div>
        <div class="ocl-info-section-a9e9caa2fa6bb4d9d711a5907e59188d">
          <h4>Responsibilities</h4>
          <ul id="ocl-info-resp-a9e9caa2fa6bb4d9d711a5907e59188d"></ul>
        </div>
        <div class="ocl-info-section-a9e9caa2fa6bb4d9d711a5907e59188d">
          <h4>Key Files</h4>
          <ul id="ocl-info-files-a9e9caa2fa6bb4d9d711a5907e59188d"></ul>
        </div>
        <div id="ocl-info-latency-wrap-a9e9caa2fa6bb4d9d711a5907e59188d">
          <span class="ocl-info-latency-a9e9caa2fa6bb4d9d711a5907e59188d" id="ocl-info-latency-a9e9caa2fa6bb4d9d711a5907e59188d"></span>
        </div>
      </div>
    </div>
  </div>

  <div class="ocl-footer-a9e9caa2fa6bb4d9d711a5907e59188d">
    <p><strong>The key idea:</strong> each layer stays in its lane. Gateway routes, adapters normalize, runtime reasons. When we keep those boundaries clean, the whole thing stays manageable.</p>
  </div>
</div>

<script>
(function() {
  var id = 'a9e9caa2fa6bb4d9d711a5907e59188d';

  var components = {
    'ws-server': {
      title: 'WebSocket Server',
      layerClass: 'gw',
      badge: 'Gateway',
      description: 'This is the entry point — a WebSocket server that channel adapters and the Control UI connect to. It keeps connections alive with heartbeats and handles reconnection when things drop.',
      responsibilities: ['Accept WebSocket connections', 'Heartbeat / keep-alive', 'Message framing and routing', 'Connection lifecycle management'],
      files: ['src/gateway/server/ws-connection.ts', 'src/gateway/server/ws-types.ts'],
      latency: 'Connection: <5ms | Message relay: <1ms'
    },
    'channel-mgr': {
      title: 'Channel Manager',
      layerClass: 'gw',
      badge: 'Gateway',
      description: 'Keeps track of which adapters are registered and routes incoming messages to the right one. When a message comes in, it figures out which platform sent it and hands it off.',
      responsibilities: ['Adapter registration', 'Platform identification', 'Message dispatching', 'Adapter health monitoring'],
      files: ['src/gateway/server-channels.ts', 'src/channels/registry.ts'],
      latency: 'Routing decision: <2ms'
    },
    'scheduler': {
      title: 'Scheduler',
      layerClass: 'gw',
      badge: 'Gateway',
      description: 'The cron system for things that need to happen on a schedule — daily summaries, periodic check-ins, webhook-triggered workflows. Each job is scoped to a specific session, so scheduled tasks do not bleed across conversations.',
      responsibilities: ['Cron job management', 'Webhook trigger handling', 'Heartbeat scheduling', 'Gmail Pub/Sub integration'],
      files: ['src/gateway/server-cron.ts', 'src/gateway/protocol/schema/cron.ts'],
      latency: 'Trigger dispatch: <10ms'
    },
    'control-ui': {
      title: 'Control UI',
      layerClass: 'gw',
      badge: 'Gateway',
      description: 'A web dashboard for managing the instance. You can view active sessions, configure channels, check health, and read through conversation logs in real time.',
      responsibilities: ['Session management dashboard', 'Channel configuration', 'Real-time conversation viewer', 'Health monitoring'],
      files: ['ui/', 'src/gateway/server-methods/'],
      latency: 'UI render: client-side'
    },
    'typebox': {
      title: 'TypeBox Validator',
      layerClass: 'gw',
      badge: 'Gateway',
      description: 'We use TypeBox (a JSON Schema compiler for TypeScript) to validate every message before it gets routed anywhere. If the shape is wrong, it gets rejected here rather than causing weird errors downstream.',
      responsibilities: ['Message schema validation', 'Type-safe runtime checks', 'Error reporting for malformed payloads', 'Schema versioning'],
      files: ['src/gateway/protocol/schema/', 'src/gateway/protocol/schema.ts'],
      latency: 'Validation: <1ms per message'
    },
    'whatsapp': {
      title: 'WhatsApp Adapter',
      layerClass: 'ch',
      badge: 'Channel Adapter',
      description: 'Connects via Baileys, the unofficial WhatsApp Web API. Handles QR code auth, message formatting, media uploads, and DM pairing verification. Baileys is reverse-engineered and can break when WhatsApp changes things — worth knowing.',
      responsibilities: ['Baileys session management', 'QR code authentication', 'WhatsApp-specific formatting', 'Media upload/download', 'DM pairing + allowlist'],
      files: ['src/whatsapp/', 'extensions/whatsapp/'],
      latency: 'Message normalization: <5ms'
    },
    'telegram': {
      title: 'Telegram Adapter',
      layerClass: 'ch',
      badge: 'Channel Adapter',
      description: 'Uses grammY to talk to the Telegram Bot API. Supports inline keyboards, markdown formatting, file sharing, and group conversations. Probably the most stable adapter since the Telegram Bot API is well-documented.',
      responsibilities: ['grammY bot lifecycle', 'Inline keyboard menus', 'Markdown formatting', 'Group message routing'],
      files: ['src/telegram/', 'extensions/telegram/'],
      latency: 'Message normalization: <3ms'
    },
    'discord': {
      title: 'Discord Adapter',
      layerClass: 'ch',
      badge: 'Channel Adapter',
      description: 'Built on discord.js. Handles slash commands, rich embeds, threads, and reactions. The Discord gateway connection can be finicky with reconnection — discord.js handles most of that for us.',
      responsibilities: ['Slash command registration', 'Rich embed formatting', 'Thread management', 'Voice channel integration'],
      files: ['src/discord/', 'src/discord/client.ts'],
      latency: 'Message normalization: <5ms'
    },
    'imessage': {
      title: 'iMessage Adapter',
      layerClass: 'ch',
      badge: 'Channel Adapter',
      description: 'This one is macOS-only. It bridges to iMessage through AppleScript — polls the Messages database for new messages and sends replies through Messages.app. Hacky, but it works.',
      responsibilities: ['AppleScript bridge', 'Messages.app database polling', 'Native macOS integration', 'Contact resolution'],
      files: ['src/imessage/', 'src/imessage/client.ts'],
      latency: 'Message normalization: <10ms'
    },
    'slack': {
      title: 'Slack Adapter',
      layerClass: 'ch',
      badge: 'Channel Adapter',
      description: 'Uses the Slack Bolt SDK for workspace integration. Handles app mentions, DMs, thread replies, and interactive components like buttons and modals.',
      responsibilities: ['Bolt SDK event handling', 'App mention detection', 'Thread reply management', 'Interactive components'],
      files: ['src/slack/', 'src/slack/client.ts'],
      latency: 'Message normalization: <3ms'
    },
    'matrix': {
      title: 'Matrix Adapter',
      layerClass: 'ch',
      badge: 'Channel Adapter',
      description: 'Matrix support via matrix-bot-sdk. Works with any homeserver — Synapse, Dendrite, whatever you run. The federated nature means messages can arrive from servers we do not control, so this adapter needs to be a bit more defensive.',
      responsibilities: ['Matrix protocol handling', 'Room management', 'E2E encryption support', 'Federation compatibility'],
      files: ['extensions/matrix/', 'extensions/matrix/src/'],
      latency: 'Message normalization: <5ms'
    },
    'session-resolver': {
      title: 'Session Resolver',
      layerClass: 'rt',
      badge: 'Agent Runtime',
      description: 'Figures out which session an incoming message belongs to. Each user + channel combo gets its own directory with dedicated state, memory, and config. This isolation is why one conversation cannot accidentally mess with another.',
      responsibilities: ['Session namespace creation', 'User + channel mapping', 'Directory isolation', 'Session lifecycle management'],
      files: ['src/gateway/sessions-resolve.ts', 'src/gateway/session-utils.ts'],
      latency: 'Resolution: <50ms (includes disk I/O)'
    },
    'context-assembler': {
      title: 'Context Assembler',
      layerClass: 'rt',
      badge: 'Agent Runtime',
      description: 'Builds the prompt that actually gets sent to the LLM. It stitches together AGENTS.md (the constitution), SOUL.md (personality), TOOLS.md (conventions), relevant memory pages, and conversation history. Getting this right matters a lot — the prompt is the product.',
      responsibilities: ['System prompt composition', 'Memory page selection', 'Token budget management', 'Conversation history trimming'],
      files: ['src/agents/context.ts', 'src/gateway/agent-prompt.ts'],
      latency: 'Assembly: <100ms (includes memory search)'
    },
    'streaming-llm': {
      title: 'Streaming LLM',
      layerClass: 'rt',
      badge: 'Agent Runtime',
      description: 'The LLM client — streams responses token-by-token so we can start sending replies before the full response is done. Supports Claude, GPT-4, Gemini, and local models. Handles retries and rate limiting so the rest of the system does not have to think about it.',
      responsibilities: ['Multi-provider support', 'Streaming token delivery', 'Retry and rate limiting', 'Token usage tracking'],
      files: ['src/agents/cli-runner/', 'src/providers/'],
      latency: 'First token: 200-500ms | Full response: 1-5s'
    },
    'tool-executor': {
      title: 'Tool Executor',
      layerClass: 'rt',
      badge: 'Agent Runtime',
      description: 'When the LLM decides to use a tool, this is what runs it. Pulls from the 5,705+ skill library on ClawHub — each skill is a markdown doc with YAML frontmatter defining its interface. The executor parses the tool call, finds the matching skill, and runs it.',
      responsibilities: ['Skill resolution from ClawHub', 'Tool call parsing and validation', 'Sandboxed execution', 'Result formatting'],
      files: ['src/agents/tools/', 'src/agents/skills/'],
      latency: 'Skill load: <10ms | Execution: varies (100ms-10s)'
    },
    'state-persister': {
      title: 'State Persister',
      layerClass: 'rt',
      badge: 'Agent Runtime',
      description: 'Appends every message and tool result to a JSONL file for the session. Append-only by design — we never mutate past entries, just keep adding. This means we can replay conversations after a crash and nothing gets lost.',
      responsibilities: ['JSONL append-only logging', 'Session state serialization', 'Crash recovery', 'Conversation replay'],
      files: ['src/gateway/server-runtime-state.ts', 'src/gateway/session-utils.ts'],
      latency: 'Write: <5ms (async flush)'
    }
  };

  function selectComponent(el, compId) {
    var container = document.querySelector('.ocl-' + id);
    if (!container) return;

    var allComponents = container.querySelectorAll('[data-component]');
    for (var i = 0; i < allComponents.length; i++) {
      allComponents[i].classList.remove('selected');
    }
    el.classList.add('selected');

    var comp = components[compId];
    if (!comp) return;

    document.getElementById('ocl-placeholder-' + id).style.display = 'none';
    var content = document.getElementById('ocl-info-content-' + id);
    content.classList.remove('active');
    void content.offsetWidth;
    content.classList.add('active');

    document.getElementById('ocl-info-title-' + id).textContent = comp.title;

    var badge = document.getElementById('ocl-info-badge-' + id);
    badge.textContent = comp.badge;
    badge.className = 'ocl-info-badge-' + id + ' ' + comp.layerClass;

    var icon = document.getElementById('ocl-info-icon-' + id);
    icon.className = 'ocl-info-icon-' + id + ' ' + comp.layerClass;

    document.getElementById('ocl-info-desc-' + id).textContent = comp.description;

    var respList = document.getElementById('ocl-info-resp-' + id);
    var respHtml = '';
    for (var r = 0; r < comp.responsibilities.length; r++) {
      respHtml += '<li>' + comp.responsibilities[r] + '</li>';
    }
    respList.innerHTML = respHtml;

    var filesList = document.getElementById('ocl-info-files-' + id);
    var filesHtml = '';
    var ghBase = 'https://github.com/MdJawad/openclaw/tree/main/';
    for (var f = 0; f < comp.files.length; f++) {
      var filePath = comp.files[f];
      var ghUrl = ghBase + filePath;
      filesHtml += '<li><a href="' + ghUrl + '" target="_blank" rel="noopener" style="font-size:0.7rem;color:#60a5fa;text-decoration:none;font-family:Monaco,Menlo,Consolas,monospace;">' + filePath + '</a></li>';
    }
    filesList.innerHTML = filesHtml;

    var latencyEl = document.getElementById('ocl-info-latency-' + id);
    if (comp.latency) {
      latencyEl.textContent = comp.latency;
      document.getElementById('ocl-info-latency-wrap-' + id).style.display = 'block';
    } else {
      document.getElementById('ocl-info-latency-wrap-' + id).style.display = 'none';
    }
  }

  var container = document.querySelector('.ocl-' + id);
  if (container) {
    container.addEventListener('click', function(e) {
      var target = e.target;
      while (target && target !== container) {
        if (target.hasAttribute && target.hasAttribute('data-component')) {
          var compId = target.getAttribute('data-component');
          selectComponent(target, compId);
          return;
        }
        target = target.parentElement;
      }
    });
  }
})();
</script>

<h2 id="message-flow">Message Flow</h2>
<p>Seeing the layers in isolation is one thing. Here&rsquo;s what happens when an actual message flows through them.</p>




<style>
.omf-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
  background: white;
  border-radius: 16px;
  padding: 2rem;
  margin: 2rem 0;
  color: #1a202c;
  box-shadow: 0 12px 30px rgba(0, 0, 0, 0.06);
}

.omf-a9e9caa2fa6bb4d9d711a5907e59188d * {
  box-sizing: border-box;
}

.omf-header-a9e9caa2fa6bb4d9d711a5907e59188d {
  text-align: center;
  margin-bottom: 1.5rem;
}

.omf-title-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 1.5rem;
  font-weight: 700;
  color: #1a202c;
  margin-bottom: 0.5rem;
}

.omf-subtitle-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.9rem;
  color: #718096;
  max-width: 600px;
  margin: 0 auto;
}

 
.omf-content-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: grid;
  grid-template-columns: auto 1fr 200px;
  gap: 1.5rem;
  align-items: start;
}

@media (max-width: 850px) {
  .omf-content-a9e9caa2fa6bb4d9d711a5907e59188d {
    grid-template-columns: 1fr;
  }
}

 
.omf-trigger-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  flex-direction: column;
  align-items: center;
  gap: 0.5rem;
  padding-top: 2rem;
}

.omf-trigger-icon-a9e9caa2fa6bb4d9d711a5907e59188d {
  width: 48px;
  height: 48px;
  border-radius: 12px;
  background: #f0fdf4;
  border: 2px solid #86efac;
  display: flex;
  align-items: center;
  justify-content: center;
}

.omf-trigger-label-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.7rem;
  color: #718096;
  text-align: center;
  max-width: 80px;
}

.omf-trigger-arrow-a9e9caa2fa6bb4d9d711a5907e59188d {
  color: #cbd5e0;
  font-size: 1.2rem;
}

 
.omf-pipeline-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  flex-direction: column;
  gap: 0.5rem;
}

 
.omf-phase-a9e9caa2fa6bb4d9d711a5907e59188d {
  border-radius: 10px;
  padding: 0.75rem 1rem;
  border: 2px solid #e2e8f0;
  background: #f7fafc;
  opacity: 0.4;
  transform: translateX(-10px);
  transition: all 0.4s cubic-bezier(0.4, 0, 0.2, 1);
  cursor: pointer;
}

.omf-phase-a9e9caa2fa6bb4d9d711a5907e59188d.active {
  opacity: 1;
  transform: translateX(0);
}

.omf-phase-a9e9caa2fa6bb4d9d711a5907e59188d.completed {
  opacity: 0.8;
  transform: translateX(0);
}

 
.omf-phase-trigger-a9e9caa2fa6bb4d9d711a5907e59188d.active,
.omf-phase-trigger-a9e9caa2fa6bb4d9d711a5907e59188d.completed {
  border-color: #f59e0b;
  background: #fffbeb;
}

.omf-phase-gateway-a9e9caa2fa6bb4d9d711a5907e59188d.active,
.omf-phase-gateway-a9e9caa2fa6bb4d9d711a5907e59188d.completed {
  border-color: #3b82f6;
  background: #eff6ff;
}

.omf-phase-adapter-a9e9caa2fa6bb4d9d711a5907e59188d.active,
.omf-phase-adapter-a9e9caa2fa6bb4d9d711a5907e59188d.completed {
  border-color: #a855f7;
  background: #faf5ff;
}

.omf-phase-runtime-a9e9caa2fa6bb4d9d711a5907e59188d.active,
.omf-phase-runtime-a9e9caa2fa6bb4d9d711a5907e59188d.completed {
  border-color: #ec4899;
  background: #fdf2f8;
}

.omf-phase-response-a9e9caa2fa6bb4d9d711a5907e59188d.active,
.omf-phase-response-a9e9caa2fa6bb4d9d711a5907e59188d.completed {
  border-color: #22c55e;
  background: #f0fdf4;
}

.omf-phase-a9e9caa2fa6bb4d9d711a5907e59188d.active {
  box-shadow: 0 4px 12px rgba(0, 0, 0, 0.08);
}

 
.omf-phase-header-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  align-items: center;
  gap: 0.5rem;
}

.omf-phase-num-a9e9caa2fa6bb4d9d711a5907e59188d {
  width: 24px;
  height: 24px;
  border-radius: 50%;
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 0.7rem;
  font-weight: 700;
  color: white;
  flex-shrink: 0;
}

.omf-phase-trigger-a9e9caa2fa6bb4d9d711a5907e59188d .omf-phase-num-a9e9caa2fa6bb4d9d711a5907e59188d { background: #f59e0b; }
.omf-phase-gateway-a9e9caa2fa6bb4d9d711a5907e59188d .omf-phase-num-a9e9caa2fa6bb4d9d711a5907e59188d { background: #3b82f6; }
.omf-phase-adapter-a9e9caa2fa6bb4d9d711a5907e59188d .omf-phase-num-a9e9caa2fa6bb4d9d711a5907e59188d { background: #a855f7; }
.omf-phase-runtime-a9e9caa2fa6bb4d9d711a5907e59188d .omf-phase-num-a9e9caa2fa6bb4d9d711a5907e59188d { background: #ec4899; }
.omf-phase-response-a9e9caa2fa6bb4d9d711a5907e59188d .omf-phase-num-a9e9caa2fa6bb4d9d711a5907e59188d { background: #22c55e; }

.omf-phase-name-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.85rem;
  font-weight: 700;
  color: #2d3748;
  flex: 1;
}

.omf-phase-desc-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.75rem;
  color: #718096;
  margin-top: 0.25rem;
  margin-left: 2rem;
}

.omf-phase-latency-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.6rem;
  padding: 0.15rem 0.4rem;
  border-radius: 4px;
  font-weight: 600;
  background: rgba(34, 197, 94, 0.1);
  color: #16a34a;
  white-space: nowrap;
}

 
.omf-substeps-a9e9caa2fa6bb4d9d711a5907e59188d {
  max-height: 0;
  overflow: hidden;
  transition: max-height 0.4s ease;
  margin-left: 2rem;
  margin-top: 0;
}

.omf-substeps-a9e9caa2fa6bb4d9d711a5907e59188d.expanded {
  max-height: 200px;
  margin-top: 0.5rem;
}

.omf-substep-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  align-items: center;
  gap: 0.4rem;
  padding: 0.25rem 0;
  font-size: 0.7rem;
  color: #4a5568;
}

.omf-substep-dot-a9e9caa2fa6bb4d9d711a5907e59188d {
  width: 5px;
  height: 5px;
  border-radius: 50%;
  background: #cbd5e0;
  flex-shrink: 0;
}

.omf-phase-trigger-a9e9caa2fa6bb4d9d711a5907e59188d .omf-substep-dot-a9e9caa2fa6bb4d9d711a5907e59188d { background: #f59e0b; }
.omf-phase-gateway-a9e9caa2fa6bb4d9d711a5907e59188d .omf-substep-dot-a9e9caa2fa6bb4d9d711a5907e59188d { background: #3b82f6; }
.omf-phase-adapter-a9e9caa2fa6bb4d9d711a5907e59188d .omf-substep-dot-a9e9caa2fa6bb4d9d711a5907e59188d { background: #a855f7; }
.omf-phase-runtime-a9e9caa2fa6bb4d9d711a5907e59188d .omf-substep-dot-a9e9caa2fa6bb4d9d711a5907e59188d { background: #ec4899; }
.omf-phase-response-a9e9caa2fa6bb4d9d711a5907e59188d .omf-substep-dot-a9e9caa2fa6bb4d9d711a5907e59188d { background: #22c55e; }

 
.omf-arrow-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  justify-content: center;
  padding: 0.15rem 0;
  color: #cbd5e0;
}

.omf-arrow-a9e9caa2fa6bb4d9d711a5907e59188d.active {
  color: #94a3b8;
}

 
.omf-tracker-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: #f7fafc;
  border-radius: 10px;
  padding: 1rem;
  border: 1px solid #e2e8f0;
  position: sticky;
  top: 1rem;
}

.omf-tracker-title-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.8rem;
  font-weight: 700;
  color: #2d3748;
  margin-bottom: 0.75rem;
  text-align: center;
}

.omf-tracker-total-a9e9caa2fa6bb4d9d711a5907e59188d {
  text-align: center;
  margin-bottom: 1rem;
  padding-bottom: 0.75rem;
  border-bottom: 1px solid #e2e8f0;
}

.omf-tracker-total-val-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 1.5rem;
  font-weight: 700;
  color: #2d3748;
}

.omf-tracker-total-label-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.65rem;
  color: #718096;
  text-transform: uppercase;
}

.omf-tracker-bar-a9e9caa2fa6bb4d9d711a5907e59188d {
  margin-bottom: 0.5rem;
}

.omf-tracker-bar-header-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  justify-content: space-between;
  font-size: 0.65rem;
  margin-bottom: 0.25rem;
}

.omf-tracker-bar-name-a9e9caa2fa6bb4d9d711a5907e59188d {
  color: #4a5568;
  font-weight: 600;
}

.omf-tracker-bar-val-a9e9caa2fa6bb4d9d711a5907e59188d {
  color: #718096;
}

.omf-tracker-bar-bg-a9e9caa2fa6bb4d9d711a5907e59188d {
  height: 6px;
  background: #e2e8f0;
  border-radius: 3px;
  overflow: hidden;
}

.omf-tracker-bar-fill-a9e9caa2fa6bb4d9d711a5907e59188d {
  height: 100%;
  border-radius: 3px;
  width: 0;
  transition: width 0.6s ease;
}

.omf-bar-amber-a9e9caa2fa6bb4d9d711a5907e59188d { background: #f59e0b; }
.omf-bar-blue-a9e9caa2fa6bb4d9d711a5907e59188d { background: #3b82f6; }
.omf-bar-purple-a9e9caa2fa6bb4d9d711a5907e59188d { background: #a855f7; }
.omf-bar-pink-a9e9caa2fa6bb4d9d711a5907e59188d { background: #ec4899; }
.omf-bar-green-a9e9caa2fa6bb4d9d711a5907e59188d { background: #22c55e; }

 
.omf-controls-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  justify-content: center;
  gap: 0.75rem;
  margin-top: 1.5rem;
}

.omf-btn-a9e9caa2fa6bb4d9d711a5907e59188d {
  padding: 0.6rem 1.5rem;
  border: 2px solid #e2e8f0;
  border-radius: 8px;
  background: white;
  color: #2d3748;
  font-size: 0.8rem;
  font-weight: 600;
  cursor: pointer;
  transition: all 0.2s ease;
}

.omf-btn-a9e9caa2fa6bb4d9d711a5907e59188d:hover {
  border-color: #3b82f6;
  color: #3b82f6;
  transform: translateY(-1px);
}

.omf-btn-a9e9caa2fa6bb4d9d711a5907e59188d:disabled {
  opacity: 0.4;
  cursor: not-allowed;
  transform: none;
}

.omf-btn-primary-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: linear-gradient(135deg, #3b82f6 0%, #6366f1 100%);
  color: white;
  border-color: #3b82f6;
}

.omf-btn-primary-a9e9caa2fa6bb4d9d711a5907e59188d:hover {
  color: white;
  box-shadow: 0 4px 12px rgba(59, 130, 246, 0.3);
}

 
@media (max-width: 850px) {
  .omf-trigger-a9e9caa2fa6bb4d9d711a5907e59188d {
    flex-direction: row;
    padding-top: 0;
    margin-bottom: 0.5rem;
  }

  .omf-trigger-arrow-a9e9caa2fa6bb4d9d711a5907e59188d {
    transform: rotate(90deg);
  }

  .omf-tracker-a9e9caa2fa6bb4d9d711a5907e59188d {
    position: static;
  }
}

@media (max-width: 600px) {
  .omf-a9e9caa2fa6bb4d9d711a5907e59188d {
    padding: 1.25rem;
  }

  .omf-phase-name-a9e9caa2fa6bb4d9d711a5907e59188d {
    font-size: 0.78rem;
  }
}
</style>

<div class="omf-a9e9caa2fa6bb4d9d711a5907e59188d">
  <div class="omf-header-a9e9caa2fa6bb4d9d711a5907e59188d">
    <div class="omf-title-a9e9caa2fa6bb4d9d711a5907e59188d">Message Journey: WhatsApp to Response</div>
    <div class="omf-subtitle-a9e9caa2fa6bb4d9d711a5907e59188d">Follow a single message through all three layers. Click any phase to expand its sub-steps.</div>
  </div>

  <div class="omf-content-a9e9caa2fa6bb4d9d711a5907e59188d">
    
    <div class="omf-trigger-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="omf-trigger-icon-a9e9caa2fa6bb4d9d711a5907e59188d">
        <svg width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="#22c55e" stroke-width="2">
          <path d="M21 11.5a8.38 8.38 0 0 1-.9 3.8 8.5 8.5 0 0 1-7.6 4.7 8.38 8.38 0 0 1-3.8-.9L3 21l1.9-5.7a8.38 8.38 0 0 1-.9-3.8 8.5 8.5 0 0 1 4.7-7.6 8.38 8.38 0 0 1 3.8-.9h.5a8.48 8.48 0 0 1 8 8v.5z"/>
        </svg>
      </div>
      <div class="omf-trigger-label-a9e9caa2fa6bb4d9d711a5907e59188d">WhatsApp +1-555-0199</div>
      <div class="omf-trigger-arrow-a9e9caa2fa6bb4d9d711a5907e59188d">&#8594;</div>
    </div>

    
    <div class="omf-pipeline-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-pipeline-a9e9caa2fa6bb4d9d711a5907e59188d">
      
      <div class="omf-phase-a9e9caa2fa6bb4d9d711a5907e59188d omf-phase-trigger-a9e9caa2fa6bb4d9d711a5907e59188d" data-phase="0">
        <div class="omf-phase-header-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="omf-phase-num-a9e9caa2fa6bb4d9d711a5907e59188d">1</div>
          <div class="omf-phase-name-a9e9caa2fa6bb4d9d711a5907e59188d">Trigger</div>
          <div class="omf-phase-latency-a9e9caa2fa6bb4d9d711a5907e59188d">~5ms</div>
        </div>
        <div class="omf-phase-desc-a9e9caa2fa6bb4d9d711a5907e59188d">WhatsApp message from +1-555-0199: "What's the weather in Tokyo?"</div>
        <div class="omf-substeps-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-sub-0-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="omf-substep-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="omf-substep-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Webhook received from Baileys connection</div>
          <div class="omf-substep-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="omf-substep-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Idempotency key checked (message dedup)</div>
        </div>
      </div>

      <div class="omf-arrow-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-arrow-0-a9e9caa2fa6bb4d9d711a5907e59188d">
        <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M12 5v14M5 12l7 7 7-7"/></svg>
      </div>

      
      <div class="omf-phase-a9e9caa2fa6bb4d9d711a5907e59188d omf-phase-gateway-a9e9caa2fa6bb4d9d711a5907e59188d" data-phase="1">
        <div class="omf-phase-header-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="omf-phase-num-a9e9caa2fa6bb4d9d711a5907e59188d">2</div>
          <div class="omf-phase-name-a9e9caa2fa6bb4d9d711a5907e59188d">Gateway</div>
          <div class="omf-phase-latency-a9e9caa2fa6bb4d9d711a5907e59188d">~8ms</div>
        </div>
        <div class="omf-phase-desc-a9e9caa2fa6bb4d9d711a5907e59188d">Route to correct adapter and session</div>
        <div class="omf-substeps-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-sub-1-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="omf-substep-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="omf-substep-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Channel identified: WhatsApp (Baileys)</div>
          <div class="omf-substep-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="omf-substep-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>TypeBox schema validation passed</div>
          <div class="omf-substep-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="omf-substep-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Scheduler check: no pending cron conflicts</div>
        </div>
      </div>

      <div class="omf-arrow-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-arrow-1-a9e9caa2fa6bb4d9d711a5907e59188d">
        <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M12 5v14M5 12l7 7 7-7"/></svg>
      </div>

      
      <div class="omf-phase-a9e9caa2fa6bb4d9d711a5907e59188d omf-phase-adapter-a9e9caa2fa6bb4d9d711a5907e59188d" data-phase="2">
        <div class="omf-phase-header-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="omf-phase-num-a9e9caa2fa6bb4d9d711a5907e59188d">3</div>
          <div class="omf-phase-name-a9e9caa2fa6bb4d9d711a5907e59188d">Adapter</div>
          <div class="omf-phase-latency-a9e9caa2fa6bb4d9d711a5907e59188d">~12ms</div>
        </div>
        <div class="omf-phase-desc-a9e9caa2fa6bb4d9d711a5907e59188d">Normalize and authorize the incoming message</div>
        <div class="omf-substeps-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-sub-2-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="omf-substep-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="omf-substep-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Baileys payload &#8594; StandardMessage format</div>
          <div class="omf-substep-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="omf-substep-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Allowlist check: +1-555-0199 authorized</div>
          <div class="omf-substep-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="omf-substep-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>DM pairing verified (not a group message)</div>
        </div>
      </div>

      <div class="omf-arrow-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-arrow-2-a9e9caa2fa6bb4d9d711a5907e59188d">
        <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M12 5v14M5 12l7 7 7-7"/></svg>
      </div>

      
      <div class="omf-phase-a9e9caa2fa6bb4d9d711a5907e59188d omf-phase-runtime-a9e9caa2fa6bb4d9d711a5907e59188d" data-phase="3">
        <div class="omf-phase-header-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="omf-phase-num-a9e9caa2fa6bb4d9d711a5907e59188d">4</div>
          <div class="omf-phase-name-a9e9caa2fa6bb4d9d711a5907e59188d">Runtime</div>
          <div class="omf-phase-latency-a9e9caa2fa6bb4d9d711a5907e59188d">~1.85s</div>
        </div>
        <div class="omf-phase-desc-a9e9caa2fa6bb4d9d711a5907e59188d">Resolve &#8594; Assemble &#8594; Invoke &#8594; Execute</div>
        <div class="omf-substeps-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-sub-3-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="omf-substep-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="omf-substep-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Session namespace loaded (&lt;50ms)</div>
          <div class="omf-substep-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="omf-substep-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Prompt assembled: AGENTS.md + SOUL.md + memory (&lt;100ms)</div>
          <div class="omf-substep-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="omf-substep-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Streaming LLM: first token (200-500ms)</div>
          <div class="omf-substep-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="omf-substep-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Tool call: weather_lookup("Tokyo") (1.2s)</div>
        </div>
      </div>

      <div class="omf-arrow-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-arrow-3-a9e9caa2fa6bb4d9d711a5907e59188d">
        <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M12 5v14M5 12l7 7 7-7"/></svg>
      </div>

      
      <div class="omf-phase-a9e9caa2fa6bb4d9d711a5907e59188d omf-phase-response-a9e9caa2fa6bb4d9d711a5907e59188d" data-phase="4">
        <div class="omf-phase-header-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="omf-phase-num-a9e9caa2fa6bb4d9d711a5907e59188d">5</div>
          <div class="omf-phase-name-a9e9caa2fa6bb4d9d711a5907e59188d">Response</div>
          <div class="omf-phase-latency-a9e9caa2fa6bb4d9d711a5907e59188d">~15ms</div>
        </div>
        <div class="omf-phase-desc-a9e9caa2fa6bb4d9d711a5907e59188d">Format and deliver the reply to WhatsApp</div>
        <div class="omf-substeps-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-sub-4-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="omf-substep-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="omf-substep-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Markdown &#8594; WhatsApp formatting (bold, lists)</div>
          <div class="omf-substep-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="omf-substep-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Sent via Baileys connection to +1-555-0199</div>
        </div>
      </div>
    </div>

    
    <div class="omf-tracker-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-tracker-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="omf-tracker-title-a9e9caa2fa6bb4d9d711a5907e59188d">Cumulative Latency</div>
      <div class="omf-tracker-total-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="omf-tracker-total-val-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-total-a9e9caa2fa6bb4d9d711a5907e59188d">0ms</div>
        <div class="omf-tracker-total-label-a9e9caa2fa6bb4d9d711a5907e59188d">Total</div>
      </div>

      <div class="omf-tracker-bar-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="omf-tracker-bar-header-a9e9caa2fa6bb4d9d711a5907e59188d">
          <span class="omf-tracker-bar-name-a9e9caa2fa6bb4d9d711a5907e59188d">Trigger</span>
          <span class="omf-tracker-bar-val-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-bar-val-0-a9e9caa2fa6bb4d9d711a5907e59188d">0ms</span>
        </div>
        <div class="omf-tracker-bar-bg-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="omf-tracker-bar-fill-a9e9caa2fa6bb4d9d711a5907e59188d omf-bar-amber-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-bar-0-a9e9caa2fa6bb4d9d711a5907e59188d"></div></div>
      </div>

      <div class="omf-tracker-bar-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="omf-tracker-bar-header-a9e9caa2fa6bb4d9d711a5907e59188d">
          <span class="omf-tracker-bar-name-a9e9caa2fa6bb4d9d711a5907e59188d">Gateway</span>
          <span class="omf-tracker-bar-val-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-bar-val-1-a9e9caa2fa6bb4d9d711a5907e59188d">0ms</span>
        </div>
        <div class="omf-tracker-bar-bg-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="omf-tracker-bar-fill-a9e9caa2fa6bb4d9d711a5907e59188d omf-bar-blue-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-bar-1-a9e9caa2fa6bb4d9d711a5907e59188d"></div></div>
      </div>

      <div class="omf-tracker-bar-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="omf-tracker-bar-header-a9e9caa2fa6bb4d9d711a5907e59188d">
          <span class="omf-tracker-bar-name-a9e9caa2fa6bb4d9d711a5907e59188d">Adapter</span>
          <span class="omf-tracker-bar-val-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-bar-val-2-a9e9caa2fa6bb4d9d711a5907e59188d">0ms</span>
        </div>
        <div class="omf-tracker-bar-bg-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="omf-tracker-bar-fill-a9e9caa2fa6bb4d9d711a5907e59188d omf-bar-purple-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-bar-2-a9e9caa2fa6bb4d9d711a5907e59188d"></div></div>
      </div>

      <div class="omf-tracker-bar-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="omf-tracker-bar-header-a9e9caa2fa6bb4d9d711a5907e59188d">
          <span class="omf-tracker-bar-name-a9e9caa2fa6bb4d9d711a5907e59188d">Runtime</span>
          <span class="omf-tracker-bar-val-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-bar-val-3-a9e9caa2fa6bb4d9d711a5907e59188d">0ms</span>
        </div>
        <div class="omf-tracker-bar-bg-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="omf-tracker-bar-fill-a9e9caa2fa6bb4d9d711a5907e59188d omf-bar-pink-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-bar-3-a9e9caa2fa6bb4d9d711a5907e59188d"></div></div>
      </div>

      <div class="omf-tracker-bar-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="omf-tracker-bar-header-a9e9caa2fa6bb4d9d711a5907e59188d">
          <span class="omf-tracker-bar-name-a9e9caa2fa6bb4d9d711a5907e59188d">Response</span>
          <span class="omf-tracker-bar-val-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-bar-val-4-a9e9caa2fa6bb4d9d711a5907e59188d">0ms</span>
        </div>
        <div class="omf-tracker-bar-bg-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="omf-tracker-bar-fill-a9e9caa2fa6bb4d9d711a5907e59188d omf-bar-green-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-bar-4-a9e9caa2fa6bb4d9d711a5907e59188d"></div></div>
      </div>
    </div>
  </div>

  
  <div class="omf-controls-a9e9caa2fa6bb4d9d711a5907e59188d">
    <button type="button" class="omf-btn-a9e9caa2fa6bb4d9d711a5907e59188d omf-btn-primary-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-play-a9e9caa2fa6bb4d9d711a5907e59188d">Play</button>
    <button type="button" class="omf-btn-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-step-a9e9caa2fa6bb4d9d711a5907e59188d">Step</button>
    <button type="button" class="omf-btn-a9e9caa2fa6bb4d9d711a5907e59188d" id="omf-reset-a9e9caa2fa6bb4d9d711a5907e59188d">Reset</button>
  </div>
</div>

<script>
(function() {
  var id = 'a9e9caa2fa6bb4d9d711a5907e59188d';

  var phases = [
    { latencyMs: 5, barPercent: 0.26 },
    { latencyMs: 8, barPercent: 0.42 },
    { latencyMs: 12, barPercent: 0.63 },
    { latencyMs: 1850, barPercent: 97.4 },
    { latencyMs: 15, barPercent: 0.79 }
  ];

  var totalLatency = 1890; 
  var currentPhase = -1;
  var isPlaying = false;
  var playInterval = null;
  var cumulativeMs = 0;

  var phaseEls = document.querySelectorAll('#omf-pipeline-' + id + ' .omf-phase-' + id);
  var arrowEls = [];
  for (var a = 0; a < 4; a++) {
    arrowEls.push(document.getElementById('omf-arrow-' + a + '-' + id));
  }

  var playBtn = document.getElementById('omf-play-' + id);
  var stepBtn = document.getElementById('omf-step-' + id);
  var resetBtn = document.getElementById('omf-reset-' + id);
  var totalEl = document.getElementById('omf-total-' + id);

  function formatMs(ms) {
    if (ms >= 1000) return (ms / 1000).toFixed(2) + 's';
    return ms + 'ms';
  }

  function activatePhase(idx) {
    if (idx < 0 || idx >= phases.length) return;

    
    for (var p = 0; p < idx; p++) {
      phaseEls[p].classList.remove('active');
      phaseEls[p].classList.add('completed');
    }

    
    phaseEls[idx].classList.remove('completed');
    phaseEls[idx].classList.add('active');

    
    for (var ar = 0; ar < arrowEls.length; ar++) {
      if (ar < idx) {
        arrowEls[ar].classList.add('active');
      } else {
        arrowEls[ar].classList.remove('active');
      }
    }

    
    cumulativeMs += phases[idx].latencyMs;
    totalEl.textContent = formatMs(cumulativeMs);

    
    var barEl = document.getElementById('omf-bar-' + idx + '-' + id);
    var barValEl = document.getElementById('omf-bar-val-' + idx + '-' + id);
    if (barEl) barEl.style.width = Math.max(phases[idx].barPercent, 2) + '%';
    if (barValEl) barValEl.textContent = formatMs(phases[idx].latencyMs);
  }

  function nextStep() {
    if (currentPhase >= phases.length - 1) {
      stopPlay();
      playBtn.textContent = 'Done';
      playBtn.disabled = true;
      stepBtn.disabled = true;
      return;
    }

    currentPhase++;
    activatePhase(currentPhase);
  }

  function startPlay() {
    if (isPlaying) return;
    isPlaying = true;
    playBtn.textContent = 'Playing...';
    playBtn.disabled = true;
    stepBtn.disabled = true;

    playInterval = setInterval(function() {
      nextStep();
      if (currentPhase >= phases.length - 1) {
        stopPlay();
        playBtn.textContent = 'Done';
      }
    }, 1200);
  }

  function stopPlay() {
    isPlaying = false;
    clearInterval(playInterval);
    if (currentPhase < phases.length - 1) {
      playBtn.disabled = false;
      stepBtn.disabled = false;
      playBtn.textContent = 'Play';
    }
  }

  function reset() {
    stopPlay();
    currentPhase = -1;
    cumulativeMs = 0;
    totalEl.textContent = '0ms';
    playBtn.textContent = 'Play';
    playBtn.disabled = false;
    stepBtn.disabled = false;

    for (var p = 0; p < phaseEls.length; p++) {
      phaseEls[p].classList.remove('active', 'completed');
    }
    for (var ar = 0; ar < arrowEls.length; ar++) {
      arrowEls[ar].classList.remove('active');
    }
    for (var b = 0; b < phases.length; b++) {
      var barEl = document.getElementById('omf-bar-' + b + '-' + id);
      var barValEl = document.getElementById('omf-bar-val-' + b + '-' + id);
      if (barEl) barEl.style.width = '0';
      if (barValEl) barValEl.textContent = '0ms';
    }

    
    for (var s = 0; s < phases.length; s++) {
      var subEl = document.getElementById('omf-sub-' + s + '-' + id);
      if (subEl) subEl.classList.remove('expanded');
    }
  }

  
  var pipeline = document.getElementById('omf-pipeline-' + id);
  if (pipeline) {
    pipeline.addEventListener('click', function(e) {
      var target = e.target;
      while (target && target !== pipeline) {
        if (target.hasAttribute && target.hasAttribute('data-phase')) {
          var phaseIdx = parseInt(target.getAttribute('data-phase'), 10);
          var subEl = document.getElementById('omf-sub-' + phaseIdx + '-' + id);
          if (subEl) subEl.classList.toggle('expanded');
          return;
        }
        target = target.parentElement;
      }
    });
  }

  playBtn.addEventListener('click', startPlay);
  stepBtn.addEventListener('click', nextStep);
  resetBtn.addEventListener('click', reset);
})();
</script>

<h2 id="system-prompt-filesystem">System Prompt Filesystem</h2>
<p>The runtime&rsquo;s &ldquo;brain&rdquo; isn&rsquo;t code — it&rsquo;s a filesystem of markdown documents that compose into the system prompt.</p>




<style>
.spf-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
  background: white;
  border-radius: 16px;
  padding: 2rem;
  margin: 2rem 0;
  color: #1a202c;
  box-shadow: 0 12px 30px rgba(0, 0, 0, 0.06);
}

.spf-a9e9caa2fa6bb4d9d711a5907e59188d * {
  box-sizing: border-box;
}

.spf-header-a9e9caa2fa6bb4d9d711a5907e59188d {
  text-align: center;
  margin-bottom: 1.5rem;
}

.spf-title-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 1.5rem;
  font-weight: 700;
  color: #1a202c;
  margin-bottom: 0.5rem;
}

.spf-subtitle-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.9rem;
  color: #718096;
  max-width: 600px;
  margin: 0 auto;
}

 
.spf-content-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: grid;
  grid-template-columns: 240px 1fr;
  gap: 1.5rem;
  align-items: start;
  margin-bottom: 1.5rem;
}

@media (max-width: 850px) {
  .spf-content-a9e9caa2fa6bb4d9d711a5907e59188d {
    grid-template-columns: 1fr;
  }
}

 
.spf-tree-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: #f7fafc;
  border: 1px solid #e2e8f0;
  border-radius: 10px;
  padding: 1rem;
}

.spf-tree-title-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.75rem;
  font-weight: 700;
  color: #4a5568;
  text-transform: uppercase;
  letter-spacing: 0.05em;
  margin-bottom: 0.75rem;
  padding-bottom: 0.5rem;
  border-bottom: 1px solid #e2e8f0;
}

.spf-file-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  align-items: center;
  gap: 0.5rem;
  padding: 0.4rem 0.5rem;
  border-radius: 6px;
  cursor: pointer;
  transition: all 0.2s ease;
  margin-bottom: 0.15rem;
}

.spf-file-a9e9caa2fa6bb4d9d711a5907e59188d:hover {
  background: #edf2f7;
}

.spf-file-a9e9caa2fa6bb4d9d711a5907e59188d.active {
  background: #ebf8ff;
  border: 1px solid #90cdf4;
}

.spf-file-a9e9caa2fa6bb4d9d711a5907e59188d.indented {
  margin-left: 1.25rem;
}

.spf-file-icon-a9e9caa2fa6bb4d9d711a5907e59188d {
  width: 16px;
  height: 16px;
  flex-shrink: 0;
}

.spf-file-icon-md-a9e9caa2fa6bb4d9d711a5907e59188d { color: #3b82f6; }
.spf-file-icon-folder-a9e9caa2fa6bb4d9d711a5907e59188d { color: #f59e0b; }
.spf-file-icon-skill-a9e9caa2fa6bb4d9d711a5907e59188d { color: #a855f7; }

.spf-file-name-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.78rem;
  font-weight: 500;
  color: #2d3748;
  flex: 1;
}

.spf-file-tokens-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.6rem;
  color: #a0aec0;
  font-weight: 500;
}

 
.spf-folder-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  align-items: center;
  gap: 0.5rem;
  padding: 0.4rem 0.5rem;
  border-radius: 6px;
  cursor: pointer;
  transition: all 0.2s ease;
  margin-bottom: 0.15rem;
}

.spf-folder-a9e9caa2fa6bb4d9d711a5907e59188d:hover {
  background: #edf2f7;
}

.spf-folder-children-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: none;
}

.spf-folder-children-a9e9caa2fa6bb4d9d711a5907e59188d.open {
  display: block;
}

.spf-folder-arrow-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.6rem;
  color: #a0aec0;
  transition: transform 0.2s ease;
}

.spf-folder-a9e9caa2fa6bb4d9d711a5907e59188d.open .spf-folder-arrow-a9e9caa2fa6bb4d9d711a5907e59188d {
  transform: rotate(90deg);
}

 
.spf-preview-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: #f7fafc;
  border: 1px solid #e2e8f0;
  border-radius: 10px;
  padding: 1.25rem;
  min-height: 320px;
}

.spf-preview-placeholder-a9e9caa2fa6bb4d9d711a5907e59188d {
  text-align: center;
  color: #a0aec0;
  padding: 3rem 1rem;
}

.spf-preview-placeholder-a9e9caa2fa6bb4d9d711a5907e59188d svg {
  width: 40px;
  height: 40px;
  margin-bottom: 0.75rem;
  opacity: 0.4;
}

.spf-preview-placeholder-a9e9caa2fa6bb4d9d711a5907e59188d p {
  font-size: 0.8rem;
  margin: 0;
}

.spf-preview-content-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: none;
}

.spf-preview-content-a9e9caa2fa6bb4d9d711a5907e59188d.active {
  display: block;
  animation: spf-fade-a9e9caa2fa6bb4d9d711a5907e59188d 0.3s ease;
}

@keyframes spf-fade-a9e9caa2fa6bb4d9d711a5907e59188d {
  from { opacity: 0; transform: translateY(5px); }
  to { opacity: 1; transform: translateY(0); }
}

.spf-preview-header-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  align-items: center;
  justify-content: space-between;
  margin-bottom: 0.75rem;
  padding-bottom: 0.75rem;
  border-bottom: 1px solid #e2e8f0;
}

.spf-preview-filename-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.9rem;
  font-weight: 700;
  color: #2d3748;
}

.spf-preview-badge-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.6rem;
  padding: 0.2rem 0.5rem;
  border-radius: 4px;
  font-weight: 600;
  text-transform: uppercase;
}

.spf-badge-constitution-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: #dbeafe;
  color: #2563eb;
}

.spf-badge-personality-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: #fce7f3;
  color: #db2777;
}

.spf-badge-conventions-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: #d1fae5;
  color: #059669;
}

.spf-badge-skill-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: #ede9fe;
  color: #7c3aed;
}

.spf-preview-role-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.8rem;
  color: #4a5568;
  margin-bottom: 0.75rem;
  line-height: 1.6;
}

.spf-preview-code-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: #1e293b;
  border-radius: 8px;
  padding: 1rem;
  font-family: 'Monaco', 'Menlo', 'Consolas', monospace;
  font-size: 0.7rem;
  line-height: 1.6;
  color: #e2e8f0;
  overflow-x: auto;
  white-space: pre-wrap;
  word-break: break-word;
}

.spf-code-comment-a9e9caa2fa6bb4d9d711a5907e59188d {
  color: #64748b;
}

.spf-code-key-a9e9caa2fa6bb4d9d711a5907e59188d {
  color: #7dd3fc;
}

.spf-code-val-a9e9caa2fa6bb4d9d711a5907e59188d {
  color: #86efac;
}

.spf-code-heading-a9e9caa2fa6bb4d9d711a5907e59188d {
  color: #fbbf24;
  font-weight: 700;
}

 
.spf-not-code-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: inline-block;
  font-size: 0.65rem;
  padding: 0.25rem 0.6rem;
  background: #fef3c7;
  border: 1px solid #f59e0b;
  border-radius: 4px;
  color: #92400e;
  font-weight: 700;
  margin-bottom: 0.75rem;
}

 
.spf-tokens-a9e9caa2fa6bb4d9d711a5907e59188d {
  margin-bottom: 1.5rem;
}

.spf-tokens-title-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.8rem;
  font-weight: 700;
  color: #2d3748;
  margin-bottom: 0.75rem;
}

.spf-tokens-bar-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  height: 28px;
  border-radius: 6px;
  overflow: hidden;
  margin-bottom: 0.5rem;
}

.spf-tokens-segment-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 0.55rem;
  font-weight: 600;
  color: white;
  transition: all 0.3s ease;
  cursor: default;
  position: relative;
  overflow: hidden;
  white-space: nowrap;
}

.spf-tokens-segment-a9e9caa2fa6bb4d9d711a5907e59188d:hover {
  filter: brightness(1.15);
}

.spf-seg-agents-a9e9caa2fa6bb4d9d711a5907e59188d { background: #3b82f6; }
.spf-seg-soul-a9e9caa2fa6bb4d9d711a5907e59188d { background: #ec4899; }
.spf-seg-tools-a9e9caa2fa6bb4d9d711a5907e59188d { background: #22c55e; }
.spf-seg-skills-a9e9caa2fa6bb4d9d711a5907e59188d { background: #a855f7; }
.spf-seg-memory-a9e9caa2fa6bb4d9d711a5907e59188d { background: #f59e0b; }
.spf-seg-history-a9e9caa2fa6bb4d9d711a5907e59188d { background: #64748b; }

.spf-tokens-legend-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  flex-wrap: wrap;
  gap: 0.75rem;
  justify-content: center;
}

.spf-legend-item-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  align-items: center;
  gap: 0.3rem;
  font-size: 0.65rem;
  color: #718096;
}

.spf-legend-dot-a9e9caa2fa6bb4d9d711a5907e59188d {
  width: 8px;
  height: 8px;
  border-radius: 2px;
  flex-shrink: 0;
}

 
.spf-clawhub-a9e9caa2fa6bb4d9d711a5907e59188d {
  border-top: 1px solid #e2e8f0;
  padding-top: 1.5rem;
}

.spf-clawhub-header-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  align-items: center;
  justify-content: space-between;
  margin-bottom: 1rem;
}

.spf-clawhub-title-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 1rem;
  font-weight: 700;
  color: #2d3748;
}

.spf-clawhub-count-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.75rem;
  color: #718096;
  padding: 0.25rem 0.75rem;
  background: #f7fafc;
  border: 1px solid #e2e8f0;
  border-radius: 20px;
}

.spf-clawhub-grid-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: grid;
  grid-template-columns: repeat(auto-fill, minmax(160px, 1fr));
  gap: 0.75rem;
}

.spf-skill-card-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: #f7fafc;
  border: 1px solid #e2e8f0;
  border-radius: 8px;
  padding: 0.75rem;
  transition: all 0.2s ease;
}

.spf-skill-card-a9e9caa2fa6bb4d9d711a5907e59188d:hover {
  border-color: #a855f7;
  box-shadow: 0 2px 8px rgba(168, 85, 247, 0.1);
  transform: translateY(-1px);
}

.spf-skill-name-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.8rem;
  font-weight: 600;
  color: #2d3748;
  margin-bottom: 0.25rem;
}

.spf-skill-desc-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.65rem;
  color: #718096;
  margin-bottom: 0.4rem;
  line-height: 1.4;
}

.spf-skill-installs-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.6rem;
  color: #a855f7;
  font-weight: 600;
}

 
@media (max-width: 600px) {
  .spf-a9e9caa2fa6bb4d9d711a5907e59188d {
    padding: 1.25rem;
  }

  .spf-clawhub-grid-a9e9caa2fa6bb4d9d711a5907e59188d {
    grid-template-columns: 1fr;
  }
}
</style>

<div class="spf-a9e9caa2fa6bb4d9d711a5907e59188d">
  <div class="spf-header-a9e9caa2fa6bb4d9d711a5907e59188d">
    <div class="spf-title-a9e9caa2fa6bb4d9d711a5907e59188d">System Prompt as Filesystem</div>
    <div class="spf-subtitle-a9e9caa2fa6bb4d9d711a5907e59188d">The agent's identity, rules, and capabilities are composed from markdown files on disk. Skills are natural-language documents, not code.</div>
  </div>

  
  <div class="spf-content-a9e9caa2fa6bb4d9d711a5907e59188d">
    <div class="spf-tree-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="spf-tree-title-a9e9caa2fa6bb4d9d711a5907e59188d">Session Directory</div>

      <div class="spf-file-a9e9caa2fa6bb4d9d711a5907e59188d" data-file="agents">
        <svg class="spf-file-icon-a9e9caa2fa6bb4d9d711a5907e59188d spf-file-icon-md-a9e9caa2fa6bb4d9d711a5907e59188d" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M14 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V8z"/><polyline points="14 2 14 8 20 8"/></svg>
        <span class="spf-file-name-a9e9caa2fa6bb4d9d711a5907e59188d">AGENTS.md</span>
        <span class="spf-file-tokens-a9e9caa2fa6bb4d9d711a5907e59188d">~800</span>
      </div>

      <div class="spf-file-a9e9caa2fa6bb4d9d711a5907e59188d" data-file="soul">
        <svg class="spf-file-icon-a9e9caa2fa6bb4d9d711a5907e59188d spf-file-icon-md-a9e9caa2fa6bb4d9d711a5907e59188d" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M14 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V8z"/><polyline points="14 2 14 8 20 8"/></svg>
        <span class="spf-file-name-a9e9caa2fa6bb4d9d711a5907e59188d">SOUL.md</span>
        <span class="spf-file-tokens-a9e9caa2fa6bb4d9d711a5907e59188d">~600</span>
      </div>

      <div class="spf-file-a9e9caa2fa6bb4d9d711a5907e59188d" data-file="tools">
        <svg class="spf-file-icon-a9e9caa2fa6bb4d9d711a5907e59188d spf-file-icon-md-a9e9caa2fa6bb4d9d711a5907e59188d" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M14 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V8z"/><polyline points="14 2 14 8 20 8"/></svg>
        <span class="spf-file-name-a9e9caa2fa6bb4d9d711a5907e59188d">TOOLS.md</span>
        <span class="spf-file-tokens-a9e9caa2fa6bb4d9d711a5907e59188d">~400</span>
      </div>

      <div class="spf-folder-a9e9caa2fa6bb4d9d711a5907e59188d" id="spf-skills-folder-a9e9caa2fa6bb4d9d711a5907e59188d">
        <span class="spf-folder-arrow-a9e9caa2fa6bb4d9d711a5907e59188d">&#9654;</span>
        <svg class="spf-file-icon-a9e9caa2fa6bb4d9d711a5907e59188d spf-file-icon-folder-a9e9caa2fa6bb4d9d711a5907e59188d" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M22 19a2 2 0 0 1-2 2H4a2 2 0 0 1-2-2V5a2 2 0 0 1 2-2h5l2 3h9a2 2 0 0 1 2 2z"/></svg>
        <span class="spf-file-name-a9e9caa2fa6bb4d9d711a5907e59188d">skills/</span>
      </div>

      <div class="spf-folder-children-a9e9caa2fa6bb4d9d711a5907e59188d" id="spf-skills-children-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="spf-file-a9e9caa2fa6bb4d9d711a5907e59188d indented" data-file="weather-skill">
          <svg class="spf-file-icon-a9e9caa2fa6bb4d9d711a5907e59188d spf-file-icon-skill-a9e9caa2fa6bb4d9d711a5907e59188d" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M14 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V8z"/><polyline points="14 2 14 8 20 8"/></svg>
          <span class="spf-file-name-a9e9caa2fa6bb4d9d711a5907e59188d">weather/SKILL.md</span>
          <span class="spf-file-tokens-a9e9caa2fa6bb4d9d711a5907e59188d">~120</span>
        </div>
        <div class="spf-file-a9e9caa2fa6bb4d9d711a5907e59188d indented" data-file="calendar-skill">
          <svg class="spf-file-icon-a9e9caa2fa6bb4d9d711a5907e59188d spf-file-icon-skill-a9e9caa2fa6bb4d9d711a5907e59188d" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M14 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V8z"/><polyline points="14 2 14 8 20 8"/></svg>
          <span class="spf-file-name-a9e9caa2fa6bb4d9d711a5907e59188d">calendar/SKILL.md</span>
          <span class="spf-file-tokens-a9e9caa2fa6bb4d9d711a5907e59188d">~150</span>
        </div>
        <div class="spf-file-a9e9caa2fa6bb4d9d711a5907e59188d indented" data-file="code-review-skill">
          <svg class="spf-file-icon-a9e9caa2fa6bb4d9d711a5907e59188d spf-file-icon-skill-a9e9caa2fa6bb4d9d711a5907e59188d" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M14 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V8z"/><polyline points="14 2 14 8 20 8"/></svg>
          <span class="spf-file-name-a9e9caa2fa6bb4d9d711a5907e59188d">code-review/SKILL.md</span>
          <span class="spf-file-tokens-a9e9caa2fa6bb4d9d711a5907e59188d">~180</span>
        </div>
      </div>
    </div>

    
    <div class="spf-preview-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="spf-preview-placeholder-a9e9caa2fa6bb4d9d711a5907e59188d" id="spf-placeholder-a9e9caa2fa6bb4d9d711a5907e59188d">
        <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.5">
          <path d="M14 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V8z"/>
          <polyline points="14 2 14 8 20 8"/>
        </svg>
        <p>Click any file to preview its content and role in the system prompt.</p>
      </div>
      <div class="spf-preview-content-a9e9caa2fa6bb4d9d711a5907e59188d" id="spf-preview-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="spf-preview-header-a9e9caa2fa6bb4d9d711a5907e59188d">
          <span class="spf-preview-filename-a9e9caa2fa6bb4d9d711a5907e59188d" id="spf-prev-name-a9e9caa2fa6bb4d9d711a5907e59188d">File</span>
          <span class="spf-preview-badge-a9e9caa2fa6bb4d9d711a5907e59188d" id="spf-prev-badge-a9e9caa2fa6bb4d9d711a5907e59188d">Type</span>
        </div>
        <div id="spf-prev-notcode-a9e9caa2fa6bb4d9d711a5907e59188d"></div>
        <div class="spf-preview-role-a9e9caa2fa6bb4d9d711a5907e59188d" id="spf-prev-role-a9e9caa2fa6bb4d9d711a5907e59188d">Role description</div>
        <div class="spf-preview-code-a9e9caa2fa6bb4d9d711a5907e59188d" id="spf-prev-code-a9e9caa2fa6bb4d9d711a5907e59188d">Content</div>
      </div>
    </div>
  </div>

  
  <div class="spf-tokens-a9e9caa2fa6bb4d9d711a5907e59188d">
    <div class="spf-tokens-title-a9e9caa2fa6bb4d9d711a5907e59188d">Token Contribution to Assembled Prompt (~4,200 tokens)</div>
    <div class="spf-tokens-bar-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="spf-tokens-segment-a9e9caa2fa6bb4d9d711a5907e59188d spf-seg-agents-a9e9caa2fa6bb4d9d711a5907e59188d" style="width: 19%;">AGENTS</div>
      <div class="spf-tokens-segment-a9e9caa2fa6bb4d9d711a5907e59188d spf-seg-soul-a9e9caa2fa6bb4d9d711a5907e59188d" style="width: 14%;">SOUL</div>
      <div class="spf-tokens-segment-a9e9caa2fa6bb4d9d711a5907e59188d spf-seg-tools-a9e9caa2fa6bb4d9d711a5907e59188d" style="width: 10%;">TOOLS</div>
      <div class="spf-tokens-segment-a9e9caa2fa6bb4d9d711a5907e59188d spf-seg-skills-a9e9caa2fa6bb4d9d711a5907e59188d" style="width: 11%;">Skills</div>
      <div class="spf-tokens-segment-a9e9caa2fa6bb4d9d711a5907e59188d spf-seg-memory-a9e9caa2fa6bb4d9d711a5907e59188d" style="width: 21%;">Memory</div>
      <div class="spf-tokens-segment-a9e9caa2fa6bb4d9d711a5907e59188d spf-seg-history-a9e9caa2fa6bb4d9d711a5907e59188d" style="width: 25%;">History</div>
    </div>
    <div class="spf-tokens-legend-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="spf-legend-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="spf-legend-dot-a9e9caa2fa6bb4d9d711a5907e59188d" style="background:#3b82f6;"></div>AGENTS.md (800)</div>
      <div class="spf-legend-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="spf-legend-dot-a9e9caa2fa6bb4d9d711a5907e59188d" style="background:#ec4899;"></div>SOUL.md (600)</div>
      <div class="spf-legend-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="spf-legend-dot-a9e9caa2fa6bb4d9d711a5907e59188d" style="background:#22c55e;"></div>TOOLS.md (400)</div>
      <div class="spf-legend-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="spf-legend-dot-a9e9caa2fa6bb4d9d711a5907e59188d" style="background:#a855f7;"></div>Active Skills (450)</div>
      <div class="spf-legend-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="spf-legend-dot-a9e9caa2fa6bb4d9d711a5907e59188d" style="background:#f59e0b;"></div>Memory Pages (900)</div>
      <div class="spf-legend-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="spf-legend-dot-a9e9caa2fa6bb4d9d711a5907e59188d" style="background:#64748b;"></div>Conversation History (1,050)</div>
    </div>
  </div>

  
  <div class="spf-clawhub-a9e9caa2fa6bb4d9d711a5907e59188d">
    <div class="spf-clawhub-header-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="spf-clawhub-title-a9e9caa2fa6bb4d9d711a5907e59188d">ClawHub: Community Skills</div>
      <div class="spf-clawhub-count-a9e9caa2fa6bb4d9d711a5907e59188d">5,705+ skills</div>
    </div>
    <div class="spf-clawhub-grid-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="spf-skill-card-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="spf-skill-name-a9e9caa2fa6bb4d9d711a5907e59188d">weather</div>
        <div class="spf-skill-desc-a9e9caa2fa6bb4d9d711a5907e59188d">Real-time weather data for any city worldwide</div>
        <div class="spf-skill-installs-a9e9caa2fa6bb4d9d711a5907e59188d">12.4K installs</div>
      </div>
      <div class="spf-skill-card-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="spf-skill-name-a9e9caa2fa6bb4d9d711a5907e59188d">google-calendar</div>
        <div class="spf-skill-desc-a9e9caa2fa6bb4d9d711a5907e59188d">Read, create, and manage Google Calendar events</div>
        <div class="spf-skill-installs-a9e9caa2fa6bb4d9d711a5907e59188d">8.9K installs</div>
      </div>
      <div class="spf-skill-card-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="spf-skill-name-a9e9caa2fa6bb4d9d711a5907e59188d">code-review</div>
        <div class="spf-skill-desc-a9e9caa2fa6bb4d9d711a5907e59188d">Automated code review with style and security checks</div>
        <div class="spf-skill-installs-a9e9caa2fa6bb4d9d711a5907e59188d">6.2K installs</div>
      </div>
      <div class="spf-skill-card-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="spf-skill-name-a9e9caa2fa6bb4d9d711a5907e59188d">web-search</div>
        <div class="spf-skill-desc-a9e9caa2fa6bb4d9d711a5907e59188d">Search the web and summarize results</div>
        <div class="spf-skill-installs-a9e9caa2fa6bb4d9d711a5907e59188d">15.1K installs</div>
      </div>
      <div class="spf-skill-card-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="spf-skill-name-a9e9caa2fa6bb4d9d711a5907e59188d">notion-sync</div>
        <div class="spf-skill-desc-a9e9caa2fa6bb4d9d711a5907e59188d">Sync notes and databases with Notion</div>
        <div class="spf-skill-installs-a9e9caa2fa6bb4d9d711a5907e59188d">4.7K installs</div>
      </div>
      <div class="spf-skill-card-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="spf-skill-name-a9e9caa2fa6bb4d9d711a5907e59188d">image-gen</div>
        <div class="spf-skill-desc-a9e9caa2fa6bb4d9d711a5907e59188d">Generate images via DALL-E or Stable Diffusion</div>
        <div class="spf-skill-installs-a9e9caa2fa6bb4d9d711a5907e59188d">9.3K installs</div>
      </div>
    </div>
  </div>
</div>

<script>
(function() {
  var id = 'a9e9caa2fa6bb4d9d711a5907e59188d';

  var files = {
    'agents': {
      name: 'AGENTS.md',
      badgeClass: 'spf-badge-constitution-' + id,
      badgeText: 'Constitution',
      role: 'The agent\'s core constitution. Defines identity, boundaries, and behavioral rules. This is the first file loaded into every prompt and sets the non-negotiable constraints the agent must follow.',
      showNotCode: false,
      content: '<span class="spf-code-heading-' + id + '"># Agent Constitution</span>\n\n<span class="spf-code-comment-' + id + '">## Identity</span>\nYou are OpenClaw, a helpful AI assistant.\nYou communicate via messaging platforms.\nYou have access to tools and long-term memory.\n\n<span class="spf-code-comment-' + id + '">## Rules</span>\n- Never impersonate other people\n- Never share private data across sessions\n- Always cite sources when using web search\n- Respect rate limits on external APIs\n- Ask for confirmation before destructive actions\n\n<span class="spf-code-comment-' + id + '">## Session Isolation</span>\nEach conversation has its own namespace.\nDo not leak information between sessions.'
    },
    'soul': {
      name: 'SOUL.md',
      badgeClass: 'spf-badge-personality-' + id,
      badgeText: 'Personality',
      role: 'Defines the agent\'s personality, tone, and communication style. This is what makes an OpenClaw instance feel like "your" assistant rather than a generic chatbot. Fully customizable per deployment.',
      showNotCode: false,
      content: '<span class="spf-code-heading-' + id + '"># Soul: Personality Profile</span>\n\n<span class="spf-code-comment-' + id + '">## Tone</span>\nFriendly but concise. No corporate speak.\nUse emoji sparingly and naturally.\nMirror the user\'s communication style.\n\n<span class="spf-code-comment-' + id + '">## Expertise</span>\nPrimary: software engineering, DevOps\nSecondary: general knowledge, research\nHonest about uncertainty: say "I don\'t know"\n\n<span class="spf-code-comment-' + id + '">## Quirks</span>\nOccasionally uses analogies to explain concepts.\nPrefers short messages over walls of text.\nProactively suggests follow-up actions.'
    },
    'tools': {
      name: 'TOOLS.md',
      badgeClass: 'spf-badge-conventions-' + id,
      badgeText: 'Conventions',
      role: 'Conventions and guidelines for how the agent should use its tools. Defines formatting rules, error handling patterns, and best practices that apply across all skill invocations.',
      showNotCode: false,
      content: '<span class="spf-code-heading-' + id + '"># Tool Conventions</span>\n\n<span class="spf-code-comment-' + id + '">## Formatting</span>\n- Use markdown for structured output\n- Wrap code blocks with language tags\n- Keep tool call arguments minimal\n\n<span class="spf-code-comment-' + id + '">## Error Handling</span>\n- On tool failure: explain + suggest alternatives\n- On rate limit: back off and inform the user\n- On timeout: retry once, then report\n\n<span class="spf-code-comment-' + id + '">## Context Awareness</span>\n- Check memory before external lookups\n- Prefer cached results for repeat queries\n- Log important findings to MEMORY.md'
    },
    'weather-skill': {
      name: 'skills/weather/SKILL.md',
      badgeClass: 'spf-badge-skill-' + id,
      badgeText: 'Skill',
      role: 'A community skill from ClawHub. Skills are pure markdown documents with YAML frontmatter that describe what the tool does, its parameters, and usage examples. The LLM reads this like documentation.',
      showNotCode: true,
      content: '<span class="spf-code-comment-' + id + '">---</span>\n<span class="spf-code-key-' + id + '">name:</span> <span class="spf-code-val-' + id + '">weather</span>\n<span class="spf-code-key-' + id + '">version:</span> <span class="spf-code-val-' + id + '">2.1.0</span>\n<span class="spf-code-key-' + id + '">author:</span> <span class="spf-code-val-' + id + '">clawhub/official</span>\n<span class="spf-code-key-' + id + '">description:</span> <span class="spf-code-val-' + id + '">Get real-time weather data</span>\n<span class="spf-code-key-' + id + '">parameters:</span>\n  <span class="spf-code-key-' + id + '">- city:</span> <span class="spf-code-val-' + id + '">string (required)</span>\n  <span class="spf-code-key-' + id + '">- units:</span> <span class="spf-code-val-' + id + '">metric|imperial (default: metric)</span>\n<span class="spf-code-comment-' + id + '">---</span>\n\n<span class="spf-code-heading-' + id + '"># Weather Skill</span>\n\nFetch current weather conditions for any city.\nReturns temperature, humidity, wind, and forecast.'
    },
    'calendar-skill': {
      name: 'skills/calendar/SKILL.md',
      badgeClass: 'spf-badge-skill-' + id,
      badgeText: 'Skill',
      role: 'Google Calendar integration skill. Demonstrates how skills compose multiple API operations (read, create, update) into a single natural-language interface.',
      showNotCode: true,
      content: '<span class="spf-code-comment-' + id + '">---</span>\n<span class="spf-code-key-' + id + '">name:</span> <span class="spf-code-val-' + id + '">google-calendar</span>\n<span class="spf-code-key-' + id + '">version:</span> <span class="spf-code-val-' + id + '">1.8.3</span>\n<span class="spf-code-key-' + id + '">author:</span> <span class="spf-code-val-' + id + '">clawhub/official</span>\n<span class="spf-code-key-' + id + '">description:</span> <span class="spf-code-val-' + id + '">Manage Google Calendar events</span>\n<span class="spf-code-key-' + id + '">auth:</span> <span class="spf-code-val-' + id + '">oauth2 (google)</span>\n<span class="spf-code-key-' + id + '">operations:</span>\n  <span class="spf-code-val-' + id + '">- list_events</span>\n  <span class="spf-code-val-' + id + '">- create_event</span>\n  <span class="spf-code-val-' + id + '">- update_event</span>\n<span class="spf-code-comment-' + id + '">---</span>\n\n<span class="spf-code-heading-' + id + '"># Google Calendar</span>\n\nRead, create, and update calendar events.\nSupports recurring events and timezone handling.'
    },
    'code-review-skill': {
      name: 'skills/code-review/SKILL.md',
      badgeClass: 'spf-badge-skill-' + id,
      badgeText: 'Skill',
      role: 'Code review skill. Demonstrates how complex, multi-step workflows (diff analysis, style checking, security scanning) are expressed as simple markdown instructions.',
      showNotCode: true,
      content: '<span class="spf-code-comment-' + id + '">---</span>\n<span class="spf-code-key-' + id + '">name:</span> <span class="spf-code-val-' + id + '">code-review</span>\n<span class="spf-code-key-' + id + '">version:</span> <span class="spf-code-val-' + id + '">3.0.1</span>\n<span class="spf-code-key-' + id + '">author:</span> <span class="spf-code-val-' + id + '">community/devtools</span>\n<span class="spf-code-key-' + id + '">description:</span> <span class="spf-code-val-' + id + '">Automated code review</span>\n<span class="spf-code-key-' + id + '">parameters:</span>\n  <span class="spf-code-key-' + id + '">- diff:</span> <span class="spf-code-val-' + id + '">string (required)</span>\n  <span class="spf-code-key-' + id + '">- language:</span> <span class="spf-code-val-' + id + '">string (auto-detect)</span>\n  <span class="spf-code-key-' + id + '">- focus:</span> <span class="spf-code-val-' + id + '">style|security|performance|all</span>\n<span class="spf-code-comment-' + id + '">---</span>\n\n<span class="spf-code-heading-' + id + '"># Code Review</span>\n\nAnalyze diffs for style, security, and performance.\nProvide actionable suggestions with line references.'
    }
  };

  
  function showFile(fileId) {
    var file = files[fileId];
    if (!file) return;

    
    var allFiles = document.querySelectorAll('.spf-' + id + ' .spf-file-' + id);
    for (var i = 0; i < allFiles.length; i++) {
      allFiles[i].classList.remove('active');
    }
    var clickedEl = document.querySelector('.spf-' + id + ' [data-file="' + fileId + '"]');
    if (clickedEl) clickedEl.classList.add('active');

    
    document.getElementById('spf-placeholder-' + id).style.display = 'none';
    var preview = document.getElementById('spf-preview-' + id);
    preview.classList.remove('active');
    void preview.offsetWidth;
    preview.classList.add('active');

    document.getElementById('spf-prev-name-' + id).textContent = file.name;

    var badge = document.getElementById('spf-prev-badge-' + id);
    badge.textContent = file.badgeText;
    badge.className = 'spf-preview-badge-' + id + ' ' + file.badgeClass;

    document.getElementById('spf-prev-role-' + id).textContent = file.role;
    document.getElementById('spf-prev-code-' + id).innerHTML = file.content;

    var notCodeEl = document.getElementById('spf-prev-notcode-' + id);
    if (file.showNotCode) {
      notCodeEl.innerHTML = '<div class="spf-not-code-' + id + '">NOT CODE &mdash; MARKDOWN DOCUMENT</div>';
    } else {
      notCodeEl.innerHTML = '';
    }
  }

  
  var container = document.querySelector('.spf-' + id);
  if (container) {
    container.addEventListener('click', function(e) {
      var target = e.target;

      while (target && target !== container) {
        
        if (target.hasAttribute && target.hasAttribute('data-file')) {
          showFile(target.getAttribute('data-file'));
          return;
        }

        
        if (target.id === 'spf-skills-folder-' + id || target.closest('#spf-skills-folder-' + id)) {
          var folder = document.getElementById('spf-skills-folder-' + id);
          var children = document.getElementById('spf-skills-children-' + id);
          folder.classList.toggle('open');
          children.classList.toggle('open');
          return;
        }

        target = target.parentElement;
      }
    });
  }
})();
</script>

<h2 id="memory-system">Memory System</h2>
<p>The most distinctive primitive: long-term memory that lives on disk, not in the context window.</p>




<style>
.mem-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
  background: linear-gradient(180deg, #0f172a 0%, #1e293b 100%);
  border-radius: 16px;
  padding: 2rem;
  margin: 2rem 0;
  color: #e2e8f0;
}

.mem-a9e9caa2fa6bb4d9d711a5907e59188d * {
  box-sizing: border-box;
}

.mem-header-a9e9caa2fa6bb4d9d711a5907e59188d {
  text-align: center;
  margin-bottom: 1.5rem;
}

.mem-title-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 1.5rem;
  font-weight: 700;
  color: #f8fafc;
  margin-bottom: 0.5rem;
}

.mem-subtitle-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.9rem;
  color: #94a3b8;
  max-width: 600px;
  margin: 0 auto;
}

 
.mem-zones-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: grid;
  grid-template-columns: 1fr 1fr 1fr;
  gap: 1rem;
  margin-bottom: 1.5rem;
}

@media (max-width: 850px) {
  .mem-zones-a9e9caa2fa6bb4d9d711a5907e59188d {
    grid-template-columns: 1fr;
  }
}

 
.mem-zone-a9e9caa2fa6bb4d9d711a5907e59188d {
  border-radius: 12px;
  padding: 1rem;
  border: 2px solid;
}

.mem-zone-context-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(59, 130, 246, 0.08);
  border-color: rgba(59, 130, 246, 0.4);
}

.mem-zone-disk-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(168, 85, 247, 0.08);
  border-color: rgba(168, 85, 247, 0.4);
}

.mem-zone-search-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(236, 72, 153, 0.08);
  border-color: rgba(236, 72, 153, 0.4);
}

.mem-zone-header-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  align-items: center;
  gap: 0.5rem;
  margin-bottom: 0.75rem;
}

.mem-zone-icon-a9e9caa2fa6bb4d9d711a5907e59188d {
  width: 28px;
  height: 28px;
  border-radius: 6px;
  display: flex;
  align-items: center;
  justify-content: center;
}

.mem-zone-context-a9e9caa2fa6bb4d9d711a5907e59188d .mem-zone-icon-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(59, 130, 246, 0.2);
  color: #60a5fa;
}

.mem-zone-disk-a9e9caa2fa6bb4d9d711a5907e59188d .mem-zone-icon-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(168, 85, 247, 0.2);
  color: #c084fc;
}

.mem-zone-search-a9e9caa2fa6bb4d9d711a5907e59188d .mem-zone-icon-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(236, 72, 153, 0.2);
  color: #f472b6;
}

.mem-zone-title-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.85rem;
  font-weight: 700;
  color: #f8fafc;
}

.mem-zone-label-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.6rem;
  color: #64748b;
  text-transform: uppercase;
  letter-spacing: 0.05em;
}

.mem-zone-desc-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.72rem;
  color: #94a3b8;
  line-height: 1.5;
  margin-bottom: 0.75rem;
}

 
.mem-ctx-bar-wrap-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(15, 23, 42, 0.6);
  border-radius: 8px;
  padding: 0.75rem;
  margin-bottom: 0.5rem;
}

.mem-ctx-bar-header-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  justify-content: space-between;
  font-size: 0.65rem;
  margin-bottom: 0.4rem;
}

.mem-ctx-bar-label-a9e9caa2fa6bb4d9d711a5907e59188d {
  color: #94a3b8;
}

.mem-ctx-bar-val-a9e9caa2fa6bb4d9d711a5907e59188d {
  color: #60a5fa;
  font-weight: 600;
}

.mem-ctx-bar-bg-a9e9caa2fa6bb4d9d711a5907e59188d {
  height: 12px;
  background: rgba(59, 130, 246, 0.15);
  border-radius: 6px;
  overflow: hidden;
}

.mem-ctx-bar-fill-a9e9caa2fa6bb4d9d711a5907e59188d {
  height: 100%;
  background: linear-gradient(90deg, #3b82f6, #60a5fa);
  border-radius: 6px;
  transition: width 1s ease;
  width: 85%;
}

.mem-ctx-tokens-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  justify-content: space-between;
  font-size: 0.6rem;
  color: #64748b;
  margin-top: 0.3rem;
}

.mem-ctx-items-a9e9caa2fa6bb4d9d711a5907e59188d {
  margin-top: 0.5rem;
}

.mem-ctx-item-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  align-items: center;
  gap: 0.4rem;
  padding: 0.25rem 0;
  font-size: 0.7rem;
  color: #94a3b8;
}

.mem-ctx-item-dot-a9e9caa2fa6bb4d9d711a5907e59188d {
  width: 6px;
  height: 6px;
  border-radius: 50%;
  background: #60a5fa;
  flex-shrink: 0;
}

 
.mem-disk-tree-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(15, 23, 42, 0.6);
  border-radius: 8px;
  padding: 0.75rem;
}

.mem-disk-file-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  align-items: center;
  gap: 0.4rem;
  padding: 0.35rem 0.5rem;
  border-radius: 6px;
  cursor: pointer;
  transition: all 0.2s ease;
  margin-bottom: 0.15rem;
}

.mem-disk-file-a9e9caa2fa6bb4d9d711a5907e59188d:hover {
  background: rgba(168, 85, 247, 0.15);
}

.mem-disk-file-a9e9caa2fa6bb4d9d711a5907e59188d.active {
  background: rgba(168, 85, 247, 0.2);
  border: 1px solid rgba(168, 85, 247, 0.4);
}

.mem-disk-icon-a9e9caa2fa6bb4d9d711a5907e59188d {
  color: #c084fc;
  flex-shrink: 0;
}

.mem-disk-name-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.72rem;
  font-weight: 500;
  color: #e2e8f0;
  flex: 1;
}

.mem-disk-size-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.6rem;
  color: #64748b;
}

 
.mem-disk-preview-a9e9caa2fa6bb4d9d711a5907e59188d {
  margin-top: 0.5rem;
  background: rgba(15, 23, 42, 0.8);
  border-radius: 6px;
  padding: 0.6rem;
  font-family: 'Monaco', 'Menlo', monospace;
  font-size: 0.6rem;
  color: #94a3b8;
  line-height: 1.5;
  max-height: 0;
  overflow: hidden;
  transition: max-height 0.4s ease, padding 0.4s ease;
}

.mem-disk-preview-a9e9caa2fa6bb4d9d711a5907e59188d.open {
  max-height: 150px;
  padding: 0.6rem;
}

 
.mem-search-paths-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  flex-direction: column;
  gap: 0.5rem;
}

.mem-search-path-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(15, 23, 42, 0.6);
  border-radius: 8px;
  padding: 0.6rem 0.75rem;
  display: flex;
  align-items: center;
  gap: 0.5rem;
}

.mem-search-path-icon-a9e9caa2fa6bb4d9d711a5907e59188d {
  width: 24px;
  height: 24px;
  border-radius: 6px;
  display: flex;
  align-items: center;
  justify-content: center;
  flex-shrink: 0;
}

.mem-search-bm25-a9e9caa2fa6bb4d9d711a5907e59188d .mem-search-path-icon-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(245, 158, 11, 0.2);
  color: #fbbf24;
}

.mem-search-vector-a9e9caa2fa6bb4d9d711a5907e59188d .mem-search-path-icon-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(34, 197, 94, 0.2);
  color: #4ade80;
}

.mem-search-path-name-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.75rem;
  font-weight: 600;
  color: #f8fafc;
}

.mem-search-path-desc-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.6rem;
  color: #64748b;
}

.mem-search-merge-a9e9caa2fa6bb4d9d711a5907e59188d {
  text-align: center;
  font-size: 0.65rem;
  color: #94a3b8;
  padding: 0.25rem 0;
}

.mem-search-merge-a9e9caa2fa6bb4d9d711a5907e59188d svg {
  vertical-align: middle;
}

.mem-search-result-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(236, 72, 153, 0.1);
  border: 1px solid rgba(236, 72, 153, 0.3);
  border-radius: 8px;
  padding: 0.5rem 0.75rem;
  text-align: center;
}

.mem-search-result-title-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.7rem;
  font-weight: 600;
  color: #f472b6;
}

.mem-search-result-desc-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.6rem;
  color: #94a3b8;
}

 
.mem-compact-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(15, 23, 42, 0.4);
  border: 2px solid rgba(245, 158, 11, 0.3);
  border-radius: 12px;
  padding: 1.25rem;
  margin-bottom: 1.5rem;
}

.mem-compact-header-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  align-items: center;
  justify-content: space-between;
  margin-bottom: 1rem;
}

.mem-compact-title-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.95rem;
  font-weight: 700;
  color: #fbbf24;
  display: flex;
  align-items: center;
  gap: 0.5rem;
}

.mem-compact-btn-a9e9caa2fa6bb4d9d711a5907e59188d {
  padding: 0.5rem 1.25rem;
  background: linear-gradient(135deg, #f59e0b 0%, #d97706 100%);
  color: #1e293b;
  border: none;
  border-radius: 8px;
  font-size: 0.78rem;
  font-weight: 700;
  cursor: pointer;
  transition: all 0.2s ease;
}

.mem-compact-btn-a9e9caa2fa6bb4d9d711a5907e59188d:hover {
  transform: translateY(-1px);
  box-shadow: 0 4px 12px rgba(245, 158, 11, 0.4);
}

.mem-compact-btn-a9e9caa2fa6bb4d9d711a5907e59188d:disabled {
  opacity: 0.5;
  cursor: not-allowed;
  transform: none;
}

 
.mem-compact-steps-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  flex-direction: column;
  gap: 0.5rem;
}

.mem-compact-step-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  align-items: center;
  gap: 0.5rem;
  padding: 0.5rem 0.75rem;
  border-radius: 8px;
  background: rgba(15, 23, 42, 0.6);
  opacity: 0.3;
  transition: all 0.4s ease;
}

.mem-compact-step-a9e9caa2fa6bb4d9d711a5907e59188d.active {
  opacity: 1;
  background: rgba(245, 158, 11, 0.1);
  border: 1px solid rgba(245, 158, 11, 0.3);
}

.mem-compact-step-a9e9caa2fa6bb4d9d711a5907e59188d.done {
  opacity: 0.7;
}

.mem-compact-step-num-a9e9caa2fa6bb4d9d711a5907e59188d {
  width: 22px;
  height: 22px;
  border-radius: 50%;
  background: rgba(245, 158, 11, 0.2);
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 0.65rem;
  font-weight: 700;
  color: #fbbf24;
  flex-shrink: 0;
}

.mem-compact-step-text-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.75rem;
  color: #cbd5e1;
  flex: 1;
}

.mem-compact-step-check-a9e9caa2fa6bb4d9d711a5907e59188d {
  color: #4ade80;
  font-size: 0.85rem;
  display: none;
}

.mem-compact-step-a9e9caa2fa6bb4d9d711a5907e59188d.done .mem-compact-step-check-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: inline;
}

 
.mem-compact-compare-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: grid;
  grid-template-columns: 1fr auto 1fr;
  gap: 1rem;
  margin-top: 1rem;
  opacity: 0;
  transition: opacity 0.6s ease;
}

.mem-compact-compare-a9e9caa2fa6bb4d9d711a5907e59188d.visible {
  opacity: 1;
}

.mem-compact-box-a9e9caa2fa6bb4d9d711a5907e59188d {
  background: rgba(15, 23, 42, 0.6);
  border-radius: 8px;
  padding: 0.75rem;
  text-align: center;
}

.mem-compact-box-label-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 0.6rem;
  color: #64748b;
  text-transform: uppercase;
  letter-spacing: 0.05em;
  margin-bottom: 0.35rem;
}

.mem-compact-box-val-a9e9caa2fa6bb4d9d711a5907e59188d {
  font-size: 1.2rem;
  font-weight: 700;
}

.mem-compact-box-before-a9e9caa2fa6bb4d9d711a5907e59188d .mem-compact-box-val-a9e9caa2fa6bb4d9d711a5907e59188d {
  color: #f87171;
}

.mem-compact-box-after-a9e9caa2fa6bb4d9d711a5907e59188d .mem-compact-box-val-a9e9caa2fa6bb4d9d711a5907e59188d {
  color: #4ade80;
}

.mem-compact-arrow-a9e9caa2fa6bb4d9d711a5907e59188d {
  display: flex;
  align-items: center;
  color: #fbbf24;
  font-size: 1.2rem;
}

 
.mem-footer-a9e9caa2fa6bb4d9d711a5907e59188d {
  text-align: center;
  padding: 0.75rem;
  background: rgba(245, 158, 11, 0.08);
  border: 1px solid rgba(245, 158, 11, 0.25);
  border-radius: 10px;
}

.mem-footer-a9e9caa2fa6bb4d9d711a5907e59188d p {
  font-size: 0.8rem;
  color: #fbbf24;
  margin: 0;
  line-height: 1.5;
  font-style: italic;
}

.mem-footer-a9e9caa2fa6bb4d9d711a5907e59188d strong {
  color: #fcd34d;
  font-style: normal;
}

 
@media (max-width: 600px) {
  .mem-a9e9caa2fa6bb4d9d711a5907e59188d {
    padding: 1.25rem;
  }

  .mem-compact-compare-a9e9caa2fa6bb4d9d711a5907e59188d {
    grid-template-columns: 1fr;
    gap: 0.5rem;
  }

  .mem-compact-arrow-a9e9caa2fa6bb4d9d711a5907e59188d {
    justify-content: center;
    transform: rotate(90deg);
  }
}
</style>

<div class="mem-a9e9caa2fa6bb4d9d711a5907e59188d">
  <div class="mem-header-a9e9caa2fa6bb4d9d711a5907e59188d">
    <div class="mem-title-a9e9caa2fa6bb4d9d711a5907e59188d">Virtual Memory for Cognition</div>
    <div class="mem-subtitle-a9e9caa2fa6bb4d9d711a5907e59188d">Long-term memory lives on disk, not in the context window. The agent pages knowledge in and out like an OS manages virtual memory.</div>
  </div>

  
  <div class="mem-zones-a9e9caa2fa6bb4d9d711a5907e59188d">

    
    <div class="mem-zone-a9e9caa2fa6bb4d9d711a5907e59188d mem-zone-context-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="mem-zone-header-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="mem-zone-icon-a9e9caa2fa6bb4d9d711a5907e59188d">
          <svg width="14" height="14" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><rect x="4" y="4" width="16" height="16" rx="2"/><rect x="9" y="9" width="6" height="6"/></svg>
        </div>
        <div>
          <div class="mem-zone-title-a9e9caa2fa6bb4d9d711a5907e59188d">LLM Context</div>
          <div class="mem-zone-label-a9e9caa2fa6bb4d9d711a5907e59188d">Cache (volatile)</div>
        </div>
      </div>
      <div class="mem-zone-desc-a9e9caa2fa6bb4d9d711a5907e59188d">The active working set. Fast but limited. Everything here is lost when the conversation ends or context fills up.</div>

      <div class="mem-ctx-bar-wrap-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="mem-ctx-bar-header-a9e9caa2fa6bb4d9d711a5907e59188d">
          <span class="mem-ctx-bar-label-a9e9caa2fa6bb4d9d711a5907e59188d">Context Usage</span>
          <span class="mem-ctx-bar-val-a9e9caa2fa6bb4d9d711a5907e59188d" id="mem-ctx-pct-a9e9caa2fa6bb4d9d711a5907e59188d">85%</span>
        </div>
        <div class="mem-ctx-bar-bg-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="mem-ctx-bar-fill-a9e9caa2fa6bb4d9d711a5907e59188d" id="mem-ctx-fill-a9e9caa2fa6bb4d9d711a5907e59188d"></div>
        </div>
        <div class="mem-ctx-tokens-a9e9caa2fa6bb4d9d711a5907e59188d">
          <span id="mem-ctx-used-a9e9caa2fa6bb4d9d711a5907e59188d">170K tokens used</span>
          <span>200K limit</span>
        </div>
      </div>

      <div class="mem-ctx-items-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="mem-ctx-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="mem-ctx-item-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>System prompt (4.2K)</div>
        <div class="mem-ctx-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="mem-ctx-item-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Conversation history (95K)</div>
        <div class="mem-ctx-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="mem-ctx-item-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Tool results (48K)</div>
        <div class="mem-ctx-item-a9e9caa2fa6bb4d9d711a5907e59188d"><div class="mem-ctx-item-dot-a9e9caa2fa6bb4d9d711a5907e59188d"></div>Memory pages (22.8K)</div>
      </div>
    </div>

    
    <div class="mem-zone-a9e9caa2fa6bb4d9d711a5907e59188d mem-zone-disk-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="mem-zone-header-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="mem-zone-icon-a9e9caa2fa6bb4d9d711a5907e59188d">
          <svg width="14" height="14" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><ellipse cx="12" cy="5" rx="9" ry="3"/><path d="M21 12c0 1.66-4 3-9 3s-9-1.34-9-3"/><path d="M3 5v14c0 1.66 4 3 9 3s9-1.34 9-3V5"/></svg>
        </div>
        <div>
          <div class="mem-zone-title-a9e9caa2fa6bb4d9d711a5907e59188d">Local Disk</div>
          <div class="mem-zone-label-a9e9caa2fa6bb4d9d711a5907e59188d">Source of truth (durable)</div>
        </div>
      </div>
      <div class="mem-zone-desc-a9e9caa2fa6bb4d9d711a5907e59188d">Persistent storage that survives across sessions. Unlimited capacity. The ground truth for all agent knowledge.</div>

      <div class="mem-disk-tree-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="mem-disk-file-a9e9caa2fa6bb4d9d711a5907e59188d" data-disk="memory-md">
          <svg class="mem-disk-icon-a9e9caa2fa6bb4d9d711a5907e59188d" width="14" height="14" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M14 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V8z"/><polyline points="14 2 14 8 20 8"/></svg>
          <span class="mem-disk-name-a9e9caa2fa6bb4d9d711a5907e59188d">MEMORY.md</span>
          <span class="mem-disk-size-a9e9caa2fa6bb4d9d711a5907e59188d">12KB</span>
        </div>
        <div class="mem-disk-file-a9e9caa2fa6bb4d9d711a5907e59188d" data-disk="daily-log">
          <svg class="mem-disk-icon-a9e9caa2fa6bb4d9d711a5907e59188d" width="14" height="14" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M14 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V8z"/><polyline points="14 2 14 8 20 8"/></svg>
          <span class="mem-disk-name-a9e9caa2fa6bb4d9d711a5907e59188d">memory/2026-02-16.md</span>
          <span class="mem-disk-size-a9e9caa2fa6bb4d9d711a5907e59188d">3KB</span>
        </div>
        <div class="mem-disk-file-a9e9caa2fa6bb4d9d711a5907e59188d" data-disk="sessions-db">
          <svg class="mem-disk-icon-a9e9caa2fa6bb4d9d711a5907e59188d" width="14" height="14" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><ellipse cx="12" cy="5" rx="9" ry="3"/><path d="M21 12c0 1.66-4 3-9 3s-9-1.34-9-3"/><path d="M3 5v14c0 1.66 4 3 9 3s9-1.34 9-3V5"/></svg>
          <span class="mem-disk-name-a9e9caa2fa6bb4d9d711a5907e59188d">sessions.sqlite</span>
          <span class="mem-disk-size-a9e9caa2fa6bb4d9d711a5907e59188d">8MB</span>
        </div>
        <div class="mem-disk-file-a9e9caa2fa6bb4d9d711a5907e59188d" data-disk="embeddings-db">
          <svg class="mem-disk-icon-a9e9caa2fa6bb4d9d711a5907e59188d" width="14" height="14" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><ellipse cx="12" cy="5" rx="9" ry="3"/><path d="M21 12c0 1.66-4 3-9 3s-9-1.34-9-3"/><path d="M3 5v14c0 1.66 4 3 9 3s9-1.34 9-3V5"/></svg>
          <span class="mem-disk-name-a9e9caa2fa6bb4d9d711a5907e59188d">embeddings.db</span>
          <span class="mem-disk-size-a9e9caa2fa6bb4d9d711a5907e59188d">24MB</span>
        </div>
      </div>

      <div class="mem-disk-preview-a9e9caa2fa6bb4d9d711a5907e59188d" id="mem-disk-preview-a9e9caa2fa6bb4d9d711a5907e59188d"></div>
    </div>

    
    <div class="mem-zone-a9e9caa2fa6bb4d9d711a5907e59188d mem-zone-search-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="mem-zone-header-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="mem-zone-icon-a9e9caa2fa6bb4d9d711a5907e59188d">
          <svg width="14" height="14" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><circle cx="11" cy="11" r="8"/><path d="M21 21l-4.35-4.35"/></svg>
        </div>
        <div>
          <div class="mem-zone-title-a9e9caa2fa6bb4d9d711a5907e59188d">Search &amp; Retrieval</div>
          <div class="mem-zone-label-a9e9caa2fa6bb4d9d711a5907e59188d">Page-in mechanism</div>
        </div>
      </div>
      <div class="mem-zone-desc-a9e9caa2fa6bb4d9d711a5907e59188d">Dual search paths find relevant memories and page them back into context when needed.</div>

      <div class="mem-search-paths-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="mem-search-path-a9e9caa2fa6bb4d9d711a5907e59188d mem-search-bm25-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="mem-search-path-icon-a9e9caa2fa6bb4d9d711a5907e59188d">
            <svg width="12" height="12" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M4 6h16M4 12h10M4 18h14"/></svg>
          </div>
          <div>
            <div class="mem-search-path-name-a9e9caa2fa6bb4d9d711a5907e59188d">BM25 Keyword</div>
            <div class="mem-search-path-desc-a9e9caa2fa6bb4d9d711a5907e59188d">Exact term matching, fast</div>
          </div>
        </div>

        <div class="mem-search-path-a9e9caa2fa6bb4d9d711a5907e59188d mem-search-vector-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="mem-search-path-icon-a9e9caa2fa6bb4d9d711a5907e59188d">
            <svg width="12" height="12" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><circle cx="12" cy="12" r="10"/><path d="M2 12h20M12 2a15.3 15.3 0 0 1 4 10 15.3 15.3 0 0 1-4 10 15.3 15.3 0 0 1-4-10 15.3 15.3 0 0 1 4-10z"/></svg>
          </div>
          <div>
            <div class="mem-search-path-name-a9e9caa2fa6bb4d9d711a5907e59188d">Vector Similarity</div>
            <div class="mem-search-path-desc-a9e9caa2fa6bb4d9d711a5907e59188d">Semantic matching, flexible</div>
          </div>
        </div>

        <div class="mem-search-merge-a9e9caa2fa6bb4d9d711a5907e59188d">
          <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="#94a3b8" stroke-width="2"><path d="M12 5v14M5 12l7 7 7-7"/></svg>
          merge &amp; re-rank
        </div>

        <div class="mem-search-result-a9e9caa2fa6bb4d9d711a5907e59188d">
          <div class="mem-search-result-title-a9e9caa2fa6bb4d9d711a5907e59188d">Ranked Results</div>
          <div class="mem-search-result-desc-a9e9caa2fa6bb4d9d711a5907e59188d">Top-K memory pages paged into context</div>
        </div>
      </div>
    </div>
  </div>

  
  <div class="mem-compact-a9e9caa2fa6bb4d9d711a5907e59188d">
    <div class="mem-compact-header-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="mem-compact-title-a9e9caa2fa6bb4d9d711a5907e59188d">
        <svg width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M21 15v4a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2v-4"/><polyline points="7 10 12 15 17 10"/><line x1="12" y1="15" x2="12" y2="3"/></svg>
        /compact — Context Paging
      </div>
      <button type="button" class="mem-compact-btn-a9e9caa2fa6bb4d9d711a5907e59188d" id="mem-compact-btn-a9e9caa2fa6bb4d9d711a5907e59188d">Run /compact</button>
    </div>

    <div class="mem-compact-steps-a9e9caa2fa6bb4d9d711a5907e59188d" id="mem-compact-steps-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="mem-compact-step-a9e9caa2fa6bb4d9d711a5907e59188d" id="mem-step-0-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="mem-compact-step-num-a9e9caa2fa6bb4d9d711a5907e59188d">1</div>
        <div class="mem-compact-step-text-a9e9caa2fa6bb4d9d711a5907e59188d">Write durable notes from context to MEMORY.md</div>
        <div class="mem-compact-step-check-a9e9caa2fa6bb4d9d711a5907e59188d">&#10003;</div>
      </div>
      <div class="mem-compact-step-a9e9caa2fa6bb4d9d711a5907e59188d" id="mem-step-1-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="mem-compact-step-num-a9e9caa2fa6bb4d9d711a5907e59188d">2</div>
        <div class="mem-compact-step-text-a9e9caa2fa6bb4d9d711a5907e59188d">Summarize conversation history (compress)</div>
        <div class="mem-compact-step-check-a9e9caa2fa6bb4d9d711a5907e59188d">&#10003;</div>
      </div>
      <div class="mem-compact-step-a9e9caa2fa6bb4d9d711a5907e59188d" id="mem-step-2-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="mem-compact-step-num-a9e9caa2fa6bb4d9d711a5907e59188d">3</div>
        <div class="mem-compact-step-text-a9e9caa2fa6bb4d9d711a5907e59188d">Drop redundant tool outputs from context</div>
        <div class="mem-compact-step-check-a9e9caa2fa6bb4d9d711a5907e59188d">&#10003;</div>
      </div>
      <div class="mem-compact-step-a9e9caa2fa6bb4d9d711a5907e59188d" id="mem-step-3-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="mem-compact-step-num-a9e9caa2fa6bb4d9d711a5907e59188d">4</div>
        <div class="mem-compact-step-text-a9e9caa2fa6bb4d9d711a5907e59188d">Rebuild context window with essential state only</div>
        <div class="mem-compact-step-check-a9e9caa2fa6bb4d9d711a5907e59188d">&#10003;</div>
      </div>
    </div>

    <div class="mem-compact-compare-a9e9caa2fa6bb4d9d711a5907e59188d" id="mem-compact-compare-a9e9caa2fa6bb4d9d711a5907e59188d">
      <div class="mem-compact-box-a9e9caa2fa6bb4d9d711a5907e59188d mem-compact-box-before-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="mem-compact-box-label-a9e9caa2fa6bb4d9d711a5907e59188d">Before</div>
        <div class="mem-compact-box-val-a9e9caa2fa6bb4d9d711a5907e59188d">170K tokens</div>
        <div style="font-size:0.65rem;color:#64748b;">85% capacity</div>
      </div>
      <div class="mem-compact-arrow-a9e9caa2fa6bb4d9d711a5907e59188d">&#8594;</div>
      <div class="mem-compact-box-a9e9caa2fa6bb4d9d711a5907e59188d mem-compact-box-after-a9e9caa2fa6bb4d9d711a5907e59188d">
        <div class="mem-compact-box-label-a9e9caa2fa6bb4d9d711a5907e59188d">After</div>
        <div class="mem-compact-box-val-a9e9caa2fa6bb4d9d711a5907e59188d">50K tokens</div>
        <div style="font-size:0.65rem;color:#64748b;">25% capacity</div>
      </div>
    </div>
  </div>

  
  <div class="mem-footer-a9e9caa2fa6bb4d9d711a5907e59188d">
    <p><strong>"RAM is limited, disk is large, and paging decides what comes back."</strong> The OS metaphor that ties the entire architecture together.</p>
  </div>
</div>

<script>
(function() {
  var id = 'a9e9caa2fa6bb4d9d711a5907e59188d';

  
  var diskPreviews = {
    'memory-md': '# Long-Term Memory\n\n## User Preferences\n- Prefers metric units\n- Timezone: Asia/Tokyo\n- Communication style: concise\n\n## Key Facts\n- Works at Acme Corp (engineering)\n- Project "Atlas" deadline: March 15\n- Prefers Claude for code tasks\n\n## Recurring Tasks\n- Daily standup summary at 9am\n- Weekly report every Friday',
    'daily-log': '# 2026-02-16\n\n## Conversations\n- 09:12 Weather query for Tokyo (WhatsApp)\n- 10:45 Code review for PR #847 (Slack)\n- 14:30 Calendar conflict resolution\n\n## Learnings\n- User started new project "Phoenix"\n- Preferred test framework: vitest\n- New team member: Sarah (frontend)',
    'sessions-db': '-- sessions.sqlite schema\nCREATE TABLE sessions (\n  id TEXT PRIMARY KEY,\n  channel TEXT NOT NULL,\n  user_id TEXT NOT NULL,\n  created_at DATETIME,\n  last_active DATETIME,\n  message_count INTEGER DEFAULT 0\n);\n-- 1,247 active sessions\n-- 38,912 total messages indexed',
    'embeddings-db': '-- embeddings.db\n-- Vector store for semantic search\n--\n-- 15,832 embedded chunks\n-- Model: text-embedding-3-small\n-- Dimensions: 1536\n-- Index: HNSW (ef=200, M=16)\n--\n-- Avg query time: 8ms\n-- Coverage: all MEMORY.md + daily logs'
  };

  
  var diskContainer = document.querySelector('.mem-zone-disk-' + id);
  if (diskContainer) {
    diskContainer.addEventListener('click', function(e) {
      var target = e.target;
      while (target && target !== diskContainer) {
        if (target.hasAttribute && target.hasAttribute('data-disk')) {
          var diskId = target.getAttribute('data-disk');
          var preview = diskPreviews[diskId];
          if (!preview) return;

          
          var allDiskFiles = diskContainer.querySelectorAll('.mem-disk-file-' + id);
          var wasActive = target.classList.contains('active');
          for (var d = 0; d < allDiskFiles.length; d++) {
            allDiskFiles[d].classList.remove('active');
          }

          var previewEl = document.getElementById('mem-disk-preview-' + id);
          if (wasActive) {
            previewEl.classList.remove('open');
            previewEl.textContent = '';
          } else {
            target.classList.add('active');
            previewEl.textContent = preview;
            previewEl.classList.add('open');
          }
          return;
        }
        target = target.parentElement;
      }
    });
  }

  
  var compactBtn = document.getElementById('mem-compact-btn-' + id);
  var compactRunning = false;

  if (compactBtn) {
    compactBtn.addEventListener('click', function() {
      if (compactRunning) return;
      compactRunning = true;
      compactBtn.disabled = true;
      compactBtn.textContent = 'Running...';

      var steps = [
        document.getElementById('mem-step-0-' + id),
        document.getElementById('mem-step-1-' + id),
        document.getElementById('mem-step-2-' + id),
        document.getElementById('mem-step-3-' + id)
      ];

      var ctxFill = document.getElementById('mem-ctx-fill-' + id);
      var ctxPct = document.getElementById('mem-ctx-pct-' + id);
      var ctxUsed = document.getElementById('mem-ctx-used-' + id);
      var compare = document.getElementById('mem-compact-compare-' + id);

      function runStep(idx) {
        if (idx >= steps.length) {
          
          setTimeout(function() {
            ctxFill.style.width = '25%';
            ctxPct.textContent = '25%';
            ctxUsed.textContent = '50K tokens used';
            compare.classList.add('visible');
            compactBtn.textContent = '/compact done';
          }, 400);
          return;
        }

        
        if (idx > 0) {
          steps[idx - 1].classList.remove('active');
          steps[idx - 1].classList.add('done');
        }

        
        steps[idx].classList.add('active');

        
        if (idx === 1) {
          ctxFill.style.width = '65%';
          ctxPct.textContent = '65%';
          ctxUsed.textContent = '130K tokens used';
        } else if (idx === 2) {
          ctxFill.style.width = '45%';
          ctxPct.textContent = '45%';
          ctxUsed.textContent = '90K tokens used';
        } else if (idx === 3) {
          ctxFill.style.width = '30%';
          ctxPct.textContent = '30%';
          ctxUsed.textContent = '60K tokens used';
        }

        setTimeout(function() {
          runStep(idx + 1);
        }, 1200);
      }

      
      for (var s = 0; s < steps.length; s++) {
        steps[s].classList.remove('active', 'done');
      }
      compare.classList.remove('visible');

      runStep(0);
    });
  }
})();
</script>

<p>The entire system is an exercise in composition: message queues, schedulers, filesystems, and virtual memory, familiar abstractions from operating systems, recomposed into an AI agent.</p>
]]></content:encoded></item><item><title>Why `vllm serve` Works on Day Zero (and What It Takes to Make It Fast)</title><link>https://www.mdjawad.com/posts/zero-day-vllm/</link><pubDate>Sat, 14 Feb 2026 12:00:00 +0800</pubDate><guid>https://www.mdjawad.com/posts/zero-day-vllm/</guid><description>A deep dive into vLLM&amp;rsquo;s tiered model integration — from the Transformers fallback that enables zero-day support to the native integration path that makes it fast.</description><content:encoded><![CDATA[<p>In this post, we&rsquo;ll trace what happens when vLLM encounters a model it&rsquo;s never seen before. We&rsquo;ll work through the full lifecycle from the initial <code>config.json</code> pull off Hugging Face, through the registry lookup that decides the integration path, into either the Transformers fallback or the native integration code, and down to the forward pass where PagedAttention kernels actually execute.</p>
<p>Why this matters: new model architectures appear constantly, and vLLM needs to serve them. The interesting engineering question is <em>how</em> — because the optimizations that make inference fast (fused kernels, CUDA Graphs, tensor parallelism) require deep model-specific restructuring. You can&rsquo;t just <code>import</code> a model and get peak performance. vLLM resolves this tension with a tiered system: immediate support through a compatibility layer, then a clear path to fully optimized native integration.</p>
<p>This post is structured into 4 parts:</p>
<ol>
<li><strong>The Gateway</strong> — how vLLM decides what to do with a model it receives</li>
<li><strong>The Transformers Fallback</strong> — the zero-day mechanism and its trade-offs</li>
<li><strong>Native Integration</strong> — what it takes to make a model truly fast in vLLM</li>
<li><strong>The Execution Core</strong> — forward pass, weight loading, and distributed execution</li>
</ol>
<p>We&rsquo;ll build on concepts from previous posts. If you&rsquo;re not familiar with <a href="/posts/flash-attention/">PagedAttention and FlashAttention</a>, or <a href="/posts/llm-inference-hidden-stack/">the hidden software stack beneath inference</a>, those are worth reading first. We also won&rsquo;t re-explain the <a href="/posts/orchestrating-inference/">Engine-Worker orchestration layer</a> in full — just enough to ground the model integration story.</p>
<p>Here&rsquo;s the starting point:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>vllm serve some-brand-new/model-7B --dtype auto
</span></span><span style="display:flex;"><span><span style="color:#75715e"># This works. Even for a model vLLM has never seen before.</span>
</span></span></code></pre></div><p>That command succeeds for models that vLLM has no dedicated code for. Let&rsquo;s understand why.</p>
<hr>
<h2 id="part-1-the-gateway">Part 1: The Gateway</h2>
<h3 id="the-engine-worker-model-hierarchy">The Engine-Worker-Model Hierarchy</h3>
<p>vLLM enforces a strict separation of concerns in how it handles models. Before we get into the model-specific details, let&rsquo;s establish the high-level architecture, since it determines <em>where</em> new model code actually lives.</p>
<p>There are four levels:</p>
<ol>
<li><strong>LLMEngine</strong> — the control plane. Handles scheduling, manages the BlockSpaceManager (which tracks physical GPU memory blocks), and decides which requests get processed in each iteration. The Engine is completely agnostic to model architecture.</li>
<li><strong>Worker</strong> — one per GPU. Manages the GPU device, holds its slice of model weights, and coordinates with other Workers for distributed execution.</li>
<li><strong>ModelRunner</strong> — sits inside each Worker. Responsible for converting logical request data (token IDs, sequence lengths) into the physical tensors the model needs. This is where input flattening happens.</li>
<li><strong>Model</strong> — the neural network itself. Whether it&rsquo;s a native <code>LlamaForCausalLM</code> or a wrapped <code>TransformersModel</code>, this is the only layer that changes when you add a new model.</li>
</ol>
<p>The key property here: the Engine only needs the KV cache element size — derived from <code>num_layers</code>, <code>hidden_size</code>, and <code>num_attention_heads</code> in the model config to make scheduling decisions. It never touches the model&rsquo;s forward pass. This means adding support for an entirely new architecture only changes the bottom layer of this stack. Everything above it stays the same.</p>




<style>
.eh-5590d392f5ce8b7f84eedffa089138fd {
  font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
  background: linear-gradient(180deg, #0f172a 0%, #1e293b 100%);
  border-radius: 16px;
  padding: 2rem;
  margin: 2rem 0;
  color: #e2e8f0;
}

.eh-5590d392f5ce8b7f84eedffa089138fd * {
  box-sizing: border-box;
}

.eh-header-5590d392f5ce8b7f84eedffa089138fd {
  text-align: center;
  margin-bottom: 1.5rem;
}

.eh-title-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 1.5rem;
  font-weight: 700;
  color: #f8fafc;
  margin-bottom: 0.5rem;
}

.eh-subtitle-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.9rem;
  color: #94a3b8;
  max-width: 600px;
  margin: 0 auto;
}

 
.eh-content-5590d392f5ce8b7f84eedffa089138fd {
  display: grid;
  grid-template-columns: 1fr 300px;
  gap: 1.5rem;
  align-items: start;
}

@media (max-width: 850px) {
  .eh-content-5590d392f5ce8b7f84eedffa089138fd {
    grid-template-columns: 1fr;
  }
  .eh-sidebar-5590d392f5ce8b7f84eedffa089138fd {
    order: -1;
  }
}

 
.eh-stack-5590d392f5ce8b7f84eedffa089138fd {
  position: relative;
}

 
.eh-layer-5590d392f5ce8b7f84eedffa089138fd {
  position: relative;
  border-radius: 12px;
  padding: 1rem;
  border: 2px solid;
  transition: all 0.3s ease;
  cursor: pointer;
}

.eh-layer-5590d392f5ce8b7f84eedffa089138fd.selected {
  transform: scale(1.02);
}

.eh-layer-5590d392f5ce8b7f84eedffa089138fd.highlighted {
  transform: scale(1.02);
}

 
.eh-layer-engine-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(59, 130, 246, 0.1);
  border-color: rgba(59, 130, 246, 0.4);
}
.eh-layer-engine-5590d392f5ce8b7f84eedffa089138fd.selected,
.eh-layer-engine-5590d392f5ce8b7f84eedffa089138fd.highlighted {
  box-shadow: 0 0 24px rgba(59, 130, 246, 0.4);
  border-color: #3b82f6;
}

.eh-layer-worker-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(168, 85, 247, 0.1);
  border-color: rgba(168, 85, 247, 0.4);
}
.eh-layer-worker-5590d392f5ce8b7f84eedffa089138fd.selected,
.eh-layer-worker-5590d392f5ce8b7f84eedffa089138fd.highlighted {
  box-shadow: 0 0 24px rgba(168, 85, 247, 0.4);
  border-color: #a855f7;
}

.eh-layer-runner-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(236, 72, 153, 0.1);
  border-color: rgba(236, 72, 153, 0.4);
}
.eh-layer-runner-5590d392f5ce8b7f84eedffa089138fd.selected,
.eh-layer-runner-5590d392f5ce8b7f84eedffa089138fd.highlighted {
  box-shadow: 0 0 24px rgba(236, 72, 153, 0.4);
  border-color: #ec4899;
}

.eh-layer-model-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(34, 197, 94, 0.1);
  border-color: rgba(34, 197, 94, 0.4);
  border-style: dashed;
}
.eh-layer-model-5590d392f5ce8b7f84eedffa089138fd.selected,
.eh-layer-model-5590d392f5ce8b7f84eedffa089138fd.highlighted {
  box-shadow: 0 0 24px rgba(34, 197, 94, 0.4);
  border-color: #22c55e;
}

 
.eh-layer-header-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  align-items: center;
  justify-content: space-between;
  margin-bottom: 0.75rem;
}

.eh-layer-title-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.85rem;
  font-weight: 700;
  text-transform: uppercase;
  letter-spacing: 0.05em;
  display: flex;
  align-items: center;
  gap: 0.5rem;
}

.eh-layer-engine-5590d392f5ce8b7f84eedffa089138fd .eh-layer-title-5590d392f5ce8b7f84eedffa089138fd { color: #60a5fa; }
.eh-layer-worker-5590d392f5ce8b7f84eedffa089138fd .eh-layer-title-5590d392f5ce8b7f84eedffa089138fd { color: #c084fc; }
.eh-layer-runner-5590d392f5ce8b7f84eedffa089138fd .eh-layer-title-5590d392f5ce8b7f84eedffa089138fd { color: #f472b6; }
.eh-layer-model-5590d392f5ce8b7f84eedffa089138fd .eh-layer-title-5590d392f5ce8b7f84eedffa089138fd { color: #4ade80; }

 
.eh-role-badge-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.65rem;
  padding: 0.25rem 0.5rem;
  border-radius: 4px;
  font-weight: 500;
  background: rgba(255, 255, 255, 0.1);
  color: #94a3b8;
}

 
.eh-components-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  gap: 0.5rem;
  flex-wrap: wrap;
  justify-content: center;
}

.eh-component-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(15, 23, 42, 0.6);
  border: 1px solid #334155;
  border-radius: 8px;
  padding: 0.6rem 0.8rem;
  cursor: pointer;
  transition: all 0.2s ease;
  text-align: center;
  min-width: 85px;
}

.eh-component-5590d392f5ce8b7f84eedffa089138fd:hover {
  transform: translateY(-2px);
  border-color: #64748b;
}

.eh-component-5590d392f5ce8b7f84eedffa089138fd.selected {
  transform: scale(1.05);
}

.eh-layer-engine-5590d392f5ce8b7f84eedffa089138fd .eh-component-5590d392f5ce8b7f84eedffa089138fd.selected {
  border-color: #3b82f6;
  box-shadow: 0 0 20px rgba(59, 130, 246, 0.4);
}
.eh-layer-worker-5590d392f5ce8b7f84eedffa089138fd .eh-component-5590d392f5ce8b7f84eedffa089138fd.selected {
  border-color: #a855f7;
  box-shadow: 0 0 20px rgba(168, 85, 247, 0.4);
}
.eh-layer-runner-5590d392f5ce8b7f84eedffa089138fd .eh-component-5590d392f5ce8b7f84eedffa089138fd.selected {
  border-color: #ec4899;
  box-shadow: 0 0 20px rgba(236, 72, 153, 0.4);
}
.eh-layer-model-5590d392f5ce8b7f84eedffa089138fd .eh-component-5590d392f5ce8b7f84eedffa089138fd.selected {
  border-color: #22c55e;
  box-shadow: 0 0 20px rgba(34, 197, 94, 0.4);
}

.eh-comp-name-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.75rem;
  font-weight: 600;
  color: #f1f5f9;
  margin-bottom: 0.2rem;
}

.eh-comp-hint-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.6rem;
  color: #64748b;
}

 
.eh-change-pill-5590d392f5ce8b7f84eedffa089138fd {
  display: inline-flex;
  align-items: center;
  gap: 0.35rem;
  font-size: 0.65rem;
  font-weight: 600;
  color: #4ade80;
  background: rgba(34, 197, 94, 0.15);
  border: 1px solid rgba(34, 197, 94, 0.4);
  border-radius: 20px;
  padding: 0.25rem 0.65rem;
  animation: eh-pill-glow-5590d392f5ce8b7f84eedffa089138fd 2s ease-in-out infinite;
  margin-left: 0.5rem;
}

@keyframes eh-pill-glow-5590d392f5ce8b7f84eedffa089138fd {
  0%, 100% { box-shadow: 0 0 6px rgba(34, 197, 94, 0.2); }
  50% { box-shadow: 0 0 16px rgba(34, 197, 94, 0.5); }
}

 
.eh-connector-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  align-items: center;
  justify-content: center;
  height: 36px;
  position: relative;
  margin: 0.25rem 0;
}

.eh-connector-arrow-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  flex-direction: column;
  align-items: center;
  position: relative;
}

.eh-connector-arrow-5590d392f5ce8b7f84eedffa089138fd svg {
  width: 24px;
  height: 24px;
}

.eh-connector-label-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.6rem;
  font-weight: 500;
  text-transform: uppercase;
  letter-spacing: 0.05em;
}

 
.eh-conn-blue-5590d392f5ce8b7f84eedffa089138fd svg { color: #60a5fa; }
.eh-conn-blue-5590d392f5ce8b7f84eedffa089138fd .eh-connector-label-5590d392f5ce8b7f84eedffa089138fd { color: #60a5fa; }
.eh-conn-purple-5590d392f5ce8b7f84eedffa089138fd svg { color: #c084fc; }
.eh-conn-purple-5590d392f5ce8b7f84eedffa089138fd .eh-connector-label-5590d392f5ce8b7f84eedffa089138fd { color: #c084fc; }
.eh-conn-pink-5590d392f5ce8b7f84eedffa089138fd svg { color: #f472b6; }
.eh-conn-pink-5590d392f5ce8b7f84eedffa089138fd .eh-connector-label-5590d392f5ce8b7f84eedffa089138fd { color: #f472b6; }

 
.eh-particle-5590d392f5ce8b7f84eedffa089138fd {
  position: absolute;
  width: 6px;
  height: 6px;
  border-radius: 50%;
  animation: eh-particle-flow-5590d392f5ce8b7f84eedffa089138fd 2s ease-in-out infinite;
}

.eh-conn-blue-5590d392f5ce8b7f84eedffa089138fd .eh-particle-5590d392f5ce8b7f84eedffa089138fd {
  background: #60a5fa;
  box-shadow: 0 0 8px #60a5fa;
}
.eh-conn-purple-5590d392f5ce8b7f84eedffa089138fd .eh-particle-5590d392f5ce8b7f84eedffa089138fd {
  background: #c084fc;
  box-shadow: 0 0 8px #c084fc;
}
.eh-conn-pink-5590d392f5ce8b7f84eedffa089138fd .eh-particle-5590d392f5ce8b7f84eedffa089138fd {
  background: #f472b6;
  box-shadow: 0 0 8px #f472b6;
}

@keyframes eh-particle-flow-5590d392f5ce8b7f84eedffa089138fd {
  0% { transform: translateY(-12px); opacity: 0; }
  20% { opacity: 1; }
  80% { opacity: 1; }
  100% { transform: translateY(12px); opacity: 0; }
}

.eh-particle-5590d392f5ce8b7f84eedffa089138fd.p1 { animation-delay: 0s; }
.eh-particle-5590d392f5ce8b7f84eedffa089138fd.p2 { animation-delay: 0.5s; }
.eh-particle-5590d392f5ce8b7f84eedffa089138fd.p3 { animation-delay: 1s; }

 
.eh-sidebar-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  flex-direction: column;
  gap: 1rem;
}

 
.eh-config-panel-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(15, 23, 42, 0.6);
  border: 1px solid #334155;
  border-radius: 12px;
  padding: 1.25rem;
}

.eh-config-title-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.9rem;
  font-weight: 700;
  color: #f8fafc;
  margin-bottom: 1rem;
  display: flex;
  align-items: center;
  gap: 0.5rem;
}

.eh-config-item-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  align-items: center;
  gap: 0.6rem;
  padding: 0.6rem 0.75rem;
  border-radius: 8px;
  cursor: pointer;
  transition: all 0.2s ease;
  margin-bottom: 0.35rem;
  border: 1px solid transparent;
}

.eh-config-item-5590d392f5ce8b7f84eedffa089138fd:hover {
  background: rgba(255, 255, 255, 0.05);
  border-color: #475569;
}

.eh-config-item-5590d392f5ce8b7f84eedffa089138fd.active {
  background: rgba(255, 255, 255, 0.08);
  border-color: #64748b;
}

.eh-config-dot-5590d392f5ce8b7f84eedffa089138fd {
  width: 10px;
  height: 10px;
  border-radius: 50%;
  flex-shrink: 0;
}

.eh-config-dot-blue-5590d392f5ce8b7f84eedffa089138fd { background: #3b82f6; }
.eh-config-dot-purple-5590d392f5ce8b7f84eedffa089138fd { background: #a855f7; }
.eh-config-dot-pink-5590d392f5ce8b7f84eedffa089138fd {
  background: linear-gradient(135deg, #ec4899 50%, #22c55e 50%);
}
.eh-config-dot-green-5590d392f5ce8b7f84eedffa089138fd { background: #22c55e; }

.eh-config-name-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.78rem;
  font-weight: 600;
  color: #e2e8f0;
}

.eh-config-target-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.6rem;
  color: #64748b;
  margin-left: auto;
}

 
.eh-info-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(15, 23, 42, 0.6);
  border: 1px solid #334155;
  border-radius: 12px;
  padding: 1.25rem;
}

.eh-info-placeholder-5590d392f5ce8b7f84eedffa089138fd {
  text-align: center;
  color: #64748b;
  padding: 2rem 1rem;
}

.eh-info-placeholder-5590d392f5ce8b7f84eedffa089138fd svg {
  width: 40px;
  height: 40px;
  margin-bottom: 0.75rem;
  opacity: 0.5;
}

.eh-info-placeholder-5590d392f5ce8b7f84eedffa089138fd p {
  font-size: 0.8rem;
  margin: 0;
}

.eh-info-content-5590d392f5ce8b7f84eedffa089138fd {
  display: none;
}

.eh-info-content-5590d392f5ce8b7f84eedffa089138fd.active {
  display: block;
  animation: eh-fade-in-5590d392f5ce8b7f84eedffa089138fd 0.3s ease;
}

@keyframes eh-fade-in-5590d392f5ce8b7f84eedffa089138fd {
  from { opacity: 0; transform: translateY(5px); }
  to { opacity: 1; transform: translateY(0); }
}

.eh-info-header-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  align-items: center;
  gap: 0.75rem;
  margin-bottom: 1rem;
}

.eh-info-icon-5590d392f5ce8b7f84eedffa089138fd {
  width: 36px;
  height: 36px;
  border-radius: 8px;
  display: flex;
  align-items: center;
  justify-content: center;
}

.eh-info-icon-5590d392f5ce8b7f84eedffa089138fd.engine { background: rgba(59, 130, 246, 0.2); color: #60a5fa; }
.eh-info-icon-5590d392f5ce8b7f84eedffa089138fd.worker { background: rgba(168, 85, 247, 0.2); color: #c084fc; }
.eh-info-icon-5590d392f5ce8b7f84eedffa089138fd.runner { background: rgba(236, 72, 153, 0.2); color: #f472b6; }
.eh-info-icon-5590d392f5ce8b7f84eedffa089138fd.model { background: rgba(34, 197, 94, 0.2); color: #4ade80; }
.eh-info-icon-5590d392f5ce8b7f84eedffa089138fd.config { background: rgba(251, 191, 36, 0.2); color: #fbbf24; }

.eh-info-title-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 1rem;
  font-weight: 700;
  color: #f8fafc;
  margin-bottom: 0.2rem;
}

.eh-info-badge-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.6rem;
  padding: 0.2rem 0.4rem;
  border-radius: 4px;
  font-weight: 600;
  text-transform: uppercase;
}

.eh-info-badge-5590d392f5ce8b7f84eedffa089138fd.engine { background: rgba(59, 130, 246, 0.2); color: #60a5fa; }
.eh-info-badge-5590d392f5ce8b7f84eedffa089138fd.worker { background: rgba(168, 85, 247, 0.2); color: #c084fc; }
.eh-info-badge-5590d392f5ce8b7f84eedffa089138fd.runner { background: rgba(236, 72, 153, 0.2); color: #f472b6; }
.eh-info-badge-5590d392f5ce8b7f84eedffa089138fd.model { background: rgba(34, 197, 94, 0.2); color: #4ade80; }
.eh-info-badge-5590d392f5ce8b7f84eedffa089138fd.config { background: rgba(251, 191, 36, 0.2); color: #fbbf24; }

.eh-info-desc-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.8rem;
  color: #cbd5e1;
  line-height: 1.6;
  margin-bottom: 1rem;
}

.eh-info-props-5590d392f5ce8b7f84eedffa089138fd h4 {
  font-size: 0.7rem;
  color: #64748b;
  margin: 0 0 0.5rem 0;
  text-transform: uppercase;
  letter-spacing: 0.05em;
}

.eh-info-props-5590d392f5ce8b7f84eedffa089138fd ul {
  margin: 0;
  padding: 0;
  list-style: none;
}

.eh-info-props-5590d392f5ce8b7f84eedffa089138fd li {
  font-size: 0.75rem;
  color: #94a3b8;
  padding: 0.3rem 0;
  padding-left: 1rem;
  position: relative;
}

.eh-info-props-5590d392f5ce8b7f84eedffa089138fd li::before {
  content: '\2022';
  position: absolute;
  left: 0;
  color: #64748b;
}

 
.eh-insight-5590d392f5ce8b7f84eedffa089138fd {
  margin-top: 0.75rem;
  padding: 0.65rem 0.75rem;
  background: rgba(245, 158, 11, 0.1);
  border: 1px solid rgba(245, 158, 11, 0.3);
  border-radius: 8px;
}

.eh-insight-5590d392f5ce8b7f84eedffa089138fd p {
  font-size: 0.75rem;
  color: #fbbf24;
  margin: 0;
  line-height: 1.5;
}

.eh-insight-5590d392f5ce8b7f84eedffa089138fd strong {
  color: #fcd34d;
}

 
.eh-footer-5590d392f5ce8b7f84eedffa089138fd {
  margin-top: 1.5rem;
  padding: 1rem;
  background: rgba(245, 158, 11, 0.1);
  border: 1px solid rgba(245, 158, 11, 0.3);
  border-radius: 10px;
  text-align: center;
}

.eh-footer-5590d392f5ce8b7f84eedffa089138fd p {
  font-size: 0.8rem;
  color: #fbbf24;
  margin: 0;
  line-height: 1.5;
}

.eh-footer-5590d392f5ce8b7f84eedffa089138fd strong {
  color: #fcd34d;
}

 
.eh-layer-desc-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.72rem;
  color: #94a3b8;
  margin-bottom: 0.6rem;
  line-height: 1.4;
}

 
@media (max-width: 600px) {
  .eh-5590d392f5ce8b7f84eedffa089138fd {
    padding: 1.25rem;
  }

  .eh-components-5590d392f5ce8b7f84eedffa089138fd {
    gap: 0.35rem;
  }

  .eh-component-5590d392f5ce8b7f84eedffa089138fd {
    min-width: 70px;
    padding: 0.5rem 0.6rem;
  }

  .eh-comp-name-5590d392f5ce8b7f84eedffa089138fd {
    font-size: 0.65rem;
  }

  .eh-layer-desc-5590d392f5ce8b7f84eedffa089138fd {
    display: none;
  }

  .eh-info-5590d392f5ce8b7f84eedffa089138fd {
    margin-top: 0.5rem;
  }
}
</style>

<div class="eh-5590d392f5ce8b7f84eedffa089138fd">
  <div class="eh-header-5590d392f5ce8b7f84eedffa089138fd">
    <div class="eh-title-5590d392f5ce8b7f84eedffa089138fd">Engine → Worker → Model Hierarchy</div>
    <div class="eh-subtitle-5590d392f5ce8b7f84eedffa089138fd">Four abstraction levels. VllmConfig feeds each one. Only the bottom layer changes for new models.</div>
  </div>

  <div class="eh-content-5590d392f5ce8b7f84eedffa089138fd">
    
    <div class="eh-stack-5590d392f5ce8b7f84eedffa089138fd">

      
      <div class="eh-layer-5590d392f5ce8b7f84eedffa089138fd eh-layer-engine-5590d392f5ce8b7f84eedffa089138fd" data-layer="engine">
        <div class="eh-layer-header-5590d392f5ce8b7f84eedffa089138fd">
          <div class="eh-layer-title-5590d392f5ce8b7f84eedffa089138fd">
            <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
              <rect x="2" y="3" width="20" height="14" rx="2"/>
              <path d="M8 21h8M12 17v4"/>
            </svg>
            LLMEngine
          </div>
          <span class="eh-role-badge-5590d392f5ce8b7f84eedffa089138fd">Control Plane</span>
        </div>
        <div class="eh-layer-desc-5590d392f5ce8b7f84eedffa089138fd">Schedules requests, manages KV cache blocks, completely model-agnostic.</div>
        <div class="eh-components-5590d392f5ce8b7f84eedffa089138fd">
          <div class="eh-component-5590d392f5ce8b7f84eedffa089138fd" data-component="scheduler">
            <div class="eh-comp-name-5590d392f5ce8b7f84eedffa089138fd">Scheduler</div>
            <div class="eh-comp-hint-5590d392f5ce8b7f84eedffa089138fd">requests</div>
          </div>
          <div class="eh-component-5590d392f5ce8b7f84eedffa089138fd" data-component="block-mgr">
            <div class="eh-comp-name-5590d392f5ce8b7f84eedffa089138fd">BlockSpaceManager</div>
            <div class="eh-comp-hint-5590d392f5ce8b7f84eedffa089138fd">memory blocks</div>
          </div>
          <div class="eh-component-5590d392f5ce8b7f84eedffa089138fd" data-component="kv-cache-mgr">
            <div class="eh-comp-name-5590d392f5ce8b7f84eedffa089138fd">KV Cache Mgr</div>
            <div class="eh-comp-hint-5590d392f5ce8b7f84eedffa089138fd">allocation</div>
          </div>
        </div>
      </div>

      
      <div class="eh-connector-5590d392f5ce8b7f84eedffa089138fd eh-conn-blue-5590d392f5ce8b7f84eedffa089138fd">
        <div class="eh-connector-arrow-5590d392f5ce8b7f84eedffa089138fd">
          <div class="eh-particle-5590d392f5ce8b7f84eedffa089138fd p1"></div>
          <div class="eh-particle-5590d392f5ce8b7f84eedffa089138fd p2"></div>
          <div class="eh-particle-5590d392f5ce8b7f84eedffa089138fd p3"></div>
          <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
            <path d="M12 5v14M5 12l7 7 7-7"/>
          </svg>
          <span class="eh-connector-label-5590d392f5ce8b7f84eedffa089138fd">dispatch</span>
        </div>
      </div>

      
      <div class="eh-layer-5590d392f5ce8b7f84eedffa089138fd eh-layer-worker-5590d392f5ce8b7f84eedffa089138fd" data-layer="worker">
        <div class="eh-layer-header-5590d392f5ce8b7f84eedffa089138fd">
          <div class="eh-layer-title-5590d392f5ce8b7f84eedffa089138fd">
            <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
              <rect x="4" y="4" width="16" height="16" rx="2"/>
              <rect x="9" y="9" width="6" height="6"/>
              <path d="M9 1v3M15 1v3M9 20v3M15 20v3M1 9h3M1 15h3M20 9h3M20 15h3"/>
            </svg>
            Worker
          </div>
          <span class="eh-role-badge-5590d392f5ce8b7f84eedffa089138fd">Device Mgmt</span>
        </div>
        <div class="eh-layer-desc-5590d392f5ce8b7f84eedffa089138fd">One per GPU. Manages device, holds weight shards, coordinates distributed execution.</div>
        <div class="eh-components-5590d392f5ce8b7f84eedffa089138fd">
          <div class="eh-component-5590d392f5ce8b7f84eedffa089138fd" data-component="gpu-device">
            <div class="eh-comp-name-5590d392f5ce8b7f84eedffa089138fd">GPU Device</div>
            <div class="eh-comp-hint-5590d392f5ce8b7f84eedffa089138fd">CUDA ctx</div>
          </div>
          <div class="eh-component-5590d392f5ce8b7f84eedffa089138fd" data-component="weight-shard">
            <div class="eh-comp-name-5590d392f5ce8b7f84eedffa089138fd">Weight Shard</div>
            <div class="eh-comp-hint-5590d392f5ce8b7f84eedffa089138fd">TP slice</div>
          </div>
          <div class="eh-component-5590d392f5ce8b7f84eedffa089138fd" data-component="dist-coord">
            <div class="eh-comp-name-5590d392f5ce8b7f84eedffa089138fd">Distributed Coord</div>
            <div class="eh-comp-hint-5590d392f5ce8b7f84eedffa089138fd">NCCL</div>
          </div>
        </div>
      </div>

      
      <div class="eh-connector-5590d392f5ce8b7f84eedffa089138fd eh-conn-purple-5590d392f5ce8b7f84eedffa089138fd">
        <div class="eh-connector-arrow-5590d392f5ce8b7f84eedffa089138fd">
          <div class="eh-particle-5590d392f5ce8b7f84eedffa089138fd p1"></div>
          <div class="eh-particle-5590d392f5ce8b7f84eedffa089138fd p2"></div>
          <div class="eh-particle-5590d392f5ce8b7f84eedffa089138fd p3"></div>
          <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
            <path d="M12 5v14M5 12l7 7 7-7"/>
          </svg>
          <span class="eh-connector-label-5590d392f5ce8b7f84eedffa089138fd">execute</span>
        </div>
      </div>

      
      <div class="eh-layer-5590d392f5ce8b7f84eedffa089138fd eh-layer-runner-5590d392f5ce8b7f84eedffa089138fd" data-layer="runner">
        <div class="eh-layer-header-5590d392f5ce8b7f84eedffa089138fd">
          <div class="eh-layer-title-5590d392f5ce8b7f84eedffa089138fd">
            <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
              <path d="M4 19.5A2.5 2.5 0 0 1 6.5 17H20"/>
              <path d="M6.5 2H20v20H6.5A2.5 2.5 0 0 1 4 19.5v-15A2.5 2.5 0 0 1 6.5 2z"/>
              <line x1="8" y1="7" x2="16" y2="7"/>
              <line x1="8" y1="11" x2="14" y2="11"/>
            </svg>
            ModelRunner
          </div>
          <span class="eh-role-badge-5590d392f5ce8b7f84eedffa089138fd">Input Prep</span>
        </div>
        <div class="eh-layer-desc-5590d392f5ce8b7f84eedffa089138fd">Converts logical request data into physical tensors. Handles input flattening and attention metadata.</div>
        <div class="eh-components-5590d392f5ce8b7f84eedffa089138fd">
          <div class="eh-component-5590d392f5ce8b7f84eedffa089138fd" data-component="input-flatten">
            <div class="eh-comp-name-5590d392f5ce8b7f84eedffa089138fd">Input Flattener</div>
            <div class="eh-comp-hint-5590d392f5ce8b7f84eedffa089138fd">tokens</div>
          </div>
          <div class="eh-component-5590d392f5ce8b7f84eedffa089138fd" data-component="tensor-prep">
            <div class="eh-comp-name-5590d392f5ce8b7f84eedffa089138fd">Tensor Prep</div>
            <div class="eh-comp-hint-5590d392f5ce8b7f84eedffa089138fd">batching</div>
          </div>
          <div class="eh-component-5590d392f5ce8b7f84eedffa089138fd" data-component="attn-meta">
            <div class="eh-comp-name-5590d392f5ce8b7f84eedffa089138fd">AttentionMetadata</div>
            <div class="eh-comp-hint-5590d392f5ce8b7f84eedffa089138fd">positions</div>
          </div>
        </div>
      </div>

      
      <div class="eh-connector-5590d392f5ce8b7f84eedffa089138fd eh-conn-pink-5590d392f5ce8b7f84eedffa089138fd">
        <div class="eh-connector-arrow-5590d392f5ce8b7f84eedffa089138fd">
          <div class="eh-particle-5590d392f5ce8b7f84eedffa089138fd p1"></div>
          <div class="eh-particle-5590d392f5ce8b7f84eedffa089138fd p2"></div>
          <div class="eh-particle-5590d392f5ce8b7f84eedffa089138fd p3"></div>
          <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
            <path d="M12 5v14M5 12l7 7 7-7"/>
          </svg>
          <span class="eh-connector-label-5590d392f5ce8b7f84eedffa089138fd">forward</span>
        </div>
      </div>

      
      <div class="eh-layer-5590d392f5ce8b7f84eedffa089138fd eh-layer-model-5590d392f5ce8b7f84eedffa089138fd" data-layer="model">
        <div class="eh-layer-header-5590d392f5ce8b7f84eedffa089138fd">
          <div class="eh-layer-title-5590d392f5ce8b7f84eedffa089138fd">
            <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
              <path d="M12 2L2 7l10 5 10-5-10-5z"/>
              <path d="M2 17l10 5 10-5"/>
              <path d="M2 12l10 5 10-5"/>
            </svg>
            Model
            <span class="eh-change-pill-5590d392f5ce8b7f84eedffa089138fd">
              <svg width="12" height="12" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2.5">
                <path d="M12 5v14M5 12h14"/>
              </svg>
              Only this changes
            </span>
          </div>
          <span class="eh-role-badge-5590d392f5ce8b7f84eedffa089138fd">Neural Network</span>
        </div>
        <div class="eh-layer-desc-5590d392f5ce8b7f84eedffa089138fd">The actual neural network. Swap this layer to support a new architecture — everything above stays the same.</div>
        <div class="eh-components-5590d392f5ce8b7f84eedffa089138fd">
          <div class="eh-component-5590d392f5ce8b7f84eedffa089138fd" data-component="llama">
            <div class="eh-comp-name-5590d392f5ce8b7f84eedffa089138fd">LlamaForCausalLM</div>
            <div class="eh-comp-hint-5590d392f5ce8b7f84eedffa089138fd">native</div>
          </div>
          <div class="eh-component-5590d392f5ce8b7f84eedffa089138fd" data-component="transformers">
            <div class="eh-comp-name-5590d392f5ce8b7f84eedffa089138fd">TransformersModel</div>
            <div class="eh-comp-hint-5590d392f5ce8b7f84eedffa089138fd">wrapped</div>
          </div>
          <div class="eh-component-5590d392f5ce8b7f84eedffa089138fd" data-component="forward-pass">
            <div class="eh-comp-name-5590d392f5ce8b7f84eedffa089138fd">Forward Pass</div>
            <div class="eh-comp-hint-5590d392f5ce8b7f84eedffa089138fd">inference</div>
          </div>
        </div>
      </div>
    </div>

    
    <div class="eh-sidebar-5590d392f5ce8b7f84eedffa089138fd">

      
      <div class="eh-config-panel-5590d392f5ce8b7f84eedffa089138fd">
        <div class="eh-config-title-5590d392f5ce8b7f84eedffa089138fd">
          <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
            <circle cx="12" cy="12" r="3"/>
            <path d="M19.4 15a1.65 1.65 0 0 0 .33 1.82l.06.06a2 2 0 0 1-2.83 2.83l-.06-.06a1.65 1.65 0 0 0-1.82-.33 1.65 1.65 0 0 0-1 1.51V21a2 2 0 0 1-4 0v-.09A1.65 1.65 0 0 0 9 19.4a1.65 1.65 0 0 0-1.82.33l-.06.06a2 2 0 0 1-2.83-2.83l.06-.06A1.65 1.65 0 0 0 4.68 15a1.65 1.65 0 0 0-1.51-1H3a2 2 0 0 1 0-4h.09A1.65 1.65 0 0 0 4.6 9a1.65 1.65 0 0 0-.33-1.82l-.06-.06a2 2 0 0 1 2.83-2.83l.06.06A1.65 1.65 0 0 0 9 4.68a1.65 1.65 0 0 0 1-1.51V3a2 2 0 0 1 4 0v.09a1.65 1.65 0 0 0 1 1.51 1.65 1.65 0 0 0 1.82-.33l.06-.06a2 2 0 0 1 2.83 2.83l-.06.06A1.65 1.65 0 0 0 19.4 9a1.65 1.65 0 0 0 1.51 1H21a2 2 0 0 1 0 4h-.09a1.65 1.65 0 0 0-1.51 1z"/>
          </svg>
          VllmConfig
        </div>

        <div class="eh-config-item-5590d392f5ce8b7f84eedffa089138fd" data-config="scheduler-config">
          <div class="eh-config-dot-5590d392f5ce8b7f84eedffa089138fd eh-config-dot-blue-5590d392f5ce8b7f84eedffa089138fd"></div>
          <span class="eh-config-name-5590d392f5ce8b7f84eedffa089138fd">SchedulerConfig</span>
          <span class="eh-config-target-5590d392f5ce8b7f84eedffa089138fd">→ Engine</span>
        </div>

        <div class="eh-config-item-5590d392f5ce8b7f84eedffa089138fd" data-config="parallel-config">
          <div class="eh-config-dot-5590d392f5ce8b7f84eedffa089138fd eh-config-dot-purple-5590d392f5ce8b7f84eedffa089138fd"></div>
          <span class="eh-config-name-5590d392f5ce8b7f84eedffa089138fd">ParallelConfig</span>
          <span class="eh-config-target-5590d392f5ce8b7f84eedffa089138fd">→ Worker</span>
        </div>

        <div class="eh-config-item-5590d392f5ce8b7f84eedffa089138fd" data-config="model-config">
          <div class="eh-config-dot-5590d392f5ce8b7f84eedffa089138fd eh-config-dot-pink-5590d392f5ce8b7f84eedffa089138fd"></div>
          <span class="eh-config-name-5590d392f5ce8b7f84eedffa089138fd">ModelConfig</span>
          <span class="eh-config-target-5590d392f5ce8b7f84eedffa089138fd">→ Runner + Model</span>
        </div>

        <div class="eh-config-item-5590d392f5ce8b7f84eedffa089138fd" data-config="quant-config">
          <div class="eh-config-dot-5590d392f5ce8b7f84eedffa089138fd eh-config-dot-green-5590d392f5ce8b7f84eedffa089138fd"></div>
          <span class="eh-config-name-5590d392f5ce8b7f84eedffa089138fd">QuantizationConfig</span>
          <span class="eh-config-target-5590d392f5ce8b7f84eedffa089138fd">→ Model</span>
        </div>
      </div>

      
      <div class="eh-info-5590d392f5ce8b7f84eedffa089138fd">
        <div class="eh-info-placeholder-5590d392f5ce8b7f84eedffa089138fd" id="eh-placeholder-5590d392f5ce8b7f84eedffa089138fd">
          <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.5">
            <circle cx="12" cy="12" r="10"/>
            <path d="M12 16v-4M12 8h.01"/>
          </svg>
          <p>Click any layer, component, or config to see details.</p>
        </div>
        <div class="eh-info-content-5590d392f5ce8b7f84eedffa089138fd" id="eh-info-content-5590d392f5ce8b7f84eedffa089138fd">
          <div class="eh-info-header-5590d392f5ce8b7f84eedffa089138fd">
            <div class="eh-info-icon-5590d392f5ce8b7f84eedffa089138fd" id="eh-info-icon-5590d392f5ce8b7f84eedffa089138fd">
              <svg width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
                <rect x="3" y="3" width="18" height="18" rx="2"/>
              </svg>
            </div>
            <div>
              <div class="eh-info-title-5590d392f5ce8b7f84eedffa089138fd" id="eh-info-title-5590d392f5ce8b7f84eedffa089138fd">Component</div>
              <span class="eh-info-badge-5590d392f5ce8b7f84eedffa089138fd" id="eh-info-badge-5590d392f5ce8b7f84eedffa089138fd">Layer</span>
            </div>
          </div>
          <div class="eh-info-desc-5590d392f5ce8b7f84eedffa089138fd" id="eh-info-desc-5590d392f5ce8b7f84eedffa089138fd">
            Description goes here.
          </div>
          <div class="eh-info-props-5590d392f5ce8b7f84eedffa089138fd">
            <h4>Key Properties</h4>
            <ul id="eh-info-props-5590d392f5ce8b7f84eedffa089138fd">
              <li>Item 1</li>
            </ul>
          </div>
          <div class="eh-insight-5590d392f5ce8b7f84eedffa089138fd" id="eh-insight-5590d392f5ce8b7f84eedffa089138fd" style="display: none;">
            <p id="eh-insight-text-5590d392f5ce8b7f84eedffa089138fd"></p>
          </div>
        </div>
      </div>
    </div>
  </div>

  <div class="eh-footer-5590d392f5ce8b7f84eedffa089138fd">
    <p><strong>Adding a new model architecture only changes the bottom layer.</strong> Everything above — scheduling, GPU management, input flattening — stays exactly the same.</p>
  </div>
</div>

<script>
(function() {
  var id = '5590d392f5ce8b7f84eedffa089138fd';

   

  var layers = {
    'engine': {
      title: 'LLMEngine',
      layerClass: 'engine',
      badge: 'Control Plane',
      description: 'The top-level control plane. Runs the main scheduling loop, manages block allocation for KV cache, and decides which requests get processed each iteration. Completely agnostic to model architecture — it only needs KV cache element sizes.',
      props: ['Scheduling loop', 'Block allocation', 'KV cache management', 'Request batching'],
      insight: 'The Engine only needs <strong>num_layers</strong>, <strong>hidden_size</strong>, and <strong>num_attention_heads</strong> to calculate KV cache block sizes. It never touches the model\'s forward pass.'
    },
    'worker': {
      title: 'Worker',
      layerClass: 'worker',
      badge: 'Device Mgmt',
      description: 'One Worker per GPU. Owns the CUDA context, holds its shard of model weights (for tensor parallelism), and coordinates with other Workers via NCCL for distributed execution.',
      props: ['CUDA device ownership', 'Weight shard loading', 'NCCL process group', 'Distributed coordination'],
      insight: 'Workers are reusable across model architectures. The same Worker can run Llama, Mistral, or any new model — it just loads different weight tensors.'
    },
    'runner': {
      title: 'ModelRunner',
      layerClass: 'runner',
      badge: 'Input Prep',
      description: 'Sits inside each Worker. Converts logical request data (token IDs, sequence lengths) into the physical tensors the model needs. Handles input flattening, attention metadata construction, and sampling parameter preparation.',
      props: ['Input flattening', 'Tensor preparation', 'Attention metadata', 'Sampling setup'],
      insight: 'ModelRunner is where the <strong>continuous batching magic</strong> happens — it packs variable-length sequences into efficient flat tensors, regardless of the model underneath.'
    },
    'model': {
      title: 'Model',
      layerClass: 'model',
      badge: 'Neural Network',
      description: 'The neural network itself. Whether it\'s a native LlamaForCausalLM with hand-optimized kernels or a TransformersModel wrapping a HuggingFace checkpoint, this is the only layer that changes when you add a new model architecture.',
      props: ['Forward pass computation', 'Attention layers', 'MLP layers', 'Weight format handling'],
      insight: 'This is the <strong>only pluggable layer</strong>. vLLM\'s zero-day support strategy targets exclusively this level — everything above is reused as-is.'
    }
  };

  var components = {
    'scheduler': {
      title: 'Scheduler',
      layerClass: 'engine',
      badge: 'LLMEngine',
      description: 'Runs every iteration to decide which requests to process next. Implements continuous batching — new requests can join a batch mid-generation without waiting for others to finish.',
      props: ['Request ordering', 'Continuous batching', 'Preemption policy', 'Iteration scheduling']
    },
    'block-mgr': {
      title: 'BlockSpaceManager',
      layerClass: 'engine',
      badge: 'LLMEngine',
      description: 'Manages physical GPU memory blocks for KV cache. Tracks which blocks are allocated, implements copy-on-write for parallel sampling, and handles block swapping between GPU and CPU.',
      props: ['Block allocation', 'Copy-on-write', 'GPU↔CPU swapping', 'Memory accounting']
    },
    'kv-cache-mgr': {
      title: 'KV Cache Manager',
      layerClass: 'engine',
      badge: 'LLMEngine',
      description: 'Determines total KV cache capacity from model config parameters. Only needs num_layers, hidden_size, and num_attention_heads — never the actual model weights.',
      props: ['Cache sizing', 'Block count calculation', 'Element size derivation']
    },
    'gpu-device': {
      title: 'GPU Device',
      layerClass: 'worker',
      badge: 'Worker',
      description: 'The CUDA context and device handle. Each Worker binds to one GPU, sets CUDA_VISIBLE_DEVICES, and manages the memory pool for that device.',
      props: ['CUDA context', 'Memory pool', 'Device binding', 'Stream management']
    },
    'weight-shard': {
      title: 'Weight Shard',
      layerClass: 'worker',
      badge: 'Worker',
      description: 'When using tensor parallelism, each Worker holds 1/N of the model weights. Linear layers are sharded column-wise or row-wise depending on the layer type.',
      props: ['Column-parallel sharding', 'Row-parallel sharding', 'Weight loading', 'Shard coordination']
    },
    'dist-coord': {
      title: 'Distributed Coordinator',
      layerClass: 'worker',
      badge: 'Worker',
      description: 'Manages NCCL process groups for tensor-parallel all-reduce operations. Initialized via torch.distributed.init_process_group(backend="nccl").',
      props: ['NCCL process group', 'All-reduce ops', 'Barrier sync', 'Rank assignment']
    },
    'input-flatten': {
      title: 'Input Flattener',
      layerClass: 'runner',
      badge: 'ModelRunner',
      description: 'Takes variable-length sequences from the scheduler and packs them into contiguous flat tensors. This is what enables efficient continuous batching at the tensor level.',
      props: ['Sequence packing', 'Padding removal', 'Position ID generation', 'Length tracking']
    },
    'tensor-prep': {
      title: 'Tensor Prep',
      layerClass: 'runner',
      badge: 'ModelRunner',
      description: 'Constructs the input tensors (token IDs, positions, slot mappings) that the model\'s forward() method expects. Handles the translation from logical to physical representation.',
      props: ['Token ID tensors', 'Position encoding', 'Slot mapping', 'Batch metadata']
    },
    'attn-meta': {
      title: 'AttentionMetadata',
      layerClass: 'runner',
      badge: 'ModelRunner',
      description: 'Contains everything the attention kernel needs: sequence start positions, KV cache slot mappings, prefix lengths for chunked prefill, and block tables.',
      props: ['Sequence positions', 'Block tables', 'Prefix lengths', 'Chunked prefill data']
    },
    'llama': {
      title: 'LlamaForCausalLM',
      layerClass: 'model',
      badge: 'Model',
      description: 'Native vLLM implementation of Llama. Hand-optimized with custom CUDA kernels for attention (FlashAttention, PagedAttention), fused RMSNorm, and rotary embeddings.',
      props: ['PagedAttention', 'Fused RMSNorm', 'Rotary embeddings', 'Custom CUDA kernels']
    },
    'transformers': {
      title: 'TransformersModel',
      layerClass: 'model',
      badge: 'Model',
      description: 'Generic wrapper around any HuggingFace Transformers model. Provides zero-day support for new architectures by using the upstream implementation directly, trading some performance for instant compatibility.',
      props: ['HuggingFace wrapping', 'Zero-day support', 'Automatic compatibility', 'Attention override']
    },
    'forward-pass': {
      title: 'Forward Pass',
      layerClass: 'model',
      badge: 'Model',
      description: 'The actual neural network computation. Takes prepared tensors and attention metadata, runs through transformer layers, and produces logits for the next token.',
      props: ['Transformer layers', 'Logit computation', 'Hidden state propagation', 'Output sampling']
    }
  };

  var configs = {
    'scheduler-config': {
      title: 'SchedulerConfig',
      layerClass: 'config',
      badge: 'VllmConfig',
      description: 'Controls how the Engine schedules requests. Specifies max_num_seqs (maximum concurrent sequences), max_model_len, and the memory allocation strategy for the BlockSpaceManager.',
      props: ['max_num_seqs', 'max_model_len', 'scheduling_policy', 'chunked_prefill settings'],
      targets: ['engine']
    },
    'parallel-config': {
      title: 'ParallelConfig',
      layerClass: 'config',
      badge: 'VllmConfig',
      description: 'Determines tensor parallelism (TP) and pipeline parallelism (PP) degrees. Workers use this to know how to shard weights and set up NCCL process groups.',
      props: ['tensor_parallel_size', 'pipeline_parallel_size', 'worker_cls', 'distributed_backend'],
      targets: ['worker']
    },
    'model-config': {
      title: 'ModelConfig',
      layerClass: 'config',
      badge: 'VllmConfig',
      description: 'Carries architecture strings, hidden sizes, vocabulary size, and the architectures list used for registry lookup. Feeds both ModelRunner (for tensor sizing) and Model (for architecture selection).',
      props: ['architectures list', 'hidden_size', 'num_attention_heads', 'vocab_size'],
      targets: ['runner', 'model']
    },
    'quant-config': {
      title: 'QuantizationConfig',
      layerClass: 'config',
      badge: 'VllmConfig',
      description: 'Specifies the quantization method (AWQ, GPTQ, FP8, etc.). Linear layers in the Model use this to select the appropriate dequantization kernel during weight loading and inference.',
      props: ['quant_method', 'weight_bits', 'group_size', 'kernel selection'],
      targets: ['model']
    }
  };

   

  function clearSelection() {
    var container = document.querySelector('.eh-' + id);
    if (!container) return;
    var els = container.querySelectorAll('.selected, .highlighted, .active');
    for (var i = 0; i < els.length; i++) {
      els[i].classList.remove('selected', 'highlighted', 'active');
    }
  }

  function showInfo(data, showInsight) {
    document.getElementById('eh-placeholder-' + id).style.display = 'none';
    var content = document.getElementById('eh-info-content-' + id);
    content.classList.remove('active');
    
    void content.offsetWidth;
    content.classList.add('active');

    document.getElementById('eh-info-title-' + id).textContent = data.title;

    var badge = document.getElementById('eh-info-badge-' + id);
    badge.textContent = data.badge;
    badge.className = 'eh-info-badge-' + id + ' ' + data.layerClass;

    var icon = document.getElementById('eh-info-icon-' + id);
    icon.className = 'eh-info-icon-' + id + ' ' + data.layerClass;

    document.getElementById('eh-info-desc-' + id).textContent = data.description;

    var propsList = document.getElementById('eh-info-props-' + id);
    var html = '';
    for (var j = 0; j < data.props.length; j++) {
      html += '<li>' + data.props[j] + '</li>';
    }
    propsList.innerHTML = html;

    var insightEl = document.getElementById('eh-insight-' + id);
    if (showInsight && data.insight) {
      insightEl.style.display = 'block';
      document.getElementById('eh-insight-text-' + id).innerHTML = data.insight;
    } else {
      insightEl.style.display = 'none';
    }
  }

   

  var container = document.querySelector('.eh-' + id);
  if (container) {
    container.addEventListener('click', function(e) {
      var target = e.target;

      while (target && target !== container) {
        
        if (target.hasAttribute && target.hasAttribute('data-config')) {
          var configId = target.getAttribute('data-config');
          var conf = configs[configId];
          if (!conf) break;

          clearSelection();
          target.classList.add('active');

          
          if (conf.targets) {
            for (var t = 0; t < conf.targets.length; t++) {
              var layerEl = container.querySelector('[data-layer="' + conf.targets[t] + '"]');
              if (layerEl) layerEl.classList.add('highlighted');
            }
          }

          showInfo(conf, false);
          return;
        }

        
        if (target.hasAttribute && target.hasAttribute('data-component')) {
          var compId = target.getAttribute('data-component');
          var comp = components[compId];
          if (!comp) break;

          clearSelection();
          target.classList.add('selected');

          showInfo(comp, false);
          return;
        }

        
        if (target.hasAttribute && target.hasAttribute('data-layer')) {
          var layerId = target.getAttribute('data-layer');
          var layer = layers[layerId];
          if (!layer) break;

          clearSelection();
          target.classList.add('selected');

          showInfo(layer, true);
          return;
        }

        target = target.parentElement;
      }
    });
  }
})();
</script>

<p>The VllmConfig object is how information flows across these levels. It aggregates several sub-configs:</p>
<table>
  <thead>
      <tr>
          <th>Config Component</th>
          <th>What It Provides</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ModelConfig</strong></td>
          <td>Architecture strings, hidden sizes, vocabulary size, the <code>architectures</code> list used for registry lookup</td>
      </tr>
      <tr>
          <td><strong>ParallelConfig</strong></td>
          <td>Tensor parallelism (TP) and pipeline parallelism (PP) degrees. Determines how linear layers shard their weights</td>
      </tr>
      <tr>
          <td><strong>SchedulerConfig</strong></td>
          <td>Maximum number of sequences and memory allocation strategy. Influences BlockSpaceManager setup</td>
      </tr>
      <tr>
          <td><strong>QuantizationConfig</strong></td>
          <td>Quantization method (AWQ, GPTQ, FP8). Linear layers use this to select the appropriate kernel during weight loading</td>
      </tr>
  </tbody>
</table>
<h3 id="registry-mechanics-and-the-architecture-lookup">Registry Mechanics and the Architecture Lookup</h3>
<p>When you run <code>vllm serve &lt;model&gt;</code>, the first thing that happens is a <code>config.json</code> resolution — either pulled from the Hugging Face Hub (if you pass a model ID like <code>meta-llama/Llama-2-7b-hf</code>) or read from disk (if you pass a local path, as you would in an air-gapped deployment). The <code>architectures</code> field — for example, <code>[&quot;LlamaForCausalLM&quot;]</code> — is the primary lookup key for the entire loading sequence.</p>
<p>This key gets checked against the <code>_VLLM_MODELS</code> dictionary, the core of vLLM&rsquo;s ModelRegistry. It maps architecture strings to <code>(module_name, class_name)</code> tuples:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>_VLLM_MODELS <span style="color:#f92672">=</span> {
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;LlamaForCausalLM&#34;</span>:       (<span style="color:#e6db74">&#34;llama&#34;</span>, <span style="color:#e6db74">&#34;LlamaForCausalLM&#34;</span>),
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;MistralForCausalLM&#34;</span>:     (<span style="color:#e6db74">&#34;mistral&#34;</span>, <span style="color:#e6db74">&#34;MistralForCausalLM&#34;</span>),
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;DeepseekV2ForCausalLM&#34;</span>:  (<span style="color:#e6db74">&#34;deepseek_v2&#34;</span>, <span style="color:#e6db74">&#34;DeepseekV2ForCausalLM&#34;</span>),
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;Qwen2ForCausalLM&#34;</span>:       (<span style="color:#e6db74">&#34;qwen2&#34;</span>, <span style="color:#e6db74">&#34;Qwen2ForCausalLM&#34;</span>),
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># ... hundreds of other architectures</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>The <code>module_name</code> is a relative path within <code>vllm.model_executor.models</code> — so <code>&quot;llama&quot;</code> resolves to <code>vllm/model_executor/models/llama.py</code>. The <code>class_name</code> is the specific <code>nn.Module</code> subclass to instantiate.</p>
<p>One important detail: vLLM does NOT import all model classes at startup. Instead, it uses <code>_LazyRegisteredModel</code> wrappers. When the ModelConfig requests a specific architecture, the registry:</p>
<ol>
<li>Checks if the architecture string exists in <code>_VLLM_MODELS</code></li>
<li>Retrieves the module path and class name</li>
<li>Dynamically imports the module using <code>importlib</code></li>
<li>Returns the class constructor to the ModelLoader</li>
</ol>
<p>This lazy loading matters for dependency isolation. A user running Llama shouldn&rsquo;t need the specific kernels required for an audio-processing model. If those kernels aren&rsquo;t installed and the audio model is loaded eagerly at startup, vLLM crashes for everyone.</p>
<p>Three things can happen when the registry receives an architecture string:</p>
<ol>
<li><strong>Found in registry</strong> → native path (optimized, model-specific code)</li>
<li><strong>Registered by plugin</strong> → external native path (optimized, third-party code)</li>
<li><strong>Not found</strong> → Transformers Modeling Backend fallback (compatibility shim)</li>
</ol>
<h3 id="the-plugin-system">The Plugin System</h3>
<p>This is a significant evolution in vLLM&rsquo;s architecture. External packages can register models without modifying vLLM core.</p>
<p>The mechanism uses Python&rsquo;s <code>vllm.general_plugins</code> entry point. During vLLM&rsquo;s initialization, it discovers and executes all registered plugins. A plugin can invoke <code>ModelRegistry.register_model()</code> to inject a new architecture mapping at runtime:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># In your package&#39;s plugin entry point</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">register</span>():
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">from</span> vllm <span style="color:#f92672">import</span> ModelRegistry
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> <span style="color:#e6db74">&#34;MyNewModel&#34;</span> <span style="color:#f92672">not</span> <span style="color:#f92672">in</span> ModelRegistry<span style="color:#f92672">.</span>get_supported_archs():
</span></span><span style="display:flex;"><span>        ModelRegistry<span style="color:#f92672">.</span>register_model(
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;MyNewModel&#34;</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;my_package.models:MyNewModel&#34;</span>
</span></span><span style="display:flex;"><span>        )
</span></span></code></pre></div><p>This decouples the vLLM release cycle from model release cycles. Model creators — Mistral, DeepSeek, Google — can ship a &ldquo;vLLM adaptation package&rdquo; alongside their weights. Users <code>pip install</code> that package, and vLLM recognizes the new architecture immediately. No PRs to vLLM core, no waiting for a new release.</p>
<!-- VISUALIZATION: vllm-model-decision-tree
     Flowchart showing the 3-way decision process:

     Start: "vllm serve <model>"
     → Pull config.json from HF Hub
     → Extract architectures field (e.g., "LlamaForCausalLM")
     → Check _VLLM_MODELS dictionary
       ├── Found → Native path (import module, instantiate class)
       ├── Not found → Check registered plugins
       │     ├── Plugin registered → External native path
       │     └── No plugin → Check Transformers compatibility
       │           ├── Compatible → TransformersModel fallback
       │           └── Incompatible → ValueError (unsupported architecture)
-->
<hr>
<h2 id="part-2-the-transformers-fallback-zero-day-support">Part 2: The Transformers Fallback (Zero-Day Support)</h2>
<h3 id="the-transformers-backend">The Transformers Backend</h3>
<p>When the registry lookup fails — or when you explicitly set <code>model_impl=&quot;transformers&quot;</code> — vLLM resolves to its Transformers backend. This is a family of mixin-composed classes (<code>TransformersForCausalLM</code>, <code>TransformersMoEForCausalLM</code>, <code>TransformersMultiModalForCausalLM</code>, etc.) defined in <code>vllm/model_executor/models/transformers/</code>. They sit between vLLM&rsquo;s scheduler and standard Hugging Face model code, and they&rsquo;re the reason that <code>vllm serve</code> command works on day zero for new models.</p>
<p>There are two initialization steps worth understanding:</p>
<p><strong>1. Config-Based Instantiation.</strong> The wrapper uses <code>transformers.AutoModel.from_config(...)</code> to build the model architecture on a <strong>meta device</strong> — meaning no GPU memory is allocated yet, just the module structure with placeholder parameters. Weights are loaded separately later through vLLM&rsquo;s <code>load_weights()</code> pipeline. This two-phase approach (structure first, weights later) is critical for distributed loading: each GPU can load only its weight shard, rather than loading everything and then discarding what it doesn&rsquo;t need.</p>
<p><strong>2. Attention Backend Injection.</strong> Before instantiation, vLLM modifies the model&rsquo;s text configuration:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># vLLM sets this before calling from_config()</span>
</span></span><span style="display:flex;"><span>text_config<span style="color:#f92672">.</span>_attn_implementation <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;vllm&#34;</span>
</span></span></code></pre></div><p>This is the critical mechanism. Modern Hugging Face models are written to be attention-backend-agnostic. They check <code>_attn_implementation</code> and query a registry of attention functions. vLLM populates <code>ALL_ATTENTION_FUNCTIONS</code> with its own PagedAttention-backed implementation:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># vLLM registers its attention backend into HF&#39;s registry</span>
</span></span><span style="display:flex;"><span>ALL_ATTENTION_FUNCTIONS[<span style="color:#e6db74">&#34;vllm&#34;</span>] <span style="color:#f92672">=</span> vllm_attention_forward
</span></span></code></pre></div><p>When the HF model reaches its attention layer and calls the registered function, it gets vLLM&rsquo;s implementation instead of the default eager/SDPA/FlashAttention backend. The model doesn&rsquo;t know the difference.</p>
<p>Let&rsquo;s trace the data flow step by step:</p>
<ol>
<li><strong>Engine</strong> generates <code>block_tables</code> and <code>slot_mapping</code> → packs them into an <code>AttentionMetadata</code> object</li>
<li><strong>TransformersForCausalLM.forward()</strong> receives flattened inputs + <code>attn_metadata</code></li>
<li>The wrapper passes vLLM metadata as <code>**kwargs</code> into the HF model&rsquo;s forward method</li>
<li>The HF model propagates <code>**kwargs</code> down through its layers (this is a convention in Transformers — unused kwargs flow through)</li>
<li>At each attention layer, the injected vLLM backend receives Q, K, V tensors + the <code>attn_metadata</code></li>
<li>The <strong>PagedAttention CUDA kernel</strong> executes — storing K/V into paged blocks, computing attention scores using block tables</li>
</ol>
<p>The result: even &ldquo;unoptimized&rdquo; models benefit from PagedAttention&rsquo;s memory virtualization. No more OOM from naive KV cache pre-allocation. The KV cache is managed efficiently through fixed-size blocks, regardless of whether the model itself was designed for it.</p>
<h3 id="the-trade-offs-what-you-lose">The Trade-offs (What You Lose)</h3>
<p>The Transformers backend enables immediate serving, but it sits in what we might call an &ldquo;unoptimized valley.&rdquo; Let&rsquo;s be specific about the costs:</p>
<p><strong>CUDA Graph Capture.</strong> In vLLM V1, the Transformers backend supports <code>torch.compile</code> with piecewise CUDA graph capture (via the <code>@support_torch_compile</code> decorator), closing what was historically the largest performance gap. However, models with dynamic RoPE scaling still fall back to eager mode. And native models can leverage more aggressive graph capture strategies that cover a larger fraction of the computation graph, since their code is explicitly written with static control flow in mind.</p>
<p><strong>Kernel Fusion.</strong> Native vLLM models use fused kernels — LayerNorm + activation in one kernel, RoPE computation fused with the QKV projection, SiLU and gate multiplication combined. The fallback uses separate PyTorch operations for each step. Every separate operation means an extra round-trip to HBM: write intermediate result, read it back for the next op. On a memory-bandwidth-bound workload (which LLM decode always is), these extra reads and writes add up fast.</p>
<p><strong>Parallelism Limitations.</strong> Basic Tensor Parallelism can sometimes be inferred automatically via the model&rsquo;s <code>base_model_tp_plan</code>, but this doesn&rsquo;t cover every case. Mixture-of-Experts routing, novel attention patterns, or architectures with unusual layer structures may not shard correctly — restricting you to single-GPU execution.</p>
<table>
  <thead>
      <tr>
          <th>Capability</th>
          <th>Transformers Fallback</th>
          <th>Native Integration</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Day-zero support</td>
          <td>Yes</td>
          <td>No (requires implementation)</td>
      </tr>
      <tr>
          <td>PagedAttention</td>
          <td>Yes (via injection)</td>
          <td>Yes (native)</td>
      </tr>
      <tr>
          <td>CUDA Graph capture</td>
          <td>Yes (via torch.compile in V1)</td>
          <td>Yes (full static graph)</td>
      </tr>
      <tr>
          <td>Kernel fusion</td>
          <td>No (separate PyTorch ops)</td>
          <td>Yes (fused CUDA kernels)</td>
      </tr>
      <tr>
          <td>Tensor Parallelism</td>
          <td>Limited (auto-inferred)</td>
          <td>Full (explicit sharding)</td>
      </tr>
      <tr>
          <td>Pipeline Parallelism</td>
          <td>No</td>
          <td>Yes (with <code>intermediate_tensors</code>)</td>
      </tr>
      <tr>
          <td>Quantization (AWQ/GPTQ/FP8)</td>
          <td>Limited</td>
          <td>Full support</td>
      </tr>
  </tbody>
</table>
<blockquote>
<p><strong>Note:</strong> The fallback is not meant to be the final state — it&rsquo;s the starting point. It gives you a working, servable model while the community works on native integration. Think of it as a bridge: useful immediately, but you cross it to get somewhere better.</p></blockquote>
<hr>
<h2 id="part-3-native-integration">Part 3: Native Integration</h2>
<h3 id="the-model-interface-and-prefix-protocol">The Model Interface and Prefix Protocol</h3>
<p>To go from &ldquo;supported&rdquo; to &ldquo;optimized,&rdquo; a model must be implemented natively. This means creating a Python class that mirrors the original model structure but substitutes standard layers with vLLM&rsquo;s distributed primitives.</p>
<p>Every module in a native vLLM model accepts a <code>prefix=&quot;&quot;</code> argument during initialization. This string represents the module&rsquo;s fully qualified name in the state dictionary — for example, <code>model.layers.0.self_attn.q_proj</code>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">LlamaAttention</span>(nn<span style="color:#f92672">.</span>Module):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">__init__</span>(self, config, prefix<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;&#34;</span>):
</span></span><span style="display:flex;"><span>        super()<span style="color:#f92672">.</span><span style="color:#a6e22e">__init__</span>()
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>qkv_proj <span style="color:#f92672">=</span> QKVParallelLinear(
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">...</span>,
</span></span><span style="display:flex;"><span>            prefix<span style="color:#f92672">=</span><span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;</span><span style="color:#e6db74">{</span>prefix<span style="color:#e6db74">}</span><span style="color:#e6db74">.qkv_proj&#34;</span>
</span></span><span style="display:flex;"><span>        )
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>o_proj <span style="color:#f92672">=</span> RowParallelLinear(
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">...</span>,
</span></span><span style="display:flex;"><span>            prefix<span style="color:#f92672">=</span><span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;</span><span style="color:#e6db74">{</span>prefix<span style="color:#e6db74">}</span><span style="color:#e6db74">.o_proj&#34;</span>
</span></span><span style="display:flex;"><span>        )
</span></span></code></pre></div><p>The prefix serves two purposes:</p>
<ol>
<li><strong>Weight loading</strong>: maps checkpoint tensors to the correct layer instance. When the <code>load_weights</code> method receives a tensor named <code>model.layers.0.self_attn.q_proj.weight</code>, the prefix tells it exactly which module to route it to.</li>
<li><strong>Non-uniform quantization</strong>: the QuantizationConfig can specify different quantization schemes per layer. Some layers might be FP16 while others are INT8. The prefix is how the config identifies which kernel to instantiate for each specific layer.</li>
</ol>
<h3 id="parallel-layer-primitives">Parallel Layer Primitives</h3>
<p>For models that won&rsquo;t fit on a single GPU (70B+), vLLM provides distributed primitives that replace standard <code>nn.Linear</code> and <code>nn.Embedding</code> layers:</p>
<p><strong>ColumnParallelLinear</strong> splits the weight matrix along the output dimension. Each GPU computes a fraction of the output features. This is used for QKV projections (each GPU computes a subset of attention heads) and MLP up-projections (each GPU computes a portion of the intermediate dimension). No inter-GPU communication is needed for this operation.</p>
<p><strong>RowParallelLinear</strong> splits along the input dimension. Each GPU computes a partial result, then an AllReduce sums the partial results across all GPUs. This is used for the attention output projection and MLP down-projection — the operations where partial results need to be recombined.</p>
<p><strong>VocabParallelEmbedding</strong> splits the embedding table (often 128k+ tokens for modern models) across GPUs. Each GPU holds a slice of the vocabulary and performs lookups only for tokens in its range.</p>
<p>The VllmConfig provides <code>tensor_parallel_size</code> during initialization, and each layer auto-configures its sharding based on the worker&rsquo;s rank. A model developer doesn&rsquo;t write explicit GPU assignment code — they use these primitives and the infrastructure handles partitioning.</p>
<h3 id="input-flattening-and-the-1d-computation-graph">Input Flattening and the 1D Computation Graph</h3>
<p>This is one of the more interesting design decisions in vLLM. In standard PyTorch, inputs are 2D tensors of shape <code>[batch_size, sequence_length]</code>. This requires padding to align sequences of different lengths — if you&rsquo;re processing three requests with lengths 5, 12, and 3, you pad everything to length 12. That means 8 wasted positions out of 20, nearly 40% of compute thrown away on padding tokens.</p>
<p>vLLM eliminates padding entirely. The ModelRunner concatenates all tokens from all concurrent requests into a single 1D tensor of shape <code>[total_num_tokens]</code>. A separate positions tensor (also 1D) provides the sequence position for each token:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Three concurrent requests:</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   Request A: tokens [101, 204, 305]        (3 tokens, positions 0,1,2)</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   Request B: tokens [42, 55, 67, 89, 12]   (5 tokens, positions 0,1,2,3,4)</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   Request C: tokens [700, 801]             (2 tokens, positions 0,1)</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Flattened input — no padding, no wasted compute:</span>
</span></span><span style="display:flex;"><span>input_ids <span style="color:#f92672">=</span> [<span style="color:#ae81ff">101</span>, <span style="color:#ae81ff">204</span>, <span style="color:#ae81ff">305</span>, <span style="color:#ae81ff">42</span>, <span style="color:#ae81ff">55</span>, <span style="color:#ae81ff">67</span>, <span style="color:#ae81ff">89</span>, <span style="color:#ae81ff">12</span>, <span style="color:#ae81ff">700</span>, <span style="color:#ae81ff">801</span>]  <span style="color:#75715e"># shape: [10]</span>
</span></span><span style="display:flex;"><span>positions  <span style="color:#f92672">=</span> [<span style="color:#ae81ff">0</span>,   <span style="color:#ae81ff">1</span>,   <span style="color:#ae81ff">2</span>,   <span style="color:#ae81ff">0</span>,  <span style="color:#ae81ff">1</span>,  <span style="color:#ae81ff">2</span>,  <span style="color:#ae81ff">3</span>,  <span style="color:#ae81ff">4</span>,  <span style="color:#ae81ff">0</span>,   <span style="color:#ae81ff">1</span>]   <span style="color:#75715e"># shape: [10]</span>
</span></span></code></pre></div><p>Every layer in a native vLLM model is written to process this 1D stream. Embeddings do lookups on the 1D tensor. RoPE uses the positions tensor for correct positional encoding. The attention layer uses <code>block_tables</code> to reconstruct the logical sequence structure — knowing which tokens belong to which request and where their KV cache blocks live in physical memory.</p>




<style>
.nm-5590d392f5ce8b7f84eedffa089138fd {
  font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
  background: linear-gradient(180deg, #0f172a 0%, #1e293b 100%);
  border-radius: 16px;
  padding: 2rem;
  margin: 2rem 0;
  color: #e2e8f0;
}

.nm-5590d392f5ce8b7f84eedffa089138fd * {
  box-sizing: border-box;
}

.nm-header-5590d392f5ce8b7f84eedffa089138fd {
  text-align: center;
  margin-bottom: 1.5rem;
}

.nm-title-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 1.5rem;
  font-weight: 700;
  color: #f8fafc;
  margin-bottom: 0.5rem;
}

.nm-subtitle-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.9rem;
  color: #94a3b8;
  max-width: 650px;
  margin: 0 auto;
}

 
.nm-grid-5590d392f5ce8b7f84eedffa089138fd {
  display: grid;
  grid-template-columns: 1fr 1fr;
  gap: 0;
  position: relative;
  margin-bottom: 1.5rem;
}

 
.nm-vs-5590d392f5ce8b7f84eedffa089138fd {
  position: absolute;
  left: 50%;
  top: 0;
  bottom: 0;
  transform: translateX(-50%);
  width: 1px;
  background: #334155;
  z-index: 2;
}

.nm-vs-circle-5590d392f5ce8b7f84eedffa089138fd {
  position: absolute;
  top: 50%;
  left: 50%;
  transform: translate(-50%, -50%);
  width: 36px;
  height: 36px;
  border-radius: 50%;
  background: #1e293b;
  border: 2px solid #475569;
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 0.7rem;
  font-weight: 800;
  color: #94a3b8;
  letter-spacing: 0.05em;
  z-index: 3;
}

 
.nm-col-5590d392f5ce8b7f84eedffa089138fd {
  padding: 1.25rem;
}

.nm-col-hf-5590d392f5ce8b7f84eedffa089138fd {
  border-right: none;
}

.nm-col-vllm-5590d392f5ce8b7f84eedffa089138fd {
  border-left: none;
}

 
.nm-col-header-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  align-items: center;
  gap: 0.5rem;
  margin-bottom: 1rem;
}

.nm-col-icon-5590d392f5ce8b7f84eedffa089138fd {
  width: 32px;
  height: 32px;
  border-radius: 8px;
  display: flex;
  align-items: center;
  justify-content: center;
}

.nm-col-icon-hf-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(239, 68, 68, 0.15);
  color: #f87171;
}

.nm-col-icon-vllm-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(34, 197, 94, 0.15);
  color: #4ade80;
}

.nm-col-label-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.95rem;
  font-weight: 700;
}

.nm-col-label-hf-5590d392f5ce8b7f84eedffa089138fd {
  color: #f87171;
}

.nm-col-label-vllm-5590d392f5ce8b7f84eedffa089138fd {
  color: #4ade80;
}

 
.nm-section-title-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.65rem;
  font-weight: 600;
  text-transform: uppercase;
  letter-spacing: 0.08em;
  color: #64748b;
  margin-bottom: 0.5rem;
}

 
.nm-shape-pill-5590d392f5ce8b7f84eedffa089138fd {
  display: inline-block;
  font-family: 'SF Mono', 'Fira Code', monospace;
  font-size: 0.7rem;
  font-weight: 600;
  padding: 0.2rem 0.55rem;
  border-radius: 4px;
  margin-bottom: 0.6rem;
}

.nm-shape-pill-hf-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(239, 68, 68, 0.12);
  color: #fca5a5;
  border: 1px solid rgba(239, 68, 68, 0.3);
}

.nm-shape-pill-vllm-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(34, 197, 94, 0.12);
  color: #86efac;
  border: 1px solid rgba(34, 197, 94, 0.3);
}

 
.nm-grid-2d-5590d392f5ce8b7f84eedffa089138fd {
  display: grid;
  grid-template-columns: repeat(5, 40px);
  gap: 3px;
  margin-bottom: 0.5rem;
  cursor: pointer;
  padding: 0.25rem;
  border-radius: 8px;
  transition: background 0.2s ease;
}

.nm-grid-2d-5590d392f5ce8b7f84eedffa089138fd:hover {
  background: rgba(255, 255, 255, 0.03);
}

.nm-grid-2d-5590d392f5ce8b7f84eedffa089138fd.selected {
  background: rgba(239, 68, 68, 0.08);
  box-shadow: 0 0 16px rgba(239, 68, 68, 0.15);
}

.nm-cell-5590d392f5ce8b7f84eedffa089138fd {
  width: 40px;
  height: 40px;
  border-radius: 6px;
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 0.65rem;
  font-weight: 600;
  transition: all 0.2s ease;
  cursor: pointer;
}

.nm-cell-real-5590d392f5ce8b7f84eedffa089138fd {
  border: 1.5px solid;
}

.nm-cell-a-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(96, 165, 250, 0.15);
  border-color: rgba(96, 165, 250, 0.5);
  color: #93c5fd;
}

.nm-cell-b-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(192, 132, 252, 0.15);
  border-color: rgba(192, 132, 252, 0.5);
  color: #d8b4fe;
}

.nm-cell-c-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(244, 114, 182, 0.15);
  border-color: rgba(244, 114, 182, 0.5);
  color: #f9a8d4;
}

.nm-cell-pad-5590d392f5ce8b7f84eedffa089138fd {
  background: repeating-linear-gradient(
    45deg,
    rgba(239, 68, 68, 0.06),
    rgba(239, 68, 68, 0.06) 3px,
    transparent 3px,
    transparent 6px
  );
  border: 1.5px dashed rgba(239, 68, 68, 0.35);
  color: rgba(239, 68, 68, 0.45);
  font-size: 0.55rem;
  font-weight: 500;
}

.nm-grid-2d-5590d392f5ce8b7f84eedffa089138fd.selected .nm-cell-pad-5590d392f5ce8b7f84eedffa089138fd {
  animation: nm-pad-pulse-5590d392f5ce8b7f84eedffa089138fd 1.5s ease-in-out infinite;
}

@keyframes nm-pad-pulse-5590d392f5ce8b7f84eedffa089138fd {
  0%, 100% { opacity: 1; }
  50% { opacity: 0.5; }
}

 
.nm-stat-5590d392f5ce8b7f84eedffa089138fd {
  display: inline-flex;
  align-items: center;
  gap: 0.3rem;
  font-size: 0.7rem;
  font-weight: 600;
  padding: 0.25rem 0.6rem;
  border-radius: 6px;
  margin-bottom: 1rem;
  animation: nm-stat-pop-5590d392f5ce8b7f84eedffa089138fd 0.5s ease-out;
}

@keyframes nm-stat-pop-5590d392f5ce8b7f84eedffa089138fd {
  0% { transform: scale(0.8); }
  70% { transform: scale(1.05); }
  100% { transform: scale(1); }
}

.nm-stat-hf-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(239, 68, 68, 0.15);
  color: #fca5a5;
  border: 1px solid rgba(239, 68, 68, 0.3);
}

.nm-stat-vllm-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(34, 197, 94, 0.15);
  color: #86efac;
  border: 1px solid rgba(34, 197, 94, 0.3);
}

 
.nm-array-1d-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  gap: 3px;
  flex-wrap: wrap;
  margin-bottom: 0.5rem;
  cursor: pointer;
  padding: 0.25rem;
  border-radius: 8px;
  transition: background 0.2s ease;
}

.nm-array-1d-5590d392f5ce8b7f84eedffa089138fd:hover {
  background: rgba(255, 255, 255, 0.03);
}

.nm-array-1d-5590d392f5ce8b7f84eedffa089138fd.selected {
  background: rgba(34, 197, 94, 0.08);
  box-shadow: 0 0 16px rgba(34, 197, 94, 0.15);
}

.nm-cell-1d-5590d392f5ce8b7f84eedffa089138fd {
  width: 40px;
  height: 40px;
  border-radius: 6px;
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 0.65rem;
  font-weight: 600;
  border: 1.5px solid;
  cursor: pointer;
  opacity: 0;
  transform: translateY(8px);
  transition: opacity 0.3s ease, transform 0.3s ease;
}

.nm-cell-1d-5590d392f5ce8b7f84eedffa089138fd.nm-visible-5590d392f5ce8b7f84eedffa089138fd {
  opacity: 1;
  transform: translateY(0);
}

 
.nm-legend-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  gap: 0.75rem;
  margin-bottom: 0.75rem;
  flex-wrap: wrap;
}

.nm-legend-item-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  align-items: center;
  gap: 0.3rem;
  font-size: 0.6rem;
  color: #94a3b8;
}

.nm-legend-swatch-5590d392f5ce8b7f84eedffa089138fd {
  width: 10px;
  height: 10px;
  border-radius: 3px;
}

.nm-legend-swatch-a-5590d392f5ce8b7f84eedffa089138fd { background: #60a5fa; }
.nm-legend-swatch-b-5590d392f5ce8b7f84eedffa089138fd { background: #c084fc; }
.nm-legend-swatch-c-5590d392f5ce8b7f84eedffa089138fd { background: #f472b6; }

 
.nm-layers-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  flex-direction: column;
  gap: 0.5rem;
}

.nm-layer-arrow-5590d392f5ce8b7f84eedffa089138fd {
  text-align: center;
  color: #475569;
  font-size: 0.7rem;
  line-height: 1;
  margin: 0.1rem 0;
}

.nm-layer-card-5590d392f5ce8b7f84eedffa089138fd {
  border-radius: 10px;
  padding: 0.75rem;
  border: 1.5px solid;
  cursor: pointer;
  transition: all 0.25s ease;
}

.nm-layer-card-5590d392f5ce8b7f84eedffa089138fd:hover {
  transform: translateY(-1px);
}

.nm-layer-card-5590d392f5ce8b7f84eedffa089138fd.selected {
  transform: scale(1.02);
}

 
.nm-layer-hf-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(239, 68, 68, 0.06);
  border-color: rgba(239, 68, 68, 0.25);
}

.nm-layer-hf-5590d392f5ce8b7f84eedffa089138fd.selected {
  border-color: rgba(239, 68, 68, 0.6);
  box-shadow: 0 0 16px rgba(239, 68, 68, 0.2);
}

.nm-layer-hf-5590d392f5ce8b7f84eedffa089138fd .nm-layer-name-5590d392f5ce8b7f84eedffa089138fd {
  color: #fca5a5;
}

 
.nm-layer-vllm-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(34, 197, 94, 0.06);
  border-color: rgba(34, 197, 94, 0.25);
}

.nm-layer-vllm-5590d392f5ce8b7f84eedffa089138fd.selected {
  border-color: rgba(34, 197, 94, 0.6);
  box-shadow: 0 0 16px rgba(34, 197, 94, 0.2);
}

.nm-layer-vllm-5590d392f5ce8b7f84eedffa089138fd .nm-layer-name-5590d392f5ce8b7f84eedffa089138fd {
  color: #86efac;
}

.nm-layer-name-5590d392f5ce8b7f84eedffa089138fd {
  font-family: 'SF Mono', 'Fira Code', monospace;
  font-size: 0.75rem;
  font-weight: 600;
  margin-bottom: 0.4rem;
}

.nm-layer-detail-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.6rem;
  color: #94a3b8;
  margin-bottom: 0.4rem;
}

 
.nm-block-single-5590d392f5ce8b7f84eedffa089138fd {
  height: 28px;
  border-radius: 5px;
  background: rgba(239, 68, 68, 0.12);
  border: 1px solid rgba(239, 68, 68, 0.25);
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 0.55rem;
  color: #fca5a5;
}

 
.nm-block-split-5590d392f5ce8b7f84eedffa089138fd {
  height: 28px;
  border-radius: 5px;
  display: flex;
  overflow: hidden;
  border: 1px solid rgba(34, 197, 94, 0.25);
  position: relative;
}

.nm-gpu-shard-5590d392f5ce8b7f84eedffa089138fd {
  flex: 1;
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 0.55rem;
  color: #86efac;
  background: rgba(34, 197, 94, 0.1);
  transition: background 0.2s ease;
  cursor: pointer;
}

.nm-gpu-shard-5590d392f5ce8b7f84eedffa089138fd:first-child {
  border-right: 1px dashed rgba(34, 197, 94, 0.35);
}

.nm-gpu-shard-5590d392f5ce8b7f84eedffa089138fd.shard-highlight {
  background: rgba(34, 197, 94, 0.25);
}

 
.nm-allreduce-label-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.55rem;
  color: #fbbf24;
  text-align: center;
  margin-top: 0.2rem;
  opacity: 0.8;
}

 
.nm-allreduce-svg-5590d392f5ce8b7f84eedffa089138fd {
  display: block;
  margin: 0.15rem auto 0;
  overflow: visible;
}

.nm-allreduce-svg-5590d392f5ce8b7f84eedffa089138fd .nm-arrow-line-5590d392f5ce8b7f84eedffa089138fd {
  stroke: #fbbf24;
  stroke-width: 2;
  stroke-dasharray: 6, 4;
  fill: none;
  opacity: 0.5;
}

.nm-allreduce-svg-5590d392f5ce8b7f84eedffa089138fd.animating .nm-arrow-line-5590d392f5ce8b7f84eedffa089138fd {
  opacity: 1;
  animation: nm-dash-flow-5590d392f5ce8b7f84eedffa089138fd 1s linear infinite;
}

@keyframes nm-dash-flow-5590d392f5ce8b7f84eedffa089138fd {
  to { stroke-dashoffset: -20; }
}

.nm-allreduce-svg-5590d392f5ce8b7f84eedffa089138fd .nm-arrow-head-5590d392f5ce8b7f84eedffa089138fd {
  fill: #fbbf24;
  opacity: 0.5;
}

.nm-allreduce-svg-5590d392f5ce8b7f84eedffa089138fd.animating .nm-arrow-head-5590d392f5ce8b7f84eedffa089138fd {
  opacity: 1;
}

 
.nm-info-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(15, 23, 42, 0.6);
  border: 1px solid #334155;
  border-radius: 12px;
  padding: 1.25rem;
  margin-bottom: 1rem;
}

.nm-info-placeholder-5590d392f5ce8b7f84eedffa089138fd {
  text-align: center;
  color: #64748b;
  padding: 1.5rem 1rem;
}

.nm-info-placeholder-5590d392f5ce8b7f84eedffa089138fd svg {
  width: 32px;
  height: 32px;
  margin-bottom: 0.5rem;
  opacity: 0.5;
}

.nm-info-placeholder-5590d392f5ce8b7f84eedffa089138fd p {
  font-size: 0.8rem;
  margin: 0;
}

.nm-info-content-5590d392f5ce8b7f84eedffa089138fd {
  display: none;
}

.nm-info-content-5590d392f5ce8b7f84eedffa089138fd.active {
  display: block;
  animation: nm-fade-in-5590d392f5ce8b7f84eedffa089138fd 0.3s ease;
}

@keyframes nm-fade-in-5590d392f5ce8b7f84eedffa089138fd {
  from { opacity: 0; transform: translateY(5px); }
  to { opacity: 1; transform: translateY(0); }
}

.nm-info-header-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  align-items: center;
  gap: 0.75rem;
  margin-bottom: 0.75rem;
}

.nm-info-icon-5590d392f5ce8b7f84eedffa089138fd {
  width: 36px;
  height: 36px;
  border-radius: 8px;
  display: flex;
  align-items: center;
  justify-content: center;
  flex-shrink: 0;
}

.nm-info-icon-5590d392f5ce8b7f84eedffa089138fd.hf { background: rgba(239, 68, 68, 0.15); color: #f87171; }
.nm-info-icon-5590d392f5ce8b7f84eedffa089138fd.vllm { background: rgba(34, 197, 94, 0.15); color: #4ade80; }
.nm-info-icon-5590d392f5ce8b7f84eedffa089138fd.waste { background: rgba(239, 68, 68, 0.15); color: #f87171; }
.nm-info-icon-5590d392f5ce8b7f84eedffa089138fd.useful { background: rgba(96, 165, 250, 0.15); color: #60a5fa; }

.nm-info-title-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 1rem;
  font-weight: 700;
  color: #f8fafc;
  margin-bottom: 0.15rem;
}

.nm-info-badge-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.6rem;
  padding: 0.2rem 0.4rem;
  border-radius: 4px;
  font-weight: 600;
  text-transform: uppercase;
}

.nm-info-badge-5590d392f5ce8b7f84eedffa089138fd.hf { background: rgba(239, 68, 68, 0.15); color: #fca5a5; }
.nm-info-badge-5590d392f5ce8b7f84eedffa089138fd.vllm { background: rgba(34, 197, 94, 0.15); color: #86efac; }
.nm-info-badge-5590d392f5ce8b7f84eedffa089138fd.waste { background: rgba(239, 68, 68, 0.15); color: #fca5a5; }
.nm-info-badge-5590d392f5ce8b7f84eedffa089138fd.useful { background: rgba(96, 165, 250, 0.15); color: #93c5fd; }

.nm-info-desc-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.8rem;
  color: #cbd5e1;
  line-height: 1.6;
  margin-bottom: 0.75rem;
}

.nm-info-props-5590d392f5ce8b7f84eedffa089138fd h4 {
  font-size: 0.7rem;
  color: #64748b;
  margin: 0 0 0.4rem 0;
  text-transform: uppercase;
  letter-spacing: 0.05em;
}

.nm-info-props-5590d392f5ce8b7f84eedffa089138fd ul {
  margin: 0;
  padding: 0;
  list-style: none;
}

.nm-info-props-5590d392f5ce8b7f84eedffa089138fd li {
  font-size: 0.75rem;
  color: #94a3b8;
  padding: 0.25rem 0;
  padding-left: 1rem;
  position: relative;
}

.nm-info-props-5590d392f5ce8b7f84eedffa089138fd li::before {
  content: '\2022';
  position: absolute;
  left: 0;
  color: #64748b;
}

.nm-insight-5590d392f5ce8b7f84eedffa089138fd {
  margin-top: 0.6rem;
  padding: 0.6rem 0.75rem;
  background: rgba(245, 158, 11, 0.1);
  border: 1px solid rgba(245, 158, 11, 0.3);
  border-radius: 8px;
}

.nm-insight-5590d392f5ce8b7f84eedffa089138fd p {
  font-size: 0.75rem;
  color: #fbbf24;
  margin: 0;
  line-height: 1.5;
}

.nm-insight-5590d392f5ce8b7f84eedffa089138fd strong {
  color: #fcd34d;
}

 
.nm-footer-5590d392f5ce8b7f84eedffa089138fd {
  padding: 1rem;
  background: rgba(245, 158, 11, 0.1);
  border: 1px solid rgba(245, 158, 11, 0.3);
  border-radius: 10px;
  text-align: center;
}

.nm-footer-5590d392f5ce8b7f84eedffa089138fd p {
  font-size: 0.8rem;
  color: #fbbf24;
  margin: 0;
  line-height: 1.5;
}

.nm-footer-5590d392f5ce8b7f84eedffa089138fd strong {
  color: #fcd34d;
}

 
@media (max-width: 850px) {
  .nm-grid-5590d392f5ce8b7f84eedffa089138fd {
    grid-template-columns: 1fr;
  }

  .nm-vs-5590d392f5ce8b7f84eedffa089138fd {
    display: none;
  }

  .nm-col-hf-5590d392f5ce8b7f84eedffa089138fd {
    border-bottom: 1px solid #334155;
    padding-bottom: 1.5rem;
  }

  .nm-col-vllm-5590d392f5ce8b7f84eedffa089138fd {
    padding-top: 1.5rem;
  }
}

 
@media (max-width: 600px) {
  .nm-5590d392f5ce8b7f84eedffa089138fd {
    padding: 1.25rem;
  }

  .nm-cell-5590d392f5ce8b7f84eedffa089138fd,
  .nm-cell-1d-5590d392f5ce8b7f84eedffa089138fd {
    width: 28px;
    height: 28px;
    font-size: 0.5rem;
  }

  .nm-grid-2d-5590d392f5ce8b7f84eedffa089138fd {
    grid-template-columns: repeat(5, 28px);
  }

  .nm-cell-pad-5590d392f5ce8b7f84eedffa089138fd {
    font-size: 0.4rem;
  }

  .nm-layer-name-5590d392f5ce8b7f84eedffa089138fd {
    font-size: 0.65rem;
  }

  .nm-gpu-shard-5590d392f5ce8b7f84eedffa089138fd {
    font-size: 0.45rem;
  }

  .nm-block-single-5590d392f5ce8b7f84eedffa089138fd {
    font-size: 0.45rem;
  }

  .nm-title-5590d392f5ce8b7f84eedffa089138fd {
    font-size: 1.2rem;
  }
}
</style>

<div class="nm-5590d392f5ce8b7f84eedffa089138fd">
  <div class="nm-header-5590d392f5ce8b7f84eedffa089138fd">
    <div class="nm-title-5590d392f5ce8b7f84eedffa089138fd">Standard HuggingFace vs vLLM Native</div>
    <div class="nm-subtitle-5590d392f5ce8b7f84eedffa089138fd">How vLLM eliminates padding waste and automatically shards across GPUs</div>
  </div>

  
  <div class="nm-grid-5590d392f5ce8b7f84eedffa089138fd">
    
    <div class="nm-vs-5590d392f5ce8b7f84eedffa089138fd">
      <div class="nm-vs-circle-5590d392f5ce8b7f84eedffa089138fd">VS</div>
    </div>

    
    <div class="nm-col-5590d392f5ce8b7f84eedffa089138fd nm-col-hf-5590d392f5ce8b7f84eedffa089138fd">
      <div class="nm-col-header-5590d392f5ce8b7f84eedffa089138fd">
        <div class="nm-col-icon-5590d392f5ce8b7f84eedffa089138fd nm-col-icon-hf-5590d392f5ce8b7f84eedffa089138fd">
          <svg width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
            <rect x="3" y="3" width="18" height="18" rx="2"/>
            <path d="M3 9h18M9 3v18"/>
          </svg>
        </div>
        <div class="nm-col-label-5590d392f5ce8b7f84eedffa089138fd nm-col-label-hf-5590d392f5ce8b7f84eedffa089138fd">Standard HuggingFace</div>
      </div>

      
      <div class="nm-section-title-5590d392f5ce8b7f84eedffa089138fd">Input Shape</div>
      <div class="nm-shape-pill-5590d392f5ce8b7f84eedffa089138fd nm-shape-pill-hf-5590d392f5ce8b7f84eedffa089138fd">shape: [3, 5]</div>

      
      <div class="nm-grid-2d-5590d392f5ce8b7f84eedffa089138fd" data-item="hf-input">
        
        <div class="nm-cell-5590d392f5ce8b7f84eedffa089138fd nm-cell-real-5590d392f5ce8b7f84eedffa089138fd nm-cell-a-5590d392f5ce8b7f84eedffa089138fd" data-item="real-cell">101</div>
        <div class="nm-cell-5590d392f5ce8b7f84eedffa089138fd nm-cell-real-5590d392f5ce8b7f84eedffa089138fd nm-cell-a-5590d392f5ce8b7f84eedffa089138fd" data-item="real-cell">204</div>
        <div class="nm-cell-5590d392f5ce8b7f84eedffa089138fd nm-cell-real-5590d392f5ce8b7f84eedffa089138fd nm-cell-a-5590d392f5ce8b7f84eedffa089138fd" data-item="real-cell">305</div>
        <div class="nm-cell-5590d392f5ce8b7f84eedffa089138fd nm-cell-pad-5590d392f5ce8b7f84eedffa089138fd" data-item="padding-cell">PAD</div>
        <div class="nm-cell-5590d392f5ce8b7f84eedffa089138fd nm-cell-pad-5590d392f5ce8b7f84eedffa089138fd" data-item="padding-cell">PAD</div>
        
        <div class="nm-cell-5590d392f5ce8b7f84eedffa089138fd nm-cell-real-5590d392f5ce8b7f84eedffa089138fd nm-cell-b-5590d392f5ce8b7f84eedffa089138fd" data-item="real-cell">42</div>
        <div class="nm-cell-5590d392f5ce8b7f84eedffa089138fd nm-cell-real-5590d392f5ce8b7f84eedffa089138fd nm-cell-b-5590d392f5ce8b7f84eedffa089138fd" data-item="real-cell">55</div>
        <div class="nm-cell-5590d392f5ce8b7f84eedffa089138fd nm-cell-real-5590d392f5ce8b7f84eedffa089138fd nm-cell-b-5590d392f5ce8b7f84eedffa089138fd" data-item="real-cell">67</div>
        <div class="nm-cell-5590d392f5ce8b7f84eedffa089138fd nm-cell-real-5590d392f5ce8b7f84eedffa089138fd nm-cell-b-5590d392f5ce8b7f84eedffa089138fd" data-item="real-cell">89</div>
        <div class="nm-cell-5590d392f5ce8b7f84eedffa089138fd nm-cell-real-5590d392f5ce8b7f84eedffa089138fd nm-cell-b-5590d392f5ce8b7f84eedffa089138fd" data-item="real-cell">12</div>
        
        <div class="nm-cell-5590d392f5ce8b7f84eedffa089138fd nm-cell-real-5590d392f5ce8b7f84eedffa089138fd nm-cell-c-5590d392f5ce8b7f84eedffa089138fd" data-item="real-cell">700</div>
        <div class="nm-cell-5590d392f5ce8b7f84eedffa089138fd nm-cell-real-5590d392f5ce8b7f84eedffa089138fd nm-cell-c-5590d392f5ce8b7f84eedffa089138fd" data-item="real-cell">801</div>
        <div class="nm-cell-5590d392f5ce8b7f84eedffa089138fd nm-cell-pad-5590d392f5ce8b7f84eedffa089138fd" data-item="padding-cell">PAD</div>
        <div class="nm-cell-5590d392f5ce8b7f84eedffa089138fd nm-cell-pad-5590d392f5ce8b7f84eedffa089138fd" data-item="padding-cell">PAD</div>
        <div class="nm-cell-5590d392f5ce8b7f84eedffa089138fd nm-cell-pad-5590d392f5ce8b7f84eedffa089138fd" data-item="padding-cell">PAD</div>
      </div>

      <div class="nm-stat-5590d392f5ce8b7f84eedffa089138fd nm-stat-hf-5590d392f5ce8b7f84eedffa089138fd">
        <svg width="14" height="14" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2.5">
          <circle cx="12" cy="12" r="10"/><path d="M15 9l-6 6M9 9l6 6"/>
        </svg>
        7 of 15 wasted (47%)
      </div>

      
      <div class="nm-section-title-5590d392f5ce8b7f84eedffa089138fd">Layer Architecture</div>
      <div class="nm-layers-5590d392f5ce8b7f84eedffa089138fd">
        
        <div class="nm-layer-card-5590d392f5ce8b7f84eedffa089138fd nm-layer-hf-5590d392f5ce8b7f84eedffa089138fd" data-item="hf-embedding">
          <div class="nm-layer-name-5590d392f5ce8b7f84eedffa089138fd">nn.Embedding</div>
          <div class="nm-layer-detail-5590d392f5ce8b7f84eedffa089138fd">Full vocabulary, single GPU</div>
          <div class="nm-block-single-5590d392f5ce8b7f84eedffa089138fd">Single GPU — full vocab</div>
        </div>

        <div class="nm-layer-arrow-5590d392f5ce8b7f84eedffa089138fd">&#x25BC;</div>

        
        <div class="nm-layer-card-5590d392f5ce8b7f84eedffa089138fd nm-layer-hf-5590d392f5ce8b7f84eedffa089138fd" data-item="hf-linear">
          <div class="nm-layer-name-5590d392f5ce8b7f84eedffa089138fd">nn.Linear (QKV)</div>
          <div class="nm-layer-detail-5590d392f5ce8b7f84eedffa089138fd">Full weight matrix</div>
          <div class="nm-block-single-5590d392f5ce8b7f84eedffa089138fd">Single GPU — full matrix</div>
        </div>

        <div class="nm-layer-arrow-5590d392f5ce8b7f84eedffa089138fd">&#x25BC;</div>

        
        <div class="nm-layer-card-5590d392f5ce8b7f84eedffa089138fd nm-layer-hf-5590d392f5ce8b7f84eedffa089138fd" data-item="hf-output">
          <div class="nm-layer-name-5590d392f5ce8b7f84eedffa089138fd">nn.Linear (Output)</div>
          <div class="nm-layer-detail-5590d392f5ce8b7f84eedffa089138fd">No sharding</div>
          <div class="nm-block-single-5590d392f5ce8b7f84eedffa089138fd">Single GPU — no sharding</div>
        </div>
      </div>
    </div>

    
    <div class="nm-col-5590d392f5ce8b7f84eedffa089138fd nm-col-vllm-5590d392f5ce8b7f84eedffa089138fd">
      <div class="nm-col-header-5590d392f5ce8b7f84eedffa089138fd">
        <div class="nm-col-icon-5590d392f5ce8b7f84eedffa089138fd nm-col-icon-vllm-5590d392f5ce8b7f84eedffa089138fd">
          <svg width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
            <path d="M13 2L3 14h9l-1 8 10-12h-9l1-8z"/>
          </svg>
        </div>
        <div class="nm-col-label-5590d392f5ce8b7f84eedffa089138fd nm-col-label-vllm-5590d392f5ce8b7f84eedffa089138fd">vLLM Native</div>
      </div>

      
      <div class="nm-section-title-5590d392f5ce8b7f84eedffa089138fd">Input Shape</div>
      <div class="nm-shape-pill-5590d392f5ce8b7f84eedffa089138fd nm-shape-pill-vllm-5590d392f5ce8b7f84eedffa089138fd">shape: [10]</div>

      
      <div class="nm-array-1d-5590d392f5ce8b7f84eedffa089138fd" data-item="vllm-input" id="nm-array-1d-5590d392f5ce8b7f84eedffa089138fd">
        <div class="nm-cell-1d-5590d392f5ce8b7f84eedffa089138fd nm-cell-a-5590d392f5ce8b7f84eedffa089138fd" data-item="real-cell">101</div>
        <div class="nm-cell-1d-5590d392f5ce8b7f84eedffa089138fd nm-cell-a-5590d392f5ce8b7f84eedffa089138fd" data-item="real-cell">204</div>
        <div class="nm-cell-1d-5590d392f5ce8b7f84eedffa089138fd nm-cell-a-5590d392f5ce8b7f84eedffa089138fd" data-item="real-cell">305</div>
        <div class="nm-cell-1d-5590d392f5ce8b7f84eedffa089138fd nm-cell-b-5590d392f5ce8b7f84eedffa089138fd" data-item="real-cell">42</div>
        <div class="nm-cell-1d-5590d392f5ce8b7f84eedffa089138fd nm-cell-b-5590d392f5ce8b7f84eedffa089138fd" data-item="real-cell">55</div>
        <div class="nm-cell-1d-5590d392f5ce8b7f84eedffa089138fd nm-cell-b-5590d392f5ce8b7f84eedffa089138fd" data-item="real-cell">67</div>
        <div class="nm-cell-1d-5590d392f5ce8b7f84eedffa089138fd nm-cell-b-5590d392f5ce8b7f84eedffa089138fd" data-item="real-cell">89</div>
        <div class="nm-cell-1d-5590d392f5ce8b7f84eedffa089138fd nm-cell-b-5590d392f5ce8b7f84eedffa089138fd" data-item="real-cell">12</div>
        <div class="nm-cell-1d-5590d392f5ce8b7f84eedffa089138fd nm-cell-c-5590d392f5ce8b7f84eedffa089138fd" data-item="real-cell">700</div>
        <div class="nm-cell-1d-5590d392f5ce8b7f84eedffa089138fd nm-cell-c-5590d392f5ce8b7f84eedffa089138fd" data-item="real-cell">801</div>
      </div>

      <div class="nm-stat-5590d392f5ce8b7f84eedffa089138fd nm-stat-vllm-5590d392f5ce8b7f84eedffa089138fd">
        <svg width="14" height="14" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2.5">
          <path d="M20 6L9 17l-5-5"/>
        </svg>
        10 tokens, 0% waste
      </div>

      
      <div class="nm-legend-5590d392f5ce8b7f84eedffa089138fd">
        <div class="nm-legend-item-5590d392f5ce8b7f84eedffa089138fd">
          <div class="nm-legend-swatch-5590d392f5ce8b7f84eedffa089138fd nm-legend-swatch-a-5590d392f5ce8b7f84eedffa089138fd"></div>
          Req A
        </div>
        <div class="nm-legend-item-5590d392f5ce8b7f84eedffa089138fd">
          <div class="nm-legend-swatch-5590d392f5ce8b7f84eedffa089138fd nm-legend-swatch-b-5590d392f5ce8b7f84eedffa089138fd"></div>
          Req B
        </div>
        <div class="nm-legend-item-5590d392f5ce8b7f84eedffa089138fd">
          <div class="nm-legend-swatch-5590d392f5ce8b7f84eedffa089138fd nm-legend-swatch-c-5590d392f5ce8b7f84eedffa089138fd"></div>
          Req C
        </div>
      </div>

      
      <div class="nm-section-title-5590d392f5ce8b7f84eedffa089138fd">Layer Architecture</div>
      <div class="nm-layers-5590d392f5ce8b7f84eedffa089138fd">
        
        <div class="nm-layer-card-5590d392f5ce8b7f84eedffa089138fd nm-layer-vllm-5590d392f5ce8b7f84eedffa089138fd" data-item="vllm-embedding">
          <div class="nm-layer-name-5590d392f5ce8b7f84eedffa089138fd">VocabParallelEmbedding</div>
          <div class="nm-layer-detail-5590d392f5ce8b7f84eedffa089138fd">Vocabulary sharded across GPUs</div>
          <div class="nm-block-split-5590d392f5ce8b7f84eedffa089138fd">
            <div class="nm-gpu-shard-5590d392f5ce8b7f84eedffa089138fd" data-gpu="0">GPU 0: vocab[0:N/2]</div>
            <div class="nm-gpu-shard-5590d392f5ce8b7f84eedffa089138fd" data-gpu="1">GPU 1: vocab[N/2:N]</div>
          </div>
          <div class="nm-allreduce-label-5590d392f5ce8b7f84eedffa089138fd">AllReduce combines</div>
        </div>

        <div class="nm-layer-arrow-5590d392f5ce8b7f84eedffa089138fd">&#x25BC;</div>

        
        <div class="nm-layer-card-5590d392f5ce8b7f84eedffa089138fd nm-layer-vllm-5590d392f5ce8b7f84eedffa089138fd" data-item="vllm-column">
          <div class="nm-layer-name-5590d392f5ce8b7f84eedffa089138fd">ColumnParallelLinear</div>
          <div class="nm-layer-detail-5590d392f5ce8b7f84eedffa089138fd">Output dim split across GPUs</div>
          <div class="nm-block-split-5590d392f5ce8b7f84eedffa089138fd">
            <div class="nm-gpu-shard-5590d392f5ce8b7f84eedffa089138fd" data-gpu="0">GPU 0: out[0:H/2]</div>
            <div class="nm-gpu-shard-5590d392f5ce8b7f84eedffa089138fd" data-gpu="1">GPU 1: out[H/2:H]</div>
          </div>
          <div class="nm-allreduce-label-5590d392f5ce8b7f84eedffa089138fd">No AllReduce needed</div>
        </div>

        <div class="nm-layer-arrow-5590d392f5ce8b7f84eedffa089138fd">&#x25BC;</div>

        
        <div class="nm-layer-card-5590d392f5ce8b7f84eedffa089138fd nm-layer-vllm-5590d392f5ce8b7f84eedffa089138fd" data-item="vllm-row">
          <div class="nm-layer-name-5590d392f5ce8b7f84eedffa089138fd">RowParallelLinear</div>
          <div class="nm-layer-detail-5590d392f5ce8b7f84eedffa089138fd">Input dim split across GPUs</div>
          <div class="nm-block-split-5590d392f5ce8b7f84eedffa089138fd">
            <div class="nm-gpu-shard-5590d392f5ce8b7f84eedffa089138fd" data-gpu="0">GPU 0: in[0:H/2]</div>
            <div class="nm-gpu-shard-5590d392f5ce8b7f84eedffa089138fd" data-gpu="1">GPU 1: in[H/2:H]</div>
          </div>
          
          <svg class="nm-allreduce-svg-5590d392f5ce8b7f84eedffa089138fd" id="nm-allreduce-svg-5590d392f5ce8b7f84eedffa089138fd" width="100%" height="20" viewBox="0 0 200 20">
            
            <line class="nm-arrow-line-5590d392f5ce8b7f84eedffa089138fd" x1="30" y1="8" x2="170" y2="8"/>
            <polygon class="nm-arrow-head-5590d392f5ce8b7f84eedffa089138fd" points="170,4 178,8 170,12"/>
            
            <line class="nm-arrow-line-5590d392f5ce8b7f84eedffa089138fd" x1="170" y1="14" x2="30" y2="14"/>
            <polygon class="nm-arrow-head-5590d392f5ce8b7f84eedffa089138fd" points="30,10 22,14 30,18"/>
          </svg>
          <div class="nm-allreduce-label-5590d392f5ce8b7f84eedffa089138fd">AllReduce sync</div>
        </div>
      </div>
    </div>
  </div>

  
  <div class="nm-info-5590d392f5ce8b7f84eedffa089138fd" id="nm-info-5590d392f5ce8b7f84eedffa089138fd">
    <div class="nm-info-placeholder-5590d392f5ce8b7f84eedffa089138fd" id="nm-placeholder-5590d392f5ce8b7f84eedffa089138fd">
      <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.5">
        <circle cx="12" cy="12" r="10"/>
        <path d="M12 16v-4M12 8h.01"/>
      </svg>
      <p>Click any input grid, token cell, or layer card to see details.</p>
    </div>
    <div class="nm-info-content-5590d392f5ce8b7f84eedffa089138fd" id="nm-info-content-5590d392f5ce8b7f84eedffa089138fd">
      <div class="nm-info-header-5590d392f5ce8b7f84eedffa089138fd">
        <div class="nm-info-icon-5590d392f5ce8b7f84eedffa089138fd" id="nm-info-icon-5590d392f5ce8b7f84eedffa089138fd">
          <svg width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
            <rect x="3" y="3" width="18" height="18" rx="2"/>
          </svg>
        </div>
        <div>
          <div class="nm-info-title-5590d392f5ce8b7f84eedffa089138fd" id="nm-info-title-5590d392f5ce8b7f84eedffa089138fd">Component</div>
          <span class="nm-info-badge-5590d392f5ce8b7f84eedffa089138fd" id="nm-info-badge-5590d392f5ce8b7f84eedffa089138fd">Category</span>
        </div>
      </div>
      <div class="nm-info-desc-5590d392f5ce8b7f84eedffa089138fd" id="nm-info-desc-5590d392f5ce8b7f84eedffa089138fd"></div>
      <div class="nm-info-props-5590d392f5ce8b7f84eedffa089138fd">
        <h4>Key Properties</h4>
        <ul id="nm-info-props-5590d392f5ce8b7f84eedffa089138fd"></ul>
      </div>
      <div class="nm-insight-5590d392f5ce8b7f84eedffa089138fd" id="nm-insight-5590d392f5ce8b7f84eedffa089138fd" style="display: none;">
        <p id="nm-insight-text-5590d392f5ce8b7f84eedffa089138fd"></p>
      </div>
    </div>
  </div>

  
  <div class="nm-footer-5590d392f5ce8b7f84eedffa089138fd">
    <p><strong>Every native vLLM model processes a 1D token stream — no padding, automatic sharding.</strong> This is why native models are significantly faster than wrapped HuggingFace models.</p>
  </div>
</div>

<script>
(function() {
  var id = '5590d392f5ce8b7f84eedffa089138fd';

   

  var items = {
    'hf-input': {
      title: '2D Padded Input Tensor',
      badgeClass: 'hf',
      badge: 'Standard HF',
      description: 'Standard HuggingFace models process inputs as 2D tensors of shape [batch_size, seq_len]. When sequences have different lengths, shorter ones are padded to match the longest. In this example, padding to length 5 wastes 7 out of 15 positions — 47% of compute is thrown away on tokens that produce no useful output.',
      props: ['Shape: [3, 5] — 3 sequences, max length 5', 'Request A: 3 real tokens + 2 padding', 'Request B: 5 real tokens + 0 padding', 'Request C: 2 real tokens + 3 padding'],
      insight: '<strong>47% waste</strong> is typical. With highly variable sequence lengths (common in chat), padding waste can exceed 60%. Every padded position consumes full attention computation.'
    },
    'hf-embedding': {
      title: 'nn.Embedding',
      badgeClass: 'hf',
      badge: 'Standard HF',
      description: 'Standard PyTorch embedding layer. The entire vocabulary embedding table (often 32K-128K entries x hidden_dim) resides on a single GPU. For large models, this alone can consume several GB of VRAM.',
      props: ['Full vocabulary on one GPU', 'No cross-GPU communication', 'Memory-bound for large vocabs', 'Cannot scale beyond single GPU VRAM']
    },
    'hf-linear': {
      title: 'nn.Linear (QKV)',
      badgeClass: 'hf',
      badge: 'Standard HF',
      description: 'Standard linear projection for computing Query, Key, and Value matrices. The full weight matrix lives on one GPU. For a 70B model, a single QKV projection can be several GB.',
      props: ['Full weight matrix on one GPU', 'No tensor parallelism', 'Computes Q, K, V projections', 'Memory limited by single GPU']
    },
    'hf-output': {
      title: 'nn.Linear (Output)',
      badgeClass: 'hf',
      badge: 'Standard HF',
      description: 'Standard output projection layer. Like all other HF layers, the full weight matrix resides on a single GPU with no automatic sharding capability.',
      props: ['Full weight matrix, no sharding', 'Single GPU bottleneck', 'No automatic parallelism', 'Manual model parallelism needed for multi-GPU']
    },
    'vllm-input': {
      title: '1D Flattened Token Stream',
      badgeClass: 'vllm',
      badge: 'vLLM Native',
      description: 'vLLM concatenates all tokens from all concurrent requests into a single flat 1D tensor. No padding is needed — every position in the tensor is a real token that produces useful computation. A separate positions tensor provides sequence-level position information.',
      props: ['Shape: [10] — all tokens concatenated', 'Request A: indices 0-2 (blue)', 'Request B: indices 3-7 (purple)', 'Request C: indices 8-9 (pink)'],
      insight: '<strong>Zero waste.</strong> Every element in the tensor represents a real token. The attention layer uses block_tables to reconstruct which tokens belong to which sequence.'
    },
    'vllm-embedding': {
      title: 'VocabParallelEmbedding',
      badgeClass: 'vllm',
      badge: 'vLLM Native',
      description: 'vLLM\'s parallel embedding layer automatically splits the vocabulary table across GPUs. Each GPU holds vocab[rank*chunk : (rank+1)*chunk] and does lookups only for its portion. An AllReduce combines the partial results.',
      props: ['Vocabulary split across N GPUs', 'Each GPU: vocab_size/N entries', 'AllReduce combines partial lookups', 'Linear memory scaling with GPU count'],
      insight: 'For a 128K vocab with TP=8, each GPU holds only 16K entries instead of 128K — an <strong>8x memory reduction</strong> for the embedding table.'
    },
    'vllm-column': {
      title: 'ColumnParallelLinear',
      badgeClass: 'vllm',
      badge: 'vLLM Native',
      description: 'Splits the output dimension of the weight matrix across GPUs. Each GPU computes a slice of the output: Y_i = X @ W_i where W_i is columns [i*H/N : (i+1)*H/N]. No AllReduce needed — the outputs are simply kept split for the next layer.',
      props: ['Output dim split: each GPU computes H/N columns', 'W shape per GPU: [H_in, H_out/N]', 'No AllReduce needed after this layer', 'Used for QKV projections and MLP gate/up'],
      insight: 'Column-parallel is paired with row-parallel: the split output feeds directly into RowParallelLinear without communication. This <strong>minimizes AllReduce calls</strong>.'
    },
    'vllm-row': {
      title: 'RowParallelLinear',
      badgeClass: 'vllm',
      badge: 'vLLM Native',
      description: 'Splits the input dimension of the weight matrix across GPUs. Each GPU computes a partial result: Y_i = X_i @ W_i. An AllReduce sum combines all partial results into the final output Y = sum(Y_i).',
      props: ['Input dim split: each GPU takes H/N input features', 'W shape per GPU: [H_in/N, H_out]', 'AllReduce required to combine partial sums', 'Used for attention output and MLP down projection'],
      insight: 'The AllReduce here is the <strong>main communication cost</strong> in tensor parallelism. vLLM uses NCCL\'s optimized AllReduce, which overlaps computation and communication where possible.'
    },
    'padding-cell': {
      title: 'Padding Token',
      badgeClass: 'waste',
      badge: 'Wasted Compute',
      description: 'A padding token fills empty positions in the 2D batch tensor. It goes through the full forward pass — embedding lookup, attention computation, MLP — but its output is discarded. Every padding token wastes the same compute as a real token.',
      props: ['Full forward pass computation wasted', 'Attention is computed but output discarded', 'Scales with max_seq_len in batch', 'Worst case: one long + many short sequences']
    },
    'real-cell': {
      title: 'Real Token',
      badgeClass: 'useful',
      badge: 'Useful Compute',
      description: 'A real token from an actual request. Its forward pass produces meaningful hidden states that contribute to the final output. In vLLM\'s 1D layout, every token is a real token.',
      props: ['Produces meaningful hidden states', 'Contributes to output generation', 'No wasted computation', 'Color indicates which request it belongs to']
    }
  };

   

  var container = document.querySelector('.nm-' + id);
  if (!container) return;

  function clearSelection() {
    var els = container.querySelectorAll('.selected, .shard-highlight');
    for (var i = 0; i < els.length; i++) {
      els[i].classList.remove('selected', 'shard-highlight');
    }
    
    var arrowSvg = document.getElementById('nm-allreduce-svg-' + id);
    if (arrowSvg) arrowSvg.classList.remove('animating');
  }

  function showInfo(data) {
    document.getElementById('nm-placeholder-' + id).style.display = 'none';
    var content = document.getElementById('nm-info-content-' + id);
    content.classList.remove('active');
    void content.offsetWidth;
    content.classList.add('active');

    document.getElementById('nm-info-title-' + id).textContent = data.title;

    var badge = document.getElementById('nm-info-badge-' + id);
    badge.textContent = data.badge;
    badge.className = 'nm-info-badge-' + id + ' ' + data.badgeClass;

    var icon = document.getElementById('nm-info-icon-' + id);
    icon.className = 'nm-info-icon-' + id + ' ' + data.badgeClass;

    document.getElementById('nm-info-desc-' + id).textContent = data.description;

    var propsList = document.getElementById('nm-info-props-' + id);
    var html = '';
    for (var j = 0; j < data.props.length; j++) {
      html += '<li>' + data.props[j] + '</li>';
    }
    propsList.innerHTML = html;

    var insightEl = document.getElementById('nm-insight-' + id);
    if (data.insight) {
      insightEl.style.display = 'block';
      document.getElementById('nm-insight-text-' + id).innerHTML = data.insight;
    } else {
      insightEl.style.display = 'none';
    }
  }

   

  container.addEventListener('mouseover', function(e) {
    var target = e.target;
    if (target.hasAttribute && target.hasAttribute('data-gpu')) {
      var gpuId = target.getAttribute('data-gpu');
      var shards = container.querySelectorAll('[data-gpu="' + gpuId + '"]');
      for (var i = 0; i < shards.length; i++) {
        shards[i].classList.add('shard-highlight');
      }
    }
  });

  container.addEventListener('mouseout', function(e) {
    var target = e.target;
    if (target.hasAttribute && target.hasAttribute('data-gpu')) {
      var shards = container.querySelectorAll('.shard-highlight');
      for (var i = 0; i < shards.length; i++) {
        shards[i].classList.remove('shard-highlight');
      }
    }
  });

   

  container.addEventListener('click', function(e) {
    var target = e.target;

    while (target && target !== container) {
      if (target.hasAttribute && target.hasAttribute('data-item')) {
        var itemId = target.getAttribute('data-item');
        var data = items[itemId];
        if (!data) { target = target.parentElement; continue; }

        
        e.stopPropagation();
        clearSelection();

        
        target.classList.add('selected');

        
        if (itemId === 'vllm-row') {
          var arrowSvg = document.getElementById('nm-allreduce-svg-' + id);
          if (arrowSvg) arrowSvg.classList.add('animating');
        }

        showInfo(data);
        return;
      }
      target = target.parentElement;
    }
  });

   

  var arrayContainer = document.getElementById('nm-array-1d-' + id);
  if (arrayContainer) {
    var cells = arrayContainer.querySelectorAll('.nm-cell-1d-' + id);
    var animated = false;

    function animateCells() {
      if (animated) return;
      animated = true;
      for (var i = 0; i < cells.length; i++) {
        (function(cell, delay) {
          setTimeout(function() {
            cell.classList.add('nm-visible-' + id);
          }, delay);
        })(cells[i], i * 50);
      }
    }

    if ('IntersectionObserver' in window) {
      var observer = new IntersectionObserver(function(entries) {
        if (entries[0].isIntersecting) {
          animateCells();
          observer.disconnect();
        }
      }, { threshold: 0.3 });
      observer.observe(arrayContainer);
    } else {
      
      animateCells();
    }
  }
})();
</script>

<h3 id="weight-loading--from-disk-to-device">Weight Loading — From Disk to Device</h3>
<p>One of the trickier parts of native integration: implementing <code>load_weights(self, weights)</code>. This method receives an iterator of <code>(name, tensor)</code> pairs from AutoWeightsLoader and must map checkpoint weights into the model&rsquo;s parameters.</p>
<p>The <strong>parameter mismatch problem</strong> is why this isn&rsquo;t trivial. vLLM often fuses layers that are separate in the Hugging Face checkpoint. For example, a standard Llama MLP has separate <code>gate_proj</code> and <code>up_proj</code> linear layers. In vLLM, these become a single <code>gate_up_proj</code> to reduce kernel launches. The <code>load_weights</code> logic must handle this:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">load_weights</span>(self, weights):
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Stacking mapping: which HF weights get concatenated into which vLLM param</span>
</span></span><span style="display:flex;"><span>    stacked_params <span style="color:#f92672">=</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;gate_proj&#34;</span>: (<span style="color:#e6db74">&#34;gate_up_proj&#34;</span>, <span style="color:#ae81ff">0</span>),  <span style="color:#75715e"># goes into first half</span>
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;up_proj&#34;</span>:   (<span style="color:#e6db74">&#34;gate_up_proj&#34;</span>, <span style="color:#ae81ff">1</span>),  <span style="color:#75715e"># goes into second half</span>
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> name, loaded_weight <span style="color:#f92672">in</span> weights:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> <span style="color:#e6db74">&#34;gate_proj&#34;</span> <span style="color:#f92672">in</span> name <span style="color:#f92672">or</span> <span style="color:#e6db74">&#34;up_proj&#34;</span> <span style="color:#f92672">in</span> name:
</span></span><span style="display:flex;"><span>            <span style="color:#75715e"># Buffer the tensor, wait for its partner, then concatenate</span>
</span></span><span style="display:flex;"><span>            param <span style="color:#f92672">=</span> self<span style="color:#f92672">.</span>state_dict()[name<span style="color:#f92672">.</span>replace(<span style="color:#e6db74">&#34;gate_proj&#34;</span>, <span style="color:#e6db74">&#34;gate_up_proj&#34;</span>)
</span></span><span style="display:flex;"><span>                                          <span style="color:#f92672">.</span>replace(<span style="color:#e6db74">&#34;up_proj&#34;</span>, <span style="color:#e6db74">&#34;gate_up_proj&#34;</span>)]
</span></span><span style="display:flex;"><span>            <span style="color:#75715e"># Load into the correct slice of the fused parameter</span>
</span></span><span style="display:flex;"><span>            weight_loader <span style="color:#f92672">=</span> param<span style="color:#f92672">.</span>weight_loader
</span></span><span style="display:flex;"><span>            weight_loader(param, loaded_weight, name)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span>            param <span style="color:#f92672">=</span> self<span style="color:#f92672">.</span>state_dict()[name]
</span></span><span style="display:flex;"><span>            param<span style="color:#f92672">.</span>copy_(loaded_weight)
</span></span></code></pre></div><p>Two utilities make this process more manageable:</p>
<p><strong>AutoWeightsLoader</strong> abstracts away the routing of weights to child modules. It recursively discovers sub-modules that have their own <code>load_weights</code> methods and delegates the appropriate <code>(name, tensor)</code> pairs to each one, so the top-level model doesn&rsquo;t need to manually dispatch weights. The upstream shard iteration — walking through <code>model-00001-of-00005.safetensors</code> through <code>model-00005-of-00005.safetensors</code> and presenting a unified <code>(name, tensor)</code> stream — happens in vLLM&rsquo;s weight loading utilities (<code>weight_utils.py</code>), which feed into <code>AutoWeightsLoader</code>.</p>
<p><strong>WeightsMapper</strong> provides declarative renaming rules. Instead of writing string manipulation inside <code>load_weights</code>, you define a mapping:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>mapper <span style="color:#f92672">=</span> WeightsMapper(orig_to_new_prefix<span style="color:#f92672">=</span>{
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;model.decoder.layers.&#34;</span>: <span style="color:#e6db74">&#34;model.layers.&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;norm.weight&#34;</span>: <span style="color:#e6db74">&#34;model.norm.weight&#34;</span>
</span></span><span style="display:flex;"><span>})
</span></span></code></pre></div><p>The loader applies these rules on the fly, letting the vLLM model structure diverge from the Hugging Face structure while maintaining compatibility with official checkpoints.</p>
<p>For quantized models, weight loading has an additional layer of complexity. In 4-bit quantization schemes like AWQ, eight 4-bit weights are packed into a single <code>int32</code>. The loader must recognize that the destination parameter is quantized and load the packed tensor directly, no casting to float16 first. If the config specifies quantization, the linear layer initializes a specialized &ldquo;quantized parameter&rdquo; object that overrides the default loading behavior.</p>
<hr>
<h2 id="part-4-the-execution-core">Part 4: The Execution Core</h2>
<h3 id="the-attention-switchboard-forwardcontext">The Attention Switchboard (ForwardContext)</h3>
<p>In standard PyTorch, an Attention module is self-contained — it receives Q, K, V and computes the output. In vLLM, the Attention layer acts as a client to a global context. When the model executes a forward pass, a <code>ForwardContext</code> is established containing the <code>AttentionMetadata</code> generated by the scheduler.</p>
<p>The AttentionMetadata object is essentially a page table for the KV cache. Here&rsquo;s what it contains, with concrete values for a batch of 3 requests:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># AttentionMetadata for a batch with 3 requests:</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   Request A: 128 tokens, KV spread across blocks [4, 17, 23, 8]</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   Request B: 64 tokens, KV in blocks [1, 12]</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">#   Request C: 256 tokens, KV in blocks [0, 5, 9, 14, 22, 31, 7, 19]</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>block_tables <span style="color:#f92672">=</span> [
</span></span><span style="display:flex;"><span>    [<span style="color:#ae81ff">4</span>, <span style="color:#ae81ff">17</span>, <span style="color:#ae81ff">23</span>, <span style="color:#ae81ff">8</span>, <span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">0</span>],   <span style="color:#75715e"># Request A (padded to max_blocks)</span>
</span></span><span style="display:flex;"><span>    [<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">12</span>, <span style="color:#ae81ff">0</span>,  <span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">0</span>],   <span style="color:#75715e"># Request B</span>
</span></span><span style="display:flex;"><span>    [<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">5</span>,  <span style="color:#ae81ff">9</span>, <span style="color:#ae81ff">14</span>, <span style="color:#ae81ff">22</span>, <span style="color:#ae81ff">31</span>, <span style="color:#ae81ff">7</span>, <span style="color:#ae81ff">19</span>], <span style="color:#75715e"># Request C</span>
</span></span><span style="display:flex;"><span>]
</span></span><span style="display:flex;"><span><span style="color:#75715e"># shape: [3, 8] — each entry is a physical block index in GPU memory</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>slot_mapping <span style="color:#f92672">=</span> [<span style="color:#ae81ff">512</span>, <span style="color:#ae81ff">65</span>, <span style="color:#ae81ff">1024</span>]
</span></span><span style="display:flex;"><span><span style="color:#75715e"># For decode: maps each new token to its physical slot (block_idx * block_size + offset)</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>context_lens <span style="color:#f92672">=</span> [<span style="color:#ae81ff">128</span>, <span style="color:#ae81ff">64</span>, <span style="color:#ae81ff">256</span>]
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Sequence length per request, for correct attention masking</span>
</span></span></code></pre></div><p>The attention layer&rsquo;s <code>forward()</code> does not compute QK^T * V directly. Instead, it dispatches to different backends depending on the phase:</p>
<ul>
<li><strong>Prefill</strong> (processing prompts) → FlashAttention variant, optimized for parallel computation over many tokens. All Q, K, V tokens are known upfront, so we can exploit parallelism across the sequence.</li>
<li><strong>Decode</strong> (generating tokens) → PagedAttention kernel. The query is a single new token. The kernel uses <code>block_tables</code> to <em>gather</em> K and V vectors from non-contiguous physical blocks, compute attention scores, and scatter the result. This is the operation that makes <a href="/posts/flash-attention/">virtual memory for KV cache</a> work, tokens don&rsquo;t need to be stored contiguously.</li>
</ul>
<p>The split between prefill and decode backends is important for performance. Prefill is compute-bound (large matrix multiplications), so FlashAttention&rsquo;s tiling strategy works well. Decode is memory-bound (loading many cached K/V vectors for a single query), so PagedAttention&rsquo;s gather-based approach is the right fit.</p>




<style>
.fp-5590d392f5ce8b7f84eedffa089138fd {
  font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
  background: linear-gradient(180deg, #0f172a 0%, #1e293b 100%);
  border-radius: 16px;
  padding: 2rem;
  margin: 2rem 0;
  color: #e2e8f0;
  max-width: 680px;
  margin-left: auto;
  margin-right: auto;
}

.fp-5590d392f5ce8b7f84eedffa089138fd * {
  box-sizing: border-box;
}

.fp-header-5590d392f5ce8b7f84eedffa089138fd {
  text-align: center;
  margin-bottom: 1.5rem;
}

.fp-title-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 1.5rem;
  font-weight: 700;
  color: #f8fafc;
  margin-bottom: 0.5rem;
}

.fp-subtitle-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.9rem;
  color: #94a3b8;
  max-width: 550px;
  margin: 0 auto;
}

 
.fp-stage-5590d392f5ce8b7f84eedffa089138fd {
  border-radius: 12px;
  padding: 1rem 1.25rem;
  border: 2px solid;
  cursor: pointer;
  transition: all 0.25s ease;
  display: flex;
  align-items: center;
  gap: 0.75rem;
}

.fp-stage-5590d392f5ce8b7f84eedffa089138fd:hover {
  transform: translateY(-1px);
}

.fp-stage-5590d392f5ce8b7f84eedffa089138fd.selected {
  transform: scale(1.02);
}

 
.fp-stage-blue-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(59, 130, 246, 0.1);
  border-color: rgba(59, 130, 246, 0.4);
}
.fp-stage-blue-5590d392f5ce8b7f84eedffa089138fd.selected {
  border-color: #3b82f6;
  box-shadow: 0 0 20px rgba(59, 130, 246, 0.35);
}

.fp-stage-pink-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(236, 72, 153, 0.1);
  border-color: rgba(236, 72, 153, 0.4);
}
.fp-stage-pink-5590d392f5ce8b7f84eedffa089138fd.selected {
  border-color: #ec4899;
  box-shadow: 0 0 20px rgba(236, 72, 153, 0.35);
}

.fp-stage-cyan-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(6, 182, 212, 0.1);
  border-color: rgba(6, 182, 212, 0.4);
}
.fp-stage-cyan-5590d392f5ce8b7f84eedffa089138fd.selected {
  border-color: #06b6d4;
  box-shadow: 0 0 20px rgba(6, 182, 212, 0.35);
}

 
.fp-stage-num-5590d392f5ce8b7f84eedffa089138fd {
  width: 28px;
  height: 28px;
  border-radius: 50%;
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 0.75rem;
  font-weight: 700;
  flex-shrink: 0;
}

.fp-stage-blue-5590d392f5ce8b7f84eedffa089138fd .fp-stage-num-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(59, 130, 246, 0.25);
  color: #60a5fa;
}
.fp-stage-pink-5590d392f5ce8b7f84eedffa089138fd .fp-stage-num-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(236, 72, 153, 0.25);
  color: #f472b6;
}
.fp-stage-cyan-5590d392f5ce8b7f84eedffa089138fd .fp-stage-num-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(6, 182, 212, 0.25);
  color: #22d3ee;
}

.fp-stage-body-5590d392f5ce8b7f84eedffa089138fd {
  flex: 1;
  min-width: 0;
}

.fp-stage-name-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.9rem;
  font-weight: 700;
  color: #f1f5f9;
  margin-bottom: 0.15rem;
}

.fp-stage-badge-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.6rem;
  padding: 0.15rem 0.4rem;
  border-radius: 4px;
  font-weight: 600;
  background: rgba(255, 255, 255, 0.08);
  color: #94a3b8;
}

 
.fp-connector-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  flex-direction: column;
  align-items: center;
  padding: 0.35rem 0;
}

.fp-connector-line-5590d392f5ce8b7f84eedffa089138fd {
  width: 2px;
  height: 16px;
  background: #475569;
}

.fp-connector-label-5590d392f5ce8b7f84eedffa089138fd {
  font-family: 'SF Mono', 'Fira Code', monospace;
  font-size: 0.6rem;
  color: #64748b;
  padding: 0.15rem 0;
}

.fp-connector-arrow-5590d392f5ce8b7f84eedffa089138fd {
  color: #475569;
  font-size: 0.6rem;
  line-height: 1;
}

 
.fp-block-5590d392f5ce8b7f84eedffa089138fd {
  border: 2px dashed #475569;
  border-radius: 14px;
  padding: 1rem;
  cursor: pointer;
  transition: all 0.25s ease;
}

.fp-block-5590d392f5ce8b7f84eedffa089138fd.selected {
  border-color: #64748b;
  box-shadow: 0 0 16px rgba(100, 116, 139, 0.2);
}

.fp-block-header-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  align-items: center;
  justify-content: space-between;
  margin-bottom: 0.75rem;
}

.fp-block-title-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.85rem;
  font-weight: 700;
  color: #e2e8f0;
  display: flex;
  align-items: center;
  gap: 0.5rem;
}

.fp-block-badge-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.6rem;
  padding: 0.2rem 0.5rem;
  border-radius: 4px;
  font-weight: 600;
  background: rgba(100, 116, 139, 0.2);
  color: #94a3b8;
}

 
.fp-substage-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  align-items: center;
  gap: 0.6rem;
  padding: 0.5rem 0.65rem;
  border-radius: 8px;
  cursor: pointer;
  transition: all 0.2s ease;
  background: rgba(168, 85, 247, 0.06);
  border: 1px solid rgba(168, 85, 247, 0.15);
  margin-bottom: 0.35rem;
}

.fp-substage-5590d392f5ce8b7f84eedffa089138fd:hover {
  background: rgba(168, 85, 247, 0.12);
  border-color: rgba(168, 85, 247, 0.3);
}

.fp-substage-5590d392f5ce8b7f84eedffa089138fd.selected {
  background: rgba(168, 85, 247, 0.15);
  border-color: rgba(168, 85, 247, 0.5);
  box-shadow: 0 0 14px rgba(168, 85, 247, 0.2);
}

.fp-substage-label-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.6rem;
  font-weight: 700;
  color: #c084fc;
  flex-shrink: 0;
  width: 22px;
}

.fp-substage-name-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.78rem;
  font-weight: 600;
  color: #e2e8f0;
  flex: 1;
}

.fp-substage-detail-5590d392f5ce8b7f84eedffa089138fd {
  font-family: 'SF Mono', 'Fira Code', monospace;
  font-size: 0.6rem;
  color: #64748b;
  flex-shrink: 0;
}

 
.fp-dispatch-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  gap: 0.5rem;
  margin-bottom: 0.35rem;
  cursor: pointer;
}

.fp-dispatch-branch-5590d392f5ce8b7f84eedffa089138fd {
  flex: 1;
  padding: 0.55rem 0.5rem;
  border-radius: 8px;
  text-align: center;
  cursor: pointer;
  transition: all 0.2s ease;
  border: 1px solid;
}

.fp-dispatch-branch-5590d392f5ce8b7f84eedffa089138fd:hover {
  transform: translateY(-1px);
}

.fp-dispatch-branch-5590d392f5ce8b7f84eedffa089138fd.selected {
  transform: scale(1.03);
}

.fp-branch-prefill-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(34, 197, 94, 0.08);
  border-color: rgba(34, 197, 94, 0.3);
}
.fp-branch-prefill-5590d392f5ce8b7f84eedffa089138fd.selected {
  background: rgba(34, 197, 94, 0.15);
  border-color: #22c55e;
  box-shadow: 0 0 14px rgba(34, 197, 94, 0.25);
}

.fp-branch-decode-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(249, 115, 22, 0.08);
  border-color: rgba(249, 115, 22, 0.3);
}
.fp-branch-decode-5590d392f5ce8b7f84eedffa089138fd.selected {
  background: rgba(249, 115, 22, 0.15);
  border-color: #f97316;
  box-shadow: 0 0 14px rgba(249, 115, 22, 0.25);
}

.fp-branch-label-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.6rem;
  font-weight: 700;
  text-transform: uppercase;
  letter-spacing: 0.03em;
  margin-bottom: 0.2rem;
}

.fp-branch-prefill-5590d392f5ce8b7f84eedffa089138fd .fp-branch-label-5590d392f5ce8b7f84eedffa089138fd { color: #4ade80; }
.fp-branch-decode-5590d392f5ce8b7f84eedffa089138fd .fp-branch-label-5590d392f5ce8b7f84eedffa089138fd { color: #fb923c; }

.fp-branch-name-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.72rem;
  font-weight: 600;
  color: #e2e8f0;
}

.fp-branch-hint-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.58rem;
  color: #64748b;
  margin-top: 0.15rem;
}

.fp-dispatch-or-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  align-items: center;
  font-size: 0.6rem;
  font-weight: 700;
  color: #475569;
  flex-shrink: 0;
}

 
.fp-allreduce-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  align-items: center;
  justify-content: center;
  gap: 0.4rem;
  padding: 0.4rem 0.85rem;
  border-radius: 20px;
  background: rgba(245, 158, 11, 0.12);
  border: 1.5px solid rgba(245, 158, 11, 0.4);
  cursor: pointer;
  transition: all 0.2s ease;
  margin: 0.35rem auto;
  width: fit-content;
  animation: fp-allreduce-pulse-5590d392f5ce8b7f84eedffa089138fd 3s ease-in-out infinite;
}

.fp-allreduce-5590d392f5ce8b7f84eedffa089138fd:hover {
  background: rgba(245, 158, 11, 0.2);
  border-color: rgba(245, 158, 11, 0.6);
}

.fp-allreduce-5590d392f5ce8b7f84eedffa089138fd.selected {
  background: rgba(245, 158, 11, 0.2);
  border-color: #f59e0b;
  animation: none;
  box-shadow: 0 0 16px rgba(245, 158, 11, 0.4);
}

@keyframes fp-allreduce-pulse-5590d392f5ce8b7f84eedffa089138fd {
  0%, 100% { box-shadow: 0 0 4px rgba(245, 158, 11, 0.15); }
  50% { box-shadow: 0 0 12px rgba(245, 158, 11, 0.35); }
}

.fp-allreduce-icon-5590d392f5ce8b7f84eedffa089138fd {
  color: #fbbf24;
  flex-shrink: 0;
}

.fp-allreduce-text-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.7rem;
  font-weight: 700;
  color: #fbbf24;
}

 
.fp-inner-conn-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  justify-content: center;
  padding: 0.15rem 0;
  color: #475569;
  font-size: 0.55rem;
}

 
.fp-info-5590d392f5ce8b7f84eedffa089138fd {
  background: rgba(15, 23, 42, 0.6);
  border: 1px solid #334155;
  border-radius: 12px;
  padding: 1.25rem;
  margin-top: 1.25rem;
}

.fp-info-placeholder-5590d392f5ce8b7f84eedffa089138fd {
  text-align: center;
  color: #64748b;
  padding: 1.5rem 1rem;
}

.fp-info-placeholder-5590d392f5ce8b7f84eedffa089138fd svg {
  width: 32px;
  height: 32px;
  margin-bottom: 0.5rem;
  opacity: 0.5;
}

.fp-info-placeholder-5590d392f5ce8b7f84eedffa089138fd p {
  font-size: 0.8rem;
  margin: 0;
}

.fp-info-content-5590d392f5ce8b7f84eedffa089138fd {
  display: none;
}

.fp-info-content-5590d392f5ce8b7f84eedffa089138fd.active {
  display: block;
  animation: fp-fade-in-5590d392f5ce8b7f84eedffa089138fd 0.3s ease;
}

@keyframes fp-fade-in-5590d392f5ce8b7f84eedffa089138fd {
  from { opacity: 0; transform: translateY(5px); }
  to { opacity: 1; transform: translateY(0); }
}

.fp-info-header-5590d392f5ce8b7f84eedffa089138fd {
  display: flex;
  align-items: center;
  gap: 0.75rem;
  margin-bottom: 0.75rem;
}

.fp-info-icon-5590d392f5ce8b7f84eedffa089138fd {
  width: 36px;
  height: 36px;
  border-radius: 8px;
  display: flex;
  align-items: center;
  justify-content: center;
  flex-shrink: 0;
}

.fp-info-icon-5590d392f5ce8b7f84eedffa089138fd.blue { background: rgba(59, 130, 246, 0.2); color: #60a5fa; }
.fp-info-icon-5590d392f5ce8b7f84eedffa089138fd.pink { background: rgba(236, 72, 153, 0.2); color: #f472b6; }
.fp-info-icon-5590d392f5ce8b7f84eedffa089138fd.cyan { background: rgba(6, 182, 212, 0.2); color: #22d3ee; }
.fp-info-icon-5590d392f5ce8b7f84eedffa089138fd.purple { background: rgba(168, 85, 247, 0.2); color: #c084fc; }
.fp-info-icon-5590d392f5ce8b7f84eedffa089138fd.green { background: rgba(34, 197, 94, 0.2); color: #4ade80; }
.fp-info-icon-5590d392f5ce8b7f84eedffa089138fd.orange { background: rgba(249, 115, 22, 0.2); color: #fb923c; }
.fp-info-icon-5590d392f5ce8b7f84eedffa089138fd.amber { background: rgba(245, 158, 11, 0.2); color: #fbbf24; }

.fp-info-title-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 1rem;
  font-weight: 700;
  color: #f8fafc;
  margin-bottom: 0.15rem;
}

.fp-info-badge-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.6rem;
  padding: 0.2rem 0.4rem;
  border-radius: 4px;
  font-weight: 600;
  text-transform: uppercase;
}

.fp-info-badge-5590d392f5ce8b7f84eedffa089138fd.blue { background: rgba(59, 130, 246, 0.2); color: #60a5fa; }
.fp-info-badge-5590d392f5ce8b7f84eedffa089138fd.pink { background: rgba(236, 72, 153, 0.2); color: #f472b6; }
.fp-info-badge-5590d392f5ce8b7f84eedffa089138fd.cyan { background: rgba(6, 182, 212, 0.2); color: #22d3ee; }
.fp-info-badge-5590d392f5ce8b7f84eedffa089138fd.purple { background: rgba(168, 85, 247, 0.2); color: #c084fc; }
.fp-info-badge-5590d392f5ce8b7f84eedffa089138fd.green { background: rgba(34, 197, 94, 0.2); color: #4ade80; }
.fp-info-badge-5590d392f5ce8b7f84eedffa089138fd.orange { background: rgba(249, 115, 22, 0.2); color: #fb923c; }
.fp-info-badge-5590d392f5ce8b7f84eedffa089138fd.amber { background: rgba(245, 158, 11, 0.2); color: #fbbf24; }

.fp-info-desc-5590d392f5ce8b7f84eedffa089138fd {
  font-size: 0.8rem;
  color: #cbd5e1;
  line-height: 1.6;
  margin-bottom: 0.75rem;
}

.fp-info-props-5590d392f5ce8b7f84eedffa089138fd h4 {
  font-size: 0.7rem;
  color: #64748b;
  margin: 0 0 0.4rem 0;
  text-transform: uppercase;
  letter-spacing: 0.05em;
}

.fp-info-props-5590d392f5ce8b7f84eedffa089138fd ul {
  margin: 0;
  padding: 0;
  list-style: none;
}

.fp-info-props-5590d392f5ce8b7f84eedffa089138fd li {
  font-size: 0.75rem;
  color: #94a3b8;
  padding: 0.25rem 0;
  padding-left: 1rem;
  position: relative;
}

.fp-info-props-5590d392f5ce8b7f84eedffa089138fd li::before {
  content: '\2022';
  position: absolute;
  left: 0;
  color: #64748b;
}

.fp-insight-5590d392f5ce8b7f84eedffa089138fd {
  margin-top: 0.6rem;
  padding: 0.6rem 0.75rem;
  background: rgba(245, 158, 11, 0.1);
  border: 1px solid rgba(245, 158, 11, 0.3);
  border-radius: 8px;
}

.fp-insight-5590d392f5ce8b7f84eedffa089138fd p {
  font-size: 0.75rem;
  color: #fbbf24;
  margin: 0;
  line-height: 1.5;
}

.fp-insight-5590d392f5ce8b7f84eedffa089138fd strong {
  color: #fcd34d;
}

 
.fp-footer-5590d392f5ce8b7f84eedffa089138fd {
  margin-top: 1.25rem;
  padding: 1rem;
  background: rgba(245, 158, 11, 0.1);
  border: 1px solid rgba(245, 158, 11, 0.3);
  border-radius: 10px;
  text-align: center;
}

.fp-footer-5590d392f5ce8b7f84eedffa089138fd p {
  font-size: 0.8rem;
  color: #fbbf24;
  margin: 0;
  line-height: 1.5;
}

.fp-footer-5590d392f5ce8b7f84eedffa089138fd strong {
  color: #fcd34d;
}

 
@media (max-width: 600px) {
  .fp-5590d392f5ce8b7f84eedffa089138fd {
    padding: 1.25rem;
    max-width: none;
  }

  .fp-title-5590d392f5ce8b7f84eedffa089138fd {
    font-size: 1.2rem;
  }

  .fp-subtitle-5590d392f5ce8b7f84eedffa089138fd {
    font-size: 0.8rem;
  }

  .fp-stage-name-5590d392f5ce8b7f84eedffa089138fd {
    font-size: 0.8rem;
  }

  .fp-substage-name-5590d392f5ce8b7f84eedffa089138fd {
    font-size: 0.7rem;
  }

  .fp-substage-detail-5590d392f5ce8b7f84eedffa089138fd {
    display: none;
  }

  .fp-dispatch-5590d392f5ce8b7f84eedffa089138fd {
    flex-direction: column;
  }

  .fp-dispatch-or-5590d392f5ce8b7f84eedffa089138fd {
    justify-content: center;
    padding: 0.15rem 0;
  }

  .fp-branch-name-5590d392f5ce8b7f84eedffa089138fd {
    font-size: 0.65rem;
  }

  .fp-allreduce-text-5590d392f5ce8b7f84eedffa089138fd {
    font-size: 0.62rem;
  }

  .fp-block-title-5590d392f5ce8b7f84eedffa089138fd {
    font-size: 0.78rem;
  }
}
</style>

<div class="fp-5590d392f5ce8b7f84eedffa089138fd">
  <div class="fp-header-5590d392f5ce8b7f84eedffa089138fd">
    <div class="fp-title-5590d392f5ce8b7f84eedffa089138fd">Forward Pass Pipeline</div>
    <div class="fp-subtitle-5590d392f5ce8b7f84eedffa089138fd">Step-by-step flow of a single forward pass through a vLLM native model, with AllReduce sync points highlighted</div>
  </div>

  
  <div class="fp-pipeline-5590d392f5ce8b7f84eedffa089138fd">

    
    <div class="fp-stage-5590d392f5ce8b7f84eedffa089138fd fp-stage-blue-5590d392f5ce8b7f84eedffa089138fd" data-stage="scheduler">
      <div class="fp-stage-num-5590d392f5ce8b7f84eedffa089138fd">1</div>
      <div class="fp-stage-body-5590d392f5ce8b7f84eedffa089138fd">
        <div class="fp-stage-name-5590d392f5ce8b7f84eedffa089138fd">Scheduler</div>
        <span class="fp-stage-badge-5590d392f5ce8b7f84eedffa089138fd">Control Plane</span>
      </div>
    </div>

    
    <div class="fp-connector-5590d392f5ce8b7f84eedffa089138fd">
      <div class="fp-connector-line-5590d392f5ce8b7f84eedffa089138fd"></div>
      <div class="fp-connector-label-5590d392f5ce8b7f84eedffa089138fd">AttentionMetadata</div>
      <div class="fp-connector-arrow-5590d392f5ce8b7f84eedffa089138fd">&#x25BC;</div>
    </div>

    
    <div class="fp-stage-5590d392f5ce8b7f84eedffa089138fd fp-stage-pink-5590d392f5ce8b7f84eedffa089138fd" data-stage="model-runner">
      <div class="fp-stage-num-5590d392f5ce8b7f84eedffa089138fd">2</div>
      <div class="fp-stage-body-5590d392f5ce8b7f84eedffa089138fd">
        <div class="fp-stage-name-5590d392f5ce8b7f84eedffa089138fd">ModelRunner</div>
        <span class="fp-stage-badge-5590d392f5ce8b7f84eedffa089138fd">Input Prep</span>
      </div>
    </div>

    
    <div class="fp-connector-5590d392f5ce8b7f84eedffa089138fd">
      <div class="fp-connector-line-5590d392f5ce8b7f84eedffa089138fd"></div>
      <div class="fp-connector-label-5590d392f5ce8b7f84eedffa089138fd">input_ids, positions</div>
      <div class="fp-connector-arrow-5590d392f5ce8b7f84eedffa089138fd">&#x25BC;</div>
    </div>

    
    <div class="fp-stage-5590d392f5ce8b7f84eedffa089138fd fp-stage-cyan-5590d392f5ce8b7f84eedffa089138fd" data-stage="embedding">
      <div class="fp-stage-num-5590d392f5ce8b7f84eedffa089138fd">3</div>
      <div class="fp-stage-body-5590d392f5ce8b7f84eedffa089138fd">
        <div class="fp-stage-name-5590d392f5ce8b7f84eedffa089138fd">Embedding Layer</div>
        <span class="fp-stage-badge-5590d392f5ce8b7f84eedffa089138fd">VocabParallelEmbedding</span>
      </div>
    </div>

    
    <div class="fp-connector-5590d392f5ce8b7f84eedffa089138fd">
      <div class="fp-connector-line-5590d392f5ce8b7f84eedffa089138fd"></div>
      <div class="fp-connector-label-5590d392f5ce8b7f84eedffa089138fd">hidden_states</div>
      <div class="fp-connector-arrow-5590d392f5ce8b7f84eedffa089138fd">&#x25BC;</div>
    </div>

    
    <div class="fp-block-5590d392f5ce8b7f84eedffa089138fd" data-stage="transformer-block">
      <div class="fp-block-header-5590d392f5ce8b7f84eedffa089138fd">
        <div class="fp-block-title-5590d392f5ce8b7f84eedffa089138fd">
          <svg width="14" height="14" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
            <path d="M12 2L2 7l10 5 10-5-10-5z"/>
            <path d="M2 17l10 5 10-5"/>
            <path d="M2 12l10 5 10-5"/>
          </svg>
          Transformer Block
        </div>
        <span class="fp-block-badge-5590d392f5ce8b7f84eedffa089138fd">× N Layers</span>
      </div>

      
      <div class="fp-substage-5590d392f5ce8b7f84eedffa089138fd" data-stage="layer-norm">
        <span class="fp-substage-label-5590d392f5ce8b7f84eedffa089138fd">4a</span>
        <span class="fp-substage-name-5590d392f5ce8b7f84eedffa089138fd">RMSNorm</span>
        <span class="fp-substage-detail-5590d392f5ce8b7f84eedffa089138fd">fused kernel</span>
      </div>

      <div class="fp-inner-conn-5590d392f5ce8b7f84eedffa089138fd">&#x25BC;</div>

      
      <div class="fp-substage-5590d392f5ce8b7f84eedffa089138fd" data-stage="qkv-proj">
        <span class="fp-substage-label-5590d392f5ce8b7f84eedffa089138fd">4b</span>
        <span class="fp-substage-name-5590d392f5ce8b7f84eedffa089138fd">QKV Projection</span>
        <span class="fp-substage-detail-5590d392f5ce8b7f84eedffa089138fd">ColumnParallelLinear</span>
      </div>

      <div class="fp-inner-conn-5590d392f5ce8b7f84eedffa089138fd">&#x25BC;</div>

      
      <div class="fp-dispatch-5590d392f5ce8b7f84eedffa089138fd" data-stage="attn-dispatch">
        <div class="fp-dispatch-branch-5590d392f5ce8b7f84eedffa089138fd fp-branch-prefill-5590d392f5ce8b7f84eedffa089138fd" data-stage="prefill">
          <div class="fp-branch-label-5590d392f5ce8b7f84eedffa089138fd">Prefill</div>
          <div class="fp-branch-name-5590d392f5ce8b7f84eedffa089138fd">FlashAttention</div>
          <div class="fp-branch-hint-5590d392f5ce8b7f84eedffa089138fd">Compute-bound</div>
        </div>
        <div class="fp-dispatch-or-5590d392f5ce8b7f84eedffa089138fd">or</div>
        <div class="fp-dispatch-branch-5590d392f5ce8b7f84eedffa089138fd fp-branch-decode-5590d392f5ce8b7f84eedffa089138fd" data-stage="decode">
          <div class="fp-branch-label-5590d392f5ce8b7f84eedffa089138fd">Decode</div>
          <div class="fp-branch-name-5590d392f5ce8b7f84eedffa089138fd">PagedAttention</div>
          <div class="fp-branch-hint-5590d392f5ce8b7f84eedffa089138fd">Memory-bound</div>
        </div>
      </div>

      <div class="fp-inner-conn-5590d392f5ce8b7f84eedffa089138fd">&#x25BC;</div>

      
      <div class="fp-substage-5590d392f5ce8b7f84eedffa089138fd" data-stage="output-proj">
        <span class="fp-substage-label-5590d392f5ce8b7f84eedffa089138fd">4d</span>
        <span class="fp-substage-name-5590d392f5ce8b7f84eedffa089138fd">Output Projection</span>
        <span class="fp-substage-detail-5590d392f5ce8b7f84eedffa089138fd">RowParallelLinear</span>
      </div>

      
      <div class="fp-allreduce-5590d392f5ce8b7f84eedffa089138fd" data-stage="allreduce-attn">
        <svg class="fp-allreduce-icon-5590d392f5ce8b7f84eedffa089138fd" width="14" height="14" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2.5">
          <path d="M13 2L3 14h9l-1 8 10-12h-9l1-8z"/>
        </svg>
        <span class="fp-allreduce-text-5590d392f5ce8b7f84eedffa089138fd">AllReduce #1 — attention sync</span>
      </div>

      <div class="fp-inner-conn-5590d392f5ce8b7f84eedffa089138fd">&#x25BC;</div>

      
      <div class="fp-substage-5590d392f5ce8b7f84eedffa089138fd" data-stage="mlp">
        <span class="fp-substage-label-5590d392f5ce8b7f84eedffa089138fd">4e</span>
        <span class="fp-substage-name-5590d392f5ce8b7f84eedffa089138fd">MLP (SwiGLU)</span>
        <span class="fp-substage-detail-5590d392f5ce8b7f84eedffa089138fd">gate_up → down</span>
      </div>

      
      <div class="fp-allreduce-5590d392f5ce8b7f84eedffa089138fd" data-stage="allreduce-mlp">
        <svg class="fp-allreduce-icon-5590d392f5ce8b7f84eedffa089138fd" width="14" height="14" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2.5">
          <path d="M13 2L3 14h9l-1 8 10-12h-9l1-8z"/>
        </svg>
        <span class="fp-allreduce-text-5590d392f5ce8b7f84eedffa089138fd">AllReduce #2 — MLP sync</span>
      </div>
    </div>

    
    <div class="fp-connector-5590d392f5ce8b7f84eedffa089138fd">
      <div class="fp-connector-arrow-5590d392f5ce8b7f84eedffa089138fd">&#x25BC;</div>
      <div class="fp-connector-label-5590d392f5ce8b7f84eedffa089138fd">× N layers</div>
      <div class="fp-connector-line-5590d392f5ce8b7f84eedffa089138fd"></div>
    </div>

    
    <div class="fp-stage-5590d392f5ce8b7f84eedffa089138fd fp-stage-cyan-5590d392f5ce8b7f84eedffa089138fd" data-stage="lm-head">
      <div class="fp-stage-num-5590d392f5ce8b7f84eedffa089138fd">5</div>
      <div class="fp-stage-body-5590d392f5ce8b7f84eedffa089138fd">
        <div class="fp-stage-name-5590d392f5ce8b7f84eedffa089138fd">Final LayerNorm → LM Head</div>
        <span class="fp-stage-badge-5590d392f5ce8b7f84eedffa089138fd">Output Logits</span>
      </div>
    </div>
  </div>

  
  <div class="fp-info-5590d392f5ce8b7f84eedffa089138fd" id="fp-info-5590d392f5ce8b7f84eedffa089138fd">
    <div class="fp-info-placeholder-5590d392f5ce8b7f84eedffa089138fd" id="fp-placeholder-5590d392f5ce8b7f84eedffa089138fd">
      <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.5">
        <circle cx="12" cy="12" r="10"/>
        <path d="M12 16v-4M12 8h.01"/>
      </svg>
      <p>Click any pipeline stage to see details.</p>
    </div>
    <div class="fp-info-content-5590d392f5ce8b7f84eedffa089138fd" id="fp-info-content-5590d392f5ce8b7f84eedffa089138fd">
      <div class="fp-info-header-5590d392f5ce8b7f84eedffa089138fd">
        <div class="fp-info-icon-5590d392f5ce8b7f84eedffa089138fd" id="fp-info-icon-5590d392f5ce8b7f84eedffa089138fd">
          <svg width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
            <rect x="3" y="3" width="18" height="18" rx="2"/>
          </svg>
        </div>
        <div>
          <div class="fp-info-title-5590d392f5ce8b7f84eedffa089138fd" id="fp-info-title-5590d392f5ce8b7f84eedffa089138fd">Stage</div>
          <span class="fp-info-badge-5590d392f5ce8b7f84eedffa089138fd" id="fp-info-badge-5590d392f5ce8b7f84eedffa089138fd">Category</span>
        </div>
      </div>
      <div class="fp-info-desc-5590d392f5ce8b7f84eedffa089138fd" id="fp-info-desc-5590d392f5ce8b7f84eedffa089138fd"></div>
      <div class="fp-info-props-5590d392f5ce8b7f84eedffa089138fd">
        <h4>Key Properties</h4>
        <ul id="fp-info-props-5590d392f5ce8b7f84eedffa089138fd"></ul>
      </div>
      <div class="fp-insight-5590d392f5ce8b7f84eedffa089138fd" id="fp-insight-5590d392f5ce8b7f84eedffa089138fd" style="display: none;">
        <p id="fp-insight-text-5590d392f5ce8b7f84eedffa089138fd"></p>
      </div>
    </div>
  </div>

  
  <div class="fp-footer-5590d392f5ce8b7f84eedffa089138fd">
    <p><strong>Exactly 2 AllReduce sync points per Transformer block</strong> — after attention output projection and after MLP down projection. These blocking collectives are the only cross-GPU communication, and the key performance bottleneck in tensor parallelism.</p>
  </div>
</div>

<script>
(function() {
  var id = '5590d392f5ce8b7f84eedffa089138fd';

   

  var stages = {
    'scheduler': {
      title: 'Scheduler',
      badgeClass: 'blue',
      badge: 'Control Plane',
      description: 'The Scheduler runs every iteration to decide which requests to process. It produces AttentionMetadata — block_tables, slot_mapping, context_lens — that tells the attention kernels exactly where to read and write KV cache entries. The Scheduler is completely model-agnostic.',
      props: ['Produces block_tables for PagedAttention', 'Computes slot_mapping for KV cache writes', 'Tracks context_lens per sequence', 'Decides prefill vs decode phase per request'],
      insight: 'The Scheduler never touches model weights or the forward pass. It only manages <strong>memory allocation</strong> and <strong>request scheduling</strong> — the same code runs for every model architecture.'
    },
    'model-runner': {
      title: 'ModelRunner',
      badgeClass: 'pink',
      badge: 'Input Prep',
      description: 'ModelRunner converts the Scheduler\'s logical request data into physical tensors. It flattens all request tokens into a 1D input_ids tensor, constructs position IDs, and packages everything the model\'s forward() method needs. This is where continuous batching becomes concrete.',
      props: ['Flattens variable-length sequences into 1D tensors', 'Constructs position IDs for rotary embeddings', 'Packages AttentionMetadata from Scheduler', 'Handles sampling parameter setup'],
      insight: 'ModelRunner is the <strong>bridge between scheduling and computation</strong>. It translates logical requests into the flat tensor format that makes vLLM\'s zero-padding approach work.'
    },
    'embedding': {
      title: 'Embedding Layer',
      badgeClass: 'cyan',
      badge: 'VocabParallelEmbedding',
      description: 'The embedding layer converts token IDs into dense vectors. In vLLM, this is VocabParallelEmbedding — the vocabulary table is sharded across GPUs for tensor parallelism. Each GPU looks up only its slice of the vocabulary and an AllReduce combines results.',
      props: ['Vocabulary table sharded across GPUs', 'Input: 1D input_ids [total_tokens]', 'Output: hidden_states [total_tokens, hidden_dim]', 'Includes positional embedding application'],
    },
    'transformer-block': {
      title: 'Transformer Block',
      badgeClass: 'purple',
      badge: '× N Layers',
      description: 'Each Transformer block applies self-attention followed by a feed-forward network (MLP). In a Llama-style architecture, this means RMSNorm → QKV → Attention → Output Projection → AllReduce → MLP → AllReduce. The block is repeated N times (e.g., 80 for Llama 405B).',
      props: ['Pre-norm architecture (RMSNorm before attention)', 'Residual connections around attention and MLP', 'Two AllReduce sync points per block', 'Each block processes all tokens in the batch'],
      insight: 'For a 405B model with 80 layers and TP=8, there are <strong>160 AllReduce operations</strong> per forward pass (2 per block × 80 blocks). This is why AllReduce latency dominates multi-GPU inference time.'
    },
    'layer-norm': {
      title: 'RMSNorm',
      badgeClass: 'purple',
      badge: 'Fused Kernel',
      description: 'Root Mean Square Layer Normalization, applied before attention (pre-norm architecture). vLLM uses a fused CUDA kernel that computes the normalization in a single pass, avoiding the overhead of separate mean and variance calculations.',
      props: ['Fused CUDA kernel — single pass', 'No mean subtraction (RMS only)', 'Applied per-token independently', 'No cross-GPU communication needed'],
    },
    'qkv-proj': {
      title: 'QKV Projection',
      badgeClass: 'purple',
      badge: 'ColumnParallelLinear',
      description: 'Projects hidden states into Query, Key, and Value vectors using ColumnParallelLinear. The output dimension is split across GPUs — each GPU computes its shard of Q, K, V heads. With GQA (Grouped Query Attention), K and V have fewer heads than Q.',
      props: ['ColumnParallelLinear shards output dim', 'Each GPU: Q_heads/N + K_heads/N + V_heads/N', 'Supports GQA (fewer K,V heads)', 'No AllReduce needed — output stays split'],
      insight: 'Column-parallel is the natural fit for QKV because each GPU can independently compute attention for its <strong>assigned heads</strong>. No communication until the output projection.'
    },
    'attn-dispatch': {
      title: 'Attention Dispatch',
      badgeClass: 'purple',
      badge: 'ForwardContext',
      description: 'The attention layer dispatches to different backends depending on the phase. ForwardContext (set up by ModelRunner) tells each layer whether the current batch is prefill or decode. The dispatch is transparent to the model implementation — it just calls the attention layer.',
      props: ['ForwardContext carries phase information', 'Prefill → FlashAttention (compute-bound)', 'Decode → PagedAttention (memory-bound)', 'Same API, different backend kernels'],
    },
    'prefill': {
      title: 'FlashAttention (Prefill)',
      badgeClass: 'green',
      badge: 'Compute-Bound',
      description: 'During prefill, all prompt tokens are processed at once. FlashAttention\'s tiling strategy divides the QK^T computation into blocks that fit in SRAM, achieving near-optimal arithmetic intensity. This is compute-bound — the GPU ALUs are the bottleneck, not memory bandwidth.',
      props: ['Processes all prompt tokens in parallel', 'Tiled computation fits in GPU SRAM', 'O(N) memory instead of O(N²) for attention', 'Arithmetic intensity maximized via tiling'],
      insight: 'Prefill processes hundreds or thousands of tokens at once, making the QK^T matrix multiplication <strong>large enough to saturate GPU compute</strong>. This is why prompts process quickly relative to their length.'
    },
    'decode': {
      title: 'PagedAttention (Decode)',
      badgeClass: 'orange',
      badge: 'Memory-Bound',
      description: 'During decode, each request generates one token at a time. PagedAttention uses block_tables to gather K and V vectors from non-contiguous physical blocks in the KV cache. The single query token attends to all cached keys — the bottleneck is loading all those K/V vectors from GPU memory.',
      props: ['Single query token per request', 'Gathers K/V from non-contiguous blocks via block_tables', 'Memory-bound: loading cached K/V dominates', 'Enables virtual memory for KV cache'],
      insight: 'Decode is <strong>memory-bandwidth limited</strong> because each new token must load the entire KV cache for that sequence. With long contexts (32K+ tokens), this becomes the dominant cost of generation.'
    },
    'output-proj': {
      title: 'Output Projection',
      badgeClass: 'purple',
      badge: 'RowParallelLinear',
      description: 'Projects the attention output back to hidden dimension using RowParallelLinear. Each GPU holds a slice of the input dimension (corresponding to its attention heads) and computes a partial result. The partial results must be summed via AllReduce.',
      props: ['RowParallelLinear shards input dim', 'Each GPU: partial attention output → partial hidden', 'Requires AllReduce to sum partial results', 'Paired with ColumnParallel QKV projection'],
    },
    'allreduce-attn': {
      title: 'AllReduce #1 — Attention',
      badgeClass: 'amber',
      badge: 'Blocking Collective',
      description: 'First AllReduce sync point in the block. After the output projection, each GPU holds a partial sum of the hidden state. AllReduce sums these partial results across all GPUs so every GPU has the complete hidden state. This is a blocking operation — all GPUs must participate and wait.',
      props: ['NCCL AllReduce (sum) across all TP ranks', 'Blocking: all GPUs wait for completion', 'Communication volume: hidden_dim × batch_tokens × dtype_size', 'Latency dominated by network bandwidth between GPUs'],
      insight: 'AllReduce is <strong>the only cross-GPU communication</strong> in the forward pass. With NVLink, inter-GPU bandwidth is ~600 GB/s, but with PCIe it drops to ~32 GB/s — a 20x difference that makes TP placement critical.'
    },
    'mlp': {
      title: 'MLP Sub-Block',
      badgeClass: 'purple',
      badge: 'SwiGLU Variant',
      description: 'The feed-forward network uses a SwiGLU activation: gate_proj and up_proj are ColumnParallelLinear (output split), their results are combined with SiLU gating, then down_proj is RowParallelLinear (input split, requires AllReduce). This is typically 2/3 of the block\'s compute.',
      props: ['gate_up_proj: ColumnParallelLinear (fused)', 'Activation: SiLU(gate) × up', 'down_proj: RowParallelLinear', 'Typically 2× the hidden_dim intermediate size'],
      insight: 'The MLP consumes roughly <strong>2/3 of each block\'s FLOPs</strong>. The gate and up projections are fused into a single ColumnParallelLinear for efficiency, and only the down projection triggers AllReduce.'
    },
    'allreduce-mlp': {
      title: 'AllReduce #2 — MLP',
      badgeClass: 'amber',
      badge: 'Blocking Collective',
      description: 'Second AllReduce sync point in the block. After the MLP down projection, partial results are summed across GPUs. This completes the Transformer block — the output hidden state is now complete on all GPUs and ready for the next block\'s RMSNorm.',
      props: ['Same mechanism as AllReduce #1', 'Sums MLP down_proj partial results', 'Completes the Transformer block', 'Output feeds into next block\'s RMSNorm'],
      insight: 'These two AllReduces (attention + MLP) account for <strong>nearly all inter-GPU communication time</strong>. Reducing TP degree from 8 to 4 halves the AllReduce count — but requires each GPU to hold 2× the weights.'
    },
    'lm-head': {
      title: 'Final LayerNorm + LM Head',
      badgeClass: 'cyan',
      badge: 'Output',
      description: 'After all N Transformer blocks, a final RMSNorm normalizes the hidden state, then the LM head (a linear projection to vocabulary size) produces logits for each token position. During decode, only the last token\'s logits are used for sampling the next token.',
      props: ['Final RMSNorm on last block\'s output', 'LM head: hidden_dim → vocab_size projection', 'During decode: only last token logits matter', 'Logits passed to sampler for token selection'],
    }
  };

   

  var container = document.querySelector('.fp-' + id);
  if (!container) return;

  function clearSelection() {
    var els = container.querySelectorAll('.selected');
    for (var i = 0; i < els.length; i++) {
      els[i].classList.remove('selected');
    }
  }

  function showInfo(data) {
    document.getElementById('fp-placeholder-' + id).style.display = 'none';
    var content = document.getElementById('fp-info-content-' + id);
    content.classList.remove('active');
    void content.offsetWidth;
    content.classList.add('active');

    document.getElementById('fp-info-title-' + id).textContent = data.title;

    var badge = document.getElementById('fp-info-badge-' + id);
    badge.textContent = data.badge;
    badge.className = 'fp-info-badge-' + id + ' ' + data.badgeClass;

    var icon = document.getElementById('fp-info-icon-' + id);
    icon.className = 'fp-info-icon-' + id + ' ' + data.badgeClass;

    document.getElementById('fp-info-desc-' + id).textContent = data.description;

    var propsList = document.getElementById('fp-info-props-' + id);
    var html = '';
    for (var j = 0; j < data.props.length; j++) {
      html += '<li>' + data.props[j] + '</li>';
    }
    propsList.innerHTML = html;

    var insightEl = document.getElementById('fp-insight-' + id);
    if (data.insight) {
      insightEl.style.display = 'block';
      document.getElementById('fp-insight-text-' + id).innerHTML = data.insight;
    } else {
      insightEl.style.display = 'none';
    }

    
    var infoPanel = document.getElementById('fp-info-' + id);
    if (infoPanel) {
      infoPanel.scrollIntoView({ behavior: 'smooth', block: 'nearest' });
    }
  }

   

  container.addEventListener('click', function(e) {
    var target = e.target;

    while (target && target !== container) {
      if (target.hasAttribute && target.hasAttribute('data-stage')) {
        var stageId = target.getAttribute('data-stage');
        var data = stages[stageId];
        if (!data) { target = target.parentElement; continue; }

        e.stopPropagation();
        clearSelection();
        target.classList.add('selected');

        showInfo(data);
        return;
      }
      target = target.parentElement;
    }
  });
})();
</script>

<h3 id="distributed-execution-contracts">Distributed Execution Contracts</h3>
<p>Supporting 405B-class models means multi-GPU execution across potentially many nodes. This introduces specific contracts that new model implementations must satisfy:</p>
<p><strong>Tensor Parallelism</strong> requires precise synchronization. In a standard Transformer block, AllReduce happens exactly twice after the attention output projection (RowParallelLinear) and after the MLP down-projection (RowParallelLinear). These are the points where partial results from each GPU must be summed. Adding extra synchronizations (say, an unnecessary AllReduce after the QKV projection) doesn&rsquo;t produce wrong results, but it degrades throughput. Each AllReduce is a blocking collective, so all GPUs wait.</p>
<p><strong>Pipeline Parallelism</strong> splits the model vertically by layers. The native model&rsquo;s forward method must handle an <code>intermediate_tensors</code> argument:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">forward</span>(self, input_ids, positions, attn_metadata, intermediate_tensors<span style="color:#f92672">=</span><span style="color:#66d9ef">None</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> intermediate_tensors <span style="color:#f92672">is</span> <span style="color:#f92672">not</span> <span style="color:#66d9ef">None</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># We&#39;re not the first pipeline stage — skip embedding</span>
</span></span><span style="display:flex;"><span>        hidden_states <span style="color:#f92672">=</span> intermediate_tensors[<span style="color:#e6db74">&#34;hidden_states&#34;</span>]
</span></span><span style="display:flex;"><span>        start_layer <span style="color:#f92672">=</span> self<span style="color:#f92672">.</span>start_layer  <span style="color:#75715e"># e.g., layer 16</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># First pipeline stage — process from the embedding</span>
</span></span><span style="display:flex;"><span>        hidden_states <span style="color:#f92672">=</span> self<span style="color:#f92672">.</span>embed_tokens(input_ids)
</span></span><span style="display:flex;"><span>        start_layer <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> layer <span style="color:#f92672">in</span> self<span style="color:#f92672">.</span>layers[start_layer:self<span style="color:#f92672">.</span>end_layer]:
</span></span><span style="display:flex;"><span>        hidden_states <span style="color:#f92672">=</span> layer(hidden_states, positions, attn_metadata)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> hidden_states
</span></span></code></pre></div><p>Rank 0 executes layers 0–N, outputs the hidden state as <code>intermediate_tensors</code>. Rank 1 receives it, skips the embedding layer, and resumes from layer N+1. If a developer forgets to implement this check, the model works fine in single-node TP mode but silently breaks in PP mode. it tries to re-embed already-processed hidden states.</p>
<p><strong>CUDA Graph Compatibility</strong> requires static control flow. Dynamic Python branching based on tensor values like <code>if tensor.sum() &gt; 0:</code>  breaks CUDA Graph capture because the graph records a fixed execution path. This is particularly relevant for Mixture-of-Experts models where expert routing is inherently data-dependent. The routing logic must use masked tensor operations (scatter, gather with masks) rather than Python <code>if/else</code>, so the computation graph remains static even though different experts activate for different tokens.</p>
<table>
  <thead>
      <tr>
          <th>Parallelism Type</th>
          <th>Contract for Model Developer</th>
          <th>Failure Mode if Missed</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Tensor Parallelism</td>
          <td>Use ColumnParallel/RowParallel layers; exactly 2 AllReduces per block</td>
          <td>Extra AllReduces → throughput degradation; wrong layer types → incorrect results</td>
      </tr>
      <tr>
          <td>Pipeline Parallelism</td>
          <td>Handle <code>intermediate_tensors</code> arg; define <code>start_layer</code>/<code>end_layer</code></td>
          <td>Model re-embeds hidden states → garbage output on later pipeline stages</td>
      </tr>
      <tr>
          <td>CUDA Graphs</td>
          <td>No Python control flow based on tensor values; use masked ops for routing</td>
          <td>Graph capture fails → fallback to eager mode → 2-3x slower decode</td>
      </tr>
  </tbody>
</table>
<h3 id="putting-it-all-together--the-optimization-ladder">Putting It All Together — The Optimization Ladder</h3>
<p>To summarize the full lifecycle, model support in vLLM is a progression through increasingly optimized tiers:</p>
<ol>
<li><strong>Transformers Fallback</strong> — works immediately. PagedAttention memory management. No fusion, no graphs, limited parallelism. This is where every new model starts.</li>
<li><strong>Plugin Registration</strong> — an external package provides a native model class. <code>pip install</code> and go. Model creators control their own release timeline.</li>
<li><strong>Native Model Class</strong> — upstreamed into vLLM. Parallel primitives, 1D flattened computation, CUDA Graph compatible. This is where the performance lives.</li>
<li><strong>Quantization Support</strong> — AWQ, GPTQ, FP8 weight loading tested and working. Packed tensor handling, per-layer quantization configs. Unlocks deployment on smaller hardware.</li>
<li><strong>Full Production</strong> — Pipeline Parallelism support, custom attention patterns if needed, benchmarked against reference implementations. Ready for large-scale serving.</li>
</ol>
<p>The plugin system represents the future direction — federated model support where model creators can ship &ldquo;vLLM-ready&rdquo; code independently of core releases. Instead of waiting for the vLLM team to implement every new architecture, the ecosystem moves toward model creators owning their integration path.</p>
<hr>
<h2 id="closing">Closing</h2>
<p>The process of introducing a new model into vLLM is a systems engineering exercise. It requires transforming a static model definition, essentially a recipe for matrix multiplications — into a dynamic, distributed execution graph that manages its own memory, shards its own weights, and coordinates across GPUs. The Transformers fallback bridges the gap for immediate access; native integration is where the performance lives.</p>
<p>There are four core contracts a model must satisfy for full integration: registry updates (mapping architecture strings to code), class restructuring (parallel primitives, 1D flattening), weight loading (handling mismatches between checkpoint and runtime structure), and <a href="/posts/flash-attention/">PagedAttention</a> integration (routing attention through the block-table-based memory system). Understanding these four contracts gives you a mental model for reasoning about model support in any inference engine, not just vLLM.</p>
<hr>
<h2 id="references">References</h2>
<ol>
<li>Kwon, W. et al. &ldquo;Efficient Memory Management for Large Language Model Serving with PagedAttention.&rdquo; <em>SOSP 2023.</em> <a href="https://arxiv.org/pdf/2309.06180">arXiv:2309.06180</a></li>
<li>vLLM Documentation — Architecture Overview. <a href="https://docs.vllm.ai/en/stable/design/arch_overview/">docs.vllm.ai</a></li>
<li>vLLM Documentation — Adding a New Model. <a href="https://docs.vllm.ai/en/v0.6.5/models/adding_model.html">docs.vllm.ai</a></li>
<li>vLLM Documentation — Plugin System. <a href="https://docs.vllm.ai/en/latest/design/plugin_system/">docs.vllm.ai</a></li>
<li>vLLM Source — Model Registry (<code>registry.py</code>). <a href="https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/registry.py">GitHub</a></li>
<li>vLLM Source — Transformers Backend (<code>transformers/</code>). <a href="https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models/transformers">GitHub</a></li>
<li>vLLM Documentation — Class Hierarchy. <a href="https://docs.vllm.ai/en/v0.6.4.post1/design/class_hierarchy.html">docs.vllm.ai</a></li>
<li>Gordic, A. &ldquo;Inside vLLM: Anatomy of a High-Throughput LLM Inference System.&rdquo; <a href="https://www.aleksagordic.com/blog/vllm">aleksagordic.com</a></li>
<li>Zalt, M. &ldquo;The Hidden Switchboard Behind vLLM Attention.&rdquo; <a href="https://zalt.me/blog/2025/12/vllm-attention-switchboard/">zalt.me</a></li>
<li>Prerepa, A. &ldquo;ZvLLM: Zigzag forward pass with vLLM.&rdquo; <a href="https://adiprerepa.github.io/data/598final.pdf">adiprerepa.github.io</a></li>
<li>El Shafie, H. &ldquo;Paged Attention from First Principles: A View Inside vLLM.&rdquo; <a href="https://hamzaelshafie.bearblog.dev/paged-attention-from-first-principles-a-view-inside-vllm/">hamzaelshafie.bearblog.dev</a></li>
<li>vLLM Documentation — Paged Attention Design. <a href="https://docs.vllm.ai/en/v0.9.1/design/kernel/paged_attention.html">docs.vllm.ai</a></li>
</ol>
]]></content:encoded></item><item><title>Orchestrating Inference: How Kubernetes, Ray, and vLLM Coordinate Under the Hood</title><link>https://www.mdjawad.com/posts/orchestrating-inference/</link><pubDate>Sun, 18 Jan 2026 12:00:00 +0800</pubDate><guid>https://www.mdjawad.com/posts/orchestrating-inference/</guid><description>A deep dive into how Kubernetes, Ray, and vLLM coordinate to transform independent GPUs into a synchronized inference machine.</description><content:encoded><![CDATA[<h2 id="the-deceptively-simple-command">The Deceptively Simple Command</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>vllm serve meta-llama/Llama-3.1-70B-Instruct --tensor-parallel-size <span style="color:#ae81ff">8</span>
</span></span></code></pre></div><p>One line. Eight GPUs. A 70-billion parameter model ready to serve requests. But this hides significant complexity.</p>
<p>Behind this command, three distinct software systems spring into action. Kubernetes allocates pods and manages node resources. Ray spawns actors, creates placement groups, and coordinates distributed execution. vLLM initializes workers, establishes NCCL communication rings, and begins orchestrating the token-by-token dance of autoregressive generation.</p>
<p>The interesting part is the choreography between systems. Each layer operates at a different granularity, speaks a different language, and solves a different class of problem. Kubernetes thinks in pods and nodes. Ray thinks in actors and tasks. vLLM thinks in requests and tokens. Yet when you hit that endpoint with a prompt, all three coordinate to produce a coherent response.</p>
<p>The question worth asking: <em>How do these systems know when to hand off control to each other?</em></p>
<p>This post traces that coordination. We&rsquo;ll follow the cascade from <code>kubectl apply</code> to the moment NCCL rings form and tensor data starts flowing. We&rsquo;ll examine why placement groups matter more than you&rsquo;d expect, why your network configuration can make or break performance, and how the industry is evolving toward disaggregated architectures that split inference across specialized pools.</p>
<p>If you&rsquo;ve read the <a href="/posts/llm-inference-hidden-stack/">previous deep-dive on the hidden software stack</a> behind inference, this builds on that foundation. We won&rsquo;t revisit PagedAttention or continuous batching fundamentals. Instead, we&rsquo;re zooming out to the orchestration layer, the software that transforms a rack of GPUs into something resembling a programmable supercomputer.</p>
<blockquote>
<p><strong>Prerequisites:</strong> This post assumes familiarity with <a href="https://blog.vllm.ai/2023/06/20/vllm.html">PagedAttention</a>, <a href="https://www.anyscale.com/blog/continuous-batching-llm-inference">continuous batching</a>, and basic Kubernetes concepts. If you&rsquo;re new to the inference stack, start with <a href="/posts/llm-inference-hidden-stack/">The Hidden Software Stack Behind Fast LLM Inference</a>.</p></blockquote>




<style>
.orch-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
  background: linear-gradient(180deg, #0f172a 0%, #1e293b 100%);
  border-radius: 16px;
  padding: 2rem;
  margin: 2rem 0;
  color: #e2e8f0;
}

.orch-c90c0da50bbebaa8b9e35fe5cf4aa5c4 * {
  box-sizing: border-box;
}

.orch-header-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  text-align: center;
  margin-bottom: 1.5rem;
}

.orch-title-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  font-size: 1.5rem;
  font-weight: 700;
  color: #f8fafc;
  margin-bottom: 0.5rem;
}

.orch-subtitle-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  font-size: 0.9rem;
  color: #94a3b8;
  max-width: 550px;
  margin: 0 auto;
}

 
.orch-content-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  display: grid;
  grid-template-columns: 1fr 300px;
  gap: 1.5rem;
  align-items: start;
}

@media (max-width: 850px) {
  .orch-content-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
    grid-template-columns: 1fr;
  }
}

 
.orch-stack-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  position: relative;
  min-height: 480px;
}

 
.orch-layer-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  position: relative;
  border-radius: 12px;
  padding: 1rem;
  margin-bottom: 0.75rem;
  border: 2px solid;
  transition: all 0.3s ease;
}

.orch-layer-c90c0da50bbebaa8b9e35fe5cf4aa5c4:last-child {
  margin-bottom: 0;
}

.orch-layer-k8s-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  background: rgba(59, 130, 246, 0.1);
  border-color: rgba(59, 130, 246, 0.4);
}

.orch-layer-ray-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  background: rgba(168, 85, 247, 0.1);
  border-color: rgba(168, 85, 247, 0.4);
}

.orch-layer-vllm-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  background: rgba(236, 72, 153, 0.1);
  border-color: rgba(236, 72, 153, 0.4);
}

.orch-layer-header-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  display: flex;
  align-items: center;
  justify-content: space-between;
  margin-bottom: 0.75rem;
}

.orch-layer-title-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  font-size: 0.85rem;
  font-weight: 700;
  text-transform: uppercase;
  letter-spacing: 0.05em;
  display: flex;
  align-items: center;
  gap: 0.5rem;
}

.orch-layer-k8s-c90c0da50bbebaa8b9e35fe5cf4aa5c4 .orch-layer-title-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  color: #60a5fa;
}

.orch-layer-ray-c90c0da50bbebaa8b9e35fe5cf4aa5c4 .orch-layer-title-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  color: #c084fc;
}

.orch-layer-vllm-c90c0da50bbebaa8b9e35fe5cf4aa5c4 .orch-layer-title-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  color: #f472b6;
}

.orch-granularity-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  font-size: 0.65rem;
  padding: 0.25rem 0.5rem;
  border-radius: 4px;
  font-weight: 500;
  background: rgba(255, 255, 255, 0.1);
  color: #94a3b8;
}

 
.orch-components-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  display: flex;
  gap: 0.5rem;
  flex-wrap: wrap;
  justify-content: center;
}

.orch-component-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  background: rgba(15, 23, 42, 0.6);
  border: 1px solid #334155;
  border-radius: 8px;
  padding: 0.6rem 0.8rem;
  cursor: pointer;
  transition: all 0.2s ease;
  text-align: center;
  min-width: 85px;
  position: relative;
}

.orch-component-c90c0da50bbebaa8b9e35fe5cf4aa5c4:hover {
  transform: translateY(-2px);
  border-color: #64748b;
}

.orch-component-c90c0da50bbebaa8b9e35fe5cf4aa5c4.selected {
  transform: scale(1.05);
  box-shadow: 0 0 20px rgba(255, 255, 255, 0.2);
}

.orch-layer-k8s-c90c0da50bbebaa8b9e35fe5cf4aa5c4 .orch-component-c90c0da50bbebaa8b9e35fe5cf4aa5c4.selected {
  border-color: #3b82f6;
  box-shadow: 0 0 20px rgba(59, 130, 246, 0.4);
}

.orch-layer-ray-c90c0da50bbebaa8b9e35fe5cf4aa5c4 .orch-component-c90c0da50bbebaa8b9e35fe5cf4aa5c4.selected {
  border-color: #a855f7;
  box-shadow: 0 0 20px rgba(168, 85, 247, 0.4);
}

.orch-layer-vllm-c90c0da50bbebaa8b9e35fe5cf4aa5c4 .orch-component-c90c0da50bbebaa8b9e35fe5cf4aa5c4.selected {
  border-color: #ec4899;
  box-shadow: 0 0 20px rgba(236, 72, 153, 0.4);
}

.orch-comp-name-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  font-size: 0.75rem;
  font-weight: 600;
  color: #f1f5f9;
  margin-bottom: 0.2rem;
}

.orch-comp-hint-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  font-size: 0.6rem;
  color: #64748b;
}

 
.orch-connector-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  display: flex;
  align-items: center;
  justify-content: center;
  height: 36px;
  position: relative;
  margin: 0.25rem 0;
}

.orch-connector-arrow-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  display: flex;
  flex-direction: column;
  align-items: center;
  position: relative;
}

.orch-connector-arrow-c90c0da50bbebaa8b9e35fe5cf4aa5c4 svg {
  width: 24px;
  height: 24px;
  color: #22c55e;
}

.orch-connector-label-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  font-size: 0.6rem;
  color: #22c55e;
  font-weight: 500;
  text-transform: uppercase;
  letter-spacing: 0.05em;
}

 
.orch-particle-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  position: absolute;
  width: 6px;
  height: 6px;
  background: #22c55e;
  border-radius: 50%;
  box-shadow: 0 0 8px #22c55e;
  animation: orch-particle-flow-c90c0da50bbebaa8b9e35fe5cf4aa5c4 2s ease-in-out infinite;
}

@keyframes orch-particle-flow-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  0% { transform: translateY(-12px); opacity: 0; }
  20% { opacity: 1; }
  80% { opacity: 1; }
  100% { transform: translateY(12px); opacity: 0; }
}

.orch-particle-c90c0da50bbebaa8b9e35fe5cf4aa5c4.p1 { animation-delay: 0s; }
.orch-particle-c90c0da50bbebaa8b9e35fe5cf4aa5c4.p2 { animation-delay: 0.5s; }
.orch-particle-c90c0da50bbebaa8b9e35fe5cf4aa5c4.p3 { animation-delay: 1s; }

 
.orch-nccl-bypass-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  position: absolute;
  right: -60px;
  top: 50%;
  transform: translateY(-50%);
  display: flex;
  flex-direction: column;
  align-items: center;
  gap: 0.25rem;
}

.orch-nccl-arrow-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  width: 4px;
  height: 180px;
  background: linear-gradient(180deg, transparent 0%, #f59e0b 10%, #f59e0b 90%, transparent 100%);
  border-radius: 2px;
  position: relative;
}

.orch-nccl-arrow-c90c0da50bbebaa8b9e35fe5cf4aa5c4::before {
  content: '';
  position: absolute;
  bottom: 0;
  left: 50%;
  transform: translateX(-50%);
  border-left: 6px solid transparent;
  border-right: 6px solid transparent;
  border-top: 10px solid #f59e0b;
}

.orch-nccl-pulse-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  position: absolute;
  width: 100%;
  height: 20px;
  background: linear-gradient(180deg, transparent, #f59e0b, transparent);
  animation: orch-nccl-pulse-c90c0da50bbebaa8b9e35fe5cf4aa5c4 1.5s ease-in-out infinite;
  border-radius: 2px;
}

@keyframes orch-nccl-pulse-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  0% { top: 0; opacity: 0.8; }
  100% { top: calc(100% - 20px); opacity: 0; }
}

.orch-nccl-label-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  writing-mode: vertical-rl;
  text-orientation: mixed;
  font-size: 0.6rem;
  font-weight: 600;
  color: #f59e0b;
  text-transform: uppercase;
  letter-spacing: 0.1em;
}

 
.orch-info-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  background: rgba(15, 23, 42, 0.6);
  border: 1px solid #334155;
  border-radius: 12px;
  padding: 1.25rem;
  position: sticky;
  top: 1rem;
}

.orch-info-placeholder-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  text-align: center;
  color: #64748b;
  padding: 2rem 1rem;
}

.orch-info-placeholder-c90c0da50bbebaa8b9e35fe5cf4aa5c4 svg {
  width: 40px;
  height: 40px;
  margin-bottom: 0.75rem;
  opacity: 0.5;
}

.orch-info-placeholder-c90c0da50bbebaa8b9e35fe5cf4aa5c4 p {
  font-size: 0.8rem;
  margin: 0;
}

.orch-info-content-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  display: none;
}

.orch-info-content-c90c0da50bbebaa8b9e35fe5cf4aa5c4.active {
  display: block;
  animation: orch-fade-in-c90c0da50bbebaa8b9e35fe5cf4aa5c4 0.3s ease;
}

@keyframes orch-fade-in-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  from { opacity: 0; transform: translateY(5px); }
  to { opacity: 1; transform: translateY(0); }
}

.orch-info-header-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  display: flex;
  align-items: center;
  gap: 0.75rem;
  margin-bottom: 1rem;
}

.orch-info-icon-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  width: 36px;
  height: 36px;
  border-radius: 8px;
  display: flex;
  align-items: center;
  justify-content: center;
}

.orch-info-icon-c90c0da50bbebaa8b9e35fe5cf4aa5c4.k8s { background: rgba(59, 130, 246, 0.2); color: #60a5fa; }
.orch-info-icon-c90c0da50bbebaa8b9e35fe5cf4aa5c4.ray { background: rgba(168, 85, 247, 0.2); color: #c084fc; }
.orch-info-icon-c90c0da50bbebaa8b9e35fe5cf4aa5c4.vllm { background: rgba(236, 72, 153, 0.2); color: #f472b6; }

.orch-info-title-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  font-size: 1rem;
  font-weight: 700;
  color: #f8fafc;
  margin-bottom: 0.2rem;
}

.orch-info-badge-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  font-size: 0.6rem;
  padding: 0.2rem 0.4rem;
  border-radius: 4px;
  font-weight: 600;
  text-transform: uppercase;
}

.orch-info-badge-c90c0da50bbebaa8b9e35fe5cf4aa5c4.k8s { background: rgba(59, 130, 246, 0.2); color: #60a5fa; }
.orch-info-badge-c90c0da50bbebaa8b9e35fe5cf4aa5c4.ray { background: rgba(168, 85, 247, 0.2); color: #c084fc; }
.orch-info-badge-c90c0da50bbebaa8b9e35fe5cf4aa5c4.vllm { background: rgba(236, 72, 153, 0.2); color: #f472b6; }

.orch-info-granularity-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  display: flex;
  align-items: center;
  gap: 0.5rem;
  margin-bottom: 0.75rem;
  padding-bottom: 0.75rem;
  border-bottom: 1px solid #334155;
}

.orch-info-granularity-c90c0da50bbebaa8b9e35fe5cf4aa5c4 span:first-child {
  font-size: 0.7rem;
  color: #64748b;
}

.orch-info-granularity-c90c0da50bbebaa8b9e35fe5cf4aa5c4 span:last-child {
  font-size: 0.75rem;
  color: #94a3b8;
  font-weight: 500;
}

.orch-info-desc-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  font-size: 0.8rem;
  color: #cbd5e1;
  line-height: 1.6;
  margin-bottom: 1rem;
}

.orch-info-manages-c90c0da50bbebaa8b9e35fe5cf4aa5c4 h4 {
  font-size: 0.7rem;
  color: #64748b;
  margin: 0 0 0.5rem 0;
  text-transform: uppercase;
  letter-spacing: 0.05em;
}

.orch-info-manages-c90c0da50bbebaa8b9e35fe5cf4aa5c4 ul {
  margin: 0;
  padding: 0;
  list-style: none;
}

.orch-info-manages-c90c0da50bbebaa8b9e35fe5cf4aa5c4 li {
  font-size: 0.75rem;
  color: #94a3b8;
  padding: 0.3rem 0;
  padding-left: 1rem;
  position: relative;
}

.orch-info-manages-c90c0da50bbebaa8b9e35fe5cf4aa5c4 li::before {
  content: '•';
  position: absolute;
  left: 0;
  color: #64748b;
}

 
.orch-footer-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  margin-top: 1.5rem;
  padding: 1rem;
  background: rgba(245, 158, 11, 0.1);
  border: 1px solid rgba(245, 158, 11, 0.3);
  border-radius: 10px;
  text-align: center;
}

.orch-footer-c90c0da50bbebaa8b9e35fe5cf4aa5c4 p {
  font-size: 0.8rem;
  color: #fbbf24;
  margin: 0;
  line-height: 1.5;
}

.orch-footer-c90c0da50bbebaa8b9e35fe5cf4aa5c4 strong {
  color: #fcd34d;
}

 
@media (max-width: 600px) {
  .orch-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
    padding: 1.25rem;
  }

  .orch-components-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
    gap: 0.35rem;
  }

  .orch-component-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
    min-width: 70px;
    padding: 0.5rem 0.6rem;
  }

  .orch-comp-name-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
    font-size: 0.65rem;
  }

  .orch-nccl-bypass-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
    display: none;
  }

  .orch-info-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
    margin-top: 1rem;
  }
}

 
.orch-stack-wrapper-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  position: relative;
  padding-right: 70px;
}

@media (max-width: 600px) {
  .orch-stack-wrapper-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
    padding-right: 0;
  }
}
</style>

<div class="orch-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
  <div class="orch-header-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
    <div class="orch-title-c90c0da50bbebaa8b9e35fe5cf4aa5c4">The Three-Layer Stack</div>
    <div class="orch-subtitle-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Control flows down through the layers. Tensor data bypasses the middle entirely.</div>
  </div>

  <div class="orch-content-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
    <div class="orch-stack-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
      <div class="orch-stack-wrapper-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
        
        <div class="orch-layer-c90c0da50bbebaa8b9e35fe5cf4aa5c4 orch-layer-k8s-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
          <div class="orch-layer-header-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
            <div class="orch-layer-title-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
              <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
                <path d="M21 16V8a2 2 0 0 0-1-1.73l-7-4a2 2 0 0 0-2 0l-7 4A2 2 0 0 0 3 8v8a2 2 0 0 0 1 1.73l7 4a2 2 0 0 0 2 0l7-4A2 2 0 0 0 21 16z"/>
                <polyline points="3.27 6.96 12 12.01 20.73 6.96"/>
                <line x1="12" y1="22.08" x2="12" y2="12"/>
              </svg>
              Kubernetes Layer
            </div>
            <span class="orch-granularity-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Coarse-grained</span>
          </div>
          <div class="orch-components-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
            <div class="orch-component-c90c0da50bbebaa8b9e35fe5cf4aa5c4" data-component="k8s-nodes">
              <div class="orch-comp-name-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Nodes</div>
              <div class="orch-comp-hint-c90c0da50bbebaa8b9e35fe5cf4aa5c4">machines</div>
            </div>
            <div class="orch-component-c90c0da50bbebaa8b9e35fe5cf4aa5c4" data-component="k8s-pods">
              <div class="orch-comp-name-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Pods</div>
              <div class="orch-comp-hint-c90c0da50bbebaa8b9e35fe5cf4aa5c4">containers</div>
            </div>
            <div class="orch-component-c90c0da50bbebaa8b9e35fe5cf4aa5c4" data-component="k8s-containers">
              <div class="orch-comp-name-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Containers</div>
              <div class="orch-comp-hint-c90c0da50bbebaa8b9e35fe5cf4aa5c4">processes</div>
            </div>
          </div>
        </div>

        
        <div class="orch-connector-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
          <div class="orch-connector-arrow-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
            <div class="orch-particle-c90c0da50bbebaa8b9e35fe5cf4aa5c4 p1"></div>
            <div class="orch-particle-c90c0da50bbebaa8b9e35fe5cf4aa5c4 p2"></div>
            <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
              <path d="M12 5v14M5 12l7 7 7-7"/>
            </svg>
            <span class="orch-connector-label-c90c0da50bbebaa8b9e35fe5cf4aa5c4">handoff</span>
          </div>
        </div>

        
        <div class="orch-layer-c90c0da50bbebaa8b9e35fe5cf4aa5c4 orch-layer-ray-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
          <div class="orch-layer-header-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
            <div class="orch-layer-title-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
              <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
                <circle cx="12" cy="12" r="3"/>
                <path d="M12 1v4M12 19v4M4.22 4.22l2.83 2.83M16.95 16.95l2.83 2.83M1 12h4M19 12h4M4.22 19.78l2.83-2.83M16.95 7.05l2.83-2.83"/>
              </svg>
              Ray Layer
            </div>
            <span class="orch-granularity-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Fine-grained</span>
          </div>
          <div class="orch-components-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
            <div class="orch-component-c90c0da50bbebaa8b9e35fe5cf4aa5c4" data-component="ray-gcs">
              <div class="orch-comp-name-c90c0da50bbebaa8b9e35fe5cf4aa5c4">GCS</div>
              <div class="orch-comp-hint-c90c0da50bbebaa8b9e35fe5cf4aa5c4">control store</div>
            </div>
            <div class="orch-component-c90c0da50bbebaa8b9e35fe5cf4aa5c4" data-component="ray-raylets">
              <div class="orch-comp-name-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Raylets</div>
              <div class="orch-comp-hint-c90c0da50bbebaa8b9e35fe5cf4aa5c4">per-node</div>
            </div>
            <div class="orch-component-c90c0da50bbebaa8b9e35fe5cf4aa5c4" data-component="ray-actors">
              <div class="orch-comp-name-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Actors</div>
              <div class="orch-comp-hint-c90c0da50bbebaa8b9e35fe5cf4aa5c4">workers</div>
            </div>
            <div class="orch-component-c90c0da50bbebaa8b9e35fe5cf4aa5c4" data-component="ray-placement">
              <div class="orch-comp-name-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Placement</div>
              <div class="orch-comp-hint-c90c0da50bbebaa8b9e35fe5cf4aa5c4">groups</div>
            </div>
          </div>
        </div>

        
        <div class="orch-connector-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
          <div class="orch-connector-arrow-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
            <div class="orch-particle-c90c0da50bbebaa8b9e35fe5cf4aa5c4 p1"></div>
            <div class="orch-particle-c90c0da50bbebaa8b9e35fe5cf4aa5c4 p2"></div>
            <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
              <path d="M12 5v14M5 12l7 7 7-7"/>
            </svg>
            <span class="orch-connector-label-c90c0da50bbebaa8b9e35fe5cf4aa5c4">handoff</span>
          </div>
        </div>

        
        <div class="orch-layer-c90c0da50bbebaa8b9e35fe5cf4aa5c4 orch-layer-vllm-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
          <div class="orch-layer-header-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
            <div class="orch-layer-title-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
              <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
                <path d="M13 2L3 14h9l-1 8 10-12h-9l1-8z"/>
              </svg>
              vLLM Layer
            </div>
            <span class="orch-granularity-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Token-level</span>
          </div>
          <div class="orch-components-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
            <div class="orch-component-c90c0da50bbebaa8b9e35fe5cf4aa5c4" data-component="vllm-scheduler">
              <div class="orch-comp-name-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Scheduler</div>
              <div class="orch-comp-hint-c90c0da50bbebaa8b9e35fe5cf4aa5c4">requests</div>
            </div>
            <div class="orch-component-c90c0da50bbebaa8b9e35fe5cf4aa5c4" data-component="vllm-workers">
              <div class="orch-comp-name-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Workers</div>
              <div class="orch-comp-hint-c90c0da50bbebaa8b9e35fe5cf4aa5c4">GPU exec</div>
            </div>
            <div class="orch-component-c90c0da50bbebaa8b9e35fe5cf4aa5c4" data-component="vllm-nccl">
              <div class="orch-comp-name-c90c0da50bbebaa8b9e35fe5cf4aa5c4">NCCL Ring</div>
              <div class="orch-comp-hint-c90c0da50bbebaa8b9e35fe5cf4aa5c4">tensor sync</div>
            </div>
          </div>
        </div>

        
        <div class="orch-nccl-bypass-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
          <span class="orch-nccl-label-c90c0da50bbebaa8b9e35fe5cf4aa5c4">NCCL Bypass</span>
          <div class="orch-nccl-arrow-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
            <div class="orch-nccl-pulse-c90c0da50bbebaa8b9e35fe5cf4aa5c4"></div>
          </div>
          <span class="orch-nccl-label-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Tensor Data</span>
        </div>
      </div>
    </div>

    
    <div class="orch-info-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
      <div class="orch-info-placeholder-c90c0da50bbebaa8b9e35fe5cf4aa5c4" id="orch-placeholder-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
        <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="1.5">
          <circle cx="12" cy="12" r="10"/>
          <path d="M12 16v-4M12 8h.01"/>
        </svg>
        <p>Click a component to see details about its role in the orchestration stack.</p>
      </div>
      <div class="orch-info-content-c90c0da50bbebaa8b9e35fe5cf4aa5c4" id="orch-info-content-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
        <div class="orch-info-header-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
          <div class="orch-info-icon-c90c0da50bbebaa8b9e35fe5cf4aa5c4" id="orch-info-icon-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
            <svg width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
              <rect x="3" y="3" width="18" height="18" rx="2"/>
            </svg>
          </div>
          <div>
            <div class="orch-info-title-c90c0da50bbebaa8b9e35fe5cf4aa5c4" id="orch-info-title-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Component</div>
            <span class="orch-info-badge-c90c0da50bbebaa8b9e35fe5cf4aa5c4" id="orch-info-badge-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Layer</span>
          </div>
        </div>
        <div class="orch-info-granularity-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
          <span>Granularity:</span>
          <span id="orch-info-gran-c90c0da50bbebaa8b9e35fe5cf4aa5c4">-</span>
        </div>
        <div class="orch-info-desc-c90c0da50bbebaa8b9e35fe5cf4aa5c4" id="orch-info-desc-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
          Description goes here.
        </div>
        <div class="orch-info-manages-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
          <h4>Manages</h4>
          <ul id="orch-info-manages-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
            <li>Item 1</li>
          </ul>
        </div>
      </div>
    </div>
  </div>

  <div class="orch-footer-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
    <p><strong>Ray sets up actors and gets out of the way.</strong> Tensor data flows through NCCL at 900 GB/s, completely bypassing Ray's object store.</p>
  </div>
</div>

<script>
(function() {
  var id = 'c90c0da50bbebaa8b9e35fe5cf4aa5c4';

  var components = {
    'k8s-nodes': {
      title: 'Nodes',
      layer: 'kubernetes',
      layerClass: 'k8s',
      granularity: 'Coarse (machines)',
      description: "Physical or virtual machines with GPU resources. K8s sees \"this node has 8 GPUs\" but knows nothing about NVLink topology.",
      manages: ['Hardware resources', 'Node labels', 'Taints/tolerations']
    },
    'k8s-pods': {
      title: 'Pods',
      layer: 'kubernetes',
      layerClass: 'k8s',
      granularity: 'Coarse (containers)',
      description: "The scheduling unit. KubeRay creates head and worker pods. K8s ensures they land on nodes with available GPU resources.",
      manages: ['Container lifecycle', 'Resource requests', 'Networking']
    },
    'k8s-containers': {
      title: 'Containers',
      layer: 'kubernetes',
      layerClass: 'k8s',
      granularity: 'Coarse (processes)',
      description: "Ray head and worker processes run inside. K8s responsibility ends here—what happens inside is Ray's domain.",
      manages: ['Process isolation', 'Volume mounts', 'Environment']
    },
    'ray-gcs': {
      title: 'Global Control Store',
      layer: 'ray',
      layerClass: 'ray',
      granularity: 'Fine (actors)',
      description: "Distributed metadata store on port 6379. Tracks cluster membership, resource availability, and actor locations across all nodes.",
      manages: ['Cluster state', 'Actor registry', 'Resource table']
    },
    'ray-raylets': {
      title: 'Raylets',
      layer: 'ray',
      layerClass: 'ray',
      granularity: 'Fine (tasks)',
      description: "One per node. Discovers GPUs via CUDA, manages local object store, and communicates resources back to GCS.",
      manages: ['Local scheduling', 'GPU discovery', 'Object store']
    },
    'ray-actors': {
      title: 'Actors',
      layer: 'ray',
      layerClass: 'ray',
      granularity: 'Fine (state)',
      description: "Long-lived Python objects (RayWorkerWrapper). Each actor gets CUDA_VISIBLE_DEVICES set to its assigned GPU.",
      manages: ['Worker state', 'GPU assignment', 'Message handling']
    },
    'ray-placement': {
      title: 'Placement Groups',
      layer: 'ray',
      layerClass: 'ray',
      granularity: 'Fine (topology)',
      description: "STRICT_PACK guarantees all 8 actors land on one node. This is where GPU topology awareness enters the stack.",
      manages: ['Co-location', 'Atomicity', 'Placement strategy']
    },
    'vllm-scheduler': {
      title: 'Scheduler',
      layer: 'vllm',
      layerClass: 'vllm',
      granularity: 'Token-level (μs)',
      description: "Runs every inference step. Decides which requests to process, manages KV cache allocation, handles continuous batching.",
      manages: ['Request ordering', 'KV cache', 'Batching']
    },
    'vllm-workers': {
      title: 'Workers',
      layer: 'vllm',
      layerClass: 'vllm',
      granularity: 'Token-level (μs)',
      description: "Execute model forward passes. Call torch.distributed.init_process_group(backend=\"nccl\") to establish communication.",
      manages: ['Forward passes', 'CUDA kernels', 'Memory']
    },
    'vllm-nccl': {
      title: 'NCCL Ring',
      layer: 'vllm',
      layerClass: 'vllm',
      granularity: 'Tensor-level (ns)',
      description: "The hot path. AllReduce operations flow here at 900 GB/s via NVLink. Ray is completely bypassed for tensor data.",
      manages: ['AllReduce', 'Tensor sync', 'Ring topology']
    }
  };

  function selectComponent(el, compId) {
    
    var allComponents = document.querySelectorAll('.orch-' + id + ' [data-component]');
    for (var i = 0; i < allComponents.length; i++) {
      allComponents[i].classList.remove('selected');
    }

    
    el.classList.add('selected');

    
    var comp = components[compId];
    if (!comp) return;

    document.getElementById('orch-placeholder-' + id).style.display = 'none';
    document.getElementById('orch-info-content-' + id).classList.add('active');

    document.getElementById('orch-info-title-' + id).textContent = comp.title;

    var badge = document.getElementById('orch-info-badge-' + id);
    badge.textContent = comp.layer;
    badge.className = 'orch-info-badge-' + id + ' ' + comp.layerClass;

    var icon = document.getElementById('orch-info-icon-' + id);
    icon.className = 'orch-info-icon-' + id + ' ' + comp.layerClass;

    document.getElementById('orch-info-gran-' + id).textContent = comp.granularity;
    document.getElementById('orch-info-desc-' + id).textContent = comp.description;

    var managesList = document.getElementById('orch-info-manages-' + id);
    var html = '';
    for (var j = 0; j < comp.manages.length; j++) {
      html += '<li>' + comp.manages[j] + '</li>';
    }
    managesList.innerHTML = html;
  }

  
  var container = document.querySelector('.orch-' + id);
  if (container) {
    container.addEventListener('click', function(e) {
      var target = e.target;
      
      while (target && target !== container) {
        if (target.hasAttribute && target.hasAttribute('data-component')) {
          var compId = target.getAttribute('data-component');
          selectComponent(target, compId);
          return;
        }
        target = target.parentElement;
      }
    });
  }
})();
</script>

<hr>
<h2 id="three-systems-three-granularities">Three Systems, Three Granularities</h2>
<p>This stack works because of its division of labor. Each system operates at a different level of abstraction, handling the problems it&rsquo;s best suited to solve.</p>
<p><strong>Kubernetes</strong> sees the world in pods and nodes. It manages the lifecycle of containers, handles service discovery, and ensures workloads get scheduled onto machines with available resources. Its scheduling decisions happen at the coarse granularity of &ldquo;does this node have 8 GPUs available?&rdquo; Kubernetes has no concept of what happens inside those containers once they&rsquo;re running.</p>
<p><strong>Ray</strong> operates one level deeper. It sees actors (long-lived Python objects that can hold state and process messages) and tasks, which are stateless function invocations. Ray&rsquo;s Global Control Store (GCS) maintains a distributed view of cluster resources, and its Raylets (one per node) handle local scheduling and object management. Ray also understands placement constraints: it can ensure that a group of actors lands on the same physical node, or spreads across nodes in a specific pattern.</p>
<p><strong>vLLM</strong> cares about requests and tokens. It manages the KV cache, schedules which requests get processed in each iteration, and coordinates the actual tensor operations across GPU workers. vLLM&rsquo;s scheduler operates at millisecond granularity, making decisions every inference step about which tokens to generate next.</p>
<p>Kubernetes has no understanding of GPU topology. It can count GPUs, but it cannot distinguish between eight GPUs connected via NVLink at 900 GB/s and eight GPUs scattered across nodes connected via Ethernet at 10 GB/s. Without additional tooling, Kubernetes might schedule your tensor-parallel workload across two nodes, a configuration that would perform 40-90x slower than necessary.</p>
<table>
  <thead>
      <tr>
          <th>Concern</th>
          <th>Kubernetes</th>
          <th>Ray</th>
          <th>vLLM</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Granularity</td>
          <td>Pods/Nodes</td>
          <td>Actors/Tasks</td>
          <td>Requests/Tokens</td>
      </tr>
      <tr>
          <td>GPU handling</td>
          <td>Counts only</td>
          <td>Placement constraints</td>
          <td>CUDA assignment</td>
      </tr>
      <tr>
          <td>State management</td>
          <td>Stateless orchestration</td>
          <td>Actor state in GCS</td>
          <td>KV cache</td>
      </tr>
      <tr>
          <td>Restart handling</td>
          <td>Pod restarts</td>
          <td>Actor recovery</td>
          <td>Request retry</td>
      </tr>
  </tbody>
</table>
<p>This is where <strong>KubeRay</strong> enters the picture. KubeRay is a Kubernetes operator that bridges the gap between Kubernetes&rsquo; pod-centric worldview and Ray&rsquo;s actor-centric model. It introduces three Custom Resource Definitions (CRDs):</p>
<ul>
<li>
<p>RayCluster is the foundation. It defines head and worker node configurations, resource requirements, and cluster topology. Use this when you need a persistent Ray cluster for interactive development or long-running services.</p>
</li>
<li>
<p>RayService builds on RayCluster to add Ray Serve deployments. It handles zero-downtime upgrades, health checking, and automatic recovery. This is the production choice for serving workloads.</p>
</li>
<li>
<p>RayJob handles batch workloads. It spins up a cluster, runs a job, then tears everything down. Useful for fine-tuning runs or batch inference over large datasets.</p>
</li>
</ul>
<p>The operator watches these CRDs and reconciles cluster state: creating pods, configuring networking, managing the Ray head node&rsquo;s GCS, and ensuring workers connect properly. It&rsquo;s the translation layer that lets Kubernetes manage Ray clusters without understanding Ray&rsquo;s internal semantics.</p>
<hr>
<h2 id="the-reconciliation-dance">The Reconciliation Dance</h2>
<p>When you <code>kubectl apply</code> a RayService manifest, you trigger a cascade that touches every layer of the stack. Understanding this sequence reveals how control flows through the system.</p>
<p><strong>Phase 1: KubeRay Operator Activation</strong></p>
<p>The KubeRay operator runs as a deployment in your cluster, watching for changes to Ray CRDs. When it detects your new RayService, its reconciliation loop activates. The operator compares desired state (your manifest) against actual state (what&rsquo;s running) and generates a plan to converge them.</p>
<p><strong>Phase 2: Head Node Creation</strong></p>
<p>First, the operator creates the Ray head pod. This pod runs the Global Control Store (GCS) on port 6379, a distributed metadata store that tracks cluster membership, resource availability, and actor locations. The head also exposes the Ray Dashboard on port 8265 for observability.</p>
<p>The head pod needs to be running and healthy before workers can join. KubeRay handles this sequencing automatically, using Kubernetes&rsquo; built-in readiness probes to gate worker creation.</p>
<p><strong>Phase 3: Worker Pod Launch</strong></p>
<p>Once the head is ready, the operator creates worker pods. Each worker&rsquo;s entrypoint executes <code>ray start --address=&lt;head-ip&gt;:6379</code>, connecting to the head&rsquo;s GCS. This is where the Kubernetes and Ray worlds first touch: Kubernetes schedules the pod, but Ray handles what happens inside.</p>
<p><strong>Phase 4: Resource Discovery</strong></p>
<p>Inside each worker pod, the Raylet process inspects its environment. It discovers available GPUs through CUDA, determines memory capacity, and inventories other resources. This information flows back to the GCS, which maintains a global resource table.</p>
<p><strong>Phase 5: Cluster Ready</strong></p>
<p>When all workers have connected and advertised their resources, the Ray cluster is ready. The GCS now has a complete picture: which nodes exist, what resources each has, and how to reach them. Ray Serve can start accepting deployment requests.</p>
<!--
VISUALIZATION: "The Reconciliation Sequence"
Type: Sequence diagram (left-to-right timeline)

Actors (vertical lanes):
- KubeRay Operator
- K8s API Server
- Head Pod
- Worker Pods
- Raylets
- GCS

Events (horizontal arrows):
1. "Apply RayService" → Operator
2. Operator → K8s API: "Create Head StatefulSet"
3. K8s API → Head Pod: "Schedule & Start"
4. Head Pod: "Start GCS on :6379"
5. Operator → K8s API: "Create Worker Pods"
6. K8s API → Worker Pods: "Schedule & Start"
7. Worker Pods → Raylets: "ray start --address=head:6379"
8. Raylets: "Discover GPUs via CUDA"
9. Raylets → GCS: "Advertise resources"
10. GCS: "Cluster Ready"

Highlight the moment Raylets discover GPUs and advertise to GCS - this is where hardware topology becomes visible to the orchestration layer.
-->
<p>When vLLM initializes with <code>--tensor-parallel-size 8</code>, it needs to transform this general-purpose Ray cluster into a coordinated inference machine.</p>
<p><strong>vLLM Initialization Sequence:</strong></p>
<ol>
<li>
<p><strong>Cluster Connection</strong>: vLLM&rsquo;s <code>RayGPUExecutor</code> calls <code>initialize_ray_cluster()</code>, connecting to the existing Ray cluster or starting a new one.</p>
</li>
<li>
<p><strong>Placement Group Creation</strong>: vLLM creates a placement group with the specification <code>[{&quot;GPU&quot;: 1}] * 8</code>, which means eight bundles, each requiring one GPU. The placement strategy is <code>STRICT_PACK</code>, meaning all bundles must land on a single node.</p>
</li>
<li>
<p><strong>GCS Scheduling</strong>: The GCS consults its resource table. Can any single node satisfy eight GPU bundles? If yes, it reserves those resources atomically. If no, the placement group creation fails. Better to fail fast than scatter actors across nodes.</p>
</li>
<li>
<p><strong>Actor Spawning</strong>: vLLM spawns <code>RayWorkerWrapper</code> actors inside the placement group. Each actor gets assigned to a specific bundle, guaranteeing GPU affinity. Ray sets <code>CUDA_VISIBLE_DEVICES</code> appropriately so each worker sees only its assigned GPU.</p>
</li>
<li>
<p><strong>Process Group Initialization</strong>: Each worker calls <code>torch.distributed.init_process_group(backend='nccl')</code>. This creates the NCCL communicator that will handle all tensor data movement.</p>
</li>
<li>
<p><strong>NCCL Ring Formation</strong>: NCCL establishes its communication topology (typically ring or tree patterns optimized for the underlying hardware). From this point forward, tensor data flows through NCCL, completely bypassing Ray&rsquo;s object store.</p>
</li>
</ol>
<!--
VISUALIZATION: "The NCCL Handoff"
Type: Two-phase state transition diagram

Phase 1 - "Ray Orchestrates Placement":
- Show Driver connecting to GCS
- GCS connecting to Raylets on a node
- Raylets spawning 8 actor boxes inside a placement group boundary
- All connections shown as dashed lines (control plane)
- Label: "Ray handles: placement, GPU assignment, actor lifecycle"

Phase 2 - "NCCL Handles Data":
- Same 8 actors, but now connected in a ring pattern
- Bold lines between actors representing NCCL connections
- Ray/GCS components grayed out or faded
- Label: "NCCL handles: AllReduce, tensor movement, synchronization"

Key moment callout: torch.distributed.init_process_group(backend='nccl')
This is the transition point where Ray's job is essentially done for the data path.
Show that Ray remains for: health monitoring, actor restart on failure, metrics collection
-->
<p>Here&rsquo;s how the handoff works: Ray&rsquo;s job is setup and supervision. Once the NCCL rings form, Ray steps aside for the performance-critical path. Tensor data never touches Ray&rsquo;s object store. It flows directly between GPUs over NVLink or the network fabric. Ray remains involved for health monitoring, actor lifecycle management, and metrics collection, but it&rsquo;s out of the hot path.</p>
<hr>
<h2 id="why-strict_pack-changes-everything">Why STRICT_PACK Changes Everything</h2>
<p>Placement groups are Ray&rsquo;s mechanism for expressing scheduling constraints that go beyond &ldquo;find me a node with resources.&rdquo; For distributed inference, they determine whether your system performs at full speed or crawls.</p>
<p>Consider what happens without placement constraints. You request 8 GPU actors. Ray&rsquo;s default scheduler might place 4 on Node A and 4 on Node B. Both nodes have available GPUs, the request is satisfied, everyone&rsquo;s happy. Except they&rsquo;re not.</p>
<p><strong>The Disaster Scenario:</strong></p>
<p>With tensor parallelism, every transformer layer requires an AllReduce operation to synchronize partial results across all GPUs. For a Llama-70B with 80 layers, that&rsquo;s 160 AllReduce calls per forward pass. Each AllReduce must move data between every pair of GPUs.</p>
<p>When all 8 GPUs are on one node connected via NVLink:</p>
<ul>
<li>Bandwidth: ~900 GB/s bidirectional</li>
<li>AllReduce latency: microseconds</li>
</ul>
<p>When 4 GPUs are on Node A and 4 on Node B, connected via datacenter Ethernet:</p>
<ul>
<li>Bandwidth: ~10-25 GB/s (even with 100GbE)</li>
<li>AllReduce latency: milliseconds</li>
</ul>
<p>The performance difference is stark. You&rsquo;re looking at a <strong>40-90x bandwidth reduction</strong> for every AllReduce. For interactive inference where you need responses in hundreds of milliseconds, this makes the system unusable. A 50ms operation becomes a 2-second operation.</p>
<p><strong>STRICT_PACK to the Rescue:</strong></p>
<p>The <code>STRICT_PACK</code> placement strategy provides an atomic guarantee: &ldquo;Reserve all N bundles on a single node. If no single node can satisfy the request, schedule none of them.&rdquo;</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Conceptual placement group specification</span>
</span></span><span style="display:flex;"><span>placement_group <span style="color:#f92672">=</span> ray<span style="color:#f92672">.</span>util<span style="color:#f92672">.</span>placement_group(
</span></span><span style="display:flex;"><span>    bundles<span style="color:#f92672">=</span>[{<span style="color:#e6db74">&#34;GPU&#34;</span>: <span style="color:#ae81ff">1</span>}] <span style="color:#f92672">*</span> <span style="color:#ae81ff">8</span>,
</span></span><span style="display:flex;"><span>    strategy<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;STRICT_PACK&#34;</span>
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><p>This is all-or-nothing. Either you get all 8 GPUs on one node with NVLink connectivity, or you get an error telling you no suitable node exists. No silent degradation to a broken configuration.</p>
<p><strong>SPREAD for Pipeline Parallelism:</strong></p>
<p>Not all parallelism strategies want STRICT_PACK. Pipeline parallelism deliberately spans multiple nodes, with each node handling different layers of the model. Here, <code>SPREAD</code> makes sense: you want actors distributed across nodes to maximize aggregate memory capacity.</p>
<p>The communication pattern differs too. Pipeline parallelism uses point-to-point sends between adjacent stages, not AllReduce across all participants. This is less latency-sensitive because you&rsquo;re overlapping computation with communication: while stage N processes micro-batch B, stage N-1 can send micro-batch C.</p>
<table>
  <thead>
      <tr>
          <th>Strategy</th>
          <th>Use Case</th>
          <th>Communication</th>
          <th>When to Use</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>STRICT_PACK</td>
          <td>Tensor Parallelism</td>
          <td>AllReduce (all-to-all)</td>
          <td>Same-node NVLink required</td>
      </tr>
      <tr>
          <td>SPREAD</td>
          <td>Pipeline Parallelism</td>
          <td>Point-to-point</td>
          <td>Memory &gt; latency</td>
      </tr>
      <tr>
          <td>PACK</td>
          <td>Mixed workloads</td>
          <td>Varies</td>
          <td>Prefer colocation, allow spread</td>
      </tr>
  </tbody>
</table>
<!--
VISUALIZATION: "Placement Strategies Comparison"
Type: Side-by-side comparison diagram

LEFT SIDE - "STRICT_PACK (Correct for TP=8)":
- Single server box containing 8 GPU icons
- All GPUs connected with thick green lines (NVLink)
- Label: "~900 GB/s via NVLink"
- Checkmark icon, happy state

RIGHT SIDE - "Accidental Split (Wrong)":
- Two server boxes
- Node A: 4 GPU icons
- Node B: 4 GPU icons
- Thin red/orange line between nodes labeled "Ethernet"
- Label: "~10-25 GB/s via Ethernet"
- Warning/X icon, broken state
- Text: "40-90x slower AllReduce"

Bottom text for STRICT_PACK: "Guarantee: All 8 bundles on one node, or placement fails"
Visual emphasis: Make the wrong scenario look painful - perhaps show a snail or clock icon
-->
<p>The placement group abstraction is what lets vLLM express &ldquo;I need these actors to be co-located&rdquo; without knowing anything about Kubernetes node topology. Ray&rsquo;s GCS has that knowledge (from Raylet resource advertisements), and the placement group mechanism lets vLLM leverage it declaratively.</p>
<hr>
<h2 id="two-interfaces-two-purposes">Two Interfaces, Two Purposes</h2>
<p>Even with correct placement, there&rsquo;s another way to tank your inference performance: letting NCCL traffic flow over the wrong network interface.</p>
<p>Production GPU nodes typically have multiple network interfaces:</p>
<ul>
<li>
<p><strong>eth0</strong>: The standard Kubernetes pod network. Usually an overlay network (Calico, Cilium, Flannel) that provides cluster connectivity, DNS, and service discovery. Fine for control plane traffic: health probes, metrics scraping, Ray GCS heartbeats.</p>
</li>
<li>
<p><strong>net1/ib0/bond0</strong>: A high-performance interface connected to InfiniBand or RoCE fabric. This is your data plane, purpose-built for moving large tensors between nodes at 100-400 Gb/s with microsecond latencies.</p>
</li>
</ul>
<p>The problem: NCCL doesn&rsquo;t automatically know which interface to use. By default, it may discover eth0 first and decide that&rsquo;s the interface for collective operations. Your carefully provisioned InfiniBand fabric sits idle while tensor data crawls through the overlay network.</p>
<p><strong>The key environment variable:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>NCCL_SOCKET_IFNAME<span style="color:#f92672">=</span>net1
</span></span></code></pre></div><p>This tells NCCL explicitly which interface to use for socket-based communication. For InfiniBand with RDMA, you&rsquo;d also set:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>NCCL_IB_HCA<span style="color:#f92672">=</span>mlx5_0
</span></span></code></pre></div><p>In Kubernetes, you expose multiple interfaces to pods using <strong>Multus CNI</strong>, a meta-plugin that lets you attach additional networks beyond the default pod network. Your pod spec includes annotations requesting attachment to the high-speed network:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#f92672">annotations</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">k8s.v1.cni.cncf.io/networks</span>: <span style="color:#ae81ff">high-speed-net</span>
</span></span></code></pre></div><p>The result is a pod with two interfaces: eth0 for Kubernetes integration, and net1 for NCCL traffic. Control plane and data plane are cleanly separated.</p>
<!--
VISUALIZATION: "Dual Network Paths"
Type: Network diagram showing single pod with two interfaces

Central element: Pod box containing "vLLM Worker" and "Raylet"

Interface 1 (eth0):
- Connects to cloud shape labeled "Kubernetes Overlay Network"
- Traffic types listed: "Ray GCS heartbeats, HTTP health probes, Prometheus metrics, K8s API"
- Label: "Control Plane"
- Shown with thinner lines, perhaps blue

Interface 2 (net1):
- Connects to rectangular shape labeled "InfiniBand/RoCE Fabric"
- Traffic types listed: "NCCL AllReduce, tensor data, KV cache transfers"
- Label: "Data Plane"
- Shown with thick lines, perhaps green

Environment variable callout box: NCCL_SOCKET_IFNAME=net1

Warning callout: "Without explicit config, NCCL may use eth0 by mistake → 10-100x slower"

Optional: Show multiple pods connected through the InfiniBand fabric for multi-node scenarios
-->
<p><strong>Why This Matters for Multi-Node:</strong></p>
<p>For single-node tensor parallelism, NVLink handles everything and network configuration is less critical. But the moment you scale beyond one node, whether for pipeline parallelism, larger tensor-parallel groups, or disaggregated serving, network configuration becomes essential.</p>
<p>A properly configured InfiniBand fabric can deliver 400 Gb/s (50 GB/s) per port with single-digit microsecond latencies. The Kubernetes overlay network, even with modern CNIs, typically maxes out at 10-25 Gb/s with millisecond-scale latencies. For operations that happen 160 times per forward pass, this difference compounds dramatically.</p>
<hr>
<h2 id="choosing-your-communication-pattern">Choosing Your Communication Pattern</h2>
<p>We&rsquo;ve seen how network configuration can make or break performance. The <em>reason</em> network matters so much depends on which parallelism strategy you&rsquo;re using, and each strategy creates fundamentally different communication patterns.</p>
<p>Parallelism isn&rsquo;t one-size-fits-all. Different strategies create different communication patterns, and understanding these patterns reveals why orchestration decisions matter.</p>
<h3 id="tensor-parallelism-the-allreduce-pattern">Tensor Parallelism: The AllReduce Pattern</h3>
<p>Tensor parallelism shards weight matrices across GPUs within a layer. Each GPU computes a partial result, then all GPUs synchronize via AllReduce to combine their contributions.</p>
<p>Ray&rsquo;s responsibilities:</p>
<ul>
<li>Create STRICT_PACK placement group</li>
<li>Spawn workers with correct GPU assignments</li>
<li>Set <code>CUDA_VISIBLE_DEVICES</code> per worker</li>
<li>Monitor actor health, restart on failure</li>
</ul>
<p>What Ray doesn&rsquo;t do:</p>
<ul>
<li>Manage AllReduce operations (that&rsquo;s NCCL)</li>
<li>Move tensor data (that flows through NVLink/NCCL)</li>
<li>The object store is bypassed entirely for the hot path</li>
</ul>
<p><strong>The Communication Reality:</strong></p>
<p>For an 80-layer model, tensor parallelism requires <strong>160 AllReduce operations per forward pass</strong> (2 per layer—one after attention, one after FFN). Each AllReduce synchronizes tensors sized <code>[batch_size, seq_len, hidden_dim]</code>. With Llama-70B&rsquo;s hidden dimension of 8192 and a batch of 32 sequences at 2048 tokens, you&rsquo;re moving ~1 GB per AllReduce.</p>
<p>AllReduce has ring and tree implementations. Ring AllReduce on 8 GPUs requires each GPU to send and receive <code>7/8</code> of the data, essentially 7 full tensor transfers per operation. The only way this is fast is with NVLink&rsquo;s 900 GB/s bandwidth.</p>
<h3 id="pipeline-parallelism-the-point-to-point-pattern">Pipeline Parallelism: The Point-to-Point Pattern</h3>
<p>Pipeline parallelism assigns different layers to different GPUs (or groups of GPUs). Data flows through stages sequentially: Stage 0 processes the input, sends activations to Stage 1, which processes and sends to Stage 2, and so on.</p>
<p><strong>Orchestration Differences:</strong></p>
<p>Ray creates a placement group that may span nodes (SPREAD rather than STRICT_PACK). Each stage gets its own bundle, and stages communicate via point-to-point sends rather than collective operations.</p>
<p><strong>The Bubble Problem:</strong></p>
<p>Pure pipeline parallelism has a fundamental inefficiency. While Stage 0 processes micro-batch 1, Stages 1-7 sit idle. While Stage 7 processes micro-batch 1, Stages 0-6 may be idle waiting for backward pass dependencies.</p>
<p>The <strong>bubble ratio</strong> quantifies this waste:</p>
$$\text{bubble ratio} = \frac{p - 1}{m + p - 1}$$<p>Where $p$ is the number of pipeline stages and $m$ is the number of micro-batches. With 8 stages and 8 micro-batches, you lose 7/15 ≈ 47% of potential throughput to bubbles.</p>
<blockquote>
<p>Think of it this way: <strong>p</strong> is how many slices you cut the model into, <strong>m</strong> is how many requests you&rsquo;re processing in parallel. With 8 pipeline stages but only 1 micro-batch, 7 out of 8 stages are always waiting, meaning 87.5% of compute wasted to bubbles. With 64 micro-batches, that drops to ~10%. The lesson: pipeline parallelism only pays off with large batches.</p></blockquote>
<p>Continuous batching helps by keeping the pipeline fed with new requests, but the fundamental tradeoff remains: pipeline parallelism trades AllReduce bandwidth requirements for pipeline bubbles.</p>




<style>
.pbv-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
  background: linear-gradient(180deg, #0f172a 0%, #1e293b 100%);
  border-radius: 16px;
  padding: 2rem;
  margin: 2rem 0;
  color: #e2e8f0;
}

.pbv-c90c0da50bbebaa8b9e35fe5cf4aa5c4 * {
  box-sizing: border-box;
}

.pbv-header-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  text-align: center;
  margin-bottom: 1.5rem;
}

.pbv-title-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  font-size: 1.5rem;
  font-weight: 700;
  color: #f8fafc;
  margin-bottom: 0.5rem;
}

.pbv-subtitle-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  font-size: 0.9rem;
  color: #94a3b8;
  max-width: 550px;
  margin: 0 auto;
}

 
.pbv-controls-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  display: flex;
  justify-content: center;
  align-items: center;
  gap: 2rem;
  margin-bottom: 1.5rem;
  flex-wrap: wrap;
}

.pbv-control-group-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  display: flex;
  align-items: center;
  gap: 0.75rem;
}

.pbv-control-label-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  font-size: 0.8rem;
  color: #94a3b8;
  font-weight: 500;
}

.pbv-slider-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  -webkit-appearance: none;
  width: 120px;
  height: 6px;
  border-radius: 3px;
  background: #334155;
  outline: none;
}

.pbv-slider-c90c0da50bbebaa8b9e35fe5cf4aa5c4::-webkit-slider-thumb {
  -webkit-appearance: none;
  width: 18px;
  height: 18px;
  border-radius: 50%;
  background: #3b82f6;
  cursor: pointer;
  transition: background 0.2s;
}

.pbv-slider-c90c0da50bbebaa8b9e35fe5cf4aa5c4::-webkit-slider-thumb:hover {
  background: #60a5fa;
}

.pbv-slider-c90c0da50bbebaa8b9e35fe5cf4aa5c4::-moz-range-thumb {
  width: 18px;
  height: 18px;
  border-radius: 50%;
  background: #3b82f6;
  cursor: pointer;
  border: none;
}

.pbv-value-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  font-size: 1rem;
  font-weight: 700;
  color: #f8fafc;
  min-width: 24px;
  text-align: center;
}

.pbv-btn-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  padding: 0.5rem 1rem;
  border-radius: 6px;
  border: 1px solid #3b82f6;
  background: rgba(59, 130, 246, 0.1);
  color: #60a5fa;
  font-size: 0.8rem;
  font-weight: 600;
  cursor: pointer;
  transition: all 0.2s;
}

.pbv-btn-c90c0da50bbebaa8b9e35fe5cf4aa5c4:hover {
  background: rgba(59, 130, 246, 0.2);
}

.pbv-btn-c90c0da50bbebaa8b9e35fe5cf4aa5c4.active {
  background: #3b82f6;
  color: #fff;
}

 
.pbv-stats-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  display: flex;
  justify-content: center;
  gap: 2rem;
  margin-bottom: 1.5rem;
  flex-wrap: wrap;
}

.pbv-stat-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  text-align: center;
  padding: 0.75rem 1.5rem;
  background: rgba(15, 23, 42, 0.6);
  border: 1px solid #334155;
  border-radius: 8px;
}

.pbv-stat-value-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  font-size: 1.5rem;
  font-weight: 700;
  color: #f8fafc;
}

.pbv-stat-value-c90c0da50bbebaa8b9e35fe5cf4aa5c4.bubble {
  color: #f87171;
}

.pbv-stat-value-c90c0da50bbebaa8b9e35fe5cf4aa5c4.compute {
  color: #4ade80;
}

.pbv-stat-label-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  font-size: 0.7rem;
  color: #64748b;
  text-transform: uppercase;
  letter-spacing: 0.05em;
  margin-top: 0.25rem;
}

 
.pbv-timeline-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  display: flex;
  gap: 0.5rem;
  overflow-x: auto;
  padding-bottom: 0.5rem;
}

.pbv-y-axis-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  display: flex;
  flex-direction: column;
  gap: 2px;
  padding-top: 28px;
}

.pbv-stage-label-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  height: 36px;
  display: flex;
  align-items: center;
  justify-content: flex-end;
  padding-right: 0.5rem;
  font-size: 0.7rem;
  color: #94a3b8;
  white-space: nowrap;
}

.pbv-grid-container-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  flex: 1;
  min-width: 0;
}

.pbv-x-axis-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  display: flex;
  gap: 2px;
  margin-bottom: 4px;
  padding-left: 0;
}

.pbv-time-label-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  width: 36px;
  text-align: center;
  font-size: 0.6rem;
  color: #64748b;
}

.pbv-grid-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  display: flex;
  flex-direction: column;
  gap: 2px;
}

.pbv-row-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  display: flex;
  gap: 2px;
}

.pbv-cell-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  width: 36px;
  height: 36px;
  border-radius: 4px;
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 0.65rem;
  font-weight: 600;
  transition: all 0.3s ease;
  position: relative;
}

.pbv-cell-c90c0da50bbebaa8b9e35fe5cf4aa5c4.bubble {
  background: rgba(248, 113, 113, 0.15);
  border: 1px dashed rgba(248, 113, 113, 0.4);
  color: #f87171;
}

.pbv-cell-c90c0da50bbebaa8b9e35fe5cf4aa5c4.active {
  border: 1px solid;
  animation: pbv-pulse-c90c0da50bbebaa8b9e35fe5cf4aa5c4 0.5s ease-out;
}

@keyframes pbv-pulse-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  0% { transform: scale(1.1); }
  100% { transform: scale(1); }
}

 
.pbv-cell-c90c0da50bbebaa8b9e35fe5cf4aa5c4.m0 { background: rgba(59, 130, 246, 0.8); border-color: #3b82f6; color: #fff; }
.pbv-cell-c90c0da50bbebaa8b9e35fe5cf4aa5c4.m1 { background: rgba(168, 85, 247, 0.8); border-color: #a855f7; color: #fff; }
.pbv-cell-c90c0da50bbebaa8b9e35fe5cf4aa5c4.m2 { background: rgba(236, 72, 153, 0.8); border-color: #ec4899; color: #fff; }
.pbv-cell-c90c0da50bbebaa8b9e35fe5cf4aa5c4.m3 { background: rgba(34, 197, 94, 0.8); border-color: #22c55e; color: #fff; }
.pbv-cell-c90c0da50bbebaa8b9e35fe5cf4aa5c4.m4 { background: rgba(245, 158, 11, 0.8); border-color: #f59e0b; color: #fff; }
.pbv-cell-c90c0da50bbebaa8b9e35fe5cf4aa5c4.m5 { background: rgba(6, 182, 212, 0.8); border-color: #06b6d4; color: #fff; }
.pbv-cell-c90c0da50bbebaa8b9e35fe5cf4aa5c4.m6 { background: rgba(244, 63, 94, 0.8); border-color: #f43f5e; color: #fff; }
.pbv-cell-c90c0da50bbebaa8b9e35fe5cf4aa5c4.m7 { background: rgba(132, 204, 22, 0.8); border-color: #84cc16; color: #fff; }

 
.pbv-legend-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  display: flex;
  justify-content: center;
  gap: 1.5rem;
  margin-top: 1rem;
  flex-wrap: wrap;
}

.pbv-legend-item-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  display: flex;
  align-items: center;
  gap: 0.5rem;
  font-size: 0.75rem;
  color: #94a3b8;
}

.pbv-legend-box-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  width: 16px;
  height: 16px;
  border-radius: 3px;
}

.pbv-legend-box-c90c0da50bbebaa8b9e35fe5cf4aa5c4.bubble {
  background: rgba(248, 113, 113, 0.15);
  border: 1px dashed rgba(248, 113, 113, 0.4);
}

.pbv-legend-box-c90c0da50bbebaa8b9e35fe5cf4aa5c4.micro {
  background: rgba(59, 130, 246, 0.8);
  border: 1px solid #3b82f6;
}

 
.pbv-formula-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  margin-top: 1.5rem;
  padding: 1rem;
  background: rgba(15, 23, 42, 0.6);
  border: 1px solid #334155;
  border-radius: 10px;
  text-align: center;
}

.pbv-formula-text-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
  font-size: 0.9rem;
  color: #cbd5e1;
  font-family: 'SF Mono', 'Monaco', 'Inconsolata', monospace;
}

.pbv-formula-text-c90c0da50bbebaa8b9e35fe5cf4aa5c4 .highlight {
  color: #f87171;
  font-weight: 600;
}

.pbv-formula-text-c90c0da50bbebaa8b9e35fe5cf4aa5c4 .stages {
  color: #60a5fa;
}

.pbv-formula-text-c90c0da50bbebaa8b9e35fe5cf4aa5c4 .microbatches {
  color: #4ade80;
}

 
@media (max-width: 600px) {
  .pbv-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
    padding: 1.25rem;
  }

  .pbv-controls-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
    gap: 1rem;
  }

  .pbv-stats-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
    gap: 1rem;
  }

  .pbv-stat-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
    padding: 0.5rem 1rem;
  }

  .pbv-cell-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
    width: 28px;
    height: 28px;
    font-size: 0.55rem;
  }

  .pbv-time-label-c90c0da50bbebaa8b9e35fe5cf4aa5c4 {
    width: 28px;
  }
}
</style>

<div class="pbv-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
  <div class="pbv-header-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
    <div class="pbv-title-c90c0da50bbebaa8b9e35fe5cf4aa5c4">The Pipeline Bubble Problem</div>
    <div class="pbv-subtitle-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Stages sit idle during startup and drain phases. More micro-batches reduce bubble overhead.</div>
  </div>

  <div class="pbv-controls-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
    <div class="pbv-control-group-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
      <span class="pbv-control-label-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Pipeline Stages (p):</span>
      <input type="range" class="pbv-slider-c90c0da50bbebaa8b9e35fe5cf4aa5c4" id="pbv-stages-c90c0da50bbebaa8b9e35fe5cf4aa5c4" min="2" max="8" value="4">
      <span class="pbv-value-c90c0da50bbebaa8b9e35fe5cf4aa5c4" id="pbv-stages-val-c90c0da50bbebaa8b9e35fe5cf4aa5c4">4</span>
    </div>
    <div class="pbv-control-group-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
      <span class="pbv-control-label-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Micro-batches (m):</span>
      <input type="range" class="pbv-slider-c90c0da50bbebaa8b9e35fe5cf4aa5c4" id="pbv-micro-c90c0da50bbebaa8b9e35fe5cf4aa5c4" min="1" max="8" value="4">
      <span class="pbv-value-c90c0da50bbebaa8b9e35fe5cf4aa5c4" id="pbv-micro-val-c90c0da50bbebaa8b9e35fe5cf4aa5c4">4</span>
    </div>
    <button class="pbv-btn-c90c0da50bbebaa8b9e35fe5cf4aa5c4" id="pbv-animate-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Animate</button>
  </div>

  <div class="pbv-stats-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
    <div class="pbv-stat-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
      <div class="pbv-stat-value-c90c0da50bbebaa8b9e35fe5cf4aa5c4 bubble" id="pbv-bubble-ratio-c90c0da50bbebaa8b9e35fe5cf4aa5c4">43%</div>
      <div class="pbv-stat-label-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Bubble Ratio</div>
    </div>
    <div class="pbv-stat-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
      <div class="pbv-stat-value-c90c0da50bbebaa8b9e35fe5cf4aa5c4" id="pbv-total-steps-c90c0da50bbebaa8b9e35fe5cf4aa5c4">7</div>
      <div class="pbv-stat-label-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Total Time Steps</div>
    </div>
    <div class="pbv-stat-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
      <div class="pbv-stat-value-c90c0da50bbebaa8b9e35fe5cf4aa5c4 compute" id="pbv-compute-c90c0da50bbebaa8b9e35fe5cf4aa5c4">16</div>
      <div class="pbv-stat-label-c90c0da50bbebaa8b9e35fe5cf4aa5c4">Compute Units</div>
    </div>
  </div>

  <div class="pbv-timeline-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
    <div class="pbv-y-axis-c90c0da50bbebaa8b9e35fe5cf4aa5c4" id="pbv-y-axis-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
      
    </div>
    <div class="pbv-grid-container-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
      <div class="pbv-x-axis-c90c0da50bbebaa8b9e35fe5cf4aa5c4" id="pbv-x-axis-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
        
      </div>
      <div class="pbv-grid-c90c0da50bbebaa8b9e35fe5cf4aa5c4" id="pbv-grid-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
        
      </div>
    </div>
  </div>

  <div class="pbv-legend-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
    <div class="pbv-legend-item-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
      <div class="pbv-legend-box-c90c0da50bbebaa8b9e35fe5cf4aa5c4 bubble"></div>
      <span>Bubble (idle)</span>
    </div>
    <div class="pbv-legend-item-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
      <div class="pbv-legend-box-c90c0da50bbebaa8b9e35fe5cf4aa5c4 micro"></div>
      <span>Micro-batch (active)</span>
    </div>
  </div>

  <div class="pbv-formula-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
    <div class="pbv-formula-text-c90c0da50bbebaa8b9e35fe5cf4aa5c4">
      Bubble Ratio = (<span class="stages" id="pbv-p-c90c0da50bbebaa8b9e35fe5cf4aa5c4">p</span> - 1) / (<span class="microbatches" id="pbv-m-c90c0da50bbebaa8b9e35fe5cf4aa5c4">m</span> + <span class="stages" id="pbv-p2-c90c0da50bbebaa8b9e35fe5cf4aa5c4">p</span> - 1) = <span class="highlight" id="pbv-calc-c90c0da50bbebaa8b9e35fe5cf4aa5c4">3/6 = 50%</span>
    </div>
  </div>
</div>

<script>
(function() {
  var id = 'c90c0da50bbebaa8b9e35fe5cf4aa5c4';

  var stagesSlider = document.getElementById('pbv-stages-' + id);
  var microSlider = document.getElementById('pbv-micro-' + id);
  var stagesVal = document.getElementById('pbv-stages-val-' + id);
  var microVal = document.getElementById('pbv-micro-val-' + id);
  var animateBtn = document.getElementById('pbv-animate-' + id);
  var grid = document.getElementById('pbv-grid-' + id);
  var xAxis = document.getElementById('pbv-x-axis-' + id);
  var yAxis = document.getElementById('pbv-y-axis-' + id);
  var bubbleRatio = document.getElementById('pbv-bubble-ratio-' + id);
  var totalSteps = document.getElementById('pbv-total-steps-' + id);
  var computeUnits = document.getElementById('pbv-compute-' + id);
  var pVal = document.getElementById('pbv-p-' + id);
  var mVal = document.getElementById('pbv-m-' + id);
  var p2Val = document.getElementById('pbv-p2-' + id);
  var calcVal = document.getElementById('pbv-calc-' + id);

  var isAnimating = false;
  var animationFrame = null;

  function buildGrid(p, m, animate) {
    
    var steps = m + p - 1;

    
    grid.innerHTML = '';
    xAxis.innerHTML = '';
    yAxis.innerHTML = '';

    
    for (var s = 0; s < p; s++) {
      var label = document.createElement('div');
      label.className = 'pbv-stage-label-' + id;
      label.textContent = 'Stage ' + s;
      yAxis.appendChild(label);
    }

    
    for (var t = 0; t < steps; t++) {
      var label = document.createElement('div');
      label.className = 'pbv-time-label-' + id;
      label.textContent = 't' + t;
      xAxis.appendChild(label);
    }

    
    
    
    

    var cells = [];

    for (var stage = 0; stage < p; stage++) {
      var row = document.createElement('div');
      row.className = 'pbv-row-' + id;

      for (var time = 0; time < steps; time++) {
        var cell = document.createElement('div');
        cell.className = 'pbv-cell-' + id;

        var mb = time - stage;
        if (mb >= 0 && mb < m) {
          
          cell.classList.add('m' + (mb % 8));
          cell.textContent = 'm' + mb;
          cell.dataset.mb = mb;
          cell.dataset.activeTime = time;
        } else {
          
          cell.classList.add('bubble');
        }

        cells.push({ el: cell, time: time, stage: stage, mb: mb });
        row.appendChild(cell);
      }

      grid.appendChild(row);
    }

    
    var bubbleCount = (p - 1);
    var ratio = bubbleCount / steps;
    var percent = Math.round(ratio * 100);

    bubbleRatio.textContent = percent + '%';
    totalSteps.textContent = steps;
    computeUnits.textContent = p * m;

    
    pVal.textContent = p;
    mVal.textContent = m;
    p2Val.textContent = p;
    calcVal.textContent = (p - 1) + '/' + steps + ' = ' + percent + '%';

    if (animate) {
      animateCells(cells, steps);
    }
  }

  function animateCells(cells, steps) {
    if (isAnimating) {
      cancelAnimationFrame(animationFrame);
    }

    isAnimating = true;
    animateBtn.classList.add('active');
    animateBtn.textContent = 'Playing...';

    
    cells.forEach(function(c) {
      if (c.mb >= 0 && c.mb < parseInt(microSlider.value)) {
        c.el.style.opacity = '0.2';
        c.el.classList.remove('active');
      }
    });

    var currentTime = -1;
    var speed = 500; 
    var lastUpdate = 0;

    function animate(timestamp) {
      if (!lastUpdate) lastUpdate = timestamp;

      if (timestamp - lastUpdate >= speed) {
        currentTime++;
        lastUpdate = timestamp;

        if (currentTime >= steps) {
          
          isAnimating = false;
          animateBtn.classList.remove('active');
          animateBtn.textContent = 'Animate';

          
          cells.forEach(function(c) {
            c.el.style.opacity = '1';
          });
          return;
        }

        
        cells.forEach(function(c) {
          if (c.time === currentTime && c.mb >= 0) {
            c.el.style.opacity = '1';
            c.el.classList.add('active');

            
            setTimeout(function() {
              c.el.classList.remove('active');
            }, 400);
          }
        });
      }

      animationFrame = requestAnimationFrame(animate);
    }

    animationFrame = requestAnimationFrame(animate);
  }

  function update(animate) {
    var p = parseInt(stagesSlider.value);
    var m = parseInt(microSlider.value);
    stagesVal.textContent = p;
    microVal.textContent = m;
    buildGrid(p, m, animate);
  }

  stagesSlider.addEventListener('input', function() { update(false); });
  microSlider.addEventListener('input', function() { update(false); });
  animateBtn.addEventListener('click', function() { update(true); });

  
  update(false);
})();
</script>

<h3 id="expert-parallelism-the-alltoall-pattern">Expert Parallelism: The AllToAll Pattern</h3>
<p>Mixture of Experts (MoE) models introduce a third communication pattern. Instead of every GPU needing data from every other GPU (AllReduce), or sequential point-to-point (pipeline), MoE requires <strong>AllToAll</strong>: each GPU sends different data to different destinations based on which expert each token routes to.</p>
<p>The orchestration complexity increases significantly. Expert assignments are dynamic (determined by a router network), so communication patterns vary per batch. Some experts may be hot (receiving many tokens) while others are cold.</p>
<p>Expert parallelism is its own orchestration challenge. Unlike tensor parallelism&rsquo;s predictable AllReduce or pipeline&rsquo;s sequential handoffs, MoE communication is <strong>dynamic</strong>. A router network decides which tokens go to which experts <em>at runtime</em>, so the communication pattern changes every batch. Some experts receive hundreds of tokens while others get none.</p>
<p>This dynamic routing breaks placement assumptions. You can&rsquo;t pre-plan which GPUs need to talk to which. Solutions like expert replication (placing hot experts on multiple GPUs) and capacity factors (limiting tokens per expert) add orchestration complexity. MoE deserves its own treatment, but the key insight here is: AllToAll with dynamic routing is fundamentally harder to orchestrate than static patterns.</p>
<!--
VISUALIZATION: "Communication Patterns"
Type: Two diagrams side by side (TP and PP), plus brief mention of EP

TENSOR PARALLELISM (left):
- 4 or 8 GPU icons arranged in a ring/circle
- Bidirectional arrows between ALL pairs (or show ring topology)
- Central label: "AllReduce"
- Caption: "Every layer: all GPUs synchronize"
- Note: "160 AllReduce ops for 80-layer model"
- All GPUs must be on same node (indicate NVLink)

PIPELINE PARALLELISM (right):
- 4 GPU icons in a horizontal line
- Unidirectional arrows: GPU0 → GPU1 → GPU2 → GPU3
- Labels on GPUs: "Layers 0-19", "Layers 20-39", etc.
- Caption: "Point-to-point between adjacent stages"
- Note: "Can span nodes (less latency-sensitive)"
- Show bubble gaps in a small timeline below

EXPERT PARALLELISM (small note):
- Brief text: "AllToAll: each token routes to specific expert"
- Note: "Dynamic routing, complex orchestration - separate post"
-->
<hr>
<h2 id="prefill-and-decode-dont-have-to-live-together">Prefill and Decode Don&rsquo;t Have to Live Together</h2>
<p>The inference phases we&rsquo;ve discussed (prefill and decode) have fundamentally different computational profiles:</p>
<ul>
<li><strong>Prefill</strong>: Process the entire prompt in parallel. Compute-bound. Benefits from high FLOPS.</li>
<li><strong>Decode</strong>: Generate tokens one at a time. Memory-bound. Benefits from high memory bandwidth.</li>
</ul>
<p>For most of LLM inference history, both phases ran on the same hardware. But there&rsquo;s no law of physics requiring this. <strong>Disaggregated serving</strong> splits them apart.</p>
<p><strong>The Architecture:</strong></p>
<ol>
<li><strong>Router</strong> receives incoming request, examines the prompt</li>
<li><strong>Prefill Pool</strong> (optimized for compute: H100s with maximum FLOPS) processes the prompt, generates initial KV cache</li>
<li><strong>KV Transfer</strong> moves the KV cache to the decode pool</li>
<li><strong>Decode Pool</strong> (optimized for memory bandwidth: could be A100s or L40S) generates tokens autoregressively</li>
<li>Response streams back to client</li>
</ol>
<p><strong>Why Bother?</strong></p>
<p>Different hardware, different economics. Prefill can run on fewer, more powerful GPUs because it&rsquo;s compute-bound. You&rsquo;re not paying for memory bandwidth you don&rsquo;t use. Decode can run on more, cheaper GPUs optimized for memory bandwidth.</p>
<p>The pools also scale independently. A sudden spike in long prompts? Scale up prefill. Many concurrent users generating responses? Scale up decode. The tight coupling of traditional serving forces you to scale both together.</p>
<p><strong>The KV Transfer Challenge:</strong></p>
<p>The catch is moving the KV cache. For Llama-70B with 128K context, the KV cache can reach 40+ GB per request. Moving that between pools is non-trivial.</p>
<p>Two approaches are emerging:</p>
<ul>
<li>
<p><strong>NIXL (NVIDIA Inference Transfer Library)</strong>: GPU-to-GPU RDMA transfers over InfiniBand/RoCE. Keeps data on GPU memory throughout, avoiding PCIe bottlenecks.</p>
</li>
<li>
<p><strong>LMCache / Shared Storage</strong>: Write KV cache to a fast shared storage layer (think distributed NVMe or GPU memory pooling). This enables &ldquo;context caching&rdquo;: compute popular prompts once, reuse across millions of requests.</p>
</li>
</ul>
<p>Context caching is particularly powerful for system prompts. If every request to your coding assistant starts with the same 8K token system prompt, why recompute that KV cache for every request? Compute it once, cache it, and let decode instances reuse it.</p>
<!--
VISUALIZATION: "Disaggregated Serving Flow"
Type: Horizontal flow diagram

Left to right flow:

[User Request] → [Router]
                    ↓
            [Prefill Pool]
            - "H100s (compute-dense)"
            - "Processes prompt"
            - "Generates KV cache"
                    ↓
            [KV Transfer]
            - "NIXL: GPU-to-GPU RDMA"
            - "or LMCache: shared storage"
                    ↓
            [Decode Pool]
            - "A100s/L40S (memory-dense)"
            - "Generates tokens"
                    ↓
            [Response]

Side element: [Context Cache]
- Connected to both Prefill Pool and Decode Pool
- "Popular prompts computed once"
- "Reused across requests"

Key insight callout: "Prefill and Decode scale independently"
-->
<hr>
<h2 id="the-request-knows-where-to-go">The Request Knows Where to Go</h2>
<p>Traditional load balancing treats all requests as fungible. Round-robin, least-connections, random: they all assume any backend can handle any request equally well. For LLM inference with caching, this assumption is expensive.</p>
<p>If Request A and Request B share a common prefix (same system prompt, same few-shot examples), and Request A already warmed the KV cache on Pod 1, sending Request B to Pod 2 wastes the cache hit opportunity. You&rsquo;ll recompute the shared prefix unnecessarily.</p>
<p><strong>Prefix-Aware Routing:</strong></p>
<p>Ray Serve implements prefix-aware routing using a prefix tree of cached prefixes. The router maintains a lightweight index of which prefixes are cached on which replicas. When a request arrives, it hashes the prefix, looks up which replica(s) have it cached, and routes accordingly.</p>
<p>This transforms routing from &ldquo;who&rsquo;s least busy?&rdquo; to &ldquo;who already has my context?&rdquo;</p>
<p><strong>Gateway API EPP:</strong></p>
<p>The Kubernetes ecosystem is developing similar capabilities at the network layer through Gateway API&rsquo;s <strong>Endpoint Picker (EPP)</strong> extension. Routing decisions happen in the ingress controller rather than in application code (Ray Serve).</p>
<p>The ingress controller can hash request properties (prompt prefix, user ID, session token) and consistently route matching requests to the same backend. This works without modifying the serving framework, using pure infrastructure-level routing.</p>
<p><strong>The Tradeoff:</strong></p>
<p>Locality-aware routing can cause load imbalance. If one prefix is extremely popular, its designated replica gets hammered while others sit idle. Production systems need to balance cache locality against load distribution, often through techniques like bounded load consistent hashing or spillover policies.</p>
<p>The evolution is clear: routing is becoming inference-aware. The network layer increasingly understands the semantics of the requests it carries, making decisions that would previously require application-level logic.</p>
<hr>
<h2 id="the-programmable-supercomputer">The Programmable Supercomputer</h2>
<p>Step back and consider what this stack achieves. You start with a collection of independent machines, each with its own GPUs, memory, and network interfaces. Through layers of orchestration (Kubernetes managing containers, Ray managing actors, vLLM managing inference) these resources transform into something that behaves like a single, coherent system.</p>
<p>A prompt enters and gets routed to the right place based on cached state. Compute spreads across GPUs that might span multiple machines, synchronized through NCCL collectives that operate faster than the software can observe. Memory fragments across PagedAttention blocks, invisible to the model but critical for efficiency. The response streams back, one token at a time, while the system is already processing the next request.</p>
<p>The orchestration is the product. Without it, you have expensive hardware sitting idle. With it, you have an inference machine that can serve thousands of concurrent users at interactive latencies.</p>
<p><strong>What&rsquo;s Emerging:</strong></p>
<p>The boundaries between these layers continue to blur. Systems like DistServe push disaggregation further, with prefill and decode pools that scale independently. KV cache transfer technologies (NIXL, LMCache) treat GPU memory across machines as a single addressable space. The trend is toward tighter integration between orchestration and execution, with systems that make placement decisions not just at startup, but continuously during inference.</p>
<p><strong>Key Metrics to Watch:</strong></p>
<p>If you&rsquo;re operating these systems, the metrics that matter span all three layers:</p>
<ul>
<li><strong>Kubernetes</strong>: Pod scheduling latency, node resource utilization, network policy drops</li>
<li><strong>Ray</strong>: Placement group creation time, actor restart rate, GCS latency (<code>ray_gcs_*</code> metrics)</li>
<li><strong>vLLM</strong>: <code>vllm:gpu_cache_usage_perc</code> (memory pressure), <code>vllm:num_requests_waiting</code> (queuing), time-to-first-token (prefill latency), inter-token-latency (decode performance)</li>
</ul>
<p>The system is only as good as its weakest link. A Kubernetes scheduling delay adds latency to every request until the pod is running. A misconfigured NCCL interface tanks throughput. A hot expert without proper load balancing creates tail latencies.</p>
<p>Understanding the choreography (knowing which system is responsible for what, where the handoffs occur, what can go wrong at each boundary) is what separates operators who can debug production issues from those who cannot.</p>
<p>The stack is complex because the problem is complex. Distributed inference across dozens of GPUs, serving thousands of users, with sub-second latency requirements. But the complexity is structured. Each layer has clear responsibilities and well-defined interfaces. Master those interfaces, understand the handoffs, and the system becomes comprehensible.</p>
<p>Eight GPUs thinking as one. Three software systems coordinating invisibly. One simple command that hides a universe of orchestration.</p>
<p>That&rsquo;s the stack. Now you know what&rsquo;s underneath.</p>
<hr>
<h2 id="references">References</h2>
<ul>
<li><strong>KubeRay Operator</strong>: <a href="https://github.com/ray-project/kuberay">ray-project/kuberay</a> — Kubernetes operator for Ray</li>
<li><strong>Ray Placement Groups</strong>: <a href="https://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html">Ray docs</a></li>
<li><strong>vLLM Distributed Inference</strong>: <a href="https://docs.vllm.ai/en/latest/serving/distributed_serving.html">vLLM docs</a></li>
<li><strong>NCCL AllReduce Algorithms</strong>: <a href="https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/operations.html">NVIDIA NCCL docs</a></li>
<li><strong>DistServe (Disaggregated Serving)</strong>: <a href="https://arxiv.org/abs/2401.09670">Zhong et al., 2024</a></li>
<li><strong>Multus CNI</strong>: <a href="https://github.com/k8snetworkplumbingwg/multus-cni">k8snetworkplumbingwg/multus-cni</a></li>
</ul>
]]></content:encoded></item><item><title>The Hidden Software Stack Behind Fast LLM Inference</title><link>https://www.mdjawad.com/posts/llm-inference-hidden-stack/</link><pubDate>Sat, 10 Jan 2026 12:00:00 +0800</pubDate><guid>https://www.mdjawad.com/posts/llm-inference-hidden-stack/</guid><description>Beyond vLLM and PagedAttention: exploring NCCL, CUTLASS, Triton, and FlashInfer, the libraries that actually make LLM inference fast.</description><content:encoded><![CDATA[<h2 id="the-iceberg-problem">The Iceberg Problem</h2>
<p>If you&rsquo;ve followed LLM infrastructure over the past two years, you&rsquo;ve probably heard the greatest hits: <a href="https://docs.vllm.ai/en/stable/design/arch_overview.html">PagedAttention</a> eliminates memory fragmentation, continuous batching keeps GPUs busy, and <a href="/posts/flash-attention/">FlashAttention</a> cuts memory from O(N²) to O(N). These optimizations are real and important. They are not the full story.</p>
<p>Below the waterline sits a stack of specialized libraries that most engineers never encounter directly. CUTLASS generates the fused kernels that make quantization practical. Triton lets researchers write GPU code without drowning in thread indexing. FlashInfer handles the messy reality of serving workloads that FlashAttention wasn&rsquo;t designed for. And NCCL quietly orchestrates communication when models span multiple GPUs.</p>
<p>This post dives into that hidden layer. We&rsquo;ll trace the path from silicon to scheduler, examining the libraries that transform NVIDIA&rsquo;s hardware capabilities into the fast inference you actually experience. If you&rsquo;re deploying LLMs at scale, or simply curious about what happens beneath vLLM&rsquo;s Python API, this is the stack worth understanding.</p>
<h2 id="hardware-contract">Hardware Contract</h2>
<p>Every optimization in this stack exists because of a single physical constraint: the memory wall. Modern GPUs have a dramatic imbalance between compute capability and memory bandwidth.</p>
<p>Consider the H100. Its Tensor Cores can deliver roughly 2,000 TFLOPS of FP8 compute. Its HBM3 memory provides 3.35 TB/s of bandwidth. Simple division gives us a &ldquo;ridge point&rdquo; of about 600 ops/byte—if your workload performs fewer than 600 operations per byte loaded from memory, you&rsquo;re memory-bound. Your expensive Tensor Cores sit idle, waiting for data.</p>
<p>LLM inference during the decode phase operates at roughly 0.5-1 ops/byte. For every token generated, the model loads billions of weight parameters, multiplies them by a single vector, and discards the weights. It&rsquo;s not even close to compute-bound. This is why a $30,000 GPU often achieves single-digit percentage utilization during autoregressive generation.</p>
<p>To understand why, it helps to see what we&rsquo;re working with.</p>


<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap" rel="stylesheet">

<style>
    .gpu-arch-container {
        font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
        background: linear-gradient(135deg, #0f172a 0%, #1e1b4b 100%);
        border-radius: 16px;
        padding: 24px;
        margin: 32px 0;
        color: #e2e8f0;
        box-shadow: 0 25px 50px -12px rgba(0, 0, 0, 0.5);
    }

    .gpu-arch-header {
        text-align: center;
        margin-bottom: 24px;
    }

    .gpu-arch-header h3 {
        font-size: 1.75rem;
        font-weight: 700;
        color: #f8fafc;
        margin: 0 0 8px 0;
        letter-spacing: -0.025em;
    }

    .gpu-arch-header p {
        color: #94a3b8;
        font-size: 0.95rem;
        margin: 0;
    }

    .gpu-arch-main {
        display: grid;
        grid-template-columns: 1fr 1fr;
        gap: 24px;
        align-items: start;
    }

    @media (max-width: 900px) {
        .gpu-arch-main {
            grid-template-columns: 1fr;
        }
    }

     
    .gpu-die-panel {
        background: rgba(15, 23, 42, 0.6);
        border: 1px solid rgba(148, 163, 184, 0.2);
        border-radius: 12px;
        padding: 20px;
    }

    .panel-title {
        font-size: 0.75rem;
        font-weight: 600;
        text-transform: uppercase;
        letter-spacing: 0.1em;
        color: #64748b;
        margin-bottom: 16px;
        text-align: center;
    }

    .gpu-die-svg {
        width: 100%;
        max-width: 400px;
        margin: 0 auto;
        display: block;
    }

     
    .gpu-component {
        cursor: pointer;
        transition: all 0.2s ease;
    }

    .gpu-component:hover {
        filter: brightness(1.3);
    }

    .gpu-component.selected {
        filter: brightness(1.4) drop-shadow(0 0 8px currentColor);
    }

     
    .sm-cell {
        fill: #166534;
        stroke: #22c55e;
        stroke-width: 0.5;
        transition: all 0.2s ease;
    }

    .sm-cell:hover, .sm-cell.selected {
        fill: #15803d;
        stroke-width: 1;
        filter: drop-shadow(0 0 4px #22c55e);
    }

     
    .hbm-stack {
        fill: #1e3a5f;
        stroke: #0ea5e9;
        stroke-width: 1;
        transition: all 0.2s ease;
    }

    .hbm-stack:hover, .hbm-stack.selected {
        fill: #1e4f7f;
        filter: drop-shadow(0 0 6px #0ea5e9);
    }

     
    .l2-cache {
        fill: #164e63;
        stroke: #22d3ee;
        stroke-width: 1;
        transition: all 0.2s ease;
    }

    .l2-cache:hover, .l2-cache.selected {
        fill: #155e75;
        filter: drop-shadow(0 0 6px #22d3ee);
    }

     
    .sm-detail-panel {
        background: rgba(15, 23, 42, 0.6);
        border: 1px solid rgba(148, 163, 184, 0.2);
        border-radius: 12px;
        padding: 20px;
    }

    .sm-detail-svg {
        width: 100%;
        max-width: 350px;
        margin: 0 auto 16px auto;
        display: block;
    }

     
    .tensor-core {
        fill: #831843;
        stroke: #ec4899;
        stroke-width: 0.5;
        transition: all 0.2s ease;
    }

    .tensor-core:hover, .tensor-core.selected {
        fill: #9d174d;
        filter: drop-shadow(0 0 4px #ec4899);
    }

     
    .shared-mem {
        fill: #581c87;
        stroke: #a855f7;
        stroke-width: 1;
        transition: all 0.2s ease;
    }

    .shared-mem:hover, .shared-mem.selected {
        fill: #6b21a8;
        filter: drop-shadow(0 0 6px #a855f7);
    }

     
    .register-file {
        fill: #78350f;
        stroke: #f59e0b;
        stroke-width: 1;
        transition: all 0.2s ease;
    }

    .register-file:hover, .register-file.selected {
        fill: #92400e;
        filter: drop-shadow(0 0 6px #f59e0b);
    }

     
    .cuda-cores {
        fill: #1e3a5f;
        stroke: #60a5fa;
        stroke-width: 1;
        transition: all 0.2s ease;
    }

    .cuda-cores:hover, .cuda-cores.selected {
        fill: #1e4f7f;
        filter: drop-shadow(0 0 6px #60a5fa);
    }

     
    .component-info {
        background: rgba(30, 41, 59, 0.8);
        border: 1px solid rgba(148, 163, 184, 0.15);
        border-radius: 8px;
        padding: 16px;
        min-height: 140px;
        transition: all 0.3s ease;
    }

    .component-info h4 {
        font-size: 1.1rem;
        font-weight: 600;
        margin: 0 0 8px 0;
        display: flex;
        align-items: center;
        gap: 8px;
    }

    .component-info .color-dot {
        width: 12px;
        height: 12px;
        border-radius: 3px;
        flex-shrink: 0;
    }

    .component-info .specs {
        font-family: 'SF Mono', 'Fira Code', monospace;
        font-size: 0.85rem;
        color: #94a3b8;
        margin-bottom: 8px;
    }

    .component-info .description {
        font-size: 0.875rem;
        color: #cbd5e1;
        line-height: 1.5;
    }

     
    .gpu-arch-legend {
        display: grid;
        grid-template-columns: repeat(auto-fit, minmax(140px, 1fr));
        gap: 8px;
        margin-top: 20px;
        padding-top: 16px;
        border-top: 1px solid rgba(148, 163, 184, 0.15);
    }

    .legend-item {
        display: flex;
        align-items: center;
        gap: 8px;
        font-size: 0.75rem;
        color: #94a3b8;
        cursor: pointer;
        padding: 4px 8px;
        border-radius: 4px;
        transition: background 0.2s ease;
    }

    .legend-item:hover {
        background: rgba(148, 163, 184, 0.1);
    }

    .legend-dot {
        width: 10px;
        height: 10px;
        border-radius: 2px;
        flex-shrink: 0;
    }

     
    .data-flow-path {
        stroke-dasharray: 6, 4;
        animation: dataFlowAnim 1.5s linear infinite;
    }

    @keyframes dataFlowAnim {
        0% { stroke-dashoffset: 0; }
        100% { stroke-dashoffset: -20; }
    }

     
    .instruction-text {
        text-align: center;
        font-size: 0.8rem;
        color: #64748b;
        margin-top: 12px;
    }
</style>

<div class="gpu-arch-container">
    <div class="gpu-arch-header">
        <h3>NVIDIA H100 GPU Architecture</h3>
        <p>Understanding the hardware that software must optimize for</p>
    </div>

    <div class="gpu-arch-main">
        
        <div class="gpu-die-panel">
            <div class="panel-title">GPU Die Layout</div>
            <svg class="gpu-die-svg" viewBox="0 0 400 320" id="gpuDieSvg-9184b5d7372a4419fe20e8998ee043a9">
                
                <g class="gpu-component hbm-group" data-component="hbm">
                    <rect class="hbm-stack" x="10" y="50" width="35" height="220" rx="4"/>
                    <rect class="hbm-stack" x="355" y="50" width="35" height="220" rx="4"/>
                    <rect class="hbm-stack" x="60" y="10" width="130" height="30" rx="4"/>
                    <rect class="hbm-stack" x="210" y="10" width="130" height="30" rx="4"/>
                    <rect class="hbm-stack" x="60" y="280" width="130" height="30" rx="4"/>
                    <rect class="hbm-stack" x="210" y="280" width="130" height="30" rx="4"/>
                    
                    <text x="27" y="165" fill="#0ea5e9" font-size="9" text-anchor="middle" transform="rotate(-90, 27, 165)" font-weight="500">HBM3</text>
                    <text x="373" y="165" fill="#0ea5e9" font-size="9" text-anchor="middle" transform="rotate(90, 373, 165)" font-weight="500">HBM3</text>
                </g>

                
                <rect x="55" y="48" width="290" height="224" rx="6" fill="#0f172a" stroke="#334155" stroke-width="2"/>

                
                <g class="gpu-component l2-group" data-component="l2">
                    <rect class="l2-cache" x="65" y="148" width="270" height="24" rx="3"/>
                    <text x="200" y="164" fill="#22d3ee" font-size="10" text-anchor="middle" font-weight="500">L2 Cache (50MB)</text>
                </g>

                
                <g class="gpu-component sm-group" data-component="sm">
                    
                    <rect class="sm-cell" x="70" y="58" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="108" y="58" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="146" y="58" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="184" y="58" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="222" y="58" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="260" y="58" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="298" y="58" width="34" height="26" rx="2"/>
                    
                    <rect class="sm-cell" x="70" y="88" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="108" y="88" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="146" y="88" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="184" y="88" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="222" y="88" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="260" y="88" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="298" y="88" width="34" height="26" rx="2"/>
                    
                    <rect class="sm-cell" x="70" y="118" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="108" y="118" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="146" y="118" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="184" y="118" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="222" y="118" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="260" y="118" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="298" y="118" width="34" height="26" rx="2"/>

                    
                    
                    <rect class="sm-cell" x="70" y="176" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="108" y="176" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="146" y="176" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="184" y="176" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="222" y="176" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="260" y="176" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="298" y="176" width="34" height="26" rx="2"/>
                    
                    <rect class="sm-cell" x="70" y="206" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="108" y="206" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="146" y="206" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="184" y="206" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="222" y="206" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="260" y="206" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="298" y="206" width="34" height="26" rx="2"/>
                    
                    <rect class="sm-cell" x="70" y="236" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="108" y="236" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="146" y="236" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="184" y="236" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="222" y="236" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="260" y="236" width="34" height="26" rx="2"/>
                    <rect class="sm-cell" x="298" y="236" width="34" height="26" rx="2"/>
                </g>

                
                <g class="data-flow-group" opacity="0.6">
                    
                    <path class="data-flow-path" d="M 45 160 L 65 160" fill="none" stroke="#0ea5e9" stroke-width="2"/>
                    <path class="data-flow-path" d="M 355 160 L 335 160" fill="none" stroke="#0ea5e9" stroke-width="2"/>
                    
                    <path class="data-flow-path" d="M 200 148 L 200 145" fill="none" stroke="#22d3ee" stroke-width="2"/>
                    <path class="data-flow-path" d="M 200 172 L 200 176" fill="none" stroke="#22d3ee" stroke-width="2"/>
                </g>

                
                <rect x="146" y="88" width="34" height="26" rx="2" fill="none" stroke="#f59e0b" stroke-width="2" stroke-dasharray="4,2">
                    <animate attributeName="opacity" values="0.5;1;0.5" dur="2s" repeatCount="indefinite"/>
                </rect>
                <text x="163" y="80" fill="#f59e0b" font-size="8" text-anchor="middle">Zoom →</text>
            </svg>
            <p class="instruction-text">Click components to explore</p>
        </div>

        
        <div class="sm-detail-panel">
            <div class="panel-title">Streaming Multiprocessor (SM) Detail</div>
            <svg class="sm-detail-svg" viewBox="0 0 300 200" id="smDetailSvg-9184b5d7372a4419fe20e8998ee043a9">
                
                <rect x="10" y="10" width="280" height="180" rx="8" fill="#0f172a" stroke="#22c55e" stroke-width="2"/>
                <text x="150" y="28" fill="#22c55e" font-size="11" text-anchor="middle" font-weight="600">One SM (of 132)</text>

                
                <g class="gpu-component tensor-group" data-component="tensor">
                    <rect class="tensor-core" x="20" y="40" width="58" height="35" rx="3"/>
                    <rect class="tensor-core" x="82" y="40" width="58" height="35" rx="3"/>
                    <rect class="tensor-core" x="160" y="40" width="58" height="35" rx="3"/>
                    <rect class="tensor-core" x="222" y="40" width="58" height="35" rx="3"/>
                    <text x="49" y="62" fill="#ec4899" font-size="9" text-anchor="middle" font-weight="500">TC</text>
                    <text x="111" y="62" fill="#ec4899" font-size="9" text-anchor="middle" font-weight="500">TC</text>
                    <text x="189" y="62" fill="#ec4899" font-size="9" text-anchor="middle" font-weight="500">TC</text>
                    <text x="251" y="62" fill="#ec4899" font-size="9" text-anchor="middle" font-weight="500">TC</text>
                </g>

                
                <g class="gpu-component cuda-group" data-component="cuda">
                    <rect class="cuda-cores" x="20" y="82" width="260" height="28" rx="3"/>
                    <text x="150" y="100" fill="#60a5fa" font-size="10" text-anchor="middle" font-weight="500">128 CUDA Cores</text>
                </g>

                
                <g class="gpu-component register-group" data-component="register">
                    <rect class="register-file" x="20" y="116" width="125" height="30" rx="3"/>
                    <text x="82" y="135" fill="#f59e0b" font-size="9" text-anchor="middle" font-weight="500">Register File (256KB)</text>
                </g>

                
                <g class="gpu-component sram-group" data-component="sram">
                    <rect class="shared-mem" x="155" y="116" width="125" height="30" rx="3"/>
                    <text x="217" y="135" fill="#a855f7" font-size="9" text-anchor="middle" font-weight="500">Shared Mem (228KB)</text>
                </g>

                
                <g class="data-flow-group" opacity="0.5">
                    <path class="data-flow-path" d="M 217 116 L 217 110 L 150 85 L 150 75" fill="none" stroke="#a855f7" stroke-width="1.5"/>
                    <path class="data-flow-path" d="M 82 116 L 82 110 L 150 85 L 150 75" fill="none" stroke="#f59e0b" stroke-width="1.5"/>
                </g>

                
                <text x="150" y="165" fill="#64748b" font-size="8" text-anchor="middle">SRAM: ~19 TB/s  |  Registers: Highest</text>
                <text x="150" y="178" fill="#475569" font-size="7" text-anchor="middle">Data stays on-chip for FlashAttention tiles</text>
            </svg>

            
            <div class="component-info" id="componentInfo-9184b5d7372a4419fe20e8998ee043a9">
                <h4>
                    <span class="color-dot" style="background: #22c55e;"></span>
                    <span id="infoTitle-9184b5d7372a4419fe20e8998ee043a9">Streaming Multiprocessors</span>
                </h4>
                <div class="specs" id="infoSpecs-9184b5d7372a4419fe20e8998ee043a9">132 SMs × (128 CUDA cores + 4 Tensor Cores)</div>
                <div class="description" id="infoDesc-9184b5d7372a4419fe20e8998ee043a9">
                    The parallel processing units where computation happens. Each SM is an independent processor with its own registers, shared memory, and access to Tensor Cores for matrix operations.
                </div>
            </div>
        </div>
    </div>

    
    <div class="gpu-arch-legend" id="gpuLegend-9184b5d7372a4419fe20e8998ee043a9">
        <div class="legend-item" data-component="hbm">
            <span class="legend-dot" style="background: #0ea5e9;"></span>
            <span>HBM3 (80GB)</span>
        </div>
        <div class="legend-item" data-component="l2">
            <span class="legend-dot" style="background: #22d3ee;"></span>
            <span>L2 Cache (50MB)</span>
        </div>
        <div class="legend-item" data-component="sm">
            <span class="legend-dot" style="background: #22c55e;"></span>
            <span>SMs (132)</span>
        </div>
        <div class="legend-item" data-component="tensor">
            <span class="legend-dot" style="background: #ec4899;"></span>
            <span>Tensor Cores</span>
        </div>
        <div class="legend-item" data-component="sram">
            <span class="legend-dot" style="background: #a855f7;"></span>
            <span>Shared Memory</span>
        </div>
        <div class="legend-item" data-component="register">
            <span class="legend-dot" style="background: #f59e0b;"></span>
            <span>Registers</span>
        </div>
    </div>
</div>

<script>
(function() {
    const uniqueId = '9184b5d7372a4419fe20e8998ee043a9';

    
    const componentInfo = {
        hbm: {
            title: 'HBM3 Memory',
            color: '#0ea5e9',
            specs: '80GB capacity | 3.35 TB/s bandwidth',
            desc: 'High Bandwidth Memory stacks surrounding the die. This is where model weights and KV cache live. Despite "high bandwidth," it\'s still the bottleneck—LLM inference is memory-bound because we must load billions of parameters for each token.'
        },
        l2: {
            title: 'L2 Cache',
            color: '#22d3ee',
            specs: '50MB shared | ~12 TB/s bandwidth',
            desc: 'Shared cache between all SMs and HBM. The first line of defense against HBM latency. Frequently accessed data (like attention keys for recent tokens) may hit L2, avoiding the trip to HBM.'
        },
        sm: {
            title: 'Streaming Multiprocessors',
            color: '#22c55e',
            specs: '132 SMs × (128 CUDA cores + 4 Tensor Cores)',
            desc: 'The parallel processing units where computation happens. Each SM is an independent processor with its own registers, shared memory, and Tensor Cores. This is where CUTLASS and Triton kernels execute.'
        },
        tensor: {
            title: 'Tensor Cores',
            color: '#ec4899',
            specs: '4th gen | FP8, FP16, BF16, INT8 | ~2000 TFLOPS (FP8)',
            desc: 'Specialized matrix multiply-accumulate units. They perform the actual GEMM computation at the heart of transformer inference. CUTLASS generates code that feeds these precisely-arranged data tiles.'
        },
        cuda: {
            title: 'CUDA Cores',
            color: '#60a5fa',
            specs: '128 per SM | 16,896 total',
            desc: 'General-purpose compute units for scalar and vector operations. Handle element-wise ops, reductions, and anything that doesn\'t fit the matrix-multiply pattern of Tensor Cores.'
        },
        sram: {
            title: 'Shared Memory (SRAM)',
            color: '#a855f7',
            specs: '228KB per SM | ~19 TB/s bandwidth',
            desc: 'User-programmable scratchpad memory on each SM. This is where FlashAttention tiles live during computation. 6× faster than HBM—the key to avoiding memory bottlenecks is keeping data here as long as possible.'
        },
        register: {
            title: 'Register File',
            color: '#f59e0b',
            specs: '256KB per SM | Fastest memory level',
            desc: 'The fastest memory on the chip. Active computation values live here. Limited capacity forces careful tiling—CUTLASS and Triton kernels are designed to maximize register reuse before spilling to shared memory.'
        }
    };

    
    const infoTitle = document.getElementById('infoTitle-' + uniqueId);
    const infoSpecs = document.getElementById('infoSpecs-' + uniqueId);
    const infoDesc = document.getElementById('infoDesc-' + uniqueId);
    const infoBox = document.getElementById('componentInfo-' + uniqueId);
    const colorDot = infoBox.querySelector('.color-dot');

    
    const updateInfo = (componentId) => {
        const info = componentInfo[componentId];
        if (!info) return;

        infoTitle.textContent = info.title;
        infoSpecs.textContent = info.specs;
        infoDesc.textContent = info.desc;
        colorDot.style.background = info.color;

        
        infoBox.style.transform = 'scale(1.02)';
        setTimeout(() => { infoBox.style.transform = 'scale(1)'; }, 150);
    };

    
    const clearSelections = () => {
        document.querySelectorAll('.gpu-component.selected').forEach(el => {
            el.querySelectorAll('rect, path').forEach(child => {
                child.classList.remove('selected');
            });
        });
    };

    
    const handleComponentClick = (e) => {
        const component = e.target.closest('.gpu-component');
        if (!component) return;

        const componentId = component.dataset.component;
        clearSelections();

        
        component.querySelectorAll('rect, path').forEach(child => {
            child.classList.add('selected');
        });

        updateInfo(componentId);
    };

    
    const handleLegendClick = (e) => {
        const legendItem = e.target.closest('.legend-item');
        if (!legendItem) return;

        const componentId = legendItem.dataset.component;
        clearSelections();

        
        document.querySelectorAll(`[data-component="${componentId}"]`).forEach(group => {
            group.querySelectorAll('rect, path').forEach(child => {
                child.classList.add('selected');
            });
        });

        updateInfo(componentId);
    };

    
    const gpuDieSvg = document.getElementById('gpuDieSvg-' + uniqueId);
    const smDetailSvg = document.getElementById('smDetailSvg-' + uniqueId);
    const legend = document.getElementById('gpuLegend-' + uniqueId);

    if (gpuDieSvg) gpuDieSvg.addEventListener('click', handleComponentClick);
    if (smDetailSvg) smDetailSvg.addEventListener('click', handleComponentClick);
    if (legend) legend.addEventListener('click', handleLegendClick);

    
    updateInfo('sm');
})();
</script>

<p>The memory hierarchy offers a path forward:</p>
<table>
  <thead>
      <tr>
          <th>Level</th>
          <th>Capacity</th>
          <th>Bandwidth</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HBM (Global Memory)</td>
          <td>80 GB</td>
          <td>3.35 TB/s</td>
      </tr>
      <tr>
          <td>L2 Cache</td>
          <td>50 MB</td>
          <td>~12 TB/s</td>
      </tr>
      <tr>
          <td>SRAM (Shared Memory)</td>
          <td>228 KB/SM</td>
          <td>~19 TB/s</td>
      </tr>
      <tr>
          <td>Register File</td>
          <td>256 KB/SM</td>
          <td>Highest</td>
      </tr>
  </tbody>
</table>
<p>The software stack&rsquo;s job is to maximize data reuse in faster memory levels and minimize trips to slow HBM.</p>


<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap" rel="stylesheet">

<style>
    .mem-hierarchy-container {
        font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
        background: linear-gradient(135deg, #0f172a 0%, #1e1b4b 100%);
        border-radius: 16px;
        padding: 24px;
        margin: 32px 0;
        color: #e2e8f0;
        box-shadow: 0 25px 50px -12px rgba(0, 0, 0, 0.5);
    }

    .mem-hierarchy-header {
        text-align: center;
        margin-bottom: 24px;
    }

    .mem-hierarchy-header h3 {
        font-size: 1.75rem;
        font-weight: 700;
        color: #f8fafc;
        margin: 0 0 8px 0;
        letter-spacing: -0.025em;
    }

    .mem-hierarchy-header p {
        color: #94a3b8;
        font-size: 0.95rem;
        margin: 0;
    }

    .mem-hierarchy-main {
        display: grid;
        grid-template-columns: 1fr 280px;
        gap: 24px;
        align-items: start;
    }

    @media (max-width: 900px) {
        .mem-hierarchy-main {
            grid-template-columns: 1fr;
        }
    }

     
    .hierarchy-flow-panel {
        background: rgba(15, 23, 42, 0.6);
        border: 1px solid rgba(148, 163, 184, 0.2);
        border-radius: 12px;
        padding: 20px;
        position: relative;
        overflow: hidden;
    }

     
    .mem-level {
        position: relative;
        margin-bottom: 8px;
        cursor: pointer;
        transition: all 0.2s ease;
    }

    .mem-level:hover .mem-level-box {
        transform: scale(1.02);
    }

    .mem-level.selected .mem-level-box {
        box-shadow: 0 0 20px rgba(255, 255, 255, 0.15);
    }

    .mem-level-box {
        border-radius: 8px;
        padding: 16px;
        transition: all 0.2s ease;
        position: relative;
        z-index: 2;
    }

    .mem-level-header {
        display: flex;
        justify-content: space-between;
        align-items: flex-start;
        margin-bottom: 8px;
        flex-wrap: wrap;
        gap: 4px;
    }

    .mem-level-name {
        font-weight: 600;
        font-size: 1rem;
        min-width: 0;
        word-break: break-word;
    }

    .mem-level-stats {
        font-family: 'SF Mono', 'Fira Code', monospace;
        font-size: 0.8rem;
        opacity: 0.9;
        min-width: 0;
        white-space: nowrap;
    }

    .mem-level-desc {
        font-size: 0.8rem;
        opacity: 0.7;
    }

    .mem-level-contents {
        display: flex;
        gap: 8px;
        margin-top: 10px;
        flex-wrap: wrap;
    }

    .content-tag {
        background: rgba(255, 255, 255, 0.1);
        padding: 4px 10px;
        border-radius: 4px;
        font-size: 0.7rem;
        font-weight: 500;
    }

     
    .mem-level.hbm .mem-level-box {
        background: linear-gradient(135deg, rgba(14, 165, 233, 0.2) 0%, rgba(14, 165, 233, 0.1) 100%);
        border: 1px solid #0ea5e9;
    }
    .mem-level.hbm .mem-level-name { color: #0ea5e9; }
    .mem-level.hbm .mem-level-stats { color: #7dd3fc; }
    .mem-level.hbm .mem-level-desc { color: #bae6fd; }

     
    .mem-level.l2 .mem-level-box {
        background: linear-gradient(135deg, rgba(34, 211, 238, 0.2) 0%, rgba(34, 211, 238, 0.1) 100%);
        border: 1px solid #22d3ee;
        max-width: 92%;
        margin-left: auto;
        margin-right: auto;
    }
    .mem-level.l2 .mem-level-name { color: #22d3ee; }
    .mem-level.l2 .mem-level-stats { color: #a5f3fc; }
    .mem-level.l2 .mem-level-desc { color: #cffafe; }

     
    .mem-level.sram .mem-level-box {
        background: linear-gradient(135deg, rgba(168, 85, 247, 0.2) 0%, rgba(168, 85, 247, 0.1) 100%);
        border: 1px solid #a855f7;
        max-width: 84%;
        margin-left: auto;
        margin-right: auto;
    }
    .mem-level.sram .mem-level-name { color: #a855f7; }
    .mem-level.sram .mem-level-stats { color: #d8b4fe; }
    .mem-level.sram .mem-level-desc { color: #e9d5ff; }

     
    .mem-level.register .mem-level-box {
        background: linear-gradient(135deg, rgba(245, 158, 11, 0.2) 0%, rgba(245, 158, 11, 0.1) 100%);
        border: 1px solid #f59e0b;
        max-width: 76%;
        margin-left: auto;
        margin-right: auto;
    }
    .mem-level.register .mem-level-name { color: #f59e0b; }
    .mem-level.register .mem-level-stats { color: #fcd34d; }
    .mem-level.register .mem-level-desc { color: #fde68a; }

     
    .mem-level.tensor .mem-level-box {
        background: linear-gradient(135deg, rgba(236, 72, 153, 0.2) 0%, rgba(236, 72, 153, 0.1) 100%);
        border: 1px solid #ec4899;
        max-width: 68%;
        margin-left: auto;
        margin-right: auto;
        text-align: center;
    }
    .mem-level.tensor .mem-level-name { color: #ec4899; }
    .mem-level.tensor .mem-level-stats { color: #f9a8d4; }

     
    .flow-connector {
        height: 24px;
        display: flex;
        justify-content: center;
        align-items: center;
        position: relative;
    }

    .flow-arrow {
        width: 2px;
        height: 100%;
        position: relative;
    }

    .flow-arrow::before {
        content: '';
        position: absolute;
        top: 0;
        left: 50%;
        transform: translateX(-50%);
        width: 8px;
        height: 8px;
        border-radius: 50%;
        animation: particleFlow 1.5s ease-in-out infinite;
    }

    .flow-connector.hbm-l2 .flow-arrow { background: linear-gradient(to bottom, #0ea5e9, #22d3ee); }
    .flow-connector.hbm-l2 .flow-arrow::before { background: #22c55e; }

    .flow-connector.l2-sram .flow-arrow { background: linear-gradient(to bottom, #22d3ee, #a855f7); }
    .flow-connector.l2-sram .flow-arrow::before { background: #22c55e; animation-delay: 0.3s; }

    .flow-connector.sram-reg .flow-arrow { background: linear-gradient(to bottom, #a855f7, #f59e0b); }
    .flow-connector.sram-reg .flow-arrow::before { background: #22c55e; animation-delay: 0.6s; }

    .flow-connector.reg-tensor .flow-arrow { background: linear-gradient(to bottom, #f59e0b, #ec4899); }
    .flow-connector.reg-tensor .flow-arrow::before { background: #22c55e; animation-delay: 0.9s; }

    @keyframes particleFlow {
        0% { top: 0; opacity: 1; }
        100% { top: calc(100% - 8px); opacity: 0.3; }
    }

     
    .side-panel {
        display: flex;
        flex-direction: column;
        gap: 16px;
    }

     
    .bandwidth-panel {
        background: rgba(15, 23, 42, 0.6);
        border: 1px solid rgba(148, 163, 184, 0.2);
        border-radius: 12px;
        padding: 16px;
    }

    .panel-title {
        font-size: 0.75rem;
        font-weight: 600;
        text-transform: uppercase;
        letter-spacing: 0.1em;
        color: #64748b;
        margin-bottom: 12px;
    }

    .bandwidth-bar {
        display: flex;
        align-items: center;
        gap: 8px;
        margin-bottom: 8px;
    }

    .bandwidth-label {
        font-size: 0.7rem;
        color: #94a3b8;
        width: 50px;
        flex-shrink: 0;
    }

    .bandwidth-track {
        flex: 1;
        height: 12px;
        background: rgba(30, 41, 59, 0.8);
        border-radius: 6px;
        overflow: hidden;
    }

    .bandwidth-fill {
        height: 100%;
        border-radius: 6px;
        transition: width 0.5s ease;
    }

    .bandwidth-value {
        font-size: 0.65rem;
        font-family: 'SF Mono', monospace;
        color: #64748b;
        width: 60px;
        text-align: right;
        flex-shrink: 0;
    }

    .bandwidth-bar.hbm .bandwidth-fill { background: #0ea5e9; width: 17.6%; }
    .bandwidth-bar.l2 .bandwidth-fill { background: #22d3ee; width: 63.2%; }
    .bandwidth-bar.sram .bandwidth-fill { background: #a855f7; width: 100%; }

     
    .accelerators-panel {
        background: rgba(15, 23, 42, 0.6);
        border: 1px solid rgba(148, 163, 184, 0.2);
        border-radius: 12px;
        padding: 16px;
    }

    .accelerator-item {
        margin-bottom: 12px;
    }

    .accelerator-item:last-child {
        margin-bottom: 0;
    }

    .accelerator-name {
        font-size: 0.85rem;
        font-weight: 600;
        color: #22c55e;
        margin-bottom: 4px;
    }

    .accelerator-desc {
        font-size: 0.75rem;
        color: #94a3b8;
        line-height: 1.4;
    }

     
    .info-panel {
        background: rgba(30, 41, 59, 0.8);
        border: 1px solid rgba(148, 163, 184, 0.15);
        border-radius: 12px;
        padding: 16px;
        transition: all 0.3s ease;
    }

    .info-panel h4 {
        font-size: 1rem;
        font-weight: 600;
        margin: 0 0 8px 0;
        display: flex;
        align-items: center;
        gap: 8px;
    }

    .info-panel .color-dot {
        width: 10px;
        height: 10px;
        border-radius: 3px;
        flex-shrink: 0;
    }

    .info-panel .specs {
        font-family: 'SF Mono', 'Fira Code', monospace;
        font-size: 0.8rem;
        color: #94a3b8;
        margin-bottom: 8px;
    }

    .info-panel .description {
        font-size: 0.8rem;
        color: #cbd5e1;
        line-height: 1.5;
    }

     
    .speed-badge {
        display: inline-block;
        background: rgba(34, 197, 94, 0.2);
        color: #4ade80;
        padding: 2px 8px;
        border-radius: 4px;
        font-size: 0.7rem;
        font-weight: 600;
        margin-left: 8px;
    }
</style>

<div class="mem-hierarchy-container">
    <div class="mem-hierarchy-header">
        <h3>GPU Memory Hierarchy: The Bandwidth Wall</h3>
        <p>Data flows through progressively faster, smaller caches to reach compute</p>
    </div>

    <div class="mem-hierarchy-main">
        
        <div class="hierarchy-flow-panel">
            
            <div class="mem-level hbm" data-level="hbm" id="memLevel-hbm-9184b5d7372a4419fe20e8998ee043a9">
                <div class="mem-level-box">
                    <div class="mem-level-header">
                        <span class="mem-level-name">HBM3 (High Bandwidth Memory)</span>
                        <span class="mem-level-stats">80 GB • 3.35 TB/s</span>
                    </div>
                    <div class="mem-level-desc">"The Warehouse" — Large but far away</div>
                    <div class="mem-level-contents">
                        <span class="content-tag">Model Weights</span>
                        <span class="content-tag">KV Cache</span>
                        <span class="content-tag">Activations</span>
                    </div>
                </div>
            </div>

            <div class="flow-connector hbm-l2"><div class="flow-arrow"></div></div>

            
            <div class="mem-level l2" data-level="l2" id="memLevel-l2-9184b5d7372a4419fe20e8998ee043a9">
                <div class="mem-level-box">
                    <div class="mem-level-header">
                        <span class="mem-level-name">L2 Cache</span>
                        <span class="mem-level-stats">50 MB • ~12 TB/s <span class="speed-badge">3.6× faster</span></span>
                    </div>
                    <div class="mem-level-desc">Shared across all SMs — first line of defense</div>
                    <div class="mem-level-contents">
                        <span class="content-tag">Hot KV entries</span>
                        <span class="content-tag">Recent weights</span>
                    </div>
                </div>
            </div>

            <div class="flow-connector l2-sram"><div class="flow-arrow"></div></div>

            
            <div class="mem-level sram" data-level="sram" id="memLevel-sram-9184b5d7372a4419fe20e8998ee043a9">
                <div class="mem-level-box">
                    <div class="mem-level-header">
                        <span class="mem-level-name">Shared Memory (SRAM)</span>
                        <span class="mem-level-stats">228 KB/SM • ~19 TB/s <span class="speed-badge">5.7× faster</span></span>
                    </div>
                    <div class="mem-level-desc">On-chip scratchpad — FlashAttention's secret weapon</div>
                    <div class="mem-level-contents">
                        <span class="content-tag">Attention tiles</span>
                        <span class="content-tag">CUTLASS staging</span>
                    </div>
                </div>
            </div>

            <div class="flow-connector sram-reg"><div class="flow-arrow"></div></div>

            
            <div class="mem-level register" data-level="register" id="memLevel-register-9184b5d7372a4419fe20e8998ee043a9">
                <div class="mem-level-box">
                    <div class="mem-level-header">
                        <span class="mem-level-name">Register File</span>
                        <span class="mem-level-stats">256 KB/SM • Fastest <span class="speed-badge">∞× faster</span></span>
                    </div>
                    <div class="mem-level-desc">Direct compute access — no latency</div>
                    <div class="mem-level-contents">
                        <span class="content-tag">Active values</span>
                        <span class="content-tag">Accumulators</span>
                    </div>
                </div>
            </div>

            <div class="flow-connector reg-tensor"><div class="flow-arrow"></div></div>

            
            <div class="mem-level tensor" data-level="tensor" id="memLevel-tensor-9184b5d7372a4419fe20e8998ee043a9">
                <div class="mem-level-box">
                    <span class="mem-level-name">Tensor Cores</span>
                    <div class="mem-level-stats">~2000 TFLOPS (FP8)</div>
                </div>
            </div>
        </div>

        
        <div class="side-panel">
            
            <div class="bandwidth-panel">
                <div class="panel-title">Bandwidth Comparison</div>
                <div class="bandwidth-bar hbm">
                    <span class="bandwidth-label">HBM</span>
                    <div class="bandwidth-track"><div class="bandwidth-fill"></div></div>
                    <span class="bandwidth-value">3.35 TB/s</span>
                </div>
                <div class="bandwidth-bar l2">
                    <span class="bandwidth-label">L2</span>
                    <div class="bandwidth-track"><div class="bandwidth-fill"></div></div>
                    <span class="bandwidth-value">~12 TB/s</span>
                </div>
                <div class="bandwidth-bar sram">
                    <span class="bandwidth-label">SRAM</span>
                    <div class="bandwidth-track"><div class="bandwidth-fill"></div></div>
                    <span class="bandwidth-value">~19 TB/s</span>
                </div>
            </div>

            
            <div class="accelerators-panel">
                <div class="panel-title">Hopper Accelerators</div>
                <div class="accelerator-item">
                    <div class="accelerator-name">TMA (Tensor Memory Accelerator)</div>
                    <div class="accelerator-desc">Offloads address calculation to hardware. Software describes tensor shape; TMA handles async loads.</div>
                </div>
                <div class="accelerator-item">
                    <div class="accelerator-name">WGMMA</div>
                    <div class="accelerator-desc">Direct SRAM → Tensor Core path. Bypasses registers, enabling larger tiles and deeper pipelines.</div>
                </div>
            </div>

            
            <div class="info-panel" id="memInfoPanel-9184b5d7372a4419fe20e8998ee043a9">
                <h4>
                    <span class="color-dot" style="background: #0ea5e9;"></span>
                    <span id="memInfoTitle-9184b5d7372a4419fe20e8998ee043a9">The Memory Wall</span>
                </h4>
                <div class="specs" id="memInfoSpecs-9184b5d7372a4419fe20e8998ee043a9">LLM decode: 0.5-1 ops/byte (memory-bound)</div>
                <div class="description" id="memInfoDesc-9184b5d7372a4419fe20e8998ee043a9">
                    Click any memory level to learn more. The 6× bandwidth gap between HBM and SRAM is why FlashAttention exists—keeping data in fast SRAM avoids the bottleneck.
                </div>
            </div>
        </div>
    </div>
</div>

<script>
(function() {
    const uniqueId = '9184b5d7372a4419fe20e8998ee043a9';

    
    const levelInfo = {
        hbm: {
            title: 'HBM3 (High Bandwidth Memory)',
            color: '#0ea5e9',
            specs: '80 GB capacity | 3.35 TB/s bandwidth | ~100ns latency',
            desc: 'The "warehouse" of GPU memory. Model weights (tens of GB), KV caches, and activations all live here. Despite "high bandwidth" in the name, it\'s the slowest level—and LLM inference loads billions of weights per token, making this the primary bottleneck.'
        },
        l2: {
            title: 'L2 Cache',
            color: '#22d3ee',
            specs: '50 MB shared | ~12 TB/s | 3.6× faster than HBM',
            desc: 'Shared cache sitting between all SMs and HBM. Recently accessed data (hot KV entries, frequently-used weights) may be served from here, avoiding the HBM round-trip. Hardware-managed—software can\'t directly control what\'s cached.'
        },
        sram: {
            title: 'Shared Memory (SRAM)',
            color: '#a855f7',
            specs: '228 KB per SM | ~19 TB/s | 5.7× faster than HBM',
            desc: 'The key to FlashAttention\'s performance. This user-programmable scratchpad lets kernels stage data tiles and perform multiple operations without returning to HBM. CUTLASS and Triton kernels are designed to maximize SRAM utilization.'
        },
        register: {
            title: 'Register File',
            color: '#f59e0b',
            specs: '256 KB per SM | Fastest memory level',
            desc: 'The fastest memory on the chip—essentially zero latency. Active computation values, loop counters, and intermediate results live here. Limited capacity forces careful tiling; spilling to shared memory hurts performance.'
        },
        tensor: {
            title: 'Tensor Cores',
            color: '#ec4899',
            specs: '~2000 TFLOPS (FP8) | Matrix multiply-accumulate',
            desc: 'Specialized units that perform fused matrix multiply-accumulate operations. They consume data from registers (or SRAM via WGMMA on Hopper) and produce results at incredible throughput—when they have data to process.'
        }
    };

    
    const infoTitle = document.getElementById('memInfoTitle-' + uniqueId);
    const infoSpecs = document.getElementById('memInfoSpecs-' + uniqueId);
    const infoDesc = document.getElementById('memInfoDesc-' + uniqueId);
    const infoPanel = document.getElementById('memInfoPanel-' + uniqueId);
    const colorDot = infoPanel.querySelector('.color-dot');

    
    const updateInfo = (levelId) => {
        const info = levelInfo[levelId];
        if (!info) return;

        infoTitle.textContent = info.title;
        infoSpecs.textContent = info.specs;
        infoDesc.textContent = info.desc;
        colorDot.style.background = info.color;

        
        infoPanel.style.transform = 'scale(1.02)';
        setTimeout(() => { infoPanel.style.transform = 'scale(1)'; }, 150);
    };

    
    const clearSelections = () => {
        document.querySelectorAll('.mem-level.selected').forEach(el => {
            el.classList.remove('selected');
        });
    };

    
    const handleLevelClick = (e) => {
        const level = e.target.closest('.mem-level');
        if (!level) return;

        const levelId = level.dataset.level;
        clearSelections();
        level.classList.add('selected');
        updateInfo(levelId);
    };

    
    const levels = ['hbm', 'l2', 'sram', 'register', 'tensor'];
    levels.forEach(levelId => {
        const el = document.getElementById('memLevel-' + levelId + '-' + uniqueId);
        if (el) {
            el.addEventListener('click', handleLevelClick);
        }
    });
})();
</script>

<h2 id="cutlass-template-metaprogramming-foundation">CUTLASS: Template Metaprogramming Foundation</h2>
<p>When you call a matrix multiplication in PyTorch, it eventually reaches cuBLAS—NVIDIA&rsquo;s battle-tested linear algebra library. cuBLAS is fast, but it&rsquo;s a black box. You get the GEMM you&rsquo;re given.</p>
<p>For LLM inference, that&rsquo;s often not enough. Consider what happens when you want to run an INT4 quantized model. The weights are stored as packed 4-bit integers. Before the Tensor Cores can process them, you need to:</p>
<ol>
<li>Load 128-bit vectors containing packed INT4 weights</li>
<li>Unpack the 32-bit integers into eight 4-bit values</li>
<li>Convert to FP16</li>
<li>Apply quantization scales</li>
<li>Feed the result to the Tensor Core</li>
</ol>
<p>If each step is a separate kernel, you&rsquo;re writing intermediate results to HBM between operations—exactly the memory traffic you&rsquo;re trying to avoid. What you need is a single fused kernel that does everything in registers.</p>
<p>This is what <a href="https://github.com/NVIDIA/cutlass">CUTLASS</a> enables. It&rsquo;s NVIDIA&rsquo;s header-only C++ template library for linear algebra, and it&rsquo;s the foundation beneath vLLM&rsquo;s quantization kernels, FlashAttention-3, and most high-performance transformer implementations.</p>
<h3 id="when-cublas-wont-cut-it">When cuBLAS Won&rsquo;t Cut It</h3>
<p>Use CUTLASS when you need:</p>
<ul>
<li><strong>Custom fusions</strong>: Bias + activation + quantization in one kernel</li>
<li><strong>Specific precision combinations</strong>: FP8 weights with FP16 accumulation</li>
<li><strong>Binary size constraints</strong>: cuBLAS ships megabytes of kernels for all cases</li>
</ul>
<p>The trade-off is complexity. CUTLASS kernels require understanding GPU architecture at a level most ML engineers never encounter. But for the performance-critical paths in inference—attention, FFN, quantized projections—that complexity pays dividends.</p>
<h2 id="triton-gpu-programming-without-the-pain">Triton: GPU Programming Without the Pain</h2>
<p>CUTLASS offers maximum control, but its learning curve is steep. Writing CUDA C++ means managing thread indices, avoiding bank conflicts, ensuring coalesced memory access, and reasoning about warp-level synchronization. A single misplaced <code>__syncthreads()</code> can introduce subtle bugs. A suboptimal memory access pattern can halve performance.</p>
<p><a href="https://triton-lang.org/">Triton</a> takes a different approach. Developed by OpenAI and now integral to PyTorch 2.0, it raises the abstraction level from threads to blocks.</p>
<h3 id="the-mental-model-shift">The Mental Model Shift</h3>
<p>Traditional CUDA asks: &ldquo;I am thread 47. What should I do?&rdquo;</p>
<p>Triton asks: &ldquo;I am processing this block of data. What operations should happen?&rdquo;</p>
<p>Consider loading data from memory. In CUDA, you calculate addresses, handle boundary conditions, and coordinate across threads for coalescing. In Triton:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#a6e22e">@triton.jit</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">kernel</span>(x_ptr, output_ptr, N, BLOCK_SIZE: tl<span style="color:#f92672">.</span>constexpr):
</span></span><span style="display:flex;"><span>    pid <span style="color:#f92672">=</span> tl<span style="color:#f92672">.</span>program_id(<span style="color:#ae81ff">0</span>)
</span></span><span style="display:flex;"><span>    offsets <span style="color:#f92672">=</span> pid <span style="color:#f92672">*</span> BLOCK_SIZE <span style="color:#f92672">+</span> tl<span style="color:#f92672">.</span>arange(<span style="color:#ae81ff">0</span>, BLOCK_SIZE)
</span></span><span style="display:flex;"><span>    mask <span style="color:#f92672">=</span> offsets <span style="color:#f92672">&lt;</span> N
</span></span><span style="display:flex;"><span>    x <span style="color:#f92672">=</span> tl<span style="color:#f92672">.</span>load(x_ptr <span style="color:#f92672">+</span> offsets, mask<span style="color:#f92672">=</span>mask)
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Process x...</span>
</span></span><span style="display:flex;"><span>    tl<span style="color:#f92672">.</span>store(output_ptr <span style="color:#f92672">+</span> offsets, result, mask<span style="color:#f92672">=</span>mask)
</span></span></code></pre></div><p>The <code>tl.load</code> call handles coalescing and vectorization automatically. The compiler figures out the optimal memory access pattern. No manual thread indexing or bank conflict avoidance.</p>
<h3 id="the-pytorch-connection">The PyTorch Connection</h3>
<p>When you call <code>torch.compile()</code> on a model, TorchInductor generates Triton kernels for GPU execution. The fusion engine identifies sequences of pointwise operations (add, multiply, activation) that can be combined into single kernels. Instead of three separate kernels with intermediate HBM writes, you get one kernel that loads data once, performs all operations in registers, and stores once.</p>
<p>A fused LayerNorm + Linear that would require 500+ lines of optimized CUDA takes about 50 lines of Triton. The resulting kernel won&rsquo;t match a hand-tuned CUTLASS implementation, but it&rsquo;ll be close, and it takes hours to write instead of weeks.</p>
<h2 id="flashinfer-built-for-serving">FlashInfer: Built for Serving</h2>
<p>FlashAttention changed attention computation by recognizing that the bottleneck was memory I/O, not FLOPs. By computing attention tile-by-tile in SRAM and never materializing the N×N attention matrix in HBM, it reduced memory access from O(N²) to O(N). This brought longer context lengths and faster training.</p>
<p>But FlashAttention was designed for training workloads with regular, rectangular batches. Production serving is messier.</p>
<h3 id="the-serving-reality">The Serving Reality</h3>
<p>In a real serving deployment:</p>
<ul>
<li>Requests arrive with different context lengths (no neat rectangular batches)</li>
<li>The KV cache uses <a href="https://docs.vllm.ai/en/stable/design/arch_overview.html">PagedAttention</a> with non-contiguous memory blocks</li>
<li>Multiple requests share common prefixes (system prompts, document context)</li>
<li>CUDA graphs need static shapes, but batch composition changes every iteration</li>
</ul>
<p>FlashAttention handles none of this natively. <a href="https://flashinfer.ai/">FlashInfer</a> does.</p>
<h3 id="what-flashinfer-adds">What FlashInfer Adds</h3>
<p><strong>Block-sparse KV cache support</strong>: FlashInfer kernels operate on PagedAttention&rsquo;s block-sparse representation directly. Page tables map logical token indices to physical memory blocks, and FlashInfer traverses them efficiently without requiring contiguous memory.</p>
<p><strong>Ragged tensor layouts</strong>: Standard kernels assume rectangular batches, padding shorter sequences to match the longest. FlashInfer operates on &ldquo;ragged&rdquo; layouts where sequences are packed tightly. No wasted compute on padding tokens.</p>
<p><strong>Plan/run separation</strong>: FlashInfer separates scheduling decisions from kernel execution. The &ldquo;plan&rdquo; phase precomputes work distribution based on current batch composition. The &ldquo;run&rdquo; phase executes with that plan. This separation enables CUDA graph capture—record the run phase once, replay it with different inputs.</p>
<p><strong>Cascade attention</strong>: When multiple requests share a common prefix (a system prompt, a retrieved document), naive approaches recompute attention over that prefix for every request. FlashInfer&rsquo;s cascade attention processes the shared prefix once, caches the result, and computes only the unique suffix per request. For a 32K shared prefix across 256 requests, this yields a 31x speedup.</p>
<h3 id="integration-with-vllm">Integration with vLLM</h3>
<p>vLLM&rsquo;s attention backend isn&rsquo;t monolithic. A kernel selection layer examines the workload (hardware architecture, head dimension, precision, model type) and dispatches to the appropriate backend: FlashAttention for standard cases, FlashInfer for PagedAttention scenarios, Triton for specific configurations. This flexibility means you get optimized kernels for your actual workload, not a one-size-fits-all solution.</p>


<style>
.flashinfer-viz {
  font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
  background: linear-gradient(180deg, #0f172a 0%, #1e293b 100%);
  border-radius: 16px;
  padding: 2rem;
  margin: 2rem 0;
  color: #e2e8f0;
}

.flashinfer-viz * {
  box-sizing: border-box;
}

.fi-header {
  text-align: center;
  margin-bottom: 2rem;
}

.fi-title {
  font-size: 1.5rem;
  font-weight: 700;
  color: #f8fafc;
  margin-bottom: 0.5rem;
}

.fi-subtitle {
  font-size: 0.9rem;
  color: #94a3b8;
  max-width: 500px;
  margin: 0 auto;
}

 
.fi-tabs {
  display: flex;
  justify-content: center;
  gap: 0.5rem;
  margin-bottom: 1.5rem;
  flex-wrap: wrap;
}

.fi-tab {
  padding: 0.6rem 1.2rem;
  border-radius: 8px;
  border: 1px solid #334155;
  background: transparent;
  color: #94a3b8;
  cursor: pointer;
  font-size: 0.85rem;
  font-weight: 500;
  transition: all 0.2s ease;
}

.fi-tab:hover {
  background: rgba(99, 102, 241, 0.1);
  border-color: #6366f1;
  color: #c7d2fe;
}

.fi-tab.active {
  background: rgba(99, 102, 241, 0.2);
  border-color: #6366f1;
  color: #a5b4fc;
}

 
.fi-panel {
  display: none;
  animation: fadeIn 0.3s ease;
}

.fi-panel.active {
  display: block;
}

@keyframes fadeIn {
  from { opacity: 0; transform: translateY(10px); }
  to { opacity: 1; transform: translateY(0); }
}

 
.batch-comparison {
  display: grid;
  grid-template-columns: 1fr 1fr;
  gap: 1.5rem;
  margin-bottom: 1rem;
}

@media (max-width: 700px) {
  .batch-comparison {
    grid-template-columns: 1fr;
  }
}

.batch-box {
  background: rgba(15, 23, 42, 0.6);
  border-radius: 12px;
  padding: 1.25rem;
  border: 1px solid #334155;
}

.batch-box-title {
  font-size: 0.9rem;
  font-weight: 600;
  margin-bottom: 1rem;
  display: flex;
  align-items: center;
  gap: 0.5rem;
}

.batch-box.padded .batch-box-title {
  color: #f87171;
}

.batch-box.ragged .batch-box-title {
  color: #4ade80;
}

.batch-visual {
  display: flex;
  flex-direction: column;
  gap: 4px;
  margin-bottom: 1rem;
  font-family: monospace;
}

.batch-row {
  display: flex;
  gap: 2px;
  align-items: center;
}

.batch-label {
  font-size: 0.65rem;
  color: #64748b;
  width: 45px;
  flex-shrink: 0;
}

.token {
  width: 18px;
  height: 18px;
  border-radius: 3px;
  font-size: 0.55rem;
  display: flex;
  align-items: center;
  justify-content: center;
  font-weight: 600;
}

.token.data {
  background: linear-gradient(135deg, #3b82f6 0%, #1d4ed8 100%);
  color: #fff;
}

.token.pad {
  background: #1e293b;
  border: 1px dashed #475569;
  color: #475569;
}

.token.packed {
  background: linear-gradient(135deg, #22c55e 0%, #16a34a 100%);
  color: #fff;
}

.batch-stats {
  display: flex;
  gap: 1rem;
  font-size: 0.75rem;
}

.batch-stat {
  display: flex;
  flex-direction: column;
  gap: 2px;
}

.batch-stat-label {
  color: #64748b;
}

.batch-stat-value {
  font-weight: 600;
}

.batch-box.padded .batch-stat-value {
  color: #f87171;
}

.batch-box.ragged .batch-stat-value {
  color: #4ade80;
}

 
.cascade-container {
  display: grid;
  grid-template-columns: 1fr 1fr;
  gap: 1.5rem;
}

@media (max-width: 700px) {
  .cascade-container {
    grid-template-columns: 1fr;
  }
}

.cascade-box {
  background: rgba(15, 23, 42, 0.6);
  border-radius: 12px;
  padding: 1.25rem;
  border: 1px solid #334155;
}

.cascade-title {
  font-size: 0.9rem;
  font-weight: 600;
  margin-bottom: 1rem;
  display: flex;
  align-items: center;
  gap: 0.5rem;
}

.cascade-box.naive .cascade-title {
  color: #f87171;
}

.cascade-box.optimized .cascade-title {
  color: #4ade80;
}

.cascade-diagram {
  display: flex;
  flex-direction: column;
  gap: 6px;
  margin-bottom: 1rem;
}

.cascade-request {
  display: flex;
  align-items: center;
  gap: 4px;
  font-size: 0.7rem;
}

.cascade-request-label {
  color: #64748b;
  width: 55px;
  flex-shrink: 0;
}

.prefix-block {
  height: 20px;
  border-radius: 4px;
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 0.6rem;
  font-weight: 600;
}

.prefix-block.computed {
  background: linear-gradient(135deg, #f59e0b 0%, #d97706 100%);
  color: #fff;
}

.prefix-block.cached {
  background: rgba(245, 158, 11, 0.2);
  border: 1px solid #f59e0b;
  color: #fbbf24;
}

.prefix-block.suffix {
  background: linear-gradient(135deg, #8b5cf6 0%, #7c3aed 100%);
  color: #fff;
}

.cascade-arrow {
  color: #475569;
  font-size: 0.7rem;
  text-align: center;
  padding: 0.25rem 0;
}

.cascade-result {
  display: flex;
  justify-content: space-between;
  align-items: center;
  padding: 0.75rem;
  border-radius: 8px;
  margin-top: 0.5rem;
}

.cascade-box.naive .cascade-result {
  background: rgba(248, 113, 113, 0.1);
  border: 1px solid rgba(248, 113, 113, 0.3);
}

.cascade-box.optimized .cascade-result {
  background: rgba(74, 222, 128, 0.1);
  border: 1px solid rgba(74, 222, 128, 0.3);
}

.cascade-metric {
  display: flex;
  flex-direction: column;
  gap: 2px;
}

.cascade-metric-label {
  font-size: 0.65rem;
  color: #64748b;
}

.cascade-metric-value {
  font-size: 0.9rem;
  font-weight: 700;
}

.cascade-box.naive .cascade-metric-value {
  color: #f87171;
}

.cascade-box.optimized .cascade-metric-value {
  color: #4ade80;
}

.speedup-badge {
  background: linear-gradient(135deg, #22c55e 0%, #16a34a 100%);
  color: #fff;
  padding: 0.5rem 1rem;
  border-radius: 20px;
  font-weight: 700;
  font-size: 1rem;
  text-align: center;
  margin-top: 1rem;
}

 
.feature-grid {
  display: grid;
  grid-template-columns: 1fr 1fr;
  gap: 1rem;
}

@media (max-width: 600px) {
  .feature-grid {
    grid-template-columns: 1fr;
  }
}

.feature-card {
  background: rgba(15, 23, 42, 0.6);
  border-radius: 12px;
  padding: 1.25rem;
  border: 1px solid #334155;
  transition: all 0.2s ease;
}

.feature-card:hover {
  border-color: #4ade80;
  transform: translateY(-2px);
}

.feature-card-header {
  display: flex;
  align-items: center;
  gap: 0.75rem;
  margin-bottom: 0.75rem;
}

.feature-icon {
  width: 36px;
  height: 36px;
  border-radius: 8px;
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 1rem;
}

.feature-card:nth-child(1) .feature-icon {
  background: rgba(139, 92, 246, 0.2);
}

.feature-card:nth-child(2) .feature-icon {
  background: rgba(34, 197, 94, 0.2);
}

.feature-card:nth-child(3) .feature-icon {
  background: rgba(59, 130, 246, 0.2);
}

.feature-card:nth-child(4) .feature-icon {
  background: rgba(245, 158, 11, 0.2);
}

.feature-card-title {
  font-size: 0.9rem;
  font-weight: 600;
  color: #f1f5f9;
}

.feature-card-desc {
  font-size: 0.8rem;
  color: #94a3b8;
  line-height: 1.5;
}

 
.comparison-header {
  display: flex;
  justify-content: center;
  gap: 2rem;
  margin-bottom: 1.5rem;
  flex-wrap: wrap;
}

.comparison-item {
  display: flex;
  align-items: center;
  gap: 0.5rem;
  font-size: 0.85rem;
}

.comparison-dot {
  width: 10px;
  height: 10px;
  border-radius: 50%;
}

.comparison-dot.fa {
  background: #3b82f6;
}

.comparison-dot.fi {
  background: #22c55e;
}

 
.summary-stats {
  display: flex;
  justify-content: center;
  gap: 2rem;
  margin-top: 1.5rem;
  flex-wrap: wrap;
}

.summary-stat {
  text-align: center;
}

.summary-stat-value {
  font-size: 1.5rem;
  font-weight: 700;
  color: #4ade80;
}

.summary-stat-label {
  font-size: 0.75rem;
  color: #64748b;
}

 
.fi-callout {
  background: rgba(59, 130, 246, 0.1);
  border: 1px solid rgba(59, 130, 246, 0.3);
  border-radius: 8px;
  padding: 1rem;
  margin-top: 1rem;
  font-size: 0.8rem;
  color: #93c5fd;
  text-align: center;
}
</style>

<div class="flashinfer-viz">
  <div class="fi-header">
    <div class="fi-title">FlashAttention vs FlashInfer</div>
    <div class="fi-subtitle">FlashAttention optimized training. FlashInfer optimizes the messy reality of production serving.</div>
  </div>

  <div class="fi-tabs">
    <button class="fi-tab active" onclick="showPanel('batch')">Ragged Batches</button>
    <button class="fi-tab" onclick="showPanel('cascade')">Cascade Attention</button>
    <button class="fi-tab" onclick="showPanel('features')">All Features</button>
  </div>

  
  <div id="panel-batch" class="fi-panel active">
    <div class="comparison-header">
      <div class="comparison-item">
        <div class="comparison-dot fa"></div>
        <span style="color: #94a3b8;">FlashAttention: Padded batches</span>
      </div>
      <div class="comparison-item">
        <div class="comparison-dot fi"></div>
        <span style="color: #94a3b8;">FlashInfer: Ragged batches</span>
      </div>
    </div>

    <div class="batch-comparison">
      <div class="batch-box padded">
        <div class="batch-box-title">
          <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
            <rect x="3" y="3" width="18" height="18" rx="2"/>
            <line x1="3" y1="9" x2="21" y2="9"/>
            <line x1="3" y1="15" x2="21" y2="15"/>
          </svg>
          Padded Rectangular Batch
        </div>
        <div class="batch-visual">
          <div class="batch-row">
            <span class="batch-label">Req 1</span>
            <div class="token data">T</div>
            <div class="token data">T</div>
            <div class="token data">T</div>
            <div class="token data">T</div>
            <div class="token data">T</div>
            <div class="token data">T</div>
            <div class="token data">T</div>
            <div class="token data">T</div>
          </div>
          <div class="batch-row">
            <span class="batch-label">Req 2</span>
            <div class="token data">T</div>
            <div class="token data">T</div>
            <div class="token data">T</div>
            <div class="token pad">∅</div>
            <div class="token pad">∅</div>
            <div class="token pad">∅</div>
            <div class="token pad">∅</div>
            <div class="token pad">∅</div>
          </div>
          <div class="batch-row">
            <span class="batch-label">Req 3</span>
            <div class="token data">T</div>
            <div class="token data">T</div>
            <div class="token data">T</div>
            <div class="token data">T</div>
            <div class="token data">T</div>
            <div class="token pad">∅</div>
            <div class="token pad">∅</div>
            <div class="token pad">∅</div>
          </div>
          <div class="batch-row">
            <span class="batch-label">Req 4</span>
            <div class="token data">T</div>
            <div class="token data">T</div>
            <div class="token pad">∅</div>
            <div class="token pad">∅</div>
            <div class="token pad">∅</div>
            <div class="token pad">∅</div>
            <div class="token pad">∅</div>
            <div class="token pad">∅</div>
          </div>
        </div>
        <div class="batch-stats">
          <div class="batch-stat">
            <span class="batch-stat-label">Real tokens</span>
            <span class="batch-stat-value">18</span>
          </div>
          <div class="batch-stat">
            <span class="batch-stat-label">Padding</span>
            <span class="batch-stat-value">14 (44% waste)</span>
          </div>
        </div>
      </div>

      <div class="batch-box ragged">
        <div class="batch-box-title">
          <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
            <path d="M3 3h18v18H3z"/>
            <path d="M3 9h14"/>
            <path d="M3 15h10"/>
          </svg>
          Ragged Packed Layout
        </div>
        <div class="batch-visual">
          <div class="batch-row">
            <span class="batch-label">Packed</span>
            <div class="token packed">1</div>
            <div class="token packed">1</div>
            <div class="token packed">1</div>
            <div class="token packed">1</div>
            <div class="token packed">1</div>
            <div class="token packed">1</div>
            <div class="token packed">1</div>
            <div class="token packed">1</div>
          </div>
          <div class="batch-row">
            <span class="batch-label"></span>
            <div class="token packed">2</div>
            <div class="token packed">2</div>
            <div class="token packed">2</div>
            <div class="token packed">3</div>
            <div class="token packed">3</div>
            <div class="token packed">3</div>
            <div class="token packed">3</div>
            <div class="token packed">3</div>
          </div>
          <div class="batch-row">
            <span class="batch-label"></span>
            <div class="token packed">4</div>
            <div class="token packed">4</div>
          </div>
        </div>
        <div class="batch-stats">
          <div class="batch-stat">
            <span class="batch-stat-label">Real tokens</span>
            <span class="batch-stat-value">18</span>
          </div>
          <div class="batch-stat">
            <span class="batch-stat-label">Padding</span>
            <span class="batch-stat-value">0 (0% waste)</span>
          </div>
        </div>
      </div>
    </div>

    <div class="fi-callout">
      FlashInfer tracks sequence boundaries with offset arrays, enabling tight packing without wasted compute.
    </div>
  </div>

  
  <div id="panel-cascade" class="fi-panel">
    <div style="text-align: center; margin-bottom: 1rem;">
      <span style="color: #94a3b8; font-size: 0.85rem;">Scenario: 4 requests share a 32K token system prompt</span>
    </div>

    <div class="cascade-container">
      <div class="cascade-box naive">
        <div class="cascade-title">
          <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
            <circle cx="12" cy="12" r="10"/>
            <line x1="12" y1="8" x2="12" y2="12"/>
            <line x1="12" y1="16" x2="12" y2="16"/>
          </svg>
          Naive Approach
        </div>
        <div class="cascade-diagram">
          <div class="cascade-request">
            <span class="cascade-request-label">Request 1</span>
            <div class="prefix-block computed" style="flex: 3;">Prefix (32K)</div>
            <div class="prefix-block suffix" style="flex: 1;">+512</div>
          </div>
          <div class="cascade-request">
            <span class="cascade-request-label">Request 2</span>
            <div class="prefix-block computed" style="flex: 3;">Prefix (32K)</div>
            <div class="prefix-block suffix" style="flex: 1;">+256</div>
          </div>
          <div class="cascade-request">
            <span class="cascade-request-label">Request 3</span>
            <div class="prefix-block computed" style="flex: 3;">Prefix (32K)</div>
            <div class="prefix-block suffix" style="flex: 1;">+128</div>
          </div>
          <div class="cascade-request">
            <span class="cascade-request-label">Request 4</span>
            <div class="prefix-block computed" style="flex: 3;">Prefix (32K)</div>
            <div class="prefix-block suffix" style="flex: 1;">+64</div>
          </div>
        </div>
        <div class="cascade-result">
          <div class="cascade-metric">
            <span class="cascade-metric-label">Prefix computed</span>
            <span class="cascade-metric-value">4×</span>
          </div>
          <div class="cascade-metric">
            <span class="cascade-metric-label">Total attention</span>
            <span class="cascade-metric-value">~129K tokens</span>
          </div>
        </div>
      </div>

      <div class="cascade-box optimized">
        <div class="cascade-title">
          <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
            <path d="M22 11.08V12a10 10 0 1 1-5.93-9.14"/>
            <polyline points="22 4 12 14.01 9 11.01"/>
          </svg>
          FlashInfer Cascade
        </div>
        <div class="cascade-diagram">
          <div class="cascade-request">
            <span class="cascade-request-label">Shared</span>
            <div class="prefix-block computed" style="flex: 3;">Prefix (32K) → cache once</div>
          </div>
          <div class="cascade-arrow">↓ cached result ↓</div>
          <div class="cascade-request">
            <span class="cascade-request-label">Request 1</span>
            <div class="prefix-block cached" style="flex: 3;">cached</div>
            <div class="prefix-block suffix" style="flex: 1;">+512</div>
          </div>
          <div class="cascade-request">
            <span class="cascade-request-label">Request 2</span>
            <div class="prefix-block cached" style="flex: 3;">cached</div>
            <div class="prefix-block suffix" style="flex: 1;">+256</div>
          </div>
          <div class="cascade-request">
            <span class="cascade-request-label">Req 3, 4...</span>
            <div class="prefix-block cached" style="flex: 3;">cached</div>
            <div class="prefix-block suffix" style="flex: 1;">+...</div>
          </div>
        </div>
        <div class="cascade-result">
          <div class="cascade-metric">
            <span class="cascade-metric-label">Prefix computed</span>
            <span class="cascade-metric-value">1×</span>
          </div>
          <div class="cascade-metric">
            <span class="cascade-metric-label">Total attention</span>
            <span class="cascade-metric-value">~33K tokens</span>
          </div>
        </div>
      </div>
    </div>

    <div class="speedup-badge">
      31× speedup for 32K shared prefix across 256 requests
    </div>
  </div>

  
  <div id="panel-features" class="fi-panel">
    <div class="feature-grid">
      <div class="feature-card">
        <div class="feature-card-header">
          <div class="feature-icon">📦</div>
          <span class="feature-card-title">Block-Sparse KV Cache</span>
        </div>
        <div class="feature-card-desc">
          Native support for PagedAttention's non-contiguous memory blocks. Traverses page tables efficiently without requiring contiguous memory layouts.
        </div>
      </div>

      <div class="feature-card">
        <div class="feature-card-header">
          <div class="feature-icon">📐</div>
          <span class="feature-card-title">Ragged Tensor Layouts</span>
        </div>
        <div class="feature-card-desc">
          Sequences packed tightly with no padding waste. Tracks boundaries via offset arrays for variable-length batches.
        </div>
      </div>

      <div class="feature-card">
        <div class="feature-card-header">
          <div class="feature-icon">🔄</div>
          <span class="feature-card-title">Plan/Run Separation</span>
        </div>
        <div class="feature-card-desc">
          Precomputes work distribution in "plan" phase, enabling CUDA graph capture. Record once, replay with different inputs.
        </div>
      </div>

      <div class="feature-card">
        <div class="feature-card-header">
          <div class="feature-icon">⚡</div>
          <span class="feature-card-title">Cascade Attention</span>
        </div>
        <div class="feature-card-desc">
          Processes shared prefixes once, caches results, computes only unique suffixes. Massive speedups for common system prompts.
        </div>
      </div>
    </div>

    <div class="summary-stats">
      <div class="summary-stat">
        <div class="summary-stat-value">0%</div>
        <div class="summary-stat-label">Padding waste</div>
      </div>
      <div class="summary-stat">
        <div class="summary-stat-value">31×</div>
        <div class="summary-stat-label">Cascade speedup</div>
      </div>
      <div class="summary-stat">
        <div class="summary-stat-value">✓</div>
        <div class="summary-stat-label">CUDA graph compatible</div>
      </div>
    </div>
  </div>
</div>

<script>
function showPanel(panelName) {
  
  document.querySelectorAll('.fi-tab').forEach(tab => {
    tab.classList.remove('active');
  });
  event.target.classList.add('active');

  
  document.querySelectorAll('.fi-panel').forEach(panel => {
    panel.classList.remove('active');
  });
  document.getElementById('panel-' + panelName).classList.add('active');
}
</script>

<h2 id="nccl-the-invisible-communication-backbone">NCCL: The Invisible Communication Backbone</h2>
<p>Everything discussed so far assumes the model fits on a single GPU. For frontier models, it doesn&rsquo;t. Llama-70B requires roughly 140GB in FP16—nearly two H100s worth of memory. Larger models require more.</p>
<p>Tensor parallelism splits the model across GPUs within a server. Weight matrices are sharded so each GPU holds a slice. Each GPU computes a partial result, and then&hellip; they have to talk to each other.</p>
<p>This is <a href="https://docs.nvidia.com/deeplearning/nccl/">NCCL&rsquo;s</a> domain.</p>
<h3 id="the-communication-pattern">The Communication Pattern</h3>
<p>Tensor parallelism using the Megatron-LM algorithm requires two AllReduce operations per transformer layer:</p>
<ol>
<li><strong>After attention output projection</strong>: Each GPU computed attention over its head subset. AllReduce combines the results.</li>
<li><strong>After FFN down projection</strong>: Each GPU computed a partial FFN result. AllReduce sums them.</li>
</ol>
<p>AllReduce means &ldquo;sum tensors across all GPUs and distribute the result to all GPUs.&rdquo; For Llama-70B on 4 GPUs, each AllReduce moves <code>batch_size × sequence_length × hidden_dim × bytes_per_element</code> bytes—and it happens 160 times per forward pass (2 per layer × 80 layers).</p>
<h3 id="the-interconnect-gap">The Interconnect Gap</h3>
<p>The choice of interconnect dominates multi-GPU inference performance:</p>
<table>
  <thead>
      <tr>
          <th>Interconnect</th>
          <th>Bandwidth</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>NVLink 4.0</td>
          <td>900 GB/s bidirectional</td>
      </tr>
      <tr>
          <td>PCIe Gen5</td>
          <td>128 GB/s bidirectional</td>
      </tr>
  </tbody>
</table>
<p>That&rsquo;s a 7x gap. On NVLink, tensor parallelism adds modest overhead. On PCIe, communication becomes the bottleneck rather than memory bandwidth.</p>
<p>Even optimized, communication overhead consumes 20-35% of inference time for Llama-70B on 4×H100. It&rsquo;s the reason single-GPU inference (when the model fits) is always preferable, and why quantization to fit larger models on fewer GPUs often improves overall throughput despite the precision loss.</p>


<style>
.nccl-viz {
  font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
  background: linear-gradient(180deg, #0f172a 0%, #1e293b 100%);
  border-radius: 16px;
  padding: 2rem;
  margin: 2rem 0;
  color: #e2e8f0;
}

.nccl-viz * {
  box-sizing: border-box;
}

.nccl-header {
  text-align: center;
  margin-bottom: 2rem;
}

.nccl-title {
  font-size: 1.5rem;
  font-weight: 700;
  color: #f8fafc;
  margin-bottom: 0.5rem;
}

.nccl-subtitle {
  font-size: 0.9rem;
  color: #94a3b8;
  max-width: 550px;
  margin: 0 auto;
}

 
.nccl-tabs {
  display: flex;
  justify-content: center;
  gap: 0.5rem;
  margin-bottom: 1.5rem;
  flex-wrap: wrap;
}

.nccl-tab {
  padding: 0.6rem 1.2rem;
  border-radius: 8px;
  border: 1px solid #334155;
  background: transparent;
  color: #94a3b8;
  cursor: pointer;
  font-size: 0.85rem;
  font-weight: 500;
  transition: all 0.2s ease;
}

.nccl-tab:hover {
  background: rgba(139, 92, 246, 0.1);
  border-color: #8b5cf6;
  color: #c4b5fd;
}

.nccl-tab.active {
  background: rgba(139, 92, 246, 0.2);
  border-color: #8b5cf6;
  color: #a78bfa;
}

 
.nccl-panel {
  display: none;
  animation: ncclFadeIn 0.3s ease;
}

.nccl-panel.active {
  display: block;
}

@keyframes ncclFadeIn {
  from { opacity: 0; transform: translateY(10px); }
  to { opacity: 1; transform: translateY(0); }
}

 
.algo-comparison {
  display: grid;
  grid-template-columns: 1fr 1fr;
  gap: 1.5rem;
}

@media (max-width: 750px) {
  .algo-comparison {
    grid-template-columns: 1fr;
  }
}

.algo-box {
  background: rgba(15, 23, 42, 0.6);
  border-radius: 12px;
  padding: 1.25rem;
  border: 1px solid #334155;
}

.algo-box-header {
  display: flex;
  align-items: center;
  justify-content: space-between;
  margin-bottom: 1rem;
}

.algo-box-title {
  font-size: 1rem;
  font-weight: 600;
  display: flex;
  align-items: center;
  gap: 0.5rem;
}

.algo-box.ring .algo-box-title {
  color: #a78bfa;
}

.algo-box.tree .algo-box-title {
  color: #fbbf24;
}

.algo-badge {
  font-size: 0.65rem;
  padding: 0.25rem 0.5rem;
  border-radius: 4px;
  font-weight: 600;
}

.algo-box.ring .algo-badge {
  background: rgba(139, 92, 246, 0.2);
  color: #a78bfa;
}

.algo-box.tree .algo-badge {
  background: rgba(251, 191, 36, 0.2);
  color: #fbbf24;
}

 
.ring-container {
  position: relative;
  width: 200px;
  height: 200px;
  margin: 0 auto 1rem;
}

.gpu-node {
  position: absolute;
  width: 50px;
  height: 50px;
  border-radius: 10px;
  display: flex;
  flex-direction: column;
  align-items: center;
  justify-content: center;
  font-weight: 600;
  font-size: 0.75rem;
  transition: all 0.3s ease;
}

.gpu-node.ring-style {
  background: linear-gradient(135deg, #8b5cf6 0%, #7c3aed 100%);
  border: 2px solid #a78bfa;
  color: #fff;
}

.gpu-node.tree-style {
  background: linear-gradient(135deg, #f59e0b 0%, #d97706 100%);
  border: 2px solid #fbbf24;
  color: #fff;
}

.gpu-node .gpu-label {
  font-size: 0.85rem;
  font-weight: 700;
}

.gpu-node .gpu-data {
  font-size: 0.55rem;
  opacity: 0.8;
}

 
.ring-container .gpu-node:nth-child(1) { top: 0; left: 50%; transform: translateX(-50%); }
.ring-container .gpu-node:nth-child(2) { top: 50%; right: 0; transform: translateY(-50%); }
.ring-container .gpu-node:nth-child(3) { bottom: 0; left: 50%; transform: translateX(-50%); }
.ring-container .gpu-node:nth-child(4) { top: 50%; left: 0; transform: translateY(-50%); }

 
.ring-arrows {
  position: absolute;
  top: 0;
  left: 0;
  width: 100%;
  height: 100%;
  pointer-events: none;
}

.ring-arrows svg {
  width: 100%;
  height: 100%;
}

 
.data-packet {
  position: absolute;
  width: 12px;
  height: 12px;
  background: #22c55e;
  border-radius: 50%;
  box-shadow: 0 0 10px #22c55e;
  animation: ringFlow 3s linear infinite;
}

@keyframes ringFlow {
  0% { offset-distance: 0%; opacity: 1; }
  100% { offset-distance: 100%; opacity: 1; }
}

.packet-1 { offset-path: path('M 100 25 Q 175 25 175 100 Q 175 175 100 175 Q 25 175 25 100 Q 25 25 100 25'); animation-delay: 0s; }
.packet-2 { offset-path: path('M 100 25 Q 175 25 175 100 Q 175 175 100 175 Q 25 175 25 100 Q 25 25 100 25'); animation-delay: 0.75s; }
.packet-3 { offset-path: path('M 100 25 Q 175 25 175 100 Q 175 175 100 175 Q 25 175 25 100 Q 25 25 100 25'); animation-delay: 1.5s; }
.packet-4 { offset-path: path('M 100 25 Q 175 25 175 100 Q 175 175 100 175 Q 25 175 25 100 Q 25 25 100 25'); animation-delay: 2.25s; }

 
.tree-container {
  position: relative;
  width: 200px;
  height: 180px;
  margin: 0 auto 1rem;
}

.tree-container .gpu-node:nth-child(1) { top: 0; left: 50%; transform: translateX(-50%); }
.tree-container .gpu-node:nth-child(2) { top: 70px; left: 15%; transform: translateX(-50%); }
.tree-container .gpu-node:nth-child(3) { top: 70px; left: 85%; transform: translateX(-50%); }
.tree-container .gpu-node:nth-child(4) { top: 130px; left: 50%; transform: translateX(-50%); width: 44px; height: 44px; }

.tree-lines {
  position: absolute;
  top: 0;
  left: 0;
  width: 100%;
  height: 100%;
  pointer-events: none;
}

 
.tree-packet {
  position: absolute;
  width: 10px;
  height: 10px;
  background: #22c55e;
  border-radius: 50%;
  box-shadow: 0 0 8px #22c55e;
  opacity: 0;
}

.tree-packet.up {
  animation: treeUp 2s ease-in-out infinite;
}

.tree-packet.down {
  animation: treeDown 2s ease-in-out infinite;
  animation-delay: 1s;
}

@keyframes treeUp {
  0% { opacity: 0; }
  10% { opacity: 1; }
  50% { opacity: 1; }
  60% { opacity: 0; }
  100% { opacity: 0; }
}

@keyframes treeDown {
  0% { opacity: 0; }
  10% { opacity: 1; }
  50% { opacity: 1; }
  60% { opacity: 0; }
  100% { opacity: 0; }
}

 
.algo-stats {
  display: grid;
  grid-template-columns: 1fr 1fr;
  gap: 0.75rem;
  margin-top: 1rem;
}

.algo-stat {
  background: rgba(15, 23, 42, 0.8);
  border-radius: 8px;
  padding: 0.75rem;
  text-align: center;
}

.algo-stat-label {
  font-size: 0.65rem;
  color: #64748b;
  margin-bottom: 0.25rem;
}

.algo-stat-value {
  font-size: 0.9rem;
  font-weight: 700;
}

.algo-box.ring .algo-stat-value {
  color: #a78bfa;
}

.algo-box.tree .algo-stat-value {
  color: #fbbf24;
}

 
.bandwidth-section {
  margin-top: 1.5rem;
}

.bandwidth-title {
  font-size: 0.9rem;
  font-weight: 600;
  color: #f1f5f9;
  margin-bottom: 1rem;
  text-align: center;
}

.bandwidth-bars {
  max-width: 400px;
  margin: 0 auto;
}

.bandwidth-bar-row {
  display: flex;
  align-items: center;
  gap: 1rem;
  margin-bottom: 0.75rem;
}

.bandwidth-label {
  width: 70px;
  font-size: 0.8rem;
  font-weight: 600;
  flex-shrink: 0;
}

.bandwidth-label.nvlink {
  color: #4ade80;
}

.bandwidth-label.pcie {
  color: #f87171;
}

.bandwidth-track {
  flex: 1;
  height: 24px;
  background: rgba(15, 23, 42, 0.8);
  border-radius: 6px;
  overflow: hidden;
  position: relative;
}

.bandwidth-fill {
  height: 100%;
  border-radius: 6px;
  display: flex;
  align-items: center;
  justify-content: flex-end;
  padding-right: 0.5rem;
  font-size: 0.7rem;
  font-weight: 700;
  color: #fff;
  transition: width 1s ease;
}

.bandwidth-fill.nvlink {
  width: 100%;
  background: linear-gradient(90deg, #22c55e 0%, #16a34a 100%);
}

.bandwidth-fill.pcie {
  width: 14.2%;
  background: linear-gradient(90deg, #ef4444 0%, #dc2626 100%);
}

.bandwidth-gap {
  text-align: center;
  margin-top: 1rem;
  padding: 0.75rem;
  background: rgba(239, 68, 68, 0.1);
  border: 1px solid rgba(239, 68, 68, 0.3);
  border-radius: 8px;
}

.bandwidth-gap-value {
  font-size: 1.25rem;
  font-weight: 700;
  color: #f87171;
}

.bandwidth-gap-label {
  font-size: 0.75rem;
  color: #94a3b8;
}

 
.use-case-grid {
  display: grid;
  grid-template-columns: 1fr 1fr;
  gap: 1rem;
  margin-top: 1rem;
}

@media (max-width: 600px) {
  .use-case-grid {
    grid-template-columns: 1fr;
  }
}

.use-case-card {
  background: rgba(15, 23, 42, 0.6);
  border-radius: 10px;
  padding: 1rem;
  border: 1px solid #334155;
}

.use-case-card.ring {
  border-color: rgba(139, 92, 246, 0.4);
}

.use-case-card.tree {
  border-color: rgba(251, 191, 36, 0.4);
}

.use-case-header {
  display: flex;
  align-items: center;
  gap: 0.5rem;
  margin-bottom: 0.75rem;
}

.use-case-icon {
  width: 32px;
  height: 32px;
  border-radius: 6px;
  display: flex;
  align-items: center;
  justify-content: center;
  font-size: 1rem;
}

.use-case-card.ring .use-case-icon {
  background: rgba(139, 92, 246, 0.2);
}

.use-case-card.tree .use-case-icon {
  background: rgba(251, 191, 36, 0.2);
}

.use-case-name {
  font-size: 0.85rem;
  font-weight: 600;
  color: #f1f5f9;
}

.use-case-desc {
  font-size: 0.75rem;
  color: #94a3b8;
  line-height: 1.5;
}

.use-case-example {
  margin-top: 0.5rem;
  font-size: 0.7rem;
  color: #64748b;
  font-style: italic;
}

 
.impact-section {
  margin-top: 1.5rem;
  padding: 1rem;
  background: rgba(139, 92, 246, 0.1);
  border: 1px solid rgba(139, 92, 246, 0.3);
  border-radius: 10px;
}

.impact-title {
  font-size: 0.85rem;
  font-weight: 600;
  color: #a78bfa;
  margin-bottom: 0.75rem;
  display: flex;
  align-items: center;
  gap: 0.5rem;
}

.impact-stats {
  display: flex;
  justify-content: space-around;
  flex-wrap: wrap;
  gap: 1rem;
}

.impact-stat {
  text-align: center;
}

.impact-stat-value {
  font-size: 1.25rem;
  font-weight: 700;
  color: #c4b5fd;
}

.impact-stat-label {
  font-size: 0.7rem;
  color: #94a3b8;
}

 
.animation-hint {
  text-align: center;
  margin-top: 0.5rem;
  font-size: 0.7rem;
  color: #64748b;
}

 
.nccl-callout {
  background: rgba(59, 130, 246, 0.1);
  border: 1px solid rgba(59, 130, 246, 0.3);
  border-radius: 8px;
  padding: 1rem;
  margin-top: 1.5rem;
  font-size: 0.8rem;
  color: #93c5fd;
  text-align: center;
}
</style>

<div class="nccl-viz">
  <div class="nccl-header">
    <div class="nccl-title">NCCL AllReduce Patterns</div>
    <div class="nccl-subtitle">How NVIDIA's collective communication library orchestrates multi-GPU data synchronization for tensor parallelism.</div>
  </div>

  <div class="nccl-tabs">
    <button class="nccl-tab active" onclick="showNcclPanel('algorithms')">Ring vs Tree</button>
    <button class="nccl-tab" onclick="showNcclPanel('bandwidth')">Interconnect Gap</button>
    <button class="nccl-tab" onclick="showNcclPanel('usage')">When to Use</button>
  </div>

  
  <div id="nccl-panel-algorithms" class="nccl-panel active">
    <div class="algo-comparison">
      
      <div class="algo-box ring">
        <div class="algo-box-header">
          <div class="algo-box-title">
            <svg width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
              <circle cx="12" cy="12" r="10"/>
              <path d="M12 6v6l4 2"/>
            </svg>
            Ring AllReduce
          </div>
          <span class="algo-badge">Bandwidth Optimal</span>
        </div>

        <div class="ring-container">
          <div class="gpu-node ring-style">
            <span class="gpu-label">GPU 0</span>
            <span class="gpu-data">chunk A</span>
          </div>
          <div class="gpu-node ring-style">
            <span class="gpu-label">GPU 1</span>
            <span class="gpu-data">chunk B</span>
          </div>
          <div class="gpu-node ring-style">
            <span class="gpu-label">GPU 2</span>
            <span class="gpu-data">chunk C</span>
          </div>
          <div class="gpu-node ring-style">
            <span class="gpu-label">GPU 3</span>
            <span class="gpu-data">chunk D</span>
          </div>

          <div class="ring-arrows">
            <svg viewBox="0 0 200 200">
              <defs>
                <marker id="ring-arrow" markerWidth="8" markerHeight="6" refX="7" refY="3" orient="auto">
                  <polygon points="0 0, 8 3, 0 6" fill="#a78bfa"/>
                </marker>
              </defs>
              
              <path d="M 120 35 Q 165 35 165 80" fill="none" stroke="#a78bfa" stroke-width="2" marker-end="url(#ring-arrow)" opacity="0.6"/>
              <path d="M 165 120 Q 165 165 120 165" fill="none" stroke="#a78bfa" stroke-width="2" marker-end="url(#ring-arrow)" opacity="0.6"/>
              <path d="M 80 165 Q 35 165 35 120" fill="none" stroke="#a78bfa" stroke-width="2" marker-end="url(#ring-arrow)" opacity="0.6"/>
              <path d="M 35 80 Q 35 35 80 35" fill="none" stroke="#a78bfa" stroke-width="2" marker-end="url(#ring-arrow)" opacity="0.6"/>
            </svg>
          </div>

          
          <div class="data-packet packet-1"></div>
          <div class="data-packet packet-2"></div>
          <div class="data-packet packet-3"></div>
          <div class="data-packet packet-4"></div>
        </div>

        <div class="animation-hint">Data chunks flow around the ring</div>

        <div class="algo-stats">
          <div class="algo-stat">
            <div class="algo-stat-label">Latency</div>
            <div class="algo-stat-value">O(k)</div>
          </div>
          <div class="algo-stat">
            <div class="algo-stat-label">Bandwidth</div>
            <div class="algo-stat-value">Optimal</div>
          </div>
          <div class="algo-stat">
            <div class="algo-stat-label">Best for</div>
            <div class="algo-stat-value">Large msgs</div>
          </div>
          <div class="algo-stat">
            <div class="algo-stat-label">Steps</div>
            <div class="algo-stat-value">2(k-1)</div>
          </div>
        </div>
      </div>

      
      <div class="algo-box tree">
        <div class="algo-box-header">
          <div class="algo-box-title">
            <svg width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
              <path d="M12 3v12"/>
              <path d="M5 10l7-7 7 7"/>
              <path d="M8 21h8"/>
              <path d="M12 15v6"/>
            </svg>
            Tree AllReduce
          </div>
          <span class="algo-badge">Latency Optimal</span>
        </div>

        <div class="tree-container">
          <div class="gpu-node tree-style">
            <span class="gpu-label">GPU 0</span>
            <span class="gpu-data">root</span>
          </div>
          <div class="gpu-node tree-style">
            <span class="gpu-label">GPU 1</span>
            <span class="gpu-data">child</span>
          </div>
          <div class="gpu-node tree-style">
            <span class="gpu-label">GPU 2</span>
            <span class="gpu-data">child</span>
          </div>
          <div class="gpu-node tree-style">
            <span class="gpu-label">GPU 3</span>
            <span class="gpu-data">leaf</span>
          </div>

          <div class="tree-lines">
            <svg viewBox="0 0 200 180">
              <defs>
                <marker id="tree-arrow-up" markerWidth="6" markerHeight="5" refX="5" refY="2.5" orient="auto">
                  <polygon points="0 0, 6 2.5, 0 5" fill="#fbbf24"/>
                </marker>
                <marker id="tree-arrow-down" markerWidth="6" markerHeight="5" refX="1" refY="2.5" orient="auto">
                  <polygon points="6 0, 0 2.5, 6 5" fill="#22c55e"/>
                </marker>
              </defs>
              
              <line x1="85" y1="50" x2="45" y2="75" stroke="#fbbf24" stroke-width="2" opacity="0.6"/>
              <line x1="115" y1="50" x2="155" y2="75" stroke="#fbbf24" stroke-width="2" opacity="0.6"/>
              <line x1="100" y1="50" x2="100" y2="130" stroke="#fbbf24" stroke-width="2" opacity="0.6"/>
            </svg>
          </div>
        </div>

        <div class="animation-hint">Reduce up, broadcast down</div>

        <div class="algo-stats">
          <div class="algo-stat">
            <div class="algo-stat-label">Latency</div>
            <div class="algo-stat-value">O(log k)</div>
          </div>
          <div class="algo-stat">
            <div class="algo-stat-label">Bandwidth</div>
            <div class="algo-stat-value">Sub-optimal</div>
          </div>
          <div class="algo-stat">
            <div class="algo-stat-label">Best for</div>
            <div class="algo-stat-value">Small msgs</div>
          </div>
          <div class="algo-stat">
            <div class="algo-stat-label">Steps</div>
            <div class="algo-stat-value">2 log(k)</div>
          </div>
        </div>
      </div>
    </div>

    <div class="nccl-callout">
      NCCL automatically selects the optimal algorithm based on message size and GPU topology.
    </div>
  </div>

  
  <div id="nccl-panel-bandwidth" class="nccl-panel">
    <div class="bandwidth-section">
      <div class="bandwidth-title">GPU Interconnect Bandwidth Comparison</div>

      <div class="bandwidth-bars">
        <div class="bandwidth-bar-row">
          <span class="bandwidth-label nvlink">NVLink 4.0</span>
          <div class="bandwidth-track">
            <div class="bandwidth-fill nvlink">900 GB/s</div>
          </div>
        </div>
        <div class="bandwidth-bar-row">
          <span class="bandwidth-label pcie">PCIe Gen5</span>
          <div class="bandwidth-track">
            <div class="bandwidth-fill pcie">128</div>
          </div>
        </div>
      </div>

      <div class="bandwidth-gap">
        <div class="bandwidth-gap-value">7× Gap</div>
        <div class="bandwidth-gap-label">NVLink is essential for efficient tensor parallelism</div>
      </div>
    </div>

    <div class="impact-section">
      <div class="impact-title">
        <svg width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
          <path d="M13 2L3 14h9l-1 8 10-12h-9l1-8z"/>
        </svg>
        Impact on LLM Inference (Llama-70B, 4×H100)
      </div>
      <div class="impact-stats">
        <div class="impact-stat">
          <div class="impact-stat-value">160</div>
          <div class="impact-stat-label">AllReduce ops/forward pass</div>
        </div>
        <div class="impact-stat">
          <div class="impact-stat-value">20-35%</div>
          <div class="impact-stat-label">Time spent on communication</div>
        </div>
        <div class="impact-stat">
          <div class="impact-stat-value">~30 KB</div>
          <div class="impact-stat-label">Per AllReduce (decode)</div>
        </div>
      </div>
    </div>

    <div class="nccl-callout">
      With PCIe, communication overhead can exceed 50%—making NVLink critical for multi-GPU inference.
    </div>
  </div>

  
  <div id="nccl-panel-usage" class="nccl-panel">
    <div class="use-case-grid">
      <div class="use-case-card ring">
        <div class="use-case-header">
          <div class="use-case-icon">🔄</div>
          <span class="use-case-name">Ring AllReduce</span>
        </div>
        <div class="use-case-desc">
          Maximizes bandwidth utilization by pipelining data transfers. Each GPU sends and receives simultaneously, achieving near-optimal throughput.
        </div>
        <div class="use-case-example">
          Best for: Prefill phase, gradient sync, large activation tensors (>1MB)
        </div>
      </div>

      <div class="use-case-card tree">
        <div class="use-case-header">
          <div class="use-case-icon">🌲</div>
          <span class="use-case-name">Tree AllReduce</span>
        </div>
        <div class="use-case-desc">
          Minimizes latency with logarithmic steps. Reduces to root, then broadcasts back. Fewer synchronization points but lower bandwidth efficiency.
        </div>
        <div class="use-case-example">
          Best for: Decode phase, small tensors, latency-critical paths (~30KB)
        </div>
      </div>

      <div class="use-case-card ring">
        <div class="use-case-header">
          <div class="use-case-icon">📊</div>
          <span class="use-case-name">High GPU Count</span>
        </div>
        <div class="use-case-desc">
          Ring scales well with many GPUs since bandwidth stays constant. Tree latency grows logarithmically but wastes bandwidth at scale.
        </div>
        <div class="use-case-example">
          8+ GPUs: Ring preferred for most operations
        </div>
      </div>

      <div class="use-case-card tree">
        <div class="use-case-header">
          <div class="use-case-icon">⚡</div>
          <span class="use-case-name">Latency Sensitive</span>
        </div>
        <div class="use-case-desc">
          When time-to-first-token matters more than throughput, tree's O(log k) steps beat ring's O(k) latency even at the cost of bandwidth.
        </div>
        <div class="use-case-example">
          Interactive inference, real-time applications
        </div>
      </div>
    </div>

    <div class="nccl-callout">
      Modern NCCL uses hybrid algorithms—tree for small messages (&lt;256KB) switching to ring for larger transfers.
    </div>
  </div>
</div>

<script>
function showNcclPanel(panelName) {
  
  document.querySelectorAll('.nccl-tab').forEach(tab => {
    tab.classList.remove('active');
  });
  event.target.classList.add('active');

  
  document.querySelectorAll('.nccl-panel').forEach(panel => {
    panel.classList.remove('active');
  });
  document.getElementById('nccl-panel-' + panelName).classList.add('active');
}
</script>

<h2 id="putting-it-together">Putting It Together</h2>
<p>During decode, a single token flows through the entire stack: vLLM schedules the batch, PyTorch dispatches through CUDA graphs, and each transformer layer executes CUTLASS GEMMs for projections (with fused quantization), FlashInfer kernels for attention over the paged KV cache, and NCCL AllReduces if using tensor parallelism.</p>
<p>The time breakdown tells the story:</p>
<ul>
<li><strong>Attention kernels</strong>: 40-60%</li>
<li><strong>FFN/MLP kernels</strong>: 30-40%</li>
<li><strong>Communication (with TP)</strong>: 20-35%</li>
<li><strong>Everything else</strong>: &lt;10%</li>
</ul>
<p>Attention and FFN dominate. Both are memory-bound.</p>
<h2 id="the-memory-bandwidth-endgame">The Memory Bandwidth Endgame</h2>
<p>Every library in this stack attacks the same fundamental constraint: memory bandwidth. CUTLASS enables fused kernels that minimize HBM round-trips. Triton makes writing such kernels accessible. FlashInfer optimizes attention&rsquo;s memory access patterns. NCCL minimizes communication overhead that competes for the same memory bandwidth.</p>
<p>The hardware is evolving in the same direction. NVIDIA&rsquo;s Blackwell B200 delivers 8 TB/s of HBM bandwidth, 2.4x more than H100, and introduces native FP4 support, halving bytes-per-parameter.</p>
<p>Understanding this stack is not just an academic exercise. If you&rsquo;re deploying LLMs at scale, these libraries determine your cost per token, your latency percentiles, your maximum context length. The optimizations that matter aren&rsquo;t in the model architecture; they&rsquo;re in the software that maps that architecture onto silicon.</p>
<p>The iceberg runs deep. Now you know what&rsquo;s beneath the surface.</p>
]]></content:encoded></item><item><title>Speculative Decoding: When Guessing Right Makes for Faster Inference</title><link>https://www.mdjawad.com/posts/speculative-decoding/</link><pubDate>Tue, 23 Dec 2025 10:00:00 +0800</pubDate><guid>https://www.mdjawad.com/posts/speculative-decoding/</guid><description>How speculative decoding achieves 2-3× inference speedup without changing model outputs, and why GLM-4.7&amp;rsquo;s native multi-token prediction marks a paradigm shift.</description><content:encoded><![CDATA[<h2 id="the-speed-problem-wasnt-always-about-compute">The Speed Problem Wasn&rsquo;t Always About Compute</h2>
<p>In late 2023, two independent research teams at Google and DeepMind published papers with remarkably similar insights. Both had discovered a way to make large language models generate text 2-3× faster without approximations, without quality loss, and without changing the output distribution at all. The technique was speculative decoding.</p>
<p>Here&rsquo;s the counterintuitive reality: when you run a 70B parameter model on a modern GPU, most of the computational units sit idle. The expensive tensor cores that can perform trillions of operations per second spend the majority of their time doing nothing, waiting. They&rsquo;re waiting for data to arrive from memory. This is the memory bandwidth bottleneck, and it&rsquo;s the reason that making LLMs faster is about doing <em>more</em> useful work with each expensive memory read.</p>
<p>Speculative decoding exploits this idle capacity in an elegant way: use a small, fast model to guess what tokens the big model will produce, then verify those guesses in parallel. When the guesses are right and they often are you&rsquo;ve generated multiple tokens for the price of one memory read of the large model&rsquo;s weights.</p>
<p>GLM-4.7, Zhipu AI&rsquo;s 355B parameter flagship released in December 2025, takes this further by building Multi-Token Prediction directly into its architecture. With vLLM&rsquo;s optimized implementation, this achieves acceptance rates exceeding 90% and generation speeds beyond 100 tokens per second, a glimpse of where inference optimization is heading.</p>
<h2 id="why-llm-inference-is-memory-bound">Why LLM Inference Is Memory-Bound</h2>
<p>To understand why speculative decoding works, we need to understand why LLM inference is slow in the first place.</p>
<p>Consider what happens when a 70B parameter model generates a single token. The GPU must:</p>
<ol>
<li>Load the model&rsquo;s ~140GB of weights from High Bandwidth Memory (HBM)</li>
<li>Perform matrix multiplications with the current token&rsquo;s hidden states</li>
<li>Produce probability distribution over the vocabulary</li>
<li>Sample one token</li>
<li>Repeat for the next token</li>
</ol>
<p>The critical insight is in step 1. An NVIDIA H100 GPU can perform roughly 2,000 trillion floating-point operations per second (TFLOPS). But its memory bandwidth—the rate at which it can read data from HBM—is &ldquo;only&rdquo; 3.35 TB/s.</p>
<p>Let&rsquo;s do the arithmetic. Loading 140GB of weights at 3.35 TB/s takes about 42 milliseconds. The actual matrix multiplications for a single token? Perhaps 1-2 milliseconds of computation. The GPU spends roughly 95% of its time waiting for memory transfers and only 5% doing actual math.</p>
<p>This ratio is captured by a metric called <em>arithmetic intensity</em>: the number of floating-point operations performed per byte of memory transferred. For autoregressive LLM inference at batch size 1, arithmetic intensity is approximately 1-2 FLOP/byte. Modern GPUs are designed for workloads with arithmetic intensity of 100+ FLOP/byte. The mismatch is severe.</p>
<p>If we could somehow verify multiple tokens in a single forward pass, we&rsquo;d amortize that expensive 42ms memory read across several tokens instead of just one. This is the aim of speculative decoding.</p>
<h2 id="the-draft-verify-paradigm">The Draft-Verify Paradigm</h2>
<p>The speculative decoding algorithm operates in a simple loop:</p>
<p><strong>Draft Phase</strong>: A small, fast &ldquo;draft&rdquo; model autoregressively generates γ candidate tokens. Because this model is 50-100× smaller than the target, its memory reads are proportionally faster.</p>
<p><strong>Verify Phase</strong>: The large &ldquo;target&rdquo; model processes all γ candidates in a single forward pass. Thanks to the parallelism of transformer attention, scoring γ tokens takes nearly the same time as scoring 1 token—the memory bandwidth cost is identical.</p>
<p><strong>Accept/Reject Phase</strong>: Compare the draft model&rsquo;s predictions against the target model&rsquo;s true probabilities. Accept tokens that match well; reject and resample where they diverge.</p>
<p>Here&rsquo;s the algorithm in pseudocode:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">speculative_decode</span>(prefix, draft_model, target_model, γ):
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Step 1: Draft γ tokens autoregressively (cheap)</span>
</span></span><span style="display:flex;"><span>    drafts <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> range(γ):
</span></span><span style="display:flex;"><span>        q_i <span style="color:#f92672">=</span> draft_model(prefix <span style="color:#f92672">+</span> drafts)
</span></span><span style="display:flex;"><span>        x_i <span style="color:#f92672">=</span> sample(q_i)
</span></span><span style="display:flex;"><span>        drafts<span style="color:#f92672">.</span>append(x_i)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Step 2: Score all positions in parallel (expensive, but single pass)</span>
</span></span><span style="display:flex;"><span>    p_1, <span style="color:#f92672">...</span>, p_{γ<span style="color:#f92672">+</span><span style="color:#ae81ff">1</span>} <span style="color:#f92672">=</span> target_model(prefix, prefix<span style="color:#f92672">+</span>x_1, <span style="color:#f92672">...</span>, prefix<span style="color:#f92672">+</span>x_1<span style="color:#f92672">...</span>x_γ)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Step 3: Accept/reject with rejection sampling</span>
</span></span><span style="display:flex;"><span>    n <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span>  <span style="color:#75715e"># number of accepted tokens</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> range(γ):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> random() <span style="color:#f92672">&lt;</span> min(<span style="color:#ae81ff">1</span>, p_i(x_i) <span style="color:#f92672">/</span> q_i(x_i)):
</span></span><span style="display:flex;"><span>            n <span style="color:#f92672">+=</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span>            <span style="color:#75715e"># Reject: resample from adjusted distribution</span>
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">return</span> prefix <span style="color:#f92672">+</span> drafts[:n] <span style="color:#f92672">+</span> sample(normalize(max(<span style="color:#ae81ff">0</span>, p_i <span style="color:#f92672">-</span> q_i)))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># All accepted: bonus token from final position</span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> prefix <span style="color:#f92672">+</span> drafts <span style="color:#f92672">+</span> sample(p_{γ<span style="color:#f92672">+</span><span style="color:#ae81ff">1</span>})
</span></span></code></pre></div><p>The key is in step 3. When we accept a draft token, we move forward. When we reject, we don&rsquo;t just discard the draft, we sample from an adjusted distribution that &ldquo;fills in&rdquo; exactly the probability mass the draft model missed. This ensures the output distribution is mathematically identical to standard autoregressive decoding.</p>
<h2 id="the-math-of-distribution-preservation">The Math of Distribution Preservation</h2>
<p>This is the part that makes speculative decoding remarkable. The output distribution is <em>exactly</em> the same as if you had run standard autoregressive decoding with the target model alone. Understanding why requires examining the rejection sampling mechanism.</p>
<p>Let $p(x)$ denote the target model&rsquo;s probability distribution and $q(x)$ denote the draft model&rsquo;s distribution. For a draft token $x'$, we accept it with probability:</p>
$$\alpha(x') = \min\left(1, \frac{p(x')}{q(x')}\right)$$<p>When rejected, we resample from the adjusted distribution:</p>
$$p'(x) = \text{normalize}\left(\max(0, p(x) - q(x))\right)$$<p>The key theorem is that this process produces samples from $p(x)$. Here&rsquo;s the proof:</p>
$$P(X = x') = P(\text{accepted}, X = x') + P(\text{rejected}, X = x')$$<p>For the accepted case, we sample $x'$ from $q$ and accept with probability $\min(1, p(x')/q(x'))$:</p>
$$P(\text{accepted}, X = x') = q(x') \cdot \min\left(1, \frac{p(x')}{q(x')}\right) = \min(q(x'), p(x'))$$<p>For the rejected case, we first reject (with probability $1 - \alpha$), then resample from $p'$:</p>
$$P(\text{rejected}, X = x') = \left(1 - \sum_x \min(p(x), q(x))\right) \cdot \frac{\max(0, p(x') - q(x'))}{\sum_x \max(0, p(x) - q(x))}$$<p>The denominator normalizes to $1 - \sum_x \min(p(x), q(x))$, so:</p>
$$P(\text{rejected}, X = x') = \max(0, p(x') - q(x')) = p(x') - \min(p(x'), q(x'))$$<p>Adding both cases:</p>
$$P(X = x') = \min(p(x'), q(x')) + p(x') - \min(p(x'), q(x')) = p(x')$$<p>This proof holds regardless of how good the draft model is. A poorly aligned draft simply increases rejection rate without corrupting the output distribution. The guarantee is unconditional.</p>
<h2 id="building-intuition-for-rejection-sampling">Building Intuition for Rejection Sampling</h2>
<p>Let&rsquo;s build some intuition for why rejection sampling works.</p>
<p>Imagine two probability distributions over possible next tokens. The target distribution $p(x)$ represents what the large model actually wants to output. The draft distribution $q(x)$ represents the small model&rsquo;s best guess.</p>
<p>Picture these as two overlapping curves. Where they overlap—where the draft model agrees with the target—we can safely use the draft&rsquo;s samples. The acceptance probability $\min(1, p/q)$ ensures we never accept a token more often than the target model would generate it.</p>
<div class="dist-overlap-viz" id="dist-overlap-c7044145aa30a113d86e1eb358e648d1">
  <style>
    .dist-overlap-viz {
      --do-bg: #0d1117;
      --do-surface: #161b22;
      --do-border: #30363d;
      --do-text: #e6edf3;
      --do-text-muted: #8b949e;
      --do-target-blue: #58a6ff;
      --do-draft-orange: #d29922;
      --do-accept-green: #39d353;
      --do-reject-red: #f97583;
      --do-residual-purple: #a371f7;

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
      background: var(--do-bg);
      color: var(--do-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

     
    [data-theme="light"] .dist-overlap-viz,
    :root:not([data-theme="dark"]) .dist-overlap-viz {
      --do-bg: #f8fafc;
      --do-surface: #ffffff;
      --do-border: #e2e8f0;
      --do-text: #1e293b;
      --do-text-muted: #64748b;
      --do-target-blue: #3b82f6;
      --do-draft-orange: #f59e0b;
      --do-accept-green: #10b981;
      --do-reject-red: #ef4444;
      --do-residual-purple: #8b5cf6;
    }

    .dist-overlap-viz * {
      box-sizing: border-box;
    }

    .do-header {
      text-align: center;
      margin-bottom: 1.5rem;
    }

    .do-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem;
      font-weight: 600;
      color: var(--do-target-blue);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .do-header p {
      color: var(--do-text-muted);
      font-size: 0.9rem;
      margin: 0;
    }

     
    .do-alpha-display {
      background: var(--do-surface);
      border: 1px solid var(--do-border);
      border-radius: 10px;
      padding: 1rem 1.25rem;
      margin-bottom: 1.25rem;
      display: flex;
      align-items: center;
      justify-content: center;
      gap: 1.5rem;
      flex-wrap: wrap;
    }

    .do-alpha-formula {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 1rem;
      color: var(--do-text);
    }

    .do-alpha-formula .alpha-symbol {
      color: var(--do-accept-green);
      font-weight: 600;
    }

    .do-alpha-value {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 1.5rem;
      font-weight: 700;
      color: var(--do-accept-green);
      background: rgba(57, 211, 83, 0.1);
      padding: 0.3rem 0.8rem;
      border-radius: 6px;
      border: 1px solid rgba(57, 211, 83, 0.3);
      min-width: 80px;
      text-align: center;
    }

    .do-alpha-interpretation {
      font-size: 0.85rem;
      color: var(--do-text-muted);
    }

     
    .do-controls {
      background: var(--do-surface);
      border: 1px solid var(--do-border);
      border-radius: 10px;
      padding: 1.25rem;
      margin-bottom: 1.25rem;
    }

    .do-control-row {
      display: flex;
      align-items: center;
      gap: 1rem;
      margin-bottom: 1rem;
    }

    .do-control-row:last-child {
      margin-bottom: 0;
    }

    .do-control-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.75rem;
      font-weight: 600;
      color: var(--do-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.08em;
      min-width: 100px;
    }

    .do-slider-container {
      flex: 1;
      display: flex;
      align-items: center;
      gap: 0.75rem;
    }

    .do-slider {
      flex: 1;
      -webkit-appearance: none;
      appearance: none;
      height: 6px;
      border-radius: 3px;
      background: var(--do-border);
      outline: none;
    }

    .do-slider::-webkit-slider-thumb {
      -webkit-appearance: none;
      appearance: none;
      width: 18px;
      height: 18px;
      border-radius: 50%;
      background: var(--do-target-blue);
      cursor: pointer;
      border: 2px solid var(--do-bg);
      box-shadow: 0 2px 6px rgba(0,0,0,0.3);
      transition: transform 0.15s ease;
    }

    .do-slider::-webkit-slider-thumb:hover {
      transform: scale(1.15);
    }

    .do-slider::-moz-range-thumb {
      width: 18px;
      height: 18px;
      border-radius: 50%;
      background: var(--do-target-blue);
      cursor: pointer;
      border: 2px solid var(--do-bg);
      box-shadow: 0 2px 6px rgba(0,0,0,0.3);
    }

    .do-slider-labels {
      display: flex;
      justify-content: space-between;
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.65rem;
      color: var(--do-text-muted);
      margin-top: 0.25rem;
      padding: 0 2px;
    }

     
    .do-presets {
      display: flex;
      gap: 0.5rem;
      flex-wrap: wrap;
    }

    .do-preset-btn {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 500;
      padding: 0.4rem 0.75rem;
      border: 1px solid var(--do-border);
      border-radius: 6px;
      background: var(--do-surface);
      color: var(--do-text);
      cursor: pointer;
      transition: all 0.2s ease;
    }

    .do-preset-btn:hover {
      border-color: var(--do-target-blue);
      background: rgba(88, 166, 255, 0.1);
    }

    .do-preset-btn.active {
      background: var(--do-target-blue);
      border-color: var(--do-target-blue);
      color: #0d1117;
    }

     
    .do-view-toggle {
      display: flex;
      gap: 0.5rem;
    }

    .do-view-btn {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 500;
      padding: 0.4rem 0.75rem;
      border: 1px solid var(--do-border);
      border-radius: 6px;
      background: var(--do-surface);
      color: var(--do-text);
      cursor: pointer;
      transition: all 0.2s ease;
    }

    .do-view-btn:hover {
      border-color: var(--do-target-blue);
    }

    .do-view-btn.active {
      background: var(--do-target-blue);
      border-color: var(--do-target-blue);
      color: #0d1117;
    }

     
    .do-charts-container {
      display: flex;
      gap: 1rem;
      margin-bottom: 1.25rem;
    }

    .do-charts-container.overlaid {
      display: block;
    }

    .do-chart-card {
      flex: 1;
      background: var(--do-surface);
      border: 1px solid var(--do-border);
      border-radius: 10px;
      padding: 1rem;
    }

    .do-charts-container.overlaid .do-chart-card {
      display: none;
    }

    .do-charts-container.overlaid .do-chart-card.overlaid-chart {
      display: block;
    }

    .do-chart-title {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--do-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.1em;
      margin: 0 0 0.75rem 0;
      display: flex;
      align-items: center;
      gap: 0.5rem;
    }

    .do-chart-title .color-indicator {
      width: 10px;
      height: 10px;
      border-radius: 3px;
    }

    .do-chart-title .color-indicator.target {
      background: var(--do-target-blue);
    }

    .do-chart-title .color-indicator.draft {
      background: var(--do-draft-orange);
    }

     
    .do-token-row {
      display: flex;
      align-items: center;
      gap: 0.6rem;
      margin-bottom: 0.6rem;
      cursor: pointer;
      padding: 0.25rem;
      border-radius: 6px;
      transition: background 0.2s ease;
    }

    .do-token-row:last-child {
      margin-bottom: 0;
    }

    .do-token-row:hover {
      background: rgba(88, 166, 255, 0.05);
    }

    .do-token-row.highlighted {
      background: rgba(88, 166, 255, 0.1);
    }

    .do-token-label {
      font-family: 'IBM Plex Mono', monospace;
      font-weight: 600;
      font-size: 0.85rem;
      width: 3.5rem;
      color: var(--do-text);
      white-space: nowrap;
      overflow: hidden;
      text-overflow: ellipsis;
    }

    .do-bar-container {
      flex: 1;
      height: 24px;
      background: rgba(128,128,128,0.1);
      border-radius: 4px;
      position: relative;
      overflow: hidden;
    }

    .do-bar {
      height: 100%;
      border-radius: 4px;
      transition: width 0.4s ease;
      position: absolute;
      top: 0;
      left: 0;
    }

    .do-bar.target {
      background: linear-gradient(90deg, var(--do-target-blue), #79c0ff);
    }

    .do-bar.draft {
      background: linear-gradient(90deg, var(--do-draft-orange), #e3b341);
    }

     
    .do-bar-segment {
      height: 100%;
      position: absolute;
      top: 0;
      transition: all 0.4s ease;
    }

    .do-bar-segment.accept {
      background: var(--do-accept-green);
      left: 0;
      z-index: 3;
    }

    .do-bar-segment.overshoot {
      background: repeating-linear-gradient(
        45deg,
        var(--do-reject-red),
        var(--do-reject-red) 3px,
        rgba(249, 117, 131, 0.5) 3px,
        rgba(249, 117, 131, 0.5) 6px
      );
      z-index: 2;
    }

    .do-bar-segment.residual {
      background: var(--do-residual-purple);
      z-index: 1;
    }

    .do-value {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.75rem;
      font-weight: 500;
      width: 2.5rem;
      text-align: right;
      color: var(--do-text);
    }

     
    .do-overlaid-chart {
      display: none;
    }

    .do-charts-container.overlaid .do-overlaid-chart {
      display: block;
    }

    .do-overlaid-row {
      display: flex;
      align-items: center;
      gap: 0.6rem;
      margin-bottom: 0.75rem;
      padding: 0.25rem;
      border-radius: 6px;
      cursor: pointer;
      transition: background 0.2s ease;
    }

    .do-overlaid-row:hover {
      background: rgba(88, 166, 255, 0.05);
    }

    .do-overlaid-bar-container {
      flex: 1;
      height: 36px;
      background: rgba(128,128,128,0.08);
      border-radius: 6px;
      position: relative;
      overflow: hidden;
    }

    .do-overlaid-values {
      display: flex;
      flex-direction: column;
      width: 4rem;
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 500;
      text-align: right;
    }

    .do-overlaid-values .p-val {
      color: var(--do-target-blue);
    }

    .do-overlaid-values .q-val {
      color: var(--do-draft-orange);
    }

     
    .do-tooltip {
      position: fixed;
      background: var(--do-bg);
      border: 1px solid var(--do-border);
      border-radius: 8px;
      padding: 0.75rem 1rem;
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.75rem;
      z-index: 1000;
      pointer-events: none;
      opacity: 0;
      transition: opacity 0.2s ease;
      box-shadow: 0 4px 12px rgba(0,0,0,0.3);
      max-width: 220px;
    }

    .do-tooltip.visible {
      opacity: 1;
    }

    .do-tooltip-title {
      font-weight: 600;
      color: var(--do-text);
      margin-bottom: 0.4rem;
    }

    .do-tooltip-row {
      display: flex;
      justify-content: space-between;
      gap: 1rem;
      margin-bottom: 0.2rem;
    }

    .do-tooltip-row:last-child {
      margin-bottom: 0;
    }

    .do-tooltip-label {
      color: var(--do-text-muted);
    }

    .do-tooltip-value {
      font-weight: 500;
    }

    .do-tooltip-value.accept {
      color: var(--do-accept-green);
    }

     
    .do-legend {
      background: var(--do-surface);
      border: 1px solid var(--do-border);
      border-radius: 10px;
      padding: 1rem 1.25rem;
      display: flex;
      gap: 1.5rem;
      flex-wrap: wrap;
      justify-content: center;
    }

    .do-legend-item {
      display: flex;
      align-items: center;
      gap: 0.5rem;
      font-size: 0.75rem;
      color: var(--do-text-muted);
    }

    .do-legend-swatch {
      width: 14px;
      height: 14px;
      border-radius: 3px;
    }

    .do-legend-swatch.accept {
      background: var(--do-accept-green);
    }

    .do-legend-swatch.overshoot {
      background: repeating-linear-gradient(
        45deg,
        var(--do-reject-red),
        var(--do-reject-red) 2px,
        rgba(249, 117, 131, 0.5) 2px,
        rgba(249, 117, 131, 0.5) 4px
      );
    }

    .do-legend-swatch.residual {
      background: var(--do-residual-purple);
    }

    .do-legend-swatch.target {
      background: var(--do-target-blue);
    }

    .do-legend-swatch.draft {
      background: var(--do-draft-orange);
    }

     
    .do-learning-note {
      background: rgba(88, 166, 255, 0.08);
      border: 1px solid rgba(88, 166, 255, 0.2);
      border-radius: 8px;
      padding: 1rem;
      margin-top: 1.25rem;
    }

    .do-learning-note h5 {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--do-target-blue);
      text-transform: uppercase;
      letter-spacing: 0.1em;
      margin: 0 0 0.4rem 0;
    }

    .do-learning-note p {
      font-size: 0.85rem;
      color: var(--do-text);
      line-height: 1.5;
      margin: 0;
    }

    .do-learning-note code {
      font-family: 'IBM Plex Mono', monospace;
      background: rgba(0,0,0,0.2);
      padding: 0.1em 0.3em;
      border-radius: 3px;
      font-size: 0.8em;
    }

    [data-theme="light"] .do-learning-note code,
    :root:not([data-theme="dark"]) .do-learning-note code {
      background: rgba(0,0,0,0.06);
    }

     
    @media (max-width: 600px) {
      .do-charts-container {
        flex-direction: column;
      }

      .do-alpha-display {
        flex-direction: column;
        gap: 0.75rem;
      }

      .do-control-row {
        flex-direction: column;
        align-items: flex-start;
        gap: 0.5rem;
      }

      .do-control-label {
        min-width: auto;
      }

      .do-slider-container {
        width: 100%;
      }

      .do-legend {
        gap: 0.75rem;
      }
    }
  </style>

  <div class="do-header">
    <h3>Distribution Overlap & Acceptance</h3>
    <p>Visualizing how draft model alignment affects token acceptance probability</p>
  </div>

  
  <div class="do-alpha-display">
    <div class="do-alpha-formula">
      <span class="alpha-symbol">α</span> = Σ min(p(x), q(x))
    </div>
    <div class="do-alpha-value" id="alphaValue-c7044145aa30a113d86e1eb358e648d1">0.85</div>
    <div class="do-alpha-interpretation" id="alphaInterpret-c7044145aa30a113d86e1eb358e648d1">
      85% of draft tokens accepted on average
    </div>
  </div>

  
  <div class="do-controls">
    <div class="do-control-row">
      <span class="do-control-label">Alignment</span>
      <div class="do-slider-container">
        <input type="range" class="do-slider" id="alignSlider-c7044145aa30a113d86e1eb358e648d1"
               min="0" max="100" value="75">
        <div class="do-slider-labels">
          <span>Misaligned</span>
          <span>Identical</span>
        </div>
      </div>
    </div>

    <div class="do-control-row">
      <span class="do-control-label">Presets</span>
      <div class="do-presets">
        <button type="button" class="do-preset-btn" data-preset="identical">Identical (α=1.0)</button>
        <button type="button" class="do-preset-btn active" data-preset="well-matched">Well-matched (α≈0.85)</button>
        <button type="button" class="do-preset-btn" data-preset="poor">Poor draft (α≈0.4)</button>
      </div>
    </div>

    <div class="do-control-row">
      <span class="do-control-label">View</span>
      <div class="do-view-toggle">
        <button type="button" class="do-view-btn active" data-view="separate">Separate</button>
        <button type="button" class="do-view-btn" data-view="overlaid">Overlaid</button>
      </div>
    </div>
  </div>

  
  <div class="do-charts-container" id="chartsContainer-c7044145aa30a113d86e1eb358e648d1">
    
    <div class="do-chart-card">
      <h4 class="do-chart-title">
        <span class="color-indicator target"></span>
        Target p(x)
      </h4>
      <div id="targetChart-c7044145aa30a113d86e1eb358e648d1">
        
      </div>
    </div>

    
    <div class="do-chart-card">
      <h4 class="do-chart-title">
        <span class="color-indicator draft"></span>
        Draft q(x)
      </h4>
      <div id="draftChart-c7044145aa30a113d86e1eb358e648d1">
        
      </div>
    </div>

    
    <div class="do-chart-card overlaid-chart">
      <h4 class="do-chart-title">
        Overlaid Comparison
      </h4>
      <div id="overlaidChart-c7044145aa30a113d86e1eb358e648d1">
        
      </div>
    </div>
  </div>

  
  <div class="do-legend">
    <div class="do-legend-item">
      <div class="do-legend-swatch accept"></div>
      <span>Accept region: min(p, q)</span>
    </div>
    <div class="do-legend-item">
      <div class="do-legend-swatch overshoot"></div>
      <span>Overshoot: q > p (rejected)</span>
    </div>
    <div class="do-legend-item">
      <div class="do-legend-swatch residual"></div>
      <span>Residual: p > q (on reject)</span>
    </div>
  </div>

  
  <div class="do-learning-note">
    <h5>Key Insight</h5>
    <p>The <span style="color: var(--do-accept-green); font-weight: 600;">green overlap</span> shows probability mass that can be safely accepted from the draft model.
    When <code>q(x) > p(x)</code>, the draft "overshoots" and risks rejection.
    The <span style="color: var(--do-residual-purple); font-weight: 600;">purple residual</span> fills in when we reject, ensuring the output matches the target distribution exactly.</p>
  </div>

  
  <div class="do-tooltip" id="tooltip-c7044145aa30a113d86e1eb358e648d1">
    <div class="do-tooltip-title" id="tooltipTitle-c7044145aa30a113d86e1eb358e648d1">Token</div>
    <div class="do-tooltip-row">
      <span class="do-tooltip-label">p(x):</span>
      <span class="do-tooltip-value" id="tooltipP-c7044145aa30a113d86e1eb358e648d1">0.35</span>
    </div>
    <div class="do-tooltip-row">
      <span class="do-tooltip-label">q(x):</span>
      <span class="do-tooltip-value" id="tooltipQ-c7044145aa30a113d86e1eb358e648d1">0.30</span>
    </div>
    <div class="do-tooltip-row">
      <span class="do-tooltip-label">Accept prob:</span>
      <span class="do-tooltip-value accept" id="tooltipAccept-c7044145aa30a113d86e1eb358e648d1">min(1, p/q) = 1.0</span>
    </div>
  </div>

  <script>
  (function() {
    const uid = 'c7044145aa30a113d86e1eb358e648d1';

    
    const tokens = ['the', 'a', 'cat', 'dog', 'bird'];

    
    const targetP = {
      'the': 0.35,
      'a': 0.25,
      'cat': 0.20,
      'dog': 0.15,
      'bird': 0.05
    };

    
    const presets = {
      'identical': {
        'the': 0.35,
        'a': 0.25,
        'cat': 0.20,
        'dog': 0.15,
        'bird': 0.05
      },
      'well-matched': {
        'the': 0.30,
        'a': 0.28,
        'cat': 0.18,
        'dog': 0.14,
        'bird': 0.10
      },
      'poor': {
        'the': 0.10,
        'a': 0.10,
        'cat': 0.15,
        'dog': 0.25,
        'bird': 0.40
      }
    };

    
    let draftQ = { ...presets['well-matched'] };
    let currentView = 'separate';

    
    const elements = {
      alphaValue: document.getElementById(`alphaValue-${uid}`),
      alphaInterpret: document.getElementById(`alphaInterpret-${uid}`),
      alignSlider: document.getElementById(`alignSlider-${uid}`),
      chartsContainer: document.getElementById(`chartsContainer-${uid}`),
      targetChart: document.getElementById(`targetChart-${uid}`),
      draftChart: document.getElementById(`draftChart-${uid}`),
      overlaidChart: document.getElementById(`overlaidChart-${uid}`),
      tooltip: document.getElementById(`tooltip-${uid}`),
      tooltipTitle: document.getElementById(`tooltipTitle-${uid}`),
      tooltipP: document.getElementById(`tooltipP-${uid}`),
      tooltipQ: document.getElementById(`tooltipQ-${uid}`),
      tooltipAccept: document.getElementById(`tooltipAccept-${uid}`)
    };

    const container = document.getElementById(`dist-overlap-${uid}`);
    const presetBtns = container.querySelectorAll('.do-preset-btn');
    const viewBtns = container.querySelectorAll('.do-view-btn');

    
    function calculateAlpha() {
      let alpha = 0;
      tokens.forEach(token => {
        alpha += Math.min(targetP[token], draftQ[token]);
      });
      return alpha;
    }

    
    function interpolateDistribution(alignment) {
      
      const t = alignment / 100;

      tokens.forEach(token => {
        if (t <= 0.5) {
          
          const localT = t / 0.5;
          draftQ[token] = presets['poor'][token] * (1 - localT) + presets['well-matched'][token] * localT;
        } else {
          
          const localT = (t - 0.5) / 0.5;
          draftQ[token] = presets['well-matched'][token] * (1 - localT) + presets['identical'][token] * localT;
        }
      });
    }

    
    function renderTargetChart() {
      let html = '';
      tokens.forEach(token => {
        const p = targetP[token];
        const q = draftQ[token];
        const acceptWidth = Math.min(p, q) * 100;
        const residualWidth = Math.max(0, p - q) * 100;
        const residualLeft = acceptWidth;

        html += `
          <div class="do-token-row" data-token="${token}">
            <span class="do-token-label">"${token}"</span>
            <div class="do-bar-container">
              <div class="do-bar-segment accept" style="width: ${acceptWidth}%"></div>
              <div class="do-bar-segment residual" style="left: ${residualLeft}%; width: ${residualWidth}%"></div>
            </div>
            <span class="do-value">${p.toFixed(2)}</span>
          </div>
        `;
      });
      elements.targetChart.innerHTML = html;
    }

    
    function renderDraftChart() {
      let html = '';
      tokens.forEach(token => {
        const p = targetP[token];
        const q = draftQ[token];
        const acceptWidth = Math.min(p, q) * 100;
        const overshootWidth = Math.max(0, q - p) * 100;
        const overshootLeft = acceptWidth;

        html += `
          <div class="do-token-row" data-token="${token}">
            <span class="do-token-label">"${token}"</span>
            <div class="do-bar-container">
              <div class="do-bar-segment accept" style="width: ${acceptWidth}%"></div>
              <div class="do-bar-segment overshoot" style="left: ${overshootLeft}%; width: ${overshootWidth}%"></div>
            </div>
            <span class="do-value">${q.toFixed(2)}</span>
          </div>
        `;
      });
      elements.draftChart.innerHTML = html;
    }

    
    function renderOverlaidChart() {
      let html = '';
      tokens.forEach(token => {
        const p = targetP[token];
        const q = draftQ[token];
        const acceptWidth = Math.min(p, q) * 100;
        const overshootWidth = Math.max(0, q - p) * 100;
        const residualWidth = Math.max(0, p - q) * 100;
        const overshootLeft = acceptWidth;
        const residualLeft = acceptWidth;

        html += `
          <div class="do-overlaid-row" data-token="${token}">
            <span class="do-token-label">"${token}"</span>
            <div class="do-overlaid-bar-container">
              <div class="do-bar-segment accept" style="width: ${acceptWidth}%"></div>
              <div class="do-bar-segment overshoot" style="left: ${overshootLeft}%; width: ${overshootWidth}%"></div>
              <div class="do-bar-segment residual" style="left: ${residualLeft}%; width: ${residualWidth}%"></div>
            </div>
            <div class="do-overlaid-values">
              <span class="p-val">p=${p.toFixed(2)}</span>
              <span class="q-val">q=${q.toFixed(2)}</span>
            </div>
          </div>
        `;
      });
      elements.overlaidChart.innerHTML = html;
    }

    
    function updateAlphaDisplay() {
      const alpha = calculateAlpha();
      elements.alphaValue.textContent = alpha.toFixed(2);
      const pct = Math.round(alpha * 100);
      elements.alphaInterpret.textContent = `${pct}% of draft tokens accepted on average`;

      
      if (alpha >= 0.8) {
        elements.alphaValue.style.color = 'var(--do-accept-green)';
        elements.alphaValue.style.borderColor = 'rgba(57, 211, 83, 0.3)';
        elements.alphaValue.style.background = 'rgba(57, 211, 83, 0.1)';
      } else if (alpha >= 0.5) {
        elements.alphaValue.style.color = 'var(--do-draft-orange)';
        elements.alphaValue.style.borderColor = 'rgba(210, 153, 34, 0.3)';
        elements.alphaValue.style.background = 'rgba(210, 153, 34, 0.1)';
      } else {
        elements.alphaValue.style.color = 'var(--do-reject-red)';
        elements.alphaValue.style.borderColor = 'rgba(249, 117, 131, 0.3)';
        elements.alphaValue.style.background = 'rgba(249, 117, 131, 0.1)';
      }
    }

    
    function updateCharts() {
      renderTargetChart();
      renderDraftChart();
      renderOverlaidChart();
      updateAlphaDisplay();
      attachHoverListeners();
    }

    
    function showTooltip(token, event) {
      const p = targetP[token];
      const q = draftQ[token];
      const acceptProb = q > 0 ? Math.min(1, p / q) : 1;

      elements.tooltipTitle.textContent = `"${token}"`;
      elements.tooltipP.textContent = p.toFixed(3);
      elements.tooltipQ.textContent = q.toFixed(3);
      elements.tooltipAccept.textContent = `min(1, ${p.toFixed(2)}/${q.toFixed(2)}) = ${acceptProb.toFixed(2)}`;

      const rect = event.target.closest('.do-token-row, .do-overlaid-row').getBoundingClientRect();
      elements.tooltip.style.left = `${rect.right + 10}px`;
      elements.tooltip.style.top = `${rect.top}px`;
      elements.tooltip.classList.add('visible');
    }

    
    function hideTooltip() {
      elements.tooltip.classList.remove('visible');
    }

    
    function attachHoverListeners() {
      const rows = container.querySelectorAll('.do-token-row, .do-overlaid-row');
      rows.forEach(row => {
        const token = row.dataset.token;
        row.addEventListener('mouseenter', (e) => showTooltip(token, e));
        row.addEventListener('mouseleave', hideTooltip);
      });
    }

    
    elements.alignSlider.addEventListener('input', function() {
      const alignment = parseInt(this.value);
      interpolateDistribution(alignment);
      updateCharts();

      
      presetBtns.forEach(btn => btn.classList.remove('active'));
      if (alignment === 100) {
        container.querySelector('[data-preset="identical"]').classList.add('active');
      } else if (alignment >= 70 && alignment <= 80) {
        container.querySelector('[data-preset="well-matched"]').classList.add('active');
      } else if (alignment <= 10) {
        container.querySelector('[data-preset="poor"]').classList.add('active');
      }
    });

    
    presetBtns.forEach(btn => {
      btn.addEventListener('click', function() {
        const preset = this.dataset.preset;
        draftQ = { ...presets[preset] };

        
        if (preset === 'identical') {
          elements.alignSlider.value = 100;
        } else if (preset === 'well-matched') {
          elements.alignSlider.value = 75;
        } else if (preset === 'poor') {
          elements.alignSlider.value = 0;
        }

        
        presetBtns.forEach(b => b.classList.remove('active'));
        this.classList.add('active');

        updateCharts();
      });
    });

    
    viewBtns.forEach(btn => {
      btn.addEventListener('click', function() {
        currentView = this.dataset.view;

        viewBtns.forEach(b => b.classList.remove('active'));
        this.classList.add('active');

        if (currentView === 'overlaid') {
          elements.chartsContainer.classList.add('overlaid');
        } else {
          elements.chartsContainer.classList.remove('overlaid');
        }
      });
    });

    
    updateCharts();
  })();
  </script>
</div>

<p>But what about the probability mass where $p(x) > q(x)$? These are tokens the target model likes more than the draft model expected. If we only accepted, we&rsquo;d undersample these tokens. The resampling step corrects for this: when we reject, we draw from exactly this &ldquo;missing&rdquo; probability mass.</p>
<p>The total acceptance rate, the probability we accept any draft token equals the overlap between distributions:</p>
$$\alpha = \sum_x \min(p(x), q(x))$$<p>This quantity has a nice interpretation: it&rsquo;s 1 minus half the total variation distance between $p$ and $q$. When distributions are identical, $\alpha = 1$ and we always accept. When they&rsquo;re completely disjoint, $\alpha = 0$ and we always reject.</p>
<p>In practice, well-matched draft-target pairs achieve $\alpha = 0.6-0.8$, while architecturally integrated solutions like GLM-4.7&rsquo;s native MTP exceed 0.9.</p>
<h2 id="a-concrete-walkthrough-of-rejection-sampling">A Concrete Walkthrough of Rejection Sampling</h2>
<p>Let&rsquo;s ground the mathematics in concrete examples to build deeper intuition for how the algorithm actually works.</p>
<h3 id="the-sequential-verification-problem">The Sequential Verification Problem</h3>
<p>When the draft model generates K tokens, each token is conditioned on the previous ones:</p>
$$x_1 \sim q(\cdot)$$<p>
</p>
$$x_2 \sim q(\cdot|x_1)$$<p>
</p>
$$x_3 \sim q(\cdot|x_1,x_2)$$<p>The target model verifies by computing in parallel:</p>
$$p(x_1), \quad p(x_2|x_1), \quad p(x_3|x_1,x_2), \quad \ldots$$<p><strong>if you reject $x_2$, then $x_3$ was generated from the wrong context.</strong></p>
<p>The draft model generated $x_3$ assuming $x_2$ was correct. But if you reject $x_2$ and resample a different token $x_2'$, then $x_3$ is now invalid, ie it was conditioned on a token that no longer exists in the sequence.</p>
<p><strong>Concrete Example:</strong></p>
<pre tabindex="0"><code>Draft generates:  &#34;The cat sat on the [mat]&#34;
                                        ↑ rejected, resample → &#34;rug&#34;

Draft&#39;s x₆ was:   &#34;mat&#34; → next token &#34;.&#34; (conditioned on &#34;mat&#34;)
But now we have:  &#34;rug&#34; → we can&#39;t use &#34;.&#34; anymore!
</code></pre><p>The token after &ldquo;mat&rdquo; might have been &ldquo;.&rdquo; with high probability, but the token after &ldquo;rug&rdquo; might be &ldquo;was&rdquo; or something entirely different. You must discard everything after the rejection point and let the target model generate the next token fresh.</p>
<h3 id="what-px-and-qx-actually-mean">What p(x) and q(x) Actually Mean</h3>
<p>The notation can obscure what&rsquo;s happening. Let&rsquo;s be concrete.</p>
<p>$x_1$ is a specific token that was sampled—say, the token &ldquo;cat&rdquo; (token ID 9846 in the vocabulary).</p>
<p>$p(x_1)$ is the scalar probability that the target model assigned to that exact token:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Target model forward pass</span>
</span></span><span style="display:flex;"><span>logits <span style="color:#f92672">=</span> target_model(prompt)        <span style="color:#75715e"># shape: [vocab_size]</span>
</span></span><span style="display:flex;"><span>probs <span style="color:#f92672">=</span> softmax(logits)              <span style="color:#75715e"># shape: [vocab_size]</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>p_x1 <span style="color:#f92672">=</span> probs[<span style="color:#ae81ff">9846</span>]                   <span style="color:#75715e"># scalar: 0.073</span>
</span></span></code></pre></div><p>Similarly, $q(x_1)$ is what the draft model assigned to that same token:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Draft model forward pass</span>
</span></span><span style="display:flex;"><span>logits <span style="color:#f92672">=</span> draft_model(prompt)         <span style="color:#75715e"># shape: [vocab_size]</span>
</span></span><span style="display:flex;"><span>probs <span style="color:#f92672">=</span> softmax(logits)              <span style="color:#75715e"># shape: [vocab_size]</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>q_x1 <span style="color:#f92672">=</span> probs[<span style="color:#ae81ff">9846</span>]                   <span style="color:#75715e"># scalar: 0.051</span>
</span></span></code></pre></div><p>The acceptance check compares these two scalars:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>ratio <span style="color:#f92672">=</span> p_x1 <span style="color:#f92672">/</span> q_x1                  <span style="color:#75715e"># 0.073 / 0.051 = 1.43</span>
</span></span><span style="display:flex;"><span>acceptance_prob <span style="color:#f92672">=</span> min(<span style="color:#ae81ff">1</span>, ratio)      <span style="color:#75715e"># min(1, 1.43) = 1.0</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>u <span style="color:#f92672">=</span> random<span style="color:#f92672">.</span>uniform(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">1</span>)             <span style="color:#75715e"># say, 0.67</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">if</span> u <span style="color:#f92672">&lt;</span> acceptance_prob:              <span style="color:#75715e"># 0.67 &lt; 1.0 → True</span>
</span></span><span style="display:flex;"><span>    accept()
</span></span></code></pre></div><p>The ratio $p(x)/q(x)$ asks: &ldquo;Did the draft model over- or under-estimate this token?&rdquo;</p>
<table>
  <thead>
      <tr>
          <th>Scenario</th>
          <th>Ratio</th>
          <th>Accept Prob</th>
          <th>Meaning</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>p=0.30, q=0.10</td>
          <td>3.0</td>
          <td>1.0 (capped)</td>
          <td>Draft underestimated—always accept</td>
      </tr>
      <tr>
          <td>p=0.10, q=0.10</td>
          <td>1.0</td>
          <td>1.0</td>
          <td>Perfect agreement—always accept</td>
      </tr>
      <tr>
          <td>p=0.05, q=0.10</td>
          <td>0.5</td>
          <td>0.5</td>
          <td>Draft overestimated—accept 50%</td>
      </tr>
      <tr>
          <td>p=0.01, q=0.10</td>
          <td>0.1</td>
          <td>0.1</td>
          <td>Draft way overconfident—accept 10%</td>
      </tr>
  </tbody>
</table>
<p>When the draft model is overconfident about a token ($q > p$), you reject proportionally to correct the bias. When the draft is underconfident ($q < p$), you always accept—the residual distribution handles the gap.</p>
<h3 id="why-resample-from-residual-not-just-p">Why Resample from Residual, Not Just p?</h3>
<p>When you reject a drafted token, you need to pick a new token. The naive answer is: &ldquo;We want the output to follow $p$, so just sample from $p$.&rdquo;</p>
<p>This is wrong. Let me show you why with a concrete example.</p>
<p><strong>Two-token vocabulary: A and B</strong></p>
<pre tabindex="0"><code>Target:  p(A) = 0.7,  p(B) = 0.3
Draft:   q(A) = 0.4,  q(B) = 0.6
</code></pre><p><strong>Tracing through the algorithm:</strong></p>
<p>Step 1: Sample from draft $q$</p>
<ul>
<li>40% chance we draft A</li>
<li>60% chance we draft B</li>
</ul>
<p>Step 2: Accept/reject check</p>
<p>If we drafted A:
</p>
$$\text{accept prob} = \min\left(1, \frac{p(A)}{q(A)}\right) = \min\left(1, \frac{0.7}{0.4}\right) = \min(1, 1.75) = 1.0$$<p>A is always accepted when drafted.</p>
<p>If we drafted B:
</p>
$$\text{accept prob} = \min\left(1, \frac{p(B)}{q(B)}\right) = \min\left(1, \frac{0.3}{0.6}\right) = \min(1, 0.5) = 0.5$$<p>B is accepted 50% of the time when drafted.</p>
<p><strong>Calculating the probabilities:</strong></p>
$$P(\text{accept A}) = q(A) \times 1.0 = 0.4$$<p>
</p>
$$P(\text{accept B}) = q(B) \times 0.5 = 0.3$$<p>
</p>
$$P(\text{reject}) = 1 - 0.4 - 0.3 = 0.3$$<p><strong>The problem with resampling from p:</strong></p>
<p>If on rejection we resample from $p$:</p>
$$P(\text{output}=A) = P(\text{accept A}) + P(\text{reject}) \times p(A)$$<p>
</p>
$$= 0.4 + 0.3 \times 0.7 = 0.4 + 0.21 = 0.61$$<p>This is <strong>wrong</strong>—should be 0.7!</p>
$$P(\text{output}=B) = P(\text{accept B}) + P(\text{reject}) \times p(B)$$<p>
</p>
$$= 0.3 + 0.3 \times 0.3 = 0.39$$<p>This is <strong>wrong</strong>—should be 0.3!</p>
<p><strong>The fix: residual distribution</strong></p>
<p>The residual distribution is:</p>
$$\max(0, p(A) - q(A)) = \max(0, 0.7 - 0.4) = 0.3$$<p>
</p>
$$\max(0, p(B) - q(B)) = \max(0, 0.3 - 0.6) = 0.0$$<p>Normalized: $p'(A) = 1.0$, $p'(B) = 0.0$</p>
<p>Now:</p>
$$P(\text{output}=A) = P(\text{accept A}) + P(\text{reject}) \times p'(A)$$<p>
</p>
$$= 0.4 + 0.3 \times 1.0 = 0.7 \checkmark$$$$P(\text{output}=B) = P(\text{accept B}) + P(\text{reject}) \times p'(B)$$<p>
</p>
$$= 0.3 + 0.3 \times 0.0 = 0.3 \checkmark$$<h3 id="the-probability-budget-intuition">The Probability Budget Intuition</h3>
<p>Think of it as a budget you need to fill for each token:</p>
<table>
  <thead>
      <tr>
          <th>Token</th>
          <th>Target p(x)</th>
          <th>Covered by Accept Phase</th>
          <th>Still Needed</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>A</td>
          <td>0.7</td>
          <td>min(0.7, 0.4) = 0.4</td>
          <td>0.7 - 0.4 = 0.3</td>
      </tr>
      <tr>
          <td>B</td>
          <td>0.3</td>
          <td>min(0.3, 0.6) = 0.3</td>
          <td>0.3 - 0.3 = 0.0</td>
      </tr>
  </tbody>
</table>
<p>The accept phase already &ldquo;spent&rdquo; $\min(p,q)$ probability on each token. The residual distribution captures exactly what&rsquo;s left to fill:</p>
$$p'(x) = \frac{\max(0, p(x) - q(x))}{Z} = \frac{\text{what we still need}}{\text{total rejection probability}}$$<div class="probability-budget-viz" id="prob-budget-c7044145aa30a113d86e1eb358e648d1">
  <style>
    .probability-budget-viz {
      --pb-bg: #0d1117;
      --pb-surface: #161b22;
      --pb-border: #30363d;
      --pb-text: #e6edf3;
      --pb-text-muted: #8b949e;
      --pb-accent-teal: #58a6ff;
      --pb-accent-cyan: #39d353;
      --pb-accent-coral: #f97583;
      --pb-accent-amber: #d29922;
      --pb-target-outline: #8b949e;

      font-family: 'IBM Plex Sans', -apple-system, BlinkMacSystemFont, sans-serif;
      background: var(--pb-bg);
      color: var(--pb-text);
      line-height: 1.6;
      padding: 1.5rem;
      border-radius: 12px;
      margin: 2rem 0;
    }

     
    [data-theme="light"] .probability-budget-viz,
    :root:not([data-theme="dark"]) .probability-budget-viz {
      --pb-bg: #f8fafc;
      --pb-surface: #ffffff;
      --pb-border: #e2e8f0;
      --pb-text: #1e293b;
      --pb-text-muted: #64748b;
      --pb-accent-teal: #3b82f6;
      --pb-accent-cyan: #10b981;
      --pb-accent-coral: #ef4444;
      --pb-accent-amber: #f59e0b;
      --pb-target-outline: #94a3b8;
    }

    .probability-budget-viz * {
      box-sizing: border-box;
    }

    .pb-header {
      text-align: center;
      margin-bottom: 1.5rem;
    }

    .pb-header h3 {
      font-family: 'IBM Plex Mono', 'SF Mono', Monaco, monospace;
      font-size: 0.85rem;
      font-weight: 600;
      color: var(--pb-accent-teal);
      letter-spacing: 0.08em;
      text-transform: uppercase;
      margin: 0 0 0.4rem 0;
    }

    .pb-header p {
      color: var(--pb-text-muted);
      font-size: 0.9rem;
      margin: 0;
    }

     
    .pb-distributions {
      display: grid;
      grid-template-columns: 1fr 1fr;
      gap: 1rem;
      margin-bottom: 1.5rem;
    }

    .pb-dist-card {
      background: var(--pb-surface);
      border: 1px solid var(--pb-border);
      border-radius: 8px;
      padding: 1rem;
    }

    .pb-dist-card h4 {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--pb-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.1em;
      margin: 0 0 0.75rem 0;
    }

    .pb-dist-row {
      display: flex;
      align-items: center;
      gap: 0.6rem;
      margin-bottom: 0.4rem;
    }

    .pb-dist-row:last-child {
      margin-bottom: 0;
    }

    .pb-token-label {
      font-family: 'IBM Plex Mono', monospace;
      font-weight: 600;
      font-size: 0.9rem;
      width: 1.25rem;
      color: var(--pb-text);
    }

    .pb-dist-bar-container {
      flex: 1;
      height: 20px;
      background: rgba(128,128,128,0.1);
      border-radius: 4px;
      position: relative;
      overflow: hidden;
    }

    .pb-dist-bar {
      height: 100%;
      border-radius: 4px;
      transition: width 0.5s ease;
    }

    .pb-dist-bar.target {
      background: linear-gradient(90deg, var(--pb-accent-teal), #79c0ff);
    }

    .pb-dist-bar.draft {
      background: linear-gradient(90deg, var(--pb-accent-amber), #e3b341);
    }

    .pb-dist-value {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.8rem;
      font-weight: 500;
      width: 2.5rem;
      text-align: right;
      color: var(--pb-text);
    }

     
    .pb-controls {
      display: flex;
      gap: 0.6rem;
      justify-content: center;
      margin-bottom: 1.5rem;
      flex-wrap: wrap;
    }

    .pb-control-btn {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.75rem;
      font-weight: 500;
      padding: 0.5rem 1rem;
      border: 1px solid var(--pb-border);
      border-radius: 6px;
      background: var(--pb-surface);
      color: var(--pb-text);
      cursor: pointer;
      transition: all 0.2s ease;
    }

    .pb-control-btn:hover {
      border-color: var(--pb-accent-teal);
      background: rgba(88, 166, 255, 0.1);
    }

    .pb-control-btn:disabled {
      opacity: 0.4;
      cursor: not-allowed;
    }

    .pb-control-btn.primary {
      background: var(--pb-accent-teal);
      border-color: var(--pb-accent-teal);
      color: #0d1117;
    }

    .pb-control-btn.primary:hover:not(:disabled) {
      background: #79c0ff;
    }

     
    .pb-budget-section {
      background: var(--pb-surface);
      border: 1px solid var(--pb-border);
      border-radius: 10px;
      padding: 1.25rem;
      margin-bottom: 1.25rem;
    }

    .pb-section-header {
      display: flex;
      align-items: center;
      justify-content: space-between;
      margin-bottom: 1.25rem;
    }

    .pb-section-title {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.75rem;
      font-weight: 600;
      color: var(--pb-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.08em;
    }

    .pb-step-indicator {
      display: flex;
      gap: 0.4rem;
    }

    .pb-step-dot {
      width: 8px;
      height: 8px;
      border-radius: 50%;
      background: var(--pb-border);
      transition: all 0.3s ease;
    }

    .pb-step-dot.active {
      background: var(--pb-accent-teal);
      box-shadow: 0 0 8px var(--pb-accent-teal);
    }

    .pb-step-dot.completed {
      background: var(--pb-accent-cyan);
    }

     
    .pb-budget-row {
      margin-bottom: 1.5rem;
    }

    .pb-budget-row:last-of-type {
      margin-bottom: 0.75rem;
    }

    .pb-budget-label-row {
      display: flex;
      align-items: center;
      justify-content: space-between;
      margin-bottom: 0.4rem;
    }

    .pb-budget-token {
      font-family: 'IBM Plex Mono', monospace;
      font-weight: 600;
      font-size: 1rem;
    }

    .pb-budget-target-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.75rem;
      color: var(--pb-text-muted);
    }

    .pb-budget-bar-wrapper {
      position: relative;
      height: 36px;
    }

    .pb-budget-target-outline {
      position: absolute;
      top: 0;
      left: 0;
      height: 100%;
      border: 2px dashed var(--pb-target-outline);
      border-radius: 6px;
      transition: width 0.5s ease;
      opacity: 0.6;
    }

    .pb-budget-bar-inner {
      position: absolute;
      top: 0;
      left: 0;
      height: 100%;
      display: flex;
      border-radius: 6px;
      overflow: hidden;
    }

    .pb-budget-segment {
      height: 100%;
      transition: width 0.5s cubic-bezier(0.4, 0, 0.2, 1);
      position: relative;
      display: flex;
      align-items: center;
      justify-content: center;
    }

    .pb-budget-segment.accept {
      background: linear-gradient(135deg, var(--pb-accent-cyan), #2ea043);
    }

    .pb-budget-segment.residual {
      background: linear-gradient(135deg, var(--pb-accent-coral), #ff7b72);
    }

    .pb-segment-label {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: #fff;
      text-shadow: 0 1px 2px rgba(0,0,0,0.3);
      white-space: nowrap;
      opacity: 0;
      transition: opacity 0.3s ease;
    }

    .pb-budget-segment.show-label .pb-segment-label {
      opacity: 1;
    }

     
    .pb-budget-annotations {
      display: flex;
      gap: 0.4rem;
      margin-top: 0.4rem;
      flex-wrap: wrap;
    }

    .pb-annotation {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.65rem;
      padding: 0.2rem 0.4rem;
      border-radius: 4px;
      opacity: 0;
      transform: translateY(-4px);
      transition: all 0.4s ease;
    }

    .pb-annotation.visible {
      opacity: 1;
      transform: translateY(0);
    }

    .pb-annotation.accept {
      background: rgba(57, 211, 83, 0.15);
      color: var(--pb-accent-cyan);
      border: 1px solid rgba(57, 211, 83, 0.3);
    }

    .pb-annotation.residual {
      background: rgba(249, 117, 131, 0.15);
      color: var(--pb-accent-coral);
      border: 1px solid rgba(249, 117, 131, 0.3);
    }

     
    .pb-legend {
      display: flex;
      gap: 1.25rem;
      padding-top: 0.75rem;
      border-top: 1px solid var(--pb-border);
      margin-top: 0.5rem;
      flex-wrap: wrap;
    }

    .pb-legend-item {
      display: flex;
      align-items: center;
      gap: 0.4rem;
      font-size: 0.75rem;
      color: var(--pb-text-muted);
    }

    .pb-legend-swatch {
      width: 12px;
      height: 12px;
      border-radius: 3px;
    }

    .pb-legend-swatch.accept {
      background: linear-gradient(135deg, var(--pb-accent-cyan), #2ea043);
    }

    .pb-legend-swatch.residual {
      background: linear-gradient(135deg, var(--pb-accent-coral), #ff7b72);
    }

    .pb-legend-swatch.target {
      border: 2px dashed var(--pb-target-outline);
      background: transparent;
    }

     
    .pb-calc-panel {
      background: var(--pb-surface);
      border: 1px solid var(--pb-border);
      border-radius: 10px;
      padding: 1.25rem;
    }

    .pb-calc-title {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.75rem;
      font-weight: 600;
      color: var(--pb-text-muted);
      text-transform: uppercase;
      letter-spacing: 0.08em;
      margin-bottom: 1rem;
    }

    .pb-calc-row {
      display: flex;
      align-items: center;
      gap: 0.5rem;
      margin-bottom: 0.5rem;
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.8rem;
      opacity: 0;
      transform: translateX(-8px);
      transition: all 0.4s ease;
    }

    .pb-calc-row.visible {
      opacity: 1;
      transform: translateX(0);
    }

    .pb-calc-row.highlight {
      color: var(--pb-accent-cyan);
    }

    .pb-calc-row.error {
      color: var(--pb-accent-coral);
    }

    .pb-calc-label {
      color: var(--pb-text-muted);
      min-width: 110px;
    }

    .pb-calc-equals {
      color: var(--pb-text-muted);
    }

    .pb-calc-value {
      font-weight: 500;
    }

    .pb-checkmark {
      color: var(--pb-accent-cyan);
      font-weight: bold;
    }

    .pb-crossmark {
      color: var(--pb-accent-coral);
      font-weight: bold;
    }

     
    .pb-insight-box {
      background: rgba(88, 166, 255, 0.08);
      border: 1px solid rgba(88, 166, 255, 0.2);
      border-radius: 8px;
      padding: 1rem;
      margin-top: 1.25rem;
      opacity: 0;
      transform: translateY(8px);
      transition: all 0.5s ease;
    }

    .pb-insight-box.visible {
      opacity: 1;
      transform: translateY(0);
    }

    .pb-insight-box h5 {
      font-family: 'IBM Plex Mono', monospace;
      font-size: 0.7rem;
      font-weight: 600;
      color: var(--pb-accent-teal);
      text-transform: uppercase;
      letter-spacing: 0.1em;
      margin: 0 0 0.4rem 0;
    }

    .pb-insight-box p {
      font-size: 0.85rem;
      color: var(--pb-text);
      line-height: 1.5;
      margin: 0;
    }

    .pb-insight-box code {
      font-family: 'IBM Plex Mono', monospace;
      background: rgba(0,0,0,0.2);
      padding: 0.1em 0.25em;
      border-radius: 3px;
      font-size: 0.8em;
    }

    [data-theme="light"] .pb-insight-box code,
    :root:not([data-theme="dark"]) .pb-insight-box code {
      background: rgba(0,0,0,0.06);
    }

     
    @media (max-width: 600px) {
      .pb-distributions {
        grid-template-columns: 1fr;
      }

      .pb-legend {
        gap: 0.75rem;
      }

      .pb-controls {
        gap: 0.4rem;
      }

      .pb-control-btn {
        padding: 0.4rem 0.75rem;
        font-size: 0.7rem;
      }

      .pb-calc-row {
        flex-wrap: wrap;
        font-size: 0.75rem;
      }

      .pb-calc-label {
        min-width: 100%;
        margin-bottom: 0.15rem;
      }
    }
  </style>

  <div class="pb-header">
    <h3>The Probability Budget</h3>
    <p>How rejection sampling fills the exact probability mass for each token</p>
  </div>

  
  <div class="pb-distributions">
    <div class="pb-dist-card">
      <h4>Target p(x)</h4>
      <div class="pb-dist-row">
        <span class="pb-token-label">A</span>
        <div class="pb-dist-bar-container">
          <div class="pb-dist-bar target" style="width: 70%"></div>
        </div>
        <span class="pb-dist-value">0.70</span>
      </div>
      <div class="pb-dist-row">
        <span class="pb-token-label">B</span>
        <div class="pb-dist-bar-container">
          <div class="pb-dist-bar target" style="width: 30%"></div>
        </div>
        <span class="pb-dist-value">0.30</span>
      </div>
    </div>
    <div class="pb-dist-card">
      <h4>Draft q(x)</h4>
      <div class="pb-dist-row">
        <span class="pb-token-label">A</span>
        <div class="pb-dist-bar-container">
          <div class="pb-dist-bar draft" style="width: 40%"></div>
        </div>
        <span class="pb-dist-value">0.40</span>
      </div>
      <div class="pb-dist-row">
        <span class="pb-token-label">B</span>
        <div class="pb-dist-bar-container">
          <div class="pb-dist-bar draft" style="width: 60%"></div>
        </div>
        <span class="pb-dist-value">0.60</span>
      </div>
    </div>
  </div>

  
  <div class="pb-controls">
    <button type="button" class="pb-control-btn" id="prevBtn-c7044145aa30a113d86e1eb358e648d1" disabled>← Prev</button>
    <button type="button" class="pb-control-btn primary" id="nextBtn-c7044145aa30a113d86e1eb358e648d1">Next Step →</button>
    <button type="button" class="pb-control-btn" id="resetBtn-c7044145aa30a113d86e1eb358e648d1">Reset</button>
  </div>

  
  <div class="pb-budget-section">
    <div class="pb-section-header">
      <span class="pb-section-title" id="stepTitle-c7044145aa30a113d86e1eb358e648d1">Step 1: Target Budget</span>
      <div class="pb-step-indicator">
        <div class="pb-step-dot active" data-step="0"></div>
        <div class="pb-step-dot" data-step="1"></div>
        <div class="pb-step-dot" data-step="2"></div>
        <div class="pb-step-dot" data-step="3"></div>
      </div>
    </div>

    
    <div class="pb-budget-row">
      <div class="pb-budget-label-row">
        <span class="pb-budget-token">Token A</span>
        <span class="pb-budget-target-label">target: 0.70</span>
      </div>
      <div class="pb-budget-bar-wrapper">
        <div class="pb-budget-target-outline" id="targetOutlineA-c7044145aa30a113d86e1eb358e648d1" style="width: 70%"></div>
        <div class="pb-budget-bar-inner">
          <div class="pb-budget-segment accept" id="acceptA-c7044145aa30a113d86e1eb358e648d1" style="width: 0%">
            <span class="pb-segment-label">0.40</span>
          </div>
          <div class="pb-budget-segment residual" id="residualA-c7044145aa30a113d86e1eb358e648d1" style="width: 0%">
            <span class="pb-segment-label">0.30</span>
          </div>
        </div>
      </div>
      <div class="pb-budget-annotations">
        <span class="pb-annotation accept" id="annotAcceptA-c7044145aa30a113d86e1eb358e648d1">min(0.7, 0.4) = 0.4</span>
        <span class="pb-annotation residual" id="annotResidualA-c7044145aa30a113d86e1eb358e648d1">0.7 − 0.4 = 0.3</span>
      </div>
    </div>

    
    <div class="pb-budget-row">
      <div class="pb-budget-label-row">
        <span class="pb-budget-token">Token B</span>
        <span class="pb-budget-target-label">target: 0.30</span>
      </div>
      <div class="pb-budget-bar-wrapper">
        <div class="pb-budget-target-outline" id="targetOutlineB-c7044145aa30a113d86e1eb358e648d1" style="width: 30%"></div>
        <div class="pb-budget-bar-inner">
          <div class="pb-budget-segment accept" id="acceptB-c7044145aa30a113d86e1eb358e648d1" style="width: 0%">
            <span class="pb-segment-label">0.30</span>
          </div>
          <div class="pb-budget-segment residual" id="residualB-c7044145aa30a113d86e1eb358e648d1" style="width: 0%">
            <span class="pb-segment-label">0.00</span>
          </div>
        </div>
      </div>
      <div class="pb-budget-annotations">
        <span class="pb-annotation accept" id="annotAcceptB-c7044145aa30a113d86e1eb358e648d1">min(0.3, 0.6) = 0.3</span>
        <span class="pb-annotation residual" id="annotResidualB-c7044145aa30a113d86e1eb358e648d1">0.3 − 0.3 = 0.0</span>
      </div>
    </div>

    <div class="pb-legend">
      <div class="pb-legend-item">
        <div class="pb-legend-swatch target"></div>
        <span>Target budget p(x)</span>
      </div>
      <div class="pb-legend-item">
        <div class="pb-legend-swatch accept"></div>
        <span>Accept phase: min(p,q)</span>
      </div>
      <div class="pb-legend-item">
        <div class="pb-legend-swatch residual"></div>
        <span>Residual: max(0, p−q)</span>
      </div>
    </div>
  </div>

  
  <div class="pb-calc-panel">
    <div class="pb-calc-title">Probability Accounting</div>

    <div class="pb-calc-row" id="calcAcceptA-c7044145aa30a113d86e1eb358e648d1">
      <span class="pb-calc-label">P(accept A)</span>
      <span class="pb-calc-equals">=</span>
      <span class="pb-calc-value">q(A) × 1.0 = 0.4 × 1.0 = <strong>0.40</strong></span>
    </div>

    <div class="pb-calc-row" id="calcAcceptB-c7044145aa30a113d86e1eb358e648d1">
      <span class="pb-calc-label">P(accept B)</span>
      <span class="pb-calc-equals">=</span>
      <span class="pb-calc-value">q(B) × 0.5 = 0.6 × 0.5 = <strong>0.30</strong></span>
    </div>

    <div class="pb-calc-row" id="calcReject-c7044145aa30a113d86e1eb358e648d1">
      <span class="pb-calc-label">P(reject)</span>
      <span class="pb-calc-equals">=</span>
      <span class="pb-calc-value">1 − 0.4 − 0.3 = <strong>0.30</strong></span>
    </div>

    <div class="pb-calc-row" id="calcResidualA-c7044145aa30a113d86e1eb358e648d1">
      <span class="pb-calc-label">Residual p′(A)</span>
      <span class="pb-calc-equals">=</span>
      <span class="pb-calc-value">0.3 / 0.3 = <strong>1.0</strong></span>
    </div>

    <div class="pb-calc-row" id="calcResidualB-c7044145aa30a113d86e1eb358e648d1">
      <span class="pb-calc-label">Residual p′(B)</span>
      <span class="pb-calc-equals">=</span>
      <span class="pb-calc-value">0.0 / 0.3 = <strong>0.0</strong></span>
    </div>

    <div class="pb-calc-row highlight" id="calcFinalA-c7044145aa30a113d86e1eb358e648d1">
      <span class="pb-calc-label">P(output=A)</span>
      <span class="pb-calc-equals">=</span>
      <span class="pb-calc-value">0.4 + 0.3 × 1.0 = <strong>0.70</strong> <span class="pb-checkmark">✓</span></span>
    </div>

    <div class="pb-calc-row highlight" id="calcFinalB-c7044145aa30a113d86e1eb358e648d1">
      <span class="pb-calc-label">P(output=B)</span>
      <span class="pb-calc-equals">=</span>
      <span class="pb-calc-value">0.3 + 0.3 × 0.0 = <strong>0.30</strong> <span class="pb-checkmark">✓</span></span>
    </div>

    <div class="pb-calc-row error" id="calcWrongA-c7044145aa30a113d86e1eb358e648d1">
      <span class="pb-calc-label">If resample from p:</span>
      <span class="pb-calc-equals"></span>
      <span class="pb-calc-value">P(A) = 0.4 + 0.3 × 0.7 = <strong>0.61</strong> <span class="pb-crossmark">✗</span></span>
    </div>

    <div class="pb-calc-row error" id="calcWrongB-c7044145aa30a113d86e1eb358e648d1">
      <span class="pb-calc-label"></span>
      <span class="pb-calc-equals"></span>
      <span class="pb-calc-value">P(B) = 0.3 + 0.3 × 0.3 = <strong>0.39</strong> <span class="pb-crossmark">✗</span></span>
    </div>

    <div class="pb-insight-box" id="insightBox-c7044145aa30a113d86e1eb358e648d1">
      <h5>Key Insight</h5>
      <p>The residual distribution <code>p′(x)</code> precisely fills the probability gap left by the accept phase. Token B is already "fully funded" by accepts (<code>min(p,q) = p</code>), so it gets zero in the residual. Token A needs exactly 0.3 more probability—which the residual provides with 100% certainty when rejection occurs.</p>
    </div>
  </div>

  <script>
  (function() {
    const uid = 'c7044145aa30a113d86e1eb358e648d1';

    const steps = [
      {
        title: "Step 1: Target Budget",
        acceptA: 0, acceptB: 0, residualA: 0, residualB: 0,
        showAcceptAnnotA: false, showAcceptAnnotB: false,
        showResidualAnnotA: false, showResidualAnnotB: false,
        showLabelAcceptA: false, showLabelAcceptB: false,
        showLabelResidualA: false, showLabelResidualB: false,
        calcs: []
      },
      {
        title: "Step 2: Accept Phase",
        acceptA: 40, acceptB: 30, residualA: 0, residualB: 0,
        showAcceptAnnotA: true, showAcceptAnnotB: true,
        showResidualAnnotA: false, showResidualAnnotB: false,
        showLabelAcceptA: true, showLabelAcceptB: true,
        showLabelResidualA: false, showLabelResidualB: false,
        calcs: ['calcAcceptA', 'calcAcceptB', 'calcReject']
      },
      {
        title: "Step 3: Fill with Residual",
        acceptA: 40, acceptB: 30, residualA: 30, residualB: 0,
        showAcceptAnnotA: true, showAcceptAnnotB: true,
        showResidualAnnotA: true, showResidualAnnotB: true,
        showLabelAcceptA: true, showLabelAcceptB: true,
        showLabelResidualA: true, showLabelResidualB: false,
        calcs: ['calcAcceptA', 'calcAcceptB', 'calcReject', 'calcResidualA', 'calcResidualB']
      },
      {
        title: "Step 4: Perfect Match!",
        acceptA: 40, acceptB: 30, residualA: 30, residualB: 0,
        showAcceptAnnotA: true, showAcceptAnnotB: true,
        showResidualAnnotA: true, showResidualAnnotB: true,
        showLabelAcceptA: true, showLabelAcceptB: true,
        showLabelResidualA: true, showLabelResidualB: false,
        calcs: ['calcAcceptA', 'calcAcceptB', 'calcReject', 'calcResidualA', 'calcResidualB', 'calcFinalA', 'calcFinalB', 'calcWrongA', 'calcWrongB'],
        showInsight: true
      }
    ];

    let currentStep = 0;

    const elements = {
      stepTitle: document.getElementById(`stepTitle-${uid}`),
      acceptA: document.getElementById(`acceptA-${uid}`),
      acceptB: document.getElementById(`acceptB-${uid}`),
      residualA: document.getElementById(`residualA-${uid}`),
      residualB: document.getElementById(`residualB-${uid}`),
      annotAcceptA: document.getElementById(`annotAcceptA-${uid}`),
      annotAcceptB: document.getElementById(`annotAcceptB-${uid}`),
      annotResidualA: document.getElementById(`annotResidualA-${uid}`),
      annotResidualB: document.getElementById(`annotResidualB-${uid}`),
      insightBox: document.getElementById(`insightBox-${uid}`),
      prevBtn: document.getElementById(`prevBtn-${uid}`),
      nextBtn: document.getElementById(`nextBtn-${uid}`),
      resetBtn: document.getElementById(`resetBtn-${uid}`)
    };

    const container = document.getElementById(`prob-budget-${uid}`);
    const stepDots = container.querySelectorAll('.pb-step-dot');

    const calcIds = [
      'calcAcceptA', 'calcAcceptB', 'calcReject',
      'calcResidualA', 'calcResidualB',
      'calcFinalA', 'calcFinalB',
      'calcWrongA', 'calcWrongB'
    ];

    const calcElements = {};
    calcIds.forEach(id => {
      calcElements[id] = document.getElementById(`${id}-${uid}`);
    });

    function applyStep(stepIndex) {
      const step = steps[stepIndex];

      elements.stepTitle.textContent = step.title;

      
      elements.acceptA.style.width = step.acceptA + '%';
      elements.acceptB.style.width = step.acceptB + '%';
      elements.residualA.style.width = step.residualA + '%';
      elements.residualB.style.width = step.residualB + '%';

      
      elements.acceptA.classList.toggle('show-label', step.showLabelAcceptA);
      elements.acceptB.classList.toggle('show-label', step.showLabelAcceptB);
      elements.residualA.classList.toggle('show-label', step.showLabelResidualA);
      elements.residualB.classList.toggle('show-label', step.showLabelResidualB);

      
      elements.annotAcceptA.classList.toggle('visible', step.showAcceptAnnotA);
      elements.annotAcceptB.classList.toggle('visible', step.showAcceptAnnotB);
      elements.annotResidualA.classList.toggle('visible', step.showResidualAnnotA);
      elements.annotResidualB.classList.toggle('visible', step.showResidualAnnotB);

      
      calcIds.forEach(id => {
        const shouldShow = step.calcs.includes(id);
        if (calcElements[id]) {
          calcElements[id].classList.toggle('visible', shouldShow);
        }
      });

      
      elements.insightBox.classList.toggle('visible', step.showInsight || false);

      
      stepDots.forEach((dot, idx) => {
        dot.classList.remove('active', 'completed');
        if (idx === stepIndex) {
          dot.classList.add('active');
        } else if (idx < stepIndex) {
          dot.classList.add('completed');
        }
      });

      
      elements.prevBtn.disabled = stepIndex === 0;
      elements.nextBtn.disabled = stepIndex === steps.length - 1;
      elements.nextBtn.textContent = stepIndex === steps.length - 1 ? 'Complete' : 'Next Step →';
    }

    function nextStep() {
      if (currentStep < steps.length - 1) {
        currentStep++;
        applyStep(currentStep);
      }
    }

    function prevStep() {
      if (currentStep > 0) {
        currentStep--;
        applyStep(currentStep);
      }
    }

    function reset() {
      currentStep = 0;
      applyStep(currentStep);
    }

    elements.nextBtn.addEventListener('click', nextStep);
    elements.prevBtn.addEventListener('click', prevStep);
    elements.resetBtn.addEventListener('click', reset);

    
    applyStep(0);
  })();
  </script>
</div>

<h3 id="the-full-algorithm-timeline">The Full Algorithm Timeline</h3>
<pre tabindex="0"><code>Step 1: Draft model runs K times (cheap, fast)
        [x₁] → [x₂] → [x₃] → [x₄] → [x₅]

Step 2: Target model runs ONCE (expensive, but parallel)
        [x₁, x₂, x₃, x₄, x₅] → [p₁, p₂, p₃, p₄, p₅]

Step 3: Sequential verify until rejection
        x₁ ✓ → x₂ ✓ → x₃ ✗ → STOP, discard x₄,x₅
                       ↓
                  resample x₃&#39; ~ residual

Output: [x₁, x₂, x₃&#39;]
</code></pre><p>The key efficiency gain: that single target model forward pass would normally give you just 1 token. With speculation, you potentially get K+1 tokens from the same compute, paying only the small overhead of draft generation.</p>
<h2 id="the-speedup-formula">The Speedup Formula</h2>
<p>How much faster does speculative decoding make inference? The expected number of tokens generated per iteration follows a capped geometric distribution.</p>
<p>If we propose γ tokens and each has acceptance probability α, the expected number of accepted tokens is:</p>
$$E[\text{tokens per iteration}] = \frac{1 - \alpha^{\gamma+1}}{1 - \alpha}$$<p>For large γ, this approaches $\frac{1}{1-\alpha}$. With typical values (α = 0.75, γ = 5), we get roughly 4 tokens per expensive target model call.</p>
<p>The speedup formula must account for the cost of the draft model:</p>
$$\text{Speedup} = \frac{1 - \alpha^{\gamma+1}}{(1-\alpha)(\gamma c + 1)}$$<p>where $c = t_{\text{draft}}/t_{\text{target}}$ is the ratio of draft model latency to target model latency. For a draft model 100× smaller, $c \approx 0.01-0.05$.</p>
<p>When $\alpha > c$, speedup is guaranteed. The minimum improvement is $(1 + \alpha)/(1 + c)$. Real-world benchmarks on H200 GPUs show Llama 3.1 405B with a Llama 3.2 3B draft achieving 3.6× speedup (33 → 120 tokens/sec).</p>
<h2 id="the-variant-landscape">The Variant Landscape</h2>
<p>The field has evolved rapidly since 2023, with researchers finding increasingly clever ways to eliminate draft models or improve acceptance rates.</p>
<h3 id="eagle-feature-level-speculation">EAGLE: Feature-Level Speculation</h3>
<p>EAGLE (ICML 2024) introduced feature-level speculation, predicting at the second-to-top layer rather than token level. The key insight: autoregression over continuous hidden states is easier than over discrete tokens.</p>
<p>Rather than training a separate small model, EAGLE trains a lightweight head (~1B parameters for 70B models) that extrapolates feature vectors. These features are then decoded to tokens and verified. The approach achieves 3× speedup, 1.6× faster than Medusa&rsquo;s parallel heads approach.</p>
<p>EAGLE-2 added context-aware dynamic draft trees, adjusting speculation aggressiveness based on prediction confidence to reach 4.26× speedup.</p>
<h3 id="medusa-parallel-prediction-heads">Medusa: Parallel Prediction Heads</h3>
<p>Medusa takes a different approach: add multiple single-layer prediction heads directly atop the frozen base model. Each head predicts a different future position independently.</p>
<pre tabindex="0"><code>Hidden State → Head 1 → Token +1
            → Head 2 → Token +2
            → Head 3 → Token +3
</code></pre><p>The Cartesian product of top-k predictions from each head creates candidate continuations verified via tree attention. Training requires only hours on a single A100.</p>
<p>The trade-off: position-independent heads can&rsquo;t condition on earlier speculated tokens, limiting acceptance rates compared to EAGLE&rsquo;s sequential feature prediction.</p>
<h3 id="self-speculative-methods">Self-Speculative Methods</h3>
<p>LayerSkip (ACL 2024) eliminates external drafters entirely by using early exits from the target model itself. During training, layer dropout with increasing rates toward later layers plus early exit loss creates a model that can draft from shallow layers and verify with deep layers.</p>
<p>The catch: requires special training recipes. Baseline LLMs show no speedup with this approach.</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Draft Model</th>
          <th>Training</th>
          <th>Memory Overhead</th>
          <th>Speedup</th>
          <th>Distribution Preserved</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Standard SD</td>
          <td>Yes (separate)</td>
          <td>Optional</td>
          <td>High</td>
          <td>1.5-2.5×</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>EAGLE-2</td>
          <td>Lightweight head</td>
          <td>~2 days</td>
          <td>Low-Medium</td>
          <td>3-4.3×</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Medusa</td>
          <td>No (heads on base)</td>
          <td>Hours</td>
          <td>Low</td>
          <td>2.2-3.6×</td>
          <td>Optional</td>
      </tr>
  </tbody>
</table>
<h2 id="glm-47-native-multi-token-prediction">GLM-4.7: Native Multi-Token Prediction</h2>
<p>GLM-4.7 represents a paradigm shift: rather than retrofitting speculative decoding onto existing models, Zhipu AI built Multi-Token Prediction directly into the architecture.</p>
<p>The model contains 355 billion total parameters with 32 billion active per forward pass via Mixture-of-Experts routing. This extreme sparsity only 9% of parameters active per token creates an ideal scenario for speculative decoding: massive memory reads but relatively modest compute.</p>
<h3 id="the-mtp-architecture">The MTP Architecture</h3>
<p>Traditional speculative decoding uses separate draft and target models. GLM-4.7&rsquo;s MTP adds auxiliary prediction heads within the model itself:</p>
<pre tabindex="0"><code>Hidden State h_t → Main Head → P(x_{t+1} | h_t)     [standard next-token]
               → MTP Head  → P(x_{t+2} | h_t)     [speculative token]
               → MTP Head  → P(x_{t+3} | h_t)     [speculative token]
</code></pre><p>The MTP heads are lightweight projections sharing the same massive 32B active backbone. This ensures the draft distribution is tightly aligned with the target distribution, they share the same semantic understanding. Resulting in acceptance rates exceeding 90% with 1 speculative token.</p>
<h3 id="training-the-mtp-heads">Training the MTP Heads</h3>
<p>The MTP layer was trained with loss weight λ = 0.3 for the first 15 trillion tokens, reduced to 0.1 later. This balances multi-token prediction quality against primary language modeling capability.</p>
$$\mathcal{L}\_{\text{total}} = \mathcal{L}\_{\text{LM}} + \lambda \cdot \mathcal{L}\_{\text{MTP}}$$<p>The reduced weight in later training prevents the MTP objective from interfering with the model&rsquo;s core capabilities while still maintaining high acceptance rates at inference time.</p>
<h3 id="architectural-innovations">Architectural Innovations</h3>
<p>GLM-4.7 incorporates several architectural choices that complement its MTP capability:</p>
<ul>
<li><strong>Sigmoid-gated loss-free balance routing</strong> across ~160 experts (128 active per token)</li>
<li><strong>96 attention heads</strong> for 5120 hidden dimension (2.5× more heads than typical)</li>
<li><strong>Grouped-Query Attention</strong> with partial RoPE at 1M base frequency for 200K context</li>
<li><strong>QK-Norm</strong> for stabilized attention logits</li>
</ul>
<p>The increased head count particularly improves reasoning benchmarks despite not improving training loss—an interesting finding suggesting that inference-time compute distribution matters.</p>
<h2 id="vllm-implementation-pagedattention-meets-speculation">vLLM Implementation: PagedAttention Meets Speculation</h2>
<p>vLLM&rsquo;s speculative decoding architecture consists of three phases orchestrated by the SpecDecodeWorker:</p>
<ol>
<li><strong>Draft Runner</strong>: Proposes candidate tokens using MTP heads</li>
<li><strong>Target Runner</strong>: Scores all candidates in a single forward pass</li>
<li><strong>Rejection Sampler</strong>: Implements accept/reject logic</li>
</ol>
<h3 id="pagedattention-integration">PagedAttention Integration</h3>
<p>The integration with PagedAttention required non-trivial modifications. The memory manager tracks KV cache for both draft and target phases with block-level management enabling sharing, copying, and forking between sequences.</p>
<p>For MTP-style speculation, the draft phase reuses the target model&rsquo;s KV cache infrastructure, minimizing overhead. The scheduler now supports &ldquo;preallocated slots&rdquo;—reserving KV block space sufficient for multiple tokens before the next scheduler invocation.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># GLM-4.7 with native MTP speculative decoding</span>
</span></span><span style="display:flex;"><span>vllm serve zai-org/GLM-4.7-FP8 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    --tensor-parallel-size <span style="color:#ae81ff">4</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    --speculative-config.method mtp <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    --speculative-config.num_speculative_tokens <span style="color:#ae81ff">1</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>    --tool-call-parser glm47
</span></span></code></pre></div><h3 id="why-num_speculative_tokens1">Why num_speculative_tokens=1?</h3>
<p>The recommendation of 1 speculative token reflects empirical findings: higher values increase mean acceptance length but decrease acceptance <em>rate</em>, reducing overall throughput. The sweet spot maximizes expected tokens per iteration accounting for verification overhead.</p>
<p>With GLM-4.7&rsquo;s 90%+ acceptance rate at num_speculative_tokens=1, you reliably get 2 tokens per forward pass. Increasing to 2 speculative tokens might yield an average of 2.5 tokens but with higher variance and occasional costly rejections.</p>
<h3 id="continuous-batching-challenges">Continuous Batching Challenges</h3>
<p>Continuous batching with speculation creates the &ldquo;ragged tensor problem&rdquo;: different sequences accept different numbers of tokens per iteration, creating irregular batch shapes. At higher concurrency, this overhead consumes up to 40% of computation.</p>
<p>vLLM addresses this through dynamic speculation length adjustment based on system load—reducing speculation aggressiveness when batch sizes grow.</p>
<h2 id="when-speculation-helps-and-when-it-hurts">When Speculation Helps and When It Hurts</h2>
<p>The fundamental principle: speculative decoding trades compute for memory bandwidth. When GPUs are memory-bound (most inference scenarios), spare compute cycles can profitably run draft verification. When GPUs are compute-saturated, speculation adds overhead without benefit.</p>
<h3 id="batch-size-dominates">Batch Size Dominates</h3>
<table>
  <thead>
      <tr>
          <th>Condition</th>
          <th>Impact</th>
          <th>Recommendation</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Batch size ≤ 8</td>
          <td>Strong benefit (1.5-2.7×)</td>
          <td>Enable with γ=4-8</td>
      </tr>
      <tr>
          <td>Batch size &gt; 32, short context</td>
          <td>Potential slowdown</td>
          <td>Disable or use dynamic γ</td>
      </tr>
      <tr>
          <td>Batch size &gt; 32, long context</td>
          <td>Moderate benefit (up to 2×)</td>
          <td>Enable with small γ</td>
      </tr>
      <tr>
          <td>QPS &lt; 10</td>
          <td>Strong benefit</td>
          <td>Enable</td>
      </tr>
      <tr>
          <td>QPS &gt; 50</td>
          <td>Diminishing/negative returns</td>
          <td>Dynamic speculation</td>
      </tr>
      <tr>
          <td>Acceptance rate &lt; 0.5</td>
          <td>Marginal benefit</td>
          <td>Improve draft alignment</td>
      </tr>
  </tbody>
</table>
<p>At batch size 1, GPUs run severely underutilized—speculative decoding achieves 2.73× speedup (63% latency reduction). Beyond batch size 16-32, benefits diminish and can reverse, causing 1.4-1.8× slowdown.</p>
<h3 id="the-long-context-exception">The Long Context Exception</h3>
<p>MagicDec research found that at large batch sizes with long contexts, decoding becomes memory-bound again due to KV cache loading. Speculative decoding can provide 2× speedup even on 8 A100s with high concurrency when context lengths exceed 32K tokens.</p>
<p>INT4/INT8 quantization presents tradeoffs: aggressive weight quantization can reduce acceptance rates as draft model quality degrades. The QSpec approach uses W4A4 for drafting and W4A16 for verification, capturing benefits of both.</p>
<h2 id="where-the-field-is-heading">Where the Field Is Heading</h2>
<p>The success of GLM-4.7&rsquo;s native MTP suggests future models will ship with speculation built-in rather than bolted-on. Several trends are emerging:</p>
<p><strong>Architectural Integration</strong>: Models trained with MTP objectives from the start achieve dramatically higher acceptance rates than retrofitted solutions. Expect this to become standard practice.</p>
<p><strong>Dynamic Speculation</strong>: Rather than fixed speculation lengths, future systems will adjust aggressiveness based on:</p>
<ul>
<li>Current batch size</li>
<li>Observed acceptance rates</li>
<li>Prediction entropy</li>
<li>Available compute headroom</li>
</ul>
<p><strong>Hardware Co-design</strong>: As speculative decoding becomes ubiquitous, GPU architectures may evolve to better support the draft-verify pattern with dedicated acceleration for the rejection sampling kernel.</p>
<p><strong>Beyond Token Prediction</strong>: EAGLE&rsquo;s feature-level speculation hints at richer speculation targets. Predicting structured outputs (tool calls, code blocks) could enable even higher acceptance rates for specialized workloads.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Speculative decoding achieves something rare in optimization: meaningful speedups without any quality tradeoff. The output distribution is mathematically identical to standard decoding.</p>
<p>The technique works because LLM inference is memory-bound, not compute-bound. By using idle GPU cycles to verify multiple speculative tokens in parallel, we amortize the expensive memory reads across several output tokens.</p>
<p>GLM-4.7&rsquo;s native MTP architecture points toward where the field is heading: models designed from the ground up for efficient speculation, achieving 90%+ acceptance rates that make speculative decoding nearly as reliable as a lookup table.</p>
<h2 id="references">References</h2>
<ol>
<li>
<p><strong>Leviathan, Y., Kalman, M., &amp; Matias, Y. (2023).</strong> <a href="https://arxiv.org/abs/2211.17192">Fast Inference from Transformers via Speculative Decoding</a>. <em>International Conference on Machine Learning</em>.</p>
<ul>
<li>The original Google paper introducing speculative decoding with rigorous distribution preservation proofs.</li>
</ul>
</li>
<li>
<p><strong>Li, Y., Cai, T., Zhang, Y., Chen, D., &amp; Dai, D. (2024).</strong> <a href="https://arxiv.org/abs/2401.15077">EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty</a>. <em>International Conference on Machine Learning</em>.</p>
<ul>
<li>Feature-level speculation achieving superior speedups through hidden state prediction.</li>
</ul>
</li>
<li>
<p><strong>Cai, T., Li, Y., Geng, Z., Peng, H., &amp; Dao, T. (2024).</strong> <a href="https://arxiv.org/abs/2401.10774">Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads</a>. <em>arXiv preprint</em>.</p>
<ul>
<li>Parallel prediction heads approach requiring minimal training overhead.</li>
</ul>
</li>
<li>
<p><strong>Elhoushi, M., Shrivastava, A., Liskovich, D., Hosmer, B., Wasti, B., Lai, L., &hellip; &amp; Acun, B. (2024).</strong> <a href="https://arxiv.org/abs/2404.16710">LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding</a>. <em>Association for Computational Linguistics</em>.</p>
<ul>
<li>Self-speculative decoding using early exits from the target model.</li>
</ul>
</li>
<li>
<p><strong>Fu, Y., Bailis, P., Stoica, I., &amp; Zhang, H. (2024).</strong> <a href="https://arxiv.org/abs/2402.02057">Break the Sequential Dependency of LLM Inference Using Lookahead Decoding</a>. <em>International Conference on Machine Learning</em>.</p>
<ul>
<li>Training-free speculative decoding via Jacobi iteration.</li>
</ul>
</li>
<li>
<p><strong>Zhipu AI. (2025).</strong> <a href="https://huggingface.co/zai-org/GLM-4.7">GLM-4.7: Advancing the Coding Capability</a>. <em>Hugging Face Model Card</em>.</p>
<ul>
<li>Technical documentation for GLM-4.7&rsquo;s native MTP architecture.</li>
</ul>
</li>
<li>
<p><strong>vLLM Team. (2025).</strong> <a href="https://docs.vllm.ai/en/latest/features/spec_decode/">Speculative Decoding Documentation</a>. <em>vLLM Documentation</em>.</p>
<ul>
<li>Implementation details for speculative decoding in vLLM.</li>
</ul>
</li>
<li>
<p><strong>Chen, Z., Yang, X., Lin, J., Sun, C., Huang, J., &amp; Chang, K. W. (2024).</strong> <a href="https://arxiv.org/abs/2408.11049">MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding</a>. <em>arXiv preprint</em>.</p>
<ul>
<li>Analysis of speculative decoding performance at high batch sizes with long contexts.</li>
</ul>
</li>
</ol>
]]></content:encoded></item><item><title>The Anatomy of Agentic Code Assist: Building Production Grade AI Coding Agents</title><link>https://www.mdjawad.com/posts/anatomy-of-agentic-code-assist/</link><pubDate>Sat, 15 Nov 2025 10:00:00 +0800</pubDate><guid>https://www.mdjawad.com/posts/anatomy-of-agentic-code-assist/</guid><description>A deep dive into the architecture, design patterns, and engineering decisions behind production-grade agentic code assist solutions. By dissecting OpenHands, we uncover how to build AI agents that safely execute code, manage complex state, and operate reliably in production.</description><content:encoded><![CDATA[<h2 id="introduction-the-agentic-shift-in-software-engineering">Introduction: The Agentic Shift in Software Engineering</h2>
<p>Software engineering tools have been getting closer to translating what humans want into what machines do. We went from assembly to C, from malloc/free to garbage collection, from Vim to IDEs with autocomplete. But something changed in the last two years. We&rsquo;re are developing more than just &ldquo;autocomplete&rdquo; solutions.</p>
<p><strong>Code Assist</strong> tools suggest the next token based on what&rsquo;s in front of your cursor. <strong>General Purpose Code Assist Agents</strong> can reason about your entire codebase, plan multi-step changes, execute commands, run tests, and fix their own mistakes. The difference is: one is a fancy text predictor, the other is something that actually does engineering work.</p>
<p><strong>OpenHands</strong> (formerly OpenDevin) is one of the most interesting open-source projects in this space. It&rsquo;s a runtime that takes probabilistic LLM outputs and turns them into deterministic actions—compiling code, running tests, managing Docker containers. This post digs into how OpenHands works: its architecture, the <strong>CodeAct</strong> framework it uses, how it sandboxes execution safely, and what the benchmarks tell us about where this technology actually stands.</p>
<h3 id="ai-as-an-amplifier-why-this-matters">AI as an Amplifier: Why This Matters</h3>
<p>Here&rsquo;s something weird from the 2025 DORA report: AI adoption is basically universal now, but productivity gains are all over the place. Some teams are crushing it, others are drowning in AI-generated technical debt. What&rsquo;s the difference?</p>
<p>AI acts as an <strong>amplifier</strong>. If your team already has good platform engineering and loosely coupled architectures, AI makes you faster. If you&rsquo;re stuck with a tightly coupled monolith and manual deployments, AI will help you write bad code faster.</p>
<p>This matters for what makes a good code assist agent. Just dumping code into a buffer isn&rsquo;t enough. A useful agent needs to:</p>
<ul>
<li>Navigate existing codebases without breaking things</li>
<li>Verify its own changes against test suites</li>
<li>Learn from failures and try different approaches</li>
<li>Understand the context of your organisation/ infrastructure and adapt accodingly</li>
</ul>
<p>Think of it as a <strong>high-performing junior engineer</strong> who can write code quickly but needs to check their work, not a hyper-intelligent autocomplete. OpenHands tries to be the former by treating the agent as part of the actual development workflow, not just a chatbot that spits out code.</p>
<h3 id="why-open-source-matters-here">Why Open Source Matters Here</h3>
<p>The first wave of coding agents (like Devin) were black boxes. Impressive demos, but good luck getting your security team to approve giving them write access to your production codebase. When an agent deletes a config file, you want to know <em>why</em>, not just get an apology.</p>
<p>OpenHands(like modern code assist solutions) takes a different approach. Everything is transparent. The <strong>Event Stream</strong> logs every action the agent takes and every observation it receives. You can watch it run shell commands, edit files, and search through code in real-time.</p>
<p>This matters because 30% of developers say they don&rsquo;t trust AI-generated code (per the DORA report). Hard to blame them. But when you can see exactly what the agent is doing, step by step, that trust equation changes. You&rsquo;re not blindly accepting output—you&rsquo;re supervising an autonomous process with full visibility. Until you gain enough confidence to let agent take over.</p>
<h2 id="what-makes-a-general-purpose-code-agent">What Makes a &ldquo;General Purpose&rdquo; Code Agent?</h2>
<p>A SQL-generating bot is useful for one thing. An agent that can write SQL, wrap it in a Python API, build a React frontend, and deploy the whole thing to Kubernetes, then debug production issues? That&rsquo;s general purpose.</p>
<p>The difference comes down to four things that separate toys from production-ready tools:</p>
<h4 id="1-memory-that-actually-works">1. Memory That Actually Works</h4>
<p>LLMs are stateless. ChatGPT &ldquo;forgets&rdquo; your file structure the moment it scrolls out of the context window. Try refactoring a 50-file codebase when the agent can&rsquo;t remember what it read five minutes ago.</p>
<p>A real agent needs persistent memory. Not just a bigger context window—actual tools to explore and navigate your codebase on-demand. OpenHands gives the LLM a developer&rsquo;s toolkit: ripgrep for fast code search, AST-based analysis for understanding structure, and incremental file access with 100-line windows. Add an event log that lets it &ldquo;replay&rdquo; its own history, and you have something that can actually work with large codebases without pre-indexing everything.</p>
<h4 id="2-execution-not-just-suggestions">2. Execution, Not Just Suggestions</h4>
<p>OpenHands can create files, run compilers, execute shell scripts—the actual work. But this is dangerous. Running arbitrary LLM-generated code on your machine is a security nightmare. So OpenHands runs everything in Docker containers. The agent gets a sandboxed workspace where it can do whatever it wants without nuking your host system.</p>
<h4 id="3-learning-from-failure">3. Learning from Failure</h4>
<p>Code never works the first time. A code generator dies on the first syntax error. A real agent reads the error output, figures out what went wrong, tries a fix, and runs it again.</p>
<p>This <strong>Edit-Run-Verify</strong> loop is how OpenHands works. Actions flow from the agent to the system, observations (logs, errors, exit codes) flow back. The agent uses that feedback to iterate. Just like you would.</p>
<h4 id="4-using-your-tools">4. Using Your Tools</h4>
<p>No LLM knows about your company&rsquo;s internal Jira workflow or feature flag database. A production agent needs to plug into arbitrary tools without rewriting its core code.</p>
<p>OpenHands uses the <strong>Model Context Protocol (MCP)</strong>—an open standard for tool discovery. Point it at an MCP server, and the agent can dynamically learn what tools are available and how to use them.</p>
<h3 id="how-openhands-compares">How OpenHands Compares</h3>
<p>Here&rsquo;s how OpenHands stacks up against regular autocomplete and chat assistants:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Feature</th>
          <th style="text-align: left">Autocomplete (IntelliSense)</th>
          <th style="text-align: left">Chat (ChatGPT)</th>
          <th style="text-align: left">Agent (OpenHands)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Context</td>
          <td style="text-align: left">Current file only</td>
          <td style="text-align: left">Conversation history</td>
          <td style="text-align: left">Entire repo + runtime state</td>
      </tr>
      <tr>
          <td style="text-align: left">Execution</td>
          <td style="text-align: left">None</td>
          <td style="text-align: left">None (maybe sandbox)</td>
          <td style="text-align: left">Full shell in Docker</td>
      </tr>
      <tr>
          <td style="text-align: left">Agency</td>
          <td style="text-align: left">You drive everything</td>
          <td style="text-align: left">Responds to prompts</td>
          <td style="text-align: left">Multi-step autonomous plans</td>
      </tr>
      <tr>
          <td style="text-align: left">Tooling</td>
          <td style="text-align: left">Static analysis</td>
          <td style="text-align: left">Fixed plugins</td>
          <td style="text-align: left">Dynamic tool discovery (MCP)</td>
      </tr>
      <tr>
          <td style="text-align: left">Memory</td>
          <td style="text-align: left">None</td>
          <td style="text-align: left">Session-only</td>
          <td style="text-align: left">Event-sourced persistence</td>
      </tr>
  </tbody>
</table>
<p>OpenHands goes all-in on that right column. More complex, but actually useful for real work.</p>
<h2 id="how-openhands-is-built">How OpenHands Is Built</h2>
<p>OpenHands looks like a local app, but it&rsquo;s actually a distributed system. The key architectural decision: split the reasoning (<strong>Agent</strong>) from the execution (<strong>Runtime</strong>), and mediate everything through an event stream. This lets you swap LLMs or runtimes without rewriting the whole system.</p>
<h3 id="event-sourcing-the-unexpected-choice">Event Sourcing: The Unexpected Choice</h3>
<p>OpenHands doesn&rsquo;t use a traditional database. It&rsquo;s <strong>event-sourced</strong>.</p>
<p>Most apps store the current state: if an agent edits a file, you overwrite the record. OpenHands records every action as an immutable event. Want to know the current state? Replay all the events.</p>
<p>The <strong>EventStream</strong> is the central nervous system. It handles three types of data:</p>
<ul>
<li><strong>Actions</strong>: Commands from the agent—<code>CmdRunAction</code>, <code>FileWriteAction</code>, <code>AgentDelegateAction</code></li>
<li><strong>Observations</strong>: Results from the environment—stdout/stderr, file contents, web pages</li>
<li><strong>Trajectories</strong>: The full sequence of actions and observations, serialized to disk (JSON or Pickle)</li>
</ul>
<div class="event-stream-container">
  <style>
    .event-stream-container {
      background: white;
      border-radius: 16px;
      padding: 32px 24px;
      box-shadow: 0 12px 30px rgba(0,0,0,0.06);
      margin: 32px auto;
      font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
      max-width: 800px;
    }

    .event-stream-container .title {
      text-align: center;
      font-size: 24px;
      font-weight: 700;
      color: #1a202c;
      margin-bottom: 8px;
    }

    .event-stream-container .subtitle {
      text-align: center;
      font-size: 14px;
      color: #718096;
      margin-bottom: 40px;
      line-height: 1.5;
    }

    .event-stream-container .viz-layout {
      display: flex;
      flex-direction: column;
      align-items: center;
      gap: 0;
    }

     
    .event-stream-container .node-wrapper {
      display: flex;
      flex-direction: column;
      align-items: center;
      gap: 12px;
      width: 100%;
    }

    .event-stream-container .node {
      background: #f7fafc;
      border: 2px solid #e2e8f0;
      border-radius: 12px;
      padding: 20px;
      position: relative;
      transition: all 0.3s ease;
      width: 100%;
      max-width: 400px;
      display: flex;
      align-items: center;
      gap: 20px;
    }

    .event-stream-container .node:hover {
      transform: translateY(-2px);
      box-shadow: 0 8px 20px rgba(0,0,0,0.08);
    }

    .event-stream-container .agent-node {
      border-left: 4px solid #3b82f6;
    }

    .event-stream-container .stream-node {
      border: 2px solid #8b5cf6;
      background: #faf5ff;
      max-width: 600px;
      padding: 24px;
      flex-direction: column;
      align-items: stretch;
    }

    .event-stream-container .runtime-node {
      border-left: 4px solid #10b981;
    }

    .event-stream-container .node-icon {
      flex-shrink: 0;
    }

    .event-stream-container .node-icon svg {
      filter: drop-shadow(0 2px 4px rgba(0,0,0,0.1));
    }

    .event-stream-container .node-content {
      flex: 1;
      display: flex;
      flex-direction: column;
      gap: 6px;
    }

    .event-stream-container .node-header {
      display: flex;
      align-items: center;
      gap: 12px;
    }

    .event-stream-container .node-label {
      font-size: 0.95rem;
      font-weight: 700;
      text-transform: uppercase;
      letter-spacing: 0.08em;
      color: #2d3748;
    }

    .event-stream-container .node-sublabel {
      font-family: 'Courier New', monospace;
      font-size: 0.8rem;
      font-weight: 600;
      color: #4a5568;
      background: white;
      padding: 4px 10px;
      border-radius: 4px;
      border: 1px solid #e2e8f0;
      display: inline-block;
    }

    .event-stream-container .node-desc {
      font-size: 0.75rem;
      color: #718096;
      margin-top: 2px;
    }

     
    .event-stream-container .status-indicator {
      display: inline-flex;
      align-items: center;
      gap: 6px;
      padding: 4px 12px;
      background: white;
      border: 1px solid #e2e8f0;
      border-radius: 12px;
      font-family: 'Courier New', monospace;
      font-size: 0.65rem;
      font-weight: 600;
      letter-spacing: 0.05em;
      box-shadow: 0 2px 4px rgba(0,0,0,0.04);
    }

    .event-stream-container .agent-status { color: #3b82f6; }
    .event-stream-container .runtime-status { color: #10b981; }

    .event-stream-container .status-dot {
      width: 6px;
      height: 6px;
      border-radius: 50%;
      animation: es-pulse 2s ease-in-out infinite;
    }

    .event-stream-container .agent-status .status-dot {
      background: #3b82f6;
      box-shadow: 0 0 8px rgba(59, 130, 246, 0.5);
    }
    .event-stream-container .runtime-status .status-dot {
      background: #10b981;
      box-shadow: 0 0 8px rgba(16, 185, 129, 0.5);
    }

    @keyframes es-pulse {
      0%, 100% { opacity: 1; transform: scale(1); }
      50% { opacity: 0.6; transform: scale(0.85); }
    }

     
    .event-stream-container .stream-header {
      display: flex;
      align-items: center;
      justify-content: center;
      gap: 12px;
      margin-bottom: 16px;
      padding-bottom: 12px;
      border-bottom: 2px solid #e2e8f0;
    }

    .event-stream-container .stream-icon {
      font-size: 1.5rem;
    }

    .event-stream-container .stream-title {
      font-size: 0.95rem;
      font-weight: 700;
      text-transform: uppercase;
      letter-spacing: 0.08em;
      color: #2d3748;
    }

    .event-stream-container .event-log-container {
      background: white;
      border: 1px solid #e2e8f0;
      border-radius: 8px;
      padding: 12px;
      max-height: 200px;
      overflow-y: auto;
      scrollbar-width: thin;
      scrollbar-color: #cbd5e1 transparent;
    }

    .event-stream-container .event-log-container::-webkit-scrollbar {
      width: 4px;
    }

    .event-stream-container .event-log-container::-webkit-scrollbar-track {
      background: transparent;
    }

    .event-stream-container .event-log-container::-webkit-scrollbar-thumb {
      background: #cbd5e1;
      border-radius: 4px;
    }

    .event-stream-container .event-log {
      display: flex;
      flex-direction: column;
      gap: 8px;
    }

    .event-stream-container .event-entry {
      display: grid;
      grid-template-columns: auto auto auto 1fr;
      gap: 8px;
      align-items: center;
      font-family: 'Courier New', monospace;
      font-size: 0.7rem;
      padding: 8px;
      border-radius: 6px;
      background: #f7fafc;
      transition: all 0.2s ease;
    }

    .event-stream-container .event-entry:hover {
      background: #edf2f7;
    }

    .event-stream-container .action-entry {
      border-left: 3px solid #3b82f6;
    }

    .event-stream-container .obs-entry {
      border-left: 3px solid #10b981;
    }

    .event-stream-container .event-time {
      color: #718096;
      font-size: 0.65rem;
      font-variant-numeric: tabular-nums;
    }

    .event-stream-container .event-arrow {
      color: #a0aec0;
      font-weight: 700;
      font-size: 0.8rem;
    }

    .event-stream-container .action-entry .event-arrow { color: #3b82f6; }
    .event-stream-container .obs-entry .event-arrow { color: #10b981; }

    .event-stream-container .event-type {
      color: #4a5568;
      font-weight: 600;
      min-width: 50px;
    }

    .event-stream-container .event-data {
      color: #2d3748;
      font-size: 0.7rem;
      white-space: nowrap;
      overflow: hidden;
      text-overflow: ellipsis;
    }

    .event-stream-container .stream-footer {
      margin-top: 12px;
      padding-top: 12px;
      border-top: 2px solid #e2e8f0;
      display: flex;
      justify-content: space-between;
      align-items: center;
      flex-wrap: wrap;
      gap: 8px;
    }

    .event-stream-container .immutable-badge {
      font-family: 'Courier New', monospace;
      font-size: 0.65rem;
      font-weight: 600;
      letter-spacing: 0.05em;
      color: #7c3aed;
      background: #f5f3ff;
      padding: 4px 10px;
      border-radius: 4px;
      border: 1px solid #ddd6fe;
    }

    .event-stream-container .replay-badge {
      display: flex;
      align-items: center;
      gap: 6px;
      padding: 4px 10px;
      background: #fffbeb;
      border: 1px solid #fde68a;
      border-radius: 12px;
      font-family: 'Courier New', monospace;
      font-size: 0.65rem;
      font-weight: 600;
      color: #d97706;
      letter-spacing: 0.05em;
      box-shadow: 0 2px 4px rgba(0,0,0,0.04);
    }

     
    .event-stream-container .flow-path {
      position: relative;
      display: flex;
      flex-direction: column;
      align-items: center;
      justify-content: center;
      padding: 24px 0;
      width: 100%;
    }

    .event-stream-container .flow-svg {
      width: 120px;
      height: 80px;
    }

    .event-stream-container .flow-label {
      display: flex;
      align-items: center;
      gap: 12px;
      margin-top: 8px;
    }

    .event-stream-container .label-badge {
      font-size: 0.7rem;
      font-weight: 700;
      letter-spacing: 0.1em;
      padding: 6px 12px;
      border-radius: 6px;
      white-space: nowrap;
    }

    .event-stream-container .action-label .label-badge {
      background: #dbeafe;
      color: #1e40af;
      border: 1px solid #93c5fd;
    }

    .event-stream-container .obs-label .label-badge {
      background: #d1fae5;
      color: #065f46;
      border: 1px solid #6ee7b7;
    }

    .event-stream-container .label-code {
      font-family: 'Courier New', monospace;
      font-size: 0.65rem;
      color: #4a5568;
      background: #f7fafc;
      padding: 4px 8px;
      border-radius: 4px;
      border: 1px solid #e2e8f0;
    }

     
    @media (max-width: 640px) {
      .event-stream-container {
        padding: 24px 16px;
      }

      .event-stream-container .title {
        font-size: 20px;
      }

      .event-stream-container .subtitle {
        font-size: 13px;
      }

      .event-stream-container .node {
        flex-direction: column;
        text-align: center;
        gap: 12px;
      }

      .event-stream-container .node-header {
        flex-direction: column;
        gap: 6px;
      }

      .event-stream-container .flow-label {
        flex-direction: column;
        gap: 6px;
      }

      .event-stream-container .stream-footer {
        justify-content: center;
      }
    }
  </style>

  <h3 class="title">OpenHands Event Stream Architecture</h3>
  <p class="subtitle">Immutable event log enabling agent-runtime separation and deterministic replay</p>

  <div class="viz-layout">
    
    <div class="node-wrapper agent-wrapper">
      <div class="node agent-node">
        <div class="node-icon">
          <svg width="36" height="36" viewBox="0 0 40 40">
            <defs>
              <linearGradient id="es-gradient-blue" x1="0%" y1="0%" x2="100%" y2="100%">
                <stop offset="0%" style="stop-color:#3b82f6;stop-opacity:1" />
                <stop offset="100%" style="stop-color:#60a5fa;stop-opacity:1" />
              </linearGradient>
            </defs>
            <circle cx="20" cy="20" r="16" fill="none" stroke="url(#es-gradient-blue)" stroke-width="2"/>
            <circle cx="20" cy="20" r="10" fill="url(#es-gradient-blue)" opacity="0.2"/>
            <path d="M 20 10 L 20 30 M 10 20 L 30 20" stroke="#3b82f6" stroke-width="2" stroke-linecap="round"/>
          </svg>
        </div>
        <div class="node-content">
          <div class="node-header">
            <div class="node-label">AGENT</div>
            <div class="status-indicator agent-status">
              <span class="status-dot"></span>
              <span class="status-text">ACTIVE</span>
            </div>
          </div>
          <div class="node-sublabel">CodeActAgent</div>
          <div class="node-desc">Reasoning · Planning · Decision Making</div>
        </div>
      </div>
    </div>

    
    <div class="flow-path action-path">
      <svg class="flow-svg" viewBox="0 0 120 80">
        <defs>
          <filter id="es-glow">
            <feGaussianBlur stdDeviation="2" result="coloredBlur"/>
            <feMerge>
              <feMergeNode in="coloredBlur"/>
              <feMergeNode in="SourceGraphic"/>
            </feMerge>
          </filter>
        </defs>
        <path d="M 60 10 L 60 70" stroke="#93c5fd" stroke-width="2" stroke-dasharray="4,4"/>
        <circle class="flow-particle" r="3" fill="#3b82f6" filter="url(#es-glow)">
          <animateMotion dur="2s" repeatCount="indefinite">
            <mpath href="#es-path-down-1"/>
          </animateMotion>
        </circle>
        <circle class="flow-particle" r="2.5" fill="#60a5fa" filter="url(#es-glow)">
          <animateMotion dur="2s" begin="0.5s" repeatCount="indefinite">
            <mpath href="#es-path-down-1"/>
          </animateMotion>
        </circle>
        <path id="es-path-down-1" d="M 60 10 L 60 65" fill="none"/>
        <polygon points="55,65 60,75 65,65" fill="#3b82f6"/>
      </svg>
      <div class="flow-label action-label">
        <span class="label-badge">↓ ACTIONS</span>
        <code class="label-code">CmdRunAction | FileWriteAction</code>
      </div>
    </div>

    
    <div class="node-wrapper stream-wrapper">
      <div class="node stream-node">
        <div class="stream-header">
          <div class="stream-icon">📡</div>
          <div class="stream-title">EVENT STREAM</div>
        </div>
        <div class="event-log-container">
          <div class="event-log">
            <div class="event-entry action-entry">
              <span class="event-time">10:23:01.442</span>
              <span class="event-arrow">→</span>
              <span class="event-type">Action</span>
              <code class="event-data">CmdRun("pytest tests/")</code>
            </div>
            <div class="event-entry obs-entry">
              <span class="event-time">10:23:02.891</span>
              <span class="event-arrow">←</span>
              <span class="event-type">Observe</span>
              <code class="event-data">CmdOutput(exit=1, stderr="...")</code>
            </div>
            <div class="event-entry action-entry">
              <span class="event-time">10:23:03.124</span>
              <span class="event-arrow">→</span>
              <span class="event-type">Action</span>
              <code class="event-data">FileEdit("src/fix.py", ...)</code>
            </div>
            <div class="event-entry obs-entry">
              <span class="event-time">10:23:03.998</span>
              <span class="event-arrow">←</span>
              <span class="event-type">Observe</span>
              <code class="event-data">FileEditObservation(success=True)</code>
            </div>
          </div>
        </div>
        <div class="stream-footer">
          <span class="immutable-badge">⚡ IMMUTABLE TRAJECTORY</span>
          <div class="replay-badge">
            <svg width="12" height="12" viewBox="0 0 16 16">
              <path d="M8 2 L8 6 L11 6 M8 2 A6 6 0 1 1 2 8" stroke="#d97706" stroke-width="2" fill="none" stroke-linecap="round"/>
            </svg>
            <span>Deterministic Replay</span>
          </div>
        </div>
      </div>
    </div>

    
    <div class="flow-path obs-path">
      <svg class="flow-svg" viewBox="0 0 120 80">
        <path d="M 60 10 L 60 70" stroke="#6ee7b7" stroke-width="2" stroke-dasharray="4,4"/>
        <circle class="flow-particle" r="3" fill="#10b981" filter="url(#es-glow)">
          <animateMotion dur="2.5s" repeatCount="indefinite">
            <mpath href="#es-path-down-2"/>
          </animateMotion>
        </circle>
        <circle class="flow-particle" r="2.5" fill="#34d399" filter="url(#es-glow)">
          <animateMotion dur="2.5s" begin="0.7s" repeatCount="indefinite">
            <mpath href="#es-path-down-2"/>
          </animateMotion>
        </circle>
        <path id="es-path-down-2" d="M 60 10 L 60 65" fill="none"/>
        <polygon points="55,65 60,75 65,65" fill="#10b981"/>
      </svg>
      <div class="flow-label obs-label">
        <span class="label-badge">↓ OBSERVATIONS</span>
        <code class="label-code">CmdOutputObservation | FileReadObservation</code>
      </div>
    </div>

    
    <div class="node-wrapper runtime-wrapper">
      <div class="node runtime-node">
        <div class="node-icon">
          <svg width="36" height="36" viewBox="0 0 40 40">
            <defs>
              <linearGradient id="es-gradient-green" x1="0%" y1="0%" x2="100%" y2="100%">
                <stop offset="0%" style="stop-color:#10b981;stop-opacity:1" />
                <stop offset="100%" style="stop-color:#34d399;stop-opacity:1" />
              </linearGradient>
            </defs>
            <circle cx="20" cy="20" r="16" fill="none" stroke="url(#es-gradient-green)" stroke-width="2"/>
            <circle cx="20" cy="20" r="10" fill="url(#es-gradient-green)" opacity="0.2"/>
            <rect x="14" y="14" width="12" height="12" fill="none" stroke="#10b981" stroke-width="2" rx="2"/>
          </svg>
        </div>
        <div class="node-content">
          <div class="node-header">
            <div class="node-label">RUNTIME</div>
            <div class="status-indicator runtime-status">
              <span class="status-dot"></span>
              <span class="status-text">RUNNING</span>
            </div>
          </div>
          <div class="node-sublabel">DockerRuntime</div>
          <div class="node-desc">Execution · Sandbox · Response Generation</div>
        </div>
      </div>
    </div>
  </div>
</div>

<p>Why this matters: <strong>Deterministic Replay</strong>. LLMs are non-deterministic nightmares to debug. When an agent fails, you can replay the exact event sequence and see where it went wrong. No guessing, no &ldquo;works on my machine.&rdquo;</p>
<p>The codebase enforces this with a hard split: <code>agenthub</code> (the logic) and <code>runtime</code> (the execution) only talk through serialized events. No shortcuts, no shared state.</p>
<p>The EventStream assigns IDs to events and manages &ldquo;subscriptions.&rdquo; The frontend subscribes to get chat updates. The agent reads from it to know what happened last.</p>
<p>There&rsquo;s been talk in the community about moving to synchronous <code>ToolCall</code>/<code>ToolResult</code> patterns to simplify the Python SDK. But the core idea stays the same: the source of truth is the event history, not some current state snapshot.</p>
<h3 id="the-codebase-structure">The Codebase Structure</h3>
<p>OpenHands is organized as a modular monorepo:</p>
<ul>
<li><strong><code>openhands/agenthub/</code></strong>: The brains. Different agent implementations (<code>CodeActAgent</code>, <code>BrowsingAgent</code>, etc.). Plug-and-play interface: take State, return Action.</li>
<li><strong><code>openhands/runtime/</code></strong>: The body. Spins up Docker containers, manages files, executes commands. Abstract <code>Runtime</code> base class with concrete implementations like <code>DockerRuntime</code> and <code>E2BRuntime</code>.</li>
<li><strong><code>openhands/server/</code></strong>: FastAPI backend. Handles WebSocket connections, orchestrates the <code>AgentController</code>, routes events.</li>
<li><strong><code>openhands/frontend/</code></strong>: React UI. Visualizes the Event Stream—chat interface, terminal emulator (xterm.js), Monaco editor.</li>
<li><strong><code>containers/</code></strong>: Dockerfiles for the sandbox environments. Version-controlled with the code.</li>
</ul>
<h3 id="the-main-loop">The Main Loop</h3>
<p>The <strong>AgentController</strong> runs an infinite loop:</p>
<ol>
<li>Gather recent history from Event Stream</li>
<li>Send to LLM (GPT-4, Claude, whatever)</li>
<li>Parse LLM response into an Action (<code>CmdRunAction</code>, etc.)</li>
<li>Dispatch to Runtime</li>
<li>Get back an Observation (stdout, exit code, etc.)</li>
<li>Add Observation to Event Stream</li>
<li>Go to step 1</li>
</ol>
<p>Runs until the agent says it&rsquo;s done (<code>AgentFinishAction</code>) or you kill it. The upcoming Python SDK will let you step through this loop manually, which should make debugging way easier.</p>
<h2 id="the-runtime-sandboxing-the-chaos">The Runtime: Sandboxing the Chaos</h2>
<p>Letting an LLM run <code>rm -rf /</code> on your laptop is a bad idea. OpenHands solves this with Docker, but not in the obvious way.</p>
<h3 id="how-the-sandbox-actually-works">How the Sandbox Actually Works</h3>
<p>You can&rsquo;t just run <code>docker exec</code> for every command. That creates a fresh shell each time, and state gets lost. If the agent runs <code>export API_KEY=xyz</code>, that needs to persist when it runs <code>python script.py</code> later.</p>
<p>OpenHands uses a <strong>client-server model across the Docker boundary</strong>:</p>
<ul>
<li><strong>Host side</strong>: <code>RuntimeClient</code> running on your machine</li>
<li><strong>Container side</strong>: <code>ActionExecutor</code>, a Python HTTP server injected into the container</li>
</ul>
<p>When the agent wants to run <code>ls -la</code>:</p>
<ol>
<li>Agent generates <code>CmdRunAction(&quot;ls -la&quot;)</code></li>
<li><code>RuntimeClient</code> serializes it and POSTs to <code>ActionExecutor</code> inside the container</li>
<li><code>ActionExecutor</code> runs it in a <strong>persistent shell session</strong> (PTY), captures stdout/stderr/exit code</li>
<li>Response goes back to <code>RuntimeClient</code></li>
<li>Backend wraps it in <code>CmdOutputObservation</code> and pushes to the event stream</li>
</ol>
<div class="sandbox-arch-container">
  <style>
    .sandbox-arch-container {
      background: white;
      border-radius: 16px;
      padding: 32px 24px;
      box-shadow: 0 12px 30px rgba(0,0,0,0.06);
      margin: 32px auto;
      font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
      max-width: 800px;
    }

    .sandbox-arch-container .title {
      text-align: center;
      font-size: 24px;
      font-weight: 700;
      color: #1a202c;
      margin-bottom: 8px;
    }

    .sandbox-arch-container .subtitle {
      text-align: center;
      font-size: 14px;
      color: #718096;
      margin-bottom: 32px;
      line-height: 1.5;
    }

    .sandbox-arch-container .viz-wrapper {
      position: relative;
      display: flex;
      flex-direction: column;
      align-items: center;
      gap: 24px;
    }

     
    .sandbox-arch-container .host-env {
      width: 100%;
      max-width: 650px;
      background: linear-gradient(135deg, #f7fafc 0%, #edf2f7 100%);
      border: 2px solid #cbd5e1;
      border-radius: 16px;
      padding: 24px;
      position: relative;
      overflow: hidden;
    }

    .sandbox-arch-container .host-env::before {
      content: '';
      position: absolute;
      top: 0;
      left: 0;
      right: 0;
      height: 40px;
      background: linear-gradient(90deg, #4299e1 0%, #3182ce 100%);
      border-radius: 14px 14px 0 0;
    }

    .sandbox-arch-container .host-label {
      position: absolute;
      top: 10px;
      left: 24px;
      color: white;
      font-size: 0.85rem;
      font-weight: 700;
      letter-spacing: 0.08em;
      text-transform: uppercase;
      z-index: 1;
    }

    .sandbox-arch-container .host-badge {
      position: absolute;
      top: 10px;
      right: 24px;
      background: rgba(255, 255, 255, 0.2);
      backdrop-filter: blur(10px);
      color: white;
      padding: 4px 10px;
      border-radius: 12px;
      font-size: 0.65rem;
      font-weight: 600;
      border: 1px solid rgba(255, 255, 255, 0.3);
      z-index: 1;
    }

    .sandbox-arch-container .host-content {
      margin-top: 32px;
      display: flex;
      flex-direction: column;
      gap: 16px;
    }

     
    .sandbox-arch-container .docker-container {
      background: white;
      border: 3px solid #805ad5;
      border-radius: 12px;
      padding: 20px;
      position: relative;
      box-shadow: 0 8px 24px rgba(128, 90, 213, 0.15);
    }

    .sandbox-arch-container .docker-container::before {
      content: '🔒';
      position: absolute;
      top: -16px;
      right: 20px;
      font-size: 1.8rem;
      background: white;
      padding: 4px 8px;
      border-radius: 8px;
      box-shadow: 0 4px 12px rgba(128, 90, 213, 0.2);
    }

    .sandbox-arch-container .docker-header {
      display: flex;
      align-items: center;
      justify-content: space-between;
      margin-bottom: 16px;
      padding-bottom: 12px;
      border-bottom: 2px solid #e2e8f0;
    }

    .sandbox-arch-container .docker-title {
      display: flex;
      align-items: center;
      gap: 10px;
    }

    .sandbox-arch-container .docker-icon {
      width: 32px;
      height: 32px;
      background: linear-gradient(135deg, #805ad5 0%, #6b46c1 100%);
      border-radius: 8px;
      display: flex;
      align-items: center;
      justify-content: center;
      color: white;
      font-weight: 700;
      font-size: 0.9rem;
    }

    .sandbox-arch-container .docker-name {
      font-size: 0.9rem;
      font-weight: 700;
      color: #2d3748;
      text-transform: uppercase;
      letter-spacing: 0.05em;
    }

    .sandbox-arch-container .docker-status {
      display: flex;
      align-items: center;
      gap: 6px;
      padding: 4px 10px;
      background: #f0fdf4;
      border: 1px solid #86efac;
      border-radius: 12px;
      font-size: 0.65rem;
      font-weight: 600;
      color: #16a34a;
    }

    .sandbox-arch-container .status-pulse {
      width: 6px;
      height: 6px;
      background: #22c55e;
      border-radius: 50%;
      animation: sb-pulse 2s ease-in-out infinite;
    }

    @keyframes sb-pulse {
      0%, 100% { opacity: 1; box-shadow: 0 0 8px #22c55e; }
      50% { opacity: 0.6; box-shadow: 0 0 4px #22c55e; }
    }

     
    .sandbox-arch-container .isolation-layers {
      display: flex;
      flex-direction: column;
      gap: 12px;
    }

    .sandbox-arch-container .isolation-layer {
      background: linear-gradient(135deg, #faf5ff 0%, #f3e8ff 100%);
      border: 2px solid #d8b4fe;
      border-left: 4px solid #a855f7;
      border-radius: 8px;
      padding: 12px 16px;
      display: flex;
      align-items: center;
      gap: 12px;
      transition: all 0.3s ease;
      position: relative;
      overflow: hidden;
    }

    .sandbox-arch-container .isolation-layer::before {
      content: '';
      position: absolute;
      left: 0;
      top: 0;
      bottom: 0;
      width: 4px;
      background: linear-gradient(180deg, #a855f7 0%, #7c3aed 100%);
    }

    .sandbox-arch-container .isolation-layer:hover {
      transform: translateX(4px);
      box-shadow: 0 4px 12px rgba(168, 85, 247, 0.15);
    }

    .sandbox-arch-container .layer-icon {
      width: 36px;
      height: 36px;
      background: white;
      border-radius: 8px;
      display: flex;
      align-items: center;
      justify-content: center;
      font-size: 1.2rem;
      flex-shrink: 0;
      box-shadow: 0 2px 8px rgba(0, 0, 0, 0.08);
    }

    .sandbox-arch-container .layer-content {
      flex: 1;
      display: flex;
      flex-direction: column;
      gap: 4px;
    }

    .sandbox-arch-container .layer-name {
      font-size: 0.85rem;
      font-weight: 700;
      color: #2d3748;
    }

    .sandbox-arch-container .layer-desc {
      font-size: 0.7rem;
      color: #4a5568;
      line-height: 1.4;
    }

    .sandbox-arch-container .layer-badge {
      font-family: 'Courier New', monospace;
      font-size: 0.65rem;
      font-weight: 600;
      color: #7c3aed;
      background: white;
      padding: 3px 8px;
      border-radius: 4px;
      border: 1px solid #e9d5ff;
      white-space: nowrap;
    }

     
    .sandbox-arch-container .exec-zone {
      background: #0f172a;
      border-radius: 8px;
      padding: 16px;
      margin-top: 8px;
      position: relative;
      overflow: hidden;
    }

    .sandbox-arch-container .exec-zone::before {
      content: '';
      position: absolute;
      top: 0;
      left: 0;
      right: 0;
      bottom: 0;
      background: repeating-linear-gradient(
        0deg,
        rgba(59, 130, 246, 0.03) 0px,
        transparent 1px,
        transparent 2px,
        rgba(59, 130, 246, 0.03) 3px
      );
      pointer-events: none;
    }

    .sandbox-arch-container .exec-header {
      display: flex;
      align-items: center;
      gap: 8px;
      margin-bottom: 12px;
      padding-bottom: 8px;
      border-bottom: 1px solid rgba(148, 163, 184, 0.2);
    }

    .sandbox-arch-container .terminal-dots {
      display: flex;
      gap: 6px;
    }

    .sandbox-arch-container .term-dot {
      width: 10px;
      height: 10px;
      border-radius: 50%;
    }

    .sandbox-arch-container .term-dot.red { background: #ef4444; }
    .sandbox-arch-container .term-dot.yellow { background: #f59e0b; }
    .sandbox-arch-container .term-dot.green { background: #10b981; }

    .sandbox-arch-container .exec-title {
      font-family: 'Courier New', monospace;
      font-size: 0.7rem;
      color: #94a3b8;
      flex: 1;
    }

    .sandbox-arch-container .code-block {
      font-family: 'Courier New', monospace;
      font-size: 0.7rem;
      line-height: 1.6;
    }

    .sandbox-arch-container .code-line {
      display: flex;
      gap: 12px;
      padding: 2px 0;
    }

    .sandbox-arch-container .line-num {
      color: #475569;
      user-select: none;
      min-width: 20px;
      text-align: right;
    }

    .sandbox-arch-container .code-text {
      color: #e2e8f0;
    }

    .sandbox-arch-container .code-comment { color: #64748b; }
    .sandbox-arch-container .code-keyword { color: #818cf8; }
    .sandbox-arch-container .code-string { color: #34d399; }
    .sandbox-arch-container .code-function { color: #fbbf24; }

    .sandbox-arch-container .exec-indicator {
      display: flex;
      align-items: center;
      gap: 8px;
      margin-top: 12px;
      padding: 8px;
      background: rgba(59, 130, 246, 0.1);
      border-left: 3px solid #3b82f6;
      border-radius: 4px;
    }

    .sandbox-arch-container .exec-spinner {
      width: 12px;
      height: 12px;
      border: 2px solid rgba(59, 130, 246, 0.3);
      border-top-color: #3b82f6;
      border-radius: 50%;
      animation: sb-spin 1s linear infinite;
    }

    @keyframes sb-spin {
      to { transform: rotate(360deg); }
    }

    .sandbox-arch-container .exec-text {
      font-family: 'Courier New', monospace;
      font-size: 0.65rem;
      color: #60a5fa;
    }

     
    .sandbox-arch-container .comm-channels {
      display: grid;
      grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
      gap: 12px;
      margin-top: 12px;
    }

    .sandbox-arch-container .channel {
      background: #f7fafc;
      border: 1px solid #e2e8f0;
      border-radius: 8px;
      padding: 12px;
      display: flex;
      align-items: center;
      gap: 10px;
      transition: all 0.2s ease;
    }

    .sandbox-arch-container .channel:hover {
      background: #edf2f7;
      transform: translateY(-2px);
    }

    .sandbox-arch-container .channel-icon {
      font-size: 1.3rem;
    }

    .sandbox-arch-container .channel-info {
      flex: 1;
    }

    .sandbox-arch-container .channel-name {
      font-size: 0.75rem;
      font-weight: 700;
      color: #2d3748;
      margin-bottom: 2px;
    }

    .sandbox-arch-container .channel-type {
      font-family: 'Courier New', monospace;
      font-size: 0.65rem;
      color: #718096;
    }

    .sandbox-arch-container .channel-arrow {
      font-size: 1rem;
      color: #cbd5e1;
    }

     
    .sandbox-arch-container .security-shield {
      margin-top: 16px;
      background: linear-gradient(135deg, #ecfdf5 0%, #d1fae5 100%);
      border: 2px solid #6ee7b7;
      border-radius: 8px;
      padding: 12px 16px;
      display: flex;
      align-items: center;
      gap: 12px;
    }

    .sandbox-arch-container .shield-icon {
      font-size: 1.8rem;
    }

    .sandbox-arch-container .shield-content {
      flex: 1;
    }

    .sandbox-arch-container .shield-title {
      font-size: 0.85rem;
      font-weight: 700;
      color: #065f46;
      margin-bottom: 4px;
    }

    .sandbox-arch-container .shield-items {
      display: flex;
      flex-wrap: wrap;
      gap: 6px;
    }

    .sandbox-arch-container .shield-item {
      font-size: 0.65rem;
      font-weight: 600;
      color: #059669;
      background: white;
      padding: 3px 8px;
      border-radius: 4px;
      border: 1px solid #86efac;
    }

     
    @media (max-width: 640px) {
      .sandbox-arch-container {
        padding: 24px 16px;
      }

      .sandbox-arch-container .title {
        font-size: 20px;
      }

      .sandbox-arch-container .subtitle {
        font-size: 13px;
      }

      .sandbox-arch-container .isolation-layer {
        flex-direction: column;
        text-align: center;
      }

      .sandbox-arch-container .comm-channels {
        grid-template-columns: 1fr;
      }

      .sandbox-arch-container .shield-items {
        justify-content: center;
      }
    }
  </style>

  <h3 class="title">OpenHands Sandbox Architecture</h3>
  <p class="subtitle">Multi-layered Docker isolation ensuring secure code execution with strict boundaries</p>

  <div class="viz-wrapper">
    <div class="host-env">
      <div class="host-label">🖥️ Host Environment</div>
      <div class="host-badge">Ubuntu 22.04 LTS</div>

      <div class="host-content">
        
        <div class="docker-container">
          <div class="docker-header">
            <div class="docker-title">
              <div class="docker-icon">🐳</div>
              <div class="docker-name">Docker Sandbox</div>
            </div>
            <div class="docker-status">
              <span class="status-pulse"></span>
              <span>ISOLATED</span>
            </div>
          </div>

          
          <div class="isolation-layers">
            <div class="isolation-layer">
              <div class="layer-icon">📁</div>
              <div class="layer-content">
                <div class="layer-name">Filesystem Isolation</div>
                <div class="layer-desc">Separate root filesystem with controlled mount points</div>
              </div>
              <div class="layer-badge">/workspace</div>
            </div>

            <div class="isolation-layer">
              <div class="layer-icon">🌐</div>
              <div class="layer-content">
                <div class="layer-name">Network Isolation</div>
                <div class="layer-desc">Virtual network with restricted external access</div>
              </div>
              <div class="layer-badge">bridge0</div>
            </div>

            <div class="isolation-layer">
              <div class="layer-icon">⚙️</div>
              <div class="layer-content">
                <div class="layer-name">Process Isolation</div>
                <div class="layer-desc">Dedicated PID namespace, resource limits enforced</div>
              </div>
              <div class="layer-badge">cgroups</div>
            </div>

            <div class="isolation-layer">
              <div class="layer-icon">👤</div>
              <div class="layer-content">
                <div class="layer-name">User Isolation</div>
                <div class="layer-desc">Non-privileged user with restricted capabilities</div>
              </div>
              <div class="layer-badge">uid:1000</div>
            </div>
          </div>

          
          <div class="exec-zone">
            <div class="exec-header">
              <div class="terminal-dots">
                <div class="term-dot red"></div>
                <div class="term-dot yellow"></div>
                <div class="term-dot green"></div>
              </div>
              <div class="exec-title">agent@sandbox:/workspace $</div>
            </div>

            <div class="code-block">
              <div class="code-line">
                <span class="line-num">1</span>
                <span class="code-text"><span class="code-comment"># Agent executing commands in sandbox</span></span>
              </div>
              <div class="code-line">
                <span class="line-num">2</span>
                <span class="code-text"><span class="code-keyword">python</span> <span class="code-string">test_suite.py</span></span>
              </div>
              <div class="code-line">
                <span class="line-num">3</span>
                <span class="code-text"><span class="code-function">git</span> diff src/main.py</span>
              </div>
              <div class="code-line">
                <span class="line-num">4</span>
                <span class="code-text"><span class="code-keyword">npm</span> run build --production</span>
              </div>
            </div>

            <div class="exec-indicator">
              <div class="exec-spinner"></div>
              <div class="exec-text">Executing in isolated environment...</div>
            </div>
          </div>

          
          <div class="comm-channels">
            <div class="channel">
              <div class="channel-icon">📤</div>
              <div class="channel-info">
                <div class="channel-name">Stdin/Stdout</div>
                <div class="channel-type">stdio</div>
              </div>
              <div class="channel-arrow">⇄</div>
            </div>

            <div class="channel">
              <div class="channel-icon">📋</div>
              <div class="channel-info">
                <div class="channel-name">Volume Mount</div>
                <div class="channel-type">/workspace</div>
              </div>
              <div class="channel-arrow">⇄</div>
            </div>

            <div class="channel">
              <div class="channel-icon">🔌</div>
              <div class="channel-info">
                <div class="channel-name">API Socket</div>
                <div class="channel-type">unix:///var/run</div>
              </div>
              <div class="channel-arrow">⇄</div>
            </div>
          </div>

          
          <div class="security-shield">
            <div class="shield-icon">🛡️</div>
            <div class="shield-content">
              <div class="shield-title">Security Guarantees</div>
              <div class="shield-items">
                <span class="shield-item">No Host Access</span>
                <span class="shield-item">Read-only System</span>
                <span class="shield-item">Resource Limited</span>
                <span class="shield-item">No Privilege Escalation</span>
              </div>
            </div>
          </div>
        </div>
      </div>
    </div>
  </div>
</div>

<p>The persistent shell is the key. Environment variables, working directory, shell history—it all persists across commands. The agent gets something that feels like an actual computer, not a stateless command executor.</p>
<h3 id="the-docker-socket-problem">The Docker Socket Problem</h3>
<p>Sometimes the agent needs to use Docker itself—like building a container for your app. OpenHands handles this by mounting the host&rsquo;s Docker socket (<code>/var/run/docker.sock</code>) into the sandbox.</p>
<p>This is powerful but dangerous. Mounting the Docker socket gives the container root access to the host. It&rsquo;s &ldquo;Docker-out-of-Docker&rdquo; (not true Docker-in-Docker), and it comes with trade-offs:</p>
<ul>
<li><strong>Power</strong>: The agent can do anything Docker can do</li>
<li><strong>Complexity</strong>: Network routing gets weird, especially on macOS/Windows where Docker runs in a VM. <code>httpx.ConnectError</code> issues are common.</li>
<li><strong>Security</strong>: You&rsquo;re basically trusting the container with your host. OpenHands mitigates this by controlling the image, but it&rsquo;s still a calculated risk.</li>
</ul>
<h3 id="other-runtime-options">Other Runtime Options</h3>
<p>Docker isn&rsquo;t the only choice. OpenHands abstracts the runtime, so you can swap it out:</p>
<ul>
<li><strong>Daytona</strong>: Remote, managed dev environments. Offloads compute to the cloud instead of burning your laptop&rsquo;s battery.</li>
<li><strong>E2B</strong>: Firecracker-based VMs designed for AI code execution. Better isolation than Docker, faster startup.</li>
</ul>
<p>You pick your runtime in <code>config.toml</code>. Same agent code, different execution environment. This is the kind of abstraction that separates production systems from hackathon demos.</p>
<h2 id="codeact-code-as-the-interface">CodeAct: Code As the Interface</h2>
<p>Early AI agents used JSON tool calling for everything. Want to edit a file? Emit a JSON blob. Run a command? Another JSON blob. Brittle, verbose, and you had to define custom tools for every possible action.</p>
<!-- codeact-vs-json-tools -->
<h3 id="code-is-the-tool">Code Is the Tool</h3>
<p>CodeAct flips this. Instead of 50 custom tools (<code>list_files</code>, <code>create_file</code>, <code>search_web</code>), just give the agent Python and Bash.</p>
<p>Need to count lines in all Python files? Write code:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> glob
</span></span><span style="display:flex;"><span>files <span style="color:#f92672">=</span> glob<span style="color:#f92672">.</span>glob(<span style="color:#e6db74">&#34;**/*.py&#34;</span>, recursive<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>total_lines <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> f <span style="color:#f92672">in</span> files:
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">with</span> open(f) <span style="color:#66d9ef">as</span> file:
</span></span><span style="display:flex;"><span>        total_lines <span style="color:#f92672">+=</span> len(file<span style="color:#f92672">.</span>readlines())
</span></span><span style="display:flex;"><span>print(total_lines)
</span></span></code></pre></div><p>Or use Bash:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>find . -name <span style="color:#e6db74">&#34;*.py&#34;</span> | xargs wc -l
</span></span></code></pre></div><p>Why this works better:</p>
<ul>
<li><strong>One language for everything</strong>: Logic, control flow, and tool execution all use Python/Bash.</li>
<li><strong>More expressive</strong>: Write loops, conditionals, error handling in a single action. Try to read a file, catch <code>FileNotFoundError</code>, create it—all in one LLM turn. Fewer round-trips = lower cost and latency.</li>
<li><strong>Free library access</strong>: The entire Python ecosystem (pandas, requests, numpy) works out of the box. No wrapper code needed.</li>
</ul>
<h3 id="how-it-works">How It Works</h3>
<p>The <code>CodeActAgent</code> uses a carefully crafted system prompt (<code>system_prompt.j2</code>):</p>
<ul>
<li>&ldquo;You can execute Python code in ```python blocks&rdquo;</li>
<li>&ldquo;You can execute Bash in ```bash blocks&rdquo;</li>
<li>&ldquo;Verify your changes by running tests&rdquo;</li>
</ul>
<p>The backend parses the LLM&rsquo;s markdown response. Code blocks get extracted and sent to the <code>JupyterPlugin</code> (for Python) or <code>BashPlugin</code> (for Bash) inside the container. The <code>JupyterPlugin</code> maintains an interactive IPython kernel, so variables persist across code blocks.</p>
<h3 id="multiple-agents-one-task">Multiple Agents, One Task</h3>
<p>One agent gets lost in a 10,000-file repo. Context window fills with noise, and it forgets what it&rsquo;s doing.</p>
<p>OpenHands uses <strong>agent delegation</strong>:</p>
<ul>
<li><strong>Manager Agent</strong>: High-level planner. Breaks &ldquo;Refactor auth module&rdquo; into sub-tasks.</li>
<li><strong>RepoStudyAgent</strong>: Explorer. Maps the codebase without modifying it.</li>
<li><strong>VerifierAgent</strong>: QA specialist. Writes tests, verifies fixes work.</li>
<li><strong>BrowsingAgent</strong>: Reads docs and StackOverflow via Playwright.</li>
</ul>
<p>The main agent can delegate: &ldquo;I need to know how to use the Stripe API. @BrowsingAgent, find the docs for creating a customer.&rdquo; BrowsingAgent spins up, does the research, returns a summary. Main agent stays focused on the high-level task.</p>
<h2 id="tool-integration-mcp">Tool Integration: MCP</h2>
<p>The old problem: N agents × M tools = N×M custom integrations. Want your agent to use Jira, Slack, GitHub, and Linear? Write four separate integrations. For every agent.</p>
<p>OpenHands uses the <strong>Model Context Protocol (MCP)</strong>, an open standard from Anthropic. Think of it as USB-C for AI tools.</p>
<h3 id="how-mcp-works">How MCP Works</h3>
<ul>
<li><strong>MCP Server</strong>: Exposes tools (functions) and resources (data). A GitHub MCP server might expose <code>create_issue</code> and <code>active_pull_requests</code>.</li>
<li><strong>MCP Client (OpenHands)</strong>: Connects via stdio or SSE. Asks: &ldquo;What tools do you have?&rdquo; Gets back JSON schemas. Injects them into the agent&rsquo;s system prompt.</li>
</ul>
<p>OpenHands doesn&rsquo;t know about GitHub or Slack. It just knows MCP. You can write a custom MCP server for your proprietary database, point OpenHands at it, and the agent can use it immediately.</p>
<h3 id="auth-that-doesnt-suck">Auth That Doesn&rsquo;t Suck</h3>
<p>What if the agent tries to read your private Slack DMs? OpenHands handles this with OAuth via FastMCP.</p>
<p>When the agent tries to use an authenticated tool, MCP pauses execution and shows you an OAuth flow. You log in, consent, and the token gets stored for that session. The agent acts with your permissions, not as some omniscient god.</p>
<h2 id="configuration-from-toml-hell-to-python-objects">Configuration: From TOML Hell to Python Objects</h2>
<p>OpenHands used to require a <code>config.toml</code> file with a million environment variables: <code>SANDBOX_IMAGE</code>, <code>WORKSPACE_MOUNT_PATH</code>, <code>LLM_API_KEY</code>, debug flags, etc. Global state everywhere. Good luck running two agents with different configs.</p>
<p>The new <strong>Python SDK</strong> fixes this:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> openhands.sdk <span style="color:#f92672">import</span> CodeActAgent, DockerRuntime
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>agent <span style="color:#f92672">=</span> CodeActAgent(
</span></span><span style="display:flex;"><span>    llm_config<span style="color:#f92672">=</span>{<span style="color:#e6db74">&#34;model&#34;</span>: <span style="color:#e6db74">&#34;claude-3-5-sonnet&#34;</span>},
</span></span><span style="display:flex;"><span>    system_prompt<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;You are a senior python engineer.&#34;</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>runtime <span style="color:#f92672">=</span> DockerRuntime(image<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;my-custom-image&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">await</span> agent<span style="color:#f92672">.</span>run(task<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Fix the bug in main.py&#34;</span>, runtime<span style="color:#f92672">=</span>runtime)
</span></span></code></pre></div><p>Code, not config files. Agents are objects. You can run them in threads, pause them, inspect state, resume. Synchronous by default, which makes debugging way easier.</p>
<h2 id="evaluation-swe-bench-the-reality-check">Evaluation: SWE-bench, the Reality Check</h2>
<p>Demo videos are easy. Proving your agent actually works is hard. OpenHands uses <strong>SWE-bench</strong>—real GitHub issues from Django, scikit-learn, Flask, etc.</p>
<h3 id="how-swe-bench-works">How SWE-bench Works</h3>
<ol>
<li>Start with the codebase <em>before</em> the bug fix</li>
<li>Give the agent the issue description</li>
<li>Let it explore, reproduce the bug, write a patch</li>
<li>Apply the patch, run the test suite</li>
<li>Pass = new test passes + no regressions</li>
</ol>
<p>This is brutal. The agent can&rsquo;t just fix the obvious bug. It has to not break anything else.</p>
<h3 id="the-infrastructure-problem">The Infrastructure Problem</h3>
<p>SWE-bench is expensive to run. Gigabytes of Docker images, thousands of containers. Epoch AI compressed the images from ~680 GB to ~67 GB by deduplicating layers. OpenHands runs evaluations in parallel on cloud infrastructure, turning days into minutes.</p>
<h3 id="the-cost-problem">The Cost Problem</h3>
<p>Running the full SWE-bench suite costs hundreds of dollars in API credits. The agent reads thousands of lines of code, generates verbose responses for every issue. <strong>SWE-bench Lite</strong> (300 issues) and <strong>SWE-bench Verified</strong> (human-verified subset) exist for people who don&rsquo;t have unlimited budgets.</p>
<h2 id="performance-where-things-stand">Performance: Where Things Stand</h2>
<h3 id="the-numbers">The Numbers</h3>
<p>OpenHands with Claude 3.5 Sonnet hits around <strong>53% on SWE-bench Verified</strong>.</p>
<p>But here&rsquo;s the interesting part: <strong>Inference Time Scaling</strong>. Run the agent 5 times on the same problem, use a critic model or voting to pick the best patch, and you can hit <strong>66%</strong>. The bottleneck isn&rsquo;t intelligence, it&rsquo;s randomness.</p>
<h3 id="why-agents-fail">Why Agents Fail</h3>
<p>Even at 53-66%, agents fail a lot. The failure modes are instructive.</p>
<h4 id="infinite-loops">Infinite Loops</h4>
<p>Agent tries a fix. Test fails. Agent tries the <strong>exact same fix</strong> again. Repeat until you run out of tokens.</p>
<p>This happens because of <strong>context truncation</strong>. When the context window fills up, OpenHands truncates old history. If the agent&rsquo;s memory of &ldquo;I already tried this&rdquo; gets truncated, it&rsquo;s stuck in Groundhog Day.</p>
<h4 id="context-pollution">Context Pollution</h4>
<p>Agent runs <code>find / -name &quot;*.py&quot;</code> and dumps 10,000 lines of output into its context. Or cats a massive log file. Context window fills with noise. LLM starts hallucinating file paths, forgets what it was supposed to do.</p>
<p>Solution: Active context management. Summarize old events, delete large observations, keep the &ldquo;working memory&rdquo; clean.</p>
<h4 id="lazy-coding">Lazy Coding</h4>
<p>Agent writes <code># ... rest of code ...</code> instead of the full file. Saves tokens, breaks the file when written to disk. OpenHands needs linting to catch this before it causes syntax errors.</p>
<h3 id="failure-mode-summary">Failure Mode Summary</h3>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Failure Mode</th>
          <th style="text-align: left">Cause</th>
          <th style="text-align: left">How OpenHands Mitigates</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Infinite Loop</td>
          <td style="text-align: left">Context truncation</td>
          <td style="text-align: left">Trajectory analysis, event summarization</td>
      </tr>
      <tr>
          <td style="text-align: left">Hallucination</td>
          <td style="text-align: left">Context overflow</td>
          <td style="text-align: left">Tool-based code search, event condensation</td>
      </tr>
      <tr>
          <td style="text-align: left">Regression</td>
          <td style="text-align: left">Fixing one bug, breaking others</td>
          <td style="text-align: left">VerifierAgent runs full test suite</td>
      </tr>
      <tr>
          <td style="text-align: left">Timeout</td>
          <td style="text-align: left">Docker/network issues</td>
          <td style="text-align: left">Persistent sessions, cloud runtimes</td>
      </tr>
  </tbody>
</table>
<h2 id="conclusion-the-new-software-engineering-workflow">Conclusion: The New Software Engineering Workflow</h2>
<p>Code assist isn&rsquo;t replacing developers, it&rsquo;s certainly changing how we work. The architecture behind systems like OpenHands reveals the shift: event sourcing for debuggability, Docker sandboxing for safety, CodeAct for expressiveness, MCP for extensibility. These are building blocks for a new kind of development workflow.</p>
<p>What makes modern tools like OpenHands, Claude Code particularly powerful is the convergence of capabilities that OpenHands pioneered:</p>
<ul>
<li><strong>Extended thinking</strong>: Models that can reason through complex refactoring before touching code</li>
<li><strong>Prompt caching</strong>: Reusing codebase context across sessions without re-indexing</li>
<li><strong>Tool integration</strong>: MCP servers that let agents interact with your actual development environment—Jira, databases, CI/CD pipelines</li>
<li><strong>Computer use</strong>: Agents that can navigate IDEs, run terminal commands, and interact with your full development stack</li>
</ul>
<p>For bug fixes, boilerplate generation, and mechanical refactoring, having an autonomous agent that executes code, verifies its work, and iterates on failures isn&rsquo;t a demo anymore. It&rsquo;s production ready infrastructure that&rsquo;s reshaping how engineering teams operate. The question isn&rsquo;t whether to adopt these tools, but how to integrate them into your workflow before your competitors do.</p>
<h2 id="references">References</h2>
<ol>
<li>
<p><strong>Wang, X., et al. (2024).</strong> <a href="https://arxiv.org/pdf/2511.03690">The OpenHands Software Agent SDK: A Composable Framework for Building AI Agents</a>. <em>arXiv preprint arXiv:2511.03690</em>.</p>
<ul>
<li>The official technical paper describing the OpenHands architecture, event-sourcing model, and SDK design.</li>
</ul>
</li>
<li>
<p><strong>Wang, X., et al. (2024).</strong> <a href="https://arxiv.org/html/2402.01030v4">Executable Code Actions Elicit Better LLM Agents</a>. <em>arXiv preprint arXiv:2402.01030v4</em>.</p>
<ul>
<li>Introduces the CodeAct framework that uses code as the universal action interface instead of JSON tool calling.</li>
</ul>
</li>
<li>
<p><strong>OpenHands Documentation.</strong> <a href="https://docs.openhands.dev/openhands/usage/architecture/runtime">Runtime Architecture</a>.</p>
<ul>
<li>Official documentation explaining the sandbox architecture, Docker runtime, and client-server model.</li>
</ul>
</li>
<li>
<p><strong>Anthropic.</strong> <a href="https://modelcontextprotocol.io/">Model Context Protocol (MCP)</a>.</p>
<ul>
<li>Official specification for the Model Context Protocol used for dynamic tool discovery and integration.</li>
</ul>
</li>
<li>
<p><strong>Jimenez, C., et al. (2024).</strong> <a href="https://www.swebench.com/">SWE-bench: Can Language Models Resolve Real-World GitHub Issues?</a></p>
<ul>
<li>The benchmark used to evaluate code assist agents on real-world software engineering tasks.</li>
</ul>
</li>
<li>
<p><strong>Yang, X., et al. (2024).</strong> <a href="https://openreview.net/pdf/95990590797cff8b93c33af989ecf4ac58bde9bb.pdf">OPENHANDS: An Open Platform for AI Software Developers</a>. <em>OpenReview</em>.</p>
<ul>
<li>Comprehensive overview of the OpenHands platform, agent capabilities, and design philosophy.</li>
</ul>
</li>
</ol>
]]></content:encoded></item><item><title>QuIP#: Achieving Near-Lossless 2-Bit LLM Quantization</title><link>https://www.mdjawad.com/posts/quip-sharp/</link><pubDate>Thu, 16 Oct 2025 00:00:00 +0800</pubDate><guid>https://www.mdjawad.com/posts/quip-sharp/</guid><description>QUIP# algorithm for quantizing LLM weights without gradient information.</description><content:encoded><![CDATA[<p><em>A deep dive into the mathematical elegance that makes extreme compression possible</em></p>
<hr>
<h2 id="1-introduction-the-compression-challenge">1. Introduction: The Compression Challenge</h2>
<h3 id="11-the-impossible-dream-running-70b-models-on-your-gaming-pc">1.1 The Impossible Dream: Running 70B Models on Your Gaming PC</h3>
<p>Picture this: You have a gaming laptop with an RTX 4090—a beast of a card with 24GB of VRAM. You want to run Llama 2 70B, one of the most powerful open-source language models available. Here&rsquo;s the brutal math:</p>
<ul>
<li><strong>At full precision (FP16)</strong>: 70 billion parameters × 2 bytes = <strong>140GB</strong></li>
<li><strong>Your available VRAM</strong>: 24GB</li>
<li><strong>The gap</strong>: You&rsquo;d need 6 of your GPUs. Total cost?</li>
</ul>
<p>This isn&rsquo;t just an inconvenience, it&rsquo;s a fundamental barrier. State-of-the-art AI remains locked in data centers, accessible only to well-funded labs and companies. Consumer hardware, edge devices, and even many research institutions are simply shut out.</p>
<p>Quantization promised to change this. By representing weights with fewer bits, we could compress these models to fit on accessible hardware. The progression seemed clear:</p>
<ul>
<li><strong>8-bit quantization (2020-2021)</strong>: Mostly lossless, 2× compression → 70GB <em>(still too big)</em></li>
<li><strong>4-bit quantization (2022-2023)</strong>: Near-lossless with methods like GPTQ, AWQ → 35GB <em>(getting closer!)</em></li>
<li><strong>2-bit quantization (2023-2024)</strong>: The holy grail → ~18GB <em>(fits on a single RTX 4090!)</em></li>
</ul>
<p>But there was a problem: <strong>nobody could make 2-bit work</strong>.</p>
<h3 id="12-why-2-bit-was-considered-impossible">1.2 Why 2-Bit Was Considered Impossible</h3>
<p>By late 2023, the field had hit a wall. Every attempt to quantize LLMs below 3 bits resulted in catastrophic quality degradation. The core challenge: LLM weight matrices contain <strong>outliers</strong>—a small number of weights that are 100-1000× larger than the rest. At 2 bits (only 4 distinct values), there&rsquo;s insufficient resolution to represent both normal weights and outliers accurately.</p>
<p><strong>Existing Methods Hit Hard Limits</strong></p>
<p>Let&rsquo;s look at what the state-of-the-art methods achieved at 2 bits on Llama 2 70B (WikiText2 perplexity with context length 2048, lower is better):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Approach</th>
          <th>2-bit PPL</th>
          <th>Result</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FP16 (baseline)</td>
          <td>—</td>
          <td>3.32</td>
          <td>Perfect quality</td>
      </tr>
      <tr>
          <td>OmniQuant</td>
          <td>Learned transformations</td>
          <td>7.81</td>
          <td>Barely usable</td>
      </tr>
      <tr>
          <td>AWQ</td>
          <td>Activation-aware scaling</td>
          <td>11.9+</td>
          <td>Completely broken</td>
      </tr>
      <tr>
          <td>GPTQ</td>
          <td>Optimal Brain Damage</td>
          <td>6.11</td>
          <td>Poor quality</td>
      </tr>
  </tbody>
</table>
<p>The best existing method (OmniQuant at 7.81) was more than <strong>2× worse</strong> than the FP16 baseline. Models were incoherent, repetitive, and failed basic reasoning tasks.</p>
<p>The consensus emerged: <strong>4 bits is the practical minimum</strong>.</p>
<p>Tim Dettmers and Luke Zettlemoyer even published a paper in 2023 arguing that &ldquo;4-bit precision is optimal&rdquo; for LLMs, with diminishing returns below that threshold.</p>
<p><strong>The 2-bit dream seemed dead.</strong></p>
<h3 id="13-the-quip-breakthrough">1.3 The QuIP# Breakthrough</h3>
<p>Then came QuIP#. The results were unprecedented:</p>
<ul>
<li><strong>First method to achieve near-lossless 2-bit quantization</strong> (4.16 PPL vs 3.32 FP16 baseline on Llama 2 70B, context 2048)</li>
<li><strong>3-bit models outperform &ldquo;theoretically lossless&rdquo; 4-bit</strong> (3.56 vs 3.47 PPL)</li>
<li><strong>Scales better than higher bitrates</strong> as model size increases</li>
</ul>
<div class="quip-scaling-container" id="quip-scaling-container-94bb7d58d186deb5c24f2d2e1ffb5b84">
    <style>
        .quip-scaling-container {
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
            max-width: 900px;
            margin: 2rem auto;
            padding: 2.5rem;
            background-color: #f8f9fa;
            border-radius: 20px;
            box-shadow: 0 12px 35px rgba(0,0,0,0.07);
            position: relative;
            border: 1px solid #e9ecef;
        }
        
        .chart-title {
            text-align: center;
            font-size: 2rem;
            font-weight: 700;
            color: #212529;
            margin-bottom: 0.75rem;
        }
        
        .chart-subtitle {
            text-align: center;
            font-size: 1rem;
            color: #6c757d;
            margin-bottom: 2.5rem;
        }
        
        .axis-label {
            font-size: 14px;
            font-weight: 500;
            fill: #495057;
        }
        
        .grid line {
            stroke: #dee2e6;
            stroke-opacity: 0.6;
            stroke-dasharray: 2,2;
        }
        
        .grid path {
            stroke-width: 0;
        }
        
        .line {
            fill: none;
            stroke-width: 2;
            stroke-linecap: round;
        }
        
        .line-quip-2-bit { stroke: #e53e3e; }
        .line-quip-3-bit { stroke: #dd6b20; }
        .line-quip-4-bit { stroke: #d69e2e; }
        .line-aqlm-2-bit { stroke: #805ad5; opacity: 0.6; }
        .line-theoretical-lossless-4-bit { stroke: #718096; stroke-dasharray: 5,5; opacity: 0.7; }
        
        .point {
            cursor: pointer;
            transition: all 0.2s ease;
            stroke: white;
            stroke-width: 1.5px;
        }
        
        .point:hover {
            stroke-width: 3px;
        }
        
        .legend {
            font-size: 13px;
            font-weight: 500;
        }
        
        .legend-item {
            cursor: pointer;
            transition: opacity 0.2s ease;
        }
        
        .legend-item:hover {
            opacity: 0.7;
        }
        
        .legend-line {
            stroke-width: 3;
            stroke-linecap: round;
        }
        
        .tooltip {
            position: absolute;
            padding: 12px 16px;
            background: rgba(45, 55, 72, 0.95);
            color: white;
            border-radius: 8px;
            pointer-events: none;
            opacity: 0;
            transition: opacity 0.2s ease;
            font-size: 13px;
            line-height: 1.6;
            box-shadow: 0 4px 12px rgba(0,0,0,0.15);
            backdrop-filter: blur(10px);
            z-index: 1000;
        }
        
        .tooltip.visible {
            opacity: 1;
        }
        
        .tooltip-model {
            font-weight: 700;
            color: #a5d8ff;
            margin-bottom: 4px;
        }
        
        .breakthrough-box {
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
            padding: 2rem;
            border-radius: 16px;
            margin-top: 2.5rem;
            text-align: center;
            box-shadow: 0 8px 25px rgba(102, 126, 234, 0.3);
        }
        
        .breakthrough-box h3 {
            margin: 0 0 0.75rem 0;
            font-size: 1.5rem;
            font-weight: 600;
        }
        
        .breakthrough-box p {
            margin: 0;
            font-size: 1.1rem;
            line-height: 1.6;
            opacity: 0.95;
        }
    </style>

    <h1 class="chart-title">QuIP# Scaling: 3-Bit Outperforms 4-Bit</h1>
    <p class="chart-subtitle">WikiText2 Perplexity vs Total Model Size (Llama 2) • Lower is Better</p>
    <svg id="scalingChart-94bb7d58d186deb5c24f2d2e1ffb5b84"></svg>
    <div class="breakthrough-box">
        <h3>🎯 The Unprecedented Result</h3>
        <p>QuIP# 3-bit models scale <strong>better than 4-bit</strong>, directly refuting the 2023 consensus that "4-bit is optimal"</p>
    </div>
    <div class="tooltip" id="tooltip-94bb7d58d186deb5c24f2d2e1ffb5b84"></div>
</div>
    
<script>
    (function() {
        const uniqueId = '94bb7d58d186deb5c24f2d2e1ffb5b84';
        
        
        if (typeof d3 === 'undefined') {
            console.warn('D3.js not loaded for QuIP# scaling chart');
            document.getElementById('scalingChart-' + uniqueId).innerHTML = 
                '<text x="50%" y="50%" text-anchor="middle" dy=".3em" style="fill: #e53e3e;">Error: D3.js not loaded</text>';
            return;
        }
        
        initChart();
        
        function initChart() {
            
            const data = {
                'QuIP# 2 Bit': [
                    { model: '2-7B', size: 1.4e10, ppl: 6.19 },
                    { model: '2-13B', size: 2.6e10, ppl: 5.35 },
                    { model: '2-70B', size: 1.4e11, ppl: 3.91 }
                ],
                'QuIP# 3 Bit': [
                    { model: '2-7B', size: 2.1e10, ppl: 5.41 },
                    { model: '2-13B', size: 3.9e10, ppl: 4.78 },
                    { model: '2-70B', size: 2.1e11, ppl: 3.35 }
                ],
                'QuIP# 4 Bit': [
                    { model: '2-7B', size: 2.8e10, ppl: 5.19 },
                    { model: '2-13B', size: 5.2e10, ppl: 4.63 },
                    { model: '2-70B', size: 2.8e11, ppl: 3.18 }
                ],
                'AQLM ~2 Bit': [
                    { model: '2-7B', size: 1.42e10, ppl: 6.93 },
                    { model: '2-13B', size: 2.56e10, ppl: 5.70 },
                    { model: '2-70B', size: 1.45e11, ppl: 3.94 }
                ],
                'Theoretical Lossless (FP16) 4 Bit': [
                    { model: '2-7B', size: 2.8e10, ppl: 5.12 },
                    { model: '2-13B', size: 5.2e10, ppl: 4.57 },
                    { model: '2-70B', size: 2.8e11, ppl: 3.12 }
                ],
                'QuIP 2 Bit': [
                    { model: '2-70B', size: 1.4e11, ppl: 5.90 }  
                ]
            };

            const colors = {
                'QuIP# 2 Bit': '#4285F4', 
                'QuIP# 3 Bit': '#DB4437', 
                'QuIP# 4 Bit': '#F4B400', 
                'AQLM ~2 Bit': '#5F6368', 
                'Theoretical Lossless (FP16) 4 Bit': '#0F9D58', 
                'QuIP 2 Bit': '#803291' 
            };
            
            const markers = {
                'QuIP# 2 Bit': d3.symbolCircle,
                'QuIP# 3 Bit': d3.symbolCircle,
                'QuIP# 4 Bit': d3.symbolCircle,
                'AQLM ~2 Bit': d3.symbolTriangle,
                'Theoretical Lossless (FP16) 4 Bit': d3.symbolStar,
                'QuIP 2 Bit': d3.symbolDiamond
            };

            
            const margin = { top: 20, right: 250, bottom: 60, left: 70 };
            const width = 900 - margin.left - margin.right;
            const height = 500 - margin.top - margin.bottom;

            const svg = d3.select('#scalingChart-' + uniqueId)
                .attr('viewBox', `0 0 ${width + margin.left + margin.right} ${height + margin.top + margin.bottom}`)
                .attr('preserveAspectRatio', 'xMidYMid meet')
                .attr('style', 'max-width: 100%; height: auto;');

            const g = svg.append('g')
                .attr('transform', `translate(${margin.left},${margin.top})`);

            
            const x = d3.scaleLog()
                .domain([1e10, 3e11])
                .range([0, width]);

            const y = d3.scaleLinear()
                .domain([3, 7.5])
                .range([height, 0]);

            
            g.append('g')
                .attr('class', 'grid')
                .attr('transform', `translate(0,${height})`)
                .call(d3.axisBottom(x)
                    .tickSize(-height)
                    .tickFormat(''));

            g.append('g')
                .attr('class', 'grid')
                .call(d3.axisLeft(y)
                    .tickSize(-width)
                    .tickFormat(''));

            
            g.append('g')
                .attr('transform', `translate(0,${height})`)
                .call(d3.axisBottom(x)
                    .ticks(4, ".1e")
                    .tickFormat(d => {
                        if (d === 5e10) return "5E+10";
                        if (d === 1e11) return "1E+11";
                        return null;
                    }))
                .selectAll('text')
                .style('font-size', '12px')
                .style('fill', '#6c757d');

            g.append('g')
                .call(d3.axisLeft(y).ticks(8))
                .selectAll('text')
                .style('font-size', '12px')
                .style('fill', '#6c757d');

            
            g.append('text')
                .attr('class', 'axis-label')
                .attr('x', width / 2)
                .attr('y', height + 45)
                .attr('text-anchor', 'middle')
                .text('Model Size (Bits)');

            g.append('text')
                .attr('class', 'axis-label')
                .attr('transform', 'rotate(-90)')
                .attr('x', -height / 2)
                .attr('y', -50)
                .attr('text-anchor', 'middle')
                .text('WikiText2 Perplexity (ctx 4096)');

            
            const line = d3.line()
                .x(d => x(d.size))
                .y(d => y(d.ppl))
                .curve(d3.curveMonotoneX);

            
            Object.entries(data).forEach(([key, values]) => {
                
                g.append('path')
                    .datum(values)
                    .attr('class', 'line')
                    .attr('d', line)
                    .style('stroke', colors[key])
                    .style('stroke-dasharray', (key.includes('AQLM') || key.includes('Theoretical') || key === 'QuIP 2 Bit') ? '5,5' : 'none');
            });

            
            const tooltip = d3.select('#tooltip-' + uniqueId);
            const container = document.getElementById('quip-scaling-container-' + uniqueId);

            Object.entries(data).forEach(([key, values]) => {
                const symbol = d3.symbol().type(markers[key]).size(80);

                g.selectAll(`.point-${key.replace(/[^a-z0-9]/gi, '-')}`)
                    .data(values)
                    .enter()
                    .append('path')
                    .attr('class', 'point')
                    .attr('d', symbol)
                    .attr('transform', d => `translate(${x(d.size)},${y(d.ppl)})`)
                    .style('fill', colors[key])
                    .on('mouseover', function(event, d) {
                        d3.select(this)
                            .attr('d', d3.symbol().type(markers[key]).size(120));
                        
                        const [posX, posY] = d3.pointer(event, container);

                        tooltip.classed('visible', true)
                            .html(`
                                <div class="tooltip-model">${d.model}</div>
                                <div><strong>${key}</strong></div>
                                <div>Perplexity: ${d.ppl.toFixed(2)}</div>
                                <div>Size: ${(d.size/1e10).toExponential(1)} bits</div>
                            `)
                            .style('left', (posX + 15) + 'px')
                            .style('top', (posY - 15) + 'px');
                    })
                    .on('mouseout', function() {
                        d3.select(this)
                            .attr('d', symbol);
                        tooltip.classed('visible', false);
                    });
            });

            
            const legend = g.append('g')
                .attr('class', 'legend')
                .attr('transform', `translate(${width + 40}, 0)`);

            Object.entries(colors).forEach(([key, color], i) => {
                const legendRow = legend.append('g')
                    .attr('class', 'legend-item')
                    .attr('transform', `translate(0, ${i * 30})`);

                const symbol = d3.symbol().type(markers[key]).size(80);

                legendRow.append('path')
                    .attr('d', symbol)
                    .attr('transform', `translate(10, 0)`)
                    .style('fill', color)
                    .attr('stroke', '#666')
                    .attr('stroke-width', 0.5);

                legendRow.append('text')
                    .attr('x', 32)
                    .attr('y', 5)
                    .text(key)
                    .style('font-size', '13px')
                    .style('fill', '#212529');
            });

            
        }
    })();
</script>

<p>What changed? QuIP# combines three techniques in a principled, mathematically elegant way:</p>
<ol>
<li><strong>Randomized Hadamard Transform (RHT)</strong> for incoherence processing</li>
<li><strong>E8 lattice codebooks</strong> for optimal sphere packing</li>
<li><strong>Block-LDLQ adaptive rounding</strong> with Hessian awareness</li>
</ol>
<p>Each component addresses a specific mathematical challenge. Together, they enable what was thought impossible.</p>
<p>But before we dive into QuIP#&rsquo;s solution, we need to understand the fundamental challenges that made 2-bit quantization seem impossible in the first place.</p>
<hr>
<h2 id="2-background-what-makes-quantization-hard">2. Background: What Makes Quantization Hard?</h2>
<h3 id="21-quantization-basics-the-storage-vs-accuracy-bargain">2.1 Quantization Basics: The Storage vs. Accuracy Bargain</h3>
<p><strong>The Core Idea</strong></p>
<p>Imagine you&rsquo;re moving to a smaller apartment and need to compress your belongings. You could:</p>
<ul>
<li>Pack everything loosely (takes many boxes, but nothing gets damaged)</li>
<li>Compress everything tightly (fits in fewer boxes, but some items might break)</li>
</ul>
<p>Quantization is exactly this trade-off for neural network weights. Each weight in a model is originally stored as a 16-bit floating-point number (FP16), giving it incredible precision. But do we really <em>need</em> that much precision?</p>
<p>Quantization reduces the memory footprint of LLMs by representing weights with fewer bits:</p>
<ul>
<li><strong>FP16 (16 bits)</strong>: 70 billion parameters × 2 bytes = <strong>140GB</strong></li>
<li><strong>4-bit</strong>: 70 billion parameters × 0.5 bytes = <strong>35GB</strong> (4× compression)</li>
<li><strong>2-bit</strong>: 70 billion parameters × 0.25 bytes = <strong>~18GB</strong> (8× compression) ✓ <em>Fits on RTX 4090!</em></li>
</ul>
<p><strong>Why It&rsquo;s Not Just Rounding</strong></p>
<p>You might think: &ldquo;Why not just round each weight to the nearest 2-bit value?&rdquo; Here&rsquo;s why that fails catastrophically.</p>
<p>Consider a simple weight matrix row: <code>[0.1, 0.15, 0.2, 0.12, 0.18, 0.11, 0.16]</code></p>
<p>With 2 bits, you can only represent <strong>4 distinct values</strong> (say: <code>{0, 0.1, 0.2, 0.3}</code>). Naively rounding each weight independently would map everything to either 0.1 or 0.2, obliterating the subtle differences between 0.11, 0.12, and 0.15 that might be critical for the model&rsquo;s behavior.</p>
<p><strong>The true challenge</strong>: How do you choose which information to preserve when you can only afford 4 distinct values per dimension?</p>
<p>This is where the mathematics gets interesting.</p>
<h3 id="22-the-outlier-problem-the-1-that-breaks-everything">2.2 The Outlier Problem: The 1% That Breaks Everything</h3>
<p><strong>The Hidden Structure of LLM Weights</strong></p>
<p>LLM weight matrices aren&rsquo;t uniform—they contain <strong>outliers</strong>: a small number of weights that are 100-1000× larger than the rest. This isn&rsquo;t a bug; it&rsquo;s a fundamental feature of how transformers learn.</p>
<p><strong>A Visual Example</strong></p>
<pre tabindex="0"><code>Typical weights: [0.1, 0.15, 0.2, 0.12, 0.18, 0.11, 0.16]
Outlier weight:  [150.0]
</code></pre><p>That single outlier dominates the entire matrix. Why? Because in the forward pass, <code>y = W·x</code>, even if <code>x</code> is small, that 150.0 weight creates a huge activation that completely changes the output.</p>
<p><strong>The Quantization Dilemma</strong></p>
<p>You face an impossible choice:</p>
<ol>
<li>
<p><strong>Scale for the outlier</strong>: Set your 4 quantization levels to cover the range [0, 150]. Now your levels might be <code>{0, 50, 100, 150}</code>. But this crushes all normal weights (0.1, 0.15, 0.2&hellip;) to zero! You&rsquo;ve destroyed 99% of the weights to preserve 1%.</p>
</li>
<li>
<p><strong>Ignore the outlier</strong>: Set your levels to <code>{0, 0.1, 0.2, 0.3}</code> to capture normal weights well. But now the 150.0 outlier gets clipped to 0.3—<strong>a 500× error</strong> that will cause catastrophic failures in the model&rsquo;s output.</p>
</li>
</ol>
<p>At 4 bits (16 distinct values), you have enough resolution to handle this with techniques like per-group scaling. At 2 bits (only 4 values), the math breaks down.</p>
<p>This is why you can&rsquo;t just &ldquo;turn down the bits&rdquo; and expect things to work. The outlier problem is the fundamental barrier that prevented 2-bit quantization from being viable until QuIP#.</p>
<h3 id="23-why-existing-methods-fail-at-2-bit">2.3 Why Existing Methods Fail at 2-Bit</h3>
<p>By late 2023, the field had hit a wall. Let&rsquo;s examine what the state-of-the-art methods achieved at 2 bits on Llama 2 70B (context length 2048):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Approach</th>
          <th>2-bit Perplexity</th>
          <th>Verdict</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FP16 (baseline)</td>
          <td>—</td>
          <td>3.32</td>
          <td>Perfect quality</td>
      </tr>
      <tr>
          <td><strong>OmniQuant</strong></td>
          <td>Learned transformations</td>
          <td>7.81</td>
          <td><strong>Barely usable</strong></td>
      </tr>
      <tr>
          <td><strong>AWQ</strong></td>
          <td>Activation-aware scaling</td>
          <td>11.9+</td>
          <td>Completely broken</td>
      </tr>
      <tr>
          <td><strong>GPTQ</strong></td>
          <td>Optimal Brain Damage</td>
          <td>6.11</td>
          <td>Poor quality</td>
      </tr>
  </tbody>
</table>
<p><em>(Lower perplexity = better. The best existing method, OmniQuant at 7.81, was more than 2× worse than the FP16 baseline.)</em></p>
<p><strong>Why Each Method Failed</strong></p>
<ol>
<li>
<p><strong>AWQ &amp; OmniQuant</strong>: Use heuristic outlier suppression via activation-aware rescaling. At 2 bits, these heuristics aren&rsquo;t strong enough—outliers still dominate.</p>
</li>
<li>
<p><strong>GPTQ</strong>: Per-group scaling adds 0.25 bits overhead per weight (12% at 2 bits) and only mitigates the outlier problem rather than solving it.</p>
</li>
<li>
<p><strong>SpQR</strong>: Stores outliers separately in FP16, but irregular memory access patterns kill GPU performance.</p>
</li>
<li>
<p><strong>AQLM</strong>: Achieves good quality but uses 1MB codebooks that cause cache misses, making it slower than FP16.</p>
</li>
</ol>
<p><strong>QuIP#&rsquo;s Key Insight</strong>: Instead of fighting outliers with heuristics, <em>eliminate them entirely through principled mathematical transformation</em> (Randomized Hadamard Transform). We&rsquo;ll explore this in Section 4.</p>
<p>Now that we understand why existing methods failed, let&rsquo;s see how QuIP# solves these challenges through three synergistic components.</p>
<hr>
<h2 id="3-quips-three-pillars">3. QuIP#&rsquo;s Three Pillars</h2>
<h3 id="31-high-level-architecture">3.1 High-Level Architecture</h3>
<p>QuIP# is built on three synergistic components:</p>
<ol>
<li><strong>Incoherence Processing with RHT</strong>: Transforms weights to eliminate outliers</li>
<li><strong>E8 Lattice Codebooks</strong>: Matches quantization to the transformed weight distribution</li>
<li><strong>Block-LDLQ Adaptive Rounding</strong>: Accounts for weight interdependencies</li>
</ol>
<p>Each component solves a specific mathematical problem:</p>
<pre tabindex="0"><code>Original Weights (with outliers)
         ↓
    [RHT Transform]
         ↓
Incoherent Weights (Gaussian, ball-shaped)
         ↓
  [E8P Quantization]
         ↓
Quantized Weights (minimal error)
         ↓
 [Block-LDLQ Rounding]
         ↓
Final Quantized Model
</code></pre><p>The beauty is in how these pieces fit together. The RHT creates a Gaussian distribution. The E8 lattice is <em>proven optimal</em> for packing spheres in 8D space (exactly what a Gaussian distribution looks like!). And Block-LDLQ uses the Hessian to minimize the final reconstruction error.</p>
<p>Let&rsquo;s explore each pillar in depth, starting with the transformation that eliminates outliers.</p>
<hr>
<h2 id="4-pillar-1-incoherence-processing-with-randomized-hadamard-transform">4. Pillar 1: Incoherence Processing with Randomized Hadamard Transform</h2>
<h3 id="41-the-core-idea-spread-the-risk">4.1 The Core Idea: Spread the Risk</h3>
<p><strong>The Building Analogy</strong></p>
<p>Imagine a building supported by many pillars. If one pillar bears 99% of the weight, the building will collapse if that pillar fails. But if the weight is evenly distributed across all pillars, the building can withstand the loss of any single support.</p>
<p>Incoherence processing does this for neural network weights. Instead of having a few &ldquo;weight pillars&rdquo; bear most of the importance, we redistribute the load so no single weight is critical.</p>
<p><strong>Mathematical Formulation</strong></p>
<p>The magic: We transform weights <code>W</code> using orthogonal matrices <code>U</code> and <code>V</code>:</p>
<pre tabindex="0"><code>W&#39; = U·W·V^T
</code></pre><p>The key properties:</p>
<ul>
<li><strong>Preserves the forward pass</strong>: <code>y = W·x = U·W'·V^T·x</code> (we just transform inputs/outputs accordingly)</li>
<li><strong>Redistributes magnitude</strong>: Outliers get &ldquo;spread out&rdquo; across many weights</li>
<li><strong>No information loss</strong>: Orthogonal matrices are invertible</li>
</ul>
<p>During inference, we compute: <code>y = U^T·(W'·(V·x))</code></p>
<h3 id="42-what-is-incoherence">4.2 What is Incoherence?</h3>
<p><strong>Intuitive Explanation</strong></p>
<p>An <strong>incoherent</strong> matrix has no outliers—all entries have similar magnitude. Think of it like a democracy where no single vote dominates, versus a dictatorship where one voice controls everything.</p>
<p><strong>Formal Definition</strong></p>
<p>For a weight matrix <code>W ∈ ℝ^(m×n)</code>, we say it&rsquo;s <strong>μ-incoherent</strong> if:</p>
<pre tabindex="0"><code>max|W_ij| ≤ μ·||W||_F / √(m·n)
</code></pre><p><strong>Understanding the Inequality</strong></p>
<p>Let&rsquo;s break this down:</p>
<ul>
<li><code>W_ij</code>: Single entry at row i, column j</li>
<li><code>max|W_ij|</code>: The largest absolute value (the outlier)</li>
<li><code>||W||_F</code>: <a href="https://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm">Frobenius norm</a> = √(Σᵢⱼ W²ᵢⱼ) (total magnitude of all weights)</li>
<li><code>√(m·n)</code>: Normalization by matrix size</li>
<li><code>μ</code>: Incoherence parameter (<strong>smaller = better</strong>)</li>
</ul>
<p><strong>What it means in plain English</strong>: &ldquo;The biggest entry ≤ μ × average entry&rdquo;</p>
<p><strong>Concrete Examples</strong></p>
<ul>
<li>
<p><strong>Incoherent matrix (μ ≈ 1)</strong>: All entries ≈ 0.28</p>
<pre tabindex="0"><code>[0.27, 0.29, 0.26, 0.30]
[0.28, 0.27, 0.31, 0.27]
[0.29, 0.28, 0.27, 0.29]
</code></pre></li>
<li>
<p><strong>Matrix with outlier (μ ≈ 20)</strong>: One entry = 5.5, others ≈ 0.28</p>
<pre tabindex="0"><code>[0.27, 0.29, 0.26, 0.30]
[0.28, 5.50, 0.31, 0.27]  ← Outlier!
[0.29, 0.28, 0.27, 0.29]
</code></pre></li>
</ul>
<h3 id="43-the-randomized-hadamard-transform-rht">4.3 The Randomized Hadamard Transform (RHT)</h3>
<p><strong>Why Hadamard?</strong></p>
<p>The Hadamard matrix has special properties that make it perfect for incoherence processing:</p>
<ol>
<li><strong>Orthogonal</strong>: Preserves information (no loss)</li>
<li><strong>Binary entries</strong>: All entries are ±1 (no floating-point multiplies needed!)</li>
<li><strong>Fast to compute</strong>: O(n log n) via Fast Walsh-Hadamard Transform</li>
</ol>
<p><strong>The RHT Construction</strong></p>
<p>The transformation is:</p>
<pre tabindex="0"><code>W&#39; = H·diag(S_U)·W·diag(S_V)·H^T
</code></pre><p>Where:</p>
<ul>
<li><code>H</code>: Hadamard matrix (orthogonal, entries in {-1, +1})</li>
<li><code>S_U, S_V</code>: Random sign vectors (diagonal ±1 matrices)</li>
</ul>
<p><strong>Theoretical Guarantees</strong></p>
<p><strong>Lemma 3.1</strong> from the QuIP# paper: With high probability (1-δ), the RHT achieves:</p>
<pre tabindex="0"><code>μ_H = √(2·log(2n²/δ))
</code></pre><p>This is a <strong>major improvement</strong> over QuIP&rsquo;s Kronecker approach:</p>
<ul>
<li><strong>QuIP (Kronecker)</strong>: μ = O(log² n)</li>
<li><strong>QuIP# (RHT)</strong>: μ = O(√log n)</li>
</ul>
<p>Better incoherence means less quantization error!</p>
<p><strong>Runtime Improvement</strong></p>
<ul>
<li><strong>QuIP (Kronecker)</strong>: O(n√n) operations</li>
<li><strong>QuIP# (RHT)</strong>: O(n log n) operations</li>
</ul>
<p>For Llama 2 70B with n=28,672: This is ~170× faster for the transform!</p>
<h3 id="44-handling-non-power-of-2-dimensions">4.4 Handling Non-Power-of-2 Dimensions</h3>
<p><strong>The Challenge</strong></p>
<p>Hadamard matrices exist for dimensions 1, 2, and most multiples of 4 (the Hadamard conjecture). However, for efficient computation using the Fast Walsh-Hadamard Transform, we prefer power-of-2 dimensions. But LLMs have all sorts of dimensions:</p>
<ul>
<li>Llama 2 70B has intermediate dimension <strong>28,672</strong> = 1024 × 28</li>
</ul>
<p><strong>The Solution: Kronecker Product</strong></p>
<p>We can factorize n = p×q where p is a power of 2 and q is another valid Hadamard dimension:</p>
<pre tabindex="0"><code>H = H_p ⊗ H_q
</code></pre><p>For Llama 2: H = H_1024 ⊗ H_28</p>
<p>This gives us:</p>
<ul>
<li><strong>Compute time</strong>: O(q²p log p)</li>
<li><strong>For 28,672</strong>: Much faster than QuIP&rsquo;s O(n√n)</li>
</ul>
<h3 id="45-why-this-works-for-quantization">4.5 Why This Works for Quantization</h3>
<p>After RHT, weights become <strong>approximately Gaussian distributed</strong>:</p>
<ul>
<li><strong>Before</strong>: Spiked distribution with outliers</li>
<li><strong>After</strong>: Smooth Gaussian bell curve</li>
</ul>
<p>This is crucial because:</p>
<ol>
<li><strong>No single weight is critical</strong> → quantization errors spread evenly</li>
<li><strong>Roughly ball-shaped in high dimensions</strong> → perfect for E8 lattice (next section!)</li>
<li><strong>Predictable error bounds</strong> → we can prove theoretical guarantees</li>
</ol>
<div class="rht-container" id="rht-container-94bb7d58d186deb5c24f2d2e1ffb5b84">
    <style>
        .rht-container {
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
            background-color: #ffffff;
            border-radius: 14px;
            box-shadow: 0 6px 20px rgba(0,0,0,0.06);
            overflow: hidden;
            margin: 2rem auto;
            max-width: 720px;  
            width: 100%;
            border: 1px solid #e9ecef;
            font-size: 0.85rem;  
        }
        
        .rht-header {
            background: linear-gradient(135deg, #7b68ee 0%, #5a4fcf 100%);
            color: white;
            padding: 18px 22px;
            text-align: center;
        }
        
        .rht-header h2 {
            font-size: 1.1em;
            margin-bottom: 6px;
            font-weight: 700;
            letter-spacing: -0.5px;
        }
        
        .rht-header p {
            font-size: 0.85em;
            opacity: 0.95;
            font-weight: 400;
        }
        
        .rht-tabs {
            display: flex;
            background: #f8f9fa;
            border-bottom: 1px solid #dee2e6;
            overflow-x: auto;
            padding: 0 10px;
        }
        
        .rht-tab {
            padding: 6px 10px;
            cursor: pointer;
            background: transparent;
            border: none;
            font-size: 0.8em;
            font-weight: 600;
            color: #6c757d;
            transition: all 0.3s ease;
            position: relative;
            white-space: nowrap;
            border-bottom: 2px solid transparent;
        }
        
        .rht-tab:hover {
            color: #5a4fcf;
        }
        
        .rht-tab.active {
            color: #5a4fcf;
            border-bottom-color: #5a4fcf;
        }
        
        .rht-tab.active::after {
           display: none;
        }
        
        .rht-content {
            padding: 18px;
        }
        
        .rht-tab-content h2 {
            font-size: 1.05em;
            font-weight: 700;
            color: #5a4fcf;
            margin-bottom: 14px;
        }

        .rht-tab-content {
            display: none;
        }
        
        .rht-tab-content.active {
            display: block;
            animation: rht-fadeIn 0.5s ease;
        }
        
        @keyframes rht-fadeIn {
            from { opacity: 0; transform: translateY(10px); }
            to { opacity: 1; transform: translateY(0); }
        }
        
        .rht-visualization {
            margin: 10px 0;
            background: #f8f9fa;
            border-radius: 10px;
            padding: 12px;
        }
         
        .rht-visualization svg,
        .rht-panel svg { width: 100%; height: auto; display: block; }
        .rht-visualization h3 { margin-bottom: 6px !important; }
        
        .rht-side-by-side {
            display: grid;
            grid-template-columns: 1fr 1fr;
            gap: 12px;
            margin: 14px 0;
        }
        
        .rht-panel {
            background: #f8f9fa;
            border-radius: 10px;
            padding: 14px;
            border: 1px solid #e9ecef;
        }
        
        .rht-panel h3 {
            color: #5a4fcf;
            margin-bottom: 6px;
            font-size: 0.95em;
        }
        
        .rht-metric {
            display: flex;
            justify-content: space-between;
            align-items: center;
            padding: 6px;
            margin: 6px 0;
            background: #f8f9fa;
            border-radius: 8px;
        }
        
        .rht-metric-label {
            font-weight: 600;
            color: #495057;
            font-size: 0.85em;
        }
        
        .rht-metric-value {
            font-size: 0.95em;
            font-weight: 700;
            padding: 2px 10px;
            border-radius: 6px;
        }
        
        .rht-metric-value.bad {
            background: #fee;
            color: #c00;
        }
        
        .rht-metric-value.good {
            background: #efe;
            color: #0a0;
        }
        
        .rht-explanation {
            background: rgba(123, 104, 238, 0.08);
            border-left: 3px solid #7b68ee;
            padding: 10px;
            margin: 8px 0;
            border-radius: 8px;
            line-height: 1.5;
        }
        
        .rht-explanation h4 {
            color: #5a4fcf;
            margin-bottom: 6px;
            font-size: 1em;
        }
        
        .rht-explanation p, .rht-explanation ul {
            font-size: 0.85em;
        }
        
        .rht-explanation ul {
            margin-left: 18px;
            margin-top: 6px;
            line-height: 1.6;
            color: #495057;
        }
        
        .rht-formula {
            background: white;
            padding: 14px;
            border-radius: 8px;
            text-align: center;
            font-family: 'Times New Roman', serif;
            font-size: 0.95em;
            margin: 14px 0;
            border: 1px solid #e9ecef;
        }
        
        .rht-highlight {
            background: #fff3cd;
            padding: 2px 6px;
            border-radius: 4px;
            font-weight: 600;
        }
        
        .rht-flow-step {
            background: white;
            border: 1px solid #7b68ee;
            border-radius: 10px;
            padding: 14px;
            margin: 10px 0;
            text-align: center;
            transition: all 0.3s;
        }
        
        .rht-flow-step:hover {
            transform: translateY(-3px);
            box-shadow: 0 6px 15px rgba(123, 104, 238, 0.2);
        }
        
        .rht-flow-step h4 {
            color: #5a4fcf;
            margin-bottom: 6px;
            font-size: 0.95em;
        }
        
        .rht-flow-arrow {
            text-align: center;
            color: #7b68ee;
            font-size: 1.5em;
            margin: 6px 0;
        }
        
        .rht-heatmap-cell {
            stroke: white;
            stroke-width: 1;
        }
        
        .rht-histogram-bar {
            fill: #7b68ee;
            opacity: 0.8;
            transition: opacity 0.3s;
        }
        
        .rht-histogram-bar:hover {
            opacity: 1;
        }
        
        .rht-axis {
            font-size: 12px;
        }
        
        .rht-axis path,
        .rht-axis line {
            stroke: #adb5bd;
        }
        
        .rht-axis text {
            fill: #495057;
        }
        
        @media (max-width: 1024px) {
            .rht-side-by-side { grid-template-columns: 1fr; }
        }
        @media (max-width: 768px) {
            .rht-header h2 { font-size: 1em; }
            .rht-tab { padding: 6px 8px; font-size: 0.75em; }
            .rht-content { padding: 14px; }
        }
    </style>

    <div class="rht-header">
        <h2>🎯 Randomized Hadamard Transform (RHT)</h2>
        <p>From Outlier-Dominated Chaos to Gaussian Harmony</p>
    </div>
    
    <div class="rht-tabs">
        <button class="rht-tab active" data-tab="problem">1. The Problem</button>
        <button class="rht-tab" data-tab="transform">2. RHT Transform</button>
        <button class="rht-tab" data-tab="result">3. The Result</button>
        <button class="rht-tab" data-tab="why">4. Why It Matters</button>
    </div>
    
    <div class="rht-content">
        
        <div class="rht-tab-content active" id="problem-94bb7d58d186deb5c24f2d2e1ffb5b84">
            <h2 style="color: #5a4fcf; margin-bottom: 20px;">The Outlier Problem</h2>
            
            <div class="rht-explanation">
                <h4>⚠️ Before RHT: Weight Matrices Have Outliers</h4>
                <p>LLM weight matrices contain a small number of weights that are 100-1000× larger than the rest. These outliers dominate quantization, making 2-bit compression impossible.</p>
            </div>
            
            <div class="rht-side-by-side">
                <div class="rht-panel">
                    <h3>Weight Matrix (8×8 sample)</h3>
                    <svg id="original-heatmap-94bb7d58d186deb5c24f2d2e1ffb5b84" width="100%"></svg>
                </div>
                
                <div class="rht-panel">
                    <h3>Problem Metrics</h3>
                    <div class="rht-metric">
                        <span class="rht-metric-label">Max Weight</span>
                        <span class="rht-metric-value bad" id="max-weight-before-94bb7d58d186deb5c24f2d2e1ffb5b84">—</span>
                    </div>
                    <div class="rht-metric">
                        <span class="rht-metric-label">Avg Weight</span>
                        <span class="rht-metric-value" id="avg-weight-before-94bb7d58d186deb5c24f2d2e1ffb5b84">—</span>
                    </div>
                    <div class="rht-metric">
                        <span class="rht-metric-label">Outlier Ratio</span>
                        <span class="rht-metric-value bad" id="outlier-ratio-94bb7d58d186deb5c24f2d2e1ffb5b84">—</span>
                    </div>
                    <div class="rht-metric">
                        <span class="rht-metric-label">Incoherence μ</span>
                        <span class="rht-metric-value bad" id="mu-before-94bb7d58d186deb5c24f2d2e1ffb5b84">—</span>
                    </div>
                    
                    <div class="rht-explanation" style="margin-top: 20px;">
                        <h4>📊 What This Means</h4>
                        <p>A high <strong>incoherence score (μ)</strong> means a few weights are vastly larger than others. When quantizing to a 2-bit format (only 4 possible values), this creates a dilemma:</p>
                        <ul>
                            <li><strong>Option A:</strong> Scale the quantization grid to include the outliers. This crushes all the smaller (but important!) weights to zero.</li>
                            <li><strong>Option B:</strong> Scale the grid for the normal weights. This results in catastrophic clipping errors for the outliers.</li>
                        </ul>
                        <p style="margin-top: 15px;"><strong>Both options lead to massive accuracy loss.</strong> This is why 2-bit quantization was considered impossible for so long.</p>
                    </div>
                </div>
            </div>
            
            <div class="rht-visualization">
                <h3 style="color: #5a4fcf; margin-bottom: 15px;">Weight Distribution (Before RHT)</h3>
                <svg id="dist-before-94bb7d58d186deb5c24f2d2e1ffb5b84" width="100%"></svg>
            </div>
        </div>
        
        
        <div class="rht-tab-content" id="transform-94bb7d58d186deb5c24f2d2e1ffb5b84">
            <h2 style="color: #5a4fcf; margin-bottom: 20px;">The RHT Transformation</h2>
            
            <div class="rht-explanation">
                <h4>🔄 How RHT Works</h4>
                <p>The Randomized Hadamard Transform redistributes weight magnitude using orthogonal matrices, eliminating outliers while preserving all information.</p>
            </div>
            
            <div class="rht-formula">
                <strong>W'</strong> = <strong>H</strong> · diag(<strong>S<sub>U</sub></strong>) · <strong>W</strong> · diag(<strong>S<sub>V</sub></strong>) · <strong>H</strong><sup>T</sup>
            </div>
            
            <div style="margin: 30px 0;">
                <div class="rht-flow-step">
                    <h4>Original Weights W</h4>
                    <p>Contains outliers, μ ≈ 20</p>
                </div>
                
                <div class="rht-flow-arrow">↓</div>
                
                <div class="rht-flow-step">
                    <h4>Apply Hadamard Matrix H</h4>
                    <p>Orthogonal transform with ±1 entries</p>
                    <p style="font-size: 0.9em; margin-top: 5px; color: #6c757d;">Fast: O(n log n) via FWHT</p>
                </div>
                
                <div class="rht-flow-arrow">↓</div>
                
                <div class="rht-flow-step">
                    <h4>Random Sign Flips S<sub>U</sub>, S<sub>V</sub></h4>
                    <p>Diagonal matrices with ±1 entries</p>
                    <p style="font-size: 0.9em; margin-top: 5px; color: #6c757d;">Breaks correlation patterns</p>
                </div>
                
                <div class="rht-flow-arrow">↓</div>
                
                <div class="rht-flow-step" style="border-color: #28a745; background: rgba(40, 167, 69, 0.05);">
                    <h4>✨ Transformed Weights W'</h4>
                    <p>Incoherent, μ ≈ √(log n)</p>
                    <p style="font-size: 0.9em; margin-top: 5px; color: #28a745;"><strong>Magnitude spread evenly across all weights!</strong></p>
                </div>
            </div>
            
            <div class="rht-explanation">
                <h4>🔑 Key Properties</h4>
                <ul>
                    <li><strong>Orthogonal</strong>: No information loss (invertible)</li>
                    <li><strong>Fast</strong>: O(n log n) via Fast Walsh-Hadamard Transform</li>
                    <li><strong>Hardware-friendly</strong>: Only ±1 multiplies (no floating-point ops!)</li>
                    <li><strong>Proven bound</strong>: μ<sub>RHT</sub> = √(2 log(2n²/δ)) with high probability</li>
                </ul>
            </div>
            
            <div class="rht-side-by-side" style="margin-top: 30px;">
                <div class="rht-panel">
                    <h3>Before vs After</h3>
                    <div class="rht-metric">
                        <span class="rht-metric-label">Incoherence μ</span>
                        <div>
                            <span class="rht-metric-value bad">20.4</span> → 
                            <span class="rht-metric-value good">2.1</span>
                        </div>
                    </div>
                    <div class="rht-metric">
                        <span class="rht-metric-label">Max/Avg Ratio</span>
                        <div>
                            <span class="rht-metric-value bad">152×</span> → 
                            <span class="rht-metric-value good">4.2×</span>
                        </div>
                    </div>
                    <p style="margin-top: 15px; color: #495057; line-height: 1.6;">
                        <strong style="color: #28a745;">10× improvement</strong> in incoherence! No single weight dominates anymore.
                    </p>
                </div>
                
                <div class="rht-panel">
                    <h3>⚡ Computational Cost</h3>
                    <div class="rht-metric">
                        <span class="rht-metric-label">Complexity</span>
                        <span class="rht-metric-value good">O(n log n)</span>
                    </div>
                    <div class="rht-metric">
                        <span class="rht-metric-label">vs. O(n√n)</span>
                        <span class="rht-metric-value good">170× faster</span>
                    </div>
                    <p style="margin-top: 15px; color: #495057; line-height: 1.6;">
                        The RHT is not only more effective than the Kronecker product method used in the original QuIP paper, but it's also significantly faster to compute.
                    </p>
                </div>
            </div>
        </div>
        
        
        <div class="rht-tab-content" id="result-94bb7d58d186deb5c24f2d2e1ffb5b84">
            <h2 style="color: #5a4fcf; margin-bottom: 20px;">The Transformed Result</h2>
            
            <div class="rht-explanation">
                <h4>✨ After RHT: Gaussian Magic</h4>
                <p>The transformed weights follow an approximately Gaussian (normal) distribution. No outliers, no single dominant weight—just a smooth bell curve!</p>
            </div>
            
            <div class="rht-side-by-side">
                <div class="rht-panel">
                    <h3>Transformed Matrix (8×8)</h3>
                    <svg id="transformed-heatmap-94bb7d58d186deb5c24f2d2e1ffb5b84" width="100%"></svg>
                </div>
                
                <div class="rht-panel">
                    <h3>Success Metrics ✓</h3>
                    <div class="rht-metric">
                        <span class="rht-metric-label">Max Weight</span>
                        <span class="rht-metric-value good" id="max-weight-after-94bb7d58d186deb5c24f2d2e1ffb5b84">—</span>
                    </div>
                    <div class="rht-metric">
                        <span class="rht-metric-label">Avg Weight</span>
                        <span class="rht-metric-value" id="avg-weight-after-94bb7d58d186deb5c24f2d2e1ffb5b84">—</span>
                    </div>
                    <div class="rht-metric">
                        <span class="rht-metric-label">Max/Avg Ratio</span>
                        <span class="rht-metric-value good" id="ratio-after-94bb7d58d186deb5c24f2d2e1ffb5b84">—</span>
                    </div>
                    <div class="rht-metric">
                        <span class="rht-metric-label">Incoherence μ</span>
                        <span class="rht-metric-value good" id="mu-after-94bb7d58d186deb5c24f2d2e1ffb5b84">—</span>
                    </div>
                    
                    <div class="rht-explanation" style="margin-top: 20px;">
                        <h4>🎯 Ready for Quantization</h4>
                        <p>With a low incoherence score, we can now define a quantization grid that treats all weights with similar importance. No single weight will dominate and cause large errors.</p>
                    </div>
                </div>
            </div>
            
            <div class="rht-visualization">
                <h3 style="color: #5a4fcf; margin-bottom: 8px;">Weight Distribution (After RHT)</h3>
                <svg id="dist-after-94bb7d58d186deb5c24f2d2e1ffb5b84" width="100%"></svg>
                <div class="rht-explanation" style="margin-top: 8px;">
                    <h4>📊 Gaussian Distribution Properties</h4>
                    <ul>
                        <li><strong>Smooth bell curve</strong>: No spikes or outliers</li>
                        <li><strong>Symmetric</strong>: Equal spread around zero</li>
                        <li><strong>Ball-shaped in high dimensions</strong>: Perfect match for E8 lattice!</li>
                        <li><strong>Predictable error bounds</strong>: We can prove theoretical guarantees</li>
                    </ul>
                </div>
            </div>
            
            <div class="rht-visualization">
                <h3 style="color: #5a4fcf; margin-bottom: 8px;">2D Scatter: Ball-Shaped Distribution</h3>
                <svg id="scatter-plot-94bb7d58d186deb5c24f2d2e1ffb5b84" width="100%"></svg>
                <p style="text-align: center; color: #6c757d; margin-top: 6px; font-size: 0.85em;">
                    Transformed weights form a radially symmetric "ball" shape—ideal for vector quantization!
                </p>
            </div>
        </div>
        
        
        <div class="rht-tab-content" id="why-94bb7d58d186deb5c24f2d2e1ffb5b84">
            <h2 style="color: #5a4fcf; margin-bottom: 20px;">Why RHT Unlocks 2-Bit Performance</h2>
            
            <div class="rht-explanation">
                <h4>🎯 The Synergy of QuIP#</h4>
                <p>RHT is the crucial first step that makes the rest of the QuIP# algorithm possible. It prepares the weights into a format that is perfectly suited for the subsequent E8 lattice quantization.</p>
            </div>
            
            <div style="margin: 30px 0;">
                <div class="rht-flow-step" style="background: rgba(102, 126, 234, 0.05);">
                    <h4>1️⃣ RHT Transform</h4>
                    <p><strong>Input:</strong> Weights with outliers (μ ≈ 20)</p>
                    <p><strong>Output:</strong> Gaussian distribution (μ ≈ √log n)</p>
                    <p style="font-size: 0.9em; margin-top: 8px; color: #667eea;">
                        <strong>Key insight:</strong> Spreads magnitude evenly across all dimensions
                    </p>
                </div>
                
                <div class="rht-flow-arrow">↓</div>
                
                <div class="rht-flow-step" style="background: rgba(118, 75, 162, 0.05);">
                    <h4>2️⃣ Perfect Match for E8</h4>
                    <p><strong>Gaussian → Ball-shaped in 8D</strong></p>
                    <p><strong>E8 lattice → Proven optimal sphere packing</strong></p>
                    <p style="font-size: 0.9em; margin-top: 8px; color: #764ba2;">
                        <strong>Key insight:</strong> E8's 240 kissing spheres perfectly cover Gaussian balls!
                    </p>
                </div>
                
                <div class="rht-flow-arrow">↓</div>
                
                <div class="rht-flow-step" style="background: rgba(40, 167, 69, 0.05); border-color: #28a745;">
                    <h4>3️⃣ Minimal Quantization Error</h4>
                    <p><strong>Error ∝ μ² · σ²</strong></p>
                    <p>Small μ (from RHT) + small σ² (from E8 lattice) = Near-lossless 2-bit!</p>
                    <p style="font-size: 0.9em; margin-top: 8px; color: #28a745;">
                        <strong>Result:</strong> 4.16 PPL on Llama 2 70B (vs 7.81 for OmniQuant)
                    </p>
                </div>
            </div>
            
            <div class="rht-formula" style="margin-top: 30px;">
                <div style="margin-bottom: 15px; font-size: 1.1em; color: #5a4fcf;">
                    <strong>Theoretical Error Bound</strong>
                </div>
                𝔼[Error] ≤ (g · m · <span class="rht-highlight">μ²</span> · <span class="rht-highlight">σ²</span> / n) · tr(H<sup>1/2</sup>)²
                <div style="margin-top: 15px; font-size: 0.9em; color: #495057; text-align: left; padding: 0 20px;">
                    <p><strong><span class="rht-highlight">μ²</span></strong> = Incoherence (minimized by RHT: 20² → 2²)</p>
                    <p style="margin-top: 8px;"><strong><span class="rht-highlight">σ²</span></strong> = Quantization noise (minimized by E8 lattice)</p>
                </div>
            </div>
            
            <div class="rht-side-by-side" style="margin-top: 40px;">
                <div class="rht-panel">
                    <h3>🚫 Without RHT</h3>
                    <div class="rht-metric">
                        <span class="rht-metric-label">μ² contribution</span>
                        <span class="rht-metric-value bad">20² = 400</span>
                    </div>
                    <div class="rht-metric">
                        <span class="rht-metric-label">Resulting PPL</span>
                        <span class="rht-metric-value bad">7.81 (OmniQuant)</span>
                    </div>
                    <div class="rht-explanation" style="margin-top: 15px;">
                        <p>Without making the weights incoherent, the quantization error from outliers is catastrophic, leading to a massive drop in model performance.</p>
                    </div>
                </div>
                
                <div class="rht-panel">
                    <h3>✅ With RHT</h3>
                    <div class="rht-metric">
                        <span class="rht-metric-label">μ² contribution</span>
                        <span class="rht-metric-value good">2² = 4</span>
                    </div>
                    <div class="rht-metric">
                        <span class="rht-metric-label">Resulting PPL</span>
                        <span class="rht-metric-value good">4.16 (QuIP#)</span>
                    </div>
                    <div class="rht-explanation" style="margin-top: 15px;">
                        <p>By reducing incoherence by <strong>100×</strong>, RHT drastically cuts down the quantization error, paving the way for near-lossless 2-bit compression.</p>
                    </div>
                </div>
            </div>
            
            <div class="rht-explanation" style="margin-top: 40px; background: linear-gradient(135deg, rgba(40, 167, 69, 0.1), rgba(40, 167, 69, 0.05)); border-color: #28a745;">
                <h4 style="color: #28a745;">🎓 The "Aha!" Moment</h4>
                <p style="font-size: 1.1em; line-height: 1.8;">
                    RHT transforms the <strong>impossible</strong> (quantizing outlier-dominated weights) into the <strong>natural</strong> (quantizing a Gaussian distribution with optimal sphere packing). 
                </p>
                <p style="margin-top: 15px; line-height: 1.8;">
                    It's not fighting against the math—it's <em>aligning</em> with it. The Gaussian distribution is exactly what E8 lattices are proven to be optimal for.
                </p>
                <p style="margin-top: 15px; line-height: 1.8;">
                    <strong>The beauty of QuIP#:</strong> Every component (RHT, E8, Block-LDLQ) is mathematically principled and directly addresses a term in the theoretical error bound. It's a testament to solving problems from first principles.
                </p>
            </div>
        </div>
    </div>
</div>

<script>
    (function() {
        const uniqueId = '94bb7d58d186deb5c24f2d2e1ffb5b84';
        
        if (typeof d3 === 'undefined') {
            console.warn('D3.js not loaded for RHT visualization');
            return;
        }
        
        function drawAllVisualizations() {
            drawHeatmap('original-heatmap-' + uniqueId, originalMatrix);
            drawHeatmap('transformed-heatmap-' + uniqueId, transformedMatrix);
            drawHistogram('dist-before-' + uniqueId, originalMatrix, 'Before RHT: Spiky with Outliers');
            drawHistogram('dist-after-' + uniqueId, transformedMatrix, 'After RHT: Smooth Gaussian');
            drawScatterPlot();
        }

        
        document.querySelectorAll('[id*="rht-container"] .rht-tab').forEach(tab => {
            tab.addEventListener('click', function() {
                const tabId = this.dataset.tab;
                const containerId = 'rht-container-' + uniqueId;
                const container = document.getElementById(containerId);
                
                if (!container) return;
                
                container.querySelectorAll('.rht-tab').forEach(t => t.classList.remove('active'));
                container.querySelectorAll('.rht-tab-content').forEach(c => c.classList.remove('active'));
                
                this.classList.add('active');
                const content = document.getElementById(tabId + '-' + uniqueId);
                if (content) content.classList.add('active');
                
                
                setTimeout(() => drawAllVisualizations(), 10);
            });
        });
        
        
        function generateWeightMatrix(size, withOutliers = true) {
            const matrix = [];
            for (let i = 0; i < size; i++) {
                const row = [];
                for (let j = 0; j < size; j++) {
                    if (withOutliers && Math.random() < 0.05) {
                        row.push((Math.random() - 0.5) * 30);
                    } else {
                        row.push((Math.random() - 0.5) * 0.4);
                    }
                }
                matrix.push(row);
            }
            return matrix;
        }
        
        function applyRHT(matrix) {
            const size = matrix.length;
            const result = [];
            for (let i = 0; i < size; i++) {
                const row = [];
                for (let j = 0; j < size; j++) {
                    const u1 = Math.random();
                    const u2 = Math.random();
                    const gaussian = Math.sqrt(-2 * Math.log(u1)) * Math.cos(2 * Math.PI * u2);
                    row.push(gaussian * 0.3);
                }
                result.push(row);
            }
            return result;
        }
        
        function calculateMetrics(matrix) {
            const flat = matrix.flat();
            const absValues = flat.map(Math.abs);
            const max = Math.max(...absValues);
            const avg = absValues.reduce((a, b) => a + b, 0) / absValues.length;
            const size = matrix.length;
            const frobNorm = Math.sqrt(flat.reduce((sum, val) => sum + val * val, 0));
            const mu = max / (frobNorm / Math.sqrt(size * size));
            
            return {
                max: max.toFixed(2),
                avg: avg.toFixed(2),
                ratio: (max / avg).toFixed(1),
                mu: mu.toFixed(1)
            };
        }
        
        const originalMatrix = generateWeightMatrix(8, true);
        const transformedMatrix = applyRHT(originalMatrix);
        
        const metricsOriginal = calculateMetrics(originalMatrix);
        const metricsTransformed = calculateMetrics(transformedMatrix);
        
        document.getElementById('max-weight-before-' + uniqueId).textContent = metricsOriginal.max;
        document.getElementById('avg-weight-before-' + uniqueId).textContent = metricsOriginal.avg;
        document.getElementById('outlier-ratio-' + uniqueId).textContent = metricsOriginal.ratio + '×';
        document.getElementById('mu-before-' + uniqueId).textContent = 'μ = ' + metricsOriginal.mu;
        
        document.getElementById('max-weight-after-' + uniqueId).textContent = metricsTransformed.max;
        document.getElementById('avg-weight-after-' + uniqueId).textContent = metricsTransformed.avg;
        document.getElementById('ratio-after-' + uniqueId).textContent = metricsTransformed.ratio + '×';
        document.getElementById('mu-after-' + uniqueId).textContent = 'μ = ' + metricsTransformed.mu;
        
        function drawHeatmap(svgId, matrix) {
            const svg = d3.select('#' + svgId);
            svg.selectAll('*').remove();
            const container = svg.node().parentElement;
            if (!container || container.clientWidth === 0) return;
            const width = container.clientWidth;
            const height = 240;
            const size = matrix.length;
            const cellSize = Math.min(width, height - 60) / size;
            
            svg.attr('height', height);
            
            const g = svg.append('g')
                .attr('transform', `translate(${(width - cellSize * size) / 2}, 16)`);
            
            const maxAbs = Math.max(...matrix.flat().map(Math.abs));
            const colorScale = d3.scaleSequential(d3.interpolatePuOr)
                .domain([maxAbs, -maxAbs]);
            
            matrix.forEach((row, i) => {
                row.forEach((value, j) => {
                    g.append('rect')
                        .attr('class', 'rht-heatmap-cell')
                        .attr('x', j * cellSize)
                        .attr('y', i * cellSize)
                        .attr('width', cellSize)
                        .attr('height', cellSize)
                        .attr('fill', colorScale(value));
                });
            });
            
            const legendWidth = 200;
            const legendHeight = 20;
            const legendG = svg.append('g')
                .attr('transform', `translate(${(width - legendWidth) / 2}, ${height - 30})`);
            
            const gradient = svg.append('defs')
                .append('linearGradient')
                .attr('id', svgId + '-gradient')
                .attr('x1', '0%')
                .attr('x2', '100%');
            
            gradient.append('stop').attr('offset', '0%').attr('stop-color', colorScale(maxAbs));
            gradient.append('stop').attr('offset', '50%').attr('stop-color', colorScale(0));
            gradient.append('stop').attr('offset', '100%').attr('stop-color', colorScale(-maxAbs));
            
            legendG.append('rect')
                .attr('width', legendWidth)
                .attr('height', legendHeight)
                .style('fill', `url(#${svgId}-gradient)`);
        }
        
        function drawHistogram(svgId, matrix, title) {
            const svg = d3.select('#' + svgId);
            svg.selectAll('*').remove();
            const container = svg.node().parentElement;
            if (!container || container.clientWidth === 0) return;
            const width = container.clientWidth;
            const height = 260;
            const margin = {top: 20, right: 20, bottom: 30, left: 40};
            const innerWidth = width - margin.left - margin.right;
            const innerHeight = height - margin.top - margin.bottom;
            
            svg.attr('height', height);
            
            const flat = matrix.flat();
            const numBins = 24;
            const rawExtent = d3.extent(flat);
            const maxAbsDomain = Math.max(Math.abs(rawExtent[0] || 0), Math.abs(rawExtent[1] || 0));
            const domain = [-maxAbsDomain, maxAbsDomain];
            const binWidth = (domain[1] - domain[0]) / numBins;
            
            const bins = d3.bin()
                .domain(domain)
                .thresholds(d3.range(domain[0], domain[1] + 1e-6, binWidth))(flat);
            
            const x = d3.scaleLinear()
                .domain(domain)
                .range([0, innerWidth]);
            
            const y = d3.scaleLinear()
                .domain([0, d3.max(bins, d => d.length)])
                .range([innerHeight, 0]);
            
            const g = svg.append('g')
                .attr('transform', `translate(${margin.left},${margin.top})`);
            
            g.selectAll('rect')
                .data(bins)
                .join('rect')
                .attr('class', 'rht-histogram-bar')
                .attr('x', d => x(d.x0))
                .attr('y', d => y(d.length))
                .attr('width', d => Math.max(0, x(d.x1) - x(d.x0) - 1))
                .attr('height', d => innerHeight - y(d.length));
            
            g.append('g')
                .attr('class', 'rht-axis')
                .attr('transform', `translate(0,${innerHeight})`)
                .call(d3.axisBottom(x).ticks(6));
            
            g.append('g')
                .attr('class', 'rht-axis')
                .call(d3.axisLeft(y).ticks(5));
        }
        
        function drawScatterPlot() {
            const svg = d3.select('#scatter-plot-' + uniqueId);
            svg.selectAll('*').remove();
            const container = svg.node().parentElement;
            if (!container || container.clientWidth === 0) return;
            const width = container.clientWidth;
            const height = 260;
            const margin = {top: 20, right: 20, bottom: 30, left: 40};
            const innerWidth = width - margin.left - margin.right;
            const innerHeight = height - margin.top - margin.bottom;
            
            svg.attr('height', height);
            
            const numPoints = 300;
            const data = [];
            for (let i = 0; i < numPoints; i++) {
                const u1 = Math.random();
                const u2 = Math.random();
                const r = Math.sqrt(-2 * Math.log(u1));
                const theta = 2 * Math.PI * u2;
                data.push({
                    x: r * Math.cos(theta),
                    y: r * Math.sin(theta)
                });
            }
            
            const extent = 4;
            const x = d3.scaleLinear()
                .domain([-extent, extent])
                .range([0, innerWidth]);
            
            const y = d3.scaleLinear()
                .domain([-extent, extent])
                .range([innerHeight, 0]);
            
            const g = svg.append('g')
                .attr('transform', `translate(${margin.left},${margin.top})`);
            
            g.append('circle')
                .attr('cx', x(0))
                .attr('cy', y(0))
                .attr('r', x(2) - x(0))
                .attr('fill', 'none')
                .attr('stroke', '#7b68ee')
                .attr('stroke-width', 1.5)
                .attr('stroke-dasharray', '4,4')
                .attr('opacity', 0.5);

            g.selectAll('circle.scatter-point')
                .data(data)
                .join('circle')
                .attr('class', 'scatter-point')
                .attr('cx', d => x(d.x))
                .attr('cy', d => y(d.y))
                .attr('r', 2.5)
                .attr('fill', '#5a4fcf')
                .attr('opacity', 0.6);
            
            g.append('g')
                .attr('class', 'rht-axis')
                .attr('transform', `translate(0,${innerHeight})`)
                .call(d3.axisBottom(x).ticks(6));
            
            g.append('g')
                .attr('class', 'rht-axis')
                .call(d3.axisLeft(y).ticks(5));
        }
        
        
        drawAllVisualizations();
    })();
</script>

<p>With weights now transformed into a Gaussian distribution free of outliers, we face a new challenge: how do we quantize this ball-shaped distribution efficiently? This is where the mathematics of sphere packing becomes crucial.</p>
<hr>
<h2 id="5-pillar-2-vector-quantization-with-e8-lattice-codebooks">5. Pillar 2: Vector Quantization with E8 Lattice Codebooks</h2>
<h3 id="51-the-shape-matching-problem">5.1 The Shape-Matching Problem</h3>
<h4 id="511-ball-shaped-gaussian-distribution">5.1.1 Ball-Shaped Gaussian Distribution</h4>
<p>After RHT, weights are approximately Gaussian. In multiple dimensions, this creates a <strong>&ldquo;ball shape&rdquo;</strong>—weights are radially symmetric around the origin.</p>
<p>Think of throwing darts at a dartboard. Most land near the center, fewer toward the edges. In 8 dimensions, Gaussian weights do the same thing—they cluster in a &ldquo;ball&rdquo; around zero.</p>
<h4 id="512-why-scalar-quantization-fails">5.1.2 Why Scalar Quantization Fails</h4>
<p><strong>Scalar quantization</strong> treats each dimension independently. This creates a <strong>hypercube</strong> of representable points.</p>
<p>The problem: Most of the cube&rsquo;s volume is in the corners, but <em>weights never appear there</em> (Gaussian samples don&rsquo;t reach the corners). We&rsquo;re wasting precious bits on regions of space we&rsquo;ll never use!</p>
<p><strong>The Math</strong>: For a d-dimensional unit cube, the ratio of corner volume to ball volume grows exponentially:</p>
<ul>
<li>2D: ~21% waste</li>
<li>4D: ~47% waste</li>
<li>8D: ~69% waste</li>
</ul>
<p>At 2 bits, we can&rsquo;t afford to waste 69% of our representable space!</p>
<h3 id="52-enter-vector-quantization">5.2 Enter Vector Quantization</h3>
<p><strong>The Idea</strong></p>
<p>Instead of quantizing each weight individually, we quantize <strong>d weights together</strong> as a d-dimensional vector.</p>
<ul>
<li><strong>Scalar quantization</strong>: 4 values per dimension → hypercube</li>
<li><strong>Vector quantization</strong>: Shape the codebook to match the actual distribution → sphere</li>
</ul>
<p><strong>The Trade-off</strong></p>
<p>Vector quantization has exponential cost:</p>
<ul>
<li><strong>Codebook size</strong>: 2^(k·d) entries for k bits and dimension d</li>
<li><strong>Example</strong>: 2 bits, 8 dimensions → 2^16 = 65,536 codewords</li>
</ul>
<p>This is where the <strong>E8 lattice</strong> becomes magical.</p>
<h3 id="53-the-sphere-packing-problem">5.3 The Sphere Packing Problem</h3>
<h4 id="531-what-is-sphere-packing">5.3.1 What Is Sphere Packing?</h4>
<p><strong>The Question</strong>: How do you arrange equal-sized spheres to achieve maximum coverage of space?</p>
<p>This is an ancient mathematical problem, dating back to Kepler&rsquo;s study of cannonball stacking in 1611.</p>
<p><strong>Relevance to Quantization</strong>: Each codebook entry is a <strong>sphere center</strong>. The sphere radius determines how far weights can be from that center. Better packing = smaller max distance = lower quantization error.</p>
<p><strong>2D Examples</strong>:</p>
<ul>
<li><strong>Square packing</strong>: 78.5% efficiency, 4 neighbors touching each sphere</li>
<li><strong>Hexagonal packing</strong>: 90.7% efficiency, 6 neighbors touching each sphere</li>
</ul>
<p>The hexagonal packing is <strong>proven optimal</strong> in 2D. We want the same for higher dimensions!</p>
<h4 id="532-the-kissing-number">5.3.2 The Kissing Number</h4>
<p>The <strong>kissing number</strong> is how many equal-sized spheres can touch one central sphere.</p>
<p><strong>Examples</strong>:</p>
<ul>
<li><strong>2D square grid</strong>: 4 neighbors</li>
<li><strong>2D hexagonal</strong>: 6 neighbors</li>
<li><strong>3D</strong>: 12 neighbors (think of oranges at the grocery store)</li>
<li><strong>8D E8 lattice</strong>: <strong>240 neighbors!</strong></li>
</ul>
<p>Higher kissing number = denser packing = better quantization!</p>
<style>
    .sphere-container {
        max-width: 1400px;
        margin: 20px auto;
        background: white;
        border-radius: 16px;
        box-shadow: 0 10px 30px rgba(0,0,0,0.1);
        overflow: hidden;
        font-size: 0.9rem;
    }
    
    .sphere-header {
        background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
        color: white;
        padding: 25px;
        text-align: center;
    }
    
    .sphere-header h1 {
        font-size: 1.6em;
        margin-bottom: 8px;
        font-weight: 600;
    }
    
    .sphere-header p {
        font-size: 1em;
        opacity: 0.9;
    }
    
    .sphere-tabs {
        display: flex;
        background: #f8f9fa;
        border-bottom: 2px solid #e9ecef;
        overflow-x: auto;
    }
    
    .sphere-tab {
        padding: 15px 20px;
        cursor: pointer;
        background: transparent;
        border: none;
        font-size: 0.9em;
        font-weight: 500;
        color: #495057;
        transition: all 0.3s ease;
        position: relative;
        white-space: nowrap;
    }
    
    .sphere-tab:hover {
        background: rgba(102, 126, 234, 0.1);
        color: #667eea;
    }
    
    .sphere-tab.active {
        color: #667eea;
        background: white;
    }
    
    .sphere-tab.active::after {
        content: '';
        position: absolute;
        bottom: -2px;
        left: 0;
        right: 0;
        height: 3px;
        background: #667eea;
    }
    
    .sphere-content {
        padding: 25px;
    }

    .sphere-tab-content h2 {
        font-size: 1.4em;
        color: #667eea;
        margin-bottom: 15px;
    }
    
    .sphere-tab-content {
        display: none;
    }
    
    .sphere-tab-content.active {
        display: block;
        animation: sphere-fadeIn 0.5s ease;
    }
    
    @keyframes sphere-fadeIn {
        from { opacity: 0; transform: translateY(10px); }
        to { opacity: 1; transform: translateY(0); }
    }
    
    .sphere-side-by-side {
        display: grid;
        grid-template-columns: 1fr 1fr;
        gap: 20px;
        margin: 15px 0;
    }
    
    .sphere-panel {
        background: white;
        border-radius: 12px;
        padding: 20px;
        box-shadow: 0 4px 6px rgba(0,0,0,0.05);
        border: 1px solid #e9ecef;
    }
    
    .sphere-panel h3 {
        color: #667eea;
        margin-bottom: 15px;
        font-size: 1.1em;
    }
    
    .sphere-explanation {
        background: linear-gradient(135deg, rgba(102, 126, 234, 0.1), rgba(118, 75, 162, 0.1));
        border-left: 4px solid #667eea;
        padding: 15px;
        margin: 15px 0;
        border-radius: 8px;
    }
    
    .sphere-explanation h4 {
        color: #667eea;
        margin-bottom: 10px;
    }
    
    .sphere-explanation p, .sphere-explanation ul {
        color: #495057;
        line-height: 1.5;
        font-size: 0.95em;
    }
    
    .sphere-metric {
        display: flex;
        justify-content: space-between;
        align-items: center;
        padding: 12px;
        margin: 8px 0;
        background: #f8f9fa;
        border-radius: 8px;
    }
    
    .sphere-metric-label {
        font-weight: 600;
        color: #495057;
    }
    
    .sphere-metric-value {
        font-size: 1.1em;
        font-weight: 600;
        padding: 4px 12px;
        border-radius: 6px;
    }
    
    .sphere-metric-value.bad {
        background: #fee;
        color: #c00;
    }
    
    .sphere-metric-value.good {
        background: #efe;
        color: #0a0;
    }
    
    .sphere-metric-value.neutral {
        background: #e9ecef;
        color: #495057;
    }

    .sphere-visualization-box {
        background: #f8f9fa;
        border-radius: 12px;
        padding: 20px;
        margin: 15px 0;
        text-align: center;
    }

    .sphere-comparison-table {
        width: 100%;
        border-collapse: collapse;
        margin: 20px 0;
    }
    
    .sphere-comparison-table th,
    .sphere-comparison-table td {
        padding: 12px;
        text-align: left;
        border-bottom: 1px solid #e9ecef;
    }
    
    .sphere-comparison-table th {
        background: #f8f9fa;
        color: #667eea;
        font-weight: 600;
    }
    
    .sphere-comparison-table tr:hover {
        background: rgba(102, 126, 234, 0.05);
    }

    .sphere-highlight {
        background: #fff3cd;
        padding: 2px 6px;
        border-radius: 4px;
        font-weight: 600;
    }
    
    @media (max-width: 768px) {
        .sphere-side-by-side {
            grid-template-columns: 1fr;
        }
        
        .sphere-header h1 {
            font-size: 1.4em;
        }
        
        .sphere-tab {
            padding: 12px 15px;
            font-size: 0.85em;
        }
    }
</style>

<div class="sphere-container">
    <div class="sphere-header">
        <h1>🔮 2D Intuition: Why Vector Quantization Wins</h1>
        <p>Understanding optimal packing before we dive into 8D E8 lattice</p>
    </div>
    
    <div class="sphere-tabs">
        <button class="sphere-tab active" data-tab="sphere-square">Square Packing</button>
        <button class="sphere-tab" data-tab="sphere-hex">Hexagonal Packing</button>
    </div>
    
    <div class="sphere-content">
        
        <div class="sphere-tab-content active" id="sphere-square">
            <h2>Square Packing: Scalar Quantization</h2>
            
            <div class="sphere-explanation">
                <h4>📐 The Simple Approach</h4>
                <p>Arrange spheres in a square grid—this is what <strong>scalar quantization</strong> does when it treats each dimension independently.</p>
            </div>

            <div class="sphere-visualization-box">
                <svg id="sphere-square-viz" width="100%" height="400"></svg>
            </div>
            
            <div class="sphere-side-by-side">
                <div class="sphere-panel">
                    <h3>📊 Square Packing Metrics</h3>
                    <div class="sphere-metric">
                        <span class="sphere-metric-label">Packing Efficiency</span>
                        <span class="sphere-metric-value bad">78.5%</span>
                    </div>
                    <div class="sphere-metric">
                        <span class="sphere-metric-label">Kissing Number</span>
                        <span class="sphere-metric-value neutral">4</span>
                    </div>
                    <div class="sphere-metric">
                        <span class="sphere-metric-label">Wasted Space</span>
                        <span class="sphere-metric-value bad">21.5%</span>
                    </div>
                </div>
                
                <div class="sphere-panel">
                    <h3>⚠️ The Problem</h3>
                    <p><strong>21.5% of space is wasted!</strong></p>
                    <p style="margin-top: 15px;">Each sphere only touches 4 neighbors. There are large gaps between spheres that could be filled more efficiently.</p>
                    <p style="margin-top: 15px; color: #dc3545; font-weight: 600;">At 2-bit quantization with only 4 values per dimension, we can't afford this waste.</p>
                </div>
            </div>

            <div class="sphere-explanation">
                <h4>🔍 What This Means for Quantization</h4>
                <p>When you quantize each weight dimension independently (scalar quantization), you're using square packing. The 21.5% waste means some weights will be far from their nearest codebook entry, causing larger errors.</p>
            </div>
        </div>
        
        
        <div class="sphere-tab-content" id="sphere-hex">
            <h2>Hexagonal Packing: Vector Quantization</h2>
            
            <div class="sphere-explanation">
                <h4>🏆 The Optimal Solution in 2D</h4>
                <p>Arrange spheres in a hexagonal pattern—this is <strong>proven optimal</strong> in 2 dimensions. This is what vector quantization achieves!</p>
            </div>

            <div class="sphere-visualization-box">
                <svg id="sphere-hex-viz" width="100%" height="400"></svg>
            </div>
            
            <div class="sphere-side-by-side">
                <div class="sphere-panel">
                    <h3>📊 Hexagonal Packing Metrics</h3>
                    <div class="sphere-metric">
                        <span class="sphere-metric-label">Packing Efficiency</span>
                        <span class="sphere-metric-value good">90.7%</span>
                    </div>
                    <div class="sphere-metric">
                        <span class="sphere-metric-label">Kissing Number</span>
                        <span class="sphere-metric-value good">6</span>
                    </div>
                    <div class="sphere-metric">
                        <span class="sphere-metric-label">Wasted Space</span>
                        <span class="sphere-metric-value good">9.3%</span>
                    </div>
                    <div class="sphere-metric">
                        <span class="sphere-metric-label">Improvement</span>
                        <span class="sphere-metric-value good">+15.5%</span>
                    </div>
                </div>
                
                <div class="sphere-panel">
                    <h3>✨ The Breakthrough</h3>
                    <p><strong>Only 9.3% waste—2.3× better than square packing!</strong></p>
                    <p style="margin-top: 15px;">Each sphere touches 6 neighbors (50% more than square). The spheres nestle into each other's gaps, minimizing wasted space.</p>
                    <p style="margin-top: 15px; color: #28a745; font-weight: 600;">This is the power of vector quantization!</p>
                </div>
            </div>

            <div class="sphere-explanation" style="background: linear-gradient(135deg, rgba(40, 167, 69, 0.1), rgba(40, 167, 69, 0.05)); border-color: #28a745;">
                <h4 style="color: #28a745;">🎓 Scaling to 8D</h4>
                <p>In 2D, hexagonal packing is proven optimal. In 8D, the <strong>E8 lattice</strong> is proven optimal (Viazovska, 2016, Fields Medal 2022).</p>
                <p style="margin-top: 10px;">E8 achieves a kissing number of <strong>240</strong> in 8D—that's 15× better than simple cubic packing (16)! This 2D intuition extends beautifully to higher dimensions.</p>
            </div>
        </div>
        
    </div>
    
    <div style="background: #f8f9fa; padding: 20px; text-align: center; border-top: 2px solid #e9ecef; color: #6c757d; font-size: 0.95em;">
        <p style="margin: 0;"><strong>Next:</strong> See how this 2D intuition scales to 8D with the E8 lattice visualization below ↓</p>
    </div>
</div>

<script>
    if (typeof d3 === 'undefined') { console.warn('D3.js not loaded'); }
    
    document.querySelectorAll('.sphere-tab').forEach(tab => {
        tab.addEventListener('click', function() {
            const tabId = this.dataset.tab;
            document.querySelectorAll('.sphere-tab').forEach(t => t.classList.remove('active'));
            document.querySelectorAll('.sphere-tab-content').forEach(c => c.classList.remove('active'));
            this.classList.add('active');
            document.getElementById(tabId).classList.add('active');
            setTimeout(() => drawAllSphereViz(), 10);
        });
    });

    function drawAllSphereViz() {
        drawSquare();
        drawHex();
    }

    setTimeout(() => drawAllSphereViz(), 100);

    function drawSquare() {
        const svg = d3.select('#sphere-square-viz');
        if (!svg.node()) return;
        const container = svg.node().parentElement;
        if (!container || container.clientWidth === 0) return;
        svg.selectAll('*').remove();

        const width = container.clientWidth, height = 400, radius = 35, spacing = radius * 2.2;
        svg.attr('height', height);
        const g = svg.append('g').attr('transform', `translate(${width/2}, ${height/2})`);

        for (let i = -2; i <= 2; i++) {
            for (let j = -2; j <= 2; j++) {
                const isCenter = i === 0 && j === 0;
                g.append('circle').attr('cx', j * spacing).attr('cy', i * spacing).attr('r', radius)
                    .attr('fill', isCenter ? 'rgba(220, 53, 69, 0.3)' : 'rgba(102, 126, 234, 0.2)')
                    .attr('stroke', isCenter ? '#dc3545' : '#667eea').attr('stroke-width', isCenter ? 3 : 2);
                if (isCenter) {
                    g.append('text').attr('x', j * spacing).attr('y', i * spacing)
                        .attr('text-anchor', 'middle').attr('dominant-baseline', 'middle')
                        .attr('fill', '#dc3545').attr('font-weight', 'bold').attr('font-size', '16px').text('Center');
                }
            }
        }

        [[0, 1], [0, -1], [1, 0], [-1, 0]].forEach(([dx, dy]) => {
            g.append('line').attr('x1', 0).attr('y1', 0).attr('x2', dx * spacing).attr('y2', dy * spacing)
                .attr('stroke', '#dc3545').attr('stroke-width', 2).attr('stroke-dasharray', '5,5').attr('opacity', 0.5);
        });

        g.append('text').attr('x', 0).attr('y', -spacing * 2.5).attr('text-anchor', 'middle')
            .attr('fill', '#495057').attr('font-size', '14px').attr('font-weight', '600')
            .text('Square Packing: 4 neighbors touching center sphere');
    }

    function drawHex() {
        const svg = d3.select('#sphere-hex-viz');
        if (!svg.node()) return;
        const container = svg.node().parentElement;
        if (!container || container.clientWidth === 0) return;
        svg.selectAll('*').remove();

        const width = container.clientWidth, height = 400, radius = 35, spacing = radius * 2.05;
        svg.attr('height', height);
        const g = svg.append('g').attr('transform', `translate(${width/2}, ${height/2})`);

        const hexPos = [];
        for (let row = -2; row <= 2; row++) {
            for (let col = -2; col <= 2; col++) {
                hexPos.push([col * spacing + (row % 2) * spacing / 2, row * spacing * Math.sqrt(3) / 2, row, col]);
            }
        }

        hexPos.forEach(([x, y, row, col]) => {
            const isCenter = row === 0 && col === 0;
            g.append('circle').attr('cx', x).attr('cy', y).attr('r', radius)
                .attr('fill', isCenter ? 'rgba(40, 167, 69, 0.3)' : 'rgba(102, 126, 234, 0.2)')
                .attr('stroke', isCenter ? '#28a745' : '#667eea').attr('stroke-width', isCenter ? 3 : 2);
            if (isCenter) {
                g.append('text').attr('x', x).attr('y', y).attr('text-anchor', 'middle')
                    .attr('dominant-baseline', 'middle').attr('fill', '#28a745')
                    .attr('font-weight', 'bold').attr('font-size', '16px').text('Center');
            }
        });

        [[spacing, 0], [-spacing, 0], [spacing/2, spacing * Math.sqrt(3)/2], 
         [-spacing/2, spacing * Math.sqrt(3)/2], [spacing/2, -spacing * Math.sqrt(3)/2], 
         [-spacing/2, -spacing * Math.sqrt(3)/2]].forEach(([dx, dy]) => {
            g.append('line').attr('x1', 0).attr('y1', 0).attr('x2', dx).attr('y2', dy)
                .attr('stroke', '#28a745').attr('stroke-width', 2).attr('stroke-dasharray', '5,5').attr('opacity', 0.5);
        });

        g.append('text').attr('x', 0).attr('y', -spacing * 2.2).attr('text-anchor', 'middle')
            .attr('fill', '#495057').attr('font-size', '14px').attr('font-weight', '600')
            .text('Hexagonal Packing: 6 neighbors touching center sphere');
    }
</script>

<h4 id="533-direct-connection-to-quantization">5.3.3 Direct Connection to Quantization</h4>
<p>The connection between sphere packing and quantization is direct:</p>
<ul>
<li><strong>Codebook entries</strong> = sphere centers</li>
<li><strong>Sphere radius</strong> = coverage area (how far weights can be from nearest codeword)</li>
<li><strong>Better packing</strong> = lower max distance = lower quantization error</li>
</ul>
<p>In the error bound, the covering radius appears as quantization noise σ². E8&rsquo;s optimal packing minimizes σ², which directly minimizes the final quantization error.</p>
<h3 id="54-common-lattices-in-quantization">5.4 Common Lattices in Quantization</h3>
<table>
  <thead>
      <tr>
          <th>Lattice</th>
          <th>Dimension</th>
          <th>Kissing #</th>
          <th>Use Case</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Z^n</td>
          <td>any</td>
          <td>2n</td>
          <td>Simple integer grid</td>
      </tr>
      <tr>
          <td>D_4</td>
          <td>4</td>
          <td>24</td>
          <td>Even-parity lattice</td>
      </tr>
      <tr>
          <td>D̂_8</td>
          <td>8</td>
          <td>112</td>
          <td>Half-integer lattice</td>
      </tr>
      <tr>
          <td><strong>E_8</strong></td>
          <td><strong>8</strong></td>
          <td><strong>240</strong></td>
          <td><strong>Optimal in 8D!</strong></td>
      </tr>
  </tbody>
</table>
<p>The E8 lattice achieves the <strong>proven optimal</strong> packing in 8 dimensions. This is not a heuristic—it&rsquo;s a mathematical certainty (Viazovska, 2016, Fields Medal 2022).</p>
<h3 id="55-the-e8-lattice-mathematical-beauty">5.5 The E8 Lattice: Mathematical Beauty</h3>
<h4 id="551-definition">5.5.1 Definition</h4>
<p>The E8 lattice is defined as:</p>
<pre tabindex="0"><code>E_8 = (ℤ⁸ ∪ (ℤ+½)⁸) ∩ {x | Σx_i is even}
</code></pre><p>In plain English:</p>
<ul>
<li><strong>All-integer</strong> OR <strong>all-half-integer</strong> vectors</li>
<li>With <strong>even coordinate sum</strong></li>
</ul>
<p><strong>Valid E8 points</strong>:</p>
<ul>
<li><code>[1, 1, 1, 1, 1, 1, 1, 1]</code> ✓ (all integers, sum=8 is even)</li>
<li><code>[0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]</code> ✓ (all half-integers, sum=4 is even)</li>
<li><code>[1, 0, 1, 0, 1, 0, 1, 0]</code> ✓ (all integers, sum=4 is even)</li>
</ul>
<p><strong>Invalid points</strong>:</p>
<ul>
<li><code>[1, 1, 1, 0, 0, 0, 0, 0]</code> ✗ (sum=3 is odd)</li>
<li><code>[0.5, 0.5, 1, 1, 1, 1, 1, 1]</code> ✗ (mixed integer and half-integer)</li>
</ul>
<h4 id="552-why-e8-is-special">5.5.2 Why E8 is Special</h4>
<ol>
<li>
<p><strong>Proven optimal</strong>: Maryna Viazovska proved in 2016 that E8 achieves the densest sphere packing in 8D (Fields Medal 2022)</p>
</li>
<li>
<p><strong>Highest kissing number</strong>: 240 in 8D—this is <strong>proven</strong> to be the best possible</p>
</li>
<li>
<p><strong>Highly symmetric</strong>: Has 696,729,600 symmetries, which enables compression</p>
</li>
<li>
<p><strong>Hardware-friendly</strong>: The structure allows for the E8P compression trick (next section)</p>
</li>
</ol>
<h3 id="56-e8p-the-padded-compression-trick">5.6 E8P: The &ldquo;Padded&rdquo; Compression Trick</h3>
<h4 id="561-the-challenge">5.6.1 The Challenge</h4>
<p>For 2-bit quantization in 8 dimensions, we need:</p>
<ul>
<li><strong>2 bits × 8 dimensions</strong> = 16 bits total per 8-weight block</li>
<li><strong>2^16 = 65,536 codewords</strong></li>
</ul>
<p>Naively storing this:</p>
<ul>
<li>65,536 vectors × 8 dimensions × 2 bytes = <strong>1 MB per codebook</strong></li>
</ul>
<p><strong>The Problem</strong>: L1 cache on modern GPUs is 128-256 KB. A 1MB codebook won&rsquo;t fit! This causes cache misses, making inference <strong>slower than FP16</strong> (as AQLM discovered).</p>
<h4 id="562-e8p-solution-exploit-symmetry">5.6.2 E8P Solution: Exploit Symmetry</h4>
<p>The insight: We don&rsquo;t need to store all 65,536 vectors. E8&rsquo;s symmetry lets us:</p>
<ol>
<li>Store only <strong>256 base vectors</strong> (4 KB)</li>
<li>Use the remaining <strong>8 bits</strong> to encode sign flips and shifts</li>
<li>Generate all 65,536 points on the fly</li>
</ol>
<p><strong>Codebook compression</strong>: 65,536 entries → 256 base vectors (<strong>256× smaller codebook</strong>)</p>
<p>Note: Each weight still uses 16 bits for encoding. The compression is in the <strong>codebook size</strong> (1 MB → 4 KB), not the encoding size. This makes the codebook cache-resident, enabling fast lookups.</p>
<h4 id="563-e8p-encoding-structure-16-bits">5.6.3 E8P Encoding Structure (16 bits)</h4>
<p>Each 16-bit codeword encodes:</p>
<pre tabindex="0"><code>[8 bits: base index] [7 bits: sign flips] [1 bit: shift]
</code></pre><ul>
<li><strong>Bits 0-7</strong>: Index into 256-entry table S</li>
<li><strong>Bits 8-14</strong>: Which coordinates to negate</li>
<li><strong>Bit 15</strong>: Add ±0.25 shift</li>
</ul>
<h4 id="564-decoding-example-step-by-step">5.6.4 Decoding Example (Step-by-Step)</h4>
<p>Let&rsquo;s decode codeword: <code>0001010110010111</code></p>
<p><strong>Step 1: Base vector</strong></p>
<ul>
<li>Bits 0-7: <code>00010101</code> = 21</li>
<li>Look up S[21] = <code>[0.5, 0.5, 0.5, 1.5, 0.5, 0.5, 0.5, 0.5]</code></li>
</ul>
<p><strong>Step 2: Apply sign flips</strong></p>
<ul>
<li>Bits 8-14: <code>1001011</code> (4 ones = even count)</li>
<li>Base is all-half-integers (needs even # of flips to stay in E8)</li>
<li>Flip positions 0, 1, 3, 6</li>
<li>Infer 8th bit from parity constraint</li>
<li>Result: <code>[-0.5, -0.5, 0.5, -1.5, 0.5, 0.5, -0.5, -0.5]</code></li>
</ul>
<p><strong>Step 3: Apply shift</strong></p>
<ul>
<li>Bit 15: <code>1</code> → add 0.25</li>
<li><strong>Final</strong>: <code>[-0.25, -0.25, 0.75, -1.25, 0.75, 0.75, -0.25, -0.25]</code></li>
</ul>
<h4 id="565-why-7-sign-bits-not-8">5.6.5 Why 7 Sign Bits (Not 8)?</h4>
<p>This is elegant! E8 has a <strong>parity constraint</strong>:</p>
<ul>
<li>If the base vector requires an even # of flips → 8th sign bit is determined by the other 7</li>
<li>Given 7 sign bits → parity determines the 8th bit automatically</li>
</ul>
<p><strong>We save 1 bit per codeword</strong> by exploiting mathematical structure!</p>
<h4 id="566-hardware-implementation">5.6.6 Hardware Implementation</h4>
<p>Decoding E8P is incredibly fast:</p>
<ol>
<li><strong>Load base</strong>: 1 memory access (L1 cache hit, 4KB total)</li>
<li><strong>Extract signs</strong>: 1 shift + AND operation</li>
<li><strong>Compute 8th sign</strong>: Hardware popcount (XOR parity)</li>
<li><strong>Apply signs</strong>: SIMD multiply (8 parallel ops)</li>
<li><strong>Apply shift</strong>: SIMD add (8 parallel ops)</li>
</ol>
<p><strong>Total</strong>: ~5 instructions per weight, all cache-resident!</p>
<style>
    * {
        margin: 0;
        padding: 0;
        box-sizing: border-box;
    }
    
    .e8-container {
        max-width: 1400px;
        margin: 20px auto;
        background: white;
        border-radius: 16px;
        box-shadow: 0 10px 30px rgba(0,0,0,0.1);
        overflow: hidden;
        font-size: 0.9rem;  
    }
    
    .e8-header {
        background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%);
        color: white;
        padding: 25px;
        text-align: center;
    }
    
    .e8-header h1 {
        font-size: 1.6em;
        margin-bottom: 8px;
        font-weight: 600;
    }
    
    .e8-header p {
        font-size: 1em;
        opacity: 0.9;
    }
    
    .e8-tabs {
        display: flex;
        background: #f8f9fa;
        border-bottom: 2px solid #e9ecef;
        overflow-x: auto;
    }
    
    .e8-tab {
        padding: 15px 20px;
        cursor: pointer;
        background: transparent;
        border: none;
        font-size: 0.9em;
        font-weight: 500;
        color: #495057;
        transition: all 0.3s ease;
        position: relative;
        white-space: nowrap;
    }
    
    .e8-tab:hover {
        background: rgba(240, 147, 251, 0.1);
        color: #f5576c;
    }
    
    .e8-tab.active {
        color: #f5576c;
        background: white;
    }
    
    .e8-tab.active::after {
        content: '';
        position: absolute;
        bottom: -2px;
        left: 0;
        right: 0;
        height: 3px;
        background: #f5576c;
    }
    
    .e8-content {
        padding: 25px;
    }

    .e8-tab-content h2 {
        font-size: 1.4em;
        color: #f5576c;
        margin-bottom: 15px;
    }
    
    .e8-tab-content {
        display: none;
    }
    
    .e8-tab-content.active {
        display: block;
        animation: e8-fadeIn 0.5s ease;
    }
    
    @keyframes e8-fadeIn {
        from { opacity: 0; transform: translateY(10px); }
        to { opacity: 1; transform: translateY(0); }
    }
    
    .e8-side-by-side {
        display: grid;
        grid-template-columns: 1fr 1fr;
        gap: 20px;
        margin: 15px 0;
    }
    
    .e8-panel {
        background: white;
        border-radius: 12px;
        padding: 20px;
        box-shadow: 0 4px 6px rgba(0,0,0,0.05);
        border: 1px solid #e9ecef;
    }
    
    .e8-panel h3 {
        color: #f5576c;
        margin-bottom: 15px;
        font-size: 1.1em;
    }
    
    .e8-visualization-box {
        background: #f8f9fa;
        border-radius: 12px;
        padding: 20px;
        margin: 15px 0;
    }
    
    .e8-explanation {
        background: linear-gradient(135deg, rgba(240, 147, 251, 0.1), rgba(245, 87, 108, 0.1));
        border-left: 4px solid #f5576c;
        padding: 15px;
        margin: 15px 0;
        border-radius: 8px;
    }
    
    .e8-explanation h4 {
        color: #f5576c;
        margin-bottom: 10px;
    }
    
    .e8-explanation p, .e8-explanation ul {
        color: #495057;
        line-height: 1.5;
        font-size: 0.95em;
    }
    
    .e8-metric {
        display: flex;
        justify-content: space-between;
        align-items: center;
        padding: 12px;
        margin: 8px 0;
        background: #f8f9fa;
        border-radius: 8px;
    }
    
    .e8-metric-label {
        font-weight: 600;
        color: #495057;
    }
    
    .e8-metric-value {
        font-size: 1.1em;
        font-weight: 600;
        padding: 4px 12px;
        border-radius: 6px;
    }
    
    .e8-metric-value.bad {
        background: #fee;
        color: #c00;
    }
    
    .e8-metric-value.good {
        background: #efe;
        color: #0a0;
    }
    
    .e8-metric-value.neutral {
        background: #e9ecef;
        color: #495057;
    }
    
    .e8-comparison-table {
        width: 100%;
        border-collapse: collapse;
        margin: 20px 0;
    }
    
    .e8-comparison-table th,
    .e8-comparison-table td {
        padding: 12px;
        text-align: left;
        border-bottom: 1px solid #e9ecef;
    }
    
    .e8-comparison-table th {
        background: #f8f9fa;
        color: #f5576c;
        font-weight: 600;
    }
    
    .e8-comparison-table tr:hover {
        background: rgba(240, 147, 251, 0.05);
    }
    
    .e8-highlight {
        background: #fff3cd;
        padding: 2px 6px;
        border-radius: 4px;
        font-weight: 600;
    }
    
    .e8-formula {
        background: white;
        padding: 20px;
        border-radius: 8px;
        text-align: center;
        font-family: 'Times New Roman', serif;
        font-size: 1.2em;
        margin: 20px 0;
        border: 2px solid #e9ecef;
    }
    
    .e8-vector-display {
        background: #f8f9fa;
        padding: 20px;
        border-radius: 8px;
        margin: 15px 0;
        font-family: 'Courier New', monospace;
        font-size: 1.1em;
    }
    
    .e8-flow-diagram {
        margin: 15px 0;
    }
    
    .e8-flow-box {
        background: white;
        border: 2px solid #f5576c;
        border-radius: 12px;
        padding: 20px;
        margin: 15px 0;
        text-align: center;
        transition: all 0.3s;
    }
    
    .e8-flow-box:hover {
        transform: translateY(-5px);
        box-shadow: 0 8px 16px rgba(245, 87, 108, 0.3);
    }
    
    .e8-flow-arrow {
        text-align: center;
        color: #f5576c;
        font-size: 2em;
        margin: 10px 0;
    }

    .e8-bit-input {
        display: inline-block;
        width: 50px;
        height: 50px;
        border: 2px solid #e9ecef;
        border-radius: 8px;
        text-align: center;
        line-height: 50px;
        font-size: 1.2em;
        font-weight: 600;
        margin: 5px;
        cursor: pointer;
        transition: all 0.2s;
        background: white;
    }

    .e8-bit-input:hover {
        border-color: #f5576c;
        transform: scale(1.05);
    }

    .e8-bit-input.active {
        background: #f5576c;
        color: white;
        border-color: #f5576c;
    }

    .e8-decoder-section {
        margin: 20px 0;
    }

    .e8-decoder-section h4 {
        color: #f5576c;
        margin-bottom: 10px;
        font-size: 1.1em;
    }

    .e8-decode-button {
        background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%);
        color: white;
        border: none;
        padding: 15px 30px;
        border-radius: 8px;
        font-size: 1.1em;
        font-weight: 600;
        cursor: pointer;
        transition: all 0.3s;
        margin: 20px 0;
    }

    .e8-decode-button:hover {
        transform: translateY(-2px);
        box-shadow: 0 8px 16px rgba(245, 87, 108, 0.3);
    }

    .e8-step-output {
        background: #f8f9fa;
        border-left: 4px solid #f5576c;
        padding: 15px;
        margin: 15px 0;
        border-radius: 8px;
    }

    .e8-step-output h4 {
        color: #f5576c;
        margin-bottom: 10px;
    }

    .e8-vector-output {
        background: white;
        padding: 15px;
        border-radius: 8px;
        font-family: 'Courier New', monospace;
        margin: 10px 0;
        border: 1px solid #e9ecef;
    }

    .e8-final-vector {
        background: linear-gradient(135deg, rgba(40, 167, 69, 0.1), rgba(40, 167, 69, 0.05));
        border-left: 4px solid #28a745;
        padding: 15px;
        margin: 15px 0;
        border-radius: 8px;
    }

    .e8-final-vector h4 {
        color: #28a745;
        margin-bottom: 10px;
    }

    .e8-success-msg {
        background: #d4edda;
        color: #155724;
        padding: 15px;
        border-radius: 8px;
        margin: 15px 0;
        font-weight: 600;
    }
    
    @media (max-width: 768px) {
        .e8-side-by-side {
            grid-template-columns: 1fr;
        }
        
        .e8-header h1 {
            font-size: 1.8em;
        }
        
        .e8-tab {
            padding: 15px 15px;
            font-size: 0.9em;
        }
    }
</style>

<div class="e8-container">
    <div class="e8-header">
        <h1>⚛️ E8 Lattice: Perfect Sphere Packing</h1>
        <p>From Wasted Hypercubes to Optimal Vector Quantization</p>
    </div>
    
    <div class="e8-tabs">
        <button class="e8-tab active" data-tab="e8-problem">1. The Waste</button>
        <button class="e8-tab" data-tab="e8-packing">2. Why 8D?</button>
        <button class="e8-tab" data-tab="e8-lattice">3. E8 Lattice</button>
        <button class="e8-tab" data-tab="e8-e8p">4. E8P Magic</button>
        <button class="e8-tab" data-tab="e8-synthesis">5. Complete Picture</button>
    </div>
    
    <div class="e8-content">
        
        <div class="e8-tab-content active" id="e8-problem">
            <h2>The Hypercube Waste Problem</h2>
            
            <div class="e8-explanation">
                <h4>⚠️ Scalar Quantization: Fitting a Ball in a Box</h4>
                <p>After RHT, our weights form a Gaussian distribution—a ball shape in high dimensions. But scalar quantization creates a hypercube of representable points. This is geometrically inefficient!</p>
            </div>
            
            <div class="e8-side-by-side">
                <div class="e8-panel">
                    <h3>The Waste Grows Exponentially</h3>
                    <div class="e8-metric">
                        <span class="e8-metric-label">2D (square)</span>
                        <span class="e8-metric-value bad">21% waste</span>
                    </div>
                    <div class="e8-metric">
                        <span class="e8-metric-label">4D (hypercube)</span>
                        <span class="e8-metric-value bad">47% waste</span>
                    </div>
                    <div class="e8-metric">
                        <span class="e8-metric-label">6D (hypercube)</span>
                        <span class="e8-metric-value bad">61% waste</span>
                    </div>
                    <div class="e8-metric">
                        <span class="e8-metric-label">8D (hypercube)</span>
                        <span class="e8-metric-value bad">69% waste</span>
                    </div>
                </div>
                
                <div class="e8-panel">
                    <h3>📊 Why This Matters</h3>
                    <p>At 2-bit quantization, we only have <strong>4 values per dimension</strong>. With 8 dimensions, that's 4⁸ = 65,536 codewords total.</p>
                    <p style="margin-top: 15px;"><strong>69% waste means 45,000 of our 65,536 codewords are useless!</strong></p>
                    <p style="margin-top: 15px; color: #dc3545; font-weight: 600;">We can't afford this inefficiency.</p>
                </div>
            </div>
            
            <div class="e8-explanation">
                <h4>💡 The Solution: Vector Quantization</h4>
                <p><strong>Instead of quantizing each dimension independently (hypercube), quantize d dimensions together as a vector (sphere-shaped codebook).</strong></p>
                <p style="margin-top: 10px;">This lets us match the codebook shape to the actual distribution. But which shape is optimal?</p>
            </div>

            <div class="e8-side-by-side">
                <div class="e8-panel">
                    <h3>2D Visualization</h3>
                    <svg id="e8-ball-cube-2d" width="100%" height="300"></svg>
                    <p style="text-align: center; color: #6c757d; margin-top: 10px;">
                        The corners are wasted—Gaussian samples never reach them!
                    </p>
                </div>
                <div class="e8-panel">
                    <h3>Volume Ratio Analysis</h3>
                    <svg id="e8-waste-chart" width="100%" height="300"></svg>
                </div>
            </div>
        </div>
        
        
        <div class="e8-tab-content" id="e8-packing">
            <h2>Why 8 Dimensions?</h2>
            
            <div class="e8-explanation">
                <h4>🎯 The Perfect Match</h4>
                <p>QuIP# uses 8 dimensions because it's the sweet spot where mathematics, hardware, and quantization align perfectly.</p>
            </div>

            <div class="e8-side-by-side">
                <div class="e8-panel">
                    <h3>🔢 Hardware Alignment</h3>
                    <div class="e8-metric">
                        <span class="e8-metric-label">2 bits per weight</span>
                        <span class="e8-metric-value neutral">4 values</span>
                    </div>
                    <div class="e8-metric">
                        <span class="e8-metric-label">8 weights per group</span>
                        <span class="e8-metric-value neutral">Vector quantization</span>
                    </div>
                    <div class="e8-metric">
                        <span class="e8-metric-label">Total encoding</span>
                        <span class="e8-metric-value good">16 bits</span>
                    </div>
                    <p style="margin-top: 15px; color: #495057;">16 bits = 2 bytes, perfect for modern hardware!</p>
                </div>
                <div class="e8-panel">
                    <h3>🏆 Mathematical Optimality</h3>
                    <div class="e8-metric">
                        <span class="e8-metric-label">Proven Optimal</span>
                        <span class="e8-metric-value good">Yes!</span>
                    </div>
                    <p style="margin-top: 15px;">E8 is one of only <strong>three dimensions</strong> where optimal sphere packing is proven:</p>
                    <ul style="margin-left: 20px; margin-top: 10px; line-height: 1.8;">
                        <li><strong>2D:</strong> Hexagonal (6 neighbors)</li>
                        <li><strong>3D:</strong> FCC (12 neighbors)</li>
                        <li><strong>8D:</strong> E8 (240 neighbors) ⭐</li>
                    </ul>
                </div>
            </div>

            <div class="e8-visualization-box">
                <h3 style="color: #f5576c; margin-bottom: 15px;">Kissing Number Scaling Across Lattices</h3>
                <svg id="e8-lattice-comparison" width="100%" height="250" style="min-height: 250px;"></svg>
                
                
                <table class="e8-comparison-table" id="e8-lattice-fallback" style="display: none; margin-top: 20px;">
                    <thead>
                        <tr>
                            <th>Lattice</th>
                            <th>Kissing Number</th>
                            <th>Relative Density</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td><strong>Z⁸ (Simple Cubic)</strong></td>
                            <td>16</td>
                            <td style="color: #dc3545;">Baseline</td>
                        </tr>
                        <tr>
                            <td><strong>D̂₈ (Half-integer)</strong></td>
                            <td>112</td>
                            <td style="color: #667eea;">7× denser</td>
                        </tr>
                        <tr style="background: rgba(245, 87, 108, 0.1);">
                            <td><strong>E₈ (Optimal)</strong></td>
                            <td><strong>240</strong></td>
                            <td style="color: #28a745; font-weight: 600;">15× denser ⭐</td>
                        </tr>
                    </tbody>
                </table>
                
                <p style="margin-top: 15px; color: #6c757d; text-align: center;">
                    E8's 240 neighbors is <strong>15× better</strong> than simple cubic packing (Z⁸) in 8D!
                </p>
            </div>
            
            <div class="e8-explanation">
                <h4>💡 Why Not Other Dimensions?</h4>
                <ul style="margin-left: 20px; margin-top: 10px; line-height: 1.8;">
                    <li><strong>4D:</strong> Only 24 neighbors (D₄ lattice) - not dense enough</li>
                    <li><strong>16D:</strong> No proven optimal lattice, too many bits (32-bit encoding)</li>
                    <li><strong>24D:</strong> Leech lattice is optimal but requires 48-bit encoding - impractical</li>
                </ul>
                <p style="margin-top: 15px; font-weight: 600; color: #f5576c;">8D is the Goldilocks dimension: proven optimal packing + practical hardware alignment!</p>
            </div>
        </div>
        
        
        <div class="e8-tab-content" id="e8-lattice">
            <h2>The E8 Lattice: Mathematical Beauty</h2>
            
            <div class="e8-formula">
                <div style="margin-bottom: 15px; font-size: 1.1em; color: #f5576c;">
                    <strong>E8 Lattice Definition</strong>
                </div>
                E₈ = (ℤ⁸ ∪ (ℤ+½)⁸) ∩ {x | Σx<sub>i</sub> is even}
            </div>
            
            <div class="e8-explanation">
                <h4>📖 In Plain English</h4>
                <p><strong>All-integer</strong> OR <strong>all-half-integer</strong> 8D vectors, with <strong>even coordinate sum</strong>.</p>
            </div>
            
            <div class="e8-side-by-side">
                <div class="e8-panel">
                    <h3>✅ Valid E8 Points</h3>
                    <div class="e8-vector-display">
                        [1, 1, 1, 1, 1, 1, 1, 1]<br>
                        <span style="color: #28a745;">✓ All integers, sum=8 (even)</span>
                    </div>
                    <div class="e8-vector-display">
                        [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]<br>
                        <span style="color: #28a745;">✓ All half-integers, sum=4 (even)</span>
                    </div>
                </div>
                
                <div class="e8-panel">
                    <h3>❌ Invalid E8 Points</h3>
                    <div class="e8-vector-display">
                        [1, 1, 1, 0, 0, 0, 0, 0]<br>
                        <span style="color: #dc3545;">✗ All integers, but sum=3 (odd)</span>
                    </div>
                    <div class="e8-vector-display">
                        [0.5, 0.5, 1, 1, 1, 1, 1, 1]<br>
                        <span style="color: #dc3545;">✗ Mixed integer and half-integer</span>
                    </div>
                </div>
            </div>
            
            <div class="e8-visualization-box">
                <h3 style="color: #f5576c; margin-bottom: 20px;">E8 Properties</h3>
                <table class="e8-comparison-table">
                    <thead>
                        <tr>
                            <th>Property</th>
                            <th>Value</th>
                            <th>Significance</th>
                        </tr>
                    </thead>
                    <tbody>
                        <tr>
                            <td><strong>Dimension</strong></td>
                            <td>8</td>
                            <td>Perfect for 2-bit × 8 weights = 16 bits</td>
                        </tr>
                        <tr>
                            <td><strong>Kissing Number</strong></td>
                            <td>240</td>
                            <td>Proven optimal in 8D</td>
                        </tr>
                        <tr>
                            <td><strong>Symmetries</strong></td>
                            <td>696,729,600</td>
                            <td>Enables E8P compression (256× reduction)</td>
                        </tr>
                        <tr>
                            <td><strong>Covering Radius</strong></td>
                            <td>σ² (minimal)</td>
                            <td>Minimizes quantization error</td>
                        </tr>
                    </tbody>
                </table>
            </div>
            
            <div class="e8-explanation">
                <h4>🎯 Perfect Match for Gaussian Weights</h4>
                <p>After RHT, weights are Gaussian → ball-shaped in 8D. E8 provides the densest possible packing of spheres in a ball. This is why the combination works perfectly!</p>
            </div>

            <div class="e8-visualization-box">
                <h3 style="color: #f5576c; margin-bottom: 15px;">E8 vs Other Lattices</h3>
                <svg id="e8-lattice-comparison" width="100%" height="200"></svg>
            </div>
        </div>
        
        
        <div class="e8-tab-content" id="e8-e8p">
            <h2>E8P: The Compression Magic</h2>
            
            <div class="e8-explanation">
                <h4>🎩 The Challenge</h4>
                <p>2 bits × 8 dimensions = 16 bits total → 2¹⁶ = <strong>65,536 codewords</strong></p>
                <p style="margin-top: 10px;">Storing naively: 65,536 vectors × 8 dims × 2 bytes = <strong>1 MB per layer</strong></p>
                <p style="margin-top: 10px; color: #dc3545;"><strong>Problem:</strong> GPU L1 cache is only 128-256 KB. Cache misses kill performance!</p>
            </div>
            
            <div class="e8-explanation" style="background: linear-gradient(135deg, rgba(40, 167, 69, 0.1), rgba(40, 167, 69, 0.05)); border-color: #28a745;">
                <h4 style="color: #28a745;">✨ The E8P Solution</h4>
                <p>Exploit E8's <strong>696 million symmetries</strong> to compress the codebook:</p>
                <ul style="margin-left: 20px; margin-top: 10px;">
                    <li>Store only <strong>256 base vectors</strong> (4 KB)</li>
                    <li>Use 8 bits to encode sign flips and shifts</li>
                    <li>Generate all 65,536 points on the fly</li>
                </ul>
                <p style="margin-top: 10px; font-size: 1.2em;"><strong>256× compression ratio!</strong> Fits in L1 cache.</p>
            </div>
            
            <div class="e8-side-by-side">
                <div class="e8-panel">
                    <h3>Storage Comparison</h3>
                    <div class="e8-metric">
                        <span class="e8-metric-label">Naive Storage</span>
                        <span class="e8-metric-value bad">1 MB</span>
                    </div>
                    <div class="e8-metric">
                        <span class="e8-metric-label">E8P Storage</span>
                        <span class="e8-metric-value good">4 KB</span>
                    </div>
                    <div class="e8-metric">
                        <span class="e8-metric-label">Compression</span>
                        <span class="e8-metric-value good">256×</span>
                    </div>
                    <div class="e8-metric">
                        <span class="e8-metric-label">Fits in L1?</span>
                        <span class="e8-metric-value good">✓ Yes!</span>
                    </div>
                </div>
                
                <div class="e8-panel">
                    <h3>Performance Impact</h3>
                    <div class="e8-metric">
                        <span class="e8-metric-label">Memory Access</span>
                        <span class="e8-metric-value good">L1 Cache Hit</span>
                    </div>
                    <div class="e8-metric">
                        <span class="e8-metric-label">Decode Cost</span>
                        <span class="e8-metric-value good">~5 instructions</span>
                    </div>
                    <div class="e8-metric">
                        <span class="e8-metric-label">Peak Bandwidth</span>
                        <span class="e8-metric-value good">56.8%</span>
                    </div>
                    <p style="margin-top: 15px; color: #28a745; font-weight: 600;">
                        QuIP# achieves >50% peak memory bandwidth on RTX 4090!
                    </p>
                </div>
            </div>

            <div class="e8-panel" style="grid-column: span 2; border: 2px solid #f5576c;">
                <h3>🔧 Interactive E8P Decoder</h3>
                <p style="margin-bottom: 15px; font-size: 0.95em; color: #6c757d;">Each 16-bit codeword encodes: <strong>[8 bits: base] [7 bits: signs] [1 bit: shift]</strong></p>
                
                <div class="e8-decoder-section">
                    <h4>Bits 0-7: Base Vector Index</h4>
                    <div id="e8p-base-bits" style="display: flex; align-items: center; flex-wrap: wrap;">
                        <span class="e8-bit-input" data-bit="7">0</span>
                        <span class="e8-bit-input" data-bit="6">0</span>
                        <span class="e8-bit-input" data-bit="5">0</span>
                        <span class="e8-bit-input" data-bit="4">0</span>
                        <span class="e8-bit-input" data-bit="3">0</span>
                        <span class="e8-bit-input" data-bit="2">0</span>
                        <span class="e8-bit-input" data-bit="1">0</span>
                        <span class="e8-bit-input" data-bit="0">0</span>
                        <span style="margin-left: 15px; font-size: 1.2em; font-weight: 600;">= <span id="e8p-base-value" style="color: #f5576c;">0</span></span>
                    </div>
                </div>

                <div class="e8-decoder-section">
                    <h4>Bits 8-14: Sign Flips (which coordinates to negate)</h4>
                    <div id="e8p-sign-bits" style="display: flex; align-items: center; flex-wrap: wrap;">
                        <span class="e8-bit-input" data-bit="6">0</span>
                        <span class="e8-bit-input" data-bit="5">0</span>
                        <span class="e8-bit-input" data-bit="4">0</span>
                        <span class="e8-bit-input" data-bit="3">0</span>
                        <span class="e8-bit-input" data-bit="2">0</span>
                        <span class="e8-bit-input" data-bit="1">0</span>
                        <span class="e8-bit-input" data-bit="0">0</span>
                    </div>
                </div>

                <div class="e8-decoder-section">
                    <h4>Bit 15: Shift (+0.25 if set)</h4>
                    <div id="e8p-shift-bit" style="display: flex; align-items: center;">
                        <span class="e8-bit-input" data-bit="0">0</span>
                    </div>
                </div>

                <button class="e8-decode-button" id="e8p-decode-btn">🎯 Decode Codeword</button>

                <div id="e8p-decode-output"></div>
            </div>
        </div>
        
        
        <div class="e8-tab-content" id="e8-synthesis">
            <h2>The Complete Picture</h2>
            
            <div class="e8-flow-diagram">
                <div class="e8-flow-box" style="background: rgba(102, 126, 234, 0.05); border-color: #667eea;">
                    <h4 style="color: #667eea;">1️⃣ RHT Creates Gaussian Distribution</h4>
                    <p><strong>Output:</strong> Ball-shaped weights in 8D space</p>
                </div>
                
                <div class="e8-flow-arrow">↓</div>
                
                <div class="e8-flow-box" style="background: rgba(240, 147, 251, 0.05);">
                    <h4>2️⃣ E8 Lattice Matches the Shape</h4>
                    <p><strong>Gaussian ball → E8 optimal sphere packing</strong></p>
                </div>
                
                <div class="e8-flow-arrow">↓</div>
                
                <div class="e8-flow-box" style="background: rgba(245, 87, 108, 0.05);">
                    <h4>3️⃣ E8P Makes it Hardware-Friendly</h4>
                    <p><strong>65,536 codewords → 256 base vectors (4 KB)</strong></p>
                </div>
                
                <div class="e8-flow-arrow">↓</div>
                
                <div class="e8-flow-box" style="background: rgba(40, 167, 69, 0.05); border-color: #28a745;">
                    <h4 style="color: #28a745;">4️⃣ Result: Near-Lossless 2-Bit</h4>
                    <p><strong>Llama 2 70B: 3.12 (FP16) → 3.91 (QuIP# 2-bit)</strong></p>
                    <p style="color: #28a745;">vs 7.81 for OmniQuant—<strong>2× better quality!</strong></p>
                </div>
            </div>
            
            <div class="e8-explanation" style="margin-top: 40px; background: linear-gradient(135deg, rgba(40, 167, 69, 0.1), rgba(40, 167, 69, 0.05)); border-color: #28a745;">
                <h4 style="color: #28a745;">🎓 The "Aha!" Moment</h4>
                <p style="font-size: 1.1em; line-height: 1.8;">
                    E8 isn't just "better" sphere packing—it's <strong>provably optimal</strong> in 8D. When you combine it with RHT's Gaussian distribution, you get a <em>perfect geometric match</em>.
                </p>
                <p style="margin-top: 15px; line-height: 1.8;">
                    The ball-shaped Gaussian weights fit exactly into E8's optimal sphere packing. No wasted corners, no wasted bits. Every single one of the 65,536 codewords is useful.
                </p>
                <p style="margin-top: 15px; line-height: 1.8;">
                    <strong>The E8P compression trick</strong> then makes it hardware-friendly: 4 KB fits in L1 cache, ~5 instructions per decode. This is why QuIP# achieves >50% peak memory bandwidth while maintaining near-lossless quality.
                </p>
            </div>
        </div>
    </div>
</div>

<script>
    if (typeof d3 === 'undefined') { console.warn('D3.js not loaded for E8 visualization'); }
    
    document.querySelectorAll('.e8-tab').forEach(tab => {
        tab.addEventListener('click', function() {
            const tabId = this.dataset.tab;
            
            document.querySelectorAll('.e8-tab').forEach(t => t.classList.remove('active'));
            document.querySelectorAll('.e8-tab-content').forEach(c => c.classList.remove('active'));
            
            this.classList.add('active');
            document.getElementById(tabId).classList.add('active');

            
            setTimeout(() => drawAllVisualizations(), 10);
        });
    });

    function drawAllVisualizations() {
        drawBallInCube();
        drawWasteChart();
        drawLatticeComparison();
    }

    
    setTimeout(() => drawAllVisualizations(), 100);
    function drawBallInCube() {
        const svg = d3.select('#e8-ball-cube-2d');
        if (!svg.node()) return;
        const container = svg.node().parentElement;
        if (!container || container.clientWidth === 0) return;
        svg.selectAll('*').remove();

        const width = container.clientWidth;
        const height = 300;
        const size = Math.min(200, width - 80);
        svg.attr('height', height);

        const g = svg.append('g').attr('transform', `translate(${width/2}, ${height/2})`);

        g.append('rect').attr('x', -size/2).attr('y', -size/2).attr('width', size).attr('height', size)
            .attr('fill', 'none').attr('stroke', '#dc3545').attr('stroke-width', 2).attr('stroke-dasharray', '8,4');

        const radius = size / 2;
        g.append('circle').attr('r', radius).attr('fill', 'rgba(102, 126, 234, 0.2)').attr('stroke', '#667eea').attr('stroke-width', 2);

        const numPoints = 150;
        for (let i = 0; i < numPoints; i++) {
            const u1 = Math.random(), u2 = Math.random();
            const r = Math.sqrt(-2 * Math.log(u1)) * 0.4;
            const theta = 2 * Math.PI * u2;
            g.append('circle').attr('cx', r * Math.cos(theta) * radius).attr('cy', r * Math.sin(theta) * radius)
                .attr('r', 2).attr('fill', '#667eea').attr('opacity', 0.6);
        }
    }

    function drawWasteChart() {
        const svg = d3.select('#e8-waste-chart');
        if (!svg.node()) return;
        const container = svg.node().parentElement;
        if (!container || container.clientWidth === 0) return;
        svg.selectAll('*').remove();

        const width = container.clientWidth, height = 250, margin = {top: 30, right: 20, bottom: 40, left: 50};
        const innerWidth = width - margin.left - margin.right, innerHeight = height - margin.top - margin.bottom;
        svg.attr('height', height);

        const data = [{dim: '2D', waste: 21.5}, {dim: '4D', waste: 47.6}, {dim: '8D', waste: 69.4}];
        const x = d3.scaleBand().domain(data.map(d => d.dim)).range([0, innerWidth]).padding(0.4);
        const y = d3.scaleLinear().domain([0, 100]).range([innerHeight, 0]);

        const g = svg.append('g').attr('transform', `translate(${margin.left},${margin.top})`);
        g.selectAll('rect').data(data).join('rect').attr('x', d => x(d.dim)).attr('y', d => y(d.waste))
            .attr('width', x.bandwidth()).attr('height', d => innerHeight - y(d.waste)).attr('fill', '#dc3545').attr('opacity', 0.8);

        g.append('g').attr('transform', `translate(0,${innerHeight})`).call(d3.axisBottom(x));
        g.append('g').call(d3.axisLeft(y).ticks(5).tickFormat(d => d + '%'));
    }

    function drawLatticeComparison() {
        try {
            const svg = d3.select('#e8-lattice-comparison');
            if (!svg.node()) {
                console.warn('SVG element not found, showing fallback table');
                showFallbackTable();
                return;
            }
            
            const container = svg.node().parentElement;
            if (!container || container.clientWidth === 0) {
                console.warn('Container not ready, retrying...');
                setTimeout(drawLatticeComparison, 100);
                return;
            }
            
            svg.selectAll('*').remove();

            const width = container.clientWidth;
            const height = 250;
            const margin = {top: 20, right: 60, bottom: 40, left: 140};
            const innerWidth = width - margin.left - margin.right;
            const innerHeight = height - margin.top - margin.bottom;
            
            if (innerWidth <= 0 || innerHeight <= 0) {
                console.warn('Invalid dimensions, showing fallback table');
                showFallbackTable();
                return;
            }
            
            svg.attr('height', height);

            const data = [
                {name: 'Z⁸ (Simple Cubic)', kissing: 16, label: '16'},
                {name: 'D̂₈ (Half-integer)', kissing: 112, label: '112'},
                {name: 'E₈ (Optimal)', kissing: 240, label: '240'}
            ];
            
            const x = d3.scaleLinear().domain([0, 260]).range([0, innerWidth]);
            const y = d3.scaleBand().domain(data.map(d => d.name)).range([0, innerHeight]).padding(0.25);

            const g = svg.append('g').attr('transform', `translate(${margin.left},${margin.top})`);
            
            
            g.selectAll('rect').data(data).join('rect')
                .attr('y', d => y(d.name))
                .attr('width', d => x(d.kissing))
                .attr('height', y.bandwidth())
                .attr('fill', (d, i) => i === 2 ? '#f5576c' : '#667eea')
                .attr('opacity', 0.8)
                .attr('rx', 4);

            
            g.selectAll('text.value').data(data).join('text')
                .attr('class', 'value')
                .attr('x', d => x(d.kissing) + 5)
                .attr('y', d => y(d.name) + y.bandwidth() / 2)
                .attr('dominant-baseline', 'middle')
                .attr('fill', '#495057')
                .attr('font-weight', 'bold')
                .attr('font-size', '14px')
                .text(d => d.label);

            
            g.append('g')
                .attr('transform', `translate(0,${innerHeight})`)
                .call(d3.axisBottom(x).ticks(5))
                .selectAll('text')
                .attr('font-size', '12px');
            
            g.append('g')
                .call(d3.axisLeft(y))
                .selectAll('text')
                .attr('font-size', '12px');

            
            g.append('text')
                .attr('x', innerWidth / 2)
                .attr('y', innerHeight + 35)
                .attr('text-anchor', 'middle')
                .attr('fill', '#6c757d')
                .attr('font-size', '13px')
                .text('Kissing Number (neighbors touching center)');
                
            console.log('Lattice comparison chart drawn successfully');
        } catch (error) {
            console.error('Error drawing lattice comparison:', error);
            showFallbackTable();
        }
    }
    
    function showFallbackTable() {
        const svg = document.getElementById('e8-lattice-comparison');
        const fallback = document.getElementById('e8-lattice-fallback');
        if (svg) svg.style.display = 'none';
        if (fallback) fallback.style.display = 'table';
    }

    
    function setupE8PDecoder() {
        
        const baseBits = document.querySelectorAll('#e8p-base-bits .e8-bit-input');
        const signBits = document.querySelectorAll('#e8p-sign-bits .e8-bit-input');
        const shiftBit = document.querySelector('#e8p-shift-bit .e8-bit-input');

        
        function toggleBit(element) {
            const currentValue = element.textContent;
            const newValue = currentValue === '0' ? '1' : '0';
            element.textContent = newValue;
            
            if (newValue === '1') {
                element.classList.add('active');
            } else {
                element.classList.remove('active');
            }
            
            
            updateBaseValue();
        }

        
        function updateBaseValue() {
            let value = 0;
            baseBits.forEach(bit => {
                const bitPos = parseInt(bit.dataset.bit);
                const bitVal = bit.textContent === '1' ? 1 : 0;
                value += bitVal * Math.pow(2, bitPos);
            });
            document.getElementById('e8p-base-value').textContent = value;
        }

        
        baseBits.forEach(bit => bit.addEventListener('click', () => toggleBit(bit)));
        signBits.forEach(bit => bit.addEventListener('click', () => toggleBit(bit)));
        shiftBit.addEventListener('click', () => toggleBit(shiftBit));

        
        document.getElementById('e8p-decode-btn').addEventListener('click', decodeE8P);
    }

    function decodeE8P() {
        
        const baseBits = document.querySelectorAll('#e8p-base-bits .e8-bit-input');
        let baseIndex = 0;
        baseBits.forEach(bit => {
            const bitPos = parseInt(bit.dataset.bit);
            const bitVal = bit.textContent === '1' ? 1 : 0;
            baseIndex += bitVal * Math.pow(2, bitPos);
        });

        
        const signBits = document.querySelectorAll('#e8p-sign-bits .e8-bit-input');
        const signArray = [];
        signBits.forEach(bit => {
            signArray.unshift(bit.textContent);
        });
        const signBitString = signArray.join('');

        
        const shiftBit = document.querySelector('#e8p-shift-bit .e8-bit-input');
        const shift = shiftBit.textContent === '1' ? 1 : 0;

        const outputDiv = document.getElementById('e8p-decode-output');
        outputDiv.innerHTML = '';

        
        const baseVectors = [
            [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5],
            [1, 1, 0, 0, 0, 0, 0, 0],
            [-0.5, 0.5, -0.5, 0.5, -0.5, 0.5, -0.5, 0.5]
        ];

        const selectedBase = baseVectors[baseIndex % baseVectors.length];
        
        
        outputDiv.innerHTML += `
            <div class="e8-step-output">
                <h4>Step 1: Base Vector Lookup</h4>
                <p>Base index: <strong>${baseIndex}</strong></p>
                <div class="e8-vector-output">[${selectedBase.join(', ')}]</div>
            </div>
        `;

        
        let finalVector = [...selectedBase];
        let parity = 0;
        for (let i = 0; i < 7; i++) {
            if (signBitString[i] === '1') {
                finalVector[i] *= -1;
                parity++;
            }
        }
        
        outputDiv.innerHTML += `
            <div class="e8-step-output">
                <h4>Step 2: Apply Sign Flips</h4>
                <p>Sign bits: <strong>${signBitString}</strong> (${parity} ones)</p>
                <p>Parity: ${parity % 2 === 0 ? 'Even' : 'Odd'} → ${parity % 2 === 0 ? '8th sign not flipped' : '8th sign flipped'}</p>
                <div class="e8-vector-output">[${finalVector.join(', ')}]</div>
            </div>
        `;

        
        if (shift === 1) {
            finalVector = finalVector.map(v => v + 0.25);
        }
        
        outputDiv.innerHTML += `
            <div class="e8-step-output">
                <h4>Step 3: Apply Shift</h4>
                <p>Shift bit: <strong>${shift}</strong> → ${shift === 1 ? 'Add +0.25 to all coordinates' : 'No shift'}</p>
            </div>
        `;

        
        outputDiv.innerHTML += `
            <div class="e8-final-vector">
                <h4>Final E8 Point:</h4>
                <div class="e8-vector-output">[${finalVector.map(v => v.toFixed(2)).join(', ')}]</div>
            </div>
        `;

        outputDiv.innerHTML += `
            <div class="e8-success-msg">
                ✨ Decoded Successfully!
                <p style="margin-top: 10px; font-size: 0.9em;">This 16-bit codeword represents one of 65,536 possible E8 lattice points, all generated from just 256 base vectors stored in 4 KB of memory.</p>
            </div>
        `;
    }

    
    setTimeout(() => {
        setupE8PDecoder();
    }, 100);
</script>

<p>This is why QuIP# achieves &gt;50% of peak memory bandwidth, while AQLM is slower than FP16.</p>
<p>We&rsquo;ve transformed weights to eliminate outliers (RHT) and matched our quantization to their distribution (E8P). But there&rsquo;s one final piece: accounting for how weights interact with each other during quantization.</p>
<hr>
<h2 id="6-pillar-3-block-ldlq-adaptive-rounding">6. Pillar 3: Block-LDLQ Adaptive Rounding</h2>
<h3 id="61-why-adaptive-rounding">6.1 Why Adaptive Rounding?</h3>
<p>Even with RHT and E8P, we face a final challenge: <strong>weights aren&rsquo;t independent</strong>.</p>
<p>Imagine you&rsquo;re tuning a guitar. If you tune each string in isolation, the guitar might sound terrible because strings interact to create chords. You need to tune them <em>together</em>, considering how they affect each other.</p>
<p>Similarly, when we round weights, errors in early weights affect later ones. <strong>Adaptive rounding</strong> accounts for these interdependencies.</p>
<h3 id="62-the-hessian-measuring-sensitivity">6.2 The Hessian: Measuring Sensitivity</h3>
<p>The <strong>proxy Hessian</strong> captures how changes in weights affect model loss:</p>
<pre tabindex="0"><code>H = 𝔼[x·x^T]
</code></pre><p>Where:</p>
<ul>
<li><code>𝔼[·]</code>: <a href="https://en.wikipedia.org/wiki/Expected_value">Expected value</a> (average over calibration data samples)</li>
<li><code>x</code>: Input activations to the layer</li>
<li><code>H_ij</code>: Measures how much weights i and j &ldquo;interact&rdquo;</li>
</ul>
<p><strong>Intuition</strong>: High Hessian values mean &ldquo;this weight is sensitive—round carefully!&rdquo;</p>
<h3 id="63-block-ldl-decomposition">6.3 Block LDL Decomposition</h3>
<p>For a Hessian <code>H ∈ ℝ^(n×n)</code> with block size <code>g</code>, we compute:</p>
<pre tabindex="0"><code>H = L^T·D·L
</code></pre><p>Where:</p>
<ul>
<li><code>L</code>: Unit block lower-triangular matrix (g×g blocks)</li>
<li><code>D</code>: Block diagonal matrix</li>
</ul>
<p>This is like breaking a big problem into smaller, manageable chunks of size g.</p>
<h3 id="64-the-block-ldlq-algorithm">6.4 The Block-LDLQ Algorithm</h3>
<p>For each block of g weights:</p>
<pre tabindex="0"><code>Ŵ_k = Q(W_k + (W_{1:k-1} - Ŵ_{1:k-1})·A_k)
</code></pre><p><strong>What this means</strong>:</p>
<ul>
<li><code>W_k</code>: Current block of weights</li>
<li><code>Ŵ_{1:k-1}</code>: Already-quantized previous blocks</li>
<li><code>A_k</code>: Feedback matrix (from L)</li>
<li><code>Q</code>: Vector quantization to E8P codebook</li>
</ul>
<p><strong>The key</strong>: We use errors from previous blocks as feedback, so errors don&rsquo;t accumulate!</p>
<h3 id="65-theoretical-guarantee">6.5 Theoretical Guarantee</h3>
<p><strong>Theorem 4.1</strong> from the QuIP# paper: For μ-incoherent weights with E8P codebook:</p>
<pre tabindex="0"><code>𝔼[Error] ≤ (g·m·μ²·σ²/n)·tr(H^{1/2})²
</code></pre><p>Where:</p>
<ul>
<li><strong>σ²</strong>: Quantization noise (<strong>minimized by E8P!</strong>)</li>
<li><strong>μ</strong>: Incoherence (<strong>minimized by RHT!</strong>)</li>
<li><strong>g</strong>: Block size (8 for QuIP#)</li>
</ul>
<p><strong>The beauty</strong>: Both RHT and E8P appear in the error bound! Each component directly reduces the final error.</p>
<p>Now let&rsquo;s step back and see how all three pillars work together to achieve what was thought impossible.</p>
<hr>
<h2 id="7-the-complete-picture-why-quip-works">7. The Complete Picture: Why QuIP# Works</h2>
<h3 id="71-the-virtuous-cycle">7.1 The Virtuous Cycle</h3>
<pre tabindex="0"><code>RHT 
  → Gaussian Weights (ball-shaped, μ ≈ √log n)
      ↓
  E8 optimal packing (matches Gaussian shape)
      ↓
  Minimizes covering radius σ²
      ↓
  Low quantization noise in Block-LDLQ
      ↓
  Near-lossless 2-bit quantization!
</code></pre><h3 id="72-each-component-is-necessary">7.2 Each Component is Necessary</h3>
<p>Remove any piece and the system fails:</p>
<ul>
<li><strong>Without RHT</strong>: Outliers remain → high μ → error bound explodes</li>
<li><strong>Without E8P</strong>: Poor sphere packing → high σ² → error bound explodes</li>
<li><strong>Without Block-LDLQ</strong>: No Hessian adaptation → accumulated error</li>
</ul>
<h3 id="73-the-mathematical-beauty">7.3 The Mathematical Beauty</h3>
<p>All three components appear in the error bound:</p>
<pre tabindex="0"><code>Error ≤ (g·μ²·σ²/n)·tr(H^{1/2})²
        ↑   ↑   ↑
        |   |   └─ E8P minimizes this (optimal packing)
        |   └───── RHT minimizes this (incoherence)
        └───────── Block-LDLQ optimizes given μ and σ²
</code></pre><p>This isn&rsquo;t accidental—it&rsquo;s mathematically inevitable that these three techniques combine to minimize error.</p>
<p>With the theory established, let&rsquo;s examine what QuIP# means for real-world deployment and the future of LLM accessibility.</p>
<hr>
<h2 id="8-practical-implications">8. Practical Implications</h2>
<h3 id="81-deployment-scenarios-unlocked">8.1 Deployment Scenarios Unlocked</h3>
<p><strong>Before QuIP#</strong>:</p>
<ul>
<li>Llama 2 70B: Requires 140GB (6× RTX 4090s ≈ $60k)</li>
<li>Research teams locked out of state-of-the-art models</li>
<li>Edge deployment impossible</li>
</ul>
<p><strong>After QuIP#</strong>:</p>
<ul>
<li>Llama 2 70B: ~18GB ✓ <strong>Fits on single RTX 4090 ($1,600)</strong></li>
<li>7B models: 4-6GB → <strong>runs on smartphones</strong></li>
<li>Cost reduction: 7× memory → 7× more models per server</li>
<li><strong>Privacy</strong>: Sensitive data processing entirely on-device</li>
</ul>
<h3 id="82-the-scaling-breakthrough">8.2 The Scaling Breakthrough</h3>
<p>The unprecedented result: <strong>QuIP# 3-bit scales better than 4-bit</strong>.</p>
<p>This directly refutes the 2023 consensus that &ldquo;4-bit is optimal.&rdquo; As models get larger, QuIP# 2-bit appears to scale similarly to 3-bit and 4-bit, suggesting that <strong>2-bit may become the new standard</strong>.</p>
<hr>
<h2 id="9-conclusion">9. Conclusion</h2>
<p>QuIP# achieves what was thought impossible: near-lossless 2-bit quantization of LLMs. The key insights:</p>
<ol>
<li><strong>Eliminate outliers</strong> through principled RHT transformation (not heuristic suppression)</li>
<li><strong>Match the distribution</strong> using proven-optimal E8 lattice sphere packing</li>
<li><strong>Account for dependencies</strong> through Block-LDLQ adaptive rounding</li>
</ol>
<p>Each component addresses a specific mathematical challenge. Together, they form an elegant solution that:</p>
<ul>
<li>Enables 70B models on consumer hardware</li>
<li>Achieves unprecedented compression with minimal quality loss</li>
<li>Scales better than &ldquo;theoretically optimal&rdquo; 4-bit methods</li>
</ul>
<p>The 2-bit dream is alive.</p>
<p>For practitioners: QuIP# quantized models are available at <a href="https://huggingface.co/relaxml">https://huggingface.co/relaxml</a>. Code at <a href="https://github.com/Cornell-RelaxML/quip-sharp">https://github.com/Cornell-RelaxML/quip-sharp</a>.</p>
<p>The future of LLM deployment just got a lot more accessible.</p>
<hr>
<h2 id="references">References</h2>
<ul>
<li>Tseng, A., Chee, J., Sun, Q., Kuleshov, V., &amp; De Sa, C. (2024). QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks. <em>ICML 2024</em>.</li>
<li>Viazovska, M. (2017). The sphere packing problem in dimension 8. <em>Annals of Mathematics</em>.</li>
<li>Chee, J., Cai, Y., Kuleshov, V., &amp; De Sa, C. (2023). QuIP: 2-bit quantization of large language models with guarantees. <em>NeurIPS 2023</em>.</li>
</ul>
]]></content:encoded></item><item><title>Why Can Your Laptop Run LLaMA? A Deep Dive into Quantization</title><link>https://www.mdjawad.com/posts/quantization-and-gptq/</link><pubDate>Sat, 04 Oct 2025 21:19:40 +0800</pubDate><guid>https://www.mdjawad.com/posts/quantization-and-gptq/</guid><description>How 4–8x compression and Hessian-guided GPTQ make 70B-scale models practical on modest hardware—what INT8/INT4 really cost, and when accuracy holds.</description><content:encoded><![CDATA[<h2 id="introduction-why-we-cant-afford-full-precision-anymore">Introduction: Why We Can&rsquo;t Afford Full Precision Anymore</h2>
<p>The numbers tell a stark story. A model like GPT-3.5, with its 175 billion parameters, demands 700GB of memory at full precision, enough to consume thousands of dollars in cloud costs every single day. LLaMA-2-70B requires 280GB at full precision and 140GB even with standard FP16, numbers that dwarf the memory capacity of most GPUs. Training such models can cost millions of dollars in compute resources, and even running inference requires an infrastructure of multiple high-end GPUs costing $15,000 each.</p>
<p>These requirements create more than just a financial barrier. They represent a fundamental accessibility problem. Consumer GPUs like the RTX 3090 offer only 24GB of VRAM, while even the newest RTX 5090 provides just 32GB nowhere near enough for unquantized large models. Mobile and edge devices face even tighter constraints with 4-8GB of total RAM. Without quantization, state-of-the-art models remain locked behind expensive data center infrastructure, inaccessible to researchers, startups, and individual developers.</p>
<p>The computational burden compounds these memory constraints. Matrix multiplication dominates LLM inference time, and high-precision arithmetic is expensive. On NVIDIA A100, lower-precision tensor cores provide substantially higher throughput—e.g., TF32 is around ~156 TFLOPS dense (≈312 with sparsity), while INT8 performance reaches into the hundreds of TOPS depending on sparsity and kernels. Memory bandwidth creates additional bottlenecks: moving 140GB of weights from GPU memory to compute units can take longer than the computation itself, especially for the small batch sizes typical in interactive applications.</p>
<p>This is where quantization transforms from optimization technique to enabling technology. By compressing neural network weights from 32 bits to 4 bits or lower, quantization can achieve 4-8x memory reduction and 2-4x computational speedup while maintaining small accuracy losses with proper techniques. Modern quantization methods are transforming LLM deployment from multi-GPU clusters to single consumer GPUs—often reducing total system costs from six figures to a few thousand dollars in certain setups (e.g., 4-bit with offloading or multi-GPU).</p>
<p>The field has evolved dramatically. What began as &ldquo;can we quantize below 8 bits?&rdquo; in 2022 has progressed to early deployments and research systems exploring 2-bit models by 2025, with clear paths emerging for both research and real-world applications.</p>
<div class="qmem-chart-container">
  <style>
    .qmem-chart-container { background: #fff; border-radius: 16px; padding: 20px; box-shadow: 0 12px 30px rgba(0,0,0,0.06); margin: 32px auto; max-width: 1100px; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif; }
    .qmem-title { font-size: 24px; font-weight: 700; color: #111827; margin: 0 0 6px 0; text-align: center; }
    .qmem-subtitle { font-size: 14px; color: #6b7280; margin: 0 0 16px 0; text-align: center; }
    .qmem-controls { display: flex; justify-content: center; align-items: center; gap: 16px; margin-bottom: 12px; flex-wrap: wrap; }
    .qmem-legend { display: flex; gap: 16px; justify-content: center; flex-wrap: wrap; font-size: 12px; color: #374151; margin-top: 10px; }
    .qmem-legend-item { display: flex; align-items: center; gap: 6px; }
    .qmem-swatch { width: 14px; height: 14px; border-radius: 3px; display: inline-block; }
    .qmem-note { margin-top: 10px; font-size: 12px; color: #6b7280; text-align: center; }
  </style>

  <div class="qmem-header">
    <h3 class="qmem-title">GPU Memory Requirements by Model Size and Precision</h3>
    <p class="qmem-subtitle">Memory footprint comparison across quantization formats for LLM deployment</p>
  </div>

  <div class="qmem-controls">
    <label style="display:flex; align-items:center; gap:8px; cursor:pointer; font-size: 13px; color:#374151;">
      <input type="checkbox" id="qmemToggle-d2ebbad0e656c085c5ea2a3a6e1b91c9" checked />
      Show GPU capacity reference lines
    </label>
  </div>

  <div style="width: 100%; overflow-x: auto;">
    <svg id="qmemSvg-d2ebbad0e656c085c5ea2a3a6e1b91c9" viewBox="0 0 920 520" preserveAspectRatio="xMidYMid meet" style="width:100%; height:520px; background:#ffffff; border-radius: 12px;">
      
    </svg>
  </div>

  <div class="qmem-legend">
    <div class="qmem-legend-item"><span class="qmem-swatch" style="background:#ef4444"></span> FP32 (32-bit)</div>
    <div class="qmem-legend-item"><span class="qmem-swatch" style="background:#f97316"></span> FP16 (16-bit)</div>
    <div class="qmem-legend-item"><span class="qmem-swatch" style="background:#eab308"></span> INT8 (8-bit)</div>
    <div class="qmem-legend-item"><span class="qmem-swatch" style="background:#22c55e"></span> INT4 (4-bit)</div>
  </div>

  <div class="qmem-note">Note: Memory values approximate model weights only using 1B params → 1 GB (FP32), 0.5 GB (FP16), 0.25 GB (INT8), 0.125 GB (INT4). Actual deployments require additional memory for activations, KV cache, and overhead (often 1.2–2× total).</div>

  <script>
    (function() {
      const uid = 'd2ebbad0e656c085c5ea2a3a6e1b91c9';
      const svg = document.getElementById('qmemSvg-' + uid);
      const toggle = document.getElementById('qmemToggle-' + uid);

      
      const margin = { top: 30, right: 140, bottom: 60, left: 70 };
      const width = 920, height = 520;
      const chartW = width - margin.left - margin.right;
      const chartH = height - margin.top - margin.bottom;
      const originX = margin.left, originY = height - margin.bottom;

      
      const modelSizes = [
        { name: '7B', params: 7 },
        { name: '13B', params: 13 },
        { name: '70B', params: 70 },
        { name: '175B', params: 175 }
      ];

      
      const perB = { FP32: 1.0, FP16: 0.5, INT8: 0.25, INT4: 0.125 };

      const series = [
        { key: 'FP32', color: '#ef4444' },
        { key: 'FP16', color: '#f97316' },
        { key: 'INT8', color: '#eab308' },
        { key: 'INT4', color: '#22c55e' }
      ];

      const gpuCaps = [
        { value: 24, label: 'RTX 3090/4090 (24GB)', color: '#94a3b8' },
        { value: 40, label: 'A100 40GB', color: '#64748b' },
        { value: 80, label: 'A100/H100 80GB', color: '#475569' }
      ];

      const data = modelSizes.map(m => ({
        name: m.name,
        FP32: m.params * perB.FP32,
        FP16: m.params * perB.FP16,
        INT8: m.params * perB.INT8,
        INT4: m.params * perB.INT4
      }));

      const maxY = Math.max(
        ...data.flatMap(d => [d.FP32, d.FP16, d.INT8, d.INT4]),
        ...gpuCaps.map(g => g.value)
      );
      const yMax = Math.ceil(maxY / 10) * 10;

      const xPositions = data.map((_, i) => originX + (i * chartW) / (data.length - 1));
      const yScale = v => originY - (v / yMax) * chartH;

      
      function addEl(tag, attrs = {}, parent = svg) {
        const el = document.createElementNS('http://www.w3.org/2000/svg', tag);
        Object.entries(attrs).forEach(([k, v]) => el.setAttribute(k, v));
        parent.appendChild(el);
        return el;
      }

      function clearChart() {
        while (svg.firstChild) svg.removeChild(svg.firstChild);
      }

      function drawAxes() {
        
        addEl('line', { x1: originX, y1: originY, x2: originX + chartW, y2: originY, stroke: '#9ca3af' });
        addEl('line', { x1: originX, y1: originY, x2: originX, y2: originY - chartH, stroke: '#9ca3af' });

        
        xPositions.forEach((x, i) => {
          addEl('line', { x1: x, y1: originY, x2: x, y2: originY + 6, stroke: '#9ca3af' });
          addEl('text', { x, y: originY + 24, 'text-anchor': 'middle', 'font-size': '12px', fill: '#374151' })
            .textContent = data[i].name;
        });
        addEl('text', { x: originX + chartW / 2, y: originY + 45, 'text-anchor': 'middle', 'font-size': '12px', fill: '#6b7280' })
          .textContent = 'Model Size (Parameters)';

        
        const yTicks = 8;
        for (let i = 0; i <= yTicks; i++) {
          const v = (yMax / yTicks) * i;
          const y = yScale(v);
          addEl('line', { x1: originX - 6, y1: y, x2: originX, y2: y, stroke: '#9ca3af' });
          addEl('line', { x1: originX, y1: y, x2: originX + chartW, y2: y, stroke: '#e5e7eb' });
          addEl('text', { x: originX - 10, y: y + 4, 'text-anchor': 'end', 'font-size': '12px', fill: '#374151' })
            .textContent = v.toString();
        }
        addEl('text', { x: originX - 50, y: originY - chartH / 2, transform: `rotate(-90 ${originX - 50} ${originY - chartH / 2})`, 'text-anchor': 'middle', 'font-size': '12px', fill: '#6b7280' })
          .textContent = 'Memory Required (GB)';
      }

      function drawSeries() {
        series.forEach(s => {
          const path = data.map((d, i) => `${i === 0 ? 'M' : 'L'} ${xPositions[i]} ${yScale(d[s.key])}`).join(' ');
          addEl('path', { d: path, fill: 'none', stroke: s.color, 'stroke-width': 3 });
          data.forEach((d, i) => {
            addEl('circle', { cx: xPositions[i], cy: yScale(d[s.key]), r: 4, fill: s.color });
          });
        });
      }

      function drawGpuCaps(show) {
        if (!show) return;
        gpuCaps.forEach(g => {
          const y = yScale(g.value);
          addEl('line', { x1: originX, y1: y, x2: originX + chartW, y2: y, stroke: g.color, 'stroke-dasharray': '6 6', 'stroke-width': 2 });
          addEl('text', { x: originX + chartW + 6, y: y + 4, 'font-size': '12px', fill: g.color })
            .textContent = g.label;
        });
      }

      function redraw() {
        clearChart();
        drawAxes();
        drawSeries();
        drawGpuCaps(toggle.checked);
      }

      redraw();
      if (toggle) toggle.addEventListener('change', redraw);
    })();
  </script>
</div>

<h2 id="part-1-the-foundation---how-computers-store-numbers">Part 1: The Foundation - How Computers Store Numbers</h2>
<p>To understand how we can compress these models, we must first understand what we&rsquo;re compressing. Machine learning algorithms don&rsquo;t process text; they process numbers, and the format used to store these numbers dictates their range, accuracy, and the memory they consume.</p>
<h3 id="the-32-bit-standard-fp32-and-its-anatomy">The 32-Bit Standard: FP32 and Its Anatomy</h3>
<p>The default numerical format in most deep learning frameworks is the 32-bit single-precision floating-point number, commonly known as FP32. Defined by the IEEE 754 standard, an FP32 number occupies 32 bits (4 bytes) of memory, divided into three distinct parts that work together to represent a vast range of real numbers.</p>
<p>The <strong>sign bit</strong> (1 bit) is straightforward—a value of 0 indicates a positive number, while 1 indicates negative. The <strong>exponent</strong> (8 bits) determines the magnitude or range of the number, functioning like the exponent in scientific notation by scaling the value up or down by powers of 2. To represent both very large and very small numbers, the 8-bit unsigned integer uses a technique called exponent bias. For FP32, the bias is 127, meaning the actual exponent equals the stored value minus 127. This allows the 8 bits to represent an exponent range from -126 to +127 without requiring a separate sign bit for the exponent itself.</p>
<p>The <strong>mantissa</strong> (23 bits), also known as the significand, determines the precision of the number—essentially, how many significant digits it can accurately represent. The mantissa is a binary fraction normalized to be between 1.0 and 2.0. Because the leading digit of a normalized binary number in this format is always 1, this &ldquo;implied leading 1&rdquo; doesn&rsquo;t need storage. This clever trick effectively gives the mantissa 24 bits of precision while only using 23 bits of memory.</p>
<div class="fp32-container">
  <style>
    .fp32-container { background:#fff; border-radius:16px; padding:20px; box-shadow:0 12px 30px rgba(0,0,0,0.06); margin:32px auto; max-width:1100px; font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,'Helvetica Neue',Arial,sans-serif; }
    .fp32-title { font-size:24px; font-weight:700; color:#111827; margin:0 0 6px 0; text-align:center; }
    .fp32-subtitle { font-size:14px; color:#6b7280; text-align:center; margin:0 0 16px 0; }

    .fp32-controls { margin: 12px 0 18px 0; }
    .fp32-label { font-size:13px; color:#374151; margin-bottom:8px; }
    .fp32-grid { display:grid; grid-template-columns:repeat(2, minmax(0,1fr)); gap:8px; }
    @media (min-width: 768px) { .fp32-grid { grid-template-columns:repeat(4, minmax(0,1fr)); } }
    .fp32-btn { padding:8px 10px; border-radius:8px; border:1px solid #e5e7eb; background:#f3f4f6; color:#374151; font-weight:600; font-size:13px; cursor:pointer; transition:.15s; }
    .fp32-btn:hover { background:#e5e7eb; }
    .fp32-btn.active { background:#2563eb; color:#fff; border-color:#2563eb; }

    .fp32-card { background:#f8fafc; border:1px solid #e5e7eb; border-radius:12px; padding:12px; margin-bottom:12px; }
    .fp32-mono-row { display:flex; align-items:flex-start; gap:8px; overflow-x:auto; font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace; }
    .fp32-bitcol { display:flex; flex-direction:column; align-items:center; gap:6px; }
    .fp32-bitlabel { font-size:10px; color:#6b7280; height:20px; display:flex; align-items:center; }
    .fp32-bit { width:28px; height:32px; display:flex; align-items:center; justify-content:center; border-radius:6px; color:#fff; font-weight:700; }
    .fp32-bit.sign { background:#059669; }
    .fp32-bit.sign.neg { background:#dc2626; }
    .fp32-bit.exp { background:#f59e0b; }
    .fp32-bit.man { background:#3b82f6; }
    .fp32-divider { width:1px; background:#d1d5db; align-self:stretch; margin-top:20px; }

    .fp32-row { display:grid; grid-template-columns:repeat(3, minmax(0,1fr)); gap:12px; }
    .fp32-panel { background:#ffffff; border:1px solid #e5e7eb; border-radius:10px; padding:12px; }
    .fp32-panel h4 { margin:0 0 8px 0; font-size:14px; color:#111827; }
    .fp32-kv { font-size:13px; color:#374151; margin:2px 0; }
    .fp32-kv small { color:#6b7280; }

    .fp32-formula { background:#f3f4f6; border:1px solid #e5e7eb; border-radius:10px; padding:12px; font-size:13px; }
    .fp32-formula .mono { background:#fff; border:1px solid #e5e7eb; border-radius:6px; padding:6px; display:inline-block; font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace; font-size:12px; }

    .fp32-insights { background:#eff6ff; border:1px solid #bfdbfe; border-radius:10px; padding:12px; }
    .fp32-insights h4 { margin:0 0 8px 0; font-size:14px; color:#1f2937; }
    .fp32-insights ul { margin:0; padding-left:18px; color:#1f2937; font-size:13px; }
  </style>

  <div class="fp32-header">
    <h3 class="fp32-title">FP32 (32-bit Float) Bit-Level Anatomy</h3>
    <p class="fp32-subtitle">IEEE 754 single-precision floating-point format</p>
  </div>

  <div class="fp32-controls">
    <div class="fp32-label">Select Example:</div>
    <div class="fp32-grid" id="fp32Btns-d2ebbad0e656c085c5ea2a3a6e1b91c9"></div>
  </div>

  <div class="fp32-card">
    <div class="fp32-mono-row">
      <div class="fp32-bitcol">
        <div class="fp32-bitlabel">Bit 31</div>
        <div id="fp32Sign-d2ebbad0e656c085c5ea2a3a6e1b91c9" class="fp32-bit sign">0</div>
      </div>
      <div class="fp32-divider"></div>
      <div class="fp32-bitcol">
        <div class="fp32-bitlabel">Bits 30–23</div>
        <div id="fp32ExpWrap-d2ebbad0e656c085c5ea2a3a6e1b91c9" style="display:flex; gap:4px;"></div>
      </div>
      <div class="fp32-divider"></div>
      <div class="fp32-bitcol" style="flex:1; min-width:280px;">
        <div class="fp32-bitlabel">Bits 22–0</div>
        <div id="fp32ManWrap-d2ebbad0e656c085c5ea2a3a6e1b91c9" style="display:flex; gap:4px; flex-wrap:wrap;"></div>
      </div>
    </div>

    <div class="fp32-mono-row" style="margin-top:8px;">
      <div style="width:60px; text-align:center;">
        <div style="font-size:12px; font-weight:600; color:#374151;">Sign</div>
        <div style="font-size:10px; color:#6b7280;">(1 bit)</div>
      </div>
      <div class="fp32-divider"></div>
      <div style="width:220px; text-align:center;">
        <div style="font-size:12px; font-weight:600; color:#374151;">Exponent</div>
        <div style="font-size:10px; color:#6b7280;">(8 bits)</div>
      </div>
      <div class="fp32-divider"></div>
      <div style="flex:1; text-align:center;">
        <div style="font-size:12px; font-weight:600; color:#374151;">Mantissa (Significand)</div>
        <div style="font-size:10px; color:#6b7280;">(23 bits)</div>
      </div>
    </div>
  </div>

  <div class="fp32-row" style="margin-bottom:12px;">
    <div class="fp32-panel">
      <h4>Sign Bit</h4>
      <div class="fp32-kv"><strong>Value:</strong> <span id="fp32SignVal-d2ebbad0e656c085c5ea2a3a6e1b91c9">0</span></div>
      <div class="fp32-kv"><strong>Meaning:</strong> <span id="fp32SignMean-d2ebbad0e656c085c5ea2a3a6e1b91c9">Positive (+)</span></div>
      <div class="fp32-kv"><small>Simple: 0 = positive, 1 = negative</small></div>
    </div>

    <div class="fp32-panel">
      <h4>Exponent</h4>
      <div class="fp32-kv"><strong>Binary:</strong> <span id="fp32ExpBin-d2ebbad0e656c085c5ea2a3a6e1b91c9">00000000</span></div>
      <div class="fp32-kv"><strong>Decimal:</strong> <span id="fp32ExpDec-d2ebbad0e656c085c5ea2a3a6e1b91c9">0</span></div>
      <div class="fp32-kv"><strong>Bias:</strong> −127</div>
      <div class="fp32-kv"><strong>Actual:</strong> <span id="fp32ExpAct-d2ebbad0e656c085c5ea2a3a6e1b91c9">0</span></div>
      <div class="fp32-kv"><small>Determines magnitude: 2^exponent</small></div>
    </div>

    <div class="fp32-panel">
      <h4>Mantissa</h4>
      <div class="fp32-kv"><strong>Stored:</strong> <span id="fp32ManBin-d2ebbad0e656c085c5ea2a3a6e1b91c9" style="font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, 'Liberation Mono', 'Courier New', monospace; font-size:12px;">00000000000000000000000</span></div>
      <div class="fp32-kv"><strong>Implied leading 1:</strong> <span class="mono" id="fp32ManShown-d2ebbad0e656c085c5ea2a3a6e1b91c9">1.000000...</span></div>
      <div class="fp32-kv"><small>Precision: ~7 decimal digits</small></div>
    </div>
  </div>

  <div class="fp32-formula">
    <h4 style="margin:0 0 8px 0; font-size:14px; color:#111827;">Reconstruction Formula</h4>
    <div>Value = (−1)<sup>sign</sup> × 2<sup>(exponent − 127)</sup> × (1 + mantissa)</div>
    <div style="margin-top:8px; color:#4b5563; font-size:12px; border-top:1px solid #e5e7eb; padding-top:8px;">
      <div>For <strong id="fp32Name-d2ebbad0e656c085c5ea2a3a6e1b91c9">π (3.14159...)</strong>:</div>
      <div class="mono" style="display:block; margin-top:6px;">
        (−1)<sup><span id="fp32SignPow-d2ebbad0e656c085c5ea2a3a6e1b91c9">0</span></sup>
        × 2<sup>(<span id="fp32ExpDec2-d2ebbad0e656c085c5ea2a3a6e1b91c9">0</span> − 127)</sup>
        × (1.<span id="fp32ManShort-d2ebbad0e656c085c5ea2a3a6e1b91c9">000000</span>...)
      </div>
      <div style="margin-top:6px;"><strong>≈ <span id="fp32ApproxVal-d2ebbad0e656c085c5ea2a3a6e1b91c9">3.14159</span></strong></div>
    </div>
  </div>

  <div class="fp32-insights" style="margin-top:12px;">
    <h4>Key Insights</h4>
    <ul>
      <li>32 bits = 4 bytes per number</li>
      <li>Exponent bias (127) enables wide dynamic range without a signed exponent</li>
      <li>Implied leading 1 in mantissa gives 24 effective precision bits</li>
      <li>Range: ±1.4 × 10^-45 to ±3.4 × 10^38</li>
      <li>Precision: ~7 decimal digits</li>
    </ul>
  </div>

  <script>
    (function(){
      const uid = 'd2ebbad0e656c085c5ea2a3a6e1b91c9';
      const examples = {
        pi: { name: 'π (3.14159...)', value: 3.14159265359, sign: 0, exponent: '10000000', mantissa: '10010010000111111011011' },
        small: { name: 'Small number (0.0001)', value: 0.0001, sign: 0, exponent: '01110011', mantissa: '10100011011011100010111' },
        negative: { name: 'Negative (-42.5)', value: -42.5, sign: 1, exponent: '10000100', mantissa: '01010100000000000000000' },
        large: { name: 'Large number (1,000,000)', value: 1000000, sign: 0, exponent: '10010010', mantissa: '11110100001001000000000' }
      };
      let selected = 'pi';

      const btns = document.getElementById('fp32Btns-' + uid);
      const signEl = document.getElementById('fp32Sign-' + uid);
      const signVal = document.getElementById('fp32SignVal-' + uid);
      const signMean = document.getElementById('fp32SignMean-' + uid);
      const expWrap = document.getElementById('fp32ExpWrap-' + uid);
      const manWrap = document.getElementById('fp32ManWrap-' + uid);

      const expBin = document.getElementById('fp32ExpBin-' + uid);
      const expDec = document.getElementById('fp32ExpDec-' + uid);
      const expAct = document.getElementById('fp32ExpAct-' + uid);
      const manBin = document.getElementById('fp32ManBin-' + uid);
      const manShown = document.getElementById('fp32ManShown-' + uid);

      const nameEl = document.getElementById('fp32Name-' + uid);
      const signPow = document.getElementById('fp32SignPow-' + uid);
      const expDec2 = document.getElementById('fp32ExpDec2-' + uid);
      const manShort = document.getElementById('fp32ManShort-' + uid);
      const approxVal = document.getElementById('fp32ApproxVal-' + uid);

      function renderButtons(){
        btns.innerHTML='';
        Object.entries(examples).forEach(([key, ex])=>{
          const b = document.createElement('button');
          b.className = 'fp32-btn' + (selected===key ? ' active' : '');
          b.textContent = ex.name;
          b.onclick = ()=>{ selected = key; render(); renderButtons(); };
          btns.appendChild(b);
        });
      }

      function bitBox(content, cls){
        const d = document.createElement('div');
        d.className = 'fp32-bit ' + cls;
        d.textContent = content;
        return d;
      }

      function render(){
        const cur = examples[selected];
        const expVal = parseInt(cur.exponent, 2);
        const actualExp = expVal - 127;

        
        signEl.textContent = cur.sign;
        signEl.className = 'fp32-bit sign' + (cur.sign === 1 ? ' neg' : '');
        signVal.textContent = cur.sign;
        signMean.textContent = cur.sign === 0 ? 'Positive (+)' : 'Negative (−)';
        signPow.textContent = cur.sign;

        
        expWrap.innerHTML='';
        cur.exponent.split('').forEach(bit=>{
          expWrap.appendChild(bitBox(bit, 'exp'));
        });
        expBin.textContent = cur.exponent;
        expDec.textContent = expVal;
        expAct.textContent = actualExp;
        expDec2.textContent = expVal;

        
        manWrap.innerHTML='';
        cur.mantissa.split('').forEach(bit=>{
          manWrap.appendChild(bitBox(bit, 'man'));
        });
        manBin.textContent = cur.mantissa;
        manShown.textContent = '1.' + cur.mantissa.slice(0, 10) + '...';
        manShort.textContent = cur.mantissa.slice(0, 6);

        
        nameEl.textContent = cur.name;
        approxVal.textContent = cur.value;
      }

      renderButtons();
      render();
    })();
  </script>
</div>

<p>This structure gives FP32 a dynamic range of approximately ±1.4×10⁻⁴⁵ to ±3.4×10³⁸. The epsilon (smallest representable difference) equals 2⁻²³ ≈ 0.00000012. For a 175B parameter model, FP32 representation demands 700GB of memory at 4 bytes per parameter.</p>
<p>However, a key limitation of any finite binary representation is that it cannot perfectly represent all decimal numbers. Just as 1/3 cannot be written with a finite number of decimal digits, values like 0.1 cannot be represented exactly in binary, leading to minor rounding errors in computation.</p>
<h3 id="the-half-precision-compromise-fp16-and-bfloat16">The Half-Precision Compromise: FP16 and BFloat16</h3>
<p>While FP32 provides a good balance of range and precision, its 32-bit size is a primary contributor to the massive memory footprint of LLMs. Two 16-bit (half-precision) formats have emerged, each with a different solution to the fundamental tradeoff between range and precision.</p>
<p><strong>FP16 (Half-Precision)</strong> compresses the format to 1 sign bit, 5 exponent bits (bias=15), and 10 mantissa bits. This dramatic reduction in exponent range means FP16 only spans ±6×10⁻⁵ to ±65,504. With only 3-4 significant digits and epsilon of 0.00097656, FP16 risks overflow and underflow during model training, where gradients can become extremely small or extremely large, causing the training process to fail.</p>
<p>The memory advantage is substantial: 175B parameters require only 350GB, halving the footprint while enabling 2x faster computation on Tensor Core GPUs. Mixed precision training exploits this by computing in FP16 but accumulating gradients in FP32, though loss scaling is needed to prevent underflow.</p>
<p><strong>BFloat16 (BF16)</strong>, developed by Google Brain specifically for deep learning, takes a radically different approach. It allocates 1 bit for the sign, 8 bits for the exponent (same as FP32), and just 7 bits for the mantissa. By maintaining FP32&rsquo;s full dynamic range (±1.2×10⁻³⁸ to ±3.4×10³⁸) while sacrificing precision (epsilon = 0.0078125), BF16 becomes a &ldquo;drop-in replacement&rdquo; for FP32 in training.</p>
<p>Converting between FP32 and BF16 is trivial, simply truncate or zero-pad the last 16 bits making it computationally cheap. The identical exponent range means no overflow issues and no need for loss scaling during training. BF16 has become the preferred format for training large models.</p>
<p>The emergence and widespread adoption of BFloat16 reveals a core principle of deep learning systems: <strong>system-level stability is often more critical than component-level precision</strong>. The industry&rsquo;s willingness to sacrifice the precision of individual numbers (BF16 has 3 fewer mantissa bits than FP16) for the stability of the entire training process demonstrates that deep neural networks are remarkably robust to a certain level of numerical noise. The most critical failure mode during training is not a slight inaccuracy in a single weight but a catastrophic gradient explosion or vanishment.</p>
<h3 id="integer-quantization-int8-and-beyond">Integer Quantization: INT8 and Beyond</h3>
<p><strong>INT8 (8-bit integer)</strong> abandons floating point entirely, representing values as integers from -128 to 127 (signed) or 0 to 255 (unsigned). Quantization maps continuous weights to these 256 discrete values using a scale factor S and zero-point Z through the formula: x_q = round(x/S + Z). Dequantization reverses this: x = S × (x_q - Z).</p>
<p>The scale and zero-point are typically stored in higher precision, adding some overhead. Advanced techniques use per-channel quantization with different (S, Z) pairs for each output channel, dramatically improving accuracy. LLM.int8() goes further, identifying outlier features with magnitude &gt;6 and keeping them in FP16 while quantizing the rest, achieving &lt;0.5% degradation on 176B models.</p>
<p>At 1 byte per parameter, INT8 reduces a 175B model to approximately 175-200GB including overhead—a 75% reduction from FP32. Computational benefits are substantial: INT8 tensor cores deliver 2.3-4x speedup over FP32 in practice, though realizing these gains requires specialized kernels. The challenge lies in selecting appropriate quantization ranges: too narrow loses information through clipping, too wide wastes the limited 256 values on rarely-used extremes. Block-wise quantization divides parameters into groups of 64-128, computing separate scales for each block to limit outlier impact.</p>
<p><strong>INT4 (4-bit integer)</strong> pushes compression to extremes with only 16 representable values. Standard INT4 uses -8 to 7, but neural network weights cluster around zero with an approximately normal distribution. NormalFloat4 (NF4) exploits this by placing quantization points at the quantiles of a standard normal distribution, optimizing for neural network weight distributions rather than uniform spacing. QLoRA uses NF4 with 64-weight blocks and double quantization (quantizing the scale factors themselves to 8-bit), compressing a 65B model from 130GB to roughly 40–50GB with careful overhead management.</p>
<p>The memory savings enable previously impossible deployments: a 70B parameter model can fit on a single 48–64GB GPU using INT4; fitting into 32GB typically requires significant offloading/pruning and tight KV‑cache constraints. However, computation typically dequantizes weights to FP16 for matrix multiplication since native 4-bit arithmetic lacks broad hardware support. This means memory bandwidth benefits dominate over raw computational speedup.</p>
<table>
  <thead>
      <tr>
          <th>Feature</th>
          <th>FP32</th>
          <th>FP16</th>
          <th>BFloat16</th>
          <th>INT8</th>
          <th>INT4</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Total Bits</td>
          <td>32</td>
          <td>16</td>
          <td>16</td>
          <td>8</td>
          <td>4</td>
      </tr>
      <tr>
          <td>Sign Bits</td>
          <td>1</td>
          <td>1</td>
          <td>1</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Exponent Bits</td>
          <td>8</td>
          <td>5</td>
          <td>8</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Mantissa Bits</td>
          <td>23 (+1 implied)</td>
          <td>10 (+1 implied)</td>
          <td>7 (+1 implied)</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Dynamic Range</td>
          <td>±1.4e-45 to ±3.4e38</td>
          <td>±6e-5 to ±65,504</td>
          <td>±1.2e-38 to ±3.4e38</td>
          <td>-128 to 127</td>
          <td>-8 to 7</td>
      </tr>
      <tr>
          <td>Precision</td>
          <td>7 digits</td>
          <td>3-4 digits</td>
          <td>2-3 digits</td>
          <td>256 levels</td>
          <td>16 levels</td>
      </tr>
  </tbody>
</table>
<p>[Placeholder for diagram: Memory footprint comparison showing LLaMA-70B across precisions with GPU memory capacity lines]</p>
<h2 id="part-2-understanding-quantization---the-accuracy-efficiency-tradeoff">Part 2: Understanding Quantization - The Accuracy-Efficiency Tradeoff</h2>
<p>With a foundation in numerical representation, we can now explore the process of quantization itself. At its heart, quantization is a mapping function from a large, often continuous set of values to a smaller, discrete set.</p>
<h3 id="the-mechanics-mapping-continuous-to-discrete">The Mechanics: Mapping Continuous to Discrete</h3>
<p>Quantization converts model parameters from a high-precision data type like FP32 to a low-precision one, most commonly an 8-bit integer (INT8). An INT8 variable can only represent 256 distinct values, a stark contrast to the billions of values representable by FP32. This mapping is achieved through a linear transformation known as the <strong>affine quantization scheme</strong>.</p>
<p>The core formula relates the original real value (r) to its quantized integer counterpart (q) using two key parameters: a <strong>scale (S)</strong> and a <strong>zero-point (Z)</strong>:</p>
<pre tabindex="0"><code>r = S × (q - Z)
</code></pre><p>Rearranging for quantization gives: <code>q = round(r/S + Z)</code></p>
<p>The <strong>scale</strong> is a positive floating-point number that acts as the step size of the quantization. It defines the ratio of the original floating-point range to the target integer range, calculated as (r_max - r_min) / (q_max - q_min).</p>
<p>The <strong>zero-point</strong> is an integer within the quantized range that corresponds exactly to the floating-point value 0.0. This is critical because the value zero holds special significance in neural networks—it&rsquo;s used for padding in convolutions and serves as the threshold for activation functions like ReLU. Ensuring that 0.0 can be perfectly represented without error after quantization is essential for maintaining model accuracy.</p>
<div class="qmap-container">
  <style>
    .qmap-container { background:#fff; border-radius:16px; padding:20px; box-shadow:0 12px 30px rgba(0,0,0,0.06); margin:32px auto; max-width:1100px; font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,'Helvetica Neue',Arial,sans-serif; }
    .qmap-title { font-size:24px; font-weight:700; color:#111827; margin:0 0 6px 0; text-align:center; }
    .qmap-subtitle { font-size:14px; color:#6b7280; text-align:center; margin:0 0 16px 0; }

    .qmap-grid { display:grid; grid-template-columns:1fr; gap:12px; }
    .qmap-selector { display:grid; grid-template-columns:1fr; gap:8px; }
    @media (min-width: 768px) { .qmap-selector { grid-template-columns:repeat(3,minmax(0,1fr)); } }
    .qmap-btn { padding:12px; border-radius:10px; border:1px solid #e5e7eb; background:#f3f4f6; color:#374151; text-align:left; font-weight:600; font-size:13px; cursor:pointer; transition:.15s; }
    .qmap-btn:hover { background:#e5e7eb; }
    .qmap-btn.active { background:#2563eb; color:#fff; border-color:#2563eb; }

    .qmap-cards { display:grid; grid-template-columns:1fr; gap:10px; }
    @media (min-width: 768px) { .qmap-cards { grid-template-columns:repeat(3,minmax(0,1fr)); } }
    .qmap-card { background:#f8fafc; border:1px solid #e5e7eb; border-radius:12px; padding:12px; }
    .qmap-card h4 { margin:0 0 8px 0; font-size:14px; color:#111827; }
    .qmap-mono { font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace; font-size:12px; }

    .qmap-row { background:#f9fafb; border:1px solid #e5e7eb; border-radius:12px; padding:12px; }
    .qmap-row h4 { margin:0 0 8px 0; font-size:14px; color:#111827; display:flex; align-items:center; gap:8px; }
    .qmap-flex { display:flex; justify-content:space-between; align-items:center; gap:10px; }
    .qmap-check { display:flex; align-items:center; gap:8px; font-size:13px; color:#374151; }

    .qmap-svg { width:100%; height:360px; background:#ffffff; border-radius:10px; }

    .qmap-buckets { background:#ffffff; border:1px solid #e5e7eb; border-radius:12px; padding:12px; overflow-x:auto; }
    .qmap-bucketbar { display:flex; gap:2px; min-width:max-content; }
    .qmap-bucket { width:12px; height:96px; position:relative; }
    .qmap-bucket:hover .qmap-bucketlabel { opacity:1; }
    .qmap-bucketlabel { position:absolute; bottom:2px; left:0; right:0; text-align:center; font-size:10px; color:#fff; opacity:0; transition:opacity .15s; }

    .qmap-insights { background:#f3f4f6; border:1px solid #e5e7eb; border-radius:12px; padding:12px; }
    .qmap-insights ul { margin:0; padding-left:18px; font-size:13px; color:#1f2937; }
  </style>

  <div class="qmap-header">
    <h3 class="qmap-title">Affine Quantization: Continuous to Discrete Mapping</h3>
    <p class="qmap-subtitle">How floating-point values are linearly mapped to integer representations</p>
  </div>

  <div class="qmap-grid">
    <div>
      <div class="qmap-selector" id="qmapButtons-d2ebbad0e656c085c5ea2a3a6e1b91c9"></div>
    </div>

    <div class="qmap-cards">
      <div class="qmap-card">
        <h4>Scale (S)</h4>
        <div style="font-size:20px; font-weight:700; color:#6d28d9;" id="qmapScale-d2ebbad0e656c085c5ea2a3a6e1b91c9">0</div>
        <div class="qmap-mono" style="background:#fff; border:1px solid #e5e7eb; border-radius:8px; padding:6px; margin-top:6px;">S = (r_max - r_min) / (q_max - q_min)</div>
        <div class="qmap-mono" id="qmapScaleExp-d2ebbad0e656c085c5ea2a3a6e1b91c9" style="margin-top:6px;"></div>
      </div>
      <div class="qmap-card">
        <h4>Zero-Point (Z)</h4>
        <div style="font-size:20px; font-weight:700; color:#047857;" id="qmapZ-d2ebbad0e656c085c5ea2a3a6e1b91c9">0</div>
        <div class="qmap-mono" style="background:#fff; border:1px solid #e5e7eb; border-radius:8px; padding:6px; margin-top:6px;">Z = round(q_min - r_min / S)</div>
        <div style="font-size:12px; color:#374151; margin-top:6px;">Maps r = 0.0 to integer Z</div>
      </div>
      <div class="qmap-card">
        <h4>Ranges</h4>
        <div style="font-size:12px; color:#374151;">
          <div><strong>Float:</strong> <span class="qmap-mono" id="qmapRngR-d2ebbad0e656c085c5ea2a3a6e1b91c9">[0,0]</span></div>
          <div style="margin-top:6px;"><strong>Int:</strong> <span class="qmap-mono" id="qmapRngQ-d2ebbad0e656c085c5ea2a3a6e1b91c9">[0,0]</span></div>
        </div>
      </div>
    </div>

    <div class="qmap-row">
      <div class="qmap-flex" style="margin-bottom:8px;">
        <h4>Mapping Visualization</h4>
        <label class="qmap-check">
          <input type="checkbox" id="qmapZeroToggle-d2ebbad0e656c085c5ea2a3a6e1b91c9" checked />
          Highlight zero-point
        </label>
      </div>
      <svg id="qmapSvg-d2ebbad0e656c085c5ea2a3a6e1b91c9" class="qmap-svg" viewBox="0 0 920 360" preserveAspectRatio="xMidYMid meet"></svg>
    </div>

    <div class="qmap-row">
      <h4>Discrete Quantization Buckets</h4>
      <div style="font-size:12px; color:#374151; margin-bottom:8px;">Each integer value represents a range of floating-point values 
        (<span class="qmap-mono">step = S</span>)</div>
      <div class="qmap-buckets">
        <div id="qmapBucketBar-d2ebbad0e656c085c5ea2a3a6e1b91c9" class="qmap-bucketbar"></div>
        <div style="display:flex; justify-content:space-between; font-size:12px; color:#6b7280; margin-top:6px;">
          <span id="qmapRmin-d2ebbad0e656c085c5ea2a3a6e1b91c9"></span>
          <span id="qmapZeroMark-d2ebbad0e656c085c5ea2a3a6e1b91c9" style="color:#16a34a; font-weight:600;"></span>
          <span id="qmapRmax-d2ebbad0e656c085c5ea2a3a6e1b91c9"></span>
        </div>
      </div>
    </div>

    <div class="qmap-insights">
      <h4 style="margin:0 0 8px 0; font-size:14px; color:#111827;">Key Insights</h4>
      <ul>
        <li><strong>Linear mapping:</strong> r = S × (q − Z) is an affine transformation</li>
        <li><strong>Scale (S):</strong> Step size; smaller S = finer granularity</li>
        <li><strong>Zero-point (Z):</strong> Ensures r = 0.0 maps to an integer</li>
        <li><strong>Symmetric:</strong> Z = 0 (common for weights)</li>
        <li><strong>Asymmetric:</strong> Z ≠ 0 (common for activations)</li>
        <li><strong>Error:</strong> Rounding introduces ±S/2 quantization error</li>
      </ul>
    </div>
  </div>

  <script>
    (function(){
      const uid = 'd2ebbad0e656c085c5ea2a3a6e1b91c9';
      const scenarios = {
        symmetric: { name:'Symmetric (weights)', rMin:-1.0, rMax:1.0, qMin:-128, qMax:127, description:'Zero-point at 0, common for weights' },
        asymmetric: { name:'Asymmetric (activations)', rMin:0.0, rMax:6.0, qMin:0, qMax:255, description:'Zero-point shifted, common for activations (e.g., after ReLU)' },
        mixed: { name:'Asymmetric (general)', rMin:-0.5, rMax:2.5, qMin:0, qMax:255, description:'General asymmetric case' }
      };
      let scenarioKey = 'symmetric';

      const btnWrap = document.getElementById('qmapButtons-' + uid);
      const svg = document.getElementById('qmapSvg-' + uid);
      const zeroToggle = document.getElementById('qmapZeroToggle-' + uid);
      const scaleEl = document.getElementById('qmapScale-' + uid);
      const scaleExp = document.getElementById('qmapScaleExp-' + uid);
      const zEl = document.getElementById('qmapZ-' + uid);
      const rngR = document.getElementById('qmapRngR-' + uid);
      const rngQ = document.getElementById('qmapRngQ-' + uid);
      const bucketBar = document.getElementById('qmapBucketBar-' + uid);
      const rminEl = document.getElementById('qmapRmin-' + uid);
      const rmaxEl = document.getElementById('qmapRmax-' + uid);
      const zeroMark = document.getElementById('qmapZeroMark-' + uid);

      function seriesColor(){ return '#3b82f6'; }

      function buildButtons(){
        btnWrap.innerHTML = '';
        Object.entries(scenarios).forEach(([key, sc])=>{
          const b = document.createElement('button');
          b.className = 'qmap-btn' + (scenarioKey===key ? ' active' : '');
          b.innerHTML = '<div style="font-weight:700">' + sc.name + '</div><div style="font-size:12px; opacity:.85; margin-top:4px;">' + sc.description + '</div>';
          b.onclick = ()=>{ scenarioKey = key; render(); buildButtons(); };
          btnWrap.appendChild(b);
        });
      }

      function addEl(tag, attrs = {}, parent = svg) {
        const el = document.createElementNS('http://www.w3.org/2000/svg', tag);
        Object.entries(attrs).forEach(([k, v]) => el.setAttribute(k, v));
        parent.appendChild(el);
        return el;
      }

      function clearSvg(){ while (svg.firstChild) svg.removeChild(svg.firstChild); }

      function render(){
        const sc = scenarios[scenarioKey];
        const S = (sc.rMax - sc.rMin) / (sc.qMax - sc.qMin);
        const Z = Math.round(sc.qMin - sc.rMin / S);

        scaleEl.textContent = S.toFixed(6);
        scaleExp.textContent = `= (${sc.rMax} - (${sc.rMin})) / (${sc.qMax} - (${sc.qMin}))`;
        zEl.textContent = Z;
        rngR.textContent = `[${sc.rMin.toFixed(2)}, ${sc.rMax.toFixed(2)}]`;
        rngQ.textContent = `[${sc.qMin}, ${sc.qMax}]`;

        
        clearSvg();
        const margin = { top: 10, right: 20, bottom: 40, left: 60 };
        const width = 920, height = 360;
        const cw = width - margin.left - margin.right;
        const ch = height - margin.top - margin.bottom;
        const ox = margin.left, oy = height - margin.bottom;

        const qToX = (q) => ox + (q - sc.qMin) / (sc.qMax - sc.qMin) * cw;
        const rToY = (r) => oy - (r - sc.rMin) / (sc.rMax - sc.rMin) * ch;

        
        addEl('line', { x1: ox, y1: oy, x2: ox + cw, y2: oy, stroke: '#9ca3af' });
        addEl('line', { x1: ox, y1: oy, x2: ox, y2: oy - ch, stroke: '#9ca3af' });

        
        const xTicks = 8;
        for (let i = 0; i <= xTicks; i++) {
          const q = Math.round(sc.qMin + i * (sc.qMax - sc.qMin) / xTicks);
          const x = qToX(q);
          addEl('line', { x1: x, y1: oy, x2: x, y2: oy + 6, stroke: '#9ca3af' });
          const t = addEl('text', { x, y: oy + 20, 'text-anchor': 'middle', 'font-size': '12px', fill: '#374151' });
          t.textContent = q;
        }
        const yTicks = 8;
        for (let i = 0; i <= yTicks; i++) {
          const r = sc.rMin + i * (sc.rMax - sc.rMin) / yTicks;
          const y = rToY(r);
          addEl('line', { x1: ox - 6, y1: y, x2: ox, y2: y, stroke: '#9ca3af' });
          addEl('line', { x1: ox, y1: y, x2: ox + cw, y2: y, stroke: '#e5e7eb' });
          const t = addEl('text', { x: ox - 10, y: y + 4, 'text-anchor': 'end', 'font-size': '12px', fill: '#374151' });
          t.textContent = r.toFixed(1);
        }
        const xLbl = addEl('text', { x: ox + cw/2, y: oy + 35, 'text-anchor': 'middle', 'font-size':'12px', fill:'#6b7280' });
        xLbl.textContent = 'Quantized Integer Value (q)';
        const yLbl = addEl('text', { x: ox - 50, y: oy - ch/2, transform: `rotate(-90 ${ox - 50} ${oy - ch/2})`, 'text-anchor': 'middle', 'font-size':'12px', fill:'#6b7280' });
        yLbl.textContent = 'Float Value (r)';

        
        if (zeroToggle.checked) {
          const xz = qToX(Z);
          addEl('line', { x1: xz, y1: oy, x2: xz, y2: oy - ch, stroke: '#22c55e', 'stroke-width': 2, 'stroke-dasharray': '6 6' });
          const tz = addEl('text', { x: xz, y: oy - ch - 6, 'text-anchor': 'middle', 'font-size': '12px', fill: '#22c55e' });
          tz.textContent = 'Zero-point';

          const y0 = rToY(0);
          addEl('line', { x1: ox, y1: y0, x2: ox + cw, y2: y0, stroke: '#22c55e', 'stroke-width': 2, 'stroke-dasharray': '6 6' });
          const tr = addEl('text', { x: ox + cw + 4, y: y0 + 4, 'font-size': '12px', fill: '#22c55e' });
          tr.textContent = 'r = 0.0';
        }

        
        const path = [];
        for (let q = sc.qMin; q <= sc.qMax; q += Math.max(1, Math.floor((sc.qMax - sc.qMin) / 200))) {
          const r = S * (q - Z);
          path.push(`${path.length ? 'L' : 'M'} ${qToX(q)} ${rToY(r)}`);
        }
        addEl('path', { d: path.join(' '), fill: 'none', stroke: seriesColor(), 'stroke-width': 2 });

        
        bucketBar.innerHTML = '';
        const step = Math.ceil((sc.qMax - sc.qMin) / 50);
        for (let q = sc.qMin; q <= sc.qMax; q += step) {
          const rStart = S * (q - Z);
          const rEnd = S * (q + step - Z);
          const div = document.createElement('div');
          const isZero = (q <= Z && Z < q + step);
          div.className = 'qmap-bucket';
          div.style.background = isZero && zeroToggle.checked ? '#16a34a' : ( (Math.floor((q - sc.qMin)/step) % 2 === 0) ? '#60a5fa' : '#3b82f6');
          div.title = `q=${q}: [${rStart.toFixed(3)}, ${rEnd.toFixed(3)}]`;
          const lab = document.createElement('div');
          lab.className = 'qmap-bucketlabel';
          lab.textContent = q;
          div.appendChild(lab);
          bucketBar.appendChild(div);
        }
        rminEl.textContent = `r_min (${sc.rMin})`;
        rmaxEl.textContent = `r_max (${sc.rMax})`;
        zeroMark.textContent = zeroToggle.checked ? '0.0' : '';
      }

      buildButtons();
      render();
      if (zeroToggle) zeroToggle.addEventListener('change', render);
    })();
  </script>
</div>

<h3 id="symmetric-vs-asymmetric-quantization">Symmetric vs. Asymmetric Quantization</h3>
<p>The zero-point concept leads to two primary quantization schemes, each with distinct tradeoffs.</p>
<p><strong>Asymmetric (Affine) Quantization</strong> is the general form where the zero-point can be any integer in the quantized range. This scheme excels at quantizing data whose distribution is not centered around zero. A prime example is the output of a ReLU activation function, where all values are non-negative. Asymmetric quantization can map the range [0.0, 1000.0] to the full integer range, maximizing the use of available precision.</p>
<p><strong>Symmetric Quantization</strong> is a special case where the floating-point range is forced to be symmetric around zero (e.g., [-a, a]). This constraint ensures that floating-point 0.0 maps directly to integer 0, making the zero-point Z = 0. The primary advantage is computational efficiency—since Z = 0, the subtraction operation in the dequantization formula can be skipped, leading to faster execution on some hardware.</p>
<p>However, if the underlying data distribution is skewed (like after a ReLU), symmetric quantization can be wasteful, as half of the quantized range will go unused, effectively losing one bit of precision.</p>
<h3 id="the-quantization-timeline-ptq-vs-qat">The Quantization Timeline: PTQ vs. QAT</h3>
<p>Quantization methods are categorized not just by their mathematical scheme but by when they&rsquo;re applied in the model&rsquo;s lifecycle.</p>
<p><strong>Post-Training Quantization (PTQ)</strong> applies quantization to a model that has already been fully trained in high precision. The process typically involves passing a small &ldquo;calibration dataset&rdquo; (a few hundred representative examples) through the model to observe the ranges of its weights and activations. These observed ranges are then used to calculate the optimal scale and zero-point parameters for each tensor.</p>
<p>The advantages are compelling: PTQ is fast, simple, and computationally inexpensive. It doesn&rsquo;t require access to the original training pipeline or large datasets, making it highly accessible. However, because the model&rsquo;s weights were optimized for a high-precision environment, abruptly forcing them into a low-precision format can introduce significant &ldquo;quantization noise,&rdquo; leading to noticeable accuracy drops, especially at very low bit-widths (e.g., 4-bit).</p>
<p><strong>Quantization-Aware Training (QAT)</strong> simulates the effects of quantization during the training or fine-tuning process. It works by inserting &ldquo;fake quantization&rdquo; operations into the model&rsquo;s computation graph. In the forward pass, weights and activations are quantized and then immediately dequantized back to a floating-point format. This simulates the error that will be introduced during low-precision inference. Crucially, the backward pass computes gradients with respect to the original full-precision weights, allowing the model to learn parameters that are inherently robust to quantization effects.</p>
<p>QAT almost always achieves higher accuracy than PTQ, often recovering nearly all of the original model&rsquo;s performance, even at aggressive quantization levels. However, it&rsquo;s a far more complex and computationally expensive process, requiring retraining or extensive fine-tuning with access to the training dataset and significant computational resources.</p>
<p>The choice between PTQ and QAT presents a fundamental dilemma for LLMs. The models that would benefit most from QAT&rsquo;s superior accuracy are the very ones for which the method is computationally and financially prohibitive. Fine-tuning a model with hundreds of billions of parameters can require hundreds of gigabytes of GPU memory, making QAT impractical for all but the largest institutions. This has led to PTQ becoming the dominant paradigm for LLM quantization, &ldquo;not for its superiority but feasibility.&rdquo;</p>
<p>This critical gap—the need for QAT-level accuracy with PTQ-level efficiency—has been the primary driver behind the intense research and development of advanced PTQ algorithms like GPTQ.</p>
<table>
  <thead>
      <tr>
          <th>Feature</th>
          <th>Post-Training Quantization (PTQ)</th>
          <th>Quantization-Aware Training (QAT)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Workflow</td>
          <td>Quantize a fully trained model</td>
          <td>Simulate quantization during training/fine-tuning</td>
      </tr>
      <tr>
          <td>Computational Cost</td>
          <td>Low (calibration pass only)</td>
          <td>High (requires retraining)</td>
      </tr>
      <tr>
          <td>Data Requirement</td>
          <td>Small calibration dataset</td>
          <td>Full training dataset</td>
      </tr>
      <tr>
          <td>Typical Accuracy</td>
          <td>Good at 8-bit, may degrade at lower bit-widths</td>
          <td>Excellent, often near full-precision performance</td>
      </tr>
      <tr>
          <td>Best For</td>
          <td>Scenarios with limited resources, no access to training data, or when speed of deployment is critical</td>
          <td>Applications where maximizing accuracy is paramount and computational resources are available</td>
      </tr>
  </tbody>
</table>
<h3 id="the-accuracy-efficiency-frontier">The Accuracy-Efficiency Frontier</h3>
<p>The choice of numeric format involves subtle tradeoffs that extend beyond simple memory calculations. Accuracy degradation patterns differ across precisions, model architectures, and quantization methods, with certain failure modes appearing only at extreme compression.</p>
<p>From <strong>FP16 to INT8</strong>, degradation is often very small when using proper techniques. Reports on BLOOM‑176B show differences within typical measurement noise on many tasks, demonstrating that INT8 can be virtually lossless for large models when outliers (e.g., &gt;6σ features) are handled separately in FP16.</p>
<p><strong>INT4 quantization</strong> shows noticeable but often acceptable degradation (commonly a few percentage points) depending on method and model. GPTQ frequently keeps 4‑bit perplexity deltas small on WikiText for large models, though exact results vary by setup. Modern techniques like AWQ can approach FP16 performance on certain tasks for specific models/configs. The difference between methods matters significantly—AWQ&rsquo;s activation‑aware approach outperforms naive rounding by protecting the most activation‑sensitive weights.</p>
<p>Dramatic failure often occurs at <strong>2 bits</strong> without specialized methods. Vanilla GPTQ typically fails at 2 bits, while methods like SpQR can make 2‑bit feasible by identifying and isolating a subset of weights as outliers (kept in higher precision) while quantizing the rest to 2 bits.</p>
<p>Model size affects quantization tolerance non-linearly. Larger models generally quantize better because individual weight errors average out across more parameters and layers. A 70B model tolerates 4-bit quantization with 96-99% accuracy recovery, while smaller 7B models show more variability. Counter-intuitively, models trained on more data become harder to quantize—LLaMA 3&rsquo;s 15 trillion training tokens create more complex weight distributions than earlier models, increasing quantization sensitivity.</p>
<p>Memory savings follow predictable patterns but include important overhead. The formula <code>Memory = Parameters × Bytes_per_parameter × 1.2</code> captures typical overhead from scale factors, activation tensors, and KV cache. For LLaMA‑70B: FP16 needs ~140–148GB, INT8 requires ~70–74GB (≈2× compression), while INT4 uses ~35–45GB (≈3.5× compression). The KV cache for attention adds substantial overhead at long context lengths: at 128K tokens an 8B model can consume on the order of tens of GB in FP16, sometimes exceeding the quantized model weights themselves.</p>
<p>[Placeholder for diagram: Accuracy-efficiency Pareto frontier showing perplexity degradation vs memory reduction for different quantization methods]</p>
<h2 id="part-3-gptq---when-second-order-thinking-meets-quantization">Part 3: GPTQ - When Second-Order Thinking Meets Quantization</h2>
<p>GPTQ (Generative Pre-trained Transformer Quantization) represents a breakthrough in post-training quantization. Published at ICLR 2023 by researchers from IST Austria and ETH Zurich, GPTQ enables 3-4 bit compression of 175B parameter models in approximately 4 GPU hours while maintaining negligible performance degradation.</p>
<h3 id="the-core-problem-and-gptqs-insight">The Core Problem and GPTQ&rsquo;s Insight</h3>
<p>The fundamental challenge with quantization is this: when you quantize a weight, you introduce error. Naively rounding weights to the nearest quantization level (Round-to-Nearest or RTN) performs acceptably at 8 bits but fails catastrophically at 3-4 bits, essentially destroying model capability.</p>
<p>GPTQ asks a different question: how can we quantize weights while compensating for the error by adjusting other weights to maintain the layer&rsquo;s output?</p>
<p>For each layer, GPTQ solves the optimization problem:</p>
<pre tabindex="0"><code>argmin_Ŵ ||WX - ŴX||²
</code></pre><p>where W is the original weight matrix, Ŵ is the quantized version, and X represents layer inputs from calibration data. This minimizes the squared difference between full-precision and quantized layer outputs rather than focusing on weight values themselves. The key realization: <strong>we care about preserving behavior (outputs) not parameter values</strong>.</p>
<p>This objective decomposes into independent row-wise problems since the squared Frobenius norm sums across rows. Processing each row separately reduces computational complexity dramatically while remaining theoretically sound.</p>
<h3 id="the-engine-optimal-brain-quantization-and-the-hessian">The Engine: Optimal Brain Quantization and the Hessian</h3>
<p>GPTQ&rsquo;s innovation is built upon a classic algorithm from the 1990s called Optimal Brain Quantization (OBQ). OBQ provides a principled way to quantize weights by using second-order information to guide the process.</p>
<p>This information is captured in the <strong>Hessian matrix</strong> H = XX^T (with damping λI), which contains the second derivatives of the model&rsquo;s loss function with respect to its weights. Intuitively, the Hessian describes the &ldquo;curvature&rdquo; of the loss landscape. A sharp curve in a particular direction (a large corresponding value in the Hessian) means the model&rsquo;s loss is highly sensitive to changes in that weight. Conversely, a flat curve (a small Hessian value) indicates that the weight can be changed with little impact on the loss.</p>
<div class="hessviz-container">
  <style>
    .hessviz-container { background:#fff; border-radius:16px; padding:20px; box-shadow:0 12px 30px rgba(0,0,0,0.06); margin:32px auto; max-width:1100px; font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,'Helvetica Neue',Arial,sans-serif; }
    .hessviz-title { font-size:24px; font-weight:700; color:#111827; margin:0 0 6px 0; text-align:center; }
    .hessviz-subtitle { font-size:14px; color:#6b7280; text-align:center; margin:0 0 16px 0; }

    .hessviz-grid { display:grid; grid-template-columns:1fr; gap:12px; }
    .hessviz-selector { display:grid; grid-template-columns:1fr; gap:8px; }
    @media (min-width: 768px) { .hessviz-selector { grid-template-columns:repeat(3,minmax(0,1fr)); } }
    .hessviz-btn { padding:12px; border-radius:10px; border:1px solid #e5e7eb; background:#f3f4f6; color:#374151; text-align:left; font-weight:600; font-size:13px; cursor:pointer; transition:.15s; }
    .hessviz-btn:hover { background:#e5e7eb; }
    .hessviz-btn.active { background:#2563eb; color:#fff; border-color:#2563eb; }

    .hessviz-row { background:#f9fafb; border:1px solid #e5e7eb; border-radius:12px; padding:12px; }
    .hessviz-flex { display:flex; justify-content:space-between; align-items:center; gap:10px; }
    .hessviz-check { display:flex; align-items:center; gap:8px; font-size:13px; color:#374151; }

    .hessviz-svg { width:100%; height:360px; background:#ffffff; border-radius:10px; }

    .hessviz-cards { display:grid; grid-template-columns:1fr 1fr; gap:10px; margin-top:12px; }
    .hessviz-card { background:#ffffff; border:1px solid #e5e7eb; border-radius:12px; padding:12px; font-size:13px; color:#374151; }

    .hessviz-lines { background:#f9fafb; border:1px solid #e5e7eb; border-radius:12px; padding:12px; margin-top:12px; }

    .hessviz-insights { background:#f3f4f6; border:1px solid #e5e7eb; border-radius:12px; padding:12px; margin-top:12px; }
    .hessviz-insights ul { margin:0; padding-left:18px; font-size:13px; color:#1f2937; }
  </style>

  <div class="hessviz-header">
    <h3 class="hessviz-title">GPTQ: Understanding the Hessian Through Loss Curvature</h3>
    <p class="hessviz-subtitle">The Hessian (H = 2XX<sup>T</sup>) captures second-order information about loss sensitivity</p>
  </div>

  <div class="hessviz-grid">
    <div>
      <div class="hessviz-selector" id="hessvizButtons-d2ebbad0e656c085c5ea2a3a6e1b91c9"></div>
    </div>

    <div class="hessviz-row">
      <div class="hessviz-flex" style="margin-bottom:8px;">
        <h4 style="margin:0; font-size:14px; color:#111827;">Loss Landscape Cross-Section</h4>
        <label class="hessviz-check">
          <input type="checkbox" id="hessvizToggle-d2ebbad0e656c085c5ea2a3a6e1b91c9" />
          Show quantization impact
        </label>
      </div>
      <svg id="hessvizSvg-d2ebbad0e656c085c5ea2a3a6e1b91c9" class="hessviz-svg" viewBox="0 0 920 360" preserveAspectRatio="xMidYMid meet"></svg>

      <div class="hessviz-cards">
        <div class="hessviz-card">
          <div style="font-weight:600; color:#111827; margin-bottom:6px;">Curvature</div>
          <div id="hessvizCurv-d2ebbad0e656c085c5ea2a3a6e1b91c9" style="font-size:22px; font-weight:800;">0×</div>
          <div style="font-size:12px; color:#6b7280;">Second derivative (Hessian diagonal)</div>
        </div>
        <div class="hessviz-card" id="hessvizErrCard-d2ebbad0e656c085c5ea2a3a6e1b91c9" style="display:none; background:#fef2f2; border-color:#fecaca;">
          <div style="font-weight:600; color:#111827; margin-bottom:6px;">Error Impact</div>
          <div id="hessvizErr-d2ebbad0e656c085c5ea2a3a6e1b91c9" style="font-size:22px; font-weight:800; color:#dc2626;">+0.000</div>
          <div style="font-size:12px; color:#6b7280;">Loss increase from quantization</div>
        </div>
      </div>
    </div>

    <div class="hessviz-lines">
      <h4 style="margin:0 0 8px 0; font-size:14px; color:#111827;">Comparing All Weight Sensitivities</h4>
      <svg id="hessvizMulti-d2ebbad0e656c085c5ea2a3a6e1b91c9" class="hessviz-svg" viewBox="0 0 920 300" preserveAspectRatio="xMidYMid meet"></svg>
      <div style="margin-top:8px; font-size:13px; color:#374151;">
        <strong>Key observation:</strong> The same weight perturbation causes much larger loss increases when curvature is high.
      </div>
    </div>

    <div class="hessviz-row">
      <div style="display:grid; grid-template-columns:1fr 1fr; gap:12px;">
        <div class="hessviz-card" style="background:#faf5ff; border-color:#ddd6fe;">
          <div style="font-weight:600; color:#111827; margin-bottom:6px;">Mathematical Interpretation</div>
          <div style="font-size:13px; color:#374151;">
            <div>Hessian diagonal value:</div>
            <div id="hessvizDiag-d2ebbad0e656c085c5ea2a3a6e1b91c9" style="font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, 'Liberation Mono', 'Courier New', monospace; font-size:16px; color:#7c3aed; margin-top:6px;">H<sub>ii</sub></div>
            <div style="margin-top:10px;">Loss approximation:</div>
            <div class="hessviz-mono" style="font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, 'Liberation Mono', 'Courier New', monospace; font-size:12px; background:#fff; border:1px solid #e5e7eb; border-radius:8px; padding:8px; margin-top:6px;">L(w + δ) ≈ L(w) + ½ H<sub>ii</sub> δ²</div>
            <div style="margin-top:8px; font-size:12px; color:#6b7280;">Higher H<sub>ii</sub> means higher curvature and more sensitivity.</div>
          </div>
        </div>
        <div class="hessviz-card" style="background:#eff6ff; border-color:#bfdbfe;">
          <div style="font-weight:700; color:#1d4ed8; margin-bottom:8px;">GPTQ Strategy</div>
          <div style="font-size:13px; color:#374151;">
            <div id="hessvizStrat-d2ebbad0e656c085c5ea2a3a6e1b91c9" style="padding:8px; border:2px solid #93c5fd; border-radius:8px; background:#fff;">Quantize FIRST</div>
            <div style="margin-top:8px; font-size:12px; color:#6b7280;">Use H<sup>-1</sup> to find low-sensitivity weights; compensate updates using H<sup>-1</sup>.</div>
          </div>
        </div>
      </div>
    </div>

    <div class="hessviz-insights">
      <h4 style="margin:0 0 8px 0; font-size:14px; color:#111827;">Key Insights</h4>
      <ul>
        <li><strong>First-order (gradient):</strong> direction of steepest descent</li>
        <li><strong>Second-order (Hessian):</strong> curvature (steepness)</li>
        <li><strong>High curvature:</strong> small changes → large loss increase</li>
        <li><strong>Low curvature:</strong> changes have smaller effect</li>
        <li><strong>GPTQ:</strong> leverage H<sup>-1</sup> to pick low-sensitivity weights first and compensate error</li>
      </ul>
    </div>
  </div>

  <script>
    (function(){
      const uid = 'd2ebbad0e656c085c5ea2a3a6e1b91c9';
      const scenarios = {
        high: { name:'High Curvature (High Hessian)', curvature:50, color:'#ef4444', strat:'Quantize LAST' },
        medium: { name:'Medium Curvature (Medium Hessian)', curvature:10, color:'#f97316', strat:'Quantize MIDDLE' },
        low: { name:'Low Curvature (Low Hessian)', curvature:1, color:'#22c55e', strat:'Quantize FIRST' }
      };
      let selected = 'high';

      const btnWrap = document.getElementById('hessvizButtons-' + uid);
      const svg = document.getElementById('hessvizSvg-' + uid);
      const multi = document.getElementById('hessvizMulti-' + uid);
      const toggle = document.getElementById('hessvizToggle-' + uid);
      const curvEl = document.getElementById('hessvizCurv-' + uid);
      const errCard = document.getElementById('hessvizErrCard-' + uid);
      const errEl = document.getElementById('hessvizErr-' + uid);
      const diagEl = document.getElementById('hessvizDiag-' + uid);
      const stratEl = document.getElementById('hessvizStrat-' + uid);

      function buildButtons(){
        btnWrap.innerHTML='';
        Object.entries(scenarios).forEach(([key, sc])=>{
          const b = document.createElement('button');
          b.className = 'hessviz-btn' + (selected===key ? ' active' : '');
          b.innerHTML = '<div style="font-weight:700">' + sc.name + '</div><div style="font-size:12px; opacity:.85; margin-top:4px;">' + (key==='high'?'Sharp V-shaped valley - very sensitive':'low'===key?'Flat bowl - insensitive':'Moderate curvature - some sensitivity') + '</div>';
          b.onclick = ()=>{ selected = key; render(); buildButtons(); };
          btnWrap.appendChild(b);
        });
      }

      function addEl(tag, attrs = {}, parent = svg) {
        const el = document.createElementNS('http://www.w3.org/2000/svg', tag);
        Object.entries(attrs).forEach(([k, v]) => el.setAttribute(k, v));
        parent.appendChild(el);
        return el;
      }
      function clearSvg(s){ while (s.firstChild) s.removeChild(s.firstChild); }

      function render(){
        const sc = scenarios[selected];
        const range = 0.5, steps = 120, center = 0;
        const S = 0.15; 

        
        curvEl.textContent = sc.curvature + '×';
        stratEl.textContent = sc.strat;
        diagEl.innerHTML = (selected==='high'?'Large ':'') + 'H<sub>ii</sub>';

        
        clearSvg(svg);
        const margin = { top: 10, right: 20, bottom: 40, left: 60 };
        const width = 920, height = 360;
        const cw = width - margin.left - margin.right;
        const ch = height - margin.top - margin.bottom;
        const ox = margin.left, oy = height - margin.bottom;
        const xToX = (x) => ox + (x - (-range)) / (2*range) * cw;
        const yToY = (y) => oy - (y - 0) / (sc.curvature * range * range) * ch; 

        
        addEl('line', { x1: ox, y1: oy, x2: ox + cw, y2: oy, stroke: '#9ca3af' });
        addEl('line', { x1: ox, y1: oy, x2: ox, y2: oy - ch, stroke: '#9ca3af' });
        
        for (let i=0;i<=8;i++){ const w=-range + 2*range*i/8; const x=xToX(w); addEl('line',{x1:x,y1:oy,x2:x,y2:oy+6,stroke:'#9ca3af'}); const t=addEl('text',{x,y:oy+20,'text-anchor':'middle','font-size':'12px',fill:'#374151'}); t.textContent=w.toFixed(2);}        
        for (let i=0;i<=6;i++){ const y=sc.curvature*(range*i/6)*(range*i/6); const yy=yToY(y); addEl('line',{x1:ox-6,y1:yy,x2:ox,y2:yy,stroke:'#9ca3af'}); addEl('line',{x1:ox,y1:yy,x2:ox+cw,y2:yy,stroke:'#e5e7eb'}); const t=addEl('text',{x:ox-10,y:yy+4,'text-anchor':'end','font-size':'12px',fill:'#374151'}); t.textContent=y.toFixed(2);}        
        const xLbl = addEl('text', { x: ox + cw/2, y: oy + 35, 'text-anchor': 'middle', 'font-size':'12px', fill:'#6b7280' });
        xLbl.textContent = 'Weight Value (w)';
        const yLbl = addEl('text', { x: ox - 50, y: oy - ch/2, transform: `rotate(-90 ${ox - 50} ${oy - ch/2})`, 'text-anchor': 'middle', 'font-size':'12px', fill:'#6b7280' });
        yLbl.textContent = 'Loss (L)';

        
        const path = [];
        for (let i=0;i<=steps;i++){ const w = center - range + (2*range*i)/steps; const L = sc.curvature*(w-center)*(w-center); path.push(`${i? 'L':'M'} ${xToX(w)} ${yToY(L)}`); }
        addEl('path', { d: path.join(' '), fill:'none', stroke: sc.color, 'stroke-width': 3 });

        
        addEl('line', { x1: xToX(0), y1: oy, x2: xToX(0), y2: oy - ch, stroke: '#22c55e', 'stroke-width':2, 'stroke-dasharray':'6 6' });
        const optTxt = addEl('text', { x: xToX(0), y: oy - ch - 6, 'text-anchor':'middle', 'font-size':'12px', fill:'#22c55e' });
        optTxt.textContent = 'Optimal';

        
        if (toggle && toggle.checked){
          const wq = S; 
          const Lq = sc.curvature * (wq-center) * (wq-center);
          errCard.style.display = '';
          errEl.textContent = '+' + Lq.toFixed(3);
          addEl('line', { x1: xToX(wq), y1: oy, x2: xToX(wq), y2: oy - ch, stroke: '#dc2626', 'stroke-width':2, 'stroke-dasharray':'3 3' });
          const qTxt = addEl('text', { x: xToX(wq), y: oy - ch - 6, 'text-anchor':'middle', 'font-size':'11px', fill:'#dc2626' });
          qTxt.textContent = 'After Quantization';
        } else {
          errCard.style.display = 'none';
        }

        
        clearSvg(multi);
        const margin2 = { top: 10, right: 20, bottom: 40, left: 60 };
        const width2 = 920, height2 = 300;
        const cw2 = width2 - margin2.left - margin2.right;
        const ch2 = height2 - margin2.top - margin2.bottom;
        const ox2 = margin2.left, oy2 = height2 - margin2.bottom;
        const range2 = 0.3; const steps2 = 120;
        const x2 = (w)=> ox2 + (w-(-range2))/(2*range2)*cw2;
        const y2 = (L, curvMax)=> oy2 - (L)/(curvMax*range2*range2)*ch2;

        
        const add2 = (tag,attrs)=>{ return addEl.call(null, tag, attrs, multi); };
        add2('line', { x1: ox2, y1: oy2, x2: ox2 + cw2, y2: oy2, stroke: '#9ca3af' });
        add2('line', { x1: ox2, y1: oy2, x2: ox2, y2: oy2 - ch2, stroke: '#9ca3af' });
        for (let i=0;i<=8;i++){ const w=-range2 + 2*range2*i/8; const x=x2(w); add2('line',{x1:x,y1:oy2,x2:x,y2:oy2+6,stroke:'#9ca3af'}); const t=add2('text',{x,y:oy2+20,'text-anchor':'middle','font-size':'12px',fill:'#374151'}); t.textContent=w.toFixed(2);}        
        for (let i=0;i<=6;i++){ const y=50*(range2*i/6)*(range2*i/6); const yy=y2(y,50); add2('line',{x1:ox2-6,y1:yy,x2:ox2,y2:yy,stroke:'#9ca3af'}); add2('line',{x1:ox2,y1:yy,x2:ox2+cw2,y2:yy,stroke:'#e5e7eb'}); const t=add2('text',{x:ox2-10,y:yy+4,'text-anchor':'end','font-size':'12px',fill:'#374151'}); t.textContent=y.toFixed(2);}        
        const xLbl2 = add2('text', { x: ox2 + cw2/2, y: oy2 + 35, 'text-anchor': 'middle', 'font-size':'12px', fill:'#6b7280' });
        xLbl2.textContent = 'Weight Deviation from Optimal';

        
        function drawCurve(curv, color){
          const p = []; for (let i=0;i<=steps2;i++){ const w=-range2 + (2*range2*i)/steps2; const L=curv*w*w; p.push(`${i?'L':'M'} ${x2(w)} ${y2(L,50)}`); }
          add2('path',{ d:p.join(' '), fill:'none', stroke:color, 'stroke-width':3 });
        }
        drawCurve(50, '#ef4444'); 
        drawCurve(10, '#f97316'); 
        drawCurve(1, '#22c55e'); 
        add2('line', { x1: x2(0), y1: oy2, x2: x2(0), y2: oy2 - ch2, stroke: '#22c55e', 'stroke-width':2, 'stroke-dasharray':'6 6' });
      }

      buildButtons();
      render();
      if (toggle) toggle.addEventListener('change', render);
    })();
  </script>
</div>

<p>GPTQ uses the <strong>inverse</strong> of the Hessian matrix H⁻¹ = (XX^T + λI)⁻¹, where the damping term λ (typically a small fraction of the average diagonal) prevents numerical instability. The OBQ algorithm quantizes weights one by one. At each step, it must decide which weight to quantize next. The optimal choice is the one that will cause the smallest increase in the layer&rsquo;s output error, guided by the diagonal entries of the inverse Hessian matrix.</p>
<h3 id="the-algorithm-error-compensation-in-action">The Algorithm: Error Compensation in Action</h3>
<p>The core of the OBQ method, and by extension GPTQ, is an iterative process of error compensation within each layer:</p>
<ol>
<li><strong>Select &amp; Quantize</strong>: A single weight is chosen and quantized (e.g., rounded to the nearest 4-bit representable value)</li>
<li><strong>Measure Error</strong>: The algorithm calculates the error introduced by this rounding step</li>
<li><strong>Compensate</strong>: This is the crucial step. The algorithm updates all the other, not-yet-quantized full-precision weights in the layer to compensate for the error just introduced. This update is not uniform; it is scaled by the inverse Hessian, which directs the correction towards related but less sensitive weights that can absorb the error with minimal impact on the layer&rsquo;s output</li>
</ol>
<p>After quantizing weight w_q at position q to its nearest grid point, the quantization error must be compensated. The update formula:</p>
<pre tabindex="0"><code>δ = -[(w_q - quant(w_q)) / [H⁻¹]_qq] · (H⁻¹)_:,q
</code></pre><p>redistributes this error across remaining unquantized weights, minimizing impact on layer output.</p>
<p>This iterative compensation is what makes GPTQ so accurate. It doesn&rsquo;t just round weights independently; it actively and intelligently corrects for the rounding error at every single step, ensuring the final quantized layer behaves as closely as possible to the original.</p>
<p>After each quantization, the Hessian inverse must be updated by removing the quantized weight&rsquo;s row and column. Gaussian elimination provides the update, but this accumulates numerical error. GPTQ&rsquo;s solution uses <strong>Cholesky decomposition</strong> to precompute all required Hessian information in a numerically stable manner, preventing error accumulation that would otherwise corrupt billion-parameter models.</p>
<div class="gptq-error-viz">
  <style>
    .gptq-error-viz {
      background: white;
      border-radius: 12px;
      padding: 24px;
      box-shadow: 0 4px 12px rgba(0,0,0,0.08);
      margin: 24px auto;
      max-width: 800px;
      font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
    }
    .gptq-canvas {
      display: block;
      margin: 0 auto;
      border: 1px solid #e0e0e0;
      border-radius: 8px;
      cursor: pointer;
      max-width: 100%;
      height: auto;
    }
    .gptq-controls {
      margin-top: 20px;
      display: flex;
      justify-content: center;
      gap: 12px;
      flex-wrap: wrap;
    }
    .gptq-btn {
      padding: 10px 20px;
      font-size: 14px;
      border: none;
      border-radius: 6px;
      cursor: pointer;
      transition: all 0.2s;
      font-weight: 500;
    }
    .gptq-btn.primary {
      background: #2563eb;
      color: white;
    }
    .gptq-btn.primary:hover {
      background: #1d4ed8;
    }
    .gptq-btn.secondary {
      background: #e5e7eb;
      color: #374151;
    }
    .gptq-btn.secondary:hover {
      background: #d1d5db;
    }
    .gptq-legend {
      margin-top: 20px;
      display: grid;
      grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
      gap: 12px;
    }
    .gptq-legend-item {
      display: flex;
      align-items: center;
      gap: 8px;
      font-size: 13px;
      color: #4b5563;
    }
    .gptq-legend-color {
      width: 16px;
      height: 16px;
      border-radius: 3px;
      border: 1px solid #d1d5db;
    }
    .gptq-info {
      margin-top: 16px;
      padding: 12px;
      background: #f0f9ff;
      border-left: 3px solid #2563eb;
      border-radius: 4px;
      font-size: 14px;
      color: #1e40af;
    }
  </style>

  <canvas id="gptqCanvas-d2ebbad0e656c085c5ea2a3a6e1b91c9" class="gptq-canvas" width="700" height="500"></canvas>
  
  <div class="gptq-controls">
    <button class="gptq-btn primary" id="gptqAnimateBtn-d2ebbad0e656c085c5ea2a3a6e1b91c9">Animate Quantization</button>
    <button class="gptq-btn secondary" id="gptqResetBtn-d2ebbad0e656c085c5ea2a3a6e1b91c9">Reset</button>
  </div>
  
  <div class="gptq-legend">
    <div class="gptq-legend-item">
      <div class="gptq-legend-color" style="background: #ef4444;"></div>
      <span>Original weight position</span>
    </div>
    <div class="gptq-legend-item">
      <div class="gptq-legend-color" style="background: #8b5cf6;"></div>
      <span>Quantized weight (error introduced)</span>
    </div>
    <div class="gptq-legend-item">
      <div class="gptq-legend-color" style="background: #10b981;"></div>
      <span>Compensated position</span>
    </div>
    <div class="gptq-legend-item">
      <div class="gptq-legend-color" style="background: #f59e0b; opacity: 0.3;"></div>
      <span>Loss landscape contours</span>
    </div>
  </div>
  
  <div class="gptq-info">
    <strong>Key Insight:</strong> The Hessian's inverse reveals which directions in weight space have flat loss (can absorb error) vs. steep loss (sensitive). GPTQ compensates quantization error by updating other weights along the flattest directions, minimizing the impact on model output.
  </div>

  <script>
    (function() {
      const uid = 'd2ebbad0e656c085c5ea2a3a6e1b91c9';
      const canvas = document.getElementById('gptqCanvas-' + uid);
      const ctx = canvas.getContext('2d');
      const animateBtn = document.getElementById('gptqAnimateBtn-' + uid);
      const resetBtn = document.getElementById('gptqResetBtn-' + uid);
      
      let animating = false;
      let animationFrame = 0;
      const maxFrames = 120;
      
      
      const centerX = canvas.width / 2;
      const centerY = canvas.height / 2;
      
      
      const origX = centerX - 80;
      const origY = centerY + 60;
      
      
      const quantX = origX + 45;
      const quantY = origY - 35;
      
      
      const compX = origX + 15;
      const compY = origY - 8;
      
      function drawEllipse(cx, cy, rx, ry, rotation, color, lineWidth = 2) {
        ctx.save();
        ctx.translate(cx, cy);
        ctx.rotate(rotation);
        ctx.beginPath();
        ctx.ellipse(0, 0, rx, ry, 0, 0, 2 * Math.PI);
        ctx.strokeStyle = color;
        ctx.lineWidth = lineWidth;
        ctx.stroke();
        ctx.restore();
      }
      
      function drawLossLandscape() {
        
        const rotation = -Math.PI / 6;
        
        ctx.globalAlpha = 0.15;
        for (let i = 5; i >= 1; i--) {
          const scale = i * 1.3;
          drawEllipse(centerX, centerY, 40 * scale, 80 * scale, rotation, '#f59e0b', 2);
        }
        ctx.globalAlpha = 1.0;
        
        
        ctx.save();
        ctx.translate(centerX, centerY);
        ctx.rotate(rotation);
        
        
        ctx.strokeStyle = '#dc2626';
        ctx.lineWidth = 2;
        ctx.setLineDash([5, 5]);
        ctx.beginPath();
        ctx.moveTo(-120, 0);
        ctx.lineTo(120, 0);
        ctx.stroke();
        
        
        ctx.strokeStyle = '#059669';
        ctx.lineWidth = 2;
        ctx.beginPath();
        ctx.moveTo(0, -140);
        ctx.lineTo(0, 140);
        ctx.stroke();
        
        ctx.setLineDash([]);
        ctx.restore();
        
        
        ctx.save();
        ctx.font = '12px sans-serif';
        ctx.fillStyle = '#dc2626';
        ctx.fillText('Steep (high H⁻¹ᵢᵢ)', centerX + 125, centerY - 75);
        ctx.fillStyle = '#059669';
        ctx.fillText('Flat (low H⁻¹ᵢᵢ)', centerX - 80, centerY - 145);
        ctx.restore();
      }
      
      function drawPoint(x, y, color, label, size = 6) {
        ctx.beginPath();
        ctx.arc(x, y, size, 0, 2 * Math.PI);
        ctx.fillStyle = color;
        ctx.fill();
        ctx.strokeStyle = 'white';
        ctx.lineWidth = 2;
        ctx.stroke();
        
        if (label) {
          ctx.font = 'bold 13px sans-serif';
          ctx.fillStyle = color;
          ctx.fillText(label, x + 12, y + 5);
        }
      }
      
      function drawArrow(fromX, fromY, toX, toY, color, label) {
        const headlen = 10;
        const angle = Math.atan2(toY - fromY, toX - fromX);
        
        ctx.strokeStyle = color;
        ctx.fillStyle = color;
        ctx.lineWidth = 2;
        
        
        ctx.beginPath();
        ctx.moveTo(fromX, fromY);
        ctx.lineTo(toX, toY);
        ctx.stroke();
        
        
        ctx.beginPath();
        ctx.moveTo(toX, toY);
        ctx.lineTo(toX - headlen * Math.cos(angle - Math.PI / 6),
                  toY - headlen * Math.sin(angle - Math.PI / 6));
        ctx.lineTo(toX - headlen * Math.cos(angle + Math.PI / 6),
                  toY - headlen * Math.sin(angle + Math.PI / 6));
        ctx.closePath();
        ctx.fill();
        
        
        if (label) {
          const midX = (fromX + toX) / 2;
          const midY = (fromY + toY) / 2;
          ctx.font = '11px sans-serif';
          ctx.fillStyle = color;
          ctx.fillText(label, midX + 5, midY - 5);
        }
      }
      
      function draw(frame = 0) {
        ctx.clearRect(0, 0, canvas.width, canvas.height);
        
        
        ctx.font = 'bold 16px sans-serif';
        ctx.fillStyle = '#1f2937';
        ctx.fillText('GPTQ: Hessian-Guided Error Compensation', 20, 30);
        
        ctx.font = '13px sans-serif';
        ctx.fillStyle = '#6b7280';
        ctx.fillText('Loss landscape curvature determines weight sensitivity', 20, 50);
        
        drawLossLandscape();
        
        if (frame === 0) {
          
          drawPoint(origX, origY, '#ef4444', 'wᵢ (original)');
          
        } else if (frame < 30) {
          
          const t = frame / 30;
          const x = origX + (quantX - origX) * t;
          const y = origY + (quantY - origY) * t;
          
          drawPoint(origX, origY, '#ef4444', 'wᵢ (original)', 4);
          drawPoint(x, y, '#8b5cf6', null);
          
          if (t === 1) {
            drawPoint(quantX, quantY, '#8b5cf6', 'quant(wᵢ)');
          }
          
        } else if (frame < 60) {
          
          drawPoint(origX, origY, '#ef4444', 'wᵢ', 4);
          drawPoint(quantX, quantY, '#8b5cf6', 'quant(wᵢ)');
          drawArrow(origX, origY, quantX, quantY, '#8b5cf6', 'error');
          
        } else if (frame < 90) {
          
          const t = (frame - 60) / 30;
          const x = quantX + (compX - quantX) * t;
          const y = quantY + (compY - quantY) * t;
          
          drawPoint(origX, origY, '#ef4444', 'wᵢ', 4);
          drawPoint(quantX, quantY, '#8b5cf6', null, 4);
          drawPoint(x, y, '#10b981', null);
          
          ctx.globalAlpha = 0.3;
          drawArrow(origX, origY, quantX, quantY, '#8b5cf6', null);
          ctx.globalAlpha = 1.0;
          
        } else {
          
          drawPoint(origX, origY, '#ef4444', 'wᵢ', 4);
          drawPoint(quantX, quantY, '#8b5cf6', null, 4);
          drawPoint(compX, compY, '#10b981', 'wᵢ + δ (compensated)');
          
          ctx.globalAlpha = 0.3;
          drawArrow(origX, origY, quantX, quantY, '#8b5cf6', null);
          ctx.globalAlpha = 1.0;
          
          drawArrow(quantX, quantY, compX, compY, '#10b981', 'H⁻¹-guided update');
          
          
          ctx.strokeStyle = '#10b981';
          ctx.lineWidth = 3;
          ctx.setLineDash([3, 3]);
          ctx.beginPath();
          ctx.arc(centerX, centerY, Math.hypot(compX - centerX, compY - centerY), 0, 2 * Math.PI);
          ctx.stroke();
          ctx.setLineDash([]);
        }
      }
      
      function animate() {
        if (!animating) return;
        
        draw(animationFrame);
        animationFrame++;
        
        if (animationFrame >= maxFrames) {
          animating = false;
          animationFrame = maxFrames - 1;
          animateBtn.textContent = 'Replay Animation';
        } else {
          requestAnimationFrame(animate);
        }
      }
      
      animateBtn.addEventListener('click', () => {
        if (animating) return;
        animating = true;
        animationFrame = 0;
        animateBtn.textContent = 'Animating...';
        animate();
      });
      
      resetBtn.addEventListener('click', () => {
        animating = false;
        animationFrame = 0;
        animateBtn.textContent = 'Animate Quantization';
        draw(0);
      });
      
      canvas.addEventListener('click', () => {
        if (!animating) {
          animating = true;
          animationFrame = 0;
          animate();
        }
      });
      
      
      draw(0);
    })();
  </script>
</div>

<h3 id="making-it-practical-gptqs-efficiency-optimizations">Making It Practical: GPTQ&rsquo;s Efficiency Optimizations</h3>
<p>While powerful, the original OBQ algorithm is far too slow for modern LLMs. Its greedy search for the next-best weight to quantize and the need to update the inverse Hessian after every single weight result in cubic runtime complexity that is prohibitive. The genius of GPTQ lies in three clever optimizations:</p>
<p><strong>Arbitrary Order Quantization</strong>: The authors made a critical empirical discovery—for very large, overparameterized models, the specific order in which weights are quantized has minimal impact on final accuracy. GPTQ thus abandons the expensive greedy search of OBQ and instead quantizes weights in a simple, fixed order (e.g., column by column). This not only eliminates the search but also means the Hessian information can be shared across all rows of a weight matrix, dramatically reducing redundant computations.</p>
<p><strong>Lazy Batch Updates</strong>: To make the algorithm friendly to modern GPUs, which thrive on parallel computation, updates to the Hessian are batched. Instead of performing a small update after every single weight, GPTQ processes a block of columns (a group_size of 128 is common) at a time. This significantly improves the compute-to-memory-access ratio, leading to massive speedups.</p>
<p><strong><a href="https://en.wikipedia.org/wiki/Cholesky_decomposition">Cholesky Decomposition</a></strong>: To ensure the complex matrix inverse operations remain numerically stable and efficient throughout the process, GPTQ employs this standard numerical linear algebra technique.</p>
<p>GPTQ&rsquo;s success is a testament to brilliant research engineering. It bridges the gap between purely heuristic methods (like simple rounding) and computationally prohibitive, fully principled methods (like QAT). Its true innovation was not inventing new mathematical theory from scratch, but rather identifying the key computational bottlenecks in a powerful existing algorithm (OBQ) and devising pragmatic approximations (fixed order, lazy updates) that were shown to work exceptionally well at the massive scale of modern LLMs.</p>
<div class="gptq-algo-viz">
  <style>
    .gptq-algo-viz { background: white; border-radius: 12px; padding: 28px; box-shadow: 0 4px 6px -1px rgba(0,0,0,0.1); margin: 24px auto; max-width: 1200px; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; }
    .gptq-header { margin-bottom: 24px; }
    .gptq-header h2 { margin: 0 0 8px 0; font-size: 22px; color: #0f172a; }
    .gptq-subtitle { font-size: 15px; color: #64748b; margin: 0; }
    .gptq-main { display: grid; grid-template-columns: 2fr 1fr; gap: 24px; margin-bottom: 24px; }
    .gptq-canvas { display: block; border: 2px solid #e2e8f0; border-radius: 8px; background: #fefefe; max-width: 100%; height: auto; }
    .gptq-side { display: flex; flex-direction: column; gap: 16px; min-height: 400px; }
    .gptq-info-card { background: #f8fafc; border: 2px solid #e2e8f0; border-radius: 8px; padding: 16px; }
    .gptq-info-card h3 { margin: 0 0 12px 0; font-size: 14px; font-weight: 600; color: #0f172a; text-transform: uppercase; letter-spacing: 0.5px; }
    .gptq-formula { background: #fffbeb; border: 2px solid #fbbf24; border-radius: 6px; padding: 14px; font-family: 'Monaco', 'Menlo', monospace; font-size: 13px; color: #92400e; line-height: 1.6; height: 150px; display: flex; flex-direction: column; justify-content: flex-start; }
    .gptq-step-label { font-weight: 700; color: #78350f; display: block; margin-bottom: 6px; }
    .gptq-phase-indicator { min-height: 36px; display: flex; align-items: center; }
     
    [id^="gptqFormulaText-"] { flex: 1; overflow: auto; }
    .gptq-hessian { background: white; border: 2px solid #c7d2fe; border-radius: 6px; padding: 12px; }
    .gptq-hessian-title { font-size: 12px; font-weight: 600; color: #4338ca; margin-bottom: 8px; }
    .gptq-controls { display: grid; grid-template-columns: repeat(auto-fit, minmax(180px, 1fr)); gap: 12px; margin-bottom: 20px; }
    .gptq-control-group { background: #f8fafc; padding: 12px; border-radius: 6px; border: 1px solid #e2e8f0; }
    .gptq-control-group label { display: block; font-size: 12px; font-weight: 600; color: #475569; margin-bottom: 8px; }
    .gptq-btn { width: 100%; padding: 10px; font-size: 14px; border: none; border-radius: 6px; cursor: pointer; transition: all 0.2s; font-weight: 600; }
    .gptq-btn.primary { background: #3b82f6; color: white; }
    .gptq-btn.primary:hover { background: #2563eb; }
    .gptq-btn.secondary { background: #e2e8f0; color: #475569; }
    .gptq-btn.secondary:hover { background: #cbd5e1; }
    .gptq-select, .gptq-range { width: 100%; padding: 8px; border: 1px solid #cbd5e1; border-radius: 4px; font-size: 13px; }
    .gptq-metrics { display: grid; grid-template-columns: repeat(auto-fit, minmax(160px, 1fr)); gap: 12px; }
    .gptq-metric { background: linear-gradient(135deg, #f0f9ff 0%, #e0f2fe 100%); padding: 14px; border-radius: 8px; border: 2px solid #bae6fd; }
    .gptq-metric-label { font-size: 11px; color: #0c4a6e; text-transform: uppercase; font-weight: 700; margin-bottom: 4px; letter-spacing: 0.5px; }
    .gptq-metric-value { font-size: 20px; font-weight: 700; color: #075985; }
    .gptq-gpu { margin-top: 16px; }
    .gptq-gpu-bar { height: 24px; background: #e2e8f0; border-radius: 6px; overflow: hidden; position: relative; border: 2px solid #cbd5e1; }
    .gptq-gpu-fill { height: 100%; background: linear-gradient(90deg, #10b981, #059669); transition: width 0.3s ease; }
    .gptq-gpu-label { position: absolute; top: 50%; left: 50%; transform: translate(-50%, -50%); font-size: 12px; font-weight: 700; color: white; text-shadow: 0 1px 3px rgba(0,0,0,0.4); }
    .gptq-legend { display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 10px; margin-top: 20px; padding: 16px; background: #f8fafc; border-radius: 8px; border: 1px solid #e2e8f0; }
    .gptq-legend-item { display: flex; align-items: center; gap: 8px; font-size: 13px; color: #475569; }
    .gptq-legend-box { width: 24px; height: 18px; border-radius: 3px; border: 1px solid #94a3b8; }
    .gptq-phase { padding: 6px 12px; border-radius: 6px; font-size: 12px; font-weight: 600; display: inline-block; }
    .gptq-phase.quantize { background: #fed7aa; color: #9a3412; }
    .gptq-phase.local { background: #bfdbfe; color: #1e40af; }
    .gptq-phase.global { background: #d9f99d; color: #365314; }
  </style>

  <div class="gptq-header">
    <h2>GPTQ: Block-wise Quantization with Hessian-Guided Error Compensation</h2>
    <p class="gptq-subtitle">Watch the three nested loops process weights column-by-column with intelligent error redistribution</p>
  </div>
  
  <div class="gptq-main">
    <div>
      <canvas id="gptqAlgoCanvas-d2ebbad0e656c085c5ea2a3a6e1b91c9" class="gptq-canvas" width="700" height="400"></canvas>
    </div>
    
    <div class="gptq-side">
      <div class="gptq-info-card">
        <h3>Current Operation</h3>
        <div id="gptqPhaseIndicator-d2ebbad0e656c085c5ea2a3a6e1b91c9" class="gptq-phase-indicator"></div>
        <div class="gptq-formula">
          <span class="gptq-step-label" id="gptqStepLabel-d2ebbad0e656c085c5ea2a3a6e1b91c9">Ready</span>
          <div id="gptqFormulaText-d2ebbad0e656c085c5ea2a3a6e1b91c9">Click Play to begin GPTQ quantization...</div>
        </div>
      </div>
      
      <div class="gptq-info-card">
        <h3>Hessian Inverse</h3>
        <canvas id="gptqHessianCanvas-d2ebbad0e656c085c5ea2a3a6e1b91c9" width="240" height="240"></canvas>
        <div style="margin-top: 8px; font-size: 11px; color: #64748b;">
          Guides error compensation direction and magnitude
        </div>
      </div>
    </div>
  </div>
  
  <div class="gptq-controls">
    <div class="gptq-control-group">
      <label>▶ Play / Pause</label>
      <button class="gptq-btn primary" id="gptqPlayBtn-d2ebbad0e656c085c5ea2a3a6e1b91c9">Play</button>
    </div>
    <div class="gptq-control-group">
      <label>→ Step Through</label>
      <button class="gptq-btn secondary" id="gptqStepBtn-d2ebbad0e656c085c5ea2a3a6e1b91c9">Next Step</button>
    </div>
    <div class="gptq-control-group">
      <label>↺ Reset</label>
      <button class="gptq-btn secondary" id="gptqResetBtn-d2ebbad0e656c085c5ea2a3a6e1b91c9">Reset</button>
    </div>
    <div class="gptq-control-group">
      <label>Block Size (B)</label>
      <select class="gptq-select" id="gptqBlockSize-d2ebbad0e656c085c5ea2a3a6e1b91c9">
        <option value="2">2</option>
        <option value="4" selected>4</option>
        <option value="8">8</option>
      </select>
    </div>
    <div class="gptq-control-group">
      <label>Speed: <span id="gptqSpeedValue-d2ebbad0e656c085c5ea2a3a6e1b91c9">5</span>x</label>
      <input type="range" class="gptq-range" id="gptqSpeedSlider-d2ebbad0e656c085c5ea2a3a6e1b91c9" min="1" max="10" value="5">
    </div>
  </div>
  
  <div class="gptq-metrics">
    <div class="gptq-metric">
      <div class="gptq-metric-label">Block Progress</div>
      <div class="gptq-metric-value" id="gptqBlockProgress-d2ebbad0e656c085c5ea2a3a6e1b91c9">0/4</div>
    </div>
    <div class="gptq-metric">
      <div class="gptq-metric-label">Column Progress</div>
      <div class="gptq-metric-value" id="gptqColProgress-d2ebbad0e656c085c5ea2a3a6e1b91c9">0/16</div>
    </div>
    <div class="gptq-metric">
      <div class="gptq-metric-label">Weights Quantized</div>
      <div class="gptq-metric-value" id="gptqWeightsQuantized-d2ebbad0e656c085c5ea2a3a6e1b91c9">0/96</div>
    </div>
    <div class="gptq-metric">
      <div class="gptq-metric-label">Compensations</div>
      <div class="gptq-metric-value" id="gptqCompensations-d2ebbad0e656c085c5ea2a3a6e1b91c9">0</div>
    </div>
  </div>
  
  <div class="gptq-gpu">
    <div class="gptq-metric-label">GPU Compute Utilization (Block Operations)</div>
    <div class="gptq-gpu-bar">
      <div class="gptq-gpu-fill" id="gptqGpuFill-d2ebbad0e656c085c5ea2a3a6e1b91c9" style="width: 0%"></div>
      <div class="gptq-gpu-label" id="gptqGpuLabel-d2ebbad0e656c085c5ea2a3a6e1b91c9">Idle</div>
    </div>
  </div>
  
  <div class="gptq-legend">
    <div class="gptq-legend-item">
      <div class="gptq-legend-box" style="background: #e2e8f0;"></div>
      <span>Unquantized weights</span>
    </div>
    <div class="gptq-legend-item">
      <div class="gptq-legend-box" style="background: #fef3c7;"></div>
      <span>Current block</span>
    </div>
    <div class="gptq-legend-item">
      <div class="gptq-legend-box" style="background: #fed7aa;"></div>
      <span>Quantizing now</span>
    </div>
    <div class="gptq-legend-item">
      <div class="gptq-legend-box" style="background: #bfdbfe;"></div>
      <span>Being compensated</span>
    </div>
    <div class="gptq-legend-item">
      <div class="gptq-legend-box" style="background: #c7d2fe;"></div>
      <span>Quantized + compensated</span>
    </div>
  </div>

  <script>
    (function() {
      const uid = 'd2ebbad0e656c085c5ea2a3a6e1b91c9';
      const canvas = document.getElementById('gptqAlgoCanvas-' + uid);
      const ctx = canvas.getContext('2d');
      const hessianCanvas = document.getElementById('gptqHessianCanvas-' + uid);
      const hessianCtx = hessianCanvas.getContext('2d');
      
      const rows = 6, cols = 16;
      let blockSize = 4, speed = 5;
      let playing = false, currentBlock = 0, currentCol = 0;
      let phase = 'quantize', frameCounter = 0, compensationCount = 0;
      let weightsOriginal = [], weights = [], state = [];
      
      function initWeights() {
        weightsOriginal = []; weights = []; state = [];
        for (let r = 0; r < rows; r++) {
          weightsOriginal[r] = []; weights[r] = []; state[r] = [];
          for (let c = 0; c < cols; c++) {
            weightsOriginal[r][c] = (Math.random() - 0.5) * 2;
            weights[r][c] = weightsOriginal[r][c];
            state[r][c] = 'unquantized';
          }
        }
      }
      
      function quantize(value) {
        const levels = 16, step = 2.0 / (levels - 1);
        return Math.max(-1, Math.min(1, Math.round((value + 1) / step) * step - 1));
      }
      
      function drawMatrix() {
        const cellW = 40, cellH = 60, startX = 40, startY = 60;
        ctx.clearRect(0, 0, canvas.width, canvas.height);
        
        ctx.font = 'bold 15px sans-serif'; ctx.fillStyle = '#0f172a';
        ctx.fillText('Weight Matrix W', startX, 30);
        ctx.font = '12px sans-serif'; ctx.fillStyle = '#64748b';
        ctx.fillText(`${rows} rows × ${cols} columns`, startX, 45);
        
        for (let r = 0; r < rows; r++) {
          for (let c = 0; c < cols; c++) {
            const x = startX + c * cellW, y = startY + r * cellH;
            const blockStart = currentBlock * blockSize;
            const blockEnd = Math.min((currentBlock + 1) * blockSize, cols);
            
            let bgColor = '#e2e8f0';
            if (state[r][c] === 'quantized') bgColor = '#c7d2fe';
            else if (c === currentCol && phase === 'quantize') bgColor = '#fed7aa';
            else if (c > currentCol && c < blockEnd && phase === 'local_update') bgColor = '#bfdbfe';
            else if (c >= blockStart && c < blockEnd) bgColor = '#fef3c7';
            
            ctx.fillStyle = bgColor;
            ctx.fillRect(x, y, cellW - 2, cellH - 2);
            ctx.strokeStyle = '#94a3b8'; ctx.lineWidth = 1;
            ctx.strokeRect(x, y, cellW - 2, cellH - 2);
            
            ctx.font = '10px monospace'; ctx.fillStyle = '#1e293b';
            ctx.fillText(weights[r][c].toFixed(2), x + 3, y + 14);
            
            if (state[r][c] === 'quantized') {
              const qError = weights[r][c] - weightsOriginal[r][c];
              ctx.font = 'bold 9px monospace'; ctx.fillStyle = '#6366f1';
              ctx.fillText(`Q${qError >= 0 ? '+' : ''}${qError.toFixed(2)}`, x + 3, y + 26);
            }
            
            if (c > currentCol && c < blockEnd && phase === 'local_update') {
              ctx.font = 'bold 16px sans-serif'; ctx.fillStyle = '#3b82f6';
              ctx.fillText('⟳', x + 12, y + 45);
            }
          }
        }
        
        ctx.strokeStyle = '#3b82f6'; ctx.lineWidth = 3;
        for (let i = 0; i <= cols; i += blockSize) {
          const x = startX + i * cellW;
          ctx.beginPath(); ctx.moveTo(x, startY); ctx.lineTo(x, startY + rows * cellH); ctx.stroke();
        }
        
        ctx.strokeStyle = '#1e293b'; ctx.lineWidth = 2;
        ctx.strokeRect(startX, startY, cols * cellW, rows * cellH);
        
        if (currentBlock < Math.ceil(cols / blockSize)) {
          const blockX = startX + currentBlock * blockSize * cellW;
          ctx.font = 'bold 11px sans-serif'; ctx.fillStyle = '#ea580c';
          ctx.fillText(`Block ${currentBlock + 1}`, blockX + 5, startY - 8);
        }
      }
      
      function drawHessian() {
        const size = 16, cellSize = 15;
        hessianCtx.clearRect(0, 0, hessianCanvas.width, hessianCanvas.height);
        
        for (let i = 0; i < size; i++) {
          for (let j = 0; j < size; j++) {
            const x = j * cellSize, y = i * cellSize;
            let alpha = 0.1;
            if (i <= j) alpha = 0.2 + 0.3 * (1 - Math.abs(i - j) / size);
            if (j === currentCol || i === currentCol) alpha += 0.2;
            
            hessianCtx.fillStyle = `rgba(99, 102, 241, ${alpha})`;
            hessianCtx.fillRect(x, y, cellSize - 1, cellSize - 1);
          }
        }
        
        hessianCtx.strokeStyle = '#4338ca'; hessianCtx.lineWidth = 2;
        hessianCtx.beginPath(); hessianCtx.moveTo(0, 0); hessianCtx.lineTo(size * cellSize, size * cellSize); hessianCtx.stroke();
        hessianCtx.strokeStyle = '#6366f1'; hessianCtx.strokeRect(0, 0, size * cellSize, size * cellSize);
      }
      
      function updateUI() {
        const phaseDiv = document.getElementById('gptqPhaseIndicator-' + uid);
        if (phase === 'quantize') phaseDiv.innerHTML = '<span class="gptq-phase quantize">QUANTIZING</span>';
        else if (phase === 'local_update') phaseDiv.innerHTML = '<span class="gptq-phase local">LOCAL UPDATE</span>';
        else if (phase === 'global_update') phaseDiv.innerHTML = '<span class="gptq-phase global">GLOBAL UPDATE</span>';
        
        const stepLabel = document.getElementById('gptqStepLabel-' + uid);
        const formulaText = document.getElementById('gptqFormulaText-' + uid);
        
        if (phase === 'quantize') {
          stepLabel.textContent = `Step 1: Quantize Column ${currentCol}`;
          const orig = weightsOriginal[0][currentCol], quant = quantize(orig);
          formulaText.innerHTML = `w[0,${currentCol}] = ${orig.toFixed(4)}<br>quant(w) = ${quant.toFixed(4)}<br>error = ${(quant - orig).toFixed(4)}`;
        } else if (phase === 'local_update') {
          stepLabel.textContent = `Step 2: Compensate Within Block`;
          const blockEnd = Math.min((currentBlock + 1) * blockSize, cols);
          const remaining = blockEnd - currentCol - 1;
          formulaText.innerHTML = `For each of ${remaining} remaining weights in block:<br>w<sub>j</sub> ← w<sub>j</sub> - (error / H⁻¹<sub>${currentCol},${currentCol}</sub>) × H⁻¹<sub>${currentCol},j</sub><br><span style="color: #059669;">Redistributing error...</span>`;
        } else if (phase === 'global_update') {
          const nextBlockStart = (currentBlock + 1) * blockSize;
          const remaining = cols - nextBlockStart;
          formulaText.innerHTML = `Block ${currentBlock + 1} complete!<br>Updating ${remaining} remaining columns<br>using accumulated block errors`;
        }
        
        const totalBlocks = Math.ceil(cols / blockSize);
        document.getElementById('gptqBlockProgress-' + uid).textContent = `${Math.min(currentBlock + 1, totalBlocks)}/${totalBlocks}`;
        document.getElementById('gptqColProgress-' + uid).textContent = `${Math.min(currentCol, cols)}/${cols}`;
        
        let quantizedCount = 0;
        for (let r = 0; r < rows; r++) {
          for (let c = 0; c < cols; c++) {
            if (state[r][c] === 'quantized') quantizedCount++;
          }
        }
        document.getElementById('gptqWeightsQuantized-' + uid).textContent = `${quantizedCount}/${rows * cols}`;
        document.getElementById('gptqCompensations-' + uid).textContent = compensationCount;
        
        let utilization = 0;
        if (phase === 'quantize') utilization = 25;
        else if (phase === 'local_update') utilization = 60;
        else if (phase === 'global_update') utilization = 85;
        
        document.getElementById('gptqGpuFill-' + uid).style.width = utilization + '%';
        document.getElementById('gptqGpuLabel-' + uid).textContent = utilization === 0 ? 'Idle' : `${utilization}% ${utilization > 70 ? '(Efficient)' : '(Moderate)'}`;
      }
      
      function step() {
        if (currentBlock >= Math.ceil(cols / blockSize)) {
          playing = false;
          document.getElementById('gptqPlayBtn-' + uid).textContent = 'Play';
          return;
        }
        
        const blockStart = currentBlock * blockSize;
        const blockEnd = Math.min((currentBlock + 1) * blockSize, cols);
        
        if (phase === 'quantize') {
          for (let r = 0; r < rows; r++) {
            weights[r][currentCol] = quantize(weights[r][currentCol]);
            state[r][currentCol] = 'quantized';
          }
          phase = 'local_update';
        } else if (phase === 'local_update') {
          for (let r = 0; r < rows; r++) {
            const error = weights[r][currentCol] - weightsOriginal[r][currentCol];
            for (let c = currentCol + 1; c < blockEnd; c++) {
              const comp = -error * 0.2 * (Math.random() - 0.5);
              weights[r][c] += comp;
              compensationCount++;
            }
          }
          
          currentCol++;
          if (currentCol >= blockEnd) {
            phase = 'global_update';
          } else {
            phase = 'quantize';
          }
        } else if (phase === 'global_update') {
          for (let c = blockEnd; c < cols; c++) {
            for (let r = 0; r < rows; r++) {
              const avgError = (Math.random() - 0.5) * 0.05;
              weights[r][c] += avgError;
              compensationCount++;
            }
          }
          
          currentBlock++;
          currentCol = currentBlock * blockSize;
          
          if (currentBlock < Math.ceil(cols / blockSize)) {
            phase = 'quantize';
          } else {
            phase = 'complete';
            playing = false;
            document.getElementById('gptqPlayBtn-' + uid).textContent = 'Play';
          }
        }
        
        draw();
      }
      
      function draw() {
        drawMatrix();
        drawHessian();
        updateUI();
      }
      
      function animate() {
        if (!playing) return;
        frameCounter++;
        if (frameCounter >= 11 - speed) {
          frameCounter = 0;
          step();
        }
        if (playing) requestAnimationFrame(animate);
      }
      
      function reset() {
        playing = false;
        currentBlock = 0; currentCol = 0; phase = 'quantize';
        frameCounter = 0; compensationCount = 0;
        initWeights(); draw();
        document.getElementById('gptqPlayBtn-' + uid).textContent = 'Play';
      }
      
      document.getElementById('gptqPlayBtn-' + uid).addEventListener('click', () => {
        if (phase === 'complete') reset();
        playing = !playing;
        document.getElementById('gptqPlayBtn-' + uid).textContent = playing ? 'Pause' : 'Play';
        if (playing) animate();
      });
      
      document.getElementById('gptqStepBtn-' + uid).addEventListener('click', () => {
        playing = false;
        document.getElementById('gptqPlayBtn-' + uid).textContent = 'Play';
        step();
      });
      
      document.getElementById('gptqResetBtn-' + uid).addEventListener('click', reset);
      
      document.getElementById('gptqBlockSize-' + uid).addEventListener('change', (e) => {
        blockSize = parseInt(e.target.value);
        reset();
      });
      
      document.getElementById('gptqSpeedSlider-' + uid).addEventListener('input', (e) => {
        speed = parseInt(e.target.value);
        document.getElementById('gptqSpeedValue-' + uid).textContent = speed;
      });
      
      initWeights();
      draw();
    })();
  </script>
</div>

<p>Calibration requires surprisingly little data. GPTQ uses 128 random 2048-token segments from C4 (Colossal Clean Crawled Corpus)—approximately 262,144 tokens of generic web text. This zero-shot approach requires no task-specific data, making quantization fast and broadly applicable.</p>
<p>On a single NVIDIA A100 80GB, GPTQ quantizes OPT-175B in 4.2 hours and BLOOM-176B in 3.8 hours. Memory requirements are manageable: load one Transformer block (typically 6 layers) at a time, accumulate Hessians, quantize, then pass inputs through the quantized block to generate inputs for the next block.</p>
<h2 id="part-4-beyond-gptq---alternative-approaches">Part 4: Beyond GPTQ - Alternative Approaches</h2>
<p>GPTQ represents one approach among several competing quantization methods, each with distinct strengths:</p>
<p><strong>AWQ (Activation-aware Weight Quantization)</strong> has emerged as a primary alternative to GPTQ, achieving strong accuracy by protecting a small fraction of activation‑sensitive weights. AWQ often matches FP16 performance on some tasks and is widely supported by fast 4‑bit inference kernels (e.g., AWQ/Marlin), which in many stacks can be faster than pipelines targeting GPTQ‑formatted weights.</p>
<p><strong>LLM.int8()</strong> focuses on 8-bit quantization with near-zero degradation through mixed precision, keeping outlier features in FP16. While limited to 2x compression versus GPTQ&rsquo;s 4x, it provides the most reliable accuracy preservation.</p>
<p><strong>GGUF/llama.cpp</strong> targets CPU inference with excellent cross-platform support, using mixed bit-width &ldquo;k-quant&rdquo; formats ideal for edge deployment on Apple Silicon and consumer hardware.</p>
<p>For practitioners, <strong>AWQ and GPTQ represent the current sweet spot for 4-bit GPU inference</strong>, offering 96-99% accuracy recovery with 3.5-4x compression. The choice between methods depends on specific accuracy requirements, inference speed priorities, and deployment constraints.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Quantization has transformed from academic curiosity to production necessity, enabling LLM deployment from datacenter to edge. The progression from &ldquo;can we quantize below 8 bits?&rdquo; to practical 4-bit deployment reflects both algorithmic breakthroughs and growing infrastructure demands.</p>
<p>GPTQ&rsquo;s core innovation, using second-order Hessian information to guide quantization and compensate for errors—proved that 3-4 bit compression is viable with minimal accuracy loss. By minimizing layer output error rather than weight error, GPTQ enables 70B models to run on consumer GPUs that previously required expensive multi-GPU clusters.</p>
<p>The field continues evolving rapidly. AWQ has emerged as a strong alternative with superior speed-accuracy tradeoffs, while advanced methods push toward 2-bit quantization. For practitioners today, 4-bit quantization with GPTQ or AWQ represents the sweet spot: 96-99% accuracy recovery with 3.5-4x memory reduction, making frontier models accessible on modest hardware.</p>
<p>The future of machine learning is quantized. These techniques have fundamentally democratized access to state-of-the-art models, transforming deployment from a privilege of well-funded organizations to a capability available to individual researchers and developers worldwide.
Of course. Here is a list of the 13 references, sorted by their relevance to the main topics and narrative flow of your blog post.</p>
<p>The list begins with the paper central to your article (GPTQ), followed by its primary alternatives and the foundational techniques that underpin them. It then provides the historical context for the core algorithm before concluding with the fundamental standards and hardware specifications that motivate the entire field.</p>
<hr>
<h3 id="references">References</h3>
<ol>
<li>
<p><strong>Frantar, E., Ashkboos, S., Hoefler, T., &amp; Alistarh, D. (2023).</strong> <em>GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers</em>. In International Conference on Learning Representations (ICLR).</p>
</li>
<li>
<p><strong>Lin, J., Tang, J., Tang, H., et al. (2023).</strong> <em>AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration</em>. In Proceedings of Machine Learning and Systems (MLSys).</p>
</li>
<li>
<p><strong>Dettmers, T., Lewis, M., Belkada, Y., &amp; Zettlemoyer, L. (2022).</strong> <em>LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale</em>. arXiv preprint arXiv:2208.07339.</p>
</li>
<li>
<p><strong>Dettmers, T., Pagnoni, A., Holtzman, A., &amp; Zettlemoyer, L. (2023).</strong> <em>QLoRA: Efficient Finetuning of Quantized LLMs</em>. In Advances in Neural Information Processing Systems (NeurIPS).</p>
</li>
<li>
<p><strong>Jacob, B., Kligys, S., Chen, B., et al. (2018).</strong> <em>Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference</em>. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).</p>
</li>
<li>
<p><strong>Krishnamoorthi, R. (2018).</strong> <em>Quantizing deep convolutional networks for efficient inference: A whitepaper</em>. arXiv preprint arXiv:1806.08342.</p>
</li>
<li>
<p><strong>Dettmers, T., Svirschevski, R., Egiazarian, V., et al. (2024).</strong> <em>SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression</em>. In International Conference on Learning Representations (ICLR).</p>
</li>
<li>
<p><strong>Wang, S., &amp; Kanwar, P. (2019, August 23).</strong> <em>BFloat16: The secret to high performance on Cloud TPUs</em>. Google Cloud Blog.</p>
</li>
<li>
<p><strong>Institute of Electrical and Electronics Engineers. (2019).</strong> <em>IEEE Standard for Floating-Point Arithmetic</em>. IEEE Std 754-2019.</p>
</li>
<li>
<p><strong>NVIDIA. (2020).</strong> <em>NVIDIA A100 Tensor Core GPU Architecture</em>. Whitepaper.</p>
</li>
<li>
<p><strong>Gerganov, G., et al. (2023).</strong> <em>ggml-org/llama.cpp: LLM inference in C/C++</em>. GitHub repository.</p>
</li>
</ol>
]]></content:encoded></item><item><title>Flash Attention: The Mathematical Tricks That Broke the Memory Wall</title><link>https://www.mdjawad.com/posts/flash-attention/</link><pubDate>Wed, 10 Sep 2025 19:59:48 +0800</pubDate><guid>https://www.mdjawad.com/posts/flash-attention/</guid><description>Flash Attention, a memory-efficient attention mechanism for transformers.</description><content:encoded><![CDATA[<h2 id="the-context-length-revolution">The Context Length Revolution</h2>
<p>In 2022, something fundamental changed in the world of large language models. Suddenly, models that had been stuck processing 2,048 tokens could handle 16,000, then 32,000, then 100,000+ tokens. This wasn&rsquo;t a gradual improvement—it was a leap forward. The breakthrough that enabled this revolution? Flash Attention, an algorithm that didn&rsquo;t approximate or simplify attention, but computed it exactly while using radically less memory.</p>
<p>The story of Flash Attention is really a story about understanding your hardware. It&rsquo;s about realizing that the obvious bottleneck isn&rsquo;t always the real bottleneck, and that sometimes doing more work can make you faster. Most importantly, it&rsquo;s about three clever mathematical tricks that, when combined, transform the fundamental scaling characteristics of the Transformer architecture.</p>
<h2 id="the-deceptive-simplicity-of-attention">The Deceptive Simplicity of Attention</h2>
<p>Let&rsquo;s start with what attention actually computes. At its core, the self-attention mechanism is elegantly simple:</p>
<pre tabindex="0"><code>Attention(Q, K, V) = softmax(QK^T / √d) × V
</code></pre><p>For a sequence of N tokens, each represented by a d-dimensional vector:</p>
<ul>
<li>Q, K, V are all N×d matrices</li>
<li>QK^T produces an N×N attention matrix</li>
<li>The softmax normalizes each row to sum to 1</li>
<li>The final multiplication with V produces our N×d output</li>
</ul>
<p>The problem is hiding in plain sight: that N×N attention matrix. When N=2,048, this matrix contains about 4 million elements. When N=16,384, it balloons to 268 million elements. At N=100,000, you&rsquo;re looking at 10 billion elements—about 40GB in float32. The quadratic growth is devastating.</p>
<p>For years, the research community attacked this problem in the obvious way: try to avoid computing the full N×N matrix. Sparse attention patterns, low-rank approximations, kernel methods—dozens of papers proposed ways to reduce the quadratic complexity. Yet something curious kept happening. These methods would successfully reduce the theoretical FLOP count, but when implemented, they&rsquo;d often run slower than standard attention.</p>
<p><img alt="Diagram showing the quadratic growth of the attention matrix with increasing sequence length." loading="lazy" src="/images/posts/flash-attention/attention_matrix_explosion.png"></p>
<p>What was going on?</p>
<h2 id="the-real-bottleneck-a-tale-of-two-memories">The Real Bottleneck: A Tale of Two Memories</h2>
<p>The answer requires understanding something about modern GPU architecture that&rsquo;s often overlooked: GPUs have a dramatic memory hierarchy with vastly different performance characteristics at each level.</p>
<p>Consider an NVIDIA A100 GPU:</p>
<ul>
<li><strong>High Bandwidth Memory (HBM)</strong>: 40-80GB of storage, but &ldquo;only&rdquo; 1.5-2.0 TB/s of bandwidth</li>
<li><strong>On-chip SRAM</strong>: Just 192KB per streaming multiprocessor, but roughly 19 TB/s of bandwidth</li>
</ul>
<p>That&rsquo;s a 10x difference in bandwidth. This massive disparity means that accessing data from HBM is the primary bottleneck in GPU computations. While SRAM can deliver data at blazing speeds, its tiny capacity forces most data to reside in the much slower HBM.</p>
<p>Now here&rsquo;s the critical insight: standard attention implementations are constantly moving data between HBM and SRAM. They&rsquo;re not slow because they do too much computation—they&rsquo;re slow because they spend most of their time waiting for data transfers from the slower HBM memory.</p>
<p>Let&rsquo;s trace through what standard attention actually does:</p>
<ol>
<li><strong>Load Q and K from HBM</strong> → Compute S = QK^T → <strong>Store N×N matrix S to HBM</strong></li>
<li><strong>Load S from HBM</strong> → Compute P = softmax(S) → <strong>Store N×N matrix P to HBM</strong></li>
<li><strong>Load P and V from HBM</strong> → Compute O = PV → <strong>Store O to HBM</strong></li>
</ol>
<p>Each of those loads and stores of N×N matrices is a catastrophic performance hit. The GPU&rsquo;s computational units, capable of trillions of operations per second, sit idle waiting for memory operations that take orders of magnitude longer than the actual math.</p>
<p>This is why reducing FLOPs didn&rsquo;t help. The computation was never the bottleneck—memory bandwidth was. It&rsquo;s like optimizing the mathematical operations when the real problem is the time spent moving data back and forth between memory systems.</p>
<h2 id="flash-attentions-three-tricks">Flash Attention&rsquo;s Three Tricks</h2>
<p>Flash Attention solves this memory bottleneck through three interconnected techniques that, together, enable computing exact attention without ever materializing the N×N matrices in HBM. Let&rsquo;s explore each one.</p>
<h3 id="trick-1-tiling--age-old-divide-and-conquer">Trick 1: Tiling — Age Old Divide and Conquer</h3>
<p>The first insight is that we don&rsquo;t need to compute the entire attention matrix at once. Instead, we can break it into small blocks that fit entirely in SRAM.</p>
<p>Think of the attention computation as filling in a giant N×N grid. Standard attention fills the entire grid, then normalizes it, then uses it. Flash Attention says: what if we filled in just one small tile at a time, processed it completely, and then moved on?</p>
<p>The algorithm divides the input sequences into blocks:</p>
<ul>
<li>Query blocks of size B_r (typically around √(M/4d) where M is SRAM size)</li>
<li>Key/Value blocks of size B_c</li>
</ul>
<p>For each block of the output, Flash Attention:</p>
<ol>
<li>Loads the relevant Q, K, V blocks into SRAM</li>
<li>Computes that tile of attention entirely in SRAM</li>
<li>Updates the output for that tile</li>
<li>Moves to the next tile</li>
</ol>
<p>The key is that each tile is small enough that all intermediate values stay in the fast SRAM. We never write the full attention matrix to slow HBM.</p>
<div class="fa-tiling-container">
    <style>
        .fa-tiling-container {
            background: white;
            border-radius: 16px;
            padding: 20px;
            box-shadow: 0 12px 30px rgba(0,0,0,0.06);
            margin: 32px auto;
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
            max-width: 1100px;
        }

        .fa-tiling-container .title {
            text-align: center;
            font-size: 28px;
            font-weight: 700;
            color: #1a202c;
            margin-bottom: 10px;
        }

        .fa-tiling-container .subtitle {
            text-align: center;
            font-size: 16px;
            color: #718096;
            margin-bottom: 24px;
        }

        .fa-tiling-container .main-content {
            display: flex;
            gap: 24px;
            align-items: flex-start;
            flex-wrap: wrap;
        }

        .fa-tiling-container .matrix-section {
            flex: 1.2;
            display: flex;
            flex-direction: column;
            align-items: center;
        }

        .fa-tiling-container .matrix-container {
            position: relative;
            width: 100%;
            max-width: 520px;
            aspect-ratio: 1;
            background: #f7fafc;
            border-radius: 12px;
            padding: 16px;
            box-shadow: inset 0 2px 4px rgba(0,0,0,0.06);
        }

        .fa-tiling-container .matrix-grid {
            width: 100%;
            height: 100%;
            display: grid;
            grid-template-columns: repeat(8, 1fr);
            grid-template-rows: repeat(8, 1fr);
            gap: 2px;
            position: relative;
        }

        .fa-tiling-container .tile {
            background: #e2e8f0;
            border-radius: 4px;
            transition: all 0.3s ease;
            position: relative;
            overflow: hidden;
        }

        .fa-tiling-container .tile::before {
            content: '';
            position: absolute;
            top: 0;
            left: 0;
            width: 100%;
            height: 100%;
            background: linear-gradient(135deg, transparent 40%, rgba(255,255,255,0.3) 50%, transparent 60%);
            transform: translateX(-100%);
            transition: transform 0.6s;
        }

        .fa-tiling-container .tile.processing {
            background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%);
            transform: scale(1.1);
            box-shadow: 0 8px 20px rgba(240, 87, 108, 0.4);
            z-index: 10;
        }

        .fa-tiling-container .tile.processing::before {
            transform: translateX(100%);
        }

        .fa-tiling-container .tile.completed {
            background: linear-gradient(135deg, #a8edea 0%, #fed6e3 100%);
        }

        @keyframes fa-pulse {
            0%, 100% { transform: scale(1); opacity: 1; }
            50% { transform: scale(1.05); opacity: 0.9; }
        }

        .fa-tiling-container .axis-label {
            position: absolute;
            font-size: 13px;
            font-weight: 600;
            color: #4a5568;
            white-space: nowrap;
        }

        .fa-tiling-container .axis-label.q {
            left: -25px;
            top: 50%;
            transform: translateY(-50%) rotate(-90deg);
        }

        .fa-tiling-container .axis-label.k {
            bottom: -30px;
            left: 50%;
            transform: translateX(-50%);
        }

        .fa-tiling-container .processing-section {
            flex: 1 1 360px;
            padding: 16px;
            min-width: 300px;
            max-width: 420px;
        }

        .fa-tiling-container .sram-visual {
            background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%);
            border-radius: 12px;
            padding: 24px;
            color: white;
            margin-bottom: 24px;
            box-shadow: 0 10px 30px rgba(240, 87, 108, 0.3);
        }

        .fa-tiling-container .sram-title {
            font-size: 22px;
            font-weight: 600;
            margin-bottom: 12px;
        }

        .fa-tiling-container .sram-item {
            background: rgba(255,255,255,0.2);
            padding: 12px;
            border-radius: 8px;
            font-size: 14px;
            margin-bottom: 10px;
            backdrop-filter: blur(10px);
        }

        .fa-tiling-container .process-steps {
            background: #f7fafc;
            border-radius: 12px;
            padding: 16px;
        }

        .fa-tiling-container .step-item {
            display: flex;
            align-items: center;
            padding: 12px;
            margin-bottom: 12px;
            background: white;
            border-radius: 8px;
            border-left: 4px solid #4299e1;
        }

        .fa-tiling-container .step-icon {
            width: 36px;
            height: 36px;
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            border-radius: 50%;
            display: flex;
            align-items: center;
            justify-content: center;
            color: white;
            font-weight: 600;
            margin-right: 12px;
            flex-shrink: 0;
        }

        .fa-tiling-container .controls {
            display: flex;
            justify-content: center;
            gap: 12px;
            margin-top: 12px;
            flex-wrap: wrap;
        }

        .fa-tiling-container .control-btn {
            padding: 10px 24px;
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
            border: none;
            border-radius: 8px;
            font-size: 14px;
            font-weight: 600;
            cursor: pointer;
            transition: all 0.3s ease;
        }

        .fa-tiling-container .control-btn:hover {
            transform: translateY(-2px);
            box-shadow: 0 6px 20px rgba(102, 126, 234, 0.4);
        }

        @media (max-width: 768px) {
            .fa-tiling-container .main-content {
                flex-direction: column;
            }
            .fa-tiling-container .matrix-container {
                width: 100%;
                height: auto;
                aspect-ratio: 1;
            }
        }
    </style>

    <h3 class="title">Flash Attention: Tiling Strategy</h3>
    <p class="subtitle">Processing the N×N attention matrix in small B<sub>r</sub>×B<sub>c</sub> blocks that fit entirely in SRAM</p>
    
    <div class="main-content">
        <div class="matrix-section">
            <div class="matrix-container">
                <span class="axis-label q">Query (Q)</span>
                <span class="axis-label k">Key (K<sup>T</sup>)</span>
                
                <div class="matrix-grid" id="faTilingGrid-b076b7c97074640167ec1b84ce6b40b1">
                    
                </div>
            </div>

            <div class="controls">
                <button type="button" class="control-btn" id="faTilingStartBtn-b076b7c97074640167ec1b84ce6b40b1" onclick="window['faTilingStart_b076b7c97074640167ec1b84ce6b40b1'] && window['faTilingStart_b076b7c97074640167ec1b84ce6b40b1']()">▶️ Start</button>
                <button type="button" class="control-btn" id="faTilingResetBtn-b076b7c97074640167ec1b84ce6b40b1" onclick="window['faTilingReset_b076b7c97074640167ec1b84ce6b40b1'] && window['faTilingReset_b076b7c97074640167ec1b84ce6b40b1']()">🔄 Reset</button>
            </div>
        </div>

        <div class="processing-section">
            <div class="sram-visual">
                <div class="sram-title">⚡ SRAM Processing</div>
                <div class="sram-item" id="faTileInfo-b076b7c97074640167ec1b84ce6b40b1">
                    <strong>Current Tile:</strong> Processing block (1, 1)
                </div>
                <div class="sram-item">
                    <strong>All operations stay in SRAM:</strong>
                    <div style="margin-top: 8px; font-size: 12px;">
                        • Compute S<sub>ij</sub> = Q<sub>i</sub>K<sub>j</sub><sup>T</sup><br>
                        • Apply softmax incrementally<br>
                        • Update output O<sub>i</sub><br>
                        • Never write N×N matrix to HBM!
                    </div>
                </div>
            </div>

            <div class="process-steps">
                <h4 style="margin: 0 0 20px 0; color: #2d3748;">Processing Steps per Tile</h4>
                
                <div class="step-item">
                    <div class="step-icon">1</div>
                    <div>
                        <strong>Load Blocks</strong><br>
                        <small>Q<sub>i</sub>, K<sub>j</sub>, V<sub>j</sub> → SRAM</small>
                    </div>
                </div>

                <div class="step-item">
                    <div class="step-icon">2</div>
                    <div>
                        <strong>Compute Tile Attention</strong><br>
                        <small>S<sub>ij</sub> = Q<sub>i</sub>K<sub>j</sub><sup>T</sup> / √d (stays in SRAM)</small>
                    </div>
                </div>

                <div class="step-item">
                    <div class="step-icon">3</div>
                    <div>
                        <strong>Update Running Softmax</strong><br>
                        <small>Maintain m (max) and l (sum) for online softmax</small>
                    </div>
                </div>

                <div class="step-item">
                    <div class="step-icon">4</div>
                    <div>
                        <strong>Accumulate Output</strong><br>
                        <small>Update O<sub>i</sub> incrementally, write only O back to HBM</small>
                    </div>
                </div>
            </div>
        </div>
    </div>

        <script>
        (function() {
            const uniqueId = 'b076b7c97074640167ec1b84ce6b40b1';
            const gridSize = 8;
            const totalTiles = gridSize * gridSize;
            let currentTile = 0;
            let animationInterval = null;

            const grid = document.getElementById('faTilingGrid-' + uniqueId);
            const startBtn = document.getElementById('faTilingStartBtn-' + uniqueId);
            const resetBtn = document.getElementById('faTilingResetBtn-' + uniqueId);
            const info = document.getElementById('faTileInfo-' + uniqueId);

            

            const initializeGrid = () => {
                grid.innerHTML = '';
                for (let i = 0; i < totalTiles; i++) {
                    const tile = document.createElement('div');
                    tile.className = 'tile';
                    tile.id = `faTile-${uniqueId}-${i}`;
                    grid.appendChild(tile);
                }
            };

            const animateTile = () => {
                if (currentTile >= totalTiles) {
                    clearInterval(animationInterval);
                    animationInterval = null;
                    return;
                }

                if (currentTile > 0) {
                    const prevTile = document.getElementById(`faTile-${uniqueId}-${currentTile - 1}`);
                    if (prevTile) {
                        prevTile.classList.remove('processing');
                        prevTile.classList.add('completed');
                    }
                }

                const tile = document.getElementById(`faTile-${uniqueId}-${currentTile}`);
                if (tile) tile.classList.add('processing');

                const row = Math.floor(currentTile / gridSize) + 1;
                const col = (currentTile % gridSize) + 1;
                if (info) {
                    info.innerHTML = `<strong>Current Tile:</strong> Processing block (${row}, ${col})`;
                }

                currentTile++;

                if (currentTile === totalTiles) {
                    setTimeout(() => {
                        const lastTile = document.getElementById(`faTile-${uniqueId}-${totalTiles - 1}`);
                        if (lastTile) {
                            lastTile.classList.remove('processing');
                            lastTile.classList.add('completed');
                        }
                    }, 300);
                }
            };

            const faTilingStart = () => {
                if (animationInterval) return; 
                faTilingReset();
                animationInterval = setInterval(animateTile, 300);
            };

            const faTilingReset = () => {
                clearInterval(animationInterval);
                animationInterval = null;
                currentTile = 0;
                for (let i = 0; i < totalTiles; i++) {
                    const tile = document.getElementById(`faTile-${uniqueId}-${i}`);
                    if (tile) {
                        tile.classList.remove('processing', 'completed');
                    }
                }
                if (info) {
                    info.innerHTML = '<strong>Current Tile:</strong> Ready to start';
                }
            };

            
            initializeGrid();
            faTilingReset();

            
            window['faTilingStart_' + uniqueId] = faTilingStart;
            window['faTilingReset_' + uniqueId] = faTilingReset;

            
            if (startBtn) startBtn.addEventListener('click', faTilingStart);
            if (resetBtn) resetBtn.addEventListener('click', faTilingReset);
        })();
    </script>
</div>
<p>But wait—there&rsquo;s a problem. The softmax operation needs to see an entire row to compute the proper normalization. How can we compute softmax correctly when we only see one tile at a time?</p>
<h3 id="trick-2-online-softmax--the-mathematical-keystone">Trick 2: Online Softmax — The Mathematical Keystone</h3>
<p>This is where Flash Attention&rsquo;s cleverest innovation comes in: online softmax. This algorithm computes the exact softmax result by maintaining running statistics that can be updated incrementally as we process each tile.</p>
<p>The standard softmax formula for a vector x is:</p>
<pre tabindex="0"><code>softmax(x_i) = exp(x_i) / Σ_j exp(x_j)
</code></pre><p>The online softmax reformulation maintains two running values:</p>
<ul>
<li><code>m</code>: The maximum value seen so far</li>
<li><code>l</code>: The sum of exponentials (adjusted for the maximum)</li>
</ul>
<p>Here&rsquo;s the magic. When we process a new block of scores, we:</p>
<ol>
<li>Find the new maximum: <code>m_new = max(m_old, max(current_block))</code></li>
<li>Rescale our running sum: <code>l_rescaled = l_old × exp(m_old - m_new)</code></li>
<li>Add the current block&rsquo;s contribution: <code>l_new = l_rescaled + Σ exp(current_block - m_new)</code></li>
</ol>
<p>The rescaling step is crucial—it adjusts previous computations to account for the new maximum, ensuring numerical stability and exactness. When we&rsquo;ve processed all blocks, we have the exact same result as if we&rsquo;d computed softmax on the entire row at once.</p>
<div class="osfx-container my-8">
    <style>
      .osfx-container {
        font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
        max-width: 1100px;
        margin: 0 auto;
        background: #fff;
        border-radius: 16px;
        box-shadow: 0 10px 30px rgba(0,0,0,0.06);
        padding: 24px;
      }
      .osfx-container .title {
        text-align: center;
        font-size: 24px;
        font-weight: 700;
        color: #1a202c;
        margin: 0 0 6px 0;
      }
      .osfx-container .subtitle {
        text-align: center;
        font-size: 14px;
        color: #718096;
        margin: 0 0 20px 0;
      }
      .osfx-container .controls {
        display: flex;
        justify-content: center;
        gap: 10px;
        margin: 8px 0 20px;
        flex-wrap: wrap;
      }
      .osfx-container .btn {
        padding: 10px 16px;
        background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
        color: #fff;
        border: 0;
        border-radius: 10px;
        font-size: 14px;
        font-weight: 600;
        cursor: pointer;
        transition: transform .15s ease, box-shadow .15s ease, opacity .15s ease;
      }
      .osfx-container .btn:hover { transform: translateY(-1px); box-shadow: 0 6px 18px rgba(102,126,234,.25); }
      .osfx-container .btn:disabled { opacity: .5; transform: none; box-shadow: none; cursor: not-allowed; }
  
      .osfx-container .grid {
        display: grid;
        grid-template-columns: 1fr 1fr;
        gap: 18px;
      }
      @media (max-width: 900px) {
        .osfx-container .grid { grid-template-columns: 1fr; }
      }
  
      .osfx-container .panel {
        background: #f7fafc;
        border-radius: 12px;
        padding: 18px;
      }
      .osfx-container .section-title {
        font-size: 15px;
        font-weight: 600;
        color: #2d3748;
        margin-bottom: 12px;
        display: flex;
        align-items: center;
        gap: 8px;
      }
  
       
      .osfx-container .blocks { display: flex; gap: 10px; flex-wrap: wrap; }
      .osfx-container .block {
        background: #fff;
        border: 1px solid #e2e8f0;
        border-radius: 10px;
        padding: 10px;
        min-width: 135px;
        transition: all .2s ease;
      }
      .osfx-container .block.active {
        border-color: #667eea;
        background: #f0f4ff;
        box-shadow: 0 6px 16px rgba(102,126,234,.2);
        transform: translateY(-3px);
      }
      .osfx-container .block.processed { border-color: #48bb78; background: #f0fff4; opacity: .85; }
      .osfx-container .block-title { font-size: 11px; color: #718096; margin-bottom: 6px; font-weight: 600; }
      .osfx-container .vals { display: grid; gap: 6px; }
      .osfx-container .val-row {
        display: flex; justify-content: space-between; align-items: center;
        background: #f7fafc; border-radius: 6px; padding: 5px 8px; font-size: 13px;
      }
      .osfx-container .val-label { color: #4a5568; font-size: 11px; }
      .osfx-container .val-num { font-family: 'Courier New', monospace; font-weight: 600; color: #2d3748; }
      .osfx-container .val-max { background: #fef5e7; }
      .osfx-container .val-max .val-num { color: #d68910; }
  
       
      .osfx-container .stats {
        background: linear-gradient(135deg, #667eea, #764ba2);
        border-radius: 12px;
        padding: 16px;
        color: #fff;
      }
      .osfx-container .stat-cards { display: grid; grid-template-columns: 1fr 1fr; gap: 10px; margin-bottom: 12px; }
      .osfx-container .stat {
        background: rgba(255,255,255,.2); border-radius: 10px; padding: 14px; text-align: center; backdrop-filter: blur(8px);
      }
      .osfx-container .stat-label { font-size: 13px; opacity: .9; margin-bottom: 6px; }
      .osfx-container .stat-value { font-size: 24px; font-weight: 700; font-family: 'Courier New', monospace; }
      .osfx-container .stat-formula { font-size: 11px; opacity: .85; font-family: 'Courier New', monospace; }
      .osfx-container .stat.update { animation: osfx-pulse .5s ease; }
      @keyframes osfx-pulse { 0%{transform:scale(1)} 50%{transform:scale(1.05); background:rgba(255,255,255,.28)} 100%{transform:scale(1)} }
  
       
      .osfx-container .steps { background: #fff; border-radius: 10px; padding: 14px; box-shadow: 0 4px 10px rgba(0,0,0,.06); }
      .osfx-container .steps-title { font-size: 14px; font-weight: 600; color: #2d3748; margin-bottom: 8px; }
      .osfx-container .formula { background: #f7fafc; border-left: 3px solid #4299e1; padding: 10px; border-radius: 6px; font-family: 'Courier New', monospace; font-size: 12px; color: #2d3748; }
      .osfx-container .formula + .formula { margin-top: 8px; }
      .osfx-container .formula small { display: block; color: #718096; margin-top: 6px; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; }
  
       
      .osfx-container .results { display: grid; grid-template-columns: 1fr 1fr; gap: 10px; margin-top: 16px; }
      .osfx-container .result { background: #fff; border-radius: 10px; padding: 14px; }
      .osfx-container .result-title { font-size: 13px; font-weight: 600; color: #4a5568; margin-bottom: 8px; display: flex; align-items: center; gap: 8px; }
      .osfx-container .badge { display: inline-block; background: #48bb78; color: #fff; border-radius: 20px; padding: 2px 8px; font-size: 11px; font-weight: 700; }
      .osfx-container .result-vals { display: flex; flex-wrap: wrap; gap: 8px; }
      .osfx-container .result-val { background: #edf2f7; padding: 6px 8px; border-radius: 6px; font-family: 'Courier New', monospace; font-size: 12px; color: #2d3748; }
  
       
      .osfx-container .dots { display: flex; justify-content: center; gap: 6px; margin: 6px 0 10px; }
      .osfx-container .dot { width: 9px; height: 9px; border-radius: 50%; background: #e2e8f0; transition: transform .2s, background .2s; }
      .osfx-container .dot.active { background: #667eea; transform: scale(1.4); }
      .osfx-container .dot.done { background: #48bb78; }
    </style>
  
    <h3 class="title">Online Softmax: The Mathematical Keystone</h3>
    <p class="subtitle">Compute exact softmax incrementally without materializing the full attention row.</p>
  
    <div class="controls">
      <button type="button" class="btn" data-osfx="prev">⬅️ Previous</button>
      <button type="button" class="btn" data-osfx="next">Next Step ➡️</button>
      <button type="button" class="btn" data-osfx="play">▶️ Play All</button>
      <button type="button" class="btn" data-osfx="reset">🔄 Reset</button>
    </div>
  
    <div class="dots" data-osfx="dots"></div>
  
    <div class="grid">
      <div class="panel">
        <div class="section-title">📊 Input Blocks (Attention Scores)</div>
        <div class="blocks" data-osfx="blocks"></div>
  
        <div class="steps" style="margin-top:12px;">
          <div class="steps-title">Current Computation</div>
          <div data-osfx="formulas"></div>
        </div>
      </div>
  
      <div class="stats">
        <div class="section-title" style="color:#fff;">⚡ Running Statistics</div>
        <div class="stat-cards">
          <div class="stat" data-osfx="m-card">
            <div class="stat-label">Maximum (m)</div>
            <div class="stat-value" data-osfx="m-val">-∞</div>
            <div class="stat-formula">max seen so far</div>
          </div>
          <div class="stat" data-osfx="l-card">
            <div class="stat-label">Sum (l)</div>
            <div class="stat-value" data-osfx="l-val">0.000</div>
            <div class="stat-formula">Σ exp(x − m)</div>
          </div>
        </div>
  
        <div class="steps">
          <div class="steps-title">Key Operations</div>
          <div class="formula">1) m_new = max(m_old, max(block))<small>Find the new maximum value</small></div>
          <div class="formula">2) l_rescaled = l_old × exp(m_old − m_new)<small>Rescale the previous sum to the new maximum</small></div>
          <div class="formula">3) l_new = l_rescaled + Σ exp(block − m_new)<small>Add the current block’s contribution</small></div>
        </div>
      </div>
    </div>
  
    <div class="results">
      <div class="result">
        <div class="result-title">Standard Softmax <span style="font-size:11px;color:#a0aec0;">(needs full row)</span></div>
        <div class="result-vals" data-osfx="std"></div>
      </div>
      <div class="result">
        <div class="result-title">Online Softmax <span class="badge" style="display:none;" data-osfx="match">✓ EXACT MATCH</span></div>
        <div class="result-vals" data-osfx="online"></div>
      </div>
    </div>
  
    <script>
      (function () {
        const root = document.currentScript.parentElement;
  
        
        const blocks = [
          { id: 0, values: [2.1, 1.8, 2.3, 1.9], max: 2.3 },
          { id: 1, values: [3.2, 2.9, 3.5, 3.1], max: 3.5 },
          { id: 2, values: [1.5, 1.2, 1.8, 1.4], max: 1.8 },
          { id: 3, values: [2.8, 3.1, 2.6, 2.9], max: 3.1 }
        ];
  
        
        let step = 0;
        let m = -Infinity;
        let l = 0;
        let timer = null;
  
        
        const els = {
          blocks: root.querySelector('[data-osfx="blocks"]'),
          formulas: root.querySelector('[data-osfx="formulas"]'),
          dots: root.querySelector('[data-osfx="dots"]'),
          mCard: root.querySelector('[data-osfx="m-card"]'),
          lCard: root.querySelector('[data-osfx="l-card"]'),
          mVal: root.querySelector('[data-osfx="m-val"]'),
          lVal: root.querySelector('[data-osfx="l-val"]'),
          std: root.querySelector('[data-osfx="std"]'),
          online: root.querySelector('[data-osfx="online"]'),
          match: root.querySelector('[data-osfx="match"]'),
          btnPrev: root.querySelector('[data-osfx="prev"]'),
          btnNext: root.querySelector('[data-osfx="next"]'),
          btnPlay: root.querySelector('[data-osfx="play"]'),
          btnReset: root.querySelector('[data-osfx="reset"]'),
        };
  
        
        const buildDots = () => {
          els.dots.innerHTML = '';
          for (let i = 0; i < blocks.length; i++) {
            const d = document.createElement('div');
            d.className = 'dot' + (i < step ? ' done' : i === step ? ' active' : '');
            els.dots.appendChild(d);
          }
        };
  
        const buildBlocks = () => {
          els.blocks.innerHTML = '';
          blocks.forEach((b, idx) => {
            const card = document.createElement('div');
            card.className = 'block';
            card.dataset.idx = idx;
            if (idx < step) card.classList.add('processed');
            if (idx === step) card.classList.add('active');
            card.innerHTML = `
              <div class="block-title">Block ${idx + 1}</div>
              <div class="vals">
                ${b.values.map((v, i) => `
                  <div class="val-row">
                    <span class="val-label">x[${idx * 4 + i}]</span>
                    <span class="val-num">${v.toFixed(1)}</span>
                  </div>`).join('')}
                <div class="val-row val-max"><span class="val-label">max:</span><span class="val-num">${b.max.toFixed(1)}</span></div>
              </div>`;
            els.blocks.appendChild(card);
          });
        };
  
        const showFormulas = (idx = -1, oldM = -Infinity, oldL = 0) => {
          els.formulas.innerHTML = '';
          if (idx < 0) return;
          const b = blocks[idx];
          const mNew = Math.max(oldM, b.max);
          const items = [
            `m_new = max(${oldM === -Infinity ? '-∞' : oldM.toFixed(1)}, ${b.max.toFixed(1)}) = ${mNew.toFixed(1)}<small>Find the new maximum value</small>`,
            `l_rescaled = ${oldL.toFixed(3)} × exp(${oldM === -Infinity ? '-∞' : oldM.toFixed(1)} − ${mNew.toFixed(1)})<small>Rescale previous sum to new maximum</small>`,
            `l_new = l_rescaled + Σ exp([${b.values.map(v => v.toFixed(1)).join(', ')}] − ${mNew.toFixed(1)})<small>Add current block’s contribution</small>`
          ];
          items.forEach(t => {
            const f = document.createElement('div');
            f.className = 'formula';
            f.innerHTML = t;
            els.formulas.appendChild(f);
          });
        };
  
        const updateStats = () => {
          els.mCard.classList.add('update');
          els.lCard.classList.add('update');
          els.mVal.textContent = m === -Infinity ? '-∞' : m.toFixed(1);
          els.lVal.textContent = l.toFixed(3);
          setTimeout(() => { els.mCard.classList.remove('update'); els.lCard.classList.remove('update'); }, 450);
        };
  
        const computeStandard = () => {
          const all = blocks.flatMap(b => b.values);
          const mx = Math.max(...all);
          const expVals = all.map(x => Math.exp(x - mx));
          const s = expVals.reduce((a, b) => a + b, 0);
          const soft = expVals.map(x => x / s);
          els.std.innerHTML = soft.slice(0, 4).map((v, i) => `<div class="result-val">p[${i}]: ${v.toFixed(4)}</div>`).join('') + '<div class="result-val">...</div>';
        };
  
        const updateOnline = () => {
          if (l === 0) { els.online.innerHTML = '<div class="result-val">Processing...</div>'; return; }
          const vals = [];
          for (let i = 0; i < step; i++) {
            blocks[i].values.forEach(v => vals.push(Math.exp(v - m) / l));
          }
          els.online.innerHTML = vals.slice(0, 4).map((v, i) => `<div class="result-val">p[${i}]: ${v.toFixed(4)}</div>`).join('') + (vals.length > 4 ? '<div class="result-val">...</div>' : '');
        };
  
        const finishIfDone = () => {
          if (step < blocks.length) return;
          els.match.style.display = 'inline-block';
          const all = blocks.flatMap(b => b.values);
          const online = all.map(x => Math.exp(x - m) / l);
          els.online.innerHTML = online.slice(0, 4).map((v, i) => `<div class="result-val">p[${i}]: ${v.toFixed(4)}</div>`).join('') + '<div class="result-val">...</div>';
        };
  
        
        const doNext = () => {
          if (step >= blocks.length) return;
          const b = blocks[step];
          const oldM = m;
          const oldL = l;
          showFormulas(step, oldM, oldL);
  
          const mNew = Math.max(oldM, b.max);
          const rescale = oldM === -Infinity ? 0 : Math.exp(oldM - mNew);
          const lRescaled = oldL * rescale;
          const blockSum = b.values.reduce((s, x) => s + Math.exp(x - mNew), 0);
          m = mNew;
          l = lRescaled + blockSum;
  
          step++;
          buildDots();
          buildBlocks();
          updateStats();
          updateOnline();
          finishIfDone();
          updateButtons();
        };
  
        const doPrev = () => {
          if (step === 0) return;
          step--;
          
          m = -Infinity; l = 0;
          for (let i = 0; i < step; i++) {
            const b = blocks[i];
            const mNew = Math.max(m, b.max);
            const rescale = m === -Infinity ? 0 : Math.exp(m - mNew);
            const lRescaled = l * rescale;
            const blockSum = b.values.reduce((s, x) => s + Math.exp(x - mNew), 0);
            m = mNew; l = lRescaled + blockSum;
          }
          els.match.style.display = 'none';
          els.formulas.innerHTML = '';
          buildDots();
          buildBlocks();
          updateStats();
          updateOnline();
          updateButtons();
        };
  
        const doReset = () => {
          step = 0; m = -Infinity; l = 0;
          els.match.style.display = 'none';
          clearInterval(timer); timer = null;
          els.formulas.innerHTML = '';
          buildDots(); buildBlocks(); updateStats(); updateOnline(); updateButtons();
        };
  
        const doPlay = () => {
          doReset();
          timer = setInterval(() => {
            if (step < blocks.length) doNext();
            else { clearInterval(timer); timer = null; }
          }, 1600);
        };
  
        const updateButtons = () => {
          els.btnPrev.disabled = step === 0;
          els.btnNext.disabled = step >= blocks.length;
        };
  
        
        els.btnPrev.addEventListener('click', doPrev);
        els.btnNext.addEventListener('click', doNext);
        els.btnReset.addEventListener('click', doReset);
        els.btnPlay.addEventListener('click', doPlay);
  
        
        computeStandard();
        buildDots();
        buildBlocks();
        updateStats();
        updateOnline();
        updateButtons();
      })();
    </script>
  </div>
<p>This isn&rsquo;t an approximation—it&rsquo;s mathematically equivalent to standard softmax. The proof relies on the fact that:</p>
<pre tabindex="0"><code>exp(x - a) / Σ exp(x - a) = exp(x - b) / Σ exp(x - b)
</code></pre><p>for any constants a and b. By carefully tracking how our maximum changes and rescaling accordingly, we maintain exactness while never needing the full row in memory.</p>
<h3 id="trick-3-recomputation--trading-compute-for-memory">Trick 3: Recomputation — Trading Compute for Memory</h3>
<p>The third trick addresses the backward pass used in training. During backpropagation, we need the attention matrices to compute gradients. Standard implementations store these N×N matrices during the forward pass for use in the backward pass.</p>
<p>Flash Attention takes a radically different approach: it doesn&rsquo;t store the attention matrices at all. Instead, during the backward pass, it recomputes the pieces it needs on-the-fly.</p>
<p>This seems wasteful—we&rsquo;re computing the same values twice! But remember: computation is cheap, memory movement is expensive. The time saved by not writing and reading N×N matrices to/from HBM far outweighs the cost of recomputation.</p>
<p>The algorithm stores only:</p>
<ul>
<li>The output O (size N×d)</li>
<li>The softmax normalization statistics (size N)</li>
</ul>
<p>During backpropagation, when gradients are needed:</p>
<ol>
<li>Reload the relevant Q, K, V blocks</li>
<li>Recompute just the attention tiles needed for that gradient</li>
<li>Compute gradients entirely in SRAM</li>
<li>Accumulate to the final gradient</li>
</ol>
<p>This is roughly a 2-3x increase in FLOPs, but a 2-4x speedup in wall-clock time. The counterintuitive lesson: in memory-bound operations, doing more work to avoid memory movement is a winning strategy.</p>
<h2 id="putting-it-all-together-the-flash-attention-algorithm">Putting It All Together: The Flash Attention Algorithm</h2>
<p>Let&rsquo;s see how these three tricks combine in the actual algorithm. Here&rsquo;s a simplified view of the Flash Attention forward pass:</p>
<pre tabindex="0"><code>Algorithm: Flash Attention Forward Pass
Input: Q, K, V matrices of size N×d
Output: O matrix of size N×d

1. Divide sequences into blocks of size B_r and B_c
2. Initialize output O = 0, running stats m = -∞, l = 0

3. For each K,V block j:
   4. Load K_j, V_j into SRAM
   
   5. For each Q block i:
      6. Load Q_i, current O_i, m_i, l_i into SRAM
      
      7. Compute scores: S_ij = Q_i × K_j^T / √d
      
      8. Update running softmax:
         - m_new = max(m_i, max(S_ij))
         - l_new = exp(m_i - m_new) × l_i + Σ exp(S_ij - m_new)
      
      9. Compute this block&#39;s output:
         - P_ij = exp(S_ij - m_new) / l_new
         - O_i = (exp(m_i - m_new) × l_i × O_i + P_ij × V_j) / l_new
      
      10. Store updated O_i, m_i, l_i to HBM

11. Return O
</code></pre><p>The beauty is in what&rsquo;s not there: we never materialize the full N×N attention matrix. Each block&rsquo;s computation happens entirely in SRAM, and we only write back the O(N×d) output.</p>
<h2 id="from-flash-attention-to-flash-attention-2">From Flash Attention to Flash Attention-2</h2>
<p>The original Flash Attention was a breakthrough, but it left performance on the table. Profiling showed it achieved only 25-40% of the GPU&rsquo;s theoretical peak performance. Flash Attention-2 represents a complete algorithmic rewrite that addresses these inefficiencies.</p>
<h3 id="the-parallelism-problem">The Parallelism Problem</h3>
<p>Flash Attention-1 parallelized across batch size and number of attention heads. But what happens with long sequences and small batch sizes? Many of the GPU&rsquo;s 108 streaming multiprocessors sit idle.</p>
<p>Flash Attention-2&rsquo;s solution: also parallelize across the sequence length dimension. Different thread blocks handle different portions of the output sequence, ensuring full GPU utilization even with batch size 1.</p>
<h3 id="the-work-partitioning-revolution">The Work Partitioning Revolution</h3>
<p>Within each thread block, Flash Attention-1 used a &ldquo;split-K&rdquo; scheme:</p>
<ul>
<li>K and V were split across 4 warps</li>
<li>Each warp computed partial results</li>
<li>Warps had to synchronize and combine results through shared memory</li>
</ul>
<p>This created a communication bottleneck. Flash Attention-2 flips this to &ldquo;split-Q&rdquo;:</p>
<ul>
<li>Q is split across warps</li>
<li>K and V are shared by all warps</li>
<li>Each warp computes its portion independently with no synchronization</li>
</ul>
<p>This seemingly simple change eliminates inter-warp communication, reducing shared memory traffic by 4x.</p>
<h3 id="the-results">The Results</h3>
<p>Flash Attention-2 achieves:</p>
<ul>
<li>50-73% of theoretical peak FLOPS (up from 25-40%)</li>
<li>2x speedup over Flash Attention-1</li>
<li>Up to 9x speedup over PyTorch standard attention</li>
<li>225 TFLOPs/s on A100 GPUs for end-to-end training</li>
</ul>
<p>These aren&rsquo;t incremental improvements—they&rsquo;re transformative leaps that make previously impossible model configurations practical.</p>
<h2 id="the-lessons-of-flash-attention">The Lessons of Flash Attention</h2>
<p>Flash Attention teaches us several crucial lessons about algorithm design in the age of specialized hardware:</p>
<p><strong>Profile the Real Bottleneck</strong>: The obvious problem (quadratic FLOPs) wasn&rsquo;t the actual problem (memory bandwidth). Understanding your hardware&rsquo;s characteristics is essential.</p>
<p><strong>Embrace Hardware Constraints</strong>: Rather than fighting the small SRAM size, Flash Attention designs around it. Constraints can inspire innovation.</p>
<p><strong>Exact Beats Approximate</strong>: While the research community pursued approximations, Flash Attention showed that exact computation could be faster through better algorithm design.</p>
<p><strong>Recomputation Can Be Free</strong>: In memory-bound regimes, trading computation for memory movement is often profitable, a counterintuitive insight that challenges conventional optimization wisdom.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Flash Attention isn&rsquo;t just a faster attention implementation, it&rsquo;s a masterclass in hardware-aware algorithm design. By recognizing that memory movement, not computation, was the true bottleneck, and by developing three mathematical techniques to minimize that movement, Flash Attention transformed what&rsquo;s possible with Transformer models.</p>
<p>The online softmax algorithm, in particular, stands as a brilliant example of mathematical reformulation enabling practical breakthroughs. It shows that sometimes the path forward isn&rsquo;t to approximate or simplify, but to find clever exact reformulations that align with hardware constraints.</p>
<p>As we push toward ever-longer context windows and larger models, the principles behind Flash Attention—tiling for locality, online algorithms for incremental processing, and strategic recomputation will remain relevant. They remind us that in the modern era of AI, the best algorithms aren&rsquo;t just mathematically elegant; they&rsquo;re architecturally aware.</p>
<p>The success of Flash Attention also highlights a broader truth: breakthrough performance improvements often come from questioning assumptions. Everyone &ldquo;knew&rdquo; that attention was compute-bound. Everyone &ldquo;knew&rdquo; that storing intermediate values was better than recomputing them. Flash Attention proved everyone wrong, and in doing so, enabled the current generation of long-context language models that are transforming AI applications.</p>
<p>The memory wall that seemed insurmountable in 2021 has been broken. Not by approximation, not by new hardware, but by three mathematical tricks and a deep understanding of the machine.</p>
<h2 id="references">References</h2>
<ol>
<li>
<p><strong>Dao, T., Fu, D. Y., Ermon, S., Rudra, A., &amp; Ré, C. (2022).</strong> <a href="https://arxiv.org/abs/2205.14135">FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness</a>. <em>arXiv preprint arXiv:2205.14135</em>.</p>
<ul>
<li>The original Flash Attention paper that introduced the tiling and online softmax algorithms.</li>
</ul>
</li>
<li>
<p><strong>Dao, T. (2023).</strong> <a href="https://arxiv.org/abs/2307.08691">FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning</a>. <em>arXiv preprint arXiv:2307.08691</em>.</p>
<ul>
<li>The follow-up paper detailing the algorithmic improvements in Flash Attention-2.</li>
</ul>
</li>
<li>
<p><strong>Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., &hellip; &amp; Polosukhin, I. (2017).</strong> <a href="https://arxiv.org/abs/1706.03762">Attention is All You Need</a>. <em>Advances in neural information processing systems, 30</em>.</p>
<ul>
<li>The foundational Transformer paper that introduced the self-attention mechanism.</li>
</ul>
</li>
<li>
<p><strong>Rabe, M. N., &amp; Staats, C. (2021).</strong> <a href="https://arxiv.org/pdf/2112.05682">Self-attention Does Not Need O(n²) Memory</a>. <em>arXiv preprint arXiv:2112.05682</em>.</p>
<ul>
<li>Important theoretical work on memory-efficient attention computation that influenced Flash Attention&rsquo;s development.</li>
</ul>
</li>
<li>
<p><strong>Milakov, M., &amp; Gimelshein, N. (2018).</strong> <a href="https://arxiv.org/abs/1805.02867">Online normalizer calculation for softmax</a>. <em>arXiv preprint arXiv:1805.02867</em>.</p>
<ul>
<li>Mathematical foundation for the online softmax algorithm used in Flash Attention.</li>
</ul>
</li>
<li>
<p><strong>Child, R., Gray, S., Radford, A., &amp; Sutskever, I. (2019).</strong> <a href="https://arxiv.org/abs/1904.10509">Generating Long Sequences with Sparse Transformers</a>. <em>arXiv preprint arXiv:1904.10509</em>.</p>
<ul>
<li>Representative work on sparse attention that, despite reducing FLOPs, often failed to deliver wall-clock speedups.</li>
</ul>
</li>
<li>
<p><strong>NVIDIA. (2020).</strong> <a href="https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf">NVIDIA A100 Tensor Core GPU Architecture</a>. <em>NVIDIA Corporation</em>.</p>
<ul>
<li>Technical specifications of the A100 GPU architecture that Flash Attention was optimized for.</li>
</ul>
</li>
</ol>
]]></content:encoded></item><item><title>Advanced NVIDIA GPU Monitoring for LLM Inference: A Deep Dive into H100 Architecture and Performance Optimization</title><link>https://www.mdjawad.com/posts/inference-optmisation/</link><pubDate>Sat, 23 Aug 2025 14:20:33 +0800</pubDate><guid>https://www.mdjawad.com/posts/inference-optmisation/</guid><description>A deep dive into NVIDIA&amp;rsquo;s H100 architecture and the monitoring techniques required for production-grade LLM inference optimization.</description><content:encoded><![CDATA[<h2 id="introduction-the-economics-of-efficiency">Introduction: The Economics of Efficiency</h2>
<p>When OpenAI serves ChatGPT to millions of users, every percentage point of GPU efficiency translates to millions in infrastructure costs. The difference between 10 percent and 40 percent Model FLOPs Utilization (MFU) can determine whether your LLM service is profitable or bleeding money. In the world of large-scale AI deployment, understanding your hardware at the deepest level isn&rsquo;t just an academic exercise—it&rsquo;s a business imperative.</p>
<p>This guide reveals the architecture and monitoring techniques that separate amateur deployments from production-grade systems. We&rsquo;ll explore how modern LLM inference maps to NVIDIA&rsquo;s revolutionary H100 architecture, dissect the metrics that truly matter, and provide the knowledge needed to achieve the 2-10x performance improvements that industry leaders routinely accomplish.</p>
<p>Since you&rsquo;re already familiar with the basics of LLM inference from previous discussions, we&rsquo;ll dive directly into the advanced architectural details and sophisticated monitoring strategies that will transform your understanding of GPU optimization.</p>
<h2 id="llm-inference-process-a-hardware-perspective">LLM Inference Process: A Hardware Perspective</h2>
<p>Before we explore the H100&rsquo;s revolutionary architecture, let&rsquo;s establish how LLM inference operations map to GPU hardware. This understanding forms the foundation for interpreting the metrics we&rsquo;ll later use for optimization.</p>
<p>Modern LLM inference consists of two distinct phases that stress different aspects of GPU architecture. The prefill phase, where the model processes the entire input context, is fundamentally compute-bound. During this phase, the model performs massive matrix multiplications across all input tokens simultaneously, creating work that can effectively saturate the GPU&rsquo;s computational units. In contrast, the generation phase, where tokens are produced one at a time, becomes memory-bound due to the autoregressive nature of the process. Each new token requires accessing the entire key-value cache while performing relatively minimal computation.</p>
<p><img alt="LLM inference pipeline showing prefill (parallel) and generation (sequential) with compute vs memory bottlenecks" loading="lazy" src="/images/posts/inference-optmisation/llm-inference-pipeline-flow.png" title="LLM Inference Pipeline: Prefill and Autoregressive Generation Flow"></p>
<p>The memory transfer operations begin with Host-to-Device (H2D) transfers moving input tokens via PCIe. On modern systems, this means PCIe Gen 4 at 64 GB/s or Gen 5 at 128 GB/s on H100 systems. Once data reaches the GPU, it enters a complex memory hierarchy that we&rsquo;ll explore in detail in the next section. The efficiency of these transfers often determines the lower bound of inference latency, particularly for smaller models where compute isn&rsquo;t the bottleneck.</p>
<p>Within the GPU, memory movement follows strict hierarchical patterns. Data flows from global memory (HBM) through various cache levels before reaching the compute units. Understanding this hierarchy is crucial because memory bandwidth, not compute capacity, often becomes the limiting factor in LLM inference performance.</p>
<h2 id="h100-architecture-deep-dive-essential-components-for-llm-inference">H100 Architecture Deep Dive: Essential Components for LLM Inference</h2>
<p>Understanding the H100&rsquo;s architecture is fundamental to optimizing LLM inference. Each component serves a specific purpose in the complex orchestration of transformer computations. This section provides a comprehensive primer on what each architectural element does and how it contributes to overall system performance.</p>
<p><img alt="H100 SM components including schedulers, CUDA cores, Tensor Cores, and on-chip memories" loading="lazy" src="/images/posts/inference-optmisation/H100-SM-arch.png" title="NVIDIA H100 Streaming Multiprocessor (SM) Architecture"></p>
<h3 id="streaming-multiprocessors-sms-the-computational-foundation">Streaming Multiprocessors (SMs): The Computational Foundation</h3>
<p>The Streaming Multiprocessor is the fundamental processing unit of the GPU. The H100 contains 132 SMs in its full configuration, each capable of independent instruction execution. Each SM functions as a complete processor with its own instruction cache, schedulers, execution units, and register file.</p>
<p>Within each SM, four warp schedulers manage thread execution. A warp consists of 32 threads that execute in lockstep—when one thread in a warp executes an instruction, all 32 execute the same instruction on different data. Each scheduler can dispatch instructions from a different warp every cycle, enabling the SM to hide memory latency by switching between warps when one stalls.</p>
<p>The SM contains 128 CUDA cores for general-purpose computation, handling integer and single-precision floating-point operations. These cores execute the non-matrix operations in neural networks: activation functions, normalization, element-wise operations, and control flow. The SM also houses 4 Tensor Cores, specialized units that perform matrix multiply-accumulate operations at dramatically higher throughput than CUDA cores.</p>
<p>Each SM includes 256 KB of register file storage, providing ultra-fast temporary storage for thread-local variables. This generous register allocation enables complex kernels to maintain their working set entirely in registers, avoiding slower memory accesses. The register file is banked to allow multiple simultaneous accesses, critical for maintaining throughput when all threads need data simultaneously.</p>
<h3 id="memory-hierarchy-from-registers-to-hbm3">Memory Hierarchy: From Registers to HBM3</h3>
<p>The memory system follows a strict hierarchy, with each level trading capacity for speed. Understanding this hierarchy is crucial for inference optimization, as data movement often dominates execution time.</p>
<p><strong>Registers</strong> provide the fastest storage at approximately 20 TB/s of aggregate bandwidth per SM. Each thread can access up to 255 registers, with access latency of just one clock cycle. Register allocation happens at compile time, and efficient register use is critical for kernel performance.</p>
<p><strong>Shared Memory and L1 Cache</strong> share a 228 KB pool per SM, configurable between different ratios. Shared memory enables threads within a block to communicate and share data with latency of approximately 30 cycles. This memory is banked into 32 banks to enable parallel access—critical for algorithms like Flash Attention that rely on efficient shared memory access patterns.</p>
<p><strong>L2 Cache</strong> provides 50 MB of shared storage across all SMs with approximately 6 TB/s of bandwidth. The L2 cache maintains frequently accessed data like model weights and popular activation tensors. Its partitioned design allows multiple SMs to access different cache lines simultaneously without contention.</p>
<p><strong>HBM3 (High Bandwidth Memory)</strong> delivers 80 GB of capacity with 3 TB/s of bandwidth through 10 memory controllers. HBM3 uses a 5120-bit wide interface achieved through vertical stacking of memory dies directly on the GPU package. Access latency ranges from 200-300 cycles, making it crucial to hide this latency through parallelism and caching.</p>
<h3 id="tensor-cores-matrix-multiplication-acceleration">Tensor Cores: Matrix Multiplication Acceleration</h3>
<p>Tensor Cores are specialized processing units designed exclusively for matrix multiply-accumulate operations, the dominant computation in transformer models. Each Tensor Core can perform a full 4×4 matrix multiplication per clock cycle, delivering dramatically higher throughput than traditional CUDA cores.</p>
<p>The fourth-generation Tensor Cores in H100 support multiple precision formats. FP64 provides full double precision for scientific computing. TF32 (TensorFloat-32) offers the range of FP32 with the precision of FP16, providing a drop-in replacement for FP32 training. FP16 and BF16 (BrainFloat16) enable mixed-precision training and inference. FP8 in two variants (E4M3 and E5M2) doubles throughput while maintaining acceptable accuracy for most transformer operations. INT8 provides further acceleration for quantized inference.</p>
<p>Each Tensor Core operates on small matrix tiles, typically 16×16 or smaller, depending on the precision. The operation D = A × B + C is performed in a single instruction, where A, B, C, and D are matrix tiles. This fused operation eliminates the need to write intermediate results to memory, significantly improving efficiency.</p>
<h3 id="transformer-engine-intelligence-for-transformer-models">Transformer Engine: Intelligence for Transformer Models</h3>
<p>The Transformer Engine is not a physical component but a collection of hardware and software optimizations specifically designed for transformer architectures. It automatically manages numerical precision throughout the network, choosing optimal formats for different operations.</p>
<p>The engine maintains statistics about tensor magnitudes and automatically scales values to maximize precision within the available dynamic range. For attention computations, it might use FP16 for the softmax operation while using FP8 for matrix multiplications. This dynamic precision management happens transparently, requiring no manual intervention while delivering near-FP16 accuracy at FP8 speeds.</p>
<p>The Transformer Engine also includes optimized implementations of common transformer operations. Layer normalization, positional encodings, and attention patterns are accelerated through specialized hardware paths. These optimizations are exposed through libraries like cuBLAS and cuDNN, making them accessible to framework developers.</p>
<h3 id="nvlink-and-pcie-interfaces-system-connectivity">NVLink and PCIe Interfaces: System Connectivity</h3>
<p>The H100 supports both NVLink 4.0 and PCIe Gen5 for system connectivity. NVLink provides 900 GB/s of bidirectional bandwidth (18 links at 50 GB/s each) for GPU-to-GPU communication, essential for model parallelism and multi-GPU inference. The high bandwidth and low latency of NVLink enables treating multiple GPUs almost as a single larger GPU for compatible workloads.</p>
<p>PCIe Gen5 delivers 128 GB/s of bidirectional bandwidth for host communication and storage access. This interface handles model loading, input data transfer, and result retrieval. The increased bandwidth of Gen5 reduces the time spent waiting for data transfer, particularly important for smaller models where transfer time might dominate computation time.</p>
<h3 id="hardware-schedulers-orchestrating-execution">Hardware Schedulers: Orchestrating Execution</h3>
<p>Beyond the warp schedulers in each SM, the H100 includes global hardware schedulers that manage work distribution across the GPU. The Gigathread Engine schedules thread blocks to SMs, considering factors like load balancing, cache locality, and resource availability.</p>
<p>The Work Distributor ensures efficient distribution of work across all available SMs, preventing scenarios where some SMs sit idle while others are overloaded. It understands the resource requirements of each kernel and schedules blocks to maximize occupancy while avoiding resource conflicts.</p>
<p>These hardware schedulers operate with sub-microsecond latency, enabling fine-grained scheduling decisions that would be impossible to implement in software. They continuously monitor SM utilization and adjust scheduling decisions dynamically, ensuring optimal resource utilization even with irregular workloads.</p>
<p><strong>Why This Architecture Matters</strong>: Each component in the H100 is designed to address specific bottlenecks in transformer inference. The massive register files enable complex kernels, the enhanced memory hierarchy reduces data movement overhead, specialized units like Tensor Cores and TMA accelerate common operations, and intelligent scheduling ensures all resources are effectively utilized. Understanding how these components work together enables developers to write software that fully exploits the hardware&rsquo;s capabilities.</p>
<h2 id="model-flops-utilization-the-north-star-metric">Model FLOPs Utilization: The North Star Metric</h2>
<p>Now that we understand the hardware foundation, we can properly appreciate why Model FLOPs Utilization (MFU) has become the definitive metric for LLM inference efficiency. Unlike simpler metrics that only indicate whether the GPU is busy, MFU measures how effectively we&rsquo;re using the computational capacity we&rsquo;ve paid for.</p>
<h3 id="understanding-mfu-in-context">Understanding MFU in Context</h3>
<p>Model FLOPs Utilization represents the ratio of achieved computational throughput to theoretical peak hardware throughput. When we report 30 percent MFU, we&rsquo;re saying that out of the H100&rsquo;s theoretical 989 TFLOPS of FP16 compute, we&rsquo;re achieving approximately 297 TFLOPS of useful model computation. The remaining capacity is lost to memory bottlenecks, kernel launch overhead, synchronization, and other inefficiencies.</p>
<p>The fundamental MFU calculation starts with understanding the computational requirements of transformer models. For a forward pass, we need approximately 2 FLOPs per parameter for the feed-forward and projection layers. The attention computation adds a significant number of FLOPs that scales quadratically with the sequence length:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>Attention FLOPs per layer ≈ 2 × L_seq² × D_hidden
</span></span></code></pre></div><p>Definitions:</p>
<ul>
<li><code>N_layers</code>: number of transformer layers</li>
<li><code>L_seq</code>: input sequence length (tokens)</li>
<li><code>D_hidden</code>: hidden size (<code>n_heads × d_head</code>)</li>
</ul>
<p>Note: the constant here (~2) varies by implementation; the key point is the quadratic scaling with <code>L_seq</code>.</p>
<p>This quadratic scaling explains why long context lengths can dramatically impact computational requirements.</p>
<h3 id="the-reality-of-mfu-in-production">The Reality of MFU in Production</h3>
<p>The MFU values achieved in production often surprise newcomers to the field. During training, well-optimized systems routinely achieve 40-60 percent MFU because the workload is consistent and batches are large. However, inference presents a different challenge entirely.</p>
<p>During the prefill phase, where the model processes the entire input context, we typically see 30-45 percent MFU. This phase is compute-bound and benefits from the parallel processing of all input tokens. The generation phase tells a different story, with MFU dropping to just 5-15 percent. This dramatic reduction isn&rsquo;t a sign of poor optimization—it&rsquo;s a fundamental consequence of autoregressive generation&rsquo;s memory-bound nature.</p>
<p>Model size significantly impacts achievable MFU. A 7B parameter model might achieve 25-35 percent MFU during prefill and 8-12 percent during generation on a single GPU. Scale up to a 70B model with tensor parallelism, and you might see 35-45 percent prefill MFU but only 4-8 percent during generation. The larger model achieves higher prefill MFU because it better amortizes memory transfer costs, but lower generation MFU because each token requires accessing more parameters.</p>
<p><strong>Why This Matters</strong>: Understanding these MFU realities helps set appropriate optimization targets. Achieving 50 percent MFU during generation would require fundamental algorithmic breakthroughs, not just better engineering. Teams should focus on maximizing prefill MFU while accepting that generation will always be memory-bound.</p>
<h3 id="mfu-as-an-economic-indicator">MFU as an Economic Indicator</h3>
<p>The direct relationship between MFU and cost makes it invaluable for capacity planning and hardware selection. The cost per token can be expressed as:</p>
<pre tabindex="0"><code>Cost per Token = (GPU Cost per Hour × FLOPs per Token) / (MFU × Peak FLOPs)
</code></pre><p>This relationship means that improving MFU by 10% directly reduces infrastructure costs by 10%.</p>
<blockquote>
<p>Cost example (holding throughput constant)</p>
<p>Assumptions: 1,000 H100 GPUs at $3/GPU·hour, MFU improves from 20% → 30%.</p>
<p>Calculation:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>GPUs needed ∝ 1 / MFU
</span></span><span style="display:flex;"><span>GPUs_saved = 1000 × (1 - 0.20/0.30) = 333
</span></span><span style="display:flex;"><span>Hourly_savings = 333 × $3 ≈ $999 ≈ $1,000 per hour
</span></span><span style="display:flex;"><span>Annual_savings ≈ $1,000 × 24 × 365 ≈ $8.8M per year
</span></span></code></pre></div><p>In practice, higher MFU often also improves batching and reduces latency, increasing the effective savings.</p></blockquote>
<blockquote>
<p><strong>Hardware Comparison: Effective TFLOPS</strong></p>
<p><strong>A100</strong>: $312 \text{ TFLOPS} \cdot 30\% \text{ MFU} = 93.6 \text{ effective TFLOPS}$</p>
<p><strong>H100</strong>: $989 \text{ TFLOPS} \cdot 25\% \text{ MFU} = 247.3 \text{ effective TFLOPS}$</p>
<p>The H100 provides <strong>2.6x</strong> more effective compute in this scenario, justifying its premium for compute-intensive workloads.</p></blockquote>
<h2 id="comprehensive-gpu-metrics-beyond-simple-utilization">Comprehensive GPU Metrics: Beyond Simple Utilization</h2>
<p>With our understanding of hardware architecture and MFU established, we can now explore the full spectrum of metrics available for monitoring NVIDIA GPUs. Each metric provides a different perspective on system behavior, and understanding their relationships is crucial for effective optimization.</p>
<h3 id="the-hierarchy-of-utilization-metrics">The Hierarchy of Utilization Metrics</h3>
<p>GPU utilization, the most commonly cited metric, merely indicates the percentage of time when one or more kernels are executing. A GPU showing 100 percent utilization might be performing useful work efficiently, or it might be spinning in inefficient kernels. This metric alone tells us almost nothing about actual performance.</p>
<p>Streaming Multiprocessor (SM) efficiency provides more insight by measuring how effectively active SMs utilize their resources. This includes warp occupancy (the ratio of active warps to maximum possible warps) and instruction throughput. An SM with high occupancy but low instruction throughput suggests memory bottlenecks, while low occupancy with high throughput might indicate kernel launch overhead.</p>
<p>Memory bandwidth utilization reveals whether we&rsquo;re constrained by data movement. On an H100, achieving 2.5 TB/s out of 3 TB/s theoretical bandwidth (83 percent utilization) might seem good, but if those transfers are inefficient (non-coalesced, redundant), we&rsquo;re still wasting resources. The relationship between achieved bandwidth and useful work becomes critical.</p>
<h3 id="tensor-core-utilization-the-hidden-bottleneck">Tensor Core Utilization: The Hidden Bottleneck</h3>
<p>Tensor Core utilization often becomes the limiting factor in achieving high MFU, yet it&rsquo;s frequently overlooked. The metric isn&rsquo;t simply whether Tensor Cores are active, but how efficiently they&rsquo;re being fed with data and how well the problem dimensions align with hardware requirements.</p>
<p>For optimal Tensor Core utilization, matrix dimensions must align with hardware constraints—multiples of 8 for FP16 operations, 16 for INT8. Misaligned dimensions can reduce utilization by 50 percent or more. The new H100 Transformer Engine alleviates some alignment constraints, but understanding these requirements remains crucial for optimization.</p>
<p>The relationship between Tensor Core utilization and memory bandwidth becomes particularly important during inference. Even with perfect alignment, Tensor Cores can only maintain peak throughput if data arrives fast enough. This creates a careful balance—batch sizes must be large enough to amortize memory transfer costs but small enough to meet latency requirements.</p>
<h3 id="memory-hierarchy-metrics-finding-the-real-bottleneck">Memory Hierarchy Metrics: Finding the Real Bottleneck</h3>
<p>Understanding memory metrics requires thinking hierarchically. L1/shared memory hit rates tell us about kernel efficiency—rates below 80 percent suggest poor data locality. L2 cache hit rates indicate weight reuse effectiveness—critical for models with repeated layer structures. HBM bandwidth utilization reveals whether we&rsquo;re fundamentally memory-bound.</p>
<p>The introduction of the Memory Bandwidth Utilization (MBU) metric by Databricks provides a complementary view to MFU. MBU measures achieved memory bandwidth versus theoretical peak, helping identify whether computation or memory movement is the limiting factor. When MBU approaches 100 percent while MFU remains low, we know memory bandwidth is the bottleneck.</p>
<p>Cache line efficiency becomes critical in attention mechanisms. The irregular access patterns of key-value caches can waste significant bandwidth if not properly managed. Modern implementations like PagedAttention improve cache line utilization from around 60 percent to over 95 percent, directly translating to higher effective memory bandwidth.</p>
<h3 id="power-and-thermal-metrics-the-overlooked-constraints">Power and Thermal Metrics: The Overlooked Constraints</h3>
<p>Power consumption and thermal behavior significantly impact sustained performance, particularly in dense datacenter deployments. The H100 can consume up to 700W, generating substantial heat that must be managed. Thermal throttling can reduce clock speeds by 30 percent or more, directly impacting achievable MFU.</p>
<p>Dynamic frequency scaling based on workload characteristics means that power-efficient kernels can run at higher clock speeds, improving overall throughput. Understanding the relationship between different operations and power consumption helps in scheduling and workload distribution.</p>
<blockquote>
<p><strong>Power Efficiency: TFLOPS per Watt</strong></p>
<ul>
<li><strong>H100</strong>: $989 \text{ TFLOPS} / 700\text{W} \approx 1.4 \text{ TFLOPS/W}$</li>
<li><strong>A100</strong>: $312 \text{ TFLOPS} / 400\text{W} \approx 0.78 \text{ TFLOPS/W}$</li>
</ul>
<p>This nearly <strong>2x</strong> improvement in power efficiency compounds the H100&rsquo;s computational advantages.</p></blockquote>
<h2 id="bottleneck-analysis-the-mathematics-of-performance-limits">Bottleneck Analysis: The Mathematics of Performance Limits</h2>
<p>Understanding whether your system is compute-bound or memory-bound requires more than just monitoring metrics—it demands understanding the fundamental arithmetic relationships in transformer models. This mathematical framework, combined with architectural knowledge, enables precise bottleneck identification and targeted optimization.</p>
<h3 id="arithmetic-intensity-the-fundamental-diagnostic-tool">Arithmetic Intensity: The Fundamental Diagnostic Tool</h3>
<p>Arithmetic intensity, defined as the ratio of floating-point operations to bytes of memory accessed, provides the key to understanding performance bottlenecks. For any given operation, we can calculate the arithmetic intensity and compare it to the hardware&rsquo;s balance point—the ratio of peak compute throughput to peak memory bandwidth.</p>
<p>The hardware&rsquo;s <strong>balance point</strong> is the ratio of its peak compute throughput to its peak memory bandwidth.</p>
<pre tabindex="0"><code>Balance Point = Peak Compute (FLOPS) / Peak Memory Bandwidth (Bytes/s)
</code></pre><blockquote>
<p><strong>Hardware Balance Points (FP16)</strong></p>
<ul>
<li><strong>H100</strong>: $989 \text{ TFLOPS} / 3000 \text{ GB/s} \approx 330 \text{ Ops/Byte}$</li>
<li><strong>A100</strong>: $312 \text{ TFLOPS} / 1555 \text{ GB/s} \approx 200 \text{ Ops/Byte}$</li>
</ul></blockquote>
<p>When a workload&rsquo;s arithmetic intensity falls below this threshold, it is memory-bound; above, it is compute-bound.</p>
<p><img alt="Roofline Model for LLM Inference (H100 vs A100)" loading="lazy" src="/images/posts/inference-optmisation/roofline.png" title="Roofline Model for LLM Inference (H100 vs A100)"></p>
<h3 id="transformer-arithmetic-breaking-down-the-operations">Transformer Arithmetic: Breaking Down the Operations</h3>
<p>To apply arithmetic intensity analysis to LLM inference, we must first understand the computational structure of transformers. Each layer consists of two main components: multi-head attention and feed-forward networks, each with distinct computational characteristics.</p>
<p>Here is the corrected breakdown of FLOPs per transformer layer:</p>
<p><strong>Attention FLOPs (per layer):</strong></p>
<ul>
<li><strong>QKV Projections</strong>: <code>6BLH²</code></li>
<li><strong>QK^T Computation</strong>: <code>2BL²H</code></li>
<li><strong>Attention × V</strong>: <code>2BL²H</code></li>
<li><strong>Output Projection</strong>: <code>2BLH²</code></li>
<li><strong>Total Attention</strong>: <code>8BLH² + 4BL²H</code></li>
</ul>
<p><strong>FFN FLOPs (per layer):</strong></p>
<ul>
<li><strong>Up-projection</strong>: <code>8BLH²</code></li>
<li><strong>Down-projection</strong>: <code>8BLH²</code></li>
<li><strong>Total FFN</strong>: <code>16BLH²</code></li>
</ul>
<p><strong>Total FLOPs per layer:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>Total = (8BLH² + 4BL²H) + 16BLH² = 24BLH² + 4BL²H
</span></span></code></pre></div><p>Memory access patterns tell a different story. During the prefill phase, we load model weights once but use them for all tokens in the sequence, achieving good arithmetic intensity. During generation, we load the entire model weights to process a single token, resulting in poor arithmetic intensity that decreases with model size.</p>
<h3 id="the-prefill-phase-compute-bound-territory">The Prefill Phase: Compute-Bound Territory</h3>
<p>During prefill, when processing an entire input sequence, arithmetic intensity is relatively high. Consider a concrete example with Llama 2 7B processing a 2048-token sequence with batch size 1.</p>
<p>The attention computation performs approximately $2 \cdot 32 \cdot 2048^2 \cdot 4096 \approx 1.1$ trillion FLOPs while accessing roughly 14 GB of memory. This yields an arithmetic intensity of:</p>
<pre tabindex="0"><code>AI_prefill = (1.1 × 10¹² FLOPs) / (14 × 10⁹ Bytes) ≈ 75 Ops/Byte
</code></pre><p>This is well below the H100&rsquo;s balance point of 330, indicating the workload is memory-bound even during prefill.</p>
<p>However, increasing the batch size dramatically improves arithmetic intensity. With batch size 32, we perform 32× more operations while only marginally increasing memory access (weights are reused across the batch). The arithmetic intensity rises to approximately 2,400 operations per byte, making us solidly compute-bound.</p>
<p>This analysis explains why batch size has such a profound impact on MFU during prefill. Small batches leave the GPU memory-bound despite the parallel processing of many tokens. Only when batch size grows sufficiently large do we transition to compute-bound operation where Tensor Cores can operate near peak efficiency.</p>
<h3 id="the-generation-phase-the-memory-bandwidth-wall">The Generation Phase: The Memory Bandwidth Wall</h3>
<p>Generation phase arithmetic intensity tells a starkly different story. When generating a single token, we must load the entire model (14 GB for Llama 2 7B) to perform approximately 14 billion operations ($2 \cdot 7B$ parameters). This yields an arithmetic intensity of:</p>
<pre tabindex="0"><code>AI_gen = (14 × 10⁹ FLOPs) / (14 × 10⁹ Bytes) = 1 Op/Byte
</code></pre><p>This is two orders of magnitude below the balance point, confirming generation is severely memory-bound.</p>
<p>The KV-cache access further degrades arithmetic intensity. For each generated token, we must read the cached keys and values for all previous tokens. With a 2048-token context, this means accessing $2048 \cdot 32 \cdot 8192 \cdot 2 = 1.074$ GB (decimal) or 1.0 GiB (binary) of KV-cache data for each token generated. This massive memory access further degrades the arithmetic intensity of the attention computation during generation.</p>
<p>This fundamental mathematical reality explains why generation phase MFU rarely exceeds 15 percent. We&rsquo;re not failing to optimize; we&rsquo;re hitting the physical limits of memory bandwidth. No amount of kernel optimization can overcome this arithmetic intensity barrier—only architectural changes like larger caches or algorithmic innovations like speculative decoding can help.</p>
<h3 id="identifying-your-bottleneck-a-systematic-approach">Identifying Your Bottleneck: A Systematic Approach</h3>
<p>To determine whether your specific workload is compute-bound or memory-bound, follow this systematic analysis:</p>
<p>First, calculate the theoretical arithmetic intensity for your model and batch size. Using
<code>P</code> = model parameters, <code>B</code> = batch size, <code>N</code> = number of layers, <code>L</code> = sequence length, <code>H</code> = hidden size (≈ <code>n_heads × d_head</code>):</p>
<p>Compute FLOPs (forward pass):</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>FLOPs_compute ≈ 2·P·B + 4·N·B·L²·H
</span></span></code></pre></div><p>Memory bytes (FP16, 2 bytes/elem):</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>Bytes_memory ≈ 2·P + 2·B·L·H·N + 4·B·L·H·N
</span></span></code></pre></div><p>where the three terms correspond to weights, activations, and KV-cache respectively.</p>
<p>Next, measure your achieved arithmetic intensity using GPU metrics. Let <code>tps</code> be tokens/second and <code>fpt</code> be FLOPs/token; let <code>BW</code> be achieved memory bandwidth (bytes/second):</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>FLOPs_achieved = tps · fpt
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>AI_achieved = FLOPs_achieved / BW
</span></span></code></pre></div><p>Compare your measured arithmetic intensity to the hardware balance point. If it&rsquo;s below the threshold, you&rsquo;re memory-bound—focus on reducing memory access through techniques like kernel fusion, quantization, or Flash Attention. If it&rsquo;s above the threshold, you&rsquo;re compute-bound—consider using lower precision, pruning, or more efficient algorithms.</p>
<h3 id="the-roofline-model-in-practice">The Roofline Model in Practice</h3>
<p>The roofline model visualizes these relationships, showing the performance ceiling imposed by either compute or memory bandwidth. The model creates a two-dimensional space where the x-axis represents arithmetic intensity and the y-axis represents achieved performance in FLOPS.</p>
<p>The &ldquo;roofline&rdquo; consists of two parts: a sloped line representing the memory bandwidth limit (performance = bandwidth × arithmetic_intensity) and a horizontal line representing the peak compute performance. The intersection point is the balance point we&rsquo;ve been discussing. Real workloads appear as points in this space, immediately revealing whether they&rsquo;re compute or memory limited.</p>
<p>For LLM inference, prefill operations typically appear in the middle region, potentially reaching the compute roofline with sufficient batch size. Generation operations cluster far to the left, firmly in memory-bound territory. This visual representation makes optimization opportunities immediately apparent.</p>
<h3 id="bottleneck-specific-optimization-strategies">Bottleneck-Specific Optimization Strategies</h3>
<p>Once you&rsquo;ve identified your bottleneck, optimization strategies become clear and targeted.</p>
<p>For memory-bound operations, focus on reducing memory traffic. Operator fusion combines multiple operations to avoid intermediate memory writes. Quantization reduces the bytes per parameter, effectively increasing arithmetic intensity. Flash Attention keeps attention computations in shared memory, dramatically reducing HBM access. KV-cache compression techniques reduce the memory footprint of cached attention states.</p>
<p>For compute-bound operations, the strategies differ entirely. Use the highest-throughput precision your accuracy requirements allow—FP8 on H100 can double throughput versus FP16. Ensure tensor dimensions align with Tensor Core requirements (multiples of 8 for FP16, 16 for INT8). Consider structured sparsity to leverage the H100&rsquo;s sparse Tensor Core operations. Implement better batching strategies to amortize overhead.</p>
<p>The key insight is that optimizing a memory-bound workload for compute efficiency (or vice versa) wastes effort. Understanding your position relative to the roofline model ensures optimization efforts target the actual bottleneck.</p>
<h3 id="dynamic-bottleneck-behavior">Dynamic Bottleneck Behavior</h3>
<p>Bottlenecks aren&rsquo;t static—they shift based on workload characteristics and system state. A system that&rsquo;s compute-bound with large batches becomes memory-bound with small batches. Long sequences increase the compute requirements of attention quadratically, potentially shifting from memory to compute bound.</p>
<p>Thermal throttling can dynamically reduce compute capacity, shifting the balance point and potentially moving workloads from compute-bound to memory-bound. Understanding these dynamics helps explain performance variations and guides adaptive optimization strategies.</p>
<p>Modern inference systems must handle this dynamism gracefully. Techniques like dynamic batching adjust batch sizes based on queue depth and latency requirements, implicitly navigating the compute-memory tradeoff. Adaptive precision selection can switch between FP16 and FP8 based on whether the system is compute or memory bound.</p>
<h3 id="case-study-optimizing-a-memory-bound-workload">Case Study: Optimizing a Memory-Bound Workload</h3>
<p>Consider a production system serving Llama 2 13B with batch size 1, achieving only 5% MFU during generation. Analysis reveals an arithmetic intensity of 0.8 Ops/Byte—severely memory-bound.</p>
<p>The optimization strategy focuses entirely on memory traffic reduction.</p>
<ul>
<li><strong>INT8 Quantization</strong>: Halves memory requirements, doubling AI to 1.6 Ops/Byte.</li>
<li><strong>Flash Attention</strong>: Reduces attention-related memory traffic by ~75%.</li>
<li><strong>Continuous Batching</strong>: Increases average batch size to 8, multiplying AI by 8x.</li>
</ul>
<p>After these optimizations, arithmetic intensity reaches approximately 12 Ops/Byte—still memory-bound but much improved. MFU increases from 5% to 18%, a <strong>3.6x</strong> improvement. Further optimization would require architectural changes like model sharding to fit in GPU cache or algorithmic innovations like speculative decoding.</p>
<p>This systematic approach—measure, analyze, identify bottleneck, apply targeted optimization, repeat—transforms random experimentation into engineering discipline. Understanding the mathematical foundations of transformer inference enables predictable, reproducible performance improvements.</p>
<h2 id="advanced-monitoring-tools-and-techniques">Advanced Monitoring Tools and Techniques</h2>
<p>The complexity of modern GPU architectures demands sophisticated monitoring tools. Each tool in NVIDIA&rsquo;s ecosystem serves a specific purpose, from real-time production monitoring to deep kernel-level analysis.</p>
<h3 id="the-nvidia-smi-foundation">The NVIDIA-SMI Foundation</h3>
<p>NVIDIA-SMI, built on the NVIDIA Management Library (NVML), provides the foundation for GPU monitoring. While often dismissed as too basic, it offers several advanced capabilities crucial for production systems. The tool&rsquo;s ability to continuously monitor with minimal overhead (less than 1 percent performance impact) makes it ideal for always-on production monitoring.</p>
<p>The key to effective nvidia-smi usage lies in understanding its sampling behavior. Utilization metrics are sampled over 1/6 second intervals, meaning short-duration kernels might be missed entirely. Memory bandwidth measurements aggregate over one-second windows, potentially hiding burst behavior. Understanding these limitations helps interpret the data correctly.</p>
<p>Advanced nvidia-smi features include event-triggered logging, which can capture detailed state information when specific conditions occur, and persistence mode management, which keeps the GPU driver loaded to reduce kernel launch latency. The tool&rsquo;s ability to set and monitor power caps enables dynamic power management strategies that balance performance with thermal constraints.</p>
<h3 id="dcgm-datacenter-scale-monitoring">DCGM: Datacenter-Scale Monitoring</h3>
<p>The Data Center GPU Manager (DCGM) extends monitoring capabilities to fleet scale while providing more detailed metrics than nvidia-smi. Its architecture, with a central daemon managing data collection and client libraries for access, enables efficient monitoring of hundreds of GPUs with minimal overhead.</p>
<p>DCGM&rsquo;s field-based metric system provides over 100 distinct metrics, each identified by a unique field ID. For LLM inference, critical fields include DCGM_FI_PROF_SM_ACTIVE (1002) for SM utilization, DCGM_FI_PROF_PIPE_TENSOR_ACTIVE (1004) for Tensor Core activity, and DCGM_FI_PROF_DRAM_ACTIVE (1005) for memory interface utilization.</p>
<p>The profiling metrics available through DCGM provide insights impossible to obtain through nvidia-smi. These include instruction-level throughput, cache hit rates, and detailed memory access patterns. The ability to correlate these metrics across multiple GPUs reveals system-level bottlenecks that individual GPU monitoring might miss.</p>
<p><img alt="Diagram showing data flow from GPUs through DCGM to Prometheus and Grafana for monitoring and alerting" loading="lazy" src="/images/posts/inference-optmisation/reference-prometheus-architecture.png" title="Monitoring Stack Architecture with Prometheus"></p>
<h3 id="nsight-systems-application-level-profiling">Nsight Systems: Application-Level Profiling</h3>
<p>Nsight Systems provides a timeline view of application execution, revealing the interplay between CPU and GPU operations. For LLM inference, this exposes critical inefficiencies like CPU-GPU synchronization bottlenecks, unnecessary memory transfers, and kernel launch overhead.</p>
<p>The tool&rsquo;s ability to trace CUDA API calls, kernel executions, and memory transfers simultaneously creates a complete picture of application behavior. Custom NVTX markers can annotate different phases of inference (tokenization, prefill, generation), making performance analysis more intuitive.</p>
<p>The overhead of Nsight Systems (typically 5-20 percent) makes it unsuitable for production monitoring but invaluable for development optimization. The visual timeline immediately reveals problems like serialized operations that could run concurrently or gaps between kernels indicating scheduling inefficiencies.</p>
<h2 id="optimization-techniques-and-their-metric-signatures">Optimization Techniques and Their Metric Signatures</h2>
<p>Modern LLM inference optimization employs sophisticated techniques that produce distinctive patterns in GPU metrics. Understanding these signatures enables rapid diagnosis and systematic improvement.</p>
<h3 id="flash-attention-transforming-memory-access-patterns">Flash Attention: Transforming Memory Access Patterns</h3>
<p>Flash Attention revolutionizes attention computation by keeping intermediate results in shared memory rather than writing to HBM. This fundamental change produces distinctive metric signatures that confirm proper implementation.</p>
<p>When Flash Attention is working correctly, HBM bandwidth utilization during attention computation drops by 50-80 percent while SM utilization increases. L1/shared memory throughput increases dramatically, often exceeding 10 TB/s aggregate across all SMs. The MFU during attention phases can improve by 1.5-2x, though this improvement is most pronounced for longer sequences where memory bandwidth typically dominates.</p>
<p><img alt="Flash Attention Tiling Strategy" loading="lazy" src="/images/posts/inference-optmisation/flash-attn.png" title="Flash Attention Tiling Strategy"></p>
<p>The H100&rsquo;s larger shared memory (228 KB per SM) enables larger tile sizes than previous generations, reducing the number of passes required. Combined with the TMA&rsquo;s ability to asynchronously load the next tile while computing the current one, this can achieve near-perfect overlap of memory and computation.</p>
<h3 id="continuous-batching-dynamic-resource-utilization">Continuous Batching: Dynamic Resource Utilization</h3>
<p>Continuous batching replaces static batches with dynamic scheduling, allowing requests of different lengths to process together. This technique produces characteristic saw-tooth patterns in GPU utilization metrics as batches naturally grow and shrink.</p>
<p>Effective continuous batching maintains average GPU utilization above 70 percent while keeping variance below 20 percent. The queue depth typically runs at 1.5-2x the optimal batch size, providing a buffer for arrival rate variations. Memory fragmentation should remain below 5 percent, indicating efficient memory management.</p>
<p>The impact on MFU is substantial—typically improving average MFU by 20-40 percent by maintaining consistent GPU saturation. The technique is particularly effective for services with variable request rates, where static batching would either waste resources or introduce unnecessary latency.</p>
<h3 id="pagedattention-memory-efficiency-revolution">PagedAttention: Memory Efficiency Revolution</h3>
<p>PagedAttention applies virtual memory concepts to KV-cache management, storing attention caches in non-contiguous blocks. This produces distinctive memory utilization patterns that confirm proper operation.</p>
<p>Memory utilization with PagedAttention exceeds 95 percent compared to around 60 percent for naive allocation. Block utilization metrics should show over 90 percent of allocated blocks actively used. The technique enables 2-4x larger effective batch sizes with the same memory, directly improving throughput.</p>
<p>The metric signatures include steady memory allocation rates (rather than large chunks), consistent block recycling patterns, and high cache hit rates for shared prefixes. When combined with continuous batching, PagedAttention enables near-optimal memory utilization while maintaining low latency.</p>
<h3 id="quantization-precision-performance-tradeoffs">Quantization: Precision-Performance Tradeoffs</h3>
<p>Quantization techniques produce clear changes in metric patterns that indicate their effectiveness. FP16 to INT8 quantization typically doubles Tensor Core throughput while halving memory bandwidth requirements. The H100&rsquo;s FP8 support can achieve similar improvements with minimal accuracy loss.</p>
<p>Successful quantization shows Tensor Core utilization increasing proportionally with the precision reduction (2x for FP16→INT8). Memory bandwidth utilization decreases by the same factor, often relieving memory bottlenecks. MFU improvements vary but typically range from 1.4-1.9x for compute-bound phases.</p>
<p>The key metric to watch is the balance between compute and memory utilization. Quantization can shift a memory-bound workload to compute-bound, fundamentally changing optimization strategies. This shift appears as increased SM efficiency and decreased memory controller activity.</p>
<h3 id="speculative-decoding-trading-compute-for-latency">Speculative Decoding: Trading Compute for Latency</h3>
<p>Speculative decoding uses a smaller &ldquo;draft&rdquo; model to predict multiple tokens, then validates them with the full model. This produces unique metric patterns: burst compute activity during speculation followed by validation phases.</p>
<p>Effective speculative decoding shows acceptance rates above 60 percent, meaning most speculated tokens are correct. The compute utilization pattern shows characteristic dual-phase behavior—low utilization during drafting, high during validation. Overall MFU might decrease, but time-to-token improves by 2-3x when properly tuned.</p>
<p>The memory access patterns reveal the technique&rsquo;s efficiency. The draft model&rsquo;s weights should remain L2-resident, showing high cache hit rates. The validation phase should show coalesced memory access as multiple tokens validate simultaneously.</p>
<h2 id="production-deployment-best-practices">Production Deployment Best Practices</h2>
<p>Transitioning from optimization in development to production deployment requires systematic approaches to monitoring, alerting, and continuous improvement.</p>
<h3 id="establishing-baseline-metrics">Establishing Baseline Metrics</h3>
<p>Before optimization, establish comprehensive baselines for your specific models and hardware. These baselines should include MFU for both prefill and generation phases, memory bandwidth utilization across different batch sizes, latency percentiles (p50, p95, p99) for various sequence lengths, and power consumption under sustained load.</p>
<p>Baseline establishment should span at least one week of production traffic to capture variations. Daily patterns, weekend differences, and special events all impact metric distributions. Understanding normal variation prevents false alerts and helps identify genuine problems.</p>
<p>The baseline must differentiate between model architectures. A 7B parameter model baseline differs substantially from a 70B model baseline, even on identical hardware. Separate baselines for different operation modes (batch inference, streaming, interactive) prevent inappropriate comparisons.</p>
<h3 id="implementing-effective-alerting">Implementing Effective Alerting</h3>
<p>Alert fatigue destroys operational effectiveness, so alerts must be both actionable and important. Critical alerts should trigger only for service-impacting conditions: MFU dropping below 50 percent of baseline for sustained periods, memory utilization exceeding 95 percent with allocation failures, or thermal throttling reducing clock speeds.</p>
<p>Warning-level alerts identify degradation before it impacts service: MFU variance exceeding 20 percent over five-minute windows, queue depths growing beyond 2x normal, or power consumption approaching thermal design limits. These alerts enable proactive intervention.</p>
<p>Informational monitoring tracks optimization opportunities without generating alerts: batch size efficiency below target, quantization candidates based on compute patterns, or scheduling inefficiencies revealed by utilization gaps. Regular review of these metrics drives continuous improvement.</p>
<h3 id="continuous-optimization-workflows">Continuous Optimization Workflows</h3>
<p>Production systems require continuous optimization as models, traffic patterns, and requirements evolve. Establish weekly metric reviews comparing current performance to baselines and identifying degradation or improvement opportunities.</p>
<p>A/B testing frameworks should include metric collection for both control and experiment groups. Beyond functional metrics like accuracy, collect detailed performance metrics to understand the full impact of changes. A model change that improves accuracy but degrades MFU by 30 percent might not be worth deploying.</p>
<p>Capacity planning must account for metric trends. If MFU gradually degrades as model complexity increases, infrastructure requirements grow super-linearly. Understanding these relationships enables accurate forecasting and budget planning.</p>
<h3 id="multi-tenant-optimization-strategies">Multi-Tenant Optimization Strategies</h3>
<p>Production systems rarely serve single models in isolation. Multi-tenant scheduling must balance resource utilization with quality of service, creating complex optimization challenges.</p>
<p>GPU sharing strategies depend on workload characteristics. Time-slicing works well for similar models with predictable resource requirements. Multi-Instance GPU (MIG) provides hardware isolation but reduces flexibility. Spatial sharing requires careful memory management to prevent interference.</p>
<p>Metric collection in multi-tenant environments requires attribution to specific tenants. Per-model MFU tracking reveals which models efficiently use resources. Memory attribution prevents one model from starving others. Power consumption tracking enables accurate cost allocation.</p>
<p>The scheduling algorithm must consider both immediate and future resource availability. Greedy scheduling might achieve high instantaneous utilization but create future bottlenecks. Predictive scheduling based on historical patterns improves overall system efficiency.</p>
<h2 id="future-directions-and-emerging-patterns">Future Directions and Emerging Patterns</h2>
<p>The landscape of LLM inference optimization continues evolving rapidly. Understanding emerging patterns helps prepare for future developments.</p>
<h3 id="algorithmic-innovations">Algorithmic Innovations</h3>
<p>Attention mechanism improvements continue emerging. Techniques like Linear Attention and Performer reduce complexity from O(n²) to O(n), fundamentally changing the computational requirements. While these haven&rsquo;t yet matched traditional attention&rsquo;s quality, rapid progress suggests breakthrough potential.</p>
<p>Mixture of Experts (MoE) architectures enable larger models without proportional compute increases. By activating only relevant experts for each token, MoE models achieve effective parameter counts far exceeding dense models while maintaining manageable computational requirements. The metric patterns for MoE models differ substantially, requiring new optimization approaches.</p>
<p>Retrieval-augmented generation (RAG) shifts computation from parameter storage to dynamic retrieval. This architectural change produces different bottlenecks—network I/O and database access rather than GPU memory bandwidth. Understanding these patterns becomes crucial as RAG adoption increases.</p>
<h3 id="software-framework-evolution">Software Framework Evolution</h3>
<p>The competition between inference frameworks drives rapid innovation. vLLM&rsquo;s PagedAttention, TensorRT-LLM&rsquo;s kernel fusion, and DeepSpeed&rsquo;s pipeline parallelism each offer unique advantages. Framework selection significantly impacts achievable metrics—the same model might achieve 30 percent MFU with one framework and 45 percent with another.</p>
<p>Automatic optimization techniques reduce the expertise required for high performance. Compilers that automatically select optimal kernel implementations, batch sizes, and parallelism strategies democratize optimization. However, understanding underlying metrics remains crucial for pushing beyond automatic optimization limits.</p>
<p>The convergence of training and inference frameworks simplifies deployment but introduces complexity. Frameworks must now optimize for both phases, with different requirements and bottlenecks. This convergence produces new metric patterns that require careful interpretation.</p>
<h2 id="conclusion-the-path-to-excellence">Conclusion: The Path to Excellence</h2>
<p>Mastering GPU monitoring for LLM inference requires deep understanding of hardware architecture, comprehensive metric collection, and systematic optimization approaches. The H100&rsquo;s revolutionary architecture provides unprecedented capability, but realizing its potential demands expertise in interpreting complex metric relationships and applying appropriate optimization techniques.</p>
<p>The journey from basic GPU utilization monitoring to sophisticated MFU optimization transforms both system performance and economics. Organizations that master these techniques achieve 2-10x performance improvements, directly impacting service quality and operational costs.</p>
<p>Remember that MFU is not just a metric—it&rsquo;s a philosophy of efficiency that permeates every aspect of LLM deployment. By understanding the intricate dance between compute and memory, between hardware capability and algorithmic requirements, we can build inference systems that deliver breakthrough performance at sustainable costs.</p>
<p>The future of LLM inference belongs to those who can see beyond surface-level metrics to understand the deep patterns of GPU behavior. Armed with the knowledge in this guide, you&rsquo;re equipped to join the ranks of teams achieving world-class inference performance. The difference between amateur and professional deployment isn&rsquo;t just knowledge—it&rsquo;s the systematic application of that knowledge to continuously improve and optimize.</p>
<h2 id="references">References</h2>
<ul>
<li><strong>NVIDIA H100 Architecture</strong>: NVIDIA. (2022). <em>NVIDIA H100 Tensor Core GPU Architecture: The Engine of the World&rsquo;s AI Infrastructure</em>. NVIDIA Whitepaper.</li>
<li><strong>Attention Is All You Need</strong>: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., &amp; Polosukhin, I. (2017). <em>Attention Is All You Need</em>. arXiv preprint arXiv:1706.03762.</li>
<li><strong>FlashAttention-2</strong>: Dao, T. (2023). <em>FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning</em>. arXiv preprint arXiv:2307.08691.</li>
<li><strong>CUDA MODE (Flash Attention 2)</strong>: Mills, C. (2023). <em>CUDA MODE - Lecture 12 - Flash Attention 2</em>. <a href="https://christianjmills.com/posts/cuda-mode-notes/lecture-012/#flash-attention-2-advanced-techniques">christianjmills.com</a>.</li>
<li><strong>PagedAttention (vLLM)</strong>: Kwon, M., Li, Z., Zhuang, S., Kedia, R., Li, C., Ma, X., &hellip; &amp; Zaharia, M. (2023). <em>Efficient Memory Management for Large Language Model Serving with PagedAttention</em>. arXiv preprint arXiv:2309.06180.</li>
<li><strong>Roofline Model</strong>: Williams, S., Waterman, A., &amp; Patterson, D. (2009). <em>Roofline: An Insightful Visual Performance Model for Multicore Architectures</em>. Communications of the ACM, 52(4), 65-76.</li>
<li><strong>Speculative Decoding</strong>: Leviathan, Y., Kalman, M., &amp; Matias, Y. (2022). <em>Fast Inference from Transformers via Speculative Decoding</em>. arXiv preprint arXiv:2211.17192.</li>
<li><strong>Transformer Inference Arithmetic</strong>: Kippley, T. (2023). <em>Transformer Inference Arithmetic</em>. <a href="https://kipp.ly/transformer-inference-arithmetic/">kipp.ly</a>.</li>
<li><strong>The Transformer Inference Guide</strong>: Baseten. (2023). <em>The Full Guide to Transformer Model Inference</em>. <a href="https://www.baseten.co/blog/llm-transformer-inference-guide/">Baseten Blog</a>.</li>
<li><strong>LLM Inference Performance Engineering</strong>: Databricks. (2023). <em>LLM Inference Performance Engineering: Best Practices</em>. <a href="https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices">Databricks Blog</a>.</li>
<li><strong>Scaling Deep Learning on GPUs</strong>: The JAX Authors. (2024). <em>Scaling Deep Learning</em>. <a href="https://jax-ml.github.io/scaling-book/gpus/">jax-ml.github.io</a>.</li>
</ul>
<p>As models grow larger and demands increase, the importance of efficient inference will only intensify. The techniques and understanding developed today will compound, creating sustainable competitive advantages for organizations that invest in deep technical excellence. The path to that excellence begins with understanding your hardware, measuring what matters, and relentlessly optimizing based on data-driven insights.</p>
]]></content:encoded></item><item><title>Beyond Prefix Caching: How LMCache Turns KV Cache into Composable LEGO Blocks</title><link>https://www.mdjawad.com/posts/beyond-prefix-caching/</link><pubDate>Sat, 09 Aug 2025 21:36:22 +0800</pubDate><guid>https://www.mdjawad.com/posts/beyond-prefix-caching/</guid><description>How LMCache Turns KV Cache into Composable LEGO Blocks</description><content:encoded><![CDATA[<p>Imagine if every time you wanted to build something with LEGOs, you had to start from scratch—even when building similar structures. That&rsquo;s essentially how we&rsquo;ve been managing KV caches in production LLMs. Until now.</p>
<h2 id="the-328gb-elephant-in-the-room">The 328GB Elephant in the Room</h2>
<p>Here&rsquo;s what nobody tells you about serving long-context LLMs: that impressive 128K context window your model supports? It&rsquo;s basically unusable in production. Not because of compute limitations, but because of a memory crisis hiding in plain sight.</p>
<p>Let me show you the brutal math for Llama 3 70B:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Context Length</th>
          <th style="text-align: left">KV Cache Size</th>
          <th style="text-align: left">% of Model Weights</th>
          <th style="text-align: left">Reality Check</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">8K tokens</td>
          <td style="text-align: left">21 GB</td>
          <td style="text-align: left">15%</td>
          <td style="text-align: left">Fits on one GPU</td>
      </tr>
      <tr>
          <td style="text-align: left">32K tokens</td>
          <td style="text-align: left">84 GB</td>
          <td style="text-align: left">60%</td>
          <td style="text-align: left">Exceeds H100 capacity</td>
      </tr>
      <tr>
          <td style="text-align: left">128K tokens</td>
          <td style="text-align: left">328 GB</td>
          <td style="text-align: left">234%</td>
          <td style="text-align: left">Needs 4+ H100s (!!)</td>
      </tr>
  </tbody>
</table>
<p>That&rsquo;s right—the KV cache for a single 128K-context request requires more memory than the entire model weights. Four times more. This is why most production deployments silently cap contexts at 8-16K tokens, leaving those impressive context capabilities as nothing more than marketing numbers.</p>
<h2 id="the-wasteful-status-quo">The Wasteful Status Quo</h2>
<p>Traditional serving engines treat KV caches like disposable napkins: use once, throw away. Every time you:</p>
<ul>
<li><strong>Continue a conversation</strong> → Recompute the entire chat history</li>
<li><strong>Process a common document in RAG</strong> → Recompute from scratch</li>
<li><strong>Use the same system prompt</strong> → Recompute yet again</li>
</ul>
<p>It&rsquo;s like demolishing your LEGO castle every time you want to add a tower. Wasteful? Absolutely. Necessary? Not anymore.</p>
<p>Enter LMCache: a system that fundamentally reimagines KV caches not as temporary computational byproducts, but as reusable, composable knowledge blocks—like LEGOs for your model&rsquo;s attention memory.</p>
<p>[Diagram 1: Traditional vs LMCache approach]</p>
<h2 id="the-core-insight-knowledge-should-be-reusable">The Core Insight: Knowledge Should Be Reusable</h2>
<p>LMCache operates on a simple but powerful principle:</p>
<blockquote>
<p>&ldquo;Prefill each text only once.&rdquo;</p></blockquote>
<p>Think of it like a Content Delivery Network (CDN), but instead of caching static website assets, you&rsquo;re caching computed attention patterns. Just as Netflix doesn&rsquo;t re-encode the same movie for every viewer, why should we recompute the same document&rsquo;s KV cache for every user?</p>
<p>This isn&rsquo;t just clever engineering—it&rsquo;s a fundamental shift in how we think about LLM memory. And it&rsquo;s made possible by three breakthrough innovations that each solve a seemingly impossible problem.</p>
<h2 id="innovation-1-cachegen--compression-that-beats-physics">Innovation #1: CacheGen – Compression That Beats Physics</h2>
<p><strong>The Problem:</strong> Moving 328GB of KV cache from GPU to CPU should be impossible without killing performance. The PCIe bus delivers 64 GB/s while GPU memory delivers 3,000 GB/s—a crushing 47× bottleneck.</p>
<p><img alt="CacheGen: Compression That Beats Physics" loading="lazy" src="/images/posts/beyond-prefix-caching/cachegen_illustration.png"></p>
<p><strong>The Solution:</strong> CacheGen doesn&rsquo;t try to beat the bandwidth limit—it sidesteps it entirely with purpose-built compression that understands the unique structure of KV cache data.</p>
<p>Here&rsquo;s the clever part: KV cache tensors aren&rsquo;t random data. They have patterns:</p>
<ul>
<li><strong>Layers have personalities:</strong> Some transformer layers are robust to compression, others are sensitive. CacheGen profiles each layer and applies custom quantization—aggressive where it can be, gentle where it must be.</li>
<li><strong>Local correlation is high:</strong> Adjacent values often differ by small amounts. Delta encoding stores just the differences, slashing storage needs.</li>
<li><strong>GPUs are parallel beasts:</strong> Instead of decompressing on the CPU (creating a new bottleneck), custom CUDA kernels decompress directly on the GPU using massive parallelism.</li>
</ul>
<p>The result? That impossible 328GB transfer becomes a manageable 76GB—a 4.3× reduction that makes PCIe viable without sacrificing inference speed.</p>
<p>[Diagram 3: CacheGen pipeline]</p>
<h2 id="innovation-2-cacheblend--the-lego-magic">Innovation #2: CacheBlend – The LEGO Magic</h2>
<p><strong>The Problem:</strong> Traditional caching is rigid. It only works for exact prefixes—like having LEGO blocks that only stick together in one specific order.</p>
<p>Consider a typical RAG prompt:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>[System Prompt] + [Retrieved Doc A] + [Retrieved Doc B] + [User Query]
</span></span></code></pre></div><p>Even if you&rsquo;ve cached Doc A and Doc B separately, you can&rsquo;t reuse them. Why? Because attention is positional—Doc A&rsquo;s KV values depend on everything before it. When it follows a system prompt instead of appearing first, its entire attention pattern changes. Naive concatenation produces garbage.</p>
<p><img alt="The Caching Trilemma: why naive concatenation fails" loading="lazy" src="/images/posts/beyond-prefix-caching/caching_trilemma.png"></p>
<p><strong>The Solution:</strong> CacheBlend makes KV caches truly composable—like LEGO blocks that intelligently adapt to their neighbors.</p>
<p>The key insight: when you move a cached chunk to a new position, most tokens (∼90%) barely change. Only a small subset—the &ldquo;high-deviation tokens&rdquo;—need updating. CacheBlend:</p>
<ul>
<li><strong>Identifies the 10% that matter:</strong> Uses attention analysis to find tokens whose KV values would change most</li>
<li><strong>Surgically updates just those tokens:</strong> Recomputes only the high-deviation subset</li>
<li><strong>Blends the updates:</strong> Fuses new values with the original cache</li>
</ul>
<p>This transforms rigid, prefix-only caching into flexible, LEGO-like composition. Same cached blocks, infinite arrangements.</p>
<p>[Diagram 5: CacheBlend LEGO composition]</p>
<h2 id="innovation-3-hierarchical-memory--the-full-stack">Innovation #3: Hierarchical Memory – The Full Stack</h2>
<p><strong>The Problem:</strong> Even with compression, you can&rsquo;t fit everything in GPU memory. You need a bigger house for your LEGOs.</p>
<p><strong>The Solution:</strong> LMCache implements a complete memory hierarchy, treating GPU, CPU, and SSD as a unified pool—just like modern CPUs treat L1, L2, L3 caches and RAM.</p>
<p><img alt="LMCache&rsquo;s Hierarchical Memory Architecture" loading="lazy" src="/images/posts/beyond-prefix-caching/hierarchical_memory_architecture.png"></p>
<p>But here&rsquo;s what makes it brilliant:</p>
<ul>
<li><strong>Asynchronous everything:</strong> Saving to slower tiers never blocks inference. Your GPU keeps generating while caches migrate in the background.</li>
<li><strong>Predictive prefetching:</strong> LMCache learns access patterns and preloads caches from SSD to RAM before they&rsquo;re needed, hiding the latency.</li>
<li><strong>Distributed sharing:</strong> Through Redis or LMCache&rsquo;s server, multiple GPUs share a global cache pool. One GPU&rsquo;s computation becomes everyone&rsquo;s asset.</li>
</ul>
<p>It&rsquo;s like having a smart assistant who knows which LEGO sets you&rsquo;ll need next and quietly moves them from the basement to your desk before you ask.</p>
<h2 id="real-world-impact-from-painful-to-practical">Real-World Impact: From Painful to Practical</h2>
<p>The numbers speak for themselves:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Scenario</th>
          <th style="text-align: left">Without LMCache</th>
          <th style="text-align: left">With LMCache</th>
          <th style="text-align: left">Speedup</th>
          <th style="text-align: left">User Experience</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">25K token conversation</td>
          <td style="text-align: left">28 seconds</td>
          <td style="text-align: left">3.7 seconds</td>
          <td style="text-align: left">7.7×</td>
          <td style="text-align: left">☠️ → 😊</td>
      </tr>
      <tr>
          <td style="text-align: left">RAG with 4 documents</td>
          <td style="text-align: left">13 seconds</td>
          <td style="text-align: left">3.6 seconds</td>
          <td style="text-align: left">3.6×</td>
          <td style="text-align: left">😔 → 😊</td>
      </tr>
  </tbody>
</table>
<p>These aren&rsquo;t incremental improvements—they&rsquo;re the difference between &ldquo;users abandon your product&rdquo; and &ldquo;users love your product.&rdquo;</p>
<h2 id="playing-nice-with-others-the-pagedattention-synergy">Playing Nice with Others: The PagedAttention Synergy</h2>
<p>A common question: doesn&rsquo;t vLLM&rsquo;s PagedAttention already solve memory problems?</p>
<p>Not quite. They&rsquo;re complementary pieces of the same puzzle:</p>
<p><img alt="PagedAttention vs. LMCache" loading="lazy" src="/images/posts/beyond-prefix-caching/paged_attention_vs_lmcache.png"></p>
<ul>
<li><strong>PagedAttention:</strong> Solves fragmentation within GPU memory (like defragging your hard drive)</li>
<li><strong>LMCache:</strong> Extends total memory across tiers (like adding more hard drives)</li>
</ul>
<p>Together, they form a complete memory management stack—PagedAttention ensures efficient packing, LMCache provides infinite capacity.</p>
<h2 id="the-bigger-picture-a-new-era-of-llm-infrastructure">The Bigger Picture: A New Era of LLM Infrastructure</h2>
<p>We&rsquo;re witnessing a fundamental shift in what limits LLM deployment:</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Era</th>
          <th style="text-align: left">Bottleneck</th>
          <th style="text-align: left">Solution</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">2020-2022</td>
          <td style="text-align: left">Raw compute</td>
          <td style="text-align: left">Better GPUs, optimized kernels</td>
      </tr>
      <tr>
          <td style="text-align: left">2022-2023</td>
          <td style="text-align: left">Memory fragmentation</td>
          <td style="text-align: left">PagedAttention</td>
      </tr>
      <tr>
          <td style="text-align: left">2024+</td>
          <td style="text-align: left">Total memory capacity</td>
          <td style="text-align: left">Hierarchical caching (LMCache)</td>
      </tr>
  </tbody>
</table>
<p>The future isn&rsquo;t just about making models bigger—it&rsquo;s about making them remember intelligently. LMCache represents a paradigm shift: from treating KV cache as disposable waste to managing it as valuable, reusable knowledge.</p>
<h2 id="what-this-means-for-you">What This Means for You</h2>
<p>If you&rsquo;re running production LLMs, LMCache changes the game:</p>
<ul>
<li><strong>Those 128K context windows become actually usable</strong> – not just marketing specs</li>
<li><strong>Multi-turn conversations become affordable</strong> – no more recomputing entire histories</li>
<li><strong>RAG at scale becomes practical</strong> – cache once, reuse everywhere</li>
<li><strong>GPU costs drop dramatically</strong> – same hardware, 7× more throughput</li>
</ul>
<p>The best part? LMCache integrates seamlessly with vLLM. It&rsquo;s not a replacement—it&rsquo;s an upgrade.</p>
<h2 id="the-lego-future">The LEGO Future</h2>
<p>LMCache shows us what modern LLM serving should look like: modular, composable, and intelligent. Just as LEGO blocks revolutionized construction toys by making everything reusable and composable, LMCache is doing the same for LLM memory.</p>
<p>We&rsquo;re moving from a world where every inference request starts from scratch to one where computed knowledge accumulates, persists, and compounds. It&rsquo;s not just an optimization—it&rsquo;s an architectural revolution.</p>
<p>The question isn&rsquo;t whether you need hierarchical KV caching. It&rsquo;s whether you can afford to keep throwing away 90% of your GPU&rsquo;s work. In a world where every millisecond and every GB matters, the answer is clear.</p>
<p>Welcome to the era of composable AI memory. Time to start building.</p>
<h2 id="references">References</h2>
<ul>
<li><strong>CacheBlend:</strong> Yao, J., Li, H., Liu, Y., Ray, S., Cheng, Y., Zhang, Q., Du, K., Lu, S., &amp; Jiang, J. (2024). <a href="https://arxiv.org/abs/2405.16444"><em>CACHEBLEND: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion</em></a>. arXiv preprint arXiv:2405.16444.</li>
<li><strong>CacheGen:</strong> Liu, Y., Li, H., Du, K., Yao, J., Cheng, Y., Huang, Y., Lu, S., Maire, M., Hoffmann, H., Holtzman, A., &amp; Jiang, J. (2023). <a href="https://arxiv.org/abs/2310.07240"><em>CacheGen: Fast Context Loading for Language Model Applications</em></a>. arXiv preprint arXiv:2310.07240.</li>
<li><strong>PagedAttention (vLLM):</strong> Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., &amp; Stoica, I. (2023). <a href="https://arxiv.org/abs/2309.06180"><em>Efficient Memory Management for Large Language Model Serving with PagedAttention</em></a>. arXiv preprint arXiv:2309.06180.</li>
<li><strong>LMCache Project:</strong> The official GitHub repository for the LMCache system, including implementations of CacheGen and CacheBlend.</li>
<li><strong>vLLM Project:</strong> The official GitHub repository for the vLLM serving engine, which LMCache integrates with.</li>
</ul>
]]></content:encoded></item><item><title>About Me</title><link>https://www.mdjawad.com/about/</link><pubDate>Sat, 09 Aug 2025 17:45:34 +0800</pubDate><guid>https://www.mdjawad.com/about/</guid><description>&lt;div style="display: flex; align-items: center; gap: 20px;">
 &lt;div style="flex: 0 0 150px;">
 &lt;img src="https://www.mdjawad.com/images/about/jawad_profile_pic_cropped.png" alt="Mohamed Jawad" style="border-radius: 50%; width: 150px; height: 150px; object-fit: cover;">
 &lt;/div>
 &lt;div style="flex: 1;">
 &lt;p>Hello! I'm Jawad, an Engineer/ Builder/ Leader with over 15 years of experience crafting and scaling AI and machine learning solutions across the ride-hailing, finance, and enterprise sectors. My passion lies in building high-performing technical teams and driving real-world business impact through AI innovation.&lt;/p>
 &lt;/div>
&lt;/div>
&lt;p>Currently, at Singapore&amp;rsquo;s Home Team Science and Technology Agency (HTX), I lead the AI Platform team, where we focus on designing and implementing scalable AI inference infrastructure. My work involves advanced optimization techniques to deploy large-scale language model (LLM) services, enabling enterprise applications to integrate powerful predictive AI capabilities.&lt;/p></description><content:encoded><![CDATA[<div style="display: flex; align-items: center; gap: 20px;">
  <div style="flex: 0 0 150px;">
    <img src="/images/about/jawad_profile_pic_cropped.png" alt="Mohamed Jawad" style="border-radius: 50%; width: 150px; height: 150px; object-fit: cover;">
  </div>
  <div style="flex: 1;">
    <p>Hello! I'm Jawad, an Engineer/ Builder/ Leader with over 15 years of experience crafting and scaling AI and machine learning solutions across the ride-hailing, finance, and enterprise sectors. My passion lies in building high-performing technical teams and driving real-world business impact through AI innovation.</p>
  </div>
</div>
<p>Currently, at Singapore&rsquo;s Home Team Science and Technology Agency (HTX), I lead the AI Platform team, where we focus on designing and implementing scalable AI inference infrastructure. My work involves advanced optimization techniques to deploy large-scale language model (LLM) services, enabling enterprise applications to integrate powerful predictive AI capabilities.</p>
<h3 id="my-journey">My Journey</h3>
<p>My career has been a journey through the evolving landscape of data and AI. As a founding engineer at <a href="https://cleric.ai/">Cleric</a>, I architected an LLM-powered SRE automation platform from the ground up. At <a href="https://gojek.io/">Gojek</a>, I led the mobility data division, overseeing critical systems like ride-matching, dynamic pricing, and logistics for one of Southeast Asia&rsquo;s largest platforms. It was there I spearheaded initiatives that improved unit economics by 15% and built the multi-objective optimization engine that boosted completed bookings by up to 10%.</p>
<h3 id="core-expertise">Core Expertise</h3>
<ul>
<li><strong>AI &amp; Machine Learning:</strong> LLM-based RAG systems, LLMOps, multi-agent architectures, deep learning, NLP, and recommendation systems.</li>
<li><strong>Data Engineering &amp; Infrastructure:</strong> Building scalable ML platforms using Airflow, Spark, Kafka, Kubernetes, and event-driven architectures.</li>
<li><strong>Data Science &amp; Analytics:</strong> Causal inference, A/B testing, statistical analysis, and multi-objective optimization.</li>
</ul>
<h3 id="teaching--education">Teaching &amp; Education</h3>
<p>I&rsquo;m passionate about teaching data science and currently serve as the lead instructor for the &ldquo;Advanced Professional Certificate in Data Science and AI&rdquo; program at Nanyang Technological University, Singapore. This comprehensive program equips professionals with practical skills in data science and artificial intelligence, preparing them for real-world challenges in the field. <a href="https://www.ntu.edu.sg/pace/for-individuals/programmes/detail/(sctp)-data-science-and-ai#outline">Learn more about the course</a>.</p>
<h3 id="lets-connect">Let&rsquo;s Connect</h3>
<p>I&rsquo;m always open to discussing new ideas in AI, MLOps, and scalable system design. Feel free to connect with me on LinkedIn or Twitter.</p>
<div class="bd-subscribe">
  <div class="bd-subscribe__copy">
    <h3 class="bd-subscribe__title">Subscribe for new posts</h3>
    <p class="bd-subscribe__blurb">Deep dives on LLM systems — inference, attention, agents, quantization — straight to your inbox. No spam, unsubscribe anytime.</p>
  </div>
  <form
    class="bd-subscribe__form embeddable-buttondown-form"
    action="https://buttondown.com/api/emails/embed-subscribe/jawad"
    method="post"
    target="popupwindow"
    onsubmit="window.open('https://buttondown.com/jawad', 'popupwindow')"
  >
    <input class="bd-subscribe__input" type="email" name="email" placeholder="you@example.com" aria-label="Email address" required>
    <input type="hidden" value="1" name="embed">
    <button class="bd-subscribe__btn" type="submit">Subscribe</button>
  </form>
  <p class="bd-subscribe__rss">Prefer a feed reader? <a href="/index.xml">Subscribe via RSS</a>.</p>
</div>

<style>
.bd-subscribe{
  margin:2.5rem 0;
  padding:1.5rem 1.75rem;
  border:1px solid var(--border);
  border-radius:12px;
  background:var(--entry);
}
.bd-subscribe__title{margin:0 0 .35rem;font-size:1.2rem;color:var(--primary);}
.bd-subscribe__blurb{margin:0 0 1rem;color:var(--secondary);font-size:.95rem;line-height:1.5;}
.bd-subscribe__form{display:flex;gap:.5rem;flex-wrap:wrap;}
.bd-subscribe__input{
  flex:1 1 220px;
  padding:.6rem .75rem;
  border:1px solid var(--border);
  border-radius:8px;
  background:var(--theme);
  color:var(--primary);
  font-size:.95rem;
}
.bd-subscribe__input:focus{outline:2px solid var(--tertiary);outline-offset:1px;}
.bd-subscribe__btn{
  padding:.6rem 1.2rem;
  border:0;
  border-radius:8px;
  background:var(--primary);
  color:var(--theme);
  font-weight:600;
  font-size:.95rem;
  cursor:pointer;
  transition:opacity .2s ease;
}
.bd-subscribe__btn:hover{opacity:.85;}
.bd-subscribe__rss{margin:.85rem 0 0;font-size:.82rem;color:var(--secondary);}
.bd-subscribe__rss a{color:var(--secondary);text-decoration:underline;}
</style>

<hr>
<h3 id="selected-talks--publications">Selected Talks &amp; Publications</h3>
<ul>
<li><strong>Ray, Arka, Mohamed Jawad Askar Ali, et al.</strong> <a href="https://arxiv.org/abs/2605.10391">&ldquo;Phoenix-VL 1.5 Medium Technical Report.&rdquo;</a> <em>arXiv preprint (2026).</em> A 123B-parameter multimodal foundation model adapted to regional languages and the Singapore context.</li>
<li><strong>LLM Inference from Jupyter to Production</strong> - <em>Data Innovation Summit APAC - 2025</em></li>
<li><strong>Designing Agentic Systems: Lessons learned from the Trenches</strong> - <em>2024</em></li>
<li><a href="https://www.youtube.com/watch?v=NzlkXRiio70"><strong>Scaling Ride-Hailing with Machine Learning on MLflow</strong></a> - <em>SPARK AI Data Summit, San Francisco 2019</em></li>
<li><strong>Goh, Yang Miang, and Mohamed Jawad Askar Ali.</strong> <a href="https://www.sciencedirect.com/science/article/abs/pii/S0001457515300725">&ldquo;A hybrid simulation approach for integrating safety behaviour into construction planning: An earthmoving case study.&rdquo;</a> <em>Accident Analysis &amp; Prevention (2015).</em></li>
</ul>
]]></content:encoded></item></channel></rss>