The Evolution of Attention, Part 1: From MHA to Latent Compression
Part 1 of 2. Every attention variant since 2019 fights the same number: KV cache bytes per token. This post traces the first wave of answers, from MHA through MQA and GQA, to DeepSeek-V2’s Multi-head Latent Attention. We end at the 57× cache reduction that comes from caching a low-rank latent and never materializing K or V at inference.