Architecture

The Evolution of Attention, Part 1: From MHA to Latent Compression

Part 1 of 2. Every attention variant since 2019 fights the same number: KV cache bytes per token. This post traces the first wave of answers, from MHA through MQA and GQA, to DeepSeek-V2’s Multi-head Latent Attention. We end at the 57× cache reduction that comes from caching a low-rank latent and never materializing K or V at inference.

The Platform Around the Agent: What Enterprise Architects Actually Build

Most enterprises have bought an AI coding agent and are stuck. The ones generating real productivity gains didn’t win by picking a better model. They built a platform around the agent. This post walks through the five control-plane responsibilities that separate the 11% of AI-native orgs from the 95% reporting zero ROI, grounded in public deployments from Block, Shopify, Atlassian, Airbnb, and others.

Inside Claude Code: Anatomy of a 512K-Line AI Agent

An interactive technical breakdown of Claude Code’s architecture — from the query loop and five compaction mechanisms to the permission pipeline and feature flags. Based on source code analysis of ~1,884 TypeScript files.

State Space Models and the Mamba Architecture: From First Principles to Mamba-3

NVIDIA’s Nemotron-3-Super, IBM’s Granite, and AI21’s Jamba all ship hybrid SSM-Transformer architectures in production. This post builds State Space Models from scratch, starting with a single differential equation, and works up through HiPPO, S4, and the three generations of Mamba to explain why.

Durable Execution for AI Agents: Temporal's Architecture for Production Reliability

Production AI agents face infrastructure problems that framework-level code cannot solve: state loss on crashes, LLM API flakiness, debugging non-deterministic behavior, and coordinating human approvals across hours-long runs. This post walks through Temporal’s durable execution model and why companies like OpenAI chose it for their agent infrastructure.

Dissecting OpenClaw: An Interactive Architecture Map

An interactive visual exploration of OpenClaw — the open-source AI agent that broke GitHub. Explore its three-layer architecture, two key primitives, memory system, and composable system prompt.

Why `vllm serve` Works on Day Zero (and What It Takes to Make It Fast)

A deep dive into vLLM’s tiered model integration — from the Transformers fallback that enables zero-day support to the native integration path that makes it fast.

The Anatomy of Agentic Code Assist: Building Production Grade AI Coding Agents

A deep dive into the architecture, design patterns, and engineering decisions behind production-grade agentic code assist solutions. By dissecting OpenHands, we uncover how to build AI agents that safely execute code, manage complex state, and operate reliably in production.