MdJawad

Speculative Speculative Decoding: Eliminating the Last Sequential Bottleneck in LLM Inference

How speculating about speculation itself achieves up to 5x faster LLM inference by eliminating the draft model’s idle time during verification, and the three engineering challenges that make it work.

Durable Execution for AI Agents: Temporal's Architecture for Production Reliability

Production AI agents face infrastructure problems that framework-level code cannot solve: state loss on crashes, LLM API flakiness, debugging non-deterministic behavior, and coordinating human approvals across hours-long runs. This post walks through Temporal’s durable execution model and why companies like OpenAI chose it for their agent infrastructure.

From RLHF to GRPO: The RL Techniques That Align Language Models

How reinforcement learning transforms raw language models into useful assistants — from PPO’s four-model pipeline to DPO’s elegant shortcut to GRPO’s reasoning revolution, with the math that makes each one work.

Dissecting OpenClaw: An Interactive Architecture Map

An interactive visual exploration of OpenClaw — the open-source AI agent that broke GitHub. Explore its three-layer architecture, two key primitives, memory system, and composable system prompt.

Why `vllm serve` Works on Day Zero (and What It Takes to Make It Fast)

A deep dive into vLLM’s tiered model integration — from the Transformers fallback that enables zero-day support to the native integration path that makes it fast.

Orchestra conductor coordinating musicians in a circular arrangement

Orchestrating Inference: How Kubernetes, Ray, and vLLM Coordinate Under the Hood

A deep dive into how Kubernetes, Ray, and vLLM coordinate to transform independent GPUs into a synchronized inference machine.

Iceberg visualization showing the hidden software stack behind LLM inference

The Hidden Software Stack Behind Fast LLM Inference

Beyond vLLM and PagedAttention: exploring NCCL, CUTLASS, Triton, and FlashInfer, the libraries that actually make LLM inference fast.

Speculative Decoding: When Guessing Right Makes for Faster Inference

How speculative decoding achieves 2-3× inference speedup without changing model outputs, and why GLM-4.7’s native multi-token prediction marks a paradigm shift.

The Anatomy of Agentic Code Assist: Building Production Grade AI Coding Agents

A deep dive into the architecture, design patterns, and engineering decisions behind production-grade agentic code assist solutions. By dissecting OpenHands, we uncover how to build AI agents that safely execute code, manage complex state, and operate reliably in production.

QuIP#: Achieving Near-Lossless 2-Bit LLM Quantization

QUIP# algorithm for quantizing LLM weights without gradient information.