MdJawad

The Anatomy of Agentic Code Assist: Building Production Grade AI Coding Agents

A deep dive into the architecture, design patterns, and engineering decisions behind production-grade agentic code assist solutions. By dissecting OpenHands, we uncover how to build AI agents that safely execute code, manage complex state, and operate reliably in production.

QuIP#: Achieving Near-Lossless 2-Bit LLM Quantization

QUIP# algorithm for quantizing LLM weights without gradient information.

Why Can Your Laptop Run LLaMA? A Deep Dive into Quantization

How 4–8x compression and Hessian-guided GPTQ make 70B-scale models practical on modest hardware—what INT8/INT4 really cost, and when accuracy holds.

Flash Attention: The Mathematical Tricks That Broke the Memory Wall

Flash Attention, a memory-efficient attention mechanism for transformers.

Advanced NVIDIA GPU Monitoring for LLM Inference: A Deep Dive into H100 Architecture and Performance Optimization

A deep dive into NVIDIA’s H100 architecture and the monitoring techniques required for production-grade LLM inference optimization.

Beyond Prefix Caching: How LMCache Turns KV Cache into Composable LEGO Blocks

How LMCache Turns KV Cache into Composable LEGO Blocks