In this post, we’ll trace what happens when vLLM encounters a model it’s never seen before. We’ll work through the full lifecycle from the initial config.json pull off Hugging Face, through the registry lookup that decides the integration path, into either the Transformers fallback or the native integration code, and down to the forward pass where PagedAttention kernels actually execute.
Why this matters: new model architectures appear constantly, and vLLM needs to serve them. The interesting engineering question is how — because the optimizations that make inference fast (fused kernels, CUDA Graphs, tensor parallelism) require deep model-specific restructuring. You can’t just import a model and get peak performance. vLLM resolves this tension with a tiered system: immediate support through a compatibility layer, then a clear path to fully optimized native integration.
This post is structured into 4 parts:
- The Gateway — how vLLM decides what to do with a model it receives
- The Transformers Fallback — the zero-day mechanism and its trade-offs
- Native Integration — what it takes to make a model truly fast in vLLM
- The Execution Core — forward pass, weight loading, and distributed execution
We’ll build on concepts from previous posts. If you’re not familiar with PagedAttention and FlashAttention, or the hidden software stack beneath inference, those are worth reading first. We also won’t re-explain the Engine-Worker orchestration layer in full — just enough to ground the model integration story.
Here’s the starting point:
vllm serve some-brand-new/model-7B --dtype auto
# This works. Even for a model vLLM has never seen before.
That command succeeds for models that vLLM has no dedicated code for. Let’s understand why.
Part 1: The Gateway
The Engine-Worker-Model Hierarchy
vLLM enforces a strict separation of concerns in how it handles models. Before we get into the model-specific details, let’s establish the high-level architecture, since it determines where new model code actually lives.
There are four levels:
- LLMEngine — the control plane. Handles scheduling, manages the BlockSpaceManager (which tracks physical GPU memory blocks), and decides which requests get processed in each iteration. The Engine is completely agnostic to model architecture.
- Worker — one per GPU. Manages the GPU device, holds its slice of model weights, and coordinates with other Workers for distributed execution.
- ModelRunner — sits inside each Worker. Responsible for converting logical request data (token IDs, sequence lengths) into the physical tensors the model needs. This is where input flattening happens.
- Model — the neural network itself. Whether it’s a native
LlamaForCausalLMor a wrappedTransformersModel, this is the only layer that changes when you add a new model.
The key property here: the Engine only needs the KV cache element size — derived from num_layers, hidden_size, and num_attention_heads in the model config to make scheduling decisions. It never touches the model’s forward pass. This means adding support for an entirely new architecture only changes the bottom layer of this stack. Everything above it stays the same.
The VllmConfig object is how information flows across these levels. It aggregates several sub-configs:
| Config Component | What It Provides |
|---|---|
| ModelConfig | Architecture strings, hidden sizes, vocabulary size, the architectures list used for registry lookup |
| ParallelConfig | Tensor parallelism (TP) and pipeline parallelism (PP) degrees. Determines how linear layers shard their weights |
| SchedulerConfig | Maximum number of sequences and memory allocation strategy. Influences BlockSpaceManager setup |
| QuantizationConfig | Quantization method (AWQ, GPTQ, FP8). Linear layers use this to select the appropriate kernel during weight loading |
Registry Mechanics and the Architecture Lookup
When you run vllm serve <model>, the first thing that happens is a config.json resolution — either pulled from the Hugging Face Hub (if you pass a model ID like meta-llama/Llama-2-7b-hf) or read from disk (if you pass a local path, as you would in an air-gapped deployment). The architectures field — for example, ["LlamaForCausalLM"] — is the primary lookup key for the entire loading sequence.
This key gets checked against the _VLLM_MODELS dictionary, the core of vLLM’s ModelRegistry. It maps architecture strings to (module_name, class_name) tuples:
_VLLM_MODELS = {
"LlamaForCausalLM": ("llama", "LlamaForCausalLM"),
"MistralForCausalLM": ("mistral", "MistralForCausalLM"),
"DeepseekV2ForCausalLM": ("deepseek_v2", "DeepseekV2ForCausalLM"),
"Qwen2ForCausalLM": ("qwen2", "Qwen2ForCausalLM"),
# ... hundreds of other architectures
}
The module_name is a relative path within vllm.model_executor.models — so "llama" resolves to vllm/model_executor/models/llama.py. The class_name is the specific nn.Module subclass to instantiate.
One important detail: vLLM does NOT import all model classes at startup. Instead, it uses _LazyRegisteredModel wrappers. When the ModelConfig requests a specific architecture, the registry:
- Checks if the architecture string exists in
_VLLM_MODELS - Retrieves the module path and class name
- Dynamically imports the module using
importlib - Returns the class constructor to the ModelLoader
This lazy loading matters for dependency isolation. A user running Llama shouldn’t need the specific kernels required for an audio-processing model. If those kernels aren’t installed and the audio model is loaded eagerly at startup, vLLM crashes for everyone.
Three things can happen when the registry receives an architecture string:
- Found in registry → native path (optimized, model-specific code)
- Registered by plugin → external native path (optimized, third-party code)
- Not found → Transformers Modeling Backend fallback (compatibility shim)
The Plugin System
This is a significant evolution in vLLM’s architecture. External packages can register models without modifying vLLM core.
The mechanism uses Python’s vllm.general_plugins entry point. During vLLM’s initialization, it discovers and executes all registered plugins. A plugin can invoke ModelRegistry.register_model() to inject a new architecture mapping at runtime:
# In your package's plugin entry point
def register():
from vllm import ModelRegistry
if "MyNewModel" not in ModelRegistry.get_supported_archs():
ModelRegistry.register_model(
"MyNewModel",
"my_package.models:MyNewModel"
)
This decouples the vLLM release cycle from model release cycles. Model creators — Mistral, DeepSeek, Google — can ship a “vLLM adaptation package” alongside their weights. Users pip install that package, and vLLM recognizes the new architecture immediately. No PRs to vLLM core, no waiting for a new release.
Part 2: The Transformers Fallback (Zero-Day Support)
The Transformers Backend
When the registry lookup fails — or when you explicitly set model_impl="transformers" — vLLM resolves to its Transformers backend. This is a family of mixin-composed classes (TransformersForCausalLM, TransformersMoEForCausalLM, TransformersMultiModalForCausalLM, etc.) defined in vllm/model_executor/models/transformers/. They sit between vLLM’s scheduler and standard Hugging Face model code, and they’re the reason that vllm serve command works on day zero for new models.
There are two initialization steps worth understanding:
1. Config-Based Instantiation. The wrapper uses transformers.AutoModel.from_config(...) to build the model architecture on a meta device — meaning no GPU memory is allocated yet, just the module structure with placeholder parameters. Weights are loaded separately later through vLLM’s load_weights() pipeline. This two-phase approach (structure first, weights later) is critical for distributed loading: each GPU can load only its weight shard, rather than loading everything and then discarding what it doesn’t need.
2. Attention Backend Injection. Before instantiation, vLLM modifies the model’s text configuration:
# vLLM sets this before calling from_config()
text_config._attn_implementation = "vllm"
This is the critical mechanism. Modern Hugging Face models are written to be attention-backend-agnostic. They check _attn_implementation and query a registry of attention functions. vLLM populates ALL_ATTENTION_FUNCTIONS with its own PagedAttention-backed implementation:
# vLLM registers its attention backend into HF's registry
ALL_ATTENTION_FUNCTIONS["vllm"] = vllm_attention_forward
When the HF model reaches its attention layer and calls the registered function, it gets vLLM’s implementation instead of the default eager/SDPA/FlashAttention backend. The model doesn’t know the difference.
Let’s trace the data flow step by step:
- Engine generates
block_tablesandslot_mapping→ packs them into anAttentionMetadataobject - TransformersForCausalLM.forward() receives flattened inputs +
attn_metadata - The wrapper passes vLLM metadata as
**kwargsinto the HF model’s forward method - The HF model propagates
**kwargsdown through its layers (this is a convention in Transformers — unused kwargs flow through) - At each attention layer, the injected vLLM backend receives Q, K, V tensors + the
attn_metadata - The PagedAttention CUDA kernel executes — storing K/V into paged blocks, computing attention scores using block tables
The result: even “unoptimized” models benefit from PagedAttention’s memory virtualization. No more OOM from naive KV cache pre-allocation. The KV cache is managed efficiently through fixed-size blocks, regardless of whether the model itself was designed for it.
The Trade-offs (What You Lose)
The Transformers backend enables immediate serving, but it sits in what we might call an “unoptimized valley.” Let’s be specific about the costs:
CUDA Graph Capture. In vLLM V1, the Transformers backend supports torch.compile with piecewise CUDA graph capture (via the @support_torch_compile decorator), closing what was historically the largest performance gap. However, models with dynamic RoPE scaling still fall back to eager mode. And native models can leverage more aggressive graph capture strategies that cover a larger fraction of the computation graph, since their code is explicitly written with static control flow in mind.
Kernel Fusion. Native vLLM models use fused kernels — LayerNorm + activation in one kernel, RoPE computation fused with the QKV projection, SiLU and gate multiplication combined. The fallback uses separate PyTorch operations for each step. Every separate operation means an extra round-trip to HBM: write intermediate result, read it back for the next op. On a memory-bandwidth-bound workload (which LLM decode always is), these extra reads and writes add up fast.
Parallelism Limitations. Basic Tensor Parallelism can sometimes be inferred automatically via the model’s base_model_tp_plan, but this doesn’t cover every case. Mixture-of-Experts routing, novel attention patterns, or architectures with unusual layer structures may not shard correctly — restricting you to single-GPU execution.
| Capability | Transformers Fallback | Native Integration |
|---|---|---|
| Day-zero support | Yes | No (requires implementation) |
| PagedAttention | Yes (via injection) | Yes (native) |
| CUDA Graph capture | Yes (via torch.compile in V1) | Yes (full static graph) |
| Kernel fusion | No (separate PyTorch ops) | Yes (fused CUDA kernels) |
| Tensor Parallelism | Limited (auto-inferred) | Full (explicit sharding) |
| Pipeline Parallelism | No | Yes (with intermediate_tensors) |
| Quantization (AWQ/GPTQ/FP8) | Limited | Full support |
Note: The fallback is not meant to be the final state — it’s the starting point. It gives you a working, servable model while the community works on native integration. Think of it as a bridge: useful immediately, but you cross it to get somewhere better.
Part 3: Native Integration
The Model Interface and Prefix Protocol
To go from “supported” to “optimized,” a model must be implemented natively. This means creating a Python class that mirrors the original model structure but substitutes standard layers with vLLM’s distributed primitives.
Every module in a native vLLM model accepts a prefix="" argument during initialization. This string represents the module’s fully qualified name in the state dictionary — for example, model.layers.0.self_attn.q_proj.
class LlamaAttention(nn.Module):
def __init__(self, config, prefix=""):
super().__init__()
self.qkv_proj = QKVParallelLinear(
...,
prefix=f"{prefix}.qkv_proj"
)
self.o_proj = RowParallelLinear(
...,
prefix=f"{prefix}.o_proj"
)
The prefix serves two purposes:
- Weight loading: maps checkpoint tensors to the correct layer instance. When the
load_weightsmethod receives a tensor namedmodel.layers.0.self_attn.q_proj.weight, the prefix tells it exactly which module to route it to. - Non-uniform quantization: the QuantizationConfig can specify different quantization schemes per layer. Some layers might be FP16 while others are INT8. The prefix is how the config identifies which kernel to instantiate for each specific layer.
Parallel Layer Primitives
For models that won’t fit on a single GPU (70B+), vLLM provides distributed primitives that replace standard nn.Linear and nn.Embedding layers:
ColumnParallelLinear splits the weight matrix along the output dimension. Each GPU computes a fraction of the output features. This is used for QKV projections (each GPU computes a subset of attention heads) and MLP up-projections (each GPU computes a portion of the intermediate dimension). No inter-GPU communication is needed for this operation.
RowParallelLinear splits along the input dimension. Each GPU computes a partial result, then an AllReduce sums the partial results across all GPUs. This is used for the attention output projection and MLP down-projection — the operations where partial results need to be recombined.
VocabParallelEmbedding splits the embedding table (often 128k+ tokens for modern models) across GPUs. Each GPU holds a slice of the vocabulary and performs lookups only for tokens in its range.
The VllmConfig provides tensor_parallel_size during initialization, and each layer auto-configures its sharding based on the worker’s rank. A model developer doesn’t write explicit GPU assignment code — they use these primitives and the infrastructure handles partitioning.
Input Flattening and the 1D Computation Graph
This is one of the more interesting design decisions in vLLM. In standard PyTorch, inputs are 2D tensors of shape [batch_size, sequence_length]. This requires padding to align sequences of different lengths — if you’re processing three requests with lengths 5, 12, and 3, you pad everything to length 12. That means 8 wasted positions out of 20, nearly 40% of compute thrown away on padding tokens.
vLLM eliminates padding entirely. The ModelRunner concatenates all tokens from all concurrent requests into a single 1D tensor of shape [total_num_tokens]. A separate positions tensor (also 1D) provides the sequence position for each token:
# Three concurrent requests:
# Request A: tokens [101, 204, 305] (3 tokens, positions 0,1,2)
# Request B: tokens [42, 55, 67, 89, 12] (5 tokens, positions 0,1,2,3,4)
# Request C: tokens [700, 801] (2 tokens, positions 0,1)
# Flattened input — no padding, no wasted compute:
input_ids = [101, 204, 305, 42, 55, 67, 89, 12, 700, 801] # shape: [10]
positions = [0, 1, 2, 0, 1, 2, 3, 4, 0, 1] # shape: [10]
Every layer in a native vLLM model is written to process this 1D stream. Embeddings do lookups on the 1D tensor. RoPE uses the positions tensor for correct positional encoding. The attention layer uses block_tables to reconstruct the logical sequence structure — knowing which tokens belong to which request and where their KV cache blocks live in physical memory.
Click any input grid, token cell, or layer card to see details.
Key Properties
Weight Loading — From Disk to Device
One of the trickier parts of native integration: implementing load_weights(self, weights). This method receives an iterator of (name, tensor) pairs from AutoWeightsLoader and must map checkpoint weights into the model’s parameters.
The parameter mismatch problem is why this isn’t trivial. vLLM often fuses layers that are separate in the Hugging Face checkpoint. For example, a standard Llama MLP has separate gate_proj and up_proj linear layers. In vLLM, these become a single gate_up_proj to reduce kernel launches. The load_weights logic must handle this:
def load_weights(self, weights):
# Stacking mapping: which HF weights get concatenated into which vLLM param
stacked_params = {
"gate_proj": ("gate_up_proj", 0), # goes into first half
"up_proj": ("gate_up_proj", 1), # goes into second half
}
for name, loaded_weight in weights:
if "gate_proj" in name or "up_proj" in name:
# Buffer the tensor, wait for its partner, then concatenate
param = self.state_dict()[name.replace("gate_proj", "gate_up_proj")
.replace("up_proj", "gate_up_proj")]
# Load into the correct slice of the fused parameter
weight_loader = param.weight_loader
weight_loader(param, loaded_weight, name)
else:
param = self.state_dict()[name]
param.copy_(loaded_weight)
Two utilities make this process more manageable:
AutoWeightsLoader abstracts away the routing of weights to child modules. It recursively discovers sub-modules that have their own load_weights methods and delegates the appropriate (name, tensor) pairs to each one, so the top-level model doesn’t need to manually dispatch weights. The upstream shard iteration — walking through model-00001-of-00005.safetensors through model-00005-of-00005.safetensors and presenting a unified (name, tensor) stream — happens in vLLM’s weight loading utilities (weight_utils.py), which feed into AutoWeightsLoader.
WeightsMapper provides declarative renaming rules. Instead of writing string manipulation inside load_weights, you define a mapping:
mapper = WeightsMapper(orig_to_new_prefix={
"model.decoder.layers.": "model.layers.",
"norm.weight": "model.norm.weight"
})
The loader applies these rules on the fly, letting the vLLM model structure diverge from the Hugging Face structure while maintaining compatibility with official checkpoints.
For quantized models, weight loading has an additional layer of complexity. In 4-bit quantization schemes like AWQ, eight 4-bit weights are packed into a single int32. The loader must recognize that the destination parameter is quantized and load the packed tensor directly, no casting to float16 first. If the config specifies quantization, the linear layer initializes a specialized “quantized parameter” object that overrides the default loading behavior.
Part 4: The Execution Core
The Attention Switchboard (ForwardContext)
In standard PyTorch, an Attention module is self-contained — it receives Q, K, V and computes the output. In vLLM, the Attention layer acts as a client to a global context. When the model executes a forward pass, a ForwardContext is established containing the AttentionMetadata generated by the scheduler.
The AttentionMetadata object is essentially a page table for the KV cache. Here’s what it contains, with concrete values for a batch of 3 requests:
# AttentionMetadata for a batch with 3 requests:
# Request A: 128 tokens, KV spread across blocks [4, 17, 23, 8]
# Request B: 64 tokens, KV in blocks [1, 12]
# Request C: 256 tokens, KV in blocks [0, 5, 9, 14, 22, 31, 7, 19]
block_tables = [
[4, 17, 23, 8, 0, 0, 0, 0], # Request A (padded to max_blocks)
[1, 12, 0, 0, 0, 0, 0, 0], # Request B
[0, 5, 9, 14, 22, 31, 7, 19], # Request C
]
# shape: [3, 8] — each entry is a physical block index in GPU memory
slot_mapping = [512, 65, 1024]
# For decode: maps each new token to its physical slot (block_idx * block_size + offset)
context_lens = [128, 64, 256]
# Sequence length per request, for correct attention masking
The attention layer’s forward() does not compute QK^T * V directly. Instead, it dispatches to different backends depending on the phase:
- Prefill (processing prompts) → FlashAttention variant, optimized for parallel computation over many tokens. All Q, K, V tokens are known upfront, so we can exploit parallelism across the sequence.
- Decode (generating tokens) → PagedAttention kernel. The query is a single new token. The kernel uses
block_tablesto gather K and V vectors from non-contiguous physical blocks, compute attention scores, and scatter the result. This is the operation that makes virtual memory for KV cache work, tokens don’t need to be stored contiguously.
The split between prefill and decode backends is important for performance. Prefill is compute-bound (large matrix multiplications), so FlashAttention’s tiling strategy works well. Decode is memory-bound (loading many cached K/V vectors for a single query), so PagedAttention’s gather-based approach is the right fit.
Click any pipeline stage to see details.
Key Properties
Distributed Execution Contracts
Supporting 405B-class models means multi-GPU execution across potentially many nodes. This introduces specific contracts that new model implementations must satisfy:
Tensor Parallelism requires precise synchronization. In a standard Transformer block, AllReduce happens exactly twice after the attention output projection (RowParallelLinear) and after the MLP down-projection (RowParallelLinear). These are the points where partial results from each GPU must be summed. Adding extra synchronizations (say, an unnecessary AllReduce after the QKV projection) doesn’t produce wrong results, but it degrades throughput. Each AllReduce is a blocking collective, so all GPUs wait.
Pipeline Parallelism splits the model vertically by layers. The native model’s forward method must handle an intermediate_tensors argument:
def forward(self, input_ids, positions, attn_metadata, intermediate_tensors=None):
if intermediate_tensors is not None:
# We're not the first pipeline stage — skip embedding
hidden_states = intermediate_tensors["hidden_states"]
start_layer = self.start_layer # e.g., layer 16
else:
# First pipeline stage — process from the embedding
hidden_states = self.embed_tokens(input_ids)
start_layer = 0
for layer in self.layers[start_layer:self.end_layer]:
hidden_states = layer(hidden_states, positions, attn_metadata)
return hidden_states
Rank 0 executes layers 0–N, outputs the hidden state as intermediate_tensors. Rank 1 receives it, skips the embedding layer, and resumes from layer N+1. If a developer forgets to implement this check, the model works fine in single-node TP mode but silently breaks in PP mode. it tries to re-embed already-processed hidden states.
CUDA Graph Compatibility requires static control flow. Dynamic Python branching based on tensor values like if tensor.sum() > 0: breaks CUDA Graph capture because the graph records a fixed execution path. This is particularly relevant for Mixture-of-Experts models where expert routing is inherently data-dependent. The routing logic must use masked tensor operations (scatter, gather with masks) rather than Python if/else, so the computation graph remains static even though different experts activate for different tokens.
| Parallelism Type | Contract for Model Developer | Failure Mode if Missed |
|---|---|---|
| Tensor Parallelism | Use ColumnParallel/RowParallel layers; exactly 2 AllReduces per block | Extra AllReduces → throughput degradation; wrong layer types → incorrect results |
| Pipeline Parallelism | Handle intermediate_tensors arg; define start_layer/end_layer | Model re-embeds hidden states → garbage output on later pipeline stages |
| CUDA Graphs | No Python control flow based on tensor values; use masked ops for routing | Graph capture fails → fallback to eager mode → 2-3x slower decode |
Putting It All Together — The Optimization Ladder
To summarize the full lifecycle, model support in vLLM is a progression through increasingly optimized tiers:
- Transformers Fallback — works immediately. PagedAttention memory management. No fusion, no graphs, limited parallelism. This is where every new model starts.
- Plugin Registration — an external package provides a native model class.
pip installand go. Model creators control their own release timeline. - Native Model Class — upstreamed into vLLM. Parallel primitives, 1D flattened computation, CUDA Graph compatible. This is where the performance lives.
- Quantization Support — AWQ, GPTQ, FP8 weight loading tested and working. Packed tensor handling, per-layer quantization configs. Unlocks deployment on smaller hardware.
- Full Production — Pipeline Parallelism support, custom attention patterns if needed, benchmarked against reference implementations. Ready for large-scale serving.
The plugin system represents the future direction — federated model support where model creators can ship “vLLM-ready” code independently of core releases. Instead of waiting for the vLLM team to implement every new architecture, the ecosystem moves toward model creators owning their integration path.
Closing
The process of introducing a new model into vLLM is a systems engineering exercise. It requires transforming a static model definition, essentially a recipe for matrix multiplications — into a dynamic, distributed execution graph that manages its own memory, shards its own weights, and coordinates across GPUs. The Transformers fallback bridges the gap for immediate access; native integration is where the performance lives.
There are four core contracts a model must satisfy for full integration: registry updates (mapping architecture strings to code), class restructuring (parallel primitives, 1D flattening), weight loading (handling mismatches between checkpoint and runtime structure), and PagedAttention integration (routing attention through the block-table-based memory system). Understanding these four contracts gives you a mental model for reasoning about model support in any inference engine, not just vLLM.
References
- Kwon, W. et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP 2023. arXiv:2309.06180
- vLLM Documentation — Architecture Overview. docs.vllm.ai
- vLLM Documentation — Adding a New Model. docs.vllm.ai
- vLLM Documentation — Plugin System. docs.vllm.ai
- vLLM Source — Model Registry (
registry.py). GitHub - vLLM Source — Transformers Backend (
transformers/). GitHub - vLLM Documentation — Class Hierarchy. docs.vllm.ai
- Gordic, A. “Inside vLLM: Anatomy of a High-Throughput LLM Inference System.” aleksagordic.com
- Zalt, M. “The Hidden Switchboard Behind vLLM Attention.” zalt.me
- Prerepa, A. “ZvLLM: Zigzag forward pass with vLLM.” adiprerepa.github.io
- El Shafie, H. “Paged Attention from First Principles: A View Inside vLLM.” hamzaelshafie.bearblog.dev
- vLLM Documentation — Paged Attention Design. docs.vllm.ai