Why `vllm serve` Works on Day Zero (and What It Takes to Make It Fast)

In this post, we’ll trace what happens when vLLM encounters a model it’s never seen before. We’ll work through the full lifecycle from the initial config.json pull off Hugging Face, through the registry lookup that decides the integration path, into either the Transformers fallback or the native integration code, and down to the forward pass where PagedAttention kernels actually execute.

Why this matters: new model architectures appear constantly, and vLLM needs to serve them. The interesting engineering question is how — because the optimizations that make inference fast (fused kernels, CUDA Graphs, tensor parallelism) require deep model-specific restructuring. You can’t just import a model and get peak performance. vLLM resolves this tension with a tiered system: immediate support through a compatibility layer, then a clear path to fully optimized native integration.

This post is structured into 4 parts:

The Gateway — how vLLM decides what to do with a model it receives
The Transformers Fallback — the zero-day mechanism and its trade-offs
Native Integration — what it takes to make a model truly fast in vLLM
The Execution Core — forward pass, weight loading, and distributed execution

We’ll build on concepts from previous posts. If you’re not familiar with PagedAttention and FlashAttention, or the hidden software stack beneath inference, those are worth reading first. We also won’t re-explain the Engine-Worker orchestration layer in full — just enough to ground the model integration story.

Here’s the starting point:

vllm serve some-brand-new/model-7B --dtype auto
# This works. Even for a model vLLM has never seen before.

That command succeeds for models that vLLM has no dedicated code for. Let’s understand why.

Part 1: The Gateway

The Engine-Worker-Model Hierarchy

vLLM enforces a strict separation of concerns in how it handles models. Before we get into the model-specific details, let’s establish the high-level architecture, since it determines where new model code actually lives.

There are four levels:

LLMEngine — the control plane. Handles scheduling, manages the BlockSpaceManager (which tracks physical GPU memory blocks), and decides which requests get processed in each iteration. The Engine is completely agnostic to model architecture.
Worker — one per GPU. Manages the GPU device, holds its slice of model weights, and coordinates with other Workers for distributed execution.
ModelRunner — sits inside each Worker. Responsible for converting logical request data (token IDs, sequence lengths) into the physical tensors the model needs. This is where input flattening happens.
Model — the neural network itself. Whether it’s a native LlamaForCausalLM or a wrapped TransformersModel, this is the only layer that changes when you add a new model.

The key property here: the Engine only needs the KV cache element size — derived from num_layers, hidden_size, and num_attention_heads in the model config to make scheduling decisions. It never touches the model’s forward pass. This means adding support for an entirely new architecture only changes the bottom layer of this stack. Everything above it stays the same.

Engine → Worker → Model Hierarchy

Four abstraction levels. VllmConfig feeds each one. Only the bottom layer changes for new models.

LLMEngine

Control Plane

Schedules requests, manages KV cache blocks, completely model-agnostic.

Scheduler

requests

BlockSpaceManager

memory blocks

KV Cache Mgr

allocation

dispatch

Worker

Device Mgmt

One per GPU. Manages device, holds weight shards, coordinates distributed execution.

GPU Device

CUDA ctx

Weight Shard

TP slice

Distributed Coord

NCCL

execute

ModelRunner

Input Prep

Converts logical request data into physical tensors. Handles input flattening and attention metadata.

Input Flattener

tokens

Tensor Prep

batching

AttentionMetadata

positions

forward

Model Only this changes

Neural Network

The actual neural network. Swap this layer to support a new architecture — everything above stays the same.

LlamaForCausalLM

native

TransformersModel

wrapped

Forward Pass

inference

The VllmConfig object is how information flows across these levels. It aggregates several sub-configs:

Config Component	What It Provides
ModelConfig	Architecture strings, hidden sizes, vocabulary size, the `architectures` list used for registry lookup
ParallelConfig	Tensor parallelism (TP) and pipeline parallelism (PP) degrees. Determines how linear layers shard their weights
SchedulerConfig	Maximum number of sequences and memory allocation strategy. Influences BlockSpaceManager setup
QuantizationConfig	Quantization method (AWQ, GPTQ, FP8). Linear layers use this to select the appropriate kernel during weight loading

Registry Mechanics and the Architecture Lookup

When you run vllm serve <model>, the first thing that happens is a config.json resolution — either pulled from the Hugging Face Hub (if you pass a model ID like meta-llama/Llama-2-7b-hf) or read from disk (if you pass a local path, as you would in an air-gapped deployment). The architectures field — for example, ["LlamaForCausalLM"] — is the primary lookup key for the entire loading sequence.

This key gets checked against the _VLLM_MODELS dictionary, the core of vLLM’s ModelRegistry. It maps architecture strings to (module_name, class_name) tuples:

_VLLM_MODELS = {
    "LlamaForCausalLM":       ("llama", "LlamaForCausalLM"),
    "MistralForCausalLM":     ("mistral", "MistralForCausalLM"),
    "DeepseekV2ForCausalLM":  ("deepseek_v2", "DeepseekV2ForCausalLM"),
    "Qwen2ForCausalLM":       ("qwen2", "Qwen2ForCausalLM"),
    # ... hundreds of other architectures
}

The module_name is a relative path within vllm.model_executor.models — so "llama" resolves to vllm/model_executor/models/llama.py. The class_name is the specific nn.Module subclass to instantiate.

One important detail: vLLM does NOT import all model classes at startup. Instead, it uses _LazyRegisteredModel wrappers. When the ModelConfig requests a specific architecture, the registry:

Checks if the architecture string exists in _VLLM_MODELS
Retrieves the module path and class name
Dynamically imports the module using importlib
Returns the class constructor to the ModelLoader

This lazy loading matters for dependency isolation. A user running Llama shouldn’t need the specific kernels required for an audio-processing model. If those kernels aren’t installed and the audio model is loaded eagerly at startup, vLLM crashes for everyone.

Three things can happen when the registry receives an architecture string:

Found in registry → native path (optimized, model-specific code)
Registered by plugin → external native path (optimized, third-party code)
Not found → Transformers Modeling Backend fallback (compatibility shim)

The Plugin System

This is a significant evolution in vLLM’s architecture. External packages can register models without modifying vLLM core.

The mechanism uses Python’s vllm.general_plugins entry point. During vLLM’s initialization, it discovers and executes all registered plugins. A plugin can invoke ModelRegistry.register_model() to inject a new architecture mapping at runtime:

# In your package's plugin entry point
def register():
    from vllm import ModelRegistry

    if "MyNewModel" not in ModelRegistry.get_supported_archs():
        ModelRegistry.register_model(
            "MyNewModel",
            "my_package.models:MyNewModel"
        )

This decouples the vLLM release cycle from model release cycles. Model creators — Mistral, DeepSeek, Google — can ship a “vLLM adaptation package” alongside their weights. Users pip install that package, and vLLM recognizes the new architecture immediately. No PRs to vLLM core, no waiting for a new release.

Part 2: The Transformers Fallback (Zero-Day Support)

The Transformers Backend

When the registry lookup fails — or when you explicitly set model_impl="transformers" — vLLM resolves to its Transformers backend. This is a family of mixin-composed classes (TransformersForCausalLM, TransformersMoEForCausalLM, TransformersMultiModalForCausalLM, etc.) defined in vllm/model_executor/models/transformers/. They sit between vLLM’s scheduler and standard Hugging Face model code, and they’re the reason that vllm serve command works on day zero for new models.

There are two initialization steps worth understanding:

1. Config-Based Instantiation. The wrapper uses transformers.AutoModel.from_config(...) to build the model architecture on a meta device — meaning no GPU memory is allocated yet, just the module structure with placeholder parameters. Weights are loaded separately later through vLLM’s load_weights() pipeline. This two-phase approach (structure first, weights later) is critical for distributed loading: each GPU can load only its weight shard, rather than loading everything and then discarding what it doesn’t need.

2. Attention Backend Injection. Before instantiation, vLLM modifies the model’s text configuration:

# vLLM sets this before calling from_config()
text_config._attn_implementation = "vllm"

This is the critical mechanism. Modern Hugging Face models are written to be attention-backend-agnostic. They check _attn_implementation and query a registry of attention functions. vLLM populates ALL_ATTENTION_FUNCTIONS with its own PagedAttention-backed implementation:

# vLLM registers its attention backend into HF's registry
ALL_ATTENTION_FUNCTIONS["vllm"] = vllm_attention_forward

When the HF model reaches its attention layer and calls the registered function, it gets vLLM’s implementation instead of the default eager/SDPA/FlashAttention backend. The model doesn’t know the difference.

Let’s trace the data flow step by step:

Engine generates block_tables and slot_mapping → packs them into an AttentionMetadata object
TransformersForCausalLM.forward() receives flattened inputs + attn_metadata
The wrapper passes vLLM metadata as **kwargs into the HF model’s forward method
The HF model propagates **kwargs down through its layers (this is a convention in Transformers — unused kwargs flow through)
At each attention layer, the injected vLLM backend receives Q, K, V tensors + the attn_metadata
The PagedAttention CUDA kernel executes — storing K/V into paged blocks, computing attention scores using block tables

The result: even “unoptimized” models benefit from PagedAttention’s memory virtualization. No more OOM from naive KV cache pre-allocation. The KV cache is managed efficiently through fixed-size blocks, regardless of whether the model itself was designed for it.

The Trade-offs (What You Lose)

The Transformers backend enables immediate serving, but it sits in what we might call an “unoptimized valley.” Let’s be specific about the costs:

CUDA Graph Capture. In vLLM V1, the Transformers backend supports torch.compile with piecewise CUDA graph capture (via the @support_torch_compile decorator), closing what was historically the largest performance gap. However, models with dynamic RoPE scaling still fall back to eager mode. And native models can leverage more aggressive graph capture strategies that cover a larger fraction of the computation graph, since their code is explicitly written with static control flow in mind.

Kernel Fusion. Native vLLM models use fused kernels — LayerNorm + activation in one kernel, RoPE computation fused with the QKV projection, SiLU and gate multiplication combined. The fallback uses separate PyTorch operations for each step. Every separate operation means an extra round-trip to HBM: write intermediate result, read it back for the next op. On a memory-bandwidth-bound workload (which LLM decode always is), these extra reads and writes add up fast.

Parallelism Limitations. Basic Tensor Parallelism can sometimes be inferred automatically via the model’s base_model_tp_plan, but this doesn’t cover every case. Mixture-of-Experts routing, novel attention patterns, or architectures with unusual layer structures may not shard correctly — restricting you to single-GPU execution.

Capability	Transformers Fallback	Native Integration
Day-zero support	Yes	No (requires implementation)
PagedAttention	Yes (via injection)	Yes (native)
CUDA Graph capture	Yes (via torch.compile in V1)	Yes (full static graph)
Kernel fusion	No (separate PyTorch ops)	Yes (fused CUDA kernels)
Tensor Parallelism	Limited (auto-inferred)	Full (explicit sharding)
Pipeline Parallelism	No	Yes (with `intermediate_tensors`)
Quantization (AWQ/GPTQ/FP8)	Limited	Full support

Note: The fallback is not meant to be the final state — it’s the starting point. It gives you a working, servable model while the community works on native integration. Think of it as a bridge: useful immediately, but you cross it to get somewhere better.

Part 3: Native Integration

The Model Interface and Prefix Protocol

To go from “supported” to “optimized,” a model must be implemented natively. This means creating a Python class that mirrors the original model structure but substitutes standard layers with vLLM’s distributed primitives.

Every module in a native vLLM model accepts a prefix="" argument during initialization. This string represents the module’s fully qualified name in the state dictionary — for example, model.layers.0.self_attn.q_proj.

class LlamaAttention(nn.Module):
    def __init__(self, config, prefix=""):
        super().__init__()
        self.qkv_proj = QKVParallelLinear(
            ...,
            prefix=f"{prefix}.qkv_proj"
        )
        self.o_proj = RowParallelLinear(
            ...,
            prefix=f"{prefix}.o_proj"
        )

The prefix serves two purposes:

Weight loading: maps checkpoint tensors to the correct layer instance. When the load_weights method receives a tensor named model.layers.0.self_attn.q_proj.weight, the prefix tells it exactly which module to route it to.
Non-uniform quantization: the QuantizationConfig can specify different quantization schemes per layer. Some layers might be FP16 while others are INT8. The prefix is how the config identifies which kernel to instantiate for each specific layer.

Parallel Layer Primitives

For models that won’t fit on a single GPU (70B+), vLLM provides distributed primitives that replace standard nn.Linear and nn.Embedding layers:

ColumnParallelLinear splits the weight matrix along the output dimension. Each GPU computes a fraction of the output features. This is used for QKV projections (each GPU computes a subset of attention heads) and MLP up-projections (each GPU computes a portion of the intermediate dimension). No inter-GPU communication is needed for this operation.

RowParallelLinear splits along the input dimension. Each GPU computes a partial result, then an AllReduce sums the partial results across all GPUs. This is used for the attention output projection and MLP down-projection — the operations where partial results need to be recombined.

VocabParallelEmbedding splits the embedding table (often 128k+ tokens for modern models) across GPUs. Each GPU holds a slice of the vocabulary and performs lookups only for tokens in its range.

The VllmConfig provides tensor_parallel_size during initialization, and each layer auto-configures its sharding based on the worker’s rank. A model developer doesn’t write explicit GPU assignment code — they use these primitives and the infrastructure handles partitioning.

Input Flattening and the 1D Computation Graph

This is one of the more interesting design decisions in vLLM. In standard PyTorch, inputs are 2D tensors of shape [batch_size, sequence_length]. This requires padding to align sequences of different lengths — if you’re processing three requests with lengths 5, 12, and 3, you pad everything to length 12. That means 8 wasted positions out of 20, nearly 40% of compute thrown away on padding tokens.

vLLM eliminates padding entirely. The ModelRunner concatenates all tokens from all concurrent requests into a single 1D tensor of shape [total_num_tokens]. A separate positions tensor (also 1D) provides the sequence position for each token:

# Three concurrent requests:
#   Request A: tokens [101, 204, 305]        (3 tokens, positions 0,1,2)
#   Request B: tokens [42, 55, 67, 89, 12]   (5 tokens, positions 0,1,2,3,4)
#   Request C: tokens [700, 801]             (2 tokens, positions 0,1)

# Flattened input — no padding, no wasted compute:
input_ids = [101, 204, 305, 42, 55, 67, 89, 12, 700, 801]  # shape: [10]
positions  = [0,   1,   2,   0,  1,  2,  3,  4,  0,   1]   # shape: [10]

Every layer in a native vLLM model is written to process this 1D stream. Embeddings do lookups on the 1D tensor. RoPE uses the positions tensor for correct positional encoding. The attention layer uses block_tables to reconstruct the logical sequence structure — knowing which tokens belong to which request and where their KV cache blocks live in physical memory.

Standard HuggingFace vs vLLM Native

How vLLM eliminates padding waste and automatically shards across GPUs

Standard HuggingFace

Input Shape

shape: [3, 5]

101

204

305

PAD

700

801

PAD

7 of 15 wasted (47%)

Layer Architecture

nn.Embedding

Full vocabulary, single GPU

Single GPU — full vocab

▼

nn.Linear (QKV)

Full weight matrix

Single GPU — full matrix

▼

nn.Linear (Output)

No sharding

Single GPU — no sharding

vLLM Native

Input Shape

shape: [10]

101

204

305

700

801

10 tokens, 0% waste

Req A

Req B

Req C

Layer Architecture

VocabParallelEmbedding

Vocabulary sharded across GPUs

GPU 0: vocab[0:N/2]

GPU 1: vocab[N/2:N]

AllReduce combines

▼

ColumnParallelLinear

Output dim split across GPUs

GPU 0: out[0:H/2]

GPU 1: out[H/2:H]

No AllReduce needed

▼

RowParallelLinear

Input dim split across GPUs

GPU 0: in[0:H/2]

GPU 1: in[H/2:H]

AllReduce sync

Click any input grid, token cell, or layer card to see details.

Component

Key Properties

Weight Loading — From Disk to Device

One of the trickier parts of native integration: implementing load_weights(self, weights). This method receives an iterator of (name, tensor) pairs from AutoWeightsLoader and must map checkpoint weights into the model’s parameters.

The parameter mismatch problem is why this isn’t trivial. vLLM often fuses layers that are separate in the Hugging Face checkpoint. For example, a standard Llama MLP has separate gate_proj and up_proj linear layers. In vLLM, these become a single gate_up_proj to reduce kernel launches. The load_weights logic must handle this:

def load_weights(self, weights):
    # Stacking mapping: which HF weights get concatenated into which vLLM param
    stacked_params = {
        "gate_proj": ("gate_up_proj", 0),  # goes into first half
        "up_proj":   ("gate_up_proj", 1),  # goes into second half
    }

    for name, loaded_weight in weights:
        if "gate_proj" in name or "up_proj" in name:
            # Buffer the tensor, wait for its partner, then concatenate
            param = self.state_dict()[name.replace("gate_proj", "gate_up_proj")
                                          .replace("up_proj", "gate_up_proj")]
            # Load into the correct slice of the fused parameter
            weight_loader = param.weight_loader
            weight_loader(param, loaded_weight, name)
        else:
            param = self.state_dict()[name]
            param.copy_(loaded_weight)

Two utilities make this process more manageable:

AutoWeightsLoader abstracts away the routing of weights to child modules. It recursively discovers sub-modules that have their own load_weights methods and delegates the appropriate (name, tensor) pairs to each one, so the top-level model doesn’t need to manually dispatch weights. The upstream shard iteration — walking through model-00001-of-00005.safetensors through model-00005-of-00005.safetensors and presenting a unified (name, tensor) stream — happens in vLLM’s weight loading utilities (weight_utils.py), which feed into AutoWeightsLoader.

WeightsMapper provides declarative renaming rules. Instead of writing string manipulation inside load_weights, you define a mapping:

mapper = WeightsMapper(orig_to_new_prefix={
    "model.decoder.layers.": "model.layers.",
    "norm.weight": "model.norm.weight"
})

The loader applies these rules on the fly, letting the vLLM model structure diverge from the Hugging Face structure while maintaining compatibility with official checkpoints.

For quantized models, weight loading has an additional layer of complexity. In 4-bit quantization schemes like AWQ, eight 4-bit weights are packed into a single int32. The loader must recognize that the destination parameter is quantized and load the packed tensor directly, no casting to float16 first. If the config specifies quantization, the linear layer initializes a specialized “quantized parameter” object that overrides the default loading behavior.

Part 4: The Execution Core

The Attention Switchboard (ForwardContext)

In standard PyTorch, an Attention module is self-contained — it receives Q, K, V and computes the output. In vLLM, the Attention layer acts as a client to a global context. When the model executes a forward pass, a ForwardContext is established containing the AttentionMetadata generated by the scheduler.

The AttentionMetadata object is essentially a page table for the KV cache. Here’s what it contains, with concrete values for a batch of 3 requests:

# AttentionMetadata for a batch with 3 requests:
#   Request A: 128 tokens, KV spread across blocks [4, 17, 23, 8]
#   Request B: 64 tokens, KV in blocks [1, 12]
#   Request C: 256 tokens, KV in blocks [0, 5, 9, 14, 22, 31, 7, 19]

block_tables = [
    [4, 17, 23, 8, 0, 0, 0, 0],   # Request A (padded to max_blocks)
    [1, 12, 0,  0, 0, 0, 0, 0],   # Request B
    [0, 5,  9, 14, 22, 31, 7, 19], # Request C
]
# shape: [3, 8] — each entry is a physical block index in GPU memory

slot_mapping = [512, 65, 1024]
# For decode: maps each new token to its physical slot (block_idx * block_size + offset)

context_lens = [128, 64, 256]
# Sequence length per request, for correct attention masking

The attention layer’s forward() does not compute QK^T * V directly. Instead, it dispatches to different backends depending on the phase:

Prefill (processing prompts) → FlashAttention variant, optimized for parallel computation over many tokens. All Q, K, V tokens are known upfront, so we can exploit parallelism across the sequence.
Decode (generating tokens) → PagedAttention kernel. The query is a single new token. The kernel uses block_tables to gather K and V vectors from non-contiguous physical blocks, compute attention scores, and scatter the result. This is the operation that makes virtual memory for KV cache work, tokens don’t need to be stored contiguously.

The split between prefill and decode backends is important for performance. Prefill is compute-bound (large matrix multiplications), so FlashAttention’s tiling strategy works well. Decode is memory-bound (loading many cached K/V vectors for a single query), so PagedAttention’s gather-based approach is the right fit.

Forward Pass Pipeline

Step-by-step flow of a single forward pass through a vLLM native model, with AllReduce sync points highlighted

Scheduler

Control Plane

AttentionMetadata

▼

ModelRunner

Input Prep

input_ids, positions

▼

Embedding Layer

VocabParallelEmbedding

hidden_states

▼

Transformer Block

× N Layers

4a RMSNorm fused kernel

▼

4b QKV Projection ColumnParallelLinear

▼

Prefill

FlashAttention

Compute-bound

Decode

PagedAttention

Memory-bound

▼

4d Output Projection RowParallelLinear

AllReduce #1 — attention sync

▼

4e MLP (SwiGLU) gate_up → down

AllReduce #2 — MLP sync

▼

× N layers

Final LayerNorm → LM Head

Output Logits

Click any pipeline stage to see details.

Stage

Key Properties

Distributed Execution Contracts

Supporting 405B-class models means multi-GPU execution across potentially many nodes. This introduces specific contracts that new model implementations must satisfy:

Tensor Parallelism requires precise synchronization. In a standard Transformer block, AllReduce happens exactly twice after the attention output projection (RowParallelLinear) and after the MLP down-projection (RowParallelLinear). These are the points where partial results from each GPU must be summed. Adding extra synchronizations (say, an unnecessary AllReduce after the QKV projection) doesn’t produce wrong results, but it degrades throughput. Each AllReduce is a blocking collective, so all GPUs wait.

Pipeline Parallelism splits the model vertically by layers. The native model’s forward method must handle an intermediate_tensors argument:

def forward(self, input_ids, positions, attn_metadata, intermediate_tensors=None):
    if intermediate_tensors is not None:
        # We're not the first pipeline stage — skip embedding
        hidden_states = intermediate_tensors["hidden_states"]
        start_layer = self.start_layer  # e.g., layer 16
    else:
        # First pipeline stage — process from the embedding
        hidden_states = self.embed_tokens(input_ids)
        start_layer = 0

    for layer in self.layers[start_layer:self.end_layer]:
        hidden_states = layer(hidden_states, positions, attn_metadata)

    return hidden_states

Rank 0 executes layers 0–N, outputs the hidden state as intermediate_tensors. Rank 1 receives it, skips the embedding layer, and resumes from layer N+1. If a developer forgets to implement this check, the model works fine in single-node TP mode but silently breaks in PP mode. it tries to re-embed already-processed hidden states.

CUDA Graph Compatibility requires static control flow. Dynamic Python branching based on tensor values like if tensor.sum() > 0: breaks CUDA Graph capture because the graph records a fixed execution path. This is particularly relevant for Mixture-of-Experts models where expert routing is inherently data-dependent. The routing logic must use masked tensor operations (scatter, gather with masks) rather than Python if/else, so the computation graph remains static even though different experts activate for different tokens.

Parallelism Type	Contract for Model Developer	Failure Mode if Missed
Tensor Parallelism	Use ColumnParallel/RowParallel layers; exactly 2 AllReduces per block	Extra AllReduces → throughput degradation; wrong layer types → incorrect results
Pipeline Parallelism	Handle `intermediate_tensors` arg; define `start_layer`/`end_layer`	Model re-embeds hidden states → garbage output on later pipeline stages
CUDA Graphs	No Python control flow based on tensor values; use masked ops for routing	Graph capture fails → fallback to eager mode → 2-3x slower decode

Putting It All Together — The Optimization Ladder

To summarize the full lifecycle, model support in vLLM is a progression through increasingly optimized tiers:

Transformers Fallback — works immediately. PagedAttention memory management. No fusion, no graphs, limited parallelism. This is where every new model starts.
Plugin Registration — an external package provides a native model class. pip install and go. Model creators control their own release timeline.
Native Model Class — upstreamed into vLLM. Parallel primitives, 1D flattened computation, CUDA Graph compatible. This is where the performance lives.
Quantization Support — AWQ, GPTQ, FP8 weight loading tested and working. Packed tensor handling, per-layer quantization configs. Unlocks deployment on smaller hardware.
Full Production — Pipeline Parallelism support, custom attention patterns if needed, benchmarked against reference implementations. Ready for large-scale serving.

The plugin system represents the future direction — federated model support where model creators can ship “vLLM-ready” code independently of core releases. Instead of waiting for the vLLM team to implement every new architecture, the ecosystem moves toward model creators owning their integration path.

Closing

The process of introducing a new model into vLLM is a systems engineering exercise. It requires transforming a static model definition, essentially a recipe for matrix multiplications — into a dynamic, distributed execution graph that manages its own memory, shards its own weights, and coordinates across GPUs. The Transformers fallback bridges the gap for immediate access; native integration is where the performance lives.

There are four core contracts a model must satisfy for full integration: registry updates (mapping architecture strings to code), class restructuring (parallel primitives, 1D flattening), weight loading (handling mismatches between checkpoint and runtime structure), and PagedAttention integration (routing attention through the block-table-based memory system). Understanding these four contracts gives you a mental model for reasoning about model support in any inference engine, not just vLLM.

References

Kwon, W. et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP 2023. arXiv:2309.06180
vLLM Documentation — Architecture Overview. docs.vllm.ai
vLLM Documentation — Adding a New Model. docs.vllm.ai
vLLM Documentation — Plugin System. docs.vllm.ai
vLLM Source — Model Registry (registry.py). GitHub
vLLM Source — Transformers Backend (transformers/). GitHub
vLLM Documentation — Class Hierarchy. docs.vllm.ai
Gordic, A. “Inside vLLM: Anatomy of a High-Throughput LLM Inference System.” aleksagordic.com
Zalt, M. “The Hidden Switchboard Behind vLLM Attention.” zalt.me
Prerepa, A. “ZvLLM: Zigzag forward pass with vLLM.” adiprerepa.github.io
El Shafie, H. “Paged Attention from First Principles: A View Inside vLLM.” hamzaelshafie.bearblog.dev
vLLM Documentation — Paged Attention Design. docs.vllm.ai

Part 1: The Gateway#

The Engine-Worker-Model Hierarchy#

Key Properties

Registry Mechanics and the Architecture Lookup#

The Plugin System#

Part 2: The Transformers Fallback (Zero-Day Support)#

The Transformers Backend#

The Trade-offs (What You Lose)#

Part 3: Native Integration#

The Model Interface and Prefix Protocol#

Parallel Layer Primitives#

Input Flattening and the 1D Computation Graph#

Key Properties

Weight Loading — From Disk to Device#

Part 4: The Execution Core#

The Attention Switchboard (ForwardContext)#

Key Properties

Distributed Execution Contracts#

Putting It All Together — The Optimization Ladder#

Closing#

References#

Part 1: The Gateway

The Engine-Worker-Model Hierarchy

Registry Mechanics and the Architecture Lookup

The Plugin System

Part 2: The Transformers Fallback (Zero-Day Support)

The Transformers Backend

The Trade-offs (What You Lose)

Part 3: Native Integration

The Model Interface and Prefix Protocol

Parallel Layer Primitives

Input Flattening and the 1D Computation Graph

Weight Loading — From Disk to Device

Part 4: The Execution Core

The Attention Switchboard (ForwardContext)

Distributed Execution Contracts

Putting It All Together — The Optimization Ladder

Closing

References