The Deceptively Simple Command

vllm serve meta-llama/Llama-3.1-70B-Instruct --tensor-parallel-size 8

One line. Eight GPUs. A 70-billion parameter model ready to serve requests. But this hides significant complexity.

Behind this command, three distinct software systems spring into action. Kubernetes allocates pods and manages node resources. Ray spawns actors, creates placement groups, and coordinates distributed execution. vLLM initializes workers, establishes NCCL communication rings, and begins orchestrating the token-by-token dance of autoregressive generation.

The interesting part is the choreography between systems. Each layer operates at a different granularity, speaks a different language, and solves a different class of problem. Kubernetes thinks in pods and nodes. Ray thinks in actors and tasks. vLLM thinks in requests and tokens. Yet when you hit that endpoint with a prompt, all three coordinate to produce a coherent response.

The question worth asking: How do these systems know when to hand off control to each other?

This post traces that coordination. We’ll follow the cascade from kubectl apply to the moment NCCL rings form and tensor data starts flowing. We’ll examine why placement groups matter more than you’d expect, why your network configuration can make or break performance, and how the industry is evolving toward disaggregated architectures that split inference across specialized pools.

If you’ve read the previous deep-dive on the hidden software stack behind inference, this builds on that foundation. We won’t revisit PagedAttention or continuous batching fundamentals. Instead, we’re zooming out to the orchestration layer, the software that transforms a rack of GPUs into something resembling a programmable supercomputer.

Prerequisites: This post assumes familiarity with PagedAttention, continuous batching, and basic Kubernetes concepts. If you’re new to the inference stack, start with The Hidden Software Stack Behind Fast LLM Inference.

The Three-Layer Stack
Control flows down through the layers. Tensor data bypasses the middle entirely.
Kubernetes Layer
Coarse-grained
Nodes
machines
Pods
containers
Containers
processes
handoff
Ray Layer
Fine-grained
GCS
control store
Raylets
per-node
Actors
workers
Placement
groups
handoff
vLLM Layer
Token-level
Scheduler
requests
Workers
GPU exec
NCCL Ring
tensor sync
NCCL Bypass
Tensor Data

Click a component to see details about its role in the orchestration stack.

Component
Layer
Granularity: -
Description goes here.

Manages

  • Item 1

Three Systems, Three Granularities

This stack works because of its division of labor. Each system operates at a different level of abstraction, handling the problems it’s best suited to solve.

Kubernetes sees the world in pods and nodes. It manages the lifecycle of containers, handles service discovery, and ensures workloads get scheduled onto machines with available resources. Its scheduling decisions happen at the coarse granularity of “does this node have 8 GPUs available?” Kubernetes has no concept of what happens inside those containers once they’re running.

Ray operates one level deeper. It sees actors (long-lived Python objects that can hold state and process messages) and tasks, which are stateless function invocations. Ray’s Global Control Store (GCS) maintains a distributed view of cluster resources, and its Raylets (one per node) handle local scheduling and object management. Ray also understands placement constraints: it can ensure that a group of actors lands on the same physical node, or spreads across nodes in a specific pattern.

vLLM cares about requests and tokens. It manages the KV cache, schedules which requests get processed in each iteration, and coordinates the actual tensor operations across GPU workers. vLLM’s scheduler operates at millisecond granularity, making decisions every inference step about which tokens to generate next.

Kubernetes has no understanding of GPU topology. It can count GPUs, but it cannot distinguish between eight GPUs connected via NVLink at 900 GB/s and eight GPUs scattered across nodes connected via Ethernet at 10 GB/s. Without additional tooling, Kubernetes might schedule your tensor-parallel workload across two nodes, a configuration that would perform 40-90x slower than necessary.

ConcernKubernetesRayvLLM
GranularityPods/NodesActors/TasksRequests/Tokens
GPU handlingCounts onlyPlacement constraintsCUDA assignment
State managementStateless orchestrationActor state in GCSKV cache
Restart handlingPod restartsActor recoveryRequest retry

This is where KubeRay enters the picture. KubeRay is a Kubernetes operator that bridges the gap between Kubernetes’ pod-centric worldview and Ray’s actor-centric model. It introduces three Custom Resource Definitions (CRDs):

  • RayCluster is the foundation. It defines head and worker node configurations, resource requirements, and cluster topology. Use this when you need a persistent Ray cluster for interactive development or long-running services.

  • RayService builds on RayCluster to add Ray Serve deployments. It handles zero-downtime upgrades, health checking, and automatic recovery. This is the production choice for serving workloads.

  • RayJob handles batch workloads. It spins up a cluster, runs a job, then tears everything down. Useful for fine-tuning runs or batch inference over large datasets.

The operator watches these CRDs and reconciles cluster state: creating pods, configuring networking, managing the Ray head node’s GCS, and ensuring workers connect properly. It’s the translation layer that lets Kubernetes manage Ray clusters without understanding Ray’s internal semantics.


The Reconciliation Dance

When you kubectl apply a RayService manifest, you trigger a cascade that touches every layer of the stack. Understanding this sequence reveals how control flows through the system.

Phase 1: KubeRay Operator Activation

The KubeRay operator runs as a deployment in your cluster, watching for changes to Ray CRDs. When it detects your new RayService, its reconciliation loop activates. The operator compares desired state (your manifest) against actual state (what’s running) and generates a plan to converge them.

Phase 2: Head Node Creation

First, the operator creates the Ray head pod. This pod runs the Global Control Store (GCS) on port 6379, a distributed metadata store that tracks cluster membership, resource availability, and actor locations. The head also exposes the Ray Dashboard on port 8265 for observability.

The head pod needs to be running and healthy before workers can join. KubeRay handles this sequencing automatically, using Kubernetes’ built-in readiness probes to gate worker creation.

Phase 3: Worker Pod Launch

Once the head is ready, the operator creates worker pods. Each worker’s entrypoint executes ray start --address=<head-ip>:6379, connecting to the head’s GCS. This is where the Kubernetes and Ray worlds first touch: Kubernetes schedules the pod, but Ray handles what happens inside.

Phase 4: Resource Discovery

Inside each worker pod, the Raylet process inspects its environment. It discovers available GPUs through CUDA, determines memory capacity, and inventories other resources. This information flows back to the GCS, which maintains a global resource table.

Phase 5: Cluster Ready

When all workers have connected and advertised their resources, the Ray cluster is ready. The GCS now has a complete picture: which nodes exist, what resources each has, and how to reach them. Ray Serve can start accepting deployment requests.

When vLLM initializes with --tensor-parallel-size 8, it needs to transform this general-purpose Ray cluster into a coordinated inference machine.

vLLM Initialization Sequence:

  1. Cluster Connection: vLLM’s RayGPUExecutor calls initialize_ray_cluster(), connecting to the existing Ray cluster or starting a new one.

  2. Placement Group Creation: vLLM creates a placement group with the specification [{"GPU": 1}] * 8, which means eight bundles, each requiring one GPU. The placement strategy is STRICT_PACK, meaning all bundles must land on a single node.

  3. GCS Scheduling: The GCS consults its resource table. Can any single node satisfy eight GPU bundles? If yes, it reserves those resources atomically. If no, the placement group creation fails. Better to fail fast than scatter actors across nodes.

  4. Actor Spawning: vLLM spawns RayWorkerWrapper actors inside the placement group. Each actor gets assigned to a specific bundle, guaranteeing GPU affinity. Ray sets CUDA_VISIBLE_DEVICES appropriately so each worker sees only its assigned GPU.

  5. Process Group Initialization: Each worker calls torch.distributed.init_process_group(backend='nccl'). This creates the NCCL communicator that will handle all tensor data movement.

  6. NCCL Ring Formation: NCCL establishes its communication topology (typically ring or tree patterns optimized for the underlying hardware). From this point forward, tensor data flows through NCCL, completely bypassing Ray’s object store.

Here’s how the handoff works: Ray’s job is setup and supervision. Once the NCCL rings form, Ray steps aside for the performance-critical path. Tensor data never touches Ray’s object store. It flows directly between GPUs over NVLink or the network fabric. Ray remains involved for health monitoring, actor lifecycle management, and metrics collection, but it’s out of the hot path.


Why STRICT_PACK Changes Everything

Placement groups are Ray’s mechanism for expressing scheduling constraints that go beyond “find me a node with resources.” For distributed inference, they determine whether your system performs at full speed or crawls.

Consider what happens without placement constraints. You request 8 GPU actors. Ray’s default scheduler might place 4 on Node A and 4 on Node B. Both nodes have available GPUs, the request is satisfied, everyone’s happy. Except they’re not.

The Disaster Scenario:

With tensor parallelism, every transformer layer requires an AllReduce operation to synchronize partial results across all GPUs. For a Llama-70B with 80 layers, that’s 160 AllReduce calls per forward pass. Each AllReduce must move data between every pair of GPUs.

When all 8 GPUs are on one node connected via NVLink:

  • Bandwidth: ~900 GB/s bidirectional
  • AllReduce latency: microseconds

When 4 GPUs are on Node A and 4 on Node B, connected via datacenter Ethernet:

  • Bandwidth: ~10-25 GB/s (even with 100GbE)
  • AllReduce latency: milliseconds

The performance difference is stark. You’re looking at a 40-90x bandwidth reduction for every AllReduce. For interactive inference where you need responses in hundreds of milliseconds, this makes the system unusable. A 50ms operation becomes a 2-second operation.

STRICT_PACK to the Rescue:

The STRICT_PACK placement strategy provides an atomic guarantee: “Reserve all N bundles on a single node. If no single node can satisfy the request, schedule none of them.”

# Conceptual placement group specification
placement_group = ray.util.placement_group(
    bundles=[{"GPU": 1}] * 8,
    strategy="STRICT_PACK"
)

This is all-or-nothing. Either you get all 8 GPUs on one node with NVLink connectivity, or you get an error telling you no suitable node exists. No silent degradation to a broken configuration.

SPREAD for Pipeline Parallelism:

Not all parallelism strategies want STRICT_PACK. Pipeline parallelism deliberately spans multiple nodes, with each node handling different layers of the model. Here, SPREAD makes sense: you want actors distributed across nodes to maximize aggregate memory capacity.

The communication pattern differs too. Pipeline parallelism uses point-to-point sends between adjacent stages, not AllReduce across all participants. This is less latency-sensitive because you’re overlapping computation with communication: while stage N processes micro-batch B, stage N-1 can send micro-batch C.

StrategyUse CaseCommunicationWhen to Use
STRICT_PACKTensor ParallelismAllReduce (all-to-all)Same-node NVLink required
SPREADPipeline ParallelismPoint-to-pointMemory > latency
PACKMixed workloadsVariesPrefer colocation, allow spread

The placement group abstraction is what lets vLLM express “I need these actors to be co-located” without knowing anything about Kubernetes node topology. Ray’s GCS has that knowledge (from Raylet resource advertisements), and the placement group mechanism lets vLLM leverage it declaratively.


Two Interfaces, Two Purposes

Even with correct placement, there’s another way to tank your inference performance: letting NCCL traffic flow over the wrong network interface.

Production GPU nodes typically have multiple network interfaces:

  • eth0: The standard Kubernetes pod network. Usually an overlay network (Calico, Cilium, Flannel) that provides cluster connectivity, DNS, and service discovery. Fine for control plane traffic: health probes, metrics scraping, Ray GCS heartbeats.

  • net1/ib0/bond0: A high-performance interface connected to InfiniBand or RoCE fabric. This is your data plane, purpose-built for moving large tensors between nodes at 100-400 Gb/s with microsecond latencies.

The problem: NCCL doesn’t automatically know which interface to use. By default, it may discover eth0 first and decide that’s the interface for collective operations. Your carefully provisioned InfiniBand fabric sits idle while tensor data crawls through the overlay network.

The key environment variable:

NCCL_SOCKET_IFNAME=net1

This tells NCCL explicitly which interface to use for socket-based communication. For InfiniBand with RDMA, you’d also set:

NCCL_IB_HCA=mlx5_0

In Kubernetes, you expose multiple interfaces to pods using Multus CNI, a meta-plugin that lets you attach additional networks beyond the default pod network. Your pod spec includes annotations requesting attachment to the high-speed network:

annotations:
  k8s.v1.cni.cncf.io/networks: high-speed-net

The result is a pod with two interfaces: eth0 for Kubernetes integration, and net1 for NCCL traffic. Control plane and data plane are cleanly separated.

Why This Matters for Multi-Node:

For single-node tensor parallelism, NVLink handles everything and network configuration is less critical. But the moment you scale beyond one node, whether for pipeline parallelism, larger tensor-parallel groups, or disaggregated serving, network configuration becomes essential.

A properly configured InfiniBand fabric can deliver 400 Gb/s (50 GB/s) per port with single-digit microsecond latencies. The Kubernetes overlay network, even with modern CNIs, typically maxes out at 10-25 Gb/s with millisecond-scale latencies. For operations that happen 160 times per forward pass, this difference compounds dramatically.


Choosing Your Communication Pattern

We’ve seen how network configuration can make or break performance. The reason network matters so much depends on which parallelism strategy you’re using, and each strategy creates fundamentally different communication patterns.

Parallelism isn’t one-size-fits-all. Different strategies create different communication patterns, and understanding these patterns reveals why orchestration decisions matter.

Tensor Parallelism: The AllReduce Pattern

Tensor parallelism shards weight matrices across GPUs within a layer. Each GPU computes a partial result, then all GPUs synchronize via AllReduce to combine their contributions.

Ray’s responsibilities:

  • Create STRICT_PACK placement group
  • Spawn workers with correct GPU assignments
  • Set CUDA_VISIBLE_DEVICES per worker
  • Monitor actor health, restart on failure

What Ray doesn’t do:

  • Manage AllReduce operations (that’s NCCL)
  • Move tensor data (that flows through NVLink/NCCL)
  • The object store is bypassed entirely for the hot path

The Communication Reality:

For an 80-layer model, tensor parallelism requires 160 AllReduce operations per forward pass (2 per layer—one after attention, one after FFN). Each AllReduce synchronizes tensors sized [batch_size, seq_len, hidden_dim]. With Llama-70B’s hidden dimension of 8192 and a batch of 32 sequences at 2048 tokens, you’re moving ~1 GB per AllReduce.

AllReduce has ring and tree implementations. Ring AllReduce on 8 GPUs requires each GPU to send and receive 7/8 of the data, essentially 7 full tensor transfers per operation. The only way this is fast is with NVLink’s 900 GB/s bandwidth.

Pipeline Parallelism: The Point-to-Point Pattern

Pipeline parallelism assigns different layers to different GPUs (or groups of GPUs). Data flows through stages sequentially: Stage 0 processes the input, sends activations to Stage 1, which processes and sends to Stage 2, and so on.

Orchestration Differences:

Ray creates a placement group that may span nodes (SPREAD rather than STRICT_PACK). Each stage gets its own bundle, and stages communicate via point-to-point sends rather than collective operations.

The Bubble Problem:

Pure pipeline parallelism has a fundamental inefficiency. While Stage 0 processes micro-batch 1, Stages 1-7 sit idle. While Stage 7 processes micro-batch 1, Stages 0-6 may be idle waiting for backward pass dependencies.

The bubble ratio quantifies this waste:

$$\text{bubble ratio} = \frac{p - 1}{m + p - 1}$$

Where $p$ is the number of pipeline stages and $m$ is the number of micro-batches. With 8 stages and 8 micro-batches, you lose 7/15 ≈ 47% of potential throughput to bubbles.

Think of it this way: p is how many slices you cut the model into, m is how many requests you’re processing in parallel. With 8 pipeline stages but only 1 micro-batch, 7 out of 8 stages are always waiting, meaning 87.5% of compute wasted to bubbles. With 64 micro-batches, that drops to ~10%. The lesson: pipeline parallelism only pays off with large batches.

Continuous batching helps by keeping the pipeline fed with new requests, but the fundamental tradeoff remains: pipeline parallelism trades AllReduce bandwidth requirements for pipeline bubbles.

The Pipeline Bubble Problem
Stages sit idle during startup and drain phases. More micro-batches reduce bubble overhead.
Pipeline Stages (p): 4
Micro-batches (m): 4
43%
Bubble Ratio
7
Total Time Steps
16
Compute Units
Bubble (idle)
Micro-batch (active)
Bubble Ratio = (p - 1) / (m + p - 1) = 3/6 = 50%

Expert Parallelism: The AllToAll Pattern

Mixture of Experts (MoE) models introduce a third communication pattern. Instead of every GPU needing data from every other GPU (AllReduce), or sequential point-to-point (pipeline), MoE requires AllToAll: each GPU sends different data to different destinations based on which expert each token routes to.

The orchestration complexity increases significantly. Expert assignments are dynamic (determined by a router network), so communication patterns vary per batch. Some experts may be hot (receiving many tokens) while others are cold.

Expert parallelism is its own orchestration challenge. Unlike tensor parallelism’s predictable AllReduce or pipeline’s sequential handoffs, MoE communication is dynamic. A router network decides which tokens go to which experts at runtime, so the communication pattern changes every batch. Some experts receive hundreds of tokens while others get none.

This dynamic routing breaks placement assumptions. You can’t pre-plan which GPUs need to talk to which. Solutions like expert replication (placing hot experts on multiple GPUs) and capacity factors (limiting tokens per expert) add orchestration complexity. MoE deserves its own treatment, but the key insight here is: AllToAll with dynamic routing is fundamentally harder to orchestrate than static patterns.


Prefill and Decode Don’t Have to Live Together

The inference phases we’ve discussed (prefill and decode) have fundamentally different computational profiles:

  • Prefill: Process the entire prompt in parallel. Compute-bound. Benefits from high FLOPS.
  • Decode: Generate tokens one at a time. Memory-bound. Benefits from high memory bandwidth.

For most of LLM inference history, both phases ran on the same hardware. But there’s no law of physics requiring this. Disaggregated serving splits them apart.

The Architecture:

  1. Router receives incoming request, examines the prompt
  2. Prefill Pool (optimized for compute: H100s with maximum FLOPS) processes the prompt, generates initial KV cache
  3. KV Transfer moves the KV cache to the decode pool
  4. Decode Pool (optimized for memory bandwidth: could be A100s or L40S) generates tokens autoregressively
  5. Response streams back to client

Why Bother?

Different hardware, different economics. Prefill can run on fewer, more powerful GPUs because it’s compute-bound. You’re not paying for memory bandwidth you don’t use. Decode can run on more, cheaper GPUs optimized for memory bandwidth.

The pools also scale independently. A sudden spike in long prompts? Scale up prefill. Many concurrent users generating responses? Scale up decode. The tight coupling of traditional serving forces you to scale both together.

The KV Transfer Challenge:

The catch is moving the KV cache. For Llama-70B with 128K context, the KV cache can reach 40+ GB per request. Moving that between pools is non-trivial.

Two approaches are emerging:

  • NIXL (NVIDIA Inference Transfer Library): GPU-to-GPU RDMA transfers over InfiniBand/RoCE. Keeps data on GPU memory throughout, avoiding PCIe bottlenecks.

  • LMCache / Shared Storage: Write KV cache to a fast shared storage layer (think distributed NVMe or GPU memory pooling). This enables “context caching”: compute popular prompts once, reuse across millions of requests.

Context caching is particularly powerful for system prompts. If every request to your coding assistant starts with the same 8K token system prompt, why recompute that KV cache for every request? Compute it once, cache it, and let decode instances reuse it.


The Request Knows Where to Go

Traditional load balancing treats all requests as fungible. Round-robin, least-connections, random: they all assume any backend can handle any request equally well. For LLM inference with caching, this assumption is expensive.

If Request A and Request B share a common prefix (same system prompt, same few-shot examples), and Request A already warmed the KV cache on Pod 1, sending Request B to Pod 2 wastes the cache hit opportunity. You’ll recompute the shared prefix unnecessarily.

Prefix-Aware Routing:

Ray Serve implements prefix-aware routing using a prefix tree of cached prefixes. The router maintains a lightweight index of which prefixes are cached on which replicas. When a request arrives, it hashes the prefix, looks up which replica(s) have it cached, and routes accordingly.

This transforms routing from “who’s least busy?” to “who already has my context?”

Gateway API EPP:

The Kubernetes ecosystem is developing similar capabilities at the network layer through Gateway API’s Endpoint Picker (EPP) extension. Routing decisions happen in the ingress controller rather than in application code (Ray Serve).

The ingress controller can hash request properties (prompt prefix, user ID, session token) and consistently route matching requests to the same backend. This works without modifying the serving framework, using pure infrastructure-level routing.

The Tradeoff:

Locality-aware routing can cause load imbalance. If one prefix is extremely popular, its designated replica gets hammered while others sit idle. Production systems need to balance cache locality against load distribution, often through techniques like bounded load consistent hashing or spillover policies.

The evolution is clear: routing is becoming inference-aware. The network layer increasingly understands the semantics of the requests it carries, making decisions that would previously require application-level logic.


The Programmable Supercomputer

Step back and consider what this stack achieves. You start with a collection of independent machines, each with its own GPUs, memory, and network interfaces. Through layers of orchestration (Kubernetes managing containers, Ray managing actors, vLLM managing inference) these resources transform into something that behaves like a single, coherent system.

A prompt enters and gets routed to the right place based on cached state. Compute spreads across GPUs that might span multiple machines, synchronized through NCCL collectives that operate faster than the software can observe. Memory fragments across PagedAttention blocks, invisible to the model but critical for efficiency. The response streams back, one token at a time, while the system is already processing the next request.

The orchestration is the product. Without it, you have expensive hardware sitting idle. With it, you have an inference machine that can serve thousands of concurrent users at interactive latencies.

What’s Emerging:

The boundaries between these layers continue to blur. Systems like DistServe push disaggregation further, with prefill and decode pools that scale independently. KV cache transfer technologies (NIXL, LMCache) treat GPU memory across machines as a single addressable space. The trend is toward tighter integration between orchestration and execution, with systems that make placement decisions not just at startup, but continuously during inference.

Key Metrics to Watch:

If you’re operating these systems, the metrics that matter span all three layers:

  • Kubernetes: Pod scheduling latency, node resource utilization, network policy drops
  • Ray: Placement group creation time, actor restart rate, GCS latency (ray_gcs_* metrics)
  • vLLM: vllm:gpu_cache_usage_perc (memory pressure), vllm:num_requests_waiting (queuing), time-to-first-token (prefill latency), inter-token-latency (decode performance)

The system is only as good as its weakest link. A Kubernetes scheduling delay adds latency to every request until the pod is running. A misconfigured NCCL interface tanks throughput. A hot expert without proper load balancing creates tail latencies.

Understanding the choreography (knowing which system is responsible for what, where the handoffs occur, what can go wrong at each boundary) is what separates operators who can debug production issues from those who cannot.

The stack is complex because the problem is complex. Distributed inference across dozens of GPUs, serving thousands of users, with sub-second latency requirements. But the complexity is structured. Each layer has clear responsibilities and well-defined interfaces. Master those interfaces, understand the handoffs, and the system becomes comprehensible.

Eight GPUs thinking as one. Three software systems coordinating invisibly. One simple command that hides a universe of orchestration.

That’s the stack. Now you know what’s underneath.


References