1. Executive Summary: The Bifurcation of Intelligence Infrastructure
The rapid proliferation of Large Language Models (LLMs) has precipitated a fundamental paradigm shift in the design of distributed computing systems. Unlike traditional deep learning workloads, which were largely characterized by static computational graphs and predictable resource consumption, generative AI workloads introduce extreme variability, state dependency, and a distinct dichotomy between compute-bound and memory-bound phases. In response to these challenges, modern inference architectures have evolved from monolithic server binaries into complex, disaggregated distributed systems. The defining characteristic of this new generation of infrastructure is the strict architectural separation between the Control Plane—responsible for orchestration, policy enforcement, and global state management—and the Data Plane—dedicated to high-throughput, low-latency tensor execution and memory management.
This report provides an exhaustive analysis of this architectural evolution. We explore how the control plane has matured into a sophisticated decision-making engine capable of predictive autoscaling and semantic routing, while the data plane has adopted operating system concepts (such as virtual memory and process scheduling) to optimize hardware utilization. Furthermore, we examine the emergence of “prefill-decode disaggregation,” a strategy that physically separates the processing of input prompts from token generation to resolve resource contention, and the integration of specialized hardware like SmartNICs to offload data management tasks. By synthesizing research from systems such as vLLM, Orca, DistServe, Mooncake, and KServe, this document offers a detailed roadmap of the current state and future trajectory of inference scheduling architecture.
2. The Evolution of Inference: From Monoliths to Disaggregated Planes
To understand the necessity of the control/data plane separation, one must first appreciate the limitations of the monolithic architectures that preceded it. In the era of smaller models (e.g., BERT, ResNet), inference was stateless and compute-bound. A single server process could receive a request, execute the forward pass, and return the result within milliseconds. Scaling was achieved simply by replicating this monolithic process behind a load balancer.
However, the autoregressive nature of Generative Pre-trained Transformers (GPT) introduced two critical complexities that broke this model:
- State Management (KV Cache): Generation requires maintaining a Key-Value (KV) cache for each active request. This cache grows dynamically with sequence length, consuming gigabytes of High Bandwidth Memory (HBM). In a monolithic setup, managing this state alongside computation led to severe memory fragmentation and underutilization.1
- Phase Heterogeneity: An inference request consists of two distinct phases with opposing hardware requirements. The Prefill phase (processing the input prompt) is compute-intensive and highly parallelizable. The Decode phase (generating tokens one by one) is memory-bandwidth-bound and serial. Colocating these phases on the same hardware without sophisticated scheduling results in “pipeline bubbles” and interference, where memory-bound tasks stall compute-bound ones.3
The modern architecture addresses these issues by decoupling the system into two independent planes 5:
- The Control Plane: A “brain” that operates at a cluster-wide scope. It abstracts the complexity of the hardware, manages the lifecycle of models, handles user authentication and quotas, and makes coarse-grained scheduling decisions (e.g., which replica handles a request). It is optimized for availability, consistency, and feature richness.7
- The Data Plane: A “muscle” that operates at the device scope. It is responsible for the actual execution of the neural network, managing GPU memory pointers, scheduling CUDA kernels, and handling tensor parallelism. It is optimized for raw throughput, microsecond latency, and maximizing hardware occupancy.9
3. The Control Plane: Orchestration, Policy, and Global State
The control plane acts as the system’s central nervous system. In frameworks like KServe, Ray Serve, and AIBrix, the control plane is designed to be largely stateless and recoverable, persisting its configuration in distributed stores (like etcd or the Ray Global Control Store) while communicating with workers via lightweight protocols.11
3.1. Workload Orchestration and Lifecycle Management
The primary responsibility of the control plane is Model Lifecycle Management. This involves the complex choreography of provisioning resources, downloading massive model weights (often hundreds of gigabytes), and initializing distributed runtimes.
3.1.1. The Controller Actor and Deployment
In Ray Serve, for instance, a global Controller actor manages the state of the cluster. It reconciles the “desired state” (e.g., 10 replicas of Llama-3-70B) with the “actual state.” When a new deployment is requested, the Controller does not merely spawn a process; it must negotiate with the cluster resource manager (e.g., KubeRay) to find nodes with the specific hardware topology required—such as ensuring 8 H100 GPUs are available on a single node for high-bandwidth NVLink interconnects.11
This placement logic is becoming increasingly sophisticated. Systems like AIBrix employ “Best-Fit Decreasing” (BFD) algorithms to solve the bin-packing problem of fitting models onto heterogeneous clusters. AIBrix’s control plane creates an abstraction layer that allows it to schedule models based on “node affinity” and “LoRA locality.” If a request requires a specific Low-Rank Adaptation (LoRA) adapter, the control plane attempts to route it to a worker that already has that adapter loaded in memory, avoiding the latency of hot-swapping weights.14
3.1.2. The Gateway and Semantic Routing
The entry point to the control plane is no longer a Layer 4 load balancer but a Layer 7 AI Gateway. The Envoy AI Gateway exemplifies this shift. It operates as a two-tier architecture:
- Tier 1 (Global): Handles authentication, global rate limiting, and coarse routing (e.g., separating internal vs. external traffic).
- Tier 2 (Cluster): Handles fine-grained traffic distribution to specific model instances.16
Crucially, modern control planes implement Semantic Routing. Instead of Round-Robin, the router analyzes the incoming prompt. Using techniques like locality-sensitive hashing on the system prompt or shared prefix, the control plane routes requests with similar contexts to the same specific worker instances. This allows the data plane to leverage Prefix Caching (RadixAttention), where the KV cache for the common prefix is already present in GPU memory, allowing the worker to skip the prefill computation for that portion. This tight coupling between the router’s logic and the data plane’s cache state is a key optimization in RAG (Retrieval Augmented Generation) workflows.18
3.2. Advanced Autoscaling Paradigms
Autoscaling in LLM inference differs fundamentally from stateless microservices due to the high startup latency (cold start) of large models. The control plane must balance the cost of idle GPUs against the risk of SLA violations.
3.2.1. From Reactive to Metric-Driven Scaling
Traditional Kubernetes autoscaling (HPA) relies on CPU or memory usage, which are poor proxies for LLM load. Systems like KServe now integrate with KEDA (Kubernetes Event-driven Autoscaling) to scale based on inference-specific metrics.
- Concurrency-Based: Ray Serve scales based on target_ongoing_requests per replica. If the number of concurrent requests exceeds a threshold, new replicas are provisioned.20
- SLA-Based: Advanced setups use Time Per Output Token (TPOT) or Token Velocity as the scaling metric. If the system detects that token generation speed is degrading due to batch saturation, it triggers a scale-out event even if the GPU utilization is technically high. This ensures that latency guarantees are met regardless of throughput.21
3.2.2. Predictive and Proactive Scaling
Reactive scaling inevitably leads to latency spikes during bursts due to the minutes-long model loading time. Newer research systems like SageServe and ThrottLLeM introduce predictive control planes.
- Mechanism: These systems employ time-series forecasting (e.g., Gamma-Poisson processes) to predict arrival rates.
- Proactive Provisioning: Based on these predictions, the control plane spins up “shadow” instances before the traffic spike arrives.
- Instance Donation: SageServe goes further by utilizing a “holistic deployment stack.” During valley periods, it creates “surplus” instances that can be donated to lower-priority spot workloads or batch processing tasks, reclaiming them instantly when high-priority inference demand returns. This minimizes the economic waste of over-provisioning.23
3.3. Multi-Model and Heterogeneous Management
The control plane also manages the complexity of serving multiple models on shared infrastructure. AIBrix and vLLM support multi-LoRA serving, where a single base model (frozen in GPU memory) serves requests for dozens of different fine-tuned adapters. The control plane acts as a registry for these adapters, scheduling requests to the appropriate workers and instructing the data plane to swap small adapter weights in and out of the compute path dynamically. This reduces the VRAM requirement by orders of magnitude compared to serving dedicated replicas for each fine-tune.13
4. The Data Plane: High-Performance Execution and Scheduling
While the control plane manages the cluster, the data plane manages the GPU. The modern data plane has evolved into a highly specialized operating system for tensors, handling memory allocation, process scheduling, and hardware acceleration with microsecond precision.
4.1. Memory Management: The PagedAttention Revolution
The most significant bottleneck in LLM inference is memory bandwidth, specifically the management of the KV cache. In early systems, the data plane allocated a contiguous block of VRAM for the maximum possible sequence length of a request. Since most requests are shorter than the maximum, this led to massive internal fragmentation. Furthermore, the requirement for contiguous memory caused external fragmentation, preventing the usage of scattered free memory blocks.1
vLLM introduced PagedAttention, a mechanism inspired by OS virtual memory paging.
- Block Tables: The KV cache is divided into fixed-size blocks (e.g., 16 or 32 tokens).
- Non-Contiguous Allocation: These blocks can be stored anywhere in physical memory. A block table maps the logical token sequence to physical block addresses.
- Impact: This eliminates external fragmentation and reduces internal fragmentation to only the last partial block. It allows the data plane to fit significantly more requests into a single batch (higher batch size), directly increasing throughput.28
- Jenga Extensions: Research system Jenga extends this concept to handle heterogeneous embeddings. In complex pipelines involving multimodal inputs or different embedding models, standard page sizes might still be inefficient. Jenga uses a two-level allocator based on the least common multiple (LCM) of embedding sizes to optimize the packing of diverse data types in the cache.30
- eLLM and Ballooning: Another system, eLLM, introduces a “virtual tensor abstraction” and a memory ballooning mechanism. It allows the data plane to oversubscribe GPU memory by transparently swapping pages to host CPU memory when VRAM is under pressure, using a “lightweight scheduling strategy” to minimize the performance impact of these swaps.31
4.2. Intra-Instance Scheduling Algorithms
The data plane’s scheduler decides which requests to execute in the next GPU kernel launch. This is no longer a simple First-In-First-Out (FIFO) queue.
4.2.1. Iteration-Level Scheduling (Orca)
Orca pioneered Iteration-Level Scheduling (also known as continuous batching or cellular batching).
- The Problem: In static batching, the GPU waits for the longest request in a batch to finish before returning results for any request.
- The Solution: The scheduler operates at the granularity of a single token generation step (iteration). At the end of each iteration, completed requests are removed, and new requests are added to the running batch.
- Mechanism: The scheduler invokes the execution engine to run only one iteration. This ensures that short requests exit the system immediately, drastically reducing average latency.3
4.2.2. Stall-Free Batching (Sarathi-Serve)
While continuous batching solves the straggler problem, it introduces a new one: Prefill Interference. When a new request joins the batch, its prefill phase (processing the whole prompt) takes much longer than the single-token decode steps of existing requests. This causes a “stall” or “hiccup” in the generation of the running requests.
Sarathi-Serve addresses this with Chunked-Prefills and Stall-Free Scheduling.
- Chunking: It splits the prefill of a long prompt into smaller chunks (e.g., 512 tokens).
- Piggybacking: It schedules one prefill chunk alongside the decode steps of other requests. It calculates a “token budget” for each iteration that fills the GPU’s compute capacity without exceeding the latency deadline (TBT SLO).
- Result: The prefill is amortized over multiple iterations. The ongoing decodes are not stalled, maintaining a smooth stream of tokens for users while maximizing “goodput” (throughput that meets SLOs).34
4.2.3. QoS-Driven Scheduling (Niyama)
Niyama moves beyond simple fairness to Quality of Service (QoS) enforcement.
- Dynamic Chunk Size Prediction: Instead of fixed chunks, Niyama uses a lightweight Random Forest model trained on profiling data to predict the optimal chunk size for the current system state.
- Hybrid Prioritization: It maintains separate queues for “interactive” (latency-sensitive) and “batch” (throughput-oriented) requests. Its scheduler uses a weighted formula considering both the deadline proximity and the estimated remaining processing time to prioritize requests. This allows the system to effectively “relegate” batch jobs during load spikes to protect the interactive experience.37
4.3. Distributed Data Plane Protocols
When a model spans multiple GPUs (Tensor Parallelism) or nodes (Pipeline Parallelism), the data plane requires a high-performance communication fabric.
- Ray vs. NCCL: While the control plane often uses Ray actors for orchestration, the data plane typically bypasses Ray’s object store for critical tensor operations. It establishes direct NCCL (NVIDIA Collective Communications Library) communicators between GPUs. This allows for kernel-bypass networking (GPU-Direct RDMA), enabling tensors to move between GPU memories across the network without touching the CPU.38
- Shared Memory IPC: For single-node multi-process setups (e.g., a vision-language model where the vision encoder runs in a separate process), vLLM has implemented a shared memory Inter-Process Communication (IPC) mechanism. This uses a ring buffer in /dev/shm to pass large tensors (like image embeddings) between processes without serialization/deserialization overhead, significantly improving throughput for multimodal inference.40
4.4. Hardware Offloading: The SmartNIC Data Plane
An emerging trend is the offloading of data plane tasks to specialized hardware. ShadowServe proposes a functional disaggregation where the SmartNIC (specifically NVIDIA BlueField-3 DPUs) takes over KV cache management.
- Pipeline: The SmartNIC handles the network fetch of KV cache blocks, decompression (using on-chip hardware accelerators), and dequantization.
- DMA Push: It then uses Direct Memory Access (DMA) to push the prepared tensors directly into the GPU’s HBM.
- Benefit: This removes the CPU from the critical data path and prevents the GPU from stalling while waiting for memory fetches. The “chunked pipelining” on the SmartNIC ensures that while one chunk is being transferred, the next is being decompressed, saturating the PCIe bus.42
5. Disaggregated Architectures: The Prefill-Decode Split
The most radical architectural shift in recent years is the physical separation of the Prefill and Decode phases into entirely different clusters. This is known as PD-Disaggregation.
5.1. The Theoretical Basis for Separation
The separation is driven by the conflicting hardware affinities of the two phases:
- Prefill: Compute-bound. Benefits from massive parallelism. Ideal for GPUs with high FLOPs (Tensor Cores) but potentially less memory bandwidth.
- Decode: Memory-bound. Benefits from high memory bandwidth (HBM). Ideal for GPUs with massive memory bandwidth but potentially fewer compute cores.
Colocating them forces a compromise. PD-Disaggregation allows for independent scaling. If users are sending long documents (high prefill load) but asking for short summaries (low decode load), the system can scale the prefill cluster independently.4
5.2. DistServe: Goodput Optimization
DistServe is a seminal system in this domain. It focuses on maximizing Goodput—defined as the request rate where both Time-To-First-Token (TTFT) and Time-Per-Output-Token (TPOT) SLOs are met.
- Placement Strategy: DistServe analyzes the workload and partitions resources into a prefill pool and a decode pool. It might assign 2 GPUs to prefill and 6 to decode for a chatbot workload.
- Interference Elimination: By isolating prefill, decode requests never experience the “stalls” discussed earlier.
- Performance: Evaluations show DistServe can improve goodput by up to 4.48x compared to vLLM, and attain 10.2x stricter SLOs on the same hardware.43
5.3. Mooncake: The KVCache-Centric Architecture
Mooncake, used by the Kimi AI platform, takes a data-centric approach. It views the entire cluster’s memory (GPU HBM, CPU RAM, NVMe SSDs) as a single disaggregated storage pool for KV caches.
- Mooncake Store: A distributed object store optimized for KV blocks.
- The Conductor: A global scheduler that dispatches requests based on data locality. If a request’s prefix is cached on Node A’s SSD, the Conductor might route the task to Node A to minimize network transfer, or pre-fetch the data to Node B’s HBM via RDMA.
- Performance: This architecture allows Mooncake to handle “highly overloaded” scenarios (100 billion tokens/day) by effectively using idle resources (CPU/SSD) as a spillover buffer for the GPU.47
5.4. Splitwise and KV Cache Transfer
Splitwise also separates the phases but focuses on the logistics of the transfer. The challenge is that the KV cache generated by the prefill phase must be moved to the decode phase machine.
- Bandwidth Bottleneck: Transferring the full FP16 cache is slow.
- Optimization: Systems use Quantization (compressing KV cache to INT8 or FP8) and Sparsity (transferring only important tokens) to reduce the transfer volume. They utilize high-speed interconnects (Infiniband/RDMA) to ensure the network transfer latency is lower than the time saved by the disaggregation.44
6. Fault Tolerance and Operational Reliability
In distributed systems, failures are inevitable. The separation of planes simplifies resilience strategies.
6.1. Data Plane Resilience: State Replication
If a worker node executing a Decode phase crashes, the KV cache in its HBM is lost. Recomputing it (re-running prefill) is expensive.
DejaVu introduces a high-availability mechanism for the data plane.
- Streaming Replication: It asynchronously streams the KV cache to a replica node or persistent storage during generation.
- Microbatch Swapping: It ensures consistent snapshots of the state.
- Fast Recovery: When a failure is detected, the control plane redirects the request to a healthy node, which loads the latest checkpointed KV cache and resumes generation. This reduces recovery time from the scale of seconds (re-computation) to milliseconds (state loading).51
6.2. Control Plane Reliability
By decoupling the control plane from the heavy compute path, the system ensures that a “GPU hang” (common in CUDA workloads) does not crash the management layer. The control agent on the node remains responsive, allowing it to report the failure to the central controller, cordon off the node, and trigger an automated restart of the inference engine process. This “stateless control” pattern is detailed in AWS reliability guidelines and Ray Serve’s architecture.8
7. Protocols and Standardization: KServe V2
To enable the interoperability of these diverse components (e.g., a KEDA autoscaler talking to a vLLM engine), the industry has standardized on the KServe V2 Open Inference Protocol.
7.1. Protocol Specifications
The V2 protocol defines a standard JSON/gRPC schema for inference.
- Endpoints: It standardizes /v2/health/live, /v2/health/ready, and /v2/models/{name}/infer.
- Generative Extensions: Unlike the V1 protocol (designed for classifiers like ResNet), V2 includes extensions for LLM parameters: temperature, top_p, echo, and stop sequences.
- Interoperability: This allows control planes (like Envoy AI Gateway) to treat backend engines (Triton, vLLM, TGI) as interchangeable “black boxes”.55
7.2. gRPC and Bi-Directional Streaming
For LLMs, the request-response model of HTTP is inefficient. The V2 protocol emphasizes gRPC Bi-Directional Streaming.
- Mechanism: The client opens a persistent HTTP/2 connection. The server pushes token chunks (Server-Sent Events or gRPC messages) as they are generated.
- Benefits: This reduces the TCP handshake overhead and allows for immediate feedback—for example, a user can cancel a generation mid-stream, and the control plane can immediately signal the data plane to abort the computation, freeing up resources.57
8. Comparison of Key Architectures
To synthesize the differences between these systems, we present a comparative analysis of their scheduling and architectural choices.
| Feature | vLLM | Orca | Sarathi-Serve | DistServe | Mooncake |
| Primary Innovation | PagedAttention (Memory) | Iteration-Level Scheduling | Stall-Free Batching | Prefill-Decode Disaggregation | KVCache-Centric Storage |
| Scheduling Granularity | Iteration (Continuous) | Iteration | Iteration (Chunked) | Phase-Specific | Global / Data-Locality |
| Batching Strategy | FCFS / Priority | FCFS | Piggybacking Prefill on Decode | Split Pools | Disaggregated Resource |
| Control/Data Split | Yes (Ray/IPC) | Yes (Scheduler/Engine) | Yes | Yes (Distinct Clusters) | Yes (Conductor/Store) |
| Key Optimization | Zero Fragmentation | Minimal Queuing Delay | Consistent Inter-Token Latency | Goodput (SLA compliance) | Throughput via Offload |
| Fault Tolerance | Checkpointing (Basic) | Basic | Basic | Replication aware | Highly Resilient (Store) |
9. Future Directions and Emerging Trends
The trajectory of inference architecture points toward further granularization and the “serverless-ification” of the data plane.
- Serverless Data Planes: Technologies like PipeBoost are reducing the cold-start time of models to milliseconds using parallelized model loading and shared memory snapshots. This will enable control planes to spin up data plane workers per request, eliminating idle costs entirely.60
- Optical Data Planes: As the bottleneck is fundamentally data movement (memory bandwidth and interconnects), future data planes may integrate optical networking directly into the inter-chip fabric to facilitate the massive KV cache transfers required by disaggregated architectures.
- The Rise of the “Inference Operating System”: We are witnessing the emergence of a standardized “Inference OS.” Kubernetes provides the kernel (resource management), KServe provides the init system (lifecycle), vLLM/Triton provides the runtime, and Envoy provides the networking. The clear separation of control and data planes is the architectural pattern that makes this stack composable and scalable.
10. Conclusion
The modern LLM inference stack has matured from a monolithic deep learning script into a complex, multi-layered distributed system. The strict separation of the Control Plane and Data Plane is the linchpin of this architecture. It allows the system to solve two distinct optimization problems simultaneously: the “macro” problem of resource availability and cost (solved by the control plane’s autoscalers and routers) and the “micro” problem of hardware saturation and latency (solved by the data plane’s schedulers and memory managers).
The innovations analyzed in this report—from PagedAttention and Iteration-Level Scheduling to the radical Disaggregated Prefill-Decode architectures—demonstrate a consistent trend: adapting software structures to the unique physical realities of autoregressive generation on heterogeneous hardware. As models grow larger and context windows expand to infinity, this architectural bifurcation will only deepen, driving the industry toward hyper-specialized, highly efficient, and reliable intelligence infrastructure.
