Architectural Paradigms in Modern Large Language Model Inference: A Comprehensive Analysis of Control and Data Plane Disaggregation

1. Executive Summary: The Bifurcation of Intelligence Infrastructure

The rapid proliferation of Large Language Models (LLMs) has precipitated a fundamental paradigm shift in the design of distributed computing systems. Unlike traditional deep learning workloads, which were largely characterized by static computational graphs and predictable resource consumption, generative AI workloads introduce extreme variability, state dependency, and a distinct dichotomy between compute-bound and memory-bound phases. In response to these challenges, modern inference architectures have evolved from monolithic server binaries into complex, disaggregated distributed systems. The defining characteristic of this new generation of infrastructure is the strict architectural separation between the Control Plane—responsible for orchestration, policy enforcement, and global state management—and the Data Plane—dedicated to high-throughput, low-latency tensor execution and memory management.

This report provides an exhaustive analysis of this architectural evolution. We explore how the control plane has matured into a sophisticated decision-making engine capable of predictive autoscaling and semantic routing, while the data plane has adopted operating system concepts (such as virtual memory and process scheduling) to optimize hardware utilization. Furthermore, we examine the emergence of “prefill-decode disaggregation,” a strategy that physically separates the processing of input prompts from token generation to resolve resource contention, and the integration of specialized hardware like SmartNICs to offload data management tasks. By synthesizing research from systems such as vLLM, Orca, DistServe, Mooncake, and KServe, this document offers a detailed roadmap of the current state and future trajectory of inference scheduling architecture.

2. The Evolution of Inference: From Monoliths to Disaggregated Planes

To understand the necessity of the control/data plane separation, one must first appreciate the limitations of the monolithic architectures that preceded it. In the era of smaller models (e.g., BERT, ResNet), inference was stateless and compute-bound. A single server process could receive a request, execute the forward pass, and return the result within milliseconds. Scaling was achieved simply by replicating this monolithic process behind a load balancer.

However, the autoregressive nature of Generative Pre-trained Transformers (GPT) introduced two critical complexities that broke this model:

State Management (KV Cache): Generation requires maintaining a Key-Value (KV) cache for each active request. This cache grows dynamically with sequence length, consuming gigabytes of High Bandwidth Memory (HBM). In a monolithic setup, managing this state alongside computation led to severe memory fragmentation and underutilization.1
Phase Heterogeneity: An inference request consists of two distinct phases with opposing hardware requirements. The Prefill phase (processing the input prompt) is compute-intensive and highly parallelizable. The Decode phase (generating tokens one by one) is memory-bandwidth-bound and serial. Colocating these phases on the same hardware without sophisticated scheduling results in “pipeline bubbles” and interference, where memory-bound tasks stall compute-bound ones.3

The modern architecture addresses these issues by decoupling the system into two independent planes 5:

The Control Plane: A “brain” that operates at a cluster-wide scope. It abstracts the complexity of the hardware, manages the lifecycle of models, handles user authentication and quotas, and makes coarse-grained scheduling decisions (e.g., which replica handles a request). It is optimized for availability, consistency, and feature richness.7
The Data Plane: A “muscle” that operates at the device scope. It is responsible for the actual execution of the neural network, managing GPU memory pointers, scheduling CUDA kernels, and handling tensor parallelism. It is optimized for raw throughput, microsecond latency, and maximizing hardware occupancy.9

3. The Control Plane: Orchestration, Policy, and Global State

The control plane acts as the system’s central nervous system. In frameworks like KServe, Ray Serve, and AIBrix, the control plane is designed to be largely stateless and recoverable, persisting its configuration in distributed stores (like etcd or the Ray Global Control Store) while communicating with workers via lightweight protocols.11

3.1. Workload Orchestration and Lifecycle Management

The primary responsibility of the control plane is Model Lifecycle Management. This involves the complex choreography of provisioning resources, downloading massive model weights (often hundreds of gigabytes), and initializing distributed runtimes.

3.1.1. The Controller Actor and Deployment

In Ray Serve, for instance, a global Controller actor manages the state of the cluster. It reconciles the “desired state” (e.g., 10 replicas of Llama-3-70B) with the “actual state.” When a new deployment is requested, the Controller does not merely spawn a process; it must negotiate with the cluster resource manager (e.g., KubeRay) to find nodes with the specific hardware topology required—such as ensuring 8 H100 GPUs are available on a single node for high-bandwidth NVLink interconnects.11

This placement logic is becoming increasingly sophisticated. Systems like AIBrix employ “Best-Fit Decreasing” (BFD) algorithms to solve the bin-packing problem of fitting models onto heterogeneous clusters. AIBrix’s control plane creates an abstraction layer that allows it to schedule models based on “node affinity” and “LoRA locality.” If a request requires a specific Low-Rank Adaptation (LoRA) adapter, the control plane attempts to route it to a worker that already has that adapter loaded in memory, avoiding the latency of hot-swapping weights.14

3.1.2. The Gateway and Semantic Routing

The entry point to the control plane is no longer a Layer 4 load balancer but a Layer 7 AI Gateway. The Envoy AI Gateway exemplifies this shift. It operates as a two-tier architecture:

Tier 1 (Global): Handles authentication, global rate limiting, and coarse routing (e.g., separating internal vs. external traffic).
Tier 2 (Cluster): Handles fine-grained traffic distribution to specific model instances.16

Crucially, modern control planes implement Semantic Routing. Instead of Round-Robin, the router analyzes the incoming prompt. Using techniques like locality-sensitive hashing on the system prompt or shared prefix, the control plane routes requests with similar contexts to the same specific worker instances. This allows the data plane to leverage Prefix Caching (RadixAttention), where the KV cache for the common prefix is already present in GPU memory, allowing the worker to skip the prefill computation for that portion. This tight coupling between the router’s logic and the data plane’s cache state is a key optimization in RAG (Retrieval Augmented Generation) workflows.18

3.2. Advanced Autoscaling Paradigms

Autoscaling in LLM inference differs fundamentally from stateless microservices due to the high startup latency (cold start) of large models. The control plane must balance the cost of idle GPUs against the risk of SLA violations.

3.2.1. From Reactive to Metric-Driven Scaling

Traditional Kubernetes autoscaling (HPA) relies on CPU or memory usage, which are poor proxies for LLM load. Systems like KServe now integrate with KEDA (Kubernetes Event-driven Autoscaling) to scale based on inference-specific metrics.

Concurrency-Based: Ray Serve scales based on target_ongoing_requests per replica. If the number of concurrent requests exceeds a threshold, new replicas are provisioned.20
SLA-Based: Advanced setups use Time Per Output Token (TPOT) or Token Velocity as the scaling metric. If the system detects that token generation speed is degrading due to batch saturation, it triggers a scale-out event even if the GPU utilization is technically high. This ensures that latency guarantees are met regardless of throughput.21

3.2.2. Predictive and Proactive Scaling

Reactive scaling inevitably leads to latency spikes during bursts due to the minutes-long model loading time. Newer research systems like SageServe and ThrottLLeM introduce predictive control planes.

Mechanism: These systems employ time-series forecasting (e.g., Gamma-Poisson processes) to predict arrival rates.
Proactive Provisioning: Based on these predictions, the control plane spins up “shadow” instances before the traffic spike arrives.
Instance Donation: SageServe goes further by utilizing a “holistic deployment stack.” During valley periods, it creates “surplus” instances that can be donated to lower-priority spot workloads or batch processing tasks, reclaiming them instantly when high-priority inference demand returns. This minimizes the economic waste of over-provisioning.23

3.3. Multi-Model and Heterogeneous Management

The control plane also manages the complexity of serving multiple models on shared infrastructure. AIBrix and vLLM support multi-LoRA serving, where a single base model (frozen in GPU memory) serves requests for dozens of different fine-tuned adapters. The control plane acts as a registry for these adapters, scheduling requests to the appropriate workers and instructing the data plane to swap small adapter weights in and out of the compute path dynamically. This reduces the VRAM requirement by orders of magnitude compared to serving dedicated replicas for each fine-tune.13

4. The Data Plane: High-Performance Execution and Scheduling

While the control plane manages the cluster, the data plane manages the GPU. The modern data plane has evolved into a highly specialized operating system for tensors, handling memory allocation, process scheduling, and hardware acceleration with microsecond precision.

4.1. Memory Management: The PagedAttention Revolution

The most significant bottleneck in LLM inference is memory bandwidth, specifically the management of the KV cache. In early systems, the data plane allocated a contiguous block of VRAM for the maximum possible sequence length of a request. Since most requests are shorter than the maximum, this led to massive internal fragmentation. Furthermore, the requirement for contiguous memory caused external fragmentation, preventing the usage of scattered free memory blocks.1

vLLM introduced PagedAttention, a mechanism inspired by OS virtual memory paging.

Block Tables: The KV cache is divided into fixed-size blocks (e.g., 16 or 32 tokens).
Non-Contiguous Allocation: These blocks can be stored anywhere in physical memory. A block table maps the logical token sequence to physical block addresses.
Impact: This eliminates external fragmentation and reduces internal fragmentation to only the last partial block. It allows the data plane to fit significantly more requests into a single batch (higher batch size), directly increasing throughput.28
Jenga Extensions: Research system Jenga extends this concept to handle heterogeneous embeddings. In complex pipelines involving multimodal inputs or different embedding models, standard page sizes might still be inefficient. Jenga uses a two-level allocator based on the least common multiple (LCM) of embedding sizes to optimize the packing of diverse data types in the cache.30
eLLM and Ballooning: Another system, eLLM, introduces a “virtual tensor abstraction” and a memory ballooning mechanism. It allows the data plane to oversubscribe GPU memory by transparently swapping pages to host CPU memory when VRAM is under pressure, using a “lightweight scheduling strategy” to minimize the performance impact of these swaps.31

4.2. Intra-Instance Scheduling Algorithms

The data plane’s scheduler decides which requests to execute in the next GPU kernel launch. This is no longer a simple First-In-First-Out (FIFO) queue.

4.2.1. Iteration-Level Scheduling (Orca)

Orca pioneered Iteration-Level Scheduling (also known as continuous batching or cellular batching).

The Problem: In static batching, the GPU waits for the longest request in a batch to finish before returning results for any request.
The Solution: The scheduler operates at the granularity of a single token generation step (iteration). At the end of each iteration, completed requests are removed, and new requests are added to the running batch.
Mechanism: The scheduler invokes the execution engine to run only one iteration. This ensures that short requests exit the system immediately, drastically reducing average latency.3

4.2.2. Stall-Free Batching (Sarathi-Serve)

While continuous batching solves the straggler problem, it introduces a new one: Prefill Interference. When a new request joins the batch, its prefill phase (processing the whole prompt) takes much longer than the single-token decode steps of existing requests. This causes a “stall” or “hiccup” in the generation of the running requests.

Sarathi-Serve addresses this with Chunked-Prefills and Stall-Free Scheduling.

Chunking: It splits the prefill of a long prompt into smaller chunks (e.g., 512 tokens).
Piggybacking: It schedules one prefill chunk alongside the decode steps of other requests. It calculates a “token budget” for each iteration that fills the GPU’s compute capacity without exceeding the latency deadline (TBT SLO).
Result: The prefill is amortized over multiple iterations. The ongoing decodes are not stalled, maintaining a smooth stream of tokens for users while maximizing “goodput” (throughput that meets SLOs).34

4.2.3. QoS-Driven Scheduling (Niyama)

Niyama moves beyond simple fairness to Quality of Service (QoS) enforcement.

Dynamic Chunk Size Prediction: Instead of fixed chunks, Niyama uses a lightweight Random Forest model trained on profiling data to predict the optimal chunk size for the current system state.
Hybrid Prioritization: It maintains separate queues for “interactive” (latency-sensitive) and “batch” (throughput-oriented) requests. Its scheduler uses a weighted formula considering both the deadline proximity and the estimated remaining processing time to prioritize requests. This allows the system to effectively “relegate” batch jobs during load spikes to protect the interactive experience.37

4.3. Distributed Data Plane Protocols

When a model spans multiple GPUs (Tensor Parallelism) or nodes (Pipeline Parallelism), the data plane requires a high-performance communication fabric.

Ray vs. NCCL: While the control plane often uses Ray actors for orchestration, the data plane typically bypasses Ray’s object store for critical tensor operations. It establishes direct NCCL (NVIDIA Collective Communications Library) communicators between GPUs. This allows for kernel-bypass networking (GPU-Direct RDMA), enabling tensors to move between GPU memories across the network without touching the CPU.38
Shared Memory IPC: For single-node multi-process setups (e.g., a vision-language model where the vision encoder runs in a separate process), vLLM has implemented a shared memory Inter-Process Communication (IPC) mechanism. This uses a ring buffer in /dev/shm to pass large tensors (like image embeddings) between processes without serialization/deserialization overhead, significantly improving throughput for multimodal inference.40

4.4. Hardware Offloading: The SmartNIC Data Plane

An emerging trend is the offloading of data plane tasks to specialized hardware. ShadowServe proposes a functional disaggregation where the SmartNIC (specifically NVIDIA BlueField-3 DPUs) takes over KV cache management.

Pipeline: The SmartNIC handles the network fetch of KV cache blocks, decompression (using on-chip hardware accelerators), and dequantization.
DMA Push: It then uses Direct Memory Access (DMA) to push the prepared tensors directly into the GPU’s HBM.
Benefit: This removes the CPU from the critical data path and prevents the GPU from stalling while waiting for memory fetches. The “chunked pipelining” on the SmartNIC ensures that while one chunk is being transferred, the next is being decompressed, saturating the PCIe bus.42

5. Disaggregated Architectures: The Prefill-Decode Split

The most radical architectural shift in recent years is the physical separation of the Prefill and Decode phases into entirely different clusters. This is known as PD-Disaggregation.

5.1. The Theoretical Basis for Separation

The separation is driven by the conflicting hardware affinities of the two phases:

Prefill: Compute-bound. Benefits from massive parallelism. Ideal for GPUs with high FLOPs (Tensor Cores) but potentially less memory bandwidth.
Decode: Memory-bound. Benefits from high memory bandwidth (HBM). Ideal for GPUs with massive memory bandwidth but potentially fewer compute cores.
Colocating them forces a compromise. PD-Disaggregation allows for independent scaling. If users are sending long documents (high prefill load) but asking for short summaries (low decode load), the system can scale the prefill cluster independently.4

5.2. DistServe: Goodput Optimization

DistServe is a seminal system in this domain. It focuses on maximizing Goodput—defined as the request rate where both Time-To-First-Token (TTFT) and Time-Per-Output-Token (TPOT) SLOs are met.

Placement Strategy: DistServe analyzes the workload and partitions resources into a prefill pool and a decode pool. It might assign 2 GPUs to prefill and 6 to decode for a chatbot workload.
Interference Elimination: By isolating prefill, decode requests never experience the “stalls” discussed earlier.
Performance: Evaluations show DistServe can improve goodput by up to 4.48x compared to vLLM, and attain 10.2x stricter SLOs on the same hardware.43

5.3. Mooncake: The KVCache-Centric Architecture

Mooncake, used by the Kimi AI platform, takes a data-centric approach. It views the entire cluster’s memory (GPU HBM, CPU RAM, NVMe SSDs) as a single disaggregated storage pool for KV caches.

Mooncake Store: A distributed object store optimized for KV blocks.
The Conductor: A global scheduler that dispatches requests based on data locality. If a request’s prefix is cached on Node A’s SSD, the Conductor might route the task to Node A to minimize network transfer, or pre-fetch the data to Node B’s HBM via RDMA.
Performance: This architecture allows Mooncake to handle “highly overloaded” scenarios (100 billion tokens/day) by effectively using idle resources (CPU/SSD) as a spillover buffer for the GPU.47

5.4. Splitwise and KV Cache Transfer

Splitwise also separates the phases but focuses on the logistics of the transfer. The challenge is that the KV cache generated by the prefill phase must be moved to the decode phase machine.

Bandwidth Bottleneck: Transferring the full FP16 cache is slow.
Optimization: Systems use Quantization (compressing KV cache to INT8 or FP8) and Sparsity (transferring only important tokens) to reduce the transfer volume. They utilize high-speed interconnects (Infiniband/RDMA) to ensure the network transfer latency is lower than the time saved by the disaggregation.44

6. Fault Tolerance and Operational Reliability

In distributed systems, failures are inevitable. The separation of planes simplifies resilience strategies.

6.1. Data Plane Resilience: State Replication

If a worker node executing a Decode phase crashes, the KV cache in its HBM is lost. Recomputing it (re-running prefill) is expensive.

DejaVu introduces a high-availability mechanism for the data plane.

Streaming Replication: It asynchronously streams the KV cache to a replica node or persistent storage during generation.
Microbatch Swapping: It ensures consistent snapshots of the state.
Fast Recovery: When a failure is detected, the control plane redirects the request to a healthy node, which loads the latest checkpointed KV cache and resumes generation. This reduces recovery time from the scale of seconds (re-computation) to milliseconds (state loading).51

6.2. Control Plane Reliability

By decoupling the control plane from the heavy compute path, the system ensures that a “GPU hang” (common in CUDA workloads) does not crash the management layer. The control agent on the node remains responsive, allowing it to report the failure to the central controller, cordon off the node, and trigger an automated restart of the inference engine process. This “stateless control” pattern is detailed in AWS reliability guidelines and Ray Serve’s architecture.8

7. Protocols and Standardization: KServe V2

To enable the interoperability of these diverse components (e.g., a KEDA autoscaler talking to a vLLM engine), the industry has standardized on the KServe V2 Open Inference Protocol.

7.1. Protocol Specifications

The V2 protocol defines a standard JSON/gRPC schema for inference.

Endpoints: It standardizes /v2/health/live, /v2/health/ready, and /v2/models/{name}/infer.
Generative Extensions: Unlike the V1 protocol (designed for classifiers like ResNet), V2 includes extensions for LLM parameters: temperature, top_p, echo, and stop sequences.
Interoperability: This allows control planes (like Envoy AI Gateway) to treat backend engines (Triton, vLLM, TGI) as interchangeable “black boxes”.55

7.2. gRPC and Bi-Directional Streaming

For LLMs, the request-response model of HTTP is inefficient. The V2 protocol emphasizes gRPC Bi-Directional Streaming.

Mechanism: The client opens a persistent HTTP/2 connection. The server pushes token chunks (Server-Sent Events or gRPC messages) as they are generated.
Benefits: This reduces the TCP handshake overhead and allows for immediate feedback—for example, a user can cancel a generation mid-stream, and the control plane can immediately signal the data plane to abort the computation, freeing up resources.57

8. Comparison of Key Architectures

To synthesize the differences between these systems, we present a comparative analysis of their scheduling and architectural choices.

Feature	vLLM	Orca	Sarathi-Serve	DistServe	Mooncake
Primary Innovation	PagedAttention (Memory)	Iteration-Level Scheduling	Stall-Free Batching	Prefill-Decode Disaggregation	KVCache-Centric Storage
Scheduling Granularity	Iteration (Continuous)	Iteration	Iteration (Chunked)	Phase-Specific	Global / Data-Locality
Batching Strategy	FCFS / Priority	FCFS	Piggybacking Prefill on Decode	Split Pools	Disaggregated Resource
Control/Data Split	Yes (Ray/IPC)	Yes (Scheduler/Engine)	Yes	Yes (Distinct Clusters)	Yes (Conductor/Store)
Key Optimization	Zero Fragmentation	Minimal Queuing Delay	Consistent Inter-Token Latency	Goodput (SLA compliance)	Throughput via Offload
Fault Tolerance	Checkpointing (Basic)	Basic	Basic	Replication aware	Highly Resilient (Store)

9. Future Directions and Emerging Trends

The trajectory of inference architecture points toward further granularization and the “serverless-ification” of the data plane.

Serverless Data Planes: Technologies like PipeBoost are reducing the cold-start time of models to milliseconds using parallelized model loading and shared memory snapshots. This will enable control planes to spin up data plane workers per request, eliminating idle costs entirely.60
Optical Data Planes: As the bottleneck is fundamentally data movement (memory bandwidth and interconnects), future data planes may integrate optical networking directly into the inter-chip fabric to facilitate the massive KV cache transfers required by disaggregated architectures.
The Rise of the “Inference Operating System”: We are witnessing the emergence of a standardized “Inference OS.” Kubernetes provides the kernel (resource management), KServe provides the init system (lifecycle), vLLM/Triton provides the runtime, and Envoy provides the networking. The clear separation of control and data planes is the architectural pattern that makes this stack composable and scalable.

10. Conclusion

The modern LLM inference stack has matured from a monolithic deep learning script into a complex, multi-layered distributed system. The strict separation of the Control Plane and Data Plane is the linchpin of this architecture. It allows the system to solve two distinct optimization problems simultaneously: the “macro” problem of resource availability and cost (solved by the control plane’s autoscalers and routers) and the “micro” problem of hardware saturation and latency (solved by the data plane’s schedulers and memory managers).

The innovations analyzed in this report—from PagedAttention and Iteration-Level Scheduling to the radical Disaggregated Prefill-Decode architectures—demonstrate a consistent trend: adapting software structures to the unique physical realities of autoregressive generation on heterogeneous hardware. As models grow larger and context windows expand to infinity, this architectural bifurcation will only deepen, driving the industry toward hyper-specialized, highly efficient, and reliable intelligence infrastructure.

Cutting-edge Technology Courses by Uplatz