The Throughput Imperative in LLM Serving
The deployment of Large Language Models (LLMs) in production environments has shifted the primary engineering challenge from model training to efficient, scalable inference. While these models possess unprecedented capabilities, their sheer size and unique computational patterns present formidable obstacles to achieving the high throughput and low latency required by real-time applications. At the heart of this challenge lies a fundamental mismatch between the autoregressive nature of LLM inference and the massively parallel architecture of the Graphics Processing Units (GPUs) on which they run. This section establishes the system-level context for this problem, detailing the architectural bottlenecks that render naive inference strategies inefficient and introducing batching as the foundational optimization that paves the way for more advanced techniques.
The Memory-Bound Nature of Autoregressive Inference
Despite the immense computational power of modern GPUs, capable of performing trillions of floating-point operations per second (FLOPs), LLM inference workloads often fail to fully utilize this capacity.1 The primary reason for this inefficiency is that LLM inference is fundamentally memory-bound, not compute-bound.1 The core bottleneck is the time required to load the model’s vast parameters—often numbering in the tens or hundreds of billions—from the GPU’s high-bandwidth memory (HBM) into the on-chip SRAM of the streaming multiprocessors (SMs) where computation occurs.1
This memory transfer latency significantly dominates the actual time spent on mathematical computations, particularly during the iterative token generation phase of inference.3 A typical LLM forward pass involves loading gigabytes of weight data to process a comparatively small amount of activation data. Consequently, the powerful arithmetic units of the GPU spend a significant portion of their time idle, waiting for data to arrive from HBM.1 This chronic underutilization of expensive hardware resources leads directly to suboptimal performance, low throughput, and poor cost-efficiency, creating a critical need for system-level optimizations that can better saturate the GPU’s computational capabilities.
The Dichotomy of LLM Inference Phases: Prefill and Decode
The challenge of GPU utilization is further compounded by the fact that LLM inference is not a monolithic computational process. It is comprised of two distinct phases with starkly different performance profiles, creating a complex scheduling problem for any serving system.6
The Prefill Phase, also known as the prompt processing or initiation phase, is the initial step where the model processes the entire input prompt simultaneously.6 This phase is characterized by large matrix-matrix multiplications (GEMM operations) that can effectively parallelize across the input sequence. As a result, the prefill phase is generally compute-bound, capable of saturating the GPU’s computational resources, especially for long prompts where the attention mechanism’s computational complexity scales quadratically with the sequence length.5
The Decode Phase, in contrast, is the iterative, autoregressive process of generating the output sequence one token at a time.6 Each decoding step involves a forward pass to predict the next token, which is then appended to the input for the subsequent step. Computationally, each step is equivalent to a matrix-vector operation, which is too small to fully leverage the GPU’s massively parallel architecture.5 The dominant operation in the decode phase is reading the entire, ever-growing Key-Value (KV) cache from HBM. The KV cache stores the attention keys and values for all previously processed tokens, and it must be accessed at every step. This makes the decode phase quintessentially memory-bound.5
This prefill/decode dichotomy is the root cause of many advanced scheduling challenges. A simple batching strategy that treats all requests identically will inevitably be inefficient because it fails to account for these two different performance profiles. A long, compute-intensive prefill operation can stall the generation of single tokens for many other users, re-introducing a form of system-level inefficiency even within advanced batching frameworks.8 Therefore, the core systems problem is not just batching requests, but intelligently scheduling the distinct sub-computations (prefill vs. decode) of those requests. This realization drives the architectural differences between various state-of-the-art inference frameworks.
Batching as a Foundational Optimization
To counteract the severe underutilization of GPUs caused by the memory-bound nature of single-sequence inference, the most fundamental optimization is batching. Instead of processing requests sequentially, a serving system can group multiple requests together and process them as a single batch.1 This approach allows the system to load the model weights from HBM once per layer and then apply them to many different input sequences simultaneously.4
By transforming the computation from a series of small matrix-vector operations into a single, large batch matrix-matrix multiplication, batching dramatically increases arithmetic intensity. This better utilizes the GPU’s parallel architecture, effectively amortizing the high cost of memory transfers across multiple requests.2 The result is a substantial improvement in aggregate throughput, measured in tokens per second, and a more cost-effective use of hardware.1 This principle of amortizing memory access costs through parallel computation is the foundational concept that motivates the development of all sophisticated batching strategies.
A Taxonomy of Batching Strategies
The evolution of batching techniques for LLM inference reflects a progressively deeper understanding of the workload’s unique characteristics. The journey from simple, rigid methods to highly dynamic, fine-grained scheduling illustrates a clear progression in system design, aimed at systematically eliminating sources of inefficiency. This section provides a taxonomy of these strategies, detailing their mechanisms, inherent limitations, and the logical evolution that led to the development of continuous batching.
Static Batching: The Inflexible Foundation
Static batching is the most straightforward implementation of the batching principle.2 In this approach, the inference server waits until a fixed number of requests, corresponding to a predetermined batch size, has arrived. Only then does it group these requests into a single tensor, process them simultaneously, and wait for every single request in the batch to complete its full generation before returning the results and proceeding to the next batch.2
While simple to implement, this request-level, atomic batching model introduces a critical and performance-killing flaw known as Head-of-Line (HOL) blocking.6 In any realistic production scenario, incoming requests will have varying prompt lengths and will require generating output sequences of different lengths. Because the entire batch is treated as an indivisible unit of work, its completion time is dictated by the single longest-running request. Consequently, GPU resources that were allocated to sequences that finish early—for example, a simple question-answering request batched with a long document summarization task—sit idle, waiting for the straggler to complete. This results in significant wastage of compute time and memory, often visualized as “white blocks” of underutilization in timelines of GPU activity.1 The phenomenon is a classic performance pathology in computer networking and operating systems, where a single slow packet or process can hold up an entire queue.13
Furthermore, to form a single, uniform tensor that can be processed by the GPU, all sequences in a static batch must be padded with special tokens to match the length of the longest sequence in the batch. This forces the GPU to perform a substantial amount of useless computation on these padding tokens, wasting both compute cycles and valuable memory.6
Despite these significant drawbacks, static batching remains the optimal choice for a specific niche: offline, predictable workloads where latency is not a primary concern and requests are largely homogeneous. Examples include nightly jobs for bulk document processing or large-scale data analysis.15 In such controlled environments, the variance in generation lengths is minimal, which naturally mitigates the impact of HOL blocking. Here, the low scheduling overhead of the static approach can lead to higher peak throughput compared to more complex dynamic methods.17
Dynamic Batching: A Reactive Improvement
Dynamic batching represents a reactive enhancement to the static model, designed to improve responsiveness in environments with variable traffic.2 The core mechanism introduces a time-based trigger in addition to the size-based one. The server collects incoming requests and forms a batch either when the maximum configured batch size is reached or when a pre-defined time window expires, whichever occurs first.2
The primary advantage of this approach is that it improves the latency for individual requests, particularly during periods of low traffic. It ensures that the first few requests in a potential batch are not forced to wait indefinitely for the batch to fill up, striking a better balance between latency and throughput.2 It effectively solves the problem of when to start processing a batch.
However, the fundamental flaw of request-level atomicity persists. Dynamic batching still operates at the granularity of an entire request. Once a batch is formed and dispatched to the GPU, it is immutable and suffers from the exact same HOL blocking and padding inefficiencies as static batching.2 It improves the batch formation process but does nothing to address the profound underutilization within the batch.
The progression from static to dynamic and ultimately to continuous batching reflects a fundamental shift in system design philosophy. Static batching is resource-centric; its primary goal is to create a full batch to maximize the GPU’s utilization for a single compute kernel, largely ignoring the individual characteristics of the requests. Dynamic batching is request-centric; it introduces a time-out to improve the latency for the first request in a batch, acknowledging that request-level metrics matter and prioritizing getting a request started over waiting for perfect resource utilization. Continuous batching, as the next section will detail, is work-centric. It decomposes requests into their smallest unit of work—a single token generation—and schedules this work dynamically. This fine-grained, work-centric view is what allows it to achieve maximum resource utilization without penalizing individual requests, representing a more sophisticated and efficient paradigm.
| Feature | Static Batching | Dynamic Batching | Continuous Batching |
| Batch Formation Trigger | Fixed number of requests arrive 2 | Fixed number of requests OR time window expires 2 | Continuous admission as resources free up 18 |
| Scheduling Granularity | Request-level (entire sequence) 14 | Request-level (entire sequence) 2 | Iteration-level (single token step) 14 |
| Batch Composition | Fixed and immutable once launched 14 | Fixed and immutable once launched 2 | Dynamic; changes at every iteration 14 |
| Head-of-Line Blocking | Severe; batch waits for the longest request 6 | Severe; batch waits for the longest request 2 | Eliminated; requests finish and exit independently 6 |
| GPU Utilization | Low and variable (“sawtooth” pattern) due to idle time 1 | Low and variable; similar to static once batch starts 2 | Consistently high; freed resources are immediately backfilled 14 |
| Padding Overhead | High; all sequences padded to the longest in the batch 14 | High; all sequences padded to the longest in the batch 2 | Eliminated; no padding across sequences is required 14 |
| Ideal Use Case | Offline, bulk processing with homogeneous requests 16 | General-purpose, balancing latency and throughput, but suboptimal for LLMs 2 | Online, interactive applications with heterogeneous requests (e.g., chatbots) 16 |
The Core Mechanism of Continuous Batching: Iteration-Level Scheduling
Continuous batching represents a paradigm shift in how inference requests are scheduled and executed. It moves away from the coarse-grained, request-level batching of its predecessors to a fine-grained, dynamic approach that maximizes hardware utilization by fundamentally rethinking the unit of schedulable work. This section provides a deep, algorithmic breakdown of its core mechanism, iteration-level scheduling, and explains precisely how this innovation solves the long-standing problems of head-of-line blocking and padding.
The Paradigm Shift: From Request-Level to Iteration-Level Granularity
The central innovation of continuous batching, first formally proposed and analyzed in the Orca paper from OSDI ’22, is to change the scheduling quantum from an entire request to a single autoregressive step, referred to as an “iteration”.1 In this model, the system no longer conceives of its workload as discrete “batches of requests” that must be processed atomically. Instead, it manages a continuous, dynamic pool of active requests, and the fundamental unit of work is the generation of the next token for every request in that pool.
The “batch” in a continuous batching system is therefore an ephemeral concept. It is simply the set of active sequences being processed at a given iteration, a composition that can—and frequently does—change at every single step.14 This decouples the lifecycle of an individual request from the lifecycle of any other request in the system, eliminating the artificial synchronization barriers that plague static and dynamic batching.
The Continuous Batching Algorithm: A Step-by-Step Walkthrough
The operational flow of a continuous batching system is a tight, continuous loop managed by a central scheduler. The process can be broken down into the following logical steps:
- Request Arrival: New inference requests arrive asynchronously at the server. Instead of being held to form a static batch, they are immediately placed into a waiting_queue.14
- Iteration Forward Pass: At each time step, the scheduler takes all sequences currently in the active_batch and executes a single forward pass on the GPU. This single pass performs one decoding step for every sequence in the active_batch, generating exactly one new token for each of them in parallel.6
- Completion Check: After the forward pass is complete, the system inspects the newly generated tokens for each sequence. If a token is a designated end-of-sequence (EOS) token, or if a sequence has reached its user-defined maximum length, that sequence is marked as completed.18
- Immediate Eviction: All sequences that were marked as completed are immediately removed from the active_batch. This is a critical step: their associated resources, most importantly the GPU memory allocated for their KV cache, are instantly freed and returned to the system’s global memory pool.6
- Admission of New Requests: The scheduler now assesses the available system capacity, considering both GPU memory and any configured concurrency limits (e.g., maximum number of batched tokens). If capacity is available, it pulls new requests from the head of the waiting_queue and adds them to the active_batch.1 These newly admitted requests will have their prefill phase executed as part of the next iteration’s forward pass, alongside the decode steps for the other requests already in the batch.
- Loop: The process immediately repeats from step 2. The active_batch is in a state of constant flux, with sequences joining and leaving at every iteration. There are no artificial synchronization points; the system is always performing useful work as long as there are requests to be processed.14
This continuous cycle transforms the GPU utilization profile. Static batching exhibits a “sawtooth” pattern: utilization spikes to 100% when a full batch is running, then gradually declines as shorter requests finish and their allocated slots go idle. Utilization then drops to zero while the system waits for the last straggler to complete and for a new batch to assemble.14 Continuous batching smooths this into a consistently high plateau. The “time between batches” is eliminated because the process is continuous, and the “decline as requests finish” is eliminated because freed resources are immediately backfilled with new work.2 This sustained high utilization is the direct source of the dramatic throughput gains observed in these systems.
How Iteration-Level Scheduling Solves HOL Blocking and Padding
The algorithmic design of iteration-level scheduling directly addresses the core inefficiencies of previous methods:
- Eliminating Head-of-Line Blocking: Because a request is evicted from the active_batch the moment it generates its final token, its completion is entirely independent of any other request. It never has to wait for a longer-running request that happened to be scheduled at the same time. This directly and completely resolves the primary source of inefficiency and latency variance in request-level batching systems.6
- Eliminating Padding: The concept of padding a batch to a uniform length becomes obsolete. At each iteration, the GPU kernel operates on sequences of their actual, current lengths. There is no need to add extraneous padding tokens to equalize sequence lengths across the batch, which saves a massive amount of wasted computation and memory.6
A Note on Terminology
The rapid development and commercialization of this technology have led to a variety of terms being used to describe the same core concept, which can be a source of confusion. This report uses “continuous batching” as the canonical term, but it is important to recognize its common synonyms:
- In-flight batching: This is the term predominantly used by NVIDIA in the context of its TensorRT-LLM framework.2
- Iteration-level scheduling: This is the original, more descriptive academic term introduced in the Orca paper.1
- Persistent batching: Used by frameworks like LMDeploy.2
- Dynamic batching: While this term historically referred to the time-window-based approach, it is now sometimes used more broadly by some sources to encompass the modern, iteration-level technique.1
Understanding that these terms generally refer to the same fundamental algorithm of dynamically managing a batch at the single-token level is key to navigating the technical literature and documentation of different inference frameworks.
PagedAttention: The Symbiotic Partner to Continuous Batching
The introduction of continuous batching, while solving the critical problem of GPU underutilization, simultaneously creates a new and severe challenge: memory management. The highly dynamic, fine-grained nature of iteration-level scheduling, with requests of varying and constantly growing lengths entering and leaving the system at every step, makes managing the memory for the KV cache exceptionally difficult. Without an efficient solution to this problem, the theoretical gains of continuous batching would be unrealizable in practice. PagedAttention, an algorithm introduced by vLLM, provides this solution by drawing inspiration from classical memory management techniques in operating systems.
The Memory Management Crisis Induced by Continuous Batching
The dynamic nature of continuous batching places extreme pressure on the GPU memory allocator.18 Each request in the active_batch maintains its own KV cache, which grows by one token’s worth of data at every single decoding step. This means the system must efficiently manage a large number of memory blocks of variable and constantly increasing size. Using traditional, malloc-style contiguous memory allocation in this environment leads to two critical and often fatal problems:
- Internal Fragmentation: A naive but common strategy to avoid frequent reallocations is to pre-allocate a single contiguous block of memory for each request, large enough to hold its KV cache up to the maximum possible sequence length. This was a drawback of the original Orca system.24 However, since most requests will generate far fewer tokens than the maximum, a significant portion of this pre-allocated memory remains unused, leading to massive waste. This is known as internal fragmentation.18
- External Fragmentation: The alternative is to allocate memory for the KV cache dynamically as it grows. However, the continuous cycle of allocating and freeing variable-sized chunks of memory quickly leads to a state where the GPU’s free memory is broken up into many small, non-contiguous “holes.” This is external fragmentation. The system may have enough total free memory to accommodate a new request, but it cannot find a single contiguous block large enough to satisfy the allocation. This results in out-of-memory (OOM) errors and service failures even when sufficient memory is technically available.18
PagedAttention: Virtual Memory for the GPU
To solve this memory management crisis, the vLLM project introduced PagedAttention, an algorithm explicitly inspired by the concepts of virtual memory and paging used in modern operating systems for decades.6 The core idea is to abandon the requirement for contiguous memory allocation for the KV cache.
The mechanism of PagedAttention is as follows:
- Physical Blocks: The entire GPU memory region allocated for KV caches is partitioned into a large pool of small, fixed-size physical blocks. These blocks are the fundamental unit of memory allocation.18
- Logical Blocks: From the perspective of a single inference request, its KV cache is still viewed as a contiguous sequence of logical blocks.18
- Block Table (Page Table): The crucial link between the logical and physical views is a per-request data structure called a block table. This table functions exactly like a page table in an operating system, mapping the logical block indices of a sequence to the memory addresses of physical blocks on the GPU. Crucially, these physical blocks do not need to be contiguous in memory.6 When the attention kernel needs to access the KV cache for a sequence, it uses the block table to find and gather the data from these scattered physical blocks.
How PagedAttention Unlocks Efficiency
By implementing this virtual memory system for the KV cache, PagedAttention elegantly solves the fragmentation crisis and introduces new efficiencies:
- Solving Fragmentation: Because the system now works with small, fixed-size blocks, external fragmentation is completely eliminated. Any free physical block can be used to satisfy an allocation request. Internal fragmentation is also minimized, confined only to the unused space in the very last block of each sequence.18 This allows for much denser packing of sequences into GPU memory, which directly translates to larger effective batch sizes and higher throughput.
- Low-Overhead Management: Allocating and freeing fixed-size blocks is a computationally trivial and fast operation (e.g., pushing to or popping from a free list), which is essential to keep up with the high frequency of memory operations demanded by continuous batching’s step-by-step nature.18
The relationship between these two technologies is deeply symbiotic. PagedAttention is not merely an optimization for continuous batching; it is a foundational enabler. Without a robust solution to the memory fragmentation problem, the performance gains from continuous scheduling would be completely undermined by memory management overhead and frequent OOM failures. The two technologies co-evolved to create a viable, high-performance LLM serving architecture.
This co-evolution mirrors the historical development of operating systems. The initial problem was task execution on a shared resource (CPU/GPU). Naive, single-request inference is like a single-tasking OS. Static batching is analogous to early batch processing systems. Continuous batching introduced preemption and time-slicing, akin to the shift to multitasking operating systems, which in turn created a memory management crisis (fragmentation). PagedAttention is a direct application of virtual memory and paging, the canonical OS solution to that crisis. This parallel strongly suggests that future challenges in LLM serving, such as quality-of-service and fairness, will likely be solved by adapting other well-established principles from OS and distributed systems research.
Advanced Capabilities: Zero-Cost Memory Sharing
Beyond solving fragmentation, PagedAttention’s architecture unlocks powerful memory sharing optimizations with almost no overhead, which are particularly beneficial for complex sampling strategies:
- Parallel Sampling & Beam Search: These techniques involve generating multiple potential output sequences from a single prompt. In a traditional system, this would require duplicating the entire prompt’s KV cache for each candidate sequence. With PagedAttention, this becomes a zero-cost operation. The block tables for all n candidate sequences simply contain pointers to the exact same physical blocks that store the shared prompt’s KV cache. No data duplication is needed, saving a significant amount of memory and time.18
- Copy-on-Write (CoW): When one of the shared sequences diverges from the others (e.g., a new token is generated in one beam), the system does not need to copy the entire shared history. It simply allocates a new physical block for the new KV data, copies the contents of only the last shared block, makes the modification, and updates the diverging sequence’s block table to point to this new block. This Copy-on-Write mechanism dramatically reduces the memory and computational overhead of branching generation paths.18
A Survey of State-of-the-Art Inference Frameworks
The theoretical concepts of iteration-level scheduling and paged memory management have been implemented, adapted, and optimized by a variety of academic and commercial inference frameworks. While the core principles are shared, each framework has its own terminology, architectural nuances, and unique optimizations that reflect different design philosophies and target use cases. This section provides a survey of the leading inference servers, tracing the evolution of these ideas from their academic origins to their highly-optimized, production-ready implementations.
The Genesis: Orca (OSDI ’22)
The 2022 paper “Orca: A Distributed Serving System for Transformer-Based Generative Models” is the pioneering academic work that formally introduced and validated the concept of iteration-level scheduling.1 Its publication marked a turning point in LLM serving research, demonstrating that decoupling request lifecycles from batch execution could yield massive performance improvements. The paper’s evaluation on a GPT-3 175B model claimed a landmark 36.9x throughput improvement over NVIDIA’s then state-of-the-art FasterTransformer library, all while maintaining the same level of latency.19
However, the original Orca implementation also highlighted the initial difficulty of the problem. To handle requests with different sequence lengths within a single iteration, Orca employed selective batching. It would only batch operations that were independent of the sequence length, such as the linear transformations. The attention mechanism, which requires inputs of a consistent shape, was executed sequentially for each request in the batch.9 This workaround underscored the memory layout challenges that later systems would solve more elegantly with custom kernels and paged memory management. Orca’s key contribution was proving the scheduling principle; subsequent systems would perfect the memory and execution model.
vLLM: The Canonical Open-Source Implementation
vLLM, developed by researchers at UC Berkeley, is widely regarded as the canonical open-source implementation that tightly integrated continuous batching with its novel PagedAttention algorithm.3 By solving the memory fragmentation crisis that made continuous batching difficult to implement efficiently, vLLM created a powerful and robust architectural pattern.
The open-sourcing of vLLM was a pivotal moment for the community. It set a new, much higher performance baseline for LLM serving and drove the widespread adoption of these combined techniques.27 Its success has made it a foundational component in the LLM ecosystem, with many other serving solutions and MLOps platforms, such as RayLLM and Wallaroo, using vLLM as a high-performance backend engine.29 Benchmarks demonstrated that vLLM could achieve up to 24x higher throughput than standard Hugging Face Transformers implementations.27
NVIDIA TensorRT-LLM: Hardware-Aware Optimization
NVIDIA’s TensorRT-LLM is a high-performance inference library designed to extract maximum performance from NVIDIA GPUs. It implements the continuous batching algorithm under the name in-flight batching (IFB).2 The library’s core strength lies in its use of the TensorRT deep learning compiler, which generates highly optimized, hardware-specific CUDA kernels for every operation in the model, delivering state-of-the-art performance.22
For memory management, TensorRT-LLM utilizes a Paged KV Cache, which is conceptually similar to vLLM’s PagedAttention, to solve memory fragmentation and enable efficient batching.31 A key architectural differentiator for TensorRT-LLM is its implementation of Chunked Prefill. This advanced scheduling feature directly addresses the prefill-decode asymmetry. It can break a long, compute-intensive prefill operation into smaller, more manageable “chunks.” This allows the scheduler to interleave the fast decode steps from other active requests with these prefill chunks, which improves interactivity and reduces the latency “bubble” that can occur when a request with a very long prompt enters the system.31
Hugging Face Text Generation Inference (TGI)
Text Generation Inference (TGI) is a production-grade, open-source inference server from Hugging Face that has become an industry standard due to its robustness, ease of use, and comprehensive feature set.33 TGI implements continuous batching and incorporates key performance optimizations, including Paged Attention and Flash Attention, to achieve high throughput.34
TGI is particularly well-known for its tight integration with the Hugging Face ecosystem, offering broad, often day-one, support for the most popular open-source models.33 Its production-ready features, such as token streaming via Server-Sent Events (SSE), tensor parallelism for multi-GPU inference, and detailed Prometheus metrics, make it a popular choice for deploying LLMs at scale.33
DeepSpeed-Inference & DeepSpeed-FastGen: A Different Scheduling Philosophy
DeepSpeed-Inference, part of Microsoft’s DeepSpeed ecosystem, and its successor DeepSpeed-FastGen, present an alternative approach to scheduling within the continuous batching paradigm. While they also employ continuous batching and a blocked KV cache, their unique innovation is a proactive scheduling strategy called Dynamic SplitFuse.35
Instead of passively scheduling requests as they are, Dynamic SplitFuse actively reshapes the workload to create more uniform and efficient batches for the GPU. It works in two ways: it decomposes very long prompts into smaller chunks to be processed over multiple iterations, and it composes multiple short prompts together to fill a target token budget.35 This strategy is designed to smooth out the computational variance between the prefill and decode phases, directly tackling their asymmetric performance profiles. This different architectural choice leads to a distinct performance profile. Published benchmarks and analysis suggest that DeepSpeed-FastGen excels specifically in workloads characterized by very long prompts and short generated outputs, as this is where its prompt decomposition strategy provides the most benefit. In other common scenarios, particularly those involving long output generations, the memory management of systems like vLLM can be superior.36
| Framework | Batching Terminology | Key Memory Tech | Differentiating Feature(s) |
| Orca (OSDI ’22) | Iteration-Level Scheduling | Contiguous Allocation (Pre-vLLM) | Pioneered the scheduling concept; used “selective batching” for attention 1 |
| vLLM | Continuous Batching | PagedAttention | Canonical open-source integration of continuous batching and paged memory 18 |
| NVIDIA TensorRT-LLM | In-Flight Batching (IFB) | Paged KV Cache | Deep compiler optimizations (TensorRT); Chunked Prefill for managing long prompts 22 |
| Hugging Face TGI | Continuous Batching | Paged Attention | Production-ready, broad model support, tight integration with Hugging Face ecosystem 33 |
| DeepSpeed-FastGen | Continuous Batching | Blocked KV Cache | Dynamic SplitFuse: Proactive scheduling that decomposes long prompts and composes short ones 35 |
Performance Analysis: Quantifying the Gains and Understanding the Trade-offs
The theoretical advantages of continuous batching translate into substantial, measurable improvements in the performance of LLM serving systems. However, a nuanced understanding requires looking beyond headline figures to analyze the specific metrics that matter for different applications and to recognize the inherent trade-offs that persist even with these advanced techniques. This section synthesizes performance data from academic papers and industry benchmarks to quantify the gains of continuous batching and delineate the scenarios where older methods may still be preferable.
Defining Performance: Key Metrics for LLM Serving
Evaluating the performance of an LLM inference system requires a multi-faceted approach, as a single metric cannot capture the complex interplay between system capacity and user experience. The key metrics are:
- Throughput: This measures the aggregate processing rate of the entire system. It is most commonly expressed as Tokens Per Second (TPS), which reflects the total number of output tokens generated by the server across all concurrent requests.17 It can also be measured in Requests Per Second (RPS), though this can be less informative given the high variability in request complexity.38 Throughput is the primary metric for assessing system capacity and cost-efficiency.
- Latency: This measures the speed of the system from the perspective of a single user. It is typically broken down into several components:
- Time to First Token (TTFT): The duration from the moment a user’s request arrives at the server to the moment the first output token is generated and sent back. This is a critical metric for interactive applications like chatbots, as it determines the initial responsiveness of the system.38
- Time Between Tokens (TBT) or Time Per Output Token (TPOT): The average time taken to generate each subsequent token after the first. This metric determines the perceived “fluidity” or speed of the text stream as it is being generated.40
- End-to-End Latency: The total time required to generate the full response for a request, from arrival to the final token.41
Synthesized Benchmark Results
The performance impact of adopting continuous batching is dramatic, particularly when compared to naive or early-generation batching methods. The literature is replete with claims of order-of-magnitude improvements:
- The foundational Orca paper reported a 36.9x throughput gain over NVIDIA FasterTransformer, a highly optimized library that used a more static form of batching.19
- Benchmarks for vLLM have shown up to 24x higher throughput compared to standard HuggingFace Transformers (which lacks continuous batching) and significant gains over earlier versions of TGI.27
- An influential blog post and benchmark from Anyscale demonstrated a 23x throughput increase while simultaneously reducing the 50th percentile latency by using continuous batching.24
- Industry analyses frequently cite that continuous batching can achieve 10x to 20x better throughput than traditional dynamic batching.11
While these figures highlight the transformative impact of the technology, more recent studies comparing state-of-the-art continuous batching implementations against highly optimized static batching baselines show more modest, though still very significant, gains. For instance, one recent paper demonstrated throughput improvements in the range of 8% to 28% over the static batching policy in a vLLM implementation, along with a 22% increase in request capacity under specific Service-Level Agreement (SLA) constraints.43
The Inherent Throughput-Latency Trade-off
A critical principle that governs all batching systems, including continuous batching, is the fundamental trade-off between aggregate throughput and per-request latency. Increasing the number of concurrent requests in the active_batch will almost always increase the total system throughput (TPS). However, because the computational work of each iteration grows with the batch size, the time taken to complete that iteration also increases. This directly translates to a higher Time Between Tokens (TBT) for every individual user.11
System operators must therefore make a strategic decision based on their application’s requirements. For a service aiming to support the maximum number of concurrent users at the lowest cost, the system should be tuned to maximize throughput, even if it means slightly higher latency for each user. Conversely, for an application where user experience and real-time interactivity are paramount, the system might be configured with a lower concurrency limit to minimize TBT and TTFT, at the expense of total system throughput.17
The Limits of Dynamism: When Static Batching is Superior
Despite the clear advantages of continuous batching for dynamic, online workloads, there are specific scenarios where the older, simpler static batching method is not only viable but can be superior.
The ideal scenario for static batching is for offline, high-volume, homogeneous workloads.16 This refers to tasks where latency is not a concern (e.g., a nightly job) and where the input and output lengths of requests are highly predictable and similar. In this controlled environment, the primary penalty of static batching—HOL blocking—is naturally minimized because there are no “straggler” requests.
Under these conditions, the complex, dynamic scheduling logic of a continuous batching engine becomes unnecessary overhead. The system is constantly making fine-grained scheduling decisions that provide no benefit because the workload is uniform. Experiments have shown that for these types of offline tasks, static batching can achieve better peak performance and higher GPU efficiency, particularly at large batch sizes (e.g., 64 or more), where the system reaches a saturation point and the lower overhead of the simpler scheduling mechanism becomes a tangible advantage.17 Therefore, for non-interactive, bulk-processing use cases, a carefully tuned static batching pipeline remains a highly effective and efficient solution.
| Comparison | Throughput (TPS) | Latency (TTFT/TBT) | Key Finding | Source(s) |
| Orca vs. FasterTransformer | 36.9x higher | Same latency | Iteration-level scheduling provides massive gains over request-level batching. | 19 |
| vLLM vs. HF Transformers | Up to 24x higher | Lower | PagedAttention combined with continuous batching is vastly superior to naive batching. | 27 |
| Anyscale Benchmark | 23x higher | Reduced p50 latency | Continuous batching boosts throughput and can improve latency for the median user. | 24 |
| Continuous vs. Dynamic Batching | 10x-20x higher | N/A | The elimination of HOL blocking within the batch is the key driver of improvement. | 11 |
| Dynamic vs. Static Batching (vLLM) | 8% – 28% higher | N/A | Even against an optimized static baseline, dynamic adjustment provides significant gains. | 43 |
| Static vs. Continuous (Offline) | Higher at large batch sizes | Higher (by design) | For homogeneous, offline workloads, static batching’s lower overhead can win. | 17 |
Advanced Challenges and Future Directions
While continuous batching and PagedAttention have resolved the most glaring inefficiencies in LLM serving, they have also revealed a new set of more subtle and complex challenges. The frontier of research and development has now moved beyond the initial breakthrough to focus on refining scheduling policies, managing system-level complexities, and achieving deeper synergies with other advanced optimizations. The evolution of the LLM inference stack is increasingly mirroring that of a specialized, high-performance operating system for the GPU.
The Lingering Specter of Head-of-Line Blocking
Continuous batching effectively eliminates HOL blocking at the request level. However, a more subtle form of HOL blocking can still manifest at the phase level due to the prefill-decode dichotomy. When a new request with a very long prompt is admitted into the active_batch, its compute-intensive prefill operation can dominate the GPU’s resources for that iteration. This can momentarily delay the fast, memory-bound decode steps for all other active requests in the batch, creating a perceptible “stutter” or increase in TBT for ongoing generations.7
This phase-level blocking is the primary motivation behind the next generation of advanced schedulers. Techniques like TensorRT-LLM’s Chunked Prefill and DeepSpeed-FastGen’s Dynamic SplitFuse are explicitly designed to mitigate this problem by breaking up large prefill computations into smaller pieces, allowing for finer-grained interleaving of prefill and decode work.
The Next Frontier in Scheduling: Beyond FCFS
Most current continuous batching implementations use a simple First-Come-First-Served (FCFS) policy for admitting new requests from the waiting queue. While fair, FCFS is known to be suboptimal for minimizing average job completion time. This has led to research into more intelligent, workload-aware scheduling policies drawn from classical operating systems theory.
- Predictive Scheduling (Shortest Job First): A more advanced approach involves using a lightweight model to predict the expected output generation length of each incoming request. The scheduler can then prioritize admitting requests predicted to be shorter. This is a direct application of the Shortest Job First (SJF) scheduling policy, which is provably optimal for minimizing average waiting time in traditional OS scheduling.45 By processing many short jobs quickly, the system can improve average latency and increase overall throughput.
- Multi-Bin Batching: This is a practical and effective approximation of SJF. Instead of a single waiting queue, the system maintains multiple queues, or “bins,” for requests of different predicted lengths (e.g., short, medium, long). The scheduler can then form batches by drawing requests from the same bin. This ensures that requests within a given batch have similar execution times, which minimizes the variance and idle time caused by differing generation lengths, further reducing the impact of any residual HOL blocking effects.48
System-Level Complexities and Overheads
The dynamism and flexibility of continuous batching come at the cost of increased system complexity and potential performance overheads.
- Scheduling Overhead: The logic required to make fine-grained, iteration-level scheduling decisions—including checking for completed requests, managing memory, and admitting new requests—is inherently more complex and computationally expensive than the simple counter used in static batching. This overhead, while typically small, can become a bottleneck in certain scenarios, particularly with highly heterogeneous workloads that have a wide variance in prompt lengths.17
- Memory Preemption and Swapping: To gracefully handle unpredictable traffic spikes that exceed available GPU memory, systems equipped with PagedAttention can implement preemption. In this mechanism, the KV cache blocks of a running, lower-priority request are evicted to CPU host memory (a process known as swapping) to free up GPU memory for a new, higher-priority request. While this is a crucial feature for preventing OOM errors and maintaining service availability, the process of swapping data between GPU and CPU memory introduces significant latency overhead and must be used judiciously.24
Synergy with Other Advanced Optimizations
Continuous batching and PagedAttention do not exist in a vacuum; they form the foundational scheduling and memory management layer upon which other advanced inference optimizations can be built.
- Speculative Decoding: This technique accelerates inference by using a smaller, faster “draft” model to generate several candidate tokens in parallel. The large, primary model then verifies these draft tokens in a single forward pass, potentially accepting multiple tokens at once. The efficient prefix sharing enabled by PagedAttention is critical for this process. All of the draft sequences can share the same base KV cache at zero memory cost, making the verification step highly efficient.18
- Quantization: Techniques that reduce the numerical precision of the model’s weights (e.g., from 16-bit floating point to 8-bit integers) decrease the memory footprint of both the model parameters and the KV cache. This directly allows more requests to be packed into the same amount of GPU memory, increasing the effective batch size of the continuous batcher and leading to higher overall throughput.3
The evolution from simple FCFS schedulers to predictive SJF and multi-queue systems, combined with the adoption of virtual memory (PagedAttention) and swapping, confirms that the LLM inference stack is developing into a specialized, user-space Operating System for the GPU. Its core purpose is to manage the complex interplay of compute, memory, and I/O resources to serve diverse, concurrent workloads with specific Quality-of-Service (QoS) requirements. This framing provides a powerful mental model for understanding current system designs and predicting future research directions in the field.
Conclusion and Strategic Recommendations for System Architects
The advent of continuous batching, in concert with enabling memory management technologies like PagedAttention, marks a pivotal moment in the operationalization of Large Language Models. This architectural paradigm has moved beyond academic research to become the undisputed industry standard for high-performance LLM serving, fundamentally resolving the head-of-line blocking and resource underutilization that plagued earlier systems. The result is an order-of-magnitude improvement in throughput and GPU efficiency, making real-time, interactive AI applications economically and technically viable at scale.
Synthesis: The New Standard for LLM Serving
Continuous batching, through its core mechanism of iteration-level scheduling, has successfully transformed the LLM inference problem. By shifting the scheduling quantum from the entire request to the single token, it allows for the dynamic and independent management of each request’s lifecycle. This fine-grained control keeps expensive GPU resources consistently saturated with useful work, maximizing throughput without the artificial synchronization barriers of static and dynamic batching.11 When paired with a paged memory system for the KV cache, it forms a robust and highly efficient foundation for any modern LLM serving stack.
Strategic Recommendations for Implementation
The choice of an optimal batching strategy and inference framework is not absolute but is critically dependent on the specific characteristics of the target workload. System architects must perform a careful analysis of their application’s requirements to make an informed decision.
- For Online, Interactive Services (e.g., Chatbots, AI Assistants, Copilots): For any application where user-perceived latency is a critical factor, continuous batching is mandatory. The primary goal is to achieve a delicate balance between high system throughput (to serve many users) and low latency (to ensure a responsive user experience). The premier open-source frameworks for this use case are vLLM, NVIDIA TensorRT-LLM, and Hugging Face Text Generation Inference (TGI), each offering a mature and highly-optimized implementation of the continuous batching paradigm.16
- For Offline, Bulk Processing (e.g., Document Summarization, Batch Data Analysis): In scenarios where latency is irrelevant and the primary goal is to maximize raw throughput, static batching should be evaluated. If the workload consists of requests with relatively homogeneous input and output lengths, the HOL blocking penalty is minimized. In this case, the lower scheduling overhead of a simple static batching implementation may yield higher peak throughput than a more complex continuous batching engine. A direct benchmark of both strategies on the target workload and hardware is essential to validate the optimal approach.16
- For Complex, Mixed, or Long-Prompt Workloads: For environments that must serve a mix of interactive and batch requests, or for applications that frequently process extremely long input contexts (e.g., RAG with many documents), a continuous batching framework offers the greatest flexibility and robust performance. For workloads specifically dominated by very long prompts, architects should consider frameworks with advanced prefill management strategies, such as TensorRT-LLM with Chunked Prefill or DeepSpeed-FastGen with Dynamic SplitFuse, as these are designed to mitigate the phase-level HOL blocking caused by large prefill computations.
Future Outlook
The primary focus of LLM inference optimization has decisively shifted. The question is no longer if one should use continuous batching, but rather how to refine and enhance it. The next wave of innovation will concentrate on developing more intelligent, workload-aware scheduling policies that move beyond simple FCFS to incorporate principles like shortest-job-first and quality-of-service guarantees. We will also see deeper hardware-software co-design to further optimize memory access patterns and computation. Ultimately, the future of LLM serving lies in the seamless integration of continuous batching as a foundational layer with other advanced optimizations, such as speculative decoding and aggressive quantization, to continue pushing the boundaries of performance, efficiency, and cost-effectiveness in the era of generative AI.
