Section 1: An Introduction to the LLM Serving Challenge
The deployment of Large Language Models (LLMs) in production has exposed a fundamental conflict between service providers and end-users. This tension is rooted in two opposing goals: maximizing throughput and minimizing latency.1
1.1 The Central Conflict: Throughput vs. Latency
For cloud vendors and AI service providers, the primary objective is high throughput.2 This is measured in metrics like tokens per second or requests per second, and it is the key to maximizing the utilization of expensive GPU hardware.1 High throughput lowers the cost per token, enabling a scalable and profitable business.2
For the end-user of an interactive application, the primary concern is low latency.2 Whether in a chatbot or a code assistant, the user demands a high Quality-of-Service (QoS), which is defined by a feeling of responsiveness and “real-time” interaction.1
This economic and engineering trade-off is the central challenge of LLM serving.3 Every architectural decision, from batching algorithms to model sharding, is an attempt to navigate this core conflict.4
1.2 The Two-Phase Problem: The Prefill vs. Decode Dichotomy
The technical root of the throughput-latency conflict lies in the unique, two-phase nature of LLM inference. Every user request forces the system to execute two distinct workloads with diametrically opposed performance characteristics.5
Phase 1: Prefill (Compute-Bound)
The prefill stage involves processing all tokens in the user’s input prompt in parallel.6 In this phase, the model generates the Key-Value (KV) cache, a data structure that stores the attention state of the prompt.9 Because this involves large, parallel matrix multiplications across all input tokens, the prefill stage is compute-bound.5 It can effectively saturate the GPU’s computational units, and its duration is proportional to the length of the input prompt.8
Phase 2: Decode (Memory-Bound)
The decode stage is the auto-regressive generation of the response, one token at a time.7 Each newly generated token depends on all the tokens that came before it.6 This process is memory-bound.1
The bottleneck is not the mathematical computation, which is small for a single token.8 Instead, the bottleneck is the memory bandwidth.1 For every single token generated, the GPU must load the entire multi-billion-parameter model and the entire (and growing) KV cache from high-bandwidth memory (HBM). This operation starves the GPU’s powerful compute units, leaving them severely underutilized.2
1.3 The Interleaving Bottleneck
The fundamental challenge for any LLM serving system is that every single request interleaves these two wildly different compute paradigms.4 The system must continuously schedule and execute a mix of compute-heavy parallel tasks (prefills) and memory-bandwidth-heavy sequential tasks (decodes).
A naive scheduler is forced into a difficult choice 4:
- Prioritize Prefill (Optimize for Throughput): When a new, compute-heavy prefill request arrives, the system stalls all ongoing, low-latency decode requests to process it. This gets new work onto the GPU quickly, maximizing throughput. However, it destroys the user experience for everyone else, causing “generation stalls” and high perceived latency.2
- Prioritize Decode (Optimize for Latency): The system finishes all current decode requests before starting any new prefill. This provides a smooth experience for existing users, but it wastes GPU compute cycles and lowers overall system throughput, as the GPU sits idle waiting to start new work.4
This prefill-decode dichotomy is the “Rosetta Stone” of LLM inference performance. Every optimization discussed in this report—batching, sharding, quantization, and speculative decoding—is a direct attempt to solve the architectural mismatch and scheduling problems created by these two phases. Some advanced architectures even propose using entirely different hardware for each phase, highlighting how distinct they are.14
Section 2: Deconstructing Performance: Latency and the Perception of Speed
To optimize the user experience, “latency” must be broken down into specific, user-facing metrics that quantify the “feel” of an AI application. The two most important metrics are Time to First Token (TTFT) and Time Per Output Token (TPOT).16
2.1 Metrics That Define the User Experience
- Time to First Token (TTFT): The duration from when a user submits a request to when the first token of the response appears on their screen.16
- Time Per Output Token (TPOT): The average time between subsequent output tokens after the first one. This measures the “speed” of text generation.16
- End-to-End Latency: The total time from request submission to the final token of the response.17 This can be calculated using the formula: $Latency = TTFT + (TPOT \times (Total\_Output\_Tokens – 1))$.7
2.2 Mapping Architectural Phases to User-Facing Metrics
These user-facing metrics map directly to the two-phase technical problem identified in Section 1:
- TTFT is the “Prefill Cost”: TTFT is dominated by the prefill stage.12 It is the direct, user-felt time cost of processing the entire input prompt (plus any network and queuing delays) and generating the very first token.7
- TPOT is the “Decode Cost”: TPOT is the direct, user-felt cost of the decode stage.19 It is a pure measure of the system’s performance in the memory-bandwidth-bound, auto-regressive generation loop.7
2.3 How TTFT and TPOT Shape Application “Feel”
Optimizing for TTFT vs. TPOT has radically different impacts on user perception.
TTFT: The “Silent Killer” of Responsiveness
A high TTFT is the initial “pause” or “lag” 19 that makes an application feel “dead” or “broken.” Even if the subsequent text generation (TPOT) is instantaneous, a long initial wait shatters the illusion of interactivity and conversation.16 This metric is therefore the primary driver of perceived responsiveness.
- Use Case (Chatbot): A conversational AI must have a low TTFT to feel responsive. A common target is under 500 milliseconds.16
- Use Case (Code Completion): The requirement is even more extreme. A code assistant must integrate into a developer’s “flow state” and feel “instant,” demanding a TTFT well below 100 milliseconds.16
TPOT: The “Flow State” of Generation
TPOT determines the “smoothness” and “speed” of the streaming response.7 A low, stable TPOT feels fluid and natural. A high or, just as importantly, a variable TPOT makes the text appear in “bursts,” which can be just as disruptive to the user experience as a high TTFT.16
However, TPOT has a “perceptual floor” defined by human reading speed. A user in a chatbot is reading as the text is generated.21 One analysis notes that a TPOT of 100 milliseconds (10 tokens/second) is equivalent to approximately 450 words per minute, which is faster than a typical person can read.7
2.4 The Optimization Trap (TTFT vs. TPOT)
This “perceptual floor” for TPOT reveals a critical optimization trap. Optimizing TPOT beyond the point of human perception (e.g., from 80ms to 40ms) provides no discernible user benefit 21 and is a wasted engineering effort. That effort could have been redirected to the far more critical TTFT.
Furthermore, optimizing for overall system throughput (e.g., by using large batches) actively degrades both metrics. A large batch of 16 requests will have a higher TTFT (as 16 prefills must be processed) and a higher TPOT for each user (as the GPU’s memory bandwidth is split 16 ways).7 This makes an architecture optimized for maximum throughput completely unusable for interactive applications.
The optimal strategy for interactive apps is therefore bifurcated:
- Dedicate all available resources to minimizing TTFT to the lowest possible value (the “responsiveness” bottleneck).
- Optimize TPOT only to the point of “perceptual smoothness” (e.g., 50-150ms per token).7
| Metric | Technical Driver | System Bottleneck | User Perception | Critical For |
| TTFT | Prefill Stage 12 | Compute-bound 8 | “Responsiveness,” “Wait,” “Lag” 19 | Chatbots, Code Assistants 16 |
| TPOT | Decode Stage 19 | Memory-bandwidth-bound [1, 5] | “Speed,” “Flow,” “Smoothness” 7 | Streaming Chat, Long-form Generation 16 |
Section 3: The Throughput Engine: Batching Strategies from Static to Continuous
3.1 Why Batching is Non-Negotiable
As established in Section 1, the decode phase is memory-bound and leaves GPU compute units dramatically underutilized.2 Running a single request (a batch size of 1) is profoundly inefficient and cost-prohibitive.
Batching—processing multiple requests in parallel—is the primary technique for increasing compute utilization. By feeding the GPU enough parallel work, the system can better hide the memory-bound nature of individual decodes, significantly increasing overall system throughput 7 and reducing the operational cost per request.23
3.2 The Evolution of Batching Algorithms
The strategy used to group requests for batching has evolved significantly, with each step attempting to solve the inefficiencies of the last.
- Static Batching: This is the most basic approach. The server waits to collect a full, fixed-size batch (e.g., 16 requests), processes all of them simultaneously, and only returns the results when all requests in the batch are complete.24
- Failure Mode: This method suffers from catastrophic “Head-of-Line Blocking”.9 The entire batch is held hostage by the single longest-running request.26 If 15 requests finish in 2 seconds but one request takes 30 seconds, the 15 completed requests sit idle, holding their GPU resources until the 30-second request is done.27 This results in massive GPU idle time and terrible latency for most users.
- Use Case: This approach is only viable for offline, predictable workloads where latency is irrelevant, such as daily document processing.23
- Dynamic Batching: This is a simple compromise. The server launches a batch when it is either full or after a set time window (e.g., 100ms) has passed, whichever comes first.24
- Failure Mode: While it improves average latency by preventing indefinite waits 24, it still operates at the request level. It still suffers from head-of-line blocking within the batch 26 and does not solve the core problem of variable generation lengths.7
- Continuous Batching (or “In-Flight” Batching): This is the state-of-the-art solution that revolutionized LLM serving. This strategy decouples the batch from individual requests by operating at the iteration level.27
- Mechanism: The server maintains a perpetually running batch of tokens. The moment any request in the batch finishes (i.e., generates its end-of-sequence token), it is immediately evicted. The scheduler then immediately inserts a new, waiting request into the now-open slot.9
- Impact: This approach eliminates the head-of-line blocking problem. It keeps the GPU constantly full, maximizing resource utilization 27 and dramatically increasing throughput—in some cases by 10-20x over naive batching.27
3.3 The Enabler: vLLM and PagedAttention
Continuous batching is a brilliant scheduling algorithm, but it creates a new, severe technical problem: memory fragmentation. As requests with different KV cache sizes 9 are constantly swapped in and out, they leave “holes” of unusable memory in the GPU’s VRAM.
Traditional serving systems (like the original FasterTransformer) allocated memory in a wasteful way:
- They reserved a single, contiguous block of VRAM per request.28
- This block had to be large enough for the maximum possible output length (e.g., 2048 tokens).
- This led to:
- Internal Fragmentation: A request that only generates 100 tokens would waste 95% of its allocated 2048-token block.28
- External Fragmentation: The “holes” left by finished requests were often too small or awkwardly shaped to fit new requests, leading to Out-of-Memory (OOM) errors even when total free memory was high.28
The vLLM project solved this with PagedAttention, an algorithm inspired by virtual memory in operating systems.27
- Mechanism: PagedAttention partitions the KV cache into small, fixed-size “KV blocks” (analogous to memory pages).28
- Impact: These blocks do not need to be stored contiguously in VRAM.
- Solves Fragmentation: A new request’s blocks can be scattered across the VRAM, filling in the “holes” left by evicted requests.
- Near-Zero Waste: Memory is allocated “just-in-time,” one block at a time, as new tokens are generated. There is no large-scale pre-allocation.27
- Enables Sharing: Multiple sequences from the same request (e.g., in beam search) can now share the same underlying KV blocks, further saving memory.28
3.4 The Engine for Continuous Batching
Continuous batching is the temporal optimization (the scheduling algorithm), but PagedAttention is the spatial optimization (the memory management system) that makes it truly effective. Benchmarks have shown that vLLM (which uses PagedAttention) can more than double the performance of other continuous batching systems.27
This is because PagedAttention’s efficient memory management 29 directly translates to higher throughput. By eliminating memory waste and fragmentation, it allows the system to fit a dramatically larger effective batch onto the same GPU, compounding the gains of the continuous batching scheduler.
Section 4: The Scale Imperative: Model Sharding and Parallelism Trade-offs
4.1 The Need for Sharding: When Models Don’t Fit
The strategies discussed so far assume the model fits on a single GPU. This is no longer the case. Modern flagship models, such as Llama 3.1 405B 31 or even 70B models (which require ~140 GB in FP16), are far too large for the VRAM of a single GPU like the NVIDIA H100 (80 GB).7
Model Sharding (or model parallelism) is the necessity of splitting a single model’s weights and computational graph across multiple GPUs.2 This is distinct from data parallelism (replicating the model), which is a training-optimization technique and less relevant for inference.9
4.2 Dissecting Parallelism Strategies
There are two primary methods for sharding a model for inference:
- Pipeline Parallelism (PP): This is an “inter-layer” parallelism.35 The model is split vertically, like a factory assembly line.9 For example, GPU 1 handles layers 1-20, GPU 2 handles layers 21-40, and so on.
- Pro: This is conceptually simpler and has a lower communication burden per token.
- Con: It creates “pipeline bubbles”.2 GPU 2 must sit idle waiting for GPU 1 to finish processing its micro-batch and pass the activations forward.32 This “bubble” of idle time reduces hardware utilization.
- Tensor Parallelism (TP): This is an “intra-layer” parallelism.35 Each individual layer (e.g., a large weight matrix) is sliced horizontally and distributed across GPUs.32
- Pro: All GPUs work simultaneously on their “slice” of the same layer.32 This eliminates the pipeline bubble.
- Con: It introduces high communication overhead.32 After computing their partial results, all GPUs must synchronize and aggregate their work using a collective operation like “All-Reduce”.34 This synchronization must happen for every layer of the model, for every step.
4.3 The Prefill-Phase “All-Reduce” Bottleneck
The choice between PP and TP is not static; its performance is critically dependent on the inference phase (prefill vs. decode).
A crucial finding is that Tensor Parallelism’s high communication overhead becomes a crippling bottleneck during the prefill stage.34
- The prefill stage (Section 1) processes many tokens (the entire prompt) in parallel.11
- The Tensor Parallelism “All-Reduce” operation (Section 4.2) requires communicating data proportional to the number of tokens being processed.37
- Therefore, during prefill, TP must execute a massive All-Reduce operation. This saturates the inter-GPU communication link (e.g., NVLink, or much worse, PCIe).38
- As shown in performance breakdowns, this “communication overhead” explodes as the input prompt length increases, quickly becoming the dominant bottleneck, eclipsing computation itself.38
Conversely, during the decode phase, only a single token is processed. The All-Reduce operation is tiny and extremely fast.
This leads to a deeply non-obvious conclusion 15:
- Pipeline Parallelism (PP) is more efficient for the prefill stage because it avoids the All-Reduce bottleneck.
- Tensor Parallelism (TP) is more efficient for the decode stage because it avoids the pipeline bubble and keeps all GPUs active.
This complex, phase-dependent trade-off means that the most advanced serving systems must employ sophisticated hybrid-parallelism strategies, which are extraordinarily difficult to implement and tune.
| Strategy | Splitting Method | Communication | Prefill-Phase Bottleneck | Decode-Phase Bottleneck |
| Pipeline (PP) | Inter-layer (layers 1-20 on GPU 1, 21-40 on GPU 2) [9, 35] | Point-to-point (forward pass) 32 | Low | Pipeline Bubbles (GPU idling) [2, 32] |
| Tensor (TP) | Intra-layer (each layer sliced across GPUs) [35, 36] | All-Reduce (collective sync) 34 | Communication Overhead (link saturation) 34 | Low |
Section 5: The Efficiency Mandate: Model Quantization and its Trade-offs
5.1 Quantization: The Three-Fold Performance Lever
Quantization is the process of reducing the numerical precision of a model’s weights and, in some cases, its activations. This typically means converting 16-bit floating-point (FP16) numbers to 8-bit or 4-bit integers (INT8/INT4).40
This technique is often misunderstood as merely “making models smaller”.42 In reality, it is a primary performance optimization that provides three distinct and powerful benefits for inference:
- Reduces Memory Capacity Needs: A 70B parameter model, which is ~140 GB in FP16, becomes ~35 GB in 4-bit precision.42 This smaller footprint allows the model to fit on a single GPU (e.g., an 80 GB H100) 7, potentially eliminating the need for model sharding (Section 4) and all its associated communication overheads.
- Reduces Memory Bandwidth Bottleneck: This is the most critical benefit for the decode phase. The decode bottleneck is memory bandwidth.5 By quantizing weights from 16-bit to 4-bit, the system moves 4x less data from VRAM to the processor for every single token generated. This directly accelerates the decode phase and lowers TPOT.43
- Increases Compute Speed: Specialized hardware, such as NVIDIA’s Tensor Cores, can perform mathematical operations on lower-precision data types (like INT8 or the newer FP8) significantly faster than on FP16.44
5.2 A Taxonomy of Modern Quantization Techniques
- GGUF (GPTQ-for-GGML Unified Format): A file format originating from the llama.cpp community, highly optimized for running models efficiently on consumer hardware, including CPUs and Apple Silicon (Macs).40
- GPTQ (General Quantized Transformer): A popular Post-Training Quantization (PTQ) method that is computationally expensive to create but produces highly accurate 4-bit models.40
- AWQ (Activation-Aware Weight Quantization): A more advanced PTQ method. It identifies that accuracy loss is often caused by a few “outlier” activations. AWQ “spares” these important activation channels from quantization, allowing the weights to be more aggressively quantized with minimal accuracy loss.40
- Hardware-Native (FP8): A new 8-bit floating-point format introduced with NVIDIA’s Hopper (H100) and Blackwell (B200) GPUs.41 FP8 (8-bit floating point) offers the computational speed of INT8 while retaining better accuracy due to its dynamic range (it can represent both very small and very large numbers). This requires direct hardware support via technologies like the “Transformer Engine”.44
5.3 Debunking the Accuracy “Myth”
A persistent fear has been that quantization achieves its performance gains by sacrificing model accuracy.42 While this was true of older methods, for modern, large-scale models, this trade-off has been largely eliminated.
An exhaustive study that ran over half a million evaluations on quantized models found 47:
- Negligible Degradation: For large models (70B, 405B), 8-bit and 4-bit quantization show “negligible performance degradation”.47
- Competitive Accuracy: Models “show very competitive accuracy recovery” across a wide range of academic and coding benchmarks.47
- No Discernible Difference: On average, the quantized models showed “no discernible differences” from their full-precision counterparts in terms of semantic quality and reliability.47
For most production use cases, modern quantization (like AWQ or 4-bit GPTQ) is not a difficult trade-off. It is a nearly “free” and essential performance gain.47
5.4 The Quantization-Batching-Sharding Synergy
Quantization’s true power is not just its direct benefits (lower TPOT), but its indirect role as an enabler for other optimizations, creating a powerful synergistic effect.48
Consider this chain of events for a 70B (140 GB FP16) model:
- Baseline: The 140 GB model requires at least 2-way sharding (e.g., 2x H100 GPUs).7 This immediately introduces the sharding bottlenecks from Section 4 (e.g., the prefill All-Reduce).38
- Apply 4-bit Quantization: The model is now only 35 GB.42
- Synergy 1 (Eliminate Sharding): The 35 GB model now fits comfortably on a single 80 GB H100 GPU.48 This completely eliminates the need for sharding, and the entire class of communication bottlenecks disappears.
- Synergy 2 (Supercharge Batching): The model only occupies 35 GB of the 80 GB VRAM. This leaves a massive 45 GB of VRAM purely for storing the KV cache.
- Synergy 3 (Compounded Gains): This enormous KV cache budget allows the PagedAttention (Section 3) memory manager to support a dramatically larger continuous batch.2
In this scenario, quantization did not just make the model 4x smaller. It unlocked the ability to avoid sharding (eliminating a bottleneck) and supercharge batching (multiplying throughput), all while also directly speeding up the decode (TPOT) by reducing the memory bandwidth bottleneck.43 This multi-level interaction is why quantization is a foundational component of all modern LLM serving.
Section 6: Synthesizing the Stack: Analyzing Modern Serving Engines
The concepts from the previous sections—continuous batching, PagedAttention, sharding, and quantization—are bundled into production-ready serving frameworks. Analyzing the two most prominent high-performance frameworks reveals a key strategic choice for organizations.
6.1 Case Study 1: Hugging Face TGI (Text Generation Inference)
- Architecture: TGI is a robust, production-focused serving solution, built with a combination of Rust, Python, and gRPC for high performance and safety.50 It is designed for broad compatibility and ease of deployment.52
- Key Features: TGI serves as an integration layer for the “best-of-breed” open-source optimizations. Its stack includes:
- Continuous Batching: To maximize throughput.51
- PagedAttention: Integrated for efficient memory management.50
- Tensor Parallelism: For sharding models across multiple GPUs.53
- Quantization: Supports popular methods like bitsandbytes and GPT-Q.51
- Production Readiness: Includes built-in distributed tracing (OpenTelemetry), Prometheus metrics, and Server-Sent Events (SSE) for token streaming.51
- Significance: TGI represents the standardized, flexible, open-source solution for deploying a wide variety of LLMs (e.g., Llama, Falcon, StarCoder).51
6.2 Case Study 2: NVIDIA TensorRT-LLM
- Architecture: TensorRT-LLM is not just a server; it is a compiler and runtime library.54 It ingests a model (like Llama) and compiles it into a highly optimized “engine” file, custom-built for a specific NVIDIA GPU architecture.
- Key Features: This is a vertically integrated, hardware-software co-designed solution.56
- In-Flight Batching: NVIDIA’s term for continuous batching.44
- PagedAttention: Implemented for memory management.55
- Deep Hardware Optimization: This is its key differentiator. The compiler automatically rewrites the model to use NVIDIA’s proprietary Hopper Transformer Engine and native FP8 quantization.44 This is not a generic optimization; it is a hardware-specific acceleration that unlocks the full potential of H100/B200 GPUs.44
- Optimized Kernels: It leverages fused kernels and FlashAttention to minimize memory operations.44
- Significance: TensorRT-LLM represents the absolute performance ceiling for NVIDIA hardware.58 It achieves this by trading flexibility (it is NVIDIA-only) for raw, record-breaking performance.60
6.3 The Ecosystem vs. Performance Trade-off
The choice between TGI and TensorRT-LLM is a classic strategic decision: open-source flexibility versus vertically-integrated performance.
- TGI is the “Linux” or “Kubernetes” of LLM serving. It is built on open standards, supports a wide range of models 51 and hardware (including AMD GPUs) 50, and offers maximum flexibility and transparency.
- TensorRT-LLM is the “macOS.” It provides a seamless and unparalleled performance experience 54 if and only if you are fully committed to the NVIDIA hardware ecosystem.55 The “automatic FP8 optimization” 44 is a hardware-level feature that open-source frameworks cannot fully replicate.
A technical leader must decide: Is it worth locking into the NVIDIA ecosystem to gain the absolute best performance (TensorRT-LLM)? Or is it more strategic to prioritize flexibility, portability, and open-source transparency (TGI), even if it means leaving some performance on the table?
| Framework | Key Features | Quantization Support | Hardware Specialization | Best-Fit Environment |
| TGI | Continuous Batching, PagedAttention, Tensor Parallelism 53 | bitsandbytes, GPT-Q 53 | Broad (NVIDIA, AMD, etc.) 50 | Production open-source, flexible/hybrid-cloud deployments. |
| TensorRT-LLM | In-Flight Batching, PagedAttention, Optimized Kernels 55 | Native FP8/FP4 44 | NVIDIA Hardware-Only 55 | Bleeding-edge performance on NVIDIA H100/B200+ hardware. |
| vLLM | PagedAttention 28, Continuous Batching [30] | AWQ, GPTQ [33] | High (NVIDIA), some AMD | SOTA open-source engine (often used within TGI). |
Section 7: The Next Frontier: Advanced Architectures and Future Bottlenecks
The optimizations discussed so far—batching, sharding, and quantization—mitigate the core prefill/decode bottlenecks. The next frontier of research aims to break them entirely.
7.1 Breaking the Auto-Regressive Chain: Speculative Decoding
- The Problem: The decode phase is fundamentally sequential. It generates one token at a time. This auto-regressive loop is the final, stubborn bottleneck that batching and quantization can only speed up but never eliminate.61
- The Solution: Speculative Decoding, also known as the “draft-then-verify” paradigm.62
- Mechanism:
- Draft: A small, fast “draft model” 63 rapidly generates a “draft” of K future tokens (e.g., 5-10 tokens) in sequence.
- Verify: The large, slow “target model” (the actual model) then takes all K draft tokens and verifies them all at once, in a single parallel pass.61
- Impact: This technique cleverly converts the performance problem. Instead of 5 sequential, memory-bound decode steps, the system performs one large, compute-bound verification step.66 GPUs are excellent at compute-bound parallel work. This achieves a 2-4x speedup in the decode phase 61 while producing a mathematically identical output distribution to the original model. It is not an approximation; it is a guaranteed acceleration.63
7.2 The Rise of Sparse Models: Serving Mixture of Experts (MoE)
- Architecture: New flagship models like Mixtral and the Llama 3.1 405B 31 use a Mixture of Experts (MoE) architecture. A “router” (or gating network) dynamically selects a small subset of “experts” (e.g., 2 out of 8) to process each individual token.68
- The Serving Nightmare: While this “sparse” approach is highly efficient for training, it creates a unique and severe serving challenge:
- Massive Memory Footprint: To process any token, all 8 experts must be loaded in VRAM, even though only 2 are used.70 This dwarfs the memory-capacity problem of dense models.
- Routing Overhead: The gating network is itself a small neural network that must be run for every token, adding a new source of latency.68
- Load Imbalance: The router often develops “preferences” and sends a disproportionate number of tokens to the same “popular” experts, creating new processing hotspots while other multi-billion-parameter experts sit idle in VRAM.71
7.3 The Final Frontier: Hardware-Software Co-Design
The future of performance lies not just in clever software (like vLLM) running on general-purpose hardware (GPUs), but in specialized hardware (ASICs) co-designed with the software stack to solve inference-specific bottlenecks.56
- Example 1 (Evolution): NVIDIA’s Transformer Engine: As discussed in Section 6, the H100 GPU (hardware) was designed to accelerate FP8 matrix math. TensorRT-LLM (software) was designed to use that hardware feature.44 This is co-design.
- Example 2 (Revolution): Groq’s LPU (Language Processing Unit): This is the radical example.75 The LPU is not a GPU. It is a “streaming processor” or Application-Specific Integrated Circuit (ASIC) 77 designed only for AI inference. Its key architectural feature is the elimination of the external memory (HBM) bottleneck that plagues GPUs.75
- The Implication: The LPU architecture delivers deterministic, unparalleled TPOT, often measured in thousands of tokens per second.76 It fundamentally breaks the memory-bound decode problem.
This development completely flips the script on LLM optimization. If the memory-bound decode phase is solved and TPOT is effectively infinite, then all the complex optimizations we have developed for it (advanced batching, PagedAttention, speculative decoding) become far less relevant. The entire performance bottleneck of the system collapses onto one thing: the compute-bound prefill stage.10 This proves that the endgame for performance is hardware-software co-design, moving from optimizing for GPUs to building new hardware that obviates their fundamental flaws.
Section 8: Strategic Recommendations: Architecting for the Use Case
8.1 The Synthesis: No “Best” Architecture
There is no single “best” LLM serving architecture.79 There is only the “optimal” architecture for a specific use case and its associated Service Level Objective (SLO).2 The four pillars—latency, batching, sharding, and quantization—are a set of interconnected dials to be tuned.
8.2 Scenario 1: The Interactive Chatbot (e.g., Customer Service, AI Assistant)
- Priority: Lowest possible TTFT (Time to First Token).16 The user must feel the AI is “listening” and responsive (e.g., < 500ms). A stable, “readable” TPOT (Time Per Output Token) is a secondary, but important, goal.7
- Architecture:
- Batching: Continuous Batching (e.g., via vLLM or TGI) is mandatory to handle bursty, unpredictable traffic 79 while maintaining fairness and high throughput.
- Memory: PagedAttention is essential to enable continuous batching.
- Quantization: Aggressive 4-bit (e.g., AWQ) or FP8 quantization is critical. It directly lowers TPOT (by reducing memory bandwidth) and indirectly lowers TTFT (by freeing VRAM for a larger, more fluid batch).43
- Optimizations: Speculative Decoding is highly valuable here, as it directly attacks the latency of the decode phase.61
8.3 Scenario 2: The Co-Pilot (e.g., Code Generation, In-line Assistant)
- Priority: Ultra-low TTFT (< 100ms).16 This requirement is more stringent than a chatbot. The completion must feel instantaneous to avoid breaking the user’s “flow state.”
- Architecture:
- Batching: Batch size must be kept very small (or even 1) to guarantee these strict TTFT SLOs. This sacrifices throughput and dramatically increases cost per request.7
- Optimizations: Speculative Decoding is mandatory to make generation feel instant.
- Hardware: This is the prime use case for Hardware-Software Co-Design. Specialized hardware like Groq’s LPU 76, which excels at low-latency streaming (TPOT) and has low overhead, is ideal for this “low-latency, low-batch” workload.
8.4 Scenario 3: The Offline Analyst (e.g., Batch Document Processing, RAG Pipeline)
- Priority: Maximum Throughput (e.g., documents per hour, tokens per second).22 Per-request latency is irrelevant. A 30-second or 5-minute wait for a large report is perfectly acceptable.16
- Architecture:
- Batching: Static Batching is the ideal choice.23 The goal is to pack the GPU full with a massive, fixed batch size (e.g., 64, 128, or 256) to maximize compute saturation and amortize the cost.7
- Quantization: Use the most aggressive quantization possible (e.g., 4-bit) to fit the largest possible batch into VRAM.23
- Sharding: Use TP/PP to scale across many GPUs and process even larger global batches.2
- Optimizations: Speculative decoding is not needed and would only add unnecessary overhead.
8.5 Final Decision-Making Framework
The four pillars of serving architecture are an interconnected system. The optimal configuration is derived by following this decision-making process:
- The Use Case (e.g., Chat vs. Batch) 23 dictates…
- The Primary Metric (e.g., TTFT vs. Throughput), which dictates…
- The Batching Strategy (e.g., Continuous vs. Static).24
- The Model Size 31 dictates the Sharding Strategy (if any).2
- Quantization 43 and Memory Management (PagedAttention) 28 are then used as levers to manage the new bottlenecks (e.g., sharding overhead 34) and maximize the effectiveness of the chosen batching strategy.
- Finally, Advanced Techniques (Speculative Decoding) 61 and Specialized Hardware (LPUs) 75 are applied to break the fundamental trade-offs that the other levers can only manage.
| Use-Case Scenario | Primary Metric | Optimal Batching Strategy | Key Architectural Optimizations |
| Interactive Chatbot [22] | Low TTFT, Stable TPOT | Continuous Batching (vLLM) 27 | PagedAttention, Speculative Decoding, 4-bit Quantization [28, 61] |
| Real-time Co-Pilot 16 | Ultra-Low TTFT (<100ms) | Small/Dynamic Batch or Specialized Hardware | Speculative Decoding, Hardware-Software Co-Design (e.g., LPU) [76, 81] |
| Offline Batch Processing 23 | Maximum Throughput | Static Batching 23 | Maximize Quantization, Maximize Batch Size, Pipeline Parallelism 7 |
