Executive Summary
The paradigm of Large Language Model (LLM) deployment has fundamentally shifted from a training-centric challenge to an inference-bound bottleneck. As models scale into the regime of hundreds of billions of parameters—exemplified by architectures like Llama-3-405B and DeepSeek-V3—the constraints of memory bandwidth, interconnect latency, and compute density have necessitated a radical reimagining of the inference stack. This report provides an exhaustive analysis of the four pillars of modern inference optimization: advanced quantization techniques, speculative decoding architectures, Mixture-of-Experts (MoE) routing dynamics, and high-performance serving infrastructure.
The analysis reveals that inference optimization is no longer a post-hoc compression step but a fundamental driver of model architecture. Innovations such as Multi-Head Latent Attention (MLA) and auxiliary-loss-free load balancing are explicitly designed to circumvent the physical limitations of GPU memory hierarchies. Furthermore, the convergence of techniques—such as the use of quantized MoEs as speculative draft models (MoE-SpeQ)—demonstrates a maturation of the field where individual optimizations are co-designed to mask specific hardware bottlenecks like PCIe bandwidth and attention computation latency.
1. The Physics of Inference: Quantization and Precision Scaling
The computational cost of Large Language Models is dominated by the movement of data. In the autoregressive generation phase, known as decoding, the arithmetic intensity (FLOPs per byte) is notoriously low, making the process memory-bandwidth bound. Quantization, the reduction of numerical precision, serves as the primary lever to alleviate this bottleneck by increasing the effective memory bandwidth and compute throughput of hardware accelerators.
1.1 Theoretical Foundations and the Shift to Floating Point
Traditionally, quantization in deep learning relied on integer arithmetic (INT8), leveraging the wide availability of integer processing units. However, the distribution of weights and activations in transformer-based LLMs, particularly at scales exceeding 100 billion parameters, exhibits properties that challenge uniform integer quantization. Tensors often possess “outlier” features—activations with magnitudes significantly larger than the mean—which, when clipped or scaled uniformly, result in catastrophic degradation of model accuracy.1
This limitation has catalyzed the industry-wide shift toward 8-bit Floating Point (FP8) formats, specifically tailored for the non-uniform distributions characteristic of neural networks. Unlike integers, floating-point representations allocate bits to an exponent and a mantissa, creating a non-uniform quantization grid that provides higher precision for values near zero (where most weights cluster) and wider dynamic range for outliers.
The FP8 standard, as implemented in NVIDIA’s Hopper architecture and supported frameworks like vLLM, defines two distinct formats 2:
- E4M3 (1 Sign, 4 Exponent, 3 Mantissa): This format sacrifices dynamic range for precision. It is capable of representing values up to $\pm 448$ and NaN. Due to its higher precision, E4M3 is the preferred format for the forward pass of inference, specifically for weights and activations where maintaining the fidelity of the signal is paramount.3
- E5M2 (1 Sign, 5 Exponent, 2 Mantissa): This format mirrors the dynamic range of FP16 but with significantly reduced precision. It is primarily utilized for gradients during training or for tensors with extreme dynamic ranges where preventing overflow is more critical than minute precision.3
The superiority of FP8 over INT8 lies in its robustness. Research across multi-cluster GPU environments has demonstrated that FP8 consistently emerges as the most reliable option across diverse tasks, particularly for models like Llama-3-405B where integer-based methods like SmoothQuant begin to struggle with instruction-following capabilities.1
1.2 Advanced Quantization Algorithms: GPTQ, AWQ, and SmoothQuant
While FP8 represents the future of hardware-accelerated inference, the vast majority of deployed hardware (e.g., NVIDIA Ampere A100s) and specific accuracy requirements necessitate a diverse toolkit of quantization algorithms. These methods generally fall into Post-Training Quantization (PTQ) categories, differing in how they handle the sensitivity of specific weights.
1.2.1 Inverse Hessian Optimization (GPTQ)
GPTQ (Generative Pre-trained Transformer Quantization) represents a mathematically rigorous approach to weight-only quantization. It formulates quantization as an optimization problem, aiming to minimize the reconstruction error of the layer’s output. By utilizing second-order information—specifically the inverse Hessian of the loss function—GPTQ identifies how to adjust the remaining unquantized weights to compensate for the error introduced by quantizing a specific weight.1
- Mechanism: GPTQ quantizes weights column-by-column, updating the remaining weights in the block to preserve the activation output.
- Trade-off: While GPTQ achieves high compression ratios (e.g., 4-bit weights), empirical analysis shows it can induce significant accuracy drops in smaller models (<7B parameters) where parameter redundancy is lower. However, for 70B+ models, it remains a highly effective baseline.1
1.2.2 Activation-Aware Weight Quantization (AWQ)
AWQ challenges the assumption that all weights are equally important or that importance correlates strictly with weight magnitude. Instead, AWQ posits that the importance of a weight is determined by the magnitude of the activation it processes. Weights that multiply large activation values (outliers) are critical for preserving the signal.4
- Mechanism: AWQ identifies salient weights based on activation statistics and protects them. Rather than leaving them in FP16 (which would complicate the kernel), AWQ applies a per-channel scaling factor that effectively increases the dynamic range for these critical channels before quantization.
- Performance: AWQ consistently outperforms GPTQ in instruction-following benchmarks (IFEval) and hallucination detection (TruthfulQA), particularly in scenarios involving weight-only quantization. It is generally robust across varying model architectures and sizes.1
1.2.3 SmoothQuant and Outlier Migration
SmoothQuant addresses the difficulty of quantizing activations in large models. In architectures beyond 6.7B parameters, systematic outliers appear in specific activation channels. SmoothQuant mathematically migrates the difficulty of quantization from the activations to the weights. By applying a smoothing factor—dividing the activation by a scale $s$ and multiplying the weight by $s$—it squashes the activation outliers, making the activation distribution easier to quantize to INT8, while the weights (which are easier to handle) absorb the complexity.1
Table 1: Comparative Analysis of Advanced Quantization Techniques
| Technique | Precision Target | Primary Mechanism | Optimal Use Case | Key Limitations |
| FP8 (E4M3) | W8A8 (Float) | Non-uniform grid, per-tensor scaling | Hopper (H100) / MI300x inference | Requires newer hardware for acceleration |
| AWQ | W4A16 (Int) | Activation-based salience protection | Small to medium models, edge deployment | Calibration required, weight-only |
| GPTQ | W4A16 (Int) | Inverse Hessian error minimization | High compression storage, older GPUs | Higher degradation in small models |
| SmoothQuant | W8A8 (Int) | Migration of outliers from act to weight | A100/A10 clusters, moderate scale | Struggles with 400B+ model outliers |
1.3 Mixed-Precision Frameworks and QSPEC
A emerging frontier in quantization is the decoupling of precision requirements between the different phases of token generation. The QSPEC (Quantized Speculative Decoding) paradigm operates on the insight that the “drafting” phase of inference—where potential future tokens are guessed—is tolerant of lower precision errors than the “verification” phase.5
In a QSPEC implementation, the system maintains a single model but utilizes it in two modes:
- Drafting Mode: Executes aggressively quantized kernels (e.g., W4A4) to maximize throughput and minimize memory reads. This allows for the rapid generation of candidate tokens.
- Verification Mode: Executes higher precision kernels (e.g., W4A16 or FP16) to validate the candidates.
Crucially, QSPEC shares the memory of the weights and KV cache between these modes, avoiding the VRAM overhead of loading two separate models. Empirical results demonstrate that this approach can recover the accuracy loss associated with W4A4 quantization (often >50% on reasoning tasks) while retaining the speed benefits, achieving up to 1.64x throughput improvements.5
1.4 Case Study: DeepSeek-V3 and FP8 Training Integration
The DeepSeek-V3 model serves as a paradigmatic example of integrating FP8 deeply into the model lifecycle, moving beyond post-training quantization to an FP8-native training framework. Training a 671-billion parameter model in FP8 required solving significant challenges related to numerical stability and gradient precision.6
Technical Innovations in FP8 Training:
- Fine-Grained Block Scaling: Standard per-tensor scaling is insufficient for massive scale training due to the range of values. DeepSeek implements block-wise scaling, where 128×128 sub-blocks of weight matrices and 1×128 blocks of activations are scaled independently. This localizes the impact of outliers, preventing them from destroying the quantization resolution of the entire tensor.8
- Decoupled Accumulation: While the matrix multiplication (GEMM) occurs in FP8 via Tensor Cores, the accumulation of these products is prone to “swamping”—where small updates are lost when added to large accumulators. DeepSeek utilizes a hybrid strategy where accumulation is promoted to FP32 in the CUDA cores (or specific registers) to preserve the fidelity of gradients and updates.8
- DualPipe Scheduling: To manage the communication overhead of such a massive model, DeepSeek employs a “DualPipe” algorithm that overlaps the forward and backward pass chunks with bidirectional pipeline communication. This hides the latency of moving the FP8 weights and gradients between nodes.6
2. Speculative Decoding: Breaking the Serial Dependency
Autoregressive decoding is inherently serial; generating the $N$-th token strictly requires the completion of the $(N-1)$-th token. This serial dependency results in memory-bound execution, as the entire model (often hundreds of gigabytes) must be moved from High-Bandwidth Memory (HBM) to the compute units for every single token. Speculative Decoding (SD) breaks this dependency by decoupling generation (drafting) from verification.
2.1 The Arithmetic of Speculation
The fundamental premise of SD is that a cheaper “Draft Model” can predict $K$ tokens in the time it takes the “Target Model” to generate one. The Target Model then verifies these $K$ tokens in a single parallel forward pass. The theoretical speedup is governed by the Acceptance Rate ($\alpha$)—the probability that the draft matches the target.11
If the verification step is parallelized efficiently, the cost of verifying $K$ tokens is roughly equivalent to generating a single token. Thus, if $\alpha$ is high, the system produces multiple tokens per target-model pass, amortizing the memory access cost.
2.2 Evolution of Drafting Architectures
The effectiveness of SD depends entirely on the quality and cost of the draft mechanism. Several architectures have evolved to optimize this trade-off.
2.2.1 Independent Draft Models
The classical approach pairs a small model (e.g., Llama-7B) with a large target (e.g., Llama-70B).
- Challenge: This introduces a “Distribution Mismatch.” If the small model is not aligned with the large model (e.g., different training data or chat templates), $\alpha$ drops precipitously, leading to a net slowdown.11 Furthermore, hosting two separate models consumes valuable VRAM and introduces context switching overheads.
2.2.2 Integrated Heads (Medusa)
To eliminate the need for a separate model, the Medusa architecture augments the target model with multiple “Medusa Heads”—extra Multi-Layer Perceptron (MLP) layers on top of the final hidden state.
- Mechanism: Head 1 predicts token $t+1$, Head 2 predicts $t+2$, and so on, all from the hidden state at step $t$.
- Implication: This allows the target model to “self-speculate” without loading extra weights. It generates a tree of candidates in a single pass.
- Limitation: The draft quality decays rapidly for deeper tokens ($t+3, t+4$) because the single hidden state at $t$ contains diminishing information about the distant future.13
2.2.3 Feature-Level Extrapolation (EAGLE)
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) shifts speculation from the discrete token space to the continuous feature space.
- Insight: Feature vectors (hidden states) change more smoothly and predictably than discrete token ids.
- Mechanism: EAGLE uses a lightweight featurizer layer to auto-regressively predict the feature vector of the next layer. It then decodes these features into tokens. This “feature-level draft” is computationally cheap but captures the semantic trajectory of the sequence better than token prediction.15
- EAGLE-2 (Dynamic Trees): While EAGLE-1 used static draft trees, EAGLE-2 introduces dynamic tree construction. It assesses the confidence of the draft predictions at runtime to dynamically expand or prune branches of the draft tree, allocating compute only to high-probability paths.16
2.3 Verification Algorithms: From Rejection to Trees
Once drafts are generated, the target model must verify them. The algorithm used for verification determines the upper bound of performance.
2.3.1 Rejection Sampling
The standard verification method involves the target model computing the probability distribution for the drafted tokens. A token $x$ is accepted based on the ratio of its probability in the target vs. the draft: $r < \min(1, \frac{P_{target}(x)}{P_{draft}(x)})$.
- Bottleneck: This is fundamentally sequential regarding acceptance. If the second drafted token is rejected, all subsequent tokens (3, 4, 5…) are discarded, even if they were correct in context. This waste limits the effective acceptance length.18
2.3.2 Tree Attention and OPT-Tree
To overcome the sequential rejection bottleneck, advanced methods utilize Tree Attention. Instead of verifying a single chain of tokens, the target model verifies a branching tree of hypotheses in parallel.
- Tree Attention Mask: The target model is fed a flattened representation of the tree. A specialized attention mask ensures that when the model computes the attention for a node, it only attends to that node’s specific ancestors in the tree, maintaining causal consistency across multiple diverging paths simultaneously.19
- OPT-Tree: This algorithm dynamically searches for the optimal tree structure (topology) to draft. By analyzing the acceptance probabilities of previous steps, OPT-Tree constructs a draft tree that maximizes the mathematical expectation of the accepted sequence length. For example, if the model is uncertain about the next token (high entropy), it might generate a wide, shallow tree. If it is confident, it generates a deep, narrow chain.19
2.3.3 Traversal Verification
Traditional verification often processes the tree top-down. Traversal Verification algorithms (like those seen in recent research) propose verifying from leaves to root or using graph-based analysis. This allows the system to identify valid sub-sequences that may have been generated via a low-probability intermediate node, effectively “rescuing” correct tokens that simple rejection sampling would discard.21
2.4 Co-Designing MoE and Speculation: MoE-SpeQ
Applying speculative decoding to Mixture-of-Experts (MoE) models introduces a specific “I/O Wall.” In a dense model, weights are static. In an offloaded MoE (where experts reside in CPU RAM), generating $K$ speculative tokens might require fetching $K \times N$ different experts. If these fetches happen sequentially during the verification phase, the PCIe latency destroys any performance gain.22
MoE-SpeQ solves this by using the draft model as a prefetcher.
- Quantized Draft: It uses a quantized (INT4) version of the MoE as the draft model. This model is small enough to run quickly.
- Speculative Prefetching: The draft model predicts not just the tokens, but the expert indices that will be required to verify them.
- Latency Hiding: While the GPU is busy computing the draft, the system proactively fetches the predicted experts from CPU memory to GPU VRAM. By the time the target model begins verification, the necessary experts are already resident in high-speed memory.
- Result: This co-design transforms the I/O latency from a blocking operation into a hidden background task, achieving up to 2.34x speedups on memory-constrained devices.22
3. Mixture-of-Experts: Routing, Load Balancing, and Architecture
The Mixture-of-Experts (MoE) architecture has become the standard for scaling model capacity while constraining inference costs. By activating only a subset of parameters per token (e.g., 37B out of 671B in DeepSeek-V3), MoE decouples model size from FLOPs. However, this introduces complex dynamics in routing and load balancing that define the inference performance.
3.1 Architectural Innovations: The DeepSeek Paradigm
DeepSeek-V3 and its “DeepSeekMoE” architecture represent a significant deviation from the standard MoE designs (like Mixtral or Switch Transformer).
3.1.1 Fine-Grained Expert Segmentation
Standard MoE architectures often use a small number of large experts (e.g., 8 experts, select 2). DeepSeek segments these into many smaller experts (e.g., 64 routed experts, select 8).
- Benefit: This promotes hyper-specialization. A monolithic “Coding” expert is forced to learn Python, C++, and Java syntax simultaneously. Fine-grained experts allow the model to dedicate specific sub-experts to “Python Indentation,” “C++ Pointers,” and “Java Classes.” The router can then mix and match these specific skills dynamically for a given token.23
3.1.2 Shared Expert Isolation
DeepSeek introduces “Shared Experts” that are always active for every token, bypassing the router.
- Insight: Certain knowledge (basic grammar, common function words, general syntax) is required for almost every token. In standard MoE, this “common knowledge” must be duplicated across every expert so that it is available regardless of which expert is selected. This is redundant and wastes parameter budget.
- Solution: By isolating this into a Shared Expert, the model ensures common knowledge is always available. The routed experts are then freed to focus exclusively on specialized, long-tail knowledge. This significantly improves parameter efficiency.23
3.2 Load Balancing: The Auxiliary-Loss-Free Breakthrough
A critical failure mode in MoE is Expert Collapse, where the router learns to send all tokens to a single expert, ignoring the others. This reduces the effective capacity of the model to that of a single expert.
The Traditional Solution: Auxiliary Loss
Standard implementations add a loss term to the training objective: $L = L_{text} + \alpha L_{aux}$. This $L_{aux}$ penalizes the model if the distribution of tokens across experts is not uniform.
- The Conflict: This creates a competitive objective. The model wants to minimize perplexity (language quality) but is forced to route sub-optimally to satisfy the load balancing constraint. High $\alpha$ degrades model quality; low $\alpha$ leads to collapse.25
DeepSeek’s Auxiliary-Loss-Free Strategy
DeepSeek-V3 eliminates the auxiliary loss. Instead, it uses a Dynamic Bias mechanism.
- Mechanism: A bias term $b_i$ is added to the routing logits of each expert $i$. $Score_i = \text{Affine}(x) + b_i$.
- Feedback Loop: If expert $i$ is receiving too many tokens (overloaded), its bias $b_i$ is decremented. If it is underloaded, $b_i$ is incremented.
- The Key Innovation: This bias is used only for the Top-K selection logic to determine which experts process the token. However, for the final weighted combination of outputs, the bias is removed, and the original affinity scores are used. This ensures that the gradient flow is driven purely by the language modeling objective, while the routing distribution is mechanically forced to balance via the bias. This decoupling preserves model quality while ensuring near-perfect hardware utilization.25
3.3 Multi-Head Latent Attention (MLA) and KV Cache Compression
While MoE optimizes the Feed-Forward Networks (FFN), the Attention mechanism remains a bottleneck, particularly regarding the Key-Value (KV) cache memory. For a model with long context (e.g., 128k tokens), the KV cache can grow to hundreds of gigabytes, forcing small batch sizes and low throughput.
DeepSeek-V3 utilizes Multi-Head Latent Attention (MLA) to compress this cache.
- Mechanism: Instead of storing the full high-dimensional Key and Value vectors for every token, MLA projects the attention input into a low-dimensional Latent Vector ($c_{KV}$).
- Compression: Only this compressed latent vector is stored in the KV cache.
- Decompression: During the attention operation, the latent vector is up-projected (via matrices $W_{UK}, W_{UV}$) to reconstruct the keys and values.
- Decoupled RoPE: Rotary Positional Embeddings (RoPE) are sensitive to absolute values and difficult to compress. MLA handles this by decoupling the positional part of the key into a separate, uncompressed vector that is concatenated during computation.
- Impact: MLA reduces the KV cache size by approximately 93% compared to standard Multi-Head Attention (MHA). This allows DeepSeek-V3 to serve significantly larger batch sizes on the same hardware compared to models using Grouped Query Attention (GQA), which typically achieves only a 2x-8x reduction.28
3.4 Expert Offloading and Handling Stragglers
In inference scenarios where the model exceeds GPU memory (e.g., running Mixtral on consumer hardware), Expert Offloading is required.
MoE-Infinity and Caching:
MoE-Infinity optimizes offloading by tracing expert activation patterns. It observes that expert usage is sparse and exhibits temporal locality.
- Activation-Aware Prefetching: By analyzing the sequence, it predicts which experts will be needed next and moves them from CPU to GPU.
- Caching Policy: It maintains a “hot” set of experts in VRAM. Unlike LRU (Least Recently Used), it uses usage frequency and activation traces to determine which experts to evict, minimizing PCIe traffic.31
Straggler Effect and Capacity-Aware Inference:
In distributed inference, if 8 GPUs each host different experts, the latency of the layer is determined by the slowest GPU (the one with the “hot” expert). This is the Straggler Effect.
- Capacity-Aware Token Drop: If an expert’s queue exceeds a capacity threshold (e.g., 1.2x the average load), excess tokens are dropped or rerouted to a Shared Expert.
- Token Reroute: Alternatively, tokens are rerouted to their 2nd or 3rd choice expert if that expert is underutilized.
- Result: This significantly tightens the latency distribution (p99 latency), preventing a single popular expert from stalling the entire cluster.33
4. Serving Infrastructure: The System Layer
The theoretical gains of quantization, speculation, and MoE routing are only realized through robust serving infrastructure. The ecosystem is currently defined by three primary frameworks: vLLM, Text Generation Inference (TGI), and TensorRT-LLM, each offering distinct approaches to memory and scheduling.
4.1 PagedAttention and Block Tables
The “fragmentation” of GPU memory was a primary bottleneck in early LLM serving. PagedAttention (vLLM) solved this by importing the concept of virtual memory paging from Operating Systems.35
- The Problem: Pre-allocating contiguous memory for a 2048-token sequence results in massive waste if the request finishes after 100 tokens. “Internal fragmentation” and “External fragmentation” prevented effective batching.
- The Solution: PagedAttention divides the KV cache into fixed-size Blocks (e.g., 16 tokens). These blocks can be stored anywhere in physical non-contiguous GPU memory.
- Block Table: A software-managed table maps the “Logical” token sequence (0, 1, 2…) to “Physical” block addresses.
- Copy-on-Write: This architecture enables efficient parallel sampling. If a prompt branches into three different beam search candidates, they all share the physical blocks of the prompt. New blocks are allocated only when the candidates diverge. This prefix sharing reduces memory usage by up to 55% in complex sampling scenarios.37
4.2 Continuous Batching (In-Flight Batching)
Traditional “Static Batching” waits for every request in a batch to complete before starting a new batch. This causes “bubbles” where GPUs idle while waiting for the one long sequence to finish.
Continuous Batching operates at the iteration granularity.
- Mechanism: After every token generation step, the scheduler checks if any sequence has finished. If so, it is evicted, and a new request from the waiting queue is inserted into the batch immediately.
- Impact: This maximizes GPU occupancy. At any given microsecond, the GPU is processing as many tokens as memory permits. This improves throughput by 10-20x over static batching for workloads with high variance in output length.39
4.3 Distributed Inference: TP, PP, and EP
Serving massive models requires partitioning them across multiple GPUs.
- Tensor Parallelism (TP): Splits individual matrix multiplications across GPUs (e.g., dividing the Query, Key, Value matrices).
- Pros: Reduces latency for single requests.
- Cons: Requires massive communication bandwidth (All-Reduce) after every layer. Feasible only within a single node (NVLink).41
- Pipeline Parallelism (PP): Splits the model by layers (GPU 1 gets layers 1-10, GPU 2 gets 11-20).
- Pros: Low communication overhead (only point-to-point between stages).
- Cons: Introduces “Pipeline Bubbles” where GPUs wait for data. DeepSeek’s DualPipe minimizes this by interleaving forward and backward chunks bi-directionally.6
- Expert Parallelism (EP): Specifically for MoE. Different experts are placed on different GPUs.
- Mechanism: Requires an All-to-All communication primitive. Tokens are dispatched from their source GPU to the GPU hosting the selected expert, processed, and then returned.
- Challenge: If routing is unbalanced, All-to-All communication becomes a bottleneck due to network congestion.
4.4 Framework Comparison and Selection Strategy
Table 2: Comparative Analysis of Serving Frameworks
| Feature | vLLM | HuggingFace TGI | TensorRT-LLM |
| Core Philosophy | High throughput via PagedAttention | Ease of use & Ecosystem integration | Maximum performance via compilation |
| Quantization Support | FP8 (W8A8), AWQ, GPTQ | FP8, EETQ, AWQ, GPTQ | FP8, INT8, INT4 (Best support) |
| MoE Support | Native, supports DeepSeek/Mixtral | Native, optimized kernels | Highly optimized FusedMoE kernels |
| Long Context | PagedAttention | Chunking & Prefix Caching | In-flight batching |
| Performance Profile | High Throughput, Good Latency | Balanced, Low Latency (v3) | Lowest Latency, Max Throughput |
| Deployment | Python-centric, flexible | Docker container, API-ready | Requires building “Engines” |
| Key Differentiator | FP8 W8A8 support for Hopper | Chunking for RAG workloads | Kernel Fusion & NVIDIA optimization |
Selection Guidance:
- Use vLLM for high-throughput batch processing and serving state-of-the-art open models (like DeepSeek) immediately upon release. Its support for FP8 on H100s is cutting-edge.35
- Use TensorRT-LLM for latency-critical applications on NVIDIA hardware where engineering resources allow for the compilation step. It extracts the absolute maximum FLOPs from the hardware.40
- Use TGI for RAG applications requiring massive context processing. TGI v3’s “Chunking” feature allows it to process 200k+ token prompts significantly faster by caching and reusing prefixes intelligently.43
5. Future Outlook: The Convergence of Hardware and Algorithms
The trajectory of inference optimization points toward a unified “Hardware-Software Co-Design.”
- FP8 as the Standard: With DeepSeek-V3 demonstrating stable FP8 training, the industry will likely standardize on FP8 for both training and inference, eliminating the quantization conversion step entirely.
- Speculation as Default: As models get larger and memory walls steeper, speculative decoding (likely via integrated methods like Medusa or EAGLE-2) will become a default “on” feature rather than an optimization option.
- Dynamic Architectures: The success of dynamic routing (MoE) and dynamic precision (QSPEC) suggests future models will be fluid—adjusting their compute path, precision, and memory footprint per token based on real-time difficulty and system load.
The optimization of LLM inference is no longer about finding a single “magic bullet” but about orchestrating a symphony of techniques—quantization, speculation, routing, and system scheduling—to mask the physical limitations of silicon and extract intelligence at scale.
Works cited
- arXiv:2409.11055v6 [cs.CL] 4 Jun 2025, accessed on December 22, 2025, https://arxiv.org/pdf/2409.11055
- FP8 Quantization in Deep Neural Networks – Emergent Mind, accessed on December 22, 2025, https://www.emergentmind.com/topics/fp8-quantization
- FP8 W8A8 – vLLM, accessed on December 22, 2025, https://docs.vllm.ai/en/v0.11.0/features/quantization/fp8.html
- A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B – arXiv, accessed on December 22, 2025, https://arxiv.org/html/2409.11055v1
- QSPEC: Speculative Decoding with Complementary Quantization Schemes – ACL Anthology, accessed on December 22, 2025, https://aclanthology.org/2025.emnlp-main.240.pdf
- DeepSeek-V3 Explained: Optimizing Efficiency and Scale – ADaSci, accessed on December 22, 2025, https://adasci.org/deepseek-v3-explained-optimizing-efficiency-and-scale/
- DeepSeek-V3 Technical Report – arXiv, accessed on December 22, 2025, https://arxiv.org/html/2412.19437v1
- DeepSeek-R1 and FP8 Mixed-Precision Training – Colfax Research, accessed on December 22, 2025, https://research.colfax-intl.com/deepseek-r1-and-fp8-mixed-precision-training/
- DeepSeek Technical Analysis — (5) FP8 Training | by Jinpeng Zhang – Medium, accessed on December 22, 2025, https://dataturbo.medium.com/deepseek-technical-analysis-5-fp8-training-ff34768727b8
- Dispelling DeepSeek Myths, Studying V3 – Creative Strategies, accessed on December 22, 2025, https://creativestrategies.com/dispelling-deepseek-myths-studying-v3/
- Decoding Speculative Decoding – arXiv, accessed on December 22, 2025, https://arxiv.org/html/2402.01528v3
- Speculative Sampling in LLMs: Speeding Up Inference with Drafts, Verification & Parallelism, accessed on December 22, 2025, https://medium.com/@xiaxiami/speculative-sampling-in-llms-speeding-up-inference-with-drafts-verification-parallelism-6d948d268a87
- Speculative Sampling — TensorRT-LLM – GitHub Pages, accessed on December 22, 2025, https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html
- Efficient and Scalable Speculative Decoding with Multi-Stream Attention – ACL Anthology, accessed on December 22, 2025, https://aclanthology.org/2025.emnlp-main.986.pdf
- [width=0.06]./figs/logo EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty – arXiv, accessed on December 22, 2025, https://arxiv.org/pdf/2401.15077
- Speculative decoding, accessed on December 22, 2025, https://aarnphm.xyz/thoughts/Speculative-decoding
- Speculative Decoding and Beyond: An In-Depth Survey of Techniques – arXiv, accessed on December 22, 2025, https://arxiv.org/html/2502.19732v4
- An Introduction to Speculative Decoding for Reducing Latency in AI Inference, accessed on December 22, 2025, https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/
- OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure – MIT Press Direct, accessed on December 22, 2025, https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00735/128189/OPT-Tree-Speculative-Decoding-with-Adaptive-Draft
- [Feature]: Tree-Attention Support for Speculative Decoding · Issue #18327 · vllm-project/vllm, accessed on December 22, 2025, https://github.com/vllm-project/vllm/issues/18327
- Traversal Verification for Speculative Tree Decoding – arXiv, accessed on December 22, 2025, https://arxiv.org/html/2505.12398v1
- MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts – arXiv, accessed on December 22, 2025, https://arxiv.org/html/2511.14102v1
- DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models – arXiv, accessed on December 22, 2025, https://arxiv.org/html/2401.06066v1
- DeepSeek-V3 Explained 2: DeepSeekMoE | by Shirley Li – AI Advances, accessed on December 22, 2025, https://ai.gopubby.com/deepseek-v3-explained-2-deepseekmoe-106cffcc56c1
- DeepSeek-V3 Explained 3: Auxiliary-Loss-Free Load Balancing | by Shirley Li – AI Advances, accessed on December 22, 2025, https://ai.gopubby.com/deepseek-v3-explained-3-auxiliary-loss-free-load-balancing-4beeb734ab1f
- DeepSeek-V3 — Advances in MoE Load Balancing and Multi-Token Prediction Training | by Yugen.ai – Medium, accessed on December 22, 2025, https://medium.com/yugen-ai-technology-blog/deepseek-v3-advances-in-moe-load-balancing-and-multi-token-prediction-training-f6d68c59749c
- DeepSeek-V3 Technical Report – arXiv, accessed on December 22, 2025, https://arxiv.org/html/2412.19437v2
- DeepSeek-V3 Technical Report – The VITALab website, accessed on December 22, 2025, https://vitalab.github.io/article/2025/02/11/DeepSeekV3.html
- Inside DeepSeek V3: Breaking Down Multi-Head Latent Attention (MLA) – Medium, accessed on December 22, 2025, https://medium.com/@ahabb/inside-deepseek-v3-breaking-down-multi-head-latent-attention-mla-72a71fa5771d
- TransMLA: Migrating GQA Models to MLA with Full DeepSeek Compatibility and Speedup, accessed on December 22, 2025, https://arxiv.org/html/2502.07864v4
- MoE-Infinity: Offloading-Efficient MoE Model Serving – Semantic Scholar, accessed on December 22, 2025, https://www.semanticscholar.org/paper/MoE-Infinity%3A-Offloading-Efficient-MoE-Model-Xue-Fu/b43e2cd01d23f3bdb90751d0d2893bd8388f1a71
- MoE-Infinity/README.md at main – GitHub, accessed on December 22, 2025, https://github.com/EfficientMoE/MoE-Infinity/blob/main/README.md
- [Literature Review] Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts – Moonlight, accessed on December 22, 2025, https://www.themoonlight.io/en/review/capacity-aware-inference-mitigating-the-straggler-effect-in-mixture-of-experts
- Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts, accessed on December 22, 2025, https://www.researchgate.net/publication/389695134_Capacity-Aware_Inference_Mitigating_the_Straggler_Effect_in_Mixture_of_Experts
- Comparative Analysis of Large Language Model Inference Serving Systems: A Performance Study of vLLM and HuggingFace TGI – arXiv, accessed on December 22, 2025, https://arxiv.org/html/2511.17593v1
- What is PagedAttention? – Hopsworks, accessed on December 22, 2025, https://www.hopsworks.ai/dictionary/pagedattention
- ultimate guide to PagedAttention – Newline.co, accessed on December 22, 2025, https://www.newline.co/@zaoyang/ultimate-guide-to-pagedattention–0da4bc75
- The Architecture Behind vLLM: How PagedAttention Improves Memory Utilization – Medium, accessed on December 22, 2025, https://medium.com/@mandeep0405/the-architecture-behind-vllm-how-pagedattention-improves-memory-utilization-2f9b25272110
- Continuous vs dynamic batching for AI inference – Baseten, accessed on December 22, 2025, https://www.baseten.co/blog/continuous-vs-dynamic-batching-for-ai-inference/
- vLLM vs TGI vs TensorRT‑LLM vs Ollama – Compute with Hivenet, accessed on December 22, 2025, https://compute.hivenet.com/post/vllm-vs-tgi-vs-tensorrt-llm-vs-ollama
- EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference – arXiv, accessed on December 22, 2025, https://arxiv.org/html/2410.12247v2
- The vLLM MoE Playbook: A Practical Guide to TP, DP, PP and Expert Parallelism, accessed on December 22, 2025, https://rocm.blogs.amd.com/software-tools-optimization/vllm-moe-guide/README.html
- vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy, A Deep Technical Comparison for Production LLM Inference – MarkTechPost, accessed on December 22, 2025, https://www.marktechpost.com/2025/11/19/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference/
