The Convergence of Concurrency: Resolving the Contention Between Continuous Batching and Speculative Decoding in Large-Scale LLM Inference

1. Introduction: The Economic and Physical Constraints of Intelligence

The rapid proliferation of Large Language Models (LLMs) into production environments—spanning generative chatbots, code assistants, and automated reasoning agents—has precipitated a fundamental shift in the architectural priorities of inference systems. In the early stages of the “generative AI” era, the primary optimization metric was Time-to-First-Token (TTFT), a latency-centric measure reflecting the responsiveness of the system to a single user. However, as deployment scales to millions of concurrent requests, the economic viability of these systems is increasingly dictated by Throughput (tokens generated per second per dollar) and Hardware Utilization.

This transition has brought two powerful optimization paradigms into direct conflict: Continuous Batching and Speculative Decoding. Continuous batching, popularized by systems like vLLM and Orca, maximizes the utilization of High-Bandwidth Memory (HBM) by dynamically scheduling requests at the token level, effectively saturating the Graphics Processing Unit (GPU) compute cores. Conversely, Speculative Decoding (SD) was originally conceived as a latency-optimization technique that leverages “idle” GPU cycles—cycles that exist because memory bandwidth limits the rate at which a single sequence can be processed—to verify multiple future tokens in parallel.

The central thesis of this report, supported by research emerging in late 2024 and 2025, is that batching has effectively compressed the available idle computing power, rendering traditional speculative decoding strategies inefficient or even deleterious in high-load scenarios.1 As batch sizes increase to maximize throughput, the “idle” gaps that SD relies on disappear, leading to resource contention where drafting and verification compete for the same saturated arithmetic units.

This analysis provides an exhaustive examination of this contention and the subsequent wave of architectural innovations designed to resolve it. We explore SpecFormer’s unification of attention mechanisms to eliminate expensive prefix trees 1, Falcon’s semi-autoregressive drafting for training stability 4, Mirror Speculative Decoding’s utilization of heterogeneous hardware to physically separate draft and verify stages 5, and the EqSpec/EXSpec frameworks that solve the critical “ragged tensor” problem to ensure mathematical correctness in batched speculation.6 By synthesizing these developments, we articulate a new standard for high-performance, cost-effective LLM inference.

2. The Physics of Interference: The Roofline Model and the Memory Wall

To understand why batching and speculative decoding are currently at odds, one must first rigorously define the physical constraints of the hardware executing these models. The performance of any Deep Learning workload is governed by the Roofline Model, which plots performance (FLOPs/second) against Arithmetic Intensity (FLOPs/byte of memory transferred).7

2.1 The Agony of Autoregression

Generative LLMs are autoregressive: the generation of token $t$ depends on token $t-1$. This sequential dependency forces the hardware to load the model’s entire weight matrix $\Theta$ from HBM to the on-chip SRAM for every single token generated.

For a model like Llama-3-70B (approx. 70 billion parameters, FP16 precision), the weights constitute roughly 140 GB of data. Generating one token requires moving this 140 GB across the memory bus. If the GPU is an NVIDIA H100 with a memory bandwidth of roughly 3.35 TB/s, the theoretical minimum time to load weights is:

 

$$T_{load} = \frac{140 \text{ GB}}{3350 \text{ GB/s}} \approx 41.8 \text{ ms}$$

However, the computation required is merely a matrix-vector multiplication (the active vector being the single token embedding). The arithmetic intensity is extremely low—approximately 1 FLOP per byte transferred. Since modern GPUs are capable of nearly 1,000 TFLOPs (FP16), the compute units (Tensor Cores) finish the calculation in microseconds, spending the vast majority of the 41.8 ms cycle waiting for memory. This state is defined as being Memory-Bound.8

2.2 The “Free Lunch” of Speculative Decoding

Speculative Decoding (SD) exploits this massive inefficiency. Since the Tensor Cores are idle for >90% of the cycle in a single-batch setting, SD proposes using a smaller “draft” model to guess the next $K$ tokens (e.g., $K=4$). The large target model then verifies all $K$ tokens in a single forward pass.

Verification is a matrix-matrix operation (checking 4 tokens simultaneously). This increases the arithmetic intensity by a factor of $K$ without significantly increasing the memory transfer time (since the weights are loaded once for the batch of $K$ tokens). In the low-batch regime ($B=1$), SD effectively converts memory-bound latency into compute-bound work, utilizing the “idle” resources to accelerate generation.1

2.3 The Impact of Batching on Arithmetic Intensity

Batching is the standard solution to the memory wall. By grouping $B$ user requests together, the system loads the weights once and applies them to $B$ tokens simultaneously. The arithmetic intensity scales linearly with $B$:

 

$$\text{Intensity}_{Batched} \approx B \times \text{Intensity}_{Single}$$

As $B$ increases, the workload moves from the memory-bound region of the roofline graph toward the compute-bound region. In production systems serving thousands of users, schedulers drive $B$ as high as memory capacity allows (often $B > 128$). At these levels, the Tensor Cores are fully saturated processing the main batch. The “idle computing power” that SD requires is no longer idle; it has been harvested to serve other users. This is the crux of the problem: Batching compresses available idle computing power, turning SD from an optimization into a burden.1

3. The Hegemony of Continuous Batching

The industry standard for handling this high-throughput requirement is Continuous Batching (often referred to as cellular or iteration-level batching), popularized by the vLLM and Orca systems.11 Understanding its mechanism is crucial to understanding why it is so hostile to traditional speculative decoding.

3.1 From Static to Continuous Scheduling

In traditional Static Batching, the server waits for $B$ requests to accumulate. It then processes them in lockstep. If one request finishes early (generating a short answer), it must wait for the longest request in the batch to complete, wasting compute slots on “padding” tokens.

Continuous Batching fundamentally alters this by operating at the granularity of a single iteration.11

  1. Iteration-Level Scheduling: At the end of every token generation step, the scheduler checks the status of all sequences.
  2. Immediate Eviction and Injection: Completed sequences are immediately evicted. Pending requests from the queue are immediately injected into the newly freed slots.
  3. Result: The GPU always operates on a “full” batch of valid tokens. There is almost no waste due to padding.

This mechanism is extremely effective at maximizing GPU Utilization. Metrics from industry deployments indicate that continuous batching can improve throughput by factors of 10$\times$ to 20$\times$ over static batching.13 Consequently, any “slack” in the system is aggressively filled with new revenue-generating work.

3.2 The Enabler: PagedAttention

Continuous batching is physically enabled by PagedAttention.9 Traditional attention mechanisms require contiguous memory blocks for the Key-Value (KV) cache, leading to fragmentation and limiting the maximum batch size. PagedAttention applies the concept of virtual memory paging to the KV cache, allowing memory to be allocated in non-contiguous blocks. This dramatically reduces memory waste, allowing the scheduler to fit even more concurrent sequences into HBM, further increasing the batch size $B$ and pushing the system deeper into the compute-bound regime.

3.3 The “Unfairness” of Stall-Free Scheduling

While continuous batching optimizes global throughput, recent 2025 research has highlighted a critical flaw: Unfairness. Stall-free schedulers (like those in vLLM) prioritize the “decode” phase of active requests to minimize Inter-Token Latency (ITL). This creates a “head-of-line blocking” effect for new requests waiting in the “prefill” (prompt processing) phase, leading to high Time-to-First-Token (TTFT) tail latencies.14

FairBatching 14 is a novel scheduler developed to address this. It introduces an adaptive batch capacity mechanism that dynamically adjusts the computational budget. By acknowledging the non-monotonic nature of Time-Between-Tokens (TBT), FairBatching reclaims resources from bursting decode tasks to serve prefill surges. This nuance is critical: even highly optimized schedulers are now dynamically trading off resources between prefill and decode, leaving absolutely zero margin for the overhead of inefficient speculative decoding.

4. The Conflict: Resource Contention and the “Ragged Tensor”

When Speculative Decoding is superimposed onto a Continuously Batched system, two distinct failures modes emerge: Compute Contention and Memory/Tensor Misalignment.

4.1 Compute Contention

In a batch of size $B=128$, the GPU is calculating logits for 128 tokens per step. If we enable Speculative Decoding with a draft length $K=3$, the verification step now requires processing $128 \times (3+1) = 512$ tokens per step.

If the GPU was already near saturation at 128 tokens (a common scenario with large models on H100s using FP8), demanding 512 tokens’ worth of compute forces the operation to slow down linearly. The verification time $T_{verify}$ exceeds the time it would have taken to just generate the tokens autoregressively. This is the Batching-Speculation Contention: batching fills the compute budget with width (more users), while speculation attempts to fill it with depth (more tokens per user). In a finite compute budget, these are mutually exclusive.1

4.2 The Ragged Tensor Problem

A more subtle, systems-level failure occurs in memory management. In standard decoding, every sequence in the batch advances by exactly one token. The data structures are regular.

In Speculative Decoding, Sequence A might accept 3 draft tokens, Sequence B might accept 0, and Sequence C might accept 5. The batch becomes “ragged”.16

  • The Alignment Issue: The input tensor for the next step is no longer a neat $$ rectangle. It is a jagged array. Standard CUDA kernels (FlashAttention, matrix multiplies) are optimized for rectangular dense tensors.
  • The Corruption Risk: Research by Zhang et al. (2025) 6 demonstrated that naive implementations often fail to handle this correctly, leading to Output Equivalence Violations. The position IDs, attention masks, and KV-cache pointers get desynchronized, causing the model to attend to the wrong history. The system might run fast, but it generates garbage.
  • The Overhead of Realignment: To fix this, the system must perform “Unpad-Append-Repad” operations to realign the tensors. This data movement overhead can consume up to 40% of the total inference time, completely negating the speedup of speculation.6

5. Architectural Solution A: SpecFormer and the Unification of Attention

To survive in this high-contention, large-batch environment, Speculative Decoding must fundamentally change its verification cost structure. It cannot afford to verify large “draft trees” (e.g., checking 64 possible branches) because the batch size multiplier makes the compute cost prohibitive.

SpecFormer 1, introduced in research accepted for AAAI 2026, proposes a novel architecture designed specifically for this constraint.

5.1 The Philosophy: Better Drafts, No Trees

The authors argue that the reliance on complex verification trees is a symptom of weak draft models. If the draft model is accurate enough, a single candidate sequence is sufficient. SpecFormer aims to maximize the quality of the draft sequence while maintaining extreme computational efficiency.2

5.2 Unidirectional and Bidirectional Fusion

Standard autoregressive models use Unidirectional (Causal) Attention: token $t$ can only see $t-1 \dots 0$. This is necessary for generation but limits the model’s ability to understand the local context of the phrase it is building.

BERT-style models use Bidirectional Attention, seeing the whole sequence, but cannot generate text.

SpecFormer integrates both:

  1. Context Causal Attention: It uses a standard causal mask to extract information from the prompt and the generated history, ensuring global consistency and causality.2
  2. Draft Bi-directional Attention: For the short sequence of draft tokens (e.g., the next 5 tokens), SpecFormer treats them as a “block.” Within this block, it allows Bidirectional Attention. Token $t+2$ can attend to token $t+4$ within the draft window.2

5.3 Mechanism and Benefits

This fused attention mechanism allows the draft model to capture strong local correlations (e.g., “New” implies “York” strongly in both directions locally) that a pure causal model might miss or be less confident about.

  • Prefix Tree Elimination: Because the draft quality is higher, SpecFormer does not need to generate a tree of hypotheses. It generates a single, high-probability chain.
  • Batch Compatibility: Verifying a single chain (per user) is far cheaper than verifying a tree. This keeps the total token load on the GPU within limits, allowing SpecFormer to provide acceleration even when the batch size is large.1
  • Performance: Experiments demonstrate that SpecFormer achieves consistent speedups in large-batch scenarios where methods like Medusa or EAGLE collapse due to verification overhead.1

6. Architectural Solution B: Falcon and Semi-Autoregressive Drafting

Parallel to SpecFormer, the Falcon framework (AAAI 2025) attacks the problem via Semi-Autoregressive Drafting and advanced distillation techniques.4

6.1 Semi-Autoregressive Generation

Pure Non-Autoregressive (NAR) models generate all tokens simultaneously (independent of each other). This is fast but usually results in incoherent text (“The cat sat on the mat” -> “The sat cat mat on”).

Falcon employs a Semi-Autoregressive approach. It generates tokens in blocks, but maintains dependencies within the block. This strikes a balance: it is faster than token-by-token generation (latency reduction) but more accurate than pure NAR (quality preservation).4

6.2 Coupled Sequential Glancing Distillation

The core innovation of Falcon is a training technique called Coupled Sequential Glancing Distillation.4

  • The Problem: Distilling a large model (Teacher) into a small draft model (Student) usually involves minimizing the KL-divergence between their output distributions at each step independently.
  • The Glancing Solution: Falcon’s training process allows the student model to “glance” at the teacher’s future predictions during the training of a block. It effectively burns the teacher’s multi-step reasoning into the student’s single-step block generation weights.
  • Result: This fortifies the inter-token dependencies within the draft block, significantly increasing the Acceptance Rate of the draft tokens. A higher acceptance rate means fewer verification failures, which directly translates to less wasted compute in a batched setting.

6.3 Custom-Designed Decoding Tree

Falcon also implements a dynamic decoding tree that adjusts based on confidence. Unlike SpecFormer’s “no tree” approach, Falcon uses a “smart tree.” It can dynamically expand or contract the search space. If the draft model is highly confident (low entropy), it generates a single chain. If it is uncertain, it branches. This adaptivity is crucial for resource-constrained batching, as it allocates compute only where it is statistically likely to yield a return.4

7. Architectural Solution C: Mirror-SD and Heterogeneous Offloading

While SpecFormer and Falcon optimize the software, Mirror Speculative Decoding (Mirror-SD) argues that the hardware architecture itself is the bottleneck. It proposes solving resource contention by physically separating the contestants.5

7.1 The GPU-NPU Dichotomy

Modern inference servers and Edge SoCs (like the Apple M-series or NVIDIA Grace-Hopper) often contain heterogeneous accelerators:

  • GPUs: High throughput, massive parallelism, optimized for large batch processing.
  • NPUs (Neural Processing Units): Compute-dense, energy-efficient, often optimized for matrix arithmetic but with different memory hierarchies.

Standard approaches leave the NPU idle. Mirror-SD offloads the Draft Model to the NPU while keeping the Target Model on the GPU.5

7.2 Breaking the Serial Barrier

By running the draft on the NPU, the GPU is completely freed from the sequential overhead of drafting.

  1. Parallel Execution: The NPU generates draft tokens for step $t+1$. Simultaneously, the GPU verifies the tokens for step $t$.
  2. Bidirectional Speculation: The draft speculates tokens for the target. The target, in turn, speculates “correction paths” for the draft.
  3. Speculative Streaming (SS): To manage the bandwidth link between NPU and GPU (which is often slower than internal memory), Mirror-SD uses speculative streaming. The draft emits multiple tokens per step into a buffer, creating a pipelined flow rather than a stop-and-wait protocol.5

7.3 Performance and Scalability

Mirror-SD achieves 2.8$\times$ – 5.8$\times$ wall-time speedups on SpecBench, outperforming EAGLE3 by 30%.5 This is particularly relevant for “Edge” or “Local” LLM inference (e.g., on laptops or phones), where “batch size” might be small (1-4), but resource contention between the display, OS, and LLM is high. By utilizing the NPU, Mirror-SD effectively “creates” new compute resources rather than fighting for the existing GPU cycles.

8. Operational Correctness: EqSpec and EXSpec

Regardless of the architecture used (SpecFormer, Falcon, or Mirror-SD), applying speculation to a batch requires solving the Ragged Tensor problem to ensure correctness.

8.1 EqSpec: The Correctness-First Paradigm

EqSpec is a framework that formally defines the synchronization invariants required for Batch Speculative Decoding.6

  • It proves that simply padding sequences to the longest draft length violates output equivalence because it alters the positional embeddings seen by the padding tokens (which might affect subsequent tokens via attention leakage in some implementations).
  • EqSpec mandates a specific “Unpad-Append-Repad” sequence. While this guarantees that the output is bit-for-bit identical to standard decoding (95% output equivalence in practice due to floating point non-determinism), it exposes the high cost of this manipulation.6

8.2 EXSpec: The Sliding Pool Scheduler

To reduce the overhead of EqSpec, EXSpec introduces a novel scheduling algorithm.6

  • Sliding Pool: Instead of a rigid batch, EXSpec views active requests as a pool.
  • Length Grouping: It dynamically groups sequences that happen to have the same current length or the same number of accepted draft tokens.
  • Mechanism: If User A and User B both accept 3 tokens, they are grouped into a micro-batch for the next step. User C (who accepted 0) is grouped with User D (who accepted 0).
  • Impact: This maximizes the “rectangularity” of the tensors naturally, reducing the need for expensive padding or realignment memory operations. EXSpec achieves up to 3$\times$ throughput improvement at batch size 8, effectively mitigating the ragged tensor penalty.6

9. Comparative Analysis and Industry Implications

The following table summarizes the key attributes of the discussed solutions, contrasting them against the baseline of Continuous Batching.

Metric Continuous Batching (vLLM) SpecFormer Falcon Mirror-SD EqSpec/EXSpec
Primary Goal Max Throughput Latency in Batches Training Stability Hardware Utilization Correctness
Idle Compute Consumed by Batching Uses Residual Compute Efficiently Distilled Offloaded to NPU Managed via Scheduling
Draft Method N/A Uni+Bi-Directional Semi-Autoregressive NPU-Resident Model Agnostic
Key Risk High Latency (TTFT) Draft Quality Training Complexity Inter-chip Bandwidth Alignment Overhead
Hardware GPU GPU GPU GPU + NPU GPU
Speedup Baseline Consistent Acceleration High Accuracy 2.8x – 5.8x 3x (vs Naive SD)

9.1 The Economic Calculus of “Idle Compute”

The concept of “idle compute” is not just technical; it is economic. In serverless environments (like AWS Lambda or specialized LLM endpoints), providers charge for “GB-seconds” of compute.

  • Batching Strategy: Providers prefer Continuous Batching because it maximizes the revenue generated per active GPU second. A fully saturated GPU is a profitable GPU.15
  • User Perspective: Users want low latency. Speculative Decoding offers this.
  • The Conflict: Providers are disincentivized to enable SD if it reduces the total batch size they can fit (and thus their revenue throughput).
  • Resolution: Techniques like SpecFormer and EXSpec are critical because they allow providers to offer the latency benefits of SD to users without sacrificing the throughput economics of Batching. They lower the “cost of speculation.”

9.2 Hardware Trends: MI300X vs. H100

Recent benchmarks 24 comparing AMD’s MI300X to NVIDIA’s H100 highlight the role of memory bandwidth. The MI300X, with higher HBM capacity and bandwidth, sustains performance at larger batch sizes (128+) better than the H100, which suffers from KV cache eviction earlier. This suggests that hardware with massive memory bandwidth (like the MI300X) might be more forgiving of the “Batching vs. Speculation” conflict, as the “Memory Wall” is pushed further back, leaving a larger window of “idle compute” for speculation even at moderate batch sizes.

10. Conclusion: The Integrated Future of Inference

The narrative that “Batching replaces Speculative Decoding” is a simplification. While Batching undeniably compresses the idle compute resources that naive Speculative Decoding relies on, the future of inference lies in the integration of these technologies, not their mutual exclusion.

The research of 2025/2026 demonstrates a maturation of the field. We have moved beyond “plug-and-play” draft models toward deep systemic optimizations. SpecFormer re-engineers the fundamental attention mechanics to suit the constraints of batched verification. Falcon utilizes advanced distillation to make drafts smarter, not larger. Mirror-SD rewrites the hardware mapping to exploit heterogeneous silicon. And EqSpec/EXSpec provides the rigorous mathematical foundation required to run these complex, ragged workloads without corrupting data.

For the practitioner, the path forward involves a hardware-software co-design approach:

  1. Adopt Continuous Batching (via vLLM/TGI) as the baseline for throughput.
  2. Implement FairBatching schedulers to protect TTFT.
  3. Deploy Batch-Aware Speculation (like SpecFormer or EXSpec-wrapped models) to reclaim latency gains without killing throughput.
  4. Leverage Heterogeneous Hardware (NPU/GPU splits) where available to physically bypass contention.

In this converged architecture, the “idle” compute is no longer found by accident; it is engineered by design.

Key Data Sources and Citations

  • Core Conflict:.1
  • SpecFormer:.1
  • Falcon:.4
  • Mirror-SD:.5
  • EqSpec/EXSpec:.6
  • Continuous Batching/vLLM:.9
  • Hardware/Economics:.23

Works cited

  1. [2511.20340] Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios – arXiv, accessed on December 13, 2025, https://arxiv.org/abs/2511.20340
  2. Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios – arXiv, accessed on December 13, 2025, https://arxiv.org/html/2511.20340v1
  3. Roofline model for Sun ultraSPARc t2+. – ResearchGate, accessed on December 13, 2025, https://www.researchgate.net/figure/Roofline-model-for-Sun-ultraSPARc-t2_fig3_220423225
  4. Falcon: Faster and Parallel Inference of Large Language Models Through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree – ResearchGate, accessed on December 13, 2025, https://www.researchgate.net/publication/390699823_Falcon_Faster_and_Parallel_Inference_of_Large_Language_Models_Through_Enhanced_Semi-Autoregressive_Drafting_and_Custom-Designed_Decoding_Tree
  5. Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference – arXiv, accessed on December 13, 2025, https://arxiv.org/html/2510.13161v1
  6. Batch speculative decoding Done right – arXiv, accessed on December 13, 2025, https://arxiv.org/html/2510.22876v1
  7. Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference – UPCommons, accessed on December 13, 2025, https://upcommons.upc.edu/bitstreams/82e2be60-b600-4fa1-90ff-08d66f1cac7a/download
  8. The Engineering Guide to Efficient LLM Inference: Metrics, Memory, and Mathematics, accessed on December 13, 2025, https://pub.towardsai.net/the-engineering-guide-to-efficient-llm-inference-metrics-memory-and-mathematics-3aead91c99cc
  9. Inside vLLM: Anatomy of a High-Throughput LLM Inference System – Aleksa Gordić, accessed on December 13, 2025, https://www.aleksagordic.com/blog/vllm
  10. Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios – ResearchGate, accessed on December 13, 2025, https://www.researchgate.net/publication/397983460_Scaling_LLM_Speculative_Decoding_Non-Autoregressive_Forecasting_in_Large-Batch_Scenarios
  11. How to Speed up AI Inference with vLLM Continuous Batching …, accessed on December 13, 2025, https://voice.ai/hub/tts/vllm-continuous-batching/
  12. LLM Inference: Continuous Batching and PagedAttention – Insu Jang, accessed on December 13, 2025, https://insujang.github.io/2024-01-07/llm-inference-continuous-batching-and-pagedattention/
  13. How continuous batching enables 23x throughput in LLM inference while reducing p50 latency – Anyscale, accessed on December 13, 2025, https://www.anyscale.com/blog/continuous-batching-llm-inference
  14. FairBatching: Fairness-Aware Batch Formation for LLM Inference – arXiv, accessed on December 13, 2025, https://arxiv.org/html/2510.14392v1
  15. A Systematic Characterization of LLM Inference on GPUs – arXiv, accessed on December 13, 2025, https://arxiv.org/html/2512.01644v1
  16. [2510.22876] Batch Speculative Decoding Done Right – arXiv, accessed on December 13, 2025, https://arxiv.org/abs/2510.22876
  17. The ragged tensor problem in batch speculative decoding. Differing… | Download Scientific Diagram – ResearchGate, accessed on December 13, 2025, https://www.researchgate.net/figure/The-ragged-tensor-problem-in-batch-speculative-decoding-Differing-numbers-of-accepted_fig1_396966883
  18. Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios – OpenReview, accessed on December 13, 2025, https://openreview.net/attachment?id=h6Ft5NiKMa&name=pdf
  19. Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, accessed on December 13, 2025, https://arxiv.org/html/2410.04466v4
  20. Daily Papers – Hugging Face, accessed on December 13, 2025, https://huggingface.co/papers?q=Speculative%20Diffusion%20Decoding
  21. Daily Papers – Hugging Face, accessed on December 13, 2025, https://huggingface.co/papers?q=streaming%20multi-processor
  22. Batch Speculative Decoding Done Right – ChatPaper, accessed on December 13, 2025, https://chatpaper.com/paper/203837
  23. How to Reduce LLM Spending by 30% Without Sacrificing Performance | by Future AGI, accessed on December 13, 2025, https://medium.com/@future_agi/how-to-reduce-llm-spending-by-30-without-sacrificing-performance-88101ddf8953
  24. Best practices for competitive inference optimization on AMD Instinct™ MI300X GPUs, accessed on December 13, 2025, https://rocm.blogs.amd.com/artificial-intelligence/LLM_Inference/README.html
  25. Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference – arXiv, accessed on December 13, 2025, https://arxiv.org/pdf/2510.13161
  26. Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference – ResearchGate, accessed on December 13, 2025, https://www.researchgate.net/publication/396517284_Mirror_Speculative_Decoding_Breaking_the_Serial_Barrier_in_LLM_Inference
  27. Sydney Blanchard – Database Trends and Applications, accessed on December 13, 2025, https://www.dbta.com/Authors/Sydney-Blanchard-9611.aspx