Distributed KV Cache Management and Systems Architecture for Long-Context LLM Inference

Executive Summary

The proliferation of Large Language Models (LLMs) supporting context windows extending from 128,000 to over 10 million tokens has fundamentally altered the computational and architectural requirements of inference systems. As of late 2025, the primary bottleneck in serving these “infinite context” models has shifted from compute capabilities (FLOPS) to memory capacity and bandwidth, specifically regarding the Key-Value (KV) cache. The KV cache—the intermediate state required to avoid redundant computation in autoregressive decoding—has become the dominant consumer of GPU memory, often exceeding the aggregate High Bandwidth Memory (HBM) capacity of entire server clusters.

This report provides an exhaustive, expert-level analysis of the distributed systems, algorithms, and hardware architectures developed to manage this “Memory Wall.” We analyze the transition from single-node execution to disaggregated serving, detailing the mechanisms of middleware solutions like LMCache, DistKV-LLM, and NVIDIA Dynamo, alongside algorithmic innovations such as Ring Attention, TokenRing, and DeepSpeed Ulysses. Furthermore, we explore the integration of Compute Express Link (CXL) and Near-Data Processing (NDP) as the critical hardware substrate for next-generation inference. The analysis indicates that the future of LLM serving lies in a disaggregated, memory-centric architecture where the KV cache is treated not as a transient buffer, but as a persistent, distributed asset managed by intelligent fabrics spanning the data center.

1. The Physics of Long-Context Inference and the Memory Wall

To comprehend the necessity of distributed KV cache management, one must first rigorously quantify the resource demands of modern LLM inference. The core mechanism of the Transformer architecture involves the self-attention layer, where each token attends to all previous tokens in the sequence. In the autoregressive generation process, the model predicts the next token based on the entire history. To prevent the computationally prohibitive re-processing of this history for every new token generation, the Key (K) and Value (V) matrices computed for previous tokens are cached in GPU memory.

1.1 The Scaling Laws of KV Memory

The memory footprint of the KV cache does not scale linearly with model size alone, but rather with the sequence length ($L$) and batch size ($B$). For a model with $N_{layers}$, hidden dimension $H$, number of heads $N_{heads}$, head dimension $d_{head}$, and element size $P$ (e.g., 2 bytes for FP16), the cache size $M_{KV}$ is governed by the equation:

$$M_{KV} = 2 \cdot B \cdot L \cdot N_{layers} \cdot N_{heads} \cdot d_{head} \cdot P$$

For a 70-billion parameter model (e.g., Llama-3-70B) serving a batch of requests with a 1-million token context window, the KV cache requirements are staggering. A single request of 1 million tokens, assuming standard FP16 precision and typical architectural hyperparameters, can require hundreds of gigabytes of memory.1 This far exceeds the HBM capacity of a single NVIDIA H100 (80GB) or even a standard node equipped with 8 GPUs. As context lengths expand to 10M tokens, as seen in models from Google and others, the KV cache becomes the dominant contributor to memory consumption, reducing the space available for model weights and activation buffers to a negligible fraction.3

This scaling behavior introduces a critical “Memory Wall.” While GPU compute throughput has increased exponentially, memory capacity and bandwidth have lagged. The KV cache footprint scales linearly with context length, but the attention mechanism’s computational complexity scales quadratically ($O(L^2)$) during the prefill phase and linearly during decoding. However, in the regime of long contexts, the sheer volume of data that must be moved from HBM to the Tensor Cores during the decode phase saturates memory bandwidth, making the system bandwidth-bound rather than compute-bound.5

1.2 The Prefill-Decode Dichotomy and Disaggregation

Inference workloads are characterized by two distinct phases with opposing resource profiles, a dichotomy that drives modern system design:

Prefill Phase (Computation-Bound): The prompt is processed in parallel. All tokens in the prompt are fed into the model simultaneously. The attention matrix is dense, and the GPU’s Tensor Cores are fully saturated performing matrix multiplications. The latency is determined by the raw FLOPs of the hardware.
Decode Phase (Memory-Bound): Tokens are generated sequentially. For each new token, the model must read the entire KV cache (representing the history) from memory to compute attention scores. The arithmetic intensity—the ratio of floating-point operations to bytes accessed—is extremely low. As the sequence length grows, the time spent moving data dominates the time spent computing.5

This divergence has led to the architectural paradigm of disaggregated serving. In legacy systems, a single GPU handled both phases. In disaggregated architectures, the cluster is divided into “prefill instances” (optimized for compute) and “decode instances” (optimized for memory capacity and bandwidth). The prefill instances generate the initial KV cache, which is then transferred to decode instances for the long-tail generation process. This separation allows for independent scaling of resources but introduces a new bottleneck: the network bandwidth required to transport terabytes of KV cache data between nodes.6

2. Distributed Attention Algorithms

While hardware scaling addresses capacity, the computational complexity of attention over millions of tokens requires specialized algorithmic parallelism. Single-GPU execution is impossible for these lengths; thus, the sequence itself must be partitioned across multiple devices. The year 2024-2025 has seen the maturation of three primary strategies: Ring Attention, TokenRing, and DeepSpeed Ulysses.

2.1 Ring Attention: The Blockwise Foundation

Ring Attention represents the foundational breakthrough for infinite-context processing. It circumvents the memory constraints of individual devices by performing self-attention and feedforward computations in a blockwise fashion, distributing the sequence dimension across a logical ring of devices.8

2.1.1 Mechanism and Data Flow

In a Ring Attention setup, the input sequence is partitioned into blocks. Each GPU manages a specific block of Queries (Q), Keys (K), and Values (V). The calculation of attention scores requires every Query to interact with every Key in the global sequence. To achieve this without gathering the full sequence on every device (which would violate memory constraints), the devices pass their KV blocks to their neighbor in the ring.

The process operates in steps. In step $i$, a GPU computes the partial attention scores between its local Query block and the currently resident KV block. Simultaneously, it sends this KV block to the next GPU in the ring and receives a new KV block from the previous GPU. This “stream-and-compute” approach ensures that the full KV cache never needs to be materialized on a single device. The quadratic memory cost is effectively distributed across the cluster.10

2.1.2 Overlapping Communication and Computation

The critical efficiency of Ring Attention stems from its ability to overlap communication with computation. While the GPU’s compute units are calculating attention for the current block, the interconnect (e.g., NVLink or InfiniBand) is transmitting the data for the next block. Ideally, if the computation time exceeds the transmission time, the communication latency is completely hidden. However, as the number of devices ($N$) increases, the sequence chunk size per device decreases, potentially reducing the computation time per step while communication overhead remains constant or decreases linearly, leading to efficiency degradation at extreme scales.11

2.2 TokenRing: Optimizing Bidirectional Bandwidth

While Ring Attention allows for context scaling, it relies on unidirectional peer-to-peer (P2P) communication, typically sending data only to the “next” device. This leaves the reverse bandwidth of full-duplex interconnects (like NVLink) idle. TokenRing addresses this inefficiency by leveraging bidirectional communication to further maximize the overlap of data transmission and computation.11

2.2.1 Concurrent Transmission Architecture

TokenRing partitions the Q, K, and V tensors along the token dimension but fundamentally alters the data flow. Instead of merely rotating KV blocks, TokenRing enables the concurrent transmission of Query blocks and block outputs—specifically block_out (the partial attention output) and block_lse (log-sum-exp statistics required for the Softmax normalization).

By utilizing a fully connected mesh topology or optimized ring variants, TokenRing allows a device to send its Query block to a neighbor while simultaneously receiving a Query block from another, or exchanging partial results. This effectively doubles the utilized bandwidth compared to standard Ring Attention. The algorithm is particularly effective in reducing the “bubble” time—the idle time associated with the fill-up and drain phases of the pipeline. Experimental results demonstrate that TokenRing significantly enhances throughput and reduces communication latency, particularly in scenarios where the communication-to-computation ratio is high.11

2.3 DeepSpeed Ulysses: The All-to-All Approach

DeepSpeed Ulysses adopts a fundamentally different parallelism strategy. Instead of rotating data in a ring, it utilizes collective communication primitives to partition the computation by attention heads rather than by sequence blocks during the attention phase.13

2.3.1 The Transpose Mechanism

In the Ulysses architecture, the input sequence is initially partitioned across GPUs (Sequence Parallelism). When the attention operation begins, the system triggers an All-to-All collective communication operation. This “transpose” operation reshuffles the data such that each GPU receives the full sequence but only for a specific subset of attention heads.

For example, if there are 8 GPUs and 64 attention heads, each GPU receives the full sequence for 8 heads. This allows the GPU to execute the standard FlashAttention kernel (or any optimized local attention kernel) without modification, as it has all the necessary K and V data for its assigned heads locally. Once the attention computation is complete, another All-to-All transpose returns the output to the original sequence-partitioned layout for the subsequent Feed-Forward Network (FFN) layers.14

2.3.2 Trade-offs and Constraints

The primary advantage of Ulysses is its kernel agnosticism; it does not require specialized “ring” kernels and can leverage the highly optimized FlashAttention-3 immediately upon release. However, its scalability is strictly limited by the number of attention heads. The parallelism degree cannot exceed the number of heads, which can be a constraint for certain model architectures (e.g., those using Grouped Query Attention with few KV heads). Furthermore, the All-to-All operation places immense stress on the network bisection bandwidth. Unlike Ring Attention, which utilizes neighbor-to-neighbor links (efficient on linear topologies), Ulysses requires a high-bandwidth, fully connected fabric (like NVSwitch) to perform efficiently. In bandwidth-constrained environments (e.g., Ethernet), Ulysses often underperforms Ring Attention.13

2.4 Comparative Analysis of Parallelism Strategies

The choice between these algorithms depends heavily on the underlying hardware topology and the specific model architecture.

Feature	Ring Attention	TokenRing	DeepSpeed Ulysses
Partitioning Axis	Sequence Dimension	Sequence Dimension	Sequence (Input) $\rightarrow$ Heads (Compute)
Communication Pattern	P2P Ring (Unidirectional)	P2P Mesh/Ring (Bidirectional)	All-to-All Collective
Network Requirement	Neighbor-to-Neighbor (linear)	Mesh / Bidirectional Ring	High Bisection Bandwidth (NVSwitch)
Overlap Potential	High (KV comms hidden by compute)	Very High (Utilizes bidirectional BW)	Moderate (Harder to hide large collectives)
Scalability Limit	Device Count (Context Length)	Device Count	Number of Attention Heads
Kernel Complexity	High (Requires custom Ring kernels)	High (Custom communication schedule)	Low (Standard FlashAttention)
Primary Bottleneck	Latency of P2P steps	Orchestration complexity	Network Bisection Bandwidth

Table 1: Comparative Analysis of Distributed Sequence Parallelism Strategies for Long-Context Inference.

3. Middleware Architectures and Disaggregated Serving

While distributed attention algorithms solve the compute problem, the management of the KV cache data itself—its storage, retrieval, and movement—requires a sophisticated middleware layer. The transition to disaggregated serving, where prefill and decode occur on different nodes, necessitates a “storage-centric” view of inference.

3.1 LMCache: The Unified Storage Substrate

LMCache has emerged as a critical open-source solution designed to abstract the complexities of KV cache management. It functions as a middleware layer sitting between the inference engine (e.g., vLLM, SGLang) and the backend storage hierarchy. Its primary innovation is transforming the KV cache from an internal engine state into a first-class storage primitive that can be shared across engines and queries.16

3.1.1 Architectural Decomposition

LMCache is architected to handle the rapid evolution of inference engines, where internal memory layouts change frequently (e.g., 15-20 new models released weekly). It employs a Connector pattern:

KV Connector: This component interfaces directly with the inference engine (e.g., vLLM). It captures the GPU memory addresses of KV pages. Crucially, it decouples the engine’s internal memory management from the storage logic, allowing LMCache to support new engines without rewriting the storage backend.16
Token Processor: This module implements the logic for prefix caching. It analyzes incoming requests to identify “new” tokens versus “redundant” tokens that match existing prefixes in the storage. This is vital for RAG workflows where a massive system prompt or document is reused across many user queries.17
Storage Manager & Hierarchy: LMCache manages a multi-tiered storage hierarchy including local CPU memory, local NVMe, remote persistent disk, and Redis. The system intelligently promotes and demotes KV blocks based on access frequency.

3.1.2 Optimization Mechanisms

LMCache implements several performance-critical optimizations. Layer-wise pipelining allows the transfer of KV cache for layer $N+1$ to occur simultaneously with the computation of layer $N$. This hides the significant latency of fetching data from remote storage. Additionally, it supports delayed decode storing, where small, granular page updates from the decode phase are aggregated into larger chunks before being committed to storage, reducing the I/O operations per second (IOPS) overhead on the storage backend.16 The system also utilizes zero-copy mechanisms to move data between the network interface card (NIC) and GPU memory, bypassing the CPU to minimize latency.16

3.2 NVIDIA Dynamo and NIXL

For enterprise-grade, high-performance deployments, NVIDIA Dynamo provides a comprehensive framework for distributed inference. It addresses the “difficult UX” of manually managing distributed state by providing a unified control plane.

3.2.1 The NIXL Transport Layer

The engine of Dynamo is the NVIDIA Inference Transfer Library (NIXL). NIXL is a specialized communication library optimized for the bursty, latency-sensitive traffic patterns of inference. Unlike generic MPI or TCP stacks, NIXL is aware of the GPU memory hierarchy. It abstracts the underlying physical transport (NVLink, InfiniBand, Ethernet) and supports a plugin architecture for various storage backends.6

A key feature of NIXL is its support for GPUDirect Storage and RDMA. This enables zero-copy transfers where KV blocks are streamed directly from a storage appliance (e.g., a WEKA data grid) to the GPU HBM, completely bypassing the host CPU. This architecture eliminates the “jitter” caused by CPU OS scheduling and frees up the CPU to handle the complex logic of the Dynamo Planner. NIXL essentially turns the storage layer into an extension of the GPU memory.19

3.2.2 The Smart Router and Radix Trees

Dynamo employs a Smart Router that fundamentally changes how requests are distributed. Traditional load balancers use Round Robin or Least Connections algorithms. The Smart Router, however, is KV-cache aware. It maintains a global index of which KV blocks reside on which GPU workers, structured as a Radix Tree. When a new request arrives, the router queries this tree to find the worker that already holds the longest matching prefix (e.g., a cached system prompt or document). It then routes the request to that specific worker. This maximizes the cache hit rate, drastically reducing the need for the prefill phase and lowering the Time To First Token (TTFT).20

3.3 DistKV-LLM: Dynamic Memory Orchestration

DistKV-LLM proposes a decentralized approach to memory management, viewing the entire cluster’s memory as a unified pool.

3.3.1 The Coherence Protocol and rBlocks

DistKV-LLM segments the KV cache into rBlocks (replicated blocks), which are manageable sub-units that can be independently stored, migrated, and retrieved. The system implements a coherence protocol similar to CPU cache coherency (MESI), ensuring that when a KV block is updated during generation, all distributed copies are invalidated or updated. This allows the system to support complex beam search or parallel sampling patterns where multiple sequences diverge from a common prefix.7

3.3.2 Proactive Memory Seeking

Unlike systems that statically partition memory, DistKV-LLM allows an instance facing a memory deficit (e.g., due to a surprisingly long generation) to proactively seek and reserve memory on less burdened instances. This dynamic elasticity ensures that a single long-context query does not crash a node due to Out-Of-Memory (OOM) errors, but rather spills over transparently to available resources elsewhere in the datacenter.7

3.4 llm-d: Kubernetes-Native Integration

llm-d provides the bridge between these high-performance inference techniques and the standard cloud-native operating model. It integrates deep into the Kubernetes networking stack.

3.4.1 The External Processing Pod (EPP)

llm-d introduces an External Processing Pod (EPP) that hooks into the Kubernetes Gateway API. The EPP inspects the headers and body of incoming HTTP requests to extract prompt metadata. It then queries the cluster state to score available pods based on “cache warmth”—a metric indicating how much of the required KV cache is already resident. This routing logic happens at the ingress layer, ensuring that requests are directed to the optimal pod before they even reach the model server.22

3.4.2 Heterogeneous Hardware Orchestration

A standout feature of llm-d is its support for heterogeneous hardware. It can orchestrate a cluster where NVIDIA H100s are dedicated to the compute-intensive prefill phase, while older A100s or even CPU-based nodes handle the memory-bound decode phase or store cold KV blocks. This tiered architecture allows organizations to optimize Total Cost of Ownership (TCO) by matching hardware capabilities to the specific physics of each inference phase.24

4. Hardware Acceleration: CXL and Near-Data Processing

As software pushes the limits of existing interconnects, hardware architectures in 2025 are undergoing a revolution centered on Compute Express Link (CXL). This open standard allows for the disaggregation of memory from the compute node, addressing the fundamental capacity limitations of HBM.

4.1 CXL-Enabled Memory Expansion (Beluga)

CXL enables GPUs to access remote memory with load/store semantics, just like local DRAM, but over a PCIe-based fabric. Systems like Beluga leverage this to create a massive, shared KV cache pool.

In the Beluga architecture, a CXL memory pool is attached to the GPU server via a CXL switch. This pool appears to the OS and the GPU as a vast extension of physical memory. When the GPU HBM is full, KV blocks are evicted to the CXL pool. Crucially, because CXL supports memory semantics (load/store), the GPU can access this data at cache-line granularity without the high overhead of block-based IO calls required for NVMe SSDs. Beluga demonstrates that CXL memory can reduce the Time-To-First-Token (TTFT) by nearly 90% compared to swapping to local SSDs, effectively breaking the memory capacity barrier for single-node serving.25

4.2 CXL-NDP: Processing-Near-Memory

While CXL solves the capacity problem, the bandwidth of the CXL link (typically PCIe Gen5 or Gen6 speeds) is significantly lower than HBM bandwidth. Moving terabytes of KV cache across the CXL link for every token generation can saturate the bus and stall the GPU. CXL-NDP (Near-Data Processing) solves this by moving the compute to the data.

4.2.1 Offloading Attention Scores

In a CXL-NDP architecture, the CXL memory module is equipped with lightweight compute units (e.g., RISC-V cores or small accelerators). Instead of fetching the entire Key matrix to the GPU to compute attention scores, the GPU sends the Query vector to the CXL device. The CXL device computes the dot products between the Query and the locally stored Keys and returns only the resulting attention scores (or the top-k highest scores) to the GPU.

This operation reduces the data movement by orders of magnitude. The massive Key matrix never leaves the CXL module; only the tiny Query vector and the resulting scores traverse the interconnect. This architecture allows the system to scale to millions of tokens without being bottlenecked by the CXL bandwidth.4

4.2.2 Transparent Hardware Compression

Advanced CXL-NDP controllers also implement transparent lossless compression. As KV blocks are written to the CXL memory, a hardware engine compresses them using algorithms like LZ4 or ZSTD. They are decompressed on the fly when read. Furthermore, precision-scalable bit-plane layouts allow for dynamic quantization. For example, the system might retrieve only the most significant bits of the Value vectors for the initial attention pass, fetching the full precision only for the top-ranking tokens. This effectively amplifies the available bandwidth, improving throughput by over 40% in memory-constrained scenarios.27

5. Algorithmic Compressions and Hybrid Architectures

While distributed systems manage the “supply” of memory, algorithmic research focuses on reducing the “demand.” Innovations in 2025 are moving away from brute-force caching toward smarter, compressed representations of context.

5.1 Infini-attention and Compressive Memory

Infini-attention introduces a modification to the standard Transformer block to support effectively infinite context within a bounded memory footprint. It integrates a compressive memory module into the attention mechanism.

Instead of storing the discrete KV pairs for the entire history, Infini-attention maintains a fixed-size buffer. As the attention window slides forward, old KV states are not discarded but are “compressed” into this memory buffer using a linear attention mechanism. The model reuses the Query, Key, and Value states to update this compressed representation. During retrieval, the model attends to both the local, high-resolution context (standard attention) and the global, compressed context (linear attention) simultaneously. This allows the model to recall information from millions of tokens back without maintaining a multi-terabyte cache.29

5.2 Hybrid SSM-Transformer Models (Jamba)

Jamba 1.5 represents a shift toward Hybrid SSM-Transformer architectures. Pure Transformers scale quadratically (or linearly with KV cache). State Space Models (SSMs) like Mamba scale linearly with constant memory state—they do not grow a KV cache at all.

Jamba interleaves Mamba layers with standard Transformer layers. For example, a model might have one Transformer layer for every seven Mamba layers. The Mamba layers handle the bulk of the temporal processing with a fixed, small memory state. The sparse Transformer layers provide the “associative recall” capability that SSMs sometimes lack. This hybrid design allows Jamba-1.5-Large to achieve an effective context length of 256K tokens while fitting on a single 8-GPU node, a feat that would require massive clusters for a pure Transformer of equivalent parameter count. The KV cache footprint is reduced to a fraction (e.g., 1/8th) of a standard model.32

5.3 Advanced Eviction Policies

For pure Transformers, eviction policies determine which tokens can be safely discarded. 2025 has seen a move from simple heuristics to semantic-aware eviction.

H2O and Scissorhands: These algorithms are based on the observation of “heavy hitter” tokens. Empirically, a small percentage of tokens (often punctuation or specific semantic anchors) receive the vast majority of attention mass. H2O dynamically identifies these tokens and retains them, while evicting the “long tail” of low-importance tokens. This can reduce cache size by 80% with negligible accuracy loss.3
vLLM’s LRU Policy: vLLM implements a robust block-manager based eviction. When memory pressure hits, it first evicts blocks with a reference count of zero (blocks not currently part of any active request’s beam search). Among those, it applies a Least Recently Used (LRU) policy. This ensures that shared prefixes (like system prompts) are kept in memory as long as possible.35
Anchor Direction (AnDPro): New research suggests that eviction should not be based solely on attention weights, but on the spatial relationship of token value states. AnDPro projects value vectors onto an “Anchor Direction” to determine their semantic contribution to the output, providing a more robust metric for eviction than simple attention scores.36

6. Integration and Future Outlook

The landscape of 2025 is defined by the convergence of these technologies into unified “Inference Operating Systems.” The siloed optimizations of 2023—a better kernel here, a scheduler there—have merged into cohesive stacks.

6.1 The Convergence of Systems

A production-grade inference stack in 2025 integrates these layers:

Orchestration: llm-d runs on Kubernetes, using the EPP to route requests based on LMCache state availability.
Engine: vLLM or TensorRT-LLM executes the model, utilizing Ring Attention or FlashAttention-3 kernels.
State Layer: LMCache acts as the unified state store, utilizing NIXL to stream data over InfiniBand or NVLink Switch.
Hardware: CXL 3.0 pools provide elastic capacity, with NDP modules offloading the pre-filtering of attention scores for ultra-long contexts.

6.2 DeepSpeed’s Democratization

DeepSpeed continues to play a vital role in democratizing access. Its ZeRO-Inference updates in 2025 focus on making these capabilities accessible on commodity hardware. By combining 4-bit quantization of weights and KV cache offloading to CPU RAM, DeepSpeed allows researchers to run infinite-context models on single-node setups that would previously require a supercomputer. While latency is higher than the optimized Dynamo stack, the throughput-per-dollar ratio makes it the standard for batch processing and academic research.37

6.3 Strategic Implications

The ability to handle infinite contexts changes the economic model of AI. Retrieval Augmented Generation (RAG) is undergoing a paradigm shift. Previously, RAG was a search problem: finding the top-5 chunks to feed a limited context window. Now, RAG is becoming a “context stuffing” problem: loading entire corpuses (books, codebases, legal libraries) into the model’s active memory.

The efficiency of the distributed KV management system—how fast it can load these contexts and how cheaply it can store them—directly determines the viability of this new paradigm. Systems that can effectively pipeline memory transfers (like Ring Attention) and manage tiered storage (like LMCache) will dominate. The “Memory Wall” has not stopped the expansion of AI; it has merely redefined the challenge from one of computation to one of logistics—the logistics of moving intelligent state across the datacenter.

Conclusion

The transition to infinite-context LLMs has necessitated a complete re-architecture of the inference stack. The “Memory Wall” is being dismantled not by a single breakthrough, but by a concerted effort across algorithms, middleware, and hardware. Ring Attention and TokenRing have solved the distributed compute problem, allowing sequence parallelism to scale. LMCache and DistKV-LLM have transformed the KV cache from a transient buffer into a persistent, managed asset. NVIDIA Dynamo and NIXL have optimized the transport layer to near-physical limits. Finally, CXL and Near-Data Processing are redefining the hardware topology, blurring the line between memory and storage. As we move through 2025, the defining characteristic of a state-of-the-art AI system is no longer just its FLOPS, but its ability to orchestrate the distributed memory of the entire data center as a single, coherent cognitive surface.

Report compiled by:

Senior Principal Systems Architect, AI Infrastructure Division

December 2025

Works cited

SCBench: A KV Cache-Centric Analysis of Long-Context Methods – arXiv, accessed on December 13, 2025, https://arxiv.org/html/2412.10319v2
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization, accessed on December 13, 2025, https://www.stat.berkeley.edu/~mmahoney/pubs/neurips-2024-kvquant.pdf
FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference – ACL Anthology, accessed on December 13, 2025, https://aclanthology.org/2025.findings-emnlp.515.pdf
Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits – arXiv, accessed on December 13, 2025, https://arxiv.org/html/2511.00321v1
KV Caching with vLLM, LMCache, and Ceph – llm-d, accessed on December 13, 2025, https://llm-d.ai/blog/kv-caching-vllm-lmcache-ceph
High Level Architecture — NVIDIA Dynamo Documentation, accessed on December 13, 2025, https://docs.nvidia.com/dynamo/latest/design_docs/architecture.html
Efficient LLM Service for Long Context with DistAttention and Distributed KVCache – arXiv, accessed on December 13, 2025, https://arxiv.org/html/2401.02669v1
RingAttention with Blockwise Transformers for Near-Infinite Context – OpenReview, accessed on December 13, 2025, https://openreview.net/forum?id=WsRHpHH4s0
RingAttention with Blockwise Transformers for Near-Infinite Context – ICLR Proceedings, accessed on December 13, 2025, https://proceedings.iclr.cc/paper_files/paper/2024/file/1119587863e78451f080da2a768c4935-Paper-Conference.pdf
Ring Attention with Blockwise Transformers for Near-Infinite Context – arXiv, accessed on December 13, 2025, https://arxiv.org/html/2310.01889v1
TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication – arXiv, accessed on December 13, 2025, https://arxiv.org/html/2412.20501v1
TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication | Request PDF – ResearchGate, accessed on December 13, 2025, https://www.researchgate.net/publication/387540476_TokenRing_An_Efficient_Parallelism_Framework_for_Infinite-Context_LLMs_via_Bidirectional_Communication
Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism – arXiv, accessed on December 13, 2025, https://arxiv.org/html/2510.17896v1
Ultra-Long Sequence Parallelism: Ulysses + Ring-Attention Technical Principles and Implementation – Hugging Face, accessed on December 13, 2025, https://huggingface.co/blog/exploding-gradients/ulysses-ring-attention
Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism | OpenReview, accessed on December 13, 2025, https://openreview.net/forum?id=W7sVYFJAEp
An Efficient KV Cache Layer for Enterprise-Scale LLM … – LMCache, accessed on December 13, 2025, https://lmcache.ai/tech_report.pdf
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference – arXiv, accessed on December 13, 2025, https://arxiv.org/html/2510.09665v1
ai-dynamo/nixl: NVIDIA Inference Xfer Library (NIXL) – GitHub, accessed on December 13, 2025, https://github.com/ai-dynamo/nixl
WEKA Accelerates AI Inference with NVIDIA Dynamo and NVIDIA NIXL, accessed on December 13, 2025, https://www.weka.io/blog/ai-ml/weka-accelerates-ai-inference-with-nvidia-dynamo-and-nvidia-nixl/
NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models | NVIDIA Technical Blog, accessed on December 13, 2025, https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/
[R] Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache – Reddit, accessed on December 13, 2025, https://www.reddit.com/r/MachineLearning/comments/191iqxj/r_infinitellm_efficient_llm_service_for_long/
Master KV cache aware routing with llm-d for efficient AI inference – Red Hat Developer, accessed on December 13, 2025, https://developers.redhat.com/articles/2025/10/07/master-kv-cache-aware-routing-llm-d-efficient-ai-inference
llm-d Architecture, accessed on December 13, 2025, https://llm-d.ai/docs/architecture
Why vLLM is the best choice for AI inference today – Red Hat Developer, accessed on December 13, 2025, https://developers.redhat.com/articles/2025/10/30/why-vllm-best-choice-ai-inference-today
Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management – arXiv, accessed on December 13, 2025, https://arxiv.org/html/2511.20172v1
[2511.20172] Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management – arXiv, accessed on December 13, 2025, https://arxiv.org/abs/2511.20172
Amplifying Effective CXL Memory Bandwidth for LLM Inference via Transparent Near-Data Processing – arXiv, accessed on December 13, 2025, https://arxiv.org/html/2509.03377v2
[2509.03377] Amplifying Effective CXL Memory Bandwidth for LLM Inference via Transparent Near-Data Processing – arXiv, accessed on December 13, 2025, https://arxiv.org/abs/2509.03377
A failed experiment: Infini-Attention, and why we should keep trying? – Hugging Face, accessed on December 13, 2025, https://huggingface.co/blog/infini-attention
Infinite Attention: Scaling LLMs for Context Understanding – ZONTAL, accessed on December 13, 2025, https://zontal.io/infinite-attention-scaling-language-models/
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention – arXiv, accessed on December 13, 2025, https://arxiv.org/abs/2404.07143
JAMBA: HYBRID TRANSFORMER-MAMBA LANGUAGE MODELS – ICLR Proceedings, accessed on December 13, 2025, https://proceedings.iclr.cc/paper_files/paper/2025/file/a9ed43fa31dc8b4a7d7a673d713dcb5f-Paper-Conference.pdf
Hymba: A Hybrid-head Architecture for Small Language Models – Jan Kautz, accessed on December 13, 2025, https://www.jankautz.com/publications/Hymba_ICLR25.pdf
LLM Inference Series: 4. KV caching, a deeper look | by Pierre Lienhart | Medium, accessed on December 13, 2025, https://medium.com/@plienhar/llm-inference-series-4-kv-caching-a-deeper-look-4ba9a77746c8
Implementation — vLLM, accessed on December 13, 2025, https://docs.vllm.ai/en/v0.6.0/automatic_prefix_caching/details.html
Accurate KV Cache Eviction via Anchor Direction Projection for Efficient LLM Inference, accessed on December 13, 2025, https://neurips.cc/virtual/2025/poster/117838
ZeRO-Inference: 20X faster inference through weight quantization and KV cache offloading – GitHub, accessed on December 13, 2025, https://github.com/deepspeedai/DeepSpeedExamples/blob/master/inference/huggingface/zero_inference/README.md
ZeRO-Inference: Democratizing massive model inference – DeepSpeed, accessed on December 13, 2025, https://www.deepspeed.ai/2022/09/09/zero-inference.html
Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache, accessed on December 13, 2025, https://developer.nvidia.com/blog/optimizing-inference-for-long-context-and-large-batch-sizes-with-nvfp4-kv-cache/

Cutting-edge Technology Courses by Uplatz