The Stateful Turn: Evolution of Prefix and Prompt Caching in Large Language Model Architectures

1. The Contextual Crisis in Agentic Artificial Intelligence

The widespread deployment of Large Language Models (LLMs) is precipitating a fundamental architectural shift in how artificial intelligence is served, priced, and engineered. As the industry graduates from ephemeral, single-turn chat interfaces to persistent, multi-turn agentic workflows, the underlying stateless paradigms of model inference are collapsing under the weight of their own inefficiency. We are witnessing the end of “amnesiac” intelligence—where models must be re-taught the entire context of a conversation with every new interaction—and the rise of Stateful Inference, enabled by the rapid evolution of Prefix and Prompt Caching.

This transformation is driven by a critical economic and computational bottleneck known as the “100:1” asymmetry. In advanced agent workflows, the ratio of input tokens (context reading) to output tokens (generation) frequently exceeds 100:1.1 An autonomous coding agent, for instance, must ingest thousands of lines of documentation, file structures, and previous debugging attempts (the “100”) merely to output a single line of corrected code (the “1”). In a traditional, stateless inference model, this massive context is re-processed for every single step of the agent’s reasoning loop. This redundancy is not merely inefficient; it is the primary driver of latency and the dominant cost center for enterprise AI, creating a “context tax” that scales linearly with the complexity of the task.3

The industry’s response has been a bifurcated evolution of caching technologies. On one side, proprietary model providers like Anthropic and Google have productized caching as a feature of their APIs, introducing novel pricing models that drastically discount cached inputs—up to 90% in the case of Claude 3.5 Sonnet.5 On the other side, the open-source ecosystem, led by frameworks such as vLLM, SGLang, and TensorRT-LLM, has engaged in an algorithmic arms race, developing sophisticated memory management techniques like PagedAttention and RadixAttention to optimize throughput on private infrastructure.6

This report provides an exhaustive analysis of this technological inflection point. We will dissect the theoretical mechanics of Key-Value (KV) caching, contrast the competing algorithmic approaches of hashing versus tree-structured memory, analyze the divergent product strategies of major cloud providers, and explore the emerging frontier of Agentic Plan Caching (APC), which seeks to cache not just the data, but the reasoning process itself.

2. The Physics of Inference: Why Caching is Non-Negotiable

To understand the necessity of prefix caching, one must first deconstruct the computational physics of the Transformer architecture, specifically the distinction between the prefill and decode phases.

2.1 The Attention Bottleneck and HBM Bandwidth

The Transformer architecture relies on the self-attention mechanism to determine the relationship between tokens in a sequence. For every token processed, the model generates three vectors: Query ($Q$), Key ($K$), and Value ($V$). The attention score is derived from the dot product of the current $Q$ vector with the $K$ vectors of all preceding tokens. This operation is fundamentally memory-intensive.

In the prefill phase, the model processes the user’s prompt. While this can be parallelized effectively across GPU Tensor Cores, it requires loading the massive model weights and writing the resulting KV vectors to the GPU’s High Bandwidth Memory (HBM). As context lengths grow—now reaching 1 million or 2 million tokens—the sheer volume of data movement becomes the bottleneck. The “Time-to-First-Token” (TTFT)—the latency the user perceives before the model begins writing—is largely determined by how fast the hardware can crunch through this initial context.8

In the decode phase, the model generates the response one token at a time. This is an autoregressive process. To generate the $N+1$ token, the model needs the attention context of tokens $1$ through $N$. Without caching, the model would have to recompute the entire sequence for every new word it generates. The KV Cache was introduced to solve this for a single request: it stores the K and V vectors in HBM so they can be reused during generation.

However, traditional KV caching is ephemeral. Once the request is finished, the cache is discarded. If the user immediately sends a follow-up question (“Can you explain that in more detail?”), the server receives the original prompt plus the new question. In a stateless architecture, it treats this as a brand-new sequence, forcing the GPU to re-compute the K and V vectors for the entire conversation history from scratch.

2.2 The “100:1” Asymmetry in Agentic Loops

This redundancy is manageable for short chats. It is catastrophic for agents. Autonomous agents operate in loops: Observation $\rightarrow$ Thought $\rightarrow$ Action $\rightarrow$ Observation.

Consider a “Research Agent” tasked with summarizing a 50,000-token legal document.

  1. Turn 1: The agent reads the 50,000-token document and the 2,000-token system prompt. (Input: 52k). It outputs a search query (Output: 20 tokens).
  2. Turn 2: The agent receives the search results (1,000 tokens). The context now contains the original document, the system prompt, the search query, and the results. (Input: 53k). It decides to refine the search.
  3. Turn 3: The process repeats.

By Turn 10, the agent has re-processed the static 50,000-token document ten times. It has consumed over 500,000 input tokens to generate perhaps 500 output tokens. This is the 100:1 ratio.1

The financial implication is severe. At standard GPT-4-class pricing (~$10/1M input tokens), this ten-step workflow costs $5.00 just for inputs. With Prefix Caching, the 50,000-token document is processed once. Subsequent turns only process the incremental tokens. The computational load drops from linear growth ($O(N)$ per turn) to near-constant time ($O(1)$ regarding the prefix), and the cost collapses by 90% or more.5

2.3 The Solution: From Request-Local to Global Persistence

Prefix Caching promotes the KV cache from a temporary, request-local artifact to a persistent, global asset. It recognizes that in a multi-tenant environment—or a multi-turn agent session—the prefixes are often identical.

By storing the KV cache in a globally addressable structure (like a hash table or a radix tree) within the inference server’s memory, the engine can check incoming requests against stored states. If a match is found (a “cache hit”), the prefill phase is skipped entirely. The engine loads the pre-computed K and V vectors and jumps straight to processing the new tokens. This reduces the TTFT from seconds to milliseconds and frees up massive amounts of compute resources.11

3. Algorithmic Architectures: The Divergence of Implementation

While the concept of reusing the KV cache is universal, the engineering implementation differs significantly across frameworks. The two dominant approaches are Block-Level Hashing (championed by vLLM) and Radix Trees (championed by SGLang).

3.1 Block-Level Hashing (The vLLM Approach)

The vLLM project revolutionized LLM serving with PagedAttention, an algorithm inspired by operating system memory paging. In this architecture, the KV cache is not stored in contiguous memory blocks (which causes fragmentation) but is broken into fixed-size “pages” or “blocks” (e.g., 16 or 32 tokens per block).

To enable Automatic Prefix Caching (APC), vLLM treats these blocks as uniquely addressable content.

  • Mechanism: It computes a hash for each block based on the tokens within it and the hash of the preceding block: $H_i = \text{Hash}(Tokens_i, H_{i-1})$.
  • Global Hash Table: These hashes map to physical block indices in the GPU memory. When a request arrives, the scheduler hashes its prefix blocks and queries the global table.
  • Deduplication: If two requests (e.g., two users accessing the same chatbot system prompt) generate the same block hashes, they point to the same physical memory. This allows a single copy of the system prompt to serve thousands of concurrent users.6

Strengths:

  • Efficiency: Hash lookups are $O(1)$ and extremely fast.
  • Simplicity: It integrates naturally with the PagedAttention memory manager.
  • Fragmentation: Eliminates external memory fragmentation.

Weaknesses:

  • Rigidity: The hashing is sensitive to block alignment. If a user inserts a single character at the start of the prompt, all subsequent block boundaries shift, changing the tokens inside every block. This produces entirely new hashes, causing a complete cache miss (the “avalanche effect”).11
  • Exact Match Dependency: It struggles with “fuzzy” matching or slight variations in system prompts.

3.2 RadixAttention (The SGLang Approach)

SGLang (Structured Generation Language) takes a different approach designed specifically for the complex, branching structure of agentic conversations. Instead of a flat hash table, it organizes the KV cache as a Radix Tree (also known as a compact prefix tree).

  • Tree Structure: The root of the tree represents an empty context. Each edge represents a sequence of tokens. Each node represents the cached KV state at that point in the sequence.
  • Dynamic Matching: When a request arrives, the scheduler traverses the tree to find the longest common prefix. It does not need to align with arbitrary block boundaries; it matches as many tokens as possible.
  • Branching Support: This is ideal for “Tree of Thoughts” or “Beam Search” workflows. If an agent explores three different solutions to a problem, they all share the common trunk (the problem description). SGLang maintains this trunk and branches out, whereas block hashing might struggle to manage the shared references as efficiently during complex evictions.14
  • Eviction Policy: SGLang implements an LRU (Least Recently Used) eviction policy on the nodes of the tree. This allows it to surgically prune old conversation branches while preserving the frequently accessed “trunk” (system prompts), often achieving higher cache hit rates in dynamic multi-turn scenarios compared to vLLM.15

Comparison Table: vLLM APC vs. SGLang RadixAttention

 

Feature vLLM (Automatic Prefix Caching) SGLang (RadixAttention)
Data Structure Global Hash Table (Block-based) Radix Tree (Token-sequence based)
Matching Logic Exact Block Hash Match Longest Common Prefix Match
Alignment Sensitivity High (Block shifts break cache) Low (Token granular matching)
Best Use Case High-throughput batching of identical prompts Multi-turn chat, Agent loops, Branching reasoning
Eviction Strategy Ref-count based LRU on blocks LRU on Tree Nodes
Throughput (Batch) Excellent Very High
Latency (Multi-turn) Good Superior (10-20% better in benchmarks) 16

3.3 TensorRT-LLM and the Compilation Paradigm

TensorRT-LLM, NVIDIA’s high-performance inference library, approaches optimization from a different angle. Historically, TensorRT relied on static compilation—building an execution plan (an “engine”) optimized for a specific model and hardware configuration.

  • Context Reuse: While earlier versions focused on raw compute throughput (using kernel fusion and FP8 quantization), recent updates have integrated context reuse capabilities. However, TensorRT-LLM traditionally requires more explicit configuration and “engine building” steps compared to the dynamic, runtime-managed approaches of vLLM and SGLang.17
  • Performance Profile: TensorRT-LLM often wins on raw tokens-per-second (throughput) due to deep hardware optimization (e.g., utilizing H100 Tensor Cores more effectively), but vLLM and SGLang frequently win on Time-to-First-Token (TTFT) in dynamic workloads because their cache management is more flexible.19
  • Integration: TensorRT-LLM is typically deployed via the Triton Inference Server. For caching to work effectively across requests, it requires careful orchestration at the Triton layer or through the Inflight Batcher.21

4. Proprietary Implementations: The Walled Gardens

The major cloud AI providers have wrapped these algorithmic concepts into easy-to-use APIs, each with distinct pricing and retention models.

4.1 Anthropic Claude: The Explicit Control Model

Anthropic has taken a developer-centric approach with Explicit Prompt Caching. This design philosophy prioritizes determinism and cost control.

  • Mechanism: Developers must explicitly tag segments of their prompt with cache_control: {“type”: “ephemeral”}. The system caches the prefix up to this breakpoint.
  • Architecture: Supported on Claude 3.5 Sonnet, Haiku, and Opus. The cache allows for up to 4 breakpoints, enabling a “layered” caching strategy (e.g., System Prompt -> Tool Definitions -> Few-Shot Examples -> Conversation History).5
  • Pricing Economics: Anthropic introduces a “Write” premium and a “Read” discount.
  • Cache Write: ~25% more expensive than standard input tokens ($3.75/M for Sonnet).
  • Cache Read: ~90% cheaper than standard input tokens ($0.30/M for Sonnet).
  • Break-Even: The break-even point is incredibly low—often just 2 calls. If a context is reused even once, the user saves money.5
  • TTL (Time-to-Live): The default TTL is 5 minutes, which resets (“rolling TTL”) every time the cache is accessed. This makes it ideal for active agent loops but less suitable for long-term storage (e.g., a “knowledge base” accessed once a day). An extended 1-hour TTL is available for a fee.22

4.2 Google Vertex AI / Gemini: The Scale and Storage Model

Google leverages its massive TPU infrastructure to offer a more flexible, storage-oriented model known as Context Caching.

  • Two Modes:
  • Implicit Caching: Enabled by default on Gemini 1.5 Flash/Pro. Google automatically detects repeated prefixes and applies the discount. There is no “write” surcharge, making this a “free lunch” feature for developers.23
  • Explicit Caching: Designed for massive, persistent contexts (up to 2 million tokens). Users create a CachedContent resource that acts like a temporary file.
  • Storage-Based Pricing: Unlike Anthropic, Google charges a storage fee (e.g., $1.00 – $4.50 per million tokens per hour, depending on the model). This effectively treats the LLM’s context window as a Vector Database, allowing users to upload entire books, codebases, or video libraries and query them repeatedly without re-uploading.25
  • Multimodal Advantage: A critical differentiator is Gemini’s ability to cache multimodal inputs. An agent can cache a 1-hour video file (audio and visual tokens) and perform multiple distinct analytical tasks on it. This avoids the massive compute cost of re-encoding the video for every question.23

4.3 AWS Bedrock and Amazon Nova

Amazon Bedrock serves as a distribution hub, supporting Anthropic’s caching features while introducing its own for the Amazon Nova model family.

  • Integration: Bedrock unifies the caching API across models, simplifying the switch between Claude 3.5 Sonnet and Amazon Nova Pro.
  • Cross-Region Inference: While caching is typically regional (data stored in us-east-1 cannot be accessed in us-west-2), Bedrock’s inference profiles help manage routing to ensure cache hits.
  • Nova Economics: Amazon positions Nova Micro/Lite/Pro as cost-efficient alternatives for high-throughput agent loops, offering similar 85-90% savings dynamics.4

4.4 OpenAI: The Late Adopter?

Historically, OpenAI has been reticent to expose manual caching controls. However, recent documentation and pricing updates indicate a shift toward Automatic (Implicit) Caching.

  • Recent Developments: Reports indicate OpenAI now applies automatic caching for prompts exceeding 1,024 tokens.
  • Pricing: The discount structure is less aggressive than Anthropic’s. For GPT-4o, cached inputs are priced at $1.25/M tokens (vs $2.50/M standard), a 50% discount compared to Anthropic’s 90% discount. However, for the o1 reasoning models, the savings are significant ($7.50/M cached vs $15.00/M standard).31
  • Latency: OpenAI promises up to 80% latency reduction, bringing its “Time-to-First-Token” in line with competitors for long-context tasks.33

5. Strategic Engineering for the 100:1 Workflow

To fully exploit these technologies, organizations must fundamentally re-architect how they construct prompts and manage agent state. The goal is to maximize Cache Hit Rate (CHR). A low CHR in an agentic workflow is not just a performance bug; it is a financial leak.

5.1 The “Append-Only” Context Pattern

The cardinal rule of prefix caching is stability. The prefix must remain byte-identical to trigger a cache hit.

  • The Anti-Pattern: Placing dynamic variables at the start of the prompt.
  • Example: System Prompt: “Current Time: 12:05:01. You are a coding assistant…”
  • Result: The timestamp changes every second. The hash changes every second. The cache is invalidated every request.
  • The Solution: Move all dynamic content to the end of the prompt.
  • Structure: -> -> ->.
  • Result: The first 90% of the prompt (the static blocks) remains identical across requests, ensuring a 90% cache hit rate.13

5.2 Deterministic Serialization

Agents typically communicate using JSON. However, standard JSON serializers are non-deterministic; they do not guarantee the order of keys.

  • Case A: {“tool”: “web_search”, “query”: “latest news”}
  • Case B: {“query”: “latest news”, “tool”: “web_search”}
    To an LLM, these are semantically identical. To a caching algorithm (hashing or tree), these are completely different strings.
  • Best Practice: Engineers must enforce canonical JSON serialization (e.g., sort_keys=True in Python) for all structured data injected into the prompt context. This ensures that the same logical object always produces the same byte sequence.2

5.3 Context Engineering and Offloading

Even with caching, context windows have limits. Context Engineering is the discipline of optimizing what goes into the cache.

  • Compression: Instead of caching raw HTML, agents should cache Markdown or heavily summarized abstracts.
  • Offloading: For massive datasets (e.g., 100,000 lines of logs), agents should “offload” the data to an external store (like a file) and cache only the file handle or a summary. The agent then uses a tool to “grep” or “read” specific chunks of the file on demand. This hybrid approach keeps the cached context lean while maintaining access to infinite data.1

5.4 RAG vs. Context Caching

A common strategic question is: Should I use a Vector Database (RAG) or just cache the documents in the context?

  • Small/Medium Corpus (< 2M tokens): Cache it. Google’s Gemini 1.5 Pro allows up to 2 million tokens. Caching the entire corpus eliminates the complexity of the retrieval step (chunking, embedding, indexing) and allows the model to perform “global reasoning” across the entire dataset (e.g., “Summarize the themes across all these documents”), which RAG cannot do effectively.
  • Large Corpus (> 2M tokens): Use RAG to retrieve the relevant ~100k tokens, then cache that retrieved subset for the duration of the user session.4

6. Integration and Routing: The Distributed Systems Challenge

In a production environment, inference is rarely served by a single GPU. It is served by a cluster of tens or hundreds. This introduces the Routing Problem.

6.1 The Failure of Random Load Balancing

Standard load balancers (Round Robin, Least Connections) are disastrous for prompt caching.

  • Scenario: User A sends a request. It lands on GPU 1, which processes and caches the prefix.
  • Next Turn: User A sends the follow-up. The load balancer sends it to GPU 2.
  • Result: GPU 2 does not have the cache. It must re-process the prefix. GPU 1’s cache sits idle. The system performs redundant work, and latency spikes.

6.2 BentoML and Prefix-Aware Routing

Frameworks like BentoML and llm-d solve this via Prefix-Aware (or Affinity) Routing.

  • Mechanism: The router sits in front of the GPU cluster. It inspects the incoming prompt and computes the hash of the prefix.
  • Affinity Map: The router maintains a map of which GPU workers hold which prefix hashes.
  • Routing Logic: It directs the request to the worker that already has the data. If multiple workers have it, it load-balances among them. If none have it, it selects the least-loaded worker and updates the map.8
  • Impact: Benchmarks show that precise prefix-aware routing can increase cluster throughput by 2x and reduce latency by 57x compared to random routing in cache-heavy workloads.36

6.3 Disaggregated Architectures (Mooncake)

The future of high-scale inference lies in Disaggregation. Projects like Mooncake decouple the Prefill compute from the Decode compute.

  • Prefill Instances: Specialized nodes (perhaps with high compute, lower memory) process prompts and generate KV caches.
  • Decode Instances: Specialized nodes (with massive memory) hold the caches and generate tokens.
  • Transfer: The KV cache is transferred between nodes over high-speed interconnects (InfiniBand, NVLink).
  • Benefit: This allows independent scaling. If an application has long prompts but short answers (100:1), you can scale up Prefill nodes without paying for unnecessary Decode capacity.8

7. Beyond Tokens: Agentic Plan Caching (APC)

While prefix caching optimizes the input (the “Reading” phase), researchers are now targeting the output (the “Thinking” phase) with Agentic Plan Caching (APC).

7.1 The Limits of Token Caching

Prefix caching helps the model read the problem faster, but the model still has to reason about the solution. For a complex task like “Plan a marketing strategy,” the reasoning steps (generating the plan) are computationally expensive and slow.

7.2 Caching the Reasoning Process

APC operates on the insight that for semantically similar tasks, the structure of the plan is often identical, even if the specific details differ.

  • Mechanism:
  1. Offline: When an agent successfully solves a task, APC extracts the “Plan Template” (the abstract sequence of steps).
  2. Online: When a new request arrives, APC uses a lightweight classifier to find a matching Plan Template.
  3. Adaptation: Instead of asking a heavy model (e.g., Claude Opus) to generate a plan from scratch, APC asks a lightweight model (e.g., Claude Haiku) to adapt the cached template to the current variables.
  • Results: Research demonstrates that APC can reduce serving costs by an additional 50% and latency by 27% on top of standard prefix caching, effectively bypassing the heavy reasoning loops for routine tasks.3

8. Financial Analysis and TCO

The shift to caching necessitates a new Total Cost of Ownership (TCO) model. The metric of “Cost per 1M Tokens” is no longer a single number; it is a composite function of Cache Hit Rate (CHR).

8.1 The “Effective Cost” Formula

 

$$\text{EffectiveCost} = (1 – \text{CHR}) \times \text{WriteCost} + \text{CHR} \times \text{ReadCost}$$

Case Study: Enterprise Knowledge Agent

  • Write Cost: $3.75 (Anthropic Sonnet)
  • Read Cost: $0.30
  • Scenario A (Poor Engineering): 10% Cache Hit Rate.
  • Cost = $0.90 \times 3.75 + 0.10 \times 0.30 = $3.40 / M.
  • Scenario B (Optimized Engineering): 90% Cache Hit Rate.
  • Cost = $0.10 \times 3.75 + 0.90 \times 0.30 = $0.64 / M.

Conclusion: Good engineering (maximizing CHR) reduces the cost of intelligence by 5.3x. This arbitrage opportunity rewards teams that master the intricacies of prompt structure and routing.5

9. Conclusion: The Era of Persistent Intelligence

The evolution of Prefix and Prompt Caching marks the maturation of the Generative AI infrastructure stack. We are moving away from the brute-force inefficiency of stateless, quadratic attention mechanisms toward a sophisticated, stateful architecture that values context persistence.

For the developer, this unlocks the “100:1” workflow, making complex, context-heavy agents economically viable. For the architect, it mandates a shift in infrastructure—from simple load balancers to prefix-aware routers, and from ephemeral compute to persistent memory management.

The future belongs to systems that treat context not as a disposable input, but as a long-term asset. Whether through the seamless “Implicit” caching of Google Gemini, the precise “Explicit” controls of Anthropic, or the high-performance “Radix Trees” of SGLang, the industry has delivered the tools to solve the context bottleneck. The challenge now lies in the engineering discipline to wield them effectively.

Works cited

  1. Deep Dive into Context Engineering for Agents – Galileo AI, accessed on December 13, 2025, https://galileo.ai/blog/context-engineering-for-agents
  2. Context Engineering for AI Agents: Lessons from Building Manus, accessed on December 13, 2025, https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus
  3. Cache Mechanism for Agent RAG Systems – arXiv, accessed on December 13, 2025, https://arxiv.org/html/2511.02919v1
  4. Amazon Bedrock Prompt Caching with the Amazon Nova Model – AWS Builder Center, accessed on December 13, 2025, https://builder.aws.com/content/33zsO2Bc7UbsnLb9XTOBcuv1O2c/amazon-bedrock-prompt-caching-with-the-amazon-nova-model
  5. Prompt caching | Generative AI on Vertex AI – Google Cloud Documentation, accessed on December 13, 2025, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/partner-models/claude/prompt-caching
  6. Automatic Prefix Caching – vLLM, accessed on December 13, 2025, https://docs.vllm.ai/en/v0.7.3/design/automatic_prefix_caching.html
  7. [2312.07104] SGLang: Efficient Execution of Structured Language Model Programs – arXiv, accessed on December 13, 2025, https://arxiv.org/abs/2312.07104
  8. The Shift to Distributed LLM Inference: 3 Key Technologies Breaking Single-Node Bottlenecks – BentoML, accessed on December 13, 2025, https://www.bentoml.com/blog/the-shift-to-distributed-llm-inference
  9. Reproducible Performance Metrics for LLM inference – Anyscale, accessed on December 13, 2025, https://www.anyscale.com/blog/reproducible-performance-metrics-for-llm-inference
  10. Welcome to LLMflation – LLM inference cost is going down fast ️ | Andreessen Horowitz, accessed on December 13, 2025, https://a16z.com/llmflation-llm-inference-cost/
  11. Prefix caching | LLM Inference Handbook – BentoML, accessed on December 13, 2025, https://bentoml.com/llm/inference-optimization/prefix-caching
  12. Automatic Prefix Caching – vLLM, accessed on December 13, 2025, https://docs.vllm.ai/en/stable/design/prefix_caching/
  13. How prompt caching works – Paged Attention and Automatic Prefix Caching plus practical tips | sankalp’s blog, accessed on December 13, 2025, https://sankalp.bearblog.dev/how-prompt-caching-works/
  14. Fast and Expressive LLM Inference with RadixAttention and SGLang | LMSYS Org, accessed on December 13, 2025, https://lmsys.org/blog/2024-01-17-sglang/
  15. SGLang vs. vLLM: The New Throughput King? | by Aparna Pradhan | Nov, 2025 | Medium, accessed on December 13, 2025, https://medium.com/@ap3617180/sglang-vs-vllm-the-new-throughput-king-7daec596f7fa
  16. When to Choose SGLang Over vLLM: Multi-Turn Conversations and KV Cache Reuse, accessed on December 13, 2025, https://www.runpod.io/blog/sglang-vs-vllm-kv-cache
  17. Why vLLM is the best choice for AI inference today – Red Hat Developer, accessed on December 13, 2025, https://developers.redhat.com/articles/2025/10/30/why-vllm-best-choice-ai-inference-today
  18. vLLM vs TensorRT-LLM: Key differences, performance, and how to run them – Northflank, accessed on December 13, 2025, https://northflank.com/blog/vllm-vs-tensorrt-llm-and-how-to-run-them
  19. Why is vLLM Outperforming TensorRT-LLM (Nvidia’s deployment library)? My Shocking Benchmarks on GPT-OSS-120B with H100 : r/LocalLLaMA – Reddit, accessed on December 13, 2025, https://www.reddit.com/r/LocalLLaMA/comments/1oyawkl/why_is_vllm_outperforming_tensorrtllm_nvidias/
  20. Benchmarking LLM Inference Backends – BentoML, accessed on December 13, 2025, https://www.bentoml.com/blog/benchmarking-llm-inference-backends
  21. Boost LLM Throughput: vLLM vs. Sglang and Other Serving …, accessed on December 13, 2025, https://tensorfuse.io/blog/llm-throughput-vllm-vs-sglang
  22. Prompt caching – Claude Docs, accessed on December 13, 2025, https://platform.claude.com/docs/en/build-with-claude/prompt-caching
  23. Context caching overview | Generative AI on Vertex AI – Google Cloud Documentation, accessed on December 13, 2025, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-overview
  24. Gemini 2.5 Models now support implicit caching – Google Developers Blog, accessed on December 13, 2025, https://developers.googleblog.com/en/gemini-2-5-models-now-support-implicit-caching/
  25. Vertex AI context caching | Google Cloud Blog, accessed on December 13, 2025, https://cloud.google.com/blog/products/ai-machine-learning/vertex-ai-context-caching
  26. Lowering Your Gemini API Bill: A Guide to Context Caching – DEV Community, accessed on December 13, 2025, https://dev.to/rawheel/lowering-your-gemini-api-bill-a-guide-to-context-caching-aag
  27. Lowering Your Gemini API Bill: A Guide to Context Caching | by Raheel Siddiqui | Medium, accessed on December 13, 2025, https://rawheel.medium.com/lowering-your-gemini-api-bill-a-guide-to-context-caching-0e1f4d0cb3f8
  28. Gemini 2.5 Flash | Generative AI on Vertex AI – Google Cloud Documentation, accessed on December 13, 2025, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash
  29. Effectively use prompt caching on Amazon Bedrock | Artificial Intelligence – AWS, accessed on December 13, 2025, https://aws.amazon.com/blogs/machine-learning/effectively-use-prompt-caching-on-amazon-bedrock/
  30. Amazon Bedrock announces general availability of prompt caching – AWS, accessed on December 13, 2025, https://aws.amazon.com/about-aws/whats-new/2025/04/amazon-bedrock-general-availability-prompt-caching/
  31. Prompt Caching in the API – OpenAI, accessed on December 13, 2025, https://openai.com/index/api-prompt-caching/
  32. Pricing | OpenAI API, accessed on December 13, 2025, https://platform.openai.com/docs/pricing
  33. Prompt Caching – Humanloop, accessed on December 13, 2025, https://humanloop.com/blog/prompt-caching
  34. Amazon Bedrock Prompt Caching: Saving Time and Money in LLM Applications – Caylent, accessed on December 13, 2025, https://caylent.com/blog/prompt-caching-saving-time-and-money-in-llm-applications
  35. Prefix-aware routing — Ray 2.52.1 – Ray Docs, accessed on December 13, 2025, https://docs.ray.io/en/latest/serve/llm/user-guides/prefix-aware-routing.html
  36. KV-Cache Wins You Can See: From Prefix Caching in vLLM to Distributed Scheduling with llm-d, accessed on December 13, 2025, https://llm-d.ai/blog/kvcache-wins-you-can-see
  37. kvcache-ai/Mooncake: Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. – GitHub, accessed on December 13, 2025, https://github.com/kvcache-ai/Mooncake
  38. Agentic Plan Caching: Test-Time Memory for Fast and Cost-Efficient LLM Agents – OpenReview, accessed on December 13, 2025, https://openreview.net/pdf?id=n4V3MSqK77
  39. Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching – arXiv, accessed on December 13, 2025, https://arxiv.org/html/2506.14852v1