I. Executive Summary
The proliferation of Generative AI and Large Language Models (LLMs) has introduced significant operational challenges in terms of computational cost and response latency. Caching is a foundational strategy to mitigate these challenges, and caching prompt embeddings—the numerical representation of input text—is paramount. This report provides a comprehensive architectural blueprint for implementing robust caching strategies for LLM prompt embeddings using Redis.
The analysis distinguishes between two primary caching models:
- Exact-Match Caching: A simple, high-speed approach that stores responses based on an exact hash of the input prompt. This is effective for high-frequency, identical queries.
- Semantic Caching: A sophisticated, vector-based approach that stores responses based on the semantic meaning of the prompt. This allows the cache to serve results for queries that are phrased differently but have the same intent, dramatically increasing the cache hit rate.
Redis is uniquely positioned as the ideal platform for this task. It is not merely a key-value cache but a multi-model database that can function simultaneously as an in-memory key-value store (for exact-match), a high-performance vector database (for semantic search), a session store, and a message broker. This multi-model capability allows architects to build a complete, real-time AI application stack on a single, unified platform, simplifying infrastructure and reducing latency.
This report details the business case for caching, compares exact-match and semantic patterns, and provides production-grade implementation details for both. It concludes with advanced strategies for performance tuning, cost optimization via vector compression, and lifecycle management, presenting a final, unified architectural blueprint for a scalable, efficient, and cost-effective LLM caching system.

bundle-course-sap-ariba-sourcing-procurement-contract-management-administration By Uplatz
II. Foundations: Prompt Embeddings and the Caching Imperative
A. What are Prompt Embeddings?
Before implementing a cache, it is critical to understand what is being cached. When a user sends a prompt (e.g., “What is the capital of France?”) to an LLM, the model does not understand text directly. The input is first processed by an embedding model.1
This model converts the non-numeric text into a numeric representation called an embedding, which is a high-dimensional vector (a long list of numbers).2 This vector captures the semantic meaning and context of the original text, allowing the LLM to mathematically determine relationships between concepts.4 For example, the vectors for “king” and “queen” will be mathematically close to each other. These vectors are the fundamental unit that LLMs use to process, compare, and understand human language.1
B. The Business Imperative: Why Cache LLM Calls?
LLM inference is computationally intensive and expensive. Every API call to a model like GPT-4o incurs a direct monetary cost (paid per token) and a time cost (latency).6 Caching is the primary strategy to address both.
- Cost Reduction: Many applications, particularly enterprise chatbots, receive a high volume of repetitive or semantically similar queries.6 For example, “How do I reset my password?” and “I forgot my password, what do I do?” are different prompts with the same intent.8 By caching the response to the first query and serving it for the second, the redundant LLM API call is eliminated. This can reduce input token costs by up to 90%.9
- Latency Reduction: Caching moves the response from a slow, multi-second computation (the LLM) to a fast, millisecond retrieval (the cache).11 For real-time applications like chatbots, this is critical for user experience. Benchmarks show that prompt caching can reduce latency by up to 80-85%.6
- Scalability: Caching reduces the computational load on the core LLM, allowing the system to handle a higher volume of users without scaling up expensive GPU resources.6
C. The Platform Caching Model (e.g., OpenAI)
It is important to note that some LLM providers, like OpenAI, implement their own form of prompt caching automatically.9 However, this caching is typically limited to exact prefix matching. The system routes requests with an identical prefix (e.g., the first 256 tokens) to a server that recently processed the same prompt. This is effective for static system prompts or instructions placed at the beginning of a prompt.9
This built-in caching is beneficial but insufficient. It cannot handle queries that are semantically identical but textually different (e.g., “password reset” vs. “forgot password”). To achieve this, a more sophisticated, application-layer caching strategy is required.
III. Redis as the Unified Platform for AI Caching
While many databases can store vectors, Redis presents a unique strategic advantage by being a multi-model database designed for real-time performance.
A. The Multi-Model Advantage
A “pure” vector database (e.g., Pinecone, Milvus) is specialized for one task: storing and searching vectors.13 A production AI application, however, requires more. A typical Retrieval-Augmented Generation (RAG) application needs:
- A Vector Database for RAG document retrieval.
- A Semantic Cache for LLM responses (also a vector database).
- A Key-Value Cache for exact-match caching.
- A Session Store for managing conversation history.
- A Message Broker for asynchronous tasks or invalidation.
Using specialized tools for each function results in a complex, fragmented architecture with multiple data silos. Redis, by contrast, can serve all these functions within a single, high-performance system.15 An AI architect can use Redis to store RAG documents in RedisJSON, manage vectors with Redis Search, handle exact-match caching with standard keys, store conversation history in Redis Hashes, and manage cache invalidation with Redis Pub/Sub.15 This consolidation drastically simplifies the architecture, reduces data-transfer latency, and lowers the total cost of ownership.16
B. Redis vs. Specialized Vector Databases
When evaluating Redis specifically for its vector search capabilities, it compares favorably against specialized, “pure” vector databases.
- Redis vs. Pinecone: Pinecone is a proprietary, purpose-built vector database.13 While effective for basic vector storage, Redis Enterprise is presented as a more comprehensive real-time data platform, offering capabilities beyond vector search, such as high availability, multi-tenancy, and durability, which specialized vendors are now building to catch up.18 Benchmarking against Pinecone is legally restricted by their “DeWitt clause,” which forbids publishing performance evaluations.20
- Redis vs. Milvus: Milvus is an open-source, purpose-built vector database designed for high-performance vector search, particularly at billion-scale.14 However, Redis is a mature, in-memory key-value store that has added vector search as a module.14 Some analyses suggest Milvus is finely tuned for high-demand vector workloads, while Redis offers broader versatility.21 Conversely, benchmarks published by Redis claim superior querying throughput (up to 3.3x higher QPS than Milvus) and lower indexing times (up to 2.8x lower).20
For many AI applications, the combination of Redis’s “good enough” or superior vector performance with its dominant, industry-standard in-memory caching and multi-model features makes it the most practical and architecturally simple choice.22
IV. Part 1: Implementing Exact-Match Caching in Redis
This is the most basic caching strategy, which targets identical, repeated prompts.
A. Concept and Pattern
The pattern, often called “query caching,” is a simple implementation of the cache-aside pattern.23
- Prompt Identification: The application receives a user prompt. A unique, deterministic key is generated from the exact prompt string, typically using a fast hashing algorithm like MD5 or SHA-256.23
- Cache Lookup: The application checks Redis for the existence of this key.23
- Cache Hit: If the key exists, the stored response is retrieved from Redis and returned to the user. The LLM is never called.23
- Cache Miss: If the key does not exist, the application forwards the prompt to the LLM. The LLM’s response is then stored in Redis using the hash key—critically, with a Time-to-Live (TTL) set—and returned to the user.23
This pattern is incredibly fast but completely rigid; even a single-character change in the prompt (e.g., an extra space) will result in a cache miss.8
B. Implementation with redis-py
Using the standard redis-py library, the implementation is straightforward.
Python
# [26, 59, 60]
import redis
import hashlib
import os
# Assumes llm.invoke() is defined elsewhere
# from langchain_openai import OpenAI
# llm = OpenAI()
try:
r = redis.Redis.from_url(os.getenv(“REDIS_URL”, “redis://localhost:6379”))
r.ping()
print(“Connected to Redis”)
except redis.exceptions.ConnectionError as e:
print(f”Redis connection error: {e}“)
# Handle connection failure
def get_response_exact_match(prompt: str):
# 1. Create a deterministic key
cache_key = f”llm:exact:{hashlib.md5(prompt.encode()).hexdigest()}“
try:
# 2. Cache Lookup
cached_response = r.get(cache_key)
if cached_response:
print(“Cache Hit! (Exact)”)
return cached_response.decode(‘utf-8’)
except redis.exceptions.RedisError as e:
print(f”Redis GET error: {e}“)
# Fallback: go directly to LLM
# 4. Cache Miss
print(“Cache Miss! (Exact)”)
response = llm.invoke(prompt)
try:
# 4. Store in cache with a 1-hour (3600s) TTL
r.set(cache_key, response, ex=3600)
except redis.exceptions.RedisError as e:
print(f”Redis SET error: {e}“)
# Non-blocking error, just return the response
return response
C. Framework Integration (LangChain)
Frameworks like LangChain abstract this entire process into a simple, drop-in integration.
Python
# [26, 59]
import langchain
from redis import Redis
from langchain.cache import RedisCache
from langchain_openai import OpenAI
import time
import os
# Initialize the cache
redis_client = Redis.from_url(os.getenv(“REDIS_URL”, “redis://localhost:6379”))
langchain.llm_cache = RedisCache(redis_=redis_client)
llm = OpenAI()
# — First call (Cache Miss) —
start_time = time.time()
response1 = llm.invoke(“Explain the concept of caching in three sentences.”)
time1 = time.time() – start_time
print(f”First call (Miss):\nTime: {time1:.2f} seconds\n”)
# — Second call (Cache Hit) —
start_time = time.time()
response2 = llm.invoke(“Explain the concept of caching in three sentences.”)
time2 = time.time() – start_time
print(f”Second call (Hit):\nTime: {time2:.2f} seconds\n”)
# A 25.40x speed improvement was demonstrated in one test
print(f”Speed improvement: {time1 / time2:.2f}x faster”)
This example, based on documentation 26, highlights the dramatic performance gain, showing a 25.40x speed improvement by retrieving the response from the cache in 0.05 seconds versus 1.16 seconds from the LLM.
V. Part 2: Implementing Semantic Caching with Redis Vector Search
This advanced strategy provides far higher cache hit rates by matching prompts based on meaning, not just exact text.25 This requires Redis to be used as a vector database, leveraging the RediSearch module.27
A. Defining the Schema and Index
Before storing vectors, an index schema must be defined to tell Redis how to store and query them.27 Data can be stored in Redis Hashes or JSON documents.28
Indexing Algorithms
The architect must first choose an indexing algorithm:
- FLAT (Flat Index): This performs a brute-force, k-nearest neighbor (KNN) search. It checks the query vector against every other vector in the index, guaranteeing 100% accuracy. It is recommended for smaller datasets (< 1 million vectors) where perfect accuracy is non-negotiable.28
- HNSW (Hierarchical Navigable Small World): This is an Approximate Nearest Neighbor (ANN) algorithm. It builds a fast, multi-layered graph structure to enable highly scalable and rapid searches, even with billions of vectors. This is the recommended choice for most production systems at scale, as it offers an excellent balance of speed and accuracy.29
Vector Data Types
The TYPE parameter is the simplest lever for cost-optimization.
- FLOAT32: The standard 32-bit floating-point precision for vectors. Most embedding models output in this format.28 Each dimension uses 4 bytes.
- FLOAT16 / BFLOAT16: Half-precision (16-bit) types. These types immediately cut the memory footprint of the vectors in half.29 This is the most basic form of quantization and a powerful “Day 1” optimization for managing the cost of in-memory storage, with a potentially minor and often acceptable trade-off in precision.
Example Schema Definition (Redis-CLI)
The following FT.CREATE command defines an index for a semantic cache.
Code snippet
#
# Creates a new index named ‘llm_semantic_cache’
FT.CREATE llm_semantic_cache
# Store data in Redis Hashes
ON HASH
# Index all keys with this prefix
PREFIX 1 “llm:semantic:”
SCHEMA
# — Metadata Fields for Hybrid Search —
prompt TEXT
response TEXT
model_version TAG SEPARATOR |
user_id TAG SEPARATOR |
# — The Vector Field —
prompt_vector VECTOR HNSW 6
TYPE FLOAT32
# Dimensions for OpenAI ‘text-embedding-ada-002’
DIM 1536
# CRITICAL: Use COSINE for text embeddings
DISTANCE_METRIC COSINE
B. The Similarity Metric: A Critical Choice (Cosine vs. Euclidean)
The DISTANCE_METRIC parameter is arguably the most important logical choice in the schema. It defines what “similar” means.
- Cosine Similarity (COSINE): This metric measures the angle between two vectors, effectively ignoring their magnitude (length).30 It is the industry standard for NLP and text analysis because it measures semantic direction or intent.30 For example, the prompts “Give me a recipe” and “Give me ten recipes” have different magnitudes but share the same semantic intent. Cosine similarity correctly identifies them as being very close.32
- Euclidean Distance (L2): This metric measures the straight-line distance between the endpoints of two vectors.30 It is highly sensitive to vector magnitude. In the recipe example, it would find “one recipe” and “ten recipes” to be very far apart and thus dissimilar. Using the wrong metric, such as L2 for text embeddings, can “fail” retrieval and cause the LLM to see irrelevant context.33
The choice of metric is dictated by the embedding model itself. For modern text-embedding models (from OpenAI, Hugging Face, etc.), Cosine similarity is the correct and intended metric. Using L2 for text-based semantic caching is a conceptual error that will lead to poor cache performance. Therefore, the architectural recommendation is non-negotiable: Always use DISTANCE_METRIC COSINE for prompt embedding caching.
C. The Search Query: K-Nearest Neighbor (KNN) in Practice
The “cache lookup” operation is now a vector search query.29 The application generates an embedding for the new prompt and then uses the FT.SEARCH command to find the “k-nearest neighbors” (e.g., the 1 closest match) in the index.
A simple KNN search is insufficient for production. A production-grade cache must be multi-tenant and version-aware. A query from user_A must never return a cached response from user_B. This is accomplished using Hybrid Querying—combining the KNN vector search with metadata filters on the TAG fields defined in the schema.27
Example Hybrid Query (Redis-CLI)
This query finds the top 1 nearest neighbor that also matches the specified user_id and model_version.
Code snippet
# Find the 1 nearest neighbor…
FT.SEARCH llm_semantic_cache
#…that ALSO matches these metadata tags:
“(@user_id:{user-123} @model_version:{gpt-4o})
#…based on vector similarity
=>[KNN 1 @prompt_vector $query_vector]”
# $query_vector is a parameter containing the 1536-dim vector bytes
PARAMS 2 query_vector “…”
# Return the metadata, vector, and the similarity score
RETURN 5 prompt response user_id __vector_score
# ASC = smallest distance (closest match)
SORTBY __vector_score ASC
DIALECT 2
This hybrid pattern is critical for preventing data leaks between users and ensuring cache consistency when LLM models are updated.
D. Framework Integration (The “Easy Button”)
Manually managing index creation, vectorization, and hybrid queries is complex. AI libraries abstract this process.
- RedisVL (Native Python Client)
redisvl is Redis’s official, high-level Python library for AI applications.36 It provides a SemanticCache class that automates the entire workflow: index creation, prompt-to-vector embedding, and the check/store logic.38
Python
#
from redisvl.extensions.cache.llm import SemanticCache
from redisvl.utils.vectorize import HFTextVectorizer # Or OpenAIVectorizer
import os
# 1. Initialize the cache. This auto-creates the index in Redis.
# HFTextVectorizer will be used to embed prompts internally.
llmcache = SemanticCache(
name=“llmcache”,
prefix=“llm:semantic”, # Key prefix
redis_url=os.getenv(“REDIS_URL”, “redis://localhost:6379”),
# Cosine Distance threshold (1.0 – 0.9 Similarity = 0.1)
distance_threshold=0.1
)
question = “What is the capital of France?”
# 2. Check the cache (this embeds the prompt and runs a KNN search)
if response := llmcache.check(prompt=question):
print(“Cache Hit! (Semantic)”)
# response is a list of cached results
return response[‘response’]
# 3. Cache Miss: Store the new response
print(“Cache Miss! (Semantic)”)
response = “Paris” # llm.invoke(question)
llmcache.store(prompt=question, response=response)
return response
- LangChain Integration
LangChain also provides a RedisSemanticCache that can be set as the global LLM cache, automating the process for any llm.invoke call.26
Python
# [26, 40, 44, 61]
from langchain.globals import set_llm_cache
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain_redis import RedisSemanticCache
import os
# 1. Initialize the embedding model
embeddings = OpenAIEmbeddings()
# 2. Initialize and set the global cache
# Note: distance_threshold is COSINE DISTANCE.
# 0.2 is equivalent to 0.8 Cosine Similarity.
set_llm_cache(
RedisSemanticCache(
redis_url=os.getenv(“REDIS_URL”, “redis://localhost:6379”),
embeddings=embeddings,
distance_threshold=0.2
)
)
llm = OpenAI()
# First call (Cache Miss)
llm.invoke(“What is the capital of France?”)
# Second call (Semantic Cache Hit)
# This different phrasing will be matched by the vector search
llm.invoke(“Tell me the capital of France”)
- LlamaIndex Integration
LlamaIndex also integrates deeply with Redis as a VectorStore, which can be used to build RAG and caching workflows.27
VI. Advanced Optimization: Tuning for Performance, Cost, and Accuracy
Deploying a semantic cache to production requires fine-tuning three key areas: the similarity threshold (logic), the index parameters (database performance), and vector compression (cost).
A. Tuning the Similarity Threshold
The distance_threshold is the single most important logical parameter.38 It defines the “hitbox” for a cache hit and represents a direct trade-off between cost savings (hit rate) and correctness (accuracy).45
- Low Threshold (e.g., 0.7 Similarity / 0.3 Distance): This creates a wide net, increasing the cache hit rate and saving more money. However, it significantly increases the risk of “false positives”—serving an irrelevant answer for a query that is only vaguely related.31
- High Threshold (e.g., 0.95 Similarity / 0.05 Distance): This creates a narrow net, decreasing the hit rate but ensuring that only highly similar queries are matched, maximizing accuracy.31
While studies suggest a 0.8 (Cosine Similarity) 31 or 0.85 46 threshold provides an optimal balance, this value is not universal. It is highly domain-specific. For example, a query for “how to build muscle quickly” requires a different answer than “how to build muscle sustainably.” A low threshold might incorrectly group them, which could be dangerous in a fitness application.48
The threshold is a business logic parameter that must be empirically tested against a validation dataset for the specific domain.
Table 6.1: Similarity Threshold Tuning Guide (Cosine Similarity)
| Similarity Threshold | Distance Threshold (1 – Sim) | Typical Use Case | Pro (Hit Rate) | Con (Accuracy) |
| 0.95 – 0.99 | 0.05 – 0.01 | Highly Sensitive Domains (Legal, Medical) | Very Low | Maximum accuracy; no false positives. |
| 0.85 – 0.90 | 0.15 – 0.10 | Production RAG, Tech Support 46 | Medium | Good accuracy; catches phrasing variants. |
| 0.80 – 0.85 | 0.20 – 0.15 | General Chatbots (Optimal Balance) 31 | High | Best blend of savings and accuracy. |
| < 0.80 | > 0.20 | Conceptual Clustering | Very High | High risk of irrelevant/incorrect answers.31 |
B. Tuning HNSW Hyperparameters
If the threshold is the logic tuner, the HNSW parameters are the database performance tuners. They control the trade-off between index build time, memory usage, and query speed/accuracy.49
Table 6.2: HNSW Hyperparameter Trade-offs 49
| Parameter | Definition (Redis Default) | High Value (e.g., 500) | Low Value (e.g., 100) |
| M | Max connections per node (16) | Pro: Better recall.
Con: More memory, slower indexing. |
Pro: Less memory, faster indexing.
Con: Lower recall. |
| EF_CONSTRUCTION | Build-time search depth (200) | Pro: Higher quality index, better query accuracy.
Con: Much slower index build time. |
Pro: Faster index build time.
Con: Lower quality index, lower query accuracy. |
| EF_RUNTIME | Query-time search depth (10) | Pro: Higher query accuracy/recall.
Con: Slower query latency. |
Pro: Faster query latency.
Con: Lower query accuracy/recall. |
For a “build-once, query-millions” workload like a cache, the optimal strategy is to over-invest at build time to enable minimal work at query time. This means setting a high M (e.g., 64) and a high EF_CONSTRUCTION (e.g., 500). This creates a highly optimized index graph. Because the graph is so well-structured, a query can achieve high accuracy with a very low EF_RUNTIME (e.g., 10-20), delivering both high accuracy and the sub-millisecond latency expected of a cache.49
C. Managing the Memory Footprint: Quantization and Compression
The primary argument against using in-memory Redis for large-scale vector search is cost, as RAM is expensive.50 A cache with millions of 1536-dimension FLOAT32 vectors can consume hundreds of gigabytes.50
Redis Enterprise addresses this directly with advanced vector quantization and compression techniques, such as LVQ (Locally-adaptive Vector Quantization) and LeanVec.50 These features are not just optimizations; they are the strategic answer to making large-scale in-memory vector search economically viable.
These techniques compress the vectors, dramatically reducing the memory footprint (by 60% or more) while maintaining high performance.51 In fact, because graph-based search is memory-bound, searching smaller, compressed vectors can often be faster (up to 2x) than searching the full FLOAT32 vectors.50
Table 6.3: Vector Compression Trade-offs in Redis 50
| Compression Type | Memory Savings (vs. FLOAT32) | Search Performance | Ingestion Time |
| FLOAT32 (Baseline) | None | Baseline | Baseline |
| LeanVec4x8 | High (60%+) | Very Fast (up to 2x) | Slower (up to 1.7x) |
| LVQ8 | High (60%+) | Very Fast | Slower (up to 2.6x) |
| LVQ4 | Highest | Fast | Slower |
The trade-off is a longer one-time ingestion cost for massive long-term savings in memory cost and query speed.50 Note that these optimizations are currently tuned for x86 platforms and may be “impractical on ARM platforms today” due to slower ingestion.50
VII. Production Lifecycle: Cache Management and Invalidation
A cache is a living system that requires policies for eviction (when full) and invalidation (when stale).
A. Cache Eviction Strategies
When Redis reaches its maxmemory limit, it must evict (delete) keys to make room for new data. The eviction policy determines which keys are deleted.52
- noeviction: Blocks all new writes when memory is full. This is bad for a cache.
- allkeys-lru: Evicts the Least Recently Used (LRU) key from all keys.
- volatile-lru: (Default) Evicts the LRU key only from keys that have an expiration (TTL) set.52
- allkeys-lfu / volatile-lfu: Evicts the Least Frequently Used (LFU) key.
- volatile-ttl: Evicts the key with the shortest TTL remaining.54
The choice of eviction policy is critical in a multi-model Redis instance. If allkeys-lru is used, a high-volume cache write operation (all “new” keys) could cause Redis to evict “old” keys, such as a user’s active session key or a critical RAG document. This would be a catastrophic failure.
The correct architectural pattern is to:
- Set all cache entries (both exact-match and semantic) with a TTL (e.g., 24 hours).
- Store all persistent data (user sessions, RAG documents) without a TTL.
- Set the Redis eviction policy to volatile-lru.55
This creates two classes of data. The eviction policy will only ever target the “cache” class, protecting the persistent data from being deleted.
B. Advanced Cache Invalidation
A TTL handles old data, but it cannot handle stale data. If a document in a RAG system is updated, all cached LLM responses based on that document are now incorrect and must be invalidated immediately.56
- Strategy 1: Prompt Versioning: As discussed in the Hybrid Query section (V.C), a model_version or data_version tag should be stored with the cached entry.57 When a model or document set is updated, the application simply queries for the new version tag (e.g., @data_version:{v2}). All v1 entries are instantly “invalidated” (they are no longer queried) and will eventually be evicted by their TTL.57
- Strategy 2: Event-Driven Invalidation: This is a proactive strategy. When a source document is updated, an event is published to trigger the deletion of related cache entries.58 While this can be done with external message queues 58, the multi-model “Redis Advantage” provides a self-contained solution: Redis Pub/Sub.16
The event-driven architecture works as follows:
- A “Document Update Service” writes a new document to Redis (e.g., JSON.SET doc:123…).
- Immediately after, it publishes an invalidation event: PUBLISH cache:invalidate ‘doc:123’.
- A separate “Cache Invalidation Worker” is subscribed to this channel: SUBSCRIBE cache:invalidate.
- When the worker receives the ‘doc:123’ message, it performs a hybrid search to find all cache keys tagged with doc_id:{123} and deletes them.
This pattern creates a clean, real-time invalidation loop that solves the stale cache problem using only Redis’s built-in features.
VIII. Final Architectural Recommendations & Blueprint
A. Recommended Stack & Configuration
- Database: Redis Stack (or Redis Enterprise) to ensure RediSearch (vector) and RedisJSON capabilities.
- Caching Strategy: A tiered, context-aware hybrid model.
- Layer 1: RedisCache (e.g., via LangChain) for simple, exact-match caching.
- Layer 2: RedisVL.SemanticCache for high-intent, generic user queries.38
- Layer 3: A custom Context-Enabled Semantic Caching (CESC) pattern.8 On a semantic cache hit, retrieve the generic response, retrieve the user’s context (e.g., location, role) from a separate Redis Hash, and send both to a fast, cheap LLM (e.g., GPT-4o-mini) for real-time personalization. This provides the speed and cost-savings of caching with the power of personalization.
- Vector Index: HNSW for performance at scale.29
- Index Configuration:
- DISTANCE_METRIC COSINE.30 This is non-negotiable for text.
- High EF_CONSTRUCTION (e.g., 500) and M (e.g., 64) for high-quality index builds.49
- Low EF_RUNTIME (e.g., 20) for fast queries.49
- Memory Management:
- Start with TYPE FLOAT32.28
- Migrate to LVQ or LeanVec compression when vector count exceeds 1 million entries or memory cost becomes a concern.51
- Tuning & Management:
- Threshold: Begin empirical testing with a distance_threshold of 0.15 (Similarity 0.85) and adjust based on domain-specific accuracy requirements.46
- Eviction Policy: volatile-lru.52
- TTL: Set a 24-hour TTL on all cache entries.
- Invalidation: Use event-driven invalidation via Redis Pub/Sub for RAG-based caches.58
B. Final Architectural Blueprint: The Unified AI Caching Flow
The following describes the complete, production-grade query lifecycle that integrates all recommended patterns.
Flow 1: Cache Miss (Populating the Cache)
- A user sends a query. The Application generates an embedding (Query Vector).
- The App executes a hybrid FT.SEARCH in Redis for the Query Vector + @user_id + @data_version.
- Redis returns a Cache Miss.
- The App retrieves RAG documents from Redis (e.g., JSON.GET doc:123).
- The App calls the expensive, high-quality LLM (e.g., GPT-4o) with the prompt and RAG context.
- The App receives the Response.
- The App stores the Response in the semantic cache: HSET llm:semantic:abc prompt “…” response “…” vector “…” user_id “…” data_version “…” and sets a 24-hour EXPIRE.
- The Response is returned to the user.
Flow 2: Cache Hit (Context-Enabled Semantic Cache)
- A user sends a query. The Application generates an embedding (Query Vector).
- The App executes a hybrid FT.SEARCH in Redis.
- Redis returns a Cache Hit, providing the Generic Response from a previous query.
- The App fetches the current user’s context from a separate Redis Hash (e.g., HGET user:profile:user-123 ‘location’).
- The App calls a fast, cheap LLM (e.g., GPT-4o-mini) with the Generic Response and the User Context, instructing it to “personalize this answer for a user in [location].”
- This lightweight model returns a Personalized Response.
- The Personalized Response is returned to the user. This flow is 40% faster and 90% cheaper than a full LLM call but provides a superior, contextualized experience.8
Flow 3: Invalidation (Event-Driven)
- An external system updates a RAG document. An “Update Service” writes the new data: JSON.SET doc:123….
- The Update Service publishes an event: PUBLISH cache:invalidate ‘doc:123’.
- A “Cache Invalidation Worker” subscribed to the channel receives the message.
- The Worker executes FT.SEARCH llm_semantic_cache “@doc_id:{123}” to find all cache entries associated with that document.
- The Worker deletes all returned keys (e.g., DEL llm:semantic:abc…).
- The cache is now clean, and the next query for this topic will trigger a “Cache Miss,” populating the cache with the new, correct information.
