{"id":7815,"date":"2025-11-27T15:31:10","date_gmt":"2025-11-27T15:31:10","guid":{"rendered":"https:\/\/uplatz.com\/blog\/?p=7815"},"modified":"2025-11-27T16:40:21","modified_gmt":"2025-11-27T16:40:21","slug":"a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis","status":"publish","type":"post","link":"https:\/\/uplatz.com\/blog\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\/","title":{"rendered":"A Solutions Architect&#8217;s Guide to Caching LLM Prompt Embeddings with Redis"},"content":{"rendered":"<h2><b>I. Executive Summary<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">The proliferation of Generative AI and Large Language Models (LLMs) has introduced significant operational challenges in terms of computational cost and response latency. Caching is a foundational strategy to mitigate these challenges, and caching prompt embeddings\u2014the numerical representation of input text\u2014is paramount. This report provides a comprehensive architectural blueprint for implementing robust caching strategies for LLM prompt embeddings using Redis.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The analysis distinguishes between two primary caching models:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Exact-Match Caching:<\/b><span style=\"font-weight: 400;\"> A simple, high-speed approach that stores responses based on an exact hash of the input prompt. This is effective for high-frequency, identical queries.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Semantic Caching:<\/b><span style=\"font-weight: 400;\"> A sophisticated, vector-based approach that stores responses based on the <\/span><i><span style=\"font-weight: 400;\">semantic meaning<\/span><\/i><span style=\"font-weight: 400;\"> of the prompt. This allows the cache to serve results for queries that are phrased differently but have the same intent, dramatically increasing the cache hit rate.<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">Redis is uniquely positioned as the ideal platform for this task. It is not merely a key-value cache but a multi-model database that can function simultaneously as an in-memory key-value store (for exact-match), a high-performance vector database (for semantic search), a session store, and a message broker. This multi-model capability allows architects to build a complete, real-time AI application stack on a single, unified platform, simplifying infrastructure and reducing latency.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This report details the business case for caching, compares exact-match and semantic patterns, and provides production-grade implementation details for both. It concludes with advanced strategies for performance tuning, cost optimization via vector compression, and lifecycle management, presenting a final, unified architectural blueprint for a scalable, efficient, and cost-effective LLM caching system.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-7887\" src=\"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/A-Solutions-Architects-Guide-to-Caching-LLM-Prompt-Embeddings-with-Redis-1024x576.jpg\" alt=\"\" width=\"840\" height=\"473\" srcset=\"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/A-Solutions-Architects-Guide-to-Caching-LLM-Prompt-Embeddings-with-Redis-1024x576.jpg 1024w, https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/A-Solutions-Architects-Guide-to-Caching-LLM-Prompt-Embeddings-with-Redis-300x169.jpg 300w, https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/A-Solutions-Architects-Guide-to-Caching-LLM-Prompt-Embeddings-with-Redis-768x432.jpg 768w, https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/A-Solutions-Architects-Guide-to-Caching-LLM-Prompt-Embeddings-with-Redis.jpg 1280w\" sizes=\"auto, (max-width: 840px) 100vw, 840px\" \/><\/p>\n<h3><a href=\"https:\/\/uplatz.com\/course-details\/bundle-course-sap-ariba-sourcing-procurement-contract-management-administration By Uplatz\">bundle-course-sap-ariba-sourcing-procurement-contract-management-administration By Uplatz<\/a><\/h3>\n<h2><b>II. Foundations: Prompt Embeddings and the Caching Imperative<\/b><\/h2>\n<h3><b>A. What are Prompt Embeddings?<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Before implementing a cache, it is critical to understand <\/span><i><span style=\"font-weight: 400;\">what<\/span><\/i><span style=\"font-weight: 400;\"> is being cached. When a user sends a prompt (e.g., &#8220;What is the capital of France?&#8221;) to an LLM, the model does not understand text directly. The input is first processed by an <\/span><i><span style=\"font-weight: 400;\">embedding model<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><span style=\"font-weight: 400;\">1<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This model converts the non-numeric text into a <\/span><i><span style=\"font-weight: 400;\">numeric representation<\/span><\/i><span style=\"font-weight: 400;\"> called an embedding, which is a high-dimensional vector (a long list of numbers).<\/span><span style=\"font-weight: 400;\">2<\/span><span style=\"font-weight: 400;\"> This vector captures the <\/span><i><span style=\"font-weight: 400;\">semantic meaning<\/span><\/i><span style=\"font-weight: 400;\"> and <\/span><i><span style=\"font-weight: 400;\">context<\/span><\/i><span style=\"font-weight: 400;\"> of the original text, allowing the LLM to mathematically determine relationships between concepts.<\/span><span style=\"font-weight: 400;\">4<\/span><span style=\"font-weight: 400;\"> For example, the vectors for &#8220;king&#8221; and &#8220;queen&#8221; will be mathematically close to each other. These vectors are the fundamental unit that LLMs use to process, compare, and understand human language.<\/span><span style=\"font-weight: 400;\">1<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>B. The Business Imperative: Why Cache LLM Calls?<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">LLM inference is computationally intensive and expensive. Every API call to a model like GPT-4o incurs a direct monetary cost (paid per token) and a time cost (latency).<\/span><span style=\"font-weight: 400;\">6<\/span><span style=\"font-weight: 400;\"> Caching is the primary strategy to address both.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Cost Reduction:<\/b><span style=\"font-weight: 400;\"> Many applications, particularly enterprise chatbots, receive a high volume of repetitive or semantically similar queries.<\/span><span style=\"font-weight: 400;\">6<\/span><span style=\"font-weight: 400;\"> For example, &#8220;How do I reset my password?&#8221; and &#8220;I forgot my password, what do I do?&#8221; are different prompts with the same intent.<\/span><span style=\"font-weight: 400;\">8<\/span><span style=\"font-weight: 400;\"> By caching the response to the first query and serving it for the second, the redundant LLM API call is eliminated. This can reduce input token costs by up to 90%.<\/span><span style=\"font-weight: 400;\">9<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Latency Reduction:<\/b><span style=\"font-weight: 400;\"> Caching moves the response from a slow, multi-second computation (the LLM) to a fast, millisecond retrieval (the cache).<\/span><span style=\"font-weight: 400;\">11<\/span><span style=\"font-weight: 400;\"> For real-time applications like chatbots, this is critical for user experience. Benchmarks show that prompt caching can reduce latency by up to 80-85%.<\/span><span style=\"font-weight: 400;\">6<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Scalability:<\/b><span style=\"font-weight: 400;\"> Caching reduces the computational load on the core LLM, allowing the system to handle a higher volume of users without scaling up expensive GPU resources.<\/span><span style=\"font-weight: 400;\">6<\/span><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h3><b>C. The Platform Caching Model (e.g., OpenAI)<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">It is important to note that some LLM providers, like OpenAI, implement their own form of <\/span><i><span style=\"font-weight: 400;\">prompt caching<\/span><\/i><span style=\"font-weight: 400;\"> automatically.<\/span><span style=\"font-weight: 400;\">9<\/span><span style=\"font-weight: 400;\"> However, this caching is typically limited to <\/span><i><span style=\"font-weight: 400;\">exact prefix matching<\/span><\/i><span style=\"font-weight: 400;\">. The system routes requests with an identical prefix (e.g., the first 256 tokens) to a server that recently processed the same prompt. This is effective for static system prompts or instructions placed at the beginning of a prompt.<\/span><span style=\"font-weight: 400;\">9<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This built-in caching is beneficial but insufficient. It cannot handle queries that are semantically identical but textually different (e.g., &#8220;password reset&#8221; vs. &#8220;forgot password&#8221;). To achieve this, a more sophisticated, application-layer caching strategy is required.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><b>III. Redis as the Unified Platform for AI Caching<\/b><\/h2>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">While many databases can store vectors, Redis presents a unique strategic advantage by being a <\/span><i><span style=\"font-weight: 400;\">multi-model<\/span><\/i><span style=\"font-weight: 400;\"> database designed for real-time performance.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>A. The Multi-Model Advantage<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">A &#8220;pure&#8221; vector database (e.g., Pinecone, Milvus) is specialized for one task: storing and searching vectors.<\/span><span style=\"font-weight: 400;\">13<\/span><span style=\"font-weight: 400;\"> A production AI application, however, requires more. A typical Retrieval-Augmented Generation (RAG) application needs:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A <\/span><b>Vector Database<\/b><span style=\"font-weight: 400;\"> for RAG document retrieval.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A <\/span><b>Semantic Cache<\/b><span style=\"font-weight: 400;\"> for LLM responses (also a vector database).<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A <\/span><b>Key-Value Cache<\/b><span style=\"font-weight: 400;\"> for exact-match caching.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A <\/span><b>Session Store<\/b><span style=\"font-weight: 400;\"> for managing conversation history.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A <\/span><b>Message Broker<\/b><span style=\"font-weight: 400;\"> for asynchronous tasks or invalidation.<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">Using specialized tools for each function results in a complex, fragmented architecture with multiple data silos. Redis, by contrast, can serve all these functions within a single, high-performance system.<\/span><span style=\"font-weight: 400;\">15<\/span><span style=\"font-weight: 400;\"> An AI architect can use Redis to store RAG documents in RedisJSON, manage vectors with Redis Search, handle exact-match caching with standard keys, store conversation history in Redis Hashes, and manage cache invalidation with Redis Pub\/Sub.<\/span><span style=\"font-weight: 400;\">15<\/span><span style=\"font-weight: 400;\"> This consolidation drastically simplifies the architecture, reduces data-transfer latency, and lowers the total cost of ownership.<\/span><span style=\"font-weight: 400;\">16<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>B. Redis vs. Specialized Vector Databases<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">When evaluating Redis specifically for its vector search capabilities, it compares favorably against specialized, &#8220;pure&#8221; vector databases.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Redis vs. Pinecone:<\/b><span style=\"font-weight: 400;\"> Pinecone is a proprietary, purpose-built vector database.<\/span><span style=\"font-weight: 400;\">13<\/span><span style=\"font-weight: 400;\"> While effective for basic vector storage, Redis Enterprise is presented as a more comprehensive real-time data platform, offering capabilities beyond vector search, such as high availability, multi-tenancy, and durability, which specialized vendors are now building to catch up.<\/span><span style=\"font-weight: 400;\">18<\/span><span style=\"font-weight: 400;\"> Benchmarking against Pinecone is legally restricted by their &#8220;DeWitt clause,&#8221; which forbids publishing performance evaluations.<\/span><span style=\"font-weight: 400;\">20<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Redis vs. Milvus:<\/b><span style=\"font-weight: 400;\"> Milvus is an open-source, purpose-built vector database designed for high-performance vector search, particularly at billion-scale.<\/span><span style=\"font-weight: 400;\">14<\/span><span style=\"font-weight: 400;\"> However, Redis is a mature, in-memory key-value store that has added vector search as a module.<\/span><span style=\"font-weight: 400;\">14<\/span><span style=\"font-weight: 400;\"> Some analyses suggest Milvus is finely tuned for high-demand vector workloads, while Redis offers broader versatility.<\/span><span style=\"font-weight: 400;\">21<\/span><span style=\"font-weight: 400;\"> Conversely, benchmarks published by Redis claim superior querying throughput (up to 3.3x higher QPS than Milvus) and lower indexing times (up to 2.8x lower).<\/span><span style=\"font-weight: 400;\">20<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">For many AI applications, the combination of Redis&#8217;s &#8220;good enough&#8221; or superior vector performance with its dominant, industry-standard in-memory caching and multi-model features makes it the most practical and architecturally simple choice.<\/span><span style=\"font-weight: 400;\">22<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><b>IV. Part 1: Implementing Exact-Match Caching in Redis<\/b><\/h2>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">This is the most basic caching strategy, which targets identical, repeated prompts.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>A. Concept and Pattern<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The pattern, often called &#8220;query caching,&#8221; is a simple implementation of the cache-aside pattern.<\/span><span style=\"font-weight: 400;\">23<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Prompt Identification:<\/b><span style=\"font-weight: 400;\"> The application receives a user prompt. A unique, deterministic key is generated from the exact prompt string, typically using a fast hashing algorithm like MD5 or SHA-256.<\/span><span style=\"font-weight: 400;\">23<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Cache Lookup:<\/b><span style=\"font-weight: 400;\"> The application checks Redis for the existence of this key.<\/span><span style=\"font-weight: 400;\">23<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Cache Hit:<\/b><span style=\"font-weight: 400;\"> If the key exists, the stored response is retrieved from Redis and returned to the user. The LLM is never called.<\/span><span style=\"font-weight: 400;\">23<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Cache Miss:<\/b><span style=\"font-weight: 400;\"> If the key does not exist, the application forwards the prompt to the LLM. The LLM&#8217;s response is then stored in Redis using the hash key\u2014critically, with a Time-to-Live (TTL) set\u2014and returned to the user.<\/span><span style=\"font-weight: 400;\">23<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">This pattern is incredibly fast but completely rigid; even a single-character change in the prompt (e.g., an extra space) will result in a cache miss.<\/span><span style=\"font-weight: 400;\">8<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>B. Implementation with redis-py<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Using the standard redis-py library, the implementation is straightforward.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Python<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\"># [26, 59, 60]<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">import<\/span><span style=\"font-weight: 400;\"> redis<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">import<\/span><span style=\"font-weight: 400;\"> hashlib<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">import<\/span><span style=\"font-weight: 400;\"> os<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"># Assumes llm.invoke() is defined elsewhere<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"># from langchain_openai import OpenAI<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"># llm = OpenAI()<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">try<\/span><span style=\"font-weight: 400;\">:<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 r = redis.Redis.from_url(os.getenv(<\/span><span style=\"font-weight: 400;\">&#8220;REDIS_URL&#8221;<\/span><span style=\"font-weight: 400;\">, <\/span><span style=\"font-weight: 400;\">&#8220;redis:\/\/localhost:6379&#8221;<\/span><span style=\"font-weight: 400;\">))<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 r.ping()<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 print(<\/span><span style=\"font-weight: 400;\">&#8220;Connected to Redis&#8221;<\/span><span style=\"font-weight: 400;\">)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">except<\/span><span style=\"font-weight: 400;\"> redis.exceptions.ConnectionError <\/span><span style=\"font-weight: 400;\">as<\/span><span style=\"font-weight: 400;\"> e:<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 print(<\/span><span style=\"font-weight: 400;\">f&#8221;Redis connection error: <\/span><span style=\"font-weight: 400;\">{e}<\/span><span style=\"font-weight: 400;\">&#8220;<\/span><span style=\"font-weight: 400;\">)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\"># Handle connection failure<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">def<\/span> <span style=\"font-weight: 400;\">get_response_exact_match<\/span><span style=\"font-weight: 400;\">(prompt: <\/span><span style=\"font-weight: 400;\">str<\/span><span style=\"font-weight: 400;\">):<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\"># 1. Create a deterministic key<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 cache_key = <\/span><span style=\"font-weight: 400;\">f&#8221;llm:exact:<\/span><span style=\"font-weight: 400;\">{hashlib.md5(prompt.encode()).hexdigest()}<\/span><span style=\"font-weight: 400;\">&#8220;<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\">try<\/span><span style=\"font-weight: 400;\">:<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\"># 2. Cache Lookup<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 cached_response = r.get(cache_key)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\">if<\/span><span style=\"font-weight: 400;\"> cached_response:<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 print(<\/span><span style=\"font-weight: 400;\">&#8220;Cache Hit! (Exact)&#8221;<\/span><span style=\"font-weight: 400;\">)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\">return<\/span><span style=\"font-weight: 400;\"> cached_response.decode(<\/span><span style=\"font-weight: 400;\">&#8216;utf-8&#8217;<\/span><span style=\"font-weight: 400;\">)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\">except<\/span><span style=\"font-weight: 400;\"> redis.exceptions.RedisError <\/span><span style=\"font-weight: 400;\">as<\/span><span style=\"font-weight: 400;\"> e:<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 print(<\/span><span style=\"font-weight: 400;\">f&#8221;Redis GET error: <\/span><span style=\"font-weight: 400;\">{e}<\/span><span style=\"font-weight: 400;\">&#8220;<\/span><span style=\"font-weight: 400;\">)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\"># Fallback: go directly to LLM<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\"># 4. Cache Miss<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 print(<\/span><span style=\"font-weight: 400;\">&#8220;Cache Miss! (Exact)&#8221;<\/span><span style=\"font-weight: 400;\">)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 response = llm.invoke(prompt) <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\">try<\/span><span style=\"font-weight: 400;\">:<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\"># 4. Store in cache with a 1-hour (3600s) TTL<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 r.<\/span><span style=\"font-weight: 400;\">set<\/span><span style=\"font-weight: 400;\">(cache_key, response, ex=<\/span><span style=\"font-weight: 400;\">3600<\/span><span style=\"font-weight: 400;\">) <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\">except<\/span><span style=\"font-weight: 400;\"> redis.exceptions.RedisError <\/span><span style=\"font-weight: 400;\">as<\/span><span style=\"font-weight: 400;\"> e:<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 print(<\/span><span style=\"font-weight: 400;\">f&#8221;Redis SET error: <\/span><span style=\"font-weight: 400;\">{e}<\/span><span style=\"font-weight: 400;\">&#8220;<\/span><span style=\"font-weight: 400;\">)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\"># Non-blocking error, just return the response<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\">return<\/span><span style=\"font-weight: 400;\"> response<\/span><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<h3><b>C. Framework Integration (LangChain)<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Frameworks like LangChain abstract this entire process into a simple, drop-in integration.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Python<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\"># [26, 59]<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">import<\/span><span style=\"font-weight: 400;\"> langchain<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">from<\/span><span style=\"font-weight: 400;\"> redis <\/span><span style=\"font-weight: 400;\">import<\/span><span style=\"font-weight: 400;\"> Redis<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">from<\/span><span style=\"font-weight: 400;\"> langchain.cache <\/span><span style=\"font-weight: 400;\">import<\/span><span style=\"font-weight: 400;\"> RedisCache<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">from<\/span><span style=\"font-weight: 400;\"> langchain_openai <\/span><span style=\"font-weight: 400;\">import<\/span><span style=\"font-weight: 400;\"> OpenAI<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">import<\/span><span style=\"font-weight: 400;\"> time<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">import<\/span><span style=\"font-weight: 400;\"> os<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"># Initialize the cache<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">redis_client = Redis.from_url(os.getenv(<\/span><span style=\"font-weight: 400;\">&#8220;REDIS_URL&#8221;<\/span><span style=\"font-weight: 400;\">, <\/span><span style=\"font-weight: 400;\">&#8220;redis:\/\/localhost:6379&#8221;<\/span><span style=\"font-weight: 400;\">))<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">langchain.llm_cache = RedisCache(redis_=redis_client)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">llm = OpenAI()<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"># &#8212; First call (Cache Miss) &#8212;<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">start_time = time.time()<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">response1 = llm.invoke(<\/span><span style=\"font-weight: 400;\">&#8220;Explain the concept of caching in three sentences.&#8221;<\/span><span style=\"font-weight: 400;\">)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">time1 = time.time() &#8211; start_time<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">print(<\/span><span style=\"font-weight: 400;\">f&#8221;First call (Miss):\\nTime: <\/span><span style=\"font-weight: 400;\">{time1:<\/span><span style=\"font-weight: 400;\">.2<\/span><span style=\"font-weight: 400;\">f}<\/span><span style=\"font-weight: 400;\"> seconds\\n&#8221;<\/span><span style=\"font-weight: 400;\">)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"># &#8212; Second call (Cache Hit) &#8212;<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">start_time = time.time()<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">response2 = llm.invoke(<\/span><span style=\"font-weight: 400;\">&#8220;Explain the concept of caching in three sentences.&#8221;<\/span><span style=\"font-weight: 400;\">)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">time2 = time.time() &#8211; start_time<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">print(<\/span><span style=\"font-weight: 400;\">f&#8221;Second call (Hit):\\nTime: <\/span><span style=\"font-weight: 400;\">{time2:<\/span><span style=\"font-weight: 400;\">.2<\/span><span style=\"font-weight: 400;\">f}<\/span><span style=\"font-weight: 400;\"> seconds\\n&#8221;<\/span><span style=\"font-weight: 400;\">)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"># A 25.40x speed improvement was demonstrated in one test <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">print(<\/span><span style=\"font-weight: 400;\">f&#8221;Speed improvement: <\/span><span style=\"font-weight: 400;\">{time1 \/ time2:<\/span><span style=\"font-weight: 400;\">.2<\/span><span style=\"font-weight: 400;\">f}<\/span><span style=\"font-weight: 400;\">x faster&#8221;<\/span><span style=\"font-weight: 400;\">)<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">This example, based on documentation <\/span><span style=\"font-weight: 400;\">26<\/span><span style=\"font-weight: 400;\">, highlights the dramatic performance gain, showing a 25.40x speed improvement by retrieving the response from the cache in 0.05 seconds versus 1.16 seconds from the LLM.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><b>V. Part 2: Implementing Semantic Caching with Redis Vector Search<\/b><\/h2>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">This advanced strategy provides far higher cache hit rates by matching prompts based on meaning, not just exact text.<\/span><span style=\"font-weight: 400;\">25<\/span><span style=\"font-weight: 400;\"> This requires Redis to be used as a vector database, leveraging the RediSearch module.<\/span><span style=\"font-weight: 400;\">27<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>A. Defining the Schema and Index<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Before storing vectors, an index schema must be defined to tell Redis how to store and query them.<\/span><span style=\"font-weight: 400;\">27<\/span><span style=\"font-weight: 400;\"> Data can be stored in Redis Hashes or JSON documents.<\/span><span style=\"font-weight: 400;\">28<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Indexing Algorithms<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The architect must first choose an indexing algorithm:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>FLAT (Flat Index):<\/b><span style=\"font-weight: 400;\"> This performs a brute-force, k-nearest neighbor (KNN) search. It checks the query vector against every other vector in the index, guaranteeing 100% accuracy. It is recommended for smaller datasets (&lt; 1 million vectors) where perfect accuracy is non-negotiable.<\/span><span style=\"font-weight: 400;\">28<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>HNSW (Hierarchical Navigable Small World):<\/b><span style=\"font-weight: 400;\"> This is an Approximate Nearest Neighbor (ANN) algorithm. It builds a fast, multi-layered graph structure to enable highly scalable and rapid searches, even with billions of vectors. This is the recommended choice for most production systems at scale, as it offers an excellent balance of speed and accuracy.<\/span><span style=\"font-weight: 400;\">29<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">Vector Data Types<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The TYPE parameter is the simplest lever for cost-optimization.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">FLOAT32: The standard 32-bit floating-point precision for vectors. Most embedding models output in this format.<\/span><span style=\"font-weight: 400;\">28<\/span><span style=\"font-weight: 400;\"> Each dimension uses 4 bytes.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">FLOAT16 \/ BFLOAT16: Half-precision (16-bit) types. These types immediately cut the memory footprint of the vectors in half.<\/span><span style=\"font-weight: 400;\">29<\/span><span style=\"font-weight: 400;\"> This is the most basic form of quantization and a powerful &#8220;Day 1&#8221; optimization for managing the cost of in-memory storage, with a potentially minor and often acceptable trade-off in precision.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">Example Schema Definition (Redis-CLI)<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The following FT.CREATE command defines an index for a semantic cache.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Code snippet<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\"># <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"># Creates a new index named &#8216;llm_semantic_cache&#8217;<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">FT.CREATE llm_semantic_cache<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 # Store data in Redis Hashes<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 ON HASH<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 # Index all keys with this prefix<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 PREFIX 1 &#8220;llm:semantic:&#8221; <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 SCHEMA<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 # &#8212; Metadata Fields for Hybrid Search &#8212;<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 prompt TEXT<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 response TEXT<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 model_version TAG SEPARATOR |<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 user_id TAG SEPARATOR |<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 # &#8212; The Vector Field &#8212;<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 prompt_vector VECTOR HNSW 6<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 TYPE FLOAT32<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 # Dimensions for OpenAI &#8216;text-embedding-ada-002&#8217;<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 DIM 1536 <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 # CRITICAL: Use COSINE for text embeddings<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 DISTANCE_METRIC COSINE<\/span><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<h3><b>B. The Similarity Metric: A Critical Choice (Cosine vs. Euclidean)<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The DISTANCE_METRIC parameter is arguably the most important <\/span><i><span style=\"font-weight: 400;\">logical<\/span><\/i><span style=\"font-weight: 400;\"> choice in the schema. It defines what &#8220;similar&#8221; means.<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Cosine Similarity (COSINE):<\/b><span style=\"font-weight: 400;\"> This metric measures the <\/span><i><span style=\"font-weight: 400;\">angle<\/span><\/i><span style=\"font-weight: 400;\"> between two vectors, effectively ignoring their magnitude (length).<\/span><span style=\"font-weight: 400;\">30<\/span><span style=\"font-weight: 400;\"> It is the industry standard for NLP and text analysis because it measures semantic <\/span><i><span style=\"font-weight: 400;\">direction<\/span><\/i><span style=\"font-weight: 400;\"> or <\/span><i><span style=\"font-weight: 400;\">intent<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><span style=\"font-weight: 400;\">30<\/span><span style=\"font-weight: 400;\"> For example, the prompts &#8220;Give me a recipe&#8221; and &#8220;Give me ten recipes&#8221; have different magnitudes but share the same semantic intent. Cosine similarity correctly identifies them as being very close.<\/span><span style=\"font-weight: 400;\">32<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Euclidean Distance (L2):<\/b><span style=\"font-weight: 400;\"> This metric measures the <\/span><i><span style=\"font-weight: 400;\">straight-line distance<\/span><\/i><span style=\"font-weight: 400;\"> between the endpoints of two vectors.<\/span><span style=\"font-weight: 400;\">30<\/span><span style=\"font-weight: 400;\"> It is highly sensitive to vector magnitude. In the recipe example, it would find &#8220;one recipe&#8221; and &#8220;ten recipes&#8221; to be very far apart and thus dissimilar. Using the wrong metric, such as L2 for text embeddings, can &#8220;fail&#8221; retrieval and cause the LLM to see irrelevant context.<\/span><span style=\"font-weight: 400;\">33<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">The choice of metric is dictated by the embedding model itself. For modern text-embedding models (from OpenAI, Hugging Face, etc.), <\/span><b>Cosine similarity is the correct and intended metric.<\/b><span style=\"font-weight: 400;\"> Using L2 for text-based semantic caching is a conceptual error that will lead to poor cache performance. Therefore, the architectural recommendation is non-negotiable: <\/span><b>Always use DISTANCE_METRIC COSINE for prompt embedding caching.<\/b><\/p>\n<p>&nbsp;<\/p>\n<h3><b>C. The Search Query: K-Nearest Neighbor (KNN) in Practice<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The &#8220;cache lookup&#8221; operation is now a vector search query.<\/span><span style=\"font-weight: 400;\">29<\/span><span style=\"font-weight: 400;\"> The application generates an embedding for the new prompt and then uses the FT.SEARCH command to find the &#8220;k-nearest neighbors&#8221; (e.g., the 1 closest match) in the index.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">A simple KNN search is insufficient for production. A production-grade cache must be multi-tenant and version-aware. A query from user_A must never return a cached response from user_B. This is accomplished using <\/span><b>Hybrid Querying<\/b><span style=\"font-weight: 400;\">\u2014combining the KNN vector search with metadata filters on the TAG fields defined in the schema.<\/span><span style=\"font-weight: 400;\">27<\/span><\/p>\n<p><b>Example Hybrid Query (Redis-CLI)<\/b><\/p>\n<p><span style=\"font-weight: 400;\">This query finds the top 1 nearest neighbor that <\/span><i><span style=\"font-weight: 400;\">also<\/span><\/i><span style=\"font-weight: 400;\"> matches the specified user_id and model_version.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Code snippet<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\"># Find the 1 nearest neighbor&#8230;<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">FT.SEARCH llm_semantic_cache<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 #&#8230;that ALSO matches these metadata tags:<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 &#8220;(@user_id:{user-123} @model_version:{gpt-4o})<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 #&#8230;based on vector similarity<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 =&gt;[KNN 1 @prompt_vector $query_vector]&#8221;<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 # $query_vector is a parameter containing the 1536-dim vector bytes<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 PARAMS 2 query_vector &#8220;&#8230;&#8221; <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 # Return the metadata, vector, and the similarity score<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 RETURN 5 prompt response user_id __vector_score<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 # ASC = smallest distance (closest match)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 SORTBY __vector_score ASC<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 DIALECT 2<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">This hybrid pattern is critical for preventing data leaks between users and ensuring cache consistency when LLM models are updated.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>D. Framework Integration (The &#8220;Easy Button&#8221;)<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Manually managing index creation, vectorization, and hybrid queries is complex. AI libraries abstract this process.<\/span><\/p>\n<ol>\n<li><span style=\"font-weight: 400;\"> RedisVL (Native Python Client)<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">redisvl is Redis&#8217;s official, high-level Python library for AI applications.36 It provides a SemanticCache class that automates the entire workflow: index creation, prompt-to-vector embedding, and the check\/store logic.38<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Python<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\"># <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">from<\/span><span style=\"font-weight: 400;\"> redisvl.extensions.cache.llm <\/span><span style=\"font-weight: 400;\">import<\/span><span style=\"font-weight: 400;\"> SemanticCache<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">from<\/span><span style=\"font-weight: 400;\"> redisvl.utils.vectorize <\/span><span style=\"font-weight: 400;\">import<\/span><span style=\"font-weight: 400;\"> HFTextVectorizer <\/span><span style=\"font-weight: 400;\"># Or OpenAIVectorizer<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">import<\/span><span style=\"font-weight: 400;\"> os<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"># 1. Initialize the cache. This auto-creates the index in Redis.<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"># HFTextVectorizer will be used to embed prompts internally.<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">llmcache = SemanticCache(<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 name=<\/span><span style=\"font-weight: 400;\">&#8220;llmcache&#8221;<\/span><span style=\"font-weight: 400;\">,<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 prefix=<\/span><span style=\"font-weight: 400;\">&#8220;llm:semantic&#8221;<\/span><span style=\"font-weight: 400;\">, <\/span><span style=\"font-weight: 400;\"># Key prefix<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 redis_url=os.getenv(<\/span><span style=\"font-weight: 400;\">&#8220;REDIS_URL&#8221;<\/span><span style=\"font-weight: 400;\">, <\/span><span style=\"font-weight: 400;\">&#8220;redis:\/\/localhost:6379&#8221;<\/span><span style=\"font-weight: 400;\">),<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\"># Cosine Distance threshold (1.0 &#8211; 0.9 Similarity = 0.1)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 distance_threshold=<\/span><span style=\"font-weight: 400;\">0.1<\/span> <span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">question = <\/span><span style=\"font-weight: 400;\">&#8220;What is the capital of France?&#8221;<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"># 2. Check the cache (this embeds the prompt and runs a KNN search)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">if<\/span><span style=\"font-weight: 400;\"> response := llmcache.check(prompt=question):<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 print(<\/span><span style=\"font-weight: 400;\">&#8220;Cache Hit! (Semantic)&#8221;<\/span><span style=\"font-weight: 400;\">)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\"># response is a list of cached results<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 <\/span><span style=\"font-weight: 400;\">return<\/span><span style=\"font-weight: 400;\"> response[<\/span><span style=\"font-weight: 400;\">&#8216;response&#8217;<\/span><span style=\"font-weight: 400;\">] <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"># 3. Cache Miss: Store the new response<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">print(<\/span><span style=\"font-weight: 400;\">&#8220;Cache Miss! (Semantic)&#8221;<\/span><span style=\"font-weight: 400;\">)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">response = <\/span><span style=\"font-weight: 400;\">&#8220;Paris&#8221;<\/span> <span style=\"font-weight: 400;\"># llm.invoke(question)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">llmcache.store(prompt=question, response=response)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">return<\/span><span style=\"font-weight: 400;\"> response<\/span><\/p>\n<p>&nbsp;<\/p>\n<ol start=\"2\">\n<li><span style=\"font-weight: 400;\"> LangChain Integration<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">LangChain also provides a RedisSemanticCache that can be set as the global LLM cache, automating the process for any llm.invoke call.26<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Python<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\"># [26, 40, 44, 61]<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">from<\/span><span style=\"font-weight: 400;\"> langchain.<\/span><span style=\"font-weight: 400;\">globals<\/span> <span style=\"font-weight: 400;\">import<\/span><span style=\"font-weight: 400;\"> set_llm_cache<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">from<\/span><span style=\"font-weight: 400;\"> langchain_openai <\/span><span style=\"font-weight: 400;\">import<\/span><span style=\"font-weight: 400;\"> OpenAI, OpenAIEmbeddings<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">from<\/span><span style=\"font-weight: 400;\"> langchain_redis <\/span><span style=\"font-weight: 400;\">import<\/span><span style=\"font-weight: 400;\"> RedisSemanticCache<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">import<\/span><span style=\"font-weight: 400;\"> os<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"># 1. Initialize the embedding model<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">embeddings = OpenAIEmbeddings()<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"># 2. Initialize and set the global cache<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"># Note: distance_threshold is COSINE DISTANCE. <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"># 0.2 is equivalent to 0.8 Cosine Similarity.<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">set_llm_cache(<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 RedisSemanticCache(<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 redis_url=os.getenv(<\/span><span style=\"font-weight: 400;\">&#8220;REDIS_URL&#8221;<\/span><span style=\"font-weight: 400;\">, <\/span><span style=\"font-weight: 400;\">&#8220;redis:\/\/localhost:6379&#8221;<\/span><span style=\"font-weight: 400;\">),<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 embeddings=embeddings,<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 \u00a0 \u00a0 distance_threshold=<\/span><span style=\"font-weight: 400;\">0.2<\/span> <span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">\u00a0 \u00a0 )<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">llm = OpenAI()<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"># First call (Cache Miss)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">llm.invoke(<\/span><span style=\"font-weight: 400;\">&#8220;What is the capital of France?&#8221;<\/span><span style=\"font-weight: 400;\">)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"># Second call (Semantic Cache Hit)<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"># This different phrasing will be matched by the vector search<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">llm.invoke(<\/span><span style=\"font-weight: 400;\">&#8220;Tell me the capital of France&#8221;<\/span><span style=\"font-weight: 400;\">) <\/span><\/p>\n<p>&nbsp;<\/p>\n<ol start=\"3\">\n<li><span style=\"font-weight: 400;\"> LlamaIndex Integration<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">LlamaIndex also integrates deeply with Redis as a VectorStore, which can be used to build RAG and caching workflows.27<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><b>VI. Advanced Optimization: Tuning for Performance, Cost, and Accuracy<\/b><\/h2>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Deploying a semantic cache to production requires fine-tuning three key areas: the similarity threshold (logic), the index parameters (database performance), and vector compression (cost).<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>A. Tuning the Similarity Threshold<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The distance_threshold is the <\/span><i><span style=\"font-weight: 400;\">single most important logical parameter<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><span style=\"font-weight: 400;\">38<\/span><span style=\"font-weight: 400;\"> It defines the &#8220;hitbox&#8221; for a cache hit and represents a direct trade-off between cost savings (hit rate) and correctness (accuracy).<\/span><span style=\"font-weight: 400;\">45<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Low Threshold (e.g., 0.7 Similarity \/ 0.3 Distance):<\/b><span style=\"font-weight: 400;\"> This creates a <\/span><i><span style=\"font-weight: 400;\">wide<\/span><\/i><span style=\"font-weight: 400;\"> net, increasing the cache hit rate and saving more money. However, it significantly increases the risk of &#8220;false positives&#8221;\u2014serving an irrelevant answer for a query that is only vaguely related.<\/span><span style=\"font-weight: 400;\">31<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>High Threshold (e.g., 0.95 Similarity \/ 0.05 Distance):<\/b><span style=\"font-weight: 400;\"> This creates a <\/span><i><span style=\"font-weight: 400;\">narrow<\/span><\/i><span style=\"font-weight: 400;\"> net, decreasing the hit rate but ensuring that only <\/span><i><span style=\"font-weight: 400;\">highly<\/span><\/i><span style=\"font-weight: 400;\"> similar queries are matched, maximizing accuracy.<\/span><span style=\"font-weight: 400;\">31<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">While studies suggest a 0.8 (Cosine Similarity) <\/span><span style=\"font-weight: 400;\">31<\/span><span style=\"font-weight: 400;\"> or 0.85 <\/span><span style=\"font-weight: 400;\">46<\/span><span style=\"font-weight: 400;\"> threshold provides an optimal balance, this value is <\/span><b>not universal<\/b><span style=\"font-weight: 400;\">. It is <\/span><i><span style=\"font-weight: 400;\">highly domain-specific<\/span><\/i><span style=\"font-weight: 400;\">. For example, a query for &#8220;how to build muscle quickly&#8221; requires a different answer than &#8220;how to build muscle sustainably.&#8221; A low threshold might incorrectly group them, which could be dangerous in a fitness application.<\/span><span style=\"font-weight: 400;\">48<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The threshold is a business logic parameter that must be empirically tested against a validation dataset for the specific domain.<\/span><\/p>\n<p><b>Table 6.1: Similarity Threshold Tuning Guide (Cosine Similarity)<\/b><\/p>\n<p>&nbsp;<\/p>\n<table>\n<tbody>\n<tr>\n<td><b>Similarity Threshold<\/b><\/td>\n<td><b>Distance Threshold (1 &#8211; Sim)<\/b><\/td>\n<td><b>Typical Use Case<\/b><\/td>\n<td><b>Pro (Hit Rate)<\/b><\/td>\n<td><b>Con (Accuracy)<\/b><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">0.95 &#8211; 0.99<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.05 &#8211; 0.01<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Highly Sensitive Domains (Legal, Medical)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Very Low<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Maximum accuracy; no false positives.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">0.85 &#8211; 0.90<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.15 &#8211; 0.10<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Production RAG, Tech Support <\/span><span style=\"font-weight: 400;\">46<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Medium<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Good accuracy; catches phrasing variants.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">0.80 &#8211; 0.85<\/span><\/td>\n<td><span style=\"font-weight: 400;\">0.20 &#8211; 0.15<\/span><\/td>\n<td><span style=\"font-weight: 400;\">General Chatbots (Optimal Balance) <\/span><span style=\"font-weight: 400;\">31<\/span><\/td>\n<td><span style=\"font-weight: 400;\">High<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Best blend of savings and accuracy.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">&lt; 0.80<\/span><\/td>\n<td><span style=\"font-weight: 400;\">&gt; 0.20<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Conceptual Clustering<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Very High<\/span><\/td>\n<td><span style=\"font-weight: 400;\">High risk of irrelevant\/incorrect answers.<\/span><span style=\"font-weight: 400;\">31<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h3><b>B. Tuning HNSW Hyperparameters<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">If the threshold is the <\/span><i><span style=\"font-weight: 400;\">logic<\/span><\/i><span style=\"font-weight: 400;\"> tuner, the HNSW parameters are the <\/span><i><span style=\"font-weight: 400;\">database performance<\/span><\/i><span style=\"font-weight: 400;\"> tuners. They control the trade-off between index build time, memory usage, and query speed\/accuracy.<\/span><span style=\"font-weight: 400;\">49<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Table 6.2: HNSW Hyperparameter Trade-offs <\/span><span style=\"font-weight: 400;\">49<\/span><\/p>\n<table>\n<tbody>\n<tr>\n<td><b>Parameter<\/b><\/td>\n<td><b>Definition (Redis Default)<\/b><\/td>\n<td><b>High Value (e.g., 500)<\/b><\/td>\n<td><b>Low Value (e.g., 100)<\/b><\/td>\n<\/tr>\n<tr>\n<td><b>M<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Max connections per node (16)<\/span><\/td>\n<td><b>Pro:<\/b><span style=\"font-weight: 400;\"> Better recall.<\/span><\/p>\n<p><b>Con:<\/b><span style=\"font-weight: 400;\"> More memory, slower indexing.<\/span><\/td>\n<td><b>Pro:<\/b><span style=\"font-weight: 400;\"> Less memory, faster indexing.<\/span><\/p>\n<p><b>Con:<\/b><span style=\"font-weight: 400;\"> Lower recall.<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>EF_CONSTRUCTION<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Build-time search depth (200)<\/span><\/td>\n<td><b>Pro:<\/b><span style=\"font-weight: 400;\"> Higher quality index, better query accuracy.<\/span><\/p>\n<p><b>Con:<\/b><span style=\"font-weight: 400;\"> Much slower index build time.<\/span><\/td>\n<td><b>Pro:<\/b><span style=\"font-weight: 400;\"> Faster index build time.<\/span><\/p>\n<p><b>Con:<\/b><span style=\"font-weight: 400;\"> Lower quality index, lower query accuracy.<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>EF_RUNTIME<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Query-time search depth (10)<\/span><\/td>\n<td><b>Pro:<\/b><span style=\"font-weight: 400;\"> Higher query accuracy\/recall.<\/span><\/p>\n<p><b>Con:<\/b><span style=\"font-weight: 400;\"> Slower query latency.<\/span><\/td>\n<td><b>Pro:<\/b><span style=\"font-weight: 400;\"> Faster query latency.<\/span><\/p>\n<p><b>Con:<\/b><span style=\"font-weight: 400;\"> Lower query accuracy\/recall.<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span style=\"font-weight: 400;\">For a &#8220;build-once, query-millions&#8221; workload like a cache, the optimal strategy is to <\/span><b>over-invest at build time to enable minimal work at query time.<\/b><span style=\"font-weight: 400;\"> This means setting a <\/span><b>high M (e.g., 64)<\/b><span style=\"font-weight: 400;\"> and a <\/span><b>high EF_CONSTRUCTION (e.g., 500)<\/b><span style=\"font-weight: 400;\">. This creates a highly optimized index graph. Because the graph is so well-structured, a query can achieve high accuracy with a <\/span><b>very low EF_RUNTIME (e.g., 10-20)<\/b><span style=\"font-weight: 400;\">, delivering both high accuracy and the sub-millisecond latency expected of a cache.<\/span><span style=\"font-weight: 400;\">49<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>C. Managing the Memory Footprint: Quantization and Compression<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The primary argument against using in-memory Redis for large-scale vector search is cost, as RAM is expensive.<\/span><span style=\"font-weight: 400;\">50<\/span><span style=\"font-weight: 400;\"> A cache with millions of 1536-dimension FLOAT32 vectors can consume hundreds of gigabytes.<\/span><span style=\"font-weight: 400;\">50<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Redis Enterprise addresses this directly with advanced <\/span><b>vector quantization and compression<\/b><span style=\"font-weight: 400;\"> techniques, such as LVQ (Locally-adaptive Vector Quantization) and LeanVec.<\/span><span style=\"font-weight: 400;\">50<\/span><span style=\"font-weight: 400;\"> These features are not just optimizations; they are the strategic answer to making large-scale in-memory vector search <\/span><i><span style=\"font-weight: 400;\">economically viable<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">These techniques compress the vectors, dramatically reducing the memory footprint (by 60% or more) while maintaining high performance.<\/span><span style=\"font-weight: 400;\">51<\/span><span style=\"font-weight: 400;\"> In fact, because graph-based search is memory-bound, searching smaller, compressed vectors can often be <\/span><i><span style=\"font-weight: 400;\">faster<\/span><\/i><span style=\"font-weight: 400;\"> (up to 2x) than searching the full FLOAT32 vectors.<\/span><span style=\"font-weight: 400;\">50<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Table 6.3: Vector Compression Trade-offs in Redis <\/span><span style=\"font-weight: 400;\">50<\/span><\/p>\n<table>\n<tbody>\n<tr>\n<td><b>Compression Type<\/b><\/td>\n<td><b>Memory Savings (vs. FLOAT32)<\/b><\/td>\n<td><b>Search Performance<\/b><\/td>\n<td><b>Ingestion Time<\/b><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">FLOAT32 (Baseline)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">None<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Baseline<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Baseline<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">LeanVec4x8<\/span><\/td>\n<td><span style=\"font-weight: 400;\">High (60%+)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Very Fast (up to 2x)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Slower (up to 1.7x)<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">LVQ8<\/span><\/td>\n<td><span style=\"font-weight: 400;\">High (60%+)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Very Fast<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Slower (up to 2.6x)<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">LVQ4<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Highest<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Fast<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Slower<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span style=\"font-weight: 400;\">The trade-off is a longer one-time ingestion cost for massive long-term savings in memory cost and query speed.<\/span><span style=\"font-weight: 400;\">50<\/span><span style=\"font-weight: 400;\"> Note that these optimizations are currently tuned for x86 platforms and may be &#8220;impractical on ARM platforms today&#8221; due to slower ingestion.<\/span><span style=\"font-weight: 400;\">50<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><b>VII. Production Lifecycle: Cache Management and Invalidation<\/b><\/h2>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">A cache is a living system that requires policies for eviction (when full) and invalidation (when stale).<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>A. Cache Eviction Strategies<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">When Redis reaches its maxmemory limit, it must evict (delete) keys to make room for new data. The eviction policy determines <\/span><i><span style=\"font-weight: 400;\">which<\/span><\/i><span style=\"font-weight: 400;\"> keys are deleted.<\/span><span style=\"font-weight: 400;\">52<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">noeviction: Blocks all new writes when memory is full. This is bad for a cache.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">allkeys-lru: Evicts the <\/span><i><span style=\"font-weight: 400;\">Least Recently Used<\/span><\/i><span style=\"font-weight: 400;\"> (LRU) key from <\/span><i><span style=\"font-weight: 400;\">all<\/span><\/i><span style=\"font-weight: 400;\"> keys.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">volatile-lru: (Default) Evicts the LRU key <\/span><i><span style=\"font-weight: 400;\">only from keys that have an expiration (TTL) set<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><span style=\"font-weight: 400;\">52<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">allkeys-lfu \/ volatile-lfu: Evicts the <\/span><i><span style=\"font-weight: 400;\">Least Frequently Used<\/span><\/i><span style=\"font-weight: 400;\"> (LFU) key.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">volatile-ttl: Evicts the key with the shortest TTL remaining.<\/span><span style=\"font-weight: 400;\">54<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">The choice of eviction policy is critical in a multi-model Redis instance. If allkeys-lru is used, a high-volume cache write operation (all &#8220;new&#8221; keys) could cause Redis to evict &#8220;old&#8221; keys, such as a user&#8217;s <\/span><i><span style=\"font-weight: 400;\">active session key<\/span><\/i><span style=\"font-weight: 400;\"> or a critical RAG document. This would be a catastrophic failure.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The correct architectural pattern is to:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Set <\/span><b>all cache entries<\/b><span style=\"font-weight: 400;\"> (both exact-match and semantic) with a TTL (e.g., 24 hours).<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Store <\/span><b>all persistent data<\/b><span style=\"font-weight: 400;\"> (user sessions, RAG documents) <\/span><i><span style=\"font-weight: 400;\">without<\/span><\/i><span style=\"font-weight: 400;\"> a TTL.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Set the Redis eviction policy to <\/span><b>volatile-lru<\/b><span style=\"font-weight: 400;\">.<\/span><span style=\"font-weight: 400;\">55<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">This creates two classes of data. The eviction policy will <\/span><i><span style=\"font-weight: 400;\">only<\/span><\/i><span style=\"font-weight: 400;\"> ever target the &#8220;cache&#8221; class, protecting the persistent data from being deleted.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><b>B. Advanced Cache Invalidation<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">A TTL handles old data, but it cannot handle <\/span><i><span style=\"font-weight: 400;\">stale<\/span><\/i><span style=\"font-weight: 400;\"> data. If a document in a RAG system is updated, all cached LLM responses based on that document are now incorrect and must be invalidated immediately.<\/span><span style=\"font-weight: 400;\">56<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Strategy 1: Prompt Versioning:<\/b><span style=\"font-weight: 400;\"> As discussed in the Hybrid Query section (V.C), a model_version or data_version tag should be stored with the cached entry.<\/span><span style=\"font-weight: 400;\">57<\/span><span style=\"font-weight: 400;\"> When a model or document set is updated, the application simply queries for the new version tag (e.g., @data_version:{v2}). All v1 entries are instantly &#8220;invalidated&#8221; (they are no longer queried) and will eventually be evicted by their TTL.<\/span><span style=\"font-weight: 400;\">57<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Strategy 2: Event-Driven Invalidation:<\/b><span style=\"font-weight: 400;\"> This is a proactive strategy. When a source document is updated, an event is published to trigger the deletion of related cache entries.<\/span><span style=\"font-weight: 400;\">58<\/span><span style=\"font-weight: 400;\"> While this can be done with external message queues <\/span><span style=\"font-weight: 400;\">58<\/span><span style=\"font-weight: 400;\">, the multi-model &#8220;Redis Advantage&#8221; provides a self-contained solution: <\/span><b>Redis Pub\/Sub<\/b><span style=\"font-weight: 400;\">.<\/span><span style=\"font-weight: 400;\">16<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">The event-driven architecture works as follows:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A &#8220;Document Update Service&#8221; writes a new document to Redis (e.g., JSON.SET doc:123&#8230;).<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Immediately after, it publishes an invalidation event: PUBLISH cache:invalidate &#8216;doc:123&#8217;.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A separate &#8220;Cache Invalidation Worker&#8221; is subscribed to this channel: SUBSCRIBE cache:invalidate.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">When the worker receives the &#8216;doc:123&#8217; message, it performs a hybrid search to find all cache keys tagged with doc_id:{123} and deletes them.<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">This pattern creates a clean, real-time invalidation loop that solves the stale cache problem using only Redis&#8217;s built-in features.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2><b>VIII. Final Architectural Recommendations &amp; Blueprint<\/b><\/h2>\n<p>&nbsp;<\/p>\n<h3><b>A. Recommended Stack &amp; Configuration<\/b><\/h3>\n<p>&nbsp;<\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Database:<\/b><span style=\"font-weight: 400;\"> Redis Stack (or Redis Enterprise) to ensure RediSearch (vector) and RedisJSON capabilities.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Caching Strategy:<\/b><span style=\"font-weight: 400;\"> A tiered, context-aware hybrid model.<\/span><\/li>\n<\/ul>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><b>Layer 1:<\/b><span style=\"font-weight: 400;\"> RedisCache (e.g., via LangChain) for simple, exact-match caching.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><b>Layer 2:<\/b><span style=\"font-weight: 400;\"> RedisVL.SemanticCache for high-intent, generic user queries.<\/span><span style=\"font-weight: 400;\">38<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><b>Layer 3:<\/b><span style=\"font-weight: 400;\"> A custom <\/span><b>Context-Enabled Semantic Caching (CESC)<\/b><span style=\"font-weight: 400;\"> pattern.<\/span><span style=\"font-weight: 400;\">8<\/span><span style=\"font-weight: 400;\"> On a semantic cache hit, retrieve the generic response, retrieve the user&#8217;s context (e.g., location, role) from a separate Redis Hash, and send both to a <\/span><i><span style=\"font-weight: 400;\">fast, cheap LLM<\/span><\/i><span style=\"font-weight: 400;\"> (e.g., GPT-4o-mini) for real-time personalization. This provides the speed and cost-savings of caching with the power of personalization.<\/span><\/li>\n<\/ul>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Vector Index:<\/b><span style=\"font-weight: 400;\"> HNSW for performance at scale.<\/span><span style=\"font-weight: 400;\">29<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Index Configuration:<\/b><\/li>\n<\/ul>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">DISTANCE_METRIC COSINE.<\/span><span style=\"font-weight: 400;\">30<\/span><span style=\"font-weight: 400;\"> This is non-negotiable for text.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">High EF_CONSTRUCTION (e.g., 500) and M (e.g., 64) for high-quality index builds.<\/span><span style=\"font-weight: 400;\">49<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Low EF_RUNTIME (e.g., 20) for fast queries.<\/span><span style=\"font-weight: 400;\">49<\/span><\/li>\n<\/ul>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Memory Management:<\/b><\/li>\n<\/ul>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Start with TYPE FLOAT32.<\/span><span style=\"font-weight: 400;\">28<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Migrate to LVQ or LeanVec compression when vector count exceeds 1 million entries or memory cost becomes a concern.<\/span><span style=\"font-weight: 400;\">51<\/span><\/li>\n<\/ul>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Tuning &amp; Management:<\/b><\/li>\n<\/ul>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><b>Threshold:<\/b><span style=\"font-weight: 400;\"> Begin empirical testing with a distance_threshold of 0.15 (Similarity 0.85) and adjust based on domain-specific accuracy requirements.<\/span><span style=\"font-weight: 400;\">46<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><b>Eviction Policy:<\/b><span style=\"font-weight: 400;\"> volatile-lru.<\/span><span style=\"font-weight: 400;\">52<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><b>TTL:<\/b><span style=\"font-weight: 400;\"> Set a 24-hour TTL on all cache entries.<\/span><\/li>\n<\/ul>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Invalidation:<\/b><span style=\"font-weight: 400;\"> Use event-driven invalidation via Redis Pub\/Sub for RAG-based caches.<\/span><span style=\"font-weight: 400;\">58<\/span><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h3><b>B. Final Architectural Blueprint: The Unified AI Caching Flow<\/b><\/h3>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The following describes the complete, production-grade query lifecycle that integrates all recommended patterns.<\/span><\/p>\n<p><b>Flow 1: Cache Miss (Populating the Cache)<\/b><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A user sends a query. The Application generates an embedding (Query Vector).<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The App executes a hybrid FT.SEARCH in Redis for the Query Vector + @user_id + @data_version.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Redis returns a <\/span><b>Cache Miss<\/b><span style=\"font-weight: 400;\">.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The App retrieves RAG documents from Redis (e.g., JSON.GET doc:123).<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The App calls the expensive, high-quality LLM (e.g., GPT-4o) with the prompt and RAG context.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The App receives the Response.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The App stores the Response in the semantic cache: HSET llm:semantic:abc prompt &#8220;&#8230;&#8221; response &#8220;&#8230;&#8221; vector &#8220;&#8230;&#8221; user_id &#8220;&#8230;&#8221; data_version &#8220;&#8230;&#8221; and sets a 24-hour EXPIRE.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The Response is returned to the user.<\/span><\/li>\n<\/ol>\n<p><b>Flow 2: Cache Hit (Context-Enabled Semantic Cache)<\/b><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A user sends a query. The Application generates an embedding (Query Vector).<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The App executes a hybrid FT.SEARCH in Redis.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Redis returns a <\/span><b>Cache Hit<\/b><span style=\"font-weight: 400;\">, providing the Generic Response from a previous query.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The App fetches the <\/span><i><span style=\"font-weight: 400;\">current user&#8217;s context<\/span><\/i><span style=\"font-weight: 400;\"> from a separate Redis Hash (e.g., HGET user:profile:user-123 &#8216;location&#8217;).<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The App calls a <\/span><i><span style=\"font-weight: 400;\">fast, cheap LLM<\/span><\/i><span style=\"font-weight: 400;\"> (e.g., GPT-4o-mini) with the Generic Response and the User Context, instructing it to &#8220;personalize this answer for a user in [location].&#8221;<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">This lightweight model returns a Personalized Response.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The Personalized Response is returned to the user. This flow is 40% faster and 90% cheaper than a full LLM call but provides a superior, contextualized experience.<\/span><span style=\"font-weight: 400;\">8<\/span><\/li>\n<\/ol>\n<p><b>Flow 3: Invalidation (Event-Driven)<\/b><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">An external system updates a RAG document. An &#8220;Update Service&#8221; writes the new data: JSON.SET doc:123&#8230;.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The Update Service publishes an event: PUBLISH cache:invalidate &#8216;doc:123&#8217;.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A &#8220;Cache Invalidation Worker&#8221; subscribed to the channel receives the message.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The Worker executes FT.SEARCH llm_semantic_cache &#8220;@doc_id:{123}&#8221; to find all cache entries associated with that document.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The Worker deletes all returned keys (e.g., DEL llm:semantic:abc&#8230;).<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The cache is now clean, and the next query for this topic will trigger a &#8220;Cache Miss,&#8221; populating the cache with the new, correct information.<\/span><\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>I. Executive Summary The proliferation of Generative AI and Large Language Models (LLMs) has introduced significant operational challenges in terms of computational cost and response latency. Caching is a foundational <span class=\"readmore\"><a href=\"https:\/\/uplatz.com\/blog\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\/\">Read More &#8230;<\/a><\/span><\/p>\n","protected":false},"author":2,"featured_media":7887,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2374],"tags":[3402,1415,3078,2929,207,3181,215,3403],"class_list":["post-7815","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-deep-research","tag-ai-performance","tag-caching","tag-cost-optimization","tag-embeddings","tag-llm","tag-redis","tag-solutions-architect","tag-vector-cache"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>A Solutions Architect&#039;s Guide to Caching LLM Prompt Embeddings with Redis | Uplatz Blog<\/title>\n<meta name=\"description\" content=\"A solutions architect&#039;s guide to caching LLM prompt embeddings with Redis. Boost performance, slash latency, and reduce costs in production AI applications.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/uplatz.com\/blog\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"A Solutions Architect&#039;s Guide to Caching LLM Prompt Embeddings with Redis | Uplatz Blog\" \/>\n<meta property=\"og:description\" content=\"A solutions architect&#039;s guide to caching LLM prompt embeddings with Redis. Boost performance, slash latency, and reduce costs in production AI applications.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/uplatz.com\/blog\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\/\" \/>\n<meta property=\"og:site_name\" content=\"Uplatz Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Uplatz-1077816825610769\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-11-27T15:31:10+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-11-27T16:40:21+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/A-Solutions-Architects-Guide-to-Caching-LLM-Prompt-Embeddings-with-Redis.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1280\" \/>\n\t<meta property=\"og:image:height\" content=\"720\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"uplatzblog\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@uplatz_global\" \/>\n<meta name=\"twitter:site\" content=\"@uplatz_global\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"uplatzblog\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"20 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\\\/\"},\"author\":{\"name\":\"uplatzblog\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#\\\/schema\\\/person\\\/8ecae69a21d0757bdb2f776e67d2645e\"},\"headline\":\"A Solutions Architect&#8217;s Guide to Caching LLM Prompt Embeddings with Redis\",\"datePublished\":\"2025-11-27T15:31:10+00:00\",\"dateModified\":\"2025-11-27T16:40:21+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\\\/\"},\"wordCount\":4476,\"publisher\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/wp-content\\\/uploads\\\/2025\\\/11\\\/A-Solutions-Architects-Guide-to-Caching-LLM-Prompt-Embeddings-with-Redis.jpg\",\"keywords\":[\"AI Performance\",\"caching\",\"Cost Optimization\",\"Embeddings\",\"LLM\",\"Redis\",\"solutions architect\",\"Vector Cache\"],\"articleSection\":[\"Deep Research\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\\\/\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\\\/\",\"name\":\"A Solutions Architect's Guide to Caching LLM Prompt Embeddings with Redis | Uplatz Blog\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/wp-content\\\/uploads\\\/2025\\\/11\\\/A-Solutions-Architects-Guide-to-Caching-LLM-Prompt-Embeddings-with-Redis.jpg\",\"datePublished\":\"2025-11-27T15:31:10+00:00\",\"dateModified\":\"2025-11-27T16:40:21+00:00\",\"description\":\"A solutions architect's guide to caching LLM prompt embeddings with Redis. Boost performance, slash latency, and reduce costs in production AI applications.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/uplatz.com\\\/blog\\\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\\\/#primaryimage\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/wp-content\\\/uploads\\\/2025\\\/11\\\/A-Solutions-Architects-Guide-to-Caching-LLM-Prompt-Embeddings-with-Redis.jpg\",\"contentUrl\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/wp-content\\\/uploads\\\/2025\\\/11\\\/A-Solutions-Architects-Guide-to-Caching-LLM-Prompt-Embeddings-with-Redis.jpg\",\"width\":1280,\"height\":720},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"A Solutions Architect&#8217;s Guide to Caching LLM Prompt Embeddings with Redis\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/\",\"name\":\"Uplatz Blog\",\"description\":\"Uplatz is a global IT Training &amp; Consulting company\",\"publisher\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#organization\",\"name\":\"uplatz.com\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/wp-content\\\/uploads\\\/2016\\\/11\\\/Uplatz-Logo-Copy-2.png\",\"contentUrl\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/wp-content\\\/uploads\\\/2016\\\/11\\\/Uplatz-Logo-Copy-2.png\",\"width\":1280,\"height\":800,\"caption\":\"uplatz.com\"},\"image\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/Uplatz-1077816825610769\\\/\",\"https:\\\/\\\/x.com\\\/uplatz_global\",\"https:\\\/\\\/www.instagram.com\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/7956715?trk=tyah&amp;amp;amp;amp;trkInfo=clickedVertical:company,clickedEntityId:7956715,idx:1-1-1,tarId:1464353969447,tas:uplatz\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#\\\/schema\\\/person\\\/8ecae69a21d0757bdb2f776e67d2645e\",\"name\":\"uplatzblog\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g\",\"caption\":\"uplatzblog\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"A Solutions Architect's Guide to Caching LLM Prompt Embeddings with Redis | Uplatz Blog","description":"A solutions architect's guide to caching LLM prompt embeddings with Redis. Boost performance, slash latency, and reduce costs in production AI applications.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/uplatz.com\/blog\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\/","og_locale":"en_US","og_type":"article","og_title":"A Solutions Architect's Guide to Caching LLM Prompt Embeddings with Redis | Uplatz Blog","og_description":"A solutions architect's guide to caching LLM prompt embeddings with Redis. Boost performance, slash latency, and reduce costs in production AI applications.","og_url":"https:\/\/uplatz.com\/blog\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\/","og_site_name":"Uplatz Blog","article_publisher":"https:\/\/www.facebook.com\/Uplatz-1077816825610769\/","article_published_time":"2025-11-27T15:31:10+00:00","article_modified_time":"2025-11-27T16:40:21+00:00","og_image":[{"width":1280,"height":720,"url":"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/A-Solutions-Architects-Guide-to-Caching-LLM-Prompt-Embeddings-with-Redis.jpg","type":"image\/jpeg"}],"author":"uplatzblog","twitter_card":"summary_large_image","twitter_creator":"@uplatz_global","twitter_site":"@uplatz_global","twitter_misc":{"Written by":"uplatzblog","Est. reading time":"20 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/uplatz.com\/blog\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\/#article","isPartOf":{"@id":"https:\/\/uplatz.com\/blog\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\/"},"author":{"name":"uplatzblog","@id":"https:\/\/uplatz.com\/blog\/#\/schema\/person\/8ecae69a21d0757bdb2f776e67d2645e"},"headline":"A Solutions Architect&#8217;s Guide to Caching LLM Prompt Embeddings with Redis","datePublished":"2025-11-27T15:31:10+00:00","dateModified":"2025-11-27T16:40:21+00:00","mainEntityOfPage":{"@id":"https:\/\/uplatz.com\/blog\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\/"},"wordCount":4476,"publisher":{"@id":"https:\/\/uplatz.com\/blog\/#organization"},"image":{"@id":"https:\/\/uplatz.com\/blog\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\/#primaryimage"},"thumbnailUrl":"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/A-Solutions-Architects-Guide-to-Caching-LLM-Prompt-Embeddings-with-Redis.jpg","keywords":["AI Performance","caching","Cost Optimization","Embeddings","LLM","Redis","solutions architect","Vector Cache"],"articleSection":["Deep Research"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/uplatz.com\/blog\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\/","url":"https:\/\/uplatz.com\/blog\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\/","name":"A Solutions Architect's Guide to Caching LLM Prompt Embeddings with Redis | Uplatz Blog","isPartOf":{"@id":"https:\/\/uplatz.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/uplatz.com\/blog\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\/#primaryimage"},"image":{"@id":"https:\/\/uplatz.com\/blog\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\/#primaryimage"},"thumbnailUrl":"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/A-Solutions-Architects-Guide-to-Caching-LLM-Prompt-Embeddings-with-Redis.jpg","datePublished":"2025-11-27T15:31:10+00:00","dateModified":"2025-11-27T16:40:21+00:00","description":"A solutions architect's guide to caching LLM prompt embeddings with Redis. Boost performance, slash latency, and reduce costs in production AI applications.","breadcrumb":{"@id":"https:\/\/uplatz.com\/blog\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/uplatz.com\/blog\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/uplatz.com\/blog\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\/#primaryimage","url":"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/A-Solutions-Architects-Guide-to-Caching-LLM-Prompt-Embeddings-with-Redis.jpg","contentUrl":"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2025\/11\/A-Solutions-Architects-Guide-to-Caching-LLM-Prompt-Embeddings-with-Redis.jpg","width":1280,"height":720},{"@type":"BreadcrumbList","@id":"https:\/\/uplatz.com\/blog\/a-solutions-architects-guide-to-caching-llm-prompt-embeddings-with-redis\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/uplatz.com\/blog\/"},{"@type":"ListItem","position":2,"name":"A Solutions Architect&#8217;s Guide to Caching LLM Prompt Embeddings with Redis"}]},{"@type":"WebSite","@id":"https:\/\/uplatz.com\/blog\/#website","url":"https:\/\/uplatz.com\/blog\/","name":"Uplatz Blog","description":"Uplatz is a global IT Training &amp; Consulting company","publisher":{"@id":"https:\/\/uplatz.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/uplatz.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/uplatz.com\/blog\/#organization","name":"uplatz.com","url":"https:\/\/uplatz.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/uplatz.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2016\/11\/Uplatz-Logo-Copy-2.png","contentUrl":"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2016\/11\/Uplatz-Logo-Copy-2.png","width":1280,"height":800,"caption":"uplatz.com"},"image":{"@id":"https:\/\/uplatz.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Uplatz-1077816825610769\/","https:\/\/x.com\/uplatz_global","https:\/\/www.instagram.com\/","https:\/\/www.linkedin.com\/company\/7956715?trk=tyah&amp;amp;amp;amp;trkInfo=clickedVertical:company,clickedEntityId:7956715,idx:1-1-1,tarId:1464353969447,tas:uplatz"]},{"@type":"Person","@id":"https:\/\/uplatz.com\/blog\/#\/schema\/person\/8ecae69a21d0757bdb2f776e67d2645e","name":"uplatzblog","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g","caption":"uplatzblog"}}]}},"_links":{"self":[{"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/posts\/7815","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/comments?post=7815"}],"version-history":[{"count":3,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/posts\/7815\/revisions"}],"predecessor-version":[{"id":7888,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/posts\/7815\/revisions\/7888"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/media\/7887"}],"wp:attachment":[{"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/media?parent=7815"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/categories?post=7815"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/tags?post=7815"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}