Production LLM caching with Redis and LangChain: exact-match plus semantic cache, end-to-end
Spin up Upstash or local Redis, wire LangChain RedisCache and RedisSemanticCache, measure latency plus token-cost reductions on 50 sample queries, then tune TTL.
Image: docs.langchain.com — Redis integration page, used for editorial coverage of the langchain-redis partner package.
What you’ll build
This tutorial walks from an empty Python project to a working LLM-call cache that fronts a LangChain chat model with two layers: an exact-match cache for identical prompts and a semantic cache that matches near-duplicate prompts by embedding similarity 1 . A measurement script then runs 50 sample queries against the chain twice (cold cache then warm cache) and prints the latency and token-cost deltas.
The aggregated source consensus across the LangChain Redis integration page 1 , the langchain-redis PyPI listing 2 , and the Redis partnership blog 5 is that the langchain-redis partner package is the canonical path for both RedisCache and RedisSemanticCache as of the 0.2.5 release in November 2025 2 . Earlier paths via langchain_community.cache still exist but are not the surface the LangChain docs send new readers to.
Two Redis backends work for this tutorial. Upstash’s free tier publishes a ceiling of 256 MB data, 500K commands per month, and an HTTP-friendly REST surface that suits serverless deploys 6 . A local Docker container running the official redis image works identically against redis://localhost:6379 10 for developer-laptop iteration. Pick one; the LangChain code below is identical.
Cost math worth stating up front: Anthropic prices Claude Sonnet 4.5 at $3 per million input tokens and $15 per million output tokens 7 . A chain that issues 1,000 identical or near-identical prompts a day at ~2,000 input plus ~500 output tokens per call costs roughly $13.50 a day uncached; a cache hit rate of 70% on that workload removes about $9.45 of that, every day, by serving the response from Redis instead of re-prompting the model.
Prerequisites
- Python 3.10 to 3.13 (the
langchain-redispackage pins this range) 2 . pipworking from a virtual environment.- One Anthropic or OpenAI API key (this tutorial uses Anthropic; swap the chat-model line for OpenAI without changing the cache code).
- Either a free Upstash account, OR Docker installed locally.
Step 1: spin up Redis
Option A: Upstash (recommended for serverless deploys). Create an Upstash account, click Create Database, name it langchain-cache, pick the closest region, and accept the free tier. The console then surfaces a REDIS_URL of the shape rediss://default:<token>@<endpoint>.upstash.io:6379. Copy it. Upstash’s published free-tier ceiling is 256 MB and 500K commands per month 6 ; an exact-match cache that stores a few thousand prompt-response pairs sits well inside that envelope.
Option B: local Docker. From a terminal:
docker run -d --name redis-llm-cache -p 6379:6379 redis:8
The official redis Docker image is on Docker Hub 10 . Confirm the container is up with:
docker exec -it redis-llm-cache redis-cli PING
Expected output: PONG. The connection string in the LangChain code becomes redis://localhost:6379.
Image: Upstash blog — Redis new pricing announcement, used for editorial coverage of the free-tier limits.
Step 2: install dependencies
Create a virtual environment and install the partner package plus a chat-model and embeddings library:
python -m venv .venv
source .venv/bin/activate # macOS / Linux
# .venv\Scripts\activate # Windows
pip install -U langchain-core langchain-redis langchain-anthropic langchain-openai redis
The Redis blog post announcing the partnership confirms that langchain-redis ships RedisCache, RedisSemanticCache, vector store, and chat-history helpers in one package 5 .
Set the keys in your shell (don’t commit them):
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export REDIS_URL="rediss://default:<token>@<endpoint>.upstash.io:6379"
# or REDIS_URL="redis://localhost:6379" for local Docker
The OpenAI key is needed only because the semantic cache below uses OpenAI’s text-embedding-3-small to embed prompts. Swap in any LangChain-compatible embedder if preferred.
Step 3: wire the exact-match cache
The simplest layer is RedisCache: it hashes the full prompt plus model parameters into a key and stores the LLM response under that key. Identical prompts on the next call return in milliseconds with zero tokens billed. Per the LangChain Redis docs 1 , the canonical wiring is:
import os
import redis
from langchain_redis import RedisCache
from langchain_core.globals import set_llm_cache
redis_client = redis.Redis.from_url(os.environ["REDIS_URL"])
set_llm_cache(RedisCache(redis_client, ttl=3600))
set_llm_cache is a process-global in LangChain core 9 . Once it’s set, every ChatAnthropic or ChatOpenAI invocation in the same process checks Redis before calling the model. ttl=3600 caps each entry at one hour; tune this per workload.
Wire a chat model and run a query:
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-sonnet-4-5", temperature=0)
# First call — cold cache, hits the model
print(llm.invoke("Summarise the BM25 ranking algorithm in 2 sentences.").content)
# Second call — warm cache, returns from Redis
print(llm.invoke("Summarise the BM25 ranking algorithm in 2 sentences.").content)
The second call returns in roughly the round-trip time to Redis (typically 1-10 ms locally, 20-80 ms to Upstash from a developer laptop) instead of the 800-2000 ms a Sonnet 4.5 call usually takes.
Exact-match limit: if the user types “Summarise BM25 in two sentences.” with a trailing period instead of “Summarise the BM25 ranking algorithm in 2 sentences.” the keys hash differently and Redis returns a miss. That’s where the semantic cache earns its place.
Image: Redis blog — LangChain Redis partner package, used for editorial coverage of the langchain-redis cache surface.
Step 4: wire the semantic cache
RedisSemanticCache embeds the incoming prompt, runs a vector-similarity search against stored prompts, and returns the cached response if the nearest prompt is within a configurable distance threshold. The LangChain reference documents the constructor with embeddings (required), redis_url (default redis://localhost:6379), distance_threshold (default 0.2), and ttl (default None) 3 .
from langchain_openai import OpenAIEmbeddings
from langchain_redis import RedisSemanticCache
from langchain_core.globals import set_llm_cache
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
semantic_cache = RedisSemanticCache(
embeddings=embeddings,
redis_url=os.environ["REDIS_URL"],
distance_threshold=0.15,
ttl=86400,
)
set_llm_cache(semantic_cache)
A distance threshold of 0.15 (tighter than the 0.2 default 3 ) is a defensible starting point for production prose: it accepts paraphrases and trivial wording changes, refuses semantically distinct questions. Calibrate per workload; tighten for legal or medical prompts where a near-match must still hit the model.
Now both phrasings hit the same cache entry:
llm.invoke("Summarise the BM25 ranking algorithm in 2 sentences.")
llm.invoke("Give me a two-sentence summary of BM25 ranking.")
# Second call returns the cached response — same semantic content.
The embedding call itself costs a few hundred tokens at OpenAI’s text-embedding-3-small rate, far below a Sonnet 4.5 completion. The aggregated source posture across the LangChain docs 1 and the Redis partner-package post 5 is that semantic caching is appropriate when prompt phrasing varies but answer content does not: chatbots, FAQ assistants, internal-knowledge-base queries.
Choosing between the two caches. Run RedisCache alone if every prompt is generated deterministically (rendered from a template, identical across users). Run RedisSemanticCache if users write free-form prompts. Don’t stack both as the single global cache — LangChain’s set_llm_cache is a singleton 9 . Pattern: route to RedisSemanticCache for user-facing endpoints and to plain RedisCache (or no cache) for internal deterministic calls.
Step 5: measure on 50 sample queries
The measurement script issues 50 prompts twice: once cold (no cache), once warm (semantic cache active). It prints per-query latency and a token-spend estimate. Save as bench.py:
import os
import time
import statistics
from langchain_anthropic import ChatAnthropic
from langchain_openai import OpenAIEmbeddings
from langchain_redis import RedisSemanticCache
from langchain_core.globals import set_llm_cache
PROMPTS = [
# 25 distinct prompts, each repeated as a paraphrase below
"Explain BM25 in two sentences.",
"What is BM25 ranking in two sentences?",
"Summarise k-means clustering in two sentences.",
"Give a two-sentence overview of k-means clustering.",
# ... extend to 50 entries (25 originals + 25 paraphrases) ...
]
llm = ChatAnthropic(model="claude-sonnet-4-5", temperature=0)
def run(label):
latencies = []
for prompt in PROMPTS:
start = time.perf_counter()
llm.invoke(prompt)
latencies.append(time.perf_counter() - start)
p50 = statistics.median(latencies)
p95 = statistics.quantiles(latencies, n=20)[18]
print(f"{label}: p50={p50*1000:.0f}ms p95={p95*1000:.0f}ms total={sum(latencies):.1f}s")
# Cold pass — no cache
run("cold")
# Warm pass — semantic cache wired
set_llm_cache(RedisSemanticCache(
embeddings=OpenAIEmbeddings(model="text-embedding-3-small"),
redis_url=os.environ["REDIS_URL"],
distance_threshold=0.15,
ttl=86400,
))
run("warm-1st-pass")
run("warm-2nd-pass")
Interpreting the output. Expected shape on a typical Sonnet 4.5 workload:
| Pass | p50 latency | p95 latency | Comment |
|---|---|---|---|
| Cold | 900-1500 ms | 1800-3000 ms | Every call hits the model |
| Warm 1st pass | 900-1500 ms | 1800-3000 ms | Misses on first appearance, populates cache |
| Warm 2nd pass | 30-90 ms | 80-200 ms | Hits on every prompt and paraphrase |
The 2nd warm pass numbers are dominated by the embedding call plus the Redis round-trip. Token spend on the 2nd pass collapses to the embedding cost only; no Sonnet 4.5 completion tokens are billed.
Compute the savings per the published Anthropic per-million-token rates 7 . At $3 per million input tokens and $15 per million output tokens, a 50-query pass averaging 2,000 input plus 500 output tokens per call costs roughly $0.675 uncached and ~$0 on cache hits (embedding cost is rounding-level for text-embedding-3-small). Scale up: a daily 10,000-query workload with a 70% semantic-cache hit rate saves roughly $95 per day in completion tokens at Sonnet 4.5 rates.
Image: platform.claude.com — Pricing documentation, used for editorial coverage of the per-million-token rates.
Image: PyPI — langchain-redis project page, used for editorial coverage of the partner package’s release status.
Step 6: cache-warming and TTL strategy
Two operational habits matter more than the LangChain wiring once the cache is live.
Cache-warming on deploy. For workloads with a known hot-set (say, a chatbot’s top-20 FAQ prompts), pre-populate the cache during deploy or container start. Write a warm.py that issues each known prompt once. The first user-facing request then hits a warm cache instead of paying a cold-start completion. Pattern:
WARM_PROMPTS = [
"What are your business hours?",
"How do I reset my password?",
# ... top 20 FAQ phrasings ...
]
for prompt in WARM_PROMPTS:
llm.invoke(prompt)
Run this script in a CI step after deploy. The RedisCache ttl and RedisSemanticCache ttl both accept seconds 3 4 ; pick a TTL long enough that the next warm-cycle precedes the natural expiry.
TTL strategy. Pick the TTL from the model’s known-stale-by date, not from intuition. Three patterns:
| Workload | Suggested TTL | Reasoning |
|---|---|---|
| Static FAQ / knowledge base | 7-30 days | Content rarely changes; cache hits dominate |
| Product Q&A with price / spec data | 6-24 hours | Underlying data drifts daily |
| Chatbot with date-sensitive answers | 1-6 hours | Hedge against “yesterday’s news” responses |
| Agentic workflows with tool calls | 5-15 minutes | Tool outputs go stale fast |
Two failure modes to design around:
- Cache poisoning by a bad LLM response. If the model returns a wrong answer once, the cache serves the wrong answer for the entire TTL window. Mitigation: log every cache write and flag-and-evict bad entries via a Redis
DELkeyed on the prompt hash. The LangChainRedisCachekeys are deterministic; RedisKEYS llmcache:*followed by targetedDELis enough for small-scale cleanup. For production, swap theKEYSscan forSCANto avoid blocking Redis. - Embedding-model drift. If you swap embedding models (say, from
text-embedding-3-smallto a newer release), existing semantic-cache entries embed under the old vector space and won’t match new queries. Flush the semantic-cache index on embedding-model change.RedisSemanticCacheexposes aclear()method per the LangChain reference 3 ; call it from a migration script.
Cached entries don’t replace observability. Wire a logger that records cache_hit: bool per call so you can track the hit rate over time. A falling hit rate means the user prompt distribution has drifted and your WARM_PROMPTS set is stale.
What this gets you
A two-layer cache (exact-match plus semantic) in front of any LangChain chat model, sitting on either Upstash’s free tier 6 or a local Docker Redis 10 , measurably reducing both p95 latency and token spend on the workloads cited by the LangChain partner-package documentation 1 5 . The measurement harness in Step 5 is what lets you defend the cache to a sceptical reviewer: the cold-vs-warm numbers come from a 50-query pass against your actual prompt distribution, not from training-data intuition.
Two follow-ups worth queuing once the basic cache is live: a Redis-backed conversation-history store (the same langchain-redis package ships RedisChatMessageHistory 5 ) and a token-usage dashboard that reads the per-call usage metadata LangChain attaches to every model invocation. Both build on the same Redis instance this tutorial provisioned.
How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.
Sources consulted
Cited Sources
- 1. LangChain — Redis integration page (langchain-redis partner package; RedisCache + RedisSemanticCache import patterns). (accessed ) ↩
- 2. langchain-redis 0.2.5 release (November 2025) on PyPI; Python 3.10-3.13 supported. (accessed ) ↩
- 3. LangChain Python reference — RedisSemanticCache constructor: embeddings (required), redis_url default `redis://localhost:6379`, distance_threshold default 0.2, ttl default None. (accessed ) ↩
- 4. LangChain Python reference — RedisCache constructor accepts redis_client and ttl (seconds). (accessed ) ↩
- 5. Redis blog announcement — langchain-redis partner package bundles RedisCache, RedisSemanticCache, RedisVectorStore, RedisChatMessageHistory. (accessed ) ↩
- 6. Upstash Redis pricing: free tier ceiling 256 MB data and 500K commands per month. (accessed ) ↩
- 7. Anthropic API pricing: Claude Sonnet 4.5 \$3 / million input tokens and \$15 / million output tokens; Claude Haiku 4.5 \$1 / million input and \$5 / million output. (accessed ) ↩
- 8. Redis — Install Redis Open Source documentation (Docker and native install paths). (accessed ) ↩
- 9. LangChain — set_llm_cache is a process-global; the most recently set cache wins. (accessed ) ↩
- 10. Docker Hub — official `redis` image (tag `redis:8` used in this tutorial). (accessed ) ↩
Anonymous · no cookies set