Production LLM caching with Redis and LangChain: exact-match plus semantic cache, end-to-end

Spin up Upstash or local Redis, wire LangChain RedisCache and RedisSemanticCache, measure latency plus token-cost reductions on 50 sample queries, then tune TTL.

20 May 2026 Updated 20 May 2026 ~12 min read

The Redis integration page on docs.langchain.com showing the langchain-redis partner package documentation that this tutorial follows

Image: docs.langchain.com — Redis integration page, used for editorial coverage of the langchain-redis partner package.

What you’ll build

This tutorial walks from an empty Python project to a working LLM-call cache that fronts a LangChain chat model with two layers: an exact-match cache for identical prompts and a semantic cache that matches near-duplicate prompts by embedding similarity¹. A measurement script then runs 50 sample queries against the chain twice (cold cache then warm cache) and prints the latency and token-cost deltas.

The aggregated source consensus across the LangChain Redis integration page¹, the langchain-redis PyPI listing², and the Redis partnership blog⁵ is that the langchain-redis partner package is the canonical path for both RedisCache and RedisSemanticCache as of the 0.2.5 release in November 2025². Earlier paths via langchain_community.cache still exist but are not the surface the LangChain docs send new readers to.

Two Redis backends work for this tutorial. Upstash’s free tier publishes a ceiling of 256 MB data, 500K commands per month, and an HTTP-friendly REST surface that suits serverless deploys⁶. A local Docker container running the official redis image works identically against redis://localhost:6379¹⁰ for developer-laptop iteration. Pick one; the LangChain code below is identical.

Cost math worth stating up front: Anthropic prices Claude Sonnet 4.5 at $3 per million input tokens and $15 per million output tokens⁷. A chain that issues 1,000 identical or near-identical prompts a day at ~2,000 input plus ~500 output tokens per call costs roughly $13.50 a day uncached; a cache hit rate of 70% on that workload removes about $9.45 of that, every day, by serving the response from Redis instead of re-prompting the model.

Prerequisites

Python 3.10 to 3.13 (the langchain-redis package pins this range)².
pip working from a virtual environment.
One Anthropic or OpenAI API key (this tutorial uses Anthropic; swap the chat-model line for OpenAI without changing the cache code).
Either a free Upstash account, OR Docker installed locally.

Step 1: spin up Redis

Option A: Upstash (recommended for serverless deploys). Create an Upstash account, click Create Database, name it langchain-cache, pick the closest region, and accept the free tier. The console then surfaces a REDIS_URL of the shape rediss://default:<token>@<endpoint>.upstash.io:6379. Copy it. Upstash’s published free-tier ceiling is 256 MB and 500K commands per month⁶; an exact-match cache that stores a few thousand prompt-response pairs sits well inside that envelope.

Option B: local Docker. From a terminal:

docker run -d --name redis-llm-cache -p 6379:6379 redis:8

The official redis Docker image is on Docker Hub¹⁰. Confirm the container is up with:

docker exec -it redis-llm-cache redis-cli PING

Expected output: PONG. The connection string in the LangChain code becomes redis://localhost:6379.

Upstash blog post announcing the 500K monthly commands free-tier change, the canonical source for the Upstash Redis free-tier pricing this tutorial references

Image: Upstash blog — Redis new pricing announcement, used for editorial coverage of the free-tier limits.

Step 2: install dependencies

Create a virtual environment and install the partner package plus a chat-model and embeddings library:

python -m venv .venv
source .venv/bin/activate          # macOS / Linux
# .venv\Scripts\activate           # Windows
pip install -U langchain-core langchain-redis langchain-anthropic langchain-openai redis

The Redis blog post announcing the partnership confirms that langchain-redis ships RedisCache, RedisSemanticCache, vector store, and chat-history helpers in one package⁵.

Set the keys in your shell (don’t commit them):

export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export REDIS_URL="rediss://default:<token>@<endpoint>.upstash.io:6379"
# or REDIS_URL="redis://localhost:6379" for local Docker

The OpenAI key is needed only because the semantic cache below uses OpenAI’s text-embedding-3-small to embed prompts. Swap in any LangChain-compatible embedder if preferred.

Step 3: wire the exact-match cache

The simplest layer is RedisCache: it hashes the full prompt plus model parameters into a key and stores the LLM response under that key. Identical prompts on the next call return in milliseconds with zero tokens billed. Per the LangChain Redis docs¹, the canonical wiring is:

import os
import redis
from langchain_redis import RedisCache
from langchain_core.globals import set_llm_cache

redis_client = redis.Redis.from_url(os.environ["REDIS_URL"])
set_llm_cache(RedisCache(redis_client, ttl=3600))

set_llm_cache is a process-global in LangChain core⁹. Once it’s set, every ChatAnthropic or ChatOpenAI invocation in the same process checks Redis before calling the model. ttl=3600 caps each entry at one hour; tune this per workload.

Wire a chat model and run a query:

from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-sonnet-4-5", temperature=0)

# First call — cold cache, hits the model
print(llm.invoke("Summarise the BM25 ranking algorithm in 2 sentences.").content)

# Second call — warm cache, returns from Redis
print(llm.invoke("Summarise the BM25 ranking algorithm in 2 sentences.").content)

The second call returns in roughly the round-trip time to Redis (typically 1-10 ms locally, 20-80 ms to Upstash from a developer laptop) instead of the 800-2000 ms a Sonnet 4.5 call usually takes.

Exact-match limit: if the user types “Summarise BM25 in two sentences.” with a trailing period instead of “Summarise the BM25 ranking algorithm in 2 sentences.” the keys hash differently and Redis returns a miss. That’s where the semantic cache earns its place.

Redis.io blog post announcing the langchain-redis partner package, the canonical source documenting RedisCache and RedisSemanticCache in one package

Image: Redis blog — LangChain Redis partner package, used for editorial coverage of the langchain-redis cache surface.

Step 4: wire the semantic cache

RedisSemanticCache embeds the incoming prompt, runs a vector-similarity search against stored prompts, and returns the cached response if the nearest prompt is within a configurable distance threshold. The LangChain reference documents the constructor with embeddings (required), redis_url (default redis://localhost:6379), distance_threshold (default 0.2), and ttl (default None)³.

from langchain_openai import OpenAIEmbeddings
from langchain_redis import RedisSemanticCache
from langchain_core.globals import set_llm_cache

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

semantic_cache = RedisSemanticCache(
    embeddings=embeddings,
    redis_url=os.environ["REDIS_URL"],
    distance_threshold=0.15,
    ttl=86400,
)

set_llm_cache(semantic_cache)

A distance threshold of 0.15 (tighter than the 0.2 default³) is a defensible starting point for production prose: it accepts paraphrases and trivial wording changes, refuses semantically distinct questions. Calibrate per workload; tighten for legal or medical prompts where a near-match must still hit the model.

Now both phrasings hit the same cache entry:

llm.invoke("Summarise the BM25 ranking algorithm in 2 sentences.")
llm.invoke("Give me a two-sentence summary of BM25 ranking.")
# Second call returns the cached response — same semantic content.

The embedding call itself costs a few hundred tokens at OpenAI’s text-embedding-3-small rate, far below a Sonnet 4.5 completion. The aggregated source posture across the LangChain docs¹ and the Redis partner-package post⁵ is that semantic caching is appropriate when prompt phrasing varies but answer content does not: chatbots, FAQ assistants, internal-knowledge-base queries.

Choosing between the two caches. Run RedisCache alone if every prompt is generated deterministically (rendered from a template, identical across users). Run RedisSemanticCache if users write free-form prompts. Don’t stack both as the single global cache — LangChain’s set_llm_cache is a singleton⁹. Pattern: route to RedisSemanticCache for user-facing endpoints and to plain RedisCache (or no cache) for internal deterministic calls.

Step 5: measure on 50 sample queries

The measurement script issues 50 prompts twice: once cold (no cache), once warm (semantic cache active). It prints per-query latency and a token-spend estimate. Save as bench.py:

import os
import time
import statistics
from langchain_anthropic import ChatAnthropic
from langchain_openai import OpenAIEmbeddings
from langchain_redis import RedisSemanticCache
from langchain_core.globals import set_llm_cache

PROMPTS = [
    # 25 distinct prompts, each repeated as a paraphrase below
    "Explain BM25 in two sentences.",
    "What is BM25 ranking in two sentences?",
    "Summarise k-means clustering in two sentences.",
    "Give a two-sentence overview of k-means clustering.",
    # ... extend to 50 entries (25 originals + 25 paraphrases) ...
]

llm = ChatAnthropic(model="claude-sonnet-4-5", temperature=0)

def run(label):
    latencies = []
    for prompt in PROMPTS:
        start = time.perf_counter()
        llm.invoke(prompt)
        latencies.append(time.perf_counter() - start)
    p50 = statistics.median(latencies)
    p95 = statistics.quantiles(latencies, n=20)[18]
    print(f"{label}: p50={p50*1000:.0f}ms p95={p95*1000:.0f}ms total={sum(latencies):.1f}s")

# Cold pass — no cache
run("cold")

# Warm pass — semantic cache wired
set_llm_cache(RedisSemanticCache(
    embeddings=OpenAIEmbeddings(model="text-embedding-3-small"),
    redis_url=os.environ["REDIS_URL"],
    distance_threshold=0.15,
    ttl=86400,
))
run("warm-1st-pass")
run("warm-2nd-pass")

Interpreting the output. Expected shape on a typical Sonnet 4.5 workload:

Pass	p50 latency	p95 latency	Comment
Cold	900-1500 ms	1800-3000 ms	Every call hits the model
Warm 1st pass	900-1500 ms	1800-3000 ms	Misses on first appearance, populates cache
Warm 2nd pass	30-90 ms	80-200 ms	Hits on every prompt and paraphrase

The 2nd warm pass numbers are dominated by the embedding call plus the Redis round-trip. Token spend on the 2nd pass collapses to the embedding cost only; no Sonnet 4.5 completion tokens are billed.

Compute the savings per the published Anthropic per-million-token rates⁷. At $3 per million input tokens and $15 per million output tokens, a 50-query pass averaging 2,000 input plus 500 output tokens per call costs roughly $0.675 uncached and ~$0 on cache hits (embedding cost is rounding-level for text-embedding-3-small). Scale up: a daily 10,000-query workload with a 70% semantic-cache hit rate saves roughly $95 per day in completion tokens at Sonnet 4.5 rates.

Anthropic Claude API documentation pricing page on platform.claude.com, the canonical source for the Sonnet and Haiku per-million-token rates this tutorial computes savings against

Image: platform.claude.com — Pricing documentation, used for editorial coverage of the per-million-token rates.

PyPI listing for the langchain-redis package showing the November 2025 release of version 0.2.5, the canonical install source for this tutorial

Image: PyPI — langchain-redis project page, used for editorial coverage of the partner package’s release status.

Step 6: cache-warming and TTL strategy

Two operational habits matter more than the LangChain wiring once the cache is live.

Cache-warming on deploy. For workloads with a known hot-set (say, a chatbot’s top-20 FAQ prompts), pre-populate the cache during deploy or container start. Write a warm.py that issues each known prompt once. The first user-facing request then hits a warm cache instead of paying a cold-start completion. Pattern:

WARM_PROMPTS = [
    "What are your business hours?",
    "How do I reset my password?",
    # ... top 20 FAQ phrasings ...
]

for prompt in WARM_PROMPTS:
    llm.invoke(prompt)

Run this script in a CI step after deploy. The RedisCache ttl and RedisSemanticCache ttl both accept seconds³⁴; pick a TTL long enough that the next warm-cycle precedes the natural expiry.

TTL strategy. Pick the TTL from the model’s known-stale-by date, not from intuition. Three patterns:

Workload	Suggested TTL	Reasoning
Static FAQ / knowledge base	7-30 days	Content rarely changes; cache hits dominate
Product Q&A with price / spec data	6-24 hours	Underlying data drifts daily
Chatbot with date-sensitive answers	1-6 hours	Hedge against “yesterday’s news” responses
Agentic workflows with tool calls	5-15 minutes	Tool outputs go stale fast

Two failure modes to design around:

Cache poisoning by a bad LLM response. If the model returns a wrong answer once, the cache serves the wrong answer for the entire TTL window. Mitigation: log every cache write and flag-and-evict bad entries via a Redis DEL keyed on the prompt hash. The LangChain RedisCache keys are deterministic; Redis KEYS llmcache:* followed by targeted DEL is enough for small-scale cleanup. For production, swap the KEYS scan for SCAN to avoid blocking Redis.
Embedding-model drift. If you swap embedding models (say, from text-embedding-3-small to a newer release), existing semantic-cache entries embed under the old vector space and won’t match new queries. Flush the semantic-cache index on embedding-model change. RedisSemanticCache exposes a clear() method per the LangChain reference³; call it from a migration script.

Cached entries don’t replace observability. Wire a logger that records cache_hit: bool per call so you can track the hit rate over time. A falling hit rate means the user prompt distribution has drifted and your WARM_PROMPTS set is stale.

What this gets you

A two-layer cache (exact-match plus semantic) in front of any LangChain chat model, sitting on either Upstash’s free tier⁶ or a local Docker Redis¹⁰, measurably reducing both p95 latency and token spend on the workloads cited by the LangChain partner-package documentation¹⁵. The measurement harness in Step 5 is what lets you defend the cache to a sceptical reviewer: the cold-vs-warm numbers come from a 50-query pass against your actual prompt distribution, not from training-data intuition.

Two follow-ups worth queuing once the basic cache is live: a Redis-backed conversation-history store (the same langchain-redis package ships RedisChatMessageHistory⁵) and a token-usage dashboard that reads the per-call usage metadata LangChain attaches to every model invocation. Both build on the same Redis instance this tutorial provisioned.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

1. LangChain — Redis integration page (langchain-redis partner package; RedisCache + RedisSemanticCache import patterns). (accessed 2026-05-20) ↩
2. langchain-redis 0.2.5 release (November 2025) on PyPI; Python 3.10-3.13 supported. (accessed 2026-05-20) ↩
3. LangChain Python reference — RedisSemanticCache constructor: embeddings (required), redis_url default `redis://localhost:6379`, distance_threshold default 0.2, ttl default None. (accessed 2026-05-20) ↩
4. LangChain Python reference — RedisCache constructor accepts redis_client and ttl (seconds). (accessed 2026-05-20) ↩
5. Redis blog announcement — langchain-redis partner package bundles RedisCache, RedisSemanticCache, RedisVectorStore, RedisChatMessageHistory. (accessed 2026-05-20) ↩
6. Upstash Redis pricing: free tier ceiling 256 MB data and 500K commands per month. (accessed 2026-05-20) ↩
7. Anthropic API pricing: Claude Sonnet 4.5 \$3 / million input tokens and \$15 / million output tokens; Claude Haiku 4.5 \$1 / million input and \$5 / million output. (accessed 2026-05-20) ↩
8. Redis — Install Redis Open Source documentation (Docker and native install paths). (accessed 2026-05-20) ↩
9. LangChain — set_llm_cache is a process-global; the most recently set cache wins. (accessed 2026-05-20) ↩
10. Docker Hub — official `redis` image (tag `redis:8` used in this tutorial). (accessed 2026-05-20) ↩

Anonymous · no cookies set

Found this useful? Share it.