Redis is the in-memory backbone of high-performance AI systems. Here are the 6 roles it plays in modern agent architectures.
Redis keeps showing up in AI architectures because AI workloads need what Redis does best: sub-millisecond data access, flexible data structures, and real-time capabilities. Every production LLM application we have built uses Redis for at least three different purposes. This guide covers the 6 roles Redis plays in AI agent stacks, with implementation examples for each.
The highest-impact use case. Store LLM responses indexed by their query embedding. When a semantically similar question arrives, return the cached response instead of calling the LLM API.
# Redis as semantic cache with vector search
from redis import Redis
from redis.commands.search.query import Query
async def semantic_cache_lookup(query: str, threshold=0.95):
embedding = await get_embedding(query)
q = Query(f"*=>[KNN 1 @embedding $vec AS score]")
.return_fields("response", "score")
.dialect(2)
result = redis.ft("cache_idx").search(
q, {"vec": embedding.tobytes()}
)
if result.docs and float(result.docs[0].score) > threshold:
return result.docs[0].response # cache hit
return None # cache miss, call API
This directly impacts your LLM costs. For more optimization techniques, see our full guide on reducing OpenAI costs by 60%.
AI agents need fast access to conversation history, user preferences, and session metadata. Redis Hashes and Sorted Sets provide O(1) lookups with automatic expiration.
# Store conversation state with auto-expiry
async def save_session(session_id: str, messages: list):
redis.set(
f"session:{session_id}",
json.dumps(messages),
ex=3600 # expire after 1 hour of inactivity
)
async def get_session(session_id: str) -> list:
data = redis.get(f"session:{session_id}")
return json.loads(data) if data else []
For a deeper dive on conversation memory, see our guide on managing AI agent memory.
Redis Stack includes a vector search module (RediSearch) that supports HNSW and flat indexing. For datasets under 1 million vectors, Redis can serve as both your vector store and your cache, eliminating the need for a separate vector database.
For larger datasets, use dedicated vector databases like Pinecone, Weaviate, or PGVector and keep Redis as the cache/state layer.
Redis Sorted Sets and sliding window algorithms are the standard for rate limiting. For AI applications, implement two levels:
This prevents individual users from running up your API costs and protects against abuse.
Redis Lists and Streams power lightweight task queues. For AI workloads, use Redis as the broker for Celery or ARQ to handle:
This pattern is central to scaling FastAPI for high-throughput AI workloads.
In multi-agent systems where agents need to communicate, Redis Pub/Sub provides lightweight real-time messaging. Agent A publishes a task result, Agent B subscribes and continues processing.
For more complex inter-agent patterns, see our guide on agent-to-agent communication.
Typical Production Setup:
User -> FastAPI -> Redis (cache check) -> LLM API (if cache miss) -> Redis (cache store + session update) -> User
Async path: Event -> Redis Queue -> Worker -> LLM API -> Redis (store result) -> Webhook
Redis. Memcached doesn't support vector search, data persistence, or pub/sub. For AI workloads where you need semantic caching, session state, and task queues from the same system, Redis is the clear choice.
Redis Cloud (managed) for most teams. AWS ElastiCache for AWS-native stacks. Self-hosted Redis Stack if you need vector search with specific version control. For infrastructure automation, use Terraform.
Each cached response requires: embedding (6 KB for 1536 dims) + response text (avg 2 KB) + metadata (0.5 KB) = ~8.5 KB per entry. 100,000 cached responses use about 850 MB. A 2 GB Redis instance handles most production caching needs.
We implement Redis-powered caching, state management, and coordination layers for AI applications.
Get Infrastructure Help