Infrastructure

The Role of Redis in
AI Agent Architectures

Redis is the in-memory backbone of high-performance AI systems. Here are the 6 roles it plays in modern agent architectures.

Redis keeps showing up in AI architectures because AI workloads need what Redis does best: sub-millisecond data access, flexible data structures, and real-time capabilities. Every production LLM application we have built uses Redis for at least three different purposes. This guide covers the 6 roles Redis plays in AI agent stacks, with implementation examples for each.

Role 1: Semantic Cache (Savings: 20-40% on API costs)

The highest-impact use case. Store LLM responses indexed by their query embedding. When a semantically similar question arrives, return the cached response instead of calling the LLM API.

# Redis as semantic cache with vector search

from redis import Redis

from redis.commands.search.query import Query

 

async def semantic_cache_lookup(query: str, threshold=0.95):

embedding = await get_embedding(query)

q = Query(f"*=>[KNN 1 @embedding $vec AS score]")

.return_fields("response", "score")

.dialect(2)

result = redis.ft("cache_idx").search(

q, {"vec": embedding.tobytes()}

)

if result.docs and float(result.docs[0].score) > threshold:

return result.docs[0].response # cache hit

return None # cache miss, call API

This directly impacts your LLM costs. For more optimization techniques, see our full guide on reducing OpenAI costs by 60%.

Role 2: Conversation State and Session Management

AI agents need fast access to conversation history, user preferences, and session metadata. Redis Hashes and Sorted Sets provide O(1) lookups with automatic expiration.

# Store conversation state with auto-expiry

async def save_session(session_id: str, messages: list):

redis.set(

f"session:{session_id}",

json.dumps(messages),

ex=3600 # expire after 1 hour of inactivity

)

 

async def get_session(session_id: str) -> list:

data = redis.get(f"session:{session_id}")

return json.loads(data) if data else []

For a deeper dive on conversation memory, see our guide on managing AI agent memory.

Role 3: Vector Search (Small-Scale)

Redis Stack includes a vector search module (RediSearch) that supports HNSW and flat indexing. For datasets under 1 million vectors, Redis can serve as both your vector store and your cache, eliminating the need for a separate vector database.

For larger datasets, use dedicated vector databases like Pinecone, Weaviate, or PGVector and keep Redis as the cache/state layer.

Role 4: Rate Limiting and Token Budget Management

Redis Sorted Sets and sliding window algorithms are the standard for rate limiting. For AI applications, implement two levels:

  • Request-level: Max 60 API calls per user per minute.
  • Token-level: Max 100,000 tokens per user per day (budget control).

This prevents individual users from running up your API costs and protects against abuse.

Role 5: Task Queues for Async Processing

Redis Lists and Streams power lightweight task queues. For AI workloads, use Redis as the broker for Celery or ARQ to handle:

  • Background document processing and embedding generation
  • Batch LLM calls that don't require real-time response
  • Async webhook deliveries after agent task completion

This pattern is central to scaling FastAPI for high-throughput AI workloads.

Role 6: Pub/Sub for Agent Coordination

In multi-agent systems where agents need to communicate, Redis Pub/Sub provides lightweight real-time messaging. Agent A publishes a task result, Agent B subscribes and continues processing.

For more complex inter-agent patterns, see our guide on agent-to-agent communication.

Architecture: Redis in the AI Stack

Typical Production Setup:

User -> FastAPI -> Redis (cache check) -> LLM API (if cache miss) -> Redis (cache store + session update) -> User
Async path: Event -> Redis Queue -> Worker -> LLM API -> Redis (store result) -> Webhook

Frequently Asked Questions

Redis or Memcached for AI caching?

Redis. Memcached doesn't support vector search, data persistence, or pub/sub. For AI workloads where you need semantic caching, session state, and task queues from the same system, Redis is the clear choice.

What Redis deployment should I use?

Redis Cloud (managed) for most teams. AWS ElastiCache for AWS-native stacks. Self-hosted Redis Stack if you need vector search with specific version control. For infrastructure automation, use Terraform.

How much memory does a semantic cache need?

Each cached response requires: embedding (6 KB for 1536 dims) + response text (avg 2 KB) + metadata (0.5 KB) = ~8.5 KB per entry. 100,000 cached responses use about 850 MB. A 2 GB Redis instance handles most production caching needs.

Optimize Your AI Infrastructure

We implement Redis-powered caching, state management, and coordination layers for AI applications.

Get Infrastructure Help
© 2026 EkaivaKriti. All rights reserved.