How to Manage Memory and Long-Term Context for AI Agents

LLMs have no persistent memory. Every API call starts from zero. The "memory" in ChatGPT is just the conversation history stuffed into the context window. Once that window fills up (or the session ends), all context is lost. For production AI agents, this is a critical limitation. A customer support agent that forgets the user's issue mid-conversation is useless. A personal assistant that doesn't remember your last 10 interactions provides no continuity. This guide covers the three memory architectures that solve this problem.

The Three Types of AI Agent Memory

Memory Type	What It Stores	Duration	Storage
Short-term (Working)	Current conversation	Single session	Context window / Redis
Long-term (Semantic)	Facts, preferences, knowledge	Permanent	Vector store / Database
Episodic	Past interaction summaries	Permanent	Summary store

Short-Term Memory: Managing the Context Window

Short-term memory is the conversation history within the current session. The problem: modern LLMs have 128K token windows, but stuffing the entire history into every request wastes tokens and money.

Strategy: Sliding Window with Summarization

Keep the last N messages in full, summarize older messages into a compressed version. This preserves recent context in detail while retaining the gist of earlier conversation.

# Sliding window with summary buffer

class ConversationMemory:

def __init__(self, window_size=10):

self.recent_messages = [] # full messages

self.summary = "" # compressed older context

self.window_size = window_size

def add_message(self, role, content):

self.recent_messages.append({"role": role, "content": content})

if len(self.recent_messages) > self.window_size:

oldest = self.recent_messages.pop(0)

self.summary = summarize(self.summary, oldest)

def get_context(self):

return [

{"role": "system", "content": f"Prior context: {self.summary}"},

*self.recent_messages

]

Store the conversation state in Redis for fast access and persistence across server restarts.

Long-Term Memory: Facts and Preferences

Long-term memory stores persistent facts about the user, their preferences, and domain knowledge. This information survives across sessions and is retrieved using semantic search when relevant to the current query.

# Store and retrieve user-specific memories

async def store_memory(user_id: str, fact: str):

embedding = await get_embedding(fact)

await vector_store.upsert(

id=generate_id(),

values=embedding,

metadata={"user_id": user_id, "fact": fact, "type": "memory"},

namespace=user_id

)

async def recall_memories(user_id: str, query: str, k=5):

embedding = await get_embedding(query)

results = await vector_store.query(

vector=embedding, top_k=k, namespace=user_id

)

return [r.metadata["fact"] for r in results.matches]

The key architectural decision is when to write to long-term memory. Two approaches:

Explicit extraction: After each conversation, run a summary LLM call that extracts new facts worth remembering. "The user prefers Python over JavaScript." "The user's project uses PostgreSQL."
Continuous extraction: After every message, check if it contains a memorizable fact. More real-time but higher API cost.

Episodic Memory: Conversation Summaries

Episodic memory stores summaries of past interactions. When a user returns, the agent can recall "In our last conversation, you asked about setting up a multi-tenant RAG system. We discussed Pinecone namespaces and you decided to go with the namespace-per-tenant approach."

Store episode summaries as structured records: date, topic, key decisions, action items, and outcome. Retrieve the most recent or most relevant episodes at the start of each new session.

Putting It Together: The Memory Architecture

Complete Memory Pipeline:

1. New session starts -> Retrieve relevant long-term memories and recent episodes for the user
2. Each message -> Append to short-term memory (sliding window)
3. Each response -> Check for new facts to store in long-term memory
4. Session ends -> Generate episode summary, store in episodic memory, persist short-term state

This architecture works with any agent framework. For LangGraph-based agents, the memory system integrates with the state management layer. For the full agent architecture, see our production agent blueprint.

Frequently Asked Questions

How do I handle memory for multi-tenant agents?

Use per-user namespaces in your vector store so each user's memories are isolated. This follows the same pattern as multi-tenant RAG with Pinecone.

Should I let users delete their memories?

Yes. Provide a "forget me" feature that deletes all stored memories for a user. This is a privacy best practice and a requirement under regulations like GDPR. For healthcare applications, see our HIPAA compliance guide.

What about the cost of all these extra LLM calls for memory management?

Summarization and fact extraction calls are typically short (500-1000 tokens). The cost is $0.001-0.005 per session. For cost optimization strategies, see our guide on reducing OpenAI costs by 60%.

Build Agents That Remember

We implement memory systems for AI agents that provide continuity, personalization, and context-awareness across sessions.

Discuss Your Agent Project