Enterprise RAG

How to Implement Multi-Tenant RAG
with Pinecone

Complete architecture guide for building SaaS AI products where each customer's data stays isolated, searchable, and secure.

Multi-tenant RAG is the backbone of every SaaS AI product. Each customer uploads their own documents, and the system must retrieve only from that customer's data, never mixing tenants. Pinecone's namespace and metadata filtering features make this straightforward to implement, but the architecture decisions around isolation models, ingestion pipelines, and access control determine whether your system scales to 10 tenants or 10,000.

The Three Isolation Models

Before writing any code, you need to choose an isolation model. Each has different trade-offs for cost, security, and performance.

Model How It Works Security Cost Best For
Index per TenantSeparate Pinecone index per customerStrongestHighestEnterprise / regulated
Namespace per TenantOne index, separate namespace per customerStrongMediumMost SaaS products
Metadata FilteringShared namespace, tenant_id in metadataAdequateLowestSmall-scale / prototypes

Recommendation:

Use the namespace-per-tenant model for most SaaS products. Pinecone namespaces provide strong query isolation (a query to namespace A will never return results from namespace B), cost efficiency (one index), and simple management. Reserve index-per-tenant for regulated industries like legal AI and healthcare where compliance requires physical data separation.

Step-by-Step Implementation

Step 1: Index Setup and Configuration

# Create a single serverless index for all tenants

from pinecone import Pinecone

 

pc = Pinecone(api_key="YOUR_API_KEY")

pc.create_index(

name="multi-tenant-rag",

dimension=1536, # OpenAI text-embedding-3-small

metric="cosine",

spec=ServerlessSpec(cloud="aws", region="us-east-1")

)

Step 2: Document Ingestion with Tenant Isolation

# Upsert documents into tenant-specific namespace

def ingest_document(tenant_id: str, doc_chunks: list):

index = pc.Index("multi-tenant-rag")

vectors = []

for i, chunk in enumerate(doc_chunks):

embedding = get_embedding(chunk.text)

vectors.append({

"id": f"{tenant_id}_{doc_id}_{i}",

"values": embedding,

"metadata": {

"text": chunk.text,

"source": chunk.source_file,

"uploaded_at": chunk.timestamp

}

})

# Namespace = tenant_id ensures isolation

index.upsert(vectors=vectors, namespace=tenant_id)

Step 3: Tenant-Scoped Retrieval

# Query only within the authenticated tenant's namespace

def retrieve(tenant_id: str, query: str, top_k: int = 5):

index = pc.Index("multi-tenant-rag")

query_embedding = get_embedding(query)

results = index.query(

vector=query_embedding,

top_k=top_k,

namespace=tenant_id, # isolation happens here

include_metadata=True

)

return [match.metadata["text"] for match in results.matches]

Step 4: Access Control Layer

The namespace parameter is your data boundary, but you must enforce it at the application layer. Never let the tenant_id come from the client request body. Extract it from the authenticated session or JWT token.

# FastAPI endpoint with tenant extraction from auth

@app.post("/api/query")

async def query_endpoint(

request: QueryRequest,

tenant_id: str = Depends(get_tenant_from_token)

):

# tenant_id comes from JWT, not request body

chunks = retrieve(tenant_id, request.question)

answer = generate_answer(request.question, chunks)

return {"answer": answer, "sources": chunks}

Scaling Considerations

  • Ingestion throughput: Pinecone serverless handles up to 100 upserts/second per namespace. For bulk uploads, batch vectors in groups of 100 and use async upserts.
  • Query latency: Namespace queries add no measurable overhead compared to non-namespaced queries. P99 latency stays under 100ms for indexes under 10M vectors.
  • Cost management: With serverless Pinecone, you pay per query and per stored vector. Monitor per-tenant usage and implement rate limiting for tenants on free plans.
  • Tenant deletion: Deleting a namespace removes all vectors for that tenant in a single API call. This is essential for GDPR compliance and account deletion.

For a deeper comparison of vector database options, see our Pinecone vs Weaviate vs PGVector analysis. For the API layer, our guide on scaling FastAPI covers the patterns for high-throughput AI endpoints.

Common Pitfalls and How to Avoid Them

Pitfall: Tenant ID Injection

If the tenant_id is passed in the request body, a malicious user can query another tenant's data. Always derive tenant_id from the authentication layer.

Pitfall: Shared Embedding Models Leaking Context

If you fine-tune an embedding model on one tenant's data and use it for all tenants, the model may encode proprietary concepts from the training tenant's documents. Use general-purpose embedding models for multi-tenant deployments.

Pitfall: No Tenant-Level Monitoring

Without per-tenant metrics, you can't identify which tenant is causing performance issues or excessive costs. Log tenant_id with every query and build dashboards showing query volume, latency, and error rates per tenant.

Frequently Asked Questions

How many tenants can a single Pinecone index support?

Pinecone serverless indexes support up to 10,000 namespaces per index. For most SaaS products, this is sufficient. If you exceed this, shard across multiple indexes with a routing layer.

Should I use Pinecone namespaces or metadata filtering?

Namespaces for tenant isolation. Metadata filtering for sub-tenant filtering (e.g., filtering by document type, date, or department within a tenant's namespace). They serve different purposes and are often used together.

How do I handle the RAG quality issues that come with scale?

As you add more tenants with diverse document types, your chunking and retrieval strategy needs tuning. Read our guide on fixing RAG failures with agentic AI for advanced retrieval patterns.

Build Your Multi-Tenant AI Product

We architect and deploy multi-tenant RAG systems for SaaS companies. From prototype to production at scale.

Start Building
© 2026 EkaivaKriti. All rights reserved.