How to Scale FastAPI for 1 Million AI Requests

FastAPI handles 10,000+ requests per second out of the box for simple JSON responses. But AI endpoints are different. Each request involves LLM API calls (2-10 seconds), vector database queries (50-200ms), and compute-heavy processing. A naive FastAPI setup will saturate at 50-100 concurrent AI requests. This guide shows you how to push that to 10,000+ concurrent requests, enough to handle 1 million AI interactions per day.

Understanding the Bottleneck: I/O-Bound vs CPU-Bound

AI API endpoints are overwhelmingly I/O-bound. The majority of request time is spent waiting for external services (LLM APIs, vector databases, downstream APIs). The CPU work (prompt construction, output parsing) is negligible. This means the optimization strategy is about concurrent I/O handling, not raw compute speed.

Rule of Thumb:

If your endpoint spends 90% of its time waiting for external API calls, you can handle 10x more concurrent requests just by switching from synchronous to asynchronous code. This single change is worth more than any infrastructure scaling.

Step 1: Async Everything

Every external call must be async. Synchronous HTTP calls block the entire event loop and limit your throughput to the number of Uvicorn workers.

# Use httpx instead of requests, asyncpg instead of psycopg2

from httpx import AsyncClient

client = AsyncClient(timeout=30.0)

@app.post("/api/chat")

async def chat(request: ChatRequest):

# Parallel external calls

embedding_task = get_embedding_async(request.message)

history_task = get_chat_history_async(request.session_id)

embedding, history = await asyncio.gather(

embedding_task, history_task

)

chunks = await vector_search_async(embedding)

return StreamingResponse(

generate_stream(request.message, chunks, history)

)

Step 2: Connection Pooling

Every database connection, HTTP client, and API session should be pooled. Creating a new connection per request adds 50-200ms of overhead and quickly exhausts system resources.

# Shared connection pools initialized at startup

@app.on_event("startup")

async def startup():

app.state.http_client = AsyncClient(

limits=Limits(max_connections=100, max_keepalive_connections=20)

)

app.state.db_pool = await asyncpg.create_pool(

DATABASE_URL, min_size=10, max_size=50

)

app.state.redis = await aioredis.from_url(REDIS_URL)

Step 3: Worker Configuration

Run Uvicorn behind Gunicorn with the right number of workers. The formula depends on whether you are running async or sync code:

# For async FastAPI (recommended)

gunicorn app:app -w 4 -k uvicorn.workers.UvicornWorker \

--bind 0.0.0.0:8000 --timeout 120

# Workers = 2 * CPU_CORES + 1 (for I/O-bound)

# Each async worker handles thousands of concurrent connections

For AI endpoints, 4-8 workers on a 4-core machine is typically optimal. Each worker handles thousands of concurrent async connections. Going beyond 8 workers on 4 cores adds context-switching overhead without benefit.

Step 4: Response Streaming

For LLM responses, streaming is both a UX improvement and a scaling strategy. A streaming endpoint holds a connection open but consumes minimal server resources while waiting for tokens. Non-streaming endpoints hold a worker thread blocked for the entire generation time (3-10 seconds), severely limiting concurrency. See our complete LLM latency optimization guide for implementation details.

Step 5: Caching Layer

Add Redis as a caching layer for three things:

Semantic cache: Cache LLM responses by query embedding. 20-40% hit rate for FAQ-style applications.
Embedding cache: Cache embedding vectors so you don't re-embed the same text on repeated queries.
Session state: Store conversation history in Redis instead of passing it through every API call.

Step 6: Rate Limiting and Queue-Based Processing

Not every request needs real-time processing. For batch operations (document analysis, bulk embedding), use a task queue (Celery with Redis, or ARQ for async). This prevents long-running tasks from consuming your API workers.

# Separate real-time and batch endpoints

@app.post("/api/chat") # Real-time, streaming

async def chat(): ...

@app.post("/api/analyze-document") # Batch, queued

async def analyze(request: AnalyzeRequest):

task_id = await enqueue(analyze_document, request.doc_id)

return {"task_id": task_id, "status": "processing"}

Step 7: Horizontal Scaling

When a single server is no longer enough, scale horizontally. Deploy multiple FastAPI instances behind a load balancer. Since the state lives in Redis and PostgreSQL (not in the application), any instance can handle any request.

For infrastructure automation, see our guide on Terraform with AWS Bedrock. For the complete AI architecture, our production agent blueprint covers how FastAPI fits into the bigger picture.

The Numbers: What 1 Million Requests/Day Looks Like

Average RPS: ~12 requests/second sustained
Peak RPS: ~50-100 requests/second (during business hours)
Infrastructure: 2-4 FastAPI instances (4 vCPU, 8 GB RAM each)
Redis: 1 instance (2 GB RAM) for caching and sessions
Monthly cost: $200-500 for infrastructure (excluding LLM API costs)

Frequently Asked Questions

Why FastAPI over Django or Flask for AI APIs?

FastAPI has native async support, automatic OpenAPI docs, Pydantic validation, and significantly higher throughput for I/O-bound workloads. Django and Flask work but require more effort to achieve the same performance. Read more about why Python dominates the AI stack.

Should I use serverless (Lambda) instead of FastAPI?

Serverless has cold start issues and execution time limits that make it challenging for LLM workloads. See our analysis of deploying LLMs on AWS Lambda for the trade-offs.

How do I monitor FastAPI performance?

Use Prometheus + Grafana for metrics (request latency, error rates, throughput). Add OpenTelemetry for distributed tracing across your FastAPI service, LLM calls, and vector database queries.

Scale Your AI API

We build and scale FastAPI backends for AI products. From architecture design to production deployment.

Get Scaling Help