FastAPI is fast by default. Getting it to handle production AI traffic at scale requires knowing where the bottlenecks are and how to eliminate them.
FastAPI handles 10,000+ requests per second out of the box for simple JSON responses. But AI endpoints are different. Each request involves LLM API calls (2-10 seconds), vector database queries (50-200ms), and compute-heavy processing. A naive FastAPI setup will saturate at 50-100 concurrent AI requests. This guide shows you how to push that to 10,000+ concurrent requests, enough to handle 1 million AI interactions per day.
AI API endpoints are overwhelmingly I/O-bound. The majority of request time is spent waiting for external services (LLM APIs, vector databases, downstream APIs). The CPU work (prompt construction, output parsing) is negligible. This means the optimization strategy is about concurrent I/O handling, not raw compute speed.
Rule of Thumb:
If your endpoint spends 90% of its time waiting for external API calls, you can handle 10x more concurrent requests just by switching from synchronous to asynchronous code. This single change is worth more than any infrastructure scaling.
Every external call must be async. Synchronous HTTP calls block the entire event loop and limit your throughput to the number of Uvicorn workers.
# Use httpx instead of requests, asyncpg instead of psycopg2
from httpx import AsyncClient
client = AsyncClient(timeout=30.0)
@app.post("/api/chat")
async def chat(request: ChatRequest):
# Parallel external calls
embedding_task = get_embedding_async(request.message)
history_task = get_chat_history_async(request.session_id)
embedding, history = await asyncio.gather(
embedding_task, history_task
)
chunks = await vector_search_async(embedding)
return StreamingResponse(
generate_stream(request.message, chunks, history)
)
Every database connection, HTTP client, and API session should be pooled. Creating a new connection per request adds 50-200ms of overhead and quickly exhausts system resources.
# Shared connection pools initialized at startup
@app.on_event("startup")
async def startup():
app.state.http_client = AsyncClient(
limits=Limits(max_connections=100, max_keepalive_connections=20)
)
app.state.db_pool = await asyncpg.create_pool(
DATABASE_URL, min_size=10, max_size=50
)
app.state.redis = await aioredis.from_url(REDIS_URL)
Run Uvicorn behind Gunicorn with the right number of workers. The formula depends on whether you are running async or sync code:
# For async FastAPI (recommended)
gunicorn app:app -w 4 -k uvicorn.workers.UvicornWorker \
--bind 0.0.0.0:8000 --timeout 120
# Workers = 2 * CPU_CORES + 1 (for I/O-bound)
# Each async worker handles thousands of concurrent connections
For AI endpoints, 4-8 workers on a 4-core machine is typically optimal. Each worker handles thousands of concurrent async connections. Going beyond 8 workers on 4 cores adds context-switching overhead without benefit.
For LLM responses, streaming is both a UX improvement and a scaling strategy. A streaming endpoint holds a connection open but consumes minimal server resources while waiting for tokens. Non-streaming endpoints hold a worker thread blocked for the entire generation time (3-10 seconds), severely limiting concurrency. See our complete LLM latency optimization guide for implementation details.
Add Redis as a caching layer for three things:
Not every request needs real-time processing. For batch operations (document analysis, bulk embedding), use a task queue (Celery with Redis, or ARQ for async). This prevents long-running tasks from consuming your API workers.
# Separate real-time and batch endpoints
@app.post("/api/chat") # Real-time, streaming
async def chat(): ...
@app.post("/api/analyze-document") # Batch, queued
async def analyze(request: AnalyzeRequest):
task_id = await enqueue(analyze_document, request.doc_id)
return {"task_id": task_id, "status": "processing"}
When a single server is no longer enough, scale horizontally. Deploy multiple FastAPI instances behind a load balancer. Since the state lives in Redis and PostgreSQL (not in the application), any instance can handle any request.
For infrastructure automation, see our guide on Terraform with AWS Bedrock. For the complete AI architecture, our production agent blueprint covers how FastAPI fits into the bigger picture.
FastAPI has native async support, automatic OpenAPI docs, Pydantic validation, and significantly higher throughput for I/O-bound workloads. Django and Flask work but require more effort to achieve the same performance. Read more about why Python dominates the AI stack.
Serverless has cold start issues and execution time limits that make it challenging for LLM workloads. See our analysis of deploying LLMs on AWS Lambda for the trade-offs.
Use Prometheus + Grafana for metrics (request latency, error rates, throughput). Add OpenTelemetry for distributed tracing across your FastAPI service, LLM calls, and vector database queries.
We build and scale FastAPI backends for AI products. From architecture design to production deployment.
Get Scaling Help