API Design That Scales: Patterns from Production Systems

We could tell you to be consistent with your naming conventions and use proper HTTP status codes. You already know that. What we actually want to talk about is what happens when your API hits real traffic and things start breaking.

The stuff that matters at scale isn't in the tutorials. It's infrastructure decisions that determine your monthly bill. Caching strategies that turn a 5-second timeout into a 50ms response. Knowing when your database is about to become your bottleneck - ideally before it does.

And increasingly, it's about the fact that modern APIs don't exist alone. They orchestrate calls to AI providers, CRMs, payment processors, video generation services. Your reliability is now bounded by your least reliable dependency, and your costs are driven by services you don't control. That changes everything about how you architect.

The infrastructure decision nobody thinks about until the bill arrives

Before you write any code, you need to decide how requests reach your backend. This is the decision that determines your monthly bill and your scaling ceiling - and it's almost never discussed in tutorials.

On AWS, the choice typically comes down to API Gateway versus Application Load Balancer. Here's what the numbers actually look like:

•API Gateway HTTP APIs: $1.00 per million requests. Built-in throttling, JWT validation, and request transformation. Hard limit of 10,000 requests per second (can request increase).
•API Gateway REST APIs: $3.50 per million requests. Adds caching, request validation, and VTL transformations. Same 10k RPS limit.
•Application Load Balancer: ~$15-20/month base plus LCU charges. No rate limiting or caching built-in, but virtually unlimited throughput (100,000+ RPS).

We've watched companies spend $3,000/month on API Gateway when an ALB would have cost $200. We've also seen the reverse - teams running ALBs with hand-rolled rate limiting code that took weeks to build, when API Gateway would have given them the same thing out of the box.

The break-even point is roughly 5.3 million requests per month. Below that, API Gateway wins. Above that, ALB wins - sometimes dramatically. Do the math for your traffic before you commit.

Caching: where everyone says they do it and almost nobody does it right

Everyone knows caching improves performance. What surprises us is how few teams actually implement it correctly. The difference between a good caching strategy and a bad one isn't 10% - it's 10-100x in latency.

Redis with ElastiCache 7.1 delivers P99 latency under 1 millisecond. We've tested this across production workloads - it holds up. A single cluster handles 500+ million requests per second. The question isn't whether Redis is fast enough. It's whether you're actually using it, and using it correctly.

•Cache-aside (lazy loading): Check cache first, fetch from database on miss, populate cache. Works for read-heavy workloads where stale data is acceptable. Simple to implement, but cache misses hit the database cold.
•Write-through: Write to cache and database simultaneously. Guarantees consistency but adds latency to every write. Use when you can't tolerate stale reads.
•Write-behind: Write to cache immediately, persist to database asynchronously. Maximum write throughput but risks data loss on cache failure. Only for data you can afford to lose.

Here's where most teams fail: cache invalidation. The safe pattern we recommend: delete cache entries instead of updating them (deletes are idempotent), always set a TTL as a safety net, and invalidate as close to the database write as possible. For complex systems, tag-based invalidation lets you group related entries and clear them together.

The most common caching bug we see: updating the cache entry instead of deleting it. When your update fails halfway through, you've got a cache that lies to your users. Delete and let the next read repopulate. It's slower but it's correct.

Rate limiting: skip this and watch your AI budget disappear in a weekend

We've watched teams skip rate limiting until their AI provider bills hit $10K in a weekend. One runaway client, one bug that retries infinitely, one malicious actor - and your margins are gone. Rate limiting isn't about being stingy with your API. It's about survival.

The algorithm you choose affects both accuracy and what happens at the edges:

•Fixed window: Simple to implement but vulnerable to burst attacks at window boundaries. A client can make 2x your limit by timing requests at the edge of two windows. Fine for internal APIs, not for public ones.
•Sliding window: True enforcement of "N requests in any X-second period." More accurate but requires tracking individual request timestamps. This is what we use for most production APIs.
•Token bucket: Allows controlled bursts while maintaining average rate limits. Good for mobile apps and other naturally bursty traffic patterns.
•Leaky bucket: Smooths output to a constant rate regardless of input bursts. Use when downstream systems need predictable load.

For distributed systems, we implement sliding window rate limiting in Redis with Lua scripts. The entire check-and-increment must be atomic - otherwise race conditions let traffic through. A sorted set with timestamps as scores handles the window trimming efficiently. If this sounds complicated, that's because it is. Get it from a library, not from scratch.

Rate your limits by dimension: per-IP for anonymous traffic (first line of defense), per-API-key for authenticated requests (primary control), and per-user for consumer apps (tiered plans). Most APIs need all three. If you only implement one, you'll regret it.

Your database will be your bottleneck

This isn't a prediction. It's a pattern. Your database will become your bottleneck before anything else. The patterns that prevent this aren't complicated, but they require decisions upfront that are painful to change later.

The N+1 query problem kills more APIs than any other issue. You fetch a list of 100 items, then query related data for each one individually. That's 101 queries instead of 2. At scale, this turns a 50ms endpoint into a 5-second timeout. The DataLoader pattern batches and deduplicates these requests within a single request cycle - but you have to know you need it.

•Connection pooling: PostgreSQL uses ~9.5MB per connection (not 256KB like MySQL). Without pooling, 100 concurrent requests can exhaust server memory. Use PgBouncer or RDS Proxy. The formula: minimum of (CPU cores x 2) or (RAM in GB x 0.1) / 9.5. We've seen teams debug memory issues for weeks before realizing their connection count was the problem.
•Read replicas: Route reads to replicas, writes to primary. Target 75% of traffic to replicas. Aurora replication lag averages ~20ms - acceptable for most read-after-write scenarios. Tag queries in your ORM so routing is automatic.
•Polyglot persistence: Not everything belongs in Postgres. Session data in Redis (TTL-based expiry). Full-text search in Elasticsearch. Time-series metrics in TimescaleDB. User-generated content in S3 with DynamoDB metadata. Use the right store for each data shape.
•Materialized views for API shapes: When your API response joins 5 tables, consider materializing it. Postgres materialized views refresh on-demand. For faster updates, use triggers to maintain denormalized tables incrementally. Trade write complexity for read speed.

Schema design for APIs requires thinking about your query patterns upfront. If you'll filter by status and sort by date on every list endpoint, that compound index should exist before you ship. If your JSON columns are frequently queried, consider JSONB with GIN indexes. The best index strategy comes from your API contract, not your entity relationships.

Selective denormalization is not a sin. When P95/P99 latencies are dominated by joins across 3+ tables, store the pre-joined shape directly. We use the mantra: "Normalize until it hurts, denormalize until it works."

For write-heavy APIs (event ingestion, logging, IoT), the CQRS pattern separates read and write models. Writes go to an append-only event store optimized for inserts. A projection service builds read-optimized views asynchronously. This eliminates contention between read and write workloads at the database level.

Monitoring: because "it seems fine" isn't a monitoring strategy

Average latency is a vanity metric. If your average is 100ms but your P99 is 3 seconds, 1% of your users are having a terrible experience. At a million requests per day, that's 10,000 frustrated users. And you'll never know from your average.

•P50 (median): The typical user experience. Target under 100ms for web APIs. Detects broad regressions.
•P95: Your primary SLA metric. Target under 200ms. This is what 19 out of 20 users experience.
•P99: Exposes architectural bottlenecks. Target under 500ms. If P99 is 10x your P50, you have a tail latency problem worth investigating.

For error rate alerting, we use Google's SRE approach: page immediately if you burn 2% of your error budget in 1 hour, create a ticket if you burn 10% in 3 days. This balances urgency with alert fatigue. Nothing kills an on-call rotation faster than alerts that cry wolf.

OpenTelemetry has become the standard for distributed tracing. Start with auto-instrumentation, add manual spans for business-critical paths. Correlate logs with traces using SPAN_ID and TRACE_ID. The goal isn't to trace everything - it's to trace what matters when something breaks.

What happens when your API goes down

Your API is only as reliable as the clients consuming it. If clients hammer you during outages, they make recovery harder. We've seen APIs stay down an extra hour because clients kept retrying aggressively the moment the server came back up.

Exponential backoff with jitter prevents thundering herds. When a thousand clients retry at exactly the same time, they create the same spike that caused the failure. Adding random jitter (0-500ms) spreads retries across time.

•Circuit breakers: After 5-10 consecutive failures, stop trying for 30-60 seconds. Test with a single request (half-open state) before resuming full traffic. Prevents cascade failures.
•Rate limit headers: Return X-RateLimit-Remaining so clients can slow down before hitting 429. Proactive throttling at 10% remaining prevents the cliff.
•Retry-After header: Tell clients exactly when to retry. Removes guesswork and prevents premature retries that extend outages.

Document your rate limits, retry expectations, and error formats. The best API documentation answers "what do I do when this fails?" not just "how do I call this endpoint?" If your docs don't cover failure modes, your clients will guess - usually wrong.

The reality of third-party APIs: you're at their mercy

Most production APIs aren't standalone systems. They orchestrate calls to Salesforce, Stripe, AI providers, and dozens of other services. Your API's reliability is now bounded by your least reliable dependency - and you don't control any of them.

The patterns you use for your own rate limiting don't apply when you're the client. Third-party rate limits are non-negotiable, poorly documented, and often inconsistent. AI provider rate limits vary by model, account tier, and time of day. Salesforce's daily API limits reset at midnight in your org's timezone. HubSpot throttles by 100 requests per 10 seconds, then blocks for 11 seconds. Nobody tells you this until you hit the wall.

•Proactive rate tracking: Don't wait for 429s. Track your consumption against known limits. Most AI providers return rate limit headers - use them. Budget 80% of your limit and queue the rest.
•Per-vendor circuit breakers: A Salesforce outage shouldn't break your AI integration. Isolate circuit breakers by vendor. When Salesforce trips, your CRM sync pauses but AI features continue. We learned this the hard way.
•Response caching with semantic keys: Cache expensive API responses by semantic meaning, not just URL. Two different customer questions might map to the same embedding vector - cache the embedding, not the endpoint call.
•Vendor abstraction layers: Wrap third-party clients in your own interface. When HeyGen changes their API (and they will), you update one adapter, not fifty callsites. This also enables easy mocking in tests.

Third-party APIs deprecate endpoints constantly. Twilio sunset Authy. Salesforce removes API versions after ~3 years. Build deprecation monitoring into your CI - flag any API calls to endpoints with announced sunset dates. Future you will thank present you.

For multi-vendor orchestration, the saga pattern prevents partial failures. If creating a customer requires Salesforce (CRM), Stripe (billing), and SendGrid (welcome email), you need compensating actions. When Stripe fails after Salesforce succeeds, you need to either retry Stripe or roll back the Salesforce contact. Queued sagas with explicit compensation handlers are the only reliable approach.

When synchronous breaks: async processing

Not everything fits in a request-response cycle. AI video generation takes 2-5 minutes. Batch CRM imports process thousands of records. Report generation queries terabytes. We've seen teams try to shove all of this into synchronous endpoints. It never works.

The decision framework is simple: if P95 latency exceeds your frontend timeout (typically 30 seconds), it needs to be async. If it can fail partially and retry matters, it needs to be async. If users benefit from progress updates, it needs to be async.

•Queue-based processing (SQS/RabbitMQ): Best for fire-and-forget tasks or when you need guaranteed delivery. SQS standard queues cost $0.40 per million requests with at-least-once delivery. FIFO queues add ordering guarantees at $0.50 per million. Dead letter queues catch failures after configurable retries.
•Job status pattern: Return a job ID immediately, let clients poll for status. Store job state in Redis with TTL. Expose GET /jobs/{id} with states: queued, processing, completed, failed. Include progress percentage for long operations.
•Webhooks for completion: Polling is inefficient at scale. Let clients register webhook URLs and POST results when done. Include HMAC signatures for verification (SHA-256 with shared secret). Retry webhooks with exponential backoff up to 3 times over 24 hours.
•Server-Sent Events (SSE): For real-time progress without WebSocket complexity. Works through HTTP proxies, auto-reconnects, one-way push from server. Perfect for AI streaming responses - users see tokens as they're generated rather than waiting for completion.

For AI integrations specifically, the token-streaming pattern has become standard. Start returning tokens as they're generated rather than waiting for completion. Users see response start in ~200ms instead of waiting 5-15 seconds. This requires SSE or WebSockets, but the UX improvement is dramatic.

Job cleanup matters. A HeyGen video generation job produces a 500MB file. Store the result URL, not the file, in your job status. Set aggressive TTLs on job metadata (24-48 hours) and implement archival for compliance needs. Orphaned job data is a common source of runaway storage costs - we've seen S3 bills double from forgotten temp files.

AI costs will eat you alive if you let them

AI APIs changed the economics of API design. A single large language model call can cost $0.01-0.15 depending on provider and model. HeyGen video generation runs $0.10-1.00+ per minute. At scale, a poorly optimized AI feature costs more than your entire infrastructure.

The math is unforgiving: 100,000 users making 10 LLM calls per day at $0.05 average equals $50,000/month in AI costs alone. Without caching and optimization, AI features don't scale - they bankrupt. We've seen startups burn through their seed round on AI bills because nobody did the math.

•Semantic caching: Cache AI responses by meaning, not exact input. Use embedding similarity (cosine > 0.95) to match semantically equivalent queries. "What's the weather in NYC?" and "NYC weather today" can share a cached response. This cuts redundant AI calls by 30-60%.
•Model tiering: Not every request needs your most expensive model. Route simple queries to smaller, cheaper models. Use classification to determine complexity, then escalate. Most chatbot queries are simple - save the expensive model for complex reasoning.
•Request batching: Many AI providers offer batch APIs at 50% discount with 24-hour turnaround. Queue non-urgent requests (nightly report generation, bulk content analysis) and batch them. Same quality, half the cost.
•Prompt optimization: Tokens cost money. A 2,000 token system prompt on every request adds up. Cache system prompts in context. Use prompt compression techniques. Strip unnecessary formatting from inputs. Measure cost per feature, not just per request.

The build-vs-buy calculation shifts with AI. Self-hosting open models on dedicated GPU instances costs $15,000-30,000/month but handles unlimited requests. At roughly 300,000-500,000 API calls per month, self-hosting breaks even. Below that, API costs are cheaper. Above that, dedicated infrastructure wins. OpenRouter and AWS Bedrock let you experiment with multiple providers before committing.

Track AI costs per feature, per user, and per customer segment. Some features have 10x the cost of others. Some customers generate 100x the AI spend. Without granular cost attribution, you can't price appropriately or identify optimization targets. We've helped teams cut AI costs by 70% just by finding the one feature that was hemorrhaging money.

For third-party integrations beyond AI, the same principles apply. Salesforce API calls are limited and precious - cache CRM data aggressively (1-5 minute TTL for frequently accessed records). Twilio charges per message - batch notifications into digests. Every API call has a cost, even if it's rate limits rather than dollars.

Building an API that needs to scale?

We've made these mistakes so you don't have to. Let's talk about your architecture before you discover the problems in production.

Start a ConversationStart a Conversation

or email partner@greenfieldlabsai.com

Don't Miss These

Custom Development

Jan 17, 2025•9 min read

Building Production Systems That Scale

How we choose technology stacks, integrate external services, and ship applications that hold up under real-world load.