Building Production Systems That Scale

When someone asks "what stack should we use?" - they deserve better than "React and Next.js" followed by a shrug. Technology choices compound. The framework you pick affects hiring, maintenance costs, and what you can build five years from now.

We spend a lot of time staying current on what actually works in production, not just what has momentum on Twitter. And we've developed some strong opinions about when the defaults are wrong.

The stack conversation most consultants skip

When a client comes to us with a project, we don't start with "we use React." We start with questions: What does your team know? What's your budget for infrastructure? How important is time-to-market versus long-term flexibility? What's the realistic traffic profile?

These conversations reveal constraints that matter more than framework benchmarks. A Laravel shop building a new product should probably use Laravel with Livewire, not pivot to a JavaScript ecosystem they'll struggle to maintain. A team obsessed with type safety and future flexibility might benefit from newer options that avoid vendor lock-in.

We've watched teams waste six months learning a stack that wasn't right for their constraints. The "modern" choice isn't always the right choice.

The best stack is the one your team can build, deploy, debug at 2am, and hire for. Technical elegance means nothing if you can't ship or sustain it.

When the defaults are wrong

We actively track what works in production, including options that challenge conventional wisdom. The "just use Next.js" answer is lazy. Here's when we recommend something else:

Content-heavy sites Astro delivers 40% faster load times with 90% less JavaScript than React-based alternatives. For marketing sites, documentation, and content platforms, shipping minimal JS to the browser isn't a constraint - it's an advantage. We've built sites that load in under a second on 3G.
Teams with PHP experience Laravel with Livewire provides 80% of what most apps need at 50% of the complexity. Infrastructure costs run 40% lower. If your team knows PHP, don't migrate to JavaScript for its own sake. That's a trap.
Type safety without vendor lock-in TanStack Start has reached release candidate status with full-stack type safety, server functions, and 30-35% smaller bundles than Next.js. It deploys anywhere, not just Vercel. If you care about not being locked into one platform, this is worth evaluating.
Global low-latency requirements Cloudflare Workers offer 0ms cold starts and zero egress costs with R2 storage. We've seen 80% cost reductions versus AWS Lambda for appropriate workloads. The catch: stricter execution limits. Know your constraints.
Complex interactive applications Next.js remains capable for sophisticated SPAs, but teams should be aware of recent security concerns including CVE-2025-55182 (CVSS 10.0 remote code execution) and middleware authorization bypass issues. We still use it when appropriate - but we patch aggressively.
Next.js on AWS (our preferred approach) When the rest of your stack lives on AWS - Bedrock for AI, DynamoDB, S3, Lambda - deploying Next.js to Amplify Gen 2 is the move. The backend integration is seamless: type-safe data access, auth that just works with Cognito, and infrastructure-as-code that deploys alongside your frontend. We've shipped a lot of apps this way. The DX is excellent and you're not paying Vercel's markup while your data already lives in AWS.
Server-rendered interactivity HTMX can reduce codebase size by up to 67% for applications where most interactivity is form submissions and partial page updates. Not every app needs a JavaScript framework. Sometimes the boring choice is the right choice.

Framework popularity is not a proxy for suitability. We evaluate options based on your specific constraints, not industry defaults. The "everyone uses X" argument has cost teams millions in technical debt.

Your app is mostly glue code now

Modern production systems rarely exist in isolation. A typical application we build integrates with AI APIs for content generation, a CRM for customer data, a payment processor, external analytics, and several domain-specific services. Your application becomes an orchestration layer - and that changes everything.

This means your code isn't the whole system. It's the glue between services, each with their own latency profiles, failure modes, and cost structures. When something breaks, it's usually at a boundary you don't control.

AI API integration Services like HeyGen for video generation or language models for text operate on different timescales than user requests. A video generation call might take 90 seconds. You can't block a web request for that. This forces async patterns whether you planned for them or not.
CRM and customer data Salesforce, HubSpot, and similar platforms become sources of truth for customer relationships. Your application needs to sync, not duplicate. Duplicate data will diverge - it's not if, it's when.
Payment processing Stripe, Square, and similar services handle complexity you don't want to own. But their webhooks, idempotency requirements, and edge cases demand careful integration. We've debugged payment issues that took weeks to surface.
External data sources APIs for market data, weather, shipping rates - they all have rate limits, outages, and stale data. Your system needs to handle all three gracefully. The happy path is easy. The edge cases are where production systems die.

Async processing: where most production bugs hide

Not everything is request-response. When you integrate with services that take seconds or minutes to complete, or when you need to process data in bulk, synchronous patterns break down. And this is where we see the most production bugs.

When to use queues Any operation that might take more than a few seconds, might fail and need retry, or needs to happen eventually but not immediately. Email sending, PDF generation, AI processing, data imports - they all fit this pattern. If you're not using queues for these, you're setting yourself up for failure.
Job status tracking Users need to know what's happening. A simple status table with job ID, state, progress percentage, and error messages handles most cases. Expose it via polling or WebSocket based on your latency requirements.
Webhooks versus polling If the external service supports webhooks, use them. You get faster notification and lower load. But always have a polling fallback - webhooks fail silently more often than you'd think.
Long-running operations Video generation with HeyGen might take 90 seconds. Large language model processing can take 30 seconds. Design your UX around these realities rather than pretending they don't exist. Users can handle "this will take a minute" - they can't handle a spinner that spins forever.

The async boundary is where most production bugs hide. Race conditions, duplicate processing, lost jobs - they all live here. Invest in observability for your queue infrastructure before you need it.

Infrastructure costs: the conversation nobody wants to have

Infrastructure decisions affect your monthly bill in ways that compound over time. We've seen teams pay ten times more than necessary because nobody thought about cost when choosing architecture. And once you're in production, moving to cheaper infrastructure is technically possible but rarely happens.

Cloudflare versus AWS comparison Cloudflare Workers have no cold starts and R2 storage has zero egress fees. For read-heavy workloads with global users, this can cut infrastructure costs by 70-80% compared to Lambda plus S3. The catch: Workers have stricter execution limits. Know your constraints before you commit.
AI API costs at scale Large language model calls at $0.01-0.15 per request add up fast. A feature that calls the API on every page load can cost more than your entire hosting bill. Cache aggressively, batch when possible, and use cheaper models for simple tasks. We've helped teams cut AI costs by 70% just by adding semantic caching.
Database costs Managed PostgreSQL on AWS RDS costs significantly more than the same database on Railway or Supabase for many workloads. Evaluate based on your actual requirements, not brand familiarity. "We use AWS for everything" is not a strategy - it's inertia.
When to cache expensive operations Any external API call that returns the same result for the same input should probably be cached. Any computation that takes more than 100ms and runs frequently should probably be cached. Cache invalidation is hard but not caching is often more expensive.

We review infrastructure costs as part of architecture design, not as an afterthought when the bill arrives. The time to optimize is before you're locked into a platform.

The last 20% takes 80% of the effort

Shipping a production system involves more than getting features working locally. The last 20% of reliability work often takes 80% of the effort - but it's what separates applications that work from applications that keep working.

Observability across services When a user reports something is slow, you need to trace a request across your application, your database, and three external APIs to find the bottleneck. Structured logging, distributed tracing, and centralized error tracking are not optional. We've debugged issues that would have taken days without proper tracing.
Error handling that spans boundaries When Stripe returns a transient error, what happens? When your AI API times out, does the job retry or fail silently? Every external boundary needs explicit error handling, not just the happy path. The happy path is maybe 60% of your code. The other 40% is what happens when things go wrong.
Deployment and reliability Zero-downtime deployments, database migrations that don't lock tables, rollback capability, and health checks are table stakes. If deploying scares you, something is wrong with your process. Deploy should be boring.
Security posture Authentication, authorization, input validation, and dependency updates are ongoing work, not one-time setup. We track CVEs for our dependencies and have upgrade paths ready. The Next.js CVE mentioned earlier? We had patches deployed within hours of disclosure.

Production readiness isn't a checkbox. It's a continuous practice of identifying failure modes and building resilience against them.

Start the ConversationStart the Conversation

or email

Don't Miss These

Custom Development

Jan 15, 2025•6 min read

Building Integrations That Don't Break

A practical guide to system integration patterns, error handling, and the decisions that determine whether your integrations help or hurt.

Custom Development

Jan 15, 2025•14 min read

API Design That Actually Scales

Patterns from production systems, not textbooks.