Building Integrations That Don't Break

Every company eventually becomes an integration company. Your CRM talks to your billing system. Your e-commerce platform syncs with your warehouse. Your support tool pulls data from everywhere.

Most of these integrations are fragile. They break silently, lose data, and create more problems than they solve. We've inherited enough of these disasters to know the patterns that fail - and the patterns that don't.

The pattern decision that shapes everything

Before writing any code, you need to decide how data will flow between systems. This decision has cascading effects on reliability, latency, and complexity. Get it wrong and you'll be living with that decision for years.

Real-time webhooks System A pushes to System B immediately when data changes. Low latency, but requires B to be available. Good for: notifications, critical updates. Bad for: systems with unreliable uptime.
Batch synchronization Run a job every N minutes/hours to sync data. Higher latency, but more resilient. Good for: reports, analytics, non-urgent syncs. This is often the right choice even when it feels "less elegant."
Event-driven (via queue) System A publishes events. System B consumes when ready. Decoupled, reliable, but more complex. Good for: high-volume, mission-critical. Worth the complexity when you need it.
API polling System B periodically asks System A for updates. Simple, but inefficient and has latency. Good for: when webhooks aren't available. Don't be ashamed of polling - sometimes it's the pragmatic choice.

There's no universally "best" pattern. The right choice depends on your volume, latency requirements, system capabilities, and how important it is that data never gets lost. We've built all four - the "right" answer changes with context.

Failures will happen. Plan for them.

Integrations fail. Networks go down. APIs rate limit. Services have outages. The question isn't whether failures happen - it's whether your integration recovers or loses data.

Retry with backoff Don't retry immediately. Use exponential backoff (1s, 2s, 4s, 8s...) to avoid overwhelming a struggling system. We've seen integrations make outages worse by hammering systems that were trying to recover.
Dead letter queues When retries fail, don't drop the data. Put it somewhere you can investigate and reprocess later. Without this, you're just hoping nothing ever fails.
Circuit breakers If a system is consistently failing, stop hitting it. Resume automatically when it recovers. This is basic reliability engineering.
Idempotency Processing the same message twice should have the same result as processing it once. This is critical for retry safety. If your integration isn't idempotent, you don't have an integration - you have a time bomb.

Data transformation: where integrations go to die

Data rarely maps cleanly from one system to another. Field names differ. Formats vary. What's required in System A is optional in System B. This is where most integrations break.

The most common integration bug: assuming data will be in the format you expect. That optional field? It'll be null in production. That date field? Someone will send an invalid format. Validate everything. Trust nothing.

Map explicitly Document every field mapping. Don't rely on "same names means same meaning." We've debugged issues where "status" in one system meant something completely different in another.
Handle nulls and missing data What happens when a required field is missing? Define the behavior explicitly. "It should never happen" is not a strategy.
Preserve originals Store the raw incoming data before transformation. You'll need it for debugging. When something goes wrong three months from now, you'll thank yourself.
Version your transformations As business logic changes, you need to know which version processed each record. Without this, debugging historical data is nearly impossible.

If you're not monitoring it, you don't know it's broken

An integration without monitoring is a time bomb. It will fail, and you won't know until someone complains about missing data days later. We've seen integrations fail silently for weeks before anyone noticed.

Success/failure counts Track how many records processed successfully vs failed. Alert on unusual patterns. A sudden spike in failures is usually a sign something changed upstream.
Latency metrics How long does sync take? Is it getting slower? Catch problems before they become outages. Gradual degradation is a warning sign.
Data freshness When did the last successful sync complete? Alert if it's been too long. "I thought it was running" is not acceptable in production.
Error categorization Not all errors are equal. Distinguish between "retry will fix it" and "this needs human attention." Without this distinction, you get alert fatigue or missed critical issues.

When to go event-driven (and when not to)

When integrations grow beyond simple point-to-point connections, event-driven architecture becomes essential. Instead of systems directly calling each other, they publish events that any interested system can consume. But this adds real complexity - use it when you need it.

Salesforce Platform Events Salesforce can publish custom events that external systems consume via CometD or Pub/Sub API. A contact status change triggers an event, your AWS Lambda picks it up, updates the warehouse system. No Apex calling external APIs, no governor limits on outbound callouts. This is how we build Salesforce integrations that actually scale.
AWS EventBridge Route events between AWS services and SaaS providers with content-based filtering. Costs ~$1 per million events. Partner integrations include Salesforce, Zendesk, and Datadog - no custom polling required. We've replaced a lot of fragile polling with EventBridge.
Change Data Capture (CDC) Instead of applications publishing events, the database streams changes. Debezium, AWS DMS, or native CDC in PostgreSQL 10+ capture inserts, updates, and deletes. Your integrations react to actual data changes, not application-level events you might forget to publish. This is a game-changer for complex systems.
Transactional outbox pattern When you need guaranteed delivery without distributed transactions: write the event to an outbox table in the same transaction as your data change. A separate process publishes from the outbox. Either both happen or neither does. This is the only reliable way to ensure consistency across systems.

Event-driven architecture adds complexity. Start with direct integrations for simple cases. Move to events when you have multiple consumers for the same data changes, or when you need to decouple systems for reliability. Don't over-engineer simple problems.

Zapier and iPaaS: be honest about the tradeoffs

Low-code integration platforms have their place. They're great for simple, low-volume use cases where you need something working quickly. We use them when they're the right tool.

Good for Simple triggers, low volume, non-critical processes, quick prototypes, teams without developers. If it fits these constraints, use it. Don't over-engineer.
Bad for Complex logic, high volume, data transformations, error handling requirements, cost-sensitive operations. We've migrated a lot of Zapier workflows to custom code when they hit these walls.

The hidden cost of Zapier: it's cheap to start, expensive to scale. 1000 tasks/month is $20. 100,000 tasks/month is $1000+. Do the math before committing. We've seen teams discover this the hard way.

Let's Fix ThemLet's Fix Them

or email

Don't Miss These

Custom Development

Jan 15, 2025•14 min read

API Design That Actually Scales

Patterns from production systems, not textbooks.

Custom Development

Jan 17, 2025•9 min read

Building Production Systems That Scale

How we choose technology stacks, integrate external services, and ship applications that hold up under real-world load.