Integration Rescue: A Case Study in Stabilizing a Legacy Partner

Integration Rescue

One legacy partner accounted for 30% of our workflow. Their API was flaky, undocumented, and politically impossible to replace.

The conventional wisdom was: rewrite. Build a new integration from scratch. Replace the legacy partner with a modern alternative.

The reality was: we couldn't. The partner was embedded in contracts. The data they provided was irreplaceable. The timeline for a rewrite was 12+ months. We needed stability in weeks.

Here's how we stabilized a failing integration without pausing the business or launching a rewrite.

The Symptoms

When I joined this rescue project, the integration was in crisis mode:

Silent nightly failures:

Jobs that pulled data from the partner would fail without obvious errors. They'd time out, or return partial data, or succeed on retry but with different results. Ops fixed data by hand every morning—a ritual that took 2-3 hours.

Unpredictable latency:

The partner API would respond in 200ms for days, then spike to 30 seconds without warning. Our queues starved. Downstream processes waited. Users saw delays.

Schema changes without notice:

The partner would add fields, remove fields, or change data types without announcements. Our ingestion would break. Reports would show blanks or wrong values. Support got hammered with "where is my update?" calls.

No visibility:

When things went wrong—and they went wrong constantly—we had no way to answer basic questions:

How many records failed?
What specifically failed?
When did the failure start?
What changed?

We were debugging with logs and prayer.

The Wrong Solution: Rewrite Everything

The temptation was strong: rip out the integration and rebuild it from scratch. Modern API standards. Proper error handling. Real monitoring.

Here's why we didn't:

Timeline:

A clean rewrite would take 12-18 months. The business couldn't survive 12-18 months of the current failure rate. We needed improvement in weeks, not quarters.

Political reality:

The partner wasn't going away. Contracts were signed. Data was exclusive. Even if we built a beautiful new integration, it would still be talking to the same flaky API.

Risk:

Rewrites fail. Often. We could burn 12 months building something that didn't work any better, and we'd have two broken integrations instead of one.

The decision:

Stabilize what exists. Build resilience around the unreliable partner. Make the flakiness manageable instead of trying to eliminate it.

Step 1: Build the Missing Seam

The first problem: partner calls were scattered throughout the codebase. Some code called the API directly. Some code used a partial wrapper. Error handling was inconsistent.

The fix: adapter layer.

We wrapped all partner calls in a dedicated module. Direct API calls were banned—every interaction went through the adapter.

The adapter handled:

Authentication (consistent token management)
Retries with exponential backoff and jitter
Timeout handling with sensible defaults
Response parsing and error classification
Logging with correlation IDs

Contract documentation:

We reverse-engineered the API behavior from logs and code. We documented:

Request schemas (what we send)
Response schemas (what we expect back)
Error shapes (what failures look like)
Known quirks (undocumented behaviors we'd discovered)

This documentation didn't exist anywhere. The partner's official docs were wrong or incomplete. Our documentation reflected reality.

Contract tests:

We captured real payloads—both successful and failed—and turned them into test fixtures. Every build ran these fixtures through the adapter. If parsing broke, CI failed before production.

Idempotency:

Every write operation got a deterministic key. If we retried (and we would retry often), duplicates were prevented automatically. Support stopped seeing duplicate records.

Step 2: Quarantine Bad Inputs

The second problem: bad data from the partner was corrupting our production tables. Schema changes caused parsing failures that left partial records. Invalid values caused downstream errors.

The fix: edge validation with quarantine.

Before any partner data touched production, it went through validation:

Schema check: Does this match our expected structure?
Value validation: Are fields within expected ranges?
Consistency check: Does this reference entities we know about?

Records that failed validation went to quarantine—a separate table with:

The raw payload (what we received)
The failure reason (why it was rejected)
Timestamp (when it arrived)
Correlation ID (how to trace it)

Quarantine records didn't corrupt production. They waited for human review or automatic retry after investigation.

Ops UI:

We built a simple dashboard showing:

Quarantine count by error type
Age of oldest quarantine item (staleness alert)
Quick action buttons for common fixes
Audit logs of all resolutions

Ops could resolve quarantine issues without engineering. Engineering focused on systemic fixes, not individual data problems.

Outcome:

Bad data stopped contaminating production tables. Reports stopped showing blanks. Support tickets dropped 40% in the first week.

Step 3: Control the Blast Radius

The third problem: when the partner was slow or failing, our entire system degraded. Queues backed up. Thread pools exhausted. Users saw errors across unrelated features.

Circuit breaker:

We implemented a circuit breaker (a pattern that stops calling a failing service to prevent cascade failures):

Closed state: Normal operation, requests go through
Open state: Partner failing, requests immediately return cached/error without calling
Half-open state: Testing recovery, a few requests go through to see if partner is back

Thresholds: If error rate exceeded 50% over 30 seconds, or if latency exceeded 10 seconds on 3 consecutive requests, the breaker opened.

When open, we served cached "last known good" data where safe, with UI banners explaining the situation.

Back-pressure:

We capped queue depth per tenant. If one customer's bulk import was hammering the partner, their queue grew, but other customers' traffic continued normally.

When queues hit 80% capacity, new work got a "busy, retry soon" response instead of queueing indefinitely. Users saw clear messages, not timeouts.

Fallback for high-risk writes:

For critical operations (like payment-related data), we queued failed writes for later retry instead of failing immediately. Ops got alerts. Users got "processing, we'll notify you" messages. Nothing was lost.

Step 4: Observability and Runbooks

The fourth problem: when failures happened, no one knew why. Debugging was archaeology.

Metrics:

We instrumented everything:

Success/error rates by endpoint
Latency percentiles (p50, p95, p99)
Queue depth and age
Circuit breaker state and trip count
Quarantine counts by error type

All visible on a single dashboard. One glance told you: is the partner healthy?

Logs with correlation:

Every request got a correlation ID that followed it through:

Our initial receipt
The partner API call
Any retries
The final outcome

We included payload hashes for forensics. When something went wrong, we could reconstruct exactly what happened.

Runbooks:

We documented specific response procedures:

"Partner API slow" runbook:

Check breaker state on dashboard
If breaker is open, verify cache is serving
Check partner's status page (if they have one)
If latency >30s and not improving, enable drain mode
Notify customer success with ETA

"Schema drift detected" runbook:

Check quarantine for new error types
Capture sample payloads for contract update
Update adapter if change is additive
If change is breaking, escalate to partner contact
Block schema promotion until tests pass

"Mass replay needed" runbook:

Identify scope (which records, which time range)
Verify idempotency keys are in place
Stage replay in sandbox
Verify results before production
Run production replay with monitoring

Step 5: Drills and Playbacks

The fifth problem: we didn't know if our fixes would hold under pressure. Theory is nice. Reality is humbling.

Replay harness:

We built a system to record failed payloads and replay them against sandboxes. Before any fix went to production, we:

Captured the failing scenarios
Applied the fix
Replayed against sandbox
Verified correct behavior
Only then promoted to production

Chaos hour:

For one month, weekly, we simulated partner failures:

Injected latency (made the API slow)
Returned 500 errors randomly
Sent malformed responses
Simulated complete outages

We timed detection. We timed recovery. We found gaps and fixed them.

Playback reviews:

After each real incident, we asked:

What guardrail should have caught this?
What test case is missing?
What runbook step was unclear?

Every incident bought us a new guardrail or test. The system got stronger over time.

Results

Stability:

Error rates dropped approximately 70%. The nightly failure ritual became a weekly check that usually found nothing. Ops got their mornings back.

Transparency:

Support used the dashboard to answer "where is my update?" They could see: your record is in quarantine because the partner sent invalid data. We're working on it. No more escalations to engineering for basic questions.

Resilience to partner issues:

When the partner shipped a schema change without notice (and they did, twice more), our CI caught it before production. We shipped adapter patches the same day. No customer-visible impact.

Risk posture:

The partner was still flaky. Their API was still undocumented. But we'd built a layer of resilience around them. Their unreliability became manageable, not catastrophic.

Timeline

Week 1-2:

Built adapter layer
Documented contracts from reverse engineering
Added basic logging with correlation IDs

Week 3-4:

Implemented quarantine
Built ops dashboard
Added circuit breaker

Week 5-6:

Added back-pressure
Built replay harness
Wrote initial runbooks

Week 7-8:

First chaos hour drill
Fixed gaps found in drill
Refined thresholds and alerts

Ongoing:

Weekly playbacks of incidents
Monthly chaos drills
Continuous improvement of contract tests

Total elapsed time to stability: 8 weeks. Total elapsed time if we'd done a rewrite: 12-18 months (optimistically). The rescue approach was 5-10x faster.

Lessons You Can Apply

If there's no seam, build one:

Adapters, contracts, and tests are the foundation. Until you have a clean boundary around the problem, you can't fix the problem.

Quarantine unknowns; contain before you cleanse:

Bad data will come. The question is whether it corrupts production or waits safely for review. Always choose safely.

Add breakers and back-pressure before traffic spikes, not after:

The time to build resilience is before the crisis, not during it. If you wait until the partner is failing to add circuit breakers, you're too late.

Run drills; practice creates calm:

Incidents are stressful. Practiced incidents are procedural. Every drill teaches you something and makes the next real incident less chaotic.

Every incident should buy you a new guardrail:

If the same failure can happen twice, you haven't fixed it. Turn every incident into a test case, a runbook step, or a new threshold.

You Don't Need a Rewrite

The instinct to rewrite is strong. Clean code! Modern patterns! Proper architecture!

But rewrites are risky, slow, and often don't solve the underlying problem. If your partner is flaky, a beautiful new integration is still talking to a flaky partner.

Build resilience around unreliable dependencies. Make the flakiness manageable. Ship improvements in weeks, not quarters.

You don't need a rewrite to stabilize a bad integration. You need seams, visibility, and rehearsed responses.

Context → Decision → Outcome → Metric

Context: Legacy partner integration handling 30% of workflow, failing daily, no documentation, no visibility, no resilience. Business couldn't wait for rewrite.
Decision: Stabilize in place using adapter pattern, quarantine, circuit breakers, observability, and drills instead of pursuing 12+ month rewrite.
Outcome: Integration stabilized in 8 weeks. Daily failures became rare. Support could answer questions without engineering. Partner schema changes handled same-day.
Metric: Error rates dropped ~70%. Quarantine visibility eliminated 40% of support tickets. Two partner schema changes handled with zero customer-visible impact.

Anecdote: The Schema Change That Didn't Break Us

Six weeks after the stabilization project, the partner shipped a schema change. No announcement. No warning. Just different data arriving at 2 a.m.

The old system would have broken silently. Data would have corrupted. Morning ops would have spent hours untangling.

Here's what actually happened:

2:17 a.m.: New records started arriving with an unexpected field. 2:17 a.m.: Adapter validation flagged them as "unknown shape." 2:17 a.m.: Records went to quarantine instead of production. 2:18 a.m.: Alert fired: "Quarantine spike, new error type."

6:30 a.m.: On-call engineer woke up, saw the alert, checked quarantine. 6:45 a.m.: Identified the change: new field "credentialSubType" added. 7:00 a.m.: Updated adapter to accept the new field. 7:30 a.m.: CI passed. Deployed. 7:45 a.m.: Replayed quarantine. All records processed correctly.

Customer-visible impact: zero. No one outside the engineering team knew it had happened.

The partner's project manager emailed us later that day: "Hey, we added a new field to the API. Let us know if you need any help adapting."

We replied: "Thanks for letting us know. We adapted this morning."

That's what resilience looks like. The partner was still unreliable. But we'd built a system that absorbed their unreliability without passing it to customers.

Mini Checklist: Integration Rescue

[ ] Adapter layer wraps all partner calls (no direct API access)
[ ] Contracts documented from reverse engineering (not just official docs)
[ ] Contract tests using captured real payloads
[ ] Idempotency keys on all write operations
[ ] Edge validation before data touches production
[ ] Quarantine table with reason codes and timestamps
[ ] Ops UI for quarantine resolution without engineering
[ ] Circuit breaker with sensible thresholds
[ ] Back-pressure (queue caps, "busy" responses)
[ ] Metrics dashboard: error rates, latency, queue depth, breaker state
[ ] Correlation IDs from receipt through final outcome
[ ] Runbooks for common failure scenarios
[ ] Replay harness for safe testing of fixes
[ ] Monthly chaos drills simulating partner failures
[ ] Post-incident reviews that produce new guardrails