Integration Rescue: A Case Study in Stabilizing a Legacy Partner
How we stabilized a failing legacy partner integration—seams, quarantines, and drills—in weeks instead of rewriting the system.
Integration Rescue: A Case Study in Stabilizing a Legacy Partner

One legacy partner accounted for 30% of our workflow. Their API was flaky, undocumented, and politically impossible to replace.
The conventional wisdom was: rewrite. Build a new integration from scratch. Replace the legacy partner with a modern alternative.
The reality was: we couldn't. The partner was embedded in contracts. The data they provided was irreplaceable. The timeline for a rewrite was 12+ months. We needed stability in weeks.
Here's how we stabilized a failing integration without pausing the business or launching a rewrite.
The Symptoms
When I joined this rescue project, the integration was in crisis mode:
Silent nightly failures:
Jobs that pulled data from the partner would fail without obvious errors. They'd time out, or return partial data, or succeed on retry but with different results. Ops fixed data by hand every morning—a ritual that took 2-3 hours.
Unpredictable latency:
The partner API would respond in 200ms for days, then spike to 30 seconds without warning. Our queues starved. Downstream processes waited. Users saw delays.
Schema changes without notice:
The partner would add fields, remove fields, or change data types without announcements. Our ingestion would break. Reports would show blanks or wrong values. Support got hammered with "where is my update?" calls.
No visibility:
When things went wrong—and they went wrong constantly—we had no way to answer basic questions:
- How many records failed?
- What specifically failed?
- When did the failure start?
- What changed?
We were debugging with logs and prayer.
The Wrong Solution: Rewrite Everything
The temptation was strong: rip out the integration and rebuild it from scratch. Modern API standards. Proper error handling. Real monitoring.
Here's why we didn't:
Timeline:
A clean rewrite would take 12-18 months. The business couldn't survive 12-18 months of the current failure rate. We needed improvement in weeks, not quarters.
Political reality:
The partner wasn't going away. Contracts were signed. Data was exclusive. Even if we built a beautiful new integration, it would still be talking to the same flaky API.
Risk:
Rewrites fail. Often. We could burn 12 months building something that didn't work any better, and we'd have two broken integrations instead of one.
The decision:
Stabilize what exists. Build resilience around the unreliable partner. Make the flakiness manageable instead of trying to eliminate it.
Step 1: Build the Missing Seam
The first problem: partner calls were scattered throughout the codebase. Some code called the API directly. Some code used a partial wrapper. Error handling was inconsistent.
The fix: adapter layer.
We wrapped all partner calls in a dedicated module. Direct API calls were banned—every interaction went through the adapter.
The adapter handled:
- Authentication (consistent token management)
- Retries with exponential backoff and jitter
- Timeout handling with sensible defaults
- Response parsing and error classification
- Logging with correlation IDs
Contract documentation:
We reverse-engineered the API behavior from logs and code. We documented:
- Request schemas (what we send)
- Response schemas (what we expect back)
- Error shapes (what failures look like)
- Known quirks (undocumented behaviors we'd discovered)
This documentation didn't exist anywhere. The partner's official docs were wrong or incomplete. Our documentation reflected reality.
Contract tests:
We captured real payloads—both successful and failed—and turned them into test fixtures. Every build ran these fixtures through the adapter. If parsing broke, CI failed before production.
Idempotency:
Every write operation got a deterministic key. If we retried (and we would retry often), duplicates were prevented automatically. Support stopped seeing duplicate records.
Step 2: Quarantine Bad Inputs
The second problem: bad data from the partner was corrupting our production tables. Schema changes caused parsing failures that left partial records. Invalid values caused downstream errors.
The fix: edge validation with quarantine.
Before any partner data touched production, it went through validation:
- Schema check: Does this match our expected structure?
- Value validation: Are fields within expected ranges?
- Consistency check: Does this reference entities we know about?
Records that failed validation went to quarantine—a separate table with:
- The raw payload (what we received)
- The failure reason (why it was rejected)
- Timestamp (when it arrived)
- Correlation ID (how to trace it)
Quarantine records didn't corrupt production. They waited for human review or automatic retry after investigation.
Ops UI:
We built a simple dashboard showing:
- Quarantine count by error type
- Age of oldest quarantine item (staleness alert)
- Quick action buttons for common fixes
- Audit logs of all resolutions
Ops could resolve quarantine issues without engineering. Engineering focused on systemic fixes, not individual data problems.
Outcome:
Bad data stopped contaminating production tables. Reports stopped showing blanks. Support tickets dropped 40% in the first week.
Step 3: Control the Blast Radius
The third problem: when the partner was slow or failing, our entire system degraded. Queues backed up. Thread pools exhausted. Users saw errors across unrelated features.
Circuit breaker:
We implemented a circuit breaker (a pattern that stops calling a failing service to prevent cascade failures):
- Closed state: Normal operation, requests go through
- Open state: Partner failing, requests immediately return cached/error without calling
- Half-open state: Testing recovery, a few requests go through to see if partner is back
Thresholds: If error rate exceeded 50% over 30 seconds, or if latency exceeded 10 seconds on 3 consecutive requests, the breaker opened.
When open, we served cached "last known good" data where safe, with UI banners explaining the situation.
Back-pressure:
We capped queue depth per tenant. If one customer's bulk import was hammering the partner, their queue grew, but other customers' traffic continued normally.
When queues hit 80% capacity, new work got a "busy, retry soon" response instead of queueing indefinitely. Users saw clear messages, not timeouts.
Fallback for high-risk writes:
For critical operations (like payment-related data), we queued failed writes for later retry instead of failing immediately. Ops got alerts. Users got "processing, we'll notify you" messages. Nothing was lost.
Step 4: Observability and Runbooks
The fourth problem: when failures happened, no one knew why. Debugging was archaeology.
Metrics:
We instrumented everything:
- Success/error rates by endpoint
- Latency percentiles (p50, p95, p99)
- Queue depth and age
- Circuit breaker state and trip count
- Quarantine counts by error type
All visible on a single dashboard. One glance told you: is the partner healthy?
Logs with correlation:
Every request got a correlation ID that followed it through:
- Our initial receipt
- The partner API call
- Any retries
- The final outcome
We included payload hashes for forensics. When something went wrong, we could reconstruct exactly what happened.
Runbooks:
We documented specific response procedures:
"Partner API slow" runbook:
- Check breaker state on dashboard
- If breaker is open, verify cache is serving
- Check partner's status page (if they have one)
- If latency >30s and not improving, enable drain mode
- Notify customer success with ETA
"Schema drift detected" runbook:
- Check quarantine for new error types
- Capture sample payloads for contract update
- Update adapter if change is additive
- If change is breaking, escalate to partner contact
- Block schema promotion until tests pass
"Mass replay needed" runbook:
- Identify scope (which records, which time range)
- Verify idempotency keys are in place
- Stage replay in sandbox
- Verify results before production
- Run production replay with monitoring
Step 5: Drills and Playbacks
The fifth problem: we didn't know if our fixes would hold under pressure. Theory is nice. Reality is humbling.
Replay harness:
We built a system to record failed payloads and replay them against sandboxes. Before any fix went to production, we:
- Captured the failing scenarios
- Applied the fix
- Replayed against sandbox
- Verified correct behavior
- Only then promoted to production
Chaos hour:
For one month, weekly, we simulated partner failures:
- Injected latency (made the API slow)
- Returned 500 errors randomly
- Sent malformed responses
- Simulated complete outages
We timed detection. We timed recovery. We found gaps and fixed them.
Playback reviews:
After each real incident, we asked:
- What guardrail should have caught this?
- What test case is missing?
- What runbook step was unclear?
Every incident bought us a new guardrail or test. The system got stronger over time.
Results
Stability:
Error rates dropped approximately 70%. The nightly failure ritual became a weekly check that usually found nothing. Ops got their mornings back.
Transparency:
Support used the dashboard to answer "where is my update?" They could see: your record is in quarantine because the partner sent invalid data. We're working on it. No more escalations to engineering for basic questions.
Resilience to partner issues:
When the partner shipped a schema change without notice (and they did, twice more), our CI caught it before production. We shipped adapter patches the same day. No customer-visible impact.
Risk posture:
The partner was still flaky. Their API was still undocumented. But we'd built a layer of resilience around them. Their unreliability became manageable, not catastrophic.
Timeline
Week 1-2:
- Built adapter layer
- Documented contracts from reverse engineering
- Added basic logging with correlation IDs
Week 3-4:
- Implemented quarantine
- Built ops dashboard
- Added circuit breaker
Week 5-6:
- Added back-pressure
- Built replay harness
- Wrote initial runbooks
Week 7-8:
- First chaos hour drill
- Fixed gaps found in drill
- Refined thresholds and alerts
Ongoing:
- Weekly playbacks of incidents
- Monthly chaos drills
- Continuous improvement of contract tests
Total elapsed time to stability: 8 weeks. Total elapsed time if we'd done a rewrite: 12-18 months (optimistically). The rescue approach was 5-10x faster.
Lessons You Can Apply
If there's no seam, build one:
Adapters, contracts, and tests are the foundation. Until you have a clean boundary around the problem, you can't fix the problem.
Quarantine unknowns; contain before you cleanse:
Bad data will come. The question is whether it corrupts production or waits safely for review. Always choose safely.
Add breakers and back-pressure before traffic spikes, not after:
The time to build resilience is before the crisis, not during it. If you wait until the partner is failing to add circuit breakers, you're too late.
Run drills; practice creates calm:
Incidents are stressful. Practiced incidents are procedural. Every drill teaches you something and makes the next real incident less chaotic.
Every incident should buy you a new guardrail:
If the same failure can happen twice, you haven't fixed it. Turn every incident into a test case, a runbook step, or a new threshold.
You Don't Need a Rewrite
The instinct to rewrite is strong. Clean code! Modern patterns! Proper architecture!
But rewrites are risky, slow, and often don't solve the underlying problem. If your partner is flaky, a beautiful new integration is still talking to a flaky partner.
Build resilience around unreliable dependencies. Make the flakiness manageable. Ship improvements in weeks, not quarters.
You don't need a rewrite to stabilize a bad integration. You need seams, visibility, and rehearsed responses.
Context → Decision → Outcome → Metric
- Context: Legacy partner integration handling 30% of workflow, failing daily, no documentation, no visibility, no resilience. Business couldn't wait for rewrite.
- Decision: Stabilize in place using adapter pattern, quarantine, circuit breakers, observability, and drills instead of pursuing 12+ month rewrite.
- Outcome: Integration stabilized in 8 weeks. Daily failures became rare. Support could answer questions without engineering. Partner schema changes handled same-day.
- Metric: Error rates dropped ~70%. Quarantine visibility eliminated 40% of support tickets. Two partner schema changes handled with zero customer-visible impact.
Anecdote: The Schema Change That Didn't Break Us
Six weeks after the stabilization project, the partner shipped a schema change. No announcement. No warning. Just different data arriving at 2 a.m.
The old system would have broken silently. Data would have corrupted. Morning ops would have spent hours untangling.
Here's what actually happened:
2:17 a.m.: New records started arriving with an unexpected field. 2:17 a.m.: Adapter validation flagged them as "unknown shape." 2:17 a.m.: Records went to quarantine instead of production. 2:18 a.m.: Alert fired: "Quarantine spike, new error type."
6:30 a.m.: On-call engineer woke up, saw the alert, checked quarantine. 6:45 a.m.: Identified the change: new field "credentialSubType" added. 7:00 a.m.: Updated adapter to accept the new field. 7:30 a.m.: CI passed. Deployed. 7:45 a.m.: Replayed quarantine. All records processed correctly.
Customer-visible impact: zero. No one outside the engineering team knew it had happened.
The partner's project manager emailed us later that day: "Hey, we added a new field to the API. Let us know if you need any help adapting."
We replied: "Thanks for letting us know. We adapted this morning."
That's what resilience looks like. The partner was still unreliable. But we'd built a system that absorbed their unreliability without passing it to customers.
Mini Checklist: Integration Rescue
- [ ] Adapter layer wraps all partner calls (no direct API access)
- [ ] Contracts documented from reverse engineering (not just official docs)
- [ ] Contract tests using captured real payloads
- [ ] Idempotency keys on all write operations
- [ ] Edge validation before data touches production
- [ ] Quarantine table with reason codes and timestamps
- [ ] Ops UI for quarantine resolution without engineering
- [ ] Circuit breaker with sensible thresholds
- [ ] Back-pressure (queue caps, "busy" responses)
- [ ] Metrics dashboard: error rates, latency, queue depth, breaker state
- [ ] Correlation IDs from receipt through final outcome
- [ ] Runbooks for common failure scenarios
- [ ] Replay harness for safe testing of fixes
- [ ] Monthly chaos drills simulating partner failures
- [ ] Post-incident reviews that produce new guardrails