1The Problem
The operations team was drowning in EDI failures. Every week, someone would get paged at 2AM because a batch posting had failed silently, leaving orders stuck in limbo. The existing integration had no retry logic, no exception handling, and no way to identify which specific transactions had failed. Partner onboarding was a nightmare—each new retail partner meant weeks of mapping adjustments and prayer.
2Investigation
- Mapped existing data flows from Cleo CIC to NetSuite (846 inventory, 850 orders, 856 ASNs)
- Identified 7 distinct failure modes in the existing scripts
- Found that 60% of failures were due to item matching issues (GTIN/UPC mismatches)
- Discovered governance limit violations during peak posting times
- Documented undocumented field mappings across 4 retail partners
3Architecture
We designed a resilient pipeline with three core principles: fail safely, recover automatically, and surface exceptions with context. The new architecture introduces an exception queue as a first-class citizen, retryable Map/Reduce jobs with exponential backoff, and a matching engine that gracefully handles GTIN/UPC variations.
Retail EDI Integration Architecture
Tap cards to see details
1// Exception queue handler with context2function handleException(context) {3 const exception = {4 transactionId: context.txnId,5 failureType: context.error.type,6 payload: context.rawPayload,7 attemptCount: context.attempts,8 lastAttempt: new Date().toISOString(),9 suggestedAction: inferAction(context.error)10 }1112 record.create({13 type: 'customrecord_edi_exception',14 values: exception15 })1617 // Alert only on business-critical failures18 if (isCritical(context.error)) {19 notify.ops(exception)20 }21}Exception handling with context preservation
4Implementation
- Built GTIN/UPC matching engine with fallback logic and fuzzy matching
- Implemented Map/Reduce with governance-aware throttling (pauses before limits)
- Created exception queue with replay capability and suggested actions
- Added comprehensive logging with transaction correlation IDs
- Deployed incrementally with shadow mode validation before cutover
5Outcome
Three months post-deployment, the team has had exactly one 2AM incident (caused by an upstream partner format change, not our code). The exception queue now handles edge cases gracefully, giving the ops team clear next steps instead of mystery failures. Partner onboarding dropped from weeks to days thanks to the new matching engine.
Lessons Learned
- →Shadow mode deployment was critical—caught 3 edge cases before production
- →Exception queue design matters more than retry logic
- →Governance limits are real constraints that need first-class handling
First written: 2024-06-15 · Last revised: 2025-01-07