Self-correcting integrations

Most "the sync didn't work" stories from production end the same way: "someone forgot to run the categories job before the products job." Two systems, one fragile assumption about the order in which their pipelines fire. Every cron, every manual replay, every onboarding of a new tenant becomes another chance to get the order wrong.

The mindset shift that fixes this once: stop relying on the order in which jobs run, and design the process itself to handle a missing prerequisite. Done well, you can say RUN at the start of the day and the integration just takes care of itself — including the cases that traditionally needed a runbook. Two systems start to feel like one, instead of two systems with a brittle dependency between their pipelines.

This guide walks through the pattern with a concrete example most teams have shipped at least once: syncing a product catalog into a target where products require existing categories.

The classic fragile setup #

The naive design has two pipelines:

Categories sync — runs first, creates/updates categories in the target.
Products sync — runs second, creates products. Each POST /products is rejected by the target if the product references a category that does not exist there yet.

This works as long as:

Categories are always synced first.
Categories are fully synced before products start.
A product never refers to a brand-new category that didn't exist when categories last ran.
No retry, no replay, no onboarding ever shifts the order.

In practice, one of those bullets gets violated within a quarter. The fix is usually not "run them in the right order more carefully" — the fix is to remove the dependency on order altogether.

Side-by-side comparison: a fragile two-cron pipeline that depends on manual ordering versus a single self-correcting products topology triggered by RUN that handles missing prerequisites internally — Two ways to wire the same business goal: external ordering vs a topology that repairs its own preconditions.

The five building blocks #

The self-correcting version of the products topology is built from five small pieces. None of them require a special feature; they are the same primitives you already use every day.

1. Detect #

Before pushing a product, check whether the prerequisite (the category) exists in the target. A simple GET /categories/{id} against the target system, returning either exists or does not exist.

Done naively, this doubles your traffic against the target API. So the next building block is mandatory.

2. Cache #

Cache the exists answer per category id, with a TTL that matches how often categories actually change in your business (minutes for fast-moving catalogs, hours for stable ones). Now ten products from the same category cost one extra GET, not ten. See Cache for efficient API calls for the three-tier cache pattern this fits into.

A cache hit means exists — push the product. A cache miss means we need to verify, which often resolves to does not exist the first time we encounter a brand-new category.

3. Detour #

When the check returns does not exist, don't fail the product and don't try to "create the category inline". Instead, route the product down a second branch of the topology that produces a message into a dedicated category-sync topology, asking it to sync exactly that category id. The product itself is not lost — it is parked, waiting for a callback.

This is the part where the integration starts behaving more like one system than two. The products topology no longer assumes anything about what categories already exist; it just says "please make sure this one is there" and waits.

4. Callback #

The category-sync topology fetches that category from the source, upserts it into the target, and ends with a node that produces a message back into the products topology at a defined start event reserved for "category is now ready". The waiting product enters the products topology again, this time with the cache freshly invalidated for that category id, and its check returns exists — so it goes straight through to the target.

A few important notes about this round-trip:

Persist what the callback needs. The product payload (or at least an id you can rehydrate from) travels with the message that detours into the category-sync topology, or is stashed in your own state store keyed by the category id. Whichever is simpler in your topology — both work, both survive a platform restart.
The callback is just another start event. Nothing exotic about it. The products topology exposes a second start event that operators can also trigger manually (useful when debugging or when the category was created out-of-band by a human in the target system).
Failures are visible. If the category sync fails, the failed message lands in the category topology's Trash just like any other failure. The waiting product is still parked; an operator either fixes the category and approves the failed message (which fires the callback), or rejects it (and the product can be moved to its own Trash entry separately).

Swim-lane diagram. Top lane is the products topology with start events A and B; the detect step branches to push-to-target on cache hit or to a parked state on cache miss. The bottom lane is the category-sync topology that fetches and upserts the category, then produces a callback message back into start event B in the products topology, which resumes the parked product. — Detour and callback round-trip: solid arrows are in-process flow, dashed arrows are cross-topology messages.

5. Coalesce #

The naive version of the detour pattern has a thundering-herd problem: if 5,000 products in the same batch reference the same brand-new category, all 5,000 will independently miss the cache, detour, and trigger 5,000 sync requests for the same category. The category-sync topology will dedupe most of them, but the work was wasted at every layer.

The fix: when a product detours because of category id X, also park every other product currently in flight that references the same id. The simplest version is a queue keyed by category id; while there is a pending sync for X, products with category X go into that queue instead of independently checking the cache. When the callback for X fires, the queue drains in one go.

This usually halves or quarters the volume hitting the target during catalog onboarding, without any code changes in the target system or the source.

Two-panel comparison. Without coalesce, five products that all need the same missing category each independently trigger a category sync, producing one productive sync request and four wasted ones. With coalesce, only the first product triggers the sync; the other four are parked in a wait queue keyed by the category id and a single callback unblocks all five. — Coalescing siblings behind the in-flight sync: one productive sync, the rest ride for free.

What this looks like in operation #

The end-to-end shape is intentionally boring:

Product arrives in the products topology.
Detect checks the cache, then optionally the target.
If exists, push to target. Done.
If does not exist, detour the product (and coalesce its siblings) into a wait queue and trigger the category-sync topology.
Category-sync runs, callbacks into the products topology when ready.
Waiting products re-enter, hit the cache that is now warm, and push to target.

In the topology heatmap the slot where the detour happens is not a failure — it is a valid branch with its own outcome. In the Limiter view, the queue of products waiting on a category callback is visible by topology, so operators can see "products topology has 5,000 messages parked waiting for category sync" without having to read logs. Failures, if any, surface in Trash through the normal Approve / Edit / Reject workflow — see Operational visibility.

When (not) to use it #

Use this pattern when:

The target system enforces a prerequisite that the source can produce on demand (categories, parent products, accounts, vendors, tax codes).
Out-of-order arrival is realistic — webhooks fire as users edit, batch loads include brand-new entities, replays happen.
The cost of a missed prerequisite is a hard error (rejection, half-created entity, audit incident), not just a soft warning.

Don't reach for it when:

The target can auto-create the prerequisite from the same call (POST /products will silently create the category for you). Save yourself the topology and let the target handle it.
The prerequisite has hundreds of dependencies of its own. At that point, a real pre-sync of the master data is cheaper than detouring per item.
The prerequisite is volatile (changes faster than your TTL allows you to cache does not exist). Cache the answer for a too-long window and you'll merrily push products into a category that was deleted thirty seconds ago.

The bigger picture #

This is what we mean when we talk about smart processes. The integration is no longer a script that has to be triggered in the right order; it is a graph that knows its own preconditions and how to recover from missing ones. The day-to-day operator command shrinks to RUN, and the platform handles the cases that used to be runbook items.

Two systems still have two APIs and two data models. But the topology between them, designed this way, behaves like a single system — with the integrity, ordering, and self-healing that a single system would have given you for free.

Cache for efficient API calls — the cache layer this pattern depends on.
ID mapping — how the prerequisite ids resolve between systems.
Operational visibility — how detour, parked queues, and Trash entries surface in the Admin UI.
Observability in practice — the operator workflow when something inside this pattern actually does fail.