High-Volume Catalog Reconciliation — Ten Million Records, Fifteen Hundred Real Changes
How Orchesty reconciles ten-million-row catalogs every night by combining the orchesty/comparator-worker with a paginated source pull, turning a brute-force full snapshot into a clean delta without help from the source system.
A surprising amount of integration work is wasted re-confirming the same facts every night. A nightly job pulls ten million products from a PIM, transforms each one, and pushes it into the e-shop and the marketplace — even though, on a typical day, fifteen hundred of those products actually changed. The other 9,998,500 are write traffic the destination didn't need, audit-log noise the operations team has to filter through, and rate-limit budget that could have gone to something useful.
This use case shows how Orchesty handles that exact scenario at the ten-million-row scale, using the orchesty/comparator-worker that ships with the platform. The pattern is the same one described in the data comparator guide; this article focuses on what it looks like in production when the numbers get large.
Why the source rarely helps #
In an ideal world the PIM (or ERP, or supplier feed) would tell you "here is what changed since last night". In practice, large catalog systems often:
- Don't expose a reliable
lastUpdatedfilter, or expose one that misses bulk imports done by other integrations. - Don't have webhooks, or have webhooks that drop messages under load and were never designed as a system of record.
- Have a CDC stream you can theoretically subscribe to, but only after a six-month vendor change request.
The pragmatic answer is to stop asking the source for help and let Orchesty figure out what changed by comparing this run's snapshot against the last one. The Comparator worker does exactly that, against a Redis-backed snapshot it maintains itself.
The shape of the topology #
A nightly cron triggers a paginated pull from the source, each item is mapped into the target shape, and the Comparator Filter drops everything that has not actually changed:
- Source pull — a paginated batch fetch (see the pagination guide). One process message per item; ten million messages over the night.
- Mapper — converts each item from the source shape to the target shape. The comparator runs after the mapper, so it judges changes on the form that actually matters for the destination. If the source toggles a field that you don't even forward, the comparator drops it as unchanged.
- Comparator Filter — the
orchesty/comparator-workernode. Compares every incoming item against the per-collection snapshot in Redis (keyed bymasterKeyandidField) and forwards only the items whose payload differs. - Downstream write —
POST/PATCH/DELETEto the e-shop, marketplace, ERP, or whatever the target is. Sees roughly the 0.015% of messages that actually represent change.
Configuration that matters at scale #
Every Comparator Filter node takes a configuration block. Three keys decide whether the integration scales cleanly or drowns in false positives:
| Key | What it does at this scale |
|---|---|
masterKey | The Redis key namespace for this collection. Use one per source × target pair (e.g. pim-products-eshop); never reuse across topologies that compare different shapes. |
idField | Path to the unique id inside each item. Wrong id field is the single most common reason a comparator returns "everything changed" the first night you turn it on. |
excludedFields | Paths to ignore. Volatile fields (updatedAt, server-side counters, lastViewedAt, computed totals) will mark every record as changed if they end up in the comparison hash. Be ruthless here. |
Plus three operational keys that matter at the ten-million scale:
| Key | When you'll need it |
|---|---|
lock | Prevents two runs of the same topology overlapping on the same masterKey — important when a nightly run takes longer than expected and the next one is about to start. |
ttl | Per-snapshot TTL in seconds. Leave null (keep until manually invalidated) when the source is reliable; set when an old snapshot is worse than rebuilding. |
skipComparison | Returns every item as-is, ignoring the cache. Useful when the destination has lost data and you need a one-off forced resync without wiping the snapshot (the cache is then valid again from the next normal run). |
Detecting deletions #
A comparator cannot tell that something has been removed from the source unless it knows what the source currently contains in full. The worker handles this through three coordinated fields you set on the source connector:
deleted: true— opt in to deletion detection.totalCount: 10000000— total number of items in the entire run (set on every message).isLast: true— set on the final batch only.
Concretely: ten million products paginated 1,000 per page is 10,000 process messages. Set totalCount: 10000000 on every one, and isLast: true only on the ten-thousandth. When the worker sees isLast, it now has the complete universe of ids for this run and can return the ids that were in the previous snapshot but not in this one — those are the deletes. They flow downstream as a separate stream you route to your DELETE endpoint.
This is genuinely hard to implement correctly by hand on a stream of paginated batches. With the worker, it is one boolean and a counter.
Force-resync without losing the snapshot
The day the destination loses data and asks you to "send everything again", you have two escape hatches:
Comparator Invalidatenode. A small one-off topology (or a manually triggered branch) wipes amasterKeyfor the whole collection, or a singleexternalIdfor one record. The next regular run rebuilds the snapshot from scratch.skipComparison: trueon the regular Comparator Filter. The worker forwards every item this run, ignoring the cache, but the cache stays valid. The next normal run resumes filtering against the snapshot.
Use Invalidate when the snapshot itself is suspect (schema drift, ttl mistake). Use skipComparison when the destination is the one that lost data and the snapshot is fine.
Why a custom PHP comparator falls over here #
Plenty of teams have written a "compute the SHA-256 of each row, compare it to last night's hash table" script in PHP, Node.js, or Python. For a few thousand rows it works fine; the data comparator guide keeps both the Node.js and PHP samples for that exact case.
At ten million rows, three things change:
- Throughput. PHP per-request memory pressure, lack of true async I/O, and the cost of opcache warmups make every per-record comparison expensive. The Comparator worker is written in TypeScript, runs as a long-lived process, and amortizes connection setup across the whole run.
- State. A naive script keeps last night's hashes in a file or a database. Reading and updating ten million rows from a relational table on every run becomes the bottleneck the comparator was supposed to remove. The worker uses Redis with a binary protocol that handles per-key reads and updates at the right rate.
- Deletion detection. Hand-rolled comparators almost always punt on deletes — "we'll add it later". Then a year later a deleted product sits in the e-shop catalog because nobody hooked up the universe-aware logic. The worker has it built in.
The general rule: if the run is small and the language is fixed (e.g. an existing PHP shop with one nightly job), a custom comparator is fine and the guide shows you how. If you are pulling six-figure-or-larger snapshots regularly, use the worker.
Operational notes #
- Bootstrap. The first run is a write storm — the cache is empty, so every item looks like a create. Plan for it. Either run the bootstrap on a low-traffic window with destination rate limits cranked up, or use
skipComparisonfor the first run with a downstream that is idempotent. - Field selection discipline. Run the comparator in shadow mode first (drop output before the destination) and look at the
created/updatedcounts after a few runs. If "updated" is suspiciously close to "total items processed", you have a volatile field leaking into the hash — add it toexcludedFieldsand re-run. - Concurrency. Two topologies that both write to the same
masterKeywill corrupt each other's snapshot. Uselock: trueon the Comparator Filter to make the second one wait, or split themasterKeyso they don't collide in the first place. - Memory and Redis sizing. Snapshots scale with item count and average payload size after
excludedFields. Ten million products × ~300 bytes per stored hash + metadata fits comfortably in single-digit GBs. Provision Redis accordingly and setmaxmemory-policy noeviction— silent eviction would silently break delta detection. - Audit. Every dropped message still generates a process trail in Orchesty, so the audit log shows "this item was checked, it was unchanged, exiting cleanly". The downstream audit is the one that gets quiet — only the 1,500 real changes appear, which is the entire point.
Summary of results #
- Write traffic cut by ~99.985% on a typical night — 1,500 writes instead of 10,000,000.
- Same topology handles bootstrap and steady-state. The first run does the full write because the cache is empty; every subsequent run filters down to the delta.
- Deletes are first-class — no extra topology, no manual reconciliation, just
deleted: true+totalCount+isLast. - Two operational escape hatches —
Comparator Invalidatefor snapshot problems,skipComparisonfor destination problems, neither requiring a code change. - No source cooperation required — the source still sends the full snapshot every night; the platform turns it into a delta on the way through.
Related #
- Data comparator guide — the underlying pattern in depth, including the
orchesty/comparator-workerreference and the roll-your-own Node.js / PHP samples for smaller cases. - Pagination guide — the upstream batch fan-out that feeds the Comparator Filter one item at a time.
- ID mapping guide — typically runs before the comparator to enrich each item with the destination id.
- Eshop synchronization use case — the same Comparator pattern applied in a smaller, multi-stream e-commerce context.
- Operational visibility — Limiter view and dashboards for keeping an eye on the bootstrap write storm.