Handling large datasets with Orchesty

Moving large volumes of data is where naive integrations fall apart. A script that comfortably processes 1,000 records will quietly die on 10 million: it runs out of memory, hits an API timeout halfway through, swallows a single bad record and corrupts the rest, or finishes "successfully" with half the data missing. Anyone who has rebuilt the same nightly job for the third time knows the feeling.

Orchesty is built for the opposite assumption: that data volumes will grow, that source systems will misbehave, and that the question is not if something fails but when. This guide walks through the concepts that make large-dataset processing safe and predictable on the platform: streaming, persistence, message-level error handling, parallelism control, and the Batch node for pagination.

1. Streaming Beats Loading #

The first shift is conceptual. Most integrations are written as if data is a thing you load: pull all 500,000 orders, hold them in memory, transform them, push them to the destination. That model breaks the moment the dataset stops fitting into RAM, or the moment a single network blip forces you to start over.

Orchesty treats data as a thing that flows. A topology is a pipeline of small, independent nodes connected by queues. A record enters at one end, moves through transformations and enrichments, and lands at the destination. The system never has to hold the whole dataset in memory at once: only as much as the current step is actively processing.

The benefits are immediate:

Memory stays flat regardless of dataset size. Processing 10 GB looks the same as processing 10 MB from the worker's point of view.
Throughput is bounded by your slowest step, not by your largest record set.
You see results as they happen: the first records reach the destination while the source is still being read.

This is what people usually mean when they say "streaming architecture", but in Orchesty it is the default execution model, not an opt-in optimization.

2. Load Distribution Comes for Free #

Because every step in a topology consumes from a queue, you can scale any step independently by adding more worker instances. Orchesty distributes incoming messages across them automatically.

In practice this means:

A slow transformation step can be scaled up without touching the rest of the topology.
A burst of incoming data is absorbed by the queue and processed as fast as the workers can handle it, instead of overwhelming downstream systems.
Adding capacity is a configuration change, not a rewrite.

The same model that makes Orchesty horizontally scalable also makes it predictable under load: queues smooth out spikes, and back-pressure protects the systems you integrate with from being hammered.

3. Persistence at Every Step #

Speed is worthless if a crash means starting from scratch. Orchesty persists every message in transit. When a node finishes processing, the result is written to the next queue before the input message is acknowledged. If a worker dies mid-processing, the message is redelivered to another instance and resumes from the last persisted point.

That single guarantee changes how you design pipelines:

Long-running jobs are safe to interrupt. Restarting a worker, deploying a new version, or losing a node never loses data.
You can resume, not restart. A failure on record 487,000 of 500,000 does not mean re-reading the source. The platform already knows which records are done and which are still in flight.
State lives in the platform, not in the worker. Workers can be stateless and disposable, which is what makes scaling trivial.

The trade-off is a tiny per-message overhead. In return, you stop writing checkpointing logic by hand for every integration.

4. Message-Level Error Handling #

The classic batch script has one big failure mode: the run failed. You don't know which records made it, which didn't, and whether re-running will create duplicates.

Orchesty handles errors at the level of the individual message. When a node throws, only that message is affected. The rest of the dataset keeps flowing through the topology, and the error is recorded against that specific record rather than the whole job.

Configurable retry on every connector #

Every connector has its own retry configuration: how many attempts, with what backoff, and on which kinds of errors. Transient problems (a timeout, a 502, a momentary network blip) are absorbed automatically without ever surfacing to an operator. Only messages that exhaust their retries continue down the failure path.

Failed messages land in Trash #

Messages that ultimately fail are not lost. They land in the Trash, a dedicated collection of failed messages tied to the topology and the exact node where they failed. From there you can:

Inspect the message payload and the error that caused it.
Edit and replay it from the same point in the topology where it failed, without re-running the upstream steps.
Discard it if it's not worth processing.
Replay in bulk. If you know a batch of messages failed because a downstream service was down and is now back up, you can release them all at once from the same node and let them continue through the rest of the pipeline.

This is the difference between "the nightly sync broke, debug tomorrow" and "42 records out of 500,000 need attention, the rest are already in the destination".

Built-in rate limiter for batch processing #

Large batches put pressure on the APIs you call. Orchesty includes a global limiter that enforces rate limits at the platform level: you declare how many requests per second a given service tolerates, and the platform throttles outgoing calls across all connectors and all topologies that talk to that service.

For batch processing this matters a lot. Without a limiter, scaling up workers means scaling up the rate at which you hit the source or destination API, which usually ends in 429s, throttling, or being blocked. With the limiter, you can crank parallelism as high as you want and trust that the platform will smoothly pace the actual outbound traffic to stay within the limits the API publishes.

5. Parallel vs. Ordered Processing #

Most large datasets can be processed in parallel: each record is independent, and the more workers you throw at the queue, the faster you finish. Some can't: events for the same customer, the same order, or the same inventory item must be applied in the order they happened, otherwise the final state is wrong.

Orchesty lets you choose, per node, with a single setting: prefetch. It applies to consumer nodes (Connector, Batch, Custom Action) and accepts values from 1 to 20.

Prefetch 1 (default, ordered): each worker takes one message at a time and finishes it before taking the next. The original order of events is preserved end-to-end. Slower, but correct for ordered streams.
Higher prefetch (parallel): the platform pulls several messages at once and the worker processes them concurrently. Throughput grows roughly with the prefetch value, with no ordering guarantee. This is the right setting for the vast majority of integrations once you've confirmed the consumer is order-insensitive.

You don't need to write your own queue logic, locking, or sequencing. Picking the right prefetch — and republishing the topology so the bridge picks it up — is the entire configuration. The detailed sizing guide and the republish flow live in Operations: Prefetch.

6. Pagination with the Batch Node #

Streaming solves what to do with data once you have it. Pagination solves how to get it from source systems that won't hand it to you all at once.

Almost every modern API exposes data page by page (cursor, offset, token). Doing this by hand is a recipe for two bugs: forgetting to fetch the last page, or fetching the same page twice. Orchesty provides a dedicated Batch node for this.

A Batch node is a smart iterator. It calls the source, emits records into the topology, and tells the platform whether there is another page to fetch. The platform calls it again with the cursor it returned. Records start flowing downstream while the next page is still being fetched, so the rest of the topology is already working while pagination continues in the background.

You also decide how records leave the Batch node:

One message per record for maximum parallelism (every downstream worker can grab one).
Chunks of N when the destination has a bulk endpoint (one message carries a small array).
One message per page when downstream needs to see the page as a whole.

The Batch node makes "loop until there are no more pages" disappear from your code. The platform handles the iteration, the persistence between pages, and the resume-on-failure semantics. The detailed SDK reference lives in the Pagination & Batch documentation.

7. Common Patterns #

A few recurring shapes worth naming, because they cover most large-dataset use cases:

Full sync with delta follow-up #

A one-time Batch pulls everything (initial load), then a scheduled topology pulls only what changed since the last run (delta). The platform stores the cursor or timestamp for you between runs. This is the standard pattern for replicating a source into a target.

Staged transformation #

Heavy transformations are split across multiple nodes (parse, normalize, validate, enrich, write). Each stage is a queue boundary, which means each stage can be scaled, retried, and monitored independently. A bottleneck shows up as a growing queue at one specific step, which makes performance tuning a measurement exercise instead of a guessing game.

Architect's note: think in records, not in jobs

The biggest mindset shift when moving from scripts to Orchesty is to stop designing the job and start designing the journey of one record. If a single record can be processed safely, in isolation, with a clear definition of "done", then a million records can be processed too: in parallel, resumable, and observable. Most large-dataset problems become simple once you stop thinking about the batch and start thinking about the unit.

Next Steps #

Explore other topics in the Learn section or check out our Documentation.