Pagination & Batch: Handling Large Datasets

When dealing with enterprise-scale data, the biggest enemy is memory exhaustion. Trying to pull 100,000 records from an API and process them in a single synchronous loop is a recipe for a system crash.

Orchesty solves this through the Batch Component, a specialized producer designed to paginate through data sources and distribute the workload across the platform's asynchronous pipeline.

1. The Anatomy of a Batch Process #

A Batch connector in Orchesty doesn't just "fetch data." It acts as a smart iterator. Instead of returning a massive array, it communicates with the orchestration layer using specific SDK methods to define how data should be "shredded" and when the next page should be fetched.

2. Output Methods: Controlling Data Atomization #

The way you output data from a Batch determines how much parallel pressure you put on your downstream nodes. The Orchesty SDK provides three primary methods:

A. addItem(): The Granular Approach #

This is the foundation of Atomization. Every time you call addItem(), you create a new, independent message in the downstream queue.

  • Use Case: You fetch a page of 100 items and want each item to be processed individually and in parallel.
  • Power Move: You can even "chunk" your data. If you have 100 items but want to process them in groups of 10 to respect a downstream API's batch limit:
Node.js
const items = body.data; // 100 items from API
const chunkSize = 10;
for (let i = 0; i < items.length; i += chunkSize) {
    dto.addItem(items.slice(i, i + chunkSize));
}

Result: Downstream nodes receive 10 messages, each containing 10 items.

B. setItemList(): The Flattening Tool #

This is a convenience method that handles arrays. It has two modes:

  • setItemList(items) (Default): It automatically calls addItem() for every single item in the list. This "shreds" the array into N separate messages.
  • setItemList(items, true) (As Batch): It wraps the entire list into a single message. Use this if the downstream node expects a bulk input.

C. setBatchCursor(): The Infinite Loop #

This is how you handle pagination without hitting timeouts. By setting a cursor, you tell Orchesty: "I'm done with this page, but there's more. Run me again with this token."

3. Strategic Iteration: The iterateOnly Flag #

This is a critical architectural choice for developers. When you call setBatchCursor(cursor, iterateOnly), you decide the flow of data:

MethodBehaviorBest For...
setBatchCursor(cursor) (Default)Fetches the next page AND simultaneously sends current items to followers.Real-time processing. Data starts flowing through the topology while the batch is still fetching next pages.
setBatchCursor(cursor, true)Fetches the next page but DOES NOT send data to followers yet.Aggregation. Use this when you need to fetch all pages first (e.g., to calculate a sum or sort data) before passing it on.

4. Why This Architecture is Superior #

Memory Efficiency #

Because Orchesty processes data page-by-page and persists the state in RabbitMQ, your Worker never needs more RAM than is required to hold one single page. You can process a 10GB dataset using a microservice with only 512MB of RAM.

Failure Recovery #

If the 50th page fails due to a network error, you don't start from zero. The platform knows exactly which cursor was last successful. You can fix the issue and resume precisely where you left off.

Visual Clarity #

In the Visual Designer, a Batch node looks like a single step. However, under the hood, it can trigger millions of individual processes, all handled with the same reliability as a single request.

Throughput vs. ordering #

Per-node parallelism is tuned through prefetch; the default 1 keeps strict input order, raising it (up to 20) trades ordering for throughput. Keep prefetch at 1 on order-sensitive consumers downstream of the batch, and feel free to raise it where each item is independent.

Architect's Note: The Shredding Strategy

When designing a batch, ask yourself: How small should my items be? If you "shred" your data into individual items (addItem), you maximize horizontal scaling. Orchesty can process those items across dozens of worker instances simultaneously. If you keep them in large chunks, you reduce queue overhead but lose parallel processing power.

See also #

© 2025 Orchesty Solutions. All rights reserved.