Trash inbox
When a message exhausts its retries, the platform routes it to the Trash: a per-topology inbox of failed messages. Trash entries hold the original payload, the error, and the node where the failure happened. From the Admin UI, an operator can inspect, edit, approve, or reject them — without touching code.
For how a message ends up in Trash (which exception to throw, how OnRepeatException and STOP_AND_FAILED map to retries vs. failures), see Building nodes: Error handling and retries.
What an operator can do #
From a Trash entry in the Admin UI:
- Inspect the payload, the error, and the failing node to understand what happened.
- Edit the payload if the failure was caused by bad data, then Approve to re-inject it from the failed node.
- Approve as-is, after the upstream issue is fixed.
- Bulk-approve a group when many messages failed for the same reason (downstream API was down and is now back).
- Reject if the message is no longer relevant.
The UI actions on a Trash entry are Approve / Edit / Reject (individually or in bulk). Under the hood, Approve performs a replay — the message is re-injected at the failed step, not at the start of the topology, so upstream side effects don't repeat. The word replay refers to that re-injection mechanic; the operator-facing button is Approve.
Operational handoff #
Trash is the channel between developers and operators:
- Developers decide what should retry vs. fail (in code).
- Operators decide what to do with the things that ultimately failed (in the Admin UI).
Wire alerts so the team knows to look at Trash without polling it — see Notifications.
Why Trash beats writing your own dead-letter queue
Every team that has rolled their own dead-letter store eventually rebuilds the same UI: a list, a filter by topology and date, an "edit and resubmit" form, a bulk action. Trash is exactly that, included, and tied to the original process and its logs. Don't build your own.
Duplicate failure guard #
When the same node fails the same way many times inside a single process run (a typical batch hitting a downed upstream), the platform stops persisting further identical entries past a threshold. Only the first N copies are kept in Trash; subsequent duplicates are acknowledged off the queue but not stored, and a single notification fires per group.
Key behaviour:
- The guard is keyed by
node + correlation-id + error message. Different nodes, different runs, or different error texts are tracked independently. - A duplicate group is held in memory by the bridge process for 10 minutes. After 10 minutes of inactivity for that key, the counter resets and the next failure starts a fresh group.
- Excess duplicates are still acknowledged (the work does not pile up on the broker), they just do not produce a Trash row.
- One
limit_overflownotification (severitycritical,limit_type=trash_duplication) is published on the moment the threshold is reached for a given group, so the team is told once that "this process has gone runaway" rather than receiving thousands of identical alerts.
Operational response #
The intended response to a duplicate-guard event is not to resolve the kept duplicates one by one. It is to:
- Identify why the upstream failed (the Trash entries that were kept are enough to diagnose).
- Wait for the upstream to recover, or fix the data / configuration that caused the failure.
- Rerun the affected process once recovery is confirmed. A clean rerun is cheaper and safer than approving thousands of stale messages from a previous run.
The dropped duplicates are intentionally not recoverable; they represent the same problem captured many times and a fresh run is the correct primitive.
Defaults and configuration #
| Deployment | Default | How to change |
|---|---|---|
| Cloud instances | Enabled, threshold 1,000 | Currently platform-wide; per-plan tuning is on the roadmap. Not user-configurable. |
| Self-hosted (Community / Enterprise) | Disabled (0) | Set ORCHESTY_LIMIT_TRASH_DUPLICATION on the platform backend deployment to a positive integer. The bridge re-reads it from GET /api/status on its next refresh (every LIMITS_CHECK_INTERVAL, default 60 seconds). |
Setting the env variable to 0 is the explicit way to disable the guard on a deployment that previously had it enabled. The bridge logs the current effective value at startup, so you can verify the configuration reached the runtime.
What this is and is not
This guard is a protection mechanism against runaway duplicates from one specific failure mode (an upstream goes down during a batch and produces thousands of identical Trash entries). It is not a general rate limit on failed messages, not a retry-policy knob, and not a tool to suppress legitimate distinct failures. Different errors, different nodes, and different runs are always counted independently.
See also #
- Building nodes: Error handling and retries — the in-code side of the same workflow.
- Operations: Notifications — alert on failed messages.
- Operations: Logging
- Operations: Integration monitoring