Observability in practice

Operations teams don't read logs for fun. They open the Admin UI when a notification fires, when a customer asks "why didn't this order sync?", or when a vendor demands evidence of a call that they lost. This guide walks through the workflows that come up week after week, the views Orchesty surfaces for each one, and — equally important — what the platform deliberately does not try to be.

The mental model #

Three facts to keep in mind before opening any dashboard:

Dashboards are not a real-time event stream. Metrics from every component are sampled on a one-minute cadence and aggregated. The unit is a time slot with a status, not an individual event. Any given red square represents something failed inside this window for this entity.
Correlation IDs are the glue. There is no single monolithic record of a process. Metrics live with the connector, failures with the Trash, queues with the Limiter, payloads with the process logs. The drill-down paths in the UI walk these stores by correlation ID; you don't have to do the joins.
Per-entity historical questions are bound by log retention. The horizon for "show me everything that happened to invoice 12345 last quarter" is set by your log retention policy. The Trace feature (Pro & Enterprise) aggregates that information into a per-entity report, but it cannot reach further back than the logs survive.

With those out of the way, here are the workflows.

1. Incident triage from a notification #

A notification fires in your team's Slack channel: "Process failure on shopify-orders-sync". The clock starts, and the goal is to know within a minute whether this is a one-off bad payload, a vendor outage, or a wider problem.

Open the topology heatmap from the link in the notification. Scan the last hour of slots. One red square is a probably a one-off; a vertical band of red across slots means the failure is ongoing; a single horizontal stripe in one slot but green again afterwards means the upstream had a hiccup that already cleared.
Click the red slot to expand the list of processes that ran inside it, filtered to failed only. The list shows the originating node, the connector that errored, and the error message.
Click a representative failed process to open the process detail. Confirm which nodes were visited, which one failed, and what the connector returned. If the same connector and the same error recur in every failed process, you have the cause.
Jump to the Trash entries for that process. From there, an operator can:
- Approve the message as-is if the upstream issue is now resolved (re-injection happens at the failed step, not at the start of the topology).
- Edit the payload first if the failure was caused by bad data, then approve.
- Reject if the message is no longer relevant.
- Bulk-approve every entry that failed for the same reason in one pass.

Most incidents close at step four. If the symptom is "the slot is red but the failures are spread across several connectors", jump straight to workflow 2.

2. Stuck Limiter queue #

The symptom is "everything against vendor X is slow" — not "a topology is failing", but "messages are accumulating and the team can see ETAs to drain blowing out". This is a Limiter problem, not a Trash problem.

Open the Limiter view. It lists every rate-limited application with the number of messages currently queued, the configured rate, and a computed time to drain — for example "hubspot:user-a — 18,400 messages queued, ETA 2.5h".
Drill into the offending application to see which topologies are contributing to the queue. Often one runaway producer (a backfill, a cron that fired twice, an upstream webhook storm) is responsible for most of the volume; the rest of the traffic is just queued behind it.
Decide what to terminate. Operators can terminate specific processes feeding the queue, or every process of an entire topology, directly from this view. Terminating the offender unblocks the queue for everyone else.
Verify the drain. Refresh the Limiter view and watch the ETA come down at the configured rate. If it doesn't, the upstream is rejecting calls for another reason and the next stop is the application heatmap (workflow 3).

Configuration of the Limiter itself lives on the connector node — see Operations: Rate limiting. This view is its operational counterpart.

3. Vendor dispute: evidence on the wire #

A partner claims they never received a webhook, or they say their API never returned the response your team is referencing. The conversation needs to move from opinion to evidence.

Open the application heatmap for the vendor in question. A red row across the same window the customer is asking about is the first piece of evidence — the issue is upstream, and the screenshot is part of the answer.
Click the connector to reach its overview, which shows current and recent status against that application.
Filter the per-connector communication history by the relevant time window or, better, by the correlation ID of the disputed business operation. The history is queried out of the platform's process logs and surfaces the exact request that left the platform, the exact response that came back, the headers, the status code, and the timestamp.
Export the relevant entries to attach to the support ticket.

The horizon is bounded by log retention, the same retention that limits Trace. If the dispute concerns events older than the configured window, the on-the-wire records may no longer be available — plan retention according to the disputes you actually need to defend.

4. Tracing one entity #

A business stakeholder asks "what happened to order 8841 last Tuesday?" — every node it touched, every transformation, every external call.

Without Trace. Search the process logs by the correlation ID associated with that order, or by the entity key your topology stamps onto the message context. Each match is one process; chain them by correlation ID to walk the full chain. Practical horizon = log retention.

With Trace (Pro & Enterprise). The Trace auditing feature aggregates the same log records into per-entity reports automatically, using the entity keys you define. Each boundary in the timeline carries a real delivery status badge (Delivered, Failed, Repeating, Limit, Trashed). The horizon is still bound by log retention — Trace turns the lookup from "engineer writes a query" into "operator opens a report", but it does not extend the retention window itself.

5. What the dashboards are not #

Naming this explicitly avoids the disappointment of expecting one thing and getting another:

Not a real-time event stream. Metrics are sampled on a one-minute cadence and aggregated. In high-throughput production this is the right trade-off; an event stream would either drop most events or fall behind. If you need an event-by-event tap, that's what process logs are for.
Not a single source of truth. Metrics, failures, queue state, and payloads each live in the store best suited for them. Correlation IDs join them at read time; nothing pretends to be one giant table.
Not an unbounded multi-year audit trail. Per-entity historical reports are bound by log retention. If your audit horizon is five years, plan retention accordingly — and accept the storage cost — rather than expect the platform to keep an unbounded history for free.

Operating with these constraints is what makes the views fast and the platform honest. Within them, you have everything you need to triage an incident, prove what was on the wire, drain a stuck queue, and answer the per-entity questions whose horizon you can afford to keep.

Operational visibility — the canonical description of every view referenced above.
Trace auditing — the per-entity AI auditing assistant (Pro & Enterprise).
Operations: Integration monitoring — operations reference for the monitoring surface.
Operations: Trash inbox — the Approve / Edit / Reject workflow in detail.
Operations: Rate limiting — Limiter configuration that the Limiter view operates on.