Integration monitoring

The process record tells you what happened to one run. Integration monitoring tells you what is happening across all your processes right now: how many are running, how many are failing, how fast each topology is moving, where the slow nodes are.

For background see Concepts: Processes and Messages.

What's in the dashboards #

The Admin UI ships with built-in dashboards organized in three layers:

Instance overview #

The "is everything fine?" view:

Processes started in the last hour / day / 7 days.
Processes currently in flight.
Failure rate, broken down by topology.
Trash inbox size, by topology.
Notifications fired in the last day.

If something looks abnormal, this is where you spot it first.

Screenshot pending

Instance overview dashboard

Top of page; processes counters, failure rate sparkline, Trash widget.

target 1280 x 620

Per-topology dashboard #

The "where is this topology spending time?" view. For the selected topology:

Per-node throughput (messages / minute).
Per-node latency (p50 / p95 / p99).
Per-node error rate.
Queue depth between nodes (where backpressure is building up).
Last 24h and last 7 days side-by-side.

This is the dashboard you open when a topology feels slow or when a node is suspect.

Per-process detail #

The process detail page in the Admin UI, drillable from the per-topology dashboard by clicking any failed or slow process. It shows which nodes were visited, which succeeded, which failed, and basic per-node metrics. For richer per-step payload audit see Trace auditing (Pro & Enterprise).

What you don't have to instrument #

You do not need to add custom metrics to a worker for any of the above. The platform measures node-level latency, throughput, and result codes from the moment a node starts processing a message; the dashboards just visualize what is already collected.

You do still want logs for "what happened inside one call". See Logging.

Health vs alerting #

Dashboards are for looking. Notifications are for being told. The two should not duplicate each other:

Use Notifications for things that need a human to act now: failed messages, instance limits.
Use dashboards for trend questions: "are we slower this week?", "which topology is responsible for most of the failures?", "is the Trash inbox growing or draining?".

Exporting metrics #

If you already run Grafana, Datadog, or a similar stack, the platform exposes:

A Prometheus-compatible metrics endpoint for instance-level metrics (queue depths, broker health, worker process counts).
A per-topology metrics export you can scrape on a schedule.

Use these to bring Orchesty health into the same dashboards you already watch for the rest of your infrastructure.