State Machines

State machines are the core orchestration mechanism in UMH Core. Every component is managed by a finite state machine (FSM) with clearly defined states and transitions. This provides predictable, observable behavior and enables reliable error handling and recovery.

How It Works

UMH Core uses hierarchical state machines where components build upon each other:

Bridge (formerly Protocol Converter) = Connection + Source Flow + Sink Flow
Flow (DataFlow Component) = Benthos instance with lifecycle management
Benthos Flow = Individual Benthos process with detailed startup phases
Connection = Network probe service (typically nmap-based)

Each component inherits lifecycle states (to_be_created, creating, removing, removed) and adds operational states specific to its function. The Agent continuously reconciles desired vs actual state, triggering appropriate transitions based on observed conditions.

1 — Redpanda Service

State

Verified

What it means

How it is entered

How it leaves

stopped

✅

redpanda process not running.

stop_done from stopping, or initial create.

start → starting

starting

✅

S6 launching broker; health checks pending.

start event from stopped.

start_done → idle start_failed → stopped

idle

✅

Broker healthy, no data for 30 s (default idle window).

start_done or no_data_timeout from active.

data_received → active degraded → degraded stop → stopping

active

✅

Broker healthy & BytesIn/OutPerSec > 0.

data_received from idle.

no_data_timeout → idle degraded → degraded stop → stopping

⚠️ degraded

✅

Broker running but ≥1 health‑check failing (disk-space-low, cpu-saturated, etc.).

degraded from idle/active.

recovered → idle stop → stopping

stopping

✅

Graceful shutdown (draining clients).

stop from any running state.

stop_done → stopped

2 — Container Monitor

State

Verified

Meaning

Enter trigger

Exit trigger

active

✅

CPU < 85 %, RAM < 90 %, Disk < 90 %.

metrics_all_ok after monitor start or from degraded.

metrics_not_ok → degraded

⚠️ degraded

✅

One of the above limits breached for 15 s.

metrics_not_ok

metrics_all_ok → active

monitoring_stopped

✅

Watchdog disabled.

stop_monitoring_done

start_monitoring → monitoring_starting

monitoring_starting

✅

Monitor service booting.

start_monitoring

start_monitoring_done → degraded (initial)

monitoring_stopping

✅

Monitor shutting down.

stop_monitoring

stop_monitoring_done → monitoring_stopped

3 — Agent Monitor

State

Verified

Meaning

Enter

Exit

active

✅

Agent connected & internal tasks OK.

metrics_all_ok

metrics_not_ok → degraded

⚠️ degraded

✅

Cloud unreachable / auth error / task panic.

metrics_not_ok

metrics_all_ok → active

monitoring_stopped

✅

Agent health monitor off.

stop_monitoring_done

start_monitoring → monitoring_starting

monitoring_starting

✅

Starting health checks.

start_monitoring

start_monitoring_done → degraded (initial)

monitoring_stopping

✅

Halting checks.

stop_monitoring

stop_monitoring_done → monitoring_stopped

4 — DataFlow Component (Bridge)

Aggregate Bridge FSM

State

Verified

Meaning

Enter

Exit

stopped

✅

All sub‑services stopped.

stop_done or after create.

start → starting

starting

✅

Launching source & sink Benthos + connection monitor.

start

start_done → idle start_failed → starting_failed

starting_failed

✅

At least one sub‑service failed during start.

start_failed

Manual retry (start) or removal

idle

✅

Sub‑services healthy, no payload for 30 s.

start_done, no_data_received, recovered

data_received → active benthos_degraded → degraded stop → stopping

active

✅

Data moving through at least one flow.

data_received

no_data_received → idle benthos_degraded → degraded stop → stopping

⚠️ degraded

✅

≥1 sub‑FSM degraded/down (connection lost, flow error).

benthos_degraded

benthos_recovered → idle stop → stopping

stopping

✅

Stopping Benthos + connection monitor.

stop

stop_done → stopped

4.1 Connection Service FSM

State

Verified

Meaning

starting

✅

Probe service launching.

✅

Target reachable.

down

✅

Target unreachable.

⚠️ degraded

✅

Flaky / intermittent responses.

stopping

✅

Probe shutting down.

stopped

✅

Probe disabled.

4.2 Benthos Flow (Source /Sink)

State

Verified

Meaning

stopped

✅

Service file present, process not running.

starting

✅

S6 launched process.

starting_config_loading

✅

Benthos parsing YAML pipeline.

starting_waiting_for_healthchecks

✅

Pipeline loaded; waiting for plugin health.

starting_waiting_for_service_to_remain_running

✅

Stability grace period.

idle

✅

Flow running, no msgs for idle window.

active

✅

Processing messages.

⚠️ degraded

✅

Flow running but error state (e.g., endpoint retries).

stopping

✅

Graceful SIGTERM underway.

Idle/Active timeout: default 30 s (DFC_IDLE_WINDOW).

5 — Topic Browser Service

The Topic Browser service manages real-time topic discovery and caching.

State

Verified

Description

Enter Trigger

Exit Trigger

stopped

✅

Service not running

Initial state or stop_done

start → starting

starting

✅

Service initialization

start

benthos_started → starting_benthos

starting_benthos

✅

Benthos starting

benthos_started

redpanda_started → starting_redpanda

starting_redpanda

✅

Redpanda connection

redpanda_started

start_done → idle

idle

✅

Healthy, no active data

start_done or recovered

data_received → active

active

✅

Processing topic data

data_received

no_data_timeout → idle

⚠️ degraded_benthos

✅

Benthos degraded

benthos_degraded

recovered → idle

⚠️ degraded_redpanda

✅

Redpanda degraded

redpanda_degraded

recovered → idle

stopping

✅

Graceful shutdown

stop

stop_done → stopped

Default: Active (runs automatically) Transitions: idle ↔ active based on topic activity Recovery: Automatic from degraded states when underlying services recover

Quick Defaults

Parameter

Default

Source Const / Env

Idle window (Redpanda)

30 s

REDPANDA_IDLE_WINDOW

Idle window (Bridge)

30 s

DFC_IDLE_WINDOW

Container CPU limit

85 %

CONTAINER_CPU_LIMIT

Container RAM limit

90 %

CONTAINER_RAM_LIMIT

Container Disk limit

90 %

CONTAINER_DISK_LIMIT

PreviousContainer Layout NextEnvironment Variables

Last updated 29 days ago