State Machines
State machines are the core orchestration mechanism in UMH Core. Every component is managed by a finite state machine (FSM) with clearly defined states and transitions. This provides predictable, observable behavior and enables reliable error handling and recovery.
How It Works
UMH Core uses hierarchical state machines where components build upon each other:
Bridge = Connection + Read Flow + Write Flow
Flow = Benthos instance with lifecycle management
Benthos Flow = Individual Benthos process with detailed startup phases
Connection = Network probe service (typically nmap-based)
Each component inherits lifecycle states (to_be_created
, creating
, removing
, removed
) and adds operational states specific to its function. The Agent continuously reconciles desired vs actual state, triggering appropriate transitions based on observed conditions.
1 — Redpanda Service
stopped
redpanda
process not running.
stop_done from stopping, or initial create.
start → starting
starting
S6 launching broker; health checks pending.
start event from stopped.
start_done → idle start_failed → stopped
idle
Broker healthy, no data for 30 s (default idle window).
start_done or no_data_timeout from active.
data_received → active degraded → degraded stop → stopping
active
Broker healthy & BytesIn/OutPerSec
> 0.
data_received from idle.
no_data_timeout → idle degraded → degraded stop → stopping
⚠️ degraded
Broker running but ≥1 health‑check failing (disk-space-low
, cpu-saturated
, etc.).
degraded from idle/active.
recovered → idle stop → stopping
stopping
Graceful shutdown (draining clients).
stop from any running state.
stop_done → stopped
2 — Container Monitor
active
CPU < 85 %, RAM < 90 %, Disk < 90 %.
metrics_all_ok after monitor start or from degraded.
metrics_not_ok → degraded
⚠️ degraded
One of the above limits breached for 15 s.
metrics_not_ok
metrics_all_ok → active
monitoring_stopped
Watchdog disabled.
stop_monitoring_done
start_monitoring → monitoring_starting
monitoring_starting
Monitor service booting.
start_monitoring
start_monitoring_done → degraded (initial)
monitoring_stopping
Monitor shutting down.
stop_monitoring
stop_monitoring_done → monitoring_stopped
3 — Agent Monitor
active
Agent connected & internal tasks OK.
metrics_all_ok
metrics_not_ok → degraded
⚠️ degraded
Cloud unreachable / auth error / task panic.
metrics_not_ok
metrics_all_ok → active
monitoring_stopped
Agent health monitor off.
stop_monitoring_done
start_monitoring → monitoring_starting
monitoring_starting
Starting health checks.
start_monitoring
start_monitoring_done → degraded (initial)
monitoring_stopping
Halting checks.
stop_monitoring
stop_monitoring_done → monitoring_stopped
4 — Bridge
Aggregate Bridge FSM
stopped
All sub‑services stopped.
"stopped"
stop_done or after create.
start → starting_connection
starting_connection
Waiting for connection to establish.
"starting: waiting for connection"
start
start_connection_up → starting_redpanda
starting_redpanda
Connection up, waiting for message broker.
"starting: redpanda not healthy"
start_connection_up
start_redpanda_up → starting_dfc
starting_dfc
Connection + Redpanda up, waiting for flow.
"starting: flow not running"
start_redpanda_up
start_dfc_up → idle start_failed_dfc_missing → starting_failed_dfc_missing
starting_failed_dfc
Flow component failed to start.
"starting failed: flow in error state"
start_failed_dfc
Manual retry or removal
starting_failed_dfc_missing
No flow configured.
"starting failed: no flows configured"
start_failed_dfc_missing
start_retry (when flow added) or removal
idle
All healthy, no data for 30 s.
"idling: no messages processed in 60s"
start_dfc_up, no_data_timeout, recovered
data_received → active degraded events → degraded_* stop → stopping
active
Processing data through flows.
""
(empty when fully healthy)
data_received
no_data_timeout → idle degraded events → degraded_* stop → stopping
⚠️ degraded_connection
Connection lost/flaky after successful start.
"connection degraded: probe timeout after 30s"
connection_unhealthy
recovered → idle stop → stopping
⚠️ degraded_redpanda
Message broker issues after successful start.
"redpanda degraded: not responding"
redpanda_degraded
recovered → idle stop → stopping
⚠️ degraded_dfc
Flow component issues after successful start.
"flow degraded: benthos service not running"
dfc_degraded
recovered → idle stop → stopping
⚠️ degraded_other
Inconsistent component states detected.
"other degraded: inconsistent states"
degraded_other
recovered → idle stop → stopping
stopping
Stopping all components.
"stopping"
stop
stop_done → stopped
4.1 Connection Service FSM
starting
Probe service launching.
up
Target reachable.
down
Target unreachable.
⚠️ degraded
Flaky / intermittent responses.
stopping
Probe shutting down.
stopped
Probe disabled.
4.2 Benthos Flow (Source /Sink)
stopped
Service file present, process not running.
starting
S6 launched process.
starting_config_loading
Benthos parsing YAML pipeline.
starting_waiting_for_healthchecks
Pipeline loaded; waiting for plugin health.
starting_waiting_for_service_to_remain_running
Stability grace period.
idle
Flow running, no msgs for idle window.
active
Processing messages.
⚠️ degraded
Flow running but error state (e.g., endpoint retries).
stopping
Graceful SIGTERM underway.
Idle/Active timeout: default 30 s (
BRIDGE_IDLE_WINDOW
).
5 — Topic Browser Service
The Topic Browser service manages real-time topic discovery and caching.
stopped
Service not running
Initial state or stop_done
start → starting
starting
Service initialization
start
benthos_started → starting_benthos
starting_benthos
Benthos starting
benthos_started
redpanda_started → starting_redpanda
starting_redpanda
Redpanda connection
redpanda_started
start_done → idle
idle
Healthy, no active data
start_done or recovered
data_received → active
active
Processing topic data
data_received
no_data_timeout → idle
⚠️ degraded_benthos
Benthos degraded
benthos_degraded
recovered → idle
⚠️ degraded_redpanda
Redpanda degraded
redpanda_degraded
recovered → idle
stopping
Graceful shutdown
stop
stop_done → stopped
Default: Active (runs automatically) Transitions: idle ↔ active based on topic activity Recovery: Automatic from degraded states when underlying services recover
Quick Defaults
Idle window (Redpanda)
30 s
REDPANDA_IDLE_WINDOW
Idle window (Bridge)
30 s
BRIDGE_IDLE_WINDOW
Container CPU limit
85 %
CONTAINER_CPU_LIMIT
Container RAM limit
90 %
CONTAINER_RAM_LIMIT
Container Disk limit
90 %
CONTAINER_DISK_LIMIT
Last updated