State Machines

State machines are the core orchestration mechanism in UMH Core. Every component is managed by a finite state machine (FSM) with clearly defined states and transitions. This provides predictable, observable behavior and enables reliable error handling and recovery.

How It Works

UMH Core uses hierarchical state machines where components build upon each other:

  • Bridge = Connection + Read Flow + Write Flow

  • Flow = Benthos instance with lifecycle management

  • Benthos Flow = Individual Benthos process with detailed startup phases

  • Connection = Network probe service (typically nmap-based)

Each component inherits lifecycle states (to_be_created, creating, removing, removed) and adds operational states specific to its function. The Agent continuously reconciles desired vs actual state, triggering appropriate transitions based on observed conditions.

1 — Redpanda Service

State
What it means
How it is entered
How it leaves

stopped

redpanda process not running.

stop_done from stopping, or initial create.

startstarting

starting

S6 launching broker; health checks pending.

start event from stopped.

start_doneidle start_failedstopped

idle

Broker healthy, no data for 30 s (default idle window).

start_done or no_data_timeout from active.

data_receivedactive degradeddegraded stopstopping

active

Broker healthy & BytesIn/OutPerSec > 0.

data_received from idle.

no_data_timeoutidle degradeddegraded stopstopping

⚠️ degraded

Broker running but ≥1 health‑check failing (disk-space-low, cpu-saturated, etc.).

degraded from idle/active.

recoveredidle stopstopping

stopping

Graceful shutdown (draining clients).

stop from any running state.

stop_donestopped


2 — Container Monitor

State
Meaning
Enter trigger
Exit trigger

active

CPU < 85 %, RAM < 90 %, Disk < 90 %.

metrics_all_ok after monitor start or from degraded.

metrics_not_okdegraded

⚠️ degraded

One of the above limits breached for 15 s.

metrics_not_ok

metrics_all_okactive

monitoring_stopped

Watchdog disabled.

stop_monitoring_done

start_monitoring → monitoring_starting

monitoring_starting

Monitor service booting.

start_monitoring

start_monitoring_donedegraded (initial)

monitoring_stopping

Monitor shutting down.

stop_monitoring

stop_monitoring_done → monitoring_stopped


3 — Agent Monitor

State
Meaning
Enter
Exit

active

Agent connected & internal tasks OK.

metrics_all_ok

metrics_not_okdegraded

⚠️ degraded

Cloud unreachable / auth error / task panic.

metrics_not_ok

metrics_all_okactive

monitoring_stopped

Agent health monitor off.

stop_monitoring_done

start_monitoring → monitoring_starting

monitoring_starting

Starting health checks.

start_monitoring

start_monitoring_donedegraded (initial)

monitoring_stopping

Halting checks.

stop_monitoring

stop_monitoring_done → monitoring_stopped


4 — Bridge

Aggregate Bridge FSM

State
Meaning
Status Reason Examples
Enter
Exit

stopped

All sub‑services stopped.

"stopped"

stop_done or after create.

startstarting_connection

starting_connection

Waiting for connection to establish.

"starting: waiting for connection"

start

start_connection_upstarting_redpanda

starting_redpanda

Connection up, waiting for message broker.

"starting: redpanda not healthy"

start_connection_up

start_redpanda_upstarting_dfc

starting_dfc

Connection + Redpanda up, waiting for flow.

"starting: flow not running"

start_redpanda_up

start_dfc_upidle start_failed_dfc_missingstarting_failed_dfc_missing

starting_failed_dfc

Flow component failed to start.

"starting failed: flow in error state"

start_failed_dfc

Manual retry or removal

starting_failed_dfc_missing

No flow configured.

"starting failed: no flows configured"

start_failed_dfc_missing

start_retry (when flow added) or removal

idle

All healthy, no data for 30 s.

"idling: no messages processed in 60s"

start_dfc_up, no_data_timeout, recovered

data_receivedactive degraded events → degraded_* stopstopping

active

Processing data through flows.

"" (empty when fully healthy)

data_received

no_data_timeoutidle degraded events → degraded_* stopstopping

⚠️ degraded_connection

Connection lost/flaky after successful start.

"connection degraded: probe timeout after 30s"

connection_unhealthy

recoveredidle stopstopping

⚠️ degraded_redpanda

Message broker issues after successful start.

"redpanda degraded: not responding"

redpanda_degraded

recoveredidle stopstopping

⚠️ degraded_dfc

Flow component issues after successful start.

"flow degraded: benthos service not running"

dfc_degraded

recoveredidle stopstopping

⚠️ degraded_other

Inconsistent component states detected.

"other degraded: inconsistent states"

degraded_other

recoveredidle stopstopping

stopping

Stopping all components.

"stopping"

stop

stop_donestopped

4.1 Connection Service FSM

State
Meaning

starting

Probe service launching.

up

Target reachable.

down

Target unreachable.

⚠️ degraded

Flaky / intermittent responses.

stopping

Probe shutting down.

stopped

Probe disabled.

4.2 Benthos Flow (Source /Sink)

State
Meaning

stopped

Service file present, process not running.

starting

S6 launched process.

starting_config_loading

Benthos parsing YAML pipeline.

starting_waiting_for_healthchecks

Pipeline loaded; waiting for plugin health.

starting_waiting_for_service_to_remain_running

Stability grace period.

idle

Flow running, no msgs for idle window.

active

Processing messages.

⚠️ degraded

Flow running but error state (e.g., endpoint retries).

stopping

Graceful SIGTERM underway.

Idle/Active timeout: default 30 s (BRIDGE_IDLE_WINDOW).


5 — Topic Browser Service

The Topic Browser service manages real-time topic discovery and caching.

State
Description
Enter Trigger
Exit Trigger

stopped

Service not running

Initial state or stop_done

startstarting

starting

Service initialization

start

benthos_startedstarting_benthos

starting_benthos

Benthos starting

benthos_started

redpanda_startedstarting_redpanda

starting_redpanda

Redpanda connection

redpanda_started

start_doneidle

idle

Healthy, no active data

start_done or recovered

data_receivedactive

active

Processing topic data

data_received

no_data_timeoutidle

⚠️ degraded_benthos

Benthos degraded

benthos_degraded

recoveredidle

⚠️ degraded_redpanda

Redpanda degraded

redpanda_degraded

recoveredidle

stopping

Graceful shutdown

stop

stop_donestopped

Default: Active (runs automatically) Transitions: idle ↔ active based on topic activity Recovery: Automatic from degraded states when underlying services recover


Quick Defaults

Parameter
Default
Source Const / Env

Idle window (Redpanda)

30 s

REDPANDA_IDLE_WINDOW

Idle window (Bridge)

30 s

BRIDGE_IDLE_WINDOW

Container CPU limit

85 %

CONTAINER_CPU_LIMIT

Container RAM limit

90 %

CONTAINER_RAM_LIMIT

Container Disk limit

90 %

CONTAINER_DISK_LIMIT

Last updated