State Machines

State machines are the core orchestration mechanism in UMH Core. Every component is managed by a finite state machine (FSM) with clearly defined states and transitions. This provides predictable, observable behavior and enables reliable error handling and recovery.

How It Works

UMH Core uses hierarchical state machines where components build upon each other:

  • Bridge (formerly Protocol Converter) = Connection + Source Flow + Sink Flow

  • Flow (DataFlow Component) = Benthos instance with lifecycle management

  • Benthos Flow = Individual Benthos process with detailed startup phases

  • Connection = Network probe service (typically nmap-based)

Each component inherits lifecycle states (to_be_created, creating, removing, removed) and adds operational states specific to its function. The Agent continuously reconciles desired vs actual state, triggering appropriate transitions based on observed conditions.

1 — Redpanda Service

State
Verified
What it means
How it is entered
How it leaves

stopped

redpanda process not running.

stop_done from stopping, or initial create.

startstarting

starting

S6 launching broker; health checks pending.

start event from stopped.

start_doneidle start_failedstopped

idle

Broker healthy, no data for 30 s (default idle window).

start_done or no_data_timeout from active.

data_receivedactive degradeddegraded stopstopping

active

Broker healthy & BytesIn/OutPerSec > 0.

data_received from idle.

no_data_timeoutidle degradeddegraded stopstopping

⚠️ degraded

Broker running but ≥1 health‑check failing (disk-space-low, cpu-saturated, etc.).

degraded from idle/active.

recoveredidle stopstopping

stopping

Graceful shutdown (draining clients).

stop from any running state.

stop_donestopped


2 — Container Monitor

State
Verified
Meaning
Enter trigger
Exit trigger

active

CPU < 85 %, RAM < 90 %, Disk < 90 %.

metrics_all_ok after monitor start or from degraded.

metrics_not_okdegraded

⚠️ degraded

One of the above limits breached for 15 s.

metrics_not_ok

metrics_all_okactive

monitoring_stopped

Watchdog disabled.

stop_monitoring_done

start_monitoring → monitoring_starting

monitoring_starting

Monitor service booting.

start_monitoring

start_monitoring_donedegraded (initial)

monitoring_stopping

Monitor shutting down.

stop_monitoring

stop_monitoring_done → monitoring_stopped


3 — Agent Monitor

State
Verified
Meaning
Enter
Exit

active

Agent connected & internal tasks OK.

metrics_all_ok

metrics_not_okdegraded

⚠️ degraded

Cloud unreachable / auth error / task panic.

metrics_not_ok

metrics_all_okactive

monitoring_stopped

Agent health monitor off.

stop_monitoring_done

start_monitoring → monitoring_starting

monitoring_starting

Starting health checks.

start_monitoring

start_monitoring_donedegraded (initial)

monitoring_stopping

Halting checks.

stop_monitoring

stop_monitoring_done → monitoring_stopped


4 — DataFlow Component (Bridge)

Aggregate Bridge FSM

State
Verified
Meaning
Enter
Exit

stopped

All sub‑services stopped.

stop_done or after create.

startstarting

starting

Launching source & sink Benthos + connection monitor.

start

start_doneidle start_failedstarting_failed

starting_failed

At least one sub‑service failed during start.

start_failed

Manual retry (start) or removal

idle

Sub‑services healthy, no payload for 30 s.

start_done, no_data_received, recovered

data_receivedactive benthos_degradeddegraded stopstopping

active

Data moving through at least one flow.

data_received

no_data_receivedidle benthos_degradeddegraded stopstopping

⚠️ degraded

≥1 sub‑FSM degraded/down (connection lost, flow error).

benthos_degraded

benthos_recoveredidle stopstopping

stopping

Stopping Benthos + connection monitor.

stop

stop_donestopped

4.1 Connection Service FSM

State
Verified
Meaning

starting

Probe service launching.

up

Target reachable.

down

Target unreachable.

⚠️ degraded

Flaky / intermittent responses.

stopping

Probe shutting down.

stopped

Probe disabled.

4.2 Benthos Flow (Source /Sink)

State
Verified
Meaning

stopped

Service file present, process not running.

starting

S6 launched process.

starting_config_loading

Benthos parsing YAML pipeline.

starting_waiting_for_healthchecks

Pipeline loaded; waiting for plugin health.

starting_waiting_for_service_to_remain_running

Stability grace period.

idle

Flow running, no msgs for idle window.

active

Processing messages.

⚠️ degraded

Flow running but error state (e.g., endpoint retries).

stopping

Graceful SIGTERM underway.

Idle/Active timeout: default 30 s (DFC_IDLE_WINDOW).


5 — Topic Browser Service

The Topic Browser service manages real-time topic discovery and caching.

State
Verified
Description
Enter Trigger
Exit Trigger

stopped

Service not running

Initial state or stop_done

startstarting

starting

Service initialization

start

benthos_startedstarting_benthos

starting_benthos

Benthos starting

benthos_started

redpanda_startedstarting_redpanda

starting_redpanda

Redpanda connection

redpanda_started

start_doneidle

idle

Healthy, no active data

start_done or recovered

data_receivedactive

active

Processing topic data

data_received

no_data_timeoutidle

⚠️ degraded_benthos

Benthos degraded

benthos_degraded

recoveredidle

⚠️ degraded_redpanda

Redpanda degraded

redpanda_degraded

recoveredidle

stopping

Graceful shutdown

stop

stop_donestopped

Default: Active (runs automatically) Transitions: idle ↔ active based on topic activity Recovery: Automatic from degraded states when underlying services recover


Quick Defaults

Parameter
Default
Source Const / Env

Idle window (Redpanda)

30 s

REDPANDA_IDLE_WINDOW

Idle window (Bridge)

30 s

DFC_IDLE_WINDOW

Container CPU limit

85 %

CONTAINER_CPU_LIMIT

Container RAM limit

90 %

CONTAINER_RAM_LIMIT

Container Disk limit

90 %

CONTAINER_DISK_LIMIT

Last updated