State Machines

State machines are the core orchestration mechanism in UMH Core. Every component is managed by a finite state machine (FSM) with clearly defined states and transitions. This provides predictable, observable behavior and enables reliable error handling and recovery.

How It Works

UMH Core uses hierarchical state machines where components build upon each other:

  • Bridge (formerly Protocol Converter) = Connection + Source Flow + Sink Flow

  • Flow (DataFlow Component) = Benthos instance with lifecycle management

  • Benthos Flow = Individual Benthos process with detailed startup phases

  • Connection = Network probe service (typically nmap-based)

Each component inherits lifecycle states (to_be_created, creating, removing, removed) and adds operational states specific to its function. The Agent continuously reconciles desired vs actual state, triggering appropriate transitions based on observed conditions.

1 β€” Redpanda Service

State
Verified
What it means
How it is entered
How it leaves

stopped

βœ…

redpanda process not running.

stop_done from stopping, or initial create.

start β†’ starting

starting

βœ…

S6 launching broker; health checks pending.

start event from stopped.

start_done β†’ idle start_failed β†’ stopped

idle

βœ…

Broker healthy, no data for 30 s (default idle window).

start_done or no_data_timeout from active.

data_received β†’ active degraded β†’ degraded stop β†’ stopping

active

βœ…

Broker healthy & BytesIn/OutPerSec > 0.

data_received from idle.

no_data_timeout β†’ idle degraded β†’ degraded stop β†’ stopping

⚠️ degraded

βœ…

Broker running but β‰₯1 health‑check failing (disk-space-low, cpu-saturated, etc.).

degraded from idle/active.

recovered β†’ idle stop β†’ stopping

stopping

βœ…

Graceful shutdown (draining clients).

stop from any running state.

stop_done β†’ stopped


2 β€” Container Monitor

State
Verified
Meaning
Enter trigger
Exit trigger

active

βœ…

CPU < 85 %, RAM < 90 %, Disk < 90 %.

metrics_all_ok after monitor start or from degraded.

metrics_not_ok β†’ degraded

⚠️ degraded

βœ…

One of the above limits breached for 15 s.

metrics_not_ok

metrics_all_ok β†’ active

monitoring_stopped

βœ…

Watchdog disabled.

stop_monitoring_done

start_monitoring β†’ monitoring_starting

monitoring_starting

βœ…

Monitor service booting.

start_monitoring

start_monitoring_done β†’ degraded (initial)

monitoring_stopping

βœ…

Monitor shutting down.

stop_monitoring

stop_monitoring_done β†’ monitoring_stopped


3 β€” Agent Monitor

State
Verified
Meaning
Enter
Exit

active

βœ…

Agent connected & internal tasks OK.

metrics_all_ok

metrics_not_ok β†’ degraded

⚠️ degraded

βœ…

Cloud unreachable / auth error / task panic.

metrics_not_ok

metrics_all_ok β†’ active

monitoring_stopped

βœ…

Agent health monitor off.

stop_monitoring_done

start_monitoring β†’ monitoring_starting

monitoring_starting

βœ…

Starting health checks.

start_monitoring

start_monitoring_done β†’ degraded (initial)

monitoring_stopping

βœ…

Halting checks.

stop_monitoring

stop_monitoring_done β†’ monitoring_stopped


4 β€” DataFlow Component (Bridge)

Aggregate Bridge FSM

State
Verified
Meaning
Enter
Exit

stopped

βœ…

All sub‑services stopped.

stop_done or after create.

start β†’ starting

starting

βœ…

Launching source & sink Benthos + connection monitor.

start

start_done β†’ idle start_failed β†’ starting_failed

starting_failed

βœ…

At least one sub‑service failed during start.

start_failed

Manual retry (start) or removal

idle

βœ…

Sub‑services healthy, no payload for 30 s.

start_done, no_data_received, recovered

data_received β†’ active benthos_degraded β†’ degraded stop β†’ stopping

active

βœ…

Data moving through at least one flow.

data_received

no_data_received β†’ idle benthos_degraded β†’ degraded stop β†’ stopping

⚠️ degraded

βœ…

β‰₯1 sub‑FSM degraded/down (connection lost, flow error).

benthos_degraded

benthos_recovered β†’ idle stop β†’ stopping

stopping

βœ…

Stopping Benthos + connection monitor.

stop

stop_done β†’ stopped

4.1 Connection Service FSM

State
Verified
Meaning

starting

βœ…

Probe service launching.

up

βœ…

Target reachable.

down

βœ…

Target unreachable.

⚠️ degraded

βœ…

Flaky / intermittent responses.

stopping

βœ…

Probe shutting down.

stopped

βœ…

Probe disabled.

4.2 Benthos Flow (Source /Sink)

State
Verified
Meaning

stopped

βœ…

Service file present, process not running.

starting

βœ…

S6 launched process.

starting_config_loading

βœ…

Benthos parsing YAML pipeline.

starting_waiting_for_healthchecks

βœ…

Pipeline loaded; waiting for plugin health.

starting_waiting_for_service_to_remain_running

βœ…

Stability grace period.

idle

βœ…

Flow running, no msgs for idle window.

active

βœ…

Processing messages.

⚠️ degraded

βœ…

Flow running but error state (e.g., endpoint retries).

stopping

βœ…

Graceful SIGTERM underway.

Idle/Active timeout: default 30 s (DFC_IDLE_WINDOW).


5 β€” Topic Browser Service

The Topic Browser service manages real-time topic discovery and caching.

State
Verified
Description
Enter Trigger
Exit Trigger

stopped

βœ…

Service not running

Initial state or stop_done

start β†’ starting

starting

βœ…

Service initialization

start

benthos_started β†’ starting_benthos

starting_benthos

βœ…

Benthos starting

benthos_started

redpanda_started β†’ starting_redpanda

starting_redpanda

βœ…

Redpanda connection

redpanda_started

start_done β†’ idle

idle

βœ…

Healthy, no active data

start_done or recovered

data_received β†’ active

active

βœ…

Processing topic data

data_received

no_data_timeout β†’ idle

⚠️ degraded_benthos

βœ…

Benthos degraded

benthos_degraded

recovered β†’ idle

⚠️ degraded_redpanda

βœ…

Redpanda degraded

redpanda_degraded

recovered β†’ idle

stopping

βœ…

Graceful shutdown

stop

stop_done β†’ stopped

Default: Active (runs automatically) Transitions: idle ↔ active based on topic activity Recovery: Automatic from degraded states when underlying services recover


Quick Defaults

Parameter
Default
Source Const / Env

Idle window (Redpanda)

30 s

REDPANDA_IDLE_WINDOW

Idle window (Bridge)

30 s

DFC_IDLE_WINDOW

Container CPU limit

85 %

CONTAINER_CPU_LIMIT

Container RAM limit

90 %

CONTAINER_RAM_LIMIT

Container Disk limit

90 %

CONTAINER_DISK_LIMIT

Last updated