Finite State Machines for Real-Time Alerting in Elixir → Lawrence Gosset

Every 30 seconds, our LLM pipeline produces observations — person counts, stock levels, cleanliness scores, desk occupancy. Some of those observations need to trigger alerts. But LLMs hallucinate. Lighting changes. Cameras glitch. A shadow falls across the frame and the model counts an extra person for one cycle.

Firing an alert on a single observation is noisy. If “person count > 5” triggers an alert every time the LLM says 6, and the LLM is wrong 10% of the time, you’re sending a false alert every 5 minutes. That’s worse than no alerting at all — users learn to ignore it.

Our first version did exactly this. Single observation, fire immediately. Within a week of testing we had customers asking to disable alerts entirely. The signal-to-noise ratio was terrible. We needed a system that confirms before it escalates — something that tracks whether a condition persists across multiple observations, enforces cooldown periods after resolution, and handles the concurrency of multiple Oban workers evaluating the same monitors simultaneously.

Monitors and rules as data

We wanted alert conditions defined in the database, not in code. The thinking here was the same as with our data-driven prompt pipeline — if adding a new alert type requires a code change and deploy, we’d be the bottleneck every time a customer wants to watch for something new.

An AlertMonitor references a specific observation output by ID and defines the condition to watch for:

schema "alert_monitors" do
  field(:name, :string)
  field(:operator, Ecto.Enum, values: [:gt, :lt, :gte, :lte, :eq, :ne, :in, :not_in])
  field(:alert_on_value, Scout.Ecto.JsonbAny)
  field(:window_seconds, :integer, default: 300)
  field(:breach_count, :integer, default: 1)
  field(:cooldown_seconds, :integer, default: 1800)
  field(:severity, Ecto.Enum, values: [:low, :medium, :high, :critical], default: :medium)
  field(:active, :boolean, default: true)

  belongs_to(:camera, Camera)
  belongs_to(:observation_output, ObservationOutput)
end

A monitor reads as: “For this camera, when this observation output matches this condition at least breach_count times within window_seconds, fire an alert at this severity. After resolution, wait cooldown_seconds before allowing it to fire again.”

For example: “Alert when person_count is greater than 5, at least 3 times in 300 seconds, with HIGH severity and a 30-minute cooldown.”

The alert_on_value field uses a JSONB column (via a custom JsonbAny Ecto type) so it can hold any value type — a boolean, a number, a string, or a list for :in/:not_in operators. We went back and forth on whether to use separate columns per type vs. a single JSONB column. The JSONB approach won because it keeps the schema simple and the operator validation handles type safety at write time:

defp check_operator_validity(changeset, :boolean, operator) do
  if operator in [:eq, :ne], do: changeset,
  else: add_error(changeset, :operator, "must be 'eq' or 'ne' for boolean outputs")
end

defp check_operator_validity(changeset, :enum, operator) do
  if operator in [:eq, :ne, :in, :not_in], do: changeset,
  else: add_error(changeset, :operator, "must be 'eq', 'ne', 'in', or 'not_in' for enum outputs")
end

defp check_operator_validity(changeset, :number, _operator), do: changeset

The output type is derived from the observation output’s JSON Schema — {"type": "boolean"} → :boolean, {"type": "string", "enum": [...]} → :enum, {"type": "integer"} → :number. This is the same schema that constrains the LLM’s structured output, so the monitor’s condition types are guaranteed to match what the LLM actually returns.

Like the prompt pipeline, adding a new alert type is a database operation. Create a monitor record pointing at the observation output you want to watch, set the condition, and the evaluator picks it up on the next analysis cycle.

The alert lifecycle as a state machine

An AlertInstance tracks one alert condition through its lifecycle. We considered a simple status column with ad-hoc updates, but the trouble is that status transitions end up scattered across conditionals — can you go from “resolved” back to “tracking”? Can you skip “tracking” and go straight to “alerting”? Without explicit rules, it’s easy to end up with invalid state transitions that are hard to debug.

We use Fsmx to define the states and enforce valid transitions:

use Fsmx.Fsm,
  transitions: %{
    "tracking" => ["alerting", "resolved"],
    "alerting" => ["resolved"]
  }

Three states, three transitions:

tracking → alerting    (condition confirmed by a second observation)
tracking → resolved    (condition cleared before confirmation)
alerting → resolved    (condition cleared while actively alerting)

The AlertInstance schema carries the timestamps for each lifecycle event:

schema "alert_instances" do
  field(:state, Ecto.Enum, values: [:tracking, :alerting, :resolved], default: :tracking)
  field(:first_detected_at, :utc_datetime)
  field(:last_seen_at, :utc_datetime)
  field(:triggered_at, :utc_datetime)     # nil while tracking, set on escalation
  field(:resolved_at, :utc_datetime)      # nil until resolved
  field(:trigger_value, :string)          # the value that caused escalation

  belongs_to(:alert_monitor, AlertMonitor)
  belongs_to(:camera, Camera)
  belongs_to(:current_observation, AnalysisObservation)
end

Fsmx’s before_transition callbacks handle the timestamp bookkeeping:

def before_transition(alert, "tracking", "alerting", %{value: value}) do
  now = DateTime.utc_now() |> DateTime.truncate(:second)
  trigger_value = if is_nil(value), do: nil, else: to_string(value)

  {:ok, %{alert | triggered_at: now, trigger_value: trigger_value, last_seen_at: now}}
end

def before_transition(alert, _from_state, "resolved", _metadata) do
  now = DateTime.utc_now() |> DateTime.truncate(:second)
  {:ok, %{alert | resolved_at: now, last_seen_at: now}}
end

The transitions are explicit and enforced. You can’t go from alerting back to tracking — Fsmx will reject it. You can’t skip tracking and go straight to alerting. The lifecycle is encoded in the state machine definition, not scattered across conditionals.

Each alert instance is also immutable in the sense that each on/off cycle creates a new instance. When an alert resolves and the condition triggers again later, that’s a new AlertInstance, not a reopened one. This gives you a clean audit trail — every alert that ever fired, when it started, when it escalated, when it resolved.

Two-step confirmation

This is the core idea that fixed our false positive problem. When a condition is first met, the alert enters :tracking state. Only if the condition persists on a subsequent, different observation does it escalate to :alerting.

defp handle_tracking_state(alert, monitor, obs) do
  if alert.current_observation_id == obs.id do
    # Same observation that created the alert — stay in tracking
    :no_change
  else
    # Different observation confirms persistence — transition to alerting
    apply_state_transition_to_alerting(alert, monitor, obs)
  end
end

The check is alert.current_observation_id == obs.id. When an alert is created in :tracking, it stores the ID of the observation that triggered it. On the next evaluation cycle (30 seconds later), if the condition is still met with a different observation, that’s confirmation — the condition is real, not a single-frame glitch.

This filters out:

LLM hallucinations: the model miscounts one frame, but the next frame is correct
Transient conditions: a delivery person walks through the frame for one cycle
Camera glitches: a brief exposure change produces one bad analysis

The confirmation step is not configurable per-monitor — and this was a deliberate choice. The two-step approach isn’t about “how many consecutive breaches” — that’s what breach_count and window_seconds handle. The two-step check is specifically about requiring a different observation to confirm, preventing the same (potentially hallucinated) data point from both creating and confirming an alert.

If the condition isn’t met on the next evaluation, the alert resolves from :tracking — it never escalated, and the user never saw it. The tracking state is invisible to the user; it’s internal bookkeeping.

Evaluation logic

We considered running alert evaluation as a separate Oban job — poll monitors on a schedule, check recent observations. The trouble is that introduces a delay between when an observation is saved and when it’s evaluated, and for real-time alerting that delay matters. Instead, alert evaluation runs inline after every analysis — the Executor calls Alerts.Evaluator.evaluate_batch(saved_observations) synchronously after observations are saved, whether from a fresh LLM analysis, a reprocessing run, or a duplicate clone.

The evaluator preloads everything upfront to avoid N+1 queries:

def evaluate_batch(observations) when is_list(observations) do
  observations = Repo.preload(observations, :analysis)

  keys = observations
    |> Enum.map(fn obs -> {obs.analysis.camera_id, obs.observation_output_id} end)
    |> Enum.uniq()

  # Four preload queries — covers all data needed for evaluation
  monitors_by_key = load_monitors(keys)
  monitor_ids = monitors_by_key |> Enum.flat_map(fn {_, m} -> m end) |> Enum.map(& &1.id)
  alerts_by_monitor = load_active_alerts(monitor_ids)
  {breach_counts_cache, metrics} = preload_breach_counts(all_monitors, observations)
  cooldown_cache = preload_cooldown_status(monitor_ids)

  Enum.reduce(observations, {0, 0}, fn obs, {triggered, resolved} ->
    monitors = Map.get(monitors_by_key, {obs.analysis.camera_id, obs.observation_output_id}, [])

    Enum.reduce(monitors, {triggered, resolved}, fn monitor, {t, r} ->
      case evaluate_monitor(monitor, obs, alerts_by_monitor, breach_counts_cache, cooldown_cache) do
        :triggered -> {t + 1, r}
        :resolved -> {t, r + 1}
        :no_change -> {t, r}
      end
    end)
  end)
end

For each monitor, evaluate_monitor makes a decision based on four inputs:

Does the current observation breach the condition? Direct comparison using the monitor’s operator:

def compare(value, :gt, threshold) when is_number(value) and is_number(threshold), do: value > threshold
def compare(value, :lt, threshold) when is_number(value) and is_number(threshold), do: value < threshold
def compare(value, :eq, threshold), do: value == threshold
def compare(value, :in, threshold) when is_list(threshold), do: value in threshold
# ...
def compare(_, _, _), do: false

How many breaches in the time window? Queried per-monitor with a 10-second granularity cache to avoid redundant queries:

window_start_rounded = round_datetime_to_10s(window_start)
window_end_rounded = round_datetime_to_10s(current_time)
cache_key = {monitor.id, window_start_rounded, window_end_rounded}

if Map.has_key?(cache, cache_key) do
  {cache, %{metrics | cache_hits: metrics.cache_hits + 1}}
else
  breach_count = query_breach_count(monitor, window_start, current_time)
  {Map.put(cache, cache_key, breach_count), %{metrics | cache_misses: metrics.cache_misses + 1}}
end

Is this monitor in cooldown? Checked against the most recent resolved_at timestamp.
Is there an existing active alert? Determines whether to create, escalate, resolve, or ignore.

The decision matrix:

Active alert?	Should alert?	Current state	Action
No	Yes	—	Create alert in `:tracking`
Yes	Yes	`:tracking`	Transition to `:alerting` (if different observation)
Yes	Yes	`:alerting`	Update `last_seen_at`, stay alerting
Yes	No	Any	Transition to `:resolved`
No	No	—	No action

Where “should alert” means: breach count meets threshold AND monitor is not in cooldown.

Cooldown and race conditions

Cooldown

After an alert resolves, you don’t want it firing again on the next 30-second cycle. We initially didn’t have a cooldown, and the result was alert spam — a condition would hover right at the threshold, triggering and resolving every other cycle. The cooldown_seconds field (default: 1800, i.e. 30 minutes) prevents retriggering.

Cooldown status is preloaded in batch by querying the most recent resolved_at for each monitor:

defp preload_cooldown_status(monitor_ids) do
  now = DateTime.utc_now()

  monitors_with_recent_resolved =
    from(a in AlertInstance,
      where: a.alert_monitor_id in ^monitor_ids,
      where: a.state == :resolved,
      where: not is_nil(a.resolved_at),
      group_by: a.alert_monitor_id,
      select: {a.alert_monitor_id, max(a.resolved_at)}
    )
    |> Repo.all()
    |> Map.new()

  Enum.into(monitors, %{}, fn {monitor_id, cooldown_seconds} ->
    in_cooldown = case Map.get(monitors_with_recent_resolved, monitor_id) do
      nil -> false
      most_recent_resolved ->
        cooldown_start = DateTime.add(now, -cooldown_seconds, :second)
        DateTime.compare(most_recent_resolved, cooldown_start) in [:gt, :eq]
    end

    {monitor_id, in_cooldown}
  end)
end

The cooldown check feeds directly into the should_alert calculation:

should_alert = breach_count >= monitor.breach_count and not in_cooldown

If a monitor is in cooldown, should_alert is false regardless of breach count. If an active alert exists, it resolves. If no alert exists, nothing happens.

Race conditions

With 12 concurrent processing workers, two workers can evaluate the same monitor simultaneously. Both see no active alert, both try to create one. This was one of those bugs that only showed up under load — in development with a single worker, everything was fine.

The defense is a partial unique index in Postgres:

CREATE UNIQUE INDEX one_active_alert_per_monitor
  ON alert_instances(alert_monitor_id)
  WHERE state IN ('tracking', 'alerting')

Only one alert instance per monitor can exist in :tracking or :alerting state at any time. The second insert fails with a unique constraint violation.

The evaluator catches this and recovers:

defp handle_insert_error(errors, monitor, obs, breach_counts_cache) do
  case Keyword.get(errors, :alert_monitor_id) do
    {_, [constraint: :unique, constraint_name: "one_active_alert_per_monitor"]} ->
      case get_active_alert(monitor.id) do
        nil -> :no_change
        alert -> apply_transition(alert, monitor, obs, true, true, nil, breach_counts_cache)
      end
  end
end

Worker A wins the insert and creates the alert in :tracking. Worker B’s insert fails, it re-fetches the alert Worker A created, and applies its transition — which either stays in :tracking (same observation) or escalates to :alerting (different observation). No duplicate alerts, no lost transitions.

State transitions themselves are safe because they’re Repo.update! calls on the existing record — Ecto’s optimistic locking handles the concurrency. If two workers try to transition the same alert simultaneously, one succeeds and one gets a stale record error, which is fine — the state has already moved.

Closing the loop with the LLM

The alert system doesn’t just sit downstream of the LLM pipeline — it feeds back into it. As covered in Data-Driven LLM Prompts, active alert monitors for a camera are included in the prompt:

MONITORED CONDITIONS (these are scored as problems when they occur):
- person_count: Alert when is greater than 5 → HIGH severity

This means the LLM knows which observations carry business significance for this camera. When a monitored condition is present, the prompt instructs the LLM to weight the health_score heavily downward — a CRITICAL severity condition should drive the score to 0-30.

The connection is data-driven end to end: observation outputs define what the LLM looks at, monitors define what triggers alerts on those outputs, and those same monitors are fed back into the prompt so the LLM can prioritize accordingly. Add a new monitor, and the LLM automatically adjusts its scoring behavior on the next capture cycle. No code changes anywhere in the chain.

On state transitions, the evaluator broadcasts via PubSub to the organization’s alert topic:

Phoenix.PubSub.broadcast(Scout.PubSub, "alerts:org:#{org_id}",
  {:alert_triggered, %{alert: alert, monitor: monitor, severity: monitor.severity,
    message: "#{monitor.name} triggered with value: #{alert.trigger_value}"}})

LiveView dashboards subscribe to these topics and update in real time — the user sees the alert appear within the same 30-second cycle that triggered it. No WebSocket server, no Pusher, no separate real-time layer — Phoenix PubSub and LiveView handle the entire chain from database event to browser update.

It’s worth stepping back and noting what’s not in this system. There’s no external message queue for event distribution. No cron job polling for alert conditions. No separate WebSocket service for pushing updates. The evaluator runs inline in the Oban worker, broadcasts via PubSub, and LiveView picks it up — all within the same BEAM VM. In a typical Node.js or Python stack, you’d likely need Redis for pub/sub, a separate WebSocket gateway, and a polling job to check alert conditions. Here, the concurrency primitives and PubSub are built into the platform.

What this doesn’t handle yet

A few things we’re still thinking through:

Alert grouping. If three cameras in the same area all fire at once, that’s probably one event, not three. We don’t have any grouping or correlation logic yet — each monitor is independent. For now this hasn’t been a big problem because most deployments have 1–2 cameras, but it’ll matter as we scale.
Escalation paths. Right now an alert is either active or resolved. There’s no “alert has been active for 30 minutes and nobody acknowledged it, escalate.” We’ll likely add states for this, and Fsmx makes that straightforward — just extend the transition map.
Notification channels. Alerts broadcast via PubSub to LiveView dashboards, but there’s no email, SMS, or webhook integration yet. The PubSub broadcast is the hook point for adding these, but the integrations themselves are still to be built.