Scout is an AI-powered camera monitoring system for small businesses — restaurants, retail stores, offices. Raspberry Pi devices running Elixir Nerves sit on-site, connected to IP cameras over the local network. Every 30 seconds, each Pi captures a frame from every camera and uploads two variants to a backend, where a vision LLM analyzes the scene and fires alerts when something needs attention.

The Pis run headless in environments we don’t control. Networks drop. Power cuts happen. The backend goes down for deploys — 9 incidents over 5 months, the longest lasting 47 minutes. The system needs to keep capturing, queue what it can’t upload, and recover without human intervention.

Why Nerves

We considered a few options here. A Python script in a Docker container on the Pi was the obvious first thought — it’s how most people approach edge computing. The trouble is that Docker on a Pi adds overhead we didn’t want, and a Python process doesn’t give you much in the way of fault recovery. If the script crashes, you need systemd or something similar to restart it, and restarting the whole process means losing any in-memory state.

Nerves gives us the full OTP supervision tree on bare metal — the same fault-tolerance primitives that run telecom switches, but on a Raspberry Pi. When a process crashes, its supervisor restarts it. When a subsystem fails repeatedly, the parent supervisor can restart the whole subtree. This isn’t application-level retry logic bolted on after the fact — it’s the runtime’s fundamental execution model.

The device boots into a read-only firmware image with the BEAM VM running as PID 1. There’s no OS to crash, no container runtime to wedge, no systemd to configure, no Kubernetes to orchestrate, and nothing to apt-get update. It’s also the same Elixir platform that runs our cloud backend, our LLM pipeline, our AI agent, and our LiveView dashboards — one language, one ecosystem, from the browser to the Pi.

The capture cycle

Each camera gets a CameraWorker GenServer that runs a 30-second capture loop:

@capture_interval 30_000
@capture_timeout  120_000

defp schedule_capture do
  Process.send_after(self(), :capture_cycle, @capture_interval)
end

When :capture_cycle fires, the worker spawns an unlinked Task for the actual capture — RTSP frame grab via FFmpeg, image processing, upload handoff — and immediately schedules the next cycle. The 30-second rhythm is independent of how long the capture takes.

defp perform_capture_with_config(state, camera_info, rtsp_url, run_timestamp) do
  worker_pid = self()

  {:ok, task_pid} =
    Task.start(fn ->
      execute_capture_pipeline(state, camera_info, rtsp_url, run_timestamp)
      send(worker_pid, :capture_complete)
    end)

  schedule_capture()

  timeout_ref = Process.send_after(self(), :capture_timeout, @capture_timeout)
  {:noreply, %{state | capturing: true, capture_timeout_ref: timeout_ref, capture_task_pid: task_pid}}
end

The task is unlinked — if it crashes, the CameraWorker doesn’t go down with it. But tasks can hang (an RTSP stream that never responds, a camera that accepts the connection but never sends frames), so there’s a 120-second timeout safety net. We arrived at 120 seconds after watching production logs — some cameras on congested networks take up to 90 seconds for a single frame grab, so we needed headroom:

def handle_info(:capture_timeout, state) do
  if state.capturing do
    if pid = state.capture_task_pid, do: Process.exit(pid, :kill)

    Logger.warning("Capture timeout - killed task and forcing reset",
      namespace: "scout_hub.camera.worker.capture.timeout",
      camera_mac: state.camera_mac
    )

    {:noreply, %{state | capturing: false, capture_timeout_ref: nil, capture_task_pid: nil}}
  else
    {:noreply, state}
  end
end

On timeout, the worker kills the zombie task, resets its state, and continues the cycle. No restart, no crash — it just skips that capture. If the next :capture_cycle fires while a capture is still in progress, the worker skips and reschedules:

defp perform_capture(state) do
  if state.capturing do
    schedule_capture()
    {:noreply, state}
  else
    perform_capture_task(state)
  end
end

The loop never blocks, never overlaps, and self-heals from hung FFmpeg processes.

Variants and parallel processing

Each captured frame produces two image variants, generated in a single FFmpeg filter_complex pass:

  • Original — full resolution, quality 2 (near-lossless), the archival copy
  • Compressed — scaled to max 1024px on the longest edge, quality 18. This is what the LLM analyzes. It also embeds a 65×64 grayscale thumbnail used to compute a perceptual hash for server-side deduplication — identical scenes skip LLM analysis entirely

Each variant gets its own VariantWorker GenServer, running under its own supervisor:

VariantSupervisor (one_for_one)
├── VariantWorker[:compressed]
└── VariantWorker[:original]

Why not process both in one worker? This was a deliberate decision around isolation. If the original upload fails and backs off for 5 minutes, the compressed variant — the one the LLM needs — still uploads immediately. Each worker has its own retry queue, its own backoff schedule, its own crash domain. A corrupt file that crashes one worker doesn’t touch the other.

The ImageProcessor creates both variants on disk, then hands them off with async casts:

VariantWorker.upload_variant(camera_mac, :original, original_variant)
VariantWorker.upload_variant(camera_mac, :compressed, compressed_variant)

Control returns immediately. The capture loop doesn’t wait for uploads.

The supervision tree

ScoutHub.Supervisor (one_for_one)
├── Phoenix.PubSub
├── ScoutHub.Repo (SQLite)
├── ScoutHub.TaskSupervisor
├── CameraWorkerRegistry / VariantWorkerRegistry / VariantSupervisorRegistry
└── CameraWorkerSupervisor (DynamicSupervisor, max_restarts: 25/60s)
    └── For each camera:
        ├── VariantSupervisor (one_for_one, max_restarts: 50/60s)
        │   ├── VariantWorker[:compressed]  (restart: :permanent)
        │   └── VariantWorker[:original]    (restart: :permanent)
        └── CameraWorker (restart: :transient)

Every design decision here is about blast radius:

one_for_one everywhere. Cameras are independent, variants are independent. One crash shouldn’t restart siblings.

DynamicSupervisor at the top. Cameras are discovered at runtime — the Pi scans the local network, finds IP cameras, and starts workers dynamically. We don’t know at boot time how many cameras exist.

:transient for CameraWorker. This one took us a while to get right. When a camera goes offline, the worker exits with :normal — the supervisor respects that and doesn’t restart it. No restart churn for legitimately inactive cameras. VariantWorkers are :permanent because they need to stay alive to drain their retry queues even when new captures aren’t coming in.

max_restarts: 50/60s on VariantSupervisor. During a sustained network outage, upload failures can cause rapid worker restarts. Fifty per minute gives the system room to ride out transient issues. The workers wrap uploads in try/rescue so crashes should be rare in practice — the high limit is a safety net for edge cases we haven’t anticipated.

max_restarts: 25/60s on CameraWorkerSupervisor. Lower because camera capture failures are less frequent than upload failures, and each restart here affects an entire camera’s worker tree.

SQLite as a durable queue

We explored a few options for the retry queue: ETS, DETS, raw files on disk, and SQLite. ETS is gone on crash — non-starter for a device that loses power unexpectedly. DETS is durable but can’t do indexed queries like “give me the 10 oldest items for this camera where next_retry_at is in the past.” Raw files work but you end up reimplementing a database. SQLite gives us Ecto schemas, migrations, indexed queries, and crash safety in a single embedded file.

Every failed upload gets written to a SQLite table:

# upload_queue table
camera_mac       :string     # which camera
image_path       :string     # path to file on disk
unix_timestamp   :integer    # when the frame was captured
variant_type     :string     # "original" or "compressed"
perceptual_hash  :string     # dHash hex (compressed only)
attempt_count    :integer    # retry counter (default: 0)
next_retry_at    :integer    # eligible for retry after this
first_failed_at  :integer    # when the first failure occurred
last_error       :string     # inspect() of last error reason
created_at       :integer    # when enqueued

Enqueue sets next_retry_at = now so the item is immediately eligible:

def enqueue(camera_mac, attrs) do
  now = System.system_time(:second)

  attrs
  |> Map.put(:camera_mac, camera_mac)
  |> Map.put(:created_at, now)
  |> Map.put_new(:attempt_count, 0)
  |> Map.put_new(:next_retry_at, now)
  |> Map.put_new(:first_failed_at, now)

  %Item{}
  |> Item.changeset(attrs)
  |> Repo.insert()
end

Dequeue is just Repo.delete(item) — called on successful upload, max retries exceeded, or permanent failure (auth errors, validation errors, files older than 24 hours). Workers poll with adaptive intervals: every 5 seconds when the queue has items, every 30 seconds when it’s empty. Items are processed oldest-first so nothing gets starved.

Exponential backoff

When uploads fail, the worker backs off before retrying:

def update_retry_metadata(%Item{} = item, error_reason \\ nil) do
  now = System.system_time(:second)
  new_attempt_count = item.attempt_count + 1

  backoff_seconds =
    case new_attempt_count do
      1 -> 30
      2 -> 60
      3 -> 120
      4 -> 300
      5 -> 600
      _ -> 1800
    end

  item
  |> Item.changeset(%{
    attempt_count: new_attempt_count,
    next_retry_at: now + backoff_seconds,
    last_error: if(error_reason, do: inspect(error_reason))
  })
  |> Repo.update()
end

No jitter — for a single Pi with a handful of cameras, thundering herd isn’t a concern. The cap at 30 minutes means the system keeps trying during extended outages without hammering the backend. After 20 attempts (~7.3 hours of cumulative backoff), the item is dequeued and the file cleaned up.

Rate-limited responses (HTTP 429) get a fixed 60-second backoff regardless of attempt count, so the worker respects the backend’s signal rather than following its own schedule. Auth failures, validation errors, and corrupt files get dequeued immediately — no point retrying something that will never succeed.

When supervisors give up

This is the subtlest failure mode we hit, and the one that taught us the most about OTP supervision.

When a VariantWorker crashes repeatedly, the VariantSupervisor exhausts its max_restarts and shuts itself down. The CameraWorker still captures frames every 30 seconds. The ImageProcessor still creates variants. And it still calls:

GenServer.cast(via_tuple(camera_mac, :compressed), {:upload_variant, variant})

The problem: GenServer.cast to a dead process silently returns :ok. No error, no exception. The message vanishes. The file sits on disk. Nothing is enqueued to SQLite. The system looks healthy — captures running, no errors in logs — but uploads are silently dropping.

We discovered this during a Fly.io outage that lasted about an hour. Every upload attempt failed, the VariantWorkers crashed repeatedly, the supervisors exhausted their restart limits and gave up. The CameraWorker kept capturing — doing its job — but the casts to dead workers vanished silently. When the backend came back, there was nothing in the queue to retry. An hour of captures, gone.

The fix is a liveness check before every handoff:

def upload_variant(camera_mac, variant_type, variant) do
  case worker_alive?(camera_mac, variant_type) do
    true ->
      GenServer.cast(via_tuple(camera_mac, variant_type), {:upload_variant, variant})

    false ->
      Logger.warning("Variant worker dead, enqueueing directly for recovery",
        namespace: "scout_hub.variant_worker.fallback_enqueue",
        camera_mac: camera_mac,
        variant_type: variant_type
      )

      fallback_enqueue(camera_mac, variant_type, variant)
  end
end

defp worker_alive?(camera_mac, variant_type) do
  case Registry.lookup(ScoutHub.VariantWorkerRegistry, {camera_mac, variant_type}) do
    [{pid, _}] -> Process.alive?(pid)
    [] -> false
  end
end

If the worker is dead, fallback_enqueue writes directly to the SQLite queue. No captures are lost.

But the system doesn’t just accept the degraded state. The CameraWorker runs a health check every 60 seconds:

defp check_variant_worker_health(state) do
  # Always check — not just when queue > 0.
  # Silent cast drops mean files accumulate on disk
  # without queue entries, so queue_count can be 0 while workers are dead.
  if not VariantSupervisor.variant_workers_alive?(state.camera_mac) do
    case CameraWorkerSupervisor.restart_variant_supervisor(state.camera_mac) do
      {:ok, _pid} ->
        Logger.info("Variant worker recovery succeeded")

      {:error, reason} ->
        Logger.error("Variant worker recovery failed", error: inspect(reason))
    end
  end
end

Dead workers get restarted. Fresh workers boot up and start draining the SQLite queue. The comment in the code tells the story: “Always check worker liveness — not just when queue > 0. Silent cast drops to dead workers mean files accumulate on disk without queue entries.” The fallback enqueue prevents that now, but the health check is belt-and-suspenders.

Authentication via PubSub

Workers need an auth token to request presigned S3 URLs from the backend. Rather than each worker managing its own token, a single DeviceAuthServer process owns the lifecycle and broadcasts state changes via PubSub.

The token is long-lived, issued during device provisioning, stored in SQLite encrypted with AES-256-GCM. It’s validated hourly against the backend. Two events on the "device_auth" topic:

Phoenix.PubSub.subscribe(ScoutHub.PubSub, "device_auth")

def handle_info({:token_refreshed, _token}, state) do
  send(self(), :process_retries)
  {:noreply, %{state | waiting_for_token: false}}
end

def handle_info(:token_invalidated, state) do
  {:noreply, %{state | waiting_for_token: true}}
end

One broadcast reaches every worker instantly. When a new token arrives, all workers start uploading immediately rather than discovering it on their next poll cycle. The waiting_for_token flag means workers don’t waste cycles attempting uploads they know will fail — they keep capturing and queuing, then flush everything when the token arrives.

Trade-offs

This architecture isn’t simple. SQLite adds a dependency and disk I/O on hardware where SD card writes have a finite lifespan. The supervision tree has three levels of nesting. There’s a Registry, a PubSub topic, adaptive polling intervals, error classification logic, a health check recovery loop. We went back and forth on whether the complexity was justified for a system that typically runs 1–2 cameras per device.

But it works. Over 90 days in production, a single hub running 1–2 cameras has processed 143,000+ capture cycles with a 99.9% upload success rate — 118,824 successful uploads against 117 failures. The upload queue average sits at 0.00 almost every day, meaning retries resolve quickly. 26 infrastructure retries (presigned URL and S3 errors) all resolved automatically. Zero fallback enqueue events — the VariantSupervisor has never exhausted its restarts in production, but the safety net is there. The backend had 9 incidents over 5 months (99.91% uptime, longest outage 47 minutes), and the hub rode out every one without losing data.

Each camera produces ~2,880 captures per day — exactly matching the theoretical 30-second cycle, meaning the pipeline keeps pace without dropping frames.

In a Python or Node.js equivalent, you’d be stitching together systemd for restarts, a separate SQLite wrapper or Redis for queuing, a threading library for parallelism, and a lot of hope that your error handling covers every edge case. Here, the supervision tree, GenServers, PubSub, and the SQLite-backed queue are all first-class OTP and Elixir patterns — they compose naturally because they were designed to work together.

The alternative — a simpler pipeline without queuing or independent workers — would work until the first network blip. On a device you can’t SSH into during a dinner rush, “works until” isn’t good enough.