Database Transaction Patterns for LLM Pipelines → Lawrence Gosset

If you wrap an LLM API call inside a database transaction, the connection is held for the entire duration — 5 to 30 seconds per call. With Oban processing jobs concurrently, you quickly exhaust the connection pool.

We learned this the hard way during a backfill operation. With ~200 workers running and a 10-connection pool, we hit cascading DBConnection.ConnectionError timeouts. Connection queue times jumped from <10ms to 2+ seconds. Oban jobs started logging "queued and checked out connection for >15000ms" before timing out entirely. The dashboard became unresponsive. Of the first 585 jobs that succeeded, the next ~150 all failed — the pool was fully saturated, and every new job queued behind jobs that were themselves waiting on connections held by other jobs waiting on LLM responses.

The root cause was simple: each Oban worker checked out a database connection at the start of the job, called the LLM, and only released the connection when the whole transaction committed. Of those ~200 total workers, around 25 were in the processing queue making LLM calls — but that was enough. 25 workers × 15 second average hold ÷ 10 connections = permanent contention.

The naive approach

The intuitive approach looks like this — a single transaction wrapping the entire operation:

def perform(%Oban.Job{args: %{"capture_id" => capture_id}}) do
  Repo.transaction(fn ->
    capture = Repo.get!(Capture, capture_id)
    camera = Cameras.get_camera!(capture.camera_id)

    analysis = Analytics.create_analysis!(%{
      capture_id: capture.id,
      camera_id: camera.id,
      status: :processing
    })

    # This takes 5-30 seconds. The connection is held the entire time.
    {:ok, observations} = LLM.analyze_image(capture.s3_key, camera.profile)

    Observations.insert_all!(analysis, observations)
    Analytics.update_analysis!(analysis, %{status: :completed})
  end)
end

This seems reasonable at first. If the LLM call fails, the transaction rolls back — no orphaned analysis records, no partial state. Clean atomicity.

But atomicity comes at a cost: the database connection is checked out for the entire duration of the Repo.transaction block. A 100ms database operation becomes a 15-second connection hold because the LLM call sits in the middle. Multiply that by your Oban queue concurrency and you’ve exhausted the pool.

Three-phase transactions

The fix is to break the transaction into three discrete steps with no transaction held during the LLM call:

with {:ok, analysis} <- create_analysis_in_transaction(capture),
     {:ok, analysis_with_observations} <- execute_llm_analysis(analysis),
     {:ok, :ok} <- mark_analysis_as_completed(analysis_with_observations.id) do
  completed_analysis = %{analysis_with_observations | status: :completed}
  broadcast_analysis_event(completed_analysis, capture)
  {:ok, completed_analysis}
end

Three phases, three separate connection checkouts.

Phase 1 creates the analysis record in a short transaction (~100ms). It loads the camera, validates it has an analysis profile, and calls find_or_create_analysis — which is idempotent for retry safety:

defp create_analysis_in_transaction(capture) do
  with {:ok, camera} <- load_camera(capture.camera_id),
       {:ok, profile_id} <- get_profile_id(camera) do
    attrs = %{
      capture_id: capture.id,
      camera_id: capture.camera_id,
      organization_id: capture.organization_id,
      profile_id: profile_id,
      status: :processing,
      event_metadata: %{
        "event_id" => Ecto.UUID.generate(),
        "discovered_at" => DateTime.to_iso8601(DateTime.utc_now()),
        "source" => "oban_llm_analysis",
        "s3_key" => capture.s3_key
      }
    }

    Analytics.find_or_create_analysis(attrs)
  end
end

Phase 2 executes the LLM call with no long-held database connections. The executor downloads the image from S3, sends it to the LLM, parses the response, and inserts observations. Individual queries still briefly check out and release connections, but there’s no wrapping transaction holding one for the duration:

defp execute_llm_analysis(analysis) do
  case Observations.Executor.execute_analysis_observations(analysis) do
    {:ok, updated_analysis} -> {:ok, updated_analysis}
    {:error, reason} -> {:error, reason}
  end
end

Phase 3 marks the analysis as completed in another short transaction (~100ms):

defp mark_analysis_as_completed(analysis_id) do
  Repo.transaction(fn ->
    with {:ok, analysis} <- fetch_analysis(analysis_id),
         {:ok, _updated} <- Analytics.update_analysis(analysis, %{status: :completed}) do
      :ok
    else
      {:error, reason} -> Repo.rollback(reason)
    end
  end)
end

The connection hold time dropped from 5-30 seconds to ~200ms total across both transactions. At 12 concurrent processing workers, that’s the difference between needing 180 connections and needing 2-3. Worth noting: there’s no Redis backing this queue, no Sidekiq, no Kafka — it’s Postgres all the way down, via Oban. The job queue, the retries, the scheduling, the concurrency control — all in the same database the application already uses.

The job also overrides Oban’s default timeout to accommodate the LLM call: def timeout(_job), do: :timer.seconds(150). Without this, Oban would kill the job after 60 seconds — well within the range of a slow LLM response.

What happens when phase 3 fails

This is the question that made us hesitant about the three-phase approach initially. With a single transaction, failure at any point rolls everything back cleanly. With three phases, you can end up in a state where the LLM call succeeded and observations are saved, but the final status update failed. Maybe the database had a brief hiccup, maybe PgBouncer closed an idle connection at the wrong moment.

The answer is idempotency. Oban retries the whole job. On retry, the with chain runs from the top — but find_or_create_analysis handles the race:

def find_or_create_analysis(attrs) do
  Repo.transaction(fn ->
    with {:ok, capture} <- lock_capture_for_update(attrs.capture_id),
         {:ok, existing} <- find_existing_analysis_for_capture(attrs.capture_id),
         {:ok, analysis} <- handle_analysis_creation(existing, attrs, capture) do
      analysis
    else
      {:error, reason} -> Repo.rollback(reason)
    end
  end)
end

It locks the capture row (SELECT ... FOR UPDATE), checks if an analysis already exists for this capture. If one does, it returns it. If not, it creates a new one. The row lock prevents race conditions from concurrent retries.

On retry after a Phase 3 failure:

Phase 1 finds the existing analysis — returns it
Phase 2 calls the executor, but observations already exist — the executor detects this and skips the LLM call
Phase 3 retries the status update

Only the status update re-executes. No wasted LLM calls, no duplicate data. This took some care to get right, but the idempotency at each phase means we can safely retry the whole job without worrying about side effects.

Dual repos

Beyond the three-phase split, we separate Oban’s internal operations from application queries entirely. This was another lesson from the backfill incident — even after fixing the transaction pattern, we noticed Oban’s own operations (polling for jobs, updating heartbeats) were competing with our application queries for the same connections.

Two Ecto repos, same database, separate connection pools:

# config/runtime.exs

# Scout.Repo — web requests + job business logic
config :scout, Scout.Repo,
  pool_size: 10,
  prepare: :unnamed,
  queue_target: 5000,
  timeout: 30_000,
  parameters: [application_name: "Scout"]

# Scout.ObanRepo — Oban framework operations only
config :scout, Scout.ObanRepo,
  pool_size: 6,
  prepare: :unnamed,
  queue_target: 1000,
  timeout: 15_000,
  parameters: [application_name: "ScoutOban"]

Why separate repos instead of just increasing the pool size? Isolation. Oban performs fast, frequent operations — polling for new jobs, updating job status, recording heartbeats — all ~10-50ms queries. If they share a pool with application queries that occasionally hold connections longer, Oban’s internal operations get starved. A backed-up Oban means jobs stop getting dequeued even though workers are idle.

The separate pool guarantees Oban always has connections available for its framework operations, independent of what the application is doing. Oban is pointed at the dedicated repo with repo: Scout.ObanRepo in the Oban config. The fix was straightforward because of how Elixir and Ecto handle connections — explicit checkouts, visible transaction boundaries, and Oban’s queue model. In a framework where connections are implicitly held or managed by an ORM’s session layer, splitting a transaction into three phases and separating connection pools would be a much more invasive change. Here, it was a with chain and a second repo module.

The distinct application_name parameters show up in pg_stat_activity, so you can see exactly which pool is consuming connections when debugging in production. This was a small thing that saved us a lot of time — when connection issues crop up, you can immediately tell whether it’s application queries or Oban operations that are backed up.

Queue concurrency and pool sizing

Here’s the queue configuration:

config :scout, Oban,
  repo: Scout.ObanRepo,
  notifier: {Oban.Notifiers.Phoenix, pubsub: Scout.PubSub},
  queues: [
    processing: 12,          # LLM analysis (the slow ones, 5-30s)
    profile_generation: 1,   # Expensive LLM profile creation
    summary_generation: 20,  # Summary creation
    embeddings: 25,          # Vector embeddings (fast API calls)
    insights: 2,             # Daily insights
    scheduled: 1,            # Cron jobs
    default: 2               # Misc
  ]

~63 workers per VM, sharing 10 Scout.Repo connections. The math works because of the three-phase pattern: most workers spend 90%+ of their time waiting on external APIs with no database connection held. At any given moment, only a fraction of workers are actually executing database operations. The pool just needs to handle that effective concurrency, not the total worker count.

The pool sizing evolved through trial and error. We started at 10 connections, bumped to 14 during stress testing (before the three-phase fix), tried 20 after adding PgBouncer, then settled back at 10. The insight: with PgBouncer, you want pools as small as possible. PgBouncer closes idle connections, causing "tcp recv (idle): closed" errors on reconnect. Oversized pools mean more idle connections, more reconnection churn. Keep the pools tight.

The PgBouncer gotchas

If you deploy Oban on Fly.io’s Managed Postgres, you’ll hit a few things that aren’t obvious from the documentation.

LISTEN/NOTIFY doesn’t work through PgBouncer. Oban’s default notifier uses Postgres LISTEN/NOTIFY to wake workers when new jobs are inserted. PgBouncer in transaction mode doesn’t support this — the connection that issued LISTEN gets returned to the pool, and a different client gets the notification (or nobody does).

The symptom is subtle. Jobs get inserted but Oban doesn’t pick them up until its next polling interval. Instead of near-instant execution, jobs sit in the queue for seconds. Under load, the polling interval can’t keep up and the queue grows. We spent a while debugging this before realizing it was PgBouncer-related.

The fix is one line:

notifier: {Oban.Notifiers.Phoenix, pubsub: Scout.PubSub}

Oban.Notifiers.Phoenix uses your existing Phoenix PubSub instead of Postgres channels. Job notifications route through PubSub, bypassing PgBouncer entirely.

Named prepared statements don’t work. PgBouncer multiplexes connections, so a prepared statement created on one connection might not exist when your next query runs on a different one. Set prepare: :unnamed on both repos. We hit this as intermittent "prepared statement does not exist" errors that only showed up under load — easy to dismiss as flaky until you understand the cause.

Idle connections get closed. PgBouncer reclaims idle server connections, so Postgrex may try to use a connection that’s been closed underneath it. You’ll see "tcp recv (idle): closed" in your logs. Keep pool sizes small to minimize idle connections — which circles back to why tight pools are better than generous ones in a PgBouncer environment.