Saving 30% on LLM Costs with Perceptual Hashing in Elixir → Lawrence Gosset

I run a system called Scout that captures images from security cameras every 30 seconds and sends them to a vision LLM for analysis. The problem is obvious once you think about it: security cameras spend most of their time looking at nothing changing. A hallway at 2am looks the same at 2:00:00 as it does at 2:00:30, and at 2:01:00, and for the next six hours. Every one of those frames was costing me an API call.

The fix was perceptual hashing — a way to fingerprint images so that visually similar frames produce similar hashes. If the current frame’s hash is close enough to a recent one, skip the LLM call entirely and clone the previous result. In production, this catches about 30% of all frames as duplicates, which translates directly to a 30% reduction in LLM API costs.

Why not just compare pixels?

Before we got to perceptual hashing, we tried a naive approach: compare raw pixel values between consecutive frames. The trouble is that security cameras are noisy. JPEG compression artifacts differ between frames. Auto-exposure adjusts constantly. A cloud passes over and the whole image shifts 5% brighter. Two frames that look identical to a human can differ in thousands of pixel values.

A cryptographic hash like SHA-256 is even worse — it’s designed so that even a single pixel change produces a completely different hash. Great for integrity checks, useless for similarity detection.

Perceptual hashes work the opposite way. They’re designed so that visually similar images produce similar hashes. A frame with slightly different lighting, minor JPEG compression artifacts, or a few shifted pixels will hash to a value that’s very close to the original. The distance between two hashes tells you how visually different the images are.

There are several perceptual hashing algorithms — aHash (average hash), pHash (DCT-based), and dHash (difference hash). I went with dHash because it’s fast, simple to implement, and resistant to minor brightness and contrast changes — exactly what you get from a camera feed that shifts subtly as lighting conditions change throughout the day. pHash would give slightly better accuracy for some edge cases, but the DCT computation is heavier and the accuracy difference didn’t justify it for our use case.

dHash: the difference hash

The idea behind dHash is to reduce the image to a sequence of relative brightness comparisons. Instead of looking at absolute pixel values (which shift with lighting), you compare each pixel to its neighbor: is the pixel to the right brighter or darker?

Here’s the implementation that runs on the Raspberry Pi. First, convert the captured image to raw grayscale pixels using FFmpeg:

args = [
  "-hide_banner", "-loglevel", "error",
  "-i", input_path,
  "-vf", "format=gray,scale=65:64:flags=bilinear",
  "-f", "rawvideo", "-pix_fmt", "gray",
  output_path
]

The width is 65, not 64, because dHash compares each pixel to its right neighbor — you need N+1 pixels across to get N comparisons. The height is 64 for a clean 32-row grid. Bilinear filtering smooths out JPEG artifacts during the downscale.

From those 65×64 raw bytes, the hash is built in three steps:

Sample a comparison grid — take every other row and every other pixel position, producing a 32×32 grid from the 65×64 source
Compare adjacent pixels — for each sampled position, if the left pixel is brighter than the right, emit a 1 bit, otherwise 0
Pack the bits into a 128-byte binary (1024 bits) and hex-encode it

The bit packing compares adjacent pixels and shifts results into bytes:

# For each sampled pixel position, compare left vs right
new_bit = if left > right, do: 1, else: 0
new_bits = bits <<< 1 ||| new_bit

# When we have 8 bits, pack into a byte
<<acc::binary, new_bits::8>>

The final hash is a 256-character hex string:

hash_binary = compute_hash_binary_sampled(pixel_data, <<>>, 0)
Base.encode16(hash_binary, case: :upper)

The 1024-bit hash is larger than the typical 64-bit dHash you’ll see in tutorials. We started with 64-bit and found it wasn’t discriminative enough — security camera frames can have subtle but meaningful differences (a person standing in a doorway versus an empty doorway) that a 64-bit hash would miss. The extra resolution costs a few more bytes of storage and a few more microseconds of comparison, but it dramatically reduced our false positive rate.

Comparing hashes with Hamming distance

The Hamming distance between two hashes is the number of bit positions where they differ. Two identical images have a distance of 0. Two completely unrelated images will differ in roughly half their bits (around 512 out of 1024).

In Elixir, computing this is clean thanks to Erlang’s :crypto module:

def hamming_distance(hash1, hash2) do
  hash1
  |> :crypto.exor(hash2)
  |> count_set_bits()
end

:crypto.exor/2 XORs the two binaries, producing a result where each 1 bit represents a position where the hashes differ. Then count the set bits using Brian Kernighan’s algorithm, which clears one bit per iteration:

defp count_set_bits(binary) do
  binary
  |> :binary.bin_to_list()
  |> Enum.reduce(0, fn byte, acc ->
    acc + count_bits_in_byte(byte)
  end)
end

defp count_bits_in_byte(0), do: 0
defp count_bits_in_byte(n), do: 1 + count_bits_in_byte(n &&& n - 1)

No external dependencies, no NIF, no C extension, no image processing library — just the Erlang standard library. :crypto.exor/2 for the XOR, :binary.bin_to_list/1 for byte iteration, and Elixir’s binary pattern matching for the bit packing. This is the kind of low-level binary work that would typically push you towards a C extension or a Python NumPy dependency, but Erlang’s binary handling makes it natural to express in pure Elixir. On a Raspberry Pi, this runs in microseconds even for the full 128-byte hashes.

Two-phase duplicate lookup

Computing Hamming distance is cheap for a single pair, but scanning every recent hash in the database would be too slow. We needed a way to narrow down candidates first.

Phase 1: Prefix match. Generate a short prefix that samples across the entire hash. Rather than taking the first few bytes (which would be biased toward one region of the image), the prefix samples 16 evenly-spaced bytes from across the full hash, then truncates to 8 bytes:

def generate_prefix_distributed(dhash) when byte_size(dhash) == 128 do
  positions = [0, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120]

  sampled_bytes =
    Enum.map(positions, fn pos ->
      <<_::binary-size(pos), byte::8, _::binary>> = dhash
      byte
    end)

  prefix_binary = sampled_bytes |> Enum.take(8) |> :erlang.list_to_binary()
  Base.encode16(prefix_binary, case: :lower)
end

This 16-character hex prefix is stored as a separate indexed column. Query for recent non-duplicate captures from the same camera where the prefix matches, joining on the analysis table to ensure there’s actually a result to clone:

@similarity_threshold 26
@lookback_hours 24

from(c in Capture,
  join: ac in AnalysisCapture, on: ac.capture_id == c.id,
  where: c.camera_id == ^camera_id,
  where: c.hash_prefix == ^prefix,
  where: c.is_duplicate == false,
  where: not is_nil(c.dhash),
  where: c.timestamp >= ^cutoff,
  order_by: [desc: c.timestamp],
  limit: 100
)

Phase 2: Hamming distance. For each candidate, compute the full Hamming distance and keep the closest match below threshold:

defp find_similar_capture(dhash, candidates) do
  candidates
  |> Enum.map(fn candidate ->
    distance = hamming_distance(dhash, candidate.dhash)
    {candidate, distance}
  end)
  |> Enum.filter(fn {_capture, distance} -> distance < @similarity_threshold end)
  |> Enum.min_by(fn {_capture, distance} -> distance end, fn -> nil end)
end

If the prefix match finds nothing, a fallback query drops the prefix constraint and checks up to 200 recent non-duplicate captures from the same camera within the 24-hour window. This catches cases where the image changed just enough to flip a prefix byte but is still a near-duplicate overall. It’s a wider net, but the Hamming distance computation on 200 candidates is still sub-millisecond.

Plugging it into the pipeline

The dedup check lives in the job that processes each capture. The routing decision is straightforward:

with {:ok, capture} <- maybe_detect_duplicate(capture) do
  if capture.is_duplicate do
    handle_duplicate_analysis(capture, args)
  else
    handle_llm_analysis(capture, s3_key, args)
  end
end

maybe_detect_duplicate runs the two-phase lookup. If a match is found, it marks the capture with the match before returning:

case DuplicateDetector.check_duplicate(metadata) do
  {:duplicate, original_capture} ->
    capture
    |> Ecto.Changeset.change(%{
      is_duplicate: true,
      similar_to_id: original_capture.id
    })
    |> Repo.update()

  {:unique, _} ->
    {:ok, capture}
end

handle_duplicate_analysis then clones the full result — observations, embeddings, everything — scoped to the same organization as a defense-in-depth check. handle_llm_analysis proceeds with the normal LLM call.

The clone is a full transactional operation, not just a pointer. Downstream consumers (alerts, dashboards, reports) don’t know or care whether a result came from a fresh LLM call or a clone. The data model is the same either way. But every clone is tagged with cloned_from_analysis_id and metadata marking it as "source" => "cloned_from_duplicate", so you can always audit the dedup rate and trace any result back to the original LLM call that produced it.

Choosing a threshold

The similarity threshold is the maximum Hamming distance that counts as “duplicate.” Too low and you miss near-duplicates, too high and you start cloning results for genuinely different scenes. Getting this wrong in either direction is bad — miss a duplicate and you waste money, clone a different scene and you miss an alert.

With a 1024-bit hash, a threshold of 26 bits means the images must be at least 97.5% identical (974 out of 1024 bits matching). In practice, this is conservative enough that false positives are extremely rare — you’d need nearly identical frames to get below 26 bits of difference.

I arrived at 26 through experimentation on production data. The approach was straightforward:

Capture a week of frames with no deduplication
Compute pairwise Hamming distances for consecutive frames from each camera
Plot the distribution — there’s a clear bimodal split between “nothing changed” (distances 0–20) and “something happened” (distances 80+), with a wide empty gap between roughly 25 and 70
Set the threshold in the gap between the two clusters — 26 sits just above the “identical” cluster with plenty of margin before the “different” cluster begins

The wide gap was reassuring — it means the threshold isn’t sensitive to small changes. Anywhere from 25 to 60 would have worked, but we chose the conservative end to minimize the risk of cloning results for scenes that actually changed.

The lookback window is 24 hours. Beyond that, lighting conditions shift enough that even truly unchanged scenes start to diverge.

When cloning isn’t safe

Not every duplicate should be cloned. We learned this when we updated an analysis profile and couldn’t figure out why the new observations weren’t showing up — the dedup was matching against results produced by the old profile. The system checks for a few conditions before reusing a previous result:

Analysis profile changed. Each camera has a configurable analysis profile that determines what the LLM looks for. If the profile changed since the original analysis, the old result was produced against different criteria and shouldn’t be reused.
Original analysis not found. If the matched capture’s analysis was deleted or never completed, there’s nothing to clone.
Clone fails. If the transactional clone operation fails for any reason, the system falls back to making a normal LLM call. The duplicate detection is an optimisation, not a gate — a failure in the dedup path should never block processing.

In all three cases, the fallback is the same: proceed with the LLM call as if no duplicate was found.

Results and trade-offs

In production, the duplicate detection rate runs at about 30% of all captures. This varies by camera — a camera pointed at a quiet hallway at night might hit 80% duplicates, while one watching a busy entrance during the day might only hit 10%. The 30% is the fleet-wide average.

The cost savings are essentially linear: 30% fewer LLM API calls means roughly 30% less spend on the vision model. There’s a small overhead for the hashing and lookup, but it’s negligible compared to the cost of an API call.

Beyond cost, there’s a latency benefit too. Cloned results are available immediately instead of waiting for a 5–30 second LLM round trip. For the alerting system that sits downstream, this means faster response times for the frames that do matter, because the pipeline isn’t backed up processing identical frames.

The trade-off is that cloned results are, by definition, stale. If something did change but the hash distance was below threshold, we’d miss it. With a 97.5% similarity requirement that’s extremely unlikely for meaningful changes — but it’s worth being honest that this is an optimisation that trades a tiny amount of recall for significant cost savings. In 90 days of production, we haven’t caught a false positive that resulted in a missed alert, but we watch the telemetry closely.

The core algorithm is simple — the dHash itself is about 50 lines of Elixir. But the production system around it is more involved: prefix generation, the two-phase lookup with fallback, transactional cloning with bulk inserts, organization-scoped validation, telemetry, and the various failure-mode fallbacks. The algorithm was an afternoon; the production hardening took longer. But it’s been the highest-ROI optimisation I’ve made to this system.