Error Handling in Computer Vision Pipelines for Retail Shelf Analytics

Within the Image Parsing & Computer Vision Workflows section, error handling is the control plane that decides what a shelf-analytics pipeline does when an input, a model, or a downstream contract behaves badly — and in live retail, something always does. Store-level imagery arrives under fluctuating fluorescent and LED lighting, partial occlusion by carts and shoppers, and mid-shift planogram resets executed by floor staff, none of which appear in curated training sets. When a vision pipeline fails silently against that reality, category managers receive distorted compliance scores, inventory reconciliation breaks, and automated replenishment fires false purchase orders. This component therefore treats failure as a first-class, typed signal: every stage either produces a trustworthy record or routes the frame to a quarantine path that preserves the audit trail, never a fabricated verdict.

This page specifies the error-handling layer itself — the data contract it imposes on every frame, the guard logic that wraps ingestion, preprocessing, inference, and post-processing, the thresholds that govern degradation, and the failure modes that only surface at fleet scale. The slow-burn case where the model keeps running but accuracy decays is handled in the companion walkthrough on Debugging Vision Model Drift in Retail Environments; here the focus is the deterministic exception routing that runs in front of every stage.

Concept & Data Contract Jump to heading

The error-handling layer consumes the same validated capture envelopes the rest of the pipeline uses and wraps a deterministic status around each one. Its single invariant is that a frame’s processing status must transition explicitly through a finite set of states, and any unrecoverable transition lands the frame in a terminal sink — REJECTED (the input is unusable and a recapture is requested) or QUARANTINED (the input is held with full context for human or automated review) — rather than allowing a partial result to flow downstream. A compliance record is only ever emitted from the COMPLETED state.

The states form a strict progression: RECEIVED → VALIDATED → CORRECTED → INFERENCING → SCORED → COMPLETED, with REJECTED and QUARANTINED reachable from any non-terminal state. Encoding this as an enum rather than free-form strings is what stops a phantom score — a frame that skipped the quality gate or whose inference degraded — from reaching an executive dashboard. The bucketing and model-tier assignment that precede this layer come from Vision Model Routing for Shelf Detection; the error-handling layer never re-routes, it only guards and records.

The typed contract makes the boundary explicit:

from __future__ import annotations

from dataclasses import dataclass, field
from enum import Enum
from typing import NewType

CaptureId = NewType("CaptureId", str)


class FrameStatus(str, Enum):
    RECEIVED = "received"
    VALIDATED = "validated"
    CORRECTED = "corrected"
    INFERENCING = "inferencing"
    SCORED = "scored"
    COMPLETED = "completed"
    REJECTED = "rejected"        # unusable input; recapture requested
    QUARANTINED = "quarantined"  # held with context for review


class RejectReason(str, Enum):
    SCHEMA_INVALID = "schema_invalid"
    CORRUPT_PAYLOAD = "corrupt_payload"
    LOW_QUALITY = "low_quality"
    MODEL_UNAVAILABLE = "model_unavailable"
    LOW_CONFIDENCE = "low_confidence"
    SPATIAL_VIOLATION = "spatial_violation"
    CATALOG_MISMATCH = "catalog_mismatch"


@dataclass(frozen=True, slots=True)
class FrameContext:
    """Travels with a frame through every guarded stage."""
    capture_id: CaptureId
    store_id: str
    fixture_id: str
    planogram_id: str
    image_uri: str
    capture_timestamp: str          # ISO-8601, UTC
    status: FrameStatus = FrameStatus.RECEIVED
    reject_reason: RejectReason | None = None
    trace_id: str = ""
    attempts: int = 0
    notes: tuple[str, ...] = field(default_factory=tuple)

When a frame is quarantined, the layer serializes a self-describing dead-letter record so the asset can be replayed or audited without re-deriving any context. That record reuses the same typed fields the rest of the platform speaks, so a reviewer reconciling it against a compliance store sees a consistent shape:

{
  "capture_id": "c_8f21a4",
  "store_id": "store_0421",
  "fixture_id": "fx_endcap_12",
  "planogram_id": "pg_2026_q2_beverages",
  "status": "quarantined",
  "reject_reason": "low_confidence",
  "trace_id": "t_19ab77",
  "attempts": 3,
  "capture_timestamp": "2026-06-28T08:14:02Z",
  "image_uri": "s3://shelf-raw/store_0421/fx_endcap_12/c_8f21a4.jpg",
  "stage": "inference",
  "observed_confidence": 0.41
}

Implementation Architecture Jump to heading

The layer is a set of guards — one per stage — that share a single rule: catch the stage’s failure mode, transition the FrameContext, and either advance or dead-letter, but never raise into the worker loop in a way that kills a batch of healthy frames. We implement each guard as a context manager so the transition and the telemetry emission happen on every exit path, including exceptions. This is the production-quality alternative to scattering try/except around ad-hoc dictionaries, and it keeps the state machine enforceable in one place.

import logging
from contextlib import contextmanager
from dataclasses import replace
from typing import Iterator

logger = logging.getLogger("cv.errors")


class QualityRejected(Exception):
    """Raised by the preprocessing gate for an unusable frame."""
    def __init__(self, reason: RejectReason, detail: str) -> None:
        super().__init__(detail)
        self.reason = reason


class DeadLetterQueue:
    """Durable sink for frames that cannot produce a trustworthy verdict."""
    def __init__(self, broker) -> None:
        self._broker = broker

    def send(self, ctx: FrameContext, stage: str, detail: str) -> None:
        self._broker.publish("cv.deadletter", {
            "capture_id": ctx.capture_id,
            "store_id": ctx.store_id,
            "fixture_id": ctx.fixture_id,
            "planogram_id": ctx.planogram_id,
            "status": ctx.status.value,
            "reject_reason": (ctx.reject_reason or RejectReason.CORRUPT_PAYLOAD).value,
            "trace_id": ctx.trace_id,
            "attempts": ctx.attempts,
            "stage": stage,
            "detail": detail,
            "image_uri": ctx.image_uri,
            "capture_timestamp": ctx.capture_timestamp,
        })


@contextmanager
def stage_guard(
    ctx: FrameContext,
    *,
    stage: str,
    advance_to: FrameStatus,
    dlq: DeadLetterQueue,
) -> Iterator[list[FrameContext]]:
    """
    Wrap one pipeline stage. On success, advance the frame's status.
    On a quality/contract failure, quarantine it. On an unexpected
    error, quarantine it too rather than corrupting the stream.
    """
    box: list[FrameContext] = [ctx]
    try:
        yield box
    except QualityRejected as exc:
        terminal = replace(ctx, status=FrameStatus.QUARANTINED, reject_reason=exc.reason)
        dlq.send(terminal, stage, str(exc))
        logger.warning("quarantined %s at %s: %s", ctx.capture_id, stage, exc)
        box[0] = terminal
    except Exception as exc:  # defensive: never let one frame kill the worker
        terminal = replace(ctx, status=FrameStatus.QUARANTINED,
                           reject_reason=RejectReason.CORRUPT_PAYLOAD)
        dlq.send(terminal, stage, repr(exc))
        logger.exception("unexpected failure on %s at %s", ctx.capture_id, stage)
        box[0] = terminal
    else:
        box[0] = replace(box[0], status=advance_to)
        logger.debug("%s advanced to %s", ctx.capture_id, advance_to.value)

The two stages most worth protecting explicitly are ingestion and inference. At ingestion, every payload passes strict schema validation against the typed envelope — mandatory store_id, fixture_id, planogram_id, capture_timestamp and an image_hash — with retries on transient network faults using exponential backoff and jitter to avoid a thundering herd against the object store. At inference, the model call is wrapped in a circuit breaker so that a model registry that has gone stale, a GPU that is fragmenting memory, or a load that has blown the latency budget degrades to a fallback tier instead of stalling the whole worker pool:

import time


class CircuitBreaker:
    """Trips to OPEN after repeated failures; degrades inference to a fallback tier."""
    def __init__(self, fail_threshold: int = 5, reset_after_s: float = 30.0) -> None:
        self._fail_threshold = fail_threshold
        self._reset_after_s = reset_after_s
        self._failures = 0
        self._opened_at: float | None = None

    @property
    def is_open(self) -> bool:
        if self._opened_at is None:
            return False
        if time.monotonic() - self._opened_at >= self._reset_after_s:
            self._opened_at = None  # half-open: allow one trial call
            self._failures = 0
            return False
        return True

    def record(self, ok: bool) -> None:
        if ok:
            self._failures = 0
            self._opened_at = None
        else:
            self._failures += 1
            if self._failures >= self._fail_threshold:
                self._opened_at = time.monotonic()


def run_inference(ctx: FrameContext, primary, fallback, breaker: CircuitBreaker,
                  min_confidence: float = 0.65):
    """Route through the primary detector unless the breaker is open or it under-performs."""
    detector = fallback if breaker.is_open else primary
    try:
        result = detector.predict(ctx.image_uri)
    except Exception:
        breaker.record(ok=False)
        raise QualityRejected(RejectReason.MODEL_UNAVAILABLE, f"{detector.name} failed")
    breaker.record(ok=True)
    if result.confidence < min_confidence and detector is primary:
        # Primary returned but is not confident enough — try the broader fallback.
        result = fallback.predict(ctx.image_uri)
    return result

Production Configuration & Tuning Jump to heading

Every threshold in this layer is an operational dial, not a constant — they belong in config so they can be tuned per store tier without a redeploy. A reasonable starting profile, expressed as environment variables:

# Ingestion guard
CV_INGEST_MAX_RETRIES=4
CV_INGEST_BACKOFF_BASE_S=0.5      # 0.5, 1, 2, 4 with jitter
CV_INGEST_SCHEMA_STRICT=true

# Preprocessing quality gate
CV_BLUR_LAPLACIAN_MIN=120.0       # below this variance == too blurry
CV_GLARE_OVEREXPOSED_MAX=0.06     # max fraction of clipped highlight pixels
CV_MIN_SKU_PIXEL_DENSITY=0.012    # min SKU px / frame px to resolve facings

# Inference circuit breaker + degradation
CV_BREAKER_FAIL_THRESHOLD=5
CV_BREAKER_RESET_AFTER_S=30
CV_MIN_CONFIDENCE=0.65            # below this on primary == try fallback
CV_PARTIAL_COMPLIANCE_FLOOR=0.55  # below this == PARTIAL_COMPLIANCE, no restock trigger

# Post-processing spatial guard
CV_NMS_IOU=0.55
CV_ASPECT_TOLERANCE=0.15          # +/- vs catalog packaging aspect ratio

The preprocessing gate runs cheap, deterministic OpenCV checks before any frame reaches a GPU: Laplacian variance below CV_BLUR_LAPLACIAN_MIN flags motion blur, an HSV highlight-clipping ratio above CV_GLARE_OVEREXPOSED_MAX flags glare or overexposure, and a resolution/aspect check enforces minimum SKU pixel density. A frame that fails the gate does not die — it raises QualityRejected, which the guard converts into either an automated lighting-correction pass or a recapture request pushed to the associate’s mobile app, depending on which metric failed. Tune CV_MIN_CONFIDENCE against the same compliance-accuracy target used in Threshold Tuning for Compliance Accuracy: raising it cuts false positives but pushes more frames into the fallback tier, so the two thresholds must be tuned together rather than in isolation.

The degradation floor CV_PARTIAL_COMPLIANCE_FLOOR is the most consequential dial. When localization confidence for an aisle segment drops below it, the layer must emit a PARTIAL_COMPLIANCE status with an explicit out_of_stock_flags and misplaced_sku_list payload, suppress any automated restock trigger for that segment, and raise a high-priority alert — never a binary compliant/non-compliant flag built on statistically insignificant detections.

Failure Modes & Debugging Workflow Jump to heading

When error rates spike or compliance numbers look wrong, work the diagnosis in order rather than guessing:

Distinguish a rejection spike from a quarantine spike. A rising REJECTED rate means inputs are genuinely unusable — check the preprocessing gate counters by store. If one store dominates, it is almost always hardware: a fixed camera that has drifted out of focus or a lighting retrofit. A rising QUARANTINED rate with healthy inputs points instead at the model or the contract, not the photo.
Confirm the circuit breaker is tripping for the right reason. A breaker stuck OPEN starves the primary detector and silently routes everything to the lower-capacity fallback, which depresses confidence fleet-wide. Log every state change with its failure count; an OPEN breaker with no corresponding GPU or registry error means CV_BREAKER_FAIL_THRESHOLD is too low for normal transient noise and is flapping.
Reproduce schema rejections from the dead-letter record. Because each quarantined frame carries its full FrameContext and image_uri, a SCHEMA_INVALID reason should be replayable directly. A burst of them usually means an upstream capture-app release changed a field name or dropped image_hash — diff the rejected payloads against the typed envelope rather than inspecting images.
Separate spatial violations from catalog mismatches. A SPATIAL_VIOLATION (a box outside the shelf plane or with an impossible aspect ratio) is a detection or homography problem owned by Bounding Box Extraction & SKU Localization; a CATALOG_MISMATCH (a valid box mapped to a deprecated SKU) is a reference-data problem. Routing both to the same quarantine reason hides which team should act, so keep the reason codes distinct.
Trace frames that never reach a terminal state. A capture_id acknowledged at ingestion but absent from both the compliance store and the dead-letter queue is a leaked frame — a guard that swallowed an exception without dead-lettering, or a worker that died mid-stage. The stage_guard re-routes every exit path precisely to make this impossible; if it still happens, audit for code paths that bypass the guard.

The recurring root cause behind most error-handling incidents is the same: a threshold chosen once against average conditions and never re-validated against a worst case. Load-test the gates with synthetic imagery that simulates the heaviest glare, deepest occlusion, and lowest resolution you expect during a national promotional reset, and set the floors against that worst case rather than the median store.

Scaling & Performance Benchmarks Jump to heading

Error handling has to be cheap, because it runs on every frame whether or not the frame ultimately succeeds. The preprocessing gate and schema validation are CPU-bound and should add no more than a few milliseconds per frame; if they creep into double digits, the Laplacian and HSV passes are being run at full resolution instead of on a downsampled thumbnail — resize to a fixed working size before computing quality metrics. The guard’s own overhead is a state transition and a structured log line, which is negligible next to inference.

The metric that governs capacity here is dead-letter throughput relative to ingestion rate. A healthy steady state keeps the quarantine ratio low and stable; a sustained climb predicts a backlog of frames that will miss a category manager’s briefing window. Alert when the QUARANTINED fraction over a rolling 100-frame window exceeds your store-tier baseline, and autoscale review-queue consumers on dead-letter depth rather than CPU. Because quarantined frames carry full context, replay is parallelizable: a recovered model or corrected reference catalog can drain a backlog by re-submitting dead-letter records straight back to the appropriate stage.

For resilience under correlated failure — a regional network outage, an object-store brownout — the ingestion guard’s backoff must be bounded so retries do not amplify the incident, and frames that exhaust retries should dead-letter rather than spin. The broader offline-buffering strategy for an entire store losing connectivity is covered in Fallback Routing for Offline Store Scenarios; within this layer, the contract is simply that a transient outage produces deferred, replayable dead-letters, never lost frames or fabricated scores. Emit telemetry per stage — ingestion latency, gate rejection ratio, breaker state changes, confidence distribution, and partial-compliance fallbacks — so a regression is attributable to a stage instead of chased across the whole pipeline.

Frequently Asked Questions Jump to heading

Should a low-quality shelf photo be rejected or quarantined? It depends on whether the input is recoverable. A frame that fails the blur or glare gate but might be salvageable is routed to an automated lighting-correction pass first; only if correction cannot bring it above CV_MIN_SKU_PIXEL_DENSITY is it terminally REJECTED with a recapture request to the associate. A frame that is structurally fine but produced a low-confidence or contract-violating result is QUARANTINED instead, because the photo is usable and the problem is downstream — that distinction is what lets you triage the two failure classes to different teams.

Why use a circuit breaker instead of just retrying the inference call? Retrying a failing detector under load makes the incident worse: it adds requests to a GPU that is already saturated or to a registry that is already stale. A breaker trips to OPEN after CV_BREAKER_FAIL_THRESHOLD consecutive failures and routes traffic to a broader, lower-capacity fallback tier so the pipeline keeps producing verdicts — flagged as degraded — while the primary recovers. Blind retries also corrupt latency SLAs, whereas the breaker bounds them.

How do I stop a phantom compliance score from reaching a dashboard? Enforce the state machine: a compliance record is only emitted from the COMPLETED state, and every transition into it is guarded. When localization confidence for a segment falls below CV_PARTIAL_COMPLIANCE_FLOOR, the layer emits a PARTIAL_COMPLIANCE status with an explicit missing-SKU list and suppresses restock triggers, never a binary flag. Because the status is typed, a downstream consumer cannot silently treat a degraded result as a confident one.

What belongs in a dead-letter record so a frame can actually be replayed? The full FrameContext plus the failing stage and reason: capture_id, store_id, fixture_id, planogram_id, image_uri, capture_timestamp, trace_id, attempts, the reject_reason, and a human-readable detail. With that, a recovered model or corrected catalog can re-submit the record straight to the stage that failed, so the frame is recovered rather than re-derived from scratch or lost.

How is this different from debugging model drift? This layer handles deterministic, per-frame failures — a corrupt payload, a tripped breaker, a box outside the shelf plane — that have a clear pass/fail boundary at the moment they occur. Drift is a gradual, statistical erosion of accuracy where the model keeps running and every individual frame looks plausible. The two are complementary: the error-handling layer quarantines hard failures in real time, while drift detection watches the confidence distribution over rolling windows to catch the slow decay.

Debugging Vision Model Drift in Retail Environments — the statistical counterpart to this layer, for accuracy that decays without hard failures
Async Image Batching for High-Volume Stores — the broker and dead-letter semantics these guards rely on at scale
Vision Model Routing for Shelf Detection — the upstream tiering that the circuit breaker degrades across
Bounding Box Extraction & SKU Localization — the detection stage whose spatial-constraint violations this layer quarantines
Image Parsing & Computer Vision Workflows — the workflow section this error-handling layer belongs to

Error Handling in Computer Vision Pipelines for Retail Shelf Analytics

Concept & Data Contract Jump to heading#

Implementation Architecture Jump to heading#

Production Configuration & Tuning Jump to heading#

Failure Modes & Debugging Workflow Jump to heading#

Scaling & Performance Benchmarks Jump to heading#

Frequently Asked Questions Jump to heading#

Related Jump to heading#