What IoU threshold should I use for NMS on dense shelves?

Start at 0.50 for standard density and tighten to 0.55 only when facings are tightly packed and a single product generates multiple overlapping boxes. If lower tiers show IoU drift, loosen to 0.45 rather than raising it, because aggressive merging at a high threshold fuses adjacent facings and undercounts the shelf.

Bounding Box Extraction & SKU Localization for Retail Shelf Analytics

Within the Image Parsing & Computer Vision Workflows section, bounding box extraction and SKU localization are the stage that turns an unstructured shelf photo into deterministic, planogram-ready coordinates — and everything downstream depends on getting it right. When a field associate, an autonomous cart, or a fixed gondola camera captures an endcap, this component must convert a raw pixel array into a set of boxes, each anchored to a physical shelf tier and resolved to a specific stock-keeping unit, with a confidence score and a provenance trail attached. That is not an academic detection exercise; it is the operational precondition for out-of-stock flagging, share-of-shelf calculation, and promotional-execution auditing. This page defines the data contract the stage enforces, the detection-plus-normalization architecture that runs underneath it, the thresholds that tune it, the failure modes you will actually hit on glossy packaging and tilted captures, and the throughput numbers to design against.

The localization stage: a validated image envelope is detected, suppressed at IoU 0.50, warped onto a fronto-parallel plane and snapped to four normalized shelf tiers, then resolved through a layered embedding → barcode → spatial-rule cascade into a typed localized-SKU record.

Concept & Data Contract Jump to heading

This component sits between inference routing and compliance scoring, so it has two hard boundaries. At the inbound boundary it consumes a validated image envelope — the same normalized payload produced upstream, carrying store_id, fixture_id, fixture_class, planogram_id, capture_timestamp, a decodable image_uri, and the sharpness_score that already gated the frame. At the outbound boundary it produces a localized-SKU record: a list of detections, each expressed in shelf-relative normalized coordinates rather than raw pixels, with a resolved sku, the shelf_tier it was snapped to, the confidence of both the box and the SKU match, and a resolution_source that names how the identity was decided. Everything between those boundaries is replaceable; the two contracts are not.

A localized-SKU record looks like this:

{
  "capture_id": "8f3c2a1e-7b4d-4e2a-9c11-5a6b7c8d9e0f",
  "fixture_id": "GONDOLA-A14-BAY3",
  "planogram_id": "PLN-2026Q2-CEREAL-A14",
  "capture_timestamp": "2026-06-28T07:14:32Z",
  "shelf_plane_confidence": 0.93,
  "detections": [
    {
      "sku": "SKU-0049221",
      "shelf_tier": 2,
      "bbox_norm": [0.118, 0.502, 0.196, 0.741],
      "box_confidence": 0.88,
      "sku_confidence": 0.91,
      "resolution_source": "embedding",
      "facings": 1
    },
    {
      "sku": "SKU-0051180",
      "shelf_tier": 2,
      "bbox_norm": [0.205, 0.500, 0.281, 0.739],
      "box_confidence": 0.83,
      "sku_confidence": 0.86,
      "resolution_source": "barcode",
      "facings": 2
    }
  ]
}

Those normalized boxes are what the planogram-sync section consumes: the slot-level pass/fail logic in position validation algorithms for planograms reads bbox_norm and shelf_tier directly, then emits the misplaced_sku_list, out_of_stock_flags, and price_tag_mismatch_count fields of the final compliance record. Codifying this localization record as a typed model — pydantic at the service boundary, the same schema mirrored in the warehouse — is what makes the join auditable: when a downstream score looks wrong, you are debugging a named field, not a malformed dictionary three stages deep. The cardinal rule of the contract is that coordinates leave this stage normalized to the shelf plane, never in raw pixels, so no downstream consumer ever performs a pixel-to-inch conversion of its own.

Implementation Architecture Jump to heading

The transform runs in three ordered passes: detect, normalize, resolve. Each is independently testable, and each writes its intermediate output so a failure can be localized to a single pass rather than the whole stage.

Detection. The pass begins with an anchor-free detector — a YOLO-family model, RT-DETR, or an EfficientDet variant — chosen by the upstream router and trained to isolate individual product facings from cluttered retail backgrounds. The detector emits raw boxes parameterized as (x_min, y_min, x_max, y_max) with class logits and confidence. Redundant detections collapse through Non-Maximum Suppression at an IoU threshold of 0.50 for standard density, tightening to 0.55 for tightly packed facings. Which detector variant runs here is decided upstream by vision model routing for shelf detection, so a dense promotional island receives an occlusion-hardened head while a sparse aisle shot is processed by a lighter tier under a stricter latency budget.

Normalization. Raw pixel coordinates carry no spatial semantics — a box at (1420, 890) is meaningless without physical context — so the detector output is warped onto a fronto-parallel shelf plane and snapped to logical tiers. Shelf images are rarely captured orthogonally; the pass estimates a homography matrix H from detected shelf-edge lines (or a one-time calibration target during store setup), computed with a Direct Linear Transform and refined with RANSAC to reject outlier keypoints. Once warped, each detection is mapped to a normalized Y-range per tier (for example tier 1: 0.00–0.25, tier 2: 0.25–0.50) and snapped to the nearest boundary within a tolerance of ±0.03 of image height, which absorbs minor camera tilt and packaging overhang without letting a box drift to the wrong shelf.

import numpy as np
import cv2
from dataclasses import dataclass


@dataclass(frozen=True)
class NormalizeConfig:
    tier_count: int = 4
    snap_tolerance: float = 0.03   # fraction of image height
    ransac_reproj_px: float = 4.0


def normalize_to_shelf_plane(
    boxes_px: np.ndarray,
    src_corners: np.ndarray,
    dst_corners: np.ndarray,
    img_shape: tuple[int, int],
    cfg: NormalizeConfig,
) -> tuple[np.ndarray, np.ndarray, float]:
    """Warp pixel-space xyxy boxes onto a fronto-parallel shelf plane and
    snap each to a logical tier.

    Returns normalized xyxy boxes, an int tier index per box, and a
    confidence score for the recovered shelf plane (RANSAC inlier ratio).
    Raises ValueError if a stable homography cannot be recovered, which the
    caller must route to the dead-letter path rather than score blindly.
    """
    if boxes_px.shape[0] == 0:
        return np.empty((0, 4)), np.empty((0,), dtype=int), 0.0

    H, inliers = cv2.findHomography(
        src_corners, dst_corners, cv2.RANSAC, cfg.ransac_reproj_px
    )
    if H is None or inliers is None:
        raise ValueError("homography_failed: insufficient stable shelf edges")
    plane_conf = float(inliers.mean())

    h, w = img_shape
    centers = np.stack(
        [(boxes_px[:, 0] + boxes_px[:, 2]) / 2,
         (boxes_px[:, 1] + boxes_px[:, 3]) / 2], axis=1
    ).reshape(-1, 1, 2).astype(np.float32)
    warped = cv2.perspectiveTransform(centers, H).reshape(-1, 2)

    # Normalize warped corners of each box to [0, 1] in plane coordinates.
    norm = boxes_px.astype(np.float64).copy()
    norm[:, [0, 2]] /= w
    norm[:, [1, 3]] /= h

    tier_h = 1.0 / cfg.tier_count
    v = (warped[:, 1] / h)
    raw_tier = np.floor(v / tier_h).astype(int)
    # Snap to the nearer boundary when within tolerance of an edge.
    edge_dist = np.minimum(v % tier_h, tier_h - (v % tier_h))
    snap = edge_dist <= cfg.snap_tolerance
    raw_tier[snap & ((v % tier_h) > tier_h / 2)] += 1
    tiers = np.clip(raw_tier, 0, cfg.tier_count - 1)
    return norm, tiers, plane_conf

Resolution. Localization alone does not satisfy compliance — each box must be deterministically resolved to a sku or GTIN. The pass uses a layered strategy. First, a lightweight Siamese or CLIP-style embedding of the cropped facing is matched by cosine similarity against a pre-indexed catalog of approved product images; a similarity at or above 0.82 triggers a direct assignment with resolution_source: "embedding". When similarity falls into an ambiguous band of 0.65–0.82, the pass triggers region-specific OCR or 1D/2D barcode decoding and cross-references the decoded value against the master item database, tagging the result barcode or ocr. Finally, spatial rule enforcement applies: if a detection violates a contracted brand-block adjacency, it is flagged for review rather than forced into a low-confidence match. Confidence and source propagate through every layer, which is exactly what makes the output auditable in a merchandising dispute. pydantic is the deliberate choice at the boundary because it converts a soft “looks fine” dictionary into a hard pass/fail with a named offending field before the record ever reaches the warehouse.

Production Configuration & Tuning Jump to heading

Localization behavior is governed by a handful of values that belong in versioned configuration, not source, so a precision regression can be attributed to a specific release rather than chased through the model. Detection confidence and suppression are per-fixture, never global: dense, glare-prone facings tolerate a lower cutoff to preserve recall, while sparse aisles enforce a stricter cutoff to suppress phantom boxes. The embedding match threshold and the ambiguous OCR band are pinned alongside them, because moving the band silently changes how many detections fall through to the slower barcode path. A representative block:

detection:
  default_conf: 0.50
  nms_iou: 0.50
  dense_facing_nms_iou: 0.55
  per_fixture:
    refrigerated_cooler: { conf: 0.40 }
    dry_grocery_gondola: { conf: 0.55 }
    endcap_promo:        { conf: 0.50 }
normalize:
  tier_count: 4
  snap_tolerance: 0.03            # +/- fraction of image height
  ransac_reproj_px: 4.0
  min_plane_confidence: 0.70      # below this -> dead-letter, do not score
resolution:
  embedding_match_min: 0.82       # direct SKU assignment at or above
  ambiguous_band: [0.65, 0.82]    # fall through to OCR / barcode
  aspect_ratio_tolerance: 0.15    # reject boxes outside catalog spec
preprocess:
  clahe_clip_limit: 2.0           # glare/specular stabilization

The snap_tolerance is the single most sensitive geometric knob: set it too tight and boxes near a shelf lip drift to the wrong tier on a mild camera pitch; set it too loose and a genuinely two-tier facing collapses into one. The min_plane_confidence gate is what keeps an unrecoverable homography from silently producing plausible-looking but spatially wrong coordinates — below it, the frame is dead-lettered, not scored. Sensitive values — catalog index credentials, object-store keys, warehouse DSNs — belong in environment variables or a secrets manager, never the YAML.

Failure Modes & Debugging Workflow Jump to heading

Localization fails in predictable, observable ways. Treat each as a routine diagnostic with a known path rather than a mystery.

IoU drift by shelf tier. Symptom: predicted boxes match ground truth well on upper tiers but degrade to IoU below 0.50 on lower ones. Root cause is almost always a stale or mis-recovered homography — the lower tiers sit furthest from the calibrated plane, so perspective error accumulates there. Reproduce by overlaying predicted boxes on 500+ annotated validation frames and plotting IoU per tier; fix by recomputing H from fresh edge detections or lowering nms_iou to 0.45 to stop aggressive box merging.
Bimodal SKU-confidence distribution. Symptom: a confidence histogram with a heavy tail below 0.60. This signals catalog mismatch or insufficient training diversity, not random noise. Diagnose with a per-fixture confusion matrix; recover by hard-mining the low-confidence crops into the next training cycle, the same discipline owned by the error handling in computer vision pipelines workflows, with threshold recovery calibrated in threshold tuning for compliance accuracy.
Phantom detections on non-product elements. Symptom: price tags, shelf clips, and promotional shelf-talkers surface as facings, inflating counts. The full suppression playbook — aspect-ratio gating, ROI masking, OCR rejection of pricing text — is the subject of the child guide on reducing false positives in SKU bounding boxes. Reproduce by filtering detections whose crops decode to pricing symbols or promo keywords and confirming they vanish.
Glare-fractured boxes on glossy packaging. Symptom: a single product yields two stacked partial boxes split by a specular highlight. Apply CLAHE or a polarized-filter simulation at preprocessing (clahe_clip_limit: 2.0) to stabilize edges before inference, and verify the fragments fuse back into one box afterward.
Swapped SKUs across flavor variants. Symptom: near-identical packaging (flavor extensions, multipacks) resolves to the wrong sku at high confidence. Cause: embedding separation is too small for the variant pair. Fix by adding the confusable pair as hard negatives and tightening embedding_match_min for that category so ambiguous matches fall through to barcode verification.

The calibration sequence that brings a new store format online without scoring blind follows a fixed order:

Capture a calibration set. Photograph each fixture class under its real lighting and hand-annotate 200+ facings as ground truth before any production scoring runs.
Recover and validate the shelf plane. Compute the homography from shelf edges, confirm shelf_plane_confidence clears min_plane_confidence, and visually verify the warped image is fronto-parallel.
Tune detection thresholds per fixture. Sweep confidence and nms_iou, then lock the values that hold IoU above 0.50 across all tiers into the versioned config.
Calibrate SKU resolution. Set embedding_match_min and the ambiguous band against the confusion matrix, adding confusable variant pairs as hard negatives.
Gate and ship. Assert the localized-SKU contract against the warehouse schema, enable per-tier IoU and confidence-drift alerting, then promote the config.

Scaling & Performance Benchmarks Jump to heading

The numbers worth designing against are operational. Hold P95 localization latency — validated envelope in to localized-SKU record out — under 900 milliseconds for a standard gondola frame, with detection on the GPU and the normalize-plus-resolve passes on an asynchronous CPU worker pool so the GPU is never blocked on RANSAC or OCR. A single mid-range GPU worker running the standard detector at batch 16 sustains roughly 120–180 frames per second; the embedding match adds negligible latency against a warm catalog index, while the OCR/barcode fallback is 5–10x slower per crop, so the design goal is to keep the share of detections falling into the ambiguous band below 15% — that ratio, not raw detector speed, is what governs tail latency under load.

Concurrency is bounded by the broker ahead of this stage, not by the workers: the broker-side batching and backpressure tuning that keeps tail latency flat through a morning store-walk burst is detailed in async image batching for high-volume stores. Cost optimization comes from three places — quantizing both the detector and the embedding model to ONNX Runtime or TensorRT to raise per-GPU batch density without measurable accuracy loss; caching the catalog embedding index in GPU memory so similarity search never round-trips to disk; and pinning a per-fixture confidence floor so light fixtures route to a cheaper detector tier. When a model checkpoint changes, tag every subsequent record with a pipeline version identifier so a shift in compliance metrics can be traced to an algorithm change rather than a real shelf condition.

Frequently Asked Questions Jump to heading

Why normalize coordinates here instead of letting the compliance scorer do it? Because normalization needs the homography and the shelf-edge geometry that only exist at this stage, and doing it once keeps every downstream consumer free of pixel-to-inch conversion errors. If the scorer re-derived coordinates from raw pixels it would need the camera calibration too, duplicating logic and inviting drift. The contract is that boxes leave this stage already snapped to a normalized shelf plane with a shelf_tier, so position validation reads bbox_norm directly.

What IoU threshold should I use for Non-Maximum Suppression on dense shelves? Start at 0.50 for standard density and tighten to 0.55 only when facings are tightly packed and a single product is generating multiple overlapping boxes. If lower tiers show IoU drift against ground truth, loosen to 0.45 rather than raising it — aggressive merging at a high threshold is what fuses two adjacent facings into one and undercounts the shelf.

How does the pipeline decide between an embedding match and a barcode scan? Cosine similarity against the catalog index decides first: at or above 0.82 the SKU is assigned directly from the visual embedding. Between 0.65 and 0.82 the detection is ambiguous, so the pass falls through to region OCR or barcode decoding and verifies against the master item database. Anything below 0.65, or a detection that violates a brand-block adjacency rule, is flagged for review rather than force-matched.

What happens when the homography can’t be recovered? The frame is dead-lettered, not scored. The min_plane_confidence gate (default 0.70, the RANSAC inlier ratio) blocks any capture whose shelf plane is unstable, because a bad warp produces coordinates that look plausible but place boxes on the wrong tier. Routing those frames to the dead-letter path preserves them for re-capture or manual review instead of poisoning the compliance record.

How do I stop flavor variants from being assigned the wrong SKU? Add the confusable variant pair to training as explicit hard negatives so the embedding model learns the fine-grained packaging difference, and raise embedding_match_min for that category so borderline matches drop into the barcode-verification band. The decoded barcode is authoritative for variants whose visual difference is genuinely below the model’s resolving power.

Reducing False Positives in SKU Bounding Boxes — suppressing phantom detections from price tags, glare, and promotional clutter
Vision Model Routing for Shelf Detection — how the detector variant feeding this stage is chosen per fixture
Async Image Batching for High-Volume Stores — the broker batching and backpressure that keep localization latency flat
Position Validation Algorithms for Planograms — the downstream consumer of this stage’s normalized coordinates
Image Parsing & Computer Vision Workflows — the workflow section this component belongs to

Bounding Box Extraction & SKU Localization for Retail Shelf Analytics

Concept & Data Contract Jump to heading#

Implementation Architecture Jump to heading#

Production Configuration & Tuning Jump to heading#

Failure Modes & Debugging Workflow Jump to heading#

Scaling & Performance Benchmarks Jump to heading#

Frequently Asked Questions Jump to heading#

Related Jump to heading#