Threshold Tuning for Compliance Accuracy

Within the Planogram Sync & SKU Mapping Strategies section, threshold tuning is the calibration layer that decides whether a shelf analytics platform produces compliance scores a category manager will act on, or noise an auditor learns to ignore. Every upstream stage emits continuous, probabilistic signal — classification logits, bounding-box overlap, positional offsets, facing tallies — and none of it becomes an operational decision until a threshold collapses it into a binary pass or fail. Set those cut-points wrong and the platform either floods stores with phantom violations that trigger costly manual audits, or silently misses the out-of-stocks and misplaced facings that erode category revenue. This page defines the data contract threshold tuning consumes and produces, the cost-weighted optimization that drives cut-point selection, the configuration surface category managers tune without a deploy, the failure modes you will hit when distributions drift, and the throughput you need to recalibrate a chain on a rolling window.

Concept & Data Contract Jump to heading

Threshold tuning sits between the scoring engines and compliance reporting, so it has two firm boundaries. At the inbound boundary it consumes calibrated per-dimension scores plus the ground-truth audit feedback that anchors them: for every evaluated slot it reads a classification confidence, an Intersection-over-Union (IoU) overlap against the matched detection, a positional offset from the expected slot coordinate, and a facing tally — the same signals produced by the spatial assignment in Position Validation Algorithms for Planograms and the counting stage in Automating Facings vs Actuals Validation. Alongside the scores it reads a per-SKU cost matrix: the dollar cost of a false positive (a needless audit) and a false negative (a missed revenue-losing gap), scoped by velocity and margin tier. At the outbound boundary it produces a threshold-set record: a versioned, per-dimension, per-tier set of cut-points with the validation metrics that justify them, so any compliance score can be traced months later to the exact operating point that produced it.

Compliance decisions operate across four interdependent confidence dimensions, and treating them as a single scalar threshold guarantees suboptimal performance. They must be calibrated independently because optimizing one in isolation degrades the others:

Classification confidence — whether a detected box corresponds to the target SKU. A high cut-point suppresses false positives but drops partially occluded, damaged, or rotated packaging.
IoU overlap — spatial agreement between the predicted and ground-truth box. A loose value merges adjacent products and inflates facings; an overly strict value fragments one unit into several false detections.
Positional tolerance — horizontal and vertical offset against the planogram grid. A rigid band flags ordinary camera parallax, lens distortion, or shelf sag as a violation.
Facing aggregation — the cut-point at which overlapping detections collapse into discrete shelf units, which must respect physical dividers rather than apply a fixed pixel gap.

A threshold-set record looks like this:

{
  "planogram_id": "PG-2026-GROCERY-A14",
  "registry_revision": 184,
  "calibrated_at": "2026-06-28T06:00:00Z",
  "calibration_method": "isotonic",
  "validation_window_days": 90,
  "dimensions": {
    "classification": { "high": 0.62, "medium": 0.55, "low": 0.48 },
    "iou": { "high": 0.55, "medium": 0.50, "low": 0.45 },
    "position_tolerance_mm": { "high": 5.0, "medium": 10.0, "low": 15.0 },
    "facing_merge_gap_mm": 18.0
  },
  "validation_metrics": {
    "expected_cost_per_1000_slots": 412.5,
    "precision": 0.94,
    "recall": 0.91,
    "reliability_brier": 0.041
  }
}

The record references the registry_revision and calibration_method it was derived under, so a planogram reset or a recalibration can never retroactively rewrite the operating point that scored yesterday’s audit. The load-bearing values are the per-tier cut-points, and the entire architecture below exists to make them defensible against a dollar cost rather than chosen by an arbitrary 0.5 default.

Implementation Architecture Jump to heading

The naive implementation — apply a fixed 0.5 cut-point to raw model output — fails immediately, because modern vision architectures emit uncalibrated logits or softmax scores that systematically overstate confidence on clean, well-lit samples and understate it on edge cases. Calibration must precede threshold selection. A correct stage decomposes into two steps: align raw scores to empirical likelihoods, then search for the cost-minimizing operating point under a recall floor.

Step one — probability calibration. Platt scaling (a logistic fit on model outputs) and isotonic regression (a non-parametric monotonic mapping) both transform raw scores into true likelihoods, but they require a validation set that mirrors production: mixed lighting, varying shelf depths, promotional overlays, and partial occlusions. Isotonic is preferred when you have enough audit volume — typically 1000+ labeled slots per SKU tier — because it makes no shape assumption; Platt is the safer choice on sparse tiers where isotonic overfits. The detections feeding this set arrive from the localization stage detailed in Bounding Box Extraction & SKU Localization, so any drift there propagates directly into the calibration curve.

import numpy as np
from sklearn.isotonic import IsotonicRegression
from sklearn.linear_model import LogisticRegression


def calibrate_scores(
    raw_scores: np.ndarray,
    labels: np.ndarray,
    method: str = "isotonic",
    min_samples: int = 1000,
) -> np.ndarray:
    """Map raw model scores to empirical likelihoods before thresholding.

    Falls back from isotonic to Platt scaling on sparse tiers where the
    non-parametric fit would overfit a thin validation slice.
    """
    if raw_scores.shape != labels.shape:
        raise ValueError("raw_scores and labels must align 1:1")
    if raw_scores.size == 0:
        raise ValueError("empty validation slice; cannot calibrate")

    if method == "isotonic" and raw_scores.size >= min_samples:
        model = IsotonicRegression(out_of_bounds="clip")
        return model.fit(raw_scores, labels).predict(raw_scores)

    # Platt scaling: logistic regression on a single feature.
    platt = LogisticRegression()
    platt.fit(raw_scores.reshape(-1, 1), labels)
    return platt.predict_proba(raw_scores.reshape(-1, 1))[:, 1]

Step two — cost-weighted operating-point search. Retail compliance never treats all errors equally, so the cut-point is chosen at the intersection of the precision-recall curve and an operational cost function, not at peak F1. A false violation on a promotional endcap triggers a $45 manual audit and possible vendor chargeback; a missed gap on a top-tier beverage costs $120+ in daily lost revenue. The optimizer maps each error type to a dollar value, computes expected cost per candidate threshold, and selects the minimum-cost point subject to a recall floor for high-velocity SKUs:

from sklearn.metrics import precision_recall_curve


def optimize_threshold(
    y_true: np.ndarray,
    y_scores: np.ndarray,
    cost_fp: float,
    cost_fn: float,
    min_recall: float = 0.85,
) -> tuple[float, float]:
    """Cost-minimizing cut-point for one calibrated compliance dimension.

    Returns (threshold, expected_cost). Enforces a recall floor so a
    cheap-audit SKU cannot be tuned into missing genuine out-of-stocks.
    """
    if not 0.0 <= min_recall <= 1.0:
        raise ValueError("min_recall must be in [0, 1]")

    precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
    thresholds = np.append(thresholds, 1.0)  # boundary candidate

    fp = np.sum((y_scores >= thresholds[:, None]) & (y_true == 0), axis=1)
    fn = np.sum((y_scores < thresholds[:, None]) & (y_true == 1), axis=1)
    total_cost = fp * cost_fp + fn * cost_fn

    valid = recall >= min_recall
    if not valid.any():
        valid = recall >= recall.min()  # degrade to best achievable recall

    idx = np.argmin(total_cost[valid])
    return float(thresholds[valid][idx]), float(total_cost[valid][idx])

For spatial dimensions, y_scores is replaced with the IoU overlap or the inverse positional offset, and the cost function is reshaped to penalize merged facings or fragmented detections. The facing-merge gap is not optimized here — it is a physical fixture constant shared with the counting stage. Running each dimension through the same cost-weighted search, rather than hand-picking a global cut-point, is what keeps a chain-wide score aligned with the actual economics of each error.

Production Configuration & Tuning Jump to heading

The cost matrix and recall floors are what make this stage agree with a human auditor, and they belong in versioned configuration so a category manager can retune them against ground truth without a deploy. A typical configuration scopes cost by velocity tier and overrides per category:

threshold_tuning:
  calibration:
    method: isotonic              # falls back to platt below min_samples
    min_samples_per_tier: 1000
    validation_window_days: 90    # rolling slice; mirrors production conditions
    recalibration_cron: "0 6 * * 1"  # weekly, Monday 06:00
  recall_floor:
    high: 0.92                    # never miss a fast-moving out-of-stock
    medium: 0.85
    low: 0.80
  cost_matrix_usd:                # per error, drives the operating point
    high:   { false_positive: 45.0, false_negative: 120.0 }
    medium: { false_positive: 45.0, false_negative: 60.0 }
    low:    { false_positive: 45.0, false_negative: 30.0 }
  category_overrides:
    tobacco:  { recall_floor: 0.99 }   # regulated; tolerate near-zero misses
    seasonal: { recall_floor: 0.75 }   # transient displays, wider band
  drift_tolerance:
    brier_max: 0.060              # trigger recalibration above this
    expected_cost_delta_pct: 15   # or if cost worsens >15% vs baseline

The recall_floor per tier is the single most consequential setting: it is the guardrail that stops the optimizer from tuning a low-audit-cost SKU into systematically missing genuine gaps to shave audit spend. The cost_matrix_usd asymmetry — a false_negative of 120.0 against a false_positive of 45.0 on high-velocity items — is what pulls the classification cut-point below the naive 0.5 for hero products, favoring recall where a stockout is expensive. Promotional zones are excluded from these bands entirely: endcaps and secondary displays carry campaign-specific cost economics and are routed through Promotional Display Alignment Checks before their results merge into reporting, so a time-bound display never skews the baseline operating point. Wrap the threshold logic in a configuration service that pulls these parameters from the retail data warehouse so cut-points update on the recalibration_cron cadence, never on a code release.

Failure Modes & Debugging Workflow Jump to heading

When compliance scores drift away from auditor reality, the root cause is almost always one of five recurring problems. Work them in this order before touching production cut-points:

Isolate the failing dimension. Symptom: violation volume spikes but the cause is ambiguous. Log per-dimension confidence distributions alongside ground-truth annotations and check which of classification, IoU, positional tolerance, or facing aggregation diverged. Reproduce by replaying a labeled capture batch through each dimension independently. Fix: tune only the dimension whose distribution shifted, never the global score.
Stale or mis-shaped calibration. Symptom: the reliability diagram deviates from the diagonal — confidence reads 0.9 while empirical accuracy is 0.7. Reproduce by plotting the Brier score on a fresh validation slice. Fix: recalibrate with calibrate_scores on current data before adjusting any threshold; a thresholding change on top of broken calibration only masks the problem.
Environmental covariate drift. Symptom: cut-points that held for months degrade after a camera firmware update, a seasonal lighting change, or a new glossy shelf liner. Reproduce by correlating the drift timestamp against the device and store-config changelog. Fix: trigger recalibration scoped to the affected stores rather than retuning the chain; parallax-induced positional error frequently masquerades as a classification failure here.
Corrupt ground-truth set. Symptom: validation metrics look excellent but field auditors still disagree. Root cause is inconsistent manual labeling or an outdated planogram reference contaminating the validation slice. Fix: confirm the IoU ground-truth boxes are tightly fitted and the annotation guidelines match the current shelf standard; resolve identifier drift at the SKU layer per Planogram Sync & SKU Mapping Strategies before trusting any sweep.
Cascading misidentification into facing counts. Symptom: one slot over-faces while its neighbor under-faces by the same amount, traceable to a borderline classification leaking into the wrong slot. Reproduce under overhead glare on visually similar variants. Fix: route the borderline detections back through the multi-frame consensus and embedding-versus-barcode fallback described in Error Handling in Computer Vision Pipelines rather than loosening the facing-merge gap.

Before promoting any new threshold set, run a shadow-mode pass: deploy the candidate cut-points in parallel to production without firing alerts, compare shadow output against live compliance flags for 7–14 days, and only promote once the expected-cost delta is stable. Keep a labeled error repository categorized by these five causes; the dominant bucket each quarter tells you whether to recalibrate, retune the cost matrix, or fix a fixture.

Scaling & Performance Benchmarks Jump to heading

Threshold tuning splits cleanly into a cheap hot path and an expensive cold path, and they scale differently. Applying a calibrated cut-point at scoring time is array arithmetic — a single CPU worker classifies a 40-slot fixture in well under 5 ms, so the operating point is never the bottleneck on the live compliance path. The cold path is recalibration: fitting isotonic or Platt models and running the cost-weighted sweep across every SKU tier over a rolling 90-day window. That job is batch, not interactive, and is best scheduled off-peak on the recalibration_cron so it never competes with morning-reset inference traffic.

Size the recalibration job against audit volume rather than store count: the sweep cost grows with the number of labeled validation slots, and a chain emitting roughly 2M labeled slots per quarter recalibrates all tiers in a few minutes on a single multi-core worker when the precision-recall curves are computed vectorized rather than looped per threshold. Partition the recalibration queue by store_cluster so a regional lighting or fixture change only re-fits the affected segment, and watch the Brier score and expected-cost delta as the scaling signals — a breach of brier_max or the expected_cost_delta_pct ceiling is what should trigger an out-of-band recalibration between scheduled runs. Cost optimization follows the same edge discipline as the rest of the platform: apply cut-points on the store-level server that already hosts inference and sync only the typed threshold-set record upstream, never raw scores, which keeps the tuning tier a rounding error on the analytics bill while still emitting a fully versioned, auditable operating point across thousands of locations.

Frequently Asked Questions Jump to heading

Why calibrate scores before choosing a threshold instead of just sweeping cut-points on the raw output? Because raw logits and softmax scores are not probabilities — they systematically overstate confidence on clean samples and understate it on edge cases, so a sweep on raw output optimizes against a distorted signal. Calibration with isotonic or Platt scaling maps each score to its empirical likelihood, which means the threshold you pick actually corresponds to the accuracy you think it does. Sweeping first and calibrating never works; the order is fixed.

Why use a per-SKU cost matrix instead of maximizing F1 or accuracy? Because retail errors are not symmetric in dollars. A false violation on an endcap costs a $45 audit, while a missed out-of-stock on a hero SKU costs $120+ in daily lost sales, so the cut-point that maximizes F1 is rarely the cut-point that minimizes cost. Selecting the operating point at the intersection of the precision-recall curve and the cost function is what aligns the score with category economics rather than a statistical abstraction.

How do classification, IoU, and positional thresholds interact — can I tune one global value? No. The four dimensions are interdependent and a single scalar guarantees a poor result: tightening classification to kill false positives drops occluded units, while loosening IoU to recover them merges adjacent products into inflated facings. Each dimension is calibrated and cost-optimized independently against its own error economics, then composed, so a gain on one is not silently paid for by a regression on another.

When should I pick isotonic regression over Platt scaling? Use isotonic when you have enough labeled audit volume per tier — roughly 1000+ slots — because it makes no assumption about the shape of the miscalibration and corrects arbitrary curves. Below that, isotonic overfits a thin slice and Platt scaling, which fits a single logistic curve, generalizes better. The configuration falls back automatically below min_samples_per_tier, so sparse categories stay stable.

How often should thresholds be recalibrated in production? On a rolling schedule plus an event trigger. A weekly fit over a 90-day window absorbs gradual drift from packaging updates and sensor aging, while a breach of the Brier or expected-cost drift tolerance forces an out-of-band recalibration after a firmware push, a seasonal reset, or a lighting change. Static cut-points degrade fast in dynamic retail; treat the operating point as a versioned, continuously re-derived artifact, not a deployment constant.

Position Validation Algorithms for Planograms — the spatial scoring that produces the IoU and positional signals this stage calibrates
Automating Facings vs Actuals Validation — consumes the tolerance bands reconciled here to score facing variance
Promotional Display Alignment Checks — campaign-specific cut-points kept separate from the baseline operating point
Error Handling in Computer Vision Pipelines — the consensus and fallback path for borderline detections near a threshold
Planogram Sync & SKU Mapping Strategies — the parent layer that binds detections to authoritative catalog identifiers

Threshold Tuning for Compliance Accuracy

Concept & Data Contract Jump to heading#

Implementation Architecture Jump to heading#

Production Configuration & Tuning Jump to heading#

Failure Modes & Debugging Workflow Jump to heading#

Scaling & Performance Benchmarks Jump to heading#

Frequently Asked Questions Jump to heading#

Related Jump to heading#