Fallback Routing for Offline Store Scenarios

Within the Core Architecture for Shelf Analytics platform, fallback routing is the component that keeps planogram compliance running when the network does not. Retail sites operate under genuinely hostile connectivity — saturated guest Wi-Fi, cellular dead zones behind freezer aisles, and regional ISP brownouts that arrive mid-trading-day. When a store router drops or a cloud inference endpoint degrades, the analytics pipeline cannot simply halt: a category manager still expects the morning compliance sweep, and a high-velocity endcap stockout is still costing sales velocity whether or not the uplink is healthy. This page specifies the routing layer that detects connectivity loss, redirects capture traffic to local processing, and reconciles deferred results deterministically on reconnect — the offline-first state machine the broader architecture’s resilience section depends on.

The job of this layer is to make degraded behavior deterministic. A naive pipeline blocks on an HTTP POST and stalls the camera; a slightly-less-naive one retries forever and exhausts edge storage. Neither preserves the property that matters: every captured frame must be accounted for — processed, buffered, or quarantined — and every locally generated compliance verdict must survive reconnection without duplication. Fallback routing sits at the intersection of connectivity probing, disk-backed queueing, edge inference, and idempotent synchronization, and it owns the contract that lets the rest of the system trust offline-origin data once it finally arrives.

Concept & Data Contract Jump to heading

The routing layer consumes two inputs and produces one of three outcomes. The inputs are a capture envelope — the signed metadata wrapper every store photo arrives in, identical to the one validated by Retail Data Ingestion Pipelines for Store Photos — and a connectivity signal sampled from a background health probe. The outcome is a routing decision: dispatch the payload to the cloud ingestion endpoint (CLOUD_ACTIVE), buffer it to a disk-backed queue and run local inference (EDGE_FALLBACK), or shed low-priority captures when storage is under pressure (THROTTLE_CAPTURE). The decision is a pure function of connectivity state, payload priority, and current buffer depth, which is what makes the layer auditable: any routing choice can be replayed from the three values that produced it.

The state itself is small and explicit. Modeling it as a typed enum rather than a tangle of booleans prevents the classic offline-pipeline bug where the system thinks it is online for uploads but offline for inference:

from __future__ import annotations

from datetime import datetime, timezone
from enum import Enum

from pydantic import BaseModel, Field, field_validator


class RouteState(str, Enum):
    CLOUD_ACTIVE = "CLOUD_ACTIVE"
    EDGE_FALLBACK = "EDGE_FALLBACK"
    THROTTLE_CAPTURE = "THROTTLE_CAPTURE"
    RECONCILIATION = "RECONCILIATION"


class Priority(int, Enum):
    AMBIENT = 0        # heatmaps, secondary facings — droppable under pressure
    STANDARD = 1       # routine audit-window captures
    COMPLIANCE = 2     # endcap/promo/safety violations — never dropped


class RoutedCapture(BaseModel):
    capture_id: str = Field(..., min_length=8)
    store_id: str = Field(..., pattern=r"^[A-Z]{2}\d{4}$")
    fixture_id: str
    object_key: str                       # pointer into local staging storage
    payload_sha256: str = Field(..., min_length=64, max_length=64)
    priority: Priority = Priority.STANDARD
    capture_timestamp: datetime
    origin_state: RouteState = RouteState.CLOUD_ACTIVE

    @field_validator("capture_timestamp")
    @classmethod
    def normalize_utc(cls, v: datetime) -> datetime:
        # Clock-skewed gateways are a top source of reconciliation conflicts.
        return v.astimezone(timezone.utc)

When a payload is processed at the edge, the routing layer emits a deferred-sync record — the offline equivalent of the compliance payload the cloud would have produced. It carries the same canonical fields the rest of the platform agrees on (planogram_id, fixture_id, compliance_percentage, out_of_stock_flags, misplaced_sku_list, capture_timestamp), plus the provenance the reconciliation step needs to deduplicate and to flag confidence gaps:

{
  "capture_id": "STR0421-A17-20260628T0714Z-0007",
  "planogram_id": "PLN-2026-Q2-BEV-014",
  "fixture_id": "STR0421-A17",
  "compliance_percentage": 88.6,
  "out_of_stock_flags": ["bev_cola_500ml"],
  "misplaced_sku_list": [],
  "capture_timestamp": "2026-06-28T07:14:22Z",
  "inference_origin": "edge",
  "model_hash": "onnx-int8-yolo-v8n-7c3a91",
  "edge_confidence": 0.81,
  "sequence_id": 4471
}

The inference_origin, model_hash, and edge_confidence fields are not optional bookkeeping — they are the contract that lets the central data lake decide whether an edge verdict can stand or must defer to a later cloud rescore, and they keep the provenance of every offline-generated compliance record auditable.

Implementation Architecture Jump to heading

The routing decision is driven by a finite state machine evaluated on every capture and on every connectivity probe. Connectivity is sampled asynchronously so that probing never blocks capture: a HEAD request against the ingestion endpoint measures round-trip time and confirms the TLS chain, and a sliding window of recent probe outcomes — not a single ping — decides the transition. A basic ICMP ping is useless here, because store routers route through captive portals and exhibit high jitter that mimics partial connectivity; the multi-signal probe is detailed in the child page on handling network outages in store-level analytics.

The router below uses asyncio for the non-blocking probe loop and a SQLite WAL-mode queue for crash-safe persistence — SQLite is chosen over an in-memory broker precisely because the failure mode being defended against (power cycles, unclean shutdowns) is the one that loses in-memory state:

import asyncio
import logging
import sqlite3
from collections import deque
from contextlib import closing
from pathlib import Path

import httpx

logger = logging.getLogger("fallback_router")


class FallbackRouter:
    def __init__(
        self,
        endpoint: str,
        db_path: str = "/var/lib/retail-edge/route_queue.db",
        rtt_threshold_ms: float = 800.0,
        window: int = 3,
    ) -> None:
        self.endpoint = endpoint
        self.db_path = db_path
        self.rtt_threshold_ms = rtt_threshold_ms
        self.state = RouteState.CLOUD_ACTIVE
        self._probe_history: deque[bool] = deque(maxlen=window)
        self._client = httpx.AsyncClient(timeout=2.0, verify=True)
        self._init_db()

    def _init_db(self) -> None:
        Path(self.db_path).parent.mkdir(parents=True, exist_ok=True)
        with closing(sqlite3.connect(self.db_path)) as conn:
            conn.execute("PRAGMA journal_mode=WAL;")
            conn.execute(
                """
                CREATE TABLE IF NOT EXISTS deferred_sync (
                    capture_id   TEXT PRIMARY KEY,
                    priority     INTEGER NOT NULL,
                    payload      TEXT NOT NULL,
                    sync_status  TEXT NOT NULL DEFAULT 'PENDING',
                    retry_count  INTEGER NOT NULL DEFAULT 0,
                    sequence_id  INTEGER
                )
                """
            )
            conn.commit()

    async def probe(self) -> bool:
        loop = asyncio.get_running_loop()
        start = loop.time()
        try:
            resp = await self._client.head(self.endpoint)
            rtt_ms = (loop.time() - start) * 1000
            healthy = resp.status_code < 500 and rtt_ms <= self.rtt_threshold_ms
        except (httpx.RequestError, httpx.TimeoutException):
            healthy = False
        self._probe_history.append(healthy)
        self._evaluate_transition()
        return healthy

    def _evaluate_transition(self) -> None:
        window_full = len(self._probe_history) == self._probe_history.maxlen
        all_down = window_full and not any(self._probe_history)
        all_up = window_full and all(self._probe_history)
        if self.state == RouteState.CLOUD_ACTIVE and all_down:
            self._set_state(RouteState.EDGE_FALLBACK)
        elif self.state == RouteState.EDGE_FALLBACK and all_up:
            self._set_state(RouteState.RECONCILIATION)

    def _set_state(self, target: RouteState) -> None:
        if target != self.state:
            logger.info("STATE_CHANGE %s -> %s", self.state.value, target.value)
            self.state = target

    def route(self, capture: "RoutedCapture") -> RouteState:
        """Pure routing decision for a single capture."""
        if self.state == RouteState.CLOUD_ACTIVE:
            return RouteState.CLOUD_ACTIVE
        if self.state == RouteState.THROTTLE_CAPTURE and capture.priority < Priority.COMPLIANCE:
            return RouteState.THROTTLE_CAPTURE
        return RouteState.EDGE_FALLBACK

Once a capture is routed to EDGE_FALLBACK, it passes through a quantized vision model running locally — typically an INT8 or FP16 ONNX graph that holds sub-250ms inference on an ARM gateway. The edge model trades marginal accuracy for operational continuity, and the resulting deferred-sync record is written to the same WAL queue keyed on capture_id. Because the key is the capture’s content-addressed identity, re-enqueueing the same frame after a crash is a no-op rather than a duplicate — the property the reconciliation step relies on.

Production Configuration & Tuning Jump to heading

Every threshold in the router is environment-driven, because the right value for a copper-DSL store and a 5G-backhauled flagship are not the same. A representative configuration block, surfaced through environment variables so the same image deploys fleet-wide:

import os


class RouterConfig:
    ENDPOINT = os.environ["INGEST_ENDPOINT"]
    # Transition CLOUD_ACTIVE -> EDGE_FALLBACK after this many fully-down probe windows.
    RTT_THRESHOLD_MS = float(os.getenv("ROUTE_RTT_THRESHOLD_MS", "800"))
    PROBE_INTERVAL_S = float(os.getenv("ROUTE_PROBE_INTERVAL_S", "5"))
    PROBE_WINDOW = int(os.getenv("ROUTE_PROBE_WINDOW", "3"))
    # Disk pressure: enter THROTTLE_CAPTURE above HIGH, leave below LOW (hysteresis).
    DISK_HIGH_WATER = float(os.getenv("ROUTE_DISK_HIGH_WATER", "0.80"))
    DISK_LOW_WATER = float(os.getenv("ROUTE_DISK_LOW_WATER", "0.65"))
    # Reconciliation: cloud verdict supersedes edge only above this confidence.
    CLOUD_OVERRIDE_CONFIDENCE = float(os.getenv("ROUTE_CLOUD_OVERRIDE_CONF", "0.92"))
    SYNC_BATCH_SIZE = int(os.getenv("ROUTE_SYNC_BATCH_SIZE", "16"))

The values worth deliberate calibration are the probe window, the disk water-marks, and the reconciliation override confidence. A probe window of 3 consecutive failures at a 5s interval means the system tolerates roughly 15s of transient loss before flipping to fallback — long enough to ignore a single dropped beacon, short enough that the camera never stalls. The disk water-marks use deliberate hysteresis: entering THROTTLE_CAPTURE at 0.80 utilization but only leaving it below 0.65 prevents the router from oscillating around a single threshold during a long outage. The override confidence of 0.92 is the lever that decides conflict resolution on reconnect — set it too low and noisy cloud rescores overwrite sound edge verdicts; set it too high and genuine edge errors never get corrected. Teams already running the platform’s compliance scoring should align this value with the precision/recall point chosen in Threshold Tuning for Compliance Accuracy, so an offline verdict and an online verdict are judged on the same scale.

When the link returns, the router enters RECONCILIATION and drains the queue idempotently. Uploads are batched by SYNC_BATCH_SIZE and ordered by priority so COMPLIANCE violations transmit before ambient telemetry, and each record carries a monotonic sequence_id and its capture_id so the ingestion endpoint can dedupe against recent windows:

import json


async def drain(self) -> None:
    if self.state != RouteState.RECONCILIATION:
        return
    with closing(sqlite3.connect(self.db_path)) as conn:
        rows = conn.execute(
            """
            SELECT capture_id, payload FROM deferred_sync
            WHERE sync_status = 'PENDING'
            ORDER BY priority DESC, sequence_id ASC
            LIMIT ?
            """,
            (RouterConfig.SYNC_BATCH_SIZE,),
        ).fetchall()
        for capture_id, payload in rows:
            try:
                resp = await self._client.post(
                    f"{self.endpoint}/sync",
                    json=json.loads(payload),
                    timeout=10.0,
                )
                resp.raise_for_status()
                conn.execute(
                    "UPDATE deferred_sync SET sync_status='SYNCED' WHERE capture_id=?",
                    (capture_id,),
                )
            except httpx.HTTPStatusError as exc:
                if exc.response.status_code == 409:  # server already has it
                    conn.execute(
                        "UPDATE deferred_sync SET sync_status='SYNCED' WHERE capture_id=?",
                        (capture_id,),
                    )
                else:
                    conn.execute(
                        "UPDATE deferred_sync SET retry_count=retry_count+1 WHERE capture_id=?",
                        (capture_id,),
                    )
            except httpx.RequestError:
                logger.warning("sync transport error for %s; will retry", capture_id)
        conn.commit()
    if not self._pending_count():
        self._set_state(RouteState.CLOUD_ACTIVE)

A 409 Conflict is treated as success, not failure — it means the server already holds that capture_id, so the local record can be purged safely. This is the difference between a reconciliation that converges and one that loops forever re-sending already-accepted data.

Failure Modes & Debugging Workflow Jump to heading

When compliance SLAs degrade across a network transition, work the diagnosis in this order rather than guessing:

Confirm the transition was network-driven, not application-driven. Grep edge gateway logs for STATE_CHANGE events and correlate each CLOUD_ACTIVE -> EDGE_FALLBACK with probe failures. If the transition fired without a corresponding window of down probes, the trigger is an application exception (a malformed envelope, a TLS-verify failure) masquerading as connectivity loss — fix the probe, not the network.
Audit local queue depth and disk headroom. Query SELECT COUNT(*), sync_status FROM deferred_sync GROUP BY sync_status. A growing PENDING count with the router stuck in EDGE_FALLBACK is expected during an outage; the same growth with healthy probes means the drain loop is wedged. If disk crosses DISK_HIGH_WATER and the router has not entered THROTTLE_CAPTURE, the water-mark check is misconfigured and storage exhaustion is imminent.
Validate edge inference latency and model identity. Profile gateway CPU/NPU utilization and confirm P99 inference holds below 250ms. A latency spike usually means the quantized weights fell back to CPU because the accelerator driver failed to load — verify the model_hash in emitted records matches the deployed model, since a silent revert to an older graph also shows up here.
Trace reconciliation conflicts. After a drain, compare edge versus cloud verdicts for the same capture_id. A discrepancy rate above 5% between inference_origin: edge and a later cloud rescore points to model drift on the edge or a stale planogram version on the gateway — the same drift signature handled by Error Handling in Computer Vision Pipelines. Records where the cloud confidence never exceeded CLOUD_OVERRIDE_CONFIDENCE should be flagged for manual category-manager review, not silently merged.
Confirm provenance survived the round trip. Every offline-processed record must retain immutable metadata — capture_timestamp, store_id, fixture_id, model_hash, and the originating RouteState. If any of these are null after sync, the deferred-sync writer dropped provenance, and the compliance audit trail for that outage window is no longer defensible.

The recurring root cause behind most of these is the same: a transition that fires on the wrong signal. Probe on the actual ingestion endpoint with TLS verification, never on a generic gateway address, so a captive portal returning 200 OK to everything cannot fool the router into thinking the analytics backend is reachable.

Scaling & Performance Benchmarks Jump to heading

Fallback routing scales on three axes: probe overhead, queue throughput, and reconnect burst. The probe loop is cheap by design — one HEAD request per PROBE_INTERVAL_S per gateway adds negligible load even across a 10,000-store fleet, because probes terminate at the regional ingestion edge, not a central service. The constraint that actually bites is the reconnect thundering herd: when a regional ISP recovers, thousands of gateways enter RECONCILIATION simultaneously and try to drain at once. Bound this with a per-gateway jitter on the first drain (a randomized 0–30s delay) and a server-side concurrency cap, so restored capacity is not immediately saturated by its own backlog.

For queue sizing, treat local storage as the hard ceiling. A gateway buffering STANDARD captures at one frame per fixture per audit window holds a multi-hour outage comfortably, but ambient telemetry will exhaust eMMC within an hour if not shed — which is exactly why THROTTLE_CAPTURE drops Priority.AMBIENT first and preserves Priority.COMPLIANCE to the end. Target a steady-state PENDING depth near zero during CLOUD_ACTIVE, and alert when depth exceeds the count one drain batch can clear in a single PROBE_INTERVAL_S, since sustained growth past that point predicts an unbounded queue.

Latency SLAs differ between the two regimes and should be reported separately. Online, end-to-end capture-to-verdict should hold under the platform’s 30s P95 budget. Offline, the meaningful SLA is reconciliation lag — the wall-clock gap between capture and the verdict reaching the data lake — which is bounded by outage duration plus drain time, not by inference speed. Surfacing both numbers keeps an offline window from looking like a system failure when it is in fact correct degraded behavior. Cost-wise, the layer is close to free: edge inference runs on hardware already deployed for capture, and deferring uploads to off-peak reconnect windows flattens the egress bill that synchronous streaming would spike. The broader autoscaling and circuit-breaker patterns that this routing layer plugs into — including the secondary-endpoint failover for vision inference — are worked through in Vision Model Routing for Shelf Detection, and the multi-region topology it runs inside is the subject of Designing a Scalable Shelf Analytics Architecture.

Frequently Asked Questions Jump to heading

How is fallback routing different from a normal HTTP retry with backoff? A retry loop assumes the request will eventually succeed and blocks (or queues unboundedly) until it does, which stalls capture and exhausts storage during a real outage. Fallback routing instead changes behavior on a connectivity state transition: it stops dispatching to the cloud, runs inference locally on a quantized model, persists results to a crash-safe queue, and only resumes uploads once a window of probes confirms recovery. The retry is one tactic inside the RECONCILIATION state, not the whole strategy.

Won’t edge inference produce worse compliance numbers than the cloud model? Marginally, yes — quantized edge models trade a few points of accuracy for sub-250ms latency and zero network dependency. The design accounts for this with provenance: every offline verdict carries inference_origin: edge and an edge_confidence, and on reconnect a cloud rescore above CLOUD_OVERRIDE_CONFIDENCE (default 0.92) supersedes it. A slightly noisier number that exists beats a perfect number that never gets captured because the camera stalled.

What stops the local queue from filling the disk during a long outage? Tiered priority plus hysteresis. Captures are tagged AMBIENT, STANDARD, or COMPLIANCE; when disk utilization crosses DISK_HIGH_WATER (0.80) the router enters THROTTLE_CAPTURE and sheds the lowest tier first, only leaving that state once utilization falls below DISK_LOW_WATER (0.65). Compliance-critical frames are never dropped, and a hard purge of synced records keeps headroom available.

How does reconnection avoid creating duplicate compliance records? Every deferred record is keyed on a content-addressed capture_id and tagged with a monotonic sequence_id. The drain loop posts batches ordered by priority, and the ingestion endpoint dedupes against recent windows — a 409 Conflict response means the server already has the record, so the gateway purges it locally and moves on. Re-enqueueing the same frame after a crash is a no-op rather than a double-count.

Where does fallback routing fit relative to ingestion validation? It sits immediately after the capture envelope is validated and signed by Retail Data Ingestion Pipelines for Store Photos and before the cloud vision tier. Only structurally sound, signature-verified payloads ever reach the router, so an outage never becomes an excuse to admit malformed frames into the buffer.

Handling Network Outages in Store-Level Analytics — the multi-signal outage probe and reconciliation walkthrough
Retail Data Ingestion Pipelines for Store Photos — the capture envelope and validation gate that feed this router
Designing a Scalable Shelf Analytics Architecture — the multi-region topology and autoscaling this layer runs inside
Vision Model Routing for Shelf Detection — circuit breakers and secondary-endpoint failover for inference
Core Architecture for Shelf Analytics — the platform and resilience layer this component belongs to

Fallback Routing for Offline Store Scenarios

Concept & Data Contract Jump to heading#

Implementation Architecture Jump to heading#

Production Configuration & Tuning Jump to heading#

Failure Modes & Debugging Workflow Jump to heading#

Scaling & Performance Benchmarks Jump to heading#

Frequently Asked Questions Jump to heading#

Related Jump to heading#