Files
processor/.planning/phase-1-5-live-broadcast/05-snapshot-on-subscribe.md
T

8.2 KiB
Raw Blame History

Task 1.5.5 — Snapshot-on-subscribe

Phase: 1.5 — Live broadcast Status: Not started Depends on: 1.5.3, 1.4 (Postgres pool) Wiki refs: docs/wiki/synthesis/processor-ws-contract.md §Server response — subscribed

Goal

When a client subscribes to event:<eventId>, return the latest known position for every device registered to that event as part of the subscribed response. Without it, the SPA opens to a black map and only fills in as devices report — feels broken.

The snapshot is a one-time read at subscribe time. After that, positions stream live via the broadcast consumer (1.5.4). The two paths together give the SPA a "fully populated map immediately, then live updates" experience.

Deliverables

  • src/live/snapshot.ts exporting:
    • createSnapshotProvider(pool, logger, metrics): SnapshotProvider — factory.
    • SnapshotProvider.forEvent(eventId: string): Promise<PositionSnapshotEntry[]> — returns the latest position per device registered to the event. Empty array if no devices or no positions yet.
    • type PositionSnapshotEntry = { deviceId: string; lat: number; lon: number; ts: number; speed?: number; course?: number; accuracy?: number; attributes?: Record<string, unknown> } — same shape as the streaming position message minus the type and topic fields (the envelope wraps them).
  • src/live/registry.ts updated: the subscribe method calls snapshot.forEvent(eventId) after authorization succeeds and includes the result in the subscribed response. Authorization happens before the snapshot query so a forbidden user doesn't pay the snapshot cost.
  • New Prometheus metrics:
    • processor_live_snapshot_query_latency_ms (histogram).
    • processor_live_snapshot_size (histogram) — number of positions in each snapshot.
  • test/live-snapshot.test.ts:
    • With three devices in an event, two of which have positions, returns two snapshot entries.
    • With an event that has no entry_devices rows, returns [].
    • With devices that have positions but faulty=true, those positions are excluded.
    • The query returns the most recent non-faulty position per device (not just the most recent overall — ORDER BY ts DESC with a WHERE faulty = false filter).

Specification

The query

The snapshot needs the latest non-faulty position per device, scoped to one event. Postgres-canonical for "latest per group" is DISTINCT ON:

SELECT DISTINCT ON (p.device_id)
  p.device_id,
  p.latitude,
  p.longitude,
  p.ts,
  p.speed,
  p.course,
  p.accuracy,
  p.attributes
FROM positions p
JOIN entry_devices ed ON ed.device_id = p.device_id
JOIN entries e ON e.id = ed.entry_id
WHERE e.event_id = $1
  AND p.faulty = false
ORDER BY p.device_id, p.ts DESC;

Why DISTINCT ON

DISTINCT ON (device_id) ... ORDER BY device_id, ts DESC returns the row with the highest ts per device_id. The alternatives (GROUP BY + MAX(ts) + self-join, or window functions with ROW_NUMBER()) all produce the same result with worse query plans on a TimescaleDB hypertable. DISTINCT ON is Postgres-specific but we're committed to Postgres.

Performance

On a TimescaleDB hypertable, the index that makes this fast is (device_id, ts DESC). Phase 1 task 1.4 created the hypertable; verify the index exists. If not, add it as a migration in this task:

CREATE INDEX IF NOT EXISTS positions_device_ts_idx ON positions (device_id, ts DESC);

Without the index, DISTINCT ON does a sequential scan per device_id group. With it, the scan is bounded by the chunk containing the most recent position per device — typically the latest one or two chunks.

For 500 devices in an event, the query should complete in < 50ms on a warm cache.

Faulty-filter semantics

The faulty column is set post-hoc by operators when a position is unrealistic (directus entity page §"Faulty position handling"). Any read path that surfaces position data to operators must filter faulty = false:

  • Snapshot: filter (this task).
  • Live broadcast: doesn't apply — the broadcast consumer reads from Redis, not from positions. By the time a position is in Redis (and being streamed), no one has had the chance to flag it.
  • Replay (future): filter when implemented.

Where the snapshot is wired into the registry

The subscribed response in 1.5.3 currently sends snapshot: []. Update:

// In registry.ts, inside subscribe() after authorization succeeds:
let snapshot: PositionSnapshotEntry[] = [];
const start = performance.now();
try {
  snapshot = await snapshotProvider.forEvent(parsed.eventId);
  metrics.snapshotSize.observe(snapshot.length);
} catch (err) {
  logger.warn({ err, eventId: parsed.eventId }, 'snapshot query failed');
  // Fall through with empty snapshot — better to subscribe without a snapshot
  // than to fail the subscription entirely.
} finally {
  metrics.snapshotLatency.observe(performance.now() - start);
}

sendOutbound(conn, { type: 'subscribed', topic, id: correlationId, snapshot });

The "fail open" choice on snapshot errors is deliberate: a subscribe that returned subscribed with an empty snapshot is recoverable (live updates still work; the SPA just sees a sparser-than-expected initial state). A subscribe that errors out forces the SPA to retry, which masks the underlying snapshot failure.

What the snapshot does NOT include

  • Position history. Just the latest position per device. Trail rendering on the SPA reads the previous N positions from its own ring buffer as new positions stream in. No bulk-history endpoint in v1.
  • Device metadata (model, IMEI, vehicle assignment). The SPA fetches that separately via Directus REST/SDK and joins on deviceId client-side.
  • Faulty positions. Filtered.
  • Stale positions. A position from 3 days ago is still "the latest" if the device hasn't reported since. The SPA should display "last seen N hours ago" indicators based on the ts field.

Snapshot field shape

Mirror the position streaming message exactly except for the envelope:

const SnapshotEntrySchema = z.object({
  deviceId: z.string().uuid(),
  lat: z.number(),
  lon: z.number(),
  ts: z.number(),  // epoch ms
  speed: z.number().optional(),
  course: z.number().optional(),
  accuracy: z.number().optional(),
  attributes: z.record(z.unknown()).optional(),
});

Same field-omission convention: don't emit null for absent values.

Acceptance criteria

  • pnpm typecheck, pnpm lint, pnpm test clean.
  • Manual: with the seeded Rally Albania 2026 event (3 registered devices, some positions in positions), subscribing returns a snapshot with the registered devices' latest positions.
  • Subscribing to an event with no positions returns subscribed with snapshot: [].
  • Manually marking a position faulty=true excludes it from the next snapshot (the snapshot returns the most recent non-faulty position for that device, or omits the device if none exists).
  • Snapshot query latency p95 < 100ms with the index in place; without the index the test should fail loudly so we don't ship without it.
  • Snapshot failure (e.g. simulated Postgres timeout) does not fail the subscription; client receives subscribed with snapshot: [] and the live stream still works.

Risks / open questions

  • Snapshot size on a large event. 500 devices × ~200 bytes per entry = ~100KB JSON payload. Tolerable. If we ever push to 5000 devices on a single event, consider streaming the snapshot in chunks via multiple subscribed frames. Out of scope for now.
  • Positions on positions table that pre-date the device's registration to the event. The JOIN catches them — if the device was on entry_devices for that event today, its 3-month-old positions still match. Acceptable behaviour; the operator's mental model is "this device has been in the system that long."
  • Trade-off with DISTINCT ON and TimescaleDB chunks. TimescaleDB partitions by time; DISTINCT ON (device_id) ORDER BY device_id, ts DESC may need to scan multiple chunks if the latest position for some devices is older than the most recent chunk. For an active event this is the same chunk for everyone. For a long-tail of stale devices, multiple chunks may be touched. Acceptable.

Done

(Filled in when the task lands.)