Files
processor/.planning/phase-1-5-live-broadcast/05-snapshot-on-subscribe.md
T

146 lines
8.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Task 1.5.5 — Snapshot-on-subscribe
**Phase:** 1.5 — Live broadcast
**Status:** ⬜ Not started
**Depends on:** 1.5.3, 1.4 (Postgres pool)
**Wiki refs:** `docs/wiki/synthesis/processor-ws-contract.md` §Server response — subscribed
## Goal
When a client subscribes to `event:<eventId>`, return the **latest known position for every device registered to that event** as part of the `subscribed` response. Without it, the SPA opens to a black map and only fills in as devices report — feels broken.
The snapshot is a one-time read at subscribe time. After that, positions stream live via the broadcast consumer (1.5.4). The two paths together give the SPA a "fully populated map immediately, then live updates" experience.
## Deliverables
- `src/live/snapshot.ts` exporting:
- `createSnapshotProvider(pool, logger, metrics): SnapshotProvider` — factory.
- `SnapshotProvider.forEvent(eventId: string): Promise<PositionSnapshotEntry[]>` — returns the latest position per device registered to the event. Empty array if no devices or no positions yet.
- `type PositionSnapshotEntry = { deviceId: string; lat: number; lon: number; ts: number; speed?: number; course?: number; accuracy?: number; attributes?: Record<string, unknown> }` — same shape as the streaming `position` message minus the `type` and `topic` fields (the envelope wraps them).
- `src/live/registry.ts` updated: the `subscribe` method calls `snapshot.forEvent(eventId)` after authorization succeeds and includes the result in the `subscribed` response. Authorization happens *before* the snapshot query so a forbidden user doesn't pay the snapshot cost.
- New Prometheus metrics:
- `processor_live_snapshot_query_latency_ms` (histogram).
- `processor_live_snapshot_size` (histogram) — number of positions in each snapshot.
- `test/live-snapshot.test.ts`:
- With three devices in an event, two of which have positions, returns two snapshot entries.
- With an event that has no `entry_devices` rows, returns `[]`.
- With devices that have positions but `faulty=true`, those positions are excluded.
- The query returns the *most recent non-faulty* position per device (not just the most recent overall — `ORDER BY ts DESC` with a `WHERE faulty = false` filter).
## Specification
### The query
The snapshot needs the latest non-faulty position per device, scoped to one event. Postgres-canonical for "latest per group" is `DISTINCT ON`:
```sql
SELECT DISTINCT ON (p.device_id)
p.device_id,
p.latitude,
p.longitude,
p.ts,
p.speed,
p.course,
p.accuracy,
p.attributes
FROM positions p
JOIN entry_devices ed ON ed.device_id = p.device_id
JOIN entries e ON e.id = ed.entry_id
WHERE e.event_id = $1
AND p.faulty = false
ORDER BY p.device_id, p.ts DESC;
```
### Why `DISTINCT ON`
`DISTINCT ON (device_id) ... ORDER BY device_id, ts DESC` returns the row with the highest `ts` per `device_id`. The alternatives (`GROUP BY` + `MAX(ts)` + self-join, or window functions with `ROW_NUMBER()`) all produce the same result with worse query plans on a TimescaleDB hypertable. `DISTINCT ON` is Postgres-specific but we're committed to Postgres.
### Performance
On a TimescaleDB hypertable, the index that makes this fast is `(device_id, ts DESC)`. Phase 1 task 1.4 created the hypertable; verify the index exists. If not, add it as a migration in this task:
```sql
CREATE INDEX IF NOT EXISTS positions_device_ts_idx ON positions (device_id, ts DESC);
```
Without the index, `DISTINCT ON` does a sequential scan per `device_id` group. With it, the scan is bounded by the chunk containing the most recent position per device — typically the latest one or two chunks.
For 500 devices in an event, the query should complete in < 50ms on a warm cache.
### Faulty-filter semantics
The `faulty` column is set post-hoc by operators when a position is unrealistic ([[directus]] entity page §"Faulty position handling"). Any read path that surfaces position data to operators must filter `faulty = false`:
- **Snapshot:** filter (this task).
- **Live broadcast:** doesn't apply — the broadcast consumer reads from Redis, not from `positions`. By the time a position is in Redis (and being streamed), no one has had the chance to flag it.
- **Replay (future):** filter when implemented.
### Where the snapshot is wired into the registry
The `subscribed` response in 1.5.3 currently sends `snapshot: []`. Update:
```ts
// In registry.ts, inside subscribe() after authorization succeeds:
let snapshot: PositionSnapshotEntry[] = [];
const start = performance.now();
try {
snapshot = await snapshotProvider.forEvent(parsed.eventId);
metrics.snapshotSize.observe(snapshot.length);
} catch (err) {
logger.warn({ err, eventId: parsed.eventId }, 'snapshot query failed');
// Fall through with empty snapshot — better to subscribe without a snapshot
// than to fail the subscription entirely.
} finally {
metrics.snapshotLatency.observe(performance.now() - start);
}
sendOutbound(conn, { type: 'subscribed', topic, id: correlationId, snapshot });
```
The "fail open" choice on snapshot errors is deliberate: a subscribe that returned `subscribed` with an empty snapshot is recoverable (live updates still work; the SPA just sees a sparser-than-expected initial state). A subscribe that errors out forces the SPA to retry, which masks the underlying snapshot failure.
### What the snapshot does NOT include
- **Position history.** Just the *latest* position per device. Trail rendering on the SPA reads the previous N positions from its own ring buffer as new positions stream in. No bulk-history endpoint in v1.
- **Device metadata** (model, IMEI, vehicle assignment). The SPA fetches that separately via Directus REST/SDK and joins on `deviceId` client-side.
- **Faulty positions.** Filtered.
- **Stale positions.** A position from 3 days ago is still "the latest" if the device hasn't reported since. The SPA should display "last seen N hours ago" indicators based on the `ts` field.
### Snapshot field shape
Mirror the `position` streaming message exactly except for the envelope:
```ts
const SnapshotEntrySchema = z.object({
deviceId: z.string().uuid(),
lat: z.number(),
lon: z.number(),
ts: z.number(), // epoch ms
speed: z.number().optional(),
course: z.number().optional(),
accuracy: z.number().optional(),
attributes: z.record(z.unknown()).optional(),
});
```
Same field-omission convention: don't emit `null` for absent values.
## Acceptance criteria
- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean.
- [ ] Manual: with the seeded Rally Albania 2026 event (3 registered devices, some positions in `positions`), subscribing returns a snapshot with the registered devices' latest positions.
- [ ] Subscribing to an event with no positions returns `subscribed` with `snapshot: []`.
- [ ] Manually marking a position `faulty=true` excludes it from the next snapshot (the snapshot returns the most recent non-faulty position for that device, or omits the device if none exists).
- [ ] Snapshot query latency p95 < 100ms with the index in place; without the index the test should fail loudly so we don't ship without it.
- [ ] Snapshot failure (e.g. simulated Postgres timeout) does not fail the subscription; client receives `subscribed` with `snapshot: []` and the live stream still works.
## Risks / open questions
- **Snapshot size on a large event.** 500 devices × ~200 bytes per entry = ~100KB JSON payload. Tolerable. If we ever push to 5000 devices on a single event, consider streaming the snapshot in chunks via multiple `subscribed` frames. Out of scope for now.
- **Positions on `positions` table that pre-date the device's registration to the event.** The JOIN catches them — if the device was on `entry_devices` for that event today, its 3-month-old positions still match. Acceptable behaviour; the operator's mental model is "this device has been in the system that long."
- **Trade-off with `DISTINCT ON` and TimescaleDB chunks.** TimescaleDB partitions by time; `DISTINCT ON (device_id) ORDER BY device_id, ts DESC` may need to scan multiple chunks if the latest position for some devices is older than the most recent chunk. For an active event this is the same chunk for everyone. For a long-tail of stale devices, multiple chunks may be touched. Acceptable.
## Done
(Filled in when the task lands.)