Files
docs/wiki/concepts/live-channel-architecture.md
T

142 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: Live channel architecture
type: concept
created: 2026-05-01
updated: 2026-05-03
sources: []
tags: [architecture, realtime, websocket, telemetry-plane, decision]
---
# Live channel architecture
How live position data reaches the [[react-spa]] without violating [[plane-separation]] or coupling to [[directus]]'s failure domain.
## The question
The SPA needs sub-second updates of device positions for live race views. Three things are non-negotiable:
1. The [[processor]] hot path stays direct-to-database — no API hop, no event-loop pressure on Directus.
2. [[directus]] is not in the telemetry hot path (per [[plane-separation]]).
3. The live channel must be authenticated and authorization-aware — only users with permission to see an event's positions get pushed updates.
The naïve assumption is that [[directus]]'s built-in WebSocket subscriptions cover this. They do not. **Directus's subscription system only fires events for writes that go through its own `ItemsService`** (REST/GraphQL/Admin UI mutations). Direct `INSERT`s from the [[processor]] are invisible to subscribers — verified against Directus's documentation and source. The bridging assumption was wrong.
This page documents how the platform actually delivers live positions.
## Options considered
| Option | Live channel works | Hot path stays fast | Plane separation | Failure domain |
|---|---|---|---|---|
| Route Processor writes through Directus REST | Yes (Directus broadcasts own writes) | Compromised — every write through Directus event loop | Compromised | Coupled — Directus down blocks ingestion |
| Bridge extension inside Directus (Redis → `WebSocketService.broadcast`) | Yes | Compromised — Directus runs the firehose consumer | Compromised | Coupled — Directus crash kills live channel |
| **Processor exposes its own WebSocket endpoint** (chosen) | Yes | Preserved | Preserved | Decoupled — Directus down blocks only new authorizations |
Option 3 wins because it preserves the architectural invariants that motivated [[plane-separation]] in the first place, while still leaning on [[directus]] for authentication and authorization.
## Chosen design
Two cleanly-separated WebSocket channels, each playing to its strength:
```
┌─ Telemetry plane ─────────────────────────┐ ┌─ Business plane ──────────────────────┐
│ │ │ │
│ Device → tcp-ingestion → Redis │ │ SPA admin action │
│ ↓ │ │ ↓ │
│ Processor │ │ Directus REST │
│ ↙ ↘ │ │ ↓ │
│ Postgres Processor's │ │ Postgres + Directus's WebSocket │
│ WebSocket │ │ ↓ │
│ ↓ │ │ SPA (admin UI, │
│ SPA │ │ leaderboard refresh, │
│ (live map) │ │ timing edits) │
└───────────────────────────────────────────┘ └───────────────────────────────────────┘
```
- **High-volume telemetry** (positions): the Processor writes directly to Postgres and *also* fans out the same records to subscribed SPA clients over its own WebSocket endpoint. Stays in the telemetry plane end-to-end.
- **Low-volume domain events** (timing records, stage results, manual entries, configuration): written via Directus's REST API; Directus's built-in subscription system broadcasts them through its WebSocket. Stays in the business plane.
Each kind of data takes the path that fits it. No bridges, no extensions inside Directus.
## Authorization flow
The Processor's WebSocket endpoint validates connections through Directus, but never asks Directus per record. The handshake is **cookie-based and same-origin** — see [[processor-ws-contract]] §"Auth handshake" for the wire-level spec.
```
1. SPA opens wss://<origin>/ws-live (relative URL; same origin as Directus).
Browser auto-attaches the httpOnly Directus session cookie.
2. Processor reads the entire Cookie header from the upgrade request and
forwards it to Directus GET /users/me.
200 → bind the connection to (id, role).
401/403 → close the socket with code 4401 (unauthorized).
3. SPA sends {type: 'subscribe', topic: 'event:<uuid>'}.
4. Processor checks the user's organization_users membership against the
event's organization_id (one cached lookup per event).
200 → store {client → topic}; reply with the latest-position snapshot.
403 → reply with {type: 'error', code: 'forbidden'}.
5. For every position arriving on Redis, match against in-memory subscriptions
and push to matched clients. Zero Directus calls in the hot path.
```
Connection-time auth is amortized over session lifetime. Permission re-checks happen on subscription change, not on every record. The hot path is bounded by `O(positions × subscribed-clients-per-event)` and runs entirely on the Processor's event loop with in-memory state.
> Earlier revisions of this page described JWT-in-URL auth. That predated [[react-spa]]'s switch to Directus SDK session-mode auth (see log entry 2026-05-02 "Auth-mode wiki realignment"). The current implementation is cookie-based; tokens never appear in WebSocket URLs (which would land them in proxy logs).
## Failure modes
| Failure | Effect on durable storage | Effect on live channel |
|---|---|---|
| Processor crashes | Records pile up in Redis; Phase 3 [[failure-domains]] resumption picks them up | Live channel dies until recovery |
| Directus crashes | Unaffected (Processor writes direct to DB) | Existing connections keep working with cached permissions; **new subscriptions cannot be authorized** |
| Postgres crashes | Writes block; Redis buffers up to `MAXLEN` | Unaffected — fan-out is independent of DB state |
| Redis crashes | Whole pipeline stops | — |
The Directus-down case is the architecturally important one. Routing writes through Directus would mean ingestion blocks. The chosen design keeps ingestion alive and only loses the ability to authorize *new* subscriptions — a much gentler failure.
## Multi-instance Processor
Phase 3 of [[processor]] adds a second instance for HA. Each instance has its own connected SPA clients. A position arriving on instance A wouldn't naturally reach a client connected to instance B unless the broadcast path crosses instances.
The clean shape: each Processor reads the [[redis-streams]] stream on **two consumer groups**:
- `processor` — the durable-write group (work-split: only one instance handles each record for the DB write).
- `live-broadcast-{instance_id}` — a per-instance fan-out group (every instance reads every record for fan-out).
DB writes deduplicate by virtue of the consumer-group split; live broadcast deduplicates by virtue of clients being connected to exactly one instance. The Processor's [[redis-streams]] consumer code structure should anticipate this even at single-instance pilot scale.
## Scale considerations
At pilot scale (≤500 devices per event, tens of viewers), the dominant costs are:
- **Connection-time auth round-trips to Directus** — a few hundred per minute peak (race start). Trivial.
- **In-memory subscription matching** — `O(records × subscribers)`; for 500 records/sec × 20 subscribers per event, ~10k messages/sec fan-out. Sustained on Node.
When this becomes wrong:
- Sustained > ~10k WebSocket messages/sec total → consider sharding the broadcast path or extracting to a dedicated gateway service.
- Connection-time auth becomes a thundering herd at race start with thousands of viewers → cache the `/users/me` validation result for the connection's lifetime and shorten the Directus permission check via a token-with-scope pattern. Pilot scale doesn't need this; revisit when measured.
- Multi-data-center deployment → revisit the consumer-group fan-out strategy; per-region broadcast may be cleaner than global.
The escape hatch is well-defined: lift the WebSocket endpoint code out of the Processor into a standalone service that subscribes to the same `live-broadcast-*` consumer group. The Redis-stream-in / WebSocket-out contract doesn't change; only the host process does.
## What this means for adjacent components
- [[processor]] grows a public-facing WebSocket endpoint in addition to its existing Redis consumer and Postgres writer.
- [[directus]] keeps its built-in WebSocket subscriptions for tables it writes to. Its real-time delivery section no longer claims to broadcast direct writes from [[processor]] — that's a documented mistake corrected in this revision.
- [[react-spa]] connects to two WebSocket endpoints: Directus at `/ws-business` for admin/business updates, Processor at `/ws-live` for live position firehose. Same-origin httpOnly Directus session cookie on both — no separate auth artifact for the live channel. Consumer-side throughput discipline (rAF coalescing of incoming positions before reducer dispatch) is documented in [[maps-architecture]] — without it the per-message dispatch pattern observed in [[traccar-maps-architecture]] cascades through selectors and `setData` at every position arrival.
- The deploy stack publishes the Processor's WebSocket port (with TLS termination at a reverse proxy in front).
## Why not a single WebSocket endpoint
It would be tempting to fold everything into a single SPA-facing WebSocket — either Processor or Directus. Both fail:
- **Single Processor WebSocket** would require Processor to broadcast Directus-managed events, meaning Processor needs to subscribe to Directus's writes — which is exactly the problem we're avoiding for positions, in reverse.
- **Single Directus WebSocket** is the bridge-extension option; it loses plane separation.
Two endpoints, each serving the writes its plane manages, is the architecturally honest answer.
## Open questions
- **Auth caching strategy.** Currently every WebSocket connection round-trips to Directus's `/users/me` (~20ms over the internal network) to validate the forwarded session cookie. At pilot scale (≤500 viewers, low reconnect rate) this is trivial. Caching the validation per-connection-lifetime is the cheap optimisation; a stateless verification path (shared signing secret) is the heavier one. Defer until measurements demand it.
- **Subscription model.** Per-event, per-stage, per-organization, or arbitrary filter expressions? The simplest pilot model is "subscribe to one event by ID"; extensions land when SPA UX demands them.
- **Permission staleness.** If a user is removed from an organization mid-session, do their existing subscriptions silently keep delivering until reconnect? Either re-validate periodically, or accept "trust the session" for pilot.