docs: update log and wiki entries for Phase 1.5 live broadcast implementation and incident resolution

This commit is contained in:
2026-05-03 19:33:15 +02:00
parent 875327bed7
commit 6ef4e9e9ee
3 changed files with 70 additions and 20 deletions
+18 -12
View File
@@ -2,7 +2,7 @@
title: Live channel architecture
type: concept
created: 2026-05-01
updated: 2026-05-01
updated: 2026-05-03
sources: []
tags: [architecture, realtime, websocket, telemetry-plane, decision]
---
@@ -59,22 +59,28 @@ Each kind of data takes the path that fits it. No bridges, no extensions inside
## Authorization flow
The Processor's WebSocket endpoint validates connections through Directus, but never asks Directus per record.
The Processor's WebSocket endpoint validates connections through Directus, but never asks Directus per record. The handshake is **cookie-based and same-origin** — see [[processor-ws-contract]] §"Auth handshake" for the wire-level spec.
```
1. SPA opens wss://processor.../live with a Directus-issued JWT.
2. Processor validates the JWT (round-trip to Directus's /users/me, or local
verification with Directus's signing secret). Failure → close socket.
3. SPA sends {type: 'subscribe', event_id: 42}.
4. Processor calls Directus once: GET /items/events/42 with the user's token.
200 → allow subscription, store {client → event_id} in memory.
403 → reject subscription with a clear error.
1. SPA opens wss://<origin>/ws-live (relative URL; same origin as Directus).
Browser auto-attaches the httpOnly Directus session cookie.
2. Processor reads the entire Cookie header from the upgrade request and
forwards it to Directus GET /users/me.
200 → bind the connection to (id, role).
401/403 → close the socket with code 4401 (unauthorized).
3. SPA sends {type: 'subscribe', topic: 'event:<uuid>'}.
4. Processor checks the user's organization_users membership against the
event's organization_id (one cached lookup per event).
200 → store {client → topic}; reply with the latest-position snapshot.
403 → reply with {type: 'error', code: 'forbidden'}.
5. For every position arriving on Redis, match against in-memory subscriptions
and push to matched clients. Zero Directus calls in the hot path.
```
Connection-time auth is amortized over session lifetime. Permission re-checks happen on subscription change, not on every record. The hot path is bounded by `O(positions × subscribed-clients-per-event)` and runs entirely on the Processor's event loop with in-memory state.
> Earlier revisions of this page described JWT-in-URL auth. That predated [[react-spa]]'s switch to Directus SDK session-mode auth (see log entry 2026-05-02 "Auth-mode wiki realignment"). The current implementation is cookie-based; tokens never appear in WebSocket URLs (which would land them in proxy logs).
## Failure modes
| Failure | Effect on durable storage | Effect on live channel |
@@ -107,7 +113,7 @@ At pilot scale (≤500 devices per event, tens of viewers), the dominant costs a
When this becomes wrong:
- Sustained > ~10k WebSocket messages/sec total → consider sharding the broadcast path or extracting to a dedicated gateway service.
- Connection-time auth becomes a thundering herd at race start with thousands of viewers → cache JWT verification locally and shorten the Directus permission check via a token-with-scope pattern.
- Connection-time auth becomes a thundering herd at race start with thousands of viewers → cache the `/users/me` validation result for the connection's lifetime and shorten the Directus permission check via a token-with-scope pattern. Pilot scale doesn't need this; revisit when measured.
- Multi-data-center deployment → revisit the consumer-group fan-out strategy; per-region broadcast may be cleaner than global.
The escape hatch is well-defined: lift the WebSocket endpoint code out of the Processor into a standalone service that subscribes to the same `live-broadcast-*` consumer group. The Redis-stream-in / WebSocket-out contract doesn't change; only the host process does.
@@ -116,7 +122,7 @@ The escape hatch is well-defined: lift the WebSocket endpoint code out of the Pr
- [[processor]] grows a public-facing WebSocket endpoint in addition to its existing Redis consumer and Postgres writer.
- [[directus]] keeps its built-in WebSocket subscriptions for tables it writes to. Its real-time delivery section no longer claims to broadcast direct writes from [[processor]] — that's a documented mistake corrected in this revision.
- [[react-spa]] connects to two WebSocket endpoints: Directus for admin/business updates, Processor for live position firehose. Same JWT-based auth on both. Consumer-side throughput discipline (rAF coalescing of incoming positions before reducer dispatch) is documented in [[maps-architecture]] — without it the per-message dispatch pattern observed in [[traccar-maps-architecture]] cascades through selectors and `setData` at every position arrival.
- [[react-spa]] connects to two WebSocket endpoints: Directus at `/ws-business` for admin/business updates, Processor at `/ws-live` for live position firehose. Same-origin httpOnly Directus session cookie on both — no separate auth artifact for the live channel. Consumer-side throughput discipline (rAF coalescing of incoming positions before reducer dispatch) is documented in [[maps-architecture]] — without it the per-message dispatch pattern observed in [[traccar-maps-architecture]] cascades through selectors and `setData` at every position arrival.
- The deploy stack publishes the Processor's WebSocket port (with TLS termination at a reverse proxy in front).
## Why not a single WebSocket endpoint
@@ -130,6 +136,6 @@ Two endpoints, each serving the writes its plane manages, is the architecturally
## Open questions
- **JWT validation strategy.** Round-trip to Directus's `/users/me` (no shared secret, ~20ms per connection) vs. local verification with Directus's signing key (no round-trip, but a secret to share). Pilot can start with round-trip; revisit if connection rates climb.
- **Auth caching strategy.** Currently every WebSocket connection round-trips to Directus's `/users/me` (~20ms over the internal network) to validate the forwarded session cookie. At pilot scale (≤500 viewers, low reconnect rate) this is trivial. Caching the validation per-connection-lifetime is the cheap optimisation; a stateless verification path (shared signing secret) is the heavier one. Defer until measurements demand it.
- **Subscription model.** Per-event, per-stage, per-organization, or arbitrary filter expressions? The simplest pilot model is "subscribe to one event by ID"; extensions land when SPA UX demands them.
- **Permission staleness.** If a user is removed from an organization mid-session, do their existing subscriptions silently keep delivering until reconnect? Either re-validate periodically, or accept "trust the session" for pilot.
+35 -8
View File
@@ -2,7 +2,7 @@
title: Processor WebSocket contract
type: synthesis
created: 2026-05-02
updated: 2026-05-02
updated: 2026-05-03
sources: [gps-tracking-architecture, traccar-maps-architecture]
tags: [websocket, protocol, contract, telemetry-plane, decision]
---
@@ -15,25 +15,25 @@ This page is the protocol spec. The architectural rationale lives in [[live-chan
## Implementation status
**Planned as `processor` Phase 1.5 — Live broadcast.** Six tasks in `trm/processor/.planning/phase-1-5-live-broadcast/`: WS server scaffold + heartbeat, cookie auth handshake, subscription registry & per-event authorization, broadcast consumer group & fan-out, snapshot-on-subscribe, integration test. Status ⬜ Not started; sequenced as 1.5.1 → 1.5.2 → 1.5.3 → (1.5.4 ‖ 1.5.5) → 1.5.6.
**Shipped as `processor` Phase 1.5 — Live broadcast** (landed 2026-05-02). All six tasks merged: 1.5.1 WS server scaffold + heartbeat, 1.5.2 cookie auth handshake, 1.5.3 subscription registry & per-event authorization, 1.5.4 broadcast consumer group & fan-out, 1.5.5 snapshot-on-subscribe, 1.5.6 integration test. 178/178 unit tests + 6 integration scenarios green.
The endpoint is hosted *inside* the Processor process (as [[processor]] and [[live-channel-architecture]] specify). Lifting it into a separate `live-gateway` service is the documented escape hatch in [[live-channel-architecture]] §"Scale considerations" if sustained > 10k WS messages/sec demands it — not the starting point.
The endpoint is hosted *inside* the Processor process (as [[processor]] and [[live-channel-architecture]] specify). Lifting it into a separate `live-gateway` service remains the documented escape hatch in [[live-channel-architecture]] §"Scale considerations" if sustained > 10k WS messages/sec ever demands it — not currently planned.
This contract is implementation-agnostic in the sense that the wire format wouldn't change if we ever did lift the endpoint out — only the host process would. SPA work can build against the contract independently of the Processor task sequence as long as it doesn't ship to stage before Phase 1.5 lands.
This contract is implementation-agnostic in the sense that the wire format wouldn't change if we ever did lift the endpoint out — only the host process would.
## Endpoint
```
wss://<one-public-origin>/processor/ws
wss://<env>.dev.trmtracking.org/ws-live
```
Served behind the same reverse proxy that fronts [[directus]] and the [[react-spa]] static bundle. **Single origin is non-negotiable** same-origin is what allows the auth cookie to flow with the WebSocket upgrade request (see Auth handshake below).
Path **`/ws-live`** (locked 2026-05-03). The companion business-plane channel hosted by [[directus]] is at **`/ws-business`** on the same origin (proxy-rewritten to Directus's native `/websocket`). Both names are read by the SPA from `/config.json` (`liveWsUrl` and `businessWsUrl`).
The path `/processor/ws` is illustrative; final path determined by the proxy routing rules. Whatever it is, the SPA reaches it as a relative URL, never a cross-origin URL.
Served behind the same Traefik instance that fronts [[directus]] and the [[react-spa]] static bundle on the Komodo host (per `NEW-HOST-KOMODO-TRAEFIK.md`). **Single origin is non-negotiable** — same-origin is what allows the auth cookie to flow with the WebSocket upgrade request (see Auth handshake below). The SPA reaches the endpoint as a relative URL, never a cross-origin URL.
## Transport
- **Protocol:** WebSocket (RFC 6455) over TLS at the edge. Internal hop from the proxy to the producer is plain WS on the `trm_default` Compose network.
- **Protocol:** WebSocket (RFC 6455) over TLS at the edge. Internal hop from Traefik to the producer is plain WS on the deploy stack's default Compose network plus the external `proxy` network shared with Traefik.
- **Subprotocol:** none required. Future versions may add a `Sec-WebSocket-Protocol` of `trm.live.v1` if we need to negotiate versions; for now the path is the version.
- **Frame format:** text frames, JSON-encoded. No binary frames. (If we ever need to ship raw position bytes for a high-frequency optimisation, that's a v2 concern.)
- **Heartbeat:** the producer sends a ping every 30 s; the consumer responds. Consumer-side liveness is enforced by `setInterval` checking time-since-last-message > 60s ⇒ reconnect.
@@ -229,6 +229,33 @@ Pilot-scale targets (subject to revision after first dogfood):
If a slow consumer can't drain its queue, the server **drops oldest position messages** for that connection (per-device; latest position is always preserved). Position data is always-fresh — backlog isn't valuable. Only `subscribed`/`unsubscribed`/`error` control messages are guaranteed delivery.
## Deployment
The endpoint terminates inside the [[processor]] container. Public routing is handled by Traefik on the Komodo host via Docker container labels — no nginx, no openresty, no NPM in the deploy repo. See `NEW-HOST-KOMODO-TRAEFIK.md` for the platform-wide infra contract and the per-host path map.
Concrete shape (placeholder host; replace with the per-environment hostname):
```yaml
processor:
networks: [default, proxy]
labels:
- "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.http.routers.processor-live.rule=Host(`<env>.dev.trmtracking.org`) && PathPrefix(`/ws-live`)"
- "traefik.http.routers.processor-live.entrypoints=websecure"
- "traefik.http.routers.processor-live.tls=true"
- "traefik.http.routers.processor-live.priority=100"
- "traefik.http.services.processor-live.loadbalancer.server.port=<PROCESSOR_WS_PORT>"
```
Three things this depends on:
- **Same origin as Directus and the SPA.** All three answer on the same hostname; Traefik routes by path. The cookie auth handshake described above requires this — different origins block the cookie flow on the WebSocket upgrade.
- **Traefik handles WS upgrade transparently.** No `proxy_http_version` / `Upgrade` / `Connection` header gymnastics required (those were artifacts of the legacy nginx-proxy-manager + openresty setup). Traefik v3 negotiates the upgrade based on the request headers alone.
- **Cookie header forwarding.** The default Traefik forward strategy preserves cookies across the upgrade. Don't introduce middlewares that strip headers between the SPA and the processor — the producer needs the entire `Cookie` header to forward to Directus's `/users/me`.
`<PROCESSOR_WS_PORT>` is the port the Phase 1.5 WS server binds; pin it in the processor service's compose definition and reference it consistently.
## Versioning
This is `v1`. Breaking changes (renaming fields, changing semantics) require: