docs: update log and wiki entries for Phase 1.5 live broadcast implementation and incident resolution
This commit is contained in:
@@ -179,3 +179,20 @@ Created in `trm/spa/.planning/`:
|
||||
Each task file follows the existing Goal / Deliverables / Specification / Acceptance / Risks / Done shape so an implementer agent can pick one up self-contained. Phase 1 sequencing: 1.2 → 1.3 → 1.4 → 1.5 → (1.6 ‖ 1.7) → 1.8, with 1.9+1.10 (deploy plumbing) developable in parallel after 1.3 lands.
|
||||
|
||||
End state of Phase 1: a deployable empty shell — auth + protected routes + login/logout + CI + compose deploy block. End state of Phase 2: the dogfood-day deliverable. End state of Phase 3: actually fielded for race operators on race day, not just a tech demo.
|
||||
|
||||
## [2026-05-03] note | Stage incident — positions dropped + WS unreachable + wiki realignment
|
||||
|
||||
Stage incident chain. Processor logged `relation "positions" does not exist` despite Redis consumer batches succeeding. Root cause: the `migrations_applied` guard table retained checksum rows for the three pre-schema migrations but the actual `positions` table had been dropped out-of-band (other 42 user tables intact — looks like a targeted `DROP TABLE` rather than a schema reset). `directus/scripts/apply-db-init.sh` trusts the guard table exclusively (no post-skip schema verification), so the runner logged `skip` for `001/002/003` and never re-applied. Fix: `DELETE FROM migrations_applied WHERE filename IN ('002_positions_hypertable.sql', '003_faulty_column.sql')`, restart Directus; idempotent SQL re-created the table; processor caught up immediately.
|
||||
|
||||
With the table back, the live WS still failed handshake (HTTP 400 in the browser). Diagnosis via Playwright + manual `fetch` probes against `/ws-live` and `/ws-business`: `/ws-live` returns openresty's empty 404 (no upstream routed); `/ws-business` returns Directus's "Route /websocket doesn't exist" (proxy is forwarding as plain HTTP, not as a WS upgrade). The migration off Portainer + nginx-proxy-manager onto Komodo + Traefik documented in `NEW-HOST-KOMODO-TRAEFIK.md` is incomplete for the `new.stage.trmtracking.org` hostname — the deploy compose for `processor` + `directus` is missing Traefik label blocks for `/ws-live` and `/ws-business`, and the Directus container needs `WEBSOCKETS_ENABLED=true` (+ `WEBSOCKETS_HEARTBEAT_ENABLED=true`).
|
||||
|
||||
Wiki realignments landed in this session:
|
||||
- [[processor-ws-contract]] — Phase 1.5 status flipped ⬜ → ✅ (the implementation landed 2026-05-02 per the "Auth-mode wiki realignment" entry above; the contract page hadn't caught up). Locked in the chosen paths (`/ws-live` for processor, `/ws-business` for Directus) — the page previously called `/processor/ws` "illustrative." Added a "Deployment" section with the Traefik label shape and the three things it depends on (same-origin, transparent upgrade, cookie forwarding). Reworded the Transport note about the internal hop to reference the `proxy` external network instead of the legacy `trm_default`.
|
||||
- [[live-channel-architecture]] — replaced the JWT-in-URL handshake description with the cookie-based same-origin handshake that has been truth since 2026-05-02. Generalised the "JWT validation strategy" open question to an "Auth caching strategy" question. Updated cross-references that still mentioned JWT-based auth on both endpoints. Added a callout footnote pointing readers at the auth-mode realignment log entry.
|
||||
- `NEW-HOST-KOMODO-TRAEFIK.md` (parent of `docs/`, infra contract not in wiki/) — added a "Per-host path map" section enumerating `/`, `/api`, `/ws-business`, `/ws-live` so the WebSocket paths are part of the documented infra contract instead of implicit.
|
||||
|
||||
**Two architectural notes deferred (not done in this session, captured here so they're not forgotten).**
|
||||
|
||||
1. **Runner gap — `apply-db-init.sh` doesn't verify schema state on subsequent boots.** The runner records success in `migrations_applied` and trusts that exclusively; in-file assertion blocks (e.g. the `DO $$ ... RAISE EXCEPTION` block at the bottom of `002_positions_hypertable.sql`) only run during apply, not on skip. Out-of-band drops produce silent drift — exactly today's failure mode. Two cheap mitigations: (a) re-run idempotent files unconditionally (cheap given `IF NOT EXISTS` everywhere), or (b) per-migration `_check.sql` companion files the runner executes even when skipping. Worth a hardening task in directus's planning.
|
||||
|
||||
2. **Positions hypertable as a Directus collection — primary-key blocker.** Discussed the design tension: positions DDL lives in `directus/db-init/` (TimescaleDB-specific, must exist before Directus boots), but Directus refuses to register the table as a collection because `002_positions_hypertable.sql` deliberately omits a PRIMARY KEY (per its divergence note 6, calling unique-index "more idiomatic" for hypertables). Directus introspection requires a PK to expose the table — log evidence: `WARN: Collection "positions" doesn't have a primary key column and will be ignored`. To enable the operator `faulty` workflow described in [[directus-schema-draft]], a future migration `004_positions_primary_key.sql` would `ALTER TABLE positions ADD PRIMARY KEY (device_id, ts)` and `DROP INDEX positions_device_ts` (now redundant). PKs that include the partition column are legal on hypertables; the divergence note's preference for unique-index is a soft style choice, not a correctness constraint. Not done in this session — pending user go-ahead.
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
title: Live channel architecture
|
||||
type: concept
|
||||
created: 2026-05-01
|
||||
updated: 2026-05-01
|
||||
updated: 2026-05-03
|
||||
sources: []
|
||||
tags: [architecture, realtime, websocket, telemetry-plane, decision]
|
||||
---
|
||||
@@ -59,22 +59,28 @@ Each kind of data takes the path that fits it. No bridges, no extensions inside
|
||||
|
||||
## Authorization flow
|
||||
|
||||
The Processor's WebSocket endpoint validates connections through Directus, but never asks Directus per record.
|
||||
The Processor's WebSocket endpoint validates connections through Directus, but never asks Directus per record. The handshake is **cookie-based and same-origin** — see [[processor-ws-contract]] §"Auth handshake" for the wire-level spec.
|
||||
|
||||
```
|
||||
1. SPA opens wss://processor.../live with a Directus-issued JWT.
|
||||
2. Processor validates the JWT (round-trip to Directus's /users/me, or local
|
||||
verification with Directus's signing secret). Failure → close socket.
|
||||
3. SPA sends {type: 'subscribe', event_id: 42}.
|
||||
4. Processor calls Directus once: GET /items/events/42 with the user's token.
|
||||
200 → allow subscription, store {client → event_id} in memory.
|
||||
403 → reject subscription with a clear error.
|
||||
1. SPA opens wss://<origin>/ws-live (relative URL; same origin as Directus).
|
||||
Browser auto-attaches the httpOnly Directus session cookie.
|
||||
2. Processor reads the entire Cookie header from the upgrade request and
|
||||
forwards it to Directus GET /users/me.
|
||||
200 → bind the connection to (id, role).
|
||||
401/403 → close the socket with code 4401 (unauthorized).
|
||||
3. SPA sends {type: 'subscribe', topic: 'event:<uuid>'}.
|
||||
4. Processor checks the user's organization_users membership against the
|
||||
event's organization_id (one cached lookup per event).
|
||||
200 → store {client → topic}; reply with the latest-position snapshot.
|
||||
403 → reply with {type: 'error', code: 'forbidden'}.
|
||||
5. For every position arriving on Redis, match against in-memory subscriptions
|
||||
and push to matched clients. Zero Directus calls in the hot path.
|
||||
```
|
||||
|
||||
Connection-time auth is amortized over session lifetime. Permission re-checks happen on subscription change, not on every record. The hot path is bounded by `O(positions × subscribed-clients-per-event)` and runs entirely on the Processor's event loop with in-memory state.
|
||||
|
||||
> Earlier revisions of this page described JWT-in-URL auth. That predated [[react-spa]]'s switch to Directus SDK session-mode auth (see log entry 2026-05-02 "Auth-mode wiki realignment"). The current implementation is cookie-based; tokens never appear in WebSocket URLs (which would land them in proxy logs).
|
||||
|
||||
## Failure modes
|
||||
|
||||
| Failure | Effect on durable storage | Effect on live channel |
|
||||
@@ -107,7 +113,7 @@ At pilot scale (≤500 devices per event, tens of viewers), the dominant costs a
|
||||
When this becomes wrong:
|
||||
|
||||
- Sustained > ~10k WebSocket messages/sec total → consider sharding the broadcast path or extracting to a dedicated gateway service.
|
||||
- Connection-time auth becomes a thundering herd at race start with thousands of viewers → cache JWT verification locally and shorten the Directus permission check via a token-with-scope pattern.
|
||||
- Connection-time auth becomes a thundering herd at race start with thousands of viewers → cache the `/users/me` validation result for the connection's lifetime and shorten the Directus permission check via a token-with-scope pattern. Pilot scale doesn't need this; revisit when measured.
|
||||
- Multi-data-center deployment → revisit the consumer-group fan-out strategy; per-region broadcast may be cleaner than global.
|
||||
|
||||
The escape hatch is well-defined: lift the WebSocket endpoint code out of the Processor into a standalone service that subscribes to the same `live-broadcast-*` consumer group. The Redis-stream-in / WebSocket-out contract doesn't change; only the host process does.
|
||||
@@ -116,7 +122,7 @@ The escape hatch is well-defined: lift the WebSocket endpoint code out of the Pr
|
||||
|
||||
- [[processor]] grows a public-facing WebSocket endpoint in addition to its existing Redis consumer and Postgres writer.
|
||||
- [[directus]] keeps its built-in WebSocket subscriptions for tables it writes to. Its real-time delivery section no longer claims to broadcast direct writes from [[processor]] — that's a documented mistake corrected in this revision.
|
||||
- [[react-spa]] connects to two WebSocket endpoints: Directus for admin/business updates, Processor for live position firehose. Same JWT-based auth on both. Consumer-side throughput discipline (rAF coalescing of incoming positions before reducer dispatch) is documented in [[maps-architecture]] — without it the per-message dispatch pattern observed in [[traccar-maps-architecture]] cascades through selectors and `setData` at every position arrival.
|
||||
- [[react-spa]] connects to two WebSocket endpoints: Directus at `/ws-business` for admin/business updates, Processor at `/ws-live` for live position firehose. Same-origin httpOnly Directus session cookie on both — no separate auth artifact for the live channel. Consumer-side throughput discipline (rAF coalescing of incoming positions before reducer dispatch) is documented in [[maps-architecture]] — without it the per-message dispatch pattern observed in [[traccar-maps-architecture]] cascades through selectors and `setData` at every position arrival.
|
||||
- The deploy stack publishes the Processor's WebSocket port (with TLS termination at a reverse proxy in front).
|
||||
|
||||
## Why not a single WebSocket endpoint
|
||||
@@ -130,6 +136,6 @@ Two endpoints, each serving the writes its plane manages, is the architecturally
|
||||
|
||||
## Open questions
|
||||
|
||||
- **JWT validation strategy.** Round-trip to Directus's `/users/me` (no shared secret, ~20ms per connection) vs. local verification with Directus's signing key (no round-trip, but a secret to share). Pilot can start with round-trip; revisit if connection rates climb.
|
||||
- **Auth caching strategy.** Currently every WebSocket connection round-trips to Directus's `/users/me` (~20ms over the internal network) to validate the forwarded session cookie. At pilot scale (≤500 viewers, low reconnect rate) this is trivial. Caching the validation per-connection-lifetime is the cheap optimisation; a stateless verification path (shared signing secret) is the heavier one. Defer until measurements demand it.
|
||||
- **Subscription model.** Per-event, per-stage, per-organization, or arbitrary filter expressions? The simplest pilot model is "subscribe to one event by ID"; extensions land when SPA UX demands them.
|
||||
- **Permission staleness.** If a user is removed from an organization mid-session, do their existing subscriptions silently keep delivering until reconnect? Either re-validate periodically, or accept "trust the session" for pilot.
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
title: Processor WebSocket contract
|
||||
type: synthesis
|
||||
created: 2026-05-02
|
||||
updated: 2026-05-02
|
||||
updated: 2026-05-03
|
||||
sources: [gps-tracking-architecture, traccar-maps-architecture]
|
||||
tags: [websocket, protocol, contract, telemetry-plane, decision]
|
||||
---
|
||||
@@ -15,25 +15,25 @@ This page is the protocol spec. The architectural rationale lives in [[live-chan
|
||||
|
||||
## Implementation status
|
||||
|
||||
**Planned as `processor` Phase 1.5 — Live broadcast.** Six tasks in `trm/processor/.planning/phase-1-5-live-broadcast/`: WS server scaffold + heartbeat, cookie auth handshake, subscription registry & per-event authorization, broadcast consumer group & fan-out, snapshot-on-subscribe, integration test. Status ⬜ Not started; sequenced as 1.5.1 → 1.5.2 → 1.5.3 → (1.5.4 ‖ 1.5.5) → 1.5.6.
|
||||
**Shipped as `processor` Phase 1.5 — Live broadcast** (landed 2026-05-02). All six tasks merged: 1.5.1 WS server scaffold + heartbeat, 1.5.2 cookie auth handshake, 1.5.3 subscription registry & per-event authorization, 1.5.4 broadcast consumer group & fan-out, 1.5.5 snapshot-on-subscribe, 1.5.6 integration test. 178/178 unit tests + 6 integration scenarios green.
|
||||
|
||||
The endpoint is hosted *inside* the Processor process (as [[processor]] and [[live-channel-architecture]] specify). Lifting it into a separate `live-gateway` service is the documented escape hatch in [[live-channel-architecture]] §"Scale considerations" if sustained > 10k WS messages/sec demands it — not the starting point.
|
||||
The endpoint is hosted *inside* the Processor process (as [[processor]] and [[live-channel-architecture]] specify). Lifting it into a separate `live-gateway` service remains the documented escape hatch in [[live-channel-architecture]] §"Scale considerations" if sustained > 10k WS messages/sec ever demands it — not currently planned.
|
||||
|
||||
This contract is implementation-agnostic in the sense that the wire format wouldn't change if we ever did lift the endpoint out — only the host process would. SPA work can build against the contract independently of the Processor task sequence as long as it doesn't ship to stage before Phase 1.5 lands.
|
||||
This contract is implementation-agnostic in the sense that the wire format wouldn't change if we ever did lift the endpoint out — only the host process would.
|
||||
|
||||
## Endpoint
|
||||
|
||||
```
|
||||
wss://<one-public-origin>/processor/ws
|
||||
wss://<env>.dev.trmtracking.org/ws-live
|
||||
```
|
||||
|
||||
Served behind the same reverse proxy that fronts [[directus]] and the [[react-spa]] static bundle. **Single origin is non-negotiable** — same-origin is what allows the auth cookie to flow with the WebSocket upgrade request (see Auth handshake below).
|
||||
Path **`/ws-live`** (locked 2026-05-03). The companion business-plane channel hosted by [[directus]] is at **`/ws-business`** on the same origin (proxy-rewritten to Directus's native `/websocket`). Both names are read by the SPA from `/config.json` (`liveWsUrl` and `businessWsUrl`).
|
||||
|
||||
The path `/processor/ws` is illustrative; final path determined by the proxy routing rules. Whatever it is, the SPA reaches it as a relative URL, never a cross-origin URL.
|
||||
Served behind the same Traefik instance that fronts [[directus]] and the [[react-spa]] static bundle on the Komodo host (per `NEW-HOST-KOMODO-TRAEFIK.md`). **Single origin is non-negotiable** — same-origin is what allows the auth cookie to flow with the WebSocket upgrade request (see Auth handshake below). The SPA reaches the endpoint as a relative URL, never a cross-origin URL.
|
||||
|
||||
## Transport
|
||||
|
||||
- **Protocol:** WebSocket (RFC 6455) over TLS at the edge. Internal hop from the proxy to the producer is plain WS on the `trm_default` Compose network.
|
||||
- **Protocol:** WebSocket (RFC 6455) over TLS at the edge. Internal hop from Traefik to the producer is plain WS on the deploy stack's default Compose network plus the external `proxy` network shared with Traefik.
|
||||
- **Subprotocol:** none required. Future versions may add a `Sec-WebSocket-Protocol` of `trm.live.v1` if we need to negotiate versions; for now the path is the version.
|
||||
- **Frame format:** text frames, JSON-encoded. No binary frames. (If we ever need to ship raw position bytes for a high-frequency optimisation, that's a v2 concern.)
|
||||
- **Heartbeat:** the producer sends a ping every 30 s; the consumer responds. Consumer-side liveness is enforced by `setInterval` checking time-since-last-message > 60s ⇒ reconnect.
|
||||
@@ -229,6 +229,33 @@ Pilot-scale targets (subject to revision after first dogfood):
|
||||
|
||||
If a slow consumer can't drain its queue, the server **drops oldest position messages** for that connection (per-device; latest position is always preserved). Position data is always-fresh — backlog isn't valuable. Only `subscribed`/`unsubscribed`/`error` control messages are guaranteed delivery.
|
||||
|
||||
## Deployment
|
||||
|
||||
The endpoint terminates inside the [[processor]] container. Public routing is handled by Traefik on the Komodo host via Docker container labels — no nginx, no openresty, no NPM in the deploy repo. See `NEW-HOST-KOMODO-TRAEFIK.md` for the platform-wide infra contract and the per-host path map.
|
||||
|
||||
Concrete shape (placeholder host; replace with the per-environment hostname):
|
||||
|
||||
```yaml
|
||||
processor:
|
||||
networks: [default, proxy]
|
||||
labels:
|
||||
- "traefik.enable=true"
|
||||
- "traefik.docker.network=proxy"
|
||||
- "traefik.http.routers.processor-live.rule=Host(`<env>.dev.trmtracking.org`) && PathPrefix(`/ws-live`)"
|
||||
- "traefik.http.routers.processor-live.entrypoints=websecure"
|
||||
- "traefik.http.routers.processor-live.tls=true"
|
||||
- "traefik.http.routers.processor-live.priority=100"
|
||||
- "traefik.http.services.processor-live.loadbalancer.server.port=<PROCESSOR_WS_PORT>"
|
||||
```
|
||||
|
||||
Three things this depends on:
|
||||
|
||||
- **Same origin as Directus and the SPA.** All three answer on the same hostname; Traefik routes by path. The cookie auth handshake described above requires this — different origins block the cookie flow on the WebSocket upgrade.
|
||||
- **Traefik handles WS upgrade transparently.** No `proxy_http_version` / `Upgrade` / `Connection` header gymnastics required (those were artifacts of the legacy nginx-proxy-manager + openresty setup). Traefik v3 negotiates the upgrade based on the request headers alone.
|
||||
- **Cookie header forwarding.** The default Traefik forward strategy preserves cookies across the upgrade. Don't introduce middlewares that strip headers between the SPA and the processor — the producer needs the entire `Cookie` header to forward to Directus's `/users/me`.
|
||||
|
||||
`<PROCESSOR_WS_PORT>` is the port the Phase 1.5 WS server binds; pin it in the processor service's compose definition and reference it consistently.
|
||||
|
||||
## Versioning
|
||||
|
||||
This is `v1`. Breaking changes (renaming fields, changing semantics) require:
|
||||
|
||||
Reference in New Issue
Block a user