Compare commits

..

2 Commits

4 changed files with 72 additions and 21 deletions
+2 -1
View File
@@ -3,4 +3,5 @@
.obsidian/workspace-mobile.json .obsidian/workspace-mobile.json
.obsidian/cache .obsidian/cache
.obsidian/plugins/*/data.json .obsidian/plugins/*/data.json
.claude/ .claude/
.playwright-mcp
+17
View File
@@ -179,3 +179,20 @@ Created in `trm/spa/.planning/`:
Each task file follows the existing Goal / Deliverables / Specification / Acceptance / Risks / Done shape so an implementer agent can pick one up self-contained. Phase 1 sequencing: 1.2 → 1.3 → 1.4 → 1.5 → (1.6 ‖ 1.7) → 1.8, with 1.9+1.10 (deploy plumbing) developable in parallel after 1.3 lands. Each task file follows the existing Goal / Deliverables / Specification / Acceptance / Risks / Done shape so an implementer agent can pick one up self-contained. Phase 1 sequencing: 1.2 → 1.3 → 1.4 → 1.5 → (1.6 ‖ 1.7) → 1.8, with 1.9+1.10 (deploy plumbing) developable in parallel after 1.3 lands.
End state of Phase 1: a deployable empty shell — auth + protected routes + login/logout + CI + compose deploy block. End state of Phase 2: the dogfood-day deliverable. End state of Phase 3: actually fielded for race operators on race day, not just a tech demo. End state of Phase 1: a deployable empty shell — auth + protected routes + login/logout + CI + compose deploy block. End state of Phase 2: the dogfood-day deliverable. End state of Phase 3: actually fielded for race operators on race day, not just a tech demo.
## [2026-05-03] note | Stage incident — positions dropped + WS unreachable + wiki realignment
Stage incident chain. Processor logged `relation "positions" does not exist` despite Redis consumer batches succeeding. Root cause: the `migrations_applied` guard table retained checksum rows for the three pre-schema migrations but the actual `positions` table had been dropped out-of-band (other 42 user tables intact — looks like a targeted `DROP TABLE` rather than a schema reset). `directus/scripts/apply-db-init.sh` trusts the guard table exclusively (no post-skip schema verification), so the runner logged `skip` for `001/002/003` and never re-applied. Fix: `DELETE FROM migrations_applied WHERE filename IN ('002_positions_hypertable.sql', '003_faulty_column.sql')`, restart Directus; idempotent SQL re-created the table; processor caught up immediately.
With the table back, the live WS still failed handshake (HTTP 400 in the browser). Diagnosis via Playwright + manual `fetch` probes against `/ws-live` and `/ws-business`: `/ws-live` returns openresty's empty 404 (no upstream routed); `/ws-business` returns Directus's "Route /websocket doesn't exist" (proxy is forwarding as plain HTTP, not as a WS upgrade). The migration off Portainer + nginx-proxy-manager onto Komodo + Traefik documented in `NEW-HOST-KOMODO-TRAEFIK.md` is incomplete for the `new.stage.trmtracking.org` hostname — the deploy compose for `processor` + `directus` is missing Traefik label blocks for `/ws-live` and `/ws-business`, and the Directus container needs `WEBSOCKETS_ENABLED=true` (+ `WEBSOCKETS_HEARTBEAT_ENABLED=true`).
Wiki realignments landed in this session:
- [[processor-ws-contract]] — Phase 1.5 status flipped ⬜ → ✅ (the implementation landed 2026-05-02 per the "Auth-mode wiki realignment" entry above; the contract page hadn't caught up). Locked in the chosen paths (`/ws-live` for processor, `/ws-business` for Directus) — the page previously called `/processor/ws` "illustrative." Added a "Deployment" section with the Traefik label shape and the three things it depends on (same-origin, transparent upgrade, cookie forwarding). Reworded the Transport note about the internal hop to reference the `proxy` external network instead of the legacy `trm_default`.
- [[live-channel-architecture]] — replaced the JWT-in-URL handshake description with the cookie-based same-origin handshake that has been truth since 2026-05-02. Generalised the "JWT validation strategy" open question to an "Auth caching strategy" question. Updated cross-references that still mentioned JWT-based auth on both endpoints. Added a callout footnote pointing readers at the auth-mode realignment log entry.
- `NEW-HOST-KOMODO-TRAEFIK.md` (parent of `docs/`, infra contract not in wiki/) — added a "Per-host path map" section enumerating `/`, `/api`, `/ws-business`, `/ws-live` so the WebSocket paths are part of the documented infra contract instead of implicit.
**Two architectural notes deferred (not done in this session, captured here so they're not forgotten).**
1. **Runner gap — `apply-db-init.sh` doesn't verify schema state on subsequent boots.** The runner records success in `migrations_applied` and trusts that exclusively; in-file assertion blocks (e.g. the `DO $$ ... RAISE EXCEPTION` block at the bottom of `002_positions_hypertable.sql`) only run during apply, not on skip. Out-of-band drops produce silent drift — exactly today's failure mode. Two cheap mitigations: (a) re-run idempotent files unconditionally (cheap given `IF NOT EXISTS` everywhere), or (b) per-migration `_check.sql` companion files the runner executes even when skipping. Worth a hardening task in directus's planning.
2. **Positions hypertable as a Directus collection — primary-key blocker.** Discussed the design tension: positions DDL lives in `directus/db-init/` (TimescaleDB-specific, must exist before Directus boots), but Directus refuses to register the table as a collection because `002_positions_hypertable.sql` deliberately omits a PRIMARY KEY (per its divergence note 6, calling unique-index "more idiomatic" for hypertables). Directus introspection requires a PK to expose the table — log evidence: `WARN: Collection "positions" doesn't have a primary key column and will be ignored`. To enable the operator `faulty` workflow described in [[directus-schema-draft]], a future migration `004_positions_primary_key.sql` would `ALTER TABLE positions ADD PRIMARY KEY (device_id, ts)` and `DROP INDEX positions_device_ts` (now redundant). PKs that include the partition column are legal on hypertables; the divergence note's preference for unique-index is a soft style choice, not a correctness constraint. Not done in this session — pending user go-ahead.
+18 -12
View File
@@ -2,7 +2,7 @@
title: Live channel architecture title: Live channel architecture
type: concept type: concept
created: 2026-05-01 created: 2026-05-01
updated: 2026-05-01 updated: 2026-05-03
sources: [] sources: []
tags: [architecture, realtime, websocket, telemetry-plane, decision] tags: [architecture, realtime, websocket, telemetry-plane, decision]
--- ---
@@ -59,22 +59,28 @@ Each kind of data takes the path that fits it. No bridges, no extensions inside
## Authorization flow ## Authorization flow
The Processor's WebSocket endpoint validates connections through Directus, but never asks Directus per record. The Processor's WebSocket endpoint validates connections through Directus, but never asks Directus per record. The handshake is **cookie-based and same-origin** — see [[processor-ws-contract]] §"Auth handshake" for the wire-level spec.
``` ```
1. SPA opens wss://processor.../live with a Directus-issued JWT. 1. SPA opens wss://<origin>/ws-live (relative URL; same origin as Directus).
2. Processor validates the JWT (round-trip to Directus's /users/me, or local Browser auto-attaches the httpOnly Directus session cookie.
verification with Directus's signing secret). Failure → close socket. 2. Processor reads the entire Cookie header from the upgrade request and
3. SPA sends {type: 'subscribe', event_id: 42}. forwards it to Directus GET /users/me.
4. Processor calls Directus once: GET /items/events/42 with the user's token. 200 → bind the connection to (id, role).
200 → allow subscription, store {client → event_id} in memory. 401/403 → close the socket with code 4401 (unauthorized).
403 → reject subscription with a clear error. 3. SPA sends {type: 'subscribe', topic: 'event:<uuid>'}.
4. Processor checks the user's organization_users membership against the
event's organization_id (one cached lookup per event).
200 → store {client → topic}; reply with the latest-position snapshot.
403 → reply with {type: 'error', code: 'forbidden'}.
5. For every position arriving on Redis, match against in-memory subscriptions 5. For every position arriving on Redis, match against in-memory subscriptions
and push to matched clients. Zero Directus calls in the hot path. and push to matched clients. Zero Directus calls in the hot path.
``` ```
Connection-time auth is amortized over session lifetime. Permission re-checks happen on subscription change, not on every record. The hot path is bounded by `O(positions × subscribed-clients-per-event)` and runs entirely on the Processor's event loop with in-memory state. Connection-time auth is amortized over session lifetime. Permission re-checks happen on subscription change, not on every record. The hot path is bounded by `O(positions × subscribed-clients-per-event)` and runs entirely on the Processor's event loop with in-memory state.
> Earlier revisions of this page described JWT-in-URL auth. That predated [[react-spa]]'s switch to Directus SDK session-mode auth (see log entry 2026-05-02 "Auth-mode wiki realignment"). The current implementation is cookie-based; tokens never appear in WebSocket URLs (which would land them in proxy logs).
## Failure modes ## Failure modes
| Failure | Effect on durable storage | Effect on live channel | | Failure | Effect on durable storage | Effect on live channel |
@@ -107,7 +113,7 @@ At pilot scale (≤500 devices per event, tens of viewers), the dominant costs a
When this becomes wrong: When this becomes wrong:
- Sustained > ~10k WebSocket messages/sec total → consider sharding the broadcast path or extracting to a dedicated gateway service. - Sustained > ~10k WebSocket messages/sec total → consider sharding the broadcast path or extracting to a dedicated gateway service.
- Connection-time auth becomes a thundering herd at race start with thousands of viewers → cache JWT verification locally and shorten the Directus permission check via a token-with-scope pattern. - Connection-time auth becomes a thundering herd at race start with thousands of viewers → cache the `/users/me` validation result for the connection's lifetime and shorten the Directus permission check via a token-with-scope pattern. Pilot scale doesn't need this; revisit when measured.
- Multi-data-center deployment → revisit the consumer-group fan-out strategy; per-region broadcast may be cleaner than global. - Multi-data-center deployment → revisit the consumer-group fan-out strategy; per-region broadcast may be cleaner than global.
The escape hatch is well-defined: lift the WebSocket endpoint code out of the Processor into a standalone service that subscribes to the same `live-broadcast-*` consumer group. The Redis-stream-in / WebSocket-out contract doesn't change; only the host process does. The escape hatch is well-defined: lift the WebSocket endpoint code out of the Processor into a standalone service that subscribes to the same `live-broadcast-*` consumer group. The Redis-stream-in / WebSocket-out contract doesn't change; only the host process does.
@@ -116,7 +122,7 @@ The escape hatch is well-defined: lift the WebSocket endpoint code out of the Pr
- [[processor]] grows a public-facing WebSocket endpoint in addition to its existing Redis consumer and Postgres writer. - [[processor]] grows a public-facing WebSocket endpoint in addition to its existing Redis consumer and Postgres writer.
- [[directus]] keeps its built-in WebSocket subscriptions for tables it writes to. Its real-time delivery section no longer claims to broadcast direct writes from [[processor]] — that's a documented mistake corrected in this revision. - [[directus]] keeps its built-in WebSocket subscriptions for tables it writes to. Its real-time delivery section no longer claims to broadcast direct writes from [[processor]] — that's a documented mistake corrected in this revision.
- [[react-spa]] connects to two WebSocket endpoints: Directus for admin/business updates, Processor for live position firehose. Same JWT-based auth on both. Consumer-side throughput discipline (rAF coalescing of incoming positions before reducer dispatch) is documented in [[maps-architecture]] — without it the per-message dispatch pattern observed in [[traccar-maps-architecture]] cascades through selectors and `setData` at every position arrival. - [[react-spa]] connects to two WebSocket endpoints: Directus at `/ws-business` for admin/business updates, Processor at `/ws-live` for live position firehose. Same-origin httpOnly Directus session cookie on both — no separate auth artifact for the live channel. Consumer-side throughput discipline (rAF coalescing of incoming positions before reducer dispatch) is documented in [[maps-architecture]] — without it the per-message dispatch pattern observed in [[traccar-maps-architecture]] cascades through selectors and `setData` at every position arrival.
- The deploy stack publishes the Processor's WebSocket port (with TLS termination at a reverse proxy in front). - The deploy stack publishes the Processor's WebSocket port (with TLS termination at a reverse proxy in front).
## Why not a single WebSocket endpoint ## Why not a single WebSocket endpoint
@@ -130,6 +136,6 @@ Two endpoints, each serving the writes its plane manages, is the architecturally
## Open questions ## Open questions
- **JWT validation strategy.** Round-trip to Directus's `/users/me` (no shared secret, ~20ms per connection) vs. local verification with Directus's signing key (no round-trip, but a secret to share). Pilot can start with round-trip; revisit if connection rates climb. - **Auth caching strategy.** Currently every WebSocket connection round-trips to Directus's `/users/me` (~20ms over the internal network) to validate the forwarded session cookie. At pilot scale (≤500 viewers, low reconnect rate) this is trivial. Caching the validation per-connection-lifetime is the cheap optimisation; a stateless verification path (shared signing secret) is the heavier one. Defer until measurements demand it.
- **Subscription model.** Per-event, per-stage, per-organization, or arbitrary filter expressions? The simplest pilot model is "subscribe to one event by ID"; extensions land when SPA UX demands them. - **Subscription model.** Per-event, per-stage, per-organization, or arbitrary filter expressions? The simplest pilot model is "subscribe to one event by ID"; extensions land when SPA UX demands them.
- **Permission staleness.** If a user is removed from an organization mid-session, do their existing subscriptions silently keep delivering until reconnect? Either re-validate periodically, or accept "trust the session" for pilot. - **Permission staleness.** If a user is removed from an organization mid-session, do their existing subscriptions silently keep delivering until reconnect? Either re-validate periodically, or accept "trust the session" for pilot.
+35 -8
View File
@@ -2,7 +2,7 @@
title: Processor WebSocket contract title: Processor WebSocket contract
type: synthesis type: synthesis
created: 2026-05-02 created: 2026-05-02
updated: 2026-05-02 updated: 2026-05-03
sources: [gps-tracking-architecture, traccar-maps-architecture] sources: [gps-tracking-architecture, traccar-maps-architecture]
tags: [websocket, protocol, contract, telemetry-plane, decision] tags: [websocket, protocol, contract, telemetry-plane, decision]
--- ---
@@ -15,25 +15,25 @@ This page is the protocol spec. The architectural rationale lives in [[live-chan
## Implementation status ## Implementation status
**Planned as `processor` Phase 1.5 — Live broadcast.** Six tasks in `trm/processor/.planning/phase-1-5-live-broadcast/`: WS server scaffold + heartbeat, cookie auth handshake, subscription registry & per-event authorization, broadcast consumer group & fan-out, snapshot-on-subscribe, integration test. Status ⬜ Not started; sequenced as 1.5.1 → 1.5.2 → 1.5.3 → (1.5.4 ‖ 1.5.5) → 1.5.6. **Shipped as `processor` Phase 1.5 — Live broadcast** (landed 2026-05-02). All six tasks merged: 1.5.1 WS server scaffold + heartbeat, 1.5.2 cookie auth handshake, 1.5.3 subscription registry & per-event authorization, 1.5.4 broadcast consumer group & fan-out, 1.5.5 snapshot-on-subscribe, 1.5.6 integration test. 178/178 unit tests + 6 integration scenarios green.
The endpoint is hosted *inside* the Processor process (as [[processor]] and [[live-channel-architecture]] specify). Lifting it into a separate `live-gateway` service is the documented escape hatch in [[live-channel-architecture]] §"Scale considerations" if sustained > 10k WS messages/sec demands it — not the starting point. The endpoint is hosted *inside* the Processor process (as [[processor]] and [[live-channel-architecture]] specify). Lifting it into a separate `live-gateway` service remains the documented escape hatch in [[live-channel-architecture]] §"Scale considerations" if sustained > 10k WS messages/sec ever demands it — not currently planned.
This contract is implementation-agnostic in the sense that the wire format wouldn't change if we ever did lift the endpoint out — only the host process would. SPA work can build against the contract independently of the Processor task sequence as long as it doesn't ship to stage before Phase 1.5 lands. This contract is implementation-agnostic in the sense that the wire format wouldn't change if we ever did lift the endpoint out — only the host process would.
## Endpoint ## Endpoint
``` ```
wss://<one-public-origin>/processor/ws wss://<env>.dev.trmtracking.org/ws-live
``` ```
Served behind the same reverse proxy that fronts [[directus]] and the [[react-spa]] static bundle. **Single origin is non-negotiable** same-origin is what allows the auth cookie to flow with the WebSocket upgrade request (see Auth handshake below). Path **`/ws-live`** (locked 2026-05-03). The companion business-plane channel hosted by [[directus]] is at **`/ws-business`** on the same origin (proxy-rewritten to Directus's native `/websocket`). Both names are read by the SPA from `/config.json` (`liveWsUrl` and `businessWsUrl`).
The path `/processor/ws` is illustrative; final path determined by the proxy routing rules. Whatever it is, the SPA reaches it as a relative URL, never a cross-origin URL. Served behind the same Traefik instance that fronts [[directus]] and the [[react-spa]] static bundle on the Komodo host (per `NEW-HOST-KOMODO-TRAEFIK.md`). **Single origin is non-negotiable** — same-origin is what allows the auth cookie to flow with the WebSocket upgrade request (see Auth handshake below). The SPA reaches the endpoint as a relative URL, never a cross-origin URL.
## Transport ## Transport
- **Protocol:** WebSocket (RFC 6455) over TLS at the edge. Internal hop from the proxy to the producer is plain WS on the `trm_default` Compose network. - **Protocol:** WebSocket (RFC 6455) over TLS at the edge. Internal hop from Traefik to the producer is plain WS on the deploy stack's default Compose network plus the external `proxy` network shared with Traefik.
- **Subprotocol:** none required. Future versions may add a `Sec-WebSocket-Protocol` of `trm.live.v1` if we need to negotiate versions; for now the path is the version. - **Subprotocol:** none required. Future versions may add a `Sec-WebSocket-Protocol` of `trm.live.v1` if we need to negotiate versions; for now the path is the version.
- **Frame format:** text frames, JSON-encoded. No binary frames. (If we ever need to ship raw position bytes for a high-frequency optimisation, that's a v2 concern.) - **Frame format:** text frames, JSON-encoded. No binary frames. (If we ever need to ship raw position bytes for a high-frequency optimisation, that's a v2 concern.)
- **Heartbeat:** the producer sends a ping every 30 s; the consumer responds. Consumer-side liveness is enforced by `setInterval` checking time-since-last-message > 60s ⇒ reconnect. - **Heartbeat:** the producer sends a ping every 30 s; the consumer responds. Consumer-side liveness is enforced by `setInterval` checking time-since-last-message > 60s ⇒ reconnect.
@@ -229,6 +229,33 @@ Pilot-scale targets (subject to revision after first dogfood):
If a slow consumer can't drain its queue, the server **drops oldest position messages** for that connection (per-device; latest position is always preserved). Position data is always-fresh — backlog isn't valuable. Only `subscribed`/`unsubscribed`/`error` control messages are guaranteed delivery. If a slow consumer can't drain its queue, the server **drops oldest position messages** for that connection (per-device; latest position is always preserved). Position data is always-fresh — backlog isn't valuable. Only `subscribed`/`unsubscribed`/`error` control messages are guaranteed delivery.
## Deployment
The endpoint terminates inside the [[processor]] container. Public routing is handled by Traefik on the Komodo host via Docker container labels — no nginx, no openresty, no NPM in the deploy repo. See `NEW-HOST-KOMODO-TRAEFIK.md` for the platform-wide infra contract and the per-host path map.
Concrete shape (placeholder host; replace with the per-environment hostname):
```yaml
processor:
networks: [default, proxy]
labels:
- "traefik.enable=true"
- "traefik.docker.network=proxy"
- "traefik.http.routers.processor-live.rule=Host(`<env>.dev.trmtracking.org`) && PathPrefix(`/ws-live`)"
- "traefik.http.routers.processor-live.entrypoints=websecure"
- "traefik.http.routers.processor-live.tls=true"
- "traefik.http.routers.processor-live.priority=100"
- "traefik.http.services.processor-live.loadbalancer.server.port=<PROCESSOR_WS_PORT>"
```
Three things this depends on:
- **Same origin as Directus and the SPA.** All three answer on the same hostname; Traefik routes by path. The cookie auth handshake described above requires this — different origins block the cookie flow on the WebSocket upgrade.
- **Traefik handles WS upgrade transparently.** No `proxy_http_version` / `Upgrade` / `Connection` header gymnastics required (those were artifacts of the legacy nginx-proxy-manager + openresty setup). Traefik v3 negotiates the upgrade based on the request headers alone.
- **Cookie header forwarding.** The default Traefik forward strategy preserves cookies across the upgrade. Don't introduce middlewares that strip headers between the SPA and the processor — the producer needs the entire `Cookie` header to forward to Directus's `/users/me`.
`<PROCESSOR_WS_PORT>` is the port the Phase 1.5 WS server binds; pin it in the processor service's compose definition and reference it consistently.
## Versioning ## Versioning
This is `v1`. Breaking changes (renaming fields, changing semantics) require: This is `v1`. Breaking changes (renaming fields, changing semantics) require: