From 9acde675d93ed4edebd09c62aa86e3521eb4626c Mon Sep 17 00:00:00 2001 From: Julian Cuni Date: Fri, 1 May 2026 10:40:12 +0200 Subject: [PATCH] Correct live-channel architecture; document dual-WebSocket design MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Researched Directus's WebSocket subscription mechanism via context7 and confirmed it only fires events for writes that go through Directus's own ItemsService. Direct INSERTs from Processor are invisible to subscribers. The previous claim in entities/directus.md that Directus broadcasts Processor's writes was wrong. New: wiki/concepts/live-channel-architecture.md captures the corrected design with three options table, chosen-architecture diagram, authorization flow, failure modes, multi-instance plumbing, scale considerations, and open questions. Chosen path: Processor exposes its own WebSocket endpoint for the high-volume telemetry firehose (authentication via Directus-issued JWT, authorization delegated to Directus once at subscribe time); Directus's built-in WebSocket covers business-plane events. Each WebSocket serves the writes its plane manages — preserves plane-separation and gives the gentlest failure mode (Directus down only blocks new authorizations). Updated: - entities/directus.md — corrected the real-time-delivery section, added pointer to the new concept page. - entities/processor.md — added Live broadcast section in responsibilities and a section explaining the dual-consumer-group plumbing for multi-instance HA. - index.md — listed the new concept. - log.md — synthesis entry for 2026-05-01 documenting the correction. --- index.md | 1 + log.md | 8 ++ wiki/concepts/live-channel-architecture.md | 135 +++++++++++++++++++++ wiki/entities/directus.md | 11 +- wiki/entities/processor.md | 13 +- 5 files changed, 164 insertions(+), 4 deletions(-) create mode 100644 wiki/concepts/live-channel-architecture.md diff --git a/index.md b/index.md index d74c468..11b35e4 100644 --- a/index.md +++ b/index.md @@ -24,6 +24,7 @@ Content catalog for the TRM wiki. Maintained by the LLM on every ingest. See [[C - [[codec-dispatch]] — Flat registry keyed on codec ID; the seam that makes Phase 2 additive. - [[failure-domains]] — Independent component failure behavior; database is the only SPOF. - [[io-element-bag]] — The pass-through principle for model-specific telemetry inside AVL records. +- [[live-channel-architecture]] — Dual-WebSocket design for live UX: Processor's endpoint for telemetry firehose, Directus's for business-plane updates. - [[phase-2-commands]] — Deferred design for server-to-device commands via Teltonika codecs 12/14. - [[plane-separation]] — Three-plane architecture (telemetry / business / presentation) split by data velocity and failure domain. - [[position-record]] — Boundary contract between vendor adapters and the rest of the system. diff --git a/log.md b/log.md index aabe89b..5658ef6 100644 --- a/log.md +++ b/log.md @@ -37,3 +37,11 @@ Updates to existing pages (no contradictions; refinements + additions): Cleanup: removed stale duplicate concept files from earlier passes (system-planes.md, protocol-adapter-pattern.md, codec-dispatch-registry.md) — superseded by plane-separation.md, protocol-adapter.md, codec-dispatch.md respectively. Fixed dangling [[protocol-adapter-pattern]] link in [[io-element-bag]]. Open questions surfaced by the canonical doc: Codec 16 Generation Type — promote to typed [[position-record]] field? Codec 8E NX values land as `Buffer` in `attributes`; needs explicit fixture coverage. SMS-based protocols (Codec 4 + binary SMS) probably out of scope but worth a deliberate decision. + +## [2026-05-01] synthesis | Live channel architecture (corrects a wiki claim) + +Researched Directus's WebSocket subscription mechanism via context7. Confirmed that subscriptions only fire for writes that go through Directus's `ItemsService` (REST/GraphQL/Admin UI mutations, not direct database INSERTs). The previous claim in [[directus]] — "When Processor writes a row, Directus broadcasts the change to subscribed clients" — was wrong. + +Wrote [[live-channel-architecture]] documenting the corrected design: two WebSocket channels, each in its own plane. Processor exposes its own WebSocket endpoint for high-volume telemetry fan-out (auth via Directus-issued JWT, authorization delegated to Directus once at subscribe time). Directus's built-in WebSocket subscriptions cover business-plane events. Reasoning: preserves [[plane-separation]] and gives the gentlest failure mode (Directus down blocks only new authorizations, not the live firehose). + +Updated [[processor]] (added Live broadcast section, multi-instance consumer-group plumbing note), [[directus]] (corrected the real-time-delivery section), and index.md. diff --git a/wiki/concepts/live-channel-architecture.md b/wiki/concepts/live-channel-architecture.md new file mode 100644 index 0000000..d5b4adb --- /dev/null +++ b/wiki/concepts/live-channel-architecture.md @@ -0,0 +1,135 @@ +--- +title: Live channel architecture +type: concept +created: 2026-05-01 +updated: 2026-05-01 +sources: [] +tags: [architecture, realtime, websocket, telemetry-plane, decision] +--- + +# Live channel architecture + +How live position data reaches the [[react-spa]] without violating [[plane-separation]] or coupling to [[directus]]'s failure domain. + +## The question + +The SPA needs sub-second updates of device positions for live race views. Three things are non-negotiable: + +1. The [[processor]] hot path stays direct-to-database — no API hop, no event-loop pressure on Directus. +2. [[directus]] is not in the telemetry hot path (per [[plane-separation]]). +3. The live channel must be authenticated and authorization-aware — only users with permission to see an event's positions get pushed updates. + +The naïve assumption is that [[directus]]'s built-in WebSocket subscriptions cover this. They do not. **Directus's subscription system only fires events for writes that go through its own `ItemsService`** (REST/GraphQL/Admin UI mutations). Direct `INSERT`s from the [[processor]] are invisible to subscribers — verified against Directus's documentation and source. The bridging assumption was wrong. + +This page documents how the platform actually delivers live positions. + +## Options considered + +| Option | Live channel works | Hot path stays fast | Plane separation | Failure domain | +|---|---|---|---|---| +| Route Processor writes through Directus REST | Yes (Directus broadcasts own writes) | Compromised — every write through Directus event loop | Compromised | Coupled — Directus down blocks ingestion | +| Bridge extension inside Directus (Redis → `WebSocketService.broadcast`) | Yes | Compromised — Directus runs the firehose consumer | Compromised | Coupled — Directus crash kills live channel | +| **Processor exposes its own WebSocket endpoint** (chosen) | Yes | Preserved | Preserved | Decoupled — Directus down blocks only new authorizations | + +Option 3 wins because it preserves the architectural invariants that motivated [[plane-separation]] in the first place, while still leaning on [[directus]] for authentication and authorization. + +## Chosen design + +Two cleanly-separated WebSocket channels, each playing to its strength: + +``` +┌─ Telemetry plane ─────────────────────────┐ ┌─ Business plane ──────────────────────┐ +│ │ │ │ +│ Device → tcp-ingestion → Redis │ │ SPA admin action │ +│ ↓ │ │ ↓ │ +│ Processor │ │ Directus REST │ +│ ↙ ↘ │ │ ↓ │ +│ Postgres Processor's │ │ Postgres + Directus's WebSocket │ +│ WebSocket │ │ ↓ │ +│ ↓ │ │ SPA (admin UI, │ +│ SPA │ │ leaderboard refresh, │ +│ (live map) │ │ timing edits) │ +└───────────────────────────────────────────┘ └───────────────────────────────────────┘ +``` + +- **High-volume telemetry** (positions): the Processor writes directly to Postgres and *also* fans out the same records to subscribed SPA clients over its own WebSocket endpoint. Stays in the telemetry plane end-to-end. +- **Low-volume domain events** (timing records, stage results, manual entries, configuration): written via Directus's REST API; Directus's built-in subscription system broadcasts them through its WebSocket. Stays in the business plane. + +Each kind of data takes the path that fits it. No bridges, no extensions inside Directus. + +## Authorization flow + +The Processor's WebSocket endpoint validates connections through Directus, but never asks Directus per record. + +``` +1. SPA opens wss://processor.../live with a Directus-issued JWT. +2. Processor validates the JWT (round-trip to Directus's /users/me, or local + verification with Directus's signing secret). Failure → close socket. +3. SPA sends {type: 'subscribe', event_id: 42}. +4. Processor calls Directus once: GET /items/events/42 with the user's token. + 200 → allow subscription, store {client → event_id} in memory. + 403 → reject subscription with a clear error. +5. For every position arriving on Redis, match against in-memory subscriptions + and push to matched clients. Zero Directus calls in the hot path. +``` + +Connection-time auth is amortized over session lifetime. Permission re-checks happen on subscription change, not on every record. The hot path is bounded by `O(positions × subscribed-clients-per-event)` and runs entirely on the Processor's event loop with in-memory state. + +## Failure modes + +| Failure | Effect on durable storage | Effect on live channel | +|---|---|---| +| Processor crashes | Records pile up in Redis; Phase 3 [[failure-domains]] resumption picks them up | Live channel dies until recovery | +| Directus crashes | Unaffected (Processor writes direct to DB) | Existing connections keep working with cached permissions; **new subscriptions cannot be authorized** | +| Postgres crashes | Writes block; Redis buffers up to `MAXLEN` | Unaffected — fan-out is independent of DB state | +| Redis crashes | Whole pipeline stops | — | + +The Directus-down case is the architecturally important one. Routing writes through Directus would mean ingestion blocks. The chosen design keeps ingestion alive and only loses the ability to authorize *new* subscriptions — a much gentler failure. + +## Multi-instance Processor + +Phase 3 of [[processor]] adds a second instance for HA. Each instance has its own connected SPA clients. A position arriving on instance A wouldn't naturally reach a client connected to instance B unless the broadcast path crosses instances. + +The clean shape: each Processor reads the [[redis-streams]] stream on **two consumer groups**: + +- `processor` — the durable-write group (work-split: only one instance handles each record for the DB write). +- `live-broadcast-{instance_id}` — a per-instance fan-out group (every instance reads every record for fan-out). + +DB writes deduplicate by virtue of the consumer-group split; live broadcast deduplicates by virtue of clients being connected to exactly one instance. The Processor's [[redis-streams]] consumer code structure should anticipate this even at single-instance pilot scale. + +## Scale considerations + +At pilot scale (≤500 devices per event, tens of viewers), the dominant costs are: + +- **Connection-time auth round-trips to Directus** — a few hundred per minute peak (race start). Trivial. +- **In-memory subscription matching** — `O(records × subscribers)`; for 500 records/sec × 20 subscribers per event, ~10k messages/sec fan-out. Sustained on Node. + +When this becomes wrong: + +- Sustained > ~10k WebSocket messages/sec total → consider sharding the broadcast path or extracting to a dedicated gateway service. +- Connection-time auth becomes a thundering herd at race start with thousands of viewers → cache JWT verification locally and shorten the Directus permission check via a token-with-scope pattern. +- Multi-data-center deployment → revisit the consumer-group fan-out strategy; per-region broadcast may be cleaner than global. + +The escape hatch is well-defined: lift the WebSocket endpoint code out of the Processor into a standalone service that subscribes to the same `live-broadcast-*` consumer group. The Redis-stream-in / WebSocket-out contract doesn't change; only the host process does. + +## What this means for adjacent components + +- [[processor]] grows a public-facing WebSocket endpoint in addition to its existing Redis consumer and Postgres writer. +- [[directus]] keeps its built-in WebSocket subscriptions for tables it writes to. Its real-time delivery section no longer claims to broadcast direct writes from [[processor]] — that's a documented mistake corrected in this revision. +- [[react-spa]] connects to two WebSocket endpoints: Directus for admin/business updates, Processor for live position firehose. Same JWT-based auth on both. +- The deploy stack publishes the Processor's WebSocket port (with TLS termination at a reverse proxy in front). + +## Why not a single WebSocket endpoint + +It would be tempting to fold everything into a single SPA-facing WebSocket — either Processor or Directus. Both fail: + +- **Single Processor WebSocket** would require Processor to broadcast Directus-managed events, meaning Processor needs to subscribe to Directus's writes — which is exactly the problem we're avoiding for positions, in reverse. +- **Single Directus WebSocket** is the bridge-extension option; it loses plane separation. + +Two endpoints, each serving the writes its plane manages, is the architecturally honest answer. + +## Open questions + +- **JWT validation strategy.** Round-trip to Directus's `/users/me` (no shared secret, ~20ms per connection) vs. local verification with Directus's signing key (no round-trip, but a secret to share). Pilot can start with round-trip; revisit if connection rates climb. +- **Subscription model.** Per-event, per-stage, per-organization, or arbitrary filter expressions? The simplest pilot model is "subscribe to one event by ID"; extensions land when SPA UX demands them. +- **Permission staleness.** If a user is removed from an organization mid-session, do their existing subscriptions silently keep delivering until reconnect? Either re-validate periodically, or accept "trust the session" for pilot. diff --git a/wiki/entities/directus.md b/wiki/entities/directus.md index 7921d1d..3464013 100644 --- a/wiki/entities/directus.md +++ b/wiki/entities/directus.md @@ -2,7 +2,7 @@ title: Directus type: entity created: 2026-04-30 -updated: 2026-04-30 +updated: 2026-05-01 sources: [gps-tracking-architecture, teltonika-ingestion-architecture] tags: [service, business-plane, api] --- @@ -47,7 +47,14 @@ Used for things that genuinely belong in the business layer: ## Real-time delivery -Directus's WebSocket subscriptions push live data to the [[react-spa]]. When [[processor]] writes a row, Directus broadcasts the change to subscribed clients. Sufficient for tens to low hundreds of concurrent subscribers. If fan-out becomes a bottleneck, a dedicated WebSocket gateway can read directly from [[redis-streams]] and push to clients, bypassing Directus for the live channel only — REST/GraphQL stays in Directus. +Directus's WebSocket subscriptions push live data to the [[react-spa]] **for writes that go through Directus's own services** (REST, GraphQL, Admin UI, Flows, custom endpoints). The mechanism is action hooks (`action('items.create', ...)`) firing from the `ItemsService`, not Postgres-level change detection. + +This means **direct database writes from [[processor]] are not visible** to Directus's subscription system. The platform handles this with two cleanly-separated WebSocket channels: + +- **[[directus]]'s WebSocket** — broadcasts business-plane events: timing edits, configuration changes, manual entries, anything operators do through the admin UI or via [[directus]]'s API. +- **[[processor]]'s WebSocket** — broadcasts the high-volume telemetry firehose: live position updates fanned out from [[redis-streams]] directly to subscribed [[react-spa]] clients. Authentication uses Directus-issued JWTs; per-subscription authorization delegates to Directus once at subscribe time. + +See [[live-channel-architecture]] for the full design, including why this split is preferable to routing telemetry writes through [[directus]]'s API or running a bridging extension inside [[directus]]. ## Phase 2 role diff --git a/wiki/entities/processor.md b/wiki/entities/processor.md index 750d4b5..74f187b 100644 --- a/wiki/entities/processor.md +++ b/wiki/entities/processor.md @@ -2,20 +2,21 @@ title: Processor type: entity created: 2026-04-30 -updated: 2026-04-30 +updated: 2026-05-01 sources: [gps-tracking-architecture] tags: [service, telemetry-plane, domain-logic] --- # Processor -The service where domain logic lives. Consumes normalized telemetry from [[redis-streams]] and is responsible for per-device runtime state, applying domain rules, and writing durable state to [[postgres-timescaledb]]. +The service where domain logic lives. Consumes normalized telemetry from [[redis-streams]] and is responsible for per-device runtime state, applying domain rules, writing durable state to [[postgres-timescaledb]], and broadcasting live position updates over WebSockets to the [[react-spa]]. ## Responsibilities - Maintain **per-device runtime state** — last position, derived metrics, current zone, accumulators. - Apply **domain rules** that turn raw telemetry into meaningful events. - Write **durable state** — both raw position history and any derived events. +- **Broadcast live positions** to subscribed [[react-spa]] clients over a WebSocket endpoint. See [[live-channel-architecture]] for the full design and rationale. - Emit events for downstream consumers (Directus Flows, notification services, dashboards). Where [[tcp-ingestion]] is about throughput and protocol correctness, the Processor is about correctness of meaning. It is the component most likely to evolve as requirements grow, which is why it is isolated from the sockets on one side and the API surface on the other. @@ -34,6 +35,14 @@ The database is the source of truth for replay/analysis; in-memory state is the - For derived business entities (events, violations, alerts), the Processor writes directly to tables [[directus]] also knows about. Schema is owned by Directus; the Processor inserts rows respecting that schema. - This keeps the hot write path off the Directus HTTP stack while still letting Directus expose the data through API and admin UI. +## Live broadcast + +The Processor exposes a WebSocket endpoint that the [[react-spa]] connects to for live position updates. The endpoint authenticates connections by validating Directus-issued JWTs and authorizes subscriptions by delegating to Directus's permission system once at subscribe time — never per record. + +This decouples the live channel from [[directus]]'s failure domain (Directus down blocks only new authorizations, not the live firehose) and preserves [[plane-separation]] (telemetry stays in the telemetry plane end-to-end). [[directus]]'s built-in WebSocket subscriptions remain the right channel for changes to the business-plane tables it writes to (timing edits, configuration, manual overrides). See [[live-channel-architecture]] for the full design. + +In multi-instance deployments, each Processor reads the [[redis-streams]] stream on two consumer groups: a shared `processor` group for durable writes (work-split across instances) and a per-instance `live-broadcast-{instance_id}` group for fan-out (every instance reads every record for its own connected clients). + ## IO element interpretation Per-model IO mappings live here, not in the Ingestion layer. Example: `{ "FMB920": { "16": "odometer_km", "240": "movement" } }`. This is the boundary set by the [[teltonika]] adapter — Ingestion produces raw IO maps; the Processor names and interprets them.