Files
docs/wiki/concepts/live-channel-architecture.md
T
julian f92595a62a docs: TRACCAR ingest + processor-ws-contract synthesis + auth-mode realignment
Catches up the wiki with several pieces of work accumulated during this
session.

INGEST: TRACCAR_MAPS_ARCHITECTURE.md
- raw/TRACCAR_MAPS_ARCHITECTURE.md (source doc, read-only).
- wiki/sources/traccar-maps-architecture.md — TL;DR + key claims +
  notable quotes + TRM divergences (PostGIS-native GeoJSON, rAF
  coalescer, Zustand, longer trail, racing sprite set).
- wiki/concepts/maps-architecture.md — distilled patterns for the SPA's
  map subsystem: singleton MapLibre + side-effect-only Map* components +
  two GeoJSON sources + style-swap mapReady gate + sprite preload + WS-
  to-map data flow (with rAF coalescer) + geofence editing + camera
  control trio.
- wiki/entities/react-spa.md — corrected the "talks exclusively to
  Directus" contradiction with [[live-channel-architecture]] (SPA
  connects to two endpoints — Directus + Processor); locked stack (raw
  MapLibre over react-map-gl, Zustand over Redux); added Auth section.
- wiki/concepts/live-channel-architecture.md — single sentence cross-
  referencing [[maps-architecture]] for consumer-side throughput
  discipline.
- index.md — Sources + Concepts entries.

SYNTHESIS: processor-ws-contract
- wiki/synthesis/processor-ws-contract.md — wire-level spec for the
  live-position WebSocket: endpoint, transport, auth handshake,
  subscribe/snapshot/streaming/unsubscribe protocol, reconnect, multi-
  instance behaviour, connection limits, versioning, open questions.
  Implementation-agnostic; the producer is cookie-name-agnostic so the
  spec doesn't pin to a specific Directus auth mode.
- index.md — Synthesis entry.

AUTH-MODE REALIGNMENT (cookie -> session)
- SPA implementation surfaced that Directus SDK 'cookie' mode doesn't
  survive a hard reload cleanly. Switched the SPA to 'session' mode
  (separate commit in trm/spa). Wiki updates here:
- wiki/entities/react-spa.md §Auth pattern — describes session mode
  (single httpOnly session cookie, no separate access token, no
  /auth/refresh dance). Added "Mode choice context" note.
- wiki/synthesis/processor-ws-contract.md §Auth handshake — emphasises
  the producer is cookie-name-agnostic; reframed "Cookie refresh while
  connected" as "Session expiry while connected".

Plus all the chronological log.md entries documenting the above plus
Phase 1.5 planning, SPA Phase 1 planning, and stage verify+seed work
from earlier in the session.

Skipped from this commit: .claude/agent-memory/* (user-local agent
state, not project content); .gitignore (already-modified by user
outside this session's scope).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 18:15:09 +02:00

10 KiB
Raw Blame History

title, type, created, updated, sources, tags
title type created updated sources tags
Live channel architecture concept 2026-05-01 2026-05-01
architecture
realtime
websocket
telemetry-plane
decision

Live channel architecture

How live position data reaches the react-spa without violating plane-separation or coupling to directus's failure domain.

The question

The SPA needs sub-second updates of device positions for live race views. Three things are non-negotiable:

  1. The processor hot path stays direct-to-database — no API hop, no event-loop pressure on Directus.
  2. directus is not in the telemetry hot path (per plane-separation).
  3. The live channel must be authenticated and authorization-aware — only users with permission to see an event's positions get pushed updates.

The naïve assumption is that directus's built-in WebSocket subscriptions cover this. They do not. Directus's subscription system only fires events for writes that go through its own ItemsService (REST/GraphQL/Admin UI mutations). Direct INSERTs from the processor are invisible to subscribers — verified against Directus's documentation and source. The bridging assumption was wrong.

This page documents how the platform actually delivers live positions.

Options considered

Option Live channel works Hot path stays fast Plane separation Failure domain
Route Processor writes through Directus REST Yes (Directus broadcasts own writes) Compromised — every write through Directus event loop Compromised Coupled — Directus down blocks ingestion
Bridge extension inside Directus (Redis → WebSocketService.broadcast) Yes Compromised — Directus runs the firehose consumer Compromised Coupled — Directus crash kills live channel
Processor exposes its own WebSocket endpoint (chosen) Yes Preserved Preserved Decoupled — Directus down blocks only new authorizations

Option 3 wins because it preserves the architectural invariants that motivated plane-separation in the first place, while still leaning on directus for authentication and authorization.

Chosen design

Two cleanly-separated WebSocket channels, each playing to its strength:

┌─ Telemetry plane ─────────────────────────┐    ┌─ Business plane ──────────────────────┐
│                                           │    │                                       │
│  Device → tcp-ingestion → Redis           │    │  SPA admin action                     │
│                              ↓            │    │                ↓                      │
│                          Processor        │    │           Directus REST               │
│                         ↙        ↘        │    │                ↓                      │
│                  Postgres    Processor's  │    │      Postgres + Directus's WebSocket  │
│                              WebSocket    │    │                ↓                      │
│                                  ↓        │    │           SPA (admin UI,              │
│                              SPA          │    │            leaderboard refresh,       │
│                              (live map)   │    │            timing edits)              │
└───────────────────────────────────────────┘    └───────────────────────────────────────┘
  • High-volume telemetry (positions): the Processor writes directly to Postgres and also fans out the same records to subscribed SPA clients over its own WebSocket endpoint. Stays in the telemetry plane end-to-end.
  • Low-volume domain events (timing records, stage results, manual entries, configuration): written via Directus's REST API; Directus's built-in subscription system broadcasts them through its WebSocket. Stays in the business plane.

Each kind of data takes the path that fits it. No bridges, no extensions inside Directus.

Authorization flow

The Processor's WebSocket endpoint validates connections through Directus, but never asks Directus per record.

1. SPA opens wss://processor.../live with a Directus-issued JWT.
2. Processor validates the JWT (round-trip to Directus's /users/me, or local
   verification with Directus's signing secret). Failure → close socket.
3. SPA sends {type: 'subscribe', event_id: 42}.
4. Processor calls Directus once: GET /items/events/42 with the user's token.
   200 → allow subscription, store {client → event_id} in memory.
   403 → reject subscription with a clear error.
5. For every position arriving on Redis, match against in-memory subscriptions
   and push to matched clients. Zero Directus calls in the hot path.

Connection-time auth is amortized over session lifetime. Permission re-checks happen on subscription change, not on every record. The hot path is bounded by O(positions × subscribed-clients-per-event) and runs entirely on the Processor's event loop with in-memory state.

Failure modes

Failure Effect on durable storage Effect on live channel
Processor crashes Records pile up in Redis; Phase 3 failure-domains resumption picks them up Live channel dies until recovery
Directus crashes Unaffected (Processor writes direct to DB) Existing connections keep working with cached permissions; new subscriptions cannot be authorized
Postgres crashes Writes block; Redis buffers up to MAXLEN Unaffected — fan-out is independent of DB state
Redis crashes Whole pipeline stops

The Directus-down case is the architecturally important one. Routing writes through Directus would mean ingestion blocks. The chosen design keeps ingestion alive and only loses the ability to authorize new subscriptions — a much gentler failure.

Multi-instance Processor

Phase 3 of processor adds a second instance for HA. Each instance has its own connected SPA clients. A position arriving on instance A wouldn't naturally reach a client connected to instance B unless the broadcast path crosses instances.

The clean shape: each Processor reads the redis-streams stream on two consumer groups:

  • processor — the durable-write group (work-split: only one instance handles each record for the DB write).
  • live-broadcast-{instance_id} — a per-instance fan-out group (every instance reads every record for fan-out).

DB writes deduplicate by virtue of the consumer-group split; live broadcast deduplicates by virtue of clients being connected to exactly one instance. The Processor's redis-streams consumer code structure should anticipate this even at single-instance pilot scale.

Scale considerations

At pilot scale (≤500 devices per event, tens of viewers), the dominant costs are:

  • Connection-time auth round-trips to Directus — a few hundred per minute peak (race start). Trivial.
  • In-memory subscription matchingO(records × subscribers); for 500 records/sec × 20 subscribers per event, ~10k messages/sec fan-out. Sustained on Node.

When this becomes wrong:

  • Sustained > ~10k WebSocket messages/sec total → consider sharding the broadcast path or extracting to a dedicated gateway service.
  • Connection-time auth becomes a thundering herd at race start with thousands of viewers → cache JWT verification locally and shorten the Directus permission check via a token-with-scope pattern.
  • Multi-data-center deployment → revisit the consumer-group fan-out strategy; per-region broadcast may be cleaner than global.

The escape hatch is well-defined: lift the WebSocket endpoint code out of the Processor into a standalone service that subscribes to the same live-broadcast-* consumer group. The Redis-stream-in / WebSocket-out contract doesn't change; only the host process does.

What this means for adjacent components

  • processor grows a public-facing WebSocket endpoint in addition to its existing Redis consumer and Postgres writer.
  • directus keeps its built-in WebSocket subscriptions for tables it writes to. Its real-time delivery section no longer claims to broadcast direct writes from processor — that's a documented mistake corrected in this revision.
  • react-spa connects to two WebSocket endpoints: Directus for admin/business updates, Processor for live position firehose. Same JWT-based auth on both. Consumer-side throughput discipline (rAF coalescing of incoming positions before reducer dispatch) is documented in maps-architecture — without it the per-message dispatch pattern observed in traccar-maps-architecture cascades through selectors and setData at every position arrival.
  • The deploy stack publishes the Processor's WebSocket port (with TLS termination at a reverse proxy in front).

Why not a single WebSocket endpoint

It would be tempting to fold everything into a single SPA-facing WebSocket — either Processor or Directus. Both fail:

  • Single Processor WebSocket would require Processor to broadcast Directus-managed events, meaning Processor needs to subscribe to Directus's writes — which is exactly the problem we're avoiding for positions, in reverse.
  • Single Directus WebSocket is the bridge-extension option; it loses plane separation.

Two endpoints, each serving the writes its plane manages, is the architecturally honest answer.

Open questions

  • JWT validation strategy. Round-trip to Directus's /users/me (no shared secret, ~20ms per connection) vs. local verification with Directus's signing key (no round-trip, but a secret to share). Pilot can start with round-trip; revisit if connection rates climb.
  • Subscription model. Per-event, per-stage, per-organization, or arbitrary filter expressions? The simplest pilot model is "subscribe to one event by ID"; extensions land when SPA UX demands them.
  • Permission staleness. If a user is removed from an organization mid-session, do their existing subscriptions silently keep delivering until reconnect? Either re-validate periodically, or accept "trust the session" for pilot.