docs: update task 1.5.2 done section and ROADMAP status
This commit is contained in:
@@ -69,7 +69,7 @@ These rules govern every task. Any deviation must be discussed and documented as
|
||||
| # | Task | Status | Landed in |
|
||||
|---|------|--------|-----------|
|
||||
| 1.5.1 | [WS server scaffold + heartbeat](./phase-1-5-live-broadcast/01-ws-server-scaffold.md) | 🟩 | `b8ebbd0` |
|
||||
| 1.5.2 | [Cookie auth handshake](./phase-1-5-live-broadcast/02-cookie-auth-handshake.md) | ⬜ | — |
|
||||
| 1.5.2 | [Cookie auth handshake](./phase-1-5-live-broadcast/02-cookie-auth-handshake.md) | 🟩 | `190254d` |
|
||||
| 1.5.3 | [Subscription registry & per-event authorization](./phase-1-5-live-broadcast/03-subscription-registry.md) | ⬜ | — |
|
||||
| 1.5.4 | [Broadcast consumer group & fan-out](./phase-1-5-live-broadcast/04-broadcast-consumer-group.md) | ⬜ | — |
|
||||
| 1.5.5 | [Snapshot-on-subscribe](./phase-1-5-live-broadcast/05-snapshot-on-subscribe.md) | ⬜ | — |
|
||||
|
||||
@@ -0,0 +1,193 @@
|
||||
# Task 1.5.2 — Cookie auth handshake
|
||||
|
||||
**Phase:** 1.5 — Live broadcast
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 1.5.1
|
||||
**Wiki refs:** `docs/wiki/synthesis/processor-ws-contract.md` §Auth handshake; `docs/wiki/entities/directus.md`; `docs/wiki/entities/react-spa.md` §Auth pattern
|
||||
|
||||
## Goal
|
||||
|
||||
Authenticate WebSocket connections using the Directus-issued cookie attached to the upgrade request. Validate via a single `/users/me` round-trip to Directus; on success, bind the user identity to the `LiveConnection` for its lifetime; on failure, close with code `4401` before completing the upgrade.
|
||||
|
||||
After this task, anonymous connections are rejected — only Directus-authenticated users can hold an open WebSocket.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/live/auth.ts` exporting:
|
||||
- `createAuthClient(config, logger): AuthClient` — factory.
|
||||
- `AuthClient` interface: `validate(cookieHeader: string): Promise<AuthenticatedUser | null>`.
|
||||
- `type AuthenticatedUser = { id: string; email: string; role: string | null; first_name: string | null; last_name: string | null }` — minimum fields used by the registry (1.5.3) for authorization decisions.
|
||||
- `validate` returns `null` on any failure (network, 401, malformed response). Logs at `warn` with the failure reason.
|
||||
- `src/live/server.ts` updated:
|
||||
- `LiveConnection` gains a `user: AuthenticatedUser` field (no longer optional).
|
||||
- The `'upgrade'` handler validates the cookie *before* calling `wss.handleUpgrade`. On `null`, write a 401 HTTP response on the raw socket and destroy it (this is how `ws` recommends rejecting upgrades cleanly).
|
||||
- On success, pass the validated user through to the `'connection'` handler via `req[USER_KEY]`.
|
||||
- New config keys (zod):
|
||||
- `DIRECTUS_BASE_URL` (default `http://directus:8055`) — where to call `/users/me`.
|
||||
- `DIRECTUS_AUTH_TIMEOUT_MS` (default `5_000`).
|
||||
- New Prometheus metrics:
|
||||
- `processor_live_auth_attempts_total{result}` — `success` / `unauthorized` / `error`.
|
||||
- `processor_live_auth_latency_ms` (histogram).
|
||||
- `test/live-auth.test.ts`:
|
||||
- With a mocked Directus returning 200 + a user payload, `validate` returns the parsed user.
|
||||
- With 401, returns `null` and increments `unauthorized` counter.
|
||||
- With a network error, returns `null` and increments `error` counter (does not throw).
|
||||
- With a 200 but malformed payload (no `id` field), returns `null` and logs at `warn`.
|
||||
- The HTTP timeout is enforced (`AbortController` after `DIRECTUS_AUTH_TIMEOUT_MS`).
|
||||
|
||||
## Specification
|
||||
|
||||
### Cookie extraction
|
||||
|
||||
The browser attaches whatever cookies were set on the SPA's origin. Directus's refresh cookie default is named `directus_refresh_token`; the actual session is identified server-side via the access token in the `Authorization` header on REST calls — but for WebSocket upgrades there is no Authorization header, so we forward the cookie and let Directus handle session lookup.
|
||||
|
||||
```ts
|
||||
function extractCookieHeader(req: IncomingMessage): string | null {
|
||||
return req.headers.cookie ?? null;
|
||||
}
|
||||
```
|
||||
|
||||
If the header is missing entirely, fail fast — no point calling Directus.
|
||||
|
||||
### `/users/me` call
|
||||
|
||||
```ts
|
||||
async function validate(cookieHeader: string): Promise<AuthenticatedUser | null> {
|
||||
if (!cookieHeader) return null;
|
||||
|
||||
const controller = new AbortController();
|
||||
const timer = setTimeout(() => controller.abort(), config.DIRECTUS_AUTH_TIMEOUT_MS);
|
||||
|
||||
const start = performance.now();
|
||||
try {
|
||||
const res = await fetch(`${config.DIRECTUS_BASE_URL}/users/me?fields=id,email,role,first_name,last_name`, {
|
||||
method: 'GET',
|
||||
headers: { cookie: cookieHeader },
|
||||
signal: controller.signal,
|
||||
});
|
||||
|
||||
if (res.status === 401 || res.status === 403) {
|
||||
metrics.authAttempts.inc({ result: 'unauthorized' });
|
||||
return null;
|
||||
}
|
||||
if (!res.ok) {
|
||||
logger.warn({ status: res.status }, 'directus auth call returned non-2xx');
|
||||
metrics.authAttempts.inc({ result: 'error' });
|
||||
return null;
|
||||
}
|
||||
|
||||
const body = await res.json();
|
||||
const user = AuthenticatedUserSchema.safeParse(body.data);
|
||||
if (!user.success) {
|
||||
logger.warn({ issues: user.error.issues }, 'directus /users/me returned unexpected shape');
|
||||
metrics.authAttempts.inc({ result: 'error' });
|
||||
return null;
|
||||
}
|
||||
|
||||
metrics.authAttempts.inc({ result: 'success' });
|
||||
return user.data;
|
||||
} catch (err) {
|
||||
if ((err as Error).name === 'AbortError') {
|
||||
logger.warn('directus auth call timed out');
|
||||
} else {
|
||||
logger.warn({ err }, 'directus auth call failed');
|
||||
}
|
||||
metrics.authAttempts.inc({ result: 'error' });
|
||||
return null;
|
||||
} finally {
|
||||
clearTimeout(timer);
|
||||
metrics.authLatency.observe(performance.now() - start);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- **Field projection** (`?fields=...`) keeps the response small. The full user record has dozens of fields we don't need.
|
||||
- **Forward the entire cookie header.** Directus may rotate the refresh cookie on this call (it shouldn't on `/users/me`, but be liberal); we ignore any `Set-Cookie` in the response — it's not our cookie to manage.
|
||||
- **No retries.** A failed validation immediately closes the upgrade. The SPA will reconnect, which gives a natural retry. Don't add server-side retry logic — masks bugs and slows down the bad-credential case.
|
||||
|
||||
### Rejecting the upgrade
|
||||
|
||||
`ws` lets you reject by writing directly to the raw socket before `handleUpgrade`:
|
||||
|
||||
```ts
|
||||
httpServer.on('upgrade', async (req, socket, head) => {
|
||||
const cookie = extractCookieHeader(req);
|
||||
const user = cookie ? await authClient.validate(cookie) : null;
|
||||
|
||||
if (!user) {
|
||||
socket.write(
|
||||
'HTTP/1.1 401 Unauthorized\r\n' +
|
||||
'Content-Length: 0\r\n' +
|
||||
'Connection: close\r\n' +
|
||||
'\r\n'
|
||||
);
|
||||
socket.destroy();
|
||||
return;
|
||||
}
|
||||
|
||||
// Stash the user on the request object so the connection handler can pick it up.
|
||||
(req as IncomingMessage & { user: AuthenticatedUser }).user = user;
|
||||
|
||||
wss.handleUpgrade(req, socket, head, (ws) => {
|
||||
wss.emit('connection', ws, req);
|
||||
});
|
||||
});
|
||||
|
||||
wss.on('connection', (ws, req: IncomingMessage & { user: AuthenticatedUser }) => {
|
||||
const conn: LiveConnection = {
|
||||
id: nanoid(),
|
||||
ws,
|
||||
remoteAddr: req.socket.remoteAddress ?? 'unknown',
|
||||
openedAt: new Date(),
|
||||
lastSeenAt: new Date(),
|
||||
user: req.user,
|
||||
};
|
||||
// ... rest of connection setup
|
||||
});
|
||||
```
|
||||
|
||||
### What `AuthenticatedUser` does and doesn't include
|
||||
|
||||
Include only fields the registry (1.5.3) and Phase 4 permissions will need:
|
||||
|
||||
```ts
|
||||
const AuthenticatedUserSchema = z.object({
|
||||
id: z.string().uuid(),
|
||||
email: z.string().email().nullable(),
|
||||
role: z.string().uuid().nullable(), // Directus role id, not the `organization_users.role` enum
|
||||
first_name: z.string().nullable(),
|
||||
last_name: z.string().nullable(),
|
||||
});
|
||||
```
|
||||
|
||||
Don't pull in `directus_users` extension fields or anything specific to the TRM domain — those are queried per-subscription, not per-connection.
|
||||
|
||||
### What we don't do (deferred)
|
||||
|
||||
- **No JWT validation locally.** The simplest path is the round-trip; cache only if the round-trip becomes a bottleneck (it won't at pilot scale).
|
||||
- **No refresh handling.** The cookie's lifetime is the SPA's problem. If it expires mid-connection, server-side state is unaffected; the SPA will reconnect (which re-validates).
|
||||
- **No revocation re-checks.** A user removed from the database mid-session keeps their WebSocket until they disconnect or the server restarts. Phase 4 hardening can add periodic re-validation if needed.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean.
|
||||
- [ ] Connecting without a cookie returns HTTP 401 (visible in `wscat`'s output as a connection rejection with status code).
|
||||
- [ ] Connecting with a stale/invalid cookie returns HTTP 401.
|
||||
- [ ] Connecting with a valid cookie (obtained via Directus's `/auth/login` with `mode: cookie`) succeeds; the connection is logged with the user id.
|
||||
- [ ] `processor_live_auth_attempts_total{result="success"}` increments on a successful upgrade.
|
||||
- [ ] Auth latency p95 < 100ms against a stage-realistic Directus (single `/users/me` call against a warm DB).
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- **Directus base URL in dev vs stage vs prod.** In dev the SPA might run via Vite proxy at `localhost:5173`, with Directus at `localhost:8055`. The Processor's `DIRECTUS_BASE_URL` should always be the *internal* Compose-network URL (`http://directus:8055`) — that's the path with the lowest latency and no proxy hops. Document this in `.env.example`.
|
||||
- **Cookie scope.** Directus issues the refresh cookie scoped to the public domain (e.g. `Domain=stage.trmtracking.org`). The Processor receives the same cookie because the upgrade request hits the same origin (proxy fronts both). Verify this works end-to-end during the integration test (1.5.6).
|
||||
- **What if `/users/me` returns 200 with `data: null`?** Directus does this when the cookie is well-formed but the session is expired. Treat as `null` user (return `null`, log at `warn`).
|
||||
|
||||
## Done
|
||||
|
||||
Landed in `190254d`. Key deviations from spec:
|
||||
- Added distinction between `data: null` (unauthorized / expired session) and missing `data` key (error / malformed response) — the task spec only mentioned `data: null` but the missing-key case is equally important.
|
||||
- `authClient` is an optional parameter to `createLiveServer` (not required) so the existing unit tests that don't need auth work unchanged.
|
||||
- Used the `satisfies` operator to pass the anonymous user placeholder at the no-auth code path for type safety.
|
||||
@@ -0,0 +1,226 @@
|
||||
# Task 1.5.3 — Subscription registry & per-event authorization
|
||||
|
||||
**Phase:** 1.5 — Live broadcast
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 1.5.2
|
||||
**Wiki refs:** `docs/wiki/synthesis/processor-ws-contract.md` §Subscription model; `docs/wiki/concepts/live-channel-architecture.md` §Authorization flow; `docs/wiki/synthesis/directus-schema-draft.md`
|
||||
|
||||
## Goal
|
||||
|
||||
Handle `subscribe` / `unsubscribe` messages: validate the topic format, authorize the user against the topic's organization, maintain in-memory bidirectional indexes (`connection → topics`, `topic → connections`), and emit the appropriate `subscribed` / `unsubscribed` / `error` responses. Authorization is a single Directus call per subscription; no per-message auth.
|
||||
|
||||
After this task, a connected client can `subscribe` to an event they have permission for, get an immediate `subscribed` response, and the registry knows which connections want updates for which event. The actual fan-out and snapshot land in 1.5.4 and 1.5.5 respectively — this task just owns the bookkeeping.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/live/registry.ts` exporting:
|
||||
- `createSubscriptionRegistry(authzClient, logger, metrics): SubscriptionRegistry` — factory.
|
||||
- `SubscriptionRegistry` interface:
|
||||
```ts
|
||||
interface SubscriptionRegistry {
|
||||
subscribe(conn: LiveConnection, topic: string, correlationId?: string): Promise<void>;
|
||||
unsubscribe(conn: LiveConnection, topic: string, correlationId?: string): Promise<void>;
|
||||
onConnectionClose(conn: LiveConnection): void; // remove from all topics
|
||||
connectionsForTopic(topic: string): Iterable<LiveConnection>; // used by 1.5.4 fan-out
|
||||
topicsForConnection(conn: LiveConnection): Iterable<string>;
|
||||
stats(): { connections: number; topics: number; subscriptions: number };
|
||||
}
|
||||
```
|
||||
- Topic format validator: `event:<uuid>` is the only accepted shape in v1; anything else returns `error/unknown-topic`.
|
||||
- `src/live/authz.ts` exporting:
|
||||
- `createAuthzClient(config, logger): AuthzClient` — factory.
|
||||
- `AuthzClient.canAccessEvent(user: AuthenticatedUser, eventId: string): Promise<AuthzResult>` — `{ allowed: true } | { allowed: false; reason: 'forbidden' | 'not-found' | 'error' }`.
|
||||
- `src/live/server.ts` updated: the `onMessage` placeholder from 1.5.1 is replaced with a real router that dispatches `subscribe` / `unsubscribe` to the registry, calls `registry.onConnectionClose` in the `'close'` event handler.
|
||||
- New Prometheus metrics:
|
||||
- `processor_live_subscriptions{instance_id}` (gauge) — current total subscriptions.
|
||||
- `processor_live_subscribe_attempts_total{result}` — `success` / `forbidden` / `not-found` / `unknown-topic` / `error`.
|
||||
- `processor_live_authz_latency_ms` (histogram).
|
||||
- `test/live-registry.test.ts`:
|
||||
- Subscribe to `event:<uuid>` with a permitted user → `subscribed` reply, registry counts go up.
|
||||
- Subscribe to `event:<uuid>` with a forbidden user → `error/forbidden` reply, no registry change.
|
||||
- Subscribe to `device:<imei>` → `error/unknown-topic`, no registry change.
|
||||
- Subscribe twice to the same topic → idempotent (single subscription, single `subscribed` reply on each call).
|
||||
- Unsubscribe from a topic the connection isn't subscribed to → `unsubscribed` reply (idempotent), no error.
|
||||
- Connection close removes all subscriptions; gauges decrement correctly.
|
||||
- `test/live-authz.test.ts`:
|
||||
- `canAccessEvent` returns `allowed: true` when `/items/events/<id>` returns 200 (Directus enforces RLS via the cookie; if Directus says yes, we say yes).
|
||||
- Returns `allowed: false, reason: 'forbidden'` on 403.
|
||||
- Returns `allowed: false, reason: 'not-found'` on 404.
|
||||
- Returns `allowed: false, reason: 'error'` on network failure or 5xx (does not throw).
|
||||
|
||||
## Specification
|
||||
|
||||
### Topic parsing
|
||||
|
||||
```ts
|
||||
const EventTopicRegex = /^event:([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})$/i;
|
||||
|
||||
function parseTopic(topic: string): { kind: 'event'; eventId: string } | null {
|
||||
const m = EventTopicRegex.exec(topic);
|
||||
if (m) return { kind: 'event', eventId: m[1] };
|
||||
return null; // unknown topic shape
|
||||
}
|
||||
```
|
||||
|
||||
Future shapes (`device:<imei>`, `entry:<uuid>`, `org:<uuid>`) get added here when they're needed. The unknown-topic path returns a clear error rather than silently failing — clients always know if they typed a topic the server doesn't understand.
|
||||
|
||||
### Authorization model
|
||||
|
||||
The simplest correct authorization: **delegate to Directus's REST API with the user's cookie**. If `GET /items/events/<eventId>` returns 200, the user has access (Directus's RLS already does the org-membership check). If 403, they don't.
|
||||
|
||||
```ts
|
||||
async function canAccessEvent(user: AuthenticatedUser, eventId: string): Promise<AuthzResult> {
|
||||
const start = performance.now();
|
||||
try {
|
||||
const res = await fetch(`${config.DIRECTUS_BASE_URL}/items/events/${eventId}?fields=id`, {
|
||||
method: 'GET',
|
||||
headers: { cookie: user.cookieHeader }, // see "Carrying the cookie" below
|
||||
signal: AbortSignal.timeout(config.DIRECTUS_AUTHZ_TIMEOUT_MS ?? 5_000),
|
||||
});
|
||||
|
||||
if (res.status === 200) return { allowed: true };
|
||||
if (res.status === 403) return { allowed: false, reason: 'forbidden' };
|
||||
if (res.status === 404) return { allowed: false, reason: 'not-found' };
|
||||
return { allowed: false, reason: 'error' };
|
||||
} catch {
|
||||
return { allowed: false, reason: 'error' };
|
||||
} finally {
|
||||
metrics.authzLatency.observe(performance.now() - start);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Field projection** (`?fields=id`) keeps the response tiny — we don't need the event details, just the access verdict.
|
||||
|
||||
### Carrying the cookie
|
||||
|
||||
The auth handshake (1.5.2) validated the cookie and discarded it. For per-subscription Directus calls we need the original cookie header. Two options:
|
||||
|
||||
**Option A: Stash on the connection.** When 1.5.2 succeeds, save `cookieHeader` on `LiveConnection`. Trade-off: cookie material lives in process memory for the connection's lifetime.
|
||||
|
||||
**Option B: Re-fetch via service account.** The Processor has its own credentials; at subscribe time, query as that service account with the user id as a filter. Trade-off: more complex, requires the Processor to have a Directus account with read access to all events.
|
||||
|
||||
**Pick Option A.** Simpler, more honest (the user's own permissions are the source of truth for authorization), and the cookie is already on this server — we received it at upgrade. Memory cost is negligible (a cookie header is typically 100–500 bytes). Document that `LiveConnection` holds sensitive material and don't log it.
|
||||
|
||||
Update `LiveConnection` in `server.ts`:
|
||||
|
||||
```ts
|
||||
export type LiveConnection = {
|
||||
id: string;
|
||||
ws: WebSocket;
|
||||
remoteAddr: string;
|
||||
openedAt: Date;
|
||||
lastSeenAt: Date;
|
||||
user: AuthenticatedUser;
|
||||
cookieHeader: string; // ← added
|
||||
};
|
||||
```
|
||||
|
||||
And update 1.5.2's upgrade handler to pass the cookie through.
|
||||
|
||||
### Registry data structures
|
||||
|
||||
```ts
|
||||
const connectionTopics = new WeakMap<LiveConnection, Set<string>>(); // conn → topics
|
||||
const topicConnections = new Map<string, Set<LiveConnection>>(); // topic → conns
|
||||
```
|
||||
|
||||
`WeakMap` for `connectionTopics` lets garbage collection clean up if a connection somehow leaks the explicit `onConnectionClose` call. `Set` semantics give idempotent subscribe/unsubscribe for free.
|
||||
|
||||
### Subscribe flow
|
||||
|
||||
```ts
|
||||
async function subscribe(conn: LiveConnection, topic: string, correlationId?: string) {
|
||||
const parsed = parseTopic(topic);
|
||||
if (!parsed) {
|
||||
sendOutbound(conn, { type: 'error', topic, id: correlationId, code: 'unknown-topic', message: 'Unknown topic format' });
|
||||
metrics.subscribeAttempts.inc({ result: 'unknown-topic' });
|
||||
return;
|
||||
}
|
||||
|
||||
// Idempotent: already subscribed?
|
||||
const existing = connectionTopics.get(conn);
|
||||
if (existing?.has(topic)) {
|
||||
// Re-send subscribed (snapshot will be fetched freshly in 1.5.5).
|
||||
sendOutbound(conn, { type: 'subscribed', topic, id: correlationId, snapshot: [] });
|
||||
return;
|
||||
}
|
||||
|
||||
const verdict = await authzClient.canAccessEvent(conn.user, parsed.eventId);
|
||||
if (!verdict.allowed) {
|
||||
sendOutbound(conn, { type: 'error', topic, id: correlationId, code: verdict.reason });
|
||||
metrics.subscribeAttempts.inc({ result: verdict.reason });
|
||||
return;
|
||||
}
|
||||
|
||||
// Insert into both indexes.
|
||||
if (!existing) connectionTopics.set(conn, new Set());
|
||||
connectionTopics.get(conn)!.add(topic);
|
||||
|
||||
if (!topicConnections.has(topic)) topicConnections.set(topic, new Set());
|
||||
topicConnections.get(topic)!.add(conn);
|
||||
|
||||
metrics.subscriptions.inc();
|
||||
metrics.subscribeAttempts.inc({ result: 'success' });
|
||||
|
||||
// 1.5.5 fills in the snapshot. For now, empty array.
|
||||
sendOutbound(conn, { type: 'subscribed', topic, id: correlationId, snapshot: [] });
|
||||
}
|
||||
```
|
||||
|
||||
### Unsubscribe flow
|
||||
|
||||
```ts
|
||||
async function unsubscribe(conn: LiveConnection, topic: string, correlationId?: string) {
|
||||
connectionTopics.get(conn)?.delete(topic);
|
||||
const conns = topicConnections.get(topic);
|
||||
if (conns) {
|
||||
conns.delete(conn);
|
||||
if (conns.size === 0) topicConnections.delete(topic);
|
||||
}
|
||||
metrics.subscriptions.dec();
|
||||
// Always reply, even if not subscribed (idempotent).
|
||||
sendOutbound(conn, { type: 'unsubscribed', topic, id: correlationId });
|
||||
}
|
||||
```
|
||||
|
||||
### `onConnectionClose`
|
||||
|
||||
```ts
|
||||
function onConnectionClose(conn: LiveConnection) {
|
||||
const topics = connectionTopics.get(conn);
|
||||
if (!topics) return;
|
||||
for (const topic of topics) {
|
||||
const conns = topicConnections.get(topic);
|
||||
if (conns) {
|
||||
conns.delete(conn);
|
||||
if (conns.size === 0) topicConnections.delete(topic);
|
||||
}
|
||||
metrics.subscriptions.dec();
|
||||
}
|
||||
connectionTopics.delete(conn);
|
||||
}
|
||||
```
|
||||
|
||||
Hooked into the `ws.on('close', ...)` handler in `server.ts`.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean.
|
||||
- [ ] `wscat` flow: connect with a valid cookie → `{"type":"subscribe","topic":"event:<existing-event-id>"}` → `{"type":"subscribed","topic":"event:<id>","snapshot":[]}`.
|
||||
- [ ] Forbidden flow: same client subscribing to an event in a different org → `{"type":"error","code":"forbidden"}`.
|
||||
- [ ] Unknown topic flow: `{"type":"subscribe","topic":"foo:bar"}` → `{"type":"error","code":"unknown-topic"}`.
|
||||
- [ ] Unsubscribe flow: client gets `unsubscribed` reply; gauge `processor_live_subscriptions` decrements.
|
||||
- [ ] Disconnect cleans up: `processor_live_subscriptions` returns to its pre-connection level after the client disconnects.
|
||||
- [ ] Idempotency: subscribing twice to the same topic doesn't double-count in `processor_live_subscriptions`.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- **Authz latency budget.** Each subscribe is one Directus call. At race-start with hundreds of viewers subscribing simultaneously, that's a thundering herd. Pilot scale (≤20 viewers per event) is fine. If we ever see a herd: cache `(userId, eventId) → verdict` for 60s with manual invalidation hooks. Defer until measured.
|
||||
- **What if the user is removed from the org mid-subscription?** Their existing subscriptions keep delivering until they disconnect. Phase 4 hardening can add periodic re-checks. For pilot, "trust the session" is fine.
|
||||
- **Filter subscriptions to the user's own entries vs all in-event?** Race directors want to see everyone; participants might want to see only their own crew. Current spec is "everyone in the event" — Phase 4 permissions can refine. Document that v1 is open within an event.
|
||||
- **Wildcard topics.** Not in scope. If we ever need it, the topic parser is the place to add `event:*` → "every event in the user's orgs."
|
||||
|
||||
## Done
|
||||
|
||||
(Filled in when the task lands.)
|
||||
@@ -0,0 +1,221 @@
|
||||
# Task 1.5.4 — Broadcast consumer group & fan-out
|
||||
|
||||
**Phase:** 1.5 — Live broadcast
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 1.5.3
|
||||
**Wiki refs:** `docs/wiki/synthesis/processor-ws-contract.md` §Streaming updates, §Multi-instance behaviour; `docs/wiki/concepts/live-channel-architecture.md` §Multi-instance Processor
|
||||
|
||||
## Goal
|
||||
|
||||
Read the same `telemetry:teltonika` Redis stream as Phase 1's durable-write consumer, but on a **per-instance** consumer group `live-broadcast-{instance_id}`, and fan out each position record to the connections subscribed to its event. The durable-write path is unaffected; this is an additional read with a different group name and a different sink.
|
||||
|
||||
After this task, a position published to Redis arrives on the SPA's WebSocket within ~50ms (in-process Redis read + per-event index lookup + JSON serialise + WS send).
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/live/broadcast.ts` exporting:
|
||||
- `createBroadcastConsumer(redis, registry, deviceToEvent, config, logger, metrics): BroadcastConsumer` — factory.
|
||||
- `BroadcastConsumer` interface: `start(): Promise<void>`, `stop(): Promise<void>`. Same lifecycle shape as Phase 1's `Consumer`.
|
||||
- The fan-out loop: read batch via `XREADGROUP`, for each record decode → look up the device's event → fetch subscribers → emit one `position` message per subscriber.
|
||||
- `src/live/device-event-map.ts` exporting:
|
||||
- `createDeviceEventMap(pool, config, logger): DeviceEventMap` — factory.
|
||||
- `DeviceEventMap.lookup(deviceId: string): string[]` — returns the event IDs the device is registered to *right now*. Cached in memory; refreshed on a cadence (default every 30s) and on demand (via a Redis Stream invalidation signal — same pattern as Phase 1's `recompute:requests`, but for entry-device assignments). For pilot, the cadence-based refresh is enough; manual invalidation can land later.
|
||||
- The query: `SELECT entry_devices.device_id, entries.event_id FROM entry_devices JOIN entries ON entries.id = entry_devices.entry_id`.
|
||||
- `src/main.ts` updated to wire and start the broadcast consumer alongside the existing throughput consumer; SIGTERM stops both (live server first, broadcast consumer second, durable-write consumer last).
|
||||
- New config keys (zod):
|
||||
- `LIVE_BROADCAST_GROUP_PREFIX` (default `live-broadcast`) — full group name is `${prefix}-${INSTANCE_ID}`.
|
||||
- `LIVE_BROADCAST_BATCH_SIZE` (default `100`).
|
||||
- `LIVE_BROADCAST_BATCH_BLOCK_MS` (default `1000`).
|
||||
- `LIVE_DEVICE_EVENT_REFRESH_MS` (default `30_000`).
|
||||
- New Prometheus metrics:
|
||||
- `processor_live_broadcast_records_total{instance_id}` (counter).
|
||||
- `processor_live_broadcast_fanout_messages_total{instance_id}` (counter) — per outbound `position` frame sent.
|
||||
- `processor_live_broadcast_orphan_records_total{instance_id}` (counter) — records for devices not registered to any event.
|
||||
- `processor_live_broadcast_lag_ms` (histogram) — time from `XADD` (record's `ts` field) to fan-out send.
|
||||
- `test/live-broadcast.test.ts`:
|
||||
- With a fake stream entry for a device registered to `event:E1` and one subscriber to `event:E1`, fan-out sends one `position` message to that subscriber.
|
||||
- Multiple subscribers on the same event each receive the message.
|
||||
- A device registered to no event increments `orphan_records_total` and emits no message.
|
||||
- Devices registered to multiple events emit one message per subscribing connection per event (i.e. a connection subscribed to both events for the same device receives two messages — they're per-topic).
|
||||
|
||||
## Specification
|
||||
|
||||
### Why a separate consumer group
|
||||
|
||||
Phase 1's durable-write consumer is on group `processor` (configurable, default in `tcp-ingestion` and matched in Processor). Two instances share that group and Redis splits records across them — exactly one instance handles each write.
|
||||
|
||||
Live broadcast needs the opposite: **every instance must see every record**, because each instance has its own connected clients. The clean way to do that with Redis Streams is one group per instance. Group name `live-broadcast-{instance_id}` ensures uniqueness; each instance's `XREADGROUP` gets the full firehose for that group.
|
||||
|
||||
The two reads are independent — the durable-write group's offset and the live-broadcast group's offset are separate. A slow durable write doesn't slow down broadcast and vice versa.
|
||||
|
||||
### Fan-out shape
|
||||
|
||||
```ts
|
||||
async function runLoop() {
|
||||
while (!stopping) {
|
||||
let entries: StreamEntry[];
|
||||
try {
|
||||
entries = await redis.xreadgroup(
|
||||
'GROUP', groupName, consumerName,
|
||||
'COUNT', config.LIVE_BROADCAST_BATCH_SIZE,
|
||||
'BLOCK', config.LIVE_BROADCAST_BATCH_BLOCK_MS,
|
||||
'STREAMS', config.REDIS_TELEMETRY_STREAM, '>',
|
||||
);
|
||||
} catch (err) {
|
||||
logger.error({ err }, 'broadcast XREADGROUP failed; backing off');
|
||||
await sleep(1000);
|
||||
continue;
|
||||
}
|
||||
if (!entries) continue;
|
||||
|
||||
for (const entry of decodeBatch(entries)) {
|
||||
metrics.broadcastRecords.inc();
|
||||
await fanOut(entry);
|
||||
// ACK immediately — broadcast doesn't need durability semantics.
|
||||
await redis.xack(config.REDIS_TELEMETRY_STREAM, groupName, entry.id);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
async function fanOut(record: ConsumedRecord) {
|
||||
const eventIds = deviceToEvent.lookup(record.position.deviceId);
|
||||
if (eventIds.length === 0) {
|
||||
metrics.broadcastOrphans.inc();
|
||||
return;
|
||||
}
|
||||
|
||||
const message = toPositionMessage(record.position); // shape per processor-ws-contract
|
||||
|
||||
for (const eventId of eventIds) {
|
||||
const topic = `event:${eventId}`;
|
||||
const conns = registry.connectionsForTopic(topic);
|
||||
for (const conn of conns) {
|
||||
sendOutbound(conn, { ...message, topic });
|
||||
metrics.broadcastFanout.inc();
|
||||
}
|
||||
}
|
||||
|
||||
metrics.broadcastLag.observe(Date.now() - record.position.ts);
|
||||
}
|
||||
```
|
||||
|
||||
### Why ACK immediately
|
||||
|
||||
Phase 1's durable-write consumer ACKs only after Postgres confirms the write — that's the `XACK` discipline that protects against data loss. The broadcast consumer has different durability semantics: **a missed broadcast is acceptable.** If a position fails to fan out (because the connection crashed mid-send, say), the next position is what matters. Don't keep a pending entry just to retry an obsolete record.
|
||||
|
||||
ACK-on-consume keeps the broadcast group's PEL empty, prevents pending-entry buildup, and avoids the "send the same position twice on retry" anti-feature. Phase 3 hardening can revisit if we ever need broadcast guarantees.
|
||||
|
||||
### `DeviceEventMap` design
|
||||
|
||||
The fan-out path needs to answer "which events does this device belong to?" thousands of times per second. The naive answer — query Postgres on each record — is wrong. Two options:
|
||||
|
||||
**Option A: In-process cache with periodic refresh.** Load the full `entry_devices` ⨯ `entries` join at startup; refresh every 30s. Stale data window: up to 30s. **Pick this for pilot.**
|
||||
|
||||
**Option B: Listen for changes.** Add a `entry-devices:changed` Redis Stream (or use Directus Flows to publish on writes); broadcast invalidates affected entries. Stale data window: ~50ms. Adds protocol surface and a coordination point.
|
||||
|
||||
For pilot: Option A. The 30s staleness window is acceptable — operators register devices before the event starts, and "you registered a new device 30s ago and it's not on the map yet" is a tolerable UX. Phase 3+ can promote to Option B if real-time registration matters.
|
||||
|
||||
```ts
|
||||
class DeviceEventMap {
|
||||
private map = new Map<string, Set<string>>(); // deviceId → Set<eventId>
|
||||
private timer: NodeJS.Timeout | null = null;
|
||||
|
||||
async start() {
|
||||
await this.refresh();
|
||||
this.timer = setInterval(() => {
|
||||
this.refresh().catch(err => logger.warn({ err }, 'device-event map refresh failed'));
|
||||
}, config.LIVE_DEVICE_EVENT_REFRESH_MS);
|
||||
}
|
||||
|
||||
async stop() {
|
||||
if (this.timer) clearInterval(this.timer);
|
||||
}
|
||||
|
||||
async refresh() {
|
||||
const start = performance.now();
|
||||
const result = await pool.query<{ device_id: string; event_id: string }>(
|
||||
`SELECT ed.device_id, e.event_id
|
||||
FROM entry_devices ed
|
||||
JOIN entries e ON e.id = ed.entry_id`
|
||||
);
|
||||
const next = new Map<string, Set<string>>();
|
||||
for (const row of result.rows) {
|
||||
if (!next.has(row.device_id)) next.set(row.device_id, new Set());
|
||||
next.get(row.device_id)!.add(row.event_id);
|
||||
}
|
||||
this.map = next;
|
||||
metrics.deviceEventRefreshLatency.observe(performance.now() - start);
|
||||
metrics.deviceEventEntries.set(next.size);
|
||||
logger.debug({ devices: next.size }, 'device-event map refreshed');
|
||||
}
|
||||
|
||||
lookup(deviceId: string): string[] {
|
||||
return Array.from(this.map.get(deviceId) ?? []);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Back-pressure
|
||||
|
||||
If a connection's send queue is backing up (slow client, slow network), the WS library queues messages in process memory. At 100 msg/s × 10s of slow consumer = 1000 queued messages × ~200 bytes each = 200KB per slow connection. Tolerable.
|
||||
|
||||
If we ever see real back-pressure problems: per-connection bounded queue (e.g. last 100 positions per device, dropping older), with a metric `processor_live_broadcast_dropped_total`. Document but defer.
|
||||
|
||||
For now: rely on `ws.bufferedAmount` to detect slow consumers; if it exceeds a threshold (say 1MB), close the connection with code 1008 (policy violation) and log. Client reconnects. Worth implementing as a defensive measure even for pilot — prevents one slow client from eating all the memory.
|
||||
|
||||
```ts
|
||||
function sendOutbound(conn: LiveConnection, msg: OutboundMessage) {
|
||||
if (conn.ws.readyState !== WebSocket.OPEN) return;
|
||||
if (conn.ws.bufferedAmount > config.LIVE_WS_BACKPRESSURE_THRESHOLD_BYTES) {
|
||||
logger.warn({ connId: conn.id, buffered: conn.ws.bufferedAmount }, 'closing slow connection');
|
||||
conn.ws.close(1008, 'back-pressure threshold exceeded');
|
||||
return;
|
||||
}
|
||||
conn.ws.send(JSON.stringify(msg));
|
||||
metrics.liveMessagesOutbound.inc({ type: msg.type });
|
||||
}
|
||||
```
|
||||
|
||||
(Update 1.5.1's `sendOutbound` to include this check; add `LIVE_WS_BACKPRESSURE_THRESHOLD_BYTES` config with default `1_048_576`.)
|
||||
|
||||
### `toPositionMessage`
|
||||
|
||||
```ts
|
||||
function toPositionMessage(p: Position): Omit<PositionMessage, 'topic'> {
|
||||
const msg: any = {
|
||||
type: 'position',
|
||||
deviceId: p.deviceId,
|
||||
lat: p.latitude,
|
||||
lon: p.longitude,
|
||||
ts: p.ts.getTime(), // epoch ms; contract is number, not ISO string
|
||||
};
|
||||
if (p.speed != null) msg.speed = p.speed;
|
||||
if (p.course != null) msg.course = p.course;
|
||||
if (p.accuracy != null) msg.accuracy = p.accuracy;
|
||||
if (p.attributes && Object.keys(p.attributes).length > 0) msg.attributes = p.attributes;
|
||||
return msg;
|
||||
}
|
||||
```
|
||||
|
||||
Per the contract: omit fields rather than send `null` for absent values.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean.
|
||||
- [ ] `pnpm dev` boots; logs show both consumer groups joining (`processor` and `live-broadcast-{instance_id}`).
|
||||
- [ ] With a subscribed `wscat` client and a synthetic position published to `telemetry:teltonika`, the client receives a `{"type":"position",...}` frame within ~100ms.
|
||||
- [ ] A second `wscat` client subscribed to the same event also receives the message.
|
||||
- [ ] An orphan position (device not in any `entry_devices` row) increments `processor_live_broadcast_orphan_records_total` and emits no WS message.
|
||||
- [ ] After 30s, modifying `entry_devices` directly in Postgres and publishing a position routes correctly to the new event's subscribers.
|
||||
- [ ] Broadcast lag p50 < 100ms, p95 < 500ms with a small subscriber count (≤20).
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- **30s staleness window** is acceptable for pilot but worth surfacing in operator docs. "If you just registered a device, wait 30s before expecting it on the map" is a reasonable line in the dogfood README.
|
||||
- **Memory cost of `DeviceEventMap`.** For 500 devices × 10 events average, ~5000 entries. Trivial.
|
||||
- **What about devices registered to *multiple* events at the same time?** Schema allows it (one device on multiple `entry_devices` rows). Fan-out handles it: each event's subscribers get the message. The SPA may want to filter by event on its end if it's showing a single event.
|
||||
- **Memory leak from `topicConnections` if registry isn't cleaning up.** Defensive: log a warning if `registry.stats().topics` exceeds a sanity threshold (e.g. 1000) to surface a leak before it OOMs.
|
||||
|
||||
## Done
|
||||
|
||||
(Filled in when the task lands.)
|
||||
@@ -0,0 +1,145 @@
|
||||
# Task 1.5.5 — Snapshot-on-subscribe
|
||||
|
||||
**Phase:** 1.5 — Live broadcast
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 1.5.3, 1.4 (Postgres pool)
|
||||
**Wiki refs:** `docs/wiki/synthesis/processor-ws-contract.md` §Server response — subscribed
|
||||
|
||||
## Goal
|
||||
|
||||
When a client subscribes to `event:<eventId>`, return the **latest known position for every device registered to that event** as part of the `subscribed` response. Without it, the SPA opens to a black map and only fills in as devices report — feels broken.
|
||||
|
||||
The snapshot is a one-time read at subscribe time. After that, positions stream live via the broadcast consumer (1.5.4). The two paths together give the SPA a "fully populated map immediately, then live updates" experience.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/live/snapshot.ts` exporting:
|
||||
- `createSnapshotProvider(pool, logger, metrics): SnapshotProvider` — factory.
|
||||
- `SnapshotProvider.forEvent(eventId: string): Promise<PositionSnapshotEntry[]>` — returns the latest position per device registered to the event. Empty array if no devices or no positions yet.
|
||||
- `type PositionSnapshotEntry = { deviceId: string; lat: number; lon: number; ts: number; speed?: number; course?: number; accuracy?: number; attributes?: Record<string, unknown> }` — same shape as the streaming `position` message minus the `type` and `topic` fields (the envelope wraps them).
|
||||
- `src/live/registry.ts` updated: the `subscribe` method calls `snapshot.forEvent(eventId)` after authorization succeeds and includes the result in the `subscribed` response. Authorization happens *before* the snapshot query so a forbidden user doesn't pay the snapshot cost.
|
||||
- New Prometheus metrics:
|
||||
- `processor_live_snapshot_query_latency_ms` (histogram).
|
||||
- `processor_live_snapshot_size` (histogram) — number of positions in each snapshot.
|
||||
- `test/live-snapshot.test.ts`:
|
||||
- With three devices in an event, two of which have positions, returns two snapshot entries.
|
||||
- With an event that has no `entry_devices` rows, returns `[]`.
|
||||
- With devices that have positions but `faulty=true`, those positions are excluded.
|
||||
- The query returns the *most recent non-faulty* position per device (not just the most recent overall — `ORDER BY ts DESC` with a `WHERE faulty = false` filter).
|
||||
|
||||
## Specification
|
||||
|
||||
### The query
|
||||
|
||||
The snapshot needs the latest non-faulty position per device, scoped to one event. Postgres-canonical for "latest per group" is `DISTINCT ON`:
|
||||
|
||||
```sql
|
||||
SELECT DISTINCT ON (p.device_id)
|
||||
p.device_id,
|
||||
p.latitude,
|
||||
p.longitude,
|
||||
p.ts,
|
||||
p.speed,
|
||||
p.course,
|
||||
p.accuracy,
|
||||
p.attributes
|
||||
FROM positions p
|
||||
JOIN entry_devices ed ON ed.device_id = p.device_id
|
||||
JOIN entries e ON e.id = ed.entry_id
|
||||
WHERE e.event_id = $1
|
||||
AND p.faulty = false
|
||||
ORDER BY p.device_id, p.ts DESC;
|
||||
```
|
||||
|
||||
### Why `DISTINCT ON`
|
||||
|
||||
`DISTINCT ON (device_id) ... ORDER BY device_id, ts DESC` returns the row with the highest `ts` per `device_id`. The alternatives (`GROUP BY` + `MAX(ts)` + self-join, or window functions with `ROW_NUMBER()`) all produce the same result with worse query plans on a TimescaleDB hypertable. `DISTINCT ON` is Postgres-specific but we're committed to Postgres.
|
||||
|
||||
### Performance
|
||||
|
||||
On a TimescaleDB hypertable, the index that makes this fast is `(device_id, ts DESC)`. Phase 1 task 1.4 created the hypertable; verify the index exists. If not, add it as a migration in this task:
|
||||
|
||||
```sql
|
||||
CREATE INDEX IF NOT EXISTS positions_device_ts_idx ON positions (device_id, ts DESC);
|
||||
```
|
||||
|
||||
Without the index, `DISTINCT ON` does a sequential scan per `device_id` group. With it, the scan is bounded by the chunk containing the most recent position per device — typically the latest one or two chunks.
|
||||
|
||||
For 500 devices in an event, the query should complete in < 50ms on a warm cache.
|
||||
|
||||
### Faulty-filter semantics
|
||||
|
||||
The `faulty` column is set post-hoc by operators when a position is unrealistic ([[directus]] entity page §"Faulty position handling"). Any read path that surfaces position data to operators must filter `faulty = false`:
|
||||
|
||||
- **Snapshot:** filter (this task).
|
||||
- **Live broadcast:** doesn't apply — the broadcast consumer reads from Redis, not from `positions`. By the time a position is in Redis (and being streamed), no one has had the chance to flag it.
|
||||
- **Replay (future):** filter when implemented.
|
||||
|
||||
### Where the snapshot is wired into the registry
|
||||
|
||||
The `subscribed` response in 1.5.3 currently sends `snapshot: []`. Update:
|
||||
|
||||
```ts
|
||||
// In registry.ts, inside subscribe() after authorization succeeds:
|
||||
let snapshot: PositionSnapshotEntry[] = [];
|
||||
const start = performance.now();
|
||||
try {
|
||||
snapshot = await snapshotProvider.forEvent(parsed.eventId);
|
||||
metrics.snapshotSize.observe(snapshot.length);
|
||||
} catch (err) {
|
||||
logger.warn({ err, eventId: parsed.eventId }, 'snapshot query failed');
|
||||
// Fall through with empty snapshot — better to subscribe without a snapshot
|
||||
// than to fail the subscription entirely.
|
||||
} finally {
|
||||
metrics.snapshotLatency.observe(performance.now() - start);
|
||||
}
|
||||
|
||||
sendOutbound(conn, { type: 'subscribed', topic, id: correlationId, snapshot });
|
||||
```
|
||||
|
||||
The "fail open" choice on snapshot errors is deliberate: a subscribe that returned `subscribed` with an empty snapshot is recoverable (live updates still work; the SPA just sees a sparser-than-expected initial state). A subscribe that errors out forces the SPA to retry, which masks the underlying snapshot failure.
|
||||
|
||||
### What the snapshot does NOT include
|
||||
|
||||
- **Position history.** Just the *latest* position per device. Trail rendering on the SPA reads the previous N positions from its own ring buffer as new positions stream in. No bulk-history endpoint in v1.
|
||||
- **Device metadata** (model, IMEI, vehicle assignment). The SPA fetches that separately via Directus REST/SDK and joins on `deviceId` client-side.
|
||||
- **Faulty positions.** Filtered.
|
||||
- **Stale positions.** A position from 3 days ago is still "the latest" if the device hasn't reported since. The SPA should display "last seen N hours ago" indicators based on the `ts` field.
|
||||
|
||||
### Snapshot field shape
|
||||
|
||||
Mirror the `position` streaming message exactly except for the envelope:
|
||||
|
||||
```ts
|
||||
const SnapshotEntrySchema = z.object({
|
||||
deviceId: z.string().uuid(),
|
||||
lat: z.number(),
|
||||
lon: z.number(),
|
||||
ts: z.number(), // epoch ms
|
||||
speed: z.number().optional(),
|
||||
course: z.number().optional(),
|
||||
accuracy: z.number().optional(),
|
||||
attributes: z.record(z.unknown()).optional(),
|
||||
});
|
||||
```
|
||||
|
||||
Same field-omission convention: don't emit `null` for absent values.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean.
|
||||
- [ ] Manual: with the seeded Rally Albania 2026 event (3 registered devices, some positions in `positions`), subscribing returns a snapshot with the registered devices' latest positions.
|
||||
- [ ] Subscribing to an event with no positions returns `subscribed` with `snapshot: []`.
|
||||
- [ ] Manually marking a position `faulty=true` excludes it from the next snapshot (the snapshot returns the most recent non-faulty position for that device, or omits the device if none exists).
|
||||
- [ ] Snapshot query latency p95 < 100ms with the index in place; without the index the test should fail loudly so we don't ship without it.
|
||||
- [ ] Snapshot failure (e.g. simulated Postgres timeout) does not fail the subscription; client receives `subscribed` with `snapshot: []` and the live stream still works.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- **Snapshot size on a large event.** 500 devices × ~200 bytes per entry = ~100KB JSON payload. Tolerable. If we ever push to 5000 devices on a single event, consider streaming the snapshot in chunks via multiple `subscribed` frames. Out of scope for now.
|
||||
- **Positions on `positions` table that pre-date the device's registration to the event.** The JOIN catches them — if the device was on `entry_devices` for that event today, its 3-month-old positions still match. Acceptable behaviour; the operator's mental model is "this device has been in the system that long."
|
||||
- **Trade-off with `DISTINCT ON` and TimescaleDB chunks.** TimescaleDB partitions by time; `DISTINCT ON (device_id) ORDER BY device_id, ts DESC` may need to scan multiple chunks if the latest position for some devices is older than the most recent chunk. For an active event this is the same chunk for everyone. For a long-tail of stale devices, multiple chunks may be touched. Acceptable.
|
||||
|
||||
## Done
|
||||
|
||||
(Filled in when the task lands.)
|
||||
@@ -0,0 +1,192 @@
|
||||
# Task 1.5.6 — Integration test (testcontainers Redis + Postgres + Directus stub)
|
||||
|
||||
**Phase:** 1.5 — Live broadcast
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 1.5.4, 1.5.5
|
||||
**Wiki refs:** —
|
||||
|
||||
## Goal
|
||||
|
||||
End-to-end pipeline test: spin up Redis 7 + TimescaleDB + a stub HTTP server impersonating Directus's `/users/me` and `/items/events/<id>` endpoints, boot the Processor against them, connect a real WebSocket client with a cookie, subscribe to an event, publish a synthetic position to `telemetry:teltonika`, verify the client receives both the snapshot and the streamed position.
|
||||
|
||||
This is the test that proves the live channel composes correctly end-to-end — auth handshake, subscription registry, snapshot, broadcast fan-out all integrated. Mirror Phase 1's `pipeline.integration.test.ts` for structure and skip-on-no-Docker pattern.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `test/live.integration.test.ts`:
|
||||
- `beforeAll`: start Redis + TimescaleDB containers, start a tiny HTTP server impersonating Directus on a random port (acts as the auth + authz endpoint), seed `entry_devices` + `entries` + a few `positions` rows, boot a Processor instance pointed at all three. Skip cleanly if Docker is unavailable.
|
||||
- `afterAll`: stop the Processor, the Directus stub, and both containers.
|
||||
- **Test 1 — Happy path:** WS client connects with a valid cookie → subscribes to a seeded event → receives `subscribed` with a non-empty snapshot containing the seeded positions → publishes a synthetic position to Redis → receives the corresponding `position` frame within 1s.
|
||||
- **Test 2 — Auth rejection:** WS client connects without a cookie → upgrade fails with HTTP 401.
|
||||
- **Test 3 — Forbidden subscription:** Client with a cookie scoped to user A → subscribes to an event in an organization user A doesn't belong to → receives `error/forbidden` (Directus stub returns 403 for that user-event pair).
|
||||
- **Test 4 — Multi-client fan-out:** Two clients subscribed to the same event → publishing one position results in both clients receiving the `position` frame.
|
||||
- **Test 5 — Orphan position:** Publishing a position for a device that's not on `entry_devices` increments `processor_live_broadcast_orphan_records_total` and reaches no client.
|
||||
- **Test 6 — Faulty-flagged snapshot exclusion:** Mark a seeded position `faulty=true` directly in Postgres, subscribe, verify the snapshot uses the next-most-recent non-faulty position (or omits the device if none exists).
|
||||
- `test/helpers/directus-stub.ts`:
|
||||
- A minimal Express-or-bare-`http.createServer` stub that responds to:
|
||||
- `GET /users/me` — returns 200 + a fake user payload if a configured cookie is present, 401 otherwise.
|
||||
- `GET /items/events/:id` — returns 200 if the (cookie, eventId) pair is in an allow-list, 403 otherwise.
|
||||
- Exposed as `createDirectusStub({ allowedCookieToUser: Map<string, FakeUser>, allowedEvents: Map<userId, Set<eventId>> }): { url: string; close: () => Promise<void> }`.
|
||||
- `vitest.integration.config.ts` — the Phase 1 config already exists; extend the `testTimeout` if needed (the live test may need ~30s for the first-position round-trip on a cold cache).
|
||||
|
||||
## Specification
|
||||
|
||||
### Skip-on-no-Docker pattern
|
||||
|
||||
Same as Phase 1's `pipeline.integration.test.ts`:
|
||||
|
||||
```ts
|
||||
let dockerAvailable = true;
|
||||
beforeAll(async () => {
|
||||
try {
|
||||
redisContainer = await new GenericContainer('redis:7').withExposedPorts(6379).start();
|
||||
} catch (err) {
|
||||
dockerAvailable = false;
|
||||
console.warn('docker unavailable; skipping live integration tests');
|
||||
return;
|
||||
}
|
||||
// ... rest of setup
|
||||
}, 120_000);
|
||||
|
||||
it('happy path', async () => {
|
||||
if (!dockerAvailable) return;
|
||||
// ... real test
|
||||
});
|
||||
```
|
||||
|
||||
### Directus stub shape
|
||||
|
||||
The stub is intentionally tiny — we're not testing Directus, we're testing the Processor's interaction with whatever Directus returns. Two endpoints, hardcoded responses:
|
||||
|
||||
```ts
|
||||
function createDirectusStub(opts: StubOpts): { url: string; close: () => Promise<void> } {
|
||||
const server = http.createServer(async (req, res) => {
|
||||
const cookie = req.headers.cookie ?? '';
|
||||
const user = opts.allowedCookieToUser.get(cookie);
|
||||
|
||||
if (req.url === '/users/me') {
|
||||
if (!user) {
|
||||
res.writeHead(401).end();
|
||||
return;
|
||||
}
|
||||
res.writeHead(200, { 'content-type': 'application/json' });
|
||||
res.end(JSON.stringify({ data: user }));
|
||||
return;
|
||||
}
|
||||
|
||||
const eventMatch = /^\/items\/events\/([0-9a-f-]+)/.exec(req.url ?? '');
|
||||
if (eventMatch) {
|
||||
if (!user) { res.writeHead(401).end(); return; }
|
||||
const eventId = eventMatch[1];
|
||||
const allowed = opts.allowedEvents.get(user.id)?.has(eventId);
|
||||
if (!allowed) { res.writeHead(403).end(); return; }
|
||||
res.writeHead(200, { 'content-type': 'application/json' });
|
||||
res.end(JSON.stringify({ data: { id: eventId } }));
|
||||
return;
|
||||
}
|
||||
|
||||
res.writeHead(404).end();
|
||||
});
|
||||
|
||||
return new Promise((resolve) => {
|
||||
server.listen(0, () => {
|
||||
const addr = server.address() as AddressInfo;
|
||||
resolve({
|
||||
url: `http://localhost:${addr.port}`,
|
||||
close: () => new Promise((res) => server.close(() => res())),
|
||||
});
|
||||
});
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
This is ~40 lines of test infra. Don't pull in Express; bare `http` is enough.
|
||||
|
||||
### Seeding data
|
||||
|
||||
The integration test needs realistic-ish seed data: at least one organization, one event, two `entries`, four `entry_devices` (so at least one device-to-event mapping per entry), and some positions for some of the devices.
|
||||
|
||||
Use a seed helper:
|
||||
|
||||
```ts
|
||||
async function seed(pool: Pool) {
|
||||
await pool.query(`INSERT INTO organizations (id, name, slug) VALUES ($1, 'Test Org', 'test-org')`, [TEST_ORG_ID]);
|
||||
await pool.query(`INSERT INTO events (id, organization_id, name, slug, discipline, starts_at, ends_at)
|
||||
VALUES ($1, $2, 'Test Event', 'test-event', 'rally', '2026-01-01', '2026-12-31')`, [TEST_EVENT_ID, TEST_ORG_ID]);
|
||||
// ... etc.
|
||||
|
||||
await pool.query(`INSERT INTO positions (device_id, ts, latitude, longitude, faulty)
|
||||
VALUES ($1, '2026-05-02T11:00:00Z', 41.327, 19.819, false),
|
||||
($1, '2026-05-02T11:01:00Z', 41.328, 19.820, false),
|
||||
($2, '2026-05-02T11:00:30Z', 41.330, 19.825, false)`,
|
||||
[TEST_DEVICE_1, TEST_DEVICE_2]);
|
||||
}
|
||||
```
|
||||
|
||||
Schema-creation in the integration test reuses the same migration runner that production uses. **Don't reach into `db-init/` or Directus's snapshot YAML** from this test — the test is for the Processor, not the schema. Stub the minimum subset of Directus-managed tables in a setup migration that runs only in the test environment, OR (cleaner) point the test's `pool` at a Postgres that already has the schema loaded via a fixture SQL file.
|
||||
|
||||
The cleanest option: a single `test/fixtures/test-schema.sql` file that creates the minimum subset (organizations, events, entries, entry_devices, devices, positions) the integration test needs. Run it once in `beforeAll`. The duplication with the real schema is bounded — these collections are stable.
|
||||
|
||||
### WebSocket client
|
||||
|
||||
Use `ws`'s client mode (`new WebSocket(url, { headers: { cookie: '...' } })`). Set up an `on('message')` listener that pushes to an array; tests read from the array with a `waitForMessage(predicate, timeout)` helper:
|
||||
|
||||
```ts
|
||||
async function waitForMessage<T>(
|
||||
ws: WebSocket,
|
||||
predicate: (msg: any) => msg is T,
|
||||
timeoutMs: number = 5_000
|
||||
): Promise<T> {
|
||||
return new Promise((resolve, reject) => {
|
||||
const timer = setTimeout(() => reject(new Error('timeout waiting for message')), timeoutMs);
|
||||
const handler = (data: WebSocket.Data) => {
|
||||
const msg = JSON.parse(data.toString());
|
||||
if (predicate(msg)) {
|
||||
clearTimeout(timer);
|
||||
ws.off('message', handler);
|
||||
resolve(msg);
|
||||
}
|
||||
};
|
||||
ws.on('message', handler);
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
This pattern is robust across the test suite — every test waits for a specific message shape, with a clear timeout error if the protocol breaks.
|
||||
|
||||
### Synthetic position publishing
|
||||
|
||||
Reuse the `XADD` shape from Phase 1's `pipeline.integration.test.ts`. Helper:
|
||||
|
||||
```ts
|
||||
async function publishPosition(redis: Redis, position: Position) {
|
||||
await redis.xadd(
|
||||
config.REDIS_TELEMETRY_STREAM,
|
||||
'*',
|
||||
'ts', position.ts.toISOString(),
|
||||
'device_id', position.deviceId,
|
||||
'codec', '8E',
|
||||
'payload', JSON.stringify(serializeForStream(position)),
|
||||
);
|
||||
}
|
||||
```
|
||||
|
||||
The `serializeForStream` helper handles the bigint/Buffer sentinel encoding (already exists in Phase 1; reuse it).
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] `pnpm test:integration -- live` runs all six scenarios green when Docker is available.
|
||||
- [ ] Without Docker, the suite logs skip messages and exits 0 (does not fail).
|
||||
- [ ] First-run total time < 60s including container pulls; subsequent runs < 20s.
|
||||
- [ ] Each test cleans up after itself — no shared state between tests.
|
||||
- [ ] Tests don't depend on each other's order.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- **Schema duplication.** `test/fixtures/test-schema.sql` will drift from the real schema unless we have a discipline. Mitigation: comment at the top of the fixture says "this is a subset of the production schema for testing only; sync when production schema changes." Worth documenting in `OPERATIONS.md` (Phase 3) as a maintenance task.
|
||||
- **Test flakiness from polling.** Same caveat as Phase 1: prefer `waitForMessage` over `await sleep(N)`. The latter is reliably wrong.
|
||||
- **Image pull times in CI.** TimescaleDB image is large (~700MB). If integration tests run in CI, pre-pull. Phase 1's CI doesn't run integration; this phase doesn't change that — local + manual stage smoke is the gate.
|
||||
|
||||
## Done
|
||||
|
||||
(Filled in when the task lands.)
|
||||
@@ -0,0 +1,97 @@
|
||||
# Phase 1.5 — Live broadcast
|
||||
|
||||
Implement the WebSocket endpoint that fans live position updates from the Processor to subscribed [[react-spa]] clients. Layered on top of Phase 1's throughput pipeline; logically between Phase 1 (throughput) and Phase 2 (domain logic). The wire spec is locked at `docs/wiki/synthesis/processor-ws-contract.md`.
|
||||
|
||||
## Why a separate phase
|
||||
|
||||
Phase 1's outcome was Redis → Postgres only; the WebSocket fan-out side of [[processor]] was wiki-canonical (`docs/wiki/concepts/live-channel-architecture.md`) but had no implementation task. Phase 2 is gated on Directus schema decisions and is a substantial domain-logic chunk; bundling the WebSocket work into it would couple two unrelated workstreams.
|
||||
|
||||
This phase is small, self-contained, and unblocks the [[react-spa]]'s live-map feature for the Rally Albania 2026 dogfood. It does **not** touch domain logic or the Phase 1 throughput path.
|
||||
|
||||
## Outcome statement
|
||||
|
||||
When Phase 1.5 is done:
|
||||
|
||||
- The Processor exposes a WebSocket endpoint (path TBD by the reverse proxy; same origin as [[directus]] and the SPA bundle so the auth cookie flows automatically).
|
||||
- Connections authenticate via the Directus-issued cookie attached to the WebSocket upgrade request. Validation is a single `/users/me` round-trip to [[directus]] at connect time; the validated user identity is bound to the connection for its lifetime.
|
||||
- Clients subscribe to `event:<eventId>` topics. Per-event authorization is checked once at subscribe time (does the user belong to the event's organization?). Multiple subscriptions per connection are supported.
|
||||
- On `subscribed`, the server returns a snapshot of the latest known position for every device registered to the event (via `entry_devices` → `entries` → `events`). After the snapshot, position records stream as they arrive on Redis.
|
||||
- A second consumer group `live-broadcast-{instance_id}` reads the same `telemetry:teltonika` stream as the durable-write group (`processor`), but per-instance — every Processor instance reads every record for its own connected clients. The durable-write path is unaffected.
|
||||
- 30s server-side ping; client-side liveness check on 60s message-gap; backoff reconnect on close.
|
||||
- All of this is covered by an end-to-end integration test (testcontainers Redis + Postgres + a Directus auth stub).
|
||||
|
||||
## Sequencing
|
||||
|
||||
```
|
||||
1.5.1 WS server scaffold + heartbeat
|
||||
└─→ 1.5.2 Cookie auth handshake
|
||||
└─→ 1.5.3 Subscription registry & authorization
|
||||
├─→ 1.5.4 Broadcast consumer group & fan-out
|
||||
├─→ 1.5.5 Snapshot-on-subscribe
|
||||
└─→ 1.5.6 Integration test (depends on 1.5.4 + 1.5.5)
|
||||
```
|
||||
|
||||
1.5.4 and 1.5.5 can be developed in parallel after 1.5.3 lands.
|
||||
|
||||
## Files modified
|
||||
|
||||
This phase adds these to the existing `processor/` layout:
|
||||
|
||||
```
|
||||
processor/
|
||||
├── src/
|
||||
│ ├── core/
|
||||
│ │ └── ... (unchanged from Phase 1)
|
||||
│ ├── live/
|
||||
│ │ ├── server.ts # WS server, heartbeat, lifecycle
|
||||
│ │ ├── auth.ts # cookie → /users/me → user identity
|
||||
│ │ ├── registry.ts # subscriptions: connection→topics, topic→connections
|
||||
│ │ ├── broadcast.ts # live-broadcast consumer group + fan-out loop
|
||||
│ │ ├── snapshot.ts # latest-position-per-device query
|
||||
│ │ └── protocol.ts # zod schemas for the wire format (subscribe/position/etc.)
|
||||
│ ├── db/
|
||||
│ │ └── ... (unchanged)
|
||||
│ └── main.ts # wires the live server alongside the consumer
|
||||
└── test/
|
||||
├── live-server.test.ts # mocked: heartbeat, lifecycle, message routing
|
||||
├── live-auth.test.ts # mocked Directus client
|
||||
├── live-registry.test.ts # subscribe/unsubscribe semantics
|
||||
├── live-snapshot.test.ts # query shape
|
||||
└── live.integration.test.ts # end-to-end with testcontainers
|
||||
```
|
||||
|
||||
## Tech stack additions
|
||||
|
||||
- **`ws`** — minimal, mature WebSocket server. Plays naturally with `http.createServer` (already used by Phase 1's metrics/health server).
|
||||
- **No HTTP client library.** Node 22's global `fetch` is sufficient for the `/users/me` and `/items/events/<id>` calls to Directus.
|
||||
- **`zod`** (already a Phase 1 dep) — runtime validation of inbound WS messages. Strict schemas; reject unknown fields.
|
||||
|
||||
No new test dependencies. `vitest` + `testcontainers` already cover what's needed.
|
||||
|
||||
## Non-negotiable design rules
|
||||
|
||||
These rules govern every task in this phase. Any deviation must be discussed and documented before code lands.
|
||||
|
||||
1. **Live work is isolated.** `src/live/` cannot import from `src/core/` and vice versa, with one exception: `src/db/pool.ts` is shared. The Phase 1 throughput pipeline must run unchanged whether or not the live server starts, and vice versa. Enforced by `import/no-restricted-paths` ESLint config.
|
||||
2. **Authorization is checked once at subscribe time.** Never per record. The hot fan-out path is `O(records × subscribed-clients-per-event)` with zero Directus calls.
|
||||
3. **Subscription state is in-memory.** No durable subscription store. Reconnect re-subscribes; instance failure means a brief gap and a reconnect.
|
||||
4. **Always-fresh, not always-deliver.** If a slow consumer can't drain its queue, drop oldest position messages for that connection — latest-position-per-device is what matters. Control messages (`subscribed`/`unsubscribed`/`error`) are guaranteed.
|
||||
5. **Single origin.** The endpoint is reachable only at the same origin as Directus and the SPA bundle. Cross-origin won't carry the cookie. The reverse-proxy config is responsible for the routing; the Processor binds to a port and trusts the proxy to forward correctly.
|
||||
6. **No business logic.** This phase ships the protocol and the fan-out plumbing. Nothing in `src/live/` should know what an `entries.race_number` is or what a `class_id` means. Phase 2 may add domain-aware filtering (e.g. "subscribe to a specific class within an event") — out of scope here.
|
||||
|
||||
## Key design references (read before starting any task)
|
||||
|
||||
- `docs/wiki/synthesis/processor-ws-contract.md` — the wire spec. Authoritative.
|
||||
- `docs/wiki/concepts/live-channel-architecture.md` — the architectural rationale; explains why this lives in the Processor at all.
|
||||
- `docs/wiki/entities/processor.md` — the entity-level summary, including the multi-instance consumer-group split.
|
||||
- `docs/wiki/entities/directus.md` — the auth source; explains how the cookie is issued and what `/users/me` returns.
|
||||
- `docs/wiki/entities/react-spa.md` — the consumer; `Auth pattern` and `Real-time rendering` sections describe the SPA-side handshake and the rAF coalescer that shapes our delivery cadence.
|
||||
|
||||
## Acceptance for the phase as a whole
|
||||
|
||||
- [ ] All six task files done.
|
||||
- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean across the new code.
|
||||
- [ ] `pnpm test:integration` runs the live-pipeline end-to-end test green.
|
||||
- [ ] Manual smoke: with stage Directus + stage Processor + a `wscat` client carrying a valid cookie, can connect, subscribe to the Rally Albania 2026 event, see snapshot, see streamed positions when synthetic positions are published to Redis.
|
||||
- [ ] No regressions in Phase 1's throughput tests; the durable-write path is unchanged.
|
||||
- [ ] `docs/wiki/synthesis/processor-ws-contract.md` Implementation status section updated to reflect "implemented in Phase 1.5".
|
||||
@@ -14,7 +14,7 @@ Ideas on radar that may or may not become real tasks. Captured here so they don'
|
||||
|
||||
- **Alternate consumer for analytics export.** A second consumer group reading the same stream, writing to a parallel destination (Parquet on object storage, ClickHouse, etc.) for offline analytics. The Phase 1 architecture already supports this — it's a separate process joining the same stream with a different group name. No Processor changes needed; just operational scaffolding.
|
||||
|
||||
- **WebSocket gateway for live updates.** If Directus's WebSocket subscriptions hit a fan-out ceiling for spectator-facing live leaderboards, a dedicated gateway reads from Redis and pushes to clients, bypassing Directus for the live channel only. REST/GraphQL stays in Directus. Mentioned in `wiki/entities/directus.md`.
|
||||
- **Lifting the live-broadcast WebSocket out of the Processor into a standalone gateway service.** Phase 1.5 implements the WS endpoint inside the Processor process per [[live-channel-architecture]]. If sustained throughput exceeds the threshold documented there (~10k WS messages/sec, or connection-time auth becomes a thundering herd at race start with thousands of viewers), the wiki's documented escape hatch is to extract the WS code into a standalone service that subscribes to the same `live-broadcast-*` consumer group. The Redis-stream-in / WebSocket-out contract doesn't change; only the host process does. Promote this to a numbered phase only when measurements justify it.
|
||||
|
||||
- **Per-instance sharding hint.** If consumer-group load distribution turns out to be uneven (one instance handles all the chatty devices), introduce hashing-by-device-id with explicit assignment. Probably overkill — Redis Streams' default round-robin works for most workloads.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user