Adds end-to-end integration test for the WebSocket live broadcast pipeline:
Redis + TimescaleDB containers + Directus stub → full pipeline boot → real
WS client assertions. Mirrors pipeline.integration.test.ts pattern with the
skip-on-no-Docker guard.
Key additions:
- test/live.integration.test.ts: 6 test scenarios — happy path (subscribe →
snapshot → live position), auth rejection (401), forbidden subscription
(error/forbidden), multi-client fan-out (both receive position), orphan
position (no WS frame), faulty snapshot exclusion (next-best non-faulty)
- test/helpers/directus-stub.ts: bare http.createServer stub for /users/me
and /items/events/:id endpoints with cookie-based user lookup
- test/fixtures/test-schema.sql: minimal schema subset (events, entries,
entry_devices with IMEI-as-device_id for Phase 1 join semantics)
The integration test runs via `pnpm test:integration`, not `pnpm test`.
Docker required; the suite skips cleanly when Docker is unavailable.
Adds snapshot provider that queries the latest non-faulty position per device
registered to an event, returned in the `subscribed` reply so the SPA map is
populated immediately rather than waiting for the first live broadcast batch.
Key changes:
- src/live/snapshot.ts: createSnapshotProvider factory using DISTINCT ON
(device_id) ... ORDER BY device_id, ts DESC with WHERE faulty=false; converts
Date ts to epoch ms; omits speed/course when 0 (matching broadcast convention)
- src/main.ts: injects createSnapshotProvider(pool) into createSubscriptionRegistry
- test/live-snapshot.test.ts: 7 unit tests covering: two-device result, empty
event, faulty exclusion, DISTINCT ON semantics, parameterized query, metrics
observation, and error propagation
The snapshot query requires the positions_device_ts_idx created in migration 0002
(task 1.5.4). Snapshot failures fail open — registry.fetchSnapshot returns [] so
the subscription still succeeds with an empty initial state.
Adds the per-instance Redis Stream consumer group (live-broadcast-{instance_id})
that reads the telemetry stream and fans out each position to subscribed
WebSocket connections without affecting the durable-write consumer path.
Key changes:
- src/shared/codec.ts: moved decodePosition/CodecError out of src/core/ so
src/live/broadcast.ts can decode positions without crossing the enforced
src/core/ ↔ src/live/ boundary; src/core/codec.ts now re-exports from there
- src/shared/types.ts: added Position and AttributeValue (same move, same reason);
src/core/types.ts re-exports both to preserve existing import paths
- src/live/broadcast.ts: createBroadcastConsumer factory — XREADGROUP loop,
immediate ACK semantics, toPositionMessage mapper, fanOut per event/topic
- src/live/device-event-map.ts: createDeviceEventMap factory — in-memory cache
of entry_devices × entries join, refreshed every LIVE_DEVICE_EVENT_REFRESH_MS
- src/db/migrations/0002_positions_faulty.sql: adds faulty boolean column and
positions_device_ts_idx for snapshot-on-subscribe query (task 1.5.5)
- src/main.ts: wired authClient, authzClient, registry, liveServer,
deviceEventMap, broadcastConsumer; shutdown chain: liveServer → deviceEventMap
+ broadcastConsumer → durable-write consumer → metricsServer → Redis → Postgres
- test/live-broadcast.test.ts: 4 unit tests covering single subscriber, multiple
subscribers, orphan device, and multi-event device fan-out
Subscribe/unsubscribe with per-event authorization via Directus delegation:
- src/live/authz.ts: createAuthzClient factory; canAccessEvent(cookieHeader,
eventId) calls GET /items/events/<id>?fields=id, delegates row-level security
to Directus (200=allow, 403=forbidden, 404=not-found, else error).
- src/live/registry.ts: createSubscriptionRegistry with bidirectional indexes
(WeakMap<conn, topics> + Map<topic, conns>); subscribe/unsubscribe/
onConnectionClose/connectionsForTopic/topicsForConnection/stats. Authorization
runs once at subscribe time. Snapshot is stubbed as [] until task 1.5.5.
Includes pluggable SnapshotProvider interface for task 1.5.5 injection.
- src/live/protocol.ts: adds 'error' to ErrorCode union for transient authz
failures.
- src/main.ts: wires createAuthzClient + createSubscriptionRegistry; replaces
the stub message handler with the real subscribe/unsubscribe router; passes
registry.onConnectionClose as the server's onClose callback.
- test/live-authz.test.ts: 6 unit tests for all canAccessEvent outcomes.
- test/live-registry.test.ts: 9 unit tests for subscribe/unsubscribe semantics,
idempotency, gauge correctness, and onConnectionClose cleanup.
Authenticate WebSocket upgrade requests via Directus's /users/me:
- src/live/auth.ts: createAuthClient factory; validate() forwards the raw
Cookie: header to Directus, parses the user with zod, and returns
AuthenticatedUser or null. Handles 401/403 (unauthorized), non-2xx (error),
network failures, AbortError (timeout), null data (expired session), and
missing data key (malformed Directus response).
- src/live/server.ts: upgrade handler now calls authClient.validate() before
completing the WS handshake; on null user, writes HTTP 401 and destroys the
socket. LiveConnection gains user: AuthenticatedUser and cookieHeader: string
(needed for per-subscription authz in task 1.5.3). authClient is an optional
parameter so tests without auth still work.
- src/main.ts: wires createAuthClient and passes it to createLiveServer.
- test/live-auth.test.ts: 11 unit tests covering all validate() code paths
including the empty-cookie fast-path, latency histogram observation, and
distinction between unauthorized (401/expired) and error (malformed) results.
Stage discovered the wrong default at runtime: tcp-ingestion's compiled
default REDIS_TELEMETRY_STREAM is 'telemetry:teltonika' but processor's
was 'telemetry:t', so the two services were talking past each other —
tcp-ingestion publishing to one stream, processor reading another empty
one. The deploy stack now pins both to the same value via a shared env
var, but the processor's compiled default should also match so local
development and the integration test stay aligned with reality.
Changes:
- src/config/load.ts — default changed to 'telemetry:teltonika'
- .env.example — same
- test/config.test.ts — default-value assertion updated
- planning docs (ROADMAP, phase-1 README, tasks 03/08/10, phase-3 README) —
occurrences of 'telemetry:t' replaced with 'telemetry:teltonika'
The deploy stack remains the single source of truth via the shared
REDIS_TELEMETRY_STREAM env var. Compiled defaults are belt-and-braces.
Several Phase 1 metrics were registered in observability/metrics.ts but
either unwired at the call sites or wired with wrong counts. Production
output showed 11 records ingested per logs but only 4 in metrics. The
fixes below align metric values with actual hot-path activity.
Wiring gaps closed (consumer.ts):
- processor_consumer_reads_total{result=ok|empty|error} — was registered
but never inc'd. Now fires on each XREADGROUP outcome.
- processor_consumer_records_total — was registered but never inc'd.
Now fires once per XREADGROUP, with the entry count.
Counts corrected (writer.ts):
- processor_position_writes_total{status} — was inc'd unconditionally
by 1 per chunk for each of inserted/duplicate. Now inc'd by the
actual per-status count, and only when count > 0.
- processor_position_writes_total{status='failed'} — was inc'd by 1
per failed chunk. Now inc'd by chunk.length so every failed record
is counted.
Counts corrected (main.ts):
- processor_acks_total — was inc'd by 1 per non-empty batch. Now
inc'd by ackIds.length so every ACK'd ID is counted.
Wiring gap closed (state.ts):
- processor_device_state_evictions_total — internal `evicted` counter
existed but was never published to metrics. createDeviceStateStore
now accepts a Metrics injection and inc's on each eviction.
Metrics interface extended (types.ts, metrics.ts):
- Metrics.inc gained an optional third `value` parameter (defaults to 1)
for batched increments. dispatchInc passes it through to prom-client's
Counter.inc(labels, value).
Tests updated to reflect the new third arg and the state.ts factory's
new metrics parameter. Total 134 unit tests passing (no count change —
existing tests adjusted, no new tests added; the real verification is
on stage where the metrics are now meaningful again).
Migration 0001_positions.sql now runs CREATE EXTENSION IF NOT EXISTS
postgis alongside timescaledb. PostGIS isn't used in Phase 1 but enabling
it now means Phase 2's geofence engine doesn't need a separate migration
step. The deploy stack uses the timescale/timescaledb-ha:*-all image
which ships both extensions.
Integration test (pipeline.integration.test.ts) updated to use the same
timescale/timescaledb-ha:pg16.6-ts2.17.2-all image as the deploy stack.
Stock POSTGRES_USER/PASSWORD/DB env vars retained — if recent ha-image
revisions don't accept them, the test container will fail clearly on
first run with Docker, and we'll switch to the right env-var scheme.
src/observability/metrics.ts — full prom-client implementation. All 10
Phase 1 metrics registered (processor_consumer_reads_total,
_records_total, _lag, _decode_errors_total, processor_position_writes_total
{status}, _write_duration_seconds, processor_acks_total,
processor_device_state_{size,evictions_total}) plus nodejs_* defaults.
node:http server with /metrics, /healthz, /readyz. /readyz checks
redis.status === 'ready' AND a 5s-cached SELECT 1 Postgres probe.
processor_consumer_lag sampled every 10s via XINFO GROUPS, falling back
to a no-op when the consumer group hasn't been created yet.
src/main.ts — replaces the trace-logging shim with createMetrics() and
startMetricsServer(); shutdown closes the metrics server before
redis.quit() and pool.end().
test/metrics.test.ts — 22 unit tests: exposition format, every metric
type behaviour, all four HTTP endpoint paths including /readyz 503 cases.
test/pipeline.integration.test.ts — testcontainers Redis 7 +
TimescaleDB latest-pg16. Four scenarios: happy path with bigint+Buffer
attribute round-trip, idempotency on (device_id, ts), malformed payload
stays in PEL (decode_errors_total increments), writer failure → retry
(weaker variant per spec: stop Postgres before publish, restart, verify
row appears). Skip-on-no-Docker pattern verified — exits 0 without
Docker.
Dockerfile — multi-stage matching tcp-ingestion. EXPOSE 9090 only,
HEALTHCHECK on /readyz, image-source label points at processor repo.
.gitea/workflows/build.yml — single-job workflow mirroring
tcp-ingestion. Path filters cover src/, test/, build config, Dockerfile.
Portainer webhook step uncommented for :main auto-deploy.
compose.dev.yaml — local-build variant with Redis + TimescaleDB +
processor-dev for verifying Dockerfile changes without the registry
round-trip.
README.md — fleshed out from stub: quick-start, Docker build, deployment
note, env vars, tests (unit vs. integration), CI behavior. Flags the
deploy-side change needed: deploy/compose.yaml needs a TimescaleDB
service and a processor service entry added.
Verification: typecheck, lint clean; 134 unit tests passing across 8
files (+22 from this batch). pnpm test:integration runs cleanly under
the no-Docker skip pattern.
Phase 1 is now complete. Service is pilot-ready.
src/core/consumer.ts — XREADGROUP loop with consumer-group resumption,
ensureConsumerGroup (BUSYGROUP-tolerant), decodeBatch (CodecError → log
+ skip + leave pending; never speculative ACK), partial-ACK semantics,
connectRedis (mirroring tcp-ingestion's retry pattern), clean stop.
src/core/state.ts — LRU Map<device_id, DeviceState> using delete+set
bump trick (no third-party LRU dep); last_seen = max(prev, ts) so
out-of-order replays don't regress the high-water mark; evictedTotal()
counter.
src/core/writer.ts — multi-row INSERT ON CONFLICT (device_id, ts) DO
NOTHING with RETURNING. Duplicate detection by set-difference between
input and RETURNING rows (xmax=0 doesn't work for skipped-conflict
rows, only returned ones — confirmed in the task spec's own Note).
Sequential chunking to WRITE_BATCH_SIZE; bigint→string and Buffer→base64
attribute serialization that handles Buffer.toJSON shape.
src/main.ts — full pipeline: pool → migrate → redis → state → writer →
sink → consumer → graceful-shutdown stub. Sink ordering is
state.update BEFORE writer.write per spec rationale (state stays
consistent with what's been seen even if not yet persisted; redelivery
is idempotent on state). Metrics is still the trace-logging shim from
tcp-ingestion's pre-1.10 pattern; real prom-client lands in task 1.9.
Verification: typecheck, lint clean; 112 unit tests passing across 7
test files (+39 from this batch).
Scaffold mirrors tcp-ingestion conventions: ESM, strict TS, pnpm, vitest
with unit/integration split, ESLint flat config with no-floating-promises
+ no-misused-promises + import/no-restricted-paths (the new src/core/ →
src/domain/ boundary that protects Phase 1 from Phase 2 churn).
Core types in src/core/types.ts (Position, StreamRecord, DeviceState,
Metrics, AttributeValue) — Position is byte-equivalent to tcp-ingestion's
output. Codec in src/core/codec.ts implements sentinel reversal:
{__bigint:"..."} → bigint, {__buffer_b64:"..."} → Buffer, ISO timestamp
string → Date. CodecError surfaces malformed payload reasons with the
failing field named.
Config in src/config/load.ts (zod schema, all 13 env vars with defaults
and bounded numerics). Logger in src/observability/logger.ts matches
tcp-ingestion exactly: ISO timestamps, string level labels, pino-pretty
in development.
Postgres in src/db/: createPool with sane defaults and application_name,
connectWithRetry mirroring the ioredis retry pattern, a 30-line
migration runner using a schema_migrations table, and 0001_positions.sql
with the hypertable + (device_id, ts) unique index + ts DESC index.
Migration runner unit-tested against a mocked pg.Pool; the real
TimescaleDB round-trip is deferred to task 1.10 per spec.
Verification: typecheck, lint, build all clean; 73 unit tests passing
across 4 files. import/no-restricted-paths verified live by temporarily
adding a forbidden src/domain/ import.