Commit Graph

22 Commits

Author SHA1 Message Date
julian 87dec03d3c feat(live): task 1.5.6 — live broadcast integration test
Adds end-to-end integration test for the WebSocket live broadcast pipeline:
Redis + TimescaleDB containers + Directus stub → full pipeline boot → real
WS client assertions. Mirrors pipeline.integration.test.ts pattern with the
skip-on-no-Docker guard.

Key additions:
- test/live.integration.test.ts: 6 test scenarios — happy path (subscribe →
  snapshot → live position), auth rejection (401), forbidden subscription
  (error/forbidden), multi-client fan-out (both receive position), orphan
  position (no WS frame), faulty snapshot exclusion (next-best non-faulty)
- test/helpers/directus-stub.ts: bare http.createServer stub for /users/me
  and /items/events/:id endpoints with cookie-based user lookup
- test/fixtures/test-schema.sql: minimal schema subset (events, entries,
  entry_devices with IMEI-as-device_id for Phase 1 join semantics)

The integration test runs via `pnpm test:integration`, not `pnpm test`.
Docker required; the suite skips cleanly when Docker is unavailable.
2026-05-02 18:38:53 +02:00
julian 3c2c5cf50e docs(planning): file phase-3 task 3.5 — retire processor migration runner
Replaces the original "migration advisory lock" sketch. Once processor
doesn't run DDL, the lock concern delegates to Directus's db-init runner.

Context: positions hypertable + faulty column DDL currently exists in
both processor (src/db/migrations/0001 + 0002) and directus
(db-init/001/002/003). Two sources of truth for the same schema is a
known hazard — adding a column means editing two files in two repos,
and silent drift between them is invisible until runtime.

Fix: directus becomes the sole DDL owner. Processor's migration runner
is retired; only INSERT/SELECT/UPDATE remain.

Task spec covers:
- Pre-flight diff between processor migrations and directus db-init
  (must be byte/semantically equivalent before deletion)
- File-by-file deletion list
- Test infra migration (integration test moves to fixture-based schema
  setup, matching the established Phase 1.5 task 1.5.6 pattern)
- Wiki + ROADMAP updates
- compose.yaml depends_on directus: service_healthy
- Operational notes (existing migrations_applied table is left in place)

Sequence: ideally lands AFTER Phase 1.5 ships so the agent shipping the
WS endpoint isn't pulled into a side quest mid-flight.
2026-05-02 18:37:47 +02:00
julian b3d6410af6 feat(live): task 1.5.5 — snapshot-on-subscribe
Adds snapshot provider that queries the latest non-faulty position per device
registered to an event, returned in the `subscribed` reply so the SPA map is
populated immediately rather than waiting for the first live broadcast batch.

Key changes:
- src/live/snapshot.ts: createSnapshotProvider factory using DISTINCT ON
  (device_id) ... ORDER BY device_id, ts DESC with WHERE faulty=false; converts
  Date ts to epoch ms; omits speed/course when 0 (matching broadcast convention)
- src/main.ts: injects createSnapshotProvider(pool) into createSubscriptionRegistry
- test/live-snapshot.test.ts: 7 unit tests covering: two-device result, empty
  event, faulty exclusion, DISTINCT ON semantics, parameterized query, metrics
  observation, and error propagation

The snapshot query requires the positions_device_ts_idx created in migration 0002
(task 1.5.4).  Snapshot failures fail open — registry.fetchSnapshot returns [] so
the subscription still succeeds with an empty initial state.
2026-05-02 18:37:18 +02:00
julian fbb1f34e9a feat(live): task 1.5.4 — broadcast consumer group and fan-out
Adds the per-instance Redis Stream consumer group (live-broadcast-{instance_id})
that reads the telemetry stream and fans out each position to subscribed
WebSocket connections without affecting the durable-write consumer path.

Key changes:
- src/shared/codec.ts: moved decodePosition/CodecError out of src/core/ so
  src/live/broadcast.ts can decode positions without crossing the enforced
  src/core/ ↔ src/live/ boundary; src/core/codec.ts now re-exports from there
- src/shared/types.ts: added Position and AttributeValue (same move, same reason);
  src/core/types.ts re-exports both to preserve existing import paths
- src/live/broadcast.ts: createBroadcastConsumer factory — XREADGROUP loop,
  immediate ACK semantics, toPositionMessage mapper, fanOut per event/topic
- src/live/device-event-map.ts: createDeviceEventMap factory — in-memory cache
  of entry_devices × entries join, refreshed every LIVE_DEVICE_EVENT_REFRESH_MS
- src/db/migrations/0002_positions_faulty.sql: adds faulty boolean column and
  positions_device_ts_idx for snapshot-on-subscribe query (task 1.5.5)
- src/main.ts: wired authClient, authzClient, registry, liveServer,
  deviceEventMap, broadcastConsumer; shutdown chain: liveServer → deviceEventMap
  + broadcastConsumer → durable-write consumer → metricsServer → Redis → Postgres
- test/live-broadcast.test.ts: 4 unit tests covering single subscriber, multiple
  subscribers, orphan device, and multi-event device fan-out
2026-05-02 18:36:52 +02:00
julian 90605614f6 docs: update task 1.5.3 done section and ROADMAP status 2026-05-02 18:36:23 +02:00
julian bf5c358668 feat(live): task 1.5.3 — subscription registry & per-event authorization
Subscribe/unsubscribe with per-event authorization via Directus delegation:
- src/live/authz.ts: createAuthzClient factory; canAccessEvent(cookieHeader,
  eventId) calls GET /items/events/<id>?fields=id, delegates row-level security
  to Directus (200=allow, 403=forbidden, 404=not-found, else error).
- src/live/registry.ts: createSubscriptionRegistry with bidirectional indexes
  (WeakMap<conn, topics> + Map<topic, conns>); subscribe/unsubscribe/
  onConnectionClose/connectionsForTopic/topicsForConnection/stats. Authorization
  runs once at subscribe time. Snapshot is stubbed as [] until task 1.5.5.
  Includes pluggable SnapshotProvider interface for task 1.5.5 injection.
- src/live/protocol.ts: adds 'error' to ErrorCode union for transient authz
  failures.
- src/main.ts: wires createAuthzClient + createSubscriptionRegistry; replaces
  the stub message handler with the real subscribe/unsubscribe router; passes
  registry.onConnectionClose as the server's onClose callback.
- test/live-authz.test.ts: 6 unit tests for all canAccessEvent outcomes.
- test/live-registry.test.ts: 9 unit tests for subscribe/unsubscribe semantics,
  idempotency, gauge correctness, and onConnectionClose cleanup.
2026-05-02 18:36:00 +02:00
julian 7450cbffaa docs: update task 1.5.2 done section and ROADMAP status 2026-05-02 18:34:40 +02:00
julian 20ebd9b473 feat(live): task 1.5.2 — cookie auth handshake
Authenticate WebSocket upgrade requests via Directus's /users/me:
- src/live/auth.ts: createAuthClient factory; validate() forwards the raw
  Cookie: header to Directus, parses the user with zod, and returns
  AuthenticatedUser or null. Handles 401/403 (unauthorized), non-2xx (error),
  network failures, AbortError (timeout), null data (expired session), and
  missing data key (malformed Directus response).
- src/live/server.ts: upgrade handler now calls authClient.validate() before
  completing the WS handshake; on null user, writes HTTP 401 and destroys the
  socket. LiveConnection gains user: AuthenticatedUser and cookieHeader: string
  (needed for per-subscription authz in task 1.5.3). authClient is an optional
  parameter so tests without auth still work.
- src/main.ts: wires createAuthClient and passes it to createLiveServer.
- test/live-auth.test.ts: 11 unit tests covering all validate() code paths
  including the empty-cookie fast-path, latency histogram observation, and
  distinction between unauthorized (401/expired) and error (malformed) results.
2026-05-02 18:33:54 +02:00
julian 8a78e53e58 docs: update task 1.5.1 done section and ROADMAP status 2026-05-02 18:33:26 +02:00
julian 7154a0a49c feat(live): task 1.5.1 — WS server scaffold + heartbeat
Stand up the WebSocket live-broadcast server inside the Processor process:
- src/live/server.ts: createLiveServer factory with start/stop lifecycle,
  per-connection LiveConnection type, sendOutbound helper with back-pressure
  guard, 30s frame-level heartbeat via ws ping/pong, pluggable onMessage
  handler (stub returns error/not-implemented until 1.5.2/1.5.3).
- src/live/protocol.ts: zod schemas for inbound subscribe/unsubscribe messages,
  all outbound types (subscribed/unsubscribed/position/error), WsCloseCodes.
- src/shared/types.ts: extracted Metrics interface so src/live/ can import it
  without crossing the enforced src/live/ ↔ src/core/ ESLint boundary.
- src/core/types.ts: re-exports Metrics from shared/types to keep Phase 1
  call sites unchanged.
- src/config/load.ts: LIVE_WS_PORT, LIVE_WS_HOST, LIVE_WS_PING_INTERVAL_MS,
  LIVE_WS_DRAIN_TIMEOUT_MS, LIVE_WS_BACKPRESSURE_THRESHOLD_BYTES,
  DIRECTUS_BASE_URL, DIRECTUS_AUTH_TIMEOUT_MS, DIRECTUS_AUTHZ_TIMEOUT_MS,
  LIVE_BROADCAST_GROUP_PREFIX, LIVE_BROADCAST_BATCH_SIZE,
  LIVE_BROADCAST_BATCH_BLOCK_MS, LIVE_DEVICE_EVENT_REFRESH_MS.
- src/observability/metrics.ts: Phase 1.5 metrics inventory (connections,
  inbound/outbound counters, auth/authz histograms, subscription gauge,
  broadcast counters + lag histogram, snapshot histograms, device-event map).
- src/main.ts: wires the live server alongside the durable-write consumer;
  shutdown order: live server → consumer → metrics → Redis → Postgres.
- eslint.config.js: import/no-restricted-paths zones for src/live/ ↔ src/core/.
- test/live-server.test.ts: 7 unit tests covering connect, ping, protocol
  violation, valid message dispatch, connections gauge, and stop() drain.
2026-05-02 18:32:49 +02:00
julian e1c6f59948 Realign processor stream-name default to telemetry:teltonika
Stage discovered the wrong default at runtime: tcp-ingestion's compiled
default REDIS_TELEMETRY_STREAM is 'telemetry:teltonika' but processor's
was 'telemetry:t', so the two services were talking past each other —
tcp-ingestion publishing to one stream, processor reading another empty
one. The deploy stack now pins both to the same value via a shared env
var, but the processor's compiled default should also match so local
development and the integration test stay aligned with reality.

Changes:
- src/config/load.ts — default changed to 'telemetry:teltonika'
- .env.example — same
- test/config.test.ts — default-value assertion updated
- planning docs (ROADMAP, phase-1 README, tasks 03/08/10, phase-3 README) —
  occurrences of 'telemetry:t' replaced with 'telemetry:teltonika'

The deploy stack remains the single source of truth via the shared
REDIS_TELEMETRY_STREAM env var. Compiled defaults are belt-and-braces.
2026-05-01 11:43:31 +02:00
julian d758c211ae Fix metric wiring gaps audited against live processor output
Several Phase 1 metrics were registered in observability/metrics.ts but
either unwired at the call sites or wired with wrong counts. Production
output showed 11 records ingested per logs but only 4 in metrics. The
fixes below align metric values with actual hot-path activity.

Wiring gaps closed (consumer.ts):
- processor_consumer_reads_total{result=ok|empty|error} — was registered
  but never inc'd. Now fires on each XREADGROUP outcome.
- processor_consumer_records_total — was registered but never inc'd.
  Now fires once per XREADGROUP, with the entry count.

Counts corrected (writer.ts):
- processor_position_writes_total{status} — was inc'd unconditionally
  by 1 per chunk for each of inserted/duplicate. Now inc'd by the
  actual per-status count, and only when count > 0.
- processor_position_writes_total{status='failed'} — was inc'd by 1
  per failed chunk. Now inc'd by chunk.length so every failed record
  is counted.

Counts corrected (main.ts):
- processor_acks_total — was inc'd by 1 per non-empty batch. Now
  inc'd by ackIds.length so every ACK'd ID is counted.

Wiring gap closed (state.ts):
- processor_device_state_evictions_total — internal `evicted` counter
  existed but was never published to metrics. createDeviceStateStore
  now accepts a Metrics injection and inc's on each eviction.

Metrics interface extended (types.ts, metrics.ts):
- Metrics.inc gained an optional third `value` parameter (defaults to 1)
  for batched increments. dispatchInc passes it through to prom-client's
  Counter.inc(labels, value).

Tests updated to reflect the new third arg and the state.ts factory's
new metrics parameter. Total 134 unit tests passing (no count change —
existing tests adjusted, no new tests added; the real verification is
on stage where the metrics are now meaningful again).
2026-05-01 11:43:06 +02:00
julian 4a9f55cdf0 Copy SQL migration files into dist/ as part of build
tsc only emits .ts -> .js; non-TypeScript assets like SQL migration
files don't make it into dist/ by default. The migration runner reads
*.sql from dist/db/migrations/ at runtime in production (relative to
the compiled migrate.js), so the missing files surface as a fatal
ENOENT on container startup.

Fix: small node script (scripts/copy-assets.mjs) using fs.cpSync,
invoked after tsc in the build script. Cross-platform, no new
dependencies. The script is in the Docker build context but not
copied into the runtime stage, so it doesn't bloat the final image.

Verified: pnpm build now produces dist/db/migrations/0001_positions.sql.
2026-05-01 10:49:59 +02:00
julian 88cc98f3cc Switch to timescaledb-ha image; enable PostGIS in migration
Migration 0001_positions.sql now runs CREATE EXTENSION IF NOT EXISTS
postgis alongside timescaledb. PostGIS isn't used in Phase 1 but enabling
it now means Phase 2's geofence engine doesn't need a separate migration
step. The deploy stack uses the timescale/timescaledb-ha:*-all image
which ships both extensions.

Integration test (pipeline.integration.test.ts) updated to use the same
timescale/timescaledb-ha:pg16.6-ts2.17.2-all image as the deploy stack.
Stock POSTGRES_USER/PASSWORD/DB env vars retained — if recent ha-image
revisions don't accept them, the test container will fail clearly on
first run with Docker, and we'll switch to the right env-var scheme.
2026-05-01 10:39:27 +02:00
julian 10e20c4038 Fill in commit SHA for Phase 1 tasks 1.9-1.11 in ROADMAP and task files
Phase 1 (throughput pipeline) is now fully landed.
2026-04-30 22:02:21 +02:00
julian be48da9baa Implement Phase 1 tasks 1.9-1.11 (observability + integration test + Dockerfile/CI)
src/observability/metrics.ts — full prom-client implementation. All 10
Phase 1 metrics registered (processor_consumer_reads_total,
_records_total, _lag, _decode_errors_total, processor_position_writes_total
{status}, _write_duration_seconds, processor_acks_total,
processor_device_state_{size,evictions_total}) plus nodejs_* defaults.
node:http server with /metrics, /healthz, /readyz. /readyz checks
redis.status === 'ready' AND a 5s-cached SELECT 1 Postgres probe.
processor_consumer_lag sampled every 10s via XINFO GROUPS, falling back
to a no-op when the consumer group hasn't been created yet.

src/main.ts — replaces the trace-logging shim with createMetrics() and
startMetricsServer(); shutdown closes the metrics server before
redis.quit() and pool.end().

test/metrics.test.ts — 22 unit tests: exposition format, every metric
type behaviour, all four HTTP endpoint paths including /readyz 503 cases.

test/pipeline.integration.test.ts — testcontainers Redis 7 +
TimescaleDB latest-pg16. Four scenarios: happy path with bigint+Buffer
attribute round-trip, idempotency on (device_id, ts), malformed payload
stays in PEL (decode_errors_total increments), writer failure → retry
(weaker variant per spec: stop Postgres before publish, restart, verify
row appears). Skip-on-no-Docker pattern verified — exits 0 without
Docker.

Dockerfile — multi-stage matching tcp-ingestion. EXPOSE 9090 only,
HEALTHCHECK on /readyz, image-source label points at processor repo.

.gitea/workflows/build.yml — single-job workflow mirroring
tcp-ingestion. Path filters cover src/, test/, build config, Dockerfile.
Portainer webhook step uncommented for :main auto-deploy.

compose.dev.yaml — local-build variant with Redis + TimescaleDB +
processor-dev for verifying Dockerfile changes without the registry
round-trip.

README.md — fleshed out from stub: quick-start, Docker build, deployment
note, env vars, tests (unit vs. integration), CI behavior. Flags the
deploy-side change needed: deploy/compose.yaml needs a TimescaleDB
service and a processor service entry added.

Verification: typecheck, lint clean; 134 unit tests passing across 8
files (+22 from this batch). pnpm test:integration runs cleanly under
the no-Docker skip pattern.

Phase 1 is now complete. Service is pilot-ready.
2026-04-30 22:01:55 +02:00
julian 4686a9c391 Fill in commit SHA for Phase 1 tasks 1.5-1.8 in ROADMAP and task files 2026-04-30 21:49:50 +02:00
julian 2a50aaf175 Implement Phase 1 tasks 1.5-1.8 (consumer + state + writer + main wiring)
src/core/consumer.ts — XREADGROUP loop with consumer-group resumption,
ensureConsumerGroup (BUSYGROUP-tolerant), decodeBatch (CodecError → log
+ skip + leave pending; never speculative ACK), partial-ACK semantics,
connectRedis (mirroring tcp-ingestion's retry pattern), clean stop.

src/core/state.ts — LRU Map<device_id, DeviceState> using delete+set
bump trick (no third-party LRU dep); last_seen = max(prev, ts) so
out-of-order replays don't regress the high-water mark; evictedTotal()
counter.

src/core/writer.ts — multi-row INSERT ON CONFLICT (device_id, ts) DO
NOTHING with RETURNING. Duplicate detection by set-difference between
input and RETURNING rows (xmax=0 doesn't work for skipped-conflict
rows, only returned ones — confirmed in the task spec's own Note).
Sequential chunking to WRITE_BATCH_SIZE; bigint→string and Buffer→base64
attribute serialization that handles Buffer.toJSON shape.

src/main.ts — full pipeline: pool → migrate → redis → state → writer →
sink → consumer → graceful-shutdown stub. Sink ordering is
state.update BEFORE writer.write per spec rationale (state stays
consistent with what's been seen even if not yet persisted; redelivery
is idempotent on state). Metrics is still the trace-logging shim from
tcp-ingestion's pre-1.10 pattern; real prom-client lands in task 1.9.

Verification: typecheck, lint clean; 112 unit tests passing across 7
test files (+39 from this batch).
2026-04-30 21:49:29 +02:00
julian 6a14eb1d01 Fill in commit SHA for Phase 1 tasks 1.1-1.4 in ROADMAP and task files 2026-04-30 21:36:23 +02:00
julian 95efc23139 Implement Phase 1 tasks 1.1-1.4 (scaffold + core types + config + Postgres)
Scaffold mirrors tcp-ingestion conventions: ESM, strict TS, pnpm, vitest
with unit/integration split, ESLint flat config with no-floating-promises
+ no-misused-promises + import/no-restricted-paths (the new src/core/ →
src/domain/ boundary that protects Phase 1 from Phase 2 churn).

Core types in src/core/types.ts (Position, StreamRecord, DeviceState,
Metrics, AttributeValue) — Position is byte-equivalent to tcp-ingestion's
output. Codec in src/core/codec.ts implements sentinel reversal:
{__bigint:"..."} → bigint, {__buffer_b64:"..."} → Buffer, ISO timestamp
string → Date. CodecError surfaces malformed payload reasons with the
failing field named.

Config in src/config/load.ts (zod schema, all 13 env vars with defaults
and bounded numerics). Logger in src/observability/logger.ts matches
tcp-ingestion exactly: ISO timestamps, string level labels, pino-pretty
in development.

Postgres in src/db/: createPool with sane defaults and application_name,
connectWithRetry mirroring the ioredis retry pattern, a 30-line
migration runner using a schema_migrations table, and 0001_positions.sql
with the hypertable + (device_id, ts) unique index + ts DESC index.
Migration runner unit-tested against a mocked pg.Pool; the real
TimescaleDB round-trip is deferred to task 1.10 per spec.

Verification: typecheck, lint, build all clean; 73 unit tests passing
across 4 files. import/no-restricted-paths verified live by temporarily
adding a forbidden src/domain/ import.
2026-04-30 21:35:59 +02:00
julian c314ba0902 Add planning documents for Phase 1 (throughput pipeline) and stub Phases 2-4
ROADMAP.md establishes status legend, architectural anchors pointing at the
wiki, and seven non-negotiable design rules — most importantly the
core/domain boundary that protects Phase 1 from Phase 2 churn, the
schema-authority split (positions hypertable owned here; everything else
owned by Directus), and idempotent-writes via (device_id, ts) ON CONFLICT.

Phase 1 (throughput pipeline) is fully detailed across 11 task files:
scaffold, core types + sentinel decoder, config + logging, Postgres
hypertable, Redis Stream consumer, per-device LRU state, batched writer,
main wiring, observability, integration test, Dockerfile + Gitea CI.
Observability is in Phase 1 (not deferred) — lesson learned from
tcp-ingestion task 1.10.

Phases 2-4 are stub READMEs. Phase 2 (domain logic) blocks on Directus
schema decisions and lists those open questions explicitly. Phase 3
(production hardening) and Phase 4 (future) sketch the task shape.
2026-04-30 21:16:59 +02:00
julian 1a4202f4d1 first commit 2026-04-30 20:48:33 +02:00