Files
processor/.planning/phase-1-throughput/09-observability.md
T
julian c314ba0902 Add planning documents for Phase 1 (throughput pipeline) and stub Phases 2-4
ROADMAP.md establishes status legend, architectural anchors pointing at the
wiki, and seven non-negotiable design rules — most importantly the
core/domain boundary that protects Phase 1 from Phase 2 churn, the
schema-authority split (positions hypertable owned here; everything else
owned by Directus), and idempotent-writes via (device_id, ts) ON CONFLICT.

Phase 1 (throughput pipeline) is fully detailed across 11 task files:
scaffold, core types + sentinel decoder, config + logging, Postgres
hypertable, Redis Stream consumer, per-device LRU state, batched writer,
main wiring, observability, integration test, Dockerfile + Gitea CI.
Observability is in Phase 1 (not deferred) — lesson learned from
tcp-ingestion task 1.10.

Phases 2-4 are stub READMEs. Phase 2 (domain logic) blocks on Directus
schema decisions and lists those open questions explicitly. Phase 3
(production hardening) and Phase 4 (future) sketch the task shape.
2026-04-30 21:16:59 +02:00

5.2 KiB

Task 1.9 — Observability (Prometheus metrics + /healthz + /readyz)

Phase: 1 — Throughput pipeline Status: Not started Depends on: 1.5, 1.6, 1.7, 1.8 Wiki refs: docs/wiki/entities/processor.md, docs/wiki/sources/gps-tracking-architecture.md § 7.4

Goal

Replace the placeholder Metrics shim with a real prom-client implementation. Expose /metrics (Prometheus exposition format), /healthz (liveness), and /readyz (readiness — Redis ready AND Postgres ready) on METRICS_PORT.

This is not a deferral candidate (unlike tcp-ingestion task 1.10). The Processor has no other surface for measuring consumer lag, write throughput, or failure rates — without it, "is the pilot keeping up?" cannot be answered.

Deliverables

  • src/observability/metrics.ts — same shape as tcp-ingestion/src/observability/metrics.ts:
    • createMetrics(): Metrics & { serializeMetrics(): Promise<string> } — wraps prom-client registries; calls collectDefaultMetrics() once for nodejs_* process metrics.
    • startMetricsServer(port, metrics, deps): http.Servernode:http server with three endpoints. deps carries readyz health checks: { isRedisReady(): boolean; isPostgresReady(): boolean }.
  • Update src/main.ts to use the real createMetrics() and start the metrics server after Redis + Postgres are connected and the consumer is started. Wire it into graceful shutdown (close it before redis.quit()).
  • Tests: test/metrics.test.ts mirroring the tcp-ingestion test pattern — exposition format, counter/gauge/histogram behaviour, all four HTTP endpoint paths including /readyz 503 cases.

Specification

Metric inventory

Metric Type Labels Description
processor_consumer_reads_total counter result=ok|empty|error XREADGROUP calls; empty = BLOCK timeout, error = client error
processor_consumer_records_total counter Total records pulled off the stream
processor_consumer_lag gauge XLEN minus the consumer group's last-delivered ID position. Sampled every N seconds.
processor_decode_errors_total counter Records that failed to decode (malformed payload, sentinel error)
processor_position_writes_total counter status=inserted|duplicate|failed Per-record write outcomes
processor_position_write_duration_seconds histogram Per-batch write latency. Buckets [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5]
processor_acks_total counter Total IDs ACKed
processor_device_state_size gauge Current count of devices in the LRU map
processor_device_state_evictions_total counter Total LRU evictions since start
nodejs_* various Default Node process metrics

Naming convention

  • processor_* for service-specific metrics. tcp-ingestion uses teltonika_* because that's its adapter; the Processor isn't bound to a vendor, so the service-name prefix fits.
  • No external service label — Prometheus scrape config adds it.

Health and readiness

  • GET /healthz → 200 if the process is alive. Always returns { "status": "ok" }.
  • GET /readyz → 200 if both Redis is ready (redis.status === 'ready') AND Postgres is ready (last successful query within 30s, or a fresh SELECT 1 succeeds quickly). 503 otherwise.
  • Both endpoints return tiny JSON bodies for diagnostic value.

processor_consumer_lag measurement

Sample every 10s in a separate setInterval (don't compute it on every read — too noisy). Compute as:

lag = XLEN(stream) - position_of(group_last_delivered_id_in_stream)

Use XINFO GROUPS <stream>lag field (Redis 7.2+). If the field is absent, fall back to XLEN minus 0 (good-enough proxy when up to date; flag as "approximate" in the metric description).

If sampling fails (Redis blip), log at debug and continue. Don't let metrics gather break the consumer.

HTTP server — same minimal node:http

No Express. Roughly 30 lines. Match tcp-ingestion's style.

Acceptance criteria

  • pnpm typecheck, pnpm lint, pnpm test clean.
  • curl http://localhost:9090/metrics returns valid exposition format with every metric in the inventory present (some at zero).
  • After processing one record end-to-end, processor_consumer_records_total increments by 1, processor_position_writes_total{status="inserted"} increments by 1, processor_acks_total increments by 1.
  • /readyz returns 503 while Redis is disconnected (simulate by redis.disconnect()), 200 once it reconnects.
  • /readyz returns 503 while the Pool fails its health probe, 200 when it recovers.
  • nodejs_* default metrics are exposed.

Risks / open questions

  • Cardinality of label values. None of the Phase 1 metrics use unbounded labels. Phase 2 may want per-stage metrics — be careful: hundreds of stages is fine, hundreds of devices as labels is not. Keep the same rule as tcp-ingestion: per-device labels never go on Prometheus metrics; logs/traces are the right place.
  • processor_consumer_lag sampling cadence. 10s is a guess. If alerts get jittery, lower to 5s or raise to 30s. Tunable later.

Done

(Fill in once complete: commit SHA, brief notes.)