ROADMAP.md establishes status legend, architectural anchors pointing at the wiki, and seven non-negotiable design rules — most importantly the core/domain boundary that protects Phase 1 from Phase 2 churn, the schema-authority split (positions hypertable owned here; everything else owned by Directus), and idempotent-writes via (device_id, ts) ON CONFLICT. Phase 1 (throughput pipeline) is fully detailed across 11 task files: scaffold, core types + sentinel decoder, config + logging, Postgres hypertable, Redis Stream consumer, per-device LRU state, batched writer, main wiring, observability, integration test, Dockerfile + Gitea CI. Observability is in Phase 1 (not deferred) — lesson learned from tcp-ingestion task 1.10. Phases 2-4 are stub READMEs. Phase 2 (domain logic) blocks on Directus schema decisions and lists those open questions explicitly. Phase 3 (production hardening) and Phase 4 (future) sketch the task shape.
5.2 KiB
Task 1.9 — Observability (Prometheus metrics + /healthz + /readyz)
Phase: 1 — Throughput pipeline
Status: ⬜ Not started
Depends on: 1.5, 1.6, 1.7, 1.8
Wiki refs: docs/wiki/entities/processor.md, docs/wiki/sources/gps-tracking-architecture.md § 7.4
Goal
Replace the placeholder Metrics shim with a real prom-client implementation. Expose /metrics (Prometheus exposition format), /healthz (liveness), and /readyz (readiness — Redis ready AND Postgres ready) on METRICS_PORT.
This is not a deferral candidate (unlike tcp-ingestion task 1.10). The Processor has no other surface for measuring consumer lag, write throughput, or failure rates — without it, "is the pilot keeping up?" cannot be answered.
Deliverables
src/observability/metrics.ts— same shape astcp-ingestion/src/observability/metrics.ts:createMetrics(): Metrics & { serializeMetrics(): Promise<string> }— wrapsprom-clientregistries; callscollectDefaultMetrics()once fornodejs_*process metrics.startMetricsServer(port, metrics, deps): http.Server—node:httpserver with three endpoints.depscarries readyz health checks:{ isRedisReady(): boolean; isPostgresReady(): boolean }.
- Update
src/main.tsto use the realcreateMetrics()and start the metrics server after Redis + Postgres are connected and the consumer is started. Wire it into graceful shutdown (close it beforeredis.quit()). - Tests:
test/metrics.test.tsmirroring thetcp-ingestiontest pattern — exposition format, counter/gauge/histogram behaviour, all four HTTP endpoint paths including/readyz503 cases.
Specification
Metric inventory
| Metric | Type | Labels | Description |
|---|---|---|---|
processor_consumer_reads_total |
counter | result=ok|empty|error |
XREADGROUP calls; empty = BLOCK timeout, error = client error |
processor_consumer_records_total |
counter | — | Total records pulled off the stream |
processor_consumer_lag |
gauge | — | XLEN minus the consumer group's last-delivered ID position. Sampled every N seconds. |
processor_decode_errors_total |
counter | — | Records that failed to decode (malformed payload, sentinel error) |
processor_position_writes_total |
counter | status=inserted|duplicate|failed |
Per-record write outcomes |
processor_position_write_duration_seconds |
histogram | — | Per-batch write latency. Buckets [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5] |
processor_acks_total |
counter | — | Total IDs ACKed |
processor_device_state_size |
gauge | — | Current count of devices in the LRU map |
processor_device_state_evictions_total |
counter | — | Total LRU evictions since start |
nodejs_* |
various | — | Default Node process metrics |
Naming convention
processor_*for service-specific metrics.tcp-ingestionusesteltonika_*because that's its adapter; the Processor isn't bound to a vendor, so the service-name prefix fits.- No external
servicelabel — Prometheus scrape config adds it.
Health and readiness
GET /healthz→ 200 if the process is alive. Always returns{ "status": "ok" }.GET /readyz→ 200 if both Redis is ready (redis.status === 'ready') AND Postgres is ready (last successful query within 30s, or a freshSELECT 1succeeds quickly). 503 otherwise.- Both endpoints return tiny JSON bodies for diagnostic value.
processor_consumer_lag measurement
Sample every 10s in a separate setInterval (don't compute it on every read — too noisy). Compute as:
lag = XLEN(stream) - position_of(group_last_delivered_id_in_stream)
Use XINFO GROUPS <stream> → lag field (Redis 7.2+). If the field is absent, fall back to XLEN minus 0 (good-enough proxy when up to date; flag as "approximate" in the metric description).
If sampling fails (Redis blip), log at debug and continue. Don't let metrics gather break the consumer.
HTTP server — same minimal node:http
No Express. Roughly 30 lines. Match tcp-ingestion's style.
Acceptance criteria
pnpm typecheck,pnpm lint,pnpm testclean.curl http://localhost:9090/metricsreturns valid exposition format with every metric in the inventory present (some at zero).- After processing one record end-to-end,
processor_consumer_records_totalincrements by 1,processor_position_writes_total{status="inserted"}increments by 1,processor_acks_totalincrements by 1. /readyzreturns 503 while Redis is disconnected (simulate byredis.disconnect()), 200 once it reconnects./readyzreturns 503 while the Pool fails its health probe, 200 when it recovers.nodejs_*default metrics are exposed.
Risks / open questions
- Cardinality of label values. None of the Phase 1 metrics use unbounded labels. Phase 2 may want per-stage metrics — be careful: hundreds of stages is fine, hundreds of devices as labels is not. Keep the same rule as
tcp-ingestion: per-device labels never go on Prometheus metrics; logs/traces are the right place. processor_consumer_lagsampling cadence. 10s is a guess. If alerts get jittery, lower to 5s or raise to 30s. Tunable later.
Done
(Fill in once complete: commit SHA, brief notes.)