Files

T

julian d4a6d8f713 Implement Phase 1 task 1.10 (Prometheus metrics + /healthz + /readyz)

Replaces the placeholder Metrics shim with a prom-client implementation
in src/observability/metrics.ts: all 10 Phase 1 metrics from the wiki
spec, plus nodejs_* defaults. Exposes /metrics, /healthz, /readyz over
node:http on METRICS_PORT (9090); /readyz returns 503 when Redis status
is not 'ready' or the TCP listener isn't bound.

The Metrics interface in src/core/types.ts is unchanged — adapter call
sites continue to use the same inc/observe shape. Only main.ts sees the
extended type that adds serializeMetrics().

Side effects:
- Dockerfile re-enables HEALTHCHECK pointing at /readyz, and EXPOSE 9090.
- frame-ingested log downgraded back to debug now that
  teltonika_records_published_total is scrapeable.
- 19 new unit tests covering exposition format, all metric types, and
  every HTTP endpoint path. Total now 98 passing.

Note: deploy/compose.yaml still does not expose 9090 — separate decision
about how Prometheus reaches the service (host port vs. internal scraper
on the same Docker network).

2026-04-30 20:54:32 +02:00

5.3 KiB

Raw Blame History

Task 1.10 — Observability (Prometheus metrics)

Phase: 1 — Inbound telemetry Status: 🟩 Depends on: 1.2, 1.3 Wiki refs: docs/wiki/sources/teltonika-ingestion-architecture.md § 7. Observability, docs/wiki/sources/gps-tracking-architecture.md § 7.4

Goal

Expose Prometheus metrics over an HTTP endpoint so the platform's observability stack can scrape them. Metrics drive alerting (consumer lag, unknown codecs, CRC failures) and capacity planning (connection counts, frame rates).

Deliverables

src/observability/metrics.ts:
- Exports createMetrics(): Metrics returning a typed wrapper around prom-client registries.
- All metric definitions in one place, with explicit names/labels matching the wiki spec.
- A serializeMetrics(): Promise<string> returning the standard Prom exposition format.
- A startMetricsServer(port, metrics): http.Server that exposes GET /metrics and GET /healthz.
Wiring updates: every place that should emit a metric (handshake outcome, frame outcome, publish queue depth, etc.) calls into the Metrics object.

Specification

Metric inventory (Phase 1)

Per docs/wiki/sources/teltonika-ingestion-architecture.md § 7:

Metric	Type	Labels	Description
`teltonika_connections_active`	gauge	—	Currently open device sessions.
`teltonika_handshake_total`	counter	`result=accepted\|rejected\|malformed`, `known=known\|unknown`	IMEI handshake outcomes. The `known` label distinguishes IMEIs that the configured `DeviceAuthority` recognizes from those it does not. With the default `AllowAllAuthority`, `known` is always `known`.
`teltonika_device_authority_failures_total`	counter	—	Times a `DeviceAuthority.check` call threw or timed out. Non-zero rate indicates the allow-list refresher (task 1.13) is unhealthy.
`teltonika_frames_total`	counter	`codec=8\|8E\|16\|unknown`, `result=ok\|crc_fail\|truncated\|n_mismatch`	Frame-level outcomes.
`teltonika_records_published_total`	counter	`codec`	AVL records emitted to Redis.
`teltonika_parse_duration_seconds`	histogram	`codec`	Per-frame parse time. Buckets: `[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]` (seconds).
`teltonika_unknown_codec_total`	counter	`codec_id` (string of the offending byte)	Canary for codec coverage drift.

Phase 1 also adds publisher-related metrics from task 1.8:

Metric	Type	Labels	Description
`teltonika_publish_queue_depth`	gauge	—	Current bounded-queue depth.
`teltonika_publish_overflow_total`	counter	—	Records dropped because the queue was full.
`teltonika_publish_duration_seconds`	histogram	—	XADD latency.

Plus shell-level:

Metric	Type	Labels	Description
`nodejs_*`	various	—	Default Node.js process metrics (`prom-client` provides a `collectDefaultMetrics()`).

Naming convention

teltonika_* for adapter-specific metrics.
nodejs_* for runtime metrics (default).
No service prefix — Prometheus scrape config adds the service and instance labels externally.

Health and readiness

GET /healthz: returns 200 OK if the process is alive. (Liveness probe.)
GET /readyz: returns 200 OK if the Redis connection is healthy AND the TCP listener is bound. (Readiness probe.) Returns 503 otherwise.
Both endpoints return a tiny JSON body { "status": "ok" } for diagnostic value.

HTTP server

Use Node's node:http directly — no Express/Fastify dependency for two endpoints. Keep it minimal, ~30 lines.

Acceptance criteria

curl http://localhost:9090/metrics returns valid Prometheus exposition format with every metric in the inventory present (some at zero).
After processing the canonical Codec 8 fixture, teltonika_records_published_total{codec="8"} increments by 1 and teltonika_frames_total{codec="8",result="ok"} increments by 1.
Sending a packet with an unknown codec ID increments teltonika_unknown_codec_total{codec_id="..."}.
After a handshake from an IMEI the configured DeviceAuthority returns 'unknown' for, teltonika_handshake_total{result="accepted",known="unknown"} increments by 1.
GET /readyz returns 503 while Redis is unreachable, then 200 once it reconnects.
Prom-client default metrics are exposed (Node version, GC, event loop lag).

Risks / open questions

Cardinality of codec_id label on teltonika_unknown_codec_total: bounded by 256 possible byte values. Acceptable.
Cardinality of device_id (IMEI) in metrics: avoid. Per-device metrics belong in logs/traces, not Prometheus, because the cardinality is unbounded. Phase 1 does not add per-IMEI labels anywhere. (This is a watch-out for future tasks.)
Histogram buckets for teltonika_parse_duration_seconds: tuned for sub-millisecond expected times. Adjust based on real production data after the first week.

Done

Implemented src/observability/metrics.ts with createMetrics(), startMetricsServer(), and ReadyzDeps. Replaced the placeholder shim in src/main.ts, wired metrics server into boot and graceful shutdown, downgraded frame ingested log to debug, and re-enabled the Dockerfile HEALTHCHECK. Landed in: (pending commit SHA)

5.3 KiB Raw Blame History