Files
tcp-ingestion/.planning/phase-1-telemetry/10-observability.md
T
julian 90d6a73a60 Sync ROADMAP statuses with landed work; mark 1.10/1.12/1.13 as paused
Tasks 1.1-1.9 marked done with their landing commit SHAs. Tasks 1.10
(observability), 1.12 (production hardening), and 1.13 (device
authority) marked paused with explicit resume triggers — pilot
deployment on real Teltonika hardware takes priority. Task 1.11
remains as next, in slimmed form for the pilot (no /readyz healthcheck
since the metrics endpoint is part of paused 1.10).
2026-04-30 16:49:07 +02:00

5.3 KiB

Task 1.10 — Observability (Prometheus metrics)

Phase: 1 — Inbound telemetry Status: ⏸ Paused — deferred until after the real-device pilot test. See ROADMAP.md "Deferred" section for resume triggers. The placeholder Metrics interface in src/core/types.ts is what code currently uses; this task replaces it with prom-client and adds the /metrics, /healthz, /readyz HTTP endpoints. Depends on: 1.2, 1.3 Wiki refs: docs/wiki/sources/teltonika-ingestion-architecture.md § 7. Observability, docs/wiki/sources/gps-tracking-architecture.md § 7.4

Goal

Expose Prometheus metrics over an HTTP endpoint so the platform's observability stack can scrape them. Metrics drive alerting (consumer lag, unknown codecs, CRC failures) and capacity planning (connection counts, frame rates).

Deliverables

  • src/observability/metrics.ts:
    • Exports createMetrics(): Metrics returning a typed wrapper around prom-client registries.
    • All metric definitions in one place, with explicit names/labels matching the wiki spec.
    • A serializeMetrics(): Promise<string> returning the standard Prom exposition format.
    • A startMetricsServer(port, metrics): http.Server that exposes GET /metrics and GET /healthz.
  • Wiring updates: every place that should emit a metric (handshake outcome, frame outcome, publish queue depth, etc.) calls into the Metrics object.

Specification

Metric inventory (Phase 1)

Per docs/wiki/sources/teltonika-ingestion-architecture.md § 7:

Metric Type Labels Description
teltonika_connections_active gauge Currently open device sessions.
teltonika_handshake_total counter result=accepted|rejected|malformed, known=known|unknown IMEI handshake outcomes. The known label distinguishes IMEIs that the configured DeviceAuthority recognizes from those it does not. With the default AllowAllAuthority, known is always known.
teltonika_device_authority_failures_total counter Times a DeviceAuthority.check call threw or timed out. Non-zero rate indicates the allow-list refresher (task 1.13) is unhealthy.
teltonika_frames_total counter codec=8|8E|16|unknown, result=ok|crc_fail|truncated|n_mismatch Frame-level outcomes.
teltonika_records_published_total counter codec AVL records emitted to Redis.
teltonika_parse_duration_seconds histogram codec Per-frame parse time. Buckets: [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1] (seconds).
teltonika_unknown_codec_total counter codec_id (string of the offending byte) Canary for codec coverage drift.

Phase 1 also adds publisher-related metrics from task 1.8:

Metric Type Labels Description
teltonika_publish_queue_depth gauge Current bounded-queue depth.
teltonika_publish_overflow_total counter Records dropped because the queue was full.
teltonika_publish_duration_seconds histogram XADD latency.

Plus shell-level:

Metric Type Labels Description
nodejs_* various Default Node.js process metrics (prom-client provides a collectDefaultMetrics()).

Naming convention

  • teltonika_* for adapter-specific metrics.
  • nodejs_* for runtime metrics (default).
  • No service prefix — Prometheus scrape config adds the service and instance labels externally.

Health and readiness

  • GET /healthz: returns 200 OK if the process is alive. (Liveness probe.)
  • GET /readyz: returns 200 OK if the Redis connection is healthy AND the TCP listener is bound. (Readiness probe.) Returns 503 otherwise.
  • Both endpoints return a tiny JSON body { "status": "ok" } for diagnostic value.

HTTP server

Use Node's node:http directly — no Express/Fastify dependency for two endpoints. Keep it minimal, ~30 lines.

Acceptance criteria

  • curl http://localhost:9090/metrics returns valid Prometheus exposition format with every metric in the inventory present (some at zero).
  • After processing the canonical Codec 8 fixture, teltonika_records_published_total{codec="8"} increments by 1 and teltonika_frames_total{codec="8",result="ok"} increments by 1.
  • Sending a packet with an unknown codec ID increments teltonika_unknown_codec_total{codec_id="..."}.
  • After a handshake from an IMEI the configured DeviceAuthority returns 'unknown' for, teltonika_handshake_total{result="accepted",known="unknown"} increments by 1.
  • GET /readyz returns 503 while Redis is unreachable, then 200 once it reconnects.
  • Prom-client default metrics are exposed (Node version, GC, event loop lag).

Risks / open questions

  • Cardinality of codec_id label on teltonika_unknown_codec_total: bounded by 256 possible byte values. Acceptable.
  • Cardinality of device_id (IMEI) in metrics: avoid. Per-device metrics belong in logs/traces, not Prometheus, because the cardinality is unbounded. Phase 1 does not add per-IMEI labels anywhere. (This is a watch-out for future tasks.)
  • Histogram buckets for teltonika_parse_duration_seconds: tuned for sub-millisecond expected times. Adjust based on real production data after the first week.

Done

(Fill in once complete.)