Tasks 1.1-1.9 marked done with their landing commit SHAs. Tasks 1.10 (observability), 1.12 (production hardening), and 1.13 (device authority) marked paused with explicit resume triggers — pilot deployment on real Teltonika hardware takes priority. Task 1.11 remains as next, in slimmed form for the pilot (no /readyz healthcheck since the metrics endpoint is part of paused 1.10).
5.3 KiB
Task 1.10 — Observability (Prometheus metrics)
Phase: 1 — Inbound telemetry
Status: ⏸ Paused — deferred until after the real-device pilot test. See ROADMAP.md "Deferred" section for resume triggers. The placeholder Metrics interface in src/core/types.ts is what code currently uses; this task replaces it with prom-client and adds the /metrics, /healthz, /readyz HTTP endpoints.
Depends on: 1.2, 1.3
Wiki refs: docs/wiki/sources/teltonika-ingestion-architecture.md § 7. Observability, docs/wiki/sources/gps-tracking-architecture.md § 7.4
Goal
Expose Prometheus metrics over an HTTP endpoint so the platform's observability stack can scrape them. Metrics drive alerting (consumer lag, unknown codecs, CRC failures) and capacity planning (connection counts, frame rates).
Deliverables
src/observability/metrics.ts:- Exports
createMetrics(): Metricsreturning a typed wrapper aroundprom-clientregistries. - All metric definitions in one place, with explicit names/labels matching the wiki spec.
- A
serializeMetrics(): Promise<string>returning the standard Prom exposition format. - A
startMetricsServer(port, metrics): http.Serverthat exposesGET /metricsandGET /healthz.
- Exports
- Wiring updates: every place that should emit a metric (handshake outcome, frame outcome, publish queue depth, etc.) calls into the
Metricsobject.
Specification
Metric inventory (Phase 1)
Per docs/wiki/sources/teltonika-ingestion-architecture.md § 7:
| Metric | Type | Labels | Description |
|---|---|---|---|
teltonika_connections_active |
gauge | — | Currently open device sessions. |
teltonika_handshake_total |
counter | result=accepted|rejected|malformed, known=known|unknown |
IMEI handshake outcomes. The known label distinguishes IMEIs that the configured DeviceAuthority recognizes from those it does not. With the default AllowAllAuthority, known is always known. |
teltonika_device_authority_failures_total |
counter | — | Times a DeviceAuthority.check call threw or timed out. Non-zero rate indicates the allow-list refresher (task 1.13) is unhealthy. |
teltonika_frames_total |
counter | codec=8|8E|16|unknown, result=ok|crc_fail|truncated|n_mismatch |
Frame-level outcomes. |
teltonika_records_published_total |
counter | codec |
AVL records emitted to Redis. |
teltonika_parse_duration_seconds |
histogram | codec |
Per-frame parse time. Buckets: [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1] (seconds). |
teltonika_unknown_codec_total |
counter | codec_id (string of the offending byte) |
Canary for codec coverage drift. |
Phase 1 also adds publisher-related metrics from task 1.8:
| Metric | Type | Labels | Description |
|---|---|---|---|
teltonika_publish_queue_depth |
gauge | — | Current bounded-queue depth. |
teltonika_publish_overflow_total |
counter | — | Records dropped because the queue was full. |
teltonika_publish_duration_seconds |
histogram | — | XADD latency. |
Plus shell-level:
| Metric | Type | Labels | Description |
|---|---|---|---|
nodejs_* |
various | — | Default Node.js process metrics (prom-client provides a collectDefaultMetrics()). |
Naming convention
teltonika_*for adapter-specific metrics.nodejs_*for runtime metrics (default).- No service prefix — Prometheus scrape config adds the
serviceandinstancelabels externally.
Health and readiness
GET /healthz: returns200 OKif the process is alive. (Liveness probe.)GET /readyz: returns200 OKif the Redis connection is healthy AND the TCP listener is bound. (Readiness probe.) Returns503otherwise.- Both endpoints return a tiny JSON body
{ "status": "ok" }for diagnostic value.
HTTP server
Use Node's node:http directly — no Express/Fastify dependency for two endpoints. Keep it minimal, ~30 lines.
Acceptance criteria
curl http://localhost:9090/metricsreturns valid Prometheus exposition format with every metric in the inventory present (some at zero).- After processing the canonical Codec 8 fixture,
teltonika_records_published_total{codec="8"}increments by 1 andteltonika_frames_total{codec="8",result="ok"}increments by 1. - Sending a packet with an unknown codec ID increments
teltonika_unknown_codec_total{codec_id="..."}. - After a handshake from an IMEI the configured
DeviceAuthorityreturns'unknown'for,teltonika_handshake_total{result="accepted",known="unknown"}increments by 1. GET /readyzreturns503while Redis is unreachable, then200once it reconnects.- Prom-client default metrics are exposed (Node version, GC, event loop lag).
Risks / open questions
- Cardinality of
codec_idlabel onteltonika_unknown_codec_total: bounded by 256 possible byte values. Acceptable. - Cardinality of
device_id(IMEI) in metrics: avoid. Per-device metrics belong in logs/traces, not Prometheus, because the cardinality is unbounded. Phase 1 does not add per-IMEI labels anywhere. (This is a watch-out for future tasks.) - Histogram buckets for
teltonika_parse_duration_seconds: tuned for sub-millisecond expected times. Adjust based on real production data after the first week.
Done
(Fill in once complete.)