c8a5f4cd68
ROADMAP plus granular task files per phase. Phase 1 (12 tasks + 1.13 device authority) covers Codec 8/8E/16 telemetry ingestion; Phase 2 (6 tasks) covers Codec 12/14 outbound commands; Phase 3 enumerates deferred items.
85 lines
5.0 KiB
Markdown
85 lines
5.0 KiB
Markdown
# Task 1.10 — Observability (Prometheus metrics)
|
|
|
|
**Phase:** 1 — Inbound telemetry
|
|
**Status:** ⬜ Not started
|
|
**Depends on:** 1.2, 1.3
|
|
**Wiki refs:** `docs/wiki/sources/teltonika-ingestion-architecture.md` § 7. Observability, `docs/wiki/sources/gps-tracking-architecture.md` § 7.4
|
|
|
|
## Goal
|
|
|
|
Expose Prometheus metrics over an HTTP endpoint so the platform's observability stack can scrape them. Metrics drive alerting (consumer lag, unknown codecs, CRC failures) and capacity planning (connection counts, frame rates).
|
|
|
|
## Deliverables
|
|
|
|
- `src/observability/metrics.ts`:
|
|
- Exports `createMetrics(): Metrics` returning a typed wrapper around `prom-client` registries.
|
|
- All metric definitions in one place, with explicit names/labels matching the wiki spec.
|
|
- A `serializeMetrics(): Promise<string>` returning the standard Prom exposition format.
|
|
- A `startMetricsServer(port, metrics): http.Server` that exposes `GET /metrics` and `GET /healthz`.
|
|
- Wiring updates: every place that should emit a metric (handshake outcome, frame outcome, publish queue depth, etc.) calls into the `Metrics` object.
|
|
|
|
## Specification
|
|
|
|
### Metric inventory (Phase 1)
|
|
|
|
Per `docs/wiki/sources/teltonika-ingestion-architecture.md` § 7:
|
|
|
|
| Metric | Type | Labels | Description |
|
|
|--------|------|--------|-------------|
|
|
| `teltonika_connections_active` | gauge | — | Currently open device sessions. |
|
|
| `teltonika_handshake_total` | counter | `result=accepted\|rejected\|malformed`, `known=known\|unknown` | IMEI handshake outcomes. The `known` label distinguishes IMEIs that the configured `DeviceAuthority` recognizes from those it does not. With the default `AllowAllAuthority`, `known` is always `known`. |
|
|
| `teltonika_device_authority_failures_total` | counter | — | Times a `DeviceAuthority.check` call threw or timed out. Non-zero rate indicates the allow-list refresher (task 1.13) is unhealthy. |
|
|
| `teltonika_frames_total` | counter | `codec=8\|8E\|16\|unknown`, `result=ok\|crc_fail\|truncated\|n_mismatch` | Frame-level outcomes. |
|
|
| `teltonika_records_published_total` | counter | `codec` | AVL records emitted to Redis. |
|
|
| `teltonika_parse_duration_seconds` | histogram | `codec` | Per-frame parse time. Buckets: `[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]` (seconds). |
|
|
| `teltonika_unknown_codec_total` | counter | `codec_id` (string of the offending byte) | **Canary** for codec coverage drift. |
|
|
|
|
Phase 1 also adds publisher-related metrics from task 1.8:
|
|
|
|
| Metric | Type | Labels | Description |
|
|
|--------|------|--------|-------------|
|
|
| `teltonika_publish_queue_depth` | gauge | — | Current bounded-queue depth. |
|
|
| `teltonika_publish_overflow_total` | counter | — | Records dropped because the queue was full. |
|
|
| `teltonika_publish_duration_seconds` | histogram | — | XADD latency. |
|
|
|
|
Plus shell-level:
|
|
|
|
| Metric | Type | Labels | Description |
|
|
|--------|------|--------|-------------|
|
|
| `nodejs_*` | various | — | Default Node.js process metrics (`prom-client` provides a `collectDefaultMetrics()`). |
|
|
|
|
### Naming convention
|
|
|
|
- `teltonika_*` for adapter-specific metrics.
|
|
- `nodejs_*` for runtime metrics (default).
|
|
- No service prefix — Prometheus scrape config adds the `service` and `instance` labels externally.
|
|
|
|
### Health and readiness
|
|
|
|
- `GET /healthz`: returns `200 OK` if the process is alive. (Liveness probe.)
|
|
- `GET /readyz`: returns `200 OK` if the Redis connection is healthy AND the TCP listener is bound. (Readiness probe.) Returns `503` otherwise.
|
|
- Both endpoints return a tiny JSON body `{ "status": "ok" }` for diagnostic value.
|
|
|
|
### HTTP server
|
|
|
|
Use Node's `node:http` directly — no Express/Fastify dependency for two endpoints. Keep it minimal, ~30 lines.
|
|
|
|
## Acceptance criteria
|
|
|
|
- [ ] `curl http://localhost:9090/metrics` returns valid Prometheus exposition format with every metric in the inventory present (some at zero).
|
|
- [ ] After processing the canonical Codec 8 fixture, `teltonika_records_published_total{codec="8"}` increments by 1 and `teltonika_frames_total{codec="8",result="ok"}` increments by 1.
|
|
- [ ] Sending a packet with an unknown codec ID increments `teltonika_unknown_codec_total{codec_id="..."}`.
|
|
- [ ] After a handshake from an IMEI the configured `DeviceAuthority` returns `'unknown'` for, `teltonika_handshake_total{result="accepted",known="unknown"}` increments by 1.
|
|
- [ ] `GET /readyz` returns `503` while Redis is unreachable, then `200` once it reconnects.
|
|
- [ ] Prom-client default metrics are exposed (Node version, GC, event loop lag).
|
|
|
|
## Risks / open questions
|
|
|
|
- Cardinality of `codec_id` label on `teltonika_unknown_codec_total`: bounded by 256 possible byte values. Acceptable.
|
|
- Cardinality of `device_id` (IMEI) in metrics: **avoid**. Per-device metrics belong in logs/traces, not Prometheus, because the cardinality is unbounded. Phase 1 does not add per-IMEI labels anywhere. (This is a watch-out for future tasks.)
|
|
- Histogram buckets for `teltonika_parse_duration_seconds`: tuned for sub-millisecond expected times. Adjust based on real production data after the first week.
|
|
|
|
## Done
|
|
|
|
(Fill in once complete.)
|