Implement Phase 1 task 1.10 (Prometheus metrics + /healthz + /readyz)

Replaces the placeholder Metrics shim with a prom-client implementation in src/observability/metrics.ts: all 10 Phase 1 metrics from the wiki spec, plus nodejs_* defaults. Exposes /metrics, /healthz, /readyz over node:http on METRICS_PORT (9090); /readyz returns 503 when Redis status is not 'ready' or the TCP listener isn't bound. The Metrics interface in src/core/types.ts is unchanged — adapter call sites continue to use the same inc/observe shape. Only main.ts sees the extended type that adds serializeMetrics(). Side effects: - Dockerfile re-enables HEALTHCHECK pointing at /readyz, and EXPOSE 9090. - frame-ingested log downgraded back to debug now that teltonika_records_published_total is scrapeable. - 19 new unit tests covering exposition format, all metric types, and every HTTP endpoint path. Total now 98 passing. Note: deploy/compose.yaml still does not expose 9090 — separate decision about how Prometheus reaches the service (host port vs. internal scraper on the same Docker network).
2026-04-30 20:52:12 +02:00
parent ff9c8d67a4
commit d4a6d8f713
8 changed files with 720 additions and 27 deletions
@@ -41,7 +41,7 @@ These rules govern every task. Any deviation must be discussed and documented as

 ### Phase 1 — Inbound telemetry (Codec 8, 8E, 16)

-**Status:** 🟨 In progress (core implementation done; observability + hardening + device authority paused for pilot test)
+**Status:** 🟨 In progress (observability landed; hardening + device authority paused for pilot test)
 **Outcome:** A production-ready Node.js TCP server ingesting Teltonika telemetry from any FMB/FMC/FMM/FMU device, publishing normalized `Position` records to Redis Streams, with full observability and CI/CD via Gitea.

 [**See `phase-1-telemetry/README.md`**](./phase-1-telemetry/README.md)
@@ -57,18 +57,17 @@ These rules govern every task. Any deviation must be discussed and documented as
 | 1.7 | [Codec 16 parser (incl. Generation Type)](./phase-1-telemetry/07-codec-16.md) | 🟩 | `381287b` |
 | 1.8 | [Redis Streams publisher & main wiring](./phase-1-telemetry/08-redis-publisher.md) | 🟩 | `af06973` |
 | 1.9 | [Fixture suite & testing strategy](./phase-1-telemetry/09-fixture-suite.md) | 🟩 | `381287b` |
-| 1.10 | [Observability (Prometheus metrics)](./phase-1-telemetry/10-observability.md) | ⏸ | *deferred — see below* |
+| 1.10 | [Observability (Prometheus metrics)](./phase-1-telemetry/10-observability.md) | 🟩 | *(pending commit SHA)* |
 | 1.11 | [Dockerfile & Gitea workflow](./phase-1-telemetry/11-dockerfile-and-ci.md) | 🟩 | `88b742d` (slim pilot variant) |
 | 1.12 | [Production hardening](./phase-1-telemetry/12-production-hardening.md) | ⏸ | *deferred — see below* |
 | 1.13 | [Device authority (Redis allow-list refresher)](./phase-1-telemetry/13-device-authority.md) | ⏸ | *deferred — see below* |

 #### Deferred (resume after the real-device pilot test)

-These three tasks are paused so we can get the service onto real hardware as fast as possible. They are paused, not cancelled — each must be completed before the service is considered production-ready.
+These two tasks are paused so we can get the service onto real hardware as fast as possible. They are paused, not cancelled — each must be completed before the service is considered production-ready.

- **1.10 Observability (Prometheus metrics).** Tracking via the placeholder `Metrics` interface for now. **Resume trigger:** as soon as the pilot is generating real traffic and we want to measure it, *or* before any second instance is deployed (without metrics, "consumer lag" and "unknown codec" alerts cannot fire).
 - **1.12 Production hardening.** Graceful shutdown is a stub today; uncaught-exception handlers are minimal. **Resume trigger:** before the pilot graduates to "always-on" or before any deployment that does rolling restarts. Acceptable for a manual pilot where we can stop/start the process by hand.
- **1.13 Device authority (Redis allow-list refresher).** Default `AllowAllAuthority` accepts every IMEI; observability of `known | unknown` is moot until 1.10 lands. **Resume trigger:** when Directus has a `devices` collection publishing the allow-list to Redis, or when the operational picture demands rejecting unknown IMEIs (`STRICT_DEVICE_AUTH=true`).
+- **1.13 Device authority (Redis allow-list refresher).** Default `AllowAllAuthority` accepts every IMEI. **Resume trigger:** when Directus has a `devices` collection publishing the allow-list to Redis, or when the operational picture demands rejecting unknown IMEIs (`STRICT_DEVICE_AUTH=true`).

 When resuming any of these, change the status from ⏸ back to ⬜ or 🟨 here and in the task file's status badge, and clear the deferral note in the task file.

@@ -1,7 +1,7 @@
 # Task 1.10 — Observability (Prometheus metrics)

 **Phase:** 1 — Inbound telemetry
-**Status:** ⏸ Paused — deferred until after the real-device pilot test. See ROADMAP.md "Deferred" section for resume triggers. The placeholder `Metrics` interface in `src/core/types.ts` is what code currently uses; this task replaces it with `prom-client` and adds the `/metrics`, `/healthz`, `/readyz` HTTP endpoints.
+**Status:** 🟩
 **Depends on:** 1.2, 1.3
 **Wiki refs:** `docs/wiki/sources/teltonika-ingestion-architecture.md` § 7. Observability, `docs/wiki/sources/gps-tracking-architecture.md` § 7.4

@@ -81,4 +81,4 @@ Use Node's `node:http` directly — no Express/Fastify dependency for two endpoi

 ## Done

-(Fill in once complete.)
+Implemented `src/observability/metrics.ts` with `createMetrics()`, `startMetricsServer()`, and `ReadyzDeps`. Replaced the placeholder shim in `src/main.ts`, wired metrics server into boot and graceful shutdown, downgraded `frame ingested` log to debug, and re-enabled the Dockerfile `HEALTHCHECK`. Landed in: *(pending commit SHA)*