Implement Phase 1 task 1.10 (Prometheus metrics + /healthz + /readyz)

Replaces the placeholder Metrics shim with a prom-client implementation
in src/observability/metrics.ts: all 10 Phase 1 metrics from the wiki
spec, plus nodejs_* defaults. Exposes /metrics, /healthz, /readyz over
node:http on METRICS_PORT (9090); /readyz returns 503 when Redis status
is not 'ready' or the TCP listener isn't bound.

The Metrics interface in src/core/types.ts is unchanged — adapter call
sites continue to use the same inc/observe shape. Only main.ts sees the
extended type that adds serializeMetrics().

Side effects:
- Dockerfile re-enables HEALTHCHECK pointing at /readyz, and EXPOSE 9090.
- frame-ingested log downgraded back to debug now that
  teltonika_records_published_total is scrapeable.
- 19 new unit tests covering exposition format, all metric types, and
  every HTTP endpoint path. Total now 98 passing.

Note: deploy/compose.yaml still does not expose 9090 — separate decision
about how Prometheus reaches the service (host port vs. internal scraper
on the same Docker network).
This commit is contained in:
2026-04-30 20:52:12 +02:00
parent ff9c8d67a4
commit d4a6d8f713
8 changed files with 720 additions and 27 deletions
+4 -5
View File
@@ -41,7 +41,7 @@ These rules govern every task. Any deviation must be discussed and documented as
### Phase 1 — Inbound telemetry (Codec 8, 8E, 16)
**Status:** 🟨 In progress (core implementation done; observability + hardening + device authority paused for pilot test)
**Status:** 🟨 In progress (observability landed; hardening + device authority paused for pilot test)
**Outcome:** A production-ready Node.js TCP server ingesting Teltonika telemetry from any FMB/FMC/FMM/FMU device, publishing normalized `Position` records to Redis Streams, with full observability and CI/CD via Gitea.
[**See `phase-1-telemetry/README.md`**](./phase-1-telemetry/README.md)
@@ -57,18 +57,17 @@ These rules govern every task. Any deviation must be discussed and documented as
| 1.7 | [Codec 16 parser (incl. Generation Type)](./phase-1-telemetry/07-codec-16.md) | 🟩 | `381287b` |
| 1.8 | [Redis Streams publisher & main wiring](./phase-1-telemetry/08-redis-publisher.md) | 🟩 | `af06973` |
| 1.9 | [Fixture suite & testing strategy](./phase-1-telemetry/09-fixture-suite.md) | 🟩 | `381287b` |
| 1.10 | [Observability (Prometheus metrics)](./phase-1-telemetry/10-observability.md) | | *deferred — see below* |
| 1.10 | [Observability (Prometheus metrics)](./phase-1-telemetry/10-observability.md) | 🟩 | *(pending commit SHA)* |
| 1.11 | [Dockerfile & Gitea workflow](./phase-1-telemetry/11-dockerfile-and-ci.md) | 🟩 | `88b742d` (slim pilot variant) |
| 1.12 | [Production hardening](./phase-1-telemetry/12-production-hardening.md) | ⏸ | *deferred — see below* |
| 1.13 | [Device authority (Redis allow-list refresher)](./phase-1-telemetry/13-device-authority.md) | ⏸ | *deferred — see below* |
#### Deferred (resume after the real-device pilot test)
These three tasks are paused so we can get the service onto real hardware as fast as possible. They are paused, not cancelled — each must be completed before the service is considered production-ready.
These two tasks are paused so we can get the service onto real hardware as fast as possible. They are paused, not cancelled — each must be completed before the service is considered production-ready.
- **1.10 Observability (Prometheus metrics).** Tracking via the placeholder `Metrics` interface for now. **Resume trigger:** as soon as the pilot is generating real traffic and we want to measure it, *or* before any second instance is deployed (without metrics, "consumer lag" and "unknown codec" alerts cannot fire).
- **1.12 Production hardening.** Graceful shutdown is a stub today; uncaught-exception handlers are minimal. **Resume trigger:** before the pilot graduates to "always-on" or before any deployment that does rolling restarts. Acceptable for a manual pilot where we can stop/start the process by hand.
- **1.13 Device authority (Redis allow-list refresher).** Default `AllowAllAuthority` accepts every IMEI; observability of `known | unknown` is moot until 1.10 lands. **Resume trigger:** when Directus has a `devices` collection publishing the allow-list to Redis, or when the operational picture demands rejecting unknown IMEIs (`STRICT_DEVICE_AUTH=true`).
- **1.13 Device authority (Redis allow-list refresher).** Default `AllowAllAuthority` accepts every IMEI. **Resume trigger:** when Directus has a `devices` collection publishing the allow-list to Redis, or when the operational picture demands rejecting unknown IMEIs (`STRICT_DEVICE_AUTH=true`).
When resuming any of these, change the status from ⏸ back to ⬜ or 🟨 here and in the task file's status badge, and clear the deferral note in the task file.
@@ -1,7 +1,7 @@
# Task 1.10 — Observability (Prometheus metrics)
**Phase:** 1 — Inbound telemetry
**Status:** ⏸ Paused — deferred until after the real-device pilot test. See ROADMAP.md "Deferred" section for resume triggers. The placeholder `Metrics` interface in `src/core/types.ts` is what code currently uses; this task replaces it with `prom-client` and adds the `/metrics`, `/healthz`, `/readyz` HTTP endpoints.
**Status:** 🟩
**Depends on:** 1.2, 1.3
**Wiki refs:** `docs/wiki/sources/teltonika-ingestion-architecture.md` § 7. Observability, `docs/wiki/sources/gps-tracking-architecture.md` § 7.4
@@ -81,4 +81,4 @@ Use Node's `node:http` directly — no Express/Fastify dependency for two endpoi
## Done
(Fill in once complete.)
Implemented `src/observability/metrics.ts` with `createMetrics()`, `startMetricsServer()`, and `ReadyzDeps`. Replaced the placeholder shim in `src/main.ts`, wired metrics server into boot and graceful shutdown, downgraded `frame ingested` log to debug, and re-enabled the Dockerfile `HEALTHCHECK`. Landed in: *(pending commit SHA)*