10e20c4038
Phase 1 (throughput pipeline) is now fully landed.
83 lines
5.5 KiB
Markdown
83 lines
5.5 KiB
Markdown
# Task 1.9 — Observability (Prometheus metrics + /healthz + /readyz)
|
|
|
|
**Phase:** 1 — Throughput pipeline
|
|
**Status:** 🟩 Done
|
|
**Depends on:** 1.5, 1.6, 1.7, 1.8
|
|
**Wiki refs:** `docs/wiki/entities/processor.md`, `docs/wiki/sources/gps-tracking-architecture.md` § 7.4
|
|
|
|
## Goal
|
|
|
|
Replace the placeholder `Metrics` shim with a real `prom-client` implementation. Expose `/metrics` (Prometheus exposition format), `/healthz` (liveness), and `/readyz` (readiness — Redis ready AND Postgres ready) on `METRICS_PORT`.
|
|
|
|
This is **not** a deferral candidate (unlike `tcp-ingestion` task 1.10). The Processor has no other surface for measuring consumer lag, write throughput, or failure rates — without it, "is the pilot keeping up?" cannot be answered.
|
|
|
|
## Deliverables
|
|
|
|
- `src/observability/metrics.ts` — same shape as `tcp-ingestion/src/observability/metrics.ts`:
|
|
- `createMetrics(): Metrics & { serializeMetrics(): Promise<string> }` — wraps `prom-client` registries; calls `collectDefaultMetrics()` once for `nodejs_*` process metrics.
|
|
- `startMetricsServer(port, metrics, deps): http.Server` — `node:http` server with three endpoints. `deps` carries readyz health checks: `{ isRedisReady(): boolean; isPostgresReady(): boolean }`.
|
|
- Update `src/main.ts` to use the real `createMetrics()` and start the metrics server after Redis + Postgres are connected and the consumer is started. Wire it into graceful shutdown (close it before `redis.quit()`).
|
|
- Tests: `test/metrics.test.ts` mirroring the `tcp-ingestion` test pattern — exposition format, counter/gauge/histogram behaviour, all four HTTP endpoint paths including `/readyz` 503 cases.
|
|
|
|
## Specification
|
|
|
|
### Metric inventory
|
|
|
|
| Metric | Type | Labels | Description |
|
|
|---|---|---|---|
|
|
| `processor_consumer_reads_total` | counter | `result=ok\|empty\|error` | `XREADGROUP` calls; `empty` = BLOCK timeout, `error` = client error |
|
|
| `processor_consumer_records_total` | counter | — | Total records pulled off the stream |
|
|
| `processor_consumer_lag` | gauge | — | `XLEN` minus the consumer group's last-delivered ID position. Sampled every N seconds. |
|
|
| `processor_decode_errors_total` | counter | — | Records that failed to decode (malformed payload, sentinel error) |
|
|
| `processor_position_writes_total` | counter | `status=inserted\|duplicate\|failed` | Per-record write outcomes |
|
|
| `processor_position_write_duration_seconds` | histogram | — | Per-batch write latency. Buckets `[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5]` |
|
|
| `processor_acks_total` | counter | — | Total IDs ACKed |
|
|
| `processor_device_state_size` | gauge | — | Current count of devices in the LRU map |
|
|
| `processor_device_state_evictions_total` | counter | — | Total LRU evictions since start |
|
|
| `nodejs_*` | various | — | Default Node process metrics |
|
|
|
|
### Naming convention
|
|
|
|
- `processor_*` for service-specific metrics. `tcp-ingestion` uses `teltonika_*` because that's its adapter; the Processor isn't bound to a vendor, so the service-name prefix fits.
|
|
- No external `service` label — Prometheus scrape config adds it.
|
|
|
|
### Health and readiness
|
|
|
|
- `GET /healthz` → 200 if the process is alive. Always returns `{ "status": "ok" }`.
|
|
- `GET /readyz` → 200 if both Redis is ready (`redis.status === 'ready'`) AND Postgres is ready (last successful query within 30s, or a fresh `SELECT 1` succeeds quickly). 503 otherwise.
|
|
- Both endpoints return tiny JSON bodies for diagnostic value.
|
|
|
|
### `processor_consumer_lag` measurement
|
|
|
|
Sample every 10s in a separate setInterval (don't compute it on every read — too noisy). Compute as:
|
|
|
|
```
|
|
lag = XLEN(stream) - position_of(group_last_delivered_id_in_stream)
|
|
```
|
|
|
|
Use `XINFO GROUPS <stream>` → `lag` field (Redis 7.2+). If the field is absent, fall back to `XLEN` minus 0 (good-enough proxy when up to date; flag as "approximate" in the metric description).
|
|
|
|
If sampling fails (Redis blip), log at `debug` and continue. Don't let metrics gather break the consumer.
|
|
|
|
### HTTP server — same minimal node:http
|
|
|
|
No Express. Roughly 30 lines. Match `tcp-ingestion`'s style.
|
|
|
|
## Acceptance criteria
|
|
|
|
- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean.
|
|
- [ ] `curl http://localhost:9090/metrics` returns valid exposition format with every metric in the inventory present (some at zero).
|
|
- [ ] After processing one record end-to-end, `processor_consumer_records_total` increments by 1, `processor_position_writes_total{status="inserted"}` increments by 1, `processor_acks_total` increments by 1.
|
|
- [ ] `/readyz` returns 503 while Redis is disconnected (simulate by `redis.disconnect()`), 200 once it reconnects.
|
|
- [ ] `/readyz` returns 503 while the Pool fails its health probe, 200 when it recovers.
|
|
- [ ] `nodejs_*` default metrics are exposed.
|
|
|
|
## Risks / open questions
|
|
|
|
- **Cardinality of label values.** None of the Phase 1 metrics use unbounded labels. Phase 2 may want per-stage metrics — be careful: hundreds of stages is fine, hundreds of devices as labels is not. Keep the same rule as `tcp-ingestion`: per-device labels never go on Prometheus metrics; logs/traces are the right place.
|
|
- **`processor_consumer_lag` sampling cadence.** 10s is a guess. If alerts get jittery, lower to 5s or raise to 30s. Tunable later.
|
|
|
|
## Done
|
|
|
|
Real prom-client implementation replacing the trace-log shim. All 10 Phase 1 metrics registered; `/healthz`, `/readyz` (cached SELECT 1 Postgres health check, 5 s TTL), `/metrics` endpoints live. Consumer lag sampled every 10 s via `XINFO GROUPS`. `createPostgresHealthCheck` and `createConsumerLagSampler` exported for graceful-shutdown wiring. 22 new unit tests in `test/metrics.test.ts`. Landed in `9791620`.
|