Files

T

julian be48da9baa Implement Phase 1 tasks 1.9-1.11 (observability + integration test + Dockerfile/CI)

src/observability/metrics.ts — full prom-client implementation. All 10
Phase 1 metrics registered (processor_consumer_reads_total,
_records_total, _lag, _decode_errors_total, processor_position_writes_total
{status}, _write_duration_seconds, processor_acks_total,
processor_device_state_{size,evictions_total}) plus nodejs_* defaults.
node:http server with /metrics, /healthz, /readyz. /readyz checks
redis.status === 'ready' AND a 5s-cached SELECT 1 Postgres probe.
processor_consumer_lag sampled every 10s via XINFO GROUPS, falling back
to a no-op when the consumer group hasn't been created yet.

src/main.ts — replaces the trace-logging shim with createMetrics() and
startMetricsServer(); shutdown closes the metrics server before
redis.quit() and pool.end().

test/metrics.test.ts — 22 unit tests: exposition format, every metric
type behaviour, all four HTTP endpoint paths including /readyz 503 cases.

test/pipeline.integration.test.ts — testcontainers Redis 7 +
TimescaleDB latest-pg16. Four scenarios: happy path with bigint+Buffer
attribute round-trip, idempotency on (device_id, ts), malformed payload
stays in PEL (decode_errors_total increments), writer failure → retry
(weaker variant per spec: stop Postgres before publish, restart, verify
row appears). Skip-on-no-Docker pattern verified — exits 0 without
Docker.

Dockerfile — multi-stage matching tcp-ingestion. EXPOSE 9090 only,
HEALTHCHECK on /readyz, image-source label points at processor repo.

.gitea/workflows/build.yml — single-job workflow mirroring
tcp-ingestion. Path filters cover src/, test/, build config, Dockerfile.
Portainer webhook step uncommented for :main auto-deploy.

compose.dev.yaml — local-build variant with Redis + TimescaleDB +
processor-dev for verifying Dockerfile changes without the registry
round-trip.

README.md — fleshed out from stub: quick-start, Docker build, deployment
note, env vars, tests (unit vs. integration), CI behavior. Flags the
deploy-side change needed: deploy/compose.yaml needs a TimescaleDB
service and a processor service entry added.

Verification: typecheck, lint clean; 134 unit tests passing across 8
files (+22 from this batch). pnpm test:integration runs cleanly under
the no-Docker skip pattern.

Phase 1 is now complete. Service is pilot-ready.

2026-04-30 22:01:55 +02:00

5.5 KiB

Raw Blame History

Task 1.9 — Observability (Prometheus metrics + /healthz + /readyz)

Phase: 1 — Throughput pipeline Status: 🟩 Done Depends on: 1.5, 1.6, 1.7, 1.8 Wiki refs: docs/wiki/entities/processor.md, docs/wiki/sources/gps-tracking-architecture.md § 7.4

Goal

Replace the placeholder Metrics shim with a real prom-client implementation. Expose /metrics (Prometheus exposition format), /healthz (liveness), and /readyz (readiness — Redis ready AND Postgres ready) on METRICS_PORT.

This is not a deferral candidate (unlike tcp-ingestion task 1.10). The Processor has no other surface for measuring consumer lag, write throughput, or failure rates — without it, "is the pilot keeping up?" cannot be answered.

Deliverables

src/observability/metrics.ts — same shape as tcp-ingestion/src/observability/metrics.ts:
- createMetrics(): Metrics & { serializeMetrics(): Promise<string> } — wraps prom-client registries; calls collectDefaultMetrics() once for nodejs_* process metrics.
- startMetricsServer(port, metrics, deps): http.Server — node:http server with three endpoints. deps carries readyz health checks: { isRedisReady(): boolean; isPostgresReady(): boolean }.
Update src/main.ts to use the real createMetrics() and start the metrics server after Redis + Postgres are connected and the consumer is started. Wire it into graceful shutdown (close it before redis.quit()).
Tests: test/metrics.test.ts mirroring the tcp-ingestion test pattern — exposition format, counter/gauge/histogram behaviour, all four HTTP endpoint paths including /readyz 503 cases.

Specification

Metric inventory

Metric	Type	Labels	Description
`processor_consumer_reads_total`	counter	`result=ok\|empty\|error`	`XREADGROUP` calls; `empty` = BLOCK timeout, `error` = client error
`processor_consumer_records_total`	counter	—	Total records pulled off the stream
`processor_consumer_lag`	gauge	—	`XLEN` minus the consumer group's last-delivered ID position. Sampled every N seconds.
`processor_decode_errors_total`	counter	—	Records that failed to decode (malformed payload, sentinel error)
`processor_position_writes_total`	counter	`status=inserted\|duplicate\|failed`	Per-record write outcomes
`processor_position_write_duration_seconds`	histogram	—	Per-batch write latency. Buckets `[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5]`
`processor_acks_total`	counter	—	Total IDs ACKed
`processor_device_state_size`	gauge	—	Current count of devices in the LRU map
`processor_device_state_evictions_total`	counter	—	Total LRU evictions since start
`nodejs_*`	various	—	Default Node process metrics

Naming convention

processor_* for service-specific metrics. tcp-ingestion uses teltonika_* because that's its adapter; the Processor isn't bound to a vendor, so the service-name prefix fits.
No external service label — Prometheus scrape config adds it.

Health and readiness

GET /healthz → 200 if the process is alive. Always returns { "status": "ok" }.
GET /readyz → 200 if both Redis is ready (redis.status === 'ready') AND Postgres is ready (last successful query within 30s, or a fresh SELECT 1 succeeds quickly). 503 otherwise.
Both endpoints return tiny JSON bodies for diagnostic value.

`processor_consumer_lag` measurement

Sample every 10s in a separate setInterval (don't compute it on every read — too noisy). Compute as:

lag = XLEN(stream) - position_of(group_last_delivered_id_in_stream)

Use XINFO GROUPS <stream> → lag field (Redis 7.2+). If the field is absent, fall back to XLEN minus 0 (good-enough proxy when up to date; flag as "approximate" in the metric description).

If sampling fails (Redis blip), log at debug and continue. Don't let metrics gather break the consumer.

HTTP server — same minimal node:http

No Express. Roughly 30 lines. Match tcp-ingestion's style.

Acceptance criteria

pnpm typecheck, pnpm lint, pnpm test clean.
curl http://localhost:9090/metrics returns valid exposition format with every metric in the inventory present (some at zero).
After processing one record end-to-end, processor_consumer_records_total increments by 1, processor_position_writes_total{status="inserted"} increments by 1, processor_acks_total increments by 1.
/readyz returns 503 while Redis is disconnected (simulate by redis.disconnect()), 200 once it reconnects.
/readyz returns 503 while the Pool fails its health probe, 200 when it recovers.
nodejs_* default metrics are exposed.

Risks / open questions

Cardinality of label values. None of the Phase 1 metrics use unbounded labels. Phase 2 may want per-stage metrics — be careful: hundreds of stages is fine, hundreds of devices as labels is not. Keep the same rule as tcp-ingestion: per-device labels never go on Prometheus metrics; logs/traces are the right place.
processor_consumer_lag sampling cadence. 10s is a guess. If alerts get jittery, lower to 5s or raise to 30s. Tunable later.

Done

Real prom-client implementation replacing the trace-log shim. All 10 Phase 1 metrics registered; /healthz, /readyz (cached SELECT 1 Postgres health check, 5 s TTL), /metrics endpoints live. Consumer lag sampled every 10 s via XINFO GROUPS. createPostgresHealthCheck and createConsumerLagSampler exported for graceful-shutdown wiring. 22 new unit tests in test/metrics.test.ts. (pending commit SHA)

5.5 KiB Raw Blame History