Files

T

julian e1c6f59948 Realign processor stream-name default to telemetry:teltonika

Stage discovered the wrong default at runtime: tcp-ingestion's compiled
default REDIS_TELEMETRY_STREAM is 'telemetry:teltonika' but processor's
was 'telemetry:t', so the two services were talking past each other —
tcp-ingestion publishing to one stream, processor reading another empty
one. The deploy stack now pins both to the same value via a shared env
var, but the processor's compiled default should also match so local
development and the integration test stay aligned with reality.

Changes:
- src/config/load.ts — default changed to 'telemetry:teltonika'
- .env.example — same
- test/config.test.ts — default-value assertion updated
- planning docs (ROADMAP, phase-1 README, tasks 03/08/10, phase-3 README) —
  occurrences of 'telemetry:t' replaced with 'telemetry:teltonika'

The deploy stack remains the single source of truth via the shared
REDIS_TELEMETRY_STREAM env var. Compiled defaults are belt-and-braces.

2026-05-01 11:43:31 +02:00

README.md

Realign processor stream-name default to telemetry:teltonika

2026-05-01 11:43:31 +02:00

README.md

Phase 3 — Production hardening

Status: ⬜ Not started

The set of operational features that turn a working pilot into something safe to leave running unattended through deploys, instance failures, and bad data.

Outcome statement

When Phase 3 is done:

Graceful shutdown with bounded in-flight drain: SIGTERM blocks new reads, awaits in-flight writes, ACKs anything still in PEL whose write succeeded, exits clean.
State rehydration on restart: on first packet for an unknown device, the Processor queries Postgres for the device's last_position and seeds DeviceState accordingly. Phase 2 accumulators get the same treatment (e.g. last geofence membership comes from the last timing_records row).
XAUTOCLAIM for stuck pending entries: at startup and on a cadence, the Processor claims entries that have been pending in another consumer's PEL for longer than CLAIM_THRESHOLD_MS. Lets a dead instance's work get picked up by survivors without manual intervention.
Dead-letter stream for poison records: records that fail to decode N times go to telemetry:teltonika:dlq with the original payload + the error. Operators can inspect, fix, replay.
Multi-instance load split verified: spinning up two Processor instances against the same consumer group splits the work evenly. End-to-end test in CI (or at least a manual playbook).
Migration safety with multiple instances: Postgres advisory locks around the migration runner so two instances starting simultaneously don't race.
Uncaught exception / unhandled rejection handlers: log, flush in-memory state to a panic dump file, exit with a code Portainer treats as restart-worthy.
OPERATIONS.md runbook: exact commands for "claim stuck entries from a dead instance," "drain the DLQ," "force-rehydrate a single device," "view consumer lag," etc.

Tasks (sketched, not detailed)

#	Task	Notes
3.1	Graceful shutdown — full	Replaces the Phase 1 stub. Drain budget configurable. Tested end-to-end
3.2	Per-device state rehydration on first-packet	Single `SELECT ... LIMIT 1` per cold device. Memoized by LRU
3.3	`XAUTOCLAIM` runner	Periodic + on-startup. Claims entries pending > `CLAIM_THRESHOLD_MS`. Re-runs the sink
3.4	Dead-letter stream	After N failed decodes/writes, record goes to `telemetry:teltonika:dlq`; original ACKed off the main stream
3.5	Migration advisory lock	`pg_advisory_lock(<hash>)` around the migrate runner; two instances can start simultaneously
3.6	Uncaught exception / unhandled rejection handlers	Log, flush, exit. Match `tcp-ingestion`'s eventual Phase 1 task 1.12 work when that lands
3.7	OPERATIONS.md	The runbook
3.8	Multi-instance load test	A test (manual or in CI) that proves two instances split the work; document expected lag behaviour during failover

Why this is a separate phase

Phase 1 + Phase 2 produce a service that works. Phase 3 is what you do before you stop watching it. None of these tasks change correctness — they change operational ergonomics.

Resume triggers

Each Phase 3 task has its own resume trigger. The whole phase doesn't have to land at once:

3.1, 3.5, 3.6 before adding a second Processor instance (rolling deploys become safe).
3.2 before any Phase 2 task that depends on hot state (geofence membership) — without rehydration, a restart would forget which geofence each device is in until the device crosses a boundary again.
3.3, 3.4 before the pilot is "always-on" (operators need a way to handle stuck/poison records without touching production).
3.7 can land alongside whichever of the above ships first; updates over time.
3.8 before the second instance is added.