c314ba0902
ROADMAP.md establishes status legend, architectural anchors pointing at the wiki, and seven non-negotiable design rules — most importantly the core/domain boundary that protects Phase 1 from Phase 2 churn, the schema-authority split (positions hypertable owned here; everything else owned by Directus), and idempotent-writes via (device_id, ts) ON CONFLICT. Phase 1 (throughput pipeline) is fully detailed across 11 task files: scaffold, core types + sentinel decoder, config + logging, Postgres hypertable, Redis Stream consumer, per-device LRU state, batched writer, main wiring, observability, integration test, Dockerfile + Gitea CI. Observability is in Phase 1 (not deferred) — lesson learned from tcp-ingestion task 1.10. Phases 2-4 are stub READMEs. Phase 2 (domain logic) blocks on Directus schema decisions and lists those open questions explicitly. Phase 3 (production hardening) and Phase 4 (future) sketch the task shape.
3.7 KiB
3.7 KiB
Phase 3 — Production hardening
Status: ⬜ Not started
The set of operational features that turn a working pilot into something safe to leave running unattended through deploys, instance failures, and bad data.
Outcome statement
When Phase 3 is done:
- Graceful shutdown with bounded in-flight drain: SIGTERM blocks new reads, awaits in-flight writes, ACKs anything still in PEL whose write succeeded, exits clean.
- State rehydration on restart: on first packet for an unknown device, the Processor queries Postgres for the device's
last_positionand seedsDeviceStateaccordingly. Phase 2 accumulators get the same treatment (e.g. last geofence membership comes from the lasttiming_recordsrow). XAUTOCLAIMfor stuck pending entries: at startup and on a cadence, the Processor claims entries that have been pending in another consumer's PEL for longer thanCLAIM_THRESHOLD_MS. Lets a dead instance's work get picked up by survivors without manual intervention.- Dead-letter stream for poison records: records that fail to decode N times go to
telemetry:t:dlqwith the original payload + the error. Operators can inspect, fix, replay. - Multi-instance load split verified: spinning up two Processor instances against the same consumer group splits the work evenly. End-to-end test in CI (or at least a manual playbook).
- Migration safety with multiple instances: Postgres advisory locks around the migration runner so two instances starting simultaneously don't race.
- Uncaught exception / unhandled rejection handlers: log, flush in-memory state to a panic dump file, exit with a code Portainer treats as restart-worthy.
OPERATIONS.mdrunbook: exact commands for "claim stuck entries from a dead instance," "drain the DLQ," "force-rehydrate a single device," "view consumer lag," etc.
Tasks (sketched, not detailed)
| # | Task | Notes |
|---|---|---|
| 3.1 | Graceful shutdown — full | Replaces the Phase 1 stub. Drain budget configurable. Tested end-to-end |
| 3.2 | Per-device state rehydration on first-packet | Single SELECT ... LIMIT 1 per cold device. Memoized by LRU |
| 3.3 | XAUTOCLAIM runner |
Periodic + on-startup. Claims entries pending > CLAIM_THRESHOLD_MS. Re-runs the sink |
| 3.4 | Dead-letter stream | After N failed decodes/writes, record goes to telemetry:t:dlq; original ACKed off the main stream |
| 3.5 | Migration advisory lock | pg_advisory_lock(<hash>) around the migrate runner; two instances can start simultaneously |
| 3.6 | Uncaught exception / unhandled rejection handlers | Log, flush, exit. Match tcp-ingestion's eventual Phase 1 task 1.12 work when that lands |
| 3.7 | OPERATIONS.md | The runbook |
| 3.8 | Multi-instance load test | A test (manual or in CI) that proves two instances split the work; document expected lag behaviour during failover |
Why this is a separate phase
Phase 1 + Phase 2 produce a service that works. Phase 3 is what you do before you stop watching it. None of these tasks change correctness — they change operational ergonomics.
Resume triggers
Each Phase 3 task has its own resume trigger. The whole phase doesn't have to land at once:
- 3.1, 3.5, 3.6 before adding a second Processor instance (rolling deploys become safe).
- 3.2 before any Phase 2 task that depends on hot state (geofence membership) — without rehydration, a restart would forget which geofence each device is in until the device crosses a boundary again.
- 3.3, 3.4 before the pilot is "always-on" (operators need a way to handle stuck/poison records without touching production).
- 3.7 can land alongside whichever of the above ships first; updates over time.
- 3.8 before the second instance is added.