3c2c5cf50e
Replaces the original "migration advisory lock" sketch. Once processor doesn't run DDL, the lock concern delegates to Directus's db-init runner. Context: positions hypertable + faulty column DDL currently exists in both processor (src/db/migrations/0001 + 0002) and directus (db-init/001/002/003). Two sources of truth for the same schema is a known hazard — adding a column means editing two files in two repos, and silent drift between them is invisible until runtime. Fix: directus becomes the sole DDL owner. Processor's migration runner is retired; only INSERT/SELECT/UPDATE remain. Task spec covers: - Pre-flight diff between processor migrations and directus db-init (must be byte/semantically equivalent before deletion) - File-by-file deletion list - Test infra migration (integration test moves to fixture-based schema setup, matching the established Phase 1.5 task 1.5.6 pattern) - Wiki + ROADMAP updates - compose.yaml depends_on directus: service_healthy - Operational notes (existing migrations_applied table is left in place) Sequence: ideally lands AFTER Phase 1.5 ships so the agent shipping the WS endpoint isn't pulled into a side quest mid-flight.
46 lines
3.9 KiB
Markdown
46 lines
3.9 KiB
Markdown
# Phase 3 — Production hardening
|
|
|
|
**Status:** ⬜ Not started
|
|
|
|
The set of operational features that turn a working pilot into something safe to leave running unattended through deploys, instance failures, and bad data.
|
|
|
|
## Outcome statement
|
|
|
|
When Phase 3 is done:
|
|
|
|
- **Graceful shutdown** with bounded in-flight drain: SIGTERM blocks new reads, awaits in-flight writes, ACKs anything still in PEL whose write succeeded, exits clean.
|
|
- **State rehydration on restart**: on first packet for an unknown device, the Processor queries Postgres for the device's `last_position` and seeds `DeviceState` accordingly. Phase 2 accumulators get the same treatment (e.g. last geofence membership comes from the last `timing_records` row).
|
|
- **`XAUTOCLAIM` for stuck pending entries**: at startup and on a cadence, the Processor claims entries that have been pending in another consumer's PEL for longer than `CLAIM_THRESHOLD_MS`. Lets a dead instance's work get picked up by survivors without manual intervention.
|
|
- **Dead-letter stream for poison records**: records that fail to decode N times go to `telemetry:teltonika:dlq` with the original payload + the error. Operators can inspect, fix, replay.
|
|
- **Multi-instance load split verified**: spinning up two Processor instances against the same consumer group splits the work evenly. End-to-end test in CI (or at least a manual playbook).
|
|
- **Migration safety with multiple instances**: Postgres advisory locks around the migration runner so two instances starting simultaneously don't race.
|
|
- **Uncaught exception / unhandled rejection handlers**: log, flush in-memory state to a panic dump file, exit with a code Portainer treats as restart-worthy.
|
|
- **`OPERATIONS.md` runbook**: exact commands for "claim stuck entries from a dead instance," "drain the DLQ," "force-rehydrate a single device," "view consumer lag," etc.
|
|
|
|
## Tasks (sketched, not detailed)
|
|
|
|
| # | Task | Notes |
|
|
|---|------|-------|
|
|
| 3.1 | Graceful shutdown — full | Replaces the Phase 1 stub. Drain budget configurable. Tested end-to-end |
|
|
| 3.2 | Per-device state rehydration on first-packet | Single `SELECT ... LIMIT 1` per cold device. Memoized by LRU |
|
|
| 3.3 | `XAUTOCLAIM` runner | Periodic + on-startup. Claims entries pending > `CLAIM_THRESHOLD_MS`. Re-runs the sink |
|
|
| 3.4 | Dead-letter stream | After N failed decodes/writes, record goes to `telemetry:teltonika:dlq`; original ACKed off the main stream |
|
|
| 3.5 | [Retire processor migration runner](./05-retire-migration-runner.md) | Delete `src/db/migrations/*` and the runner; Directus becomes the sole DDL owner via its `db-init/`. Closes the two-sources-of-truth hazard for `positions`. Replaces the original "migration advisory lock" sketch — once processor doesn't run migrations, the lock concern delegates to Directus. |
|
|
| 3.6 | Uncaught exception / unhandled rejection handlers | Log, flush, exit. Match `tcp-ingestion`'s eventual Phase 1 task 1.12 work when that lands |
|
|
| 3.7 | OPERATIONS.md | The runbook |
|
|
| 3.8 | Multi-instance load test | A test (manual or in CI) that proves two instances split the work; document expected lag behaviour during failover |
|
|
|
|
## Why this is a separate phase
|
|
|
|
Phase 1 + Phase 2 produce a service that *works*. Phase 3 is what you do *before you stop watching it*. None of these tasks change correctness — they change operational ergonomics.
|
|
|
|
## Resume triggers
|
|
|
|
Each Phase 3 task has its own resume trigger. The whole phase doesn't have to land at once:
|
|
|
|
- **3.1, 3.5, 3.6** before adding a second Processor instance (rolling deploys become safe).
|
|
- **3.2** before any Phase 2 task that depends on hot state (geofence membership) — without rehydration, a restart would forget which geofence each device is in until the device crosses a boundary again.
|
|
- **3.3, 3.4** before the pilot is "always-on" (operators need a way to handle stuck/poison records without touching production).
|
|
- **3.7** can land alongside whichever of the above ships first; updates over time.
|
|
- **3.8** before the second instance is added.
|