Files
processor/.planning/phase-3-hardening/README.md
T
julian 3c2c5cf50e docs(planning): file phase-3 task 3.5 — retire processor migration runner
Replaces the original "migration advisory lock" sketch. Once processor
doesn't run DDL, the lock concern delegates to Directus's db-init runner.

Context: positions hypertable + faulty column DDL currently exists in
both processor (src/db/migrations/0001 + 0002) and directus
(db-init/001/002/003). Two sources of truth for the same schema is a
known hazard — adding a column means editing two files in two repos,
and silent drift between them is invisible until runtime.

Fix: directus becomes the sole DDL owner. Processor's migration runner
is retired; only INSERT/SELECT/UPDATE remain.

Task spec covers:
- Pre-flight diff between processor migrations and directus db-init
  (must be byte/semantically equivalent before deletion)
- File-by-file deletion list
- Test infra migration (integration test moves to fixture-based schema
  setup, matching the established Phase 1.5 task 1.5.6 pattern)
- Wiki + ROADMAP updates
- compose.yaml depends_on directus: service_healthy
- Operational notes (existing migrations_applied table is left in place)

Sequence: ideally lands AFTER Phase 1.5 ships so the agent shipping the
WS endpoint isn't pulled into a side quest mid-flight.
2026-05-02 18:37:47 +02:00

3.9 KiB

Phase 3 — Production hardening

Status: Not started

The set of operational features that turn a working pilot into something safe to leave running unattended through deploys, instance failures, and bad data.

Outcome statement

When Phase 3 is done:

  • Graceful shutdown with bounded in-flight drain: SIGTERM blocks new reads, awaits in-flight writes, ACKs anything still in PEL whose write succeeded, exits clean.
  • State rehydration on restart: on first packet for an unknown device, the Processor queries Postgres for the device's last_position and seeds DeviceState accordingly. Phase 2 accumulators get the same treatment (e.g. last geofence membership comes from the last timing_records row).
  • XAUTOCLAIM for stuck pending entries: at startup and on a cadence, the Processor claims entries that have been pending in another consumer's PEL for longer than CLAIM_THRESHOLD_MS. Lets a dead instance's work get picked up by survivors without manual intervention.
  • Dead-letter stream for poison records: records that fail to decode N times go to telemetry:teltonika:dlq with the original payload + the error. Operators can inspect, fix, replay.
  • Multi-instance load split verified: spinning up two Processor instances against the same consumer group splits the work evenly. End-to-end test in CI (or at least a manual playbook).
  • Migration safety with multiple instances: Postgres advisory locks around the migration runner so two instances starting simultaneously don't race.
  • Uncaught exception / unhandled rejection handlers: log, flush in-memory state to a panic dump file, exit with a code Portainer treats as restart-worthy.
  • OPERATIONS.md runbook: exact commands for "claim stuck entries from a dead instance," "drain the DLQ," "force-rehydrate a single device," "view consumer lag," etc.

Tasks (sketched, not detailed)

# Task Notes
3.1 Graceful shutdown — full Replaces the Phase 1 stub. Drain budget configurable. Tested end-to-end
3.2 Per-device state rehydration on first-packet Single SELECT ... LIMIT 1 per cold device. Memoized by LRU
3.3 XAUTOCLAIM runner Periodic + on-startup. Claims entries pending > CLAIM_THRESHOLD_MS. Re-runs the sink
3.4 Dead-letter stream After N failed decodes/writes, record goes to telemetry:teltonika:dlq; original ACKed off the main stream
3.5 Retire processor migration runner Delete src/db/migrations/* and the runner; Directus becomes the sole DDL owner via its db-init/. Closes the two-sources-of-truth hazard for positions. Replaces the original "migration advisory lock" sketch — once processor doesn't run migrations, the lock concern delegates to Directus.
3.6 Uncaught exception / unhandled rejection handlers Log, flush, exit. Match tcp-ingestion's eventual Phase 1 task 1.12 work when that lands
3.7 OPERATIONS.md The runbook
3.8 Multi-instance load test A test (manual or in CI) that proves two instances split the work; document expected lag behaviour during failover

Why this is a separate phase

Phase 1 + Phase 2 produce a service that works. Phase 3 is what you do before you stop watching it. None of these tasks change correctness — they change operational ergonomics.

Resume triggers

Each Phase 3 task has its own resume trigger. The whole phase doesn't have to land at once:

  • 3.1, 3.5, 3.6 before adding a second Processor instance (rolling deploys become safe).
  • 3.2 before any Phase 2 task that depends on hot state (geofence membership) — without rehydration, a restart would forget which geofence each device is in until the device crosses a boundary again.
  • 3.3, 3.4 before the pilot is "always-on" (operators need a way to handle stuck/poison records without touching production).
  • 3.7 can land alongside whichever of the above ships first; updates over time.
  • 3.8 before the second instance is added.