diff --git a/.planning/phase-3-hardening/05-retire-migration-runner.md b/.planning/phase-3-hardening/05-retire-migration-runner.md new file mode 100644 index 0000000..9615ae5 --- /dev/null +++ b/.planning/phase-3-hardening/05-retire-migration-runner.md @@ -0,0 +1,169 @@ +# Task 3.5 — Retire processor migration runner + +**Phase:** 3 — Production hardening +**Status:** ⬜ Not started +**Depends on:** Phase 1.5 ideally landed (avoid mid-flight code churn for the agent shipping the WS endpoint). No hard code dependency. +**Replaces:** the original 3.5 sketch ("Migration advisory lock"). Once the processor doesn't run migrations, the lock concern is delegated to Directus's `db-init` runner — outside this service's surface. +**Wiki refs:** `docs/wiki/entities/processor.md` §"Schema ownership vs. write access" (the line that needs to change), `docs/wiki/entities/directus.md` §"Schema management — snapshot/apply pipeline", `docs/wiki/entities/postgres-timescaledb.md` + +## Goal + +Establish [[directus]] as the **single owner of all DDL** against the shared Postgres database. Retire the processor's migration runner. After this task, the only DDL paths are: + +- `trm/directus/db-init/*.sql` (pre-schema: extensions, hypertables, raw tables Directus's snapshot-yaml format can't express). +- `trm/directus/snapshots/schema.yaml` (Directus-managed user collections). +- `trm/directus/db-init-post/*.sql` (post-schema: composite UNIQUE constraints on Directus-managed tables). + +Processor exclusively does `INSERT` / `SELECT` / `UPDATE`. No `CREATE`, `ALTER`, `CREATE EXTENSION`, or any other DDL. + +## Context — why this exists + +The current state has **both** services creating the `positions` hypertable and the `faulty` column: + +- `trm/processor/src/db/migrations/0001_positions.sql` and `0002_positions_faulty.sql` (processor's runner, from Phase 1 task 1.4 + the recent 1.5.5 prep). +- `trm/directus/db-init/001_extensions.sql`, `002_positions_hypertable.sql`, `003_faulty_column.sql` (directus's runner, added later when the destructive-apply incident showed positions had to exist *before* `directus schema apply` runs or it would get wiped). + +Both runners are idempotent (`IF NOT EXISTS`, etc.) so the runtime collision is benign at the moment, but the architectural risks are real: + +- **Two sources of truth.** Adding a column means editing two files in two repos; either one can drift silently. +- **Schema divergence.** A processor migration that adds a column the directus side doesn't know about is silently invisible to the admin UI. +- **Two `migrations_applied` tables**, which already caused the ghost-collection apply conflict earlier in Phase 1 of directus. +- **Operator confusion.** The wiki says "Directus owns the schema" but processor runs migrations — newcomers can't tell which is canonical. + +The fix is the wiki's stated intent: directus owns DDL. Processor was the historical owner before directus's `db-init` story matured; the legacy runner survived the transition because nobody retired it. + +## Pre-flight (before deleting anything) + +### 1. Confirm directus's `db-init/` covers the full processor schema surface + +Check that `trm/directus/db-init/`'s SQL is byte-equivalent (or semantically equivalent) to processor's migrations. As of writing, directus has: + +- `001_extensions.sql` — `CREATE EXTENSION IF NOT EXISTS timescaledb` + `postgis`. +- `002_positions_hypertable.sql` — `CREATE TABLE positions (...)` + `create_hypertable(...)` + indexes. +- `003_faulty_column.sql` — `ALTER TABLE positions ADD COLUMN IF NOT EXISTS faulty ...` + `positions_device_ts_idx`. + +Processor has: + +- `src/db/migrations/0001_positions.sql` — extensions + table + hypertable + `positions_device_ts` (UNIQUE on `(device_id, ts)`) + `positions_ts` (DESC). +- `src/db/migrations/0002_positions_faulty.sql` — `faulty` column + `positions_device_ts_idx` (`(device_id, ts DESC)`). + +**Diff the two before retiring.** If processor's SQL has an index, column, or constraint directus's `db-init/` doesn't, the deliverable starts with porting that diff into directus's `db-init/` (and snapshotting if applicable). Specific things to verify: + +- All four indexes exist in directus's db-init: `positions_device_ts` (UNIQUE), `positions_ts`, `positions_device_ts_idx`. +- Column types match exactly: `device_id text`, `ts timestamptz`, `ingested_at timestamptz DEFAULT now()`, etc. +- `chunk_time_interval` is `INTERVAL '1 day'` on both sides. +- The `ON CONFLICT (device_id, ts) DO NOTHING` upsert path requires the UNIQUE on `(device_id, ts)` — that's the `positions_device_ts` index, not `positions_device_ts_idx`. Both must exist. + +### 2. Verify the directus `db-init` apply order is fixed + +Per `docs/wiki/entities/directus.md`'s 5-step boot pipeline: + +``` +1. db-init pre-schema → positions hypertable, faulty column, timescaledb extension +2. directus bootstrap → Directus system tables + first admin +3. directus schema apply → Directus-managed user collections +4. db-init post-schema → composite UNIQUE constraints on user collections +5. pm2-runtime start → server up at :8055 +``` + +So when processor boots against a stage stack: + +- `directus` container has run steps 1–4 (positions exists; everything else exists). +- `processor` container can connect and `INSERT` immediately. + +**Compose ordering.** `trm/deploy/compose.yaml`'s `processor` service should `depends_on: directus` with `condition: service_healthy` so processor doesn't try to read `positions` before directus's db-init has run on first deploy. Verify in this task. + +## Deliverables + +### `trm/processor/` + +1. **Delete** `src/db/migrations/0001_positions.sql`. +2. **Delete** `src/db/migrations/0002_positions_faulty.sql`. +3. **Delete** the migrations directory if it's now empty. +4. **Delete** `src/db/migrate.ts` (or whatever the migration-runner module is named — the file that owns the `migrations_applied` table, the file walker, the `pg_advisory_lock` if any). +5. **Update `src/main.ts`** to remove the `await migrate(...)` step from boot. Postgres pool creation stays; migration call goes. +6. **Update tests** that exercise the migration runner — most likely deletes the corresponding test file. Integration tests that previously seeded the schema via `migrate()` either: + - (a) Use directus's `db-init/*.sql` files directly (read them in `beforeAll`, execute against the testcontainer Postgres), or + - (b) Carry a fixture SQL file in `test/fixtures/` (the same approach Phase 1.5 task 1.5.6 already takes for its integration test). + Pick (b) — it's already the established pattern. +7. **Update `Dockerfile`** to drop any migration-running step from the entrypoint (Phase 1's Dockerfile may not have this, but verify; the runtime container shouldn't carry the migrations directory if the runner is gone). +8. **Update `package.json` `dependencies`** — if `pg-migrate` or any migration-runner library was a Phase-1-only dep, remove it. +9. **Update `phase-1-throughput/04-postgres-schema.md`'s Done section** with a note: "Migration runner retired in Phase 3 task 3.5 — see that task for context." +10. **Update `ROADMAP.md`** to reflect the retired runner under Phase 1's "what changed since landing" note. + +### `trm/docs/wiki/` + +1. **Update `wiki/entities/processor.md`** — drop the "Schema ownership vs. write access" caveat that says "the positions hypertable is owned by [[processor]]'s migration runner." Replace with a single sentence: "Processor never runs DDL. Schema is exclusively owned by Directus (snapshot.yaml + `db-init/` for things the snapshot can't express)." +2. **Update `wiki/entities/directus.md`** — confirm the Schema-management section already lists `db-init/` as covering everything (no edits unless current text implies a split). +3. **Update `wiki/entities/postgres-timescaledb.md`** — verify the writer-side documentation; remove any "split schema authority" framing. +4. **Append `docs/log.md`** with a `note` entry recording the retirement. + +### `trm/deploy/` + +1. **Verify** `compose.yaml`'s `processor` service has `depends_on: directus: { condition: service_healthy }`. Add if missing. +2. **Confirm** the deploy README doesn't mention the processor's migration runner anywhere. + +## Specification + +### What stays in `src/db/` + +- `pool.ts` — Postgres `Pool` factory. Untouched. +- Connection helpers, query helpers (if any). Untouched. + +### What goes + +- `migrations/*.sql` — gone. +- `migrate.ts` (the runner). Gone. +- `migrations_applied` table — directus's runner has its own; the processor's becomes orphaned but harmless. **Don't drop it from existing databases.** The retirement is a *runtime* change; the table is just unused. Phase 3 hardening's eventual `OPERATIONS.md` (task 3.7) can document a one-off `DROP TABLE migrations_applied` step for operators who want a clean schema. + +### Boot order on first deploy + +``` +1. postgres container starts → DB available. +2. directus container starts → runs 5-step boot pipeline. + ├─ Step 1 (db-init pre-schema) creates positions hypertable + faulty column + extensions. + ├─ Steps 2-4 set up Directus's own world. + └─ Step 5 marks the container healthy. +3. processor container starts (depends_on: directus: service_healthy) → connects, finds positions, starts consuming. +``` + +If processor races directus on a fresh stack (no `depends_on`), it'll fail to find the `positions` table and crash-loop until directus catches up. `depends_on: service_healthy` makes the order deterministic. + +### Dev workflow + +`compose.dev.yaml` in `trm/processor` (if it exists for processor-side dev) should `depends_on: directus` if running both. For pure-processor dev (no directus), the developer either: + +- Runs directus's `db-init/*.sql` manually against their local Postgres before booting processor. +- Or copies the equivalent SQL into a one-off bootstrap script in `processor/test/fixtures/`. + +Document the chosen path in `processor/README.md`. + +### What this task does NOT do + +- Does not retire directus's snapshot-managed collections. +- Does not change Phase 1 or Phase 1.5 code paths beyond removing the migration runner step. +- Does not introduce a new migration tool. The fix is *fewer* moving parts, not different ones. + +## Acceptance criteria + +- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean. +- [ ] `pnpm test:integration` runs green — the integration test no longer relies on `migrate()`; it loads schema from a fixture SQL file instead. +- [ ] `src/db/migrations/` directory is gone (or empty + gitignored). +- [ ] No `migrate()` call anywhere in the source tree. +- [ ] No `migrations_applied` references in processor source. +- [ ] Stage smoke against a fresh DB: redeploy the stack, watch directus boot through its 5 steps, watch processor connect and start writing positions. No errors. +- [ ] `docs/wiki/entities/processor.md` and `directus.md` agree: directus is the sole DDL owner. +- [ ] `docs/log.md` has a `note` entry recording the retirement. +- [ ] `trm/deploy/compose.yaml`'s `processor` service has `depends_on: directus: service_healthy`. + +## Risks / open questions + +- **Existing prod databases.** If anyone has deployed the processor's migrations on a real DB, the `migrations_applied` table is harmless but stale. Document a one-off cleanup query for operators (in `OPERATIONS.md` when 3.7 lands). +- **Schema drift between processor's old migrations and directus's db-init.** If the diff in pre-flight step 1 surfaces anything, that diff must land in directus's `db-init/` *before* the processor's runner is retired. Order of operations matters: never delete the processor migration before the equivalent SQL is verified live in directus's runner. +- **Test container schema setup.** The integration test fixture has to mirror what directus actually creates. If directus's `db-init/` changes in a way that breaks processor's read paths, the fixture and the read paths both need updating. Mitigation: the fixture file lives in `test/fixtures/` and a comment at its top says "syncs with `trm/directus/db-init/` — update both when schema changes." +- **The original 3.5 ("Migration advisory lock") concern.** Once processor doesn't run migrations, the advisory-lock concern is delegated to directus's runner. That's a directus concern; whether to add an advisory lock to directus's `apply-db-init.sh` is tracked as a follow-up in directus's own roadmap, not here. +- **PostGIS usage in Phase 2.** Processor's `0001_positions.sql` enables PostGIS even though Phase 1 doesn't use it. Directus's `db-init/001_extensions.sql` does the same. Confirm in pre-flight; no change needed if the directus side already has it. + +## Done + +(Filled in when the task lands.) diff --git a/.planning/phase-3-hardening/README.md b/.planning/phase-3-hardening/README.md index 2d434ee..fa66239 100644 --- a/.planning/phase-3-hardening/README.md +++ b/.planning/phase-3-hardening/README.md @@ -25,7 +25,7 @@ When Phase 3 is done: | 3.2 | Per-device state rehydration on first-packet | Single `SELECT ... LIMIT 1` per cold device. Memoized by LRU | | 3.3 | `XAUTOCLAIM` runner | Periodic + on-startup. Claims entries pending > `CLAIM_THRESHOLD_MS`. Re-runs the sink | | 3.4 | Dead-letter stream | After N failed decodes/writes, record goes to `telemetry:teltonika:dlq`; original ACKed off the main stream | -| 3.5 | Migration advisory lock | `pg_advisory_lock()` around the migrate runner; two instances can start simultaneously | +| 3.5 | [Retire processor migration runner](./05-retire-migration-runner.md) | Delete `src/db/migrations/*` and the runner; Directus becomes the sole DDL owner via its `db-init/`. Closes the two-sources-of-truth hazard for `positions`. Replaces the original "migration advisory lock" sketch — once processor doesn't run migrations, the lock concern delegates to Directus. | | 3.6 | Uncaught exception / unhandled rejection handlers | Log, flush, exit. Match `tcp-ingestion`'s eventual Phase 1 task 1.12 work when that lands | | 3.7 | OPERATIONS.md | The runbook | | 3.8 | Multi-instance load test | A test (manual or in CI) that proves two instances split the work; document expected lag behaviour during failover |