docs(planning): file phase-3 task 3.5 — retire processor migration runner

Replaces the original "migration advisory lock" sketch. Once processor
doesn't run DDL, the lock concern delegates to Directus's db-init runner.

Context: positions hypertable + faulty column DDL currently exists in
both processor (src/db/migrations/0001 + 0002) and directus
(db-init/001/002/003). Two sources of truth for the same schema is a
known hazard — adding a column means editing two files in two repos,
and silent drift between them is invisible until runtime.

Fix: directus becomes the sole DDL owner. Processor's migration runner
is retired; only INSERT/SELECT/UPDATE remain.

Task spec covers:
- Pre-flight diff between processor migrations and directus db-init
  (must be byte/semantically equivalent before deletion)
- File-by-file deletion list
- Test infra migration (integration test moves to fixture-based schema
  setup, matching the established Phase 1.5 task 1.5.6 pattern)
- Wiki + ROADMAP updates
- compose.yaml depends_on directus: service_healthy
- Operational notes (existing migrations_applied table is left in place)

Sequence: ideally lands AFTER Phase 1.5 ships so the agent shipping the
WS endpoint isn't pulled into a side quest mid-flight.
This commit is contained in:
2026-05-02 17:59:27 +02:00
parent b3d6410af6
commit 3c2c5cf50e
2 changed files with 170 additions and 1 deletions
@@ -0,0 +1,169 @@
# Task 3.5 — Retire processor migration runner
**Phase:** 3 — Production hardening
**Status:** ⬜ Not started
**Depends on:** Phase 1.5 ideally landed (avoid mid-flight code churn for the agent shipping the WS endpoint). No hard code dependency.
**Replaces:** the original 3.5 sketch ("Migration advisory lock"). Once the processor doesn't run migrations, the lock concern is delegated to Directus's `db-init` runner — outside this service's surface.
**Wiki refs:** `docs/wiki/entities/processor.md` §"Schema ownership vs. write access" (the line that needs to change), `docs/wiki/entities/directus.md` §"Schema management — snapshot/apply pipeline", `docs/wiki/entities/postgres-timescaledb.md`
## Goal
Establish [[directus]] as the **single owner of all DDL** against the shared Postgres database. Retire the processor's migration runner. After this task, the only DDL paths are:
- `trm/directus/db-init/*.sql` (pre-schema: extensions, hypertables, raw tables Directus's snapshot-yaml format can't express).
- `trm/directus/snapshots/schema.yaml` (Directus-managed user collections).
- `trm/directus/db-init-post/*.sql` (post-schema: composite UNIQUE constraints on Directus-managed tables).
Processor exclusively does `INSERT` / `SELECT` / `UPDATE`. No `CREATE`, `ALTER`, `CREATE EXTENSION`, or any other DDL.
## Context — why this exists
The current state has **both** services creating the `positions` hypertable and the `faulty` column:
- `trm/processor/src/db/migrations/0001_positions.sql` and `0002_positions_faulty.sql` (processor's runner, from Phase 1 task 1.4 + the recent 1.5.5 prep).
- `trm/directus/db-init/001_extensions.sql`, `002_positions_hypertable.sql`, `003_faulty_column.sql` (directus's runner, added later when the destructive-apply incident showed positions had to exist *before* `directus schema apply` runs or it would get wiped).
Both runners are idempotent (`IF NOT EXISTS`, etc.) so the runtime collision is benign at the moment, but the architectural risks are real:
- **Two sources of truth.** Adding a column means editing two files in two repos; either one can drift silently.
- **Schema divergence.** A processor migration that adds a column the directus side doesn't know about is silently invisible to the admin UI.
- **Two `migrations_applied` tables**, which already caused the ghost-collection apply conflict earlier in Phase 1 of directus.
- **Operator confusion.** The wiki says "Directus owns the schema" but processor runs migrations — newcomers can't tell which is canonical.
The fix is the wiki's stated intent: directus owns DDL. Processor was the historical owner before directus's `db-init` story matured; the legacy runner survived the transition because nobody retired it.
## Pre-flight (before deleting anything)
### 1. Confirm directus's `db-init/` covers the full processor schema surface
Check that `trm/directus/db-init/`'s SQL is byte-equivalent (or semantically equivalent) to processor's migrations. As of writing, directus has:
- `001_extensions.sql``CREATE EXTENSION IF NOT EXISTS timescaledb` + `postgis`.
- `002_positions_hypertable.sql``CREATE TABLE positions (...)` + `create_hypertable(...)` + indexes.
- `003_faulty_column.sql``ALTER TABLE positions ADD COLUMN IF NOT EXISTS faulty ...` + `positions_device_ts_idx`.
Processor has:
- `src/db/migrations/0001_positions.sql` — extensions + table + hypertable + `positions_device_ts` (UNIQUE on `(device_id, ts)`) + `positions_ts` (DESC).
- `src/db/migrations/0002_positions_faulty.sql``faulty` column + `positions_device_ts_idx` (`(device_id, ts DESC)`).
**Diff the two before retiring.** If processor's SQL has an index, column, or constraint directus's `db-init/` doesn't, the deliverable starts with porting that diff into directus's `db-init/` (and snapshotting if applicable). Specific things to verify:
- All four indexes exist in directus's db-init: `positions_device_ts` (UNIQUE), `positions_ts`, `positions_device_ts_idx`.
- Column types match exactly: `device_id text`, `ts timestamptz`, `ingested_at timestamptz DEFAULT now()`, etc.
- `chunk_time_interval` is `INTERVAL '1 day'` on both sides.
- The `ON CONFLICT (device_id, ts) DO NOTHING` upsert path requires the UNIQUE on `(device_id, ts)` — that's the `positions_device_ts` index, not `positions_device_ts_idx`. Both must exist.
### 2. Verify the directus `db-init` apply order is fixed
Per `docs/wiki/entities/directus.md`'s 5-step boot pipeline:
```
1. db-init pre-schema → positions hypertable, faulty column, timescaledb extension
2. directus bootstrap → Directus system tables + first admin
3. directus schema apply → Directus-managed user collections
4. db-init post-schema → composite UNIQUE constraints on user collections
5. pm2-runtime start → server up at :8055
```
So when processor boots against a stage stack:
- `directus` container has run steps 14 (positions exists; everything else exists).
- `processor` container can connect and `INSERT` immediately.
**Compose ordering.** `trm/deploy/compose.yaml`'s `processor` service should `depends_on: directus` with `condition: service_healthy` so processor doesn't try to read `positions` before directus's db-init has run on first deploy. Verify in this task.
## Deliverables
### `trm/processor/`
1. **Delete** `src/db/migrations/0001_positions.sql`.
2. **Delete** `src/db/migrations/0002_positions_faulty.sql`.
3. **Delete** the migrations directory if it's now empty.
4. **Delete** `src/db/migrate.ts` (or whatever the migration-runner module is named — the file that owns the `migrations_applied` table, the file walker, the `pg_advisory_lock` if any).
5. **Update `src/main.ts`** to remove the `await migrate(...)` step from boot. Postgres pool creation stays; migration call goes.
6. **Update tests** that exercise the migration runner — most likely deletes the corresponding test file. Integration tests that previously seeded the schema via `migrate()` either:
- (a) Use directus's `db-init/*.sql` files directly (read them in `beforeAll`, execute against the testcontainer Postgres), or
- (b) Carry a fixture SQL file in `test/fixtures/` (the same approach Phase 1.5 task 1.5.6 already takes for its integration test).
Pick (b) — it's already the established pattern.
7. **Update `Dockerfile`** to drop any migration-running step from the entrypoint (Phase 1's Dockerfile may not have this, but verify; the runtime container shouldn't carry the migrations directory if the runner is gone).
8. **Update `package.json` `dependencies`** — if `pg-migrate` or any migration-runner library was a Phase-1-only dep, remove it.
9. **Update `phase-1-throughput/04-postgres-schema.md`'s Done section** with a note: "Migration runner retired in Phase 3 task 3.5 — see that task for context."
10. **Update `ROADMAP.md`** to reflect the retired runner under Phase 1's "what changed since landing" note.
### `trm/docs/wiki/`
1. **Update `wiki/entities/processor.md`** — drop the "Schema ownership vs. write access" caveat that says "the positions hypertable is owned by [[processor]]'s migration runner." Replace with a single sentence: "Processor never runs DDL. Schema is exclusively owned by Directus (snapshot.yaml + `db-init/` for things the snapshot can't express)."
2. **Update `wiki/entities/directus.md`** — confirm the Schema-management section already lists `db-init/` as covering everything (no edits unless current text implies a split).
3. **Update `wiki/entities/postgres-timescaledb.md`** — verify the writer-side documentation; remove any "split schema authority" framing.
4. **Append `docs/log.md`** with a `note` entry recording the retirement.
### `trm/deploy/`
1. **Verify** `compose.yaml`'s `processor` service has `depends_on: directus: { condition: service_healthy }`. Add if missing.
2. **Confirm** the deploy README doesn't mention the processor's migration runner anywhere.
## Specification
### What stays in `src/db/`
- `pool.ts` — Postgres `Pool` factory. Untouched.
- Connection helpers, query helpers (if any). Untouched.
### What goes
- `migrations/*.sql` — gone.
- `migrate.ts` (the runner). Gone.
- `migrations_applied` table — directus's runner has its own; the processor's becomes orphaned but harmless. **Don't drop it from existing databases.** The retirement is a *runtime* change; the table is just unused. Phase 3 hardening's eventual `OPERATIONS.md` (task 3.7) can document a one-off `DROP TABLE migrations_applied` step for operators who want a clean schema.
### Boot order on first deploy
```
1. postgres container starts → DB available.
2. directus container starts → runs 5-step boot pipeline.
├─ Step 1 (db-init pre-schema) creates positions hypertable + faulty column + extensions.
├─ Steps 2-4 set up Directus's own world.
└─ Step 5 marks the container healthy.
3. processor container starts (depends_on: directus: service_healthy) → connects, finds positions, starts consuming.
```
If processor races directus on a fresh stack (no `depends_on`), it'll fail to find the `positions` table and crash-loop until directus catches up. `depends_on: service_healthy` makes the order deterministic.
### Dev workflow
`compose.dev.yaml` in `trm/processor` (if it exists for processor-side dev) should `depends_on: directus` if running both. For pure-processor dev (no directus), the developer either:
- Runs directus's `db-init/*.sql` manually against their local Postgres before booting processor.
- Or copies the equivalent SQL into a one-off bootstrap script in `processor/test/fixtures/`.
Document the chosen path in `processor/README.md`.
### What this task does NOT do
- Does not retire directus's snapshot-managed collections.
- Does not change Phase 1 or Phase 1.5 code paths beyond removing the migration runner step.
- Does not introduce a new migration tool. The fix is *fewer* moving parts, not different ones.
## Acceptance criteria
- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean.
- [ ] `pnpm test:integration` runs green — the integration test no longer relies on `migrate()`; it loads schema from a fixture SQL file instead.
- [ ] `src/db/migrations/` directory is gone (or empty + gitignored).
- [ ] No `migrate()` call anywhere in the source tree.
- [ ] No `migrations_applied` references in processor source.
- [ ] Stage smoke against a fresh DB: redeploy the stack, watch directus boot through its 5 steps, watch processor connect and start writing positions. No errors.
- [ ] `docs/wiki/entities/processor.md` and `directus.md` agree: directus is the sole DDL owner.
- [ ] `docs/log.md` has a `note` entry recording the retirement.
- [ ] `trm/deploy/compose.yaml`'s `processor` service has `depends_on: directus: service_healthy`.
## Risks / open questions
- **Existing prod databases.** If anyone has deployed the processor's migrations on a real DB, the `migrations_applied` table is harmless but stale. Document a one-off cleanup query for operators (in `OPERATIONS.md` when 3.7 lands).
- **Schema drift between processor's old migrations and directus's db-init.** If the diff in pre-flight step 1 surfaces anything, that diff must land in directus's `db-init/` *before* the processor's runner is retired. Order of operations matters: never delete the processor migration before the equivalent SQL is verified live in directus's runner.
- **Test container schema setup.** The integration test fixture has to mirror what directus actually creates. If directus's `db-init/` changes in a way that breaks processor's read paths, the fixture and the read paths both need updating. Mitigation: the fixture file lives in `test/fixtures/` and a comment at its top says "syncs with `trm/directus/db-init/` — update both when schema changes."
- **The original 3.5 ("Migration advisory lock") concern.** Once processor doesn't run migrations, the advisory-lock concern is delegated to directus's runner. That's a directus concern; whether to add an advisory lock to directus's `apply-db-init.sh` is tracked as a follow-up in directus's own roadmap, not here.
- **PostGIS usage in Phase 2.** Processor's `0001_positions.sql` enables PostGIS even though Phase 1 doesn't use it. Directus's `db-init/001_extensions.sql` does the same. Confirm in pre-flight; no change needed if the directus side already has it.
## Done
(Filled in when the task lands.)
+1 -1
View File
@@ -25,7 +25,7 @@ When Phase 3 is done:
| 3.2 | Per-device state rehydration on first-packet | Single `SELECT ... LIMIT 1` per cold device. Memoized by LRU | | 3.2 | Per-device state rehydration on first-packet | Single `SELECT ... LIMIT 1` per cold device. Memoized by LRU |
| 3.3 | `XAUTOCLAIM` runner | Periodic + on-startup. Claims entries pending > `CLAIM_THRESHOLD_MS`. Re-runs the sink | | 3.3 | `XAUTOCLAIM` runner | Periodic + on-startup. Claims entries pending > `CLAIM_THRESHOLD_MS`. Re-runs the sink |
| 3.4 | Dead-letter stream | After N failed decodes/writes, record goes to `telemetry:teltonika:dlq`; original ACKed off the main stream | | 3.4 | Dead-letter stream | After N failed decodes/writes, record goes to `telemetry:teltonika:dlq`; original ACKed off the main stream |
| 3.5 | Migration advisory lock | `pg_advisory_lock(<hash>)` around the migrate runner; two instances can start simultaneously | | 3.5 | [Retire processor migration runner](./05-retire-migration-runner.md) | Delete `src/db/migrations/*` and the runner; Directus becomes the sole DDL owner via its `db-init/`. Closes the two-sources-of-truth hazard for `positions`. Replaces the original "migration advisory lock" sketch — once processor doesn't run migrations, the lock concern delegates to Directus. |
| 3.6 | Uncaught exception / unhandled rejection handlers | Log, flush, exit. Match `tcp-ingestion`'s eventual Phase 1 task 1.12 work when that lands | | 3.6 | Uncaught exception / unhandled rejection handlers | Log, flush, exit. Match `tcp-ingestion`'s eventual Phase 1 task 1.12 work when that lands |
| 3.7 | OPERATIONS.md | The runbook | | 3.7 | OPERATIONS.md | The runbook |
| 3.8 | Multi-instance load test | A test (manual or in CI) that proves two instances split the work; document expected lag behaviour during failover | | 3.8 | Multi-instance load test | A test (manual or in CI) that proves two instances split the work; document expected lag behaviour during failover |