docs: update log and wiki entries for Phase 1.5 live broadcast implementation and incident resolution
This commit is contained in:
@@ -179,3 +179,20 @@ Created in `trm/spa/.planning/`:
|
||||
Each task file follows the existing Goal / Deliverables / Specification / Acceptance / Risks / Done shape so an implementer agent can pick one up self-contained. Phase 1 sequencing: 1.2 → 1.3 → 1.4 → 1.5 → (1.6 ‖ 1.7) → 1.8, with 1.9+1.10 (deploy plumbing) developable in parallel after 1.3 lands.
|
||||
|
||||
End state of Phase 1: a deployable empty shell — auth + protected routes + login/logout + CI + compose deploy block. End state of Phase 2: the dogfood-day deliverable. End state of Phase 3: actually fielded for race operators on race day, not just a tech demo.
|
||||
|
||||
## [2026-05-03] note | Stage incident — positions dropped + WS unreachable + wiki realignment
|
||||
|
||||
Stage incident chain. Processor logged `relation "positions" does not exist` despite Redis consumer batches succeeding. Root cause: the `migrations_applied` guard table retained checksum rows for the three pre-schema migrations but the actual `positions` table had been dropped out-of-band (other 42 user tables intact — looks like a targeted `DROP TABLE` rather than a schema reset). `directus/scripts/apply-db-init.sh` trusts the guard table exclusively (no post-skip schema verification), so the runner logged `skip` for `001/002/003` and never re-applied. Fix: `DELETE FROM migrations_applied WHERE filename IN ('002_positions_hypertable.sql', '003_faulty_column.sql')`, restart Directus; idempotent SQL re-created the table; processor caught up immediately.
|
||||
|
||||
With the table back, the live WS still failed handshake (HTTP 400 in the browser). Diagnosis via Playwright + manual `fetch` probes against `/ws-live` and `/ws-business`: `/ws-live` returns openresty's empty 404 (no upstream routed); `/ws-business` returns Directus's "Route /websocket doesn't exist" (proxy is forwarding as plain HTTP, not as a WS upgrade). The migration off Portainer + nginx-proxy-manager onto Komodo + Traefik documented in `NEW-HOST-KOMODO-TRAEFIK.md` is incomplete for the `new.stage.trmtracking.org` hostname — the deploy compose for `processor` + `directus` is missing Traefik label blocks for `/ws-live` and `/ws-business`, and the Directus container needs `WEBSOCKETS_ENABLED=true` (+ `WEBSOCKETS_HEARTBEAT_ENABLED=true`).
|
||||
|
||||
Wiki realignments landed in this session:
|
||||
- [[processor-ws-contract]] — Phase 1.5 status flipped ⬜ → ✅ (the implementation landed 2026-05-02 per the "Auth-mode wiki realignment" entry above; the contract page hadn't caught up). Locked in the chosen paths (`/ws-live` for processor, `/ws-business` for Directus) — the page previously called `/processor/ws` "illustrative." Added a "Deployment" section with the Traefik label shape and the three things it depends on (same-origin, transparent upgrade, cookie forwarding). Reworded the Transport note about the internal hop to reference the `proxy` external network instead of the legacy `trm_default`.
|
||||
- [[live-channel-architecture]] — replaced the JWT-in-URL handshake description with the cookie-based same-origin handshake that has been truth since 2026-05-02. Generalised the "JWT validation strategy" open question to an "Auth caching strategy" question. Updated cross-references that still mentioned JWT-based auth on both endpoints. Added a callout footnote pointing readers at the auth-mode realignment log entry.
|
||||
- `NEW-HOST-KOMODO-TRAEFIK.md` (parent of `docs/`, infra contract not in wiki/) — added a "Per-host path map" section enumerating `/`, `/api`, `/ws-business`, `/ws-live` so the WebSocket paths are part of the documented infra contract instead of implicit.
|
||||
|
||||
**Two architectural notes deferred (not done in this session, captured here so they're not forgotten).**
|
||||
|
||||
1. **Runner gap — `apply-db-init.sh` doesn't verify schema state on subsequent boots.** The runner records success in `migrations_applied` and trusts that exclusively; in-file assertion blocks (e.g. the `DO $$ ... RAISE EXCEPTION` block at the bottom of `002_positions_hypertable.sql`) only run during apply, not on skip. Out-of-band drops produce silent drift — exactly today's failure mode. Two cheap mitigations: (a) re-run idempotent files unconditionally (cheap given `IF NOT EXISTS` everywhere), or (b) per-migration `_check.sql` companion files the runner executes even when skipping. Worth a hardening task in directus's planning.
|
||||
|
||||
2. **Positions hypertable as a Directus collection — primary-key blocker.** Discussed the design tension: positions DDL lives in `directus/db-init/` (TimescaleDB-specific, must exist before Directus boots), but Directus refuses to register the table as a collection because `002_positions_hypertable.sql` deliberately omits a PRIMARY KEY (per its divergence note 6, calling unique-index "more idiomatic" for hypertables). Directus introspection requires a PK to expose the table — log evidence: `WARN: Collection "positions" doesn't have a primary key column and will be ignored`. To enable the operator `faulty` workflow described in [[directus-schema-draft]], a future migration `004_positions_primary_key.sql` would `ALTER TABLE positions ADD PRIMARY KEY (device_id, ts)` and `DROP INDEX positions_device_ts` (now redundant). PKs that include the partition column are legal on hypertables; the divergence note's preference for unique-index is a soft style choice, not a correctness constraint. Not done in this session — pending user go-ahead.
|
||||
|
||||
Reference in New Issue
Block a user