Files
docs/log.md
T

45 KiB
Raw Blame History

Log

Chronological activity log. Append-only. Entry headers use the format ## [YYYY-MM-DD] <op> | <title> so they can be grepped:

grep "^## \[" log.md | tail -10

[2026-04-30] note | Wiki bootstrapped

Created CLAUDE.md (schema + workflows), index.md (empty catalog), and this log. Wiki directory structure (wiki/sources, wiki/entities, wiki/concepts, wiki/synthesis) will be created on first ingest.

[2026-04-30] ingest | gps-tracking-architecture.md + teltonika-ingestion-architecture.md

Ingested both initial architecture docs in one pass. Created:

No contradictions to flag — the two docs are coherent (the Teltonika doc explicitly cites and respects the system architecture). Open follow-ups: TRM business domain not yet captured; per-model IO dictionary location TBD; Phase 2 timing unspecified.

[2026-04-30] ingest | Teltonika Data Sending Protocols (official wiki)

Ingested the canonical Teltonika spec covering all codec families. New additions:

Updates to existing pages (no contradictions; refinements + additions):

  • teltonika — added full codec table with hex IDs, Codec 15 (out of scope), Codec 14 ACK/nACK, packet size limits, UDP support note.
  • codec-dispatch — corrected hex IDs, added directionality table covering codecs 815.
  • position-record — concrete priority enum (0/1/2), two's-complement lat/lon note, Speed=0 means GPS invalid, Generation Type and NX section flagged.
  • phase-2-commands — clarified Codec 12 vs 14 selection, added nack status for Codec 14 IMEI-mismatch (Type 0x11); noted 13/15 are not part of the outbound design.

Cleanup: removed stale duplicate concept files from earlier passes (system-planes.md, protocol-adapter-pattern.md, codec-dispatch-registry.md) — superseded by plane-separation.md, protocol-adapter.md, codec-dispatch.md respectively. Fixed dangling protocol-adapter-pattern link in io-element-bag.

Open questions surfaced by the canonical doc: Codec 16 Generation Type — promote to typed position-record field? Codec 8E NX values land as Buffer in attributes; needs explicit fixture coverage. SMS-based protocols (Codec 4 + binary SMS) probably out of scope but worth a deliberate decision.

[2026-05-01] note | Stream-name canonicalization

Documented the canonical stream/key names in redis-streams — the wiki was previously silent on the actual telemetry:teltonika name, so anyone reading it had no way to find out what stream the services use. Added a "Stream and key naming" table covering the inbound telemetry stream, Phase 2 command streams, and registry/heartbeat keys. Also added the naming convention (telemetry:{vendor}) so future adapters fit predictably. Cross-referenced the actual stream name in processor and tcp-ingestion entities so each entity is self-contained but the convention has one canonical home.

Triggered by a stage-side bug where tcp-ingestion's compiled default (telemetry:teltonika) and processor's compiled default (telemetry:t) had drifted; pipeline ran with both services talking past each other for ~7 hours before symptoms surfaced. Fix landed in deploy stack (shared env var) and processor (default realigned). Wiki update closes the documentation loop.

[2026-05-01] synthesis | Live channel architecture (corrects a wiki claim)

Researched Directus's WebSocket subscription mechanism via context7. Confirmed that subscriptions only fire for writes that go through Directus's ItemsService (REST/GraphQL/Admin UI mutations, not direct database INSERTs). The previous claim in directus — "When Processor writes a row, Directus broadcasts the change to subscribed clients" — was wrong.

Wrote live-channel-architecture documenting the corrected design: two WebSocket channels, each in its own plane. Processor exposes its own WebSocket endpoint for high-volume telemetry fan-out (auth via Directus-issued JWT, authorization delegated to Directus once at subscribe time). Directus's built-in WebSocket subscriptions cover business-plane events. Reasoning: preserves plane-separation and gives the gentlest failure mode (Directus down blocks only new authorizations, not the live firehose).

Updated processor (added Live broadcast section, multi-instance consumer-group plumbing note), directus (corrected the real-time-delivery section), and index.md.

[2026-05-01] synthesis | Directus schema — working draft

Captured the business-plane schema agreement reached during today's discussion as directus-schema-draft. Marked as a working draft, open for revision.

Shape: pseudo multi-tenant under organizations; users / teams / vehicles / devices are all m2m with orgs (durable catalog); events scoped to a single org; entries is the per-event timing unit with nullable vehicle_id (foot races) and nullable team_id (lone racers); entry_crew and entry_devices are junctions off entries (no separate crews collection — teams already provide durable group identity). Vehicle ownership intentionally soft (owner_user_id?, owner_team_id?), not enforced. Per-event classes. events.discipline drives validation. Per-org-per-user role lives on organization_users.role.

Open: entries.status enum, permission policy definitions per role, stages/timing records (Phase 2 processor), geofences (Phase 2 processor).

[2026-05-01] synthesis | Schema draft — course definition + penalty system

Major expansion of directus-schema-draft. Added course definition (stages → segments → geofences/waypoints/SLZs) and the full penalty system. Vehicle ownership idea dropped (org-level only, no owner FKs). entries.status enum pinned with semantics. Permission policies confirmed as Directus 11 dynamic-filter Policies, one per logical role.

Penalty system landed as: numbers in DB, math in code. A penalty_formulas collection holds all values (bracket multipliers, per-miss penalties); the processor holds one evaluator per type in a registry. Speed limit penalties are progressive slice-by-slice (income-tax math, confirmed against the Tirana 24h rulebook): each bracket contributes only the portion of the peak overspeed within its range — slice × rate summed across all brackets the peak crossed. Worked example with peak=58 included in the doc.

Retroactive flag lives on penalty_formulas (default true) and on geofences / speed_limit_zones (default false). Per-edit override at save time. Formula recomputes are cheap (snapshotted inputs on entry_penalties rows). Geometry recomputes are expensive (replay from positions hypertable) and deferred to Phase 2.5 of processor.

Other decisions: checkpoints are typed geofences with manual_verification=true, not a separate collection. Stages are containers; segments (liaison / special-stage / parc-ferme) are the atomic rule unit. SLZs carry an evaluation_window_meters so the 2km rule from real federations is data, not code.

Per-entry timing layer (entry_segment_starts, entry_crossings, entry_penalties) and results layer (stage_results) are the processor Phase 2 write target. Schema is laid out so Phase 1 (positions only) can ship without it.

[2026-05-01] note | Faulty position flagging

Added a faulty boolean DEFAULT false column to the positions hypertable, controlled by track operators through directus (the hypertable is exposed as a Directus collection for read+update). processor filters WHERE faulty = false on every read of position data — peak-speed, crossing detection, replay-based recompute. Flagging triggers a windowed recompute of affected entry_penalties. Updated postgres-timescaledb, position-record (storage shape vs. wire shape), processor (faulty position handling), and directus-schema-draft (cross-plane operator workflow + third recompute kind).

[2026-05-01] synthesis | Schema draft — start-order strategies + secondary observations

Read two real-world rulebooks to pin the start-order question: Tirana 24h 2017 (static every leg) and Rally Albania 2025 (dynamic, several variants). Rally Albania's §5.55.10 settle it — start order is per-stage, declarative, and rule-driven. Stage 1 bikes invert the top 20 of the prologue; stages 2 onward seed from previous-stage clean SS time (penalties explicitly excluded); epilogue inverts overall standings; intervals are decided per stage.

Updates to directus-schema-draft:

  • stages gains role (prologue/regular/epilogue), start_interval_seconds, start_order_strategy, start_order_strategy_params, start_order_input_stage_id.
  • New "Start order strategies" subsection enumerating manual / previous_stage_result / previous_stage_clean_result / inverse_top_n_then_natural / inverse_of_overall with real-world mappings. Tirana 24h covered by manual; Rally Albania covered by the other four.
  • entry_segment_starts adds start_position and manual_override (latter for late-arrival reseeding by Race Marshals — both rulebooks leave that operator-driven).
  • Materialization is per-category (categories share grids independently per Rally Albania §2.8 + §5.10).
  • Decisions list grows: stage roles, CP-missing vs CP-late-past-closing as distinct event types sharing a formula row, reverse-stage tiebreaker.
  • Open questions shrink: dropped the start-interval question (now pinned) and the permission-policy-filters question (admin/deployment task, not architectural).

[2026-05-01] ingest | Rally Albania 2025 — Race Rules and Regulations

Formal ingest of raw/Regulations_2025.pdf (Motorsport Club Albania, October 2024). Created rally-albania-regulations-2025 as the canonical real-world reference for federation rule shapes — classes, start-order rules, penalty taxonomy, tracking requirements, timekeeping, protests. Section numbers preserved as §X.Y so the schema draft and future SPA work can cite precisely.

Wired the source into directus-schema-draft (added to sources: frontmatter; framing note near the top; inline citation at start-order strategies section). Most of the schema-relevant content was already absorbed into the draft during the prior synthesis step — this ingest formalizes the citation chain.

Open follow-ups flagged on the source page: §12.11 SLZ formula lives in the Supplementary Regulations (not the general regs), so we shouldn't hardcode a default; M-7 numbering bug (Veteran and Female driver share the code — likely a typo); neutralization zones (§8.12) not yet modeled in the schema.

Index updated: new source row. No new entity/concept pages created — the doc supports existing pages rather than introducing new domain objects.

[2026-05-02] note | Directus deployment wired; entity page updated

trm/directus Phase 1 shipped its image to the registry and the trm/deploy compose.yaml was extended with a directus service block (sharing the existing postgres service with processor). Updated directus entity page to reflect operational reality:

  • New "Deployment" section: links to the deploy compose, explains the shared-Postgres model with processor, spells out the 5-step boot pipeline (db-init pre-schema → bootstrap → schema apply → db-init post-schema → start), notes first-boot vs warm-boot timing.
  • Schema management section: db-init split into pre-schema (db-init/) and post-schema (db-init-post/) phases. Post-schema landed because the composite UNIQUE constraints target Directus-managed tables that don't exist until schema apply runs.
  • Destructive-apply hazard callout: corrected entrypoint step reference (now step 3/5, not 2/4) after the bootstrap-before-apply reorder.
  • New "Network exposure" subsection inside Deployment: directus is internal-only on stage / prod (expose: 8055 not ports:). A reverse proxy (Traefik / Caddy / nginx) on the host or attached to trm_default terminates TLS and forwards the public domain to http://directus:8055. The asymmetry with tcp-ingestion (which must host-publish for GPS devices) is named, and the dev compose's deliberate divergence is noted.

Three CI iterations on the directus repo's first push exposed three distinct production-breaking bugs (port collision; bootstrap-before-apply ordering + silent ERROR exit; ghost-collection apply conflict). The dry-run gate caught all of them before the image touched stage. The "ghost-collection" stripping is now automated in scripts/schema-snapshot.sh so future captures don't regress.

[2026-05-02] note | Stage deploy verified + Rally Albania 2026 seed landed

Stage Directus is live at api.stage.new.trmtracking.org and matches the local snapshot. Verification done via the directus-stage MCP server:

  • All 12 user collections present (organizations, organization_users, organization_vehicles, organization_devices, vehicles, devices, events, classes, entries, entry_crew, entry_devices + custom fields on directus_users).
  • Field shapes, types, notes, and relations identical to local. migrations_applied + positions (db-init) and schema_migrations (processor migration runner) tables also present, as expected.
  • Composite UNIQUE constraints landed — probed (event_id, code) on classes with a duplicate M-1 insert, got RECORD_NOT_UNIQUE. Confirms db-init-post/001 + 002 ran on stage (the post-schema phase introduced during task 1.8 CI iterations).

Rally Albania 2026 dogfood seed (task 1.9) replayed against stage: 1 org (msc-albania), 1 event (rally-albania-2026, 2026-06-06 → 2026-06-13), 18 classes (M-1..M-8, Q-1..Q-3, C-1/C-2/C-A/C-3, S-1..S-3), 1 vehicle (Toyota Land Cruiser 70), 3 devices (FMB920 chassis + FMB920 dash backup + FMB003 panic). Junction rows (organization_vehicles ×1, organization_devices ×3) wired. UUIDs differ from the local seed; record of stage UUIDs lives in trm/directus/.planning/phase-1-slice-1-schema/09-rally-albania-2026-seed.md Done section if needed.

End-to-end registration walkthrough (organization_users + entries + entry_crew + entry_devices) deferred to manual operator pass through the admin UI — the MCP items tool blocks writes to core collections like directus_users, so the user-attaches-to-entry flow can't be MCP-driven. That manual walkthrough is the actual dogfood acceptance gate for slice-1 schema.

Drift flagged: field notes on events.slug, classes.code, and entries.race_number still reference "db-init/005" — those constraints moved to db-init-post/ during the CI fix. Cosmetic only, no behavior impact; worth a snapshot-side cleanup pass next time someone touches the schema.

[2026-05-02] ingest | TRACCAR_MAPS_ARCHITECTURE.md

Ingested the deep architectural reference for traccar-web's maps subsystem after recognising during SPA-planning discussion that Traccar already fields the exact stack we're converging on (MapLibre GL JS + GeoJSON sources + WebSocket fan-out). Created traccar-maps-architecture (source page, with TRM divergences enumerated) and maps-architecture (concept page distilling the inherited patterns: singleton map, side-effect-only Map* components, two-effect setup/setData split, two-source clustered+selected design, style-swap mapReady gate, sprite preload, rAF coalescer at the WS boundary, geofence editing via @mapbox/mapbox-gl-draw, three-way camera control split).

Updated react-spa heavily: appended the new source; corrected the "talks exclusively to Directus" claim that conflicted with live-channel-architecture (the SPA connects to two endpoints — Directus for business plane, Processor for telemetry firehose); locked in the stack (raw MapLibre over react-map-gl, Zustand over Redux, maplibre-google-maps adapter as optional Google-tiles path); added an Auth section documenting the same-domain cookie + reverse-proxy pattern; rewrote Real-time rendering to point at maps-architecture and headline the rAF coalescer + per-device bounded ring buffers. One sentence + cross-reference added to live-channel-architecture flagging consumer-side throughput discipline.

Headline takeaway: Traccar's frontend architecture is mostly correct — the lag the user experienced isn't the rendering layer (which is WebGL setData and fast) but throughput discipline (per-message Redux dispatch cascading through selectors and rebuilding feature collections at every position arrival). TRM inherits the architecture and adds an rAF coalescer at the WS boundary plus Zustand to neutralise the failure mode. Tile-source decision unblocked: Google Maps via the official Map Tiles API is legitimate through the maplibre-google-maps protocol adapter (bring-your-own-key, runtime-config-gated). Dogfood-day starter set: Esri World Imagery (satellite, free) + OpenTopoMap (free) + OSM raster, with Google Satellite as an optional add when an operator provides a key.

[2026-05-02] synthesis | Processor WebSocket contract + wiki/planning drift surfaced

Wrote processor-ws-contract as the wire-level spec for the live-position WebSocket: endpoint shape, cookie-based auth handshake, subscribe/snapshot/streaming/unsubscribe protocol, reconnect semantics, multi-instance fan-out behaviour, connection limits, versioning rules. Both the SPA and the producing service will build against this page; changes require coordinated updates on both sides.

Surfaced a real wiki/planning drift while researching: processor entity page lists "Broadcast live positions" as a top-level responsibility and live-channel-architecture specifies the design, but the processor's actual planning roadmap (trm/processor/.planning/) has no task for it. Phase 1 (done) is throughput-only; Phase 2 is geofence/IO/timing; Phase 3 is hardening; Phase 4 only mentions a "WebSocket gateway" as an uncommitted fallback service. The drift happened because live-channel-architecture was synthesised on 2026-05-01, after Phase 1's plan had locked — the wiki absorbed the corrected design, the processor's planning didn't reconcile.

Recommendation pending user decision: add a new processor phase ("Phase 1.5 — Live broadcast") that implements processor-ws-contract inside the processor service. Alternatives are Option B (separate trm/live-gateway service, aligning with the old Phase 4 framing — adds a deploy unit and contradicts the wiki) and Option C (defer the live map for the dogfood — thins the SPA's value-add over Directus admin). The synthesis page is implementation-agnostic so the contract is locked regardless of which option lands.

[2026-05-02] note | Phase 1.5 planning landed (Option A chosen)

Promoted the Processor's WebSocket broadcast endpoint to a real planning artefact. Created trm/processor/.planning/phase-1-5-live-broadcast/ with a phase README and six task files: 1.5.1 WS server scaffold + heartbeat, 1.5.2 cookie auth handshake, 1.5.3 subscription registry & per-event authorization, 1.5.4 broadcast consumer group & fan-out, 1.5.5 snapshot-on-subscribe, 1.5.6 integration test. Each follows the existing Phase 1 task-file shape (Goal / Deliverables / Specification / Acceptance / Risks / Done) so an implementer can pick one up self-contained.

Updated trm/processor/.planning/ROADMAP.md with a Phase 1.5 section between Phase 1 and Phase 2, including the per-task table. Pruned the stale "WebSocket gateway for live updates" candidate from Phase 4's README and reframed it as the documented live-channel-architecture escape hatch — to be promoted to a numbered phase only when measurements justify lifting the WS endpoint out of the Processor process. Updated processor-ws-contract's Implementation status section to reflect "planned as Phase 1.5" instead of "designed but not scheduled."

Wiki / planning drift surfaced earlier today is now closed: the wiki's processor / live-channel-architecture / processor-ws-contract design and the processor's planning roadmap agree on what gets built, where, and how it's sequenced. Implementation can start on 1.5.1 whenever; SPA work can proceed against processor-ws-contract in parallel as long as it doesn't ship to stage before Phase 1.5 lands.

SPA implementation surfaced that Directus SDK's 'cookie' auth mode doesn't survive a hard reload cleanly — the in-memory access token is gone, and /users/me 401s before autoRefresh can establish a new one. Switched the SPA to 'session' mode (authentication('session', { credentials: 'include' })), where the session itself lives in the httpOnly cookie and the browser sends it on every request including the WebSocket upgrade. Reload survives without any client-side state.

Updated react-spa §"Auth pattern" to describe session mode (single httpOnly session cookie, no separate access token, no /auth/refresh dance). Added a "Mode choice context" note explaining why session mode is the right default for an SPA that needs reload-survives behaviour.

Updated processor-ws-contract §"Auth handshake" to drop the explicit "(mode: cookie)" annotation and emphasise that the producer is cookie-name-agnostic — it forwards the entire Cookie header to /users/me and lets Directus identify the session. The producer's implementation was already cookie-name-agnostic in practice (the 1.5.2 implementation forwards the whole header), so no processor-side code change is needed; the wiki just now matches the implementation. Reframed "Cookie refresh while connected" open question as "Session expiry while connected" with the cleaner session-mode semantics.

Processor Phase 1.5 is fully shipped (c07ea0e 1.5.4, f4b50ca 1.5.5, 2f2cf5c 1.5.6) — six tasks, 178/178 unit tests, 6 integration scenarios. The cookie-mode language in the processor's planning task files (1.5.2 in particular) is left as-is — it's the historical spec the implementation landed against; the implementation itself is mode-agnostic.

[2026-05-02] note | TRM design handoff imported (deferred to SPA Phase 3.8)

User generated a design system via claude.ai/design and dropped the handoff bundle into trm/spa/TRM_Design_System-handoff/. Bloomberg/F1-pit-wall aesthetic — ink-on-paper base, race-flag red #E8412B accent, square-edged everything, sharp printed offset shadows (no blur), mono numerics for changing values, Goldplay (real licensed font, three weights) + JetBrains Mono + Inter. Four surfaces designed: dashboard / leaderboard / mobile / marketing — SPA scope covers the first two.

Adoption deferred to SPA Phase 3.8 ("Visual brand pass") because applying it now would either delay dogfood-blocking Phase 1/2 work or land partial styling that gets reworked. The bundle is committed in-tree (trm/spa/9e6b361) and Phase 3's README spells out the recommended approach: retheme shadcn via CSS-variable overrides + Tailwind 4 @theme block, don't replace primitives. Source-of-truth files for the future implementer: colors_and_type.css (tokens), chats/chat1.md (intent), the bundle's READMEs (specs), ui_kits/ (HTML prototypes per surface).

No wiki updates yet — design system isn't part of the architectural model and the surface-level styling isn't worth a wiki entity. If/when 3.8 lands and the brand becomes a stable fixture, a brief mention in [[react-spa]] is the right home.

[2026-05-02] note | trm/spa planning landed

User created trm/spa repo on Gitea and seeded a minimal Vite 8 + React 19 + TypeScript 6 scaffold (App.tsx returns "SPA"). Wrote the full planning structure mirroring the conventions established by trm/processor and trm/directus.

Created in trm/spa/.planning/:

  • ROADMAP.md — navigation hub with status legend, architectural anchors, eight non-negotiable design rules (singleton MapLibre, side-effect-only Map* components, rAF coalescer, same-origin-everything, in-memory access token, role-aware UI, runtime config, native PostGIS GeoJSON), four phases.
  • phase-1-foundation/ — README + 9 task files: 1.2 stack rounding-out (Tailwind + shadcn/ui + TanStack Router/Query + Zustand + @directus/sdk + zod + react-hook-form + Prettier), 1.3 Vite dev proxy + path aliases + tsconfig hardening, 1.4 runtime config endpoint, 1.5 Directus auth client (cookie mode + refresh + Zustand auth store), 1.6 login page, 1.7 routing skeleton (TanStack Router file-based + role-aware guards), 1.8 logout flow (with cross-tab sync), 1.9 Gitea CI + Dockerfile + nginx static serve, 1.10 compose service block in trm/deploy.
  • phase-2-live-map/README.md — sketched task table for the live-monitoring map; depends on processor Phase 1.5 landing. Nine tasks: MapLibre singleton, tile-source switcher, sprite preload, WS client + rAF coalescer + Zustand store, MapPositions, MapTrails, event picker, camera control trio, connection-status indicators.
  • phase-3-dogfood-readiness/README.md — error boundaries, connection-state UI, mobile-responsive baseline, per-device detail panel, empty/loading-state polish, Vitest setup, production logging, visual brand pass.
  • phase-4-future/README.md — geometry editor (depends on directus Phase 2), replay mode, heatmaps / deck.gl, i18n (Albanian), dark mode, Playwright E2E, leaderboard, spectator-facing public map, notifications, operator chat. None committed.

Each task file follows the existing Goal / Deliverables / Specification / Acceptance / Risks / Done shape so an implementer agent can pick one up self-contained. Phase 1 sequencing: 1.2 → 1.3 → 1.4 → 1.5 → (1.6 ‖ 1.7) → 1.8, with 1.9+1.10 (deploy plumbing) developable in parallel after 1.3 lands.

End state of Phase 1: a deployable empty shell — auth + protected routes + login/logout + CI + compose deploy block. End state of Phase 2: the dogfood-day deliverable. End state of Phase 3: actually fielded for race operators on race day, not just a tech demo.

[2026-05-03] note | Stage incident — positions dropped + WS unreachable + wiki realignment

Stage incident chain. Processor logged relation "positions" does not exist despite Redis consumer batches succeeding. Root cause: the migrations_applied guard table retained checksum rows for the three pre-schema migrations but the actual positions table had been dropped out-of-band (other 42 user tables intact — looks like a targeted DROP TABLE rather than a schema reset). directus/scripts/apply-db-init.sh trusts the guard table exclusively (no post-skip schema verification), so the runner logged skip for 001/002/003 and never re-applied. Fix: DELETE FROM migrations_applied WHERE filename IN ('002_positions_hypertable.sql', '003_faulty_column.sql'), restart Directus; idempotent SQL re-created the table; processor caught up immediately.

With the table back, the live WS still failed handshake (HTTP 400 in the browser). Diagnosis via Playwright + manual fetch probes against /ws-live and /ws-business: /ws-live returns openresty's empty 404 (no upstream routed); /ws-business returns Directus's "Route /websocket doesn't exist" (proxy is forwarding as plain HTTP, not as a WS upgrade). The migration off Portainer + nginx-proxy-manager onto Komodo + Traefik documented in NEW-HOST-KOMODO-TRAEFIK.md is incomplete for the new.stage.trmtracking.org hostname — the deploy compose for processor + directus is missing Traefik label blocks for /ws-live and /ws-business, and the Directus container needs WEBSOCKETS_ENABLED=true (+ WEBSOCKETS_HEARTBEAT_ENABLED=true).

Wiki realignments landed in this session:

  • processor-ws-contract — Phase 1.5 status flipped (the implementation landed 2026-05-02 per the "Auth-mode wiki realignment" entry above; the contract page hadn't caught up). Locked in the chosen paths (/ws-live for processor, /ws-business for Directus) — the page previously called /processor/ws "illustrative." Added a "Deployment" section with the Traefik label shape and the three things it depends on (same-origin, transparent upgrade, cookie forwarding). Reworded the Transport note about the internal hop to reference the proxy external network instead of the legacy trm_default.
  • live-channel-architecture — replaced the JWT-in-URL handshake description with the cookie-based same-origin handshake that has been truth since 2026-05-02. Generalised the "JWT validation strategy" open question to an "Auth caching strategy" question. Updated cross-references that still mentioned JWT-based auth on both endpoints. Added a callout footnote pointing readers at the auth-mode realignment log entry.
  • NEW-HOST-KOMODO-TRAEFIK.md (parent of docs/, infra contract not in wiki/) — added a "Per-host path map" section enumerating /, /api, /ws-business, /ws-live so the WebSocket paths are part of the documented infra contract instead of implicit.

Two architectural notes deferred (not done in this session, captured here so they're not forgotten).

  1. Runner gap — apply-db-init.sh doesn't verify schema state on subsequent boots. The runner records success in migrations_applied and trusts that exclusively; in-file assertion blocks (e.g. the DO $$ ... RAISE EXCEPTION block at the bottom of 002_positions_hypertable.sql) only run during apply, not on skip. Out-of-band drops produce silent drift — exactly today's failure mode. Two cheap mitigations: (a) re-run idempotent files unconditionally (cheap given IF NOT EXISTS everywhere), or (b) per-migration _check.sql companion files the runner executes even when skipping. Worth a hardening task in directus's planning.

  2. Positions hypertable as a Directus collection — primary-key blocker. Discussed the design tension: positions DDL lives in directus/db-init/ (TimescaleDB-specific, must exist before Directus boots), but Directus refuses to register the table as a collection because 002_positions_hypertable.sql deliberately omits a PRIMARY KEY (per its divergence note 6, calling unique-index "more idiomatic" for hypertables). Directus introspection requires a PK to expose the table — log evidence: WARN: Collection "positions" doesn't have a primary key column and will be ignored. To enable the operator faulty workflow described in directus-schema-draft, a future migration 004_positions_primary_key.sql would ALTER TABLE positions ADD PRIMARY KEY (device_id, ts) and DROP INDEX positions_device_ts (now redundant). PKs that include the partition column are legal on hypertables; the divergence note's preference for unique-index is a soft style choice, not a correctness constraint. Not done in this session — pending user go-ahead.

[2026-05-03] note | Live broadcast SQL fix — IMEI/UUID translation

Phase 1.5 was effectively dead in production: snapshot.ts and device-event-map.ts joined positions.device_id (text/IMEI) directly against entry_devices.device_id (uuid FK to devices.id). Two coordinated symptoms:

  1. Snapshot crash. Subscribe → subscribed { snapshot: [] } because Postgres rejected the join with operator does not exist: uuid = text (42883); registry caught the error and returned an empty snapshot, masking the failure in surface UX.
  2. Streaming silence. device-event-map cache keyed on entry_devices.device_id (uuid); broadcast.ts:141 looked up by position.device_id (imei). Cache missed every record → all positions classified as orphans → no fan-out, no errors.

broadcast.ts:5364 had documented the IMEI/UUID divergence ("Phase 1's positions table stores the raw IMEI as device_id") but the join code in snapshot.ts and device-event-map.ts didn't follow through. Fix in processor f2c64a2: both queries hop through the devices table (devices.imei = positions.device_id, devices.id = entry_devices.device_id); device-event-map aliases d.imei AS device_id so cache keys remain IMEI strings and broadcast.ts is unchanged.

End-to-end verified via Playwright on app.dev.trmtracking.org/ws-live after Komodo redeploy: snapshot returns the live device's last position; 12 streamed position frames in a 30s window from IMEI 350424064163619 on event 9ddeba93-... (Rally Albania 2026 dev seed scaffolded today via directus-dev MCP — org msc-albania, vehicle, devices, entry+entry_devices).

Durable lesson — integration-test fixture schemas must mirror production join shapes. processor/test/fixtures/test-schema.sql shipped with entry_devices.device_id text to match the broken production join, so 178/178 unit tests + 6 integration scenarios passed against fiction. Production exposed it on first real subscribe. Updated the fixture schema to add a devices table and change entry_devices.device_id to uuid REFERENCES devices(id); integration test seed now inserts devices first.

Wiki updates landed in this session:

  • position-record — added §"device_id is the IMEI — not the business devices.id" capturing the storage shape, the join chain through devices, and why this is true (plane-separation means tcp-ingestion writes positions without a business-plane round-trip, so the only identifier it has is the IMEI).
  • processor-ws-contract — corrected the deviceId field semantics (Phase 1 implementation ships IMEI text on the wire, not devices.id uuid as originally specified — flagged as a documented divergence rather than a silent contract change). Added a "Server-side data resolution" section with the snapshot SQL and device-event-map SQL so any replacement gateway service reproduces the join shape rather than rediscovering it.

Three open architectural debts still outstanding (carried over from the earlier 2026-05-03 entry):

  1. apply-db-init.sh runner gap — schema verification on skip.
  2. positions hypertable PK — 004_positions_primary_key.sql would unblock Directus collection registration for the operator faulty workflow.
  3. Wire-format IMEI/UUID closure — Phase 2 question, not blocking dogfood.

[2026-05-04] note | Positions-as-collection abandoned; faulty-flag UI deferred to custom endpoint

Pushed directus/db-init/004_positions_primary_key.sql (PK (device_id, ts)) plus the runner hardening (apply-db-init.sh always re-applies, treats migrations_applied as a log). Boot logs confirmed the runner behaved as designed: re-apply 001/002/003, apply 004, summary 4 total (1 first-apply, 3 re-apply). But Directus still emits WARN: Collection "positions" doesn't have a primary key column and will be ignored.

Root cause: Directus introspection requires a single-column primary key per collection (confirmed against Directus docs); composite PKs trigger the exact warning above. TimescaleDB requires the partitioning column (ts) to be in every unique index on a hypertable. The two constraints are mutually exclusive on positions — no PK shape satisfies both. Migration 004's premise was wrong.

Decision (architectural): the operator faulty-flag UI ships as a custom Directus endpoint extension, not as a Directus-introspected collection. Endpoint exposes flag candidates and accepts PATCH-by-(device_id, ts), emits the recompute webhook, enforces the same dynamic-filter authorization. Operator interface lives in the SPA (or a custom Directus module) consuming the endpoint. Migration 004 is left in place — the PK is harmless (redundant with positions_device_ts unique index) and may have other uses.

Deferred until after Rally Albania 2026-06-06. Per the dogfood event scope ("manual workflows for penalties/results/faulty-flag"), the flag stays manual SQL for the test event.

Wiki update: directus-schema-draft §"Position flagging" now carries the full decision record (the previous "exposed as a Directus collection" claim was the wrong premise) and the corresponding bullet in "Decisions made" was rewritten. position-record delegates to the schema draft and required no edit. Resolves debt #2 from the previous entry — not by implementing it as planned, but by abandoning the approach.

Open debts now:

  1. apply-db-init.sh runner gap — resolved (always-re-apply runner).
  2. positions hypertable PK / Directus collection — abandoned (custom endpoint deferred post-Albania).
  3. Wire-format IMEI/UUID closure — Phase 2 question, not blocking dogfood.
  4. SPA Phase 2 live map UX — markers, trails, rAF coalescer, event picker. Not built. Biggest remaining blocker for the dogfood event (without the map there's nothing user-visible at Rally Albania).

[2026-05-04] note | SPA Phase 2 cold-load truly works — three lifecycle bugs found and fixed

Started the day investigating "Phase 2 done in commits, but /monitor shows only the map canvas — no controls." react-spa's phase-2-live-map/README.md claimed all 9 tasks shipped, ROADMAP.md still said ⬜ Not started (drift). Code archaeology confirmed the README — every task had a feat: commit + a docs: backfill companion. So Phase 2 was committed but never verified against the production environment, and yesterday's processor IMEI/UUID fix was the first time the WS path produced anything for /monitor to consume.

Stood up a backends-only local stack (trm/deploy/compose.local.yaml, new file — Postgres + Redis + Directus + processor + tcp-ingestion, no Traefik, no SPA service; SPA runs via pnpm dev and Vite's dev proxy) so the cold-load could be debugged with source maps + devtools. Seeded the same Rally Albania 2026 shape via the directus-local MCP. Three independent lifecycle bugs in spa/src/map/core/:

  1. Map.loaded() deadlock. <MapView>'s onStyleData waited for _map.loaded() before opening the gate. Map.loaded() checks style.imageManager.isLoaded() internally, and that flag only flips true after MapLibre fetches a sprite URL declared in the style JSON. Our styles in map/core/styles.ts deliberately omit sprite — we manage images ourselves via installSprites(). So the predicate stayed false forever, gate never opened. Confirmed via React fiber dump: MapView.useMapReady() === false while every other precondition was satisfied.

  2. installSpritesstyledata cascade. Even after replacing loaded() with isStyleLoaded() && areTilesLoaded(), instrumentation showed 87,494 styledata events in 28 seconds. Cause: installSprites() calls map.addImage() ~32 times; each addImage triggers a styledata event; each event re-entered onStyleData, scheduling another installSprites, ad infinitum. The original code masked this by virtue of loaded() always returning false — the loop never reached installSprites at all. Fix: stop listening to styledata (too noisy — fires for every internal style mutation including image registration, source data updates), listen to idle instead, scoped per-setStyle via _map.once('idle', …). idle only fires when style + tiles + transitions are all settled, which is exactly the lifecycle signal we need. Also extracted a setBasemap(style) API so style swaps go through one supported entry point that closes the gate, calls setStyle, and schedules the handshake.

  3. BasemapSwitcher bootstrap remount cycle. With (1) and (2) fixed, the map briefly opened then slammed shut every 350ms. Cause: BasemapSwitcher is a child of <MapView>, so it unmounts when the gate closes and remounts when it reopens. Its bootstrap useEffect (which applies the user's saved basemap preference at first mount) re-fired on every remount, calling setBasemap again, which closed the gate again. Fix: a module-level _basemapBootstrapped flag so the bootstrap fires once per page load, not once per mount. This bug had been present in the original code too — it was the deeper cause of the gate never opening; (1) and (2) were proximate causes that masked it.

End-to-end Playwright verification on local: cold reload /monitor → all controls render → EventPicker dropdown opens on click → basemap swap (Esri → Topo) completes with one clean handshake cycle (scheduled → idle-fired @ +316ms → sprites-installed → ready-true). Body text reads "Tiles © OpenTopoMap … Rally Albania 2026 Satellite Topo Street TRAILS None Selected All Follow Live" — every Phase 2 component mounted. The "Live" connection chip means the WS to processor connected via Vite's /ws-live proxy (since both Directus and processor are bound to localhost on the host, same-origin holds via the proxy).

Side-issue surfaced and resolved during the same session: tcp-ingestion locally, when the user forwarded WAN port 5027 to the PC, packets weren't reaching the container. Root cause was not Docker — wslrelay.exe (Microsoft's WSL2 port forwarder) listens only on 127.0.0.1 regardless of the 0.0.0.0:5027:5027 binding inside the WSL VM. Bridge with one elevated netsh interface portproxy add v4tov4 listenaddress=0.0.0.0 listenport=5027 connectaddress=127.0.0.1 connectport=5027. Long-term cleanup: WSL2 mirrored networking mode in ~/.wslconfig. Not Phase 2 work, but logged here for future reference if anyone else stumbles into it.

Wiki/planning updates landing in the same push:

  • trm/spa/.planning/ROADMAP.md — Phase 2 row flipped from ⬜ Not started to 🟩 Done with the last commit reference.
  • trm/spa/src/map/core/map-view.tsx — replaces styledata-driven gate with idle-event handshake, exposes setBasemap() as the supported style-swap entry, prepends (not appends) the map container so floating cards paint above. Header docstring on scheduleStyleLoadHandshake explains the idle-vs-styledata rationale so the next person doesn't revert it.
  • trm/spa/src/map/core/basemap-switcher.tsx — uses setBasemap(), gates the bootstrap useEffect with a module-level flag.
  • trm/deploy/compose.local.yaml (new file) — Postgres + Redis + Directus + processor + tcp-ingestion bound to host ports for local debugging via Vite dev proxy. No Traefik, no SPA service. Header comment documents the WEBSOCKETS_REST_PATH divergence from compose.dev.yaml (default /websocket here vs /ws-business on dev, because Vite's proxy rewrites /ws-business/websocket).

Resolves debt #4 from the previous entry — Phase 2 cold-load is verified end-to-end. Open debts:

  1. apply-db-init.sh runner gap — resolved.
  2. positions hypertable PK / Directus collection — abandoned.
  3. Wire-format IMEI/UUID closure — Phase 2 backend question, not blocking dogfood.
  4. SPA Phase 2 cold-load lifecycle — resolved today; pending push to dev.
  5. Phase 3 dogfood readiness — error boundaries, mobile responsive baseline, per-device detail panel, operator-friendly empty/loading states. The next biggest fish for Rally Albania 2026-06-06.

[2026-05-04] note | Phase 3 plan + tasks 3.1 / 3.2 / 3.5 landed on dev

Drafted trm/spa/.planning/phase-3-dogfood-readiness/ task plan: eight per-task .md files numbered to match task IDs (0108 ↔ 3.13.8). Path-1 execution order picked: 3.1 → 3.2 → 3.5 → 3.4 → 3.3 → 3.6 → 3.7 → 3.8 (reliability surface first, then operator polish, then mobile + tests, then observability, then brand).

3.1 — error boundaries (commit c1410b0). <ErrorBoundary regionId> class component + error-bus pub/sub + <RegionFallback /> panel. Three boundaries wired: chrome (in _authed/route.tsx), route (in __root.tsx), map (in monitor.tsx wrapping just the layer + camera children). Notable deviation: spec sketched the map boundary inside map-view.tsx wrapping ALL children, but that would also wrap the floating controls (event picker, basemap switcher, connection chip), which violated the spec's own acceptance bullet. Moved into monitor.tsx to scope the boundary to layers + camera. <MapView> stays a clean positioning singleton with no opinion on what's protected. Verified via Playwright with a temporary throw-on-query helper.

3.2 — connection-state UI escalations (commit 8e830f0). <ConnectionBanner /> (top-of-map; muted at 530 s of outage, warning style ≥ 30 s or terminal disconnected) and <DeviceListPanel /> (collapsible card; sorted last-seen-ascending). Connection store gains outageStartedAt set on connected → !connected, cleared on the inverse, and explicitly NOT set on cold-load disconnected → connecting so the banner only escalates for mid-session outages.

3.5 — empty / loading state polish (commit bc7483f). <EmptyState /> + <LoadingState /> primitives, <MapStatusOverlay /> (soft "Connecting to live feed…" / "No devices reporting yet" cue layered over the map, hides itself once any positions arrive), <DeviceListPanel /> distinct "Connecting…" disabled state, login submit-button spinner.

Open issue surfaced during 3.2 smoke: the device list panel renders 8-char device-id prefixes (e.g. "35042406") instead of IMEI labels because useDevicesById() doesn't find those IDs in Directus's devices table. The Redis stream's deviceId and Directus's devices.id are diverging — the IMEI/UUID closure debt (#3) surfacing in the SPA. Will block 3.4's vehicle / crew join until closed; flagged in 04-per-device-detail-panel.md's risks section.