Add planning documents for Phase 1 (throughput pipeline) and stub Phases 2-4

ROADMAP.md establishes status legend, architectural anchors pointing at the wiki, and seven non-negotiable design rules — most importantly the core/domain boundary that protects Phase 1 from Phase 2 churn, the schema-authority split (positions hypertable owned here; everything else owned by Directus), and idempotent-writes via (device_id, ts) ON CONFLICT. Phase 1 (throughput pipeline) is fully detailed across 11 task files: scaffold, core types + sentinel decoder, config + logging, Postgres hypertable, Redis Stream consumer, per-device LRU state, batched writer, main wiring, observability, integration test, Dockerfile + Gitea CI. Observability is in Phase 1 (not deferred) — lesson learned from tcp-ingestion task 1.10. Phases 2-4 are stub READMEs. Phase 2 (domain logic) blocks on Directus schema decisions and lists those open questions explicitly. Phase 3 (production hardening) and Phase 4 (future) sketch the task shape.
2026-04-30 21:16:26 +02:00
parent 1a4202f4d1
commit c314ba0902
17 changed files with 1191 additions and 0 deletions
@@ -0,0 +1,81 @@
+# Task 1.6 — Per-device in-memory state
+
+**Phase:** 1 — Throughput pipeline
+**Status:** ⬜ Not started
+**Depends on:** 1.2
+**Wiki refs:** `docs/wiki/entities/processor.md` (§ State management)
+
+## Goal
+
+Maintain a bounded `Map<device_id, DeviceState>` updated on every accepted Position. Phase 1 only stores trivial state — `last_position`, `last_seen`, `position_count_session` — but the structure is built so Phase 2 (geofence accumulators, time-since-last-checkpoint, etc.) can extend it cleanly.
+
+## Deliverables
+
+- `src/core/state.ts` exporting:
+  - `createDeviceStateStore(config, logger): DeviceStateStore` — factory.
+  - `DeviceStateStore` interface:
+    - `update(position: Position): DeviceState` — applies the position, returns the new state. Touches LRU order.
+    - `get(device_id: string): DeviceState | undefined` — read without touching LRU order. (Used for diagnostics; the hot path uses `update`.)
+    - `size(): number` — for metrics.
+    - `evictedTotal(): number` — for metrics.
+- `test/state.test.ts` covering:
+  - First update for a new device creates the entry; subsequent updates increment `position_count_session`.
+  - LRU eviction: with cap=3, after 4 distinct devices, the least-recently-updated is evicted.
+  - Eviction increments `evictedTotal()`.
+  - `last_seen` reflects the position's `timestamp` (the device-reported time), not the wall clock at update time.
+  - Out-of-order positions (a position with `timestamp` older than `last_seen`) are still applied (we don't drop them) but `last_seen` only advances forward — i.e. `last_seen = max(prev_last_seen, position.timestamp)`. Document the rationale.
+
+## Specification
+
+### LRU implementation
+
+Use a plain `Map<string, DeviceState>`. JavaScript `Map` preserves insertion order, and we exploit it: on every `update`, `delete` then `set` the entry — that bumps it to the most recent position in iteration order. When `size() > cap`, take `keys().next().value` (the oldest) and `delete` it.
+
+This is O(1) per update and avoids a third-party LRU dependency. **Do not** introduce `lru-cache` — the standard `Map` trick is sufficient for Phase 1's needs.
+
+### Why `last_seen = max(...)`, not `last_seen = position.timestamp`
+
+Devices buffer records when offline and replay them in bursts (we observed a 55-record buffer flush on stage). Within a single batch, timestamps may *decrease* between consecutive records if the device sorted them oddly. We want `last_seen` to mean "highest device timestamp seen so far for this device" — that's what downstream consumers want.
+
+### What about restart?
+
+On Processor restart, the in-memory state is empty. The first record from any device creates a fresh `DeviceState`. **Phase 1 accepts this** — it's a recovery path, not a hot path, and Phase 1 has no domain logic that would be wrong without rehydrated state.
+
+Phase 3 (production hardening) adds rehydration: on first packet for an unknown device, query `positions WHERE device_id = $1 ORDER BY ts DESC LIMIT 1` to seed `last_position`. That's a Phase 3 task, not Phase 1.
+
+### What state lives here, what doesn't
+
+In Phase 1 the state is intentionally minimal:
+
+```ts
+type DeviceState = {
+  device_id: string;
+  last_position: Position;
+  last_seen: Date;                // = max(prev, position.timestamp)
+  position_count_session: number; // resets on restart
+};
+```
+
+**Not in Phase 1:**
+- Geofence membership (Phase 2)
+- Distance accumulators (Phase 2)
+- Time-in-stage (Phase 2)
+- Anything that would be wrong if dropped on restart (Phase 3 + rehydration)
+
+The interface is built to extend: Phase 2 may add fields, but the existing fields and method signatures should not change.
+
+## Acceptance criteria
+
+- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean.
+- [ ] LRU cap from `DEVICE_STATE_LRU_CAP` config is respected.
+- [ ] `evictedTotal()` increments correctly under eviction.
+- [ ] `last_seen` does not regress on out-of-order timestamps.
+
+## Risks / open questions
+
+- **Cap sizing.** Default `DEVICE_STATE_LRU_CAP=10000`. At 1KB per state entry, that's 10MB of resident memory — fine. Operators with unusually large fleets can raise it; the bound exists to prevent runaway growth from misbehaving devices flooding novel `device_id` values.
+- **No mutex.** State is updated only from the consumer loop, which is single-threaded. If Phase 2 introduces parallel sinks, revisit with proper synchronization.
+
+## Done
+
+(Fill in once complete: commit SHA, brief notes.)