Add planning documents for Phase 1 (throughput pipeline) and stub Phases 2-4

ROADMAP.md establishes status legend, architectural anchors pointing at the wiki, and seven non-negotiable design rules — most importantly the core/domain boundary that protects Phase 1 from Phase 2 churn, the schema-authority split (positions hypertable owned here; everything else owned by Directus), and idempotent-writes via (device_id, ts) ON CONFLICT. Phase 1 (throughput pipeline) is fully detailed across 11 task files: scaffold, core types + sentinel decoder, config + logging, Postgres hypertable, Redis Stream consumer, per-device LRU state, batched writer, main wiring, observability, integration test, Dockerfile + Gitea CI. Observability is in Phase 1 (not deferred) — lesson learned from tcp-ingestion task 1.10. Phases 2-4 are stub READMEs. Phase 2 (domain logic) blocks on Directus schema decisions and lists those open questions explicitly. Phase 3 (production hardening) and Phase 4 (future) sketch the task shape.
2026-04-30 21:16:26 +02:00
parent 1a4202f4d1
commit c314ba0902
17 changed files with 1191 additions and 0 deletions
@@ -0,0 +1,90 @@
+# processor — Roadmap
+
+A Node.js worker service that consumes normalized `Position` records from a Redis Stream, maintains per-device runtime state, applies racing-domain rules, and writes durable state to Postgres / TimescaleDB.
+
+This file is the single navigation hub for all implementation planning. Each phase has its own folder with a README and granular task files. Update statuses here as work lands.
+
+## Status legend
+
+| Symbol | Meaning |
+|--------|---------|
+| ⬜ | Not started |
+| 🟦 | Planned (designed, not coded) |
+| 🟨 | In progress |
+| 🟩 | Done |
+| ⏸ | Paused / blocked |
+| ❄️ | Frozen / future / optional |
+
+## Architectural anchors
+
+The service is specified by the wiki at `../docs/wiki/`. Implementing agents should read these pages before starting any task:
+
+- **Architecture** — `docs/wiki/sources/gps-tracking-architecture.md`, `docs/wiki/concepts/plane-separation.md`, `docs/wiki/concepts/failure-domains.md`
+- **This service** — `docs/wiki/entities/processor.md`
+- **Upstream contract (input)** — `docs/wiki/concepts/position-record.md`, `docs/wiki/concepts/io-element-bag.md`, `docs/wiki/entities/redis-streams.md`
+- **Downstream contract (output)** — `docs/wiki/entities/postgres-timescaledb.md`, `docs/wiki/entities/directus.md`
+
+## Non-negotiable design rules
+
+These rules govern every task. Any deviation must be discussed and documented as a decision before code lands.
+
+1. **Domain logic is isolated.** `src/core/` (Stream consumer, Postgres writer, in-memory state plumbing) never imports from `src/domain/` (geofence engine, timing logic, IO interpretation). Phase 2 must be a pure addition layered on top of the Phase 1 throughput pipeline.
+2. **Schema authority lives in Directus**, with one exception: the `positions` hypertable is bulk-written by this service and its migration is owned here. All other tables Processor writes to (timing_records, stage_results, etc.) are defined in Directus and Processor inserts respecting that schema.
+3. **Per-device state is in-memory, not durable.** The DB is the source of truth for replay/analysis; in-memory state is the source of truth for the current decision. On restart, hot state is rehydrated from the DB — Phase 1 does not implement rehydration; restart loses state, which is acceptable until Phase 2 introduces stateful accumulators.
+4. **Consumer-group offsets drive work distribution.** No application-level coordination between Processor instances. `XACK` on success; failed batches stay pending and are claimed by surviving instances via `XAUTOCLAIM`.
+5. **Idempotent writes.** Records arriving twice (after a claim, replay, or retry) must not produce duplicate rows. The `positions` hypertable uses `(device_id, ts)` as a unique key with `ON CONFLICT DO NOTHING`. Derived tables follow the same pattern, scoped by their natural keys.
+6. **Bounded memory.** Per-device state is capped (LRU eviction by last-seen timestamp); replay-from-DB rehydrates an evicted device on next packet. No unbounded `Map<imei, ...>` growth.
+7. **Fail loudly.** Schema-incompatible records (e.g. malformed payload, unknown sentinel) are logged at `error` and dead-letter-streamed (Phase 3); they do **not** silently skip.
+
+## Phases
+
+### Phase 1 — Throughput pipeline
+
+**Status:** ⬜ Not started
+**Outcome:** A Node.js Processor that joins a Redis Streams consumer group on `telemetry:t`, decodes each `Position` (including `__bigint`/`__buffer_b64` sentinel reversal), upserts it into a TimescaleDB `positions` hypertable, updates per-device in-memory state (last position, last seen), `XACK`s on successful write, and exposes Prometheus metrics + health/readiness HTTP endpoints. End-to-end pilot-quality service; no domain logic yet.
+
+[**See `phase-1-throughput/README.md`**](./phase-1-throughput/README.md)
+
+| # | Task | Status | Landed in |
+|---|------|--------|-----------|
+| 1.1 | [Project scaffold](./phase-1-throughput/01-project-scaffold.md) | ⬜ | — |
+| 1.2 | [Core types & contracts](./phase-1-throughput/02-core-types.md) | ⬜ | — |
+| 1.3 | [Configuration & logging](./phase-1-throughput/03-config-and-logging.md) | ⬜ | — |
+| 1.4 | [Postgres connection & `positions` hypertable](./phase-1-throughput/04-postgres-schema.md) | ⬜ | — |
+| 1.5 | [Redis Stream consumer (XREADGROUP)](./phase-1-throughput/05-stream-consumer.md) | ⬜ | — |
+| 1.6 | [Per-device in-memory state](./phase-1-throughput/06-device-state.md) | ⬜ | — |
+| 1.7 | [Position writer (batched upsert)](./phase-1-throughput/07-position-writer.md) | ⬜ | — |
+| 1.8 | [Main wiring & ACK semantics](./phase-1-throughput/08-main-wiring.md) | ⬜ | — |
+| 1.9 | [Observability (Prometheus metrics + /healthz + /readyz)](./phase-1-throughput/09-observability.md) | ⬜ | — |
+| 1.10 | [Integration test (testcontainers Redis + Postgres)](./phase-1-throughput/10-integration-test.md) | ⬜ | — |
+| 1.11 | [Dockerfile & Gitea workflow](./phase-1-throughput/11-dockerfile-and-ci.md) | ⬜ | — |
+
+### Phase 2 — Domain logic
+
+**Status:** ⬜ Not started — blocks on Directus schema decisions
+**Outcome:** Geofence engine that detects entry/checkpoint/finish crossings; per-model Teltonika IO mapping driving derived attributes (`odometer_km`, `ignition`, etc.); timing record writer producing entries in the Directus-owned `timing_records` table; per-stage result aggregator. Layered on top of Phase 1 — no changes to the throughput pipeline.
+
+[**See `phase-2-domain/README.md`**](./phase-2-domain/README.md)
+
+Detailed task breakdown deferred until the Directus schema is finalized (open questions about geofence ownership, IO mapping storage, stage vocabulary). Phase 1 can ship and run on stage without any Phase 2 work.
+
+### Phase 3 — Production hardening
+
+**Status:** ⬜ Not started
+**Outcome:** Graceful shutdown with consumer-group commit on SIGTERM; per-device state rehydration from Postgres on startup (only loaded on first packet for a given device); `XAUTOCLAIM` for stuck pending entries from a dead instance; dead-letter stream for poison records; multi-instance load-split verification; `OPERATIONS.md` runbook.
+
+[**See `phase-3-hardening/README.md`**](./phase-3-hardening/README.md)
+
+### Phase 4 — Future / optional
+
+**Status:** ❄️ Not committed
+[**See `phase-4-future/README.md`**](./phase-4-future/README.md)
+
+Ideas on radar: Directus Flow trigger emission, replay tooling, derived-metric backfill, alternate consumer for analytics export.
+
+## Operating model
+
+- **Implementation agent contract.** Each task file is self-sufficient: goal, deliverables, specification, acceptance criteria. An agent should be able to complete one task without reading the whole wiki — but should skim the wiki references at the top of the task before starting.
+- **Sequence within a phase.** Task numbering reflects intended order. Soft dependencies are explicit in each task's "Depends on" field. Tasks with no dependencies on each other can be done in parallel.
+- **Status updates.** When a task is started, change its row in this ROADMAP to 🟨 and the task file's status badge accordingly. When done, 🟩 + a one-line note in the task file's "Done" section pointing at the merging commit/PR.
+- **Drift control.** If implementation diverges from a task's spec, update the task file *before* the diverging code lands, with a note explaining why. Do not let plans rot — either fix the plan or fix the code.
@@ -0,0 +1,58 @@
+# Task 1.1 — Project scaffold
+
+**Phase:** 1 — Throughput pipeline
+**Status:** ⬜ Not started
+**Depends on:** None
+**Wiki refs:** `docs/wiki/entities/processor.md`
+
+## Goal
+
+Initialize the Node.js / TypeScript project with the directory layout from the Phase 1 README, install the agreed tooling, and produce a minimal `main.ts` that the rest of Phase 1 builds on. Mirror the `tcp-ingestion` scaffold conventions exactly so the two services feel uniform.
+
+## Deliverables
+
+- `package.json` declaring:
+  - `"type": "module"` (ESM only).
+  - `"engines": { "node": ">=22" }`.
+  - Scripts: `build`, `dev`, `start`, `test`, `test:watch`, `test:integration`, `lint`, `format`, `typecheck`.
+  - Dependencies: `ioredis`, `pg`, `pino`, `prom-client`, `zod`.
+  - Dev dependencies: `typescript`, `@types/node`, `@types/pg`, `vitest`, `@vitest/coverage-v8`, `eslint`, `@typescript-eslint/parser`, `@typescript-eslint/eslint-plugin`, `eslint-plugin-import`, `eslint-import-resolver-typescript`, `prettier`, `pino-pretty`, `tsx`, `testcontainers`.
+- `tsconfig.json` — same as `tcp-ingestion`: `strict: true`, `target: ES2022`, `module: NodeNext`, `moduleResolution: NodeNext`, `outDir: dist`, `rootDir: src`, `noUncheckedIndexedAccess: true`.
+- `eslint.config.js` (flat config) with `@typescript-eslint/recommended-type-checked`, `@typescript-eslint/no-floating-promises`, `@typescript-eslint/no-misused-promises`. Add `import/no-restricted-paths` enforcing **`src/core/` cannot import from `src/domain/`**. (`src/domain/` doesn't exist yet — the rule is preemptive so Phase 2 can't violate the boundary by accident.)
+- `.prettierrc` — match `tcp-ingestion` (2 spaces, single quotes, semis).
+- `.gitignore` — `node_modules/`, `dist/`, `coverage/`, `.env`, `.env.local`, `*.log`.
+- `.dockerignore` — same as `.gitignore` plus `.git/`, `.planning/`, `test/`, `*.md` except `README.md`.
+- `vitest.config.ts` — unit-test config that excludes `*.integration.test.ts`.
+- `vitest.integration.config.ts` — integration-test config with `hookTimeout: 120_000`, `testTimeout: 60_000`. Include only `*.integration.test.ts`.
+- `.env.example` documenting every env var (with descriptions and defaults). Required vars only: `REDIS_URL`, `POSTGRES_URL`. All others have sensible defaults.
+- Empty directories with `.gitkeep` files where Phase 1 will fill them in:
+  - `src/core/`, `src/db/migrations/`, `src/config/`, `src/observability/`
+  - `test/`
+- `src/main.ts` — minimal stub: imports nothing yet, prints `processor starting` to stdout, exits with code 0.
+- `README.md` — short description pointing at `.planning/ROADMAP.md` for the work plan, and at `../docs/wiki/entities/processor.md` for the architectural specification.
+
+## Specification
+
+- **Package manager:** pnpm. Commit `pnpm-lock.yaml`. The Dockerfile (task 1.11) will use `pnpm fetch` for layer-cache friendliness.
+- **Module style:** ESM throughout. Relative imports use `.js` suffix per Node ESM resolution. No `paths` aliases.
+- **No bundler.** Build is `tsc` only. Runtime is plain Node consuming `dist/`.
+- **Linting style:** Configure ESLint to enforce no-floating-promises and no-misused-promises — both critical in a stream consumer where unhandled rejection silently loses work.
+
+## Acceptance criteria
+
+- [ ] `pnpm install` succeeds with no warnings other than peer deps.
+- [ ] `pnpm typecheck` succeeds on the empty project.
+- [ ] `pnpm lint` succeeds.
+- [ ] `pnpm build` produces `dist/main.js`.
+- [ ] `pnpm start` runs the compiled output and prints the startup message.
+- [ ] `pnpm test` runs (with no tests) and exits successfully.
+- [ ] `pnpm dev` runs `main.ts` via `tsx` and prints the startup message.
+- [ ] Repository builds reproducibly: deleting `node_modules` and `dist`, then `pnpm install --frozen-lockfile && pnpm build` produces identical output.
+
+## Risks / open questions
+
+- The `import/no-restricted-paths` rule is preemptive and will be silently inert until Phase 2 introduces `src/domain/`. Verify it activates correctly with a temporary `src/domain/foo.ts` during scaffold setup, then remove.
+
+## Done
+
+(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,66 @@
+# Task 1.2 — Core types & contracts
+
+**Phase:** 1 — Throughput pipeline
+**Status:** ⬜ Not started
+**Depends on:** 1.1
+**Wiki refs:** `docs/wiki/concepts/position-record.md`, `docs/wiki/concepts/io-element-bag.md`
+
+## Goal
+
+Define the canonical TypeScript types for the data flowing through the Processor: the `Position` record (input from Redis), the per-device runtime state, the metrics interface, and the codec for reversing the JSON sentinels (`__bigint`, `__buffer_b64`) that `tcp-ingestion` produces.
+
+This task does **not** add behaviour — only types and a sentinel decoder with unit tests. Behaviour is layered on in subsequent tasks.
+
+## Deliverables
+
+- `src/core/types.ts` exporting:
+  - `Position` — must be byte-equivalent to `tcp-ingestion`'s output. Fields: `device_id: string`, `timestamp: Date`, `latitude: number`, `longitude: number`, `altitude: number`, `angle: number`, `speed: number`, `satellites: number`, `priority: number`, `attributes: Record<string, AttributeValue>`. Where `AttributeValue = number | bigint | Buffer`.
+  - `StreamRecord` — what `XREADGROUP` actually returns: `{ id: string; ts: string; device_id: string; codec: string; payload: string }`. The `payload` field is the JSON-encoded `Position` (still sentinel-encoded — the consumer decodes it).
+  - `DeviceState` — `{ device_id: string; last_position: Position; last_seen: Date; position_count_session: number }`.
+  - `Metrics` interface — same shape as `tcp-ingestion`: `inc(name: string, labels?: Record<string, string>): void`, `observe(name: string, value: number, labels?: Record<string, string>): void`. Don't extend it in Phase 1; task 1.9 may add helpers but the interface stays.
+- `src/core/codec.ts` exporting:
+  - `decodePosition(payload: string): Position` — JSON-parses with a reviver that detects `{ __bigint: "..." }` → `BigInt(...)` and `{ __buffer_b64: "..." }` → `Buffer.from(s, 'base64')`. Throws `CodecError` with a clear message on malformed payloads.
+  - `class CodecError extends Error` for failure cases.
+- `test/codec.test.ts` covering:
+  - Round-trip a Position with bigint and Buffer attributes through `tcp-ingestion`'s `serializePosition` (copy the helper into the test or inline-encode) → `decodePosition` → assert byte-equal.
+  - Decode a u64-max bigint sentinel.
+  - Decode a Buffer sentinel with non-UTF-8 bytes (e.g. `0xde 0xad 0xbe 0xef`).
+  - Reject malformed payload (non-JSON, missing required fields, invalid sentinel shape).
+  - `device_id`, `timestamp` (ISO string → Date), and numeric fields all decode correctly.
+
+## Specification
+
+### Sentinel reversal — exact rules
+
+The reviver runs on every JSON value. For each value:
+
+1. If the value is an object with exactly one property `__bigint` whose value is a string of digits → return `BigInt(value.__bigint)`. Validate that the string parses or throw.
+2. If the value is an object with exactly one property `__buffer_b64` whose value is a base64 string → return `Buffer.from(value.__buffer_b64, 'base64')`.
+3. If the key is `timestamp` and the value is a string → return `new Date(value)` (validate it parsed; reject `Invalid Date`).
+4. Otherwise pass through.
+
+**Critical:** the reviver must not match shapes broader than the sentinels. A user-defined attribute `{ __bigint: "..." }` is by definition a sentinel — there is no ambiguity because `tcp-ingestion` only uses these shapes for sentinel encoding. But validate the inner string strictly so a malformed attribute fails fast.
+
+### Why `Buffer`, not `Uint8Array`
+
+`tcp-ingestion` uses Node's `Buffer`. We're also Node-only. Using `Buffer` avoids the conversion cost on every record. If the platform later needs to support browser-side decoding (e.g. for a debug tool), introduce a `Uint8Array`-based parallel path then; not now.
+
+### Why `bigint`, not `number`
+
+Some Teltonika IO elements are u64 values that exceed `Number.MAX_SAFE_INTEGER` (2^53 − 1). Silently truncating to `number` would corrupt those values. Phase 1 preserves them as `bigint`; the Position writer (task 1.7) decides how to store them in Postgres (likely `numeric` or stringified — see that task).
+
+## Acceptance criteria
+
+- [ ] `pnpm typecheck` succeeds.
+- [ ] `pnpm test` runs `codec.test.ts` and all cases pass.
+- [ ] A round-tripped Position with `bigint` and `Buffer` attributes matches the original byte-for-byte (including `Buffer` content equality, not just length).
+- [ ] Malformed payloads throw `CodecError` with a descriptive message that names the failing field.
+
+## Risks / open questions
+
+- The reviver runs on the full JSON tree, including the top-level object. Verify that nested attributes (rare, but possible if Teltonika ever sends nested IO bags in Codec 16) decode correctly. The spec doesn't currently allow nesting; treat unexpected nesting as an error.
+- TypeScript inference for revivers is awkward (`(key: string, value: any) => any`). Use a typed wrapper to surface the result as `Position` without `any` leakage outside the codec module.
+
+## Done
+
+(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,76 @@
+# Task 1.3 — Configuration & logging
+
+**Phase:** 1 — Throughput pipeline
+**Status:** ⬜ Not started
+**Depends on:** 1.1
+**Wiki refs:** `docs/wiki/entities/processor.md`
+
+## Goal
+
+Validate environment variables on startup with `zod`, build the pino root logger with the same conventions as `tcp-ingestion` (ISO timestamps, string level labels, instance_id base field), and fail fast with a readable error message if config is invalid.
+
+## Deliverables
+
+- `src/config/load.ts` exporting:
+  - `loadConfig(): Config` — reads `process.env`, runs zod parse, returns a typed `Config`. Throws on invalid input with a multi-line message that names every invalid field.
+  - `Config` type derived from the zod schema.
+- `src/observability/logger.ts` exporting:
+  - `createLogger({ level, nodeEnv, instanceId }): Logger` — pino root logger with base fields `service: 'processor'`, `instance_id`. ISO timestamps via `pino.stdTimeFunctions.isoTime`. Level formatter that emits `"level":"info"` not `"level":30`. In `nodeEnv === 'development'`, use the pino-pretty transport.
+  - `type Logger` re-exported from `pino`.
+- Wire both into `src/main.ts`: `loadConfig()` → `createLogger()` → `logger.info('processor starting')` → exit 0 (still a stub; consumer wiring lands in 1.8).
+
+## Specification
+
+### Environment variables
+
+| Var | Required | Default | Notes |
+|---|---|---|---|
+| `NODE_ENV` | no | `production` | `development` enables pino-pretty |
+| `INSTANCE_ID` | no | `processor-1` | Used in metrics + log base field |
+| `LOG_LEVEL` | no | `info` | `trace` / `debug` / `info` / `warn` / `error` |
+| `REDIS_URL` | yes | — | e.g. `redis://redis:6379` |
+| `POSTGRES_URL` | yes | — | e.g. `postgres://user:pass@db:5432/trm` |
+| `REDIS_TELEMETRY_STREAM` | no | `telemetry:t` | Must match `tcp-ingestion`'s `REDIS_TELEMETRY_STREAM` |
+| `REDIS_CONSUMER_GROUP` | no | `processor` | All Processor instances join this group |
+| `REDIS_CONSUMER_NAME` | no | `${INSTANCE_ID}` | Unique per instance — defaults to instance id |
+| `METRICS_PORT` | no | `9090` | HTTP server port for `/metrics`, `/healthz`, `/readyz` |
+| `BATCH_SIZE` | no | `100` | Max records per `XREADGROUP` call |
+| `BATCH_BLOCK_MS` | no | `5000` | `BLOCK` timeout on `XREADGROUP` when stream is empty |
+| `WRITE_BATCH_SIZE` | no | `50` | Max rows per Postgres `INSERT` |
+| `DEVICE_STATE_LRU_CAP` | no | `10000` | Max devices kept in memory; LRU eviction beyond this |
+
+### Validation rules
+
+- All defaults must be expressed in the zod schema with `.default(...)` so the parsed `Config` is fully typed and never has `undefined` for an optional field.
+- Numeric env vars must be coerced from string and bounded: `BATCH_SIZE` 1–10000, `BATCH_BLOCK_MS` 0–60000, `WRITE_BATCH_SIZE` 1–1000, `DEVICE_STATE_LRU_CAP` 100–1_000_000.
+- `REDIS_URL` and `POSTGRES_URL` must parse as URLs with the expected protocol (`redis:` or `rediss:`; `postgres:` or `postgresql:`).
+- `LOG_LEVEL` must be one of pino's accepted levels.
+
+### Logger conventions
+
+Match `tcp-ingestion/src/observability/logger.ts` line for line where applicable. Future-you grepping across services should see the same shape:
+
+```ts
+const formatters = { level: (label: string) => ({ level: label }) };
+
+if (nodeEnv === 'development') {
+  return pino({ level, base, timestamp: pino.stdTimeFunctions.isoTime, formatters,
+    transport: { target: 'pino-pretty', options: { colorize: true, translateTime: 'SYS:standard', ignore: 'pid,hostname' } } });
+}
+return pino({ level, base, timestamp: pino.stdTimeFunctions.isoTime, formatters });
+```
+
+## Acceptance criteria
+
+- [ ] `pnpm test` covers config validation: missing required vars throw with the right message; invalid URLs throw; bounded numerics throw on out-of-range values.
+- [ ] Running with valid env emits a single `processor starting` info log with `service=processor` and `instance_id=processor-1` base fields.
+- [ ] Running with `NODE_ENV=development` produces colorized output via pino-pretty.
+- [ ] Running with `NODE_ENV=production` produces JSON output with ISO `time` and string `level`.
+
+## Risks / open questions
+
+- `REDIS_CONSUMER_NAME` defaulting to `INSTANCE_ID` means `INSTANCE_ID` must be unique per instance for safe consumer-group operation. Document this in `.env.example` so operators don't accidentally run two instances with the same `INSTANCE_ID`.
+
+## Done
+
+(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,89 @@
+# Task 1.4 — Postgres connection & `positions` hypertable
+
+**Phase:** 1 — Throughput pipeline
+**Status:** ⬜ Not started
+**Depends on:** 1.1, 1.3
+**Wiki refs:** `docs/wiki/entities/postgres-timescaledb.md`
+
+## Goal
+
+Stand up the Postgres connection (a single `pg.Pool`) and define the `positions` hypertable migration. This is the only table whose schema the Processor owns directly (per the design rule in ROADMAP.md). Every other table is owned by Directus.
+
+## Deliverables
+
+- `src/db/pool.ts` exporting:
+  - `createPool(url: string): pg.Pool` — single Pool with sane defaults (`max: 10`, `idleTimeoutMillis: 30_000`, `connectionTimeoutMillis: 5_000`). Sets `application_name = 'processor'` so connections are identifiable in `pg_stat_activity`.
+  - `connectWithRetry(pool, logger): Promise<void>` — runs `SELECT 1` with exponential backoff (3 attempts, up to 5s). Mirrors `tcp-ingestion`'s `connectRedis` pattern. Calls `process.exit(1)` on final failure.
+- `src/db/migrations/0001_positions.sql` containing:
+  - `CREATE EXTENSION IF NOT EXISTS timescaledb;` (no-op if already enabled)
+  - `CREATE TABLE IF NOT EXISTS positions (...)` per the schema below
+  - `SELECT create_hypertable('positions', 'ts', if_not_exists => TRUE, chunk_time_interval => INTERVAL '1 day');`
+  - `CREATE UNIQUE INDEX IF NOT EXISTS positions_device_ts ON positions (device_id, ts);`
+  - `CREATE INDEX IF NOT EXISTS positions_ts ON positions (ts DESC);`
+- `src/db/migrate.ts` — minimal runner that executes pending migration files in order. Tracks applied migrations in a `schema_migrations(version, applied_at)` table. Idempotent. Called from `main.ts` before the consumer starts.
+- `test/db/migrate.test.ts` covering: applying a fresh migration; applying twice is a no-op; bad SQL fails loudly.
+
+## Specification
+
+### `positions` table schema
+
+```sql
+CREATE TABLE IF NOT EXISTS positions (
+  device_id    text        NOT NULL,
+  ts           timestamptz NOT NULL,        -- canonical event time from device GPS
+  ingested_at  timestamptz NOT NULL DEFAULT now(),  -- when Processor wrote the row
+  latitude     double precision NOT NULL,
+  longitude    double precision NOT NULL,
+  altitude     real        NOT NULL,
+  angle        real        NOT NULL,
+  speed        real        NOT NULL,
+  satellites   smallint    NOT NULL,
+  priority     smallint    NOT NULL,
+  codec        text        NOT NULL,        -- '8' | '8E' | '16'
+  attributes   jsonb       NOT NULL         -- the IO bag, sentinel-decoded
+);
+```
+
+### Why these column types
+
+- `device_id text` — IMEIs are 15 ASCII digits. Could be `bigint`, but `text` keeps the door open for non-IMEI device identifiers (future vendors) and avoids leading-zero loss.
+- `ts timestamptz` — the **device-reported** time, not ingestion time. This is the hypertable partitioning column.
+- `ingested_at timestamptz` — diagnostic: helps spot devices with clock skew or buffered records (the 55-record buffer flush we saw on stage). Not part of the natural key.
+- `altitude/angle/speed real` — float32 is plenty; saves space on a high-volume table.
+- `attributes jsonb` — preserves the IO bag verbatim. Per the design rule, no naming or unit conversion happens here; that's Phase 2 in `src/domain/`.
+
+### bigint and Buffer attributes — JSONB encoding
+
+The codec (task 1.2) decodes `__bigint` to `bigint` and `__buffer_b64` to `Buffer`. Postgres `jsonb` is JSON, so we re-encode for storage:
+- `bigint` → JSON number if it fits in `Number.MAX_SAFE_INTEGER`, else JSON string. Always store as a string is simpler and unambiguous; **decision: always string for bigint**.
+- `Buffer` → base64 string.
+
+**Re-encoding loses the type tag.** Phase 2 IO interpretation (per-model mapping table) is responsible for knowing that `attributes.io_240` is a u64 stored as a string. Phase 1 doesn't need to query individual attributes — it's pass-through storage.
+
+If this becomes painful later, options to revisit: a separate `attributes_typed` column with structured shape; or store bigints as `numeric` and Buffers as `bytea` in dedicated columns. **Defer** — 80% of attributes are small ints, and the simple string approach unblocks Phase 1.
+
+### Migration runner
+
+Follow the simplest possible pattern. The runner:
+1. `CREATE TABLE IF NOT EXISTS schema_migrations (version text PRIMARY KEY, applied_at timestamptz NOT NULL DEFAULT now())`.
+2. Lists `*.sql` files in `src/db/migrations/` sorted by filename.
+3. For each, `SELECT 1 FROM schema_migrations WHERE version = $1`. If absent, run the SQL inside a transaction and insert the row.
+4. Logs each applied or skipped migration at `info`.
+
+Do **not** introduce a heavy framework (Knex, node-pg-migrate). The Processor has one migration file in Phase 1 — a 30-line runner is the right answer.
+
+## Acceptance criteria
+
+- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean.
+- [ ] Integration test (testcontainers TimescaleDB): apply migration; insert a row with a bigint-as-string attribute; query it back; verify shape.
+- [ ] Re-running the migration on an already-migrated database is a no-op.
+- [ ] `connectWithRetry` retries 3 times with exponential backoff, then calls `process.exit(1)`. Verify with a unit test using a fake Pool.
+
+## Risks / open questions
+
+- **TimescaleDB extension availability.** The `deploy/` repo's Postgres container must be the `timescale/timescaledb` image, not stock `postgres`. Document this explicitly in the deploy README when Phase 1 ships. Fall back to a regular table (no hypertable) if the extension is unavailable: `create_hypertable` will error, but the `IF NOT EXISTS` table creation succeeds. The performance falls off a cliff at scale, but functional.
+- **Schema authority overlap with Directus.** Directus also speaks Postgres. When Directus connects and introspects the schema, it will see the `positions` table created by Processor. That's fine — Directus can reflect tables it didn't create. But if an operator later modifies `positions` from the Directus admin UI, the migration may break. Document: `positions` is a Processor-owned table; do not edit from Directus.
+
+## Done
+
+(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,93 @@
+# Task 1.5 — Redis Stream consumer (XREADGROUP)
+
+**Phase:** 1 — Throughput pipeline
+**Status:** ⬜ Not started
+**Depends on:** 1.2, 1.3
+**Wiki refs:** `docs/wiki/entities/redis-streams.md`, `docs/wiki/entities/processor.md`
+
+## Goal
+
+Build the Redis Stream consumer: join the consumer group, fetch batches via `XREADGROUP`, decode each entry to a `Position`, hand off to a sink callback, and return successfully-handled IDs to the caller for `XACK`.
+
+This task does **not** wire in the Postgres writer or the in-memory state — those are tasks 1.7 and 1.6, joined to the consumer in 1.8. The consumer accepts a `sink: (records: ConsumedRecord[]) => Promise<string[]>` callback that returns the IDs it wants ACKed. Only those IDs are ACKed; failures stay pending and get claimed on the next loop.
+
+## Deliverables
+
+- `src/core/consumer.ts` exporting:
+  - `createConsumer(redis, config, logger, metrics, sink): Consumer` — factory.
+  - `Consumer` interface: `start(): Promise<void>` (returns when the consumer loop starts), `stop(): Promise<void>` (signals the loop to exit, waits for the in-flight batch).
+  - `ensureConsumerGroup(redis, stream, group)` — `XGROUP CREATE ... MKSTREAM` ignoring `BUSYGROUP` errors. Called once at start.
+  - `type ConsumedRecord = { id: string; position: Position; codec: string; ts: string }` — what's passed to the sink.
+- `test/consumer.test.ts` (mocked `ioredis`):
+  - Decodes a synthetic stream entry into a `ConsumedRecord` with the right shape.
+  - Calls `sink` with the decoded batch and ACKs only the IDs the sink returned.
+  - On `BUSYGROUP` error from `XGROUP CREATE`, swallows the error and continues.
+  - On a malformed payload, increments `consumer_decode_errors_total`, logs at `error`, and **does not** ACK the bad entry — leaves it pending for inspection.
+  - On `stop()`, the loop exits cleanly without losing in-flight work.
+
+## Specification
+
+### Consumer loop shape
+
+```ts
+async function runLoop() {
+  while (!stopping) {
+    let entries: StreamEntry[];
+    try {
+      entries = await redis.xreadgroup(
+        'GROUP', group, consumerName,
+        'COUNT', batchSize,
+        'BLOCK', batchBlockMs,
+        'STREAMS', stream, '>',
+      );
+    } catch (err) {
+      logger.error({ err }, 'XREADGROUP failed; backing off');
+      await sleep(1000);
+      continue;
+    }
+    if (!entries) continue;  // BLOCK timeout
+
+    const records = decodeBatch(entries);              // <— may emit decode errors
+    const ackIds = await sink(records);                // <— writer + state
+    if (ackIds.length > 0) {
+      await redis.xack(stream, group, ...ackIds);
+    }
+  }
+}
+```
+
+### Decode error handling
+
+`decodeBatch` calls `decodePosition` (from task 1.2) on each entry's `payload` field. If a single entry fails to decode:
+- Increment `processor_decode_errors_total{stream=...}`.
+- Log at `error` with the entry ID and a truncated raw payload (first 256 chars).
+- **Skip** the entry — do not pass to sink, do not ACK. It stays in the consumer's PEL (Pending Entries List) and will be re-attempted on next claim. Phase 3 will route truly-poison entries to a dead-letter stream; for Phase 1, leaving them pending and visible in `XPENDING` is enough.
+
+### `XACK` semantics
+
+ACK only what the sink returned. If the sink returns `['id1', 'id3']` from a batch of `[id1, id2, id3]`, then `id2` stays pending. Why a sink might return a partial list: it failed to write some records. The consumer must trust the sink's signal — never ACK speculatively.
+
+### Consumer group setup
+
+On `start()`:
+1. `XGROUP CREATE <stream> <group> $ MKSTREAM` — creates the stream if missing, group at "now" so we don't replay history. If the group already exists, the call returns `BUSYGROUP Consumer Group name already exists` — catch and ignore.
+2. Log at `info` whether the group was created or already existed.
+
+### Why `>` not `0` for the read ID
+
+`>` means "deliver only new entries, not pending ones for this consumer." That's what we want for the steady-state loop. Phase 3 will add an explicit `XAUTOCLAIM` step at startup (and periodically) to pull stuck pending entries from dead consumers; Phase 1 relies on the natural redelivery via consumer-group resumption (when a dead instance restarts with the same name, it sees its old PEL).
+
+## Acceptance criteria
+
+- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean.
+- [ ] Unit tests cover: happy path, `BUSYGROUP` swallow, decode error skip, partial-ACK, clean stop.
+- [ ] Stop signal causes the loop to exit within one `BATCH_BLOCK_MS` tick.
+
+## Risks / open questions
+
+- **Consumer name uniqueness.** Two instances with the same `REDIS_CONSUMER_NAME` will both read from the same PEL, which is undefined behaviour. Task 1.3 already documents that `INSTANCE_ID` (which defaults `REDIS_CONSUMER_NAME`) must be unique per instance — surface this again in the operator-facing README later.
+- **Long sink calls block the loop.** If the Postgres writer takes 30s, no new records are read. That's fine for Phase 1 (Postgres should be fast); Phase 3 may add a configurable max-in-flight if writer pressure becomes an issue.
+
+## Done
+
+(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,81 @@
+# Task 1.6 — Per-device in-memory state
+
+**Phase:** 1 — Throughput pipeline
+**Status:** ⬜ Not started
+**Depends on:** 1.2
+**Wiki refs:** `docs/wiki/entities/processor.md` (§ State management)
+
+## Goal
+
+Maintain a bounded `Map<device_id, DeviceState>` updated on every accepted Position. Phase 1 only stores trivial state — `last_position`, `last_seen`, `position_count_session` — but the structure is built so Phase 2 (geofence accumulators, time-since-last-checkpoint, etc.) can extend it cleanly.
+
+## Deliverables
+
+- `src/core/state.ts` exporting:
+  - `createDeviceStateStore(config, logger): DeviceStateStore` — factory.
+  - `DeviceStateStore` interface:
+    - `update(position: Position): DeviceState` — applies the position, returns the new state. Touches LRU order.
+    - `get(device_id: string): DeviceState | undefined` — read without touching LRU order. (Used for diagnostics; the hot path uses `update`.)
+    - `size(): number` — for metrics.
+    - `evictedTotal(): number` — for metrics.
+- `test/state.test.ts` covering:
+  - First update for a new device creates the entry; subsequent updates increment `position_count_session`.
+  - LRU eviction: with cap=3, after 4 distinct devices, the least-recently-updated is evicted.
+  - Eviction increments `evictedTotal()`.
+  - `last_seen` reflects the position's `timestamp` (the device-reported time), not the wall clock at update time.
+  - Out-of-order positions (a position with `timestamp` older than `last_seen`) are still applied (we don't drop them) but `last_seen` only advances forward — i.e. `last_seen = max(prev_last_seen, position.timestamp)`. Document the rationale.
+
+## Specification
+
+### LRU implementation
+
+Use a plain `Map<string, DeviceState>`. JavaScript `Map` preserves insertion order, and we exploit it: on every `update`, `delete` then `set` the entry — that bumps it to the most recent position in iteration order. When `size() > cap`, take `keys().next().value` (the oldest) and `delete` it.
+
+This is O(1) per update and avoids a third-party LRU dependency. **Do not** introduce `lru-cache` — the standard `Map` trick is sufficient for Phase 1's needs.
+
+### Why `last_seen = max(...)`, not `last_seen = position.timestamp`
+
+Devices buffer records when offline and replay them in bursts (we observed a 55-record buffer flush on stage). Within a single batch, timestamps may *decrease* between consecutive records if the device sorted them oddly. We want `last_seen` to mean "highest device timestamp seen so far for this device" — that's what downstream consumers want.
+
+### What about restart?
+
+On Processor restart, the in-memory state is empty. The first record from any device creates a fresh `DeviceState`. **Phase 1 accepts this** — it's a recovery path, not a hot path, and Phase 1 has no domain logic that would be wrong without rehydrated state.
+
+Phase 3 (production hardening) adds rehydration: on first packet for an unknown device, query `positions WHERE device_id = $1 ORDER BY ts DESC LIMIT 1` to seed `last_position`. That's a Phase 3 task, not Phase 1.
+
+### What state lives here, what doesn't
+
+In Phase 1 the state is intentionally minimal:
+
+```ts
+type DeviceState = {
+  device_id: string;
+  last_position: Position;
+  last_seen: Date;                // = max(prev, position.timestamp)
+  position_count_session: number; // resets on restart
+};
+```
+
+**Not in Phase 1:**
+- Geofence membership (Phase 2)
+- Distance accumulators (Phase 2)
+- Time-in-stage (Phase 2)
+- Anything that would be wrong if dropped on restart (Phase 3 + rehydration)
+
+The interface is built to extend: Phase 2 may add fields, but the existing fields and method signatures should not change.
+
+## Acceptance criteria
+
+- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean.
+- [ ] LRU cap from `DEVICE_STATE_LRU_CAP` config is respected.
+- [ ] `evictedTotal()` increments correctly under eviction.
+- [ ] `last_seen` does not regress on out-of-order timestamps.
+
+## Risks / open questions
+
+- **Cap sizing.** Default `DEVICE_STATE_LRU_CAP=10000`. At 1KB per state entry, that's 10MB of resident memory — fine. Operators with unusually large fleets can raise it; the bound exists to prevent runaway growth from misbehaving devices flooding novel `device_id` values.
+- **No mutex.** State is updated only from the consumer loop, which is single-threaded. If Phase 2 introduces parallel sinks, revisit with proper synchronization.
+
+## Done
+
+(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,94 @@
+# Task 1.7 — Position writer (batched upsert)
+
+**Phase:** 1 — Throughput pipeline
+**Status:** ⬜ Not started
+**Depends on:** 1.2, 1.4
+**Wiki refs:** `docs/wiki/entities/postgres-timescaledb.md`
+
+## Goal
+
+Write batches of `Position` records into the `positions` hypertable using `INSERT ... ON CONFLICT (device_id, ts) DO NOTHING` for idempotency. Return per-record success/failure so the consumer (task 1.8) can decide what to ACK.
+
+## Deliverables
+
+- `src/core/writer.ts` exporting:
+  - `createWriter(pool, config, logger, metrics): Writer` — factory.
+  - `Writer` interface:
+    - `write(records: ConsumedRecord[]): Promise<WriteResult[]>` — inserts the batch, returns per-record results: `{ id: string; status: 'inserted' | 'duplicate' | 'failed'; error?: Error }`.
+- `test/writer.test.ts` (mocked `pg.Pool`):
+  - Happy path: all records insert.
+  - Duplicate-key: `ON CONFLICT DO NOTHING` returns `'duplicate'` for those records.
+  - Mixed: half new, half duplicate.
+  - Pool error: all records in the batch return `'failed'`.
+  - Bigint attribute is stringified before serialization.
+  - Buffer attribute is base64-encoded before serialization.
+
+## Specification
+
+### SQL pattern
+
+Use a single multi-row `INSERT` per batch with `RETURNING (xmax = 0) AS inserted`:
+
+```sql
+INSERT INTO positions (device_id, ts, latitude, longitude, altitude, angle, speed, satellites, priority, codec, attributes)
+VALUES
+  ($1,  $2,  $3,  $4,  $5,  $6,  $7,  $8,  $9,  $10, $11),
+  ($12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22),
+  ...
+ON CONFLICT (device_id, ts) DO NOTHING
+RETURNING device_id, ts, (xmax = 0) AS inserted;
+```
+
+`xmax = 0` is true for newly-inserted rows, false for ones that hit `ON CONFLICT`. The `RETURNING` rows give us a lookup of which `(device_id, ts)` pairs were inserted vs. duplicates.
+
+**Note:** rows that hit the conflict are NOT returned (Postgres doesn't return them with `ON CONFLICT DO NOTHING`). To distinguish duplicate from "new but hit a unique violation later," compare the returned rows against the input by `(device_id, ts)`. Anything in the input but missing from RETURNING is a `'duplicate'`.
+
+### bigint and Buffer attribute encoding
+
+Per task 1.4, `jsonb` storage:
+- `bigint` → JSON string. Use a custom replacer in `JSON.stringify`:
+  ```ts
+  JSON.stringify(attributes, (_k, v) =>
+    typeof v === 'bigint' ? v.toString() :
+    Buffer.isBuffer(v) ? v.toString('base64') : v
+  );
+  ```
+- `Buffer` → base64 string.
+
+Document this in `wiki/concepts/position-record.md` as a follow-up — the on-disk shape differs slightly from the in-flight shape because JSON can't hold bigints or bytes natively.
+
+### Batching strategy
+
+The consumer (task 1.8) calls `write(batch)` with whatever batch the consumer received from `XREADGROUP`. Phase 1 doesn't internally batch further — the consumer's batch size (`BATCH_SIZE`, default 100) is the writer's batch size.
+
+If `BATCH_SIZE > WRITE_BATCH_SIZE` (default 50), the writer chunks internally: split the input into chunks of `WRITE_BATCH_SIZE`, run them sequentially. Don't parallelize chunks against the same Pool — `pg.Pool` has bounded connections and we don't want to starve other queries (the migration runner, `/readyz` health checks, etc.).
+
+### Per-record status
+
+The consumer (task 1.8) takes the `WriteResult[]` and decides ACK:
+- `'inserted'` and `'duplicate'` → ACK (we got the data into Postgres or already had it).
+- `'failed'` → do not ACK (let it stay pending for retry).
+
+If a transaction-wide failure occurs (Pool dead, transient network), all records in the chunk get `'failed'`. The consumer treats them all the same.
+
+### Metrics emitted by this module
+
+- `processor_position_writes_total{status="inserted"|"duplicate"|"failed"}` — counter
+- `processor_position_write_duration_seconds` — histogram (per-batch latency)
+
+## Acceptance criteria
+
+- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean.
+- [ ] Mocked-Pool test verifies SQL parameter ordering and types are correct.
+- [ ] Bigint and Buffer attributes serialize as expected via the JSON.stringify replacer.
+- [ ] Mixed insert/conflict batch produces correct per-record `WriteResult[]`.
+- [ ] Pool error → all records get `'failed'`; metrics reflect this.
+
+## Risks / open questions
+
+- **Parameter limit.** Postgres protocol allows max 65535 parameters per statement. With 11 columns per row, that caps us at ~5957 rows per statement. `WRITE_BATCH_SIZE=50` is well under. If the cap is ever raised, document the formula.
+- **`RETURNING` cost.** On a hypertable with many chunks, `RETURNING` has near-zero overhead. Verify with a benchmark in task 1.10 (integration test).
+
+## Done
+
+(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,100 @@
+# Task 1.8 — Main wiring & ACK semantics
+
+**Phase:** 1 — Throughput pipeline
+**Status:** ⬜ Not started
+**Depends on:** 1.5, 1.6, 1.7
+**Wiki refs:** `docs/wiki/entities/processor.md`
+
+## Goal
+
+Assemble the throughput pipeline in `src/main.ts`: connect Redis + Postgres → run migrations → build the device-state store → build the writer → build the consumer with a sink that calls `state.update()` then `writer.write()` → start. Establish the rule for what to ACK and when.
+
+## Deliverables
+
+- `src/main.ts` updated to:
+  1. `loadConfig()` (from task 1.3).
+  2. `createLogger()` (from task 1.3).
+  3. `createPool(config.POSTGRES_URL)` and `connectWithRetry()` (from task 1.4).
+  4. Run migrations via `migrate()` (from task 1.4) before any consumer activity.
+  5. Connect Redis with `connectRedis(...)` (re-implement the `tcp-ingestion` retry pattern; small enough to copy).
+  6. Build `state = createDeviceStateStore(config, logger)`.
+  7. Build `writer = createWriter(pool, config, logger, metrics)`.
+  8. Build `consumer = createConsumer(redis, config, logger, metrics, sink)` where `sink` is the function defined below.
+  9. `await consumer.start()`.
+  10. Install graceful shutdown stub (full Phase 3 hardening later): on SIGTERM/SIGINT, call `consumer.stop()`, await pending writes, close Redis + Pool, exit.
+- `src/main.ts` defines the **sink function** (the central decision point):
+
+  ```ts
+  async function sink(records: ConsumedRecord[]): Promise<string[]> {
+    // 1. Update in-memory state for every record (cheap, synchronous, can't fail meaningfully)
+    for (const r of records) state.update(r.position);
+
+    // 2. Write to Postgres
+    const results = await writer.write(records);
+
+    // 3. ACK only the IDs that succeeded or were duplicates
+    return results
+      .filter(r => r.status === 'inserted' || r.status === 'duplicate')
+      .map(r => r.id);
+  }
+  ```
+
+- A placeholder `metrics` shim — the same trace-logging stub as `tcp-ingestion` originally had (task 1.9 replaces it with prom-client). Use `Metrics` from `src/core/types.ts`.
+
+## Specification
+
+### State update happens before write — by design
+
+The sink updates `state` first, *then* writes. If the write fails:
+- The state update has already happened.
+- The record is not ACKed, so it stays pending.
+- On re-delivery (same instance retries, or another instance claims), the record will be processed again.
+- `state.update` is idempotent for a given position (same record applied twice produces the same `last_position`, only `position_count_session` is double-counted — and that's a session counter that resets on restart anyway, so it's a non-issue).
+
+If we wrote *first* and updated state second, a successful write followed by a state-update crash would leave Postgres ahead of state — but state is hot-path, so that's worse. The chosen order keeps state consistent with what's been seen, even if not yet persisted.
+
+### What the sink does NOT do
+
+- **No business logic.** No "is this a finish-line crossing" detection. That's Phase 2's domain.
+- **No multi-stream fanout.** No publishing to derived streams (e.g. for the SPA). The Phase 1 model is: positions go into Postgres, Directus reads them and pushes via WebSocket. If that fanout proves insufficient at the SPA layer, Phase 4 considers a dedicated WebSocket gateway reading from Redis directly.
+
+### Graceful shutdown — Phase 1 stub vs. Phase 3 final
+
+Phase 1 stub is enough to not lose data in the common case:
+1. Catch SIGTERM/SIGINT.
+2. `consumer.stop()` — exits the read loop after the current batch.
+3. Await any in-flight `writer.write()`.
+4. `redis.quit()` and `pool.end()`.
+5. `process.exit(0)`.
+6. Force-exit timer at 15s as a backstop.
+
+What Phase 1 does NOT do (deferred to Phase 3):
+- Explicit consumer-group offset commit on SIGTERM (the current model relies on `XACK` after each successful write, which is already the right thing — but Phase 3 documents and tests this rigorously).
+- Uncaught exception / unhandled rejection handlers that flush state to logs before crashing.
+- Multi-instance coordination on shutdown (drain mode).
+
+### Logger shape
+
+Match `tcp-ingestion`'s convention:
+- `info` for lifecycle: `processor starting`, `Postgres connected`, `Redis connected`, `migrations applied`, `consumer started on stream X group Y consumer Z`, `processor ready`.
+- `debug` for per-batch: `batch consumed n=42`, `batch written inserted=40 duplicates=2 failed=0`.
+- `warn` / `error` for the obvious.
+
+After this task lands you should be able to run `pnpm dev` against a local Redis + Postgres, publish a synthetic `Position` to `telemetry:t`, and watch a row appear in `positions` while seeing the lifecycle logs above.
+
+## Acceptance criteria
+
+- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean.
+- [ ] `pnpm dev` (with local Redis + Postgres reachable) shows the lifecycle log sequence and `processor ready`.
+- [ ] Manually publishing a `Position` to `telemetry:t` results in a row in `positions` within seconds.
+- [ ] SIGTERM during idle exits cleanly (no error, no force-exit warning).
+- [ ] SIGTERM with in-flight writes waits for them to complete before exiting.
+
+## Risks / open questions
+
+- **`metrics` placeholder is intentional.** Don't try to wire prom-client here; that's task 1.9. Use the trace-logging shim from `tcp-ingestion`'s pre-1.10 `main.ts` as the model.
+- **Migration during deploy.** Phase 1 runs migrations on every startup. With multiple instances, two starting at once both try to migrate — Postgres advisory locks would solve this. **Defer to Phase 3** (it's a Production hardening concern); for the pilot with one instance, this is fine. Document the limitation.
+
+## Done
+
+(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,82 @@
+# Task 1.9 — Observability (Prometheus metrics + /healthz + /readyz)
+
+**Phase:** 1 — Throughput pipeline
+**Status:** ⬜ Not started
+**Depends on:** 1.5, 1.6, 1.7, 1.8
+**Wiki refs:** `docs/wiki/entities/processor.md`, `docs/wiki/sources/gps-tracking-architecture.md` § 7.4
+
+## Goal
+
+Replace the placeholder `Metrics` shim with a real `prom-client` implementation. Expose `/metrics` (Prometheus exposition format), `/healthz` (liveness), and `/readyz` (readiness — Redis ready AND Postgres ready) on `METRICS_PORT`.
+
+This is **not** a deferral candidate (unlike `tcp-ingestion` task 1.10). The Processor has no other surface for measuring consumer lag, write throughput, or failure rates — without it, "is the pilot keeping up?" cannot be answered.
+
+## Deliverables
+
+- `src/observability/metrics.ts` — same shape as `tcp-ingestion/src/observability/metrics.ts`:
+  - `createMetrics(): Metrics & { serializeMetrics(): Promise<string> }` — wraps `prom-client` registries; calls `collectDefaultMetrics()` once for `nodejs_*` process metrics.
+  - `startMetricsServer(port, metrics, deps): http.Server` — `node:http` server with three endpoints. `deps` carries readyz health checks: `{ isRedisReady(): boolean; isPostgresReady(): boolean }`.
+- Update `src/main.ts` to use the real `createMetrics()` and start the metrics server after Redis + Postgres are connected and the consumer is started. Wire it into graceful shutdown (close it before `redis.quit()`).
+- Tests: `test/metrics.test.ts` mirroring the `tcp-ingestion` test pattern — exposition format, counter/gauge/histogram behaviour, all four HTTP endpoint paths including `/readyz` 503 cases.
+
+## Specification
+
+### Metric inventory
+
+| Metric | Type | Labels | Description |
+|---|---|---|---|
+| `processor_consumer_reads_total` | counter | `result=ok\|empty\|error` | `XREADGROUP` calls; `empty` = BLOCK timeout, `error` = client error |
+| `processor_consumer_records_total` | counter | — | Total records pulled off the stream |
+| `processor_consumer_lag` | gauge | — | `XLEN` minus the consumer group's last-delivered ID position. Sampled every N seconds. |
+| `processor_decode_errors_total` | counter | — | Records that failed to decode (malformed payload, sentinel error) |
+| `processor_position_writes_total` | counter | `status=inserted\|duplicate\|failed` | Per-record write outcomes |
+| `processor_position_write_duration_seconds` | histogram | — | Per-batch write latency. Buckets `[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5]` |
+| `processor_acks_total` | counter | — | Total IDs ACKed |
+| `processor_device_state_size` | gauge | — | Current count of devices in the LRU map |
+| `processor_device_state_evictions_total` | counter | — | Total LRU evictions since start |
+| `nodejs_*` | various | — | Default Node process metrics |
+
+### Naming convention
+
+- `processor_*` for service-specific metrics. `tcp-ingestion` uses `teltonika_*` because that's its adapter; the Processor isn't bound to a vendor, so the service-name prefix fits.
+- No external `service` label — Prometheus scrape config adds it.
+
+### Health and readiness
+
+- `GET /healthz` → 200 if the process is alive. Always returns `{ "status": "ok" }`.
+- `GET /readyz` → 200 if both Redis is ready (`redis.status === 'ready'`) AND Postgres is ready (last successful query within 30s, or a fresh `SELECT 1` succeeds quickly). 503 otherwise.
+- Both endpoints return tiny JSON bodies for diagnostic value.
+
+### `processor_consumer_lag` measurement
+
+Sample every 10s in a separate setInterval (don't compute it on every read — too noisy). Compute as:
+
+```
+lag = XLEN(stream) - position_of(group_last_delivered_id_in_stream)
+```
+
+Use `XINFO GROUPS <stream>` → `lag` field (Redis 7.2+). If the field is absent, fall back to `XLEN` minus 0 (good-enough proxy when up to date; flag as "approximate" in the metric description).
+
+If sampling fails (Redis blip), log at `debug` and continue. Don't let metrics gather break the consumer.
+
+### HTTP server — same minimal node:http
+
+No Express. Roughly 30 lines. Match `tcp-ingestion`'s style.
+
+## Acceptance criteria
+
+- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean.
+- [ ] `curl http://localhost:9090/metrics` returns valid exposition format with every metric in the inventory present (some at zero).
+- [ ] After processing one record end-to-end, `processor_consumer_records_total` increments by 1, `processor_position_writes_total{status="inserted"}` increments by 1, `processor_acks_total` increments by 1.
+- [ ] `/readyz` returns 503 while Redis is disconnected (simulate by `redis.disconnect()`), 200 once it reconnects.
+- [ ] `/readyz` returns 503 while the Pool fails its health probe, 200 when it recovers.
+- [ ] `nodejs_*` default metrics are exposed.
+
+## Risks / open questions
+
+- **Cardinality of label values.** None of the Phase 1 metrics use unbounded labels. Phase 2 may want per-stage metrics — be careful: hundreds of stages is fine, hundreds of devices as labels is not. Keep the same rule as `tcp-ingestion`: per-device labels never go on Prometheus metrics; logs/traces are the right place.
+- **`processor_consumer_lag` sampling cadence.** 10s is a guess. If alerts get jittery, lower to 5s or raise to 30s. Tunable later.
+
+## Done
+
+(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,58 @@
+# Task 1.10 — Integration test (testcontainers Redis + Postgres)
+
+**Phase:** 1 — Throughput pipeline
+**Status:** ⬜ Not started
+**Depends on:** 1.5, 1.7, 1.8, 1.9
+**Wiki refs:** —
+
+## Goal
+
+End-to-end pipeline test: spin up Redis 7 and TimescaleDB via testcontainers, boot the Processor against them, publish a synthetic `Position` to `telemetry:t`, verify the row appears in `positions` with byte-equivalent attribute decoding (bigint, Buffer included).
+
+This is the integration test that proves the upstream contract from `tcp-ingestion` flows through end-to-end. Mirror `tcp-ingestion/test/publish.integration.test.ts`'s structure and skip-on-no-Docker pattern.
+
+## Deliverables
+
+- `test/pipeline.integration.test.ts`:
+  - `beforeAll`: start Redis container, start TimescaleDB container, run migrations, build a Processor instance pointed at both. If Docker is unavailable, log a clear skip message and set a flag so all `it` blocks early-return without failing.
+  - `afterAll`: stop the Processor, stop containers.
+  - Test 1: publish a Position with `bigint` and `Buffer` attributes via `XADD`; wait for the row in `positions` (poll, timeout 10s); assert `device_id`, `ts`, GPS fields, and a JSON round-trip of `attributes` matches the original (bigint as string, Buffer as base64).
+  - Test 2: publish two records with the same `(device_id, ts)`; verify only one row in `positions` (idempotency check).
+  - Test 3: publish a malformed payload (broken JSON) on the stream; verify `processor_decode_errors_total` increments and the bad entry stays in PEL (not ACKed).
+  - Test 4: simulate the writer failing once (e.g. by temporarily shutting Postgres mid-test, then bringing it back); verify the record gets retried and eventually lands.
+
+- Use the **TimescaleDB image**, not stock `postgres:7-alpine`. Suggested: `timescale/timescaledb:latest-pg16`. Confirm the migration's `CREATE EXTENSION IF NOT EXISTS timescaledb` no-ops (extension already loaded).
+- Use the same Vitest config split as `tcp-ingestion`: `vitest.integration.config.ts` with `hookTimeout: 120_000`, `testTimeout: 60_000`. Default `pnpm test` excludes `*.integration.test.ts`; opt-in via `pnpm test:integration`.
+
+## Specification
+
+### Skip-on-no-Docker pattern
+
+Copy `tcp-ingestion/test/publish.integration.test.ts`'s pattern verbatim:
+- Try to start the first container in `beforeAll`. On error, set `dockerAvailable = false`, log a warning, and return.
+- Each `it` block early-returns with a `console.warn` if `!dockerAvailable`.
+- This pattern was the fix for the CI test failure on the runner without Docker — keep it.
+
+### Synthetic Position publishing
+
+Reuse `serializePosition` from `tcp-ingestion`'s `publish.ts` if it can be imported (likely not — separate repos). Otherwise inline the encoding: a Position object → JSON.stringify with the bigint/Buffer replacer → `XADD telemetry:t * ts <iso> device_id <imei> codec 8E payload <json>`.
+
+### Why test 4 (writer failure → retry)
+
+This validates the core ACK semantics: if a write fails, the record stays pending, and re-delivery brings it back. Without this test, we have unit tests showing each piece behaves correctly, but no proof the pieces compose right. Skip-conditions: if simulating Postgres failure mid-test is too flaky in testcontainers, weaken to: stop Postgres before publishing, publish, start Postgres, verify row appears.
+
+## Acceptance criteria
+
+- [ ] `pnpm test:integration` runs all four scenarios green when Docker is available.
+- [ ] Without Docker, the suite logs skip messages and exits 0 (does not fail).
+- [ ] CI (`pnpm test`, unit only) does not run these — they are opt-in.
+- [ ] First-run container pull is reasonable; subsequent runs are fast (testcontainers caches the image).
+
+## Risks / open questions
+
+- **Image pull on first CI run.** The TimescaleDB image is large (~700MB). If we ever wire integration tests into CI (separate job with Docker), pre-pulling may be required. Document but defer.
+- **Test flakiness from polling.** Polling for "row appears in `positions`" uses a 10s timeout. If CI is slow, raise it. Don't replace polling with `await sleep(2000)` — that's reliably wrong.
+
+## Done
+
+(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,86 @@
+# Task 1.11 — Dockerfile & Gitea workflow
+
+**Phase:** 1 — Throughput pipeline
+**Status:** ⬜ Not started
+**Depends on:** 1.10
+**Wiki refs:** —
+
+## Goal
+
+Containerize the service and add the Gitea Actions workflow that builds and publishes `git.dev.microservices.al/trm/processor:main` on every push to `main`. Mirror `tcp-ingestion`'s slim variant — same multi-stage Dockerfile, same single-job workflow with path filters.
+
+## Deliverables
+
+- `Dockerfile` — multi-stage: deps → build → runtime. Match `tcp-ingestion/Dockerfile` line for line, adjusting only:
+  - `EXPOSE 9090` (only — Processor has no TCP listener).
+  - `HEALTHCHECK` pointing at `/readyz` on `${METRICS_PORT}`.
+  - `CMD ["node", "dist/main.js"]`.
+- `.gitea/workflows/build.yml` — single-job workflow matching `tcp-ingestion/.gitea/workflows/build.yml`:
+  - Trigger: `push` to `main` (path filters: `src/`, `test/`, `package.json`, `pnpm-lock.yaml`, `tsconfig.json`, `Dockerfile`, `.gitea/workflows/build.yml`) + `workflow_dispatch`.
+  - Steps: checkout, setup-node@v4 (Node 22, pnpm), install, typecheck, lint, test (unit only), docker buildx build-push to `git.dev.microservices.al/trm/processor:main`.
+  - Uses `secrets.REGISTRY_USERNAME` / `secrets.REGISTRY_PASSWORD`.
+  - Final step: trigger Portainer webhook on success (uncommented; same as `tcp-ingestion` after the `:main` -> webhook auto-deploy got working).
+- `compose.dev.yaml` — local-build variant with `build: .`, named `processor-dev`, depends on a Redis service and a TimescaleDB service. Useful for verifying Dockerfile changes without the registry round-trip.
+- `README.md` (the repo-level one, already a stub) — flesh out with:
+  - Quick-start (local: `pnpm install && cp .env.example .env && pnpm dev`).
+  - "Run the Docker build locally" section (`docker compose -f compose.dev.yaml up --build`).
+  - Production-deployment note: image is pulled by the `deploy/` repo's stack; do not run standalone.
+  - Pin to a specific commit via `PROCESSOR_TAG=<sha>` in the deploy stack.
+  - Tests section (unit vs. integration).
+  - CI behavior summary.
+  - "Pilot deployment notes" section if anything is paused (Phase 1 has nothing paused — note this and remove the section if so).
+
+## Specification
+
+### Dockerfile parity with `tcp-ingestion`
+
+Open `tcp-ingestion/Dockerfile` and copy structure verbatim. The only diffs from a Phase 1 Processor are:
+- No `EXPOSE 5027` — there's no TCP listener.
+- `HEALTHCHECK` URL path is `/readyz` (already true for `tcp-ingestion`).
+- Image label: `org.opencontainers.image.source` should point to the `processor` repo URL.
+
+This parity matters: when a future engineer needs to debug a build, having two services build the same way reduces cognitive load.
+
+### Workflow parity with `tcp-ingestion`
+
+Same. Open `tcp-ingestion/.gitea/workflows/build.yml`, copy, change image name and (if needed) path filters. The webhook step at the end should be uncommented so `:main` builds auto-deploy through Portainer.
+
+### Stage deploy
+
+Phase 1 ships ready to land in the `deploy/compose.yaml` (`trm/deploy` repo) as a new service. **Do not edit `deploy/compose.yaml` from this task.** Surface it in the final report: "Add `processor` service to `deploy/compose.yaml` with image, env, depends_on Redis + Postgres." That is a deploy-side change, made by the user.
+
+The `deploy/compose.yaml`'s service block will look roughly like:
+
+```yaml
+processor:
+  image: git.dev.microservices.al/trm/processor:${PROCESSOR_TAG:-main}
+  depends_on:
+    redis:    { condition: service_healthy }
+    postgres: { condition: service_healthy }
+  environment:
+    NODE_ENV: production
+    INSTANCE_ID: ${PROCESSOR_INSTANCE_ID:-processor-1}
+    REDIS_URL: redis://redis:6379
+    POSTGRES_URL: postgres://...
+    LOG_LEVEL: ${LOG_LEVEL:-info}
+  restart: unless-stopped
+```
+
+Plus a Postgres service (TimescaleDB image) added to the stack — the stack currently only has Redis + tcp-ingestion. That's the user's deploy decision to make.
+
+## Acceptance criteria
+
+- [ ] `docker build .` succeeds locally; resulting image runs and exposes `/healthz` on 9090.
+- [ ] `docker compose -f compose.dev.yaml up --build` boots Redis + TimescaleDB + Processor; `/readyz` reports 200 once everything is up.
+- [ ] Pushing to `main` (or hitting `workflow_dispatch`) builds the image, runs typecheck/lint/test, and pushes `:main` to the registry.
+- [ ] Portainer webhook fires on successful push and the stage stack picks up the new image (assuming the `deploy/` stack is set up).
+- [ ] Image size is reasonable (target < 250 MB final stage; the `tcp-ingestion` slim variant lands around there).
+
+## Risks / open questions
+
+- **Re-pull on stack redeploy.** The same Portainer issue we hit with `tcp-ingestion` (stack redeploy doesn't pull new images by default) will apply here. Make sure the same fix is in place ("Re-pull image" toggle, or per-commit-SHA tags) before this lands. Cross-reference the `tcp-ingestion` deploy note in `deploy/README.md`.
+- **HEALTHCHECK `wget` availability.** `node:22-alpine` includes `wget`. If we ever switch base image, revisit.
+
+## Done
+
+(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,98 @@
+# Phase 1 — Throughput pipeline
+
+Implement a Node.js worker that joins a Redis Streams consumer group, decodes `Position` records, upserts them into a TimescaleDB hypertable, maintains per-device in-memory state, and ships with the operational baseline (Prometheus metrics, health/readiness endpoints, integration tests, Dockerfile, Gitea CI/CD pipeline).
+
+## Outcome statement
+
+When Phase 1 is done:
+
+- The Processor connects to Redis and joins consumer group `processor` on stream `telemetry:t` (configurable). On startup it creates the group with `MKSTREAM` if missing.
+- Every `Position` record published by `tcp-ingestion` lands as exactly one row in the `positions` hypertable, with `device_id`, `ts`, GPS fields, and the IO `attributes` bag preserved as `JSONB` (sentinel-decoded — bigint values become `numeric`, Buffer values become `bytea` or `text` per the spec in task 1.2).
+- Per-device in-memory state (`last_position`, `last_seen`, `position_count_session`) is updated on every record and bounded by an LRU cap.
+- `XACK` is sent only after the Postgres write succeeds. A crashed instance leaves work pending; on its next start it picks up via consumer-group resumption, and any other instance can claim its pending entries (full `XAUTOCLAIM` polish lives in Phase 3, but the basic resumption works in Phase 1).
+- `GET /metrics` returns Prometheus exposition format with consumer lag, throughput, write-latency histogram, error counters. `GET /healthz` and `GET /readyz` cover liveness and readiness (Redis ready + Postgres ready).
+- The service builds reproducibly via a Gitea Actions workflow, publishing a Docker image to the Gitea container registry tagged `:main` (and per-commit SHA tags later if needed).
+- An integration test spins up Redis + Postgres via testcontainers, publishes a synthetic `Position` to the input stream, and verifies the resulting row in `positions`. End-to-end byte-level round-trip including bigint and Buffer sentinel reversal.
+
+## Sequencing
+
+```
+1.1 Project scaffold
+   ├─→ 1.2 Core types & contracts
+   │      ├─→ 1.3 Configuration & logging
+   │      ├─→ 1.4 Postgres connection & positions hypertable
+   │      │      └─→ 1.7 Position writer (batched upsert)
+   │      └─→ 1.5 Redis Stream consumer
+   │             ├─→ 1.6 Per-device in-memory state
+   │             └─→ 1.8 Main wiring & ACK semantics  (depends on 1.5, 1.6, 1.7)
+   │                    └─→ 1.9 Observability
+   │                           └─→ 1.10 Integration test
+   │                                  └─→ 1.11 Dockerfile & CI
+```
+
+Tasks 1.5/1.6/1.7 can be developed in parallel after 1.4 lands. Task 1.10 (integration test) should land *before* 1.11 because the Dockerfile depends on knowing what `pnpm test` and `pnpm test:integration` will do.
+
+## Files modified
+
+Phase 1 produces this layout in `processor/`:
+
+```
+processor/
+├── .gitea/workflows/build.yml
+├── src/
+│   ├── core/
+│   │   ├── types.ts                # Position, DeviceState, Metrics
+│   │   ├── consumer.ts             # XREADGROUP loop + claim handler
+│   │   ├── writer.ts               # Postgres batched upsert
+│   │   ├── state.ts                # in-memory device state with LRU
+│   │   └── codec.ts                # sentinel decode (__bigint, __buffer_b64)
+│   ├── db/
+│   │   ├── pool.ts                 # pg.Pool factory
+│   │   └── migrations/
+│   │       └── 0001_positions.sql  # hypertable creation
+│   ├── config/load.ts              # zod schema for env
+│   ├── observability/
+│   │   ├── logger.ts               # pino root logger
+│   │   └── metrics.ts              # prom-client + HTTP server
+│   └── main.ts
+├── test/
+│   ├── codec.test.ts
+│   ├── state.test.ts
+│   ├── consumer.test.ts            # mocked Redis behaviour
+│   ├── writer.test.ts              # mocked pg behaviour
+│   └── pipeline.integration.test.ts # testcontainers Redis + Postgres
+├── Dockerfile
+├── compose.dev.yaml
+├── package.json
+├── pnpm-lock.yaml
+├── tsconfig.json
+├── vitest.config.ts
+├── vitest.integration.config.ts
+├── .dockerignore
+├── .gitignore
+├── .prettierrc
+├── eslint.config.js
+└── README.md
+```
+
+## Tech stack (decided)
+
+- **Node.js 22 LTS**, ESM-only.
+- **TypeScript 5.x** with `strict: true`, `noUncheckedIndexedAccess: true`.
+- **pnpm** for dependency management.
+- **vitest** for tests (unit + integration split — same pattern as `tcp-ingestion`).
+- **pino** for structured logging (ISO timestamps, string level labels — same config as `tcp-ingestion`).
+- **prom-client** for Prometheus metrics.
+- **ioredis** for Redis Streams (XREADGROUP, XACK, XAUTOCLAIM).
+- **pg** (`pg` package, not `postgres.js`) for Postgres — battle-tested, simple Pool API.
+- **zod** for environment-variable validation.
+- **testcontainers** for integration tests (Redis 7 + TimescaleDB).
+
+If an implementer wants to deviate, they must update the relevant task file first.
+
+## Key design decisions inherited from `tcp-ingestion`
+
+- **ESLint `import/no-restricted-paths`** — `src/core/` cannot import from `src/domain/` (the boundary that protects Phase 1 from Phase 2 churn). `src/db/` is shared.
+- **Logger config** — `pino.stdTimeFunctions.isoTime` + level-as-string formatter. Lifecycle events at `info`; high-frequency per-record events at `debug` or `trace`.
+- **Slim Dockerfile** — multi-stage with BuildKit cache mounts, `pnpm fetch` + `pnpm install --offline` in the build stage, `pnpm prune --prod` for runtime.
+- **CI workflow** — single-job pattern matching `tcp-ingestion/.gitea/workflows/build.yml`. No `services:` block; no separate test container.
@@ -0,0 +1,47 @@
+# Phase 2 — Domain logic
+
+**Status:** ⬜ Not started — blocks on Directus schema decisions
+
+The phase that makes the Processor *racing-aware*. Phase 1 produces a generic position firehose into Postgres; Phase 2 layers the domain rules that turn raw positions into racing events: geofence crossings, timing records, IO interpretation, stage results.
+
+## Outcome statement
+
+When Phase 2 is done:
+
+- Per-model Teltonika IO mappings (e.g. `FMB920 IO 16 → odometer_km`) live in a Directus-managed collection that the Processor reads at startup and refreshes on a known cadence. Decoded attributes are written to a typed shape alongside the raw bag.
+- The geofence engine evaluates each incoming Position against the active geofences for the device's current event/stage and emits cross-events (entry/exit) when transitions happen.
+- A `timing_records` table is written for each cross-event of interest (start gate, finish gate, intermediate splits), tied to the entry's bib/competitor/stage.
+- A `stage_results` rollup is maintained per `(entry, stage)` showing total time, position, and split times. Updated on each new timing record.
+
+## Why this is a separate phase
+
+- **Throughput correctness is independent of domain correctness.** Phase 1 ships a working firehose; Phase 2 layers logic on top without touching the consumer/writer/state plumbing.
+- **The Directus schema gates everything in this phase.** Geofences, entries, classes, device_assignments — all live in Directus collections. Until those are designed and migrated, Phase 2 has no schema to write against.
+- **Multiple Phase 1 production milestones can pass before Phase 2 starts.** Real-device pilot, second tcp-ingestion instance, Redis high availability — none of those need Phase 2.
+
+## Tasks (sketched, not detailed)
+
+These tasks will get full task files once the Directus schema conversation is settled and we know the exact collection shapes. For now, this is the planned shape:
+
+| # | Task | Notes |
+|---|------|-------|
+| 2.1 | Directus reflection — read-only client for `geofences`, `device_assignments`, `entries`, `events`, `stages` | Cached in memory, refreshed on a cadence; the boundary that lets the Processor know "what is this device currently racing in" |
+| 2.2 | IO mapping table & per-model decoder | `device_models` collection in Directus → in-memory map → `decoded_attributes` JSONB column on `positions` (or a separate table) |
+| 2.3 | Geofence engine | Per-position, evaluate active geofences for the device's current entry. Use PostGIS `ST_Contains` for the cross-detection. Emit cross-events |
+| 2.4 | Timing record writer | Cross-events of interest → rows in `timing_records` (Directus-owned). Idempotent on `(entry_id, geofence_id, ts)` |
+| 2.5 | Stage result aggregator | On each new `timing_records` row, recompute `stage_results.{total_time, position}` for the affected entry. Materialized incrementally to avoid full recomputation |
+| 2.6 | Per-device runtime state extension | Phase 1's `DeviceState` extended with current entry, current stage, last geofence membership, accumulators. Note: Phase 3 rehydration becomes important once this state has substance |
+
+## Architectural boundary to maintain
+
+`src/core/` from Phase 1 stays untouched. Phase 2 lives in `src/domain/`. The wire-up point is the `sink` function in `src/main.ts`: after `state.update` and `writer.write`, the sink invokes domain handlers. Per the ESLint rule from task 1.1, `src/core/` cannot import from `src/domain/` — only `main.ts` glues them.
+
+## Open questions blocking task-level detail
+
+(These get answered in the Directus schema conversation.)
+
+1. Are `geofences` org-scoped, event-scoped, or both?
+2. Is `device_assignments` time-bounded (start_at + end_at) or just event-bounded?
+3. Where does the IO mapping table live — Directus collection, hardcoded in Processor, or in a config file?
+4. What's the canonical name for the sub-event unit — `stage`, `session`, `run`, `leg`?
+5. Is there a live leaderboard requirement, or is timing reviewed post-event?
@@ -0,0 +1,45 @@
+# Phase 3 — Production hardening
+
+**Status:** ⬜ Not started
+
+The set of operational features that turn a working pilot into something safe to leave running unattended through deploys, instance failures, and bad data.
+
+## Outcome statement
+
+When Phase 3 is done:
+
+- **Graceful shutdown** with bounded in-flight drain: SIGTERM blocks new reads, awaits in-flight writes, ACKs anything still in PEL whose write succeeded, exits clean.
+- **State rehydration on restart**: on first packet for an unknown device, the Processor queries Postgres for the device's `last_position` and seeds `DeviceState` accordingly. Phase 2 accumulators get the same treatment (e.g. last geofence membership comes from the last `timing_records` row).
+- **`XAUTOCLAIM` for stuck pending entries**: at startup and on a cadence, the Processor claims entries that have been pending in another consumer's PEL for longer than `CLAIM_THRESHOLD_MS`. Lets a dead instance's work get picked up by survivors without manual intervention.
+- **Dead-letter stream for poison records**: records that fail to decode N times go to `telemetry:t:dlq` with the original payload + the error. Operators can inspect, fix, replay.
+- **Multi-instance load split verified**: spinning up two Processor instances against the same consumer group splits the work evenly. End-to-end test in CI (or at least a manual playbook).
+- **Migration safety with multiple instances**: Postgres advisory locks around the migration runner so two instances starting simultaneously don't race.
+- **Uncaught exception / unhandled rejection handlers**: log, flush in-memory state to a panic dump file, exit with a code Portainer treats as restart-worthy.
+- **`OPERATIONS.md` runbook**: exact commands for "claim stuck entries from a dead instance," "drain the DLQ," "force-rehydrate a single device," "view consumer lag," etc.
+
+## Tasks (sketched, not detailed)
+
+| # | Task | Notes |
+|---|------|-------|
+| 3.1 | Graceful shutdown — full | Replaces the Phase 1 stub. Drain budget configurable. Tested end-to-end |
+| 3.2 | Per-device state rehydration on first-packet | Single `SELECT ... LIMIT 1` per cold device. Memoized by LRU |
+| 3.3 | `XAUTOCLAIM` runner | Periodic + on-startup. Claims entries pending > `CLAIM_THRESHOLD_MS`. Re-runs the sink |
+| 3.4 | Dead-letter stream | After N failed decodes/writes, record goes to `telemetry:t:dlq`; original ACKed off the main stream |
+| 3.5 | Migration advisory lock | `pg_advisory_lock(<hash>)` around the migrate runner; two instances can start simultaneously |
+| 3.6 | Uncaught exception / unhandled rejection handlers | Log, flush, exit. Match `tcp-ingestion`'s eventual Phase 1 task 1.12 work when that lands |
+| 3.7 | OPERATIONS.md | The runbook |
+| 3.8 | Multi-instance load test | A test (manual or in CI) that proves two instances split the work; document expected lag behaviour during failover |
+
+## Why this is a separate phase
+
+Phase 1 + Phase 2 produce a service that *works*. Phase 3 is what you do *before you stop watching it*. None of these tasks change correctness — they change operational ergonomics.
+
+## Resume triggers
+
+Each Phase 3 task has its own resume trigger. The whole phase doesn't have to land at once:
+
+- **3.1, 3.5, 3.6** before adding a second Processor instance (rolling deploys become safe).
+- **3.2** before any Phase 2 task that depends on hot state (geofence membership) — without rehydration, a restart would forget which geofence each device is in until the device crosses a boundary again.
+- **3.3, 3.4** before the pilot is "always-on" (operators need a way to handle stuck/poison records without touching production).
+- **3.7** can land alongside whichever of the above ships first; updates over time.
+- **3.8** before the second instance is added.
@@ -0,0 +1,21 @@
+# Phase 4 — Future / optional
+
+**Status:** ❄️ Not committed
+
+Ideas on radar that may or may not become real tasks. Captured here so they don't get forgotten and so we have a place to push scope creep that surfaces during Phase 1–3.
+
+## Candidates
+
+- **Directus Flow trigger emission.** When a domain event fires (timing record written, stage result computed, anomaly detected), publish a structured event Directus Flows can subscribe to. Lets Directus orchestrate notifications, integrations, derived workflows without polling the database.
+
+- **Replay tooling.** Read historical positions for a device + time range from Postgres, re-emit them through the domain pipeline (geofence engine, timing logic) without touching `positions`. Useful for: validating a new geofence layout against past races, regenerating timing records after a rule change, demoing.
+
+- **Derived-metric backfill.** When the IO mapping table changes (new model, corrected mapping), backfill `decoded_attributes` for affected devices over a chosen time range without touching `positions`.
+
+- **Alternate consumer for analytics export.** A second consumer group reading the same stream, writing to a parallel destination (Parquet on object storage, ClickHouse, etc.) for offline analytics. The Phase 1 architecture already supports this — it's a separate process joining the same stream with a different group name. No Processor changes needed; just operational scaffolding.
+
+- **WebSocket gateway for live updates.** If Directus's WebSocket subscriptions hit a fan-out ceiling for spectator-facing live leaderboards, a dedicated gateway reads from Redis and pushes to clients, bypassing Directus for the live channel only. REST/GraphQL stays in Directus. Mentioned in `wiki/entities/directus.md`.
+
+- **Per-instance sharding hint.** If consumer-group load distribution turns out to be uneven (one instance handles all the chatty devices), introduce hashing-by-device-id with explicit assignment. Probably overkill — Redis Streams' default round-robin works for most workloads.
+
+None of these are committed. Move them out of Phase 4 and into a numbered phase only when there's a concrete reason to do them.