Add planning documents for Phase 1 (throughput pipeline) and stub Phases 2-4

ROADMAP.md establishes status legend, architectural anchors pointing at the
wiki, and seven non-negotiable design rules — most importantly the
core/domain boundary that protects Phase 1 from Phase 2 churn, the
schema-authority split (positions hypertable owned here; everything else
owned by Directus), and idempotent-writes via (device_id, ts) ON CONFLICT.

Phase 1 (throughput pipeline) is fully detailed across 11 task files:
scaffold, core types + sentinel decoder, config + logging, Postgres
hypertable, Redis Stream consumer, per-device LRU state, batched writer,
main wiring, observability, integration test, Dockerfile + Gitea CI.
Observability is in Phase 1 (not deferred) — lesson learned from
tcp-ingestion task 1.10.

Phases 2-4 are stub READMEs. Phase 2 (domain logic) blocks on Directus
schema decisions and lists those open questions explicitly. Phase 3
(production hardening) and Phase 4 (future) sketch the task shape.
This commit is contained in:
2026-04-30 21:16:26 +02:00
parent 1a4202f4d1
commit c314ba0902
17 changed files with 1191 additions and 0 deletions
+90
View File
@@ -0,0 +1,90 @@
# processor — Roadmap
A Node.js worker service that consumes normalized `Position` records from a Redis Stream, maintains per-device runtime state, applies racing-domain rules, and writes durable state to Postgres / TimescaleDB.
This file is the single navigation hub for all implementation planning. Each phase has its own folder with a README and granular task files. Update statuses here as work lands.
## Status legend
| Symbol | Meaning |
|--------|---------|
| ⬜ | Not started |
| 🟦 | Planned (designed, not coded) |
| 🟨 | In progress |
| 🟩 | Done |
| ⏸ | Paused / blocked |
| ❄️ | Frozen / future / optional |
## Architectural anchors
The service is specified by the wiki at `../docs/wiki/`. Implementing agents should read these pages before starting any task:
- **Architecture** — `docs/wiki/sources/gps-tracking-architecture.md`, `docs/wiki/concepts/plane-separation.md`, `docs/wiki/concepts/failure-domains.md`
- **This service** — `docs/wiki/entities/processor.md`
- **Upstream contract (input)** — `docs/wiki/concepts/position-record.md`, `docs/wiki/concepts/io-element-bag.md`, `docs/wiki/entities/redis-streams.md`
- **Downstream contract (output)** — `docs/wiki/entities/postgres-timescaledb.md`, `docs/wiki/entities/directus.md`
## Non-negotiable design rules
These rules govern every task. Any deviation must be discussed and documented as a decision before code lands.
1. **Domain logic is isolated.** `src/core/` (Stream consumer, Postgres writer, in-memory state plumbing) never imports from `src/domain/` (geofence engine, timing logic, IO interpretation). Phase 2 must be a pure addition layered on top of the Phase 1 throughput pipeline.
2. **Schema authority lives in Directus**, with one exception: the `positions` hypertable is bulk-written by this service and its migration is owned here. All other tables Processor writes to (timing_records, stage_results, etc.) are defined in Directus and Processor inserts respecting that schema.
3. **Per-device state is in-memory, not durable.** The DB is the source of truth for replay/analysis; in-memory state is the source of truth for the current decision. On restart, hot state is rehydrated from the DB — Phase 1 does not implement rehydration; restart loses state, which is acceptable until Phase 2 introduces stateful accumulators.
4. **Consumer-group offsets drive work distribution.** No application-level coordination between Processor instances. `XACK` on success; failed batches stay pending and are claimed by surviving instances via `XAUTOCLAIM`.
5. **Idempotent writes.** Records arriving twice (after a claim, replay, or retry) must not produce duplicate rows. The `positions` hypertable uses `(device_id, ts)` as a unique key with `ON CONFLICT DO NOTHING`. Derived tables follow the same pattern, scoped by their natural keys.
6. **Bounded memory.** Per-device state is capped (LRU eviction by last-seen timestamp); replay-from-DB rehydrates an evicted device on next packet. No unbounded `Map<imei, ...>` growth.
7. **Fail loudly.** Schema-incompatible records (e.g. malformed payload, unknown sentinel) are logged at `error` and dead-letter-streamed (Phase 3); they do **not** silently skip.
## Phases
### Phase 1 — Throughput pipeline
**Status:** ⬜ Not started
**Outcome:** A Node.js Processor that joins a Redis Streams consumer group on `telemetry:t`, decodes each `Position` (including `__bigint`/`__buffer_b64` sentinel reversal), upserts it into a TimescaleDB `positions` hypertable, updates per-device in-memory state (last position, last seen), `XACK`s on successful write, and exposes Prometheus metrics + health/readiness HTTP endpoints. End-to-end pilot-quality service; no domain logic yet.
[**See `phase-1-throughput/README.md`**](./phase-1-throughput/README.md)
| # | Task | Status | Landed in |
|---|------|--------|-----------|
| 1.1 | [Project scaffold](./phase-1-throughput/01-project-scaffold.md) | ⬜ | — |
| 1.2 | [Core types & contracts](./phase-1-throughput/02-core-types.md) | ⬜ | — |
| 1.3 | [Configuration & logging](./phase-1-throughput/03-config-and-logging.md) | ⬜ | — |
| 1.4 | [Postgres connection & `positions` hypertable](./phase-1-throughput/04-postgres-schema.md) | ⬜ | — |
| 1.5 | [Redis Stream consumer (XREADGROUP)](./phase-1-throughput/05-stream-consumer.md) | ⬜ | — |
| 1.6 | [Per-device in-memory state](./phase-1-throughput/06-device-state.md) | ⬜ | — |
| 1.7 | [Position writer (batched upsert)](./phase-1-throughput/07-position-writer.md) | ⬜ | — |
| 1.8 | [Main wiring & ACK semantics](./phase-1-throughput/08-main-wiring.md) | ⬜ | — |
| 1.9 | [Observability (Prometheus metrics + /healthz + /readyz)](./phase-1-throughput/09-observability.md) | ⬜ | — |
| 1.10 | [Integration test (testcontainers Redis + Postgres)](./phase-1-throughput/10-integration-test.md) | ⬜ | — |
| 1.11 | [Dockerfile & Gitea workflow](./phase-1-throughput/11-dockerfile-and-ci.md) | ⬜ | — |
### Phase 2 — Domain logic
**Status:** ⬜ Not started — blocks on Directus schema decisions
**Outcome:** Geofence engine that detects entry/checkpoint/finish crossings; per-model Teltonika IO mapping driving derived attributes (`odometer_km`, `ignition`, etc.); timing record writer producing entries in the Directus-owned `timing_records` table; per-stage result aggregator. Layered on top of Phase 1 — no changes to the throughput pipeline.
[**See `phase-2-domain/README.md`**](./phase-2-domain/README.md)
Detailed task breakdown deferred until the Directus schema is finalized (open questions about geofence ownership, IO mapping storage, stage vocabulary). Phase 1 can ship and run on stage without any Phase 2 work.
### Phase 3 — Production hardening
**Status:** ⬜ Not started
**Outcome:** Graceful shutdown with consumer-group commit on SIGTERM; per-device state rehydration from Postgres on startup (only loaded on first packet for a given device); `XAUTOCLAIM` for stuck pending entries from a dead instance; dead-letter stream for poison records; multi-instance load-split verification; `OPERATIONS.md` runbook.
[**See `phase-3-hardening/README.md`**](./phase-3-hardening/README.md)
### Phase 4 — Future / optional
**Status:** ❄️ Not committed
[**See `phase-4-future/README.md`**](./phase-4-future/README.md)
Ideas on radar: Directus Flow trigger emission, replay tooling, derived-metric backfill, alternate consumer for analytics export.
## Operating model
- **Implementation agent contract.** Each task file is self-sufficient: goal, deliverables, specification, acceptance criteria. An agent should be able to complete one task without reading the whole wiki — but should skim the wiki references at the top of the task before starting.
- **Sequence within a phase.** Task numbering reflects intended order. Soft dependencies are explicit in each task's "Depends on" field. Tasks with no dependencies on each other can be done in parallel.
- **Status updates.** When a task is started, change its row in this ROADMAP to 🟨 and the task file's status badge accordingly. When done, 🟩 + a one-line note in the task file's "Done" section pointing at the merging commit/PR.
- **Drift control.** If implementation diverges from a task's spec, update the task file *before* the diverging code lands, with a note explaining why. Do not let plans rot — either fix the plan or fix the code.
@@ -0,0 +1,58 @@
# Task 1.1 — Project scaffold
**Phase:** 1 — Throughput pipeline
**Status:** ⬜ Not started
**Depends on:** None
**Wiki refs:** `docs/wiki/entities/processor.md`
## Goal
Initialize the Node.js / TypeScript project with the directory layout from the Phase 1 README, install the agreed tooling, and produce a minimal `main.ts` that the rest of Phase 1 builds on. Mirror the `tcp-ingestion` scaffold conventions exactly so the two services feel uniform.
## Deliverables
- `package.json` declaring:
- `"type": "module"` (ESM only).
- `"engines": { "node": ">=22" }`.
- Scripts: `build`, `dev`, `start`, `test`, `test:watch`, `test:integration`, `lint`, `format`, `typecheck`.
- Dependencies: `ioredis`, `pg`, `pino`, `prom-client`, `zod`.
- Dev dependencies: `typescript`, `@types/node`, `@types/pg`, `vitest`, `@vitest/coverage-v8`, `eslint`, `@typescript-eslint/parser`, `@typescript-eslint/eslint-plugin`, `eslint-plugin-import`, `eslint-import-resolver-typescript`, `prettier`, `pino-pretty`, `tsx`, `testcontainers`.
- `tsconfig.json` — same as `tcp-ingestion`: `strict: true`, `target: ES2022`, `module: NodeNext`, `moduleResolution: NodeNext`, `outDir: dist`, `rootDir: src`, `noUncheckedIndexedAccess: true`.
- `eslint.config.js` (flat config) with `@typescript-eslint/recommended-type-checked`, `@typescript-eslint/no-floating-promises`, `@typescript-eslint/no-misused-promises`. Add `import/no-restricted-paths` enforcing **`src/core/` cannot import from `src/domain/`**. (`src/domain/` doesn't exist yet — the rule is preemptive so Phase 2 can't violate the boundary by accident.)
- `.prettierrc` — match `tcp-ingestion` (2 spaces, single quotes, semis).
- `.gitignore``node_modules/`, `dist/`, `coverage/`, `.env`, `.env.local`, `*.log`.
- `.dockerignore` — same as `.gitignore` plus `.git/`, `.planning/`, `test/`, `*.md` except `README.md`.
- `vitest.config.ts` — unit-test config that excludes `*.integration.test.ts`.
- `vitest.integration.config.ts` — integration-test config with `hookTimeout: 120_000`, `testTimeout: 60_000`. Include only `*.integration.test.ts`.
- `.env.example` documenting every env var (with descriptions and defaults). Required vars only: `REDIS_URL`, `POSTGRES_URL`. All others have sensible defaults.
- Empty directories with `.gitkeep` files where Phase 1 will fill them in:
- `src/core/`, `src/db/migrations/`, `src/config/`, `src/observability/`
- `test/`
- `src/main.ts` — minimal stub: imports nothing yet, prints `processor starting` to stdout, exits with code 0.
- `README.md` — short description pointing at `.planning/ROADMAP.md` for the work plan, and at `../docs/wiki/entities/processor.md` for the architectural specification.
## Specification
- **Package manager:** pnpm. Commit `pnpm-lock.yaml`. The Dockerfile (task 1.11) will use `pnpm fetch` for layer-cache friendliness.
- **Module style:** ESM throughout. Relative imports use `.js` suffix per Node ESM resolution. No `paths` aliases.
- **No bundler.** Build is `tsc` only. Runtime is plain Node consuming `dist/`.
- **Linting style:** Configure ESLint to enforce no-floating-promises and no-misused-promises — both critical in a stream consumer where unhandled rejection silently loses work.
## Acceptance criteria
- [ ] `pnpm install` succeeds with no warnings other than peer deps.
- [ ] `pnpm typecheck` succeeds on the empty project.
- [ ] `pnpm lint` succeeds.
- [ ] `pnpm build` produces `dist/main.js`.
- [ ] `pnpm start` runs the compiled output and prints the startup message.
- [ ] `pnpm test` runs (with no tests) and exits successfully.
- [ ] `pnpm dev` runs `main.ts` via `tsx` and prints the startup message.
- [ ] Repository builds reproducibly: deleting `node_modules` and `dist`, then `pnpm install --frozen-lockfile && pnpm build` produces identical output.
## Risks / open questions
- The `import/no-restricted-paths` rule is preemptive and will be silently inert until Phase 2 introduces `src/domain/`. Verify it activates correctly with a temporary `src/domain/foo.ts` during scaffold setup, then remove.
## Done
(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,66 @@
# Task 1.2 — Core types & contracts
**Phase:** 1 — Throughput pipeline
**Status:** ⬜ Not started
**Depends on:** 1.1
**Wiki refs:** `docs/wiki/concepts/position-record.md`, `docs/wiki/concepts/io-element-bag.md`
## Goal
Define the canonical TypeScript types for the data flowing through the Processor: the `Position` record (input from Redis), the per-device runtime state, the metrics interface, and the codec for reversing the JSON sentinels (`__bigint`, `__buffer_b64`) that `tcp-ingestion` produces.
This task does **not** add behaviour — only types and a sentinel decoder with unit tests. Behaviour is layered on in subsequent tasks.
## Deliverables
- `src/core/types.ts` exporting:
- `Position` — must be byte-equivalent to `tcp-ingestion`'s output. Fields: `device_id: string`, `timestamp: Date`, `latitude: number`, `longitude: number`, `altitude: number`, `angle: number`, `speed: number`, `satellites: number`, `priority: number`, `attributes: Record<string, AttributeValue>`. Where `AttributeValue = number | bigint | Buffer`.
- `StreamRecord` — what `XREADGROUP` actually returns: `{ id: string; ts: string; device_id: string; codec: string; payload: string }`. The `payload` field is the JSON-encoded `Position` (still sentinel-encoded — the consumer decodes it).
- `DeviceState``{ device_id: string; last_position: Position; last_seen: Date; position_count_session: number }`.
- `Metrics` interface — same shape as `tcp-ingestion`: `inc(name: string, labels?: Record<string, string>): void`, `observe(name: string, value: number, labels?: Record<string, string>): void`. Don't extend it in Phase 1; task 1.9 may add helpers but the interface stays.
- `src/core/codec.ts` exporting:
- `decodePosition(payload: string): Position` — JSON-parses with a reviver that detects `{ __bigint: "..." }``BigInt(...)` and `{ __buffer_b64: "..." }``Buffer.from(s, 'base64')`. Throws `CodecError` with a clear message on malformed payloads.
- `class CodecError extends Error` for failure cases.
- `test/codec.test.ts` covering:
- Round-trip a Position with bigint and Buffer attributes through `tcp-ingestion`'s `serializePosition` (copy the helper into the test or inline-encode) → `decodePosition` → assert byte-equal.
- Decode a u64-max bigint sentinel.
- Decode a Buffer sentinel with non-UTF-8 bytes (e.g. `0xde 0xad 0xbe 0xef`).
- Reject malformed payload (non-JSON, missing required fields, invalid sentinel shape).
- `device_id`, `timestamp` (ISO string → Date), and numeric fields all decode correctly.
## Specification
### Sentinel reversal — exact rules
The reviver runs on every JSON value. For each value:
1. If the value is an object with exactly one property `__bigint` whose value is a string of digits → return `BigInt(value.__bigint)`. Validate that the string parses or throw.
2. If the value is an object with exactly one property `__buffer_b64` whose value is a base64 string → return `Buffer.from(value.__buffer_b64, 'base64')`.
3. If the key is `timestamp` and the value is a string → return `new Date(value)` (validate it parsed; reject `Invalid Date`).
4. Otherwise pass through.
**Critical:** the reviver must not match shapes broader than the sentinels. A user-defined attribute `{ __bigint: "..." }` is by definition a sentinel — there is no ambiguity because `tcp-ingestion` only uses these shapes for sentinel encoding. But validate the inner string strictly so a malformed attribute fails fast.
### Why `Buffer`, not `Uint8Array`
`tcp-ingestion` uses Node's `Buffer`. We're also Node-only. Using `Buffer` avoids the conversion cost on every record. If the platform later needs to support browser-side decoding (e.g. for a debug tool), introduce a `Uint8Array`-based parallel path then; not now.
### Why `bigint`, not `number`
Some Teltonika IO elements are u64 values that exceed `Number.MAX_SAFE_INTEGER` (2^53 1). Silently truncating to `number` would corrupt those values. Phase 1 preserves them as `bigint`; the Position writer (task 1.7) decides how to store them in Postgres (likely `numeric` or stringified — see that task).
## Acceptance criteria
- [ ] `pnpm typecheck` succeeds.
- [ ] `pnpm test` runs `codec.test.ts` and all cases pass.
- [ ] A round-tripped Position with `bigint` and `Buffer` attributes matches the original byte-for-byte (including `Buffer` content equality, not just length).
- [ ] Malformed payloads throw `CodecError` with a descriptive message that names the failing field.
## Risks / open questions
- The reviver runs on the full JSON tree, including the top-level object. Verify that nested attributes (rare, but possible if Teltonika ever sends nested IO bags in Codec 16) decode correctly. The spec doesn't currently allow nesting; treat unexpected nesting as an error.
- TypeScript inference for revivers is awkward (`(key: string, value: any) => any`). Use a typed wrapper to surface the result as `Position` without `any` leakage outside the codec module.
## Done
(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,76 @@
# Task 1.3 — Configuration & logging
**Phase:** 1 — Throughput pipeline
**Status:** ⬜ Not started
**Depends on:** 1.1
**Wiki refs:** `docs/wiki/entities/processor.md`
## Goal
Validate environment variables on startup with `zod`, build the pino root logger with the same conventions as `tcp-ingestion` (ISO timestamps, string level labels, instance_id base field), and fail fast with a readable error message if config is invalid.
## Deliverables
- `src/config/load.ts` exporting:
- `loadConfig(): Config` — reads `process.env`, runs zod parse, returns a typed `Config`. Throws on invalid input with a multi-line message that names every invalid field.
- `Config` type derived from the zod schema.
- `src/observability/logger.ts` exporting:
- `createLogger({ level, nodeEnv, instanceId }): Logger` — pino root logger with base fields `service: 'processor'`, `instance_id`. ISO timestamps via `pino.stdTimeFunctions.isoTime`. Level formatter that emits `"level":"info"` not `"level":30`. In `nodeEnv === 'development'`, use the pino-pretty transport.
- `type Logger` re-exported from `pino`.
- Wire both into `src/main.ts`: `loadConfig()``createLogger()``logger.info('processor starting')` → exit 0 (still a stub; consumer wiring lands in 1.8).
## Specification
### Environment variables
| Var | Required | Default | Notes |
|---|---|---|---|
| `NODE_ENV` | no | `production` | `development` enables pino-pretty |
| `INSTANCE_ID` | no | `processor-1` | Used in metrics + log base field |
| `LOG_LEVEL` | no | `info` | `trace` / `debug` / `info` / `warn` / `error` |
| `REDIS_URL` | yes | — | e.g. `redis://redis:6379` |
| `POSTGRES_URL` | yes | — | e.g. `postgres://user:pass@db:5432/trm` |
| `REDIS_TELEMETRY_STREAM` | no | `telemetry:t` | Must match `tcp-ingestion`'s `REDIS_TELEMETRY_STREAM` |
| `REDIS_CONSUMER_GROUP` | no | `processor` | All Processor instances join this group |
| `REDIS_CONSUMER_NAME` | no | `${INSTANCE_ID}` | Unique per instance — defaults to instance id |
| `METRICS_PORT` | no | `9090` | HTTP server port for `/metrics`, `/healthz`, `/readyz` |
| `BATCH_SIZE` | no | `100` | Max records per `XREADGROUP` call |
| `BATCH_BLOCK_MS` | no | `5000` | `BLOCK` timeout on `XREADGROUP` when stream is empty |
| `WRITE_BATCH_SIZE` | no | `50` | Max rows per Postgres `INSERT` |
| `DEVICE_STATE_LRU_CAP` | no | `10000` | Max devices kept in memory; LRU eviction beyond this |
### Validation rules
- All defaults must be expressed in the zod schema with `.default(...)` so the parsed `Config` is fully typed and never has `undefined` for an optional field.
- Numeric env vars must be coerced from string and bounded: `BATCH_SIZE` 110000, `BATCH_BLOCK_MS` 060000, `WRITE_BATCH_SIZE` 11000, `DEVICE_STATE_LRU_CAP` 1001_000_000.
- `REDIS_URL` and `POSTGRES_URL` must parse as URLs with the expected protocol (`redis:` or `rediss:`; `postgres:` or `postgresql:`).
- `LOG_LEVEL` must be one of pino's accepted levels.
### Logger conventions
Match `tcp-ingestion/src/observability/logger.ts` line for line where applicable. Future-you grepping across services should see the same shape:
```ts
const formatters = { level: (label: string) => ({ level: label }) };
if (nodeEnv === 'development') {
return pino({ level, base, timestamp: pino.stdTimeFunctions.isoTime, formatters,
transport: { target: 'pino-pretty', options: { colorize: true, translateTime: 'SYS:standard', ignore: 'pid,hostname' } } });
}
return pino({ level, base, timestamp: pino.stdTimeFunctions.isoTime, formatters });
```
## Acceptance criteria
- [ ] `pnpm test` covers config validation: missing required vars throw with the right message; invalid URLs throw; bounded numerics throw on out-of-range values.
- [ ] Running with valid env emits a single `processor starting` info log with `service=processor` and `instance_id=processor-1` base fields.
- [ ] Running with `NODE_ENV=development` produces colorized output via pino-pretty.
- [ ] Running with `NODE_ENV=production` produces JSON output with ISO `time` and string `level`.
## Risks / open questions
- `REDIS_CONSUMER_NAME` defaulting to `INSTANCE_ID` means `INSTANCE_ID` must be unique per instance for safe consumer-group operation. Document this in `.env.example` so operators don't accidentally run two instances with the same `INSTANCE_ID`.
## Done
(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,89 @@
# Task 1.4 — Postgres connection & `positions` hypertable
**Phase:** 1 — Throughput pipeline
**Status:** ⬜ Not started
**Depends on:** 1.1, 1.3
**Wiki refs:** `docs/wiki/entities/postgres-timescaledb.md`
## Goal
Stand up the Postgres connection (a single `pg.Pool`) and define the `positions` hypertable migration. This is the only table whose schema the Processor owns directly (per the design rule in ROADMAP.md). Every other table is owned by Directus.
## Deliverables
- `src/db/pool.ts` exporting:
- `createPool(url: string): pg.Pool` — single Pool with sane defaults (`max: 10`, `idleTimeoutMillis: 30_000`, `connectionTimeoutMillis: 5_000`). Sets `application_name = 'processor'` so connections are identifiable in `pg_stat_activity`.
- `connectWithRetry(pool, logger): Promise<void>` — runs `SELECT 1` with exponential backoff (3 attempts, up to 5s). Mirrors `tcp-ingestion`'s `connectRedis` pattern. Calls `process.exit(1)` on final failure.
- `src/db/migrations/0001_positions.sql` containing:
- `CREATE EXTENSION IF NOT EXISTS timescaledb;` (no-op if already enabled)
- `CREATE TABLE IF NOT EXISTS positions (...)` per the schema below
- `SELECT create_hypertable('positions', 'ts', if_not_exists => TRUE, chunk_time_interval => INTERVAL '1 day');`
- `CREATE UNIQUE INDEX IF NOT EXISTS positions_device_ts ON positions (device_id, ts);`
- `CREATE INDEX IF NOT EXISTS positions_ts ON positions (ts DESC);`
- `src/db/migrate.ts` — minimal runner that executes pending migration files in order. Tracks applied migrations in a `schema_migrations(version, applied_at)` table. Idempotent. Called from `main.ts` before the consumer starts.
- `test/db/migrate.test.ts` covering: applying a fresh migration; applying twice is a no-op; bad SQL fails loudly.
## Specification
### `positions` table schema
```sql
CREATE TABLE IF NOT EXISTS positions (
device_id text NOT NULL,
ts timestamptz NOT NULL, -- canonical event time from device GPS
ingested_at timestamptz NOT NULL DEFAULT now(), -- when Processor wrote the row
latitude double precision NOT NULL,
longitude double precision NOT NULL,
altitude real NOT NULL,
angle real NOT NULL,
speed real NOT NULL,
satellites smallint NOT NULL,
priority smallint NOT NULL,
codec text NOT NULL, -- '8' | '8E' | '16'
attributes jsonb NOT NULL -- the IO bag, sentinel-decoded
);
```
### Why these column types
- `device_id text` — IMEIs are 15 ASCII digits. Could be `bigint`, but `text` keeps the door open for non-IMEI device identifiers (future vendors) and avoids leading-zero loss.
- `ts timestamptz` — the **device-reported** time, not ingestion time. This is the hypertable partitioning column.
- `ingested_at timestamptz` — diagnostic: helps spot devices with clock skew or buffered records (the 55-record buffer flush we saw on stage). Not part of the natural key.
- `altitude/angle/speed real` — float32 is plenty; saves space on a high-volume table.
- `attributes jsonb` — preserves the IO bag verbatim. Per the design rule, no naming or unit conversion happens here; that's Phase 2 in `src/domain/`.
### bigint and Buffer attributes — JSONB encoding
The codec (task 1.2) decodes `__bigint` to `bigint` and `__buffer_b64` to `Buffer`. Postgres `jsonb` is JSON, so we re-encode for storage:
- `bigint` → JSON number if it fits in `Number.MAX_SAFE_INTEGER`, else JSON string. Always store as a string is simpler and unambiguous; **decision: always string for bigint**.
- `Buffer` → base64 string.
**Re-encoding loses the type tag.** Phase 2 IO interpretation (per-model mapping table) is responsible for knowing that `attributes.io_240` is a u64 stored as a string. Phase 1 doesn't need to query individual attributes — it's pass-through storage.
If this becomes painful later, options to revisit: a separate `attributes_typed` column with structured shape; or store bigints as `numeric` and Buffers as `bytea` in dedicated columns. **Defer** — 80% of attributes are small ints, and the simple string approach unblocks Phase 1.
### Migration runner
Follow the simplest possible pattern. The runner:
1. `CREATE TABLE IF NOT EXISTS schema_migrations (version text PRIMARY KEY, applied_at timestamptz NOT NULL DEFAULT now())`.
2. Lists `*.sql` files in `src/db/migrations/` sorted by filename.
3. For each, `SELECT 1 FROM schema_migrations WHERE version = $1`. If absent, run the SQL inside a transaction and insert the row.
4. Logs each applied or skipped migration at `info`.
Do **not** introduce a heavy framework (Knex, node-pg-migrate). The Processor has one migration file in Phase 1 — a 30-line runner is the right answer.
## Acceptance criteria
- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean.
- [ ] Integration test (testcontainers TimescaleDB): apply migration; insert a row with a bigint-as-string attribute; query it back; verify shape.
- [ ] Re-running the migration on an already-migrated database is a no-op.
- [ ] `connectWithRetry` retries 3 times with exponential backoff, then calls `process.exit(1)`. Verify with a unit test using a fake Pool.
## Risks / open questions
- **TimescaleDB extension availability.** The `deploy/` repo's Postgres container must be the `timescale/timescaledb` image, not stock `postgres`. Document this explicitly in the deploy README when Phase 1 ships. Fall back to a regular table (no hypertable) if the extension is unavailable: `create_hypertable` will error, but the `IF NOT EXISTS` table creation succeeds. The performance falls off a cliff at scale, but functional.
- **Schema authority overlap with Directus.** Directus also speaks Postgres. When Directus connects and introspects the schema, it will see the `positions` table created by Processor. That's fine — Directus can reflect tables it didn't create. But if an operator later modifies `positions` from the Directus admin UI, the migration may break. Document: `positions` is a Processor-owned table; do not edit from Directus.
## Done
(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,93 @@
# Task 1.5 — Redis Stream consumer (XREADGROUP)
**Phase:** 1 — Throughput pipeline
**Status:** ⬜ Not started
**Depends on:** 1.2, 1.3
**Wiki refs:** `docs/wiki/entities/redis-streams.md`, `docs/wiki/entities/processor.md`
## Goal
Build the Redis Stream consumer: join the consumer group, fetch batches via `XREADGROUP`, decode each entry to a `Position`, hand off to a sink callback, and return successfully-handled IDs to the caller for `XACK`.
This task does **not** wire in the Postgres writer or the in-memory state — those are tasks 1.7 and 1.6, joined to the consumer in 1.8. The consumer accepts a `sink: (records: ConsumedRecord[]) => Promise<string[]>` callback that returns the IDs it wants ACKed. Only those IDs are ACKed; failures stay pending and get claimed on the next loop.
## Deliverables
- `src/core/consumer.ts` exporting:
- `createConsumer(redis, config, logger, metrics, sink): Consumer` — factory.
- `Consumer` interface: `start(): Promise<void>` (returns when the consumer loop starts), `stop(): Promise<void>` (signals the loop to exit, waits for the in-flight batch).
- `ensureConsumerGroup(redis, stream, group)``XGROUP CREATE ... MKSTREAM` ignoring `BUSYGROUP` errors. Called once at start.
- `type ConsumedRecord = { id: string; position: Position; codec: string; ts: string }` — what's passed to the sink.
- `test/consumer.test.ts` (mocked `ioredis`):
- Decodes a synthetic stream entry into a `ConsumedRecord` with the right shape.
- Calls `sink` with the decoded batch and ACKs only the IDs the sink returned.
- On `BUSYGROUP` error from `XGROUP CREATE`, swallows the error and continues.
- On a malformed payload, increments `consumer_decode_errors_total`, logs at `error`, and **does not** ACK the bad entry — leaves it pending for inspection.
- On `stop()`, the loop exits cleanly without losing in-flight work.
## Specification
### Consumer loop shape
```ts
async function runLoop() {
while (!stopping) {
let entries: StreamEntry[];
try {
entries = await redis.xreadgroup(
'GROUP', group, consumerName,
'COUNT', batchSize,
'BLOCK', batchBlockMs,
'STREAMS', stream, '>',
);
} catch (err) {
logger.error({ err }, 'XREADGROUP failed; backing off');
await sleep(1000);
continue;
}
if (!entries) continue; // BLOCK timeout
const records = decodeBatch(entries); // <— may emit decode errors
const ackIds = await sink(records); // <— writer + state
if (ackIds.length > 0) {
await redis.xack(stream, group, ...ackIds);
}
}
}
```
### Decode error handling
`decodeBatch` calls `decodePosition` (from task 1.2) on each entry's `payload` field. If a single entry fails to decode:
- Increment `processor_decode_errors_total{stream=...}`.
- Log at `error` with the entry ID and a truncated raw payload (first 256 chars).
- **Skip** the entry — do not pass to sink, do not ACK. It stays in the consumer's PEL (Pending Entries List) and will be re-attempted on next claim. Phase 3 will route truly-poison entries to a dead-letter stream; for Phase 1, leaving them pending and visible in `XPENDING` is enough.
### `XACK` semantics
ACK only what the sink returned. If the sink returns `['id1', 'id3']` from a batch of `[id1, id2, id3]`, then `id2` stays pending. Why a sink might return a partial list: it failed to write some records. The consumer must trust the sink's signal — never ACK speculatively.
### Consumer group setup
On `start()`:
1. `XGROUP CREATE <stream> <group> $ MKSTREAM` — creates the stream if missing, group at "now" so we don't replay history. If the group already exists, the call returns `BUSYGROUP Consumer Group name already exists` — catch and ignore.
2. Log at `info` whether the group was created or already existed.
### Why `>` not `0` for the read ID
`>` means "deliver only new entries, not pending ones for this consumer." That's what we want for the steady-state loop. Phase 3 will add an explicit `XAUTOCLAIM` step at startup (and periodically) to pull stuck pending entries from dead consumers; Phase 1 relies on the natural redelivery via consumer-group resumption (when a dead instance restarts with the same name, it sees its old PEL).
## Acceptance criteria
- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean.
- [ ] Unit tests cover: happy path, `BUSYGROUP` swallow, decode error skip, partial-ACK, clean stop.
- [ ] Stop signal causes the loop to exit within one `BATCH_BLOCK_MS` tick.
## Risks / open questions
- **Consumer name uniqueness.** Two instances with the same `REDIS_CONSUMER_NAME` will both read from the same PEL, which is undefined behaviour. Task 1.3 already documents that `INSTANCE_ID` (which defaults `REDIS_CONSUMER_NAME`) must be unique per instance — surface this again in the operator-facing README later.
- **Long sink calls block the loop.** If the Postgres writer takes 30s, no new records are read. That's fine for Phase 1 (Postgres should be fast); Phase 3 may add a configurable max-in-flight if writer pressure becomes an issue.
## Done
(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,81 @@
# Task 1.6 — Per-device in-memory state
**Phase:** 1 — Throughput pipeline
**Status:** ⬜ Not started
**Depends on:** 1.2
**Wiki refs:** `docs/wiki/entities/processor.md` (§ State management)
## Goal
Maintain a bounded `Map<device_id, DeviceState>` updated on every accepted Position. Phase 1 only stores trivial state — `last_position`, `last_seen`, `position_count_session` — but the structure is built so Phase 2 (geofence accumulators, time-since-last-checkpoint, etc.) can extend it cleanly.
## Deliverables
- `src/core/state.ts` exporting:
- `createDeviceStateStore(config, logger): DeviceStateStore` — factory.
- `DeviceStateStore` interface:
- `update(position: Position): DeviceState` — applies the position, returns the new state. Touches LRU order.
- `get(device_id: string): DeviceState | undefined` — read without touching LRU order. (Used for diagnostics; the hot path uses `update`.)
- `size(): number` — for metrics.
- `evictedTotal(): number` — for metrics.
- `test/state.test.ts` covering:
- First update for a new device creates the entry; subsequent updates increment `position_count_session`.
- LRU eviction: with cap=3, after 4 distinct devices, the least-recently-updated is evicted.
- Eviction increments `evictedTotal()`.
- `last_seen` reflects the position's `timestamp` (the device-reported time), not the wall clock at update time.
- Out-of-order positions (a position with `timestamp` older than `last_seen`) are still applied (we don't drop them) but `last_seen` only advances forward — i.e. `last_seen = max(prev_last_seen, position.timestamp)`. Document the rationale.
## Specification
### LRU implementation
Use a plain `Map<string, DeviceState>`. JavaScript `Map` preserves insertion order, and we exploit it: on every `update`, `delete` then `set` the entry — that bumps it to the most recent position in iteration order. When `size() > cap`, take `keys().next().value` (the oldest) and `delete` it.
This is O(1) per update and avoids a third-party LRU dependency. **Do not** introduce `lru-cache` — the standard `Map` trick is sufficient for Phase 1's needs.
### Why `last_seen = max(...)`, not `last_seen = position.timestamp`
Devices buffer records when offline and replay them in bursts (we observed a 55-record buffer flush on stage). Within a single batch, timestamps may *decrease* between consecutive records if the device sorted them oddly. We want `last_seen` to mean "highest device timestamp seen so far for this device" — that's what downstream consumers want.
### What about restart?
On Processor restart, the in-memory state is empty. The first record from any device creates a fresh `DeviceState`. **Phase 1 accepts this** — it's a recovery path, not a hot path, and Phase 1 has no domain logic that would be wrong without rehydrated state.
Phase 3 (production hardening) adds rehydration: on first packet for an unknown device, query `positions WHERE device_id = $1 ORDER BY ts DESC LIMIT 1` to seed `last_position`. That's a Phase 3 task, not Phase 1.
### What state lives here, what doesn't
In Phase 1 the state is intentionally minimal:
```ts
type DeviceState = {
device_id: string;
last_position: Position;
last_seen: Date; // = max(prev, position.timestamp)
position_count_session: number; // resets on restart
};
```
**Not in Phase 1:**
- Geofence membership (Phase 2)
- Distance accumulators (Phase 2)
- Time-in-stage (Phase 2)
- Anything that would be wrong if dropped on restart (Phase 3 + rehydration)
The interface is built to extend: Phase 2 may add fields, but the existing fields and method signatures should not change.
## Acceptance criteria
- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean.
- [ ] LRU cap from `DEVICE_STATE_LRU_CAP` config is respected.
- [ ] `evictedTotal()` increments correctly under eviction.
- [ ] `last_seen` does not regress on out-of-order timestamps.
## Risks / open questions
- **Cap sizing.** Default `DEVICE_STATE_LRU_CAP=10000`. At 1KB per state entry, that's 10MB of resident memory — fine. Operators with unusually large fleets can raise it; the bound exists to prevent runaway growth from misbehaving devices flooding novel `device_id` values.
- **No mutex.** State is updated only from the consumer loop, which is single-threaded. If Phase 2 introduces parallel sinks, revisit with proper synchronization.
## Done
(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,94 @@
# Task 1.7 — Position writer (batched upsert)
**Phase:** 1 — Throughput pipeline
**Status:** ⬜ Not started
**Depends on:** 1.2, 1.4
**Wiki refs:** `docs/wiki/entities/postgres-timescaledb.md`
## Goal
Write batches of `Position` records into the `positions` hypertable using `INSERT ... ON CONFLICT (device_id, ts) DO NOTHING` for idempotency. Return per-record success/failure so the consumer (task 1.8) can decide what to ACK.
## Deliverables
- `src/core/writer.ts` exporting:
- `createWriter(pool, config, logger, metrics): Writer` — factory.
- `Writer` interface:
- `write(records: ConsumedRecord[]): Promise<WriteResult[]>` — inserts the batch, returns per-record results: `{ id: string; status: 'inserted' | 'duplicate' | 'failed'; error?: Error }`.
- `test/writer.test.ts` (mocked `pg.Pool`):
- Happy path: all records insert.
- Duplicate-key: `ON CONFLICT DO NOTHING` returns `'duplicate'` for those records.
- Mixed: half new, half duplicate.
- Pool error: all records in the batch return `'failed'`.
- Bigint attribute is stringified before serialization.
- Buffer attribute is base64-encoded before serialization.
## Specification
### SQL pattern
Use a single multi-row `INSERT` per batch with `RETURNING (xmax = 0) AS inserted`:
```sql
INSERT INTO positions (device_id, ts, latitude, longitude, altitude, angle, speed, satellites, priority, codec, attributes)
VALUES
($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11),
($12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22),
...
ON CONFLICT (device_id, ts) DO NOTHING
RETURNING device_id, ts, (xmax = 0) AS inserted;
```
`xmax = 0` is true for newly-inserted rows, false for ones that hit `ON CONFLICT`. The `RETURNING` rows give us a lookup of which `(device_id, ts)` pairs were inserted vs. duplicates.
**Note:** rows that hit the conflict are NOT returned (Postgres doesn't return them with `ON CONFLICT DO NOTHING`). To distinguish duplicate from "new but hit a unique violation later," compare the returned rows against the input by `(device_id, ts)`. Anything in the input but missing from RETURNING is a `'duplicate'`.
### bigint and Buffer attribute encoding
Per task 1.4, `jsonb` storage:
- `bigint` → JSON string. Use a custom replacer in `JSON.stringify`:
```ts
JSON.stringify(attributes, (_k, v) =>
typeof v === 'bigint' ? v.toString() :
Buffer.isBuffer(v) ? v.toString('base64') : v
);
```
- `Buffer` → base64 string.
Document this in `wiki/concepts/position-record.md` as a follow-up — the on-disk shape differs slightly from the in-flight shape because JSON can't hold bigints or bytes natively.
### Batching strategy
The consumer (task 1.8) calls `write(batch)` with whatever batch the consumer received from `XREADGROUP`. Phase 1 doesn't internally batch further — the consumer's batch size (`BATCH_SIZE`, default 100) is the writer's batch size.
If `BATCH_SIZE > WRITE_BATCH_SIZE` (default 50), the writer chunks internally: split the input into chunks of `WRITE_BATCH_SIZE`, run them sequentially. Don't parallelize chunks against the same Pool — `pg.Pool` has bounded connections and we don't want to starve other queries (the migration runner, `/readyz` health checks, etc.).
### Per-record status
The consumer (task 1.8) takes the `WriteResult[]` and decides ACK:
- `'inserted'` and `'duplicate'` → ACK (we got the data into Postgres or already had it).
- `'failed'` → do not ACK (let it stay pending for retry).
If a transaction-wide failure occurs (Pool dead, transient network), all records in the chunk get `'failed'`. The consumer treats them all the same.
### Metrics emitted by this module
- `processor_position_writes_total{status="inserted"|"duplicate"|"failed"}` — counter
- `processor_position_write_duration_seconds` — histogram (per-batch latency)
## Acceptance criteria
- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean.
- [ ] Mocked-Pool test verifies SQL parameter ordering and types are correct.
- [ ] Bigint and Buffer attributes serialize as expected via the JSON.stringify replacer.
- [ ] Mixed insert/conflict batch produces correct per-record `WriteResult[]`.
- [ ] Pool error → all records get `'failed'`; metrics reflect this.
## Risks / open questions
- **Parameter limit.** Postgres protocol allows max 65535 parameters per statement. With 11 columns per row, that caps us at ~5957 rows per statement. `WRITE_BATCH_SIZE=50` is well under. If the cap is ever raised, document the formula.
- **`RETURNING` cost.** On a hypertable with many chunks, `RETURNING` has near-zero overhead. Verify with a benchmark in task 1.10 (integration test).
## Done
(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,100 @@
# Task 1.8 — Main wiring & ACK semantics
**Phase:** 1 — Throughput pipeline
**Status:** ⬜ Not started
**Depends on:** 1.5, 1.6, 1.7
**Wiki refs:** `docs/wiki/entities/processor.md`
## Goal
Assemble the throughput pipeline in `src/main.ts`: connect Redis + Postgres → run migrations → build the device-state store → build the writer → build the consumer with a sink that calls `state.update()` then `writer.write()` → start. Establish the rule for what to ACK and when.
## Deliverables
- `src/main.ts` updated to:
1. `loadConfig()` (from task 1.3).
2. `createLogger()` (from task 1.3).
3. `createPool(config.POSTGRES_URL)` and `connectWithRetry()` (from task 1.4).
4. Run migrations via `migrate()` (from task 1.4) before any consumer activity.
5. Connect Redis with `connectRedis(...)` (re-implement the `tcp-ingestion` retry pattern; small enough to copy).
6. Build `state = createDeviceStateStore(config, logger)`.
7. Build `writer = createWriter(pool, config, logger, metrics)`.
8. Build `consumer = createConsumer(redis, config, logger, metrics, sink)` where `sink` is the function defined below.
9. `await consumer.start()`.
10. Install graceful shutdown stub (full Phase 3 hardening later): on SIGTERM/SIGINT, call `consumer.stop()`, await pending writes, close Redis + Pool, exit.
- `src/main.ts` defines the **sink function** (the central decision point):
```ts
async function sink(records: ConsumedRecord[]): Promise<string[]> {
// 1. Update in-memory state for every record (cheap, synchronous, can't fail meaningfully)
for (const r of records) state.update(r.position);
// 2. Write to Postgres
const results = await writer.write(records);
// 3. ACK only the IDs that succeeded or were duplicates
return results
.filter(r => r.status === 'inserted' || r.status === 'duplicate')
.map(r => r.id);
}
```
- A placeholder `metrics` shim — the same trace-logging stub as `tcp-ingestion` originally had (task 1.9 replaces it with prom-client). Use `Metrics` from `src/core/types.ts`.
## Specification
### State update happens before write — by design
The sink updates `state` first, *then* writes. If the write fails:
- The state update has already happened.
- The record is not ACKed, so it stays pending.
- On re-delivery (same instance retries, or another instance claims), the record will be processed again.
- `state.update` is idempotent for a given position (same record applied twice produces the same `last_position`, only `position_count_session` is double-counted — and that's a session counter that resets on restart anyway, so it's a non-issue).
If we wrote *first* and updated state second, a successful write followed by a state-update crash would leave Postgres ahead of state — but state is hot-path, so that's worse. The chosen order keeps state consistent with what's been seen, even if not yet persisted.
### What the sink does NOT do
- **No business logic.** No "is this a finish-line crossing" detection. That's Phase 2's domain.
- **No multi-stream fanout.** No publishing to derived streams (e.g. for the SPA). The Phase 1 model is: positions go into Postgres, Directus reads them and pushes via WebSocket. If that fanout proves insufficient at the SPA layer, Phase 4 considers a dedicated WebSocket gateway reading from Redis directly.
### Graceful shutdown — Phase 1 stub vs. Phase 3 final
Phase 1 stub is enough to not lose data in the common case:
1. Catch SIGTERM/SIGINT.
2. `consumer.stop()` — exits the read loop after the current batch.
3. Await any in-flight `writer.write()`.
4. `redis.quit()` and `pool.end()`.
5. `process.exit(0)`.
6. Force-exit timer at 15s as a backstop.
What Phase 1 does NOT do (deferred to Phase 3):
- Explicit consumer-group offset commit on SIGTERM (the current model relies on `XACK` after each successful write, which is already the right thing — but Phase 3 documents and tests this rigorously).
- Uncaught exception / unhandled rejection handlers that flush state to logs before crashing.
- Multi-instance coordination on shutdown (drain mode).
### Logger shape
Match `tcp-ingestion`'s convention:
- `info` for lifecycle: `processor starting`, `Postgres connected`, `Redis connected`, `migrations applied`, `consumer started on stream X group Y consumer Z`, `processor ready`.
- `debug` for per-batch: `batch consumed n=42`, `batch written inserted=40 duplicates=2 failed=0`.
- `warn` / `error` for the obvious.
After this task lands you should be able to run `pnpm dev` against a local Redis + Postgres, publish a synthetic `Position` to `telemetry:t`, and watch a row appear in `positions` while seeing the lifecycle logs above.
## Acceptance criteria
- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean.
- [ ] `pnpm dev` (with local Redis + Postgres reachable) shows the lifecycle log sequence and `processor ready`.
- [ ] Manually publishing a `Position` to `telemetry:t` results in a row in `positions` within seconds.
- [ ] SIGTERM during idle exits cleanly (no error, no force-exit warning).
- [ ] SIGTERM with in-flight writes waits for them to complete before exiting.
## Risks / open questions
- **`metrics` placeholder is intentional.** Don't try to wire prom-client here; that's task 1.9. Use the trace-logging shim from `tcp-ingestion`'s pre-1.10 `main.ts` as the model.
- **Migration during deploy.** Phase 1 runs migrations on every startup. With multiple instances, two starting at once both try to migrate — Postgres advisory locks would solve this. **Defer to Phase 3** (it's a Production hardening concern); for the pilot with one instance, this is fine. Document the limitation.
## Done
(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,82 @@
# Task 1.9 — Observability (Prometheus metrics + /healthz + /readyz)
**Phase:** 1 — Throughput pipeline
**Status:** ⬜ Not started
**Depends on:** 1.5, 1.6, 1.7, 1.8
**Wiki refs:** `docs/wiki/entities/processor.md`, `docs/wiki/sources/gps-tracking-architecture.md` § 7.4
## Goal
Replace the placeholder `Metrics` shim with a real `prom-client` implementation. Expose `/metrics` (Prometheus exposition format), `/healthz` (liveness), and `/readyz` (readiness — Redis ready AND Postgres ready) on `METRICS_PORT`.
This is **not** a deferral candidate (unlike `tcp-ingestion` task 1.10). The Processor has no other surface for measuring consumer lag, write throughput, or failure rates — without it, "is the pilot keeping up?" cannot be answered.
## Deliverables
- `src/observability/metrics.ts` — same shape as `tcp-ingestion/src/observability/metrics.ts`:
- `createMetrics(): Metrics & { serializeMetrics(): Promise<string> }` — wraps `prom-client` registries; calls `collectDefaultMetrics()` once for `nodejs_*` process metrics.
- `startMetricsServer(port, metrics, deps): http.Server``node:http` server with three endpoints. `deps` carries readyz health checks: `{ isRedisReady(): boolean; isPostgresReady(): boolean }`.
- Update `src/main.ts` to use the real `createMetrics()` and start the metrics server after Redis + Postgres are connected and the consumer is started. Wire it into graceful shutdown (close it before `redis.quit()`).
- Tests: `test/metrics.test.ts` mirroring the `tcp-ingestion` test pattern — exposition format, counter/gauge/histogram behaviour, all four HTTP endpoint paths including `/readyz` 503 cases.
## Specification
### Metric inventory
| Metric | Type | Labels | Description |
|---|---|---|---|
| `processor_consumer_reads_total` | counter | `result=ok\|empty\|error` | `XREADGROUP` calls; `empty` = BLOCK timeout, `error` = client error |
| `processor_consumer_records_total` | counter | — | Total records pulled off the stream |
| `processor_consumer_lag` | gauge | — | `XLEN` minus the consumer group's last-delivered ID position. Sampled every N seconds. |
| `processor_decode_errors_total` | counter | — | Records that failed to decode (malformed payload, sentinel error) |
| `processor_position_writes_total` | counter | `status=inserted\|duplicate\|failed` | Per-record write outcomes |
| `processor_position_write_duration_seconds` | histogram | — | Per-batch write latency. Buckets `[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5]` |
| `processor_acks_total` | counter | — | Total IDs ACKed |
| `processor_device_state_size` | gauge | — | Current count of devices in the LRU map |
| `processor_device_state_evictions_total` | counter | — | Total LRU evictions since start |
| `nodejs_*` | various | — | Default Node process metrics |
### Naming convention
- `processor_*` for service-specific metrics. `tcp-ingestion` uses `teltonika_*` because that's its adapter; the Processor isn't bound to a vendor, so the service-name prefix fits.
- No external `service` label — Prometheus scrape config adds it.
### Health and readiness
- `GET /healthz` → 200 if the process is alive. Always returns `{ "status": "ok" }`.
- `GET /readyz` → 200 if both Redis is ready (`redis.status === 'ready'`) AND Postgres is ready (last successful query within 30s, or a fresh `SELECT 1` succeeds quickly). 503 otherwise.
- Both endpoints return tiny JSON bodies for diagnostic value.
### `processor_consumer_lag` measurement
Sample every 10s in a separate setInterval (don't compute it on every read — too noisy). Compute as:
```
lag = XLEN(stream) - position_of(group_last_delivered_id_in_stream)
```
Use `XINFO GROUPS <stream>``lag` field (Redis 7.2+). If the field is absent, fall back to `XLEN` minus 0 (good-enough proxy when up to date; flag as "approximate" in the metric description).
If sampling fails (Redis blip), log at `debug` and continue. Don't let metrics gather break the consumer.
### HTTP server — same minimal node:http
No Express. Roughly 30 lines. Match `tcp-ingestion`'s style.
## Acceptance criteria
- [ ] `pnpm typecheck`, `pnpm lint`, `pnpm test` clean.
- [ ] `curl http://localhost:9090/metrics` returns valid exposition format with every metric in the inventory present (some at zero).
- [ ] After processing one record end-to-end, `processor_consumer_records_total` increments by 1, `processor_position_writes_total{status="inserted"}` increments by 1, `processor_acks_total` increments by 1.
- [ ] `/readyz` returns 503 while Redis is disconnected (simulate by `redis.disconnect()`), 200 once it reconnects.
- [ ] `/readyz` returns 503 while the Pool fails its health probe, 200 when it recovers.
- [ ] `nodejs_*` default metrics are exposed.
## Risks / open questions
- **Cardinality of label values.** None of the Phase 1 metrics use unbounded labels. Phase 2 may want per-stage metrics — be careful: hundreds of stages is fine, hundreds of devices as labels is not. Keep the same rule as `tcp-ingestion`: per-device labels never go on Prometheus metrics; logs/traces are the right place.
- **`processor_consumer_lag` sampling cadence.** 10s is a guess. If alerts get jittery, lower to 5s or raise to 30s. Tunable later.
## Done
(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,58 @@
# Task 1.10 — Integration test (testcontainers Redis + Postgres)
**Phase:** 1 — Throughput pipeline
**Status:** ⬜ Not started
**Depends on:** 1.5, 1.7, 1.8, 1.9
**Wiki refs:**
## Goal
End-to-end pipeline test: spin up Redis 7 and TimescaleDB via testcontainers, boot the Processor against them, publish a synthetic `Position` to `telemetry:t`, verify the row appears in `positions` with byte-equivalent attribute decoding (bigint, Buffer included).
This is the integration test that proves the upstream contract from `tcp-ingestion` flows through end-to-end. Mirror `tcp-ingestion/test/publish.integration.test.ts`'s structure and skip-on-no-Docker pattern.
## Deliverables
- `test/pipeline.integration.test.ts`:
- `beforeAll`: start Redis container, start TimescaleDB container, run migrations, build a Processor instance pointed at both. If Docker is unavailable, log a clear skip message and set a flag so all `it` blocks early-return without failing.
- `afterAll`: stop the Processor, stop containers.
- Test 1: publish a Position with `bigint` and `Buffer` attributes via `XADD`; wait for the row in `positions` (poll, timeout 10s); assert `device_id`, `ts`, GPS fields, and a JSON round-trip of `attributes` matches the original (bigint as string, Buffer as base64).
- Test 2: publish two records with the same `(device_id, ts)`; verify only one row in `positions` (idempotency check).
- Test 3: publish a malformed payload (broken JSON) on the stream; verify `processor_decode_errors_total` increments and the bad entry stays in PEL (not ACKed).
- Test 4: simulate the writer failing once (e.g. by temporarily shutting Postgres mid-test, then bringing it back); verify the record gets retried and eventually lands.
- Use the **TimescaleDB image**, not stock `postgres:7-alpine`. Suggested: `timescale/timescaledb:latest-pg16`. Confirm the migration's `CREATE EXTENSION IF NOT EXISTS timescaledb` no-ops (extension already loaded).
- Use the same Vitest config split as `tcp-ingestion`: `vitest.integration.config.ts` with `hookTimeout: 120_000`, `testTimeout: 60_000`. Default `pnpm test` excludes `*.integration.test.ts`; opt-in via `pnpm test:integration`.
## Specification
### Skip-on-no-Docker pattern
Copy `tcp-ingestion/test/publish.integration.test.ts`'s pattern verbatim:
- Try to start the first container in `beforeAll`. On error, set `dockerAvailable = false`, log a warning, and return.
- Each `it` block early-returns with a `console.warn` if `!dockerAvailable`.
- This pattern was the fix for the CI test failure on the runner without Docker — keep it.
### Synthetic Position publishing
Reuse `serializePosition` from `tcp-ingestion`'s `publish.ts` if it can be imported (likely not — separate repos). Otherwise inline the encoding: a Position object → JSON.stringify with the bigint/Buffer replacer → `XADD telemetry:t * ts <iso> device_id <imei> codec 8E payload <json>`.
### Why test 4 (writer failure → retry)
This validates the core ACK semantics: if a write fails, the record stays pending, and re-delivery brings it back. Without this test, we have unit tests showing each piece behaves correctly, but no proof the pieces compose right. Skip-conditions: if simulating Postgres failure mid-test is too flaky in testcontainers, weaken to: stop Postgres before publishing, publish, start Postgres, verify row appears.
## Acceptance criteria
- [ ] `pnpm test:integration` runs all four scenarios green when Docker is available.
- [ ] Without Docker, the suite logs skip messages and exits 0 (does not fail).
- [ ] CI (`pnpm test`, unit only) does not run these — they are opt-in.
- [ ] First-run container pull is reasonable; subsequent runs are fast (testcontainers caches the image).
## Risks / open questions
- **Image pull on first CI run.** The TimescaleDB image is large (~700MB). If we ever wire integration tests into CI (separate job with Docker), pre-pulling may be required. Document but defer.
- **Test flakiness from polling.** Polling for "row appears in `positions`" uses a 10s timeout. If CI is slow, raise it. Don't replace polling with `await sleep(2000)` — that's reliably wrong.
## Done
(Fill in once complete: commit SHA, brief notes.)
@@ -0,0 +1,86 @@
# Task 1.11 — Dockerfile & Gitea workflow
**Phase:** 1 — Throughput pipeline
**Status:** ⬜ Not started
**Depends on:** 1.10
**Wiki refs:**
## Goal
Containerize the service and add the Gitea Actions workflow that builds and publishes `git.dev.microservices.al/trm/processor:main` on every push to `main`. Mirror `tcp-ingestion`'s slim variant — same multi-stage Dockerfile, same single-job workflow with path filters.
## Deliverables
- `Dockerfile` — multi-stage: deps → build → runtime. Match `tcp-ingestion/Dockerfile` line for line, adjusting only:
- `EXPOSE 9090` (only — Processor has no TCP listener).
- `HEALTHCHECK` pointing at `/readyz` on `${METRICS_PORT}`.
- `CMD ["node", "dist/main.js"]`.
- `.gitea/workflows/build.yml` — single-job workflow matching `tcp-ingestion/.gitea/workflows/build.yml`:
- Trigger: `push` to `main` (path filters: `src/`, `test/`, `package.json`, `pnpm-lock.yaml`, `tsconfig.json`, `Dockerfile`, `.gitea/workflows/build.yml`) + `workflow_dispatch`.
- Steps: checkout, setup-node@v4 (Node 22, pnpm), install, typecheck, lint, test (unit only), docker buildx build-push to `git.dev.microservices.al/trm/processor:main`.
- Uses `secrets.REGISTRY_USERNAME` / `secrets.REGISTRY_PASSWORD`.
- Final step: trigger Portainer webhook on success (uncommented; same as `tcp-ingestion` after the `:main` -> webhook auto-deploy got working).
- `compose.dev.yaml` — local-build variant with `build: .`, named `processor-dev`, depends on a Redis service and a TimescaleDB service. Useful for verifying Dockerfile changes without the registry round-trip.
- `README.md` (the repo-level one, already a stub) — flesh out with:
- Quick-start (local: `pnpm install && cp .env.example .env && pnpm dev`).
- "Run the Docker build locally" section (`docker compose -f compose.dev.yaml up --build`).
- Production-deployment note: image is pulled by the `deploy/` repo's stack; do not run standalone.
- Pin to a specific commit via `PROCESSOR_TAG=<sha>` in the deploy stack.
- Tests section (unit vs. integration).
- CI behavior summary.
- "Pilot deployment notes" section if anything is paused (Phase 1 has nothing paused — note this and remove the section if so).
## Specification
### Dockerfile parity with `tcp-ingestion`
Open `tcp-ingestion/Dockerfile` and copy structure verbatim. The only diffs from a Phase 1 Processor are:
- No `EXPOSE 5027` — there's no TCP listener.
- `HEALTHCHECK` URL path is `/readyz` (already true for `tcp-ingestion`).
- Image label: `org.opencontainers.image.source` should point to the `processor` repo URL.
This parity matters: when a future engineer needs to debug a build, having two services build the same way reduces cognitive load.
### Workflow parity with `tcp-ingestion`
Same. Open `tcp-ingestion/.gitea/workflows/build.yml`, copy, change image name and (if needed) path filters. The webhook step at the end should be uncommented so `:main` builds auto-deploy through Portainer.
### Stage deploy
Phase 1 ships ready to land in the `deploy/compose.yaml` (`trm/deploy` repo) as a new service. **Do not edit `deploy/compose.yaml` from this task.** Surface it in the final report: "Add `processor` service to `deploy/compose.yaml` with image, env, depends_on Redis + Postgres." That is a deploy-side change, made by the user.
The `deploy/compose.yaml`'s service block will look roughly like:
```yaml
processor:
image: git.dev.microservices.al/trm/processor:${PROCESSOR_TAG:-main}
depends_on:
redis: { condition: service_healthy }
postgres: { condition: service_healthy }
environment:
NODE_ENV: production
INSTANCE_ID: ${PROCESSOR_INSTANCE_ID:-processor-1}
REDIS_URL: redis://redis:6379
POSTGRES_URL: postgres://...
LOG_LEVEL: ${LOG_LEVEL:-info}
restart: unless-stopped
```
Plus a Postgres service (TimescaleDB image) added to the stack — the stack currently only has Redis + tcp-ingestion. That's the user's deploy decision to make.
## Acceptance criteria
- [ ] `docker build .` succeeds locally; resulting image runs and exposes `/healthz` on 9090.
- [ ] `docker compose -f compose.dev.yaml up --build` boots Redis + TimescaleDB + Processor; `/readyz` reports 200 once everything is up.
- [ ] Pushing to `main` (or hitting `workflow_dispatch`) builds the image, runs typecheck/lint/test, and pushes `:main` to the registry.
- [ ] Portainer webhook fires on successful push and the stage stack picks up the new image (assuming the `deploy/` stack is set up).
- [ ] Image size is reasonable (target < 250 MB final stage; the `tcp-ingestion` slim variant lands around there).
## Risks / open questions
- **Re-pull on stack redeploy.** The same Portainer issue we hit with `tcp-ingestion` (stack redeploy doesn't pull new images by default) will apply here. Make sure the same fix is in place ("Re-pull image" toggle, or per-commit-SHA tags) before this lands. Cross-reference the `tcp-ingestion` deploy note in `deploy/README.md`.
- **HEALTHCHECK `wget` availability.** `node:22-alpine` includes `wget`. If we ever switch base image, revisit.
## Done
(Fill in once complete: commit SHA, brief notes.)
+98
View File
@@ -0,0 +1,98 @@
# Phase 1 — Throughput pipeline
Implement a Node.js worker that joins a Redis Streams consumer group, decodes `Position` records, upserts them into a TimescaleDB hypertable, maintains per-device in-memory state, and ships with the operational baseline (Prometheus metrics, health/readiness endpoints, integration tests, Dockerfile, Gitea CI/CD pipeline).
## Outcome statement
When Phase 1 is done:
- The Processor connects to Redis and joins consumer group `processor` on stream `telemetry:t` (configurable). On startup it creates the group with `MKSTREAM` if missing.
- Every `Position` record published by `tcp-ingestion` lands as exactly one row in the `positions` hypertable, with `device_id`, `ts`, GPS fields, and the IO `attributes` bag preserved as `JSONB` (sentinel-decoded — bigint values become `numeric`, Buffer values become `bytea` or `text` per the spec in task 1.2).
- Per-device in-memory state (`last_position`, `last_seen`, `position_count_session`) is updated on every record and bounded by an LRU cap.
- `XACK` is sent only after the Postgres write succeeds. A crashed instance leaves work pending; on its next start it picks up via consumer-group resumption, and any other instance can claim its pending entries (full `XAUTOCLAIM` polish lives in Phase 3, but the basic resumption works in Phase 1).
- `GET /metrics` returns Prometheus exposition format with consumer lag, throughput, write-latency histogram, error counters. `GET /healthz` and `GET /readyz` cover liveness and readiness (Redis ready + Postgres ready).
- The service builds reproducibly via a Gitea Actions workflow, publishing a Docker image to the Gitea container registry tagged `:main` (and per-commit SHA tags later if needed).
- An integration test spins up Redis + Postgres via testcontainers, publishes a synthetic `Position` to the input stream, and verifies the resulting row in `positions`. End-to-end byte-level round-trip including bigint and Buffer sentinel reversal.
## Sequencing
```
1.1 Project scaffold
├─→ 1.2 Core types & contracts
│ ├─→ 1.3 Configuration & logging
│ ├─→ 1.4 Postgres connection & positions hypertable
│ │ └─→ 1.7 Position writer (batched upsert)
│ └─→ 1.5 Redis Stream consumer
│ ├─→ 1.6 Per-device in-memory state
│ └─→ 1.8 Main wiring & ACK semantics (depends on 1.5, 1.6, 1.7)
│ └─→ 1.9 Observability
│ └─→ 1.10 Integration test
│ └─→ 1.11 Dockerfile & CI
```
Tasks 1.5/1.6/1.7 can be developed in parallel after 1.4 lands. Task 1.10 (integration test) should land *before* 1.11 because the Dockerfile depends on knowing what `pnpm test` and `pnpm test:integration` will do.
## Files modified
Phase 1 produces this layout in `processor/`:
```
processor/
├── .gitea/workflows/build.yml
├── src/
│ ├── core/
│ │ ├── types.ts # Position, DeviceState, Metrics
│ │ ├── consumer.ts # XREADGROUP loop + claim handler
│ │ ├── writer.ts # Postgres batched upsert
│ │ ├── state.ts # in-memory device state with LRU
│ │ └── codec.ts # sentinel decode (__bigint, __buffer_b64)
│ ├── db/
│ │ ├── pool.ts # pg.Pool factory
│ │ └── migrations/
│ │ └── 0001_positions.sql # hypertable creation
│ ├── config/load.ts # zod schema for env
│ ├── observability/
│ │ ├── logger.ts # pino root logger
│ │ └── metrics.ts # prom-client + HTTP server
│ └── main.ts
├── test/
│ ├── codec.test.ts
│ ├── state.test.ts
│ ├── consumer.test.ts # mocked Redis behaviour
│ ├── writer.test.ts # mocked pg behaviour
│ └── pipeline.integration.test.ts # testcontainers Redis + Postgres
├── Dockerfile
├── compose.dev.yaml
├── package.json
├── pnpm-lock.yaml
├── tsconfig.json
├── vitest.config.ts
├── vitest.integration.config.ts
├── .dockerignore
├── .gitignore
├── .prettierrc
├── eslint.config.js
└── README.md
```
## Tech stack (decided)
- **Node.js 22 LTS**, ESM-only.
- **TypeScript 5.x** with `strict: true`, `noUncheckedIndexedAccess: true`.
- **pnpm** for dependency management.
- **vitest** for tests (unit + integration split — same pattern as `tcp-ingestion`).
- **pino** for structured logging (ISO timestamps, string level labels — same config as `tcp-ingestion`).
- **prom-client** for Prometheus metrics.
- **ioredis** for Redis Streams (XREADGROUP, XACK, XAUTOCLAIM).
- **pg** (`pg` package, not `postgres.js`) for Postgres — battle-tested, simple Pool API.
- **zod** for environment-variable validation.
- **testcontainers** for integration tests (Redis 7 + TimescaleDB).
If an implementer wants to deviate, they must update the relevant task file first.
## Key design decisions inherited from `tcp-ingestion`
- **ESLint `import/no-restricted-paths`** — `src/core/` cannot import from `src/domain/` (the boundary that protects Phase 1 from Phase 2 churn). `src/db/` is shared.
- **Logger config** — `pino.stdTimeFunctions.isoTime` + level-as-string formatter. Lifecycle events at `info`; high-frequency per-record events at `debug` or `trace`.
- **Slim Dockerfile** — multi-stage with BuildKit cache mounts, `pnpm fetch` + `pnpm install --offline` in the build stage, `pnpm prune --prod` for runtime.
- **CI workflow** — single-job pattern matching `tcp-ingestion/.gitea/workflows/build.yml`. No `services:` block; no separate test container.
+47
View File
@@ -0,0 +1,47 @@
# Phase 2 — Domain logic
**Status:** ⬜ Not started — blocks on Directus schema decisions
The phase that makes the Processor *racing-aware*. Phase 1 produces a generic position firehose into Postgres; Phase 2 layers the domain rules that turn raw positions into racing events: geofence crossings, timing records, IO interpretation, stage results.
## Outcome statement
When Phase 2 is done:
- Per-model Teltonika IO mappings (e.g. `FMB920 IO 16 → odometer_km`) live in a Directus-managed collection that the Processor reads at startup and refreshes on a known cadence. Decoded attributes are written to a typed shape alongside the raw bag.
- The geofence engine evaluates each incoming Position against the active geofences for the device's current event/stage and emits cross-events (entry/exit) when transitions happen.
- A `timing_records` table is written for each cross-event of interest (start gate, finish gate, intermediate splits), tied to the entry's bib/competitor/stage.
- A `stage_results` rollup is maintained per `(entry, stage)` showing total time, position, and split times. Updated on each new timing record.
## Why this is a separate phase
- **Throughput correctness is independent of domain correctness.** Phase 1 ships a working firehose; Phase 2 layers logic on top without touching the consumer/writer/state plumbing.
- **The Directus schema gates everything in this phase.** Geofences, entries, classes, device_assignments — all live in Directus collections. Until those are designed and migrated, Phase 2 has no schema to write against.
- **Multiple Phase 1 production milestones can pass before Phase 2 starts.** Real-device pilot, second tcp-ingestion instance, Redis high availability — none of those need Phase 2.
## Tasks (sketched, not detailed)
These tasks will get full task files once the Directus schema conversation is settled and we know the exact collection shapes. For now, this is the planned shape:
| # | Task | Notes |
|---|------|-------|
| 2.1 | Directus reflection — read-only client for `geofences`, `device_assignments`, `entries`, `events`, `stages` | Cached in memory, refreshed on a cadence; the boundary that lets the Processor know "what is this device currently racing in" |
| 2.2 | IO mapping table & per-model decoder | `device_models` collection in Directus → in-memory map → `decoded_attributes` JSONB column on `positions` (or a separate table) |
| 2.3 | Geofence engine | Per-position, evaluate active geofences for the device's current entry. Use PostGIS `ST_Contains` for the cross-detection. Emit cross-events |
| 2.4 | Timing record writer | Cross-events of interest → rows in `timing_records` (Directus-owned). Idempotent on `(entry_id, geofence_id, ts)` |
| 2.5 | Stage result aggregator | On each new `timing_records` row, recompute `stage_results.{total_time, position}` for the affected entry. Materialized incrementally to avoid full recomputation |
| 2.6 | Per-device runtime state extension | Phase 1's `DeviceState` extended with current entry, current stage, last geofence membership, accumulators. Note: Phase 3 rehydration becomes important once this state has substance |
## Architectural boundary to maintain
`src/core/` from Phase 1 stays untouched. Phase 2 lives in `src/domain/`. The wire-up point is the `sink` function in `src/main.ts`: after `state.update` and `writer.write`, the sink invokes domain handlers. Per the ESLint rule from task 1.1, `src/core/` cannot import from `src/domain/` — only `main.ts` glues them.
## Open questions blocking task-level detail
(These get answered in the Directus schema conversation.)
1. Are `geofences` org-scoped, event-scoped, or both?
2. Is `device_assignments` time-bounded (start_at + end_at) or just event-bounded?
3. Where does the IO mapping table live — Directus collection, hardcoded in Processor, or in a config file?
4. What's the canonical name for the sub-event unit — `stage`, `session`, `run`, `leg`?
5. Is there a live leaderboard requirement, or is timing reviewed post-event?
+45
View File
@@ -0,0 +1,45 @@
# Phase 3 — Production hardening
**Status:** ⬜ Not started
The set of operational features that turn a working pilot into something safe to leave running unattended through deploys, instance failures, and bad data.
## Outcome statement
When Phase 3 is done:
- **Graceful shutdown** with bounded in-flight drain: SIGTERM blocks new reads, awaits in-flight writes, ACKs anything still in PEL whose write succeeded, exits clean.
- **State rehydration on restart**: on first packet for an unknown device, the Processor queries Postgres for the device's `last_position` and seeds `DeviceState` accordingly. Phase 2 accumulators get the same treatment (e.g. last geofence membership comes from the last `timing_records` row).
- **`XAUTOCLAIM` for stuck pending entries**: at startup and on a cadence, the Processor claims entries that have been pending in another consumer's PEL for longer than `CLAIM_THRESHOLD_MS`. Lets a dead instance's work get picked up by survivors without manual intervention.
- **Dead-letter stream for poison records**: records that fail to decode N times go to `telemetry:t:dlq` with the original payload + the error. Operators can inspect, fix, replay.
- **Multi-instance load split verified**: spinning up two Processor instances against the same consumer group splits the work evenly. End-to-end test in CI (or at least a manual playbook).
- **Migration safety with multiple instances**: Postgres advisory locks around the migration runner so two instances starting simultaneously don't race.
- **Uncaught exception / unhandled rejection handlers**: log, flush in-memory state to a panic dump file, exit with a code Portainer treats as restart-worthy.
- **`OPERATIONS.md` runbook**: exact commands for "claim stuck entries from a dead instance," "drain the DLQ," "force-rehydrate a single device," "view consumer lag," etc.
## Tasks (sketched, not detailed)
| # | Task | Notes |
|---|------|-------|
| 3.1 | Graceful shutdown — full | Replaces the Phase 1 stub. Drain budget configurable. Tested end-to-end |
| 3.2 | Per-device state rehydration on first-packet | Single `SELECT ... LIMIT 1` per cold device. Memoized by LRU |
| 3.3 | `XAUTOCLAIM` runner | Periodic + on-startup. Claims entries pending > `CLAIM_THRESHOLD_MS`. Re-runs the sink |
| 3.4 | Dead-letter stream | After N failed decodes/writes, record goes to `telemetry:t:dlq`; original ACKed off the main stream |
| 3.5 | Migration advisory lock | `pg_advisory_lock(<hash>)` around the migrate runner; two instances can start simultaneously |
| 3.6 | Uncaught exception / unhandled rejection handlers | Log, flush, exit. Match `tcp-ingestion`'s eventual Phase 1 task 1.12 work when that lands |
| 3.7 | OPERATIONS.md | The runbook |
| 3.8 | Multi-instance load test | A test (manual or in CI) that proves two instances split the work; document expected lag behaviour during failover |
## Why this is a separate phase
Phase 1 + Phase 2 produce a service that *works*. Phase 3 is what you do *before you stop watching it*. None of these tasks change correctness — they change operational ergonomics.
## Resume triggers
Each Phase 3 task has its own resume trigger. The whole phase doesn't have to land at once:
- **3.1, 3.5, 3.6** before adding a second Processor instance (rolling deploys become safe).
- **3.2** before any Phase 2 task that depends on hot state (geofence membership) — without rehydration, a restart would forget which geofence each device is in until the device crosses a boundary again.
- **3.3, 3.4** before the pilot is "always-on" (operators need a way to handle stuck/poison records without touching production).
- **3.7** can land alongside whichever of the above ships first; updates over time.
- **3.8** before the second instance is added.
+21
View File
@@ -0,0 +1,21 @@
# Phase 4 — Future / optional
**Status:** ❄️ Not committed
Ideas on radar that may or may not become real tasks. Captured here so they don't get forgotten and so we have a place to push scope creep that surfaces during Phase 13.
## Candidates
- **Directus Flow trigger emission.** When a domain event fires (timing record written, stage result computed, anomaly detected), publish a structured event Directus Flows can subscribe to. Lets Directus orchestrate notifications, integrations, derived workflows without polling the database.
- **Replay tooling.** Read historical positions for a device + time range from Postgres, re-emit them through the domain pipeline (geofence engine, timing logic) without touching `positions`. Useful for: validating a new geofence layout against past races, regenerating timing records after a rule change, demoing.
- **Derived-metric backfill.** When the IO mapping table changes (new model, corrected mapping), backfill `decoded_attributes` for affected devices over a chosen time range without touching `positions`.
- **Alternate consumer for analytics export.** A second consumer group reading the same stream, writing to a parallel destination (Parquet on object storage, ClickHouse, etc.) for offline analytics. The Phase 1 architecture already supports this — it's a separate process joining the same stream with a different group name. No Processor changes needed; just operational scaffolding.
- **WebSocket gateway for live updates.** If Directus's WebSocket subscriptions hit a fan-out ceiling for spectator-facing live leaderboards, a dedicated gateway reads from Redis and pushes to clients, bypassing Directus for the live channel only. REST/GraphQL stays in Directus. Mentioned in `wiki/entities/directus.md`.
- **Per-instance sharding hint.** If consumer-group load distribution turns out to be uneven (one instance handles all the chatty devices), introduce hashing-by-device-id with explicit assignment. Probably overkill — Redis Streams' default round-robin works for most workloads.
None of these are committed. Move them out of Phase 4 and into a numbered phase only when there's a concrete reason to do them.