Add Phase 1 and Phase 2 planning documents
ROADMAP plus granular task files per phase. Phase 1 (12 tasks + 1.13 device authority) covers Codec 8/8E/16 telemetry ingestion; Phase 2 (6 tasks) covers Codec 12/14 outbound commands; Phase 3 enumerates deferred items.
This commit is contained in:
@@ -0,0 +1,96 @@
|
||||
# tcp-ingestion — Roadmap
|
||||
|
||||
A Node.js TCP server that accepts persistent connections from GPS hardware, parses vendor binary protocols, and emits normalized `Position` records onto Redis Streams for the Processor to consume.
|
||||
|
||||
This file is the single navigation hub for all implementation planning. Each phase has its own folder with a README and granular task files. Update statuses here as work lands.
|
||||
|
||||
## Status legend
|
||||
|
||||
| Symbol | Meaning |
|
||||
|--------|---------|
|
||||
| ⬜ | Not started |
|
||||
| 🟦 | Planned (designed, not coded) |
|
||||
| 🟨 | In progress |
|
||||
| 🟩 | Done |
|
||||
| ⏸ | Paused / blocked |
|
||||
| ❄️ | Frozen / future / optional |
|
||||
|
||||
## Architectural anchors
|
||||
|
||||
The service is specified by the wiki at `../docs/wiki/`. Implementing agents should read these pages before starting any task:
|
||||
|
||||
- **Architecture** — `docs/wiki/sources/gps-tracking-architecture.md`, `docs/wiki/concepts/plane-separation.md`, `docs/wiki/concepts/failure-domains.md`
|
||||
- **Adapter design (this service)** — `docs/wiki/sources/teltonika-ingestion-architecture.md`, `docs/wiki/concepts/protocol-adapter.md`
|
||||
- **Teltonika protocol** — `docs/wiki/entities/teltonika.md`, `docs/wiki/concepts/avl-data-format.md`, `docs/wiki/concepts/codec-dispatch.md`, `docs/wiki/concepts/position-record.md`, `docs/wiki/concepts/io-element-bag.md`
|
||||
- **Phase 2 design** — `docs/wiki/concepts/phase-2-commands.md`
|
||||
- **Canonical Teltonika spec** — `docs/wiki/sources/teltonika-data-sending-protocols.md`
|
||||
|
||||
## Non-negotiable design rules
|
||||
|
||||
These rules govern every task. Any deviation must be discussed and documented as a decision before code lands.
|
||||
|
||||
1. **Vendor-agnostic shell.** `src/core/` never imports from `src/adapters/`. Adapters never import from each other. Each adapter folder is self-contained.
|
||||
2. **TCP handler never blocks on downstream work.** Redis pushes are queued/awaited but never inside a path that would back-pressure a device socket beyond the TCP buffer.
|
||||
3. **IO element bag passes through unchanged.** No naming, no unit conversion, no model lookup in the parser. `attributes` is keyed by numeric IO ID as string.
|
||||
4. **Loud failure on unknown codec.** Drop the connection, do not skip-ahead, do not partial-parse.
|
||||
5. **NACK on CRC mismatch** = simply do not ACK; device retransmits.
|
||||
6. **Codec dispatch is a flat registry**, not a switch or inheritance hierarchy. Phase 2 must be a pure addition.
|
||||
7. **Fixture-based testing is mandatory** for every codec parser. Hex captures + expected `Position[]` pairs run on every CI build.
|
||||
|
||||
## Phases
|
||||
|
||||
### Phase 1 — Inbound telemetry (Codec 8, 8E, 16)
|
||||
|
||||
**Status:** ⬜ Not started
|
||||
**Outcome:** A production-ready Node.js TCP server ingesting Teltonika telemetry from any FMB/FMC/FMM/FMU device, publishing normalized `Position` records to Redis Streams, with full observability and CI/CD via Gitea.
|
||||
|
||||
[**See `phase-1-telemetry/README.md`**](./phase-1-telemetry/README.md)
|
||||
|
||||
| # | Task | Status |
|
||||
|---|------|--------|
|
||||
| 1.1 | [Project scaffold](./phase-1-telemetry/01-project-scaffold.md) | ⬜ |
|
||||
| 1.2 | [Core shell & framing types](./phase-1-telemetry/02-core-shell.md) | ⬜ |
|
||||
| 1.3 | [Configuration & logging](./phase-1-telemetry/03-config-and-logging.md) | ⬜ |
|
||||
| 1.4 | [Teltonika framing layer (envelope, CRC, handshake)](./phase-1-telemetry/04-teltonika-framing.md) | ⬜ |
|
||||
| 1.5 | [Codec 8 parser](./phase-1-telemetry/05-codec-8.md) | ⬜ |
|
||||
| 1.6 | [Codec 8 Extended parser (incl. NX)](./phase-1-telemetry/06-codec-8-extended.md) | ⬜ |
|
||||
| 1.7 | [Codec 16 parser (incl. Generation Type)](./phase-1-telemetry/07-codec-16.md) | ⬜ |
|
||||
| 1.8 | [Redis Streams publisher & main wiring](./phase-1-telemetry/08-redis-publisher.md) | ⬜ |
|
||||
| 1.9 | [Fixture suite & testing strategy](./phase-1-telemetry/09-fixture-suite.md) | ⬜ |
|
||||
| 1.10 | [Observability (Prometheus metrics)](./phase-1-telemetry/10-observability.md) | ⬜ |
|
||||
| 1.11 | [Dockerfile & Gitea workflow](./phase-1-telemetry/11-dockerfile-and-ci.md) | ⬜ |
|
||||
| 1.12 | [Production hardening](./phase-1-telemetry/12-production-hardening.md) | ⬜ |
|
||||
| 1.13 | [Device authority (Redis allow-list refresher)](./phase-1-telemetry/13-device-authority.md) | ⬜ |
|
||||
|
||||
### Phase 2 — Outbound commands (Codec 12, 14)
|
||||
|
||||
**Status:** ⬜ Not started (depends on Phase 1)
|
||||
**Outcome:** Each Ingestion instance runs a parallel command consumer that reads from `commands:outbound:{instance_id}`, encodes Codec 12 or 14 frames, writes them to the appropriate device socket via a per-socket write queue, and publishes responses back to `commands:responses`. Includes the IMEI→instance connection registry and heartbeat.
|
||||
|
||||
[**See `phase-2-commands/README.md`**](./phase-2-commands/README.md)
|
||||
|
||||
| # | Task | Status |
|
||||
|---|------|--------|
|
||||
| 2.1 | [Connection registry & heartbeat](./phase-2-commands/01-connection-registry.md) | ⬜ |
|
||||
| 2.2 | [Registry janitor (stale entry cleanup)](./phase-2-commands/02-registry-janitor.md) | ⬜ |
|
||||
| 2.3 | [Per-socket write queue & outstanding-command tracker](./phase-2-commands/03-write-queue.md) | ⬜ |
|
||||
| 2.4 | [Command consumer (stream reader)](./phase-2-commands/04-command-consumer.md) | ⬜ |
|
||||
| 2.5 | [Codec 12 encoder + handler](./phase-2-commands/05-codec-12.md) | ⬜ |
|
||||
| 2.6 | [Codec 14 encoder + ACK/nACK handler](./phase-2-commands/06-codec-14.md) | ⬜ |
|
||||
|
||||
### Phase 3 — Future / optional
|
||||
|
||||
**Status:** ❄️ Not committed
|
||||
[**See `phase-3-future/README.md`**](./phase-3-future/README.md) for ideas on radar:
|
||||
|
||||
- Additional vendor adapters (Queclink, Concox).
|
||||
- UDP transport for codecs 8/8E/16.
|
||||
- Codec 15 (FMX6 RS232 modes) if such a fleet onboards.
|
||||
- SMS-based protocols (Codec 4 24-position, binary SMS) if SMS fallback connectivity is needed.
|
||||
|
||||
## Operating model
|
||||
|
||||
- **Implementation agent contract.** Each task file is self-sufficient: goal, deliverables, specification, acceptance criteria. An agent should be able to complete one task without reading the whole wiki — but should skim the wiki references at the top of the task before starting.
|
||||
- **Sequence within a phase.** Task numbering reflects intended order. Soft dependencies are explicit in each task's "Depends on" field. Tasks with no dependencies on each other can be done in parallel.
|
||||
- **Status updates.** When a task is started, change its row in this ROADMAP to 🟨 and the task file's status badge accordingly. When done, 🟩 + a one-line note in the task file's "Done" section pointing at the merging commit/PR.
|
||||
- **Drift control.** If implementation diverges from a task's spec, update the task file *before* the diverging code lands, with a note explaining why. Do not let plans rot — either fix the plan or fix the code.
|
||||
@@ -0,0 +1,57 @@
|
||||
# Task 1.1 — Project scaffold
|
||||
|
||||
**Phase:** 1 — Inbound telemetry
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** None
|
||||
**Wiki refs:** `docs/wiki/sources/teltonika-ingestion-architecture.md` § Project location and layout
|
||||
|
||||
## Goal
|
||||
|
||||
Initialize the Node.js / TypeScript project with the directory layout from the wiki, install the agreed tooling, and produce a "hello world" `main.ts` that the rest of Phase 1 builds on.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `package.json` declaring:
|
||||
- Type: `"module"` (ESM only).
|
||||
- Engines: `"node": ">=22"`.
|
||||
- Scripts: `build`, `dev`, `start`, `test`, `test:watch`, `lint`, `format`, `typecheck`.
|
||||
- Dependencies (production): `ioredis`, `pino`, `pino-pretty` (dev-only via NODE_ENV check), `prom-client`, `zod`.
|
||||
- Dev dependencies: `typescript`, `@types/node`, `vitest`, `@vitest/coverage-v8`, `eslint`, `@typescript-eslint/parser`, `@typescript-eslint/eslint-plugin`, `prettier`, `tsx` (for `dev` watch).
|
||||
- `tsconfig.json` with `strict: true`, `target: ES2022`, `module: NodeNext`, `moduleResolution: NodeNext`, `outDir: dist`, `rootDir: src`, `declaration: false`, `noUncheckedIndexedAccess: true`.
|
||||
- `eslint.config.js` (flat config) with `@typescript-eslint/recommended-type-checked` plus a small project-specific allow-list.
|
||||
- `.prettierrc` — 2 spaces, single quotes, no semis at line end OR keep semis (pick one and stay consistent — recommend keep semis to match Node convention).
|
||||
- `.gitignore` — `node_modules/`, `dist/`, `coverage/`, `.env`, `.env.local`, `*.log`.
|
||||
- `.dockerignore` — same as `.gitignore` plus `.git/`, `.planning/`, `test/`, `*.md` except `README.md`.
|
||||
- Empty directories with `.gitkeep` files where Phase 1 will fill them in:
|
||||
- `src/core/`, `src/adapters/teltonika/codec/data/`, `src/adapters/teltonika/codec/command/`, `src/config/`, `src/observability/`
|
||||
- `test/fixtures/teltonika/codec8/`, `test/fixtures/teltonika/codec8e/`, `test/fixtures/teltonika/codec16/`
|
||||
- `src/main.ts` — minimal stub: imports a logger (placeholder until task 1.3), prints "tcp-ingestion starting" and exits with code 0.
|
||||
- `README.md` — short description pointing at `.planning/ROADMAP.md` for the work plan, and at `../docs/wiki/` for the architectural specification.
|
||||
|
||||
## Specification
|
||||
|
||||
- **Package manager:** pnpm. Commit `pnpm-lock.yaml`. The Dockerfile in task 1.11 will use `pnpm fetch` for layer-cache friendliness.
|
||||
- **Module style:** ESM throughout. No CJS interop hacks. All files use `import`/`export` and `.js` suffix on relative imports per Node ESM resolution rules.
|
||||
- **TypeScript path style:** Use relative imports for now. No `paths` aliases — they add a bundler dependency at runtime that we don't want.
|
||||
- **No bundler.** The build is `tsc` only. Runtime is plain Node consuming `dist/`. The Dockerfile will copy `dist/` and `node_modules/`.
|
||||
- **Linting style:** Configure ESLint to enforce `@typescript-eslint/no-floating-promises` and `@typescript-eslint/no-misused-promises` — both are critical in a TCP server where unhandled promise rejections will silently lose work.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] `pnpm install` succeeds with no warnings other than peer deps.
|
||||
- [ ] `pnpm typecheck` succeeds on the empty project.
|
||||
- [ ] `pnpm lint` succeeds.
|
||||
- [ ] `pnpm build` produces `dist/main.js`.
|
||||
- [ ] `pnpm start` runs the compiled output and prints the startup message.
|
||||
- [ ] `pnpm test` runs (with no tests) and exits successfully.
|
||||
- [ ] `pnpm dev` runs `main.ts` via `tsx` and prints the startup message.
|
||||
- [ ] Repository builds reproducibly: deleting `node_modules` and `dist`, then `pnpm install --frozen-lockfile && pnpm build` produces identical output.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- Pinning Node 22 LTS vs 20 LTS: 22 is current LTS in 2026 and has stable native fetch + better worker thread perf. Stay with 22 unless deployment infra forces 20.
|
||||
- ESLint v9 flat config: ensure the version is compatible with `@typescript-eslint/*` v8+. If issues arise, fall back to legacy `.eslintrc.json` until upstream catches up.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete: commit SHA, brief notes.)
|
||||
@@ -0,0 +1,89 @@
|
||||
# Task 1.2 — Core shell & framing types
|
||||
|
||||
**Phase:** 1 — Inbound telemetry
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 1.1
|
||||
**Wiki refs:** `docs/wiki/concepts/protocol-adapter.md`, `docs/wiki/concepts/codec-dispatch.md`, `docs/wiki/concepts/position-record.md`
|
||||
|
||||
## Goal
|
||||
|
||||
Build the vendor-agnostic shell: TCP server bootstrap, per-socket session loop, and the type/registry definitions that adapters plug into. **No Teltonika-specific code in this task.**
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/core/types.ts`:
|
||||
- `Position` type matching the [[position-record]] shape exactly.
|
||||
- `Adapter` interface: `{ name: string; ports: number[]; handleSession(socket: net.Socket, ctx: AdapterContext): Promise<void> }`.
|
||||
- `AdapterContext` interface: `{ publish: (p: Position) => Promise<void>; logger: Logger; metrics: Metrics }` — a narrow contract giving adapters what they need without leaking shell internals.
|
||||
- `src/core/registry.ts`:
|
||||
- `AdapterRegistry` class (or simple module) holding `Map<port, Adapter>`. Methods: `register(adapter)`, `get(port)`.
|
||||
- This is the *adapter* registry. The *codec* registry (per-vendor, in Teltonika's case) is internal to the adapter — it lives in `src/adapters/teltonika/` (task 1.4).
|
||||
- `src/core/session.ts`:
|
||||
- `runSession(socket, adapter, ctx)` that wraps `adapter.handleSession` with:
|
||||
- Initial socket configuration (`setNoDelay`, `setKeepAlive` with a sane delay, e.g. 60s).
|
||||
- Standard error handling: `error`, `close`, `end` events all logged at `debug` level with the connection's remote address.
|
||||
- A `try { await handleSession() } catch (e) { logger.warn(e) }` that ensures the socket is destroyed if the handler throws.
|
||||
- Crucially, `runSession` does *not* know about IMEI, framing, or codecs — those are entirely the adapter's business.
|
||||
- `src/core/server.ts`:
|
||||
- `startServer(port, adapter, ctx)` returning a closable handle. Uses `net.createServer((socket) => runSession(socket, adapter, ctx))`.
|
||||
- Logs server bind, accepts, and close events.
|
||||
- `src/core/publish.ts`:
|
||||
- Stub `publishPosition(position)` returning `Promise<void>`. Real implementation lands in task 1.8. For now, it accepts a `Position` and logs at debug. The shape should already match what task 1.8 will produce so `Adapter` types stabilize early.
|
||||
|
||||
## Specification
|
||||
|
||||
### Vendor-agnostic discipline (re-stated)
|
||||
|
||||
**`src/core/` must not import from `src/adapters/` — ever.** This is enforced by ESLint with `eslint-plugin-import`'s `no-restricted-paths` rule. Add the rule in this task; failure to follow it should be a CI error.
|
||||
|
||||
```js
|
||||
// in eslint.config.js
|
||||
'import/no-restricted-paths': ['error', {
|
||||
zones: [{ target: './src/core', from: './src/adapters' }]
|
||||
}]
|
||||
```
|
||||
|
||||
Adapters can import from `core/`; the reverse is forbidden.
|
||||
|
||||
### TCP socket settings
|
||||
|
||||
- `socket.setNoDelay(true)` — disable Nagle so ACKs are not batched. We're sending small ACKs; latency matters more than packet count.
|
||||
- `socket.setKeepAlive(true, 60_000)` — TCP keepalive with 60s probe. Defends against idle NAT timeouts; safe because devices already retransmit on disconnect.
|
||||
- No `socket.setTimeout()` at the shell level. The protocol does not specify per-frame timing; idle sockets are fine. Adapters can impose timeouts if their protocol demands.
|
||||
|
||||
### Position type
|
||||
|
||||
Mirror [[position-record]] precisely:
|
||||
|
||||
```ts
|
||||
export type Position = {
|
||||
device_id: string;
|
||||
timestamp: Date;
|
||||
latitude: number;
|
||||
longitude: number;
|
||||
altitude: number;
|
||||
angle: number; // 0–360
|
||||
speed: number; // km/h, 0 may mean "GPS invalid" — caller preserves verbatim
|
||||
satellites: number;
|
||||
priority: 0 | 1 | 2; // Low | High | Panic
|
||||
attributes: Record<string, number | bigint | Buffer>;
|
||||
};
|
||||
```
|
||||
|
||||
Use `Date` not `number` for `timestamp` — the value is a `Date` from the moment it leaves Ingestion, downstream is responsible for serialization choice.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] `pnpm typecheck` and `pnpm lint` pass.
|
||||
- [ ] `src/core/server.ts` can be imported and `startServer` returns a `net.Server` that listens on a configurable port.
|
||||
- [ ] A trivial test (in `test/core/server.test.ts`) starts a server with a stub adapter, opens a TCP client, and verifies the adapter's `handleSession` is invoked with a real socket.
|
||||
- [ ] ESLint enforces the no-restricted-paths rule — verified by adding a temporary import-from-adapter into `src/core/server.ts` and confirming the lint error, then removing it.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- The `AdapterContext`'s metrics interface is sketched but not fully specified until task 1.10. Make a minimal placeholder (`{ inc: (name, labels?) => void; observe: ... }`) and tighten in 1.10.
|
||||
- `Buffer` in `Position.attributes` requires JSON serialization handling at the publish boundary (task 1.8). Decide there: base64-encode buffers, or serialize via msgpack. Recommendation: base64 with a sentinel in the JSON, e.g. `{ "_b64": "..." }`. Defer the decision to task 1.8 and revisit if simpler options surface.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,78 @@
|
||||
# Task 1.3 — Configuration & logging
|
||||
|
||||
**Phase:** 1 — Inbound telemetry
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 1.1
|
||||
**Wiki refs:** `docs/wiki/sources/gps-tracking-architecture.md` § Deployment topology, § Observability
|
||||
|
||||
## Goal
|
||||
|
||||
Provide a single source of truth for runtime configuration (env-var-driven, validated at startup, fail-fast on misconfiguration) and a structured JSON logger.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/config/load.ts`:
|
||||
- Exports `loadConfig(): Config` that parses `process.env` through a zod schema, returning a typed `Config` object. Throws with a clear error message on missing/malformed values.
|
||||
- All env vars optional in dev (with sensible defaults) and required in production-like deployments. Use `NODE_ENV` to gate.
|
||||
- `src/observability/logger.ts`:
|
||||
- Exports a configured `pino` logger. JSON output by default; pretty-printed via `pino-pretty` only when `NODE_ENV === 'development'` (lazy-loaded so it's not in the prod bundle).
|
||||
- Log level controlled by `LOG_LEVEL` env var (default `info` in production, `debug` in development).
|
||||
- Adds a `service: 'tcp-ingestion'` and `instance_id` (from `INSTANCE_ID` env var or a generated short UUID at startup) to every log line.
|
||||
|
||||
## Specification
|
||||
|
||||
### Config schema (zod)
|
||||
|
||||
```ts
|
||||
const ConfigSchema = z.object({
|
||||
NODE_ENV: z.enum(['development', 'test', 'production']).default('development'),
|
||||
INSTANCE_ID: z.string().min(1).default(() => `local-${randomUUID().slice(0, 8)}`),
|
||||
LOG_LEVEL: z.enum(['fatal', 'error', 'warn', 'info', 'debug', 'trace']).default('info'),
|
||||
|
||||
// Vendor port bindings — extend as adapters are added.
|
||||
TELTONIKA_PORT: z.coerce.number().int().min(1).max(65535).default(5027),
|
||||
|
||||
// Redis
|
||||
REDIS_URL: z.string().url(),
|
||||
REDIS_TELEMETRY_STREAM: z.string().min(1).default('telemetry:teltonika'),
|
||||
REDIS_STREAM_MAXLEN: z.coerce.number().int().min(0).default(1_000_000), // approximate cap
|
||||
|
||||
// Observability
|
||||
METRICS_PORT: z.coerce.number().int().min(0).max(65535).default(9090),
|
||||
|
||||
// Phase 2 (planned, not used in Phase 1)
|
||||
// COMMANDS_OUTBOUND_STREAM_PREFIX: z.string().default('commands:outbound'),
|
||||
});
|
||||
|
||||
export type Config = z.infer<typeof ConfigSchema>;
|
||||
```
|
||||
|
||||
The Phase 2 fields are commented out so they do not become runtime requirements before Phase 2 ships. Add them when Phase 2 is in flight.
|
||||
|
||||
### Logger conventions
|
||||
|
||||
- Always emit JSON in production (pino default).
|
||||
- Always include: `time`, `level`, `service`, `instance_id`, `msg`.
|
||||
- Adapter log lines include `imei` when known; framing log lines include `codec_id` when applicable; CRC failures include `expected_crc`, `computed_crc`, `frame_length`.
|
||||
- Use `logger.child({ imei })` to scope a logger per session, so subsequent log lines auto-include the IMEI.
|
||||
- Never log raw frame payloads at info or above — they're huge and may contain sensitive telemetry. At debug, truncate to first/last 16 bytes.
|
||||
|
||||
### Failure mode
|
||||
|
||||
`loadConfig()` is called once in `main.ts`. If it throws, the process exits with a non-zero code and a single human-readable line listing the missing/invalid keys. **Do not fall back to silent defaults for required keys** — the operational habit we want is "missing config = process refuses to start," not "process starts and behaves weirdly later."
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] Calling `loadConfig()` with `REDIS_URL` unset throws and the error names `REDIS_URL` specifically.
|
||||
- [ ] Calling `loadConfig()` in dev with `NODE_ENV=development` and only `REDIS_URL` set returns a fully valid `Config` with sensible defaults for everything else.
|
||||
- [ ] The logger emits JSON when `NODE_ENV=production` and pretty-printed text when `NODE_ENV=development`.
|
||||
- [ ] `logger.child({ imei: '...' })` produces lines with `imei` included.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- `INSTANCE_ID` default is a random UUID per process start — fine for dev, but in production K8s/compose deployments, set it explicitly to a stable identifier (pod name, hostname, etc.). The Phase 2 connection registry depends on `INSTANCE_ID` being stable across the lifetime of the process; document this in the deployment notes (task 1.11).
|
||||
- Log volume could be high under load. Pino is fast (~100k+ lines/sec on modern hardware) but consider `useOnlyCustomLevels` or sampling for the busiest events (e.g. per-frame debug logs).
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,183 @@
|
||||
# Task 1.4 — Teltonika framing layer
|
||||
|
||||
**Phase:** 1 — Inbound telemetry
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 1.2
|
||||
**Wiki refs:** `docs/wiki/concepts/avl-data-format.md` (envelope, IMEI handshake), `docs/wiki/concepts/codec-dispatch.md`, `docs/wiki/sources/teltonika-data-sending-protocols.md`
|
||||
|
||||
## Goal
|
||||
|
||||
Implement the Teltonika **adapter shell**: IMEI handshake, AVL frame envelope read loop, CRC validation, and codec dispatch registry. Codec parsers themselves are tasks 1.5–1.7; this task lays the framing rails they slot into.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/adapters/teltonika/index.ts` — the `Adapter` export consumed by `src/core/`. Wires the codec registry, exports `{ name: 'teltonika', ports: [config.TELTONIKA_PORT], handleSession }`.
|
||||
- `src/adapters/teltonika/handshake.ts` — `readImeiHandshake(socket): Promise<string>` performs the 2-byte length + ASCII IMEI read and returns the IMEI string. **Does not write the accept/reject byte itself** — that decision is made by the session loop after consulting `DeviceAuthority` (see "DeviceAuthority seam" below). On malformed input, throws a typed `HandshakeError`.
|
||||
- `src/adapters/teltonika/device-authority.ts` — defines the `DeviceAuthority` interface and ships an `AllowAllAuthority` default implementation. The opt-in Redis-backed authority lives in task 1.13.
|
||||
- `src/adapters/teltonika/frame.ts` — `readNextFrame(socket, buffer): Promise<{ codecId: number; payload: Buffer; crcValid: boolean }>` plus a small `BufferedReader` class that handles partial-read accumulation across `socket.on('data')` events.
|
||||
- `src/adapters/teltonika/crc.ts` — pure function `crc16Ibm(buf: Buffer): number`. Implements CRC-16/IBM (polynomial `0xA001`, init `0x0000`, no reflection beyond the polynomial standard).
|
||||
- `src/adapters/teltonika/codec/registry.ts` — internal-to-adapter codec registry: `Map<codecId, CodecDataHandler>`. Phase 1 registers handlers from `codec/data/`. Phase 2 will register from `codec/command/`.
|
||||
|
||||
## Specification
|
||||
|
||||
### IMEI handshake
|
||||
|
||||
```
|
||||
Device → Server: [length 2B big-endian][IMEI bytes (ASCII, length B)]
|
||||
Server → Device: 0x01 (accept) | 0x00 (reject)
|
||||
```
|
||||
|
||||
Phase 1 default: **accept all syntactically valid IMEIs.** Authorization (the question of whether a given IMEI is *expected* to be in the fleet) is a soft observability concern, not a hard gate, until task 1.13 adds the opt-in allow-list refresher. The handshake consults a `DeviceAuthority` interface for a `known | unknown` label that flows into metrics and logs but does **not** block the handshake by default.
|
||||
|
||||
Parse rules:
|
||||
- Length must be ≤ 32 (Teltonika IMEIs are 15 ASCII digits; we allow some headroom).
|
||||
- IMEI body must match `/^\d{14,16}$/` after ASCII decode.
|
||||
- Anything malformed: `HandshakeError`, log `WARN` with the offending bytes (truncated), destroy socket, no `0x01`.
|
||||
|
||||
### `DeviceAuthority` seam
|
||||
|
||||
```ts
|
||||
export interface DeviceAuthority {
|
||||
check(imei: string): Promise<'known' | 'unknown'>;
|
||||
}
|
||||
|
||||
export class AllowAllAuthority implements DeviceAuthority {
|
||||
async check(): Promise<'known'> { return 'known'; }
|
||||
}
|
||||
```
|
||||
|
||||
Wire `DeviceAuthority` into the Teltonika adapter context. Default binding in `main.ts` is `new AllowAllAuthority()` — every IMEI is reported as `known` until a real authority is configured.
|
||||
|
||||
Behavior of the handshake:
|
||||
|
||||
```ts
|
||||
const imei = await readImeiHandshake(socket);
|
||||
const authority = ctx.deviceAuthority;
|
||||
const knownLabel = await authority.check(imei).catch((err) => {
|
||||
ctx.logger.warn({ err, imei }, 'device authority check failed; defaulting to unknown');
|
||||
return 'unknown' as const;
|
||||
});
|
||||
ctx.metrics.handshake.inc({ result: 'accepted', known: knownLabel });
|
||||
if (knownLabel === 'unknown' && config.STRICT_DEVICE_AUTH) {
|
||||
// Reject (rare; off by default)
|
||||
socket.write(Buffer.from([0x00]));
|
||||
ctx.logger.warn({ imei }, 'rejected unknown device under STRICT_DEVICE_AUTH');
|
||||
socket.destroy();
|
||||
return;
|
||||
}
|
||||
socket.write(Buffer.from([0x01]));
|
||||
```
|
||||
|
||||
Three properties:
|
||||
|
||||
1. **Default behavior is unchanged from accept-all.** No business-plane dependency.
|
||||
2. **Unknown devices are *visible*** via the `known` label on `teltonika_handshake_total` (see task 1.10).
|
||||
3. **`STRICT_DEVICE_AUTH=true`** flips the policy to reject-unknowns. Off by default. When operators want this, they enable it; the code path is already there.
|
||||
|
||||
The real implementation of `DeviceAuthority` (Redis-backed, refreshed from a Directus-published allow-list) is task 1.13. Task 1.4 only ships the interface and the `AllowAllAuthority` default.
|
||||
|
||||
### AVL frame envelope
|
||||
|
||||
Per [[avl-data-format]]:
|
||||
|
||||
```
|
||||
[Preamble 4B = 0x00000000]
|
||||
[DataFieldLength 4B big-endian]
|
||||
[CodecID 1B]
|
||||
[N1 1B]
|
||||
[AVL records — DataFieldLength minus 2 bytes for CodecID and N1, minus 1 byte for N2]
|
||||
[N2 1B]
|
||||
[CRC 4B]
|
||||
```
|
||||
|
||||
Important framing rules:
|
||||
|
||||
- **DataFieldLength is NOT the size of the AVL records section** — it is the size from `CodecID` through `N2` inclusive. So the bytes to read after the length field = `DataFieldLength + 4 (CRC)`.
|
||||
- **CRC is computed from `CodecID` through `N2`** (the same span as `DataFieldLength`).
|
||||
- **N1 must equal N2.** Mismatch is a malformation; treat like CRC failure (no ACK, log, **drop the connection** — N1≠N2 is structural, not transient).
|
||||
- **The CRC field is 4 bytes** but only the lower 2 contain the value; upper 2 are zero. Read all 4; validate the lower 16 bits.
|
||||
|
||||
Pseudocode for the read loop:
|
||||
|
||||
```ts
|
||||
const reader = new BufferedReader(socket);
|
||||
while (!socket.destroyed) {
|
||||
const preamble = await reader.readExact(4);
|
||||
if (preamble.readUInt32BE() !== 0) {
|
||||
logger.warn({ imei }, 'invalid preamble; dropping connection');
|
||||
socket.destroy();
|
||||
return;
|
||||
}
|
||||
const length = (await reader.readExact(4)).readUInt32BE();
|
||||
if (length < 8 || length > MAX_AVL_PACKET_SIZE) { /* log + destroy */ }
|
||||
const body = await reader.readExact(length); // CodecID + N1 + records + N2
|
||||
const crcField = await reader.readExact(4);
|
||||
const expectedCrc = crcField.readUInt16BE(2); // lower 2 of 4
|
||||
const computedCrc = crc16Ibm(body);
|
||||
if (expectedCrc !== computedCrc) {
|
||||
metrics.frames.inc({ codec: codecLabel(body[0]), result: 'crc_fail' });
|
||||
logger.warn({ imei, expected_crc: expectedCrc, computed_crc: computedCrc }, 'CRC mismatch');
|
||||
continue; // do NOT ack; device retransmits
|
||||
}
|
||||
const codecId = body[0];
|
||||
const handler = codecRegistry.get(codecId);
|
||||
if (!handler) {
|
||||
metrics.unknownCodec.inc({ codec_id: String(codecId) });
|
||||
logger.warn({ imei, codec_id: codecId, header: body.subarray(0, 16).toString('hex') }, 'unknown codec; dropping connection');
|
||||
socket.destroy();
|
||||
return;
|
||||
}
|
||||
const result = await handler.handle(body, ctx);
|
||||
// ACK: 4-byte big-endian count of records accepted
|
||||
const ack = Buffer.alloc(4);
|
||||
ack.writeUInt32BE(result.recordCount, 0);
|
||||
socket.write(ack);
|
||||
}
|
||||
```
|
||||
|
||||
### CRC-16/IBM
|
||||
|
||||
Polynomial `0xA001` (reflected `0x8005`), initial value `0x0000`, no final XOR. Implementation should be a tight loop with a precomputed lookup table for performance — protocol parsing is on the hot path. The Teltonika doc references CRC-16/IBM; it's the same as ARC.
|
||||
|
||||
Test against the canonical doc's worked example:
|
||||
- Frame body `08010000016B40D8EA30010000000000000000000000000000000105021503010101425E0F01F10000601A014E000000000000000001` (codec 8, see `docs/raw/...`) expected CRC = `0x0000C7CF` (lower 16 bits = `0xC7CF`).
|
||||
|
||||
### MAX_AVL_PACKET_SIZE
|
||||
|
||||
Constant for sanity check on `DataFieldLength`. Use `1300` to cover both fleet caps (512B for FMB640 family, 1280B for others) with small headroom. Larger frames are malformed and we drop the connection.
|
||||
|
||||
### Codec registry structure
|
||||
|
||||
```ts
|
||||
export interface CodecDataHandler {
|
||||
codec_id: number;
|
||||
handle(
|
||||
body: Buffer, // CodecID + N1 + records + N2
|
||||
ctx: { imei: string; publish: (p: Position) => Promise<void>; logger: Logger; metrics: Metrics }
|
||||
): Promise<{ recordCount: number }>;
|
||||
}
|
||||
```
|
||||
|
||||
The `handle` body skips the framing-level concerns (envelope, CRC, codec dispatch) — those happen above. Each codec parser receives the validated body and is responsible for parsing N1/N2, the records themselves, and producing `Position` records via `ctx.publish`.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] CRC-16/IBM matches the canonical Teltonika example byte-for-byte.
|
||||
- [ ] `readImeiHandshake` returns a parsed IMEI for well-formed input without writing to the socket.
|
||||
- [ ] `readImeiHandshake` rejects malformed input by throwing without writing anything.
|
||||
- [ ] The session loop, after a successful handshake, consults `DeviceAuthority.check`, increments `teltonika_handshake_total{result, known}`, and writes `0x01` (or `0x00` under `STRICT_DEVICE_AUTH`).
|
||||
- [ ] `AllowAllAuthority` always returns `'known'`; verified by a unit test.
|
||||
- [ ] `STRICT_DEVICE_AUTH=true` causes an `unknown` device to receive `0x00` and have its socket destroyed; verified by an integration test with a stub authority.
|
||||
- [ ] `BufferedReader.readExact(n)` correctly handles cases where bytes arrive across multiple `data` events.
|
||||
- [ ] `readNextFrame` correctly identifies CRC mismatch without dropping the connection.
|
||||
- [ ] `readNextFrame` drops the connection on unknown codec ID and logs the structured warn line.
|
||||
- [ ] All paths that write to the socket use a single point of ACK emission so Phase 2 can later interpose a write queue without rewriting framing code.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- `BufferedReader` correctness is critical. Use a battle-tested approach — a queue of pending reads with a backing `Buffer.concat` accumulator. Alternatively use Node's `node:stream/web` async iterator if the ergonomics fit.
|
||||
- The `await ctx.publish(p)` inside the handler is the boundary where Phase 1's "TCP handler never blocks on downstream" property is enforced. The publish must use a non-blocking strategy (fire-and-forget into a bounded queue, or guarantee Redis publish is fast enough). Task 1.8 specifies the publish strategy; this task only needs to make the `await` semantically correct.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,98 @@
|
||||
# Task 1.5 — Codec 8 parser
|
||||
|
||||
**Phase:** 1 — Inbound telemetry
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 1.4, 1.9 (fixture infra)
|
||||
**Wiki refs:** `docs/wiki/concepts/avl-data-format.md` § Codec 8, `docs/wiki/sources/teltonika-data-sending-protocols.md` § Codec 8
|
||||
|
||||
## Goal
|
||||
|
||||
Parse Codec 8 (`0x08`) AVL data bodies into `Position` records and publish them via `ctx.publish`.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/adapters/teltonika/codec/data/codec8.ts` exporting `codec8Handler: CodecDataHandler` with `codec_id: 0x08`.
|
||||
- Helper functions in the same file (or in a sibling `gps-element.ts` if shared with codec 8E and 16):
|
||||
- `parseGpsElement(buf, offset): { value: GpsElement; nextOffset: number }`
|
||||
- `parseTimestamp(buf, offset): { value: Date; nextOffset: number }`
|
||||
- Test file `test/codec8.test.ts` with at least the three fixtures from the canonical Teltonika example doc plus a synthetic empty-IO fixture and a multi-record fixture.
|
||||
|
||||
## Specification
|
||||
|
||||
### AVL record layout (Codec 8)
|
||||
|
||||
```
|
||||
[Timestamp 8B] [Priority 1B] [GPS Element 15B] [IO Element ...]
|
||||
```
|
||||
|
||||
#### Timestamp
|
||||
|
||||
8-byte big-endian unsigned integer = milliseconds since UNIX epoch UTC. Convert to `Date` via `new Date(Number(buf.readBigUInt64BE(offset)))`. Use `BigInt` arithmetic to avoid the `Number` precision concern; values are well within safe range until ~year 285k, but be explicit.
|
||||
|
||||
#### Priority
|
||||
|
||||
1-byte enum: `0` = Low, `1` = High, `2` = Panic. Reject unexpected values? Decision: **accept any value 0–255 and pass through** — the Teltonika spec lists 0–2, but treating "unexpected priority" as parser failure feels hostile. Log a `debug` line if value > 2 but proceed.
|
||||
|
||||
Type-narrow to `0 | 1 | 2` only when value is in range; otherwise record as `2` (Panic, the most conservative) and emit a metric `teltonika_priority_out_of_range_total`. Open question: confirm with operations the right fallback.
|
||||
|
||||
#### GPS Element (15 bytes)
|
||||
|
||||
```
|
||||
[Longitude 4B][Latitude 4B][Altitude 2B][Angle 2B][Satellites 1B][Speed 2B]
|
||||
```
|
||||
|
||||
- **Longitude / Latitude**: signed 32-bit big-endian integer (two's complement), divided by `1e7` to get decimal degrees. Negative bit handling: `buf.readInt32BE(offset) / 1e7` does the right thing because Node's `readInt32BE` interprets the value as signed.
|
||||
- **Altitude**: 2-byte signed big-endian, meters above sea level.
|
||||
- **Angle**: 2-byte unsigned big-endian, degrees from north pole (0–360).
|
||||
- **Satellites**: 1-byte unsigned.
|
||||
- **Speed**: 2-byte unsigned, km/h. **Pass through verbatim** — `0x0000` may mean "GPS invalid" but that semantic decision belongs to the Processor.
|
||||
|
||||
#### IO Element (Codec 8 layout)
|
||||
|
||||
```
|
||||
[Event IO ID 1B]
|
||||
[N total 1B]
|
||||
[N1 1B] then N1 × ([IO ID 1B][Value 1B])
|
||||
[N2 1B] then N2 × ([IO ID 1B][Value 2B BE unsigned])
|
||||
[N4 1B] then N4 × ([IO ID 1B][Value 4B BE unsigned])
|
||||
[N8 1B] then N8 × ([IO ID 1B][Value 8B BE — store as bigint])
|
||||
```
|
||||
|
||||
Iterate each section and write into `position.attributes`:
|
||||
|
||||
```ts
|
||||
attributes[String(ioId)] = value;
|
||||
```
|
||||
|
||||
Values:
|
||||
- 1-byte → `number` (read with `readUInt8`)
|
||||
- 2-byte → `number` (read with `readUInt16BE`)
|
||||
- 4-byte → `number` (read with `readUInt32BE`)
|
||||
- 8-byte → `bigint` (read with `readBigUInt64BE`)
|
||||
|
||||
**Do not decode signedness** for IO values. The spec is silent on per-IO signedness; downstream model-aware code in the Processor handles that. If a downstream interpretation needs signed, it can `(unsigned > 0x7FFFFFFF) ? unsigned - 0x100000000 : unsigned` itself.
|
||||
|
||||
The `Event IO ID` value is captured into a separate `event_io_id` attribute — propose: store as `attributes['__event']` (or as a typed sibling field on the Position; recommendation: store under `attributes['__event']` to keep the `Position` shape stable and avoid adding a Codec-8-specific field).
|
||||
|
||||
> **Open question:** is `__event` the right key? Alternatives: `'_event_io_id'`, `'0'` (it's IO ID 0 in some interpretations, but that's a different "0"). Decide before merging task 1.5.
|
||||
|
||||
### Record loop
|
||||
|
||||
After parsing N1 (number of records from the framing layer), loop N1 times producing one `Position` per record. Validate that the position of the cursor at the end equals the start of the trailing N2 byte; mismatch is a parser bug → throw with a structured error including offset.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] All three canonical doc examples (single record with all IO widths; single record with reduced IO; two records) parse to the expected `Position[]` byte-for-byte (verified via the fixture suite from task 1.9).
|
||||
- [ ] CRC validation already happened upstream (task 1.4); this task does not re-check.
|
||||
- [ ] Cursor-end-equals-N2 invariant holds for every fixture.
|
||||
- [ ] `Position.timestamp` round-trips: `new Date(...).toISOString()` matches the doc example's `GMT: Monday, June 10, 2019, 10:04:46 AM` for the first fixture.
|
||||
- [ ] All IO IDs from the fixture appear in `attributes` with correct numeric/bigint types.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- The `Event IO ID` field semantics. Storing under `'__event'` keeps things flexible but adds a magic key. Discuss with Processor implementer before settling.
|
||||
- 8-byte values as `bigint` complicate JSON serialization. Task 1.8 (publisher) must handle this — recommend serializing as a string with a sentinel, e.g. `"123n"` or `{ "_bigint": "123" }`. Keep parser side clean (real `bigint`); push encoding to the publish boundary.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,81 @@
|
||||
# Task 1.6 — Codec 8 Extended parser
|
||||
|
||||
**Phase:** 1 — Inbound telemetry
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 1.4, 1.5 (shared GPS Element / timestamp helpers), 1.9
|
||||
**Wiki refs:** `docs/wiki/concepts/avl-data-format.md` § Codec 8 Extended, `docs/wiki/sources/teltonika-data-sending-protocols.md` § Codec 8 Extended
|
||||
|
||||
## Goal
|
||||
|
||||
Parse Codec 8 Extended (`0x8E`) AVL data bodies into `Position` records, including the **NX variable-length IO section** that does not exist in Codecs 8 or 16.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/adapters/teltonika/codec/data/codec8e.ts` exporting `codec8eHandler: CodecDataHandler` with `codec_id: 0x8E`.
|
||||
- Test file `test/codec8e.test.ts` with the canonical doc example plus at least two synthetic fixtures: one with NX entries, one with mixed N1/N2/N4/N8/NX.
|
||||
|
||||
## Specification
|
||||
|
||||
### Differences from Codec 8
|
||||
|
||||
| Field | Codec 8 | Codec 8 Extended |
|
||||
|-------|---------|------------------|
|
||||
| Codec ID | `0x08` | `0x8E` |
|
||||
| Event IO ID width | 1B | 2B |
|
||||
| N total / N* counts | 1B | **2B** |
|
||||
| IO ID width | 1B | 2B |
|
||||
| Value widths | 1/2/4/8B | 1/2/4/8B (same) |
|
||||
| Variable-length IO (NX) | — | **Yes** |
|
||||
|
||||
The fixed AVL fields (timestamp, priority, GPS element 15B) are identical to Codec 8.
|
||||
|
||||
### IO Element layout (Codec 8E)
|
||||
|
||||
```
|
||||
[Event IO ID 2B]
|
||||
[N total 2B]
|
||||
[N1 2B] then N1 × ([IO ID 2B][Value 1B])
|
||||
[N2 2B] then N2 × ([IO ID 2B][Value 2B])
|
||||
[N4 2B] then N4 × ([IO ID 2B][Value 4B])
|
||||
[N8 2B] then N8 × ([IO ID 2B][Value 8B])
|
||||
[NX 2B] then NX × ([IO ID 2B][Length 2B][Value <Length> bytes]) ← unique to 8E
|
||||
```
|
||||
|
||||
### NX section — the load-bearing complication
|
||||
|
||||
The NX section is the most error-prone part of Codec 8E. Each entry self-describes:
|
||||
|
||||
- 2 bytes IO ID.
|
||||
- 2 bytes length (unsigned big-endian).
|
||||
- `length` bytes of raw value.
|
||||
|
||||
Store NX values as **`Buffer`** (not number/bigint) — they may be ICCID-class data, BLE sensor payloads, or similar binary content. The Processor decodes them per model.
|
||||
|
||||
```ts
|
||||
attributes[String(ioId)] = buf.subarray(offset, offset + length); // copy via .subarray; treat as Buffer
|
||||
```
|
||||
|
||||
**Common bug:** off-by-one in the length field's endianness or width. Verify with a fixture that has at least one NX entry whose length value spans both bytes (e.g. 256+ bytes).
|
||||
|
||||
**Common bug 2:** mishandling NX length 0. Permitted by the spec; treat as a 0-byte Buffer.
|
||||
|
||||
### Cursor invariant
|
||||
|
||||
Same as Codec 8: after parsing all N records and the trailing N2 byte, the cursor must equal the body's last byte. Mismatch = parser bug; throw with offset details.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] Canonical doc example (one record with N1=1, N2=1, N4=1, N8=2, NX=0) parses correctly. Note: the doc's NX section count is `00 00`, so this fixture covers the "NX present but empty" path.
|
||||
- [ ] At least one synthetic fixture has NX > 0 with mixed lengths (e.g. 1B, 8B, 64B values).
|
||||
- [ ] At least one synthetic fixture has an NX entry with `length = 0`.
|
||||
- [ ] At least one synthetic fixture has an NX entry with `length` requiring full 16 bits (≥ 256B).
|
||||
- [ ] All NX values land in `attributes` as `Buffer` instances; non-NX values land as `number` or `bigint` per width.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- Maximum total record size remains 255 bytes per the spec. NX with large values can push this — verify the per-record size guard.
|
||||
- Memory pressure: storing many `Buffer` instances per record could add up. Use `Buffer.subarray` (zero-copy view) rather than `Buffer.from(slice)` (copy). Confirm that downstream consumers (the publisher, task 1.8) handle the view semantics correctly — they should be safe because the underlying frame buffer is held until publish completes.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,84 @@
|
||||
# Task 1.7 — Codec 16 parser
|
||||
|
||||
**Phase:** 1 — Inbound telemetry
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 1.4, 1.5 (shared helpers), 1.9
|
||||
**Wiki refs:** `docs/wiki/concepts/avl-data-format.md` § Codec 16, `docs/wiki/sources/teltonika-data-sending-protocols.md` § Codec 16
|
||||
|
||||
## Goal
|
||||
|
||||
Parse Codec 16 (`0x10`) AVL data bodies into `Position` records, including the per-record **Generation Type** byte.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/adapters/teltonika/codec/data/codec16.ts` exporting `codec16Handler: CodecDataHandler` with `codec_id: 0x10`.
|
||||
- Test file `test/codec16.test.ts` with the canonical doc example (multi-record) plus at least one synthetic fixture covering each Generation Type value.
|
||||
|
||||
## Specification
|
||||
|
||||
### Differences from Codec 8 / Codec 8E
|
||||
|
||||
| Field | Codec 8 | Codec 16 | Codec 8E (for contrast) |
|
||||
|-------|---------|----------|-------------------------|
|
||||
| Codec ID | `0x08` | `0x10` | `0x8E` |
|
||||
| Event IO ID width | 1B | **2B** | 2B |
|
||||
| Generation Type | — | **1B** | — |
|
||||
| N total / N* counts | 1B | **1B** | 2B |
|
||||
| IO ID width | 1B | **2B** | 2B |
|
||||
| Value widths | 1/2/4/8B | 1/2/4/8B | 1/2/4/8B |
|
||||
| Variable-length IO (NX) | — | — | Yes |
|
||||
|
||||
Codec 16 is a "mixed" layout: 2-byte IO IDs (like 8E) but 1-byte counts (like 8), plus the new Generation Type field. This is the trap — implementers who copy from Codec 8E will get the count widths wrong; implementers who copy from Codec 8 will get the IO ID widths wrong. Read the spec carefully and write fixture-driven tests first.
|
||||
|
||||
### IO Element layout (Codec 16)
|
||||
|
||||
```
|
||||
[Event IO ID 2B]
|
||||
[Generation Type 1B] ← unique to Codec 16
|
||||
[N total 1B]
|
||||
[N1 1B] then N1 × ([IO ID 2B][Value 1B])
|
||||
[N2 1B] then N2 × ([IO ID 2B][Value 2B])
|
||||
[N4 1B] then N4 × ([IO ID 2B][Value 4B])
|
||||
[N8 1B] then N8 × ([IO ID 2B][Value 8B])
|
||||
```
|
||||
|
||||
No NX section.
|
||||
|
||||
### Generation Type
|
||||
|
||||
1-byte enum:
|
||||
|
||||
| Value | Meaning |
|
||||
|-------|---------|
|
||||
| 0 | On Exit |
|
||||
| 1 | On Entrance |
|
||||
| 2 | On Both |
|
||||
| 3 | Reserved |
|
||||
| 4 | Hysteresis |
|
||||
| 5 | On Change |
|
||||
| 6 | Eventual |
|
||||
| 7 | Periodical |
|
||||
|
||||
Storage decision: **store as `attributes['__generation_type']`** (consistent with the `__event` convention from task 1.5). Codec 8 and 8E omit this key entirely. Downstream code can pattern-match on its presence.
|
||||
|
||||
> **Open question (carried from task 1.5):** if we promote Generation Type to a typed `Position` field, then `__event` should also become typed. Recommendation: keep them in `attributes` for Phase 1; revisit when Processor-side modeling firms up. Flagged in [[position-record]] open questions.
|
||||
|
||||
### AVL ID range
|
||||
|
||||
Codec 16 (and 8E) supports IO IDs > 255. The parser treats this transparently — IO IDs are read as 2-byte unsigned values; nothing prevents `ioId = 1234`. Just confirm no fixture has an off-by-one assumption that breaks for >255.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] Canonical doc example (two records, N1=2, N2=2, codec ID `0x10`, generation type `0x05`) parses correctly with both records' attributes populated.
|
||||
- [ ] A synthetic fixture exists for each Generation Type 0–7 (eight fixtures total or one fixture with eight records varying the field).
|
||||
- [ ] At least one synthetic fixture has an IO ID > 255 to verify 2-byte read.
|
||||
- [ ] `attributes['__generation_type']` is set on every Codec 16 position; absent on Codec 8 / 8E positions.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- The "mixed widths" trap is real. Pair-review (or have a second LLM agent review) the field-width table before declaring done. Fixture tests catch this if they're built carefully.
|
||||
- Reserved value `3` for Generation Type: spec says reserved. Decision: log a `debug` if observed; do not reject. We do not police reserved values that don't break parsing.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,114 @@
|
||||
# Task 1.8 — Redis Streams publisher & main wiring
|
||||
|
||||
**Phase:** 1 — Inbound telemetry
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 1.2, 1.3, 1.4, 1.5, 1.6, 1.7
|
||||
**Wiki refs:** `docs/wiki/entities/redis-streams.md`, `docs/wiki/concepts/position-record.md`
|
||||
|
||||
## Goal
|
||||
|
||||
Implement the real `publishPosition` that writes `Position` records to a Redis Stream, then wire the entire Phase 1 pipeline together in `src/main.ts`.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/core/publish.ts` (replacing the stub from task 1.2):
|
||||
- `createPublisher(redis: Redis, config: Config, logger: Logger, metrics: Metrics): Publisher` factory.
|
||||
- `Publisher.publish(p: Position): Promise<void>` that serializes and `XADD`s.
|
||||
- Internal serialization helper `serializePosition(p: Position): Record<string, string>` returning the field-value pairs Redis expects.
|
||||
- `src/main.ts` updated to:
|
||||
1. Load config (task 1.3).
|
||||
2. Build logger and metrics (tasks 1.3, 1.10).
|
||||
3. Connect to Redis with retry-on-startup logic.
|
||||
4. Build the publisher.
|
||||
5. Build the Teltonika adapter and register codec handlers.
|
||||
6. Start the TCP server.
|
||||
7. Start the metrics HTTP server (task 1.10).
|
||||
8. Install graceful shutdown (task 1.12 finalizes; stub here).
|
||||
|
||||
## Specification
|
||||
|
||||
### Stream record shape
|
||||
|
||||
`XADD telemetry:teltonika MAXLEN ~ <maxlen> * <fields>` where fields are flat key→string pairs (Redis Streams do not nest). Use a JSON-encoded `payload` field for simplicity:
|
||||
|
||||
```
|
||||
1) ts → ISO8601 string (timestamp from the Position)
|
||||
2) device_id → IMEI string
|
||||
3) codec → "8" | "8E" | "16" (the codec that produced this record — useful for downstream filtering)
|
||||
4) payload → JSON string of the full Position
|
||||
```
|
||||
|
||||
The duplicated `device_id` and `ts` at the top level let downstream tools filter without parsing the JSON; `payload` is the source of truth.
|
||||
|
||||
### JSON serialization
|
||||
|
||||
`Position.attributes` contains `number | bigint | Buffer`. JSON.stringify out of the box handles `number` but not `bigint` or `Buffer`. Implement a custom replacer:
|
||||
|
||||
```ts
|
||||
function replacer(_key: string, value: unknown): unknown {
|
||||
if (typeof value === 'bigint') return { __bigint: value.toString() };
|
||||
if (Buffer.isBuffer(value)) return { __buffer_b64: value.toString('base64') };
|
||||
if (value instanceof Date) return value.toISOString();
|
||||
return value;
|
||||
}
|
||||
```
|
||||
|
||||
The `__bigint` and `__buffer_b64` sentinels are decoded by the Processor (and any other consumer). Document this contract in the [[position-record]] page once landed.
|
||||
|
||||
### `XADD` options
|
||||
|
||||
- `MAXLEN ~ <REDIS_STREAM_MAXLEN>` — approximate trimming, much cheaper than exact.
|
||||
- `*` for auto-generated message ID.
|
||||
- Use a single connection (no pooling — `ioredis` multiplexes commands automatically).
|
||||
|
||||
### Backpressure / non-blocking property
|
||||
|
||||
The TCP handler is `await`-ing `ctx.publish(p)`. Two strategies:
|
||||
|
||||
**Option A: Direct `XADD` per record.** Simplest. Latency per publish is sub-millisecond on a healthy Redis. The risk: if Redis hangs, the TCP handler blocks → device sockets back up → Phase 1's "TCP handler never blocks" property is violated.
|
||||
|
||||
**Option B: Bounded in-memory queue + worker drain.** A `Promise`-based bounded queue (e.g. `p-queue` or hand-rolled). `publish()` resolves once the record is enqueued; a worker drains via `XADD`. If the queue is full, the worker has fallen behind catastrophically — at that point we have to choose: drop oldest, drop newest, or throw. Recommendation: drop newest with a structured error log + metric, because the device will retransmit (we won't ACK).
|
||||
|
||||
**Decision: Option B.** Specification:
|
||||
|
||||
- Queue capacity: 10,000 records (configurable via `PUBLISH_QUEUE_CAPACITY`).
|
||||
- On overflow: do **not** publish; throw a typed `PublishOverflowError`. The framing layer (task 1.4) catches this and skips the ACK so the device retransmits.
|
||||
- Worker concurrency: 1 (Redis is single-threaded per connection; concurrency just adds context-switch cost).
|
||||
- Metric: `teltonika_publish_queue_depth` gauge, `teltonika_publish_overflow_total` counter.
|
||||
|
||||
The worker uses `XADD` with a per-call timeout (e.g. 2s) and exits the process on prolonged Redis unavailability — graceful shutdown should restart the process via the orchestrator.
|
||||
|
||||
### `main.ts` skeleton
|
||||
|
||||
```ts
|
||||
async function main() {
|
||||
const config = loadConfig();
|
||||
const logger = createLogger(config);
|
||||
const metrics = createMetrics();
|
||||
const redis = await connectRedis(config, logger);
|
||||
const publisher = createPublisher(redis, config, logger, metrics);
|
||||
const adapter = createTeltonikaAdapter({ publisher, logger, metrics });
|
||||
const server = startServer(config.TELTONIKA_PORT, adapter, { publish: publisher.publish, logger, metrics });
|
||||
const metricsServer = startMetricsServer(config.METRICS_PORT, metrics);
|
||||
installGracefulShutdown({ server, metricsServer, redis, publisher, logger });
|
||||
logger.info({ port: config.TELTONIKA_PORT }, 'tcp-ingestion ready');
|
||||
}
|
||||
|
||||
main().catch((err) => { console.error(err); process.exit(1); });
|
||||
```
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] Integration test: spin up a Redis (testcontainers or `redis-mock`), publish a known `Position`, `XREAD` it back, parse the JSON, and assert it equals the input (with `bigint` and `Buffer` round-tripped through the sentinel encoding).
|
||||
- [ ] Overflow test: artificially block the worker, fill the queue, verify the next `publish()` rejects with `PublishOverflowError`, verify metrics increment.
|
||||
- [ ] Startup test: with a wrong `REDIS_URL`, the process logs a clear error and exits non-zero.
|
||||
- [ ] An end-to-end test: open a TCP client to the running server, send the canonical Codec 8 fixture, verify a Position lands on the Stream and the ACK comes back with `00 00 00 01`.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- `redis-mock` does not implement Streams. Use testcontainers + a real Redis for integration tests.
|
||||
- The bounded queue could cause backpressure concerns — discuss with the Processor team whether they prefer the device-retransmit path (overflow throw) or a soft-drop with logging. Defaulting to retransmit because it's the safer correctness choice.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,156 @@
|
||||
# Task 1.9 — Fixture suite & testing strategy
|
||||
|
||||
**Phase:** 1 — Inbound telemetry
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 1.1
|
||||
**Wiki refs:** `docs/wiki/sources/teltonika-ingestion-architecture.md` § 5.6, `docs/wiki/sources/teltonika-data-sending-protocols.md`
|
||||
|
||||
## Goal
|
||||
|
||||
Establish the fixture-based testing infrastructure and seed it with the canonical hex captures from the Teltonika documentation. **This is the only place where the parser's correctness is actually verified.** Bugs in binary protocol parsers are silent; tests are the defense.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `test/fixtures/teltonika/codec8/`, `test/fixtures/teltonika/codec8e/`, `test/fixtures/teltonika/codec16/` populated with at least:
|
||||
- 3 captures from the canonical Teltonika doc (one per codec, with full parsed expectations).
|
||||
- 1 synthetic edge case per codec (empty IO bag, max-size IO values, multi-record).
|
||||
- Each fixture is a pair: `<name>.hex` (raw frame, hex-encoded with whitespace stripped) and `<name>.expected.json` (the expected `Position[]` after parsing).
|
||||
- `test/fixtures/_loader.ts` — helpers:
|
||||
- `loadFixture(path): { hex: Buffer; expected: Position[] }`
|
||||
- `compareToExpected(actual: Position[], expected: Position[]): void` (deep-equals with `bigint`/`Buffer` aware comparator).
|
||||
- A vitest test pattern that automatically picks up every fixture pair in a directory and generates a test per pair. So adding a new fixture file = a new test, no boilerplate.
|
||||
- `test/fixtures/teltonika/README.md` documenting the format and how to add new captures.
|
||||
|
||||
## Specification
|
||||
|
||||
### Fixture format
|
||||
|
||||
`fixture-name.hex`:
|
||||
```
|
||||
000000000000003608010000016B40D8EA30
|
||||
01000000000000000000000000000000010
|
||||
5021503010101425E0F01F10000601A014E
|
||||
0000000000000000010000C7CF
|
||||
```
|
||||
|
||||
Whitespace and newlines are ignored. Implementer strips `[^0-9a-fA-F]` and parses with `Buffer.from(hex, 'hex')`.
|
||||
|
||||
`fixture-name.expected.json`:
|
||||
```json
|
||||
{
|
||||
"positions": [
|
||||
{
|
||||
"device_id": "FIXTURE",
|
||||
"timestamp": "2019-06-10T10:04:46.000Z",
|
||||
"latitude": 0,
|
||||
"longitude": 0,
|
||||
"altitude": 0,
|
||||
"angle": 0,
|
||||
"speed": 0,
|
||||
"satellites": 0,
|
||||
"priority": 1,
|
||||
"attributes": {
|
||||
"21": 3,
|
||||
"1": 1,
|
||||
"66": 24079,
|
||||
"241": 24602,
|
||||
"78": "__bigint:0",
|
||||
"__event": 1
|
||||
}
|
||||
}
|
||||
],
|
||||
"ack_record_count": 1
|
||||
}
|
||||
```
|
||||
|
||||
The `__bigint:` and `__buffer_b64:` prefixes are how the JSON file represents the special types. The loader decodes them into real `bigint` / `Buffer` instances before comparison.
|
||||
|
||||
`device_id` in fixtures is a placeholder (`"FIXTURE"`) because the captures don't include the IMEI — the codec parsers receive the IMEI from the framing layer's session context, not from the body itself.
|
||||
|
||||
### Bootstrap fixtures (must be present at end of this task)
|
||||
|
||||
From the canonical Teltonika doc (`docs/raw/Teltonika Data Sending Protocols - Teltonika Telematics Wiki.md`):
|
||||
|
||||
#### Codec 8
|
||||
- `01-single-record-all-widths.hex`: 1st example — one record with N1=2, N2=1, N4=1, N8=1.
|
||||
- `02-single-record-reduced.hex`: 2nd example — one record with N1=2, N2=1, N4=0, N8=0.
|
||||
- `03-two-records.hex`: 3rd example — two records with minimal IO.
|
||||
|
||||
#### Codec 8 Extended
|
||||
- `01-canonical.hex`: doc example — one record, N1=1, N2=1, N4=1, N8=2, NX=0.
|
||||
|
||||
#### Codec 16
|
||||
- `01-canonical.hex`: doc example — two records with Generation Type `0x05`.
|
||||
|
||||
### Synthetic fixtures (must be present)
|
||||
|
||||
#### Codec 8
|
||||
- `04-empty-io-bag.hex`: one record, N=0 (no IO elements). Smallest valid record.
|
||||
- `05-multi-record-large.hex`: 10 records to exercise the loop and N1==N2 invariant.
|
||||
|
||||
#### Codec 8 Extended
|
||||
- `02-nx-mixed.hex`: one record with NX=3, lengths 1, 8, 64.
|
||||
- `03-nx-zero-length.hex`: one record with one NX entry of length 0.
|
||||
- `04-nx-large-length.hex`: one record with one NX entry of length 300+ (verifies 16-bit length read).
|
||||
|
||||
#### Codec 16
|
||||
- `02-each-generation-type.hex`: 8 records, one per Generation Type 0–7.
|
||||
- `03-large-io-id.hex`: one record with an IO ID > 255 (e.g. `0x0400`).
|
||||
|
||||
### Test runner pattern
|
||||
|
||||
In `test/codec8.test.ts`, `codec8e.test.ts`, `codec16.test.ts`:
|
||||
|
||||
```ts
|
||||
import { describe, it, expect } from 'vitest';
|
||||
import { loadFixturesFromDir } from './fixtures/_loader';
|
||||
import { codec8Handler } from '../src/adapters/teltonika/codec/data/codec8';
|
||||
|
||||
describe('Codec 8 parser', () => {
|
||||
for (const fixture of loadFixturesFromDir('test/fixtures/teltonika/codec8')) {
|
||||
it(`parses ${fixture.name}`, async () => {
|
||||
const positions: Position[] = [];
|
||||
const ctx = makeTestCtx(positions);
|
||||
const result = await codec8Handler.handle(fixture.body, ctx);
|
||||
expect(positions).toEqual(fixture.expected.positions);
|
||||
expect(result.recordCount).toBe(fixture.expected.ack_record_count);
|
||||
});
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
This pattern means **adding a new fixture file = a new test, automatically.** No editing the test file.
|
||||
|
||||
### CRC tests
|
||||
|
||||
A separate `test/crc.test.ts` covers `crc16Ibm` against:
|
||||
- The canonical doc CRCs (each fixture's CRC computed over the body should match the trailing CRC bytes).
|
||||
- A few hand-computed reference values (from online CRC-16/IBM calculators, recorded in the test).
|
||||
- An empty buffer (`crc16Ibm(Buffer.alloc(0))` should return `0x0000`).
|
||||
|
||||
### Frame tests
|
||||
|
||||
`test/frame.test.ts`:
|
||||
- IMEI handshake happy path.
|
||||
- IMEI handshake malformed length.
|
||||
- Frame envelope: bytes split across multiple `data` events.
|
||||
- Frame envelope: CRC mismatch path returns the right outcome (no ACK, connection stays open).
|
||||
- Frame envelope: unknown codec ID drops the connection.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] All bootstrap and synthetic fixtures listed above are present.
|
||||
- [ ] `pnpm test` runs all fixture tests and they pass.
|
||||
- [ ] `pnpm test --coverage` reports ≥ 90% line coverage for `src/adapters/teltonika/codec/`.
|
||||
- [ ] Adding a new fixture pair to a codec's fixtures directory automatically produces a new test (verified manually by adding a temp fixture).
|
||||
- [ ] The fixture README documents the format clearly enough that a new contributor can add a capture without reading the test code.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- Where do real production captures come from? Until devices are streaming to a staging environment, we only have doc captures. Plan: record the first day of staging traffic into `tcpdump`-style captures, extract a few representative frames per device model, contribute them as fixtures with the model name in the filename. This step is a follow-up after staging deployment, not a Phase 1 blocker.
|
||||
- Hex format vs binary `.bin` files: hex is reviewable in PRs and documented in the doc. Stick with hex.
|
||||
- Confirming expected outputs: bootstrap fixtures' expected outputs come directly from the canonical doc's parsed tables. Synthetic fixture expectations are computed by hand and double-checked against the parser output once the parser is believed correct — circular if the parser is buggy. Mitigation: cross-check at least one synthetic fixture against an external Teltonika parser (e.g. the open-source [Traccar](https://github.com/traccar/traccar) project's Teltonika decoder) before declaring done.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,84 @@
|
||||
# Task 1.10 — Observability (Prometheus metrics)
|
||||
|
||||
**Phase:** 1 — Inbound telemetry
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 1.2, 1.3
|
||||
**Wiki refs:** `docs/wiki/sources/teltonika-ingestion-architecture.md` § 7. Observability, `docs/wiki/sources/gps-tracking-architecture.md` § 7.4
|
||||
|
||||
## Goal
|
||||
|
||||
Expose Prometheus metrics over an HTTP endpoint so the platform's observability stack can scrape them. Metrics drive alerting (consumer lag, unknown codecs, CRC failures) and capacity planning (connection counts, frame rates).
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/observability/metrics.ts`:
|
||||
- Exports `createMetrics(): Metrics` returning a typed wrapper around `prom-client` registries.
|
||||
- All metric definitions in one place, with explicit names/labels matching the wiki spec.
|
||||
- A `serializeMetrics(): Promise<string>` returning the standard Prom exposition format.
|
||||
- A `startMetricsServer(port, metrics): http.Server` that exposes `GET /metrics` and `GET /healthz`.
|
||||
- Wiring updates: every place that should emit a metric (handshake outcome, frame outcome, publish queue depth, etc.) calls into the `Metrics` object.
|
||||
|
||||
## Specification
|
||||
|
||||
### Metric inventory (Phase 1)
|
||||
|
||||
Per `docs/wiki/sources/teltonika-ingestion-architecture.md` § 7:
|
||||
|
||||
| Metric | Type | Labels | Description |
|
||||
|--------|------|--------|-------------|
|
||||
| `teltonika_connections_active` | gauge | — | Currently open device sessions. |
|
||||
| `teltonika_handshake_total` | counter | `result=accepted\|rejected\|malformed`, `known=known\|unknown` | IMEI handshake outcomes. The `known` label distinguishes IMEIs that the configured `DeviceAuthority` recognizes from those it does not. With the default `AllowAllAuthority`, `known` is always `known`. |
|
||||
| `teltonika_device_authority_failures_total` | counter | — | Times a `DeviceAuthority.check` call threw or timed out. Non-zero rate indicates the allow-list refresher (task 1.13) is unhealthy. |
|
||||
| `teltonika_frames_total` | counter | `codec=8\|8E\|16\|unknown`, `result=ok\|crc_fail\|truncated\|n_mismatch` | Frame-level outcomes. |
|
||||
| `teltonika_records_published_total` | counter | `codec` | AVL records emitted to Redis. |
|
||||
| `teltonika_parse_duration_seconds` | histogram | `codec` | Per-frame parse time. Buckets: `[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]` (seconds). |
|
||||
| `teltonika_unknown_codec_total` | counter | `codec_id` (string of the offending byte) | **Canary** for codec coverage drift. |
|
||||
|
||||
Phase 1 also adds publisher-related metrics from task 1.8:
|
||||
|
||||
| Metric | Type | Labels | Description |
|
||||
|--------|------|--------|-------------|
|
||||
| `teltonika_publish_queue_depth` | gauge | — | Current bounded-queue depth. |
|
||||
| `teltonika_publish_overflow_total` | counter | — | Records dropped because the queue was full. |
|
||||
| `teltonika_publish_duration_seconds` | histogram | — | XADD latency. |
|
||||
|
||||
Plus shell-level:
|
||||
|
||||
| Metric | Type | Labels | Description |
|
||||
|--------|------|--------|-------------|
|
||||
| `nodejs_*` | various | — | Default Node.js process metrics (`prom-client` provides a `collectDefaultMetrics()`). |
|
||||
|
||||
### Naming convention
|
||||
|
||||
- `teltonika_*` for adapter-specific metrics.
|
||||
- `nodejs_*` for runtime metrics (default).
|
||||
- No service prefix — Prometheus scrape config adds the `service` and `instance` labels externally.
|
||||
|
||||
### Health and readiness
|
||||
|
||||
- `GET /healthz`: returns `200 OK` if the process is alive. (Liveness probe.)
|
||||
- `GET /readyz`: returns `200 OK` if the Redis connection is healthy AND the TCP listener is bound. (Readiness probe.) Returns `503` otherwise.
|
||||
- Both endpoints return a tiny JSON body `{ "status": "ok" }` for diagnostic value.
|
||||
|
||||
### HTTP server
|
||||
|
||||
Use Node's `node:http` directly — no Express/Fastify dependency for two endpoints. Keep it minimal, ~30 lines.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] `curl http://localhost:9090/metrics` returns valid Prometheus exposition format with every metric in the inventory present (some at zero).
|
||||
- [ ] After processing the canonical Codec 8 fixture, `teltonika_records_published_total{codec="8"}` increments by 1 and `teltonika_frames_total{codec="8",result="ok"}` increments by 1.
|
||||
- [ ] Sending a packet with an unknown codec ID increments `teltonika_unknown_codec_total{codec_id="..."}`.
|
||||
- [ ] After a handshake from an IMEI the configured `DeviceAuthority` returns `'unknown'` for, `teltonika_handshake_total{result="accepted",known="unknown"}` increments by 1.
|
||||
- [ ] `GET /readyz` returns `503` while Redis is unreachable, then `200` once it reconnects.
|
||||
- [ ] Prom-client default metrics are exposed (Node version, GC, event loop lag).
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- Cardinality of `codec_id` label on `teltonika_unknown_codec_total`: bounded by 256 possible byte values. Acceptable.
|
||||
- Cardinality of `device_id` (IMEI) in metrics: **avoid**. Per-device metrics belong in logs/traces, not Prometheus, because the cardinality is unbounded. Phase 1 does not add per-IMEI labels anywhere. (This is a watch-out for future tasks.)
|
||||
- Histogram buckets for `teltonika_parse_duration_seconds`: tuned for sub-millisecond expected times. Adjust based on real production data after the first week.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,175 @@
|
||||
# Task 1.11 — Dockerfile & Gitea workflow
|
||||
|
||||
**Phase:** 1 — Inbound telemetry
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 1.8 (so the service actually does something), 1.10 (metrics endpoint for healthcheck)
|
||||
**Wiki refs:** `docs/wiki/sources/gps-tracking-architecture.md` § 7.3 Deployment topology
|
||||
|
||||
## Goal
|
||||
|
||||
Produce a multi-stage Docker image and a Gitea Actions workflow that builds and pushes the image to the project's Gitea Container Registry on every push to `main` and every tag.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `Dockerfile` — multi-stage build (deps → build → runtime).
|
||||
- `.dockerignore` — already created in task 1.1; verify it excludes `.planning/`, `test/`, `dist/` (rebuilt in image).
|
||||
- `.gitea/workflows/build.yml` — Gitea Actions workflow.
|
||||
- `compose.yaml` (alongside Dockerfile) — example local stack with Redis for `pnpm docker:dev`. Useful for local testing of the full pipeline.
|
||||
- Documentation updates in `README.md` covering: build, run locally, run via compose, CI behavior, image registry path.
|
||||
|
||||
## Specification
|
||||
|
||||
### Dockerfile
|
||||
|
||||
```dockerfile
|
||||
# syntax=docker/dockerfile:1.7
|
||||
|
||||
# ---- deps stage: install with cache-friendly pnpm fetch ----
|
||||
FROM node:22-alpine AS deps
|
||||
WORKDIR /app
|
||||
RUN corepack enable && corepack prepare pnpm@latest-9 --activate
|
||||
COPY package.json pnpm-lock.yaml ./
|
||||
RUN --mount=type=cache,id=pnpm-store,target=/root/.local/share/pnpm/store \
|
||||
pnpm fetch
|
||||
|
||||
# ---- build stage: compile TypeScript ----
|
||||
FROM deps AS build
|
||||
COPY . .
|
||||
RUN --mount=type=cache,id=pnpm-store,target=/root/.local/share/pnpm/store \
|
||||
pnpm install --frozen-lockfile --offline
|
||||
RUN pnpm build
|
||||
RUN pnpm prune --prod
|
||||
|
||||
# ---- runtime: slim, non-root ----
|
||||
FROM node:22-alpine AS runtime
|
||||
WORKDIR /app
|
||||
RUN addgroup -S app && adduser -S -G app app
|
||||
COPY --from=build --chown=app:app /app/node_modules ./node_modules
|
||||
COPY --from=build --chown=app:app /app/dist ./dist
|
||||
COPY --from=build --chown=app:app /app/package.json ./package.json
|
||||
USER app
|
||||
EXPOSE 5027 9090
|
||||
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
|
||||
CMD wget -qO- http://localhost:9090/readyz || exit 1
|
||||
CMD ["node", "dist/main.js"]
|
||||
```
|
||||
|
||||
Notes:
|
||||
- `node:22-alpine` is small (~100MB final image). If musl-related issues arise (rare with pure JS), fall back to `node:22-slim`.
|
||||
- BuildKit cache mounts (`--mount=type=cache`) speed up rebuilds significantly; the Gitea runner must support BuildKit (it does by default with modern docker).
|
||||
- `pnpm prune --prod` strips dev dependencies before the runtime copy.
|
||||
- Healthcheck hits `/readyz` so the container reports unhealthy if Redis is unreachable.
|
||||
|
||||
### Gitea workflow
|
||||
|
||||
`.gitea/workflows/build.yml`:
|
||||
|
||||
```yaml
|
||||
name: build
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: [main]
|
||||
tags: ['v*']
|
||||
pull_request:
|
||||
branches: [main]
|
||||
|
||||
jobs:
|
||||
test:
|
||||
runs-on: ubuntu-latest
|
||||
container: node:22-alpine
|
||||
services:
|
||||
redis:
|
||||
image: redis:7-alpine
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- run: corepack enable && corepack prepare pnpm@latest-9 --activate
|
||||
- run: pnpm install --frozen-lockfile
|
||||
- run: pnpm typecheck
|
||||
- run: pnpm lint
|
||||
- run: pnpm test --coverage
|
||||
env:
|
||||
REDIS_URL: redis://redis:6379
|
||||
|
||||
build-and-push:
|
||||
needs: test
|
||||
if: gitea.event_name == 'push'
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: docker/setup-buildx-action@v3
|
||||
- uses: docker/login-action@v3
|
||||
with:
|
||||
registry: git.dev.microservices.al
|
||||
username: ${{ gitea.actor }}
|
||||
password: ${{ secrets.GITEA_TOKEN }}
|
||||
- id: meta
|
||||
uses: docker/metadata-action@v5
|
||||
with:
|
||||
images: git.dev.microservices.al/trm/tcp-ingestion
|
||||
tags: |
|
||||
type=ref,event=branch
|
||||
type=sha,prefix=,format=short
|
||||
type=semver,pattern={{version}}
|
||||
type=semver,pattern={{major}}.{{minor}}
|
||||
type=raw,value=latest,enable={{is_default_branch}}
|
||||
- uses: docker/build-push-action@v5
|
||||
with:
|
||||
context: .
|
||||
push: true
|
||||
tags: ${{ steps.meta.outputs.tags }}
|
||||
labels: ${{ steps.meta.outputs.labels }}
|
||||
cache-from: type=registry,ref=git.dev.microservices.al/trm/tcp-ingestion:buildcache
|
||||
cache-to: type=registry,ref=git.dev.microservices.al/trm/tcp-ingestion:buildcache,mode=max
|
||||
```
|
||||
|
||||
Tags produced:
|
||||
- On push to `main`: `main`, `<short-sha>`, `latest`.
|
||||
- On tag `v1.2.3`: `v1.2.3`, `1.2.3`, `1.2`, `latest` (because tag push is on default branch context-dependent — verify with the Gitea Actions semantics in your runner version; adjust if necessary).
|
||||
- On PR: tests run, no push.
|
||||
|
||||
`GITEA_TOKEN` is provided by Gitea Actions automatically (similar to `GITHUB_TOKEN` in GitHub Actions). It must have package-write scope; configure once in repo settings if the default scope is read-only.
|
||||
|
||||
### compose.yaml (local dev)
|
||||
|
||||
```yaml
|
||||
services:
|
||||
redis:
|
||||
image: redis:7-alpine
|
||||
ports: ['6379:6379']
|
||||
ingestion:
|
||||
build: .
|
||||
depends_on: [redis]
|
||||
ports:
|
||||
- '5027:5027' # Teltonika TCP
|
||||
- '9090:9090' # metrics
|
||||
environment:
|
||||
NODE_ENV: production
|
||||
INSTANCE_ID: local-1
|
||||
REDIS_URL: redis://redis:6379
|
||||
LOG_LEVEL: debug
|
||||
restart: unless-stopped
|
||||
```
|
||||
|
||||
### Deployment
|
||||
|
||||
Out of scope for this task: how the image is consumed in production (compose pull + restart? K8s? Watchtower?). Recommend a follow-up task once Phase 1 is functional, since the deployment substrate may not be fully decided yet. For now, the image is built and published; humans pull and run it manually.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] `docker build .` succeeds locally and produces an image under 200MB.
|
||||
- [ ] `docker compose up` starts both Redis and the ingestion service; the service's `/healthz` and `/readyz` return 200.
|
||||
- [ ] On push to `main`, the Gitea workflow runs tests, builds the image, and publishes it to the registry. The image is visible in the Gitea Packages UI.
|
||||
- [ ] On a tag push, the image is also tagged with the version.
|
||||
- [ ] On a PR, only the test job runs (no push).
|
||||
- [ ] BuildKit cache reduces a rebuild-with-no-changes to under 30 seconds.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- The exact Gitea Actions feature parity with GitHub Actions varies by runner version. If `docker/metadata-action@v5` doesn't work as expected, fall back to a hand-rolled tag generator using `git rev-parse --short HEAD`.
|
||||
- `GITEA_TOKEN` permissions: confirm the default token can push to the registry. If not, switch to a dedicated `secrets.REGISTRY_TOKEN`.
|
||||
- Architecture: build only `linux/amd64` for now. Multi-arch (`linux/arm64`) is a follow-up if anyone needs it for Apple Silicon dev.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,150 @@
|
||||
# Task 1.12 — Production hardening
|
||||
|
||||
**Phase:** 1 — Inbound telemetry
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 1.8, 1.10, 1.11
|
||||
**Wiki refs:** `docs/wiki/concepts/failure-domains.md`
|
||||
|
||||
## Goal
|
||||
|
||||
Make the service safe for unattended production operation: graceful shutdown, robust error handling, structured logging discipline, sane defaults for resource limits, and operational documentation.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/core/lifecycle.ts` — `installGracefulShutdown({ ... })` that wires SIGTERM/SIGINT/SIGHUP to a coordinated shutdown.
|
||||
- `src/core/errors.ts` — typed error classes (`HandshakeError`, `FrameError`, `PublishOverflowError`, `RedisUnavailableError`).
|
||||
- Updates to `src/main.ts` to install error handlers and shutdown.
|
||||
- `OPERATIONS.md` (or section in `README.md`) covering: env var reference, signals, log fields, metric meanings, common alert rules, troubleshooting.
|
||||
- (Optional) `docs/runbook.md` for on-call: "what to do when X alert fires."
|
||||
|
||||
## Specification
|
||||
|
||||
### Graceful shutdown
|
||||
|
||||
On SIGTERM (deployment rolling update) or SIGINT (Ctrl-C):
|
||||
|
||||
1. **Stop accepting new connections.** `server.close()` — existing sockets continue.
|
||||
2. **Drain the publish queue.** Stop accepting new `publish()` calls; wait for the worker to flush queued records to Redis (with a timeout, e.g. 10s).
|
||||
3. **Send a final goodbye on each open socket.** Optional: just let TCP FIN naturally; devices will reconnect to a new instance.
|
||||
4. **Close Redis connection.**
|
||||
5. **Exit cleanly with code 0.**
|
||||
|
||||
If shutdown takes longer than `SHUTDOWN_TIMEOUT_MS` (default 30s), log and exit with code 1 — the orchestrator will SIGKILL anyway, but exiting deliberately gives a cleaner signal.
|
||||
|
||||
```ts
|
||||
export function installGracefulShutdown(handles: ShutdownHandles) {
|
||||
let shuttingDown = false;
|
||||
const shutdown = async (signal: string) => {
|
||||
if (shuttingDown) return;
|
||||
shuttingDown = true;
|
||||
handles.logger.info({ signal }, 'shutdown: starting');
|
||||
const deadline = setTimeout(() => {
|
||||
handles.logger.error({}, 'shutdown: timed out, forcing exit');
|
||||
process.exit(1);
|
||||
}, handles.timeoutMs ?? 30_000);
|
||||
try {
|
||||
await new Promise<void>((res) => handles.server.close(() => res()));
|
||||
await handles.publisher.drain(10_000);
|
||||
await handles.redis.quit();
|
||||
handles.metricsServer.close();
|
||||
clearTimeout(deadline);
|
||||
handles.logger.info({}, 'shutdown: clean exit');
|
||||
process.exit(0);
|
||||
} catch (err) {
|
||||
handles.logger.error({ err }, 'shutdown: error during drain');
|
||||
clearTimeout(deadline);
|
||||
process.exit(1);
|
||||
}
|
||||
};
|
||||
process.on('SIGTERM', () => shutdown('SIGTERM'));
|
||||
process.on('SIGINT', () => shutdown('SIGINT'));
|
||||
}
|
||||
```
|
||||
|
||||
### Unhandled promise / uncaught exception
|
||||
|
||||
```ts
|
||||
process.on('unhandledRejection', (reason) => {
|
||||
logger.fatal({ reason }, 'unhandledRejection');
|
||||
process.exit(1);
|
||||
});
|
||||
process.on('uncaughtException', (err) => {
|
||||
logger.fatal({ err }, 'uncaughtException');
|
||||
process.exit(1);
|
||||
});
|
||||
```
|
||||
|
||||
Crashing the process on either is the right move — the orchestrator restarts, devices reconnect, no harm done. The wrong move is to log and continue; that hides real bugs.
|
||||
|
||||
ESLint's `no-floating-promises` (added in task 1.1) is the first line of defense; these handlers are the safety net.
|
||||
|
||||
### Per-socket error handling
|
||||
|
||||
In the session loop:
|
||||
|
||||
- Errors from `BufferedReader` / `frame.ts` / codec parsers: log at `warn` with `imei`, drop the socket.
|
||||
- Errors from `ctx.publish` (specifically `PublishOverflowError`): skip the ACK, continue reading. Device retransmits.
|
||||
- Errors from `ctx.publish` (other, unexpected): log at `error`, drop the socket. Open question: should we crash the process? Recommendation: drop the socket only; let the publisher's own logic decide whether the underlying issue (e.g. Redis hang) warrants process exit.
|
||||
|
||||
### Resource limits
|
||||
|
||||
- **Max concurrent connections per instance:** soft cap via gauge alert (`teltonika_connections_active > 5000`). No hard cap in code — let the OS-level fd limit be the real ceiling.
|
||||
- **Per-connection memory:** the `BufferedReader` buffer is bounded by `MAX_AVL_PACKET_SIZE` (~1.3KB) per session. With 5,000 connections, ~6.5MB of buffer state — fine.
|
||||
- **Node heap:** set via `NODE_OPTIONS=--max-old-space-size=512` in the Dockerfile or compose. 512MB is plenty for this workload.
|
||||
|
||||
### Logging discipline (audit pass)
|
||||
|
||||
Before declaring this task done, walk through every `logger.*` call site and confirm:
|
||||
|
||||
- `info`: lifecycle events (startup, shutdown, server bound).
|
||||
- `warn`: recoverable per-frame issues (CRC fail, malformed handshake), per-connection drops.
|
||||
- `error`: per-publish failures, unexpected per-session errors.
|
||||
- `fatal`: process-killing conditions (Redis unreachable for >X seconds, `unhandledRejection`).
|
||||
- `debug`: per-frame parse details, per-record publish details.
|
||||
- No `console.log` anywhere in production paths. If there are any, replace.
|
||||
|
||||
### OPERATIONS.md outline
|
||||
|
||||
```
|
||||
# tcp-ingestion — Operations
|
||||
|
||||
## Configuration
|
||||
[table of env vars from task 1.3]
|
||||
|
||||
## Signals
|
||||
| Signal | Effect |
|
||||
|--------|--------|
|
||||
| SIGTERM | Graceful shutdown (drain publish queue, close connections, exit 0) |
|
||||
| SIGINT | Same as SIGTERM |
|
||||
|
||||
## Metrics
|
||||
[table of metrics from task 1.10]
|
||||
|
||||
## Alerts (recommended)
|
||||
- `teltonika_unknown_codec_total > 0` for 5 min: investigate codec coverage drift.
|
||||
- `teltonika_publish_overflow_total > 0` for 1 min: Redis or downstream backed up.
|
||||
- `rate(teltonika_frames_total{result="crc_fail"}[5m]) / rate(teltonika_frames_total[5m]) > 0.01`: high CRC error rate, suspect device firmware or line quality.
|
||||
- `teltonika_connections_active{instance=...} == 0` for 10 min while peer instances have traffic: instance is silently broken; investigate.
|
||||
|
||||
## Troubleshooting
|
||||
- "Devices not connecting" → check TCP_PORT firewall, /readyz response, Redis connectivity.
|
||||
- "Records not appearing in Redis" → check publish queue depth metric, then Redis connectivity.
|
||||
- "High CRC failures from one IMEI" → likely a firmware bug or bad cellular link; coordinate with device fleet ops.
|
||||
```
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] SIGTERM during steady-state traffic results in a clean exit with no data loss (verified by killing the process and confirming the publish queue drained, no `PublishOverflowError` in the last second of logs).
|
||||
- [ ] SIGTERM under publish-queue-overflow conditions still exits within `SHUTDOWN_TIMEOUT_MS`.
|
||||
- [ ] An `unhandledRejection` (intentionally injected via test) logs at fatal and exits non-zero.
|
||||
- [ ] OPERATIONS.md is populated and accurate; an on-caller could read it cold and find the answer to "what does this metric mean."
|
||||
- [ ] All log calls audited; no `console.log` in production paths.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- The "drain publish queue with timeout" balance: too long blocks deployments; too short loses records on shutdown. Default 10s is a reasonable starting point; tune after real production data.
|
||||
- Crashing on `unhandledRejection` is opinionated. Some teams prefer to log and continue. We choose crash because the alternative hides bugs and we have a fast restart path. Document the choice.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,153 @@
|
||||
# Task 1.13 — Device authority (Redis allow-list refresher)
|
||||
|
||||
**Phase:** 1 — Inbound telemetry
|
||||
**Status:** ⬜ Not started (deferrable — can ship after the rest of Phase 1)
|
||||
**Depends on:** 1.4 (DeviceAuthority seam), 1.10 (metrics)
|
||||
**Wiki refs:** `docs/wiki/concepts/plane-separation.md`, `docs/wiki/entities/directus.md`, `docs/wiki/entities/redis-streams.md`
|
||||
|
||||
## Goal
|
||||
|
||||
Provide a real `DeviceAuthority` implementation that classifies an IMEI as `known` or `unknown` by consulting an allow-list **published from Directus into Redis** and cached in-memory in each Ingestion instance. This is the operational link between the business plane (where the source-of-truth `devices` collection lives) and the telemetry plane (where Ingestion makes its handshake decisions).
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Not a security boundary. Real device security is network-level + downstream filtering. This list is a **soft signal** for observability and (optionally) a hard reject under `STRICT_DEVICE_AUTH`.
|
||||
- Not a real-time check. The list is cached locally with periodic refresh; new device provisioning takes effect within the refresh interval.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/adapters/teltonika/redis-allow-list-authority.ts`:
|
||||
- `RedisAllowListAuthority` implementing `DeviceAuthority`.
|
||||
- In-memory `Set<string>` of allowed IMEIs.
|
||||
- Refresh worker that pulls from Redis on a configurable cadence.
|
||||
- `start()` runs an initial fetch synchronously (so the cache is warm before the TCP listener accepts) and then starts the periodic refresh.
|
||||
- `stop()` halts the refresh ticker.
|
||||
- `src/main.ts` updated:
|
||||
- Read `DEVICE_AUTHORITY_MODE` env var (`allow_all` | `redis_allow_list`, default `allow_all`).
|
||||
- Construct the appropriate authority and pass it into the adapter context.
|
||||
- Documentation in `OPERATIONS.md` (task 1.12) — section "Device authority" describing the env vars, refresh cadence, and Directus contract.
|
||||
|
||||
## Specification
|
||||
|
||||
### Redis contract
|
||||
|
||||
The Ingestion side reads from a single Redis key. Two viable shapes; pick one and stick with it.
|
||||
|
||||
**Option 1: Redis Set.** Simple, idiomatic for membership checks.
|
||||
|
||||
```
|
||||
SADD devices:allowed <imei1> <imei2> ...
|
||||
SMEMBERS devices:allowed # what the refresher reads
|
||||
SISMEMBER devices:allowed <imei> # what an on-demand check would do (we do not use this; we cache)
|
||||
```
|
||||
|
||||
**Option 2: Redis Hash with metadata per device.** Useful if downstream wants more than membership (e.g. device model, firmware version, owner).
|
||||
|
||||
```
|
||||
HSET devices:allowed <imei> '{"model":"FMB920","fw":"03.27"}'
|
||||
HGETALL devices:allowed
|
||||
```
|
||||
|
||||
**Recommendation: Option 1 (Set).** Membership is the only signal Ingestion uses; metadata belongs in Directus where it's queryable. If a future task needs metadata in Ingestion, switch to Option 2.
|
||||
|
||||
### Directus → Redis sync (out of scope for this task)
|
||||
|
||||
This task implements the **Ingestion-side reader**. The Directus-side publisher is a separate piece of work in the Directus repo:
|
||||
|
||||
- A `devices` collection in Directus with at least `imei`, `active` fields.
|
||||
- A Directus Flow or hook that, on `items.create | items.update | items.delete` of `devices`, updates the Redis Set:
|
||||
- Active inserted/updated → `SADD devices:allowed <imei>`.
|
||||
- Deleted or `active=false` → `SREM devices:allowed <imei>`.
|
||||
- A periodic full-resync (e.g. nightly cron) that snapshots the collection into Redis to recover from any drift: `DEL devices:allowed && SADD devices:allowed <imei1> ... <imeiN>`.
|
||||
|
||||
Document this contract in the Ingestion repo's `OPERATIONS.md` so on-call understands the dependency, but the implementation lives in Directus.
|
||||
|
||||
### Refresh strategy
|
||||
|
||||
```ts
|
||||
class RedisAllowListAuthority implements DeviceAuthority {
|
||||
private cache = new Set<string>();
|
||||
private timer?: NodeJS.Timeout;
|
||||
|
||||
constructor(
|
||||
private redis: Redis,
|
||||
private key: string = 'devices:allowed',
|
||||
private intervalMs: number = 30_000,
|
||||
private logger: Logger,
|
||||
private metrics: Metrics,
|
||||
) {}
|
||||
|
||||
async start(): Promise<void> {
|
||||
await this.refresh(); // synchronous initial load before TCP listener is up
|
||||
this.timer = setInterval(() => {
|
||||
this.refresh().catch((err) => this.logger.warn({ err }, 'allow-list refresh failed'));
|
||||
}, this.intervalMs);
|
||||
}
|
||||
|
||||
stop(): void { if (this.timer) clearInterval(this.timer); }
|
||||
|
||||
async check(imei: string): Promise<'known' | 'unknown'> {
|
||||
return this.cache.has(imei) ? 'known' : 'unknown';
|
||||
}
|
||||
|
||||
private async refresh(): Promise<void> {
|
||||
const start = process.hrtime.bigint();
|
||||
const members = await this.redis.smembers(this.key);
|
||||
this.cache = new Set(members);
|
||||
const ms = Number(process.hrtime.bigint() - start) / 1e6;
|
||||
this.metrics.allowListRefresh.observe(ms / 1000);
|
||||
this.metrics.allowListSize.set(this.cache.size);
|
||||
this.logger.debug({ size: this.cache.size, took_ms: ms }, 'allow-list refreshed');
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Failure modes
|
||||
|
||||
- **Redis unavailable at startup.** `start()` throws → process exits non-zero → orchestrator restarts. Loud failure, easy to alert. Operators may opt to fall back to `allow_all` via env var change.
|
||||
- **Redis unavailable mid-flight.** `refresh` fails; the cache stays at last-known-good. `check` keeps working off the stale cache. Log warn; metric for refresh failures. Eventually the cache is "stale forever" if Redis never recovers — that's fine because telemetry is still flowing.
|
||||
- **Empty allow-list.** A bug or misconfiguration in Directus could publish an empty Set. The Ingestion side will then mark every device as `unknown`. With `STRICT_DEVICE_AUTH=false` (default), this is a visibility problem (alert-worthy) but not a service outage. With `STRICT_DEVICE_AUTH=true`, the entire fleet would be rejected — bad. Add a safety: refuse to apply a refresh result of size 0 unless `ALLOW_EMPTY_ALLOW_LIST=true` is set explicitly. Log error; keep the previous cache.
|
||||
|
||||
### Configuration
|
||||
|
||||
Add to the env schema (task 1.3):
|
||||
|
||||
```ts
|
||||
DEVICE_AUTHORITY_MODE: z.enum(['allow_all', 'redis_allow_list']).default('allow_all'),
|
||||
DEVICE_ALLOW_LIST_KEY: z.string().default('devices:allowed'),
|
||||
DEVICE_ALLOW_LIST_REFRESH_MS: z.coerce.number().int().min(1000).default(30_000),
|
||||
STRICT_DEVICE_AUTH: z.coerce.boolean().default(false),
|
||||
ALLOW_EMPTY_ALLOW_LIST: z.coerce.boolean().default(false),
|
||||
```
|
||||
|
||||
### Metrics
|
||||
|
||||
Add to task 1.10's inventory:
|
||||
|
||||
| Metric | Type | Labels | Description |
|
||||
|--------|------|--------|-------------|
|
||||
| `teltonika_allow_list_size` | gauge | — | Number of IMEIs in the local cache. Sudden drops are alert-worthy. |
|
||||
| `teltonika_allow_list_refresh_duration_seconds` | histogram | — | Time to refresh from Redis. |
|
||||
| `teltonika_allow_list_refresh_failures_total` | counter | `reason` | Refresh attempts that failed (network, empty-rejected, etc.). |
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] With `DEVICE_AUTHORITY_MODE=allow_all`, behavior is identical to Phase 1 default — every IMEI is `known`.
|
||||
- [ ] With `DEVICE_AUTHORITY_MODE=redis_allow_list` and a populated Redis Set, `check(imei)` returns `'known'` for members and `'unknown'` for non-members.
|
||||
- [ ] Initial load happens before the TCP listener accepts connections.
|
||||
- [ ] Refresh runs every `DEVICE_ALLOW_LIST_REFRESH_MS` and updates the cache.
|
||||
- [ ] Empty allow-list refresh is rejected (cache preserved) unless `ALLOW_EMPTY_ALLOW_LIST=true`; metric increments with `reason=empty_rejected`.
|
||||
- [ ] Mid-flight Redis outage does not crash the service; subsequent successful refresh restores the cache.
|
||||
- [ ] `teltonika_allow_list_size` and `teltonika_allow_list_refresh_duration_seconds` appear in `/metrics`.
|
||||
- [ ] `STRICT_DEVICE_AUTH=true` combined with `redis_allow_list` causes `0x00` rejection of unknown IMEIs (verified by integration test).
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- **Provisioning lag.** A newly added device waits up to `DEVICE_ALLOW_LIST_REFRESH_MS` before being recognized. Default 30s is fine for most ops; tune down to 5s if the team has a workflow where they provision and immediately expect the device to be `known`.
|
||||
- **Cache size.** A Set of 100k IMEIs is ~6MB in memory — fine. At 1M+ devices, consider a Bloom filter + Redis fallback for misses, or split into shards. Not a near-term concern.
|
||||
- **Drift between Directus and Redis.** Hooks-based sync can miss updates if Directus has an issue mid-write. The nightly full-resync cron mitigates. Discussed in the Directus-side task (out of repo scope here).
|
||||
- **Should `STRICT_DEVICE_AUTH` be observable?** Yes — log at info on startup which mode the authority is in, so operators can verify config without reading env vars.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,102 @@
|
||||
# Phase 1 — Inbound telemetry
|
||||
|
||||
Implement a Node.js TCP server that ingests Teltonika telemetry over codecs 8, 8E, and 16; publishes normalized `Position` records to a Redis Stream; and ships with the operational baseline (Prometheus metrics, fixture-based tests, Dockerfile, Gitea CI/CD pipeline).
|
||||
|
||||
## Outcome statement
|
||||
|
||||
When Phase 1 is done:
|
||||
|
||||
- Devices in the deployed FMB/FMC/FMM/FMU fleet connect to a known TCP port, complete the IMEI handshake, and stream AVL frames.
|
||||
- Every well-formed AVL record produces exactly one `Position` JSON entry on the `telemetry:teltonika` Redis Stream, with all GPS fields and IO element bag intact.
|
||||
- CRC-mismatched frames are dropped (no ACK) so devices retransmit.
|
||||
- Unknown-codec frames cause the connection to close with a structured `WARN` log entry; a Prometheus counter increments.
|
||||
- **Device authority is observable but permissive by default** — every handshake is labeled `known` or `unknown` based on a configurable `DeviceAuthority`; the Phase 1 default `AllowAllAuthority` accepts everything, and an opt-in `RedisAllowListAuthority` (task 1.13) reads a Directus-published allow-list from Redis. Strict reject-on-unknown is gated behind a `STRICT_DEVICE_AUTH` flag.
|
||||
- The service builds reproducibly via a Gitea Actions workflow, publishing a Docker image to the project's Gitea Container Registry, tagged by branch + git SHA.
|
||||
- Tests cover every codec parser using hex captures sourced from the canonical Teltonika doc, with at least one synthetic edge-case fixture per codec.
|
||||
|
||||
## Sequencing
|
||||
|
||||
```
|
||||
1.1 Project scaffold
|
||||
├─→ 1.2 Core shell & framing types
|
||||
│ ├─→ 1.3 Configuration & logging
|
||||
│ ├─→ 1.4 Teltonika framing layer (incl. DeviceAuthority seam)
|
||||
│ │ ├─→ 1.5 Codec 8 parser
|
||||
│ │ ├─→ 1.6 Codec 8 Extended parser
|
||||
│ │ └─→ 1.7 Codec 16 parser
|
||||
│ └─→ 1.8 Redis publisher & main wiring
|
||||
│ └─→ 1.10 Observability
|
||||
│ ├─→ 1.11 Dockerfile & CI
|
||||
│ │ └─→ 1.12 Production hardening
|
||||
│ └─→ 1.13 Device authority (opt-in, deferrable)
|
||||
└─→ 1.9 Fixture suite (cross-cutting; established alongside 1.5)
|
||||
```
|
||||
|
||||
Tasks 1.5, 1.6, 1.7 can be done in parallel after 1.4 lands. Task 1.9 (fixture infrastructure) should land *with or before* 1.5 — it's the framework the codec tasks add to. Task 1.13 is the only Phase 1 task that can ship *after* the rest of Phase 1 is in production — `AllowAllAuthority` is functional from day one; the Redis allow-list lights up once the Directus-side publisher exists.
|
||||
|
||||
## Files modified
|
||||
|
||||
Phase 1 produces this layout in `tcp-ingestion/`:
|
||||
|
||||
```
|
||||
tcp-ingestion/
|
||||
├── .gitea/workflows/build.yml
|
||||
├── src/
|
||||
│ ├── core/
|
||||
│ │ ├── types.ts
|
||||
│ │ ├── publish.ts
|
||||
│ │ ├── registry.ts
|
||||
│ │ ├── session.ts
|
||||
│ │ └── server.ts
|
||||
│ ├── adapters/
|
||||
│ │ └── teltonika/
|
||||
│ │ ├── index.ts
|
||||
│ │ ├── handshake.ts
|
||||
│ │ ├── frame.ts
|
||||
│ │ ├── crc.ts
|
||||
│ │ ├── device-authority.ts (interface + AllowAllAuthority)
|
||||
│ │ ├── redis-allow-list-authority.ts (task 1.13, opt-in)
|
||||
│ │ └── codec/
|
||||
│ │ ├── data/
|
||||
│ │ │ ├── codec8.ts
|
||||
│ │ │ ├── codec8e.ts
|
||||
│ │ │ └── codec16.ts
|
||||
│ │ └── command/ (empty in Phase 1)
|
||||
│ ├── config/load.ts
|
||||
│ ├── observability/
|
||||
│ │ ├── logger.ts
|
||||
│ │ └── metrics.ts
|
||||
│ └── main.ts
|
||||
├── test/
|
||||
│ ├── fixtures/teltonika/
|
||||
│ │ ├── codec8/
|
||||
│ │ ├── codec8e/
|
||||
│ │ └── codec16/
|
||||
│ ├── codec8.test.ts
|
||||
│ ├── codec8e.test.ts
|
||||
│ ├── codec16.test.ts
|
||||
│ ├── crc.test.ts
|
||||
│ └── frame.test.ts
|
||||
├── Dockerfile
|
||||
├── package.json
|
||||
├── pnpm-lock.yaml
|
||||
├── tsconfig.json
|
||||
├── .dockerignore
|
||||
├── .gitignore
|
||||
├── .prettierrc
|
||||
├── eslint.config.js
|
||||
└── README.md
|
||||
```
|
||||
|
||||
## Tech stack (decided)
|
||||
|
||||
- **Node.js 22 LTS**, ESM-only.
|
||||
- **TypeScript 5.x** with `strict: true`.
|
||||
- **pnpm** for dependency management (deterministic, fast, easy to add workspaces later if needed).
|
||||
- **vitest** for tests.
|
||||
- **pino** for structured logging.
|
||||
- **prom-client** for Prometheus metrics.
|
||||
- **ioredis** for Redis Streams.
|
||||
- **zod** for environment-variable validation.
|
||||
|
||||
If an implementer wants to deviate, they must update the relevant task file first.
|
||||
@@ -0,0 +1,118 @@
|
||||
# Task 2.1 — Connection registry & heartbeat
|
||||
|
||||
**Phase:** 2 — Outbound commands
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** Phase 1 complete
|
||||
**Wiki refs:** `docs/wiki/concepts/phase-2-commands.md` § 9.3
|
||||
|
||||
## Goal
|
||||
|
||||
Maintain a Redis-backed registry mapping device IMEI → Ingestion instance ID, so Directus can route outbound commands to the instance currently holding the device's TCP socket.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/core/connection-registry.ts`:
|
||||
- `ConnectionRegistry` class with methods `register(imei)`, `unregister(imei)`, `unregisterAll()`, `heartbeat()`.
|
||||
- Internal state: `Set<string>` of held IMEIs for graceful-shutdown bulk cleanup.
|
||||
- Hook into the Teltonika session lifecycle (in `src/adapters/teltonika/index.ts`):
|
||||
- After IMEI handshake succeeds: `registry.register(imei)`.
|
||||
- On socket close (any cause): `registry.unregister(imei)`.
|
||||
- Heartbeat ticker started in `src/main.ts`, runs every 30 seconds.
|
||||
- Graceful shutdown calls `registry.unregisterAll()` (task 1.12 hook updated to include this).
|
||||
|
||||
## Specification
|
||||
|
||||
### Redis layout
|
||||
|
||||
- **Hash** `connections:registry`: field = `imei`, value = `instance_id`. Single hash, all instances share it. Per-field TTL is not supported by Redis hashes — that's why the heartbeat key exists.
|
||||
- **Key** `instance:heartbeat:{instance_id}`: written every 30s with `EX 90`. Existence proves the instance is alive.
|
||||
|
||||
### Operations
|
||||
|
||||
```ts
|
||||
class ConnectionRegistry {
|
||||
private held = new Set<string>();
|
||||
constructor(private redis: Redis, private instanceId: string) {}
|
||||
|
||||
async register(imei: string): Promise<void> {
|
||||
await this.redis.hset('connections:registry', imei, this.instanceId);
|
||||
this.held.add(imei);
|
||||
}
|
||||
|
||||
async unregister(imei: string): Promise<void> {
|
||||
// Only delete if the entry still points at us.
|
||||
// (Race: a device reconnected to a different instance between
|
||||
// our session ending and this delete.)
|
||||
const current = await this.redis.hget('connections:registry', imei);
|
||||
if (current === this.instanceId) {
|
||||
await this.redis.hdel('connections:registry', imei);
|
||||
}
|
||||
this.held.delete(imei);
|
||||
}
|
||||
|
||||
async unregisterAll(): Promise<void> {
|
||||
if (this.held.size === 0) return;
|
||||
const pipeline = this.redis.pipeline();
|
||||
for (const imei of this.held) {
|
||||
pipeline.hdel('connections:registry', imei);
|
||||
}
|
||||
await pipeline.exec();
|
||||
this.held.clear();
|
||||
}
|
||||
|
||||
async heartbeat(): Promise<void> {
|
||||
await this.redis.set(
|
||||
`instance:heartbeat:${this.instanceId}`,
|
||||
Date.now().toString(),
|
||||
'EX',
|
||||
90,
|
||||
);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Heartbeat ticker
|
||||
|
||||
In `main.ts`:
|
||||
|
||||
```ts
|
||||
const heartbeatInterval = setInterval(() => {
|
||||
registry.heartbeat().catch((err) => logger.error({ err }, 'heartbeat failed'));
|
||||
}, 30_000);
|
||||
// ensure cleared on shutdown
|
||||
```
|
||||
|
||||
Run an initial `heartbeat()` immediately at startup so the instance is "alive" before the first 30s tick.
|
||||
|
||||
### Race conditions to handle
|
||||
|
||||
- **Same IMEI on two instances at once.** Possible when a device reconnects faster than we can detect close. The new instance's `register` overwrites the old's entry. The old instance's `unregister` checks `if (current === this.instanceId)` and skips the delete if not. Good.
|
||||
- **Heartbeat key expires while instance is alive.** Network glitch caused a write to fail. The janitor (task 2.2) will clear the registry entries; devices reconnect and the new entries get written. Acceptable — temporary loss of routability for affected devices, recoverable in seconds.
|
||||
- **Hash entry without heartbeat.** The instance died without graceful cleanup. Janitor handles this.
|
||||
|
||||
### Phase 1 impact
|
||||
|
||||
Phase 1 code in `src/adapters/teltonika/index.ts` needs three hook points:
|
||||
1. After successful handshake.
|
||||
2. On `socket.on('close')`.
|
||||
3. On graceful shutdown (already wired in task 1.12).
|
||||
|
||||
These are additive — no Phase 1 logic changes, only new calls to the registry.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] After a device handshake completes, `HGET connections:registry <imei>` returns the local instance ID.
|
||||
- [ ] After the socket closes, `HGET connections:registry <imei>` returns nil.
|
||||
- [ ] If two simulated instances "race" on the same IMEI, the registry ends up pointing at whichever instance most recently registered, and the loser's `unregister` does not delete the winner's entry.
|
||||
- [ ] Heartbeat key has `EX 90` and is refreshed every 30s.
|
||||
- [ ] On SIGTERM, all held IMEIs are unregistered before exit.
|
||||
- [ ] Registry operations are non-blocking on the TCP read path — register/unregister use `await` but inside session lifecycle callbacks, not the per-frame hot path.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- What if Redis is unavailable at registration time? Options: (A) fail the handshake, (B) accept the device but log + alert. **Recommendation: B.** Phase 1's "telemetry continues even if business plane is degraded" property must be preserved; commands routing is a Phase 2 nice-to-have. Track via `teltonika_registry_failures_total`.
|
||||
- Heartbeat write failures: log at warn, retry on next tick. Don't crash.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,83 @@
|
||||
# Task 2.2 — Registry janitor
|
||||
|
||||
**Phase:** 2 — Outbound commands
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 2.1
|
||||
**Wiki refs:** `docs/wiki/concepts/phase-2-commands.md` § 9.3
|
||||
|
||||
## Goal
|
||||
|
||||
Periodically clear stale entries from `connections:registry` whose owning instance has died (heartbeat expired) without graceful cleanup.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/core/janitor.ts` — `Janitor` class with a `run()` method that performs one cleanup pass.
|
||||
- A choice: run the janitor in-process (every Ingestion instance runs it, with leader election or with idempotent cleanup) or as a separate small process. **Recommendation: in-process, idempotent.** Simpler ops, no leader election; the cost is N instances each doing the work, but a registry pass is O(N_devices) and fast.
|
||||
- Wired into `src/main.ts` as a 60-second ticker.
|
||||
- Metric: `teltonika_registry_janitor_evicted_total{instance_id=...}` counter.
|
||||
|
||||
## Specification
|
||||
|
||||
### Algorithm (per pass)
|
||||
|
||||
```
|
||||
1. entries = HGETALL connections:registry
|
||||
2. unique_instance_ids = unique values from entries
|
||||
3. For each instance_id in unique_instance_ids:
|
||||
alive = EXISTS instance:heartbeat:{instance_id}
|
||||
If !alive:
|
||||
For each (imei, owner) in entries where owner == instance_id:
|
||||
HDEL connections:registry imei
|
||||
metrics.evicted.inc({ instance_id })
|
||||
```
|
||||
|
||||
Use `HSCAN` instead of `HGETALL` if the registry is large (>10k entries) to avoid blocking Redis. For Phase 2's expected scale, `HGETALL` is fine.
|
||||
|
||||
### Idempotence
|
||||
|
||||
Multiple instances running the janitor in parallel may both attempt to delete the same stale entry. `HDEL` is idempotent — the second call returns 0 and is harmless. Just ensure logging doesn't double-count: only log on actual deletes (HDEL > 0 result).
|
||||
|
||||
### Race with re-registration
|
||||
|
||||
Sequence to consider:
|
||||
1. Instance A dies; heartbeat expires.
|
||||
2. Janitor on Instance B starts a pass. Sees A's entries, A's heartbeat is gone.
|
||||
3. Device that was on A reconnects to Instance C.
|
||||
4. Instance C calls `HSET connections:registry <imei> C`.
|
||||
5. Janitor on B is mid-pass and calls `HDEL connections:registry <imei>`.
|
||||
|
||||
Result: device entry deleted moments after C registered it. Device routing is broken until the next reconnect or registration.
|
||||
|
||||
**Mitigation:** the janitor must check the entry value at delete time, not just at scan time:
|
||||
|
||||
```ts
|
||||
for (const imei of evictTargets) {
|
||||
// Re-read the value; only delete if still pointing at the dead instance.
|
||||
const current = await redis.hget('connections:registry', imei);
|
||||
if (current === deadInstanceId) {
|
||||
await redis.hdel('connections:registry', imei);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This is "check-and-delete" — not atomic but the window is small. For full atomicity, use a Lua script. **Recommendation: ship the non-atomic version first; upgrade to Lua if the race causes operational issues.**
|
||||
|
||||
### Pace
|
||||
|
||||
Run every 60 seconds (configurable via `JANITOR_INTERVAL_MS`). One pass costs at most one `HGETALL` + N `EXISTS` + (rare) M `HDEL`. Negligible Redis load.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] Killing an Ingestion instance without graceful shutdown: within ~2 minutes (heartbeat TTL of 90s + one janitor pass), all of that instance's registry entries are gone.
|
||||
- [ ] If the dying instance restarts and re-registers a device before the janitor evicts it, the new (live) entry is preserved (verified by the check-and-delete logic).
|
||||
- [ ] Two janitors running in parallel: total deletes are correct, no double-counting in metrics.
|
||||
- [ ] `teltonika_registry_janitor_evicted_total` increments by the right amount per pass.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- The check-and-delete race window: small but real. If operationally observed, upgrade to Lua. Document the trade-off in `OPERATIONS.md`.
|
||||
- Should the janitor be a separate process? Pros: cleaner separation; can be sized differently. Cons: another deployable, another monitoring target. **Defer to operational feedback.**
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,112 @@
|
||||
# Task 2.3 — Per-socket write queue & outstanding-command tracker
|
||||
|
||||
**Phase:** 2 — Outbound commands
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** Phase 1 complete (specifically the session loop in 1.4)
|
||||
**Wiki refs:** `docs/wiki/concepts/phase-2-commands.md` § 9.6, § 9.8
|
||||
|
||||
## Goal
|
||||
|
||||
Provide a per-socket serialization layer so:
|
||||
|
||||
1. Outbound command frames do not interleave with codec ACK writes (which would corrupt the byte stream).
|
||||
2. Only **one command is outstanding per socket at a time** (Teltonika's command codecs assume serial dispatch — there's no correlation ID in the protocol).
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/core/write-queue.ts`:
|
||||
- `SocketWriteQueue` class wrapping a `net.Socket` with an internal queue.
|
||||
- Methods: `writeAck(buf: Buffer): Promise<void>`, `writeCommand(buf: Buffer): Promise<void>`.
|
||||
- Per-socket state: `outstandingCommand: PendingCommand | null` with `commandId`, `timeout`, `resolve`, `reject` functions.
|
||||
- `awaitResponse(commandId, timeoutMs): Promise<Buffer>` — registers the in-flight command and waits for a response delivered via a separate `notifyResponse(buf)` method.
|
||||
- Update `src/adapters/teltonika/index.ts` session struct to hold a `SocketWriteQueue` per session.
|
||||
- Update Phase 1's framing layer (task 1.4 deliverable) to write ACKs through `queue.writeAck` instead of directly to the socket.
|
||||
|
||||
## Specification
|
||||
|
||||
### Why ACKs go through the queue too
|
||||
|
||||
Phase 1 wrote ACKs directly to the socket. Phase 2 must serialize ACKs with command writes, otherwise:
|
||||
|
||||
```
|
||||
Time T+0: codec parser writes ACK = [00 00 00 01]
|
||||
Time T+0: command consumer writes Codec 12 frame
|
||||
```
|
||||
|
||||
Without serialization, the bytes interleave at the socket level, producing garbage on the wire. The fix is mandatory — every socket write goes through one queue.
|
||||
|
||||
### Queue semantics
|
||||
|
||||
```ts
|
||||
class SocketWriteQueue {
|
||||
private chain: Promise<void> = Promise.resolve();
|
||||
private outstanding: PendingCommand | null = null;
|
||||
|
||||
constructor(private socket: net.Socket) {}
|
||||
|
||||
async writeAck(buf: Buffer): Promise<void> {
|
||||
this.chain = this.chain.then(() => this.writeRaw(buf));
|
||||
return this.chain;
|
||||
}
|
||||
|
||||
async writeCommand(commandId: string, buf: Buffer, timeoutMs = 30_000): Promise<Buffer> {
|
||||
if (this.outstanding) {
|
||||
// Wait for the previous command to resolve/reject before queueing this one.
|
||||
try { await this.outstanding.promise; } catch { /* prior command failed; we still proceed */ }
|
||||
}
|
||||
const pending: PendingCommand = makePending(commandId, timeoutMs);
|
||||
this.outstanding = pending;
|
||||
this.chain = this.chain.then(() => this.writeRaw(buf));
|
||||
await this.chain; // bytes are on the wire
|
||||
return pending.promise; // resolves when notifyResponse called or rejects on timeout
|
||||
}
|
||||
|
||||
notifyResponse(buf: Buffer): void {
|
||||
if (!this.outstanding) {
|
||||
// Unsolicited response. Log warn and ignore.
|
||||
return;
|
||||
}
|
||||
this.outstanding.resolveWith(buf);
|
||||
this.outstanding = null;
|
||||
}
|
||||
|
||||
private writeRaw(buf: Buffer): Promise<void> {
|
||||
return new Promise((resolve, reject) => {
|
||||
this.socket.write(buf, (err) => err ? reject(err) : resolve());
|
||||
});
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`PendingCommand` exposes a promise that resolves when `resolveWith` is called and rejects when its `setTimeout` fires.
|
||||
|
||||
### Backpressure on queued commands
|
||||
|
||||
A device with many queued commands could grow the queue unboundedly. Cap per-socket queue depth:
|
||||
|
||||
- Soft: log a warning at 5 queued commands.
|
||||
- Hard: reject `writeCommand` with `WriteQueueFullError` at 20 queued commands. The command consumer publishes a failure to `commands:responses`.
|
||||
|
||||
### Timeout default
|
||||
|
||||
30 seconds per command. Override via `commandTimeoutMs` in the `commands` row (Phase 2 design has `expires_at`; that's a clock-time deadline at the Directus level. The per-write timeout is the protocol-level "device didn't respond" deadline).
|
||||
|
||||
When the timeout fires, the queue resolves the outstanding promise with a rejection (`CommandTimeoutError`). The next queued command becomes the outstanding one and proceeds.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] Two concurrent calls to `writeAck(buf1)` and `writeCommand(id, buf2)` produce bytes on the wire in submission order, no interleaving (verified with a TCP-level recording test).
|
||||
- [ ] `writeCommand` blocks subsequent `writeCommand` calls until the first resolves or times out.
|
||||
- [ ] `notifyResponse` correctly resolves the outstanding command's promise.
|
||||
- [ ] Timeout firing rejects the outstanding promise; the next queued command starts.
|
||||
- [ ] Queue depth metric (`teltonika_command_queue_depth{imei=...}`) — wait, no: per-IMEI labels are forbidden by task 1.10's cardinality rule. Use `teltonika_command_queue_depth_total` (gauge sum across sockets) and log per-IMEI in warns.
|
||||
- [ ] On socket close, all pending command promises reject with `SocketClosedError`.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- The "outstanding command" model assumes the device responds to commands in order, which Teltonika's protocol does (one outstanding per socket). If we discover devices that don't, we'd need correlation IDs — but the protocol doesn't carry them, so the answer is "you can't" and we'd add a queue depth limit smaller than 1 (i.e. don't ever queue, fail fast).
|
||||
- ACK write order vs response delivery: when a device sends an AVL frame and we're mid-command, the AVL frame's ACK queues behind the command bytes. Worst case: device receives ACK for AVL frame slightly later. Acceptable.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,141 @@
|
||||
# Task 2.4 — Command consumer (Redis stream reader)
|
||||
|
||||
**Phase:** 2 — Outbound commands
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 2.1, 2.3
|
||||
**Wiki refs:** `docs/wiki/concepts/phase-2-commands.md` § 9.6
|
||||
|
||||
## Goal
|
||||
|
||||
Each Ingestion instance runs a worker that consumes commands from `commands:outbound:{instance_id}`, looks up the local socket for the target IMEI, and dispatches the command to the appropriate codec encoder + write queue.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/adapters/teltonika/command-consumer.ts`:
|
||||
- `CommandConsumer` class with `start()` and `stop()` methods.
|
||||
- Internal: a registry of `imei → SocketWriteQueue` for sessions held by this instance.
|
||||
- Methods exposed to the session lifecycle: `attach(imei, queue)`, `detach(imei)`.
|
||||
- Reads commands via `XREADGROUP commands:outbound:{instance_id} GROUP ingest {instance_id} COUNT 16 BLOCK 1000`.
|
||||
- Calls codec-specific encoder/handler based on the command's `codec` field.
|
||||
- On terminal outcome (delivered, responded, failed), publishes to `commands:responses`.
|
||||
- `src/adapters/teltonika/responses.ts`:
|
||||
- `publishResponse({ commandId, status, response?, failureReason? })` writes to `commands:responses` via `XADD`.
|
||||
|
||||
## Specification
|
||||
|
||||
### Stream consumption
|
||||
|
||||
```ts
|
||||
async start(): Promise<void> {
|
||||
// Ensure the consumer group exists. MKSTREAM creates the stream if absent.
|
||||
try {
|
||||
await this.redis.xgroup('CREATE', this.streamKey, 'ingest', '$', 'MKSTREAM');
|
||||
} catch (err: any) {
|
||||
if (!err.message?.includes('BUSYGROUP')) throw err;
|
||||
}
|
||||
|
||||
while (!this.stopping) {
|
||||
const messages = await this.redis.xreadgroup(
|
||||
'GROUP', 'ingest', this.instanceId,
|
||||
'COUNT', 16, 'BLOCK', 1000,
|
||||
'STREAMS', this.streamKey, '>',
|
||||
);
|
||||
if (!messages) continue;
|
||||
for (const [, entries] of messages) {
|
||||
for (const [id, fields] of entries) {
|
||||
await this.handleCommand(id, fieldsToObject(fields));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`fieldsToObject` converts Redis's flat `[key, value, key, value, ...]` array to a plain object.
|
||||
|
||||
### Command field shape
|
||||
|
||||
Per the Phase 2 design, Directus's Flow publishes:
|
||||
|
||||
```
|
||||
XADD commands:outbound:{instance_id} *
|
||||
command_id <uuid>
|
||||
target_imei <string>
|
||||
codec 12 | 14
|
||||
payload <ASCII command text>
|
||||
expires_at <unix-seconds>
|
||||
```
|
||||
|
||||
### Dispatch
|
||||
|
||||
```ts
|
||||
async handleCommand(streamId: string, cmd: CommandMessage): Promise<void> {
|
||||
const queue = this.sessions.get(cmd.target_imei);
|
||||
if (!queue) {
|
||||
await this.publishResponse({ commandId: cmd.command_id, status: 'failed', failureReason: 'socket_closed' });
|
||||
await this.redis.xack(this.streamKey, 'ingest', streamId);
|
||||
return;
|
||||
}
|
||||
if (Date.now() / 1000 > cmd.expires_at) {
|
||||
await this.publishResponse({ commandId: cmd.command_id, status: 'failed', failureReason: 'expired_before_delivery' });
|
||||
await this.redis.xack(this.streamKey, 'ingest', streamId);
|
||||
return;
|
||||
}
|
||||
try {
|
||||
const frame = encodeCommand(cmd.codec, cmd.command_id, cmd.payload);
|
||||
const responseBuf = await queue.writeCommand(cmd.command_id, frame, /* timeoutMs */ 30_000);
|
||||
const parsed = parseCommandResponse(cmd.codec, responseBuf);
|
||||
await this.publishResponse({
|
||||
commandId: cmd.command_id,
|
||||
status: parsed.kind === 'ack' ? 'responded' : 'failed',
|
||||
response: parsed.text,
|
||||
failureReason: parsed.kind === 'nack' ? 'imei_mismatch' : undefined,
|
||||
});
|
||||
} catch (err) {
|
||||
await this.publishResponse({ commandId: cmd.command_id, status: 'failed', failureReason: errToReason(err) });
|
||||
} finally {
|
||||
await this.redis.xack(this.streamKey, 'ingest', streamId);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`encodeCommand` and `parseCommandResponse` come from tasks 2.5 (Codec 12) and 2.6 (Codec 14).
|
||||
|
||||
### `commands:responses` shape
|
||||
|
||||
```
|
||||
XADD commands:responses *
|
||||
command_id <uuid>
|
||||
status delivered | responded | failed
|
||||
response <ASCII response text, optional>
|
||||
failure_reason socket_closed | expired_before_delivery | imei_mismatch | timeout | write_queue_full | ...
|
||||
responded_at <ms>
|
||||
```
|
||||
|
||||
### Lifecycle hooks
|
||||
|
||||
In the Teltonika session:
|
||||
|
||||
- After successful handshake: `commandConsumer.attach(imei, writeQueue)`.
|
||||
- On socket close: `commandConsumer.detach(imei)`.
|
||||
- The consumer must reject any in-flight command for a detached IMEI with `socket_closed`.
|
||||
|
||||
### Concurrency
|
||||
|
||||
The consumer reads up to 16 messages per `XREADGROUP` call. Process them sequentially per call (`for await`). Multiple commands targeting different IMEIs can complete in parallel naturally because each goes to a different `SocketWriteQueue`. Within a single IMEI, the queue serializes them.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] Publishing a command via `XADD commands:outbound:{instance_id}` causes the consumer to call `writeCommand` on the right session.
|
||||
- [ ] If the IMEI is not held by this instance, the consumer publishes `failed` with `socket_closed` to `commands:responses` and ACKs the stream entry.
|
||||
- [ ] If `expires_at` has passed, the consumer publishes `failed` with `expired_before_delivery` and ACKs.
|
||||
- [ ] On `stop()`, the consumer drains the in-flight message and exits the read loop cleanly.
|
||||
- [ ] `XACK` happens only after the response is published (or terminal failure recorded), so a crash mid-handler causes the command to be redelivered.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- Crash mid-handler: the command was sent on the wire but we crashed before `XACK`. After restart, the consumer will redeliver; the new instance won't have the device, so it publishes `socket_closed`. The result: command was delivered to the device but Directus thinks it failed. Operator re-issues. Acceptable v1; flagged in [[phase-2-commands]] as a sweeper concern. Idempotent device commands mitigate.
|
||||
- Duplicate delivery via `XPENDING`: not handling Pending Entries List explicitly in v1. If a consumer crashes, its claims time out and another consumer in the group can claim — but we're using `instance_id` as the consumer name, so cross-instance claiming would deliver commands to the wrong device. **Decision:** each instance is the only consumer in its own consumer group (group name = `ingest`, consumer name = `instance_id`, but stream is per-instance so no cross-claiming risk). Verify this matches the Directus-side publishing logic.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,117 @@
|
||||
# Task 2.5 — Codec 12 encoder + handler
|
||||
|
||||
**Phase:** 2 — Outbound commands
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 2.3, 2.4
|
||||
**Wiki refs:** `docs/wiki/sources/teltonika-data-sending-protocols.md` § Codec 12, `docs/wiki/concepts/phase-2-commands.md`
|
||||
|
||||
## Goal
|
||||
|
||||
Encode Codec 12 (`0x0C`) command frames for outbound delivery; parse Codec 12 response frames coming back from devices.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/adapters/teltonika/codec/command/codec12.ts`:
|
||||
- `encodeCodec12Command(payload: string): Buffer` produces the on-the-wire byte sequence.
|
||||
- `parseCodec12Response(buf: Buffer): { kind: 'ack' | 'unexpected'; text: string }` parses an inbound response frame.
|
||||
- A `codec12CommandHandler: CodecDataHandler` that the **inbound** framing layer (task 1.4) registers for codec ID `0x0C`. This handler does not produce `Position` records; it routes the response payload to the per-socket write queue's `notifyResponse`.
|
||||
- Test file `test/codec12.test.ts` with at least:
|
||||
- The two canonical doc examples (`getinfo` request + response, `getio` request + response).
|
||||
- One synthetic command with non-ASCII bytes in the payload to verify HEX encoding.
|
||||
|
||||
## Specification
|
||||
|
||||
### Frame structure (server → device)
|
||||
|
||||
```
|
||||
[Preamble 4B = 0x00000000]
|
||||
[DataSize 4B BE] ← from CodecID through CmdQty2 inclusive
|
||||
[CodecID 1B = 0x0C]
|
||||
[CmdQty1 1B = 0x01]
|
||||
[Type 1B = 0x05] ← 0x05 = command from server
|
||||
[CmdSize 4B BE] ← length of command payload bytes
|
||||
[Command X B] ← ASCII command, encoded as raw bytes (NOT hex-encoded)
|
||||
[CmdQty2 1B = 0x01]
|
||||
[CRC 4B BE] ← CRC-16/IBM, lower 2 bytes; computed over CodecID..CmdQty2
|
||||
```
|
||||
|
||||
Encoder pseudocode:
|
||||
|
||||
```ts
|
||||
export function encodeCodec12Command(payload: string): Buffer {
|
||||
const cmd = Buffer.from(payload, 'ascii');
|
||||
const cmdSize = cmd.length;
|
||||
const dataSize = 1 + 1 + 1 + 4 + cmdSize + 1; // CodecID + CmdQty1 + Type + CmdSize + Command + CmdQty2
|
||||
const out = Buffer.alloc(4 + 4 + dataSize + 4); // Preamble + DataSize + body + CRC
|
||||
let off = 0;
|
||||
out.writeUInt32BE(0, off); off += 4;
|
||||
out.writeUInt32BE(dataSize, off); off += 4;
|
||||
out.writeUInt8(0x0C, off); off += 1;
|
||||
out.writeUInt8(0x01, off); off += 1;
|
||||
out.writeUInt8(0x05, off); off += 1;
|
||||
out.writeUInt32BE(cmdSize, off); off += 4;
|
||||
cmd.copy(out, off); off += cmdSize;
|
||||
out.writeUInt8(0x01, off); off += 1;
|
||||
const body = out.subarray(8, 8 + dataSize); // CodecID through CmdQty2
|
||||
const crc = crc16Ibm(body);
|
||||
out.writeUInt32BE(crc, off);
|
||||
return out;
|
||||
}
|
||||
```
|
||||
|
||||
Verify against the canonical doc's `getinfo` example: input `getinfo` → output `000000000000000F0C010500000007676574696E666F0100004312`.
|
||||
|
||||
### Response structure (device → server)
|
||||
|
||||
Identical frame shape, but `Type = 0x06`:
|
||||
|
||||
```
|
||||
[Preamble 4B][DataSize 4B][CodecID 0x0C][RspQty1 1B][Type 0x06][RspSize 4B][Response X B][RspQty2 1B][CRC 4B]
|
||||
```
|
||||
|
||||
The response field is ASCII text, e.g. `INI:2019/7/22 7:22 RTC:...`.
|
||||
|
||||
Parser:
|
||||
|
||||
```ts
|
||||
export function parseCodec12Response(body: Buffer): { kind: 'ack'; text: string } | { kind: 'unexpected'; reason: string } {
|
||||
// body is post-framing-layer: starts at CodecID
|
||||
const codecId = body[0];
|
||||
if (codecId !== 0x0C) return { kind: 'unexpected', reason: `wrong codec ${codecId.toString(16)}` };
|
||||
const rspQty1 = body[1];
|
||||
const type = body[2];
|
||||
if (type !== 0x06) return { kind: 'unexpected', reason: `expected response type 0x06, got 0x${type.toString(16)}` };
|
||||
const rspSize = body.readUInt32BE(3);
|
||||
const text = body.subarray(7, 7 + rspSize).toString('ascii');
|
||||
// body[7 + rspSize] is RspQty2; CRC was already validated upstream.
|
||||
return { kind: 'ack', text };
|
||||
}
|
||||
```
|
||||
|
||||
### Routing inbound responses to the right command
|
||||
|
||||
The inbound framing layer (task 1.4) sees a frame with codec `0x0C` and dispatches to `codec12CommandHandler`. That handler retrieves the session's `SocketWriteQueue` (from the session context) and calls `queue.notifyResponse(rawBody)`. The write queue's `awaitResponse` promise resolves with the body; the command consumer (task 2.4) then calls `parseCodec12Response` to extract the text.
|
||||
|
||||
This is the seam where Phase 2 plugs into Phase 1's framing layer. Phase 1 already supports it because:
|
||||
|
||||
1. The codec dispatch is a registry — Phase 2 just registers a new handler.
|
||||
2. Phase 1's handler interface returns `{ recordCount: number }` for ACK count. For Codec 12, **the device does not expect a record-count ACK** — responses are inherently their own ACK. The handler returns `{ recordCount: 0 }` and the framing layer's ACK send path skips the write when `recordCount` is 0. **Update task 1.4 to honor this** if not already.
|
||||
|
||||
> **Open question:** is `recordCount: 0` the right signal to skip ACK? Or should the handler interface return `{ ack: Buffer | null }` instead? The latter is cleaner. **Recommendation:** add an explicit `ack` return slot to `CodecDataHandler` in this task and update the data codec handlers to return `{ ack: makeRecordCountAck(n) }`. Phase 2's command handlers return `{ ack: null }`.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] `encodeCodec12Command('getinfo')` produces the canonical doc bytes exactly (compare hex strings).
|
||||
- [ ] `parseCodec12Response` correctly decodes the doc's `getinfo` response into the `INI:2019/7/22...` ASCII string.
|
||||
- [ ] An end-to-end test: simulate a device that responds to a Codec 12 command, verify the round-trip command_id → encoded frame → device response → parsed text → published to `commands:responses`.
|
||||
- [ ] CRC of every encoded frame validates against `crc16Ibm`.
|
||||
- [ ] An incoming Codec 12 frame with `Type != 0x06` is logged at warn (unexpected protocol direction) and not surfaced to the command consumer.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- The interface change (returning `{ ack }` instead of `{ recordCount }`) is a Phase 1 retrofit. Cost: minor — three Phase 1 codec handlers update their return shape. Benefit: cleaner Phase 2 plug-in.
|
||||
- The `getinfo` canonical CRC in the doc is `0x00004312`. Verify the encoder matches before declaring done.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,118 @@
|
||||
# Task 2.6 — Codec 14 encoder + ACK/nACK handler
|
||||
|
||||
**Phase:** 2 — Outbound commands
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 2.5 (shares utility code), 2.3, 2.4
|
||||
**Wiki refs:** `docs/wiki/sources/teltonika-data-sending-protocols.md` § Codec 14, `docs/wiki/concepts/phase-2-commands.md`
|
||||
|
||||
## Goal
|
||||
|
||||
Encode Codec 14 (`0x0E`) command frames with embedded IMEI; parse responses with both ACK (`0x06`) and **nACK (`0x11`)** types.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/adapters/teltonika/codec/command/codec14.ts`:
|
||||
- `encodeCodec14Command(imei: string, payload: string): Buffer`.
|
||||
- `parseCodec14Response(buf: Buffer): { kind: 'ack'; imei: string; text: string } | { kind: 'nack'; imei: string } | { kind: 'unexpected'; reason: string }`.
|
||||
- `codec14CommandHandler: CodecDataHandler` registered for codec ID `0x0E`.
|
||||
- Test file `test/codec14.test.ts` covering: doc canonical example (`getver` round trip with both ACK and nACK responses).
|
||||
|
||||
## Specification
|
||||
|
||||
### Frame structure (server → device)
|
||||
|
||||
```
|
||||
[Preamble 4B]
|
||||
[DataSize 4B] ← from CodecID through CmdQty2
|
||||
[CodecID 1B = 0x0E]
|
||||
[CmdQty1 1B = 0x01]
|
||||
[Type 1B = 0x05]
|
||||
[CmdSize 4B] ← command bytes + 8 (IMEI size)
|
||||
[IMEI 8B HEX] ← e.g. IMEI 123456789123456 → 0x0123456789123456
|
||||
[Command X B] ← ASCII command bytes
|
||||
[CmdQty2 1B = 0x01]
|
||||
[CRC 4B]
|
||||
```
|
||||
|
||||
**IMEI encoding rule:** the device IMEI is encoded as 8 bytes in HEX. For a 15-digit IMEI like `352093081452251`, prepend a leading zero (`0352093081452251`) and parse as a 16-hex-char value → 8 bytes: `0x03 52 09 30 81 45 22 51`. **Not** ASCII like the handshake.
|
||||
|
||||
```ts
|
||||
function imeiToHex(imei: string): Buffer {
|
||||
// 15 digits → prepend "0" → 16 hex chars → 8 bytes
|
||||
const padded = imei.padStart(16, '0');
|
||||
if (!/^\d{16}$/.test(padded)) throw new Error(`bad IMEI: ${imei}`);
|
||||
return Buffer.from(padded, 'hex');
|
||||
}
|
||||
```
|
||||
|
||||
### Response structure (device → server)
|
||||
|
||||
Two cases:
|
||||
|
||||
**ACK** (`Type = 0x06`): IMEI matched; command executed.
|
||||
```
|
||||
[Preamble][DataSize][CodecID 0x0E][RspQty1][Type 0x06][RspSize][IMEI 8B][Response X B][RspQty2][CRC]
|
||||
```
|
||||
|
||||
**nACK** (`Type = 0x11`): IMEI did not match; command not executed.
|
||||
```
|
||||
[Preamble][DataSize][CodecID 0x0E][RspQty1][Type 0x11][RspSize=0x08][IMEI 8B][RspQty2][CRC]
|
||||
```
|
||||
|
||||
Note: nACK has `RspSize = 8` (the IMEI itself counts), no Response bytes.
|
||||
|
||||
### Parser
|
||||
|
||||
```ts
|
||||
export function parseCodec14Response(body: Buffer):
|
||||
| { kind: 'ack'; imei: string; text: string }
|
||||
| { kind: 'nack'; imei: string }
|
||||
| { kind: 'unexpected'; reason: string }
|
||||
{
|
||||
const codecId = body[0];
|
||||
if (codecId !== 0x0E) return { kind: 'unexpected', reason: `wrong codec 0x${codecId.toString(16)}` };
|
||||
const type = body[2];
|
||||
const rspSize = body.readUInt32BE(3);
|
||||
const imeiHex = body.subarray(7, 15).toString('hex');
|
||||
const imei = imeiHex.replace(/^0+/, ''); // strip leading zero used for padding
|
||||
if (type === 0x06) {
|
||||
const text = body.subarray(15, 15 + rspSize - 8).toString('ascii');
|
||||
return { kind: 'ack', imei, text };
|
||||
}
|
||||
if (type === 0x11) {
|
||||
return { kind: 'nack', imei };
|
||||
}
|
||||
return { kind: 'unexpected', reason: `unknown response type 0x${type.toString(16)}` };
|
||||
}
|
||||
```
|
||||
|
||||
### Mapping to `commands:responses`
|
||||
|
||||
The command consumer (task 2.4) handles all three outcomes:
|
||||
|
||||
- `ack` → `status = 'responded'`, `response = text`.
|
||||
- `nack` → `status = 'failed'`, `failure_reason = 'imei_mismatch'`. The command was *delivered* but rejected — important nuance for operator dashboards.
|
||||
- `unexpected` → `status = 'failed'`, `failure_reason = 'protocol_error'`.
|
||||
|
||||
### Firmware version requirement
|
||||
|
||||
Codec 14 requires FMB.Ver.03.25.04.Rev.00 or newer. Older firmware will not understand the codec ID and may close the connection. The Phase 2 design relies on Directus knowing which devices support which codecs (potentially a `firmware_version` column on a `devices` collection). The Ingestion service does not enforce this; it just sends what it's told.
|
||||
|
||||
> **Open question:** should we expose a metric `teltonika_codec14_unexpected_total` to detect cases where Codec 14 was sent but the device closed the connection (suggesting outdated firmware)? Probably yes; add to task 2.6 deliverables and the metrics inventory.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] `encodeCodec14Command('352093081452251', 'getver')` produces the canonical doc bytes exactly: `00000000000000160E01050000000E0352093081452251676574766572010000D2C1`.
|
||||
- [ ] `parseCodec14Response` correctly decodes the doc's ACK response (IMEI + version string).
|
||||
- [ ] `parseCodec14Response` correctly decodes the doc's nACK response (IMEI mismatch case).
|
||||
- [ ] An end-to-end test simulates a device that ACKs Codec 14 and a device that nACKs; verify both terminal statuses land in `commands:responses` correctly.
|
||||
- [ ] IMEI HEX encoding round-trips through `imeiToHex` and the response parser.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- nACK with `RspSize` not equal to 8 is malformed but we should fail safe (treat as `unexpected`) rather than read past buffer bounds.
|
||||
- Should the Ingestion service also log the IMEI from the nACK response (which is the *server's claim*) and compare to the *actual* IMEI of the connection (from handshake)? If they differ, something seriously wrong is happening. **Yes — log at error if they differ.** Add to acceptance criteria.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,65 @@
|
||||
# Phase 2 — Outbound commands
|
||||
|
||||
Add server-to-device command delivery using Teltonika codecs 12 (`0x0C`) and 14 (`0x0E`). Codec 13 is one-way device→server (not in scope for outbound); codec 15 is FMX6-only (out of scope entirely).
|
||||
|
||||
## Prerequisite
|
||||
|
||||
Phase 1 must be complete and stable in production. Phase 2 adds code *alongside* Phase 1, never in the inbound parsing path.
|
||||
|
||||
## Outcome statement
|
||||
|
||||
When Phase 2 is done:
|
||||
|
||||
- Each Ingestion instance maintains its IMEI→instance mapping in `connections:registry` (Redis hash) and a heartbeat key.
|
||||
- A Directus Flow on `commands` table inserts can publish a command to `commands:outbound:{instance_id}` after looking up the routing.
|
||||
- Each Ingestion instance runs a command consumer in parallel with the TCP listener; consumed commands are dispatched to the right per-socket write queue, encoded as Codec 12 or 14, and written to the device.
|
||||
- Device responses (Codec 12 Type `0x06` or Codec 14 Type `0x06`/`0x11`) are correlated to the in-flight command and published to `commands:responses` for Directus to update the row.
|
||||
- The TCP read path is never blocked by outbound work.
|
||||
- Phase 1 code is unchanged.
|
||||
|
||||
## Architectural anchors
|
||||
|
||||
`docs/wiki/concepts/phase-2-commands.md` is the design source of truth. Read it before starting any Phase 2 task.
|
||||
|
||||
Key invariants:
|
||||
|
||||
1. **Ingestion exposes no user-facing HTTP** — never. All command authorization happens in Directus.
|
||||
2. **Commands are data before transport.** Every command has a row in Directus's `commands` table before it ever reaches Redis.
|
||||
3. **One outstanding command per device socket.** Teltonika command codecs have no correlation ID; the protocol assumes serialization. Subsequent commands queue on the per-socket write queue.
|
||||
4. **Per-instance routing.** Only the Ingestion instance currently holding a device's socket can deliver commands to it. The connection registry exists so Directus knows which instance to publish to.
|
||||
|
||||
## Sequencing
|
||||
|
||||
```
|
||||
2.1 Connection registry & heartbeat ─┐
|
||||
2.2 Registry janitor ├─→ 2.4 Command consumer ─┐
|
||||
2.3 Per-socket write queue ──────────┘ ├─→ 2.5 Codec 12 handler
|
||||
└─→ 2.6 Codec 14 handler
|
||||
```
|
||||
|
||||
Tasks 2.1, 2.2, 2.3 can be done in parallel; they are independent infrastructure pieces. 2.5 and 2.6 can be parallelized once 2.4 lands.
|
||||
|
||||
## Files added
|
||||
|
||||
Phase 2 introduces these new files (no Phase 1 file is modified except `src/main.ts` to wire in the command consumer):
|
||||
|
||||
```
|
||||
src/
|
||||
├── adapters/teltonika/
|
||||
│ ├── codec/command/
|
||||
│ │ ├── codec12.ts ← NEW (encoder + response parser)
|
||||
│ │ └── codec14.ts ← NEW (encoder + ACK/nACK parser)
|
||||
│ └── command-consumer.ts ← NEW (stream reader, dispatch)
|
||||
├── core/
|
||||
│ ├── connection-registry.ts ← NEW
|
||||
│ ├── write-queue.ts ← NEW
|
||||
│ └── janitor.ts ← NEW (separate small process or in-process worker)
|
||||
└── main.ts ← updated to start consumer + registry
|
||||
```
|
||||
|
||||
`src/adapters/teltonika/codec/command/` already exists from Phase 1 (empty placeholder); Phase 2 fills it.
|
||||
|
||||
## Out of scope for this phase
|
||||
|
||||
- The Directus side of the system (`commands` table, Flows, sweeper) is owned by the Directus repo, not this one. Phase 2 in this repo only handles the Ingestion-side consumer and writer behavior.
|
||||
- The pending-command sweeper runs in Directus, not Ingestion. Ingestion publishes terminal status (`delivered`, `responded`, or `failed` reasons) and Directus updates the row.
|
||||
@@ -0,0 +1,54 @@
|
||||
# Phase 3 — Future / optional
|
||||
|
||||
Items that are designed-for but not committed. Each could become its own phase if the platform's needs grow that direction. None of these block Phase 1 or Phase 2 — they are clean additions thanks to the layout rules in `docs/wiki/concepts/protocol-adapter.md`.
|
||||
|
||||
## Candidates
|
||||
|
||||
### 3.A — Additional vendor adapters
|
||||
|
||||
**Trigger:** Customer onboards Queclink, Concox, or other GPS hardware.
|
||||
|
||||
Each new vendor is a new sibling under `src/adapters/`. The shell stays unchanged. Phase 1's [[protocol-adapter]] discipline is the load-bearing property here — `core/` does not import from any specific adapter, so a new one slots in without touching existing code.
|
||||
|
||||
Estimated scope per vendor: roughly Phase 1 tasks 1.4–1.9 again (framing + codec parsers + fixture suite + tests), but smaller because the shell already exists.
|
||||
|
||||
### 3.B — UDP transport for Teltonika
|
||||
|
||||
**Trigger:** Devices configured for UDP, or operational reasons to prefer UDP (lower keepalive cost on cellular).
|
||||
|
||||
Adds a parallel `udp-server.ts` in `src/core/` that handles the Teltonika UDP envelope (`Length 2B + Packet ID 2B + Not Usable Byte 1B + AVL Packet Header + AVL Data Array`). The codec parsers themselves are unchanged — UDP only changes the framing/transport. ACK format differs (5-byte UDP ack vs. 4-byte TCP record count).
|
||||
|
||||
Reference: `docs/wiki/concepts/avl-data-format.md` § "UDP envelope".
|
||||
|
||||
Risk: UDP loses TCP's natural per-session stickiness, so a future Phase 2 over UDP would need a different routing model than the connection registry.
|
||||
|
||||
### 3.C — Codec 15 (`0x0F`)
|
||||
|
||||
**Trigger:** FMX6 professional fleet onboards.
|
||||
|
||||
Codec 15 is one-way device→server with both timestamp and IMEI in every frame, used only on FMX6 devices in RS232 modes. Out of scope for the deployed FMB/FMC/FMM/FMU fleet.
|
||||
|
||||
Implementation is a new handler in `src/adapters/teltonika/codec/data/codec15.ts` plus a registration in `index.ts`. The handshake skips IMEI exchange because it's in every frame.
|
||||
|
||||
### 3.D — SMS protocols (Codec 4 + binary SMS)
|
||||
|
||||
**Trigger:** SMS fallback connectivity becomes a requirement (e.g. devices in cellular-fringe areas).
|
||||
|
||||
Codec 4 carries 24 GPS positions in one SMS (bit-field compressed); separate binary-SMS protocol carries an AVL data array + IMEI. Both require an SMS gateway integration outside this service.
|
||||
|
||||
This is large enough to warrant its own service rather than living in `tcp-ingestion/`. Open question for that point: should the SMS gateway publish to the same `telemetry:teltonika` Redis Stream, or a separate one? Probably the same — the [[position-record]] contract is transport-agnostic.
|
||||
|
||||
### 3.E — Multi-region / NATS or Kafka migration
|
||||
|
||||
**Trigger:** Multi-region deployment, or single-Redis throughput becomes a real bottleneck.
|
||||
|
||||
Replace `ioredis` with `@nats-io/nats` or `kafkajs`. The publish interface in `src/core/publish.ts` should be small enough that this is a swap, not a rewrite — design Phase 1's publisher with this in mind.
|
||||
|
||||
## How items move from here to a real phase
|
||||
|
||||
When one of these triggers fires:
|
||||
|
||||
1. Create a new `phase-N-<name>/` folder under `.planning/`.
|
||||
2. Promote the relevant section above into a full phase plan with the same task structure as Phase 1/2.
|
||||
3. Update the ROADMAP with the new phase row.
|
||||
4. Remove the section from this README (or mark it as promoted).
|
||||
Reference in New Issue
Block a user