Add Phase 1 and Phase 2 planning documents
ROADMAP plus granular task files per phase. Phase 1 (12 tasks + 1.13 device authority) covers Codec 8/8E/16 telemetry ingestion; Phase 2 (6 tasks) covers Codec 12/14 outbound commands; Phase 3 enumerates deferred items.
This commit is contained in:
@@ -0,0 +1,118 @@
|
||||
# Task 2.1 — Connection registry & heartbeat
|
||||
|
||||
**Phase:** 2 — Outbound commands
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** Phase 1 complete
|
||||
**Wiki refs:** `docs/wiki/concepts/phase-2-commands.md` § 9.3
|
||||
|
||||
## Goal
|
||||
|
||||
Maintain a Redis-backed registry mapping device IMEI → Ingestion instance ID, so Directus can route outbound commands to the instance currently holding the device's TCP socket.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/core/connection-registry.ts`:
|
||||
- `ConnectionRegistry` class with methods `register(imei)`, `unregister(imei)`, `unregisterAll()`, `heartbeat()`.
|
||||
- Internal state: `Set<string>` of held IMEIs for graceful-shutdown bulk cleanup.
|
||||
- Hook into the Teltonika session lifecycle (in `src/adapters/teltonika/index.ts`):
|
||||
- After IMEI handshake succeeds: `registry.register(imei)`.
|
||||
- On socket close (any cause): `registry.unregister(imei)`.
|
||||
- Heartbeat ticker started in `src/main.ts`, runs every 30 seconds.
|
||||
- Graceful shutdown calls `registry.unregisterAll()` (task 1.12 hook updated to include this).
|
||||
|
||||
## Specification
|
||||
|
||||
### Redis layout
|
||||
|
||||
- **Hash** `connections:registry`: field = `imei`, value = `instance_id`. Single hash, all instances share it. Per-field TTL is not supported by Redis hashes — that's why the heartbeat key exists.
|
||||
- **Key** `instance:heartbeat:{instance_id}`: written every 30s with `EX 90`. Existence proves the instance is alive.
|
||||
|
||||
### Operations
|
||||
|
||||
```ts
|
||||
class ConnectionRegistry {
|
||||
private held = new Set<string>();
|
||||
constructor(private redis: Redis, private instanceId: string) {}
|
||||
|
||||
async register(imei: string): Promise<void> {
|
||||
await this.redis.hset('connections:registry', imei, this.instanceId);
|
||||
this.held.add(imei);
|
||||
}
|
||||
|
||||
async unregister(imei: string): Promise<void> {
|
||||
// Only delete if the entry still points at us.
|
||||
// (Race: a device reconnected to a different instance between
|
||||
// our session ending and this delete.)
|
||||
const current = await this.redis.hget('connections:registry', imei);
|
||||
if (current === this.instanceId) {
|
||||
await this.redis.hdel('connections:registry', imei);
|
||||
}
|
||||
this.held.delete(imei);
|
||||
}
|
||||
|
||||
async unregisterAll(): Promise<void> {
|
||||
if (this.held.size === 0) return;
|
||||
const pipeline = this.redis.pipeline();
|
||||
for (const imei of this.held) {
|
||||
pipeline.hdel('connections:registry', imei);
|
||||
}
|
||||
await pipeline.exec();
|
||||
this.held.clear();
|
||||
}
|
||||
|
||||
async heartbeat(): Promise<void> {
|
||||
await this.redis.set(
|
||||
`instance:heartbeat:${this.instanceId}`,
|
||||
Date.now().toString(),
|
||||
'EX',
|
||||
90,
|
||||
);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Heartbeat ticker
|
||||
|
||||
In `main.ts`:
|
||||
|
||||
```ts
|
||||
const heartbeatInterval = setInterval(() => {
|
||||
registry.heartbeat().catch((err) => logger.error({ err }, 'heartbeat failed'));
|
||||
}, 30_000);
|
||||
// ensure cleared on shutdown
|
||||
```
|
||||
|
||||
Run an initial `heartbeat()` immediately at startup so the instance is "alive" before the first 30s tick.
|
||||
|
||||
### Race conditions to handle
|
||||
|
||||
- **Same IMEI on two instances at once.** Possible when a device reconnects faster than we can detect close. The new instance's `register` overwrites the old's entry. The old instance's `unregister` checks `if (current === this.instanceId)` and skips the delete if not. Good.
|
||||
- **Heartbeat key expires while instance is alive.** Network glitch caused a write to fail. The janitor (task 2.2) will clear the registry entries; devices reconnect and the new entries get written. Acceptable — temporary loss of routability for affected devices, recoverable in seconds.
|
||||
- **Hash entry without heartbeat.** The instance died without graceful cleanup. Janitor handles this.
|
||||
|
||||
### Phase 1 impact
|
||||
|
||||
Phase 1 code in `src/adapters/teltonika/index.ts` needs three hook points:
|
||||
1. After successful handshake.
|
||||
2. On `socket.on('close')`.
|
||||
3. On graceful shutdown (already wired in task 1.12).
|
||||
|
||||
These are additive — no Phase 1 logic changes, only new calls to the registry.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] After a device handshake completes, `HGET connections:registry <imei>` returns the local instance ID.
|
||||
- [ ] After the socket closes, `HGET connections:registry <imei>` returns nil.
|
||||
- [ ] If two simulated instances "race" on the same IMEI, the registry ends up pointing at whichever instance most recently registered, and the loser's `unregister` does not delete the winner's entry.
|
||||
- [ ] Heartbeat key has `EX 90` and is refreshed every 30s.
|
||||
- [ ] On SIGTERM, all held IMEIs are unregistered before exit.
|
||||
- [ ] Registry operations are non-blocking on the TCP read path — register/unregister use `await` but inside session lifecycle callbacks, not the per-frame hot path.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- What if Redis is unavailable at registration time? Options: (A) fail the handshake, (B) accept the device but log + alert. **Recommendation: B.** Phase 1's "telemetry continues even if business plane is degraded" property must be preserved; commands routing is a Phase 2 nice-to-have. Track via `teltonika_registry_failures_total`.
|
||||
- Heartbeat write failures: log at warn, retry on next tick. Don't crash.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,83 @@
|
||||
# Task 2.2 — Registry janitor
|
||||
|
||||
**Phase:** 2 — Outbound commands
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 2.1
|
||||
**Wiki refs:** `docs/wiki/concepts/phase-2-commands.md` § 9.3
|
||||
|
||||
## Goal
|
||||
|
||||
Periodically clear stale entries from `connections:registry` whose owning instance has died (heartbeat expired) without graceful cleanup.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/core/janitor.ts` — `Janitor` class with a `run()` method that performs one cleanup pass.
|
||||
- A choice: run the janitor in-process (every Ingestion instance runs it, with leader election or with idempotent cleanup) or as a separate small process. **Recommendation: in-process, idempotent.** Simpler ops, no leader election; the cost is N instances each doing the work, but a registry pass is O(N_devices) and fast.
|
||||
- Wired into `src/main.ts` as a 60-second ticker.
|
||||
- Metric: `teltonika_registry_janitor_evicted_total{instance_id=...}` counter.
|
||||
|
||||
## Specification
|
||||
|
||||
### Algorithm (per pass)
|
||||
|
||||
```
|
||||
1. entries = HGETALL connections:registry
|
||||
2. unique_instance_ids = unique values from entries
|
||||
3. For each instance_id in unique_instance_ids:
|
||||
alive = EXISTS instance:heartbeat:{instance_id}
|
||||
If !alive:
|
||||
For each (imei, owner) in entries where owner == instance_id:
|
||||
HDEL connections:registry imei
|
||||
metrics.evicted.inc({ instance_id })
|
||||
```
|
||||
|
||||
Use `HSCAN` instead of `HGETALL` if the registry is large (>10k entries) to avoid blocking Redis. For Phase 2's expected scale, `HGETALL` is fine.
|
||||
|
||||
### Idempotence
|
||||
|
||||
Multiple instances running the janitor in parallel may both attempt to delete the same stale entry. `HDEL` is idempotent — the second call returns 0 and is harmless. Just ensure logging doesn't double-count: only log on actual deletes (HDEL > 0 result).
|
||||
|
||||
### Race with re-registration
|
||||
|
||||
Sequence to consider:
|
||||
1. Instance A dies; heartbeat expires.
|
||||
2. Janitor on Instance B starts a pass. Sees A's entries, A's heartbeat is gone.
|
||||
3. Device that was on A reconnects to Instance C.
|
||||
4. Instance C calls `HSET connections:registry <imei> C`.
|
||||
5. Janitor on B is mid-pass and calls `HDEL connections:registry <imei>`.
|
||||
|
||||
Result: device entry deleted moments after C registered it. Device routing is broken until the next reconnect or registration.
|
||||
|
||||
**Mitigation:** the janitor must check the entry value at delete time, not just at scan time:
|
||||
|
||||
```ts
|
||||
for (const imei of evictTargets) {
|
||||
// Re-read the value; only delete if still pointing at the dead instance.
|
||||
const current = await redis.hget('connections:registry', imei);
|
||||
if (current === deadInstanceId) {
|
||||
await redis.hdel('connections:registry', imei);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This is "check-and-delete" — not atomic but the window is small. For full atomicity, use a Lua script. **Recommendation: ship the non-atomic version first; upgrade to Lua if the race causes operational issues.**
|
||||
|
||||
### Pace
|
||||
|
||||
Run every 60 seconds (configurable via `JANITOR_INTERVAL_MS`). One pass costs at most one `HGETALL` + N `EXISTS` + (rare) M `HDEL`. Negligible Redis load.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] Killing an Ingestion instance without graceful shutdown: within ~2 minutes (heartbeat TTL of 90s + one janitor pass), all of that instance's registry entries are gone.
|
||||
- [ ] If the dying instance restarts and re-registers a device before the janitor evicts it, the new (live) entry is preserved (verified by the check-and-delete logic).
|
||||
- [ ] Two janitors running in parallel: total deletes are correct, no double-counting in metrics.
|
||||
- [ ] `teltonika_registry_janitor_evicted_total` increments by the right amount per pass.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- The check-and-delete race window: small but real. If operationally observed, upgrade to Lua. Document the trade-off in `OPERATIONS.md`.
|
||||
- Should the janitor be a separate process? Pros: cleaner separation; can be sized differently. Cons: another deployable, another monitoring target. **Defer to operational feedback.**
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,112 @@
|
||||
# Task 2.3 — Per-socket write queue & outstanding-command tracker
|
||||
|
||||
**Phase:** 2 — Outbound commands
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** Phase 1 complete (specifically the session loop in 1.4)
|
||||
**Wiki refs:** `docs/wiki/concepts/phase-2-commands.md` § 9.6, § 9.8
|
||||
|
||||
## Goal
|
||||
|
||||
Provide a per-socket serialization layer so:
|
||||
|
||||
1. Outbound command frames do not interleave with codec ACK writes (which would corrupt the byte stream).
|
||||
2. Only **one command is outstanding per socket at a time** (Teltonika's command codecs assume serial dispatch — there's no correlation ID in the protocol).
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/core/write-queue.ts`:
|
||||
- `SocketWriteQueue` class wrapping a `net.Socket` with an internal queue.
|
||||
- Methods: `writeAck(buf: Buffer): Promise<void>`, `writeCommand(buf: Buffer): Promise<void>`.
|
||||
- Per-socket state: `outstandingCommand: PendingCommand | null` with `commandId`, `timeout`, `resolve`, `reject` functions.
|
||||
- `awaitResponse(commandId, timeoutMs): Promise<Buffer>` — registers the in-flight command and waits for a response delivered via a separate `notifyResponse(buf)` method.
|
||||
- Update `src/adapters/teltonika/index.ts` session struct to hold a `SocketWriteQueue` per session.
|
||||
- Update Phase 1's framing layer (task 1.4 deliverable) to write ACKs through `queue.writeAck` instead of directly to the socket.
|
||||
|
||||
## Specification
|
||||
|
||||
### Why ACKs go through the queue too
|
||||
|
||||
Phase 1 wrote ACKs directly to the socket. Phase 2 must serialize ACKs with command writes, otherwise:
|
||||
|
||||
```
|
||||
Time T+0: codec parser writes ACK = [00 00 00 01]
|
||||
Time T+0: command consumer writes Codec 12 frame
|
||||
```
|
||||
|
||||
Without serialization, the bytes interleave at the socket level, producing garbage on the wire. The fix is mandatory — every socket write goes through one queue.
|
||||
|
||||
### Queue semantics
|
||||
|
||||
```ts
|
||||
class SocketWriteQueue {
|
||||
private chain: Promise<void> = Promise.resolve();
|
||||
private outstanding: PendingCommand | null = null;
|
||||
|
||||
constructor(private socket: net.Socket) {}
|
||||
|
||||
async writeAck(buf: Buffer): Promise<void> {
|
||||
this.chain = this.chain.then(() => this.writeRaw(buf));
|
||||
return this.chain;
|
||||
}
|
||||
|
||||
async writeCommand(commandId: string, buf: Buffer, timeoutMs = 30_000): Promise<Buffer> {
|
||||
if (this.outstanding) {
|
||||
// Wait for the previous command to resolve/reject before queueing this one.
|
||||
try { await this.outstanding.promise; } catch { /* prior command failed; we still proceed */ }
|
||||
}
|
||||
const pending: PendingCommand = makePending(commandId, timeoutMs);
|
||||
this.outstanding = pending;
|
||||
this.chain = this.chain.then(() => this.writeRaw(buf));
|
||||
await this.chain; // bytes are on the wire
|
||||
return pending.promise; // resolves when notifyResponse called or rejects on timeout
|
||||
}
|
||||
|
||||
notifyResponse(buf: Buffer): void {
|
||||
if (!this.outstanding) {
|
||||
// Unsolicited response. Log warn and ignore.
|
||||
return;
|
||||
}
|
||||
this.outstanding.resolveWith(buf);
|
||||
this.outstanding = null;
|
||||
}
|
||||
|
||||
private writeRaw(buf: Buffer): Promise<void> {
|
||||
return new Promise((resolve, reject) => {
|
||||
this.socket.write(buf, (err) => err ? reject(err) : resolve());
|
||||
});
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`PendingCommand` exposes a promise that resolves when `resolveWith` is called and rejects when its `setTimeout` fires.
|
||||
|
||||
### Backpressure on queued commands
|
||||
|
||||
A device with many queued commands could grow the queue unboundedly. Cap per-socket queue depth:
|
||||
|
||||
- Soft: log a warning at 5 queued commands.
|
||||
- Hard: reject `writeCommand` with `WriteQueueFullError` at 20 queued commands. The command consumer publishes a failure to `commands:responses`.
|
||||
|
||||
### Timeout default
|
||||
|
||||
30 seconds per command. Override via `commandTimeoutMs` in the `commands` row (Phase 2 design has `expires_at`; that's a clock-time deadline at the Directus level. The per-write timeout is the protocol-level "device didn't respond" deadline).
|
||||
|
||||
When the timeout fires, the queue resolves the outstanding promise with a rejection (`CommandTimeoutError`). The next queued command becomes the outstanding one and proceeds.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] Two concurrent calls to `writeAck(buf1)` and `writeCommand(id, buf2)` produce bytes on the wire in submission order, no interleaving (verified with a TCP-level recording test).
|
||||
- [ ] `writeCommand` blocks subsequent `writeCommand` calls until the first resolves or times out.
|
||||
- [ ] `notifyResponse` correctly resolves the outstanding command's promise.
|
||||
- [ ] Timeout firing rejects the outstanding promise; the next queued command starts.
|
||||
- [ ] Queue depth metric (`teltonika_command_queue_depth{imei=...}`) — wait, no: per-IMEI labels are forbidden by task 1.10's cardinality rule. Use `teltonika_command_queue_depth_total` (gauge sum across sockets) and log per-IMEI in warns.
|
||||
- [ ] On socket close, all pending command promises reject with `SocketClosedError`.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- The "outstanding command" model assumes the device responds to commands in order, which Teltonika's protocol does (one outstanding per socket). If we discover devices that don't, we'd need correlation IDs — but the protocol doesn't carry them, so the answer is "you can't" and we'd add a queue depth limit smaller than 1 (i.e. don't ever queue, fail fast).
|
||||
- ACK write order vs response delivery: when a device sends an AVL frame and we're mid-command, the AVL frame's ACK queues behind the command bytes. Worst case: device receives ACK for AVL frame slightly later. Acceptable.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,141 @@
|
||||
# Task 2.4 — Command consumer (Redis stream reader)
|
||||
|
||||
**Phase:** 2 — Outbound commands
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 2.1, 2.3
|
||||
**Wiki refs:** `docs/wiki/concepts/phase-2-commands.md` § 9.6
|
||||
|
||||
## Goal
|
||||
|
||||
Each Ingestion instance runs a worker that consumes commands from `commands:outbound:{instance_id}`, looks up the local socket for the target IMEI, and dispatches the command to the appropriate codec encoder + write queue.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/adapters/teltonika/command-consumer.ts`:
|
||||
- `CommandConsumer` class with `start()` and `stop()` methods.
|
||||
- Internal: a registry of `imei → SocketWriteQueue` for sessions held by this instance.
|
||||
- Methods exposed to the session lifecycle: `attach(imei, queue)`, `detach(imei)`.
|
||||
- Reads commands via `XREADGROUP commands:outbound:{instance_id} GROUP ingest {instance_id} COUNT 16 BLOCK 1000`.
|
||||
- Calls codec-specific encoder/handler based on the command's `codec` field.
|
||||
- On terminal outcome (delivered, responded, failed), publishes to `commands:responses`.
|
||||
- `src/adapters/teltonika/responses.ts`:
|
||||
- `publishResponse({ commandId, status, response?, failureReason? })` writes to `commands:responses` via `XADD`.
|
||||
|
||||
## Specification
|
||||
|
||||
### Stream consumption
|
||||
|
||||
```ts
|
||||
async start(): Promise<void> {
|
||||
// Ensure the consumer group exists. MKSTREAM creates the stream if absent.
|
||||
try {
|
||||
await this.redis.xgroup('CREATE', this.streamKey, 'ingest', '$', 'MKSTREAM');
|
||||
} catch (err: any) {
|
||||
if (!err.message?.includes('BUSYGROUP')) throw err;
|
||||
}
|
||||
|
||||
while (!this.stopping) {
|
||||
const messages = await this.redis.xreadgroup(
|
||||
'GROUP', 'ingest', this.instanceId,
|
||||
'COUNT', 16, 'BLOCK', 1000,
|
||||
'STREAMS', this.streamKey, '>',
|
||||
);
|
||||
if (!messages) continue;
|
||||
for (const [, entries] of messages) {
|
||||
for (const [id, fields] of entries) {
|
||||
await this.handleCommand(id, fieldsToObject(fields));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`fieldsToObject` converts Redis's flat `[key, value, key, value, ...]` array to a plain object.
|
||||
|
||||
### Command field shape
|
||||
|
||||
Per the Phase 2 design, Directus's Flow publishes:
|
||||
|
||||
```
|
||||
XADD commands:outbound:{instance_id} *
|
||||
command_id <uuid>
|
||||
target_imei <string>
|
||||
codec 12 | 14
|
||||
payload <ASCII command text>
|
||||
expires_at <unix-seconds>
|
||||
```
|
||||
|
||||
### Dispatch
|
||||
|
||||
```ts
|
||||
async handleCommand(streamId: string, cmd: CommandMessage): Promise<void> {
|
||||
const queue = this.sessions.get(cmd.target_imei);
|
||||
if (!queue) {
|
||||
await this.publishResponse({ commandId: cmd.command_id, status: 'failed', failureReason: 'socket_closed' });
|
||||
await this.redis.xack(this.streamKey, 'ingest', streamId);
|
||||
return;
|
||||
}
|
||||
if (Date.now() / 1000 > cmd.expires_at) {
|
||||
await this.publishResponse({ commandId: cmd.command_id, status: 'failed', failureReason: 'expired_before_delivery' });
|
||||
await this.redis.xack(this.streamKey, 'ingest', streamId);
|
||||
return;
|
||||
}
|
||||
try {
|
||||
const frame = encodeCommand(cmd.codec, cmd.command_id, cmd.payload);
|
||||
const responseBuf = await queue.writeCommand(cmd.command_id, frame, /* timeoutMs */ 30_000);
|
||||
const parsed = parseCommandResponse(cmd.codec, responseBuf);
|
||||
await this.publishResponse({
|
||||
commandId: cmd.command_id,
|
||||
status: parsed.kind === 'ack' ? 'responded' : 'failed',
|
||||
response: parsed.text,
|
||||
failureReason: parsed.kind === 'nack' ? 'imei_mismatch' : undefined,
|
||||
});
|
||||
} catch (err) {
|
||||
await this.publishResponse({ commandId: cmd.command_id, status: 'failed', failureReason: errToReason(err) });
|
||||
} finally {
|
||||
await this.redis.xack(this.streamKey, 'ingest', streamId);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`encodeCommand` and `parseCommandResponse` come from tasks 2.5 (Codec 12) and 2.6 (Codec 14).
|
||||
|
||||
### `commands:responses` shape
|
||||
|
||||
```
|
||||
XADD commands:responses *
|
||||
command_id <uuid>
|
||||
status delivered | responded | failed
|
||||
response <ASCII response text, optional>
|
||||
failure_reason socket_closed | expired_before_delivery | imei_mismatch | timeout | write_queue_full | ...
|
||||
responded_at <ms>
|
||||
```
|
||||
|
||||
### Lifecycle hooks
|
||||
|
||||
In the Teltonika session:
|
||||
|
||||
- After successful handshake: `commandConsumer.attach(imei, writeQueue)`.
|
||||
- On socket close: `commandConsumer.detach(imei)`.
|
||||
- The consumer must reject any in-flight command for a detached IMEI with `socket_closed`.
|
||||
|
||||
### Concurrency
|
||||
|
||||
The consumer reads up to 16 messages per `XREADGROUP` call. Process them sequentially per call (`for await`). Multiple commands targeting different IMEIs can complete in parallel naturally because each goes to a different `SocketWriteQueue`. Within a single IMEI, the queue serializes them.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] Publishing a command via `XADD commands:outbound:{instance_id}` causes the consumer to call `writeCommand` on the right session.
|
||||
- [ ] If the IMEI is not held by this instance, the consumer publishes `failed` with `socket_closed` to `commands:responses` and ACKs the stream entry.
|
||||
- [ ] If `expires_at` has passed, the consumer publishes `failed` with `expired_before_delivery` and ACKs.
|
||||
- [ ] On `stop()`, the consumer drains the in-flight message and exits the read loop cleanly.
|
||||
- [ ] `XACK` happens only after the response is published (or terminal failure recorded), so a crash mid-handler causes the command to be redelivered.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- Crash mid-handler: the command was sent on the wire but we crashed before `XACK`. After restart, the consumer will redeliver; the new instance won't have the device, so it publishes `socket_closed`. The result: command was delivered to the device but Directus thinks it failed. Operator re-issues. Acceptable v1; flagged in [[phase-2-commands]] as a sweeper concern. Idempotent device commands mitigate.
|
||||
- Duplicate delivery via `XPENDING`: not handling Pending Entries List explicitly in v1. If a consumer crashes, its claims time out and another consumer in the group can claim — but we're using `instance_id` as the consumer name, so cross-instance claiming would deliver commands to the wrong device. **Decision:** each instance is the only consumer in its own consumer group (group name = `ingest`, consumer name = `instance_id`, but stream is per-instance so no cross-claiming risk). Verify this matches the Directus-side publishing logic.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,117 @@
|
||||
# Task 2.5 — Codec 12 encoder + handler
|
||||
|
||||
**Phase:** 2 — Outbound commands
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 2.3, 2.4
|
||||
**Wiki refs:** `docs/wiki/sources/teltonika-data-sending-protocols.md` § Codec 12, `docs/wiki/concepts/phase-2-commands.md`
|
||||
|
||||
## Goal
|
||||
|
||||
Encode Codec 12 (`0x0C`) command frames for outbound delivery; parse Codec 12 response frames coming back from devices.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/adapters/teltonika/codec/command/codec12.ts`:
|
||||
- `encodeCodec12Command(payload: string): Buffer` produces the on-the-wire byte sequence.
|
||||
- `parseCodec12Response(buf: Buffer): { kind: 'ack' | 'unexpected'; text: string }` parses an inbound response frame.
|
||||
- A `codec12CommandHandler: CodecDataHandler` that the **inbound** framing layer (task 1.4) registers for codec ID `0x0C`. This handler does not produce `Position` records; it routes the response payload to the per-socket write queue's `notifyResponse`.
|
||||
- Test file `test/codec12.test.ts` with at least:
|
||||
- The two canonical doc examples (`getinfo` request + response, `getio` request + response).
|
||||
- One synthetic command with non-ASCII bytes in the payload to verify HEX encoding.
|
||||
|
||||
## Specification
|
||||
|
||||
### Frame structure (server → device)
|
||||
|
||||
```
|
||||
[Preamble 4B = 0x00000000]
|
||||
[DataSize 4B BE] ← from CodecID through CmdQty2 inclusive
|
||||
[CodecID 1B = 0x0C]
|
||||
[CmdQty1 1B = 0x01]
|
||||
[Type 1B = 0x05] ← 0x05 = command from server
|
||||
[CmdSize 4B BE] ← length of command payload bytes
|
||||
[Command X B] ← ASCII command, encoded as raw bytes (NOT hex-encoded)
|
||||
[CmdQty2 1B = 0x01]
|
||||
[CRC 4B BE] ← CRC-16/IBM, lower 2 bytes; computed over CodecID..CmdQty2
|
||||
```
|
||||
|
||||
Encoder pseudocode:
|
||||
|
||||
```ts
|
||||
export function encodeCodec12Command(payload: string): Buffer {
|
||||
const cmd = Buffer.from(payload, 'ascii');
|
||||
const cmdSize = cmd.length;
|
||||
const dataSize = 1 + 1 + 1 + 4 + cmdSize + 1; // CodecID + CmdQty1 + Type + CmdSize + Command + CmdQty2
|
||||
const out = Buffer.alloc(4 + 4 + dataSize + 4); // Preamble + DataSize + body + CRC
|
||||
let off = 0;
|
||||
out.writeUInt32BE(0, off); off += 4;
|
||||
out.writeUInt32BE(dataSize, off); off += 4;
|
||||
out.writeUInt8(0x0C, off); off += 1;
|
||||
out.writeUInt8(0x01, off); off += 1;
|
||||
out.writeUInt8(0x05, off); off += 1;
|
||||
out.writeUInt32BE(cmdSize, off); off += 4;
|
||||
cmd.copy(out, off); off += cmdSize;
|
||||
out.writeUInt8(0x01, off); off += 1;
|
||||
const body = out.subarray(8, 8 + dataSize); // CodecID through CmdQty2
|
||||
const crc = crc16Ibm(body);
|
||||
out.writeUInt32BE(crc, off);
|
||||
return out;
|
||||
}
|
||||
```
|
||||
|
||||
Verify against the canonical doc's `getinfo` example: input `getinfo` → output `000000000000000F0C010500000007676574696E666F0100004312`.
|
||||
|
||||
### Response structure (device → server)
|
||||
|
||||
Identical frame shape, but `Type = 0x06`:
|
||||
|
||||
```
|
||||
[Preamble 4B][DataSize 4B][CodecID 0x0C][RspQty1 1B][Type 0x06][RspSize 4B][Response X B][RspQty2 1B][CRC 4B]
|
||||
```
|
||||
|
||||
The response field is ASCII text, e.g. `INI:2019/7/22 7:22 RTC:...`.
|
||||
|
||||
Parser:
|
||||
|
||||
```ts
|
||||
export function parseCodec12Response(body: Buffer): { kind: 'ack'; text: string } | { kind: 'unexpected'; reason: string } {
|
||||
// body is post-framing-layer: starts at CodecID
|
||||
const codecId = body[0];
|
||||
if (codecId !== 0x0C) return { kind: 'unexpected', reason: `wrong codec ${codecId.toString(16)}` };
|
||||
const rspQty1 = body[1];
|
||||
const type = body[2];
|
||||
if (type !== 0x06) return { kind: 'unexpected', reason: `expected response type 0x06, got 0x${type.toString(16)}` };
|
||||
const rspSize = body.readUInt32BE(3);
|
||||
const text = body.subarray(7, 7 + rspSize).toString('ascii');
|
||||
// body[7 + rspSize] is RspQty2; CRC was already validated upstream.
|
||||
return { kind: 'ack', text };
|
||||
}
|
||||
```
|
||||
|
||||
### Routing inbound responses to the right command
|
||||
|
||||
The inbound framing layer (task 1.4) sees a frame with codec `0x0C` and dispatches to `codec12CommandHandler`. That handler retrieves the session's `SocketWriteQueue` (from the session context) and calls `queue.notifyResponse(rawBody)`. The write queue's `awaitResponse` promise resolves with the body; the command consumer (task 2.4) then calls `parseCodec12Response` to extract the text.
|
||||
|
||||
This is the seam where Phase 2 plugs into Phase 1's framing layer. Phase 1 already supports it because:
|
||||
|
||||
1. The codec dispatch is a registry — Phase 2 just registers a new handler.
|
||||
2. Phase 1's handler interface returns `{ recordCount: number }` for ACK count. For Codec 12, **the device does not expect a record-count ACK** — responses are inherently their own ACK. The handler returns `{ recordCount: 0 }` and the framing layer's ACK send path skips the write when `recordCount` is 0. **Update task 1.4 to honor this** if not already.
|
||||
|
||||
> **Open question:** is `recordCount: 0` the right signal to skip ACK? Or should the handler interface return `{ ack: Buffer | null }` instead? The latter is cleaner. **Recommendation:** add an explicit `ack` return slot to `CodecDataHandler` in this task and update the data codec handlers to return `{ ack: makeRecordCountAck(n) }`. Phase 2's command handlers return `{ ack: null }`.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] `encodeCodec12Command('getinfo')` produces the canonical doc bytes exactly (compare hex strings).
|
||||
- [ ] `parseCodec12Response` correctly decodes the doc's `getinfo` response into the `INI:2019/7/22...` ASCII string.
|
||||
- [ ] An end-to-end test: simulate a device that responds to a Codec 12 command, verify the round-trip command_id → encoded frame → device response → parsed text → published to `commands:responses`.
|
||||
- [ ] CRC of every encoded frame validates against `crc16Ibm`.
|
||||
- [ ] An incoming Codec 12 frame with `Type != 0x06` is logged at warn (unexpected protocol direction) and not surfaced to the command consumer.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- The interface change (returning `{ ack }` instead of `{ recordCount }`) is a Phase 1 retrofit. Cost: minor — three Phase 1 codec handlers update their return shape. Benefit: cleaner Phase 2 plug-in.
|
||||
- The `getinfo` canonical CRC in the doc is `0x00004312`. Verify the encoder matches before declaring done.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,118 @@
|
||||
# Task 2.6 — Codec 14 encoder + ACK/nACK handler
|
||||
|
||||
**Phase:** 2 — Outbound commands
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 2.5 (shares utility code), 2.3, 2.4
|
||||
**Wiki refs:** `docs/wiki/sources/teltonika-data-sending-protocols.md` § Codec 14, `docs/wiki/concepts/phase-2-commands.md`
|
||||
|
||||
## Goal
|
||||
|
||||
Encode Codec 14 (`0x0E`) command frames with embedded IMEI; parse responses with both ACK (`0x06`) and **nACK (`0x11`)** types.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/adapters/teltonika/codec/command/codec14.ts`:
|
||||
- `encodeCodec14Command(imei: string, payload: string): Buffer`.
|
||||
- `parseCodec14Response(buf: Buffer): { kind: 'ack'; imei: string; text: string } | { kind: 'nack'; imei: string } | { kind: 'unexpected'; reason: string }`.
|
||||
- `codec14CommandHandler: CodecDataHandler` registered for codec ID `0x0E`.
|
||||
- Test file `test/codec14.test.ts` covering: doc canonical example (`getver` round trip with both ACK and nACK responses).
|
||||
|
||||
## Specification
|
||||
|
||||
### Frame structure (server → device)
|
||||
|
||||
```
|
||||
[Preamble 4B]
|
||||
[DataSize 4B] ← from CodecID through CmdQty2
|
||||
[CodecID 1B = 0x0E]
|
||||
[CmdQty1 1B = 0x01]
|
||||
[Type 1B = 0x05]
|
||||
[CmdSize 4B] ← command bytes + 8 (IMEI size)
|
||||
[IMEI 8B HEX] ← e.g. IMEI 123456789123456 → 0x0123456789123456
|
||||
[Command X B] ← ASCII command bytes
|
||||
[CmdQty2 1B = 0x01]
|
||||
[CRC 4B]
|
||||
```
|
||||
|
||||
**IMEI encoding rule:** the device IMEI is encoded as 8 bytes in HEX. For a 15-digit IMEI like `352093081452251`, prepend a leading zero (`0352093081452251`) and parse as a 16-hex-char value → 8 bytes: `0x03 52 09 30 81 45 22 51`. **Not** ASCII like the handshake.
|
||||
|
||||
```ts
|
||||
function imeiToHex(imei: string): Buffer {
|
||||
// 15 digits → prepend "0" → 16 hex chars → 8 bytes
|
||||
const padded = imei.padStart(16, '0');
|
||||
if (!/^\d{16}$/.test(padded)) throw new Error(`bad IMEI: ${imei}`);
|
||||
return Buffer.from(padded, 'hex');
|
||||
}
|
||||
```
|
||||
|
||||
### Response structure (device → server)
|
||||
|
||||
Two cases:
|
||||
|
||||
**ACK** (`Type = 0x06`): IMEI matched; command executed.
|
||||
```
|
||||
[Preamble][DataSize][CodecID 0x0E][RspQty1][Type 0x06][RspSize][IMEI 8B][Response X B][RspQty2][CRC]
|
||||
```
|
||||
|
||||
**nACK** (`Type = 0x11`): IMEI did not match; command not executed.
|
||||
```
|
||||
[Preamble][DataSize][CodecID 0x0E][RspQty1][Type 0x11][RspSize=0x08][IMEI 8B][RspQty2][CRC]
|
||||
```
|
||||
|
||||
Note: nACK has `RspSize = 8` (the IMEI itself counts), no Response bytes.
|
||||
|
||||
### Parser
|
||||
|
||||
```ts
|
||||
export function parseCodec14Response(body: Buffer):
|
||||
| { kind: 'ack'; imei: string; text: string }
|
||||
| { kind: 'nack'; imei: string }
|
||||
| { kind: 'unexpected'; reason: string }
|
||||
{
|
||||
const codecId = body[0];
|
||||
if (codecId !== 0x0E) return { kind: 'unexpected', reason: `wrong codec 0x${codecId.toString(16)}` };
|
||||
const type = body[2];
|
||||
const rspSize = body.readUInt32BE(3);
|
||||
const imeiHex = body.subarray(7, 15).toString('hex');
|
||||
const imei = imeiHex.replace(/^0+/, ''); // strip leading zero used for padding
|
||||
if (type === 0x06) {
|
||||
const text = body.subarray(15, 15 + rspSize - 8).toString('ascii');
|
||||
return { kind: 'ack', imei, text };
|
||||
}
|
||||
if (type === 0x11) {
|
||||
return { kind: 'nack', imei };
|
||||
}
|
||||
return { kind: 'unexpected', reason: `unknown response type 0x${type.toString(16)}` };
|
||||
}
|
||||
```
|
||||
|
||||
### Mapping to `commands:responses`
|
||||
|
||||
The command consumer (task 2.4) handles all three outcomes:
|
||||
|
||||
- `ack` → `status = 'responded'`, `response = text`.
|
||||
- `nack` → `status = 'failed'`, `failure_reason = 'imei_mismatch'`. The command was *delivered* but rejected — important nuance for operator dashboards.
|
||||
- `unexpected` → `status = 'failed'`, `failure_reason = 'protocol_error'`.
|
||||
|
||||
### Firmware version requirement
|
||||
|
||||
Codec 14 requires FMB.Ver.03.25.04.Rev.00 or newer. Older firmware will not understand the codec ID and may close the connection. The Phase 2 design relies on Directus knowing which devices support which codecs (potentially a `firmware_version` column on a `devices` collection). The Ingestion service does not enforce this; it just sends what it's told.
|
||||
|
||||
> **Open question:** should we expose a metric `teltonika_codec14_unexpected_total` to detect cases where Codec 14 was sent but the device closed the connection (suggesting outdated firmware)? Probably yes; add to task 2.6 deliverables and the metrics inventory.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] `encodeCodec14Command('352093081452251', 'getver')` produces the canonical doc bytes exactly: `00000000000000160E01050000000E0352093081452251676574766572010000D2C1`.
|
||||
- [ ] `parseCodec14Response` correctly decodes the doc's ACK response (IMEI + version string).
|
||||
- [ ] `parseCodec14Response` correctly decodes the doc's nACK response (IMEI mismatch case).
|
||||
- [ ] An end-to-end test simulates a device that ACKs Codec 14 and a device that nACKs; verify both terminal statuses land in `commands:responses` correctly.
|
||||
- [ ] IMEI HEX encoding round-trips through `imeiToHex` and the response parser.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- nACK with `RspSize` not equal to 8 is malformed but we should fail safe (treat as `unexpected`) rather than read past buffer bounds.
|
||||
- Should the Ingestion service also log the IMEI from the nACK response (which is the *server's claim*) and compare to the *actual* IMEI of the connection (from handshake)? If they differ, something seriously wrong is happening. **Yes — log at error if they differ.** Add to acceptance criteria.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
@@ -0,0 +1,65 @@
|
||||
# Phase 2 — Outbound commands
|
||||
|
||||
Add server-to-device command delivery using Teltonika codecs 12 (`0x0C`) and 14 (`0x0E`). Codec 13 is one-way device→server (not in scope for outbound); codec 15 is FMX6-only (out of scope entirely).
|
||||
|
||||
## Prerequisite
|
||||
|
||||
Phase 1 must be complete and stable in production. Phase 2 adds code *alongside* Phase 1, never in the inbound parsing path.
|
||||
|
||||
## Outcome statement
|
||||
|
||||
When Phase 2 is done:
|
||||
|
||||
- Each Ingestion instance maintains its IMEI→instance mapping in `connections:registry` (Redis hash) and a heartbeat key.
|
||||
- A Directus Flow on `commands` table inserts can publish a command to `commands:outbound:{instance_id}` after looking up the routing.
|
||||
- Each Ingestion instance runs a command consumer in parallel with the TCP listener; consumed commands are dispatched to the right per-socket write queue, encoded as Codec 12 or 14, and written to the device.
|
||||
- Device responses (Codec 12 Type `0x06` or Codec 14 Type `0x06`/`0x11`) are correlated to the in-flight command and published to `commands:responses` for Directus to update the row.
|
||||
- The TCP read path is never blocked by outbound work.
|
||||
- Phase 1 code is unchanged.
|
||||
|
||||
## Architectural anchors
|
||||
|
||||
`docs/wiki/concepts/phase-2-commands.md` is the design source of truth. Read it before starting any Phase 2 task.
|
||||
|
||||
Key invariants:
|
||||
|
||||
1. **Ingestion exposes no user-facing HTTP** — never. All command authorization happens in Directus.
|
||||
2. **Commands are data before transport.** Every command has a row in Directus's `commands` table before it ever reaches Redis.
|
||||
3. **One outstanding command per device socket.** Teltonika command codecs have no correlation ID; the protocol assumes serialization. Subsequent commands queue on the per-socket write queue.
|
||||
4. **Per-instance routing.** Only the Ingestion instance currently holding a device's socket can deliver commands to it. The connection registry exists so Directus knows which instance to publish to.
|
||||
|
||||
## Sequencing
|
||||
|
||||
```
|
||||
2.1 Connection registry & heartbeat ─┐
|
||||
2.2 Registry janitor ├─→ 2.4 Command consumer ─┐
|
||||
2.3 Per-socket write queue ──────────┘ ├─→ 2.5 Codec 12 handler
|
||||
└─→ 2.6 Codec 14 handler
|
||||
```
|
||||
|
||||
Tasks 2.1, 2.2, 2.3 can be done in parallel; they are independent infrastructure pieces. 2.5 and 2.6 can be parallelized once 2.4 lands.
|
||||
|
||||
## Files added
|
||||
|
||||
Phase 2 introduces these new files (no Phase 1 file is modified except `src/main.ts` to wire in the command consumer):
|
||||
|
||||
```
|
||||
src/
|
||||
├── adapters/teltonika/
|
||||
│ ├── codec/command/
|
||||
│ │ ├── codec12.ts ← NEW (encoder + response parser)
|
||||
│ │ └── codec14.ts ← NEW (encoder + ACK/nACK parser)
|
||||
│ └── command-consumer.ts ← NEW (stream reader, dispatch)
|
||||
├── core/
|
||||
│ ├── connection-registry.ts ← NEW
|
||||
│ ├── write-queue.ts ← NEW
|
||||
│ └── janitor.ts ← NEW (separate small process or in-process worker)
|
||||
└── main.ts ← updated to start consumer + registry
|
||||
```
|
||||
|
||||
`src/adapters/teltonika/codec/command/` already exists from Phase 1 (empty placeholder); Phase 2 fills it.
|
||||
|
||||
## Out of scope for this phase
|
||||
|
||||
- The Directus side of the system (`commands` table, Flows, sweeper) is owned by the Directus repo, not this one. Phase 2 in this repo only handles the Ingestion-side consumer and writer behavior.
|
||||
- The pending-command sweeper runs in Directus, not Ingestion. Ingestion publishes terminal status (`delivered`, `responded`, or `failed` reasons) and Directus updates the row.
|
||||
Reference in New Issue
Block a user