Files
julian c8a5f4cd68 Add Phase 1 and Phase 2 planning documents
ROADMAP plus granular task files per phase. Phase 1 (12 tasks + 1.13
device authority) covers Codec 8/8E/16 telemetry ingestion; Phase 2
(6 tasks) covers Codec 12/14 outbound commands; Phase 3 enumerates
deferred items.
2026-04-30 15:50:49 +02:00

3.8 KiB

Task 2.2 — Registry janitor

Phase: 2 — Outbound commands Status: Not started Depends on: 2.1 Wiki refs: docs/wiki/concepts/phase-2-commands.md § 9.3

Goal

Periodically clear stale entries from connections:registry whose owning instance has died (heartbeat expired) without graceful cleanup.

Deliverables

  • src/core/janitor.tsJanitor class with a run() method that performs one cleanup pass.
  • A choice: run the janitor in-process (every Ingestion instance runs it, with leader election or with idempotent cleanup) or as a separate small process. Recommendation: in-process, idempotent. Simpler ops, no leader election; the cost is N instances each doing the work, but a registry pass is O(N_devices) and fast.
  • Wired into src/main.ts as a 60-second ticker.
  • Metric: teltonika_registry_janitor_evicted_total{instance_id=...} counter.

Specification

Algorithm (per pass)

1. entries = HGETALL connections:registry
2. unique_instance_ids = unique values from entries
3. For each instance_id in unique_instance_ids:
     alive = EXISTS instance:heartbeat:{instance_id}
     If !alive:
       For each (imei, owner) in entries where owner == instance_id:
         HDEL connections:registry imei
         metrics.evicted.inc({ instance_id })

Use HSCAN instead of HGETALL if the registry is large (>10k entries) to avoid blocking Redis. For Phase 2's expected scale, HGETALL is fine.

Idempotence

Multiple instances running the janitor in parallel may both attempt to delete the same stale entry. HDEL is idempotent — the second call returns 0 and is harmless. Just ensure logging doesn't double-count: only log on actual deletes (HDEL > 0 result).

Race with re-registration

Sequence to consider:

  1. Instance A dies; heartbeat expires.
  2. Janitor on Instance B starts a pass. Sees A's entries, A's heartbeat is gone.
  3. Device that was on A reconnects to Instance C.
  4. Instance C calls HSET connections:registry <imei> C.
  5. Janitor on B is mid-pass and calls HDEL connections:registry <imei>.

Result: device entry deleted moments after C registered it. Device routing is broken until the next reconnect or registration.

Mitigation: the janitor must check the entry value at delete time, not just at scan time:

for (const imei of evictTargets) {
  // Re-read the value; only delete if still pointing at the dead instance.
  const current = await redis.hget('connections:registry', imei);
  if (current === deadInstanceId) {
    await redis.hdel('connections:registry', imei);
  }
}

This is "check-and-delete" — not atomic but the window is small. For full atomicity, use a Lua script. Recommendation: ship the non-atomic version first; upgrade to Lua if the race causes operational issues.

Pace

Run every 60 seconds (configurable via JANITOR_INTERVAL_MS). One pass costs at most one HGETALL + N EXISTS + (rare) M HDEL. Negligible Redis load.

Acceptance criteria

  • Killing an Ingestion instance without graceful shutdown: within ~2 minutes (heartbeat TTL of 90s + one janitor pass), all of that instance's registry entries are gone.
  • If the dying instance restarts and re-registers a device before the janitor evicts it, the new (live) entry is preserved (verified by the check-and-delete logic).
  • Two janitors running in parallel: total deletes are correct, no double-counting in metrics.
  • teltonika_registry_janitor_evicted_total increments by the right amount per pass.

Risks / open questions

  • The check-and-delete race window: small but real. If operationally observed, upgrade to Lua. Document the trade-off in OPERATIONS.md.
  • Should the janitor be a separate process? Pros: cleaner separation; can be sized differently. Cons: another deployable, another monitoring target. Defer to operational feedback.

Done

(Fill in once complete.)