Add Phase 1 and Phase 2 planning documents
ROADMAP plus granular task files per phase. Phase 1 (12 tasks + 1.13 device authority) covers Codec 8/8E/16 telemetry ingestion; Phase 2 (6 tasks) covers Codec 12/14 outbound commands; Phase 3 enumerates deferred items.
This commit is contained in:
@@ -0,0 +1,83 @@
|
||||
# Task 2.2 — Registry janitor
|
||||
|
||||
**Phase:** 2 — Outbound commands
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 2.1
|
||||
**Wiki refs:** `docs/wiki/concepts/phase-2-commands.md` § 9.3
|
||||
|
||||
## Goal
|
||||
|
||||
Periodically clear stale entries from `connections:registry` whose owning instance has died (heartbeat expired) without graceful cleanup.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/core/janitor.ts` — `Janitor` class with a `run()` method that performs one cleanup pass.
|
||||
- A choice: run the janitor in-process (every Ingestion instance runs it, with leader election or with idempotent cleanup) or as a separate small process. **Recommendation: in-process, idempotent.** Simpler ops, no leader election; the cost is N instances each doing the work, but a registry pass is O(N_devices) and fast.
|
||||
- Wired into `src/main.ts` as a 60-second ticker.
|
||||
- Metric: `teltonika_registry_janitor_evicted_total{instance_id=...}` counter.
|
||||
|
||||
## Specification
|
||||
|
||||
### Algorithm (per pass)
|
||||
|
||||
```
|
||||
1. entries = HGETALL connections:registry
|
||||
2. unique_instance_ids = unique values from entries
|
||||
3. For each instance_id in unique_instance_ids:
|
||||
alive = EXISTS instance:heartbeat:{instance_id}
|
||||
If !alive:
|
||||
For each (imei, owner) in entries where owner == instance_id:
|
||||
HDEL connections:registry imei
|
||||
metrics.evicted.inc({ instance_id })
|
||||
```
|
||||
|
||||
Use `HSCAN` instead of `HGETALL` if the registry is large (>10k entries) to avoid blocking Redis. For Phase 2's expected scale, `HGETALL` is fine.
|
||||
|
||||
### Idempotence
|
||||
|
||||
Multiple instances running the janitor in parallel may both attempt to delete the same stale entry. `HDEL` is idempotent — the second call returns 0 and is harmless. Just ensure logging doesn't double-count: only log on actual deletes (HDEL > 0 result).
|
||||
|
||||
### Race with re-registration
|
||||
|
||||
Sequence to consider:
|
||||
1. Instance A dies; heartbeat expires.
|
||||
2. Janitor on Instance B starts a pass. Sees A's entries, A's heartbeat is gone.
|
||||
3. Device that was on A reconnects to Instance C.
|
||||
4. Instance C calls `HSET connections:registry <imei> C`.
|
||||
5. Janitor on B is mid-pass and calls `HDEL connections:registry <imei>`.
|
||||
|
||||
Result: device entry deleted moments after C registered it. Device routing is broken until the next reconnect or registration.
|
||||
|
||||
**Mitigation:** the janitor must check the entry value at delete time, not just at scan time:
|
||||
|
||||
```ts
|
||||
for (const imei of evictTargets) {
|
||||
// Re-read the value; only delete if still pointing at the dead instance.
|
||||
const current = await redis.hget('connections:registry', imei);
|
||||
if (current === deadInstanceId) {
|
||||
await redis.hdel('connections:registry', imei);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This is "check-and-delete" — not atomic but the window is small. For full atomicity, use a Lua script. **Recommendation: ship the non-atomic version first; upgrade to Lua if the race causes operational issues.**
|
||||
|
||||
### Pace
|
||||
|
||||
Run every 60 seconds (configurable via `JANITOR_INTERVAL_MS`). One pass costs at most one `HGETALL` + N `EXISTS` + (rare) M `HDEL`. Negligible Redis load.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] Killing an Ingestion instance without graceful shutdown: within ~2 minutes (heartbeat TTL of 90s + one janitor pass), all of that instance's registry entries are gone.
|
||||
- [ ] If the dying instance restarts and re-registers a device before the janitor evicts it, the new (live) entry is preserved (verified by the check-and-delete logic).
|
||||
- [ ] Two janitors running in parallel: total deletes are correct, no double-counting in metrics.
|
||||
- [ ] `teltonika_registry_janitor_evicted_total` increments by the right amount per pass.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- The check-and-delete race window: small but real. If operationally observed, upgrade to Lua. Document the trade-off in `OPERATIONS.md`.
|
||||
- Should the janitor be a separate process? Pros: cleaner separation; can be sized differently. Cons: another deployable, another monitoring target. **Defer to operational feedback.**
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
Reference in New Issue
Block a user