90d6a73a60
Tasks 1.1-1.9 marked done with their landing commit SHAs. Tasks 1.10 (observability), 1.12 (production hardening), and 1.13 (device authority) marked paused with explicit resume triggers — pilot deployment on real Teltonika hardware takes priority. Task 1.11 remains as next, in slimmed form for the pilot (no /readyz healthcheck since the metrics endpoint is part of paused 1.10).
154 lines
8.6 KiB
Markdown
154 lines
8.6 KiB
Markdown
# Task 1.13 — Device authority (Redis allow-list refresher)
|
|
|
|
**Phase:** 1 — Inbound telemetry
|
|
**Status:** ⏸ Paused — deferred until after the real-device pilot test, AND until Directus has a `devices` collection publishing the allow-list to Redis. See ROADMAP.md "Deferred" section. The `DeviceAuthority` seam exists with `AllowAllAuthority` (default, in `src/adapters/teltonika/device-authority.ts`); this task adds `RedisAllowListAuthority`.
|
|
**Depends on:** 1.4 (DeviceAuthority seam), 1.10 (metrics)
|
|
**Wiki refs:** `docs/wiki/concepts/plane-separation.md`, `docs/wiki/entities/directus.md`, `docs/wiki/entities/redis-streams.md`
|
|
|
|
## Goal
|
|
|
|
Provide a real `DeviceAuthority` implementation that classifies an IMEI as `known` or `unknown` by consulting an allow-list **published from Directus into Redis** and cached in-memory in each Ingestion instance. This is the operational link between the business plane (where the source-of-truth `devices` collection lives) and the telemetry plane (where Ingestion makes its handshake decisions).
|
|
|
|
## Non-goals
|
|
|
|
- Not a security boundary. Real device security is network-level + downstream filtering. This list is a **soft signal** for observability and (optionally) a hard reject under `STRICT_DEVICE_AUTH`.
|
|
- Not a real-time check. The list is cached locally with periodic refresh; new device provisioning takes effect within the refresh interval.
|
|
|
|
## Deliverables
|
|
|
|
- `src/adapters/teltonika/redis-allow-list-authority.ts`:
|
|
- `RedisAllowListAuthority` implementing `DeviceAuthority`.
|
|
- In-memory `Set<string>` of allowed IMEIs.
|
|
- Refresh worker that pulls from Redis on a configurable cadence.
|
|
- `start()` runs an initial fetch synchronously (so the cache is warm before the TCP listener accepts) and then starts the periodic refresh.
|
|
- `stop()` halts the refresh ticker.
|
|
- `src/main.ts` updated:
|
|
- Read `DEVICE_AUTHORITY_MODE` env var (`allow_all` | `redis_allow_list`, default `allow_all`).
|
|
- Construct the appropriate authority and pass it into the adapter context.
|
|
- Documentation in `OPERATIONS.md` (task 1.12) — section "Device authority" describing the env vars, refresh cadence, and Directus contract.
|
|
|
|
## Specification
|
|
|
|
### Redis contract
|
|
|
|
The Ingestion side reads from a single Redis key. Two viable shapes; pick one and stick with it.
|
|
|
|
**Option 1: Redis Set.** Simple, idiomatic for membership checks.
|
|
|
|
```
|
|
SADD devices:allowed <imei1> <imei2> ...
|
|
SMEMBERS devices:allowed # what the refresher reads
|
|
SISMEMBER devices:allowed <imei> # what an on-demand check would do (we do not use this; we cache)
|
|
```
|
|
|
|
**Option 2: Redis Hash with metadata per device.** Useful if downstream wants more than membership (e.g. device model, firmware version, owner).
|
|
|
|
```
|
|
HSET devices:allowed <imei> '{"model":"FMB920","fw":"03.27"}'
|
|
HGETALL devices:allowed
|
|
```
|
|
|
|
**Recommendation: Option 1 (Set).** Membership is the only signal Ingestion uses; metadata belongs in Directus where it's queryable. If a future task needs metadata in Ingestion, switch to Option 2.
|
|
|
|
### Directus → Redis sync (out of scope for this task)
|
|
|
|
This task implements the **Ingestion-side reader**. The Directus-side publisher is a separate piece of work in the Directus repo:
|
|
|
|
- A `devices` collection in Directus with at least `imei`, `active` fields.
|
|
- A Directus Flow or hook that, on `items.create | items.update | items.delete` of `devices`, updates the Redis Set:
|
|
- Active inserted/updated → `SADD devices:allowed <imei>`.
|
|
- Deleted or `active=false` → `SREM devices:allowed <imei>`.
|
|
- A periodic full-resync (e.g. nightly cron) that snapshots the collection into Redis to recover from any drift: `DEL devices:allowed && SADD devices:allowed <imei1> ... <imeiN>`.
|
|
|
|
Document this contract in the Ingestion repo's `OPERATIONS.md` so on-call understands the dependency, but the implementation lives in Directus.
|
|
|
|
### Refresh strategy
|
|
|
|
```ts
|
|
class RedisAllowListAuthority implements DeviceAuthority {
|
|
private cache = new Set<string>();
|
|
private timer?: NodeJS.Timeout;
|
|
|
|
constructor(
|
|
private redis: Redis,
|
|
private key: string = 'devices:allowed',
|
|
private intervalMs: number = 30_000,
|
|
private logger: Logger,
|
|
private metrics: Metrics,
|
|
) {}
|
|
|
|
async start(): Promise<void> {
|
|
await this.refresh(); // synchronous initial load before TCP listener is up
|
|
this.timer = setInterval(() => {
|
|
this.refresh().catch((err) => this.logger.warn({ err }, 'allow-list refresh failed'));
|
|
}, this.intervalMs);
|
|
}
|
|
|
|
stop(): void { if (this.timer) clearInterval(this.timer); }
|
|
|
|
async check(imei: string): Promise<'known' | 'unknown'> {
|
|
return this.cache.has(imei) ? 'known' : 'unknown';
|
|
}
|
|
|
|
private async refresh(): Promise<void> {
|
|
const start = process.hrtime.bigint();
|
|
const members = await this.redis.smembers(this.key);
|
|
this.cache = new Set(members);
|
|
const ms = Number(process.hrtime.bigint() - start) / 1e6;
|
|
this.metrics.allowListRefresh.observe(ms / 1000);
|
|
this.metrics.allowListSize.set(this.cache.size);
|
|
this.logger.debug({ size: this.cache.size, took_ms: ms }, 'allow-list refreshed');
|
|
}
|
|
}
|
|
```
|
|
|
|
### Failure modes
|
|
|
|
- **Redis unavailable at startup.** `start()` throws → process exits non-zero → orchestrator restarts. Loud failure, easy to alert. Operators may opt to fall back to `allow_all` via env var change.
|
|
- **Redis unavailable mid-flight.** `refresh` fails; the cache stays at last-known-good. `check` keeps working off the stale cache. Log warn; metric for refresh failures. Eventually the cache is "stale forever" if Redis never recovers — that's fine because telemetry is still flowing.
|
|
- **Empty allow-list.** A bug or misconfiguration in Directus could publish an empty Set. The Ingestion side will then mark every device as `unknown`. With `STRICT_DEVICE_AUTH=false` (default), this is a visibility problem (alert-worthy) but not a service outage. With `STRICT_DEVICE_AUTH=true`, the entire fleet would be rejected — bad. Add a safety: refuse to apply a refresh result of size 0 unless `ALLOW_EMPTY_ALLOW_LIST=true` is set explicitly. Log error; keep the previous cache.
|
|
|
|
### Configuration
|
|
|
|
Add to the env schema (task 1.3):
|
|
|
|
```ts
|
|
DEVICE_AUTHORITY_MODE: z.enum(['allow_all', 'redis_allow_list']).default('allow_all'),
|
|
DEVICE_ALLOW_LIST_KEY: z.string().default('devices:allowed'),
|
|
DEVICE_ALLOW_LIST_REFRESH_MS: z.coerce.number().int().min(1000).default(30_000),
|
|
STRICT_DEVICE_AUTH: z.coerce.boolean().default(false),
|
|
ALLOW_EMPTY_ALLOW_LIST: z.coerce.boolean().default(false),
|
|
```
|
|
|
|
### Metrics
|
|
|
|
Add to task 1.10's inventory:
|
|
|
|
| Metric | Type | Labels | Description |
|
|
|--------|------|--------|-------------|
|
|
| `teltonika_allow_list_size` | gauge | — | Number of IMEIs in the local cache. Sudden drops are alert-worthy. |
|
|
| `teltonika_allow_list_refresh_duration_seconds` | histogram | — | Time to refresh from Redis. |
|
|
| `teltonika_allow_list_refresh_failures_total` | counter | `reason` | Refresh attempts that failed (network, empty-rejected, etc.). |
|
|
|
|
## Acceptance criteria
|
|
|
|
- [ ] With `DEVICE_AUTHORITY_MODE=allow_all`, behavior is identical to Phase 1 default — every IMEI is `known`.
|
|
- [ ] With `DEVICE_AUTHORITY_MODE=redis_allow_list` and a populated Redis Set, `check(imei)` returns `'known'` for members and `'unknown'` for non-members.
|
|
- [ ] Initial load happens before the TCP listener accepts connections.
|
|
- [ ] Refresh runs every `DEVICE_ALLOW_LIST_REFRESH_MS` and updates the cache.
|
|
- [ ] Empty allow-list refresh is rejected (cache preserved) unless `ALLOW_EMPTY_ALLOW_LIST=true`; metric increments with `reason=empty_rejected`.
|
|
- [ ] Mid-flight Redis outage does not crash the service; subsequent successful refresh restores the cache.
|
|
- [ ] `teltonika_allow_list_size` and `teltonika_allow_list_refresh_duration_seconds` appear in `/metrics`.
|
|
- [ ] `STRICT_DEVICE_AUTH=true` combined with `redis_allow_list` causes `0x00` rejection of unknown IMEIs (verified by integration test).
|
|
|
|
## Risks / open questions
|
|
|
|
- **Provisioning lag.** A newly added device waits up to `DEVICE_ALLOW_LIST_REFRESH_MS` before being recognized. Default 30s is fine for most ops; tune down to 5s if the team has a workflow where they provision and immediately expect the device to be `known`.
|
|
- **Cache size.** A Set of 100k IMEIs is ~6MB in memory — fine. At 1M+ devices, consider a Bloom filter + Redis fallback for misses, or split into shards. Not a near-term concern.
|
|
- **Drift between Directus and Redis.** Hooks-based sync can miss updates if Directus has an issue mid-write. The nightly full-resync cron mitigates. Discussed in the Directus-side task (out of repo scope here).
|
|
- **Should `STRICT_DEVICE_AUTH` be observable?** Yes — log at info on startup which mode the authority is in, so operators can verify config without reading env vars.
|
|
|
|
## Done
|
|
|
|
(Fill in once complete.)
|