Files

T

julian 90d6a73a60 Sync ROADMAP statuses with landed work; mark 1.10/1.12/1.13 as paused

Tasks 1.1-1.9 marked done with their landing commit SHAs. Tasks 1.10
(observability), 1.12 (production hardening), and 1.13 (device
authority) marked paused with explicit resume triggers — pilot
deployment on real Teltonika hardware takes priority. Task 1.11
remains as next, in slimmed form for the pilot (no /readyz healthcheck
since the metrics endpoint is part of paused 1.10).

2026-04-30 16:49:07 +02:00

8.6 KiB

Raw Blame History

Task 1.13 — Device authority (Redis allow-list refresher)

Phase: 1 — Inbound telemetry Status: ⏸ Paused — deferred until after the real-device pilot test, AND until Directus has a devices collection publishing the allow-list to Redis. See ROADMAP.md "Deferred" section. The DeviceAuthority seam exists with AllowAllAuthority (default, in src/adapters/teltonika/device-authority.ts); this task adds RedisAllowListAuthority. Depends on: 1.4 (DeviceAuthority seam), 1.10 (metrics) Wiki refs: docs/wiki/concepts/plane-separation.md, docs/wiki/entities/directus.md, docs/wiki/entities/redis-streams.md

Goal

Provide a real DeviceAuthority implementation that classifies an IMEI as known or unknown by consulting an allow-list published from Directus into Redis and cached in-memory in each Ingestion instance. This is the operational link between the business plane (where the source-of-truth devices collection lives) and the telemetry plane (where Ingestion makes its handshake decisions).

Non-goals

Not a security boundary. Real device security is network-level + downstream filtering. This list is a soft signal for observability and (optionally) a hard reject under STRICT_DEVICE_AUTH.
Not a real-time check. The list is cached locally with periodic refresh; new device provisioning takes effect within the refresh interval.

Deliverables

src/adapters/teltonika/redis-allow-list-authority.ts:
- RedisAllowListAuthority implementing DeviceAuthority.
- In-memory Set<string> of allowed IMEIs.
- Refresh worker that pulls from Redis on a configurable cadence.
- start() runs an initial fetch synchronously (so the cache is warm before the TCP listener accepts) and then starts the periodic refresh.
- stop() halts the refresh ticker.
src/main.ts updated:
- Read DEVICE_AUTHORITY_MODE env var (allow_all | redis_allow_list, default allow_all).
- Construct the appropriate authority and pass it into the adapter context.
Documentation in OPERATIONS.md (task 1.12) — section "Device authority" describing the env vars, refresh cadence, and Directus contract.

Specification

Redis contract

The Ingestion side reads from a single Redis key. Two viable shapes; pick one and stick with it.

Option 1: Redis Set. Simple, idiomatic for membership checks.

SADD devices:allowed <imei1> <imei2> ...
SMEMBERS devices:allowed       # what the refresher reads
SISMEMBER devices:allowed <imei>   # what an on-demand check would do (we do not use this; we cache)

Option 2: Redis Hash with metadata per device. Useful if downstream wants more than membership (e.g. device model, firmware version, owner).

HSET devices:allowed <imei> '{"model":"FMB920","fw":"03.27"}'
HGETALL devices:allowed

Recommendation: Option 1 (Set). Membership is the only signal Ingestion uses; metadata belongs in Directus where it's queryable. If a future task needs metadata in Ingestion, switch to Option 2.

Directus → Redis sync (out of scope for this task)

This task implements the Ingestion-side reader. The Directus-side publisher is a separate piece of work in the Directus repo:

A devices collection in Directus with at least imei, active fields.
A Directus Flow or hook that, on items.create | items.update | items.delete of devices, updates the Redis Set:
- Active inserted/updated → SADD devices:allowed <imei>.
- Deleted or active=false → SREM devices:allowed <imei>.
A periodic full-resync (e.g. nightly cron) that snapshots the collection into Redis to recover from any drift: DEL devices:allowed && SADD devices:allowed <imei1> ... <imeiN>.

Document this contract in the Ingestion repo's OPERATIONS.md so on-call understands the dependency, but the implementation lives in Directus.

Refresh strategy

class RedisAllowListAuthority implements DeviceAuthority {
  private cache = new Set<string>();
  private timer?: NodeJS.Timeout;

  constructor(
    private redis: Redis,
    private key: string = 'devices:allowed',
    private intervalMs: number = 30_000,
    private logger: Logger,
    private metrics: Metrics,
  ) {}

  async start(): Promise<void> {
    await this.refresh(); // synchronous initial load before TCP listener is up
    this.timer = setInterval(() => {
      this.refresh().catch((err) => this.logger.warn({ err }, 'allow-list refresh failed'));
    }, this.intervalMs);
  }

  stop(): void { if (this.timer) clearInterval(this.timer); }

  async check(imei: string): Promise<'known' | 'unknown'> {
    return this.cache.has(imei) ? 'known' : 'unknown';
  }

  private async refresh(): Promise<void> {
    const start = process.hrtime.bigint();
    const members = await this.redis.smembers(this.key);
    this.cache = new Set(members);
    const ms = Number(process.hrtime.bigint() - start) / 1e6;
    this.metrics.allowListRefresh.observe(ms / 1000);
    this.metrics.allowListSize.set(this.cache.size);
    this.logger.debug({ size: this.cache.size, took_ms: ms }, 'allow-list refreshed');
  }
}

Failure modes

Redis unavailable at startup. start() throws → process exits non-zero → orchestrator restarts. Loud failure, easy to alert. Operators may opt to fall back to allow_all via env var change.
Redis unavailable mid-flight. refresh fails; the cache stays at last-known-good. check keeps working off the stale cache. Log warn; metric for refresh failures. Eventually the cache is "stale forever" if Redis never recovers — that's fine because telemetry is still flowing.
Empty allow-list. A bug or misconfiguration in Directus could publish an empty Set. The Ingestion side will then mark every device as unknown. With STRICT_DEVICE_AUTH=false (default), this is a visibility problem (alert-worthy) but not a service outage. With STRICT_DEVICE_AUTH=true, the entire fleet would be rejected — bad. Add a safety: refuse to apply a refresh result of size 0 unless ALLOW_EMPTY_ALLOW_LIST=true is set explicitly. Log error; keep the previous cache.

Configuration

Add to the env schema (task 1.3):

DEVICE_AUTHORITY_MODE: z.enum(['allow_all', 'redis_allow_list']).default('allow_all'),
DEVICE_ALLOW_LIST_KEY: z.string().default('devices:allowed'),
DEVICE_ALLOW_LIST_REFRESH_MS: z.coerce.number().int().min(1000).default(30_000),
STRICT_DEVICE_AUTH: z.coerce.boolean().default(false),
ALLOW_EMPTY_ALLOW_LIST: z.coerce.boolean().default(false),

Metrics

Add to task 1.10's inventory:

Metric	Type	Labels	Description
`teltonika_allow_list_size`	gauge	—	Number of IMEIs in the local cache. Sudden drops are alert-worthy.
`teltonika_allow_list_refresh_duration_seconds`	histogram	—	Time to refresh from Redis.
`teltonika_allow_list_refresh_failures_total`	counter	`reason`	Refresh attempts that failed (network, empty-rejected, etc.).

Acceptance criteria

With DEVICE_AUTHORITY_MODE=allow_all, behavior is identical to Phase 1 default — every IMEI is known.
With DEVICE_AUTHORITY_MODE=redis_allow_list and a populated Redis Set, check(imei) returns 'known' for members and 'unknown' for non-members.
Initial load happens before the TCP listener accepts connections.
Refresh runs every DEVICE_ALLOW_LIST_REFRESH_MS and updates the cache.
Empty allow-list refresh is rejected (cache preserved) unless ALLOW_EMPTY_ALLOW_LIST=true; metric increments with reason=empty_rejected.
Mid-flight Redis outage does not crash the service; subsequent successful refresh restores the cache.
teltonika_allow_list_size and teltonika_allow_list_refresh_duration_seconds appear in /metrics.
STRICT_DEVICE_AUTH=true combined with redis_allow_list causes 0x00 rejection of unknown IMEIs (verified by integration test).

Risks / open questions

Provisioning lag. A newly added device waits up to DEVICE_ALLOW_LIST_REFRESH_MS before being recognized. Default 30s is fine for most ops; tune down to 5s if the team has a workflow where they provision and immediately expect the device to be known.
Cache size. A Set of 100k IMEIs is ~6MB in memory — fine. At 1M+ devices, consider a Bloom filter + Redis fallback for misses, or split into shards. Not a near-term concern.
Drift between Directus and Redis. Hooks-based sync can miss updates if Directus has an issue mid-write. The nightly full-resync cron mitigates. Discussed in the Directus-side task (out of repo scope here).
Should STRICT_DEVICE_AUTH be observable? Yes — log at info on startup which mode the authority is in, so operators can verify config without reading env vars.

Done

(Fill in once complete.)

8.6 KiB Raw Blame History