trm/docs

Files

T

julian 22b1b069df Bootstrap LLM-maintained wiki with TRM architecture knowledge

Initialize CLAUDE.md schema, index, and log; ingest three architecture
sources (system overview, Teltonika ingestion design, official Teltonika
data-sending protocols) into 7 entity pages, 8 concept pages, and 3
source pages with wikilink cross-references.

2026-04-30 13:20:17 +02:00

1.9 KiB

Raw Blame History

title, type, created, updated, sources, tags

title

type

created

updated

sources

Failure Domains

Each component of the platform fails independently. The architecture deliberately concentrates operational risk in one place — the database — and keeps everything else restartable, replaceable, or naturally redundant.

Per-component failure behavior

Component	Crash behavior	Data loss
tcp-ingestion	Devices reconnect; in-flight frames retransmitted by the device per protocol	None beyond unacknowledged frames
redis-streams	Streams are persisted; restart resumes from disk	Recoverable from device retransmits + Processor checkpointing
processor	Consumer-group offsets ensure next instance picks up; in-memory state rehydrated from DB	None
directus	Telemetry continues to flow into DB; admin UI/SPA unavailable	None
postgres-timescaledb	System stops accepting writes	The single point of failure
react-spa	UI unavailable	N/A — no state owned

The discipline behind this

No component reaches across two plane boundaries — see plane-separation. A failure in one plane cannot cascade through another.
The TCP handler never blocks on downstream work. Slow Processor or DB pressure is absorbed by redis-streams, not by device sockets.
Per-device session state lives only on the open socket — Ingestion is trivially restartable.
The Processor's hot state can always be rehydrated from the DB.

Operational consequence

The database gets careful operational attention — replication, backups, point-in-time recovery via TimescaleDB. Everything else can be restarted, redeployed, or scaled without ceremony.

Canary metric

Redis Streams consumer lag. It reflects the health of the entire telemetry pipeline in one number.

1.9 KiB Raw Blame History