Files
docs/wiki/concepts/failure-domains.md
T
julian 22b1b069df Bootstrap LLM-maintained wiki with TRM architecture knowledge
Initialize CLAUDE.md schema, index, and log; ingest three architecture
sources (system overview, Teltonika ingestion design, official Teltonika
data-sending protocols) into 7 entity pages, 8 concept pages, and 3
source pages with wikilink cross-references.
2026-04-30 13:20:17 +02:00

1.9 KiB

title, type, created, updated, sources, tags
title type created updated sources tags
Failure Domains concept 2026-04-30 2026-04-30
gps-tracking-architecture
architecture
reliability

Failure Domains

Each component of the platform fails independently. The architecture deliberately concentrates operational risk in one place — the database — and keeps everything else restartable, replaceable, or naturally redundant.

Per-component failure behavior

Component Crash behavior Data loss
tcp-ingestion Devices reconnect; in-flight frames retransmitted by the device per protocol None beyond unacknowledged frames
redis-streams Streams are persisted; restart resumes from disk Recoverable from device retransmits + Processor checkpointing
processor Consumer-group offsets ensure next instance picks up; in-memory state rehydrated from DB None
directus Telemetry continues to flow into DB; admin UI/SPA unavailable None
postgres-timescaledb System stops accepting writes The single point of failure
react-spa UI unavailable N/A — no state owned

The discipline behind this

  • No component reaches across two plane boundaries — see plane-separation. A failure in one plane cannot cascade through another.
  • The TCP handler never blocks on downstream work. Slow Processor or DB pressure is absorbed by redis-streams, not by device sockets.
  • Per-device session state lives only on the open socket — Ingestion is trivially restartable.
  • The Processor's hot state can always be rehydrated from the DB.

Operational consequence

The database gets careful operational attention — replication, backups, point-in-time recovery via TimescaleDB. Everything else can be restarted, redeployed, or scaled without ceremony.

Canary metric

Redis Streams consumer lag. It reflects the health of the entire telemetry pipeline in one number.