--- title: Failure Domains type: concept created: 2026-04-30 updated: 2026-04-30 sources: [gps-tracking-architecture] tags: [architecture, reliability] --- # Failure Domains Each component of the platform fails independently. The architecture deliberately concentrates operational risk in one place — the database — and keeps everything else restartable, replaceable, or naturally redundant. ## Per-component failure behavior | Component | Crash behavior | Data loss | |-----------|---------------|-----------| | [[tcp-ingestion]] | Devices reconnect; in-flight frames retransmitted by the device per protocol | None beyond unacknowledged frames | | [[redis-streams]] | Streams are persisted; restart resumes from disk | Recoverable from device retransmits + Processor checkpointing | | [[processor]] | Consumer-group offsets ensure next instance picks up; in-memory state rehydrated from DB | None | | [[directus]] | Telemetry continues to flow into DB; admin UI/SPA unavailable | None | | [[postgres-timescaledb]] | System stops accepting writes | The single point of failure | | [[react-spa]] | UI unavailable | N/A — no state owned | ## The discipline behind this - **No component reaches across two plane boundaries** — see [[plane-separation]]. A failure in one plane cannot cascade through another. - **The TCP handler never blocks on downstream work.** Slow Processor or DB pressure is absorbed by [[redis-streams]], not by device sockets. - **Per-device session state lives only on the open socket** — Ingestion is trivially restartable. - **The Processor's hot state can always be rehydrated** from the DB. ## Operational consequence The database gets careful operational attention — replication, backups, point-in-time recovery via TimescaleDB. Everything else can be restarted, redeployed, or scaled without ceremony. ## Canary metric **Redis Streams consumer lag.** It reflects the health of the entire telemetry pipeline in one number.