docs/wiki/concepts/failure-domains.md

---
title: Failure Domains
type: concept
created: 2026-04-30
updated: 2026-04-30
sources: [gps-tracking-architecture]
tags: [architecture, reliability]
---

# Failure Domains

Each component of the platform fails independently. The architecture deliberately concentrates operational risk in one place — the database — and keeps everything else restartable, replaceable, or naturally redundant.

## Per-component failure behavior

| Component | Crash behavior | Data loss |
|-----------|---------------|-----------|
| [[tcp-ingestion]] | Devices reconnect; in-flight frames retransmitted by the device per protocol | None beyond unacknowledged frames |
| [[redis-streams]] | Streams are persisted; restart resumes from disk | Recoverable from device retransmits + Processor checkpointing |
| [[processor]] | Consumer-group offsets ensure next instance picks up; in-memory state rehydrated from DB | None |
| [[directus]] | Telemetry continues to flow into DB; admin UI/SPA unavailable | None |
| [[postgres-timescaledb]] | System stops accepting writes | The single point of failure |
| [[react-spa]] | UI unavailable | N/A — no state owned |

## The discipline behind this

- **No component reaches across two plane boundaries** — see [[plane-separation]]. A failure in one plane cannot cascade through another.
- **The TCP handler never blocks on downstream work.** Slow Processor or DB pressure is absorbed by [[redis-streams]], not by device sockets.
- **Per-device session state lives only on the open socket** — Ingestion is trivially restartable.
- **The Processor's hot state can always be rehydrated** from the DB.

## Operational consequence

The database gets careful operational attention — replication, backups, point-in-time recovery via TimescaleDB. Everything else can be restarted, redeployed, or scaled without ceremony.

## Canary metric

**Redis Streams consumer lag.** It reflects the health of the entire telemetry pipeline in one number.