Files
docs/wiki/concepts/failure-domains.md
julian 22b1b069df Bootstrap LLM-maintained wiki with TRM architecture knowledge
Initialize CLAUDE.md schema, index, and log; ingest three architecture
sources (system overview, Teltonika ingestion design, official Teltonika
data-sending protocols) into 7 entity pages, 8 concept pages, and 3
source pages with wikilink cross-references.
2026-04-30 13:20:17 +02:00

39 lines
1.9 KiB
Markdown

---
title: Failure Domains
type: concept
created: 2026-04-30
updated: 2026-04-30
sources: [gps-tracking-architecture]
tags: [architecture, reliability]
---
# Failure Domains
Each component of the platform fails independently. The architecture deliberately concentrates operational risk in one place — the database — and keeps everything else restartable, replaceable, or naturally redundant.
## Per-component failure behavior
| Component | Crash behavior | Data loss |
|-----------|---------------|-----------|
| [[tcp-ingestion]] | Devices reconnect; in-flight frames retransmitted by the device per protocol | None beyond unacknowledged frames |
| [[redis-streams]] | Streams are persisted; restart resumes from disk | Recoverable from device retransmits + Processor checkpointing |
| [[processor]] | Consumer-group offsets ensure next instance picks up; in-memory state rehydrated from DB | None |
| [[directus]] | Telemetry continues to flow into DB; admin UI/SPA unavailable | None |
| [[postgres-timescaledb]] | System stops accepting writes | The single point of failure |
| [[react-spa]] | UI unavailable | N/A — no state owned |
## The discipline behind this
- **No component reaches across two plane boundaries** — see [[plane-separation]]. A failure in one plane cannot cascade through another.
- **The TCP handler never blocks on downstream work.** Slow Processor or DB pressure is absorbed by [[redis-streams]], not by device sockets.
- **Per-device session state lives only on the open socket** — Ingestion is trivially restartable.
- **The Processor's hot state can always be rehydrated** from the DB.
## Operational consequence
The database gets careful operational attention — replication, backups, point-in-time recovery via TimescaleDB. Everything else can be restarted, redeployed, or scaled without ceremony.
## Canary metric
**Redis Streams consumer lag.** It reflects the health of the entire telemetry pipeline in one number.