22b1b069df
Initialize CLAUDE.md schema, index, and log; ingest three architecture sources (system overview, Teltonika ingestion design, official Teltonika data-sending protocols) into 7 entity pages, 8 concept pages, and 3 source pages with wikilink cross-references.
39 lines
1.9 KiB
Markdown
39 lines
1.9 KiB
Markdown
---
|
|
title: Failure Domains
|
|
type: concept
|
|
created: 2026-04-30
|
|
updated: 2026-04-30
|
|
sources: [gps-tracking-architecture]
|
|
tags: [architecture, reliability]
|
|
---
|
|
|
|
# Failure Domains
|
|
|
|
Each component of the platform fails independently. The architecture deliberately concentrates operational risk in one place — the database — and keeps everything else restartable, replaceable, or naturally redundant.
|
|
|
|
## Per-component failure behavior
|
|
|
|
| Component | Crash behavior | Data loss |
|
|
|-----------|---------------|-----------|
|
|
| [[tcp-ingestion]] | Devices reconnect; in-flight frames retransmitted by the device per protocol | None beyond unacknowledged frames |
|
|
| [[redis-streams]] | Streams are persisted; restart resumes from disk | Recoverable from device retransmits + Processor checkpointing |
|
|
| [[processor]] | Consumer-group offsets ensure next instance picks up; in-memory state rehydrated from DB | None |
|
|
| [[directus]] | Telemetry continues to flow into DB; admin UI/SPA unavailable | None |
|
|
| [[postgres-timescaledb]] | System stops accepting writes | The single point of failure |
|
|
| [[react-spa]] | UI unavailable | N/A — no state owned |
|
|
|
|
## The discipline behind this
|
|
|
|
- **No component reaches across two plane boundaries** — see [[plane-separation]]. A failure in one plane cannot cascade through another.
|
|
- **The TCP handler never blocks on downstream work.** Slow Processor or DB pressure is absorbed by [[redis-streams]], not by device sockets.
|
|
- **Per-device session state lives only on the open socket** — Ingestion is trivially restartable.
|
|
- **The Processor's hot state can always be rehydrated** from the DB.
|
|
|
|
## Operational consequence
|
|
|
|
The database gets careful operational attention — replication, backups, point-in-time recovery via TimescaleDB. Everything else can be restarted, redeployed, or scaled without ceremony.
|
|
|
|
## Canary metric
|
|
|
|
**Redis Streams consumer lag.** It reflects the health of the entire telemetry pipeline in one number.
|