Bootstrap LLM-maintained wiki with TRM architecture knowledge
Initialize CLAUDE.md schema, index, and log; ingest three architecture sources (system overview, Teltonika ingestion design, official Teltonika data-sending protocols) into 7 entity pages, 8 concept pages, and 3 source pages with wikilink cross-references.
This commit is contained in:
@@ -0,0 +1,38 @@
|
||||
---
|
||||
title: Failure Domains
|
||||
type: concept
|
||||
created: 2026-04-30
|
||||
updated: 2026-04-30
|
||||
sources: [gps-tracking-architecture]
|
||||
tags: [architecture, reliability]
|
||||
---
|
||||
|
||||
# Failure Domains
|
||||
|
||||
Each component of the platform fails independently. The architecture deliberately concentrates operational risk in one place — the database — and keeps everything else restartable, replaceable, or naturally redundant.
|
||||
|
||||
## Per-component failure behavior
|
||||
|
||||
| Component | Crash behavior | Data loss |
|
||||
|-----------|---------------|-----------|
|
||||
| [[tcp-ingestion]] | Devices reconnect; in-flight frames retransmitted by the device per protocol | None beyond unacknowledged frames |
|
||||
| [[redis-streams]] | Streams are persisted; restart resumes from disk | Recoverable from device retransmits + Processor checkpointing |
|
||||
| [[processor]] | Consumer-group offsets ensure next instance picks up; in-memory state rehydrated from DB | None |
|
||||
| [[directus]] | Telemetry continues to flow into DB; admin UI/SPA unavailable | None |
|
||||
| [[postgres-timescaledb]] | System stops accepting writes | The single point of failure |
|
||||
| [[react-spa]] | UI unavailable | N/A — no state owned |
|
||||
|
||||
## The discipline behind this
|
||||
|
||||
- **No component reaches across two plane boundaries** — see [[plane-separation]]. A failure in one plane cannot cascade through another.
|
||||
- **The TCP handler never blocks on downstream work.** Slow Processor or DB pressure is absorbed by [[redis-streams]], not by device sockets.
|
||||
- **Per-device session state lives only on the open socket** — Ingestion is trivially restartable.
|
||||
- **The Processor's hot state can always be rehydrated** from the DB.
|
||||
|
||||
## Operational consequence
|
||||
|
||||
The database gets careful operational attention — replication, backups, point-in-time recovery via TimescaleDB. Everything else can be restarted, redeployed, or scaled without ceremony.
|
||||
|
||||
## Canary metric
|
||||
|
||||
**Redis Streams consumer lag.** It reflects the health of the entire telemetry pipeline in one number.
|
||||
Reference in New Issue
Block a user