Add Phase 1 and Phase 2 planning documents
ROADMAP plus granular task files per phase. Phase 1 (12 tasks + 1.13 device authority) covers Codec 8/8E/16 telemetry ingestion; Phase 2 (6 tasks) covers Codec 12/14 outbound commands; Phase 3 enumerates deferred items.
This commit is contained in:
@@ -0,0 +1,150 @@
|
||||
# Task 1.12 — Production hardening
|
||||
|
||||
**Phase:** 1 — Inbound telemetry
|
||||
**Status:** ⬜ Not started
|
||||
**Depends on:** 1.8, 1.10, 1.11
|
||||
**Wiki refs:** `docs/wiki/concepts/failure-domains.md`
|
||||
|
||||
## Goal
|
||||
|
||||
Make the service safe for unattended production operation: graceful shutdown, robust error handling, structured logging discipline, sane defaults for resource limits, and operational documentation.
|
||||
|
||||
## Deliverables
|
||||
|
||||
- `src/core/lifecycle.ts` — `installGracefulShutdown({ ... })` that wires SIGTERM/SIGINT/SIGHUP to a coordinated shutdown.
|
||||
- `src/core/errors.ts` — typed error classes (`HandshakeError`, `FrameError`, `PublishOverflowError`, `RedisUnavailableError`).
|
||||
- Updates to `src/main.ts` to install error handlers and shutdown.
|
||||
- `OPERATIONS.md` (or section in `README.md`) covering: env var reference, signals, log fields, metric meanings, common alert rules, troubleshooting.
|
||||
- (Optional) `docs/runbook.md` for on-call: "what to do when X alert fires."
|
||||
|
||||
## Specification
|
||||
|
||||
### Graceful shutdown
|
||||
|
||||
On SIGTERM (deployment rolling update) or SIGINT (Ctrl-C):
|
||||
|
||||
1. **Stop accepting new connections.** `server.close()` — existing sockets continue.
|
||||
2. **Drain the publish queue.** Stop accepting new `publish()` calls; wait for the worker to flush queued records to Redis (with a timeout, e.g. 10s).
|
||||
3. **Send a final goodbye on each open socket.** Optional: just let TCP FIN naturally; devices will reconnect to a new instance.
|
||||
4. **Close Redis connection.**
|
||||
5. **Exit cleanly with code 0.**
|
||||
|
||||
If shutdown takes longer than `SHUTDOWN_TIMEOUT_MS` (default 30s), log and exit with code 1 — the orchestrator will SIGKILL anyway, but exiting deliberately gives a cleaner signal.
|
||||
|
||||
```ts
|
||||
export function installGracefulShutdown(handles: ShutdownHandles) {
|
||||
let shuttingDown = false;
|
||||
const shutdown = async (signal: string) => {
|
||||
if (shuttingDown) return;
|
||||
shuttingDown = true;
|
||||
handles.logger.info({ signal }, 'shutdown: starting');
|
||||
const deadline = setTimeout(() => {
|
||||
handles.logger.error({}, 'shutdown: timed out, forcing exit');
|
||||
process.exit(1);
|
||||
}, handles.timeoutMs ?? 30_000);
|
||||
try {
|
||||
await new Promise<void>((res) => handles.server.close(() => res()));
|
||||
await handles.publisher.drain(10_000);
|
||||
await handles.redis.quit();
|
||||
handles.metricsServer.close();
|
||||
clearTimeout(deadline);
|
||||
handles.logger.info({}, 'shutdown: clean exit');
|
||||
process.exit(0);
|
||||
} catch (err) {
|
||||
handles.logger.error({ err }, 'shutdown: error during drain');
|
||||
clearTimeout(deadline);
|
||||
process.exit(1);
|
||||
}
|
||||
};
|
||||
process.on('SIGTERM', () => shutdown('SIGTERM'));
|
||||
process.on('SIGINT', () => shutdown('SIGINT'));
|
||||
}
|
||||
```
|
||||
|
||||
### Unhandled promise / uncaught exception
|
||||
|
||||
```ts
|
||||
process.on('unhandledRejection', (reason) => {
|
||||
logger.fatal({ reason }, 'unhandledRejection');
|
||||
process.exit(1);
|
||||
});
|
||||
process.on('uncaughtException', (err) => {
|
||||
logger.fatal({ err }, 'uncaughtException');
|
||||
process.exit(1);
|
||||
});
|
||||
```
|
||||
|
||||
Crashing the process on either is the right move — the orchestrator restarts, devices reconnect, no harm done. The wrong move is to log and continue; that hides real bugs.
|
||||
|
||||
ESLint's `no-floating-promises` (added in task 1.1) is the first line of defense; these handlers are the safety net.
|
||||
|
||||
### Per-socket error handling
|
||||
|
||||
In the session loop:
|
||||
|
||||
- Errors from `BufferedReader` / `frame.ts` / codec parsers: log at `warn` with `imei`, drop the socket.
|
||||
- Errors from `ctx.publish` (specifically `PublishOverflowError`): skip the ACK, continue reading. Device retransmits.
|
||||
- Errors from `ctx.publish` (other, unexpected): log at `error`, drop the socket. Open question: should we crash the process? Recommendation: drop the socket only; let the publisher's own logic decide whether the underlying issue (e.g. Redis hang) warrants process exit.
|
||||
|
||||
### Resource limits
|
||||
|
||||
- **Max concurrent connections per instance:** soft cap via gauge alert (`teltonika_connections_active > 5000`). No hard cap in code — let the OS-level fd limit be the real ceiling.
|
||||
- **Per-connection memory:** the `BufferedReader` buffer is bounded by `MAX_AVL_PACKET_SIZE` (~1.3KB) per session. With 5,000 connections, ~6.5MB of buffer state — fine.
|
||||
- **Node heap:** set via `NODE_OPTIONS=--max-old-space-size=512` in the Dockerfile or compose. 512MB is plenty for this workload.
|
||||
|
||||
### Logging discipline (audit pass)
|
||||
|
||||
Before declaring this task done, walk through every `logger.*` call site and confirm:
|
||||
|
||||
- `info`: lifecycle events (startup, shutdown, server bound).
|
||||
- `warn`: recoverable per-frame issues (CRC fail, malformed handshake), per-connection drops.
|
||||
- `error`: per-publish failures, unexpected per-session errors.
|
||||
- `fatal`: process-killing conditions (Redis unreachable for >X seconds, `unhandledRejection`).
|
||||
- `debug`: per-frame parse details, per-record publish details.
|
||||
- No `console.log` anywhere in production paths. If there are any, replace.
|
||||
|
||||
### OPERATIONS.md outline
|
||||
|
||||
```
|
||||
# tcp-ingestion — Operations
|
||||
|
||||
## Configuration
|
||||
[table of env vars from task 1.3]
|
||||
|
||||
## Signals
|
||||
| Signal | Effect |
|
||||
|--------|--------|
|
||||
| SIGTERM | Graceful shutdown (drain publish queue, close connections, exit 0) |
|
||||
| SIGINT | Same as SIGTERM |
|
||||
|
||||
## Metrics
|
||||
[table of metrics from task 1.10]
|
||||
|
||||
## Alerts (recommended)
|
||||
- `teltonika_unknown_codec_total > 0` for 5 min: investigate codec coverage drift.
|
||||
- `teltonika_publish_overflow_total > 0` for 1 min: Redis or downstream backed up.
|
||||
- `rate(teltonika_frames_total{result="crc_fail"}[5m]) / rate(teltonika_frames_total[5m]) > 0.01`: high CRC error rate, suspect device firmware or line quality.
|
||||
- `teltonika_connections_active{instance=...} == 0` for 10 min while peer instances have traffic: instance is silently broken; investigate.
|
||||
|
||||
## Troubleshooting
|
||||
- "Devices not connecting" → check TCP_PORT firewall, /readyz response, Redis connectivity.
|
||||
- "Records not appearing in Redis" → check publish queue depth metric, then Redis connectivity.
|
||||
- "High CRC failures from one IMEI" → likely a firmware bug or bad cellular link; coordinate with device fleet ops.
|
||||
```
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- [ ] SIGTERM during steady-state traffic results in a clean exit with no data loss (verified by killing the process and confirming the publish queue drained, no `PublishOverflowError` in the last second of logs).
|
||||
- [ ] SIGTERM under publish-queue-overflow conditions still exits within `SHUTDOWN_TIMEOUT_MS`.
|
||||
- [ ] An `unhandledRejection` (intentionally injected via test) logs at fatal and exits non-zero.
|
||||
- [ ] OPERATIONS.md is populated and accurate; an on-caller could read it cold and find the answer to "what does this metric mean."
|
||||
- [ ] All log calls audited; no `console.log` in production paths.
|
||||
|
||||
## Risks / open questions
|
||||
|
||||
- The "drain publish queue with timeout" balance: too long blocks deployments; too short loses records on shutdown. Default 10s is a reasonable starting point; tune after real production data.
|
||||
- Crashing on `unhandledRejection` is opinionated. Some teams prefer to log and continue. We choose crash because the alternative hides bugs and we have a fast restart path. Document the choice.
|
||||
|
||||
## Done
|
||||
|
||||
(Fill in once complete.)
|
||||
Reference in New Issue
Block a user