Files
tcp-ingestion/.planning/phase-1-telemetry/12-production-hardening.md
T
julian 90d6a73a60 Sync ROADMAP statuses with landed work; mark 1.10/1.12/1.13 as paused
Tasks 1.1-1.9 marked done with their landing commit SHAs. Tasks 1.10
(observability), 1.12 (production hardening), and 1.13 (device
authority) marked paused with explicit resume triggers — pilot
deployment on real Teltonika hardware takes priority. Task 1.11
remains as next, in slimmed form for the pilot (no /readyz healthcheck
since the metrics endpoint is part of paused 1.10).
2026-04-30 16:49:07 +02:00

7.3 KiB

Task 1.12 — Production hardening

Phase: 1 — Inbound telemetry Status: ⏸ Paused — deferred until after the real-device pilot test. See ROADMAP.md "Deferred" section for resume triggers. installGracefulShutdown exists as a stub from task 1.8; this task fully implements signal handling, drain timeouts, unhandled-rejection handlers, and writes OPERATIONS.md. Resume before any always-on deployment or rolling-restart workflow. Depends on: 1.8, 1.10, 1.11 Wiki refs: docs/wiki/concepts/failure-domains.md

Goal

Make the service safe for unattended production operation: graceful shutdown, robust error handling, structured logging discipline, sane defaults for resource limits, and operational documentation.

Deliverables

  • src/core/lifecycle.tsinstallGracefulShutdown({ ... }) that wires SIGTERM/SIGINT/SIGHUP to a coordinated shutdown.
  • src/core/errors.ts — typed error classes (HandshakeError, FrameError, PublishOverflowError, RedisUnavailableError).
  • Updates to src/main.ts to install error handlers and shutdown.
  • OPERATIONS.md (or section in README.md) covering: env var reference, signals, log fields, metric meanings, common alert rules, troubleshooting.
  • (Optional) docs/runbook.md for on-call: "what to do when X alert fires."

Specification

Graceful shutdown

On SIGTERM (deployment rolling update) or SIGINT (Ctrl-C):

  1. Stop accepting new connections. server.close() — existing sockets continue.
  2. Drain the publish queue. Stop accepting new publish() calls; wait for the worker to flush queued records to Redis (with a timeout, e.g. 10s).
  3. Send a final goodbye on each open socket. Optional: just let TCP FIN naturally; devices will reconnect to a new instance.
  4. Close Redis connection.
  5. Exit cleanly with code 0.

If shutdown takes longer than SHUTDOWN_TIMEOUT_MS (default 30s), log and exit with code 1 — the orchestrator will SIGKILL anyway, but exiting deliberately gives a cleaner signal.

export function installGracefulShutdown(handles: ShutdownHandles) {
  let shuttingDown = false;
  const shutdown = async (signal: string) => {
    if (shuttingDown) return;
    shuttingDown = true;
    handles.logger.info({ signal }, 'shutdown: starting');
    const deadline = setTimeout(() => {
      handles.logger.error({}, 'shutdown: timed out, forcing exit');
      process.exit(1);
    }, handles.timeoutMs ?? 30_000);
    try {
      await new Promise<void>((res) => handles.server.close(() => res()));
      await handles.publisher.drain(10_000);
      await handles.redis.quit();
      handles.metricsServer.close();
      clearTimeout(deadline);
      handles.logger.info({}, 'shutdown: clean exit');
      process.exit(0);
    } catch (err) {
      handles.logger.error({ err }, 'shutdown: error during drain');
      clearTimeout(deadline);
      process.exit(1);
    }
  };
  process.on('SIGTERM', () => shutdown('SIGTERM'));
  process.on('SIGINT', () => shutdown('SIGINT'));
}

Unhandled promise / uncaught exception

process.on('unhandledRejection', (reason) => {
  logger.fatal({ reason }, 'unhandledRejection');
  process.exit(1);
});
process.on('uncaughtException', (err) => {
  logger.fatal({ err }, 'uncaughtException');
  process.exit(1);
});

Crashing the process on either is the right move — the orchestrator restarts, devices reconnect, no harm done. The wrong move is to log and continue; that hides real bugs.

ESLint's no-floating-promises (added in task 1.1) is the first line of defense; these handlers are the safety net.

Per-socket error handling

In the session loop:

  • Errors from BufferedReader / frame.ts / codec parsers: log at warn with imei, drop the socket.
  • Errors from ctx.publish (specifically PublishOverflowError): skip the ACK, continue reading. Device retransmits.
  • Errors from ctx.publish (other, unexpected): log at error, drop the socket. Open question: should we crash the process? Recommendation: drop the socket only; let the publisher's own logic decide whether the underlying issue (e.g. Redis hang) warrants process exit.

Resource limits

  • Max concurrent connections per instance: soft cap via gauge alert (teltonika_connections_active > 5000). No hard cap in code — let the OS-level fd limit be the real ceiling.
  • Per-connection memory: the BufferedReader buffer is bounded by MAX_AVL_PACKET_SIZE (~1.3KB) per session. With 5,000 connections, ~6.5MB of buffer state — fine.
  • Node heap: set via NODE_OPTIONS=--max-old-space-size=512 in the Dockerfile or compose. 512MB is plenty for this workload.

Logging discipline (audit pass)

Before declaring this task done, walk through every logger.* call site and confirm:

  • info: lifecycle events (startup, shutdown, server bound).
  • warn: recoverable per-frame issues (CRC fail, malformed handshake), per-connection drops.
  • error: per-publish failures, unexpected per-session errors.
  • fatal: process-killing conditions (Redis unreachable for >X seconds, unhandledRejection).
  • debug: per-frame parse details, per-record publish details.
  • No console.log anywhere in production paths. If there are any, replace.

OPERATIONS.md outline

# tcp-ingestion — Operations

## Configuration
[table of env vars from task 1.3]

## Signals
| Signal | Effect |
|--------|--------|
| SIGTERM | Graceful shutdown (drain publish queue, close connections, exit 0) |
| SIGINT | Same as SIGTERM |

## Metrics
[table of metrics from task 1.10]

## Alerts (recommended)
- `teltonika_unknown_codec_total > 0` for 5 min: investigate codec coverage drift.
- `teltonika_publish_overflow_total > 0` for 1 min: Redis or downstream backed up.
- `rate(teltonika_frames_total{result="crc_fail"}[5m]) / rate(teltonika_frames_total[5m]) > 0.01`: high CRC error rate, suspect device firmware or line quality.
- `teltonika_connections_active{instance=...} == 0` for 10 min while peer instances have traffic: instance is silently broken; investigate.

## Troubleshooting
- "Devices not connecting" → check TCP_PORT firewall, /readyz response, Redis connectivity.
- "Records not appearing in Redis" → check publish queue depth metric, then Redis connectivity.
- "High CRC failures from one IMEI" → likely a firmware bug or bad cellular link; coordinate with device fleet ops.

Acceptance criteria

  • SIGTERM during steady-state traffic results in a clean exit with no data loss (verified by killing the process and confirming the publish queue drained, no PublishOverflowError in the last second of logs).
  • SIGTERM under publish-queue-overflow conditions still exits within SHUTDOWN_TIMEOUT_MS.
  • An unhandledRejection (intentionally injected via test) logs at fatal and exits non-zero.
  • OPERATIONS.md is populated and accurate; an on-caller could read it cold and find the answer to "what does this metric mean."
  • All log calls audited; no console.log in production paths.

Risks / open questions

  • The "drain publish queue with timeout" balance: too long blocks deployments; too short loses records on shutdown. Default 10s is a reasonable starting point; tune after real production data.
  • Crashing on unhandledRejection is opinionated. Some teams prefer to log and continue. We choose crash because the alternative hides bugs and we have a fast restart path. Document the choice.

Done

(Fill in once complete.)