Context & Architectural Decision โ
- System Profile: Complex B2B SaaS (ERP, CRM, PIM, eCommerce) with a strict market size cap of 4,000 tenants (operating baseline ~300).
- Stack Selection: Grafana LGTM Stack (Loki, Grafana, Tempo, Mimir) utilizing OpenTelemetry (OTel).
- Rationale: Provides a "single pane of glass" for application telemetry and external business metrics (Lago, Mollie) while maintaining cost control and avoiding the unpredictable billing spikes associated with managed SaaS observability platforms.
Telemetry & Ingestion Architecture โ
- Standard: Strict adherence to OpenTelemetry (OTel) and W3C Trace Context to decouple applications from the storage backend.
- High Availability (HA) Ingestion: OpenTelemetry Collector operates as the central ingestion gateway, deployed via Docker Swarm.
- Intent: Prevent a single point of failure (SPOF) and avoid tightly coupled applications.
- Implementation: Deployed using Swarm
mode: replicated(2 replicas) behind the Traefik load balancer. Swarm's routing mesh handles load distribution across stateless OTel replicas.
- Placement: all LGTM services (OTel Collector, Loki, Tempo, Mimir, Grafana, cAdvisor,
prometheus-postgres-exporter) are placement-constrained tonode.labels.tier == observabilityon the dedicatedhcloud_server.observability_workernode (cpx21+). The 2-node Swarm cannot absorb the LGTM footprint alongside application services โ this node parallels thetier=statefulpattern frominfrastructure.mdยง1. Sequencing: this requires the quorum + node-labelling infrastructure from the Swarm quorum expansion (seeadr-0xx-swarm-quorum-topology.md). - HA on Swarm without Consul: Loki / Tempo / Mimir ingesters require a single-writer ring; plain Swarm has no Consul or KV primitive for ring election. We run each ingester as
replicas: 1, pinned to the observability-labelled node, and rely on Swarm self-heal for failover. The S3 chunk store is MinIO (already in-cluster). Trade-off: a single-instance ingester briefly drops writes during task reschedule (seconds). The alternative โ a Consul-backed ring โ adds a stateful Raft cluster and is deferred.
Service Instrumentation & Correlation โ
- Frontend (Grafana Faro Web SDK): Captures Core Web Vitals, performance, and exceptions. Automatically injects W3C
traceparentheaders intofetch/XHRrequests. - CORS: Traefik and NestJS are explicitly configured to expose and accept
traceparentandtracestateheaders. - Backend Distributed Tracing:
- Traefik: Extracts incoming
traceparentat the edge and forwards it downstream. - NestJS: Utilizes
@opentelemetry/auto-instrumentations-nodefor automatic span generation. - Hasura (Community Edition): Lacks native OTel. We maintain tracing via a composite strategy:
- Edge Tracing: Traefik captures overarching request duration and status.
- Context Propagation: We configure Hasura to forward the
traceparentheader to NestJS for Actions/Event Triggers. - Log Analysis: The system ships Hasura structured JSON logs (
startup, http-log, webhook-log, query-log) to Loki for query latency analysis. - DB Metrics:
prometheus-postgres-exporterconnects directly to Postgres to push database metrics to Mimir.
- Traefik: Extracts incoming
- Log/Trace Correlation: Backend loggers (Pino, Winston) inject the active
trace_idinto JSON log payloads. Loki utilizes derived fields to map the regex-extractedtrace_iddirectly to Tempo for UI pivoting.
Metrics & High-Cardinality Management โ
- Business & 3rd-Party Metrics: * The system fetches real-time API state (e.g., Mollie balances) at dashboard load via Grafana plugins (Infinity/JSON).
- Prometheus Exporters poll APIs to push historical time-series data to Mimir.
- The system logs asynchronous webhook events to Loki; LogQL generates dynamic metrics.
- LLM Usage Tracking:
- Intent: Track costs without causing Mimir cardinality explosions.
- Implementation: Utilize standard counter metrics (e.g.,
llm_tokens_total) where the token amount is the value. Thetenant_idandmodel_versionare safely retained as labels, as the hard cap of 4,000 tenants ensures cardinality remains well within Mimir's operational thresholds.
- Global cardinality budget: โค80 000 active series per metric (4 000 tenants ร 20 series). Mimir enforces this at the ingester level; exceeding it causes ingestion rejection.
- Per-metric budget contract (see ADR-027): every custom metric declared in code MUST carry a
cardinality_budget: "<n>"annotation on its associated Prometheus alert rule. A CI linter step (promtool check rules+check_metric_cardinality.py) verifies (a) the annotation exists and parses as an integer, (b) the declared budget โค 80 000, (c) the actual cardinality at the time of the probe is โค the declared budget. Rogue metrics with unbounded labels are rejected at PR time, not at production ingestion.
PII Redaction & Data Privacy โ
- Production Catch-All: Promtail regex rules redact PII patterns (credit cards, emails) from logs before ingestion into Loki. See
promtail-config.yaml. - Client-Side Filtering: The Faro SDK sanitizes URLs and excludes sensitive DOM elements before data transmission.
- False-Positive budget: โค5 FP/day per PII detection rule. Each LogQL rule carries an exclusion list maintained in
pii-exclusions.yaml; new exclusions require PR approval. Seeobservability-alerts.md.
NOTE
Future: CI/CD static analysis enforcement (Semgrep/SonarQube) to block deployments containing hardcoded PII logging variables. LogQL alerting rules for runtime PII scanning.
Log Volume & Retention โ
- TTL: 14 days (matches the GDPR ยง4 immutability window).
- Retention mechanism: Loki compactor only โ do not rely on MinIO bucket lifecycle rules to delete chunks. MinIO-lifecycle deletion produces "chunk not found" errors at query time because Loki's index still references them until the compactor sweeps. Required
loki.yaml:yamlcompactor: working_directory: /data/loki/compactor shared_store: s3 retention_enabled: true delete_request_store: s3 limits_config: retention_period: 336h # 14d - Hard safety cap: a 500 GB MinIO bucket quota acts as an emergency backstop (NOT the primary retention mechanism). A Grafana alert fires at >80 % bucket utilisation so operators can investigate volume spikes before the quota is hit.
- Per-service log volume: every application service in
docker-compose.production.ymlMUST carry a logging block:yamllogging: driver: loki options: loki-url: "http://otel-collector.observability:4318/loki/api/v1/push" loki-batch-size: "400" loki-external-labels: "service=\{\{.Name\}\},environment=production" - Exporter DSN gate:
prometheus-postgres-exporteris declared indocker-compose.observability.ymlwithdeploy: { replicas: 0 }until.github/workflows/secrets-sync.ymllands thepostgres_exporter_dsnSwarm Secret; the replica count is flipped to1in a follow-up commit. This keeps L3-5.1.3 mergeable standalone.
Alerting & Configuration Management โ
- Alerting Strategy: We deploy Grafana Alertmanager alongside the LGTM stack for deduplication, grouping, and silencing. Alert trees route infrastructure severity to PagerDuty/Slack, and business logic alerts to operational teams. We manage alerts as code.
- Docker Configs: We decouple all configurations (
grafana.ini, OTel YAML, Prometheus rules) from container images and manage them via immutable Docker Configs, leveraging Swarm's Raft logs for secure distribution. Updates utilize config rotation for zero-downtime deployments. - Secrets: Infrastructure API keys and DB credentials rely on encrypted Docker Secrets (
/run/secrets/), while sensitive tenant credentials are encrypted at the application boundary viapgsodium.
NOTE
Future: Multi-Tenant Embedded Dashboards โ
When tenants need self-service analytics, Grafana panels will be embedded in the workspace app via iframes. This requires:
- An Auth Proxy (Nginx/Envoy) that validates the SaaS session and injects
X-WEBAUTH-USER+X-Scope-OrgID(tenant_id) headers - Grafana cookie/SSO hardening (
allow_embedding = true,COOKIE_DOMAIN=.busflow.app,COOKIE_SAMESITE=none,COOKIE_SECURE=true) - Row-level security via the
X-Scope-OrgIDheader (native Loki/Mimir support) - See
observability-alerts.mdยงA7 (SSO fallthrough) for triage
NOTE
Future: Faro SDK Load Policy (B2C vs. Workspace) โ
When the B2C passenger app launches, evaluate Faro's ~50 KB impact on checkout LCP.
- Workspace app (B2B): load Faro synchronously
- Passenger app (B2C checkout): lazy-load after
window.loadbehind a feature flag - Run an A/B test for โฅ1 week on real 3G/4G sessions before deciding the global default
Architectural Decisions (Resolved Questions) โ
- Asynchronous Trace Context: When calling third-party APIs (Mollie, WhatsApp), the system injects the active
traceparentas a custom metadata field in the webhook request payload. When the webhook fires back, Nest.js unmarshals the metadata and links the span. - Stateful Storage High Availability: Loki, Tempo, and Mimir use MinIO (S3-compatible) as their primary chunk store. The services themselves only hold local volumes for ephemeral caching and Write-Ahead Logs (WAL), making them easily replaceable Swarm tasks.
- Untrusted Client Trace Context: The edge Traefik proxy enforces aggressive rate-limiting on the telemetry ingestion port and drops arbitrarily large
tracestateheaders to prevent cardinality explosion attacks from the client side. - Immutability vs. GDPR Data Deletion: Raw operational telemetry (Loki/Tempo) enforces a strict 14-day retention TTL. Instead of selectively scrubbing immutable chunks, the system guarantees automatic total wipeout of potentially leaked PII within the globally compliant 30-day deletion timeframe by simply aging it out.
Volume Monitoring & Auto-Scaling โ
Postgres and MinIO data live on a Hetzner Cloud Volume (hcloud_volume.data, default 20 GB) bind-mounted via /mnt/data/. Without monitoring, disk exhaustion crashes both services silently.
Metrics โ
cAdvisor (already in the LGTM stack) and node_exporter expose filesystem metrics for the volume mount:
| Metric | Source | Meaning |
|---|---|---|
node_filesystem_avail_bytes\{mountpoint="/mnt/HC_Volume_*"\} | node_exporter | Free bytes on the Hetzner Volume |
node_filesystem_size_bytes\{mountpoint="/mnt/HC_Volume_*"\} | node_exporter | Total volume size |
(1 - avail/size) * 100 | Grafana recording rule | Usage percentage |
Alert Rules โ
Defined as code in docker/observability/prometheus-rules/volume-alerts.yml:
groups:
- name: volume
rules:
- alert: VolumeUsageWarning
expr: (1 - node_filesystem_avail_bytes\{mountpoint=~"/mnt/HC_Volume.*"\} / node_filesystem_size_bytes\{mountpoint=~"/mnt/HC_Volume.*"\}) * 100 > 80
for: 5m
labels: { severity: warning }
annotations:
summary: "Hetzner Volume usage > 80%"
runbook_url: "https://docs.busflow.app/protocols/volume-resize"
- alert: VolumeUsageCritical
expr: (1 - node_filesystem_avail_bytes\{mountpoint=~"/mnt/HC_Volume.*"\} / node_filesystem_size_bytes\{mountpoint=~"/mnt/HC_Volume.*"\}) * 100 > 90
for: 2m
labels: { severity: critical }
annotations:
summary: "Hetzner Volume usage > 90% โ auto-resize triggered"NOTE
Future: Auto-Resize Pipeline โ
Hetzner does not offer native volume auto-scaling. The planned two-stage resize:
- Alertmanager webhook fires
VolumeUsageCriticalto a GitHub Actionsworkflow_dispatchendpoint .github/workflows/volume-resize.ymlcallshcloud volume resize+resize2fs(online, no downtime)- Safety: max 100 GB cap, concurrency group cooldown, Terraform drift handling
TLS Certificate Monitoring โ
Traefik manages ACME certificates (Let's Encrypt) for 6 subdomains: api.busflow.app, hasura.busflow.app, auth.busflow.app, storage.busflow.app, n8n.busflow.app, and busflow.app. A silent ACME renewal failure causes hard downtime after the 90-day cert lifetime expires.
- Metric: Traefik v3 exposes
traefik_tls_certs_not_afteras a Prometheus metric (epoch timestamp per certificate). Mimir scrapes this via the OTel Collector. - Alert: Grafana fires when any certificate has โค14 days remaining. Defined in
docker/observability/prometheus-rules/tls-alerts.yml:yamlgroups: - name: tls rules: - alert: TlsCertExpiringSoon expr: (traefik_tls_certs_not_after - time()) / 86400 < 14 for: 1h labels: { severity: warning } annotations: summary: "TLS cert for {{ $labels.cn }} expires in {{ $value | humanizeDuration }}" runbook_url: "https://docs.busflow.app/protocols/observability-alerts#a9" - Triage: See
observability-alerts.mdยงA9. - Accepted risk: A Cloudflare DNS API outage does not break existing traffic but silently prevents ACME DNS-challenge renewals. The 14-day alert window provides sufficient buffer to detect and resolve this.
NOTE
Future: External Dependency Health Probes โ
Scheduled probes (NestJS cron or standalone) checking external service health every 5 min:
- AWS SES, Hetzner Object Storage, LLM Providers (OpenAI/Anthropic)
- Results surface as OTel gauges; Grafana alerts on โฅ3 consecutive failures โ Slack
#ops - See
observability-alerts.mdยงA10
NOTE
Future: Synthetic Uptime Monitoring โ
External probes (outside the Swarm) hitting public URLs through the full CDN โ Traefik โ service chain:
api.busflow.app/api/health,busflow.app,hasura.busflow.app/healthz,auth.busflow.app/healthz- Implementation: Grafana Cloud Synthetic Monitoring (free tier), BetterStack, or GitHub Actions cron
- Alert: โฅ2 consecutive failures โ Slack
#ops
Developer Instrumentation Guide โ
Backend (NestJS) โ
- Auto-instrumentation:
@opentelemetry/auto-instrumentations-nodeautomatically generates spans for HTTP, database, and Redis operations. No manual setup required for standard CRUD paths. - Custom spans: For business-critical operations (e.g., PDF parsing, pricing calculation), create explicit spans:
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('busflow-api');
async function parsePdf(file: Buffer) {
return tracer.startActiveSpan('pdf-parser.extract', async (span) => {
span.setAttribute('pdf.size_bytes', file.length);
try {
const result = await extractStructuredData(file);
span.setAttribute('pdf.confidence', result.confidence);
return result;
} catch (err) {
span.recordException(err);
throw err;
} finally {
span.end();
}
});
}- Naming conventions:
<service>.<operation>(e.g.,pricing.calculate,wallet.generate-pass,compliance.check-rest-time). - Attributes: Use semantic conventions (
http.method,db.statement) where applicable. Custom attributes use thebusflow.prefix (e.g.,busflow.tenant_id,busflow.tour_id).
Frontend (Grafana Faro) โ
- Automatic: The Faro SDK captures Core Web Vitals,
fetch/XHRdurations, and unhandled exceptions out of the box. - User Paths (Page Views): We hook into the Vue Router
afterEachguard in the app boot file (e.g.,src/boot/faro.ts) to trackpage_viewevents. This allows building funnel charts and tracking user journeys. - User Interactions (Button Clicks): We track critical business UI actions (e.g., clicking "Save Vehicle") using the custom
v-trackdirective (e.g.v-track="'save_vehicle_btn'") or manualfaro.api.pushEvent('button_click', { action: '...' }). - Custom events: Use
faro.api.pushEvent()for complex business-critical flows (e.g.,booking.completed,magic-upload.started). - Error boundaries: Vue error boundaries should call
faro.api.pushError()with structured context (component name, route, user role).
Local Development โ
Docker Compose provides a minimal observability stack for local debugging:
# Start local Grafana + OTel Collector + Loki + Tempo
docker compose -f docker-compose.observability.yml up -d| Service | Local Port | Purpose |
|---|---|---|
| Grafana | localhost:3000 | Dashboards, trace viewer, log explorer |
| OTel Collector | localhost:4317 (gRPC), localhost:4318 (HTTP) | Telemetry ingestion |
| Loki | localhost:3100 | Log storage |
| Tempo | localhost:3200 | Trace storage |
NestJS dev servers emit traces to localhost:4317 by default when OTEL_EXPORTER_OTLP_ENDPOINT is set. Faro SDK points to localhost:4318 in development mode.
Preview Environments โ
Busflow does not spin up dedicated LGTM stacks per preview environment (e.g., per pull request). This would be resource-prohibitive and difficult to manage. Instead, preview environments leverage the shared central observability stack on the Docker Swarm via environment tagging:
- Global Log Collection (Zero-Config): The
promtailservice in the observability stack is deployed inmode: globalacross all Swarm nodes. It automatically reads the Docker socket and ships logs for all ephemeral preview containers (busflow-pr-<ID>) to the central Loki instance. - Central Telemetry Routing: Preview backend services (like the NestJS API) push their OTel traces and metrics to the shared OTel Collector on the Swarm's internal network (
http://obs_otel_collector:4318). - Environment Tagging: Every trace and metric emitted by a preview stack is tagged with the specific preview environment (e.g.,
deployment.environment="preview",service.namespace="busflow-pr-123"). - Grafana Filtering: Operators use a single set of Grafana dashboards. By selecting a specific PR namespace from the
$environmentdashboard variable, all Loki logs, Mimir metrics, and Tempo traces are filtered to show only data for that ephemeral stack.
Runbooks & Dashboards โ
Core Dashboards โ
| Dashboard | Scope | Key Panels |
|---|---|---|
| Service Overview | All services | Request rate, error rate, p50/p95/p99 latency per service |
| Per-Tenant Health | Individual tenant | Request volume, error breakdown, active sessions |
| LLM Cost Tracker | AI services | llm_tokens_total by model and tenant, daily/weekly cost projections |
| Infrastructure | Docker Swarm | Node CPU/memory, container restarts, Swarm service health |
| Booking Funnel | B2C commerce | Conversion steps, drop-off points, payment success rate |
| Deploy Timeline | CI/CD | Deploy markers overlaid on error rate + latency panels |
Alerting Runbooks โ
All alerts are defined as code in the Grafana provisioning directory. Each alert has a runbook link in its annotation:
- High Error Rate (>1% 5xx for 5 min): Check Tempo for failing traces โ identify root service โ check recent deploys.
- Latency SLO Breach (p99 > 500ms for 10 min): Check Mimir for resource saturation โ scale Swarm replicas or investigate slow queries in Loki.
- PII Detection Alert: Loki uses a LogQL pattern match โ identify offending service โ hotfix PII leak โ verify with Semgrep.
SLI / SLO Definitions โ
Service Level Indicators (SLIs) and Objectives (SLOs) for core services. Monitored via Mimir metrics and Grafana alerting.
| Service | SLI | SLO Target | Measurement |
|---|---|---|---|
| API (NestJS) | Request latency (p99) | < 300ms | OTel histogram http.server.duration |
| API (NestJS) | Error rate (5xx) | < 0.5% | OTel counter http.server.response.status_code |
| Booking Flow | End-to-end success rate | > 99.5% | Custom counter busflow.booking.completed / busflow.booking.started |
| Hasura (GraphQL) | Query latency (p95) | < 200ms | Hasura http-log in Loki |
| PDF Parsing | PDF parse success rate | > 95% | Custom counter busflow.pdf_parser.success / busflow.pdf_parser.total |
| Frontend (Faro) | LCP (Largest Contentful Paint) | < 2.5s | Faro Core Web Vitals |
| Frontend (Faro) | Unhandled JS exceptions | < 0.1% of sessions | Faro error events / session count |
Error Budget Policy: When a service exhausts its monthly error budget (calculated from SLO), feature development pauses and the team prioritizes reliability work until the budget recovers.