Busflow Docs

Context & Architectural Decision

System Profile: Complex B2B SaaS (ERP, CRM, PIM, eCommerce) with a strict market size cap of 4,000 tenants (operating baseline ~300).
Stack Selection: Grafana LGTM Stack (Loki, Grafana, Tempo, Mimir) utilizing OpenTelemetry (OTel).
Rationale: Provides a "single pane of glass" for application telemetry and external business metrics (Lago, Mollie) while maintaining cost control and avoiding the unpredictable billing spikes associated with managed SaaS observability platforms.

Telemetry & Ingestion Architecture

Standard: Strict adherence to OpenTelemetry (OTel) and W3C Trace Context to decouple applications from the storage backend.
High Availability (HA) Ingestion: OpenTelemetry Collector operates as the central ingestion gateway, deployed via Docker Swarm.
- Intent: Prevent a single point of failure (SPOF) and avoid tightly coupled applications.
- Implementation: Deployed using Swarm mode: replicated (2 replicas) behind the Traefik load balancer. Swarm's routing mesh handles load distribution across stateless OTel replicas.
Placement: all LGTM services (OTel Collector, Loki, Tempo, Mimir, Grafana, cAdvisor, prometheus-postgres-exporter) are placement-constrained to node.labels.tier == observability on the dedicated hcloud_server.observability_worker node (cpx21+). The 2-node Swarm cannot absorb the LGTM footprint alongside application services — this node parallels the tier=stateful pattern from infrastructure.md §1. Sequencing: this requires the quorum + node-labelling infrastructure from the Swarm quorum expansion (see adr-0xx-swarm-quorum-topology.md).
HA on Swarm without Consul: Loki / Tempo / Mimir ingesters require a single-writer ring; plain Swarm has no Consul or KV primitive for ring election. We run each ingester as replicas: 1, pinned to the observability-labelled node, and rely on Swarm self-heal for failover. The S3 chunk store is MinIO (already in-cluster). Trade-off: a single-instance ingester briefly drops writes during task reschedule (seconds). The alternative — a Consul-backed ring — adds a stateful Raft cluster and is deferred.

Service Instrumentation & Correlation

Frontend (Grafana Faro Web SDK): Captures Core Web Vitals, performance, and exceptions. Automatically injects W3C traceparent headers into fetch/XHR requests.
CORS: Traefik and NestJS are explicitly configured to expose and accept traceparent and tracestate headers.
Backend Distributed Tracing:
- Traefik: Extracts incoming traceparent at the edge and forwards it downstream.
- NestJS: Utilizes @opentelemetry/auto-instrumentations-node for automatic span generation.
- Hasura (Community Edition): Lacks native OTel. We maintain tracing via a composite strategy:
  - Edge Tracing: Traefik captures overarching request duration and status.
  - Context Propagation: We configure Hasura to forward the traceparent header to NestJS for Actions/Event Triggers.
  - Log Analysis: The system ships Hasura structured JSON logs (startup, http-log, webhook-log, query-log) to Loki for query latency analysis.
  - DB Metrics: prometheus-postgres-exporter connects directly to Postgres to push database metrics to Mimir.
Log/Trace Correlation: Backend loggers (Pino, Winston) inject the active trace_id into JSON log payloads. Loki utilizes derived fields to map the regex-extracted trace_id directly to Tempo for UI pivoting.

Metrics & High-Cardinality Management

Business & 3rd-Party Metrics: * The system fetches real-time API state (e.g., Mollie balances) at dashboard load via Grafana plugins (Infinity/JSON).
- Prometheus Exporters poll APIs to push historical time-series data to Mimir.
- The system logs asynchronous webhook events to Loki; LogQL generates dynamic metrics.
LLM Usage Tracking:
- Intent: Track costs without causing Mimir cardinality explosions.
- Implementation: Utilize standard counter metrics (e.g., llm_tokens_total) where the token amount is the value. The tenant_id and model_version are safely retained as labels, as the hard cap of 4,000 tenants ensures cardinality remains well within Mimir's operational thresholds.
Global cardinality budget: ≤80 000 active series per metric (4 000 tenants × 20 series). Mimir enforces this at the ingester level; exceeding it causes ingestion rejection.
Per-metric budget contract (see ADR-027): every custom metric declared in code MUST carry a cardinality_budget: "<n>" annotation on its associated Prometheus alert rule. A CI linter step (promtool check rules + check_metric_cardinality.py) verifies (a) the annotation exists and parses as an integer, (b) the declared budget ≤ 80 000, (c) the actual cardinality at the time of the probe is ≤ the declared budget. Rogue metrics with unbounded labels are rejected at PR time, not at production ingestion.

PII Redaction & Data Privacy

Production Catch-All: Promtail regex rules redact PII patterns (credit cards, emails) from logs before ingestion into Loki. See promtail-config.yaml.
Client-Side Filtering: The Faro SDK sanitizes URLs and excludes sensitive DOM elements before data transmission.
False-Positive budget: ≤5 FP/day per PII detection rule. Each LogQL rule carries an exclusion list maintained in pii-exclusions.yaml; new exclusions require PR approval. See observability-alerts.md.

NOTE

Future: CI/CD static analysis enforcement (Semgrep/SonarQube) to block deployments containing hardcoded PII logging variables. LogQL alerting rules for runtime PII scanning.

Log Volume & Retention

TTL: 14 days (matches the GDPR §4 immutability window).
Retention mechanism: Loki compactor only — do not rely on MinIO bucket lifecycle rules to delete chunks. MinIO-lifecycle deletion produces "chunk not found" errors at query time because Loki's index still references them until the compactor sweeps. Required loki.yaml:
yaml
```
compactor:
  working_directory: /data/loki/compactor
  shared_store: s3
  retention_enabled: true
  delete_request_store: s3
limits_config:
  retention_period: 336h  # 14d
```
Hard safety cap: a 500 GB MinIO bucket quota acts as an emergency backstop (NOT the primary retention mechanism). A Grafana alert fires at >80 % bucket utilisation so operators can investigate volume spikes before the quota is hit.

Per-service log volume: every application service in docker-compose.production.yml MUST carry a logging block:

yaml

logging:
  driver: loki
  options:
    loki-url: "http://otel-collector.observability:4318/loki/api/v1/push"
    loki-batch-size: "400"
    loki-external-labels: "service=\{\{.Name\}\},environment=production"

Exporter DSN gate: prometheus-postgres-exporter is declared in docker-compose.observability.yml with deploy: { replicas: 0 } until .github/workflows/secrets-sync.yml lands the postgres_exporter_dsn Swarm Secret; the replica count is flipped to 1 in a follow-up commit. This keeps L3-5.1.3 mergeable standalone.

Alerting & Configuration Management

Alerting Strategy: We deploy Grafana Alertmanager alongside the LGTM stack for deduplication, grouping, and silencing. Alert trees route infrastructure severity to PagerDuty/Slack, and business logic alerts to operational teams. We manage alerts as code.
Docker Configs: We decouple all configurations (grafana.ini, OTel YAML, Prometheus rules) from container images and manage them via immutable Docker Configs, leveraging Swarm's Raft logs for secure distribution. Updates utilize config rotation for zero-downtime deployments.
Secrets: Infrastructure API keys and DB credentials rely on encrypted Docker Secrets (/run/secrets/), while sensitive tenant credentials are encrypted at the application boundary via pgsodium.

NOTE

Future: Multi-Tenant Embedded Dashboards

When tenants need self-service analytics, Grafana panels will be embedded in the workspace app via iframes. This requires:

An Auth Proxy (Nginx/Envoy) that validates the SaaS session and injects X-WEBAUTH-USER + X-Scope-OrgID (tenant_id) headers
Grafana cookie/SSO hardening (allow_embedding = true, COOKIE_DOMAIN=.busflow.app, COOKIE_SAMESITE=none, COOKIE_SECURE=true)
Row-level security via the X-Scope-OrgID header (native Loki/Mimir support)
See observability-alerts.md §A7 (SSO fallthrough) for triage

NOTE

Future: Faro SDK Load Policy (B2C vs. Workspace)

When the B2C passenger app launches, evaluate Faro's ~50 KB impact on checkout LCP.

Workspace app (B2B): load Faro synchronously
Passenger app (B2C checkout): lazy-load after window.load behind a feature flag
Run an A/B test for ≥1 week on real 3G/4G sessions before deciding the global default

Architectural Decisions (Resolved Questions)

Asynchronous Trace Context: When calling third-party APIs (Mollie, WhatsApp), the system injects the active traceparent as a custom metadata field in the webhook request payload. When the webhook fires back, Nest.js unmarshals the metadata and links the span.
Stateful Storage High Availability: Loki, Tempo, and Mimir use MinIO (S3-compatible) as their primary chunk store. The services themselves only hold local volumes for ephemeral caching and Write-Ahead Logs (WAL), making them easily replaceable Swarm tasks.
Untrusted Client Trace Context: The edge Traefik proxy enforces aggressive rate-limiting on the telemetry ingestion port and drops arbitrarily large tracestate headers to prevent cardinality explosion attacks from the client side.
Immutability vs. GDPR Data Deletion: Raw operational telemetry (Loki/Tempo) enforces a strict 14-day retention TTL. Instead of selectively scrubbing immutable chunks, the system guarantees automatic total wipeout of potentially leaked PII within the globally compliant 30-day deletion timeframe by simply aging it out.

Volume Monitoring & Auto-Scaling

Postgres and MinIO data live on a Hetzner Cloud Volume (hcloud_volume.data, default 20 GB) bind-mounted via /mnt/data/. Without monitoring, disk exhaustion crashes both services silently.

Metrics

cAdvisor (already in the LGTM stack) and node_exporter expose filesystem metrics for the volume mount:

Metric	Source	Meaning
`node_filesystem_avail_bytes\{mountpoint="/mnt/HC_Volume_*"\}`	node_exporter	Free bytes on the Hetzner Volume
`node_filesystem_size_bytes\{mountpoint="/mnt/HC_Volume_*"\}`	node_exporter	Total volume size
`(1 - avail/size) * 100`	Grafana recording rule	Usage percentage

Alert Rules

Defined as code in docker/observability/prometheus-rules/volume-alerts.yml:

yaml

groups:
  - name: volume
    rules:
      - alert: VolumeUsageWarning
        expr: (1 - node_filesystem_avail_bytes\{mountpoint=~"/mnt/HC_Volume.*"\} / node_filesystem_size_bytes\{mountpoint=~"/mnt/HC_Volume.*"\}) * 100 > 80
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Hetzner Volume usage > 80%"
          runbook_url: "https://docs.busflow.app/protocols/volume-resize"
      - alert: VolumeUsageCritical
        expr: (1 - node_filesystem_avail_bytes\{mountpoint=~"/mnt/HC_Volume.*"\} / node_filesystem_size_bytes\{mountpoint=~"/mnt/HC_Volume.*"\}) * 100 > 90
        for: 2m
        labels: { severity: critical }
        annotations:
          summary: "Hetzner Volume usage > 90% — auto-resize triggered"

NOTE

Future: Auto-Resize Pipeline

Hetzner does not offer native volume auto-scaling. The planned two-stage resize:

Alertmanager webhook fires VolumeUsageCritical to a GitHub Actions workflow_dispatch endpoint
.github/workflows/volume-resize.yml calls hcloud volume resize + resize2fs (online, no downtime)
Safety: max 100 GB cap, concurrency group cooldown, Terraform drift handling

TLS Certificate Monitoring

Traefik manages ACME certificates (Let's Encrypt) for 6 subdomains: api.busflow.app, hasura.busflow.app, auth.busflow.app, storage.busflow.app, n8n.busflow.app, and busflow.app. A silent ACME renewal failure causes hard downtime after the 90-day cert lifetime expires.

Metric: Traefik v3 exposes traefik_tls_certs_not_after as a Prometheus metric (epoch timestamp per certificate). Mimir scrapes this via the OTel Collector.

Alert: Grafana fires when any certificate has ≤14 days remaining. Defined in docker/observability/prometheus-rules/tls-alerts.yml:

yaml

groups:
  - name: tls
    rules:
      - alert: TlsCertExpiringSoon
        expr: (traefik_tls_certs_not_after - time()) / 86400 < 14
        for: 1h
        labels: { severity: warning }
        annotations:
          summary: "TLS cert for {{ $labels.cn }} expires in {{ $value | humanizeDuration }}"
          runbook_url: "https://docs.busflow.app/protocols/observability-alerts#a9"

Triage: See observability-alerts.md §A9.
Accepted risk: A Cloudflare DNS API outage does not break existing traffic but silently prevents ACME DNS-challenge renewals. The 14-day alert window provides sufficient buffer to detect and resolve this.

NOTE

Future: External Dependency Health Probes

Scheduled probes (NestJS cron or standalone) checking external service health every 5 min:

AWS SES, Hetzner Object Storage, LLM Providers (OpenAI/Anthropic)
Results surface as OTel gauges; Grafana alerts on ≥3 consecutive failures → Slack #ops
See observability-alerts.md §A10

NOTE

Future: Synthetic Uptime Monitoring

External probes (outside the Swarm) hitting public URLs through the full CDN → Traefik → service chain:

api.busflow.app/api/health, busflow.app, hasura.busflow.app/healthz, auth.busflow.app/healthz
Implementation: Grafana Cloud Synthetic Monitoring (free tier), BetterStack, or GitHub Actions cron
Alert: ≥2 consecutive failures → Slack #ops

Developer Instrumentation Guide

Backend (NestJS)

Auto-instrumentation: @opentelemetry/auto-instrumentations-node automatically generates spans for HTTP, database, and Redis operations. No manual setup required for standard CRUD paths.
Custom spans: For business-critical operations (e.g., PDF parsing, pricing calculation), create explicit spans:

typescript

import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('busflow-api');

async function parsePdf(file: Buffer) {
  return tracer.startActiveSpan('pdf-parser.extract', async (span) => {
    span.setAttribute('pdf.size_bytes', file.length);
    try {
      const result = await extractStructuredData(file);
      span.setAttribute('pdf.confidence', result.confidence);
      return result;
    } catch (err) {
      span.recordException(err);
      throw err;
    } finally {
      span.end();
    }
  });
}

Naming conventions: <service>.<operation> (e.g., pricing.calculate, wallet.generate-pass, compliance.check-rest-time).
Attributes: Use semantic conventions (http.method, db.statement) where applicable. Custom attributes use the busflow. prefix (e.g., busflow.tenant_id, busflow.tour_id).

Frontend (Grafana Faro)

Automatic: The Faro SDK captures Core Web Vitals, fetch/XHR durations, and unhandled exceptions out of the box.
User Paths (Page Views): We hook into the Vue Router afterEach guard in the app boot file (e.g., src/boot/faro.ts) to track page_view events. This allows building funnel charts and tracking user journeys.
User Interactions (Button Clicks): We track critical business UI actions (e.g., clicking "Save Vehicle") using the custom v-track directive (e.g. v-track="'save_vehicle_btn'") or manual faro.api.pushEvent('button_click', { action: '...' }).
Custom events: Use faro.api.pushEvent() for complex business-critical flows (e.g., booking.completed, magic-upload.started).
Error boundaries: Vue error boundaries should call faro.api.pushError() with structured context (component name, route, user role).

Local Development

Docker Compose provides a minimal observability stack for local debugging:

bash

# Start local Grafana + OTel Collector + Loki + Tempo
docker compose -f docker-compose.observability.yml up -d

Service	Local Port	Purpose
Grafana	`localhost:3000`	Dashboards, trace viewer, log explorer
OTel Collector	`localhost:4317` (gRPC), `localhost:4318` (HTTP)	Telemetry ingestion
Loki	`localhost:3100`	Log storage
Tempo	`localhost:3200`	Trace storage

NestJS dev servers emit traces to localhost:4317 by default when OTEL_EXPORTER_OTLP_ENDPOINT is set. Faro SDK points to localhost:4318 in development mode.

Preview Environments

Busflow does not spin up dedicated LGTM stacks per preview environment (e.g., per pull request). This would be resource-prohibitive and difficult to manage. Instead, preview environments leverage the shared central observability stack on the Docker Swarm via environment tagging:

Global Log Collection (Zero-Config): The promtail service in the observability stack is deployed in mode: global across all Swarm nodes. It automatically reads the Docker socket and ships logs for all ephemeral preview containers (busflow-pr-<ID>) to the central Loki instance.
Central Telemetry Routing: Preview backend services (like the NestJS API) push their OTel traces and metrics to the shared OTel Collector on the Swarm's internal network (http://obs_otel_collector:4318).
Environment Tagging: Every trace and metric emitted by a preview stack is tagged with the specific preview environment (e.g., deployment.environment="preview", service.namespace="busflow-pr-123").
Grafana Filtering: Operators use a single set of Grafana dashboards. By selecting a specific PR namespace from the $environment dashboard variable, all Loki logs, Mimir metrics, and Tempo traces are filtered to show only data for that ephemeral stack.

Runbooks & Dashboards

Core Dashboards

Dashboard	Scope	Key Panels
Service Overview	All services	Request rate, error rate, p50/p95/p99 latency per service
Per-Tenant Health	Individual tenant	Request volume, error breakdown, active sessions
LLM Cost Tracker	AI services	`llm_tokens_total` by model and tenant, daily/weekly cost projections
Infrastructure	Docker Swarm	Node CPU/memory, container restarts, Swarm service health
Booking Funnel	B2C commerce	Conversion steps, drop-off points, payment success rate
Deploy Timeline	CI/CD	Deploy markers overlaid on error rate + latency panels

Alerting Runbooks

All alerts are defined as code in the Grafana provisioning directory. Each alert has a runbook link in its annotation:

High Error Rate (>1% 5xx for 5 min): Check Tempo for failing traces → identify root service → check recent deploys.
Latency SLO Breach (p99 > 500ms for 10 min): Check Mimir for resource saturation → scale Swarm replicas or investigate slow queries in Loki.
PII Detection Alert: Loki uses a LogQL pattern match → identify offending service → hotfix PII leak → verify with Semgrep.

SLI / SLO Definitions

Service Level Indicators (SLIs) and Objectives (SLOs) for core services. Monitored via Mimir metrics and Grafana alerting.

Service	SLI	SLO Target	Measurement
API (NestJS)	Request latency (p99)	< 300ms	OTel histogram `http.server.duration`
API (NestJS)	Error rate (5xx)	< 0.5%	OTel counter `http.server.response.status_code`
Booking Flow	End-to-end success rate	> 99.5%	Custom counter `busflow.booking.completed` / `busflow.booking.started`
Hasura (GraphQL)	Query latency (p95)	< 200ms	Hasura `http-log` in Loki
PDF Parsing	PDF parse success rate	> 95%	Custom counter `busflow.pdf_parser.success` / `busflow.pdf_parser.total`
Frontend (Faro)	LCP (Largest Contentful Paint)	< 2.5s	Faro Core Web Vitals
Frontend (Faro)	Unhandled JS exceptions	< 0.1% of sessions	Faro error events / session count

Error Budget Policy: When a service exhausts its monthly error budget (calculated from SLO), feature development pauses and the team prioritizes reliability work until the budget recovers.

Busflow Docs

Context & Architectural Decision ​

Telemetry & Ingestion Architecture ​

Service Instrumentation & Correlation ​

Metrics & High-Cardinality Management ​

PII Redaction & Data Privacy ​

Log Volume & Retention ​

Alerting & Configuration Management ​

Future: Multi-Tenant Embedded Dashboards ​

Future: Faro SDK Load Policy (B2C vs. Workspace) ​

Architectural Decisions (Resolved Questions) ​

Volume Monitoring & Auto-Scaling ​

Metrics ​

Alert Rules ​

Future: Auto-Resize Pipeline ​

TLS Certificate Monitoring ​

Future: External Dependency Health Probes ​

Future: Synthetic Uptime Monitoring ​

Developer Instrumentation Guide ​

Backend (NestJS) ​

Frontend (Grafana Faro) ​

Local Development ​

Preview Environments ​

Runbooks & Dashboards ​

Core Dashboards ​

Alerting Runbooks ​

SLI / SLO Definitions ​

Context & Architectural Decision

Telemetry & Ingestion Architecture

Service Instrumentation & Correlation

Metrics & High-Cardinality Management

PII Redaction & Data Privacy

Log Volume & Retention

Alerting & Configuration Management

Future: Multi-Tenant Embedded Dashboards

Future: Faro SDK Load Policy (B2C vs. Workspace)

Architectural Decisions (Resolved Questions)

Volume Monitoring & Auto-Scaling

Metrics

Alert Rules

Future: Auto-Resize Pipeline

TLS Certificate Monitoring

Future: External Dependency Health Probes

Future: Synthetic Uptime Monitoring

Developer Instrumentation Guide

Backend (NestJS)

Frontend (Grafana Faro)

Local Development

Preview Environments

Runbooks & Dashboards

Core Dashboards

Alerting Runbooks

SLI / SLO Definitions