Busflow Docs

Internal documentation portal

Skip to content

Context & Architectural Decision โ€‹

  • System Profile: Complex B2B SaaS (ERP, CRM, PIM, eCommerce) with a strict market size cap of 4,000 tenants (operating baseline ~300).
  • Stack Selection: Grafana LGTM Stack (Loki, Grafana, Tempo, Mimir) utilizing OpenTelemetry (OTel).
  • Rationale: Provides a "single pane of glass" for application telemetry and external business metrics (Lago, Mollie) while maintaining cost control and avoiding the unpredictable billing spikes associated with managed SaaS observability platforms.

Telemetry & Ingestion Architecture โ€‹

  • Standard: Strict adherence to OpenTelemetry (OTel) and W3C Trace Context to decouple applications from the storage backend.
  • High Availability (HA) Ingestion: OpenTelemetry Collector operates as the central ingestion gateway, deployed via Docker Swarm.
    • Intent: Prevent a single point of failure (SPOF) and avoid tightly coupled applications.
    • Implementation: Deployed using Swarm mode: replicated (2 replicas) behind the Traefik load balancer. Swarm's routing mesh handles load distribution across stateless OTel replicas.
  • Placement: all LGTM services (OTel Collector, Loki, Tempo, Mimir, Grafana, cAdvisor, prometheus-postgres-exporter) are placement-constrained to node.labels.tier == observability on the dedicated hcloud_server.observability_worker node (cpx21+). The 2-node Swarm cannot absorb the LGTM footprint alongside application services โ€” this node parallels the tier=stateful pattern from infrastructure.md ยง1. Sequencing: this requires the quorum + node-labelling infrastructure from the Swarm quorum expansion (see adr-0xx-swarm-quorum-topology.md).
  • HA on Swarm without Consul: Loki / Tempo / Mimir ingesters require a single-writer ring; plain Swarm has no Consul or KV primitive for ring election. We run each ingester as replicas: 1, pinned to the observability-labelled node, and rely on Swarm self-heal for failover. The S3 chunk store is MinIO (already in-cluster). Trade-off: a single-instance ingester briefly drops writes during task reschedule (seconds). The alternative โ€” a Consul-backed ring โ€” adds a stateful Raft cluster and is deferred.

Service Instrumentation & Correlation โ€‹

  • Frontend (Grafana Faro Web SDK): Captures Core Web Vitals, performance, and exceptions. Automatically injects W3C traceparent headers into fetch/XHR requests.
  • CORS: Traefik and NestJS are explicitly configured to expose and accept traceparent and tracestate headers.
  • Backend Distributed Tracing:
    • Traefik: Extracts incoming traceparent at the edge and forwards it downstream.
    • NestJS: Utilizes @opentelemetry/auto-instrumentations-node for automatic span generation.
    • Hasura (Community Edition): Lacks native OTel. We maintain tracing via a composite strategy:
      • Edge Tracing: Traefik captures overarching request duration and status.
      • Context Propagation: We configure Hasura to forward the traceparent header to NestJS for Actions/Event Triggers.
      • Log Analysis: The system ships Hasura structured JSON logs (startup, http-log, webhook-log, query-log) to Loki for query latency analysis.
      • DB Metrics: prometheus-postgres-exporter connects directly to Postgres to push database metrics to Mimir.
  • Log/Trace Correlation: Backend loggers (Pino, Winston) inject the active trace_id into JSON log payloads. Loki utilizes derived fields to map the regex-extracted trace_id directly to Tempo for UI pivoting.

Metrics & High-Cardinality Management โ€‹

  • Business & 3rd-Party Metrics: * The system fetches real-time API state (e.g., Mollie balances) at dashboard load via Grafana plugins (Infinity/JSON).
    • Prometheus Exporters poll APIs to push historical time-series data to Mimir.
    • The system logs asynchronous webhook events to Loki; LogQL generates dynamic metrics.
  • LLM Usage Tracking:
    • Intent: Track costs without causing Mimir cardinality explosions.
    • Implementation: Utilize standard counter metrics (e.g., llm_tokens_total) where the token amount is the value. The tenant_id and model_version are safely retained as labels, as the hard cap of 4,000 tenants ensures cardinality remains well within Mimir's operational thresholds.
  • Global cardinality budget: โ‰ค80 000 active series per metric (4 000 tenants ร— 20 series). Mimir enforces this at the ingester level; exceeding it causes ingestion rejection.
  • Per-metric budget contract (see ADR-027): every custom metric declared in code MUST carry a cardinality_budget: "<n>" annotation on its associated Prometheus alert rule. A CI linter step (promtool check rules + check_metric_cardinality.py) verifies (a) the annotation exists and parses as an integer, (b) the declared budget โ‰ค 80 000, (c) the actual cardinality at the time of the probe is โ‰ค the declared budget. Rogue metrics with unbounded labels are rejected at PR time, not at production ingestion.

PII Redaction & Data Privacy โ€‹

  • Production Catch-All: Promtail regex rules redact PII patterns (credit cards, emails) from logs before ingestion into Loki. See promtail-config.yaml.
  • Client-Side Filtering: The Faro SDK sanitizes URLs and excludes sensitive DOM elements before data transmission.
  • False-Positive budget: โ‰ค5 FP/day per PII detection rule. Each LogQL rule carries an exclusion list maintained in pii-exclusions.yaml; new exclusions require PR approval. See observability-alerts.md.

NOTE

Future: CI/CD static analysis enforcement (Semgrep/SonarQube) to block deployments containing hardcoded PII logging variables. LogQL alerting rules for runtime PII scanning.

Log Volume & Retention โ€‹

  • TTL: 14 days (matches the GDPR ยง4 immutability window).
  • Retention mechanism: Loki compactor only โ€” do not rely on MinIO bucket lifecycle rules to delete chunks. MinIO-lifecycle deletion produces "chunk not found" errors at query time because Loki's index still references them until the compactor sweeps. Required loki.yaml:
    yaml
    compactor:
      working_directory: /data/loki/compactor
      shared_store: s3
      retention_enabled: true
      delete_request_store: s3
    limits_config:
      retention_period: 336h  # 14d
  • Hard safety cap: a 500 GB MinIO bucket quota acts as an emergency backstop (NOT the primary retention mechanism). A Grafana alert fires at >80 % bucket utilisation so operators can investigate volume spikes before the quota is hit.
  • Per-service log volume: every application service in docker-compose.production.yml MUST carry a logging block:
    yaml
    logging:
      driver: loki
      options:
        loki-url: "http://otel-collector.observability:4318/loki/api/v1/push"
        loki-batch-size: "400"
        loki-external-labels: "service=\{\{.Name\}\},environment=production"
  • Exporter DSN gate: prometheus-postgres-exporter is declared in docker-compose.observability.yml with deploy: { replicas: 0 } until .github/workflows/secrets-sync.yml lands the postgres_exporter_dsn Swarm Secret; the replica count is flipped to 1 in a follow-up commit. This keeps L3-5.1.3 mergeable standalone.

Alerting & Configuration Management โ€‹

  • Alerting Strategy: We deploy Grafana Alertmanager alongside the LGTM stack for deduplication, grouping, and silencing. Alert trees route infrastructure severity to PagerDuty/Slack, and business logic alerts to operational teams. We manage alerts as code.
  • Docker Configs: We decouple all configurations (grafana.ini, OTel YAML, Prometheus rules) from container images and manage them via immutable Docker Configs, leveraging Swarm's Raft logs for secure distribution. Updates utilize config rotation for zero-downtime deployments.
  • Secrets: Infrastructure API keys and DB credentials rely on encrypted Docker Secrets (/run/secrets/), while sensitive tenant credentials are encrypted at the application boundary via pgsodium.

NOTE

Future: Multi-Tenant Embedded Dashboards โ€‹

When tenants need self-service analytics, Grafana panels will be embedded in the workspace app via iframes. This requires:

  • An Auth Proxy (Nginx/Envoy) that validates the SaaS session and injects X-WEBAUTH-USER + X-Scope-OrgID (tenant_id) headers
  • Grafana cookie/SSO hardening (allow_embedding = true, COOKIE_DOMAIN=.busflow.app, COOKIE_SAMESITE=none, COOKIE_SECURE=true)
  • Row-level security via the X-Scope-OrgID header (native Loki/Mimir support)
  • See observability-alerts.md ยงA7 (SSO fallthrough) for triage

NOTE

Future: Faro SDK Load Policy (B2C vs. Workspace) โ€‹

When the B2C passenger app launches, evaluate Faro's ~50 KB impact on checkout LCP.

  • Workspace app (B2B): load Faro synchronously
  • Passenger app (B2C checkout): lazy-load after window.load behind a feature flag
  • Run an A/B test for โ‰ฅ1 week on real 3G/4G sessions before deciding the global default

Architectural Decisions (Resolved Questions) โ€‹

  • Asynchronous Trace Context: When calling third-party APIs (Mollie, WhatsApp), the system injects the active traceparent as a custom metadata field in the webhook request payload. When the webhook fires back, Nest.js unmarshals the metadata and links the span.
  • Stateful Storage High Availability: Loki, Tempo, and Mimir use MinIO (S3-compatible) as their primary chunk store. The services themselves only hold local volumes for ephemeral caching and Write-Ahead Logs (WAL), making them easily replaceable Swarm tasks.
  • Untrusted Client Trace Context: The edge Traefik proxy enforces aggressive rate-limiting on the telemetry ingestion port and drops arbitrarily large tracestate headers to prevent cardinality explosion attacks from the client side.
  • Immutability vs. GDPR Data Deletion: Raw operational telemetry (Loki/Tempo) enforces a strict 14-day retention TTL. Instead of selectively scrubbing immutable chunks, the system guarantees automatic total wipeout of potentially leaked PII within the globally compliant 30-day deletion timeframe by simply aging it out.

Volume Monitoring & Auto-Scaling โ€‹

Postgres and MinIO data live on a Hetzner Cloud Volume (hcloud_volume.data, default 20 GB) bind-mounted via /mnt/data/. Without monitoring, disk exhaustion crashes both services silently.

Metrics โ€‹

cAdvisor (already in the LGTM stack) and node_exporter expose filesystem metrics for the volume mount:

MetricSourceMeaning
node_filesystem_avail_bytes\{mountpoint="/mnt/HC_Volume_*"\}node_exporterFree bytes on the Hetzner Volume
node_filesystem_size_bytes\{mountpoint="/mnt/HC_Volume_*"\}node_exporterTotal volume size
(1 - avail/size) * 100Grafana recording ruleUsage percentage

Alert Rules โ€‹

Defined as code in docker/observability/prometheus-rules/volume-alerts.yml:

yaml
groups:
  - name: volume
    rules:
      - alert: VolumeUsageWarning
        expr: (1 - node_filesystem_avail_bytes\{mountpoint=~"/mnt/HC_Volume.*"\} / node_filesystem_size_bytes\{mountpoint=~"/mnt/HC_Volume.*"\}) * 100 > 80
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Hetzner Volume usage > 80%"
          runbook_url: "https://docs.busflow.app/protocols/volume-resize"
      - alert: VolumeUsageCritical
        expr: (1 - node_filesystem_avail_bytes\{mountpoint=~"/mnt/HC_Volume.*"\} / node_filesystem_size_bytes\{mountpoint=~"/mnt/HC_Volume.*"\}) * 100 > 90
        for: 2m
        labels: { severity: critical }
        annotations:
          summary: "Hetzner Volume usage > 90% โ€” auto-resize triggered"

NOTE

Future: Auto-Resize Pipeline โ€‹

Hetzner does not offer native volume auto-scaling. The planned two-stage resize:

  1. Alertmanager webhook fires VolumeUsageCritical to a GitHub Actions workflow_dispatch endpoint
  2. .github/workflows/volume-resize.yml calls hcloud volume resize + resize2fs (online, no downtime)
  3. Safety: max 100 GB cap, concurrency group cooldown, Terraform drift handling

TLS Certificate Monitoring โ€‹

Traefik manages ACME certificates (Let's Encrypt) for 6 subdomains: api.busflow.app, hasura.busflow.app, auth.busflow.app, storage.busflow.app, n8n.busflow.app, and busflow.app. A silent ACME renewal failure causes hard downtime after the 90-day cert lifetime expires.

  • Metric: Traefik v3 exposes traefik_tls_certs_not_after as a Prometheus metric (epoch timestamp per certificate). Mimir scrapes this via the OTel Collector.
  • Alert: Grafana fires when any certificate has โ‰ค14 days remaining. Defined in docker/observability/prometheus-rules/tls-alerts.yml:
    yaml
    groups:
      - name: tls
        rules:
          - alert: TlsCertExpiringSoon
            expr: (traefik_tls_certs_not_after - time()) / 86400 < 14
            for: 1h
            labels: { severity: warning }
            annotations:
              summary: "TLS cert for {{ $labels.cn }} expires in {{ $value | humanizeDuration }}"
              runbook_url: "https://docs.busflow.app/protocols/observability-alerts#a9"
  • Triage: See observability-alerts.md ยงA9.
  • Accepted risk: A Cloudflare DNS API outage does not break existing traffic but silently prevents ACME DNS-challenge renewals. The 14-day alert window provides sufficient buffer to detect and resolve this.

NOTE

Future: External Dependency Health Probes โ€‹

Scheduled probes (NestJS cron or standalone) checking external service health every 5 min:

  • AWS SES, Hetzner Object Storage, LLM Providers (OpenAI/Anthropic)
  • Results surface as OTel gauges; Grafana alerts on โ‰ฅ3 consecutive failures โ†’ Slack #ops
  • See observability-alerts.md ยงA10

NOTE

Future: Synthetic Uptime Monitoring โ€‹

External probes (outside the Swarm) hitting public URLs through the full CDN โ†’ Traefik โ†’ service chain:

  • api.busflow.app/api/health, busflow.app, hasura.busflow.app/healthz, auth.busflow.app/healthz
  • Implementation: Grafana Cloud Synthetic Monitoring (free tier), BetterStack, or GitHub Actions cron
  • Alert: โ‰ฅ2 consecutive failures โ†’ Slack #ops

Developer Instrumentation Guide โ€‹

Backend (NestJS) โ€‹

  • Auto-instrumentation: @opentelemetry/auto-instrumentations-node automatically generates spans for HTTP, database, and Redis operations. No manual setup required for standard CRUD paths.
  • Custom spans: For business-critical operations (e.g., PDF parsing, pricing calculation), create explicit spans:
typescript
import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('busflow-api');

async function parsePdf(file: Buffer) {
  return tracer.startActiveSpan('pdf-parser.extract', async (span) => {
    span.setAttribute('pdf.size_bytes', file.length);
    try {
      const result = await extractStructuredData(file);
      span.setAttribute('pdf.confidence', result.confidence);
      return result;
    } catch (err) {
      span.recordException(err);
      throw err;
    } finally {
      span.end();
    }
  });
}
  • Naming conventions: <service>.<operation> (e.g., pricing.calculate, wallet.generate-pass, compliance.check-rest-time).
  • Attributes: Use semantic conventions (http.method, db.statement) where applicable. Custom attributes use the busflow. prefix (e.g., busflow.tenant_id, busflow.tour_id).

Frontend (Grafana Faro) โ€‹

  • Automatic: The Faro SDK captures Core Web Vitals, fetch/XHR durations, and unhandled exceptions out of the box.
  • User Paths (Page Views): We hook into the Vue Router afterEach guard in the app boot file (e.g., src/boot/faro.ts) to track page_view events. This allows building funnel charts and tracking user journeys.
  • User Interactions (Button Clicks): We track critical business UI actions (e.g., clicking "Save Vehicle") using the custom v-track directive (e.g. v-track="'save_vehicle_btn'") or manual faro.api.pushEvent('button_click', { action: '...' }).
  • Custom events: Use faro.api.pushEvent() for complex business-critical flows (e.g., booking.completed, magic-upload.started).
  • Error boundaries: Vue error boundaries should call faro.api.pushError() with structured context (component name, route, user role).

Local Development โ€‹

Docker Compose provides a minimal observability stack for local debugging:

bash
# Start local Grafana + OTel Collector + Loki + Tempo
docker compose -f docker-compose.observability.yml up -d
ServiceLocal PortPurpose
Grafanalocalhost:3000Dashboards, trace viewer, log explorer
OTel Collectorlocalhost:4317 (gRPC), localhost:4318 (HTTP)Telemetry ingestion
Lokilocalhost:3100Log storage
Tempolocalhost:3200Trace storage

NestJS dev servers emit traces to localhost:4317 by default when OTEL_EXPORTER_OTLP_ENDPOINT is set. Faro SDK points to localhost:4318 in development mode.


Preview Environments โ€‹

Busflow does not spin up dedicated LGTM stacks per preview environment (e.g., per pull request). This would be resource-prohibitive and difficult to manage. Instead, preview environments leverage the shared central observability stack on the Docker Swarm via environment tagging:

  1. Global Log Collection (Zero-Config): The promtail service in the observability stack is deployed in mode: global across all Swarm nodes. It automatically reads the Docker socket and ships logs for all ephemeral preview containers (busflow-pr-<ID>) to the central Loki instance.
  2. Central Telemetry Routing: Preview backend services (like the NestJS API) push their OTel traces and metrics to the shared OTel Collector on the Swarm's internal network (http://obs_otel_collector:4318).
  3. Environment Tagging: Every trace and metric emitted by a preview stack is tagged with the specific preview environment (e.g., deployment.environment="preview", service.namespace="busflow-pr-123").
  4. Grafana Filtering: Operators use a single set of Grafana dashboards. By selecting a specific PR namespace from the $environment dashboard variable, all Loki logs, Mimir metrics, and Tempo traces are filtered to show only data for that ephemeral stack.

Runbooks & Dashboards โ€‹

Core Dashboards โ€‹

DashboardScopeKey Panels
Service OverviewAll servicesRequest rate, error rate, p50/p95/p99 latency per service
Per-Tenant HealthIndividual tenantRequest volume, error breakdown, active sessions
LLM Cost TrackerAI servicesllm_tokens_total by model and tenant, daily/weekly cost projections
InfrastructureDocker SwarmNode CPU/memory, container restarts, Swarm service health
Booking FunnelB2C commerceConversion steps, drop-off points, payment success rate
Deploy TimelineCI/CDDeploy markers overlaid on error rate + latency panels

Alerting Runbooks โ€‹

All alerts are defined as code in the Grafana provisioning directory. Each alert has a runbook link in its annotation:

  • High Error Rate (>1% 5xx for 5 min): Check Tempo for failing traces โ†’ identify root service โ†’ check recent deploys.
  • Latency SLO Breach (p99 > 500ms for 10 min): Check Mimir for resource saturation โ†’ scale Swarm replicas or investigate slow queries in Loki.
  • PII Detection Alert: Loki uses a LogQL pattern match โ†’ identify offending service โ†’ hotfix PII leak โ†’ verify with Semgrep.

SLI / SLO Definitions โ€‹

Service Level Indicators (SLIs) and Objectives (SLOs) for core services. Monitored via Mimir metrics and Grafana alerting.

ServiceSLISLO TargetMeasurement
API (NestJS)Request latency (p99)< 300msOTel histogram http.server.duration
API (NestJS)Error rate (5xx)< 0.5%OTel counter http.server.response.status_code
Booking FlowEnd-to-end success rate> 99.5%Custom counter busflow.booking.completed / busflow.booking.started
Hasura (GraphQL)Query latency (p95)< 200msHasura http-log in Loki
PDF ParsingPDF parse success rate> 95%Custom counter busflow.pdf_parser.success / busflow.pdf_parser.total
Frontend (Faro)LCP (Largest Contentful Paint)< 2.5sFaro Core Web Vitals
Frontend (Faro)Unhandled JS exceptions< 0.1% of sessionsFaro error events / session count

Error Budget Policy: When a service exhausts its monthly error budget (calculated from SLO), feature development pauses and the team prioritizes reliability work until the budget recovers.

Internal documentation โ€” Busflow