Infrastructure & Architecture

Deployment Architecture

1. Infrastructure Layer (Terraform)

Compute (MVP): 2× Hetzner Cloud VPS — 1 Swarm Manager + 1 Worker — for HA and self-healing. Accepted MVP Risk: A single manager node is a control-plane SPOF. Swarm workloads self-heal on the worker, but cluster management (deploy, scale, inspect) is unavailable if the manager goes down.
Compute (Quorum Expansion — tenant count > 50 OR SLA > 99.5 %): grow the swarm Terraform module to a 3-node Raft quorum + dedicated stateful_worker + dedicated observability_worker. See ADR-023. Key contract points:
- var.manager_count validation: odd (manager_count % 2 == 1) and pinned to fsn1 (contains(["fsn1"], var.hetzner_location)) to preserve Raft consensus latency.
- hcloud_server.manager[0] initialises the swarm; null_resource.manager_join_token retrieves the manager join token over SSH; null_resource.swarm_join_manager iterates manager_count - 1 joining managers 2 and 3 using the manager token + manager[0]'s private IP (not public) on port 2377.
- user_data is split: a shared local.docker_install_script local is used by every node; only manager[0] appends docker swarm init. Workers and additional managers use the install script alone. The null_resource join provisioners remain the exclusive mechanism for Swarm membership.
- SSH fingerprint policy: host keys are unique-per-manager (tls_private_key.host_key_manager[count.index]); all fingerprints are published as a Terraform output manager_host_keys = { manager-1 = "…", … }. CI accepts any published fingerprint via a TOFU window. StrictHostKeyChecking=no is removed from every provisioner once fingerprint management is in place — see ADR-025.
- Stateful worker: a separate hcloud_server.stateful_worker (count = 1, label tier=stateful) holds the persistent Hetzner Cloud Volume for containerized Postgres / Redis. Managers never carry stateful services. Pre-launch requirement: deploy a Docker volume plugin (e.g. costela/docker-volume-hetzner) so Swarm can automate volume detach/re-attach on node failure — Swarm does not use Kubernetes CSI. Without the plugin, volume recovery requires manual Hetzner API or Terraform intervention.
- Observability worker: a separate hcloud_server.observability_worker (count = 1, label tier=observability, cpx21+) holds the LGTM stack. The 2-node Swarm cannot absorb 7+ observability containers alongside the application tier; this node parallels the tier=stateful pattern.
- Split-brain & runbooks: manager join failures run on_failure = fail plus a journalctl -u docker tail for triage; shrink requires manual docker node demote + docker swarm leave --force before re-applying Terraform with a smaller manager_count.
Edge Routing: CDN (Cloudflare) → Hetzner Managed Load Balancer → Swarm nodes. At quorum expansion time the Managed LB replaces any DNS round-robin: hcloud_load_balancer.ingress with services 80/443, health-check HTTP /healthz (15 s interval, 10 s timeout, 3 retries), and label_selector = "busflow-swarm=true" so new managers auto-attach. See ADR-024.
Edge Routing: CDN (Cloudflare) → Hetzner Managed Load Balancer → Swarm nodes.
Database (MVP): Containerized PostgreSQL (with pgvector) inside the Docker Swarm, backed by a Hetzner Cloud Volume for persistence. Nightly pg_dump backups to the in-Swarm Minio instance. See §6 Backup & Disaster Recovery.
Database (Production): Ubicloud Managed PostgreSQL (standard-2, HA async) on Hetzner bare metal in Falkenstein (eu-central-h1). Managed via the ubicloud/ubicloud Terraform provider. Automated PITR, 7-day backup retention.
Storage (MVP): Minio (S3-compatible) inside the Docker Swarm for file uploads, images, and PDFs.
Storage (Production): Hetzner Object Storage (S3-compatible, €6.49/mo for 1TB storage + 1TB egress). Nhost Storage connects directly via S3 API. Managed via the aminueza/minio Terraform provider.
Storage (Local Dev): Minio container in docker-compose.local.yml.
Terraform State Management (MVP): Local backend, state file committed to the repository. Phase 2: Migrate to Hetzner Object Storage (S3-compatible backend) for remote state with CI/CD support.

2. Application Layer (Docker Swarm)

Traefik: Ingress proxy, automated SSL, long-read timeouts for AI streaming, request routing.
Frontend (Vue.js): Static files served by replicated Nginx containers.
Hasura: GraphQL engine for CRUD, RBAC, and real-time subscriptions (WebSockets).
Nest.js API: Business logic, custom mutations, external integrations, SSE for AI chat streams.
n8n: Visual automation engine for external API integrations and agent pipelines. See workflow-orchestration.md.
Nhost Auth/Storage: JWT generation, OAuth, file uploads (S3 bridge).

3. AI & Background Processing

Redis: In-Swarm message broker for BullMQ job queue.
Nest.js Workers: Same codebase, HTTP off. Listens to Redis for heavy AI tasks, PDF generation, event triggers.

4. External Integrations

LLM Providers: OpenAI, Anthropic (external APIs).
Messaging: AWS SES (email), AWS SNS (SMS), Meta Cloud API (WhatsApp Business). See communications.md §CPaaS for the full provider strategy.

5. Observability & CI/CD

Monitoring: Grafana LGTM stack (Loki, Grafana, Tempo, Mimir) + OpenTelemetry Collector + Faro Web SDK + prometheus-postgres-exporter. See observability.md — it is the authoritative source of truth for the stack, cardinality policy, retention, cookie/SSO settings, and alerting runbooks.
Deployment: GitHub Actions → isolated PR preview deployments → rolling docker stack deploy on merge.
Transitional: Frontend deployments use Vercel during early development. We will retire it once the Swarm-based preview deployment pipeline (Traefik + docker stack deploy) becomes fully operational. We plan no deeper Vercel integration.
Deployment DX (Future): Consider Portainer for a Swarm dashboard (visibility without replacing orchestration), GH Actions auto-rollback on failed health checks, and canary deploys via Traefik weighted routing. These additions improve operator confidence without adding a PaaS layer (e.g., Coolify) that duplicates existing Terraform + Swarm + Traefik responsibilities.
Container Resource Governance: every service in docker-compose.production.yml MUST declare deploy.resources.limits.memory and deploy.resources.reservations.cpus. Without limits, a runaway container triggers the kernel OOM-killer, which selects the largest memory consumer — potentially Postgres. Actual values require load profiling before setting; do not guess.
Docker System Maintenance: a weekly scheduled task prunes unused images, containers, and build cache on each Swarm node (docker system prune --all --force --filter "until=168h"). Mechanism (Swarm global service with entrypoint cron or systemd timer via Terraform user_data) is deferred to implementation time. Grafana alerts on root disk usage (node_filesystem_avail_bytes{mountpoint="/"} < 20%) — distinct from the Hetzner Volume alerts in observability.md §Volume Monitoring.
Terraform State Drift Detection: a weekly GitHub Actions cron runs terraform plan -detailed-exitcode against production. Exit code 2 (drift detected) opens a GitHub Issue tagged ops, drift and posts to Slack. The workflow does not auto-apply; an operator reviews the diff. The current terraform.yml only plans on PRs touching terraform/** — manual SSH changes to servers drift silently without this scheduled check.

6. Backup & Disaster Recovery

Database Backups (MVP: Containerized Postgres)

Strategy: A cron job on the Swarm manager node runs pg_dump --format=custom nightly and uploads the dump to a dedicated backup bucket in the in-Swarm Minio instance via rclone or aws-cli.
Retention: 14 daily snapshots + 4 weekly snapshots. Older dumps are pruned automatically.
Restore: pg_restore --dbname=<target> from any snapshot into a fresh container.

Database Backups (Production: Ubicloud Managed)

Ubicloud provides automated Point-in-Time Recovery (PITR) with 7-day retention.
Supplement with a weekly pg_dump to Hetzner Object Storage for off-provider safety.

Automated Backup Verification

See also: backup-verify-runbook.md for operator triage steps.

Cadence: A daily GitHub Actions cron job (03:00 UTC) pulls the latest pg_dump and runs a full verify. Weekly is too coarse once Ubicloud PITR exists — the off-site dump is a disaster-recovery safety net and must be proven healthy against every produced artefact.
Source parameterisation: An environment: production workflow input resolves to vars.BACKUP_SOURCE — either minio-inswarm (MVP path) or hetzner-s3 (post-cutover). Both rclone remote blocks live in the workflow; only the source switches.
Naming convention: dumps must be emitted as pg_dump_YYYYMMDD_HHMM.dump. The producer (current easypi/postgres-backup-s3 or the post-cutover Ubicloud-triggered job) enforces this via docker/backup/producer.env. Files that do not match this pattern are rejected by the verify workflow rather than consumed.
Size envelope: ubuntu-latest has ~14 GB free disk and a 6 h timeout. A size guard aborts above a configurable ceiling (default 10 GB) and signals to promote to a self-hosted runner lane. In the common path the verify job streams the dump directly into pg_restore via rclone cat … | pg_restore to skip a full-disk staging copy.
Format drift tolerance (MVP → production): the magic-number check accepts either PGDMP (custom format) or a plain-text header (-- PostgreSQL database dump). Plain-text dumps produce a warning — this nudges the MVP producer (easypi) toward --format=custom without failing the verify.
Restore pipeline (split-section): the ephemeral postgres:16-alpine target runs pg_restore --section=pre-data → manual CREATE EXTENSION IF NOT EXISTS pg_cron, pgsodium, pgvector inside the verify DB → --section=data → --section=post-data. This avoids false-reds on extension-privilege errors unrelated to data integrity.
Integrity probes:
- pg_restore exits with code 0.
- All expected schemas exist (backoffice, commerce, operations, communications).
- information_schema.tables returns a non-zero row count per schema.
- No orphaned foreign-key violations against the allow-list at config/backup-verify/soft-fk-allowlist.yaml. The Modulith deliberately uses soft FKs across bounded contexts; the probe reads the allow-list before raising alerts.
TLS: insecure_skip_verify = false is explicit in the generated rclone.conf; http:// endpoints are rejected.
Timeouts: rclone --timeout 15m --low-level-retries 3; job-level timeout-minutes: 60; fall back to in-Swarm MinIO if the offsite bucket is unreachable for >10 min.
Failure routing: Slack webhook errors are classified — 4xx (e.g., rotated webhook) opens a deduped GitHub issue and sets the workflow status to warning; 5xx/network errors retry 3× with backoff before falling back to the GH-issue path. Issues dedup by (run-pattern, week) key via gh issue list lookup so the orphan-FK probe cannot issue-spam.
Producer liveness: the verify pipeline implicitly covers producer liveness — a missing or zero-byte dump causes the verify job itself to fail. The ≤1 hour gap between production (02:00 UTC) and verification (03:00 UTC) is an accepted risk.

Object Storage

Hetzner Object Storage replicates data across drives (RAID). No cross-region replication at MVP.

Disaster Recovery Drill

Backup verification (see §above) tests restorability of individual dumps. A DR drill validates the full rebuild path that backup verification alone cannot test: environment variables, Docker registry auth, DNS propagation, and Terraform state completeness.

Procedure: provision a parallel Hetzner environment via Terraform, restore the latest backup, deploy the full stack, and run smoke tests against all service health endpoints.
Cadence (tied to SLA tier):
- Pre-SLA (< 50 tenants, no contractual uptime): annual drill.
- Post-SLA (≥ 50 tenants or contractual ≥ 99.5% uptime): quarterly drill.
- After any major infrastructure change (Ubicloud cutover, quorum expansion): ad-hoc drill within 2 weeks.
Results: logged in docs/protocols/dr-drill-results/YYYY-MM.md with: time-to-recovery, issues found, and remediation actions.
First drill: scheduled after the Ubicloud cutover completes.

6.2 Ubicloud PostgreSQL Cutover (Production DB)

Purpose: migrate from the containerized pgvector Postgres on the Swarm manager to managed Ubicloud Postgres on Hetzner bare metal once tenant count > 50 or a 99.5 % SLA becomes contractual.

Terraform module: terraform/modules/postgres-ubicloud

Required variables (variables.tf):

Variable	Type	Notes
`environment`	`string`	required
`ubicloud_project_id`	`string`	from Ubicloud console
`hetzner_location`	`string`	validation `contains(["fsn1"], value)` — must match Swarm location for latency + GDPR residency
`postgres_tier`	`string`	default `"standard-2"`
`postgres_version`	`number`	default `16`
`ha_mode`	`string`	`"async" \| "sync"`, default `"async"`
`pitr_retention_days`	`number`	default `7`
`allowed_cidrs`	`list(string)`	fed from `module.network.swarm_cidrs` (see Finding 6 of the architect loop)

Required resources inside the module:

ubicloud_postgres.primary — the HA instance.
ubicloud_postgres_firewall_rule.swarm — whitelisting each var.allowed_cidrs entry on port 5432.
ubicloud_postgres_metric_destination.grafana — ships metrics to the Mimir endpoint defined in observability.md.

Required outputs (fed to Docker Swarm Secrets at deploy time via secrets-sync.yml):

connection_uri_writer (sensitive)
connection_uri_reader (sensitive)
ca_certificate (sensitive) — mounted as a Docker Config, not a Secret (non-sensitive; lets Mimir scrape over TLS without an extra rotation surface).
instance_fqdn

Provider contract — in terraform/environments/production/providers.tf:

hcl

terraform {
  required_providers {
    ubicloud = { source = "ubicloud/ubicloud", version = "~> 0.3" }
  }
}
provider "ubicloud" {
  project_id = var.ubicloud_project_id
}

Cutover runbook: docs/protocols/postgres-cutover.md.

Out of scope for cutover: Redis (BullMQ broker), MinIO/Object Storage, and the Hasura JWT secret are NOT rotated. Only busflow_db_writer and busflow_db_reader flip.

Edge states that MUST be covered in the cutover runbook:

Atomic reader/writer flip: both secrets update in a single docker service update --secret-rm … --secret-add … call so readers never briefly point at the old writer and return phantom rows.
Hydration failure: if replication lag does not converge within 60 min, abort, retain Swarm DB as authoritative, and re-run from the base-backup step.
PITR boundary: Ubicloud PITR only covers time after cutover. Retain the final pre-cutover dump in Hetzner Object Storage for 30 days (not the usual 14) and document the recovery decision-tree in the runbook.
Connection retries with jitter: Hasura HASURA_GRAPHQL_READ_REPLICA_URLS and NestJS pg pools set connectionTimeoutMillis=5000, max=20, and wrap every transaction in retry.operation({retries: 3, factor: 2, minTimeout: 200, randomize: true, jitter: 0.1}) to avoid a thundering herd on the first post-cutover second.
TLS required: sslmode=verify-full using the ca_certificate output.
Terraform state-lock guard: production applies run via workflow_dispatch with concurrency: { group: terraform-prod, cancel-in-progress: false } to prevent concurrent CI pipelines from corrupting the remote state.
Rollback criterion: if p99 API latency exceeds 2× baseline for > 10 min within the 72 h window, trigger the rollback runbook that flips the secrets back to the containerized Postgres endpoint.
Cost envelope: the 72 h parallel-run is a Finance line item; the ADR carries a monthly Ubicloud tier price × 3/30 days figure and a Finance sign-off checkbox.
GDPR residency: Ubicloud EU-residency SLA clause is cited in gdpr-strategy.md §1 and in ADR-022.

The Modular Monolith

The backend follows a Modular Monolith (Modulith) pattern: isolated modules compiled and deployed as a single application. This combines the clean architecture of microservices with the operational simplicity of a monolith.

How It Works

/apps/api — Entry point. Imports microbackends, exposes them via HTTP/GraphQL. Minimal business logic.
/packages/<domain> — Each package is a bounded context (billing, auth, etc.) with its own logic, schemas, and services.
Single Docker image built from /apps/api + all /packages/*. Deployed as one scaled Swarm service.
Evolutionary: If a module needs independent scaling, extract it to a new app (e.g., /apps/pdf-worker).

Key Benefits

AI-Optimized: Small, isolated packages = focused context windows for LLMs.
Operationally Simple: No distributed monolith anti-patterns, no inter-service networking.
Flexible: Leaves the door open for microservices without upfront DevOps tax.

Risks to Manage

Boundary Enforcement: Use ESLint rules to forbid cross-package coupling.
Database Isolation: Each microbackend manages its own tables. Use Hasura or PostgreSQL for cross-domain joins.
Build Optimization: Docker builds use Turborepo prune to isolate each app's dependency tree into a minimal workspace, enabling aggressive Docker layer caching. See monorepo.md for the full multi-stage Dockerfile strategy and CI caching approach.

Busflow Docs

Infrastructure & Architecture ​

Deployment Architecture ​

1. Infrastructure Layer (Terraform) ​

2. Application Layer (Docker Swarm) ​

3. AI & Background Processing ​

4. External Integrations ​

5. Observability & CI/CD ​

6. Backup & Disaster Recovery ​

Database Backups (MVP: Containerized Postgres) ​

Database Backups (Production: Ubicloud Managed) ​

Automated Backup Verification ​

Object Storage ​

Disaster Recovery Drill ​

6.2 Ubicloud PostgreSQL Cutover (Production DB) ​

The Modular Monolith ​

How It Works ​

Key Benefits ​

Risks to Manage ​