Busflow Docs

Internal documentation portal

Skip to content

Infrastructure & Architecture ​

Deployment Architecture ​

1. Infrastructure Layer (Terraform) ​

  • Compute (MVP): 2Γ— Hetzner Cloud VPS β€” 1 Swarm Manager + 1 Worker β€” for HA and self-healing. Accepted MVP Risk: A single manager node is a control-plane SPOF. Swarm workloads self-heal on the worker, but cluster management (deploy, scale, inspect) is unavailable if the manager goes down.
  • Compute (Quorum Expansion β€” tenant count > 50 OR SLA > 99.5 %): grow the swarm Terraform module to a 3-node Raft quorum + dedicated stateful_worker + dedicated observability_worker. See ADR-023. Key contract points:
    • var.manager_count validation: odd (manager_count % 2 == 1) and pinned to fsn1 (contains(["fsn1"], var.hetzner_location)) to preserve Raft consensus latency.
    • hcloud_server.manager[0] initialises the swarm; null_resource.manager_join_token retrieves the manager join token over SSH; null_resource.swarm_join_manager iterates manager_count - 1 joining managers 2 and 3 using the manager token + manager[0]'s private IP (not public) on port 2377.
    • user_data is split: a shared local.docker_install_script local is used by every node; only manager[0] appends docker swarm init. Workers and additional managers use the install script alone. The null_resource join provisioners remain the exclusive mechanism for Swarm membership.
    • SSH fingerprint policy: host keys are unique-per-manager (tls_private_key.host_key_manager[count.index]); all fingerprints are published as a Terraform output manager_host_keys = { manager-1 = "…", … }. CI accepts any published fingerprint via a TOFU window. StrictHostKeyChecking=no is removed from every provisioner once fingerprint management is in place β€” see ADR-025.
    • Stateful worker: a separate hcloud_server.stateful_worker (count = 1, label tier=stateful) holds the persistent Hetzner Cloud Volume for containerized Postgres / Redis. Managers never carry stateful services. Pre-launch requirement: deploy a Docker volume plugin (e.g. costela/docker-volume-hetzner) so Swarm can automate volume detach/re-attach on node failure β€” Swarm does not use Kubernetes CSI. Without the plugin, volume recovery requires manual Hetzner API or Terraform intervention.
    • Observability worker: a separate hcloud_server.observability_worker (count = 1, label tier=observability, cpx21+) holds the LGTM stack. The 2-node Swarm cannot absorb 7+ observability containers alongside the application tier; this node parallels the tier=stateful pattern.
    • Split-brain & runbooks: manager join failures run on_failure = fail plus a journalctl -u docker tail for triage; shrink requires manual docker node demote + docker swarm leave --force before re-applying Terraform with a smaller manager_count.
  • Edge Routing: CDN (Cloudflare) β†’ Hetzner Managed Load Balancer β†’ Swarm nodes. At quorum expansion time the Managed LB replaces any DNS round-robin: hcloud_load_balancer.ingress with services 80/443, health-check HTTP /healthz (15 s interval, 10 s timeout, 3 retries), and label_selector = "busflow-swarm=true" so new managers auto-attach. See ADR-024.
  • Edge Routing: CDN (Cloudflare) β†’ Hetzner Managed Load Balancer β†’ Swarm nodes.
  • Database (MVP): Containerized PostgreSQL (with pgvector) inside the Docker Swarm, backed by a Hetzner Cloud Volume for persistence. Nightly pg_dump backups to the in-Swarm Minio instance. See Β§6 Backup & Disaster Recovery.
  • Database (Production): Ubicloud Managed PostgreSQL (standard-2, HA async) on Hetzner bare metal in Falkenstein (eu-central-h1). Managed via the ubicloud/ubicloud Terraform provider. Automated PITR, 7-day backup retention.
  • Storage (MVP): Minio (S3-compatible) inside the Docker Swarm for file uploads, images, and PDFs.
  • Storage (Production): Hetzner Object Storage (S3-compatible, €6.49/mo for 1TB storage + 1TB egress). Nhost Storage connects directly via S3 API. Managed via the aminueza/minio Terraform provider.
  • Storage (Local Dev): Minio container in docker-compose.local.yml.
  • Terraform State Management (MVP): Local backend, state file committed to the repository. Phase 2: Migrate to Hetzner Object Storage (S3-compatible backend) for remote state with CI/CD support.

2. Application Layer (Docker Swarm) ​

  • Traefik: Ingress proxy, automated SSL, long-read timeouts for AI streaming, request routing.
  • Frontend (Vue.js): Static files served by replicated Nginx containers.
  • Hasura: GraphQL engine for CRUD, RBAC, and real-time subscriptions (WebSockets).
  • Nest.js API: Business logic, custom mutations, external integrations, SSE for AI chat streams.
  • n8n: Visual automation engine for external API integrations and agent pipelines. See workflow-orchestration.md.
  • Nhost Auth/Storage: JWT generation, OAuth, file uploads (S3 bridge).

3. AI & Background Processing ​

  • Redis: In-Swarm message broker for BullMQ job queue.
  • Nest.js Workers: Same codebase, HTTP off. Listens to Redis for heavy AI tasks, PDF generation, event triggers.

4. External Integrations ​

  • LLM Providers: OpenAI, Anthropic (external APIs).
  • Messaging: AWS SES (email), AWS SNS (SMS), Meta Cloud API (WhatsApp Business). See communications.md Β§CPaaS for the full provider strategy.

5. Observability & CI/CD ​

  • Monitoring: Grafana LGTM stack (Loki, Grafana, Tempo, Mimir) + OpenTelemetry Collector + Faro Web SDK + prometheus-postgres-exporter. See observability.md β€” it is the authoritative source of truth for the stack, cardinality policy, retention, cookie/SSO settings, and alerting runbooks.
  • Deployment: GitHub Actions β†’ isolated PR preview deployments β†’ rolling docker stack deploy on merge.
  • Transitional: Frontend deployments use Vercel during early development. We will retire it once the Swarm-based preview deployment pipeline (Traefik + docker stack deploy) becomes fully operational. We plan no deeper Vercel integration.
  • Deployment DX (Future): Consider Portainer for a Swarm dashboard (visibility without replacing orchestration), GH Actions auto-rollback on failed health checks, and canary deploys via Traefik weighted routing. These additions improve operator confidence without adding a PaaS layer (e.g., Coolify) that duplicates existing Terraform + Swarm + Traefik responsibilities.
  • Container Resource Governance: every service in docker-compose.production.yml MUST declare deploy.resources.limits.memory and deploy.resources.reservations.cpus. Without limits, a runaway container triggers the kernel OOM-killer, which selects the largest memory consumer β€” potentially Postgres. Actual values require load profiling before setting; do not guess.
  • Docker System Maintenance: a weekly scheduled task prunes unused images, containers, and build cache on each Swarm node (docker system prune --all --force --filter "until=168h"). Mechanism (Swarm global service with entrypoint cron or systemd timer via Terraform user_data) is deferred to implementation time. Grafana alerts on root disk usage (node_filesystem_avail_bytes{mountpoint="/"} < 20%) β€” distinct from the Hetzner Volume alerts in observability.md Β§Volume Monitoring.
  • Terraform State Drift Detection: a weekly GitHub Actions cron runs terraform plan -detailed-exitcode against production. Exit code 2 (drift detected) opens a GitHub Issue tagged ops, drift and posts to Slack. The workflow does not auto-apply; an operator reviews the diff. The current terraform.yml only plans on PRs touching terraform/** β€” manual SSH changes to servers drift silently without this scheduled check.

6. Backup & Disaster Recovery ​

Database Backups (MVP: Containerized Postgres) ​

  • Strategy: A cron job on the Swarm manager node runs pg_dump --format=custom nightly and uploads the dump to a dedicated backup bucket in the in-Swarm Minio instance via rclone or aws-cli.
  • Retention: 14 daily snapshots + 4 weekly snapshots. Older dumps are pruned automatically.
  • Restore: pg_restore --dbname=<target> from any snapshot into a fresh container.

Database Backups (Production: Ubicloud Managed) ​

  • Ubicloud provides automated Point-in-Time Recovery (PITR) with 7-day retention.
  • Supplement with a weekly pg_dump to Hetzner Object Storage for off-provider safety.

Automated Backup Verification ​

See also: backup-verify-runbook.md for operator triage steps.

  • Cadence: A daily GitHub Actions cron job (03:00 UTC) pulls the latest pg_dump and runs a full verify. Weekly is too coarse once Ubicloud PITR exists β€” the off-site dump is a disaster-recovery safety net and must be proven healthy against every produced artefact.
  • Source parameterisation: An environment: production workflow input resolves to vars.BACKUP_SOURCE β€” either minio-inswarm (MVP path) or hetzner-s3 (post-cutover). Both rclone remote blocks live in the workflow; only the source switches.
  • Naming convention: dumps must be emitted as pg_dump_YYYYMMDD_HHMM.dump. The producer (current easypi/postgres-backup-s3 or the post-cutover Ubicloud-triggered job) enforces this via docker/backup/producer.env. Files that do not match this pattern are rejected by the verify workflow rather than consumed.
  • Size envelope: ubuntu-latest has ~14 GB free disk and a 6 h timeout. A size guard aborts above a configurable ceiling (default 10 GB) and signals to promote to a self-hosted runner lane. In the common path the verify job streams the dump directly into pg_restore via rclone cat … | pg_restore to skip a full-disk staging copy.
  • Format drift tolerance (MVP β†’ production): the magic-number check accepts either PGDMP (custom format) or a plain-text header (-- PostgreSQL database dump). Plain-text dumps produce a warning β€” this nudges the MVP producer (easypi) toward --format=custom without failing the verify.
  • Restore pipeline (split-section): the ephemeral postgres:16-alpine target runs pg_restore --section=pre-data β†’ manual CREATE EXTENSION IF NOT EXISTS pg_cron, pgsodium, pgvector inside the verify DB β†’ --section=data β†’ --section=post-data. This avoids false-reds on extension-privilege errors unrelated to data integrity.
  • Integrity probes:
    • pg_restore exits with code 0.
    • All expected schemas exist (backoffice, commerce, operations, communications).
    • information_schema.tables returns a non-zero row count per schema.
    • No orphaned foreign-key violations against the allow-list at config/backup-verify/soft-fk-allowlist.yaml. The Modulith deliberately uses soft FKs across bounded contexts; the probe reads the allow-list before raising alerts.
  • TLS: insecure_skip_verify = false is explicit in the generated rclone.conf; http:// endpoints are rejected.
  • Timeouts: rclone --timeout 15m --low-level-retries 3; job-level timeout-minutes: 60; fall back to in-Swarm MinIO if the offsite bucket is unreachable for >10 min.
  • Failure routing: Slack webhook errors are classified β€” 4xx (e.g., rotated webhook) opens a deduped GitHub issue and sets the workflow status to warning; 5xx/network errors retry 3Γ— with backoff before falling back to the GH-issue path. Issues dedup by (run-pattern, week) key via gh issue list lookup so the orphan-FK probe cannot issue-spam.
  • Producer liveness: the verify pipeline implicitly covers producer liveness β€” a missing or zero-byte dump causes the verify job itself to fail. The ≀1 hour gap between production (02:00 UTC) and verification (03:00 UTC) is an accepted risk.

Object Storage ​

  • Hetzner Object Storage replicates data across drives (RAID). No cross-region replication at MVP.

Disaster Recovery Drill ​

Backup verification (see Β§above) tests restorability of individual dumps. A DR drill validates the full rebuild path that backup verification alone cannot test: environment variables, Docker registry auth, DNS propagation, and Terraform state completeness.

  • Procedure: provision a parallel Hetzner environment via Terraform, restore the latest backup, deploy the full stack, and run smoke tests against all service health endpoints.
  • Cadence (tied to SLA tier):
    • Pre-SLA (< 50 tenants, no contractual uptime): annual drill.
    • Post-SLA (β‰₯ 50 tenants or contractual β‰₯ 99.5% uptime): quarterly drill.
    • After any major infrastructure change (Ubicloud cutover, quorum expansion): ad-hoc drill within 2 weeks.
  • Results: logged in docs/protocols/dr-drill-results/YYYY-MM.md with: time-to-recovery, issues found, and remediation actions.
  • First drill: scheduled after the Ubicloud cutover completes.

6.2 Ubicloud PostgreSQL Cutover (Production DB) ​

Purpose: migrate from the containerized pgvector Postgres on the Swarm manager to managed Ubicloud Postgres on Hetzner bare metal once tenant count > 50 or a 99.5 % SLA becomes contractual.

Terraform module: terraform/modules/postgres-ubicloud

Required variables (variables.tf):

VariableTypeNotes
environmentstringrequired
ubicloud_project_idstringfrom Ubicloud console
hetzner_locationstringvalidation contains(["fsn1"], value) β€” must match Swarm location for latency + GDPR residency
postgres_tierstringdefault "standard-2"
postgres_versionnumberdefault 16
ha_modestring"async" | "sync", default "async"
pitr_retention_daysnumberdefault 7
allowed_cidrslist(string)fed from module.network.swarm_cidrs (see Finding 6 of the architect loop)

Required resources inside the module:

  • ubicloud_postgres.primary β€” the HA instance.
  • ubicloud_postgres_firewall_rule.swarm β€” whitelisting each var.allowed_cidrs entry on port 5432.
  • ubicloud_postgres_metric_destination.grafana β€” ships metrics to the Mimir endpoint defined in observability.md.

Required outputs (fed to Docker Swarm Secrets at deploy time via secrets-sync.yml):

  • connection_uri_writer (sensitive)
  • connection_uri_reader (sensitive)
  • ca_certificate (sensitive) β€” mounted as a Docker Config, not a Secret (non-sensitive; lets Mimir scrape over TLS without an extra rotation surface).
  • instance_fqdn

Provider contract β€” in terraform/environments/production/providers.tf:

hcl
terraform {
  required_providers {
    ubicloud = { source = "ubicloud/ubicloud", version = "~> 0.3" }
  }
}
provider "ubicloud" {
  project_id = var.ubicloud_project_id
}

Cutover runbook: docs/protocols/postgres-cutover.md.

Out of scope for cutover: Redis (BullMQ broker), MinIO/Object Storage, and the Hasura JWT secret are NOT rotated. Only busflow_db_writer and busflow_db_reader flip.

Edge states that MUST be covered in the cutover runbook:

  • Atomic reader/writer flip: both secrets update in a single docker service update --secret-rm … --secret-add … call so readers never briefly point at the old writer and return phantom rows.
  • Hydration failure: if replication lag does not converge within 60 min, abort, retain Swarm DB as authoritative, and re-run from the base-backup step.
  • PITR boundary: Ubicloud PITR only covers time after cutover. Retain the final pre-cutover dump in Hetzner Object Storage for 30 days (not the usual 14) and document the recovery decision-tree in the runbook.
  • Connection retries with jitter: Hasura HASURA_GRAPHQL_READ_REPLICA_URLS and NestJS pg pools set connectionTimeoutMillis=5000, max=20, and wrap every transaction in retry.operation({retries: 3, factor: 2, minTimeout: 200, randomize: true, jitter: 0.1}) to avoid a thundering herd on the first post-cutover second.
  • TLS required: sslmode=verify-full using the ca_certificate output.
  • Terraform state-lock guard: production applies run via workflow_dispatch with concurrency: { group: terraform-prod, cancel-in-progress: false } to prevent concurrent CI pipelines from corrupting the remote state.
  • Rollback criterion: if p99 API latency exceeds 2Γ— baseline for > 10 min within the 72 h window, trigger the rollback runbook that flips the secrets back to the containerized Postgres endpoint.
  • Cost envelope: the 72 h parallel-run is a Finance line item; the ADR carries a monthly Ubicloud tier price Γ— 3/30 days figure and a Finance sign-off checkbox.
  • GDPR residency: Ubicloud EU-residency SLA clause is cited in gdpr-strategy.md Β§1 and in ADR-022.

The Modular Monolith ​

The backend follows a Modular Monolith (Modulith) pattern: isolated modules compiled and deployed as a single application. This combines the clean architecture of microservices with the operational simplicity of a monolith.

How It Works ​

  • /apps/api β€” Entry point. Imports microbackends, exposes them via HTTP/GraphQL. Minimal business logic.
  • /packages/<domain> β€” Each package is a bounded context (billing, auth, etc.) with its own logic, schemas, and services.
  • Single Docker image built from /apps/api + all /packages/*. Deployed as one scaled Swarm service.
  • Evolutionary: If a module needs independent scaling, extract it to a new app (e.g., /apps/pdf-worker).

Key Benefits ​

  • AI-Optimized: Small, isolated packages = focused context windows for LLMs.
  • Operationally Simple: No distributed monolith anti-patterns, no inter-service networking.
  • Flexible: Leaves the door open for microservices without upfront DevOps tax.

Risks to Manage ​

  • Boundary Enforcement: Use ESLint rules to forbid cross-package coupling.
  • Database Isolation: Each microbackend manages its own tables. Use Hasura or PostgreSQL for cross-domain joins.
  • Build Optimization: Docker builds use Turborepo prune to isolate each app's dependency tree into a minimal workspace, enabling aggressive Docker layer caching. See monorepo.md for the full multi-stage Dockerfile strategy and CI caching approach.

Internal documentation β€” Busflow