Infrastructure & Architecture β
Deployment Architecture β
1. Infrastructure Layer (Terraform) β
- Compute (MVP): 2Γ Hetzner Cloud VPS β 1 Swarm Manager + 1 Worker β for HA and self-healing. Accepted MVP Risk: A single manager node is a control-plane SPOF. Swarm workloads self-heal on the worker, but cluster management (deploy, scale, inspect) is unavailable if the manager goes down.
- Compute (Quorum Expansion β tenant count > 50 OR SLA > 99.5 %): grow the
swarmTerraform module to a 3-node Raft quorum + dedicatedstateful_worker+ dedicatedobservability_worker. See ADR-023. Key contract points:var.manager_countvalidation: odd (manager_count % 2 == 1) and pinned tofsn1(contains(["fsn1"], var.hetzner_location)) to preserve Raft consensus latency.hcloud_server.manager[0]initialises the swarm;null_resource.manager_join_tokenretrieves the manager join token over SSH;null_resource.swarm_join_manageriteratesmanager_count - 1joining managers 2 and 3 using the manager token + manager[0]'s private IP (not public) on port 2377.user_datais split: a sharedlocal.docker_install_scriptlocal is used by every node; onlymanager[0]appendsdocker swarm init. Workers and additional managers use the install script alone. Thenull_resourcejoin provisioners remain the exclusive mechanism for Swarm membership.- SSH fingerprint policy: host keys are unique-per-manager (
tls_private_key.host_key_manager[count.index]); all fingerprints are published as a Terraform outputmanager_host_keys = { manager-1 = "β¦", β¦ }. CI accepts any published fingerprint via a TOFU window.StrictHostKeyChecking=nois removed from every provisioner once fingerprint management is in place β see ADR-025. - Stateful worker: a separate
hcloud_server.stateful_worker(count = 1, labeltier=stateful) holds the persistent Hetzner Cloud Volume for containerized Postgres / Redis. Managers never carry stateful services. Pre-launch requirement: deploy a Docker volume plugin (e.g.costela/docker-volume-hetzner) so Swarm can automate volume detach/re-attach on node failure β Swarm does not use Kubernetes CSI. Without the plugin, volume recovery requires manual Hetzner API or Terraform intervention. - Observability worker: a separate
hcloud_server.observability_worker(count = 1, labeltier=observability,cpx21+) holds the LGTM stack. The 2-node Swarm cannot absorb 7+ observability containers alongside the application tier; this node parallels thetier=statefulpattern. - Split-brain & runbooks: manager join failures run
on_failure = failplus ajournalctl -u dockertail for triage; shrink requires manualdocker node demote+docker swarm leave --forcebefore re-applying Terraform with a smallermanager_count.
- Edge Routing: CDN (Cloudflare) β Hetzner Managed Load Balancer β Swarm nodes. At quorum expansion time the Managed LB replaces any DNS round-robin:
hcloud_load_balancer.ingresswith services 80/443, health-checkHTTP /healthz(15 s interval, 10 s timeout, 3 retries), andlabel_selector = "busflow-swarm=true"so new managers auto-attach. See ADR-024. - Edge Routing: CDN (Cloudflare) β Hetzner Managed Load Balancer β Swarm nodes.
- Database (MVP): Containerized PostgreSQL (with
pgvector) inside the Docker Swarm, backed by a Hetzner Cloud Volume for persistence. Nightlypg_dumpbackups to the in-Swarm Minio instance. See Β§6 Backup & Disaster Recovery. - Database (Production): Ubicloud Managed PostgreSQL (
standard-2, HA async) on Hetzner bare metal in Falkenstein (eu-central-h1). Managed via theubicloud/ubicloudTerraform provider. Automated PITR, 7-day backup retention. - Storage (MVP): Minio (S3-compatible) inside the Docker Swarm for file uploads, images, and PDFs.
- Storage (Production): Hetzner Object Storage (S3-compatible, β¬6.49/mo for 1TB storage + 1TB egress). Nhost Storage connects directly via S3 API. Managed via the
aminueza/minioTerraform provider. - Storage (Local Dev): Minio container in
docker-compose.local.yml. - Terraform State Management (MVP): Local backend, state file committed to the repository. Phase 2: Migrate to Hetzner Object Storage (S3-compatible backend) for remote state with CI/CD support.
2. Application Layer (Docker Swarm) β
- Traefik: Ingress proxy, automated SSL, long-read timeouts for AI streaming, request routing.
- Frontend (Vue.js): Static files served by replicated Nginx containers.
- Hasura: GraphQL engine for CRUD, RBAC, and real-time subscriptions (WebSockets).
- Nest.js API: Business logic, custom mutations, external integrations, SSE for AI chat streams.
- n8n: Visual automation engine for external API integrations and agent pipelines. See workflow-orchestration.md.
- Nhost Auth/Storage: JWT generation, OAuth, file uploads (S3 bridge).
3. AI & Background Processing β
- Redis: In-Swarm message broker for BullMQ job queue.
- Nest.js Workers: Same codebase, HTTP off. Listens to Redis for heavy AI tasks, PDF generation, event triggers.
4. External Integrations β
- LLM Providers: OpenAI, Anthropic (external APIs).
- Messaging: AWS SES (email), AWS SNS (SMS), Meta Cloud API (WhatsApp Business). See communications.md Β§CPaaS for the full provider strategy.
5. Observability & CI/CD β
- Monitoring: Grafana LGTM stack (Loki, Grafana, Tempo, Mimir) + OpenTelemetry Collector + Faro Web SDK +
prometheus-postgres-exporter. See observability.md β it is the authoritative source of truth for the stack, cardinality policy, retention, cookie/SSO settings, and alerting runbooks. - Deployment: GitHub Actions β isolated PR preview deployments β rolling
docker stack deployon merge. - Transitional: Frontend deployments use Vercel during early development. We will retire it once the Swarm-based preview deployment pipeline (Traefik +
docker stack deploy) becomes fully operational. We plan no deeper Vercel integration. - Deployment DX (Future): Consider Portainer for a Swarm dashboard (visibility without replacing orchestration), GH Actions auto-rollback on failed health checks, and canary deploys via Traefik weighted routing. These additions improve operator confidence without adding a PaaS layer (e.g., Coolify) that duplicates existing Terraform + Swarm + Traefik responsibilities.
- Container Resource Governance: every service in
docker-compose.production.ymlMUST declaredeploy.resources.limits.memoryanddeploy.resources.reservations.cpus. Without limits, a runaway container triggers the kernel OOM-killer, which selects the largest memory consumer β potentially Postgres. Actual values require load profiling before setting; do not guess. - Docker System Maintenance: a weekly scheduled task prunes unused images, containers, and build cache on each Swarm node (
docker system prune --all --force --filter "until=168h"). Mechanism (Swarm global service with entrypoint cron or systemd timer via Terraformuser_data) is deferred to implementation time. Grafana alerts on root disk usage (node_filesystem_avail_bytes{mountpoint="/"}< 20%) β distinct from the Hetzner Volume alerts inobservability.mdΒ§Volume Monitoring. - Terraform State Drift Detection: a weekly GitHub Actions cron runs
terraform plan -detailed-exitcodeagainst production. Exit code 2 (drift detected) opens a GitHub Issue taggedops, driftand posts to Slack. The workflow does not auto-apply; an operator reviews the diff. The currentterraform.ymlonly plans on PRs touchingterraform/**β manual SSH changes to servers drift silently without this scheduled check.
6. Backup & Disaster Recovery β
Database Backups (MVP: Containerized Postgres) β
- Strategy: A cron job on the Swarm manager node runs
pg_dump --format=customnightly and uploads the dump to a dedicated backup bucket in the in-Swarm Minio instance viarcloneoraws-cli. - Retention: 14 daily snapshots + 4 weekly snapshots. Older dumps are pruned automatically.
- Restore:
pg_restore --dbname=<target>from any snapshot into a fresh container.
Database Backups (Production: Ubicloud Managed) β
- Ubicloud provides automated Point-in-Time Recovery (PITR) with 7-day retention.
- Supplement with a weekly
pg_dumpto Hetzner Object Storage for off-provider safety.
Automated Backup Verification β
See also: backup-verify-runbook.md for operator triage steps.
- Cadence: A daily GitHub Actions cron job (03:00 UTC) pulls the latest
pg_dumpand runs a full verify. Weekly is too coarse once Ubicloud PITR exists β the off-site dump is a disaster-recovery safety net and must be proven healthy against every produced artefact. - Source parameterisation: An
environment: productionworkflow input resolves tovars.BACKUP_SOURCEβ eitherminio-inswarm(MVP path) orhetzner-s3(post-cutover). Both rclone remote blocks live in the workflow; only the source switches. - Naming convention: dumps must be emitted as
pg_dump_YYYYMMDD_HHMM.dump. The producer (currenteasypi/postgres-backup-s3or the post-cutover Ubicloud-triggered job) enforces this viadocker/backup/producer.env. Files that do not match this pattern are rejected by the verify workflow rather than consumed. - Size envelope:
ubuntu-latesthas ~14 GB free disk and a 6 h timeout. A size guard aborts above a configurable ceiling (default 10 GB) and signals to promote to a self-hosted runner lane. In the common path the verify job streams the dump directly intopg_restoreviarclone cat β¦ | pg_restoreto skip a full-disk staging copy. - Format drift tolerance (MVP β production): the magic-number check accepts either
PGDMP(custom format) or a plain-text header (-- PostgreSQL database dump). Plain-text dumps produce a warning β this nudges the MVP producer (easypi) toward--format=customwithout failing the verify. - Restore pipeline (split-section): the ephemeral
postgres:16-alpinetarget runspg_restore --section=pre-dataβ manualCREATE EXTENSION IF NOT EXISTS pg_cron, pgsodium, pgvectorinside theverifyDB β--section=dataβ--section=post-data. This avoids false-reds on extension-privilege errors unrelated to data integrity. - Integrity probes:
pg_restoreexits with code 0.- All expected schemas exist (
backoffice,commerce,operations,communications). information_schema.tablesreturns a non-zero row count per schema.- No orphaned foreign-key violations against the allow-list at
config/backup-verify/soft-fk-allowlist.yaml. The Modulith deliberately uses soft FKs across bounded contexts; the probe reads the allow-list before raising alerts.
- TLS:
insecure_skip_verify = falseis explicit in the generatedrclone.conf;http://endpoints are rejected. - Timeouts:
rclone --timeout 15m --low-level-retries 3; job-leveltimeout-minutes: 60; fall back to in-Swarm MinIO if the offsite bucket is unreachable for >10 min. - Failure routing: Slack webhook errors are classified β 4xx (e.g., rotated webhook) opens a deduped GitHub issue and sets the workflow status to
warning; 5xx/network errors retry 3Γ with backoff before falling back to the GH-issue path. Issues dedup by(run-pattern, week)key viagh issue listlookup so the orphan-FK probe cannot issue-spam. - Producer liveness: the verify pipeline implicitly covers producer liveness β a missing or zero-byte dump causes the verify job itself to fail. The β€1 hour gap between production (02:00 UTC) and verification (03:00 UTC) is an accepted risk.
Object Storage β
- Hetzner Object Storage replicates data across drives (RAID). No cross-region replication at MVP.
Disaster Recovery Drill β
Backup verification (see Β§above) tests restorability of individual dumps. A DR drill validates the full rebuild path that backup verification alone cannot test: environment variables, Docker registry auth, DNS propagation, and Terraform state completeness.
- Procedure: provision a parallel Hetzner environment via Terraform, restore the latest backup, deploy the full stack, and run smoke tests against all service health endpoints.
- Cadence (tied to SLA tier):
- Pre-SLA (< 50 tenants, no contractual uptime): annual drill.
- Post-SLA (β₯ 50 tenants or contractual β₯ 99.5% uptime): quarterly drill.
- After any major infrastructure change (Ubicloud cutover, quorum expansion): ad-hoc drill within 2 weeks.
- Results: logged in
docs/protocols/dr-drill-results/YYYY-MM.mdwith: time-to-recovery, issues found, and remediation actions. - First drill: scheduled after the Ubicloud cutover completes.
6.2 Ubicloud PostgreSQL Cutover (Production DB) β
Purpose: migrate from the containerized pgvector Postgres on the Swarm manager to managed Ubicloud Postgres on Hetzner bare metal once tenant count > 50 or a 99.5 % SLA becomes contractual.
Terraform module: terraform/modules/postgres-ubicloud
Required variables (variables.tf):
| Variable | Type | Notes |
|---|---|---|
environment | string | required |
ubicloud_project_id | string | from Ubicloud console |
hetzner_location | string | validation contains(["fsn1"], value) β must match Swarm location for latency + GDPR residency |
postgres_tier | string | default "standard-2" |
postgres_version | number | default 16 |
ha_mode | string | "async" | "sync", default "async" |
pitr_retention_days | number | default 7 |
allowed_cidrs | list(string) | fed from module.network.swarm_cidrs (see Finding 6 of the architect loop) |
Required resources inside the module:
ubicloud_postgres.primaryβ the HA instance.ubicloud_postgres_firewall_rule.swarmβ whitelisting eachvar.allowed_cidrsentry on port 5432.ubicloud_postgres_metric_destination.grafanaβ ships metrics to the Mimir endpoint defined inobservability.md.
Required outputs (fed to Docker Swarm Secrets at deploy time via secrets-sync.yml):
connection_uri_writer(sensitive)connection_uri_reader(sensitive)ca_certificate(sensitive) β mounted as a Docker Config, not a Secret (non-sensitive; lets Mimir scrape over TLS without an extra rotation surface).instance_fqdn
Provider contract β in terraform/environments/production/providers.tf:
terraform {
required_providers {
ubicloud = { source = "ubicloud/ubicloud", version = "~> 0.3" }
}
}
provider "ubicloud" {
project_id = var.ubicloud_project_id
}Cutover runbook: docs/protocols/postgres-cutover.md.
Out of scope for cutover: Redis (BullMQ broker), MinIO/Object Storage, and the Hasura JWT secret are NOT rotated. Only busflow_db_writer and busflow_db_reader flip.
Edge states that MUST be covered in the cutover runbook:
- Atomic reader/writer flip: both secrets update in a single
docker service update --secret-rm β¦ --secret-add β¦call so readers never briefly point at the old writer and return phantom rows. - Hydration failure: if replication lag does not converge within 60 min, abort, retain Swarm DB as authoritative, and re-run from the base-backup step.
- PITR boundary: Ubicloud PITR only covers time after cutover. Retain the final pre-cutover dump in Hetzner Object Storage for 30 days (not the usual 14) and document the recovery decision-tree in the runbook.
- Connection retries with jitter: Hasura
HASURA_GRAPHQL_READ_REPLICA_URLSand NestJSpgpools setconnectionTimeoutMillis=5000,max=20, and wrap every transaction inretry.operation({retries: 3, factor: 2, minTimeout: 200, randomize: true, jitter: 0.1})to avoid a thundering herd on the first post-cutover second. - TLS required:
sslmode=verify-fullusing theca_certificateoutput. - Terraform state-lock guard: production applies run via
workflow_dispatchwithconcurrency: { group: terraform-prod, cancel-in-progress: false }to prevent concurrent CI pipelines from corrupting the remote state. - Rollback criterion: if p99 API latency exceeds 2Γ baseline for > 10 min within the 72 h window, trigger the rollback runbook that flips the secrets back to the containerized Postgres endpoint.
- Cost envelope: the 72 h parallel-run is a Finance line item; the ADR carries a monthly Ubicloud tier price Γ 3/30 days figure and a Finance sign-off checkbox.
- GDPR residency: Ubicloud EU-residency SLA clause is cited in
gdpr-strategy.mdΒ§1 and in ADR-022.
The Modular Monolith β
The backend follows a Modular Monolith (Modulith) pattern: isolated modules compiled and deployed as a single application. This combines the clean architecture of microservices with the operational simplicity of a monolith.
How It Works β
/apps/apiβ Entry point. Imports microbackends, exposes them via HTTP/GraphQL. Minimal business logic./packages/<domain>β Each package is a bounded context (billing, auth, etc.) with its own logic, schemas, and services.- Single Docker image built from
/apps/api+ all/packages/*. Deployed as one scaled Swarm service. - Evolutionary: If a module needs independent scaling, extract it to a new app (e.g.,
/apps/pdf-worker).
Key Benefits β
- AI-Optimized: Small, isolated packages = focused context windows for LLMs.
- Operationally Simple: No distributed monolith anti-patterns, no inter-service networking.
- Flexible: Leaves the door open for microservices without upfront DevOps tax.
Risks to Manage β
- Boundary Enforcement: Use ESLint rules to forbid cross-package coupling.
- Database Isolation: Each microbackend manages its own tables. Use Hasura or PostgreSQL for cross-domain joins.
- Build Optimization: Docker builds use Turborepo
pruneto isolate each app's dependency tree into a minimal workspace, enabling aggressive Docker layer caching. See monorepo.md for the full multi-stage Dockerfile strategy and CI caching approach.