Busflow Docs

Internal documentation portal

Skip to content

ADR-022: Ubicloud Managed PostgreSQL Cutover โ€‹

Status: ๐ŸŸก Proposed โ€” pending Finance sign-off (cost envelope) + Legal/DPO sign-off (EU residency clause) Impacts: infrastructure.md ยง1, ยง6.2; gdpr-strategy.md ยง1; terraform/modules/postgres-ubicloud/; terraform/environments/production/; docs/protocols/postgres-cutover.md; deploy.yml + secrets-sync.yml


Context โ€‹

The MVP runs PostgreSQL (with pgvector) as a containerized service on the Swarm manager node, backed by a Hetzner Cloud Volume. This is operationally simple but carries two unbounded risks at scale:

  1. Single-manager SPOF for the stateful tier. Volume-backed containers on the manager node cannot be rescheduled without manual Hetzner API intervention; a failed manager takes Postgres with it.
  2. Operator fatigue on backups/PITR. Nightly pg_dump plus 14 daily + 4 weekly rotations inside the Swarm works at low tenant count but does not scale to a 99.5 % uptime SLA with a tenant-hour recovery objective.

Ubicloud Managed PostgreSQL runs on Hetzner bare metal in eu-central-h1 (Falkenstein), offers standard-2 HA-async, automated PITR (7-day retention), and exposes a Terraform provider (ubicloud/ubicloud ~> 0.3). It is the smallest operational footprint that lifts the stateful tier off the Swarm without introducing a US-or-hyperscaler dependency that would break GDPR residency.

The trigger threshold at which we pay the cutover cost: tenant count > 50 OR SLA contracts require > 99.5 % โ€” whichever comes first.

Options Evaluated โ€‹

#OptionProsCons
1Stay on containerized Postgres indefinitelyZero migration costUnbounded operational risk past ~50 tenants; SLA ceiling ~99 %
2Self-managed Postgres on a dedicated Hetzner VM with streaming replicationFull control, no vendor lock-inWe become the DBA team; PITR + failover drift into our on-call surface
3Ubicloud Managed Postgres on Hetzner bare metal (chosen)EU residency preserved; PITR + HA handled by provider; Terraform provider available; pg_cron/pgsodium/pgvector supported on standard-2New Terraform provider dependency (~v0.3); 3-day parallel-run cost during cutover window
4AWS RDS PostgresMature, battle-testedData leaves the EU unless explicitly regioned, and adds a hyperscaler bill we cannot justify pre-Series A

Decision โ€‹

Option 3 โ€” adopt Ubicloud Managed Postgres via the ubicloud/ubicloud Terraform provider, co-located in fsn1/eu-central-h1.

  • A new Terraform module terraform/modules/postgres-ubicloud wraps ubicloud_postgres.primary, ubicloud_postgres_firewall_rule.swarm, and ubicloud_postgres_metric_destination.grafana.
  • The swarm module already exports swarm_cidrs (from module.network). These feed var.allowed_cidrs โ€” not a hard-coded 10.0.0.0/16.
  • The module is instantiated behind var.enable_ubicloud_postgres = false in terraform/environments/production/main.tf. Flipping the flag provisions the instance; Commit 5 of the rollout wires the outputs (connection_uri_writer, connection_uri_reader, ca_certificate, instance_fqdn) into Swarm Secrets via secrets-sync.yml.
  • The cutover runbook (docs/protocols/postgres-cutover.md) is the single operational authority for the flip. Writer + reader Swarm Secrets update in one docker service update --secret-rm โ€ฆ --secret-add โ€ฆ call to prevent phantom-row reads during the transition.
  • pg_cron is confirmed available on standard-2 out of the box โ€” no pre-flight support ticket.
  • Scope exclusions during cutover: Redis (BullMQ broker), MinIO/Object Storage, and the Hasura JWT secret are NOT rotated.
  • 30-day pre-cutover dump retention: the final pre-cutover pg_dump lives in Hetzner Object Storage for 30 days (not the usual 14) to cover the Ubicloud PITR dark-zone for any transaction landed immediately before the secret flip.
  • Terraform concurrency guard: production applies run via workflow_dispatch with concurrency: { group: terraform-prod, cancel-in-progress: false }.

Consequences โ€‹

Positive:

  • Stateful tier moves off the Swarm โ€” the manager becomes a stateless control plane again.
  • PITR + HA replication are handled by Ubicloud; our on-call page decreases.
  • EU residency is cited by clause; DPIA audit trail is cleaner.
  • pg_cron / pgsodium / pgvector remain available for GDPR scrub functions (ADR-028) and LLM workloads.

Negative:

  • New Terraform provider at ~> 0.3 โ€” minor-version churn risk. Mitigated by pinning and by reading release notes before any terraform apply.
  • 72-hour parallel-run during cutover adds a Finance line item (monthly Ubicloud tier ร— 3/30); requires a Finance sign-off checkbox in this ADR.
  • fsn1/eu-central-h1 co-location coupling: a Falkenstein regional outage now takes both the Swarm and the DB offline. Accepted โ€” cross-region expansion is deferred past Series A.

Neutral:

  • Rollback criterion (p99 API latency > 2ร— baseline for >10 min in the 72 h window) is straightforward; the runbook flips the Secrets back to the containerized endpoint.
  • Observability integration: ubicloud_postgres_metric_destination.grafana ships directly to the Mimir endpoint from observability.md โ€” no pgbouncer-to-exporter bridge required unless Ubicloud restricts direct prometheus-postgres-exporter connections (pending verification).

Finance sign-off โ€‹

[ ] Monthly Ubicloud tier price for standard-2 HA: โ‚ฌ/mo (to be confirmed with Ubicloud sales before any production apply) [ ] Approved 72 h parallel-run cost (= monthly ร— 3/30): โ‚ฌ [ ] Signed: __________________________ Date: ____________

[ ] Ubicloud EU-residency SLA ยง(EU residency) clause attached as appendix: _____________ [ ] gdpr-strategy.md ยง1 cites the clause: ____________ [ ] Signed: __________________________ Date: ____________

Internal documentation โ€” Busflow