Busflow Docs

Internal documentation portal

Skip to content

Runbook โ€” Ubicloud PostgreSQL Cutover โ€‹

ADR: ADR-022Architecture: infrastructure.md ยง6.2 Duration (expected): 2โ€“4 h live cutover; 72 h parallel-run window. Reversible window: 72 h.


Pre-conditions โ€‹

  • [ ] ADR-022 has Finance sign-off (monthly Ubicloud tier price ร— 3/30 recorded).
  • [ ] ADR-022 has Legal/DPO sign-off (Ubicloud EU-residency SLA clause attached).
  • [ ] terraform/modules/postgres-ubicloud exists and terraform plan is clean on staging.
  • [ ] A staging cutover dry-run has completed and the observed cutover-duration + rollback-cost metrics are recorded below.
  • [ ] On-call rotation is informed and #ops Slack is pinned with the runbook link.
  • [ ] The current pre-cutover pg_dump is < 24 h old and passes the daily backup-verify job.

Step-by-step โ€‹

1. Provision Ubicloud instance โ€‹

bash
# From a developer laptop with AWS_* unset and UBICLOUD_PROJECT_ID set.
cd terraform/environments/production
terraform apply -target=module.postgres_ubicloud -var="enable_ubicloud_postgres=true"

Capture outputs:

  • connection_uri_writer โ†’ save to a secure scratch file
  • connection_uri_reader โ†’ save
  • ca_certificate โ†’ save as ubicloud-ca.crt
  • instance_fqdn โ†’ note

2. Stage an async logical replica โ€‹

  1. Enable logical replication on the source Postgres:
    sql
    ALTER SYSTEM SET wal_level = logical;
    SELECT pg_reload_conf();
  2. Issue a pg_basebackup against the Ubicloud primary:
    bash
    pg_basebackup -h <source-host> -U replicator -D /tmp/pgdata -X stream -P
  3. Create a publication + subscription for each schema (backoffice, commerce, operations, communications).
  4. Monitor lag: SELECT application_name, replay_lag FROM pg_stat_replication;

3. Read-only maintenance mode โ€‹

Once replication lag < 1 s:

  • Add a Traefik middleware that returns HTTP 503 for any POST /api/bookings* or POST /api/payments*.
  • Verify through a health-probe URL from the operator workstation.

4. Final VACUUM ANALYZE + post-data dump โ€‹

bash
psql -h <source-host> -c "VACUUM ANALYZE;"
pg_dump --section=post-data --format=custom --file=/tmp/pre-cutover-post-data.dump

Upload pre-cutover-post-data.dump to Hetzner Object Storage bucket busflow-backups-pre-cutover/ with a 30-day lifecycle rule (not 14 โ€” see the PITR dark-zone note in ADR-022).

5. Atomic secret flip โ€‹

This is the only step that is not reversible in <30 s without a full rollback. Both busflow_db_writer and busflow_db_reader MUST update in a single docker service update call:

bash
# Prepare new secrets
docker secret create busflow_db_writer_v2 <(echo "$WRITER_URI")
docker secret create busflow_db_reader_v2 <(echo "$READER_URI")

# Flip both at once for every dependent service
for svc in api hasura workers; do
  docker service update \
    --secret-rm busflow_db_writer --secret-rm busflow_db_reader \
    --secret-add source=busflow_db_writer_v2,target=busflow_db_writer \
    --secret-add source=busflow_db_reader_v2,target=busflow_db_reader \
    --force \
    busflow_${svc}
done

6. Verify โ€‹

  • SELECT * FROM pg_stat_replication; โ€” lag is 0 on the new primary.
  • Smoke-test: booking read/write via curl against the API; payment webhook replay against a test event.
  • Grafana dashboard "Cutover Watch" shows p50/p95/p99 on the new instance.

7. Disable read-only mode โ€‹

  • Remove the Traefik 503 middleware.
  • Post "Cutover complete" in #ops and tag the on-call.

8. Warm-rollback window (72 h) โ€‹

  • Keep docker stack busflow_postgres_legacy up and idle.
  • Grafana alert: p99 API latency > 2ร— baseline for > 10 min โ†’ page on-call.

9. Cleanup (72 h + 1) โ€‹

bash
docker stack rm busflow_postgres_legacy
# Delete the pre-cutover dump only after the 30-day PITR-dark-zone window expires.

Rollback procedure โ€‹

Trigger: p99 API latency > 2ร— baseline for > 10 min within the 72 h window, OR any data-integrity concern flagged by Grafana or on-call.

  1. Re-enable the Traefik read-only middleware.
  2. Flip the Swarm Secrets back to the legacy busflow_db_writer_v1 / busflow_db_reader_v1 in the same atomic docker service update pattern.
  3. Post "Cutover rolled back" in #ops; open a blameless incident doc.
  4. Do not rerun the cutover without a new staging dry-run.

Observed metrics (fill in after staging dry-run) โ€‹

MetricObservedTarget
Read-only window duration_____ min< 10 min
Replication lag convergence_____ min< 30 min
Post-flip smoke-test pass___ / ___all pass
Rollback-cost (if needed)_____ min< 5 min

Internal documentation โ€” Busflow