Runbook — Ubicloud PostgreSQL Cutover

ADR: ADR-022Architecture: infrastructure.md §6.2 Duration (expected): 2–4 h live cutover; 72 h parallel-run window. Reversible window: 72 h.

Pre-conditions

[ ] ADR-022 has Finance sign-off (monthly Ubicloud tier price × 3/30 recorded).
[ ] ADR-022 has Legal/DPO sign-off (Ubicloud EU-residency SLA clause attached).
[ ] terraform/modules/postgres-ubicloud exists and terraform plan is clean on staging.
[ ] A staging cutover dry-run has completed and the observed cutover-duration + rollback-cost metrics are recorded below.
[ ] On-call rotation is informed and #ops Slack is pinned with the runbook link.
[ ] The current pre-cutover pg_dump is < 24 h old and passes the daily backup-verify job.

Step-by-step

1. Provision Ubicloud instance

bash

# From a developer laptop with AWS_* unset and UBICLOUD_PROJECT_ID set.
cd terraform/environments/production
terraform apply -target=module.postgres_ubicloud -var="enable_ubicloud_postgres=true"

Capture outputs:

connection_uri_writer → save to a secure scratch file
connection_uri_reader → save
ca_certificate → save as ubicloud-ca.crt
instance_fqdn → note

2. Stage an async logical replica

Enable logical replication on the source Postgres:

sql

ALTER SYSTEM SET wal_level = logical;
SELECT pg_reload_conf();

Issue a pg_basebackup against the Ubicloud primary:

bash

pg_basebackup -h <source-host> -U replicator -D /tmp/pgdata -X stream -P

Create a publication + subscription for each schema (backoffice, commerce, operations, communications).
Monitor lag: SELECT application_name, replay_lag FROM pg_stat_replication;

3. Read-only maintenance mode

Once replication lag < 1 s:

Add a Traefik middleware that returns HTTP 503 for any POST /api/bookings* or POST /api/payments*.
Verify through a health-probe URL from the operator workstation.

4. Final `VACUUM ANALYZE` + post-data dump

bash

psql -h <source-host> -c "VACUUM ANALYZE;"
pg_dump --section=post-data --format=custom --file=/tmp/pre-cutover-post-data.dump

Upload pre-cutover-post-data.dump to Hetzner Object Storage bucket busflow-backups-pre-cutover/ with a 30-day lifecycle rule (not 14 — see the PITR dark-zone note in ADR-022).

5. Atomic secret flip

This is the only step that is not reversible in <30 s without a full rollback. Both busflow_db_writer and busflow_db_reader MUST update in a single docker service update call:

bash

# Prepare new secrets
docker secret create busflow_db_writer_v2 <(echo "$WRITER_URI")
docker secret create busflow_db_reader_v2 <(echo "$READER_URI")

# Flip both at once for every dependent service
for svc in api hasura workers; do
  docker service update \
    --secret-rm busflow_db_writer --secret-rm busflow_db_reader \
    --secret-add source=busflow_db_writer_v2,target=busflow_db_writer \
    --secret-add source=busflow_db_reader_v2,target=busflow_db_reader \
    --force \
    busflow_${svc}
done

6. Verify

SELECT * FROM pg_stat_replication; — lag is 0 on the new primary.
Smoke-test: booking read/write via curl against the API; payment webhook replay against a test event.
Grafana dashboard "Cutover Watch" shows p50/p95/p99 on the new instance.

7. Disable read-only mode

Remove the Traefik 503 middleware.
Post "Cutover complete" in #ops and tag the on-call.

8. Warm-rollback window (72 h)

Keep docker stack busflow_postgres_legacy up and idle.
Grafana alert: p99 API latency > 2× baseline for > 10 min → page on-call.

9. Cleanup (72 h + 1)

bash

docker stack rm busflow_postgres_legacy
# Delete the pre-cutover dump only after the 30-day PITR-dark-zone window expires.

Rollback procedure

Trigger: p99 API latency > 2× baseline for > 10 min within the 72 h window, OR any data-integrity concern flagged by Grafana or on-call.

Re-enable the Traefik read-only middleware.
Flip the Swarm Secrets back to the legacy busflow_db_writer_v1 / busflow_db_reader_v1 in the same atomic docker service update pattern.
Post "Cutover rolled back" in #ops; open a blameless incident doc.
Do not rerun the cutover without a new staging dry-run.

Observed metrics (fill in after staging dry-run)

Metric	Observed	Target
Read-only window duration	_____ min	< 10 min
Replication lag convergence	_____ min	< 30 min
Post-flip smoke-test pass	___ / ___	all pass
Rollback-cost (if needed)	_____ min	< 5 min

Busflow Docs

Runbook — Ubicloud PostgreSQL Cutover ​

Pre-conditions ​

Step-by-step ​

1. Provision Ubicloud instance ​

2. Stage an async logical replica ​

3. Read-only maintenance mode ​

4. Final VACUUM ANALYZE + post-data dump ​

5. Atomic secret flip ​

6. Verify ​

7. Disable read-only mode ​

8. Warm-rollback window (72 h) ​

9. Cleanup (72 h + 1) ​

Rollback procedure ​

Observed metrics (fill in after staging dry-run) ​

Runbook — Ubicloud PostgreSQL Cutover

Pre-conditions

Step-by-step

1. Provision Ubicloud instance

2. Stage an async logical replica

3. Read-only maintenance mode

4. Final `VACUUM ANALYZE` + post-data dump

5. Atomic secret flip

6. Verify

7. Disable read-only mode

8. Warm-rollback window (72 h)

9. Cleanup (72 h + 1)

Rollback procedure

Observed metrics (fill in after staging dry-run)