Runbook โ Ubicloud PostgreSQL Cutover โ
ADR: ADR-022Architecture:
infrastructure.mdยง6.2 Duration (expected): 2โ4 h live cutover; 72 h parallel-run window. Reversible window: 72 h.
Pre-conditions โ
- [ ] ADR-022 has Finance sign-off (monthly Ubicloud tier price ร 3/30 recorded).
- [ ] ADR-022 has Legal/DPO sign-off (Ubicloud EU-residency SLA clause attached).
- [ ]
terraform/modules/postgres-ubicloudexists andterraform planis clean on staging. - [ ] A staging cutover dry-run has completed and the observed
cutover-duration+rollback-costmetrics are recorded below. - [ ] On-call rotation is informed and
#opsSlack is pinned with the runbook link. - [ ] The current pre-cutover
pg_dumpis < 24 h old and passes the daily backup-verify job.
Step-by-step โ
1. Provision Ubicloud instance โ
bash
# From a developer laptop with AWS_* unset and UBICLOUD_PROJECT_ID set.
cd terraform/environments/production
terraform apply -target=module.postgres_ubicloud -var="enable_ubicloud_postgres=true"Capture outputs:
connection_uri_writerโ save to a secure scratch fileconnection_uri_readerโ saveca_certificateโ save asubicloud-ca.crtinstance_fqdnโ note
2. Stage an async logical replica โ
- Enable logical replication on the source Postgres:sql
ALTER SYSTEM SET wal_level = logical; SELECT pg_reload_conf(); - Issue a
pg_basebackupagainst the Ubicloud primary:bashpg_basebackup -h <source-host> -U replicator -D /tmp/pgdata -X stream -P - Create a publication + subscription for each schema (
backoffice,commerce,operations,communications). - Monitor lag:
SELECT application_name, replay_lag FROM pg_stat_replication;
3. Read-only maintenance mode โ
Once replication lag < 1 s:
- Add a Traefik middleware that returns HTTP 503 for any
POST /api/bookings*orPOST /api/payments*. - Verify through a health-probe URL from the operator workstation.
4. Final VACUUM ANALYZE + post-data dump โ
bash
psql -h <source-host> -c "VACUUM ANALYZE;"
pg_dump --section=post-data --format=custom --file=/tmp/pre-cutover-post-data.dumpUpload pre-cutover-post-data.dump to Hetzner Object Storage bucket busflow-backups-pre-cutover/ with a 30-day lifecycle rule (not 14 โ see the PITR dark-zone note in ADR-022).
5. Atomic secret flip โ
This is the only step that is not reversible in <30 s without a full rollback. Both busflow_db_writer and busflow_db_reader MUST update in a single docker service update call:
bash
# Prepare new secrets
docker secret create busflow_db_writer_v2 <(echo "$WRITER_URI")
docker secret create busflow_db_reader_v2 <(echo "$READER_URI")
# Flip both at once for every dependent service
for svc in api hasura workers; do
docker service update \
--secret-rm busflow_db_writer --secret-rm busflow_db_reader \
--secret-add source=busflow_db_writer_v2,target=busflow_db_writer \
--secret-add source=busflow_db_reader_v2,target=busflow_db_reader \
--force \
busflow_${svc}
done6. Verify โ
SELECT * FROM pg_stat_replication;โ lag is 0 on the new primary.- Smoke-test: booking read/write via
curlagainst the API; payment webhook replay against a test event. - Grafana dashboard "Cutover Watch" shows p50/p95/p99 on the new instance.
7. Disable read-only mode โ
- Remove the Traefik 503 middleware.
- Post "Cutover complete" in
#opsand tag the on-call.
8. Warm-rollback window (72 h) โ
- Keep
docker stack busflow_postgres_legacyup and idle. - Grafana alert: p99 API latency > 2ร baseline for > 10 min โ page on-call.
9. Cleanup (72 h + 1) โ
bash
docker stack rm busflow_postgres_legacy
# Delete the pre-cutover dump only after the 30-day PITR-dark-zone window expires.Rollback procedure โ
Trigger: p99 API latency > 2ร baseline for > 10 min within the 72 h window, OR any data-integrity concern flagged by Grafana or on-call.
- Re-enable the Traefik read-only middleware.
- Flip the Swarm Secrets back to the legacy
busflow_db_writer_v1/busflow_db_reader_v1in the same atomicdocker service updatepattern. - Post "Cutover rolled back" in
#ops; open a blameless incident doc. - Do not rerun the cutover without a new staging dry-run.
Observed metrics (fill in after staging dry-run) โ
| Metric | Observed | Target |
|---|---|---|
| Read-only window duration | _____ min | < 10 min |
| Replication lag convergence | _____ min | < 30 min |
| Post-flip smoke-test pass | ___ / ___ | all pass |
| Rollback-cost (if needed) | _____ min | < 5 min |