Runbook — Daily Backup Verify Triage

Workflow: .github/workflows/backup-verify.ymlArchitecture: infrastructure.md §6 Alert channel: #opsTrigger cadence: daily at 03:00 UTC

The backup-verify job does five things in order: (1) stream the latest pg_dump_YYYYMMDD_HHMM.dump from Hetzner Object Storage via rclone; (2) size-guard (must be > 10 GB and < 2× the 7-day trailing median); (3) restore into a disposable Postgres container; (4) run VACUUM ANALYZE and assert row-count deltas are within tolerance; (5) run the soft-foreign-key sweep against config/backup-verify/soft-fk-allowlist.yaml.

Any failure below pages #ops. This runbook triages each failure mode.

F1 — `SIZE_EXCEEDED` or `SIZE_SHRUNK`

Symptom: Dump is > 2× trailing median (possible runaway write) or < 10 GB (possible truncation).

Compare row counts on the source vs. last night's verified dump for the largest tables (commerce.bookings, communications.messages, operations.service_legs).
If rows are proportional to dump size: raise the median tolerance only with evidence in the PR (e.g., tenant onboarding expected to 3× volume). Do not silently widen.
If rows are disproportionate: the dump is corrupt. Re-run pg_dump from the source before the next pg_cron scrub cycle; if the new dump also fails size-guard, open a P1 incident.
Do not promote a failing dump to the verified-latest pointer. That pointer is what the cutover runbook relies on.

F2 — `RESTORE_FAILED`

Symptom: pg_restore errored mid-stream.

Check format: the verify job tolerates both PGDMP custom format and plain-text. If neither header matches: the file is not a Postgres dump — check the S3/MinIO object for an HTML error page (auth failure), common after a credential rotation.
If format is valid but restore fails on a CREATE EXTENSION line: the disposable container is missing the extension. Add it to docker/backup/producer.env (EXTENSIONS="pgcrypto,pgvector,pgsodium,pg_cron") and re-run.
If restore fails on a COPY line: capture the failing table + line number; this is likely a source-side corruption. Snapshot the source immediately (PITR snapshot via Ubicloud) before touching anything else.
Split-section restore (pre-data → data → post-data) is the canonical path. If someone reverts to a single-stream restore: reject the PR.

F3 — `ORPHAN_FK_SURGE`

Symptom: Soft-FK sweep found orphans exceeding the allowlist tolerance.

Open config/backup-verify/soft-fk-allowlist.yaml. The file declares expected orphan counts per soft-FK (e.g., communications.messages.booking_id → commerce.bookings.id: max_orphans: 200).
Each sweep failure must be triaged to one of:
- Expected: a scheduled scrub tombstoned rows — raise the allowlist only with the scrub-log cross-reference (tenant_scrub_logs UUIDs).
- Unexpected: an application bug deleted referenced rows — open a P2 issue and pin the FK.
Never blanket-raise a soft-FK tolerance without the scrub-log justification. That is the single most common way GDPR tombstone contracts drift.

F4 — `SLACK_NOTIFICATION_FAILURE`

Symptom: The job's final status could not be posted to #ops.

Check the job's stderr for the HTTP code:
- 4xx → webhook secret is stale or revoked. Rotate via the secrets-rotation runbook (SLACK_OPS_WEBHOOK). Re-post manually to #ops with the job result so on-call isn't blind.
- 5xx → Slack outage; the job retried with exponential backoff. If all retries failed, post manually; no rotation needed.
A 4xx vs. 5xx distinction matters — the backup job itself succeeded in both cases; this alert is purely about the notification channel.

F5 — `OFFSITE_ENDPOINT_DOWN` → MinIO fallback

Symptom: rclone cannot reach Hetzner Object Storage.

The verify job automatically retries against the on-site MinIO mirror (BACKUP_SOURCE=minio-fallback). If that succeeded: the alert is informational; log the Hetzner outage and continue.
If both failed: the dump is inaccessible. Do not declare the day "unverified" — page @infra-oncall and halt any pending cutover work until a verified dump is produced.
The fallback's existence does not relax the primary SLA — a sustained Hetzner Object Storage outage (> 24 h) requires a DPIA note because offsite residency is part of the audit contract.

Appendix — Escalation matrix

Failure	Default severity	Escalate to
F1 (size)	P2	`@data-oncall`
F2 (restore)	P1 if corruption suspected	`@infra-oncall` + `@data-oncall`
F3 (orphan-FK surge)	P2, P1 if tenant-spanning	`@data-oncall` + `@security-oncall`
F4 (Slack)	P3	`@platform-oncall`
F5 (offsite)	P1 if both endpoints down	`@infra-oncall`

Busflow Docs

Runbook — Daily Backup Verify Triage ​

F1 — SIZE_EXCEEDED or SIZE_SHRUNK ​

F2 — RESTORE_FAILED ​

F3 — ORPHAN_FK_SURGE ​

F4 — SLACK_NOTIFICATION_FAILURE ​

F5 — OFFSITE_ENDPOINT_DOWN → MinIO fallback ​

Appendix — Escalation matrix ​