Runbook โ Daily Backup Verify Triage โ
Workflow:
.github/workflows/backup-verify.ymlArchitecture:infrastructure.mdยง6 Alert channel:#opsTrigger cadence: daily at 03:00 UTC
The backup-verify job does five things in order: (1) stream the latest pg_dump_YYYYMMDD_HHMM.dump from Hetzner Object Storage via rclone; (2) size-guard (must be > 10 GB and < 2ร the 7-day trailing median); (3) restore into a disposable Postgres container; (4) run VACUUM ANALYZE and assert row-count deltas are within tolerance; (5) run the soft-foreign-key sweep against config/backup-verify/soft-fk-allowlist.yaml.
Any failure below pages #ops. This runbook triages each failure mode.
F1 โ SIZE_EXCEEDED or SIZE_SHRUNK โ
Symptom: Dump is > 2ร trailing median (possible runaway write) or < 10 GB (possible truncation).
- Compare row counts on the source vs. last night's verified dump for the largest tables (
commerce.bookings,communications.messages,operations.service_legs). - If rows are proportional to dump size: raise the median tolerance only with evidence in the PR (e.g., tenant onboarding expected to 3ร volume). Do not silently widen.
- If rows are disproportionate: the dump is corrupt. Re-run
pg_dumpfrom the source before the nextpg_cronscrub cycle; if the new dump also fails size-guard, open a P1 incident. - Do not promote a failing dump to the verified-latest pointer. That pointer is what the cutover runbook relies on.
F2 โ RESTORE_FAILED โ
Symptom: pg_restore errored mid-stream.
- Check format: the verify job tolerates both
PGDMPcustom format and plain-text. If neither header matches: the file is not a Postgres dump โ check the S3/MinIO object for an HTML error page (auth failure), common after a credential rotation. - If format is valid but restore fails on a
CREATE EXTENSIONline: the disposable container is missing the extension. Add it todocker/backup/producer.env(EXTENSIONS="pgcrypto,pgvector,pgsodium,pg_cron") and re-run. - If restore fails on a
COPYline: capture the failing table + line number; this is likely a source-side corruption. Snapshot the source immediately (PITR snapshot via Ubicloud) before touching anything else. - Split-section restore (pre-data โ data โ post-data) is the canonical path. If someone reverts to a single-stream restore: reject the PR.
F3 โ ORPHAN_FK_SURGE โ
Symptom: Soft-FK sweep found orphans exceeding the allowlist tolerance.
- Open
config/backup-verify/soft-fk-allowlist.yaml. The file declares expected orphan counts per soft-FK (e.g.,communications.messages.booking_id โ commerce.bookings.id: max_orphans: 200). - Each sweep failure must be triaged to one of:
- Expected: a scheduled scrub tombstoned rows โ raise the allowlist only with the scrub-log cross-reference (
tenant_scrub_logsUUIDs). - Unexpected: an application bug deleted referenced rows โ open a P2 issue and pin the FK.
- Expected: a scheduled scrub tombstoned rows โ raise the allowlist only with the scrub-log cross-reference (
- Never blanket-raise a soft-FK tolerance without the scrub-log justification. That is the single most common way GDPR tombstone contracts drift.
F4 โ SLACK_NOTIFICATION_FAILURE โ
Symptom: The job's final status could not be posted to #ops.
- Check the job's stderr for the HTTP code:
4xxโ webhook secret is stale or revoked. Rotate via the secrets-rotation runbook (SLACK_OPS_WEBHOOK). Re-post manually to#opswith the job result so on-call isn't blind.5xxโ Slack outage; the job retried with exponential backoff. If all retries failed, post manually; no rotation needed.
- A 4xx vs. 5xx distinction matters โ the backup job itself succeeded in both cases; this alert is purely about the notification channel.
F5 โ OFFSITE_ENDPOINT_DOWN โ MinIO fallback โ
Symptom: rclone cannot reach Hetzner Object Storage.
- The verify job automatically retries against the on-site MinIO mirror (
BACKUP_SOURCE=minio-fallback). If that succeeded: the alert is informational; log the Hetzner outage and continue. - If both failed: the dump is inaccessible. Do not declare the day "unverified" โ page
@infra-oncalland halt any pending cutover work until a verified dump is produced. - The fallback's existence does not relax the primary SLA โ a sustained Hetzner Object Storage outage (> 24 h) requires a DPIA note because offsite residency is part of the audit contract.
Appendix โ Escalation matrix โ
| Failure | Default severity | Escalate to |
|---|---|---|
| F1 (size) | P2 | @data-oncall |
| F2 (restore) | P1 if corruption suspected | @infra-oncall + @data-oncall |
| F3 (orphan-FK surge) | P2, P1 if tenant-spanning | @data-oncall + @security-oncall |
| F4 (Slack) | P3 | @platform-oncall |
| F5 (offsite) | P1 if both endpoints down | @infra-oncall |