Busflow Docs

Internal documentation portal

Skip to content

Runbook โ€” Daily Backup Verify Triage โ€‹

Workflow: .github/workflows/backup-verify.ymlArchitecture: infrastructure.md ยง6 Alert channel: #opsTrigger cadence: daily at 03:00 UTC

The backup-verify job does five things in order: (1) stream the latest pg_dump_YYYYMMDD_HHMM.dump from Hetzner Object Storage via rclone; (2) size-guard (must be > 10 GB and < 2ร— the 7-day trailing median); (3) restore into a disposable Postgres container; (4) run VACUUM ANALYZE and assert row-count deltas are within tolerance; (5) run the soft-foreign-key sweep against config/backup-verify/soft-fk-allowlist.yaml.

Any failure below pages #ops. This runbook triages each failure mode.


F1 โ€” SIZE_EXCEEDED or SIZE_SHRUNK โ€‹

Symptom: Dump is > 2ร— trailing median (possible runaway write) or < 10 GB (possible truncation).

  1. Compare row counts on the source vs. last night's verified dump for the largest tables (commerce.bookings, communications.messages, operations.service_legs).
  2. If rows are proportional to dump size: raise the median tolerance only with evidence in the PR (e.g., tenant onboarding expected to 3ร— volume). Do not silently widen.
  3. If rows are disproportionate: the dump is corrupt. Re-run pg_dump from the source before the next pg_cron scrub cycle; if the new dump also fails size-guard, open a P1 incident.
  4. Do not promote a failing dump to the verified-latest pointer. That pointer is what the cutover runbook relies on.

F2 โ€” RESTORE_FAILED โ€‹

Symptom: pg_restore errored mid-stream.

  1. Check format: the verify job tolerates both PGDMP custom format and plain-text. If neither header matches: the file is not a Postgres dump โ€” check the S3/MinIO object for an HTML error page (auth failure), common after a credential rotation.
  2. If format is valid but restore fails on a CREATE EXTENSION line: the disposable container is missing the extension. Add it to docker/backup/producer.env (EXTENSIONS="pgcrypto,pgvector,pgsodium,pg_cron") and re-run.
  3. If restore fails on a COPY line: capture the failing table + line number; this is likely a source-side corruption. Snapshot the source immediately (PITR snapshot via Ubicloud) before touching anything else.
  4. Split-section restore (pre-data โ†’ data โ†’ post-data) is the canonical path. If someone reverts to a single-stream restore: reject the PR.

F3 โ€” ORPHAN_FK_SURGE โ€‹

Symptom: Soft-FK sweep found orphans exceeding the allowlist tolerance.

  1. Open config/backup-verify/soft-fk-allowlist.yaml. The file declares expected orphan counts per soft-FK (e.g., communications.messages.booking_id โ†’ commerce.bookings.id: max_orphans: 200).
  2. Each sweep failure must be triaged to one of:
    • Expected: a scheduled scrub tombstoned rows โ€” raise the allowlist only with the scrub-log cross-reference (tenant_scrub_logs UUIDs).
    • Unexpected: an application bug deleted referenced rows โ€” open a P2 issue and pin the FK.
  3. Never blanket-raise a soft-FK tolerance without the scrub-log justification. That is the single most common way GDPR tombstone contracts drift.

F4 โ€” SLACK_NOTIFICATION_FAILURE โ€‹

Symptom: The job's final status could not be posted to #ops.

  1. Check the job's stderr for the HTTP code:
    • 4xx โ†’ webhook secret is stale or revoked. Rotate via the secrets-rotation runbook (SLACK_OPS_WEBHOOK). Re-post manually to #ops with the job result so on-call isn't blind.
    • 5xx โ†’ Slack outage; the job retried with exponential backoff. If all retries failed, post manually; no rotation needed.
  2. A 4xx vs. 5xx distinction matters โ€” the backup job itself succeeded in both cases; this alert is purely about the notification channel.

F5 โ€” OFFSITE_ENDPOINT_DOWN โ†’ MinIO fallback โ€‹

Symptom: rclone cannot reach Hetzner Object Storage.

  1. The verify job automatically retries against the on-site MinIO mirror (BACKUP_SOURCE=minio-fallback). If that succeeded: the alert is informational; log the Hetzner outage and continue.
  2. If both failed: the dump is inaccessible. Do not declare the day "unverified" โ€” page @infra-oncall and halt any pending cutover work until a verified dump is produced.
  3. The fallback's existence does not relax the primary SLA โ€” a sustained Hetzner Object Storage outage (> 24 h) requires a DPIA note because offsite residency is part of the audit contract.

Appendix โ€” Escalation matrix โ€‹

FailureDefault severityEscalate to
F1 (size)P2@data-oncall
F2 (restore)P1 if corruption suspected@infra-oncall + @data-oncall
F3 (orphan-FK surge)P2, P1 if tenant-spanning@data-oncall + @security-oncall
F4 (Slack)P3@platform-oncall
F5 (offsite)P1 if both endpoints down@infra-oncall

Internal documentation โ€” Busflow