Runbook โ Secrets Rotation โ
ADR: ADR-029Architecture:
gdpr-strategy.mdยง5;infrastructure.mdยง2 (Secrets) Workflow:.github/workflows/secrets-sync.yml
All infrastructure credentials are Swarm Docker Secrets; all tenant-scoped credentials are stored encrypted via pgsodium (deterministic AEAD for equality-searchable fields, standard AEAD otherwise). .env.production has been retired โ never reintroduced. This runbook covers the quarterly rotation, the incident-driven rotation, and the drift check.
Rotation cadence โ
| Secret class | Cadence | Owner |
|---|---|---|
Swarm infra secrets (busflow_db_*, SLACK_*_WEBHOOK, HCLOUD_TOKEN, UBICLOUD_TOKEN) | Quarterly | Platform |
Tenant API configs (encrypted via pgsodium) | Annually, or on tenant-request | Tenant CSM + DPO |
| Grafana SSO shared secrets | Quarterly | Observability |
| GitHub Actions deploy tokens | Quarterly | Platform |
| Manager host keys | On Terraform apply (see manager-failover-runbook.md) | Platform |
Quarterly rotation โ
Run on the first Monday of the quarter. Schedule via the Schedule skill or cron.
1. Generate โ
For each Swarm infra secret, generate a new value via the appropriate provider:
- Postgres writer/reader URIs:
ubicloud postgres rotate-password --database busflow --role <role>. - Slack webhooks: Slack Manager โ rotate for the target channel.
HCLOUD_TOKEN/UBICLOUD_TOKEN: provider console; scope tokens to what they need (avoid root project scope).
2. Seed via secrets-sync.yml โ
The workflow writes the new value as busflow_<name>_v<n+1>. It must never overwrite _v<n> โ rotation is by new versioned secret, not by value swap (Docker Swarm secrets are immutable after create).
3. Atomic flip per service โ
For each dependent service:
docker service update \
--secret-rm busflow_<name>_v<n> \
--secret-add source=busflow_<name>_v<n+1>,target=busflow_<name> \
--force \
busflow_<svc>Pay attention to services that consume multiple related secrets (e.g., api uses both writer and reader URIs) โ flip them in a single docker service update call so the service never sees a mismatched pair. This mirrors the cutover runbook's atomic-flip rule.
4. Verify โ
- Health probe on
/healthzreturns 200 across all replicas. #opsreceives a "rotation complete" message fromsecrets-sync.yml.- Grafana panel "Swarm Secret Versions" shows only
_v<n+1>referenced (no service stuck on_v<n>).
5. Cleanup โ
After 72 h and successful verification:
docker secret rm busflow_<name>_v<n>Never remove the old secret in the same maintenance window โ it is the rollback path.
Incident-driven rotation โ
Triggered by: leaked token in logs, compromised developer laptop, suspected exfiltration, vendor breach notification.
- Within 1 hour of discovery: revoke at the provider (token invalidation at the source is non-negotiable and must happen even before Swarm update).
- Run the quarterly rotation flow above in emergency mode โ flip the new secret without the 72 h cleanup delay. Old secret is removed as soon as all services report healthy.
- File a security incident and notify
@security-oncallvia#security-alerts. - If the leaked secret is a Postgres credential: additionally rotate the underlying DB role password and force a reconnect storm by bumping every dependent service via
--force.
Drift detection โ
Weekly check (Sunday 01:00 UTC):
# Diff: secrets declared in compose vs. secrets present on the Swarm
for secret in $(docker secret ls --format '{{.Name}}'); do
grep -q "$secret" docker/compose.*.yml || echo "ORPHAN: $secret"
done
for declared in $(grep -hoE 'busflow_[a-z_]+_v[0-9]+' docker/compose.*.yml | sort -u); do
docker secret inspect "$declared" >/dev/null 2>&1 || echo "MISSING: $declared"
doneAn ORPHAN is usually leftover from incomplete cleanup โ safe to delete. A MISSING is always a deploy blocker; investigate before the next CI run.
Determinism note (tenant-scoped encryption) โ
pgsodium deterministic AEAD is used only on columns that must support equality lookups (e.g., tenant_api_keys.key_fingerprint). Determinism has been signed off by Legal/DPO precisely because the attack surface is scoped and documented in ADR-029. Do not extend deterministic AEAD to columns without re-engaging Legal/DPO.
Notes โ
.env.productionis retired and must never return. Any PR that re-introduces plaintext secrets at rest is rejected by CI (checked viacheck_no_env_production.sh).- Hasura masking view (
****$SUFFIX) is a defence-in-depth, not a substitute for rotation. Do not skip rotation because "Hasura masks it anyway".