Busflow Docs

Internal documentation portal

Skip to content

Runbook โ€” Secrets Rotation โ€‹

ADR: ADR-029Architecture: gdpr-strategy.md ยง5; infrastructure.md ยง2 (Secrets) Workflow: .github/workflows/secrets-sync.yml

All infrastructure credentials are Swarm Docker Secrets; all tenant-scoped credentials are stored encrypted via pgsodium (deterministic AEAD for equality-searchable fields, standard AEAD otherwise). .env.production has been retired โ€” never reintroduced. This runbook covers the quarterly rotation, the incident-driven rotation, and the drift check.


Rotation cadence โ€‹

Secret classCadenceOwner
Swarm infra secrets (busflow_db_*, SLACK_*_WEBHOOK, HCLOUD_TOKEN, UBICLOUD_TOKEN)QuarterlyPlatform
Tenant API configs (encrypted via pgsodium)Annually, or on tenant-requestTenant CSM + DPO
Grafana SSO shared secretsQuarterlyObservability
GitHub Actions deploy tokensQuarterlyPlatform
Manager host keysOn Terraform apply (see manager-failover-runbook.md)Platform

Quarterly rotation โ€‹

Run on the first Monday of the quarter. Schedule via the Schedule skill or cron.

1. Generate โ€‹

For each Swarm infra secret, generate a new value via the appropriate provider:

  • Postgres writer/reader URIs: ubicloud postgres rotate-password --database busflow --role <role>.
  • Slack webhooks: Slack Manager โ†’ rotate for the target channel.
  • HCLOUD_TOKEN / UBICLOUD_TOKEN: provider console; scope tokens to what they need (avoid root project scope).

2. Seed via secrets-sync.yml โ€‹

The workflow writes the new value as busflow_<name>_v<n+1>. It must never overwrite _v<n> โ€” rotation is by new versioned secret, not by value swap (Docker Swarm secrets are immutable after create).

3. Atomic flip per service โ€‹

For each dependent service:

bash
docker service update \
  --secret-rm busflow_<name>_v<n> \
  --secret-add source=busflow_<name>_v<n+1>,target=busflow_<name> \
  --force \
  busflow_<svc>

Pay attention to services that consume multiple related secrets (e.g., api uses both writer and reader URIs) โ€” flip them in a single docker service update call so the service never sees a mismatched pair. This mirrors the cutover runbook's atomic-flip rule.

4. Verify โ€‹

  • Health probe on /healthz returns 200 across all replicas.
  • #ops receives a "rotation complete" message from secrets-sync.yml.
  • Grafana panel "Swarm Secret Versions" shows only _v<n+1> referenced (no service stuck on _v<n>).

5. Cleanup โ€‹

After 72 h and successful verification:

bash
docker secret rm busflow_<name>_v<n>

Never remove the old secret in the same maintenance window โ€” it is the rollback path.

Incident-driven rotation โ€‹

Triggered by: leaked token in logs, compromised developer laptop, suspected exfiltration, vendor breach notification.

  1. Within 1 hour of discovery: revoke at the provider (token invalidation at the source is non-negotiable and must happen even before Swarm update).
  2. Run the quarterly rotation flow above in emergency mode โ€” flip the new secret without the 72 h cleanup delay. Old secret is removed as soon as all services report healthy.
  3. File a security incident and notify @security-oncall via #security-alerts.
  4. If the leaked secret is a Postgres credential: additionally rotate the underlying DB role password and force a reconnect storm by bumping every dependent service via --force.

Drift detection โ€‹

Weekly check (Sunday 01:00 UTC):

bash
# Diff: secrets declared in compose vs. secrets present on the Swarm
for secret in $(docker secret ls --format '{{.Name}}'); do
  grep -q "$secret" docker/compose.*.yml || echo "ORPHAN: $secret"
done
for declared in $(grep -hoE 'busflow_[a-z_]+_v[0-9]+' docker/compose.*.yml | sort -u); do
  docker secret inspect "$declared" >/dev/null 2>&1 || echo "MISSING: $declared"
done

An ORPHAN is usually leftover from incomplete cleanup โ€” safe to delete. A MISSING is always a deploy blocker; investigate before the next CI run.

Determinism note (tenant-scoped encryption) โ€‹

pgsodium deterministic AEAD is used only on columns that must support equality lookups (e.g., tenant_api_keys.key_fingerprint). Determinism has been signed off by Legal/DPO precisely because the attack surface is scoped and documented in ADR-029. Do not extend deterministic AEAD to columns without re-engaging Legal/DPO.

Notes โ€‹

  • .env.production is retired and must never return. Any PR that re-introduces plaintext secrets at rest is rejected by CI (checked via check_no_env_production.sh).
  • Hasura masking view (****$SUFFIX) is a defence-in-depth, not a substitute for rotation. Do not skip rotation because "Hasura masks it anyway".

Internal documentation โ€” Busflow