Busflow Docs

Internal documentation portal

Skip to content

Runbook โ€” Manager Failover & Host Key Rotation โ€‹

ADR: ADR-023, ADR-025Architecture: infrastructure.md ยง1 Workflow: .github/workflows/deploy.yml (consumes PRODUCTION_MANAGER_HOST_KEYS)

Swarm runs three managers in fsn1 with an odd quorum. Each manager has a unique SSH host key provisioned by Terraform (tls_private_key.host_key_manager[count.index]). There is no StrictHostKeyChecking=no in CI โ€” the known_hosts list is explicit and rotates with Terraform apply. This runbook covers planned replacement and emergency failover.


Pre-conditions (planned replacement) โ€‹

  • [ ] Quorum is healthy: docker node ls shows 3 Reachable managers.
  • [ ] A recent pg_dump has been verified by the nightly backup-verify job.
  • [ ] No in-progress deploy or cutover.
  • [ ] Maintenance window posted in #ops with expected duration (โ‰ค 30 min).

Planned replacement โ€‹

1. Drain โ€‹

bash
docker node update --availability drain <manager-to-replace>

Wait for docker node ps <manager-to-replace> to show zero Running tasks. Do not force-drain โ€” the stateful_worker tier can tolerate a task move, but abrupt drain can collapse quorum if paired with a rolling restart.

2. Demote โ€‹

bash
docker node demote <manager-to-replace>

Confirm the remaining two managers still have Leader + Reachable status before proceeding.

3. Remove โ€‹

bash
docker node rm <manager-to-replace>

4. Terraform apply โ€” rotate fingerprint โ€‹

bash
cd terraform/environments/production
terraform apply -target=module.swarm.random_id.manager_host_key_nonce -target=module.swarm.hcloud_server.manager

This regenerates tls_private_key.host_key_manager[count.index] for the replaced index and updates the manager_host_keys output map. The output is consumed by CI via PRODUCTION_MANAGER_HOST_KEYS.

5. Push the new fingerprint to GitHub โ€‹

bash
terraform output -json manager_host_keys \
  | jq -r '. | to_entries | map("\(.key) \(.value)") | .[]' \
  > /tmp/manager_host_keys.known

gh secret set PRODUCTION_MANAGER_HOST_KEYS \
  --repo busflow/busflow \
  --body "$(cat /tmp/manager_host_keys.known)"

Then delete /tmp/manager_host_keys.known (shred, not rm). The secret is the sole source of truth for CI's known_hosts โ€” no manual edits elsewhere.

6. Re-join and promote โ€‹

Terraform's cloud-init on the new instance runs the Swarm join script via local.docker_install_script (the user_data split from ADR-023). Once docker node ls shows the new host as Reachable:

bash
docker node promote <new-manager-hostname>

7. Verify โ€‹

  • docker node ls โ†’ 3 Reachable managers, 1 Leader.
  • Re-run the latest deploy workflow with a no-op change (git commit --allow-empty -m "chore: verify manager SSH") and confirm CI authenticates against all 3 managers.
  • Grafana panel "Swarm Raft Health" is green.

8. TOFU window โ€‹

Close the "trust-on-first-use" window: any CI run between steps 5 and 7 should have failed-closed because the new host key was not yet authorised. If any CI run succeeded with an unknown host key: investigate โ€” StrictHostKeyChecking=no has sneaked in somewhere.


Emergency failover (Leader lost) โ€‹

Triggered by: Leader unreachable > 2 min; two managers simultaneously Unreachable; Raft split detected.

  1. On any remaining Reachable manager, confirm quorum loss: docker node ls reports only 1 Reachable.
  2. Do not force a new election manually. Swarm elects automatically once โ‰ฅ 2 managers are Reachable again.
  3. Restart the unreachable managers one at a time (prefer the oldest survivor first to preserve Raft log continuity).
  4. If โ‰ฅ 2 managers are lost irrecoverably: this is a DR scenario, not a failover. Follow the DR plan (recover from Hetzner snapshot; rejoin via docker swarm init --force-new-cluster on the survivor; re-provision the lost managers). The Raft log lives only on managers โ€” data loss window โ‰ˆ the last Raft snapshot interval (default 10 k entries).
  5. After recovery, rotate all three managers' host keys via the Planned Replacement flow above โ€” the DR event is a trust event.

Notes โ€‹

  • The StrictHostKeyChecking=no workaround is permanently banned. Any PR reintroducing it is rejected by CI (check_deploy_yml.sh).
  • The three managers stay pinned to fsn1 per ADR-023. Do not spread across fsn1 + nbg1 + hel1 chasing availability โ€” Raft latency across those distances collapses election windows.
  • Volume plugin (costela/docker-volume-hetzner) must be healthy on any replacement manager before promotion; the stateful_worker tier depends on it. Check docker plugin ls.

Internal documentation โ€” Busflow