Runbook โ Manager Failover & Host Key Rotation โ
ADR: ADR-023, ADR-025Architecture:
infrastructure.mdยง1 Workflow:.github/workflows/deploy.yml(consumesPRODUCTION_MANAGER_HOST_KEYS)
Swarm runs three managers in fsn1 with an odd quorum. Each manager has a unique SSH host key provisioned by Terraform (tls_private_key.host_key_manager[count.index]). There is no StrictHostKeyChecking=no in CI โ the known_hosts list is explicit and rotates with Terraform apply. This runbook covers planned replacement and emergency failover.
Pre-conditions (planned replacement) โ
- [ ] Quorum is healthy:
docker node lsshows 3 Reachable managers. - [ ] A recent
pg_dumphas been verified by the nightly backup-verify job. - [ ] No in-progress deploy or cutover.
- [ ] Maintenance window posted in
#opswith expected duration (โค 30 min).
Planned replacement โ
1. Drain โ
docker node update --availability drain <manager-to-replace>Wait for docker node ps <manager-to-replace> to show zero Running tasks. Do not force-drain โ the stateful_worker tier can tolerate a task move, but abrupt drain can collapse quorum if paired with a rolling restart.
2. Demote โ
docker node demote <manager-to-replace>Confirm the remaining two managers still have Leader + Reachable status before proceeding.
3. Remove โ
docker node rm <manager-to-replace>4. Terraform apply โ rotate fingerprint โ
cd terraform/environments/production
terraform apply -target=module.swarm.random_id.manager_host_key_nonce -target=module.swarm.hcloud_server.managerThis regenerates tls_private_key.host_key_manager[count.index] for the replaced index and updates the manager_host_keys output map. The output is consumed by CI via PRODUCTION_MANAGER_HOST_KEYS.
5. Push the new fingerprint to GitHub โ
terraform output -json manager_host_keys \
| jq -r '. | to_entries | map("\(.key) \(.value)") | .[]' \
> /tmp/manager_host_keys.known
gh secret set PRODUCTION_MANAGER_HOST_KEYS \
--repo busflow/busflow \
--body "$(cat /tmp/manager_host_keys.known)"Then delete /tmp/manager_host_keys.known (shred, not rm). The secret is the sole source of truth for CI's known_hosts โ no manual edits elsewhere.
6. Re-join and promote โ
Terraform's cloud-init on the new instance runs the Swarm join script via local.docker_install_script (the user_data split from ADR-023). Once docker node ls shows the new host as Reachable:
docker node promote <new-manager-hostname>7. Verify โ
docker node lsโ 3 Reachable managers, 1 Leader.- Re-run the latest deploy workflow with a no-op change (
git commit --allow-empty -m "chore: verify manager SSH") and confirm CI authenticates against all 3 managers. - Grafana panel "Swarm Raft Health" is green.
8. TOFU window โ
Close the "trust-on-first-use" window: any CI run between steps 5 and 7 should have failed-closed because the new host key was not yet authorised. If any CI run succeeded with an unknown host key: investigate โ StrictHostKeyChecking=no has sneaked in somewhere.
Emergency failover (Leader lost) โ
Triggered by: Leader unreachable > 2 min; two managers simultaneously Unreachable; Raft split detected.
- On any remaining Reachable manager, confirm quorum loss:
docker node lsreports only 1 Reachable. - Do not force a new election manually. Swarm elects automatically once โฅ 2 managers are Reachable again.
- Restart the unreachable managers one at a time (prefer the oldest survivor first to preserve Raft log continuity).
- If โฅ 2 managers are lost irrecoverably: this is a DR scenario, not a failover. Follow the DR plan (recover from Hetzner snapshot; rejoin via
docker swarm init --force-new-clusteron the survivor; re-provision the lost managers). The Raft log lives only on managers โ data loss window โ the last Raft snapshot interval (default 10 k entries). - After recovery, rotate all three managers' host keys via the Planned Replacement flow above โ the DR event is a trust event.
Notes โ
- The
StrictHostKeyChecking=noworkaround is permanently banned. Any PR reintroducing it is rejected by CI (check_deploy_yml.sh). - The three managers stay pinned to
fsn1per ADR-023. Do not spread acrossfsn1+nbg1+hel1chasing availability โ Raft latency across those distances collapses election windows. - Volume plugin (
costela/docker-volume-hetzner) must be healthy on any replacement manager before promotion; the stateful_worker tier depends on it. Checkdocker plugin ls.