Runbook — Manager Failover & Host Key Rotation

ADR: ADR-023, ADR-025Architecture: infrastructure.md §1 Workflow: .github/workflows/deploy.yml (consumes PRODUCTION_MANAGER_HOST_KEYS)

Swarm runs three managers in fsn1 with an odd quorum. Each manager has a unique SSH host key provisioned by Terraform (tls_private_key.host_key_manager[count.index]). There is no StrictHostKeyChecking=no in CI — the known_hosts list is explicit and rotates with Terraform apply. This runbook covers planned replacement and emergency failover.

Pre-conditions (planned replacement)

[ ] Quorum is healthy: docker node ls shows 3 Reachable managers.
[ ] A recent pg_dump has been verified by the nightly backup-verify job.
[ ] No in-progress deploy or cutover.
[ ] Maintenance window posted in #ops with expected duration (≤ 30 min).

Planned replacement

1. Drain

bash

docker node update --availability drain <manager-to-replace>

Wait for docker node ps <manager-to-replace> to show zero Running tasks. Do not force-drain — the stateful_worker tier can tolerate a task move, but abrupt drain can collapse quorum if paired with a rolling restart.

2. Demote

bash

docker node demote <manager-to-replace>

Confirm the remaining two managers still have Leader + Reachable status before proceeding.

3. Remove

bash

docker node rm <manager-to-replace>

4. Terraform apply — rotate fingerprint

bash

cd terraform/environments/production
terraform apply -target=module.swarm.random_id.manager_host_key_nonce -target=module.swarm.hcloud_server.manager

This regenerates tls_private_key.host_key_manager[count.index] for the replaced index and updates the manager_host_keys output map. The output is consumed by CI via PRODUCTION_MANAGER_HOST_KEYS.

5. Push the new fingerprint to GitHub

bash

terraform output -json manager_host_keys \
  | jq -r '. | to_entries | map("\(.key) \(.value)") | .[]' \
  > /tmp/manager_host_keys.known

gh secret set PRODUCTION_MANAGER_HOST_KEYS \
  --repo busflow/busflow \
  --body "$(cat /tmp/manager_host_keys.known)"

Then delete /tmp/manager_host_keys.known (shred, not rm). The secret is the sole source of truth for CI's known_hosts — no manual edits elsewhere.

6. Re-join and promote

Terraform's cloud-init on the new instance runs the Swarm join script via local.docker_install_script (the user_data split from ADR-023). Once docker node ls shows the new host as Reachable:

bash

docker node promote <new-manager-hostname>

7. Verify

docker node ls → 3 Reachable managers, 1 Leader.
Re-run the latest deploy workflow with a no-op change (git commit --allow-empty -m "chore: verify manager SSH") and confirm CI authenticates against all 3 managers.
Grafana panel "Swarm Raft Health" is green.

8. TOFU window

Close the "trust-on-first-use" window: any CI run between steps 5 and 7 should have failed-closed because the new host key was not yet authorised. If any CI run succeeded with an unknown host key: investigate — StrictHostKeyChecking=no has sneaked in somewhere.

Emergency failover (Leader lost)

Triggered by: Leader unreachable > 2 min; two managers simultaneously Unreachable; Raft split detected.

On any remaining Reachable manager, confirm quorum loss: docker node ls reports only 1 Reachable.
Do not force a new election manually. Swarm elects automatically once ≥ 2 managers are Reachable again.
Restart the unreachable managers one at a time (prefer the oldest survivor first to preserve Raft log continuity).
If ≥ 2 managers are lost irrecoverably: this is a DR scenario, not a failover. Follow the DR plan (recover from Hetzner snapshot; rejoin via docker swarm init --force-new-cluster on the survivor; re-provision the lost managers). The Raft log lives only on managers — data loss window ≈ the last Raft snapshot interval (default 10 k entries).
After recovery, rotate all three managers' host keys via the Planned Replacement flow above — the DR event is a trust event.

Notes

The StrictHostKeyChecking=no workaround is permanently banned. Any PR reintroducing it is rejected by CI (check_deploy_yml.sh).
The three managers stay pinned to fsn1 per ADR-023. Do not spread across fsn1 + nbg1 + hel1 chasing availability — Raft latency across those distances collapses election windows.
Volume plugin (costela/docker-volume-hetzner) must be healthy on any replacement manager before promotion; the stateful_worker tier depends on it. Check docker plugin ls.

Busflow Docs

Runbook — Manager Failover & Host Key Rotation ​

Pre-conditions (planned replacement) ​

Planned replacement ​

1. Drain ​

2. Demote ​

3. Remove ​

4. Terraform apply — rotate fingerprint ​

5. Push the new fingerprint to GitHub ​

6. Re-join and promote ​

7. Verify ​

8. TOFU window ​

Emergency failover (Leader lost) ​

Notes ​

Runbook — Manager Failover & Host Key Rotation

Pre-conditions (planned replacement)

Planned replacement

1. Drain

2. Demote

3. Remove

4. Terraform apply — rotate fingerprint

5. Push the new fingerprint to GitHub

6. Re-join and promote

7. Verify

8. TOFU window

Emergency failover (Leader lost)

Notes