ADR-025: Swarm Manager Failover — SSH fingerprint rotation & TOFU policy

Status: 🟡 Proposed — pending architect approval Impacts: terraform/modules/swarm/, .github/workflows/deploy.yml, .github/workflows/terraform.yml, new runbook docs/protocols/manager-failover-runbook.md

Context

terraform/modules/swarm/main.tf currently uses StrictHostKeyChecking=no in its SSH provisioner blocks (lines 156, 159) and generates a single shared host key for all managers. This is a documented security anti-pattern: it accepts any key on first contact, permits silent man-in-the-middle on subsequent connections, and blurs the audit trail.

With multi-manager quorum (ADR-023) arriving, we have to address two questions:

How does Terraform, GitHub Actions, and the on-call operator trust a new manager the first time it comes up?
How does the same code recognise the replacement of an existing manager after a node loss?

Options Evaluated

#	Option	Pros	Cons
1	Keep `StrictHostKeyChecking=no`	No work required	Security audit finding; MITM-possible in principle; no fingerprint accountability
2	Pre-generate a shared host key for all managers	Simple	Fingerprint collision exposes the same private material; no per-node accountability
3	Unique per-manager host keys + Terraform output map + CI TOFU window	Per-node accountability; rotating one manager only invalidates one fingerprint	Slightly more Terraform surface; CI secret grows by N fingerprints

Decision

Option 3 — unique per-manager host keys, a Terraform output map, and a TOFU window in CI.

Generate per-node keys: tls_private_key.host_key_manager[count.index].
Install each key into the corresponding hcloud_server.manager[count.index] via cloud-init so the host presents it on first SSH.

Export all fingerprints as a Terraform output JSON map:

hcl

output "manager_host_keys" {
  value = { for i, k in tls_private_key.host_key_manager :
            "manager-${i}" => trimspace(k.public_key_openssh) }
  sensitive = false   # fingerprints are not secret
}

Update the PRODUCTION_MANAGER_HOST_KEYS GitHub secret after every manager add/replace. CI accepts any published fingerprint as valid.
Remove StrictHostKeyChecking=no from every provisioner block once the fingerprint map is in place. The TOFU window lives in CI, not in the Terraform code.

Consequences

Positive:

Audit-clean SSH posture. A fingerprint change on an existing manager now blocks CI and surfaces a human decision.
Manager replacement is a single known operation: regenerate that node's key, apply, push the new fingerprint to the GitHub secret.
Per-node accountability in the ~/.ssh/known_hosts audit.

Negative:

The PRODUCTION_MANAGER_HOST_KEYS GitHub secret grows with each manager. At 3 managers this is 3 fingerprints. Acceptable — manageable in the UI.

Neutral:

Runbook docs/protocols/manager-failover-runbook.md codifies the rotation steps: Terraform plan → apply → read new fingerprint from output → gh secret set PRODUCTION_MANAGER_HOST_KEYS → re-run the deploy workflow.

Busflow Docs

ADR-025: Swarm Manager Failover — SSH fingerprint rotation & TOFU policy ​

Context ​

Options Evaluated ​

Decision ​

Consequences ​

ADR-025: Swarm Manager Failover — SSH fingerprint rotation & TOFU policy

Context

Options Evaluated

Decision

Consequences