ADR-025: Swarm Manager Failover — SSH fingerprint rotation & TOFU policy
Status: 🟡 Proposed — pending architect approval Impacts:
terraform/modules/swarm/,.github/workflows/deploy.yml,.github/workflows/terraform.yml, new runbookdocs/protocols/manager-failover-runbook.md
Context
terraform/modules/swarm/main.tf currently uses StrictHostKeyChecking=no in its SSH provisioner blocks (lines 156, 159) and generates a single shared host key for all managers. This is a documented security anti-pattern: it accepts any key on first contact, permits silent man-in-the-middle on subsequent connections, and blurs the audit trail.
With multi-manager quorum (ADR-023) arriving, we have to address two questions:
- How does Terraform, GitHub Actions, and the on-call operator trust a new manager the first time it comes up?
- How does the same code recognise the replacement of an existing manager after a node loss?
Options Evaluated
| # | Option | Pros | Cons |
|---|---|---|---|
| 1 | Keep StrictHostKeyChecking=no | No work required | Security audit finding; MITM-possible in principle; no fingerprint accountability |
| 2 | Pre-generate a shared host key for all managers | Simple | Fingerprint collision exposes the same private material; no per-node accountability |
| 3 | Unique per-manager host keys + Terraform output map + CI TOFU window | Per-node accountability; rotating one manager only invalidates one fingerprint | Slightly more Terraform surface; CI secret grows by N fingerprints |
Decision
Option 3 — unique per-manager host keys, a Terraform output map, and a TOFU window in CI.
- Generate per-node keys:
tls_private_key.host_key_manager[count.index]. - Install each key into the corresponding
hcloud_server.manager[count.index]viacloud-initso the host presents it on first SSH. - Export all fingerprints as a Terraform output JSON map:hcl
output "manager_host_keys" { value = { for i, k in tls_private_key.host_key_manager : "manager-${i}" => trimspace(k.public_key_openssh) } sensitive = false # fingerprints are not secret } - Update the
PRODUCTION_MANAGER_HOST_KEYSGitHub secret after every manager add/replace. CI accepts any published fingerprint as valid. - Remove
StrictHostKeyChecking=nofrom every provisioner block once the fingerprint map is in place. The TOFU window lives in CI, not in the Terraform code.
Consequences
Positive:
- Audit-clean SSH posture. A fingerprint change on an existing manager now blocks CI and surfaces a human decision.
- Manager replacement is a single known operation: regenerate that node's key, apply, push the new fingerprint to the GitHub secret.
- Per-node accountability in the
~/.ssh/known_hostsaudit.
Negative:
- The
PRODUCTION_MANAGER_HOST_KEYSGitHub secret grows with each manager. At 3 managers this is 3 fingerprints. Acceptable — manageable in the UI.
Neutral:
- Runbook
docs/protocols/manager-failover-runbook.mdcodifies the rotation steps: Terraform plan → apply → read new fingerprint from output →gh secret set PRODUCTION_MANAGER_HOST_KEYS→ re-run the deploy workflow.