Busflow Docs

Internal documentation portal

Skip to content

ADR-025: Swarm Manager Failover — SSH fingerprint rotation & TOFU policy

Status: 🟡 Proposed — pending architect approval Impacts: terraform/modules/swarm/, .github/workflows/deploy.yml, .github/workflows/terraform.yml, new runbook docs/protocols/manager-failover-runbook.md


Context

terraform/modules/swarm/main.tf currently uses StrictHostKeyChecking=no in its SSH provisioner blocks (lines 156, 159) and generates a single shared host key for all managers. This is a documented security anti-pattern: it accepts any key on first contact, permits silent man-in-the-middle on subsequent connections, and blurs the audit trail.

With multi-manager quorum (ADR-023) arriving, we have to address two questions:

  1. How does Terraform, GitHub Actions, and the on-call operator trust a new manager the first time it comes up?
  2. How does the same code recognise the replacement of an existing manager after a node loss?

Options Evaluated

#OptionProsCons
1Keep StrictHostKeyChecking=noNo work requiredSecurity audit finding; MITM-possible in principle; no fingerprint accountability
2Pre-generate a shared host key for all managersSimpleFingerprint collision exposes the same private material; no per-node accountability
3Unique per-manager host keys + Terraform output map + CI TOFU windowPer-node accountability; rotating one manager only invalidates one fingerprintSlightly more Terraform surface; CI secret grows by N fingerprints

Decision

Option 3 — unique per-manager host keys, a Terraform output map, and a TOFU window in CI.

  • Generate per-node keys: tls_private_key.host_key_manager[count.index].
  • Install each key into the corresponding hcloud_server.manager[count.index] via cloud-init so the host presents it on first SSH.
  • Export all fingerprints as a Terraform output JSON map:
    hcl
    output "manager_host_keys" {
      value = { for i, k in tls_private_key.host_key_manager :
                "manager-${i}" => trimspace(k.public_key_openssh) }
      sensitive = false   # fingerprints are not secret
    }
  • Update the PRODUCTION_MANAGER_HOST_KEYS GitHub secret after every manager add/replace. CI accepts any published fingerprint as valid.
  • Remove StrictHostKeyChecking=no from every provisioner block once the fingerprint map is in place. The TOFU window lives in CI, not in the Terraform code.

Consequences

Positive:

  • Audit-clean SSH posture. A fingerprint change on an existing manager now blocks CI and surfaces a human decision.
  • Manager replacement is a single known operation: regenerate that node's key, apply, push the new fingerprint to the GitHub secret.
  • Per-node accountability in the ~/.ssh/known_hosts audit.

Negative:

  • The PRODUCTION_MANAGER_HOST_KEYS GitHub secret grows with each manager. At 3 managers this is 3 fingerprints. Acceptable — manageable in the UI.

Neutral:

  • Runbook docs/protocols/manager-failover-runbook.md codifies the rotation steps: Terraform plan → apply → read new fingerprint from output → gh secret set PRODUCTION_MANAGER_HOST_KEYS → re-run the deploy workflow.

Internal documentation — Busflow