Busflow Docs

Internal documentation portal

Skip to content

ADR-023: Swarm Quorum Topology โ€” fsn1 pin + odd manager count + dedicated workers โ€‹

Status: ๐ŸŸก Proposed โ€” pending architect approval Impacts: terraform/modules/swarm/, terraform/environments/production/, infrastructure.md ยง1, docs/protocols/ (no existing runbook; to be added)


Context โ€‹

The current terraform/modules/swarm/main.tf treats the cluster as single-manager only:

  • null_resource.swarm_init dereferences hcloud_server.manager[0]
  • hcloud_volume_attachment.data is bound to manager[0]
  • null_resource.swarm_join_worker pulls the worker token
  • variables.tf has manager_count but nothing downstream honours a value > 1

This is acceptable for the MVP (see infrastructure.md ยง1 "Accepted MVP Risk") but becomes a blocker at the first 99.5 % SLA commitment and at any tenant count where a control-plane outage costs more than the dedicated-node bill.

We also observed three quieter risks at the same moment:

  1. Stateful containers (Postgres / Redis) sit on the same volume-bound manager node โ€” a manager failure is a data-tier failure.
  2. The LGTM observability stack will not fit on the existing cax21 workers (8 GB RAM) alongside application services.
  3. The current SSH provisioner block uses StrictHostKeyChecking=no โ€” a security anti-pattern once the cluster goes multi-manager.

Options Evaluated โ€‹

#OptionProsCons
13 managers co-located in fsn1, no dedicated worker tiersMinimal changeStateful services still on managers; LGTM stack OOMs
23 managers + dedicated tier=stateful worker (Postgres/Redis here) + dedicated tier=observability workerEach failure domain is independentOne extra cpx21 (observability) + one extra worker (stateful) per env
33 managers in fsn1, workers allowed in nbg1Regional failover possibleRaft consensus latency spikes; split-brain risk under network partitions

Decision โ€‹

Option 2 โ€” three managers pinned to fsn1 + dedicated stateful_worker (1ร—) + dedicated observability_worker (1ร—, cpx21+).

Activated when manager_count >= 3:

  1. var.manager_count validation: odd (manager_count % 2 == 1). Raft tolerates failures deterministically only with odd-count quorums.
  2. var.hetzner_location validation: contains(["fsn1"], value). Managers never straddle regions; Raft consensus latency stays under ~1 ms.
  3. Manager initialisation (existing null_resource.swarm_init against manager[0]) is unchanged.
  4. A new null_resource.manager_join_token runs after swarm_init, SSHes to manager[0], and captures the manager-scoped join token.
  5. A new null_resource.swarm_join_manager iterates count = manager_count - 1, joining manager[count.index + 1] using the manager token and manager[0]'s private IP on port 2377 (never the public IP).
  6. user_data is split into a shared local.docker_install_script local. Only manager[0] appends docker swarm init. Workers and additional managers use the install script alone. The null_resource join provisioners remain the exclusive mechanism for Swarm membership.
  7. A new hcloud_server.stateful_worker (count = 1, label tier=stateful) holds the persistent Hetzner Cloud Volume. Postgres / Redis placement-constrain to node.labels.tier == stateful.
  8. A new hcloud_server.observability_worker (count = 1, label tier=observability, cpx21+) holds the LGTM stack (see observability.md).
  9. SPOF acknowledgement on the stateful worker is expected behaviour until ADR-022 (Ubicloud) lands; once Ubicloud is live, the stateful worker retains only Redis.

Consequences โ€‹

Positive:

  • Control plane survives a single manager loss deterministically.
  • Stateful services, observability, and application services each have an independent failure domain.
  • fsn1 pin preserves GDPR residency guarantees (see ADR-022).

Negative:

  • Two extra nodes (stateful_worker + observability_worker) per production environment. Accepted โ€” the bill is < the cost of a single multi-hour SLA breach.
  • Shrinking the quorum requires manual docker node demote + docker swarm leave --force before re-applying Terraform with a smaller manager_count. Documented in the runbook.
  • Manager join failure runs on_failure = fail and dumps the remote journalctl -u docker tail; Terraform retries are opt-in via an until loop (max 3 ร— 30 s).

Neutral:

  • Volume recovery on the stateful worker requires a Docker volume plugin (e.g. costela/docker-volume-hetzner); this is a separate pre-launch requirement โ€” Swarm does not use Kubernetes CSI.
  • LB strategy (managed LB vs. DNS round-robin) is carved out into ADR-024.
  • SSH fingerprint policy is carved out into ADR-025.

Internal documentation โ€” Busflow