ADR-023: Swarm Quorum Topology โ fsn1 pin + odd manager count + dedicated workers โ
Status: ๐ก Proposed โ pending architect approval Impacts:
terraform/modules/swarm/,terraform/environments/production/,infrastructure.mdยง1,docs/protocols/(no existing runbook; to be added)
Context โ
The current terraform/modules/swarm/main.tf treats the cluster as single-manager only:
null_resource.swarm_initdereferenceshcloud_server.manager[0]hcloud_volume_attachment.datais bound tomanager[0]null_resource.swarm_join_workerpulls the worker tokenvariables.tfhasmanager_countbut nothing downstream honours a value > 1
This is acceptable for the MVP (see infrastructure.md ยง1 "Accepted MVP Risk") but becomes a blocker at the first 99.5 % SLA commitment and at any tenant count where a control-plane outage costs more than the dedicated-node bill.
We also observed three quieter risks at the same moment:
- Stateful containers (Postgres / Redis) sit on the same volume-bound manager node โ a manager failure is a data-tier failure.
- The LGTM observability stack will not fit on the existing
cax21workers (8 GB RAM) alongside application services. - The current SSH provisioner block uses
StrictHostKeyChecking=noโ a security anti-pattern once the cluster goes multi-manager.
Options Evaluated โ
| # | Option | Pros | Cons |
|---|---|---|---|
| 1 | 3 managers co-located in fsn1, no dedicated worker tiers | Minimal change | Stateful services still on managers; LGTM stack OOMs |
| 2 | 3 managers + dedicated tier=stateful worker (Postgres/Redis here) + dedicated tier=observability worker | Each failure domain is independent | One extra cpx21 (observability) + one extra worker (stateful) per env |
| 3 | 3 managers in fsn1, workers allowed in nbg1 | Regional failover possible | Raft consensus latency spikes; split-brain risk under network partitions |
Decision โ
Option 2 โ three managers pinned to fsn1 + dedicated stateful_worker (1ร) + dedicated observability_worker (1ร, cpx21+).
Activated when manager_count >= 3:
var.manager_countvalidation: odd (manager_count % 2 == 1). Raft tolerates failures deterministically only with odd-count quorums.var.hetzner_locationvalidation:contains(["fsn1"], value). Managers never straddle regions; Raft consensus latency stays under ~1 ms.- Manager initialisation (existing
null_resource.swarm_initagainstmanager[0]) is unchanged. - A new
null_resource.manager_join_tokenruns afterswarm_init, SSHes tomanager[0], and captures the manager-scoped join token. - A new
null_resource.swarm_join_manageriteratescount = manager_count - 1, joiningmanager[count.index + 1]using the manager token and manager[0]'s private IP on port 2377 (never the public IP). user_datais split into a sharedlocal.docker_install_scriptlocal. Onlymanager[0]appendsdocker swarm init. Workers and additional managers use the install script alone. Thenull_resourcejoin provisioners remain the exclusive mechanism for Swarm membership.- A new
hcloud_server.stateful_worker(count = 1, labeltier=stateful) holds the persistent Hetzner Cloud Volume. Postgres / Redis placement-constrain tonode.labels.tier == stateful. - A new
hcloud_server.observability_worker(count = 1, labeltier=observability,cpx21+) holds the LGTM stack (seeobservability.md). - SPOF acknowledgement on the stateful worker is expected behaviour until ADR-022 (Ubicloud) lands; once Ubicloud is live, the stateful worker retains only Redis.
Consequences โ
Positive:
- Control plane survives a single manager loss deterministically.
- Stateful services, observability, and application services each have an independent failure domain.
fsn1pin preserves GDPR residency guarantees (see ADR-022).
Negative:
- Two extra nodes (
stateful_worker+observability_worker) per production environment. Accepted โ the bill is < the cost of a single multi-hour SLA breach. - Shrinking the quorum requires manual
docker node demote+docker swarm leave --forcebefore re-applying Terraform with a smallermanager_count. Documented in the runbook. - Manager join failure runs
on_failure = failand dumps the remotejournalctl -u dockertail; Terraform retries are opt-in via anuntilloop (max 3 ร 30 s).
Neutral:
- Volume recovery on the stateful worker requires a Docker volume plugin (e.g.
costela/docker-volume-hetzner); this is a separate pre-launch requirement โ Swarm does not use Kubernetes CSI. - LB strategy (managed LB vs. DNS round-robin) is carved out into ADR-024.
- SSH fingerprint policy is carved out into ADR-025.