ADR-023: Swarm Quorum Topology — `fsn1` pin + odd manager count + dedicated workers

Status: 🟡 Proposed — pending architect approval Impacts: terraform/modules/swarm/, terraform/environments/production/, infrastructure.md §1, docs/protocols/ (no existing runbook; to be added)

Context

The current terraform/modules/swarm/main.tf treats the cluster as single-manager only:

null_resource.swarm_init dereferences hcloud_server.manager[0]
hcloud_volume_attachment.data is bound to manager[0]
null_resource.swarm_join_worker pulls the worker token
variables.tf has manager_count but nothing downstream honours a value > 1

This is acceptable for the MVP (see infrastructure.md §1 "Accepted MVP Risk") but becomes a blocker at the first 99.5 % SLA commitment and at any tenant count where a control-plane outage costs more than the dedicated-node bill.

We also observed three quieter risks at the same moment:

Stateful containers (Postgres / Redis) sit on the same volume-bound manager node — a manager failure is a data-tier failure.
The LGTM observability stack will not fit on the existing cax21 workers (8 GB RAM) alongside application services.
The current SSH provisioner block uses StrictHostKeyChecking=no — a security anti-pattern once the cluster goes multi-manager.

Options Evaluated

#	Option	Pros	Cons
1	3 managers co-located in `fsn1`, no dedicated worker tiers	Minimal change	Stateful services still on managers; LGTM stack OOMs
2	3 managers + dedicated `tier=stateful` worker (Postgres/Redis here) + dedicated `tier=observability` worker	Each failure domain is independent	One extra `cpx21` (observability) + one extra worker (stateful) per env
3	3 managers in `fsn1`, workers allowed in `nbg1`	Regional failover possible	Raft consensus latency spikes; split-brain risk under network partitions

Decision

Option 2 — three managers pinned to fsn1 + dedicated stateful_worker (1×) + dedicated observability_worker (1×, cpx21+).

Activated when manager_count >= 3:

var.manager_count validation: odd (manager_count % 2 == 1). Raft tolerates failures deterministically only with odd-count quorums.
var.hetzner_location validation: contains(["fsn1"], value). Managers never straddle regions; Raft consensus latency stays under ~1 ms.
Manager initialisation (existing null_resource.swarm_init against manager[0]) is unchanged.
A new null_resource.manager_join_token runs after swarm_init, SSHes to manager[0], and captures the manager-scoped join token.
A new null_resource.swarm_join_manager iterates count = manager_count - 1, joining manager[count.index + 1] using the manager token and manager[0]'s private IP on port 2377 (never the public IP).
user_data is split into a shared local.docker_install_script local. Only manager[0] appends docker swarm init. Workers and additional managers use the install script alone. The null_resource join provisioners remain the exclusive mechanism for Swarm membership.
A new hcloud_server.stateful_worker (count = 1, label tier=stateful) holds the persistent Hetzner Cloud Volume. Postgres / Redis placement-constrain to node.labels.tier == stateful.
A new hcloud_server.observability_worker (count = 1, label tier=observability, cpx21+) holds the LGTM stack (see observability.md).
SPOF acknowledgement on the stateful worker is expected behaviour until ADR-022 (Ubicloud) lands; once Ubicloud is live, the stateful worker retains only Redis.

Consequences

Positive:

Control plane survives a single manager loss deterministically.
Stateful services, observability, and application services each have an independent failure domain.
fsn1 pin preserves GDPR residency guarantees (see ADR-022).

Negative:

Two extra nodes (stateful_worker + observability_worker) per production environment. Accepted — the bill is < the cost of a single multi-hour SLA breach.
Shrinking the quorum requires manual docker node demote + docker swarm leave --force before re-applying Terraform with a smaller manager_count. Documented in the runbook.
Manager join failure runs on_failure = fail and dumps the remote journalctl -u docker tail; Terraform retries are opt-in via an until loop (max 3 × 30 s).

Neutral:

Volume recovery on the stateful worker requires a Docker volume plugin (e.g. costela/docker-volume-hetzner); this is a separate pre-launch requirement — Swarm does not use Kubernetes CSI.
LB strategy (managed LB vs. DNS round-robin) is carved out into ADR-024.
SSH fingerprint policy is carved out into ADR-025.

Busflow Docs

ADR-023: Swarm Quorum Topology — fsn1 pin + odd manager count + dedicated workers ​

Context ​

Options Evaluated ​

Decision ​

Consequences ​

ADR-023: Swarm Quorum Topology — `fsn1` pin + odd manager count + dedicated workers

Context

Options Evaluated

Decision

Consequences