ADR-024: Swarm Ingress — Hetzner Managed Load Balancer over DNS round-robin

Status: 🟡 Proposed — pending architect approval Impacts: terraform/modules/swarm/, infrastructure.md §1, Cloudflare DNS records

Context

At quorum expansion time (ADR-023), Cloudflare can no longer point at a single manager IP. There are two common strategies for fanning out to multiple Swarm managers:

DNS round-robin — multiple A/AAAA records; the client picks one.
Managed load balancer — a single anycast IP on the provider, which health-checks each manager.

Options Evaluated

#	Option	Pros	Cons
1	DNS round-robin via Cloudflare	Zero extra infra; "free"	Clients cache failed IPs up to TTL; no health-check; manager replacement requires DNS churn
2	Hetzner `hcloud_load_balancer.ingress` (chosen)	Health-checked, single IP, auto-attaches new managers via label selector, in-region (fsn1)	€5.83/mo/env; extra Terraform surface
3	Self-hosted HAProxy on a 4th node	Full control	We become the LB team; another SPOF unless we run two; not worth the cost

Decision

Option 2 — declare hcloud_load_balancer.ingress in the swarm Terraform module with label-selector targeting.

hcl

resource "hcloud_load_balancer" "ingress" {
  name               = "busflow-${var.environment}-ingress"
  load_balancer_type = "lb11"
  location           = var.hetzner_location  # "fsn1"
}

resource "hcloud_load_balancer_target" "managers" {
  type             = "label_selector"
  load_balancer_id = hcloud_load_balancer.ingress.id
  label_selector   = "busflow-swarm=true"
}

resource "hcloud_load_balancer_service" "https" {
  load_balancer_id = hcloud_load_balancer.ingress.id
  protocol         = "tcp"
  listen_port      = 443
  destination_port = 443
  health_check {
    protocol = "http"
    port     = 80
    interval = 15
    timeout  = 10
    retries  = 3
    http { path = "/healthz" }
  }
}
# + equivalent for port 80 (HTTP→HTTPS redirect handled by Traefik)

New managers auto-attach via the busflow-swarm=true label — no Terraform re-apply required after a manager replacement.
Cloudflare DNS points a single A/AAAA record at the LB's public IP.
Health check interval = 15s, timeout = 10s, retries = 3 matches the 2-minute deploy health-gate in deploy.yml.

Consequences

Positive:

Client-side caching of failed IPs ceases; a bad manager is pulled from rotation in ≤45 s.
Manager replacement becomes a no-touch operation for DNS.
TLS termination can migrate from per-manager Traefik to the LB if we later decide to offload SSL.

Negative:

€5.83/mo/env. Accepted.
One more provider-surface to monitor; we add a Grafana panel for LB target-health transitions.

Neutral:

We do not preclude DNS round-robin as a backup — Cloudflare keeps a secondary record disabled that can be enabled if the LB itself fails.

Busflow Docs

ADR-024: Swarm Ingress — Hetzner Managed Load Balancer over DNS round-robin ​

Context ​

Options Evaluated ​

Decision ​

Consequences ​

ADR-024: Swarm Ingress — Hetzner Managed Load Balancer over DNS round-robin

Context

Options Evaluated

Decision

Consequences