ADR-024: Swarm Ingress โ Hetzner Managed Load Balancer over DNS round-robin โ
Status: ๐ก Proposed โ pending architect approval Impacts:
terraform/modules/swarm/,infrastructure.mdยง1, Cloudflare DNS records
Context โ
At quorum expansion time (ADR-023), Cloudflare can no longer point at a single manager IP. There are two common strategies for fanning out to multiple Swarm managers:
- DNS round-robin โ multiple A/AAAA records; the client picks one.
- Managed load balancer โ a single anycast IP on the provider, which health-checks each manager.
Options Evaluated โ
| # | Option | Pros | Cons |
|---|---|---|---|
| 1 | DNS round-robin via Cloudflare | Zero extra infra; "free" | Clients cache failed IPs up to TTL; no health-check; manager replacement requires DNS churn |
| 2 | Hetzner hcloud_load_balancer.ingress (chosen) | Health-checked, single IP, auto-attaches new managers via label selector, in-region (fsn1) | โฌ5.83/mo/env; extra Terraform surface |
| 3 | Self-hosted HAProxy on a 4th node | Full control | We become the LB team; another SPOF unless we run two; not worth the cost |
Decision โ
Option 2 โ declare hcloud_load_balancer.ingress in the swarm Terraform module with label-selector targeting.
hcl
resource "hcloud_load_balancer" "ingress" {
name = "busflow-${var.environment}-ingress"
load_balancer_type = "lb11"
location = var.hetzner_location # "fsn1"
}
resource "hcloud_load_balancer_target" "managers" {
type = "label_selector"
load_balancer_id = hcloud_load_balancer.ingress.id
label_selector = "busflow-swarm=true"
}
resource "hcloud_load_balancer_service" "https" {
load_balancer_id = hcloud_load_balancer.ingress.id
protocol = "tcp"
listen_port = 443
destination_port = 443
health_check {
protocol = "http"
port = 80
interval = 15
timeout = 10
retries = 3
http { path = "/healthz" }
}
}
# + equivalent for port 80 (HTTPโHTTPS redirect handled by Traefik)- New managers auto-attach via the
busflow-swarm=truelabel โ no Terraform re-apply required after a manager replacement. - Cloudflare DNS points a single A/AAAA record at the LB's public IP.
- Health check
interval = 15s,timeout = 10s,retries = 3matches the 2-minute deploy health-gate indeploy.yml.
Consequences โ
Positive:
- Client-side caching of failed IPs ceases; a bad manager is pulled from rotation in โค45 s.
- Manager replacement becomes a no-touch operation for DNS.
- TLS termination can migrate from per-manager Traefik to the LB if we later decide to offload SSL.
Negative:
- โฌ5.83/mo/env. Accepted.
- One more provider-surface to monitor; we add a Grafana panel for LB target-health transitions.
Neutral:
- We do not preclude DNS round-robin as a backup โ Cloudflare keeps a secondary record disabled that can be enabled if the LB itself fails.