Busflow Docs

Internal documentation portal

Skip to content

ADR-024: Swarm Ingress โ€” Hetzner Managed Load Balancer over DNS round-robin โ€‹

Status: ๐ŸŸก Proposed โ€” pending architect approval Impacts: terraform/modules/swarm/, infrastructure.md ยง1, Cloudflare DNS records


Context โ€‹

At quorum expansion time (ADR-023), Cloudflare can no longer point at a single manager IP. There are two common strategies for fanning out to multiple Swarm managers:

  1. DNS round-robin โ€” multiple A/AAAA records; the client picks one.
  2. Managed load balancer โ€” a single anycast IP on the provider, which health-checks each manager.

Options Evaluated โ€‹

#OptionProsCons
1DNS round-robin via CloudflareZero extra infra; "free"Clients cache failed IPs up to TTL; no health-check; manager replacement requires DNS churn
2Hetzner hcloud_load_balancer.ingress (chosen)Health-checked, single IP, auto-attaches new managers via label selector, in-region (fsn1)โ‚ฌ5.83/mo/env; extra Terraform surface
3Self-hosted HAProxy on a 4th nodeFull controlWe become the LB team; another SPOF unless we run two; not worth the cost

Decision โ€‹

Option 2 โ€” declare hcloud_load_balancer.ingress in the swarm Terraform module with label-selector targeting.

hcl
resource "hcloud_load_balancer" "ingress" {
  name               = "busflow-${var.environment}-ingress"
  load_balancer_type = "lb11"
  location           = var.hetzner_location  # "fsn1"
}

resource "hcloud_load_balancer_target" "managers" {
  type             = "label_selector"
  load_balancer_id = hcloud_load_balancer.ingress.id
  label_selector   = "busflow-swarm=true"
}

resource "hcloud_load_balancer_service" "https" {
  load_balancer_id = hcloud_load_balancer.ingress.id
  protocol         = "tcp"
  listen_port      = 443
  destination_port = 443
  health_check {
    protocol = "http"
    port     = 80
    interval = 15
    timeout  = 10
    retries  = 3
    http { path = "/healthz" }
  }
}
# + equivalent for port 80 (HTTPโ†’HTTPS redirect handled by Traefik)
  • New managers auto-attach via the busflow-swarm=true label โ€” no Terraform re-apply required after a manager replacement.
  • Cloudflare DNS points a single A/AAAA record at the LB's public IP.
  • Health check interval = 15s, timeout = 10s, retries = 3 matches the 2-minute deploy health-gate in deploy.yml.

Consequences โ€‹

Positive:

  • Client-side caching of failed IPs ceases; a bad manager is pulled from rotation in โ‰ค45 s.
  • Manager replacement becomes a no-touch operation for DNS.
  • TLS termination can migrate from per-manager Traefik to the LB if we later decide to offload SSL.

Negative:

  • โ‚ฌ5.83/mo/env. Accepted.
  • One more provider-surface to monitor; we add a Grafana panel for LB target-health transitions.

Neutral:

  • We do not preclude DNS round-robin as a backup โ€” Cloudflare keeps a secondary record disabled that can be enabled if the LB itself fails.

Internal documentation โ€” Busflow