Busflow Docs

Internal documentation portal

Skip to content

Runbook β€” Observability Alerts ​

ADR: ADR-027Architecture: observability.mdOn-call: @ops-oncall in #ops

This runbook is the single landing page for every Grafana alert that pages on-call. Each alert has an ID (matching the alertname label), a one-line intent, and a terse triage procedure. Keep this file flat and greppable β€” do not add deep hierarchy.


A1 β€” HighApiErrorRate ​

Intent: p99 HTTP 5xx rate on busflow_api > 2% for 5 min.

  1. Open Grafana β†’ dashboard "API Errors" β†’ filter by tenant_id.
  2. If concentrated on one tenant: check communications.messages and operations.incidents for backpressure; page the tenant's CSM only if sustained > 30 min.
  3. If broad: check pg_stat_activity for lock waits; check Hasura metadata_status; check traefik_service_requests_total{code=~"5.."} for edge-level symptoms.
  4. If Postgres-rooted: follow postgres-cutover.md Β§Rollback only if within the 72 h warm-rollback window; otherwise open P1 incident.

A2 β€” LatencySloBreach ​

Intent: p95 latency on /api/bookings > 800 ms for 10 min (SLO: ≀ 500 ms).

  1. Confirm the symptom in Tempo: sample 10 slow traces, check db.statement spans.
  2. If pg_stat_statements shows a regressed query: capture the query_id, open a fix-forward issue, notify the feature team's lead.
  3. If symptom is load-driven and Swarm has spare capacity: scale busflow_api replicas (docker service scale busflow_api=<n+2>).
  4. Do not scale above stateful_worker.replicas Γ— 1.5 β€” the Postgres connection pool is the ceiling. Escalate to infra-oncall if more capacity is genuinely needed.

A3 β€” PiiLeakDetected ​

Intent: PII redaction rule matched > 5 events in 24 h (the daily false-positive budget; see observability.md Β§PII Redaction).

  1. Open the Loki query: {job="faro"} | json | pii_match_rule!="".
  2. If all matches share one rule_id: evaluate whether the rule is a true leak or a false positive.
    • False positive: add an exception to docker/observability/pii-exclusions.yaml with the rule_id + sample excerpt + reviewer. Open a PR; do not hand-edit on the box.
    • True leak: treat as a security incident. Page @security-oncall. Follow breach protocol in gdpr-strategy.md Β§5 (72-h notification mandate).
  3. If matches span multiple rules: page @security-oncall immediately without further triage β€” scatter patterns are almost never false positives.

A4 β€” PgCronJobFailed ​

Intent: A pg_cron job (scrub pipeline, snapshot, analyze) errored twice in a row.

  1. Identify the job: SELECT jobname, last_start, last_end, last_status FROM cron.job_run_details ORDER BY last_start DESC LIMIT 20;
  2. If tenant_scrub_*: follow legal-hold-runbook.md Β§Scrub Fault β€” do not disable the job; a disabled GDPR scrub is a compliance finding.
  3. If refresh_* (materialised view): safe to retry manually; log the retry in the on-call journal.
  4. If the failure is a hot-lock: capture the blocking query via pg_blocking_pids() before retrying β€” the architect-loop requires a blocker capture for every scrub lock event.

A5 β€” CardinalityExplosion ​

Intent: A metric's active series count exceeded its declared cardinality_budget by β‰₯ 2Γ—.

  1. Find the offender: topk(10, count by (__name__, tenant_id)({job="mimir"})) in Grafana Explore.
  2. Check the offending metric's declaration (docker/observability/metrics-budgets.yaml) β€” does the budget match the CI linter output?
  3. Common root causes: a new label added without budget update, or a tenant migration spraying unbounded values into a label like route_id.
  4. Do not delete series manually β€” Mimir's compactor handles eviction. Instead:
    • File a ticket on the owning service.
    • Raise the budget only after confirming the new cardinality is bounded (PR must update metrics-budgets.yaml and the CI linter contract).
    • If the explosion is ongoing and threatens Mimir ingestion: set a temporary relabel-drop at the OTel Collector (never at Mimir) and time-box it to ≀ 48 h.

A6 β€” LokiCompactorBehind ​

Intent: Compactor backlog > 3 h OR MinIO bucket at β‰₯ 85% of 500 GB quota.

  1. Check compactor health: loki_compactor_last_successful_run_timestamp_seconds.
  2. If compactor is unhealthy: restart loki_compactor (single-replica service). Do not scale up β€” dual compactors corrupt the shared store.
  3. If compactor is healthy but backlog is from ingest volume: the retention_period contract (336 h / 14 d) is working β€” treat as capacity planning, not an incident. Open a ticket to revisit the quota.
  4. Never solve this via MinIO bucket-lifecycle rules. That path causes "chunk not found" errors at query time (see observability.md Β§Log Volume & Retention).

A7 β€” SsoFallthrough ​

Intent: A tenant session reached Grafana without a valid SSO claim.

  1. Check GF_SECURITY_COOKIE_DOMAIN, GF_SECURITY_COOKIE_SAMESITE=none, GF_SECURITY_COOKIE_SECURE=true, and GF_SERVER_ROOT_URL on the active Grafana replica: docker service inspect busflow_grafana --format '{{json .Spec.TaskTemplate.ContainerSpec.Env}}'.
  2. If any env differs from the contract in observability.md Β§Multi-Tenant: revert via docker service update --env-add. Do not hotpatch the live container.
  3. If envs are correct but fallthrough persists: the tenant's IdP config is the likely cause β€” verify tenants.sso_issuer matches what the IdP presents; rotate the tenant's shared JWT secret via the secrets-rotation runbook.

A8 β€” PostgresExporterDsnGateOpen ​

Intent: postgres-exporter service has replicas > 0 but the pgexporter_dsn Swarm secret is absent or stub.

  1. Confirm: docker service inspect busflow_postgres_exporter --format '{{.Spec.TaskTemplate.ContainerSpec.Secrets}}'.
  2. If the secret is missing: this is the bootstrap gap documented in observability.md. Either complete .github/workflows/secrets-sync.yml and populate the secret, or scale exporter to 0.
  3. Never add the DSN as a plaintext env var. If someone did: rotate the Postgres credential immediately via the secrets-rotation runbook.

A9 β€” TlsCertExpiringSoon ​

Intent: A Traefik-managed ACME certificate has ≀14 days until expiry.

  1. Check which subdomain is affected: traefik_tls_certs_not_after label cn identifies the cert.
  2. Verify the ACME challenge method: HTTP challenge (letsencrypt-http) or DNS challenge (letsencrypt-dns). Check Traefik logs for recent renewal errors: docker service logs busflow_traefik --tail 100 | grep -i acme.
  3. If DNS challenge failure: verify the Cloudflare DNS API token (CF_DNS_API_TOKEN) is valid β€” tokens expire or get revoked without notification. Test via curl -H "Authorization: Bearer $TOKEN" https://api.cloudflare.com/client/v4/user/tokens/verify.
  4. Force a renewal attempt: restart the Traefik service (docker service update --force busflow_traefik). Traefik re-checks all certs on startup.
  5. If renewal still fails: check Let's Encrypt rate limits at https://tools.letsdebug.net/cert-search/<domain> β€” a burst of failed attempts may have triggered a 1-hour backoff.

A10 β€” ExternalDependencyDown ​

Intent: An external dependency health probe returned unhealthy for β‰₯3 consecutive checks (15 min).

Per-provider triage:

ProviderFirst stepEscalation
AWS SESCheck AWS status page for eu-west-1 SES. Verify SMTP credentials in the busflow_smtp_* Swarm Secrets are not expired.If SES is globally down: no action possible, document in on-call journal. If credentials expired: rotate via secrets-rotation-runbook.md.
Hetzner Object StorageRun rclone ls offsite:<bucket> --max-depth 1 from a local machine. If auth fails: verify OFFSITE_S3_ACCESS_KEY / OFFSITE_S3_SECRET_KEY are current.If Hetzner S3 is down: backups cannot upload. Not an emergency unless it persists > 24 h (the daily verification pipeline will catch it).
LLM ProviderCheck OpenAI status at status.openai.com or Anthropic at status.anthropic.com. Verify API key is valid and has remaining quota.If provider is down: AI features degrade gracefully (copilot, magic upload return errors). Not a P1 β€” document and wait for recovery.

A11 β€” StudioServiceDown ​

Intent: A studio stack service (context-engine, docs-hub, landing, qdrant) failed its Uptime Kuma health check for β‰₯ 3 consecutive intervals (30 s).

Monitoring: Uptime Kuma runs inside the studio Swarm stack and probes all services via their Docker-internal healthcheck endpoints.

  1. SSH into the studio server: ./infrastructure/scripts/ssh-connect.sh studio "docker service ls".
  2. Check the failing service: docker service ps busflow-studio_<name> --no-trunc.
  3. Check logs: docker service logs busflow-studio_<name> --tail 50.
  4. Common root causes:
    • 504 Gateway Timeout from Traefik: the service uses multiple overlay networks and Traefik resolved the backend IP on the wrong network. Fix: add traefik.swarm.network=busflow-studio_busflow-traefik to the service's deploy labels.
    • Crash-loop: missing env vars (GEMINI_API_KEY, ANTHROPIC_API_KEY, etc.). Check GitHub Actions environment secrets for the studio environment.
    • Qdrant unreachable: context-engine's /api/health/ready returns 503. Check docker service ps busflow-studio_qdrant.

A12 β€” StudioTraefikMultiNetworkRouting ​

Intent: Traefik routed traffic to a backend IP on the wrong overlay network, causing a 504 Gateway Timeout.

Background: When a Swarm service joins multiple overlay networks (e.g., busflow-traefik + busflow-qdrant), Docker assigns one IP per network. Without traefik.swarm.network, Traefik can resolve the service's VIP to any of those IPs β€” including one on a network Traefik itself is not attached to.

  1. Verify the Traefik access log: docker exec $(docker ps -q -f name=busflow-studio_traefik) tail -20 /var/log/traefik/access.log | grep 504.
  2. If the backend IP in the log does not match the busflow-traefik subnet (10.0.1.x): the service is missing the traefik.swarm.network label.
  3. Fix: docker service update --label-add 'traefik.swarm.network=busflow-studio_busflow-traefik' busflow-studio_<name>.
  4. Permanent fix: add the label in docker-compose.studio.yml and redeploy.

Appendix β€” Closing an alert ​

Every acknowledged alert must land in the on-call journal with:

  • Alert ID (e.g., A3).
  • Root-cause one-liner.
  • Remediation (PR, config change, or "no-op, transient").
  • Whether the contract (budget, SLO, exclusion list) changed β€” if yes, link the PR.

Internal documentation β€” Busflow