Runbook — Observability Alerts

ADR: ADR-027Architecture: observability.mdOn-call: @ops-oncall in #ops

This runbook is the single landing page for every Grafana alert that pages on-call. Each alert has an ID (matching the alertname label), a one-line intent, and a terse triage procedure. Keep this file flat and greppable — do not add deep hierarchy.

A1 — `HighApiErrorRate`

Intent: p99 HTTP 5xx rate on busflow_api > 2% for 5 min.

Open Grafana → dashboard "API Errors" → filter by tenant_id.
If concentrated on one tenant: check communications.messages and operations.incidents for backpressure; page the tenant's CSM only if sustained > 30 min.
If broad: check pg_stat_activity for lock waits; check Hasura metadata_status; check traefik_service_requests_total{code=~"5.."} for edge-level symptoms.
If Postgres-rooted: follow postgres-cutover.md §Rollback only if within the 72 h warm-rollback window; otherwise open P1 incident.

A2 — `LatencySloBreach`

Intent: p95 latency on /api/bookings > 800 ms for 10 min (SLO: ≤ 500 ms).

Confirm the symptom in Tempo: sample 10 slow traces, check db.statement spans.
If pg_stat_statements shows a regressed query: capture the query_id, open a fix-forward issue, notify the feature team's lead.
If symptom is load-driven and Swarm has spare capacity: scale busflow_api replicas (docker service scale busflow_api=<n+2>).
Do not scale above stateful_worker.replicas × 1.5 — the Postgres connection pool is the ceiling. Escalate to infra-oncall if more capacity is genuinely needed.

A3 — `PiiLeakDetected`

Intent: PII redaction rule matched > 5 events in 24 h (the daily false-positive budget; see observability.md §PII Redaction).

Open the Loki query: {job="faro"} | json | pii_match_rule!="".
If all matches share one rule_id: evaluate whether the rule is a true leak or a false positive.
- False positive: add an exception to docker/observability/pii-exclusions.yaml with the rule_id + sample excerpt + reviewer. Open a PR; do not hand-edit on the box.
- True leak: treat as a security incident. Page @security-oncall. Follow breach protocol in gdpr-strategy.md §5 (72-h notification mandate).
If matches span multiple rules: page @security-oncall immediately without further triage — scatter patterns are almost never false positives.

A4 — `PgCronJobFailed`

Intent: A pg_cron job (scrub pipeline, snapshot, analyze) errored twice in a row.

Identify the job: SELECT jobname, last_start, last_end, last_status FROM cron.job_run_details ORDER BY last_start DESC LIMIT 20;
If tenant_scrub_*: follow legal-hold-runbook.md §Scrub Fault — do not disable the job; a disabled GDPR scrub is a compliance finding.
If refresh_* (materialised view): safe to retry manually; log the retry in the on-call journal.
If the failure is a hot-lock: capture the blocking query via pg_blocking_pids() before retrying — the architect-loop requires a blocker capture for every scrub lock event.

A5 — `CardinalityExplosion`

Intent: A metric's active series count exceeded its declared cardinality_budget by ≥ 2×.

Find the offender: topk(10, count by (__name__, tenant_id)({job="mimir"})) in Grafana Explore.
Check the offending metric's declaration (docker/observability/metrics-budgets.yaml) — does the budget match the CI linter output?
Common root causes: a new label added without budget update, or a tenant migration spraying unbounded values into a label like route_id.
Do not delete series manually — Mimir's compactor handles eviction. Instead:
- File a ticket on the owning service.
- Raise the budget only after confirming the new cardinality is bounded (PR must update metrics-budgets.yaml and the CI linter contract).
- If the explosion is ongoing and threatens Mimir ingestion: set a temporary relabel-drop at the OTel Collector (never at Mimir) and time-box it to ≤ 48 h.

A6 — `LokiCompactorBehind`

Intent: Compactor backlog > 3 h OR MinIO bucket at ≥ 85% of 500 GB quota.

Check compactor health: loki_compactor_last_successful_run_timestamp_seconds.
If compactor is unhealthy: restart loki_compactor (single-replica service). Do not scale up — dual compactors corrupt the shared store.
If compactor is healthy but backlog is from ingest volume: the retention_period contract (336 h / 14 d) is working — treat as capacity planning, not an incident. Open a ticket to revisit the quota.
Never solve this via MinIO bucket-lifecycle rules. That path causes "chunk not found" errors at query time (see observability.md §Log Volume & Retention).

A7 — `SsoFallthrough`

Intent: A tenant session reached Grafana without a valid SSO claim.

Check GF_SECURITY_COOKIE_DOMAIN, GF_SECURITY_COOKIE_SAMESITE=none, GF_SECURITY_COOKIE_SECURE=true, and GF_SERVER_ROOT_URL on the active Grafana replica: docker service inspect busflow_grafana --format '{{json .Spec.TaskTemplate.ContainerSpec.Env}}'.
If any env differs from the contract in observability.md §Multi-Tenant: revert via docker service update --env-add. Do not hotpatch the live container.
If envs are correct but fallthrough persists: the tenant's IdP config is the likely cause — verify tenants.sso_issuer matches what the IdP presents; rotate the tenant's shared JWT secret via the secrets-rotation runbook.

A8 — `PostgresExporterDsnGateOpen`

Intent: postgres-exporter service has replicas > 0 but the pgexporter_dsn Swarm secret is absent or stub.

Confirm: docker service inspect busflow_postgres_exporter --format '{{.Spec.TaskTemplate.ContainerSpec.Secrets}}'.
If the secret is missing: this is the bootstrap gap documented in observability.md. Either complete .github/workflows/secrets-sync.yml and populate the secret, or scale exporter to 0.
Never add the DSN as a plaintext env var. If someone did: rotate the Postgres credential immediately via the secrets-rotation runbook.

A9 — `TlsCertExpiringSoon`

Intent: A Traefik-managed ACME certificate has ≤14 days until expiry.

Check which subdomain is affected: traefik_tls_certs_not_after label cn identifies the cert.
Verify the ACME challenge method: HTTP challenge (letsencrypt-http) or DNS challenge (letsencrypt-dns). Check Traefik logs for recent renewal errors: docker service logs busflow_traefik --tail 100 | grep -i acme.
If DNS challenge failure: verify the Cloudflare DNS API token (CF_DNS_API_TOKEN) is valid — tokens expire or get revoked without notification. Test via curl -H "Authorization: Bearer $TOKEN" https://api.cloudflare.com/client/v4/user/tokens/verify.
Force a renewal attempt: restart the Traefik service (docker service update --force busflow_traefik). Traefik re-checks all certs on startup.
If renewal still fails: check Let's Encrypt rate limits at https://tools.letsdebug.net/cert-search/<domain> — a burst of failed attempts may have triggered a 1-hour backoff.

A10 — `ExternalDependencyDown`

Intent: An external dependency health probe returned unhealthy for ≥3 consecutive checks (15 min).

Per-provider triage:

Provider	First step	Escalation
AWS SES	Check AWS status page for `eu-west-1` SES. Verify SMTP credentials in the `busflow_smtp_*` Swarm Secrets are not expired.	If SES is globally down: no action possible, document in on-call journal. If credentials expired: rotate via `secrets-rotation-runbook.md`.
Hetzner Object Storage	Run `rclone ls offsite:<bucket> --max-depth 1` from a local machine. If auth fails: verify `OFFSITE_S3_ACCESS_KEY` / `OFFSITE_S3_SECRET_KEY` are current.	If Hetzner S3 is down: backups cannot upload. Not an emergency unless it persists > 24 h (the daily verification pipeline will catch it).
LLM Provider	Check OpenAI status at `status.openai.com` or Anthropic at `status.anthropic.com`. Verify API key is valid and has remaining quota.	If provider is down: AI features degrade gracefully (copilot, magic upload return errors). Not a P1 — document and wait for recovery.

A11 — `StudioServiceDown`

Intent: A studio stack service (context-engine, docs-hub, landing, qdrant) failed its Uptime Kuma health check for ≥ 3 consecutive intervals (30 s).

Monitoring: Uptime Kuma runs inside the studio Swarm stack and probes all services via their Docker-internal healthcheck endpoints.

SSH into the studio server: ./infrastructure/scripts/ssh-connect.sh studio "docker service ls".
Check the failing service: docker service ps busflow-studio_<name> --no-trunc.
Check logs: docker service logs busflow-studio_<name> --tail 50.
Common root causes:
- 504 Gateway Timeout from Traefik: the service uses multiple overlay networks and Traefik resolved the backend IP on the wrong network. Fix: add traefik.swarm.network=busflow-studio_busflow-traefik to the service's deploy labels.
- Crash-loop: missing env vars (GEMINI_API_KEY, ANTHROPIC_API_KEY, etc.). Check GitHub Actions environment secrets for the studio environment.
- Qdrant unreachable: context-engine's /api/health/ready returns 503. Check docker service ps busflow-studio_qdrant.

A12 — `StudioTraefikMultiNetworkRouting`

Intent: Traefik routed traffic to a backend IP on the wrong overlay network, causing a 504 Gateway Timeout.

Background: When a Swarm service joins multiple overlay networks (e.g., busflow-traefik + busflow-qdrant), Docker assigns one IP per network. Without traefik.swarm.network, Traefik can resolve the service's VIP to any of those IPs — including one on a network Traefik itself is not attached to.

Verify the Traefik access log: docker exec $(docker ps -q -f name=busflow-studio_traefik) tail -20 /var/log/traefik/access.log | grep 504.
If the backend IP in the log does not match the busflow-traefik subnet (10.0.1.x): the service is missing the traefik.swarm.network label.
Fix: docker service update --label-add 'traefik.swarm.network=busflow-studio_busflow-traefik' busflow-studio_<name>.
Permanent fix: add the label in docker-compose.studio.yml and redeploy.

Appendix — Closing an alert

Every acknowledged alert must land in the on-call journal with:

Alert ID (e.g., A3).
Root-cause one-liner.
Remediation (PR, config change, or "no-op, transient").
Whether the contract (budget, SLO, exclusion list) changed — if yes, link the PR.

Busflow Docs

Runbook — Observability Alerts ​

A1 — HighApiErrorRate ​

A2 — LatencySloBreach ​

A3 — PiiLeakDetected ​

A4 — PgCronJobFailed ​

A5 — CardinalityExplosion ​

A6 — LokiCompactorBehind ​

A7 — SsoFallthrough ​

A8 — PostgresExporterDsnGateOpen ​

A9 — TlsCertExpiringSoon ​

A10 — ExternalDependencyDown ​

A11 — StudioServiceDown ​

A12 — StudioTraefikMultiNetworkRouting ​

Appendix — Closing an alert ​