Runbook β Observability Alerts β
ADR: ADR-027Architecture:
observability.mdOn-call:@ops-oncallin#ops
This runbook is the single landing page for every Grafana alert that pages on-call. Each alert has an ID (matching the alertname label), a one-line intent, and a terse triage procedure. Keep this file flat and greppable β do not add deep hierarchy.
A1 β HighApiErrorRate β
Intent: p99 HTTP 5xx rate on busflow_api > 2% for 5 min.
- Open Grafana β dashboard "API Errors" β filter by
tenant_id. - If concentrated on one tenant: check
communications.messagesandoperations.incidentsfor backpressure; page the tenant's CSM only if sustained > 30 min. - If broad: check
pg_stat_activityfor lock waits; check Hasurametadata_status; checktraefik_service_requests_total{code=~"5.."}for edge-level symptoms. - If Postgres-rooted: follow
postgres-cutover.mdΒ§Rollback only if within the 72 h warm-rollback window; otherwise open P1 incident.
A2 β LatencySloBreach β
Intent: p95 latency on /api/bookings > 800 ms for 10 min (SLO: β€ 500 ms).
- Confirm the symptom in Tempo: sample 10 slow traces, check
db.statementspans. - If
pg_stat_statementsshows a regressed query: capture thequery_id, open a fix-forward issue, notify the feature team's lead. - If symptom is load-driven and Swarm has spare capacity: scale
busflow_apireplicas (docker service scale busflow_api=<n+2>). - Do not scale above
stateful_worker.replicas Γ 1.5β the Postgres connection pool is the ceiling. Escalate to infra-oncall if more capacity is genuinely needed.
A3 β PiiLeakDetected β
Intent: PII redaction rule matched > 5 events in 24 h (the daily false-positive budget; see observability.md Β§PII Redaction).
- Open the Loki query:
{job="faro"} | json | pii_match_rule!="". - If all matches share one
rule_id: evaluate whether the rule is a true leak or a false positive.- False positive: add an exception to
docker/observability/pii-exclusions.yamlwith the rule_id + sample excerpt + reviewer. Open a PR; do not hand-edit on the box. - True leak: treat as a security incident. Page
@security-oncall. Follow breach protocol ingdpr-strategy.mdΒ§5 (72-h notification mandate).
- False positive: add an exception to
- If matches span multiple rules: page
@security-oncallimmediately without further triage β scatter patterns are almost never false positives.
A4 β PgCronJobFailed β
Intent: A pg_cron job (scrub pipeline, snapshot, analyze) errored twice in a row.
- Identify the job:
SELECT jobname, last_start, last_end, last_status FROM cron.job_run_details ORDER BY last_start DESC LIMIT 20; - If
tenant_scrub_*: followlegal-hold-runbook.mdΒ§Scrub Fault β do not disable the job; a disabled GDPR scrub is a compliance finding. - If
refresh_*(materialised view): safe to retry manually; log the retry in the on-call journal. - If the failure is a hot-lock: capture the blocking query via
pg_blocking_pids()before retrying β the architect-loop requires a blocker capture for every scrub lock event.
A5 β CardinalityExplosion β
Intent: A metric's active series count exceeded its declared cardinality_budget by β₯ 2Γ.
- Find the offender:
topk(10, count by (__name__, tenant_id)({job="mimir"}))in Grafana Explore. - Check the offending metric's declaration (
docker/observability/metrics-budgets.yaml) β does the budget match the CI linter output? - Common root causes: a new label added without budget update, or a tenant migration spraying unbounded values into a label like
route_id. - Do not delete series manually β Mimir's compactor handles eviction. Instead:
- File a ticket on the owning service.
- Raise the budget only after confirming the new cardinality is bounded (PR must update
metrics-budgets.yamland the CI linter contract). - If the explosion is ongoing and threatens Mimir ingestion: set a temporary relabel-drop at the OTel Collector (never at Mimir) and time-box it to β€ 48 h.
A6 β LokiCompactorBehind β
Intent: Compactor backlog > 3 h OR MinIO bucket at β₯ 85% of 500 GB quota.
- Check compactor health:
loki_compactor_last_successful_run_timestamp_seconds. - If compactor is unhealthy: restart
loki_compactor(single-replica service). Do not scale up β dual compactors corrupt the shared store. - If compactor is healthy but backlog is from ingest volume: the retention_period contract (336 h / 14 d) is working β treat as capacity planning, not an incident. Open a ticket to revisit the quota.
- Never solve this via MinIO bucket-lifecycle rules. That path causes "chunk not found" errors at query time (see
observability.mdΒ§Log Volume & Retention).
A7 β SsoFallthrough β
Intent: A tenant session reached Grafana without a valid SSO claim.
- Check
GF_SECURITY_COOKIE_DOMAIN,GF_SECURITY_COOKIE_SAMESITE=none,GF_SECURITY_COOKIE_SECURE=true, andGF_SERVER_ROOT_URLon the active Grafana replica:docker service inspect busflow_grafana --format '{{json .Spec.TaskTemplate.ContainerSpec.Env}}'. - If any env differs from the contract in
observability.mdΒ§Multi-Tenant: revert viadocker service update --env-add. Do not hotpatch the live container. - If envs are correct but fallthrough persists: the tenant's IdP config is the likely cause β verify
tenants.sso_issuermatches what the IdP presents; rotate the tenant's shared JWT secret via the secrets-rotation runbook.
A8 β PostgresExporterDsnGateOpen β
Intent: postgres-exporter service has replicas > 0 but the pgexporter_dsn Swarm secret is absent or stub.
- Confirm:
docker service inspect busflow_postgres_exporter --format '{{.Spec.TaskTemplate.ContainerSpec.Secrets}}'. - If the secret is missing: this is the bootstrap gap documented in
observability.md. Either complete.github/workflows/secrets-sync.ymland populate the secret, or scale exporter to 0. - Never add the DSN as a plaintext env var. If someone did: rotate the Postgres credential immediately via the secrets-rotation runbook.
A9 β TlsCertExpiringSoon β
Intent: A Traefik-managed ACME certificate has β€14 days until expiry.
- Check which subdomain is affected:
traefik_tls_certs_not_afterlabelcnidentifies the cert. - Verify the ACME challenge method: HTTP challenge (
letsencrypt-http) or DNS challenge (letsencrypt-dns). Check Traefik logs for recent renewal errors:docker service logs busflow_traefik --tail 100 | grep -i acme. - If DNS challenge failure: verify the Cloudflare DNS API token (
CF_DNS_API_TOKEN) is valid β tokens expire or get revoked without notification. Test viacurl -H "Authorization: Bearer $TOKEN" https://api.cloudflare.com/client/v4/user/tokens/verify. - Force a renewal attempt: restart the Traefik service (
docker service update --force busflow_traefik). Traefik re-checks all certs on startup. - If renewal still fails: check Let's Encrypt rate limits at
https://tools.letsdebug.net/cert-search/<domain>β a burst of failed attempts may have triggered a 1-hour backoff.
A10 β ExternalDependencyDown β
Intent: An external dependency health probe returned unhealthy for β₯3 consecutive checks (15 min).
Per-provider triage:
| Provider | First step | Escalation |
|---|---|---|
| AWS SES | Check AWS status page for eu-west-1 SES. Verify SMTP credentials in the busflow_smtp_* Swarm Secrets are not expired. | If SES is globally down: no action possible, document in on-call journal. If credentials expired: rotate via secrets-rotation-runbook.md. |
| Hetzner Object Storage | Run rclone ls offsite:<bucket> --max-depth 1 from a local machine. If auth fails: verify OFFSITE_S3_ACCESS_KEY / OFFSITE_S3_SECRET_KEY are current. | If Hetzner S3 is down: backups cannot upload. Not an emergency unless it persists > 24 h (the daily verification pipeline will catch it). |
| LLM Provider | Check OpenAI status at status.openai.com or Anthropic at status.anthropic.com. Verify API key is valid and has remaining quota. | If provider is down: AI features degrade gracefully (copilot, magic upload return errors). Not a P1 β document and wait for recovery. |
A11 β StudioServiceDown β
Intent: A studio stack service (context-engine, docs-hub, landing, qdrant) failed its Uptime Kuma health check for β₯ 3 consecutive intervals (30 s).
Monitoring: Uptime Kuma runs inside the studio Swarm stack and probes all services via their Docker-internal healthcheck endpoints.
- SSH into the studio server:
./infrastructure/scripts/ssh-connect.sh studio "docker service ls". - Check the failing service:
docker service ps busflow-studio_<name> --no-trunc. - Check logs:
docker service logs busflow-studio_<name> --tail 50. - Common root causes:
- 504 Gateway Timeout from Traefik: the service uses multiple overlay networks and Traefik resolved the backend IP on the wrong network. Fix: add
traefik.swarm.network=busflow-studio_busflow-traefikto the service's deploy labels. - Crash-loop: missing env vars (
GEMINI_API_KEY,ANTHROPIC_API_KEY, etc.). Check GitHub Actions environment secrets for thestudioenvironment. - Qdrant unreachable: context-engine's
/api/health/readyreturns 503. Checkdocker service ps busflow-studio_qdrant.
- 504 Gateway Timeout from Traefik: the service uses multiple overlay networks and Traefik resolved the backend IP on the wrong network. Fix: add
A12 β StudioTraefikMultiNetworkRouting β
Intent: Traefik routed traffic to a backend IP on the wrong overlay network, causing a 504 Gateway Timeout.
Background: When a Swarm service joins multiple overlay networks (e.g., busflow-traefik + busflow-qdrant), Docker assigns one IP per network. Without traefik.swarm.network, Traefik can resolve the service's VIP to any of those IPs β including one on a network Traefik itself is not attached to.
- Verify the Traefik access log:
docker exec $(docker ps -q -f name=busflow-studio_traefik) tail -20 /var/log/traefik/access.log | grep 504. - If the backend IP in the log does not match the
busflow-traefiksubnet (10.0.1.x): the service is missing thetraefik.swarm.networklabel. - Fix:
docker service update --label-add 'traefik.swarm.network=busflow-studio_busflow-traefik' busflow-studio_<name>. - Permanent fix: add the label in
docker-compose.studio.ymland redeploy.
Appendix β Closing an alert β
Every acknowledged alert must land in the on-call journal with:
- Alert ID (e.g., A3).
- Root-cause one-liner.
- Remediation (PR, config change, or "no-op, transient").
- Whether the contract (budget, SLO, exclusion list) changed β if yes, link the PR.