Busflow Docs

Internal documentation portal

Skip to content

ADR-027: Per-metric cardinality-budget annotation contract ​

Status: ⏳ Deferred β€” activate when custom metric count exceeds 5 Impacts: observability.md Β§Metrics & High-Cardinality Management; .github/workflows/ci.yml (linter step); every new PrometheusRule / alert definition


Context ​

observability.md states a global cardinality ceiling (≀80 000 active series, = 4 000 tenants Γ— 20 series). In practice, one rogue custom metric with an unbounded label (e.g. route that includes booking IDs, or user_agent raw) can alone blow the budget and take Mimir's ingester OOM.

Global ceilings do not catch per-metric blow-ups. We need a per-metric contract that:

  1. Declares an expected cardinality at PR time.
  2. Is verifiable in CI against the actual series produced by promtool.
  3. Fails the build when a metric's label-set expands beyond its declared budget.

Decision ​

Every custom metric that ships with a Prometheus alert rule MUST carry a cardinality_budget: "<n>" annotation.

Example:

yaml
- alert: HighBookingFailureRate
  expr: rate(busflow_booking_failed_total[5m]) > 0.05
  annotations:
    summary: "Booking failure rate > 5 %"
    cardinality_budget: "8000"   # 4000 tenants Γ— 2 status labels
    runbook: "https://busflow.app/runbooks/booking-failures"

CI enforcement in .github/workflows/ci.yml:

  1. promtool check rules validates rule syntax (pre-existing).
  2. A new check_metric_cardinality.py probe:
    • Loads each rule file.
    • Parses the annotation; fails if missing, non-integer, or > 80 000.
    • If a metrics sample is available (pre-deploy against staging), counts the actual active series for the metric and fails if it exceeds the declared budget.

Consequences ​

Positive:

  • Cardinality explosions are caught at PR review, not at production ingestion.
  • Every metric has a declared "expected shape" documented alongside it β€” easier to reason about retention cost.

Negative:

  • Adds a per-metric annotation requirement; slight author friction. Mitigated by a template snippet in docs/architecture/observability.md Β§Metrics.

Neutral:

  • Applies to application metrics only; system metrics from node-exporter / cAdvisor are exempt (their shape is fixed by the exporter).

Internal documentation β€” Busflow