ADR-027: Per-metric cardinality-budget annotation contract β
Status: β³ Deferred β activate when custom metric count exceeds 5 Impacts:
observability.mdΒ§Metrics & High-Cardinality Management;.github/workflows/ci.yml(linter step); every newPrometheusRule/ alert definition
Context β
observability.md states a global cardinality ceiling (β€80 000 active series, = 4 000 tenants Γ 20 series). In practice, one rogue custom metric with an unbounded label (e.g. route that includes booking IDs, or user_agent raw) can alone blow the budget and take Mimir's ingester OOM.
Global ceilings do not catch per-metric blow-ups. We need a per-metric contract that:
- Declares an expected cardinality at PR time.
- Is verifiable in CI against the actual series produced by
promtool. - Fails the build when a metric's label-set expands beyond its declared budget.
Decision β
Every custom metric that ships with a Prometheus alert rule MUST carry a cardinality_budget: "<n>" annotation.
Example:
- alert: HighBookingFailureRate
expr: rate(busflow_booking_failed_total[5m]) > 0.05
annotations:
summary: "Booking failure rate > 5 %"
cardinality_budget: "8000" # 4000 tenants Γ 2 status labels
runbook: "https://busflow.app/runbooks/booking-failures"CI enforcement in .github/workflows/ci.yml:
promtool check rulesvalidates rule syntax (pre-existing).- A new
check_metric_cardinality.pyprobe:- Loads each rule file.
- Parses the annotation; fails if missing, non-integer, or > 80 000.
- If a metrics sample is available (pre-deploy against staging), counts the actual active series for the metric and fails if it exceeds the declared budget.
Consequences β
Positive:
- Cardinality explosions are caught at PR review, not at production ingestion.
- Every metric has a declared "expected shape" documented alongside it β easier to reason about retention cost.
Negative:
- Adds a per-metric annotation requirement; slight author friction. Mitigated by a template snippet in
docs/architecture/observability.mdΒ§Metrics.
Neutral:
- Applies to application metrics only; system metrics from node-exporter / cAdvisor are exempt (their shape is fixed by the exporter).