ADR-0304: Custom-resource metrics and richer aggregation¶
KSM-style domain metrics from the collection engine, plus cross-target/cross-cluster aggregation.
Theme: 03 · Collection & extraction · Status: Exploring (Phase 4 spike landed; config + engine wiring ongoing)
Context¶
Phase 1–3 shipped operator Prometheus metrics on /metrics (ADR-0602,
ADR-0601) and inventory aggregation via KollectInventory /
KollectClusterInventory with hub merge (ADR-0501). Stakeholder
export uses Git, Postgres, Kafka, and object-store sinks — not a Prometheus export sink.
kube-state-metrics (KSM) exposes
CustomResourceStateMetrics: config-driven GVK → Prometheus series from informer cache paths.
That pattern complements Kollect's existing operator metrics and is the primary Phase 4 deliverable
per prior art and ROADMAP.
Phase 4 must also define richer cross-target / cross-cluster aggregation without duplicating full inventory payloads in etcd (ADR-0103) or exploding label cardinality.
Decision¶
1. KSM-style custom-resource metrics¶
- Emit from the existing collection engine (shared dynamic informers per GVK — ADR-0301), not a second watch loop.
- Config surface:
KollectProfile.spec.metrics(andKollectClusterProfile.spec.metrics) — companionKollectMetricsProfileCR deferred until cross-profile reuse is required. - Spike shape :
MetricSpec { name, path, labels? }wherepathreferences an attribute name fromspec.attributes; admission validates bounded label keys (max 5). Engine emitskollect_custom_resource_series{profile,gvk,series}and, when labels are configured,kollect_custom_resource_labeled_series{profile,gvk,series,<attribute labels>}; auto-sum of all numeric attributes remains the fallback whenspec.metricsis empty. - Cardinality rules: bounded label sets; no unbounded
name/namespacelabels unless explicitly opted in per metric; document max series per profile in PERFORMANCE.md. - Serve on operator
/metricsalongside existingkollect_*counters — noKollectSink.type: prometheus(ADR-0601). metricsScope: ship on profile (ADR-0604) —profile(default) vstargetemission keys for sum series.- Future scalar export: individual numeric attribute values as gauges/counters are RFC-only (RFC: Prometheus attribute metrics) — complements sum series, not in the first ADR-0604 ship window.
2. Richer aggregation¶
- Spoke:
KollectInventoryremains the per-namespace rollup contract (debounced export —spec.exportMinInterval). - Hub:
KollectClusterInventorymerges spoke summaries; aggregation rules stay O(total rows). - Cross-target rollups (spike): optional
KollectTargetSet-style grouping deferred; Phase 4 spike ininternal/aggregate/documents row identity (RowIdentity),DedupeByResourceUIDmerge mode, andExportCoalescechecksum/generation skip rules for multi-target inventories sharing one sink.
3. Relationship to existing metrics¶
| Layer | Examples | Phase |
|---|---|---|
| Operator health | kollect_reconcile_*, kollect_workqueue_depth, kollect_sink_errors_total |
0–1 ✅ |
| Collection / export | kollect_collected_objects, kollect_export_duration_seconds |
1 ✅ |
| Domain series from CR fields | KSM-style gauges per spec.metrics path |
4 🚧 (config + engine wire) |
Consequences¶
Positive¶
- Platform teams can alert on domain signals (e.g. cert expiry, Argo sync status) without scraping inventory export sinks.
- Reuses proven KSM config patterns; testable with table-driven metric assertions like Phase 1.
Negative¶
- CRD/schema design for metrics config adds API surface and webhook validation work.
- Misconfigured high-cardinality paths can overwhelm Prometheus — needs guardrails in admission.
Open questions¶
- Companion CR: revisit
KollectMetricsProfilewhen platform teams need one metrics schema across many profiles. - Per-metric labels: ✅
kollect_custom_resource_labeled_seriesemits attribute label values fromspec.metrics[].labels. - Hub domain series:
kollect_hub_merged_items_totalwired; scrape at spoke for domain metrics per ADR-0604; optional hub-only federated gauges deferred. - Dedupe: ✅ spike —
ExportCoalesceuses content-hash skip with generation bypass;MergeRowssupportsDedupeByResourceUIDfor cross-target collapse (ROADMAP).