Skip to content

Performance and scalability

Kollect is designed for large single clusters (1000s of nodes, 10k+ watched resources baseline) and multi-cluster fleetsN independent single-mode operators exporting to a shared sink partitioned by spec.cluster. There is no hub/spoke runtime tier (ADR-0501). This guide summarizes tuning knobs from ADR-0603.

Scale tiers

Tier Collected rows Clusters How to validate
Dev / CI default ≤500 synthetic 1 task test
Opt-in load ≤2,000 synthetic 1 KOLECT_LOAD_TEST=1 task load-test
Nightly load 10,000 synthetic 1 task load-test:10k on 8-core runners
Baseline production 10,000+ (validated) 1 Metrics + pprof; manual load
Design target 100,000 1 Manual / perf-report; needs export sharding + Postgres bulk upsert (claim gate v0.5+)
Fleet 10k–100k × N many One ServiceMonitor per cluster release; correlate by spec.cluster

The 100,000-row design target requires mandatory export sharding — one KollectInventory per workload namespace (or smaller groups) so each export stays below ~2,000 rows (~1.5 MiB) — plus resourcesProfile: large (request ≥2 GiB, limit ≥4 GiB). Fleet scale fans out across operators; there is no central merge process to bottleneck (ADR-0603).

Controller parallelism

Flag Default Controller
--max-concurrent-reconciles-target 5 KollectTarget
--max-concurrent-reconciles-inventory 3 KollectInventory
--max-concurrent-reconciles-cluster-target 2 KollectClusterTarget
--max-concurrent-reconciles-cluster-inventory 2 KollectClusterInventory
--collect-dispatch-workers 4 Collection informer dispatch pool
--collect-dispatch-queue-size 512 Dispatch queue depth before sync fallback

Raise concurrency when reconcile latency grows while CPU is underutilized. Lower it when API server throttling or etcd watch pressure appears.

Workqueue rate limiting

When --reconcile-rate-limit is unset (0), controller-runtime uses its default exponential failure rate limiter (5ms base, 1000s cap). Set a positive duration (for example 100ms) to slow retries on terminal errors without changing success-path throughput.

kollect_workqueue_depth approximates queue pressure as in-flight reconciles per controller (not the internal client-go queue length).

Export debouncing

KollectInventory.spec.exportMinInterval (default 30s) coalesces export to external sinks per inventory. Material payload changes (generation/checksum bump) may export immediately inside the min interval (ADR-0201). Lower the interval for fresher Postgres/Kafka exports; raise to reduce sink API load. Across a large fleet writing to a shared sink, debouncing is mandatory to avoid export storms against the backend.

Collection engine

  • In-memory store: O(n) memory in collected object count; one RWMutex guards nested maps. Target ≤512 MiB RSS at 10k typical Deployment/Service rows (verify with pprof).
  • Informers: When all active targets for a GVK resolve to one namespace via spec.namespaceSelector, the dynamic informer is scoped to that namespace. Otherwise the informer watches all namespaces and filters events at dispatch time (correctness over cache size).
  • Resync: 12h informer resync is a correctness backstop, not a freshness driver.
  • Fleet exports: Each operator writes its own debounced inventory snapshot to the shared sink, partitioned by spec.cluster — no cross-cluster merge on the hot path (ADR-0501).

Metrics catalog

Full scrape setup, default PrometheusRule alerts, and the complete kollect_* metric reference live in Operator metrics. The table below highlights performance and scalability signals — use it with the bottleneck checklist below.

Metric Type Labels PromQL hint What rising values imply
kollect_inventory_items_total Gauge kollect_inventory_items_total Stale while store grows → inventory reconcile or export lag
kollect_collect_items_total Gauge kollect_collect_items_total RSS scales with store size at 10k+ objects
kollect_collected_objects Gauge profile, gvk sum by (profile, gvk) (kollect_collected_objects) Per-target cardinality; split profiles when labels explode
kollect_reconcile_total Counter controller, result sum(rate(kollect_reconcile_total[5m])) by (controller, result) Rising failure ratio → check error-class counters
kollect_reconcile_errors_total Counter kind, error_class sum(rate(kollect_reconcile_errors_total[5m])) by (error_class) See ADR-0602 error classes
kollect_sink_errors_total Counter reason sum(rate(kollect_sink_errors_total[5m])) by (reason) Export failures — separate from reconcile errors (ADR-0602)
kollect_sink_connection_test_total Counter type, result sum(rate(kollect_sink_connection_test_total[5m])) by (type, result) Spikes on sink CR churn; sustained failure → creds/network
kollect_workqueue_depth Gauge controller max_over_time(kollect_workqueue_depth[5m]) Sustained high values → raise --max-concurrent-reconciles-* or reduce reconcile work
kollect_reconcile_duration_seconds Histogram controller histogram_quantile(0.99, sum(rate(kollect_reconcile_duration_seconds_bucket[5m])) by (le, controller)) p99 rising while depth low → slow API/sink; p99 rising with depth → under-provisioned workers
kollect_informer_objects Gauge group, version, resource sum by (group, version, resource) (kollect_informer_objects) Unexpected growth → extra GVR watches or cluster-wide scope; check namespace scoping
kollect_export_bytes_total Counter sink_type rate(kollect_export_bytes_total[5m]) Spike → debounce too low or inventory churn; flat while stale → export path stuck
kollect_export_duration_seconds Histogram sink_type histogram_quantile(0.95, sum(rate(kollect_export_duration_seconds_bucket[5m])) by (le, sink_type)) Sink slowness (Git/Postgres/Kafka) — not collection
kollect_export_debounced_total Counter controller sum(rate(kollect_export_debounced_total[5m])) by (controller) Exports skipped by min interval — expected when debounce is tight
kollect_collect_dispatch_duration_seconds Histogram histogram_quantile(0.95, sum(rate(kollect_collect_dispatch_duration_seconds_bucket[5m])) by (le)) Collection extract/upsert latency
kollect_collect_dispatch_queue_depth Gauge max_over_time(kollect_collect_dispatch_queue_depth[5m]) Sustained high → raise dispatch workers/queue
kollect_collect_dispatch_sync_fallback_total Counter increase(kollect_collect_dispatch_sync_fallback_total[15m]) Queue overflow — dispatch pool undersized
kollect_informer_resync_dispatches_total Counter group, version, resource sum(increase(kollect_informer_resync_dispatches_total[1h])) by (group, version, resource) Resync-driven dispatch volume
kollect_informer_cluster_wide_scope Gauge group, version, resource max by (group, version, resource) (kollect_informer_cluster_wide_scope) 1 = cluster-wide watch (RSS risk at scale)

Additional runtime signals: Go memstats via pprof (--enable-pprof), API server 429 in operator logs. See Operator metrics for profile-derived series, connection-test counters, and example PromQL for alerting.

Profiling

--enable-pprof serves standard Go profiles on --pprof-bind-address (default :6060), separate from Prometheus metrics (:8080 / :8443). Helm sets pprof.enabled: false by default; enable in dev overlays only.

Benchmarks and load tests

task bench                    # writes artifacts/bench/*.txt
KOLECT_LOAD_TEST=1 task load-test

For local perf summaries (task perf-report), see DEVELOPMENT.md.

Default go test ./... excludes load-tagged tests.

Early bottleneck checklist

Symptom Likely cause First action
High kollect_informer_objects, high RSS Cluster-wide informer for multi-namespace targets Namespace-scope targets; split profiles
High kollect_workqueue_depth on inventory Export or aggregation on hot path Raise inventory workers; increase spec.exportMinInterval
High export bytes rate, low object churn Missing payload dedupe Verify debounce + content-hash skip
Bench regression in BenchmarkExtract CEL/JSONPath hot path Profile extractor; check attribute count
High RSS on large clusters Full in-memory collect store Namespace-scoped targets; raise export interval (ADR-0603)