Performance and scalability¶
Kollect is designed for large single clusters (1000s of nodes, 10k+ watched resources
baseline) and multi-cluster fleets — N independent single-mode operators exporting to a
shared sink partitioned by spec.cluster. There is no hub/spoke runtime tier
(ADR-0501). This guide summarizes tuning knobs from
ADR-0603.
Scale tiers¶
| Tier | Collected rows | Clusters | How to validate |
|---|---|---|---|
| Dev / CI default | ≤500 synthetic | 1 | task test |
| Opt-in load | ≤2,000 synthetic | 1 | KOLECT_LOAD_TEST=1 task load-test |
| Nightly load | 10,000 synthetic | 1 | task load-test:10k on 8-core runners |
| Baseline production | 10,000+ (validated) | 1 | Metrics + pprof; manual load |
| Design target | 100,000 | 1 | Manual / perf-report; needs export sharding + Postgres bulk upsert (claim gate v0.5+) |
| Fleet | 10k–100k × N | many | One ServiceMonitor per cluster release; correlate by spec.cluster |
The 100,000-row design target requires mandatory export sharding — one KollectInventory
per workload namespace (or smaller groups) so each export stays below ~2,000 rows (~1.5 MiB) —
plus resourcesProfile: large (request ≥2 GiB, limit ≥4 GiB). Fleet scale fans out across
operators; there is no central merge process to bottleneck (ADR-0603).
Controller parallelism¶
| Flag | Default | Controller |
|---|---|---|
--max-concurrent-reconciles-target |
5 |
KollectTarget |
--max-concurrent-reconciles-inventory |
3 |
KollectInventory |
--max-concurrent-reconciles-cluster-target |
2 |
KollectClusterTarget |
--max-concurrent-reconciles-cluster-inventory |
2 |
KollectClusterInventory |
--collect-dispatch-workers |
4 |
Collection informer dispatch pool |
--collect-dispatch-queue-size |
512 |
Dispatch queue depth before sync fallback |
Raise concurrency when reconcile latency grows while CPU is underutilized. Lower it when API server throttling or etcd watch pressure appears.
Workqueue rate limiting¶
When --reconcile-rate-limit is unset (0), controller-runtime uses its default
exponential failure rate limiter (5ms base, 1000s cap). Set a positive duration (for example
100ms) to slow retries on terminal errors without changing success-path throughput.
kollect_workqueue_depth approximates queue pressure as in-flight reconciles per controller
(not the internal client-go queue length).
Export debouncing¶
KollectInventory.spec.exportMinInterval (default 30s) coalesces export to external sinks
per inventory. Material payload changes (generation/checksum bump) may export immediately inside the
min interval (ADR-0201).
Lower the interval for fresher Postgres/Kafka exports; raise to reduce sink API load. Across a large
fleet writing to a shared sink, debouncing is mandatory to avoid export storms against the
backend.
Collection engine¶
- In-memory store: O(n) memory in collected object count; one
RWMutexguards nested maps. Target ≤512 MiB RSS at 10k typical Deployment/Service rows (verify with pprof). - Informers: When all active targets for a GVK resolve to one namespace via
spec.namespaceSelector, the dynamic informer is scoped to that namespace. Otherwise the informer watches all namespaces and filters events at dispatch time (correctness over cache size). - Resync: 12h informer resync is a correctness backstop, not a freshness driver.
- Fleet exports: Each operator writes its own debounced inventory snapshot to the shared sink,
partitioned by
spec.cluster— no cross-cluster merge on the hot path (ADR-0501).
Metrics catalog¶
Full scrape setup, default PrometheusRule alerts, and the complete kollect_* metric reference live in
Operator metrics. The table below highlights performance and
scalability signals — use it with the bottleneck checklist below.
| Metric | Type | Labels | PromQL hint | What rising values imply |
|---|---|---|---|---|
kollect_inventory_items_total |
Gauge | — | kollect_inventory_items_total |
Stale while store grows → inventory reconcile or export lag |
kollect_collect_items_total |
Gauge | — | kollect_collect_items_total |
RSS scales with store size at 10k+ objects |
kollect_collected_objects |
Gauge | profile, gvk |
sum by (profile, gvk) (kollect_collected_objects) |
Per-target cardinality; split profiles when labels explode |
kollect_reconcile_total |
Counter | controller, result |
sum(rate(kollect_reconcile_total[5m])) by (controller, result) |
Rising failure ratio → check error-class counters |
kollect_reconcile_errors_total |
Counter | kind, error_class |
sum(rate(kollect_reconcile_errors_total[5m])) by (error_class) |
See ADR-0602 error classes |
kollect_sink_errors_total |
Counter | reason |
sum(rate(kollect_sink_errors_total[5m])) by (reason) |
Export failures — separate from reconcile errors (ADR-0602) |
kollect_sink_connection_test_total |
Counter | type, result |
sum(rate(kollect_sink_connection_test_total[5m])) by (type, result) |
Spikes on sink CR churn; sustained failure → creds/network |
kollect_workqueue_depth |
Gauge | controller |
max_over_time(kollect_workqueue_depth[5m]) |
Sustained high values → raise --max-concurrent-reconciles-* or reduce reconcile work |
kollect_reconcile_duration_seconds |
Histogram | controller |
histogram_quantile(0.99, sum(rate(kollect_reconcile_duration_seconds_bucket[5m])) by (le, controller)) |
p99 rising while depth low → slow API/sink; p99 rising with depth → under-provisioned workers |
kollect_informer_objects |
Gauge | group, version, resource |
sum by (group, version, resource) (kollect_informer_objects) |
Unexpected growth → extra GVR watches or cluster-wide scope; check namespace scoping |
kollect_export_bytes_total |
Counter | sink_type |
rate(kollect_export_bytes_total[5m]) |
Spike → debounce too low or inventory churn; flat while stale → export path stuck |
kollect_export_duration_seconds |
Histogram | sink_type |
histogram_quantile(0.95, sum(rate(kollect_export_duration_seconds_bucket[5m])) by (le, sink_type)) |
Sink slowness (Git/Postgres/Kafka) — not collection |
kollect_export_debounced_total |
Counter | controller |
sum(rate(kollect_export_debounced_total[5m])) by (controller) |
Exports skipped by min interval — expected when debounce is tight |
kollect_collect_dispatch_duration_seconds |
Histogram | — | histogram_quantile(0.95, sum(rate(kollect_collect_dispatch_duration_seconds_bucket[5m])) by (le)) |
Collection extract/upsert latency |
kollect_collect_dispatch_queue_depth |
Gauge | — | max_over_time(kollect_collect_dispatch_queue_depth[5m]) |
Sustained high → raise dispatch workers/queue |
kollect_collect_dispatch_sync_fallback_total |
Counter | — | increase(kollect_collect_dispatch_sync_fallback_total[15m]) |
Queue overflow — dispatch pool undersized |
kollect_informer_resync_dispatches_total |
Counter | group, version, resource |
sum(increase(kollect_informer_resync_dispatches_total[1h])) by (group, version, resource) |
Resync-driven dispatch volume |
kollect_informer_cluster_wide_scope |
Gauge | group, version, resource |
max by (group, version, resource) (kollect_informer_cluster_wide_scope) |
1 = cluster-wide watch (RSS risk at scale) |
Additional runtime signals: Go memstats via pprof (--enable-pprof), API server 429 in operator logs.
See Operator metrics for profile-derived series, connection-test counters,
and example PromQL for alerting.
Profiling¶
--enable-pprof serves standard Go profiles on --pprof-bind-address (default :6060),
separate from Prometheus metrics (:8080 / :8443). Helm sets pprof.enabled: false by default;
enable in dev overlays only.
Benchmarks and load tests¶
task bench # writes artifacts/bench/*.txt
KOLECT_LOAD_TEST=1 task load-test
For local perf summaries (task perf-report), see DEVELOPMENT.md.
Default go test ./... excludes load-tagged tests.
Early bottleneck checklist¶
| Symptom | Likely cause | First action |
|---|---|---|
High kollect_informer_objects, high RSS |
Cluster-wide informer for multi-namespace targets | Namespace-scope targets; split profiles |
High kollect_workqueue_depth on inventory |
Export or aggregation on hot path | Raise inventory workers; increase spec.exportMinInterval |
| High export bytes rate, low object churn | Missing payload dedupe | Verify debounce + content-hash skip |
Bench regression in BenchmarkExtract |
CEL/JSONPath hot path | Profile extractor; check attribute count |
| High RSS on large clusters | Full in-memory collect store | Namespace-scoped targets; raise export interval (ADR-0603) |