Load test runbook (100k design proof)¶
Status: PLANNED — not yet executed. Maintainer validation on public cloud (GKE target) is the gate for an honest 100k collected rows/cluster claim. Do not run 100k on
ubuntu-latestGitHub Actions runners.
Scope¶
| Tier | Objects | Where | Status |
|---|---|---|---|
| CI default | 500 | envtest (task test) |
✅ Active |
| CI extended | 2,000 | KOLECT_LOAD_TEST=1 task load-test |
✅ Opt-in |
| Nightly | 10,000 | ubuntu-latest-8-cores workflow |
✅ Active |
| Design proof | 100,000 | 2× public cloud clusters | ⬜ Planned (GKE) |
10k nightly on GH runners is OK. 100k = manual cloud gate only.
Prerequisites¶
- Two Kubernetes clusters (GKE/EKS/AKS — maintainer plan: GKE).
- Shared sink endpoints reachable from both clusters:
- Postgres (primary query path)
- Git snapshot @ 1h cadence
- S3 or GCS object store (optional spill path)
- Optional NATS/Kafka event sink
- Clone kollect @ release SHA; set
KOLLECT_SRCto repo root. - Helm install with
resourcesProfile: largeon each cluster. - Namespace-scoped targets + sharded inventories (one per workload namespace).
Manifest bundle¶
Generator lives under hack/loadtest/100k/:
cd hack/loadtest/100k
./generate.sh --namespaces 50 --deployments-per-ns 2000
# → ~100k Deployments across 50 namespaces (adjust flags to hit collect-store row target)
kubectl apply -k manifests/
Tune --namespaces and --deployments-per-ns so kollect_collect_items_total approaches 100k
after filters, not merely raw API object count.
Step-by-step (per cluster)¶
1. Install operator¶
helm upgrade --install kollect oci://ghcr.io/konih/kollect/charts/kollect \
--namespace kollect-system --create-namespace \
-f hack/loadtest/100k/values-large.yaml
2. Apply sinks + sharded inventories¶
kubectl apply -f hack/loadtest/100k/sinks.yaml
kubectl apply -k hack/loadtest/100k/manifests/
Ensure each workload namespace has a KollectInventory with <2k rows per shard.
3. Soak¶
Run ≥4 hours steady state. Record:
kollect_collect_items_totalkollect_reconcile_duration_secondsp95kollect_export_duration_secondsby sink type- Pod RSS / CPU throttling
4. Shared sink verification¶
| Sink | Check |
|---|---|
| Postgres | SELECT cluster, count(*) FROM … GROUP BY 1 — both clusters present |
| Git | Commit cadence ~1h; fingerprint skip reduces empty pushes |
| S3/GCS | Object count matches export generations |
Diagnosis¶
When the load test shows bottlenecks, use this matrix:
| Symptom | Metrics / tools | Likely cause | Next action |
|---|---|---|---|
| CPU throttle | pprof, kollect_reconcile_duration_seconds, pod limits |
dispatch pool, marshal | Raise collect.dispatchWorkers; add inventory sharding |
| OOM | RSS, kollect_collect_items_total, kollect_informer_objects |
cluster-wide informers | Namespace-scope targets; resourcesProfile: large |
| Export timeout | kollect_export_duration_seconds, kollect_sink_errors_total |
Postgres row loop / git clone | PERF-09 bulk; PERF-10 mirror + fingerprint skip |
| PayloadTooLarge | inventory Degraded, kollect_export_spill_warn_total |
monolithic inventory | Multi-namespace inventories |
| etcd/API slow | apiserver metrics, status update rate | status churn | Longer exportMinInterval; debounce already shipped |
PromQL snippets¶
# Collect store size
kollect_collect_items_total
# Reconcile latency p95
histogram_quantile(0.95, sum(rate(kollect_reconcile_duration_seconds_bucket[5m])) by (le, controller))
# Export latency by sink
histogram_quantile(0.95, sum(rate(kollect_export_duration_seconds_bucket[5m])) by (le, sink_type))
# Sharding warning
increase(kollect_export_shard_warn_total[1h])
pprof¶
kubectl port-forward -n kollect-system deploy/kollect-controller-manager 6060:6060
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
go tool pprof http://localhost:6060/debug/pprof/heap
Local perf snapshot¶
task perf-report # local → agent-context/PERF-SNAPSHOT.md; CI → artifacts/perf-snapshot.md (artifact)
Log keys¶
Search operator logs for: export failed, PayloadTooLarge, debounced, dispatch sync fallback,
export payload exceeds spill warn.
Explicit non-goals¶
- No 100k job in
.github/workflows/onubuntu-latest - No GKE execution in CI — maintainer runs manually when ready