Scaling and fleet operations¶

Guidance for operators running Kollect at large cluster scale (design target: 100,000 collected rows per cluster operator) and multi-cluster fleets sharing Postgres, Git, or object stores.

Honest scale claim

100k/cluster is a design target, not a blanket product guarantee. Proof requires a manual cloud load test (load test runbook) — not GitHub Actions runners. Until that gate passes, treat 100k as architecture guidance with mandatory export sharding.

Collected rows vs export shards¶

Term	Meaning
Collected rows	Items in the operator collect store (`kollect_collect_items_total`)
Export shard	One `KollectInventory` namespace aggregate — keep <~2,000 rows per shard

Monolithic namespace inventories hit PayloadTooLarge above ~2,500 rows. Spread workloads across many namespaces, each with its own KollectInventory — see config/samples/kollect_v1alpha1_kollectinventory_sharded.yaml.

The operator sets status.conditions[ExportShardWarning] and increments kollect_export_shard_warn_total when a namespace aggregate reaches ~1,800 rows.

Helm resource profiles¶

For large clusters, use the chart resourcesProfile: large preset (≥2 GiB request, ≥4 GiB limit). Tune dispatch and reconcile flags per PERFORMANCE.md.

resourcesProfile: large
collect:
  dispatchWorkers: 8
  dispatchQueueSize: 1024

Git audit @ 1h¶

Git snapshot sinks are for audit cadence (typically 1h exportMinInterval), not portal query. At scale:

Shard exports (<2k rows/inventory).
Set 1h (or longer) per-ref interval on snapshotSinkRefs.
Use pathTemplate: clusters/{cluster}/… on snapshot sinks for fleet repos.
Operator PERF-10 persistent mirror + checksum fingerprint skip avoid clone/push when payload is unchanged (env: KOLLECT_GIT_MIRROR_DIR).

Shared Postgres fleet¶

Multiple cluster operators can upsert into one Postgres sink (ADR-0501). Each operator sets spec.cluster on database sinks; the backend primary key is (cluster, namespace, name, uid).

Row growth¶

total_rows ≈ Σ (clusters × collected_rows_per_cluster)

Example: 200 clusters × 50k rows ≈ 10M rows — plan DBA review before sustained growth.

When to partition (DBA)¶

Signal	Action
Table >~10M rows or >~100 GiB	Partitioning review
Slow exports / vacuum pressure	Partition by `cluster` or `exported_at` month
Retention policy	Drop/archive old monthly partitions

Responsibilities¶

Role	Owns
Kollect operator	Upsert semantics, `spec.cluster`, export debounce, row identity
Platform / DBA	Partition DDL, indexes, retention, connection pooling, backups

The operator does not create Postgres partitions. Document expected table shape in your runbook; use kollect_export_duration_seconds and sink error metrics for early warning.

Index hints (DBA)¶

Composite unique index aligned with upsert PK: (cluster, namespace, name, uid)
Optional BRIN on exported_at for time-range portal queries
Avoid unbounded JSONB bloat — keep attribute profiles lean (REQUIREMENTS.md)