Scaling and fleet operations¶
Guidance for operators running Kollect at large cluster scale (design target: 100,000 collected rows per cluster operator) and multi-cluster fleets sharing Postgres, Git, or object stores.
Honest scale claim
100k/cluster is a design target, not a blanket product guarantee. Proof requires a manual cloud load test (load test runbook) — not GitHub Actions runners. Until that gate passes, treat 100k as architecture guidance with mandatory export sharding.
Collected rows vs export shards¶
| Term | Meaning |
|---|---|
| Collected rows | Items in the operator collect store (kollect_collect_items_total) |
| Export shard | One KollectInventory namespace aggregate — keep <~2,000 rows per shard |
Monolithic namespace inventories hit PayloadTooLarge above ~2,500 rows. Spread workloads across
many namespaces, each with its own KollectInventory — see
config/samples/kollect_v1alpha1_kollectinventory_sharded.yaml.
The operator sets status.conditions[ExportShardWarning] and increments
kollect_export_shard_warn_total when a namespace aggregate reaches ~1,800 rows.
Helm resource profiles¶
For large clusters, use the chart resourcesProfile: large preset (≥2 GiB request, ≥4 GiB
limit). Tune dispatch and reconcile flags per PERFORMANCE.md.
resourcesProfile: large
collect:
dispatchWorkers: 8
dispatchQueueSize: 1024
Git audit @ 1h¶
Git snapshot sinks are for audit cadence (typically 1h exportMinInterval), not portal
query. At scale:
- Shard exports (<2k rows/inventory).
- Set 1h (or longer) per-ref interval on
snapshotSinkRefs. - Use
pathTemplate: clusters/{cluster}/…on snapshot sinks for fleet repos. - Operator PERF-10 persistent mirror + checksum fingerprint skip avoid clone/push when
payload is unchanged (env:
KOLLECT_GIT_MIRROR_DIR).
Shared Postgres fleet¶
Multiple cluster operators can upsert into one Postgres sink (ADR-0501).
Each operator sets spec.cluster on database sinks; the backend primary key is
(cluster, namespace, name, uid).
Row growth¶
total_rows ≈ Σ (clusters × collected_rows_per_cluster)
Example: 200 clusters × 50k rows ≈ 10M rows — plan DBA review before sustained growth.
When to partition (DBA)¶
| Signal | Action |
|---|---|
| Table >~10M rows or >~100 GiB | Partitioning review |
| Slow exports / vacuum pressure | Partition by cluster or exported_at month |
| Retention policy | Drop/archive old monthly partitions |
Responsibilities¶
| Role | Owns |
|---|---|
| Kollect operator | Upsert semantics, spec.cluster, export debounce, row identity |
| Platform / DBA | Partition DDL, indexes, retention, connection pooling, backups |
The operator does not create Postgres partitions. Document expected table shape in your runbook;
use kollect_export_duration_seconds and sink error metrics for early warning.
Index hints (DBA)¶
- Composite unique index aligned with upsert PK:
(cluster, namespace, name, uid) - Optional BRIN on
exported_atfor time-range portal queries - Avoid unbounded JSONB bloat — keep attribute profiles lean (REQUIREMENTS.md)