Requirements & assumptions
First-principles requirements for Kollect — what the operator must do and how well, independent
of any specific design. ADRs record how; this document records what and why. When an ADR and
this document disagree, reconcile or flag the drift.
Status: living · Audience: architects, contributors, and reviewers evaluating or extending Kollect ·
Not a tutorial — start with Understand the basics and
Architecture; locked design choices live in Platform decisions.
Build order, not a release train — see PLATFORM-DECISIONS.md.
1. Problem statement
Platform and application teams need versioned, stakeholder-facing inventory of what runs in their
Kubernetes clusters — which resources exist, with which attributes (chart versions, images, sync
status, certificate expiry, …), aggregated into a queryable, auditable form.
From first principles, the existing options each fail one requirement:
| Option |
Why it is insufficient |
Query kube-apiserver directly |
Unbounded list/watch load; no history; couples portals to cluster availability and RBAC |
Store inventory in CRD .status |
etcd object-size limit (~1.5 MB); destabilizes apiserver at scale |
| Hardcoded collector schemas |
Break whenever a new CRD/attribute is needed; no per-team extensibility |
| kube-state-metrics only |
Metrics are observability, not diffable stakeholder inventory |
| Per-cluster Git commits |
O(N) commit/export storms across a fleet; noise without aggregation |
Kollect's thesis: watch user-defined GVKs via shared informers → extract attributes via
CEL/JSONPath → aggregate in memory → debounce → export to pluggable durable sinks, so
consumers read export data, never the live API at scale.
2. Users and assumptions
| Persona |
Need |
| Application team |
Inventory of their own namespace's workloads, owned in their namespace |
| Platform team |
Cross-namespace / cross-cluster rollup with tenancy guardrails |
| Portal / automation |
A stable, queryable, durable read surface (SQL, object store, or stream) |
| Auditor |
Diffable, point-in-time history of what was deployed |
Operating assumptions (binding unless revisited):
- A1 — No external adopters on
v1alpha1. Breaking API changes are acceptable pre-beta.
- A2 — Event-driven, not polling. Collection reacts to informer events (ADR-0301).
- A3 — Status is a summary, never a payload store (ADR-0103).
- A4 — Single responsibility. Kollect collects and exports; it does not render or publish docs/CMS (ADR-0702).
- A5 — The in-memory snapshot per inventory is canonical; every sink is a projection of it (ADR-0401).
- A6 — Internal/self-signed CAs are normal; TLS trust is a first-class sink concern, not a bolt-on.
3. Functional requirements
IDs are stable handles for discussion (FR-<area>-<n>).
3.1 Configuration & API (FR-API)
| ID |
Requirement |
Reference |
| FR-API-1 |
Inventory is configured by CRDs, not code: extraction schema (KollectProfile), resource selection (KollectTarget), aggregation/export (KollectInventory), backends (KollectSnapshotSink / KollectDatabaseSink / KollectEventSink), tenancy (KollectScope) |
ADR-0201, ADR-0414 |
| FR-API-2 |
Namespaced-by-default tenancy; cluster-scoped variants for platform-wide use |
ADR-0201, ADR-0203 |
| FR-API-3 |
Config kinds (Profile, Scope) are static (no controller); work kinds are reconciled |
ADR-0202 |
| FR-API-4 |
Invalid CEL/JSONPath and unknown sink types are rejected at admission, not at runtime |
ADR-0201, ADR-0302 |
| FR-API-5 |
Every reconciled kind supports spec.suspend and a manual-trigger annotation |
ADR-0201 |
| ID |
Requirement |
Reference |
| FR-COL-1 |
Watch arbitrary GVKs declared by profiles via one shared dynamic informer per GVK |
ADR-0301 |
| FR-COL-2 |
Extract named attributes via JSONPath (incl. [*] array wildcard) and cel:-prefixed CEL |
ADR-0302 |
| FR-COL-3 |
Scope watches by namespace/label selector and name lists to bound memory |
ADR-0301 |
| FR-COL-4 |
Watch opt-in/opt-out via kollect.dev/watch labels and watchMode: All\|OptIn |
ADR-0205 |
| FR-COL-5 |
Never extract Secret.data (incl. Helm data.release) without explicit opt-in; redact sensitive keys |
ADR-0303 |
| FR-COL-6 |
Ship tested sample profiles + contract tests (Deployment, Argo Application, cert-manager Certificate, …) |
ADR-0301, ADR-0303 |
| FR-COL-7 |
Optional full-resource export (export.mode: Resource): embed a pruned target object via Argo-style RFC 6901 jsonPointers, JSONPath, built-in defaults, and merged scrubKeys; Secret/sensitive kinds require kollect.dev/allow-full-resource-export opt-in and stay within size governance |
ADR-0306, ADR-0303, ADR-0405 |
3.3 Aggregation & export (FR-EXP)
| ID |
Requirement |
Reference |
| FR-EXP-1 |
Aggregate target rows into a per-namespace KollectInventory snapshot |
ADR-0201 |
| FR-EXP-2 |
Coalesce identical exports via spec.exportMinInterval (default 30s); material changes bypass |
ADR-0201, ADR-0603 |
| FR-EXP-3 |
Deterministic, stable-ordered serialization (diffable Git, golden tests) |
ADR-0103 |
| FR-EXP-4 |
Pluggable sinks by role: snapshot store (Git, GitLab, S3, GCS), relational/analytics SoR (Postgres; BigQuery planned), document store (MongoDB), event emitter (Kafka, NATS) |
ADR-0401, ADR-0402 |
| FR-EXP-5 |
Resource deletions are reflected in sinks (snapshot stores free; Postgres/MongoDB/event sinks via reconcile) |
ADR-0401 |
| FR-EXP-9 |
A single KollectInventory exports to all referenced sinks in parallel in one debounced pass — each sink with its own exportMinInterval and circuit breaker; partial failure degrades to PartiallySynced, not full failure |
ADR-0401, ADR-0413 |
| FR-EXP-6 |
First-class sink connectivity testing (KollectConnectionTest CR + sink probe) |
ADR-0403 |
| FR-EXP-7 |
Custom CA / self-signed TLS trust for Git/GitLab/Postgres sinks (caSecretRef / caBundle) |
ADR-0201 |
| FR-EXP-8 |
Git/GitLab sinks default to human-readable YAML with zero config; optional spec.layout (document/perResource/split), path templates, and auto-prune for per-resource trees |
ADR-0419, ADR-0416, ADR-0306 |
3.4 Read path (FR-READ)
| ID |
Requirement |
Reference |
| FR-READ-1 |
Primary scalable read = sink export (SQL/object store/stream), not the live API |
ADR-0103 |
| FR-READ-2 |
Optional read-only HTTP inventory API, feature-gated off by default, for debug/small installs |
ADR-0103 |
| FR-READ-3 |
When HTTP is enabled, authenticate via Kubernetes TokenReview + SubjectAccessReview |
ADR-0404 |
| FR-READ-4 |
Fleet read plane (frozen — design only): a standalone read-only console may materialize a fleet-wide read model from the shared event stream and serve the Read API extended with a cluster dimension — never a hub, never kube-apiserver writes, no bus/DB creds in the browser. The UI program is frozen and the Read API freeze is deferred; see ROADMAP § Read API + UI console (frozen) |
ADR-0418, ADR-0501 |
3.5 Multi-cluster (FR-MC)
| ID |
Requirement |
Reference |
| FR-MC-1 |
Default multi-cluster = direct shared-sink fan-in (spec.cluster); backend key/PK merges |
ADR-0401, ADR-0501 |
| FR-MC-2 |
Each cluster runs an independent operator (mode: single); no hub tier |
ADR-0501 |
| FR-MC-3 |
Operators stay lightweight: debounced export, bounded in-memory collect store |
ADR-0501, ADR-0603 |
3.6 Observability (FR-OBS)
| ID |
Requirement |
Reference |
| FR-OBS-1 |
Operator Prometheus metrics on /metrics (reconcile, export, sink errors, collection counts) |
ADR-0601, ADR-0602 |
| FR-OBS-2 |
Typed error taxonomy drives requeue behavior and conditions (Ready/Synced/Degraded) |
ADR-0602 |
| FR-OBS-3 |
Operators can tell why export failed from conditions, events, and metrics |
ADR-0602, ADR-0403 |
| FR-OBS-4 |
prometheus is not a sink type; domain (KSM-style) metrics emit from the collection engine |
ADR-0601, ADR-0304 |
4. Non-functional requirements
| ID |
Target |
| NFR-PERF-1 |
Design target: 100,000 collected rows/cluster (sharded exports); 10,000+ validated in CI tiers; store ≤512 MiB @ 10k, operator RSS 2–4 GiB @ 100k |
| NFR-PERF-2 |
Giant cluster: 1000+ nodes — namespace-scoped informers + paginated list mandatory |
| NFR-PERF-3 |
Fleet: 100–500+ clusters via shared sink (ADR-0501); no hub merge tier |
| NFR-PERF-4 |
One shared informer per GVK; memory scales with objects × GVKs, not with target count |
| NFR-PERF-5 |
Export load bounded by debounce; spill oversized payloads to object store; ≤~2k rows/inventory at default maxExportBytes |
| NFR-PERF-6 |
Tunable MaxConcurrentReconciles, dispatch pool, resync period; observable queue depth |
4.2 Reliability & correctness (NFR-REL)
| ID |
Requirement |
| NFR-REL-1 |
Idempotent, level-based reconcile; safe to repeat |
| NFR-REL-2 |
At-least-once export; sinks idempotent on (cluster, ns, name, uid) |
| NFR-REL-3 |
No reconcile spin on terminal errors; exponential backoff + jitter on transient |
| NFR-REL-4 |
Degrade (not crash) under partial RBAC; record skipped:forbidden |
| NFR-REL-5 |
Pod restart loses only in-memory cache; rebuilt from informer resync |
Enforcement patterns: guidelines § 1–2.
4.3 Security (NFR-SEC)
| ID |
Requirement |
| NFR-SEC-1 |
Credentials only via secretRef; never in spec/status/logs |
| NFR-SEC-2 |
Default verify TLS; insecureSkipVerify opt-in and surfaced in status |
| NFR-SEC-3 |
Tenancy enforced by KollectScope (hard degrade) + SAR; least-privilege RBAC |
| NFR-SEC-4 |
Sensitive-key redaction before export; no secret material in inventory |
| NFR-SEC-5 |
Distroless nonroot image; minimal attack surface |
Enforcement: guidelines § 3,
coding-standards.md § Security.
4.4 Operability (NFR-OPS)
| ID |
Requirement |
| NFR-OPS-1 |
Helm chart day one; tenantMode + watchNamespaces default per-team install |
| NFR-OPS-2 |
Feature gates default to safe values (HTTP off, profiling off, connectionTest off in prod) |
| NFR-OPS-3 |
Clear, sanitized, actionable condition/error messages |
| NFR-OPS-4 |
No hard dependency on Kafka/NATS/Postgres for install or CI (inprocess defaults) |
4.5 Extensibility & compatibility (NFR-EXT)
| ID |
Requirement |
| NFR-EXT-1 |
New sink backends register via a factory; no vendor SDK in reconcilers |
| NFR-EXT-2 |
New GVKs need no codegen — profile-driven |
| NFR-EXT-3 |
A sink backend ships only when integration/e2e-testable (testcontainers or kind sidecar) |
| NFR-EXT-4 |
CRD enums/conditions evolve via OpenAPI; pre-beta breaking changes allowed (A1) |
4.6 Testability (NFR-TEST)
| ID |
Requirement |
| NFR-TEST-1 |
Extraction + error classes covered by table-driven unit tests (no cluster) |
| NFR-TEST-2 |
Samples double as contract/regression tests; breaking extraction fails CI |
| NFR-TEST-3 |
Scheduled full-path e2e (install → apply samples → assert conditions/export) |
| NFR-TEST-4 |
Codegen drift gate (task verify) green at every commit |
Enforcement: guidelines § 4,
testing.md, coding-standards.md § Testing.
5. Explicit non-goals
| Non-goal |
Rationale |
| In-operator doc/CMS rendering (Confluence, wiki, templating) |
Single responsibility — external CI consumes exports (ADR-0702) |
prometheus as a sink type |
Operator metrics use /metrics; avoids scrape/sink confusion (ADR-0601) |
KollectHub CRD |
Never shipped — hub tier removed (ADR-0501) |
| Full inventory payload in CRD status |
etcd limit (ADR-0103) |
| Pairwise agent mesh beyond ~20 peers |
Does not scale; use shared sink (ADR-0501) |
| In-place ACID lakehouse updates (Iceberg/DuckLake) |
Kollect overwrites whole snapshots; no catalog/metadata DB needed (ADR-0401) |
6. Resolved requirement questions (2026-06-05)
- Payload spill: object-store spill is mandatory above 1 MiB (warn at 1 MiB; hard cap
~1.5 MiB
maxExportBytes) (ADR-0103).
- Delivery semantics: at-least-once + idempotent (effectively-once for state); exactly-once is a
non-goal.
- Parquet schema: hybrid — typed identity columns + JSON
attributes + a promoted hot-attribute
allowlist (ADR-0401).
- Cluster-scoped under OptIn: honor a target-level default opt-in, per-object opt-out wins
(ADR-0205).
See also