Requirements & assumptions¶

First-principles requirements for Kollect — what the operator must do and how well, independent of any specific design. ADRs record how; this document records what and why. When an ADR and this document disagree, reconcile or flag the drift.

Status: living · Audience: architects, contributors, and reviewers evaluating or extending Kollect · Not a tutorial — start with Understand the basics and Architecture; locked design choices live in Platform decisions.

Build order, not a release train — see PLATFORM-DECISIONS.md.

1. Problem statement¶

Platform and application teams need versioned, stakeholder-facing inventory of what runs in their Kubernetes clusters — which resources exist, with which attributes (chart versions, images, sync status, certificate expiry, …), aggregated into a queryable, auditable form.

From first principles, the existing options each fail one requirement:

Option	Why it is insufficient
Query `kube-apiserver` directly	Unbounded list/watch load; no history; couples portals to cluster availability and RBAC
Store inventory in CRD `.status`	etcd object-size limit (~1.5 MB); destabilizes apiserver at scale
Hardcoded collector schemas	Break whenever a new CRD/attribute is needed; no per-team extensibility
kube-state-metrics only	Metrics are observability, not diffable stakeholder inventory
Per-cluster Git commits	O(N) commit/export storms across a fleet; noise without aggregation

Kollect's thesis: watch user-defined GVKs via shared informers → extract attributes via CEL/JSONPath → aggregate in memory → debounce → export to pluggable durable sinks, so consumers read export data, never the live API at scale.

2. Users and assumptions¶

Persona	Need
Application team	Inventory of their own namespace's workloads, owned in their namespace
Platform team	Cross-namespace / cross-cluster rollup with tenancy guardrails
Portal / automation	A stable, queryable, durable read surface (SQL, object store, or stream)
Auditor	Diffable, point-in-time history of what was deployed

Operating assumptions (binding unless revisited):

A1 — No external adopters on v1alpha1. Breaking API changes are acceptable pre-beta.
A2 — Event-driven, not polling. Collection reacts to informer events (ADR-0301).
A3 — Status is a summary, never a payload store (ADR-0103).
A4 — Single responsibility. Kollect collects and exports; it does not render or publish docs/CMS (ADR-0702).
A5 — The in-memory snapshot per inventory is canonical; every sink is a projection of it (ADR-0401).
A6 — Internal/self-signed CAs are normal; TLS trust is a first-class sink concern, not a bolt-on.

3. Functional requirements¶

IDs are stable handles for discussion (FR-<area>-<n>).

3.1 Configuration & API (FR-API)¶

ID	Requirement	Reference
FR-API-1	Inventory is configured by CRDs, not code: extraction schema (`KollectProfile`), resource selection (`KollectTarget`), aggregation/export (`KollectInventory`), backends (`KollectSnapshotSink` / `KollectDatabaseSink` / `KollectEventSink`), tenancy (`KollectScope`)	ADR-0201, ADR-0414
FR-API-2	Namespaced-by-default tenancy; cluster-scoped variants for platform-wide use	ADR-0201, ADR-0203
FR-API-3	Config kinds (`Profile`, `Scope`) are static (no controller); work kinds are reconciled	ADR-0202
FR-API-4	Invalid CEL/JSONPath and unknown sink types are rejected at admission, not at runtime	ADR-0201, ADR-0302
FR-API-5	Every reconciled kind supports `spec.suspend` and a manual-trigger annotation	ADR-0201

3.2 Collection & extraction (FR-COL)¶

ID	Requirement	Reference
FR-COL-1	Watch arbitrary GVKs declared by profiles via one shared dynamic informer per GVK	ADR-0301
FR-COL-2	Extract named attributes via JSONPath (incl. `[*]` array wildcard) and `cel:`-prefixed CEL	ADR-0302
FR-COL-3	Scope watches by namespace/label selector and name lists to bound memory	ADR-0301
FR-COL-4	Watch opt-in/opt-out via `kollect.dev/watch` labels and `watchMode: All\\|OptIn`	ADR-0205
FR-COL-5	Never extract `Secret.data` (incl. Helm `data.release`) without explicit opt-in; redact sensitive keys	ADR-0303
FR-COL-6	Ship tested sample profiles + contract tests (Deployment, Argo `Application`, cert-manager `Certificate`, …)	ADR-0301, ADR-0303
FR-COL-7	Optional full-resource export (`export.mode: Resource`): embed a pruned target object via Argo-style RFC 6901 `jsonPointers`, JSONPath, built-in defaults, and merged `scrubKeys`; Secret/sensitive kinds require `kollect.dev/allow-full-resource-export` opt-in and stay within size governance	ADR-0306, ADR-0303, ADR-0405

3.3 Aggregation & export (FR-EXP)¶

ID	Requirement	Reference
FR-EXP-1	Aggregate target rows into a per-namespace `KollectInventory` snapshot	ADR-0201
FR-EXP-2	Coalesce identical exports via `spec.exportMinInterval` (default 30s); material changes bypass	ADR-0201, ADR-0603
FR-EXP-3	Deterministic, stable-ordered serialization (diffable Git, golden tests)	ADR-0103
FR-EXP-4	Pluggable sinks by role: snapshot store (Git, GitLab, S3, GCS), relational/analytics SoR (Postgres; BigQuery planned), document store (MongoDB), event emitter (Kafka, NATS)	ADR-0401, ADR-0402
FR-EXP-5	Resource deletions are reflected in sinks (snapshot stores free; Postgres/MongoDB/event sinks via reconcile)	ADR-0401
FR-EXP-9	A single `KollectInventory` exports to all referenced sinks in parallel in one debounced pass — each sink with its own `exportMinInterval` and circuit breaker; partial failure degrades to `PartiallySynced`, not full failure	ADR-0401, ADR-0413
FR-EXP-6	First-class sink connectivity testing (`KollectConnectionTest` CR + sink probe)	ADR-0403
FR-EXP-7	Custom CA / self-signed TLS trust for Git/GitLab/Postgres sinks (`caSecretRef` / `caBundle`)	ADR-0201
FR-EXP-8	Git/GitLab sinks default to human-readable YAML with zero config; optional `spec.layout` (`document`/`perResource`/`split`), path templates, and auto-prune for per-resource trees	ADR-0419, ADR-0416, ADR-0306

3.4 Read path (FR-READ)¶

ID	Requirement	Reference
FR-READ-1	Primary scalable read = sink export (SQL/object store/stream), not the live API	ADR-0103
FR-READ-2	Optional read-only HTTP inventory API, feature-gated off by default, for debug/small installs	ADR-0103
FR-READ-3	When HTTP is enabled, authenticate via Kubernetes TokenReview + SubjectAccessReview	ADR-0404
FR-READ-4	Fleet read plane (frozen — design only): a standalone read-only console may materialize a fleet-wide read model from the shared event stream and serve the Read API extended with a `cluster` dimension — never a hub, never kube-apiserver writes, no bus/DB creds in the browser. The UI program is frozen and the Read API freeze is deferred; see ROADMAP § Read API + UI console (frozen)	ADR-0418, ADR-0501

3.5 Multi-cluster (FR-MC)¶

ID	Requirement	Reference
FR-MC-1	Default multi-cluster = direct shared-sink fan-in (`spec.cluster`); backend key/PK merges	ADR-0401, ADR-0501
FR-MC-2	Each cluster runs an independent operator (`mode: single`); no hub tier	ADR-0501
FR-MC-3	Operators stay lightweight: debounced export, bounded in-memory collect store	ADR-0501, ADR-0603

3.6 Observability (FR-OBS)¶

ID	Requirement	Reference
FR-OBS-1	Operator Prometheus metrics on `/metrics` (reconcile, export, sink errors, collection counts)	ADR-0601, ADR-0602
FR-OBS-2	Typed error taxonomy drives requeue behavior and conditions (`Ready`/`Synced`/`Degraded`)	ADR-0602
FR-OBS-3	Operators can tell why export failed from conditions, events, and metrics	ADR-0602, ADR-0403
FR-OBS-4	`prometheus` is not a sink type; domain (KSM-style) metrics emit from the collection engine	ADR-0601, ADR-0304

4. Non-functional requirements¶

4.1 Performance & scale (NFR-PERF) — ADR-0603 ¶

ID	Target
NFR-PERF-1	Design target: 100,000 collected rows/cluster (sharded exports); 10,000+ validated in CI tiers; store ≤512 MiB @ 10k, operator RSS 2–4 GiB @ 100k
NFR-PERF-2	Giant cluster: 1000+ nodes — namespace-scoped informers + paginated list mandatory
NFR-PERF-3	Fleet: 100–500+ clusters via shared sink (ADR-0501); no hub merge tier
NFR-PERF-4	One shared informer per GVK; memory scales with objects × GVKs, not with target count
NFR-PERF-5	Export load bounded by debounce; spill oversized payloads to object store; ≤~2k rows/inventory at default `maxExportBytes`
NFR-PERF-6	Tunable `MaxConcurrentReconciles`, dispatch pool, resync period; observable queue depth

4.2 Reliability & correctness (NFR-REL)¶

ID	Requirement
NFR-REL-1	Idempotent, level-based reconcile; safe to repeat
NFR-REL-2	At-least-once export; sinks idempotent on `(cluster, ns, name, uid)`
NFR-REL-3	No reconcile spin on terminal errors; exponential backoff + jitter on transient
NFR-REL-4	Degrade (not crash) under partial RBAC; record `skipped:forbidden`
NFR-REL-5	Pod restart loses only in-memory cache; rebuilt from informer resync

Enforcement patterns: guidelines § 1–2.

4.3 Security (NFR-SEC)¶

ID	Requirement
NFR-SEC-1	Credentials only via `secretRef`; never in spec/status/logs
NFR-SEC-2	Default verify TLS; `insecureSkipVerify` opt-in and surfaced in status
NFR-SEC-3	Tenancy enforced by `KollectScope` (hard degrade) + SAR; least-privilege RBAC
NFR-SEC-4	Sensitive-key redaction before export; no secret material in inventory
NFR-SEC-5	Distroless nonroot image; minimal attack surface

Enforcement: guidelines § 3, coding-standards.md § Security.

4.4 Operability (NFR-OPS)¶

ID	Requirement
NFR-OPS-1	Helm chart day one; `tenantMode` + `watchNamespaces` default per-team install
NFR-OPS-2	Feature gates default to safe values (HTTP off, profiling off, `connectionTest` off in prod)
NFR-OPS-3	Clear, sanitized, actionable condition/error messages
NFR-OPS-4	No hard dependency on Kafka/NATS/Postgres for install or CI (`inprocess` defaults)

4.5 Extensibility & compatibility (NFR-EXT)¶

ID	Requirement
NFR-EXT-1	New sink backends register via a factory; no vendor SDK in reconcilers
NFR-EXT-2	New GVKs need no codegen — profile-driven
NFR-EXT-3	A sink backend ships only when integration/e2e-testable (testcontainers or kind sidecar)
NFR-EXT-4	CRD enums/conditions evolve via OpenAPI; pre-beta breaking changes allowed (A1)

4.6 Testability (NFR-TEST)¶

ID	Requirement
NFR-TEST-1	Extraction + error classes covered by table-driven unit tests (no cluster)
NFR-TEST-2	Samples double as contract/regression tests; breaking extraction fails CI
NFR-TEST-3	Scheduled full-path e2e (install → apply samples → assert conditions/export)
NFR-TEST-4	Codegen drift gate (`task verify`) green at every commit

Enforcement: guidelines § 4, testing.md, coding-standards.md § Testing.

5. Explicit non-goals¶

Non-goal	Rationale
In-operator doc/CMS rendering (Confluence, wiki, templating)	Single responsibility — external CI consumes exports (ADR-0702)
`prometheus` as a sink type	Operator metrics use `/metrics`; avoids scrape/sink confusion (ADR-0601)
`KollectHub` CRD	Never shipped — hub tier removed (ADR-0501)
Full inventory payload in CRD status	etcd limit (ADR-0103)
Pairwise agent mesh beyond ~20 peers	Does not scale; use shared sink (ADR-0501)
In-place ACID lakehouse updates (Iceberg/DuckLake)	Kollect overwrites whole snapshots; no catalog/metadata DB needed (ADR-0401)

6. Resolved requirement questions (2026-06-05)¶

Payload spill: object-store spill is mandatory above 1 MiB (warn at 1 MiB; hard cap ~1.5 MiB maxExportBytes) (ADR-0103).
Delivery semantics: at-least-once + idempotent (effectively-once for state); exactly-once is a non-goal.
Parquet schema: hybrid — typed identity columns + JSON attributes + a promoted hot-attribute allowlist (ADR-0401).
Cluster-scoped under OptIn: honor a target-level default opt-in, per-object opt-out wins (ADR-0205).