Common errors — operator guide¶

Symptom-oriented catalog for production failures in Kollect reconcilers, collection, and export. For error-class semantics and reconcile behavior, see ADR-0602: Error taxonomy — this page focuses on what operators see and what to do.

No hub/spoke runtime

Hub/spoke ingest was removed in v0.3. Multi-cluster uses N single-mode operators exporting to a shared sink with spec.cluster. There are no hub-specific conditions, metrics, or failure modes in current releases.

1. How errors surface¶

Kollect reports failures through four channels. Start with conditions on the CR that owns the pipeline; use metrics and logs when status is stale or ambiguous.

Conditions¶

Condition	Typical objects	Meaning
`Ready`	`KollectTarget`, `KollectInventory`, `KollectClusterInventory`	Pipeline healthy enough to collect or export
`Synced`	Inventory / cluster inventory	Last export cycle outcome (aggregate across sinks)
`PartiallySynced`	Inventory (`Ready` or `Synced` reason)	Some sinks exported; others debounced or failed
`Degraded`	Target, inventory, sink family CRs	Terminal misconfig or export gate — spec change required
`ExportShardWarning`	`KollectInventory`	Namespace aggregate ≥ ~1,800 rows — split before cap
`SinkReachable`	Inventory / target	Sink ref resolves and backend reachable
`ConnectionVerified`	`KollectSnapshotSink`, `KollectDatabaseSink`, `KollectEventSink`	Last connectivity probe succeeded

Per-sink detail lives in status.sinkExports[] — each entry has its own Synced condition and lastExportTime.

kubectl describe kollectinventory <name> -n <ns>
kubectl describe kollecttarget <name> -n <ns>
kubectl describe kollectsnapshotsink <name> -n <ns>   # or databasesink / eventsink
kubectl get events -n <ns> --field-selector involvedObject.name=<name>

Events¶

Warning events carry stable reason enums (not free-form types). Common reasons: ScopeGVKDenied, PayloadTooLarge, ExportFailed, Progressing, ConnectionTestFailed, ReconcilePanic.

Metrics¶

Metric	Labels	Use
`kollect_reconcile_errors_total`	`kind`, `error_class`	Reconcile failures: `transient`, `terminal`, `forbidden`
`kollect_sink_errors_total`	`reason`	Export failures — separate from reconcile errors
`kollect_sink_connection_test_total`	`type`, `result`	Probe outcomes per sink family
`kollect_export_duration_seconds`	`sink_type`	Slow exports (Git clone, Postgres bulk, etc.)
`kollect_workqueue_depth`	`controller`	Reconcile backlog / conflict storms

Sink error reason values include: transient, terminal, forbidden, payload_too_large, spill_required, unknown.

Full catalog: Operator metrics.

Inventory `status.sinkExports[]`¶

Each bound sink (snapshotSinkRefs, databaseSinkRefs, eventSinkRefs) gets a status slice entry:

Per-sink `Synced`	Operator read
`True`, reason `Exported`	Last attempt succeeded
`False`, reason `Debounced`	Not a failure — cadence/coalesce skipped write
`False`, reason `ExportFailed`	Export attempt failed — read `message`

Read API mirrors this as debounced vs degraded per sink (metrics note).

2. Error classes (ADR-0602)¶

Class	Meaning	Reconcile	Self-heal?
Transient	Network blip, 429, API conflict, sink timeout, circuit breaker open	Requeue with backoff; `Synced=False`, reason `Progressing`	Yes — when root cause clears
Terminal	Bad config, invalid extraction path, auth permanently wrong, payload over cap	No requeue; `Degraded=True` + Warning event	No — fix spec/credentials, then observe new generation
Forbidden	SAR/RBAC denied for list/watch on a namespace/GVK	Degrade scope; partial collection; metric `error_class=forbidden`	Partial — grant RBAC or narrow selectors

Details, examples, and circuit-breaker rules: ADR-0602.

3. Catalog by symptom¶

Each row: what you see → likely cause → how to confirm → fix → escalate when stuck.

Scope and tenancy¶

Symptom	Likely causes	Identify	Handle	Escalate
Target `Degraded`, reason `ScopeGVKDenied`	`KollectScope` allow-list excludes profile GVK	Event reason; scope CR in same namespace	Add GVK to scope or remove scope binding	Platform team owns scope policy
Inventory `Degraded`, reason `ScopeSinkDenied`	Scope disallows snapshot/database/event family ref	`kubectl describe kollectinventory`; scope spec	Use allowed sink family/type or widen scope	Same
Target `Degraded`, reason `ScopeNamespaceDenied`	Target/intent namespaces outside scope	Event + target spec `namespaceSelector`	Fix selectors or scope `allowedNamespaces`	Same
Target `Degraded`, reason `Forbidden`	SAR denied for list in workload namespace	`kollect_reconcile_errors_total{error_class="forbidden"}`; target message cites namespace/GVK	Grant operator Role/ClusterRole list/watch on GVR; or narrow target to permitted namespaces	RBAC audit / cluster admin
Namespace empty in inventory but target exists	`namespaceSelector` mismatch, watch opt-out, or scope deny	Target `Ready`; `status.itemCount=0`; labels `kollect.dev/watch`	Align selector with workloads; check watch labels	If SAR OK but still empty — profile/GVK issue

Collection¶

Symptom	Likely causes	Identify	Handle	Escalate
Target `Degraded`, `ProfileNotFound`	Wrong `profileRef` or cross-namespace ref	`kubectl get kollectprofile -n <target-ns>`	Create profile in same namespace as target	—
Target `Degraded`, `InformerRegistrationFailed`	Unknown/uninstalled GVK, CRD missing	Target message; apiserver discovery	Install CRD/API; fix `KollectProfile.spec.targetGVK`	Vendor CRD not on cluster
Target `Degraded`, `AccessCheckFailed`	SAR API error (not denial) during list pre-check	Logs: `access check failed`; transient error metric	Fix apiserver connectivity; check operator pod network	Sustained apiserver errors
Target `Degraded`, `Forbidden` (collection)	List denied for namespace	`error_class=forbidden`; engine marks forbidden scope	Fix RBAC or reduce target scope	—
Partial/empty attributes	CEL/JSONPath eval error on object	Logs: `extract attributes` (no secret values logged)	Fix attribute paths in profile; test with `kubectl explain` sample	Webhook should catch most invalid paths at admission
Growing `kollect_watch_map_list_errors_total`	List failure registering watch map handler	PromQL increase; controller logs on inventory/target	Fix RBAC for mapped GVR; check apiserver load	API server degradation

Export — payload size¶

Symptom	Likely causes	Identify	Handle	Escalate
Inventory `Degraded`, `PayloadTooLarge`	Monolithic export > ~1.5 MiB (`maxExportBytes`)	Condition message with byte counts; `kollect_sink_errors_total{reason="payload_too_large"}`	Shard: multiple `KollectInventory` per namespace (<~2k rows each)	Architecture review for 10k+ row namespaces
Inventory `Degraded`, `SpillRequired`	Large payload needs object-store spill, none configured	Reason `SpillRequired`; `spill_required` metric	Add `KollectSnapshotSink` type `s3` or `gcs` to inventory refs	—
`ExportShardWarning=True`	≥ ~1,800 rows in one namespace aggregate	Condition + `increase(kollect_export_shard_warn_total[1h])`	Split inventories before hard cap	See scaling and fleet
`kollect_export_spill_warn_total` increasing	Payload ≥ 1 MiB warn threshold	Metric + log `export payload exceeds spill warn threshold`	Shard or tune `spec.maxExportBytes` (within global cap)	—

Export — sink backends¶

Family CRDs: KollectSnapshotSink (git/gitlab/s3/gcs), KollectDatabaseSink (postgres), KollectEventSink (kafka/nats). Inventory refs must use the matching family field (snapshotSinkRefs, etc.) in the same namespace.

Symptom	Likely causes	Identify	Handle	Escalate
Git push failures (NFF)	Remote ahead of operator; concurrent writers	Logs `git push`; `terminal` or `transient` on sink errors	`pushPolicy: Commit` retries merge+push; ensure single writer per branch/path	Protected branch / hook rejects — platform Git admin
Git auth `terminal`	Bad token, expired credential, 401/403	Sink `ConnectionVerified=False`; `ConnectionTestFailed` event	Rotate `secretRef`; re-annotate `kollect.dev/test-connection=true`	IdP / Git provider outage
Postgres connection failures	DSN, network policy, TLS, pool timeout	Database sink conditions; `transient` sink errors; connection test metric	Verify Secret keys, egress NetworkPolicy, server reachable	DBA for server-side limits
S3/GCS 403	IAM, wrong bucket, signature	Export logs; `terminal`/`forbidden`	Fix credentials and bucket policy	Cloud IAM review
`sink circuit breaker open` in logs	5 consecutive transient failures per sink key	`transient` errors then silence ~30s	Fix backend; breaker self-closes after timeout	Backend SLA breach
`SinkReachable=False`, `SinkNotFound`	Wrong sink name or cross-namespace ref	Inventory message; `kubectl get kollect*sink -n <inv-ns>`	Fix ref name; create sink in inventory namespace	—
`SinkReachable=False`, `SinkUnreachable`	Backend down despite CR present	Sink `ConnectionVerified`; probe annotation	Fix network/credentials first	—

Export — debounce (not a failure)¶

Symptom	Likely causes	Identify	Handle	Escalate
`Ready=True`, reason `PartiallySynced`	Per-sink `exportMinInterval`; unchanged payload checksum	`status.sinkExports[].conditions` reason `Debounced`; no `kollect_sink_errors_total` spike	Expected — wait for interval; tighten interval only if SLA requires fresher data	Mistaking debounce for outage
`Synced=False`, reason `PartiallySynced`, all sinks debounced	All sinks within cadence window	All per-sink `Debounced`; `kollect_export_debounced_total` up	Normal for dual-cadence (e.g. Postgres 30s + Git 1h)	—
Stale data in Postgres but `Synced=True` on Git ref	Different intervals per ref	Compare `lastExportTime` per `sinkExports` entry	Set ref-level `exportMinInterval` intentionally	—

Connection test¶

Symptom	Likely causes	Identify	Handle	Escalate
Sink `ConnectionVerified=False`, `ConnectionTestFailed`	TLS verify fail, bad Secret, wrong endpoint	`kubectl describe` sink; `kollect_sink_connection_test_total{result="failure"}`	Fix `secretRef`, CA bundle, URL; one-shot: annotation `kollect.dev/test-connection=true`	Corporate TLS inspection
`KollectConnectionTest` stuck false	One-shot CR probe failed	`kubectl describe kollectconnectiontest`	Same as sink probe; check `spec.sinkRef` family field	—
`TLSInsecure=True` on sink	Explicit insecure TLS (non-default)	Condition on sink	Prefer proper CA; document exception per security policy	Security review

Production sinks should keep spec.connectionTest: false and use annotation for ad-hoc probes (ADR-0403).

Reconcile and workqueue¶

Symptom	Likely causes	Identify	Handle	Escalate
Slow inventory updates, no `Degraded`	Optimistic-lock conflicts, high churn	`kollect_workqueue_depth` sustained; conflict requeues in logs	Raise `maxConcurrentReconciles`; increase `exportMinInterval`; shard inventories	etcd/apiserver slow
Event `ReconcilePanic`	Unexpected panic (should not crash pod)	Event reason; log `reconcile panic recovered`	Upgrade to fixed release; file bug with stack from logs	Repeat panics on same controller
`kollect_collect_dispatch_sync_fallback_total` rising	Dispatch queue saturated	Metric + dispatch queue depth	Increase `collect.dispatchWorkers` / `dispatchQueueSize`	CPU throttle on controller
Status update lag	Many inventories, frequent export	Reconcile duration p95; etcd metrics	Debounce, sharding, fewer sinks per inventory	Load test runbook

Webhook vs runtime validation¶

Symptom	Likely causes	Identify	Handle	Escalate
`kubectl apply` rejected	CEL validation on CRD, validating webhook	Admission error message (no object created)	Fix spec before create	—
CR accepted but `Degraded` at runtime	GVK/CRD absent on cluster, scope enforced only at reconcile, SAR not checked at admission	Compare admission vs `kubectl describe` conditions	Install CRDs; fix runtime-only constraints	Gap between webhook and runtime — upstream issue
Scope ceiling on cluster targets	`KollectClusterScope` webhook deny	Forbidden on apply	Adjust cluster target GVKs to allowed set	—

Multi-sink partial success¶

Symptom	Likely causes	Identify	Handle	Escalate
`Ready=True`, `PartiallySynced`; Postgres OK, Git failed	Independent per-sink export	`status.sinkExports[]` — mixed `Exported` / `ExportFailed`	Fix failing sink only; successful sinks stay current	—
`Synced=False`, `PartiallySynced`; some failed	One backend terminal while others OK	Failed count in condition message	Terminal sink needs spec/cred fix; others self-heal	—
Aggregate `Synced=False`, all per-sink failed	Shared payload gate (spill) before export	Inventory-level `Degraded` + spill reasons	Fix size/sharding first	—

Resources (brief)¶

Symptom	Likely causes	Identify	Handle	Escalate
OOMKilled controller	Large collect store, cluster-wide informers	Pod status; `kollect_collect_items_total`; RSS	`resourcesProfile: large`; namespace-scope targets; shard inventories	Load test runbook
CPU throttle	High dispatch/reconcile load	pprof; `kollect_reconcile_duration_seconds` p95	Raise limits; tune workers; reduce churn	—
etcd / API slow	Status write rate, large fleets	Apiserver metrics; many inventories	Longer export intervals; fewer status transitions	Platform cluster health

4. PromQL cheat sheet¶

Tie queries to symptoms (adjust namespace/job labels for your scrape config):

# Sustained reconcile failures by class
sum(rate(kollect_reconcile_errors_total[5m])) by (error_class)

# Inventory export errors only
sum(rate(kollect_reconcile_errors_total{kind="KollectInventory"}[5m])) by (error_class)

# Export failure reasons (auth, size, transient, …)
sum(increase(kollect_sink_errors_total[15m])) by (reason)

# Slow exports — Git vs Postgres vs event
histogram_quantile(0.95, sum(rate(kollect_export_duration_seconds_bucket[5m])) by (le, sink_type))

# Debounce (expected) vs failure — debounce should NOT correlate with sink_errors
sum(rate(kollect_export_debounced_total[5m])) by (controller)

# Workqueue backlog — conflict storms or under-provisioned workers
max_over_time(kollect_workqueue_depth[10m])

# Collect store growth — OOM/sharding signal
kollect_collect_items_total

# Approaching export shard cap
increase(kollect_export_shard_warn_total[1h])

Default alert rules: metrics.md — Default alerts.

5. Log patterns¶

Structured controller logs (logr). Grep operator pod logs (namespace typically kollect-system):

Key / message fragment	Indicates
`error_class`	`transient` / `terminal` / `forbidden` on wrapped errors
`reason`	Spill gate, export failure, scope denial (stable enum)
`inventory`, `target`	Which CR pipeline
`sink`	Backend key during export
`access check failed`	SAR API error → target `AccessCheckFailed`
`extract attributes`	CEL/JSONPath failure on a resource
`export failed`	Sink export path
`export payload exceeds spill warn threshold`	Approaching 1 MiB — shard soon
`debounced`	Export skipped by interval/coalesce
`dispatch sync fallback`	Dispatch queue full — performance tuning
`sink circuit breaker open`	Repeated transient sink failures
`reconcile panic recovered`	Panic converted to requeue (EC-P2-01)
`git push` / `git auth failed`	Snapshot sink transport

Never expect secrets, tokens, or full payloads in logs.

kubectl logs -n kollect-system deploy/kollect-controller-manager --tail=500 \
  | rg 'export failed|PayloadTooLarge|debounced|access check failed|circuit breaker'

6. See also¶

ADR-0602: Error taxonomy — class definitions and reconcile rules
Operator metrics — full metric catalog and Prometheus Operator setup
FAQ — installation, same-namespace refs, connection conditions
Load test runbook — scale diagnosis matrix and pprof
Scaling and fleet — export sharding and multi-cluster shared sinks
Deployment inventory troubleshooting — first-check table for namespaced pipelines