Skip to content

Common errors — operator guide

Symptom-oriented catalog for production failures in Kollect reconcilers, collection, and export. For error-class semantics and reconcile behavior, see ADR-0602: Error taxonomy — this page focuses on what operators see and what to do.

No hub/spoke runtime

Hub/spoke ingest was removed in v0.3. Multi-cluster uses N single-mode operators exporting to a shared sink with spec.cluster. There are no hub-specific conditions, metrics, or failure modes in current releases.

1. How errors surface

Kollect reports failures through four channels. Start with conditions on the CR that owns the pipeline; use metrics and logs when status is stale or ambiguous.

Conditions

Condition Typical objects Meaning
Ready KollectTarget, KollectInventory, KollectClusterInventory Pipeline healthy enough to collect or export
Synced Inventory / cluster inventory Last export cycle outcome (aggregate across sinks)
PartiallySynced Inventory (Ready or Synced reason) Some sinks exported; others debounced or failed
Degraded Target, inventory, sink family CRs Terminal misconfig or export gate — spec change required
ExportShardWarning KollectInventory Namespace aggregate ≥ ~1,800 rows — split before cap
SinkReachable Inventory / target Sink ref resolves and backend reachable
ConnectionVerified KollectSnapshotSink, KollectDatabaseSink, KollectEventSink Last connectivity probe succeeded

Per-sink detail lives in status.sinkExports[] — each entry has its own Synced condition and lastExportTime.

kubectl describe kollectinventory <name> -n <ns>
kubectl describe kollecttarget <name> -n <ns>
kubectl describe kollectsnapshotsink <name> -n <ns>   # or databasesink / eventsink
kubectl get events -n <ns> --field-selector involvedObject.name=<name>

Events

Warning events carry stable reason enums (not free-form types). Common reasons: ScopeGVKDenied, PayloadTooLarge, ExportFailed, Progressing, ConnectionTestFailed, ReconcilePanic.

Metrics

Metric Labels Use
kollect_reconcile_errors_total kind, error_class Reconcile failures: transient, terminal, forbidden
kollect_sink_errors_total reason Export failures — separate from reconcile errors
kollect_sink_connection_test_total type, result Probe outcomes per sink family
kollect_export_duration_seconds sink_type Slow exports (Git clone, Postgres bulk, etc.)
kollect_workqueue_depth controller Reconcile backlog / conflict storms

Sink error reason values include: transient, terminal, forbidden, payload_too_large, spill_required, unknown.

Full catalog: Operator metrics.

Inventory status.sinkExports[]

Each bound sink (snapshotSinkRefs, databaseSinkRefs, eventSinkRefs) gets a status slice entry:

Per-sink Synced Operator read
True, reason Exported Last attempt succeeded
False, reason Debounced Not a failure — cadence/coalesce skipped write
False, reason ExportFailed Export attempt failed — read message

Read API mirrors this as debounced vs degraded per sink (metrics note).


2. Error classes (ADR-0602)

Class Meaning Reconcile Self-heal?
Transient Network blip, 429, API conflict, sink timeout, circuit breaker open Requeue with backoff; Synced=False, reason Progressing Yes — when root cause clears
Terminal Bad config, invalid extraction path, auth permanently wrong, payload over cap No requeue; Degraded=True + Warning event No — fix spec/credentials, then observe new generation
Forbidden SAR/RBAC denied for list/watch on a namespace/GVK Degrade scope; partial collection; metric error_class=forbidden Partial — grant RBAC or narrow selectors

Details, examples, and circuit-breaker rules: ADR-0602.


3. Catalog by symptom

Each row: what you see → likely cause → how to confirm → fix → escalate when stuck.

Scope and tenancy

Symptom Likely causes Identify Handle Escalate
Target Degraded, reason ScopeGVKDenied KollectScope allow-list excludes profile GVK Event reason; scope CR in same namespace Add GVK to scope or remove scope binding Platform team owns scope policy
Inventory Degraded, reason ScopeSinkDenied Scope disallows snapshot/database/event family ref kubectl describe kollectinventory; scope spec Use allowed sink family/type or widen scope Same
Target Degraded, reason ScopeNamespaceDenied Target/intent namespaces outside scope Event + target spec namespaceSelector Fix selectors or scope allowedNamespaces Same
Target Degraded, reason Forbidden SAR denied for list in workload namespace kollect_reconcile_errors_total{error_class="forbidden"}; target message cites namespace/GVK Grant operator Role/ClusterRole list/watch on GVR; or narrow target to permitted namespaces RBAC audit / cluster admin
Namespace empty in inventory but target exists namespaceSelector mismatch, watch opt-out, or scope deny Target Ready; status.itemCount=0; labels kollect.dev/watch Align selector with workloads; check watch labels If SAR OK but still empty — profile/GVK issue

Collection

Symptom Likely causes Identify Handle Escalate
Target Degraded, ProfileNotFound Wrong profileRef or cross-namespace ref kubectl get kollectprofile -n <target-ns> Create profile in same namespace as target
Target Degraded, InformerRegistrationFailed Unknown/uninstalled GVK, CRD missing Target message; apiserver discovery Install CRD/API; fix KollectProfile.spec.targetGVK Vendor CRD not on cluster
Target Degraded, AccessCheckFailed SAR API error (not denial) during list pre-check Logs: access check failed; transient error metric Fix apiserver connectivity; check operator pod network Sustained apiserver errors
Target Degraded, Forbidden (collection) List denied for namespace error_class=forbidden; engine marks forbidden scope Fix RBAC or reduce target scope
Partial/empty attributes CEL/JSONPath eval error on object Logs: extract attributes (no secret values logged) Fix attribute paths in profile; test with kubectl explain sample Webhook should catch most invalid paths at admission
Growing kollect_watch_map_list_errors_total List failure registering watch map handler PromQL increase; controller logs on inventory/target Fix RBAC for mapped GVR; check apiserver load API server degradation

Export — payload size

Symptom Likely causes Identify Handle Escalate
Inventory Degraded, PayloadTooLarge Monolithic export > ~1.5 MiB (maxExportBytes) Condition message with byte counts; kollect_sink_errors_total{reason="payload_too_large"} Shard: multiple KollectInventory per namespace (<~2k rows each) Architecture review for 10k+ row namespaces
Inventory Degraded, SpillRequired Large payload needs object-store spill, none configured Reason SpillRequired; spill_required metric Add KollectSnapshotSink type s3 or gcs to inventory refs
ExportShardWarning=True ≥ ~1,800 rows in one namespace aggregate Condition + increase(kollect_export_shard_warn_total[1h]) Split inventories before hard cap See scaling and fleet
kollect_export_spill_warn_total increasing Payload ≥ 1 MiB warn threshold Metric + log export payload exceeds spill warn threshold Shard or tune spec.maxExportBytes (within global cap)

Export — sink backends

Family CRDs: KollectSnapshotSink (git/gitlab/s3/gcs), KollectDatabaseSink (postgres), KollectEventSink (kafka/nats). Inventory refs must use the matching family field (snapshotSinkRefs, etc.) in the same namespace.

Symptom Likely causes Identify Handle Escalate
Git push failures (NFF) Remote ahead of operator; concurrent writers Logs git push; terminal or transient on sink errors pushPolicy: Commit retries merge+push; ensure single writer per branch/path Protected branch / hook rejects — platform Git admin
Git auth terminal Bad token, expired credential, 401/403 Sink ConnectionVerified=False; ConnectionTestFailed event Rotate secretRef; re-annotate kollect.dev/test-connection=true IdP / Git provider outage
Postgres connection failures DSN, network policy, TLS, pool timeout Database sink conditions; transient sink errors; connection test metric Verify Secret keys, egress NetworkPolicy, server reachable DBA for server-side limits
S3/GCS 403 IAM, wrong bucket, signature Export logs; terminal/forbidden Fix credentials and bucket policy Cloud IAM review
sink circuit breaker open in logs 5 consecutive transient failures per sink key transient errors then silence ~30s Fix backend; breaker self-closes after timeout Backend SLA breach
SinkReachable=False, SinkNotFound Wrong sink name or cross-namespace ref Inventory message; kubectl get kollect*sink -n <inv-ns> Fix ref name; create sink in inventory namespace
SinkReachable=False, SinkUnreachable Backend down despite CR present Sink ConnectionVerified; probe annotation Fix network/credentials first

Export — debounce (not a failure)

Symptom Likely causes Identify Handle Escalate
Ready=True, reason PartiallySynced Per-sink exportMinInterval; unchanged payload checksum status.sinkExports[].conditions reason Debounced; no kollect_sink_errors_total spike Expected — wait for interval; tighten interval only if SLA requires fresher data Mistaking debounce for outage
Synced=False, reason PartiallySynced, all sinks debounced All sinks within cadence window All per-sink Debounced; kollect_export_debounced_total up Normal for dual-cadence (e.g. Postgres 30s + Git 1h)
Stale data in Postgres but Synced=True on Git ref Different intervals per ref Compare lastExportTime per sinkExports entry Set ref-level exportMinInterval intentionally

Connection test

Symptom Likely causes Identify Handle Escalate
Sink ConnectionVerified=False, ConnectionTestFailed TLS verify fail, bad Secret, wrong endpoint kubectl describe sink; kollect_sink_connection_test_total{result="failure"} Fix secretRef, CA bundle, URL; one-shot: annotation kollect.dev/test-connection=true Corporate TLS inspection
KollectConnectionTest stuck false One-shot CR probe failed kubectl describe kollectconnectiontest Same as sink probe; check spec.sinkRef family field
TLSInsecure=True on sink Explicit insecure TLS (non-default) Condition on sink Prefer proper CA; document exception per security policy Security review

Production sinks should keep spec.connectionTest: false and use annotation for ad-hoc probes (ADR-0403).

Reconcile and workqueue

Symptom Likely causes Identify Handle Escalate
Slow inventory updates, no Degraded Optimistic-lock conflicts, high churn kollect_workqueue_depth sustained; conflict requeues in logs Raise maxConcurrentReconciles; increase exportMinInterval; shard inventories etcd/apiserver slow
Event ReconcilePanic Unexpected panic (should not crash pod) Event reason; log reconcile panic recovered Upgrade to fixed release; file bug with stack from logs Repeat panics on same controller
kollect_collect_dispatch_sync_fallback_total rising Dispatch queue saturated Metric + dispatch queue depth Increase collect.dispatchWorkers / dispatchQueueSize CPU throttle on controller
Status update lag Many inventories, frequent export Reconcile duration p95; etcd metrics Debounce, sharding, fewer sinks per inventory Load test runbook

Webhook vs runtime validation

Symptom Likely causes Identify Handle Escalate
kubectl apply rejected CEL validation on CRD, validating webhook Admission error message (no object created) Fix spec before create
CR accepted but Degraded at runtime GVK/CRD absent on cluster, scope enforced only at reconcile, SAR not checked at admission Compare admission vs kubectl describe conditions Install CRDs; fix runtime-only constraints Gap between webhook and runtime — upstream issue
Scope ceiling on cluster targets KollectClusterScope webhook deny Forbidden on apply Adjust cluster target GVKs to allowed set

Multi-sink partial success

Symptom Likely causes Identify Handle Escalate
Ready=True, PartiallySynced; Postgres OK, Git failed Independent per-sink export status.sinkExports[] — mixed Exported / ExportFailed Fix failing sink only; successful sinks stay current
Synced=False, PartiallySynced; some failed One backend terminal while others OK Failed count in condition message Terminal sink needs spec/cred fix; others self-heal
Aggregate Synced=False, all per-sink failed Shared payload gate (spill) before export Inventory-level Degraded + spill reasons Fix size/sharding first

Resources (brief)

Symptom Likely causes Identify Handle Escalate
OOMKilled controller Large collect store, cluster-wide informers Pod status; kollect_collect_items_total; RSS resourcesProfile: large; namespace-scope targets; shard inventories Load test runbook
CPU throttle High dispatch/reconcile load pprof; kollect_reconcile_duration_seconds p95 Raise limits; tune workers; reduce churn
etcd / API slow Status write rate, large fleets Apiserver metrics; many inventories Longer export intervals; fewer status transitions Platform cluster health

4. PromQL cheat sheet

Tie queries to symptoms (adjust namespace/job labels for your scrape config):

# Sustained reconcile failures by class
sum(rate(kollect_reconcile_errors_total[5m])) by (error_class)

# Inventory export errors only
sum(rate(kollect_reconcile_errors_total{kind="KollectInventory"}[5m])) by (error_class)

# Export failure reasons (auth, size, transient, …)
sum(increase(kollect_sink_errors_total[15m])) by (reason)

# Slow exports — Git vs Postgres vs event
histogram_quantile(0.95, sum(rate(kollect_export_duration_seconds_bucket[5m])) by (le, sink_type))

# Debounce (expected) vs failure — debounce should NOT correlate with sink_errors
sum(rate(kollect_export_debounced_total[5m])) by (controller)

# Workqueue backlog — conflict storms or under-provisioned workers
max_over_time(kollect_workqueue_depth[10m])

# Collect store growth — OOM/sharding signal
kollect_collect_items_total

# Approaching export shard cap
increase(kollect_export_shard_warn_total[1h])

Default alert rules: metrics.md — Default alerts.


5. Log patterns

Structured controller logs (logr). Grep operator pod logs (namespace typically kollect-system):

Key / message fragment Indicates
error_class transient / terminal / forbidden on wrapped errors
reason Spill gate, export failure, scope denial (stable enum)
inventory, target Which CR pipeline
sink Backend key during export
access check failed SAR API error → target AccessCheckFailed
extract attributes CEL/JSONPath failure on a resource
export failed Sink export path
export payload exceeds spill warn threshold Approaching 1 MiB — shard soon
debounced Export skipped by interval/coalesce
dispatch sync fallback Dispatch queue full — performance tuning
sink circuit breaker open Repeated transient sink failures
reconcile panic recovered Panic converted to requeue (EC-P2-01)
git push / git auth failed Snapshot sink transport

Never expect secrets, tokens, or full payloads in logs.

kubectl logs -n kollect-system deploy/kollect-controller-manager --tail=500 \
  | rg 'export failed|PayloadTooLarge|debounced|access check failed|circuit breaker'

6. See also