Skip to main content

Alert Reference

The bundled PrometheusRule includes alerts organized by component. Alerts that reference configurable thresholds can be tuned through monitoring.alerts.* Helm values (see Monitoring Parameters in the Helm values reference).

Every alert includes a runbook_url annotation that links directly to the relevant section below. When an alert fires, the runbook link is available in Alertmanager, PagerDuty, Slack, and other notification integrations.

Platform Controller - Hardware

These alerts fire when the platform controller process exceeds resource thresholds, which may indicate a memory leak, excessive concurrency, or resource exhaustion.

Each fires at warning severity when the named resource exceeds its monitoring.alerts.platformController.*Threshold for 1h:

  • AkuityPlatformControllerTooManyGoroutines
  • AkuityPlatformControllerTooManyThreads
  • AkuityPlatformControllerHighMemory
  • AkuityPlatformControllerTooManyHeapObjects
  • AkuityPlatformControllerTooManyFileDescriptors
tip

The file descriptor alerts use the container_file_descriptors metric from cAdvisor/kubelet. This metric may not be available on all Kubernetes providers. If the metric is absent, these alerts will simply remain inactive: they will not cause errors.

Remediation:

  • Check platform controller logs for goroutine leaks or unexpected activity
  • Review recent configuration changes that may have increased load
  • Consider increasing resource limits on the platform controller deployment
  • Restart the platform controller pod if values remain elevated after the underlying cause is resolved

Platform Controller - Argo CD Instances

These alerts monitor the health and reconciliation status of managed Argo CD instances.

Critical unless noted:

  • AkuityUnhealthyArgoInstances
  • AkuityUnreconciledArgoInstances
  • AkuityDegradedArgoInstances
  • AkuityArgoInstanceEventsSyncFailed
  • AkuityUnreconciledArgoClusters
  • AkuityDegradedArgoClusters
  • AkuityHighDisconnectedArgoClusters (warning) / AkuityHighDisconnectedArgoClustersCritical
  • AkuityHighUnhealthyArgoClusters (warning) / AkuityHighUnhealthyArgoClustersCritical

The disconnected and unhealthy ratio thresholds are configurable via monitoring.alerts.argoCD.disconnectedWarningRatio, monitoring.alerts.argoCD.disconnectedCriticalRatio, monitoring.alerts.argoCD.unhealthyWarningRatio, and monitoring.alerts.argoCD.unhealthyCriticalRatio. Small fleets where a single disconnected cluster represents a large percentage may want to raise the warning threshold.

Remediation:

  • Check the platform controller logs for reconciliation errors
  • Verify the database is healthy and accessible (see Database Operations)
  • For disconnected clusters, verify network connectivity between the agent and the platform
  • For degraded instances, check the Argo CD instance namespace for pod status and events
  • Review recent changes to instance configuration

Platform Controller - Kargo Instances

These alerts monitor Kargo instances and agents using the same pattern as the Argo CD alerts above, with two intentional differences:

  • Unhealthy timeout is 30m (vs 40m for Argo CD). Kargo's reconciliation loop is faster, so a 30-minute window gives the same margin with less lag.
  • Disconnected agent warning threshold is 60% (vs 50% for Argo CD clusters). Kargo agents have a higher expected transient-disconnection rate during warehouse syncs, so the warning fires at a higher ratio to reduce noise.

Critical unless noted:

  • AkuityUnhealthyKargoInstances
  • AkuityUnreconciledKargoInstances
  • AkuityDegradedKargoInstances
  • AkuityKargoInstanceEventsSyncFailed
  • AkuityUnreconciledKargoAgents
  • AkuityDegradedKargoAgents
  • AkuityHighDisconnectedKargoAgents (warning) / AkuityHighDisconnectedKargoAgentsCritical
  • AkuityHighUnhealthyKargoAgents (warning) / AkuityHighUnhealthyKargoAgentsCritical

The ratio thresholds are configurable via monitoring.alerts.kargo.disconnectedWarningRatio, monitoring.alerts.kargo.disconnectedCriticalRatio, monitoring.alerts.kargo.unhealthyWarningRatio, and monitoring.alerts.kargo.unhealthyCriticalRatio.

Remediation: Same approach as Argo CD instance alerts above. Check platform controller logs, database health, and network connectivity.

Platform Controller - Reconciler

These alerts detect operational issues with the platform controller's internal reconciliation loops.

  • AkuityMetricsCollectorRefreshErrors (critical)
  • AkuitySlowMetricsCollectorRefresh (warning)
  • AkuityPlatformControllerReconcilerErrors (warning)
  • AkuityMissingDBMetricsCollectorHeartbeat (critical)
  • AkuitySlowReconciliationLoop (warning): configurable at monitoring.alerts.reconciliation.slowReconcileP90Seconds
  • AkuitySlowEnqueueLoop (warning): configurable at monitoring.alerts.reconciliation.slowEnqueueP90Seconds
  • AkuityMissing*Heartbeat: 12 critical heartbeat alerts, one per reconciliation loop

Remediation:

  • Missing heartbeat alerts typically indicate the platform controller is not running or has crashed. Check pod status and logs.
  • Slow reconciliation may indicate database performance issues or high load. Check database metrics and connection pool health (see Database Operations).
  • Metrics collector errors suggest the platform controller cannot query the database for status metrics. Verify database connectivity.

Portal Server - Hardware

Each fires at warning severity when the named resource exceeds its monitoring.alerts.portalServer.*Threshold:

  • AkuityPortalServerTooManyGoroutines
  • AkuityPortalServerTooManyThreads
  • AkuityPortalServerHighMemory
  • AkuityPortalServerTooManyHeapObjects
  • AkuityPortalServerTooManyFileDescriptors
  • AkuityPortalServerTooManyTransmittedBytes

Remediation: Same approach as platform controller hardware alerts. The portal server handles API traffic, so elevated values may correlate with high request volume. The transmitted bytes alert detects runaway network egress, which may indicate a streaming loop or unexpectedly large API responses.

Portal Server - Application

  • AkuityPortalServerErrors (critical): any HTTP 5xx from the portal API.
  • AkuityPortalServerSlowResponses (warning): average response time above monitoring.alerts.portalServer.slowResponseAvgSeconds.
  • AkuityPortalServerSlowResponsesP90 (warning): p90 response time above monitoring.alerts.portalServer.slowResponseP90Seconds.

The response time thresholds are configurable via monitoring.alerts.portalServer.slowResponseAvgSeconds and monitoring.alerts.portalServer.slowResponseP90Seconds. Self-hosted installations on slower hardware or with higher-latency database connections may want to raise these thresholds.

Remediation:

  • Check portal server logs for error details
  • Review database query performance
  • Check for upstream service degradation (SSO provider, etc.)
  • Review recent deployments for regressions

Addon Controller

Only deployed when addonController.enabled: true.

  • AkuityAddonControllerErrors (warning)
  • AkuitySlowAddonControllerReconciliation (warning): configurable at monitoring.alerts.addonController.slowReconcileP90Seconds
  • AkuitySlowAddonControllerEnqueue (warning): configurable at monitoring.alerts.addonController.slowEnqueueP90Seconds
  • AkuityMissingAddonController*Heartbeat: 3 critical heartbeat alerts

Remediation:

  • Check addon controller logs for error details
  • Verify the addon controller pod is running and not crash-looping
  • Review recent changes to addon definitions or cluster addon configurations
  • For heartbeat alerts, restart the addon controller pod if it appears stuck

Notification Controller

Only deployed when notificationController.enabled: true.

  • AkuityNotificationControllerErrors (warning)
  • AkuitySlowNotificationControllerReconciliation (warning): configurable at monitoring.alerts.notificationController.slowReconcileP90Seconds
  • AkuitySlowNotificationControllerEnqueue (warning): configurable at monitoring.alerts.notificationController.slowEnqueueP90Seconds
  • AkuityMissingNotificationController*Heartbeat: 3 critical heartbeat alerts
  • AkuityHighNotificationDeliveryFailureRate (warning): configurable at monitoring.alerts.notificationController.deliveryFailureRateThreshold
  • AkuityHighNotificationPendingRate (warning): configurable at monitoring.alerts.notificationController.pendingRateThreshold

The delivery failure and pending rate thresholds are configurable via monitoring.alerts.notificationController.deliveryFailureRateThreshold and monitoring.alerts.notificationController.pendingRateThreshold. Both alerts exclude the web delivery method and require at least 10 notifications in the window to avoid noisy alerts on low-volume deployments.

Remediation:

  • Check notification controller logs for error details
  • Verify SMTP configuration if using email delivery
  • Review notification target availability (webhooks, Slack, etc.)