Alert Reference

The bundled PrometheusRule includes alerts organized by component. Alerts that reference configurable thresholds can be tuned through monitoring.alerts.* Helm values (see Monitoring Parameters in the Helm values reference).

Every alert includes a runbook_url annotation that links directly to the relevant section below. When an alert fires, the runbook link is available in Alertmanager, PagerDuty, Slack, and other notification integrations.

Platform Controller - Hardware

These alerts fire when the platform controller process exceeds resource thresholds, which may indicate a memory leak, excessive concurrency, or resource exhaustion.

Each fires at warning severity when the named resource exceeds its monitoring.alerts.platformController.*Threshold for 1h:

AkuityPlatformControllerTooManyGoroutines
AkuityPlatformControllerTooManyThreads
AkuityPlatformControllerHighMemory
AkuityPlatformControllerTooManyHeapObjects
AkuityPlatformControllerTooManyFileDescriptors

tip

The file descriptor alerts use the container_file_descriptors metric from cAdvisor/kubelet. This metric may not be available on all Kubernetes providers. If the metric is absent, these alerts will simply remain inactive: they will not cause errors.

Remediation:

Check platform controller logs for goroutine leaks or unexpected activity
Review recent configuration changes that may have increased load
Consider increasing resource limits on the platform controller deployment
Restart the platform controller pod if values remain elevated after the underlying cause is resolved

Platform Controller - Argo CD Instances

These alerts monitor the health and reconciliation status of managed Argo CD instances.

Critical unless noted:

AkuityUnhealthyArgoInstances
AkuityUnreconciledArgoInstances
AkuityDegradedArgoInstances
AkuityArgoInstanceEventsSyncFailed
AkuityUnreconciledArgoClusters
AkuityDegradedArgoClusters
AkuityHighDisconnectedArgoClusters (warning) / AkuityHighDisconnectedArgoClustersCritical
AkuityHighUnhealthyArgoClusters (warning) / AkuityHighUnhealthyArgoClustersCritical

The disconnected and unhealthy ratio thresholds are configurable via monitoring.alerts.argoCD.disconnectedWarningRatio, monitoring.alerts.argoCD.disconnectedCriticalRatio, monitoring.alerts.argoCD.unhealthyWarningRatio, and monitoring.alerts.argoCD.unhealthyCriticalRatio. Small fleets where a single disconnected cluster represents a large percentage may want to raise the warning threshold.

Remediation:

Check the platform controller logs for reconciliation errors
Verify the database is healthy and accessible (see Database Operations)
For disconnected clusters, verify network connectivity between the agent and the platform
For degraded instances, check the Argo CD instance namespace for pod status and events
Review recent changes to instance configuration

Platform Controller - Kargo Instances

These alerts monitor Kargo instances and agents using the same pattern as the Argo CD alerts above, with two intentional differences:

Unhealthy timeout is 30m (vs 40m for Argo CD). Kargo's reconciliation loop is faster, so a 30-minute window gives the same margin with less lag.
Disconnected agent warning threshold is 60% (vs 50% for Argo CD clusters). Kargo agents have a higher expected transient-disconnection rate during warehouse syncs, so the warning fires at a higher ratio to reduce noise.

Critical unless noted:

AkuityUnhealthyKargoInstances
AkuityUnreconciledKargoInstances
AkuityDegradedKargoInstances
AkuityKargoInstanceEventsSyncFailed
AkuityUnreconciledKargoAgents
AkuityDegradedKargoAgents
AkuityHighDisconnectedKargoAgents (warning) / AkuityHighDisconnectedKargoAgentsCritical
AkuityHighUnhealthyKargoAgents (warning) / AkuityHighUnhealthyKargoAgentsCritical

The ratio thresholds are configurable via monitoring.alerts.kargo.disconnectedWarningRatio, monitoring.alerts.kargo.disconnectedCriticalRatio, monitoring.alerts.kargo.unhealthyWarningRatio, and monitoring.alerts.kargo.unhealthyCriticalRatio.

Remediation: Same approach as Argo CD instance alerts above. Check platform controller logs, database health, and network connectivity.

Platform Controller - Reconciler

These alerts detect operational issues with the platform controller's internal reconciliation loops.

AkuityMetricsCollectorRefreshErrors (critical)
AkuitySlowMetricsCollectorRefresh (warning)
AkuityPlatformControllerReconcilerErrors (warning)
AkuityMissingDBMetricsCollectorHeartbeat (critical)
AkuitySlowReconciliationLoop (warning): configurable at monitoring.alerts.reconciliation.slowReconcileP90Seconds
AkuitySlowEnqueueLoop (warning): configurable at monitoring.alerts.reconciliation.slowEnqueueP90Seconds
AkuityMissing*Heartbeat: 12 critical heartbeat alerts, one per reconciliation loop

Remediation:

Missing heartbeat alerts typically indicate the platform controller is not running or has crashed. Check pod status and logs.
Slow reconciliation may indicate database performance issues or high load. Check database metrics and connection pool health (see Database Operations).
Metrics collector errors suggest the platform controller cannot query the database for status metrics. Verify database connectivity.

Portal Server - Hardware

Each fires at warning severity when the named resource exceeds its monitoring.alerts.portalServer.*Threshold:

AkuityPortalServerTooManyGoroutines
AkuityPortalServerTooManyThreads
AkuityPortalServerHighMemory
AkuityPortalServerTooManyHeapObjects
AkuityPortalServerTooManyFileDescriptors
AkuityPortalServerTooManyTransmittedBytes

Remediation: Same approach as platform controller hardware alerts. The portal server handles API traffic, so elevated values may correlate with high request volume. The transmitted bytes alert detects runaway network egress, which may indicate a streaming loop or unexpectedly large API responses.

Portal Server - Application

AkuityPortalServerErrors (critical): any HTTP 5xx from the portal API.
AkuityPortalServerSlowResponses (warning): average response time above monitoring.alerts.portalServer.slowResponseAvgSeconds.
AkuityPortalServerSlowResponsesP90 (warning): p90 response time above monitoring.alerts.portalServer.slowResponseP90Seconds.

The response time thresholds are configurable via monitoring.alerts.portalServer.slowResponseAvgSeconds and monitoring.alerts.portalServer.slowResponseP90Seconds. Self-hosted installations on slower hardware or with higher-latency database connections may want to raise these thresholds.

Remediation:

Check portal server logs for error details
Review database query performance
Check for upstream service degradation (SSO provider, etc.)
Review recent deployments for regressions

Addon Controller

Only deployed when addonController.enabled: true.

AkuityAddonControllerErrors (warning)
AkuitySlowAddonControllerReconciliation (warning): configurable at monitoring.alerts.addonController.slowReconcileP90Seconds
AkuitySlowAddonControllerEnqueue (warning): configurable at monitoring.alerts.addonController.slowEnqueueP90Seconds
AkuityMissingAddonController*Heartbeat: 3 critical heartbeat alerts

Remediation:

Check addon controller logs for error details
Verify the addon controller pod is running and not crash-looping
Review recent changes to addon definitions or cluster addon configurations
For heartbeat alerts, restart the addon controller pod if it appears stuck

Notification Controller

Only deployed when notificationController.enabled: true.

AkuityNotificationControllerErrors (warning)
AkuitySlowNotificationControllerReconciliation (warning): configurable at monitoring.alerts.notificationController.slowReconcileP90Seconds
AkuitySlowNotificationControllerEnqueue (warning): configurable at monitoring.alerts.notificationController.slowEnqueueP90Seconds
AkuityMissingNotificationController*Heartbeat: 3 critical heartbeat alerts
AkuityHighNotificationDeliveryFailureRate (warning): configurable at monitoring.alerts.notificationController.deliveryFailureRateThreshold
AkuityHighNotificationPendingRate (warning): configurable at monitoring.alerts.notificationController.pendingRateThreshold

The delivery failure and pending rate thresholds are configurable via monitoring.alerts.notificationController.deliveryFailureRateThreshold and monitoring.alerts.notificationController.pendingRateThreshold. Both alerts exclude the web delivery method and require at least 10 notifications in the window to avoid noisy alerts on low-volume deployments.

Remediation:

Check notification controller logs for error details
Verify SMTP configuration if using email delivery
Review notification target availability (webhooks, Slack, etc.)

Platform Controller - Hardware​

Platform Controller - Argo CD Instances​

Platform Controller - Kargo Instances​

Platform Controller - Reconciler​

Portal Server - Hardware​

Portal Server - Application​

Addon Controller​

Notification Controller​

Platform Controller - Hardware

Platform Controller - Argo CD Instances

Platform Controller - Kargo Instances

Platform Controller - Reconciler

Portal Server - Hardware

Portal Server - Application

Addon Controller

Notification Controller