Alert Reference
The bundled PrometheusRule includes alerts organized by component. Alerts that
reference configurable thresholds can be tuned through monitoring.alerts.*
Helm values (see
Monitoring Parameters
in the Helm values reference).
Every alert includes a runbook_url annotation that links directly to the
relevant section below. When an alert fires, the runbook link is available in
Alertmanager, PagerDuty, Slack, and other notification integrations.
Platform Controller - Hardware
These alerts fire when the platform controller process exceeds resource thresholds, which may indicate a memory leak, excessive concurrency, or resource exhaustion.
Each fires at warning severity when the named resource exceeds its
monitoring.alerts.platformController.*Threshold for 1h:
AkuityPlatformControllerTooManyGoroutinesAkuityPlatformControllerTooManyThreadsAkuityPlatformControllerHighMemoryAkuityPlatformControllerTooManyHeapObjectsAkuityPlatformControllerTooManyFileDescriptors
The file descriptor alerts use the container_file_descriptors metric from
cAdvisor/kubelet. This metric may not be available on all Kubernetes providers.
If the metric is absent, these alerts will simply remain inactive: they will
not cause errors.
Remediation:
- Check platform controller logs for goroutine leaks or unexpected activity
- Review recent configuration changes that may have increased load
- Consider increasing resource limits on the platform controller deployment
- Restart the platform controller pod if values remain elevated after the underlying cause is resolved
Platform Controller - Argo CD Instances
These alerts monitor the health and reconciliation status of managed Argo CD instances.
Critical unless noted:
AkuityUnhealthyArgoInstancesAkuityUnreconciledArgoInstancesAkuityDegradedArgoInstancesAkuityArgoInstanceEventsSyncFailedAkuityUnreconciledArgoClustersAkuityDegradedArgoClustersAkuityHighDisconnectedArgoClusters(warning) /AkuityHighDisconnectedArgoClustersCriticalAkuityHighUnhealthyArgoClusters(warning) /AkuityHighUnhealthyArgoClustersCritical
The disconnected and unhealthy ratio thresholds are configurable via
monitoring.alerts.argoCD.disconnectedWarningRatio,
monitoring.alerts.argoCD.disconnectedCriticalRatio,
monitoring.alerts.argoCD.unhealthyWarningRatio, and
monitoring.alerts.argoCD.unhealthyCriticalRatio. Small fleets where a single
disconnected cluster represents a large percentage may want to raise the warning
threshold.
Remediation:
- Check the platform controller logs for reconciliation errors
- Verify the database is healthy and accessible (see Database Operations)
- For disconnected clusters, verify network connectivity between the agent and the platform
- For degraded instances, check the Argo CD instance namespace for pod status and events
- Review recent changes to instance configuration
Platform Controller - Kargo Instances
These alerts monitor Kargo instances and agents using the same pattern as the Argo CD alerts above, with two intentional differences:
- Unhealthy timeout is 30m (vs 40m for Argo CD). Kargo's reconciliation loop is faster, so a 30-minute window gives the same margin with less lag.
- Disconnected agent warning threshold is 60% (vs 50% for Argo CD clusters). Kargo agents have a higher expected transient-disconnection rate during warehouse syncs, so the warning fires at a higher ratio to reduce noise.
Critical unless noted:
AkuityUnhealthyKargoInstancesAkuityUnreconciledKargoInstancesAkuityDegradedKargoInstancesAkuityKargoInstanceEventsSyncFailedAkuityUnreconciledKargoAgentsAkuityDegradedKargoAgentsAkuityHighDisconnectedKargoAgents(warning) /AkuityHighDisconnectedKargoAgentsCriticalAkuityHighUnhealthyKargoAgents(warning) /AkuityHighUnhealthyKargoAgentsCritical
The ratio thresholds are configurable via
monitoring.alerts.kargo.disconnectedWarningRatio,
monitoring.alerts.kargo.disconnectedCriticalRatio,
monitoring.alerts.kargo.unhealthyWarningRatio, and
monitoring.alerts.kargo.unhealthyCriticalRatio.
Remediation: Same approach as Argo CD instance alerts above. Check platform controller logs, database health, and network connectivity.
Platform Controller - Reconciler
These alerts detect operational issues with the platform controller's internal reconciliation loops.
AkuityMetricsCollectorRefreshErrors(critical)AkuitySlowMetricsCollectorRefresh(warning)AkuityPlatformControllerReconcilerErrors(warning)AkuityMissingDBMetricsCollectorHeartbeat(critical)AkuitySlowReconciliationLoop(warning): configurable atmonitoring.alerts.reconciliation.slowReconcileP90SecondsAkuitySlowEnqueueLoop(warning): configurable atmonitoring.alerts.reconciliation.slowEnqueueP90SecondsAkuityMissing*Heartbeat: 12 critical heartbeat alerts, one per reconciliation loop
Remediation:
- Missing heartbeat alerts typically indicate the platform controller is not running or has crashed. Check pod status and logs.
- Slow reconciliation may indicate database performance issues or high load. Check database metrics and connection pool health (see Database Operations).
- Metrics collector errors suggest the platform controller cannot query the database for status metrics. Verify database connectivity.
Portal Server - Hardware
Each fires at warning severity when the named resource exceeds its
monitoring.alerts.portalServer.*Threshold:
AkuityPortalServerTooManyGoroutinesAkuityPortalServerTooManyThreadsAkuityPortalServerHighMemoryAkuityPortalServerTooManyHeapObjectsAkuityPortalServerTooManyFileDescriptorsAkuityPortalServerTooManyTransmittedBytes
Remediation: Same approach as platform controller hardware alerts. The portal server handles API traffic, so elevated values may correlate with high request volume. The transmitted bytes alert detects runaway network egress, which may indicate a streaming loop or unexpectedly large API responses.
Portal Server - Application
AkuityPortalServerErrors(critical): any HTTP 5xx from the portal API.AkuityPortalServerSlowResponses(warning): average response time abovemonitoring.alerts.portalServer.slowResponseAvgSeconds.AkuityPortalServerSlowResponsesP90(warning): p90 response time abovemonitoring.alerts.portalServer.slowResponseP90Seconds.
The response time thresholds are configurable via
monitoring.alerts.portalServer.slowResponseAvgSeconds and
monitoring.alerts.portalServer.slowResponseP90Seconds. Self-hosted
installations on slower hardware or with higher-latency database connections may
want to raise these thresholds.
Remediation:
- Check portal server logs for error details
- Review database query performance
- Check for upstream service degradation (SSO provider, etc.)
- Review recent deployments for regressions
Addon Controller
Only deployed when addonController.enabled: true.
AkuityAddonControllerErrors(warning)AkuitySlowAddonControllerReconciliation(warning): configurable atmonitoring.alerts.addonController.slowReconcileP90SecondsAkuitySlowAddonControllerEnqueue(warning): configurable atmonitoring.alerts.addonController.slowEnqueueP90SecondsAkuityMissingAddonController*Heartbeat: 3 critical heartbeat alerts
Remediation:
- Check addon controller logs for error details
- Verify the addon controller pod is running and not crash-looping
- Review recent changes to addon definitions or cluster addon configurations
- For heartbeat alerts, restart the addon controller pod if it appears stuck
Notification Controller
Only deployed when notificationController.enabled: true.
AkuityNotificationControllerErrors(warning)AkuitySlowNotificationControllerReconciliation(warning): configurable atmonitoring.alerts.notificationController.slowReconcileP90SecondsAkuitySlowNotificationControllerEnqueue(warning): configurable atmonitoring.alerts.notificationController.slowEnqueueP90SecondsAkuityMissingNotificationController*Heartbeat: 3 critical heartbeat alertsAkuityHighNotificationDeliveryFailureRate(warning): configurable atmonitoring.alerts.notificationController.deliveryFailureRateThresholdAkuityHighNotificationPendingRate(warning): configurable atmonitoring.alerts.notificationController.pendingRateThreshold
The delivery failure and pending rate thresholds are configurable via
monitoring.alerts.notificationController.deliveryFailureRateThreshold and
monitoring.alerts.notificationController.pendingRateThreshold. Both alerts
exclude the web delivery method and require at least 10 notifications in the
window to avoid noisy alerts on low-volume deployments.
Remediation:
- Check notification controller logs for error details
- Verify SMTP configuration if using email delivery
- Review notification target availability (webhooks, Slack, etc.)