Operations
Alertmanager Routing
All bundled alerts use names prefixed with Akuity, making it straightforward
to route them in Alertmanager. Below are example configurations for common
notification targets.
Slack
# alertmanager.yaml
route:
routes:
- matchers:
- alertname=~"Akuity.*"
receiver: akuity-slack
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: akuity-slack
slack_configs:
- channel: '#akuity-platform-alerts'
send_resolved: true
title: '{{ .GroupLabels.alertname }}'
text: >-
{{ range .Alerts }}
*{{ .Annotations.summary }}*
{{ .Annotations.description }}
Runbook: {{ .Annotations.runbook_url }}
{{ end }}
PagerDuty
route:
routes:
- matchers:
- alertname=~"Akuity.*"
- severity=~"critical|warning"
receiver: akuity-pagerduty
receivers:
- name: akuity-pagerduty
pagerduty_configs:
- routing_key: '<your-pagerduty-integration-key>'
description: '{{ .GroupLabels.alertname }}: {{ (index .Alerts 0).Annotations.summary }}'
links:
- href: '{{ (index .Alerts 0).Annotations.runbook_url }}'
text: 'Runbook'
Severity-based routing
To page on critical alerts and send warnings to a lower-priority channel:
route:
routes:
- matchers:
- alertname=~"Akuity.*"
- severity="critical"
receiver: akuity-pagerduty
- matchers:
- alertname=~"Akuity.*"
- severity="warning"
receiver: akuity-slack
Tuning Alert Thresholds
The bundled alert thresholds are designed for typical production deployments. Depending on your scale and workload patterns, you may want to adjust them.
When to tune
- Small deployments (fewer than 10 Argo CD instances): Hardware thresholds like goroutine count and memory usage may be set higher than your platform controller will ever reach under normal conditions. This is fine: alerts simply won't fire. No action is needed unless you want tighter bounds for your environment.
- Small fleets (fewer than ~10 clusters or agents): The ratio-based
disconnected/unhealthy alerts (
monitoring.alerts.argoCD.*,monitoring.alerts.kargo.*) can be noisy when a single disconnected cluster represents a large fraction of the fleet. Consider raising the warning ratio. - Large deployments (hundreds of instances or clusters): Reconciliation loop
thresholds (
slowReconcileP90Seconds,slowEnqueueP90Seconds) may need to be relaxed, as larger instance counts naturally increase reconciliation time. - Resource-constrained environments: If you run the platform controller or portal server with tight memory limits, consider lowering the memory threshold to alert before OOM kills occur.
- Higher-latency infrastructure: If the portal server connects to a remote
database or runs on slower hardware, consider raising
monitoring.alerts.portalServer.slowResponseAvgSecondsandslowResponseP90Seconds.
How to tune
Override any threshold through the monitoring.alerts.* values. See the
Monitoring Parameters
section of the Helm values reference for the full list, defaults, and
descriptions.
After adjusting thresholds, verify the updated values in the Prometheus UI under Status > Rules. Search for the alert name to confirm the new threshold is active.
Disabling Monitoring
Monitoring resources can be safely removed by setting monitoring.enabled back
to false and re-applying the chart:
monitoring:
enabled: false
This cleanly removes all monitoring resources (ServiceMonitors, PrometheusRule, Grafana dashboard ConfigMap, and Grafana datasource Secret) regardless of which namespace they were deployed to. The platform components themselves are unaffected: disabling monitoring does not restart or modify any running pods.
Prometheus will stop scraping the targets once the ServiceMonitors are deleted, and Grafana's sidecar will remove the dashboard and datasource on its next reconciliation cycle.
You can also selectively disable individual monitoring sub-resources without turning off monitoring entirely. For example, to keep Prometheus scraping and alerts but remove the Grafana integration:
monitoring:
enabled: true
grafanaDashboard:
enabled: false
grafanaDatasource:
enabled: false
Grafana Datasource and External Secrets
The auto-provisioned Grafana datasource reads database credentials from the
same database.* Helm values that the platform components use (database.host,
database.user, database.password, etc.). For most deployments this works out
of the box with no additional configuration because the platform already
requires these values to connect to PostgreSQL.
External secrets management
If you manage database credentials through an external secrets solution
(e.g., External Secrets Operator,
OpenBao, or cloud-native IAM authentication),
the database.password value may not be set in your Helm values. In this case,
disable the auto-provisioned datasource and manage it through your existing
secrets workflow:
monitoring:
enabled: true
grafanaDatasource:
enabled: false
Then provision an equivalent PostgreSQL datasource in Grafana with matching credentials. To ensure the bundled dashboard panels work, either:
- Set the datasource UID to match
<release>-portal-db(e.g.,akuity-platform-portal-db), or - Update the
DS_PORTAL_DBtemplate variable in the dashboard to point at your datasource after import
Non-default database schema
If your installation uses a custom database.schemaname (anything other than
public), the auto-provisioned Grafana datasource will connect successfully but
dashboard SQL panels that reference unqualified table names (e.g. organization,
argo_cd_instance) may fail because the Grafana PostgreSQL plugin does not
support setting search_path through its provisioning configuration.
To work around this, set search_path on the PostgreSQL role that Grafana
uses to connect:
ALTER ROLE <database.user> SET search_path TO '<database.schemaname>', 'public';
This takes effect on every new connection from that role and requires no Grafana configuration changes.
Alternatively, disable the auto-provisioned datasource and manage your own
with schema-qualified table names or a role-level search_path:
monitoring:
enabled: true
grafanaDatasource:
enabled: false
IAM-based database authentication
For databases using IAM authentication (e.g., AWS RDS IAM, GCP Cloud SQL IAM), the platform components typically receive credentials through pod-level service account bindings rather than a static password in Helm values. The auto-provisioned Grafana datasource does not support this pattern. Disable it and configure a Grafana datasource that uses your IAM-compatible authentication method instead.
Custom Alerts
You can add your own alert or recording rules alongside the bundled ones using
monitoring.prometheusRules.additionalRules:
monitoring:
enabled: true
prometheusRules:
additionalRules:
- alert: AkuityPodRestartLoop
expr: |
increase(kube_pod_container_status_restarts_total{namespace="akuity"}[1h]) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is restarting frequently"
description: >-
Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has
restarted {{ $value }} times in the last hour.
Further Reading
values.yaml: full list of configurablemonitoring.*parameters (see themonitoringsection)- Database Operations: database-specific monitoring and Kine metrics