Skip to main content

Operations

Alertmanager Routing

All bundled alerts use names prefixed with Akuity, making it straightforward to route them in Alertmanager. Below are example configurations for common notification targets.

Slack

# alertmanager.yaml
route:
routes:
- matchers:
- alertname=~"Akuity.*"
receiver: akuity-slack
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h

receivers:
- name: akuity-slack
slack_configs:
- channel: '#akuity-platform-alerts'
send_resolved: true
title: '{{ .GroupLabels.alertname }}'
text: >-
{{ range .Alerts }}
*{{ .Annotations.summary }}*
{{ .Annotations.description }}
Runbook: {{ .Annotations.runbook_url }}
{{ end }}

PagerDuty

route:
routes:
- matchers:
- alertname=~"Akuity.*"
- severity=~"critical|warning"
receiver: akuity-pagerduty

receivers:
- name: akuity-pagerduty
pagerduty_configs:
- routing_key: '<your-pagerduty-integration-key>'
description: '{{ .GroupLabels.alertname }}: {{ (index .Alerts 0).Annotations.summary }}'
links:
- href: '{{ (index .Alerts 0).Annotations.runbook_url }}'
text: 'Runbook'

Severity-based routing

To page on critical alerts and send warnings to a lower-priority channel:

route:
routes:
- matchers:
- alertname=~"Akuity.*"
- severity="critical"
receiver: akuity-pagerduty
- matchers:
- alertname=~"Akuity.*"
- severity="warning"
receiver: akuity-slack

Tuning Alert Thresholds

The bundled alert thresholds are designed for typical production deployments. Depending on your scale and workload patterns, you may want to adjust them.

When to tune

  • Small deployments (fewer than 10 Argo CD instances): Hardware thresholds like goroutine count and memory usage may be set higher than your platform controller will ever reach under normal conditions. This is fine: alerts simply won't fire. No action is needed unless you want tighter bounds for your environment.
  • Small fleets (fewer than ~10 clusters or agents): The ratio-based disconnected/unhealthy alerts (monitoring.alerts.argoCD.*, monitoring.alerts.kargo.*) can be noisy when a single disconnected cluster represents a large fraction of the fleet. Consider raising the warning ratio.
  • Large deployments (hundreds of instances or clusters): Reconciliation loop thresholds (slowReconcileP90Seconds, slowEnqueueP90Seconds) may need to be relaxed, as larger instance counts naturally increase reconciliation time.
  • Resource-constrained environments: If you run the platform controller or portal server with tight memory limits, consider lowering the memory threshold to alert before OOM kills occur.
  • Higher-latency infrastructure: If the portal server connects to a remote database or runs on slower hardware, consider raising monitoring.alerts.portalServer.slowResponseAvgSeconds and slowResponseP90Seconds.

How to tune

Override any threshold through the monitoring.alerts.* values. See the Monitoring Parameters section of the Helm values reference for the full list, defaults, and descriptions.

tip

After adjusting thresholds, verify the updated values in the Prometheus UI under Status > Rules. Search for the alert name to confirm the new threshold is active.

Disabling Monitoring

Monitoring resources can be safely removed by setting monitoring.enabled back to false and re-applying the chart:

monitoring:
enabled: false

This cleanly removes all monitoring resources (ServiceMonitors, PrometheusRule, Grafana dashboard ConfigMap, and Grafana datasource Secret) regardless of which namespace they were deployed to. The platform components themselves are unaffected: disabling monitoring does not restart or modify any running pods.

Prometheus will stop scraping the targets once the ServiceMonitors are deleted, and Grafana's sidecar will remove the dashboard and datasource on its next reconciliation cycle.

You can also selectively disable individual monitoring sub-resources without turning off monitoring entirely. For example, to keep Prometheus scraping and alerts but remove the Grafana integration:

monitoring:
enabled: true
grafanaDashboard:
enabled: false
grafanaDatasource:
enabled: false

Grafana Datasource and External Secrets

The auto-provisioned Grafana datasource reads database credentials from the same database.* Helm values that the platform components use (database.host, database.user, database.password, etc.). For most deployments this works out of the box with no additional configuration because the platform already requires these values to connect to PostgreSQL.

External secrets management

If you manage database credentials through an external secrets solution (e.g., External Secrets Operator, OpenBao, or cloud-native IAM authentication), the database.password value may not be set in your Helm values. In this case, disable the auto-provisioned datasource and manage it through your existing secrets workflow:

monitoring:
enabled: true
grafanaDatasource:
enabled: false

Then provision an equivalent PostgreSQL datasource in Grafana with matching credentials. To ensure the bundled dashboard panels work, either:

  • Set the datasource UID to match <release>-portal-db (e.g., akuity-platform-portal-db), or
  • Update the DS_PORTAL_DB template variable in the dashboard to point at your datasource after import

Non-default database schema

If your installation uses a custom database.schemaname (anything other than public), the auto-provisioned Grafana datasource will connect successfully but dashboard SQL panels that reference unqualified table names (e.g. organization, argo_cd_instance) may fail because the Grafana PostgreSQL plugin does not support setting search_path through its provisioning configuration.

To work around this, set search_path on the PostgreSQL role that Grafana uses to connect:

ALTER ROLE <database.user> SET search_path TO '<database.schemaname>', 'public';

This takes effect on every new connection from that role and requires no Grafana configuration changes.

Alternatively, disable the auto-provisioned datasource and manage your own with schema-qualified table names or a role-level search_path:

monitoring:
enabled: true
grafanaDatasource:
enabled: false

IAM-based database authentication

For databases using IAM authentication (e.g., AWS RDS IAM, GCP Cloud SQL IAM), the platform components typically receive credentials through pod-level service account bindings rather than a static password in Helm values. The auto-provisioned Grafana datasource does not support this pattern. Disable it and configure a Grafana datasource that uses your IAM-compatible authentication method instead.

Custom Alerts

You can add your own alert or recording rules alongside the bundled ones using monitoring.prometheusRules.additionalRules:

monitoring:
enabled: true
prometheusRules:
additionalRules:
- alert: AkuityPodRestartLoop
expr: |
increase(kube_pod_container_status_restarts_total{namespace="akuity"}[1h]) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is restarting frequently"
description: >-
Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has
restarted {{ $value }} times in the last hour.

Further Reading

  • values.yaml: full list of configurable monitoring.* parameters (see the monitoring section)
  • Database Operations: database-specific monitoring and Kine metrics