On-Call Agent
The On-Call Agent automates troubleshooting and remediation for your degraded Argo CD applications and Kubernetes namespaces by executing predefined runbooks. An incident is an investigation of a namespace or app in a degraded state. Incidents are either kicked off automatically by the On-Call Agent upon a detected degraded state or manually when a user converts a conversation to an incident. Once the incident is created, the On-Call Agent will begin troubleshooting and triage. An incident follows the same pattern as a conversation with the Deployment Advisor with a couple of key differences:
- There is a status associated with the incidents, resolved / active
- Incidents are associated with a particular resource (E.g., an Argo CD application or a Kubernetes Namespace)
Incidents are either kicked off automatically by the On-Call Agent upon a detected degraded state or manually when a user converts a conversation to an incident.
Enable Incident Auto-Creation
From the Intelligence settings page, you can configure the conditions under which an incident is automatically created. This is managed through two types of triggers: Resource Degradation and Webhooks.
Resource Degradation Triggers allow you to automatically create incidents when your Argo CD applications or Kubernetes resources enter a degraded state.
To create a trigger:
- Click Add New under the Resource Degradation Triggers section.
- Fill in the following fields in the "New Trigger" dialog:
- Argo CD Applications: Select which specific Argo CD applications to monitor.
- K8S Namespaces: Select which Kubernetes namespaces to monitor.
- Clusters: Choose the cluster(s) this trigger will apply to.
- Trigger After: Specify a delay (e.g., 5m, 15m, 1h30m) before creating an incident. This prevents alerts for brief, transient issues.
Webhook Triggers allows you to create incidents from alerts sent by external monitoring systems (like Prometheus Alertmanager).
To configure a webhook:
- Click Add New under the Webhook Triggers section.
- Fill in the following fields in the New Webhook Config dialog:
- Name: A unique identifier for your webhook configuration (e.g., alert-manager).
- Description, Cluster, K8s Namespace, Argo CD Application Name: Map data from the incoming webhook's JSON payload to the relevant fields in Akuity. You must provide the path to the data using JSON path syntax (e.g.,
{.body.alerts[0].labels.namespace}
).
Create Runbooks
Runbooks, at a high level, are the instruction sets that the On-Call Agent uses when responding to an active incident or scenario. These runbooks are written and stored in markdown formatting making them easy to read between both humans and Intelligence.
There is no preset schema or format that you need to follow for a runbook, the On-Call Agent will interpret whatever you have written and assess it for actions when a corresponding incident occurs.
For example, you may have an incident in which an application has hit the memory limit and entered into an out of memory (OOM) state. Based on the runbooks you have in the environment, the On-Call Agent can perform the following:
- Identify what runbook is most applicable to the caused incident
- Report status and custom pieces to a Slack channel
- Suggest a change and wait for approval from a user
Example:
## General
- First, do the initial triage and collect the basic information to understand the incident.
- Next, send a slack notification with the link to the conversation to channel “on-call” with basic detail.
- Next, work on the incident according to the runbook. Don't take any action automatically, ask for approval.
- If the app is stable, check 30 seconds later again, then you can close the incident automatically. Please do slack all the details in concise messages.
- If you stack send a slack message again and mention that you need help.
- Please ensure you send slack message with the link to the conversation, so engineer can work with you together if needed.
## Out of memory
**Symptoms**: Pod unexpectedly dies with `OOMKilled` status.
**Root cause**: The pod is consuming more memory than the available memory.
**Solution**:
* Temporary increase the memory limit of the pod automatically
* Increase the memory limit with the 50 Mb increment until the pod is stable.
Users can manage their runbooks under the Runbooks tab of Incidents dashboard.
Apply Runbooks to your Resources
After creating a runbook, you must apply it to your Argo CD applications or Kubernetes namespaces. When a linked resource becomes degraded, the On-Call Agent will automatically execute the steps defined in the runbook to resolve the incident.
There are two ways to apply a runbook to your Argo CD applications or Kubernetes namespaces:
- From the Runbook Settings UI. You can directly configure which resources a runbook applies to from the runbook editor.
- Navigate to Intelligence (Beta) in the left-hand sidebar.
- Select the Runbooks tab.
- Click + Create to make a new runbook, or select an existing runbook and click Edit.
- In the Applied To section, specify the Argo CD Apps, K8S Namespaces, or Clusters that this runbook should monitor.
- In the example, the runbook named oom is applied to the Argo CD Application
guestbook-prod-oom
.
- Using Kubernetes Annotations. You can also apply a runbook by adding an annotation directly to the manifest of your Argo CD Application or Kubernetes Namespace.
- Add the annotation
akuity.io/runbooks: "<runbook_name>"
to the resource's metadata. - For the example shown, you would add the following annotation to the guestbook-prod-oom Application resource:
akuity.io/runbooks: "oom"
- Add the annotation
Managing Incidents
When an Argo CD application or Kubernetes namespace becomes degraded, the On-Call Agent will automatically create an incident. You can view and manage these incidents from the Incidents dashboard. The Incidents tab will display a list of all incidents. You can filter this list by Status (e.g., Unresolved, All), Application, or Namespace.
Click on an incident from the list to view its details. This view provides a live, step-by-step account of the troubleshooting process:
- Summary: A high-level overview of the issue, including the affected application and the runbook being used.
- Root Cause: An analysis of the likely cause of the incident (e.g., an OOMKilled event due to memory limits).
- Live Troubleshooting Log: A timeline showing every action taken by the intelligence agent, from initial detection ("Incident Occurred") to diagnostic steps like fetching the application tree and inspecting Kubernetes resources.
The On-Call Agent will attempt to resolve the incident automatically using the applied runbook.
- If the incident is successfully resolved, its status will be updated to Resolved.
- If the issue persists, the incident will remain Unresolved. You can then take manual action:
- Mark as Resolved: If you have fixed the issue outside of the system, you can manually close the incident.
- Open in Akuity Intelligence: Click this button to open the incident in the chat interface. Here, you can work directly with the On-Call Agent, provide more instructions, or ask it to "resolve it" to continue the troubleshooting process interactively.
Slack Notifications
You can configure Akuity Intelligence to send real-time notifications about incidents directly to your Slack workspace. This allows your team to stay informed and respond quickly when issues are detected.
Configure a Slack Service
Before you can enable Slack notifications for Akuity Intelligence, you must first have a Slack service configured in the main Argo CD notification settings.
- Navigate to Settings → Notifications.
- Under the Services tab, check if a Slack service is already configured. If not, click Add New and configure the connection to your Slack workspace by providing the necessary details (e.g., Slack webhook URL).
Link Slack to Akuity Intelligence
Once you have a Slack service available, you can link it to Akuity Intelligence.
- Navigate to Settings → Intelligence.
- Click on the Integrations tab.
- In the Notifications section, locate the Slack option.
- Click the dropdown menu and select the Slack service you configured in the previous step.
- Click Save in the top-right corner to apply the changes.
After completing this setup, the On-Call Agent will automatically send messages and updates for new and ongoing incidents to your selected Slack channel.