Skip to main content

On-Call Agent

The On-Call Agent automates troubleshooting and remediation for your degraded Argo CD applications and Kubernetes namespaces by executing predefined runbooks. An incident is an investigation of a namespace or app in a degraded state. Incidents are either kicked off automatically by the On-Call Agent upon a detected degraded state or manually when a user converts a conversation to an incident. Once the incident is created, the On-Call Agent will begin troubleshooting and triage. An incident follows the same pattern as a conversation with the Deployment Advisor with a couple of key differences:

  • There is a status associated with the incidents, resolved / active
  • Incidents are associated with a particular resource (E.g., an Argo CD application or a Kubernetes Namespace)

Incidents are either kicked off automatically by the On-Call Agent upon a detected degraded state or manually when a user converts a conversation to an incident.

Enable Incident Auto-Creation

From the Intelligence settings page, you can configure the conditions under which an incident is automatically created. This is managed through two types of triggers: Resource Degradation and Webhooks.

Auto-creation triggers

Resource Degradation Triggers allow you to automatically create incidents when your Argo CD applications or Kubernetes resources enter a degraded state.

New trigger

To create a trigger:

  • Click Add New under the Resource Degradation Triggers section.
  • Fill in the following fields in the "New Trigger" dialog:
    • Argo CD Applications: Select which specific Argo CD applications to monitor.
    • K8S Namespaces: Select which Kubernetes namespaces to monitor.
    • Clusters: Choose the cluster(s) this trigger will apply to.
    • Trigger After: Specify a delay (e.g., 5m, 15m, 1h30m) before creating an incident. This prevents alerts for brief, transient issues.

Webhook Triggers allows you to create incidents from alerts sent by external monitoring systems (like Prometheus Alertmanager).

Webhook config

To configure a webhook:

  • Click Add New under the Webhook Triggers section.
  • Fill in the following fields in the New Webhook Config dialog:
    • Name: A unique identifier for your webhook configuration (e.g., alert-manager).
    • Description, Cluster, K8s Namespace, Argo CD Application Name: Map data from the incoming webhook's JSON payload to the relevant fields in Akuity. You must provide the path to the data using JSON path syntax (e.g., {.body.alerts[0].labels.namespace}).

Create Runbooks

Runbooks, at a high level, are the instruction sets that the On-Call Agent uses when responding to an active incident or scenario. These runbooks are written and stored in markdown formatting making them easy to read between both humans and Intelligence.

There is no preset schema or format that you need to follow for a runbook, the On-Call Agent will interpret whatever you have written and assess it for actions when a corresponding incident occurs.

For example, you may have an incident in which an application has hit the memory limit and entered into an out of memory (OOM) state. Based on the runbooks you have in the environment, the On-Call Agent can perform the following:

  • Identify what runbook is most applicable to the caused incident
  • Report status and custom pieces to a Slack channel
  • Suggest a change and wait for approval from a user

Example:

## General

- First, do the initial triage and collect the basic information to understand the incident.
- Next, send a slack notification with the link to the conversation to channel “on-call” with basic detail.
- Next, work on the incident according to the runbook. Don't take any action automatically, ask for approval.
- If the app is stable, check 30 seconds later again, then you can close the incident automatically. Please do slack all the details in concise messages.
- If you stack send a slack message again and mention that you need help.
- Please ensure you send slack message with the link to the conversation, so engineer can work with you together if needed.

## Out of memory

**Symptoms**: Pod unexpectedly dies with `OOMKilled` status.

**Root cause**: The pod is consuming more memory than the available memory.

**Solution**:

* Temporary increase the memory limit of the pod automatically
* Increase the memory limit with the 50 Mb increment until the pod is stable.

Users can manage their runbooks under the Runbooks tab of Incidents dashboard.

Runbooks

Apply Runbooks to your Resources

After creating a runbook, you must apply it to your Argo CD applications or Kubernetes namespaces. When a linked resource becomes degraded, the On-Call Agent will automatically execute the steps defined in the runbook to resolve the incident.

There are two ways to apply a runbook to your Argo CD applications or Kubernetes namespaces:

  • From the Runbook Settings UI. You can directly configure which resources a runbook applies to from the runbook editor.
    • Navigate to Intelligence (Beta) in the left-hand sidebar.
    • Select the Runbooks tab.
    • Click + Create to make a new runbook, or select an existing runbook and click Edit.
    • In the Applied To section, specify the Argo CD Apps, K8S Namespaces, or Clusters that this runbook should monitor.
    • In the example, the runbook named oom is applied to the Argo CD Application guestbook-prod-oom.
  • Using Kubernetes Annotations. You can also apply a runbook by adding an annotation directly to the manifest of your Argo CD Application or Kubernetes Namespace.
    • Add the annotation akuity.io/runbooks: "<runbook_name>" to the resource's metadata.
    • For the example shown, you would add the following annotation to the guestbook-prod-oom Application resource: akuity.io/runbooks: "oom"

Managing Incidents

When an Argo CD application or Kubernetes namespace becomes degraded, the On-Call Agent will automatically create an incident. You can view and manage these incidents from the Incidents dashboard. The Incidents tab will display a list of all incidents. You can filter this list by Status (e.g., Unresolved, All), Application, or Namespace.

Incident details

Click on an incident from the list to view its details. This view provides a live, step-by-step account of the troubleshooting process:

  • Summary: A high-level overview of the issue, including the affected application and the runbook being used.
  • Root Cause: An analysis of the likely cause of the incident (e.g., an OOMKilled event due to memory limits).
  • Live Troubleshooting Log: A timeline showing every action taken by the intelligence agent, from initial detection ("Incident Occurred") to diagnostic steps like fetching the application tree and inspecting Kubernetes resources.

The On-Call Agent will attempt to resolve the incident automatically using the applied runbook.

  • If the incident is successfully resolved, its status will be updated to Resolved.
  • If the issue persists, the incident will remain Unresolved. You can then take manual action:
    • Mark as Resolved: If you have fixed the issue outside of the system, you can manually close the incident.
    • Open in Akuity Intelligence: Click this button to open the incident in the chat interface. Here, you can work directly with the On-Call Agent, provide more instructions, or ask it to "resolve it" to continue the troubleshooting process interactively.

Slack Integration

Akuity Intelligence can notify Slack about incidents and, if you enable the full Slack Integration, keep conversations in sync. Use the configuration level that matches what you need:

  • Send incident updates to Slack using Argo CD notifications, this is single directional conversation sync. it sends alerts to Slack from Akuity Intelligence.
  • Add bi-directional conversation sync, Share to Slack, and thread-first workflows. This layer builds on the alerting setup and requires additional Slack app permissions.

Configure the Slack Service

Slack service

Both experiences rely on an Argo CD Slack service. Create or update it under Settings → NotificationsServices:

  1. Click Add NewSlack (or edit an existing service).
  2. Provide a Name (e.g., slack-main).
  3. Enter your Slack Bot Token (xoxb-…). This is required for incident notifications.
  4. (Slack Integration only) Enter your App-Level Token (xapp-…) so Socket Mode can power live conversation sync.
  5. Optionally set a posting Username and Icon.
  6. Save the service.

If you rotate either token later, update the Slack service to avoid delivery failures.

Link Slack

After the Slack service exists, connect it to Intelligence so incident updates flow to Slack:

  • Click on the Integrations tab.
  • In the Notifications section, locate the Slack option.
  • Click the dropdown menu and select the Slack service you configured in the previous step.
  • Click Save in the top-right corner to apply the changes.

With this in place, the On-Call Agent posts new and ongoing incident activity to the channels configured for that Slack service.

Additional Setup for Slack Integration

If you plan to collaborate directly from Slack threads, configure your Slack app with the permissions and subscriptions below. These steps unlock Share to Slack, bi-directional chat, and conversation sync without changing how the On-Call Agent posts incident alerts.

Generate an App-Level Token

App-level tokens let Akuity connect to Slack platform features such as Socket Mode.

  1. Navigate to SettingsBasic InformationApp-Level Tokens.
  2. Click Generate an app-level token.
  3. Add the scopes: connections:write, authorizations:read, and app_configurations:write.
Add scopes

Enable Socket Mode

Socket Mode keeps Slack traffic behind WebSockets so you do not need to expose a public endpoint.

  1. Open SettingsSocket Mode.
  2. Turn on Connect using Socket Mode.
Enable Socket Mode

Configure Event Subscriptions

Event subscriptions allow Akuity to receive messages and mentions from the channels you monitor.

  1. Go to FeaturesEvent Subscriptions.
  2. Enable Events.
  3. Under Subscribe to Bot Events, add: app_mention, message.channels, message.groups, message.im, and message.mpim.
Subscribe to bot events

Configure OAuth & Permissions

Bot token scopes define what your Slack app can read and write when syncing conversations.

  1. Open FeaturesOAuth & Permissions.
  2. Under Scopes, add: app_mentions:read, channels:history, channels:read, chat:write, groups:history, groups:read, im:history, mpim:history, users:read, and users:read.email.
OAuth and Permissions scopes

Allowlist Slack Channels for Share to Slack

  1. Return to the target Argo CD instance and open SettingsIntelligenceIntegrations.
  2. With your Slack service selected, find Slack Channels.
  3. List the channels (without #) that should appear in the Share to Slack dialog. This acts as an allowlist.
  4. Save the settings and invite the Slack app (bot user) to each listed channel so first posts succeed. Configure Slack Channels

When these optional steps are complete, the Slack Integration experience—thread sync, share-to-Slack workflows, and responding to incidents inside Slack—is available alongside the standard On-Call Agent notifications.

Using Slack Integration

Once configuration is complete, you can share conversations to Slack, collaborate inside threads, and review synced incidents without leaving Slack.

Sharing Conversations to Slack

After services and channels are configured, conversation owners can use the Share to Slack action in the AI Conversation UI:

  1. Open a conversation and click Share to Slack. The modal lists the allowlisted channels you configured earlier.
  2. Pick a destination. If the conversation is already shared, the modal shows the existing permalink and an Unshare option.
  3. Confirm Share. The backend will make the conversation public if it was private, capture the title, generate a permalink, and post to the selected Slack channel with both pieces of context.
Sharing conversations to Slack

Re-sharing to the same channel refreshes the thread and permalink. Unsharing clears the stored Slack metadata and returns the conversation to private visibility.

Interacting in Slack

  • Outbound (AKP → Slack): New messages in the AI conversation automatically post to the Slack thread.

    AKP to Slack message flow
  • Inbound (Slack → AKP): When a human replies in the linked Slack thread or mentions the bot, the message syncs back to the AKP conversation.

    Slack to AKP message flow

Automatic Incident Synchronization

Akuity Intelligence can detect incidents (for example, degraded applications) and notify your team via Slack. The automation is driven by the runbooks attached to your resources.

Interacting with Incidents in Slack
  1. View Details: Incident messages include a summary, ID (for example, INC-123), and a link back to the AKP console.
  2. Reply in Thread: Collaborate directly in the Slack thread.
  3. AI Response: The AI monitors the thread and responds to questions or commands (for example, “Get the logs for the test-service pod”).
  4. Two-Way Sync: Every message in the Slack thread is mirrored in the incident conversation inside the AKP console.
Incident update in Slack
note

The AI only posts to channels you explicitly allow in Slack Channel Configuration. Ensure your runbook references one of those channels.

Troubleshooting

If Slack synchronization is not working as expected, verify the following:

  1. Permissions: Confirm the Slack app includes the channels:read, chat:write, and groups:read scopes.
  2. Tokens: In SettingsNotifications, ensure both Slack tokens are valid.
  3. Channel Membership: Invite the Slack app (bot) to the target channel (/invite @YourApp).
  4. Socket Mode: When using Socket Mode, confirm the app-level token is configured and Socket Mode is enabled.
  5. Runbook Instructions: For incident sync, make sure your runbook tells the AI to post to Slack and names the correct channel.
  6. Service not listed: Verify the Slack service you created under Notifications starts with service.slack. and save the Intelligence settings again.
  7. Share fails immediately: The bot may lack access to the channel or tokens may be stale. Reinvite the bot, rotate the secrets, and click Save in Notifications.
  8. Slack replies do not sync: Check platform logs for socket pool warnings. If the instance was unregistered, confirm both bot and app tokens resolve correctly from the secret.