Skip to main content

Database Operations

The Akuity Platform uses K3S as a lightweight Kubernetes control plane for each Argo CD and Kargo instance. K3S stores its data in PostgreSQL via Kine, which translates etcd API calls into SQL operations.

A healthy, well-provisioned database is the most important factor in K3S stability. Every Kubernetes write operation (create, update, delete) inserts a row into the database. A background compaction process periodically removes old rows to keep the table from growing indefinitely. When the database cannot keep up, compaction falls behind, the table bloats, and the Argo CD or Kargo instance can become unstable.

Self-hosted operators are responsible for database sizing, performance monitoring, and capacity planning. This page covers what to watch, how to interpret signs of trouble, and how to remediate issues.

Database Health and Sizing

A well-provisioned database prevents the majority of K3S stability issues. Monitor your database as you would any production workload:

  • CPU: Sustained saturation slows all query processing, including compaction.
  • Memory: Insufficient buffer pool causes excessive disk reads and slows queries.
  • I/O: Compaction is write-heavy. IOPS limits directly throttle how fast it can run.
  • Connections: An exhausted connection pool causes compaction transactions to queue or fail.
  • Storage: A database that runs out of storage will cause K3S to fail. Monitor headroom and set alerts well before limits are reached.

For sizing guidance, see PostgreSQL Database in the Getting Started guide. If you are consistently seeing K3S or compaction issues and suspect the database is undersized, scaling up CPU, memory, or IOPS is the right first step before investigating anything else.

Monitoring

The platform controller exposes two Prometheus metrics on the /metrics endpoint (port 9500) that can help diagnose compaction issues when K3S is behaving unexpectedly:

MetricLabelsDescription
kine_lag_ratioinstance_id, instance_typeCompaction lag as a multiple of the retention window
kine_actual_laginstance_id, instance_typeRaw revision count that compaction is behind

These metrics are diagnostic tools, not primary health indicators. An elevated kine_lag_ratio is a symptom, not a root cause. In most cases it reflects an underlying database constraint. Use these metrics to understand why something is wrong after your database metrics have already flagged a problem, not as the first signal to alert on.

Understanding kine_lag_ratio

The lag ratio expresses how far behind compaction is, normalized across instance sizes:

target_rev = current_rev - 1000 (min_retain)
actual_lag = target_rev - compact_rev
lag_ratio = actual_lag / 1000

What constitutes a concerning value depends entirely on your environment. A stable ratio on a well-provisioned database with ample headroom may require no action at all. The same ratio trending upward on a database that is already constrained is a different situation. Always interpret it alongside your database metrics, not in isolation.

Alerting

Alert on your database first. Storage headroom, CPU utilization, I/O saturation, and connection pool exhaustion are the signals that indicate real risk.

The Kine metrics can be useful as supporting context if you are already investigating K3S instability, but setting up aggressive alerts on kine_lag_ratio in isolation is likely to generate noise without actionable signal. A lag ratio that is elevated but stable on a healthy database generally does not require intervention.

Troubleshooting K3S or Compaction Issues

If K3S is behaving unexpectedly or you are seeing signs of compaction falling behind, work through the following in order. The database is almost always the root cause.

1. Check Database Health

Start here before anything else. Check:

  • CPU: Is the database CPU saturated?
  • Memory: Is the database swapping or running low on buffer pool?
  • I/O: Are disk IOPS maxed out?
  • Connections: Is the connection pool exhausted?
  • Storage: How much headroom remains?

If any of these are constrained, address them before proceeding.

2. Review Instance Load

If the database is healthy but issues persist:

  • How many applications does the instance manage?
  • What is the sync frequency? Aggressive sync intervals generate more writes.
  • Are there runaway controllers or reconciliation loops?

3. Check K3S Compaction Logs

Check K3S logs for compact failed messages. These indicate whether compaction transactions are timing out, hitting lock contention, or failing due to connection issues, and help distinguish between a database bottleneck and a compaction configuration issue. If you are seeing frequent failures and are unsure how to interpret them, contact Akuity Support.

4. Tune Compaction Settings

Only after confirming the database is healthy and instance load is reasonable, consider adjusting the k3s compaction parameters in the Argo CD Instance Parameters or Kargo Instance Parameters sections of the Helm Values Reference.

If you are unsure which settings to adjust or want guidance based on your specific environment, contact Akuity Support before making changes.

caution

These settings compensate for environmental constraints. They do not fix underlying database performance problems. Adjusting them on a database that is already under heavy load can make things worse.

5. Emergency Manual Compaction

If compaction is critically behind and the instance is impaired, contact Akuity Support for assistance. This operation requires instance downtime and should only be performed after the underlying database issue has been identified and addressed.

Further Reading