How etcd Bloat Took Down Our Kubernetes Control Plane

Executive Summary

Our production Kubernetes cluster experienced a critical performance crisis where all three master nodes were cyclically crashing. The issue manifested as 100% CPU usage, memory exhaustion (32GB), and extreme disk I/O spikes (3-4GB reads), causing the Linux OOM killer to terminate critical Kubernetes components including the API server, scheduler, and controller manager. In the following guide I will show How etcd Bloat Took Down Our Kubernetes Control Plane and what we’ve done to resolve it.

Root Cause: ETCD database bloat from accumulated Kyverno policy reports (181,000+ entries)
Impact: Cluster instability, pod restarts, service disruptions
Resolution: Systematic cleanup of ephemeral ETCD entries and implementation of monitoring

What is etcd in Kubernetes and why its performance matters

etcd is Kubernetes’ distributed key-value database and the single source of truth for the entire cluster state. Every critical object like nodes, pods, deployments, secrets, configmaps, RBAC, and more is stored in etcd, and the API server relies on it. When etcd becomes slow due to high CPU, memory pressure, or disk I/O saturation, Kubernetes control plane components can start timing out or fail. This leads to cascading impact such as API unresponsiveness, controller failures, scheduler delays, and eventually control plane instability.

In short: when etcd performance degrades, the whole cluster behaves like it’s going down.

Infrastructure Overview

Platform: 3 Kubernetes master nodes running as VMs on Proxmox cluster
Storage: Ceph distributed storage with NVMe OSDs
Network: Masters and workers in same rack, connected to shared switches via 25Gbps interfaces
ETCD Version: 3.5.12
Database Size: 2.4GB per node
Installation done using Kubespray like in this guide: https://k8s.co.il/kubernetes/how-to-deploy-kubernetes-using-kubespray/

Problem Manifestation

System Symptoms

The crisis presented with a consistent pattern every 10-15 minutes:

Figure 1: Proxmox monitoring showing 100% CPU utilization and memory exhaustion on master node during the crisis

Figure 2: Extreme disk I/O activity with read spikes reaching 3-4GB during ETCD operations

Kubernetes Component Failures

The resource exhaustion triggered cascading failures:

root@k8s-bastion-prod-xxx:/home/master/prod# k get pod -n kube-system | grep sch
kube-scheduler-k8s-master-prod-01-xxx    0/1   CrashLoopBackOff   98 (2m2s ago)   395d
kube-scheduler-k8s-master-prod-02-xxx    0/1   Error              132 (9m10s ago) 394d
kube-scheduler-k8s-master-prod-03-xxx    0/1   Running            132 (3m23s ago) 394d

ETCD Health Degradation

ETCD endpoints became unresponsive with timeout errors:

root@k8s-master-prod-02-xxx:~# sudo etcdctl endpoint health
{"level":"warn","ts":"2025-08-10T15:45:23.207618+0300","logger":"client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00015ca80/10.x.x.25:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}

+--------------------------+--------+--------------+---------------------------+
|         ENDPOINT         | HEALTH |     TOOK     |           ERROR           |
+--------------------------+--------+--------------+---------------------------+
| https://10.x.x.25:2379 |  false | 5.002562165s | context deadline exceeded |
| https://10.x.x.24:2379 |  false | 5.002525518s | context deadline exceeded |
+--------------------------+--------+--------------+---------------------------+

Troubleshooting Methodology

Proxmox Infrastructure Analysis

All infrastructure-level diagnostics returned healthy results, confirming the issue was not at the hardware or virtualization layer:

1. Ceph Storage Health (All Clear):

ceph status – Cluster healthy
ceph osd status – All OSDs operational
ceph pg stat – Placement groups healthy
ceph osd perf – Performance within normal ranges

2. Network Connectivity (All Clear):

ping tests between all nodes – Normal latency
mtr network trace reports – No packet loss
25Gbps interface utilization – Within capacity
Switch port statistics – No errors or drops

3. VM Resource Status (All Clear):

Proxmox resource allocation – Sufficient headroom
Ceph mount points – Properly mounted and responsive
No resource contention between VMs detected

Kubernetes-Level Diagnostics

With infrastructure confirmed healthy, the investigation focused on Kubernetes-level components and their resource consumption patterns.

1. ETCD Database Deep Dive Analysis:

The systematic approach to ETCD analysis revealed the smoking gun:

# Step 1: Export complete ETCD keyspace
$ ETCDCTL_API=3 etcdctl get "" --keys-only --prefix \
  --endpoints=https://10.x.x.24:2379,https://10.x.x.25:2379,https://10.x.x.26:2379 \
  --cert=/etc/ssl/etcd/ssl/node-k8s-master-prod-01-xxx.pem \
  --key=/etc/ssl/etcd/ssl/node-k8s-master-prod-01-xxx-key.pem \
  --cacert=/etc/ssl/etcd/ssl/ca.pem \
  --command-timeout=30s > /tmp/etcd_keys.dump

# Step 2: Analyze top-level resource distribution
$ cat /tmp/etcd_keys.dump | awk -F "/" '{print $3}' | sort | uniq -c | sort -rn

Results revealed catastrophic imbalance:

 193513  [blank entries - registry metadata]
 181193  reports.kyverno.io  ################# ROOT CAUSE
   5631  events
    549  pods
    542  services
    417  crd.projectcalico.org
    330  configmaps
    269  controllerrevisions
    260  replicasets

Step 3: Drill down into problematic resource:

# Investigate Kyverno reports distribution
$ cat /tmp/etcd_keys.dump | grep reports.kyverno.io | awk -F "/" '{print $5}' | sort | uniq -c
 130078  rook-ceph
  51115  velero
      1  v1.reports.kyverno.io
      1  ephemeralreports.reports.kyverno.io

Step 4: Sample actual content to understand data nature:

# Examine sample Kyverno report content
SAMPLE_KEY=$(cat /tmp/etcd_keys.dump | grep "reports.kyverno.io/ephemeralreports" | head -1)
ETCDCTL_API=3 etcdctl get "$SAMPLE_KEY" --print-value-only | head -20

The content analysis confirmed these were ephemeral policy violation reports – temporary logs that should have been cleaned up automatically but had accumulated over months.

2. Resource Consumption Investigation:

System-level diagnostics confirmed ETCD as the performance bottleneck:

CPU and Memory Analysis:

kubectl top nodes – Master nodes showing sustained high resource usage
htop on master nodes – etcd process consuming 90-100% CPU during query operations
free -h – Memory exhaustion (31-32GB used of 32GB) due to ETCD cache pressure

Disk I/O Analysis:

iostat -x 1 – Extreme read operations (3-4GB) correlating with ETCD startup and compaction
NVMe disk utilization – 100% during ETCD database operations
Ceph OSD metrics – Healthy performance, bottleneck was at the ETCD application layer

Kubernetes Events Analysis:

kubectl get events --all-namespaces – Excessive scheduling failures due to API server unavailability
Event logs showing recurring OOMKilled events for system pods
API server logs indicating ETCD timeout errors during high-load periods

Pod Restart Pattern Analysis:

kubectl get pods -n kube-system – Consistent CrashLoopBackOff patterns
System pods failing health checks due to resource starvation
Rolling failures across all three master nodes in sequence

Root Cause Analysis

The Critical Discovery: 181,193 Kyverno ephemeral policy reports represented 94% of all ETCD keys, creating massive overhead for:

Database queries and indexing
Memory consumption for caching
Disk I/O during compaction operations
Network serialization between ETCD nodes

Timeline Analysis: The reports had accumulated over approximately 6 months, with the heaviest concentration from:

Rook-Ceph operations (130,078 reports)
Velero backup processes (51,115 reports)

Performance Impact: Each ETCD operation was processing nearly 200,000 keys instead of the typical 10,000-15,000, causing exponential performance degradation.

Solution Implementation

Immediate Emergency Resolution

Phase 1: Emergency Cleanup

# Remove Kyverno ephemeral reports from rook-ceph namespace
ETCDCTL_API=3 etcdctl del --prefix /registry/reports.kyverno.io/ephemeralreports/rook-ceph \
  --endpoints=https://10.x.x.24:2379,https://10.x.x.25:2379,https://10.x.x.26:2379 \
  --cert=/etc/ssl/etcd/ssl/node-k8s-master-prod-01-xxx.pem \
  --key=/etc/ssl/etcd/ssl/node-k8s-master-prod-01-xxx-key.pem \
  --cacert=/etc/ssl/etcd/ssl/ca.pem
# Result: 95,698 entries deleted

# Remove Kyverno ephemeral reports from velero namespace  
ETCDCTL_API=3 etcdctl del --prefix /registry/reports.kyverno.io/ephemeralreports/velero \
  --endpoints=https://10.x.x.24:2379,https://10.x.x.25:2379,https://10.x.x.26:2379 \
  --cert=/etc/ssl/etcd/ssl/node-k8s-master-prod-01-xxx.pem \
  --key=/etc/ssl/etcd/ssl/node-k8s-master-prod-01-xxx-key.pem \
  --cacert=/etc/ssl/etcd/ssl/ca.pem
# Result: 23,226 entries deleted

Immediate Impact:

Total cleanup: 118,924 problematic entries removed (65% reduction)
CPU usage: Dropped from 100% to normal levels within minutes
Memory pressure: Reduced from 32GB to 18-20GB usage
Disk I/O: Eliminated 3-4GB read spikes
Pod stability: System pods stopped crashing, CrashLoopBackOff resolved

Phase 2: Verification

# Confirm cluster health restoration
$ ETCDCTL_API=3 etcdctl endpoint status --write-out=table
kubectl get nodes
kubectl get pods -n kube-system

Now that we know how etcd bloat tookd own our kubernetes control plane, we can start mitigating the issue.

Post-Resolution Status

After cleanup, the cluster achieved stable operation:

root@k8s-master-prod-01-xxx:~# ETCDCTL_API=3 etcdctl endpoint status
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.x.x.24:2379 | f3b25ccf4d71f345 |  3.5.12 |  2.4 GB |     false |      false |       117 |  345544847 |          345544847 |        |
| https://10.x.x.25:2379 | a5f091fbbaaac69a |  3.5.12 |  2.4 GB |      true |      false |       117 |  345544847 |          345544847 |        |
| https://10.x.x.26:2379 | 5715e7fd6dacb34c |  3.5.12 |  2.4 GB |     false |      false |       117 |  345544847 |          345544847 |        |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Current healthy metrics:

Total ETCD keys: 13,926 (down from 181K+)
Largest resource group: velero.io (2,466 entries – manageable)
System stability: No more crashes or resource exhaustion
Response times: ETCD queries returning in milliseconds instead of seconds

Long-term Prevention Solution

Automated Monitoring System

Created a comprehensive monitoring script (etcd-daily-monitor.sh) that:

Daily exports and analyzes ETCD key distribution
Categorizes resources by deletion safety level
Implements threshold-based alerting
Tracks growth patterns over time
Stores historical data for trend analysis

Key Features:

Critical Alert: >50,000 keys for single resource type
Warning Alert: >10,000 keys for single resource type
Safety Classification: Resources categorized as SAFE_DELETE, CONDITIONAL_DELETE, or NEVER_DELETE
Growth Tracking: Daily comparison to detect rapid accumulation

Prevention Strategy

1. Proactive Monitoring:

Daily ETCD health checks via cron
Automated alerting for key count thresholds
Historical trend analysis

2. Resource Cleanup Policies:

Kyverno ephemeral report retention policies
Regular cleanup of expired backup metadata
Event log rotation

3. Infrastructure Monitoring:

Proxmox resource utilization dashboards
Ceph storage performance monitoring
Network latency tracking between nodes

Key Learnings

ETCD Bloat is Silent but Deadly: Performance degradation can occur gradually before becoming critical
Ephemeral Resources Need Management: Policy engines and backup tools can generate significant metadata
Infrastructure Can Be Red Herring: Despite healthy Ceph and 25Gbps networking, the bottleneck was application-level
Systematic Analysis is Crucial: The root cause was only discoverable through methodical ETCD key analysis
Monitoring is Essential: Proactive monitoring prevents crisis situations

Conclusion

This incident demonstrates the critical importance of ETCD health monitoring and How etcd Bloat Took Down Our Kubernetes Control Plane. What appeared as a complex infrastructure problem with multiple failing components was ultimately resolved by identifying and cleaning accumulated ephemeral policy reports.

Despite having robust infrastructure (25Gbps networking, NVMe-backed Ceph storage, and sufficient VM resources), the crisis originated from application-level resource accumulation within ETCD. The systematic approach to troubleshooting – starting with infrastructure verification and progressing to application-level analysis – proved essential in identifying the true root cause.

The implementation of automated monitoring and cleanup procedures ensures this type of crisis is detected and resolved before impacting cluster stability.

Recommendation: All production Kubernetes clusters should implement similar ETCD monitoring and cleanup automation to prevent resource exhaustion crises, regardless of underlying infrastructure robustness. Regular ETCD key analysis should be part of standard operational procedures.

To learn more about the internals of etcd you can read etcd tutorials on their website.

Executive Summary

What is etcd in Kubernetes and why its performance matters

Infrastructure Overview

Problem Manifestation

System Symptoms

Kubernetes Component Failures

ETCD Health Degradation

Troubleshooting Methodology

Proxmox Infrastructure Analysis

Kubernetes-Level Diagnostics

Root Cause Analysis

Solution Implementation

Immediate Emergency Resolution

Post-Resolution Status

Long-term Prevention Solution

Automated Monitoring System

Prevention Strategy

Key Learnings

Conclusion

How etcd Bloat Took Down Our Kubernetes Control Plane

How to Upgrade OpenShift Operators in a Disconnected Environment

Headlamp Web Dashboard with Helm

Longhorn Persistent Storage for Kubernetes

Categories

Help & Info