Last9

Kubernetes Monitoring Metrics That Improve Cluster Reliability

Understand Kubernetes monitoring metrics that help detect issues early, improve reliability, and keep your cluster performing at its best.

Sep 5th, ‘25
Kubernetes Monitoring Metrics That Improve Cluster Reliability
See How Last9 Works

Unified observability for all your telemetry. Open standards. Simple pricing.

Talk to an Expert

A Kubernetes cluster can generate more than 1,400 metrics out of the box. That’s a lot of numbers to sift through, especially when you’re troubleshooting a production slowdown in the middle of the night.

The key is knowing which metrics tell you the most, with the least noise. These are the signals worth paying attention to when you need answers fast.

Key Metrics for Identifying Problems Early

Container CPU Throttling

Standard CPU utilization graphs can look fine, even when your application is struggling. That’s because Kubernetes uses the Linux Completely Fair Scheduler (CFS), which can throttle containers when they hit their CPU limits, even if the node still has idle capacity.

How CFS Periods Work

CFS operates in 100-ms scheduling periods. If a container reaches its CPU quota during a period, it gets paused until the next one starts. This pause can introduce noticeable latency spikes that average CPU metrics completely smooth over.

A useful PromQL query to see the percentage of time a container is throttled is:

rate(container_cpu_cfs_throttled_seconds_total[5m]) /
rate(container_cpu_cfs_periods_total[5m]) * 100

For example, Java applications make this behavior easy to spot. CPU throttling can cause stop-the-world garbage collection pauses. A container throttled for just 15% of the time might look fine in average CPU usage, but still produce 200 ms+ latency spikes.

💡
Understanding CPU throttling is key to interpreting Kubernetes monitoring metrics—here’s how it works and what to track.

CPU Demand vs. Supply

To understand when an application wants more CPU than it’s allowed, track throttled periods directly:

rate(container_cpu_cfs_throttled_periods_total[5m]) /
rate(container_cpu_cfs_periods_total[5m]) * 100

This ratio shows the gap between CPU demand and allocation. Applications that frequently burst over their CPU limits tend to have unpredictable response times, even if overall usage appears normal.

💡
When checking Kubernetes monitoring metrics, it’s useful to know the basics—here’s how to monitor resource usage with kubectl top.

Memory Pressure

container_memory_working_set_bytes is a useful metric, but you don’t have to wait until it spikes to spot trouble. There are earlier indicators that can reveal memory pressure before it becomes a production incident.

Kernel Memory Management Layers

The Linux kernel tracks memory usage in multiple layers, each influencing container behavior differently:

  • container_memory_rss – Physical memory actively used by processes in the container.
  • container_memory_cache – Filesystem cache attributed to the container.
  • container_memory_swap – Memory swapped to disk (if swap is enabled).
  • container_memory_mapped_file – Memory-mapped files, such as shared libraries.

The working set calculation follows:

working_set = rss + cache - inactive_cache

When a container approaches its memory limit, the kernel reclaims inactive cache first. Applications with little inactive cache (e.g., high active memory workloads) can hit OOM conditions sooner.

To measure non-reclaimable memory as a percentage of the limit:

container_memory_rss / container_spec_memory_limit_bytes * 100

Page Cache Behavior Under Pressure

Aggressive page cache reclaim can impact I/O performance before an OOM kill occurs:

rate(container_memory_cache[5m])

A sustained negative rate under a steady workload means the cache is shrinking. Even if the container avoids OOM, reduced caching leads to more disk reads, which can increase latency.

Early Memory Allocation Signals

container_memory_failcnt increments whenever the kernel denies a memory allocation request—this can happen well before an OOM event:

rate(container_memory_failcnt[1m])

Non-zero values here mean the application tried to allocate memory beyond its limit. How this impacts behavior depends on the application: some handle it gracefully, others may fail on the next request.

Node Memory

The kubelet uses more than just raw memory utilization when deciding to evict pods. Its eviction logic factors in multiple signals to determine when a node is under pressure.

Eviction Threshold Calculation

The kubelet defines “available” memory differently from common system utilities. It uses this formula:

memory.available = memory.capacity - memory.workingSet - memory.kernel - memory.buffer

This is not the same as node_memory_MemAvailable_bytes, which includes reclaimable cache. The kubelet excludes kernel memory that cannot be reclaimed.

PromQL example for memory available for scheduling new pods:

node_memory_MemTotal_bytes -
(node_memory_Active_bytes + node_memory_Inactive_bytes - node_memory_Cached_bytes)

Pressure Stall Information (PSI) Patterns

Modern Linux kernels expose PSI metrics that track how much time processes spend waiting for memory allocations:

rate(node_pressure_memory_waiting_seconds_total[1m]) * 1000000

Values above 100,000 (10% wait time) indicate heavy memory pressure. This signal often appears before eviction starts, but after application performance has already degraded.

💡
If you’re comparing memory pressure signals in this guide, you’ll find this helpful - check out our post on how to track pod memory usage in real environments.

Control Plane Metrics That Affect Cluster Performance

API Server

API server latency metrics combine several internal processing phases. Each phase has its own performance characteristics and potential failure modes.

API Request Processing Pipeline

Every API request passes through a series of stages:

  1. Authentication – Verifies the client identity.
  2. Authorization – Confirms the client has the required permissions.
  3. Admission control – Validates or mutates the request.
  4. etcd write – Persists changes to the data store.
  5. Response serialization – Formats the response for the client.

While total request duration covers all phases, tracking them separately helps pinpoint where latency is introduced. For example, to measure 95th percentile authentication latency:

histogram_quantile(0.95,
  rate(apiserver_authentication_duration_seconds_bucket[5m])
)

Spikes here often indicate issues with external identity providers or certificate verification.

Check Stream Behavior

Controllers use watch streams to receive real-time updates from the API server. Watch performance directly affects controller responsiveness.

apiserver_registered_watchers

A large number of active watchers (1,000+ per resource type) can increase CPU and memory load on the API server due to filtering and serialization work.

Admission Controller Effect on Latency

Admission controllers can slow down specific resource operations, especially when validating webhooks are involved.

histogram_quantile(0.99,
  rate(apiserver_admission_controller_admission_duration_seconds_bucket[5m])
)

Slow webhook responses or network timeouts to webhook services can cause delays of up to 30 seconds, stalling API requests.

etcd Performance

etcd sits at the core of Kubernetes control plane operations, and its performance directly influences cluster responsiveness. Multiple metrics help track different aspects of its health.

Raft Consensus Overhead

etcd uses the Raft consensus algorithm, which requires a majority of nodes to agree on every write. In a 3-node cluster, losing one node increases write latency because every commit now depends on the remaining two nodes.

histogram_quantile(0.99, 
  rate(etcd_network_peer_sent_duration_seconds_bucket{type="send_snapshot"}[5m])
)

Values above 1 second indicate possible network issues between etcd peers. These delays can cascade into API server request timeouts.

Backend Database Growth

etcd stores all data in BoltDB files. Without regular compaction, database growth can degrade performance.

rate(etcd_mvcc_db_total_size_in_bytes[1h])

A sustained growth rate above 100 MB/hour suggests compaction is not keeping up. Large database files (8 GB or more) increase startup times and memory consumption.

Write-Ahead Log (WAL) Patterns

etcd writes changes to a write-ahead log before committing to the database. Monitoring WAL sync frequency and latency is key for spotting I/O bottlenecks:

rate(etcd_disk_wal_fsync_duration_seconds_count[1m])
rate(etcd_disk_wal_fsync_duration_seconds_sum[1m]) / 
rate(etcd_disk_wal_fsync_duration_seconds_count[1m])

Disk latency above 50 ms for WAL sync operations can introduce cluster-wide performance degradation.

Scheduler Performance

The Kubernetes scheduler matches pod requirements with available cluster resources while also optimizing for performance, locality, and policy rules. Each scheduling cycle goes through several phases, each with its own performance characteristics.

Scheduling Algorithm Phases

Every pod placement decision follows three main steps:

  1. Filtering – Removes nodes that cannot meet the pod’s requirements.
  2. Scoring – Ranks the remaining nodes based on configured priorities.
  3. Binding – Writes the decision back to the API server.

To measure latency in each phase:

histogram_quantile(0.95,
  rate(scheduler_framework_extension_point_duration_seconds_bucket[5m])
)

Filtering is typically sub-millisecond per node, while scoring can take longer—especially with custom scheduler plugins or in large clusters.

Queue Management Patterns

The scheduler maintains multiple queues, each with specific retry logic:

scheduler_pending_pods{queue="unschedulable"}
scheduler_pending_pods{queue="backoff"}
  • Unschedulable – Pods that failed placement and are waiting for cluster state changes.
  • Backoff – Pods rate-limited after multiple failed scheduling attempts.

Affinity Rule Impact on Performance

Pod affinity and anti-affinity rules influence scheduling latency. To track the ratio of failed to total scheduling attempts:

rate(scheduler_pod_scheduling_attempts_total{result="error"}[5m]) /
rate(scheduler_pod_scheduling_attempts_total[5m])

Inter-pod anti-affinity in large clusters is particularly costly since it requires checking every potential placement against all existing pods that match the rule.

💡
If you're exploring deeper Kubernetes monitoring metrics, it's helpful to understand the role of the metrics server—this post explains how it collects and exports cluster-level resource data.

How CNI Choice Affects Network Policy Performance

Different CNI implementations enforce network policies with varying efficiency and scaling behavior. Understanding these differences is key to diagnosing latency or packet drop issues in large clusters.

eBPF vs iptables Performance

Cilium uses eBPF, while Calico relies on iptables. Each approach has distinct scaling patterns:

# Cilium eBPF program execution time
histogram_quantile(0.95, 
  rate(cilium_bpf_map_ops_duration_seconds_bucket[5m])
)

# Calico iptables rule processing time
rate(felix_int_dataplane_apply_time_seconds_sum[5m]) /
rate(felix_int_dataplane_apply_time_seconds_count[5m])
  • eBPF – Offers O(1) lookup times regardless of the number of policies.
  • iptables – Processes rules sequentially, so latency increases as the number of rules grows.

Connection Tracking Capacity

Both eBPF- and iptables-based CNIs rely on connection tracking to maintain state for active flows:

node_nf_conntrack_entries / node_nf_conntrack_entries_limit * 100

When utilization exceeds 80%, new connections risk being dropped, leading to intermittent connectivity problems that can be hard to trace.

Service Mesh and Storage Performance Metrics

After network policy enforcement, another part of the Kubernetes data plane worth tracking is service mesh sidecars and persistent volumes. Both can add measurable overhead to workloads and influence latency in ways that standard CPU and memory metrics won’t capture.

Envoy Proxy Resource Use

Istio sidecars running Envoy typically have predictable resource footprints:

# Envoy memory usage across all sidecars
sum(container_memory_working_set_bytes{container="istio-proxy"})

# Envoy CPU usage per pod
avg(rate(container_cpu_usage_seconds_total{container="istio-proxy"}[5m])) 
by (namespace, pod)

Memory usage grows with the number of configured routes and clusters. In large meshes with hundreds of services, sidecar memory can exceed 100 MB per pod.

Connection Pool and Circuit Breaker Metrics

Envoy tracks connection-level statistics that help identify capacity or reliability issues:

# Connection pool overflow (circuit breaker activation)
rate(envoy_cluster_upstream_cx_overflow_total[1m])

# Request timeout rate
rate(envoy_cluster_upstream_rq_timeout_total[1m]) /
rate(envoy_cluster_upstream_rq_total[1m]) * 100

Overflow events point to backend connection limits. Timeout rates above 1% often indicate network or backend performance problems.

Persistent Volume Performance Metrics

Storage constraints impact stateful workloads in different ways from CPU or memory limits.

CSI Driver Operation Latency

Volume attach and detach times depend on the CSI driver's responsiveness:

# Volume attach/detach latency (95th percentile)
histogram_quantile(0.95,
  rate(csi_operations_seconds_bucket[5m])
)

AWS EBS volumes usually attach in 10–30 seconds. Longer attach times can suggest EC2 API throttling or EBS service degradation.

Filesystem Performance Indicators

Disk-level metrics reveal early signs of I/O bottlenecks:

# Disk utilization percentage
rate(node_disk_io_time_seconds_total[1m]) * 100

Values close to 100% indicate disk saturation, which leads to I/O queuing and higher application latency.

Inode Utilization

Inode exhaustion can block writes even when disk space is available:

# Inode utilization percentage
(kubelet_volume_stats_inodes - kubelet_volume_stats_inodes_free) /
kubelet_volume_stats_inodes * 100

Workloads with many small files (e.g., node_modules, Python packages) can hit inode limits long before running out of storage capacity.

💡
For monitoring Kubernetes workloads with sidecars, understanding their CPU and memory footprint is essential—here’s a detailed look at sidecar containers in Kubernetes.

Metrics That Show Controller and Operator Health

Beyond core components like the API server, scheduler, and service mesh, Kubernetes operators and custom controllers also influence cluster responsiveness. Well-instrumented controllers surface metrics that help track reconciliation speed, error patterns, and the ability to keep up with workload changes.

Work Queue Depth Patterns

Controllers process events through internal work queues:

# Items waiting for processing per controller
workqueue_depth{name=~".*controller.*"}

A consistently high queue depth means the controller is processing events slower than they arrive. This delay can push back reconciliation for custom resources, potentially affecting application availability.

Reconciliation Result Metrics

Breaking down reconciliation outcomes by result type helps identify root causes:

# Reconciliation outcomes by controller and result type
sum(rate(controller_runtime_reconcile_total{result!="success"}[5m])) 
by (controller, result)
  • Requeue – Often triggered when a dependency isn’t ready, requiring the controller to retry later.
  • Error – Suggests a configuration problem, missing dependency, or a conflict with another resource.

How to Collect, Store, and Correlate High-Resolution Metrics

Once you know which Kubernetes components, controllers, and workloads to monitor, the next step is configuring how you collect, store, and connect those signals. The right collection strategy ensures you capture enough detail to diagnose problems quickly—without overwhelming your storage or query performance.

Scrape Interval Selection by Metric Type

Not all metrics change at the same rate. Matching scrape intervals to metric behavior improves visibility where it matters most:

scrape_configs:
- job_name: 'kubernetes-nodes'
  scrape_interval: 30s  # Node metrics change gradually
- job_name: 'kubernetes-pods'  
  scrape_interval: 15s  # Pod metrics need higher resolution
- job_name: 'kubernetes-apiserver'
  scrape_interval: 10s  # API server latency changes rapidly

Control Cardinality with Relabeling

High-cardinality labels can flood your storage backend with low-value series. Relabeling helps normalize noisy identifiers:

metric_relabel_configs:
- source_labels: [__name__]
  regex: 'container_(cpu|memory).*'
  target_label: __keep
  replacement: 'true'
- source_labels: [pod_name]
  regex: '.*-[0-9a-f]{8,10}-.*'
  target_label: pod_name
  replacement: '${1}'  # Normalize replica set suffixes

Pre-Compute with Recording Rules

Complex queries are expensive at runtime. Recording rules let you compute them ahead of time for faster dashboards:

groups:
- name: kubernetes-resources
  interval: 30s
  rules:
  - record: cluster:cpu_throttling:rate5m
    expr: |
      sum(rate(container_cpu_cfs_throttled_seconds_total[5m])) by (namespace, pod, container) /
      sum(rate(container_cpu_cfs_periods_total[5m])) by (namespace, pod, container)

Apply Storage and Retention Strategy

Large clusters generate huge amounts of data. A retention plan balances cost, resolution, and query performance:

  • Infrastructure metrics – 30-day full resolution retention
  • Application metrics – 7-day full resolution, 90-day downsampled
  • Debug metrics – 24-hour retention

Last9 handles high-cardinality telemetry efficiently, integrating with OpenTelemetry and Prometheus to connect metrics, logs, and traces while keeping storage costs predictable. Teams at Probo, CleverTap, and Replit rely on this model for complex observability scenarios.

Correlate Metrics Across System Layers

Correlation transforms raw metrics into actionable insights. Linking signals from different layers reveals root causes that isolated graphs can miss.

Resource Pressure Correlation

Correlating pod restarts with node memory pressure helps separate node-level saturation from application-level faults:

(
  increase(kube_pod_container_status_restarts_total[10m]) > 0
  and on(node) 
  (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
)

Control Plane and Controller Correlation

Slow API server responses can cascade into controller backlogs:

(
  histogram_quantile(0.95, 
    rate(apiserver_request_duration_seconds_bucket[5m])) > 0.5
  and
  increase(workqueue_depth[5m]) > 0
)

Final Thoughts

With tuned scrape intervals, controlled cardinality, targeted retention, and multi-layer correlation, you can spot early warning signs before they affect users. CPU throttling rates, memory pressure signals, and API server latency percentiles often show up first, especially when tied directly to workload performance.

At Last9, we’ve built the platform to make this practical at scale. For the Kubernetes metrics we’ve covered here, we help you:

  • Track high-cardinality metrics without compromise – Pod-level CPU throttling, per-container memory pressure, and controller queue depth without dropping labels.
  • Run real-time streaming aggregations – Pre-compute metrics like throttling ratios or backlog growth so dashboards and alerts stay fast.
  • Set retention that matches the value of your data – 30 days for infra metrics, 90 days downsampled for application metrics, 24 hours for debug metrics, without fighting Prometheus configs.
  • Correlate metrics, logs, and traces in one place – See how infrastructure issues impact application performance without switching tools.

We’re OpenTelemetry- and Prometheus-native, so you can plug in kube-state-metrics, node-exporter, and OTel collectors with zero re-instrumentation.

Get started for free today and see how it works with your existing setup.

FAQs

How to see Kubernetes metrics?
Use kubectl top nodes and kubectl top pods for basic resource usage. For detailed metrics, deploy Prometheus with kube-state-metrics and node-exporter, or use managed solutions like Last9. Access metrics through the kubelet’s /metrics endpoint on port 10250 or the metrics-server API.

What is the best tool to monitor a Kubernetes cluster?
Prometheus remains the standard for Kubernetes monitoring due to its native integration and extensive ecosystem. Managed solutions like Last9 provide similar capabilities without operational overhead. Grafana pairs well with Prometheus for visualization and alerting.

What are the security metrics of Kubernetes?
Key security metrics include RBAC policy violations (apiserver_audit_total), failed authentication attempts (apiserver_authentication_total{result="failure"}), admission controller rejections, and privileged container usage. Monitor certificate expiration times and API server access patterns for anomalies.

What are the 4 C’s of Kubernetes security?
The 4 C’s are Cloud (infrastructure security), Cluster (Kubernetes configuration), Container (image and runtime security), and Code (application security). Each layer has specific metrics—cloud provider APIs, cluster RBAC events, container vulnerability scans, and application authentication logs.

What Are the Advantages of Kubernetes Monitoring?
Kubernetes monitoring provides visibility into resource utilization, application performance, and cluster health. It enables proactive scaling decisions, faster incident response, and capacity planning. Multi-layer observability helps correlate infrastructure issues with application behavior.

Why use Prometheus for Kubernetes monitoring?
Prometheus integrates natively with Kubernetes service discovery, uses pull-based collection that scales well, and provides powerful query capabilities with PromQL. The ecosystem includes purpose-built exporters for Kubernetes components and extensive community support.

Which cloud provider has the best Kubernetes experience?
Each provider offers different strengths: GKE provides the most mature managed experience, EKS integrates deeply with AWS services, and AKS offers good hybrid cloud capabilities. The “best” depends on your existing infrastructure, compliance requirements, and team expertise.

What Kubernetes Metrics Should You Measure?
Focus on CPU throttling rates, memory pressure indicators, API server latency, pod restart patterns, and node resource availability. Application-specific metrics like request rates and error percentages provide business context. Network and storage performance metrics complete the observability picture.

What is kube-prometheus-stack?
kube-prometheus-stack is a Helm chart that deploys a complete monitoring stack including Prometheus, Grafana, Alertmanager, and various exporters. It provides pre-configured dashboards, alerting rules, and service discovery for Kubernetes components out of the box.

What are some common dashboards?
Popular Kubernetes dashboards include cluster resource overview, node-level performance, pod resource usage by namespace, and API server performance. Grafana’s official Kubernetes dashboards (IDs 8588, 6417, 7249) provide good starting points. Application-specific dashboards complement infrastructure monitoring.

How can I set up alerts for Kubernetes monitoring metrics?
Configure Prometheus alerting rules in YAML format, then route notifications through Alertmanager. Alert on resource exhaustion (CPU throttling >25%, memory usage >80%), control plane issues (API server latency >500ms), and pod health (restart rates, scheduling failures). Use label-based routing for different notification channels.

Authors
Anjali Udasi

Anjali Udasi

Helping to make the tech a little less intimidating. I

Contents

Do More with Less

Unlock unified observability and faster triaging for your team.