Vibe monitoring with Last9 MCP: Ask your agent to fix production issues! Setup →
Last9 Last9

Apr 4th, ‘25 / 14 min read

Pod Memory Usage: Tracking, Commands & Troubleshooting

Learn how to track pod memory usage, run key kubectl commands, and troubleshoot spikes before they crash your Kubernetes apps.

Pod Memory Usage: Tracking, Commands & Troubleshooting

Your containers are running, and your clusters seem fine, but then you get that dreaded alert – memory pressure.

Whether you're scaling up your infrastructure or just trying to keep things running smoothly, understanding pod memory usage isn't just nice to have – it's essential knowledge for any DevOps engineer worth their salt.

Let's cut through the noise and get straight to what matters: practical ways to track, analyze, and fix memory issues in your Kubernetes pods.

TL;DR

  • Track memory usage with kubectl top pods, metrics-server, and Prometheus
  • Key metrics to monitor: Working Set Memory, RSS, Cache Memory, and Page Faults
  • Common issues: OOMKilled pods, memory leaks, and resource contention
  • Quick fixes: Increase limits (short-term), optimize application code (long-term), implement caching strategies (smart solution)
  • Best practices: Set appropriate requests/limits, implement memory-aware autoscaling, and establish a continuous memory monitoring workflow

Essential commands:

kubectl top pods -n namespace
kubectl describe pod pod-name
kubectl get events --field-selector involvedObject.name=pod-name

Pod Memory Fundamentals

Memory in Kubernetes isn't just about RAM allocation – it's about resource efficiency and application stability. Pods consume memory in various ways, and knowing the difference between requested memory, limits, and actual usage is your first step toward mastery.

Memory Resource Types in Kubernetes

A pod's memory footprint includes:

  • Application memory: What your code actually needs to run, including heap allocations, stack memory, and any other data structures
  • Runtime overhead: The memory tax paid by your container runtime (Docker, containerd, CRI-O) – typically 10-20MB per container
  • Kernel memory: System resources your container borrows from the host, including page tables, socket buffers, and kernel modules
  • Shared memory: Memory segments shared between processes within the container
  • Container image: Memory used to store the container's layers and filesystem
💡
For a closer look at how Kubernetes handles networking behind the scenes, check out this piece on what really happens at a ContainerPort.

Key Memory Metrics Explained

Before diving into commands, it's crucial to understand that memory metrics in Kubernetes come in different flavors:

  • Working Set Memory: The subset of memory that can't be reclaimed without application impact – the most important metric for pod health
  • RSS (Resident Set Size): The portion of memory occupied in RAM (not swapped out)
  • Cache memory: File-backed pages that can be reclaimed under memory pressure
  • Anonymous memory: Memory that isn't file-backed and must be written to swap if reclaimed
  • Page faults: Minor (reclaiming from disk cache) vs. Major (reading from disk) – major faults impact performance

Memory Requests vs. Limits

Understanding the difference is crucial:

  • Memory requests: The guaranteed minimum amount of memory allocated to a pod (used for scheduling)
  • Memory limits: The maximum memory a pod can use before being terminated with OOMKilled

The ratio between these values creates different Quality of Service (QoS) classes:

  • Guaranteed: Requests equal limits (highest priority)
  • Burstable: Requests less than limits (medium priority)
  • BestEffort: No requests or limits specified (lowest priority, first to be evicted)
💡
If you're frequently jumping into containers to debug memory issues, this guide on using kubectl exec might come in handy.

Essential Commands for Tracking Pod Memory Usage

When it comes to keeping tabs on memory, these commands are your best friends:

Using kubectl to check memory metrics

# Get memory usage for all pods in a namespace
kubectl top pods -n your-namespace

# Get detailed memory stats for a specific pod
kubectl describe pod your-pod-name -n your-namespace

# Get memory usage for containers within a pod
kubectl top pods your-pod-name --containers -n your-namespace

# Get resource usage across all namespaces
kubectl top pods --all-namespaces

# Watch memory changes in real-time (updates every 2 seconds)
kubectl top pod your-pod-name --watch -n your-namespace

The kubectl top command gives you a quick snapshot of current memory consumption, while describe shows you the memory requests and limits configured for your containers.

Accessing detailed container stats with crictl

For deeper insights at the container runtime level:

# Get container stats (requires SSH access to node)
crictl stats

# Get detailed stats for a specific container
crictl stats --id <container-id> --output json

Leveraging metrics-server for real-time data

If you want more granular data, metrics-server is your go-to:

# First, ensure metrics-server is installed
kubectl get deployment metrics-server -n kube-system

# Then you can get detailed metrics
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/your-namespace/pods/"

# Get node-level memory metrics
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes/"

# Filter for specific pods with jq (if installed)
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/your-namespace/pods/" | jq '.items[] | select(.metadata.name | startswith("your-prefix"))'

Using Prometheus for long-term memory analysis

For those who prefer a dashboard view, Prometheus and Grafana make a powerful combo:

# Sample PromQL queries for memory tracking

# Total working set memory by pod
sum(container_memory_working_set_bytes{namespace="your-namespace", pod=~"your-pod-name-.*"}) by (pod)

# Memory usage rate of change (useful for detecting leaks)
rate(container_memory_working_set_bytes{namespace="your-namespace", pod=~"your-pod-name-.*"}[5m])

# RSS memory by container
sum(container_memory_rss{namespace="your-namespace", pod=~"your-pod-name-.*"}) by (container)

# Memory usage vs request ratio (efficiency metric)
sum(container_memory_working_set_bytes{namespace="your-namespace"}) by (pod) / 
sum(kube_pod_container_resource_requests{namespace="your-namespace", resource="memory"}) by (pod)

Direct /proc examination for extreme cases

When you need to go deeper, SSH into the node and examine the process directly:

# Find container process IDs
ps aux | grep [your-container-process-name]

# Examine detailed memory maps
cat /proc/<pid>/smaps

# Check overall memory status
cat /proc/<pid>/status | grep -i mem
💡
Still unclear how pods fit into the bigger picture? This breakdown of Kubernetes Pods vs Nodes clears up the confusion.

How to Read Memory Usage Output

When you run these commands, you'll see numbers – but what do they mean? Let's break it down:

Metric Description Normal Range When to Worry Action Items
Working Set Memory currently in active use 60-80% of limit >90% of limit or increasing over time Increase limits or optimize code
RSS Actual RAM consumption Depends on app Consistently >80% of working set Check for memory-intensive processes
Cache Disk cache memory (reclaimable) 10-30% of total Not usually concerning Can be safely ignored in most cases
Page Faults Memory access errors <10/s minor, 0 major >100/s minor, any major faults Check disk I/O, optimize memory access patterns
Memory Request Utilization Usage/Request ratio 70-90% <50% (waste) or >100% (risk) Right-size your memory requests
OOM Score Likelihood of termination <500 >900 (at risk of termination) Increase limits or reduce memory usage

Interpreting kubectl top output

When you run kubectl top pods, you'll see output like:

NAME                       CPU(cores)   MEMORY(bytes)
nginx-6799fc88d8-bnrwl     1m           9Mi

Here's what the memory number really tells you:

  • It represents the working set memory
  • It's an instantaneous value that can fluctuate
  • It doesn't include all types of memory usage (like kernel memory)

The key isn't just collecting these metrics – it's understanding what they tell you about your application's behavior. A sudden spike in working set memory might indicate a memory leak, while high RSS with low working set could point to inefficient memory management.

💡
Fix pod memory issues in production—right from your IDE, with AI and Last9 MCP. Set up Last9 MCP → Watch demo

Decoding Memory Patterns

Different memory usage patterns indicate different issues:

Pattern Likely Cause Investigation Approach
Steady increase over time Memory leak Heap dumps, profiling tools
Cyclical peaks and valleys Normal garbage collection Adjust GC parameters if valleys don't return to baseline
Sudden spikes Batch processing or backpressure Check upstream services and incoming request volume
Plateaus at limit Constrained by limits Determine if application is being throttled
Saw-tooth pattern Inefficient memory reuse Look for object churn and allocation patterns

Container vs. Pod vs. Node Memory

Understanding the hierarchy helps with troubleshooting:

  • Container memory: Isolated to a single container process
  • Pod memory: Sum of all containers plus inter-process shared memory
  • Node memory: Physical host resource that pods compete for

When a node runs low on memory, the kubelet will start evicting pods based on QoS class and memory pressure thresholds.

Common Pod Memory Issues and How to Fix Them

Now for the part you've been waiting for – troubleshooting. Here are the memory issues you're likely to encounter and how to tackle them:

OOMKilled Pods: The Memory Assassin

When Kubernetes reports OOMKilled, it means your pod exceeded its memory limit and got terminated. The fix depends on the cause:

# Check for OOMKilled events
kubectl get events --field-selector involvedObject.name=your-pod-name -n your-namespace

# Look for specific OOM messages in logs
kubectl logs your-pod-name -n your-namespace | grep -i "out of memory"

# Check the last state of the container for OOM termination
kubectl describe pod your-pod-name -n your-namespace | grep -A 10 "Last State"

Diagnosing OOMKilled Events

Look for patterns in when OOMs occur:

  • During startup: Configuration issue or initialization memory spike
  • Under heavy load: Insufficient limits for peak traffic
  • After running for days: Likely memory leak
  • Random times: Possible memory fragmentation or noisy neighbors

OOMKilled Resolution Strategies

If you see OOMKilled events, your options include:

  • Short-term fixes:
    • Increase memory limits (the quick fix)
    • Add more nodes to your cluster to reduce resource competition
    • Restart affected pods on a schedule to mitigate leaks temporarily
  • Medium-term fixes:
    • Set appropriate init container resources (they often have different requirements)
    • Implement memory caching strategies with proper TTL settings
    • Switch to more memory-efficient libraries or data structures
  • Long-term fixes:
    • Optimize your application code (the right fix)
    • Implement circuit breakers to prevent resource exhaustion
    • Consider breaking monolithic apps into smaller microservices
💡
If you're chasing memory issues, knowing how to read logs helps—this kubectl logs guide shows you how.

Memory Leaks: The Silent Resource Drain

Memory leaks can be trickier to spot. Look for a pattern of gradually increasing memory usage that never decreases, even during low traffic periods:

# Monitor memory over time
kubectl top pod your-pod-name -n your-namespace --containers --watch

# Use Prometheus to track memory growth over longer periods
rate(container_memory_working_set_bytes{pod="your-pod-name"}[6h]) > 0

# For Java applications, trigger a heap dump
kubectl exec your-pod-name -n your-namespace -- jmap -dump:format=b,file=/tmp/heap.bin 1

Language-Specific Memory Profiling

For deeper investigation, consider language-specific profiling:

Python applications:

# Using memory_profiler
python -m memory_profiler your-script.py

Java applications:

# Using JMX to monitor memory
java -Dcom.sun.management.jmxremote -jar your-app.jar
# Then connect using tools like VisualVM

Node.js applications:

# Using Node.js built-in profiler
node --inspect your-app.js
# Then connect Chrome DevTools to analyze memory

Go applications:

# Enable pprof endpoint and capture memory profile
curl http://your-service:port/debug/pprof/heap > heap.pprof
go tool pprof -http=:8080 heap.pprof

Resource Contention: When Pods Compete

Sometimes the issue isn't with a single pod, but with resource allocation across your cluster:

# Check node resource usage
kubectl describe node your-node-name | grep -A 5 "Allocated resources"

# Get detailed node metrics
kubectl top nodes

# Check memory pressure conditions
kubectl describe node your-node-name | grep -A 5 "Conditions"

# Examine eviction thresholds
kubectl get cm -n kube-system kubelet-config -o yaml | grep eviction

Node-Level Memory Pressure Indicators

  • MemoryPressure condition: True indicates active memory pressure
  • Eviction events: Pods being terminated due to node memory constraints
  • System OOMs: Check node logs for kernel OOM killer activity with journalctl -k | grep -i "Out of memory"

If you're seeing high memory pressure across nodes, consider:

  • Adjusting QoS classes for critical pods (set identical requests and limits)
  • Implementing pod anti-affinity to spread memory-intensive workloads
  • Using vertical pod autoscaler to right-size your resource requests
  • Setting appropriate node taints and tolerations to isolate memory-hungry workloads
  • Configuring memory limits at the namespace level with ResourceQuotas
  • Implementing cluster autoscaling to automatically add nodes during pressure

Fragmentation Issues: The Hidden Memory Tax

Memory fragmentation occurs when free memory exists but isn't contiguous enough to satisfy allocation requests:

# Check memory fragmentation on the node
cat /proc/buddyinfo  # Shows free memory blocks by size

# Check for large page support
grep Huge /proc/meminfo

If fragmentation is an issue, consider:

  • Using huge pages for large memory allocations
  • Setting appropriate ulimits for your containers
  • Restarting nodes periodically during maintenance windows

Advanced Memory Tracking Techniques

Ready to level up your memory management game? These techniques separate the pros from the rookies:

Using cAdvisor for Container-Level Insights

cAdvisor runs as part of Kubelet and provides detailed container stats:

# Access cAdvisor metrics directly (if kubelet secure port is enabled)
curl -k https://node-ip:10250/metrics/cadvisor

# Or on some clusters, via the read-only port
curl http://node-ip:10255/metrics/cadvisor

# Filter for specific memory metrics
curl -k https://node-ip:10250/metrics/cadvisor | grep container_memory

# For Docker Desktop or Minikube
curl http://localhost:4194/metrics

cAdvisor metrics provide more granular memory data than standard kubectl commands, including:

  • container_memory_cache: Page cache memory
  • container_memory_rss: Anonymous and swap cache memory
  • container_memory_swap: Swap usage
  • container_memory_mapped_file: Memory-mapped files
  • container_memory_usage_bytes: Total current memory usage
💡
If your pods keep getting killed mysteriously, this explainer on OOM (Out of Memory) errors connects the dots.

eBPF for Deep Memory Insights

For hardcore memory debugging, eBPF tools provide kernel-level insights:

# Using bpftrace to track memory allocations (requires node access)
bpftrace -e 'tracepoint:kmem:mm_page_alloc { @pages[args->order] = count(); }'

# Using BCC tools to track memory allocations
/usr/share/bcc/tools/memleak -p $(pidof your-process)

Custom Memory Dashboards with Prometheus and Grafana

Create custom dashboards that show exactly what matters to your workloads:

# Sample Grafana dashboard JSON for pod memory
{
  "title": "Pod Memory Dashboard",
  "panels": [
    {
      "title": "Working Set Memory by Pod",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(container_memory_working_set_bytes{namespace=\"$namespace\"}) by (pod)"
        }
      ]
    },
    {
      "title": "Memory Usage vs. Requests",
      "type": "gauge",
      "targets": [
        {
          "expr": "sum(container_memory_working_set_bytes{namespace=\"$namespace\"}) by (pod) / sum(kube_pod_container_resource_requests{namespace=\"$namespace\", resource=\"memory\"}) by (pod) * 100"
        }
      ],
      "thresholds": [
        {"value": 0, "color": "green"},
        {"value": 70, "color": "yellow"},
        {"value": 90, "color": "red"}
      ]
    },
    {
      "title": "Memory Change Rate (Possible Leaks)",
      "type": "heatmap",
      "targets": [
        {
          "expr": "rate(container_memory_working_set_bytes{namespace=\"$namespace\"}[30m])"
        }
      ]
    },
    {
      "title": "OOMKilled Events",
      "type": "table",
      "targets": [
        {
          "expr": "kube_pod_container_status_last_terminated_reason{reason=\"OOMKilled\", namespace=\"$namespace\"}"
        }
      ]
    }
  ],
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "query": "label_values(kube_pod_info, namespace)"
      },
      {
        "name": "pod",
        "type": "query",
        "query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)"
      }
    ]
  }
}

Memory Anomaly Detection

Set up automated anomaly detection with Prometheus Alertmanager:

groups:
- name: memory-alerts
  rules:
  - alert: PodMemoryLeakSuspected
    expr: deriv(container_memory_working_set_bytes{namespace="production"}[1h]) > 1024 * 1024
    for: 2h
    annotations:
      summary: "Possible memory leak in {{ $labels.pod }}"
      description: "Pod {{ $labels.pod }} shows consistently increasing memory over 2 hours"
  
  - alert: HighMemoryUtilization
    expr: sum(container_memory_working_set_bytes) by (pod) / sum(kube_pod_container_resource_requests{resource="memory"}) by (pod) > 0.9
    for: 15m
    annotations:
      summary: "High memory utilization in {{ $labels.pod }}"
      description: "Pod {{ $labels.pod }} is using >90% of its requested memory for over 15 minutes"

Memory Efficiency Scoring

Not all memory usage is equal. Create a scoring system based on:

Metric Weight Calculation Rationale
Memory efficiency 40% (memory_requests - memory_usage) / memory_requests * 100 Shows resource efficiency
Memory stability 25% 1 - stddev(memory_usage[24h]) / avg(memory_usage[24h]) Indicates predictable behavior
OOMKilled frequency 20% 1 - (oom_events[30d] / 30) Reflects stability
Memory fragmentation 15% 1 - (memory_working_set / total_allocated) Measures allocation efficiency

Implementation with Prometheus:

# Memory Efficiency Score
(
  (0.4 * (1 - abs(
    sum(container_memory_working_set_bytes{namespace="production"}) by (pod) / 
    sum(kube_pod_container_resource_requests{namespace="production", resource="memory"}) by (pod) - 0.7
  ) / 0.7)) +
  (0.25 * (1 - stddev_over_time(container_memory_working_set_bytes{namespace="production"}[24h]) / 
   avg_over_time(container_memory_working_set_bytes{namespace="production"}[24h]))) +
  (0.2 * (1 - count_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled", namespace="production"}[30d]) / 30)) +
  (0.15 * (1 - (
    sum(container_memory_working_set_bytes{namespace="production"}) by (pod) / 
    sum(container_memory_usage_bytes{namespace="production"}) by (pod)
  )))
) * 100

This approach helps prioritize which pods need memory optimization first, and can be added to your cluster dashboards to provide at-a-glance health metrics for your applications.

💡
If memory isn’t the only thing spiking, this guide on monitoring container CPU usage is worth a read.

Memory Optimization Strategies That Work

Once you've identified memory issues, here's how to fix them for good:

Right-sizing Pod Memory Requests and Limits

The art of setting memory constraints is finding the sweet spot:

resources:
  requests:
    memory: "256Mi"  # Guaranteed minimum
  limits:
    memory: "512Mi"  # Maximum before OOMKilled

Too tight, and your pods get killed; too loose, and you waste resources. Here's a systematic approach:

  1. Measure baseline usage: Monitor memory for at least 1 week capturing various traffic patterns
  2. Calculate appropriate values:
    • Set requests at P50 (median) + 10-15% buffer
    • Set limits at P99 (99th percentile) + 20% buffer
  3. Consider QoS requirements:
    • Critical services: Set equal requests and limits for Guaranteed QoS
    • Background services: Allow larger gaps between requests and limits for Burstable QoS
  4. Account for JVM-based applications: Add headroom for garbage collection spikes
  5. Test under load: Verify settings handle peak traffic without OOMKilled events

Advanced Request/Limit Strategies

For multi-container pods, consider these strategies:

# Memory-optimized sidecar configuration
apiVersion: v1
kind: Pod
metadata:
  name: multi-container-pod
spec:
  containers:
  - name: app
    image: main-application:v1
    resources:
      requests:
        memory: "512Mi"
      limits:
        memory: "768Mi"
  - name: sidecar
    image: sidecar:v1
    resources:
      requests:
        memory: "64Mi"
      limits:
        memory: "128Mi"
  # Memory-sensitive init container
  initContainers:
  - name: init-db
    image: db-setup:v1
    resources:
      requests:
        memory: "256Mi"
      limits:
        memory: "256Mi"  # Equal for Guaranteed QoS during initialization

Implementing Memory-Aware Autoscaling

Horizontal Pod Autoscaler (HPA) can scale based on memory usage:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: memory-based-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: your-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 20
        periodSeconds: 120

This approach makes your system resilient to memory pressure without manual intervention. Key considerations for memory-based autoscaling:

  1. Set appropriate thresholds: 80% is typically a good target for memory utilization
  2. Configure sensible scaling behavior:
    • Scale up quickly (short stabilization window)
    • Scale down slowly (longer stabilization window)
  3. Use multiple metrics: Combine memory and CPU to avoid scaling ping-pong
  4. Consider custom metrics: For memory-intensive apps, add application-specific metrics like queue length

For even more precise control, combine HPA with Vertical Pod Autoscaler (VPA) in recommendation mode:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: memory-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: your-deployment
  updatePolicy:
    updateMode: "Off"  # Recommendation mode only
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      minAllowed:
        memory: "128Mi"
      maxAllowed:
        memory: "1Gi"
💡
If you're wondering why some pods get more resources than others, this breakdown of Kubernetes QoS explains the logic.

Putting It All Together: A Memory Monitoring Workflow

Here's a practical workflow you can implement today:

1. Establish Baselines

Begin by collecting baseline memory metrics for at least 1-2 weeks:

# Create a baseline script (memory-baseline.sh)
#!/bin/bash
NAMESPACE="your-namespace"
OUTPUT_DIR="memory-baselines"
mkdir -p $OUTPUT_DIR

# Collect hourly snapshots for a week
for i in {1..168}; do
  TIMESTAMP=$(date +%Y%m%d%H%M)
  kubectl top pods -n $NAMESPACE > "$OUTPUT_DIR/memory-$TIMESTAMP.txt"
  sleep 3600
done

Analyze this data to understand normal patterns:

  • Daily/weekly usage cycles
  • Traffic-correlated spikes
  • Baseline memory after garbage collection
  • Variance between pods of the same workload

2. Implement Multi-Layer Monitoring

Set up a comprehensive monitoring stack:

# Example Prometheus memory recording rules
groups:
- name: memory-metrics
  interval: 1m
  rules:
  - record: memory:usage:ratio
    expr: sum(container_memory_working_set_bytes{namespace="production"}) by (pod) / sum(kube_pod_container_resource_requests{namespace="production", resource="memory"}) by (pod)
  
  - record: memory:usage:rate1h
    expr: rate(container_memory_working_set_bytes{namespace="production"}[1h])
  
  - record: memory:oom:count
    expr: sum(increase(kube_pod_container_status_last_terminated_reason{reason="OOMKilled", namespace="production"}[24h])) by (pod)

Set up dashboards and alerting with multiple thresholds:

  • Warning alerts at 80% memory utilization
  • Critical alerts at 90% memory utilization
  • Trend-based alerts for steady increases
  • OOMKilled event alerts

3. Implement a Diagnostic Runbook

When a memory issue occurs, follow a systematic approach:

Memory Issue Diagnostic Checklist

  1. Advanced Diagnostics
    • Use language-specific profiling tools
    • Examine heap dumps or memory profiles
    • Trigger garbage collection and observe recovery
    • Test with controlled traffic increase

Root Cause Analysis

# Check recent traffic patterns (if you have Prometheus)
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{pod=~\"$POD_NAME.*\"}[5m]))"

# Check memory growth rate
curl -s "http://prometheus:9090/api/v1/query?query=deriv(container_memory_working_set_bytes{pod=\"$POD_NAME\"}[30m])"

# Check logs for clues
kubectl logs $POD_NAME -n $NAMESPACE --tail=200 | grep -i "memory|heap|garbage|oom"

Initial Assessment

# What's the current memory usage?
kubectl top pod $POD_NAME -n $NAMESPACE

# Are there any recent OOM events?
kubectl get events --field-selector involvedObject.name=$POD_NAME -n $NAMESPACE | grep -i "kill|memory|oom"

# What are the configured requests/limits?
kubectl describe pod $POD_NAME -n $NAMESPACE | grep -A 3 "Limits:"

4. Optimize Based on Root Cause

Implement the appropriate fix based on the findings:

Root Cause Short-term Fix Long-term Fix
Insufficient limits Increase limits by 20-30% Right-size based on actual usage patterns
Traffic spikes Implement circuit breakers Add HPA based on memory utilization
Memory leaks Restart pods on schedule Fix application code, add leak detection
Inefficient algorithms Tune GC and buffers Redesign data processing approach
Resource contention Anti-affinity rules Implement dedicated node pools

5. Verify and Iterate

After implementing fixes:

  1. Update baselines and documentation
    • Record new expected memory patterns
    • Document the issue and resolution
    • Update runbooks with new findings
  2. Implement regression testing
    • Create load tests that verify memory usage
    • Add memory utilization to your CI/CD pipelines
    • Set up canary deployments to catch memory issues early

Compare metrics pre and post-fix

# Using Prometheus for before/after comparison
curl -s "http://prometheus:9090/api/v1/query_range?query=container_memory_working_set_bytes{pod=\"$POD_NAME\"}&start=$START_TIME&end=$END_TIME&step=5m"

Monitor closely for 24-48 hours

# Watch memory usage in real-time
kubectl top pod $POD_NAME -n $NAMESPACE --watch

This systematic approach creates a continuous improvement loop for memory management. Over time, your detection and resolution process becomes faster and more efficient, resulting in more stable and cost-effective Kubernetes workloads.

💡
If your app feels slower under load, this piece on Kubernetes CPU throttling might explain why.

Conclusion

The approach to managing pod memory should be both proactive and reactive:

Key Takeaways

  1. Memory metrics matter: Understand the difference between working set, RSS, and cache memory
  2. Commands are your tools: Master the kubectl, Prometheus, and cAdvisor commands for memory analysis
  3. Context is crucial: Memory issues often manifest differently under various conditions
  4. Layered approach works best: Implement fixes at multiple levels - infrastructure, Kubernetes configuration, and application code
  5. Continuous improvement: Treat memory management as an ongoing cycle of measurement, analysis, and optimization

The tools and techniques we've covered give you a solid foundation for keeping your Kubernetes environments healthy and cost-effective.

💡
What memory issues are you tackling in your Kubernetes environment? Join our Discord community where we share we talk about everything with other DevOps folks.

Contents


Newsletter

Stay updated on the latest from Last9.

Authors
Anjali Udasi

Anjali Udasi

Helping to make the tech a little less intimidating. I love breaking down complex concepts into easy-to-understand terms.