Last9

A Detailed Guide to Azure Kubernetes Service Monitoring

Track the right AKS metrics, integrate with Azure Monitor, and optimize dashboards for reliable, cost-efficient Kubernetes operations.

Aug 20th, ‘25
A Detailed Guide to Azure Kubernetes Service Monitoring
See How Last9 Works

Unified observability for all your telemetry. Open standards. Simple pricing.

Talk to an Expert

Azure Kubernetes Service (AKS) continuously generates a high volume of telemetry, ranging from node-level CPU and memory usage to request latencies and error rates within individual pods and services. Without a structured monitoring strategy, this flood of metrics can easily become noise, leaving teams blind to early warning signs.

Effective monitoring in AKS is about identifying the right signals, correlating them across layers, and acting before they impact application performance or cluster stability.

In this blog, we’ll look at core metrics, practical strategies for monitoring at scale, and tooling options designed to manage the operational complexity of Kubernetes environments.

Where AKS Fits in Your Monitoring Setup

AKS takes care of running the Kubernetes control plane — etcd, the API server, and the scheduler — but monitoring workloads, pods, and node resources is still up to your team.

It connects directly with Azure’s monitoring tools, so you can send metrics to Azure Monitor or Log Analytics without extra networking setup. You can also stream data to third-party observability platforms. This makes it easy to see Kubernetes performance next to related Azure resources such as databases, storage accounts, or virtual networks.

Key integration features:

  • Automatically detects new pods and nodes.
  • Uses Azure AD for secure access to monitoring data.
  • Let's you control retention and collection policies with cost in mind.

You can use AKS’s built-in integrations to skip the overhead of running a separate monitoring stack while still getting a unified view of your cluster and the Azure infrastructure that supports it.

💡
To see how AKS metrics fit into the bigger picture of tracking and troubleshooting across Azure services, read how Azure observability works.

The Key AKS Metrics

In AKS, certain metrics give you an early warning that something in the cluster is heading toward trouble. Grouping them into three main areas — nodes, workloads, and the control plane — makes it easier to identify most issues before they escalate.

Node-Level Resource Metrics

Node metrics reflect the actual capacity of your cluster. If these values stay high, it’s only a matter of time before workloads start failing to schedule or degrade in performance.

  • CPU usage – Sustained usage above ~80% can cause the scheduler to reject new pods and limit CPU time for existing workloads. Track both average usage for capacity planning and short spikes that could signal bursty workloads.
  • Memory pressure – Memory exhaustion leads to immediate pod evictions by the kubelet. Monitor both percentage usage and available memory in bytes. Compare requests vs. limits to see how much usable headroom remains.
  • Disk space – If /var/lib/docker or /var/log fills up, containers may fail to start or log output. In AKS, check both the OS disk and temporary disk usage — each is tracked separately.
kubectl top nodes
kubectl describe node <node-name>

Pod and Container Metrics

Pod-level metrics link infrastructure health to application behavior, helping you identify whether issues are resource-related or code-related.

  • CPU throttling – When containers hit their CPU limit, throttling occurs. Rates above ~10% suggest limits are too low or the workload needs tuning. Metric: container_cpu_cfs_throttled_periods_total.
  • Memory working set – Tracks active memory use without cached data. This is a better indicator of real pressure than total memory. Compare to memory requests to find containers consistently going over the allocation.
  • Restart counts – High restart counts can point to memory leaks, failing health probes, or configuration errors. Look at both total restarts and restart frequency to gauge severity.

Control Plane Metrics

The control plane is the brain of the cluster. If it becomes slow or unresponsive, deployments, scaling, and even monitoring can be affected.

  • API server latency – All kubectl calls and controller loops go through the API server. Latency over ~1s for basic operations means it’s under load. Monitor both the 95th percentile and maximum latency.
  • etcd performance – etcd stores the cluster state. Sustained disk write latency above ~10ms or long commit durations can stall scheduling and updates.
  • Scheduler throughput – Measures how quickly pods are assigned to nodes. Low throughput alongside growing pending pod counts usually points to capacity shortages or scheduling constraints.
💡
If you want a proven framework for deciding which AKS metrics deserve a place on your dashboards, see our guide on Golden Signals for Monitoring.

Azure Monitor Integration for Kubernetes

Azure Monitor Container Insights offers built-in monitoring for AKS clusters, so you can start collecting metrics, logs, and performance data as soon as it’s enabled — no separate installations or sidecar setups required.

Azure Monitor Agent and Log Analytics

The Azure Monitor Agent runs as a DaemonSet on each node. It automatically discovers running containers and collects telemetry without manual configuration or noticeable performance impact.

All collected data flows into a Log Analytics workspace, which serves as the central store for metrics and logs. The workspace’s region matters — placing it in a different region than your AKS cluster can lead to data egress charges. It also controls retention policies and access permissions.

# Create Log Analytics workspace
az monitor log-analytics workspace create \
  --resource-group myResourceGroup \
  --workspace-name myAKSWorkspace \
  --location eastus

# Link workspace to an AKS cluster
az aks enable-addons \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --addons monitoring \
  --workspace-resource-id /subscriptions/{subscription-id}/resourceGroups/myResourceGroup/providers/Microsoft.OperationalInsights/workspaces/myAKSWorkspace

Real-Time Metric Collection in AKS

Azure Monitor collects most AKS metrics with 1-minute granularity, which works well for most operational checks. Higher frequency can surface issues faster, but it also means more data to store and query.

With Last9, you can keep that higher-resolution data without worrying about slow queries or ballooning storage costs — even when metrics come with high-cardinality labels.

The Live Metrics view in the Azure Portal shows the cluster’s current state without processing delays — useful during deployments, scaling events, or incident response when you need immediate feedback.

# Create an AKS cluster with monitoring enabled
az aks create \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --enable-addons monitoring

# Enable monitoring on an existing cluster
az aks enable-addons \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --addons monitoring

Container Insights for Deep Analysis

Container Insights extends beyond simple metric collection by tying container performance data to cluster events. This correlation helps you pinpoint the cause of issues faster — for example, matching a sudden CPU spike to a specific deployment rollout.

Container Log Collection and Analysis

Container Insights automatically gathers stdout and stderr logs from every container in the cluster, indexing them for search without manual forwarding rules in each application.

If your applications emit structured JSON logs, Container Insights automatically parses the fields, making them filterable in queries. For example:

{
  "timestamp": "2025-01-15T10:30:00Z",
  "level": "ERROR",
  "service": "payment-api",
  "request_id": "req-123456",
  "error": "connection timeout to database",
  "duration_ms": 5000
}

With structured logs, you can quickly filter by fields like service or level instead of searching raw text.

A KQL example to match error logs with pod inventory for faster root-cause tracing:

ContainerLog
| where TimeGenerated > ago(1h)
| where LogEntry contains "ERROR"
| join kind=inner (
    KubePodInventory
    | where TimeGenerated > ago(1h)
) on ContainerName
| project TimeGenerated, LogEntry, PodName, Namespace
| order by TimeGenerated desc

Diagnostic Settings and Workbooks

Diagnostic settings determine which logs and metrics leave your AKS cluster and where they go. For example, you can send API server audit logs and controller manager events directly to a Log Analytics workspace:

az monitor diagnostic-settings create \
  --name aks-diagnostics \
  --resource /subscriptions/{subscription-id}/resourceGroups/myResourceGroup/providers/Microsoft.ContainerService/managedClusters/myAKSCluster \
  --workspace /subscriptions/{subscription-id}/resourceGroups/myResourceGroup/providers/Microsoft.OperationalInsights/workspaces/myAKSWorkspace \
  --logs '[{"category":"kube-apiserver","enabled":true},{"category":"kube-controller-manager","enabled":true}]'

Azure Workbooks give you ready-to-use dashboards for cluster health, resource usage trends, and application performance. They can be duplicated and customized for team-specific views — for instance, tracking only production namespaces.

Grafana Integration with Azure Monitor

While Azure Portal covers most visualization needs, teams often prefer Grafana’s dashboarding flexibility. Azure Monitor works with Grafana through its Azure Monitor data source plugin.

# Install Grafana via Helm
helm repo add grafana https://grafana.github.io/helm-charts
helm install grafana grafana/grafana \
  --set adminPassword=admin123 \
  --set service.type=LoadBalancer

After installation, configure the Azure Monitor data source in Grafana using a service principal with Monitoring Reader permissions. This way, you can create custom dashboards in Grafana while keeping Azure Monitor as the single source of truth for data.

💡
For a look at how security data from Azure WAF can complement your AKS monitoring setup, check out our guide to Azure WAF.

Advanced KQL Queries for AKS Metrics

Kusto Query Language (KQL) gives you the ability to run targeted searches against your logs and metrics, providing deep insights into cluster behavior that dashboards often miss.

These queries are particularly useful during incident response when you need quick answers without combing through multiple dashboards.

Memory and Resource Analysis Queries

Find pods with high memory usage across the cluster:

// Find pods with high memory usage
KubePodInventory
| where TimeGenerated > ago(1h)
| join kind=inner (
    Perf
    | where ObjectName == "K8SContainer" 
    | where CounterName == "memoryWorkingSetBytes"
    | where TimeGenerated > ago(1h)
) on Computer
| summarize AvgMemory = avg(CounterValue) by Name, Computer, Namespace
| extend MemoryMB = AvgMemory / 1024 / 1024
| where MemoryMB > 500  // Filter for pods using more than 500MB
| order by MemoryMB desc
| take 20

This query joins pod inventory data with performance counters to identify memory-hungry containers. The extend operator converts bytes to megabytes for easier reading, and you can adjust the threshold based on your cluster's typical resource usage.

Track CPU throttling across namespaces:

Perf
| where TimeGenerated > ago(2h)
| where ObjectName == "K8SContainer"
| where CounterName == "cpuThrottledTime"
| summarize ThrottledTime = sum(CounterValue) by bin(TimeGenerated, 5m), Namespace = extract(@"namespace_name:([^,]+)", 1, InstanceName)
| where Namespace != ""
| order by TimeGenerated desc

CPU throttling indicates containers hitting their CPU limits. High throttling values suggest either undersized limits or workloads that need optimization.

Pod Health and Stability Monitoring

Identify frequently restarting pods with context:

// Identify frequently restarting pods  
KubePodInventory
| where TimeGenerated > ago(24h)
| where RestartCount > 5
| join kind=leftouter (
    KubeEvents
    | where TimeGenerated > ago(24h)
    | where Reason in ("Failed", "FailedScheduling", "Unhealthy")
    | summarize EventCount = count() by Name
) on Name
| project TimeGenerated, Name, Namespace, RestartCount, Computer, EventCount
| order by RestartCount desc

This enhanced version correlates restart counts with Kubernetes events to understand why pods are restarting. Look for patterns in the Reason field to identify common failure modes.

Find pods stuck in the does pending state:

KubePodInventory
| where TimeGenerated > ago(30m)
| where PodStatus == "Pending"
| join kind=inner (
    KubeEvents
    | where TimeGenerated > ago(30m)
    | where Reason in ("FailedScheduling", "InsufficientMemory", "InsufficientCPU")
) on Name
| project TimeGenerated, Name, Namespace, Reason, Message
| order by TimeGenerated desc

Pending pods often indicate resource constraints or scheduling issues. This query helps identify the root cause quickly.

Node and Cluster Health Analysis

Monitor node resource pressure trends:

Perf
| where TimeGenerated > ago(4h)
| where ObjectName == "K8SNode"
| where CounterName in ("memoryWorkingSetBytes", "cpuUsageNanoCores")
| extend MetricType = case(
    CounterName == "memoryWorkingSetBytes", "Memory",
    CounterName == "cpuUsageNanoCores", "CPU",
    "Unknown"
)
| summarize AvgValue = avg(CounterValue), MaxValue = max(CounterValue) by bin(TimeGenerated, 10m), Computer, MetricType
| extend UtilizationPercent = case(
    MetricType == "Memory", (AvgValue / 1024 / 1024 / 1024),  // Convert to GB
    MetricType == "CPU", (AvgValue / 1000000000),  // Convert to cores
    0
)
| order by TimeGenerated desc

Track resource trends across nodes to identify capacity planning needs and potential bottlenecks before they impact workloads.

Analyze API server performance:

AzureDiagnostics
| where Category == "kube-apiserver"
| where TimeGenerated > ago(1h)
| extend RequestDuration = extract(@"audit_request_duration:(\d+\.?\d*)", 1, log_s)
| extend Verb = extract(@"verb:(\w+)", 1, log_s)
| extend StatusCode = extract(@"response_status:(\d+)", 1, log_s)
| where RequestDuration != ""
| summarize 
    AvgDuration = avg(toreal(RequestDuration)),
    P95Duration = percentile(toreal(RequestDuration), 95),
    RequestCount = count()
    by bin(TimeGenerated, 5m), Verb
| order by TimeGenerated desc

API server latency directly impacts cluster responsiveness. This query tracks request duration by operation type to identify performance bottlenecks.

Application-Level Monitoring Queries

Correlate application errors with container restarts:

ContainerLog
| where TimeGenerated > ago(2h)
| where LogEntry contains "ERROR" or LogEntry contains "FATAL"
| join kind=inner (
    KubePodInventory
    | where TimeGenerated > ago(2h)
    | where RestartCount > 0
) on ContainerName
| extend ErrorType = case(
    LogEntry contains "OutOfMemory", "Memory",
    LogEntry contains "Connection", "Network",
    LogEntry contains "Timeout", "Timeout",
    "Other"
)
| summarize ErrorCount = count() by Name, Namespace, ErrorType, RestartCount
| order by ErrorCount desc

This query helps identify whether application errors are causing container restarts, providing insights into application stability issues.

Track service mesh performance (if using Istio/Linkerd):

ContainerLog
| where TimeGenerated > ago(1h)
| where Image contains "istio-proxy" or Image contains "linkerd-proxy"
| where LogEntry contains "response_code"
| extend ResponseCode = extract(@"response_code=(\d+)", 1, LogEntry)
| extend RequestDuration = extract(@"duration=(\d+)", 1, LogEntry)
| where ResponseCode != "" and RequestDuration != ""
| summarize 
    TotalRequests = count(),
    SuccessRate = (todouble(countif(ResponseCode startswith "2")) / count()) * 100,
    AvgDuration = avg(toreal(RequestDuration))
    by bin(TimeGenerated, 5m), PodName
| order by TimeGenerated desc

Service mesh metrics provide application-level insights into request success rates and latency patterns across your microservices.

Cost and Resource Optimization Queries

Identify over-provisioned pods:

let cpu_usage = Perf
| where TimeGenerated > ago(24h)
| where ObjectName == "K8SContainer"
| where CounterName == "cpuUsageNanoCores"
| summarize AvgCPU = avg(CounterValue) by PodName = extract(@"pod_name:([^,]+)", 1, InstanceName);
let memory_usage = Perf
| where TimeGenerated > ago(24h)
| where ObjectName == "K8SContainer"
| where CounterName == "memoryWorkingSetBytes"
| summarize AvgMemory = avg(CounterValue) by PodName = extract(@"pod_name:([^,]+)", 1, InstanceName);
KubePodInventory
| where TimeGenerated > ago(1h)
| join kind=inner cpu_usage on $left.Name == $right.PodName
| join kind=inner memory_usage on $left.Name == $right.PodName
| extend CPURequestMilli = toreal(extract(@"(\d+)m", 1, PodCpuRequest)) 
| extend MemoryRequestMB = toreal(extract(@"(\d+)Mi", 1, PodMemoryRequest))
| extend CPUUtilization = (AvgCPU / 1000000) / CPURequestMilli * 100
| extend MemoryUtilization = (AvgMemory / 1024 / 1024) / MemoryRequestMB * 100
| where CPUUtilization < 20 or MemoryUtilization < 30  // Low utilization thresholds
| project Name, Namespace, CPUUtilization, MemoryUtilization, CPURequestMilli, MemoryRequestMB
| order by CPUUtilization asc

This query identifies pods that consistently use much less than their requested resources, helping optimize cluster costs and resource allocation.

💡
If your AKS workloads also serve static assets or APIs to a global audience, you can see how Azure CDN fits into that setup in our post on Azure CDN for Static Assets, APIs, and Front Door.

Prometheus and Grafana on AKS

Prometheus is widely used for metrics collection, and Grafana remains the go-to tool for custom dashboards. In AKS, you can either use the Azure Monitor managed service for Prometheus or run your own Prometheus stack inside the cluster for full control over configuration and retention.

Prometheus Operator Deployment

The Prometheus Operator handles the lifecycle of Prometheus instances, Alertmanager, and related resources. It also makes it easier to define scraping rules using Kubernetes-native objects like ServiceMonitor.

Example deployment:

# prometheus-operator.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-operator
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-operator
  template:
    metadata:
      labels:
        app: prometheus-operator
    spec:
      containers:
      - name: prometheus-operator
        image: quay.io/prometheus-operator/prometheus-operator:latest
        ports:
        - containerPort: 8080

Apply the configuration:

kubectl apply -f prometheus-operator.yaml

ServiceMonitor Configuration

ServiceMonitor resources tell Prometheus which endpoints to scrape and how often.

Example configuration for scraping application metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: my-application
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

This setup collects metrics from services labeled app: my-application every 30 seconds.

💡
Last9’s alerting lets you create precise, noise-controlled rules on any metric and route them to the right team—get started here.

Alerting Strategies That Help You

Good alerts focus on symptoms rather than causes. Instead of alerting on every metric threshold, design alerts around user-facing issues and clear action items.

Memory and CPU Alerts

Set up alerts for resource pressure before it becomes critical:

# PrometheusRule for node memory alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: node-memory-alerts
spec:
  groups:
  - name: node.memory
    rules:
    - alert: NodeMemoryHigh
      expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Node memory usage is high"
        description: "Node {{ $labels.instance }} has less than 10% memory available"

This alert fires when available memory drops below 10% for more than 5 minutes, giving you time to investigate before pods start getting evicted.

Application-Level SLI Alerts

Monitor Service Level Indicators (SLIs) like request success rate and response time. These metrics directly correlate with user experience:

- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "High error rate detected"
    description: "Error rate is above 5% for more than 2 minutes"

Troubleshooting Common AKS Monitoring Issues

A Kubernetes monitoring setup isn’t static — AKS upgrades, workload churn, and telemetry growth can all cause gaps or slowdowns over time.
Below are common issues teams encounter in production, along with checks and remediation steps that keep your monitoring system healthy.

1. Missing Metrics After Cluster Updates

Why does it happen:
When you upgrade an AKS cluster, the process can reset certain add-on configurations or overwrite namespace resources. In some cases, monitoring agents are redeployed with defaults, and custom scraping configurations are lost.

What to check:

  • Ensure the Container Insights add-on is still enabled for the cluster.
  • Confirm your custom ServiceMonitors and PodMonitor resources still exist in the monitoring namespace.
  • Check that node-level monitoring pods (like omsagent) are running on every node.

How to check:

# Check Container Insights status in AKS
az aks show \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --query addonProfiles.omsagent

# Verify monitoring pods are running on each node
kubectl get pods -n kube-system | grep omsagent

Remediation tips:

  • Reapply missing ServiceMonitor manifests from version control.
  • For missing Container Insights, re-enable the add-on via az aks enable-addons.
  • If agents are stuck in CrashLoopBackOff, review their logs for networking or permissions errors.

2. High Cardinality Affecting Query or Storage Performance

Why does it happen:
Kubernetes metrics naturally include labels like pod_name, container_id, and namespace. Each unique label combination forms a separate time series. At scale, this leads to tens or hundreds of thousands of series, which increases both storage usage and query time.

What to watch for:

  • Dashboards timing out on high-label queries.
  • Prometheus or Azure Monitor showing elevated ingestion rates.
  • Growing memory consumption in Prometheus instances.

Mitigation strategies:

  • Reduce retention for high-cardinality datasets (e.g., detailed pod metrics) to days instead of weeks.
  • Aggregate early — use Prometheus recording rules to store sum or avg versions of metrics at coarser dimensions.
  • Sample selectively — for debug-level metrics, capture at a lower frequency during normal operation, and raise it only during investigations.

3. Storage and Cost Optimization

Why it matters:
Telemetry isn’t free. In Azure Monitor, cost grows with ingestion and retention. In Prometheus, a growing series count means more memory usage and longer compaction times. Monitoring these aspects helps you avoid surprises.

Metrics to track in Prometheus:

# Ingestion rate (bytes per second for symbol table growth)
rate(prometheus_tsdb_symbol_table_size_bytes[5m])

# Series creation rate
prometheus_tsdb_series_created_total

Azure-specific checks:

  • Use Azure Monitor’s Cost Analysis to view metric ingestion trends.
  • Review Log Analytics workspace retention settings to ensure long-term retention is applied only where needed.

Optimization examples:

  • Set lower scrape intervals for low-priority metrics.
  • Limit log collection from non-critical namespaces.
  • In Prometheus, shard workloads or use remote write to long-term storage for historical data.
💡
Monitor your AKS clusters with Last9 MCP and fix issues faster by bringing real-time context, metrics, logs, and traces, into your workflow for immediate action.

Design Dashboard for AKS

A dashboard is only useful if it answers the right questions quickly. Instead of trying to fit every possible metric on one screen, design dashboards around specific workflows and audiences.

Audience-Specific Dashboards

Different teams care about different parts of the system:

  • Application developers focus on service-level behavior — error rates, request latency, throughput, and how recent deployments affect these metrics.
  • Platform or SRE teams focus on cluster health — node resource usage, control plane stability, and infrastructure trends that could impact workloads.

By splitting these into separate views, each team can act faster without sorting through unrelated data.

Use the Golden Signals for Services

For application-facing dashboards, the Golden Signals are a proven starting point:

  1. Latency – How long requests take to complete. Track both average and high-percentile (p95/p99) latencies.
  2. Traffic – Request rate, connections, or messages processed per second.
  3. Errors – Count of failed requests or error responses (4xx/5xx).
  4. Saturation – Resource usage (CPU, memory, queue depth) relative to available capacity.

These four metrics, shown together, give developers a clear picture of whether the service is healthy.

Grouping Metrics for Infrastructure Dashboards

For cluster and node-level dashboards, group related metrics so trends are easier to spot:

  • Capacity Planning: CPU usage, memory consumption, and disk utilization per node.
  • Performance Analysis: API server latency, scheduler throughput, and request processing rates.
  • Storage and Networking: Persistent volume usage, I/O latency, and network throughput.

Logical grouping makes it easier to see whether an issue is isolated (e.g., one node under memory pressure) or systemic.

Managing Dashboard Complexity

Dashboards overloaded with panels are hard to interpret, especially during an incident. A good starting point is 6–8 key metrics per dashboard, with links to more detailed dashboards for deep dives.

For example:

  • The main cluster health dashboard can show CPU, memory, disk usage, API latency, and scheduler status.
  • Clicking on CPU usage could lead to a more detailed view with per-node breakdowns, pod-level consumption, and historical trends.

Keeping the main dashboard lean makes it faster to read and easier to maintain.

Final Thoughts

Once an AKS cluster grows beyond a few nodes, the problem isn’t “how to get metrics” — it’s how to keep them queryable, useful, and cost-controlled. High-cardinality labels from pods, containers, and services can slow down traditional backends and inflate storage bills.

Last9 is built to solve that. For AKS users, that means:

  • Handles high-cardinality data without query lag — millions of active time series per metric, per day.
  • Metrics, logs, and traces in one place — makes cross-referencing faster during incidents.
  • Native support for OpenTelemetry and Prometheus — no need to replace what you’ve already instrumented.
  • Cardinality and ingestion analytics — see exactly where metric volume comes from and how it trends.
  • Capacity for traffic spikes — no dropped data when deployments or incidents generate bursts.

If you need request-level detail, Last9 works with Jaeger and OpenTelemetry collectors to push traces to any backend — keeping the setup flexible and portable.

Try Last9 with your AKS setup, or connect with us for a deep dive into the platform.

FAQs

How do I monitor the health of Azure Kubernetes?

Enable Azure Monitor Container Insights on your AKS cluster to get automatic health monitoring. This tracks node status, pod health, and resource utilization. You can also use kubectl commands like kubectl get nodes and kubectl top pods for quick health checks, or set up custom Prometheus monitoring for more detailed visibility.

How do you monitor Kubernetes?

Kubernetes monitoring involves tracking metrics at three levels: cluster infrastructure (nodes, API server), workloads (pods, containers), and applications (custom metrics). Use tools like Azure Monitor, Prometheus, or specialized platforms like Last9. Start with basic resource metrics, then add application-specific monitoring and alerting rules.

What is the best monitoring for Azure?

For Azure-native environments, Azure Monitor provides seamless integration with AKS and other Azure services. For more flexibility, Prometheus with Grafana offers extensive customization. Last9 provides cost-effective, managed observability that handles high-cardinality data well. The best choice depends on your team's expertise and specific monitoring requirements.

Which tools does Kubernetes use to do container monitoring?

Kubernetes uses several built-in components: kubelet exposes metrics via cAdvisor, the metrics server provides resource usage data, and kube-state-metrics exposes cluster state information. External tools like Prometheus scrape these endpoints, while Azure Monitor agents collect and forward metrics to managed services.

How to check if the wellAzure Kubernetes Service cluster is performing well or not?

Monitor key performance indicators: node CPU/memory utilization (should stay below 80%), pod restart rates (low is better), API server response times (under 1 second), and application-specific metrics like request success rates. Set up alerts for resource pressure and track trends over time rather than just point-in-time values.

What Kubernetes metrics can be measured?

Kubernetes exposes hundreds of metrics, including resource utilization (CPU, memory, disk), cluster state (pod counts, node status), performance data (request latency, throughput), and application metrics (custom business logic indicators). Focus on metrics that directly impact user experience and operational health rather than collecting everything.

What is the difference between the Pod resource and the AKS Node resource?

Both pod and node resources matter for different reasons. Node resources show infrastructure capacity and help with cluster scaling decisions. Pod resources reveal application behavior and help optimize workload placement. Monitor both levels—nodes for capacity planning, pods for application performance, and resource allocation efficiency.

How to monitor the Azure Kubernetes Cluster resource status in Azure Portal?

Navigate to your AKS cluster in the Azure Portal, then click "Insights" under the Monitoring section. This shows cluster performance, node status, controller logs, and workload metrics. You can also use the "Metrics" section to create custom charts and set up alerts based on specific thresholds.

Is it possible for a single AKS cluster to use multiple Log Analytics workspaces in Container Insights?

No, each AKS cluster can only send Container Insights data to one Log Analytics workspace at a time. However, you can configure additional monitoring solutions (like Prometheus) to send data to different destinations, or use Azure Monitor's cross-workspace queries to analyze data from multiple workspaces together.

What are the best practices for setting up Azure Kubernetes monitoring?

Start with Container Insights for basic visibility, then add custom metrics for applications. Use structured logging with consistent labels, set up alerts for user-facing issues rather than every metric threshold, and implement resource quotas to prevent monitoring costs from spiraling. Keep monitoring configurations in version control and test them in staging environments.

What are the best practices for monitoring Azure Kubernetes Service (AKS) clusters?

Focus on golden signals: latency, traffic, errors, and saturation. Monitor at multiple levels (infrastructure, platform, application), use meaningful alert thresholds based on SLIs, and maintain monitoring during cluster upgrades. Implement log aggregation, use high-cardinality observability tools for complex environments, and regularly review monitoring coverage for gaps.

Contents

Do More with Less

Unlock unified observability and faster triaging for your team.