Kubernetes Monitoring Tools: What Actually Works at Scale

Here's a story I keep hearing: a team migrates to Kubernetes, points their existing monitoring at it, and everything looks fine. Dashboards are green. Alerts are quiet. Then a pod starts OOMKilling every twelve minutes, silently, on a node nobody's watching, and the only reason anyone notices is because a customer reports that their requests intermittently fail.

The monitoring was working. It was monitoring the wrong things.

Kubernetes monitoring is a different beast from VM monitoring, and most kubernetes monitoring tools weren't built for this level of dynamism. Pods are ephemeral. IP addresses are recycled every few minutes. A single deployment can scatter across dozens of nodes. The very act of scaling up generates an avalanche of new time series that your monitoring system wasn't expecting. A deployment that looked perfectly healthy at 10 replicas starts behaving in weird, non-obvious ways at 200.

I've spent more time than I'd like to admit debugging monitoring systems that were supposed to be monitoring my actual systems. Which is, if you think about it, a profoundly depressing way to spend a Tuesday.

Here's what actually works. Not what looks good in a vendor demo with a five-pod cluster, but what holds up when you're running real workloads at real scale.

What are the best Kubernetes monitoring tools? Prometheus + Grafana is the de-facto standard for Kubernetes monitoring, offering free, well-documented metric collection with kube-state-metrics and node-exporter. Datadog provides full auto-discovery but per-host pricing gets expensive with autoscaling clusters. Grafana Cloud offers managed Prometheus without the operational overhead. Last9 handles the cardinality explosion that Kubernetes creates. Pixie uses eBPF for zero-instrumentation network visibility. Most production setups combine 2-3 of these for different layers: cluster metrics, application traces, and cost visibility.

What to Monitor in Kubernetes

Before picking a tool, it helps to understand what you're actually trying to see. Kubernetes monitoring has distinct layers, and most teams get in trouble by focusing on one layer and ignoring the others.

Cluster-Level Metrics

This is the foundation. If your nodes are unhealthy, nothing else matters.

You need visibility into node health and resource utilization: CPU, memory, disk, and network across every node in the cluster. You need to know about scheduling pressure: are pods stuck in Pending because there's nowhere to put them? Are nodes being drained and not coming back?

Then there's the control plane. API server latency is the canary in the coal mine for cluster-wide issues. If the API server is slow, kubectl is slow, deployments are slow, everything is slow. Etcd health matters too. It's the source of truth for your entire cluster state, and when it gets unhappy, you get surprises like pods that vanish from the API but keep running on nodes, or config changes that don't propagate.

Most teams set up node monitoring and call it done. The control plane gets ignored until there's a cluster-wide outage and someone asks "wait, what happened to etcd?"

Pod and Container Metrics

This is where Kubernetes monitoring gets interesting and also where it starts to get expensive.

The metrics that matter here: CPU and memory requests versus limits versus actual usage. If you're not tracking all three, you're missing the picture. A pod that requests 100m CPU but is actually using 900m is stealing resources from its neighbors. A pod that has a 512Mi memory limit but routinely hits 490Mi is a ticking time bomb. One small spike and the OOM killer comes knocking. The pod memory usage guide walks through the kubectl and PromQL queries that surface this kind of headroom problem before it bites.

Restart counts are your early warning system. A pod that restarts once is a blip. A pod that's restarted fourteen times in the last hour is screaming for attention. OOMKills specifically deserve their own alert, because they indicate a fundamental mismatch between what your application needs and what you've told Kubernetes to give it.

Pod lifecycle events (creation, scheduling, pulling images, running, terminating) tell you whether your deployment pipeline is healthy. If pods are spending three minutes in ContainerCreating because image pulls are slow, that's not a Kubernetes problem. That's a registry problem wearing a Kubernetes mask.

Application-Level Metrics

This is what your users actually care about. Request latency, error rates, throughput. The golden signals that tell you whether your service is doing its job.

In a Kubernetes environment, this also means understanding service-to-service communication. When Service A calls Service B and gets a timeout, is the problem in A, in B, in the network policy between them, or in the fact that B just got rescheduled to a node in a different availability zone? Distributed tracing helps here, but only if you've instrumented properly, and in Kubernetes, "properly" is a moving target because the topology changes constantly.

The Cardinality Problem

Nobody warns you about this upfront. Kubernetes generates somewhere between 10x and 100x more time series than the equivalent VM-based deployment. This isn't a bug. It's a consequence of how Kubernetes works.

Every metric gets labeled with namespace, deployment, replicaset, pod, container, and node. That's six dimensions before you've added a single application-level label. If you have 10 namespaces, 50 deployments, 500 pods across 50 nodes, the combinatorial explosion is real. Add in custom labels for environment, region, team, and version, and you're looking at millions of active time series.

This is the thing that breaks most monitoring setups. Not the complexity of Kubernetes itself, but the sheer volume of data it produces. Your Prometheus server that happily handled 100K time series on VMs starts falling over at 2M time series in Kubernetes, and suddenly you're spending more time fixing your monitoring than your actual applications.

If this sounds familiar, the high cardinality guides are worth a read. Understanding cardinality isn't optional in Kubernetes. It's survival.

Kubernetes Monitoring Tools Compared

Every tool has trade-offs. Anyone who tells you their tool has no downsides is either lying or hasn't used it at scale. If you want a broader catalogue beyond the six tools below, the 10 Kubernetes monitoring tools roundup covers the wider field including service-mesh observability and CNI-level options.

Prometheus + Grafana

The de-facto standard, and for good reason. Prometheus with kube-state-metrics, node-exporter, and Grafana dashboards gives you a monitoring stack that's free, well-documented, and understood by practically every SRE on the planet.

Strengths: The community around Prometheus is enormous. If you have a question, someone's already answered it on GitHub or Stack Overflow. PromQL is powerful once you get past the learning curve — it lets you express queries that would be painful or impossible in other query languages. And kube-state-metrics gives you Kubernetes-native metrics (deployment status, pod phases, resource quotas) that you won't get from generic infrastructure monitoring. If you want to get serious with PromQL, the PromQL guide is a good place to sharpen your skills.

Considerations: Here's where it gets real. Prometheus was designed as a single-node system. When you outgrow one Prometheus server — and in Kubernetes, you will — you need to start thinking about federation, remote write, or one of the long-term storage solutions. High availability means running multiple Prometheus replicas with deduplication. Multi-cluster monitoring means either federation (which has its own problems) or a centralized remote write target. The operational overhead of running Prometheus at scale is real, and I've seen teams where the Prometheus infrastructure becomes its own project with its own on-call rotation.

Datadog

Full Kubernetes integration with auto-discovery, live container views, network performance monitoring, and the network map that makes infrastructure feel like a video game.

Strengths: The Datadog agent does auto-discovery out of the box. Deploy it as a DaemonSet and it starts finding your services, pulling metrics, collecting logs, and correlating everything together. The unified platform (metrics, traces, logs, and now security) means you're not context-switching between six different tools during an incident. The live containers view is useful — actually useful, not demo-useful — for seeing what's happening right now across your cluster.

Considerations: Pricing. Kubernetes is the worst-case scenario for Datadog's per-host and per-container pricing model. Your cluster autoscales from 20 nodes to 80 nodes during peak traffic? You're paying for 80 nodes. You have 2,000 pods across those nodes? The container monitoring costs add up in ways that surprise people at invoice time. I've seen teams where Datadog costs more than the infrastructure it's monitoring. The Datadog pricing breakdown lays out the math if you want to run the numbers before committing.

Last9

OpenTelemetry-native monitoring with streaming aggregation designed specifically for the cardinality problem that Kubernetes creates.

Strengths: The metrics pipeline handles high-cardinality Kubernetes labels without requiring you to pre-aggregate or drop dimensions. This matters because the labels you need during an incident — the specific pod, the specific node, the specific container — are exactly the ones that cause cardinality explosions. The control plane gives you visibility into what your monitoring is costing you and lets you make informed decisions about what to keep and what to aggregate, rather than discovering you've blown your budget at the end of the month. Being OpenTelemetry-native means you're not locked into a proprietary agent — you can use the OTel Collector you're probably already running, or wire it in via the OpenTelemetry Operator on Kubernetes for CRD-driven rollouts. See the integration docs for Kubernetes-specific setup.

Considerations: Last9 is newer than some of the others on this list. The integration library is growing but isn't as extensive as what you'll find with Datadog or Grafana Cloud. If you need a specific integration for a niche tool, check first.

Grafana Cloud (Mimir + Loki + Tempo)

The managed version of what a lot of teams are already running self-hosted. Grafana Cloud gives you Mimir for metrics (Prometheus-compatible), Loki for logs, and Tempo for traces, all behind the Grafana UI that your team already knows.

Strengths: If your team already lives in Grafana (and many do), this is the lowest-friction path to managed monitoring. You keep your existing dashboards, your existing alerts, and your existing PromQL queries. You just stop managing the infrastructure underneath. The Grafana UI is best-in-class for dashboarding, and the correlation between metrics, logs, and traces is getting better with every release.

Considerations: Cost scales with ingest volume, and Kubernetes generates a lot of ingest volume (see the cardinality section above). Teams that move from self-hosted Prometheus to Grafana Cloud sometimes get sticker shock when they realize how much data they're actually producing. The free tier is generous for experimentation, but production Kubernetes clusters will blow past it quickly.

Pixie (by New Relic)

The eBPF-based approach. Pixie uses eBPF to capture application traffic at the kernel level, meaning zero instrumentation, zero code changes, and immediate visibility.

Strengths: The "deploy and immediately see everything" experience is impressive. You get HTTP request/response pairs, DNS queries, database calls, and more — all without touching your application code. For teams that need visibility into services they can't easily instrument (third-party software, legacy code, sidecar meshes), this is powerful. The in-cluster data processing means sensitive data doesn't leave your environment.

Considerations: eBPF can see network traffic, syscalls, and kernel events. It can't see application-internal metrics, business logic, or anything that doesn't cross a system boundary. Data is retained in-cluster by default, which is great for security but means you don't get long-term historical data unless you export it somewhere. And "in-cluster" means if the cluster goes down, your monitoring data goes with it.

Kubecost

A different angle entirely. Kubecost focuses on the cost dimension of Kubernetes monitoring.

Strengths: Per-namespace, per-deployment, and per-pod cost allocation. Right-sizing recommendations based on actual usage versus requested resources. Showback and chargeback reports for organizations where multiple teams share clusters. If "how much does this microservice actually cost to run?" is a question you need to answer, Kubecost answers it.

Considerations: Kubecost is a cost management tool that happens to show some monitoring data, not a monitoring tool that happens to show costs. You'll still need a separate solution for alerting, incident response, and deep operational visibility.

The Prometheus Question: Self-Host or Go Managed?

This deserves its own section because it's the decision that causes the most grief.

When self-hosted Prometheus works: You have a single cluster. Your retention needs are modest (two weeks or less). Your cardinality is under control. You have someone on the team who enjoys YAML and doesn't mind the occasional 3 AM page about Prometheus itself running out of memory. In this scenario, Prometheus is fantastic. It's free, it's fast, and it does exactly what it says on the tin.

When it breaks: You add a second cluster. Now you need federation or remote write. You need more than two weeks of historical data for capacity planning. Now you need long-term storage — Thanos, Cortex, or Mimir. Your cardinality grows because someone added a request_id label to a metric and now you have 50 million time series. Your Prometheus server needs 128GB of RAM and it's still OOM-killing itself during compaction.

I've watched this progression happen at multiple companies. It always starts with "Prometheus is great, we'll just add another server," and ends with a team of three people managing a distributed Prometheus infrastructure that's more complex than the application it monitors.

The options for scaling beyond single-instance Prometheus: Thanos adds a sidecar for long-term storage and a query layer for multi-cluster views. Cortex and its successor Mimir provide a horizontally scalable, multi-tenant Prometheus backend — the Thanos vs Cortex comparison covers when to pick which. Or you skip the infrastructure entirely and use a managed backend that accepts Prometheus remote write. The Prometheus monitoring guide digs deeper into these trade-offs.

My honest take: if you're running fewer than three clusters and your team has strong Kubernetes operational skills, self-hosted Prometheus with Thanos is a solid choice. Beyond that, the operational cost of managing your own metrics infrastructure starts exceeding the cost of paying someone else to do it.

Getting Started: A Practical Checklist

If you're setting up Kubernetes monitoring from scratch, or re-evaluating a setup that's not working, here's the order I'd do things:

1. Install kube-state-metrics and node-exporter. These are non-negotiable. kube-state-metrics gives you Kubernetes object state (deployments, pods, nodes, jobs, cronjobs). node-exporter gives you machine-level metrics (CPU, memory, disk, network). Without these two, you're flying blind.

2. Get comfortable with kubectl for live debugging. Monitoring dashboards are great for trends and alerts, but when you're in the middle of an incident, kubectl is your best friend. kubectl top for live resource usage, kubectl describe pod, kubectl logs. These are the basics. The kubectl commands cheatsheet is worth bookmarking for the commands you don't use often enough to remember.

3. Configure alerts for the things that actually matter. Not everything deserves an alert. But these do: OOMKill events, pod restart loops (more than 5 restarts in 10 minutes), node NotReady conditions, persistent volume usage above 85%, and API server latency above 1 second. Start there. Resist the temptation to alert on everything — alert fatigue is the enemy of reliable operations.

4. Set resource requests and limits on all workloads. This isn't strictly monitoring, but without requests and limits, your monitoring data is meaningless. "CPU usage is 80%" means nothing if you don't know what 100% is. Requests and limits give your metrics context.

5. Decide on retention. How long do you need historical data? Two weeks is enough for incident response. Three months is useful for capacity planning. A year or more is needed for compliance or long-term trend analysis. Your retention requirement drives your architecture — two weeks fits in a single Prometheus server; a year requires a long-term storage solution.

Kubernetes Monitoring Tools FAQ

How many metrics does a typical Kubernetes cluster generate?

It depends on your workload density, but here's a rough baseline: a 50-node cluster running 500 pods with standard kube-state-metrics and node-exporter will generate around 500,000 active time series. That's before you add application metrics. With application-level instrumentation — especially if you're using OpenTelemetry or Prometheus client libraries with labels for endpoints, methods, and status codes — you can easily hit 2-5 million active time series. If someone on your team is adding high-cardinality labels (user IDs, request IDs, trace IDs) to metrics, the number goes vertical. I've seen clusters producing north of 20 million time series, most of which were never queried by anyone.

Should I use Prometheus Operator or Helm charts?

Prometheus Operator if you want full CRD-based management — ServiceMonitor and PodMonitor resources make it easy to define what gets scraped alongside your application manifests. This is the "Kubernetes-native" approach and works well for teams that are already comfortable with CRDs and operators. Helm charts (like kube-prometheus-stack) for simpler setups where you want sane defaults without writing a lot of custom resources. Honestly, kube-prometheus-stack uses the Operator under the hood anyway, so the question is really about how much control you want. Pick based on your team's Kubernetes maturity and how much time you want to spend on monitoring infrastructure versus your actual product.

What's the minimum monitoring I should set up on a new cluster?

Start with three components: node-exporter for machine metrics, kube-state-metrics for Kubernetes object state, and a Prometheus instance (or managed equivalent) to scrape and store both. Then set up exactly four alerts: node NotReady for more than 5 minutes, pod OOMKilled, persistent volume usage above 85%, and API server request latency p99 above 1 second. This gives you coverage for the failure modes that actually cause outages: infrastructure failure, application memory issues, storage exhaustion, and control plane degradation. You can refine and expand from there, but these four alerts will catch the majority of problems that wake people up at night.

How do I handle monitoring across multiple Kubernetes clusters?

This is where most monitoring setups start to creak. The options, from simplest to most sophisticated: run independent Prometheus instances per cluster and use Grafana with multiple data sources to query across them (simple but limited). Deploy Thanos with a sidecar on each cluster's Prometheus and use Thanos Query for a unified view (solid but operationally complex). Use Prometheus remote write to send metrics from every cluster to a centralized backend — Cortex, Mimir, or a managed service (cleanest long-term architecture, but requires network connectivity and adds ingestion latency). Whatever you choose, use consistent label schemas across clusters. Add a cluster label to every metric — future you will be grateful.

Is eBPF monitoring a replacement for traditional instrumentation?

No, and anyone telling you otherwise is overselling it. eBPF monitoring (like Pixie) is phenomenal for network-level visibility — HTTP requests, DNS queries, database calls — without any code changes. But it can't see inside your application. It doesn't know about your business metrics, your queue depths, your cache hit rates, or your feature flag evaluations. Think of eBPF as a complement to traditional instrumentation, not a replacement. The ideal setup uses both: eBPF for the network-level view and the "I didn't know I needed to see that" moments, and explicit instrumentation for the metrics that are specific to your application's domain.

Send your Kubernetes telemetry to Last9

If your Prometheus is OOM-killing itself, your Datadog bill is climbing faster than your traffic, or you're tired of dropping labels to keep cardinality under control — Last9 is built for exactly this. OpenTelemetry-native ingest, streaming aggregation that handles Kubernetes-scale cardinality without forcing you to pre-aggregate, and a control plane that shows you what each label is costing before the invoice arrives.

Start sending Kubernetes metrics to Last9 →