Build Your Kubernetes Monitoring Foundation with kube-prometheus-stack

When you run Kubernetes at scale, one of the first challenges is understanding what the cluster is actually doing. Workloads shift around, pods restart for normal reasons, and traffic doesn't always follow the patterns you expect. Having clear signals makes day-to-day operations much easier.

That's where kube-prometheus-stack helps. It brings Prometheus, Grafana, Alertmanager, and supporting components together as a single package. Instead of wiring each tool by hand, the stack gives you a setup that already works well for most clusters.

So what exactly does kube-prometheus-stack include, and how does it fit into your observability setup? Let's unpack it.

What kube-prometheus-stack Does for You

kube-prometheus-stack is a bundled setup that brings together the key tools you need to monitor a Kubernetes cluster. Instead of installing and configuring each component yourself, you get a ready-to-use toolkit built around Prometheus and its ecosystem.

Here's what it includes:

Prometheus Operator — manages the entire monitoring setup using Kubernetes custom resources
Prometheus — collects and stores your time-series metrics
Grafana — turns metrics into visual dashboards
Alertmanager — handles alert routing and notifications
Node Exporter — runs on every node to expose host-level metrics
kube-state-metrics — reports on the state of Kubernetes objects
Prometheus Operator CRDs — resources like ServiceMonitor, PodMonitor, and PrometheusRule that define what Prometheus scrapes

The big advantage of this stack is that it fits naturally with GitOps workflows. You describe your monitoring setup in Kubernetes YAML, and the operator handles configuration and permissions automatically. No manual tweaks — it just works the way you'd expect.

The Monitoring Architecture

Once you deploy kube-prometheus-stack, you start relying on it for daily visibility. That's when the questions usually appear:

Which component is actually scraping your metrics?
Where do alerts originate?
Why does a dashboard feel slightly out of sync?

Understanding how the pieces work together helps you answer these questions and gives you a clearer sense of how the stack behaves during incidents and scales with your cluster.

The Metric Collection Pipeline

At the core of kube-prometheus-stack are a few components that collect and expose metrics for Prometheus to scrape. This is where the actual data flow begins — from your nodes, through the Kubernetes API, into Prometheus.

Node Exporter runs as a DaemonSet on every node and exposes host-level data — CPU usage, memory pressure, disk activity, and network traffic. This gives you visibility into the machines that power your cluster.

kube-state-metrics looks at the cluster from another angle. Instead of reading system files, it queries the Kubernetes API and emits metrics about object states. It shows how many pods are running, whether a deployment is progressing, or if a StatefulSet is waiting for replicas.

While Node Exporter shows how resources are used, kube-state-metrics helps you see how workloads behave. Together, they create a complete view of infrastructure and orchestration.

Prometheus then scrapes these sources — and any additional exporters you add — on a schedule you configure (usually every 15 seconds). It stores everything in its time-series database, ready for queries, alerts, and dashboards.

How Prometheus Operator Handles Configuration

Once you understand how metrics move through the stack, the next question is how Prometheus knows what to scrape and when. That's handled by the Prometheus Operator.

The operator watches for custom resources in your cluster and updates Prometheus automatically when something changes. When you create a ServiceMonitor, it adds new scrape targets right away — no restarts or manual reloads.

A ServiceMonitor defines which services to watch, which ports to scrape, and how often to do it.
A PodMonitor applies the same idea to pods.
A PrometheusRule contains alert and recording rule definitions.

Because this setup is declarative, you can store your monitoring configuration in Git alongside your application manifests. It becomes part of the same workflow you already use to manage Kubernetes.

What Metrics You Actually Have Access To

Before you build dashboards or alerts, you should get a clear view of what metrics your stack already exposes. kube-prometheus-stack pulls data from a few key sources that cover different parts of the system.

Container-level metrics come from cAdvisor through the container runtime. You'll typically see:

container_cpu_usage_seconds_total – CPU usage
container_memory_working_set_bytes – memory in use
container_fs_usage_bytes – filesystem usage

Node-level metrics come from Node Exporter and usually start with the node_ prefix — for example:

node_cpu_seconds_total
node_memory_MemAvailable_bytes
node_filesystem_size_bytes

These show what's happening at the host level.

Kubernetes resource metrics come from kube-state-metrics and use the kube_ prefix — such as:

kube_pod_status_phase
kube_deployment_status_replicas

These describe how Kubernetes objects behave — deployments, pods, and nodes.

Some metrics generate a huge number of label combinations — what Prometheus calls high cardinality. Labels like user ID, pod name, or request ID can quickly multiply the number of active series. That can slow queries and use more memory.

To check which metrics have the most series, run this in Prometheus:

topk(10, count by (__name__, job)({__name__=~".+"}))

If you see a metric with an unusually high count, look closer at the labels that might be expanding it.

Configure What Prometheus Scrapes

Once you've figured out what metrics exist in your cluster, you'll need a way to tell Prometheus where to collect them from. kube-prometheus-stack does this through two custom resources — ServiceMonitor and PodMonitor. They both define scrape targets, just at different levels.

ServiceMonitor — scraping through Services

A ServiceMonitor is the most common approach. It works through Kubernetes Services, which makes it reliable even when pods scale or move around.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-monitoring
spec:
  selector:
    matchLabels:
      monitoring: enabled
  endpoints:
  - port: metrics
    interval: 30s

Here's how this behaves in practice:

The Prometheus Operator sees the new ServiceMonitor and updates the Prometheus configuration automatically.
Any Service with the label monitoring: enabled gets scraped from its metrics port every 30 seconds.
Because it's service-based, you don't have to adjust anything when pods restart or scale — the Service abstraction keeps things stable.

This setup works well for most workloads because it follows how Kubernetes already manages traffic between pods.

PodMonitor — scraping pods directly

A PodMonitor works similarly but skips the Service layer. You'd use it when:

your application exposes metrics on a non-standard port,
there's no Service in front of the pods, or
you need per-pod metrics for debugging, testing, or analysis.

This approach gives you more control, but it can create more scrape targets to manage. It's useful for workloads like Jobs, DaemonSets, or custom operators that don't always have a Service.

In most cases, ServiceMonitor covers what you need. You can mix both in the same cluster — ServiceMonitor for stable applications, and PodMonitor for components that run outside the usual service flow.

Alerts Drive Action in Prometheus

Once you've got metrics flowing, the next step is deciding which ones should raise a flag when something goes wrong. Prometheus lets you define alert rules in PromQL — simple expressions that evaluate conditions and trigger when those conditions hold for a set time.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: application-alerts
spec:
  groups:
  - name: app
    rules:
    - alert: HighCPUUsage
      expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
      for: 5m
      annotations:
        summary: "Pod {{ $labels.pod }} CPU usage is above 80%"

In this case, the alert fires if CPU usage stays above 80% for five minutes. The for field keeps short-lived spikes from creating noise.

Once a rule triggers, Alertmanager takes over. It handles grouping, routing, and delivery — sending alerts to Slack, PagerDuty, or any webhook you've configured. You can also define silences and inhibition rules to keep related alerts from firing together.

The goal isn't to have more alerts — it's to have the right ones. Focus on signals that call for real attention rather than those that just add background noise.

Where All That Metric Data Lives

After setting up alerts, the next question is usually about data — how Prometheus stores it and how long you can keep it before storage starts to matter.

Prometheus saves everything as time-series data: a metric name, labels, a timestamp, and a value. Each unique label combination creates a separate series.

container_memory_usage_bytes{pod="api-server", namespace="production"}
container_memory_usage_bytes{pod="api-server", namespace="staging"}

Even though these share a metric name, they're stored as two distinct series. That's important because retention applies to the entire dataset, not individual metrics.

By default, Prometheus keeps 15 days of data. That's fine for development, but production environments often need 30–60 days for trend analysis, capacity planning, and debugging.

You can estimate how much space you'll need with this formula:

Disk Space = Retention (seconds) × Ingestion Rate (samples/sec) × Bytes per Sample (~2 bytes compressed)

If you're collecting about 50,000 samples per second and want 30 days of retention, that's roughly 259 GB of storage.

As your cluster grows, it helps to use a tiered approach — detailed data for a week, aggregated data for a month, and long-term archives in external storage.

Thanos handles this well. It runs alongside Prometheus, uploads metric blocks to object storage (S3, GCS, or Azure Blob), and lets you query across local and archived data from one place.

Write PromQL Queries You Use Every Day

Once your metrics start coming in, PromQL becomes the tool you reach for most often. It helps you spot trends, track usage, and figure out what's really happening inside the cluster.

Pod restarts in the last hour

increase(kube_pod_container_status_restarts_total[1h])

Shows how many times containers restarted within the past hour — a quick way to catch crashing pods.

Nodes under memory pressure

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.9

Returns nodes using more than 90% of available memory — a good signal before workloads start getting evicted.

CPU utilization by namespace

sum by (namespace) (rate(container_cpu_usage_seconds_total[5m])) /
sum by (namespace) (kube_pod_container_resource_limits{resource="cpu"})

Compares actual CPU use with what's been requested, giving you a sense of how efficiently each namespace is running.

Deployments missing replicas

kube_deployment_status_replicas_unavailable > 0

Highlights deployments that aren't fully available — handy for spotting stalled rollouts or missing resources.

Most useful queries combine aggregations (sum, avg), label filters (by, without), and rate functions (rate, increase) to turn raw metrics into signals you can reason about.

Once you've got queries that answer the right questions, the next move is to make that data easy to see and share.

How Grafana Fits Into the Stack

Once your PromQL queries start giving you useful data, the next question is — how do you make sense of all those numbers at scale? Grafana fills that gap. It turns Prometheus metrics into something visual, helping you spot patterns, changes, and failures without sifting through raw data.

kube-prometheus-stack ships with several pre-built Grafana dashboards that cover the basics: node health, pod performance, cluster state, and resource usage. They're great for quick visibility, but you'll eventually want your own dashboards tuned to how your systems behave.

When setting one up, here's what usually works:

Set Prometheus as your data source. That's where all your metrics already live.
Write queries you trust. The same PromQL you'd run in the Prometheus console can power every panel in Grafana.
Choose visuals that tell the story clearly. Time series graphs are best for trends, gauges for real-time status, and tables when you need detail.
Add thresholds or color rules. They make it easier to spot anomalies or approaching limits at a glance.

A useful dashboard for pods might include:

CPU usage over time
Current memory utilization
Recent pod restarts
Node availability

Keep dashboards lean. The best ones highlight what you check daily — the signals that actually help you act faster during an incident or verify a deployment. Too many charts make it harder to see what's changed; a few focused views tell you everything you need.

Run kube-prometheus-stack in Production

Once you move beyond development, a few new factors start to matter — storage, redundancy, and scale. kube-prometheus-stack can handle production workloads well if these are planned early.

Persistent storage

Prometheus needs durable storage so data isn't lost when pods restart. Run it as a StatefulSet with a PersistentVolumeClaim. Align data retention with your operational needs — enough for incident analysis and reporting, but not so much that storage becomes a concern.

High availability

Running multiple Prometheus replicas with the same configuration improves reliability. Each replica scrapes targets independently, and Alertmanager deduplicates alerts before sending notifications. This ensures monitoring continues smoothly even if one replica restarts or fails.

Resource tuning

Memory and CPU usage depend on how many time series you track and how often you scrape them. As a general rule, 100,000 active series need around 2–4 GB of RAM. Optimize scrape intervals and dashboard queries to strike a balance between visibility and resource utilization.

Multi-cluster visibility

For a few clusters, federation works well. At larger scales, Thanos helps by providing unified querying and long-term storage without overloading any single Prometheus instance.

Manage High-Cardinality Metrics

High-cardinality metrics can increase memory use and slow queries, especially as your environment grows. They usually appear when metrics include highly variable labels such as user_id, request_id, or pod names that change frequently.

You can spot them by checking series counts in the Prometheus UI or by querying for the metrics with the most unique label combinations. Once identified, a few techniques help keep things efficient:

Relabeling

Remove labels that aren't useful for analysis before they're written to storage:

metric_relabel_configs:
- source_labels: [__name__]
  regex: user_action_.*
  action: drop

Aggregation

Use recording rules to store summarized data and avoid keeping every raw sample:

- record: app:requests:rate5m
  expr: rate(http_requests_total[5m])

Bucketing

Histograms group continuous values into defined ranges, reducing the number of unique label combinations.

Scrape filtering

If certain exporters or targets emit metrics you never use, exclude them from the scrape configuration to keep data volume manageable.

Final Thoughts

kube-prometheus-stack gives you a solid foundation for cluster visibility. Prometheus stores the metrics, Grafana adds context, and Alertmanager ensures alerts reach you on time. But as clusters grow, Prometheus alone can start to feel heavy — limited retention, growing storage needs, and high-cardinality data add up fast.

Last9 extends this setup without changing how you work. It's built on the same open standards — Prometheus and OpenTelemetry — but removes the pain of scaling and managing metric data.

With Last9, you get:

Durable metric storage without running or tuning TSDB retention.
Streaming aggregation to handle high-cardinality metrics efficiently.
Cross-cluster visibility through a single query layer — no federation needed.
Cost transparency with insights into ingestion, storage, and query usage.

You can connect your existing Prometheus or Thanos setup using remote write or the OpenTelemetry Collector—no code changes, no new agents, just more visibility and faster queries across environments.

If you already use kube-prometheus-stack, you're halfway there. Last9 builds on that foundation with long-term reliability, predictable costs, and observability that scales with your systems.

Getting started just takes a few minutes, and if you're stuck at any point, book some time with us; our team would be happy to help!