GPU at 80%.
But which pod?
DCGM tells you the device is busy. It doesn't tell you which Kubernetes pod, namespace, or Slurm job is responsible. That's the gap. l9gpu fills it.
- 1 DaemonSet
- per node, no sidecars
- NVIDIA · AMD · Gaudi
- vendor-neutral
- Kubernetes + Slurm
- both supported
- OTLP out
- any backend
The problem
You have Prometheus.
You have Grafana.
You have DCGM.
You still can't answer this question.
"Complete black box. Zero visibility into which pod is actually eating up the VRAM and compute utilization on those slices."
"All pods show identical values with GPU time-slicing. namespace/pod/container labels missing on MIG GPU."
"DCGM + Prometheus + Grafana — 4 moving parts solving what should be one question."
"We have observability for CPU and memory and APM for code — but nothing for the GPU and inferencing part."
"We have A100s reserved through 2026 that barely hit 20% utilization. Finance treats them like insurance, not infrastructure."
The attribution layer — the join between GPU hardware metrics and Kubernetes or Slurm workload identity — is what's missing. l9gpu is that layer.
How it works
One agent. Attribution at collection time.
l9gpu runs as a DaemonSet on every GPU node. It reads directly from NVML and DCGM, enriches each metric with the workload consuming that device, and ships OTLP to whatever backend you already have. No PromQL joins. No brittle label pipelines.
- NVML / DCGM / amdsmi / hl-smi hardware source
- l9gpu — attribution layer
- node agent DaemonSet · OTLP source
- k8sprocessor / slurmprocessor pod · namespace · job enrichment
- OTLP export Prometheus · Grafana · any backend
DCGM_FI_DEV_GPU_UTIL{
gpu="0",
device="nvidia0",
modelName="A100"
} 83 gpu_utilization{
gpu="0",
pod="inference-api-7f9d",
namespace="production",
deployment="inference-api",
node="gpu-node-03",
cluster="ml-cluster-us"
} 83 Platform support
Works wherever your GPUs are.
GPU Hardware
- NVIDIA
- NVML + DCGM · A100, H100/H200, B200/GB200, T4, A10, L4
- AMD
- amdsmi · MI300X, MI325X
- Intel Gaudi
- hl-smi · Gaudi 2, Gaudi 3
Workload Orchestration
- Kubernetes
- pod · namespace · deployment · node · cluster · cloud metadata
- Slurm
- job ID · user · account · partition · QoS
- Bare metal
- process-level attribution, systemd service
Inference Engines
Per-engine GPU metrics, not just per-device aggregates
What's included
Useful on day one.
Not after three weeks of setup.
-
17 pre-built alert rules
Across 3 PrometheusRule CRDs — GPU temperature, throttling, ECC errors, XID events, and idle utilization. Pod and namespace appear on every fired alert via k8sprocessor enrichment.
-
Grafana dashboards
Multi-cluster fleet view, per-pod workload attribution, DCGM profiling, inference engine breakdown, and health/reliability panels.
-
XID + ECC at job level
When XID errors increase on gpu03, you know which Slurm job or Kubernetes pod was running. Not just which node.
-
GPU chargeback, ready to query
Team A consumed 340 GPU-hours in April. Team B consumed 60. Every metric labeled with namespace and deployment so cost queries are trivial.
-
Works with your existing stack
OTLP out. Prometheus, Grafana Cloud, Datadog, any OTLP-compatible backend. Nothing proprietary, no lock-in — and a one-config path to Last9 if you want it.
Send to Last9 → -
Derived from Meta's GCM
Built on the same foundation Meta uses for monitoring hundreds of thousands of GPUs in production AI research clusters.
Install
Running in 60 seconds.
One Helm install for Kubernetes. One pip install for bare metal. Metrics start flowing to your OTLP endpoint immediately.
View full docs on GitHubSending GPU metrics to Last9? Follow the GPU Telemetry integration guide for the OTLP endpoint, auth headers, and dashboard import.
kubectl create secret generic l9gpu-otlp \
-n monitoring \
--from-literal=OTEL_EXPORTER_OTLP_ENDPOINT=https://your-backend/v1/metrics
helm repo add l9gpu https://last9.github.io/gpu-telemetry
helm install l9gpu l9gpu/l9gpu \
-n monitoring \
--create-namespace \
--set otlpSecretName=l9gpu-otlp pip install l9gpu
export OTEL_EXPORTER_OTLP_ENDPOINT=https://your-backend/v1/metrics
l9gpu nvml_monitor \
--sink otel \
--cluster my-cluster After install you get:
- Per-pod GPU utilization in Prometheus, labeled immediately
- 17 alert rules active — no PromQL to write
- Grafana dashboard importable with one click
- Slurm job → GPU attribution visible in 1 collection cycle
What you can finally answer
Questions finance and engineering leads actually ask.
- Which team consumed the most GPU hours this month?
- Which pod caused that utilization spike at 2am?
- What was running on gpu03 when XID errors fired?
- Which vLLM instance is burning GPU without serving requests?
- Is our H100 utilization 5% because of idle GPUs or bad workload scheduling?
- Which Slurm job account should we bill for this training run?
Stop guessing which pod is burning your GPU budget.
MIT licensed. One DaemonSet. Metrics with workload identity in under 60 seconds.
Start observing for free. No lock-in.
Just update your config. Start seeing data on Last9 in seconds.
We've got you covered. Bring over your dashboards & alerts in one click.
100+ integrations. OTel native, works with your existing stack.
Gartner Cool Vendor 2025