Open source · MIT licensed

GPU at 80%.
But which pod?

DCGM tells you the device is busy. It doesn't tell you which Kubernetes pod, namespace, or Slurm job is responsible. That's the gap. l9gpu fills it.

1 DaemonSet
per node, no sidecars
NVIDIA · AMD · Gaudi
vendor-neutral
Kubernetes + Slurm
both supported
OTLP out
any backend

The problem

You have Prometheus.
You have Grafana.
You have DCGM.
You still can't answer this question.

"Complete black box. Zero visibility into which pod is actually eating up the VRAM and compute utilization on those slices."
Platform engineer, multi-tenant A10 cluster
"All pods show identical values with GPU time-slicing. namespace/pod/container labels missing on MIG GPU."
DCGM-exporter GitHub, issues #577 and #582
"DCGM + Prometheus + Grafana — 4 moving parts solving what should be one question."
r/kubernetes, KEDA GPU Scaler thread
"We have observability for CPU and memory and APM for code — but nothing for the GPU and inferencing part."
r/devops, GPU observability thread
"We have A100s reserved through 2026 that barely hit 20% utilization. Finance treats them like insurance, not infrastructure."
r/kubernetes, 95% idle GPU thread

The attribution layer — the join between GPU hardware metrics and Kubernetes or Slurm workload identity — is what's missing. l9gpu is that layer.

How it works

One agent. Attribution at collection time.

l9gpu runs as a DaemonSet on every GPU node. It reads directly from NVML and DCGM, enriches each metric with the workload consuming that device, and ships OTLP to whatever backend you already have. No PromQL joins. No brittle label pipelines.

  1. NVML / DCGM / amdsmi / hl-smi hardware source
  2. l9gpu — attribution layer
    1. node agent DaemonSet · OTLP source
    2. k8sprocessor / slurmprocessor pod · namespace · job enrichment
  3. OTLP export Prometheus · Grafana · any backend
metric labels before → after l9gpu
Before — DCGM raw
DCGM_FI_DEV_GPU_UTIL{
  gpu="0",
  device="nvidia0",
  modelName="A100"
} 83
After — l9gpu enriched
gpu_utilization{
  gpu="0",
  pod="inference-api-7f9d",
  namespace="production",
  deployment="inference-api",
  node="gpu-node-03",
  cluster="ml-cluster-us"
} 83

Platform support

Works wherever your GPUs are.

GPU Hardware

NVIDIA
NVML + DCGM · A100, H100/H200, B200/GB200, T4, A10, L4
AMD
amdsmi · MI300X, MI325X
Intel Gaudi
hl-smi · Gaudi 2, Gaudi 3

Workload Orchestration

Kubernetes
pod · namespace · deployment · node · cluster · cloud metadata
Slurm
job ID · user · account · partition · QoS
Bare metal
process-level attribution, systemd service

Inference Engines

vLLM
SGLang
TGI
Triton
NVIDIA NIM

Per-engine GPU metrics, not just per-device aggregates

What's included

Useful on day one.
Not after three weeks of setup.

  • 17 pre-built alert rules

    Across 3 PrometheusRule CRDs — GPU temperature, throttling, ECC errors, XID events, and idle utilization. Pod and namespace appear on every fired alert via k8sprocessor enrichment.

  • Grafana dashboards

    Multi-cluster fleet view, per-pod workload attribution, DCGM profiling, inference engine breakdown, and health/reliability panels.

  • XID + ECC at job level

    When XID errors increase on gpu03, you know which Slurm job or Kubernetes pod was running. Not just which node.

  • GPU chargeback, ready to query

    Team A consumed 340 GPU-hours in April. Team B consumed 60. Every metric labeled with namespace and deployment so cost queries are trivial.

  • Works with your existing stack

    OTLP out. Prometheus, Grafana Cloud, Datadog, any OTLP-compatible backend. Nothing proprietary, no lock-in — and a one-config path to Last9 if you want it.

    Send to Last9 →
  • Derived from Meta's GCM

    Built on the same foundation Meta uses for monitoring hundreds of thousands of GPUs in production AI research clusters.

Install

Running in 60 seconds.

One Helm install for Kubernetes. One pip install for bare metal. Metrics start flowing to your OTLP endpoint immediately.

View full docs on GitHub

Sending GPU metrics to Last9? Follow the GPU Telemetry integration guide for the OTLP endpoint, auth headers, and dashboard import.

chart-v0.2.1
kubectl create secret generic l9gpu-otlp \
  -n monitoring \
  --from-literal=OTEL_EXPORTER_OTLP_ENDPOINT=https://your-backend/v1/metrics

helm repo add l9gpu https://last9.github.io/gpu-telemetry

helm install l9gpu l9gpu/l9gpu \
  -n monitoring \
  --create-namespace \
  --set otlpSecretName=l9gpu-otlp

After install you get:

  • Per-pod GPU utilization in Prometheus, labeled immediately
  • 17 alert rules active — no PromQL to write
  • Grafana dashboard importable with one click
  • Slurm job → GPU attribution visible in 1 collection cycle

What you can finally answer

Questions finance and engineering leads actually ask.

  • Which team consumed the most GPU hours this month?
  • Which pod caused that utilization spike at 2am?
  • What was running on gpu03 when XID errors fired?
  • Which vLLM instance is burning GPU without serving requests?
  • Is our H100 utilization 5% because of idle GPUs or bad workload scheduling?
  • Which Slurm job account should we bill for this training run?

Stop guessing which pod is burning your GPU budget.

MIT licensed. One DaemonSet. Metrics with workload identity in under 60 seconds.

Last9 keyboard illustration

Start observing for free. No lock-in.

OPENTELEMETRY • PROMETHEUS

Just update your config. Start seeing data on Last9 in seconds.

DATADOG • NEW RELIC • OTHERS

We've got you covered. Bring over your dashboards & alerts in one click.

BUILT ON OPEN STANDARDS

100+ integrations. OTel native, works with your existing stack.

Gartner Cool Vendor 2025 Gartner Cool Vendor 2025
High Performer High Performer
Best Usability Best Usability
Highest User Adoption Highest User Adoption