Skip to content
Last9
Book demo

GPU Telemetry (l9gpu)

Vendor-agnostic GPU monitoring for AI/ML and HPC clusters with workload attribution exported to Last9 via OpenTelemetry

Monitor NVIDIA, AMD, and Intel Gaudi GPUs with a single agent that emits OpenTelemetry to Last9. Get per-GPU utilization, memory, temperature, and power — plus workload attribution that ties every GPU metric to the Kubernetes pod or Slurm job that owns it.

What is l9gpu?

l9gpu is an open-source (MIT) GPU telemetry agent that normalizes hardware counters across GPU vendors into the OpenTelemetry gpu.* namespace. One DaemonSet per cluster (or one systemd unit per HPC node) covers the whole fleet.

Key capabilities:

  • Vendor-agnostic — NVIDIA (NVML, DCGM), AMD (amdsmi), Intel Gaudi (hl-smi)
  • Workload attributionk8s.pod.name, k8s.namespace.name, k8s.deployment.name, slurm.job.id, slurm.user, slurm.partition on every GPU metric
  • Fleet health — XID errors, ECC trends, NCCL errors, thermal throttling
  • Cost attribution$/token, tokens/watt, idle-GPU cost per team
  • OTLP-native — no Prometheus scrape config to maintain

Why not DCGM Exporter?

Capabilityl9gpuDCGM ExporterNVIDIA GPU OperatorDatadog GPU
NVIDIA support
AMD support
Intel Gaudi support
Workload attribution (pod/job)
Slurm HPC attribution
OTLP-native
Open source✅ MIT

DCGM Exporter is great at reading NVIDIA hardware but leaves workload attribution and multi-vendor support to you. l9gpu bundles both.

Prerequisites

Before you start:

  1. Last9 Account — Sign up at app.last9.io.
  2. OTLP credentials — From Last9, go to Integrations → OpenTelemetry and copy the OTLP endpoint and Authorization header.
  3. Kubernetes 1.24+ (or Slurm 22.05+) with GPU nodes.
  4. Helm 3.14+.

Current release: chart 0.2.0, Python agent 0.2.0, built on OpenTelemetry Collector v0.150.

Install on Kubernetes

  1. Add the Helm repository.

    helm repo add l9gpu https://last9.github.io/gpu-telemetry
    helm repo update
  2. Create a Secret with your Last9 OTLP credentials.

    kubectl create namespace l9gpu
    kubectl -n l9gpu create secret generic l9gpu-otlp \
    --from-literal=OTEL_EXPORTER_OTLP_ENDPOINT="otlp.last9.io:443" \
    --from-literal=OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <YOUR_AUTH_HEADER>"
  3. Install the chart.

    helm install l9gpu l9gpu/l9gpu \
    --version 0.2.0 \
    --namespace l9gpu \
    --set otlpSecretName=l9gpu-otlp \
    --set monitoring.sink=otel \
    --set monitoring.cluster=prod-gpu-us-east \
    --set collectors.nvidia=true \
    --set monitoring.nodeSelector."nvidia\.com/gpu\.present"=true

    For AMD or Gaudi nodes, set collectors.amd=true or collectors.gaudi=true and adjust the node selector.

  4. Verify the DaemonSet is running.

    kubectl -n l9gpu get pods -o wide

    You should see one l9gpu-monitoring-* pod per GPU node.

Install on Slurm / bare metal

  1. Install the Python agent.

    pip install 'l9gpu==0.2.0'
  2. Copy the systemd units.

    sudo cp /usr/local/share/l9gpu/systemd/*.service /etc/systemd/system/
    sudo cp /usr/local/share/l9gpu/systemd/*.slice /etc/systemd/system/
  3. Configure OTLP credentials in /etc/default/l9gpu.

    OTEL_EXPORTER_OTLP_ENDPOINT=otlp.last9.io:443
    OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <YOUR_AUTH_HEADER>"
  4. Enable and start.

    sudo systemctl daemon-reload
    sudo systemctl enable --now l9gpu_nvml_monitor slurm_monitor

Verify data in Last9

After ~1 minute, open the Metrics Explorer and try:

# Per-pod GPU utilization
avg by (k8s_pod_name, gpu_uuid) (gpu_utilization)
# Cluster-wide idle GPUs (utilization < 5% for 15 min)
count(avg_over_time(gpu_utilization[15m]) < 5)
# Power draw per namespace
sum by (k8s_namespace_name) (gpu_power_watts)
# Slurm jobs consuming most GPU memory
topk(10, sum by (slurm_job_id, slurm_user) (gpu_memory_used_bytes))

Pre-built dashboards and alerts

l9gpu ships a set of Grafana dashboards and Prometheus alert rules you can import directly from the repo:

  • Dashboardsdashboards/grafana/ (Fleet Overview, Workload, DCGM, Single-GPU, Insights)
  • Alertsalerts/ (Grafana and Prometheus formats)

Metrics reference

MetricDescription
gpu.utilizationSM utilization %
gpu.memory.used.bytesVRAM in use
gpu.memory.total.bytesVRAM capacity
gpu.temperature.celsiusGPU die temperature
gpu.power.wattsInstantaneous power draw
gpu.sm.clock.hertz / gpu.memory.clock.hertzClocks
gpu.errors.xid.totalXID error count (NVIDIA)
gpu.errors.ecc.totalUncorrectable ECC errors
gpu.throttle.reasonsThermal / power throttle flags

Every metric carries GPU topology (gpu.uuid, gpu.index, gpu.model, gpu.vendor) and workload attribution (k8s.pod.name, k8s.namespace.name, slurm.job.id, etc.). See the full metrics reference for the complete list.

Troubleshooting

No metrics appearing in Last9

  1. Check DaemonSet logs: kubectl -n l9gpu logs ds/l9gpu-monitoring.
  2. Verify the Secret is mounted: kubectl -n l9gpu describe pod <pod>.
  3. Confirm the node has GPUs visible to the agent: kubectl -n l9gpu exec <pod> -- nvidia-smi (or rocm-smi, hl-smi).

gpu.utilization present but no k8s.pod.name

The enrichment Collector needs RBAC to list pods on each node. Verify:

kubectl -n l9gpu get clusterrolebinding l9gpu -o yaml

If missing, reinstall with --set enrichment.rbac.create=true.

Slurm metrics missing slurm.job.id

The slurm_monitor unit runs sacct / scontrol, which must be in PATH for the cluster_monitor user. Add the Slurm bin dir to the unit’s Environment=PATH=....

Resources