GPU Telemetry (l9gpu)
Vendor-agnostic GPU monitoring for AI/ML and HPC clusters with workload attribution exported to Last9 via OpenTelemetry
Monitor NVIDIA, AMD, and Intel Gaudi GPUs with a single agent that emits OpenTelemetry to Last9. Get per-GPU utilization, memory, temperature, and power — plus workload attribution that ties every GPU metric to the Kubernetes pod or Slurm job that owns it.
What is l9gpu?
l9gpu is an open-source (MIT) GPU telemetry agent that normalizes hardware counters across GPU vendors into the OpenTelemetry gpu.* namespace. One DaemonSet per cluster (or one systemd unit per HPC node) covers the whole fleet.
Key capabilities:
- Vendor-agnostic — NVIDIA (NVML, DCGM), AMD (amdsmi), Intel Gaudi (hl-smi)
- Workload attribution —
k8s.pod.name,k8s.namespace.name,k8s.deployment.name,slurm.job.id,slurm.user,slurm.partitionon every GPU metric - Fleet health — XID errors, ECC trends, NCCL errors, thermal throttling
- Cost attribution —
$/token,tokens/watt, idle-GPU cost per team - OTLP-native — no Prometheus scrape config to maintain
Why not DCGM Exporter?
| Capability | l9gpu | DCGM Exporter | NVIDIA GPU Operator | Datadog GPU |
|---|---|---|---|---|
| NVIDIA support | ✅ | ✅ | ✅ | ✅ |
| AMD support | ✅ | ❌ | ❌ | ❌ |
| Intel Gaudi support | ✅ | ❌ | ❌ | ❌ |
| Workload attribution (pod/job) | ✅ | ❌ | ❌ | ✅ |
| Slurm HPC attribution | ✅ | ❌ | ❌ | ❌ |
| OTLP-native | ✅ | ❌ | ❌ | ❌ |
| Open source | ✅ MIT | ✅ | ✅ | ❌ |
DCGM Exporter is great at reading NVIDIA hardware but leaves workload attribution and multi-vendor support to you. l9gpu bundles both.
Prerequisites
Before you start:
- Last9 Account — Sign up at app.last9.io.
- OTLP credentials — From Last9, go to Integrations → OpenTelemetry and copy the OTLP endpoint and
Authorizationheader. - Kubernetes 1.24+ (or Slurm 22.05+) with GPU nodes.
- Helm 3.14+.
Current release: chart 0.2.0, Python agent 0.2.0, built on OpenTelemetry Collector v0.150.
Install on Kubernetes
-
Add the Helm repository.
helm repo add l9gpu https://last9.github.io/gpu-telemetryhelm repo update -
Create a Secret with your Last9 OTLP credentials.
kubectl create namespace l9gpukubectl -n l9gpu create secret generic l9gpu-otlp \--from-literal=OTEL_EXPORTER_OTLP_ENDPOINT="otlp.last9.io:443" \--from-literal=OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <YOUR_AUTH_HEADER>" -
Install the chart.
helm install l9gpu l9gpu/l9gpu \--version 0.2.0 \--namespace l9gpu \--set otlpSecretName=l9gpu-otlp \--set monitoring.sink=otel \--set monitoring.cluster=prod-gpu-us-east \--set collectors.nvidia=true \--set monitoring.nodeSelector."nvidia\.com/gpu\.present"=trueFor AMD or Gaudi nodes, set
collectors.amd=trueorcollectors.gaudi=trueand adjust the node selector. -
Verify the DaemonSet is running.
kubectl -n l9gpu get pods -o wideYou should see one
l9gpu-monitoring-*pod per GPU node.
Install on Slurm / bare metal
-
Install the Python agent.
pip install 'l9gpu==0.2.0' -
Copy the systemd units.
sudo cp /usr/local/share/l9gpu/systemd/*.service /etc/systemd/system/sudo cp /usr/local/share/l9gpu/systemd/*.slice /etc/systemd/system/ -
Configure OTLP credentials in
/etc/default/l9gpu.OTEL_EXPORTER_OTLP_ENDPOINT=otlp.last9.io:443OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <YOUR_AUTH_HEADER>" -
Enable and start.
sudo systemctl daemon-reloadsudo systemctl enable --now l9gpu_nvml_monitor slurm_monitor
Verify data in Last9
After ~1 minute, open the Metrics Explorer and try:
# Per-pod GPU utilizationavg by (k8s_pod_name, gpu_uuid) (gpu_utilization)
# Cluster-wide idle GPUs (utilization < 5% for 15 min)count(avg_over_time(gpu_utilization[15m]) < 5)
# Power draw per namespacesum by (k8s_namespace_name) (gpu_power_watts)
# Slurm jobs consuming most GPU memorytopk(10, sum by (slurm_job_id, slurm_user) (gpu_memory_used_bytes))Pre-built dashboards and alerts
l9gpu ships a set of Grafana dashboards and Prometheus alert rules you can import directly from the repo:
- Dashboards —
dashboards/grafana/(Fleet Overview, Workload, DCGM, Single-GPU, Insights) - Alerts —
alerts/(Grafana and Prometheus formats)
Metrics reference
| Metric | Description |
|---|---|
gpu.utilization | SM utilization % |
gpu.memory.used.bytes | VRAM in use |
gpu.memory.total.bytes | VRAM capacity |
gpu.temperature.celsius | GPU die temperature |
gpu.power.watts | Instantaneous power draw |
gpu.sm.clock.hertz / gpu.memory.clock.hertz | Clocks |
gpu.errors.xid.total | XID error count (NVIDIA) |
gpu.errors.ecc.total | Uncorrectable ECC errors |
gpu.throttle.reasons | Thermal / power throttle flags |
Every metric carries GPU topology (gpu.uuid, gpu.index, gpu.model, gpu.vendor) and workload attribution (k8s.pod.name, k8s.namespace.name, slurm.job.id, etc.). See the full metrics reference for the complete list.
Troubleshooting
No metrics appearing in Last9
- Check DaemonSet logs:
kubectl -n l9gpu logs ds/l9gpu-monitoring. - Verify the Secret is mounted:
kubectl -n l9gpu describe pod <pod>. - Confirm the node has GPUs visible to the agent:
kubectl -n l9gpu exec <pod> -- nvidia-smi(orrocm-smi,hl-smi).
gpu.utilization present but no k8s.pod.name
The enrichment Collector needs RBAC to list pods on each node. Verify:
kubectl -n l9gpu get clusterrolebinding l9gpu -o yamlIf missing, reinstall with --set enrichment.rbac.create=true.
Slurm metrics missing slurm.job.id
The slurm_monitor unit runs sacct / scontrol, which must be in PATH for the cluster_monitor user. Add the Slurm bin dir to the unit’s Environment=PATH=....
Resources
- Source — github.com/last9/gpu-telemetry
- Helm chart — last9.github.io/gpu-telemetry
- Artifact Hub — artifacthub.io/packages/search?repo=l9gpu
- PyPI — pypi.org/project/l9gpu
- Example configs — github.com/last9/opentelemetry-examples/tree/main/gpu-telemetry