GPU Telemetry (l9gpu)
Vendor-agnostic GPU monitoring for AI/ML and HPC clusters with workload attribution exported to Last9 via OpenTelemetry
Monitor NVIDIA, AMD, and Intel Gaudi GPUs with a single agent that emits OpenTelemetry to Last9. Get per-GPU utilization, memory, temperature, and power — plus workload attribution that ties every GPU metric to the Kubernetes pod or Slurm job that owns it.
What is l9gpu?
l9gpu is an open-source (MIT) GPU telemetry agent that normalizes hardware counters across GPU vendors into the OpenTelemetry gpu.* namespace. One DaemonSet per cluster (or one systemd unit per HPC node) covers the whole fleet.
Key capabilities:
- Vendor-agnostic — NVIDIA (NVML, DCGM), AMD (amdsmi), Intel Gaudi (hl-smi)
- Workload attribution —
k8s.pod.name,k8s.namespace.name,k8s.deployment.name,slurm.job.id,slurm.user,slurm.partitionon every GPU metric - Fleet health — XID errors, ECC trends, NCCL errors, thermal throttling
- Cost attribution —
$/token,tokens/watt, idle-GPU cost per team - OTLP-native — no Prometheus scrape config to maintain
Why not DCGM Exporter?
| Capability | l9gpu | DCGM Exporter | NVIDIA GPU Operator | Datadog GPU |
|---|---|---|---|---|
| NVIDIA support | ✅ | ✅ | ✅ | ✅ |
| AMD support | ✅ | ❌ | ❌ | ❌ |
| Intel Gaudi support | ✅ | ❌ | ❌ | ❌ |
| Workload attribution (pod/job) | ✅ | ❌ | ❌ | ✅ |
| Slurm HPC attribution | ✅ | ❌ | ❌ | ❌ |
| OTLP-native | ✅ | ❌ | ❌ | ❌ |
| Open source | ✅ MIT | ✅ | ✅ | ❌ |
DCGM Exporter is great at reading NVIDIA hardware but leaves workload attribution and multi-vendor support to you. l9gpu bundles both.
Prerequisites
Before you start:
- Last9 Account — Sign up at app.last9.io.
- OTLP credentials — From Last9, go to Integrations → OpenTelemetry and copy the OTLP endpoint and
Authorizationheader. - Kubernetes 1.24+ (or Slurm 22.05+) with GPU nodes.
- Helm 3.14+.
Current release: chart 0.2.0, Python agent 0.2.0, built on OpenTelemetry Collector v0.150.
Installation
-
Add the Helm repository.
helm repo add l9gpu https://last9.github.io/gpu-telemetryhelm repo update -
Create a Secret with your Last9 OTLP credentials.
kubectl create namespace l9gpukubectl -n l9gpu create secret generic l9gpu-otlp \--from-literal=OTEL_EXPORTER_OTLP_ENDPOINT="otlp.last9.io:443" \--from-literal=OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <YOUR_AUTH_HEADER>" -
Install the chart.
helm install l9gpu l9gpu/l9gpu \--version 0.2.0 \--namespace l9gpu \--set otlpSecretName=l9gpu-otlp \--set monitoring.sink=otel \--set monitoring.cluster=prod-gpu-us-east \--set collectors.nvidia=true \--set monitoring.nodeSelector."nvidia\.com/gpu\.present"=trueFor AMD or Gaudi nodes, set
collectors.amd=trueorcollectors.gaudi=trueand adjust the node selector. -
Verify the DaemonSet is running.
kubectl -n l9gpu get pods -o wideYou should see one
l9gpu-monitoring-*pod per GPU node.
-
Install the Python agent.
pip install 'l9gpu==0.2.0' -
Copy the systemd units.
sudo cp /usr/local/share/l9gpu/systemd/*.service /etc/systemd/system/sudo cp /usr/local/share/l9gpu/systemd/*.slice /etc/systemd/system/ -
Configure OTLP credentials in
/etc/default/l9gpu.OTEL_EXPORTER_OTLP_ENDPOINT=otlp.last9.io:443OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <YOUR_AUTH_HEADER>" -
Enable and start.
sudo systemctl daemon-reloadsudo systemctl enable --now l9gpu_nvml_monitor slurm_monitor
Verify data in Last9
After ~1 minute, open the Metrics Explorer and try:
# Per-pod GPU utilizationavg by (k8s_pod_name, gpu_uuid) (gpu_utilization)
# Cluster-wide idle GPUs (utilization < 5% for 15 min)count(avg_over_time(gpu_utilization[15m]) < 5)
# Power draw per namespacesum by (k8s_namespace_name) (gpu_power_watts)
# Slurm jobs consuming most GPU memorytopk(10, sum by (slurm_job_id, slurm_user) (gpu_memory_used_bytes))Pre-built dashboards and alerts
l9gpu ships a set of Grafana dashboards and Prometheus alert rules you can import directly from the repo:
- Dashboards —
dashboards/grafana/(Fleet Overview, Workload, DCGM, Single-GPU, Insights) - Alerts —
alerts/(Grafana and Prometheus formats)
Metrics reference
| Metric | Description |
|---|---|
gpu.utilization | SM utilization % |
gpu.memory.used.bytes | VRAM in use |
gpu.memory.total.bytes | VRAM capacity |
gpu.temperature.celsius | GPU die temperature |
gpu.power.watts | Instantaneous power draw |
gpu.sm.clock.hertz / gpu.memory.clock.hertz | Clocks |
gpu.errors.xid.total | XID error count (NVIDIA) |
gpu.errors.ecc.total | Uncorrectable ECC errors |
gpu.throttle.reasons | Thermal / power throttle flags |
Every metric carries GPU topology (gpu.uuid, gpu.index, gpu.model, gpu.vendor) and workload attribution (k8s.pod.name, k8s.namespace.name, slurm.job.id, etc.). See the full metrics reference for the complete list.
Resources
- Source — github.com/last9/gpu-telemetry
- Helm chart — last9.github.io/gpu-telemetry
- Artifact Hub — artifacthub.io/packages/search?repo=l9gpu
- PyPI — pypi.org/project/l9gpu
- Example configs — github.com/last9/opentelemetry-examples/tree/main/gpu-telemetry
Troubleshooting
Common Issues:
- No metrics appearing in Last9: Check DaemonSet logs with
kubectl -n l9gpu logs ds/l9gpu-monitoring, verify the Secret is mounted withkubectl -n l9gpu describe pod <pod>, and confirm the node has GPUs visible to the agent viakubectl -n l9gpu exec <pod> -- nvidia-smi(orrocm-smi,hl-smi). gpu.utilizationpresent but nok8s.pod.name: The enrichment Collector needs RBAC to list pods on each node. Verify withkubectl -n l9gpu get clusterrolebinding l9gpu -o yaml; if missing, reinstall with--set enrichment.rbac.create=true.- Slurm metrics missing
slurm.job.id: Theslurm_monitorunit runssacct/scontrol, which must be inPATHfor thecluster_monitoruser. Add the Slurm bin dir to the unit’sEnvironment=PATH=....
Please get in touch with us on Discord or Email if you have any questions.