GPU Telemetry (l9gpu)

Monitor NVIDIA, AMD, and Intel Gaudi GPUs with a single agent that emits OpenTelemetry to Last9. Get per-GPU utilization, memory, temperature, and power — plus workload attribution that ties every GPU metric to the Kubernetes pod or Slurm job that owns it.

What is l9gpu?

l9gpu is an open-source (MIT) GPU telemetry agent that normalizes hardware counters across GPU vendors into the OpenTelemetry gpu.* namespace. One DaemonSet per cluster (or one systemd unit per HPC node) covers the whole fleet.

Key capabilities:

Vendor-agnostic — NVIDIA (NVML, DCGM), AMD (amdsmi), Intel Gaudi (hl-smi)
Workload attribution — k8s.pod.name, k8s.namespace.name, k8s.deployment.name, slurm.job.id, slurm.user, slurm.partition on every GPU metric
Fleet health — XID errors, ECC trends, NCCL errors, thermal throttling
Cost attribution — $/token, tokens/watt, idle-GPU cost per team
OTLP-native — no Prometheus scrape config to maintain

Why not DCGM Exporter?

Capability	`l9gpu`	DCGM Exporter	NVIDIA GPU Operator	Datadog GPU
NVIDIA support	✅	✅	✅	✅
AMD support	✅	❌	❌	❌
Intel Gaudi support	✅	❌	❌	❌
Workload attribution (pod/job)	✅	❌	❌	✅
Slurm HPC attribution	✅	❌	❌	❌
OTLP-native	✅	❌	❌	❌
Open source	✅ MIT	✅	✅	❌

DCGM Exporter is great at reading NVIDIA hardware but leaves workload attribution and multi-vendor support to you. l9gpu bundles both.

Prerequisites

Before you start:

Last9 Account — Sign up at app.last9.io.
OTLP credentials — From Last9, go to Integrations → OpenTelemetry and copy the OTLP endpoint and Authorization header.
Kubernetes 1.24+ (or Slurm 22.05+) with GPU nodes.
Helm 3.14+.

Current release: chart 0.2.0, Python agent 0.2.0, built on OpenTelemetry Collector v0.150.

Add the Helm repository.

helm repo add l9gpu https://last9.github.io/gpu-telemetry
helm repo update

Create a Secret with your Last9 OTLP credentials.

kubectl create namespace l9gpu

kubectl -n l9gpu create secret generic l9gpu-otlp \
  --from-literal=OTEL_EXPORTER_OTLP_ENDPOINT="otlp.last9.io:443" \
  --from-literal=OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <YOUR_AUTH_HEADER>"

Install the chart.

helm install l9gpu l9gpu/l9gpu \
  --version 0.2.0 \
  --namespace l9gpu \
  --set otlpSecretName=l9gpu-otlp \
  --set monitoring.sink=otel \
  --set monitoring.cluster=prod-gpu-us-east \
  --set collectors.nvidia=true \
  --set monitoring.nodeSelector."nvidia\.com/gpu\.present"=true

For AMD or Gaudi nodes, set collectors.amd=true or collectors.gaudi=true and adjust the node selector.

Verify the DaemonSet is running.
```
kubectl -n l9gpu get pods -o wide
```
You should see one l9gpu-monitoring-* pod per GPU node.

Install the Python agent.
```
pip install 'l9gpu==0.2.0'
```

Copy the systemd units.

sudo cp /usr/local/share/l9gpu/systemd/*.service /etc/systemd/system/
sudo cp /usr/local/share/l9gpu/systemd/*.slice /etc/systemd/system/

Configure OTLP credentials in /etc/default/l9gpu.

OTEL_EXPORTER_OTLP_ENDPOINT=otlp.last9.io:443
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <YOUR_AUTH_HEADER>"

Enable and start.

sudo systemctl daemon-reload
sudo systemctl enable --now l9gpu_nvml_monitor slurm_monitor

Verify data in Last9

After ~1 minute, open the Metrics Explorer and try:

# Per-pod GPU utilization
avg by (k8s_pod_name, gpu_uuid) (gpu_utilization)

# Cluster-wide idle GPUs (utilization < 5% for 15 min)
count(avg_over_time(gpu_utilization[15m]) < 5)

# Power draw per namespace
sum by (k8s_namespace_name) (gpu_power_watts)

# Slurm jobs consuming most GPU memory
topk(10, sum by (slurm_job_id, slurm_user) (gpu_memory_used_bytes))

Pre-built dashboards and alerts

l9gpu ships a set of Grafana dashboards and Prometheus alert rules you can import directly from the repo:

Dashboards — dashboards/grafana/ (Fleet Overview, Workload, DCGM, Single-GPU, Insights)
Alerts — alerts/ (Grafana and Prometheus formats)

Metrics reference

Metric	Description
`gpu.utilization`	SM utilization %
`gpu.memory.used.bytes`	VRAM in use
`gpu.memory.total.bytes`	VRAM capacity
`gpu.temperature.celsius`	GPU die temperature
`gpu.power.watts`	Instantaneous power draw
`gpu.sm.clock.hertz` / `gpu.memory.clock.hertz`	Clocks
`gpu.errors.xid.total`	XID error count (NVIDIA)
`gpu.errors.ecc.total`	Uncorrectable ECC errors
`gpu.throttle.reasons`	Thermal / power throttle flags

Every metric carries GPU topology (gpu.uuid, gpu.index, gpu.model, gpu.vendor) and workload attribution (k8s.pod.name, k8s.namespace.name, slurm.job.id, etc.). See the full metrics reference for the complete list.

Resources

Source — github.com/last9/gpu-telemetry
Helm chart — last9.github.io/gpu-telemetry
Artifact Hub — artifacthub.io/packages/search?repo=l9gpu
PyPI — pypi.org/project/l9gpu
Example configs — github.com/last9/opentelemetry-examples/tree/main/gpu-telemetry

Troubleshooting

Common Issues:

No metrics appearing in Last9: Check DaemonSet logs with kubectl -n l9gpu logs ds/l9gpu-monitoring, verify the Secret is mounted with kubectl -n l9gpu describe pod <pod>, and confirm the node has GPUs visible to the agent via kubectl -n l9gpu exec <pod> -- nvidia-smi (or rocm-smi, hl-smi).
gpu.utilization present but no k8s.pod.name: The enrichment Collector needs RBAC to list pods on each node. Verify with kubectl -n l9gpu get clusterrolebinding l9gpu -o yaml; if missing, reinstall with --set enrichment.rbac.create=true.
Slurm metrics missing slurm.job.id: The slurm_monitor unit runs sacct / scontrol, which must be in PATH for the cluster_monitor user. Add the Slurm bin dir to the unit’s Environment=PATH=....

Please get in touch with us on Discord or Email if you have any questions.