Your GPU cluster dashboard shows 73% utilization. Is that good? It depends. That number doesn't tell you whether the GPU is compute-bound or memory-bound, whether ECC errors are silently corrupting your training run, or whether your inference engine is about to start dropping requests.
Most teams monitor two or three GPU metrics — utilization, temperature, maybe memory. There are over fifty signals that matter, and the ones most teams skip are often the ones that would have prevented their worst outages.
This is a guide to what to actually monitor on GPUs, why each signal matters, and what thresholds to set. It's vendor-neutral — the concepts apply whether you're running NVIDIA H100s, AMD MI300X, or Intel Gaudi 3.
Why GPU Monitoring Is Different
GPUs aren't CPUs. When you see 90% CPU utilization, you have a reasonable intuition for what that means. GPU utilization is more nuanced — it measures the percentage of time the Streaming Multiprocessors (SMs) had at least one kernel active, not how efficiently they're being used. A GPU can report 90% utilization while being severely memory-bandwidth-bound, with the tensor cores mostly idle.
The other gap is correlation. CPU metrics come with process-level attribution by default — you know which process is consuming the cycles. GPU metrics don't. You see per-GPU utilization, but not which pod, job, or model is driving it. That requires additional instrumentation we'll cover later in this series.
The result: most teams are flying with partial instruments. They know a GPU is "busy" but not whether it's busy doing useful work, not which workload is responsible, and not whether the hardware is healthy enough to keep doing it.
Compute and Memory Metrics
GPU Utilization (gpu.utilization) measures SM activity as a percentage. It answers a simple question: is the GPU doing anything?
- Below 5% for 15 minutes means the GPU is idle and costing you money. Alert on it.
- Above 80% looks healthy, but check memory controller utilization alongside it. A GPU at 85% utilization with memory controller at 95% is memory-bandwidth-bound — the SMs are stalling on data fetches, not doing math.
# Idle GPU detection — alert if <5% for 15 minutes
gpu_utilization_ratio{gpu_task_type="compute"} < 0.05
Memory Utilization (gpu.memory.used.percent) tells you how much VRAM is in use. This is critical for inference workloads where KV-cache allocation determines how many concurrent requests you can serve.
- Above 95% is an OOM risk. For training, this means your batch size is too large or your model doesn't fit. For inference, it means your KV-cache is overcommitted.
- For inference: watch this alongside
vllm.cache.usage— the inference engine's own view of cache pressure is more actionable than raw VRAM utilization.
Memory Controller Utilization (gpu.memory.utilization) is the metric most teams miss. It measures how busy the memory subsystem is — the path between GPU cores and HBM. When this is above 80%, your workload is memory-bandwidth-bound. Tensor cores are waiting for data, not crunching numbers. MFU (Model FLOPs Utilization) will be poor no matter how high GPU utilization looks.
This distinction — compute-bound vs memory-bound — is the single most important thing GPU utilization alone doesn't tell you.
Power and Thermal Metrics
Temperature (gpu.temperature) is straightforward but has important nuances. Modern GPUs expose three sensors:
| Sensor | What it measures | Threshold |
|---|---|---|
| Edge | Board surface temperature | >85C warn, >95C critical |
| Hotspot | Hottest point on the die | >90C warn (typically 5-10C above edge) |
| Memory junction | HBM stack temperature | >95C warn (HBM has higher thermal tolerance) |
The edge sensor is the one most dashboards show. But the hotspot sensor is what triggers thermal throttling — a GPU can report 82C edge while the hotspot is already at 93C and quietly throttling clocks.
# Temperature alert — edge sensor
gpu_temperature_celsius{gpu_temperature_sensor="edge"} > 85
Power Draw (gpu.power.draw) in watts tells you how hard the GPU is working in terms of electrical power. Combined with P-state (gpu.power.state, values 0-8), it paints a picture of the GPU's power management state. P0 is maximum performance. P8 is idle.
A GPU sustained at P0 drawing its full TDP (Thermal Design Power) — 700W for an H100 SXM, 300W for an A100 — is running flat out. If utilization is also high, that's expected. If utilization is moderate but power is at TDP, something is wrong — the GPU is working hard but not producing proportional compute output.
Thermal Throttling (gpu.throttle.reason) is a bitmask that tells you why the GPU slowed down. Active throttle reasons include:
power_software— Power cap is limiting clock speedtemp_hardware— Die temperature hit the hardware thermal limittemp_software— Driver-level thermal management kicked insyncboost— Multi-GPU clock synchronization (normal on NVLink systems)
Any active throttle (except syncboost) lasting more than 5 minutes warrants investigation. Throttling silently reduces performance — your workload runs slower with no visible error.
# Throttle alert — ignore syncboost
gpu_throttle_reason{gpu_throttle_cause=~"power_software|temp_hardware|temp_software"} == 1
Clock Frequencies (gpu.clock.frequency) for both graphics and memory clocks are the downstream effect of throttling. If your graphics clock drops below the GPU's base clock, the GPU is being throttled. Monitoring clock frequency gives you a continuous performance signal, while throttle reason gives you the cause.
Reliability and Health Metrics
This is the category most teams skip entirely — and it's the one that saves the most money.
ECC Errors (gpu.ecc.errors) come in two flavors:
- Correctable (single-bit, SBE) — The GPU's error correction code detected and fixed a bit flip in HBM. One isolated SBE is normal. An SBE rate above 10 per hour that's trending upward means the memory is degrading. Meta's fleet research (2025) found that ECC SBE trends predict GPU failure 48-72 hours in advance with 89-96% accuracy.
- Uncorrectable (double-bit, DBE) — The ECC could not fix the error. This is always critical. A DBE means data corruption has already occurred. Training checkpoints may be silently corrupted. Inference outputs may be wrong. Any increase in DBEs should trigger immediate investigation and GPU replacement scheduling.
# Critical: any new uncorrectable ECC errors in 5 minutes
increase(gpu_ecc_errors{gpu_ecc_error_type="uncorrectable", gpu_ecc_count_type="volatile"}[5m]) > 0
XID Errors (gpu.xid.errors) are NVIDIA's hardware-level fault codes. They're logged to dmesg and visible through NVML. The critical ones:
| XID | Meaning | Severity |
|---|---|---|
| 79 | GPU fell off bus | Critical — GPU is dead or PCIe link failed |
| 63 | Row remap failure | Critical — ECC repair exhausted |
| 48 | Double-bit ECC error | Critical — data corruption |
| 74 | NVLink error | Warning — check cables and connectors |
| 45 | Preemptive row remap | Info — GPU is self-repairing (track rate) |
Any non-zero XID error rate on a production GPU should alert. XID 79 ("GPU fell off bus") during a multi-day training run is one of the most expensive failures in AI infrastructure — the entire run must restart from the last checkpoint.
Row Remapping (gpu.row_remap.count, gpu.row_remap.available) tracks HBM self-repair. When a memory row has persistent errors, the GPU remaps it to a spare row. This is healthy — but spare rows are a finite resource. When gpu.row_remap.available hits zero, the next uncorrectable error can't be repaired. The GPU needs replacement.
Think of it like spare tires: using one is fine. Running out means the next flat strands you.
PCIe Replay Count (gpu.pcie.replay.count) tracks how many times the PCIe link had to retransmit a packet. An increasing count indicates link integrity degradation — bad slot seating, damaged cable, or a failing riser card. It doesn't cause immediate failure, but it adds latency to every GPU-to-host transfer and gets worse over time.
# PCIe link degradation — replay count increasing over 15 minutes
increase(gpu_pcie_replay_count[15m]) > 0
NVLink Errors are similar — CRC error counters on NVLink lanes indicate connector or cable issues on multi-GPU nodes. On 8-GPU HGX systems with NVSwitch, a single bad NVLink connection can bottleneck all-reduce operations for the entire node.
Inference Engine Metrics
If you're serving LLMs, these metrics matter more than GPU utilization.
Time-to-First-Token (TTFT) (vllm.ttft) measures the time from receiving a request to generating the first output token. This is the latency your users feel — it's the "thinking" time before the response starts streaming. TTFT is primarily driven by prompt processing (prefill) and is affected by prompt length, KV-cache availability, and batch contention.
- P50 > 500ms — Worth investigating. Could be normal for long prompts, or could indicate cache pressure.
- P95 > 2s — Critical. Users are waiting over 2 seconds before seeing any response. Something is wrong.
# TTFT regression — P95 above 2 seconds
vllm_ttft_seconds{quantile="p95"} > 2
KV-Cache Usage (vllm.cache.usage) is the most important capacity metric for inference. The KV-cache stores computed attention keys and values for active requests. When it fills up, the inference engine must either evict (lose cached computation), swap to CPU (massive latency hit), or reject new requests.
- Above 80% — Eviction pressure begins. Performance starts degrading.
- Above 90% — Active queuing. New requests wait for cache slots. TTFT and E2E latency spike.
Queue Depth — Three metrics tell the capacity story:
vllm.requests.running— Requests actively being processedvllm.requests.waiting— Requests in queue, waiting for resourcesvllm.requests.swapped— Requests evicted from GPU to CPU memory
waiting > 0 sustained for more than 5 minutes means your serving capacity is saturated. You need more replicas, a smaller model, or shorter maximum context length.
Token Throughput (vllm.generation.throughput) in tokens per second is the capacity metric. If it drops more than 50% compared to an hour ago, something has changed — a GPU may have degraded, the model may have reloaded, or a configuration change is causing a regression.
# Throughput regression — >50% drop vs 1 hour ago
(sum(vllm_generation_throughput_per_second) / sum(vllm_generation_throughput_per_second offset 1h)) < 0.5
Training Metrics
MFU (Model FLOPs Utilization) (training.mfu) is the single most important training efficiency metric. It measures the ratio of actually achieved FLOPS to the theoretical peak of your hardware. An H100 SXM peaks at 989 TFLOPS for BF16. If your training job achieves 450 TFLOPS, your MFU is ~45%.
Good MFU values depend on model size and hardware, but generally:
- > 50% — Excellent
- 30-50% — Typical for most setups
- < 30% — Investigate. DataLoader bottleneck, poor batch size, communication overhead, or hardware issue.
A sudden MFU drop during training (e.g., from 45% to 31%) is a strong signal that something changed — often a GPU thermal-throttling, a slow network link, or a straggler GPU in the collective.
Gradient Health — training.gradient.norm (L2 norm of gradients), training.gradient.nan_count, and training.gradient.clip_rate catch training instability early. A spike in gradient norm or any NaN/Inf values indicates the model is heading toward divergence. Clip rate above 50% means your learning rate may be too high.
DataLoader Blocking (training.dataloader.wait) measures how long the training loop waits for the next batch of data. If this exceeds a few percent of step time, your data pipeline is the bottleneck — not the GPU. No amount of GPU optimization will help if the GPUs are starved for data.
Checkpoint Duration (training.checkpoint.save_duration) matters at scale. Saving a 70B parameter checkpoint to shared storage can take 30-60 seconds. At scale, checkpoint I/O competes with training data I/O and can stall the entire training run.
The Monitoring Checklist
Here's the reference table. Pin it to your team's Slack channel.
| Metric | What It Tells You | Threshold | Action |
|---|---|---|---|
| GPU utilization | Is the GPU doing anything? | <5% for 15min = idle | Deallocate or investigate |
| Memory utilization | VRAM pressure | >95% = OOM risk | Reduce batch size or model size |
| Memory controller util | Compute-bound vs memory-bound | >80% = memory-bound | Optimize memory access patterns |
| Temperature (edge) | Thermal health | >85C warn, >95C critical | Check cooling, reduce load |
| Throttle reason | Why is the GPU slow? | Any active (non-syncboost) for 5min | Investigate cooling or power cap |
| ECC SBE rate | Memory degradation trend | >10/hr trending up | Schedule replacement in 48-72hrs |
| ECC DBE | Data corruption | Any increase | Immediate: stop workload, replace GPU |
| XID errors | Hardware fault | Any non-zero | Investigate per XID code |
| Row remap available | ECC repair capacity | 0 remaining | Schedule replacement |
| PCIe replay count | Link integrity | Increasing over 15min | Check slot seating and cables |
| TTFT P95 | User-facing latency | >500ms warn, >2s critical | Check cache, queue, GPU health |
| KV-cache usage | Inference capacity | >80% warn, >90% critical | Scale out or reduce context |
| Queue depth (waiting) | Serving saturation | >0 for 5min sustained | Add replicas |
| Token throughput | Serving capacity | >50% drop vs 1hr ago | Investigate GPU, model, config |
| MFU | Training efficiency | <30% | Profile DataLoader, comms, batch size |
| Gradient NaN count | Training stability | >0 | Check learning rate, data quality |
These are the signals that separate teams who get paged at 3 AM from teams who catch problems during business hours. The hardware is expensive — the telemetry to protect it shouldn't be an afterthought.
Last9 collects all of these metrics across NVIDIA, AMD, and Intel Gaudi hardware, with pre-built dashboards and alert rules for each category. If you're running GPU infrastructure and want this out of the box, get started with Last9.
