Predicting GPU Failures Before They Cost You

XID error 79 — “GPU fell off bus.”

When this hits during a 72-hour Llama 3 training run at step 47,000, you lose the entire run. On a 64-GPU cluster at $200/hour, that’s $14,000 in wasted compute. The run has to restart from the last checkpoint — if the checkpoint wasn’t corrupted by the same failing GPU that wrote part of it.

The GPU showed warning signs 48 hours before it died. Nobody was watching.

This is about the signals that predict GPU hardware failure, how to turn them into a composite health score, and how to integrate that score into your operations so you catch failing GPUs before they cost you a training run.

How GPUs Fail

GPU failure is rarely instant. The “GPU fell off bus” XID 79 that kills your training run is the final event in a degradation sequence that usually starts days earlier. Understanding the sequence is the key to predicting it.

ECC memory degradation is the most common failure path. HBM (High Bandwidth Memory) cells degrade over time due to heat, electrical stress, and cosmic ray-induced bit flips. The GPU’s Error Correcting Code can fix single-bit errors (SBE) transparently — the application never knows it happened. But as cells degrade, SBE frequency increases. When multiple bits in the same word fail, you get an uncorrectable double-bit error (DBE). At that point, data has been corrupted.

Meta published fleet-level research in 2025 showing that ECC single-bit error trends predict GPU failure 48-72 hours in advance with 89-96% accuracy. The signal isn’t “this GPU had an ECC error” — isolated SBEs are normal. The signal is “this GPU’s SBE rate is increasing over time.”

Row remapping exhaustion is the next stage. When a memory row has persistent errors, the GPU automatically remaps it to a spare row (like a hard drive remapping bad sectors). HBM has a finite pool of spare rows. When they’re exhausted (row_remap_available = 0), the next uncorrectable error can’t be repaired. The GPU must be replaced.

Thermal runaway happens when cooling fails — a dead fan, blocked airflow in a dense rack, or a failed liquid cooling pump. The GPU temperature climbs steadily. The driver applies thermal throttling to prevent damage, but if the cooling problem isn’t fixed, the GPU eventually hits the hardware thermal limit and shuts down. If it’s in the middle of an all-reduce step in a distributed training job, all other GPUs in the collective stall waiting for it.

PCIe link downtraining is subtler. A GPU that negotiated PCIe Gen5 x16 but is now running at Gen3 x8 has lost most of its host-to-device bandwidth. This happens when the PCIe link is electrically marginal — bad slot seating, a damaged riser card, or a connector with oxidation. The link “works” but at a fraction of its rated speed, silently slowing every data transfer.

XID faults are NVIDIA’s hardware fault reporting mechanism. They cover everything from corrected errors (informational) to fatal hardware failures. XID 79 (GPU fell off bus) is the most feared — it means the GPU is no longer reachable on the PCIe bus. XID 63 (row remap failure) means the GPU tried to remap a bad memory row and failed — spare rows are exhausted. XID 48 is a double-bit ECC error during computation.

According to fleet data, XID 79 affects approximately 3.2% of H100 fleets in the first year of operation.

From Point-in-Time Readings to Trends

The failure signals described above exist in NVML — the same library that nvidia-smi uses. But most monitoring tools treat them as point-in-time snapshots: “current ECC error count is 12.” That’s nearly useless.

A GPU with 12 lifetime ECC errors that accumulated over 6 months is healthy. A GPU that went from 2 to 12 ECC errors in the last hour is failing. The absolute number doesn’t matter. The rate of change matters.

The approach is sliding-window analysis. For each GPU, maintain a rolling window of observations (default: 5 minutes for fast signals like temperature, 1 hour for slow signals like ECC) and compute rates from the window boundaries:

sbe_rate = (newest_count - oldest_count) / elapsed_hours

This turns a meaningless counter into an actionable signal: “GPU-7 is accumulating single-bit ECC errors at 14 per hour, up from 2 per hour yesterday.”

The same applies to temperature. A temperature reading of 82C is normal for a loaded H100. But a temperature that was 74C ten minutes ago and is now 82C — a ramp rate of 0.8C/minute — means something is changing. If the ramp rate exceeds 2C/minute, the cooling system is failing and the GPU will hit thermal throttle limits within minutes.

The Five Predictive Signals

Five signals, tracked as rates over time, form the basis of GPU health prediction.

1. ECC Single-Bit Error Rate (errors/hour)

The strongest single predictor of impending failure. Computed as the delta in the cumulative SBE counter over the observation window, normalized to errors per hour.

0-2 errors/hour — Normal background rate. No action needed.
2-10 errors/hour — Elevated. Worth tracking. If trending upward over 24 hours, the GPU is degrading.
Above 10 errors/hour and increasing — This GPU will likely fail within 48-72 hours. Schedule replacement.

# ECC SBE rate elevated for 10+ minutes
gpu_ecc_sbe_rate > 10

The key nuance: a single spike to 15 errors/hour that drops back to 2 is likely a transient event (cosmic ray shower, power fluctuation). A sustained trend from 2 to 5 to 8 to 14 over 48 hours is memory degradation. The trend matters more than any single reading.

2. XID Error Rate (events/hour)

XID events are discrete hardware faults. Any non-zero rate on a production GPU is concerning, but the severity depends on the XID code:

XID 79 (GPU fell off bus) — Immediately critical. GPU is dead or dying.
XID 63 (row remap failure) — Critical. ECC repair capacity exhausted.
XID 48 (double-bit ECC error) — Critical. Active data corruption.
XID 74 (NVLink error) — Warning. Check cables and connectors.
XID 45 (preemptive row remap) — Informational. GPU is self-repairing — track the rate.

# Any XID errors in the last 5 minutes
increase(gpu_xid_errors[5m]) > 0

3. Thermal Ramp Rate (C/minute)

Measures how fast the GPU temperature is changing, not the absolute temperature. Computed as temperature delta over elapsed time.

Below 0.5 C/min — Normal fluctuation from workload changes.
0.5-2.0 C/min — Possible workload spike. Check if a new job just started.
Above 2.0 C/min — Cooling system failure. Fan dead, airflow blocked, or liquid cooling pump failed. The GPU will hit thermal throttle limits within minutes.

This signal is valuable because temperature itself is workload-dependent — a loaded GPU at 80C is fine, an idle GPU at 80C is not. The ramp rate normalizes for workload: a rapid climb regardless of load level means something physical is wrong.

4. PCIe Link Downtraining

Detects when a GPU is running at a lower PCIe generation or width than its maximum capability. An H100 SXM should run at PCIe Gen5 x16. If it’s reporting Gen3 x8, it has lost approximately 75% of its host bandwidth.

This isn’t a prediction — it’s an active degradation that causes immediate performance loss. Every CPU-to-GPU data transfer (model loading, KV-cache swap, checkpoint writes) runs at a fraction of rated speed.

pcie_downtraining = (pcie_gen_current < pcie_gen_max) or (pcie_width_current < pcie_width_max)

Caveat: Some systems legitimately reduce PCIe link state during idle periods (PCIe ASPM power management). Only alert on downtraining when the GPU is actively under load.

5. Row Remap Exhaustion

HBM spare rows are the GPU’s self-repair budget. Each time a memory row is permanently damaged, the GPU remaps it to a spare. The total available spares vary by GPU model (typically 128-512 per HBM stack).

Above 50% remaining — Healthy. Row remaps are occurring but the budget is ample.
Below 25% remaining — Watch closely. The GPU is consuming its repair budget faster than expected.
Zero remaining — The next uncorrectable ECC error cannot be repaired. The GPU is on borrowed time.

Unlike ECC rate, which is a trending signal, row remap exhaustion is a threshold: it either has spares or it doesn’t. But tracking the rate of spare consumption gives you a time-to-exhaustion estimate.

Composite Health Score

Five signals are useful for root-cause analysis, but for a fleet dashboard where you need to scan 1,000+ GPUs and quickly identify the ones that need attention, you want a single number.

The health score starts at 100 and deducts points for each active degradation signal:

Signal	Penalty	Rationale
PCIe downtraining	-10	Performance loss, hardware issue
SBE rate (linear up to 10/hr)	-20 max	Strongest failure predictor
Any double-bit ECC errors	-30	Active data corruption
Row remap exhausted (0 avail)	-15	No more self-repair capacity
XID error rate (linear up to 5/hr)	-15 max	Hardware faults occurring
Thermal ramp > 2 C/min	-10	Cooling system failure

The penalties are additive and floor at 0. A GPU with double-bit ECC errors (-30), exhausted row remaps (-15), and an elevated SBE rate of 10/hr (-20) scores 35 — deep in critical territory.

Alert thresholds:

Below 80 — Warning. One or more degradation signals active. Investigate.
Below 50 — Critical. Multiple signals active. Schedule replacement or drain the GPU from workloads.

# Health score critical — likely multiple degradation signals
gpu_health_score_ratio < 0.50

Why double-bit ECC gets the heaviest penalty (-30): A double-bit error means the ECC correction capacity has been exceeded — data corruption has already happened. Training checkpoints written while a DBE was active may be silently corrupted. Inference outputs may have been wrong. This isn’t a “might fail soon” signal — it’s a “damage has already occurred” signal.

The scores are intentionally not machine-learned. They’re simple, transparent, and auditable. When the on-call sees a GPU at score 45, they can look at the component signals and understand exactly why. No black box.

Operational Integration

A health score is only useful if it’s wired into your operational workflows.

Kubernetes: When a GPU’s health score drops below 50, automatically cordon the node (prevent new pod scheduling) and drain existing workloads to healthy nodes. This prevents new jobs from being scheduled onto degrading hardware without disrupting jobs running on other GPUs on the same node.

Slurm: Use epilog hooks to check GPU health between jobs. If the GPU health score is below threshold after a job completes, mark the node as draining in Slurm — the scheduler won’t assign new jobs to it, but existing jobs on the node finish normally. This is critical for multi-day training runs: you don’t want to kill an active job, but you do want to prevent the next job from landing on a failing GPU.

Fleet dashboards: A single “health score” column in a GPU fleet table gives operators immediate visibility:

# Fleet-level: what percentage of GPUs have degraded health?
count(gpu_health_score_ratio < 0.80) / count(gpu_health_score_ratio) * 100

If more than 5% of your fleet has degraded health scores, something systemic may be happening — a batch of GPUs from the same manufacturing lot, a cooling issue in a specific rack row, or a driver bug triggering false XID errors.

What We Learned Running This in Production

False positives from PCIe ASPM: Power management can legitimately reduce PCIe link width during idle periods. We initially alerted on any downtraining event, which triggered false alarms on idle GPUs. The fix: only evaluate PCIe downtraining when GPU utilization is above 10%.

SBE bursts vs trends: A cosmic ray event can cause a burst of 20+ SBEs in a minute, then nothing for days. Early versions of the alerting flagged these as critical. The fix: require the elevated SBE rate to persist for at least 10 minutes before alerting.

Window duration matters: The 5-minute default window is great for thermal ramp detection (fast signal) but too short for ECC trends (slow signal). In production, we use 5-minute windows for thermal and 1-hour windows for ECC rate. The code supports different windows per signal, but the default configuration didn’t make this clear enough in early documentation.

The value of the composite score: The biggest operational win wasn’t any individual signal — it was the single health score on the fleet dashboard. Before: operators checked 5 separate metric panels per GPU, across 500 GPUs. After: one column, sorted ascending. The GPUs that need attention float to the top. Time-to-detection for degraded GPUs dropped from hours to minutes.

These five signals exist in every NVIDIA GPU made in the last decade. The data is free — NVML exposes it. The challenge isn’t collection, it’s turning point-in-time readings into trends and scoring them into something actionable.

Last9 computes GPU health scores continuously across your entire fleet and surfaces them in fleet dashboards with automatic alerting. One glance tells you which GPUs need attention — before they cost you a training run. Get started with Last9.