From GPU Silicon to Business Metrics: The 8 Layers of GPU Observability

Your inference latency just spiked. Is it the model? The GPU? The node? The network? A thermal throttle? A memory leak?

If your monitoring can't answer this in under 60 seconds, you're paying for a GPU cluster you can't debug. And the problem isn't that you're missing data. You probably have GPU utilization from DCGM, inference metrics from Prometheus, pod status from kubectl, and cost numbers in a spreadsheet. The problem is that none of these systems talk to each other.

GPU observability today is siloed. Hardware metrics live in one dashboard. Inference metrics live in another. Workload identity requires a kubectl command. Cost is a monthly email from finance. When something goes wrong, you're Alt-Tabbing between four tools, mentally joining data that should already be correlated.

This post introduces a framework for thinking about GPU observability as eight connected layers — and explains why correlation across those layers is the difference between debugging in 2 minutes versus 2 hours.

This is the anchor post for a 6-part series on GPU observability.
Each layer below links to a deep-dive post. Read this end-to-end, then
pick the rabbit holes that apply to your cluster.

The 8 Layers

GPU observability isn't one thing. It's eight layers stacked on top of each other, each answering different questions, each producing different signals. Most teams instrument one or two layers and wonder why they can't find root causes.

Layer	What It Covers	The Question It Answers
L1 — GPU Silicon	Utilization, memory, temperature, ECC, power, NVLink	"Is the GPU healthy and doing useful work?"
L2 — CUDA / NCCL	AllReduce bandwidth, straggler detection, kernel profiling	"Is multi-GPU communication the bottleneck?"
L3 — Host / OS	CPU, RAM, disk I/O, network, swap	"Is the host — not the GPU — the problem?"
L4 — K8s / Slurm	GPU-to-pod mapping, namespace, deployment, job identity	"Which workload owns this GPU?"
L5 — Training	MFU, gradient health, DataLoader wait, checkpoint I/O	"Is the training run efficient, or wasting GPU cycles?"
L6 — Inference	TTFT, throughput, KV-cache, queue depth, ITL	"Is the inference engine keeping up with demand?"
L7 — GenAI Semantics	Unified `gen_ai.*` OTel namespace across engines	"Can I compare vLLM, Triton, and SGLang on the same dashboard?"
L8 — Business / Cost	Cost per GPU-hour, cost per token, idle waste, carbon	"What is this costing us and who's paying?"

Layers 1-3 are infrastructure. Layer 4 is the bridge — it connects anonymous GPU metrics to named workloads. Layers 5-6 are application-level. Layers 7-8 are about standardization and business outcomes.

Most existing tools cover L1 (DCGM, nvidia-smi) and maybe part of L3 (node_exporter). Some cover L6 (Prometheus scraping vLLM). Almost nobody covers L2, L4, L5, L7, or L8.

The gap isn't in any single layer — it's in the connections between them.

Deep dive on L1: The GPU Metrics That Actually Matter — the 50+ signals most teams skip.

Why Correlation Matters More Than Coverage

Having all eight layers instrumented is necessary but not sufficient. The value is in correlation — being able to start with a symptom at any layer and trace the cause across layers.

Here's a concrete example of what this looks like:

Symptom (L6): TTFT P95 spikes from 200ms to 2.1 seconds. The inference alert fires.

Layer 6 investigation: KV-cache usage is at 92%. That explains the TTFT spike — new requests are queuing because the cache is full. But why is the cache full? It was at 70% an hour ago with the same traffic volume.

Deep dive on L6: Your LLM Is Slower Than You Think — why GPU utilization hides inference problems.

Layer 1 investigation: GPU utilization across the four GPUs in the serving pool: GPU-0 at 45%, GPU-1 at 42%, GPU-2 at 44%, GPU-3 at 100%. One GPU is pegged. Checking further: GPU-3 temperature is 91C and climbing. Thermal throttling is active — the GPU is running at reduced clocks to prevent hardware damage.

A throttled GPU processes tokens slower, so each request occupies its KV-cache slot longer. Slots don't free up as fast. Cache fills. New requests queue.

Deep dive on failure prediction: Predicting GPU Failures Before They Cost You — XID errors, ECC trends, and thermal signatures that precede hardware failure.

Layer 4 investigation: GPU-3 is assigned to pod vllm-prod-3 in the ml-serving namespace, owned by Deployment vllm-prod. This narrows the blast radius — it's one pod in the serving pool, not a systemic issue.

Layer 3 investigation: The node hosting GPU-3 (ip-192-168-33-18) has a degraded fan sensor. The node's ambient cooling is failing, not just the GPU. Other GPUs on this node will follow the same path if not addressed.

Action: Cordon the node, drain the pod to a healthy node. The remaining three GPUs in the pool resume normal operation. KV-cache drops back to 70%. TTFT recovers within minutes.

Layer 8 impact: The degraded GPU caused 6 hours of impacted throughput before detection. At the serving pool's rate, that's approximately $2,400 in degraded capacity — requests that were served slowly or timed out.

With correlation, this investigation takes 2 minutes. You see the TTFT alert, check the GPU breakdown, spot the thermal outlier, identify the node, and drain it.

Without correlation, it takes 2 hours. You see the TTFT alert, check the vLLM dashboard (cache is full — but why?), check the GPU dashboard in a separate tool (can't tell which pod is on which GPU), SSH into nodes to check temperature manually, try to correlate timestamps across tools, eventually find the fan issue by process of elimination.

The time difference matters because every minute of degraded performance is serving slow or failed responses to your users.

The Bridge Layer: GPU-to-Workload Identity

Layer 4 — workload identity — is the layer that makes cross-layer correlation possible. Without it, you have "GPU-3 is hot" but not "the vllm-prod pod on GPU-3 is affected." Without it, every investigation requires manual kubectl get pods correlation.

The technical challenge is non-trivial. In Kubernetes, the NVIDIA device plugin assigns GPU devices to pods through environment variables (NVIDIA_VISIBLE_DEVICES) and device file mounts. But this mapping isn't exposed through standard Kubernetes APIs. You have to:

Query the Kubernetes API for pods on each node that have GPU resource requests
Read each pod's environment to determine which GPU indices are assigned
Resolve owner references (pod → ReplicaSet → Deployment) to get workload identity
Cache the results (GPU-to-pod mapping changes infrequently)

On Slurm clusters, it's a similar challenge with different mechanics — reading GRES allocations from slurmctld or inspecting /proc for SLURM_JOB_ID environment variables.

Once the mapping exists, every GPU metric automatically carries workload context. A temperature reading isn't just "GPU-3 at 91C" — it's "GPU-3 at 91C, assigned to pod vllm-prod-3 in namespace ml-serving, owned by Deployment vllm-prod, running model Llama-3-70B."

This is the enrichment that turns raw hardware telemetry into actionable operational data. It's also the enrichment that enables Layer 8 — you can't attribute cost to a team if you don't know which team's workload is running on each GPU.

What End-to-End Looks Like

When all eight layers are connected through a single observability platform, the experience changes fundamentally.

Fleet dashboard: A table of every GPU in the cluster. Columns: GPU index, node, health score, utilization, temperature, workload (pod/job), inference latency (if applicable), cost rate. Sorted by health score ascending. The GPUs that need attention are always at the top.

Drill-down: Click any GPU. See its L1 hardware metrics, L3 host metrics, L4 workload identity, and L6 inference metrics (if it's running an inference engine) — all on the same page, same time range, same context.

Alert correlation: An alert fires — "TTFT P95 > 2s." The alert is enriched with L4 context (which cluster, which namespace, which deployment). The alert links directly to the GPU breakdown for that deployment, showing which specific GPUs are affected and their L1 health signals. The operator doesn't need to open four tools — the correlation is pre-computed.

Cost attribution: The L8 cost dashboard shows GPU spend by namespace, deployment, or team. Because every GPU metric carries workload identity (L4), aggregating cost by team is a GROUP BY — not a spreadsheet exercise.

Deep dive on L8: Every Token Has a Price: Per-Request GPU Cost Attribution — why flat per-token rates are wrong, and how to attribute cost to individual requests.

Training debugging: An MFU drop (L5) triggers investigation. The training dashboard shows which rank (GPU) is the straggler. L2 NCCL metrics show AllReduce bandwidth dropped for that rank. L1 shows the underlying GPU has PCIe link downtraining. L3 shows the host's network interface had a burst of TCP retransmits. Root cause: network flap caused PCIe renegotiation and NVLink degradation. Total investigation time: 3 minutes.

The Gap in Existing Tools

Capability	Datadog	Grafana + DCGM	CloudWatch	Last9
GPU hardware metrics (L1)	NVIDIA only	NVIDIA only	16 NVIDIA metrics	NVIDIA, AMD, Intel Gaudi
NCCL / collective comms (L2)	No	No	No	Yes
Host metrics (L3)	Yes	Yes	Yes	Yes
GPU-to-pod attribution (L4)	Partial (beta)	Partial (GPU Operator labels)	No	Yes (automatic)
Training metrics (L5)	No	No	No	Yes
Inference engine metrics (L6)	vLLM, NIM	Manual Prometheus scrape	No	vLLM, SGLang, Triton, TGI, NIM
Unified GenAI namespace (L7)	No	No	No	Yes
Cost attribution (L8)	No	No	No	Yes

The pattern: existing tools cover 2-3 layers. Nobody covers all eight in a single, correlated platform.

Datadog has the broadest GPU support among traditional observability vendors, but it's NVIDIA-only and doesn't connect hardware metrics to inference engine metrics to cost. Grafana + DCGM gives you raw hardware counters and requires manual setup for everything else. CloudWatch provides basic metrics with no attribution, no inference visibility, and no multi-vendor support.

The fundamental issue is architectural: traditional monitoring tools collect metrics in silos. GPU hardware metrics come from one integration, inference metrics from another, Kubernetes metadata from a third. Each has different attribute schemas, different retention policies, and different query languages. Joining them requires manual correlation — which is what turns a 2-minute investigation into a 2-hour one.

The OpenTelemetry Advantage

The reason end-to-end correlation is possible is OpenTelemetry. When every signal — GPU hardware, host metrics, inference engine, training library, workload identity — is emitted as OTel metrics, logs, or traces with shared resource attributes, correlation becomes a query-time join rather than a manual investigation.

Every data point carries the same resource attributes:

k8s.cluster.name — Which cluster
host.name — Which node
gpu.index — Which GPU
k8s.pod.name — Which workload (enriched by the processor)
k8s.namespace.name — Which team/environment

When you query "show me all signals for GPU-3 on node X in the last hour," you get hardware metrics, inference metrics, host metrics, and cost data — all in one result, all aligned by timestamp and GPU identity.

This is what makes the 8-layer framework practical rather than theoretical. Without a common data model, eight layers means eight tools. With OTel, eight layers means eight categories in one platform.

Deep dive on scale: 10,000 GPUs, One TSDB: Cardinality at GPU Scale — how to keep this unified data model from melting your time-series database.

We built l9gpu to make this real

Most GPU monitoring starts and stops at Layer 1. That's like monitoring a web application by watching CPU utilization alone — you can tell something is busy, but not why, not for whom, and not at what cost.

The eight-layer framework is a way to think about what "complete" GPU observability means: from the silicon that does the work, through the software that orchestrates it, to the business that pays for it.

l9gpu is the open-source agent we built to cover all eight layers — NVIDIA, AMD, and Intel Gaudi; Kubernetes and Slurm; every major inference engine. It's a single DaemonSet (or systemd unit) that emits OpenTelemetry metrics, logs, and traces. Point it at any OTLP-compatible backend and you get the data model this post describes.

If you want the managed version — all 8 layers pre-built as dashboards and alerts, with a TSDB engineered for GPU-scale cardinality — Last9 ships l9gpu with its backend, so you can go from helm install to a live fleet dashboard in minutes.

The series

From GPU Silicon to Business Metrics (this post) — the 8-layer framework
The GPU Metrics That Actually Matter — L1 silicon signals
Your LLM Is Slower Than You Think — L6 inference observability
Predicting GPU Failures Before They Cost You — reliability patterns across L1/L3
Every Token Has a Price: Per-Request GPU Cost Attribution — L8 cost
10,000 GPUs, One TSDB: Cardinality at GPU Scale — the infra that makes the data model work