Your vLLM instance is running at 60% GPU utilization and you think you have headroom. Then a customer reports 3-second response times. You check the logs — no errors. The GPU dashboard shows plenty of capacity. What happened?
The GPU was "utilized" but your KV-cache was at 94%, requests were queuing, and speculative decoding was rejecting 80% of draft tokens. GPU utilization told you nothing useful. The metrics that actually predict user-facing performance are entirely different from the ones you'd monitor for a traditional GPU workload.
This is a guide to the inference metrics that matter — what to watch, what the thresholds mean, and what to do when they go wrong.
The Utilization Lie
GPU utilization measures one thing: the percentage of time the Streaming Multiprocessors had at least one active kernel. For training workloads, this is a reasonable proxy for "is the GPU busy doing useful work." For inference, it's nearly useless.
An LLM inference server does two fundamentally different things in the same GPU second:
- Prefill — Process the input prompt. This is compute-heavy, matrix-multiplication-intensive work. It lights up the tensor cores and drives high utilization.
- Decode — Generate output tokens one at a time. This is memory-bandwidth-bound — the GPU reads the entire KV-cache for each token. Utilization stays high (the GPU is technically busy), but throughput per watt of GPU power is far lower.
A GPU at 90% utilization that's doing mostly decode work is fundamentally less productive than a GPU at 70% utilization that's doing mostly prefill. The utilization number looks fine, but user-facing latency is degrading because the KV-cache is full, every token generation requires reading gigabytes of cached state, and the memory controller is saturated.
The metrics that actually predict user experience are TTFT, KV-cache pressure, queue depth, and throughput. GPU utilization is the denominator in an efficiency equation — not the metric you alert on.
Time-to-First-Token: The Metric Your Users Feel
Time-to-First-Token (TTFT) measures the elapsed time from when the server receives a request to when it produces the first output token. For streaming responses (which is how most LLM APIs work), this is the delay before the user sees anything — the "thinking" time.
TTFT is driven by three factors:
- Prompt length — Longer prompts require more prefill computation. A 4K-token prompt takes ~4x longer to prefill than a 1K-token prompt. This is expected and not a sign of problems.
- KV-cache availability — If the KV-cache is full when a request arrives, the server must either evict another request's cache (expensive), swap it to CPU (very expensive), or make the new request wait in the queue.
- Batch contention — During prefill, the GPU is doing heavy matrix multiplication that blocks decode steps for other requests in the batch. Large prefills cause a "convoy effect" where existing decode-phase requests stall.
Thresholds:
| Percentile | Good | Investigate | Critical |
|---|---|---|---|
| P50 | < 200ms | 200-500ms | > 500ms |
| P95 | < 500ms | 500ms-2s | > 2s |
A sustained P95 above 2 seconds means 5% of your users are waiting over 2 seconds before seeing any response. In a chat application, that's a noticeable, uncomfortable pause.
# TTFT P95 regression — critical above 2 seconds
vllm_ttft_seconds{quantile="p95"} > 2
When TTFT spikes, don't look at GPU utilization first. Check KV-cache usage and queue depth — they're almost always the root cause.
KV-Cache: The Hidden Bottleneck
The KV-cache is the most important resource in LLM inference, and the least understood outside of ML infrastructure teams.
What it is: During transformer inference, the model computes attention keys and values for every token in the prompt and every generated token. These are cached in GPU memory so they don't need to be recomputed on each decode step. Without the cache, generating each new token would require reprocessing the entire sequence — O(n^2) instead of O(n).
Why it's finite: KV-cache size is proportional to (context_length x batch_size x num_layers x hidden_dim). On an H100 with 80GB VRAM, after the model weights consume 40-60GB (depending on model size), there's 20-40GB left for KV-cache. This limits how many concurrent requests the server can handle and how long their context windows can be.
What happens when it fills:
-
80-90% usage — The inference engine starts getting selective. New requests with long prompts may be deferred. Prefix caching hit rates drop because the cache is evicting older entries to make room.
-
90-95% usage — Active queuing begins. The engine can't admit new requests until running ones complete and release their cache slots. TTFT for new requests spikes because they're waiting in line, not because the GPU is slow.
-
Above 95% — Preemption and swapping. The engine starts evicting in-progress requests from GPU memory to CPU memory (swap). When those requests need to continue generating, their KV-cache is paged back from CPU — adding hundreds of milliseconds of latency per swap.
# KV-cache pressure — warn at 90%, page at 95%
vllm_cache_usage_ratio{cache_type="gpu"} > 0.90
The cascade:
The problem looks like an "error spike" but the root cause is capacity.
The fix is almost always one of: add more replicas, reduce --max-model-len to limit per-request cache consumption, or use a smaller model with less VRAM overhead.
Queue Depth and Continuous Batching
Modern inference engines — vLLM, SGLang, TGI — use continuous batching, which is fundamentally different from traditional request-response processing. Understanding it is critical to interpreting queue metrics correctly.
In traditional batching, you fill a batch of N requests, run them all through the model together, and wait for the slowest request to finish before starting the next batch. If one request generates 500 tokens and another generates 10, the short request waits idle while the long one finishes.
In continuous batching, requests enter and leave the batch independently. When a request finishes generating (hits its stop token or max length), its slot is immediately given to a waiting request. Prefill and decode can happen in the same batch cycle. This dramatically improves throughput — the GPU is rarely idle between batches.
Three queue states tell the capacity story:
Running (vllm.requests.running) — Requests actively being processed. This is the current batch size. Higher is generally better (better GPU utilization), up to the point where KV-cache fills.
Waiting (vllm.requests.waiting) — Requests in queue, ready to be admitted to the batch but blocked because there aren't enough resources (usually KV-cache slots). This is the signal to watch. A waiting count above zero sustained for more than 5 minutes means your serving capacity is saturated.
# Queue saturation — more than 10 requests waiting for 5+ minutes
sum by (k8s_cluster_name) (vllm_requests_waiting) > 10
Swapped (vllm.requests.swapped) — Requests that were being processed but got evicted from GPU memory to make room for others. They'll be paged back later at significant latency cost. Any non-zero swapped count is a sign of memory pressure.
The prefill vs decode contention: Prefill (processing a long input prompt) is compute-intensive and temporarily monopolizes the GPU. While a large prefill is happening, decode steps for other requests in the batch are delayed. This is why a single request with a 32K-token prompt can spike TTFT for all other requests on the same instance — it's not a bug, it's a scheduling tradeoff inherent to continuous batching.
Some engines (vLLM with chunked prefill, SGLang) mitigate this by splitting large prefills into chunks interleaved with decode steps. But the tradeoff is real: longer prompts impose latency tax on shorter ones sharing the same GPU.
Inter-Token Latency and Throughput
Inter-Token Latency (ITL) (vllm.itl) measures the time between consecutive generated tokens. While TTFT is the initial "thinking" delay, ITL determines the streaming speed — how fast text appears to the user after it starts.
ITL is primarily affected by:
- Batch size — More requests in the batch means each decode step takes longer (more KV-cache reads)
- Model size — Larger models have more layers and more memory to read per token
- Memory bandwidth saturation — When the memory controller is maxed, each token generation slows down
For a chat application, ITL below 50ms feels "instant" to users. Between 50-100ms, the text streams visibly but smoothly. Above 200ms, the stream feels halting and frustrating.
Generation Throughput (vllm.generation.throughput) in tokens per second is the capacity planning metric. It tells you the effective output bandwidth of your serving infrastructure. To convert this to concurrent users:
concurrent_users ≈ throughput_tokens_per_sec × avg_response_time_sec / avg_tokens_per_response
A sudden throughput drop — more than 50% compared to an hour ago — is a critical signal. Possible causes: a GPU degraded (thermal throttle, ECC errors), the model reloaded (cold cache), a configuration change (reduced max batch size), or increased prompt lengths (more prefill, less decode throughput).
# Throughput regression — >50% drop vs 1 hour ago
(sum(vllm_generation_throughput_per_second) / sum(vllm_generation_throughput_per_second offset 1h)) < 0.5
Speculative Decoding: When It Helps, When It Hurts
Speculative decoding is an optimization where a smaller "draft" model generates several candidate tokens at once, and the main model verifies them in a single forward pass. When it works well, it can 2-3x decode throughput because verification is nearly as fast as generating one token.
The key metric: acceptance rate (vllm.spec_decode.acceptance_rate).
- Above 70% — Speculative decoding is providing significant speedup. The draft model is a good match for your workload.
- 50-70% — Marginal. The overhead of running the draft model may offset the speedup.
- Below 50% — The draft model's predictions don't match the main model's outputs for your prompts. Speculative decoding is adding latency, not reducing it. Turn it off.
Low acceptance rates typically mean the draft model is too different from the main model for your specific distribution of prompts. Code generation tends to have higher acceptance rates than open-ended creative writing because code is more predictable.
The Inference Dashboard
Here's what an inference monitoring dashboard should show, at a glance:
| Metric | What It Tells You | Alert When |
|---|---|---|
| TTFT P95 | User-perceived "thinking" time | > 2s for 5 min |
| ITL P95 | Streaming speed | > 200ms sustained |
| KV-cache usage | Memory pressure | > 90% for 5 min |
| Requests waiting | Queue saturation | > 0 for 5 min |
| Requests swapped | Memory eviction | > 0 |
| Generation throughput | Serving capacity | > 50% drop vs 1hr |
| Prefill duration P95 | Prompt processing time | Correlate with TTFT |
| Error rate (aborts) | Failed requests | > 5% |
| Spec decode acceptance | Draft model quality | < 50% (consider disabling) |
The first three rows — TTFT, KV-cache, and queue depth — form a causal chain. TTFT spikes are almost always caused by cache pressure, which causes queuing. Fix the cache, the queue drains, TTFT recovers.
These metrics are engine-agnostic in concept — TTFT, KV-cache, and queue depth exist in vLLM, SGLang, TGI, and Triton. The metric names differ, but the signals and thresholds are the same. Last9 monitors all of them through a unified dashboard — same metrics, same thresholds, regardless of which inference engine you run. One dashboard for your entire serving fleet.
If you're running LLM inference in production and want these metrics out of the box, get started with Last9.
