Apr 17th, 2026

Every Token Has a Price: Per-Request GPU Cost Attribution

Flat per-token pricing is wrong by 10–50× per request. Prefill vs decode, batch sharing, and cache effects break the math. How to attribute real GPU cost - compute, energy, and dollars - to each inference request.

Every Token Has a Price: Per-Request GPU Cost Attribution

Contents

Your GPU cluster costs $100K/month. Finance asks: "How much did customer X's API calls cost us?" You do what everyone does — divide total cost by total tokens. That gives you a flat per-token rate and a clean spreadsheet.

It's also wrong.

A 2,000-token generation on a cold KV-cache uses 10x more GPU compute than a 100-token completion riding a warm cache. A request that hits during a full batch shares GPU cycles with 30 other requests. A request that arrives when the batch is nearly empty gets the GPU mostly to itself. Flat per-token pricing hides these differences — the same way averaging CPU across a Kubernetes cluster hides noisy neighbors.

Real GPU cost attribution — knowing what each individual request actually cost in compute, energy, and dollars — is the biggest unsolved problem in AI infrastructure FinOps.


Why Flat Per-Token Pricing Is Wrong

The standard approach to GPU cost attribution:

cost_per_token = total_gpu_cost / total_tokens_served

This assumes every token consumes the same amount of GPU. That assumption breaks in at least three ways:

Prefill vs decode asymmetry. Processing a 4,000-token input prompt (prefill) is compute-intensive — heavy matrix multiplication that saturates the tensor cores. Generating a single output token (decode) is memory-bandwidth-bound — reading the KV-cache is the bottleneck, not compute. A request with a 4K prompt and 100-token output consumed far more GPU than a request with a 100-token prompt and 4K-token output, even though both involved roughly the same total token count.

Batch sharing. Modern inference engines use continuous batching — 8 to 32 requests share the GPU simultaneously. When a request arrives during a full batch, it gets a thin slice of each GPU cycle. When it arrives during a near-empty batch, it gets most of the GPU to itself. The "cost" of a request depends on how many other requests were co-scheduled with it.

Cache effects. With prefix caching enabled, requests that share common prompt prefixes (system prompts, few-shot examples) can reuse previously computed KV-cache entries. A cache-hit request skips prefill entirely — it's dramatically cheaper than an identical request that missed the cache.

The result: flat per-token pricing can be off by 10-50x for individual requests. Your highest-cost customers may be underpaying, and your cheapest customers may be subsidizing them.


The Key Insight: Instrument Inside the Batch

The fundamental challenge is that GPU compute is shared. You can't just time a request end-to-end and multiply by GPU cost — the GPU was simultaneously serving other requests during that time.

The solution is to instrument inside the inference engine's batch scheduler. At the level where the GPU actually executes model forward passes, you know:

  1. Which requests are in the current batch
  2. How many tokens each request contributed to this batch step
  3. How long the batch step took on the GPU

With those three pieces of information, you can attribute GPU time proportionally. If a batch of two requests — one with 512 tokens and one with 256 tokens — takes 12 milliseconds to execute, the first request gets 8ms of GPU time and the second gets 4ms. This attribution happens at every single batch step across the hundreds of decode iterations that make up a request's lifetime.

The result is a gpu_time_share value per request — the total GPU seconds that request consumed, accumulated across all batch steps it participated in.


Three Signals, One Join

Per-request GPU cost attribution requires joining three independent data sources:

Signal 1: Inference traces — The instrumented inference engine emits an OpenTelemetry span for each request with the gpu_time_share attribute. The span also carries request identity (trace_id, span_id), token counts (input + output), model name, and the GPU index.

Signal 2: GPU hardware metrics — DCGM or NVML provides per-GPU, per-minute metrics: actual utilization percentage and power draw in watts. This tells you what the GPU was actually doing during the time window the request was active.

Signal 3: Pricing — The cost per GPU-hour, from cloud instance pricing (e.g., $6.88/hr for a p5.48xlarge H100) or an internal chargeback rate.

The attribution computation:

attributed_gpu_seconds = gpu_time_share × avg_gpu_utilization
attributed_energy_joules = gpu_time_share × avg_power_draw_watts
attributed_cost_usd = gpu_time_share × (cost_per_gpu_hour / 3600)

The utilization adjustment is important: it separates "GPU time this request occupied" from "GPU time this request actually consumed doing useful work." A request that runs during a period of low utilization (batch was thin, GPU was partly idle) gets a lower attributed cost than the same request during peak load.

The time boundary problem: A long inference request might span multiple minutes. GPU utilization varies minute-to-minute — the batch was full for the first minute, partially drained during the second, full again in the third. Using a single average across the entire request lifetime would be inaccurate.

The solution is minute-bucket slicing: split each request's gpu_time_share proportionally across the minute boundaries it spans, then compute attribution separately for each minute using that minute's actual GPU utilization and power draw. A request from 10:01:45 to 10:02:30 gets attributed separately for its 15 seconds in the 10:01 bucket and its 30 seconds in the 10:02 bucket, each with the correct GPU metrics for that minute.


What This Unlocks

Per-request GPU cost attribution enables things that flat per-token pricing fundamentally cannot:

Per-customer cost tracking. "Customer X's API calls cost us $4,200 last month." Not estimated from token counts — measured from actual GPU resource consumption. When customer Y sends long system prompts that blow the prefix cache, their actual cost is captured, not averaged away.

Per-model cost comparison. "Llama-3-70B costs $0.003 per request on average, Qwen-2.5-3B costs $0.0002." But more importantly, the distribution is visible — some Llama-3-70B requests cost 100x more than others depending on prompt length and cache hit rate. Flat per-token pricing would say Llama-3-70B is ~15x more expensive. Per-request attribution reveals it's 10-100x more expensive depending on the request.

Waste detection. "15% of GPU time went to requests that timed out and were never delivered to the user." These wasted requests are invisible in aggregate metrics but show up clearly in per-request attribution — they consumed real GPU seconds and contributed zero value.

Carbon accounting. Energy attribution per request (joules) enables sustainability reporting at the API level: "This API endpoint consumed 47 kWh last month." Not an estimate from total cluster power — a bottom-up calculation from individual request energy consumption.

Capacity planning with real data. "At current request mix and utilization, adding 100 more customers of similar profile requires 2 additional GPUs — not 4." Because you know the actual per-request cost distribution, capacity models use real data instead of averages that systematically overestimate or underestimate.


How Last9 Makes This Possible

Per-request GPU cost attribution isn't a separate product. It's what happens when you already have end-to-end correlation.

Last9 already collects all three required signals in a single platform:

  • Inference traces with per-request spans, token counts, and GPU identity
  • GPU hardware metrics with per-minute utilization and power draw
  • Workload identity linking every GPU metric to the pod, namespace, and team

All three signal types share the same resource attributes — cluster, node, GPU index, pod name. The attribution computation is a lightweight join across data that's already correlated, not a new data pipeline.

This is the payoff of the eight-layer observability framework: when your data is already connected from silicon to workload identity, cost attribution is a query — not an infrastructure project.

If you're running LLM inference at scale and need real cost attribution — not per-token estimates — talk to us at Last9.

About the authors
Shekhar

Shekhar

AI Infrastructure @Last9

Last9 keyboard illustration

Start observing for free. No lock-in.

OPENTELEMETRY • PROMETHEUS

Just update your config. Start seeing data on Last9 in seconds.

DATADOG • NEW RELIC • OTHERS

We've got you covered. Bring over your dashboards & alerts in one click.

BUILT ON OPEN STANDARDS

100+ integrations. OTel native, works with your existing stack.

Gartner Cool Vendor 2025 Gartner Cool Vendor 2025
High Performer High Performer
Best Usability Best Usability
Highest User Adoption Highest User Adoption