What is cost-per-token and why does it matter?

Cost-per-token is the metric that divides total inference spend by the number of tokens processed. Without it, you don't know if you're overpaying for an endpoint, which model consumes the most budget, or where the waste is. It's the equivalent of managing cloud costs without knowing the hourly price of your instance.

Is 15-30% GPU utilization normal?

It's common, but it's not normal. It means half your inference budget pays for idle hardware. The root cause is endpoints provisioned for peak that bill at full rate 24/7 — at 3am with zero traffic, the GPU still charges. Dynamic batching and GPU-metric autoscaling fix this.

Does CPU-based autoscaling work for GPU workloads?

No. CPU-based autoscaling doesn't reflect real GPU utilization. An endpoint can have low CPU but saturated GPU — or the opposite. The right approach is KEDA with NVIDIA GPU Operator, scaling on GPU metrics like utilization, memory usage, and queue depth.

Is on-prem inference worth it?

For stable, high-volume inference workloads, break-even can hit 3 months according to Signal 65 / Futurum Group data. But it depends on the workload: if volume is sporadic, cloud with autoscaling still wins. The calculation must include hardware cost, power, cooling, and operations staff.

What are spend caps and how do they work?

Spend caps are spending limits configured per endpoint or project that automatically pause execution when the budget is reached. Google Cloud launched this feature in 2026. Without spend caps, an uncontrolled agentic loop can generate costs that look like legitimate traffic until the bill arrives.

FinOps for AI: Cost-per-Token and the GPU You Don't Use

73% of AI projects blow their budget, per the FinOps Foundation's State of FinOps 2026 report. Some 2.4x over. The average enterprise AI budget jumped from $1.2M to $7M in two years — and most CFOs don't know what each token costs. At Tech86, we've seen companies burn $2.3M in costs nobody predicted because nobody instrumented. Without cost-per-token, any optimization is guesswork.

The visibility problem

The Anthropic bill arrives as a single line: $100K/month. No breakdown by client, feature, or workflow. One banking query fires 1 orchestrator, 3 retrievers, 4 tool calls, and 7 model invocations. Cost buried 6 levels deep. Cost Explorer can't see it.

The problem isn't lack of tools — it's lack of granularity. Traditional FinOps operates at the instance level. FinOps for AI needs to operate at the token level. If you don't know what each token costs per endpoint, you don't know where you're burning money. And 98% of FinOps teams now manage AI spend, per the State of FinOps 2026 report — up from 31% in 2024. Demand exploded. The playbook still doesn't exist.

80-90% of spend is inference — and the GPU is idle

Training dominates the hype. Inference dominates the bill. 80-90% of AI spend is inference, per industry data, not training. And average GPU utilization sits at 15-30%, per FinOps analyses. Half the budget pays for idle hardware.

The main culprit: endpoints provisioned for peak. You size the infrastructure for Monday business hours, and it bills at full rate 24/7. At 3am, zero traffic, GPU still charging. Utilization below 50% is recoverable spend. It's like renting a truck to deliver a letter and paying the full daily rate.

At Tech86, when we audit AI workloads, the first number we look for is GPU utilization per endpoint and time window. If it's below 50%, we already know there's significant savings margin.

Cost-per-token: the metric that separates control from guesswork

Cost-per-token is simple in theory: daily inference spend divided by tokens processed. In practice, it requires instrumentation at the application layer. You need to track tokens per endpoint, per model, per client, per feature. Without this, you're flying blind.

A concrete example: Opus 4.6 costs 21x more than MiniMax M2.5 for a 0.6% benchmark difference. If you don't have cost-per-token per endpoint, you don't know which model is consuming what. You don't know if Opus is running commodity tasks that MiniMax would handle. You don't know if a client representing 12% of revenue is consuming 78% of LLM spend. That asymmetry is invisible without the right metric.

Cost-per-token turns optimization from guesswork into data-driven decisions. With it, you compare models on real cost per task, identify overprovisioned endpoints, and quantify the impact of every architecture change.

The levers that work

Dynamic batching and caching are the fastest win. By grouping requests and caching frequent responses, GPU utilization jumps from 30% to 70%. Cost drops proportionally. This isn't theory — it's what happens when the GPU stops processing one request at a time and starts working in batches.

Model routing by complexity is the second lever. Commodity tasks on cheap models, reasoning on frontier. 5-10x savings. We covered this in detail in our model selection article — the focus here is that without cost-per-token, you can't tell if routing is actually working.

GPU-metric autoscaling is the third. CPU-based autoscaling doesn't work for GPU. An endpoint can have low CPU but saturated GPU. The right approach is KEDA with NVIDIA GPU Operator, scaling on real GPU utilization. Scale to zero during low-demand periods. Without this, the peak-provisioned endpoint keeps billing 24/7.

Spend caps and on-prem: when cloud isn't the answer

Spend caps are the brake that was missing. Google Cloud launched them in 2026: auto-pause when budget is hit. Without this, an uncontrolled agentic loop generates costs that look like legitimate traffic until the bill arrives. Spend caps per endpoint and per project are basic governance — not a luxury.

On-prem inference is the math few do. For stable, high-volume workloads, break-even can hit 3 months according to Signal 65 / Futurum Group data. Cloud wins when volume is sporadic or unpredictable — autoscaling compensates for the higher hourly rate. But if your inference runs 24/7 with predictable load, on-prem can be significantly cheaper. The calculation must include hardware, power, cooling, and operations staff.

Conclusion

FinOps for AI is a parallel practice: token-level instrumentation, application-layer allocation, model routing governance, budget enforcement at the API call. No surprise bills. No agentic runaway that looks like normal traffic. No reserved GPU that nobody uses.

The question for the CFO: what does each token in your chatbot cost? Which model runs for which query? Which client consumes 78% of LLM spend while paying 12% of revenue? If the answer is "I don't know" to any of these, you're burning money.

At Tech86, we design FinOps for AI with cost-per-token, model routing, and GPU utilization as core metrics. If you don't know what each token costs, it's time to find out.

FinOps for AI: Cost-per-Token and the GPU You Don't Use

The visibility problem

80-90% of spend is inference — and the GPU is idle

Cost-per-token: the metric that separates control from guesswork

The levers that work

Spend caps and on-prem: when cloud isn't the answer

Conclusion

Frequently Asked Questions

What is cost-per-token and why does it matter?

Is 15-30% GPU utilization normal?

Does CPU-based autoscaling work for GPU workloads?

Is on-prem inference worth it?

What are spend caps and how do they work?

Blog — Get in Touch

Schedule a Meeting

Email

WhatsApp

Address

Tech86 Specialist

We Value Your Privacy