73% of AI projects blow their budget. Some 2.4x over. The average enterprise AI budget jumped from $1.2M to $7M in two years — and most CFOs don't know what each token costs. At Tech86, we've seen companies burn $2.3M in costs nobody predicted because nobody instrumented. Without cost-per-token, any optimization is guesswork.
The visibility problem
The Anthropic bill arrives as a single line: $100K/month. No breakdown by client, feature, or workflow. One banking query fires 1 orchestrator, 3 retrievers, 4 tool calls, and 7 model invocations. Cost buried 6 levels deep. Cost Explorer can't see it.
The problem isn't lack of tools — it's lack of granularity. Traditional FinOps operates at the instance level. FinOps for AI needs to operate at the token level. If you don't know what each token costs per endpoint, you don't know where you're burning money. And 98% of FinOps teams now manage AI spend — up from 31% in 2024. Demand exploded. The playbook still doesn't exist.
80-90% of spend is inference — and the GPU is idle
Training dominates the hype. Inference dominates the bill. 80-90% of AI spend is inference, not training. And average GPU utilization sits at 15-30%. Half the budget pays for idle hardware.
The main culprit: endpoints provisioned for peak. You size the infrastructure for Monday business hours, and it bills at full rate 24/7. At 3am, zero traffic, GPU still charging. Utilization below 50% is recoverable spend. It's like renting a truck to deliver a letter and paying the full daily rate.
At Tech86, when we audit AI workloads, the first number we look for is GPU utilization per endpoint and time window. If it's below 50%, we already know there's significant savings margin.
Cost-per-token: the metric that separates control from guesswork
Cost-per-token is simple in theory: daily inference spend divided by tokens processed. In practice, it requires instrumentation at the application layer. You need to track tokens per endpoint, per model, per client, per feature. Without this, you're flying blind.
A concrete example: Opus 4.6 costs 42x more than MiniMax M2.5 for a 0.6% benchmark difference. If you don't have cost-per-token per endpoint, you don't know which model is consuming what. You don't know if Opus is running commodity tasks that MiniMax would handle. You don't know if a client representing 12% of revenue is consuming 78% of LLM spend. That asymmetry is invisible without the right metric.
Cost-per-token turns optimization from guesswork into data-driven decisions. With it, you compare models on real cost per task, identify overprovisioned endpoints, and quantify the impact of every architecture change.
The levers that work
Dynamic batching and caching are the fastest win. By grouping requests and caching frequent responses, GPU utilization jumps from 30% to 70%. Cost drops proportionally. This isn't theory — it's what happens when the GPU stops processing one request at a time and starts working in batches.
Model routing by complexity is the second lever. Commodity tasks on cheap models, reasoning on frontier. 5-10x savings. We covered this in detail in our model selection article — the focus here is that without cost-per-token, you can't tell if routing is actually working.
GPU-metric autoscaling is the third. CPU-based autoscaling doesn't work for GPU. An endpoint can have low CPU but saturated GPU. The right approach is KEDA with NVIDIA GPU Operator, scaling on real GPU utilization. Scale to zero during low-demand periods. Without this, the peak-provisioned endpoint keeps billing 24/7.
Spend caps and on-prem: when cloud isn't the answer
Spend caps are the brake that was missing. Google Cloud launched them in 2026: auto-pause when budget is hit. Without this, an uncontrolled agentic loop generates costs that look like legitimate traffic until the bill arrives. Spend caps per endpoint and per project are basic governance — not a luxury.
On-prem inference is the math few do. For stable, high-volume workloads, break-even can hit 3 months according to Signal 65 / Futurum Group data. Cloud wins when volume is sporadic or unpredictable — autoscaling compensates for the higher hourly rate. But if your inference runs 24/7 with predictable load, on-prem can be significantly cheaper. The calculation must include hardware, power, cooling, and operations staff.
Conclusion
FinOps for AI is a parallel practice: token-level instrumentation, application-layer allocation, model routing governance, budget enforcement at the API call. No surprise bills. No agentic runaway that looks like normal traffic. No reserved GPU that nobody uses.
The question for the CFO: what does each token in your chatbot cost? Which model runs for which query? Which client consumes 78% of LLM spend while paying 12% of revenue? If the answer is "I don't know" to any of these, you're burning money.
At Tech86, we design FinOps for AI with cost-per-token, model routing, and GPU utilization as core metrics. If you don't know what each token costs, it's time to find out.
