Pular para o conteúdo principal
Close
FinOps

FinOps for AI: Cost-per-Token and the GPU You Don't Use

Gabriel Ferraresi· CEO | Tech86May 28, 20264 min
finopsaigpucost-per-tokencloud

73% of AI projects blow their budget. Some 2.4x over. The average enterprise AI budget jumped from $1.2M to $7M in two years — and most CFOs don't know what each token costs. At Tech86, we've seen companies burn $2.3M in costs nobody predicted because nobody instrumented. Without cost-per-token, any optimization is guesswork.

The visibility problem

The Anthropic bill arrives as a single line: $100K/month. No breakdown by client, feature, or workflow. One banking query fires 1 orchestrator, 3 retrievers, 4 tool calls, and 7 model invocations. Cost buried 6 levels deep. Cost Explorer can't see it.

The problem isn't lack of tools — it's lack of granularity. Traditional FinOps operates at the instance level. FinOps for AI needs to operate at the token level. If you don't know what each token costs per endpoint, you don't know where you're burning money. And 98% of FinOps teams now manage AI spend — up from 31% in 2024. Demand exploded. The playbook still doesn't exist.

80-90% of spend is inference — and the GPU is idle

Training dominates the hype. Inference dominates the bill. 80-90% of AI spend is inference, not training. And average GPU utilization sits at 15-30%. Half the budget pays for idle hardware.

The main culprit: endpoints provisioned for peak. You size the infrastructure for Monday business hours, and it bills at full rate 24/7. At 3am, zero traffic, GPU still charging. Utilization below 50% is recoverable spend. It's like renting a truck to deliver a letter and paying the full daily rate.

At Tech86, when we audit AI workloads, the first number we look for is GPU utilization per endpoint and time window. If it's below 50%, we already know there's significant savings margin.

Cost-per-token: the metric that separates control from guesswork

Cost-per-token is simple in theory: daily inference spend divided by tokens processed. In practice, it requires instrumentation at the application layer. You need to track tokens per endpoint, per model, per client, per feature. Without this, you're flying blind.

A concrete example: Opus 4.6 costs 42x more than MiniMax M2.5 for a 0.6% benchmark difference. If you don't have cost-per-token per endpoint, you don't know which model is consuming what. You don't know if Opus is running commodity tasks that MiniMax would handle. You don't know if a client representing 12% of revenue is consuming 78% of LLM spend. That asymmetry is invisible without the right metric.

Cost-per-token turns optimization from guesswork into data-driven decisions. With it, you compare models on real cost per task, identify overprovisioned endpoints, and quantify the impact of every architecture change.

The levers that work

Dynamic batching and caching are the fastest win. By grouping requests and caching frequent responses, GPU utilization jumps from 30% to 70%. Cost drops proportionally. This isn't theory — it's what happens when the GPU stops processing one request at a time and starts working in batches.

Model routing by complexity is the second lever. Commodity tasks on cheap models, reasoning on frontier. 5-10x savings. We covered this in detail in our model selection article — the focus here is that without cost-per-token, you can't tell if routing is actually working.

GPU-metric autoscaling is the third. CPU-based autoscaling doesn't work for GPU. An endpoint can have low CPU but saturated GPU. The right approach is KEDA with NVIDIA GPU Operator, scaling on real GPU utilization. Scale to zero during low-demand periods. Without this, the peak-provisioned endpoint keeps billing 24/7.

Spend caps and on-prem: when cloud isn't the answer

Spend caps are the brake that was missing. Google Cloud launched them in 2026: auto-pause when budget is hit. Without this, an uncontrolled agentic loop generates costs that look like legitimate traffic until the bill arrives. Spend caps per endpoint and per project are basic governance — not a luxury.

On-prem inference is the math few do. For stable, high-volume workloads, break-even can hit 3 months according to Signal 65 / Futurum Group data. Cloud wins when volume is sporadic or unpredictable — autoscaling compensates for the higher hourly rate. But if your inference runs 24/7 with predictable load, on-prem can be significantly cheaper. The calculation must include hardware, power, cooling, and operations staff.

Conclusion

FinOps for AI is a parallel practice: token-level instrumentation, application-layer allocation, model routing governance, budget enforcement at the API call. No surprise bills. No agentic runaway that looks like normal traffic. No reserved GPU that nobody uses.

The question for the CFO: what does each token in your chatbot cost? Which model runs for which query? Which client consumes 78% of LLM spend while paying 12% of revenue? If the answer is "I don't know" to any of these, you're burning money.

At Tech86, we design FinOps for AI with cost-per-token, model routing, and GPU utilization as core metrics. If you don't know what each token costs, it's time to find out.

Interested in this solution?

Explore our managed services and infrastructure.

Explore FinOps for AI

Frequently Asked Questions

Cost-per-token is the metric that divides total inference spend by the number of tokens processed. Without it, you don't know if you're overpaying for an endpoint, which model consumes the most budget, or where the waste is. It's the equivalent of managing cloud costs without knowing the hourly price of your instance.

It's common, but it's not normal. It means half your inference budget pays for idle hardware. The root cause is endpoints provisioned for peak that bill at full rate 24/7 — at 3am with zero traffic, the GPU still charges. Dynamic batching and GPU-metric autoscaling fix this.

No. CPU-based autoscaling doesn't reflect real GPU utilization. An endpoint can have low CPU but saturated GPU — or the opposite. The right approach is KEDA with NVIDIA GPU Operator, scaling on GPU metrics like utilization, memory usage, and queue depth.

For stable, high-volume inference workloads, break-even can hit 3 months according to Signal 65 / Futurum Group data. But it depends on the workload: if volume is sporadic, cloud with autoscaling still wins. The calculation must include hardware cost, power, cooling, and operations staff.

Spend caps are spending limits configured per endpoint or project that automatically pause execution when the budget is reached. Google Cloud launched this feature in 2026. Without spend caps, an uncontrolled agentic loop can generate costs that look like legitimate traffic until the bill arrives.

Blog — Get in Touch

Have a question about our articles or services? Our team is ready to help.

Schedule a Meeting

Book a time slot.

Schedule Now

Email

Send us a message.

[email protected]

WhatsApp

Quick conversation.

Address

Avenida Paulista, 1636 - São Paulo - SP - 01310-200

Tech86 Specialist

Online now

Hello! How can we help scale your business today?

Tech86 Engineering

We Value Your Privacy

We use cookies and similar technologies to optimize your experience, analyze site traffic, and personalize content. By clicking "Accept All", you agree to the use of all cookies. Read our Privacy Policy.