Pular para o conteúdo principal
Close
FinOps

AI Inference FinOps Playbook: 5 Levers in the Right Order

Gabriel Ferraresi· CEO | Tech86June 6, 20265 min
finopsaiinferencecachingmodel routing

80-90% of AI cost goes to inference. And most teams are paying 50-90% more than they need to. Not for lack of tools — for lack of sequence. There are five measurable levers to cut that waste. The order in which you apply them matters more than any single lever. This is the playbook we use at Tech86.

Lever 1: Prompt/KV Caching — up to 90% reduction, zero risk

Every LLM request carries a repeated block: system prompt, tool calling instructions, RAG context, few-shot examples. These tokens are processed from scratch on every call. Prompt caching eliminates that redundancy.

The numbers are unambiguous. On Anthropic Sonnet 4.6, cache reads cost $0.30/MTok vs $3.00/MTok uncached. 90% reduction. The break-even for a 1-hour cache: fewer than 3 reuses/hour. Most workloads exceed that threshold trivially — a corporate chatbot with a 2K-token system prompt and 5K-token RAG context reprocesses 7K identical tokens on every interaction.

Effort: low. It is a flag on the API call or endpoint configuration. Zero quality impact — the output is identical to uncached. It is the first mover for an obvious reason: it eliminates cost without touching the architecture.

Lever 2: Semantic Caching — 61-69% fewer API calls

Prompt caching solves exact match. Semantic caching solves semantic equivalence. "What is the price of the Pro plan" and "How much does the Pro plan cost" generate distinct API calls, but they ask the same question. Semantic cache understands this.

Research confirms: 61-69% reduction in API calls with 97%+ accuracy preserved. The mechanism is an embedding pipeline that converts each query to a vector, compares it against the existing cache by cosine similarity, and returns the cached response when the score exceeds the threshold.

The effort is medium because it requires infrastructure: embedding model, vector store, ingestion pipeline, and — critically — threshold tuning. The threshold defines the trade-off between savings and accuracy. Too aggressive returns wrong answers. Too conservative does not cache enough. Each domain needs its own calibration.

Lever 3: Async Batching — 50% discount guaranteed

Not every LLM call needs a real-time response. Ticket classification, data enrichment, embedding generation, batch sentiment analysis — these workloads tolerate hours of latency. The OpenAI Batch API offers a 50% discount with a 24-hour window. Not a promise — a published price.

Implementation is minimal: package requests in JSONL, submit via API, receive results within 24h. For workloads already running in async pipelines, migration is trivial. The savings are unconditional — they do not depend on volume, prompt distribution, or benchmarks.

Effort: low. Zero quality impact — same model, same output, different timing. The only constraint: it does not work for anything interactive. If a user is waiting for a response, it is not batch.

Lever 4: Model Routing/Cascade — 30%+ savings + 5%+ accuracy

A single model for everything is waste. Simple queries — "what are the business hours", "summarize this text", "extract entities" — do not need a frontier model. Model routing directs each query to the most appropriate model based on complexity.

Intelligent routers predict the best model per query. Vendor data: 30%+ cost savings and 5%+ accuracy gains over single-model. The counter-intuitive accuracy gain comes from the fact that small models frequently outperform large models on simple tasks — less hallucination, less unnecessary reasoning, more direct answers.

The effort is medium because it requires benchmarking against your real prompt distribution. Public benchmarks are not enough — you need to run your actual traffic against multiple models, measure cost and quality per complexity tier, and calibrate the router. Without that benchmark, routing is guesswork.

Lever 5: FP8 Quantization — effectively lossless, self-hosted only

FP8 is the safe point of quantization. Study with 500,000+ evaluations on the Llama-3.1 family: zero accuracy degradation. INT8 adds 1-3% loss. INT4 is variable and depends on the model and task. FP8 reduces memory consumption and increases throughput — it is free performance on supported hardware.

But this lever only exists for self-hosted inference. For managed APIs (OpenAI, Anthropic, Google), you do not control quantization. The provider decides. If your inference runs on your own GPUs or on-prem, FP8 is the first hardware optimization step. If it runs on an API, skip this lever.

Effort: high. Requires serving infrastructure (vLLM, TensorRT-LLM), hardware compatibility validation, and regression testing. ROI appears at scale — for a few GPUs, the implementation effort may not be worth it.

The metric that aligns engineering with outcomes

Cost-per-token is an input metric. Cost-per-successful-output is the metric that aligns engineering with outcomes. Optimizing cost-per-token at the expense of retry rates, hallucination rates, or task completion rates cuts the token bill while increasing the real cost per result.

A model that costs 50% less per token but generates 3x more retries is not cheaper — it is more expensive per useful result. An aggressive cache that returns wrong answers 10% of the time is not saving money — it is shifting cost from the API to support. Track both: cost-per-token and cost-per-successful-output. The difference between the two is your actual waste.

The order is not stylistic

Starting with lever 3 means optimizing a model whose cost you cannot quantify — because you have not cached repeated prompts. Starting with lever 5 means quantizing before eliminating unnecessary calls. It is like replacing the engine before closing the window: the effort is real, the savings are marginal.

The path: cache first (zero risk, highest savings) → batch (50% guaranteed) → route (benchmark) → quantize (self-hosted only). Each lever reduces the volume the next one needs to process. Caching reduces tokens. Batching cuts the cost of what remains. Routing directs the rest to the right model. Quantization optimizes what actually runs on your own hardware.

At Tech86, we design inference architectures with integrated FinOps — from model selection to caching and routing pipelines. If your AI bill scales faster than your revenue, these levers existed from day one. You just needed to apply them in the right order.

Interested in this solution?

Explore our managed services and infrastructure.

Explore Cloud Hosting

Frequently Asked Questions

Because each lever depends on the previous one. Starting with batch (lever 3) means optimizing a model whose cost you cannot quantify — because you have not cached repeated prompts. Starting with quantization (lever 5) means reducing precision before eliminating unnecessary calls. The order is not stylistic — it is a dependency chain.

It works for any workload with repeated prompts. System prompts, RAG contexts, tool calling instructions — these repeat on every request. The break-even is fewer than 3 reuses per hour. Most workloads exceed that threshold trivially.

With proper threshold tuning, accuracy stays above 97%. The key is calibrating the similarity threshold for your domain. Queries like "what is the price of the Pro plan" and "how much does the Pro plan cost" are semantically identical — the cache gets it right. Ambiguous queries need a more conservative threshold.

No. Batch API has a 24-hour window — it is not for anything that needs a real-time response. It is for classification, enrichment, embedding generation, batch sentiment analysis, log processing. If a user is waiting for a response, it is not batch.

The study with 500,000+ evaluations on the Llama-3.1 family shows zero accuracy degradation with FP8. INT8 adds 1-3% loss. INT4 is variable and model-dependent. FP8 is the safe point — effectively lossless. But it only applies to self-hosted inference. For managed APIs, you have no control over quantization.

Blog — Get in Touch

Have a question about our articles or services? Our team is ready to help.

Schedule a Meeting

Book a time slot.

Schedule Now

Email

Send us a message.

[email protected]

WhatsApp

Quick conversation.

Address

Avenida Paulista, 1636 - São Paulo - SP - 01310-200

Tech86 Specialist

Online now

Hello! How can we help scale your business today?

Tech86 Engineering

We Value Your Privacy

We use cookies and similar technologies to optimize your experience, analyze site traffic, and personalize content. By clicking "Accept All", you agree to the use of all cookies. Read our Privacy Policy.