A fixed-fee diagnostic, run on your own invoices and observability rather than a questionnaire, that scores your AI estate against twelve levers and gives you a savings number you can act on, with a costed roadmap behind it.
The score
We build a spend cube from your billing and observability, reconcile it to the invoice, and score each of the twelve levers from the evidence in your data rather than from what anyone reports. The result is a single number and a band, from Tokenmaxxer to Yield-optimised, and an overlap-adjusted estimate of recoverable spend. The paid audit replaces every benchmark default with a measured value from your own data.
The taxonomy
How the levers actually work
We do not swap models on a hunch. Each workload gets a golden set and a pass-rate floor; a cheaper model is promoted only after it clears that floor offline and then holds it in a live A/B holdout. For workloads with a hard tail, a confidence-gated cascade answers with a cheap model first and escalates only the difficult cases. Routing stays inside your approved, security-vetted model set.
On current frontier-provider pricing, a cache read costs roughly a tenth of a fresh input token and a write about a quarter more, so any context resent more than twice pays for itself. The detail that matters: cacheable content has to sit at the prompt prefix, so we restructure prompts to put the stable system prompt and retrieved context first and the variable query last, which maximizes the cacheable prefix and the hit rate.
For a stable, high-volume task we collect labeled traces from the frontier model, fine-tune a small approved model, evaluate it against a held-out golden set to a quality floor you set, take it through your model-risk process, and deploy it behind the same eval gate as routing.
Quality-sensitive changes are proven with an A/B holdout: the optimized path runs against a concurrent control on comparable traffic, so the saving is the measured difference between the two arms. Everything is computed on your own invoices and observability.
We run on usage metadata, not prompt content. The spend cube is built from your billing exports and observability data: per request, the model, input and output token counts, cached-token counts, latency, retries, and team and application tags. Redacted is fine. We prefer read-only access inside your environment. Prompt and response content is needed only for a specific use-case deep-dive, and only with your sign-off.
Your data stays inside your environment wherever possible, and we never train any model on it. We work under your NDA and data-processing agreement, take only the access an engagement needs, and delete what we hold on request. For a regulated buyer we complete your third-party-risk and security review before any access is granted.
The engagement
Define the perimeter and access, then build the spend cube and reconcile it to the invoice. Attribution is the visibility lever in action.
Score all twelve levers from your own data, with at least one quantified finding per lever.
Set the baseline, compute the score and the overlap-adjusted savings, and flip at least one lever live so the estimate becomes an observed delta.
A readout, a written report, a tier-sequenced roadmap, and an implementation proposal priced on verified savings.
Remediate in tier order, verify on your own ledger against a quality floor, sign off monthly.
Stand up FinOps for AI so the savings stay won.
The self-serve scorecard gives you an indicative number for free. The paid audit replaces it with a measured one.