A delivered engagement

A hedge fund took $8.7M off its AI bill in just over a week, and more than doubled research per dollar.

The fund was running about $24M a year through large language models and could not say what the spend returned. In just over a week we removed $8.7M from the annual bill, with no research workload dropping below the quality floor the fund's own researchers set, and the desk ended up running more than twice the effective research per dollar. Here it is lever by lever, measured against the fund's own evals and signed off.

$8.7M

verified savings per year

−36%

total AI spend ($24M to $15.3M)

2×

research per AI dollar

26→71

Token Efficiency Score

The $8.7M is an annualized run-rate: the changes were made in just over a week, then verified and signed off month by month. Published with the client's permission and anonymized at its request. Every figure here was measured on the fund's own invoices and observability and signed off by its finance and research leads before any fee was invoiced.

The diagnosis

A sophisticated fund, scored 26 out of 100.

The fund put roughly $24M a year through large language models with no clean answer to what it returned. Twenty-two percent of that spend could not be attributed to any team or research process: a desk that ties every basis point of P&L to a strategy could not attribute a fifth of its AI spend to a research purpose. Fifty-eight percent of token volume ran on the top reasoning-grade model regardless of task. The same context was re-sent uncached on nearly every call, a 6% cache-hit rate. An agentic research loop had been stuck retrying a failed data fetch for weeks, producing nothing at a roughly seven-figure annualized rate. That puts the fund at 26 out of 100, the Tokenmaxxer band: spend running with no one accountable for what it returns. Because waste like that buys no alpha, removing it cost the fund nothing it valued.

Where the $8.7M came from

How the saving breaks down, lever by lever.

Tier 0verified / yr

zombiesKilled the looping research agent and two abandoned pilots; backoff, retry caps, dev and backtest traffic off production pricing. $1.5M

visibility & capsAttribution to under 5% unallocated, plus enforced per-key and per-agent budgets. The enabling layer for everything below.

Tier 1verified / yr

routingRouted extraction, pre-screening, and classification to the cheapest approved model that cleared an eval gate, with a cascade on the hard tail. $3.4M

cachingCached the static research prompts and retrieved context behind a stable prefix. $1.3M

batchMoved overnight backtest summarization and eval suites to the Batch API. $0.7M

arbitrageRebalanced heavy users of the coding assistant between seats and metered API. $0.3M

Tier 2verified / yr

hygieneOutput caps and prompt trimming on the summarization workloads. $0.6M

distillA small fine-tuned model for the high-volume signal pre-screen and filing extraction. $0.5M

semanticDe-duplicated near-identical research queries. $0.15M

Tier 3verified / yr

triageCut two research pilots that never produced a live signal; concentrated the surviving spend on what reaches production. $0.2M direct, and the lever behind the effectiveness gain.

finopsA standing operating model so the savings stay won.

Why this matters more than the bill

Cost per validated signal fell about 57%.

For an alpha shop the figure that matters is not the bill but the cost per validated signal: the all-in inference cost to carry one hypothesis from idea, through backtest and evaluation, to a signal cleared for a live book. It fell about 57%. Two forces drove it. Every stage of the pipeline got cheaper through routing, caching, and the distilled pre-screen. And the compute that used to die in looping agents and abandoned pilots left the denominator, because we switched it off. Before the engagement only about 22% of AI spend was tied to research that reached a live signal; afterward it was roughly 80%. Combined with a cost base a third lower, effective research per AI dollar more than doubled, and on the flagship workload the team backtested far more ideas for the same money.

How a fund lets you book it

Measured on the fund's own ledger, and signed off by it.

The baseline is theirs

Co-defined in writing on the fund's own usage data before any work begins, then frozen, with one-off anomalies normalized out. We never touch the ledger.

Measured three ways

Unit-cost reduction by default; frozen run-rate for the one-time eliminations; A/B holdout for the quality-sensitive levers like routing and distillation, so the counterfactual is measured directly rather than argued. A fund understands a holdout better than anyone.

A quality floor the research desk owns

The eval pass rates and signal-quality metrics, defined and owned by the fund's researchers, must hold or improve, or the affected saving does not count. Cutting cost by letting research quality slip would count for nothing.

Paid only on verified savings

A fixed audit fee plus a minority share of the savings the fund's finance and research leads sign off each month. Provider price cuts and organic volume are carved out, so we are paid only for the savings we actually caused.