·5 min read·

When inference is almost free

The price of running a capable model keeps falling by roughly an order of magnitude a year. That does not just make AI features cheaper, it moves the line for where you can put a model in the first place.

This month brought another round of cheaper model tiers from most of the large labs, the same pattern we have watched for two years now. Epoch AI puts numbers on it: for a fixed level of capability, the cost of inference has been falling by something close to an order of magnitude a year, though the rate varies a lot depending on how hard the task is. Andreessen Horowitz gave the trend a name, LLMflation, when it noted that matching a 2022 flagship on price had got cheaper by roughly a thousandfold in three years.

The interesting bit is not that AI features got cheaper. It is that a falling price does not just shrink the bill for things you already do. It moves the boundary of what is worth doing at all.

The decision that quietly flips

Eighteen months ago, putting a model call in the hot path of a request was a decision you had to justify. It cost real money per call, so you reserved it for the premium feature, the thing the user explicitly asked for. Everywhere else you reached for a rule, a regex, a lookup table, because a model on every request did not pencil out.

When a competent model costs a fraction of a penny for a typical call, that maths inverts. Classifying every inbound message, tidying every messy address, pulling structured fields out of a scanned document, checking a form answer reads sensibly before it is submitted. None of these were worth a model call at the old price. Several of them are now cheaper to do with a model than to build and maintain the bespoke logic they replace.

The question stops being can we afford to call the model here. It becomes what would we build differently if reasoning were nearly free.

What stays expensive

Cheap per token is not the same as cheap in practice, and this is where a studio earns its keep. Three costs do not fall with the token price. Latency, because a model in the request path still adds hundreds of milliseconds, and a user waiting on a form does not care that the call was cheap. Aggregate spend, because near-free multiplied by a large enough volume is a real number again, and the systems that get into trouble are the ones that treated a cheap call as a free one. And reliability, because a probabilistic component in a place that used to be deterministic needs a fallback, a validation step, and a sensible answer for the times it is wrong.

So the work moves. It is less about whether you can afford the model and more about designing the surface around it: where the call sits, what it is allowed to touch, what happens on the slow path or the wrong answer, and how you cap the bill before a loop runs up a surprise.

How we read it

For anyone building operational software right now, the useful move is to revisit the decisions you made when a model call was expensive. Some of the hand-rolled logic you wrote to avoid the cost is now the more expensive option to keep. That does not mean spraying a model across everything. It means the honest comparison, model call versus bespoke code, has quietly changed, and it is worth running again on the parts of the system where accuracy used to be too costly to buy. The price of inference is going one way. The engineering discipline around it is what still separates a system that holds up from one that surprises you.

Talk to Remiam about a system like this.