AI inference will drop more than 90%, but the total bill won’t decrease that much

The economics of generative artificial intelligence is set to change radically over the next decade. According to a new forecast from Gartner, by 2030, running inference on a large language model with 1 trillion parameters will cost AI providers more than 90% less than in 2025. The firm also adds that LLMs in 2030 could be up to 100 times more cost-efficient than comparable-sized early models developed in 2022. This is an eye-catching figure, but also potentially misleadingly optimistic if taken out of context.

The key point isn’t just that inference will get cheaper. The really important part is: the cost per token will decrease, but overall token consumption will grow even faster. This means that tech companies, AI providers, and product teams won’t be able to rely solely on hardware or model cost reductions to solve the economic equation of advanced AI. Especially as autonomous systems and complex reasoning flows begin to become mainstream in production.

Cheap tokens don’t fix bad architecture

Gartner attributes this future cost reduction to a combination of fairly predictable factors for any market observer: improvements in semiconductors, increased efficiency of infrastructure, innovation in model design, better chip utilization, greater use of specialized inference silicon, and a higher presence of edge devices for specific use cases. Simply put: there will be better chips, more optimized models, and platforms more effective at squeezing the available hardware.

The consulting firm even divides its scenarios into two categories. On one hand are frontier scenarios based on cutting-edge chips. On the other are legacy blend scenarios built on a representative mix of available semiconductors. In these latter cases, modeled costs remain significantly higher than in cutting-edge scenarios, precisely because computational power is lower. The clear technical conclusion is: cost reductions will not be uniform across the market. Not all companies will access the same efficiency levels, nor will they deploy on the same hardware classes.

This has a clear implication for tech media: the future of inference won’t depend solely on models becoming cheaper to run, but on who controls the best infrastructure, who gains early access to specialized hardware, and who designs products capable of intelligently orchestrating multiple model levels. The competition won’t just be about unit cost but about orchestration.

Agents will consume many more tokens than a chatbot

This is the most important nuance of all the forecast. Gartner warns that lower unit costs won’t fully transfer to enterprise clients. Furthermore, it emphasizes that “frontier intelligence” will require many more tokens than current applications. According to the firm, agent-based models could need between 5 and 30 times more tokens per task than a standard generative chatbot.

This difference is significant. A typical chatbot receives a query, processes limited context, and responds. An agent, by contrast, can decompose the problem, review documents, consult tools, call APIs, generate intermediate plans, correct routes, validate results, and execute multiple steps before completing a task. All of this multiplies the number of tokens processed—both input and output. And if models with enhanced reasoning capabilities are involved, the count increases even more.

That’s why Gartner issues a warning worth noting: product teams shouldn’t conflate the dropping price of “commodity tokens” with automatic democratization of advanced reasoning. In other words, lower-cost tokens don’t mean that more sophisticated AI will become trivial or nearly free. The computing power and systems needed to support high-level reasoning will remain scarce and expensive in contexts where it matters most.

Cheap is basic; expensive will remain the differentiator

The overarching conclusion points towards a very clear market segmentation. Simpler, repetitive, high-volume AI will tend to become a sort of cheap utility. Routine tasks, highly structured workflows, and general-use assistants with low complexity will fall into this category. Conversely, high-cost inference based on frontier models will continue to be reserved for scenarios where advanced reasoning justifies the expense: high-value automation, complex agent-based software, science, engineering, critical business analysis, or premium products with strong margins.

Gartner frames this in terms of platforms: value will concentrate among those who can orchestrate workloads across a diverse portfolio of models. Routine tasks should be delegated to smaller or domain-specialized models, which can outperform large general models in certain workflows at a fraction of the cost. Meanwhile, high-cost inference from frontier models should be tightly controlled and reserved for complex reasoning and cases where it truly makes a difference.

From a technical perspective, this means that competitive advantage will come not only from access to the best model but also from designing architectures that can decide what model to use, when to use it, and how much context to provide. Prompt optimization, context management, memory compression, model routing, and cost observability will shift from operational details to central elements of product design.

The major implications for 2030

Gartner’s forecast doesn’t predict cheap AI in an absolute sense. It predicts a much more efficient AI ecosystem, but one where actual usage will be more intensive, more complex, and more dependent on good systems engineering. This directly impacts hyperscalers, model providers, chip manufacturers, infrastructure startups, and developers of agent-based applications.

For the tech industry, the clear lesson is: the next big battleground won’t just be about training the most powerful model, but about making its widespread use economically sustainable. In this race, hardware matters, but so do inference software, deployment topologies, model specialization, and architectural discipline. Tokens will be cheaper, yes. But the future will favor those who best manage this new relative abundance.

Frequently Asked Questions

What exactly does Gartner say about inference costs in 2030?
Gartner predicts that by 2030, running inference on a 1-trillion-parameter LLM will cost AI providers more than 90% less than in 2025.

Why will costs decrease so much?
Due to improvements in chips, infrastructure, model design, hardware utilization, specialized inference silicon, and increased edge processing.

So, will advanced AI be much cheaper for companies?
Not necessarily. Gartner warns that the reduction in token costs won’t fully pass through to clients, and that agent-based systems will consume many more tokens per task.

How much more can AI agents consume compared to a chatbot?
According to Gartner, between 5 and 30 times more tokens per task than a standard generative chatbot.

Source: AI will be cheaper

Scroll to Top