X (Twitter) Facebook Pinterest LinkedIn E-mail

The inference—and not just training— is becoming the real bottleneck of the new wave of Artificial Intelligence. The reason is simple: agents and programming assistants are consuming tokens at a rate that forces a rethink of computing economics. According to OpenRouter’s State of AI report, programming-related queries went from approximately 11% of the total token volume to over 50% in recent weeks. This shift is not just statistical: it marks a transition from exploratory uses to applied tasks such as debugging, code generation, scripting, and workflows with integrated tools.

In this context, NVIDIA has published new data aiming to quantify a question that worries both sysadmins and platform teams: how much does it cost to serve AI in real time when every millisecond and every watt counts? The company relies on measurements from the SemiAnalysis InferenceX benchmark to claim that its GB300 NVL72 systems (Blackwell Ultra platform) can deliver up to 50 times more performance per megawatt and, as a result, up to 35 times less cost per token compared to Hopper-generation hardware, especially in low latency scenarios typical of “agentic” applications (multi-step, iterative, with continuous interaction).

Why do these numbers matter in a data center (and not just marketing)

In a real-world environment, raw performance is no longer enough. The focus has shifted toward tokens per watt, cost per million tokens, density per rack, and operational complexity. When a platform promises to multiply performance “per megawatt,” the implicit message is that the limit isn’t demand but energy, cooling, and the ability to deploy at scale without operational costs skyrocketing.

For an IT management audience, what matters isn’t just the “up to 50 times,” but the journey: NVIDIA emphasizes its extreme co-design approach (chip + system + software) and highlights that improvements come not only from hardware but through continuous stack optimization. They cite advances in tools and libraries like TensorRT-LLM, NVIDIA Dynamo, Mooncake, and SGLang, aimed at improving inference performance for Mixture-of-Experts (MoE) models across various latency targets.

In other words: in the inference war, the winner isn’t the one with the most theoretical FLOPS, but the one that delivers more useful tokens with less power and latency that preserves user experience.

The role of software: from “kernel” to token economy

A key detail from the announcement is that library improvements are not just isolated. NVIDIA states that updates to TensorRT-LLM have achieved up to 5 times more performance in low-latency loads on GB200 compared to just four months ago. This points to a reality familiar to many SRE/infrastructure teams: inference performance in production is a mix of runtime, scheduling, kernels, GPU communication, and efficient memory use.

In this vein, the company highlights three technical ingredients that, in practical terms, are relevant for anyone managing AI infrastructure:

Higher-performance kernels optimized for efficiency and low latency, to maximize GPU utilization when the goal isn’t “huge batch,” but immediate response.
NVLink Symmetric Memory, enabling direct GPU-to-GPU access for better communication and reduced penalties.
Programmatic dependent launch, aimed at reducing idle times by launching the next kernel’s setup phase before the current one finishes.

These engineering pieces don’t usually make headlines in mainstream media, but they ultimately determine whether a cluster can support interactive assistants with stable latencies… or remains at demos.

Long context: when the “agent” reads the entire repository

The other battleground is long context. If agents need to reason across entire codebases, the cost in attention and memory skyrockets. NVIDIA states that, in scenarios with 128,000 input tokens and 8,000 output tokens—a very typical profile for programming assistants navigating large repositories—GB300 NVL72 can provide up to 1.5 times less cost per token compared to GB200 NVL72.

This benefits developers directly: according to the company, the Blackwell Ultra platform delivers 1.5 times higher NVFP4 compute performance and twice the attention processing speed, which helps sustain long sessions without the “cost of context” impairing product viability.

Who is deploying this and what does it mean for operations

NVIDIA claims that cloud and inference providers are already making moves. They cite adoption of Blackwell by inference providers like Baseten, DeepInfra, Fireworks AI, and Together AI, with reductions in token costs of up to 10 times in previous generations. For Blackwell Ultra, they state that Microsoft, CoreWeave, and Oracle Cloud Infrastructure are deploying GB300 NVL72 for low latency and long context use cases focused on agentic coding and interactive assistants.

For platform teams managing daily operations, this means the conversation shifts from “which GPU to buy” to “which architecture to operate”: integration with serving stacks, latency observability, queues, user limits, capacity planning, and an uncomfortable truth: at equal demand, costs are no longer dictated solely by GPUs, but by the sum of energy + cooling + runtime efficiency.

Next stop: Rubin (and another step down on cost)

In the same announcement, NVIDIA looks ahead to its Rubin platform, promising up to 10 times more performance per megawatt than Blackwell in MoE inference, which would translate to one-tenth the cost per million tokens. They also claim Rubin could train large MoE models with a quarter of the GPUs compared to Blackwell. While ambitious, this aligns with market trends: every generation aims to make AI a cheaper, more ubiquitous, and more “industrialized” service.

Frequently Asked Questions

What does “cost per token” mean for a sysadmin or platform team?
It’s a practical way to translate infrastructure costs into dollars: how much it costs to generate or process tokens considering energy, hardware, cooling, and software efficiency. Useful for comparing platforms and sizing inference budgets.

Why is “tokens per megawatt” becoming a key metric in AI data centers?
Because many deployments are no longer limited by demand but by available power and cooling capacity. Improving performance per megawatt allows supporting more users or agents without drastically increasing energy footprint.

When does “long context” matter more than low latency?
When the assistant needs to understand large volumes of information (repositories, extensive documentation, incident history). In such scenarios, attention and memory costs can dominate total expense, and the platform that manages them best usually wins in overall response cost.

What should be monitored in production if deploying agentic assistants?
Besides latency metrics like p95/p99, it’s wise to watch queues, tokens per second per user, retry ratios, times per phase (retrieval, tool calls, generation), and correlate with energy consumption and GPU-GPU interconnection saturation.

via: NVIDIA blogs

X (Twitter) Facebook Pinterest LinkedIn E-mail