NVIDIA Dominates MLPerf Inference v6.0 and Accelerates the AI Race

NVIDIA has once again turned MLPerf benchmarks into a showcase of strength. In the new round of MLPerf Inference v6.0, the company claims to have achieved the best results across the greatest number of tests and scenarios, leveraging its Blackwell Ultra platform, the GB300 NVL72 system, and a highly optimized combination of hardware, interconnection, and software. The publication coincides with a moment when inference is no longer measured solely in teraflops or chip specifications but in something far more directly business-related: how many tokens an infrastructure can produce and at what cost.

This round’s significance goes beyond the usual “performance record” headline. MLCommons, the consortium responsible for MLPerf, has presented v6.0 as the most important update to date of the inference benchmark, with five of the eleven new or updated data center tests and a more representative suite of real-world AI deployment scenarios. Among the new features are a benchmark based on GPT-OSS 120B, an expansion of DeepSeek-R1 with an interactive scenario, a new recommendation test DLRMv3, the first text-to-video test for the set, and a vision-language model benchmark.

NVIDIA asserts that it was the only platform to deliver results across all of these new models and scenarios, achieving the highest processing rates in each. Their technical blog details impressive figures: 2,494,310 tokens per second in DeepSeek-R1 in offline mode, 1,555,110 tokens per second in server mode for the same model, 1,046,150 tokens per second in GPT-OSS-120B in offline, 1,096,770 tokens per second in server, 79 samples per second in Qwen3-VL, and 104,637 samples per second in DLRMv3. For WAN 2.2, the text-to-video model, the most noticeable metric is latency in single stream, with 21 seconds per request.

However, it’s important to introduce a key caveat. MLPerf is not an exact simulation of all production loads, but a standardized and auditable benchmark designed to compare platforms under defined conditions. Its value lies precisely in this reproducibility, but that doesn’t mean each figure directly translates to the behavior of a specific commercial application, a real API service, or an environment with mixed models, users, and operational limitations. MLCommons itself emphasizes that these results provide a rigorous basis for system comparison, not an automatic prediction of universal performance.

Blackwell Ultra is not just winning because of hardware

One of the most interesting aspects of NVIDIA’s announcement is not the chip itself but the software. The company claims that the same GB300 NVL72 system, introduced just six months ago, has significantly improved in several tests thanks to optimizations in TensorRT-LLM and the distributed framework Dynamo. According to NVIDIA, GPU performance in DeepSeek-R1, in server scenario, increased from 2,907 tokens per second per GPU in MLPerf v5.1 to 8,064 tokens per second per GPU in v6.0, a improvement of 2.77 times. During this period, Llama 3.1 405B also saw a 52% boost in server performance on the same infrastructure.

This message is significant because it reinforces NVIDIA’s strategic narrative: the competitive advantage is no longer solely in selling GPUs but in controlling a complete inference stack. The company attributes these improvements to faster kernels, kernel fusion, better balance in Attention Data Parallel, disaggregated serving, Wide Expert Parallel, Multi-Token Prediction, and KV-aware routing. In other words, the race is no longer just about silicon but about a finely integrated mix of model, runtime, memory, networking, and serving techniques.

Additionally, NVIDIA emphasizes the role of its ecosystem. This round included, according to the company, 14 partners providing results on its platform—the highest number of partners in any platform for this edition. Among them are ASUS, Cisco, CoreWeave, Dell, Google Cloud, HPE, Lenovo, Nebius, QCT, Red Hat, and Supermicro. This detail is notable: it indicates that a large part of the market still sees NVIDIA as the safest environment for building and tuning large-scale AI infrastructure.

Inference is now measured at factory scale

Another notable detail in MLPerf v6.0 is the growth of multi-node systems. MLCommons explains that this edition set a new maximum for large-scale systems, with a 30% increase over v5.1. Furthermore, 10% of all submitted systems exceeded ten nodes, compared to 2% in the previous round. The largest system in the set utilized 72 nodes and 288 accelerators, quadrupling the size of the largest system in the prior version.

NVIDIA fits perfectly into this trend. For DeepSeek-R1, results were presented with four GB300 NVL72 systems connected via Quantum-X800 InfiniBand, reaching those 2.49 million tokens per second offline and 1.55 million in server mode. The practical message: the company aims for the market to move beyond thinking about single GPUs and toward AI factories, complete infrastructures where the value isn’t just in processors but in the ability to produce profitable inference at scale.

This vision also explains why NVIDIA heavily promotes various models within the same suite: advanced reasoning, vision-language, generative recommendation, and video. The company wants to demonstrate that Blackwell Ultra isn’t limited to pure LLMs but serves as a versatile platform for different types of inference. And that has an obvious commercial implication: if one infrastructure can handle more workloads and client profiles, its potential amortization improves.

A significant victory, but not definitive

The overall picture clearly favors NVIDIA. The company claims to have accumulated 291 victories in MLPerf training and inference benchmarks since 2018, about nine times more than all other participants combined. But perhaps the most interesting detail isn’t that, but the competitive context. MLPerf v6.0 saw submissions from 24 organizations, including AMD, Intel, Oracle, Google, Dell, Lenovo, HPE, Supermicro, and others in the ecosystem. The competition exists, is present, and continues to try to measure itself on the same ground.

Nevertheless, NVIDIA emerges strengthened from this round for two reasons. First, it maintains its leadership in the industry’s most influential benchmark. Second, it successfully links that leadership to a very clear story for investors, hyperscalers, and data center operators: it doesn’t just sell accelerators but offers a complete platform optimized to produce tokens, reduce inference costs, and improve through software even on the same hardware. In today’s AI economy, this argument almost weighs as much as the raw benchmark numbers.

Frequently Asked Questions

What is MLPerf Inference v6.0 and why does it matter so much?
It’s the latest edition of MLCommons’ inference benchmark, a standardized and reproducible suite that compares the performance of AI systems on representative workloads. It matters because it has become an industry reference for measuring inference platforms under comparable conditions.

What exactly did NVIDIA achieve in this edition?
NVIDIA claims to have been the only platform to present results across all the new benchmarks and scenarios added in v6.0, achieving the best performance in each with its Blackwell Ultra and GB300 NVL72 systems.

Does this mean NVIDIA is automatically the best choice for any AI deployment?
Not necessarily. MLPerf provides a very valuable comparison, but it doesn’t substitute a real evaluation of cost, software, availability, power consumption, integration, and specific organizational needs.

What technical highlight stands out most in this round?
Probably the combination of more realistic new benchmarks with the performance improvements NVIDIA achieved on the same hardware through software like TensorRT-LLM and Dynamo—underscoring that inference now depends as much on the software stack as on the chip itself.

Source: developer.nvidia

Scroll to Top