NVIDIA Breaks World Record with Llama 4 Maverick: Over 1,000 Tokens Per Second Per User Thanks to Blackwell

The company achieves unprecedented speed in LLM model inference by combining hardware and software optimization with advanced speculative decoding techniques.

NVIDIA has set a new milestone in the performance of large language models (LLM). A single NVIDIA DGX B200 node, equipped with eight Blackwell GPUs, has reached a speed of over 1,000 tokens per second per user with Llama 4 Maverick, the largest model in the Llama 4 collection, featuring 400 billion parameters. This benchmark was independently verified by the benchmarking service Artificial Analysis.

This achievement makes Blackwell the optimal hardware platform for running Llama 4, whether the goal is to maximize performance per server or minimize latency in single-user scenarios. In maximum configuration, the system achieves 72,000 tokens per second per server.

Total Optimization: From CUDA to TensorRT-LLM

The success is attributable to a combination of architectural innovations and deep software-level improvements. NVIDIA utilized the TensorRT-LLM environment to tune every aspect of the inference, implementing kernel optimizations in CUDA for critical operations like GEMM, MoE, and attention.

Noteworthy are the kernel fusions (such as merging AllReduce with RMSNorm) and the use of Programmatic Dependent Launch (PDL), a CUDA feature that allows overlapping the execution of consecutive kernels, eliminating downtime and improving hardware utilization. Additionally, operations in FP8 format were employed, which, thanks to Blackwell’s Tensor Cores, maintain precision with lower computational costs.

Speculative Decoding: Speed Without Sacrificing Quality

One of the key factors was the use of a custom speculative decoding based on the EAGLE-3 architecture. This technique enables a fast model to generate text drafts that are then verified in parallel by the main model, leading to a multiplier effect on inference speed.

In this instance, an optimal balance was achieved using draft sequences of three tokens, resulting in a more than 2x speedup without compromising quality. The draft model runs directly on the GPU via torch.compile(), reducing its overhead from 25% to 18%.

Real Impact: Towards Faster and More Useful AI

The need to reduce latency is crucial for real-time generative AI applications, such as virtual assistants, software copilots, or autonomous agents. With these improvements, NVIDIA demonstrates that it is possible to deliver a seamless and responsive experience even with massive models.

This performance represents not just a technical advance: it is the foundation of the next generation of AI agents capable of instant and effective interaction with humans, ranging from conversational interfaces to complex cloud simulations.

Conclusion

With this achievement, NVIDIA not only reinforces its leadership in AI infrastructure but also paves the way for a new era of extreme performance in AI, where the combination of specialized hardware like Blackwell, advanced inference techniques, and low-level optimization will enable the deployment of increasingly powerful models in critical, high-demand scenarios.

via: Nvidia Technical Blog

Scroll to Top