X (Twitter) Facebook Pinterest LinkedIn E-mail

The race for next-generation artificial intelligence isn’t just about models; it’s also about the infrastructure that makes them possible. With the advent of reasoning models—capable of performing complex multi-step tasks, analyzing data, and acting as autonomous agents—the bottleneck has shifted from accuracy to latency and inference performance.

CoreWeave announced that its new accelerated instances with NVIDIA GB300 NVL72, based on Blackwell Ultra GPUs, achieved a 6.5-fold improvement in inference performance on the DeepSeek R1 model compared to an H100 GPU cluster.

The shift from basic generative models to reasoning models like DeepSeek R1 marks a qualitative leap: predicting the next word no longer suffices; instead, processes like chain-of-thought, which involve multiple iterations and heavier computations, are required.

The challenge: these models are extremely sensitive to latency. A delay in inference can render them useless in real-time applications such as coding copilots, financial agents, or scientific assistants.

CoreWeave compared two configurations:

16 NVIDIA H100 GPUs running the model with 16-way tensor parallelism (TP16).
4 NVIDIA GB300 GPUs on the NVL72 infrastructure, using 4-way tensor parallelism (TP4), thanks to increased memory and bandwidth.

The result: with only a quarter of the GPUs, the GB300 setup achieved 6.5 times more tokens per second throughput, significantly reducing inter-GPU communication overhead.

For clients, this translates into faster token generation, lower latency, and more efficient resource utilization.

The performance leap results from a radical redesign of the architecture:

Massive memory: up to 37–40 TB of total memory in a single system, enabling the deployment of trillion-parameter models without fragmentation or penalties.
Ultra-fast interconnects: the fifth generation of NVLink provides 130 TB/s bandwidth for 72 interconnected Blackwell Ultra GPUs, reducing reliance on traditional PCIe.
End-to-end optimized red: NVIDIA Quantum-X800 InfiniBand facilitates efficient data flow across the cluster, eliminating bottlenecks common in general cloud setups.

CoreWeave’s advantage isn’t just hardware. They’ve built a cloud AI stack that maximizes the potential of the GB300 NVL72:

Rack LifeCycle Controller: automates verification, firmware updates, and system images to ensure rack stability.
Integration with Kubernetes (CKS) and Slurm on Kubernetes (SUNK), with topology-aware scheduling that keeps jobs within the same NVLink domain to maximize performance.
Advanced monitoring with Grafana dashboards providing real-time visibility into GPU utilization, NVLink traffic, and rack availability.

The efficiency gains achieved by CoreWeave aren’t just technical milestones—they’re a paradigm shift for businesses:

Accelerate innovation: train larger models faster.
Reduce costs (TCO): higher GPU performance and less communication overhead.
Deploy with confidence: a cloud optimized expressly for AI workloads, with enterprise-grade resilience and reliability.

The NVIDIA GB300 NVL72 deployed at scale by CoreWeave demonstrates that reasoning models are no longer laboratory dreams but operational realities. The combination of greater memory, extreme bandwidth, and a tailored cloud environment allows next-generation models to run in real time at lower costs and higher scalability than ever.

As the industry moves toward trillion-parameter models, this benchmark suggests that the future of large-scale AI hinges on architectures like the GB300 NVL72, where hardware and software work seamlessly together.

Frequently Asked Questions (FAQ):

What distinguishes reasoning models from generative ones?
Reasoning models not only generate text but also perform multi-step processes (chain-of-thought), analyze data, and act as autonomous agents.
What key advantage does GB300 have over H100?
The ability to use fewer GPUs due to larger memory and bandwidth, reducing communication overhead and increasing throughput.
What does this mean for companies practically?
Lower inference latency, greater scalability, and a better cost-performance ratio for critical AI workloads.
Why choose CoreWeave over a generic cloud?
Because its infrastructure is specifically designed for AI: optimized racks, topology-aware scheduling for NVLink, and advanced monitoring that maximize performance.

X (Twitter) Facebook Pinterest LinkedIn E-mail