NVIDIA Rubin: Six Chips, a “Supercomputer,” and the Race to Lower Token Costs in the Reasoning AI Era

NVIDIA took center stage at CES in Las Vegas to introduce their next major AI computing platform. It’s called Rubin, in honor of astronomer Vera Florence Cooper Rubin, and it comes with a message the company emphasizes: the demand for compute power for training and inference is “skyrocketing,” increasing the cost of deploying advanced models. NVIDIA’s response is a unified architecture: six new chips working together as a single AI supercomputer.

The Rubin platform is based on a concept the company calls “extreme co-design”: CPU, GPU, network, security, operations, and storage evolve in harmony so the system doesn’t bottleneck at usual points. In practice, Rubin combines the NVIDIA Vera CPU, NVIDIA Rubin GPU, NVLink 6 switch, ConnectX-9 SuperNIC, BlueField-4 DPU, and Spectrum-6 Ethernet switch. The goal is clear: drastically reduce training times and especially the cost per token in inference, especially as models grow longer, more multilingual, and more “agentic”.

From “GPU-centricity” to AI factory

For years, the public discourse on AI infrastructure has revolved around a single word: GPU. Rubin shifts that focus toward a more industrial approach: the “AI factory”, where accelerators are important but so are the network fabric, data security, system resilience, and energy efficiency.

Jensen Huang, NVIDIA’s founder and CEO, framed the announcement in this context: Rubin arrives “at the right time,” with a yearly cadence of new “AI supercomputers,” and with integrated chips aiming to push toward “the next frontier” in the industry.

The company’s ambitious benchmark: up to 10 times lower cost per token in inference compared to Blackwell and the ability to train Mixture-of-Experts (MoE) models with four times fewer GPUs than previous generations, according to their figures. The approach targets workloads typical in research labs and large platforms: multi-step reasoning, longer memory, autonomous agents, and large-scale video generation.

Two formats for different profiles: NVL72 and HGX NVL8

Rubin is not presented as a standalone piece but as complete systems. NVIDIA highlights two principal configurations:

  • NVIDIA Vera Rubin NVL72, a “rack-scale” solution integrating 72 Rubin GPUs, 36 Vera CPUs, NVLink 6, ConnectX-9, BlueField-4, and Spectrum-6.
  • NVIDIA HGX Rubin NVL8, a server platform linking 8 Rubin GPUs via NVLink, designed for generative AI based on x86 architecture, plus high-performance computing and scientific workloads.

The implicit message is that not every organization will adopt Rubin at the same scale or degree of integration. For some, an 8-GPU board might be the next step; for others, operating entire racks as unified memory and compute units will be the goal.

Five innovations for a more “breakable” expensive AI

NVIDIA asserts Rubin introduces five key advancements addressing scaling challenges: GPU communication, efficiency, security, maintenance, and stable performance in production.

  1. Sixth-generation NVLink: Each GPU provides 3.6 TB/s bandwidth, and the NVL72 rack reaches a combined 260 TB/s. The company describes this as more bandwidth than “all of the Internet,” to illustrate magnitude. Additionally, the NVLink 6 switch includes compute “within the network” to accelerate collective operations, along with improved service and resilience features.
  2. NVIDIA Vera CPU: designed for agentic reasoning and energy efficiency, with 88 customized Olympus cores, full compatibility with Armv9.2, and NVLink-C2C connectivity for high-bandwidth links between CPU and GPU.
  3. NVIDIA Rubin GPU: features a Third-generation Transformer Engine with hardware-accelerated adaptive compression and delivers 50 petaflops of NVFP4 compute for inference, according to NVIDIA.
  4. Third-generation Confidential Computing: the NVL72 is the first “rack-scale” deployment extending data and workload security via domains spanning CPU, GPU, and NVLink—a design for protecting proprietary models and sensitive operations.
  5. Second-generation RAS Engine: supports real-time health checks, fault tolerance, and proactive maintenance. NVIDIA also highlights a modular “cable-less” tray design to speed up assembly and servicing compared to Blackwell.

The overarching focus is clear: when working with massive clusters, the problem often isn’t just FLOPS; it’s system-wide stability—operational pace, deployment, maintenance, and security.

The “context” as new bottleneck: native inference storage

One of Rubin’s most revealing announcements is in storage—not the GPU, but the Inference Context Memory Storage Platform. This new infrastructure category is designed to scale something critical with modern models: the inference context.

In reasoning models and agents, the conversation isn’t just a single prompt. It involves multiple turns, long sessions, concurrent users, and chained tasks. In this scenario, the key-value cache gains importance, allowing reuse of intermediate states to avoid recomputation.

NVIDIA claims this platform—driven by BlueField-4 as a “storage processor”—enables sharing and reusing that cache across infrastructures, boosting responsiveness and performance, and supporting more predictable, energy-efficient scalability for agentic AI.

The DPU plays a dual role here: apart from data movement, BlueField-4 introduces ASTRA (Advanced Secure Trusted Resource Architecture), a system-level “trust” approach providing operators with centralized provisioning, isolation, and management of multi-tenant and bare-metal environments without compromising performance. This directly addresses the needs of a market blending public clouds, “neoclouds,” and enterprise platforms with increasingly fragmented deployment models.

Photonics Ethernet and 800 Gb/s: network as catalyst, not bottleneck

Rubin also emphasizes the network as vital for supporting “east-west” (server-to-server) AI workloads. The Spectrum-6 Ethernet is presented as the next-gen connectivity for AI factories, with 200G SerDes, co-packaged optics, and optimized fabric designs.

Built on this foundation, NVIDIA introduces Spectrum-X Ethernet Photonics with co-packaged optics: claiming up to 10 times higher reliability, up to 5 times longer uptime, and up to 5 times better energy efficiency compared to traditional approaches—maximizing performance per watt.

The goal isn’t just speed; it’s also transforming separate facilities separated by hundreds of kilometers into a single logical environment via technologies like Spectrum-XGS, enabling distributed data centers to operate as a unified AI factory.

Additionally, NVIDIA describes an end-to-end connectivity suite delivering 800 Gb/s through two pathways: Quantum-X800 InfiniBand (for dedicated clusters with minimal latency) and Spectrum-X Ethernet (for standard Ethernet protocols optimized for AI). In InfiniBand, features like SHARP v4 and adaptive routing are used to offload collective operations to the network fabric.

DGX SuperPOD: the blueprint for scaling Rubin

To operationalize Rubin as a reference design, NVIDIA highlights DGX SuperPOD as a deployment “blueprint.” The configuration based on DGX Vera Rubin NVL72 integrates eight NVL72 systems to deliver 576 Rubin GPUs, with a declared performance of 28.8 exaflops FP4 and 600 TB of fast memory. Each NVL72 includes 36 Vera CPUs, 72 Rubin GPUs, and 18 BlueField-4 DPUs, forming a coherent engine that reduces the need to split models.

There’s also a variation with DGX Rubin NVL8: a system of 64 units with 512 GPUs, designed to ease Rubin adoption with liquid cooling and x86 CPUs. NVIDIA claims each DGX Rubin NVL8 offers 5.5 times more NVFP4 FLOPS than equivalent Blackwell systems.

In operation, NVIDIA Mission Control acts as the orchestration and management layer, automating deployment, coordinating power and cooling events, enhancing resilience, and speeding up responses—featuring rapid leak detection and autonomous recovery.

Ecosystem and timeline: second half of 2026

Rubin isn’t positioned as a standalone leap but as part of a broader adoption plan. NVIDIA expects extensive industry uptake—from cloud providers and AI labs to server manufacturers. In the cloud, deployments based on Rubin are planned for the second half of 2026 by major players like AWS, Google Cloud, Microsoft, and Oracle Cloud Infrastructure, along with cloud partners such as CoreWeave, Lambda, Nebius, and Nscale.

Microsoft appears as a strategic partner: its upcoming Fairwater AI superfactories will include Vera Rubin NVL72 systems and aim to scale to hundreds of thousands of “superchips”, according to NVIDIA. CoreWeave is among the early adopters, offering Rubin deployments via Mission Control to facilitate side-by-side architectures without disrupting existing workloads.

The collaboration extends into enterprise software: Red Hat announced an expanded partnership with NVIDIA to provide an optimized Rubin stack with Red Hat Enterprise Linux, Red Hat OpenShift, and Red Hat AI, targeting corporate markets seeking to industrialize AI projects beyond research labs.


Frequently Asked Questions

What is NVIDIA Rubin, and why is it described as “six chips, one AI supercomputer”?
Rubin is a rack-scale platform that integrates CPU, GPU, NVLink interconnect, network, DPU, and SuperNIC as a unified system designed to reduce inference costs and accelerate large-scale training.

What’s the difference between Vera Rubin NVL72 and HGX Rubin NVL8 for deploying AI models?
NVL72 is a complete rack-scale system with 72 GPUs and 36 CPUs built to operate as a coherent engine; HGX NVL8 is a server platform linking 8 GPUs via NVLink, meant for integration into x86 infrastructures and more traditional scaling.

What is the Inference Context Memory Storage Platform, and why does it matter for agentic AI?
It’s designed to accelerate and scale the inference “context”—like the key-value cache—used across sessions and multi-turn reasoning, enhancing responsiveness and performance by sharing state between infrastructures.

When will Rubin-based systems arrive, and which providers plan to offer them in the cloud?
NVIDIA states Rubin is already in production, with products available through partners in the second half of 2026—including large cloud providers like AWS, Google Cloud, Microsoft, and OCI, along with partners like CoreWeave.

via: nvidianews.nvidia

Scroll to Top