X (Twitter) Facebook Pinterest LinkedIn E-mail

AMD, OpenAI, Microsoft, and other major industry players have introduced MRC, short for Multipath Reliable Connection—a new networking protocol designed to enhance performance and resilience in large AI training clusters. The specification has been published through the Open Compute Project, aiming for industry adoption beyond the internal deployments of participating companies.

This news may seem highly technical but addresses one of the most critical challenges in current AI. Training advanced models no longer depends solely on buying more GPUs. At scale, real performance is also determined by the network connecting those GPUs. If hundreds of thousands of accelerators need to exchange data continuously and synchronously, any congestion, unstable link, or switch failure can slow down or halt training jobs that cost millions.

Over the past few years, much of the AI infrastructure discussion has focused on GPUs, HBM memory, custom chips, and power consumption. MRC shifts the focus to another equally critical layer: how data moves inside the supercomputer. OpenAI summarizes it clearly: network design determines how much available compute capacity can actually be used.

What Changes Does MRC Bring Compared to Traditional Networks?

In a traditional network, a transfer usually follows a single path. While this approach can work well in conventional environments, it creates bottlenecks in large AI clusters. Multiple communications can collide on the same link, increasing latency and impacting collective operations where all accelerators must progress at the same pace.

MRC transforms this model. Instead of sending all transfer packets through a single route, it distributes them across multiple paths simultaneously. OpenAI describes this behavior as dispersing packets across hundreds of routes within multi-plane networks. Packets may arrive out of order but include the necessary information for correct delivery at the destination.

The goal is to smooth congestion and prevent a specific link from becoming the bottleneck. In synchronous training, performance is usually dictated by the worst-case scenario rather than the average. If part of the network is slow, other GPUs may be left waiting. That’s why reducing latency variation is as important as increasing maximum bandwidth.

MRC also incorporates failure detection and recovery mechanisms. If congestion is detected on a route, it can be replaced with another. If a packet is lost, it assumes a problem on that path, stops using it, and retransmits the necessary information. OpenAI states that combining multi-plane networking, load balancing, packet spraying, and packet trimming allows microsecond-scale failure recovery—far faster than the seconds or tens of seconds it may take a conventional network to stabilize.

Another significant decision is the use of IPv6 Segment Routing (SRv6). With this approach, the sender can explicitly specify the path each packet should follow, reducing dependence on dynamic routing protocols like BGP within the fabric. For large AI clusters, this can simplify operations and make network behavior more predictable during failures.

AMD Leverages MRC to Strengthen Its Commitment to Open Ethernet

For AMD, MRC arrives at a pivotal moment. The company competes not only in GPUs with the Instinct family but also in CPUs with EPYC and in networking through its Pensando technology. AMD’s message is that AI infrastructure needs an open, programmable, production-ready foundation—not a collection of closed, hard-to-adapt solutions.

AMD claims it played a significant role in shaping the MRC specification, contributing congestion control technology and deployment expertise. The company also states that it has already implemented MRC alongside its networking technology in large-scale test clusters with a major cloud provider. Precision matters: this is not yet universal adoption but initial validation and deployment in large-scale environments.

The most visible hardware component is the AMD Pensando Pollara 400 AI NIC, a 400 Gbps network card designed for AI workloads. AMD highlights its P4 programmable engine, advanced RDMA capabilities, OCP 3.0 compatibility, and features like intelligent load balancing, fast fault recovery, and congestion control. According to AMD, the Pollara 400 can be upgraded to support evolving standards—a key point in a market where AI networking protocols are still changing.

AMD also links MRC with its upcoming AMD Pensando “Vulcano” 800G AI NIC, which will support the same transport protocol. The leap to 800G aligns with market trends: AI clusters need more bandwidth per node and greater resilience. If an 800G network underperforms in real-world conditions, raw speed becomes less meaningful. MRC aims to bridge the gap between theoretical speed and real-world performance.

A Partnership with Rivals Within the Same Specification

The list of participants underscores the strategic importance of the protocol. The MRC specification published by OCP includes contributions from AMD, Broadcom, Intel, Microsoft, NVIDIA, and OpenAI. While this collaboration might seem unusual from a competitive standpoint, it makes sense given the scale of the problem—none of these companies can afford AI networks to remain a constant bottleneck.

NVIDIA has also announced support for MRC in Spectrum-X Ethernet. It notes that the protocol can run on ConnectX SuperNICs and Spectrum-X switches, among other RDMA transport options. This confirms that MRC is not just an AMD initiative but part of a broader conversation on optimized Ethernet for AI.

The Open Compute Project’s publication signals a significant industry move. The AI networking market is divided among various approaches: InfiniBand, advanced Ethernet, Ultra Ethernet, proprietary solutions, programmable NICs, and specialized accelerator fabrics. By open-sourcing MRC, participants aim to create a common foundation to scale training clusters without relying solely on closed solutions.

For cloud providers, enterprises, research centers, and sovereign AI projects, this openness can be meaningful. AI deployments are expanding beyond major US hyperscalers. Governments, universities, and regional suppliers seek to build their own capacity but need technologies that prevent lock-in to a single stack. While MRC alone doesn’t eliminate this dependence, it points toward a more interoperable and programmable network infrastructure.

Actual adoption will depend on multiple factors: hardware support, software maturity, integration with training frameworks, observability tools, operational costs, and equipment availability. It will also be interesting to see how MRC coexists with other standardization efforts such as Ultra Ethernet and specific vendor architectures.

The core message is clear: the next phase of AI won’t be won solely through more chips but through comprehensive systems capable of keeping those chips busy, synchronized, and operational—even amid partial infrastructure failures. MRC strives to turn the network into a more tolerant, less fragile layer, better suited to large-scale training realities.

If the protocol delivers on its promises, it could reduce downtime, improve GPU utilization, and enable larger clusters without increasing operational complexity. In an industry where every percentage point of accelerator utilization impacts costs, energy, and training schedules, the network shifts from being just a technical detail to a strategic competitive advantage.

Frequently Asked Questions

What is MRC?
MRC, or Multipath Reliable Connection, is a networking protocol designed for large AI training clusters. It disperses packets over multiple routes to reduce congestion and improve failure recovery.

Who developed MRC?
The specification incorporates contributions from AMD, Broadcom, Intel, Microsoft, NVIDIA, and OpenAI, and has been published through the Open Compute Project.

Why is it important for AI?
Because large models require thousands or hundreds of thousands of GPUs to exchange data continuously. If the network fails or congests, training slows down despite the available compute capacity.

What role does AMD play in MRC?
AMD states it has co-led the specification, contributed congestion control technology, and implemented MRC within its networking ecosystem, including work with the AMD Pensando Pollara 400 and future Vulcano 800G NICs.

via: amd

X (Twitter) Facebook Pinterest LinkedIn E-mail