OpenAI Launches MRC: The Network Keeping 100,000 AI GPUs Busy

OpenAI has published the MRC specification, a new networking protocol for supercomputers in artificial intelligence developed in collaboration with AMD, Broadcom, Intel, Microsoft, and NVIDIA. The company has released it through the Open Compute Project with a clear goal: to allow the industry to use and improve a piece of infrastructure already operational in their largest training clusters.

The news may not have the commercial flash of a new model, but it could be equally important for understanding the future direction of AI. Training frontier models isn’t just about having more GPUs. It also requires those GPUs to communicate with each other with extremely high precision. If a transfer arrives late, a link fails, or a switch introduces latency, thousands of accelerators could be left waiting. And a GPU halted in a training cluster isn’t just a technical problem: it means money, energy, and time lost.

Why OpenAI Needed a Different Network

OpenAI explains that training large models can involve millions of data transfers in a single step. In synchronous training, many GPUs work in concert on the same model. That means performance isn’t just determined by the network’s average but by the worst-case scenarios: late packets, congested routes, or failed links. The company describes these loads as a kind of “failure amplifier,” because the larger the system, the more likely a small problem will impact the whole.

MRC, or Multipath Reliable Connection, seeks to resolve that bottleneck. It’s an extension of RoCE, the RDMA technology over converged Ethernet used for direct memory access between CPU and GPU. The difference is that MRC distributes a single transfer across hundreds of routes, can dodge failures in microseconds, and simplifies network control through static routing based on SRv6.

OpenAI assures that MRC is already deployed across all their largest supercomputers equipped with NVIDIA GB200 hardware used for training frontier models, including Oracle Cloud Infrastructure systems in Abilene, Texas, and Microsoft’s Fairwater supercomputers. The company also states they have trained several models with MRC using NVIDIA and Broadcom hardware.

Context is critical. OpenAI notes that ChatGPT exceeds 900 million weekly users—a scale that demands rethinking infrastructure below the models. The company no longer works solely with experimental clusters but with systems that are part of a global production chain of AI models, products, and services.

Key Idea: Divide, Distribute, and Survive Failures

One of MRC’s most interesting aspects is its topology. Instead of treating an 800 Gb/s network interface as a single link, OpenAI proposes splitting it into multiple smaller links. For example, one interface could connect to eight different switches, creating eight parallel planes of 100 Gb/s each. This decision transforms the cluster’s structure: a switch that connects 64 ports at 800 Gb/s could connect 512 ports at 100 Gb/s. According to OpenAI, this enables building a network capable of connecting around 131,000 GPUs with just two switch layers, compared to three or four levels required by traditional designs.

Reducing switch levels isn’t trivial. Fewer switches mean less power consumption, fewer failure points, and reduced operational complexity. But dividing the network into multiple planes only makes sense if traffic can leverage these divisions effectively. That’s where adaptive “packet spraying” comes in: MRC doesn’t send a transfer through a single route but distributes its packets across many paths simultaneously.

Issue in Large AI ClustersTraditional ApproachWhat MRC Offers
Congestion on specific linksEach flow usually follows one routeDistributes packets over hundreds of paths
Link or switch failuresNetwork recalculates routes, which can take secondsEvades failed routes in microseconds
Scaling beyond 100,000 GPUsRequires more switch levelsUses multi-plane networks with only two levels
Out-of-order packetsCan hurt performanceEach packet carries its final memory address
Control plane complexityDynamic protocols like BGPStatic routes with SRv6 and source-based control
Network maintenanceMay require coordination with trainingAllows repairs or restarts without stopping jobs

In classic networks, sending packets via different routes can cause out-of-order arrivals. MRC handles this because each packet includes its final memory address. The recipient can place it correctly as it arrives, without waiting for the entire sequence to follow one route. This feature reduces congestion hotspots and prevents some transfers from becoming significantly slower than others.

The protocol also distinguishes better between packet loss due to failure and due to congestion. If a switch cannot forward a full packet, it can truncate the payload and send only the header. This “packet trimming” enables explicit retransmission requests without automatically interpreting the path as broken. It’s a way to reduce false positives and maintain useful routes when the problem is not a physical failure.

SRv6 and the End of Some Diagnosable Failures

MRC introduces a crucial decision: replacing part of traditional dynamic routing with source routing via SRv6. Instead of switches calculating paths dynamically, the origin specifies the path each packet should follow. Switches merely read identifiers and follow preconfigured static tables.

This greatly simplifies operation. If a route fails, MRC stops using it. Switches don’t need to renegotiate or recalculate paths. For a cluster with millions of links, reducing this complexity can be as valuable as increasing bandwidth.

OpenAI provides real-world examples illustrating this impact. During actual training, they observed several transient failures per minute between level 0 and level 1 switches with no measurable effect on synchronous pretraining jobs. In another instance, during training of a recent ChatGPT and Codex model, four level 1 switches had to be reset without coordinating with the training teams—something that previously would have required careful planning to avoid interruptions.

The difference isn’t just technical; it’s operational. A network that tolerates failures without halting training enables teams to perform maintenance, repair links, and operate massive clusters with less fear of stopping multi-million-dollar jobs. In AI supercomputing, resilience directly translates to productivity.

An Open Standard for Infrastructure Race

OpenAI’s decision to release MRC through the Open Compute Project has strategic implications. The company isn’t keeping the protocol as a closed advantage but offering it as a specification for others—hardware vendors, cloud operators, and labs—to adopt. It signals that certain layers of AI infrastructure may require common standards to scale efficiently.

The AI race is no longer only about better models or more accelerators. It’s also about networks that leverage these GPUs, data centers that consume less power, systems resilient to failures, and architectures capable of scaling without exponential complexity. MRC fits perfectly into that invisible layer—an unseen but crucial factor that determines whether a lab can train larger, faster models with less waste.

Another noteworthy aspect is the group of partners involved. AMD, Broadcom, Intel, Microsoft, and NVIDIA have different market interests in AI, but share a common challenge: if large clusters cannot communicate effectively, hardware performance is compromised. OpenAI also mentions deployment at scale with Microsoft Azure, OCI, NVIDIA, and Arista. This underscores that AI infrastructure is now an industrial effort involving many stakeholders, not just a software stack.

Adopting MRC won’t be immediate. It requires compatible hardware, integration, testing, and highly specialized network operation. Still, it sets a clear direction: clusters with over 100,000 GPUs need networks designed for constant fault tolerance—not just operation under ideal conditions.

Practically, MRC aims to keep GPUs working despite congestion, unstable links, maintenance, or degraded routes. This capability might seem minor compared to headlines about new models, but it’s a key enabler—one of the reasons those models can even exist. Frontier AI begins with algorithms but depends on cables, switches, protocols, and architectural decisions that must operate flawlessly around the clock.

FAQs

What is MRC?
MRC, or Multipath Reliable Connection, is a networking protocol developed by OpenAI with AMD, Broadcom, Intel, Microsoft, and NVIDIA to improve performance and resilience in large AI training clusters.

Why is it important for training large models?
Because distributed training involves moving data across thousands or hundreds of thousands of GPUs. If a transfer is delayed or fails, training slows down or stops.

What does it offer over traditional networks?
MRC distributes packets over hundreds of paths, uses multi-plane networks, avoids microsecond failures, and simplifies routing with SRv6 and static routes.

Who can use MRC?
OpenAI has published the specification via the Open Compute Project so industry players can study, adopt, and build upon it.

Scroll to Top