RoCE for AI: Technical Guide for Designing Lossless Ethernet in GPU Clusters

The network has become one of the most critical parts of artificial intelligence infrastructure. For years, many teams viewed Ethernet as a general-purpose layer: stable, well-known, relatively inexpensive, and sufficiently flexible for almost any enterprise workload. In a modern AI cluster, this perspective falls short. When thousands of GPUs train a distributed model, the network ceases to be just “transport” and becomes part of the computing system itself.

RoCE, RDMA over Converged Ethernet, enables the use of RDMA over Ethernet to transfer data between servers with less CPU and OS intervention. Its appeal is clear: low latency, high bandwidth, and improved efficiency in GPU-to-GPU communication. However, operating well in AI workloads requires the Ethernet fabric to behave almost losslessly, with congestion controlled, well-designed queues, tuned buffers, and sufficient telemetry to detect issues before they degrade training.

What problem does RoCE solve in AI workloads

In distributed training, GPUs do not work in isolation. They exchange gradients, synchronize states, perform collective operations like AllReduce, and depend on all nodes progressing at similar rates. If part of the cluster is delayed due to congestion, packet loss, or variable latency, overall performance drops. An expensive GPU waiting for data is wasted capacity.

RDMA reduces memory copies and bypasses part of the traditional kernel and CPU path. Instead of treating each transfer as conventional traffic, it allows NICs to participate more directly in moving data between memories. RoCEv2 extends this logic over IP/Ethernet networks, enabling RDMA semantics in Ethernet-based environments.

Google Cloud uses RoCEv2 on A3 Ultra and A4 machines for node-to-node communication and GPU-to-GPU traffic, with up to 3.2 Tbps of inter-node GPU-to-GPU traffic on A3 Ultra. Meta has also documented RoCE networks for large-scale distributed training, detailing clusters of 24,576 H100 GPUs used for Llama 3, one based on RoCE and another on InfiniBand. NVIDIA has developed Spectrum-X as an Ethernet platform optimized for AI with RoCE connectivity, congestion control, telemetry, and performance isolation.

Network OptionTypical UseAdvantagesChallenges
Traditional Ethernet with TCPGeneral applications, cloud traffic, enterprise servicesMaturity, interoperability, cost, known operationVariable latency, retransmissions, loss tolerated by design
RoCEv2 over EthernetGPU clusters, distributed AI, HPC over EthernetRDMA over IP/Ethernet, lower CPU overhead, good multi-vendor supportRequires PFC, ECN, buffer tuning, QoS, and fine telemetry
InfiniBandHPC, high-performance AI training, closed clustersLow latency, mature RDMA, integrated stackLess ubiquitous than Ethernet, more vendor dependency
Ultra EthernetNext-gen large-scale AI/HPCSeeks to improve multipath, scalability, and extreme transportStill maturing compared to current RoCE deployments

Technical components: PFC, ECN, DCQCN, and QoS

RoCE alone does not convert Ethernet into a lossless network. The network must be configured to protect RDMA traffic classes and prevent drops at congestion points. Several mechanisms come into play, which should be understood as an integrated system, not as independent switch options.

PFC, Priority Flow Control, pauses a specific priority when its associated queue reaches a threshold. Its advantage is preventing loss in RDMA traffic. Its risk is that misconfiguration can propagate congestion, cause head-of-line blocking, and affect traffic that shouldn’t be blocked. It’s recommended to apply PFC only to necessary queues, not network-wide.

ECN, Explicit Congestion Notification, marks packets before they are dropped. The receiver or endpoint interprets this signal, and the sender reduces the transmission rate. In RoCEv2, ECN is often combined with congestion control algorithms like DCQCN, Data Center Quantized Congestion Notification, which adapts the sending rate based on congestion marks. The key is to mark early enough but not too early; overly aggressive thresholds waste capacity, while lax thresholds cause queues and pauses.

Queue separation is also critical. RDMA traffic should have its own class, queues, and policies. Congestion notification packets, like CNP, need to arrive promptly; delayed congestion signals cause the sender to continue injecting traffic into already saturated fabric. Cisco recommends differentiating RoCEv2 traffic, applying ECN and PFC on dedicated queues, and prioritizing congestion control traffic appropriately.

ComponentFunctionPotential Misconfiguration Risks
PFCPrevents loss by pausing a specific priorityCongestion propagation, blocking, excessive pauses
ECNMarks congestion before packet dropLate marking causes queues; early marking reduces performance
DCQCNAdjusts sending rate in RoCE endpointsOscillations, underutilization, persistent congestion
QoS / QueuesSeparates RDMA, control, storage, and general trafficMixing critical and non-critical traffic, jitter, drops
BuffersAbsorbs microbursts and incastDrops, queue latency, unnecessary switch memory consumption
TelemetryProvides visibility into PFC, ECN, queues, drops, and microburstsIntermittent issues without clear cause

Fabric design: topology, rails, and isolation

In AI, a generic leaf-spine network isn’t enough. The design should consider communication patterns: distributed training, parallel inference, storage, checkpoints, and control traffic behave differently. The network must avoid oversubscription where it affects critical traffic and should provide consistent paths between GPU nodes.

Advanced deployments use rail-aligned or multi-rail designs, where GPU server NICs connect to parallel fabrics or designed network domains for predictable traffic balancing. Google, for example, uses a 4-way rail-aligned network in A3 Ultra to ensure non-blocking GPU-to-GPU traffic. In private environments, this requires careful cable symmetry, NIC distribution, ECMP hashing, hop count management, and physical node placement.

Isolation is also vital. A RoCE fabric for GPUs shouldn’t share the same control plane as management, backup, general storage, or noisy monitoring traffic. Some designs benefit from physically separated networks, while others rely on logical separation, dedicated queues, and strict policies. The decision depends on scale, budget, risk, and expected utilization levels.

Design LayerTechnical RecommendationRationale
TopologySymmetric leaf-spine or rail-alignedReduces uneven paths and improves predictability
OversubscriptionAvoid in critical GPU-to-GPU trafficGPUs degrade quickly if they wait too long for communication
Traffic separationRDMA on dedicated queues and prioritiesPrevents general traffic from increasing latency and buffer contention
ECMP / hashingValidate real flow distribution for RoCE trafficPoor hashing can concentrate traffic on specific links
CablingDocument rails, NICs, leafs, and domainsPhysical errors cause hard-to-diagnose problems
FailuresTest link, leaf, and spine loss scenariosThe network should degrade predictably, not chaotically

Practical checklist before deploying RoCE in production

The most common mistake is treating RoCE as an on/off feature. In reality, it’s an architecture that must be validated. Before deployment, build a test environment with the same NIC, firmware, OS, drivers, switch, NOS version, optics, and traffic pattern used in the real cluster. The difference between “works in a simple test” and “sustains training for weeks” is huge.

A solid validation plan should include sustained throughput tests, incast, microbursts, mixed traffic, link failures, switch resets, localized congestion, route changes, firmware updates, and real NCCL workloads. Synthetic tests like ib_write_bw or RDMA simulations aren’t enough. Measure application-level performance or at least traffic patterns similar to real workloads.

Operational governance is also critical. Define who can change ECN thresholds, approved firmware versions, how PFC events are monitored, alerting policies for drops by priority, criteria for considering training degraded due to network issues, and how to correlate GPU, NCCL, and switch metrics. Without this framework, incidents will lead to endless conflicts between network, system, platform, and AI teams.

PhaseMinimum ValidationExpected Result
LaboratoryNIC, switch, firmware, and NOS identical to productionReproducible configuration
BaselineThroughput, latency, PFC, ECN, drops, queues without real loadStable comparison point
Synthetic LoadRDMA, incast, microbursts, mixed trafficVerify congestion thresholds
AI WorkloadNCCL, AllReduce, distributed trainingMeasure real GPU impact
FailuresLink drops, switch resets, route changesPredictable degradation
OperationalAlerts, dashboards, runbooks, responsible teamsFast, repeatable diagnosis

Metrics to monitor effectively

A RoCE fabric can appear healthy based on traditional metrics but still impair training performance. Link utilization isn’t enough; look at queues, pauses, ECN marks, and endpoint behavior.

Key metrics include PFC events per port and priority, pause durations, ECN-marked packets, CNP sent and received, drop counts per queue, buffer occupancy, queue latency, retransmissions if any, rail utilization, ECMP distribution, physical errors, CRC errors, port flaps, optical temperature, and NCCL throughput. At the GPU level, correlate accelerator utilization, wait times in collective communications, and training step durations.

NVIDIA promotes this approach with Spectrum-X and telemetry designed for AI factories, viewing network performance as integral to cluster efficiency, not an isolated component. The goal is for AI teams to identify when network issues reduce efficiency and for network teams to see when queues impact specific workloads.

Common Deployment Mistakes in RoCE

The first mistake is enabling PFC on too many priorities, turning localized protection into a systemic problem. The second is copying ECN values without adapting to switch configurations, buffer sizes, and traffic patterns. The third is mixing RDMA traffic with other data without sufficient separation. The fourth is testing only with synthetic workloads, assuming real training will behave similarly.

Also often underestimated is firmware importance. NICs, drivers, switches, and OS must form a cohesive stack. An incompatible version can introduce latency, drops, or congestion behaviors that are hard to reproduce. Homogeneity in large clusters is as crucial as correct configuration.

Another critical mistake is not designing for failures. A fabric that works only in perfect conditions isn’t practical. Test scenarios include link failures, queue saturation, leaf resets, or ECMP route changes. In AI, even a partial failure can reduce efficiency so much that training costs spike, even if the service remains operational.

RoCE is not a trend; it’s a new skill set

The growth of RoCE in AI reflects a deeper trend: hyperscalers and large cloud providers want high-performance Ethernet networks because they offer scale, vendor diversity, and broader operational foundations. InfiniBand will remain vital for HPC and AI, but Ethernet is entering territories once thought reserved for specialized fabrics.

For technical teams, this means an update in skills. Knowing VLANs, BGP EVPN, MLAG, or general QoS isn’t enough. In AI networks, understanding RDMA, training communication patterns, NCCL, congestion control, PFC, ECN, DCQCN, telemetry, and real workload validation is essential. It’s a discipline bridging networking, systems, and AI platform management.

The good news is that Ethernet retains its main advantages: it’s open, well-understood, and broadly supported. The bad news is that, with RoCE, it stops forgiving approximate configurations. A traditional Ethernet network can perform reasonably well even if not perfectly tuned. A RoCE fabric for AI demands precision.

The key takeaway is straightforward: before acquiring more GPUs, design the network capable of keeping them busy. AI doesn’t scale solely with accelerators; it scales with a fabric that moves memory efficiently, avoids losses, and manages congestion during days or weeks of continuous training. RoCE is one of the enabling technologies for this transition, but it requires real engineering—it’s not just flipping a switch on the switch.

Frequently Asked Questions

Does RoCE replace TCP across the entire data center?
Not. RoCE is used for low-latency, high-performance RDMA traffic, especially between GPU nodes or HPC workloads. Other applications can continue to use traditional TCP/IP.

Is it mandatory to use PFC with RoCE?
Practically, many RoCEv2 architectures for AI enable PFC on RDMA queues to avoid loss. It’s also combined with ECN and congestion control algorithms. The key is precise application, not blanket usage.

What’s the difference between RoCE and InfiniBand?
RoCE runs RDMA over Ethernet/IP, leveraging the Ethernet ecosystem. InfiniBand is a specialized fabric with native RDMA and a deeply integrated stack. Both are valid options depending on scale, cost, infrastructure, and goals.

What should be tested before deployment?
Throughput, latency, microbursts, incast, PFC events, ECN marks, ECMP distribution, link failures, NCCL performance, and real distributed training workloads.

via: RoCE AI

Scroll to Top