Catalina: the AI system with which Meta redefines open-data-center infrastructure

Meta has revealed the details of Catalina, their new AI hardware architecture that combines the power of NVIDIA Blackwell GB200 NVL72, the Open Rack v3 (ORv3) standard, and high-density liquid cooling. Announced at the Open Compute Project (OCP), this not only demonstrates how the company is scaling its AI infrastructure but also emphasizes their commitment to open and standardized collaboration in a sector dominated by proprietary solutions.

In 2022, Meta was working with clusters of around 6,000 GPUs, primarily used for recommendation systems and ranking models. By 2023, driven by the rise of generative AI and large language models (LLMs), these clusters grew to between 16,000 and 24,000 GPUs, quadrupling in size. By 2024, the company operated with over 100,000 GPUs in production and expects this number to increase tenfold in the coming years. This growth support training models like Llama 3.1, with 405 billion parameters, which required more than 16,000 H100 GPUs trained over 15 trillion tokens.

Catalina was developed in response to this explosive demand for computing, where not just the number of GPUs matters but also interconnection capability, energy efficiency, and system scalability.

Catalina is a system of AI pods, where each pod comprises two IT racks forming a scaling domain of 72 GPUs. Each rack includes 18 compute trays (top and bottom), nine NVSwitches on each side for GPU interconnection, NVLink connections to form a coherent memory domain, and Air-assisted Liquid Cooling (ALC) to enable liquid cooling in traditional data centers. Its key feature is the ability to copy and scale: pods can interconnect through the Disaggregated Scheduled Fabric (DSF), an open, modular network that links multiple pods, racks, and even entire buildings into a single optimized supercluster for AI.

The architecture also adopts the first high-power implementation of the Open Rack v3 (ORv3) standard, driven by the Open Compute Project. This standard supports up to 94 kW per rack (600 A) and is designed for the extreme requirements of AI accelerators. The modularity of ORv3 allows for integrating 480 V units converting to 48 V DC, direct liquid cooling connections, and enhanced security with the Rack Management Controller (RMC), which monitors leaks, controls valves, and orchestrates cooling.

In terms of cooling and sustainability, Meta has implemented a hybrid system to address the high energy density of Blackwell GPUs. This includes air-assisted liquid cooling on rack sides, a monitored liquid management system with sensors to detect leaks, and compatibility with new-generation building infrastructure that can channel water directly into racks. This design aims to maintain high energy efficiency while lowering operational risks, crucial as AI demands continue to grow.

While Catalina is based on NVIDIA’s GB200 NVL72, Meta is also expanding its open platform strategy to encompass other providers. Their platform Grand Teton, launched in 2022, now supports AMD Instinct MI300X accelerators, signaling that AI infrastructure of the future cannot depend solely on NVIDIA. Additionally, Meta has introduced new 51T network switches based on Broadcom and Cisco ASICs, their first custom network ASIC (FBNIC) to optimize cluster communication, and the open, provider-agnostic backend network Disaggregated Scheduled Fabric (DSF), supported by standards like OCP-SAI and RoCE over Ethernet.

This strategy aims to break vendor lock-in and foster an open ecosystem where various manufacturers can compete equally. Meta maintains a close collaboration with Microsoft within OCP, contributing to the development of the Switch Abstraction Interface (SAI) standard and initiatives like the Open Accelerator Module (OAM). They are working on Mount Diablo, a 400 VDC disaggregated power rack, allowing more accelerators per rack with greater efficiency. This reflects a broader industry shift towards cooperation and standardization on open hardware and networking.

Meta emphasizes that openness—through modular, standardized hardware, disaggregated networks, and industry collaboration—is essential for democratizing AI and unlocking its full potential. Company officials state, “Opening our hardware designs is as important as releasing software frameworks. Only then can we democratize AI and ensure its benefits reach everyone.”

In conclusion, Catalina embodies three key forces shaping the next era of digital infrastructure: massive scale for training and deploying ever-larger generative AI models; energy efficiency and liquid cooling for sustaining growth sustainably; and technological openness through standards like ORv3 and DSF, paving the way for a collaborative AI ecosystem.

As data centers increasingly become “factory floors” for artificial intelligence, Meta’s Catalina project represents a strategic move to lead not only with cutting-edge technology but also with an open, inclusive philosophy that could transform industry standards.

Frequently Asked Questions:

What is Meta Catalina?
It is an AI infrastructure system based on NVIDIA GB200 NVL72, designed as a pod of two racks with 72 GPUs, liquid cooling, and a modular architecture.

Why is the Open Rack v3 standard important?
ORv3 supports up to 94 kW per rack, standardizes 48 VDC power delivery, and facilitates liquid cooling adoption in data centers.

How does Catalina connect at scale?
Via the Disaggregated Scheduled Fabric (DSF), which links multiple pods and buildings into a single AI cluster, with open support for Ethernet RoCE.

Does Meta work only with NVIDIA for AI?
No. Although Catalina is based on Blackwell, Meta also expands platforms like Grand Teton to include AMD (MI300X) accelerators and develops its own networking hardware (FBNIC).

Sources include engineering.fb.com, wccftech, and datacenterfrontier.

Scroll to Top