The National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign has officially announced the launch of DeltaAI, its new advanced computing system designed to enhance research in artificial intelligence (AI) and high-performance computing (HPC). Funded with nearly $30 million from the National Science Foundation (NSF), DeltaAI is positioned as a key resource that promises to transform AI and HPC research in the United States.
A Complement to the Delta Supercomputer
DeltaAI serves as a complementary system to the Delta supercomputer, an HPE Cray-based installation that the NCSA implemented in 2021. While Delta set a milestone with its 338 nodes and Nvidia A100 GPUs, DeltaAI takes these capabilities to the next level, integrating cutting-edge technology such as Nvidia H100 Hopper GPUs and GH200 Grace Hopper superchips. This system not only doubles the performance of its predecessor but is specifically optimized for AI workloads, machine learning, and state-of-the-art language models.
Bill Gropp, director of the NCSA, highlighted that the design of DeltaAI responds to the growing demand for GPU-based resources, a trend that emerged rapidly following the implementation of the Delta system. “AI has grown exponentially, and with it the need for resources with higher memory capacity and performance,” Gropp stated during an interview at the SC2024 conference in Atlanta.
Optimized Performance for AI and HPC
DeltaAI offers an impressive performance of 633 petaflops in half-precision (FP16), designed specifically for AI tasks, and petaflops in double precision (FP64) for scientific applications requiring high numerical precision, such as climate modeling and fluid dynamics. Each node in the system is equipped with 320 Nvidia Grace Hopper GPUs, each with 96 GB of memory, totaling 384 GB per node. Additionally, it features a storage system of 14 PB capable of handling up to 1 TB/second and a highly scalable interconnect.
This design not only enhances the performance of current applications but also enables researchers to tackle large-scale language models and more complex inference tasks. Gropp noted that the system will support key research in areas such as explainable artificial intelligence (XAI), aimed at unraveling the internal workings of AI models and improving their reliability.
Promoting Accessibility and Collaboration
DeltaAI will be available to researchers nationwide through the NSF ACCESS program and the pilot initiative of the National AI Research Resource (NAIRR). This broad accessibility aims to democratize research in AI and HPC, allowing more users to leverage the capabilities of this state-of-the-art system.
“The idea is to maximize collaborative impact,” Gropp explained. “We want more users to take advantage of our cutting-edge GPUs and collaborate with other groups to share resources and knowledge.”
The system is also designed to be versatile, catering to both specific AI needs and traditional HPC applications, such as molecular dynamics, fluid mechanics, and structural mechanics. Its architecture, based on multi-GPU nodes and unified memory, addresses common limitations like memory bandwidthBandwidth is the maximum transfer capacity of …, significantly enhancing performance for computationally intensive tasks.
Prepared for the Future
DeltaAI is part of an infrastructure design approach that seamlessly integrates its capabilities with those of Delta, utilizing the same Slingshot network and shared file systems. This design not only ensures resource usage efficiency but also lays the groundwork for future expansions. In fact, the NCSA already has plans to add new systems in the coming years, adopting a model of continuous upgrades rather than waiting for current hardware to become obsolete.
Gropp also emphasized the importance of balancing excitement for AI with practical scientific progress. “AI has tremendous potential, but there are things it will never be able to do with current technologies,” he cautioned. “DeltaAI will allow us to advance both scientific curiosity and practical applications that improve people’s lives.”
A Step Towards Leadership in AI and HPC
With DeltaAI, the NCSA reinforces its commitment to leading research in artificial intelligence and high-performance computing, providing a resource that combines power, versatility, and accessibility. This system not only promises to be a catalyst for new scientific and technological applications but also reaffirms the role of collaboration and transparency in advancing knowledge.
DeltaAI is an example of how technology can be used to address fundamental questions, enhance the reliability of AI, and translate these advances into tangible benefits for society.
Technical Summary: Hardware and Network of DeltaAI
DeltaAI is designed with state-of-the-art technology to meet the growing demands of AI and high-performance computing research. The system includes:
- 456 NVIDIA H100 GPUs, optimized for machine learning tasks and AI workloads.
- HPE Slingshot network with 200 Gb/s, providing high-performance, low-latency interconnect between nodes.
- Lustre shared file systems with the Delta supercomputer:
- A HDD-based system for large volumes of data.
- Another NVME-based for handling small files and fast I/O operations.
- Access to the Taiga file system for center-level projects, based on Lustre.
- Personal directories hosted on Harbor, a VAST-based system for high-reliability storage.
High-Performance CPU-GPU Nodes
DeltaAI has 114 CPU-GPU nodes, each equipped with:
- 4 Grace Hopper GH200 superchips per node, each with:
- 1 NVIDIA H100 GPU with 96 GB of HBM3 memory.
- 1 72-core Grace ARM CPU with 120 GB of LPDDR5X memory.
- 4 Slingshot11 network connections, one for each superchip, maximizing communication efficiency.
- 1 NVME unit of 3.5 TB per node, providing fast, local storage.
This hardware setup enables unprecedented performance for AI models, with an infrastructure that prioritizes both power and efficiency. DeltaAI is a key tool for researchers looking to tackle complex problems and scale their scientific and technological applications.
via: HPCwire and NCSA Delta