In 2002 and 2023, the world began to see the proliferation of artificial intelligence (AI) applications in various industries. What is driving this revolution? Data centers: the beating heart behind the AI boom, coupled with the advancement of GPUs, especially from NVidia.
The explosive growth of AI applications has demanded a complete reevaluation of traditional data centers. The existing infrastructure is generally not designed or equipped to handle the enormous parallel processing and memory capacity required by AI workloads. By 2024, the world is expected to generate 1.5 times the amount of digital data it produced two years ago.
Undoubtedly, the demand for AI workloads will soon surpass traditional cloud computing, and a one-size-fits-all approach does not cater to the needs of AI developers, who require customized solutions for their immense and specific requirements.
The problem with traditional Data Centers
Traditional data centers were primarily built to support general-purpose applications, offering a balance between performance and cost. Most of the computing power was designed for workloads such as web servers, e-commerce sites, and databases, not for the processing power needed to train a Large Language Model (LLM).
The main limitations of traditional data centers include:
– Performance and cost balance: Not optimized for specific types of workloads.
– Fragmented usage: Workloads scale incrementally, without the need for large-scale parallel processing or massive storage.
– CPU-centric workloads: Significantly less energy-intensive and generate much less heat compared to GPUs.
AI developers need customized solutions with high capacity, immediate availability, and high-level technical support. Existing data centers lack the architecture, cooling, and software needed to run AI workloads or accelerated computing.
Key components of the redesign
1. Architecture: Power density per server has quadrupled compared to CPU servers. Traditional data centers are designed with an average density of 5 to 10 kW per rack, while AI data centers now require 60 or more kW per rack.
2. Cooling: Servers with multiple GPUs generate much more heat than a traditional server, presenting two main challenges:
– Current air cooling solutions are stressed and require GPU racks to be more spaced out for effective cooling.
– Next-generation racks can consume up to 120 kW of power per cabinet, generating heat that cannot be cooled by air.
3. Software: Traditional software has redundancies and can rely on other hardware components if one fails. LLMs are trained as a cluster, with significant cost implications if hardware fails. A software stack specifically built to optimize workload performance and recover automatically from interruptions is needed.
Transitioning Data Centers for AI: A comprehensive upgrade
Adapting existing data centers to turn them into AI facilities involves significant hardware upgrades and even building structure changes to handle new types of workloads. This includes:
– Replacing hardware with components capable of processing and transmitting large amounts of data in real-time.
– Reworking the network to support much higher bandwidth, ensuring efficient communication between densely packed GPU racks and remote storage systems.
– Redesigning the layout, cooling, power, and wiring systems to accommodate the increased density and interconnectivity of GPU racks.
Reimagining the data center
The first stage is power. Redesigning the power system to handle these workloads occurs at both the data center and rack level. Future cooling systems will require liquid cooling at every part of the data center, using less water than current air cooling systems. Incorporating liquid cooling in new data centers requires planning and investment in specialized infrastructure.
Transforming data center connectivity is not just about connecting servers but facilitating efficient high-speed communication between GPUs. In an AI-driven environment, where parallel processing is the norm, the speed at which GPUs exchange data determines overall performance.
The result of this redesign is faster and more efficient applications than those run on legacy infrastructures. Serverless Kubernetes deployments allow for quick boot times, responsive automatic scaling, and the ability to handle thousands of GPUs per workload, with infrastructure specifically built to address the challenges posed by these large workloads.
With these advancements, data centers are poised to support the artificial intelligence revolution and high-performance computing applications, marking the beginning of a new era in digital infrastructure.