Currently, the most advanced artificial intelligence models require massive computational infrastructure, with servers equipped with high-performance GPUs like the NVIDIA H100 or A100, which can easily cost more than a home. However, a recent experiment has challenged this trend by using five Mac Studios to create an AI cluster capable of running large-scale language models with EXO Labs, an emerging distributed computing software.
The Challenge: Running Llama 3.1 405B on Consumer Hardware
Llama 3.1 405B is a language model with 405 billion parameters, making it one of the most complex and demanding AIs in terms of hardware. Traditionally, such models can only be run in data centers with AI-optimized servers that have high-speed networks and specialized video memory (VRAM).
The goal of this experiment was to check if a cluster of five Mac Studios with M2 Ultra chips and 64 GB of unified memory each could handle the task, leveraging Apple’s unified memory architecture to compensate for the lack of dedicated VRAM.
Cluster Setup with EXO Labs
To connect the five Mac Studios and make them work together, EXO Labs, an open-source software, was used to distribute AI workloads across multiple devices, including laptops, PCs, and servers.
The interconnection network was a key point for performance:
- 10 Gbps Ethernet Network: Initially, the Mac Studios were connected through a 10 Gbps UniFi XG6 POE switch, but it soon became evident that this speed was insufficient to handle the required data traffic.
- Thunderbolt 4 Connection (40 Gbps): A Thunderbolt bridge was tested to improve bandwidthBandwidth is the maximum transfer capacity and reduce latency, which showed improvements in communication between the cluster nodes.
Initial Tests: Performance with Smaller Models
Before tackling Llama 3.1 405B, tests were conducted with smaller models:
- Llama 3.21B (1 billion parameters): This model ran smoothly on a single Mac Studio, with an acceptable inference speed.
- Llama 3.3 70B (70 billion parameters): This required the use of the cluster, distributing the load across several machines, with satisfactory results.
- Llama 3.1 405B (405 billion parameters): This is where the real challenges began.
Issues with the 405B Parameter Model
The main obstacle was the intensive memory usage. Despite having a total of 320 GB of unified RAM in the cluster, this was not enough to handle the model without resorting to swap memory, which severely affected performance.
Another problem was the communication between the nodes. Although Thunderbolt 4 improved bandwidth, latency remained a limiting factor. In traditional data centers, GPUs are interconnected with 100 Gbps or 800 Gbps InfiniBand networks optimized for AI workloads, something that cannot be replicated in this setup.
Additionally, the software and architecture of the Mac Studios are not optimized for AI to the same extent as NVIDIA GPUs with CUDA. While Apple offers MLX (Machine Learning Acceleration), it has not yet reached the level of optimization found in CUDA and TensorRT-based AI environments.
Comparison with Traditional AI Hardware
Resource | Mac Studio M2 Ultra (x5) | AI Server with H100 GPUs |
---|---|---|
Total Memory (RAM/VRAM) | 320 GB (unified) | 1 TB+ (H100) |
Internal Bandwidth | 40 Gbps (Thunderbolt) | 400-800 Gbps (InfiniBand) |
Power Consumption | ~750W (total for 5 Mac Studios) | 3,000-5,000W (data centerA data center or data processing center (DPC)) |
Estimated Cost | $13,000 (total) | $200,000+ |
In terms of energy efficiency and costs, the Mac Studios have clear advantages. However, the lack of specialized VRAM and ultra-high-speed networks limits their ability to run large-scale AI models with the same efficiency as purpose-built servers.
Conclusion: Is a Mac Studios Cluster Viable for AI?
The experiment with EXO Labs demonstrated that Mac Studios can run AI models, but with limitations. For small to medium models, they can be a viable alternative, especially if energy consumption is an important factor. However, for large-scale models like Llama 3.1 405B, the lack of AI-optimized hardware remains a significant hurdle.
Despite this, this test opens new possibilities for distributed computing on consumer hardware, and with future improvements in software like EXO Labs, it could become a more viable option for certain types of AI workloads.
Source: AI News