AMD Launches ROCm 6.4 with Major Enhancements for Accelerating AI and HPC

AMD has unveiled version 6.4 of its ROCm (Radeon Open Compute) software platform, an update that represents a significant advance for those developing and deploying artificial intelligence (AI) and high-performance computing (HPC) workloads on AMD Instinct™ GPUs. With this new version, the company strengthens its commitment to a faster, more modular, and easier-to-manage ecosystem, tailored to the growing demands for performance and scalability in the industry.

Optimized Containers for Training and Inference

One of the cornerstones of ROCm 6.4 is the introduction of pre-optimized containers for training and inference processes in large language models (LLM). These containers are ready to use and eliminate the usual complexity of setting up custom environments.

Among them are:

  • vLLM: an inference container for models like Gemma 3, Llama, Mistral, or Cohere, designed to achieve low latency from day one.
  • SGLang: an inference container optimized for DeepSeek R1 and agent-based workflows, with support for FP8, DeepGEMM, and multi-head parallel attention.
  • PyTorch and Megatron-LM: training containers adapted for MI300X Instinct GPUs with fine-tuned support for advanced models like Llama 3.1 and DeepSeek-V2-Lite.

These solutions allow researchers, developers, and infrastructure engineers to quickly access reproducible, stable, and high-performance environments.

Training Acceleration with Enhancements in PyTorch

ROCm 6.4 also introduces notable performance improvements in PyTorch, particularly regarding the attention mechanisms used in LLM models. The new version includes:

  • Flex Attention, which significantly improves training times and reduces memory usage.
  • TopK up to three times faster, enhancing performance in inference tasks.
  • SDPA (Scaled Dot-Product Attention) optimized for long contexts.

These enhancements allow for more efficient training of larger models with lower computational costs and greater speed.

Next-Generation Inference with vLLM and SGLang

The new version also enhances large-scale inference, providing low latency and high performance for advanced models like Llama 3.1 (8B, 70B, 405B), Gemma 3, or DeepSeek R1. In internal tests, the SGLang container achieved record performance on the MI300X Instinct GPU with DeepSeek R1, while vLLM provides immediate support for deploying Gemma 3 in production environments.

The containers, which are updated weekly or bi-weekly, ensure stability and operational continuity in production environments.

Automated GPU Cluster Management with AMD GPU Operator

To simplify the management of complex infrastructures, ROCm 6.4 includes advancements in the AMD GPU Operator, a tool that automates tasks such as driver updates, GPU scheduling in Kubernetes clusters, and real-time monitoring.

The new features include:

  • Seamless automatic updates (cordon, drain, reboot).
  • Expanded compatibility with Red Hat OpenShift 4.16–4.17 and Ubuntu 22.04/24.04.
  • Exporting metrics with Prometheus for GPU status tracking.

This allows IT teams to reduce operational risks and ensure a more resilient infrastructure.

Modular Architecture with the New Instinct GPU Driver

Finally, ROCm 6.4 presents a new GPU driver with a modular architecture, separating the driver from the core of the ROCm user space. This development offers:

  • Greater flexibility to update components separately.
  • An extended compatibility window of 12 months.
  • Better integration with bare metal environments, containers, and third-party applications.

This modularity simplifies large-scale management, especially for cloud service providers, public administrations, and companies with high stability requirements.

With ROCm 6.4, AMD reaffirms its commitment to developing high-performance tools for AI and HPC, providing researchers, developers, and infrastructure managers with a more powerful, flexible, and scalable environment to tackle current technological challenges.

Scroll to Top