AI Audio Generation on Smartphones: Arm and Stability AI’s Bet

The collaboration between Arm and Stability AI has led to significant advancements in audio generation using artificial intelligence directly on mobile devices. Thanks to the optimization of Arm KleidiAI technology, text-to-audio conversion is now 30 times faster, opening up new possibilities for content creation and digital experiences without needing an internet connection.


Audio Generation in Seconds with Stable Audio Open

Stability AI’s audio generation model, Stable Audio Open, allows users to create sound effects, ringtones, or even music tracks simply by writing a description. However, running these models on mobile devices without cloud connectivity posed a considerable technical challenge.

Initially, generating a single audio clip took more than four minutes, which was impractical for end users. Thanks to the integration of KleidiAI, along with optimizations in XNNPack and ExecuTorch, this time has been reduced to just a few seconds on mobile devices with Arm processors.

This improvement not only makes the use of generative AI in audio more accessible but also enables millions of devices worldwide to leverage this technology without relying on external servers.


How Arm and Stability AI Optimized Performance

To achieve these results, Stability AI collaborated with Arm on the reduction and optimization of the AI model for execution on mobile CPUs. This included:

  • Optimization of model parameters to balance performance and quality.
  • Utilization of KleidiAI, which enhances AI execution on Arm processors without requiring additional modifications by developers.
  • Running the entire process offline, ensuring greater privacy and lower energy consumption.

“As more companies and creators adopt generative AI, it is crucial that these models are accessible on any platform. Arm has been an ideal partner to make this possible,” said Prem Akkaraju, CEO of Stability AI.


Applications and Advantages of Generative Audio AI

This innovation has the potential to transform sectors such as:

Content Creation – Generating custom sound effects for videos, social media, and games.
Mobile Video Editing – Quick integration of audio without the need to download clips from the internet.
Entertainment and Personalization – Creating custom ringtones or alarms in seconds.
Accessibility and Education – Producing automatic narrations or enhanced audio assistants.


Demonstrations at MWC 2025

At the Mobile World Congress 2025, Arm and Stability AI will showcase their solution at Arm’s booth in Hall 2 Stand I60. The demonstration will include devices such as the vivo X200 Series with the MediaTek Dimensity 9400 processor based on the Armv9 architecture.

This collaboration is just the beginning of a new era in generative AI executed on mobile devices, enabling faster, more private, and accessible experiences. With future optimizations, Stability AI and Arm plan to extend this technology to images, video, and 3D models, redefining digital creation directly from smartphones.

via: ARM

Scroll to Top