NVIDIA has introduced Fugatto, an innovative Artificial Intelligence (AI) model designed to transform and generate sounds in unprecedented ways. Dubbed the “Swiss Army knife of sound,” this system allows users to control audio through textual descriptions, opening up a myriad of possibilities in music, film, education, and video games.
A New Era in Sound Creation
Unlike other AI models focused on musical composition or voice modification, Fugatto stands out for its versatility and precision. Named Foundational Generative Audio Transformer Opus 1, it can create mixes of music, voices, and sounds from textual descriptions and audio files. Its functionalities include the ability to craft melodies from scratch, add or remove instruments in an existing song, modify the accent or emotion of a voice, and even generate entirely new sounds.
Ido Zmishlany, a multi-platinum music producer and co-founder of One Take Audio — a company within NVIDIA’s Inception program for innovative startups — described the model as “incredible.” “The ability to create completely new sounds in the studio is revolutionary. This marks a new chapter in the history of music,” he stated.
Potential Across Multiple Sectors
Fugatto is not just a tool for musicians. Notable use cases include:
- Music Production: Composers can prototype songs, experiment with different styles and instruments, and enhance the audio quality of existing tracks.
- Advertising: Agencies can customize campaigns by adapting voices with different accents and emotions for specific audiences.
- Education: Language learning tools can utilize customized voices, such as those of family members or friends.
- Video Games: Developers can modify pre-recorded sounds or generate new sound effects in real-time based on player actions.
The Technology Behind the Advancement
Fugatto employs 2.5 billion parameters and was trained on NVIDIA DGX systems with 32 NVIDIA H100 Tensor Core GPUs. Its ability to creatively combine instructions — such as generating a French-accented voice with a melancholic tone — is made possible through techniques like ComposableART. Additionally, it can interpolate sounds over time, allowing for the creation of dynamic soundscapes, such as storms dissipating into sunrises filled with birdsong.
The model also excels at tasks it wasn’t specifically trained for, such as generating high-quality sung voices from simple textual descriptions.
A Global Collaboration
The development of Fugatto, led by a diverse team of researchers from countries like India, Brazil, China, Jordan, and South Korea, took over a year. The team worked with millions of audio samples to train the model, expanding its reach and accuracy without the need for additional data.
According to Rafael Valle, NVIDIA’s director of applied audio research and one of the project leads, “Fugatto represents a step toward a future where unsupervised multitask learning in audio synthesis and transformation emerges from data and model scale.”
Innovation that Inspires
The developers of Fugatto experienced unique moments during the creation process. One such moment was when the model responded to a command to generate electronic music synchronized with the barking of dogs. “When the team burst into laughter, I knew we had created something special,” Valle recalled.
Fugatto promises to transform the way sound is created and perceived, establishing itself as an essential tool for artists and creatives around the world. NVIDIA continues to demonstrate its leadership in leveraging AI to push the boundaries of technological innovation.