Generative multimodal artificial intelligence will drive the next major shift in enterprise software, combining text, voice, video, images, and numerical data into a single intelligent experience.
According to Gartner’s latest forecasts, 80% of enterprise software and applications will incorporate multimodal capabilities by the end of the decade, up from just 10% in 2024. This evolution is driven by advances in multimodal generative AI (GenAI) models, which can process multiple types of data simultaneously: from text and images to voice and video.
In its Emerging Tech Impact Radar: Generative AI report, the technology consulting firm highlights that multimodal GenAI models are already at the forefront of product innovation, especially in sectors like healthcare, finance, manufacturing, and retail. The transition from text-focused models to systems capable of understanding and generating content across different formats and contexts marks a turning point in the history of enterprise software.
“We are witnessing a fundamental transformation of enterprise software. AI’s ability to combine text, voice, images, and operational data in real-time enables a level of automation and contextual intelligence that was previously unimaginable,” explained Roberta Cozza, Gartner senior analyst.
📡 Multimodality: The Next Frontier in Software
Multimodality is defined as an AI model’s ability to work with various types of input and output data: text, audio, video, images, and numeric values. While many current models support two or three modalities—for example, text-to-image or voice-to-text—the trend points toward full integration of modalities in the coming years.
This means that, for instance, a healthcare app could read an MRI scan, interpret a written medical report, and generate a spoken response—all within the same intelligent system.
🧠 Generative AI as the Core of Product Decision-Making
Gartner emphasizes that product leaders should prepare to reevaluate their technology roadmaps. Incorporating multimodal capabilities is not just an aesthetic or interface improvement; it signifies a new development paradigm where software becomes a proactive layer of assistance, automation, and value creation.
“Companies that integrate multimodal capabilities will be able to deliver more human, natural, and efficient experiences. Software will evolve from being a tool to an intelligent collaborator,” added Cozza.
🏥🏛️🏭 Sectoral Impact: From Healthcare to Heavy Industry
Gartner highlights several sectors where multimodal GenAI will have an immediate and transformative impact:
- Healthcare: Medical image analysis, understanding clinical histories, generating spoken diagnostic reports.
- Finance: Reading financial documents, pattern detection in voice and text, creating personalized reports.
- Industry: Predictive maintenance based on sensor data, visual recognition in production environments, real-time voice alerts.
🔄 User Experience Reimagined
One of the most significant changes will be in user interfaces. Applications will shift from being purely visual or textual to adopting combined conversational, visual, and auditory modes. An enterprise assistant, for example, will be able to receive a PDF, interpret it, chat with users to confirm data, and automatically generate dashboards based on extracted KPIs.
This paves the way for a new paradigm: software as an active interlocutor, capable of interacting across multiple channels simultaneously and coherently.
🌐 An Opportunity… and a Regulatory Challenge
While the progress is promising, Gartner warns about inherent risks. Centralizing sensitive data within multimodal models, training on critical information, and designing conversational interfaces pose technical, legal, and ethical challenges. Transparency, traceability, and governance of these models will be essential.
🔮 Toward a New Generation of Autonomous Applications
Gartner’s vision extends beyond technology—they see multimodal AI as the engine for a new generation of proactive software capable of acting autonomously in specific scenarios. This will influence both architecture design and business strategies.
From hyper-automation of processes to predictive customer support, multimodal GenAI will fundamentally reshape enterprise software within the next five years.
📌 Key Insights from Gartner’s Report
| Year | % of enterprise software with multimodal capabilities | 
|---|---|
| 2024 | Less than 10% | 
| 2025 | Estimated 20-30% | 
| 2030 | 80% | 
📚 More Information
- Full report: Emerging Tech Impact Radar: Generative AI
- Executive summary: Top Use Cases for Generative AI
- Upcoming event: Gartner IT Symposium/Xpo 2025 — with special coverage on AI and enterprise tech
In summary: Multimodality isn’t a futuristic option; it’s the next natural step in software evolution. Organizations that don’t adapt their development strategies risk falling behind in an environment where AI will be omnichannel, ubiquitous, and increasingly intelligent.

