NVIDIA and Mistral AI join forces to bring open models Mistral 3 into the era of “distributed intelligence”

The race for open AI innovation takes a major leap forward with the joint announcement from NVIDIA and French startup Mistral AI. Both companies have introduced the new Mistral 3 family of models, a series of open-source, multilingual, multimodal models optimized end-to-end for NVIDIA infrastructure — from superclusters with GB200 NVL72 to RTX GPUs on PCs and Jetson edge devices.

The centerpiece of this launch is Mistral Large 3, a mixture-of-experts (MoE) model combining efficiency and scale: instead of activating all neurons for each token, it only runs the relevant “experts,” reducing compute costs without sacrificing performance. The clear goal is to make large-scale enterprise AI not only feasible but cost-effective.


Mistral Large 3: 675 billion parameters designed for Blackwell

According to published technical data, Mistral Large 3 is a sparse model with 675 billion total parameters, of which 41 billion are active during each inference, and features a context window of 256,000 tokens. It’s tailored for high-reasoning agents, extensive document analysis, and complex multimodal workflows.

The model was trained on NVIDIA Hopper GPUs and specifically optimized for the new Blackwell architecture, particularly for systems like GB200 NVL72, which combine 72 cutting-edge GPUs into a single system with high-speed NVLink interconnects.

NVIDIA guarantees that, with this hardware and software optimizations, Mistral Large 3 achieves up to 10 times inference performance compared to the previous H200 generation, surpassing 5 million tokens per second per megawatt at approximately 40 tokens per second per user. Practically, this translates to a better user experience, lower cost per token, and greater energy efficiency — a critical factor as AI models become major power consumers in data centers worldwide.

This leap is supported by several components of NVIDIA’s stack:

  • Wide Expert Parallelism in TensorRT-LLM, which dynamically distributes and balances model experts across the NVL72’s coherent memory domain.
  • NVFP4, a low-precision format tailored for Blackwell that reduces computation and memory usage while maintaining the accuracy needed for production.
  • NVIDIA Dynamo, a low-latency distributed inference framework that separates prefill and decode phases, optimizing workloads for long-context scenarios.

Mini Mistral 3: compact models for RTX PCs, laptops, and Jetson devices

NVIDIA and Mistral’s strategy extends beyond the frontier. Alongside the large model, the French firm has launched the Mistral 3 suite, a collection of dense, high-performance models with 3B, 8B, and 14B parameters, available in Base, Instruct, and Reasoning variants — totaling nine models.

These models are designed for more modest but increasingly relevant environments:

  • PCs and laptops with GeForce RTX AI GPUs
  • NVIDIA DGX Spark workstations
  • Embedded devices like NVIDIA Jetson, targeting robotics, edge computing, and IoT applications

NVIDIA collaborated with popular projects such as Llama.cpp and Ollama to enable developers and enthusiasts to test Mistral 3 locally, with low latency and higher data privacy. On top-tier GPUs — like the RTX 5090 — performance figures reach several hundred tokens per second with the smaller models, making them prime candidates for on-device assistants, edge agents, and disconnected applications.


An open ecosystem: Apache 2.0, NeMo, and NIM

A key aspect of this announcement is the emphasis on openness. The Mistral 3 family is released under Apache 2.0 license with open weights, allowing companies and researchers to download, fine-tune, and deploy these models in their own environments without the restrictions typical of many proprietary models.

Additionally, these models integrate with NVIDIA’s open-source tools—NeMo—for AI agent lifecycle management — including Data Designer, Customizer, Guardrails, and NeMo Agent Toolkit. This enables organizations to:

  • Curate and prepare datasets
  • Fine-tune models for specific use cases
  • Implement security and filtering policies (guardrails)
  • Orchestrate complex agents based on Mistral 3

To facilitate deployment, NVIDIA has optimized inference frameworks such as TensorRT-LLM, vLLM, and SGLang for the entire Mistral 3 family, and announced that models will be available as NVIDIA NIM microservices, ready to run on any GPU-accelerated infrastructure.


What does this mean for businesses and developers?

The combination of a giant MoE model in data centers and compact Mistral 3 models at the edge reinforces the idea of “distributed intelligence” pushed by Mistral AI. Organizations can envision architectures where:

  • Advanced, high-reasoning agents run on clusters with GB200 NVL72, handling intensive analysis, planning, or multimodal generation tasks.
  • Clients, branches, factories, or vehicles execute Mistral 3 variants on RTX PCs or Jetson platforms, keeping part of the processing local and reducing cloud reliance.

Because these models are open and permissively licensed, this partnership marks a further step toward democratizing frontier-level AI technology across Europe and globally — crucial amid ongoing discussions on digital sovereignty, energy costs, and dependency on closed ecosystems.


FAQs about Mistral 3 and the NVIDIA partnership

What exactly is the Mistral 3 family?
Mistral 3 is a new generation of open-source, multilingual AI models combining a large mixture-of-experts model (Mistral Large 3, with 675 billion parameters) and a suite of smaller dense models — 3B, 8B, and 14B — available in Base, Instruct, and Reasoning variants. All are optimized to run on NVIDIA hardware, from GB200 NVL72 data centers to RTX GPUs and Jetson edge devices.

How does Mistral Large 3 differ from other large language models?
Its key differentiator is the MoE architecture and its specific optimization for Blackwell. By activating only a subset of experts per token, it reduces computational costs while maintaining high accuracy. On GB200 NVL72, it can achieve up to 10x performance over the previous H200 generation. Its context window of 256,000 tokens allows it to handle large documents and long sessions seamlessly.

What hardware is needed to run Mistral 3 locally?
Mistral 3 models are designed to run on consumer and edge NVIDIA GPUs like GeForce RTX cards, NVIDIA DGX Spark workstations, or Jetson modules, enabling deployment on desktops or embedded systems. Through integrations with Llama.cpp and Ollama, they can be run on modern workstations without data center infrastructure, provided sufficient video memory for the chosen model size.

Can organizations fine-tune and deploy these models in their own data centers?
Absolutely. With open weights and Apache 2.0 licensing, organizations can download the models, fine-tune with their own data, and deploy across NVIDIA GPU clusters—including H100, H200, GB200, or others—using frameworks like TensorRT-LLM, vLLM, or SGLang. NVIDIA will also offer NIM microservices to streamline deployment in hybrid and multi-cloud setups.

via: NVIDIA Blogs

Scroll to Top