NVIDIA and OpenAI bring inference to 1.5 million tokens per second with GPT-OSS models on the Blackwell architecture

NVIDIA and OpenAI have made a significant leap in AI performance with the release of the open-source models gpt-oss-20b and gpt-oss-120b, optimized for the Blackwell architecture. According to the company, the larger model reaches up to 1.5 million tokens per second (TPS) on an NVIDIA GB200 NVL72 system, which could serve roughly 50,000 concurrent users.

These text reasoning-focused models incorporate chain-of-thought capability and tool calls, utilizing a Mixture of Experts (MoE) architecture with SwigGLU activations. They feature attention layers with RoPE for contexts of up to 128,000 tokens, alternating between full attention and a sliding window of 128 tokens.

Both versions are available in FP4 precision, enabling even the 120 billion parameter model to run on a single data center GPU with 80 GB of memory, fully leveraging Blackwell’s native capabilities.


Training and Optimization

The gpt-oss-120b required over 2.1 million GPU hours on NVIDIA H100 Tensor Core GPUs, while the gpt-oss-20b needed about ten times less. To maximize performance, NVIDIA collaborated with Hugging Face Transformers, Ollama, vLLM, and its own TensorRT-LLM, implementing specific improvements in attention cores, MoE routing, and optimized preprocessing.

Key optimizations include:

  • TensorRT-LLM Gen for attention prefill and decode, and low-latency MoE
  • CUTLASS MoE kernels for Blackwell
  • XQA kernel specialized for Hopper
  • FlashInfer library for serving LLMs with optimized attention and accelerated MoE routing
  • Compatibility with OpenAI Triton kernel for MoE in TensorRT-LLM and vLLM

Flexible Deployment: From Data Centers to Local PCs

In Data Centers:

  • With vLLM, developers can launch an OpenAI-compatible web server by automatically downloading the model with a simple command.
  • With TensorRT-LLM, NVIDIA provides guides, Docker containers, and configurations to maximize performance in both low latency and high throughput scenarios.

In Enterprise Infrastructure:

  • NVIDIA Dynamo, an open-source inference platform, improves interactivity up to 4x for long sequences (32k ISL) on Blackwell thanks to disaggregated inference, separating calculation phases across different GPUs.
  • Models are offered as NVIDIA NIM microservices, ready to deploy on any GPU-accelerated infrastructure, with control over privacy and security.

On Local Environments:

  • The gpt-oss-20b can run on any PC with an NVIDIA GeForce RTX GPU with at least 16 GB VRAM, or on professional stations with RTX PRO GPUs. It is compatible with Ollama, Llama.cpp, and Microsoft AI Foundry Local.
  • Developers can test in RTX AI Garage with preconfigured environments.

Designed for Scalability

The GB200 NVL72 system combines 72 Blackwell GPUs with fifth-generation NVLink and NVLink Switch, operating as a large-scale single GPU. The second-generation Transformer engine with FP4 Tensor Cores, along with massive bandwidth, enables inference peaks previously unreachable for models of this size.

NVIDIA claims this advance reinforces the platform’s ability to serve cutting-edge models from day one, with high performance and low cost per token, both in cloud and on-premise environments.


Model Specifications

ModelTransformer BlocksTotal ParametersActive Parameters per TokenNumber of ExpertsActive Experts per TokenMax Context
gpt-oss-20b2420B3.6B324128K
gpt-oss-120b36117B5.1B1284128K

Conclusion

The collaboration between NVIDIA and OpenAI on gpt-oss models sets a new benchmark in large-scale language model inference, not just through performance leaps but also through deployment flexibility: from cloud environments to desktop PCs and production-ready microservices.

With an optimized ecosystem integrating hardware, kernels, and frameworks, the goal is clear: to bring high-performance AI closer to any developer, anywhere.

via: developer.nvidia.com

Scroll to Top