Red Hat AI 3 Brings Distributed Inference to Production: An Open Platform for Agents, Kubernetes, and Any Accelerator

Red Hat has announced Red Hat AI 3, a major evolution of its enterprise AI platform that integrates Red Hat AI Inference Server, RHEL AI, and OpenShift AI to address the biggest bottleneck of 2025: operationalizing inference at scale (the “doing” phase) and shifting from proof of concepts to production without rebuilding infrastructure. The proposal revolves around three ideas: native distributed inference in Kubernetes (llm-d), unified platform experience (internal MaaS, AI Hub, Gen AI Studio), and foundations for autonomous AI with open APIs and standards.

The move comes as the market shifts from training to massive real-time inference and agents; and as CIOs seek to cut latency, per-token costs, and complexity without sacrificing privacy or hardware freedom. Underlying this is an uncomfortable reality: according to MIT’s NANDA project, around 95% of organizations still do not see measurable returns despite a collective corporate spend of $40 billion. Red Hat AI 3 aims to close this gap with an open, multi-vendor platform that supports any model on any accelerator, from data centers to public cloud, sovereign environments, and edge.


From training to “doing”: llm-d turns vLLM into a distributed, Kubernetes-native service

The most striking technical innovation is the general availability of llm-d in OpenShift AI 3.0. Built on the vLLM engine, llm-d reimagines how LLMs are served within Kubernetes:

  • Distributed and intelligent inference: orchestration with Kubernetes and Gateway API Inference Extension, inference-aware prioritization, disaggregated serving, and scheduling that accounts for load variability (peaks of prefill, decoding, and context windows).
  • Open components for performance: integration with NVIDIA Dynamo (NIXL) for KV transfer, and with DeepEP for Mixture-of-Experts (MoE) communication; designed for large models and high fan-out loads.
  • “Well-lit Paths”: prescriptive routes that standardize deployment and operations, preventing teams from building fragile stacks of disparate tools.
  • Cross-compatible accelerators: support for NVIDIA and AMD out of the box, aiming to maximize the utility of existing hardware investments.

Practically, llm-d takes the best of vLLM (high performance on a single node) and transforms it into a coherent, scalable inference service, complete with monitoring, reliability, and capacity planning oriented toward ROI. The message to executive leadership is clear: predictability and control over costs and performance when tokens are counted in millions.


A unified platform for collaboration (internal MaaS, AI Hub, and Gen AI Studio)

Red Hat AI 3 packages a platform experience designed for platform engineering and AI teams to work within the same framework:

  • Model as a Service (MaaS): IT can operate as a model provider for the organization, serving common models from a central point with on-demand access for applications and developers. Enables cost management, reusability, and covers cases that cannot go public due to privacy or sovereignty.
  • AI Hub: a hub for exploring, deploying, and managing AI assets: a curated catalog of validated/optimized models, registry for lifecycle management, and an deployment environment with configurable observability.
  • Gen AI Studio: an interactive environment for prototyping: a playground for testing prompts, adjusting parameters, creating chats or RAG, and an asset endpoint for discovering models and MCP servers (Model Context Protocol), essential when models need to call external tools.

In addition, Red Hat offers a set of validated models (e.g., gpt-oss, DeepSeek-R1, Whisper for voice-to-text, Voxtral Mini for voice agents), making it easier to start without hunting artifacts online or battling compatibility issues.


Laying the groundwork for autonomous AI (Llama Stack, MCP, and modular customization)

The second evolution is agent-centric. Red Hat OpenShift AI 3.0 introduces:

  • Unified API layer based on Llama Stack: aligning development with OpenAI-compatible protocols and reducing friction between tools.
  • Early adoption of MCP (Model Context Protocol): an emerging standard for models to interact with external tools safely and uniformly, a key component for composable agents that perform actions.
  • Modular customization kit: built on InstructLab, with Python libraries for data processing (e.g., Docling to convert unstructured documents into AI-readable format), synthetic data generation, fine-tuning, and an Evaluation Hub integrated for measuring and validating results. The idea is for clients to fine-tune with their data under control and traceability.

    If 2025 sparks the fever for agents, 2026 will demand inference infrastructure capable of supporting complex autonomous flows. Red Hat AI 3 positions its platform as the place where this software is developed, governed, and scaled.


    Why this matters to CIOs and platform teams

    1. From silo to shared platform. Inference ceases to be a point service in a VM and becomes a shared capability within the cluster: policies, quotas, telemetry, and SLOs are comparable to any other workload.
    2. Accelerator-agnosticism. The promise of “any model, any accelerator” translates into less lock-in and longer lifespan for investments in Instinct (AMD) or NVIDIA, supported by open stacks like ROCm.
    3. Cost and latency reduction. Disaggregated serving, inference-aware scheduling, and high-performance open libraries push down token costs and stabilize latency.
    4. Compliance and sovereignty. The platform can be deployed across data centers, public clouds, sovereign environments, and edge, aligning privacy and jurisdiction with sector-specific realities.
    5. Standards. Embracing Kubernetes, vLLM, Gateway API, MCP, and Llama Stack mitigates risks of building isolated technology silos.

    What partners say (and what it means)

    • AMD emphasizes the combination of EPYC + Instinct + ROCm, aligning with Red Hat’s multi-vendor approach: not everything will be NVIDIA, especially for IO-bound workloads or where TCO is critical.
    • NVIDIA focuses on accelerated inference and celebrates compatibility with Dynamo/NIXL for KV transfer and with libraries that favor MoE.
    • Clients like ARSAT (Argentina’s connectivity infrastructure) highlight two key points: data sovereignty and rapid deployment (a case that went from need to production in 45 days), illustrating that platforms are not just about deployment but encompass the entire lifecycle.
    • Analysts (IDC) point to 2026 as a turning point: the metric will become repeatable results with efficient inference. The extra mile will be walked by those unifying increasingly sophisticated load orchestration in hybrid cloud.

    What to watch in the coming weeks if evaluating Red Hat AI 3

    • Benchmarks and playbooks of llm-d on OpenShift AI 3.0: latency SLO, throughput by request type (prefill/decoding), cost per 1,000 tokens, and shared KV cache among sessions.
    • Compatibility with your accelerator fleet (NVIDIA/AMD), drivers, and ROCm/CUDA versions, along with integrated observability (queue metrics, memory, fragmentation).
    • AI Hub catalogue and validation pipelines (quality, bias, guardrails) for regulated environments.
    • MCP integration with internal tools (document search, APIs, RPA) and security in agent tool usage.
    • Model governance: full lifecycle management (registry → deployment → rollbacksA/B testingdeprecation) and traceability for audits.

    A critical note: value comes when inference becomes “boring”

    The announcement hits the mark: inference—not the heroic training—is what generates revenue. The challenge is making that “doing” predictable, observable, optimizable, and repeatable. If Red Hat AI 3 can make model serving in Kubernetes as routine as deploying a microservice, the conversation will shift from “which model?” to “what SLO business need and at what cost”. That’s where the ROI the MIT’s NANDA project highlights can be realized.


    Conclusion

    Red Hat AI 3 is, first and foremost, a commitment to standardize AI within enterprises: llm-d for distributed inference using vLLM in Kubernetes, a unified platform that combines catalog, service, and studio, and a bold openness (Llama Stack, MCP, ROCm, Gateway API) to support models and agents across any infrastructure and with any accelerator. The challenge for 2025–2026 is not “more demos,” but turning that capacity into SLA, SLO, and per-token costs that make the accounts balance. At least, the direction is clear.


    Frequently Asked Questions

    How is llm-d different from “using vLLM in a pod”?
    llm-d takes vLLM and elevates it to an integrated, distributed serving system within Kubernetes: inference-aware scheduling, disaggregated serving, support for Gateway API, accelerated KV transfer, and MoE libraries; plus “Well-lit Paths” (prescriptive deployment routes) for reliable scale.

    How does the internal Model-as-a-Service compare to using external APIs?
    The internal MaaS enables centralizing models, controlling costs, reusing assets, ensuring privacy, and maintaining sovereignty. External APIs still make sense for peaks or non-critical cases, but core business usually requires own data and predictable TCO.

    Which accelerators does Red Hat AI 3 support?
    The platform is multi-vendor and offers cross support for NVIDIA and AMD, with open stacks like ROCm and libraries such as Dynamo/NIXL for KV transfer. The goal is to maximize performance per watt of existing hardware.

    What do MCP and the Llama Stack-based layer bring to agents?
    MCP standardizes how models use external tools, crucial for composable and secure agents. The unified API based on Llama Stack aligns protocols with ecosystems (including OpenAI compatibility), reducing integration and vendor lock-in barriers.

Scroll to Top