Red Hat, a global leader in open-source solutions, has introduced llm-d, a new project aimed at addressing one of the most pressing challenges in the future of generative artificial intelligence: large-scale inference. This development focuses on improving the performance and efficiency with which generative language models (LLMs) execute real tasks in production environments.
llm-d has been designed from the ground up to maximize the potential of native Kubernetes environments, combining a distributed architecture based on vLLM with an innovative AI-aware intelligent network routing system. Thanks to this combination, the system can deploy inference clouds that meet the highest operational and service level (SLO) requirements, even in resource-intensive environments.
While model training remains a foundational pillar, the key to success in generative AI increasingly lies in the inference phase: the moment when pre-trained models are put to work to generate responses, content, or solutions. It is here that they translate into real user experiences and value for businesses.
In this regard, a recent report from Gartner highlights that “by 2028, more than 80% of workload accelerators in data centers will be dedicated to inference tasks, not training.” This statistic reinforces the need for tools like llm-d, designed to scale the execution of complex, large-scale models without encountering latency issues or disproportionate costs.
The centralization of inference in large servers is already showing limitations in the face of the increasing volume of requests and the complexity of current models. In this context, llm-d emerges as a flexible, scalable, and open alternative that will enable developers and organizations to deploy more distributed and sustainable inference infrastructures while maintaining high performance.
With this launch, Red Hat strengthens its commitment to open innovation and the evolution of the artificial intelligence ecosystem, providing tools that facilitate the responsible and efficient adoption of technologies based on generative models.
Addressing the Need for Scalable Generative AI Inference with llm-d
Red Hat and its industry partners are directly tackling this challenge with llm-d, a visionary project that amplifies the power of vLLM to transcend the limitations of a single server and unlock production at scale for AI inference. Leveraging Kubernetes’ proven orchestration prowess, llm-d integrates advanced inference capabilities into existing enterprise IT infrastructures. This unified platform allows IT teams to respond to the diverse service demands of mission-critical workloads while deploying innovative techniques to maximize efficiency and drastically reduce the total cost of ownership (TCO) associated with high-performance AI accelerators.
llm-d offers a powerful suite of innovations, highlighting:
vLLM, which has quickly become the de facto open-source inference server standard, providing model support from day zero for emerging cutting-edge models, and support for an extensive list of accelerators, now including Google Cloud’s Tensor Processing Units (TPUs).
Prefill and decode disaggregation to separate input context and token generation phases in discrete operations, which can then be distributed across multiple servers.
KV (key-value) cache offloading, based on LMCache, which shifts the KV cache memory load from GPU memory to more cost-effective and abundant standard storage, such as CPU memory or network storage.
Kubernetes-based clusters and controllers for more efficient scheduling of computing and storage resources as workload demands fluctuate, maintaining performance and lower latency.
AI-aware network routing to direct incoming requests to servers and accelerators that are most likely to have "hot" caches of previous inference calculations.
- High-performance communication APIs for faster and more efficient data transfer between servers, with support for the NVIDIA Inference Xfer Library (NIXL).
llm-d: Backed by Industry Leaders
This new open-source project has already garnered support from an impressive coalition of leading generative AI model providers, pioneers in AI accelerators, and prominent AI cloud platforms. CoreWeave, Google Cloud, IBM Research, and NVIDIA are founding collaborators, joined by partners like AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI, highlighting the industry’s close collaboration to shape the future of large-scale LLM service. The llm-d community also receives backing from the founders of the Sky Computing Lab at the University of California, creators of vLLM, and the LMCache Lab at the University of Chicago, makers of LMCache.
Rooted in its strong commitment to open collaboration, Red Hat recognizes the critical importance of dynamic and accessible communities in the fast-paced landscape of generative AI inference. Red Hat will actively promote the development of the llm-d community, fostering an inclusive environment for newcomers and driving its continual evolution.
Red Hat’s Vision: Any Model, Any Accelerator, Any Cloud
The future of AI must be defined by limitless opportunities, not by the limitations imposed by infrastructure silos. Red Hat envisions a future where organizations can deploy any model, on any accelerator, across any cloud, delivering an exceptional and more consistent user experience without exorbitant costs. To unlock the true potential of investments in generative AI, businesses need a universal inference platform: a standard for smoother, high-performance AI innovation, both now and in the future.
Just as Red Hat pioneered its open enterprise proposition by transforming Linux into the foundation of modern IT, it is now poised to design the future of AI inference. The potential of vLLM serves as a central axis for standardized generative AI inference, and Red Hat is committed to creating a thriving ecosystem around not only the vLLM community but also llm-d for distributed inference at scale. The vision is clear: regardless of the AI model, the underlying accelerator, or the deployment environment, Red Hat intends to make vLLM the ultimate open standard for inference in the new hybrid cloud.