X (Twitter) Facebook Pinterest LinkedIn E-mail

Amazon Web Services aims to strengthen one of the most sensitive fronts in the current AI race: inference speed. AWS and Cerebras have announced a collaboration whereby Amazon’s cloud will deploy Cerebras CS-3 systems in its data centers and make them available to customers through Amazon Bedrock. According to both companies, the service will launch in the coming months, and later in 2026, it will also enable the execution of prominent open models and Amazon Nova models on Cerebras hardware.

This news is significant because it’s not just about adding another hardware option to AWS’s catalog, but about testing a different architecture to serve large-scale generative models quickly. Instead of performing all inference on a single type of processor, Amazon and Cerebras want to separate two distinct phases: prefill, which processes the prompt or initial context, and decode, which generates output tokens. AWS argues that this division will allow each chip to do what it’s best suited for.

A separate architecture for an increasingly visible bottleneck

The technical approach behind this alliance revolves around a simple concept that is complex to execute: AWS will use Trainium for prefill, while Cerebras CS-3 will handle decode. Both parts will be connected via Elastic Fabric Adapter, Amazon’s high-performance interconnect. The company states this “disaggregated” setup could offer up to five times more capacity for fast tokens within the same physical hardware footprint. For now, this figure should be considered a product promise announced by the companies, not an independently validated measurement in public production.

The reasoning behind this design makes considerable sense within the current state of AI. Prefill is a more parallel-computation-intensive task, while decode is heavily dependent on continuous memory access to generate tokens one by one. Cerebras has long claimed that its advantage lies in this second phase. Its CS-3 system is based on the Wafer-Scale Engine and integrated SRAM memory, with a bandwidth the company estimates at 21 PB/s, aiming to reduce the typical bottleneck of GPUs, which must repeatedly access weights during generation.

This narrative aligns well with current market trends. Inference is no longer just a secondary phase after training; increasingly, companies recognize that cost, latency, and response speed are critical when deploying assistants, agents, or coding tools. In the official announcement, Cerebras states that agentic encoding can generate approximately 15 times more tokens per query than a conventional chatbot, increasing the pressure on inference infrastructure. AWS frames this collaboration as a response to bottlenecks in demanding workloads like real-time code assistance and interactive applications.

AWS strengthens Bedrock without abandoning its own silicon

One of the most interesting aspects of the announcement is that Amazon is not replacing its internal chip strategy but expanding it. Trainium remains central to the joint design, positioned as the ideal processor for prefill. The company describes Trainium as a custom AI chip designed for scalability and cost-efficiency in generative workloads. Its latest documentation mentions Trainium3 as its first 3 nm chip, connected to applications involving agentic reasoning and video generation.

This means that the Cerebras partnership does not contradict AWS’s commitment to Trainium but complements it where Amazon believes it can gain more performance. It also reinforces Amazon Bedrock’s role as an access layer for AI models and services. Bedrock supports both proprietary models like Amazon Nova and third-party models, with official documentation showing Nova’s integration and options for text, multimodal, and reasoning tasks. The new promise is that parts of this offering could benefit from a much faster inference layer.

Strategically, AWS emphasizes that this new solution will run within the standard AWS cloud infrastructure and on the AWS Nitro System, ensuring CS-3 systems and Trainium servers maintain the same levels of isolation, security, and operational consistency that customers expect. This is an important message because Cerebras is traditionally seen as a highly specialized, differentiated platform, and AWS needs to present this integration as a natural extension of its cloud, not as an exotic standalone environment.

Speed matters more than ever, but real-world testing is still pending

The announcement is supported by bold figures. Cerebras claims to be running models for companies like OpenAI, Meta, and Cognition at speeds up to 3,000 tokens per second, asserting that its architecture can be up to 15 times faster than GPU-based alternatives in certain inference scenarios. These impressive numbers help explain AWS’s interest in this technology, but it’s important to distinguish between the performance demonstrated by Cerebras in its own environment and the real-world behavior of this integrated offering once it’s live on Amazon Bedrock, with diverse clients, models, and workloads.

Additionally, some caution is needed. AWS and Cerebras have stated that both disaggregated and integrated configurations will be supported. Not everything will automatically shift to the Trainium for prefill and CS-3 for decode setup. Both companies acknowledge that many clients operate with mixed workloads and shifting proportions of context and generation, meaning a traditional architecture will still make sense in certain cases. In other words, the collaboration aims to significantly improve inference performance for specific profiles, but is not necessarily a universal replacement for conventional deployments.

Nonetheless, the broader message is clear. AWS wants to avoid the AI cloud battle becoming just a contest of who has the most GPUs. Instead, it is crafting a narrative where its combination of proprietary silicon, networking, Bedrock, and specialized partners can offer something different. Cerebras, meanwhile, gains entry into the largest cloud market with a proposition aligned to increasing demand: fast inference for agents, assistants, and applications that can no longer afford delays. It remains to be seen whether the promised performance will hold at scale, but the trend is clear: in this new phase of AI, response speed is becoming nearly as strategic as model quality.

Frequently Asked Questions

What exactly have AWS and Cerebras announced?

AWS announced it will deploy Cerebras CS-3 systems in its data centers and offer them to customers via Amazon Bedrock. Both companies are also collaborating on a disaggregated inference architecture that combines AWS Trainium for prefill and Cerebras for decode.

When will this new infrastructure be available on AWS?

According to Amazon’s official announcement, the solution will arrive in the coming months. The deployment of prominent open models and Amazon Nova on Cerebras hardware is expected later in 2026.

What is disaggregated inference?

It’s an approach that splits inference into two phases: prefill, which processes initial context, and decode, which generates response tokens one by one. AWS and Cerebras claim that using different hardware for each phase can improve speed and capacity in certain workloads.

Is Amazon Nova already part of Bedrock?

Yes. AWS already offers Amazon Nova models in Bedrock. What’s new in this announcement is the plan to run part of this offering on Cerebras-based accelerated infrastructure to prioritize inference speed.

via: cerebras.ai

X (Twitter) Facebook Pinterest LinkedIn E-mail