Cerebras vs. NVIDIA: Why a Giant Chip Can Win in Inference

Cerebras is once again raising an uncomfortable question for the AI industry: what if the problem isn’t having more GPUs, but relying too heavily on an architecture originally designed for different workloads? Andrew Feldman, co-founder and CEO of Cerebras, has long been advocating this thesis: inference of large language models doesn’t resemble graphics rendering nor traditional large-scale training. It’s primarily a memory problem.

The explanation is simple, though the implications are vast. To generate each token, a language model must move weights from memory to computation units. If that flow gets bottlenecked, the processor can have plenty of theoretical power and still wait for data. In this scenario, speed depends not only on FLOPS but also on where memory is located, the distance to the compute units, and the actual bandwidth the system can sustain.

Cerebras doesn’t want many GPUs: they want a single wafer

Cerebras’s approach is radical because it changes the starting point. While traditional industry cuts a silicon wafer into hundreds of small chips, Cerebras uses almost the entire wafer as a single processor. Their WSE-3 measures 46,225 mm², integrates 4 trillion transistors, 900,000 AI-optimized cores, and offers 125 petaFLOPS of AI calculation, according to the company’s documentation.

The most critical aspect isn’t just size but memory. The WSE-3 includes 44 GB of SRAM on the chip itself and claims 21 PB/s memory bandwidth. That’s a figure hard to compare directly with a conventional GPU because the architectures are very different, but it illustrates the core idea: bringing memory and compute as close as possible to reduce data movement. The technical sheet for the CS-3 also states the system’s maximum power consumption at 27 kW and describes liquid-cooled cluster configurations for large-scale inference deployments.

FeatureCerebras WSE-3 / CS-3NVIDIA H100 / DGX B200
ApproachWafer-scale, a single large chipDiscrete GPUs connected in multi-GPU systems
Transistors4 trillionH100 and B200 use much smaller individual chips
On-chip main memory44 GB SRAM on-chipExternal HBM in package
Memory bandwidth21 PB/s in SRAMH100 SXM: 3.35 TB/s; DGX B200: 64 TB/s combined in HBM3e
Dominant complexityManufacturing a large, defect-tolerant chipCoordinating many GPUs, HBM memory, and interconnects
Main advantageLow latency for specific inference tasksEcosystem, availability, software, and general performance

Comparison with NVIDIA must be made carefully. The H100 isn’t Blackwell, and a DGX B200 system adds eight GPUs with a total HBM3e bandwidth that NVIDIA rates at 64 TB/s. Still, the architectural difference remains clear: Cerebras concentrates compute and SRAM inside a single wafer; NVIDIA scales through multiple GPUs, HBM, NVLink, NVSwitch, software, and high-speed networks.

Inference is changing the game

During the first phase of the AI boom, the focus was on training. NVIDIA’s GPUs have dominated due to a difficult-to-replicate combination: powerful hardware, CUDA, libraries, frameworks, operational experience, cloud providers, and a mature supply chain. But the next battleground is inference, especially for large models, agents, coding assistants, real-time voice, and multi-step workflows.

In interactive inference, users don’t just want the system to handle many requests in parallel—they want a quick response. If an agent needs to reason, consult tools, read documents, and generate multiple steps, user latency becomes a product factor. A response taking seconds may be acceptable; one taking minutes breaks the experience.

That’s where Cerebras is trying to differentiate. In May 2025, the company announced that Artificial Analysis had measured their Llama 4 Maverick endpoint at 2,522 tokens per second per user, compared to 1,038 tokens per second for NVIDIA Blackwell on the same model. NVIDIA had announced days earlier that a DGX B200 node with eight Blackwell GPUs exceeded 1,000 tokens per second per user on Llama 4 Maverick, thanks to optimizations like TensorRT-LLM, FP8, and speculative decoding based on EAGLE-3.

The difference is significant but doesn’t mean Cerebras is “better than NVIDIA” across the board. It indicates that in specific low-latency inference scenarios and particular models, its architecture can offer a clear advantage. NVIDIA still holds a much broader position in training, ecosystem, enterprise support, cloud availability, development tools, and compatibility with nearly all modern AI software.

Simplifying the system by complicating manufacturing

Cerebras’s most intriguing insight is that they’ve shifted the difficulty to the manufacturing side. NVIDIA solves the problem by connecting many components: GPUs, HBM, NVLink, NVSwitch, InfiniBand, orchestration software, optimized kernels, and full servers. Cerebras attempts to eliminate part of that complexity by concentrating the system into a huge single piece of silicon.

This approach requires solving a challenge that seemed almost impossible for years: manufacturing a wafer-sized chip without defects ruining it. Cerebras addresses this with redundancy, alternative routes, and a fault-tolerant architecture that isolates defective zones and keeps functioning. They summarize this as a design meant to coexist with defects, not to pretend they don’t exist.

It’s a tough engineering gamble but offers a key conceptual advantage: if it works, it reduces some data transfer costs between chips. In AI, moving data consumes energy, time, and adds complexity. Hence the increasingly common phrase “memory is the bottleneck.” It’s no longer enough to just increase compute power if the model spends most of its time waiting for weights.

Problem in generative AITypical GPU solutionCerebras’s approach
Large modelSplit across many GPUsPlace extensive compute and memory on a wafer
Data movementHBM, NVLink, NVSwitch, networksOn-chip SRAM and internal wafer network
ScalingMulti-GPU clustersSYSTEMS like CS-3 and wafer-scale clusters
Per-user latencyKernel optimization and batchingReduce trips between memory and compute
Distributed programmingOften necessaryCerebras aims to simplify it

Why NVIDIA isn’t defeated yet

The enthusiasm around Cerebras shouldn’t overshadow market realities. NVIDIA isn’t dominant just because of raw speed. Its platform—CUDA, TensorRT, Triton, cuDNN, NCCL, DGX, HGX, networking, documentation, cloud providers, enterprise integrations, and talent—creates a formidable barrier.

Furthermore, many workloads aren’t measured solely in tokens per second per user. In production, costs per million tokens, utilization rates, total throughput, capacity availability, model support, driver stability, framework compatibility, security, multi-tenant deployment, and ease of operation at scale all matter.

Cerebras has a strong story for fast inference. NVIDIA, however, offers a generalist platform already deployed in thousands of data centers. The landscape won’t be binary. More likely, the market will fragment: GPUs for training and broad workloads; ASICs, wafer-scale chips, and specialized accelerators for low-latency inference or specific models; and a mix based on cost, performance, and availability.

Cerebras’s question is another: if inference becomes the main operational cost of AI, perhaps the most cost-effective architecture isn’t always a cluster of general-purpose GPUs. For agents, voice, generative search, coding assistants, and interactive reasoning, user speed can be highly valuable. Doubling response speed not only improves user experience but also enables building products that were previously too slow.

Cerebras hasn’t found a magic way to bypass physics; they’ve chosen a different physics: less distance between memory and compute, less orchestration among chips, and greater complexity in manufacturing. If this approach scales well, NVIDIA will face real competition in one of AI’s most sensitive layers: fast inference.

Frequently Asked Questions

Why can Cerebras sometimes be faster than NVIDIA on certain models?
Because their architecture places large amounts of SRAM directly on the chip and provides very high bandwidth, reducing the bottleneck of moving weights during inference.

Is the Cerebras chip a GPU?
No. The WSE-3 is a wafer-scale processor specifically designed for AI. Its approach differs significantly from a conventional GPU.

Does this mean Cerebras surpasses NVIDIA everywhere?
No. Cerebras excels in specific low-latency inference scenarios, but NVIDIA maintains a significant lead in ecosystem, software, training, enterprise adoption, and broad availability.

Why is memory so critical in LLMs?
Because generating tokens requires repeatedly accessing the model’s weights. If memory is far away or bandwidth is limited, the calculation can stall waiting for data.

via: LinkedIn

Scroll to Top