X (Twitter) Facebook Pinterest LinkedIn E-mail

The conversation around AI infrastructure is shifting from focusing solely on “raw performance” to obsessing over something much more prosaic: how much it costs to serve each token when users demand quick responses, at scale, with a good “interaction feel.” In this domain, MoE (Mixture of Experts) models are pushing the industry toward an awkward problem: inter-node communication and internal latency become almost as important as computational power.

In this context, Signal65 has published an analysis centered on what they call “the new economy of inference” for MoE, comparing NVIDIA and AMD platforms with a basic idea: the relative cost per token depends on the platform’s cost and the tokens per second it actually delivers at a specific level of interactivity. The conclusion — with important nuances — is striking: in a MoE-oriented setup, a rack with NVIDIA GB200 NVL72 can deliver up to 28× more throughput per GPU than AMD MI355X at a high level of interactivity (75 tokens/sec/user), translating to up to 15× more “performance per dollar” at that point.

Why MoE is changing the rules: the bottleneck is no longer just “computing”

MoE models activate “experts” (specialized sub-networks) dynamically, enabling efficiency over dense models… but at a cost: massive data exchange. Practically, when scaling a MoE, patterns akin to “all-to-all” communication emerge, penalizing latency and stressing internal bandwidth. In simple terms: you can have blazing-fast GPUs, but if expert coordination stalls, the interactive experience suffers.

This is where NVIDIA bets on rack-scale architecture: a large domain of high-speed compute and memory, designed to minimize these data movement penalties. Signal65 attributes much of the observed advantage to this “co-design” (hardware + interconnect + software) architecture and to large-scale configurations with fast shared memory.

What is (broadly) a GB200 NVL72 and why does it matter?

The market has popularized the term “NVL72” to describe NVIDIA racks containing 72 accelerators connected via a very high-speed internal interconnect, intended to act as a “supernode” for AI. In the case of GB200, this family is associated with the Grace-Blackwell platform; in industry and technical literature, it’s described as a rack-scale system combining Grace CPUs and Blackwell GPUs within a tightly integrated architecture.

The core idea isn’t new, but its execution is extreme: in MoE, the value isn’t just in TFLOPS but in how many useful tokens you produce with acceptable latency, without the system “talking to itself” all day.

What AMD offers with MI355X

AMD pushes its Instinct line with an argument that weighs heavily in AI: memory and bandwidth. According to their official specs, the MI355X is a accelerator based on AMD’s 4th generation AMD CDNA, featuring 288 GB of HBM3E and up to 8 TB/s of memory bandwidth, among other features geared toward AI workloads.

In other words: AMD offers a product that’s clearly aggressive in memory density and computational muscle for demanding scenarios. The debate is whether, in high-interactivity scaled MoE setups, the advantage shifts toward those who best master the system’s “connective tissue.”

Report figures: cost per token and “performance per dollar”

Signal65 explains that it uses third-party performance measurements and separates the economic calculation to help the reader understand the assumptions. Their comparison table (using publicly available Oracle Cloud prices for these platforms) approaches “relative cost,” where the cost per token derives from:

GPU-hour cost,
divided by tokens per second per GPU at the targeted level of interactivity,
scaled to millions of tokens.

Within this framework, for MoE:

At 25 tokens/sec/user:
- Price ratio (GB200 vs MI355X): 1.86×
- Performance delta (per GPU): 5.85×
- Performance per dollar: 3.1×
- Relative cost per token: about 1/3 compared to MI355X
At 75 tokens/sec/user:
- Price ratio (GB200 vs MI355X): 1.86×
- Performance delta (per GPU): 28×
- Performance per dollar: 15×
- Relative cost per token: about 1/15 compared to MI355X

To visualize with a quick table:

Interactivity Target	Platform (Signal65 reference)	Price Ratio (vs MI355X)	Performance Delta (vs MI355X)	Performance/$ Advantage	Relative Cost per Token
25 tokens/sec/user	GB200 NVL72	1.86×	5.85×	3.1×	1/3
25 tokens/sec/user	MI355X	1.0×	1.0×	1.0×	1.0×
75 tokens/sec/user	GB200 NVL72	1.86×	28×	15×	1/15
75 tokens/sec/user	MI355X	1.0×	1.0×	1.0×	1.0×

Furthermore, Signal65 highlights that, in their analysis, the publicly available cloud price for the MI355X they reference comes from Oracle, which has also announced availability of the MI355X on OCI starting at $8.60/hour (according to their publication).

What the headline doesn’t say: why you should read the fine print

These results are not “the universal truth” of NVIDIA vs AMD performance. Rather, they are a snapshot of a very specific scenario:

MoE, where internal communication and latency are critical.
An explicit interactivity target (25 vs 75 tokens/sec/user), which radically changes the operating point.
Specific software stacks (Signal65 mentions combinations like TensorRT-LLM, vLLM, and a “Dynamo” setup in their graphics), which can drastically alter practical comparisons.
Public cloud “list prices,” which rarely reflect what a large-scale hyperscaler pays with commitments, discounts, reservations, or capacity agreements.

There’s also a nearly philosophical point: if the industry moves toward increasingly interactive “chat” experiences, the tokens/sec metric at strict latency targets might become the dominant KPI. Conversely, if throughput in batches, dense models, or memory-centric workloads dominate, the overall landscape could shift.

Nevertheless, the underlying message appears solid: at scale, system architecture (interconnect + memory + software) can make enormous differences in inference economics—even when competitors offer very capable accelerators on paper.

Frequently Asked Questions

Why is “tokens per second per user” a key metric in generative AI?
Because it approximates the real experience: not just how many tokens the GPU can produce, but whether it can do so while maintaining smooth responses for many users simultaneously.

What’s the practical difference between a dense model and an MoE in inference costs?
MoE can be more computing-efficient but often requires more internal coordination and traffic: if the interconnect becomes a bottleneck, the cost per “useful” token rises in interactive scenarios.

Why can “cost per token” vary so much between cloud providers and on-prem setups?
In cloud, factors like pricing, availability, quotas, and discounts matter; on-premises, amortization, energy, cooling, and utilization are key. The same platform can be “expensive” or “cheap” depending on occupancy and demand patterns.

Can I extrapolate a MoE benchmark to my use case (customer support, RAG, internal copilots)?
Caution is advised: model mixes, quantization, context length, concurrency, and latency goals all influence results. The best approach is to measure TPS for your target interactivity and convert it to cost per million tokens using your actual pricing.

Sources: Signal65 and AMD

X (Twitter) Facebook Pinterest LinkedIn E-mail