X (Twitter) Facebook Pinterest LinkedIn E-mail

With NVIDIA leading the way in AI hardware—from Blackwell to the new Rubin platform supported by CPUs Vera and contexts of up to “a million tokens”— it might seem that the game is settled. Positron AI disagrees. Its CEO, Mitesh Agarwal, argues that there’s room for alternatives focused on cheaper and more efficient inference, suitable for air-cooled data centers (air-cooled) and without the urgency to switch to liquid cooling that upcoming NVIDIA GPUs would require.

Positron’s bet is embodied in Atlas (first generation) and Asimov (upcoming), two accelerators that, according to Agarwal, consume 2 to 5 times less energy than an NVIDIA GPU during inference workloads and fit into conventional air-cooled racks. The message to hyperscalers, cloud providers, and hosters with existing real estate is direct: deploy on what you already have. “95% of current capacity is air-cooled. Rubin and Blackwell demand new constructions; we enter where they are,” summarizes Agarwal.

Air vs. liquid: costs, timelines, and complexity

For Positron, it’s not about sacrificing performance but about designing for inference. This is where, they argue, business volume is won. The thermal comparison is significant: Blackwell hovers around 1,200 W per GPU and Rubin—citing industry publications—could reach up to 2,000 W. At those densities, air cooling is impossible; switching to liquid cooling involves substantial CAPEX (piping, heat exchangers, wet aisle, water supply guarantees) and OPEX for maintenance.

Gagarwal also emphasizes an important nuance: liquid-cooled centers are more efficient at the building level, but building them costs 40–50% more and takes longer. In markets with electricity constraints—like urban areas in the US or Europe—power availability isn’t always enough to build a next-generation campus. If the accelerator fits in air-cooled and performs well in inference, demand can be met today without awaiting permits or substations.

Technical claim: contained power and token efficiency

The technical specs Positron touts are straightforward to remember:

Power budget per chip: < 200 W in base designs and ~400 W in higher configurations, both air-cooled.
Efficiency: 2–5× better than an NVIDIA GPU in inference (depending on the case).
Performance per dollar: ≈ 3.5× compared to Hopper; up to 5× on memory/energy-intensive workloads.
ROI: where NVIDIA would require ~ 2–2.5 years to recoup investment, Atlas would take 15–16 months, and the next silicon could drop that below 12 months; in extreme scenarios, ~ 6 months.

The consistent theme across all metrics is the same: infer more tokens per euro and per watt. Compared to a general-purpose GPU for training + inference—NVIDIA’s approach—Positron positions itself as specialist in inference, precisely where the market is set to explode in 2025–2028.

Scarcity angle: different memory, another bottleneck

A significant part of industry bottlenecks isn’t on the wafer level but related to advanced packaging and HBM memory (e.g., CoWoS). Here, all players compete for the same slots. Positron claims to avoid that lane with its own memory architecture, which decouples its supply chain from that of NVIDIA/AMD/TPU. The result: less dependency on the HBM funnel and, in theory, more scalability if production ramps up.

In production, Atlas is—according to the company—“US-made” with Intel Foundry and a domestic supply chain. Asimov (target: tape-out late 2026) would be produced on a mature node and more readily available, with manufacturing options also in Arizona. The takeaway: prioritize capacity over fighting for every micron on cutting-edge nodes.

System stack: x86 CPUs and “Archer” accelerators

Positron isn’t competing in CPUs; it uses AMD EPYC (or could be Intel or ARM) and builds Atlas systems with 8 “Archer” accelerators, 24 DDR5 RDIMM channels, and dual Genoa EPYCs. The difference, Agarwal emphasizes, is in the accelerator and its memory:

Memory bandwidth utilization: > 90% (compared to 40–50% “even in the best GPU case”).
Capacity: currently prioritizes bandwidth; Asimov would raise the bar with about 2 TB per card (≈ 2,048 GB), which is approximately 5× more than anticipated for Rubin (288–384 GB HBM3e, depending on versions).

If the load limitation hinges on memory (pre-fill intensive, large contexts, prompt caches, vector databases), that combination—high utilization + large capacity—boosts throughput without fighting over every gigabyte of HBM.

Rubin CPX, pre-fill, and “collaborative competition”

NVIDIA’s response to the surge in inference is underway with Rubin CPX, an accelerator focused on pre-fill (token entry). Does Positron worry? “No,” says Agarwal: decoding—code, video, extended generation—will be the economic lever, and Positron is optimizing there. They even propose hybrid systems: Rubin CPX + Positron to maximize € per token on mixed workloads.

The subtext is clear: training will continue to concentrate on a few generalist chips; inference will fragment into ASICs and task-specific GPUs. And the market size is huge: about $400 billion in 2028 for inference, the executive estimates.

Domestic competitors: Trainium and TPU

What about Trainium (AWS) or TPU (Google)? Agarwal distinguishes platforms and applications. If the key metric is performance/€ per token in state-of-the-art LLMs, Positron claims a ≈ 3.5× advantage over Hopper and a better performance-to-efficiency ratio than Trainium/TPU in cases where memory and power efficiency matter most. The message: you don’t have to be number one everywhere; just be number one where it counts for the bill.

Use cases and clients: brownfield before greenfield

The theory gains traction when applied to centers with limited power. Agarwal cites Cloudflare—a public client—as an example of someone who cannot “request more megawatts from the grid” or rebuild liquid cooling in San Francisco, New York, or Chicago. Here, Atlas provides: more tokens per watt in existing facilities.

Signals to watch (and cautionary notes)

The figures of 2,000 W for Rubin are unconfirmed by NVIDIA; they are external estimates that the company does not comment on.
The timeline for Asimov (late 2026) and its 2 TB capacity are goals; real silicon will need to be seen.
Liquid cooling will continue to grow due to energy efficiency and density: Positron admits this and promises dual support (air/liquid) depending on the layout.
The success of the approach will depend on whether most high-value inference workloads remain tied to memory and decoding—where they have an edge—and whether the supply chain window stays outside the HBM funnel.

The essentials

In a market moving toward liquid racks, megawatts at the facility, and HBM by the kilogram, Positron offers a hand for what’s already built, with chips of 200–400 W, distinct memory, better utilization, and a promise that appeals to any CFO: recover your investment in months, not years, in real inference workloads. If Rubin is NVIDIA’s highway, Positron wants to fill the secondary roads that still carry traffic.

Frequently Asked Questions

Why does Positron emphasize air-cooled despite liquid cooling being more efficient?
Because 95% of the existing capacity is air-cooled and rebuilding for liquid costs 40–50% more and takes longer. For inference using 200–400 W chips, Positron argues it can fit into current racks and shorten the time-to-value.

What does “2–5× more efficient” compared to an NVIDIA GPU mean?
During inference, it’s more tokens per watt and tokens per euro. The company states about ≈ 3.5× performance/€ versus Hopper and up to 5× on memory and energy-sensitive workloads. The ROI could drop from 2–2.5 years to 15–16 months (Atlas) and below 12 months in the next generation.

How does Positron address HBM shortages?
It claims to use another memory architecture (not HBM/CoWoS), with > 90% memory bandwidth utilization, and ~2 TB capacity per card in Asimov. This decouples its production from the bottlenecks affecting NVIDIA/AMD/TPU.

Does Rubin CPX exclude Positron from inference?
Rubin CPX is focused on pre-fill. Positron emphasizes decoding (output)—code, video, lengthy generation—as the cost driver and even suggests combining Rubin CPX + Positron to maximize € per token in mixed pipelines.

Where does Atlas fit in the stack?
It’s integrated into x86 servers with dual AMD EPYC, 24 DDR5 channels, and 8 Archer accelerators. It does not compete in CPU; its strength lies in the accelerator and memory.

Note: all figures and assertions derive from statements by Positron AI’s CEO in the referenced interview. Some power and memory estimates of external platforms (e.g., Rubin) are third-party estimations not publicly confirmed by manufacturers.

via: wccftech

X (Twitter) Facebook Pinterest LinkedIn E-mail