X (Twitter) Facebook Pinterest LinkedIn E-mail

NVIDIA is trying to change the language of the entire data center industry. It’s no longer just about GPUs, servers, or accelerated clusters but about “AI factories”: artificial intelligence factories designed to produce tokens continuously, like an industrial plant generates electricity, steel, or components. The metaphor is commercial, but it helps to understand a real shift: AI can no longer be treated as a software layer running on generic infrastructure.

In NVIDIA’s vision, an AI factory converts energy into intelligence. The production unit isn’t a physical piece but the token generated by a model when reasoning, responding, writing code, coordinating agents, or executing a task. That’s why the relevant metrics are starting to resemble those of heavy industry rather than SaaS applications: tokens per second, tokens per watt, cost per token, infrastructure utilization, and availability.

Inference is no longer an isolated query

The major shift is in the workload. For many users, generative AI started as a text box: you write a question, the model responds, and the interaction ends. Agentic AI breaks that scheme. An agent can plan, seek information, call tools, read documents, write code, query databases, create sub-agents, and make chained decisions.

This makes inference a longer, more interactive process that’s more challenging to orchestrate. It’s no longer enough to have a powerful GPU waiting for a request. Coordination of memory, storage, network, CPU, software, queues, and external services is needed to keep the entire flow moving without unnecessary delays.

NVIDIA frames this as a full-stack problem. Models need accelerated compute, but also fast memory, context storage, low-latency networks for coordinating services, and software capable of maintaining high system utilization. If one layer lags, the cost per token rises, and the experience degrades.

Metric	What it measures in an AI factory
Tokens per second	Ability to produce responses and actions
Tokens per watt	Energy efficiency of the system
Cost per token	Economic viability of inference at scale
Utilization	Degree of GPU, CPU, memory, and network usage
Uptime	Continuity of AI production
Latency	Response time in agents and interactive applications

This perspective has implications for any company serious about deploying AI. The debate is no longer limited to choosing a model. Decisions now include where to run it, the cost per interaction, acceptable latency, how to maintain context, which data to retrieve, and the energy consumption of the infrastructure.

Data driving the new token economy

NVIDIA presents Blackwell Ultra and the GB300 NVL72 systems as answers to this new economy. According to the company, these systems can generate 50 times more tokens per megawatt than the Hopper generation and reduce cost per token by 35 times. These figures are provided by NVIDIA and should be interpreted within their own comparison framework, but they illustrate the industry’s direction: producing more intelligence with less energy.

The company also highlights NVIDIA Dynamo, a framework designed to orchestrate long-context inference and high volumes of requests. In an AI factory, software makes much of the economic decisions. It routes requests, manages memory, balances latency and throughput, coordinates services, and prevents expensive hardware from waiting idle.

Key Data	Reported Figure	Why It Matters
GB300 NVL72 vs Hopper	50x more tokens per megawatt	Measures improved AI production efficiency per unit of energy
GB300 NVL72 vs Hopper	35x lower cost per token	Directly impacts inference profitability
Vera Rubin with LPX	Up to 35x more performance per watt	Pushes the next generation of agentic AI and reasoning
Vera CPU	88 Olympus cores	Reinforces the CPU’s role in agents, runtimes, and orchestration
Vera’s memory bandwidth	Up to 1.2 TB/s	Helps sustain workloads with heavy memory pressure
Vera vs. Grace, per Phoronix	1.6x average geometric performance	Shows significant generational leap in data center CPUs
Vera vs. a 128-core x86, per NVIDIA	1.5x overall performance	Positions ARM as a serious competitor in AI infrastructure
Linux kernel compile on Vera	20 seconds	Practical example of development workload performance

The next step is Vera Rubin. NVIDIA claims this platform, along with LPX, is designed to lift performance per watt again in reasoning workloads and agentic AI. The clear message: the company wants the conversation to shift from “which GPU should I buy” to “what AI factory can I operate at the lowest cost per token.”

This strategy also shields NVIDIA from increasingly specialized competition. ASICs, inference chips, LPUs, TPUs, and custom accelerators aim to target specific market segments with better costs or latencies. NVIDIA responds by broadening the offering: it’s not just selling chips but entire architectures.

The CPU returns to the center of AI infrastructure

An AI factory is not built solely with GPUs. NVIDIA is also pushing Vera, its new data center CPU based on its own Olympus cores and Armv9.2 architecture. The technical message is significant because agents do not only run matrix operations on accelerators; they compile code, launch isolated environments, process data, manage runtimes, coordinate tools, execute Python or Java, and query databases.

Based on initial results published by Phoronix and shared by NVIDIA, Vera offers 88 Olympus cores, 176 threads, up to 1.2 TB/s of LPDDR5X memory bandwidth, 164 MB of unified L3 cache, PCIe Gen 6, and CXL 3.1 support. The tested chip had a maximum TDP of 450W, with LPDDR5X memory consuming around 50W or less, per Phoronix.

Features of NVIDIA Vera	Technical Data
Architecture	Armv9.2
Cores	88 Olympus
Threads	176
Memory bandwidth	Up to 1.2 TB/s
L2 cache	2 MB per core
Unified L3 cache	164 MB
Connectivity	PCIe Gen 6 and CXL 3.1
Socket TDP	450 W
Memory consumption in testing	Around 50W or less
Expected availability	Second half of the year, via partners

The memory figure is especially important. Agentic workloads are not limited to core count; they require many parallel processes with good memory access and consistent latencies. NVIDIA states that Vera sustains 90% of its peak bandwidth in the STREAM TRIAD test and offers over four times the bandwidth per core compared to traditional x86 CPUs. This is a clear attempt to address a classic data center bottleneck: moving data quickly without significantly increasing power consumption.

Design before building

AI factories cannot be improvised. A traditional data center could scale by adding servers, more storage, or new cabinets. In AI, power density, liquid cooling, interconnects, load distribution, and power supply demands require designing the system as a single integrated unit.

NVIDIA discusses extreme co-design: hardware, network, memory, storage, software, energy, and cooling all planned together from the outset. It also cites its DSX reference designs and the use of digital twins via Omniverse DSX Blueprint to model installations, equipment, cooling, and operations before actual deployment.

This is especially critical in projects involving hundreds of megawatts or even gigawatts. An electrical or thermal design error can limit expansion capacity for years. AI is unforgiving of wastage—inefficiencies in energy, space, or cooling translate directly into higher token costs.

Layers of the AI Factory	Why It Matters
Accelerated compute	Runs models, reasoning, and inference
CPU	Coordinates agents, runtimes, processes, and services
Network	Connects thousands of accelerators and systems
Memory	Feeds models, long contexts, and parallel workloads
Storage	Stores data, vectors, checkpoints, and state
Software	Orchestrates workloads and maximizes utilization
Energy	Limits the economic size of deployment
Cooling	Enables operation at high densities without degradation

NVIDIA also aims to take this architecture beyond hyperscalers. It cites collaborations with Cisco, Dell, HPE, Lenovo, and Supermicro to bring AI infrastructure closer to enterprise data centers. The idea is that an AI factory can start with a specific business use and then scale to broader applications.

Companies building or renting intelligence

NVIDIA’s most ambitious claim is that every organization will need to build or rent an AI factory. Not all will do so with their own infrastructure; many will turn to cloud, neoclouds, colocation providers, or managed platforms. But the core idea remains: AI shifting from an occasional tool to a permanent layer of work.

A financial institution might use agents for risk analysis, compliance, internal support, and software development. A pharma company could leverage AI for simulation, scientific documentation, and molecule discovery. An industry might deploy agents for maintenance, planning, robotics, and design. In all cases, the fundamental question is the same: how to produce intelligence safely, efficiently, and reliably?

The less comfortable part of this vision is its energy dimension. If an AI factory converts electricity into tokens, energy becomes the raw material for AI. That requires scrutinizing costs, electricity origin, thermal efficiency, and power availability just as thoroughly as software licenses were once considered.

The next stage of AI’s evolution won’t be decided solely by more capable models but also by who can serve them at lower cost per token, lower response energy, and higher availability. NVIDIA aims for this battle to be fought within an architecture that controls every component end-to-end: GPU, CPU, network, software, systems, partners, and data center design.

Cloud promised to abstract infrastructure. AI is making it visible again. Behind every reasoning agent, assistant, and responding model is a physical factory tirelessly producing tokens.

Frequently Asked Questions

What does NVIDIA mean by an AI factory?
An infrastructure designed to produce tokens continuously through models, agents, accelerated compute, CPU, network, memory, storage, software, energy, and cooling coordinated as a single system.

Why is the cost per token so important?
Because it determines whether a company can scale AI profitably. Lower cost per token makes deploying models and agents in mass processes more viable.

What role does Vera CPU play?
Vera is intended for CPU-intensive tasks in agentic AI: compiling code, coordinating agents, running runtimes, processing data, querying databases, and keeping services operational in parallel.

Will all companies need to build their own AI factory?
Not necessarily. Some will do so for scale, security, or sovereignty reasons. Others will rent capacity from cloud, neocloud, or specialized providers. The key is to control cost, performance, security, and availability.

via: phoronix and NVIDIA blogs