NVIDIA is trying to change the language of the entire data center industry. It’s no longer just about GPUs, servers, or accelerated clusters but about “AI factories”: artificial intelligence factories designed to produce tokens continuously, like an industrial plant generates electricity, steel, or components. The metaphor is commercial, but it helps to understand a real shift: AI can no longer be treated as a software layer running on generic infrastructure.
In NVIDIA’s vision, an AI factory converts energy into intelligence. The production unit isn’t a physical piece but the token generated by a model when reasoning, responding, writing code, coordinating agents, or executing a task. That’s why the relevant metrics are starting to resemble those of heavy industry rather than SaaS applications: tokens per second, tokens per watt, cost per token, infrastructure utilization, and availability.
Inference is no longer an isolated query
The major shift is in the workload. For many users, generative AI started as a text box: you write a question, the model responds, and the interaction ends. Agentic AI breaks that scheme. An agent can plan, seek information, call tools, read documents, write code, query databases, create sub-agents, and make chained decisions.
This makes inference a longer, more interactive process that’s more challenging to orchestrate. It’s no longer enough to have a powerful GPU waiting for a request. Coordination of memory, storage, network, CPU, software, queues, and external services is needed to keep the entire flow moving without unnecessary delays.
NVIDIA frames this as a full-stack problem. Models need accelerated compute, but also fast memory, context storage, low-latency networks for coordinating services, and software capable of maintaining high system utilization. If one layer lags, the cost per token rises, and the experience degrades.
| Metric | What it measures in an AI factory |
|---|---|
| Tokens per second | Ability to produce responses and actions |
| Tokens per watt | Energy efficiency of the system |
| Cost per token | Economic viability of inference at scale |
| Utilization | Degree of GPU, CPU, memory, and network usage |
| Uptime | Continuity of AI production |
| Latency | Response time in agents and interactive applications |
This perspective has implications for any company serious about deploying AI. The debate is no longer limited to choosing a model. Decisions now include where to run it, the cost per interaction, acceptable latency, how to maintain context, which data to retrieve, and the energy consumption of the infrastructure.
Data driving the new token economy
NVIDIA presents Blackwell Ultra and the GB300 NVL72 systems as answers to this new economy. According to the company, these systems can generate 50 times more tokens per megawatt than the Hopper generation and reduce cost per token by 35 times. These figures are provided by NVIDIA and should be interpreted within their own comparison framework, but they illustrate the industry’s direction: producing more intelligence with less energy.
The company also highlights NVIDIA Dynamo, a framework designed to orchestrate long-context inference and high volumes of requests. In an AI factory, software makes much of the economic decisions. It routes requests, manages memory, balances latency and throughput, coordinates services, and prevents expensive hardware from waiting idle.
| Key Data | Reported Figure | Why It Matters |
|---|---|---|
| GB300 NVL72 vs Hopper | 50x more tokens per megawatt | Measures improved AI production efficiency per unit of energy |
| GB300 NVL72 vs Hopper | 35x lower cost per token | Directly impacts inference profitability |
| Vera Rubin with LPX | Up to 35x more performance per watt | Pushes the next generation of agentic AI and reasoning |
| Vera CPU | 88 Olympus cores | Reinforces the CPU’s role in agents, runtimes, and orchestration | Vera’s memory bandwidth | Up to 1.2 TB/s | Helps sustain workloads with heavy memory pressure |
| Vera vs. Grace, per Phoronix | 1.6x average geometric performance | Shows significant generational leap in data center CPUs |
| Vera vs. a 128-core x86, per NVIDIA | 1.5x overall performance | Positions ARM as a serious competitor in AI infrastructure |
| Linux kernel compile on Vera | 20 seconds | Practical example of development workload performance |
The next step is Vera Rubin. NVIDIA claims this platform, along with LPX, is designed to lift performance per watt again in reasoning workloads and agentic AI. The clear message: the company wants the conversation to shift from “which GPU should I buy” to “what AI factory can I operate at the lowest cost per token.”
This strategy also shields NVIDIA from increasingly specialized competition. ASICs, inference chips, LPUs, TPUs, and custom accelerators aim to target specific market segments with better costs or latencies. NVIDIA responds by broadening the offering: it’s not just selling chips but entire architectures.
The CPU returns to the center of AI infrastructure
An AI factory is not built solely with GPUs. NVIDIA is also pushing Vera, its new data center CPU based on its own Olympus cores and Armv9.2 architecture. The technical message is significant because agents do not only run matrix operations on accelerators; they compile code, launch isolated environments, process data, manage runtimes, coordinate tools, execute Python or Java, and query databases.
Based on initial results published by Phoronix and shared by NVIDIA, Vera offers 88 Olympus cores, 176 threads, up to 1.2 TB/s of LPDDR5X memory bandwidth, 164 MB of unified L3 cache, PCIe Gen 6, and CXL 3.1 support. The tested chip had a maximum TDP of 450W, with LPDDR5X memory consuming around 50W or less, per Phoronix.
| Features of NVIDIA Vera | Technical Data |
|---|---|
| Architecture | Armv9.2 |
| Cores | 88 Olympus |
| Threads | 176 |
| Memory bandwidth | Up to 1.2 TB/s |
| L2 cache | 2 MB per core |
| Unified L3 cache | 164 MB |
| Connectivity | PCIe Gen 6 and CXL 3.1 |
| Socket TDP | 450 W |
| Memory consumption in testing | Around 50W or less |
| Expected availability | Second half of the year, via partners |
The memory figure is especially important. Agentic workloads are not limited to core count; they require many parallel processes with good memory access and consistent latencies. NVIDIA states that Vera sustains 90% of its peak bandwidth in the STREAM TRIAD test and offers over four times the bandwidth per core compared to traditional x86 CPUs. This is a clear attempt to address a classic data center bottleneck: moving data quickly without significantly increasing power consumption.
Design before building
AI factories cannot be improvised. A traditional data center could scale by adding servers, more storage, or new cabinets. In AI, power density, liquid cooling, interconnects, load distribution, and power supply demands require designing the system as a single integrated unit.
NVIDIA discusses extreme co-design: hardware, network, memory, storage, software, energy, and cooling all planned together from the outset. It also cites its DSX reference designs and the use of digital twins via Omniverse DSX Blueprint to model installations, equipment, cooling, and operations before actual deployment.
This is especially critical in projects involving hundreds of megawatts or even gigawatts. An electrical or thermal design error can limit expansion capacity for years. AI is unforgiving of wastage—inefficiencies in energy, space, or cooling translate directly into higher token costs.
| Layers of the AI Factory | Why It Matters |
|---|---|
| Accelerated compute | Runs models, reasoning, and inference |
| CPU | Coordinates agents, runtimes, processes, and services |
| Network | Connects thousands of accelerators and systems |
| Memory | Feeds models, long contexts, and parallel workloads |
| Storage | Stores data, vectors, checkpoints, and state |
| Software | Orchestrates workloads and maximizes utilization |
| Energy | Limits the economic size of deployment |
| Cooling | Enables operation at high densities without degradation |
NVIDIA also aims to take this architecture beyond hyperscalers. It cites collaborations with Cisco, Dell, HPE, Lenovo, and Supermicro to bring AI infrastructure closer to enterprise data centers. The idea is that an AI factory can start with a specific business use and then scale to broader applications.
Companies building or renting intelligence
NVIDIA’s most ambitious claim is that every organization will need to build or rent an AI factory. Not all will do so with their own infrastructure; many will turn to cloud, neoclouds, colocation providers, or managed platforms. But the core idea remains: AI shifting from an occasional tool to a permanent layer of work.
A financial institution might use agents for risk analysis, compliance, internal support, and software development. A pharma company could leverage AI for simulation, scientific documentation, and molecule discovery. An industry might deploy agents for maintenance, planning, robotics, and design. In all cases, the fundamental question is the same: how to produce intelligence safely, efficiently, and reliably?
The less comfortable part of this vision is its energy dimension. If an AI factory converts electricity into tokens, energy becomes the raw material for AI. That requires scrutinizing costs, electricity origin, thermal efficiency, and power availability just as thoroughly as software licenses were once considered.
The next stage of AI’s evolution won’t be decided solely by more capable models but also by who can serve them at lower cost per token, lower response energy, and higher availability. NVIDIA aims for this battle to be fought within an architecture that controls every component end-to-end: GPU, CPU, network, software, systems, partners, and data center design.
Cloud promised to abstract infrastructure. AI is making it visible again. Behind every reasoning agent, assistant, and responding model is a physical factory tirelessly producing tokens.
Frequently Asked Questions
What does NVIDIA mean by an AI factory?
An infrastructure designed to produce tokens continuously through models, agents, accelerated compute, CPU, network, memory, storage, software, energy, and cooling coordinated as a single system.
Why is the cost per token so important?
Because it determines whether a company can scale AI profitably. Lower cost per token makes deploying models and agents in mass processes more viable.
What role does Vera CPU play?
Vera is intended for CPU-intensive tasks in agentic AI: compiling code, coordinating agents, running runtimes, processing data, querying databases, and keeping services operational in parallel.
Will all companies need to build their own AI factory?
Not necessarily. Some will do so for scale, security, or sovereignty reasons. Others will rent capacity from cloud, neocloud, or specialized providers. The key is to control cost, performance, security, and availability.
via: phoronix and NVIDIA blogs

