X (Twitter) Facebook Pinterest LinkedIn E-mail

Artificial agentic AI is forcing a new way to measure data centers. It’s no longer enough to know how many tokens per second a model delivers in a single request. New agents operate over longer periods, chain steps, call tools, maintain context, edit code, run tests, and reason again with the information they receive. This kind of usage completely shifts the pressure on infrastructure.

NVIDIA has published its first results in AA-AgentPerf, a new Artificial Analysis benchmark designed to measure how many AI agents a platform can sustain under realistic loads. The results clearly favor Blackwell Ultra: the NVIDIA GB300 NVL72 system reaches up to 20 times more capacity per megawatt than an HGX H200 based on Hopper in agent programming workloads.

This figure summarizes the leap well. According to the published data, GB300 NVL72 supports 61,400 concurrent agents per MW compared to 2,600 for H200. In terms of capacity per GPU, the difference is also significant: 57.5 concurrent agents per accelerator versus 1.4 in the previous generation. These results come from tests with DeepSeek V4 Pro, a Mixture-of-Experts model used as a representative of modern agent workloads.

What does AA-AgentPerf measure and why does it matter

AA-AgentPerf doesn’t aim to measure simple chatbot conversations. Its goal is to evaluate infrastructure performance when many agents are working simultaneously on long, variable tasks—similar to what is seen in AI-assisted development environments.

The benchmark uses real trajectories from programming agents. These include sessions with multiple turns, interleaved reasoning, tool calls, code editing, and highly variable context lengths. According to Artificial Analysis, input sequences can exceed 100,000 tokens, with an average around 27,000 tokens in the tested set.

This matters because agent loads challenge different parts of the system. An agent doesn’t just generate text. It reads context, waits for tool results, resumes sessions, reuses KV cache, alternates between pre-filling and decoding, and keeps many requests alive for extended periods. In production, this mix impacts the scheduler, memory usage, GPU interconnection, and the ability to keep latency and speed within target levels.

AA-AgentPerf Metric	What it indicates
TTFT	Time to first token
Output speed	Tokens per second after output starts
System throughput	Tokens per second with concurrent agents
Concurrent agents per MW	Effective capacity per energy budget
Concurrent agents per GPU	Effective capacity per accelerator

aa agentperf measures — NVIDIA GB300 Boosts AI Agent Performance Compared to Hopper 3

The most relevant metric for infrastructure operators is concurrent agents per megawatt. In an AI data center, energy has become as critical a constraint as GPU cost. Knowing how many agents a facility can run per MW helps estimate capacity, operational costs, and ROI on hardware investments.

GB300 NVL72 versus H200: the leap of Blackwell Ultra

NVIDIA’s published data show a significant gap between GB300 NVL72 and H200 in agent workloads. The comparison isn’t just raw GPU performance, but the entire platform’s ability to sustain multiple agents under service goals.

Benchmark	NVIDIA GB300 NVL72	NVIDIA H200
Concurrent agents per MW	61,400	2,600
Concurrent agents per GPU	57.5	1.4
Approximate difference per MW	Up to 20x more	Reference

This advantage isn’t due to a single component. NVIDIA attributes the results to integrated hardware, software, and interconnect design. GB300 NVL72 connects 72 GPUs over a high-capacity NVLink domain, which is especially beneficial for MoE models like DeepSeek V4 Pro, where execution needs to be distributed among experts and remain coordinated without communication bottlenecks.

Optimization techniques such as TensorRT LLM, SGLang, or vLLM, along with methods to separate pre-filling and decoding, improve KV cache utilization, and keep GPU utilization high as session counts grow, all play a role. In agentic AI, the goal isn’t just fast response times but supporting thousands of active agents without latency or speed dropping below agreed standards.

This shifts the conversation for cloud providers, hyperscalers, AI labs, and companies deploying large-scale internal agents. The question evolves from “which GPU is fastest” to “how many useful agents can I run with my energy, space, and budget.” In this context, performance per MW becomes a planning metric.

Data centers for agents, not just models

The rise of AI agents means infrastructure increasingly resembles a factory for long-running processes. A programming assistant might receive an incident report, inspect files, suggest changes, run tests, fix bugs, and repeat the cycle multiple times. Each step involves new model calls and maintains accumulated context.

This requires designing data centers with a different perspective. Memory, internal networking, cooling, energy efficiency, and orchestration software are more critical than in traditional inference tests. A poorly balanced system might have powerful GPUs but still deliver poor user experience if bottlenecks occur in cache, interconnects, scheduler, or context storage.

AA-AgentPerf aims to capture this new reality. It doesn’t replace other inference benchmarks but adds a layer aligned with how many companies expect AI to be used in the coming years. If agents transition from individual tools to fleets of autonomous processes working in parallel, infrastructure must be measured in sustained capacity, efficiency, and predictability.

It’s also wise to approach initial results with caution. These are based on specific configurations, with chosen models, SLOs, and optimizations. Not all enterprise workloads will behave the same. A programming agent differs from financial, legal, customer service, or scientific analysis agents. Nonetheless, the benchmark sets a clear direction: measuring agentic AI requires longer, more variable, near-production tests.

Rubin as the next frontier

The timing for releasing these results isn’t accidental. NVIDIA is already preparing for the transition to Vera Rubin, its next platform for large AI installations. The company announced that Vera Rubin is entering production for “AI factories,” with architecture combining Vera CPUs, Rubin GPUs, NVLink 6, BlueField-4, Spectrum-6, and new networking and storage systems optimized for agent workloads.

NVIDIA states that Rubin GPUs will deliver up to 50 PFLOPS of NVFP4 inference compute, with NVLink 6 providing 3.6 TB/s per GPU and 260 TB/s per rack of Vera Rubin NVL72 systems. Vera is also presented as a CPU designed for agent workloads, emphasizing data movement, efficiency, and flow acceleration, where tool calls and shared context become increasingly important.

NVIDIA claims Vera Rubin can deliver up to 10x more agent throughput at scale compared to Grace Blackwell. While this promise will need to be validated through real deployments and independent benchmarks, it reflects market trends: more agents, more context, more concurrency, and higher energy demands.

For the cloud sector, the message is clear. Competitive advantage in AI will depend not only on access to the latest GPUs but also on designing racks, networks, inference software, security, multi-tenant isolation, context storage, and energy operation as a cohesive system. Companies deploying large-scale agents will seek not just raw power but effective capacity per MW, per rack, and per dollar spent.

Blackwell Ultra already exemplifies this shift. Hopper was a pivotal generation for generative AI growth, but agentic workloads are raising the bar further. GB300 NVL72 doesn’t just gain because it’s newer; it’s designed to keep many agents operating in parallel more efficiently.

Agentic AI remains in early enterprise adoption stages, but its infrastructure impact is measureable. If agents become a common layer in software development, customer support, analytics, IT operations, or industrial automation, data centers will need to scale for millions of persistent, intelligent processes. While Artificial Analysis’s benchmark doesn’t close the debate, it introduces a key metric: how many real agents can a platform support without degrading user experience.

Frequently Asked Questions

What is AA-AgentPerf?

It’s an Artificial Analysis benchmark that measures how many AI agents a platform can support under realistic loads, maintaining speed and time-to-first-token goals.

What achievement has NVIDIA GB300 NVL72 reached?

It supports 61,400 concurrent agents per MW and 57.5 per GPU, versus 2,600 agents per MW and 1.4 per GPU for NVIDIA H200, in the published results.

Why are agentic loads different from traditional inference?

Because agents don’t just make one request. They reason across multiple turns, call tools, read/edit files, run tests, and maintain long contexts. This demands more memory, better planning, and greater overall system efficiency.

What role will NVIDIA Vera Rubin play?

Vera Rubin will be NVIDIA’s next large-scale AI platform, combining Rubin GPUs with 50 PFLOPS NVFP4, Vera CPUs, NVLink 6, and advanced networking/storage, to improve large-scale agent performance.

via: Nvidia

X (Twitter) Facebook Pinterest LinkedIn E-mail

NVIDIA GB300 Boosts AI Agent Performance Compared to Hopper

What does AA-AgentPerf measure and why does it matter

GB300 NVL72 versus H200: the leap of Blackwell Ultra

Data centers for agents, not just models

Rubin as the next frontier

About The Author

Alex D. Smither W.