X (Twitter) Facebook Pinterest LinkedIn E-mail

F5 (NASDAQ: FFIV) announced the expansion of BIG-IP Next for Kubernetes on the new NVIDIA BlueField-4 DPU, targeting gigascale AI factories. The combination promises up to 800 Gb/s of multi-tenant networking with intelligent control, zero-trust security enhancements, and acceleration of LLM (inference) workloads that, according to F5, translate into +30% increased token generation capacity while maintaining “cloud-grade” expectations.

This move aligns with the trend of bringing networking and security functions to the DPU (data processing unit) to free up GPUs from non-AI tasks and reduce p99 latencies. In environments with massive context windows, autonomous agents, and multi-model traffic, every microsecond counts.

What the F5 + BlueField-4 integration delivers (in brief)

Performance, multi-tenancy, and security: F5 claims a +30% improvement in token capacity by offloading data path and controls to the DPU, maintaining isolation between tenants at up to 800 Gb/s.
Optimized LLM inference: integration with NVIDIA Dynamo and KV Cache Manager for reducing latency, better GPU utilization, enabling disaggregated serving, and adapting to variable memory demands (changes in prompt and context).
Multi-model intelligent routing: via NVIDIA NIM (microservices), F5’s control plane can manage traffic across multiple models to optimize TTFT (Time to First Token), cost, or quality.
Granular token governance: metrics and visibility for compliance, accounting, and risk, crucial in multi-team environments.
Scalable and secure MCP: enhanced protection for Model Context Protocol, ensuring agents and MCP-dependent tools maintain speed without security gaps.
Zero-trust on VM and bare-metal: supported by NVIDIA DOCA Platform Framework (DPF), with tenant segregation and secure distributed AI networks by design.
Programmability: F5 iRules applied to AI flows to create policies, rate limits, or customized security maneuvers.

Practical translation: The DPU accelerates and isolates the fast data path (encryption, telemetry, segmentation, WAF/L4-7, etc.), while F5 adds observability and control so the AI scheduler serves more tokens with less queuing.

Why it matters for modern AI workloads

1) **More tokens/second and better TTFT**

KV-cache kept hot and managed → fewer cache misses, fewer hops to memory, and a more occupied GPU with actual compute.
DPU offload → reduces context switches on the host CPU, minimizes jitter, and results in more predictable p99.

2) Efficiency in heterogeneous clusters

With NIM, the control plane can balance across models/versions (cost, latency, quality) seamlessly.
Useful for canaries, A/B testing, fallbacks by region/SLAs, or graceful degradation during spikes.

3) Security and multi-tenancy without performance penalties

DOCA/DPF enables micro-segmentation of tenants and AI services (per project, team, or client) with encryption and policies close to the wire.
Reduces exposed surface on the host and facilitates compliance in regulated environments.

4) Usage governance

Token accounting per model/tenant/queue → foundational for chargeback/showback, budget limits, priority policies, and abuse detection.

Where it fits within an “AI factory” stack

Physical/IO layer: BlueField-4 DPU (network acceleration, encryption, telemetry, DOCA).
Layer of network/security L4-L7: F5 BIG-IP Next for Kubernetes (service proxy, WAF/API, load balancing, iRules).
Model serving and orchestration layer: NVIDIA NIM + Dynamo + KV Cache Manager (runtimes, schedulers, memory/state management).
Application layer: AI gateways, multi-model routers, MCP, agents.

The proposition: disaggregate the serving (state, cache, control) from GPU compute and push networking/security to the DPU for scaling by blocks (more nodes, same SLOs).

Design considerations (if planning adoption)

Topologies: validate effective throughput per node (up to 800 Gb/s bandwidth ceiling; look for goodput with encryption, telemetry, and active policies).
SLOs: define TTFT, tokens/sec, p95/p99, and error budgets per queue/model/tenant; enable autoscaling based on real metrics (queue, utilization, cache hit).
Policies and iRules: rate-limit per tenant, token caps, fallback strategies, circuit breaking toward saturated routes.
Observability: end-to-end traceability L7 + token accounting + GPU utilization; alerting for degradation of KV-cache or latency drift.
Security: DOCA/DPF for micro-segmentation, mTLS between microservices, WAF/API security on public endpoints, and hardened MCP policies.
Cost efficiency: compare tokens/$ in freed GPU vs. DPU costs and F5 footprint; measure savings from consolidating network/security functions.

Typical use case

Multi-model inference (large context windows) with low TTFT SL.
Multi-tenant traffic (teams/clients) with budget limits and priorities.
Compliance (token accounting, decision traceability, policies auditability).
Hybrid deployments: VM and bare-metal on-prem/colo with consistent Zero Trust.

Fine print

F5 frames this announcement as a launch expansion of its Kubernetes solution on BlueField-4. The benefits cited (like +30% token capacity) depend on design, load, and tuning. Like any product note, it includes forward-looking statements subject to integration and environment-specific results variability.

In summary

F5 and NVIDIA drive networking and security to the DPU and disaggregate serving so GPUs focus on AI, not packet pushing. With BIG-IP Next for Kubernetes on BlueField-4, organizations can serve more tokens, faster, and with less jitter, maintain tenant isolation, and govern usage—all key for the next wave of AI factories and agent systems.

via: f5.com