F5 (NASDAQ: FFIV) announced the expansion of BIG-IP Next for Kubernetes on the new NVIDIA BlueField-4 DPU, targeting gigascale AI factories. The combination promises up to 800 Gb/s of multi-tenant networking with intelligent control, zero-trust security enhancements, and acceleration of LLM (inference) workloads that, according to F5, translate into +30% increased token generation capacity while maintaining “cloud-grade” expectations.
This move aligns with the trend of bringing networking and security functions to the DPU (data processing unit) to free up GPUs from non-AI tasks and reduce p99 latencies. In environments with massive context windows, autonomous agents, and multi-model traffic, every microsecond counts.
What the F5 + BlueField-4 integration delivers (in brief)
- Performance, multi-tenancy, and security: F5 claims a +30% improvement in token capacity by offloading data path and controls to the DPU, maintaining isolation between tenants at up to 800 Gb/s.
- Optimized LLM inference: integration with NVIDIA Dynamo and KV Cache Manager for reducing latency, better GPU utilization, enabling disaggregated serving, and adapting to variable memory demands (changes in prompt and context).
- Multi-model intelligent routing: via NVIDIA NIM (microservices), F5’s control plane can manage traffic across multiple models to optimize TTFT (Time to First Token), cost, or quality.
- Granular token governance: metrics and visibility for compliance, accounting, and risk, crucial in multi-team environments.
- Scalable and secure MCP: enhanced protection for Model Context Protocol, ensuring agents and MCP-dependent tools maintain speed without security gaps.
- Zero-trust on VM and bare-metal: supported by NVIDIA DOCA Platform Framework (DPF), with tenant segregation and secure distributed AI networks by design.
- Programmability: F5 iRules applied to AI flows to create policies, rate limits, or customized security maneuvers.
Practical translation: The DPU accelerates and isolates the fast data path (encryption, telemetry, segmentation, WAF/L4-7, etc.), while F5 adds observability and control so the AI scheduler serves more tokens with less queuing.
Why it matters for modern AI workloads
1) More tokens/second and better TTFT
- KV-cache kept hot and managed → fewer cache misses, fewer hops to memory, and a more occupied GPU with actual compute.
- DPU offload → reduces context switches on the host CPU, minimizes jitter, and results in more predictable p99.
2) Efficiency in heterogeneous clusters
- With NIM, the control plane can balance across models/versions (cost, latency, quality) seamlessly.
- Useful for canaries, A/B testing, fallbacks by region/SLAs, or graceful degradation during spikes.
3) Security and multi-tenancy without performance penalties
- DOCA/DPF enables micro-segmentation of tenants and AI services (per project, team, or client) with encryption and policies close to the wire.
- Reduces exposed surface on the host and facilitates compliance in regulated environments.
4) Usage governance
- Token accounting per model/tenant/queue → foundational for chargeback/showback, budget limits, priority policies, and abuse detection.
Where it fits within an “AI factory” stack
Physical/IO layer: BlueField-4 DPU (network acceleration, encryption, telemetry, DOCA).
Layer of network/security L4-L7: F5 BIG-IP Next for Kubernetes (service proxy, WAF/API, load balancing, iRules).
Model serving and orchestration layer: NVIDIA NIM + Dynamo + KV Cache Manager (runtimes, schedulers, memory/state management).
Application layer: AI gateways, multi-model routers, MCP, agents.
The proposition: disaggregate the serving (state, cache, control) from GPU compute and push networking/security to the DPU for scaling by blocks (more nodes, same SLOs).
Design considerations (if planning adoption)
- Topologies: validate effective throughput per node (up to 800 Gb/s bandwidth ceiling; look for goodput with encryption, telemetry, and active policies).
- SLOs: define TTFT, tokens/sec, p95/p99, and error budgets per queue/model/tenant; enable autoscaling based on real metrics (queue, utilization, cache hit).
- Policies and iRules: rate-limit per tenant, token caps, fallback strategies, circuit breaking toward saturated routes.
- Observability: end-to-end traceability L7 + token accounting + GPU utilization; alerting for degradation of KV-cache or latency drift.
- Security: DOCA/DPF for micro-segmentation, mTLS between microservices, WAF/API security on public endpoints, and hardened MCP policies.
- Cost efficiency: compare tokens/$ in freed GPU vs. DPU costs and F5 footprint; measure savings from consolidating network/security functions.
Typical use case
- Multi-model inference (large context windows) with low TTFT SL.
- Multi-tenant traffic (teams/clients) with budget limits and priorities.
- Compliance (token accounting, decision traceability, policies auditability).
- Hybrid deployments: VM and bare-metal on-prem/colo with consistent Zero Trust.
Fine print
F5 frames this announcement as a launch expansion of its Kubernetes solution on BlueField-4. The benefits cited (like +30% token capacity) depend on design, load, and tuning. Like any product note, it includes forward-looking statements subject to integration and environment-specific results variability.
In summary
F5 and NVIDIA drive networking and security to the DPU and disaggregate serving so GPUs focus on AI, not packet pushing. With BIG-IP Next for Kubernetes on BlueField-4, organizations can serve more tokens, faster, and with less jitter, maintain tenant isolation, and govern usage—all key for the next wave of AI factories and agent systems.
via: f5.com

