Zyphra has launched Zyphra Cloud, a new AI platform built on AMD infrastructure designed to deploy large open-weight models into production. This move positions the San Francisco-based company in an increasingly competitive segment of the market: advanced model inference, where it’s no longer enough to just train well—you must deliver quick, stable responses at an affordable cost.
The platform debuts with Zyphra Inference, a serverless inference service that provides access to models such as DeepSeek V3.2, Kimi K2.6, and GLM 5.1. According to the company, the service combines custom kernels, long-context algorithms, and advanced parallelism schemes to handle long-running workloads like agent programming, in-depth research, and complex automation workflows.
This development is supported by AMD Instinct MI355X GPUs deployed on TensorWave’s infrastructure, a cloud provider specializing in AI and HPC that works exclusively with AMD Instinct accelerators. For AMD, this announcement adds another piece to its strategy to compete in accelerated AI—an arena where NVIDIA has a dominant position thanks to CUDA, its software ecosystem, and its strong presence in data centers.
Inference becomes the new battleground
In recent years, much of the AI conversation has focused on training large models. But as companies start integrating assistants, agents, and automation systems into real-world processes, inference is gaining importance. Each query, each agent session, and each long workflow require memory, bandwidth, and an architecture capable of maintaining context without overly penalizing latency.
This is where Zyphra aims to differentiate itself. The company claims that Zyphra Inference is designed for large MoE-type models and for workloads with substantial context, where KV and prefix caches can occupy a significant portion of available memory. In these scenarios, having more HBM memory per GPU can reduce recomputations and increase the number of active sessions a node can support before performance degrades.
AMD Instinct MI355X GPUs fit this technical premise. According to AMD specifications, each GPU includes 288 GB of HBM3E memory and offers 8 TB/s bandwidth, along with support for low-precision formats like MXFP8, MXFP6, and MXFP4. These formats are relevant for serving models with lower memory consumption and higher performance, though final quality depends on the model, quantization, and specific implementation.
Additionally, Zyphra published a technical analysis comparing, for a specific case with Kimi K2.6, the memory available for caches on a node with 8 MI355X GPUs versus a node with 8 B200 GPUs. They state that, under their assumptions, an MI355X node can support about 184 active agents with 256K context, compared to approximately 100 in the B200 example. While this is a partial estimate—not an independent benchmark—it helps clarify the positioning: fewer delays, more sessions resident in memory, and better performance for long-duration agents.
AMD gains visibility in the AI cloud
This launch also broadens AMD’s outlook. The company has long been working to strengthen its position in AI infrastructure with the Instinct family and ROCm, its software platform for accelerated computing. The challenge isn’t just selling powerful chips, but demonstrating complete stacks capable of deploying advanced models in production.
Both Zyphra and TensorWave contribute to this narrative. TensorWave provides the AMD-based compute infrastructure, while Zyphra focuses on software, models, kernels, and inference services. The combined effort points to a clear trend: more providers are seeking alternatives to NVIDIA’s dominant stack—not necessarily to replace it entirely but to open up options around cost, availability, and technological sovereignty.
The open-weight model approach adds another layer. Companies and development teams increasingly seek alternatives that offer greater control over models, deployments, and costs. DeepSeek, Kimi, and GLM have gained prominence in this conversation, especially for those building products on powerful models without relying solely on proprietary services.
However, market decisions will not be based solely on technical specs. In AI inference, factors such as service stability, real response times, compatibility with common tools, quota management, pricing, documentation, and vendor trust are crucial. Zyphra starts with a technically ambitious message but will need to demonstrate real-world performance under production loads and with clients managing complete operational workflows, not just single models.
A platform aiming to go beyond model serving
Zyphra Cloud launches with inference capabilities but already indicates plans to expand. Future features include distributed post-training, reinforcement learning, fine-tuning, isolated environments for agents, and development on AMD EPYC CPUs, plus access to dedicated GPU clusters and bare-metal infrastructure.
This is important because many AI projects now go beyond calling models via API. Companies need to adapt models, run agents in controlled environments, keep sensitive data under specific policies, and reserve capacity for predictable workloads. If Zyphra manages to integrate inference, subsequent training, and agent environments into a single platform, it could compete strongly in an area highly valued by technical teams—operational control.
There’s also a market perspective. Generative AI is shifting from isolated tests to systems that operate longer, consult tools, maintain session memory, and execute chained tasks. Such uses strain infrastructure more than simple chatbots with short responses. As a result, providers are moving away from talking about “models” in isolation and towards full platforms for agents, long contexts, and persistent workflows.
Zyphra Cloud has been available since 05/04/2026. The company has not disclosed public pricing, SLA details, or specific model limits—factors that will be key in assessing suitability for enterprise environments. For now, this launch signals that the AI race is not just about training models but also about deploying them efficiently, with sufficient memory, on increasingly specialized infrastructure.
FAQs
What is Zyphra Cloud?
Zyphra Cloud is an AI platform targeting developers, companies, and AI providers, starting with a serverless inference service for open-weight models.
Which models are available on Zyphra Inference?
The launch mentions access to DeepSeek V3.2, Kimi K2.6, and GLM 5.1. The company also plans to add more open models as they become available.
Why are AMD Instinct MI355X GPUs important?
These GPUs provide 288 GB of HBM3E memory and 8 TB/s bandwidth per unit, making them suitable for large, long-context inference workloads with many active sessions.
Is Zyphra Cloud only for inference?
No. Zyphra has announced plans to expand the platform to include fine-tuning, reinforcement learning, isolated agent environments, dedicated GPU clusters, and bare-metal infrastructure.
via: zyphra

