X (Twitter) Facebook Pinterest LinkedIn E-mail

In a move that strengthens its ambition to reduce external dependencies and gain traction in the domestic accelerator market, Huawei has unveiled at Huawei Connect 2025 a roadmap for AI chips extending until 2028. According to a MyDrivers report on the presentation, the plan includes several Ascend families with substantial increases in compute power and, most notably, a turning point: proprietary HBM memory integrated into the new products.

The message is clear: a focus on “homegrown” competitiveness to meet the Chinese demand for AI computing solutions, which, in some cases, serve as alternatives to offerings like NVIDIA H20 in the local market. But the innovation is not just performance-based: Huawei advocates for technological sovereignty in critical components, starting with the HBM (high-bandwidth memory), historically dominated by a few global suppliers.

Ascend 950PR: Starting Point with Proprietary HBM and Focus on Inference

The first milestone in the plan is the Ascend 950PR, successor to the Ascend 910C. This chip marks the beginning of a transition toward an internal stack with Huawei’s HBM integration. Specifications promise:

Low precision support with support for up to FP8, delivering 1 PFLOPS in FP8 and 2 PFLOPS in FP4.
Interconnection bandwidth of 2 TB/s between components, key for scaling modern AI workloads.
“HiBL 1.0” HBM memory with 128 GB capacity and 1.6 TB/s bandwidth.

Huawei positions the 950PR as an inference-oriented accelerator, optimized for prefill (the initial context provisioning phase in large models) and recommendation systems. The strategic reading is clear: start where most cloud AI consumption occurs today—large-scale inference—and ensure that the proprietary memory stack is running smoothly during that phase.

Huawei’s “Made by Huawei” HBM: From HiBL 1.0 to HiZQ 2.0

The most significant technical milestone is the proprietary HBM. The company mentions a first-generation “HiBL 1.0” for the 950PR (128 GB, 1.6 TB/s) and hints at a second-generation “HiZQ 2.0” with 144 GB and 4 TB/s. An accelerator house integrating its own branded HBM means a reduction in supply dependencies and potentially a custom-optimized relationship between bandwidth, latency, power consumption, and form factor.

In market terms, this step is monumental: HBM has become the hard currency of generative AI. It’s not just about FLOPS; memory and its effective bandwidth determine the context window and how many simultaneous queries a cluster can handle. If Huawei can stabilize production and performance of HiBL/HiZQ, it will gain leverage on pricing, delivery timelines, and scalability in the Ascend ecosystem.

950DT (Q4 2026): The Training Phase

The next step in the 950 series is the Ascend 950DT, scheduled for Q4 2026. Unlike the 950PR, this model is focused on training and jumps to HiZQ 2.0 (144 GB HBM and 4 TB/s bandwidth), promising higher memory throughput and improved tensor core performance during extended, distributed workloads.

Separating inference (950PR) from training (950DT) enables silicon, memory, and networking segmentation for each use case, rather than a one-size-fits-all design. This approach, observed in other companies, involves tuning heat profiles, schedulers, and precision modes (FP8/FP4) to maximize throughput for specific scenarios.

960 (Q4 2027): Expanded Memory, Bandwidth, and FLOPS

The roadmap continues with the Ascend 960, expected in Q4 2027, representing broad improvements across multiple fronts:

Interconnect of 2.2 TB/s, facilitating denser cluster topologies and faster synchronization.
Effective memory of 288 GB (likely HiZQ 2.0) and a memory bandwidth of 9.6 TB/s.
Compute capacity of 2 PFLOPS (FP8) and 4 PFLOPS (FP4).

This leap suggests two main focuses: first, doubling down on expanding the “memory bottleneck,” and second, sustained investment in low-precision compute (FP8/FP4), where industry yields the best performance-per-watt for many model architectures.

970 (2028): The “Grand Block” in Three Years

The plan closes with the Ascend 970, targeted for 2028. Huawei hints at “significant improvements” in memory and compute, but no official figures are provided. Based on the pattern of increases, a new step in HBM capacity, sustained bandwidth, interconnection, and FP8/FP4 FLOPS is expected. The critical question will be how to translate this “muscle” into total cost of ownership (TCO) advantages—density per rack, energy efficiency, and maintenance.

Strategic Outlook: A ‘From Inside Out’ Approach

Apart from the numbers, the roadmap reflects a strategic design philosophy “inside-out”:

Proprietary HBM First: controlling the most stressed part of the supply chain (memory) to protect supply and set prices.
Inference now, training later: ensuring deployment capacity for immediate demand (950PR), then consolidating training with more memory and bandwidth (950DT, 960).
Phased growth: annual/biennial iterations allowing lessons to be absorbed and risks to be diversified.
Supporting the domestic market: aligning with the need for national AI computing capacity and digital services, with Ascend as the core piece of the stack.

Competition & Positioning: The Role of Ascend

Huawei isn’t new to AI acceleration. Its Ascend family has targeted niches where NVIDIA dominates using ecosystems like CUDA and TensorRT. The differentiator in this new roadmap is sovereignty (own memory, internal stack) and local traction for cloud, government, and enterprise.

Compared to NVIDIA H20: the message aims to compete domestically with “homegrown” solutions featuring good perf/watt in FP8/FP4 and > high memory bandwidth.
Focus on software: hardware alone isn’t enough; success depends on toolchains, compilers, frameworks, and libraries. Huawei has historically promoted its CANN ecosystem for Ascend programming.
Interconnects: with 2–2.2 TB/s in fabric, scaling depends on low latencies, lock-free topologies, and orchestration. This is a critically technical point that rivals FLOPS in importance.

Risks & Challenges: Packaging, Power, and Ecosystem Maturity

The plan is ambitious but pragmatic: it recognizes that the fight isn’t solely about FLOPS. Some emerging challenges include:

Advanced packaging & thermal management: Proprietary HBM involves complex stacking, crosstalk, signal integrity, and interposer design. Sustaining 9.6 TB/s with 288 GB requires refined encapsulation engineering and thermal management.
Availability and sustained performance: reaching volumes with competitive yield is a different ballgame. Variability can impact costs, timelines, and binning.
Software ecosystem: widespread adoption depends on compatibility with popular frameworks, optimizers, kernels, and efficient APIs.
Energy & density: increased bandwidth and memory often mean higher power consumption. Success for data centers will depend on electrical efficiency and density per U.
Global versus domestic market: the plan prioritizes China, where computing sovereignty is a national goal. Beyond that, regulatory and supply chain limitations will influence deployment.

Understanding FP8 and FP4 (and Why They Matter)

The focus on FP8 and FP4 isn’t incidental. The industry is shifting toward lower precision formats that maintain training and inference quality, aided by quantization, loss scaling, and calibration techniques. The result? Higher performance-per-watt and more efficient memory use. When Huawei claims 1 PFLOPS in FP8 and 2 PFLOPS in FP4 (for the 950PR), it indicates its vector engines are designed for the current mainstream LLMs, where FP8 is becoming standard in training, and FP4/INT4 is gaining ground in high-throughput inference.

The Road to 2028: A Three-Step Ladder

2025–2026: consolidating large-scale inference with 950PR (HiBL 1.0 HBM) and entering training with 950DT (HiZQ 2.0 HBM).
2027: raising the bar with 960 (more memory & bandwidth, 2 PFLOPS FP8, 4 PFLOPS FP4).
2028: the Ascend 970 as an overall synthesis, with “significant” improvements in compute and memory.

Each step increases effective memory capacity and bandwidth; without this, FLOPS specifications on paper won’t tell the full story. This emphasizes Huawei’s primary thesis: “HBM First”.

Conclusion: A Memory Sovereignty Plan to Compete in the AI Era

The most significant aspect of this announcement isn’t a single figure but the architecture of the plan. Huawei presents a credible, incremental roadmap that combines scaling compute with tight memory control—the scarcest resource in AI—and a clear segmentation between inference and training.

If the company meets its milestones and maintains its pace with proprietary HBM (from HiBL 1.0 to HiZQ 2.0), it can secure capacity for the Chinese market and build cost advantages in Ascend clusters. The real unknown lies in software and industrial scale: two factors that, historically, separate ambitious roadmaps from market-changing deployments.

For now, the signals are strong: 950PR, 950DT, 960, and 970 sketch a staircase to 2028 featuring 1–2 PFLOPS FP8, 2–4 PFLOPS FP4, up to 9.6 TB/s bandwidth in 960, 2–2.2 TB/s interconnection, and proprietary HBM rising from 128 GB to 144 and 288 GB in the mid-range portfolio. In a decade marked by memory shortages and the voracious appetite of LLMs, controlling HBM is, in itself, a strategic advantage.

FAQs

What is the Huawei Ascend 950PR and what is it designed for?
The Ascend 950PR is the successor to the 910C and the first accelerator in the series to incorporate proprietary HBM (HiBL 1.0, 128 GB, 1.6 TB/s). It offers 1 PFLOPS in FP8 and 2 PFLOPS in FP4, with 2 TB/s interconnection, aimed at inference (e.g., prefill in LLMs and recommendation systems).

How does the 950DT differ from the 950PR?
The 950DT is training-focused and is expected to launch in Q4 2026 with HiZQ 2.0 (144 GB HBM & 4 TB/s). It provides more bandwidth and capacity for longer, larger training batches and pipelines.

What improvements does the Ascend 960 (Q4 2027) bring?
The 960 increases interconnect bandwidth to 2.2 TB/s, raises memory to 288 GB (likely HiZQ 2.0), and boosts memory bandwidth to 9.6 TB/s. Compute-wise, it aims for 2 PFLOPS (FP8) and 4 PFLOPS (FP4).

What about the upcoming Ascend 970 in 2028?
Huawei talks about “significant” improvements in memory and compute, though no official specs are provided. It’s expected to continue the trend, with increased HBM capacity, bandwidth, and FLOPS, emphasizing HBM control and low-precision performance.

Does Huawei’s HBM replace third-party solutions now?
Huawei has introduced HiBL 1.0 (128 GB, 1.6 TB/s) for the 950PR and HiZQ 2.0 (144 GB, 4 TB/s) for the 950DT and later lines. The goal is to reduce dependencies and optimize tailored solutions; adoption speed will depend on production volume, performance, and validation progress.

Is this competing with NVIDIA H20?
In the Chinese domestic market, Huawei positions Ascend as a “homegrown” alternative. Exact comparisons depend on system metrics (not just chip specs): software, networking, HBM, efficiency, and total cost of ownership (TCO) of the cluster.

Why are FP8 and FP4 important?
Because they enable more performance per watt and better memory utilization without sacrificing quality in many models. Huawei’s numbers (1 PFLOPS FP8, 2 PFLOPS FP4 in 950PR; 960 doubling these) indicate their vector engines are tuned for current mainstream large language models, where low-precision formats are becoming standard for training and inference.

What are the main risks and challenges of this plan?
Packaging innovations, thermal management, software ecosystem maturity, power density, production volumes, and supply chain reliability all pose challenges that will determine real success in cost, timelines, and data center adoption.

via: wccftech

X (Twitter) Facebook Pinterest LinkedIn E-mail