LineShine: The Chinese Supercomputer Training AI Without GPUs

China has just demonstrated an unconventional approach to training large-scale artificial intelligence models: a supercomputer based on Armv9 CPUs, without relying on the dominant GPU-accelerated cluster scheme. The system is called LineShine, installed at the National Supercomputing Center in Shenzhen (NSCC-SZ), and has been described in a scientific paper published as a preprint on arXiv on 09/05/2026.

The most striking figure isn’t just the performance but the architecture. LineShine combines 20,480 compute nodes and 40,960 LX2 processors based on Armv9. Each processor includes 304 cores, so, based on the technical details in the paper, the theoretical total reaches 12,451,840 CPU cores. This number is much higher than the 2.4 million cited in some quick readings of the system, which doesn’t align with the straightforward multiplication of nodes, processors, and cores per processor described in the paper.

The project appears in a context marked by US restrictions on advanced chips for China, which since 2022 have affected advanced computing semiconductors and certain supercomputing applications. These limitations have accelerated China’s interest in developing independent architectures, national processors, and designs capable of supporting AI workloads without relying entirely on foreign GPUs.

A CPU supercomputer for scientific AI model training

LineShine has not only been presented as hardware demonstration. The system has been used to train a generative compression model applied to Earth observation data. The goal of this work is to drastically reduce satellite data volumes, with ratios ranging from 100× to 10,000×, then reconstruct the information using a model trained on historical Earth observation datasets.

This approach makes sense because satellites repeatedly observe the same planet. This repetition produces geographic, temporal, and spectral patterns that can be learned. Instead of treating each image as an isolated file to transmit, store, and process almost in raw form, the system proposes using the global history of observations as a kind of generative memory. The model not only compresses; it learns prior knowledge about the terrain to better rebuild what is lost during compression.

According to the paper, training achieved 1.54 exaFLOP/s sustained in BFloat16 and a peak of 2.16 exaFLOP/s during evaluation. These numbers are significant because they don’t originate from a conventional GPU cluster but from an Armv9 CPU machine with hierarchical HBM and DDR memory, a custom interconnect network, and extensive software optimization efforts.

ElementDescribed Data
Compute nodes20,480
LX2 processors40,960
Cores per processor304
Total CPU cores12,451,840
Memory per processor32 GB HBM + 256 GB DDR
HBM bandwidth per processorup to 4 TB/s
Network per nodeLQLink, 1.6 Tb/s
Sustained performance reported1.54 exaFLOP/s
Peak performance reported2.16 exaFLOP/s

The LX2 processor described integrates two compute dies, eight CPU clusters, and combines onboard HBM memory with external DDR. This architecture doesn’t aim to exactly imitate a GPU model but to harness a blend of many cores, high-bandwidth memory, larger capacity memory, and very specific optimizations for intensive training operations.

Why is it important that it doesn’t use GPUs

Most modern large-scale AI training and inference rely on GPUs or specialized accelerators. NVIDIA dominates much of this market due to its chips, software ecosystem, and CUDA platform—creating a barrier that’s difficult to replicate. Therefore, China showcasing an exascale training achievement based on Armv9 CPUs is significant: it doesn’t prove GPUs are obsolete, but it illustrates alternative paths for certain scientific workloads.

This distinction is key. LineShine shouldn’t be directly compared to large generative AI clusters used for training massive language models. Its use case is different: generative compression and reconstruction of multispectral satellite data. Data ingestion, memory management, communications, tensor organization, and the ability to sustain very long scientific workloads over huge datasets are central here.

The paper emphasizes that Earth observation datasets are already reaching hundreds of petabytes, and for many scientific tasks, moving and reprocessing data at this scale is becoming a bottleneck. The D2AR framework used in training aims to convert these historical datasets into a model capable of on-demand reconstructions at various compression levels.

This approach can also influence the design of future scientific infrastructures. Instead of each researcher downloading large data volumes, supercomputing centers could provide compressed representations, tailored reconstructions, or derived products near the storage, aligning with a broader trend of bringing analysis closer to data rather than moving vast datasets around.

The importance of co-optimization

LineShine’s performance isn’t just a matter of adding millions of cores. The technical work describes coordinated optimizations across model design, kernels, memory hierarchies, runtime, and parallelism. In CPU-based systems, planning, synchronization, and data movement costs can be heavier than in GPUs if software isn’t well-adapted. Hence, the researchers developed strategies specific to Armv9, SVE, and SME, the matrix extension of the architecture.

Memory management is a key challenge. Each cluster has limited local HBM, so not all model parameters, activations, gradients, and optimizer states can reside in the fastest memory. The system dynamically decides which tensors stay in HBM and which can reside in DDR, based on their impact on performance and their lifespan during training.

Communication is also optimized. LineShine employs sequence parallelism and a hybrid data strategy aligned with the machine’s physical topology. The goal is to keep frequent communications within low-latency domains whenever possible and avoid unnecessary replication of optimizer states.

The performance jump reported is notable. For the 6-billion-parameter model, the per-step time on a node drops from 51.31 seconds to just 4.98 seconds after implementing memory management, optimized kernels, communication improvements, and asynchronous runtime. This local improvement underpins the scalability to thousands of nodes without efficiency loss.

The final scale, with 20,480 nodes, maintains a weak scaling efficiency of 76%. Practically, this means that as the number of nodes increases, the overall workload increases proportionally without a significant drop in efficiency. For training on global historic datasets, this is more critical than speeding up small, fixed tests.

LineShine demonstrates that China isn’t just seeking to replace Western GPUs with similar products. It is exploring comprehensive supercomputing designs where processor, network, memory, and software are tailored to specific workloads. This approach doesn’t eliminate the advantage of accelerators in commercial AI but broadens the strategic landscape.

The key takeaway for cloud and infrastructure sectors is that AI won’t be dominated by a single architecture. Conversational models, enterprise inference, scientific simulation, Earth observation, and generative compression may each require different combinations of compute, memory, network, and storage. LineShine fits into this second category: less visible to the general public than ChatGPT or DeepSeek but highly relevant for understanding how supercomputing is evolving amid technological rivalry.

Frequently Asked Questions

What is LineShine?
LineShine is a Chinese supercomputer installed at the National Supercomputing Center in Shenzhen. It is based on LX2 Armv9 processors and has been used to train large-scale scientific AI models at exascale.

How many cores does LineShine have?
According to the technical data in the paper, it has 20,480 nodes, two processors per node, and 304 cores per processor, totaling approximately 12.45 million CPU cores.

Does LineShine use GPUs?
The described architecture relies on Armv9 LX2 CPUs, and it has been presented as an exascale all-CPU machine. Its significance lies in demonstrating an AI training route that doesn’t depend on traditional GPU clusters.

Can it compete with NVIDIA’s large clusters?
It depends on the workload. For training massive language models, GPUs still dominate. LineShine excels in a specific scientific workload: exascale training of a generative model for satellite data compression and reconstruction.

Scroll to Top