X (Twitter) Facebook Pinterest LinkedIn E-mail

The ongoing technological rivalry between CPUs and GPUs in artificial intelligence has just taken an unexpected turn. Intel has demonstrated that their conventional processors, thanks to a low-level redesign utilizing microkernels optimized for specific instruction sets, can run large language models (LLMs) at speeds approaching those of the popular NVIDIA A100 GPU, a benchmark for years in AI training and inference.

This isn’t about magic or a revolutionary new chip. The secret lies in how matrices are multiplied within the CPU. By employing microkernels designed to maximize AVX2 instructions and new data arrangements, Intel has enabled quantized models in 1 and 2 bits to achieve performance up to seven times higher than traditional 16-bit inference.

From 16 bits to 2 bits: a paradigm shift

Until now, the de facto standard for efficient LLM inference was to use weights of 16 bits (BF16 or FP16) or, more recently, 4 bits, utilizing libraries like bitnet.cpp or llama.cpp. These reductions save memory and energy but always entail some loss of precision.

However, Intel has gone further:

Their engineers designed microkernels for 1 and 2 bits, capable of packaging information in an extremely compact form.
Running these microkernels on modern x86 CPUs results in a dramatic reduction in bandwidth and memory usage.
Tests reveal that despite this extreme reduction, final performance maintains model quality and accelerates inference up to 7 times faster than the 16-bit standard.

In concrete numbers: while an NVIDIA A100 GPU achieves 250 tokens per second, tested Intel Core Ultra processors range between 82 and 110 tokens per second, depending on the CPU model. The difference is less than expected, considering the GPU has 17 to 20 times more memory bandwidth thanks to HBM2E compared to traditional DDR5.

The experiments: three CPUs versus the A100 giant

Intel tested their microkernels on three recent consumer CPUs:

Intel Core Ultra 9 285K with 24 cores (8 Performance cores and 16 Efficiency cores)
Intel Core Ultra 7 255H with 14 cores
Intel Core Ultra 7 258V with 8 cores

In all cases, they used representative models like Llama3-8B, Falcon3-1B, and MobileLLM-1.5B. The results were consistent: 2-bit models showed linear acceleration, approaching each processor’s theoretical performance ceiling.

Intel’s academic report (published on arXiv in August 2025) shows that:

For Llama3-8B, acceleration with 2 bits reached up to 5.8 times compared to 16 bits.
For MobileLLM-1.5B, the jump was 4.4 times in 1-bit configurations.
Compared to bitnet.cpp (the benchmark for sub-2-bit models), Intel’s solution was up to 2.2 times faster on CPU.

How did they achieve this?

The key is what Intel calls “up-convert and compute”:

Model weights are stored in 1 or 2 bits, drastically reducing data volume.
During inference, they convert these weights into 8-bit integers.
These are processed using FMA (fused multiply-add) operations, optimized with AVX2 instructions.

To prevent unpacking time from negating these gains, Intel introduced a weight data layout format called VNNI4-interleaved, which minimizes data reorganization costs before vector operations.

Additionally, libraries like libxsmm and frameworks such as PyTorch-TPP were used to integrate these microkernels into complete inference workflows, proving that this is not just an isolated experiment but a practical optimization.

Impact: AI on any laptop

The most significant aspect of this breakthrough isn’t merely that a CPU can approach the performance of GPUs from a few years ago. It’s that it opens the door to running advanced LLMs on modest devices, such as laptops or desktops, without needing a dedicated GPU.

This has direct implications:

Democratization of access: models like Falcon3-1B or Llama3-8B, once thought to be confined to data centers, could run on regular PCs.
Energy savings: microkernels consume 4 to 8 times less memory and reduce energy per token generated.
Edge scenarios: low-power devices like edge servers or embedded systems could run real-time LLMs independently of cloud infrastructure.

As Intel states: “We have demonstrated that ultra-low bit inference on CPUs can approach the performance of high-end GPUs.”

A strategic blow to NVIDIA?

NVIDIA has dominated AI with its high-bandwidth HBM memory GPUs. But Intel’s development poses a strategic challenge:

Not all users need massive models for training; many only require inference.
If inference can be effectively performed on standard CPUs, the attractiveness of GPUs diminishes.
Costs also drop — there’s no longer a need to spend thousands of euros on a GPU for local open-source model deployment.

Although Intel acknowledges that a CPU doesn’t yet match the latency and massive parallelism of an A100 or the newer Blackwell GPUs, they show that, for certain cases, CPUs are enough.

Next steps: from x86 to ARM and AVX10

Intel isn’t stopping at x86. Engineers are already working on porting these optimizations to ARM CPUs and SoCs, utilizing AArch64 and SVE instructions. This would enable smartphones, tablets, and ARM-based systems with integrated NPUs to benefit.

Looking ahead, the upcoming AVX10.2 extension, with vector sizes up to 512 bits, promises to double the capabilities of these microkernels, bringing CPU performance even closer to that of GPUs.

Final thoughts

What once seemed impossible — running models with billions of parameters on a laptop — is becoming increasingly feasible. With 1 and 2-bit microkernels, Intel has not only challenged NVIDIA’s dominance but also opened a new era: accessible AI on any device, without the need for specialized hardware.

In a context where AI training and inference costs concern governments, companies, and consumers alike, this progress marks a turning point. Large-scale AI might no longer be confined to data centers but could make its way to desktops and laptops as a common feature.

FAQs

What does a 1- or 2-bit model mean?
It means each weight in the model is stored using only 1 or 2 bits, compared to the usual 16 or 32. This reduces size and memory consumption, but requires advanced techniques to maintain accuracy.

Is Intel really competing with NVIDIA GPUs?
Not in training, but for inference of trained models, Intel’s microkernel-optimized CPUs can approach NVIDIA A100 speeds, which is quite remarkable.

Can I run Llama3 on my laptop because of this?
Yes, with a modern processor and sufficient memory, models from 1B to 8B parameters are feasible using these optimizations.

What does the future hold for PCs?
The AI-PC concept is gaining traction: portable or desktop systems equipped with CPUs capable of executing advanced AI models without dedicated GPUs. This could transform daily AI use.

Sources: El Hacuzas Informático, arXiv

X (Twitter) Facebook Pinterest LinkedIn E-mail