Intel and AMD prepare ACE, the x86 extension to accelerate AI from the CPU

Intel and AMD have taken another step in modernizing x86 with the ACE v1.15 specification, standing for AI Compute Extensions. This new extension was created within the x86 Ecosystem Advisory Group, a team formed by both companies to coordinate the future of the architecture and reduce platform fragmentation. Its primary goal is clear: accelerate AI and machine learning operations directly from the CPU, with particular focus on matrix multiplication and low-precision numeric formats.

ACE should not be seen as an integrated NPU or a GPU replacement. It is an extension of the x86 instruction set designed so that future CPUs can perform certain common AI calculations more efficiently, especially when it isn’t worth transferring data to an external accelerator or when low latency, system integration, or predictable CPU execution is needed.

The technical document describes ACE as an extension to accelerate compute tasks, initially focusing on matrix multiplication kernels and low-precision formats relevant for ML workloads. The specification introduces a new register state, data movement instructions, and operations that combine AVX vector registers with tile-type registers, aiming for higher computational density without breaking compatibility with existing x86 architectures.

Why ACE Matters for the Future of x86

Artificial Intelligence has shifted much of the debate toward GPUs, NPUs, and dedicated accelerators. It makes sense: large models, training, and many extensive inference loads require specialized hardware. But not all AI runs on large clusters. There are lightweight inference tasks, small models, embedded functions in apps, workstations, general-purpose servers, and laptops where the CPU remains a central component.

This is where ACE fits in. Matrix multiplication is one of the fundamental operations in neural networks, transformers, and machine learning systems. AVX10 already supports vectors and SIMD operations, but ACE’s specification recognizes that the calculation density and scalability of traditional vector approaches have their limits. Therefore, it introduces matrix primitives with tile registers, closer to how these workloads are executed on modern accelerators.

TechnologyPrimary Role
AVX10Modern base vector instruction set for x86
ACEMatrix extension for AI and ML workloads
Tile registersAccumulate and operate on 2D blocks
Block Scale RegistersBlock scaling for OCP MX formats
GPUMassive acceleration for AI, graphics, and parallel compute
NPUEfficient local inference in client devices

The strategy also has a broader strategic meaning. x86 competes with alternative architectures that have gained ground in efficiency, mobility, and integrated acceleration. Companies like Apple, Qualcomm, Arm, NVIDIA, and others are pushing designs where CPU, GPU, NPU, and memory work increasingly together. Intel and AMD need x86 to evolve without repeating the fragmentation errors that complicated developers and manufacturers in the past.

The most cited precedent is AVX-512. Over the years, partial, uneven, or limited support across ranges forced developers to maintain multiple code paths, check capabilities carefully, and accept that not all x86 processors behaved the same. ACE aims to start from a different point: as a jointly developed specification, coordinated between Intel and AMD, allowing compilers, libraries, and frameworks to prepare based on a more unified foundation.

How ACE Works: Tiles, AVX, and Low Precision

ACE combines AVX registers with a new tile-type register state. According to the specification, the tile register file contains eight 512-bit two-dimensional registers with 16 rows. Each row matches the size of an AVX-512 vector. In the initial version, accumulators mainly support 32-bit types like FP32 or INT32.

The extension also introduces a Block Scale Register of 1,024 bits, split into two halves of 512 bits each, for scaling associated with the two inputs of operations. This register facilitates block scaling, an important technique for formats like OCP MX. In AI, such formats help reduce memory and bandwidth while maintaining usable results in quantized or low-precision models.

ACE ComponentsWhat they add
Tile registers2D registers for matrix operations
Block Scale RegisterScales E8M0 for OCP MX operations
Outer product on tilesOuter product operations over tiles
AVX-tile movesTransfer between AVX registers and ACE state
Format conversionsSteps between FP32, FP16, BF16, FP8, FP6, FP4, and INT8
System managementXSAVE state, CPUID, and OS support

The core operation is the outer product. Simply put, ACE allows taking two input vectors, treating them as partial matrices, and accumulating the result into a tile. The specification defines range-2 and range-4 operations for formats like BF16, INT8, MX FP8, and MX INT8. These are designed to build larger matrix multiplications through sequential steps.

Supported formats reflect industry trends. ACE covers INT8, INT32, FP32, BF16, FP16, E8M0, FP8, MX FP8, MX FP6, MX FP4, and MX INT8. These are not arbitrary choices — FP8, BF16, FP16, and INT8 are already common in AI acceleration; FP6 and FP4 aim for even greater precision reduction, saving memory and enabling more data movement per cycle when models permit.

FormatTypical Use in AI
FP32High precision and accumulation
BF16Training and inference with good balance
FP16Low-precision loads and acceleration
FP8Efficient inference and training in compatible models
FP6 / FP4Aggressive quantization and bandwidth savings
INT8Quantized inference
MX FP8 / MX INT8Block-scaled formats
E8M0Power-of-two scaling for OCP MX

Implementations compatible with ACE are required to start from at least an AVX10.1 foundation. Full support for ACE v1 should be detected via CPUID and necessitates features like ACE, ACE_VSN version 1 or higher, AVX10_V2_AUX, and proper XSAVE states for tiles and scale registers. This indicates that merely having a compatible CPU won’t be enough; OS, compiler, library, and framework support will also be necessary.

Not an Immediate Upgrade for Current Ryzen or Core CPUs

It’s important to set realistic expectations. ACE is an architectural specification—not an immediate feature update that will magically boost current processors’ performance. The document itself notes that these are technologies still in design phases and that product plans may change. This means these instructions won’t reach silicon or software support instantly.

The real impact depends on multiple layers. First, Intel and AMD must implement ACE in future CPU generations. Next, operating systems need to properly manage the new register states. Compilers must generate ACE instructions. And finally, libraries like BLAS, NumPy, SciPy, oneDNN, PyTorch, TensorFlow, and other inference layers will need to develop optimized routines.

Required layerWhat needs to happen
CPUPhysical implementation of ACE in new architectures
FirmwareProper CPUID exposure and configuration
Operating systemManagement of XSAVE state for tiles and BSR
CompilersIntrinsic, assembly, and code generation support
Math librariesOptimized GEMM kernels and conversions
AI frameworksUse of ACE paths when hardware supports it
ApplicationsActual benefits in inference and specific workloads

AMD has announced within their x86 Ecosystem Advisory Group that ACE is part of a broader roadmap alongside FRED, AVX10, and ChkTag. Meanwhile, some technical reports suggest that future AMD architectures, such as Zen 6 and Zen 7, will incorporate improvements related to AI, new data types, and matrix engines. However, until commercial products are released and independent metrics are available, any timelines should be approached with caution.

The Battle Is Not Just Performance, but Compatibility Too

Perhaps the most interesting aspect of ACE isn’t raw performance, but coordination. Intel and AMD have competed within x86 for decades, but the pressures of AI and alternative architectures force them to prioritize compatibility. For developers, the worst-case isn’t a difficult instruction; it’s that each provider implements incompatible variants or different subsets without a clear path forward.

ACE aims to provide a common foundation so that AI software can optimize for x86 without maintaining completely separate code paths. If successful, it will benefit servers, workstations, client devices, and embedded systems where local AI deployment is expected to grow over the coming years.

Historical riskWhat ACE tries to prevent
Instruction fragmentationA common base between Intel and AMD
Separate code pathsLess maintenance for libraries and frameworks
Unpredictable partial supportClear detection via CPUID
Over-reliance on GPU/NPUMore options for CPU inference
Lack of modern formatsDirect support for low precision and OCP MX

This doesn’t mean ACE will replace GPUs. For training large models and massive inference workloads, accelerators will still hold an advantage. But many applications don’t require a dedicated GPU for every operation. On a laptop, a general-purpose server, or an use case working close to the CPU, avoiding data transfers between devices can reduce latency and simplify execution.

In local AI, experience also depends on factors beyond the announced TOPS. Relevant aspects include the available memory, bandwidth, latency, energy efficiency, system integration, and ease of software deployment. ACE can give x86 an additional tool to compete effectively in this space.

A Signal of Where the General-Purpose Processor Is Heading

For years, it has been said that the general-purpose CPU is losing ground to specialized accelerators. The reality is more nuanced. The CPU continues to coordinate the system, run application logic, move data, manage memory, handle interrupts, and work on diverse workloads. If AI is integrated into all sorts of applications, the CPU needs to better understand those patterns.

ACE responds to this pressure. It brings matrix capabilities and modern AI formats into the core of x86, without turning the CPU into a GPU or attempting to handle all workloads. Its more pragmatic goal seems to be making the CPU a more efficient and predictable platform for certain AI calculations, especially inference, quantization, preprocessing, small operations, or scenarios where moving data outside the CPU isn’t justified.

Success will depend on execution. If Intel and AMD implement ACE consistently, OS support is solid, and frameworks adopt it, x86 will have a stronger foundation for local and enterprise AI. If support comes late, fragments, or remains limited to certain ranges, the impact will be less significant.

The ACE v1.15 specification doesn’t today improve the performance of any specific system. But it indicates an important direction: Intel and AMD understand that AI requires a coordinated evolution of x86. It’s no longer enough to just add more cores or increase frequencies. Future CPUs will need to better work with matrices, low-precision formats, and models that run increasingly close to the user.

Frequently Asked Questions

What is ACE in x86?
ACE, or AI Compute Extensions, is a specification developed by Intel and AMD to add x86 instructions aimed at accelerating AI and machine learning operations, particularly matrix multiplication and low-precision formats.

Does ACE replace a GPU or NPU?
No. ACE does not replace dedicated accelerators for large workloads. Its role is to enhance the ability of future x86 CPUs to perform certain AI operations more efficiently.

What formats does ACE support?
The specification includes support for INT8, INT32, FP32, BF16, FP16, FP8, MX FP8, MX FP6, MX FP4, MX INT8, and E8M0 for block-scaling formats.

Will it arrive on current processors via update?
It’s unlikely to be an immediate upgrade for existing CPUs. ACE requires silicon support, along with OS, compiler, library, and framework readiness.

Scroll to Top