Intel Aims to Maximize NVMe Performance on Linux: “Cluster-Aware” Patch Promises Up to 15% More on Multi-Core Systems

In the world of performance, big rewrites don’t always win. Sometimes, the leap comes from a seemingly minor detail: where interruptions “fall”. And that’s precisely where Intel engineers are pushing for a change in the Linux kernel to improve NVMe storage performance on modern multi-core servers.

The issue arises when the number of NVMe IRQs is fewer than the CPUs, a relatively common situation in current platforms. In this scenario, multiple cores end up sharing the same interrupt, and if the IRQ affinity isn’t aligned with the actual processor topology, the cost manifests as increased latency and reduced performance. Intel summarizes this simply: when the interrupt and the CPU “group” aren’t close (in terms of caches and locality), penalties occur.

The bottleneck: Shared IRQs and the CPU’s “real” topology

Linux already has mechanisms for assigning affinities, but the reality of modern processors is that “NUMA” no longer explains everything. Within the same NUMA domain, cores can be organized into clusters (for example, groups sharing an intermediate cache), and depending on load distribution, a “flat” approach can cross internal boundaries that ideally shouldn’t be crossed.

Developers on the kernel mailing list describe a practical example: a CPU where 4 cores share an L2 cache, forming a cluster; if Linux distributes affinity without considering this grouping, the interrupt may end up “jumping” between clusters, degrading locality.

In other words: it’s not enough to just put the IRQ “on the same NUMA node”; in multi-cluster systems within NUMA, it can be more efficient for the interrupt to be handled by a “neighbor” core.

What the patch changes: Linux becomes cluster-aware

The proposed improvement makes the kernel code that groups CPUs for affinity (in lib/group_cpus.c) aware of clusters within each NUMA domain. The goal is to group cores more intelligently so that NVMe IRQ assignment maintains a better locality between CPU and interrupt.

Intel engineer Wangyang Guo explains: as the number of cores increases, there may be fewer NVMe IRQs than CPUs, forcing multiple cores to share an interrupt; if affinity doesn’t match the correct cluster, penalties appear. The patch aims to address this by grouping CPUs into clusters within NUMA.

Notable result: +15% in random reads with FIO

In publicly reported tests, the change showed a rough 15% improvement in random reads using FIO with libaio, on an Intel Xeon E server.

This is an impressive result, but with a key caveat: for now, the data relates to a specific case and no figures are provided for other I/O profiles (sequential, mixed, writes) or a wide range of platforms. Still, the approach aligns with a clear trend: performance in modern servers depends both on the device and on the “path” each event takes within the system.

Why this matters to sysadmins and system architects

For system professionals, the practical takeaway is immediate: IRQ affinity isn’t a cosmetic detail in NVMe environments with high parallelism. When many cores, NVMe queues, and intensive I/O workloads (databases, virtualization storage, ingestion queues, analytics) are involved, poor distribution can silently bottleneck performance.

This kind of improvement also reflects operational realities: in many infrastructures, NVMe is no longer “the fast component,” but rather part of a chain that must be finely tuned (scheduler, affinities, queues, NUMA, caches) to avoid wasting performance.

What can be checked today without waiting for future kernels

While this change matures, here are several checks that often reveal clear signals in systems with NVMe:

  • Interrupt distribution: check if one or multiple NVMe IRQs handle too much load or “jump” between CPUs.
  • Affinity and NUMA: verify if NVMe IRQs are handled by CPUs on the most appropriate NUMA node for the device (especially in multi-socket systems).
  • irqbalance behavior: in some environments, its default policy isn’t ideal for deterministic workloads; in others, it successfully avoids excessive concentration.

No universal recipe exists, but the conclusion is clear: in many-core platforms, topology matters. Each new generation tends to add more internal layers (clusters, chiplets, CCX/tiles) that aren’t always well-represented by older heuristics.

Status of the change: in “mm-everything” branch with upcoming integration windows

According to published info, the patch has entered the “mm-everything” branch maintained by Andrew Morton and could soon be included in the Linux kernel integration window (around Linux 6.20 or even 7.0, depending on progress and review).

As with many such optimizations, the real impact will be visible when more users test it across different configurations: diverse CPU topologies, multiple sockets, various NVMe controllers, and real workloads (not just microbenchmarks). But the approach points in the right direction: improving performance isn’t always “more IOPS,” sometimes it’s just less internal friction.


Frequently Asked Questions

Who benefits most from this NVMe patch in Linux?

Primarily systems with many cores where the number of NVMe IRQs/queues doesn’t scale proportionally, leading to shared IRQs and potential penalties from misaligned affinity with the CPU topology.

What does “CPU cluster-aware” mean in this context?

It indicates that the kernel attempts to group CPUs by cluster within each NUMA domain to assign IRQ affinities with better locality (e.g., avoiding jumps between groups of cores sharing internal caches).

Is the +15% improvement generalizable to any server?

Not necessarily. The reported increase came from an Intel Xeon E server with FIO/libaio in random reads. No broad results for other I/O profiles or hardware are currently published.

When might it arrive in the stable kernel?

The change is in the “mm-everything” branch and is a candidate for upcoming integration (around Linux 6.20 to 7.0, subject to review and acceptance).

Scroll to Top