X (Twitter) Facebook Pinterest LinkedIn E-mail

A team of engineers has achieved a historic milestone in the world of distributed storage: a Ceph cluster capable of sustaining 1 TiB/s of sequential read, surpassing all previously known records. This achievement resulted from an extreme deployment that combines cutting-edge hardware, a high-performance network, and months of fine-tuning to overcome unexpected technical challenges.

An architecture designed to squeeze every byte per second

The project began in 2023 when a leading company decided to migrate its Ceph cluster based on hard drives to an infrastructure that is 100% NVMe, with a capacity of 10 PB. The final design, developed in partnership with Clyso, relied on 68 Dell PowerEdge R6615 nodes featuring AMD EPYC 9454P processors (48 cores / 96 threads), 192 GiB of DDR5 RAM, two Mellanox ConnectX-6 100 GbE interfaces per node, and ten enterprise NVMe drives of 15.36 TB each.

The cluster, distributed across 17 racks, was deployed with Ceph Quincy v17.2.7 and Ubuntu 20.04.6, reaching a total of 630 OSDs in production. The pre-existing high-performance network was crucial for maximizing the architecture’s potential.

“The challenge wasn’t just to hit a record figure, but to do it in a realistic environment with production hardware and system stability,” explain the project engineers.

Three critical problems and their solutions

Reaching the 1 TB/s mark was no easy feat. During initial tests, performance was much lower than expected, and erratic patterns appeared in the results. After weeks of analysis, three key bottlenecks were identified:

Power-saving states (c-states)
Ceph is highly sensitive to latency introduced by CPU c-states management. Disabling them in BIOS yielded an immediate 10–20% performance increase.
IOMMU deadlock contention
The kernel spent a massive amount of time in native_queued_spin_lock_slowpath managing DMA mappings for NVMe devices. The solution was to disable IOMMU at the kernel level, which freed read and write performance in multi-node tests.
Suboptimal RocksDB compilation
Debian/Ubuntu packages didn’t compile RocksDB with the proper optimization flags. Rebuilding Ceph with the correct flags accelerated compaction threefold and doubled performance in 4K random writes.

Results: scaling up to break the barrier

With the issues resolved and the configuration fine-tuned (optimal number of PGs, threads, and shards per OSD), the cluster achieved:

1.025 TiB/s in 4 MB sequential reads with three replicas
270 GiB/s in sequential writes with three replicas
25.5 million IOPS in 4K random reads
Under 6+2 erasure coding, over 500 GiB/s in reads and 387 GiB/s in writes

The key was to scale clients and OSDs proportionally, optimize asynchronous messaging threads, and prevent PGs from entering “laggy” states, which can temporarily halt I/O.

“Ceph is capable of saturating two 100 GbE interfaces per node. To go further, the future lies in 200 GbE or higher networks,” conclude the technical team.

The future of high-performance Ceph

This deployment demonstrates that Ceph can compete with proprietary extreme storage solutions while remaining open source. The lessons from this case—such as the sensitivity to kernel configuration, the importance of optimized compilation, and tuning PGs—are valuable for any large-scale deployment.

The next challenge is to improve efficiency in massive write operations and completely eliminate laggy PG issues. Additionally, developers indicate that surpassing the IOPS wall (~400–600K per node) will require rethinking parts of the OSD threading model.

Insights from Stackscale

David Carrero, co-founder of Stackscale (Grupo Aire), highlights that while few companies need the extreme figures achieved in this record-breaking deployment, Ceph’s underlying technology is perfectly applicable to real-world enterprise projects.

“At Stackscale, we offer clients the ability to deploy Ceph environments on dedicated infrastructure, whether as part of Proxmox-based projects or customized architectures. We don’t aim for 1 TiB/s, but we design solutions tailored to each case, with high availability, scalability, and the performance your business requires. Ceph is a key component for those seeking technological independence and flexibility in distributed storage,” Carrero emphasizes.

This perspective underscores that Ceph’s potential isn’t limited to technical records but serves as a versatile tool for companies aiming to control their data and optimize costs in private or hybrid environments.

Key project metrics

Metric	3× Replication	EC 6+2
Sequential read 4MB	1,025 TiB/s	547 GiB/s
Sequential write 4MB	270 GiB/s	387 GiB/s
Random read 4K	25.5 M IOPS	3.4 M IOPS
Random write 4K	4.9 M IOPS	936 K IOPS

Frequently Asked Questions (FAQ)

1. What is Ceph and why is this record significant?
Ceph is an open-source distributed storage system providing block, object, and file storage. This record showcases its ability to achieve extreme performance figures without relying on proprietary hardware.

2. What role did AMD EPYC processors play?
The AMD EPYC 9454P processors provided numerous cores, high DDR5 memory bandwidth, and energy efficiency—key factors supporting hundreds of OSDs per node.

3. Why is tuning PGs (Placement Groups) important?
An optimal number of PGs per OSD improves data distribution and reduces internal contention, boosting performance in very fast clusters.

4. Can Ceph be used with Proxmox in an enterprise setting?
Yes. Providers like Stackscale offer optimized infrastructure to deploy Ceph alongside Proxmox, tailoring design to each client’s performance, availability, and capacity requirements.

References: ceph.io and Micron PDF

X (Twitter) Facebook Pinterest LinkedIn E-mail