Ceph Replica 3 vs Erasure Coding: Which One to Choose for Your Storage

Ceph offers several ways to protect data within a distributed cluster, but two approaches dominate most practical decisions: three-copy replication, commonly known as Replica 3, and Erasure Coding, a technique that divides data into fragments and adds parity to reconstruct it in case of failure. The choice is significant. It impacts cost per usable terabyte, performance, latency, CPU usage, operational complexity, and the types of workloads best suited for hosting.

At a time when capacity prices, data growth, and cost reduction pressures are increasingly influencing storage architectures, Erasure Coding is again frequently discussed. But it is not a direct replacement for Replica 3. In Ceph, each method makes sense in different scenarios, and choosing poorly can turn a cost-saving measure into a performance or availability issue.

Replica 3: simple, fast, and capacity-intensive

Replica 3 is the easiest approach to understand. Each piece of data is stored in three copies distributed across different OSDs, typically spread among nodes or failure domains defined by CRUSH. If a disk fails, the cluster retains two other copies. Losing a second OSD under controlled conditions still leaves a copy available. Its simplicity explains why it’s so widely used in virtualization environments, databases, VM storage, and latency-sensitive workloads.

Its main advantage is also its cost: for every 1 TB of usable data, 3 TB of raw capacity is consumed. In other words, overhead is 200%. This investment in redundancy offers a straightforward operation, less calculation, and very predictable read/write behavior. For RBD, virtual machines, container disks, and critical services, Replica 3 remains a very solid option.

The downside becomes apparent as volume grows. Storing tens or hundreds of terabytes in Replica 3 requires purchasing a lot of raw capacity. For cold data, large repositories, office files, images, videos, secondary copies, or read-heavy workloads with few modifications, it may be excessive.

Erasure Coding: more usable capacity, more computation, and more planning

Erasure Coding operates differently. Instead of storing full copies, Ceph splits each object into data fragments, defined by the parameter k, and adds parity fragments, defined by m. With these fragments, the system can reconstruct the data even if a certain number of OSDs or failure domains are lost. Ceph’s documentation explains this model as dividing the object into data chunks and coding chunks stored across different OSDs.

A profile like 4+2, for example, divides data into four fragments and adds two parity fragments. This configuration tolerates the loss of two fragments and reduces raw capacity consumption relative to Replica 3. Practically, storing 100 TB of usable data might require around 150 TB of raw capacity, compared to 300 TB with Replica 3. Efficiency improves, but an additional cost appears: each write requires parity calculation, more coordination among OSDs, and increased sensitivity to latency, CPU, network, and block size.

ProfileApproximate raw capacity needed for 100 TB of usable dataFailures toleratedTypical use
Replica 3300 TBTwo copies can be lost before losing redundancyVirtual machines, databases, critical services
EC 2+1150 TBOne fragmentTesting or less critical environments
EC 4+2150 TBTwo fragmentsLarge files, cold data, repositories, secondary backup
EC 6+3150 TBThree fragmentsHigh-capacity storage with more resilience
EC 8+3137.5 TBThree fragmentsLarge-scale cold data
EC 8+4150 TBFour fragmentsLarge volumes with higher failure tolerance

This table highlights a sometimes-overlooked truth: Erasure Coding does not always reduce capacity more as parity increases. What changes is the balance among efficiency, resilience, minimum number of OSDs or nodes, and reconstruction cost. The more ambitious the profile, the more critical it becomes to design the cluster carefully.

Ceph recommends that most erasure-coded pool deployments have at least k + m failure domains CRUSH, often corresponding to hosts or racks. This is important: it’s not enough to just count disks. If all fragments become too concentrated, losing a node could impact more pieces than the profile can tolerate.

Performance: where each option excels

Replica 3 generally performs better with small writes, random workloads, and latency-sensitive services. It does not need to calculate parity or reconstruct fragments for each normal operation. The system simply writes full copies, replicates, and confirms according to configuration. This makes it more suitable for RBD, VM disks, databases, queues, transactional systems, and applications with many small writes.

Erasure Coding shines when capacity efficiency is a priority, and data is large, less frequently changed, or mostly read-oriented. Large files, object storage, document repositories, multimedia content, secondary backups, scientific datasets, or archival data can benefit from its lower overhead. Ceph’s own documentation cites cold storage with large objects and minimal write activity vs. read-intensive workloads as a typical use case for erasure-coded pools.

The problem arises when Erasure Coding is used as if it were Replica 3. Small writes can be penalized. Post-failure reconstruction consumes more CPU and network resources. Recovery may be slower and more demanding. Additionally, in degraded scenarios, the cluster must read multiple fragments and recalculate data, increasing system load.

RBD, CephFS, and allow_ec_overwrites

For years, a common limitation of erasure-coded pools was that they were mainly designed for full-object write operations, such as those used by RGW. From Ceph Luminous onward, allow_ec_overwrites can be enabled to allow partial writes on erasure-coded pools, enabling RBD, CephFS, and RADOS to store data in such pools.

This does not mean Erasure Coding is automatically recommended for any virtual machine disk. In RBD, a typical approach is to maintain a replicated pool for metadata and use an erasure-coded pool as the data pool. It’s a valid architecture but requires understanding its implications. For a virtual machine with light workloads, large files, or infrequently changing data, it may make sense. For active databases, high I/O systems, ERP, transactional workloads, or services with small and intense I/O, Replica 3 is usually safer.

In virtualization, the key question isn’t “Can I use Erasure Coding?”, but “What is the I/O pattern of this workload?” If the application writes heavily, in small blocks, with low latency requirements, disk savings may come at a high cost. If data is written infrequently, read occasionally, and takes up significant space, Erasure Coding can be a useful tool.

Practical decision: cost versus behavior

The comparison between Replica 3 and Erasure Coding typically starts with capacity considerations, but should ultimately focus on service type. Replica 3 consumes more disk space but reduces complexity and provides better performance for active workloads. Erasure Coding conserves capacity but requires more calculation, better network design, careful failure domain planning, and more meticulous profile management.

CriterionReplica 3Erasure Coding
Capacity efficiencyLow: 3 TB of raw for 1 TB usableHigh: depends on k + m
LatencyBetter, especially for small writesHigher due to computation and distribution
CPU usageLowerHigher due to coding and reconstruction
Operational simplicityHighMedium or low, depending on profile
Recovery after failureMore straightforwardMore intensive in computation and network
Critical virtual machinesRecommendedOnly with caution and specific cases
DatabasesRecommendedGenerally not advised
Large files and cold dataMay be costlyVery suitable
Secondary backupsViable but expensiveSuitable if performance profile is acceptable

The best architecture often combines both approaches. A Ceph cluster can use Replica 3 pools for critical VMs, databases, and latency-sensitive services, and erasure-coded pools for cold data, large repositories, or object storage. This separation helps optimize costs without turning the entire environment into a single-faceted gamble.

It’s also important to remember that Erasure Coding does not replace a backup strategy. It protects against disk or node failures within the cluster but does not prevent accidental deletion, logical corruption, ransomware, human errors, or application issues. For that, backups, retention policies, restore testing, and external replicas or offsite copies are still essential.

Ceph offers great flexibility, but that flexibility requires thoughtful design. Replica 3 is the conservative answer for critical and active workloads. Erasure Coding is a powerful tool to increase usable capacity when data patterns support it. The right choice isn’t simply picking a winner tech but assigning each approach to where it delivers the most value.


Ceph: Replica 3 or Erasure Coding

Frequently Asked Questions

What is the difference between Replica 3 and Erasure Coding in Ceph?

Replica 3 stores three complete copies of data. Erasure Coding divides data into fragments and adds parity fragments for reconstruction in case of failure. Replica 3 consumes more capacity but often offers lower latency and simpler management.

What does a 4+2 profile mean in Erasure Coding?

It means each object is split into four data fragments and two parity fragments. It can tolerate losing two fragments and provides better capacity efficiency than Replica 3.

Can I use Erasure Coding for virtual machines?

Yes, with proper configurations such as enable allow_ec_overwrites and separating a replicated metadata pool from an erasure-coded data pool. However, it’s generally not recommended for critical VMs, databases, or workloads with many small, write-intensive operations.

When is it advisable to use Erasure Coding?

It makes sense for large files, cold data, document repositories, object storage, secondary backups, or content that changes infrequently and takes up a lot of space. For transactional workloads or those very sensitive to latency, Replica 3 tends to be the safer choice.

Scroll to Top