X (Twitter) Facebook Pinterest LinkedIn E-mail

The sector doubled down on investment in backups, deduplication, and retention following the wave of ransomware attacks in recent years. But when “D-day” arrives, the setback repeats with alarming frequency: companies do have copies, but cannot restore them in time. The RTO (Recovery Time Objective) jumps from hours to days, and operational and financial impacts soar.

In conversations with this outlet, several specialists point to the same blind spot: backup repositories and appliances are optimized for ingest and storage—compression, deduplication, erasure coding—but not for serving hundreds of concurrent restores with intensive rehydration and I/O spikes. The familiar symptom: overwhelmed CPU, disks at 100%, saturated IOPS, and endless queues. The diagnosis too: architecture is designed for writing, not for large-scale reading.

“The most expensive day in backup is the day you restore. If you rely on a finely tuned deduplication vault but not for concurrent reads, you’ve bought storage, not business continuity,” summarizes David Carrero, co-founder of Stackscale (Grupo Aire), a European provider of private cloud and bare-metal infrastructure with experience in mission-critical and disaster recovery.

The “90:1” trap: when deduplication ratio masks the real RTO cost

Small block deduplication and aggressive compression work: they reduce capex, extend CPDs, and help meet retention policies. The problem arises during rehydration. Rebuilding a couple of terabytes of “logical” data can require hundreds of thousands of scattered reads. That pattern—random and massive—is the opposite of sequential backup ingestion and multiplies cost in latency and IOPS. Additionally, if encryption or inline compression are involved, CPU becomes a bottleneck itself.

“Every extreme deduplication point is an I/O promise on D-day. The metric that matters isn’t the ‘ratio’, it’s how many TB/h you can deliver rehydrating with 100, 200, or 400 VMs in parallel,” warns Carrero.

The overlooked metric: measuring TB/h delivering restorations (Recovery Performance Index)

Industry sources recommend institutionalizing a specific KPI, the Recovery Performance Index (RPI): the actual TB/h that the platform can serve with N simultaneous restores, in the real topology of the company (not laboratory conditions). This value, and not deduplication savings, should guide both purchasing decisions and continuity runbooks.

An illustrative example: restoring 50 TB with an effective throughput of 600 MB/s (already accounting for rehydration, queues, and latency) takes approximately 25 hours just to move data, not counting hypervisor rescans, massive boots (boot storms), application validations, or malware cleanup. With 100 TB, the figure doubles.

“Any serious backup program should report a quarterly RPI: TB/h delivered with X restores in parallel. That, and not the savings in TB, is what keeps the P&L healthy,” affirms Stackscale’s co-founder.

Design for recovery: separating “storage” from “reproduction”

The technical consensus points toward a layered redesign:

1) Hot layer for instant recovery
Repositories with flash/cache or high-read profiles that allow mounting VMs directly from backup and providing “acceptable” service while migrating to production. The cold layer—involving dedupe and objects—remains outside the critical window of the initial hours.

2) Replicas/snapshots for the 5–10% that cannot tolerate downtime
For workloads intolerant to hours of outage, cabinet snapshots, replicas (synchronous/asynchronous), and where appropriate, CDP with very low RPOs. It’s expensive, yes, but applies to a critical subset.

3) Restoration plan—not just “testing a VM”
Orchestrating 200 VMs involves AD/DNS/PKI, queues, licenses, and application dependencies. A versioned runbook is necessary: startup order, temporary networks, verification criteria, and responsible personnel.

4) Concurrency and network provisioning
Sufficient backup proxies, QoS to prevent hogging, and 10/25/40/100 GbE based on size. Restoring in batches with storage affinity reduces contention.

5) Security: immutability and “clean room”
Immutable copy (WORM/Object Lock, isolated vaults) and restoration in a clean environment with malware scanning before deploying to production. MFA and strict RBAC in the backup platform complete the circle.

“Always two questions: Can you bring back clean data in hours? And can you demonstrate that you don’t re-inject the malware? Without immutability and clean room, one of these two will fail,” notes Carrero.

Errors that turn an incident into a crisis

Confusing tests: restoring a lab VM ≠ bringing up 80 services with real dependencies.
Buying based on datasheet: prioritizing “90:1” deduplication ratio over concurrent TB/h restore capacity.
Under-dimensioning network and proxies: shared 10 GbE and proxies without CPU can’t handle peaks.
Ignoring the “minimum vital”: not all services weigh equally in the business; priorities are missing.
Publishing unclean restores: restoring directly into production without passing through a clean room.

90-day action plan

Days 0–15

Define with business the minimum vital (20% of services that return 80% of value) and set RTO/RPO per service.
Measure initial RPI with 50–100 simultaneous recoveries including AD/DNS/PKI; timing and lessons learned.

Days 16–45

Introduce hot layer for instant restore.
Scale proxies and network; orchestrate batch restorations.
Prepare clean environment and verification procedures.

Days 46–90

Harden immutable vaults, MFA/RBAC for backup.
Automate runbooks (infrastructure as code for networks and startups).
Repeat large-scale testing; report RPI and time to “useful service” to management.

“If in 90 days you don’t show management a chart with RPI rising and RTO falling, you’re still managing storage, not business continuity,” summarizes the expert.

Five incisive questions for providers

TB/h of restore with 200 VMs rehydrating concurrently in the client’s real topology.
Latency and stream limits when mounting VMs via instant restore.
The real impact of Object Lock and encryption activated on read performance.
Proxy and network requirements to sustain that performance.
SLA for “D-day” support outside working hours.

Cost and false savings

A hot layer costs more per TB. But it also costs to shut down business for two days due to a foreseeable bottleneck. The relevant TCO isn’t €/TB saved, but €/hour of downtime avoided. When a fast layer cuts a day of outage, it often pays for itself.

“The dedupe slide is taught before buying. The RTO clock is presented to the crisis committee. Guess which one determines your future,” satirizes Carrero.

Conclusion

Backup is no longer just a storage project: it is a continuity capability that should be measured in time and service. Most recent disasters are not due to lack of copies but to inability to serve them at the speed the business demands. Changing the mindset—buying for restore, measuring RPI, separating hot layer/retention/immutability, and practicing massive restores—marks the line between a well-resolved incident and a crisis.

X (Twitter) Facebook Pinterest LinkedIn E-mail