Commvault has introduced Clumio for Apache Iceberg on AWS, which the company describes as the first “Iceberg-aware” solution with air-gapped copy to protect data lakehouses used for AI and large-scale analytics. The goal: close resilience gaps that expose organizations to data loss, ransomware, and compliance risks when relying solely on native snapshots or copies that lack understanding of Apache Iceberg semantics.
Why an “Iceberg-aware” copy is needed
Apache Iceberg provides transactional tables (metadata, manifests, snapshots, and delete files) over objects (e.g., Amazon S3) to enable atomic reads, time travel, and schema evolution. Backing up without comprehension of this structure forces manual reconnection of tables during restore — with risks of inconsistencies and extended downtime — and native snapshots are often stored within the same account and control domain, lacking an air-gapped copy for account breaches or malicious deletions.
Clumio for Apache Iceberg addresses both issues:
- Transactional consistency: captures the full state of tables (metadata + data) with support for point-in-time recovery, via snapshot, inter-region, inter-account, or in-place.
- Air-gapped, immutable copy: stored in an isolated environment designed to withstand ransomware, credential compromises, and accidental or malicious deletions.
Key features
- Iceberg-aware backups: understands manifests, position/eq deletes, and metastore to restore without manual rewiring; reduces errors and MTTR in data lakehouses.
- Isolation and immutability: separate copies from the source account, with unlimited retention of snapshots for compliance and governance without impacting the performance of the active lake.
- Storage efficiency: only changes after the initial backup (incremental approach), shortening windows and reducing TCO.
- Availability on AWS Marketplace: supports self-managed tables (AWS Glue Data Catalog) and managed tables (Amazon S3 Tables).
Market context
Adoption of Iceberg has surged — public references include Netflix, Apple, and Airbnb — and industry surveys position the data lakehouse as the dominant analytics architecture over the next three years. However, many organizations have not applied native resilience to the table layer: they protect S3 or metastore, but do not ensure a coherent recovery of the entire dataset.
For AI and analytics, where datasets are critical assets, this gap presents a material risk: long downtimes and corrupted data can breach SLA and compliance.
How it fits into Commvault’s AWS strategy
Clumio for Iceberg complements resilience capabilities for Amazon S3 and DynamoDB, aiming to cover the entire data pipeline on AWS: from object and NoSQL to the transactional table layer of the lakehouse. The key message: no one currently offers a combination of Iceberg awareness, air-gap, and large-scale recovery with this depth.
Opinions
- Commvault: “The data fueling AI and analytics is the most valuable and often the most exposed; for the first time, it can be protected with an automated, isolated solution,” says Woon Jung (CTO, Cloud Native).
- IDC: for Archana Venkatraman, ‘Apache-aware’ protection with transactional recovery and air-gap “has become imperative” as the lakehouse expansion in AI accelerates.
Considerations for data and security teams
- Threat model: aside from ransomware, consider account breaches and erasures — the air-gap addresses this risk.
- RPO/RTO: transactional capture and restore options (inter-account/regional) support low RPO and predictable RTO.
- Consistency: verify that the restored state (metastore + manifests + delete files) reflects a coherent point in time for queries and pipelines.
- Costs: incremental approach reduces capacity and bandwidth compared to repeated fulls; evaluate retention versus compliance.
- Operations: integrate with catalogs, orchestrators, and jobs dependent on tables; plan regular restore testing.
Availability
Clumio for Apache Iceberg on AWS is GA in AWS Marketplace, with support for AWS Glue Data Catalog and Amazon S3 Tables. Commvault will expand the offering at SHIFT 2025 (November 11–12, NYC; virtual version on 19).
Summary
The announcement raises the bar for cyber resilience in AI lakehouses: from backing up “files on S3” to protecting Iceberg tables with air-gapped, immutable, and transactional recovery. For organizations relying on models and analytics with demanding SLA, it’s the difference between rapid recovery with coherence and manual rebuilding over days with risk of inconsistencies.
via: commvault