Why Backing Up VMs of Your Hadoop Infrastructure is Misguided

The other day a customer asked us why they shouldn’t just perform Hadoop backup by backing up the entire virtualized infrastructure that houses their Hadoop environment.  There is certainly the benefit of simplicity to this approach, but it comes with its own set of issues.


  • The amount of data backed up will be three times higher (assuming three replicas in your Hadoop infrastructure) than our approach since Imanis Data backs up only one replica. This puts an unnecessary burden on your network and storage.
  • The backups will not be application aware. In other words, there is no context of Hive or HBase tables or HDFS files or directories in your backup. This has two major implications. The first is that your backup AND recovery will be all or nothing since you are backing up the entire VM – you cannot back up individual tables or databases.  Nor will there be any granular recovery.  The second implication is that your de-duplication will not be as effective since there is no data awareness, which in turn eliminates storage savings when you are dealing with hundreds of terabytes or more of data.
  • Because it is a distributed system, Hadoop backups may not be consistent since backing up individual VMs will not guarantee consistency. If you have a 100-node Hadoop cluster, VM1 may get backed up at time T1, VM2 at time T2, and so on.  During this process, you could have significant change in your Hadoop infrastructure in terms of new tables being created or deleted.  As a result your VM backups may be inconsistent.
  • Recovery will not be granular down to the table or partition level which may impact your RTO depending on the size and importance of your data sets.
  • Backing up VMs involve a lot of scripting including scheduling snapshots, synchronizing snapshots across VMs and the applications, and copying snapshot data (backup) to some secondary storage system.  This entire process is error prone and can result in failed or unrecoverable backups
  • Hadoop backups will be full backups and there is no concept of incremental-forever.  Depending on the frequency of backups this can have a significant impact on storage and bandwidth costs.

These are just a few considerations to take into account when thinking about protecting your virtualized Hadoop infrastructure from accidental data loss.  Imanis Data provides a granular, application and content-aware approach to Hadoop backups with the ability to support any RTO/RPO requirement.  Our ability to support both bare-metal and virtualized infrastructures across on-prem and cloud environments makes Imanis Data a compelling solution for those companies who are looking to protect their Hadoop (and Cassandra, Couchbase, and MongoDB) environments.  Take a look and let us know how we can help.

Sign Up To Receive Imanis Data Updates

Take the Next Step

Put Imanis Data to work for all your data management needs.