Four Reasons Your Current Hadoop Backup and Recovery Solution is Falling Short

The rapid adoption of Hadoop within the enterprise has resulted in the deployment of a number of haphazard, quick fix Hadoop backup and recovery mechanisms. Usually, these primitive solutions come bundled with the Hadoop distributions themselves but are also cobbled together by devops teams within organizations. While they may seem to work on the surface, they often put your data at a significant risk, particularly as your systems become bigger and more complex.  Any resulting downtime or data loss (resulting from failed recoveries) in the event of a disaster will severely impact your business in terms of reputation, costs, and/or time-to-market.


Digging deeper, the inadequacies of these solutions are better understood by examining the underlying misconceptions regarding Hadoop from a data protection perspective.


1. Relying on file system replicas for Hadoop backup and recovery


Replicas are a great way to protect data against hardware failures (such as one or more nodes going down, or disk drives failing). However, they do not protect your data against the more common scenarios, where certain user errors (for example, the DBA inadvertently dropping a Hive table) and application bugs end up corrupting data in the database.


A large high technology company was relying on 3 Hadoop replicas to protect data.  A DBA accidentally deleted a large 400 terabyte Hive table due to a typo.  Since they had no true backup in place, they ended up recreating the data from the source, which took 4 weeks of elapsed time and numerous engineering resources.  Per their estimates, the total cost of these resources and associated downtime was $1.1M.


2. Using HDFS snapshots


HDFS provides snapshot capabilities that create point-in-time copies of specific files and directories.  While this may seem like a good data protection strategy, it has severe limitations as described below:

  • HDFS snapshots are file-level snapshots. As such, they do not work well with databases like Hive and Hbase as the associated schema definitions aren’t captured in the backups.
  • Recovering data is onerous as it requires one to manually locate the files being recovered by combing through all the snapshots, rebuild any schemas pertinent to the time of recovery, and finally recover the data files.
  • Since snapshots are stored on the same nodes as the data, a node or a disk failure results in a loss of both the snapshots as well as the data being protected.
  • Storing even a moderate number of snapshots will increase the storage requirements of the Hadoop cluster, thus limiting one’s ability to go further back in time for the purposes of data recovery.


3. Writing custom devops scripts for Hadoop backup and recovery


Many organizations with in-house devops teams often resort to writing custom scripts for backing up their Hive and HBase databases, and HDFS files.  Often, several man-months are spent writing and testing these scripts in order to make sure they will work under all scenarios.  The scripts need to be periodically updated to handle larger datasets, upgrades to the Hadoop distribution, and any other non-trivial changes to the datacenter infrastructure.  Like snapshots, scripts only take care of making copies of data.  Being a completely manual process, recovery continues to be onerous and error-prone as it is with the snapshots approach. Unless tested regularly, scripts also can result in data loss, particularly if the devops team that wrote the scripts isn’t around anymore.


A retail organization had written scripts to backup their Hive and Hbase databases.  Although the scripts had to be run manually, failed frequently, and required regular changes, the process seemed to be working until they had a data loss incident.  When they tried to recover the data from their backups, they realized that the backup script was encountering a silent failure and, as a result, the backups were being reported as successful when, in reality, the backups were failing.  Their backups failed them when they most needed it, resulting in data loss.


4. Using backup tools from commercial Hadoop distributions


Commercial Hadoop distributions come packaged with basic backup capabilities such as Cloudera’s BDR.  These tools provide very primitive backup capabilities and thus don’t usually meet an organization’s recovery point (RPO) and recovery time (RTO) objectives.  They primarily provide a minimal user interface on top of HDFS snapshots, so all of the limitations associated with HDFS snapshots mentioned above show up here as well.  Since these tools do not provide any meaningfully usable recovery mechanisms, recovery continues to be manual and error prone.


A Solid Hadoop Backup and Recovery Strategy


As Hadoop-based applications and databases become more critical, organizations need to take a more serious look at their recovery strategy for Hadoop.  A proper, well thought out Hadoop backup and recovery strategy is needed to ensure that data can be recovered reliably and quickly, and that backup operations do not take up too much engineering or devops resources.  


A modern Hadoop backup and recovery solution must have the following capabilities:

  • Completely eliminates the need for scripting
  • Is fully automated, does not need dedicated resources
  • Requires very little Hadoop expertise
  • Is extremely reliable
  • Is very scalable to meet recovery time objectives
  • Integrates with cloud storage to reduce costs
  • Preserves multiple point-in-time copies of data
  • Is designed with recovery in mind
  • Is data aware and able to de-duplicate big data formats

Review our datasheet to get deeper insights into our capabilities for Hadoop backup and recovery.

Sign Up To Receive Imanis Data Updates

Take the Next Step

Put Imanis Data to work for all your data management needs.