The rapid adoption of Hadoop within the enterprise has resulted in the deployment of a number of haphazard, quick fix Hadoop backup and recovery mechanisms. Usually, these primitive solutions come bundled with the Hadoop distributions themselves but are also cobbled together by devops teams within organizations. While they may seem to work on the surface, they often put your data at a significant risk, particularly as your systems become bigger and more complex. Any resulting downtime or data loss (resulting from failed recoveries) in the event of a disaster will severely impact your business in terms of reputation, costs, and/or time-to-market.
Digging deeper, the inadequacies of these solutions are better understood by examining the underlying misconceptions regarding Hadoop from a data protection perspective.
Replicas are a great way to protect data against hardware failures (such as one or more nodes going down, or disk drives failing). However, they do not protect your data against the more common scenarios, where certain user errors (for example, the DBA inadvertently dropping a Hive table) and application bugs end up corrupting data in the database.
A large high technology company was relying on 3 Hadoop replicas to protect data. A DBA accidentally deleted a large 400 terabyte Hive table due to a typo. Since they had no true backup in place, they ended up recreating the data from the source, which took 4 weeks of elapsed time and numerous engineering resources. Per their estimates, the total cost of these resources and associated downtime was $1.1M.
HDFS provides snapshot capabilities that create point-in-time copies of specific files and directories. While this may seem like a good data protection strategy, it has severe limitations as described below:
Many organizations with in-house devops teams often resort to writing custom scripts for backing up their Hive and HBase databases, and HDFS files. Often, several man-months are spent writing and testing these scripts in order to make sure they will work under all scenarios. The scripts need to be periodically updated to handle larger datasets, upgrades to the Hadoop distribution, and any other non-trivial changes to the datacenter infrastructure. Like snapshots, scripts only take care of making copies of data. Being a completely manual process, recovery continues to be onerous and error-prone as it is with the snapshots approach. Unless tested regularly, scripts also can result in data loss, particularly if the devops team that wrote the scripts isn’t around anymore.
A retail organization had written scripts to backup their Hive and Hbase databases. Although the scripts had to be run manually, failed frequently, and required regular changes, the process seemed to be working until they had a data loss incident. When they tried to recover the data from their backups, they realized that the backup script was encountering a silent failure and, as a result, the backups were being reported as successful when, in reality, the backups were failing. Their backups failed them when they most needed it, resulting in data loss.
Commercial Hadoop distributions come packaged with basic backup capabilities such as Cloudera’s BDR. These tools provide very primitive backup capabilities and thus don’t usually meet an organization’s recovery point (RPO) and recovery time (RTO) objectives. They primarily provide a minimal user interface on top of HDFS snapshots, so all of the limitations associated with HDFS snapshots mentioned above show up here as well. Since these tools do not provide any meaningfully usable recovery mechanisms, recovery continues to be manual and error prone.
As Hadoop-based applications and databases become more critical, organizations need to take a more serious look at their recovery strategy for Hadoop. A proper, well thought out Hadoop backup and recovery strategy is needed to ensure that data can be recovered reliably and quickly, and that backup operations do not take up too much engineering or devops resources.
A modern Hadoop backup and recovery solution must have the following capabilities:
Review our datasheet to get deeper insights into our capabilities for Hadoop backup and recovery.