In this post we compare the Imanis Data HBase Connector with existing and proposed data protection solutions for Apache HBase.
Imanis Data HBase connector design
As with all Imanis Data connectors, the HBase connector automatically provides all the benefits of the Imanis Data platform: metadata catalog; Imanis Data FastFind, for rapid object discovery; storage optimization, to reduce capital and operating expenses; a scale-out architecture, to handle any size of production workload; and more. Review all the platform features here.
A few different capabilities, explained below, uniquely separate the Imanis Data connector design from that of other systems.
The new HBase connector design follows the same principles that govern other connectors and has an agentless architecture. That means customers need not change any configuration or install any software on their production cluster.
Imanis Data HBase backup
The Imanis Data HBase connector leverages HBase snapshots to take backups. HBase snapshots guarantee data consistency by flushing all in-memory data, committing it to persistent storage.
The process is as follows:
1. Take a full backup: Create an HBase table snapshot and copy all the files contained in the snapshot to Imanis Data. Do this only once for any backup workflow.
2. Take an incremental backup by creating a new HBase table snapshot.
3. Look up the Imanis Data catalog and compare the files contained in the new snapshot with the files that were present in the previous iteration of the backup.
4. Copy the incremental data to Imanis Data
5. After the backup is completed, immediately delete snapshots taken on the production cluster.
It’s possible that extra data is captured by the incremental backups because of compaction on the production cluster. But, thanks to Imanis Data data-aware de-duplication, the extra data capture does not consume extra space on the secondary cluster.
Imanis Data backup example
Below is tabulation for a backup job during which we took a backup snapshot every night. We assumed that 20Gb of data were added to the cluster every day. We also assumed duplicate data within the table and that Imanis Data would achieve a 5x reduction in data size.
Notice that on Day 4, files f3, f4, and f5 were combined because of compaction and a new file f6 was created by the daily addition of data. The incremental backup on Day 4 copied the new files as well as the additional data. 80GB of data were copied instead of the daily 20GB. But once de-duplication runs on the Imanis Data platform, all duplicate data will be eliminated and only unique chunks of data will be retained.
Additional space savings
HBase works on top of the Hadoop distributed file system (HDFS). Typically, the replication factor of HDFS is set to three production clusters, so a 100GB data set on the production cluster will occupy 300Gb of disk space.
Example 1 highlights the storage efficiency of the Imanis Data platform. When data is backed up to the Imanis Data system, only unique data is saved after our data-aware de-duplication. A data set that takes 300GB of disk space on a production cluster can end up taking just 20GB of disk space on the Imanis Data platform.
Our backups are incremental forever and the platform also provides a restore-centric design. Our architecture optimizes a company’s recovery time objective (RTO). Unlike traditional backup-and-recovery methods that take periodic full backups and apply incrementals to them, every incremental backup image is a fully recoverable and independent snapshot of the production data. This allows Imanis Data to deliver a single-step restore process.
The following scenario and its results (shown in Example 2) illustrate the principle.
A customer creates a new backup job, which takes a nightly backup of 10 critical tables. Backup images are maintained for 90 days. Assume an original 1 terabyte data set and 50 gigabytes of daily changes.
After 80 days, we see that the customer has one full backup and 79 incremental backups. On day 81, we see that user error caused data corruption in some of the tables—the customer needs to recover all the data immediately!
A traditional recovery approach would recover the first full backup and then start applying changes from each of the 79 incremental backups. However, Imanis Data maintains a virtualized copy of the production data set in its restore point, so our restores are speedy and involve moving just a fraction of the original data. In this example, Imanis Data’s restore algorithms will restore from the virtualized restore point, thereby restoring only 1.2TB of data.
Data recovered by other solutions: 1TB + 50GB x79 = 4.95TB
Data recovered by Imanis Data: 1.2TB (exactly the size of data on Day 80)
With the Imanis Data HBase connector, customers can select backup-and-restore data sets at the namespace or table level. A customer can select a complete namespace or a set of tables when a new backup workflow is created. But during the restore, the customer can select any individual table or set of tables to be recovered to the same HBase cluster or to an alternate HBase cluster in the data center.
Even though Imanis Data uses an incremental forever approach for backup, all the restore points are completely virtualized on the Imanis Data cluster. That way, a customer need not restore the first full backup followed by incrementals. Instead, the restore point is instantly available to complete the restore.
We compared the Imanis Data HBase backup strategy against two other HBase backup offerings.
Backups with the Write-Ahead Log (WAL)
The WAL technique uses HBase snapshot capability to take a full backup and then uses the WAL to take incremental backups. In HBase, all transactions are first written to the WAL before they are committed to actual HFiles. A WAL is maintained for each region server. All that makes for a good schema, but with certain limitations.
One such is that the restore procedure follows the traditional approach of full and incremental restores. It suffers from the same storage bloat problems discussed above. Moreover, the captured WAL files must be converted to HFiles before they can be restored. And further, RTOs are significantly higher with this procedure.
Another limitation is that the incremental backup appropriates the WAL, and because the WAL is shared by all regions hosting various tables on a single region’s server, incremental backups include data for all tables in the deployment. Even if a customer selects just a single table for backup, the changes for all tables are captured, thus extending the backup window and including unnecessary data. That extra data will have to be purged at the receiving cluster so that only the relevant data set is stored. Moreover, when two tables have to be backed up with different frequencies, say, once every hour vs. once every 12 hours, well, the WAL has to be copied multiple times.
We think these are two major disadvantages of a WAL-based backup.
Backup snapshot management
Simple snapshot management leverages HBase snapshots for data protection. A backup utility takes periodic snapshots according to a predefined policy. The snapshots are saved locally on the HBase cluster, but to assist in disaster recovery, they can be copied to different locations in the data center or to cloud storage environments like Amazon S3 or Azure Blob storage. Recovery involves copying the files of the snapshot to a temporary location on the HBase cluster and using HBase bulk load to recover the lost data.
That procedure is used by backup utilities provided with some HBase distributions, but it is too simple for today’s complex data protection needs. Some of its limitations:
Imanis Data provides a highly scalable solution to protect against accidental data loss in a HBase environment and encompasses key functional attributes such as an agentless model, incremental-forever backup and extremely rapid recovery aided by our metadata catalog. Watch our product video and review our architecture white paper to get a better understanding of how we bring technical and business value to the world of big data management.