Blog

Lorem Ipsum

Cloud Data Management at Exabyte Scale

Taxonomy/Category

Earlier this year I hosted a webinar that discussed the key data management challenges enterprises face when running or migrating big data workloads to the cloud. These workloads are moving in greater numbers to the cloud for well-known reasons: business agility, flexibility, and minimizing capital expenditures, among others. Yet, often left unanswered is the question of how companies optimize data storage and data management for these new workloads.

 

This post highlights how the Imanis Data architecture optimizes cloud data management. Our philosophy from the very beginning is to be compatible with any infrastructure deployment architecture: whether exclusively on-prem, exclusively in the cloud, or hybrid deployments. This flexible approach enables us to support a wide variety of different data backup, mirroring and recovery use cases, and is enabled by how the Imanis Data file system handles these different storage requirements.

 

The Imanis Data File System

The Imanis Data file system is built on a storage tiering model. It can federate data over multiple tiers of storage transparently based on user defined policies. The first storage tier is typically the block storage tier – examples include Elastic Block Store (EBS), managed disks, direct attached drives, storage-attached network/network-attached storage devices. The second storage tier is an object storage platform such as Amazon S3 or Azure Blob Storage. Finally, the last tier of storage is a cold storage platform such as Amazon Glacier. A user could define a policy to keep the data in the block tier for 5 days, in the object tier for 25 days and in the cold tier for 6 months. The Imanis Data file system transparently migrates data between different tiers. Similarly the data access also is transparent to the user since our file system has built-in capabilities to natively retrieve the data from different tiers using supported access protocols.

 

Data Backup and Mirroring in the Cloud

The storage in Imanis Data is unbounded. What this means is that we built the Imanis Data file system in such a way that it can automatically span the different types of storage highlighted above. For example, Imanis Data can back up a directory where two files are stored on local storage while the remaining six files are moved to S3. More importantly, this process is completely transparent to the user as to where the files are stored because our file system presents a unified namespace across different storage tiers. If a user migrates data from local storage to cloud storage using a policy that they have created, the data movement happens asynchronously without the user needing to be aware of the underlying migration process. As soon as the migration is done, the space occupied by the original files is freed up and immediately available for re-use.

Let’s take this one step further. If your company deployed workloads in a multi-cloud environment, say AWS and Azure, Imanis can designate specific backup workflows to go to S3 and others to go to Azure Blob Storage. Because we separated the compute and storage layers, Imanis Data can handle huge amounts of storage relative to the compute requirements. Furthermore, the Imanis Data storage optimization engine can store de-duplicated data on S3/Azure Blob Storage, reducing your storage footprint even further.

 

Data Recovery from the Cloud

During recovery, the Imanis Data file system will immediately figure out if the data is, for example, on the local file system or on the object storage tier and read the data from the appropriate storage location. All of the data restore operations are completely transparent to the users and the data is seamless fetched from the storage tier where it resides. This is in stark contrast to what would happen if you were to use scripts: you would need specific scripts for each storage location.

 

Conclusion

The Imanis Data architecture is flexible enough to address the key issues of scale, bandwidth, and cost of cloud data management. As a result, we’re used by some of the largest enterprises running the most demanding big data workflows. We encourage you to check out this video of how Imanis Data works and contact us to learn more about the ideal big data cloud management solution.

Sign Up To Receive Imanis Data Updates

Blog

Lorem Ipsum

The Unique Capabilities of the Imanis Data HBase Connector

Taxonomy/Category

In this post we compare the Imanis Data HBase Connector with existing and proposed data protection solutions for Apache HBase.

Imanis Data HBase connector design

As with all Imanis Data connectors, the HBase connector automatically provides all the benefits of the Imanis Data platform: metadata catalog; Imanis Data FastFind, for rapid object discovery; storage optimization, to reduce capital and operating expenses; a scale-out architecture, to handle any size of production workload; and more. Review all the platform features here.

A few different capabilities, explained below, uniquely separate the Imanis Data connector design from that of other systems.

Agentless architecture

The new HBase connector design follows the same principles that govern other connectors and has an agentless architecture. That means customers need not change any configuration or install any software on their production cluster.

Imanis Data HBase backup

The Imanis Data HBase connector leverages HBase snapshots to take backups. HBase snapshots guarantee data consistency by flushing all in-memory data, committing it to persistent storage.

The process is as follows:

1. Take a full backup: Create an HBase table snapshot and copy all the files contained in the snapshot to Imanis Data. Do this only once for any backup workflow.

2. Take an incremental backup by creating a new HBase table snapshot.

3. Look up the Imanis Data catalog and compare the files contained in the new snapshot with the files that were present in the previous iteration of the backup.

4. Copy the incremental data to Imanis Data

5. After the backup is completed, immediately delete snapshots taken on the production cluster.

It’s possible that extra data is captured by the incremental backups because of compaction on the production cluster. But, thanks to Imanis Data data-aware de-duplication, the extra data capture does not consume extra space on the secondary cluster.

Imanis Data backup example

Below is tabulation for a backup job during which we took a backup snapshot every night. We assumed that 20Gb of data were added to the cluster every day. We also assumed duplicate data within the table and that Imanis Data would achieve a 5x reduction in data size.

Notice that on Day 4, files f3, f4, and f5 were combined because of compaction and a new file f6 was created by the daily addition of data. The incremental backup on Day 4 copied the new files as well as the additional data. 80GB of data were copied instead of the daily 20GB. But once de-duplication runs on the Imanis Data platform, all duplicate data will be eliminated and only unique chunks of data will be retained.

Additional space savings

HBase works on top of the Hadoop distributed file system (HDFS). Typically, the replication factor of HDFS is set to three production clusters, so a 100GB data set on the production cluster will occupy 300Gb of disk space.

Example 1 highlights the storage efficiency of the Imanis Data platform. When data is backed up to the Imanis Data system, only unique data is saved after our data-aware de-duplication. A data set that takes 300GB of disk space on a production cluster can end up taking just 20GB of disk space on the Imanis Data platform.

Incremental forever

Our backups are incremental forever and the platform also provides a restore-centric design. Our architecture optimizes a company’s recovery time objective (RTO). Unlike traditional backup-and-recovery methods that take periodic full backups and apply incrementals to them, every incremental backup image is a fully recoverable and independent snapshot of the production data. This allows Imanis Data to deliver a single-step restore process.

The following scenario and its results (shown in Example 2) illustrate the principle.

A customer creates a new backup job, which takes a nightly backup of 10 critical tables. Backup images are maintained for 90 days. Assume an original 1 terabyte data set and 50 gigabytes of daily changes.

After 80 days, we see that the customer has one full backup and 79 incremental backups. On day 81, we see that user error caused data corruption in some of the tables—the customer needs to recover all the data immediately!

A traditional recovery approach would recover the first full backup and then start applying changes from each of the 79 incremental backups. However, Imanis Data maintains a virtualized copy of the production data set in its restore point, so our restores are speedy and involve moving just a fraction of the original data. In this example, Imanis Data’s restore algorithms will restore from the virtualized restore point, thereby restoring only 1.2TB of data.

Data recovered by other solutions: 1TB + 50GB x79 = 4.95TB

Data recovered by Imanis Data: 1.2TB (exactly the size of data on Day 80)

Granular Restores

With the Imanis Data HBase connector, customers can select backup-and-restore data sets at the namespace or table level. A customer can select a complete namespace or a set of tables when a new backup workflow is created. But during the restore, the customer can select any individual table or set of tables to be recovered to the same HBase cluster or to an alternate HBase cluster in the data center.

Even though Imanis Data uses an incremental forever approach for backup, all the restore points are completely virtualized on the Imanis Data cluster. That way, a customer need not restore the first full backup followed by incrementals. Instead, the restore point is instantly available to complete the restore.

Imanis Data HBase connector vs. other HBase backup stratagems

We compared the Imanis Data HBase backup strategy against two other HBase backup offerings.

Backups with the Write-Ahead Log (WAL)

The WAL technique uses HBase snapshot capability to take a full backup and then uses the WAL to take incremental backups. In HBase, all transactions are first written to the WAL before they are committed to actual HFiles. A WAL is maintained for each region server. All that makes for a good schema, but with certain limitations.

One such is that the restore procedure follows the traditional approach of full and incremental restores. It suffers from the same storage bloat problems discussed above. Moreover, the captured WAL files must be converted to HFiles before they can be restored. And further, RTOs are significantly higher with this procedure.

Another limitation is that the incremental backup appropriates the WAL, and because the WAL is shared by all regions hosting various tables on a single region’s server, incremental backups include data for all tables in the deployment. Even if a customer selects just a single table for backup, the changes for all tables are captured, thus extending the backup window and including unnecessary data. That extra data will have to be purged at the receiving cluster so that only the relevant data set is stored. Moreover, when two tables have to be backed up with different frequencies, say, once every hour vs. once every 12 hours, well, the WAL has to be copied multiple times.

We think these are two major disadvantages of a WAL-based backup.

Backup snapshot management

Simple snapshot management leverages HBase snapshots for data protection. A backup utility takes periodic snapshots according to a predefined policy. The snapshots are saved locally on the HBase cluster, but to assist in disaster recovery, they can be copied to different locations in the data center or to cloud storage environments like Amazon S3 or Azure Blob storage. Recovery involves copying the files of the snapshot to a temporary location on the HBase cluster and using HBase bulk load to recover the lost data.

That procedure is used by backup utilities provided with some HBase distributions, but it is too simple for today’s complex data protection needs. Some of its limitations:

  • Backups are not incremental; rather, the whole snapshot is copied as part of every backup, resulting in large backup windows.
  • Secondary storage is needed to keep multiple restore points.
  • The number of restore points that can be saved on backup storage is limited because of excessive space consumption.
  • Snapshots cannot be recovered to another cluster in the data center because the backup utility is limited.

Conclusion

Imanis Data provides a highly scalable solution to protect against accidental data loss in a HBase environment and encompasses key functional attributes such as an agentless model, incremental-forever backup and extremely rapid recovery aided by our metadata catalog. Watch our product video and review our architecture white paper to get a better understanding of how we bring technical and business value to the world of big data management.

Sign Up To Receive Imanis Data Updates