Blog

Optimal Data Management Strategies in the Cloud

Uncategorized

The next-generation architecture to provide data protection for scale-out, mission-critical databases, such as Apache Cassandra, requires key features like incremental-forever, granular, and fast recovery, Such a solution must also act as an active data platform so that it can manage secondary workloads for test and development clusters in the data center.

 

The specific backup and recovery requirements for these new data platforms are outlined in our blog here and here.

 

As cloud-based deployments become increasingly mainstream, companies must understand that reliance on scripts to drive data backups and restores leads to suboptimal backup strategies. The shortcomings we encountered with script reliance are noted below. We used Cassandra as our database example and Amazon Web Services (AWS) as our cloud environment example, but the problems noted are relevant to any Big Data platform deployed in the cloud.

 

Backups are not incremental forever.

 

Companies often use traditional backup models with Amazon Simple Storage Service (S3) as the storage target for the backed-up data. In that model, full backups are done one or more times during the week, with incremental backups in the interim between full backups. Here’s why that model is an exceedingly inefficient way of doing backups:

 

  1. For Big Data, terabytes and petabytes are the usual sizes of data for backup, and repetitive, periodic full backups take a long time to complete. Regularly moving such volumes adversely affects the CPU, network, and storage resources on the primary Cassandra cluster and on the backup system.
  2. File and directory manifests and catalogs must be created on every increment to show a list of files present when the increment was taken. These manifests have to be searchable.
  3. The deletion of restore points is not trivial since each backed-up file could have multiple references across different incremental backups. For example, while file1 could be created as a part of incremental #1, file1 could also be referenced by incremental #2 since it has not been modified. So all the files associated with incremental #1 cannot be deleted without a reference check of the manifests of other incremental backups.
  4. To prevent incremental backups from accumulating, full backups must be done with reasonable frequency. For full backups, the data in the entire Cassandra database is read, resulting in performance issues for production workloads. Also, the resultant storage overhead significantly limits the number of full backups that can be preserved.
  5. Only a limited number of restore points can be effectively managed because of the aggregate size of metadata across all the incremental and full backups.

 

Recovery is not trivial.

 

  1. Companies face lost time and extra work just to get back to a known restore point. Here are three examples:
  2. Restoring an incremental backup is time consuming: The last full backup must be identified and all subsequent incremental backups must then be sequentially applied to reach the desired restore point. Essentially a “synthetic” full backup has to be reconstructed from the latest full backup and various incremental restore points.
  3. Restoring a single table requires more work to identify only the files associated with the table to be restored across the last full backup and several incremental backups.
  4. Identifying the relevant restore points requires parsing through significant amounts of metadata.

 

Metadata management for restore points can be complex.

 

Companies choosing to store data in S3 must separately manage the metadata associated with restore points (both full and incremental) outside that environment. Metadata stored for various recovery points needs to be searchable. In addition, the metadata management layer has to implement complex reference counting of various objects that are stored in S3.

 

Storage optimization techniques are difficult to implement.

 

Content-aware deduplication is a technique that reduces the number of replicas stored. Since Cassandra keyspaces and tables are created with n replicas, deduplication techniques improve storage efficiency by removing replicas. With support from a native file system, deduplication algorithms rewrite the data after replicas have been removed.

 

The reality is that the combination of scripting and an S3 storage target does not support a native file system as needed for deduplication.

 

Imanis Data provides an ideal solution for companies deploying Big Data platforms in the Cloud because we’ve already thought about all of these issues and understand the optimal architecture for backup, recovery and other data management needs.

Sign Up To Receive Imanis Data Updates