The next-generation architecture to provide data protection for scale-out, mission-critical databases, such as Apache Cassandra, requires key features like incremental-forever, granular, and fast recovery, Such a solution must also act as an active data platform so that it can manage secondary workloads for test and development clusters in the data center.
As cloud-based deployments become increasingly mainstream, companies must understand that reliance on scripts to drive data backups and restores leads to suboptimal backup strategies. The shortcomings we encountered with script reliance are noted below. We used Cassandra as our database example and Amazon Web Services (AWS) as our cloud environment example, but the problems noted are relevant to any Big Data platform deployed in the cloud.
Backups are not incremental forever.
Companies often use traditional backup models with Amazon Simple Storage Service (S3) as the storage target for the backed-up data. In that model, full backups are done one or more times during the week, with incremental backups in the interim between full backups. Here’s why that model is an exceedingly inefficient way of doing backups:
Recovery is not trivial.
Metadata management for restore points can be complex.
Companies choosing to store data in S3 must separately manage the metadata associated with restore points (both full and incremental) outside that environment. Metadata stored for various recovery points needs to be searchable. In addition, the metadata management layer has to implement complex reference counting of various objects that are stored in S3.
Storage optimization techniques are difficult to implement.
Content-aware deduplication is a technique that reduces the number of replicas stored. Since Cassandra keyspaces and tables are created with n replicas, deduplication techniques improve storage efficiency by removing replicas. With support from a native file system, deduplication algorithms rewrite the data after replicas have been removed.
The reality is that the combination of scripting and an S3 storage target does not support a native file system as needed for deduplication.
Imanis Data provides an ideal solution for companies deploying Big Data platforms in the Cloud because we’ve already thought about all of these issues and understand the optimal architecture for backup, recovery and other data management needs.