With the growing popularity of NoSQL databases (Cassandra, Couchbase, MongoDB, etc.), customers are generally comfortable running large-scale, mission-critical applications in production. Most of these applications are critical to the success of their business, and any data availability issues related to these applications can have severe negative consequences to the businesses these applications support. These impacts can manifest themselves in many ways, including the millions of dollars in lost revenue, permanent data loss, customer attrition, negative brand perception, and higher IT costs. Data availability management that entails database backup/restore, test/development copied from production data, and disaster recovery is a critical infrastructure element that needs to be carefully thought out.
Unlike the SQL database world in which robust solutions for data availability management are available and implemented at the time of production, in the NoSQL world, such is not the case. Data availability management tools lag in functionality and do not meet the requirements of mission-critical applications.
In this blog, I share several common data management challenges voiced by Cassandra and DataStax Enterprise customers; these situations are equally applicable to other NoSQL environments.
1. “Someone accidentally deleted data from a production database by using the Truncate command. Can I restore my data?”
Fortunately, the customer has enabled auto_snapshot; otherwise, the data would have been lost. However, the snapshot folder holds a number of snapshot files, and the customer does not know which ones to use to recover. It will take hours to locate the right files and manually recover the data, and application downtime will result. Data recovery need not be cumbersome.
2. “We have a lot of product quality issues because our QA team is not able to test with production data sets.”
The Cassandra database contains confidential and sensitive data that cannot be moved out of the production environment because of compliance risks. As a result, the QA team tests with static and fabricated data sets. With this fake data set, the QA team does not perform real-world tests, and once the software is deployed in production environments, issues can and do arise.
3. “We constantly run out of space on our Cassandra nodes and every time this happens, it is a fire drill to add more storage.”
This customer daily backs up the Cassandra database with snapshots. Snapshots do not take up a lot of space, but every time compaction starts reorganizing the SSTables and creating new files, snapshot storage utilization shoots up significantly and storage consumption on the Cassandra cluster reaches 100 percent. Fire drill!
4. “We used to keep 2 weeks’ worth of Cassandra backups in snapshots. But since we added new brands to our application, we can keep only 2 days’ worth of backups, and that does not meet our service-level agreements.”
The issue is the same as that of the third challenge. With more data being loaded into Cassandra, less space is available for snapshots; hence the reduction in the number of backups and the customer’s inability to meet its SLAs.
5. “My Cassandra production database is on-site and I am using Amazon S3 for storing Cassandra backups. Backups take a long time and my monthly Amazon bills are going up.”
Two challenges arise with this deployment. First, since the backups go over a wide area network, every large backup (full or otherwise) takes a long time to complete. Also note that for every compaction, backup data volume can be sizable. This situation leads to the second challenge: Amazon S3 bills. Because of the large volume of data generated by periodic full backups and compaction, storage requirements keep growing and so does the Amazon S3 bill.
6. “For each of the Big Data stores (Cassandra, Couchbase, Vertica, Hadoop) in our environment, the backup/recovery tools and procedures are completely different. This makes our environment operationally difficult to manage.”
For that team, Big Data backup and recovery become more complex and, furthermore, must be done manually. Although each Big Data technology includes its own command-line interfaces for backup and recovery, the CLIs alone are insufficient for ensuring automated and error-free backups. For each data store, wrapper scripts have to be written (and maintained) to automate the backup process on each node, to manage space on each node, and to clean up older backups that are no longer required. Then, for consistent backups and reliable recoveries, Operations teams must master each script.
Having heard these stories multiple times, we at Imanis Data embarked on building an enterprise-grade backup, restore, and test/dev management solution for Cassandra. In my next blog, I will discuss how we address the challenges described above.