In a prior blog post I described some of the storage and management overhead challenges of using Cassandra snapshots as part of your strategic data protection policy. In this post, I’ll describe why restoring from Cassandra snapshots is also non-trivial.
Let’s take the example of a snapshot taken a week ago. That snapshot is the basis of the restore that needs to happen today as a result of a human error. However, in that span of the elapsed week the topology of your production Cassandra cluster has changed because two new nodes have been added. When these nodes get added the token distribution changes. You can no longer just copy the snapshot directory to the original storage directory and conduct a nodetool refresh to restart the table. You will need to reshard the data to account for the two new nodes and the change in the token map distribution. The only way to do this is the SSTable Loader, but this is difficult to use and is error-prone.
The other issue that occurs during Cassandra data recovery is if the user has changed the properties of the table, for example, changing the compaction policy or the replication factor. This makes a restore from a snapshot impossible without resharding the data according to the changed property.
Finally, if you have a 30 node Cassandra cluster the snapshot restore will have to be done on every node of the cluster, an operational nightmare. Compounding matters, if you have multiple tables that have been corrupted your operational overhead will exponentially increase. In addition, different tables could have been snapshotted at different intervals depending on the requirements for recovery point or recovery time objectives. This makes finding the suitable snapshot for restore a very manual task. Is Cassandra snapshot restore possible? Yes. Is it operationally feasible? No.
We’ve spent considerable time at Imanis Data designing what we believe is the right architectural approach to data protection for Cassandra and other modern data sources. You can read one of my earliest blog posts on this topic and review our white paper for an even more technical discussion.