Cassandra snapshots are often used to help customers go back in time to recover from a badly written command or an issue arising from application corruption. However, there are two significant limitations when it comes to using snapshots for Cassandra backup.
Snapshots Lead to Storage Amplification Due to Compaction
Snapshots in Cassandra use the hard links feature in the underlying file system (Ext4, XFS). Hard links increase the reference count on all the files for which a snapshot is taken. When a snapshot of a table is taken, the underlying file system will increase the reference count on all the files in the storage directory corresponding to the table. This ensures that if the table is dropped and the user tries to clean up the storage directory the actual files will not be deleted since the snapshot holds an additional reference to the storage directory and the files contained within it. However, this process will result in storage amplification due to another Cassandra process that happens in parallel, compaction.
During the process of compaction the files of a particular generation are combined together to create a new SSTable file for which cleanup has been done – tombstones have been removed, deleted columns have been cleaned up, and data has been sorted. After compaction has completed, SSTables corresponding to a previous generation are typically deleted. Once a snapshot is taken, however, SSTables from the previous generation cannot be deleted because the snapshots have an additional reference pointer to these files. There are now two sets of files; one set of files for which the snapshot has a reference, and the other set of files created by the compaction process. This results in storage amplification. For example, if you have a snapshot taken of a table when the storage directory for the table was 1 TB in size, then the snapshot can potentially take up an additional one TB of space.
When we talk to customers, they often tell us that they cannot take more than 2-3 snapshots of their Cassandra environment before they run out of storage space.
Snapshots Need a Scheduler To Work Effectively
The frequency of taking and retaining snapshots will vary based on business requirements. For example, you may need to manage specific keyspaces and tables differently based on their relative value. Let’s say you have ten tables in your Cassandra environment. You may have three tables that require a higher degree of protection and hence need to take snapshots every hour. The other seven tables may need snapshots taken only once a day. You therefore need some form of automated scheduler with a policy engine that sits on top of your snapshot infrastructure. The policy engine will need to create snapshots at suitable user defined intervals as well as delete snapshots at the end of the retention period. It is extremely difficult to script this process and manage this level of complexity on an ad hoc basis.
In a follow on blog post, I’ll discuss why recovery from Cassandra snapshots is also non-trivial and time-consuming. Visit our resources section to learn more about our capabilities for Cassandra and DataStax Enterprise customers.