The rapid adoption of technologies such as social media, mobile, Internet of Things, and the cloud has resulted in the creation of large amounts of data that need to be stored and managed efficiently. Traditional enterprise data warehouse (EDW) technologies are unable to cope with the volume, variety, and velocity of that new data. Thus, organizations utilize modern analytics platforms such as Vertica to analyze, with unprecedented speed and simplicity, those massive volumes of data. Some popular and business-critical use cases for Vertica databases include those for modeling healthcare costs, monitoring network performance, improving patient care, and detecting fraud.
Since the databases store and analyze the most strategic asset—the data—it is critical and mandatory to ensure that such data is adequately backed up, recoverable, and compliant with business recovery point objective (RPO) and recovery time objective (RTO). Vertica software ships with a simple and reliable backup and recovery script, called vbr.py, that addresses the most common backup and recovery scenarios. Beyond those, however, and based on conversations with dozens of Vertica clients using the vbr.py script, we have identified some unique backup and recovery use cases, which we outline below.
1. True incremental-forever backups
As the volume of data in the Vertica databases expands to the high terabyte range or even to petabytes, performing a periodic full backup is neither feasible nor efficient from a network or storage standpoint. An initial full backup followed by incremental-forever backups makes a lot more sense for Vertica databases. The incremental backups should identify and back up only the changes to the database since the last backup and should not be backing up the large amounts of data from a tuple mergeout.
2. Scalable and content-aware backup
Since Vertica databases tend to be in the terabytes or even, in some cases, petabytes, backup processes need to be fast to meet the shrinking backup windows. Backup scripts that are single-threaded will not scale to meet these requirements. What is needed is a truly scalable and multithreaded backup solution that can parallelize backup operations across all available hardware resources. Also, with those data volumes, content awareness is important for optimizing the storage pipeline. Refer to this blog for more details on why a new backup architecture is required for Big Data.
3. Fast search and recovery
The primary purpose of backing up the Vertica database is to ensure that data can quickly and easily be restored should a need arise. Recovery needs arise, for example, when a user accidentally deletes data from a table or an application corrupts tables in the database. Since recoveries involve downtime, taking time to locate where the backups are stored and to identify which backup to restore from can be costly. Recovery should be as simple as a search for the affected table and a click to restore.
4. Granular recovery
Vertica DBAs generally back up the entire database or in some cases do a targeted backup of specific tables that are important and need to be protected. Recoveries, however, are quite the opposite and typically involve restoring one table or just a few affected tables. Rarely is restoration of the entire database needed. So the most common use case from a recovery perspective is to restore an individual table from a full backup.
5. Recovery to a different-sized or alternative cluster
Given the rate at which Big Data is growing, Vertica database clusters are always expanding with addition of more nodes to the cluster. So it is quite likely that the Vertica cluster configuration at the time of data restoration will be different from the configuration at the time of backup. Also, if data is copied from production to a test system, it is highly likely that the test system will be smaller than the production system. Restoring data in those situations is common with Vertica backup and restore.