Understanding Replication Versus Backup in a Big Data Environment
When we talk with customers who are implementing Big Data technologies, we are often presented with two scenarios leading to two questions:
We have multiple replicas of data on the production cluster. Do we really need a backup solution?
We use the replication feature provided by our Big Data vendor to replicate data on a remote cluster. Should we still care about backup?
The answer to both questions is a resounding “yes.” Let me explain why with a real-life example:
We recently worked with one of our customers on a proof of concept (POC) for Hadoop. While the executive team understood the importance of a backup solution, one of the engineers working with us was not convinced of the requirement. She was using three replicas for the data on the production cluster. The customer uses external tables in Hive and creates partitions for each week. The engineer had created temporary tables during the POC. Once the POC was completed, she used a regular expression (regex) in command-line form to delete the HDFS locations of the temporary tables she had created. Soon after, she realized that she had accidentally deleted from the production data for a couple of tables that matched the regex.
Fortunately, she was able to recover the data—she had wisely backed it up during the Imanis Data POC—but the near-fiasco instantly converted her to our backup solution.
Both backup and replication have been crucial elements for managing data availability in traditional enterprise environments. Similarly, for emerging Big Data platforms, data backup and replication are becoming mandatory as those platforms grow more business-critical. So let us step back for a moment to understand the need for backup and replication. IT administrators typically deploy backup or replication to protect data against some (or all) of the following situations:
Disaster at the data center. The entire data center can become unavailable, possibly resulting in prolonged downtime. Administrators should not have to wait for the data center to recover, but should be able to quickly make the data available from an alternative location.
Infrastructure failures. All hard drives eventually fail, and more often than not, without warning. The same occurs for other hardware components like RAM, power supplies, and the like. Administrators must assure availability of data in the event of such hardware outages.
Data corruption. Data corruption can happen because a faulty application can erroneously delete or edit incorrect files. In rare cases, data corruption can be caused by infrastructure issues like a faulty RAM or faulty hard disks. If faced with such corruptions, administrators must be able to recover the older, non-corrupted copy of the data.
User errors. Data can get accidentally deleted due to user errors. Administrators need the capability to recover data lost due to such incidents.
So here’s how IT administrators can use backup or replication to recover from the above situations.
Disaster at the data center. To protect against disaster, administrators can replicate the data to a remote data center, with data being copied in real time over a wide area network to the replicated data center. Organizations must set up the appropriate infrastructure; especially important is a dedicated network link between the two data centers with bandwidth sufficient that data can be copied at nearly the same speed at which it is added to the production cluster. That way, data loss is minimal should a disaster occur at the production data center, and applications can continue running in the remote data center.
Infrastructure failures. The probability of infrastructure failures is much higher in Big Data environments because most Big Data platforms are deployed on scale-out commodity hardware. To alleviate this problem, Big Data platforms maintain multiple copies (a.k.a. replicas) of the data. Nevertheless, in the rare case of all the replicas having been lost or corrupted, administrators will need to have an independent backup system to rely on. Typically, such a backup system periodically copies the data from production clusters to other dedicated hardware, affording greater resilience from hardware failures. And it does so at a fraction of infrastructure cost because it stores the backed-up data in an optimized format with deduplication and compression.
Data corruption and user errors. To protect against data loss or corruption, administrators must rely only on backups. Apart from the advantages mentioned above, enterprise-grade backup systems allow users to browse through “multiple versions” of the data objects (files, databases, tables) and select only the objects they wish to recover.
Some administrators opt to create periodic snapshots on the production cluster so that users can easily go back to older versions of the data should the current data become corrupted. This practice is highly discouraged for three reasons:
If the data set undergoes various mutations like updates to or deletions of rows in a database, snapshots will hold the older data and users can recover lost data. In such cases, snapshots consume a lot of additional storage space. Depending on the data pattern, snapshots can even grow to be larger than the actual user data; that’s because snapshots, unlike enterprise backup software, do not support storage reduction techniques.
Depending on how the snapshot feature is implemented, a large number of snapshots can degrade the performance of applications that use the production data.
Snapshots usually share the same storage hardware as the original user data. Hence, data lost or corrupted because of underlying hardware failure is unrecoverable from the snapshots as well—which defeats the very purpose of the snapshots.
Imanis Data is the only enterprise-grade product that unifies backup and replication for multiple modern data platforms, namely, Hadoop, NoSQL stores like Cassandra, HBase, MongoDB and Couchbase, and even data warehouses like Vertica. Imanis Data’s flexible policies allow for periodic or ad hoc backups. In the case of periodic backup, each time a data object, say a database, is modified, that version is created on Imanis Data. After that, a user can quickly search for the object, thanks to Imanis Data FastFind™, and inspect the timeline for different versions of the object. Once the right version is chosen, it can be recovered to the production cluster or even to a different cluster, as the user prefers.
Once data is backed up to Imanis Data, it is stored in a form that is highly optimized by deduplication and deep compression techniques. Further, depending on its configuration, the data is protected against multiple hardware failures by means of erasure coding. Should the production cluster fail catastrophically, the user can, with a single click, recover the last “good” version of the entire data set backed up by Imanis Data. We also let users create flexible workflows; for example, data can periodically be replicated to a remote data center, thus completing the spectrum for data management for Big Data platforms.