In a previous post, I outlined the major requirements for Big Data backup, including incremental-forever backups and fast, granular recovery. This post highlights additional requirements and summarizes how Imanis Data approaches Big Data backup.
Big Data cluster configurations are in a state of constant flux. Since commodity hardware is used to deploy platforms such as Hadoop, Cassandra, and Vertica, those clusters are configured to withstand or quickly recover from failures of various components such as drives, network adapters, and even nodes in the cluster.
A traditional backup solution deploys agents to primary nodes to schedule data transfers. That model will not be operationally feasible in a Big Data environment because new nodes are constantly commissioned and dead nodes are decommissioned. Monitoring the availability of agents that are deployed on the individual nodes is a non-trivial task due to the number of nodes involved in a Big Data cluster. There are also security implications in a datacenter where authorization from the security infrastructure team is typically required before additional daemons are deployed on the production nodes. For these operational reasons, any Big Data backup solution will need to incorporate an agentless model whereby no backup software is installed on the nodes of the primary cluster.
The number of objects that need to be versioned and monitored in the Big Data world is in the millions, and the catalog to support these many objects will have to scale horizontally. For example, an HDFS data store might easily have a million files and directories. Assuming a reasonable change rate (e.g., some files and directories are deleted or appended or new files and directories are created), every incremental backup will add a large number of objects. These objects will have to be mapped to an appropriate recover point. A catalog will need extensive search capabilities and must scale to Big Data levels. Metadata of the objects will need to be stored, and the mutations of the metadata must be searchable across different versions and transitions.
Application-Aware Backups and Restores
The Big Data world involves different applications with different types of data abstractions. For example, data in HDFS is stored in files and directories while the abstraction layer for Hive or Impala focuses on databases, tables, and partitions. These differences impact Big Data backup requirements in a couple of different ways. The user setting up workflows needs to interact with the backup system at the data abstraction layer supported by the application. For example, a Cassandra workflow needs to be set up using keyspaces and tables. The second requirement is that all the metadata and attributes associated with the abstraction layer needs protection along with the actual data. For example, the metadata in a Hive metastore will have to be protected in addition to the actual directories and files representing the database and tables.
The Imanis Data backup implementation is based on a wholly scale-out architecture built with commodity hardware having direct-attached, software-defined storage. We use an incremental-forever model to fetch only modified objects from the primary cluster. We are completely application-aware. For an application like Cassandra, we fetch all the metadata information related to keyspaces and column families. The metadata includes user information such as roles and privileges. Any recovery of the keyspace will ensure that the original set of roles and privileges are applied to the recovered keyspace. Another critical differentiator is that we are agentless. No Imanis Data software need be installed on any of the nodes of the primary cluster. The catalog is architected to host millions of versioned objects along with their attributes and properties. It is searchable with different attributes and regular expressions.
I hope you enjoyed this view into how we envisioned backup occurring in the Big Data world. Please let me know your thoughts on this blog or our solution. Stay tuned for future blogs that describe the file system architecture and other components of the Imanis Data software.