Big Data refers to immense amounts of structured and unstructured data that cannot be processed by traditional databases and software techniques. Examples of Big Data platforms include NoSQL databases like Cassandra and MongoDB, Hadoop components like HDFS, Hive and Impala, and modern data warehouses like Vertica. These new data platforms are built on a scale-out architecture supporting hundreds of commodity nodes. Big Data backup needs a new architecture.
Here are some key requirements for big data backup solution:
Each of these requirements, described in more detail below, is important when data sizes are orders of magnitude larger than traditional data sources.
Traditionally, organizations backed up data by creating a complete backup each week, followed by daily incremental backups. For recovery, the last full backup became the starting point to which the subsequent incremental backups were added, thus generating the image that needed to be recovered. Consider a Hive application with several databases with a total data size of 1 petabyte. Implementing a full weekly backup of a 1-petabyte data set is not feasible and can never meet any reasonable service-level agreement.
The only way to back up Big Data is with incremental-forever techniques. That statement implies that the full backup will be done only once: when the backup workflow is initially set up. After that, all changes will have to be incrementally copied to the destination cluster. For example, when a new Hive partition is added, only the directories and files corresponding to the new Hive partition need be copied to the backup cluster.
Fast and Granular Recovery
Another requirement in the Big Data world is that incremental changes must be immediately added to the full backup to create a complete image of the primary data at a particular time. This requirement ensures that recovery can be done immediately without the lag time associated with creation of the image for recovery. Data recovery must be application-aware and granular. For the example of a Hive database, the backup cluster must be able to recover a single partition of a table. Additionally, an entire database or schema comprising of hundreds of tables might be backed up in a single workflow. The recovery workflow needs to be flexible enough to restore a single table from this backup workflow.
Parallel Data Transfers
The architectures of all Big Data platforms, from Cassandra to Impala, specify a loosely coupled, shared-nothing architecture built on commodity hardware with direct-attached cheap storage. Implicit in this design is that data is actually distributed for storage across several nodes on the primary cluster for all these Big Data applications. Good performance, therefore, will require a backup workflow to be parallelizable. That is, each node containing data will be contacted independently and its data copied directly from the individual container nodes on the primary cluster hosting the data.
That requirement suggests that a monolithic backup solution will not scale to Big Data levels because of several chokepoints in the design. Hence, any workable backup implementation will have to run on a scale-out platform built on commodity hardware with direct-attached drives. All the nodes in the backup cluster will set up connections to one or more nodes in the primary cluster so that data can be transferred in parallel.
In part 2, I’ll continue the discussion around backup requirements for Big Data specifically related to agents, catalogs, and the concept of application-aware backups. If you have questions about this post or any other aspect of the Imanis Data architecture, don’t hesitate to contact us.