Disasters in a Hadoop environment take various forms. It could be a major natural disaster that takes out an entire data center, an extended power outage that makes the Hadoop platform unavailable, a DBA accidentally dropping an entire database, or an application bug corrupting data stored on HDFS. Proper Hadoop disaster recovery mechanisms are necessary prior to rolling out the application in production to protect data in each of these scenarios. The actual mechanism used to protect this data will depend on a number of factors including the criticality of the application, the maximum amount of data that the organization can afford to lose (Recovery Point Objective (RPO)), the maximum time applications can be down during recovery (Recovery Time Objective (RTO)), and the available budget for the appropriate Hadoop disaster recovery infrastructure.
Multiple replicas in Hadoop are a great way to protect against hardware failures such as disk drive failure or server failure. However, they do not protect against natural disasters, human errors, or application corruptions. One or more of the following options will have to be put in place to protect against these scenarios.
- Back up data on a regular basis to secondary storage located in the same data center, a remote data center, or in the cloud. Regular backups will help protect against human errors and application corruptions. If backups are stored in a remote location, it can also protect against natural disasters. Data recovery may take longer (higher RTO) and recovered data may not be the most current, depending on how frequently backups are done.
- Replicate data asynchronously from the production Hadoop cluster to a standby Hadoop cluster in a different data center. Since replication mirrors data from production to the standby Hadoop cluster, it does not have older copies of the data and hence does not provide a way to recover older data that may be lost due to human error or application corruption. However, replication does protect against a natural disaster or power outage in the data center where the production Hadoop cluster is located. Since the data is always available in the standby Hadoop cluster, recovery times (RTO) are shorter. RPO will depend on the frequency with which data is copied to the remote data center.
- Synchronously replicate data from production Hadoop cluster to another Hadoop cluster in a different data center. Similar to asynchronous replication, synchronous replication will not protect against a human error or application corruption but will safeguard data in case of a data center outage. This solution gives the best RPO (no data loss) and RTO (very quick failover to the active Hadoop cluster).
So why not just use synchronous data replication to protect against a data center failure? There are some serious factors to consider before blindly jumping into deploying an active synchronous data replication solution for Hadoop disaster recovery purposes.
- Does your application need real time data replication? Is your application so critical that you cannot incur any downtime or data loss? In the real world, very few applications (particularly transactional applications) require this type of stringent RPO and RTO. If your application is one of those few critical applications, active replication may make sense but it comes with its own limitations and cost considerations, noted below.
- Are your application users willing to take the performance hit? Synchronously replicating data will negatively impact your application performance. Every change made on the production system will have to be transmitted and acknowledged by the remote Hadoop cluster before allowing the application to proceed with the next change. The performance impact will depend on the network connectivity between the two clusters which will most likely be a slower WAN connection.
- Are you willing to risk potential disruption to the production environment? Synchronous data replication solutions require software to be installed on the production Hadoop cluster. This software will intercept all writes to the file system which can destabilize the production system and requires extensive testing prior to putting it into production. Also, any disruption on the WAN network will bring your application to a halt since data changes cannot be transmitted to the remote cluster or acknowledged. This can result in downtime and disruption to your production applications.
- Can your wide area network (WAN) handle the additional network traffic? In case of active real time replication, all changes (temporary or permanent) are sent over the network to the remote Hadoop cluster. This will cause significantly more load on the WAN compared to an asynchronous replication methodology which will transmit far less data over the network.
- Do you have the budget required for an active disaster recovery solution? Typically these solutions have much higher hardware, software, and networking costs.
- Do you have basic data protection in-place already? Human errors and application corruptions are more likely than a natural disaster that will take out an entire data center. Protecting data against these likely events should be a higher priority for an enterprise. Implementing an active disaster recovery solution will not protect data in these scenarios since all changes (intentional or accidental) will get propagated to the disaster recovery copy instantaneously.
In summary, although real time replication gives the best possible RPO and RTO, it comes with limitations and considerations that need to be carefully thought through. Implementing an active Hadoop disaster recovery solution must be done in context to the criticality of the application to get the best return on investment. If not, it can result in unnecessary expenditures, affect availability of the production Hadoop system, and lead to excessive resources in managing the production Hadoop environment.