Hadoop Disaster recovery and prevent data loss
My client has a Hadoop cluster with Hortonworks HDP 2.1. He is very much concerned about disaster recovery with Hadoop cluster (how can we recover data if I lose the working production Hadoop cluster due to some reason.)
While searching over internet I found several solution like:
Copy data periodically into a different cluster (but this doesn't look like a feasible solution because in this case we have to set up an additional parallel cluster with same storage capacity. And there will be an overhead whilst copying data and creating hive table similar to original table and loading data into this duplicate table)
Putting other replication on different rack (distributed geographical location) As far I know Hadoop is not designed for multiple geographical location data-center (but if it is support multiple geographical location data-center then it might be take time during copying data into remote node because of network latency and also it can take time while processing data.
Apache Falcon: Falcon replicates HDFS files and Hive Tables between different clusters for disaster recovery and multi-cluster data discovery scenarios (but again we need to create parallel cluster) and I am not sure if it support HDP2.1 or not. I don't have much understanding about Falcon.
I am looking here for some feasible solution and your experience to understand and set up disaster recovery on Hadoop.
First, if you want a DR solution you need to store this data somewhere outside of the main production site. This implies that the secondary site should have at least the same storage capacity as the main one. Now remember that the main ideas that lead to HDFS were moving calculations closer to the data and utilizing cheap commodity hardware to store and process the data. In general it means that utilizing Hadoop cluster for storing the data on the secondary site would be cheaper that acquiring any enterprise storage solution. This means that you cannot overcome the need for the secondary Hadoop cluster, but it can be more "fat" than the main one (for instance, by utilizing new HP servers supporting up to 60 HDDs per server) - more storage, lower amount of machines
Second, you cannot overcome the need to copy the data to the remote cluster. You can use Apache Falcon to create incremental backups (here is the guide: http://hortonworks.com/hadoop-tutorial/incremental-backup-data-hdp-azure-disaster-recovery-burst-capacity/, under the hood it would use distcp) or you can use proprietary solution like WANdisco (here is another example http://www.slideshare.net/hortonworks/hortonworks-wa-ndiscowebinarfinal, with some criticism over distcp)
Third, replicating data when it is already in HDFS might be expensive, while duplicating the input data stream might be much simpler. The scenario deeply depends on the way you use your cluster.
Here's also a good presentation on this topic: http://www.slideshare.net/cloudera/hadoop-backup-and-disaster-recovery. In general, the resulting architecture deeply
So in general, this is the decision that should be taken by experienced Hadoop architect, so if you use HDP I'd recommend you to reach Hortonworks with this question directly