The 4 Levels of Replication for Disaster Recovery in the Cloud

May 11, 2012 by Jake Robinson

disaster recovery in the cloudDisaster Recovery as a Service (DRaaS or RaaS) is big on the hype meter right now, and cloud computing providers are clamoring to find solutions to fill the gap.

The biggest driver in DR in the Cloud is the utility-like aspect of purchasing infrastructure, because you only pay for what you use. Disaster Recovery lends itself well to the concept of cloud computing, because you want to keep that insurance premium as low as possible. Virtualization has enabled some pretty amazing things in regards to Disaster Recovery as well. Recovery time objectives (RTO) and recovery point objectives (RPO) have dropped significantly since virtualization became mainstream, without adding cost.

On to the practical application: How on earth do I replicate my data to a public cloud provider?

Replication is essential to a low RPO. Doing full backups across a WAN link is going to result in a lot of data loss. There are methods to only replicated changed blocks, but even then you could have some pretty hefty loss.

Here are the 4 levels of replication currently out on the market:

Application

Replication at the Application level has been around for some time. Relational database servers such as Microsoft SQL are a good example. You can replicate database transactions to a remote server, and restore the database when disaster strikes.

PROS: Public Cloud-friendly, very low RTO and RPO, can be physical to virtual, or any combination of the two.

CONS: Target server must be running in the cloud. OS and application must be set up properly and maintained (patches, OS, etc).

Guest OS

Vision Solution's Double-Take product is a good example of this. The OS, application, and data can all be replicated on a block-level basis to a target machine. The OS files are "staged" on the target machine, and upon clicking the failover button, the target machine "becomes" the source machine. Pretty cool technology indeed.

PROS: Public cloud-friendly, Low RTO and RPO, physical to virtual, or any combination of the two. One click failover.

CONS: Agent overhead on source machine (CPU, Disk and Memory), license cost, target server must be running in the cloud.

VM/Hypervisor

There are two strategies at this level. The first is snapshot based replication. Veeam is a good example of snapshot-based replication. A virtual machine snapshot is taken, much like a normal VM backup is taken, and the changed blocks are replicated to the target VM.

Hypervisor-based replication is new to VMware vSphere 5. ESXi 5 now has a low level driver to capture VM disk data writes at the SCSI level. This is incredibly cool because it means I don't need to rely on snapshots. VMware SRM VR takes advantage of this, as does the ultra-cool Zerto.

As we move down the stack, this is the first replication level we have to think about VMware vCloud Director and it's meta-data. At the time of writing this blog post neither VMware's VR or Veeam replication have solutions for vCloud Director. Zerto supports vCloud Director in both private and public cloud scenarios.

The challenge for DR to public vCloud providers is direct access to the SAN. Most of the hypervisor level replication solutions have been built around the enterprise, not around a multi-tenant cloud. The public cloud is certainly NOT going to give you self-service, open access to the SAN.

PROS: Public vCloud friendly (Zerto), very low RTO and low RPO replicates entire servers, storage agnostic, target VMs powered off and not consuming resources.

CONS: NOT public vCloud friendly (VMware SRM, Veeam), Snapshot rollups can hurt performance (Veeam), virtual only, not hypervisor agnostic.

SAN/LUN

SAN-based replication is a great solution for Enterprises looking to back up all or part of their infrastructure. The source and target in this case is not a VM though. It's an entire LUN. Typically you have multiple VMs running on a single LUN, which is nice if the group of VMs require data/time consistency. VMware SRM uses SAN replication to achieve a fully orchestrated DR solution, and even physical servers with disks on the SAN can be replicated.

As stated previously, no public cloud provider is going to give you access to the SAN, which leaves you needing to buy another SAN of the same make and model, and slap it in a colocation rack.

But wait… SAN vendors also have virtual storage appliances (VSA), which do run in the cloud. So if you run SANs such as EMC, Nexenta, or HP, there is a possibility you could at least get your data off-site. The only challenge there is you can't run the VMs in the cloud from the VSA. There would need to be a seperate script to migrate VMs from the VSA to the public vCloud provider through the vCloud API. It's not impossible, but it would certainly add some time onto the RTO.

PROS: Public cloud friendly (maybe with a VSA), Low RTO and low RPO, replicates virtual and physical machines

CONS: NOT public cloud friendly (hardware SAN-to-SAN), NOT hardware, SAN, or hypervisor agnostic.

Don't Forget Failback!

Just an additional note that you should always ask about how to failback once your primary site is healthy. You don't want to be stuck in DR mode forever, and most of the above mentioned solutions require full replication of the systems back to your datacenter. Make that a top ten question when looking for a DR solution. The Disaster Recovery is not complete until you are back in your production datacenter!

So, what are your biggest disaster recovery challenges?