Disaster recovery is like insurance; paying for it hurts but not having it when you need it hurts even more. The trick is to have the right insurance for your situation and risks and the right DR too. The reason that the spending hurts is that there is no value at the time you spend. The value comes when you claim on your insurance and have to failover to your DR site. So how might you reduce the cost for your insurance without compromising your ability to claim (failover in a disaster) later? Using a Disaster Recovery as a Service (DRaaS) platform can reduce your cost while still allowing rapid testing and recovery. I was briefed about the technology under the VMware DRaaS product by Datrium before VMware acquired them to get this same technology. More recently, VMware presented DRaaS at Tech Field Day 22, and I was delighted to see presenters who worked with me on the Build Day Live with Datrium way back in 2017. I must be getting old; it doesn’t seem so long ago.
DR Site Compute
With most DR environments, you have a replica of your production compute infrastructure or a subset of the entire infrastructure, often a virtualization platform. You usually buy or lease this infrastructure and have a fixed cost whether you use it or not. If you choose a DR to the cloud solution, then you should be able to spin up and pay for a compute environment only when you need it and not pay for the compute if you are not using it. With on-demand compute, the cost for DR compute resources is more closely coupled to the DR environment’s value.
DR Site Storage
The storage system at your DR site serves two purposes with very different cost and performance requirements. When there is no active DR activity, then the storage receives and stores updates, a sequential write-only storage workload. When DR is happening, either for test or failover, the storage runs the active workload and must deliver good transactional read/write performance. For both purposes, the data is valuable, so it needs to be persisted and protected. The cost of transactional Read/Write storage is always much higher than backup optimized storage that suits sequential write workloads. On AWS, S3 object storage costs around 2.5c per GB per month and is great for backups. While EBS block storage for running VMs costs about 10c per GB per month, which is 4x the cost. To optimize your DR storage value, use backup optimized storage and change the configuration to make the storage transactional when you need run VMs to test or failover. Unfortunately, the usual cloud process to get from backup storage to transactional storage is a full data copy. On AWS, that usually means copying data from S3 object storage to EBS block storage and takes hours depending on the data volume.
VMware DRaaS
The VMware DRaaS is a very different service; essentially, the protected VMs can be running before they are copied from S3. An S3 bucket provides low-cost storage when there is no recovery activity; deduplication and compression help control costs even further. When a DR activity takes place, storage controllers are started in EC2 instances. These storage controllers use the S3 bucket as the source and maintain a cache to provide better transactional performance. The separation of persistence from performance was a central part of the Datrium architecture. The storage controllers present the protected VMs inside an NFS share, which is mounted to ESXi hosts inside the VMware Cloud on AWS (VMC) to recover the VMs. Once the VMs are up, you will probably migrate the VMs onto the VSAN datastore in VMC and have the storage controller VMs shut down.
Use the Cloud Luke
One of the fundamental truths of public cloud services is that they have opinions about how they should be used, and the best value comes from respecting those opinions. The VMware DRaaS architecture uses some of the unique features and capabilities of the AWS platform. You would not build an identical architecture on-premises, but this architecture has enormous benefits for shorter recovery time without higher AWS costs.
As a more general principle, you should not merely take your on-premises architectures into AWS and expect the job to be complete. Expect to re-architect to get the most value out of AWS, and make sure you understand the relevant AWS services.
© 2021, Alastair. All rights reserved.