Replication with Rubrik Edge

I am continuing my look at the Rubrik platform. In my previous blog post, I looked at the deployment process for the Rubrik Edge virtual appliance, as well as backups and restores from that Edge appliance. Today I want to dig a little deeper into the backup policies (SLA Domains in the Rubrik terms) as well as look at using replication to protect against losing the Edge appliance itself. I will start with replication and then loop round to policies since replication is driven by these policies.

Replication

Consider why we make backups, there are two fundamental reasons:

  1. Errors and minor failures: accidental deletion of files, toxic updates, malware infection, accidental VM deletion, and disk failures. All are data loss events that leave the ESXi host or cluster operational. Having a backup on that same infrastructure will allow rapid recovery after the event. This is the sort of data loss where the backup stored on the Rubrik Edge appliance will immediately address.
  2. Complete site loss. This is the disaster event; fire, flood, lightning strike, hardware theft or whatever else can go wrong. Having a backup stored on the same equipment that we just lost will not help us to recover. Backups need to be available somewhere outside the disaster area. A disaster event is where we will need the replication capabilities of the Rubrik solution.

Replication is a core feature for Rubrik, as it must be for any product that stores backups on non-removable media. In the past days, when tape was king, you would simply ship tapes off-site to protect against disasters. The shift to disk-based backup means that replication to another location provides the same protection, usually with the benefit of faster recovery if a disaster should occur. To replicate you need another Rubrik cluster: an on-premises cluster in another data centre or a cluster in a public cloud. For my testing, I simply deployed another Rubrik Edge appliance on another ESXi server. Since the Edge runs the full Rubrik software stack, you can replicate between Edge appliances just like full Briks.

I did find a way to get past my previous issue with the 1TB data disk, since my second ESXi host has smaller drives. First I deployed the appliance thin-provisioned, so that the default 1TB disk is created. Before powering the appliance on I deleted the 1TB disk and expanded the thin-provisioned system disk, then I added a thick provisioned data disk that used almost all of the remaining capacity of the datastore. Once the disks are setup the Edge appliance can be powered on and bootstrap configured just like the first one.

../../../Desktop/Rubrik/11-Setup%20Replication.png

Once the second cluster is available, the setup of the replication relationship is simple. From the gear icon in the top right of the Rubrik console, select Manage Replication. Then click the + in the top right corner and enter the details of the remote cluster. I setup replication in both directions, so I could replicate VMs that are protected by either Edge appliance.

../../../Desktop/Rubrik/12%20Replication%20Status.p

Keep in mind that the backups can be replicated to Briks in remote data centres, or to Rubrik clusters on public cloud platforms. Once the second cluster is connected, you can see replication status in the admin portal. At this point, the two Edges were just exchanging policy information. To replicate VM snapshots (backups), we will need a policy that demands replication.

Policies

I see policy based management as the only way to scale operations to manage large numbers of VMs without large teams of operators. Rubrik backup policies are called SLA Domains and drive the protection of VMs. The default policies are the usual: Gold, Silver, and Bronze.

../../../Desktop/Rubrik/13-Default%20Gold%20SLA.p

You can modify the existing policies, or you can create your own policies. I started by reducing the frequency of the snapshots during the day, from every 4 hours to every 6 hours. My service level to the business (myself) doesn’t require a 4 hour RPO.

../../../Desktop/Rubrik/14-%20Modify%20Gold-Local.p

I have a small environment, so the three built-in policies are more than sufficient. At enterprise scale, you will have hundreds of applications and probably thousands of VMs. You can create additional SLA Domains to provide different levels of protection for different business requirements.

Local Backup is excellent, but I need the backups replicated, so I clicked the Configure Remote Settings link at the bottom and turned on replication to my second Edge. As this is only for DR, I just need a few days of snapshots retained at the second site. Notice that the snapshot schedule is shared by the local and replicated storage, we simply adjust the retention at each location.

For a real Edge deployment, I might choose to keep only a month or so of snapshots on the relatively small Edge appliance to save space. The remaining snapshots would be replicated to a (larger) cluster in another location for longer-term retention.

In my lab, the two Edge appliances reside on internal SSDs and replicate to protect against either host or the SSD holding the workload VMs failing. In a production environment, the ROBO location might be a stand-alone ESXi server or a small cluster with a small shared storage array. The Edge appliance would provide local backup and fast recovery of deleted objects. At a central office, or in a public cloud, a larger Rubrik cluster would provide a replication target for a group of ROBO locations. The software-only option allows a modern data protection product to be deployed to ROBO locations, protecting data where it is generated. Replicating deduplicated data is very WAN-efficient (far more efficient than replicating changed blocks) which can reduce the monthly spend on WAN circuits.

I would like to see a more unified management option for the Edge appliances. Keep in mind that the aim is to have dozens of Edge appliances, without needing actions to be repeated at dozens of sites. I would like to see central configuration for all the Edge appliances, covering the configuration of standard SLA Domains as well as allowing a default SLA Domain to be selected for all VMs at all sites. I would like all of this inside the Rubrik interface on the primary data centre’s cluster of Briks. I have to imagine that Rubrik is already working on these interface changes. Right now you could use the Rubrik API to integrate deploying and configuring Edge appliances into your existing ROBO office deployment automation. You do have automated ROBO deployments, don’t you? If you have dozens or hundreds of ROBOs you need that automation.

I have looked at the fundamentals of deploying a Rubrik Edge appliance and using it to protect a bunch of VMs against data loss and disasters that effect the whole site. There is far more capability to Rurik; I have not touched on backups of physical servers & applications or using Rubrik as part of a test/dev automation (DevOps) workflow. I haven’t even looked at using an external storage device to expand the effective capacity of a Rubrik cluster. Maybe if I do some more work with Rubrik, I’ll get a chance to play with more of the product.

© 2017, Alastair. All rights reserved.

About Alastair

I am a professional geek, working in IT Infrastructure. Mostly I help to communicate and educate around the use of current technology and the direction of future technologies.
This entry was posted in General. Bookmark the permalink.