Most on-premises IT infrastructure designs treat a datacentre as a highly available platform, having an entire datacentre off-line is a disaster. It is a bit of a surprise then that AWS recommends we treat a datacentre as a failure domain and plan to keep our applications operational even if a datacentre fails. AWS doesn’t actually expose individual datacentres in its services; they present Availability Zones. An Availability Zone (AZ) is the smallest area we can usually select for running applications on AWS and is made up of one or more datacentres that are very close together. As far as customers are concerned, we treat an AZ like one datacentre. The EC2 service, and its storage EBS, is scoped at the AZ; an EC2 instance in one AZ cannot be powered on in another AZ. AWS recommends that we have multiple EC2 instances spread across multiple AZs for high availability because an AZ or an AZ scoped service can fail. If you take a look at the AWS Post Event Summaries page you will see events where specific services were unavailable; usually the EC2 or EBS events impacted only a single AZ.
Multi-AZ is a standard design practice for production applications on AWS, DR is usually considered for region to region failure. Failover between AZs is part of the application design, usually with scale-out EC2 for compute and a decoupling service like a load balancer or queue that is regionally scoped. The regionally scoped service continues to operate even when one AZ fails, allowing the surviving EC2 instances to keep delivering application services. The ability to scale-out to provide HA is a part of the application design, rather than a feature of the infrastructure.
The equivalent design practice on-premises is a highly redundant virtualization platform in a single datacentre, DR is used to recover to another datacentre. All of the redundancy and availability of the virtualization layer is invisible to the application, which is often even unaware of a DR failover other than as an outage before regular service is restored. There are on-premises designs that have storage and hypervisor clusters that span multiple datacentres with the equivalent scope of AWS multi-AZ. These Metro-Cluster solutions are usually very expensive and used only for highly critical applications. Metro-Cluster places all of the failover awareness and functionality in the infrastructure; applications are generally still unaware of the failover.
On AWS, a single datacentre is not enough for any production application deployment. Deploying highly available applications on AWS requires that the application be designed with the awareness of the AWS infrastructure. Cloud-native applications are designed with an awareness of the limitations of cloud-native infrastructure. Enterprise applications deployed on enterprise infrastructure expect perfect reliability from the infrastructure. Take a moment to look back at the Post Event Summaries page, think about the number of datacentres AWS operates (currently 76 AZs), and then think about whether your on-premises datacentres experience fewer outages than AWS.
© 2020, Alastair. All rights reserved.