My favourite quote from Werner Vogels is, “Everything fails, all the time.” One of the AWS design principles is to understand where things fail and prevent a failure from causing your application to stop doing its job. The guidance from AWS is to avoid Single Points of Failure (SPOF). I don’t believe you can eliminate every SPOF, so you should understand and accept your remaining SPOFs. This principle is related to the previous principles of designing services, automating, and using disposable resources. It adds awareness of the reality that every AWS service has a scope and may fail at that scope. EC2 is scoped at the Availability Zone (AZ), and a single EC2 instance is susceptible to failure within its AZ. We use autoscaling groups and elastic load balancing to remove the AZ as a SPOF, and now the regional services are our SPOF. While it is unusual for a regionally scoped AWS service to fail, they can and have failed in the past. To eliminate a region as a SPOF, you use a global service like Route53 to distribute application access across multiple regions, with load balancers and autoscaling groups in each region.
The problem is that each time we eliminated a SPOF, we at least doubled our cost and complexity. The additional cost and complexity are precisely why we may choose to leave a SPOF; eliminating the SPOF may be more expensive than an outage cost due to the SPOF. It may also be that the business’s nature may be its own SPOF; a company that operates in one city may not suit failover to another AWS region. For each SPOF, you will need to identify the cost of elimination and the failure’s risk. Everything fails all the time. Ensure you know what single points of failure might cause your application to die and that the business (not IT) accepts the business risk of the possible outage.
© 2021, Alastair. All rights reserved.