My favourite quote from Werner Vogels is, “Everything fails, all the time.” One of the AWS design principles is to understand where things fail and prevent a failure from causing your application to stop doing its job. The guidance from AWS is to avoid Single Points of Failure (SPOF). I don’t believe you can eliminate every SPOF, so you should understand and accept your remaining SPOFs. This principle is related to the previous principles of designing services, automating, and using disposable resources. It adds awareness of the reality that every AWS service has a scope and may fail at that scope. EC2 is scoped at the Availability Zone (AZ), and a single EC2 instance is susceptible to failure within its AZ. We use autoscaling groups and elastic load balancing to remove the AZ as a SPOF, and now the regional services are our SPOF. While it is unusual for a regionally scoped AWS service to fail, they can and have failed in the past. To eliminate a region as a SPOF, you use a global service like Route53 to distribute application access across multiple regions, with load balancers and autoscaling groups in each region.
The problem is that each time we eliminated a SPOF, we at least doubled our cost and complexity. The additional cost and complexity are precisely why we may choose to leave a SPOF; eliminating the SPOF may be more expensive than an outage cost due to the SPOF. It may also be that the business’s nature may be its own SPOF; a company that operates in one city may not suit failover to another AWS region. For each SPOF, you will need to identify the cost of elimination and the failure’s risk. Everything fails all the time. Ensure you know what single points of failure might cause your application to die and that the business (not IT) accepts the business risk of the possible outage.
Some of the AWS design principles pinpoint that AWS has many services to fulfill many different needs. The guidance for choosing the correct database solutions is not to say that you must standardize with one database for your application, quite the opposite. In a previous life, with on-premises enterprise IT, I was told that the database platform for critical production is Oracle. For non-critical, you could choose to use Microsoft SQL Server. There were only two database platforms (both relational) no matter what technical requirements come from your application. It is easy to choose the suitable database for each section of data that your application requires on AWS. There are at least seven different database services on AWS, relational or not, transactional or analytical. There are plenty of options. There are even options that are specialized for recommendations or transaction immutability. Many of these databases are serverless, so you only pay for what you use rather than hourly charges for performance capacity that you may not be using. When the database is delivered as a service, there is a far lower cost to add a different database type to your application. On-premises you would need a team to support the new platform, which might take months and cost thousands. Database as a service allows application teams to choose the right database platform for their requirements and to have multiple different database platforms within one application.
Before choosing a database solution, you need to understand your data structure and quantity and what you will do with that data. A few dozen gigabytes of data that you will use for ad-hoc monthly reporting (SQL, probably RDS) is a very different proposition to storing user profiles (Dynamo) and high scores (Elasticache with Redis) for millions of online gamers. The online game needs both a scale-out SSD-based JSON database for profiles and a RAM-based database for high scores. The application stores different information about the same people in different databases. Without the database choice, it is common to bend one database to multiple separate uses and find that it does a poor job. AWS makes it simpler to use the correct database type for the different data that your application requires. Choose the right database solutions.
Most of the AWS design principles are about using the unique features and limitations of the AWS platform. With on-premises enterprise infrastructure, applications can assume that the infrastructure is perfect and will handle failures without the application knowing. The result of this enterprise infrastructure is that it is an acceptable solution to have a single server that delivers an application, features such as VMotion and vSphere HA will ensure the application is operational. On AWS, applications must expect the infrastructure to fail and must continue to deliver services when there is a failure. On AWS, there is no equivalent to VMotion or HA; your application architecture must ensure service availability. It is uncommon, but not unknown, for the EC2 service to fail for an entire AZ or to have network or storage issues that affect some or all of an AZ. If you have a single EC2 instance as a server, any of these outages means your application is offline. The best practice is to have your application spread across multiple AZs and abstracted by a multi-AZ (regional) service.
As I’m sure you know, VMware has been making a big move into networking in the last few years. The acquisition of VeloCloud in 2017 added WAN capabilities to the data center networking of NSX, from the Nicira acquisition in 2012. I learned a lot about the newly renamed VMware SD-WAN solution when we did a Build Day TV series last year. I remembered from the original news, that there is custom on-premises hardware (Edge device) and a cloud-based management platform (Orchestrator). The element that I was not aware of is the forwarding plane (Gateway) that can be a shared service cloud platform operated by VMware or enabled on a high spec Edge device and can be augmented with distributed peer-to-peer connections amongst Edge devices. As you probably know, I like policy-based management and the VMware SD-WAN is all about policies that are applied to groups of Edge devices while still allowing overrides and location-specific configuration for each device. There are a few more advanced use-cases covered too; using an AWS EC2 instance as an edge to provide SD-WAN into your VPC and using cloud on on-Edge device network security services.
Here’s the list of Build Day TV videos where Rohan Naggi explains the solution and implementation to Jeffrey and me.
The beginners Guide to VMware SD-WAN
Unbox and Set Up VMware SD-WAN Locations
Cloud VPN and Routing of Your VMware SD-WAN
VMware SD-WAN Application Performance
Intrinsic Security with VMware SD-WAN
The next of our design principles on AWS is loose coupling. A part of this idea is to reduce the blast radius for problems in your application. Another aspect is to define and simplify the connection between parts of your application. An example of loose coupling is using a message queue between a web site where customers place orders and a manufacturing plant that fulfills the orders. Once the web site puts the order details in a message on the queue, the webserver does not need to check that the factory is progressing the order. If the website is down, the factory can still process orders from the queue, and the website can take orders while the factory is closed. You might even use two queues, one for high priority orders and another for lower priority (possibly discounted) orders that are only processed if the high priority order queue is empty. Another example of loose coupling is using a load balancer in front of a farm of web servers. Clients connect to the load balancer, which directs them to a specific web server. The web servers may come and go when demand fluctuates or when updates are required, but the load balancer remains. In this way, we are decoupling access to the web servers from knowledge about individual servers.
Several AWS services specifically designed as loose coupling mechanisms, the Simple Queue Service (SQS) and Elastic Load Balancer (ELB), are the two I have already alluded to in the examples. You can also use services like S3 to loosely couple parts of your application; one part generates an object, and the other responds to the S3 event for the new object. The API Gateway service is another excellent loose coupler and allows a consolidated location to access multiple parts of your application. You might use API Gateway in front of a web server when the web application is replaced or enhanced with Lambda functions. The API gateway path remains the same when you move a part of your application from the webserver to a lambda function. Even Lambda has its own loose coupling; you can use a Lambda alias to launch a specific version of a Lambda function and move the alias to a new version of the function. You can have different aliases for test and production on each Lambda function.
Loose coupling is not just about using these services; it is also about how you handle faults. One component failing should not prevent the rest of your applications from working, although with impaired function. For example, your web page might have a list of products taken from a catalogue and a stock level for each product taken from the inventory system. Suppose the inventory system is offline for some reason, but the catalogue is still available. In that case, your web site should still list the products even though it cannot show stock availability. Once the inventory system returns, so does the availability information.
I wrote before about New Zealand being like a boy in a bubble, we are still in our bubble. The hardest thing is that most of my friends and clients are not in this bubble. Since most of my work is for companies and with people outside New Zealand, I have been doing a lot more remote work and missing my previous life. What I really miss are the week-long projects from the last eight years. Projects I organize like vBrownBag TechTalks and Build Day Live events or ones that I attend like Tech Field Day. These projects, where a small team travels, assemble, and then works hard for a week before dissolving back to real life, have been a part of my world since 2011 and have stopped since travel became restricted. I really miss the excitement of a time-limited shared objective. Being an introvert by nature, I am comfortable being at home with Tracey and our cats. I simply miss the shared objective and short project team. Hopefully this year we will see widespread vaccination and the end of the requirement for our New Zealand bubble. Maybe I will get to share meals with my short term project teams, I really hope so.
Disaster recovery is like insurance; paying for it hurts but not having it when you need it hurts even more. The trick is to have the right insurance for your situation and risks and the right DR too. The reason that the spending hurts is that there is no value at the time you spend. The value comes when you claim on your insurance and have to failover to your DR site. So how might you reduce the cost for your insurance without compromising your ability to claim (failover in a disaster) later? Using a Disaster Recovery as a Service (DRaaS) platform can reduce your cost while still allowing rapid testing and recovery. I was briefed about the technology under the VMware DRaaS product by Datrium before VMware acquired them to get this same technology. More recently, VMware presented DRaaS at Tech Field Day 22, and I was delighted to see presenters who worked with me on the Build Day Live with Datrium way back in 2017. I must be getting old; it doesn’t seem so long ago.
One of the most exciting projects I worked on last year was the CTOAdvisor Datacentre project around using vSphere-on-Cloud services. The project was for the Oracle Cloud VMware Service (OCVS); they paid for the analysis. We started with the premise (and reality) of an overloaded on-premises infrastructure and a need to rapidly expand capacity to enable a large and sudden work-from-home requirement. The project as a whole looked at the Oracle service plus both VMConAWS and the Google vSphere service. We were unable to test the Azure vSphere service as it was transitioning from the platform built by CloudSimple to a Microsoft developed platform. The short answer is that the Oracle solution’s unique part is that it is not a managed service; you get complete control and responsibility. The other notable aspect is that on OCVS, the vSphere network is far better integrated with the cloud-native network than the other cloud vSphere platforms. My role was primarily around the Build Day TV videos we made, with Thom Greene being the hands-on technical expert. The videos show the practical, technical details of extending on-premises vSphere into OCVS.
Recycling is a good strategy for the physical world and also for your AWS resources. When you have an automated process to deploy parts of your application, you can often use that same automation to rebuild broken pieces rather than troubleshooting the failure. Naturally, there are parts of your applications that cannot simply be destroyed and replaced; there is always valuable persistent data somewhere. The design skill separates that data from the rest of your application components and handles the persistent and disposable parts differently. Later I will look at using managed services for that persistent data; now, we will look at the disposable portion.
The company name VirtualMetric might lead you to believe that the product is all about the performance numbers. While the product definitely has plenty of metrics, I don’t think that is the biggest differentiator. What stood out for me in the demo was how much more information was gathered and available in the highly configurable console. There was both slow-changing information, such as installed applications, patches, and operating system configuration, as well as faster-changing details such as running processes, network connections, and performance metrics. The main dashboard views are infinitely customizable; the nicest customization is assembling a dashboard from any of the information collected and sharing that with other teams working on the same issue. There is also a mechanism to cycle through a series of dashboards on a timer, ideal for a large display in an operations center. Data collection is agentless; the VirtualMetric server pulls data across your network periodically. Static data such as installed applications is retrieved daily, log information more frequently, and select performance metrics as often as every second. Agentless data collection means that there is no requirement to deploy anything new onto your servers but limit data collection to what is published by the server’s operating system. The second challenge with agentless monitoring is that it has a higher network load, so different retrieval intervals for different data are essential.
I like that VirtualMetric provides a single consolidated location to find a lot of information about my infrastructure and applications. Because there is so much data, having customizable dashboards is good. As a customer, be sure to have subject matter experts author dashboards that enable less specialist staff to identify and resolve issues. I would like the product to do more with that information, some analytics that proactively identifies possible problems and provides me recommended remediations. Ideally, I would like the product to allow me to simply accept the remediation, probably with integration into the corporate change management system. VirtualMetric has remediation actions on its roadmap, although that is over a year away. Right now, the product is read-only and will not make changes to the monitored network.