Most of the AWS design principles are about using the unique features and limitations of the AWS platform. With on-premises enterprise infrastructure, applications can assume that the infrastructure is perfect and will handle failures without the application knowing. The result of this enterprise infrastructure is that it is an acceptable solution to have a single server that delivers an application, features such as VMotion and vSphere HA will ensure the application is operational. On AWS, applications must expect the infrastructure to fail and must continue to deliver services when there is a failure. On AWS, there is no equivalent to VMotion or HA; your application architecture must ensure service availability. It is uncommon, but not unknown, for the EC2 service to fail for an entire AZ or to have network or storage issues that affect some or all of an AZ. If you have a single EC2 instance as a server, any of these outages means your application is offline. The best practice is to have your application spread across multiple AZs and abstracted by a multi-AZ (regional) service.
As I’m sure you know, VMware has been making a big move into networking in the last few years. The acquisition of VeloCloud in 2017 added WAN capabilities to the data center networking of NSX, from the Nicira acquisition in 2012. I learned a lot about the newly renamed VMware SD-WAN solution when we did a Build Day TV series last year. I remembered from the original news, that there is custom on-premises hardware (Edge device) and a cloud-based management platform (Orchestrator). The element that I was not aware of is the forwarding plane (Gateway) that can be a shared service cloud platform operated by VMware or enabled on a high spec Edge device and can be augmented with distributed peer-to-peer connections amongst Edge devices. As you probably know, I like policy-based management and the VMware SD-WAN is all about policies that are applied to groups of Edge devices while still allowing overrides and location-specific configuration for each device. There are a few more advanced use-cases covered too; using an AWS EC2 instance as an edge to provide SD-WAN into your VPC and using cloud on on-Edge device network security services.
Here’s the list of Build Day TV videos where Rohan Naggi explains the solution and implementation to Jeffrey and me.
The beginners Guide to VMware SD-WAN
Unbox and Set Up VMware SD-WAN Locations
Cloud VPN and Routing of Your VMware SD-WAN
VMware SD-WAN Application Performance
Intrinsic Security with VMware SD-WAN
The next of our design principles on AWS is loose coupling. A part of this idea is to reduce the blast radius for problems in your application. Another aspect is to define and simplify the connection between parts of your application. An example of loose coupling is using a message queue between a web site where customers place orders and a manufacturing plant that fulfills the orders. Once the web site puts the order details in a message on the queue, the webserver does not need to check that the factory is progressing the order. If the website is down, the factory can still process orders from the queue, and the website can take orders while the factory is closed. You might even use two queues, one for high priority orders and another for lower priority (possibly discounted) orders that are only processed if the high priority order queue is empty. Another example of loose coupling is using a load balancer in front of a farm of web servers. Clients connect to the load balancer, which directs them to a specific web server. The web servers may come and go when demand fluctuates or when updates are required, but the load balancer remains. In this way, we are decoupling access to the web servers from knowledge about individual servers.
Several AWS services specifically designed as loose coupling mechanisms, the Simple Queue Service (SQS) and Elastic Load Balancer (ELB), are the two I have already alluded to in the examples. You can also use services like S3 to loosely couple parts of your application; one part generates an object, and the other responds to the S3 event for the new object. The API Gateway service is another excellent loose coupler and allows a consolidated location to access multiple parts of your application. You might use API Gateway in front of a web server when the web application is replaced or enhanced with Lambda functions. The API gateway path remains the same when you move a part of your application from the webserver to a lambda function. Even Lambda has its own loose coupling; you can use a Lambda alias to launch a specific version of a Lambda function and move the alias to a new version of the function. You can have different aliases for test and production on each Lambda function.
Loose coupling is not just about using these services; it is also about how you handle faults. One component failing should not prevent the rest of your applications from working, although with impaired function. For example, your web page might have a list of products taken from a catalogue and a stock level for each product taken from the inventory system. Suppose the inventory system is offline for some reason, but the catalogue is still available. In that case, your web site should still list the products even though it cannot show stock availability. Once the inventory system returns, so does the availability information.
I wrote before about New Zealand being like a boy in a bubble, we are still in our bubble. The hardest thing is that most of my friends and clients are not in this bubble. Since most of my work is for companies and with people outside New Zealand, I have been doing a lot more remote work and missing my previous life. What I really miss are the week-long projects from the last eight years. Projects I organize like vBrownBag TechTalks and Build Day Live events or ones that I attend like Tech Field Day. These projects, where a small team travels, assemble, and then works hard for a week before dissolving back to real life, have been a part of my world since 2011 and have stopped since travel became restricted. I really miss the excitement of a time-limited shared objective. Being an introvert by nature, I am comfortable being at home with Tracey and our cats. I simply miss the shared objective and short project team. Hopefully this year we will see widespread vaccination and the end of the requirement for our New Zealand bubble. Maybe I will get to share meals with my short term project teams, I really hope so.
Disaster recovery is like insurance; paying for it hurts but not having it when you need it hurts even more. The trick is to have the right insurance for your situation and risks and the right DR too. The reason that the spending hurts is that there is no value at the time you spend. The value comes when you claim on your insurance and have to failover to your DR site. So how might you reduce the cost for your insurance without compromising your ability to claim (failover in a disaster) later? Using a Disaster Recovery as a Service (DRaaS) platform can reduce your cost while still allowing rapid testing and recovery. I was briefed about the technology under the VMware DRaaS product by Datrium before VMware acquired them to get this same technology. More recently, VMware presented DRaaS at Tech Field Day 22, and I was delighted to see presenters who worked with me on the Build Day Live with Datrium way back in 2017. I must be getting old; it doesn’t seem so long ago.
One of the most exciting projects I worked on last year was the CTOAdvisor Datacentre project around using vSphere-on-Cloud services. The project was for the Oracle Cloud VMware Service (OCVS); they paid for the analysis. We started with the premise (and reality) of an overloaded on-premises infrastructure and a need to rapidly expand capacity to enable a large and sudden work-from-home requirement. The project as a whole looked at the Oracle service plus both VMConAWS and the Google vSphere service. We were unable to test the Azure vSphere service as it was transitioning from the platform built by CloudSimple to a Microsoft developed platform. The short answer is that the Oracle solution’s unique part is that it is not a managed service; you get complete control and responsibility. The other notable aspect is that on OCVS, the vSphere network is far better integrated with the cloud-native network than the other cloud vSphere platforms. My role was primarily around the Build Day TV videos we made, with Thom Greene being the hands-on technical expert. The videos show the practical, technical details of extending on-premises vSphere into OCVS.
Recycling is a good strategy for the physical world and also for your AWS resources. When you have an automated process to deploy parts of your application, you can often use that same automation to rebuild broken pieces rather than troubleshooting the failure. Naturally, there are parts of your applications that cannot simply be destroyed and replaced; there is always valuable persistent data somewhere. The design skill separates that data from the rest of your application components and handles the persistent and disposable parts differently. Later I will look at using managed services for that persistent data; now, we will look at the disposable portion.
The company name VirtualMetric might lead you to believe that the product is all about the performance numbers. While the product definitely has plenty of metrics, I don’t think that is the biggest differentiator. What stood out for me in the demo was how much more information was gathered and available in the highly configurable console. There was both slow-changing information, such as installed applications, patches, and operating system configuration, as well as faster-changing details such as running processes, network connections, and performance metrics. The main dashboard views are infinitely customizable; the nicest customization is assembling a dashboard from any of the information collected and sharing that with other teams working on the same issue. There is also a mechanism to cycle through a series of dashboards on a timer, ideal for a large display in an operations center. Data collection is agentless; the VirtualMetric server pulls data across your network periodically. Static data such as installed applications is retrieved daily, log information more frequently, and select performance metrics as often as every second. Agentless data collection means that there is no requirement to deploy anything new onto your servers but limit data collection to what is published by the server’s operating system. The second challenge with agentless monitoring is that it has a higher network load, so different retrieval intervals for different data are essential.
I like that VirtualMetric provides a single consolidated location to find a lot of information about my infrastructure and applications. Because there is so much data, having customizable dashboards is good. As a customer, be sure to have subject matter experts author dashboards that enable less specialist staff to identify and resolve issues. I would like the product to do more with that information, some analytics that proactively identifies possible problems and provides me recommended remediations. Ideally, I would like the product to allow me to simply accept the remediation, probably with integration into the corporate change management system. VirtualMetric has remediation actions on its roadmap, although that is over a year away. Right now, the product is read-only and will not make changes to the monitored network.
In a long distant former life, I looked after a farm of around a hundred Citrix servers. It was so long ago that they were physical servers, and we built or rebuilt servers following dozens of pages of written instructions. You can imagine that there were plenty of helpdesk calls for faults in the build of individual servers. This environment typifies the “handcrafted perfection” that Enterprise IT operations teams had to build. Even back in 2001, I created an automated build process of these servers to avoid manual builds. Manual processes work at the speed of humans, and they are full of human errors. To work faster and smarter, we need methods that are protected from human errors. Operational processes must be executed precisely the same every time. With manual processes, you must get it right every time. With automation, you only need to get it right once. Automating build and deployment is an excellent start as it will deliver consistent and reliable infrastructure on demand.
Ideally, infrastructure automation should be more like software development and use a declarative configuration service to implement the specification in a design file. The design file is the source code for the built environment and is stored in version control, just like the source code for the application. Once you can deploy an environment automatically, the environment is automatically recreatable. You might recreate an environment that broke rather than troubleshoot the problem. You might recreate rather than restore from backups. You should also create a copy for testing, for both changes to the environment and changes to the deployed software.
With deployment automated, it becomes easy to build an environment for testing, then dispose of that environment when the testing is complete. I don’t mean at the end of the project, but at the end of the test. When we want to implement Continuous Integration and Continuous Delivery or Deployment (CI/CD), we need to run tests for each code change made by a developer. Without build automation, these tests simply cannot happen at the pace demanded by CI/CD methodologies. There is a strong argument to be made for keeping the infrastructure design file alongside the application source code, a single source location for both the application and the infrastructure that the application requires.
Infrastructure automation aims to increase the consistency and velocity of operations. Once the infrastructure lifecycle is automated, you unlock the ability to automate application lifecycles. With the ability to innovate in applications safely, business agility is easier and more reliable, providing greater value for overall IT spend.