The design principle to use caching is not simply an AWS principle, it is a common application design principle. A cache is a temporary storage location for a small amount of data that improves application performance. Sometimes the cache is distributed around the world to be close to users, then it might be called a content delivery network. Other times the cache is simply extra memory in your web or application servers that holds some status information about currently active users. The idea that a cache is temporary is important, it is not the persistent storage location for the data. If the cache gets lost, the data in the cache can be re-created from a persistent location. The idea that the cache is not the definitive source is also important, data in the cache represents a copy of the persistent data at some point in the past. Some data can stay in the cache for a long time, the temperature recorded at 10 am yesterday will never change. The current temperature will change, so the current temperature shouldn’t be kept in cache for long, it should have a short Time To Live (TTL).
There are a few trigger points for considering adding a cache, usually centered around needing more application performance for transactional (rather than analytics) workloads. If increasing the database or other persistence tier performance tier seems inefficient, you might feel that you are not getting consistent value for money, then caching might be a good option. This can also be a trigger for considering a different database for a subset of the data, I mentioned this in a previous principle. As an infrastructure person, I am used to providing transparent caches, where the application code is unaware of the cache. But software developers often use explicit caches, where the application code makes choices about what data to place in the cache and when to update or remove cached data. On AWS, the ElastiCache service provides RAM-based caching which developers can choose to use within their application. Because it is an explicit cache, the application developer chooses what data to cache, whether to write to the cache on database updates or only on reads. There is a lot of developer effort to get the most out of ElastiCache, but the potential performance improvement is huge.
Caching is an important tool for improving application performance, everywhere from the end access device (user’s laptop or phone), through the application servers and to the persistent storage at the back. Efficient use of caching does require good design and the more application awareness you bring to that design the more efficiently you can use the expensive cache. Allocating excess RAM to application servers is a simple but inefficient way to provide caching, particularly for applications that you cannot get rewritten.
Are databases too slow for your application? I don’t mean, is your database slow, and do queries take minutes to complete. I mean, is an optimized database still too slow for the rate at which things happen? That brings you into the stream processing world, where data arrives very fast. You need to make decisions quickly and act on those decisions immediately. One example is credit card processing, where instant fraud identification can prevent transactions from being approved. Another is real-time cyber-threat analytics, where every request to a website or application is validated before acceptance. In both cases, there are a massive number of transactions to monitor and complex scoring that is required within the allowed latency for the transaction. This is the space where Hazelcast plays, unifying large amounts of slower changing data with fast arriving streamed data. The slower changing data might be machine learning models and reference data, which are then used to evaluate the faster-arriving data stream. This is not an infrastructure feature; it is an application platform service. To use Hazelcast, your application will be developed using the Hazelcast SDK. There will be fast infrastructure to support your Hazelcast application: fast networks and powerful servers. The architecture is a grid or cluster, so several servers working together in a distributed architecture to provide a memory-first database and stream engine.
I heard from Hazelcast at Cloud Field Day 11; they have presented at several Tech Field Day events. My usual Tech Field Day disclaimer applies. If you have a big problem, and if a standard database just isn’t fast enough, maybe Hazelcast can handle your data rate.
This AWS design principle is based on the financial reality of using cloud services. The magic of AWS is that you can use as much or as little resource as you want and only pay for what you use. The tragedy of AWS is that every month you get charged for (more or less) every piece of resource that you use. Optimizing for cost is not about minimizing the amount you spend. Closing your AWS account will reduce the bill, but at what impact on the business? The objective is to get as much business value as possible and only pay for things that deliver business value.
My favourite quote from Werner Vogels is, “Everything fails, all the time.” One of the AWS design principles is to understand where things fail and prevent a failure from causing your application to stop doing its job. The guidance from AWS is to avoid Single Points of Failure (SPOF). I don’t believe you can eliminate every SPOF, so you should understand and accept your remaining SPOFs. This principle is related to the previous principles of designing services, automating, and using disposable resources. It adds awareness of the reality that every AWS service has a scope and may fail at that scope. EC2 is scoped at the Availability Zone (AZ), and a single EC2 instance is susceptible to failure within its AZ. We use autoscaling groups and elastic load balancing to remove the AZ as a SPOF, and now the regional services are our SPOF. While it is unusual for a regionally scoped AWS service to fail, they can and have failed in the past. To eliminate a region as a SPOF, you use a global service like Route53 to distribute application access across multiple regions, with load balancers and autoscaling groups in each region.
The problem is that each time we eliminated a SPOF, we at least doubled our cost and complexity. The additional cost and complexity are precisely why we may choose to leave a SPOF; eliminating the SPOF may be more expensive than an outage cost due to the SPOF. It may also be that the business’s nature may be its own SPOF; a company that operates in one city may not suit failover to another AWS region. For each SPOF, you will need to identify the cost of elimination and the failure’s risk. Everything fails all the time. Ensure you know what single points of failure might cause your application to die and that the business (not IT) accepts the business risk of the possible outage.
Some of the AWS design principles pinpoint that AWS has many services to fulfill many different needs. The guidance for choosing the correct database solutions is not to say that you must standardize with one database for your application, quite the opposite. In a previous life, with on-premises enterprise IT, I was told that the database platform for critical production is Oracle. For non-critical, you could choose to use Microsoft SQL Server. There were only two database platforms (both relational) no matter what technical requirements come from your application. It is easy to choose the suitable database for each section of data that your application requires on AWS. There are at least seven different database services on AWS, relational or not, transactional or analytical. There are plenty of options. There are even options that are specialized for recommendations or transaction immutability. Many of these databases are serverless, so you only pay for what you use rather than hourly charges for performance capacity that you may not be using. When the database is delivered as a service, there is a far lower cost to add a different database type to your application. On-premises you would need a team to support the new platform, which might take months and cost thousands. Database as a service allows application teams to choose the right database platform for their requirements and to have multiple different database platforms within one application.
Before choosing a database solution, you need to understand your data structure and quantity and what you will do with that data. A few dozen gigabytes of data that you will use for ad-hoc monthly reporting (SQL, probably RDS) is a very different proposition to storing user profiles (Dynamo) and high scores (Elasticache with Redis) for millions of online gamers. The online game needs both a scale-out SSD-based JSON database for profiles and a RAM-based database for high scores. The application stores different information about the same people in different databases. Without the database choice, it is common to bend one database to multiple separate uses and find that it does a poor job. AWS makes it simpler to use the correct database type for the different data that your application requires. Choose the right database solutions.
Most of the AWS design principles are about using the unique features and limitations of the AWS platform. With on-premises enterprise infrastructure, applications can assume that the infrastructure is perfect and will handle failures without the application knowing. The result of this enterprise infrastructure is that it is an acceptable solution to have a single server that delivers an application, features such as VMotion and vSphere HA will ensure the application is operational. On AWS, applications must expect the infrastructure to fail and must continue to deliver services when there is a failure. On AWS, there is no equivalent to VMotion or HA; your application architecture must ensure service availability. It is uncommon, but not unknown, for the EC2 service to fail for an entire AZ or to have network or storage issues that affect some or all of an AZ. If you have a single EC2 instance as a server, any of these outages means your application is offline. The best practice is to have your application spread across multiple AZs and abstracted by a multi-AZ (regional) service.
As I’m sure you know, VMware has been making a big move into networking in the last few years. The acquisition of VeloCloud in 2017 added WAN capabilities to the data center networking of NSX, from the Nicira acquisition in 2012. I learned a lot about the newly renamed VMware SD-WAN solution when we did a Build Day TV series last year. I remembered from the original news, that there is custom on-premises hardware (Edge device) and a cloud-based management platform (Orchestrator). The element that I was not aware of is the forwarding plane (Gateway) that can be a shared service cloud platform operated by VMware or enabled on a high spec Edge device and can be augmented with distributed peer-to-peer connections amongst Edge devices. As you probably know, I like policy-based management and the VMware SD-WAN is all about policies that are applied to groups of Edge devices while still allowing overrides and location-specific configuration for each device. There are a few more advanced use-cases covered too; using an AWS EC2 instance as an edge to provide SD-WAN into your VPC and using cloud on on-Edge device network security services.
Here’s the list of Build Day TV videos where Rohan Naggi explains the solution and implementation to Jeffrey and me.
The beginners Guide to VMware SD-WAN
Unbox and Set Up VMware SD-WAN Locations
Cloud VPN and Routing of Your VMware SD-WAN
VMware SD-WAN Application Performance
Intrinsic Security with VMware SD-WAN
The next of our design principles on AWS is loose coupling. A part of this idea is to reduce the blast radius for problems in your application. Another aspect is to define and simplify the connection between parts of your application. An example of loose coupling is using a message queue between a web site where customers place orders and a manufacturing plant that fulfills the orders. Once the web site puts the order details in a message on the queue, the webserver does not need to check that the factory is progressing the order. If the website is down, the factory can still process orders from the queue, and the website can take orders while the factory is closed. You might even use two queues, one for high priority orders and another for lower priority (possibly discounted) orders that are only processed if the high priority order queue is empty. Another example of loose coupling is using a load balancer in front of a farm of web servers. Clients connect to the load balancer, which directs them to a specific web server. The web servers may come and go when demand fluctuates or when updates are required, but the load balancer remains. In this way, we are decoupling access to the web servers from knowledge about individual servers.
Several AWS services specifically designed as loose coupling mechanisms, the Simple Queue Service (SQS) and Elastic Load Balancer (ELB), are the two I have already alluded to in the examples. You can also use services like S3 to loosely couple parts of your application; one part generates an object, and the other responds to the S3 event for the new object. The API Gateway service is another excellent loose coupler and allows a consolidated location to access multiple parts of your application. You might use API Gateway in front of a web server when the web application is replaced or enhanced with Lambda functions. The API gateway path remains the same when you move a part of your application from the webserver to a lambda function. Even Lambda has its own loose coupling; you can use a Lambda alias to launch a specific version of a Lambda function and move the alias to a new version of the function. You can have different aliases for test and production on each Lambda function.
Loose coupling is not just about using these services; it is also about how you handle faults. One component failing should not prevent the rest of your applications from working, although with impaired function. For example, your web page might have a list of products taken from a catalogue and a stock level for each product taken from the inventory system. Suppose the inventory system is offline for some reason, but the catalogue is still available. In that case, your web site should still list the products even though it cannot show stock availability. Once the inventory system returns, so does the availability information.
I wrote before about New Zealand being like a boy in a bubble, we are still in our bubble. The hardest thing is that most of my friends and clients are not in this bubble. Since most of my work is for companies and with people outside New Zealand, I have been doing a lot more remote work and missing my previous life. What I really miss are the week-long projects from the last eight years. Projects I organize like vBrownBag TechTalks and Build Day Live events or ones that I attend like Tech Field Day. These projects, where a small team travels, assemble, and then works hard for a week before dissolving back to real life, have been a part of my world since 2011 and have stopped since travel became restricted. I really miss the excitement of a time-limited shared objective. Being an introvert by nature, I am comfortable being at home with Tracey and our cats. I simply miss the shared objective and short project team. Hopefully this year we will see widespread vaccination and the end of the requirement for our New Zealand bubble. Maybe I will get to share meals with my short term project teams, I really hope so.
Disaster recovery is like insurance; paying for it hurts but not having it when you need it hurts even more. The trick is to have the right insurance for your situation and risks and the right DR too. The reason that the spending hurts is that there is no value at the time you spend. The value comes when you claim on your insurance and have to failover to your DR site. So how might you reduce the cost for your insurance without compromising your ability to claim (failover in a disaster) later? Using a Disaster Recovery as a Service (DRaaS) platform can reduce your cost while still allowing rapid testing and recovery. I was briefed about the technology under the VMware DRaaS product by Datrium before VMware acquired them to get this same technology. More recently, VMware presented DRaaS at Tech Field Day 22, and I was delighted to see presenters who worked with me on the Build Day Live with Datrium way back in 2017. I must be getting old; it doesn’t seem so long ago.