AWS Design – Enable Scalability

One of the defining capabilities of public cloud is elasticity, the ability to use more or less resource over time to meet the load requirements of your application. When your application is quiet, you should consume and pay for fewer resources than when your application is busy. Not all AWS services have scalability built-in, many services require that you manage your own scalability. Managed services like Lambda and Fargate, mange capacity for you, delivering the resources that are required for your workload. More lightly managed services, such as EC2 and RDS, leave scalability up to you, although they may provide tooling like autoscaling you can use.

Continue reading

Posted in General | Comments Off on AWS Design – Enable Scalability

Ten Design Principles on AWS

Having previously looked at some surprises I discovered as I learned about AWS, I’m going to take a look at some of the basic architectural design principles on AWS. As in the last series, there will be blog posts for each principle that go into some basic details. Here are the ten principles:

  1. Enable scalability: What happens if demand increases? Or doesn’t increase? What if demand goes up and down over time?
  2. Automate your environment: Computers are good at doing things the same every time, humans are not
  3. Use disposable resources: “Everything fails, all the time” Werner Vogels. Replace broken things with brand new things, rather than spend a lot of time fixing.
  4. Loosely couple your components: When one element of your application changes or has an issue, the rest of the application should still work.
  5. Design services, not servers: An EC2 instance should not be a single point of failure. Use several instances and a load balancer or a queue.
  6. Choose the right database solutions: I don’t mean Microsoft SQL Server vs Oracle. I mean use the right database for the data you need to store, some will be better in non-relational databases.
  7. Understand your single points of failure: There are always SPOFs, make sure you know where they are and try to eliminate as many as possible.
  8. Optimize for cost: Your AWS bill will arrive every month & you will pay for what you use. Make sure you are getting value for every dollar spent on that AWS bill.
  9. Use caching: Your data is not all of the same value or location, nor are resources of the same cost. Caching uses small amounts of resource that is fast or close.
  10. Secure your infrastructure at every layer: “Dance like nobody’s watching, encrypt like everyone is” Werner Vogels. By now, we should all understand that defense in depth is the only viable strategy.
Posted in General | Comments Off on Ten Design Principles on AWS

AWS Surprises – AWS Has Virtually Infinite Resources

Sometimes the AWS surprises are not so much about how AWS is different, but how you design solutions differently on AWS than on-premises. One of the significant differences is that you have a near-infinite amount of resources available on AWS, while on-premises, you are always aware of a finite resource limit. On-premises your workload must fit inside those limited resources; on AWS, you can rent as much resource as your workload requires. One typical pattern on-premises is to defer reporting or bulk processing until off-peak hours, overnight when the office is empty. The office is never empty at AWS, so you might as well do that reporting or processing right away. The only time you might defer is if the spot price for the EC2 instance you want is too high.

As an example, there are plenty of problems that we solve by using a lot of compute resources to get a timely answer. On-premises we will have a limited quantity of CPU time and RAM, and these resources (servers) have a lifespan of 3-5 years, so more resources that will only be used for part of their life are expensive. On-premises it is common to consume all these limited resources for a long time to complete some complex tasks; we may have to wait hours or days for an answer. On AWS, we rent CPU time and RAM as EC2 instances and pay by the hour for what we use. On AWS, we can scale out just for the duration of the job and use maybe 50x as much resource to get an answer faster. There is no cost difference between using 5 EC2 instances for 100 hours and 250 EC2 instances for two hours, so scaling out massively is an option.

Other near-infinite resources include storage, networking, and even application services. The Simple Storage Service (S3) allows you unlimited storage capacity and only charges you for what you actually store. The VPC network and it’s supporting features such as ELB provide colossal capacity that is available on-demand, and you are billed for consumption, not capacity. Even application services such as the Simple Queue Service (SQS) offer near-unlimited messages per second in a queue and only charge you for the transactions on that queue. There are a lot of AWS services that allow you to draw from a nearly limitless pool of resources and only pay for the resources that you use.

Capacity Is Never Infinite

One caveat is that while AWS has near infinite capacity, there is always a finite amount, and, in some situations, that limited amount may not be as large as you might hope. When you start deploying unusual and new EC2 instance types, and particularly when you use them in their largest configurations, you may get Insufficient Compute Capacity Errors (ICCE, pronounced ice). Remember that each EC2 family and generation runs on its own dedicated physical servers, M5 instances only run on M5 servers, which in turn only run M5 instances. The larger the size within the family and the more instances you request, the more previously unused capacity is required. So, if you decide to deploy a cluster of six X1e.32XLarge across three availability zones, you may find that one of those AZs does not have two whole X1e hosts to dedicate to your cluster immediately. Hopefully, you have a good relationship with your local AWS team and can get this information before it causes you a problem. They may suggest that you use smaller instances and more of them, or that you will have a better result with a different region or a different EC2 family.

If you had on-demand access to a virtually infinite amount of computing resources, how would your IT and business operate differently? On AWS, resources are available, and you pay for what you use each month. To get the best out of AWS, you should deploy the resources you need as you need them, and cast off the implicit implication of purchased on-premises IT.

Posted in General | 2 Comments

AWS Surprises – You still need infrastructure architecture on AWS

It is a popular idea that “the cloud means I don’t have to care” however, nothing could be further from the truth. It isn’t really an AWS Surprise to me that infrastructure architecture is still essential for many customers on AWS. Naturally, there are many infrastructure elements that AWS manages; You don’t need to worry about racking and cabling servers or power and cooling. You do still need to choose VM resources (EC2 instance families and sizes) for each application component. You do need to design the network connectivity and isolation when you put together a VPC. Applications that ran on-premises, which you migrate to AWS, will require cloud infrastructure that replicates the on-premises infrastructure.

Similarly, applications built to on-premises architectures will require similar infrastructure on AWS. On-premises infrastructure architects can augment their skills to design infrastructure on AWS. Like any new platform, you will need to learn the capabilities and limitations of the AWS platform. You can find a few of the things I learned on my AWS Surprises page. One thing to prepare for: moving up the stack. Expect to learn more about application and integration architecture as the infrastructure becomes more of a commodity.

No Infrastructure

Not everything on AWS requires conventional infrastructure; more serverless application components mean less infrastructure. It is entirely possible to build large and complex applications on AWS without requiring a single EC instance or subnet. Services like Lambda, DynamoDB, API Gateway, and you can even assemble older services like S3, SQS, and SNS into a microservices-based application without a single VM. These services do not exist in on-premises enterprise datacentres. Only applications developed specifically on AWS will use these services. With a fully serverless application, there is a large amount of application architecture to design rather than infrastructure architecture.

Assumed Infrastructure

One thing to watch for is elements that are provided by on-premises infrastructure that are not automatically delivered by AWS. One example is data protection for backup/recovery, compliance, and disaster recovery. On AWS, these capabilities must be added to or configured for the services, where on-premises, they are often just a fundamental part of the infrastructure. Even if there is no infrastructure to design to support functional requirements, often there are non-functional requirements that the infrastructure team would usually handle.

Posted in General | Comments Off on AWS Surprises – You still need infrastructure architecture on AWS

New Zealand Is like the Boy in a Bubble

You may have seen the new, New Zealand has no active COVID-19 cases, the coronavirus has been eliminated from New Zealand. As of Monday, 8 June, the last infected person had recovered, and it has been over three weeks since the last new case was diagnosed. We have moved from having some of the strictest lockdown rules to totally relaxed, at least within the country. There is almost no risk of COVID-19 transmission inside New Zealand, so we are now protecting ourselves at the border. Anybody arriving in New Zealand is subject to a two-week, government-controlled, quarantine and a COVID test. We have very little immunity to COVID in New Zealand, only 1,100 or so confirmed cases out of five million people. We now live in a bubble, surrounded by countries that still have active transmission, and any breach of our bubble will cause us to go back to lockdown. We will not be safe to leave the bubble until other counties eliminate COVID or a vaccine is widespread.

Continue reading

Posted in General | 1 Comment

I Want Network Integration, I’m Not Getting It

I like having consistent management interfaces and having a single operational model across as much of my IT estate as possible. I don’t like point solutions that function or are managed differently; they add up to more problems. With this in mind, I would like to see far deeper network integration between AWS and VMware Cloud on AWS (VMC) even though I know why I won’t get this integration for a while. At Cloud Field Day 7, we had two sessions that focussed on network connectivity between AWS (AWS presentation) and VMC (VMware presentation); neither said it works the same as everything else they offer.

Continue reading

Posted in General | Comments Off on I Want Network Integration, I’m Not Getting It

AWS Surprises – One Datacentre Is Not Enough

Most on-premises IT infrastructure designs treat a datacentre as a highly available platform, having an entire datacentre off-line is a disaster. It is a bit of a surprise then that AWS recommends we treat a datacentre as a failure domain and plan to keep our applications operational even if a datacentre fails. AWS doesn’t actually expose individual datacentres in its services; they present Availability Zones. An Availability Zone (AZ) is the smallest area we can usually select for running applications on AWS and is made up of one or more datacentres that are very close together. As far as customers are concerned, we treat an AZ like one datacentre. The EC2 service, and its storage EBS, is scoped at the AZ; an EC2 instance in one AZ cannot be powered on in another AZ. AWS recommends that we have multiple EC2 instances spread across multiple AZs for high availability because an AZ or an AZ scoped service can fail.  If you take a look at the AWS Post Event Summaries page you will see events where specific services were unavailable; usually the EC2 or EBS events impacted only a single AZ.

Multi-AZ is a standard design practice for production applications on AWS, DR is usually considered for region to region failure. Failover between AZs is part of the application design, usually with scale-out EC2 for compute and a decoupling service like a load balancer or queue that is regionally scoped. The regionally scoped service continues to operate even when one AZ fails, allowing the surviving EC2 instances to keep delivering application services. The ability to scale-out to provide HA is a part of the application design, rather than a feature of the infrastructure.

The equivalent design practice on-premises is a highly redundant virtualization platform in a single datacentre, DR is used to recover to another datacentre. All of the redundancy and availability of the virtualization layer is invisible to the application, which is often even unaware of a DR failover other than as an outage before regular service is restored. There are on-premises designs that have storage and hypervisor clusters that span multiple datacentres with the equivalent scope of AWS multi-AZ. These Metro-Cluster solutions are usually very expensive and used only for highly critical applications. Metro-Cluster places all of the failover awareness and functionality in the infrastructure; applications are generally still unaware of the failover.

On AWS, a single datacentre is not enough for any production application deployment. Deploying highly available applications on AWS requires that the application be designed with the awareness of the AWS infrastructure. Cloud-native applications are designed with an awareness of the limitations of cloud-native infrastructure. Enterprise applications deployed on enterprise infrastructure expect perfect reliability from the infrastructure. Take a moment to look back at the Post Event Summaries page, think about the number of datacentres AWS operates (currently 76 AZs), and then think about whether your on-premises datacentres experience fewer outages than AWS.

Posted in General | Comments Off on AWS Surprises – One Datacentre Is Not Enough

AWS Surprises – Choose a configuration from a menu

On the surface, there is no surprise here, AWS offers a list of services, and you order what you want from the list. But the devil is always in the detail, or the operational consequence. This actual AWS surprise came when I first played with EC2 instances and looked at changing the configuration of an existing EC2 instance. One does not simply add 4GB of RAM to an instance. The sizes of EC2 instances are fixed by AWS; you choose a size option from the list. For each EC2 instance family, there is a fixed relationship between the number of CPU cores and the amount of RAM. Within the family, there are fixed sizes; most often, the next size up is exactly twice as much resource in each dimension. To get more RAM in an existing EC2 instance, you either double the size of the instance or choose a size from a whole new instance family. The M5 family has 4GB per core, so an M5.Large has two cores and 8GB, while an M5.24XLarge has 96 cores and 384GB of RAM. From the M5.Large ($0.12 per hour in Sydney), the next size up is M5.XLarge, with four cores and 16GB of RAM it is exactly twice the size of an M5.Large and twice the cost per hour at $0.24 per hour. That is a large increase in price if my application only wants 4GB more RAM. I am probably better off changing to an R5.Large, which has two cores and 16GB of RAM and will only cost me $0.15 per hour in Sydney. The R5 series is more RAM heavy; the R5.24XLarge has 96 cores and 768GB of RAM. It is not just the CPU and RAM that are fixed per instance; the available network bandwidth is related to the size and family of instance. Ephemeral local storage called Instance Store is also fixed per instance size, and most instance families don’t even have Instance Store.

While there are a few dozen instance families and a few hundred possible combinations of family and size, for any given application, there will only be a small selection that are suitable. Choosing the wrong compromise of instance resources and cost will seriously affect the viability of your application on AWS. Make sure you don’t simply consider doubling the size of an EC2 instance, choosing another instance family might be a better option. Just remember that you cannot change the resources separately, you can only select an EC2 configuration from the menu.

Posted in General | Comments Off on AWS Surprises – Choose a configuration from a menu

Zoom Mute/Unmute Using Stream Deck

Zoom is the “new” way we are all doing meetings. Whether it is last week attending Cloud Field Day presentations or this week teaching AWS training, Zoom is the constant for meetings. With Zoom, microphone control is essential. You want to be able to interact immediately but don’t want to interrupt when you need to cough or swear because you spilled your drink on the desk. The result is that we enter Zoom meetings muted and only unmute when we have something to say. The simplest way is to hold down the space bar; if Zoom has focus, then your mic is unmuted while you hold down the spacebar. The obvious problem is where you need to use another application while you are in Zoom, then you hold down the space bar and stay muted as happened to me at least once last week.

Continue reading

Posted in General | Comments Off on Zoom Mute/Unmute Using Stream Deck

Hardware Offloads, Not Everything Is x86

While software is busy eating the world, we do still need hardware to run that software. One of the things that we are learning is that an x86 processor is not always the best way to solve every computing problem. The most obvious demonstration is the absence of x86 based smartphones; there have been a couple of attempts but nothing successful. Of course, mobile is a very different use case to the data center, and most data centers are full of x86 based servers. What we are seeing is that the x86 CPU in these servers is being supplemented by increasing numbers of specialized processors that handle functions that are better suited to different processor architectures. The first was network cards, NICs that could offload a lot of the computing functions for handling ethernet and TCP packets. Rather than tying up an x86 CPU core for every 1-2GBps of network throughput, a powerful NIC managing the network allows 10GB and even 100GB ethernet to be utilized without saturating the main CPU. We have also seen GPUs being added to servers for some workloads, in particular workloads that suit parallel compute with moderate amounts of data. Another type of offload is computational storage from NGD systems, which uses additional ARM core inside SSDs to process data inside the SSD. Computational storage offload seems to be the reverse of GPU offload, huge data but not as much compute demand, although, with a lot of NGD SSDs, those ARM cores do add up. We have also seen more consolidated offload for virtualization with Amazon’s Nitro architecture offloading network, storage, and server management into a custom add-in card. What is clear is that general-purpose CPUs are not the right solution for every computational task.

The AWS Nitro card appears to have a cousin in the Pensando Distributed Services Card, which seems to be the hardware magic that delivers Pensando software-defined services. The Pensando web site talks a lot about software-defined edge services. I believe the edge that they mean is a telco point of presence, what used to be a telephone exchange, but is now really a datacenter close to the telco’s subscribers. It does appear that the target customer is a cloud provider or telco that delivers cloud-like services, lots of networking and security. The front page of Pensando’s web site suggested to me that this might be a platform for building business applications, it appears to be more for building network applications. Next week I will hear more detail from Pensando at Cloud Field Day 7, join me for the live stream, or catch up with the videos afterward.

Posted in General | Comments Off on Hardware Offloads, Not Everything Is x86