While I’m teaching the course “Architecting on AWS,” one of the central themes is that the highest value comes from using the specific capabilities of AWS services. Directly uploading your software into EC2 instances is unlikely to give you a great result. Consequently, I am very interested in stories of how on-premises products have been re-platformed to be cloud-native. A little while ago, I had the opportunity to sit down at the Pure Storage office in Mountain View and hear about how their hardware arrays become a cloud platform. We looked at the dual controller Flash Array product in our Build Day Live event with Pure; you can watch those videos here. Soon you will get all of the goodness of the on-premises Flash Array in a cloud-deployed form. I was very impressed that the Pure Storage team chose to use the native features of the AWS platform to deliver the same features as their on-premises hardware.
Disclaimer: Pure Stoarge briefed me about this product, but did not comission or pay for thsi article. This article is my thoughts only.
How to do it wrong
One of the ways that on-premises storage products get public cloud counterpart is that the software that runs on-premises is deployed into EC2 instances with a lot of EBS volumes for persistent storage. You need two of these EC2 instances to provide a dual-controller configuration for higher availability. The EBS volumes need to be SSD backed and provisioned IOPs (AWS IO1) to deliver great performance. Since the EBS volumes cannot be shared, they must be mirrored across the network by the software inside the EC2 instances. Now we have a storage array built with two copies of data one of AWS’s most expensive storage, and the overall solution is rather expensive. I did some back of the napkin design and worked out the basic AWS cost for a possible configuration to deliver 60TB of capacity, excluding any cost for the Pure software:
R5.16XLarge has 64 cores and 512GB of RAM for $4.032 per hour in Ohio, that comes out at $2,900 per month. Adding 60TB of EBS with provisions IOPS costs around $7,500 per month for capacity and $12,400 for 192,000 IOPS (the maximum for three EBS volumes on an AWS Nitro platform). Each server will cost $2,900 + $7,500 + $ 12,400 = $22,880 per month, the pair together $45,760 per month. You can and should reduce the cost by committing to pay AWS for three years of Reserved Instance (RI) pricing and get huge discounts off this on-demand pricing. A three year, all up-front R5.16XLarge RI will set you back $41,833, which is $1,162 per month. You pay $1,162 + $7,500 + $12,400 = $21,062 per server, total $42,124 per month. The IOPS are over half of this total cost; you could reduce the provisioned IOPS count and deliver a lower performance array for a lower cost.
If you’re gonna do it, do it right, right.
If you need a high-performance block storage solution on AWS, the highest performing option is Instance Store with NVME SSD. These SSDs can deliver over 300,000 IOPs, far more than any reasonable amount of EBS. The problem is that Instance Store is volatile if you power cycle the instance then the data is lost. You need to have another store for persistence and use the Instance Store as a performance tier. You might also want some fast EBS on each controller for a persistent write buffer, to mitigate slow writes to S3 and use S3 as the final persistence tier. I did some napkin math again, using a configuration that I came up with that may not match what Pure Cloud Block Store uses. Here are the numbers:
I3en.24xlarge has 96 cores and 768GB of RAM, along with 8 NVMe SSDs of Instance Store at 7.5TB each (60 TB total) for $10.848 per hour, $7,800 per month. A 2TB EBS volume should allow ample write buffer $250 per month, provision 64K IOPS for $4,160. Each server is $12,210; two servers costs $24,420. Add 60 TB of S3 for around $1,500, and your total monthly cost is $25,920. Using the same 3 years, all upfront for an RI of $ 107,967 yields just under $3,000 per month, so $3,000 + $250 + $4,160 = $ 7,410 per server, $14,820 for two servers and S3. Because most of the cost here is from EC2, the Reserved Instance discount is beneficial.
By using Instance Store for performance and S3 for persistence, you get a higher-performing solution for a lower cost. The biggest operational challenge is that after a controller node EC2 instance is power cycled it will take quite a while to re-fill the 60TB of Instance Store from S3 or the other EC2 instance. During this re-fill, you will probably continue to serve the IO from the other controller.
Another interesting aspect is that you could choose not to keep all of the snapshots on the Instance Store. Older snapshot data might only be retained in S3, freeing space in the Instance Store to hold new data and resulting in a larger effective capacity. S3 capacity is exceptionally cost-effective, making it ideal for medium-term retention. There is no reason why AWS Glacier could not also be integrated for cost-effective long-term retention. Ideally, data would be re-hydrated before transfer to Glacier, making bulk restore simpler, while S3 and Instance Store data remains deduplicated for capacity efficiency.
Cloud Block Store is an array re-engineered for AWS
When the Pure team explained the architecture that they were building for Cloud Block Store on AWS, it made complete sense to me. Rather than directly replicate the architecture of the physical array, the Cloud Block Store uses the unique features of the AWS platform to deliver a higher performing and lower cost solution. However, why would I want to replicate an on-premises array in a public cloud?
The key here is that not all workloads in the cloud are cloud-native and no cloud-native storage has enterprise features. One simple example is that restoring a snapshot of an EBS volume takes a long time. The EBS snapshot is stored on S3 and must be bulk copied from S3 back to EBS before data is available. Next, consider how you go about replicating data from on-premises to a public cloud, maybe to run development, testing, reporting, or AI functions on public cloud platforms. An enterprise array such as Pure Flash Array has replication built-in.
Multi-AZ storage
There is a challenge on AWS; most of the example architectures use a multi-AZ design to allow AZ failure without an application failure. The network between AZs is high bandwidth and low latency. The latency is only designed to be “single figure milliseconds,” which is rather larger than the sub-millisecond latency that we expect from a modern enterprise all-flash array. There is a trade-off here, do you put both controllers in one AZ for lowest storage latency or place controllers in two AZs and suffer higher latency for higher availability? I believe that Pure will offer the Cloud Block Store as a dual-controller, single-AZ, sub-millisecond configuration. Multi-AZ availability can be delivered by replicating to another dual-controller, single-AZ setup in the same region. The second cluster would be in standby mode while the first is operational. I would love to have the options for a dual-controller, dual-AZ, multi-millisecond configuration for situations where availability is more critical than latency.
I am very impressed with the use of cloud-native services and capabilities to deliver an enterprise storage platform inside AWS. I expect to see Pure Storage customers leverage the data services of the Flash Array product in conjunction with Pure Cloud Block Store for hybrid cloud operations.
© 2019, Alastair. All rights reserved.