I am continuing my series of blog posts that are inspired by listening to SimpliVity’s story while I was on a speaking tour with them. Almost everything in these post is generically about deduplicated primary storage rather than specifically about SimpliVity. So far I’ve mostly looked at the upside of deduplicated storage, it’s about time I looked at some challenges. Some design challenges are around metadata management and write latency. There are also operational challenges around the meaning of available capacity. And challenges when you migrate off a deduplicated storage platform that is full.
The Meta problem
All arrays, whether they are deduplicated or not, have metadata to manage. Deduplicated arrays have more metadata as a result of storing only unique data blocks and needing to map multiple VMs to those blocks. Managing this metadata is one of the things that make storage arrays very hard to engineer. With a dedupe array, there is no need to write duplicate data to disk. The array does need to record which unique blocks make up a storage object. This list of blocks is one type of metadata. A dedupe array will copy this metadata to clone a VM. Arrays that use metadata clones for backups tend to generate a lot of metadata. Another type of metadata is the record of how many objects are using a particular unique block. This is the block’s reference count. If no objects refer to the block (reference count = 0) then it need not be kept and the disk space can be reclaimed through garbage collection. Both the block lists and the reference counts are updated for every disk write and the block list is read for every disk read. This means that the metadata for a running VM is very frequently accessed. Active metadata needs to be stored on the fastest tier of storage, either fast SSD or power protected RAM. The amount of this fast tier and how the array stores metadata will be unique to each deduplicated store. The metadata for VM backups doesn’t need to be on fast media until it is used to restore a VM. Putting metadata related to backups on slower media will reduce the need for SSD or RAM, or allow more of those resources to be used for caching.
How long will that take?
The second big challenge is making sure that the deduplication of incoming writes doesn’t add latency to application I/O. Application performance is often tied to disk write latency. Particularly for database applications where both the database log file and data file writes must complete in order for the transaction to complete. Adding milliseconds of latency to identify uniqueness is unacceptable for primary VM storage. It may well be fine for a backup target, but it’s no way to treat a live application inside a VM. Most vendors use Intel CPUs to do the processing for deduplication, benefiting from Moores law delivering increasing CPU power. For CPU based deduplication, the maximum throughput will depend on the available CPU time, having excess CPU capacity will protect storage throughput. As far as I’m aware the only exception is SimpliVity who use customized hardware in their OmniStack accelerator to do the deduplication. This dedicates physical resource to deduplication, which enables predictable performance for deduplication.
As my friend Stephen Foskett says, one does not simply return zero instead of one. A storage array that returns invalid data is worse than one that doesn’t return data at all. The creeping malaise of bit rot will destroy all the value in your data. A crucial part of any modern storage array is to keep some sort of correction code and use this to ensure the stored data is returned faithfully. The deduped arrays that I have looked at all store a cryptographic hash of data as well as the data. Data is only returned if the hash of the read data matches the hash of the stored data. This ensures that if the array returns data then it is exactly the correct data. But what does the array do if it finds that the hash of the read data does not match the hash of the stored data? If the array stores two copies of the data then it will read and hash check the second copy. Hopefully, that copy hasn’t also been corrupted. If the array uses parity then the parity segment can be used with each of the data segments until the returned data has the right hash. The use of hashes allows certainty that the data returned is exactly the data that was stored. Modern CPUs have dedicated hardware for computing hashes, this allows far faster hash computation and, therefore, faster storage access.
Capacity management is a fact of life. What was once a largely empty array will fill over time and eventually be out of space. On a deduplicated array, the amount of free space does not equal the amount of additional data you can add. Since only unique data and metadata is stored it is possible to add lots of duplicate data without significantly reducing available capacity. On the other hand adding a lot of unique data will reduce available capacity. The trick is knowing whether the added data will be unique. Cloning an existing VM doesn’t add a lot of unique data. Importing a new VM can mean a lot of unique blocks, particularly if the VM has an encrypted disk where all the blocks are effectively random and unique. The other part is that some deduplication isn’t always at ingest, which it is on SimpliVity. Some arrays will ingest data without deduplicating it, meaning that initially the full set of data is taking up capacity. Then later on the array will analyze the data it has written and deduplicate it. This is called post-process deduplication. Often the dedupe phase happens overnight, during idle time, or simply delayed by several hours. The result is that the free space on the array declines over the working day then recovers overnight when deduplication catches up. These arrays make it a little harder to predict available capacity simply because it is so volatile through the day and week. Free space is also a crucial commodity in a deduplicated array. Since the array is inherently overcommitted there is a risk of running out of space and having nowhere to store incoming unique data. This is a catastrophic situation for any array. When you deploy any kind of thinly provisioned array make sure you are managing free capacity in the array and watching data growth. There are toxic situations, like existing VMs when whole disk encryption is turned on. Then an array can be made to fail faster because it is deduplicated. Imagine a group policy that is applied to a group of VMs for security compliance, which encrypts all of their disks.
We often associate the hotel California reference as being about a cloud service where it is very hard to stop using the service. There is also a Hotel California-like situation when you migrate out of a deduplicated array to another array. In particular when the old array is running out of capacity and you migrate some VMs to a new array to reclaim capacity. Migrating a VM out of a dedupe array does not necessarily free a lot of space on that array. A 500 GB VM might only have 5 GB of unique blocks, so when the VM is removed from the array the array has only 5 GB more free space. Even more challenging is when we use storage array metadata clones as backups. That same VM might have 100 backups each with a couple of gigabytes of unique blocks and metadata. When we migrate the VM off, the backups remain in place. The backups may well consume more capacity than the VM itself. If we need to recover capacity in the array it will probably be more effective to change our backup retention policy than migrate the VMs.
Deduplicated storage is a great tool for efficiently storing VMs. But like any decision in a design there are pluses and minuses to taking the choice. Be sure you are aware of the dangers of the solution you choose as well as it’s benefits.
Disclosure: I am doing a lot of work with SimpliVity, however, they did not request this article or review its content. Most of what I mention here I true of most deduplicated arrays.
© 2015, Alastair. All rights reserved.