Dedupe means data on disk is never re-written

I think that inline primary data deduplication is going to be a standard feature of storage arrays in the near(ish) future. Even for storing transactional workloads like virtual machines and tier-1 applications. As many storage experts will tell you the best IO is the one you don’t do. Once you have a copy of a unique data block, never needing to write the block to storage again is going to avoid a lot of IO. Inline deduplication means only writing unique data to disk once.

shutterstock_7Blockes Becomes shutterstock_3Blocks

Disclosure: I’ve spent the last two months working with SimpliVity, so I’ve heard their pitch a lot and spent a while thinking about the consequences. SimpliVity have not paid for or requested this blog post and will only know its content when it is posted. Blog posts on Demitasse.nz do not contain paid content; you can find links to some of my paid for writing on this page.

Write only unique data to disk

An inline deduplicated storage array analyses each write it receives to identify whether the array has already stored the data that the write represents. If the array already has a copy of the data then only meta-data needs to be updated, the actual data is already on disk. The metadata is what combines the unique blocks together into a stored object, a LUN or a file. The same unique block can be part of multiple stored objects and can represent multiple parts of the same object. As an example consider the Windows operating system Kernel file. For every Windows server (with the same Windows version, service pack and patches) this file will be identical so the blocks holding these files will be identical. All of the Windows Server VMs stored on a deduplicated array can use just one copy of these blocks. Bear in mind that this works within a VM too. Windows will have a cache of the kernel file for system file protection as well as a cache of the patch installer that last updated the kernel. Each Windows VM may have three copies of those same blocks. Bear in mind that the unique blocks do not need to represent the same file, just have exactly the same contents. The same block could exist in multiple machines with different operating systems or applications but is more likely if there is consistency in the VMs.

Usually deduplication is talked about for the capacity saving of only storing one copy of those blocks for all of your Windows VMs. This is definitely a huge benefit but I’ve been thinking about the performance benefit too. The array only wrote those blocks to disk when the first Windows server saved its first copy. When the array received each subsequent write from the VMs there was no need to store the data, only metadata updates were needed. Particularly on HDD based storage arrays the performance can often be limited by the spinning disks. Every write you don’t need to send to disk is IO the disks can deliver for reads. This reduced load might remove the need to use solid state to achieve performance requirements, allowing a deduplicated array to use hard disks as bulk storage.

Write Once, Read Many

Once a block is written to a deduplicated store that block is almost never overwritten. Only when there are no stored objects on the array that use that data block will the block be released and eligible to hold new data. Consider for a moment what rarely overwriting data means for low cost flash in an enterprise storage array. If a NAND flash page is only ever written every few months then it is not going to wear out due to its write limit. Infrequent overwrites mean that the bulk of storage can be on the cheapest type of SSD. This dramatically reduces the cost of an all (NAND) flash storage configuration without reducing reliability. The capacity savings of deduplication can make solid state storage price competitive with hard disk based storage on a dollars per terabyte basis. This is the value proposition for Pure Storage all flash arrays. They use deduplication and compression to make an all flash array that is price competitive with hard disk arrays.

SimpliVity

Since I’ve spent quite a bit of time with SimpliVity I have learnt a bit about their product. SimpliVity have an inline deduplication strategy, so only unique data is written to disk. The disk used is spinning rust, old fashioned hard disk, as the primary storage. Deduplication means they still get good performance from the spinning disk. They also use SSD and RAM for cache, so multiple tiers of storage. The unique parts are that the deduplication (and compression) is done with a custom expansion card rather than x86 CPUs and that they can also replicate deduplicated data to other sites for a lot of WAN bandwidth savings.

Inline deduplicated storage arrays improve the performance of lower cost bulk storage media like hard disks and cheaper high capacity solid state. I will take a look at some of the challenges and consequences of deduplicated primary storage in future blog posts.

© 2015, Alastair. All rights reserved.

About Alastair

I am a professional geek, working in IT Infrastructure. Mostly I help to communicate and educate around the use of current technology and the direction of future technologies.
This entry was posted in General. Bookmark the permalink.

3 Responses to Dedupe means data on disk is never re-written

  1. Jeff Preou says:

    The hard part in realising those savings is estimating how much they will be pre-purchase. For some storage arrays, getting it wrong can be expensive to fix. Customers don’t want to purchase an array, and then buy a bit more and a bit more after discovering during implementation that the dedupe and compression ratios are not what they thought they would get. So there is a tendency to ‘go big’ and thus not realise the cost savings. This, I think, is why some of these technologies are not making quite the inroads that logic suggests they should; people are risk-averse and go back to what they know; ‘legacy’ storage. (heck, I’ve even had customers who can’t grasp storage pools over trad raid sets without considerable persuading, let alone dedupe!).
    Good insight as usual Alastair – might bump into you at some event or other one day…

  2. Steve Meier says:

    Dudupe , I have been involved with for some time.

    Both Inline and the other less popular offline (surprise surprise) have good merits.

    Having used and still using in Arrays as well as devices like virtual Tape library appliances as well as complete software versions like commvaults.
    I have seen good and bad .

    The biggy is the system has to be 100% robust, full proof. I have suffered corruption in a dedupe database on Commvault…now while this is software the same principles are used. Nothing was recoverable which means we lost everything….could not restore because even the database was backed up via dedupe….design oversight.

    Having seen the work these systems do on large dedicated boxes it makes me wonder how the storage arrays are going to cope and maintain the millions of meta data files that represent a large array…..What point does this not scale well and actually slow things down….worse still …what if it gets corrupted in like a firmware upgrade….gulp.

    Like I said at the start the benefits outway the risks…but I am still weary of This TEch a wee bit.

    I suspect while the race to dedupe everything is what storage vendors are doing I think that this technology may pass as ssd and storage become even cheaper.

    With new technologies on the road maps to shrink down the foot prints of storage devices but increase the storage size held…this may make deduping anything redundant as teh race for ever faster access increases.

    luv your blogs by the way

  3. Alastair says:

    Thanks Steve & Jeff.

    I have way more thoughts on dedupe storage and will be writing about some along the way.
    Definitey agree with both of you on some of the issues around using dedupe stores.

Comments are closed.