Every so often a product comes along that works in a new way and we need to re-learn how to think about building an IT infrastructure. I spent some time with Datrium learning about how their solution is different from other solutions. I think of their product as a scale-out controller with a shared storage shelf. Both hyperconverged and scale-out storage have scale-out controllers and scale-out storage. Hyperconverged uses the same scale-out physical servers to run VMs and scale-out storage uses additional servers. Datrium puts the controller with cache and workload VMs in each scale-out host but uses centralized storage shared by all the hosts.
With Datrium the controllers scale-out and are on the compute nodes, alongside the VMs. Each node has some solid-state storage as a cache but does not have “persistent” storage. All persistent storage is in a data node, separate from the compute nodes. The data node has local disks and NVRAM, but is only accessible through the compute nodes. Think of the data node as a disk shelf, a future release will allow multiple data nodes to be joined together. The compute nodes scale-out, up to 32 compute nodes can access a single data node. A nice feature is the ability to have non-uniform compute nodes. You might have sixteen general purpose compute nodes; dual socket, 256GB of RAM, and 1TB of SSD. Then maybe four nodes that are for large database VMs; quad socket, 1TB of RAM, and 8TB of SSD. All these compute nodes can access the same data node.
Datrium’s architecture provides a lot of scale-out benefits without some of the challenges. In typical scale-out and hyperconverged architectures there is a lot of east-west network traffic between the storage nodes. Data written to one node must also be written to another node, or two, to provide durability. There are also operational and availability issues with having storage capacity in your compute nodes. Taking an HCI node down for maintenance effects the redundancy of your storage, potentially reducing your failure tolerance. With Datrium the compute nodes seldom talk to each other, they almost exclusively talk to the data node. Having a compute node shut down or failed does not change your storage availability and resilience. With both HCI and scale-out you must have a minimum quorum of nodes operational before any storage is available. Datrium need the data node and one compute node to provide a working storage system.
Datrium is also designed to be simple to manage, that is a top value proposition for HCI too. Datrium has very few settings to configure; deduplication, erasure coding, and compression are always enabled, cannot be turned off. The only feature that can be turned on and off is full system encryption. The encryption happens in the compute nodes. Data is encrypted after it is deduplicated and compressed but before it leaves the compute node where the VM IO occurs. Data is encrypted across the storage network and at rest on the data node, no need for self-encrypting hard disks.
This architecture has some interesting consequences. It is going to take me a while to think through and talk about what the benefits are and what the downsides are, there are always downsides. Hopefully I will get to do some more work with Datrium and we will all learn more about their cool product.
© 2017, Alastair. All rights reserved.