One of the hardest parts of designing a virtual infrastructure for VDI is getting the storage right. This is where the price of doing VDI can become very obvious so it’s tempting to try to cut costs. Unfortunately not having enough performance in your VDI storage leads to unusable desktops. Below is an example that I was involved in, the root cause of the inadequate storage wasn’t poor design so much as poor communication. The result was the same, unusable desktops and unhappy users.
I got called into an escalation: one Monday morning user logon time had become extremely long. I asked about the disk system, there were 4 SSDs, 30 15K SAS disks and 60 SATA disks. The disk array was a nice enterprise one with sub-LUN tiering and lots of cache, so it could use all of the different types of disk efficiently.
I did some quick and rough math:
4 SSDs means 4 x 5,000 = 20,000 disk IOPS
30 SAS means 30 x 200 = 6,000 disk IOPS
60 SATA means 60 x 80 = 4,800 disk IOPS
Overall 26,480 disk IOPS, however not all the IO is read, so we need to apply the RAID penalty. RAID5 has a 4:1 penalty, one write IO to the array causes 4 IOs on the disks. RAID10 has a 2:1 penalty and both RAID levels have no penalty for reads (unless RAID5 has a failed disk). As this is a VDI implementation I used 80% write as my number, this was validated by the performance information this customer had gathered.
More Math:
RAID5 | RAID10 | |
Write | 80% x 4 = 3.2 | 80% x 2 = 1.6 |
Read | 20% x 1 = 0.2 | 20* x 1 = 0.2 |
Total | 3.2 + 0.2 = 3.4 | 1.6 + 0.2 = 1.8 |
For every one IO to the RAID5 array the disks had to do an average of 3.4 IOs. For the RAID10 array the disks had to do 1.8 IOs.
The RAID5 disks were SAS at 6,000 IOPS and SATA at 4,800IOPS for a total of 10,800 IOPS. This meant that the RAID5 array could do 10,800 / 3.4 = 3,176 IOPS and the RAID10 SSDs could do 20,000 / 1.8 = 11,111 IOPS at 80% write.
The array as a whole could do 3,176 + 11,111 = 14,287 IOPS at 80% write. Let us call this 14,000 IOPS to make the math easier.
A little questioning told me that things were fine the week before, around 250 users had been active. 14,00 IOPS / 250 users = 56 IOPS per user. I’d be pretty happy with this and users were too.
On the fateful Monday a large number of users were migrated, total users was around 2,000. 14,000 IOPS / 2,000 users = 7 IOPS per user. Well there’s your problem. Even the lowest estimates say 10 IOPS per user at steady state.
The customer is now redesigning their array to deliver a lot more IOPS. RAID10 everywhere and probably a lot more disks. Luckily for them there was an alternate array that they could use to reduce the pain until the upgrade is in place.
The Math that I applied was very simple and quick, it ignored cache effect and the fact that not all of the IOPS of all of the tiers are available at all times. It was good enough to tell me that I didn’t need to look any further to identify the cause of the slow logon problem. To design the storage system requires a lot more detail and awareness of the storage system and the customer.
I’m planning some follow-ups; why insufficient IOPS causes sudden catastrophic failure, steady state IOPS versus Peak IOPS, as well as looking at the cost of SSD versus spinning disk.
© 2012, Alastair. All rights reserved.