I’m starting a new series of blog posts, these are all about when things went wrong and needed to be fixed. The point of these is to help you learn from other people’s mistakes and avoid them in future
The Problem
This learning moment came from a health check that I did with a client. They had a small VMware estate with only three ESX servers and a small SAN. They asked for a health check because they had been let down by VMware High Availability (HA) when one of their ESX hosts crashed. None of the VMs on the failed ESX server had been restarted on the remaining ESX servers. Rather disappointing.
The Diagnosis
It only took me five minutes to find out why HA didn’t work. The storage configuration was the primary fault. Only a couple of the six datastores provided by the SAN were available on all three ESX servers. The second issue was that a critical VM had some of its vmdk files stored on a local datastore inside one of the ESX servers. I used the vSphere client maps to identify both of these issues. I use different combinations of enabled relationships to identify different issues. In the picture of my lab below Host3 is unable to access the datastore NFS01 where all the VMs reside. Also the VM Lab-01 uses a datastore that is only available on Host2 since it is local disk. Both these issues will cause problems with HA, and VMotion.
The Resolution
The first part was to have the storage array reconfigured so that all of the shared LUNs were presented to all of the hosts. Since this was a SAN change it took a while to get through change management and implemented. Getting the LUNs all presented and then rescanning storage on the ESX servers did not cause any VM or ESX outage. The second phase was to relocate some VMs so that only the shared datastores were used. Happily Storage VMotion allows VMs to be relocated while they are in use, so again no outage. If you choose the advanced option on the Storage page of the storage VMotion wizard you can select which parts of the VM to migrate. In the picture below only the second hard disk is being moved. This selective move is useful if the VM only has some disks on the wrong datastore or if you want to spread its disks across multiple shared datastores. You can choose a different destination for each disk and for the VM home directory. It is a good idea to keep the entire VM on one datastores, if this is possible.
The Prevention
The primary reason that the environment got to this state is that the team did not understand how vSphere HA works. The environment had grown from a single ESX server with local storage into a larger environment. None of the team specialized in VMware and the team had to look after every part of the infrastructure, including the operating system and applications inside the VMs. Things that were not broken did not get a lot of attention. To avoid the problem one of the team should have been trained or given adequate time to gain the skills to manage the platform. If there is no chance of this training it is a great idea to have an external expert come and do a health check to let you know if any actual or potential issues.
In addition it is a great idea to setup some monitoring. I use the excellent vCheck script which has a report of VMs stored on datastores attached to only one host. vCheck has many other useful report elements, I recommend it for any virtualization environment.
© 2014, Alastair. All rights reserved.
Great idea for a series Alastair, and a good first post. I’ll stay tuned!
The maps are often overlooked as useful troubleshooting tools, good to see someone going back to elementary, but ever valuable steps.