Another simple little posting that has grown, it doesn’t cover every eventuality but it can eb a life saver.
This one comes up a bit, someone has lost a VMFS Datastore from all their ESX servers, it was there yesterday and now it’s gone! The hosts can still see the LUN, but not a Datastore.
First off some advice:
Don’t Panic
and don’t create a new VMFS over the existing one, if you do that the data on the disk is lost and only heroic effort by VMware support will be able to get it back for you.
Reducing the Risk
If you still have ESX 3.5U3 or later (not vSphere) then you can backup the VMFS metadata as well as documenting the partitioning. The backup process is in this VMware KB article. In theory you should be able to use the same thing on vSphere, however the tool does not ship with ESX 4.0.
Background
The first thing to understand is how the VMKernel recognises a VMFS Datastore.
- First the VMKernel looks in the disk partition table for a partition of type “fb”, this partition type number identifies the contents as a VMFS datastore.
- Within the partition the VMKernel looks for VMFS Metadata, if this partition is the first extent of the VMFS datastore then it will have VMFS metadata at the start of the partition.
When the Datastore is deleted only the partition table entry is deleted, so the metadata is still in the right place on the disk. As a side note only the first extent (partition) is deleted when the Datastore is deleted, any additional extents are left behind.
To get the Datastore back we need to recover the partition table, either from a backup or by manually recreating it using the Linux fdisk tool.
Things to know before you start
- The correct VMKernel path to the LUN
- The correct Linux disk ID for the partition
- The correct starting block for the partition
VMKernel Path
The VMKernel path is the path shown to the now empty LUN in the “Storage” view in the VI Client. Ideally you would document this with the other Datastore documentation. There may be multiple paths to the LUN, in which case you want the path with the lowest HBA and Target numbers.
Linux Disk ID
Hopefully you now know the VMKernel path to the LUN that held the Datastore. The Linux disk ID can be resolved from the VMKernel disk ID (vmhba…. number). On VI3 use esxcfg-vmhbadevs, on vSphere use “esxcfg-scsidevs –c”.
The VI3 example the VMKernel device vmhba0:0:25 maps to the Linux device /dev/sdb.
[root@desx1 root]# esxcfg-vmhbadevs
vmhba0:0:0 /dev/sda
vmhba0:0:25 /dev/sdb
vmhba0:0:26 /dev/sdc
vmhba0:0:31 /dev/sdd
vmhba1:0:31 /dev/sde
vmhba2:0:0 /dev/cciss/c0d0
The vSphere example has the VMKernel device is vmhba0:C0:T1:L0 maps to the Linux device /dev/sdb.
[root@localhost ~]# esxcfg-scsidevs -c
Device UID Device Type Console Device Size Plugin Display Name
mpx.vmhba0:C0:T0:L0 Direct-Access /dev/sda 16384MB NMP Local VMware, Disk (mpx.vmhba0:C0:T0:L0)
mpx.vmhba0:C0:T1:L0 Direct-Access /dev/sdb 8192MB NMP Local VMware, Disk (mpx.vmhba0:C0:T1:L0)
mpx.vmhba32:C0:T0:L0 CD-ROM /dev/sr0 0MB NMP Local NECVMWar CD-ROM (mpx.vmhba32:C0:T0:L0)
On both VI3 and vSphere , determining the starting block requires documentation, or an educated guess. To find the starting block of an existing VMFS partition using the command “fdisk –lu /dev/sdx” where /dev/sdx is the Linux device ID for the LUN that contains the Datastore. The start column will show this, below is an example with a start block of 128, this should be recorded against the rest of the documentation for the Datastore. The value of 128 is a good guess if you don’t have documentation, it is the default start block for all VMFS partitions created using the VI Client.
Starting Block
[root@localhost ~]# fdisk -lu /dev/sdb
Disk /dev/sdb: 8589 MB, 8589934592 bytes
255 heads, 63 sectors/track, 1044 cylinders, total 16777216 sectors
Units = sectors of 1 * 512 = 512 bytesDevice Boot Start End Blocks Id System
/dev/sdb1 128 16771859 8385866 fb VMware VMFS
Recovery Process
First off if you are uncertain log a call with VMware Support, they fix these sorts of problems routinely and will make sure you don’t do more damage. Repair for yourself only if a support call is not an option.
- Before you try to recover the Datastore make sure it is really gone, refresh storage on every ESX server that can see the LUN and make sure none can see the Datastore.
- Choose an ESX server that is not too critical to complete the remaining steps, preferably one in maintenance mode so no VMs can be effected.
- Logon to the service console of the ESX server, elevate to root
- run “fdisk –l /dev/sdx” where /dev/sdx is the linux device ID for the LUN that used to contain the Datastore. This will list the current partition table of the disk, if you see partitions then stop.
- If there is no partition table or if it is blank then run “fdisk /dev/sdx”
- In fdisk create a new primary partition occupying all of the available space
- In fdisk press “t” to chnage the partition type to “fb”
- In fdisk “x” to go to Expert mode and then use “b” to change the start block of the partition to the correct block
- In fdisk use “p” to print the current partition table and confirm that it corresponds to the recorded information
- Use “w” to quit fdisk and write the new partition table to disk.
- Use the VI Client to refresh storage on the ESX server, you should see your Datastore return
Below is a sequence that may help make it clearer:
The VM before the Datastore was deleted:
The VM after the Datastore was deleted:
Command sequence on the ESX server console:
[root@localhost ~]# esxcfg-scsidevs -c
Device UID Device Type Console Device Size Plugin Display Name
mpx.vmhba0:C0:T0:L0 Direct-Access /dev/sda 16384MB NMP Local VMware, Disk (mpx.vmhba0:C0:T0:L0)
mpx.vmhba0:C0:T1:L0 Direct-Access /dev/sdb 8192MB NMP Local VMware, Disk (mpx.vmhba0:C0:T1:L0)
mpx.vmhba32:C0:T0:L0 CD-ROM /dev/sr0 0MB NMP Local NECVMWar CD-ROM (mpx.vmhba32:C0:T0:L0)
[root@localhost ~]# fdisk /dev/sdbThe number of cylinders for this disk is set to 8192.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
(e.g., DOS FDISK, OS/2 FDISK)Command (m for help): p
Disk /dev/sdb: 8589 MB, 8589934592 bytes
64 heads, 32 sectors/track, 8192 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytesDevice Boot Start End Blocks Id System
Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-8192, default 1):
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-8192, default 8192):
Using default value 8192Command (m for help): t
Selected partition 1
Hex code (type L to list codes): fb
Changed system type of partition 1 to fb (VMware VMFS)Command (m for help): x
Expert command (m for help): b
Partition number (1-4): 1
New beginning of data (32-16777215, default 32): 128Expert command (m for help): p
Disk /dev/sdb: 8589 MB, 8589934592 bytes
64 heads, 32 sectors/track, 8192 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytesDevice Boot Start End Blocks Id System
/dev/sdb1 1 8192 8388544 fb VMware VMFSExpert command (m for help): w
The partition table has been altered!Calling ioctl() to re-read partition table.
Syncing disks.
[root@localhost ~]# fdisk -lu /dev/sdbDisk /dev/sdb: 8589 MB, 8589934592 bytes
64 heads, 32 sectors/track, 8192 cylinders, total 16777216 sectors
Units = sectors of 1 * 512 = 512 bytesDevice Boot Start End Blocks Id System
/dev/sdb1 128 16777215 8388544 fb VMware VMFS
[root@localhost ~]#
© 2009, Alastair. All rights reserved.
MANY THANKS !!!!
Thanks for the article,
i fixed it with
esxcfg-volume -l
esxcfg-volume -m
Sharon
Hi Sharon,
Those commands will restore to service a datastore where the LUN presentation has changed rather than the datastore being deleted.
More information in the vmware KB, kb.vmware.com/kb/1011387