ESXi Hosts Disconnecting Randomly

A recent issue we experienced was seeing hosts disconnecting from vCenter and reconnecting.  The host would drop and randomly come back for about an hour or more.  The VM’s never saw any issues nor was there any type of outage.  It was that vCenter could no longer see the host.

After quite a bit of troubleshooting, I started digging around in the vCenter Server Settings (Administration > vCenter Server Settings).  In this menu, there is a tab for Runtime settings.  I noticed that we only had the vCenter Server Name filled in and not the vCenter Server Managed IP. The window looks as follows:

vCenter Runtime SettingsAfter completing all the fields in this window, the hosts magically all reconnected and have not dropped again.  This is due to the fact that the hosts use these settings to check in with the vCenter box and they let the host know who it’s being managed by.  As you can guess, if the host doesn’t know who’s managing it, it doesn’t know who to check in with.

The more curious issue was that this field hadn’t even been filled out, but didn’t start immediately.  Which made troubleshooting more difficult and made us all panic as we started getting numerous alerts for hosts dropping.

As best practice, whether you only have 1 vCenter server, is to fill out all these fields and enure they are correct.  Especially if you want the host to check in with the correct vCenter server and you don’t want the heart attack of seeing numerous hosts suddenly disconnecting from vCenter.

MTU Mismatch and Slow I/O

After a month or two of troubleshooting some storage issues we have been having with our NetApp system, we dug up an interesting piece of information.  When reviewing the MTU size on the host and on the NetApp, we noticed that the host was set for 1500 and the NetApp interface was set at 9000.  Whoops!

Before troubleshooting, we were seeing I/O at a rate of about 2500 IOPS to the NetApp system. However, when making the MTU change to match on both the ESXi host and the NetApp, we saw IOPS jump to close to 10,000.  Just a quick breakdown of what was happening here:

  1. The host would send data with an MTU of 1500 to the NetApp.
  2. The NetApp would retrieve the data and try to send it back at 9000
  3. It would fail from the switch stating it could only accept 1500
  4. The NetApp would then have to translate the data down to 1500

Basically, we were doubling the time it took to return the data back to the host and in turn to the guest VM.  The slow I/O was due to the translation time on the NetApp to get the proper data back to the host.  The switch interface was also set at 1500 and was rejecting the traffic.

Word to the wise: Always double check MTU settings and ensure it is the same through the entire path back to the host.  Just another one of those things to have in your back pocket when troubleshooting.

Datastore not visible after upgrading to ESXi 5

After upgrading my dev datacenter and rebooting the first ESXi 5 host, I realized that one of my fiber datastores was missing.  The path to the datastore was still visible to the host under the HBA, but it was not showing as an available datastore in the storage view.  Upon investigation, the datastore had been tagged as a snapshot datastore and was not mounting properly to the host.  This can be found by running the following:

esxcli storage vmfs snapshot list

You will see an output similar to:

<UDID>

   Volume Name: <VOLUME_NAME>

   VMFS UUID: <UDID>

   Can mount: true

   Reason for un-mountability:

   Can resignature: false

   Reason for non-resignaturability: the volume is being actively used

   Unresolved Extent Count: 2

Next, I had to force mount the datastore in CLI by first changing to “/var/log” and running:

esxcli storage vmfs snapshot mount -u <UUID> -l <VOLUME_NAME>

The command will be persistent across reboots.  If you would like to make it non-persistent then you will need to add “-n” to your command.  Once it is run, check your host and the datastore should be showing as an available datastore again.  No reboot needed and the change takes affect immediately.

You can also mount the datastore using the vSphere client as well by following the below steps:

  1. Go to your host in question
  2. On the storage tab, click add storage
  3. Choose disk/LUN
  4. Find the LUN that is missing. If it is not shown, you will need to use the above steps to mount using CLI
  5. Under mount options, choose “Keep Existing Signature” to mount persistent across reboots
  6. Click through to finish

There are a few caveats to force mounting a datastore though.  The datastore can only be mounted if it doesn’t already exist with a unique UDID.  If you choose to use the client to force mount the datastore, it cannot be mounted to other hosts in the same datacenter.  You will need to use the CLI steps posted above to mount to other hosts.

For more information about this issue and steps to fix in ESX/ESXi 4 and 3.5, you can find the VMware KB here.