VM Delta Migration Procedure

A while back we were charged with moving VMs to a new data center while also keeping downtime to a minimum.  My team and I came up with a VM Delta Migration process to move a delta of the VM (basically the snapshot) so that we could keep the downtime short.  The basic process was to take a snapshot, copy the VM to external media, and power it on.  Then that media was shipped to the new DC to import.  Once imported and ready, we shut down the VM again, SFTP the snapshot files, imported those into the new VM folder and powered on the VM.  Once the VM was powered on and verified working, we were able to remove the snapshot.  I’ve documented the process below for anyone that may be wanting to do something similar.

This article details the steps taken to perform the migration of a large VM in multiple parts – Part 1 is a bulk data copy, sent via physical media for large files. Part 2 is an incremental copy, to allow us to keep the VM available during this window. When the VM is imported at its new home, both parts should be combined.

Step 1:

Power off the VM, and create a snapshot.

Create Snapshot

Step 2:

Browse to the datastore that the VM is located in, and copy all files in the folder to the bulk storage destination. – Delete the VMWare.log files from the destination.

Browse Datastore

Step 3:

Power the VM back on, and ship the physical media over to the new location.

Step 4:

Once the media has been received, power the VM off again, and copy the following files over to the SFTP server:

  • The VMX file
  • The NVRAM file
  • The 000001.vmdk – Snapshot file
  • The –delta.vmdk – Snapshot deltas

Step 5:

At the new data center, copy the files from step 4 to the physical media from step 2. Overwrite any files that are duplicates.

Step 6:

Add all files from the physical media to a datastore, and import the VM using “Add to Inventory” on the .VMX file.

Step 7:

Power the VM online, and once everything is confirmed working, delete the snapshot.

 

I hope this helps anyone else needing a process to perform a migration of VMs between data centers while keeping downtime to a minimum.

What It Means to Be a VMware vExpert

What It Means to Be a VMware vExpert [rubrik.com]

What It Means to Be a VMware vExpert

Being a technical professional often results in no one knowing about the work being done–if things go according to plan. Our focus is on being invisible. We set up complicated servers, hook them all together, and offer up resources for virtual machines and data repositories. And in the end, an application is given a home, where visibility of the world is often abstracted from view.


VMware Social Media Advocacy

vExpert 2017 Award Announcement

Honored to be selected for my 3rd year in the vExpert program.

vExpert 2017 Award Announcement

First we would like to say thank you to everyone who applied for the 2017 vExpert program. I’m pleased to announce the list 2017 vExperts. Each of these vExperts have demonstrated significant contributions to the community and a willingness to share their expertise with others. Contributing is not always blogging or Twitter as there are many public speakers, book authors, CloudCred task writing, script writers, VMUG leaders, VMTN community moderators and internal champions among this group.


VMware Social Media Advocacy

Get step-by-step guidance for installing,…

Get step-by-step guidance for installing, configuring & managing VMware products & solutions

Get step-by-step guidance for installing,…

The VMware Feature Walkthrough site provides step-by-step guidance for installing, configuring & managing VMware products & solutions.


VMware Social Media Advocacy

VMware Security Advisory for ESXi 6.0 and 5.5

Vmware released a new security advisory today advising that ESXi versions 6.0 and 5.5 are vulnerable to Cross-Site Scripting (XSS).  The details of the advisory can be found below as well as the current solution.

Advisory ID: VMSA-2016-0023

Severity:    Important

Synopsis:    VMware ESXi updates address a cross-site

scripting issue

Issue date:  2016-12-20

Updated on:  2016-12-20 (Initial Advisory)

CVE number:  CVE-2016-7463

  1. Summary

VMware ESXi updates address a cross-site scripting issue.

 

  1. Relevant Releases

VMware vSphere Hypervisor (ESXi)

 

  1. Problem Description
  1. Host Client stored cross-site scripting issue

 

The ESXi Host Client contains a vulnerability that may allow for

stored cross-site scripting (XSS). The issue can be introduced by

an attacker that has permission to manage virtual machines through

ESXi Host Client or by tricking the vSphere administrator to import

a specially crafted VM. The issue may be triggered on the system

from where ESXi Host Client is used to manage the specially crafted

VM.

VMware advises not to import VMs from untrusted sources.

VMware would like to thank Caleb Watt (@calebwatt15) for reporting

this issue to us.

The Common Vulnerabilities and Exposures project (cve.mitre.org) has

assigned the identifier CVE-2016-7463 to this issue.

 

Column 4 of the following table lists the action required to

remediate the vulnerability in each release, if a solution is

available.

VMware  Product Running             Replace with/

Product Version on       Severity    Apply Patch*        Workaround

======= ======= =======  ========   =============        ==========

ESXi     6.5    ESXi

N/A        not affected           N/A

ESXi     6.0    ESXi

Important  ESXi600-201611102-SG   None

ESXi     5.5    ESXi

Important  ESXi550-201612102-SG   None

*The fling version which resolves this issue is 1.13.0.

 

  1. Solution

Please review the patch/release notes for your product and

version and verify the checksum of your downloaded file.

 

ESXi 6.0

————-

Downloads:

https://www.vmware.com/patchmgr/findPatch.portal

Documentation:

http://kb.vmware.com/kb/2145815

 

ESXi 5.5

————

Downloads:

https://www.vmware.com/patchmgr/findPatch.portal

Documentation:

http://kb.vmware.com/kb/2148194

 

  1. References

http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-7463

Source: http://www.vmware.com/security/advisories/VMSA-2016-0023.html

Update Manager Error: sysimage.fault.SSLCertificateError

I had to redeploy a vCenter Server Appliance recently and got an error when opening vCenter: sysimage.fault.SSLCertificateError.

sysimage.fault.SSLCertificateErrorThis is caused by the certificate on the vCenter Server changing.  This causes the Upgrade Manager needing to be re-registered with the vCenter Server.  Luckily, VMware provides a utility to do just that.  Go to your vCenter Upgrade Manager Server (The appliance does not include this, so it will typically be installed on a separate Windows Server).

Go to “C:\Program Files (x86)\VMware\Infrastructure\Update Manager” and open the file “VMwareUpdateManagerUtility”.

VMwareUpdateManagerUtilityNext, enter the credentials for your vCenter Server and hit Login.

UtilityLoginOnce the window opens, you’ll want to click “Re-register to vCenter Server”. This brings up another login screen.  You can use the same credentials you did to open the utility here as well.

Re-register to vCenter ServerFinally click apply and you will receive a notification that you need to restart the VMware vSphere Update Manager in order for the settings to take affect.

RestartUpdateManagerServiceRestart the service and open the vSphere Client again.  The error will be gone and you’ll be able to use Update Manager again.

 

ESXi 5.1 Host out of sync with VDS

A recent issue I was having was that our ESXi 5.1 hosts would go “out of sync” with the VDS.  The only fix that would work was rebooting the host.  After digging into the log file, I discovered that the host was failing to get state information from the VDS. The entries are below:

value = “Failed to get DVS state from vmkernel Status (bad0014)= Out of memory”,
             }
          ],
          message = “Operation failed, diagnostics report: Failed to get DVS state from vmkernel Status (bad0014)= Out of memory”

The issue is a bug in the version of 5.1 we were running (Update 2 at the time) and is a memory leak on the host when using E1000 NICs in your VMs.  Because these VMs were created a long time ago, they were defaulted to the E1000.  The fix for this issue is updating to the latest build of ESXi which has a fix for the issue.  And also, don’t use E1000 NICs, always go with VMXNET3.  Problem solved!

Lost Path Redundancy to Storage Device

After installing 3 new hosts, I kept getting errors for Storage Connectivity stating “Lost path redundancy to storage device naa…….”.  We had 2 fibre cards and one of the paths was being marked as down.  I spent a couple weeks troubleshooting and trying different path selection techniques.  Still, we would randomly get alerts that the redundant path has gone down.  The only fix was to reboot the host, as not even a rescan would bring the path back up.

So after some trial and error, I found a solution.  The RCA isn’t necessarily complete yet, but I believe it was a problem with the fibre switch having an outdated firmware and us using new fibre cards in our hosts.  When using the path selection of Fixed, it would randomly pick an hba to use for each datastore.  Some datastores would use path 2 and some would use path 4.

The solution I came up with was to manually set the preferred path on each datastore (we have about 40, so it was no easy task).  You go into your host configuration, choose storage, pick a datastore and go into properties.  Inside this window, select manage paths from the bottom right and you should see your HBA’s listed.  There is a column marked Preferred with an asterisk showing which hba to prefer for the datastore (see the image below).  I went through and manually set the preferred path to be hba2 instead of letting vmware pick the path. The path selection is persistent across reboot as well when setting it manually.

storage path selectionSince manually setting the preferred path, the hosts have been stable and we have not gotten any more errors about path redundancy.  This is pretty much a band aid fix but at least we are not rebooting hosts 2-3 times per week.

MTU Mismatch and Slow I/O

After a month or two of troubleshooting some storage issues we have been having with our NetApp system, we dug up an interesting piece of information.  When reviewing the MTU size on the host and on the NetApp, we noticed that the host was set for 1500 and the NetApp interface was set at 9000.  Whoops!

Before troubleshooting, we were seeing I/O at a rate of about 2500 IOPS to the NetApp system. However, when making the MTU change to match on both the ESXi host and the NetApp, we saw IOPS jump to close to 10,000.  Just a quick breakdown of what was happening here:

  1. The host would send data with an MTU of 1500 to the NetApp.
  2. The NetApp would retrieve the data and try to send it back at 9000
  3. It would fail from the switch stating it could only accept 1500
  4. The NetApp would then have to translate the data down to 1500

Basically, we were doubling the time it took to return the data back to the host and in turn to the guest VM.  The slow I/O was due to the translation time on the NetApp to get the proper data back to the host.  The switch interface was also set at 1500 and was rejecting the traffic.

Word to the wise: Always double check MTU settings and ensure it is the same through the entire path back to the host.  Just another one of those things to have in your back pocket when troubleshooting.

vNUMA: A VMware Admin’s Guide

Since I’ve been discussing this recently at work with developers and managers due to some SQL Server issues we have been having, I decided it was time to write about it.  vNUMA was introduced along with vSphere 5 back in the day and is related to NUMA (Non Uniform Memory Architecture). In a nutshell, vNUMA allows you to divide your CPU and Memory resources into different nodes and allows for faster memory access.

vNUMAThe image above shows how NUMA breaks up the resources to create the nodes.  This is also a setting that can be controlled within VMware.  vNUMA is typically used for your “Monster VMs”, where your virtual CPUs span multiple physical CPUs. For example, if your virtual machine has 12 CPUs and your physical host has 8 cores, you are spanning multiple CPUs.  vNUMA is also recommended when using more than 8 vCPUs.  To change the NUMA setting (Keep in mind that the VM must be powered off to make this change):

  1. Select your VM
  2. Choose Edit Settings…
  3. Go to the Options tab
  4. Go to Advanced -> General
  5. Click the button “Configuration Parameters”
  6. Add the following “numa.vcpu.min” and set to the number of CPUs you want used in the NUMA node.

One thing to keep in mind with NUMA is that you want to size the NUMA nodes according to the number of cores in your physical host.  For example, my physical hosts have a six core processor, so my NUMA node on the SQL Server is set to 6.  This gives my 12 processor SQL Server 2 NUMA nodes with 6 CPUs each.  By default, when you create a VM with more than 8 vCPUs, vNUMA is enabled by default.  Using the instructions above, you can change the vNUMA value or set it on VMs with less than 8 vCPUs.

vNUMA is very specific to certain use cases and should be used with caution.  Incorrectly configuring NUMA can cause more problems than leaving it at the default.  Be sure to test your settings on a non production server and see if the results were as expected.  One final thought and last thing to keep in mind, ensure that the hosts in your cluster when vNUMA is being used have similar NUMA configurations to avoid issues when the VM decides to vMotion to a different host.

Image courtesy of brentozar.com