MTU Mismatch and Slow I/O

After a month or two of troubleshooting some storage issues we have been having with our NetApp system, we dug up an interesting piece of information.  When reviewing the MTU size on the host and on the NetApp, we noticed that the host was set for 1500 and the NetApp interface was set at 9000.  Whoops!

Before troubleshooting, we were seeing I/O at a rate of about 2500 IOPS to the NetApp system. However, when making the MTU change to match on both the ESXi host and the NetApp, we saw IOPS jump to close to 10,000.  Just a quick breakdown of what was happening here:

  1. The host would send data with an MTU of 1500 to the NetApp.
  2. The NetApp would retrieve the data and try to send it back at 9000
  3. It would fail from the switch stating it could only accept 1500
  4. The NetApp would then have to translate the data down to 1500

Basically, we were doubling the time it took to return the data back to the host and in turn to the guest VM.  The slow I/O was due to the translation time on the NetApp to get the proper data back to the host.  The switch interface was also set at 1500 and was rejecting the traffic.

Word to the wise: Always double check MTU settings and ensure it is the same through the entire path back to the host.  Just another one of those things to have in your back pocket when troubleshooting.