VirtualKryptonite: Seeing LUN Queue Depth Filling up with UCS B230/B200 using Cisco VICs and EMC VMAX

Link

Here is an excerpt from the same blog:

A World queue (a queue per virtual machine), an Adapter queue (a queue per HBA in the host), and a Device/LUN queue (a queue per LUN per Adapter). Finally at the bottom of the storage stack there are queues at the storage device, for instance the front-end storage port has a queue for all incoming I/Os on that port.

and also this:

As you can see, the I/O requests flow into the per virtual machine queue, which then flows into the per HBA queue, and then finally the I/O flows from the adapter queue into the per LUN queue for the LUN the I/O is going to. From the default sizes you can see that each VM is able to issue 32 concurrent I/O requests, the adapter queue beneath it is generally quite large and can normally accept all those I/O requests, but the LUN queue beneath that typically only has a size of 32 itself.

If you want to check what your queues are currently set to, you can take a look at VMware KB 1027901.

Not a lot of things touch upon for the Adapter Queue Length (AQLEN). The Emulex driver has an option for this (but apparently this is not settable after ESX 3.5, more information here):

~ # vmkload_mod -s lpfc820 | grep lpfc_hba_queue_depth -A 1
lpfc_hba_queue_depth: int
Max number of FCP commands we can queue to a lpfc HBA

If you are using Software iSCSI there is an option for this as well:

~ # vmkload_mod -s iscsi_vmk | grep iscsivmk_HostQDepth -A 1
iscsivmk_HostQDepth: int
Maximum Outstanding Commands Per Adapter

Like mentioned in the above VMware blog the typical size is 1024 (or something in the thousands) but usually the default size is good. The vendor knows the best option for this piece of hardware and it’s rarely possible to change that value.

The interesting setting is the LUN/Device Queue Length (DQLEN). There are many ways to tweak this option. The first way is to do it via the driver, more on this in VMware KB 1267. For each of the drivers the options are the following:

~ # vmkload_mod -s lpfc820 | grep lpfc_lun_queue_depth -A 1
lpfc_lun_queue_depth: int
Max number of FCP commands we can queue to a specific LUN

~ # vmkload_mod -s iscsi_vmk | grep iscsivmk_LunQDepth -A 1
iscsivmk_LunQDepth: int
Maximum Outstanding Commands Per LUN

~ # vmkload_mod -s qla2xxx | grep ql2xmaxqdepth -A 1
ql2xmaxqdepth: int
Maximum queue depth to report for target devices.

The Cisco VIC sets the DQLEN value to be 32. From the Cisco article “Cisco Unified Computing System (UCS) Storage Connectivity Options and Best Practices with NetApp Storage”.

Hardcoded Parameters

LUN Queue Depth This value affects performance in a FC environment when the host throughput is limited by the various queues that exist in the FC driver and SCSI layer of the operating system

This Cisco VIC adapter sets this value to 32 per LUN on ESX and Linux and 255 on Windows and does not expose this parameter in the FC adapter policy. Emulex and Qlogic expose this setting using their host based utilities. Many customers have asked about how to change this value using the Cisco VIC adapter. Cisco is considering this request as an enhancement for a future release. However FC performance with the VIC adapter has been excellent and there no cases in evidence (that the author is aware of) indicating that this setting is not optimal at its current value. It should be noted that this is the default value recommended by VMware for ESX and other operating systems vendors.

The VMware white paper “Scalable Storage Performance“ describes what DQLEN is:

The SCSI protocol allows multiple commands to be active on a LUN at the same time. SCSI device drivers have a configurable parameter called the LUN queue depth that determines how many commands can be active at one time to a given LUN. QLogic Fibre Channel HBAs support up to 255 outstanding commands per LUN,and Emulex HBAs support up to 128. However, the default value for both drivers is set to 32. If an ESX host generates more commands to a LUN than the LUN queue depth, the excess commands are queued in the ESX kernel, and this increases the latency.

SCSI device drivers have a configurable parameter called the LUN queue depth that determines how many commands to a given LUN can be active at one time. The default in ESX is 32. If an ESX host generates more commands to a LUN than the LUN queue depth, the excess commands are queued in the ESX kernel, and this increases the latency. The queue depth is per‐LUN, and not per‐initiator. The initiator (or the host bus adapter) supports many more commands (typically 2,000 to 4,000 commands per port).

There is also another way of tweaking this value and that is with Disk.SchedNumReqOutstanding (DSNRO). This actually comes before the LUN Queues. There is a very good step by step diagram of the queues at the Virtual Geek blog “VMware I/O queues, “micro-bursting”, and multipathing”. Here is the picture from that blog:

VirtualKryptonite

Thursday, July 21, 2016

Seeing LUN Queue Depth Filling up with UCS B230/B200 using Cisco VICs and EMC VMAX

No comments:

Post a Comment

Vmware NSX SSL creation

Search This Blog