REG ADD HKLM\SYSTEM\CurrentControlSet\services\pvscsi\Parameters\Device /v "DriverParameter" /t REG_SZ /d RequestRingPages=32,MaxQueueDepth=254
VMWare ESX 5.1u1: Increasing pvscsi Adapter Queue Depth... Hey, What on Earth are RequestRingPages?
Figured I'd write about this, because it won't be long before I forget it all again. A few weeks from now, I'll need this... won't be able to piece it together from memory or from VMWare documentation. Then I'll google... find my own blog post... refresh my memory... rinse... repeat :)
SQL Server performance on virtual platforms is a big deal these days. It's pretty rare that folks really get to the guts of what may limit the performance of a particular workload on a virtual platform.
I'm lucky that the main SQL Server workflows I am concerned with are all at the same side of the spectrum between latency sensitive and bandwidth hungry.
VMWare ESX 5.1 update 1 contains an update that allows the increase of vHBA adapter queue depth from the default of 256 to 1024. That's a good thing - since the default for Windows on a physical server with QLogic, Brocade, or Emulex FC HBA is 1024 outstanding IOs per adapter port.
In fact, when my data hungry workflows were tested inhouse on VMWare vSphere 5.0 and 5.1, it became apparent that the difference in adapter queue depth was a significant factor in the performance reached with SQL Server on physical server and SQL Server on VMWare on the same server. Increased adapter queue depth, and with 4 vHBAs attached to the VM, performance of the target workflows was nearly indistinguishable from physical server.
The instructions for increasing the vhba adapter queue depth are here:
Large-scale workloads with intensive I/O patterns might require queue depths significantly greater than PVSCSI default values (2053145)
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=2053145
You might find the kb article a little cryptic :)
*/
As an aside let me mention that the VMs I test use dedicated server hardware, dedicated HBAs, dedicated LUNs. For that reason, parameters like Disk.SchedNumReqOutstanding (ESXi 4.x, 5.0, 5.1) or "--schednumreqoutstanding | -O" (ESXi 5.5) are not something I typically pay attention to. If your configuration does share resources such as vSphere host LUNs with other VMs, you may want to read these:
Setting the Maximum Outstanding Disk Requests for virtual machines
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=1268
http://pubs.vmware.com/vsphere-55/index.jsp?topic=/com.vmware.vcli.ref.doc/esxcli_storage.html
/*
The VMWare kb article gives examples of increasing the host LUN queue depth by setting ql2xmaxqdepth (QLogic) and lpfc_lun_queue_depth (Emulex) to 128.
QLogic has this related post also giving an example of setting LUN queue depth ql2xmaxqdepth to 128.
https://support.qlogic.com/app/answers/detail/a_id/2189/~/configuring-queue-depth-for-large-workloads-in-esxi5.1x
But, apparently, in tests for IBM ql2xmaxqdepth was set as high as 256.
http://www.vmware.com/a/assets/vmmark/pdf/2012-04-12-IBM-FlexSystemx240.pdf
(page 7)
The most recent Emulex documentation I could find still documents 128 as the maximum value for lpfc_lun_queue_depth LUN queue depth.
http://www-dl.emulex.com/support/elx/rt960/b12/docs/vmware/vmware_manual_elx.pdf
(page 23)
Dell, for their part, documents setting both ql2xmaxqdepth and lpfc_lun_queue_depth to 255 in Compellent Best practices :)
http://en.community.dell.com/cfs-file.ashx/__key/telligent-evolution-components-attachments/13-4491-00-00-20-43-79-43/Compellent-Best-Practices-with-VMware-ESX-4.x.pdf
(page 10)
Because of some nasty experiences with out-of-bounds parameter values resulting in unwanted effective values (and unwanted performance consequences), we decided to stick with host level LUN queue depth of 128. It worked well enough for what we were doing - there does seem to be coalescing of guest IOs at the ESX level: even though we increased guest LUN queue depth to 254 (with guest LUNs in one-to-one relationship with host LUNs) we never overflowed the host LUN queue depth of 128.
Why does the kb article mention a new maximum Windows guest LUN queue depth of 256, while the example sets it to 254? My guess: the difference between actual and effective queue depth settings. In other VMWare documentation, allowed queue depth of 32 for a LUN results in explicitly setting the queue depth to 30. Its kinda similar with IBMPower AIX LPARs that are using vscsi attached devices served through VIOs - the vscsi adapter has a maximum queue depth of 512 but 2 are reserved, so it can be set to 510. LUN queue depth (the sum of which should be less than the vscsi adapter queue depth) should allow for 3 outstanding IOs to be used by the virtualization layer. I expect something similar here - LUN queue depth of 256 is allowed for the guest but 2 need to be reserved for the virtualization layer so the maximum effective LUN queue depth is 254.
Then there's the question that motivated me to write this post: how does the vhba queue depth actually get increased to 1024? And what on earth is RequestRingPages?
The kb article includes this somewhat cryptic example of a modification to the Windows registry.
REG ADD HKLM\SYSTEM\CurrentControlSet\services\pvscsi\Parameters\Device /v "DriverParameter" /t REG_SZ /d RequestRingPages=32,MaxQueueDepth=254
Hmmm... decoder ring maybe? :) MaxQueueDepth sounds like it should be the per LUN queue depth in the Windows guest. And, if I'm right about 2 IO slots reserved for the virtualization platform, that makes sense. But what is RequestRingPages, and where do we set the PVSCSI vHBA adapter queue depth to 1024?
RequestRingPages indicates a configuration for the PVSCSI driver - and what if I told you that each page allowed 32 slots for outstanding IO requests on the adapter driver? That would work out splendidly: it would result in this registry edit giving an effective LUN queue depth of 254 (the most allowed with an actual queue depth limit of 256, reserving 2 for the virtual platform) and a queue depth of 1024 outstanding IO requests for the PVSCSI adapter.
VMWare ESX 5.1 update 1 contains an update that allows the increase of vHBA adapter queue depth from the default of 256 to 1024. That's a good thing - since the default for Windows on a physical server with QLogic, Brocade, or Emulex FC HBA is 1024 outstanding IOs per adapter port.
In fact, when my data hungry workflows were tested inhouse on VMWare vSphere 5.0 and 5.1, it became apparent that the difference in adapter queue depth was a significant factor in the performance reached with SQL Server on physical server and SQL Server on VMWare on the same server. Increased adapter queue depth, and with 4 vHBAs attached to the VM, performance of the target workflows was nearly indistinguishable from physical server.
The instructions for increasing the vhba adapter queue depth are here:
Large-scale workloads with intensive I/O patterns might require queue depths significantly greater than PVSCSI default values (2053145)
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=2053145
You might find the kb article a little cryptic :)
*/
As an aside let me mention that the VMs I test use dedicated server hardware, dedicated HBAs, dedicated LUNs. For that reason, parameters like Disk.SchedNumReqOutstanding (ESXi 4.x, 5.0, 5.1) or "--schednumreqoutstanding | -O" (ESXi 5.5) are not something I typically pay attention to. If your configuration does share resources such as vSphere host LUNs with other VMs, you may want to read these:
Setting the Maximum Outstanding Disk Requests for virtual machines
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=1268
http://pubs.vmware.com/vsphere-55/index.jsp?topic=/com.vmware.vcli.ref.doc/esxcli_storage.html
/*
The VMWare kb article gives examples of increasing the host LUN queue depth by setting ql2xmaxqdepth (QLogic) and lpfc_lun_queue_depth (Emulex) to 128.
QLogic has this related post also giving an example of setting LUN queue depth ql2xmaxqdepth to 128.
https://support.qlogic.com/app/answers/detail/a_id/2189/~/configuring-queue-depth-for-large-workloads-in-esxi5.1x
But, apparently, in tests for IBM ql2xmaxqdepth was set as high as 256.
http://www.vmware.com/a/assets/vmmark/pdf/2012-04-12-IBM-FlexSystemx240.pdf
(page 7)
The most recent Emulex documentation I could find still documents 128 as the maximum value for lpfc_lun_queue_depth LUN queue depth.
http://www-dl.emulex.com/support/elx/rt960/b12/docs/vmware/vmware_manual_elx.pdf
(page 23)
Dell, for their part, documents setting both ql2xmaxqdepth and lpfc_lun_queue_depth to 255 in Compellent Best practices :)
http://en.community.dell.com/cfs-file.ashx/__key/telligent-evolution-components-attachments/13-4491-00-00-20-43-79-43/Compellent-Best-Practices-with-VMware-ESX-4.x.pdf
(page 10)
Because of some nasty experiences with out-of-bounds parameter values resulting in unwanted effective values (and unwanted performance consequences), we decided to stick with host level LUN queue depth of 128. It worked well enough for what we were doing - there does seem to be coalescing of guest IOs at the ESX level: even though we increased guest LUN queue depth to 254 (with guest LUNs in one-to-one relationship with host LUNs) we never overflowed the host LUN queue depth of 128.
Why does the kb article mention a new maximum Windows guest LUN queue depth of 256, while the example sets it to 254? My guess: the difference between actual and effective queue depth settings. In other VMWare documentation, allowed queue depth of 32 for a LUN results in explicitly setting the queue depth to 30. Its kinda similar with IBMPower AIX LPARs that are using vscsi attached devices served through VIOs - the vscsi adapter has a maximum queue depth of 512 but 2 are reserved, so it can be set to 510. LUN queue depth (the sum of which should be less than the vscsi adapter queue depth) should allow for 3 outstanding IOs to be used by the virtualization layer. I expect something similar here - LUN queue depth of 256 is allowed for the guest but 2 need to be reserved for the virtualization layer so the maximum effective LUN queue depth is 254.
Then there's the question that motivated me to write this post: how does the vhba queue depth actually get increased to 1024? And what on earth is RequestRingPages?
The kb article includes this somewhat cryptic example of a modification to the Windows registry.
REG ADD HKLM\SYSTEM\CurrentControlSet\services\pvscsi\Parameters\Device /v "DriverParameter" /t REG_SZ /d RequestRingPages=32,MaxQueueDepth=254
Hmmm... decoder ring maybe? :) MaxQueueDepth sounds like it should be the per LUN queue depth in the Windows guest. And, if I'm right about 2 IO slots reserved for the virtualization platform, that makes sense. But what is RequestRingPages, and where do we set the PVSCSI vHBA adapter queue depth to 1024?
RequestRingPages indicates a configuration for the PVSCSI driver - and what if I told you that each page allowed 32 slots for outstanding IO requests on the adapter driver? That would work out splendidly: it would result in this registry edit giving an effective LUN queue depth of 254 (the most allowed with an actual queue depth limit of 256, reserving 2 for the virtual platform) and a queue depth of 1024 outstanding IO requests for the PVSCSI adapter.
No comments:
Post a Comment