Home > Uncategorized > ESXi 4.1 Emulex LPe11000 FC HBA, errors SCSILinuxAbortCommands

ESXi 4.1 Emulex LPe11000 FC HBA, errors SCSILinuxAbortCommands

I recently upgraded a customer from 4.0 Update 1 to 4.1 Update 1. Two clusters were upgraded and they consisted of five and eight hosts each, all of the Hosts are IBM x3650 M3 with all but four of them having the Emulex LPe11000 FC HBA’s.

Since the upgrade over about a week we have had six hosts fail, the VM’s on these hosts are lost, the ESXi becomes generally unresponsive but still respond to HA pings from the other hosts in the Cluster. The VM’s therefore become unknown and restarting management agents on the ESXi DCUI doesn’t help. Sometimes when this issue has occured the DCUI is unresponsive and Alt-F11 Alt-F2 keys do nothing, no PSOD is happening.

This issue has only affected the hosts with two LPe11000 single port FC HBA’s, which is on the HCL for 4.1 ESXi. We are using ESXi embedded with USB keys and not boot from SAN.

We have reviewed the Storage and Networking and the fault is isolated to the affected host at the time it occurs, the other hosts don’t have any issues occuring at the same time.

All the affected hosts have had the following HBA: Emulex, LPe11000, firmware 52A3.

We noticed there are alot of storage related errors in the messages.log

——————————————————————
FEB 23 16:33:06 vobd: Feb 23 16:33:00.126: 248389618631us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.6006016019802900e634e73f72a6df11. Path vmhba4:C0:T1:L59 is down. Affected datastores: “DM01_SAP_Test_VMFS01”..
FEB 23 16:33:06 vmkernel: 2:20:59:46.445 cpu3:4142)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41027f9feb40) to NMP device “naa.6006016019802900e634e73f72a6df11” failed on physical path “vmhba4:C0:T1:L59” H:0x1 D:0x0 P:0x0 Possible sense data: 0 FEB 23 16:33:06 x0 0x0 0x0.
FEB 23 16:33:06 vmkernel: 2:20:59:46.445 cpu3:4142)WARNING: NMP: nmp_DeviceRetryCommand: Device “naa.6006016019802900e634e73f72a6df11”: awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
————————————————-

We contacted VMware GSS, initially we had little help with the first technical support presentative not even able to see any Storage errors and only confirming what we already knew that “the hosts is being unresponsive”, could be a hardware issue.

Seeing at this stage it had affected four different hosts it clearly wasn’t a hardware fault, possibly hardware firmware and driver issue.

After losing another few hosts and applying some pressure GSS, I talked to Aakash from VMware Global Support, he was great in helping us. He confirmed that we were experiencing SCSI aborts commands on our FC HBA and that the Storage connectivity is lost.

-We noticed APD messages around 25th Feb,2011 14:59 based on the below log snippet.

———————————————
FEB 25 14:57:12 vmkernel: 0:09:47:02.054 cpu12:10442)FS3: 7412: Waiting for timed-out heartbeat [HB state abcdef02 offset 3280896 gen 9 stamp 35219957746 uuid 4d6739f2-8ec6b4c0-23f3-e61f13594cb3 jrnl drv 8.46] FEB 25 14:57:12 vmkernel: 0:09:47:02.054 cpu18:10525)FS3: 7412: Waiting for timed-out heartbeat [HB state abcdef02 offset 3280896 gen 9 stamp 35219957746 uuid 4d6739f2-8ec6b4c0-23f3-e61f13594cb3 jrnl drv 8.46] FEB 25 14:57:12 vmkernel: 0:09:47:02.081 cpu14:4135)WARNING: LinScsi: SCSILinuxAbortCommands: Failed, Driver lpfc820, for vmhba4 FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu7:4263)ScsiDeviceIO: 1672: Command 0x12 to device “naa.60060160e8802900bdb0f3eb16a4df11” failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu7:4263)WARNING: NMP: nmp_DeviceStartLoop: NMP Device “naa.60060160e8802900bdb0f3eb16a4df11” is blocked. Not starting I/O from device.

FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu2:4259)WARNING: VMW_VAAIP_CX: cx_claim_device: Inquiry to device naa.60060160e8802900bdb0f3eb16a4df11 failed FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu2:4259)WARNING: vmw_psp_rr: psp_rrSelectPath: Could not select path for device “naa.60060160e8802900c078f2bbc3a5df11”.

FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu2:4259)WARNING: NMP: nmp_IssueCommandToDevice: I/O could not be issued to device “naa.60060160e8802900c078f2bbc3a5df11” due to Not found FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu2:4259)WARNING: NMP: nmp_DeviceRetryCommand: Device “naa.60060160e8802900c078f2bbc3a5df11”: awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.

FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu2:4259)WARNING: NMP: nmp_DeviceStartLoop: NMP Device “naa.60060160e8802900c078f2bbc3a5df11” is blocked. Not starting I/O from device.

FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu4:4258)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device “naa.60060160e8802900c078f2bbc3a5df11”.

FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu0:4255)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device “naa.60060160e8802900bdb0f3eb16a4df11”.

FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu21:4257)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device “naa.60060160e880290079f48bf2c1a5df11”.

FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu12:4260)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device “naa.60060160e880290004445743c3a5df11”.

FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu16:4511)WARNING: NMP: nmpDeviceAttemptFailover: Retry world failover device “naa.60060160e8802900c078f2bbc3a5df11” – issuing command 0x41027ef92940 FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu16:4511)WARNING: vmw_psp_rr: psp_rrSelectPath: Could not select path for device “naa.60060160e8802900c078f2bbc3a5df11”.

FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu16:4511)WARNING: NMP: nmpDeviceAttemptFailover: Retry world failover device “naa.60060160e8802900c078f2bbc3a5df11” – failed to issue command due to Not found (APD), try again…

-It can be caused due to emulex driver since there are abort commands.

FEB 25 14:57:12 vmkernel: 0:09:47:02.054 cpu12:10442)FS3: 7412: Waiting for timed-out heartbeat [HB state abcdef02 offset 3280896 gen 9 stamp 35219957746 uuid 4d6739f2-8ec6b4c0-23f3-e61f13594cb3 jrnl drv 8.46] FEB 25 14:57:12 vmkernel: 0:09:47:02.054 cpu18:10525)FS3: 7412: Waiting for timed-out heartbeat [HB state abcdef02 offset 3280896 gen 9 stamp 35219957746 uuid 4d6739f2-8ec6b4c0-23f3-e61f13594cb3 jrnl drv 8.46] FEB 25 14:57:12 vmkernel: 0:09:47:02.081 cpu14:4135)WARNING: LinScsi: SCSILinuxAbortCommands: Failed, Driver lpfc820, for vmhba4
—————————————————

We are rolling back hosts to ESXi 4.0 update 1, the driver for that HBA changed in 4.1 to 8.2.1.30.1-58vmw.

This article mentions there are issues with 4.0 and this adapter but GSS confirmed with VMware engineering that the HBA is supported.

http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1012547&sliceId=1&docTypeID=DT_KB_1_1&dialogID=161271506&stateId=0 0 161275257

Advertisements
Categories: Uncategorized Tags: ,
  1. Eric
    March 12, 2011 at 1:48 am

    I have the exact same issues with ESX 4.1 U1 and a Dell equallogic iSCSI ps6000E. using HP G7’s and the BNX II on board nics. Vmware has not been very helpful with me either. So rolling back fixed the issues? I ask because I have had the intermittent crashed also and my vmkernal reads exactly the same as yours. Have you had any more insight into the issue?

  2. March 12, 2011 at 8:53 am

    Eric our issue’s were indeed resolved by rolling back to 4.0 and therefore switching back to the older driver for the Emulux HBA FC. VMware have requested we use serial logging and try to recreate the issue to help them identify the cause.
    Although your issue may have the same errors seeing you are using iSCSI Storage our problems are entirely different (we have fibre channel storage). I would check the NIC driver versions pre 4.1 and 4.1 to see if your driver version is different. Is your NIC on the HCL? Also have you had the issue with more than one type of adapter?

  3. June 6, 2011 at 9:12 am

    Hi Paul,

    When did you start seeing the errors? After how many days?

    We are running our Emulex LPe11002 4Gb Fibre Channel Host Adapters at the following firmware revision:
    42C2071 FV2.82A4 DV8.2.1.30.1-58vmw

    After 4-5 days running time on ESXi 4.1 U1, I have not observed any issues with the latest IBM + Emulex firmware on ESXi 4.1 U1.
    http://mainframe.clancampbell.id.au/?p=65

    Naaman.

    • June 6, 2011 at 9:29 am

      Ensure the hosts are under load, especially storage as it happened on hosts with the larger VM workload. It took between three to six days to occur on average.

  4. Aakash Jacob
    November 17, 2011 at 11:21 am

    Hey Folks

    My name is Aakash and I work for the VMware support team in Bangalore. Thanks for highlighting my name. Feel very motivated and helping customer further.

  1. May 18, 2011 at 9:06 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: