Archive

Posts Tagged ‘ESXi’

ESXi 4.1 Emulex LPe11000 FC HBA, errors SCSILinuxAbortCommands

February 27, 2011 6 comments

I recently upgraded a customer from 4.0 Update 1 to 4.1 Update 1. Two clusters were upgraded and they consisted of five and eight hosts each, all of the Hosts are IBM x3650 M3 with all but four of them having the Emulex LPe11000 FC HBA’s.

Since the upgrade over about a week we have had six hosts fail, the VM’s on these hosts are lost, the ESXi becomes generally unresponsive but still respond to HA pings from the other hosts in the Cluster. The VM’s therefore become unknown and restarting management agents on the ESXi DCUI doesn’t help. Sometimes when this issue has occured the DCUI is unresponsive and Alt-F11 Alt-F2 keys do nothing, no PSOD is happening.

This issue has only affected the hosts with two LPe11000 single port FC HBA’s, which is on the HCL for 4.1 ESXi. We are using ESXi embedded with USB keys and not boot from SAN.

We have reviewed the Storage and Networking and the fault is isolated to the affected host at the time it occurs, the other hosts don’t have any issues occuring at the same time.

All the affected hosts have had the following HBA: Emulex, LPe11000, firmware 52A3.

We noticed there are alot of storage related errors in the messages.log

——————————————————————
FEB 23 16:33:06 vobd: Feb 23 16:33:00.126: 248389618631us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.6006016019802900e634e73f72a6df11. Path vmhba4:C0:T1:L59 is down. Affected datastores: “DM01_SAP_Test_VMFS01”..
FEB 23 16:33:06 vmkernel: 2:20:59:46.445 cpu3:4142)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41027f9feb40) to NMP device “naa.6006016019802900e634e73f72a6df11” failed on physical path “vmhba4:C0:T1:L59” H:0x1 D:0x0 P:0x0 Possible sense data: 0 FEB 23 16:33:06 x0 0x0 0x0.
FEB 23 16:33:06 vmkernel: 2:20:59:46.445 cpu3:4142)WARNING: NMP: nmp_DeviceRetryCommand: Device “naa.6006016019802900e634e73f72a6df11”: awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
————————————————-

We contacted VMware GSS, initially we had little help with the first technical support presentative not even able to see any Storage errors and only confirming what we already knew that “the hosts is being unresponsive”, could be a hardware issue.

Seeing at this stage it had affected four different hosts it clearly wasn’t a hardware fault, possibly hardware firmware and driver issue.

After losing another few hosts and applying some pressure GSS, I talked to Aakash from VMware Global Support, he was great in helping us. He confirmed that we were experiencing SCSI aborts commands on our FC HBA and that the Storage connectivity is lost.

-We noticed APD messages around 25th Feb,2011 14:59 based on the below log snippet.

———————————————
FEB 25 14:57:12 vmkernel: 0:09:47:02.054 cpu12:10442)FS3: 7412: Waiting for timed-out heartbeat [HB state abcdef02 offset 3280896 gen 9 stamp 35219957746 uuid 4d6739f2-8ec6b4c0-23f3-e61f13594cb3 jrnl drv 8.46] FEB 25 14:57:12 vmkernel: 0:09:47:02.054 cpu18:10525)FS3: 7412: Waiting for timed-out heartbeat [HB state abcdef02 offset 3280896 gen 9 stamp 35219957746 uuid 4d6739f2-8ec6b4c0-23f3-e61f13594cb3 jrnl drv 8.46] FEB 25 14:57:12 vmkernel: 0:09:47:02.081 cpu14:4135)WARNING: LinScsi: SCSILinuxAbortCommands: Failed, Driver lpfc820, for vmhba4 FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu7:4263)ScsiDeviceIO: 1672: Command 0x12 to device “naa.60060160e8802900bdb0f3eb16a4df11” failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu7:4263)WARNING: NMP: nmp_DeviceStartLoop: NMP Device “naa.60060160e8802900bdb0f3eb16a4df11” is blocked. Not starting I/O from device.

FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu2:4259)WARNING: VMW_VAAIP_CX: cx_claim_device: Inquiry to device naa.60060160e8802900bdb0f3eb16a4df11 failed FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu2:4259)WARNING: vmw_psp_rr: psp_rrSelectPath: Could not select path for device “naa.60060160e8802900c078f2bbc3a5df11”.

FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu2:4259)WARNING: NMP: nmp_IssueCommandToDevice: I/O could not be issued to device “naa.60060160e8802900c078f2bbc3a5df11” due to Not found FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu2:4259)WARNING: NMP: nmp_DeviceRetryCommand: Device “naa.60060160e8802900c078f2bbc3a5df11”: awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.

FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu2:4259)WARNING: NMP: nmp_DeviceStartLoop: NMP Device “naa.60060160e8802900c078f2bbc3a5df11” is blocked. Not starting I/O from device.

FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu4:4258)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device “naa.60060160e8802900c078f2bbc3a5df11”.

FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu0:4255)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device “naa.60060160e8802900bdb0f3eb16a4df11”.

FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu21:4257)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device “naa.60060160e880290079f48bf2c1a5df11”.

FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu12:4260)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device “naa.60060160e880290004445743c3a5df11”.

FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu16:4511)WARNING: NMP: nmpDeviceAttemptFailover: Retry world failover device “naa.60060160e8802900c078f2bbc3a5df11” – issuing command 0x41027ef92940 FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu16:4511)WARNING: vmw_psp_rr: psp_rrSelectPath: Could not select path for device “naa.60060160e8802900c078f2bbc3a5df11”.

FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu16:4511)WARNING: NMP: nmpDeviceAttemptFailover: Retry world failover device “naa.60060160e8802900c078f2bbc3a5df11” – failed to issue command due to Not found (APD), try again…

-It can be caused due to emulex driver since there are abort commands.

FEB 25 14:57:12 vmkernel: 0:09:47:02.054 cpu12:10442)FS3: 7412: Waiting for timed-out heartbeat [HB state abcdef02 offset 3280896 gen 9 stamp 35219957746 uuid 4d6739f2-8ec6b4c0-23f3-e61f13594cb3 jrnl drv 8.46] FEB 25 14:57:12 vmkernel: 0:09:47:02.054 cpu18:10525)FS3: 7412: Waiting for timed-out heartbeat [HB state abcdef02 offset 3280896 gen 9 stamp 35219957746 uuid 4d6739f2-8ec6b4c0-23f3-e61f13594cb3 jrnl drv 8.46] FEB 25 14:57:12 vmkernel: 0:09:47:02.081 cpu14:4135)WARNING: LinScsi: SCSILinuxAbortCommands: Failed, Driver lpfc820, for vmhba4
—————————————————

We are rolling back hosts to ESXi 4.0 update 1, the driver for that HBA changed in 4.1 to 8.2.1.30.1-58vmw.

This article mentions there are issues with 4.0 and this adapter but GSS confirmed with VMware engineering that the HBA is supported.

http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1012547&sliceId=1&docTypeID=DT_KB_1_1&dialogID=161271506&stateId=0 0 161275257

Categories: Uncategorized Tags: ,

vCenter AD authentication ESXi Host Profile issue

November 21, 2010 4 comments

There is a bug with 4.1 host profiles, profiles built from a reference host that were joined to a Windows Domain using Authentication Services cannot be successfully applied. The Host Profile prompts you for credentials to authenticate and gets stuck, next in the host profile wizard no longer works. I also notice if you select options in the host profile settings displayed you may receive an errror “an internal error occurred in the vSphere Client. Details: The given key was not present in the dictionary.”

According to this Communities thread it is indeed a bug and will be fixed in future releases.

In the meantime you can disable the Domain join setting in the host profile by editing the host profile “Configure a fixed domain name” in “Authentication Services/Active Directory Configuration/Domain Name” and set it to “Host not joined to any domain” or remove the reference host from the domain and updating the profile using that reference host.