Archive

Archive for February, 2011

ESXi 4.1 Emulex LPe11000 FC HBA, errors SCSILinuxAbortCommands

February 27, 2011 6 comments

I recently upgraded a customer from 4.0 Update 1 to 4.1 Update 1. Two clusters were upgraded and they consisted of five and eight hosts each, all of the Hosts are IBM x3650 M3 with all but four of them having the Emulex LPe11000 FC HBA’s.

Since the upgrade over about a week we have had six hosts fail, the VM’s on these hosts are lost, the ESXi becomes generally unresponsive but still respond to HA pings from the other hosts in the Cluster. The VM’s therefore become unknown and restarting management agents on the ESXi DCUI doesn’t help. Sometimes when this issue has occured the DCUI is unresponsive and Alt-F11 Alt-F2 keys do nothing, no PSOD is happening.

This issue has only affected the hosts with two LPe11000 single port FC HBA’s, which is on the HCL for 4.1 ESXi. We are using ESXi embedded with USB keys and not boot from SAN.

We have reviewed the Storage and Networking and the fault is isolated to the affected host at the time it occurs, the other hosts don’t have any issues occuring at the same time.

All the affected hosts have had the following HBA: Emulex, LPe11000, firmware 52A3.

We noticed there are alot of storage related errors in the messages.log

——————————————————————
FEB 23 16:33:06 vobd: Feb 23 16:33:00.126: 248389618631us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.6006016019802900e634e73f72a6df11. Path vmhba4:C0:T1:L59 is down. Affected datastores: “DM01_SAP_Test_VMFS01”..
FEB 23 16:33:06 vmkernel: 2:20:59:46.445 cpu3:4142)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41027f9feb40) to NMP device “naa.6006016019802900e634e73f72a6df11” failed on physical path “vmhba4:C0:T1:L59” H:0x1 D:0x0 P:0x0 Possible sense data: 0 FEB 23 16:33:06 x0 0x0 0x0.
FEB 23 16:33:06 vmkernel: 2:20:59:46.445 cpu3:4142)WARNING: NMP: nmp_DeviceRetryCommand: Device “naa.6006016019802900e634e73f72a6df11”: awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
————————————————-

We contacted VMware GSS, initially we had little help with the first technical support presentative not even able to see any Storage errors and only confirming what we already knew that “the hosts is being unresponsive”, could be a hardware issue.

Seeing at this stage it had affected four different hosts it clearly wasn’t a hardware fault, possibly hardware firmware and driver issue.

After losing another few hosts and applying some pressure GSS, I talked to Aakash from VMware Global Support, he was great in helping us. He confirmed that we were experiencing SCSI aborts commands on our FC HBA and that the Storage connectivity is lost.

-We noticed APD messages around 25th Feb,2011 14:59 based on the below log snippet.

———————————————
FEB 25 14:57:12 vmkernel: 0:09:47:02.054 cpu12:10442)FS3: 7412: Waiting for timed-out heartbeat [HB state abcdef02 offset 3280896 gen 9 stamp 35219957746 uuid 4d6739f2-8ec6b4c0-23f3-e61f13594cb3 jrnl drv 8.46] FEB 25 14:57:12 vmkernel: 0:09:47:02.054 cpu18:10525)FS3: 7412: Waiting for timed-out heartbeat [HB state abcdef02 offset 3280896 gen 9 stamp 35219957746 uuid 4d6739f2-8ec6b4c0-23f3-e61f13594cb3 jrnl drv 8.46] FEB 25 14:57:12 vmkernel: 0:09:47:02.081 cpu14:4135)WARNING: LinScsi: SCSILinuxAbortCommands: Failed, Driver lpfc820, for vmhba4 FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu7:4263)ScsiDeviceIO: 1672: Command 0x12 to device “naa.60060160e8802900bdb0f3eb16a4df11” failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu7:4263)WARNING: NMP: nmp_DeviceStartLoop: NMP Device “naa.60060160e8802900bdb0f3eb16a4df11” is blocked. Not starting I/O from device.

FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu2:4259)WARNING: VMW_VAAIP_CX: cx_claim_device: Inquiry to device naa.60060160e8802900bdb0f3eb16a4df11 failed FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu2:4259)WARNING: vmw_psp_rr: psp_rrSelectPath: Could not select path for device “naa.60060160e8802900c078f2bbc3a5df11”.

FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu2:4259)WARNING: NMP: nmp_IssueCommandToDevice: I/O could not be issued to device “naa.60060160e8802900c078f2bbc3a5df11” due to Not found FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu2:4259)WARNING: NMP: nmp_DeviceRetryCommand: Device “naa.60060160e8802900c078f2bbc3a5df11”: awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.

FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu2:4259)WARNING: NMP: nmp_DeviceStartLoop: NMP Device “naa.60060160e8802900c078f2bbc3a5df11” is blocked. Not starting I/O from device.

FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu4:4258)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device “naa.60060160e8802900c078f2bbc3a5df11”.

FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu0:4255)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device “naa.60060160e8802900bdb0f3eb16a4df11”.

FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu21:4257)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device “naa.60060160e880290079f48bf2c1a5df11”.

FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu12:4260)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device “naa.60060160e880290004445743c3a5df11”.

FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu16:4511)WARNING: NMP: nmpDeviceAttemptFailover: Retry world failover device “naa.60060160e8802900c078f2bbc3a5df11” – issuing command 0x41027ef92940 FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu16:4511)WARNING: vmw_psp_rr: psp_rrSelectPath: Could not select path for device “naa.60060160e8802900c078f2bbc3a5df11”.

FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu16:4511)WARNING: NMP: nmpDeviceAttemptFailover: Retry world failover device “naa.60060160e8802900c078f2bbc3a5df11” – failed to issue command due to Not found (APD), try again…

-It can be caused due to emulex driver since there are abort commands.

FEB 25 14:57:12 vmkernel: 0:09:47:02.054 cpu12:10442)FS3: 7412: Waiting for timed-out heartbeat [HB state abcdef02 offset 3280896 gen 9 stamp 35219957746 uuid 4d6739f2-8ec6b4c0-23f3-e61f13594cb3 jrnl drv 8.46] FEB 25 14:57:12 vmkernel: 0:09:47:02.054 cpu18:10525)FS3: 7412: Waiting for timed-out heartbeat [HB state abcdef02 offset 3280896 gen 9 stamp 35219957746 uuid 4d6739f2-8ec6b4c0-23f3-e61f13594cb3 jrnl drv 8.46] FEB 25 14:57:12 vmkernel: 0:09:47:02.081 cpu14:4135)WARNING: LinScsi: SCSILinuxAbortCommands: Failed, Driver lpfc820, for vmhba4
—————————————————

We are rolling back hosts to ESXi 4.0 update 1, the driver for that HBA changed in 4.1 to 8.2.1.30.1-58vmw.

This article mentions there are issues with 4.0 and this adapter but GSS confirmed with VMware engineering that the HBA is supported.

http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1012547&sliceId=1&docTypeID=DT_KB_1_1&dialogID=161271506&stateId=0 0 161275257

Categories: Uncategorized Tags: ,

VMware DCUI/Console ALT-F Keys

February 25, 2011 2 comments

ESXi

ALT-F1 = Switches to the console
ALT-F2 = Switches to the DCUI
ALT-F11 = Returns to the banner screen
ALT-F12 = Displays the VMkernel log on the console

ESX

ALT-F1 = Switches to the console of the service console
ALT-F2 = Switches to the console of the service console
ALT-F11 = Returns to the banner screen
ALT-F12 = Displays the VMkernel log on the console

Categories: VMware

VMware SIOC, first investigations

February 21, 2011 Leave a comment

By default SIOC latency connention threshold is 30ms and is set per Datastore.
Only during contention, when the latency is greater than 30ms to a Datastore will SIOC start descheduling VM’s Disk queue access across the Hosts with Disk queues to that Datastore within vCenter.

SIOC recommendations:

• Ensure that datastores are managed by a single vCenter Server
• The storage media (spindles, SSD) on which the SIOC enabled datastores are located is not shared with volumes used by non-vSphere workloads.
• Storage I/O Control is supported on Fibre Channel-connected and iSCSI-connected storage. NFS
datastores and Raw Device Mapping (RDM) are not supported.
• SIOC not enabled on Datastores with multiple extents

This is the Technical Deployment considerations guide, great document how to implement SIOC:
http://www.vmware.com/files/pdf/techpaper/VMW-vSphere41-SIOC.pdf

How to troubleshoot issues:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1022091

vSphere Resource Management Guide:

Before using Storage I/O Control on datastores that are backed by arrays with automated storage tiering
capabilities, check the VMware Storage/SAN Compatibility Guide to verify whether your automated tiered
storage array has been certified to be compatible with Storage I/O Control. Automated storage tiering is the ability of an array (or group of arrays) to migrate LUNs/volumes or parts of LUNs/volumes to different types of storage media (SSD, FC, SAS, SATA) based on user-set policies and current I/O patterns. No special certification is required for arrays that do not have these automatic migration/tiering features, including those that provide the ability to manually migrate data between different types of storage media.

http://www.vmware.com/pdf/vsphere4/r41/vsp_41_resource_mgmt.pdf

vSphere SAN compatibility guide:

SIOC feature is available with VMware vSphere 4.1T GA. SIOC is a QoS feature that is disabled by default. When enabled, this feature monitors ESX-array latency to determine when a datastore is congested. When congestion is detected (latency above a threshold), SIOC allocates the limited I/O resources to virtual
machines in accordance to their relative importance (user-specified). All the storage devices listed on vSphere 4.1 Storage HCL are supported for use with SIOC. There is no special certification requirements for the SIOC support.

http://www.vmware.com/resources/compatibility/pdf/vi_san_guide.pdf

SIOC Recommended Thresholds for Storage (mix of 50/50 SSD and FC should have threshold of 20ms, midpoint between the two types)

Chad Sakac recommends for EMC storage “the SIOC threshold should be the median between the slowest and fastest tier”

External I/O workload detected on shared datastore running SIOC errors:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1020651

Categories: VMware

VMware vCenter Server, uninstalling SQL express issues

February 14, 2011 1 comment

I has an issue today where a customer had SQL express installed on their vCenter Server and moved their vCenter Database to another SQL Server instance on another Server.
Being thorough they also uninstalled SQL express from their vCenter Server. Unfortunately the uninstall of the SQL express also performs an uninstall of the SQL Native agent and therefore removes any ODBC DSN’s using that driver dll.
Although the SQL express uninstall was done previously, no error occurrs until the VMware vCenter Server services were restarted (or the Server was restarted).

The error was as follows and the vpxa log shows the Database cannot be located.
“Windows could not start VMware Virtual Center Service. Error code 2”

To resolve this issue re-install the SQL Native client and then recreate the ODBC DSN.

The DSN name can be found in the registry location:
HKLM\SOFTWARE\Wow6432Node\VMware, Inc\VMware VirtualCenter\DB\1
This DWORD contains the correct DSN

Create the DSN by running odbcad32 and creating a System DSN with the name from the registry setting. Then try restarting the services.

Categories: VMware

Exchange 2010 Granting Full Mailbox Access

February 9, 2011 1 comment

This command grants Frank Brown Full Access to Fred Smith’s mailbox:

Add-MailboxPermission -Identity “Fred Smith” -User “Frank Brown” -AccessRights FullAccess -InheritanceType All

http://technet.microsoft.com/en-us/library/bb124097.aspx

This command grants Frank Brown send as and receive as to Fred Smith’s mailbox:

Add-ADPermission -Identity “Fred Smith” -User “Frank Brown” -ExtendedRights Receive-As, Send-As

http://technet.microsoft.com/en-us/library/bb124403.aspx

To apply the rights send-as and receive-as to all mailboxes within one database:

Get-MailboxDatabase –identity “Database01” | Add-ADPermission -User “Frank Brown” -ExtendedRights Receive-As, Send-As

To grant Full Access for Frank Brown to all Mailboxes within an Exchange Organisation use this command:

Get-Mailbox | Add-MailboxPermission -User “Frank Brown” -AccessRights FullAccess -InheritanceType All

I have found that sometimes the Add-MailboxPermission -FullAccess doesn’t grant the user Full Access, in those cases try all using the add-adpermission -AccessRights GenericAll:

Get-MailboxDatabase –identity “Database01” | Add-ADPermission -User “Frank Brown” -AccessRights GenericAll

Categories: Exchange