Evaluation of the following windows performance counters and cross matching with storage systems performance statistics can assist in identifying the application workload pattern. The performance criteria we should be evaluating is broken down into two main sections, Storage and Windows systems “Workload” performance counters.
Storage performance monitoring should make up the bulk of the information we review to identify storage related bottlenecks or performance problems. The latency and queue length counters can be used by the System owners and the application development team to self-diagnose potential storage issues. You could also potentially rollout the performance logging via SCOM to the servers and alert on the below defined thresholds for further investigation.
Processor % utilisation
Processor Bandwidth MB/s
Processor Throughput IOPS
Processor Cache forced flushes
The above statistics and counters should be recorded and reviewed on the storage arrays regularly using the appropriate storage vendor tool sets (ECC and analyser). There are significantly more performance counters to track at the storage level but as a rule these will help set a baseline.
Windows Performance Counters
PhysicalDisk(*)\Avg. Disk sec/Read
PhysicalDisk(*)\Avg. Disk sec/Write
PhysicalDisk(*)\Avg. Disk sec/Transfer (combination of the above two counters)
The counter returns a value in seconds, therefore 0.010 is 10ms
Disk latency should fall within the following acceptance criteria:
• Less than 5ms is considered excellent
• Less than 10ms is considered good
• Less than 15ms is considered acceptable
• Less than 20ms is fair
• More than 20ms and less than 50ms is poor
• More than 50ms is substandard
Latency should be measured in the following locations to isolate the source of the latency when it is identified:
1. Within the Operating System (perfmon or top)
2. Within the Hypervisor (esxtop or resxtop)
3. At the Storage subsystem, LUN response times (latency)
Disk queue length
Disk queue length does not accurately reflect performance but can be used to assist in the diagnosis of performance issues. Queue lengths that grow significantly and that exceeding the expected performance of the underlying storage highlight bottlenecks but those bottlenecks can exist in several locations.
Windows Perfromance counters
\PhysicalDisk(*)\Avg. Disk Write Queue Length
\PhysicalDisk(*)\Avg. Disk Read Queue Length
\PhysicalDisk(*)\Current Disk Queue Length
\PhysicalDisk(*)\Avg. Disk Queue Length
• 2x 3x the number of disk spindles used to create the volume (assuming they are dedicated)
• 30 plus sustained is a reason for investigation
Disk queue length should be measured at the OS as well as all of the following queue lengths:
1. Disk queue length within the Operating System
2. Front end port queue length on the storage arrays
3. LUN queue length
The following counters can assist in identifying the type of workload within the Windows operating system
Windows performance counters
\PhysicalDisk(*)\Avg. Disk Bytes/Read
\PhysicalDisk(*)\Avg. Disk Bytes/Transfer
\PhysicalDisk(*)\Avg. Disk Bytes/Write
\PhysicalDisk(*)\Avg. Disk sec/Read
\PhysicalDisk(*)\Avg. Disk sec/Transfer
\PhysicalDisk(*)\Avg. Disk sec/Write
\PhysicalDisk(*)\Disk Read Bytes/sec
\PhysicalDisk(*)\Disk Write Bytes/sec
\PhysicalDisk(*)\Split IO/Sec (as few as possible)
Additional Windows Performance Monitoring counters to collect to ensure there are no processor, memory, paging of other operations occurring within the operating system that could negatively impact the system and produce additional storage workload.
Windows perfmon counters
\Processor(*)\% Processor Time
\Processor(*)\% User Time
\Processor(*)\% Privileged Time
\System\Processor Queue Length
Available Mbytes, should be greater than 100MB
Pages Input/sec, should be less than 10
Pages/Sec, slow disk subsystem greater than 100, fast subsystem greater than 600
Memory Grants pending, at or close to zero, over zero or growing indicates an issue
Page Life Expectancy, should be greater than 300, lower or declining indicates memory pressure
The amount of the page file currently in use, by percentage, should be less than 70%
The peak amount of page file used since the server was last booted, by percentage, should be less than 70%
Processor: % Processor Time, 80% or less is ideal
Processor: % Privileged Time, should be less than 30% of the total % processor time
Processor: % User time, should be about 70% or more of the total % processor time
Process sqlservr % Processor Time, should be less than 80%
System: Processor Queue Length, should be less than 4, 5 – 8 good, over 8 to 12 fair
FCoE is possible between two Data Centres by using VE ports to build multi-hop FCoE fabrics that interconnecting FCF, the VE port functions as a FC E-port on top of a lossless ethernet fabric.
The maximum distance between two Cisco Nexus 5000 Series switches is 3000m for FCoE (lossless Ethernet) traffic with NX-OS 5.0(2)N1.1. It can be enabled using the pause no-drop buffer-size buffer-size pause-thresholdxoff-size resume-threshold xon-size command in QOS configuration. Here are the maximum distances:
Nexus 55xx to Nexus 55xx – 3km
Nexus 50×0 to Nexus 50×0 – 3km
Nexus 50×0 to Nexus 55xx – 3km
Nexus 55xx to Nexus 2232 – 300m
Nexus 50×0 to Nexus 2232 – 300m
My first Apple experience was an Apple IIe back in the late 1980s, it belonged to a friend of mine and was the first computer I ever used, we played load runner most of the time. We also had Apple IIs in my primary schools computer room but we barely got to use them, later on in intermediate school we did touch typing on Apple computers.
I didn’t use an Apple computer again until recently in August 2009 when I purchased a MacBook Pro (unibody). Being the first apple I owned I made sure to purchased an extended AppleCare warranty. Two and a half years later I have never needed to use the warranty and the MacBook continues to age gracefully and is by far the best engineered laptop I have ever owned.
Since 2009 we have been getting more Apple consumer devices and slowly have become an “Apple Family”. We now own six devices being the MacBook Pro, an iPad, two iPhone 4s and two iTouch for the kids. I never intended to purchase Apple exclusivly but each time I looked into the different products the Apple ones were always the best.
Today I performed my first Apple warranty claim, it was for my iPhone, its home button had become unreliable. I started by jumping on the web and going into the Apple support section. Apple has an express lane http://www.apple.com/au/support/contact/ that allows its customer to lodge the support request and provides online assistance, troubleshooting steps and contact numbers. I was provided a contact number in response to my answers and called it. You get prompted to leave your details with an IVR system and then wait for a call back. I got a call back immediately and the Apple support representative confirmed I had attempted the required troubleshooting steps including a software upgrade and factory reset. Once confirmed she organised an appointment for me at the Chermside Apple store, the appointment was at midday the same day.
I arrived on time for my appointment and the store’s concierge checked me in and told me to wait next to the MacBooks for someone to come and assist. The Apple store has easily over hundred people and it was a hive of activity with Apple staff of all shapes and sizes (most had a slight geek or alternative look, the hallmark of Apple). An Apple store employee Dan came up to me and advised he would be assisting me, I told him about the problem and he took the phone away to check it out. The phone still had 80 days warranty left and was therefore covered by the standard Apple twelve month warranty. Dan returned and informed me Apple would be replacing the phone completely and providing a 90 day warranty. As I had already backed up the phone we wiped the old handset and I waited for a few minutes while Dan retrieved a new handset. We registered it over the store wireless and iCloud downloaded my contacts so I was good to go. I thanked Dan and headed home after getting a new screen protector, the only very small downside to getting a new handset.
In short, from first contact to replacement handset it took 4 hours and I talked to two Apple employees to get the warranty claim completed. Not only can Apple boast some of the best consumer products on the planet but it also has the best support and warranty service I have ever experienced. Having had previous experience with several consumer and business IT product vendors support and warranty claims, I think Apple is setting the standard and the other vendors should take note. Most IT vendors are outsourcing, off shoring and reducing their bricks and mortar to reduce costs. Apple’s approach seems to be orientated around the customers experience and not cost cutting.
I’m no Apple fanboy and I certainly don’t buy brands for the sake of it or to be in the cool gang. The quality of their products and the support they provide, Apple is the leading IT consumer vendor!
Current customers with fibre channel environments looking to refresh their infrastructue always ask me one question, can we replace fibre channel with FCoE end to end for my whole environment.
For most customers, 100% virtualisation is a dream that wont be happening in the near future, replacing fibre channel with FCoE is in the same boat.
Unfortately unless you are replacing all of your compute, storage and connectivity end to end FCoE is simply not possible and if you have non-virtualised workloads its even further away!
I have vendors consistently telling me they have customers going 100% FCoE only to find it a small environment where the requirements are specific enough to make FCoE end to end happen.
Unfortunately we are at the stage where hype/vendors/expectations dont meet expectations and that lands us fair and square in the trough of disillusionment.
Approximately three months ago I left my old job working as a Technical Architect for Corpnet. Following that I took a contract working in the Queensland Government as a Infrastructure/Storage Engineer. I had planned to stay contracting for at least six months but only lasted about 10 weeks. I only lasted 10 weeks as I was offered an opportunity that I simply couldn’t pass up.
I have accepted the Solutions Architect role at Dimension Data working in the Data Centre Solutions practice. Dimension Data is a household name in the IT industry, an organisation with offices across the world and over 12,000 employees. Headquartered in Johannesburg, Dimension Data were late last year acquired by Nippon Telegraph and Telephone Corporation (NTT).
I will still continue to blog but it’s likely there be some changes in the future, what they will be is yet to be seen.
I recently upgraded a customer from 4.0 Update 1 to 4.1 Update 1. Two clusters were upgraded and they consisted of five and eight hosts each, all of the Hosts are IBM x3650 M3 with all but four of them having the Emulex LPe11000 FC HBA’s.
Since the upgrade over about a week we have had six hosts fail, the VM’s on these hosts are lost, the ESXi becomes generally unresponsive but still respond to HA pings from the other hosts in the Cluster. The VM’s therefore become unknown and restarting management agents on the ESXi DCUI doesn’t help. Sometimes when this issue has occured the DCUI is unresponsive and Alt-F11 Alt-F2 keys do nothing, no PSOD is happening.
This issue has only affected the hosts with two LPe11000 single port FC HBA’s, which is on the HCL for 4.1 ESXi. We are using ESXi embedded with USB keys and not boot from SAN.
We have reviewed the Storage and Networking and the fault is isolated to the affected host at the time it occurs, the other hosts don’t have any issues occuring at the same time.
All the affected hosts have had the following HBA: Emulex, LPe11000, firmware 52A3.
We noticed there are alot of storage related errors in the messages.log
FEB 23 16:33:06 vobd: Feb 23 16:33:00.126: 248389618631us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.6006016019802900e634e73f72a6df11. Path vmhba4:C0:T1:L59 is down. Affected datastores: “DM01_SAP_Test_VMFS01”..
FEB 23 16:33:06 vmkernel: 2:20:59:46.445 cpu3:4142)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41027f9feb40) to NMP device “naa.6006016019802900e634e73f72a6df11” failed on physical path “vmhba4:C0:T1:L59” H:0x1 D:0x0 P:0x0 Possible sense data: 0 FEB 23 16:33:06 x0 0x0 0x0.
FEB 23 16:33:06 vmkernel: 2:20:59:46.445 cpu3:4142)WARNING: NMP: nmp_DeviceRetryCommand: Device “naa.6006016019802900e634e73f72a6df11”: awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
We contacted VMware GSS, initially we had little help with the first technical support presentative not even able to see any Storage errors and only confirming what we already knew that “the hosts is being unresponsive”, could be a hardware issue.
Seeing at this stage it had affected four different hosts it clearly wasn’t a hardware fault, possibly hardware firmware and driver issue.
After losing another few hosts and applying some pressure GSS, I talked to Aakash from VMware Global Support, he was great in helping us. He confirmed that we were experiencing SCSI aborts commands on our FC HBA and that the Storage connectivity is lost.
-We noticed APD messages around 25th Feb,2011 14:59 based on the below log snippet.
FEB 25 14:57:12 vmkernel: 0:09:47:02.054 cpu12:10442)FS3: 7412: Waiting for timed-out heartbeat [HB state abcdef02 offset 3280896 gen 9 stamp 35219957746 uuid 4d6739f2-8ec6b4c0-23f3-e61f13594cb3 jrnl drv 8.46] FEB 25 14:57:12 vmkernel: 0:09:47:02.054 cpu18:10525)FS3: 7412: Waiting for timed-out heartbeat [HB state abcdef02 offset 3280896 gen 9 stamp 35219957746 uuid 4d6739f2-8ec6b4c0-23f3-e61f13594cb3 jrnl drv 8.46] FEB 25 14:57:12 vmkernel: 0:09:47:02.081 cpu14:4135)WARNING: LinScsi: SCSILinuxAbortCommands: Failed, Driver lpfc820, for vmhba4 FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu7:4263)ScsiDeviceIO: 1672: Command 0x12 to device “naa.60060160e8802900bdb0f3eb16a4df11” failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu7:4263)WARNING: NMP: nmp_DeviceStartLoop: NMP Device “naa.60060160e8802900bdb0f3eb16a4df11” is blocked. Not starting I/O from device.
FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu2:4259)WARNING: VMW_VAAIP_CX: cx_claim_device: Inquiry to device naa.60060160e8802900bdb0f3eb16a4df11 failed FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu2:4259)WARNING: vmw_psp_rr: psp_rrSelectPath: Could not select path for device “naa.60060160e8802900c078f2bbc3a5df11”.
FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu2:4259)WARNING: NMP: nmp_IssueCommandToDevice: I/O could not be issued to device “naa.60060160e8802900c078f2bbc3a5df11” due to Not found FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu2:4259)WARNING: NMP: nmp_DeviceRetryCommand: Device “naa.60060160e8802900c078f2bbc3a5df11”: awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
FEB 25 14:57:12 vmkernel: 0:09:47:02.529 cpu2:4259)WARNING: NMP: nmp_DeviceStartLoop: NMP Device “naa.60060160e8802900c078f2bbc3a5df11” is blocked. Not starting I/O from device.
FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu4:4258)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device “naa.60060160e8802900c078f2bbc3a5df11”.
FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu0:4255)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device “naa.60060160e8802900bdb0f3eb16a4df11”.
FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu21:4257)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device “naa.60060160e880290079f48bf2c1a5df11”.
FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu12:4260)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device “naa.60060160e880290004445743c3a5df11”.
FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu16:4511)WARNING: NMP: nmpDeviceAttemptFailover: Retry world failover device “naa.60060160e8802900c078f2bbc3a5df11” – issuing command 0x41027ef92940 FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu16:4511)WARNING: vmw_psp_rr: psp_rrSelectPath: Could not select path for device “naa.60060160e8802900c078f2bbc3a5df11”.
FEB 25 14:57:12 vmkernel: 0:09:47:02.679 cpu16:4511)WARNING: NMP: nmpDeviceAttemptFailover: Retry world failover device “naa.60060160e8802900c078f2bbc3a5df11” – failed to issue command due to Not found (APD), try again…
-It can be caused due to emulex driver since there are abort commands.
FEB 25 14:57:12 vmkernel: 0:09:47:02.054 cpu12:10442)FS3: 7412: Waiting for timed-out heartbeat [HB state abcdef02 offset 3280896 gen 9 stamp 35219957746 uuid 4d6739f2-8ec6b4c0-23f3-e61f13594cb3 jrnl drv 8.46] FEB 25 14:57:12 vmkernel: 0:09:47:02.054 cpu18:10525)FS3: 7412: Waiting for timed-out heartbeat [HB state abcdef02 offset 3280896 gen 9 stamp 35219957746 uuid 4d6739f2-8ec6b4c0-23f3-e61f13594cb3 jrnl drv 8.46] FEB 25 14:57:12 vmkernel: 0:09:47:02.081 cpu14:4135)WARNING: LinScsi: SCSILinuxAbortCommands: Failed, Driver lpfc820, for vmhba4
We are rolling back hosts to ESXi 4.0 update 1, the driver for that HBA changed in 4.1 to 220.127.116.11.1-58vmw.
This article mentions there are issues with 4.0 and this adapter but GSS confirmed with VMware engineering that the HBA is supported.
These are a collect of pictures of the Flooding on Brisbane from the 11th to the 14th of Jan 2011
Pictures at 9:30am, 11/01/2011 on the end of Kurilpa Road, the park slowly going under.
Pictures an hour later, about 11am, 11/01/2011, the river is creeping up.
Pictures at 11:45am, 11/01/2011, still slowly rising
Pictures at 1:15pm, 11/01/2011
Pictures taken around 4pm, 11/01/2011
Pictures taken approx 7:00am, 12/01/2011
Bailey Street off Kurilpa Road
Top end of Kurilpa Road near the Montague Road corner
Pictures 11am, 12/01/2011, Top of Kurilpa Road near the Montague Road corner
Pictures 11.30am, 12/01/2011, Top of Kurilpa Road near the Montague Road corner
Pictures 5.30am, 13/01/2011, Top of Kurilpa Road, Corpnet office 365 Montague Road
Corner of Harriet Street and Monatgue Road