Archive

Archive for March, 2012

NetApp flow control and 10GbE recommendations

March 19, 2012 Leave a comment

I have been investigating NetApp’s “best practices” on flow control and found the following two references where NetApp suggest on modern 10GbE infrastructure flow control should be avoided.

TR-3802 – Ethernet Storage Best Practices

Page 22

CONGESTION MANAGEMENT WITH FLOW CONTROL

Flow control mechanisms exist at many different OSI Layers including the TCP window, XON/XOFF, and FECN/BECN for Frame Relay. In an Ethernet context, L2 flow control was unable to be implemented until the introduction of full duplex links, because a half duplex link is unable to send and receive traffic simultaneously. 802.3X allows a device on a point-to-point connection experiencing congestion to send a PAUSE frame to temporarily pause the flow of data. A reserved and defined multicast MAC address of 01-80-C2-00-00-01 is used to send the PAUSE frames, which also includes the length of pause requested.

In simple networks, this method works well to can work well. However, with the introduction of larger and larger networks along with more advanced network equipment and software, technologies such as TCP windowing, increased switch buffering, and end-to-end QoS negate the need for simple flow control throughout the network.

TR-3749 NetApp Storage Best Practices for VMware vSphere

Page 25

FLOW CONTROL

Flow control is a low-level process for managing the rate of data transmission between two nodes to prevent a fast sender from overrunning a slow receiver. Flow control can be configured on ESX/ESXi servers, FAS storage arrays, and network switches. For modern network equipment, especially 10GbE equipment, NetApp recommends turning off flow control and allowing congestion management to be performed higher in the network stack. For older equipment, typically GbE with smaller buffers and weaker buffer management, NetApp recommends configuring the endpoints, ESX servers, and NetApp arrays with the flow control set to “send.”

Advertisements
Categories: NetApp

Storage and Windows systems monitoring criteria

March 18, 2012 1 comment

Evaluation of the following windows performance counters and cross matching with storage systems performance statistics can assist in identifying the application workload pattern. The performance criteria we should be evaluating is broken down into two main sections, Storage and Windows systems “Workload” performance counters.

Storage performance monitoring should make up the bulk of the information we review to identify storage related bottlenecks or performance problems. The latency and queue length counters can be used by the System owners and the application development team to self-diagnose potential storage issues. You could also potentially rollout the performance logging via SCOM to the servers and alert on the below defined thresholds for further investigation.

Storage

Storage Processor

Processor % utilisation
Processor Bandwidth MB/s
Processor Throughput IOPS
Processor Cache forced flushes

The above statistics and counters should be recorded and reviewed on the storage arrays regularly using the appropriate storage vendor tool sets (ECC and analyser). There are significantly more performance counters to track at the storage level but as a rule these will help set a baseline.

Latency

Windows Performance Counters

PhysicalDisk(*)\Avg. Disk sec/Read
PhysicalDisk(*)\Avg. Disk sec/Write
PhysicalDisk(*)\Avg. Disk sec/Transfer (combination of the above two counters)

The counter returns a value in seconds, therefore 0.010 is 10ms

Disk latency should fall within the following acceptance criteria:

• Less than 5ms is considered excellent
• Less than 10ms is considered good
• Less than 15ms is considered acceptable
• Less than 20ms is fair
• More than 20ms and less than 50ms is poor
• More than 50ms is substandard

Latency should be measured in the following locations to isolate the source of the latency when it is identified:

1. Within the Operating System (perfmon or top)
2. Within the Hypervisor (esxtop or resxtop)
3. At the Storage subsystem, LUN response times (latency)

Disk queue length

Disk queue length does not accurately reflect performance but can be used to assist in the diagnosis of performance issues. Queue lengths that grow significantly and that exceeding the expected performance of the underlying storage highlight bottlenecks but those bottlenecks can exist in several locations.

Windows Perfromance counters

\PhysicalDisk(*)\Avg. Disk Write Queue Length
\PhysicalDisk(*)\Avg. Disk Read Queue Length
\PhysicalDisk(*)\Current Disk Queue Length
\PhysicalDisk(*)\Avg. Disk Queue Length

Acceptance criteria:

• 2x 3x the number of disk spindles used to create the volume (assuming they are dedicated)
• 30 plus sustained is a reason for investigation

Disk queue length should be measured at the OS as well as all of the following queue lengths:

1. Disk queue length within the Operating System
2. Front end port queue length on the storage arrays
3. LUN queue length

Workload

The following counters can assist in identifying the type of workload within the Windows operating system

Windows performance counters

\PhysicalDisk(*)\Avg. Disk Bytes/Read
\PhysicalDisk(*)\Avg. Disk Bytes/Transfer
\PhysicalDisk(*)\Avg. Disk Bytes/Write
\PhysicalDisk(*)\Avg. Disk sec/Read
\PhysicalDisk(*)\Avg. Disk sec/Transfer
\PhysicalDisk(*)\Avg. Disk sec/Write
\PhysicalDisk(*)\Disk Read Bytes/sec
\PhysicalDisk(*)\Disk Write Bytes/sec
\PhysicalDisk(*)\Split IO/Sec (as few as possible)

Windows Systems

Additional Windows Performance Monitoring counters to collect to ensure there are no processor, memory, paging of other operations occurring within the operating system that could negatively impact the system and produce additional storage workload.

Windows perfmon counters

\Paging File(*)\*
\Processor(*)\% Processor Time
\Processor(*)\% User Time
\Processor(*)\% Privileged Time
\System\Processor Queue Length
\Memory\Available MBytes
\Memory\Pages Input/sec
\Memory\Pages/sec

Memory

Available Mbytes, should be greater than 100MB
Pages Input/sec, should be less than 10
Pages/Sec, slow disk subsystem greater than 100, fast subsystem greater than 600

Memory Manager

Memory Grants pending, at or close to zero, over zero or growing indicates an issue
Page Life Expectancy, should be greater than 300, lower or declining indicates memory pressure

Paging

%Usage
The amount of the page file currently in use, by percentage, should be less than 70%
%Usage Peak
The peak amount of page file used since the server was last booted, by percentage, should be less than 70%

CPU Activity

Processor: % Processor Time, 80% or less is ideal
Processor: % Privileged Time, should be less than 30% of the total % processor time
Processor: % User time, should be about 70% or more of the total % processor time
Process sqlservr % Processor Time, should be less than 80%
System: Processor Queue Length, should be less than 4, 5 – 8 good, over 8 to 12 fair

Categories: Uncategorized