EMC PLEX vMSC (VMware Metro Stretched Cluster) support and certification process. I have complied the following detail on the requirements from both VMware and EMC.
There is no formal process to certify a vMSC installation but as long as the storage infrastructure is supported by EMC, equipment is on the VMware HCL and the following the kb articles are followed, the environment will be supported. Ultimately the configuration defined in the below kb articles was verified by and directly supported by EMC.
VMware kb articles:
vSphere 4.x : Using VPLEX Metro with VMware HA (1026692)
vSphere 5.x : Implementing vSphere Metro Storage Cluster (vMSC) using EMC VPLEX (2007545)
VMware published the following best practices whitepaper
Also Duncan Epping blogged about PDL (permanent device loss) condition:
You need to ensure the attached documents are following as per the following detail.
Attached is the EMC “simple support matrix for VMware vSphere 5.x” and GeoSynchrony 5.1 references the following known issue and therefore patch 2 should be applied or the workaround deployed.
emc299427: VPLEX: Fabric frame drops due to SAN congestion
EMC Recommendations/Best Practices for Cluster cross-connect for VMWare ESXi (docu9765 Technical Notes, page 62)
EMC encourages any customer moving to a VPLEX-Metro to move to ESX 5.0 Update 1 to benefit from all the HA enhancements in ESX 5.0 as well as the APD/PDL handling enhancements provided in update 1
• Applies to vSphere 4.1 and newer and VPLEX Metro Spanned SAN configuration
• HA/DRS cluster is stretched across the sites. This is a single HA/DRS cluster with ESXi hosts at each site
• A single standalone vCenter will manage the HA/DRS cluster
• The vCenter host will be located at the primary datacenter
• The HA/VM /Service console/vMotion networks should use multiple NIC cards on each ESX for redundancy
• The latency limitation of 1ms is applicable to both Ethernet Networks as well as the VPLEX FC WAN networks
• The ESXi servers should use internal disks or local SAN disks for booting. The Distributed Device should not be used as a boot disk
• All ESXi hosts initiators must be registered as ―default‖ type in VPLEX
• VPLEX Witness must be installed at a third location isolating it from failures that could affect VPLEX clusters at either site
• It is recommended to place the VM in the preferred site of the VPLEX distributed volume (that contains the datastore)
• In case of a Storage Volume failure or a BE array failure at one site, the VPLEX will continue to operated with the site that is healthy. Furthermore if a full VPLEX failure or WAN COM failure occurs and the cluster cross-connect is operational then these failures will be transparent to the host
• Create a common storage view for ESX nodes on site 1 on VPLEX cluster-1
• Create a common storage view for ESX nodes on site 2 on VPLEX cluster-2
• All Distributed Devices common to the same set of VMs should be in one consistency group
• All VM‘s associated with one consistency group should be collocated at the same site with the bias set on the consistency group to that site
• If using ESX Native Multi-Pathing (NMP) make sure to use the fixed policy and make sure the path(s) to the local VPLEX is the primary path(s) and the path(s) to the remote VPLEX is only stand-by
• vMSC is support for both non-uniform and uniform (cross-connect)
The following configuration requirements from the above VMware article kb2007545
These requirements must be satisfied to support this configuration:
• The maximum round trip latency on both the IP network and the inter-cluster network between the two VPLEX clusters must not exceed 5 milliseconds round-trip-time for a non-uniform host access configuration and must not exceed 1 millisecond round-trip-time for a uniform host access configuration. The IP network supports the VMware ESXi hosts and the VPLEX Management Console. The interface between two VPLEX clusters can be Fibre Channel or IP.
• The ESXi hosts in both data centers must have a private network on the same IP subnet and broadcast domain.
• Any IP subnet used by the virtual machine that resides on it must be accessible from ESXi hosts in both datacenters. This requirement is important so that clients accessing virtual machines running on ESXi hosts on both sides are able to function smoothly upon any VMware HA triggered virtual machine restart events.
• The data storage locations, including the boot device used by the virtual machines, must be active and accessible from ESXi hosts in both datacenters.
• vCenter Server must be able to connect to ESXi hosts in both datacenters.
• The VMware datastore for the virtual machines running in the ESX Cluster are provisioned on Distributed Virtual Volumes.
• The maximum number of hosts in the HA cluster must not exceed 32 hosts.
• The configuration option auto-resume for VPLEX consistency groups must be set to true.
• The ESXi hosts forming the VMware HA cluster can be distributed on two sites. HA Clusters can start a virtual machine on the surviving ESXi host, and the ESXi host access the Distributed Virtual Volume through storage path at its site.
• VPLEX 5.0 and above and ESXi 5.0 are tested in this configuration with the VPLEX Witness.
For any additional requirement for VPLEX Distributed Virtual Volumes, see the EMC VPLEX best practices document.
• The front-end zoning should be done in such a manner that an HBA port is zoned to either the local or the remote VPLEX cluster.
• The path policy should be set to FIXED to avoid writes to both legs of the distributed volume by the same host.
emc299427: Workaround and Permanent fixes for VPLEX GeoSynchrony 5.1
• VMware ESX and ESXi 5.x hosts can be configured to NOT send the VAAI-CAW command to the VPLEX. On all ESX and ESXi 5.x hosts connected to the VPLEX, the following actions must be completed to accomplish this.
• The setting is represented by the “HardwareAcceleratedLocking” variable in ESX:
a. Using vSphere client, go to host > Configuration > Software > Advanced Settings > VMFS3
b. Set the HardwareAcceleratedLocking value from 1 to 0. By default this is 1 in ESX or ESXi 5.x environments.
The change of the above settings can be verified by reviewing VMkernel logs at /var/log/vmkernel or /var/log/messages:
cpuN:1234)Config: 297: “HardwareAcceleratedMove” = 1, Old Value: 0, (Status: 0×0)
cpuN:1234)Config: 297: “HardwareAcceleratedInit” = 1, Old Value: 0, (Status: 0×0)
cpuN:1234)Config: 297: “HardwareAcceleratedLocking” = 0, Old Value: 1, (Status: 0×0)
• VPLEX GeoSynchrony 5.1 only utilizes VAAI-CAW [ HardwareAccelaratedLocking ] commands and hence this is the only value that needs to be set to 0.
• The values of HardwareAcceleratedMove and HardwareAcceleratedInit can be either 1 or 0.
Caution! There is an option in VPlexcli to set the ‘caw-enabled’ property under the storage-views context. Do not turn off the Compare and Write feature using the ‘caw-enabled’ property under VPLEX Storage-Views context as this may have unexpected negative consequences.This must not be done from the VPlexcli.
• Apply 5.1 Patch 2
Today Dimension Data exposed anti-affinity rules functionality to the Public Managed Cloud Platform. This allows our customers to define a rule to separate two virtual machines so that they dont run on the same VM host at the same time.
Also the functionality to copy OVF templates between two international regions (Geos), allows customers in import an OVF once and then copy it around as required. Up to 2 simultaneous copies can be in progress at one time for your organization between geos. Copies in progress are visible here until completed.
• Mixing RAID types within a pool or disk group set is possible
• Tiering is done at 2 MB by default, though administrators have the option to manage storage blocks at 512 KB or 4 MB pages
• Tiering profiles can be applied to single LUNs or groups of LUNs
• Profiles specify not only the tiers to be used for each volume, but also the disk types, rotational speeds and RAID levels within each tier.
• With Dell Compellent there is no write penalty for migrating data to lower tiers.
Tiered storage solutions within traditional storage architectures cannot deliver the same level of write performance. In those solutions, data is written to a particular block and kept in that block. If the block is migrated to tier 3 and a new write comes in for that volume, the write will occur on tier 3.
• Dell Compellent snapshots do not reside within the volume with production data — they sit outside the production volume in pages within the general virtual storage pool
• Dell Compellent Fast Track augments Automated Tiered Storage by delivering optimal placement of data on each individual disk.
Fast Track uses the intelligence continuously collected by the Dell Compellent system to identify the most active, frequently accessed blocks of data on each spindle. It then places those data blocks together on the outer tracks of each drive. Keeping the head focused on that one area of the disk, where the active data resides, delivers better performance than if the head were forced to move all over the disk.
• Writes are always written to tier 1 RAID 10, large sequential write workloads will hit tier 1,
Migration occurs automatically at a set time defined by the user, or on demand, while the system is
still online. The migration process runs in the background and does not affect data availability or
application performance. There is no need to bring down an application, pause I/O or wait for a
minimum I/O requirement. If a read request comes in to a page that is being moved, the request is
satisfied from the original placement of the page. The page is then moved after the read is complete.
If a write request comes in, it will not interfere with the migration process, as new information is
always written to tier 1, RAID 10—and therefore is not eligible for migration. Overwriting a block of
protected information also occurs on tier 1, RAID 10, so moving a block of data receiving writes simply
will not occur. There is never a situation when application I/O is denied—application requests always
In the Dell Compellent architecture, new data is written by default to tier 1, RAID 10 storage to
provide the best write performance.
• Activating the tiering functionality requires a module-specific license, but perpetual licensing ensures that organizations only incur additional licensing expenses when adding more capacity to an existing system. Upgrading to a new controller with the latest technologies does not require a new software license, as is the case with other solutions.
• In the Dell Compellent architecture, new data is written by default to tier 1, RAID 10 storage to provide the best write performance. Replays move to a lower storage tier with RAID 5 or 6 protection during the next migration cycle within a 24-hour window. And over time, according to the tiering profile, infrequently accessed blocks of data move to a lower storage tier and RAID level, or to a different RAID level within the same tier. Moving this read-only data from RAID 10 to RAID 5 within the same tier enables administrators to maintain the same read performance.
Not sure that that read performance will be the same when a block of data within a tier is moved between RAID types (RAID 5 read perform is superior to RAID 10)
• When new data needs to be written to an existing block that has since been converted to read-only and migrated to a lower tier, those writes are redirected to the tier 1, RAID 10 storage. A new writable block is automatically allocated to provide the highest transaction performance. Virtual pointers utilize the use characteristics of those blocks to maintain data continuity. All data is written to tier 1, yet snaphsots move to the lower tier available within 24 hours for the highest write and read performance possible. Virtual pointer retain continuity between all associated blocks.
vSphere Operations Manager implementation uses a vApp and if you are using vSphere Essentials or Advanced the deployment fails due to the vApp requiring DRS.
The following article explains how to get around this requirement by using a standalong ESX host.
It seems that there is an issue with EMC VNX and CX with VMware 4.x when ALUA (failover mode 4) is used.
If there are no data LUNs numbered 0 presented to the VMware, the hosts failover mode 4 can switch back to failover mode 1.
Refer to primus case, emc262738
Symptom After a storage processor reboot (either because of a non-disruptive upgrade [NDU] or other reboot event), the failover mode for the ESX 4.x hosts changed from 4 (ALUA) to 1 on all host initiators.
Cause On this particular array, for each Storage Group a Host LUN Zero was not configured. This allowed the array to present to the host a “LUNZ.” All host initiators had been configured to failover mode 4 (ALUA). When the storage processor rebooted due to a non-disruptive upgrade (NDU), when the connection was reestablished, the ESX host saw the LUNZ as an active/passive device and sent a command to the array to set the failover mode to 1. This changed all the failover mode settings for all the LUNs in the Storage Group and since the Failover Policy on the host was set to FIXED, when one SP was rebooting, it lost access to the LUNs.
Fix VMware will fix this issue in an upcoming patch for ESX 4.0 and 4.1. ESX 5.x does not have this issue.
To work around this issue, you can bind a small LUN, add to the Storage Group and configure the LUN as Host LUN 0 (zero). You will need to reboot each host after adding the HLU 0. For each Storage Group you will need a HLU 0. See solution emc57314 for information on changing the HLU.
These are the directions from VMware for the workaround:
Present a 1.5 GB or larger LUN0 to all ESX hosts. (This volume does not need to be formatted, but must be equal to or larger than 1.5 GB.
Roll a reboot through all hosts to guarantee that they are seeing the LUN 0 instead of the LUNZ. A rescan may work, but a reboot guarantees that they will not have any legacy data for the CommPath volume.
Thanks to Glen for pointing out the reason and solution.
There is a fix in ESX version 5, so those hosts aren’t affected.
I have been investigating NetApp’s “best practices” on flow control and found the following two references where NetApp suggest on modern 10GbE infrastructure flow control should be avoided.
TR-3802 – Ethernet Storage Best Practices
CONGESTION MANAGEMENT WITH FLOW CONTROL
Flow control mechanisms exist at many different OSI Layers including the TCP window, XON/XOFF, and FECN/BECN for Frame Relay. In an Ethernet context, L2 flow control was unable to be implemented until the introduction of full duplex links, because a half duplex link is unable to send and receive traffic simultaneously. 802.3X allows a device on a point-to-point connection experiencing congestion to send a PAUSE frame to temporarily pause the flow of data. A reserved and defined multicast MAC address of 01-80-C2-00-00-01 is used to send the PAUSE frames, which also includes the length of pause requested.
In simple networks, this method works well to can work well. However, with the introduction of larger and larger networks along with more advanced network equipment and software, technologies such as TCP windowing, increased switch buffering, and end-to-end QoS negate the need for simple flow control throughout the network.
TR-3749 NetApp Storage Best Practices for VMware vSphere
Flow control is a low-level process for managing the rate of data transmission between two nodes to prevent a fast sender from overrunning a slow receiver. Flow control can be configured on ESX/ESXi servers, FAS storage arrays, and network switches. For modern network equipment, especially 10GbE equipment, NetApp recommends turning off flow control and allowing congestion management to be performed higher in the network stack. For older equipment, typically GbE with smaller buffers and weaker buffer management, NetApp recommends configuring the endpoints, ESX servers, and NetApp arrays with the flow control set to “send.”
Evaluation of the following windows performance counters and cross matching with storage systems performance statistics can assist in identifying the application workload pattern. The performance criteria we should be evaluating is broken down into two main sections, Storage and Windows systems “Workload” performance counters.
Storage performance monitoring should make up the bulk of the information we review to identify storage related bottlenecks or performance problems. The latency and queue length counters can be used by the System owners and the application development team to self-diagnose potential storage issues. You could also potentially rollout the performance logging via SCOM to the servers and alert on the below defined thresholds for further investigation.
Processor % utilisation
Processor Bandwidth MB/s
Processor Throughput IOPS
Processor Cache forced flushes
The above statistics and counters should be recorded and reviewed on the storage arrays regularly using the appropriate storage vendor tool sets (ECC and analyser). There are significantly more performance counters to track at the storage level but as a rule these will help set a baseline.
Windows Performance Counters
PhysicalDisk(*)\Avg. Disk sec/Read
PhysicalDisk(*)\Avg. Disk sec/Write
PhysicalDisk(*)\Avg. Disk sec/Transfer (combination of the above two counters)
The counter returns a value in seconds, therefore 0.010 is 10ms
Disk latency should fall within the following acceptance criteria:
• Less than 5ms is considered excellent
• Less than 10ms is considered good
• Less than 15ms is considered acceptable
• Less than 20ms is fair
• More than 20ms and less than 50ms is poor
• More than 50ms is substandard
Latency should be measured in the following locations to isolate the source of the latency when it is identified:
1. Within the Operating System (perfmon or top)
2. Within the Hypervisor (esxtop or resxtop)
3. At the Storage subsystem, LUN response times (latency)
Disk queue length
Disk queue length does not accurately reflect performance but can be used to assist in the diagnosis of performance issues. Queue lengths that grow significantly and that exceeding the expected performance of the underlying storage highlight bottlenecks but those bottlenecks can exist in several locations.
Windows Perfromance counters
\PhysicalDisk(*)\Avg. Disk Write Queue Length
\PhysicalDisk(*)\Avg. Disk Read Queue Length
\PhysicalDisk(*)\Current Disk Queue Length
\PhysicalDisk(*)\Avg. Disk Queue Length
• 2x 3x the number of disk spindles used to create the volume (assuming they are dedicated)
• 30 plus sustained is a reason for investigation
Disk queue length should be measured at the OS as well as all of the following queue lengths:
1. Disk queue length within the Operating System
2. Front end port queue length on the storage arrays
3. LUN queue length
The following counters can assist in identifying the type of workload within the Windows operating system
Windows performance counters
\PhysicalDisk(*)\Avg. Disk Bytes/Read
\PhysicalDisk(*)\Avg. Disk Bytes/Transfer
\PhysicalDisk(*)\Avg. Disk Bytes/Write
\PhysicalDisk(*)\Avg. Disk sec/Read
\PhysicalDisk(*)\Avg. Disk sec/Transfer
\PhysicalDisk(*)\Avg. Disk sec/Write
\PhysicalDisk(*)\Disk Read Bytes/sec
\PhysicalDisk(*)\Disk Write Bytes/sec
\PhysicalDisk(*)\Split IO/Sec (as few as possible)
Additional Windows Performance Monitoring counters to collect to ensure there are no processor, memory, paging of other operations occurring within the operating system that could negatively impact the system and produce additional storage workload.
Windows perfmon counters
\Processor(*)\% Processor Time
\Processor(*)\% User Time
\Processor(*)\% Privileged Time
\System\Processor Queue Length
Available Mbytes, should be greater than 100MB
Pages Input/sec, should be less than 10
Pages/Sec, slow disk subsystem greater than 100, fast subsystem greater than 600
Memory Grants pending, at or close to zero, over zero or growing indicates an issue
Page Life Expectancy, should be greater than 300, lower or declining indicates memory pressure
The amount of the page file currently in use, by percentage, should be less than 70%
The peak amount of page file used since the server was last booted, by percentage, should be less than 70%
Processor: % Processor Time, 80% or less is ideal
Processor: % Privileged Time, should be less than 30% of the total % processor time
Processor: % User time, should be about 70% or more of the total % processor time
Process sqlservr % Processor Time, should be less than 80%
System: Processor Queue Length, should be less than 4, 5 – 8 good, over 8 to 12 fair