EMC PLEX vMSC (VMware Metro Stretched Cluster) support and certification process. I have complied the following detail on the requirements from both VMware and EMC.
There is no formal process to certify a vMSC installation but as long as the storage infrastructure is supported by EMC, equipment is on the VMware HCL and the following the kb articles are followed, the environment will be supported. Ultimately the configuration defined in the below kb articles was verified by and directly supported by EMC.
VMware kb articles:
vSphere 4.x : Using VPLEX Metro with VMware HA (1026692)
vSphere 5.x : Implementing vSphere Metro Storage Cluster (vMSC) using EMC VPLEX (2007545)
VMware published the following best practices whitepaper
Also Duncan Epping blogged about PDL (permanent device loss) condition:
You need to ensure the attached documents are following as per the following detail.
Attached is the EMC “simple support matrix for VMware vSphere 5.x” and GeoSynchrony 5.1 references the following known issue and therefore patch 2 should be applied or the workaround deployed.
emc299427: VPLEX: Fabric frame drops due to SAN congestion
EMC Recommendations/Best Practices for Cluster cross-connect for VMWare ESXi (docu9765 Technical Notes, page 62)
EMC encourages any customer moving to a VPLEX-Metro to move to ESX 5.0 Update 1 to benefit from all the HA enhancements in ESX 5.0 as well as the APD/PDL handling enhancements provided in update 1
• Applies to vSphere 4.1 and newer and VPLEX Metro Spanned SAN configuration
• HA/DRS cluster is stretched across the sites. This is a single HA/DRS cluster with ESXi hosts at each site
• A single standalone vCenter will manage the HA/DRS cluster
• The vCenter host will be located at the primary datacenter
• The HA/VM /Service console/vMotion networks should use multiple NIC cards on each ESX for redundancy
• The latency limitation of 1ms is applicable to both Ethernet Networks as well as the VPLEX FC WAN networks
• The ESXi servers should use internal disks or local SAN disks for booting. The Distributed Device should not be used as a boot disk
• All ESXi hosts initiators must be registered as ―default‖ type in VPLEX
• VPLEX Witness must be installed at a third location isolating it from failures that could affect VPLEX clusters at either site
• It is recommended to place the VM in the preferred site of the VPLEX distributed volume (that contains the datastore)
• In case of a Storage Volume failure or a BE array failure at one site, the VPLEX will continue to operated with the site that is healthy. Furthermore if a full VPLEX failure or WAN COM failure occurs and the cluster cross-connect is operational then these failures will be transparent to the host
• Create a common storage view for ESX nodes on site 1 on VPLEX cluster-1
• Create a common storage view for ESX nodes on site 2 on VPLEX cluster-2
• All Distributed Devices common to the same set of VMs should be in one consistency group
• All VM‘s associated with one consistency group should be collocated at the same site with the bias set on the consistency group to that site
• If using ESX Native Multi-Pathing (NMP) make sure to use the fixed policy and make sure the path(s) to the local VPLEX is the primary path(s) and the path(s) to the remote VPLEX is only stand-by
• vMSC is support for both non-uniform and uniform (cross-connect)
The following configuration requirements from the above VMware article kb2007545
These requirements must be satisfied to support this configuration:
• The maximum round trip latency on both the IP network and the inter-cluster network between the two VPLEX clusters must not exceed 5 milliseconds round-trip-time for a non-uniform host access configuration and must not exceed 1 millisecond round-trip-time for a uniform host access configuration. The IP network supports the VMware ESXi hosts and the VPLEX Management Console. The interface between two VPLEX clusters can be Fibre Channel or IP.
• The ESXi hosts in both data centers must have a private network on the same IP subnet and broadcast domain.
• Any IP subnet used by the virtual machine that resides on it must be accessible from ESXi hosts in both datacenters. This requirement is important so that clients accessing virtual machines running on ESXi hosts on both sides are able to function smoothly upon any VMware HA triggered virtual machine restart events.
• The data storage locations, including the boot device used by the virtual machines, must be active and accessible from ESXi hosts in both datacenters.
• vCenter Server must be able to connect to ESXi hosts in both datacenters.
• The VMware datastore for the virtual machines running in the ESX Cluster are provisioned on Distributed Virtual Volumes.
• The maximum number of hosts in the HA cluster must not exceed 32 hosts.
• The configuration option auto-resume for VPLEX consistency groups must be set to true.
• The ESXi hosts forming the VMware HA cluster can be distributed on two sites. HA Clusters can start a virtual machine on the surviving ESXi host, and the ESXi host access the Distributed Virtual Volume through storage path at its site.
• VPLEX 5.0 and above and ESXi 5.0 are tested in this configuration with the VPLEX Witness.
For any additional requirement for VPLEX Distributed Virtual Volumes, see the EMC VPLEX best practices document.
• The front-end zoning should be done in such a manner that an HBA port is zoned to either the local or the remote VPLEX cluster.
• The path policy should be set to FIXED to avoid writes to both legs of the distributed volume by the same host.
emc299427: Workaround and Permanent fixes for VPLEX GeoSynchrony 5.1
• VMware ESX and ESXi 5.x hosts can be configured to NOT send the VAAI-CAW command to the VPLEX. On all ESX and ESXi 5.x hosts connected to the VPLEX, the following actions must be completed to accomplish this.
• The setting is represented by the “HardwareAcceleratedLocking” variable in ESX:
a. Using vSphere client, go to host > Configuration > Software > Advanced Settings > VMFS3
b. Set the HardwareAcceleratedLocking value from 1 to 0. By default this is 1 in ESX or ESXi 5.x environments.
The change of the above settings can be verified by reviewing VMkernel logs at /var/log/vmkernel or /var/log/messages:
cpuN:1234)Config: 297: “HardwareAcceleratedMove” = 1, Old Value: 0, (Status: 0×0)
cpuN:1234)Config: 297: “HardwareAcceleratedInit” = 1, Old Value: 0, (Status: 0×0)
cpuN:1234)Config: 297: “HardwareAcceleratedLocking” = 0, Old Value: 1, (Status: 0×0)
• VPLEX GeoSynchrony 5.1 only utilizes VAAI-CAW [ HardwareAccelaratedLocking ] commands and hence this is the only value that needs to be set to 0.
• The values of HardwareAcceleratedMove and HardwareAcceleratedInit can be either 1 or 0.
Caution! There is an option in VPlexcli to set the ‘caw-enabled’ property under the storage-views context. Do not turn off the Compare and Write feature using the ‘caw-enabled’ property under VPLEX Storage-Views context as this may have unexpected negative consequences.This must not be done from the VPlexcli.
• Apply 5.1 Patch 2
Done some reading up on Amazon’s Glacier product and although I see its uses in the consumer space, I not convinced there is a clear use case in the enterprise space.
Essentially Glacier is a simple, standards-based REST web services interface as well as Java and .NET SDKs. Providing an HTTP web service worm storage for archives up to 1000 per region per account (either files or zips from 1 byte to 40TB).
It’s driven via the API but I didn’t find any references to any included archive software and archive retrieval can take 3-5 hours but Amazon claims it is not a tape based solution.
Availability of 99.999999 refers to data integrity and this raises questions about the platform as Amazon doesn’t guarantee data is recoverable. Therefore Amazon only recommends Glacier when data loss is acceptable even as unlikely as it may be. If data loss is unacceptable, Amazon recommends using their S3 product.
Not sure what use cases exist for enterprise customers with Glacier unless they don’t value the data or its retrieval timely. I’d see this as a Cloud trash bin service!
Today Dimension Data exposed anti-affinity rules functionality to the Public Managed Cloud Platform. This allows our customers to define a rule to separate two virtual machines so that they dont run on the same VM host at the same time.
Also the functionality to copy OVF templates between two international regions (Geos), allows customers in import an OVF once and then copy it around as required. Up to 2 simultaneous copies can be in progress at one time for your organization between geos. Copies in progress are visible here until completed.
• Mixing RAID types within a pool or disk group set is possible
• Tiering is done at 2 MB by default, though administrators have the option to manage storage blocks at 512 KB or 4 MB pages
• Tiering profiles can be applied to single LUNs or groups of LUNs
• Profiles specify not only the tiers to be used for each volume, but also the disk types, rotational speeds and RAID levels within each tier.
• With Dell Compellent there is no write penalty for migrating data to lower tiers.
Tiered storage solutions within traditional storage architectures cannot deliver the same level of write performance. In those solutions, data is written to a particular block and kept in that block. If the block is migrated to tier 3 and a new write comes in for that volume, the write will occur on tier 3.
• Dell Compellent snapshots do not reside within the volume with production data — they sit outside the production volume in pages within the general virtual storage pool
• Dell Compellent Fast Track augments Automated Tiered Storage by delivering optimal placement of data on each individual disk.
Fast Track uses the intelligence continuously collected by the Dell Compellent system to identify the most active, frequently accessed blocks of data on each spindle. It then places those data blocks together on the outer tracks of each drive. Keeping the head focused on that one area of the disk, where the active data resides, delivers better performance than if the head were forced to move all over the disk.
• Writes are always written to tier 1 RAID 10, large sequential write workloads will hit tier 1,
Migration occurs automatically at a set time defined by the user, or on demand, while the system is
still online. The migration process runs in the background and does not affect data availability or
application performance. There is no need to bring down an application, pause I/O or wait for a
minimum I/O requirement. If a read request comes in to a page that is being moved, the request is
satisfied from the original placement of the page. The page is then moved after the read is complete.
If a write request comes in, it will not interfere with the migration process, as new information is
always written to tier 1, RAID 10—and therefore is not eligible for migration. Overwriting a block of
protected information also occurs on tier 1, RAID 10, so moving a block of data receiving writes simply
will not occur. There is never a situation when application I/O is denied—application requests always
In the Dell Compellent architecture, new data is written by default to tier 1, RAID 10 storage to
provide the best write performance.
• Activating the tiering functionality requires a module-specific license, but perpetual licensing ensures that organizations only incur additional licensing expenses when adding more capacity to an existing system. Upgrading to a new controller with the latest technologies does not require a new software license, as is the case with other solutions.
• In the Dell Compellent architecture, new data is written by default to tier 1, RAID 10 storage to provide the best write performance. Replays move to a lower storage tier with RAID 5 or 6 protection during the next migration cycle within a 24-hour window. And over time, according to the tiering profile, infrequently accessed blocks of data move to a lower storage tier and RAID level, or to a different RAID level within the same tier. Moving this read-only data from RAID 10 to RAID 5 within the same tier enables administrators to maintain the same read performance.
Not sure that that read performance will be the same when a block of data within a tier is moved between RAID types (RAID 5 read perform is superior to RAID 10)
• When new data needs to be written to an existing block that has since been converted to read-only and migrated to a lower tier, those writes are redirected to the tier 1, RAID 10 storage. A new writable block is automatically allocated to provide the highest transaction performance. Virtual pointers utilize the use characteristics of those blocks to maintain data continuity. All data is written to tier 1, yet snaphsots move to the lower tier available within 24 hours for the highest write and read performance possible. Virtual pointer retain continuity between all associated blocks.
vSphere Operations Manager implementation uses a vApp and if you are using vSphere Essentials or Advanced the deployment fails due to the vApp requiring DRS.
The following article explains how to get around this requirement by using a standalong ESX host.
It seems that there is an issue with EMC VNX and CX with VMware 4.x when ALUA (failover mode 4) is used.
If there are no data LUNs numbered 0 presented to the VMware, the hosts failover mode 4 can switch back to failover mode 1.
Refer to primus case, emc262738
Symptom After a storage processor reboot (either because of a non-disruptive upgrade [NDU] or other reboot event), the failover mode for the ESX 4.x hosts changed from 4 (ALUA) to 1 on all host initiators.
Cause On this particular array, for each Storage Group a Host LUN Zero was not configured. This allowed the array to present to the host a “LUNZ.” All host initiators had been configured to failover mode 4 (ALUA). When the storage processor rebooted due to a non-disruptive upgrade (NDU), when the connection was reestablished, the ESX host saw the LUNZ as an active/passive device and sent a command to the array to set the failover mode to 1. This changed all the failover mode settings for all the LUNs in the Storage Group and since the Failover Policy on the host was set to FIXED, when one SP was rebooting, it lost access to the LUNs.
Fix VMware will fix this issue in an upcoming patch for ESX 4.0 and 4.1. ESX 5.x does not have this issue.
To work around this issue, you can bind a small LUN, add to the Storage Group and configure the LUN as Host LUN 0 (zero). You will need to reboot each host after adding the HLU 0. For each Storage Group you will need a HLU 0. See solution emc57314 for information on changing the HLU.
These are the directions from VMware for the workaround:
Present a 1.5 GB or larger LUN0 to all ESX hosts. (This volume does not need to be formatted, but must be equal to or larger than 1.5 GB.
Roll a reboot through all hosts to guarantee that they are seeing the LUN 0 instead of the LUNZ. A rescan may work, but a reboot guarantees that they will not have any legacy data for the CommPath volume.
Thanks to Glen for pointing out the reason and solution.
There is a fix in ESX version 5, so those hosts aren’t affected.
I have been investigating NetApp’s “best practices” on flow control and found the following two references where NetApp suggest on modern 10GbE infrastructure flow control should be avoided.
TR-3802 – Ethernet Storage Best Practices
CONGESTION MANAGEMENT WITH FLOW CONTROL
Flow control mechanisms exist at many different OSI Layers including the TCP window, XON/XOFF, and FECN/BECN for Frame Relay. In an Ethernet context, L2 flow control was unable to be implemented until the introduction of full duplex links, because a half duplex link is unable to send and receive traffic simultaneously. 802.3X allows a device on a point-to-point connection experiencing congestion to send a PAUSE frame to temporarily pause the flow of data. A reserved and defined multicast MAC address of 01-80-C2-00-00-01 is used to send the PAUSE frames, which also includes the length of pause requested.
In simple networks, this method works well to can work well. However, with the introduction of larger and larger networks along with more advanced network equipment and software, technologies such as TCP windowing, increased switch buffering, and end-to-end QoS negate the need for simple flow control throughout the network.
TR-3749 NetApp Storage Best Practices for VMware vSphere
Flow control is a low-level process for managing the rate of data transmission between two nodes to prevent a fast sender from overrunning a slow receiver. Flow control can be configured on ESX/ESXi servers, FAS storage arrays, and network switches. For modern network equipment, especially 10GbE equipment, NetApp recommends turning off flow control and allowing congestion management to be performed higher in the network stack. For older equipment, typically GbE with smaller buffers and weaker buffer management, NetApp recommends configuring the endpoints, ESX servers, and NetApp arrays with the flow control set to “send.”