Sunday, November 8, 2009

Designing disaster tolerant Multi Site High Availability Solution integrating Microsoft failover clustering

Today we are going to talk about the design concepts of highly available disaster recovery solution based on Microsoft clustering and various options available in market for geographic clusters. The purpose of the article is to give an insight into the design consideration of Geographic cluster. Why do we need a disaster tolerant solution and does it covers for backup requirements. Disaster tolerance is the ability to restore applications and data services within a reasonable period of time after a disaster. Most think of fire, flood, and earthquake as disasters, but a disaster can be any event that unexpectedly interrupts service or corrupts data in an entire data center. It does not mitigate the need of an effective backup solution for the data or application recovery on cluster. Backup solutions enables us to go back in time for restoration and high availability/Clustering solutions enables us to make sure that applications and data services are up and running round the year. The very essence of a geographic cluster is that data on site A needs to be replicated to site B to counter any disaster on site A and vice versa. The maximum distance between nodes in cluster determines the data replication and networking technology.

The questions you need to ask pre design are:

1. The applications you are going to run on cluster nodes and what kind of IO they will do. What kind of data loss or lag these applications and business can sustain. Many applications can recover from crash-consistent states; very few can recover from out-of-order I/O operation sequences.

2. How far these cluster nodes will be and will the solution consists of 2 or multiple sites. Depending on the answer of 1st question you have to choose between synchronus or asynchronus replication.
3. What will be the medium of Data replication : Fibre channel, LAN or WLAN.
4. Which cluster extension you will like to use for Microsoft failover clustering multi site cluster. There are various solutions in the market like HP CLX extension and EMC cluster enabler. This can get influenced from the storage solution you have like HP EVA/XP or EMC Symmetrix/Clariion.
Well, this article is more of multi site clustering solution for Microsoft failover clustering however there are other existing solutions in market which can be leveraged for example VMware vCenter Site Recovery Manager, IBM GPFS clusters using powerHA system mirror for AIX enterprise edition, Veritas cluster, HP polyserve & Metroclusters.

In a two-site configuration, the nodes in Site A are connected to the storage in Site A directly, and the nodes in Site B are connected to the storage in Site B directly. The nodes in Site A can continue without accessing the storage on Site B and vice versa. Its storage fabric [HP Continous Access or EMC SRDF] or host-based software [ Double Take or Microsoft Exchange CCR ] provides a way to mirror or replicate data between the sites so that each site has a copy of the data. In failover clustering 2008, concept of quroum has changed entirely and now quroum is translated as Majority of votes. Prior to this, server 2003 MNS clusters were being used and as the name suggests, majority node sets also worked on the concept of node majority along with the benefit of file share witness which can provide an additional vote if required to achieve quorum.
The essence of the solution is that we need to replicate our storage luns from site A to storage luns of site B. This can be either synchronus or asynchronus replication and it may be from site A luns--> site B luns or site B luns --> site A luns. This automatic lun replication behaviour can be controlled by a cluster extension and if site A is active then replication will be from site A luns--> site B luns and vice versa. It is recommended to put file share witness in 3rd site.
*A major improvement to clustering in Windows Server 2008 is that cluster nodes can now reside on different subnets. As opposed to previous versions of clustering (as in Windows Server 2003 and Windows 2000 Server), cluster nodes in Windows Server 2008 can communicate across network routers. This means that you no longer have to stretch virtual local area networks (VLANs) to connect geographically separated cluster nodes, far reducing the complexity and cost of setting up and maintaining multi-site clusters. One consideration for subnet-spanning clusters can be client response time. Client computers cannot see a failed-over workload any faster than the DNS servers can update one another to point clients to the new server hosting that workload. For this reason, VLANs can make sense when keeping workload downtime to an absolute minimum is your highest priority.

Difference between synchronus and asynchronus data replication:
Synchronous replication is when an application performs an operation on one node at one site, and then that operation is not completed until the change has been made on the other sites. So, synchronous data replication holds the promise of no data loss in the event of failover for multi-site clusters that can take advantage of it. Using synchronous, block-level replication as an example, if an application at Site A writes a block of data to a disk mirrored to Site B, the input/output (I/O) operation will not be completed until the change has been made to both the disk on Site A and the disk on Site B. In general, synchronous data replication is best for multi-site clusters that can rely on high-bandwidth, low-latency connections. Typically, this will limit the application of synchronous data replication to geographically dispersed clusters whose nodes are separated by shorter distances. While synchronous data replication protects against data loss in the event of failover for multi-site clusters, it comes at the cost of the latencies of application write and acknowledgement times impacting application performance. Because of this potential latency, synchronous replication can slow or otherwise detract from application performance for your users.
Asynchronous replication is when a change is made to the data on Site A and that change eventually makes it to Site B. Multi-site clusters using asynchronous data replication can generally stretch over greater geographical distances with no significant application performance impact. In asynchronous replication, if an application at Site A writes a block of data to a disk mirrored to Site B, then the I/O operation is complete as soon as the change is made to the disk at Site A. The replication software transfers the change to Site B (in the background) and eventually makes that change to Site B. With asynchronous replication, the data at Site B can be out of date with respect to Site A at any point in time. This is because a node may fail after it has written an application transaction to storage locally but before it has successfully replicated that transaction to the other site or sites in the cluster; if that site goes down, the application failing over to another node will be unaware that the lost transaction ever took place. Preserving the order of application operations written to storage is also an issue with asynchronous data replication. Different vendors implement asynchronous replication in different ways. Some preserve the order of operations and others do not.

*Excerpt from Microsoft Windows Server 2008 Multi-Site Clustering Technical Decision-Maker White Paper

Hp CLX extension is the one responsible for monitoring and recovering disk pair synchronization on an application level and offload data replication tasks from the host using storage softwares like command view EVA/XP. CLX automates the time-consuming, labor-intensive processes required to verify the status of the storage as well as the server cluster; thus allowing the correct failover and failback decisions to be made to minimize downtime. It automatically manages recovery without human intervention. For more information these please refer to
Similarly we can use EMC cluster enabler extension for 2008 failover clusters. For more details on EMC cluster extension and recovery point solution please refer following link. It does the same job for EMC storgae clarrion/symmetrix as HP CLX does for EVA/XP storage.
So, in a failover scenario resources will move to other site and will start using the storage on disaster recovery site and lun replication direction will be reversed by the cluster extension without any manual intervention. File share witness quroum model helps in retaining the cluster quroum [vote majority] in case of split brain scenarios and resources will remain highly available even if the network communications break between 2 sites [ Till the time one of the nodes can access file share witness in 3rd site]. I hope this talk would have given you an insight into the high level design overview of multisite clusters based on Microsoft failover clusters and whats needs to be considered in the design process. Thanks for your time and stay tuned to blog for more intersting upcoming topics.


  1. Thank you very much for this article, it is so rare to see nowadays written as fervently article. I enjoyed reading it and I learned a lot of things. I will go and continue reading your blog =). Good luck for the future and another one for the quality of it.You can also check out this (