Saturday, March 20, 2010

Storage Architecture changes for failover clustering 2008/R2-How Persistent Reservation works.

Last month I decided to do a failover cluster blog series and here I am doing the first one. Via this blog I will try to throw some light on the storage architecture changes and requirements of failover clustering. As we are aware, only storage that supports scsi 3 persistent reservations will be supported in failover clustering. Parallel scsi based storage is being deprecated and won’t be supported. There is a nice blog to read more why this might have been done.  The good thing is that due to this change now we no longer use scsi bus resets which can be disruptive on a SAN. In this article we will see what happens in the storage stack and how persistent reservation works for a physical disk resource and later a cluster shared volume physical disk resource.

                                                                                 Fig 1

We will take the case of physical disk resource first and then cluster shared volume later. Clusdisk.sys which has been modified with 2 functions seen in the figure 1 issue persistent reservation [PR] to the class driver disk.sys which flows down to MPIO.sys and vendor based DSM.sys or inbuilt MSdsm.sys [msdsm.sys is inbuilt device specific module provided with 2008 R2 server] The vendor based DSM driver is responsible for registering the PR on the storage object. The storage object maintains a registration table which contains entry from all the multiple paths available from multiple nodes. The registration table contains registration and reservation entry for all paths available from all nodes. We take an example of a 2 node cluster with dual HBA and hence we see that every interface from both nodes will have to register in the table with a unique key for each node. The rule is that you cannot register and reserve at a same time though there are some exceptions to that rule and we will discuss those later. This key is a 8 byte key unique to every cluster node. The low 32 bits of this key contain a mask that can be used to identify a Persistent Reservation that was placed specifically by a Microsoft Failover Cluster.  So assuming node 1 HBA1 is right now owning the physical disk resource you will see both registration and reservation entry  for HBA1 interface. In case HBA1 fails then depending on the vendor DSM configuration storage stack will automatically start using HBA2 without any end user intervention or failover of the disk resource. Every 3 seconds node 1 keeps coming back and checking the registration table. Nodes defend their reservations (every 3 secs…but is configurable and this setting might help you in troubleshooting some issues also) . During a split brain scenario challenger node 2 will come and enter a registration entry in the table but as per rule node 2 cannot register at same time. It needs to wait for at least 6 seconds before it comes back and enters its registration entries and own the ownership for the storage object. Meanwhile defender node 1 comes back after every 3 second and  sees registration entries from 2nd node  and it will scrub those entries. Challenging node 2 will come back after 6 seconds but will not be able to add reservation entries in table as its registration entries have already been scrubbed by node 1. This is the process of a successful defense. Now there may be a legitimate case where node 1 both interfaces are unable to access the storage or node 1 has rebooted or blue screened and in that case node 2 should win the arbitration of storage object. Let’s see what will happen in that case. Node 1 HBA1 is right now owning the physical disk resource and you will see both registration and reservation entry  for HBA1 interface.  Node 1 blue screened, so this time when node 2 comes to register and reserve it will successfully register and 6 seconds later on revisit it will put a reservation entry into table and will own the storage object. This is how the storage arbitration and scsi 3 persistent reservation works. You can see in detail the same process happening during the cluster validation test –validate scsi 3 persistent reservation as seen below in figure 2.

                                                                             Fig 2

Validate SCSI-3 Persistent Reservation
Validate that storage supports the SCSI-3 Persistent Reservation commands.
Validating Cluster Disk 0 for Persistent Reservation support
Registering PR key for cluster disk 0 from node Node 1
Putting PR reserve on cluster disk 0 from node Node 1
Attempting to read PR on cluster disk 0 from node Node 1.
Attempting to preempt PR on cluster disk 0 from unregistered node Node 2. Expecting to fail
Registering PR key for cluster disk 0 from node Node 2
Putting PR reserve on cluster disk 0 from node Node 2
Unregistering PR key for cluster disk 0 from node Node 2
Trying to write to sector 11 on cluster disk 0 from node Node 1
Trying to read sector 11 on cluster disk 0 from node Node 1
Attempting to read drive layout of Cluster disk 0 from node Node 1 while the disk has PR on it
Trying to read sector 11 on cluster disk 0 from node Node 2
Attempting to read drive layout of Cluster disk 0 from node Node 2 while the disk has PR on it
Trying to write to sector 11 on cluster disk 0 from node Node 2
Registering PR key for cluster disk 0 from node Node 2
Trying to write to sector 11 on cluster disk 0 from node Node 2
Trying to read sector 11 on cluster disk 0 from node Node 2
Unregistering PR key for cluster disk 0 from node Node 2
Releasing PR reserve on cluster disk 0 from node Node 1
Attempting to read PR on cluster disk 0 from node Node 1.
Unregistering PR key for cluster disk 0 from node Node 1
Registering PR key for cluster disk 0 from node Node 2
Putting PR reserve on cluster disk 0 from node Node 2
Attempting to read PR on cluster disk 0 from node Node 2.
Attempting to preempt PR on cluster disk 0 from unregistered node Node 1. Expecting to fail
Registering PR key for cluster disk 0 from node Node 1
Putting PR reserve on cluster disk 0 from node Node 1
Unregistering PR key for cluster disk 0 from node Node 1
Trying to write to sector 11 on cluster disk 0 from node Node 2
Trying to read sector 11 on cluster disk 0 from node Node 1
Attempting to read drive layout of Cluster disk 0 from node Node 1 while the disk has PR on it
Trying to read sector 11 on cluster disk 0 from node Node 2
Attempting to read drive layout of Cluster disk 0 from node Node 2 while the disk has PR on it
Trying to write to sector 11 on cluster disk 0 from node Node 1
Registering PR key for cluster disk 0 from node Node 1
Trying to write to sector 11 on cluster disk 0 from node Node 1
Trying to read sector 11 on cluster disk 0 from node Node 1
Unregistering PR key for cluster disk 0 from node Node 1
Releasing PR reserve on cluster disk 0 from node Node 2
Attempting to read PR on cluster disk 0 from node Node 2.
Unregistering PR key for cluster disk 0 from node Node 2
Cluster Disk 0 supports Persistent Reservation

But there is an exception –remember –[ The rule is that you cannot register and reserve at a same time though there are some exceptions to that rule and we will discuss those later.] Ok let us see where we use this exception. There were various changes that failover cluster team brought in storage architecture and one is that we do not arbitrate the same way as we used to do for MSCS clustering. We now never arbitrate for physical disk resources in case of a controlled manual movement of resources across node [manual move group process or quick migration] Though we always arbitrate for witness disk resources.  We only arbitrate for other disk resources in failure conditions. We use a fast path algorithm for such a scenario and why we do this! Because this is was a requirement for effectively supporting Hyper-v Virtual machines as resources in failover clustering. When we move a virtual machine resource from node 1 to node 2 the storage object containing the .VHD file needs to move quickly and cannot wait for 6 seconds for arbitration to take place [in case of a manual move group process or quick migration]. Leveraging Fast path algorithm fixes this challenge for us. Using fast path algorithm physical disk resource, in use by virtual machine group can move across nodes in less than 1 second approx or even lesser time.

So lets see what happens when I manually move a physical disk resource which is currently owned by node 1 i.e. node 1 has entry in the registration table for the storage object. Node 2 will wipe the registration table and scrub all the entries from all the paths from all nodes. It will place new registration and after successful registration within milli seconds will put a reservation entry in the storage object PR table which means that it owns the physical disk resource now and can bring it online. In such a scenario node 2 will not wait for node 1 for 6 seconds and this complete operations takes in approx less than 1 second. This is what happens in case of quick migration and when you manually move a group containing physical disk resource. However as said earlier in all circumstances we arbitrate for witness disk resources.





We all know that we can put a physical disk resource in maintenance mode to run chkdsk [for exclusive access]or other maintenance operations. So what happens when we put a physical disk resource in maintenance mode! All the non owner cluster nodes [who have their entry in registration table] will not be able to access the storage object as disk resource will be fenced from all of them and then owner cluster node will also remove its persistent reservation from the storage object PR table. So in other words we temporarily made this storage object a non clustered storage object enabling chkdsk or other Maintenance operations to run on it.

Now lets jump on cluster shared volumes in failover clustering 2008 R2. Cluster shared volumes is a feature of windows server 2008 R2 clustering which allows the different nodes of cluster to have concurrent access to the LUN where highly available virtual machine's VHD is stored. CSV allows multiple VHD per LUN and removes the traditional one VM per LUN issue. Earlier clustered virtual machines can only fail over independently if each virtual machine has its own LUN, which makes the management of LUNs and clustered virtual machines more difficult. As of now CSV feature is supported for use with windows server 2008 R2 hyper-v role only. As its obvious that there is no arbitration happens in case of CSV physical disk resource so how does all the nodes access the storage object at same time. Again we have same concept of PR table on storage object and we take an example of 2 node cluster. Node 1 is the coordinator node in this example which means that node 1 owns the CSV resource. Node 1 will have  a unique key in the reservation table which grants it access  as owner node and all the ntfs metadata writes are achieved via this node. So if there is a need to modify ntfs metadata of this CSV physical disk resource from node 2 it will be passed to node 1 and then node 1 will take care of it. However node 2 can still read/write to the CSV physical disk resource. The difference here is that non coordinator nodes will have their read/write key into the registration table always while coordinator node will have a unique key in the reservation table of the storage object.

So how can I remove a PR manually? Cluster.exe [node-name] /CLEAR[PR]:device-number is the command to clear a persistent reservation manually. You might have to use this command during troubleshooting of PR issues. You can also change the disk arbitration interval from 3 second to customized value but remember that there will be repercussions of modifying this resource private property and needs proper testing from storage vendor and remember that this setting comes in picture only in case of failure conditions and not for manually controlled operations.

I hope this article would have given you an insight into how persistent reservations works for failover clustering for physical disk resource and cluster shared volume disk resource and with this understanding you can do more effective troubleshooting & planning of storage related clustering issues. Thanks for your time and hope this is helpful.

GAURAV ANAND

The information based is "AS IS" and based on my personal understanding of the Technology.


5 comments:

  1. Great post! I really appreciate your effort. We are really missing multipathing knowledge and articles around to better understand planning of storage and combining with clustering and CSV. I was very pleased with your first article about mpio and you are getting even better ;) THANK YOU VERY MUCH!

    ReplyDelete
  2. Really Amazing post for understanding Persistent Reservation

    ReplyDelete
  3. Great! I was always wondering how does Persistent reservation work. Though I am working on Linux (mostly VCS) and Windows clustering is not in my experience but from storage & SCSI concept view point I found this post very useful.

    ReplyDelete
  4. This is great explanation, could you please also give more insight on how IO fencing work in case of CSV, while network partition(split brain)

    ReplyDelete
  5. Iordan Harizanov: Great article!

    ReplyDelete