Tuesday, March 30, 2010

How configuring storage with Volume GUID works in Failover clustering

There is couple of new features and architectural changes in failover clustering and I am going to talk about new functionality of using Volume GUIDs instead of drive letters in this talk. You are recommended to apply this patch for this increased functionality.

951308 Increased functionality and virtual machine control in the Windows Server 2008 Failover Cluster Management console for the Hyper-V role [This is not required for server 2008 R2]

http://support.microsoft.com/default.aspx?scid=kb;EN-US;951308

Let’s see how this works behind the scene, so let’s dig in to find out what happens in the background. I have a 2 node Hyper-v cluster with Node and file share majority as the quorum model. To investigate this Guid behavior I presented new storage disk to my cluster and noticed the view in disk management, Failover cluster admin console, {HKEY_LOCAL_MACHINE\SYSTEM\MountedDevices} mounted devices registry key and mountvol.exe output. Node 1 is the current owner of this disk. On Node 1, we can observe that the newly presented disk shows up as disk 4 in disk management.msc with drive letter H:\ but in failover cluster admin console it appears as cluster disk 5, so there is no co-relation between two and we should not be confused about it. (See fig 1)


On node 2 the disk got the next available drive letter which is F:\ and got its own unique Volume GUID. So we see both nodes have got different local Volume GUIDs for same disk .Till now it is all expected as every machine should have its own GUID for a disk and they picked the next available drive letter on node 2 so all looks good so far (see fig 2)

FIG 1


                                                                 FIG 2

Now I went ahead and removed the drive letter so that we can use this disk with GUID (see fig 3)

FIG 3

This is what I observed (see fig 4) in failover cluster admin console…its displaying Cluster Volume GUID for the disk which is same as the disk’s Local Volume GUID of node 1 (remember node 1 owns the disk). On node 1 in mountvol.exe output we see a fresh new entry with modifier “no mount points” which was not there earlier. This shows that this disk is not using a drive letter now. On node 2 we don’t see any change on either mountvol.exe output or mounted devices registry key as seen in fig 5.
FIG 5

Now I opened Hyper-V console on Node 1 and created a virtual machine named “Guid behavior” as shown in fig 6. Here we provide the path using Volume GUID. As I am going to create a highly available Virtual machine later on, I will use the Volume GUID displayed in Failover cluster admin console for cluster disk 5 which is the recommended way of creating a highly available VM. (We will come to know why this is recommended way later in the blog)


FIG 6

However the question arises that the Volume GUID displayed in failover cluster admin console is same as local Volume GUID on node 1 however we have no reference of this Volume GUID on node 2. So when this highly available VM will move or failover to node 2 what will happen? How will it come online as node 2 has no idea of this Volume GUID showing up in failover cluster admin console…let’s see what happen.

I created a highly available virtual machine service and this is what I observe on node 1 and failover cluster admin console (see fig 7) which is all expected.
 
FIG 7

Now lets see what do we see on node 2 . we still see the exact same information (see fig 8) in mounted devices registry key and mountvol.exe output as seen before. No changes till now. So lets try to move this VM from node 1 to node 2 and lets see what happens….we see that in mounted devices registry key we now see 2 GUIDs for disk with signature 6fd2cfff, one is Local volume GUID which was orginally present on the node 2 before moving VM and other one is Cluster Volume GUID (Cluster replicated actually) which is being displayed in failover cluster admin console and is also the Local Volume GUID of disk with signature 6fd2cfff on node 1. (see fig 9).

So whoever node owns the disk before it was used as highly available disk in cluster will replicate the Volume GUID to all the remaining nodes and then that volume GUID is used by cluster and is known as cluster volume GUID. We will still see the Local Volume GUID on all the nodes along with cluster Volume GUID. You can distinguish between Local Volume GUID and Cluster Volume GUID easily by seeing the Mountvol.exe output as the Cluster Volume GUID will not show up on Node 2 mountvol.exe output however it does show up on node 1.

FIG 8


FIG 9

Well, this completes our discussion for the day and we now know how we can use GUIDs instead of drive letters for configuring storage and how cluster replicates these Volume GUIDs. There is one more word of caution that I have noticed and I will like to share. We can place virtual machine either via Hyper-V or system center virtual machine manager. Now using system center virtual machine manager I placed newly created virtual machine “Guid behavior” on node 1. While creating this virtual machine we have to provide storage for placing .vhd file and as you can see in fig 10, SCVMM picks the Volume GUID information automatically. There are scenarios when it may pick the local volume GUID instead of Cluster Volume GUID and that’s why we recommend that if you are creating highly available VM, make sure that you confirm the GUID used, is the GUID displayed in failover cluster administrator console.


                                                                FIG 10

This ends our discussion for today and I hope that you enjoyed reading the blog. Please come back on our  blog where we try to share information and Thanks for your valuable time. There is already a very nice blog written by Chuck Timon on this matter and I strongly advise you to read that too.

Configuring Storage Using Volume GUIDs in Hyper-V

http://blogs.technet.com/askcore/archive/2008/10/29/configuring-storage-using-volume-guids-in-hyper-v.aspx

GAURAV ANAND

Saturday, March 20, 2010

Storage Architecture changes for failover clustering 2008/R2-How Persistent Reservation works.

Last month I decided to do a failover cluster blog series and here I am doing the first one. Via this blog I will try to throw some light on the storage architecture changes and requirements of failover clustering. As we are aware, only storage that supports scsi 3 persistent reservations will be supported in failover clustering. Parallel scsi based storage is being deprecated and won’t be supported. There is a nice blog to read more why this might have been done.  The good thing is that due to this change now we no longer use scsi bus resets which can be disruptive on a SAN. In this article we will see what happens in the storage stack and how persistent reservation works for a physical disk resource and later a cluster shared volume physical disk resource.

                                                                                 Fig 1

We will take the case of physical disk resource first and then cluster shared volume later. Clusdisk.sys which has been modified with 2 functions seen in the figure 1 issue persistent reservation [PR] to the class driver disk.sys which flows down to MPIO.sys and vendor based DSM.sys or inbuilt MSdsm.sys [msdsm.sys is inbuilt device specific module provided with 2008 R2 server] The vendor based DSM driver is responsible for registering the PR on the storage object. The storage object maintains a registration table which contains entry from all the multiple paths available from multiple nodes. The registration table contains registration and reservation entry for all paths available from all nodes. We take an example of a 2 node cluster with dual HBA and hence we see that every interface from both nodes will have to register in the table with a unique key for each node. The rule is that you cannot register and reserve at a same time though there are some exceptions to that rule and we will discuss those later. This key is a 8 byte key unique to every cluster node. The low 32 bits of this key contain a mask that can be used to identify a Persistent Reservation that was placed specifically by a Microsoft Failover Cluster.  So assuming node 1 HBA1 is right now owning the physical disk resource you will see both registration and reservation entry  for HBA1 interface. In case HBA1 fails then depending on the vendor DSM configuration storage stack will automatically start using HBA2 without any end user intervention or failover of the disk resource. Every 3 seconds node 1 keeps coming back and checking the registration table. Nodes defend their reservations (every 3 secs…but is configurable and this setting might help you in troubleshooting some issues also) . During a split brain scenario challenger node 2 will come and enter a registration entry in the table but as per rule node 2 cannot register at same time. It needs to wait for at least 6 seconds before it comes back and enters its registration entries and own the ownership for the storage object. Meanwhile defender node 1 comes back after every 3 second and  sees registration entries from 2nd node  and it will scrub those entries. Challenging node 2 will come back after 6 seconds but will not be able to add reservation entries in table as its registration entries have already been scrubbed by node 1. This is the process of a successful defense. Now there may be a legitimate case where node 1 both interfaces are unable to access the storage or node 1 has rebooted or blue screened and in that case node 2 should win the arbitration of storage object. Let’s see what will happen in that case. Node 1 HBA1 is right now owning the physical disk resource and you will see both registration and reservation entry  for HBA1 interface.  Node 1 blue screened, so this time when node 2 comes to register and reserve it will successfully register and 6 seconds later on revisit it will put a reservation entry into table and will own the storage object. This is how the storage arbitration and scsi 3 persistent reservation works. You can see in detail the same process happening during the cluster validation test –validate scsi 3 persistent reservation as seen below in figure 2.

                                                                             Fig 2

Validate SCSI-3 Persistent Reservation
Validate that storage supports the SCSI-3 Persistent Reservation commands.
Validating Cluster Disk 0 for Persistent Reservation support
Registering PR key for cluster disk 0 from node Node 1
Putting PR reserve on cluster disk 0 from node Node 1
Attempting to read PR on cluster disk 0 from node Node 1.
Attempting to preempt PR on cluster disk 0 from unregistered node Node 2. Expecting to fail
Registering PR key for cluster disk 0 from node Node 2
Putting PR reserve on cluster disk 0 from node Node 2
Unregistering PR key for cluster disk 0 from node Node 2
Trying to write to sector 11 on cluster disk 0 from node Node 1
Trying to read sector 11 on cluster disk 0 from node Node 1
Attempting to read drive layout of Cluster disk 0 from node Node 1 while the disk has PR on it
Trying to read sector 11 on cluster disk 0 from node Node 2
Attempting to read drive layout of Cluster disk 0 from node Node 2 while the disk has PR on it
Trying to write to sector 11 on cluster disk 0 from node Node 2
Registering PR key for cluster disk 0 from node Node 2
Trying to write to sector 11 on cluster disk 0 from node Node 2
Trying to read sector 11 on cluster disk 0 from node Node 2
Unregistering PR key for cluster disk 0 from node Node 2
Releasing PR reserve on cluster disk 0 from node Node 1
Attempting to read PR on cluster disk 0 from node Node 1.
Unregistering PR key for cluster disk 0 from node Node 1
Registering PR key for cluster disk 0 from node Node 2
Putting PR reserve on cluster disk 0 from node Node 2
Attempting to read PR on cluster disk 0 from node Node 2.
Attempting to preempt PR on cluster disk 0 from unregistered node Node 1. Expecting to fail
Registering PR key for cluster disk 0 from node Node 1
Putting PR reserve on cluster disk 0 from node Node 1
Unregistering PR key for cluster disk 0 from node Node 1
Trying to write to sector 11 on cluster disk 0 from node Node 2
Trying to read sector 11 on cluster disk 0 from node Node 1
Attempting to read drive layout of Cluster disk 0 from node Node 1 while the disk has PR on it
Trying to read sector 11 on cluster disk 0 from node Node 2
Attempting to read drive layout of Cluster disk 0 from node Node 2 while the disk has PR on it
Trying to write to sector 11 on cluster disk 0 from node Node 1
Registering PR key for cluster disk 0 from node Node 1
Trying to write to sector 11 on cluster disk 0 from node Node 1
Trying to read sector 11 on cluster disk 0 from node Node 1
Unregistering PR key for cluster disk 0 from node Node 1
Releasing PR reserve on cluster disk 0 from node Node 2
Attempting to read PR on cluster disk 0 from node Node 2.
Unregistering PR key for cluster disk 0 from node Node 2
Cluster Disk 0 supports Persistent Reservation

But there is an exception –remember –[ The rule is that you cannot register and reserve at a same time though there are some exceptions to that rule and we will discuss those later.] Ok let us see where we use this exception. There were various changes that failover cluster team brought in storage architecture and one is that we do not arbitrate the same way as we used to do for MSCS clustering. We now never arbitrate for physical disk resources in case of a controlled manual movement of resources across node [manual move group process or quick migration] Though we always arbitrate for witness disk resources.  We only arbitrate for other disk resources in failure conditions. We use a fast path algorithm for such a scenario and why we do this! Because this is was a requirement for effectively supporting Hyper-v Virtual machines as resources in failover clustering. When we move a virtual machine resource from node 1 to node 2 the storage object containing the .VHD file needs to move quickly and cannot wait for 6 seconds for arbitration to take place [in case of a manual move group process or quick migration]. Leveraging Fast path algorithm fixes this challenge for us. Using fast path algorithm physical disk resource, in use by virtual machine group can move across nodes in less than 1 second approx or even lesser time.

So lets see what happens when I manually move a physical disk resource which is currently owned by node 1 i.e. node 1 has entry in the registration table for the storage object. Node 2 will wipe the registration table and scrub all the entries from all the paths from all nodes. It will place new registration and after successful registration within milli seconds will put a reservation entry in the storage object PR table which means that it owns the physical disk resource now and can bring it online. In such a scenario node 2 will not wait for node 1 for 6 seconds and this complete operations takes in approx less than 1 second. This is what happens in case of quick migration and when you manually move a group containing physical disk resource. However as said earlier in all circumstances we arbitrate for witness disk resources.





We all know that we can put a physical disk resource in maintenance mode to run chkdsk [for exclusive access]or other maintenance operations. So what happens when we put a physical disk resource in maintenance mode! All the non owner cluster nodes [who have their entry in registration table] will not be able to access the storage object as disk resource will be fenced from all of them and then owner cluster node will also remove its persistent reservation from the storage object PR table. So in other words we temporarily made this storage object a non clustered storage object enabling chkdsk or other Maintenance operations to run on it.

Now lets jump on cluster shared volumes in failover clustering 2008 R2. Cluster shared volumes is a feature of windows server 2008 R2 clustering which allows the different nodes of cluster to have concurrent access to the LUN where highly available virtual machine's VHD is stored. CSV allows multiple VHD per LUN and removes the traditional one VM per LUN issue. Earlier clustered virtual machines can only fail over independently if each virtual machine has its own LUN, which makes the management of LUNs and clustered virtual machines more difficult. As of now CSV feature is supported for use with windows server 2008 R2 hyper-v role only. As its obvious that there is no arbitration happens in case of CSV physical disk resource so how does all the nodes access the storage object at same time. Again we have same concept of PR table on storage object and we take an example of 2 node cluster. Node 1 is the coordinator node in this example which means that node 1 owns the CSV resource. Node 1 will have  a unique key in the reservation table which grants it access  as owner node and all the ntfs metadata writes are achieved via this node. So if there is a need to modify ntfs metadata of this CSV physical disk resource from node 2 it will be passed to node 1 and then node 1 will take care of it. However node 2 can still read/write to the CSV physical disk resource. The difference here is that non coordinator nodes will have their read/write key into the registration table always while coordinator node will have a unique key in the reservation table of the storage object.

So how can I remove a PR manually? Cluster.exe [node-name] /CLEAR[PR]:device-number is the command to clear a persistent reservation manually. You might have to use this command during troubleshooting of PR issues. You can also change the disk arbitration interval from 3 second to customized value but remember that there will be repercussions of modifying this resource private property and needs proper testing from storage vendor and remember that this setting comes in picture only in case of failure conditions and not for manually controlled operations.

I hope this article would have given you an insight into how persistent reservations works for failover clustering for physical disk resource and cluster shared volume disk resource and with this understanding you can do more effective troubleshooting & planning of storage related clustering issues. Thanks for your time and hope this is helpful.

GAURAV ANAND

The information based is "AS IS" and based on my personal understanding of the Technology.