Monday, April 26, 2010

How cluster shared volumes Direct and Redirected IO works

Cluster shared volume [CSV] is a great feature offered with 2008 R2 failover clustering. Cluster shared volumes allows the different nodes of cluster to have concurrent access to the LUN where highly available virtual machine's VHD is stored. So in case of Live migration/quick migration or manual move/failover operation the LUN (which has been configured as a physical disk resource in cluster) is not dismounted and mounted which was the case till 2008 server. In server 2008 R2 both the nodes have simultaneous access to the LUN and both the nodes can read and write on the LUN where multiple VHD's of the highly available virtual machines are placed. This allows multiple VHD per LUN and removes the traditional one VHD per LUN issue. Earlier clustered virtual machines can only fail over independently if each virtual machine has its own LUN, which makes the management of LUNs and clustered virtual machines more difficult. Now that limitation is gone with CSV. From the last blog we know how persistent reservation works for cluster shared volumes and we know how both nodes are able to read and write at same time on the CSV disk resource. In this blog we will see how it works on file system level. But before we do that lets have a quick recap of cluster shared volumes below and here

To understand Direct IO, we will take a 2 node cluster scenario. Node A is the coordinator node as shown in the fig 1 and holds the CSV volume containing VHD files of VM 1 and VM 2.

                                                             FIG 1

Node A is directing read write IO and meta data IO [file renames, attribute changes, new file creation etc ] directly as shown by red and yellow lines simultaneously. At the same time I collected a perfmon and enabled new counters for cluster shared volumes. We can see in fig 1 that we have Read write IO along with metadata IO however as both the VM’s are running on coordinator node we do not pass any IO request to mini SMB redirector and none of the IO flows on network. Once you move the VM2 on node B the metadata IO goes via network [see fig 2]. CSV redirector passes over request to SMB mini director which passes the request from client side to server side and IO request flows to CSV redirector on coordinator i.e. node A. CSV redirector passes the IO to ntfs.sys and passes the information back to Node B via same channel. And this is the main reason why it is asked to enable SMB protocol on the CSV network.

                                                                  FIG 2

Cluster disk manger [DCM] is responsible for managing the CSV resources and manages the read write policy set on non coordinator nodes for access to CSV disk resources. You can look at cluster.log below and see that [DCM] is responsible for starting csvfilter.sys and creating CSV disk resources.

From cluster.log

01255 00000724.00000388::2009/10/08-18:01:35.592 INFO [DCM] Cluster Shared Volume Root is C:\ClusterStorage
01257 00000724.00000388::2009/10/08-18:01:35.607 INFO [DCM] service/driver CSVFilter started
01258 00000724.00000388::2009/10/08-18:01:35.607 INFO [DCM] short name is C:\CLUSTE~1
01259 00000724.00000388::2009/10/08-18:01:35.607 INFO [DCM] Filter.CfsSetRootFolder (, RootFolder=\ClusterStorage\)
01260 00000724.00000388::2009/10/08-18:01:35.607 INFO [DCM] SetRoot message sent
01261 00000724.00000388::2009/10/08-18:01:35.607 INFO [DCM] Pnp CfsFilter Launching Filter Listener
01269 00000724.00000388::2009/10/08-18:01:35.607 INFO [DCM] db.CreateDcmDisk 'SR' 8e62ea00-9763-4c21-86d1-4a05708be24a-----SR is name of my CSV disk resource
02000 00000724.00000928::2009/10/08-18:01:41.957 INFO [DCM] FsFilterCanUseDirectIO is called for file:///?\Volume{f98a57c7-9ec6-11de-ad77-806e6f6e6963}\
02001 00000724.00000928::2009/10/08-18:01:41.957 INFO [DCM] PostOnline. CanUseDirectIO for Volume1 => true
02008 00000724.00000928::2009/10/08-18:01:41.957 INFO [DCM] ClearVolumeStates: resource 'SR' states
02011 00000724.00000928::2009/10/08-18:01:41.957 INFO [DCM] Reservation.SetMembership(SR,(1 2))
02019 00000724.000008b4::2009/10/08-18:01:41.972 INFO [DCM] volume 'Volume1' is already paused
02020 00000724.000008b4::2009/10/08-18:01:41.972 INFO [DCM] CreateLink C:\ClusterStorage\Volume1 => \\?\Volume{f98a57c7-9ec6-11de-ad77-806e6f6e6963}\

The interesting part is that when a CSV resource is created it is put in its own group which has a unique GUID instead of name. when you run command cluster.exe group command these CSV GUID based group will not be displayed and the reason is Microsoft does not want users to mess with these. Also there are lot of limitations with these groups like you cannot add any resources in these and many more. However running a “cluster.exe res” will show you CSV group’s GUID name.

Now as we know that if you are accessing CSV disk resource in a “Redirected access” mode you are bound to get a performance impact as all the read/write and metadata IO is flowing over the network from all non coordinator nodes instead of direct IO to storage object. You can calculate the performance impact by capturing a baseline perfmon during direct IO and redirected IO.

In Redirected IO, we leverage SMB mini director and Csvfilter.sys and send all the read/write IO along with NTFS metadata over on network instead of storage path. CSV helps in increased fault tolerance as even if storage path of lun fails from a particular node, the Node will keep accessing the Lun via coordinator node and re direct all IO on network as seen above [though with performance constraints] . We will see more details in next part of this series on cluster shared volumes. We will also see what are options and best practices regarding backup of Vhd files on cluster shared volumes. Hope you find this information interesting and liked today's blog. Thanks for your time and good bye till next blog.


  1. Could anyone explain to me, if i have my csv working in redirected mode and vm that works on it. is that normal or I can switch it to online mode?

  2. Hi Никита,

    That's normal but not a good practice. This is because, in redirected mode all the IO traffic goes via network and this will hit the performance.

  3. What about backups ? How can I take snapshot of a host if its running in redirected mode ?

  4. Very descriptive and informative blog.

  5. I have a doubt, for two nodes if any node get disturbed then the redirect access mode will be enabled and the LUN will be accessed through FileShare. If i have four nodes and one node disturbed, what will be the status of LUN ? Will the CSV go to redirect access mode for all the nodes ? If yes then due to one connection problem all the server will access the co-ordinator mode via File share and performance will be degraded right ?