Tuesday, January 4, 2011

Cluster Shared Volume disk though manually made offline do not participate in validation tests

Happy New Year to everyone while i welcome to year 2011 and hope you enjoyed Christmas and Winter vacation. This is  the time when most of people go for vacation or spend time with their closed and loved ones. Sadly, however some of IT folks have to work even in these days either due to project deadlines or critical issues and I was approached for one such issue recently where cluster environment was not stable and strangely when engineers were running cluster validation test, it was not showing any errors.

The Golden rule is to start from cluster validation test and that is the most basic troubeshooting step that you need to do for most failover clustering issues. I asked engineer about same and asked him to follow steps provided  where one have to manually take the cluster shared volumes [CSV] offline [yes the cluster validation wizard will give warning for non CSV disk resources but not for CSV disk resources..so you need to take them offline manually before running the test, else CSV disk resources will not participate in validation tests]

  • In the Failover Cluster Manager snap-in, expand the console tree and click Cluster Shared Volumes. In the center pane, expand the listing for the volume that you are gathering information about. View the status of the volume.
  • Still in the center pane, to prepare for testing a disk in Cluster Shared Volumes, right-click the disk, click Take this resource offline, and then if prompted, confirm your choice. Repeat this action for any other disks that you want to test.
  • Right-click the cluster containing the Cluster Shared Volumes, and then click Validate This Cluster.
Strangely when i looked at the Validation report i saw only one warning error for CSV disk resource. [ In our case it is "SR" as seen below]

"This resource is marked with a state of 'Offline'. The functionality that this resource provides is not available while it is in the offline state. The resource may be put in this state by an administrator or program. It may also be a newly created resource which has not been put in the online state or the resource may be dependent on a resource that is not online. Resources can be brought online by choosing the 'Bring this resource online' action in Failover Cluster Manager."




Now because of this error, CSV disk is not participating in any test. Strangely, one get this error though the CSV disk is offline [the way it should ] and cluster validation eligibility  status is true (as seen below)




Cluster validation tool itself gives a very subtle clue about why this is happening and I was quickly able to reproduce this and show why...The problem is CSV disk was either in Redirected access mode or Maintenance mode when it was turned offline for running cluster validation test.


To fix the issue one need to bring that CSV disk online..take it out from Redirected access mode by turning it off  and switching back to Direct IO mode…and then take CSV disk offline and run validation test. This is an issue which may create some confusion for customers/engineers because they do not get any error in validation test and may think that all is good and working..but in reality the CSV disks will not participate in most of the cluster storage validation tests and you may end up seeing storage issues later on. I will try to bring this to Microsoft product team's notice via right channels so that we can have better verbatim in validation test about this warning and what needs to be done to fix the issue for valued customers. Hey wait..what was that subtle clue in validation report..


The clue was that volume consistency test was telling us that CSV disk resource is not in correct state, so once you make the sugested changes you do not get that message and dirty bit is checked on CSV disk as shown above. Now though Microsoft says that we support cluster till the time you pass the validation test and have logo but in such a tricky case though you do not get an error in validation test, you are hardly in supported territory as all storage validation tests were never performed on the CSV disk resource. Hope this information is useful and helps you if you get same issue ever or in understanding cluster validation test that it is better to check all warnings and what they really mean and its implications.


GAURAV ANAND


Blog is based on my Personal understanding of the Technologies mentioned above and information provided is AS IS.