File Share Witness and Datacenter Failback
This afternoon we ran across an issue with a fairly new Exchange 2010 Database Availability Group comprised of 3 nodes all running SP1 with Update Rollup 3. The primary datacenter had 2 nodes with a local file share witness while the 3rd node and alternate file share witness were in a DR site. We also had recently performed a successfull datacenter failover and failback test that went swimmingly so everything was back up and running in the primary datacenter.
What we noticed today was that the cluster quorum and file share witness settings persisted as a node and file share majority after the failback instead of reverting to a node majority model like a 3-node DAG should be using. The only time Exchange should be using this model is when we have an even number of servers in the DAG. So without reproducing this again I can only see this as a timing issue – when one of the primary datacenter nodes gets added back to the DAG the quorum settings are flipped, but once the 3rd and final node joins again the quorum settings are not adjusted. This leaves us with a node and file share majority, and the FSW being our alternate FSW.
You can see here if you open the Cluster MMC our DAG is operating as a node and file share majority model even though all 3 nodes are online:
The fix for the issue is really easy – just run the Set-DatabaseAvailabilityGroup with no parameters. This process does not take the databases or cluster offline, but you’ll see the DAG detect it is using the wrong model for an odd number of nodes and adjust itself accordingly:
After the change you can verify in the cluster MMC that the quorum settings have been corrected to be a node majority:
I’m sure there’s a rational reason behind this behavior, but I haven’t quite nailed down why this happens quite yet. In the meantime it’s just one more step to add to your DR documentation!