I have a Failover Cluster running on two Server 2012 R2 Datacenter nodes hosting our Hyper-V environment. Recently, we have run into an issue where the VMs won’t migrate to the opposite node unless the VM is rebooted or the Saved State data is deleted.
The VMs are stored either on an SOFS volume on a separate FO Cluster or a CSV volume both nodes are connected to. The problem occurs to VMs in either storage location.
Testing I’ve done is below. Note that I only list one direction, but the behavior is the same moving in the opposite direction, as well:
- Live Migration: if a VM is on Node1 and I tell it to Live Migrate to Node2, it begins the process in the console and for a split second shows Node2. It immediately flips back to Node1. If the VM has rebooted since the last migration, it will go
ahead and migrate to Node2. It will not migrate back until the VM has been rebooted again. The Event Log shows IDs 1205 and 1069. 1069 states “Based on the failure policies for the resource and role, the cluster service may try to bring the
resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.” All resources show Online
in Powershell.
- Quick Migration: I initiate a Quick Migration and the VM will move from Node1 to Node2, but will fail to start on Node2. Checking the Event Log I see Event IDs 1205 and 1069. 1069 states “Cluster resource 'Virtual Machine IDF' of type 'Virtual
Machine' in clustered role 'IDF' failed. The error code was '0xc0370027' ('Cannot restore this virtual machine because the saved state data cannot be read. Delete the saved state data and then try to start the virtual machine.').” After deleting the
Saved State Data, the VM will start right up and can be Live or Quick Migrated once.
- Shutdown VM and Quick Migration: I have not had an occasion of this method fail so far.
- Rebooting the Nodes has had no discernable effect on the situation.
- I’ve shut down a VM and moved its storage from SOFS to the CSV and still have the same issues as above. I moved the VHDX, the config file, and saved state data (which was empty while the VM was powered down) to the CSV.
Items from the FO Cluster Validation Report:
1. The following virtual machines have referenced paths that do not appear accessible to all nodes of the cluster. Ensure all storage paths in use by virtual machines are accessible by all nodes of the cluster.
Virtual Machines Storage Paths That Cannot Be Accessed By All Nodes
Virtual Machine Storage Path Nodes That Cannot Access the Storage Path
VM1 \\sofs\vms Node1
I’m not sure what to make of this error as most of the VMs live on this SOFS share and are running on Nodes1 and 2. If Node1 really couldn’t access the share, none of the VMs would run on Node1.
2. Validating cluster resource File Share Witness (2) (\\sofs\HVQuorum).
This resource is configured to run in a separate monitor. By default, resources are configured to run in a shared monitor. This setting can be changed manually to keep it from affecting or being affected by other resources. It can also be set automatically
by the failover cluster. If a resource fails it will be restarted in a separate monitor to try to reduce the impact on other resources if it fails again. This value can be changed by opening the resource properties and selecting the 'Advanced Policies' tab.
There is a check-box 'run this resource in a separate Resource Monitor'.
I checked on this and the check-box is indeed unchecked and both Nodes report the same setting (or lack thereof).
3. Validating cluster resource Virtual Machine VM2.
This resource is configured to run in a separate monitor. By default, resources are configured to run in a shared monitor. This setting can be changed manually to keep it from affecting or being affected by other resources. It can also be set automatically
by the failover cluster. If a resource fails it will be restarted in a separate monitor to try to reduce the impact on other resources if it fails again. This value can be changed by opening the resource properties and selecting the 'Advanced Policies' tab.
There is a check-box 'run this resource in a separate Resource Monitor'.
Validating cluster resource Virtual Machine VM3.
This resource is configured to run in a separate monitor. By default, resources are configured to run in a shared monitor. This setting can be changed manually to keep it from affecting or being affected by other resources. It can also be set automatically
by the failover cluster. If a resource fails it will be restarted in a separate monitor to try to reduce the impact on other resources if it fails again. This value can be changed by opening the resource properties and selecting the 'Advanced Policies' tab.
There is a check-box 'run this resource in a separate Resource Monitor'.
I can’t find a place to see this check-box for the VMs. The properties on the roles don’t contain the ‘Advanced Policies’ tab.
All other portions of the Validation Report are clean.
So far, I haven’t found any answers in several days of Google searching and trying different tactics. I’m hoping someone here has run into a similar situation and can help steer me in the right direction to get this resolved. The goal is to be able
to Live Migrate freely so I can reboot the Nodes one at a time for Microsoft Updates without having to bring down all the VMs in the process.