Hi Folks,
I'm performing a live migration of HA roles (VMs) between two 2016 clusters, using SCVMM.
When live migrating a virtual machine to (including storage) 2016 the virtual machine storage resiliancy powers off the VM soon after the migration is completed successfully.
Error Message: "Virtual hard disk resiliency failed to recover the drive. The virtual machine will be powered off. Current status: Permanent Failure."
Event ID: 12630
I've captured and studied the below logs extensively, and also did a lot of research - in vain.
HyperVHighAvailability-Admin
HyperVStorageVSP-Admin
HyperVWorker-Admin
Below are the critical events/errors I noticed in those logs most commonly(with exact time-stamp when the issue is reproduced).
HyperVHighAvailability-Admin:
21120 Information 'SCVMM <VMName> Configuration' successfully registered the configuration for the virtual machine.
21119 Information 'SCVMM <VMName>' successfully started the virtual machine.
HyperVStorageVSP-Admin:
Information 6 Storage device '\\?\UNC\xyz\abc\<VMName>\Clean_disk_1.vhdx' received an IO failure with error = SRB_STATUS_ERROR_RECOVERY. Current device state = No Errors, New state = Recoverable Error Detected, Current status = No Errors.
Information 4 Storage device '\\?\UNC\xyz\abc\<VMName>\Clean_disk_1.vhdx' received a recovery status notification. Current device state = Recoverable Error Detected, Last status = No Errors, New status = Disconnected.
Information 5 Storage device ''\\?\UNC\xyz\abc\<VMName>\Clean_disk_1.vhdx' changed recovery state. Previous state = Recoverable Error Detected, New state = Recoverable Error Detected.
Information 4 Storage device ''\\?\UNC\xyz\abc\<VMName>\Clean_disk_1.vhdx' received a recovery status notification. Current device state = Recoverable Error Detected, Last status = Disconnected, New
status = Permanent Failure.
Information 5 Storage device ''\\?\UNC\xyz\abc\<VMName>\Clean_disk_1.vhdx' changed recovery state. Previous state = Recoverable Error Detected, New state = Unrecoverable Error.
HyperVWorker-Admin:
Information 12597 '<VMName>' <VMName> (<GUID>) Connected to virtual network. (Virtual Machine ID <GUID>)
Information 12582 '<VMName>' <VMName> (<GUID>) started successfully. (Virtual Machine ID <GUID>)
Information 12635 '<VMName>': Virtual hard disk '\\FQDN\abc\<VMName>\Clean_disk_1.vhdx' received a resiliency status notification. Current status: Disconnected. (Virtual machine ID<GUID>)
Warning 12636 '<VMName>': Virtual hard disk '\\\\FQDN\abc\<VMName>\Clean_disk_1.vhdx' has detected a critical error. Current status: Disconnected. (Virtual machine ID<GUID>)
Information 12635 '<VMName>': Virtual hard disk '\\FQDN\abc\<VMName>\Clean_disk_1.vhdx' received a resiliency status notification. Current status: Permanent Failure. (Virtual machine ID<GUID>)
Error 12630 '<VMName>': Virtual hard disk resiliency failed to recover the drive '\\\\FQDN\abc\<VMName>\Clean_disk_1.vhdx'. The virtual machine will be powered off. Current status: Permanent
Failure. (Virtual machine ID <GUID>)
Information 18524 '<VMName>' was paused for critical error. (Virtual machine ID<GUID>)
Information 12598 '<VMName>' <VMName> (<GUID>) Disconnected from virtual network. (Virtual Machine ID<GUID>)
Information 18528 '<VMName>' was turned off as it could not recover from a critical error. (Virtual machine ID <GUID>)
System Event Logs:
2:34:02 PM Information 7040 Service Control Manager The start type of the Windows Modules Installer service was changed from auto start to demand start.
2:34:02 PM Information 7036 Service Control Manager The Volume Shadow Copy service entered the running state.
2:34:02 PM Information 7036 Service Control Manager The Microsoft Software Shadow Copy Provider service entered the running state.
2:35:52 PM Information 7036 Service Control Manager The WMI Performance Adapter service entered the stopped state.
2:36:03 PM Information 7036 Service Control Manager The Windows Modules Installer service entered the stopped state.
2:37:02 PM Information 7036 Service Control Manager The Volume Shadow Copy service entered the stopped state.
2:37:41 PM Information 7036 Service Control Manager The Windows Modules Installer service entered the running state.
2:39:43 PM Information 7036 Service Control Manager The Windows Modules Installer service entered the stopped state.
2:40:02 PM Information 7036 Service Control Manager The Microsoft Software Shadow Copy Provider service entered the stopped state.
2:43:01 PM Information 7036 Service Control Manager The Microsoft Storage Spaces SMP service entered the running state.
2:46:06 PM Information 7036 Service Control Manager The WinHTTP Web Proxy Auto-Discovery Service service entered the stopped state.
2:49:06 PM Information 7036 Service Control Manager The WinHTTP Web Proxy Auto-Discovery Service service entered the running state.
2:49:57 PM Information 7036 Service Control Manager The WMI Performance Adapter service entered the running state.
2:50:10 PM Information 233 Microsoft-Windows-Hyper-V-VmSwitch The operation 'Create' succeeded on nic C0470977-2D74-4F23-B695-B60A74E5100A--FD0F5C61-44A8-4C23-ACC1-B262965E22D8 (Friendly Name: ).
2:50:10 PM Information 232 Microsoft-Windows-Hyper-V-VmSwitch NIC C0470977-2D74-4F23-B695-B60A74E5100A--FD0F5C61-44A8-4C23-ACC1-B262965E22D8 (Friendly Name: hogia-cl561) successfully connected to port 608710AB-5CDD-449D-B3DE-801891384C7E
(Friendly Name: d932b689-5f4f-4513-92ea-f12f3ca415ab) on switch FF9A59EE-0D6C-468D-98B0-DE0008045F13(Friendly Name: vSwitch).
2:50:14 PM Information 21500 Microsoft-Windows-Hyper-V-High-Availability 'SCVMM hogia-cl561 Configuration' successfully registered the configuration for the virtual machine.
2:50:14 PM Information 21500 Microsoft-Windows-Hyper-V-High-Availability 'SCVMM hogia-cl561' successfully started the virtual machine.
2:50:20 PM Information 234 Microsoft-Windows-Hyper-V-VmSwitch NIC C0470977-2D74-4F23-B695-B60A74E5100A--FD0F5C61-44A8-4C23-ACC1-B262965E22D8 successfully disconnected from port .
2:50:20 PM Information 233 Microsoft-Windows-Hyper-V-VmSwitch The operation 'Delete' succeeded on nic C0470977-2D74-4F23-B695-B60A74E5100A--FD0F5C61-44A8-4C23-ACC1-B262965E22D8 (Friendly Name: hogia-cl561).
2:51:57 PM Information 7036 Service Control Manager The WMI Performance Adapter service entered the stopped state.
2:52:50 PM Information 7036 Service Control Manager The Smart Card Device Enumeration Service service entered the running state.
2:52:51 PM Information 7036 Service Control Manager The Device Setup Manager service entered the running state.
2:53:40 PM Information 7036 Service Control Manager The Device Setup Manager service entered the stopped state.
3:05:58 PM Information 7036 Service Control Manager The WMI Performance Adapter service entered the running state.
3:07:58 PM Information 7036 Service Control Manager The WMI Performance Adapter service entered the stopped state.
Apart from everything above, I've noticed a specific error somewhat related to SCSI Bus:
SRB_STATUS_ERROR_RECOVERY
However, I couldn't get a hint of the error despite searching all the blogs. Your kind help is much appreciated.
--------------------------------------------------------------------------
SYNOPSIS of my infrastructure:
Source Cluster OS: Windows 2016 Standard Edition (Build: 10.0.14393.0)
Target Cluster OS: Windows 2016 Standard Edition (Build: 10.0.14393.0)
Hyper-V Version: 10.0.14393.2758 (Build: 10.0.14393.0)
Storage: Dedicated SMB3 File share Storage Spaces 2016
Management Tool: System Center Virtual Machine Manager (SCVMM) 2016
Dedicated 2x10GB nics and physical switches for SMB3 and management (Jumbo frames 9014 enabled)
Dedicated 2x10GB nics and physical switches for Tenant/VM traffic
We are not using RDMA or DCB
Everything works and have worked in 2012R2 for many years. However migrating a VM from
a 2012R2 cluster to 2016 cause the VM to Power off. Also we found out today that migration between
the new 2016 clusters also cause the VM to Power off.
I have found this thread and also posted in it without any resolution.
https://social.technet.microsoft.com/Forums/en-US/4cb2c0a0-cedc-4a71-886b-3146f0943d8d/migrate-vm-from-2012r2-cluster-to-2016-cluster-causes-outage?forum=winserverhyperv
Ramkumar