Hey everyone,
I started experiencing a weird issue this week. We have a 2 node cluster (server 2016) setup with hyper-v vdi and some pooled desktops. We cannot live migrate these vdi machines consistently. At some point during the live migration one
fails which causes the rest to stop migrating. The errors are few and far between, but the main one just says that the live migration failed. However there isn't an error message attached to that event (in FCM). There is an error message
in event viewer, in the hyper-v vmms log, but the description can't be found. The event id is 22040 and the error code at the end of the message is 0x800705B4, which from my research refers to a timeout issue.
There are two weird issues with this problem. The first is that even though the machines fail to migrate I can migrate them one at a time. (I tested with draining the roles and it still fails). If I migrate them one at a time there are
no errors, ever, and every machine migrates perfectly fine. The other issue is I wrote a powershell script to move the VMs in parallel, with a foreach command and all of the machines migrate just fine. I believe that is due to the script calling
one command at a time to migrate each virtual machine, but I am not sure why that would work.
We are currently in the process of rebuilding our master image to see if something has gone wrong with it, however I don't have much faith in that. I think the issue lies somewhere in the FCM, but I am also not sure.
I have already checked the simultaneous migrations setup in hyper-v settings. We use Kerberos, but have tried CredSSP as well. Since the machines are live migrating one by one I don't think that is the issue. The servers are connected with
a 10GB direct attached link which is the only network setup for live migration traffic. We also have a duplicate system in our primary location, identical servers with identical peripherals, and it doesn't have an issue. The only difference is
the pools/master images. Both servers are connected to an iSCSI nimble SAN, but so is the duplicate system, just a different hardware piece. Everything is identical from the switches they are connecting to, to the coax that is directly attached,
between the two different setups.
One other note the servers in each location are slightly different from each other though. One server is running a V2 of the same processor and has 8 cores as one is running V1 with 6 cores. However the exact same situation in our other site
exists and it works fine.
Thanks for anything that you all can provide. For now I can utilize the powershell script, but need to figure out what this issue is in case it is a pre-cursor to what's to come.