Quantcast
Channel: High Availability (Clustering) forum
Viewing all 4519 articles
Browse latest View live

Large virtual machine reboot during live migration - WS 2012R2

$
0
0

Hello,

live migrating large VMs often fail on a 2012R2 cluster, same situation as described here:https://social.technet.microsoft.com/Forums/en-US/3436c57b-8832-4981-a09f-47361ed5db1d/live-migration-of-big-vms-fail. There is even a solution provided in this postings:

On all hosts, goto below in registry 
HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Virtualization\Migration
Create a new key "NetworkBufferCount" as a DWORD with value of "1024" 
Reboot the host

Unfortunately it is not saying wheter this key is hexadecimal or decimal. Would also be nice to know how this works.  

Any help would be highly appreciated!

Regards

Ueli


Storage Pool - Lost Communcation - StorageJob - Suspend

$
0
0
Hello together,
namely I have 7 hard drives in my pool in Windows 10. Now twoof them have been disconnected and after I reconnected them I get the message "Warning". In PowerShell I get "Lost Communication" for these disks. How can I add them again without data loss? So I started to test something.

So far I´ve added an eight hard drive. Additionally I have set the status of the broken disk to Retired. This has spread the data over the wohlo pool. But now the repair process is hanging. What could be the reason. What else can I do?
Thanks for all of your help.

PS C:\Users\Steidle> Get-VirtualDisk






FriendlyName     ResiliencySettingName FaultDomainRedundancy OperationalStatus HealthStatus  Size FootprintOnPool StorageEfficiency
------------     --------------------- --------------------- ----------------- ------------  ---- --------------- -----------------
Datenfestplatten Parity                1                     Degraded          Unhealthy    32 TB        36.02 TB           66,66 %


PS C:\Users\Steidle> get-physicaldisk

DeviceId FriendlyName                          SerialNumber   MediaType   CanPool OperationalStatus  HealthStatus Usage            Size
-------- ------------                          ------------   ---------   ------- -----------------  ------------ -----            ----
5        ST8000DM004-2CX188                    ZCT039J7       HDD         False   OK                 Healthy      Auto-Select   7.28 TB
0        ST8000AS0002-1NA17Z                   Z8416XR7       HDD         False   OK                 Healthy      Auto-Select   7.28 TB
3        ATA Crucial_CT750MX3                  16161312104B   SSD         False   OK                 Healthy      Auto-Select 698.64 GB
         ST8000AS 0002-1NA17Z SCSI Disk Device Z840KHSV       HDD         False   Lost Communication Warning      Retired       7.28 TB
1        TOSHIBA MD04ACA400 SCSI Disk Device   15Q3KFPLFSAA   HDD         False   OK                 Healthy      Auto-Select   3.64 TB
8        SAMSUNG HD501LJ                       S0MUJ1CQA11444 Unspecified False   OK                 Healthy      Auto-Select 465.76 GB
7        ST8000DM 004-2CX188 SCSI Disk Device  ZCT0DRCZ       Unspecified False   OK                 Healthy      Auto-Select   7.28 TB
2        ST8000AS 0002-1NA17Z SCSI Disk Device Z840F1JP       HDD         False   OK                 Healthy      Auto-Select   7.28 TB
4        TOSHIBA HDWE140 SCSI Disk Device      95CEKHKQF58D   HDD         False   OK                 Healthy      Auto-Select   3.64 TB
6        WDC WD80EZAZ-11TDBA0                  7SJPREBW       HDD         False   OK                 Healthy      Auto-Select   7.28 TB


PS C:\Users\Steidle> get-storagepool

FriendlyName    OperationalStatus HealthStatus IsPrimordial IsReadOnly     Size AllocatedSize
------------    ----------------- ------------ ------------ ----------     ---- -------------
Datenfestplatte OK                Healthy      False        False      50.94 TB      36.02 TB
Primordial      OK                Healthy      True         False       44.8 TB      43.66 TB


PS C:\Users\Steidle> get-storagejob

Name                      IsBackgroundTask ElapsedTime   JobState  PercentComplete BytesProcessed BytesTotal
----                      ---------------- -----------   --------  --------------- -------------- ----------
Datenfestplatte-Rebalance True             3575.13:08:27 Running   0                          0 B    1.25 GB
Datenfestplatten-Repair   True             00:58:57      Suspended 0                          0 B     4.5 GB


Cluster problems

$
0
0

Hi.

i have hypev - v cluster (2 nodes) in Windows Server 2019 (with last updates).

Trying to create VM in this cluster, with fist VM everything ok, but next added VM shows as "non clustered" in Hypev V (adding vm in failover cluster manager). Every next added VM makes previous added VM clustered while it remains non clustered. 

Then i simulate node1 stop (stopping cluster service in services manager), VM's successfully moves to NODE2. In Hyper V Node2 lots of VMS Non clustered.

Have anyone meet problem like that? 

SET switch on 1GB adapters maxing core 0

$
0
0

I have a brand new 2016 cluster sitting on 2x DL360 G10 nodes.

2x cpu's of 8 cores each (so 16 cores\32 logical processors)

I've joined all 4x 1GB NICS into the SET Switch.

Created vNIC for LM, Cluster and Management (no SMB, as its direct SAS)

I created a SET switch but whenever i live migrate VM's around, Core 0 maxes out which means i get around 1.6GBit out of the NIC's  (is showing as 4GBit)

I understand VMQ is a big no no for 1GB (i'm used to working on 10Gbit)

I set RSS to move away from core 0, but still does the same.

Set-NetadapterRSS -Name OB1 -BaseProcessorNumber 2 -NumaNode 0
Set-NetadapterRSS -Name OB2 -BaseProcessorNumber 8 -NumaNode 0
Set-NetadapterRSS -Name OB3 -BaseProcessorNumber 2 -NumaNode 1
Set-NetadapterRSS -Name OB4 -BaseProcessorNumber 8 -NumaNode 1

What am i missing?  What have i done wrong?

Thanks 

How to set HangRecoveryAction in powershell on server 2019

$
0
0

I have a server 2019 cluster, when I run validation I get a warning The setting for HangRecoveryAction on this cluster is not the default and recommended setting, can someone tell me how I set it.

Thanks

Storage Spaces Direct (S2D) - Poor write performance with 5 nodes with 24 Intel P3520 NVME SSDs each over 40Gb IB network

$
0
0

Need a little help with my S2D cluster which is not performing as I had expected.

Details:

5 x Supermicro SSG-2028R-NR48N servers with 2 x Xeon E5-2643v4 CPUs and 96GB RAM

Each node has 24 x Intel P3520 1.2TB NVME SSDs

The servers are connected over an Infiniband 40Gb network, RDMA is enabled and working.

All 120 SSDs are added to S2D storage pool as data disks (no cache disks). There are two 30TB CSVs configured with hybrid tiering (3TB 3-way mirror, 27TB Parity)

I know these are read intensive SSDs and that parity write performance is generally pretty bad but I was expecting slightly better numbers then I'm getting:

Tested using CrystalDiskMark and diskspd.exe

Multithreaded Read speeds: < 4GBps (seq) / 150k IOPs (4k rand)

Singlethreaded Read speeds: < 600MBps  (seq) 

Multithreaded Write speeds: < 400MBps  (seq) 

Singlethreaded Write speeds: < 200MBps (seq) / 5k IOPS (4k rand)

I did manage to up these numbers by configuring a 4GB CSV cache on the CSVs and forcing write through on the CSVs:

Max Reads: 23GBps/500K IOPs 4K IOPS, Max Writes:2GBps/150K 4KIOPS

That high read performance is due to the CSV cache which uses memory. Write performance is still pretty bad though. In fact it's only slight better than the performance I would get for a single one of these NVME drives. I was expecting much better performance from 120 of them!

I suspect that the issue here is that Storage Spaces is not recognising that these disks have PLP protection which you can see here:

Get-storagepool "*S2D*" | Get-physicaldisk |Get-StorageAdvancedProperty

FriendlyName          SerialNumber       IsPowerProtected IsDeviceCacheEnabled
------------          ------------       ---------------- --------------------                   
NVMe INTEL SSDPE2MX01 CVPF7165003Y1P2NGN            False                     
WARNING: Retrieving IsDeviceCacheEnabled failed with ErrorCode 1.
NVMe INTEL SSDPE2MX01 CVPF717000JR1P2NGN            False                     
WARNING: Retrieving IsDeviceCacheEnabled failed with ErrorCode 1.
NVMe INTEL SSDPE2MX01 CVPF7254009B1P2NGN            False                     
WARNING: Retrieving IsDeviceCacheEnabled failed with ErrorCode 1.

Any help with this issue would be appreciated.

Thanks.

Storage Spaces Direct / Cluster Virtual Disk goes offline when rebooting a node

$
0
0

Hello

We have several Hyper-converged einvoronments based on HP ProLiant DL360/DL380.
We have 3 Node and 2 Node Clusters, running with Windows 2016 and actual patches, Firmware Updates done, Witness configured.

The following issue occurs with at least one 3 Node and one 2 Node cluster:
When we put one node into maintenance mode (correctly as described in microsoft docs and checked everything is fine) and reboot that node, it can happen, that one of the Cluster Virtual Disks goes offline. It is always the Disk Performance with the SSD only storage in each environment. The issue occurs only sometimes and not always. So sometimes I can reboot the nodes one after the other several times in a row and everything is fine, but sometimes the Disk "Performance" goes offline. I can not bring this disk back online until the rebooted node comes back online. After the node which was down during maintenance is back online the Virtual Disk can be taken online without any issues.

We have created 3 Cluster Virtual Disks & CSV Volumes on these clusters:
1x Volume with only SSD Storage, called Performance
1x Volume with Mixed Storage (SSD, HDD), called Mixed
1x Volume with Capacity Storage (HDD only), called Capacity

Disk Setup for Storage Spaces Direct (per Host):
- P440ar Raid Controller
- 2 x HP 800 GB NVME (803200-B21)
- 2 x HP 1.6 TB 6G SATA SSD (804631-B21)
- 4 x HP 2 TB 12G SAS HDD (765466-B21)
- No spare Disks
- Network Adapter for Storage: HP 10 GBit/s 546FLR-SFP+ (2 storage networks for redundancy)
- 3 Node Cluster Storage Network Switch: HPE FlexFabric 5700 40XG 2QSFP+ (JG896A), 2 Node Cluster directly connected with each other

Cluster Events Log is showing the following errors when the issue occurs:

Error 1069 FailoverClustering
Cluster resource 'Cluster Virtual Disk (Performance)' of type 'Physical Disk' in clustered role '6ca63b55-1a16-4bb2-ac53-2b23619e258a' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Warning 5120 FailoverClustering
Cluster Shared Volume 'Performance' ('Cluster Virtual Disk (Performance)') has entered a paused state because of 'STATUS_NO_SUCH_DEVICE(c000000e)'. All I/O will temporarily be queued until a path to the volume is reestablished.

Error 5150 FailoverClustering
Cluster physical disk resource 'Cluster Virtual Disk (Performance)' failed.  The Cluster Shared Volume was put in failed state with the following error: 'Failed to get the volume number for \\?\GLOBALROOT\Device\Harddisk10\ClusterPartition2\ (error 2)'

Error 1205 FailoverClustering
The Cluster service failed to bring clustered role '6ca63b55-1a16-4bb2-ac53-2b23619e258a' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

Error 1254 FailoverClustering
Clustered role '6ca63b55-1a16-4bb2-ac53-2b23619e258a' has exceeded its failover threshold.  It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state.  No additional attempts will be made to bring the role online or fail it over to another node in the cluster.  Please check the events associated with the failure.  After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.

Error 5142 FailoverClustering
Cluster Shared Volume 'Performance' ('Cluster Virtual Disk (Performance)') is no longer accessible from this cluster node because of error '(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.

Any hints / inputs appreciated. Had someone something similar?

Thanks in advance

Philippe



VM shutdown, when try add highly available role in Windows 2016 Hyper-V Failover Cluster

$
0
0
Hi! We have Hyper-V Failover cluster on Windows 2016 (after clear upgrade from 2012R2). We have SoFS Cluster on 2016 as a file storage for VM. Also we have VMM.
Let's analyze the problem in steps:
1.) Create non highly available VM "vmtest1" in VMM on Hyper-v Cluster 2016, and start. 
2.) Add Virtual Machine Role for vm "vmtest1" in Failover Clauster snap-in.
3.) After ~5-10 min we have error in eventlog Hyper-V-SynthStor (Event ID 12630) - 'vmtest1': Virtual hard disk resiliency failed to recover the drive '\\test1.test.consto.ru\VD0\test1\vmtest1.vhdx'. The virtual machine will be powered off. Current status: Permanent Failure.
4.) Now virtual machine "vmtest1" is poweroff.  
This situation is repeated on other cluster on Windows 2016.
We have that problem, after upgrade cluster from windows 2012R2 to Windows 2016.
On 2012R2 clusters that problem is not noticed.
Its happen only when we add "highly available" role in cluster and VM is Running. If we just try create "highly available" VM in VMM, everything goes well.
All cluster servers and sofs have last updates. Some events:


Microsoft-Windows-Hyper-V-Worker/Admin:
'vmtest1': Virtual hard disk '\\\test1.test.consto.ru\VD0\test1\vmtest1.vhdx' received a resiliency status notification. Current status: Disconnected.
'vmtest1': Virtual hard disk '\\test1.test.consto.ru\VD0\test1\vmtest1.vhdx' has detected a recoverable error. Current status: Disconnected.
'vmtest1': Virtual hard disk resiliency failed to recover the drive '\\test1.test.consto.ru\VD0\test1\vmtest1.vhdx'. The virtual machine will be powered off. Current status: Permanent Failure.
'vmtest1' was paused for critical error
'vmtest1' was turned off as it could not recover from a critical error. 

Microsoft-Windows-Hyper-V-StorageVSP/Microsoft-Hyper-V-StorageVSP-Admin:
Storage device '\?\UNC\test1.test.consto.ru\VD0\test1\vmtest1.vhdx' changed recovery state. Previous state = Recoverable Error Detected, New state = Unrecoverable Error.
Storage device '\?\UNC\test1.test.consto.ru\VD0\test1\vmtest1.vhdx' received a recovery status notification. Current device state = Recoverable Error Detected, Last status = Disconnected, New status = Permanent Failure.
Storage device '\\?\UNC\test1.test.consto.ru\VD0\test1\vmtest1.vhdx' received a recovery status notification. Current device state = No Errors, Last status = No Errors, New status = Disconnected.
Storage device '\\?\UNC\test1.test.consto.ru\VD0\test1\vmtest1.vhdx' changed recovery state. Previous state = No Errors, New state = Recoverable Error Detected.

Microsoft-Windows-FailoverClustering/Diagnostic:
[RES] Virtual Machine Configuration <Virtual Machine Configuration vmtest1>: Current state 'Online', event 'UpdateVmConfigurationProperties'
[RES] Virtual Machine Configuration <Virtual Machine Configuration vmtest1>: Updated VmStoreRootPath property to '\\?\UNC\test1.test.consto.ru\VD0\test1\vmtest1.vhdx'
[RCM] HandleMonitorReply: LOCKEDMODE for 'Virtual Machine Configuration vmtest1', gen(0) result 0/0.
[RCM] Virtual Machine Configuration vmtest1: Flags 1 added to StatusInformation. New StatusInformation 1 
[RCM] vmtest1: Added Flags 1 to StatusInformation. New StatusInformation 1 
[RHS] Resource Virtual Machine vmtest1 called SetResourceLockedMode. LockedModeEnabled1, LockedModeReason0.
[RCM] HandleMonitorReply: LOCKEDMODE for 'Virtual Machine vmtest1', gen(0) result 0/0.
[RCM] Virtual Machine vmtest1: Flags 1 added to StatusInformation. New StatusInformation 1 
[GUM] Node 16: Processing RequestLock 16:1953
[RCM] HandleMonitorReply: INMEMORY_NODELOCAL_PROPERTIES for 'Virtual Machine vmtest1', gen(0) result 0/0.
[RHS] Resource Virtual Machine Configuration vmtest1 called SetResourceLockedMode. LockedModeEnabled0, LockedModeReason0.
[RCM] HandleMonitorReply: LOCKEDMODE for 'Virtual Machine Configuration vmtest1', gen(0) result 0/0.
[RCM] Virtual Machine Configuration vmtest1: Flags 1 removed from StatusInformation. New StatusInformation 0 
[RHS] Resource Virtual Machine vmtest1 called SetResourceLockedMode. LockedModeEnabled0, LockedModeReason0.
[RCM] HandleMonitorReply: LOCKEDMODE for 'Virtual Machine vmtest1', gen(0) result 0/0.
[RCM] Virtual Machine vmtest1: Flags 1 removed from StatusInformation. New StatusInformation 0 
[RCM] vmtest1: Removed Flags 1 from StatusInformation. New StatusInformation 0 
[RCM] HandleMonitorReply: INMEMORY_NODELOCAL_PROPERTIES for 'Virtual Machine vmtest1', gen(0) result 0/0.
[RCM] Virtual Machine vmtest1: Flags 1 removed from StatusInformation. New StatusInformation 0 
[RES] Virtual Machine <Virtual Machine vmtest1>: Current state 'Online', event 'VmStopped'
[RCM] vmtest1: Removed Flags 1 from StatusInformation. New StatusInformation 0 
[RES] Virtual Machine <Virtual Machine vmtest1>: State change 'Online' -> 'Offline'
[RCM] rcm::RcmApi::OfflineResource: (Virtual Machine vmtest1, 1)
[RCM] Res Virtual Machine vmtest1: Online -> WaitingToGoOffline( StateUnknown )
[RCM] TransitionToState(Virtual Machine vmtest1) Online-->WaitingToGoOffline.
[RCM] rcm::RcmGroup::UpdateStateIfChanged: (vmtest1, Online --> Pending)
[RCM] Res Virtual Machine vmtest1: WaitingToGoOffline -> OfflineCallIssued( StateUnknown )
[RCM] TransitionToState(Virtual Machine vmtest1) WaitingToGoOffline-->OfflineCallIssued.
[RCM] HandleMonitorReply: INMEMORY_NODELOCAL_PROPERTIES for 'Virtual Machine vmtest1', gen(0) result 0/0.









VM gets powered off soon after successful migration between 2016 clusters

$
0
0

Hi Folks,

I'm performing a live migration of HA roles (VMs) between two 2016 clusters, using SCVMM.

When live migrating a virtual machine to (including storage) 2016 the virtual machine storage resiliancy powers off the VM soon after the migration is completed successfully.

Error Message: "Virtual hard disk resiliency failed to recover the drive. The virtual machine will be powered off. Current status: Permanent Failure."

Event ID: 12630

I've captured and studied the below logs extensively, and also did a lot of research - in vain.

HyperVHighAvailability-Admin

HyperVStorageVSP-Admin

HyperVWorker-Admin

Below are the critical events/errors I noticed in those logs most commonly(with exact time-stamp when the issue is reproduced).

HyperVHighAvailability-Admin:

21120 Information 'SCVMM <VMName> Configuration' successfully registered the configuration for the virtual machine.
21119 Information 'SCVMM <VMName>' successfully started the virtual machine.

HyperVStorageVSP-Admin:

Information 6 Storage device '\\?\UNC\xyz\abc\<VMName>\Clean_disk_1.vhdx' received an IO failure with error = SRB_STATUS_ERROR_RECOVERY. Current device state = No Errors, New state = Recoverable Error Detected, Current status = No Errors.
Information 4 Storage device '\\?\UNC\xyz\abc\<VMName>\Clean_disk_1.vhdx' received a recovery status notification. Current device state = Recoverable Error Detected, Last status = No Errors, New status = Disconnected.
Information 5 Storage device ''\\?\UNC\xyz\abc\<VMName>\Clean_disk_1.vhdx' changed recovery state. Previous state = Recoverable Error Detected, New state = Recoverable Error Detected.
Information 4 Storage device ''\\?\UNC\xyz\abc\<VMName>\Clean_disk_1.vhdx' received a recovery status notification. Current device state = Recoverable Error Detected, Last status = Disconnected, New status = Permanent Failure.
Information 5 Storage device ''\\?\UNC\xyz\abc\<VMName>\Clean_disk_1.vhdx' changed recovery state. Previous state = Recoverable Error Detected, New state = Unrecoverable Error. 

HyperVWorker-Admin:

Information 12597 '<VMName>' <VMName> (<GUID>) Connected to virtual network. (Virtual Machine ID <GUID>)
Information 12582 '<VMName>' <VMName> (<GUID>) started successfully. (Virtual Machine ID <GUID>)
Information 12635 '<VMName>': Virtual hard disk '\\FQDN\abc\<VMName>\Clean_disk_1.vhdx' received a resiliency status notification. Current status: Disconnected. (Virtual machine ID<GUID>)
Warning 12636 '<VMName>': Virtual hard disk '\\\\FQDN\abc\<VMName>\Clean_disk_1.vhdx' has detected a critical error. Current status: Disconnected. (Virtual machine ID<GUID>)
Information 12635 '<VMName>': Virtual hard disk '\\FQDN\abc\<VMName>\Clean_disk_1.vhdx' received a resiliency status notification. Current status: Permanent Failure. (Virtual machine ID<GUID>)
Error 12630 '<VMName>': Virtual hard disk resiliency failed to recover the drive '\\\\FQDN\abc\<VMName>\Clean_disk_1.vhdx'. The virtual machine will be powered off. Current status: Permanent Failure. (Virtual machine ID <GUID>)
Information 18524 '<VMName>' was paused for critical error. (Virtual machine ID<GUID>)
Information 12598 '<VMName>' <VMName> (<GUID>) Disconnected from virtual network. (Virtual Machine ID<GUID>)
Information 18528 '<VMName>' was turned off as it could not recover from a critical error. (Virtual machine ID <GUID>)

System Event Logs:

2:34:02 PM Information 7040 Service Control Manager The start type of the Windows Modules Installer service was changed from auto start to demand start.
2:34:02 PM Information 7036 Service Control Manager The Volume Shadow Copy service entered the running state.
2:34:02 PM Information 7036 Service Control Manager The Microsoft Software Shadow Copy Provider service entered the running state.
2:35:52 PM Information 7036 Service Control Manager The WMI Performance Adapter service entered the stopped state.
2:36:03 PM Information 7036 Service Control Manager The Windows Modules Installer service entered the stopped state.
2:37:02 PM Information 7036 Service Control Manager The Volume Shadow Copy service entered the stopped state.
2:37:41 PM Information 7036 Service Control Manager The Windows Modules Installer service entered the running state.
2:39:43 PM Information 7036 Service Control Manager The Windows Modules Installer service entered the stopped state.
2:40:02 PM Information 7036 Service Control Manager The Microsoft Software Shadow Copy Provider service entered the stopped state.
2:43:01 PM Information 7036 Service Control Manager The Microsoft Storage Spaces SMP service entered the running state.
2:46:06 PM Information 7036 Service Control Manager The WinHTTP Web Proxy Auto-Discovery Service service entered the stopped state.
2:49:06 PM Information 7036 Service Control Manager The WinHTTP Web Proxy Auto-Discovery Service service entered the running state.
2:49:57 PM Information 7036 Service Control Manager The WMI Performance Adapter service entered the running state.
2:50:10 PM Information 233 Microsoft-Windows-Hyper-V-VmSwitch The operation 'Create' succeeded on nic C0470977-2D74-4F23-B695-B60A74E5100A--FD0F5C61-44A8-4C23-ACC1-B262965E22D8 (Friendly Name: ).
2:50:10 PM Information 232 Microsoft-Windows-Hyper-V-VmSwitch NIC C0470977-2D74-4F23-B695-B60A74E5100A--FD0F5C61-44A8-4C23-ACC1-B262965E22D8 (Friendly Name: hogia-cl561) successfully connected to port 608710AB-5CDD-449D-B3DE-801891384C7E (Friendly Name: d932b689-5f4f-4513-92ea-f12f3ca415ab) on switch FF9A59EE-0D6C-468D-98B0-DE0008045F13(Friendly Name: vSwitch).
2:50:14 PM Information 21500 Microsoft-Windows-Hyper-V-High-Availability 'SCVMM hogia-cl561 Configuration' successfully registered the configuration for the virtual machine.
2:50:14 PM Information 21500 Microsoft-Windows-Hyper-V-High-Availability 'SCVMM hogia-cl561' successfully started the virtual machine.
2:50:20 PM Information 234 Microsoft-Windows-Hyper-V-VmSwitch NIC C0470977-2D74-4F23-B695-B60A74E5100A--FD0F5C61-44A8-4C23-ACC1-B262965E22D8 successfully disconnected from port .
2:50:20 PM Information 233 Microsoft-Windows-Hyper-V-VmSwitch The operation 'Delete' succeeded on nic C0470977-2D74-4F23-B695-B60A74E5100A--FD0F5C61-44A8-4C23-ACC1-B262965E22D8 (Friendly Name: hogia-cl561).
2:51:57 PM Information 7036 Service Control Manager The WMI Performance Adapter service entered the stopped state.
2:52:50 PM Information 7036 Service Control Manager The Smart Card Device Enumeration Service service entered the running state.
2:52:51 PM Information 7036 Service Control Manager The Device Setup Manager service entered the running state.
2:53:40 PM Information 7036 Service Control Manager The Device Setup Manager service entered the stopped state.
3:05:58 PM Information 7036 Service Control Manager The WMI Performance Adapter service entered the running state.
3:07:58 PM Information 7036 Service Control Manager The WMI Performance Adapter service entered the stopped state.

Apart from everything above, I've noticed a specific error somewhat related to SCSI Bus:

SRB_STATUS_ERROR_RECOVERY

However, I couldn't get a hint of the error despite searching all the blogs.  Your kind help is much appreciated.

--------------------------------------------------------------------------

SYNOPSIS of my infrastructure:

Source Cluster OS: Windows 2016 Standard Edition (Build:  10.0.14393.0)

Target Cluster OS: Windows 2016 Standard Edition (Build:  10.0.14393.0)

Hyper-V Version: 10.0.14393.2758 (Build:  10.0.14393.0)

Storage: Dedicated SMB3 File share Storage Spaces 2016

Management Tool: System Center Virtual Machine Manager (SCVMM) 2016

Dedicated 2x10GB nics and physical switches for SMB3 and management (Jumbo frames 9014 enabled)
Dedicated 2x10GB nics and physical switches for Tenant/VM traffic
We are not using RDMA or DCB

Everything works and have worked in 2012R2 for many years. However migrating a VM from
a 2012R2 cluster to 2016 cause the VM to Power off. Also we found out today that migration between
the new 2016 clusters also cause the VM to Power off.
I have found this thread and also posted in it without any resolution.

https://social.technet.microsoft.com/Forums/en-US/4cb2c0a0-cedc-4a71-886b-3146f0943d8d/migrate-vm-from-2012r2-cluster-to-2016-cluster-causes-outage?forum=winserverhyperv


Ramkumar

S2D on WS2019 - Disks added - not rebalancing

$
0
0

Hi,

We just added 4 more disks (2 per server) to our Windows Server 2019 HCI (lab) and we are wondering, why it does not rebalance the data. We expanded the vDisk already. This action took some of the space of the new disks - but it doesn't rebalance the rest of the data.


Even if we use the CMDlet "Optimize-StoragePool", it doesn't do anything: Just counting directly to 100% and the job is done within some seconds. The new disks remain at 48% and the other disks are filled up by 91%...

Any ideas about, how we could rebalance the data?

We used nested resiliancy with a nested mirror with just one vDisk (2.5TB in size):


Thanks for your ideas!

André



VMs Unable to Live Migrate

$
0
0

I have a Failover Cluster running on two Server 2012 R2 Datacenter nodes hosting our Hyper-V environment.  Recently, we have run into an issue where the VMs won’t migrate to the opposite node unless the VM is rebooted or the Saved State data is deleted.  The VMs are stored either on an SOFS volume on a separate FO Cluster or a CSV volume both nodes are connected to.  The problem occurs to VMs in either storage location.

Testing I’ve done is below.  Note that I only list one direction, but the behavior is the same moving in the opposite direction, as well:

- Live Migration: if a VM is on Node1 and I tell it to Live Migrate to Node2, it begins the process in the console and for a split second shows Node2.  It immediately flips back to Node1.  If the VM has rebooted since the last migration, it will go ahead and migrate to Node2.  It will not migrate back until the VM has been rebooted again.  The Event Log shows IDs 1205 and 1069.  1069 states “Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.”  All resources show Online in Powershell.

- Quick Migration: I initiate a Quick Migration and the VM will move from Node1 to Node2, but will fail to start on Node2.  Checking the Event Log I see Event IDs 1205 and 1069.  1069 states “Cluster resource 'Virtual Machine IDF' of type 'Virtual Machine' in clustered role 'IDF' failed. The error code was '0xc0370027' ('Cannot restore this virtual machine because the saved state data cannot be read. Delete the saved state data and then try to start the virtual machine.').”  After deleting the Saved State Data, the VM will start right up and can be Live or Quick Migrated once.

- Shutdown VM and Quick Migration: I have not had an occasion of this method fail so far.

- Rebooting the Nodes has had no discernable effect on the situation.

- I’ve shut down a VM and moved its storage from SOFS to the CSV and still have the same issues as above.  I moved the VHDX, the config file, and saved state data (which was empty while the VM was powered down) to the CSV.

Items from the FO Cluster Validation Report:
1. The following virtual machines have referenced paths that do not appear accessible to all nodes of the cluster. Ensure all storage paths in use by virtual machines are accessible by all nodes of the cluster.
Virtual Machines Storage Paths That Cannot Be Accessed By All Nodes 
Virtual Machine       Storage Path      Nodes That Cannot Access the Storage Path 
VM1                       \\sofs\vms         Node1

I’m not sure what to make of this error as most of the VMs live on this SOFS share and are running on Nodes1 and 2.  If Node1 really couldn’t access the share, none of the VMs would run on Node1.

2. Validating cluster resource File Share Witness (2) (\\sofs\HVQuorum).
This resource is configured to run in a separate monitor. By default, resources are configured to run in a shared monitor. This setting can be changed manually to keep it from affecting or being affected by other resources. It can also be set automatically by the failover cluster. If a resource fails it will be restarted in a separate monitor to try to reduce the impact on other resources if it fails again. This value can be changed by opening the resource properties and selecting the 'Advanced Policies' tab. There is a check-box 'run this resource in a separate Resource Monitor'.

I checked on this and the check-box is indeed unchecked and both Nodes report the same setting (or lack thereof).

3. Validating cluster resource Virtual Machine VM2.
This resource is configured to run in a separate monitor. By default, resources are configured to run in a shared monitor. This setting can be changed manually to keep it from affecting or being affected by other resources. It can also be set automatically by the failover cluster. If a resource fails it will be restarted in a separate monitor to try to reduce the impact on other resources if it fails again. This value can be changed by opening the resource properties and selecting the 'Advanced Policies' tab. There is a check-box 'run this resource in a separate Resource Monitor'.

Validating cluster resource Virtual Machine VM3.
This resource is configured to run in a separate monitor. By default, resources are configured to run in a shared monitor. This setting can be changed manually to keep it from affecting or being affected by other resources. It can also be set automatically by the failover cluster. If a resource fails it will be restarted in a separate monitor to try to reduce the impact on other resources if it fails again. This value can be changed by opening the resource properties and selecting the 'Advanced Policies' tab. There is a check-box 'run this resource in a separate Resource Monitor'.

I can’t find a place to see this check-box for the VMs.  The properties on the roles don’t contain the ‘Advanced Policies’ tab.

All other portions of the Validation Report are clean.

So far, I haven’t found any answers in several days of Google searching and trying different tactics.  I’m hoping someone here has run into a similar situation and can help steer me in the right direction to get this resolved.  The goal is to be able to Live Migrate freely so I can reboot the Nodes one at a time for Microsoft Updates without having to bring down all the VMs in the process.




Cluster Management Crashes Upon Starting to Create a Cluster Server 2016

$
0
0

Previously on My Life Sucks, I could not get the Cluster service to start:

https://social.technet.microsoft.com/Forums/windowsserver/en-US/7a6431c1-3fec-437e-b733-092f5b056cb5/cluster-service-will-not-start-hyperv-2016?forum=winserverClustering

After resolving this issue, I was able to start the process of clustering my 2 nodes.

About 15 seconds after beginning the cluster process, the cluster management GUI crashes and the cluster fails to create.

I tried ensuring that the iscsi targets were identical thanks to this article: 

https://www.asp.be/hyper-v-2016-cluster-crash-troubleshooting/

I did have a similar issue, where, on one node, the 2 SAN volumes were listed twice each (4 items). I removed 2 items and now both servers had the same number of targets. I confirmed that the SAN volumes were still online on what I wanted to be the primary node.

Just to be sure this wasn't completely the issue, I unchecked the box for "add all storage" to the cluster during the Failover Cluster wizard.

It still crashes about 15 seconds into cluster creation.

I'm using my Windows 10 Pro PC to create the Failover Cluster because there's a plug-in for RSAT that allows for this. Unfortunately, these are currently the only 2 Server 2016 servers in the domain. The whole point of this is to get our Hyper V Machines into High Availability so we can eventually remove our Server 2012 servers from the network.

S2D 2 node cluster

$
0
0

Hello,

We have 2 node S2D cluster with windows server 2019. Between two nodes we have directly connected RDMA storage network (Cluster only) and client-facing network based on LACP teaming on each node (Client And Cluster). We have done failover test and it works: when we power off one node, virtual machines migrates to another host as expected. But when we unplug client facing adapters (two adapters in LACP) on one node, where VM are resides, VM migration fails and after some time Cluster network name and Cluster IP address also failed. When we plug again client facing adapters (two adapters in LACP) to failed node, cluster IP address recover and VM client network works again. So the problem: cluster migration fails after unexpectedly shutdown of client facing network of one node, where VM are resides. Nodes can communicate with each other through Storage network and all nodes are up in Failover Cluster manager. So when client network is down, VM should migrate to another node with working client-facing network. But cluster fails and VM do not migrate. Where we can fix this behaviour? Has anyone met this before?

Windows Server 2012-R2 or 2016 Failover cluster manager: multiple online resources

$
0
0

I was wondering if anybody experienced and/or resolved the following issue:

Windows Failover cluster Setup:

  • Two Windows 2016 or 2012-R2 server nodes: A and B with current Windows patches.
  • Generic Application DLL resource: implements IsAlive(), LookAlive(), Online() and Offline()
  • Virtual IP address resource: as a dependency of the Generic Application
  • Policy: configured to failover at the first failure
    1. Period for Restarts=15:00
    2. Maximum restarts in the specified period=0
    3. Delay between restarts=0

Issue:

When IsAlive() fails on A primary server, the cluster manager:

  • Does not call Offline() on A (leaving A online)
  • Moves VIP address from A to B
  • Calls Online() on B

As a result, both A and B Application resources are online.



Errors when running cluster-aware updating

$
0
0
I currently have two clusters.  all VM's are running on Cluster 2 at the moment.  When I manually run cluster aware updating everything works perfectly on cluster 1.  All the roles then drain from cluster 2 over to 1, no problem.  After cluster 2 updates all roles move back, everything seems fine but I get a status of "partially failed"  There is a warning, "an error occurred resuming node 2, the cluster is not paused"  and there is an error "cluster aware updating Node failed to leave maintenance mode".  I'm not sure what is going on here, any ideas?

Node Quarantined

$
0
0

I have a peculiar issue. One of the nodes in my cluster is listed as quarantined.

I tried running Start-ClusterNode -ClearQuarantine command but it does not make any difference

BuildNumber        : 14393
Cluster            : A-FSC-PRD1
CSDVersion         : 
Description        : 
DrainStatus        : NotInitiated
DrainTarget        : 4294967295
DynamicWeight      : 1
Id                 : 1
MajorVersion       : 10
MinorVersion       : 0
Name               : W-FSC-vPRD1
NeedsPreventQuorum : 0
NodeHighestVersion : 589832
NodeInstanceID     : 00000000-0000-0000-0000-000000000001
NodeLowestVersion  : 589832
NodeName           : W-FSC-vPRD1
NodeWeight         : 1
FaultDomain        : {Site:, Rack:, Chassis:}
Model              : VMware Virtual Platform
Manufacturer       : VMware, Inc.
SerialNumber       : VMware-42 09 e3 d0 15 42 b5 8d-5e 11 69 26 d7 79 42 02
State              : Up
StatusInformation  : Quarantined


SK

Error validating cluster computer resource name (Server 2016 Datacenter Cluster)

$
0
0

    An error occurred while executing the test.
    The operation has failed. An error occurred while checking the Active Directory organizational unit for the cluster name resource.

    The parameter is incorrect

    Interesting enough the cluster name was created successfully in the Computers OU and the cluster can be taken offline and brought back online with no problem. The DNS entry is correct and the cluster name pings to the correct IP.  Changing the name of the cluster will update the cluster computer name in AD with no errors.


Failed Server 2016 Rolling cluster upgrade.

$
0
0

Hi,

I have a 2 node Hyper-V server 2012 R2 cluster. I would like to upgrade the cluster to 2016 to take advantage of the new nested virtualisation feature. I have been following this guide (https://technet.microsoft.com/en-us/windows-server-docs/failover-clustering/cluster-operating-system-rolling-upgrade) and have got to the point of adding the first upgraded node back into the cluster but it fails with the following error. 

Cluster service on node CAMHVS02 did not reach the running state. The error code is 0x5b4. For more information check the cluster log and the system event log from node CAMHVS02. This operation returned because the timeout period expired.

In the event log of the 2016 node I have these errors.

FAILOVER CLUSTERING LOG

mscs_security::BaseSecurityContext::DoAuthenticate_static: (30)' because of '[Schannel] Received wrong header info: 1576030063, 4089746611'

cxl::ConnectWorker::operator (): HrError(0x0000001e)' because of '[SV] Security Handshake failed to obtain SecurityContext for NetFT driver'

[QUORUM] Node 2: Fail to form/join a cluster in 6-7 minutes

[QUORUM] An attempt to form cluster failed due to insufficient quorum votes. Try starting additional cluster node(s) with current vote or as a last resort use Force Quorum option to start the cluster. Look below for quorum information,

[QUORUM] To achieve quorum cluster needs at least 2 of quorum votes. There is only 1 quorum votes running

[QUORUM] List of running node(s) attempting to form cluster: CAMHVS02, 

[QUORUM] List of running node(s) with current vote: CAMHVS02, 

[QUORUM] Attempt to start some or all of the following down node(s) that have current vote: CAMHVS01, 

join/form timeout (status = 258)

join/form timeout (status = 258), executing OnStop

SYSTEM LOG

Cluster node 'CAMHVS02' failed to join the cluster because it could not communicate over the network with any other node in the cluster. Verify network connectivity and configuration of any network firewalls.

Cluster failed to start. The latest copy of cluster configuration data was not available within the set of nodes attempting to start the cluster. Changes to the cluster occurred while the set of nodes were not in membership and as a result were not able to receive configuration data updates. .

The Cluster Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 15000 milliseconds: Restart the service.

The Service Control Manager tried to take a corrective action (Restart the service) after the unexpected termination of the Cluster Service service, but this action failed with the following error: 
The service cannot be started, either because it is disabled or because it has no enabled devices associated with it.

For information Node 2 is 2016. Windows firewall is disabled on both nodes. Interfaces can all ping each other between nodes. There was one error in the validation wizard which may be related but I thought it was because of the different versions of windows.

CAMHVS01.gemalto.com
Device-specific module (DSM) Name Major Version Minor Version Product Build QFE Number 
Microsoft DSM 6 2 9200 17071 


Getting information about registered device-specific modules (DSMs) from node CAMHVS02.gemalto.com.


CAMHVS02.gemalto.com
Device-specific module (DSM) Name Major Version Minor Version Product Build QFE Number 
Microsoft DSM 10 0 14393 0 


For the device-specific module (DSM) named Microsoft DSM, versions do not match between node CAMHVS01.gemalto.com and node CAMHVS02.gemalto.com.
For the device-specific module (DSM) named Microsoft DSM, versions do not match between node CAMHVS02.gemalto.com and node CAMHVS01.gemalto.com.
Stop: 31/10/2016 09:03:47.
Indicates two revision levels are incompatible

Any help would be really appreciated.

Thanks


Best practices for Windows Search on file Failover Cluster with folder mounted drives

$
0
0

Hi there,

we have an "active/active" fileserver failovercluster

Failoverresource01

Failoverresouce02

Both with own drives and fileshares. Our drives are mounted in one "master" Driver, like

Mastedrive is mounted on O:

Slavedrive is mounted on O:\Slavedrive

First question
In the indexing options i cant select O:\Slavedrive, how to i select this drive for indexing?

Second question:
Where to put the storage for the Windows Search? If i put it in one of the Cluster resources, i can only index one of the resourcegroups...so should i let it stay local and the second node just Indexes when the resource is active on this node?

Thanks in advance

New-Volume cmdlet does not create requested size volume.

$
0
0

Hello,

I don't know if this is the best forum for this as it is powershell, but it is a cluster volume I am creating.

When I create a volume for DTC of 500MB, the volume actually gets created as 8GB. I would like to know why and what to do to fix this behavior.

Command I ran:

New-Volume -StoragePoolFriendlyName S2D* -FriendlyName VDiskDBDTC -FileSystem CSVFS_REFS -Size 500MB

Result:

DriveLetter FileSystemLabel FileSystem DriveType HealthStatus OperationalStatus SizeRemaining    Size
----------- --------------- ---------- --------- ------------ ----------------- -------------    ----
            VDiskDBDTC      CSVFS      Fixed     Healthy      OK                      7.21 GB 7.94 GB

The other volumes I created, of various sizes in GB, are the correct sizes. The help page for this command states that you can use MB when specifying size, so I don't know why it won't work correctly.

https://docs.microsoft.com/en-us/powershell/module/storage/new-volume?view=win10-ps

Thanks,
Chris

Viewing all 4519 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>