Quantcast
Channel: High Availability (Clustering) forum
Viewing all 4519 articles
Browse latest View live

Live Migration and WorkGroup Cluster on windows 2019

$
0
0

Hi ,

I found the following document about live migration and work group cluster on Windows 2016.

https://techcommunity.microsoft.com/t5/Failover-Clustering/Workgroup-and-Multi-domain-clusters-in-Windows-Server-2016/ba-p/372059

I understand Live migration is not support, and support quick migration. Is it same on windows 2019? or any plans about it ?



Failed Server 2016 Rolling cluster upgrade.

$
0
0

Hi,

I have a 2 node Hyper-V server 2012 R2 cluster. I would like to upgrade the cluster to 2016 to take advantage of the new nested virtualisation feature. I have been following this guide (https://technet.microsoft.com/en-us/windows-server-docs/failover-clustering/cluster-operating-system-rolling-upgrade) and have got to the point of adding the first upgraded node back into the cluster but it fails with the following error. 

Cluster service on node CAMHVS02 did not reach the running state. The error code is 0x5b4. For more information check the cluster log and the system event log from node CAMHVS02. This operation returned because the timeout period expired.

In the event log of the 2016 node I have these errors.

FAILOVER CLUSTERING LOG

mscs_security::BaseSecurityContext::DoAuthenticate_static: (30)' because of '[Schannel] Received wrong header info: 1576030063, 4089746611'

cxl::ConnectWorker::operator (): HrError(0x0000001e)' because of '[SV] Security Handshake failed to obtain SecurityContext for NetFT driver'

[QUORUM] Node 2: Fail to form/join a cluster in 6-7 minutes

[QUORUM] An attempt to form cluster failed due to insufficient quorum votes. Try starting additional cluster node(s) with current vote or as a last resort use Force Quorum option to start the cluster. Look below for quorum information,

[QUORUM] To achieve quorum cluster needs at least 2 of quorum votes. There is only 1 quorum votes running

[QUORUM] List of running node(s) attempting to form cluster: CAMHVS02, 

[QUORUM] List of running node(s) with current vote: CAMHVS02, 

[QUORUM] Attempt to start some or all of the following down node(s) that have current vote: CAMHVS01, 

join/form timeout (status = 258)

join/form timeout (status = 258), executing OnStop

SYSTEM LOG

Cluster node 'CAMHVS02' failed to join the cluster because it could not communicate over the network with any other node in the cluster. Verify network connectivity and configuration of any network firewalls.

Cluster failed to start. The latest copy of cluster configuration data was not available within the set of nodes attempting to start the cluster. Changes to the cluster occurred while the set of nodes were not in membership and as a result were not able to receive configuration data updates. .

The Cluster Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 15000 milliseconds: Restart the service.

The Service Control Manager tried to take a corrective action (Restart the service) after the unexpected termination of the Cluster Service service, but this action failed with the following error: 
The service cannot be started, either because it is disabled or because it has no enabled devices associated with it.

For information Node 2 is 2016. Windows firewall is disabled on both nodes. Interfaces can all ping each other between nodes. There was one error in the validation wizard which may be related but I thought it was because of the different versions of windows.

CAMHVS01.gemalto.com
Device-specific module (DSM) Name Major Version Minor Version Product Build QFE Number 
Microsoft DSM 6 2 9200 17071 


Getting information about registered device-specific modules (DSMs) from node CAMHVS02.gemalto.com.


CAMHVS02.gemalto.com
Device-specific module (DSM) Name Major Version Minor Version Product Build QFE Number 
Microsoft DSM 10 0 14393 0 


For the device-specific module (DSM) named Microsoft DSM, versions do not match between node CAMHVS01.gemalto.com and node CAMHVS02.gemalto.com.
For the device-specific module (DSM) named Microsoft DSM, versions do not match between node CAMHVS02.gemalto.com and node CAMHVS01.gemalto.com.
Stop: 31/10/2016 09:03:47.
Indicates two revision levels are incompatible

Any help would be really appreciated.

Thanks


Performance Issue on Storage Space Direct Server 2019 - Getting high read and write Latency

$
0
0

Hello All,

On S2D i am getting performance issue, getting high read and write Latency. From some days getting more issues, not getting constant IOPs, in every second IOPs reach thousands and in next second it came to hundreds, same thing happening with Throughput read and write speed, earlier having performance issue but getting constant IOPs. In admin center it's creating peeks on IOP's and Throughput, due to this hosted VPS are getting hang and slow.

I have configured S2D with 4 nodes having Nvme for caching and SSD for storage as below:

Node 1 : 1x250 Nvme, 3x1TB SSD, Not Having Hyper-v role

Node 2 : 1x500 Nvme, 3x1TB SSD, Not Having Hyper-v role

Node 3 : 2X250 Nvme, 4x1TB SSD, Having Hyper-v role

Node 4 : 2X250 Nvme, 4x1TB SSD, Not Having Hyper-v role

Node 5,6,7 : Not having any SSD or Nvme for storage,  Only having Hyper-V role

All server are connected with 10 GB Ethernet and using CSV to storing the VM files.

Please suggest how to resolve the issue.

HyperV Failover cluster backup leaving avhdx files

$
0
0

I am running a 4 node 2016 Hyper V failover cluster with about 30+ servers. I just noticed one of my servers is leaving behind a .avhdx .avhdx.mrt and a .avhdx.rct file after every backup with DPM 2016. 

The machine list not checkpoints under hyper V via the gui or powershell. These are being created when DPM backs up(Which happens successfully). How do figure out what is causing this and how do i go about cleaning this up?

Syncing updates between domain controllers, DHCP, SQL and IIS servers

$
0
0

Hello,

So I'm curious about having Windows updates being synced between machines.  From my understanding if you have 2 domain controllers on the same domain (Server 2016) and they're set to have Automatic Updates that they sync in a way with the reboot, so one DC is always up and then they work together somehow so they're not out of sync with anything.  Is that possible for DHCP, SQL and IIS servers if those are setup as a cluster or fail-over state?  It's something I'm trying to research and am just not 100% sure on.  I know there's SCCM (which I'm also looking at), but if I can just cluster/fail-over the servers and set to automatic updates and just move onto other things that'd be ideal.  Any help appreciated.  Thanks in advance.

Windows Fileshare witness is not accessible | After patching

$
0
0

Hi Experts,

Our windows team have applied patches on two nodes of a cluster.Post patch,file share witness is accessible from one server and from other it is not accessible.Hence 

After deep dive,we could see below extra security patches has been applied on server where file share witness is not accessible.

KB3161949
KB3172729
KB3173424
KB3175024
KB4338824
KB4499165
KB4503290



we are not sure what patch is creating this problem as we don't see any official MS doc on this .Please let us know if you have any information on this matter.

Also advise,if there is any forum to check on bug details quickly.

Many thanks in advance ! 

Regards,
Naren poosa

Updating hyperv cluster

$
0
0

Hi,

I know it's not recommended to upgrade the OS, okay!

if we choose to upgrade the hyperv cluster nodes with 2012 R2 to 2016, what are the recommendations?

- Remove the cluster node;
- Remove functions;
- Upgrade to Windows 2016;
- Reinstall functions;
- Insert the cluster node again;

Or can we update it directly with the node in the cluster?


Thank you.

Microsoft Hyper-V Cluster “CSV Auto Pause due to STATUS_IO_TIMEOUT (c00000b5)”

$
0
0

Hello Team,

We received the event id 5120 in couple of our Hyper V nodes which are in the cluster.

The error mentioned in the Title is the error that shows up with below description

Cluster Shared Volume ‘Volume1’ (‘name’) is no longer available on this node because of ‘STATUS_IO_TIMEOUT(c00000b5)’. All I/O will temporarily be queued until a path to the volume is reestablished

All servers are using Windows Server 2012 R2 as Operating System. Another event id generated at that time is 5217.

I would like to know if there are any hotfixes for the present time. All the hyper v nodes were updated with the March month of 2019 security patches.

Can anyone please help me to find a solution for this? And how to find the actual cause of this issue.

Regards

SJ


Test-Clusterhealth

$
0
0

We have setup a 6 Node Cluster using Iwarp configuration, QL41262 25GB Network Adapters. I have enabled RDMA everywhere I can see, i did this via the Dell setup guide.

Running the Test-clusterhealth command I get the below results. 

I'm really struggling with the RDMA failures and now I'm getting failures on SMB. Saying disconnects, I'm having multiple reports of lagging and performance issues in guest VMs

PS C:\Scripts> .\Test-Clusterhealth.ps1
Detected RDMA adapters: will require RDMA
******************** Basic Health Checks (3.6s)
All cluster nodes Up
Cluster node uptime:

PSComputerName Uptime
-------------- ------
S2D-NODE01     30d:00h:48m.50s
S2D-NODE02     0d:19h:47m.54s
S2D-NODE03     40d:16h:21m.24s
S2D-NODE04     0d:22h:14m.54s
S2D-NODE05     0d:02h:53m.07s
S2D-NODE06     2d:15h:45m.03s


Clustered storage subsystem Healthy
All pools Healthy
******************** Clusport Device Symmetry Check (2.1s)
********** Total
Pass with 72 per node
********** Disk Type
Pass with 60 per node
********** Solid/Non-Rotational Media
Pass with 12 per node
********** Enclosure Type
Pass with 6 per node
********** Virtual
Pass with none on any node
******************** Enclosure View Symmetry Check (4.1s)
********** Total
Pass with 6 per node
******************** Operational Issues and Storage Jobs (116.2s)
No storage rebuild or regeneration jobs are active
******************** Physical Disk Health (2.2s)
All physical disks are in normal auto-select or journal state
******************** Physical Disk View Symmetry Check (4.1s)
********** Total
Pass with 60 per node
******************** RDMA Adapter IP Check (8.9s)
*************** RDMA Adapter IP Check
********** Total
Pass with none on any node
*************** RDMA Adapter (Virtual) IP Check
********** Total
Fail

Count Name
----- ----
    6 S2D-NODE01
    4 S2D-NODE02
    4 S2D-NODE03
    6 S2D-NODE04
    6 S2D-NODE05
    7 S2D-NODE06


*************** RDMA Adapter (Physical) IP Check
********** Total
Pass with none on any node
******************** RDMA Adapters Symmetry Check (3.4s)
********** Total
Fail

Count Name
----- ----
    5 S2D-NODE01
    4 S2D-NODE02
    4 S2D-NODE03
    5 S2D-NODE04
    5 S2D-NODE05
    5 S2D-NODE06


********** Operational
Fail

Count Name
----- ----
    5 S2D-NODE01
    4 S2D-NODE02
    4 S2D-NODE03
    5 S2D-NODE04
    5 S2D-NODE05
    5 S2D-NODE06


********** Up
Fail

Count Name
----- ----
    5 S2D-NODE01
    4 S2D-NODE02
    4 S2D-NODE03
    5 S2D-NODE04
    5 S2D-NODE05
    5 S2D-NODE06


******************** SMB Connectivity Error Check - Connect Failures (2.4s)

PSComputerName RDMA Last5Min RDMA LastDay RDMA LastHour TCP Last5Min TCP LastDay TCP LastHour
-------------- ------------- ------------ ------------- ------------ ----------- ------------
S2D-NODE01                 0            0             0            0          20           0
S2D-NODE02                 0            0             0            0          10           0
S2D-NODE03                 0            0             0            0          13           0
S2D-NODE04                 0            0             0            0          12           0
S2D-NODE05                 0            0             0            0          10           0
S2D-NODE06                 0            0             0            0          14           0


******************** SMB Connectivity Error Check - Disconnect Failures (2.5s)
WARNING: the SMB Client is receiving RDMA disconnects. This is an error whose root"
         cause may be PFC/CoS misconfiguration (RoCE) on hosts or switches, physical"
         issues (ex: bad cable), switch or NIC firmware issues, and will lead to severely"
         degraded performance. Additional triage is included in other tests."

PSComputerName RDMA Last5Min RDMA LastDay RDMA LastHour TCP Last5Min TCP LastDay TCP LastHour
-------------- ------------- ------------ ------------- ------------ ----------- ------------
S2D-NODE01                 0           16             0            0           3           0
S2D-NODE02                 0           11             0            0          11           0
S2D-NODE03                 0           17             0            0           3           0
S2D-NODE04                 0            8             0            0           1           0
S2D-NODE05                 0           12             0            0          12           0
S2D-NODE06                 0           18             0            0           6           0


******************** SMB CSV Multichannel Symmetry Check (2.5s)
********** Total
Fail

Count Name
----- ----
   16 S2D-NODE01
   10 S2D-NODE02
   18 S2D-NODE03
   12 S2D-NODE04
    8 S2D-NODE05
   14 S2D-NODE06


********** RDMA Capable
Fail

Count Name
----- ----
   16 S2D-NODE01
   10 S2D-NODE02
   18 S2D-NODE03
   12 S2D-NODE04
    8 S2D-NODE05
   14 S2D-NODE06


********** Selected & Non-Failed
Fail

Count Name
----- ----
   16 S2D-NODE01
   10 S2D-NODE02
   18 S2D-NODE03
   12 S2D-NODE04
    8 S2D-NODE05
   14 S2D-NODE06


******************** SMB SBL Multichannel Symmetry Check (2.6s)
********** Total
Fail

Count Name
----- ----
   10 S2D-NODE01
   10 S2D-NODE02
   16 S2D-NODE03
   10 S2D-NODE04
   10 S2D-NODE05
   10 S2D-NODE06


********** RDMA Capable
Fail

Count Name
----- ----
   10 S2D-NODE01
   10 S2D-NODE02
   16 S2D-NODE03
   10 S2D-NODE04
   10 S2D-NODE05
   10 S2D-NODE06


********** Selected & Non-Failed
Fail

Count Name
----- ----
   10 S2D-NODE01
   10 S2D-NODE02
   16 S2D-NODE03
   10 S2D-NODE04
   10 S2D-NODE05
   10 S2D-NODE06


******************** Virtual Disk Health (2.1s)
All operational virtual disks Healthy
PS C:\Scripts>

Read Scale availability group

$
0
0

We are designing a new SQL farm in our company, they want HA/DR (so a WSFC cluster), but we only need DR for some certain Databases not HA, due to the amount of databases.  These other Databases will be on other separate VMs due to size and separation.

If you have a cluster with three nodes two in one site with HA and 1 in another site for DR, what should you do for the databases with no need for HA.  I had planned to have the non-HA databases in a read scale availability group between two VMs away from the cluster to provide the DR ability.

But thinking about it would it not make more sense to have the two WMs in the cluster, with no quorum votes, with just a synchronous, manual failover availability group between these two VMs?

Is there ever a situation where you would use read scale availability group where you have a cluster available?

quorum

$
0
0

Hi,

If there is two node  and both nodes are up and running  but the  heartbeat lost , in that case  how node will decide who aill be active 

Thanks

AD-Detached Cluster and Access to WSFC console (Access Denied with local admin or from any node except 1)

$
0
0

Hello,

due to specific constraints in my environment I've had to build an AD-Detached WSFC to host a SQL AAG.

The cluster was configured with a specific user created specifically on each machine of the cluster so the credentials would be consistent on each machine due to the lack of AD.

The current setup is as is :

3 VM : 2 Windows 2016 servers for SQL (Let's call them SQL1 & SQL2)+ 1 Windows 2016 server to act as a quorum with only WSFC role (Called QRM01) (these 3 servers are VMs). no possibility to have shared storage or a witness due to the AD-Detached+ environment constraints, this is why a 3-node configuration (Majority node) was chosen to have the quorum and avoid split-brain.

The cluster was created by specifying the "ClusterUser" credentials.

The issue I encounter is the following :

I can mange the cluster ONLY from the QRM01 server, and with the account the cluster was created under (ClusterUser).

If I try to manage the cluster using the WSFC mmc either running under a local admin on any node or under the ClusterUser account on one of the 2 SQL nodes, I get an Access Denied error : "Access is denied. (Exception from HRESULT: 0x80070005 (E_ACCESSDENIED))"

The Get-ClusterAccess shows me Full access for all my local Admin users.

I'm sure it was working perfectly before adding the QRM01 server, can't be sure it worked after that but at the moment it doesn't.

How can't I access my cluster with local Administrator or ClusterUser accounts (both members of local Administrators group) from my 2 sql nodes ?

Thanks for your insights and help.


how to configure 2 DC server and 1DR server mirror in 2016 sql server

$
0
0

hi

how to configure 2 DC server and 1DR server mirror in 2016 sql server

Pl suggest

Server 2016 MSMQ cluster role - bind to multiple cluster IPs

$
0
0

We have a two node cluster (Server 2016 datacenter) where I've added the MSMQ cluster role.  I have two networks this cluster can talk to, 192.168.0.0 and 10.10.0.0.  The message queuing role has both cluster IPs added as dependencies.  If I run netstat (netstat -abno | findstr 1801) I can see the 192.168.0.0 address is listening on port 1801 but not the 10.10.0.0 cluster IP.  I've tried adding the "BindInterfaceIP" string to the below reg key but it doesn't change anything.  I've went as far as rebooting both nodes after making the registry changes.  The firewall is completely turned off on both nodes.  I feel like I'm missing something small to make the cluster listen on the 10.10.0.0 IP as well as the 192.168.0.0 IP.  Has anyone seen this before or have an idea on what else to try?

Registry Key

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSMQ\Clustered QMs\<Clustered Message Queuing Name>\Parameters

SQL Cluster node not working

$
0
0

HI Team ,

My Server 2012 R2 SQL Cluster Servers Two Node ,

Problem : Server A D Drive unable online mode ( your reference below the screen short) quram Drive which server have D Drive Gone


s2d down after adding hard drives

$
0
0

s2d newbie here.  This is a test environment, so nothing is probably supported hardware. 

the setup

Running two optiplex 3050's with windows server 2016.  They each have a spinning disk and an ssd via sata.  the spinning disk is partitioned for the operating system and the rest i dumped into the s2d pool.  With this hardware i setup a fail-over cluster w/ quarm coming from my domain controller, and the clusters' storage coming from the s2d pool.  Everything was working well, but terrible slow.  

the issue

To cure the slowness i decided to add a pcie, m.2  drive to each.  After adding it to one machine the cluster and s2d drive came back w/ an error, i ran a repair inside of server manger.  After that completed i added the same drive to the other machine and my s2d drive has been gone since.  I've tried removing the last HD I added, rebooted each several times w/ no luck.

error's

when i look at the critical events for the cluster's disk in "fail-over cluster manager"  there are a lot of repeating event ID.  5142, 1069, 1793

Any help would be greatly appreciated.  I'd like to see if i can fix this in a test environment before i see it in production.

many thanks!


IT guy

Cluster upgrade same computer name and ip

$
0
0

Hello All,

I am about to start a cluster upgrade from 2012 R2 to 2016. There are a lot of good guides out there and the process seems strait forward. I would like to know if it is ok the keep the same computer names and ips for my servers once I add them back to the cluster. 

Thanks,

Scott

Can't back to online the DAG cluster.

$
0
0

Hi,

Good Day!

May I ask if anyone have encounter below error. Please see the screenshot for the reference.

Thanks,

Raymond

Partition information lost on cluster shared disk

$
0
0

Hi everyone,


we've got a cluster virtual disk where the partition table and volume name broke. Has anyone experienced a simliar problem and got some hints on how to recover?


The problem occured last friday. I restarted node3 for windows updates. During the restart node1 had a bluescreen and also restarted. The failover cluster manager tried to bring online the cluster resources but failed several times. Finally the resource-swapping came to a rest on node1 which came up early after the crash. Many virtual disks were in an unhealthy state, but the repair process managed to repair all disks so they are now in a healthy state. We aren't able to explain why node1 crashed. Since the storage pool is in dual parity mode the disks should be able to work even if there are only 2 nodes running.

One virtual disk, however, lost its partition information.


Network config:

Hardware: 2x Emulex OneConnect OCe14102-NT, 2x Intel(R) Ethernet Connection X722 for 10GBASE-T

Backbone-Network: On the "right" Emulex network card (only members in this subnet are the 4 nodes)

Client-access teaming network: emulex "left" and intel "left" cards in team; 1 untagged network and 2 tagged networks


Software Specs:

    • Windows Server 2016
    • Cluster with 4 Clusternodes
    • Failover Cluster Manager + File Server Roles running on the cluster
    • 1 Storagepool with 36 HDDs / 12 SSDs (9HDD / 3 SSD on each node
    • Virtual disks are configured to use dual parity:
Get-VirtualDisk Archiv | get-storagetier | fl
  •    FriendlyName           : Archiv_capacity
  •    MediaType              : HDD
       NumberOfColumns        : 4
       NumberOfDataCopies     : 1
       NumberOfGroups         : 1
       ParityLayout           : Non-rotated Parity
       PhysicalDiskRedundancy : 2
       ProvisioningType       : Fixed
       ResiliencySettingName  : Parity

Hardware Specs per Node:

  • 2x Intel Xeon Silver 4110
  • 9HDDs à 4 TB and 3 SSD à 1 TB
  • 32GB RAM on each node

Additional information:

The virtualdisk is currently in Healthy state:

Get-VirtualDisk -FriendlyName Archiv

FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach   Size

------------ --------------------- ----------------- ------------ --------------   ----
Archiv                             OK                Healthy      True           500 GB


The storagepool is also healthy:

PS C:\Windows\system32> Get-StoragePool
FriendlyName   OperationalStatus HealthStatus IsPrimordial IsReadOnly

------------   ----------------- ------------ ------------ ----------
Primordial     OK                Healthy      True         False
Primordial     OK                Healthy      True         False
tn-sof-cluster OK                Healthy      False        False


Since the incident the event log (of current master: Node2) has various errors for this disk like:

[RES] Physical Disk <Cluster Virtual Disk (Archiv)>: VolumeIsNtfs: Failed to get volume information for \\?\GLOBALROOT\Device\Harddisk13\ClusterPartition2\. Error: 1005.


Before the incident we also had errors that might indicate a problem:

[API] ApipGetLocalCallerInfo: Error 3221356570 calling RpcBindingInqLocalClientPID.


Our suspicions so far:

We did registry changes to: SYSTEM\CurrentControlSet\Control\Class\{4d36e972-e325-11ce-bfc1-08002be10318}\0001 (to 0009) and set the value PnPCapabilities to 280 (disabling the checkbox "Allow the computer to turn off this device to save power") but not all network adapters support this checkbox so this may have had some side effects)



One curiosity: after the error we noticed that one of the 2 tagged networks had the wrong subnet on two nodes. This may have caused some of the failover role switches that occured on friday, but we're unsure about the reason since they were configured correctly some time before.

We've had a similar problem in our test environment after activating jumbo frames on the network interfaces. In that case we lost more and more filesystems after moving the file server role to another server. In the end all filesystems were lost and we reinstalled the whole cluster without enabling jumbo frames.

We now suspect that maybe two different network cards in the same network team may cause this problem.

What are your ideas? What may have caused the problem and how can we prevent this from happening again?

We could endure the loss of this virtual disk since it was only archive data and we have a backup, but we'd like to be able to fix this problem.

Best regards

Tobias Kolkmann


Event id 153 ONLY when host has a VM on it

$
0
0

I have a 3 node cluster with a iSCSI / MPIO CSV. This has been running for about 1 1/2 years with no issue.

Host 1 is the current owner of the CSV. I verified that by going to Disk Management on host 2 and host 3, and seeing that for the CSV disk, they both have the 'disk is offline because of a policy set by an administrator' message. Host 1 does not have that message which makes it the owner.

Recently, whenever any VM function is attempted on host 2  (like start a VM, live migrate a VM, shutdown a VM, etc..) I get a non-stop flow of event id 153 and whatever process was started takes longer than normal to complete. If the VM does start after a long delay, access to the VM is slow and choppy from the end user's standpoint.

If I migrate or shutdown all VMs on host 2, the 153 messages stop.

Host 2 itself is never slow or laggy. Only the VM operations are slow.

Host 1 and host 3 DO NOT have ANY event id 153. 

Does anyone have any ideas why this single node is displaying this behavior? 

Thanks in advance!

Viewing all 4519 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>