Quantcast
Channel: High Availability (Clustering) forum
Viewing all 4519 articles
Browse latest View live

Cluster fail after majority failed

$
0
0

Hi, 

My topology is as the following:

HQ: - ServerHQ1 / ServerHQ2

DR: ServerDR1/ ServerDR2

witness: on Azure.

The link between both sites is MPLS 50 Mbps

I did the following test:

Shutdown both server in HQ and block connection to the quorum, the cluster failed as the majority had failed. but after restoring the connection to the witness, the cluster remains in the failed status.

The cluster did not go up until I restore the connection with the HQ site.

Why the cluster remains in the failed when I restored the connection to the witness ?!


Cluster aware updating - server 2012 R2

$
0
0

Hi All, 

I have a question regarding cluster aware updating, since we would like to use that in our company. 

To be able to test this we have created a test cluster with 2 hosts running Windows server 2012 R2 and then added 2 VM's on each host. 

Running below command seems to be working fine, unless when the last host is done updating it will take all 4 VM's leaving the other host empty. Is that normal or is there a way to get CAU to balance the VM's back the way it was before starting CAU? Would that be with Powershell or? 

Invoke-CauRun -MaxRetriesPerNode 3 -FailbackMode Immediate -MaxFailedNodes 0 -RebootTimeoutMinutes 15 -RequireAllNodesOnline –ClusterName ************

Allow Failback has also been enabled on the cluster and set to immediately.

Thanks Martin 



network load balancing trouble with add another virtual machine in cluster

$
0
0
i have windows server 2008 r2 hyper-v with two virtual machine that have each per one network adapter(the first IP 192.168.99.36,GW 192.168.99.1, the second 192.168.99.38, GW 192.168.99.1 NLB cluster IP 192.168.99.40, GW 192.168.99.1). i installed nlb on both vm's with enabled 'enable spoofing of MAC addreses'.NLB is unicast. i add one VM in nlb cluster without problem and nlb cluster is working ok, but when i add another the VM, this VM lose network connectivity and NLB cluster is not working. how i resolve this problem?

Symptom Cluster server is in hang after applying MS security patch August releases.

$
0
0

1. Very few minitues ago, Active cluster node for SQL service was hung. So I power down manually.

2. Confirmed that SQL Service was transitted to passive node normally

3. Another servers operating IIS also shown the same randomlly

4. The latest changes on to those server is only applying MS security patch released August.

5. Confirmed that no changes with hardware components.

6. No pretty much usage for DATABASE.

Is there any tools or method to figure out why or trace why it happens on to those servers?


Can't connect to cluster from Cluster Nodes Hyper-V Cluster Server 2016

$
0
0


Hi,

I have <g class="gr_ gr_41 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del" data-gr-id="41" id="41">setup</g> a 2 node Server 2016 hyper-v cluster and passed the validation wizard. I have a team switch created from 4 NIC's, I have my windows firewall disabled on both nodes and no <g class="gr_ gr_42 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" data-gr-id="42" id="42">anti virus</g> software installed. I confirmed network connectivity. But while performing some tests <g class="gr_ gr_61 gr-alert gr_tiny gr_spell gr_inline_cards gr_run_anim ContextualSpelling multiReplace" data-gr-id="61" id="61">i</g> encountered the following issue:

When the team switch is disabled or if there is a network failure there is a loss of cluster communication and once <g class="gr_ gr_60 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling multiReplace" data-gr-id="60" id="60">its</g> enabled again/<g class="gr_ gr_43 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" data-gr-id="43" id="43">netowrk</g> connection restored, we are unable to manage the cluster from the local cluster nodes. But using remote cluster management does work.
The error that is occurring is:

ErrorFailoverClustering-Manager4683MMC Snapin
Failover Cluster Manager failed while managing one or more cluster. The error was 'An attempt to connect to the cluster failed due to one or more nodes not responding to WMI calls. This is usually caused by a problem with the WMI infrastructure on the node(s).

The following is a list of the nodes that encountered this problem when the connection to the cluster was attempted:

NODE2
  '. For more information see the Failover Cluster Manager Diagnostic channel.

And this error

ErrorDistributedCOM10028None
DCOM was unable to communicate with the computer SVR3-NYC1 using any of the configured protocols; requested by PID     214c (C:\Windows\system32\mmc.exe).

The only way for me to resolve this issue is to reboot both nodes.

All my searches suggest that its a WMI bug, but the problem with that suggestion is that <g class="gr_ gr_51 gr-alert gr_tiny gr_spell gr_inline_cards gr_run_anim ContextualSpelling multiReplace" data-gr-id="51" id="51">i</g> CAN connect to the cluster <g class="gr_ gr_44 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" data-gr-id="44" id="44">remotley</g> without a problem. Here are some articles with<g class="gr_ gr_45 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" data-gr-id="45" id="45">similler</g> issues but <g class="gr_ gr_46 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" data-gr-id="46" id="46">non</g> seem to resolve my issue, the only way <g class="gr_ gr_52 gr-alert gr_tiny gr_spell gr_inline_cards gr_run_anim ContextualSpelling multiReplace" data-gr-id="52" id="52">i</g> am able to resolve is by performing a reboot, but <g class="gr_ gr_53 gr-alert gr_tiny gr_spell gr_inline_cards gr_run_anim ContextualSpelling multiReplace" data-gr-id="53" id="53">i</g> can't just reboot in a live <g class="gr_ gr_47 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" data-gr-id="47" id="47">envirment</g>:

https://social.technet.microsoft.com/Forums/en-US/99aa09c9-6d68-4e9a-bb20-9b34a468eb42/unable-to-connect-to-cluster-using-failover-cluster-manager?forum=winserverClustering

https://blog.workinghardinit.work/2017/09/08/an-error-occurred-connecting-to-the-cluster/

https://blogs.msdn.microsoft.com/clustering/2010/11/23/trouble-connecting-to-cluster-nodes-check-wmi/

https://sqlsanctum.wordpress.com/2016/09/21/failover-cluster-manager-connection-error-fix/

https://community.spiceworks.com/topic/639445-connecting-to-server-2012-hyper-v-cluster-throws-the-rpc-server-is-unavailable
 

Hyper-Stretched Cluster with storage Replica, Unplanned failover

$
0
0

Hi,

We are using windows 2016 Hyper-V stretched cluster between two sites.

When we try to fail over a site, it fail over rightly. but when the site is restored, the failed over CSVs, are not Failing over Back to its original site. Then we have to move the cluster group manually.

Jiwan Sharma

Windows Failover Clustered IP question

$
0
0

Hello,

Over this past weekend, I created a test environment in my home lab.  This environment contains a 2-Node Windows Server 2016 Failover Cluster with each node having SQL Server 2017 Developer Edition installed with one AlwaysOn Availability Group created and configured.  There are actually three Windows Server 2016 in this test lab.  One domain controller and the 2 clustered nodes.  Here's my question:

While creating the 2-node WFSC, I remember there was a screen asking me to enter the Windows Server Cluster IP.  So I provided one.  While creating the AlwaysOn Availability Group, there was also a screen asking me to enter the AG name and an IP.  So I again provide a static IP for this which was different from the WFSC IP.  Once I got everything setup up & running, I was able to connect to SQL Server 2017 instance using only the AG Cluster name which is the whole point behind HA environment.  So in SSMS, I connect to the AG using only the AG Cluster name.  Everything working beautifully even after a manual failover to my secondary replica.   My question is, what is that static IP for the WFSC used for?  The one that they asked you when you setup the WFSC.  When do I use it?

Thanks

Cluster validation failure : On List disk

$
0
0

here is running Windows 2016 cluster and installed SQL server 2016 cluster. I have now have 4 nodes cluster.

Originally there are two failover instance installed in the cluster. And each instance has 3 cluster disk resources . Today I remove one of the SQL instance by removing cluster node one by one.  The instance successfully removed.
Then I try to install another sql server failover instance, however got this error in the rule check page:

"
Rule "Microsoft Cluster Service (MSCS) cluster verification errors" failed.

 The cluster either has not been verified or there are errors or failures in the verification report. Refer to KB953748 or SQL Server Books Online for more information.
"

Then I run the cluster validation, and it failed on the following error:

I found that ONE of the cluster disk still keep ONLINE in one of the node, however, in failover cluster management view, it show offline.

any work around ? reason for this ?


Exchange Cluster Service offlice

$
0
0
hi all , 

we have exchange server 2013 running DAG , we have issue with cluster services due to the Cluster name is not being online and checking the event log of the cluster we see the below error 
" Cluster network name resource 'Cluster Name' cannot be brought online. Ensure that the network adapters for dependent IP address resources have access to at least one DNS server. Alternatively, enable NetBIOS for dependent IP addresses "

what we have tried is  remove the A record of the cluster from the DNS and create again with all permission and  set the option allow any authenticated user to update DNS .

the ComputerName CNO for the Cluster is now online . 

any suggestion people .

Can not connect to the Cluster name You do not have administrative privileage on the cluster, contact your N/W administraor to request access. Error Code: 0x80070005 Access is Denied

$
0
0

Can not connect to the Cluster name
You do not have request access.

Error Code: 0x80070005
Access is Denied

attached the screenshot

facing this issue in both the nodes .kindly help us how to fix this one.





CPU Usage: 50% or above on all Server 2016 Hyper-V Clustered nodes with 3 VM's

$
0
0

Hello,

We have a 3-node Server 2016 cluster, based on Dell R730 servers.All Hosts show a CPU Usage of 50% or above. TaskManager shows the correct value (around 0%)

Only 3 small test-vm's are running on those hosts. Has anyone seen this behavior?

We would like resolve this.

Regards,

Jan


Jan

2016 Hyper-V Cluster - Balancer not migrating machines when over the threshold.

$
0
0

Hi,

We have a 2016 HPV Clsuter setup with the Balancer setting at Medium.  According to info online, this means it should migrate things off machines when they hit the 70% utilization.  However, this is not happening.

We applied Windows updates yesterday so hosts got cleaned off for rebooting.  Today things are not evened out.

Here are the current RAM usage listings:

Free RAMTotal RAMPercent Used
12025653%
5325679%
4225684%
28057551%
49657514%

So why is the cluster not migrating things off Host 2 and 3 since they are both over the 70% utilization threshold?  Or am I misunderstanding how the balancer is supposed to work?

Thanks!

VM Cluster

$
0
0

I was reading MS documentation on Failover Clustering https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2012-R2-and-2012/dn265972(v%3dws.11). There is this note that I don't understand ...Make sure that there is not more than one virtual machine in a virtual machine clustered role. Starting with Windows Server 2012, we do not support this configuration. An example of this scenario is where multiple virtual machines have files on a common physical disk that is not part of Cluster Shared Volumes. A single virtual machine per clustered role improves the management experience and the functionality of virtual machines in a clustered environment, such as virtual machine mobility.

Can anyone elaborate this for me please?

tia,

gix



Can Cluster-Aware Updating be configured to accept that certain virtual machines, which are OFF, should be left on their current node during update run?

$
0
0

I have a few Hyper-V virtual machines in a 3-node Windows Server 2016 Failover Cluster, where I do not want to move them from their current Node.

I want to just shut them down manually before any maintenance. But Neither CAU nor Failover Cluster Manager Node Pause/Drain seem to want to accept that they should just be left where they are in their Off state.

Is there a way I can tell CAU or the Failover Cluster to leave them where they are?

I know that I can Choose Pause/No-Drain in Failover Cluster Manager, but there does not seem to be a similar option in CAU.

Shadow Copies on 2012 R2 File Server Cluster

$
0
0

Hello all!

I've inherited a two node physical 2012 R2 file server cluster that contains a few SMB shares on a single clustered disk.  I'd like to enable shadow copies for this shared disk but want to store the shadow copy data on its own shared disk as per shadow copy best practices (at least on non-clustered file servers).

I created a second cluster disk, assigned it to the same resources as the SMB clustered disk. 

Now historically I've enabled shadow copies through computer management but I want to ensure the shadow copies are cluster aware so in the Failover Cluster Manager I open Storage | Disks and right click the SMB cluster disk and click the Shadow Copies tab. From the tab I can Enable shadow copies, this sounds like what I'm looking for, unfortunately it does not give me an option to choose a disk/volume to store my shadow copy data, so this can't be right.

My next step was to connect computer management to the cluster virtual server name (NOCFS4) and through the System Tools | Shared Folders | All Tasks | Configure Shadow Copies it shows me the correct number of shares on the SMB cluster disk plus I can see the Settings button for configuring the location and size limit for shadow copies, however once I tweak the shadow copy location and size settings, click Ok and click Enable to turn on the shadow copies I get a long pause and an error about not being able to create a schedule. So it seems connecting to the cluster virtual object is not the answer either.

That leaves using computer management to connect to one of the physical server nodes of the cluster. When I open the shadow copy interface on the physical node I note that it shows a 0 for the number of shares it detects on the SMB clustered disk. This doesn't surprise me since this interface isn't cluster resource aware.

So I'm stuck. Does anyone know the "Microsoft way" to enable shadow copies on a clustered disk while storing the shadow copy data on a second cluster disk both attached to the same cluster resource?


MSCluster_service using c#

$
0
0

I have Windows Failover Cluster deployed with 1 or more clusters and services or storage associated with it.

I am looking for some C# examples of how to report on the above.  I am trying to report on the Cluster Name, the 

Server Names on the cluster, Services associated with the cluster and storage nodes associated with the cluster.  

Are there any useful examples that I can refer to?  

Thank you.

HA for mail relay server

$
0
0

We currently have a mail relay server that is configured to receive messages from different applications and basically forwards the messages to O365 using a connector. Basically, it is running the SMTP service and IIS.

What HA solution should I use? 

Any GOTCHA's that I should look out for?

Any help or advice would be appreciated.

Windows Failover Cluster Best Practice - Should the Cluster Name and Resource Name should be a part of OU where Baseline GPO's are applied

$
0
0

HI All,

For Windows Failover Cluster What is the Best Practice ?

Should the Cluster Name and Resource Name should be a part of OU where Baseline GPO's are applied or not ?

Taking > 5 Minutes to Connect to SQL Server from Remote Computers After a Fail-over

$
0
0

Hi, all.

I want to bring this up front before I begin.  I am, by no means, an expert at clustering, or SQL Server.  However, I want to try and figure out why it takes so long for our tools to connect to the database server after a fail over.

Background:

OS: Server 2008

SQL Server 2005

2 node cluster server w/ NETAPP SAN (connected to each node using iScsi connection)

Observations:

  • Happens whether fail-over is REAL or simulated (simulated being through the cluster manager).
  • When fail over occurs, all services go offline.. and then come back online in less than a minute (~45-60 seconds).
  • Active node does switch successfully when fail over-occurs.
  • I can see the cluster storage / SAN (storage drive, Qorom, etc.) on the active node.
  • I am able to access the database, local.. on either node, using Enterprise Management Studio.
  • What I can't do is access the database from another computer outside the network for a certain period of time
  • Time that the database is unavailable can range from 6 minutes to up to 13 minutes.  However, I think the time differences here are simulated fail-over events (cluster manager) vs. forced fail-over events (shutting down active node).
  • using NETSTAT -A, I could not see anything as far as lost packets or collisions.  
  • Ping test is fine from server to clients.
  • All clients outside of the subnet, but within the factory network.
  • When the clients are connected to the server's database, there are no issues with connectivity.
  • This "pause" in DB connection from outside clients was not happening when it was 1st installed roughly a year ago (So, I am GUESSING it's not a setup or configuration issue).

I can't just assume this is a network issue because my equipment is always "guilty until proven innocent."  At least in the IT department's eyes.. :), so I want to exhaust all possibilities that this could be an issue with the server itself.

Can anyone think of anymore tests to try to rule out the server?  I'm all ears.  Please remember, I am not an expert, so sorry if I ask how to do half of the actions you suggest... :)

Thank you so much for any suggestions!

Best Regards,

Bill

CAU error when previewing updates

$
0
0

Attempting to get CAU working for a 2 node HA cluster, both nodes running 2012R2. I get the following error when I "preview updates for this cluster" CAU:

The Microsoft.WindowsUpdatePlugin plug-in reported a failure while attempting to scan for applicable updates on node "<NETBIOS NAME>". Additional information reported by the plug-in: (ClusterUpdateException) There was a failure in a Common Information Model (CIM) operation, that is, an operation performed by software that Cluster-Aware Updating depends on. The computer was "<NETBIOS NAME>", and the operation was "ScanUpdates[Info,CauNodeWCD[<NETBIOS NAME>]]". The failure was: (CimException) Internal server error (500).  HRESULT 0x801901f4 ==> (CimException) Internal server error (500).  HRESULT 0x801901f4

The above error is repeated for both nodes (i.e. one <NETBIOS NAME> per node)

The "analyse cluster updating readiness" comes back all green (except yellow warning for machine proxy which wont apply to the node servers).  Windows updates run just fine (I can manually run updates just fine).

Any ideas?




Viewing all 4519 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>