Quantcast
Channel: High Availability (Clustering) forum
Viewing all 4519 articles
Browse latest View live

Clustered VM

$
0
0

Hi!

Is it possible to have a highly available VM also clustered so that if the VM itself goes down the clustered copy comes online?

Thanks.


Loopback adapters and DSR: DAG Cluster node--which is not Cluster Host--crashes when another node restarts

$
0
0

An all-hardware Exchange 2010 SP3 UR4 DAG cluster is having an issue when the Microsoft Loopback adapter is installed (from Device Manager...Add Legacy Hardware) to support DSR operations with hardware load balancer (HLB).

  • The HLB provides HA endpoint for RPC Client Access, SMTP, etc. DSR is required to preserve source IP--on which      Exchange receive connectors that filter on source IP for security depend.
  • It is server DAG, with 3 x production severs at the datacenter and 2 x DAG DR servers located in a DR site.
  • Only the 3 x production servers at the main site have the loopback adapter installed.
  • The loopback-DSR-specific settings like 'weakhostrecive, etc' are in effect.

The problem only involves the 3 servers in the DAG with loopback adapters.

The issue is that when a DAG member restarts, sometimes it will cause the online production cluster node which isnot the Cluster Host Server to fail. Consider:

  • DAGNode1, Loopback enabled, Healthy, Is Cluster Host Server
  • DAGNode2, Loopback enabled, Healthy
  • DAGNode3, Loopback enabled, is Restarted

In this scenario, the cluster service on DAGNode2 will experience a loss of network connectivity when DAGNode3 rejoins the cluster (DAGNode2 reports cluster failure on all other nodes) and shortly afterwards the Cluster Service on DAGNode2 will terminate. FailoverClustering 1572 is seen on DAGNode2:

Node 'DAGNode2' failed to join the cluster because it could not send and receive failure detection network messages with other cluster nodes. Please run the Validate a Configuration wizard to ensure network settings. Also verify the Windows Firewall 'Failover Clusters' rules.

Interestingly, if you disable the Loopback on DAGNode3, DAGNode2 will immediately rejoin the cluster! Re-enable the Loopback on DAGNode3 and DAGNode2 immediately fails again! With some more server restarts possibly, you get a stable cluster again with Loopback enabled on all production nodes. The status of the loopback (enabled or not) on the Cluster Host does not impact this issue.

As I mentioned, it is only some restarts that this occurs, usually there is no problem. Also note the Loopback network/adapters do not appear in Cluster Manager and are not listed as cluster networks with cluster.exe. Cluster Validation Wizard passes everything except noting that every node has a duplicate IP on an installed adapter.

Looking for others with experience that have combined DSR-based HLB with CAS/Hub/MBX DAG Cluster on same Exchange computers and were able to use reliably.

There is an unanswered thread from 2010 on this topic:

http://social.technet.microsoft.com/Forums/windowsserver/en-US/7616b0e5-6fb6-4be7-a859-14baa2e9b925/cluster-network-is-partitioned-due-to-loopback-adapter?forum=winserverClustering

Some questions / any answers are very welcome!

  • Can I add the Loopback adapter to the cluster configuration so that I can use Cluster.exe to ignore the loopback adapter?
  • Can I prevent other cluster nodes from seeing the loopback adapters in the other nodes? Is there an ‘ignore partner adapter’ setting?

Thank you!


John Joyner MVP-SC-CDM

P.S. I add this information 3/1/2014:

This link suggests that if you allow the cluster network to partition it will discover the loopback adapters and they will appear in cluster manager: (Did this by enabled IPV6 on Loopback, when done this made Loopback Network appear in Cluster Manager. Then used Cluster.exe to set IgnoreNetwork=$true on the Loopback network.) Result: No change, still caused cluster communication outage when Loopback enabled on third production node that is not the Cluster Group Host.

http://social.technet.microsoft.com/Forums/windowsserver/en-US/311f7763-9f72-4dfe-bb35-3fd1a1dc567c/adding-additional-network-adapters-to-a-cluster?forum=winserverClustering

Developed: A workaround!

1. Just before restarting a node, after drain stop in NS, and after running StartDAGServerMaintenance.PS1 (which pauses the node in Cluster Manager) disable the Loopback Adapter so that when the computer restarts, Loopback is disabled.

2. After node restarts and rejoins cluster in Paused status, and after running StopDAGServerMaintenance.PS1, issue this command to move the Cluster Group host to the computer that was restarted and has the Loopback disabled.

cluster <clustername> group "Cluster Group" /moveto:<nodename>

3. Then safely enable the Loopback on the computer that was restarted and is now the Cluster Group host.

4. Then take the computer out of drain stop in the NS.

This of course only applies to controlled restarts.

In the event of unexpected server crash and recoveries, there is nothing stopping this from happening when the crashed server restarts. Still need a real fix! With knowledge of how to defuse the situation when it happens (disable loopback on the production node that is not the Cluster Group host), it clears the condition immediately. You than then fix it by steps 2 and 3 in the workaround.

"Optimize" a CSV volume

$
0
0

I have a H-V 2012 cluster with 3 nodes.  I use a CSV volume to store the VMs.  I have all the latest patches installed. I have an Equallogic SAN (PS4000) with the latest firmware on it providing the LUN for the CSV.  Everything in my environment is supposed to support re-thinning (or unmapping, or whatever the right term is) the LUN.  I have about 500 GB of unused space on the 2TB volume, and the volume was thin provisioned.  A restore of some very large vhd files from backup caused the thin provisioned volume to grow and to use almost the entire volume at one point but the corrupt VHDs have since been deleted.  Now I have "dirty" blocks in the LUN that I want to reclaim into free space on the SAN. This all happens, apparently, when Server 2012 performs an "Optimize" on the disks.  In my environment this is scheduled to happen once a week.  It did apparently do something this last week, because my volume utilization on the SAN went from 96% to 91%.  Not even close to reclaiming all dirty blocks, but it's a start I guess.  So now I went in to the "Defragment and Optimize Drives" utility and told it to commence a manual optimization.  Nothing happens and event viewer give me this error:

The volume VMStorage1 (C:\ClusterStorage\Volume1) was not optimized because an error was encountered: CSVFS failed operation as volume is not in redirected mode. (0x8007174F)

So my questions are these:

Shouldn't it put the CSV in redirect mode if it needs to do this in order to optimize the drive automatically?

If it can't do this automatically, how did it return 5% of the CSV SAN volume to free space last week?

Can I put the volume in redirect mode manually and do the optimize manually? Redirect mode is not supposed to be necessary in 2012 CSV any more- at least not for backup.  Why here?

Will my environment re-thin, Unmap, whatever?  It appears it MIGHT.  Does it take several iterations (ie weeks)?

Can anyone explain this incredibly vague and cloaked process from a Windows server 2012 perspective?

Thank you for any help!

DML


DLovitt



Node failed to join the cluster because it ould not send and receive failure detection network messages

$
0
0

One of my customers has a Windows Server 2008 R2 cluster for an Exchange 2010 Mailbox Database Availability Group.  Lately, they've been having problems with one of their nodes (the one node that is on a different subnet in a different datacenter) where their Exchange databases aren't replicating.  While looking into this issue it seems that the problem is the Network Manager isn't started because the cluster service is failing.  Since the issue seems to be with the cluster service, and not Exchange, I'm asking here. 

When the cluster service starts, it appears to start working, but within a few minutes the following is logged in the system event log.

FailoverClustering

1572

Critical

Cluster Virtual Adapter

Node 'nodename' failed to join the cluster because it could not send and receive failure detection network messages with other cluster nodes. ...

It seems that the problem is with the 169.254 address on the cluster virtual adapter.  An entry in the cluster.log file says: Aborting connection because NetFT route to node nodename on virtual IP 169.254.1.44:~3343~ has failed to come up. 

In my experience, you never have to mess with the cluster virtual adapter.  I'm not sure what happened here, but I doubt it has been modified.  I need the cluster to communicate with its other nodes on our routed 10. network.  I've never experienced this before and found little in my searches on the subject.  Any idea how I can fix this?

Thanks,

Joe


Joseph M. Durnal MCM: Exchange 2010 MCITP: Enterprise Messaging Administrator, Exchange 2010 MCITP: Enterprise Messaging Administrator, MCITP: Enterprise Administrator

Nics for a Hyper-V cluster

$
0
0

Hello!

http://alexappleton.net/post/44748523400/step-by-step-configuration-of-2-node-hyper-v-cluster-in

Each server has a total of 8 NIC’s and they will be used for the following:

1 – Dedicated for management of the nodes, and heartbeat
1 – Dedicated for Hyper-V live migration
2 – To connect to the shared storage appliance directly
4 - For virtual machine network connections

Tell me please whether I'm right or not:

1) 1 – Dedicated for management of the nodes, and heartbeat - it's not good: it's better to have a separate Nic decicated  for heartbeat and a NIc  dedicated for management.

2) I can use the same Nic(s) for the nodes management and for virtual machine network connections.

Thank you in advance,

Michael

UPS Shutdown of cluster

$
0
0

We're setting up a simple HyperV cluster, just 3 Nodes, and it occurred to me to give some thought to unattended shutdown during extended power failure.

If I understand the workings of the cluster correctly, simply having each host independently shutdown at 50% battery (or whatever) is not going to produce a graceful shutdown of the cluster itself.

Is there any common way to link the shutdown signal from the UPS software to the cluster shutdown command (and then subsequently shutdown the machine itself)?

HOW TO CONFIGURE SEPARATE NETWORK FOR HEARTBEAT FOR ALWAYS ON CLUSTERING

$
0
0

  I NEED TO CONFIGURE A S EPARATE NETWORK FOR HEARTBEAT COMMUNICATION (2012 SRVER ,SQL 2012  FILE SHARE WITNESS) HOW I CAN DO THIS 

THANK YOU FOR YOUR HELP

SQL Server Failover Clustering error

$
0
0

Hi All,

I am currently setting up 2-node SQL Server cluster but I am getting error when doing the failover test from Node2 to Node1.

Here is the quick overview of what I have so far.

1. Setup the failover cluster for both nodes, public and private network, cluster disks for Quorum, MSDTC and SQL, etc.

2. Run validation configuration before creating the cluster. Validation report completed successfully with no errors/warnings.

3. Created cluster, created MSDTC cluster and installed SQL server on both nodes.

Now I am doing some failover test on whether cluster resources will failover from Node1 to Node2 and Node2 to Node1.

Failover Test: Active Node is Node1.

1. Disable Public network on Node1.

2. Failover to Node2 -> successful

3. Enable Public network on Node1.

Problem:

After the failover to Node1, I tried to failback the resources from Node2 to Node1 by disabling the public network on Node2 (which is the active Node after the failover from Node1 to Node2) but the cluster resources won't failback to Node1.

Failback from Node2 to Node1 -> failed

1. Disable Public network on Node2.

2. Failback to Node1 -> failed

                  - Cluster Name and Cluster IP ->failed

                  - SQL cluster group (SQL name, SQL IP address, Analysis, SQL server and SQL Server Agent) ->failed

                  -MSDTC cluster group -> failed back successfully to Node1

3. Enable Public network on Node 2.

4. Manually online Cluster Group and SQL cluster group

I tried to Manually online the Cluster Group and SQL cluster group but it CANNOT be online unless I enable the Public network on Node2. I have checked on the cluster event log and I am getting some event ID 1077 and 1069 errors and Event ID 1069 and 1205.

Here are some of the logs on the cluster events.

Event ID 1069: The Cluster service failed to bring clustered service or application 'SQL_Group' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.

Event ID 1205: Cluster resource 'SQL IP Address 2 (db-vip)' in clustered service or application 'SQL_Group' failed.

Anyone experience the same issue before? Appreciate if someone can point me to right direction to resolve the issue.

Thanks in advance for your feedback.

BTW, failover and failback works perfectly when I try to reboot the Active node. Resources failed over successfully from Node1 to Node2 and vice versa when I reboot the server.

Thanks again.

Regards,

Ivan


File Share Witness Resouces Errors in a SQL 2012 Alwayson Availability Group Environment

$
0
0

Hi I am getting the following error in WFC Manager and in my system event log:

Event ID1564: 

File share witness resource 'File Share Witness' failed to arbitrate for the file share '\\SQL2012ClusterWitnessPath'. Please ensure that file share '\\SQL2012ClusterWitnessPath' exists and is accessible by the cluster.

Event ID 1069: 

Cluster resource 'File Share Witness' of type 'File Share Witness' in clustered role 'Cluster Group' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Event ID 1205:

The Cluster service failed to bring clustered service or application 'Cluster Group' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.

These errors showed up every hour on the hour and then suddenly stopped.  I tried looking at the cluster.log file but there wasn't anything recorded there.  The file share witness shows to be online and my AG did not fail over to another node.  The cluster has read and write permissions to the share.  I did not find any error messages about the witness share on the remote server. 

I am wondering what caused these series of events to occur?

Thanks.

An error occured attempting to read properties for 'Cluster Group' group. The remote procedure call failed. Error ID:1726 (000006be).

$
0
0

Hi All,

I have a two node 2003 cluster, When any one of the cluster node holding resources goes down, the resources are not failing over to the running node.
In the running node I receive the pop-up error message
""An error occured attempting to read properties for 'Cluster Group' group. The remote procedure call failed. Error ID:1726 (000006be).""

After I click on ok on the pop-up error message the resources are coming online on the running node. If I dont click ok the pop-up error message. The Cluadmin screen is not responding and the resources are not coming online.

In the cluster log i see the below messages realted to Error ID 1726

00000874.00000b10::2014/03/17-23:38:58.276 WARN [EVT] EvtBroadcaster: EvPropEvents for node 2 failed. status 1726
00000874.00000b10::2014/03/17-23:38:58.276 INFO [NM] RpcExtErrorInfo: ProcessId= 2164
00000874.00000b10::2014/03/17-23:38:58.276 INFO [NM] RpcExtErrorInfo: SystemTime= 3/17/2014 23:38:58:276
00000874.00000b10::2014/03/17-23:38:58.276 INFO [NM] RpcExtErrorInfo: GeneratingComponent= 2
00000874.00000b10::2014/03/17-23:38:58.276 INFO [NM] RpcExtErrorInfo: Status= 0xc002100b
00000874.00000b10::2014/03/17-23:38:58.276 INFO [NM] RpcExtErrorInfo: Detection Location= 641
00000874.00000b10::2014/03/17-23:38:58.276 INFO [NM] RpcExtErrorInfo: Flags= 0x0
00000874.00000b10::2014/03/17-23:38:58.276 INFO [NM] RpcExtErrorInfo: Number of Parameters= 2
00000874.00000b10::2014/03/17-23:38:58.276 INFO [NM] RpcExtErrorInfo: Long Val= 32000
00000874.00000b10::2014/03/17-23:38:58.276 INFO [NM] RpcExtErrorInfo: Long Val= 32000
00000874.00000b10::2014/03/17-23:38:58.276 INFO [NM] RpcExtErrorInfo: ProcessId= 2164

How to FIX this issue?

Regards,
Stunner.


SELF account permissions reset for cluster account.

$
0
0

Hi,

Windows 2008 R2 cluster.

The cluster was created with a non domain-admin account, just the permissions needed to create all the AD objects in the Active Directory and give them the needed permissions.

I've been checking the permissions needed for the cluster objects according to Microsoft documentation (http://technet.microsoft.com/en-us/library/cc731002(v=ws.10).aspx) and the the cluster object in the AD has Full Control permissions for itself (there's an ACE for SELF account within the ACL with Full Control permissions.

The permissions for SELF account are reset and change from Full Control to only Change Password every month, as the cluster itself were resetting the permissions for itself after a password change. The permissions for the cluster itself (SELF account) are reset just after the cluster account password change.

Why the permissions for the cluster itself are changed after the machine password change?

Thank you.

SoFS multi-site

$
0
0
Is there any MS documentation on SoFS in a multi-site architecture?

Cluster Host Master Node

$
0
0

Hi Team,

I need to increase the cluster host master shifting interval, currently it is 5 seconds to the best of my knowledge, if it doesnt get response in 5 seconds it starts shifting the Exchange DBs to the next available node.

Can I increase this time interval?

OS: windows server2008 R2 sp1

application: exchange 2010 sp3 in 4 node DaG.

any help?


Muhammad Nadeem Ahmed Sr System Support Engineer Premier Systems (Pvt) Ltd T. +9221-2429051 Ext-226 F. +9221-2428777 M. +92300-8262627 Web. www.premier.com.pk

NodeInstanceID - Allocation of a specific NodeID

$
0
0

When nodes are added to a cluster their NodeID is allocated as an increment to the last one. But in case we are starting to build a cluster, say with 5 or 6 nodes, can we expect/guess the node ID that will be allocated to each of them? This number doesn’t looks random to me and its certainly not using alphabetical order of the node names or using any sequence for the node that’s being used to build.

All these are from a Win 6.3 build done over a virtual env.

Cluster setup was first done from machine w63sAcNs3& then on the same environment usingw63sAcNs4 (the environment was refreshed using VM snapshots). The NodeID is allocated in the same order. So what’s the logic behind numbering these machines like this?

Similar setup was redone from machine w63sAcNs3, after adding another nodew63sAcNs6. The time the sequence went same until #3, then it changed. So what’s triggering this?

Making VM dependent on Storage Volume

$
0
0

Hi,

I was wondering if it is possible to make a VM dependent or to always migrate a VM to the same node as a CSV on a Hyper-V cluster.

For example, I have a a VM with directly attached volumes as hard drives. These volumes sit on the same storage fabric as my CSV which the VM resides on. If the HBA on the owner Node1 fail, it will move the CSV to Node2. If the VM is one Node1 it will lose the attached volumes, but if it migrates to the same CSV owner node the volumes will be restored.

So it would preferable if the VM always moved to the same node the CSV is on after the CSV comes back online.

Any suggestions are welcome, thanks.



Terminology for cluster networks

$
0
0

Hello!

I've already asked the question on cluster communication network here
http://social.technet.microsoft.com/Forums/windowsserver/en-US/2f650bfb-3544-4290-84e0-7138159e4cf2/cluster-heartbeat?forum=winserverfiles 

"Q2: Do all of these three types of cluster communication (including heartbeat) occur via the cluster network configured as in the preceeding screenshot (10.1.1.0) or, for example,Network Health  monitoring (including heartbeats) uses one network whileIntra-Cluster communication uses some other?"

and got the answer, but now I'd like to clearify one more thing: if all types of cluster communication (heartbeat + InterClusterCommunication+CSV) occur via the single network (it's often called "Heartbeat" network) why do some admins designate two different networks for both Heartbeat and CSV traffic simultaneously:

http://andreagx.blogspot.ru/2011/12/hyper-v-cluster-network-configuration.html

...while others do NOT designate separate network for the Heartbeat:

http://social.technet.microsoft.com/Forums/windowsserver/en-US/078950b0-b9e8-44a9-9f90-71a67df92cad/nic-assignments-for-2012-hyperv-cluster?forum=winserverClustering

"NIC 1 Management LAN
NIC 2 Cluster Shared Volume
NIC 3 Live Migration
NIC 4 iSCSI /MPIO
NIC 5 iSCSI /MPIO
NIC 6 Virtual Switch - Teamed
NIC 7 Virtual Switch - Teamed
NIC 8 Virtual Switch - Teamed"

So either the terms "Heartbeat network/nic" and "CSV network/nic" are used interchangeably or I don't understand whatCSV network means...

Thank you in advance,

Michael

Storage Migration (Migrating from HP XP20000 to the VPLEX Server 2012 WSFC)

$
0
0

We are currently in the process of migrating from our old storage to the VNX with VPLEX from EMC.

We have managed to successfully claim the luns from the HP XP for 2003 MSCS and 2008 MSCS.

Not so successfull with 2012 WSFC.

I am able to claim the LUN from the HP XP but as soon as you bring the server back up after creating the storage view the disk become critical on the VPLEX and RAW in windows but obviously not useable.

I have logged a call with EMC but no joy and posted on an EMC forum with someone replying.

https://community.emc.com/thread/191242

Would simply running Clear-ClusterDiskReservation -Disk # fix this issue?

Thing is the disks are not clustered they are presented from seperate storage arrays and the WSFC is for SQL 2012 AlwaysOn.

SIMPLE QUESTION: HOW TO MIGRATE FROM WINDOWS 2008 R2 + SQL 2012 FAILOVER CLUSTER to WINDOWS SERVER 2012 CLUSTER WITH ALWAYS ON AVAILABILITY GROUP

$
0
0

Hello,

We have 2-node Windows 2008 R2 Enterprise Edition failover cluster with Fibre shared storage (SAN) running SQL Server 2012 SP1. Below is current configuration - very simple and classic, I would say everything by the book:

This is what I think we want to achieve:

Objectives:

1. Upgrade Windows Operating System from Windows Server 2008 R2 to Windows Server 2012

2. Migrate to SQL Server 2012 Always On Availability Group (AAG) for High Availability and Disaster Recovery

My question is how to achieve both goals?

If possible I would like to upgrade OS first. Ideally I would like to upgrade on the same hardware (because it should be minimal impact - no need to migrate data). If this is not possible, we have new hardware I can use also. But I guess it will be more impact and actual data migration will be required.

For AAG what I'm honestly missing is what would be the name of the second SQL server? Lets say my servers called DB1 and DB2, and SQL server called DB. If I create AAG, and fail-over to replica server, would SQL server name be DB as well?

I know there is lots of documentation on AAG and I went through it but I cannot find any specific information about names.

Another question I have - would 3rd server (DB3) be part of the same MSCS cluster? Or it will be separate server? How fail-over exactly works - do I use Fail-over cluster Manager to initiate failover?

Sorry for lots of questions, but any information would be appreciated very much.

Thanks!



Without Storage

$
0
0

Hi!

Is it possible to have 2 node cluster without any kind of external storage. Secondly, is it also possible for Hyper-V Clustering for highly available VMs?

Thanks.

Unable to move quorum to other node

$
0
0

Hi, 

We have a 2 node cluster(2008 R2) configured with node and disk majority, from last 2 weeks i am unable to move the quorum resource to second node. It will allow me to move however the quorum gets failed on the second node.

Viewing all 4519 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>