Quantcast
Channel: High Availability (Clustering) forum
Viewing all 4519 articles
Browse latest View live

Running chkdsk casues Cluster Shared Volumes to go offline

$
0
0

Basic setup at Production site:

2x Dell R730 running Windows 2012 R2 Standard (Windows updates current as of June 2016). Cluster passed Validation tests, with the exception of some warnings due to iSCSI NICs not having a default gateway (those NICs can't ping the LAN interfaces)

3x Cluster Shared Volumes (one Witness, two data) with a single Generic Service role. Generic Service role has both data volumes as dependency. All CSVs are located on a Dell EqualLogic iSCSI SAN group; data volumes are replicated regularly to DR site.

For backups, we have scripted mounting recent replicas on the DR EqualLogic group to a server at the DR site. This setup has been working fine; however, the CSV replicas, when mounted to the DR system, have their filesystems marked as dirty. The filesystems on the Production CSVs are not marked dirty (verified by fsutil and Action Center). This happens occasionally as all replicas/snapshots are taken on the hardware side and the filesystems are not quiesced (which is OK for our purposes), but has been occurring on all replicas this past week.

Issue:

To ensure the CSVs were clean I opted to run a read-only chkdsk from the command line "chkdskX:" on each volume (no switches specified). When I did so, both CSVs were taken offline, which is contrary to my understanding that 2012 R2 CSVs should remain online for analysis and spotfix. On two other servers (not cluster systems), mounting a replica and running chkdsk, the volumes do stay online, so I'm a bit puzzled why the Cluster decided to take the volumes offline, and how to prevent this from happening in the future.

Do I need to specify the "/scan" switch or some other parameter? Does this need to be run through Action Center or Server Manager? Do I need to be running a File Server role in the Cluster?

I appreciate any help and thoughts on this!

Thanks!


Cannot add a second node!

$
0
0

Hi 

I have a two node hyper-v cluster that i am running. One of my node failed (crashed) , I reinstalled my server, setup  as it was before the crash occured. I tried to re-add it bu it keeps falling with the following error messages:

On the servers i tried to add: 

[QUORUM] An attempt to form cluster failed due to insufficient quorum votes. Try starting additional cluster node(s) with current vote or as a last resort use Force Quorum option to start the cluster. Look below for quorum information,
00000fcc.00000204::2016/09/07-16:20:48.958 ERR   [QUORUM] To achieve quorum cluster needs at least 2 of quorum votes. There is only 1 quorum votes running
00000fcc.00000204::2016/09/07-16:20:48.958 ERR   [QUORUM] List of running node(s) attempting to form cluster: VM01,
00000fcc.00000204::2016/09/07-16:20:48.958 ERR   [QUORUM] List of running node(s) with current vote: VM01,
00000fcc.00000204::2016/09/07-16:20:48.958 ERR   [QUORUM] Attempt to start some or all of the following down node(s) that have current vote: EXITINGVM, EXISTINGVM0,
00000fcc.00000204::2016/09/07-16:20:48.958 ERR   join/form timeout (status = 258)

Any help from you will be appreciated and thanks in advance.

Replace Shared Physical Disk Windows Server 2008 R2 Cluster

$
0
0

Hi Experts,

I am panning to replace physical disk resources in a SQL DB cluster ( Win 2k8 )  after the data is copied to alternate disks , I had done this kind of operation in Win 2003 cluster and that time I had used dmpcfg to update disk signature , Do we need to do similar operation in Win 2k8 also ? I was checking the disk signatures of disks in HKLM\Cluster\Resources\GUID\parameters\disksignature , but all the disk signature looks same ( value is 0 ) .  Are we really going to worry about disk signature in Win2k8 clusters ? 

SMB Continous Availability Not supported !!!

$
0
0

Hi Guys,

I have a cluster of Windows Server 2012 R2. I made a Scale-out fileserver and i want to put applications on it. For that purpose i tried using SMB Share - Applications. 

However i get this:

I Also tried adding a formal SMB Share without Continous Availability. It creates it, but then when i try to add it i get this:

I looked in the event log and i have this error: CA failure - Failed to set continuously available property on a new or existing file share as Resume Key filter is not started or has failed to attach to the underlying volume.

Now i looked it up and i saw that this was in relation to 8dot3name. I tried the manual fsutil but didnt work and did not stay after restart. 

So i made a GPO for the HKLM:\SYSTEM\CurrentControlSet\Control\FileSystem -Name NtfsDisable8dot3NameCreation -Value 1

It applies, but still i get same errors. 

This cluster is made from vmware vm's with rdm disks attached. The cluster itself works, i just can't get continous availability working.

Any Ideas?

Duplicate Services in Cluster

$
0
0
I have an issue where if a service is started manually on a cluster node the Cluster Service will not bring it down and I end up with duplicates of services running across Cluster Nodes. is there not functionality in a cluster that checks to ensure that services only run on one node at a time?

Live migration of 'Virtual Machine ADVM-01 ' failed. Event ID : 21502

$
0
0

I've HA Cluster running on Windows 2012 R2 with configured fail over cluster. it's running Windows 2008 , 2008 R2 , 2012 VMs.

already installed the Integration Services. when i tried to Live Migrate to other Node , it's getting failed.

in the event viewer below error message shows.

" Live migration of 'Virtual Machine ADVM-01' failed.

Virtual machine migration operation for 'ADVM-01' failed at migration source 'NODE01'. (Virtual machine ID D840382C-194B-4B4F-8BF5-19552537D0EF)

'ADVM-01' failed to delete configuration: The request is not supported. (0x80070032). (Virtual machine ID D840382C-194B-4B4F-8BF5-19552537D0EF) "

please advise me.


Regards, COMDINI

Cluster Shared Volume error after server not shutting down properly

$
0
0

Hi,
We have two IBM X240 servers ( we call it server A and server B) connecting to IBM disk system:V3700 via fibre HBA.

The both servers are installing windows 2012 R2.

We have implemented VM cluster and everything is working well.

Last week this two server is down due to power shortage in my server room.

After turning on the  server A, it will come out the below error:

Windows failed to start, a recent hardware or software change might be cause.
File: \windows\system32\drivers\msdsm.sys
status: 0xc0000017
Info:the operation system could't be loaded because a critical system drive is missing or contain errors.

After using the Last Good Configuration, we can log in to the system and turn on the clustered virtual machine.

it seems everything is fine now.

So i go and start the server B and log in to the system using the same method with server A.

I found all the VM will be shut down or running error due to Cluster Shared Volume error.

Refer to below some errors captured from system system logs.

* Event 5142, Cluster Shared Volume 'Volume7' ('Cluster Disk 10') is no longer accessible from this cluster node because of error '(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.

* Event 5120,Cluster Shared Volume 'Volume3' ('Cluster Disk 4') has entered a paused state because of '(c00000be)'. All I/O will temporarily be queued until a path to the volume is reestablished.

Now we only can turn on only one server and shut down another server, if i turn on both server, the error will come out again & the server will go down.

Any suggestion or need me provide more information.

Thanks.

Any beter way to do it?

$
0
0

2 node cluster windows 2008 r2 with node and file witness share
the cluster resource (DHCP) is on cluster disk F
now, we need to decommision the storage which holds file share and cluster disk.
We need to switch file share and cluster disk to new storage.
my plan is to take cluster disk offline and add new one. change the new one to F drive
and copy files from old disk to the new one.
then chanbge quorum from node and file witness share
the cluster resource to node majority; then change it to node and file witness share
(with new storage share)

Any beter way to do it?

Thank you!


File Share Witness on DFS

$
0
0

Hi

Is it supported by Microsoft to place a File Share Witness on a DFS Share in Windows Server 2012 R2 (Not DFS-R involved). All I have found is some documentation for Windows Server 2008 R2 that states that you should not place your File Witness Share on DFS. But no indication of whether it is supported or not.

I fully understand that it is a very bad idea to implement DFS replication on this share du to the fact that DFS replication ísn´t real time.

Kind regards

Michael Buchardt

Replace a physical disk resource in a Windows 2003 SQL cluster

$
0
0

Hello All.

I am in a situation where I need to replace a SAN disk in windows 2003 SQL cluster with a another disk with larger size. I cannot go for an expansion of the drive.

here is my plan but since this was not tried, I would like to confirm that works fine or any issues as I have read lot about issues with the change in disk signatures

1. Assign new SAN disks to the nodes.

2. Bring up the disk on one of the nodes and assign the temporary drive letter

3. Take the SQL instance offline and Copy over the data from old disk to the newly added disk

4. Remove cluster dependencies and delete the old physical disk resource from cluster

5. Add the new disk as a physical disk resource to the SQL cluster group

6. Configure the dependencies

7. Bring the SQL instance online

Please do advise if there is any harm in switching over the disk as mentioned above.

Thanks in advance

Define Timeout for CSVs / IO-Queue-Time?

$
0
0

I am running a HyperV cluster on 2 nodes, which utilize a HA-NAS-Cluster to store Virtualmachines and their disks.

The NAS-Cluster itself has build in failover features which works nice, but failover can take upto 3 minutes to complete.

During this time, CSVs are reported offline and running VMs are starting to "fail", entering an undetermined state. 

- Some just report "failed",
- Some report "failed", but start again, ending up on "missing or invalid boot drive"
- Some - especially linux vms - end up reporting disk errors, shutting down everything until a manual reboot. 

I read, that hyperV will queue Disk-IO in Case a CSV goes down for a certain amount of time. It looks like VMs are staying healthy for 30 seconds, then starting to fail. 

Is there a way to extend this time to - lets say - 5 Minutes? 

What about Read-IO happening meanwhile?

best,

dognose


Cluster network degradation

$
0
0

Hi all, I'm just wondering if anyone else has experienced this.

We have a 3-node 2012R2 cluster, all nodes running core. It's fine, the only test it seems to fail is the cluster communication networks all being on the same subnet- if I set them as the same then the cluster test complains, if I set them as separate then the Dell HITKit complains so it's lose-lose.

The cluster will tick along nicely, we can move things around etc. Then, after some time (?), event log errors start popping up. These can be (sorry, the list is long): 1038, 1069, 1126,1127,1129,1135,1137,1146,1155,1205,1254,5120,5142.

We don't necessarily get all of these error and am not sure which ones crop up first, but it seems like the host networking gets...clogged up? That sounds daft, but if we reboot (drain) the hosts then the problem is resolved, and the cluster carries on for however long.

Microsoft have, in the past, suggested settings to switch off (TCP chimney's etc, a bunch of stuff) on each host NIC and the Dell HITKit is installed on anything directly accessing EqualLogic volumes. We patch the hosts, run the Dell SUU CD against them once in a while to keep drivers, firmware up to date etc.

I'd be grateful for any help- like I said, however daft it sounds it just seems like the networking gets clogged up with data after a while so the adapters freeze up.

?

Node in cluster - status changes to "paused"

$
0
0

We have seven Windows 2012 R2 nodes in a Hyper-V cluster. They are all identical hardware (HP BladeSystem). For a while, we had only six nodes, and there were no problems.

Recently, we added the seventh node, and the status keeps reverting to "paused". I can't find any errors that directly point to why this is happening - either in the System or Application log of the server, in the various FailoverClustering logs, or in the Cluster Event logs. I created a cluster.log using the get-clusterlog command, but if it explains why this is happening, I can't figure it out (it's also a very large file - 150 MB, so it's difficult to determine what lines are the important ones).

As far as I can tell, everything on this new node is the same as the previous ones - the software versions, network settings, etc. The Cluster Validation report also doesn't give me anything helpful.

Any ideas on how to go about investigating this? Even before I can solve the problem, I'd like to at least know when and why the status reverts to paused.

Thanks,

David

Not able to see cluster resources in the failover cluster manager

$
0
0

We are using Windows Server 2008 R2 and configured SQL Server Cluster services.

Randomly every 10-15 days  cluster service gets hang and to load resource list in Failover Cluster manager it takes very long time(almost 10-15 mins). When this issue is happening then in SQL server if we try to add any anew DB then it does not allow us to create and query run on and on.

From the SQL Server logs we came to know waittype as PREEMPTIVE_CLUSAPI_CLUSTERRESOURCECONTROL it means cluster service is hung some where.

Not sure how to troubleshoot this issue, I have checked event log for cluster service but not getting any clue.

When we restart our server then cluster server becomes normal and after 10-15 days it again starts behaving same way.

Need help how to find out the reason where cluster service is hung?

Cluster shared volume disappear... STATUS_MEDIA_WRITE_PROTECTED(c00000a2)

$
0
0

Hi all, I am having an issue hopefully someone can help me with. I have recently inherited a 2 node cluster, both nodes are one half of an ASUS RS702D-E6/PS8 so both nodes should be near identical. They are both running Hyper-V Server 2008 R2 hosting some 14 VM's.

Each node is hooked up via cat5e to a PromiseVessRAID 1830i via iSCSI using one of the servers onboard NICs each, whose cluster network is setup as Disabled for cluster use (the way I think it is supposed to be not the way I had originally inherited it) on it's own Class A Subnet and on it's own private physical switch...

The SAN hosts a 30GB CSV Witness Disk and 2 2TB CSV Volumes, one for each node labeled Volume1 and Volume2. Some VHD's on each.

The Cluster Clients connect to the rest of the company via the Virtual ExternalNIC adapters created in Hyper-V manager but physically are off of Intel ET Dual Gigabit adapters wired into our main core switch which is set up with class c subnets.

I also have a crossover cable wired up running to the other ports on the Intel ET Dual Port NICs using yet a third Class B Subnet and is configured in the Failover Cluster Manger as internal so there are 3 ipv4 Cluster networks total.

Even though the cluster passes the validation tests with flying colors I am not convinced all is well. With Hyperv1 or node 1, I can move the CSV's and machines over to hyperv2 or node 2, stop the cluster service on 1 and perform maintenance such as a reboot or install patches if needed. When it reboots or I restart the cluster service to bring it back online, it is well behaved leaving hyperv2 the owner of all 3 CSV's Witness, Volume 1 and 2. I can then pass them back or split them up any which way and at no point is cluster service interrupted or noticed by users, duh I know this is how it is SUPPOSED to work but...

if I try the same thing with Node 2, that is move the witness and volumes to node 1 as owner and migrate all VM's over, stop cluster service on node 2, do whatever I have to do and reboot, as soon as node 2 tries to go back online, it tries to snatch volume 2 back, but it never succeeds and then the following error is logged in cluster event log:

Hyperv1

Event ID: 5120

Source: Microsoft-Windows-FailoverClustering

Task Category: Cluster Shared Volume

The listed message is:Cluster Shared Volume 'Volume2' ('HyperV1 Disk') is no longer available on this node because of 'STATUS_MEDIA_WRITE_PROTECTED(c00000a2)'. All I/O will temporarily be queued until a path to the volume is reestablished.

Followed 4 seconds later by:

Hyperv1

event ID: 1069

Source: Microsoft-Windows-FailoverClustering

Task Catagory: Resource Control Manager

Message: Cluster Resource 'Hyperv1 Disk in clustered service or application '75d88aa3-8ecf-47c7-98e7-6099e56a097d' failed.

- AND -

2 of the following:

Hyperv1

event ID: 1038

Source: Microsoft-Windows-FailoverClustering

Task Catagory: Physical Disk Resource

Message: Ownership of cluster disk 'HyperV1 Disk' has been unexpectedly lost by this node. Run the Validate a Configuration wizard to check your storage configuration.

Followed 1 second later by another 1069 and then various machines are failing messages.

If you browse to\\hyperv-1\c$\clusterstorage\ or\\hyperv-2\c$\Clusterstorage\, Volume 2 is indeed missing!!

This has caused me to panic a few times as the first time I saw this I thought everything was lost but I can get it back by stopping the service on node 1 or shutting it down, restarting node 2 or the service on node 2 and waiting forever for the disk to list as failed and then shortly thereafter it comes back online. I can then boot node 1 back up and let it start servicing the cluster again. It doesn’t pull the same craziness node 2 does when it comes online; it leaves all ownership with 2 unless I tell I to move.

I am very new to clusters and all I know at this point is this is pretty cool stuff but basically if it is running don’t mess with it is the attitude I have taken with it but there is a significant amount of money tied up in this hardware and we should be able to leverage this as needed, not wonder if it is going to act up again. 

To me it seems for a ‘failover’ cluster it should be way more robust than this...

I can go into way more detail if needed but I didn’t see any other posts on this specific issue no matter what forum I scoured. I’m obviously looking for advice on how to get this resolved as well as advice on whether or not I wired the cluster networks correctly. I am also not sure about what protocols are bound to what nics anymore and what the binding order should be, could this be what is causing my issue?

I have NVSPBIND and NVSPSCRUB on both boxes if needed.

Thanks!

-LW


Live migration of 'Virtual Machine ADVM-01 ' failed. Event ID : 21502

$
0
0

I've HA Cluster running on Windows 2012 R2 with configured fail over cluster. it's running Windows 2008 , 2008 R2 , 2012 VMs.

already installed the Integration Services. when i tried to Live Migrate to other Node , it's getting failed.

in the event viewer below error message shows.

" Live migration of 'Virtual Machine ADVM-01' failed.

Virtual machine migration operation for 'ADVM-01' failed at migration source 'NODE01'. (Virtual machine ID D840382C-194B-4B4F-8BF5-19552537D0EF)

'ADVM-01' failed to delete configuration: The request is not supported. (0x80070032). (Virtual machine ID D840382C-194B-4B4F-8BF5-19552537D0EF) "

please advise me.


Regards, COMDINI

Cluster Shared Volume error after server not shutting down properly

$
0
0

Hi,
We have two IBM X240 servers ( we call it server A and server B) connecting to IBM disk system:V3700 via fibre HBA.

The both servers are installing windows 2012 R2.

We have implemented VM cluster and everything is working well.

Last week this two server is down due to power shortage in my server room.

After turning on the  server A, it will come out the below error:

Windows failed to start, a recent hardware or software change might be cause.
File: \windows\system32\drivers\msdsm.sys
status: 0xc0000017
Info:the operation system could't be loaded because a critical system drive is missing or contain errors.

After using the Last Good Configuration, we can log in to the system and turn on the clustered virtual machine.

it seems everything is fine now.

So i go and start the server B and log in to the system using the same method with server A.

I found all the VM will be shut down or running error due to Cluster Shared Volume error.

Refer to below some errors captured from system system logs.

* Event 5142, Cluster Shared Volume 'Volume7' ('Cluster Disk 10') is no longer accessible from this cluster node because of error '(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.

* Event 5120,Cluster Shared Volume 'Volume3' ('Cluster Disk 4') has entered a paused state because of '(c00000be)'. All I/O will temporarily be queued until a path to the volume is reestablished.

Now we only can turn on only one server and shut down another server, if i turn on both server, the error will come out again & the server will go down.

Any suggestion or need me provide more information.

Thanks.

SMB Continous Availability Not supported !!!

$
0
0

Hi Guys,

I have a cluster of Windows Server 2012 R2. I made a Scale-out fileserver and i want to put applications on it. For that purpose i tried using SMB Share - Applications. 

However i get this:

I Also tried adding a formal SMB Share without Continous Availability. It creates it, but then when i try to add it i get this:

I looked in the event log and i have this error: CA failure - Failed to set continuously available property on a new or existing file share as Resume Key filter is not started or has failed to attach to the underlying volume.

Now i looked it up and i saw that this was in relation to 8dot3name. I tried the manual fsutil but didnt work and did not stay after restart. 

So i made a GPO for the HKLM:\SYSTEM\CurrentControlSet\Control\FileSystem -Name NtfsDisable8dot3NameCreation -Value 1

It applies, but still i get same errors. 

This cluster is made from vmware vm's with rdm disks attached. The cluster itself works, i just can't get continous availability working.

Any Ideas?

Node failed to join cluster - Cluster group could not be found

$
0
0

Hi,

I'm having problems joining a Hyper-V cluster node to the cluster, after a reboot. It gives me a critical error in the event log: EventID 1070 - The node failed to join the failover cluster <clustername> due to error code 5013. Error code 5013 seems to be 'cluster group could not be found'.

I have googled, but didn't find any useful answers. Perhaps some of you know how to troubleshoot this issue? Thanks...

High availibility with hyper-V

$
0
0

 Hi,

our customer uses hyper-V VM (on Linux) on a primary site, and a multi site cluster to allow disaster recovery plan.

He plans to install an IBM product (based on WebSphere appplication server).

My question:

is it necessary to install the product on both sites ? how does the hyper-V cluster work ? Does it make an automatic synchronization of VM data or is it necessary to trigger  the switch manually ?

Can a VM clone or replication be done ? But in this case i think it is necessary ip and it can impact application configuration.

Thanks a lot for your information about this subject

Viewing all 4519 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>