I have a 2016 WSFC with file server role. 2 Nodes in the cluster shared storage. We lost Power to Node2 which died, when bringing it back up it wont join the cluster (shows 'Down' in failover cluster manager). If I shut down the entire cluster completley and start it on Node2 first, Node2 runs the cluster fine but Node1 now wont join the cluster (shows 'Down')
As far as I can tell all connectivity seems fine, I've turned off windows firewall, the network between the two servers is working fine and no firewalls in between the two nodes. Other clusters are running on the same infrastructure.
The only hints in failover cluster manager are that the Network connection for Node2 shows as offline (the network is up and working has the allow traffic and management ticked, can ping, RDP etc.
When I shutdown then restart the entire cluster Node2 first, roles become reversed, Node1 now shows network as offline, information details or crytical events for network have no entries
Crytical Events for Node2 itself, when in down state show: Error 1653 Cluster node 'Node2' failed to join the cluster because it could not communicate over the network with any other node in the cluster. Verify network connectivity and configuration of any network firewalls. - however im not convinvced this is actually the issue because of the below error messages:
The failover clustering log is as follows:
00000774.00001c4c::2018/05/15-16:48:50.659 INFO [Schannel] Server: Negotiation is done,
protocol: 10, security level: Sign00000774.00001c4c::2018/05/15-16:48:50.663 DBG [Schannel] Server: Receive, type: MSG_AUTH_PACKAGE::Schannel, buf: 16100000774.00001c4c::2018/05/15-16:48:50.712 DBG [Schannel] Server: ASC, sec: 90312, buf: 205900000774.00001c4c::2018/05/15-16:48:50.728 DBG [Schannel] Server: Receive, type: MSG_AUTH_PACKAGE::Schannel, buf: 199200000774.00001c4c::2018/05/15-16:48:50.730 DBG [Schannel] Server: ASC, sec: 0, buf: 5100000774.00001c4c::2018/05/15-16:48:50.730 DBG [Schannel] Server: Receive, type: MSG_AUTH_PACKAGE::Synchronize, buf: 000000774.00001c4c::2018/05/15-16:48:50.730 INFO [Schannel] Server: Security context exchanged for cluster00000774.00001c4c::2018/05/15-16:48:50.735 DBG [Schannel] Client: ISC, sec: 90312, buf: 17800000774.00001c4c::2018/05/15-16:48:50.736 DBG [Schannel] Client: Receive, type: MSG_AUTH_PACKAGE::Schannel, buf: 6000000774.00001c4c::2018/05/15-16:48:50.736 DBG [Schannel] Client: ISC, sec: 90312, buf: 21000000774.00001c4c::2018/05/15-16:48:50.749 DBG [Schannel] Client: Receive, type: MSG_AUTH_PACKAGE::Schannel, buf: 213300000774.00001c4c::2018/05/15-16:48:50.752 DBG [Schannel] Client: ISC, sec: 90364, buf: 5800000774.00001c4c::2018/05/15-16:48:50.753 DBG [Schannel] Client: ISC, sec: 90364, buf: 1400000774.00001c4c::2018/05/15-16:48:50.753 DBG [Schannel] Client: ISC, sec: 90312, buf: 6100000774.00001c4c::2018/05/15-16:48:50.754 DBG [Schannel] Client: Receive, type: MSG_AUTH_PACKAGE::Schannel, buf: 7500000774.00001c4c::2018/05/15-16:48:50.754 DBG [Schannel] Client: ISC, sec: 0, buf: 000000774.00001c4c::2018/05/15-16:48:50.754 INFO [Schannel] Client: Security context exchanged for netft00000774.00001c4c::2018/05/15-16:48:50.756 WARN [ClRtl] Cannot open crypto container (error 2148073494). Giving up.00000774.00001c4c::2018/05/15-16:48:50.756 ERR mscs_security::SchannelSecurityContext::AuthenticateAndAuthorize: (-2146893802)' because of 'ClRtlRetrieveServiceSecret(&secretBLOB)'00000774.00001c4c::2018/05/15-16:48:50.756 WARN mscs::ListenerWorker::operator (): HrError(0x80090016)' because of '[SV] Schannel Authentication or Authorization Failed'00000774.00001c4c::2018/05/15-16:48:50.756 DBG [CHANNEL 172.23.1.15:~56287~] Close().
specifically:
Server: Negotiation is done (aka they talked to eachother?)
[ClRtl] Cannot open crypto container (error 2148073494). Giving up. mscs_security::SchannelSecurityContext::AuthenticateAndAuthorize: (-2146893802)' because of 'ClRtlRetrieveServiceSecret(&secretBLOB)' mscs::ListenerWorker::operator (): HrError(0x80090016)'
because of '[SV] Schannel Authentication or Authorization Failed'
I cant find many if any articles dealing with these messages, the only ones I can find, say to make sure permissions are correct on %SystemRoot%\Users\All Users\Microsoft\Crypto\RSA\MachineKeys
I did have to change some of the permissions on these files but still couldnt join the cluster. Other than that im struggling to find any actual issues (SMB access from node1 to node2 appears to be fine, smb access from node2 to node1 appears to be fine, dns appears to be working fine, file share whitness seems to be fine)
Finally the cluster vlaidations report shows these two errors as the only errors with the cluster
Validate disk Arbitration: Failed to release SCSI reservation on Test Disk 0 from node Node2.domain: Element not found.
Validate CSV Settings: Failed to validate Server Message Block (SMB) share access through the IP address of the fault tolerant network driver for failover clustering (NetFT). The connection was attempted with the Cluster Shared Volumes test user account, from node Node1.domain to the share on node Node2.domain. The network path was not found.
Validate CSV Settings: Failed to validate Server Message Block (SMB) share access through the IP address of the fault tolerant network driver for failover clustering (NetFT). The connection was attempted with the Cluster Shared Volumes test user account, from node Node2.domain to the share on node Node1.domain. The network path was not found.
other errors from the event logs
ID5398 Cluster failed to start. The latest copy of cluster configuration data was not available within the set of nodes attempting to start the cluster. Changes to the cluster occurred while the set of nodes were not in membership and as a result were not able to receive configuration data updates. .Votes required to start cluster: 2 Votes available: 1Nodes with votes: Node1 Node2 Guidance:Attempt to start the cluster service on all nodes in the cluster so that nodes with the latest copy of the cluster configuration data can first form the cluster. The cluster will be able to start and the nodes will automatically obtain the updated cluster configuration data. If there are no nodes available with the latest copy of the cluster configuration data, run the 'Start-ClusterNode -FQ' Windows PowerShell cmdlet. Using the ForceQuorum (FQ) parameter will start the cluster service and mark this node's copy of the cluster configuration data to be authoritative. Forcing quorum on a node with an outdated copy of the cluster database may result in cluster configuration changes that occurred while the node was not participating in the cluster to be lost.
ID4350 Cluster API call failed with error code: 0x80070046. Cluster API function: ClusterResourceTypeOpenEnum Arguments: hCluster: 4a398760 lpszResourceTypeName: Distributed Transaction Coordinator lpcchNodeName: 2
Lastly I built another Server node3 to see if I could join it to the cluster but this fails:
* The server 'Node3.domain' could not be added to the cluster. An error occurred while adding node 'Node3.domain' to cluster 'CLUS1'. Keyset does not exist |
ive done the steps here with no joy, http://chrishayward.co.uk/2015/07/02/windows-server-2012-r2-add-cluster-node-cluster-service-keyset-does-not-exist/