Quantcast
Channel: High Availability (Clustering) forum
Viewing all articles
Browse latest Browse all 4519

WS2016 Multi-Subnet Cluster Communication Issues - Port 3343

$
0
0

Hi,

We’ve recently been in the process of extending a number of single site SQL failover clusters (on WS2016) into multi-site geoclusters. Our environment is a mix of physical and virtual nodes, however for the purpose of simplifying my question I will discuss a single site multi-subnet cluster which is experiencing the exact same issues as our multi-site geoclusters.

The single site, multi-subnet cluster is setup as below:

2 x nodes in network “A”

2 x nodes in network “B”

Each node is on identical infrastructure and has a “Data” network (for client and cluster communication) and a dedicated “Heartbeat” network (for cluster communication only). The heartbeat network is routable between the two subnets. Static routes have been added to each host.

When we run a validation test on a multi-site (or multi-subnet cluster) we get an error on the network validation test stating the below:

Node site1-node1 is reachable from Node site2-node1 by multiple communication paths, but each path includes network interface site2-node1 - Heartbeat. This network interface may be a single point of failure for communication within the cluster. Please verify that this network interface is highly available or consider adding additional networks or network interfaces to the cluster.

When delving deeper into the validation report it shows a failure communicating on UDP port 3343 between the two data networks at each site. We’ve ran the report numerous times and never get a network failure between the local nodes, only between different subnet nodes. We also never see a communication issue on the heartbeat network (dedicated to cluster communication).

The data network errors intermittently on the validation report. Sometimes the report will pass without any errors. Other times it will show certain cross subnet nodes can’t communicate and occasionally all nodes can’t communicate across subnets. We seem to be able to recreate the issue by simply restarting all of the cluster nodes or the cluster services on each node. Even more strange is if we reboot the nodes or restart the cluster service once we get validation errors (as above), they’ll clear for a period of time.

We’ve tested disabling the heartbeat network which we’ve created specifically for cluster communication on all nodes and then run the validation test. The tests pass successfully showing that our data > data networks between nodes and subnets/sites can communicate successfully. As soon as we reenable the heartbeat NICs and rerun the validation test it begins erroring again.

We’ve tested with a UDP port emulator, disabled the cluster service (after the validation test has reported that the cluster isn’t communicating over the data network) and then sent UDP packets over port 3343 to confirm that they can reach node that failed the validation test successfully. We’ve also run packet traces and can confirm that both tests show that the respective ports between the hosts are open. Windows Firewall is turned off on all nodes and the two subnets used don’t pass through a firewall. The hosts are also fully patched.

It doesn’t appear to be a networking issue as the packets are reaching the nodes, but for some reason the validation report is intermittently failing.

Any help with this would be greatly appreciated.

R




Viewing all articles
Browse latest Browse all 4519

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>