HP Integrity Superdome SX2000 Cluster Installation and Configuration Guide - W - Page 29

Troubleshooting the Cluster, What to Do if Validation Tests Fail

Page 29 highlights

• The NIC and switch redundancy layer is transparent to the IP layer. • It may use standby, redundant team members to load balance your network traffic and improve performance for transmitted and received packets on the individual cluster node. • It may use advanced redundancy mechanisms to improve the detection of failures in your network infrastructure, and to provide a proactive response to them. For example, cluster nodes continuously test their connectivity with each other but they cannot detect path failures when there is an external switch upstream. Active Path Failover is an advanced teaming feature that detects such failures, and fails over to a NIC that has a path to an Echo Node device (an external switch upstream). If you are going to implement NIC teaming in your cluster networks, you should complete the following steps: 1. Plan your network infrastructure according to the cluster demands, taking into account NIC teaming configuration, redundant switches, routers, and so on. 2. Create the teams planned in the previous step for every cluster node. 3. Validate your cluster configuration. 4. Create your cluster. For more information about NIC teaming issues in clustered environments, see the following document: http://support.microsoft.com/kb/254101 Troubleshooting the Cluster What to Do if Validation Tests Fail In most cases, if any tests in the cluster validation wizard fail, then Microsoft does not consider the solution to be supported. There are exceptions to this rule, such as the case with multi-site (geographically dispersed) clusters where there is no shared storage. In this scenario the expected result of the validation wizard is that the storage tests will fail. This is still a supported solution if the remainder of the tests complete successfully. The type of test that fails is a guideline to the corrective action to take. For example, if the storage test "List all disks" fails, and subsequent storage tests do not run (because these would also fail), contact the storage vendor to troubleshoot. Similarly, if a network test related to IP addresses fails, consult with your network infrastructure team. Most of the warnings or errors should result in working with internal teams or with a specific hardware vendor. After the issues have been addressed and resolved, it is necessary to rerun the cluster validation wizard. It is required (in order to be considered a supported configuration) that all tests are run and completed successfully without failures. Validation Issues for Multi-site or Geographically Dispersed Failover Clusters Failover cluster solutions that do not have a common shared disk and instead leverage data replication between nodes might not pass the cluster validation "storage" tests. This is a common configuration in cluster solutions where nodes are stretched across geographic regions. If a cluster solution does not require external storage to fail over from one node to another, it does not need to pass the "storage" tests to be a fully supported solution. For more information on multi-site or geographically dispersed clusters, see the following white paper: http://go.microsoft.com/fwlink/?LinkId=112125 Troubleshooting See the following documents for more information about troubleshooting errors and interpreting system event descriptions in clusters: Troubleshooting the Cluster 29

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31

The NIC and switch redundancy layer is transparent to the IP layer.
It may use standby, redundant team members to load balance your network traffic and
improve performance for transmitted and received packets on the individual cluster node.
It may use advanced redundancy mechanisms to improve the detection of failures in your
network infrastructure, and to provide a proactive response to them. For example, cluster
nodes continuously test their connectivity with each other but they cannot detect path failures
when there is an external switch upstream. Active Path Failover is an advanced teaming
feature that detects such failures, and fails over to a NIC that has a path to an Echo Node
device (an external switch upstream).
If you are going to implement NIC teaming in your cluster networks, you should complete the
following steps:
1.
Plan your network infrastructure according to the cluster demands, taking into account NIC
teaming configuration, redundant switches, routers, and so on.
2.
Create the teams planned in the previous step for every cluster node.
3.
Validate your cluster configuration.
4.
Create your cluster.
For more information about NIC teaming issues in clustered environments, see the following
document:
Troubleshooting the Cluster
What to Do if Validation Tests Fail
In most cases, if any tests in the cluster validation wizard fail, then Microsoft does not consider
the solution to be supported. There are exceptions to this rule, such as the case with multi-site
(geographically dispersed) clusters where there is no shared storage. In this scenario the expected
result of the validation wizard is that the storage tests will fail. This is still a supported solution
if the remainder of the tests complete successfully.
The type of test that fails is a guideline to the corrective action to take. For example, if the storage
test "List all disks" fails, and subsequent storage tests do not run (because these would also fail),
contact the storage vendor to troubleshoot. Similarly, if a network test related to IP addresses
fails, consult with your network infrastructure team. Most of the warnings or errors should result
in working with internal teams or with a specific hardware vendor.
After the issues have been addressed and resolved, it is necessary to rerun the cluster validation
wizard. It is required (in order to be considered a supported configuration) that all tests are run
and completed successfully without failures.
Validation Issues for Multi-site or Geographically Dispersed Failover Clusters
Failover cluster solutions that do not have a common shared disk and instead leverage data
replication between nodes might not pass the cluster validation "storage" tests. This is a common
configuration in cluster solutions where nodes are stretched across geographic regions. If a cluster
solution does not require external storage to fail over from one node to another, it does not need
to pass the "storage" tests to be a fully supported solution.
For more information on multi-site or geographically dispersed clusters, see the following white
paper:
Troubleshooting
See the following documents for more information about troubleshooting errors and interpreting
system event descriptions in clusters:
Troubleshooting the Cluster
29