HP Cluster Platform Interconnects v2010 Quadrics QsNetII Interconnect - Page 134

Poor application performance indicated by nodes hanging or low bandwidth or

Page 134 highlights

b. On the second occurrence of an error at the same link location, repeat the procedure described in Step a. c. On the third occurrence of an error at the same location, replace the cable. d. On the first occurrence of an error in the same location with the new cable, replace one of the switch cards to which the cable is connected. If the errors are cleared then the replaced switch card is at fault. If the error persists then replace the other switch card. A link appears to be disconnected A disconnected link is indicated in the Links in Reset panel in the output from the qsnetstat command. Proceed as follows: 1. Links will go through reset if they are the down links to nodes that have been rebooted or powered down. Any uplinks from federated node-level interconnects to top-level interconnects which have been powered off will also appear in reset. In these cases no further action is required. 2. Links with high error counts may go into reset. Physical disconnection of the link cable from the switch card port will also result in the link going into reset. Follow the procedures described in the preceding diagnostic (A high occurrence of network errors seen using the qsnetstat or qsneterr diagnostics). Poor application performance indicated by nodes hanging or low bandwidth or high latency Poor performance might be caused by unexpected processes running on the compute nodes of a cluster. If these can be ruled out then a possible cause is the high occurrence of network errors. Proceed as follows: 1. Run qsnetstat to identify the link(s) with high error rates. 2. Examine monitoring log histories to look for a specific event that might cause the problem or enable you to identify when the problem started. 3. Use the link location as an argument to the qsctrl command and configure the link out of the network. 4. If application characteristics are restored then diagnose the fault using the process described in the diagnostic (A high occurrence of network errors seen using the qsnetstat or qsneterr diagnostics. If the application is still not behaving as expected then the network is probably not the cause. Escalate the problem to the next level of support. You suspect that a link cable replacement is required You suspect that a link cable needs replacement due to high error counts or possible physical damage to the cable (such as exceeding its bend radius, which might cause invisible damage. Proceed as follows: 1. Route a temporary link cable to replace the suspected bad cable and run diagnostics with the new cable in place. 2. Do not replace the original cable until the test is complete. It is possible that the fault is in the port connector or the switch card itself. If the fault persists with the replacement cable then further diagnosis is required. 3. If the fault is not found in the replaced cable then reconnect the original cable and remove the temporary link. 13-4 Troubleshooting Nodes and Links

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166

b.
On the second occurrence of an error at the same link location, repeat
the procedure described in Step a.
c.
On the third occurrence of an error at the same location, replace the cable.
d.
On the first occurrence of an error in the same location with the new
cable, replace one of the switch cards to which the cable is connected. If
the errors are cleared then the replaced switch card is at fault. If the
error persists then replace the other switch card.
A link appears to be disconnected
A disconnected link is indicated in the
Links in Reset
panel in the output from
the
qsnetstat
command. Proceed as follows:
1.
Links will go through reset if they are the down links to nodes that have
been rebooted or powered down. Any uplinks from federated node-level
interconnects to top-level interconnects which have been powered off will also
appear in reset. In these cases no further action is required.
2.
Links with high error counts may go into reset. Physical disconnection of the
link cable from the switch card port will also result in the link going into reset.
Follow the procedures described in the preceding diagnostic (
A high occurrence
of network errors seen using the
qsnetstat
or
qsneterr
diagnostics
).
Poor application performance indicated by nodes hanging or low bandwidth or
high latency
Poor performance might be caused by unexpected processes running on the
compute nodes of a cluster. If these can be ruled out then a possible cause is the
high occurrence of network errors. Proceed as follows:
1.
Run qsnetstat to identify the link(s) with high error rates.
2.
Examine monitoring log histories to look for a specific event that might cause
the problem or enable you to identify when the problem started.
3.
Use the link location as an argument to the
qsctrl
command and configure
the link out of the network.
4.
If application characteristics are restored then diagnose the fault using the
process described in the diagnostic (
A high occurrence of network errors seen
using the
qsnetstat
or
qsneterr
diagnostics
. If the application is still not
behaving as expected then the network is probably not the cause. Escalate the
problem to the next level of support.
You suspect that a link cable replacement is required
You suspect that a link cable needs replacement due to high error counts or possible
physical damage to the cable (such as exceeding its bend radius, which might cause
invisible damage. Proceed as follows:
1.
Route a temporary link cable to replace the suspected bad cable and run
diagnostics with the new cable in place.
2.
Do not replace the original cable until the test is complete. It is possible that
the fault is in the port connector or the switch card itself. If the fault persists
with the replacement cable then further diagnosis is required.
3.
If the fault is not found in the replaced cable then reconnect the original cable
and remove the temporary link.
13-4
Troubleshooting Nodes and Links