HP Cluster Platform Interconnects v2010 Quadrics QsNetII Interconnect - Page 133

Troubleshooting Link Problems

Page 133 highlights

2. Test the node using qselantest and qsnetcabletest to confirm whether the node is functional or not. 3. Test every other node in the segment by using qselantest and qsnetcabletest to verify their integrity. Using a tool such as pdsh with dshbak is helpful for sorting the diagnostic output. 13.2 Troubleshooting Link Problems A high occurrence of network errors seen using the qsnetstat or qsneterr diagnostics Network errors are often due to a badly seated or failing link component. You can determine the location of the error by using the qsnetstat or qsneterr diagnostic tools. The cause of the errors might be present in the link cable, the cable connectors, or on the switch cards. If errors are reported on the receiver during data transfer, it means that the cause might also be at the other end of an interconnecting link. Proceed as follows: 1. Identify the end points of the link reporting errors. Error locations are identified by the switch module name, a switch card id, and a port id or chip and link. Typical output from the qsnetstat command is as follows: Name B C:L/Port State CRC Clock Data Protocol QR1N07 6 07 ULink R 0 (0/0) 342 (8/2) 0 (0/0) 0 (0/0) QR0N11 2 0:6 Intnl N 4 (2/0) 0 (0/0) 0 (0/0) 0 (0/0 The first line indicates an port, the second line has the colon delimiter indicating a chip:link location. Port ids are used for locations where a link cable connects to a switch card. Chip and link ids are used for internal links, either between chips on the same switch card or between chips on different switch cards, connecting through the midplane of the interconnect. 2. If the error is contained within a single switch card, replace the card and retest the network with the replacement. 3. Errors in links between switch cards which pass through the midplane of a federated node-level interconnect can be in one or other of the cards, or the midplane itself. Reseating both cards connected by the problem link should be the first step. If this fails to clear the error then replace the QM502 which carries the up-links to the top-level interconnects. If the problem persists then swap the QM501 which carries the down-links to the cluster nodes. If swapping both cards doesn't fix the problem then the midplane may be at fault and a replacement is required. 4. Errors reported at the ports of the switch cards may be due to the card, port or the interconnecting cable. Diagnosing the cause of the problem may cause some disruption in availability of the nodes local to the fault. To minimize downtime, configure out the link by using the qsctrl -o command. You can then rectify the problem during a scheduled maintenance period. The link should be configured back in using qsctrl -i before diagnosis proceeds. Diagnosis starts with the cables and moves onto the switch cards in the following steps: a. Visually inspect and reseat both ends of the link cable. On reseat a solid green LED should be lit next to the port. The red LED should not be lit. Use qsnetstat to monitor the link. If the link comes out of reset and the error counts stop incrementing, note its location and make it clean. If it fails to come out of reset after a number of reseat attempts the cable should be replaced. Troubleshooting Nodes and Links 13-3

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166

2.
Test the node using
qselantest
and
qsnetcabletest
to confirm whether
the node is functional or not.
3.
Test every other node in the segment by using
qselantest
and
qsnetcabletest
to verify their integrity. Using a tool such as
pdsh
with
dshbak
is helpful for sorting the diagnostic output.
13.2 Troubleshooting Link Problems
A high occurrence of network errors seen using the qsnetstat or qsneterr
diagnostics
Network errors are often due to a badly seated or failing link component. You
can determine the location of the error by using the
qsnetstat
or
qsneterr
diagnostic tools. The cause of the errors might be present in the link cable, the
cable connectors, or on the switch cards. If errors are reported on the receiver
during data transfer, it means that the cause might also be at the other end of an
interconnecting link. Proceed as follows:
1.
Identify the end points of the link reporting errors. Error locations are
identified by the switch module name, a switch card id, and a port id or chip
and link. Typical output from the
qsnetstat
command is as follows:
Name B C:L/Port State CRC Clock Data Protocol
QR1N07 6 07 ULink R 0 (0/0) 342 (8/2) 0 (0/0) 0 (0/0)
QR0N11 2 0:6 Intnl N 4 (2/0) 0 (0/0) 0 (0/0) 0 (0/0
The first line indicates an port, the second line has the colon delimiter
indicating a chip:link location. Port ids are used for locations where a link
cable connects to a switch card. Chip and link ids are used for internal links,
either between chips on the same switch card or between chips on different
switch cards, connecting through the midplane of the interconnect.
2.
If the error is contained within a single switch card, replace the card and
retest the network with the replacement.
3.
Errors in links between switch cards which pass through the midplane of a
federated node-level interconnect can be in one or other of the cards, or the
midplane itself. Reseating both cards connected by the problem link should
be the first step. If this fails to clear the error then replace the QM502 which
carries the up-links to the top-level interconnects. If the problem persists
then swap the QM501 which carries the down-links to the cluster nodes. If
swapping both cards doesn’t fix the problem then the midplane may be at
fault and a replacement is required.
4.
Errors reported at the ports of the switch cards may be due to the card, port
or the interconnecting cable. Diagnosing the cause of the problem may cause
some disruption in availability of the nodes local to the fault. To minimize
downtime, configure out the link by using the
qsctrl -o
command. You can
then rectify the problem during a scheduled maintenance period. The link
should be configured back in using
qsctrl -i
before diagnosis proceeds.
Diagnosis starts with the cables and moves onto the switch cards in the
following steps:
a.
Visually inspect and reseat both ends of the link cable. On reseat a solid
green LED should be lit next to the port. The red LED should not be lit.
Use
qsnetstat
to monitor the link. If the link comes out of reset and
the error counts stop incrementing, note its location and make it clean. If
it fails to come out of reset after a number of reseat attempts the cable
should be replaced.
Troubleshooting Nodes and Links
13-3