HP Cluster Platform Interconnects v2010 Quadrics QsNetII Interconnect - Page 132

QM500 driver unable to determine network position

Page 132 highlights

QM500 driver unable to determine network position The QM500 (Elan) driver reports that it is unable to determine network position The QM500 PCI adapter is found but the driver is unable to communicate with the network through the card. Proceed as follows: 1. Verify that the card is actually functioning using qsnelantest. It is possible that the driver is only able to communicate partially with the card. If qselantest fails, it is likely that the card is poorly seated in its PCI connector. Reseat the card. 2. Check the green LED at both ends of the link cable. If the green LEDs are not lit (or are only lit at one end), it is likely that the cable is faulty. Try reseating the cable connections. 3. If reseating the cable connections does not help, try swapping the cable for a replacement that you know to be good. Node has an incorrect nodeset You can determine the nodeset by examining the /proc/elan/device0/nodeset file. An anomalous nodeset can mean either that the QM500 network adapter is malfunctioning intermittently, or that there is a fault in the interconnect network above the problem node. Proceed as follows: 1. Using a tool such as pdsh with dshbak is useful for viewing the nodeset on every node and collating the returned data. The nodeset information is contained in the procfs, in the text file /proc/qsnet/ep/rail0/nodeset. 2. A contiguous group of nodes with a broken nodeset suggests that the error is in the interconnect network. Run network diagnostics. 3. Isolated nodes with broken nodesets are more likely to be a broken or poorly seated QM500 card. Reseat the card. QM500 (Elan) driver displays unusual messages Unexpected driver messages might be displayed, such as the following: Rev A switch detected... ...change in network level..... You might see these messages in conjunction with a nodeset problem, as described in the preceding troubleshooting symptoms. Proceed as follows: 1. The QM500 network adapter is either faulty or needs to be reseated in its PCI connection. Test the card with a diagnostic and reseat the card. 2. A useful way of detecting nodes with Elan driver problems is to route all syslog kernel messages from the nodes to a log host. Configure this routing syslog.conf in the node system images. You can then examine the output of the syslog log file by using the tail command. Applications receive signal 6 (I/O trap) on the node Signal 6 indicates a QM500 hardware exception. Further information can be found by using edb on the core produced (This is done by default when the exception occurs). Exceptions usually mean that a node is generating hardware errors. Proceed as follows: 1. It is possible that this node is on the receiving end of a hardware error generated elsewhere in the network. Configure the node out of the network by using qsctrl -o. If the exception moves to another node, it is a sign that the node itself is not the cause of the problem. 13-2 Troubleshooting Nodes and Links

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166

QM500 driver unable to determine network position
The QM500 (Elan) driver reports that it is unable to determine network position
The QM500 PCI adapter is found but the driver is unable to communicate with the
network through the card. Proceed as follows:
1.
Verify that the card is actually functioning using
qsnelantest
. It is
possible that the driver is only able to communicate partially with the card.
If
qselantest
fails, it is likely that the card is poorly seated in its PCI
connector. Reseat the card.
2.
Check the green LED at both ends of the link cable. If the green LEDs are not
lit (or are only lit at one end), it is likely that the cable is faulty. Try reseating
the cable connections.
3.
If reseating the cable connections does not help, try swapping the cable for a
replacement that you know to be good.
Node has an incorrect nodeset
You can determine the
nodeset
by examining the
/proc/elan/device0/nodeset
file. An anomalous
nodeset
can mean either that the QM500 network adapter is
malfunctioning intermittently, or that there is a fault in the interconnect network
above the problem node. Proceed as follows:
1.
Using a tool such as
pdsh
with
dshbak
is useful for viewing the nodeset
on every node and collating the returned data. The nodeset information is
contained in the
procfs
, in the text file
/proc/qsnet/ep/rail0/nodeset
.
2.
A contiguous group of nodes with a broken nodeset suggests that the error is
in the interconnect network. Run network diagnostics.
3.
Isolated nodes with broken nodesets are more likely to be a broken or poorly
seated QM500 card. Reseat the card.
QM500 (Elan) driver displays unusual messages
Unexpected driver messages might be displayed, such as the following:
Rev A switch detected...
...change in network level
....
.
You might see these messages in conjunction with a nodeset problem, as described
in the preceding troubleshooting symptoms. Proceed as follows:
1.
The QM500 network adapter is either faulty or needs to be reseated in its PCI
connection. Test the card with a diagnostic and reseat the card.
2.
A useful way of detecting nodes with Elan driver problems is to route all
syslog
kernel messages from the nodes to a log host. Configure this routing
syslog.conf
in the node system images. You can then examine the output of
the
syslog
log file by using the
tail
command.
Applications receive signal 6 (I/O trap) on the node
Signal 6 indicates a QM500 hardware exception. Further information can be found
by using
edb
on the core produced (This is done by default when the exception
occurs). Exceptions usually mean that a node is generating hardware errors.
Proceed as follows:
1.
It is possible that this node is on the receiving end of a hardware error
generated elsewhere in the network. Configure the node out of the network by
using
qsctrl -o
. If the exception moves to another node, it is a sign that the
node itself is not the cause of the problem.
13-2
Troubleshooting Nodes and Links