HP Cluster Platform Interconnects v2010 Quadrics QsNetII Interconnect - Page 120

failing data bit.

Page 120 highlights

since the last time that all the boards and chips were selected and cleared. When using this raw format of error data, you must decide whether the registers are reporting genuine link errors or simply errors due to node reboots. You look for a link to show errors repetitively, every day, during normal production mode testing. Use the following procedure to run this test: 1. Open a connection to the interconnect's master control card, or launch the jtest utility remotely as described in Section 11.2. 2. At the jtest utility prompt, select all boards as follows: # jtest> b -1 board in slot 0 is of type QM501_CU board in slot 4 is of type QM502_CU board in slot 8 is of type QM503 board in slot 9 is of type QM503 3. At the jtest utility prompt, select all switch chips as follows: # jtest> c -1 4. At the jtest utility prompt, enter the error command: # jtest> error jtest: no errors on boards 0 4 8 9 chips : 0 1 2 3 4 5 6 7 jtest> If you see the same repetitive error occurring on a link, that error indicates a potential fault. The error registers do not count the number of errors, just indicate that at least 1 error has occurred since the register was last cleared. The jtest error command generates the following information: • B:C:L The board, chip and link being reported. • E An error has occurred. • RtCRC CRC error on route byte (packet and transaction error). This indicates some bit errors on the route values. • TrCRC CRC error on transaction (packet and transaction error). This indicates some bit errors in one of the transactions. • RcvLk Receiver lock error (low level line error). Problems with the received or local clock. • Dskew Deskew error (low level line error). Only likely to be caused by a hard failing data bit. • Phase Phase error (low level line error). Probably a missed clock on the incoming link. • DataE Data error (low level line error). Not a valid data value or a valid token. • ChM45 Mod 4/5 change detected on link (low level line error). • Fifo0 FIFO overrun on virtual channel 0 (protocol error). • Fifo1 FIFO overrun on virtual channel 1 (protocol error). • OpenT Packet has been open at the input for too long (protocol error). • PktRT Packet acknowledge return error (protocol error). Protocol errors are normally caused by very high rates of errors on another link. They can only be caused by double or triple bit errors converting one type of token into another valid token. Note that data errors occur when a node is reset. The following example demonstrates a protocol error: B:C:L E RtCRC TrCRC RcvLk Dskew Phase Fifo0 Fifo1 OpenT PktRT ChM45 DataE Value 0:0:0 1 0 0 10 0 1 1 1 0 0 0 1 00f022 12-18 Maintenance and Diagnostic Procedures

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166

since the last time that all the boards and chips were selected and cleared. When
using this raw format of error data, you must decide whether the registers are
reporting genuine link errors or simply errors due to node reboots. You look for a
link to show errors repetitively, every day, during normal production mode testing.
Use the following procedure to run this test:
1.
Open a connection to the interconnect’s master control card, or launch the
jtest
utility remotely as described in Section 11.2.
2.
At the
jtest
utility prompt, select all boards as follows:
#
jtest> b -1
board in slot 0 is of type QM501_CU
board in slot 4 is of type QM502_CU
board in slot 8 is of type QM503
board in slot 9 is of type QM503
3.
At the
jtest
utility prompt, select all switch chips as follows:
#
jtest> c -1
4.
At the
jtest
utility prompt, enter the error command:
#
jtest> error
jtest: no errors on boards 0 4 8 9 chips : 0 1 2 3 4 5 6 7
jtest>
If you see the same repetitive error occurring on a link, that error indicates a
potential fault. The error registers do not count the number of errors, just indicate
that at least 1 error has occurred since the register was last cleared.
The
jtest error
command generates the following information:
B:C:L
The board, chip and link being reported.
E
An error has occurred.
RtCRC
CRC error on route byte (packet and transaction error). This indicates
some bit errors on the route values.
TrCRC
CRC error on transaction (packet and transaction error). This indicates
some bit errors in one of the transactions.
RcvLk
Receiver lock error (low level line error). Problems with the received
or local clock.
Dskew
Deskew error (low level line error). Only likely to be caused by a hard
failing data bit.
Phase
Phase error (low level line error). Probably a missed clock on the
incoming link.
DataE
Data error (low level line error). Not a valid data value or a valid token.
ChM45
Mod 4/5 change detected on link (low level line error).
Fifo0
FIFO overrun on virtual channel 0 (protocol error).
Fifo1
FIFO overrun on virtual channel 1 (protocol error).
OpenT
Packet has been open at the input for too long (protocol error).
PktRT
Packet acknowledge return error (protocol error).
Protocol errors are normally caused by very high rates of errors on another
link. They can only be caused by double or triple bit errors converting one type
of token into another valid token. Note that data errors occur when a node is
reset. The following example demonstrates a protocol error:
B:C:L E RtCRC TrCRC RcvLk Dskew Phase Fifo0 Fifo1 OpenT PktRT ChM45 DataE Value
0:0:0 1
0
0
10
0
1
1
1
0
0
0
1
00f022
12-18
Maintenance and Diagnostic Procedures