HP Cluster Platform Interconnects v2010 HP Cluster Platform InfiniBand Interco - Page 108

Debugging a Fabric Failure by Using Performance Management (PM)

Page 108 highlights

3. Check the Link LEDs on the line boards front panel for the various ports. In the event of a system lock-up in the ISR 9XXX, use the reset button to reset and reboot the system. The system has two reset buttons: one is located on the front panel of the sFU-8 fan assembly module and one of the front panel of the sCTRL board. To reset the router push the button, using a thick wire or tip of a pen until the system reboots. Remove the wire immediately afterwards. 4. Verify that the PC terminal emulation program is set correctly and that the PC is connected properly to the Console management port if the boot information and CLI prompt are not displayed. 9.3 Debugging a Fabric Failure by Using Performance Management (PM) The interconnect's fabric manager enables you to debug problem with fabric connections by using the performance management (PM) features. The following two PM functions support fabric debugging: • Port counters monitoring and report. The PM generates a periodic port counters report file in text comma separated value format (CSV format). You can import this file format into spreadsheets such as Microsoft Excel for analysis. The PM monitors port counter errors and reports every port that passed its user-configured error threshold limit. Examine the report file to find problem ports. • Event logging. the PM creates an event log file for both IB traps and SM internal events. Examine the log to find problem ports. Refer to the Voltaire InfiniBand Fabric Management and Diagnostic Guide and the HP Cluster Platform InfiniBand Fabric Management and Diagnostic Guide for information on using the PM and interpreting the log files. 9.4 Identifying a Leaf or Spine Port Malfunction on an ISR 9XXX You can identify a failed leaf or spine port by using the fabric manager as follows: 1. Launch the fabric manager and select the relevant ISR 9XXX interconnect in the topology map of the main window. 2. Click the nodes information icon in the toolbar to launch a new browser window. 3. A separate browser window displays the following data: • ISR 9XXX node information. • The leaf/spine chip status. • The port connection matrix, including: - Each ISR 9XXX external port. - Each internal chip port. - The channel adapter connection. For example: L4:4, Anafa:1, LID:0680, Port:1 Attached to CA, LID:06A9, Port:2, GUID:0008f10403965020 9.5 Detecting a Failed Port By combining the node information and the port counter file information described in the Voltaire InfiniBand Fabric Management and Diagnostic Guide and theHP Cluster Platform InfiniBand Fabric Management and Diagnostic Guide, you can identify any fabric port connectivity failure. A port is considered failed if one of the following conditions occur: • It does not respond to a NodeInfo request. • It has a GUID that is a duplicate of another port's GUID. 108 Postinstallation Troubleshooting and Diagnostics

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140

3.
Check the Link LEDs on the line boards front panel for the various ports. In the event of a
system lock-up in the ISR 9XXX, use the reset button to reset and reboot the system.
The system has two reset buttons: one is located on the front panel of the sFU-8 fan assembly
module and one of the front panel of the sCTRL board. To reset the router push the button,
using a thick wire or tip of a pen until the system reboots. Remove the wire immediately
afterwards.
4.
Verify that the PC terminal emulation program is set correctly and that the PC is connected
properly to the Console management port if the boot information and CLI prompt are not
displayed.
9.3 Debugging a Fabric Failure by Using Performance Management (PM)
The interconnect's fabric manager enables you to debug problem with fabric connections by
using the performance management (PM) features. The following two PM functions support
fabric debugging:
Port counters monitoring and report.
The PM generates a periodic port counters report file in text comma separated value format
(CSV format). You can import this file format into spreadsheets such as Microsoft Excel for
analysis. The PM monitors port counter errors and reports every port that passed its
user-configured error threshold limit. Examine the report file to find problem ports.
Event logging.
the PM creates an event log file for both IB traps and SM internal events. Examine the log
to find problem ports.
Refer to the
Voltaire InfiniBand Fabric Management and Diagnostic Guide
and the
HP Cluster Platform
InfiniBand Fabric Management and Diagnostic Guide
for information on using the PM and interpreting
the log files.
9.4 Identifying a Leaf or Spine Port Malfunction on an ISR 9XXX
You can identify a failed leaf or spine port by using the fabric manager as follows:
1.
Launch the fabric manager and select the relevant ISR 9XXX interconnect in the topology
map of the main window.
2.
Click the nodes information icon in the toolbar to launch a new browser window.
3.
A separate browser window displays the following data:
ISR 9XXX node information.
The leaf/spine chip status.
The port connection matrix, including:
Each ISR 9XXX external port.
Each internal chip port.
The channel adapter connection.
For example:
L4:4, Anafa:1, LID:0680, Port:1 Attached to CA, LID:06A9, Port:2, GUID:0008f10403965020
9.5 Detecting a Failed Port
By combining the node information and the port counter file information described in the
Voltaire
InfiniBand Fabric Management and Diagnostic Guide
and the
HP Cluster Platform InfiniBand Fabric
Management and Diagnostic Guide
, you can identify any fabric port connectivity failure. A port is
considered failed if one of the following conditions occur:
It does not respond to a NodeInfo request.
It has a GUID that is a duplicate of another port's GUID.
108
Postinstallation Troubleshooting and Diagnostics