HP Integrity rx2800 rx2800 i2 User Service Guide - Page 80

Fault management overview, HP-UX fault management, WBEM indication providers

Page 80 highlights

Fault management overview The goal of fault management and monitoring is to increase system availability, by moving from a reactive fault detection, diagnosis, and repair strategy to a proactive fault detection, diagnosis, and repair strategy. The objectives are as follows: • To detect problems automatically, as nearly as possible to when they actually occur. • To diagnose problems automatically, at the time of detection. • To automatically report in understandable text a description of the problem, the likely causes of the problem, the recommended actions to resolve the problem, and detailed information about the problem. • To ensure that tools are available to repair or recover from the fault. HP-UX fault management Proactive fault prediction and notification is provided on HP-UX by SysFaultMgmt WBEM indication providers. WBEM provideS frameworks for monitoring and reporting events. SysFaultMgmt WBEM indication providers allow users to monitor the operation of a wide variety of hardware products, and alert them immediately if any failure or other unusual event occurs. By using hardware event monitoring, users can virtually eliminate undetected hardware failures that could interrupt system operation or cause data loss. WBEM indication providers Hardware monitors are available to monitor the following components (These monitors are distributed free on the OE media): • Server/fans/environment • CPU monitor • UPS monitor* • FC hub monitor* • FC switch monitor* • Memory monitor • Core electronics components • Disk drives • Ha_disk_array NOTE: No SysFaultMgmt WBEM indication provider is currently available for components followed by an asterisk. Errors and reading error logs Event log definitions Often the underlying root cause of an MCA event is captured by system or BMC firmware in both the System Event and Forward Progress Event Logs (SEL and FP, respectively). These errors are easily matched with MCA events by their timestamps. For example, the loss of a CPU VRM might cause a CPU fault. Decoding the MCA error logs would only identify the failed CPU as the most likely faulty CRU. Following are some important points to remember about events and event logs: • Event logs are the equivalent of the old server logs for status or error information output. • Symbolic names are used in the source code; for example, MC_CACHE_CHECK. 80 Troubleshooting

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151

Fault management overview
The goal of fault management and monitoring is to increase system availability, by moving from
a reactive fault detection, diagnosis, and repair strategy to a proactive fault detection, diagnosis,
and repair strategy. The objectives are as follows:
To detect problems automatically, as nearly as possible to when they actually occur.
To diagnose problems automatically, at the time of detection.
To automatically report in understandable text a description of the problem, the likely causes
of the problem, the recommended actions to resolve the problem, and detailed information
about the problem.
To ensure that tools are available to repair or recover from the fault.
HP-UX fault management
Proactive fault prediction and notification is provided on HP-UX by SysFaultMgmt WBEM indication
providers. WBEM provideS frameworks for monitoring and reporting events.
SysFaultMgmt WBEM indication providers allow users to monitor the operation of a wide variety
of hardware products, and alert them immediately if any failure or other unusual event occurs. By
using hardware event monitoring, users can virtually eliminate undetected hardware failures that
could interrupt system operation or cause data loss.
WBEM indication providers
Hardware monitors are available to monitor the following components (These monitors are distributed
free on the OE media):
Server/fans/environment
CPU monitor
UPS monitor*
FC hub monitor*
FC switch monitor*
Memory monitor
Core electronics components
Disk drives
Ha_disk_array
NOTE:
No SysFaultMgmt WBEM indication provider is currently available for components
followed by an asterisk.
Errors and reading error logs
Event log definitions
Often the underlying root cause of an MCA event is captured by system or BMC firmware in both
the System Event and Forward Progress Event Logs (SEL and FP, respectively). These errors are
easily matched with MCA events by their timestamps. For example, the loss of a CPU VRM might
cause a CPU fault. Decoding the MCA error logs would only identify the failed CPU as the most
likely faulty CRU. Following are some important points to remember about events and event logs:
Event logs are the equivalent of the old server logs for status or error information output.
Symbolic names are used in the source code; for example,
MC_CACHE_CHECK
.
80
Troubleshooting