Intel S2600GZ S2600GZ/GL - Page 69

Fault Resilient Booting FRB, Sensor Monitoring

Page 69 highlights

Platform Management Functional Overview Intel® Server Board S2600GZ/GL TPS The BMC watchdog feature will only allow up to three resets of the BMC CPU (such as HW reset) or entire FW stack (such as a SW reset) before giving up and remaining in the uBOOT code. This count is cleared upon cycling of power to the BMC or upon continuous operation of the BMC without a watchdog-generated reset occurring for a period of > 30 minutes. The BMC FW logs a SEL event indicating that a watchdog-generated BMC reset (either soft or hard reset) has occurred. This event may be logged after the actual reset has occurred. Refer sensor section for details for the related sensor definition. The BMC will also indicate a degraded system status on the Front Panel Status LED after a BMC HW reset or FW stack reset. This state (which follows the state of the associated sensor) will be cleared upon system reset or (AC or DC) power cycle. Note: A reset of the BMC may result in the following system degradations that will require a system reset or power cycle to correct: 1. Timeout value for the rotation period can be set using this parameterPotentially incorrect ACPI Power State reported by the BMC. 2. Reversion of temporary test modes for the BMC back to normal operational modes. 3. FP status LED and DIMM fault LEDs may not reflect BIOS detected errors. 6.5 Fault Resilient Booting (FRB) Fault resilient booting (FRB) is a set of BIOS and BMC algorithms and hardware support that allow a multiprocessor system to boot even if the bootstrap processor (BSP) fails. Only FRB2 is supported using watchdog timer commands. FRB2 refers to the FRB algorithm that detects system failures during POST. The BIOS uses the BMC watchdog timer to back up its operation during POST. The BIOS configures the watchdog timer to indicate that the BIOS is using the timer for the FRB2 phase of the boot operation. After the BIOS has identified and saved the BSP information, it sets the FRB2 timer use bit and loads the watchdog timer with the new timeout interval. If the watchdog timer expires while the watchdog use bit is set to FRB2, the BMC (if so configured) logs a watchdog expiration event showing the FRB2 timeout in the event data bytes. The BMC then hard resets the system, assuming the BIOS-selected reset as the watchdog timeout action. The BIOS is responsible for disabling the FRB2 timeout before initiating the option ROM scan and before displaying a request for a boot password. If the processor fails and causes an FRB2 timeout, the BMC resets the system. The BIOS gets the watchdog expiration status from the BMC. If the status shows an expired FRB2 timer, the BIOS enters the failure in the system event log (SEL). In the OEM bytes entry in the SEL, the last POST code generated during the previous boot attempt is written. FRB2 failure is not reflected in the processor status sensor value. The FRB2 failure does not affect the front panel LEDs. 6.6 Sensor Monitoring The BMC monitors system hardware and reports system health. Some of the sensors include those for monitoring  Component, board, and platform temperatures  Board and platform voltages 56 Revision 1.1 Intel order number G24881-004

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166
  • 167
  • 168
  • 169
  • 170
  • 171
  • 172
  • 173
  • 174
  • 175
  • 176
  • 177
  • 178
  • 179
  • 180
  • 181
  • 182
  • 183
  • 184
  • 185
  • 186
  • 187
  • 188
  • 189
  • 190
  • 191
  • 192
  • 193
  • 194
  • 195
  • 196
  • 197
  • 198
  • 199
  • 200
  • 201
  • 202
  • 203
  • 204
  • 205
  • 206
  • 207
  • 208
  • 209
  • 210
  • 211
  • 212
  • 213
  • 214
  • 215
  • 216
  • 217
  • 218
  • 219
  • 220
  • 221
  • 222
  • 223
  • 224
  • 225
  • 226
  • 227
  • 228
  • 229
  • 230
  • 231
  • 232
  • 233
  • 234
  • 235
  • 236
  • 237
  • 238
  • 239
  • 240
  • 241
  • 242
  • 243
  • 244
  • 245
  • 246
  • 247
  • 248
  • 249
  • 250
  • 251
  • 252
  • 253
  • 254
  • 255
  • 256
  • 257
  • 258
  • 259
  • 260
  • 261
  • 262
  • 263
  • 264

Platform Management Functional Overview
Intel® Server Board S2600GZ/GL TPS
Revision 1.1
Intel order number G24881-004
56
The BMC watchdog feature will only allow up to three resets of the BMC CPU (such as HW
reset) or entire FW stack (such as a SW reset) before giving up and remaining in the uBOOT
code. This count is cleared upon cycling of power to the BMC or upon continuous operation of
the BMC without a watchdog-generated reset occurring for a period of > 30 minutes. The BMC
FW logs a SEL event indicating that a watchdog-generated BMC reset (either soft or hard reset)
has occurred. This event may be logged after the actual reset has occurred. Refer sensor
section for details for the related sensor definition. The BMC will also indicate a degraded
system status on the Front Panel Status LED after a BMC HW reset or FW stack reset. This
state (which follows the state of the associated sensor) will be cleared upon system reset or (AC
or DC) power cycle.
Note:
A reset of the BMC may result in the following system degradations that will require a
system reset or power cycle to correct:
1.
Timeout value for the rotation period can be set using this parameterPotentially incorrect
ACPI Power State reported by the BMC.
2.
Reversion of temporary test modes for the BMC back to normal operational modes.
3.
FP status LED and DIMM fault LEDs may not reflect BIOS detected errors.
6.5
Fault Resilient Booting (FRB)
Fault resilient booting (FRB) is a set of BIOS and BMC algorithms and hardware support that
allow a multiprocessor system to boot even if the bootstrap processor (BSP) fails. Only FRB2 is
supported using watchdog timer commands.
FRB2 refers to the FRB algorithm that detects system failures during POST. The BIOS uses the
BMC watchdog timer to back up its operation during POST. The BIOS configures the watchdog
timer to indicate that the BIOS is using the timer for the FRB2 phase of the boot operation.
After the BIOS has identified and saved the BSP information, it sets the FRB2 timer use bit and
loads the watchdog timer with the new timeout interval.
If the watchdog timer expires while the watchdog use bit is set to FRB2, the BMC (if so
configured) logs a watchdog expiration event showing the FRB2 timeout in the event data bytes.
The BMC then hard resets the system, assuming the BIOS-selected reset as the watchdog
timeout action.
The BIOS is responsible for disabling the FRB2 timeout before initiating the option ROM scan
and before displaying a request for a boot password. If the processor fails and causes an FRB2
timeout, the BMC resets the system.
The BIOS gets the watchdog expiration status from the BMC. If the status shows an expired
FRB2 timer, the BIOS enters the failure in the system event log (SEL). In the OEM bytes entry
in the SEL, the last POST code generated during the previous boot attempt is written. FRB2
failure is not reflected in the processor status sensor value.
The FRB2 failure does not affect the front panel LEDs.
6.6
Sensor Monitoring
The BMC monitors system hardware and reports system health. Some of the sensors include
those for monitoring
Component, board, and platform temperatures
Board and platform voltages