Intel X38ML Product Specification - Page 103

Error Handling and Logging

Page 103 highlights

Intel® Server Board X38ML Error Reporting and Handling 6.2 Error Handling and Logging This section defines how errors are handled by the system BIOS, including a discussion of the role of the BIOS in error handling and the interaction between the BIOS, platform hardware, and server management firmware with regard to error handling. In addition, error-logging techniques are described and error codes for errors are defined. 6.2.1 Error Sources and Types One of the major requirements of server management is to correctly and consistently handle system errors. System errors that can be enabled and disabled individually or as a group can be categorized as follows: PCI bus Memory single- and multi-bit errors Sensors Errors detected during POST, logged as POST errors Sensors are managed by the BMC. The BMC is capable of receiving event messages from individual sensors and logging system events. For more information on BMC logged errors, see the BMC EPS. 6.2.2 Error Logging via SMI Handler The SMI handler is used to handle and log system level events not visible to the server management firmware. The SMI handler pre-processes all system errors, including errors that can generate an NMI. The SMI handler sends a command to the BMC to log the event and provides the data to be logged. For example, the BIOS programs the hardware to generate an SMI on a single-bit memory error and logs the location of the failed DIMM in the system event log. System events handled by the BIOS generate an SMI. After the BIOS finishes logging the error, it asserts the NMI if needed. 6.2.2.1 PCI Bus Error The PCI bus defines two error pins, PERR# and SERR#. These are used for reporting PCI parity errors and system errors, respectively. The BIOS can be instructed to enable or disable reporting PERR# and SERR# through the NMI. Disabling NMI for PERR# and/or SERR# also disables logging of the corresponding event. In the case of PERR#, the PCI bus master has the option to retry the offending transaction, or to report it using SERR#. All other PCI-related errors are reported by SERR#. All PCI-to-PCI bridges are configured so that they generate an SERR# on the primary interface whenever there is an SERR# on the secondary side, as long as SERR# is enabled in BIOS Setup. The same is true for PERR#. The format of the data bytes is described in Section 6.2.3.3. 6.2.2.2 PCI Express* Errors The hardware is programmed to generate an SMI on PCI Express* correctable, uncorrectable non-fatal, and uncorrectable fatal errors. The correctable PCI Express* errors are reported to Revision 1.3 91 Intel order number E15331-006

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132

IntelĀ® Server Board X38ML
Error Reporting and Handling
Revision 1.3
Intel order number E15331-006
91
6.2
Error Handling and Logging
This section defines how errors are handled by the system BIOS, including a discussion of the
role of the BIOS in error handling and the interaction between the BIOS, platform hardware, and
server management firmware with regard to error handling. In addition, error-logging techniques
are described and error codes for errors are defined.
6.2.1
Error Sources and Types
One of the major requirements of server management is to correctly and consistently handle
system errors. System errors that can be enabled and disabled individually or as a group can be
categorized as follows:
PCI bus
Memory single- and multi-bit errors
Sensors
Errors detected during POST, logged as POST errors
Sensors are managed by the BMC. The BMC is capable of receiving event messages from
individual sensors and logging system events. For more information on BMC logged errors, see
the BMC EPS.
6.2.2
Error Logging via SMI Handler
The SMI handler is used to handle and log system level events not visible to the server
management firmware. The SMI handler pre-processes all system errors, including errors that
can generate an NMI.
The SMI handler sends a command to the BMC to log the event and provides the data to be
logged. For example, the BIOS programs the hardware to generate an SMI on a single-bit
memory error and logs the location of the failed DIMM in the system event log. System events
handled by the BIOS generate an SMI. After the BIOS finishes logging the error, it asserts the
NMI if needed.
6.2.2.1
PCI Bus Error
The PCI bus defines two error pins, PERR# and SERR#. These are used for reporting PCI
parity errors and system errors, respectively. The BIOS can be instructed to enable or disable
reporting PERR# and SERR# through the NMI. Disabling NMI for PERR# and/or SERR# also
disables logging of the corresponding event.
In the case of PERR#, the PCI bus master has the option to retry the offending transaction, or to
report it using SERR#. All other PCI-related errors are reported by SERR#. All PCI-to-PCI
bridges are configured so that they generate an SERR# on the primary interface whenever
there is an SERR# on the secondary side, as long as SERR# is enabled in BIOS Setup. The
same is true for PERR#. The format of the data bytes is described in Section 6.2.3.3.
6.2.2.2
PCI Express* Errors
The hardware is programmed to generate an SMI on PCI Express* correctable, uncorrectable
non-fatal, and uncorrectable fatal errors. The correctable PCI Express* errors are reported to