HP Integrity rx2800 rx2800 i2 User Service Guide - Page 87

Troubleshooting the server memory, Memory DIMM load order, Memory subsystem behaviors

Page 87 highlights

Table 36 CPU events that may light SID LEDs (continued) Diagnostic LEDs CPUs Sample IPMI Events Type E0h, 33d:26d BOOT_CPU_EARLY_TEST_FAIL CPUs Type 02h, 25h:71h:80h MISSING_FRU_DEVICE Cause Source A logical CPU (thread) failed early self test No physical CPU cores present SFW BMC Notes Possible seating or failed CPU Troubleshooting the server memory Memory DIMM load order For a minimally loaded server, two equal-size DIMMs must be installed in the DIMM slots, as per Table 14 (page 35). Memory subsystem behaviors The CPU and its integrated memory controller provides increased reliability of DIMMs. The memory controller built into the 9300 series CPU doubles memory rank error correction from 4 bytes to 8 bytes of a 128 byte cache line, during cache line misses initiated by CPU cache controllers and by Direct Memory Access (DMA) operations initiated by I/O devices. This feature is called double DRAM sparing, as 2 of 72 DRAMs in any DIMM pair can fail without any loss of server performance. Corrective action, DIMM/memory expander replacement, is required when a threshold is reached for multiple double-byte errors from one or more DIMMs in the same rank. And when any uncorrectable memory error (more than 2 bytes) or when no pair of like DIMMs is loaded in rank 0 of side 0. All other causes of memory DIMM errors are corrected by the CPU and reported to the Page Deallocation Table (PDT) / diagnostic LED panel. Customer messaging policy • Only light a diagnostic LED for memory DIMM errors when isolation is to a specific memory DIMM. If any uncertainty about a specific DIMM, then point customer to the SEL for any action and do not light the suspect DIMM CRU LED on the System Insight Display. • For configuration style errors, for example, no DIMMs installed in 0A and 0B, follow the HP ProLiant policy of lighting all of the CRU LEDs on the diagnostic LED panel for all of the DIMMs that are missing. • No diagnostic messages are reported for single-byte errors that are corrected in both ICH10 caches and DIMMs during corrected platform error (CPE) events. Diagnostic messages are reported for CPE events when thresholds are exceeded for both single-byte and double byte errors; all fatal memory subsystem errors cause global MCA events. • PDT logs for all double byte errors are permanent; single byte errors are initially logged as transient errors. If the server logs 2 single byte errors within 24 hours, then upgrade them to permanent in the PDT. Troubleshooting the CPU and Memory 87

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151

Table 36 CPU events that may light SID LEDs
(continued)
Notes
Source
Cause
Sample IPMI Events
Diagnostic
LEDs
SFW
A logical CPU
(thread) failed
early self test
Type E0h, 33d:26d
BOOT_CPU_EARLY_TEST_FAIL
CPUs
Possible
seating or
failed CPU
BMC
No physical
CPU cores
present
Type 02h, 25h:71h:80h
MISSING_FRU_DEVICE
CPUs
Troubleshooting the server memory
Memory DIMM load order
For a minimally loaded server, two equal-size DIMMs must be installed in the DIMM slots, as per
Table 14 (page 35)
.
Memory subsystem behaviors
The CPU and its integrated memory controller provides increased reliability of DIMMs. The memory
controller built into the 9300 series CPU doubles memory rank error correction from 4 bytes to 8
bytes of a 128 byte cache line, during cache line misses initiated by CPU cache controllers and
by Direct Memory Access (DMA) operations initiated by I/O devices. This feature is called double
DRAM sparing, as 2 of 72 DRAMs in any DIMM pair can fail without any loss of server
performance.
Corrective action, DIMM/memory expander replacement, is required when a threshold is reached
for multiple double-byte errors from one or more DIMMs in the same rank. And when any
uncorrectable memory error (more than 2 bytes) or when no pair of like DIMMs is loaded in rank
0 of side 0. All other causes of memory DIMM errors are corrected by the CPU and reported to
the Page Deallocation Table (PDT) / diagnostic LED panel.
Customer messaging policy
Only light a diagnostic LED for memory DIMM errors when isolation is to a specific memory
DIMM. If any uncertainty about a specific DIMM, then point customer to the SEL for any action
and do not light the suspect DIMM CRU LED on the System Insight Display.
For configuration style errors, for example, no DIMMs installed in 0A and 0B, follow the HP
ProLiant policy of lighting all of the CRU LEDs on the diagnostic LED panel for all of the DIMMs
that are missing.
No diagnostic messages are reported for single-byte errors that are corrected in both ICH10
caches and DIMMs during corrected platform error (CPE) events. Diagnostic messages are
reported for CPE events when thresholds are exceeded for both single-byte and double byte
errors; all fatal memory subsystem errors cause global MCA events.
PDT logs for all double byte errors are permanent; single byte errors are initially logged as
transient errors. If the server logs 2 single byte errors within 24 hours, then upgrade them to
permanent in the PDT.
Troubleshooting the CPU and Memory
87