Intel S2600GZ S2600GZ/GL - Page 38

Non-Maskable Interrupt NMI, the Uncorrectable Memory ECC Error is logged to the SEL,

Page 38 highlights

Intel® Server Board S2600GZ/GL TPS Product Architecture Overview have to match but timings will be set to support all DIMMs populated (that is, DIMMs with slower timings will force faster DIMMs to the slower common timing modes). 3.2.4.5.1 Single Device Data Correction (SDDC) SDDC - Single Device Data Correction is a technique by which data can be replaced by the IMC from an entire x4 DRAM device which is failing, using a combination of CRC plus parity. This is an automatic IMC driven hardware. It can be extended to x8 DRAM technology by placing the system in Channel Lockstep Mode. 3.2.4.5.2 Error Correction Code (ECC) Memory ECC uses "extra bits" - 64-bit data in a 72-bit DRAM array - to add an 8-bit calculated "Hamming Code" to each 64 bits of data. This additional encoding enables the memory controller to detect and report single or multiple bit errors when data is read, and to correct single-bit errors. 3.2.4.5.2.1 Correctable Memory ECC Error Handling A "Correctable ECC Error" is one in which a single-bit error in memory contents is detected and corrected by use of the ECC Hamming Code included in the memory data. For a correctable error, data integrity is preserved, but it may be a warning sign of a true failure to come. Note that some correctable errors are expected to occur. The system BIOS has logic to cope with the random factor in correctable ECC errors. Rather than reporting every correctable error that occurs, the BIOS has a threshold and only logs a correctable error when a threshold value is reached. Additional correctable errors that occur after the threshold has been reached are disregarded. In addition, on the expectation the server system may have extremely long operational runs without being rebooted, there is a "Leaky Bucket" algorithm incorporated into the correctable error counting and comparing mechanism. The "Leaky Bucket" algorithm reduces the correctable error count as a function of time - as the system remains running for a certain amount of time, the correctable error count will "leak out" of the counting registers. This prevents correctable error counts from building up over an extended runtime. The correctable memory error threshold value is a configurable option in the BIOS Setup Utility, where you can configure it for 20/10/5/ALL/None Once a correctable memory error threshold is reached, the event is logged to the System Event Log (SEL) and the appropriate memory slot fault LED is lit to indicate on which DIMM the correctable error threshold crossing occurred. 3.2.4.5.2.2 Uncorrectable Memory ECC Error Handling All multi-bit "detectable but not correctable" memory errors are classified as Uncorrectable Memory ECC Errors. This is generally a fatal error. However, before returning control to the OS drivers from Machine Check Exception (MCE) or Non-Maskable Interrupt (NMI), the Uncorrectable Memory ECC Error is logged to the SEL, the appropriate memory slot fault LED is lit, and the System Status LED state is changed to a solid Amber. Revision 1.1 25 Intel order number G24881-004

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166
  • 167
  • 168
  • 169
  • 170
  • 171
  • 172
  • 173
  • 174
  • 175
  • 176
  • 177
  • 178
  • 179
  • 180
  • 181
  • 182
  • 183
  • 184
  • 185
  • 186
  • 187
  • 188
  • 189
  • 190
  • 191
  • 192
  • 193
  • 194
  • 195
  • 196
  • 197
  • 198
  • 199
  • 200
  • 201
  • 202
  • 203
  • 204
  • 205
  • 206
  • 207
  • 208
  • 209
  • 210
  • 211
  • 212
  • 213
  • 214
  • 215
  • 216
  • 217
  • 218
  • 219
  • 220
  • 221
  • 222
  • 223
  • 224
  • 225
  • 226
  • 227
  • 228
  • 229
  • 230
  • 231
  • 232
  • 233
  • 234
  • 235
  • 236
  • 237
  • 238
  • 239
  • 240
  • 241
  • 242
  • 243
  • 244
  • 245
  • 246
  • 247
  • 248
  • 249
  • 250
  • 251
  • 252
  • 253
  • 254
  • 255
  • 256
  • 257
  • 258
  • 259
  • 260
  • 261
  • 262
  • 263
  • 264

Intel® Server Board S2600GZ/GL TPS
Product Architecture Overview
Revision 1.1
Intel order number G24881-004
25
have to match but timings will be set to support all DIMMs populated (that is, DIMMs with slower
timings will force faster DIMMs to the slower common timing modes).
3.2.4.5.1
Single Device Data Correction (SDDC)
SDDC
Single Device Data Correction is a technique by which data can be replaced by the
IMC from an entire x4 DRAM device which is failing, using a combination of CRC plus parity.
This is an automatic IMC driven hardware. It can be extended to x8 DRAM technology by
placing the system in Channel Lockstep Mode.
3.2.4.5.2
Error Correction Code (ECC) Memory
ECC uses “extra bits” –
64-bit data in a 72-bit DRAM array
to add an 8-bit calculated
“Hamming Code” to each 64 bits of data. This additional encoding enables the memory
controller to detect and report single or multiple bit errors when data is read, and to correct
single-bit errors.
3.2.4.5.2.1
Correctable Memory ECC Error Handling
A “Correctable ECC Error” is one in which a single
-bit error in memory contents is detected and
corrected
by use of the ECC Hamming Code included in the memory data. For a correctable
error, data integrity is preserved, but it may be a warning sign of a true failure to come. Note that
some correctable errors are expected to occur.
The system BIOS has logic to cope with the random factor in correctable ECC errors. Rather
than reporting every correctable error that occurs, the BIOS has a threshold and only logs a
correctable error when a threshold value is reached. Additional correctable errors that occur
after the threshold has been reached are disregarded. In addition, on the expectation the server
system may have extremely long operational runs without being rebooted, there is a “Leaky
Bucket” algorithm incorporated into the
correctable error counting and comparing mechanism.
The
Leaky Bucket
algorithm reduces the correctable error count as a function of time
as the
system remains running for a certain amount of time, the correctable error
count will “leak out”
of the counting registers. This prevents correctable error counts from building up over an
extended runtime.
The correctable memory error threshold value is a configurable option in the <F2> BIOS Setup
Utility, where you can configure it for 20/
10
/5/ALL/None
Once a correctable memory error threshold is reached, the event is logged to the System Event
Log (SEL) and the appropriate memory slot fault LED is lit to indicate on which DIMM the
correctable error threshold crossing occurred.
3.2.4.5.2.2
Uncorrectable Memory ECC Error Handling
All multi-
bit “detectable but not correctable“ memory error
s are classified as Uncorrectable
Memory ECC Errors. This is generally a fatal error.
However, before returning control to the OS drivers from Machine Check Exception (MCE) or
Non-Maskable Interrupt (NMI), the Uncorrectable Memory ECC Error is logged to the SEL, the
appropriate memory slot fault LED is lit, and the System Status LED state is changed to a solid
Amber.