HP ProLiant DL288 HP Advanced Memory Error Detection Technology - Page 4

Why memory errors are increasing

Page 4 highlights

The industry has traditionally classified memory errors by the number of bits affected and the causes of the errors. But for systems with large memory footprints, it‟s more meaningful to classify errors as correctable or uncorrectable. The following sections explain this distinction. Traditional memory error classifications Memory errors are commonly classified according to the number of bits affected in a 64-bit data word. An error in one bit of a data word is a single-bit error. An error in more than one bit of a data word is a multi-bit error. Memory errors are also classified as „hard‟ or „soft‟ depending on what caused them. DRAM defects, bad solder joints, and data pin issues cause hard errors so that the device consistently returns incorrect results. For example, a "stuck" memory cell returns the same bit value, even when a different bit is written to it. In contrast, soft errors are transient and non-repeating. They can be caused by an electrical disturbance inside the memory array or on the memory interface. Correctable and uncorrectable errors The outcome of a memory error depends on whether it can be corrected. Some row failures and column failures are correctable depending on both the DIMM configuration (x4 or x8) and the error correction capability of the system. ECC can correct single-bit errors within a single x4 or x8 DRAM chip, but ECC can only detect a multi-bit error. Only ×4 DRAM chips allow the use of advanced errorcorrection control technologies1 in server environments. Advanced error-correction control technologies can detect and correct multi-bit failures in a single x4 DRAM chip. Their algorithms can correct any single-bit or multi-bit errors in a 4-bit symbol, also known as a symbol error. This allows recovery from a x4 DRAM chip failure. The algorithms can also detect two symbol errors across two x4 DRAM chips. Intel® Xeon®- and AMD Opteron™-based systems use advanced error-correction control technologies to correct one 4-bit symbol error and detect two symbol errors (single-symbol correct, double-symbol detect). If there is an error in more than two symbols, the technologies may not be able to detect them. Another technology known as Double Device Data Correction (DDDC) can correct errors in two symbols and detect errors in three symbols (double-symbol correct, triple-symbol detect). This means that if one DRAM chip fails, but the DIMM remains in operation, DDDC will continue to work even if a second chip has an error or fails. Intel Xeon systems support DDDC in lockstep memory mode. In lockstep mode, two channels operate as a single channel so that each write and read operation moves a cache line two channels wide. Both channels split the cache line to provide 2x 8-bit error detection and 8-bit error correction within a single DRAM. Why memory errors are increasing Two trends increase the likelihood of memory errors in servers: Server memory capacity is increasing. DRAM technology is changing to meet the demand for higher DIMM storage capacity. Server memory capacity is increasing The growth of high-performance computing (HPC) and virtualized IT environments is driving operating systems to address more memory. This is causing manufacturers to expand the memory capacity of servers. In the last 5 years, the average memory capacity per server has grown by more than 500%-from 5.6 GB to 33 GB per server across all HP ProLiant server lines. 1 Intel® Single Device Data Correction and IBM Chipkill™ 4

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

4
The industry has traditionally classified memory errors by the number of bits affected and the causes
of the errors
. But for systems with large memory footprints, it‟s more meaningful to classify errors as
correctable or uncorrectable. The following sections explain this distinction.
Traditional memory error classifications
Memory errors are commonly classified according to the number of bits affected in a 64-bit data
word. An error in one bit of a data word is a single-bit error. An error in more than one bit of a data
word is a multi-bit error.
Memory errors are also classified as „hard‟ or „soft‟ depending on what
caused them. DRAM defects,
bad solder joints, and data pin issues cause hard errors so that the device consistently returns
incorrect results. For example, a “stuck” memory cell returns the same bit value, even when a different
bit is written to it. In contrast, soft errors are transient and non-repeating. They can be caused by an
electrical disturbance inside the memory array or on the memory interface.
Correctable and uncorrectable errors
The outcome of a memory error depends on whether it can be corrected. Some row failures and
column failures are correctable depending on both the DIMM configuration (x4 or x8) and the error
correction capability of the system. ECC can correct single-bit errors within a single x4 or x8 DRAM
chip, but ECC can only detect a multi-bit error. Only ×4 DRAM chips allow the use of advanced error-
correction control technologies
1
in server environments.
Advanced error-correction control technologies can detect and correct multi-bit failures in a single x4
DRAM chip. Their algorithms can correct any single-bit or multi-bit errors in a 4-bit symbol, also
known as a symbol error. This allows recovery from a x4 DRAM chip failure. The algorithms can also
detect two symbol errors across two x4 DRAM chips.
Intel® Xeon®- and AMD Opteron
-based systems use advanced error-correction control technologies
to correct one 4-bit symbol error and detect two symbol errors (single-symbol correct, double-symbol
detect). If there is an error in more than two symbols, the technologies may not be able to detect them.
Another technology known as Double Device Data Correction (DDDC) can correct errors in two
symbols and detect errors in three symbols (double-symbol correct, triple-symbol detect). This means
that if one DRAM chip fails, but the DIMM remains in operation, DDDC will continue to work even if a
second chip has an error or fails. Intel Xeon systems support DDDC in lockstep memory mode. In
lockstep mode, two channels operate as a single channel so that each write and read operation
moves a cache line two channels wide. Both channels split the cache line to provide 2x 8-bit error
detection and 8-bit error correction within a single DRAM.
Why memory errors are increasing
Two trends increase the likelihood of memory errors in servers:
Server memory capacity is increasing.
DRAM technology is changing to meet the demand for higher DIMM storage capacity.
Server memory capacity is increasing
The growth of high-performance computing (HPC) and virtualized IT environments is driving operating
systems to address more memory. This is causing manufacturers to expand the memory capacity of
servers. In the last 5 years, the average memory capacity per server has grown by more than
500%
from 5.6 GB to 33 GB per server across all HP ProLiant server lines.
1
Intel® Single Device Data Correction and IBM Chipkill™