HP ProLiant DL288 HP Advanced Memory Error Detection Technology - Page 4

4

The industry has traditionally classified memory errors by the number of bits affected and the causes

of the errors

. But for systems with large memory footprints, it‟s more meaningful to classify errors as

correctable or uncorrectable. The following sections explain this distinction.

Traditional memory error classifications

Memory errors are commonly classified according to the number of bits affected in a 64-bit data

word. An error in one bit of a data word is a single-bit error. An error in more than one bit of a data

word is a multi-bit error.

Memory errors are also classified as „hard‟ or „soft‟ depending on what

caused them. DRAM defects,

bad solder joints, and data pin issues cause hard errors so that the device consistently returns

incorrect results. For example, a “stuck” memory cell returns the same bit value, even when a different

bit is written to it. In contrast, soft errors are transient and non-repeating. They can be caused by an

electrical disturbance inside the memory array or on the memory interface.

Correctable and uncorrectable errors

The outcome of a memory error depends on whether it can be corrected. Some row failures and

column failures are correctable depending on both the DIMM configuration (x4 or x8) and the error

correction capability of the system. ECC can correct single-bit errors within a single x4 or x8 DRAM

chip, but ECC can only detect a multi-bit error. Only ×4 DRAM chips allow the use of advanced error-

correction control technologies

1

in server environments.

Advanced error-correction control technologies can detect and correct multi-bit failures in a single x4

DRAM chip. Their algorithms can correct any single-bit or multi-bit errors in a 4-bit symbol, also

known as a symbol error. This allows recovery from a x4 DRAM chip failure. The algorithms can also

detect two symbol errors across two x4 DRAM chips.

Intel® Xeon®- and AMD Opteron

™

-based systems use advanced error-correction control technologies

to correct one 4-bit symbol error and detect two symbol errors (single-symbol correct, double-symbol

detect). If there is an error in more than two symbols, the technologies may not be able to detect them.

Another technology known as Double Device Data Correction (DDDC) can correct errors in two

symbols and detect errors in three symbols (double-symbol correct, triple-symbol detect). This means

that if one DRAM chip fails, but the DIMM remains in operation, DDDC will continue to work even if a

second chip has an error or fails. Intel Xeon systems support DDDC in lockstep memory mode. In

lockstep mode, two channels operate as a single channel so that each write and read operation

moves a cache line two channels wide. Both channels split the cache line to provide 2x 8-bit error

detection and 8-bit error correction within a single DRAM.

Why memory errors are increasing

Two trends increase the likelihood of memory errors in servers:

Server memory capacity is increasing.

DRAM technology is changing to meet the demand for higher DIMM storage capacity.

Server memory capacity is increasing

The growth of high-performance computing (HPC) and virtualized IT environments is driving operating

systems to address more memory. This is causing manufacturers to expand the memory capacity of

servers. In the last 5 years, the average memory capacity per server has grown by more than

500%

—

from 5.6 GB to 33 GB per server across all HP ProLiant server lines.

1

Intel® Single Device Data Correction and IBM Chipkill™

HP ProLiant DL288 HP Advanced Memory Error Detection Technology - Page 4

Why memory errors are increasing

Page 4 highlights