HP DL740 hot plug RAID memory technology for fault tolerance and scalability - Page 3

hot plug RAID memory technology for fault tolerance and scalability

Two kinds of errors can typically occur in a memory system. The first is called a hard,

error and is characterized by the fact that it is repeatable, though it may be very

inconsistent. In this situation, a piece of hardware is broken and will continue to exhibit

incorrect behavior over time. For example, a bit may be stuck so that it always returns

“0”, even when a “1” is written to it. Hard errors indicate physical problems such as

memory defects or a broken connection.

Most errors that occur in the memory subsystem are soft errors. A soft error is a randomly

occurring event that causes the data stored in a device to be changed. Because a soft

error is not caused by a problem with the circuit, once the data is corrected, the error

will not recur.

error detection

and correction

The only true protection from memory errors is to use some sort of memory detection or

correction protocol. Some protocols can only detect errors, while others can both detect

and correct memory problems, seamlessly.

parity checking

Parity checking is the most basic form of memory error detection. Although it detects

many errors, it does have some drawbacks. Parity checking can only reliably detect a

single-bit error. In addition, parity checking cannot locate and correct erroneous data.

Even if parity checking detects an error, it has no ability to correct the error, and the

server will halt operation.

error checking and

correcting

ECC memory is now standard in all ProLiant servers and significantly reduces the

probability of fatal memory failures. The ECC commonly used in industry-standard

servers is superior to parity checking because this ECC not only detects both single-bit

and multibit errors, but it will actually correct single-bit errors.

Moreover, this ECC will detect (but not correct) errors of two, three, or even four bits.

ECC protected memory systems handle these multibit errors much as parity checking

handles single-bit errors: by generating a nonmaskable interrupt (NMI) that instructs the

system to shut down to avoid data corruption.

potential for system

failures

Research has shown that the number of soft errors increases as memory capacity

increases. Some percentage of these soft errors will be multibit errors that ECC cannot

correct, so the potential for failure in ECC systems also increases as memory capacity

increases. In fact, servers with 1 GB of memory using ECC are protected against memory

failures only about as well as servers with 64 MB of memory using parity checking

(figure 1). With each new generation of servers, memory capacity increases, and so

does the potential for system failures.

3

Section	Page
abstract	2
introduction	2
memory reliability	2
error detection and correction	3
parity checking	3
error checking and correcting	3
potential for system failures	3
hot plug RAID memory	4
performance	5
basic operation	6
hot-plug capabilities	8
ease-of-use capabilities	9
conclusion	9

HP DL740 hot plug RAID memory technology for fault tolerance and scalability - Page 3

error detection and correction, parity checking, potential for system failures, error checking

Page 3 highlights